rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-20 14:40:37 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	f63c8e5a8c	Update Postgres versions to use smgrexists() instead of access() to check if Oid is used (#8597 ) ## Problem PR #7992 was merged without correspondent changes in Postgres submodules and this is why test_oid_overflow.py is failed now. ## Summary of changes Bump Postgres versions ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-05 14:24:54 +03:00
Alex Chi Z.	200fa56b04	feat(pageserver): support split delta layers (#8599 ) part of https://github.com/neondatabase/neon/issues/8002 Similar to https://github.com/neondatabase/neon/pull/8574, we add auto-split support for delta layers. Tests are reused from image layer split writers. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-05 10:30:49 +00:00
dotdister	0f3dac265b	safekeeper: remove unused partial_backup_enabled option (#8547 ) ## Problem There is an unused safekeeper option `partial_backup_enabled`. `partial_backup_enabled` was implemented in #6530, but this option was always turned into enabled in #8022. If you intended to keep this option for a specific reason, I will close this PR. ## Summary of changes I removed an unused safekeeper option `partial_backup_enabled`.	2024-08-05 09:23:59 +02:00
Alex Chi Z.	1dc496a2c9	feat(pageserver): support auto split layers based on size (#8574 ) part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes Add a `SplitImageWriter` that automatically splits image layer based on estimated target image layer size. This does not consider compression and we might need a better metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-08-05 06:55:36 +01:00
Alex Chi Z.	6814bdd30b	fix(pageserver): deadlock in gc-compaction (#8590 ) We need both compaction and gc lock for gc-compaction. The lock order should be the same everywhere, otherwise there could be a deadlock where A waits for B and B waits for A. We also had a double-lock issue. The compaction lock gets acquired in the outer `compact` function. Note that the unit tests directly call `compact_with_gc`, and therefore not triggering the issue. ## Summary of changes Ensure all places acquire compact lock and then gc lock. Remove an extra compact lock acqusition. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-03 00:52:04 +01:00
John Spray	0a667bc8ef	tests: add test_historic_storage_formats (#8423 ) ## Problem Currently, our backward compatibility tests only look one release back. That means, for example, that when we switch on image layer compression by default, we'll test reading of uncompressed layers for one release, and then stop doing it. When we make an index_part.json format change, we'll test against the old format for a week, then stop (unless we write separate unit tests for each old format). The reality in the field is that data in old formats will continue to exist for weeks/months/years. When we make major format changes, we should retain examples of the old format data, and continuously verify that the latest code can still read them. This test uses contents from a new path in the public S3 bucket, `compatibility-data-snapshots/`. It is populated by hand. The first important artifact is one from before we switch on compression, so that we will keep testing reads of uncompressed data. We will generate more artifacts ahead of other key changes, like when we update remote storage format for archival timelines. Closes: https://github.com/neondatabase/cloud/issues/15576	2024-08-02 18:28:23 +01:00
Arthur Petukhovsky	f3acfb2d80	Improve safekeepers eviction rate limiting (#8456 ) This commit tries to fix regular load spikes on staging, caused by too many eviction and partial upload operations running at the same time. Usually it was hapenning after restart, for partial backup the load was delayed. - Add a semaphore for evictions (2 permits by default) - Rename `resident_since` to `evict_not_before` and smooth out the curve by using random duration - Use random duration in partial uploads as well related to https://github.com/neondatabase/neon/issues/6338 some discussion in https://neondb.slack.com/archives/C033RQ5SPDH/p1720601531744029	2024-08-02 15:26:46 +01:00
Arpad Müller	8c828c586e	Wait for completion of the upload queue in flush_frozen_layer (#8550 ) Makes `flush_frozen_layer` add a barrier to the upload queue and makes it wait for that barrier to be reached until it lets the flushing be completed. This gives us backpressure and ensures that writes can't build up in an unbounded fashion. Fixes #7317	2024-08-02 13:07:12 +02:00
John Spray	2334fed762	storage_controller: start adding chaos hooks (#7946 ) Chaos injection bridges the gap between automated testing (where we do lots of different things with small, short-lived tenants), and staging (where we do many fewer things, but with larger, long-lived tenants). This PR adds a first type of chaos which isn't really very chaotic: it's live migration of tenants between healthy pageservers. This nevertheless provides continuous checks that things like clean, prompt shutdown of tenants works for realistically deployed pageservers with realistically large tenants.	2024-08-02 09:37:44 +01:00
John Spray	c53799044d	pageserver: refine how we delete timelines after shard split (#8436 ) ## Problem Previously, when we do a timeline deletion, shards will delete layers that belong to an ancestor. That is not a correctness issue, because when we delete a timeline, we're always deleting it from all shards, and destroying data for that timeline is clearly fine. However, there exists a race where one shard might start doing this deletion while another shard has not yet received the deletion request, and might try to access an ancestral layer. This creates ambiguity over the "all layers referenced by my index should always exist" invariant, which is important to detecting and reporting corruption. Now that we have a GC mode for clearing up ancestral layers, we can rely on that to clean up such layers, and avoid deleting them right away. This makes things easier to reason about: there are now no cases where a shard will delete a layer that belongs to a ShardIndex other than itself. ## Summary of changes - Modify behavior of RemoteTimelineClient::delete_all - Add `test_scrubber_physical_gc_timeline_deletion` to exercise this case - Tweak AWS SDK config in the scrubber to enable retries. Motivated by seeing the test for this feature encounter some transient "service error" S3 errors (which are probably nothing to do with the changes in this PR)	2024-08-02 08:00:46 +01:00
Alexander Bayandin	e7477855b7	test_runner: don't create artifacts if Allure is not enabled (#8580 ) ## Problem `allure_attach_from_dir` method might create `tar.zst` archives even if `--alluredir` is not set (i.e. Allure results collection is disabled) ## Summary of changes - Don't run `allure_attach_from_dir` if `--alluredir` is not set	2024-08-01 15:55:43 +00:00
Alex Chi Z.	f4a668a27d	fix(pageserver): skip existing layers for btm-gc-compaction (#8498 ) part of https://github.com/neondatabase/neon/issues/8002 Due to the limitation of the current layer map implementation, we cannot directly replace a layer. It's interpreted as an insert and a deletion, and there will be file exist error when renaming the newly-created layer to replace the old layer. We work around that by changing the end key of the image layer. A long-term fix would involve a refactor around the layer file naming. For delta layers, we simply skip layers with the same key range produced, though it is possible to add an extra key as an alternative solution. * The image layer range for the layers generated from gc-compaction will be Key::MIN..(Key..MAX-1), to avoid being recognized as an L0 delta layer. * Skip existing layers if it turns out that we need to generate a layer with the same persistent key in the same generation. Note that it is possible that the newly-generated layer has different content from the existing layer. For example, when the user drops a retain_lsn, the compaction could have combined or dropped some records, therefore creating a smaller layer than the existing one. We discard the "optimized" layer for now because we cannot deal with such rewrites within the same generation. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-01 15:00:06 +01:00
Alex Chi Z.	970f2923b2	storage-scrubber: log version on start (#8571 ) Helps us better identify which version of storage scrubber is running. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-01 13:52:34 +00:00
John Spray	1678dea20f	pageserver: add layer visibility calculation (#8511 ) ## Problem We recently added a "visibility" state to layers, but nothing initializes it. Part of: - #8398 ## Summary of changes - Add a dependency on `range-set-blaze`, which is used as a fast incrementally updated alternative to KeySpace. We could also use this to replace the internals of KeySpaceRandomAccum if we wanted to. Writing a type that does this kind of "BtreeMap & merge overlapping entries" thing isn't super complicated, but no reason to write this ourselves when there's a third party impl available. - Add a function to layermap to calculate visibilities for each layer - Add a function to Timeline to call into layermap and then apply these visibilities to the Layer objects. - Invoke the calculation during startup, after image layer creations, and when removing branches. Branch removal and image layer creation are the two ways that a layer can go from Visible to Covered. - Add unit test & benchmark for the visibility calculation - Expose `pageserver_visible_physical_size` metric, which should always be <= `pageserver_remote_physical_size`. - This metric will feed into the /v1/utilization endpoint later: the visible size indicates how much space we would like to use on this pageserver for this tenant. - When `pageserver_visible_physical_size` is greater than `pageserver_resident_physical_size`, this is a sign that the tenant has long-idle branches, which result in layers that are visible in principle, but not used in practice. This does not keep visibility hints up to date in all cases: particularly, when creating a child timeline, any previously covered layers will not get marked Visible until they are accessed. Updates after image layer creation could be implemented as more of a special case, but this would require more new code: the existing depth calculation code doesn't maintain+yield the list of deltas that would be covered by an image layer. ## Performance This operation is done rarely (at startup and at timeline deletion), so needs to be efficient but not ultra-fast. There is a new `visibility` bench that measures runtime for a synthetic 100k layers case (`sequential`) and a real layer map (`real_map`) with ~26k layers. The benchmark shows runtimes of single digit milliseconds (on a ryzen 7950). This confirms that the runtime shouldn't be a problem at startup (as we already incur S3-level latencies there), but that it's slow enough that we definitely shouldn't call it more often than necessary, and it may be worthwhile to optimize further later (things like: when removing a branch, only bother scanning layers below the branchpoint) ``` visibility/sequential time: [4.5087 ms 4.5894 ms 4.6775 ms] change: [+2.0826% +3.9097% +5.8995%] (p = 0.00 < 0.05) Performance has regressed. Found 24 outliers among 100 measurements (24.00%) 2 (2.00%) high mild 22 (22.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map time: [7.0796 ms 7.0832 ms 7.0871 ms] change: [+0.3900% +0.4505% +0.5164%] (p = 0.00 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map_many_branches time: [4.5285 ms 4.5355 ms 4.5434 ms] change: [-1.0012% -0.8004% -0.5969%] (p = 0.00 < 0.05) Change within noise threshold. ```	2024-08-01 09:25:35 +00:00
Arpad Müller	163f2eaf79	Reduce linux-raw-sys duplication (#8577 ) Before, we had four versions of linux-raw-sys in our dependency graph: ``` linux-raw-sys@0.1.4 linux-raw-sys@0.3.8 linux-raw-sys@0.4.13 linux-raw-sys@0.6.4 ``` now it's only two: ``` linux-raw-sys@0.4.13 linux-raw-sys@0.6.4 ``` The changes in this PR are minimal. In order to get to its state one only has to update procfs in Cargo.toml to 0.16 and do `cargo update -p tempfile -p is-terminal -p prometheus`.	2024-08-01 08:22:21 +00:00
Christian Schwarz	980d506bda	pageserver: shutdown all walredo managers 8s into shutdown (#8572 ) # Motivation The working theory for hung systemd during PS deploy (https://github.com/neondatabase/cloud/issues/11387) is that leftover walredo processes trigger a race condition. In https://github.com/neondatabase/neon/pull/8150 I arranged that a clean Tenant shutdown does actually kill its walredo processes. But many prod machines don't manage to shut down all their tenants until the 10s systemd timeout hits and, presumably, triggers the race condition in systemd / the Linux kernel that causes the frozen systemd # Solution This PR bolts on a rather ugly mechanism to shut down tenant managers out of order 8s after we've received the SIGTERM from systemd. # Changes - add a global registry of `Weak<WalRedoManager>` - add a special thread spawned during `shutdown_pageserver` that sleeps for 8s, then shuts down all redo managers in the registry and prevents new redo managers from being created - propagate the new failure mode of tenant spawning throughout the code base - make sure shut down tenant manager results in PageReconstructError::Cancelled so that if Timeline::get calls come in after the shutdown, they do the right thing	2024-08-01 07:57:09 +02:00
Alex Chi Z.	d6c79b77df	test(pageserver): add test_gc_feedback_with_snapshots (#8474 ) should be working after https://github.com/neondatabase/neon/pull/8328 gets merged. Part of https://github.com/neondatabase/neon/issues/8002 adds a new perf benchmark case that ensures garbages can be collected with branches --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-31 17:55:19 -04:00
Alexander Bayandin	3350daeb9a	CI(create-test-report): fix missing benchmark results in Allure report (#8540 ) ## Problem In https://github.com/neondatabase/neon/pull/8241 I've accidentally removed `create-test-report` dependency on `benchmarks` job ## Summary of changes - Run `create-test-report` after `benchmarks` job	2024-07-31 19:47:59 +01:00
Arpad Müller	939d50a41c	storage_scrubber: migrate FindGarbage to remote_storage (#8548 ) Uses the newly added APIs from #8541 named `stream_tenants_generic` and `stream_objects_with_retries` and extends them with `list_objects_with_retries_generic` and `stream_tenant_timelines_generic` to migrate the `find-garbage` command of the scrubber to `GenericRemoteStorage`. Part of https://github.com/neondatabase/neon/issues/7547	2024-07-31 18:24:42 +00:00
John Spray	2f9ada13c4	controller: simplify reconciler generation increment logic (#8560 ) ## Problem This code was confusing, untested and covered: - an impossible case, where intent state is AttacheStale (we never do this) - a rare edge case (going from AttachedMulti to Attached), which we were not testing, and in any case the pageserver internally does the same Tenant reset in this transition as it would do if we incremented generation. Closes: https://github.com/neondatabase/neon/issues/8367 ## Summary of changes - Simplify the logic to only skip incrementing the generation if the location already has the expected generation and the exact same mode.	2024-07-31 18:37:47 +01:00
Cihan Demirci	ff51b565d3	cicd: change Azure storage details [2/2] (#8562 ) Change Azure storage configuration to point to updated variables/secrets. Also update subscription id variable.	2024-07-31 17:42:10 +01:00
Tristan Partin	5e0409de95	Fix negative replication delay metric In some cases, we can get a negative metric for replication_delay_bytes. My best guess from all the research I've done is that we evaluate pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by the time everything is said and done, the replay LSN has advanced past the receive LSN. In this case, our lag can effectively be modeled as 0 due to the speed of the WAL reception and replay.	2024-07-31 10:16:58 -05:00
Christian Schwarz	4e3b70e308	refactor(page_service): Timeline gate guard holding + cancellation + shutdown (#8339 ) Since the introduction of sharding, the protocol handling loop in `handle_pagerequests` cannot know anymore which concrete `Tenant`/`Timeline` object any of the incoming `PagestreamFeMessage` resolves to. In fact, one message might resolve to one `Tenant`/`Timeline` while the next one may resolve to another one. To avoid going to tenant manager, we added the `shard_timelines` which acted as an ever-growing cache that held timeline gate guards open for the lifetime of the connection. The consequence of holding the gate guards open was that we had to be sensitive to every cached `Timeline::cancel` on each interaction with the network connection, so that Timeline shutdown would not have to wait for network connection interaction. We can do better than that, meaning more efficiency & better abstraction. I proposed a sketch for it in * https://github.com/neondatabase/neon/pull/8286 and this PR implements an evolution of that sketch. The main idea is is that `mod page_service` shall be solely concerned with the following: 1. receiving requests by speaking the protocol / pagestream subprotocol 2. dispatching the request to a corresponding method on the correct shard/`Timeline` object 3. sending response by speaking the protocol / pagestream subprotocol. The cancellation sensitivity responsibilities are clear cut: * while in `page_service` code, sensitivity to page_service cancellation is sufficient * while in `Timeline` code, sensitivity to `Timeline::cancel` is sufficient To enforce these responsibilities, we introduce the notion of a `timeline::handle::Handle` to a `Timeline` object that is checked out from a `timeline::handle::Cache` for each request. The `Handle` derefs to `Timeline` and is supposed to be used for a single async method invocation on `Timeline`. See the lengthy doc comment in `mod handle` for details of the design.	2024-07-31 17:05:45 +02:00
Alex Chi Z.	61a65f61f3	feat(pageserver): support btm-gc-compaction for child branches (#8519 ) part of https://github.com/neondatabase/neon/issues/8002 For child branches, we will pull the image of the modified keys from the parant into the child branch, which creates a full history for generating key retention. If there are not enough delta keys, the image won't be wrote eventually, and we will only keep the deltas inside the child branch. We could avoid the wasteful work to pull the image from the parent if we can know the number of deltas in advance, in the future (currently we always pull image for all modified keys in the child branch) --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-31 15:48:48 +01:00
Alexander Bayandin	d21246c8bd	CI(regress-tests): run less regression tests (#8561 ) ## Problem We run regression tests on `release` & `debug` builds for each of the three supported Postgres versions (6 in total). With upcoming ARM support and Postgres 17, the number of jobs will jump to 16, which is a lot. See the internal discussion here: https://neondb.slack.com/archives/C033A2WE6BZ/p1722365908404329 ## Summary of changes - Run `regress-tests` job in debug builds only with the latest Postgres version - Do not do `debug` builds on release branches	2024-07-31 15:10:27 +01:00
Christian Schwarz	4825b0fec3	compaction_level0_phase1: bypass PS PageCache for data blocks (#8543 ) part of https://github.com/neondatabase/neon/issues/8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](`e78341e1c2/pageserver/src/tenant/block_io.rs (L229-L236)`). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * https://github.com/neondatabase/neon/issues/8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * https://github.com/neondatabase/aws/pull/1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.	2024-07-31 14:17:59 +02:00
Cihan Demirci	a4df3c8488	cicd: change Azure storage details [1/2] (#8553 ) Change Azure storage configuration to point to new variables/secrets. They have the `_NEW` suffix in order not to disrupt any tests while we complete the switch.	2024-07-30 19:34:15 +00:00
Christian Schwarz	d95b46f3f3	cleanup(compact_level0_phase1): some commentary and wrapping into block expressions (#8544 ) Byproduct of scouting done for https://github.com/neondatabase/neon/issues/8184 refs https://github.com/neondatabase/neon/issues/8184	2024-07-30 18:13:18 +02:00
Yuchen Liang	85bef9f05d	feat(scrubber): post `scan_metadata` results to storage controller (#8502 ) Part of #8128, followup to #8480. closes #8421. Enable scrubber to optionally post metadata scan health results to storage controller. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-30 16:07:34 +01:00
Yuchen Liang	e374d6778e	feat(storcon): store scrubber metadata scan result (#8480 ) Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-30 14:32:00 +01:00
Anton Chaporgin	9ceaf9a986	[neon/acr] impr: push to ACR while building images (#8545 ) This tests the ability to push into ACR using OIDC. Proved it worked by running slightly modified YAML. In `promote-images` we push the following images `neon compute-tools {vm-,}compute-node-{v14,v15,v16}` into `neoneastus2`. https://github.com/neondatabase/cloud/issues/14640	2024-07-30 14:15:53 +01:00
Alexander Bayandin	f72fe68626	CI(benchmarking): make neonvm default provisioner (#8538 ) ## Problem We don't allow regular end-users to use `k8s-pod` provisioner, but we still use it in nightly benchmarks ## Summary of changes - Remove `provisioner` input from `neon-create-project` action, use `k8s-neonvm` as a default provioner - Change `neon-` platform prefix to `neonvm-` - Remove `neon-captest-freetier` and `neon-captest-new` as we already have their `neonvm` counterparts	2024-07-30 13:38:23 +01:00
Arpad Müller	9fabdda2dc	scrubber: add remote_storage based listing APIs and use them in find-large-objects (#8541 ) Add two new functions `stream_objects_with_retries` and `stream_tenants_generic` and use them in the `find-large-objects` subcommand, migrating it to `remote_storage`. Also adds the `size` field to the `ListingObject` struct. Part of #7547	2024-07-30 09:00:37 +00:00
Arpad Müller	1c7b06c988	Add metrics for input data considered and taken for compression (#8522 ) If compression is enabled, we currently try compressing each image larger than a specific size and if the compressed version is smaller, we write that one, otherwise we use the uncompressed image. However, this might sometimes be a wasteful process, if there is a substantial amount of images that don't compress well. The compression metrics added in #8420 `pageserver_compression_image_in_bytes_total` and `pageserver_compression_image_out_bytes_total` are well designed for answering the question how space efficient the total compression process is end-to-end, which helps one to decide whether to enable it or not. To answer the question of how much waste there is in terms of trial compression, so CPU time, we add two metrics: * one about the images that have been trial-compressed (considered), and * one about the images where the compressed image has actually been written (chosen). There is different ways of weighting them, like for example one could look at the count, or the compressed data. But the main contributor to compression CPU usage is amount of data processed, so we weight the images by their uncompressed size. In other words, the two metrics are: * `pageserver_compression_image_in_bytes_considered` * `pageserver_compression_image_in_bytes_chosen` Part of #5431	2024-07-30 09:59:15 +02:00
John Spray	52b02d95c8	scrubber: enable cleaning up garbage tenants from known deletion bugs, add object age safety check (#8461 ) ## Problem Old storage buckets can contain a lot of tenants that aren't known to the control plane at all, because they belonged to test jobs that get their control plane state cleaned up shortly after running. In general, it's somewhat unsafe to purge these, as it's hard to distinguish "control plane doesn't know about this, so it's garbage" from "control plane said it didn't know about this, which is a bug in the scrubber, control plane, or API URL configured". However, the most common case is that we see only a small husk of a tenant in S3 from a specific old behavior of the software, for example: - We had a bug where heatmaps weren't deleted on tenant delete - When WAL DR was first deployed, we didn't delete initdb.tar.zst on tenant deletion ## Summary of changes - Add a KnownBug variant for the garbage reason - Include such cases in the "safe" deletion mode (`--mode=deleted`) - Add code that inspects tenants missing in control plane to identify cases of known bugs (this is kind of slow, but should go away once we've cleaned all these up) - Add an additional `-min-age` safety check similar to physical GC, where even if everything indicates objects aren't needed, we won't delete something that has been modified too recently. --------- Co-authored-by: Yuchen Liang <70461588+yliang412@users.noreply.github.com> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-07-29 17:50:44 +01:00
Christian Schwarz	4be58522fb	l0_flush: use mode=direct by default => coverage in automated tests (#8534 ) Testing in staging and pre-prod has been [going well](https://github.com/neondatabase/neon/issues/7418#issuecomment-2255474917). This PR enables mode=direct by default, thereby providing additional coverage in the automated tests: - Rust tests - Integration tests - Nightly pagebench (likely irrelevant because it's read-only) Production deployments continue to use `mode=page-cache` for the time being: https://github.com/neondatabase/aws/pull/1655 refs https://github.com/neondatabase/neon/issues/7418	2024-07-29 16:49:22 +02:00
Christian Schwarz	d09dad0ea2	pageserver: fail if `id` is present in pageserver.toml (#8489 ) Overall plan: https://www.notion.so/neondatabase/Rollout-Plan-simplified-pageserver-initialization-f935ae02b225444e8a41130b7d34e4ea?pvs=4 --- `identity.toml` is the authoritative place for `id` as of https://github.com/neondatabase/neon/pull/7766 refs https://github.com/neondatabase/neon/issues/7736	2024-07-29 15:16:32 +01:00
John Spray	5775662276	pageserver: fix return code from secondary_download_handler (#8508 ) ## Problem The secondary download HTTP API is meant to return 200 if the download is complete, and 202 if it is still in progress. In #8198 the download implementation was changed to drop out with success early if it over-runs a time budget, which resulted in 200 responses for incomplete downloads. This breaks storcon_cli's "tenant-warmup" command, which uses the OK status to indicate download complete. ## Summary of changes - Only return 200 if we get an Ok() _and_ the progress stats indicate the download is complete.	2024-07-29 15:05:30 +01:00
Joonas Koivunen	bdfc9ca7e9	test: deflake test_duplicate_creation (#8536 ) By including comparison of `remote_consistent_lsn_visible` we risk flakyness coming from outside of timeline creation. Mask out the `remote_consistent_lsn_visible` for the comparison. Evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8489/10142336315/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/89ff0270bf58577a	2024-07-29 13:41:06 +01:00
a-masterov	1d8cf5b3a9	Add a test for clickhouse as a logical replication consumer (#8408 ) ## Problem We need to test logical replication with 3rd-party tools regularly. ## Summary of changes Added a test using ClickHouse as a client Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-07-29 14:35:12 +02:00
Arpad Müller	859f019185	Adopt list_streaming in tenant deletion (#8504 ) Uses the Stream based `list_streaming` function added by #8457 in tenant deletion, as suggested in https://github.com/neondatabase/neon/pull/7932#issuecomment-2150480180 . We don't have to worry about retries, as the function is wrapped inside an outer retry block. If there is a retryable error either during the listing or during deletion, we just do a fresh start. Also adds `+ Send` bounds as they are required by the `delete_tenant_remote` function.	2024-07-29 12:05:18 +02:00
Alexander Bayandin	da6bdff893	test_runner: fix user_property usage in benchmarks (#8531 ) ## Problem After https://github.com/neondatabase/neon/pull/7990 `regress_test` job started to fail with an error: ``` ... File "/__w/neon/neon/test_runner/fixtures/benchmark_fixture.py", line 485, in pytest_terminal_summary terminalreporter.write(f"{test_report.head_line}.{recorded_property['name']}: ") TypeError: 'bool' object is not subscriptable ``` https://github.com/neondatabase/neon/actions/runs/10125750938/job/28002582582 It happens because the current implementation doesn't expect pytest's `user_properties` can be used for anything else but benchmarks (and https://github.com/neondatabase/neon/pull/7990 started to use it for tracking `preserve_database_files` parameter) ## Summary of changes - Make NeonBenchmarker use only records with`neon_benchmarker_` prefix	2024-07-29 11:00:33 +01:00
Christian Schwarz	2416da337e	safekeeper: include application name in wal service root span (#8525 ) For IDENTIFY_SYSTEM in particular, application name gives away whether the client is * walproposer => Some(wal_proposer_recovery) * safekeeper => Some(safekeeper) * pageserver => Some(pageserver) Context: https://neondb.slack.com/archives/C06SJG60FRB/p1721987794673429?thread_ts=1721981056.451599&cid=C06SJG60FRB	2024-07-28 20:36:59 +01:00
Alexander Bayandin	6cad0455b0	CI(test_runner): Upload all test artifacts if preserve_database_files is enabled (#7990 ) ## Problem There's a `NeonEnvBuilder#preserve_database_files` parameter that allows you to keep database files for debugging purposes (by default, files get cleaned up), but there's no way to get these files from a CI run. This PR adds handling of `NeonEnvBuilder#preserve_database_files` and adds the compressed test output directory to Allure reports (for tests with this parameter enabled). Ref https://github.com/neondatabase/neon/issues/6967 ## Summary of changes - Compress and add the whole test output directory to Allure reports - Currently works only with `neon_env_builder` fixture - Remove `preserve_database_files = True` from sharding tests as unneeded --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-27 20:01:10 +01:00
Arpad Müller	b5e95f68b5	Persist archival information (#8479 ) Persists whether a timeline is archived or not in `index_part.json`. We only return success if the upload has actually worked successfully. Also introduces a new `index_part.json` version number. Fixes #8459 Part of #8088	2024-07-27 02:32:05 +00:00
Alex Chi Z.	dd40b19db4	fix(pageserver): give L0 compaction priorities over image layer creation (#8443 ) close https://github.com/neondatabase/neon/issues/8435 ## Summary of changes If L0 compaction did not include all L0 layers, skip image generation. There are multiple possible solutions to the original issue, i.e., an alternative is to wrap the partial L0 compaction in a loop until it compacts all L0 layers. However, considering that we should weight all tenants equally, the current solution can ensure everyone gets a chance to run compaction, and those who write too much won't get a chance to create image layers. This creates a natural backpressure feedback that they get a slower read due to no image layers are created, slowing down their writes, and eventually compaction could keep up with their writes + generate image layers. Consider deployment, we should add an alert on "skipping image layer generation", so that we won't run into the case that image layers are not generated => incidents again. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-26 18:09:55 +00:00
Christian Schwarz	68241f5a3e	raise wait_lsn timeout from 60s to 300s (#8529 ) Problem ------- wait_lsn timeouts result in a user-facing errors like ``` $ /tmp/neon/pg_install/v16/bin/pgbench -s3424 -i -I dtGvp user=neondb_owner dbname=neondb host=ep-tiny-wave-w23owa37.eastus2.azure.neon.build sslmode=require options='-cstatement_timeout=0 ' dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data (server-side)... vacuuming... pgbench: error: query failed: ERROR: [NEON_SMGR] [shard 0] could not read block 214338 in rel 1663/16389/16839.0 from page server at lsn C/E1C12828 DETAIL: page server returned error: LSN timeout: Timed out while waiting for WAL record at LSN C/E1418528 to arrive, last_record_lsn 6/999D9CA8 disk consistent LSN=6/999D9CA8, WalReceiver status: (update 2024-07-25 08:30:07): connecting to node 25, safekeeper candidates (id\|update_time\|commit_lsn): [(21\|08:30:16\|C/E1C129E0), (23\|08:30:16\|C/E1C129E0), (25\|08:30:17\|C/E1C129E0)] CONTEXT: while scanning block 214338 of relation "public.pgbench_accounts" pgbench: detail: Query was: vacuum analyze pgbench_accounts ``` Solution -------- Its better to be slow than to fail the queries. If the app has a deadline, it can use `statement_timeout`. In the long term, we want to eliminate wait_lsn timeout. In the short term (this PR), we bump the wait_lsn timeout to a larger value to reduce the frequency at which these wait_lsn timeouts occur. We will observe SLOs and specifically `pageserver_wait_lsn_seconds_bucket` before we eliminate the timeout completely.	2024-07-26 16:44:57 +01:00
Christian Schwarz	8154e88732	refactor(layer load API): all errors are permanent (#8527 ) I am not aware of a case of "transient" VirtualFile errors as mentioned in https://github.com/neondatabase/neon/pull/5880 Private DM with Joonas discussing this: https://neondb.slack.com/archives/D049K7HJ9JM/p1721836424615799	2024-07-26 15:48:44 +01:00
Em Sharnoff	240ba7e10c	Fix sql-exporter-autoscaling for pg < 16 (#8523 ) The lfc_approximate_working_set_size_windows query was failing on pg14 and pg15 with pq: subquery in FROM must have an alias Because aliases in that position became optional only in pg16. Some context here: https://neondb.slack.com/archives/C04DGM6SMTM/p1721970322601679?thread_ts=1721921122.528849	2024-07-26 15:08:13 +01:00
Vlad Lazar	7a796a9963	storcon: introduce step down primitive (#8512 ) ## Problem We are missing the step-down primitive required to implement rolling restarts of the storage controller. ## Summary of changes Add `/control/v1/step_down` endpoint which puts the storage controller into a state where it rejects all API requests apart from `/control/v1/step_down`, `/status` and `/metrics`. When receiving the request, storage controller cancels all pending reconciles and waits for them to exit gracefully. The response contains a snapshot of the in-memory observed state. Related: * https://github.com/neondatabase/cloud/issues/14701 * https://github.com/neondatabase/neon/issues/7797 * https://github.com/neondatabase/neon/pull/8310	2024-07-26 14:54:09 +01:00

1 2 3 4 5 ...

5778 Commits