rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Alex Chi Z	933bb88694	Revert "refactor(page_service): Timeline gate guard holding + cancellation + shutdown (#8339 )" This reverts commit `4e3b70e308`.	2024-08-07 19:57:33 +08:00
Arpad Müller	00c981576a	Lower level for timeline cancellations during gc (#8626 ) Timeline cancellation running in parallel with gc yields error log lines like: ``` Gc failed 1 times, retrying in 2s: TimelineCancelled ``` They are completely harmless though and normal to occur. Therefore, only print those messages at an info level. Still print them at all so that we know what is going on if we focus on a single timeline.	2024-08-07 09:29:52 +02:00
Arpad Müller	c3f2240fbd	storage broker: only print one line for version and build tag in init (#8624 ) This makes it more consistent with pageserver and safekeeper. Also, it is easier to collect the two values into one data point.	2024-08-07 09:14:26 +02:00
Yuchen Liang	ed5724d79d	scrubber: clean up `scan_metadata` before prod (#8565 ) Part of #8128. ## Problem Currently, scrubber `scan_metadata` command will return with an error code if the metadata on remote storage is corrupted with fatal errors. To safely deploy this command in a cronjob, we want to differentiate between failures while running scrubber command and the erroneous metadata. At the same time, we also want our regression tests to catch corrupted metadata using the scrubber command. ## Summary of changes - Return with error code only when the scrubber command fails - Uses explicit checks on errors and warnings to determine metadata health in regression tests. Resolve conflict with `tenant-snapshot` command (after shard split): [`test_scrubber_tenant_snapshot`](https://github.com/neondatabase/neon/blob/yuchen/scrubber-scan-cleanup-before-prod/test_runner/regress/test_storage_scrubber.py#L23) failed before applying `422a8443dd` - When taking a snapshot, the old `index_part.json` in the unsharded tenant directory is not kept. - The current `list_timeline_blobs` implementation consider no `index_part.json` as a parse error. - During the scan, we are only analyzing shards with highest shard count, so we will not get a parse error. but we do need to add the layers to tenant object listing, otherwise we will get index is referencing a layer that is not in remote storage error. - Action: Add s3_layers from `list_timeline_blobs` regardless of parsing error Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-06 18:55:42 +01:00
John Spray	ca5390a89d	pageserver: add `bench_ingest` (#7409 ) ## Problem We lack a rust bench for the inmemory layer and delta layer write paths: it is useful to benchmark these components independent of postgres & WAL decoding. Related: https://github.com/neondatabase/neon/issues/8452 ## Summary of changes - Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can be cleanly tested + benched without a Tenant/Timeline test harness. It only needed the Timeline for building `Layer`, so this can be done in a separate step. - Add `bench_ingest`, which exercises a variety of workload "shapes" (big values, small values, sequential keys, random keys) - Include a small uncontroversial optimization: in `freeze`, only exhaustively walk values to assert ordering relative to end_lsn in debug mode. These benches are limited by drive performance on a lot of machines, but still useful as a local tool for iterating on CPU/memory improvements around this code path. Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd): ``` ingest-small-values/ingest 128MB/100b seq time: [1.1160 s 1.1230 s 1.1289 s] thrpt: [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s. ingest-small-values/ingest 128MB/100b rand time: [1.9001 s 1.9056 s 1.9110 s] thrpt: [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s] Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s. ingest-small-values/ingest 128MB/100b rand-1024keys time: [1.0715 s 1.0828 s 1.0937 s] thrpt: [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s] ingest-small-values/ingest 128MB/100b seq, no delta time: [425.49 ms 429.07 ms 432.04 ms] thrpt: [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild ingest-big-values/ingest 128MB/8k seq time: [373.03 ms 375.84 ms 379.17 ms] thrpt: [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild ingest-big-values/ingest 128MB/8k seq, no delta time: [81.534 ms 82.811 ms 83.364 ms] thrpt: [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s] Found 1 outliers among 10 measurements (10.00%) ```	2024-08-06 16:39:40 +00:00
John Spray	3727c6fbbe	pageserver: use layer visibility when composing heatmap (#8616 ) ## Problem Sometimes, a layer is Covered by hasn't yet been evicted from local disk (e.g. shortly after image layer generation). It is not good use of resources to download these to a secondary location, as there's a good chance they will never be read. This follows the previous change that added layer visibility: - #8511 Part of epic: - https://github.com/neondatabase/neon/issues/8398 ## Summary of changes - When generating heatmaps, only include Visible layers - Update test_secondary_downloads to filter to visible layers when listing layers from an attached location	2024-08-06 17:15:40 +01:00
John Spray	42229aacf6	pageserver: fixes for layer visibility metric (#8603 ) ## Problem In staging, we could see that occasionally tenants were wrapping their pageserver_visible_physical_size metric past zero to 2^64. This is harmless right now, but will matter more later when we start using visible size in things like the /utilization endpoint. ## Summary of changes - Add debug asserts that detect this case. `test_gc_of_remote_layers` works as a reproducer for this issue once the asserts are added. - Tighten up the interface around access_stats so that only Layer can mutate it. - In Layer, wrap calls to `record_access` in code that will update the visible size statistic if the access implicitly marks the layer visible (this was what caused the bug) - In LayerManager::rewrite_layers, use the proper set_visibility layer function instead of directly using access_stats (this is an additional path where metrics could go bad.) - Removed unused instances of LayerAccessStats in DeltaLayer and ImageLayer which I noticed while reviewing the code paths that call record_access.	2024-08-06 14:47:01 +01:00
John Spray	b7beaa0fd7	tests: improve stability of `test_storage_controller_many_tenants` (#8607 ) ## Problem The controller scale test does random migrations. These mutate secondary locations, and therefore can cause secondary optimizations to happen in the background, violating the test's expectation that consistency_check will work as there are no reconciliations running. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10247161379/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/6316beacd3fb3060/ ## Summary of changes - Only migrate to existing secondary locations, not randomly picked nodes, so that we can do a fast reconcile_until_idle (otherwise reconcile_until_idle is takes a long time to create new secondary locations). - Do a reconcile_until_idle before consistency_check.	2024-08-06 12:58:33 +01:00
a-masterov	16c91ff5d3	enable rum test (#8380 ) ## Problem We need to test the rum extension automatically as a path of the GitHub workflow ## Summary of changes rum test is enabled	2024-08-06 13:56:42 +02:00
a-masterov	078f941dc8	Add a test using Debezium as a client for the logical replication (#8568 ) ## Problem We need to test the logical replication with some external consumers. ## Summary of changes A test of the logical replication with Debezium as a consumer was added. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-08-06 13:08:55 +02:00
Arseny Sher	68bcbf8227	Add package-mode=false to poetry. We don't use it for packaging, and 'poetry install' will soon error otherwise. Also remove name and version fields as these are not required for non-packaging mode.	2024-08-06 13:53:23 +03:00
Arpad Müller	a31c95cb40	storage_scrubber: migrate scan_safekeeper_metadata to remote_storage (#8595 ) Migrates the safekeeper-specific parts of `ScanMetadata` to GenericRemoteStorage, making it Azure-ready. Part of https://github.com/neondatabase/neon/issues/7547	2024-08-06 10:51:39 +00:00
Joonas Koivunen	dc7eb5ae5a	chore: bump index part version (#8611 ) #8600 missed the hunk changing index_part.json informative version. Include it in this PR, in addition add more non-warning index_part.json versions to scrubber.	2024-08-06 11:45:41 +01:00
Vlad Lazar	44fedfd6c3	pageserver: remove legacy read path (#8601 ) ## Problem We have been maintaining two read paths (legacy and vectored) for a while now. The legacy read-path was only used for cross validation in some tests. ## Summary of changes * Tweak all tests that were using the legacy read path to use the vectored read path instead * Remove the read path dispatching based on the pageserver configs * Remove the legacy read path code We will be able to remove the single blob io code in `pageserver/src/tenant/blob_io.rs` when https://github.com/neondatabase/neon/issues/7386 is complete. Closes https://github.com/neondatabase/neon/issues/8005	2024-08-06 10:14:01 +01:00
Joonas Koivunen	138f008bab	feat: persistent gc blocking (#8600 ) Currently, we do not have facilities to persistently block GC on a tenant for whatever reason. We could do a tenant configuration update, but that is risky for generation numbers and would also be transient. Introduce a `gc_block` facility in the tenant, which manages per timeline blocking reasons. Additionally, add HTTP endpoints for enabling/disabling manual gc blocking for a specific timeline. For debugging, individual tenant status now includes a similar string representation logged when GC is skipped. Cc: #6994	2024-08-06 10:09:56 +01:00
Joonas Koivunen	6a6f30e378	fix: make Timeline::set_disk_consistent_lsn use fetch_max (#8311 ) now it is safe to use from multiple callers, as we have two callers.	2024-08-06 08:52:01 +01:00
Alex Chi Z.	8f3bc5ae35	feat(pageserver): support dry-run for gc-compaction, add statistics (#8557 ) Add dry-run mode that does not produce any image layer + delta layer. I will use this code to do some experiments and see how much space we can reclaim for tenants on staging. Part of https://github.com/neondatabase/neon/issues/8002 * Add dry-run mode that runs the full compaction process without updating the layer map. (We never call finish on the writers and the files will be removed before exiting the function). * Add compaction statistics and print them at the end of compaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-06 02:07:48 +00:00
Alexander Bayandin	e6e578821b	CI(benchmarking): set pub/sub projects for LR tests (#8483 ) ## Problem > Currently, long-running LR tests recreate endpoints every night. We'd like to have along-running buildup of history to exercise the pageserver in this case (instead of "unit-testing" the same behavior everynight). Closes #8317 ## Summary of changes - Update Postgres version for replication tests - Set `BENCHMARK_PROJECT_ID_PUB`/`BENCHMARK_PROJECT_ID_SUB` env vars to projects that were created for this purpose --------- Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>	2024-08-05 22:06:47 +00:00
Joonas Koivunen	c32807ac19	fix: allow awaiting logical size for root timelines (#8604 ) Currently if `GET /v1/tenant/x/timeline/y?force-await-initial-logical-size=true` is requested for a root timeline created within the current pageserver session, the request handler panics hitting the debug assertion. These timelines will always have an accurate (at initdb import) calculated logical size. Fix is to never attempt prioritizing timeline size calculation if we already have an exact value. Split off from #8528.	2024-08-05 21:21:33 +01:00
Alexander Bayandin	50daff9655	CI(trigger-e2e-tests): fix deadlock with Build and Test workflow (#8606 ) ## Problem In some cases, a deadlock between `build-and-test` and `trigger-e2e-tests` workflows can happen: ``` Build and Test Canceling since a deadlock for concurrency group 'Build and Test-8600/merge-anysha' was detected between 'top level workflow' and 'trigger-e2e-tests' ``` I don't understand the reason completely, probably `${{ github.workflow }}` got evaluated to the same value and somehow caused the issue. We don't need to limit concurrency for `trigger-e2e-tests` workflow. See https://neondb.slack.com/archives/C059ZC138NR/p1722869486708179?thread_ts=1722869027.960029&cid=C059ZC138NR	2024-08-05 19:47:59 +01:00
Alexander Bayandin	bd845c7587	CI(trigger-e2e-tests): wait for promote-images job from the last commit (#8592 ) ## Problem We don't trigger e2e tests for draft PRs, but we do trigger them once a PR is in the "Ready for review" state. Sometimes, a PR can be marked as "Ready for review" before we finish image building. In such cases, triggering e2e tests fails. ## Summary of changes - Make `trigger-e2e-tests` job poll status of `promote-images` job from the build-and-test workflow for the last commit. And trigger only if the status is `success` - Remove explicit image checking from the workflow - Add `concurrency` for `triggere-e2e-tests` workflow to make it possible to cancel jobs in progress (if PR moves from "Draft" to "Ready for review" several times in a row)	2024-08-05 12:25:23 +01:00
Konstantin Knizhnik	f63c8e5a8c	Update Postgres versions to use smgrexists() instead of access() to check if Oid is used (#8597 ) ## Problem PR #7992 was merged without correspondent changes in Postgres submodules and this is why test_oid_overflow.py is failed now. ## Summary of changes Bump Postgres versions ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-05 14:24:54 +03:00
Alex Chi Z.	200fa56b04	feat(pageserver): support split delta layers (#8599 ) part of https://github.com/neondatabase/neon/issues/8002 Similar to https://github.com/neondatabase/neon/pull/8574, we add auto-split support for delta layers. Tests are reused from image layer split writers. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-05 10:30:49 +00:00
dotdister	0f3dac265b	safekeeper: remove unused partial_backup_enabled option (#8547 ) ## Problem There is an unused safekeeper option `partial_backup_enabled`. `partial_backup_enabled` was implemented in #6530, but this option was always turned into enabled in #8022. If you intended to keep this option for a specific reason, I will close this PR. ## Summary of changes I removed an unused safekeeper option `partial_backup_enabled`.	2024-08-05 09:23:59 +02:00
Alex Chi Z.	1dc496a2c9	feat(pageserver): support auto split layers based on size (#8574 ) part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes Add a `SplitImageWriter` that automatically splits image layer based on estimated target image layer size. This does not consider compression and we might need a better metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-08-05 06:55:36 +01:00
Alex Chi Z.	6814bdd30b	fix(pageserver): deadlock in gc-compaction (#8590 ) We need both compaction and gc lock for gc-compaction. The lock order should be the same everywhere, otherwise there could be a deadlock where A waits for B and B waits for A. We also had a double-lock issue. The compaction lock gets acquired in the outer `compact` function. Note that the unit tests directly call `compact_with_gc`, and therefore not triggering the issue. ## Summary of changes Ensure all places acquire compact lock and then gc lock. Remove an extra compact lock acqusition. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-03 00:52:04 +01:00
John Spray	0a667bc8ef	tests: add test_historic_storage_formats (#8423 ) ## Problem Currently, our backward compatibility tests only look one release back. That means, for example, that when we switch on image layer compression by default, we'll test reading of uncompressed layers for one release, and then stop doing it. When we make an index_part.json format change, we'll test against the old format for a week, then stop (unless we write separate unit tests for each old format). The reality in the field is that data in old formats will continue to exist for weeks/months/years. When we make major format changes, we should retain examples of the old format data, and continuously verify that the latest code can still read them. This test uses contents from a new path in the public S3 bucket, `compatibility-data-snapshots/`. It is populated by hand. The first important artifact is one from before we switch on compression, so that we will keep testing reads of uncompressed data. We will generate more artifacts ahead of other key changes, like when we update remote storage format for archival timelines. Closes: https://github.com/neondatabase/cloud/issues/15576	2024-08-02 18:28:23 +01:00
Arthur Petukhovsky	f3acfb2d80	Improve safekeepers eviction rate limiting (#8456 ) This commit tries to fix regular load spikes on staging, caused by too many eviction and partial upload operations running at the same time. Usually it was hapenning after restart, for partial backup the load was delayed. - Add a semaphore for evictions (2 permits by default) - Rename `resident_since` to `evict_not_before` and smooth out the curve by using random duration - Use random duration in partial uploads as well related to https://github.com/neondatabase/neon/issues/6338 some discussion in https://neondb.slack.com/archives/C033RQ5SPDH/p1720601531744029	2024-08-02 15:26:46 +01:00
Arpad Müller	8c828c586e	Wait for completion of the upload queue in flush_frozen_layer (#8550 ) Makes `flush_frozen_layer` add a barrier to the upload queue and makes it wait for that barrier to be reached until it lets the flushing be completed. This gives us backpressure and ensures that writes can't build up in an unbounded fashion. Fixes #7317	2024-08-02 13:07:12 +02:00
John Spray	2334fed762	storage_controller: start adding chaos hooks (#7946 ) Chaos injection bridges the gap between automated testing (where we do lots of different things with small, short-lived tenants), and staging (where we do many fewer things, but with larger, long-lived tenants). This PR adds a first type of chaos which isn't really very chaotic: it's live migration of tenants between healthy pageservers. This nevertheless provides continuous checks that things like clean, prompt shutdown of tenants works for realistically deployed pageservers with realistically large tenants.	2024-08-02 09:37:44 +01:00
John Spray	c53799044d	pageserver: refine how we delete timelines after shard split (#8436 ) ## Problem Previously, when we do a timeline deletion, shards will delete layers that belong to an ancestor. That is not a correctness issue, because when we delete a timeline, we're always deleting it from all shards, and destroying data for that timeline is clearly fine. However, there exists a race where one shard might start doing this deletion while another shard has not yet received the deletion request, and might try to access an ancestral layer. This creates ambiguity over the "all layers referenced by my index should always exist" invariant, which is important to detecting and reporting corruption. Now that we have a GC mode for clearing up ancestral layers, we can rely on that to clean up such layers, and avoid deleting them right away. This makes things easier to reason about: there are now no cases where a shard will delete a layer that belongs to a ShardIndex other than itself. ## Summary of changes - Modify behavior of RemoteTimelineClient::delete_all - Add `test_scrubber_physical_gc_timeline_deletion` to exercise this case - Tweak AWS SDK config in the scrubber to enable retries. Motivated by seeing the test for this feature encounter some transient "service error" S3 errors (which are probably nothing to do with the changes in this PR)	2024-08-02 08:00:46 +01:00
Alexander Bayandin	e7477855b7	test_runner: don't create artifacts if Allure is not enabled (#8580 ) ## Problem `allure_attach_from_dir` method might create `tar.zst` archives even if `--alluredir` is not set (i.e. Allure results collection is disabled) ## Summary of changes - Don't run `allure_attach_from_dir` if `--alluredir` is not set	2024-08-01 15:55:43 +00:00
Alex Chi Z.	f4a668a27d	fix(pageserver): skip existing layers for btm-gc-compaction (#8498 ) part of https://github.com/neondatabase/neon/issues/8002 Due to the limitation of the current layer map implementation, we cannot directly replace a layer. It's interpreted as an insert and a deletion, and there will be file exist error when renaming the newly-created layer to replace the old layer. We work around that by changing the end key of the image layer. A long-term fix would involve a refactor around the layer file naming. For delta layers, we simply skip layers with the same key range produced, though it is possible to add an extra key as an alternative solution. * The image layer range for the layers generated from gc-compaction will be Key::MIN..(Key..MAX-1), to avoid being recognized as an L0 delta layer. * Skip existing layers if it turns out that we need to generate a layer with the same persistent key in the same generation. Note that it is possible that the newly-generated layer has different content from the existing layer. For example, when the user drops a retain_lsn, the compaction could have combined or dropped some records, therefore creating a smaller layer than the existing one. We discard the "optimized" layer for now because we cannot deal with such rewrites within the same generation. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-01 15:00:06 +01:00
Alex Chi Z.	970f2923b2	storage-scrubber: log version on start (#8571 ) Helps us better identify which version of storage scrubber is running. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-01 13:52:34 +00:00
John Spray	1678dea20f	pageserver: add layer visibility calculation (#8511 ) ## Problem We recently added a "visibility" state to layers, but nothing initializes it. Part of: - #8398 ## Summary of changes - Add a dependency on `range-set-blaze`, which is used as a fast incrementally updated alternative to KeySpace. We could also use this to replace the internals of KeySpaceRandomAccum if we wanted to. Writing a type that does this kind of "BtreeMap & merge overlapping entries" thing isn't super complicated, but no reason to write this ourselves when there's a third party impl available. - Add a function to layermap to calculate visibilities for each layer - Add a function to Timeline to call into layermap and then apply these visibilities to the Layer objects. - Invoke the calculation during startup, after image layer creations, and when removing branches. Branch removal and image layer creation are the two ways that a layer can go from Visible to Covered. - Add unit test & benchmark for the visibility calculation - Expose `pageserver_visible_physical_size` metric, which should always be <= `pageserver_remote_physical_size`. - This metric will feed into the /v1/utilization endpoint later: the visible size indicates how much space we would like to use on this pageserver for this tenant. - When `pageserver_visible_physical_size` is greater than `pageserver_resident_physical_size`, this is a sign that the tenant has long-idle branches, which result in layers that are visible in principle, but not used in practice. This does not keep visibility hints up to date in all cases: particularly, when creating a child timeline, any previously covered layers will not get marked Visible until they are accessed. Updates after image layer creation could be implemented as more of a special case, but this would require more new code: the existing depth calculation code doesn't maintain+yield the list of deltas that would be covered by an image layer. ## Performance This operation is done rarely (at startup and at timeline deletion), so needs to be efficient but not ultra-fast. There is a new `visibility` bench that measures runtime for a synthetic 100k layers case (`sequential`) and a real layer map (`real_map`) with ~26k layers. The benchmark shows runtimes of single digit milliseconds (on a ryzen 7950). This confirms that the runtime shouldn't be a problem at startup (as we already incur S3-level latencies there), but that it's slow enough that we definitely shouldn't call it more often than necessary, and it may be worthwhile to optimize further later (things like: when removing a branch, only bother scanning layers below the branchpoint) ``` visibility/sequential time: [4.5087 ms 4.5894 ms 4.6775 ms] change: [+2.0826% +3.9097% +5.8995%] (p = 0.00 < 0.05) Performance has regressed. Found 24 outliers among 100 measurements (24.00%) 2 (2.00%) high mild 22 (22.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map time: [7.0796 ms 7.0832 ms 7.0871 ms] change: [+0.3900% +0.4505% +0.5164%] (p = 0.00 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map_many_branches time: [4.5285 ms 4.5355 ms 4.5434 ms] change: [-1.0012% -0.8004% -0.5969%] (p = 0.00 < 0.05) Change within noise threshold. ```	2024-08-01 09:25:35 +00:00
Arpad Müller	163f2eaf79	Reduce linux-raw-sys duplication (#8577 ) Before, we had four versions of linux-raw-sys in our dependency graph: ``` linux-raw-sys@0.1.4 linux-raw-sys@0.3.8 linux-raw-sys@0.4.13 linux-raw-sys@0.6.4 ``` now it's only two: ``` linux-raw-sys@0.4.13 linux-raw-sys@0.6.4 ``` The changes in this PR are minimal. In order to get to its state one only has to update procfs in Cargo.toml to 0.16 and do `cargo update -p tempfile -p is-terminal -p prometheus`.	2024-08-01 08:22:21 +00:00
Christian Schwarz	980d506bda	pageserver: shutdown all walredo managers 8s into shutdown (#8572 ) # Motivation The working theory for hung systemd during PS deploy (https://github.com/neondatabase/cloud/issues/11387) is that leftover walredo processes trigger a race condition. In https://github.com/neondatabase/neon/pull/8150 I arranged that a clean Tenant shutdown does actually kill its walredo processes. But many prod machines don't manage to shut down all their tenants until the 10s systemd timeout hits and, presumably, triggers the race condition in systemd / the Linux kernel that causes the frozen systemd # Solution This PR bolts on a rather ugly mechanism to shut down tenant managers out of order 8s after we've received the SIGTERM from systemd. # Changes - add a global registry of `Weak<WalRedoManager>` - add a special thread spawned during `shutdown_pageserver` that sleeps for 8s, then shuts down all redo managers in the registry and prevents new redo managers from being created - propagate the new failure mode of tenant spawning throughout the code base - make sure shut down tenant manager results in PageReconstructError::Cancelled so that if Timeline::get calls come in after the shutdown, they do the right thing	2024-08-01 07:57:09 +02:00
Alex Chi Z.	d6c79b77df	test(pageserver): add test_gc_feedback_with_snapshots (#8474 ) should be working after https://github.com/neondatabase/neon/pull/8328 gets merged. Part of https://github.com/neondatabase/neon/issues/8002 adds a new perf benchmark case that ensures garbages can be collected with branches --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-31 17:55:19 -04:00
Alexander Bayandin	3350daeb9a	CI(create-test-report): fix missing benchmark results in Allure report (#8540 ) ## Problem In https://github.com/neondatabase/neon/pull/8241 I've accidentally removed `create-test-report` dependency on `benchmarks` job ## Summary of changes - Run `create-test-report` after `benchmarks` job	2024-07-31 19:47:59 +01:00
Arpad Müller	939d50a41c	storage_scrubber: migrate FindGarbage to remote_storage (#8548 ) Uses the newly added APIs from #8541 named `stream_tenants_generic` and `stream_objects_with_retries` and extends them with `list_objects_with_retries_generic` and `stream_tenant_timelines_generic` to migrate the `find-garbage` command of the scrubber to `GenericRemoteStorage`. Part of https://github.com/neondatabase/neon/issues/7547	2024-07-31 18:24:42 +00:00
John Spray	2f9ada13c4	controller: simplify reconciler generation increment logic (#8560 ) ## Problem This code was confusing, untested and covered: - an impossible case, where intent state is AttacheStale (we never do this) - a rare edge case (going from AttachedMulti to Attached), which we were not testing, and in any case the pageserver internally does the same Tenant reset in this transition as it would do if we incremented generation. Closes: https://github.com/neondatabase/neon/issues/8367 ## Summary of changes - Simplify the logic to only skip incrementing the generation if the location already has the expected generation and the exact same mode.	2024-07-31 18:37:47 +01:00
Cihan Demirci	ff51b565d3	cicd: change Azure storage details [2/2] (#8562 ) Change Azure storage configuration to point to updated variables/secrets. Also update subscription id variable.	2024-07-31 17:42:10 +01:00
Tristan Partin	5e0409de95	Fix negative replication delay metric In some cases, we can get a negative metric for replication_delay_bytes. My best guess from all the research I've done is that we evaluate pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by the time everything is said and done, the replay LSN has advanced past the receive LSN. In this case, our lag can effectively be modeled as 0 due to the speed of the WAL reception and replay.	2024-07-31 10:16:58 -05:00
Christian Schwarz	4e3b70e308	refactor(page_service): Timeline gate guard holding + cancellation + shutdown (#8339 ) Since the introduction of sharding, the protocol handling loop in `handle_pagerequests` cannot know anymore which concrete `Tenant`/`Timeline` object any of the incoming `PagestreamFeMessage` resolves to. In fact, one message might resolve to one `Tenant`/`Timeline` while the next one may resolve to another one. To avoid going to tenant manager, we added the `shard_timelines` which acted as an ever-growing cache that held timeline gate guards open for the lifetime of the connection. The consequence of holding the gate guards open was that we had to be sensitive to every cached `Timeline::cancel` on each interaction with the network connection, so that Timeline shutdown would not have to wait for network connection interaction. We can do better than that, meaning more efficiency & better abstraction. I proposed a sketch for it in * https://github.com/neondatabase/neon/pull/8286 and this PR implements an evolution of that sketch. The main idea is is that `mod page_service` shall be solely concerned with the following: 1. receiving requests by speaking the protocol / pagestream subprotocol 2. dispatching the request to a corresponding method on the correct shard/`Timeline` object 3. sending response by speaking the protocol / pagestream subprotocol. The cancellation sensitivity responsibilities are clear cut: * while in `page_service` code, sensitivity to page_service cancellation is sufficient * while in `Timeline` code, sensitivity to `Timeline::cancel` is sufficient To enforce these responsibilities, we introduce the notion of a `timeline::handle::Handle` to a `Timeline` object that is checked out from a `timeline::handle::Cache` for each request. The `Handle` derefs to `Timeline` and is supposed to be used for a single async method invocation on `Timeline`. See the lengthy doc comment in `mod handle` for details of the design.	2024-07-31 17:05:45 +02:00
Alex Chi Z.	61a65f61f3	feat(pageserver): support btm-gc-compaction for child branches (#8519 ) part of https://github.com/neondatabase/neon/issues/8002 For child branches, we will pull the image of the modified keys from the parant into the child branch, which creates a full history for generating key retention. If there are not enough delta keys, the image won't be wrote eventually, and we will only keep the deltas inside the child branch. We could avoid the wasteful work to pull the image from the parent if we can know the number of deltas in advance, in the future (currently we always pull image for all modified keys in the child branch) --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-31 15:48:48 +01:00
Alexander Bayandin	d21246c8bd	CI(regress-tests): run less regression tests (#8561 ) ## Problem We run regression tests on `release` & `debug` builds for each of the three supported Postgres versions (6 in total). With upcoming ARM support and Postgres 17, the number of jobs will jump to 16, which is a lot. See the internal discussion here: https://neondb.slack.com/archives/C033A2WE6BZ/p1722365908404329 ## Summary of changes - Run `regress-tests` job in debug builds only with the latest Postgres version - Do not do `debug` builds on release branches	2024-07-31 15:10:27 +01:00
Christian Schwarz	4825b0fec3	compaction_level0_phase1: bypass PS PageCache for data blocks (#8543 ) part of https://github.com/neondatabase/neon/issues/8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](`e78341e1c2/pageserver/src/tenant/block_io.rs (L229-L236)`). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * https://github.com/neondatabase/neon/issues/8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * https://github.com/neondatabase/aws/pull/1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.	2024-07-31 14:17:59 +02:00
Cihan Demirci	a4df3c8488	cicd: change Azure storage details [1/2] (#8553 ) Change Azure storage configuration to point to new variables/secrets. They have the `_NEW` suffix in order not to disrupt any tests while we complete the switch.	2024-07-30 19:34:15 +00:00
Christian Schwarz	d95b46f3f3	cleanup(compact_level0_phase1): some commentary and wrapping into block expressions (#8544 ) Byproduct of scouting done for https://github.com/neondatabase/neon/issues/8184 refs https://github.com/neondatabase/neon/issues/8184	2024-07-30 18:13:18 +02:00
Yuchen Liang	85bef9f05d	feat(scrubber): post `scan_metadata` results to storage controller (#8502 ) Part of #8128, followup to #8480. closes #8421. Enable scrubber to optionally post metadata scan health results to storage controller. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-30 16:07:34 +01:00

1 2 3 4 5 ...

5799 Commits