From 49b63570c84a9cd25e0476aca0b65da6e326ce13 Mon Sep 17 00:00:00 2001
From: Christian Schwarz <christian@neon.tech>
Date: Tue, 28 Nov 2023 15:03:28 +0000
Subject: [PATCH] idea: concurrency-limit initial logical size calculation

Before this patch, there was no concurrency limit on initial logical
size computations.

In an experiment with a PS with 20k tenants, 1 timeline each,
all tenants inactive in SKs / not present in storage broker,
all logical size calculations are spawned by MetricsCollection,
i.e., consumption metrics worker.

Before this patch, these timelines would all do their initial logical
size calculation in parallel, leading to extreme thrashing in page cache
and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
signficantly (from 80k `open`-system-calls/second to ~500
`open`-system-calls/second during loading).

This patch uses the existing background tasks semaphore to limit
concurrency, which generally is the right call for background activity.
However, due to logical size's involvement in PageserverFeedback towards
safekeepers, I think we need a priority-boosting mechanism, e.g., if
we're still calculating but walreceiver is actively asking, skip the
semaphore. That's fairly easy to implement, but, want to some feedback
on the general idea first before implementing it.
See also the FIXME in the block comment added in this commit.

NB: when evaluating, keep in mind that consumption metrics worker
persists its interval across restarts; delete the state file on disk
to get predictable (and I believe worst-case in terms of concurrency
during PS restart) behavior.
---
 pageserver/src/tenant/tasks.rs    |  1 +
 pageserver/src/tenant/timeline.rs | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/pageserver/src/tenant/tasks.rs b/pageserver/src/tenant/tasks.rs
index 860bb255ca..f572ef1e60 100644
--- a/pageserver/src/tenant/tasks.rs
+++ b/pageserver/src/tenant/tasks.rs
@@ -44,6 +44,7 @@ pub(crate) enum BackgroundLoopKind {
     Eviction,
     ConsumptionMetricsCollectMetrics,
     ConsumptionMetricsSyntheticSizeWorker,
+    InitialLogicalSizeCalculation,
 }
 
 impl BackgroundLoopKind {
diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs
index 9493ed1c9a..1285c53cbc 100644
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -1822,6 +1822,29 @@ impl Timeline {
                 // delay will be terminated by a timeout regardless.
                 let _completion = { self_clone.initial_logical_size_attempt.lock().expect("unexpected initial_logical_size_attempt poisoned").take() };
 
+                // In prod, initial logical size calucalation is spawned either by
+                // WalReceiverConnectionHandler if the timeline is active according to storage broker,
+                // or by the first consumption metrics worker (MetricsCollection).
+                // The latter runs every `metric_collection_interval` and checkpoints to disk, i.e.,
+                // if pageserver gets restarted, the consumption metrics worker will resume waiting
+                // for the correct remaining time, as if the pageserver had not been restarted.
+                //
+                // FIXME: with the current code, walreceiver requests would also hit this semaphore
+                // and get queued behind other background operations. That's bad because walreceiver_connection
+                // will push the not-precise value as `current_timeline_size` in the `PageserverFeedback`
+                // while this calculation is stuck.
+                // We need some way to priority-boost the initial size calculation if walreceiver is asking.
+                // Or, should we maybe revisit the use of logical size in `PageserverFeedback`?
+                // It seems broken the way it is.
+                //
+                // Example query to show different causes of initial size calculation spawning:
+                //
+                // https://neonprod.grafana.net/explore?panes=%7B%22wSx%22:%7B%22datasource%22:%22grafanacloud-logs%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22sum%20by%20%28task_kind%29%20%28count_over_time%28%7Bneon_service%3D%5C%22pageserver%5C%22,%20neon_region%3D%5C%22us-west-2%5C%22%7D%20%7C%3D%20%60logical%20size%20computation%20from%20context%20of%20task%20kind%60%20%7C%20regexp%20%60logical%20size%20computation%20from%20context%20of%20task%20kind%20%28%3FP%3Ctask_kind%3E.%2A%29%60%20%5B1m%5D%29%29%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22grafanacloud-logs%22%7D,%22editorMode%22:%22code%22,%22step%22:%221m%22%7D%5D,%22range%22:%7B%22from%22:%221700637500615%22,%22to%22:%221700639648743%22%7D%7D%7D&schemaVersion=1&orgId=1
+                let _permit = match crate::tenant::tasks::concurrent_background_tasks_rate_limit(BackgroundLoopKind::InitialLogicalSizeCalculation,&background_ctx, &cancel).await {
+                    Ok(permit) => permit,
+                    Err(RateLimitError::Cancelled) => return Ok(()),
+                };
+
                 let calculated_size = match self_clone
                     .logical_size_calculation_task(lsn, LogicalSizeCalculationCause::Initial, &background_ctx)
                     .await