Commit Graph

39 Commits

Author SHA1 Message Date
Christian Schwarz
140ef67dd8 refactor: replace LD_PRELOAD with a baked-in solution 2023-03-30 15:51:26 +02:00
Christian Schwarz
a698ddb8a4 fix: avoid needless timeline.clone() 2023-03-29 10:58:59 +02:00
Christian Schwarz
57d215e6bb fix: suggestions commited from GitHub web didn't compile 2023-03-29 10:55:57 +02:00
Christian Schwarz
83813f2cb1 fix: remove unneeded clippy allow 2023-03-29 10:55:45 +02:00
Christian Schwarz
9a55e4f909 fix: structured logging of tenant_id
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-03-29 10:51:14 +02:00
Christian Schwarz
0b9a44a879 fix: structured logging of tenant_id
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-03-29 10:50:36 +02:00
Christian Schwarz
a6f9ebf178 fix: repeat tenant_id in debug message 2023-03-29 10:49:06 +02:00
Christian Schwarz
370b3637db doc: add explainer to debug_assert 2023-03-29 10:47:15 +02:00
Christian Schwarz
d6c2867b46 doc: add debug_assert for self-documenting candidates.sort_unstable_by_kye() 2023-03-28 19:16:26 +02:00
Christian Schwarz
386c2d0112 refactor: go back to a single list
The MinResidentSizePartition is effectively what `overage` was earlier,
but more expressive and outside of EvictionCandidates.

So switch the code back to a single list,
but use (MinResidentSizePartition, EvictionCandidates) tuples.

That eliminates the need for iter_in_eviction_order() alltogether.

It consumes 8 bytes more memory per candidate, but, that doesn't matter
for now.
2023-03-28 18:25:44 +02:00
Christian Schwarz
704d4f4640 doc: improve comment on min_resident_size 2023-03-28 18:06:44 +02:00
Christian Schwarz
dc72a9534e doc: update doc comment for collect_eviction_candidates
And move the impl of MinResidentSizePartitionedCandidates
below it because it makes sense when reading the code top-down.
2023-03-28 18:05:31 +02:00
Christian Schwarz
07c44f9151 doc: hint that usage_assumed is modified in the loop 2023-03-28 17:47:11 +02:00
Christian Schwarz
0c10e6d3e7 feat: demote info logs to debug
These would be per tenant, we don't want to emit thousands of log lines
when this code runs.
2023-03-28 17:36:59 +02:00
Christian Schwarz
85becb148f feat: bring back min_resident=max(all layers) behavior 2023-03-28 17:36:59 +02:00
Christian Schwarz
ea3c76a9d6 refactor: instead of 'overage', have two separate lists 2023-03-28 17:36:59 +02:00
Christian Schwarz
799576ab1e Merge branch 'problame/disk-usage-eviction' into heikki/disk-usage-eviction 2023-03-28 15:24:52 +02:00
Christian Schwarz
0428b6822a doc: more comment on lru_candidates to address questions from review 2023-03-28 11:53:48 +02:00
Christian Schwarz
42d63270a5 doc: add comment on extend_lru_candidates 2023-03-28 11:44:10 +02:00
Christian Schwarz
0056108c45 doc: remove stray comment 2023-03-28 11:44:10 +02:00
Christian Schwarz
54cc1d5064 doc: sub-headings for mechanics & policy in module comment 2023-03-28 11:44:10 +02:00
Joonas Koivunen
70c837a4b2 refactor: simplify max_layer_size as u64 2023-03-28 12:44:01 +03:00
Christian Schwarz
38d3061143 add comment on not-need to be 100% accurate about max_layer_size 2023-03-28 12:44:01 +03:00
Joonas Koivunen
75759f709f doc: explain "bug" message, log layers 2023-03-28 12:44:01 +03:00
Joonas Koivunen
03ab5df081 chore: remove dead code 2023-03-28 12:44:01 +03:00
Joonas Koivunen
6e8d7b449f refactor: combine nested macthes 2023-03-28 12:44:01 +03:00
Joonas Koivunen
17b5c8d1c4 refactor: get rid of ApproxAccurate 2023-03-28 12:44:01 +03:00
Joonas Koivunen
0943dd30eb chore: clippy 2023-03-28 12:44:01 +03:00
Joonas Koivunen
b599755042 refactor: rename DiskUsageEvictionState => State 2023-03-28 12:44:01 +03:00
Joonas Koivunen
244185e6e6 doc: comment changes
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-03-28 12:44:01 +03:00
Joonas Koivunen
0a5043fae5 refactor: less static mutexes 2023-03-28 12:44:01 +03:00
Heikki Linnakangas
041b708dc6 rustfmt 2023-03-28 11:08:35 +03:00
Heikki Linnakangas
b6b8265450 Rewrite to make the algorithm more understandable (I hope).
The algorithm is the same (with two small exceptions), but rewrite the
way it's implemented to make it easier to follow.

The exceptions:
1. 'min_resident_size' now protects at least that much data in the first
   "respectful" phase of the algorithm. Previously, it would evict layers
   until the resident size fell below min_resident_size. In other words,
   we know protect one more layer of each tenant, so that the resident
   size stays just above min_resident_size, while previously we would
   evict enough to bring the resident size just under min_resident_size.

2. Previously, the "max layer size" that's used as the default
   min_resident_size was calculated from *all* layers in the tenant,
   including remote layers. Now it's only calculated across all
   locally-present layers. I don't know if that was a deliberate choice,
   but this is slightly simpler.
2023-03-28 01:28:27 +03:00
Christian Schwarz
0462563d31 doc: more explanation of the added config knobs 2023-03-27 19:52:48 +02:00
Christian Schwarz
ef39b3f067 rename: dedupe comment about allow(dead_code) 2023-03-27 19:27:41 +02:00
Christian Schwarz
504256a0cc rename: serde_percent::{Value => Percent} 2023-03-27 18:54:32 +02:00
Joonas Koivunen
e2b7e2962e fixup: missing .get() as u64 2023-03-27 19:39:24 +03:00
Christian Schwarz
9d78e98467 refactor: move the percentage value deserialization to newtype in utils 2023-03-27 18:27:29 +02:00
Christian Schwarz
5aef192bf2 disk-usage-based layer eviction
This patch adds a pageserver-global background loop that evicts layers
in response to a shortage of available bytes in the $repo/tenants
directory's filesystem.

The loop runs periodically at a configurable `period`.

Each loop iteration uses `statvfs` to determine filesystem-level space
usage.  It compares the returned usage data against two different types
of thresholds. The iteration tries to evict layers until app-internal
accounting says we should be below the thresholds.  We cross-check this
internal accounting with the real world by making another `statvfs` at
the end of the iteration.  We're good if that second statvfs shows that
we're _actually_ below the configured thresholds.  If we're still above
one or more thresholds, we emit a warning log message, leaving it to the
operator to investigate further.

There are two thresholds: `max_usage_pct` is the relative available
space, expressed in percent of the total filesystem space. If the actual
usage is higher, the threshold is exceeded.  `min_avail_bytes` is the
absolute available space in bytes. If the actual usage is lower, the
threshold is exceeded.

The iteration evicts layers in LRU fashion with a reservation of up to
`min_resident_size` bytes of the most recent layers per tenant.
The layers not part of the per-tenant reservation are evicted
least-recently-used first until we're below all thresholds.
If the above doesn't relieve enough pressure, we fall back to Global LRU.

In addition to the loop, there is also an HTTP endpoint to perform
one loop iteration synchronous to the request.
The endpoint takes an absolute number of bytes that the iteration
needs to evict before pressure is relieved.
The tests use this endpoint, which is a great simplification over
setting up loopback-mounts in the tests, which would be required to
test the statvfs part of the implementation.
We will rely on manual testing in staging to test the statvfs parts.

The HTTP endpoint is also handy in emergencies where an operator wants
the pageserver to evict a given amount of space _now.
Hence, it's arguments documented in openapi_spec.yml.
The response type isn't documented though because we don't consider
it stable. The endpoint should _not_ be used by Console.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

fixes https://github.com/neondatabase/neon/issues/3728
2023-03-20 16:51:09 +01:00