The MinResidentSizePartition is effectively what `overage` was earlier,
but more expressive and outside of EvictionCandidates.
So switch the code back to a single list,
but use (MinResidentSizePartition, EvictionCandidates) tuples.
That eliminates the need for iter_in_eviction_order() alltogether.
It consumes 8 bytes more memory per candidate, but, that doesn't matter
for now.
The algorithm is the same (with two small exceptions), but rewrite the
way it's implemented to make it easier to follow.
The exceptions:
1. 'min_resident_size' now protects at least that much data in the first
"respectful" phase of the algorithm. Previously, it would evict layers
until the resident size fell below min_resident_size. In other words,
we know protect one more layer of each tenant, so that the resident
size stays just above min_resident_size, while previously we would
evict enough to bring the resident size just under min_resident_size.
2. Previously, the "max layer size" that's used as the default
min_resident_size was calculated from *all* layers in the tenant,
including remote layers. Now it's only calculated across all
locally-present layers. I don't know if that was a deliberate choice,
but this is slightly simpler.
This patch adds a pageserver-global background loop that evicts layers
in response to a shortage of available bytes in the $repo/tenants
directory's filesystem.
The loop runs periodically at a configurable `period`.
Each loop iteration uses `statvfs` to determine filesystem-level space
usage. It compares the returned usage data against two different types
of thresholds. The iteration tries to evict layers until app-internal
accounting says we should be below the thresholds. We cross-check this
internal accounting with the real world by making another `statvfs` at
the end of the iteration. We're good if that second statvfs shows that
we're _actually_ below the configured thresholds. If we're still above
one or more thresholds, we emit a warning log message, leaving it to the
operator to investigate further.
There are two thresholds: `max_usage_pct` is the relative available
space, expressed in percent of the total filesystem space. If the actual
usage is higher, the threshold is exceeded. `min_avail_bytes` is the
absolute available space in bytes. If the actual usage is lower, the
threshold is exceeded.
The iteration evicts layers in LRU fashion with a reservation of up to
`min_resident_size` bytes of the most recent layers per tenant.
The layers not part of the per-tenant reservation are evicted
least-recently-used first until we're below all thresholds.
If the above doesn't relieve enough pressure, we fall back to Global LRU.
In addition to the loop, there is also an HTTP endpoint to perform
one loop iteration synchronous to the request.
The endpoint takes an absolute number of bytes that the iteration
needs to evict before pressure is relieved.
The tests use this endpoint, which is a great simplification over
setting up loopback-mounts in the tests, which would be required to
test the statvfs part of the implementation.
We will rely on manual testing in staging to test the statvfs parts.
The HTTP endpoint is also handy in emergencies where an operator wants
the pageserver to evict a given amount of space _now.
Hence, it's arguments documented in openapi_spec.yml.
The response type isn't documented though because we don't consider
it stable. The endpoint should _not_ be used by Console.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
fixes https://github.com/neondatabase/neon/issues/3728