motivation

2026-07-13 17:10:39 +00:00 · 2024-07-08 10:01:31 +00:00
parent 82c30ac757
commit b3c95a5b32
1 changed files with 36 additions and 4 deletions
--- a/docs/rfcs/direct-io-for-reads.md
+++ b/docs/rfcs/direct-io-for-reads.md
@@ -4,7 +4,7 @@

 This document is a proposal and implementation plan for direct IO in Pageserver.

-## Terminology
+## Terminology / Glossary

 **kernel page cache**: the kernel's page cache is a write-back cache for filesystem contents.
 The cached unit is memory-page-sized & aligned chunks of the files that are being cached (typically 4k).
@@ -23,11 +23,12 @@ asynchronously writes back dirtied pages based on a variety of conditions. For u
 ones are a) explicit request by userspace (`fsync`) and b) memory pressure.

 **Memory pressure**: the kernel page cache is a best effort service and a user of spare memory capacity.
-The kernel page allocator will take pages used by page cache if there is no other free memory available.
+If there is no free memory, the kernel page allocator will take pages used by page cache to satisfy allocations.
 Before reusing a page like that, the page has to be written back (writeback, see above).
 The far-reaching consequence of this is that **any allocation of anonymous memory can do IO** if the only
 way to get that memory is by eviction & re-using a dirty page cache page.
 Notably, this includes a simple `malloc` in userspace, because eventually that boils down to `mmap(..., MAP_ANON, ...)`.
+I refer to this effect as the "malloc latency backscatter" caused by buffered IO.

 **Direct IO** allows application's read/write system calls to bypass the kernel page cache. The filesystem
 is still involved because it is ultimately in charge of mapping the concept of files & offsets within them
@@ -45,7 +46,7 @@ Its caching unit is 8KiB which is the Postgres page size.
 Currently, it is tiny (128MiB), very much like Postgres's `shared_buffers`.
 A miss in PageCache is filled from the filesystem using buffered IO, issued through the `VirtualFile` layer in Pageserver.

-**VirtualFIle** is Pageserver's abstraction for file IO, very similar to the faciltiy in Postgres that bears the same name.
+**VirtualFile** is Pageserver's abstraction for file IO, very similar to the faciltiy in Postgres that bears the same name.
 Its historical purpose appears to be working around open file descriptor limitations, which is practically irrelevant on Linux.
 However, the faciltiy in Pageserver is useful as an intermediary layer for metrics and abstracts over the different kinds of
 IO engines that Pageserver supports (`std-fs` vs `tokio-epoll-uring`).
@@ -78,5 +79,36 @@ In the future, we may elminate the `PageCache` even for indirect blocks.
 For example with an LRU cache that has as unit the entire disk btree content
 instead of individual blocks.

-##
+## Motivation
+
+Even though we have eliminated PS `PageCache` complexities and overheads, we are still using the kernel page cache for all IO.
+
+In this RFC, we propose switching to direct IO and lay out a plan to do it.
+
+The motivation for using direct IO:
+
+Predictable VirtualFile operation latencies.
+    * for reads: currently kernel page cache hit/miss determines fast/slow
+    * for appends: immediate back-pressure from disk instead of kernel page cache
+    * for in-place updates: we don't do in-place updates in Pageserver
+    * file fsync: will become practically constant cost because no writeback needs to happen
+
+Predictabile latencies, generally.
+    * avoid *malloc latency backscatter* caused by buffered writes (see glossary section)
+
+Efficiency
+* Direct IO avoids one layer of memory-to-memory copy.
+* We already do not rely / do not want to rely on the kernel page cache for batching of small IOs into bigger ones:
+    * writes: we do large streaming writes and/or have implemented batching in userspace.
+    * reads:
+    * intra-request: vectored get (RFC 30) takes care of merging reads => no block is read twice
+    * inter-request, e.g., getpage request for adjacent pages last-modified at nearly the same time
+        * (ideally these would come in as one vectored get request)
+        * generally, we accept making such reads *predictably* slow rather than *maybe* fast,
+            depending on how busy the kernel page cache is.
+
+Explicitness & Tangibility of resource usage.
+* It is desriable and valuable to be *explicit* about the main resources we use. For example:
+* We can build true observability of resource usage ("what tenant is causing the actual IOs that are sent to the disk?").
+* We can build accounting & QoS by implementing an IO scheduler that is tenant aware.