mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-19 06:00:38 +00:00
motivation
This commit is contained in:
@@ -4,7 +4,7 @@
|
||||
|
||||
This document is a proposal and implementation plan for direct IO in Pageserver.
|
||||
|
||||
## Terminology
|
||||
## Terminology / Glossary
|
||||
|
||||
**kernel page cache**: the kernel's page cache is a write-back cache for filesystem contents.
|
||||
The cached unit is memory-page-sized & aligned chunks of the files that are being cached (typically 4k).
|
||||
@@ -23,11 +23,12 @@ asynchronously writes back dirtied pages based on a variety of conditions. For u
|
||||
ones are a) explicit request by userspace (`fsync`) and b) memory pressure.
|
||||
|
||||
**Memory pressure**: the kernel page cache is a best effort service and a user of spare memory capacity.
|
||||
The kernel page allocator will take pages used by page cache if there is no other free memory available.
|
||||
If there is no free memory, the kernel page allocator will take pages used by page cache to satisfy allocations.
|
||||
Before reusing a page like that, the page has to be written back (writeback, see above).
|
||||
The far-reaching consequence of this is that **any allocation of anonymous memory can do IO** if the only
|
||||
way to get that memory is by eviction & re-using a dirty page cache page.
|
||||
Notably, this includes a simple `malloc` in userspace, because eventually that boils down to `mmap(..., MAP_ANON, ...)`.
|
||||
I refer to this effect as the "malloc latency backscatter" caused by buffered IO.
|
||||
|
||||
**Direct IO** allows application's read/write system calls to bypass the kernel page cache. The filesystem
|
||||
is still involved because it is ultimately in charge of mapping the concept of files & offsets within them
|
||||
@@ -45,7 +46,7 @@ Its caching unit is 8KiB which is the Postgres page size.
|
||||
Currently, it is tiny (128MiB), very much like Postgres's `shared_buffers`.
|
||||
A miss in PageCache is filled from the filesystem using buffered IO, issued through the `VirtualFile` layer in Pageserver.
|
||||
|
||||
**VirtualFIle** is Pageserver's abstraction for file IO, very similar to the faciltiy in Postgres that bears the same name.
|
||||
**VirtualFile** is Pageserver's abstraction for file IO, very similar to the faciltiy in Postgres that bears the same name.
|
||||
Its historical purpose appears to be working around open file descriptor limitations, which is practically irrelevant on Linux.
|
||||
However, the faciltiy in Pageserver is useful as an intermediary layer for metrics and abstracts over the different kinds of
|
||||
IO engines that Pageserver supports (`std-fs` vs `tokio-epoll-uring`).
|
||||
@@ -78,5 +79,36 @@ In the future, we may elminate the `PageCache` even for indirect blocks.
|
||||
For example with an LRU cache that has as unit the entire disk btree content
|
||||
instead of individual blocks.
|
||||
|
||||
##
|
||||
## Motivation
|
||||
|
||||
Even though we have eliminated PS `PageCache` complexities and overheads, we are still using the kernel page cache for all IO.
|
||||
|
||||
In this RFC, we propose switching to direct IO and lay out a plan to do it.
|
||||
|
||||
The motivation for using direct IO:
|
||||
|
||||
Predictable VirtualFile operation latencies.
|
||||
* for reads: currently kernel page cache hit/miss determines fast/slow
|
||||
* for appends: immediate back-pressure from disk instead of kernel page cache
|
||||
* for in-place updates: we don't do in-place updates in Pageserver
|
||||
* file fsync: will become practically constant cost because no writeback needs to happen
|
||||
|
||||
Predictabile latencies, generally.
|
||||
* avoid *malloc latency backscatter* caused by buffered writes (see glossary section)
|
||||
|
||||
Efficiency
|
||||
* Direct IO avoids one layer of memory-to-memory copy.
|
||||
* We already do not rely / do not want to rely on the kernel page cache for batching of small IOs into bigger ones:
|
||||
* writes: we do large streaming writes and/or have implemented batching in userspace.
|
||||
* reads:
|
||||
* intra-request: vectored get (RFC 30) takes care of merging reads => no block is read twice
|
||||
* inter-request, e.g., getpage request for adjacent pages last-modified at nearly the same time
|
||||
* (ideally these would come in as one vectored get request)
|
||||
* generally, we accept making such reads *predictably* slow rather than *maybe* fast,
|
||||
depending on how busy the kernel page cache is.
|
||||
|
||||
Explicitness & Tangibility of resource usage.
|
||||
* It is desriable and valuable to be *explicit* about the main resources we use. For example:
|
||||
* We can build true observability of resource usage ("what tenant is causing the actual IOs that are sent to the disk?").
|
||||
* We can build accounting & QoS by implementing an IO scheduler that is tenant aware.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user