From 7e165f5288db3186ab3335a47af387b03a92b653 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 13 Dec 2022 15:11:46 +0200 Subject: [PATCH] RFC on compute cache and autoscaling --- docs/rfcs/021-autoscaling-compute-cache.md | 201 +++++++++++++++++++++ 1 file changed, 201 insertions(+) create mode 100644 docs/rfcs/021-autoscaling-compute-cache.md diff --git a/docs/rfcs/021-autoscaling-compute-cache.md b/docs/rfcs/021-autoscaling-compute-cache.md new file mode 100644 index 0000000000..02dd0563eb --- /dev/null +++ b/docs/rfcs/021-autoscaling-compute-cache.md @@ -0,0 +1,201 @@ +# Context + +In Neon, one host runs a lot of compute nodes. Most of the compute +nodes are fairly small, 1-2 vCPUs and 1-2 GB of memory. But some +compute nodes can be larger, and we also want to autoscale from small +to large, and back again, without shutting down the compute. + + +----------------------------+ + | Host | + | +------------------------+ | + | | Container VM | | +------------+ + | | +--------------------+ | | | | + | | | Postgres | | | | Pageserver | + | | | | | | | | + | | | ################ | | | | | + | | | # shared_ # | | | +------------+ + | | | # buffers # | | | + | | | ################ | | | + | | | | | | + | | +--------------------+ | | + | +------------------------+ | + +----------------------------+ + +Each Postgres runs in a qemu VM, which in turn runs in a kubernetes +pod. Kubernetes manages the placement of these Postgres compute nodes +to hosts. + +# Problem + +PostgreSQL normally relies heavily on the kernel page cache for +performance. PostgreSQL has its own buffer cache, configured by the +shared_buffers setting, but the usual advice is to set shared_buffers +to around 10-20% of the overall system RAM available, leaving the rest +for the kernel page cache. However with Neon, the I/O operations don't +go through the kernel filesystem layer, so we bypass the kernel page +cache and rely solely on the Postgres shared buffer cache for caching +in the compute node. + +Because we don't make use of the kernel page cache, we have to either +set shared_buffers larger than you would with normal PostgreSQL, or +you send a lot more I/O requests to the pageserver than you otherwise +would. However in PostgreSQL, shared_buffers setting cannot be changed +while the server is running. + +Furthermore, we have fast local SSDs available in the compute hosts +that we could also utilize for caching. + + + +# Solution 1: Scale shared buffers + +This solution consists of: + +- Core PostgreSQL changes to allow changing shared_buffers on the fly + +- New code to orchestrate changing the memory size of the VM, and tell + PostgreSQL to change the shared_buffers setting accordingly. + +- New code to Postgres that measures current shared buffer cache usage + to determine what the "cache pressure" is, i.e. how useful it would + be to have a larger shared buffer cache. This could be in an + extension. + +- A new governor in the host that chooses which VM to allocate + how much memory. + +Picture: + + +----------------------------+ + | Host | + | +------------------------+ | + | | Container VM | | +------------+ + | | +--------------------+ | | | | + | | | Postgres | | | | Pageserver | + | | | | | | | | + | | | ################ | | | | | + | | | # shared_ # | | | +------------+ + | | | # buffers # | | | + | | | # # | | | + | | | . | . | | | + | | | . | . | | | + | | | . V . | | | + | | | ################ | | | + | | | | | | + | | +--------------------+ | | + | +------------------------+ | + +----------------------------+ + +Pros: + +- best possible performance for the cached data + +Cons: + +- Scales only memory, cannot take advantage of local SSDs in host machine +- Needs explicit operations to scale. Won't dynamicaly share resources + between tenants, we'll need to start a resizing process to change + the allocations. +- Needs patches to core PostgreSQL + + +# Alternative 2: Local filesystem cache + +Add code to Postgres Neon extension to use a local file on disk for +caching. When a page is evicted from Postgres buffer cache, write it +to the local file, and read it back if it's requested again. Rely on +kernel page cache to keep the most hot part of that file in memory. + +Like in solution 1, need a governor in the host to allocate the local +disk for each VM, and orchestration to scale it up and down. + + + +----------------------------+ + | Host | + | +------------------------+ | + | | Container VM | | +------------+ + | | +--------------------+ | | | | + | | | Postgres | | | | Pageserver | + | | | | | | | | + | | | ################ | | | | | + | | | # shared_ # | | | +------------+ + | | | # buffers # | | | + | | | ################ | | | + | | | | | | + | | | ################ | | | + | | | # Local FS # | | | + | | | # cache # | | | + | | | # # | | | + | | | ################ | | | + | | | | | | + | | +--------------------+ | | + | +------------------------+ | + +----------------------------+ + +Pros: + +- No PostgreSQL core changes required +- Automatically takes advantage of local SSDs, not just memory + +Cons: + +- Needs explicit operations to change the size of the cache file +- Need support for live migration of the filesystem + + +Question: +How is the page cache shared between the host kernel and the VMs? Does +each VM maintain their own page cache? I think that depends on the +filesystem and qemu driver we choose. If we use a raw block device and +let the VM manage it, I believe the VM will maintain the page +cache. But if we use a driver to access the host filesystem directly, +or use something like NFS, I'm not sure. + + +# Alternative 3: Host cache + +Implement a new host cache process that intercepts all the I/O +requests from all VMs running on the host, and manages a local cache. +Postgres can communicate with the host cache using TCP, or via custom +virtio driver. + + +----------------------------+ + | Host | + | | + | ###################### | + | # # | + | # shared host cache # | + | # # | + | # # | + | # # | + | # # | + | ###################### | + | | + | +------------------------+ | + | | Container VM | | +------------+ + | | +--------------------+ | | | | + | | | Postgres | | | | Pageserver | + | | | | | | | | + | | | ################ | | | | | + | | | # shared_ # | | | +------------+ + | | | # buffers # | | | + | | | ################ | | | + | | | | | | + | | +--------------------+ | | + | +------------------------+ | + +----------------------------+ + +Pros: +- dynamic sharing between all tenants (one cache and eviction policy for all) +- No PostgreSQL core changes required +- Takes advantage of local SSDs, not just memory + +Cons: + +- Whole new component to write + + +One way to achieve this would be to collocate the pageserver on the +host itself. That would eliminate the network roundtrip between +Postgres and the pageserver, effectively making the pageserver itself +be the host shared cache.