Files
neon/libs/vm_monitor
Em Sharnoff 1cac923af8 vm-monitor: Rate-limit upscale requests (#5263)
Some VMs, when already scaled up as much as possible, end up spamming
the autoscaler-agent with upscale requests that will never be fulfilled.
If postgres is using memory greater than the cgroup's memory.high, it
can emit new memory.high events 1000 times per second, which... just
means unnecessary load on the rest of the system.

This changes the vm-monitor so that we skip sending upscale requests if
we already sent one within the last second, to avoid spamming the
autoscaler-agent. This matches previous behavior that the vm-informant
hand.
2023-09-10 20:33:53 +03:00
..
2023-08-24 15:54:37 -04:00

vm-monitor

The vm-monitor (or just monitor) is a core component of the autoscaling system, along with the autoscale-scheduler and the autoscaler-agents. The monitor has two primary roles: 1) notifying agents when immediate upscaling is necessary due to memory conditions and 2) managing Postgres' file cache and a cgroup to carry out upscaling and downscaling decisions.

More on scaling

We scale CPU and memory using NeonVM, our in-house QEMU tool for use with Kubernetes. To control thresholds for receiving memory usage notifications, we start Postgres in the neon-postgres cgroup and set its memory.{max,high}.

Structure

The vm-monitor is loosely comprised of a few systems. These are:

  • the server: this is just a simple axum server that accepts requests and upgrades them to websocket connections. The server only allows one connection at a time. This means that upon receiving a new connection, the server will terminate and old one if it exists.
  • the filecache: a struct that allows communication with the Postgres file cache. On startup, we connect to the filecache and hold on to the connection for the entire monitor lifetime.
  • the cgroup watcher: the CgroupWatcher manages the neon-postgres cgroup by listening for memory.high events and setting its memory.{high,max} values.
  • the runner: the runner marries the filecache and cgroup watcher together, communicating with the agent throught the Dispatcher, and then calling filecache and cgroup watcher functions as needed to upscale and downscale