RFC: Rewrite Postgres <-> Pageserver communication

This is not ready, I'm still collecting an organizing my own thoughts. I will update when it's ready for review. That said, feel free to leave comments already if you wish.
2026-01-16 09:52:54 +00:00 · 2025-02-13 02:39:39 +02:00
parent e38694742c
commit 87a9afbc64
1 changed files with 197 additions and 0 deletions
--- a/docs/rfcs/042-compute-pageserver-communicator.md
+++ b/docs/rfcs/042-compute-pageserver-communicator.md
@@ -0,0 +1,197 @@
+# Compute <-> Pageserver Communicator Rewrite
+
+## Summary
+
+## Motivation
+
+- prefetching logic is complicated
+- handling async libpq connections in C code are difficult and error-prone
+- only few people are comfortable working on the code
+- new AIO (maybe) coming up in Postgres v18
+- cannot process prefetch replies until another I/O function is called. Makes it impossible to accurately measure when a reply was received
+- every backend opens a separate connection to pageservers -> lots of connections, first query in backend is slow
+
+- desire for better protocol, not libpq-based
+
+- By writing the "communicator" as a separate rust module, it can be reused in
+  tests, outside PostgreSQL.
+
+## Non Goals (if relevant)
+
+- We will keep LFC unmodified for now. It might be a good idea to rewrite it
+  too, but it's out of scope here.
+
+- We should consider a similar rewrite for the walproposer - safekeeper
+  communication, but it's out of scope for this RFC
+
+## Impacted components (e.g. pageserver, safekeeper, console, etc)
+
+- Most changes are to the neon extension in compute.
+
+- Pageserver, to implement the new protocol.
+
+## Proposed implementation
+
+- we will use the new implementation with all PostgreSQL versions.
+- we will have a feature flag to switch between old and new communicator. Once we're
+  comfortable with the new communicator, remove old code and protocol.
+
+- What about relation size cache? Out of scope? Or move it to the communicator process,
+  and have smgrnblocks() requests always through communicator process?
+
+### Communicator process
+
+There is one communicator process in the Postgres server. It's a background
+worker process. It handles all communication with the pageservers.
+
+The communicator process is written in a mix of C and Rust. Mostly in Rust, but
+some C code is needed to interface with the Postgres facilities. For example:
+- logging
+- error handling (in a very limited form, we don't want to ereport() on most errors)
+- expose a shared memory area for IPC
+- signal other processes
+
+We will write unsafe rust or C glue code for those facilities, which allow us to
+write the rest of the communicator in safe rust.
+
+The Rust parts of the communicator process can use multiple threads and
+tokio. The glue code is written taken that into account, making it safe.
+
+pqrx is a rust crate for writing Postgres extensions in Rust. We will _not_ use
+that. It's a fine crate, good for most extensions, but I don't think we need
+most of the facilities that it provides. Our wrappers are more low-level than
+what most extensions need. We don't expose SQL functions or types from this
+extension for example.
+
+### Communicator <-> backend interface
+
+The backends and the communicator process communicate via shared memory. Each
+backend has a fixed number of "request slots", forming a ring. When a backend
+wants to perform an I/O, it writes the request details like blk # and LSN, to
+the next available slot. The request also includes a pointer or buffer ID where
+the resulting page should be written. The backend then wakes up the
+communicator, with a signal/futex/latch or something, telling the communicator
+that it has work to do.
+
+The communicator picks up the request from the backend's ring, and performs
+it. It writes result page to the address requested by the backend (most likely a
+shared buffer that the backend is holding a lock on), marks the request as
+completed, and wakes up the backend.
+
+In this design, because each backend has its own small ring, a backend doesn't
+need to do any locking to manipulate the request slots. Similarly, after a
+request has been submitted, the communicator has temporary ownership of the
+request slot, and doesn't need to do locking on it.
+
+This design is somewhat similar to how the upcoming AIO patches in PostgreSQL
+will work. That should make it easy to adapt to new PostgreSQL versions.
+
+In the above example, I assumed a GetPage request, but everything applies to
+other other request types like "smgrnblocks" too.
+
+### Prefetching
+
+A backend can also issue a "blind" prefetch request. When a communicator
+processes a blind prefetch request, it starts the GetPage request and writes the
+result to a local buffer within the communicator process. But it could also
+decide to do nothing, or to schedule the request with a lower priority. It
+doesn't give any result back to the requesting backend, hence it's "blind".
+Later, when the backend - or a different backend - requests the page that was
+prefetched and the prefetch was performed and completed, the communicator
+process can satisfy the request quickly from the private buffer.
+
+In this design, the "prefetch" slots are shared by all backends. If one backend
+issues a prefetch request but never consumes it, but another backend reads the
+same page, the prefetch can be used to satisfy the request. (In our current
+implementaiton, it would be wasted, and the same GetPage is performed
+twice. https://github.com/neondatabase/neon/pull/10442 helps with that, but
+doesn't fully fix the problem)
+
+### PostgreSQL versions < 17
+
+- In 16 and below, prefetching calls are made without holding the buffer pinned.
+Backends will perform "blind" prefetch requests for smgrprefetch().
+
+### PostgreSQL version 17
+
+In version 17, when prefetching is requested, the pages are already pinned in
+the buffer manager. We possibly could write the page directly to the shared
+buffer, but there's a risk that the backend stops the scan and releases the pins
+without ever performing the real I/Os. Because of that, backends will perform
+blind prefetch requests like in v16; we can't easily take advantage of the
+pinned buffer.
+
+### PostgreSQL version 18, if the AIO patches are committed
+
+With the AIO patches, prefetching is no longer performed with posix_fadvise
+calls. The backends will start the prefetch I/Os "for real", into the locked
+shared buffer. On completion of an AIO, the process that processes the
+completion will have to call a small callback routine that releases the buffer
+lock and wakes up any processes waiting on the I/O. It'll require some care to
+execute that safely from the communicator process.
+
+### Compute <-> Pageserver Protocol
+
+As part of the project, we will change the protocol. Desires for new protocol:
+
+- Use protobuf or something else more standard. Maybe gRPC. So that we can use
+  standard tools like Wireshark to easily analyze the traffic.
+
+- Batching. Have capability to request more than one page in one request.
+
+In principle, changing the protocol is an independent change from the new
+communicator process. But it makes sense to do at the same time:
+
+- Switching to Rust in the communicator process makes it possible to use
+  existing libraries
+  
+- Using a library might help with managing the pool of pageserver connnection,
+  so we want need to implement that ourselves
+
+### Reliability, failure modes and corner cases (if relevant)
+
+### Interaction/Sequence diagram (if relevant)
+
+### Scalability (if relevant)
+
+- Could the single communicator process become a bottleneck? In the new v18 AIO
+  system, the process needs to execute all the I/O completion callbacks. They're
+  very short, but I still wonder if a single process can handle it.
+
+### Security implications (if relevant)
+
+- We currently use libpq authentication with a JWT token. We can continue to use
+  the token for authentication in the new protocol.
+
+### Unresolved questions (if relevant)
+
+## Alternative implementation (if relevant)
+
+I think UDP might also be a good fit for the protocol. No overhead of
+establishing or holding a connection. No head-of-line blocking; prefetch
+requests can be processed with lower priority. We would control our own
+destiny. But it has its own set of challenges: congestion control,
+authentication & encryption.
+
+## Pros/cons of proposed approaches (if relevant)
+
+## Definition of Done (if relevant)
+
+New communicator has replaced the old code, deployed in production, old protocol
+support is removed.
+
+Implentation phases:
+
+- Implement new protocol in pageserver. In first prototype, maybe just
+  wrap/convert the existing message types into HTTP+protobuf, to keep it simple.
+
+- Implement the C/Rust wrappers needed to launch the communicator as a background
+  worker process, with access to shard memory.
+
+- Implement a simple request / response interface in shared memory between the
+  backends and the communicator.
+  
+- Implement a minimalistic communicator: hold one connection to
+  pageserver/shard. No prefetching. Process one request at a time
+
+- Improve the communicator: multiple threads, multiple connections, prefetching