From c15aa04714e82af1542b8ade1b6d8c1453474dee Mon Sep 17 00:00:00 2001
From: Anastasia Lubennikova <lubennikovaav@gmail.com>
Date: Thu, 14 Apr 2022 12:56:46 +0300
Subject: [PATCH] Move Cluster size limit RFC from rfcs repo

---
 docs/rfcs/cluster-size-limits.md | 79 ++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 docs/rfcs/cluster-size-limits.md

diff --git a/docs/rfcs/cluster-size-limits.md b/docs/rfcs/cluster-size-limits.md
new file mode 100644
index 0000000000..4696f2c7f0
--- /dev/null
+++ b/docs/rfcs/cluster-size-limits.md
@@ -0,0 +1,79 @@
+Cluster size limits
+==================
+
+## Summary
+
+One of the resource consumption limits for free-tier users is a cluster size limit.
+
+To enforce it, we need to calculate the timeline size and check if the limit is reached before relation create/extend operations.
+If the limit is reached, the query must fail with some meaningful error/warning.
+We may want to exempt some operations from the quota to allow users free space to fit back into the limit.
+
+The stateless compute node that performs validation is separate from the storage that calculates the usage, so we need to exchange cluster size information between those components.
+
+## Motivation
+
+Limit the maximum size of a PostgreSQL instance to limit free tier users (and other tiers in the future).
+First of all, this is needed to control our free tier production costs.
+Another reason to limit resources is risk management — we haven't (fully) tested and optimized zenith for big clusters,
+so we don't want to give users access to the functionality that we don't think is ready.
+
+## Components
+
+* pageserver - calculate the size consumed by a timeline and add it to the feedback message.
+* safekeeper - pass feedback message from pageserver to compute.
+* compute - receive feedback message, enforce size limit based on GUC `zenith.max_cluster_size`.
+* console - set and update `zenith.max_cluster_size` setting
+
+## Proposed implementation
+
+First of all, it's necessary to define timeline size.
+
+The current approach is to count all data, including SLRUs. (not including WAL)
+Here we think of it as a physical disk underneath the Postgres cluster.
+This is how the `LOGICAL_TIMELINE_SIZE` metric is implemented in the pageserver.
+
+Alternatively, we could count only relation data. As in pg_database_size().
+This approach is somewhat more user-friendly because it is the data that is really affected by the user.
+On the other hand, it puts us in a weaker position than other services, i.e., RDS.
+We will need to refactor the timeline_size counter or add another counter to implement it. 
+
+Timeline size is updated during wal digestion. It is not versioned and is valid at the last_received_lsn moment.
+Then this size should be reported to compute node.
+
+`current_timeline_size` value is included in the walreceiver's custom feedback message: `ZenithFeedback.`
+
+(PR about protocol changes https://github.com/zenithdb/zenith/pull/1037).
+
+This message is received by the safekeeper and propagated to compute node as a part of `AppendResponse`.
+
+Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable.
+
+And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > zenith.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so.
+(see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html))
+
+TODO:
+We can allow autovacuum processes to bypass this check, simply checking `IsAutoVacuumWorkerProcess()`.
+It would be nice to allow manual VACUUM and VACUUM FULL to bypass the check, but it's uneasy to distinguish these operations at the low level.
+See issues https://github.com/neondatabase/neon/issues/1245
+https://github.com/zenithdb/zenith/issues/1445
+
+TODO:
+We should warn users if the limit is soon to be reached.
+
+### **Reliability, failure modes and corner cases**
+
+1. `current_timeline_size` is valid at the last received and digested by pageserver lsn.
+    
+    If pageserver lags behind compute node, `current_timeline_size` will lag too. This lag can be tuned using backpressure, but it is not expected to be 0 all the time.
+    
+    So transactions that happen in this lsn range may cause limit overflow. Especially operations that generate (i.e., CREATE DATABASE) or free (i.e., TRUNCATE) a lot of data pages while generating a small amount of WAL. Are there other operations like this?
+    
+    Currently, CREATE DATABASE operations are restricted in the console. So this is not an issue.
+
+
+### **Security implications**
+
+We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM.
+Malicious users may change the `zenith.max_cluster_size`, so we need an extra size limit check.
+To cover this case, we also monitor the compute node size in the console.