From c15aa04714e82af1542b8ade1b6d8c1453474dee Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Thu, 14 Apr 2022 12:56:46 +0300 Subject: [PATCH] Move Cluster size limit RFC from rfcs repo --- docs/rfcs/cluster-size-limits.md | 79 ++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 docs/rfcs/cluster-size-limits.md diff --git a/docs/rfcs/cluster-size-limits.md b/docs/rfcs/cluster-size-limits.md new file mode 100644 index 0000000000..4696f2c7f0 --- /dev/null +++ b/docs/rfcs/cluster-size-limits.md @@ -0,0 +1,79 @@ +Cluster size limits +================== + +## Summary + +One of the resource consumption limits for free-tier users is a cluster size limit. + +To enforce it, we need to calculate the timeline size and check if the limit is reached before relation create/extend operations. +If the limit is reached, the query must fail with some meaningful error/warning. +We may want to exempt some operations from the quota to allow users free space to fit back into the limit. + +The stateless compute node that performs validation is separate from the storage that calculates the usage, so we need to exchange cluster size information between those components. + +## Motivation + +Limit the maximum size of a PostgreSQL instance to limit free tier users (and other tiers in the future). +First of all, this is needed to control our free tier production costs. +Another reason to limit resources is risk management — we haven't (fully) tested and optimized zenith for big clusters, +so we don't want to give users access to the functionality that we don't think is ready. + +## Components + +* pageserver - calculate the size consumed by a timeline and add it to the feedback message. +* safekeeper - pass feedback message from pageserver to compute. +* compute - receive feedback message, enforce size limit based on GUC `zenith.max_cluster_size`. +* console - set and update `zenith.max_cluster_size` setting + +## Proposed implementation + +First of all, it's necessary to define timeline size. + +The current approach is to count all data, including SLRUs. (not including WAL) +Here we think of it as a physical disk underneath the Postgres cluster. +This is how the `LOGICAL_TIMELINE_SIZE` metric is implemented in the pageserver. + +Alternatively, we could count only relation data. As in pg_database_size(). +This approach is somewhat more user-friendly because it is the data that is really affected by the user. +On the other hand, it puts us in a weaker position than other services, i.e., RDS. +We will need to refactor the timeline_size counter or add another counter to implement it. + +Timeline size is updated during wal digestion. It is not versioned and is valid at the last_received_lsn moment. +Then this size should be reported to compute node. + +`current_timeline_size` value is included in the walreceiver's custom feedback message: `ZenithFeedback.` + +(PR about protocol changes https://github.com/zenithdb/zenith/pull/1037). + +This message is received by the safekeeper and propagated to compute node as a part of `AppendResponse`. + +Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable. + +And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > zenith.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so. +(see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html)) + +TODO: +We can allow autovacuum processes to bypass this check, simply checking `IsAutoVacuumWorkerProcess()`. +It would be nice to allow manual VACUUM and VACUUM FULL to bypass the check, but it's uneasy to distinguish these operations at the low level. +See issues https://github.com/neondatabase/neon/issues/1245 +https://github.com/zenithdb/zenith/issues/1445 + +TODO: +We should warn users if the limit is soon to be reached. + +### **Reliability, failure modes and corner cases** + +1. `current_timeline_size` is valid at the last received and digested by pageserver lsn. + + If pageserver lags behind compute node, `current_timeline_size` will lag too. This lag can be tuned using backpressure, but it is not expected to be 0 all the time. + + So transactions that happen in this lsn range may cause limit overflow. Especially operations that generate (i.e., CREATE DATABASE) or free (i.e., TRUNCATE) a lot of data pages while generating a small amount of WAL. Are there other operations like this? + + Currently, CREATE DATABASE operations are restricted in the console. So this is not an issue. + + +### **Security implications** + +We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM. +Malicious users may change the `zenith.max_cluster_size`, so we need an extra size limit check. +To cover this case, we also monitor the compute node size in the console.