User management RFC

2026-05-26 09:30:37 +00:00 · 2022-07-15 13:19:02 +03:00
parent c68336a246
commit fb8935305a
1 changed files with 80 additions and 0 deletions
--- a/docs/rfcs/017-user-management.md
+++ b/docs/rfcs/017-user-management.md
@@ -0,0 +1,80 @@
+# Postgres user and database management
+
+We've accumulated a bunch of problems with our approach to role and database management, namely:
+
+1. we don't allow role and database creation from Postgres, and users are complaining about that
+2. fine-grained role management is not possible both from Postgres and console
+3. web_access and @user are different roles, which creates object access problems in some cases
+
+Right now, we do store users and databases both in console and Postgres, and there are two main reasons for
+that:
+
+* we want to be able to authenticate users in proxy against the console without Postgres involvement. Otherwise,
+malicious brute force attempts will wake up Postgres (expensive) and may exhaust the Postgres connection pool (deny of service).
+* it is handy when we can render console UI without waking up compute (e.g., show database list)
+
+Storing the same information in two systems is a form of replication. And in the current scheme
+the console is primary, and Postgres catalog is a replica.
+
+This RFC proposes to address problems 1. and 2. by making Postgres a source of truth for roles/databases and
+only caching this info in Postgres. So using the replication analogy, now the Postgres catalog will be primary, and
+the console will be a replica. Problem 3 is a bit different and could be addressed by ditching the web_access
+user and using, e.g., JWT auth for the @username user so that we do not introduce a new user (JWT is needed
+since we don't know users password).
+
+This RFC doesn't talk about giving root access to the database, which is blocked by a secure runtime setup.
+
+## Overview
+
+* Add `/tenant/$tenant/branch/$branch/refresh_catalog` endpoint to console management API which asks `/get_catalog` and updates cached roles/databases info.
+* Whenever user edits list of databases or users postgres signals `compute_ctl` to call `/<...>/refresh_catalog` in the console
+* Add password strenght check in our extension
+
+## Postgres behavior
+
+Default user role (@username) should have `CREATE ROLE`, `CREATE DB`, and `BYPASSRLS` privileges. We expose Postgres port
+to the open internet, so we need to check passwords strength. We can use the `passwordcheck` extension or do the same
+from our extension.
+
+Whenever a user edits a list of databases or users, Postgres sends SIGHUP to `compute_ctl`. `compute_ctl` should write PID to `compute_ctl.pid` file.
+
+
+## Compute_ctl behavior
+
+Upon `SIGHUP` signal `compute_ctl` should call `/tenant/$tenant/branch/$branch/refresh_catalog` to inform console about changes in the database. The console will circle back and load the data from `/get_catalog` on compute (see next section on why this approach instead of direct PUT/PATH to the console). In the case of `/refresh_catalog` failure, we should retry it N times.
+
+Also `compute_ctl` listens for http `/get_catalog` and returns list of databases and users upon request:
+```
+/get_catalog: -> {
+    databases: [{
+        name: "db1",
+        owner: "jack"
+    }],
+    roles: [{
+        name: "jack",
+        rolepassword: "SCRAM-SHA-256..."
+    }]
+}
+```
+
+## Console behavior
+
+Whenever the console receives `/refresh_catalog` on the management API it goes to compute and asks for `/get_catalog`. I suggest using this way instead of accepting a list of databases/roles directly to the console endpoint for the following reasons:
+
+* we, anyway, will need console originated call to compute's `/get_catalog` after historical branch creation
+* If an intruder gains access to some other `/tenant/$tenant/.../refresh_catalog` he won't be able to change the roles list and will just force an unnecessary reload.
+
+`/refresh_catalog` returns HTTP 200 OK on success.
+
+We should have a button in the admin UI to manually force `/refresh_catalog` in case of data desync.
+
+# Scalability
+
+On my laptop, I can create 4200 roles per second. That corresponds to 363 million roles per day. So both `/get_catalog` can become expensive, and our roles database can snowball. While we can address `/get_catalog` size by catching only the latest changes (e.g., maintain the audit table and drain it by the console), it is still not nice that a single tenant can blow up a multi-tenant console database. I would instead propose to limit the number of databases and roles by some big number like 1000 and bump this limit if somebody asks for it with a legit use case. 
+
+
+# QA:
+
+- Why implement `/get_catalog` instead of sending an SQL query from the console to the compute?
+
+- So far, we do not allow remote superuser access to Postgres, and exposing only endpoints with fixed queries beneath them reduces the attack surface.