From 1ad48b2eaf643ace40ed523c00044622ba4e7f59 Mon Sep 17 00:00:00 2001
From: John Spray <john@neon.tech>
Date: Mon, 14 Apr 2025 11:57:17 +0100
Subject: [PATCH] docs/rfcs: add storage encryption key RFC

---
 docs/rfcs/2025-04-14-storage-keys.md | 244 +++++++++++++++++++++++++++
 1 file changed, 244 insertions(+)
 create mode 100644 docs/rfcs/2025-04-14-storage-keys.md

diff --git a/docs/rfcs/2025-04-14-storage-keys.md b/docs/rfcs/2025-04-14-storage-keys.md
new file mode 100644
index 0000000000..32494e03e3
--- /dev/null
+++ b/docs/rfcs/2025-04-14-storage-keys.md
@@ -0,0 +1,244 @@
+
+# Storage Encryption Key Management
+
+## Summary
+
+As a precursor to adding new encryption capabilities to Neon's storage services, this RFC proposes
+mechanisms for creating and storing fine-grained encryption keys for user data in Neon.  We aim
+to provide at least tenant granularity, but will use timeline granularity when it is simpler to do
+so.
+
+Out of scope:
+- We describe lifecycle of keys here but not the encryption of user data with these keys.
+- We describe an abstract KMS interface, but not particular platform implementations (such as how
+  to authenticate with KMS).
+
+## Terminology
+
+_wrapped/unwrapped_: a wrapped encryption key is a key encrypted by another key.  For example, the key for
+encrypting a timeline's pageserver data might be wrapped by some "root" key for the tenant's user account, stored in a KMS system.
+
+_key hierarchy_: the relationships between keys which wrap each other. For example, a layer file key might
+be wrapped by a pageserver timeline key, which is wrapped by a tenant's root key.
+
+## Design Choices
+
+Storage: S3 will be the store of record for wrapped keys
+
+Separate keys: Safekeeper and Pageserver will use independent keys.
+
+AES256: rather than building a generic system for keys, we will assume that all the keys
+we manage are AES256 keys -- this is the de-facto standard for enterprise data storage.
+
+Per-object keys: rather than encrypting data objects (layer files and segment files) with
+the tenant keys directly, they will be encrypted with separate keys.  This avoids cryptographic
+safety issues from re-using the same key for large quantities of potentially repetitive plaintext.
+
+Key storage is optional at a per-tenant granularity: eventually this would be on by default, but:
+- initially only some environments will have a KMS set up.
+- Encryption has some overhead and it may be that some tenants don't want or need it.
+
+## Design
+
+### Summary of format changes
+
+- Pageserver layer files and safekeeper segment objects get new metadata fields to
+  store wrapped key and version of the wrapping key
+- Pageserver timeline index gets a new `keys` field to store timeline keys
+- Safekeeper gets a new per-timeline manifest object in S3 to store timeline keys
+- Pageserver timeline index gets per-layer metadata for wrapped key and wrapping version
+
+### Summary of API changes
+
+- Pageserver TenantConf API gets a new field for account ID
+- Pageserver TenantConf API gets a new field for encryption mode
+- Safekeeper timeline creation API gets a new field for account ID
+- Controller, pageserver & safekeeper get a new timeline-scoped `rotate_key` API
+
+### KMS interface
+
+Neon will interoperate with different KMS APIs on different platforms.  We will implement a generic interface,
+similar to how `remote_storage` wraps different object storage APIs:
+- `generate(accountId, keyType, alias) -> `
+- `unwrap(accountId, ciphertext key) -> plaintext key`
+
+Hereafter, when we talk about generating or unwrapping a key, this means a call into the KMS API.
+
+The KMS deals with abstract "account IDs", which are not equal to tenant IDs and may not be
+1:1 with tenants.  The account ID will be provided as part of tenant configuration, along
+with a field to identify an encryption mode.
+
+### Pageserver key storage
+
+The wrapped pageserver timeline key will be stored in the timeline index object.  Because of
+key rotation, multiple keys will be stored in an array, with each key having a counter version. 
+
+```
+"keys": [
+    {
+        # The key version: a new key with the next version is generated when rekeying
+        "version": 1,
+        # The wrapped key: this is unwrapped by a KMS API call when the key is to be used
+        "wrapped": "<base64 string>",
+        # The time the key was generated: this may be used to implement rekeying/key rotation
+        # policies.
+        "ctime": "<ISO 8601 timestamp>",
+    },
+    ...
+]
+```
+
+Wrapped pageserver layer file keys will be stored in the `index_part` file, as part
+of the layer metadata.
+
+```
+# LayerFileMetadata
+{
+    "key": {
+        "version":
+    }
+
+}
+```
+
+To enable re-key procedure to drop deleted versions with old keys, and to avoid mistakes in index_part leading to irretreivable data loss, wrapped keys & version will also be stored
+in the object store metadata of uploaded objects.
+
+### Safekeeper key storage
+
+All safekeeper storage is per-timeline.  The only concept of a tenant in the safekeeper
+is as a namespace for timelines.
+
+As the safekeeper doesn't currently have a flexible metadata object in remote storage,
+we will add one.  This will initially contain:
+- A configuration object that contains the accountId
+- An array of keys idential to those used in the pageserver's index.
+
+Because multiple safekeeper processes share the same remote storage path, we must be
+sure to handle write races safely.  To avoid giving safekeepers a pageserver-like generation
+concept (not to be confused with safekeeper's configuration generation), we may use
+the conditional write primitive that is available on S3 and ABS, to implement a safe
+read-then-write for operations such as key rotation, such that a given key version is
+only ever implemented once.
+
+### Key rotation
+
+The process of key rotation is:
+1. Load the version of the existing key
+2. Generate a new key
+3. Store the new key with the previous version incremented by 1
+4. **Only once durably stored** use the new key for subsequent generation of object keys
+
+This is the same for safekeepers and pageservers.
+
+A storage controller API will be exposed for re-keying.
+
+For the pageserver, it is very important that re-key
+operations respect generation safety rules, the same as timeline CRUD operations: i.e.
+the operation is only durable if the generation of the tenant location updated is still
+the latest generation when the operation is complete.
+
+For the safekeeper, it is very important that ... **TODO** rules on racing key updates
+with configuration changes?
+
+### Re-keying
+
+While re-keying and key-rotation are sometimes used synonymously, we distinguish them:
+- Key rotation is generating a new key to use for new data
+- Re-keying is rewriting existing data so that old keys are no longer used at all
+
+Re-keying is a bulk data operation, and not fully defined in this RFC: it can be defined
+quite simply as "For object in objects, if object key version is < the rekeying horizon,
+then do a read/write cycle on the object using latest key".  This is a simple but potentially very
+expensive operation, so we discuss efficiency here.
+
+#### Pageserver re-key
+
+For pageservers, occasional rekeying may be implemented efficiently if one tolerates using
+the last few keys and doesn't insist on the latest, because pageservers periodically rewrite
+their data for GC-compaction anyway.  Thereby an API call to re-key any data with an overly old
+key would often be a no-op because all data was rewritten recently anyway.
+
+When object versioning is enabled in storage, re-keying is not fully accomplished by just
+re-writing live data: old versions would still contain user data encrypted with older keys.  To
+fully re-key, an extra step is needed to purge old objects.  Ideally, we should only purge
+old objects which were encrypted using old keys.  To this end, it would be useful to store
+the encryption key version as metadata on objects, so that a scrub of deleted object versions
+can efficiently select those objects that should be purged during re-key.
+
+Checks on object versions should not only be on deleted objects: because pageserver can emit
+"orphan" objects not referenced in the index under some circumstances, re-key must also 
+check non-deleted objects.
+
+To summarize, the pageserver re-key operation is:
+- Iterate over index of layer files, select those with too-old key and rewrite them
+- Iterate over all versions in object storage, select those with a too-old key version
+  in their metadata and purge them (with a safety check that these are not referenced
+  by the latest index).
+
+It would be wise to combine the re-key procedure with an exhaustive read of a timeline's data,
+to ensure that when testing & rolling this feature out we are not rendering anything unreadable
+due to bugs in implementation.  Since we are deleting old versions in object storage, our
+time travel recovery tool will not be any help if we get something wrong in this process.
+
+#### Safekeeper re-key
+
+Re-keying a safekeeper timeline requires an exhaustive walk of segment objects, read
+metadata on each one and decide whether it requires rewrite.
+
+Safekeeper currently keeps historic objects forever, so re-keying this data will get
+more expensive as time goes on.  This would be a good time to add cleanup of old safekeeper
+segments, but doing so is beyond the scope of this RFC.
+
+### Enabling encryption for existing tenants
+
+To enable encryption for an existing tenant, we may simply call key-rotation API (to generate a key),
+and then re-key API (to rewrite existing data using this key).
+
+## Observability
+
+- To enable some external service to implement re-keying, we should publish metrics per-timeline
+  on the age of their latest encryption key.
+- Calls to KMS should be tracked with typical request rate/result/latency histograms to enable
+  detection of a slow KMS server and/or errors.
+
+## Alternatives considered
+
+### Use same tenant key for safekeeper and pageserver
+
+We could halve the number of keys in circulation by having the safekeeper and pageserver
+share a key rather than working independently.
+
+However, this would be substantially more complex to implement, as safekeepers and pageservers
+currently share no storage, so some new communication path would be needed.  There is minimal
+upside in sharing a key.
+
+### No KMS dependency
+
+We could choose to do all key management ourselves.  However, the industry standard approach
+to enabling users of cloud SaaS software to self-manage keys is to use the KMS as the intermediary
+between our system and the user's control of their key.  Although this RFC does not propose user-managed keys, we should design with this in mind.
+
+### Do all key generation/wrapping in KMS service
+
+We could avoid generating and wrapping/unwrapping object keys in our storage
+services by delegating all responsibility for key operations to the KMS.  However,
+KMS services have limited throughput and in some cases may charge per operation, so
+it is useful to avoid doing KMS operations per-object, and restrict them to per-timeline
+frequency.
+
+### Per-tenant instead of per-timeline pageserver keys
+
+For tenants with many timelines, we may reduce load on KMS service by
+using per-tenant instead of per-timeline keys, so that we may do operations
+such as creating a timeline without needing to do a KMS unwrap operation.
+
+However, per-timeline key management is much simpler to implement on the safekeeper,
+which currently has no concept of a tenant (other than as a namespace for timelines).
+It is also slightly simpler to implement on the pageserver, as it avoids implementing
+a tenant-scoped creation operation to initialize keys (instead, we may initialize keys
+during timeline creation).
+
+As a side benefit, per-timeline key management also enables implementing secure deletion in future
+at a per-timeline granularity.
+