docs/rfcs: add storage encryption key RFC

This commit is contained in:
John Spray
2025-04-14 11:57:17 +01:00
parent a338984dc7
commit 1ad48b2eaf

View File

@@ -0,0 +1,244 @@
# Storage Encryption Key Management
## Summary
As a precursor to adding new encryption capabilities to Neon's storage services, this RFC proposes
mechanisms for creating and storing fine-grained encryption keys for user data in Neon. We aim
to provide at least tenant granularity, but will use timeline granularity when it is simpler to do
so.
Out of scope:
- We describe lifecycle of keys here but not the encryption of user data with these keys.
- We describe an abstract KMS interface, but not particular platform implementations (such as how
to authenticate with KMS).
## Terminology
_wrapped/unwrapped_: a wrapped encryption key is a key encrypted by another key. For example, the key for
encrypting a timeline's pageserver data might be wrapped by some "root" key for the tenant's user account, stored in a KMS system.
_key hierarchy_: the relationships between keys which wrap each other. For example, a layer file key might
be wrapped by a pageserver timeline key, which is wrapped by a tenant's root key.
## Design Choices
Storage: S3 will be the store of record for wrapped keys
Separate keys: Safekeeper and Pageserver will use independent keys.
AES256: rather than building a generic system for keys, we will assume that all the keys
we manage are AES256 keys -- this is the de-facto standard for enterprise data storage.
Per-object keys: rather than encrypting data objects (layer files and segment files) with
the tenant keys directly, they will be encrypted with separate keys. This avoids cryptographic
safety issues from re-using the same key for large quantities of potentially repetitive plaintext.
Key storage is optional at a per-tenant granularity: eventually this would be on by default, but:
- initially only some environments will have a KMS set up.
- Encryption has some overhead and it may be that some tenants don't want or need it.
## Design
### Summary of format changes
- Pageserver layer files and safekeeper segment objects get new metadata fields to
store wrapped key and version of the wrapping key
- Pageserver timeline index gets a new `keys` field to store timeline keys
- Safekeeper gets a new per-timeline manifest object in S3 to store timeline keys
- Pageserver timeline index gets per-layer metadata for wrapped key and wrapping version
### Summary of API changes
- Pageserver TenantConf API gets a new field for account ID
- Pageserver TenantConf API gets a new field for encryption mode
- Safekeeper timeline creation API gets a new field for account ID
- Controller, pageserver & safekeeper get a new timeline-scoped `rotate_key` API
### KMS interface
Neon will interoperate with different KMS APIs on different platforms. We will implement a generic interface,
similar to how `remote_storage` wraps different object storage APIs:
- `generate(accountId, keyType, alias) -> `
- `unwrap(accountId, ciphertext key) -> plaintext key`
Hereafter, when we talk about generating or unwrapping a key, this means a call into the KMS API.
The KMS deals with abstract "account IDs", which are not equal to tenant IDs and may not be
1:1 with tenants. The account ID will be provided as part of tenant configuration, along
with a field to identify an encryption mode.
### Pageserver key storage
The wrapped pageserver timeline key will be stored in the timeline index object. Because of
key rotation, multiple keys will be stored in an array, with each key having a counter version.
```
"keys": [
{
# The key version: a new key with the next version is generated when rekeying
"version": 1,
# The wrapped key: this is unwrapped by a KMS API call when the key is to be used
"wrapped": "<base64 string>",
# The time the key was generated: this may be used to implement rekeying/key rotation
# policies.
"ctime": "<ISO 8601 timestamp>",
},
...
]
```
Wrapped pageserver layer file keys will be stored in the `index_part` file, as part
of the layer metadata.
```
# LayerFileMetadata
{
"key": {
"version":
}
}
```
To enable re-key procedure to drop deleted versions with old keys, and to avoid mistakes in index_part leading to irretreivable data loss, wrapped keys & version will also be stored
in the object store metadata of uploaded objects.
### Safekeeper key storage
All safekeeper storage is per-timeline. The only concept of a tenant in the safekeeper
is as a namespace for timelines.
As the safekeeper doesn't currently have a flexible metadata object in remote storage,
we will add one. This will initially contain:
- A configuration object that contains the accountId
- An array of keys idential to those used in the pageserver's index.
Because multiple safekeeper processes share the same remote storage path, we must be
sure to handle write races safely. To avoid giving safekeepers a pageserver-like generation
concept (not to be confused with safekeeper's configuration generation), we may use
the conditional write primitive that is available on S3 and ABS, to implement a safe
read-then-write for operations such as key rotation, such that a given key version is
only ever implemented once.
### Key rotation
The process of key rotation is:
1. Load the version of the existing key
2. Generate a new key
3. Store the new key with the previous version incremented by 1
4. **Only once durably stored** use the new key for subsequent generation of object keys
This is the same for safekeepers and pageservers.
A storage controller API will be exposed for re-keying.
For the pageserver, it is very important that re-key
operations respect generation safety rules, the same as timeline CRUD operations: i.e.
the operation is only durable if the generation of the tenant location updated is still
the latest generation when the operation is complete.
For the safekeeper, it is very important that ... **TODO** rules on racing key updates
with configuration changes?
### Re-keying
While re-keying and key-rotation are sometimes used synonymously, we distinguish them:
- Key rotation is generating a new key to use for new data
- Re-keying is rewriting existing data so that old keys are no longer used at all
Re-keying is a bulk data operation, and not fully defined in this RFC: it can be defined
quite simply as "For object in objects, if object key version is < the rekeying horizon,
then do a read/write cycle on the object using latest key". This is a simple but potentially very
expensive operation, so we discuss efficiency here.
#### Pageserver re-key
For pageservers, occasional rekeying may be implemented efficiently if one tolerates using
the last few keys and doesn't insist on the latest, because pageservers periodically rewrite
their data for GC-compaction anyway. Thereby an API call to re-key any data with an overly old
key would often be a no-op because all data was rewritten recently anyway.
When object versioning is enabled in storage, re-keying is not fully accomplished by just
re-writing live data: old versions would still contain user data encrypted with older keys. To
fully re-key, an extra step is needed to purge old objects. Ideally, we should only purge
old objects which were encrypted using old keys. To this end, it would be useful to store
the encryption key version as metadata on objects, so that a scrub of deleted object versions
can efficiently select those objects that should be purged during re-key.
Checks on object versions should not only be on deleted objects: because pageserver can emit
"orphan" objects not referenced in the index under some circumstances, re-key must also
check non-deleted objects.
To summarize, the pageserver re-key operation is:
- Iterate over index of layer files, select those with too-old key and rewrite them
- Iterate over all versions in object storage, select those with a too-old key version
in their metadata and purge them (with a safety check that these are not referenced
by the latest index).
It would be wise to combine the re-key procedure with an exhaustive read of a timeline's data,
to ensure that when testing & rolling this feature out we are not rendering anything unreadable
due to bugs in implementation. Since we are deleting old versions in object storage, our
time travel recovery tool will not be any help if we get something wrong in this process.
#### Safekeeper re-key
Re-keying a safekeeper timeline requires an exhaustive walk of segment objects, read
metadata on each one and decide whether it requires rewrite.
Safekeeper currently keeps historic objects forever, so re-keying this data will get
more expensive as time goes on. This would be a good time to add cleanup of old safekeeper
segments, but doing so is beyond the scope of this RFC.
### Enabling encryption for existing tenants
To enable encryption for an existing tenant, we may simply call key-rotation API (to generate a key),
and then re-key API (to rewrite existing data using this key).
## Observability
- To enable some external service to implement re-keying, we should publish metrics per-timeline
on the age of their latest encryption key.
- Calls to KMS should be tracked with typical request rate/result/latency histograms to enable
detection of a slow KMS server and/or errors.
## Alternatives considered
### Use same tenant key for safekeeper and pageserver
We could halve the number of keys in circulation by having the safekeeper and pageserver
share a key rather than working independently.
However, this would be substantially more complex to implement, as safekeepers and pageservers
currently share no storage, so some new communication path would be needed. There is minimal
upside in sharing a key.
### No KMS dependency
We could choose to do all key management ourselves. However, the industry standard approach
to enabling users of cloud SaaS software to self-manage keys is to use the KMS as the intermediary
between our system and the user's control of their key. Although this RFC does not propose user-managed keys, we should design with this in mind.
### Do all key generation/wrapping in KMS service
We could avoid generating and wrapping/unwrapping object keys in our storage
services by delegating all responsibility for key operations to the KMS. However,
KMS services have limited throughput and in some cases may charge per operation, so
it is useful to avoid doing KMS operations per-object, and restrict them to per-timeline
frequency.
### Per-tenant instead of per-timeline pageserver keys
For tenants with many timelines, we may reduce load on KMS service by
using per-tenant instead of per-timeline keys, so that we may do operations
such as creating a timeline without needing to do a KMS unwrap operation.
However, per-timeline key management is much simpler to implement on the safekeeper,
which currently has no concept of a tenant (other than as a namespace for timelines).
It is also slightly simpler to implement on the pageserver, as it avoids implementing
a tenant-scoped creation operation to initialize keys (instead, we may initialize keys
during timeline creation).
As a side benefit, per-timeline key management also enables implementing secure deletion in future
at a per-timeline granularity.