rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-03 19:42:55 +00:00

Files

Matthias van de Meent 1d49eefbbb RFC: Endpoint Persistent Unlogged Files Storage (#9661 )

## Summary
A design for a storage system that allows storage of files required to
make
Neon's Endpoints have a better experience at or after a reboot.

## Motivation
Several systems inside PostgreSQL (and Neon) need some persistent
storage for
optimal workings across reboots and restarts, but still work without.
Examples are the cumulative statistics file in `pg_stat/global.stat`,
`pg_stat_statements`' `pg_stat/pg_stat_statements.stat`, and
`pg_prewarm`'s
`autoprewarm.blocks`. We need a storage system that can store and manage
these files for each Endpoint.

[GH rendered
file](https://github.com/neondatabase/neon/blob/MMeent/rfc-unlogged-file/docs/rfcs/040-Endpoint-Persistent-Unlogged-Files-Storage.md)

Part of https://github.com/neondatabase/cloud/issues/24225

2025-06-25 16:25:57 +00:00

13 KiB

Raw Permalink Blame History

Memo: Endpoint Persistent Unlogged Files Storage

Created on 2024-11-05 Implemented on N/A

Summary

A design for a storage system that allows storage of files required to make Neon's Endpoints have a better experience at or after a reboot.

Motivation

Several systems inside PostgreSQL (and Neon) need some persistent storage for optimal workings across reboots and restarts, but still work without. Examples are the query-level statistics files of pg_stat_statements in pg_stat/pg_stat_statements.stat, and pg_prewarm's autoprewarm.blocks. We need a storage system that can store and manage these files for each Endpoint, without necessarily granting users access to an unlimited storage device.

Goals

Store known files for Endpoints with reasonable persistence.
Data loss in this service, while annoying and bad for UX, won't lose any customer's data.

Non Goals (if relevant)

This storage system does not need branching, file versioning, or other such features. The files are as ephemeral to the timeline of the data as the Endpoints that host the data.
This storage system does not need to store all user files, only 'known' user files.
This storage system does not need to be hosted fully inside Computes.
Instead, this will be a separate component similar to Pageserver, SafeKeeper, the S3 proxy used for dynamically loaded extensions, etc.

Impacted components

Compute needs new code to load and store these files in its lifetime.
Control Plane needs to consider this new storage system when signalling the deletion of an Endpoint, Timeline, or Tenant.
Control Plane needs to consider this new storage system when it resets or re-assigns an endpoint's timeline/branch state.

A new service is created: the Endpoint Persistent Unlogged Files Storage service. This could be integrated in e.g. Pageserver or Control Plane, or a separately hosted service.

Proposed implementation

Endpoint-related data files are managed by a newly designed service (which optionally is integrated in an existing service like Pageserver or Control Plane), which stores data directly into S3 or any blob storage of choice.

Upon deletion of the Endpoint, or reassignment of the endpoint to a different branch, this ephemeral data is dropped: the data stored may not match the state of the branch's data after reassignment, and on endpoint deletion the data won't have any use to the user.

Compute gets credentials (JWT token with Tenant, Timeline & Endpoint claims) which it can use to authenticate to this new service and retrieve and store data associated with this endpoint. This limited scope reduces leaks of data across endpoints and timeline resets, and limits the ability of endpoints to mess with other endpoints' data.

The path of this endpoint data in S3 is initially as follows:

s3://<regional-epufs-bucket>/
  tenants/
    <hex-tenant-id>/
      tenants/
        <hex-timeline-id>/
          endpoints/
            <endpoint-id>/
              pgdata/
                <file_path_in_pgdatadir>

For other blob storages an equivalent or similar path can be constructed.

Reliability, failure modes and corner cases (if relevant)

Reliability is important, but not critical to the workings of Neon. The data stored in this service will, when lost, reduce performance, but won't be a cause of permanent data loss - only operational metadata is stored.

Most, if not all, blob storage services have sufficiently high persistence guarantees to cater our need for persistence and uptime. The only concern with blob storages is that the access latency is generally higher than local disk, but for the object types stored (cache state, ...) I don't think this will be much of an issue.

Interaction/Sequence diagram (if relevant)

In these diagrams you can replace S3 with any persistent storage device of choice, but S3 is chosen as representative name: The well-known and short name of AWS' blob storage. Azure Blob Storage should work too, but it has a much longer name making it less practical for the diagrams.

Write data:

POST /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path>
Host: epufs.svc.neon.local

<<<

200 OK
{
  "version": "<opaque>", # opaque file version token, changes when the file contents change
  "size": <bytes>,
}

sequenceDiagram
    autonumber
    participant co as Compute
    participant ep as EPUFS
    participant s3 as Blob Storage

    co-->ep: Connect with credentials
    co->>+ep: Store Unlogged Persistent File
    opt is authenticated
        ep->>s3: Write UPF to S3
    end
    ep->>-co: OK / Failure / Auth Failure
    co-->ep: Cancel connection

Read data: (optional with cache-relevant request parameters, e.g. If-Modified-Since)

GET /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path>
Host: epufs.svc.neon.local

<<<

200 OK

<file data>

sequenceDiagram
    autonumber
    participant co as Compute
    participant ep as EPUFS
    participant s3 as Blob Storage

    co->>+ep: Read Unlogged Persistent File
    opt is authenticated
        ep->>+s3: Request UPF from storage
        s3->>-ep: Receive UPF from storage
    end
    ep->>-co: OK(response) / Failure(storage, auth, ...)

Compute Startup:

sequenceDiagram
    autonumber
    participant co as Compute
    participant ps as Pageserver
    participant ep as EPUFS
    participant es as Extension server

    note over co: Bind endpoint ep-xxx
    par Get basebackup
        co->>+ps: Request basebackup @ LSN
        ps-)ps: Construct basebackup
        ps->>-co: Receive basebackup TAR @ LSN
    and Get startup-critical Unlogged Persistent Files
        co->>+ep: Get all UPFs of endpoint ep-xxx
        ep-)ep: Retrieve and gather all UPFs
        ep->>-co: TAR of UPFs
    and Get startup-critical extensions
        loop For every startup-critical extension
            co->>es: Get critical extension
            es->>co: Receive critical extension
        end
    end
    note over co: Start compute

CPlane ops:

DELETE /tenants/<tenant-id>/timelines/<timeline-id>/endpoints/<endpoint-id>
Host: epufs.svc.neon.local

<<<

200 OK
{
  "tenant": "<tenant-id>",
  "timeline": "<timeline-id>",
  "endpoint": "<endpoint-id>",
  "deleted": {
    "files": <count>,
    "bytes": <count>,
  },
}

DELETE /tenants/<tenant-id>/timelines/<timeline-id>
Host: epufs.svc.neon.local

<<<

200 OK
{
  "tenant": "<tenant-id>",
  "timeline": "<timeline-id>",
  "deleted": {
    "files": <count>,
    "bytes": <count>,
  },
}

DELETE /tenants/<tenant-id>
Host: epufs.svc.neon.local

<<<

200 OK
{
  "tenant": "<tenant-id>",
  "deleted": {
    "files": <count>,
    "bytes": <count>,
  },
}

sequenceDiagram
    autonumber
    participant cp as Control Plane
    participant ep as EPUFS
    participant s3 as Blob Storage

    alt Tenant deleted
        cp-)ep: Tenant deleted
        loop For every object associated with removed tenant
            ep->>s3: Remove data of deleted tenant from Storage
        end
        opt
            ep-)cp: Tenant cleanup complete
        end
    alt Timeline deleted
        cp-)ep: Timeline deleted
        loop For every object associated with removed timeline
            ep->>s3: Remove data of deleted timeline from Storage
        end
        opt
            ep-)cp: Timeline cleanup complete
        end
    else Endpoint reassigned or removed
        cp->>+ep: Endpoint reassigned
        loop For every object associated with reassigned/removed endpoint
            ep->>s3: Remove data from Storage
        end
        ep->>-cp: Cleanup complete
    end

Scalability (if relevant)

Provisionally: As this service is going to be part of compute startup, this service should be able to quickly respond to all requests. Therefore this service is deployed to every AZ we host Computes in, and Computes communicate (generally) only to the EPUFS endpoint of the AZ they're hosted in.

Local caching of frequently restarted endpoints' data or metadata may be needed for best performance. However, due to the regional nature of stored data but zonal nature of the service deployment, we should be careful when we implement any local caching, as it is possible that computes in AZ 1 will update data originally written and thus cached by AZ 2. Cache version tests and invalidation is therefore required if we want to roll out caching to this service, which is too broad a scope for an MVC. This is why caching is left out of scope for this RFC, and should be considered separately after this RFC is implemented.

Security implications (if relevant)

This service must be able to authenticate users at least by Tenant ID, Timeline ID and Endpoint ID. This will use the existing JWT infrastructure of Compute, which will be upgraded to the extent needed to support Timeline- and Endpoint-based claims.

The service requires unlimited access to (a prefix of) a blob storage bucket, and thus must be hosted outside the Compute VM sandbox.

A service that generates pre-signed request URLs for Compute to download the data from that URL is likely problematic, too: Compute would be able to write unlimited data to the bucket, or exfiltrate this signed URL to get read/write access to specific objects in this bucket, which would still effectively give users access to the S3 bucket (but with improved access logging).

There may be a use case for transferring data associated with one endpoint to another endpoint (e.g. to make one endpoint warm its caches with the state of another endpoint), but that's not currently in scope, and specific needs may be solved through out-of-line communication of data or pre-signed URLs.

Unresolved questions (if relevant)

Caching of files is not in the implementation scope of the document, but should at some future point be considered to maximize performance.

Alternative implementation (if relevant)

Several ideas have come up to solve this issue:

Use AUXfile

One prevalent idea was to WAL-log the files using our AUXfile mechanism.

Benefits:

We already have this storage mechanism

Demerits:

It isn't available on read replicas
Additional WAL will be consumed during shutdown and after the shutdown checkpoint, which needs PG modifications to work without panics.
It increases the data we need to manage in our versioned storage, thus causing higher storage costs with higher retention due to duplication at the storage layer.

Sign URLs for read/write operations, instead of proxying them

Benefits:

The service can be implemented with a much reduced IO budget

Demerits:

Users could get access to these signed credentials
Not all blob storage services may implement URL signing

Give endpoints each their own directly accessed block volume

Benefits:

Easier to integrate for PostgreSQL

Demerits:

Little control on data size and contents
Potentially problematic as we'd need to store data all across the pgdata directory.
EBS is not a good candidate
- Attaches in 10s of seconds, if not more; i.e. too cold to start
- Shared EBS volumes are a no-go, as you'd have to schedule the endpoint with users of the same EBS volumes, which can't work with VM migration
- EBS storage costs are very high (>80$/kilotenant when using a volume/tenant)
- EBS volumes can't be mounted across AZ boundaries
Bucket per endpoint is unfeasible
- S3 buckets are priced at $20/month per 1k, which we could better spend on developers.
- Allocating service accounts takes time (100s of ms), and service accounts are a limited resource, too; so they're not a good candidate to allocate on a per-endpoint basis.
- Giving credentials limited to prefix has similar issues as the pre-signed URL approach.
- Bucket DNS lookup will fill DNS caches and put pressure on DNS lookup much more than our current systems would.
Volumes bound by hypervisor are unlikely
- This requires significant investment and increased software on the hypervisor.
- It is unclear if we can attach volumes after boot, i.e. for pooled instances.

Put the files into a table

Benefits:

Mostly already available in PostgreSQL

Demerits:

Uses WAL
- Can't be used after shutdown checkpoint
- Needs a RW endpoint, and table & catalog access to write to this data
Gets hit with DB size limitations
Depending on user acces:
- Inaccessible:
  The user doesn't have control over database size caused by these systems.
- Accessible:
  The user can corrupt these files and cause the system to crash while user-corrupted files are present, thus increasing on-call overhead.

Definition of Done (if relevant)

This project is done if we have:

One S3 bucket equivalent per region, which stores this per-endpoint data.
A new service endpoint in at least every AZ, which indirectly grants endpoints access to the data stored for these endpoints in these buckets.
Compute writes & reads temp-data at shutdown and startup, respectively, for at least the pg_prewarm or lfc_prewarm state files.
Cleanup of endpoint data is triggered when the endpoint is deleted or is detached from its current timeline.

13 KiB Raw Permalink Blame History

Memo: Endpoint Persistent Unlogged Files Storage

Summary

Motivation

Goals

Non Goals (if relevant)

Impacted components

Proposed implementation

Reliability, failure modes and corner cases (if relevant)

Interaction/Sequence diagram (if relevant)

Scalability (if relevant)

Security implications (if relevant)

Unresolved questions (if relevant)

Alternative implementation (if relevant)

Use AUXfile

Sign URLs for read/write operations, instead of proxying them

Give endpoints each their own directly accessed block volume

Put the files into a table

Definition of Done (if relevant)

13 KiB

Raw Permalink Blame History