neon/docs/rfcs/038-independent-compute-release.md

# Independent compute release

Created at: 2024-08-30. Author: Alexey Kondratov (@ololobus)

## Summary

This document proposes an approach to fully independent compute release flow. It attempts to
cover the following features:

- Process is automated as much as possible to minimize human errors.
- Compute<->storage protocol compatibility is ensured.
- A transparent release history is available with an easy rollback strategy.
- Although not in the scope of this document, there is a viable way to extend the proposed release
  flow to achieve the canary and/or blue-green deployment strategies.

## Motivation

Previously, the compute release was tightly coupled to the storage release. This meant that once
some storage nodes got restarted with a newer version, all new compute starts using these nodes
automatically got a new version. Thus, two releases happen in parallel, which increases the blast
radius and makes ownership fuzzy.

Now, we practice a manual v0 independent compute release flow -- after getting a new compute release
image and tag, we pin it region by region using Admin UI. It's better, but it still has its own flaws:

1. It's a simple but fairly manual process, as you need to click through a few pages.
2. It's prone to human errors, e.g., you could mistype or copy the wrong compute tag.
3. We now require an additional approval in the Admin UI, which partially solves the 2.,
   but also makes the whole process pretty annoying, as you constantly need to go back
   and forth between two people.

## Non-goals

It's not the goal of this document to propose a design for some general-purpose release tool like Helm.
The document considers how the current compute fleet is orchestrated at Neon. Even if we later
decide to split the control plane further (e.g., introduce a separate compute controller), the proposed
release process shouldn't change much, i.e., the releases table and API will reside in
one of the parts.

Achieving the canary and/or blue-green deploy strategies is out of the scope of this document. They
were kept in mind, though, so it's expected that the proposed approach will lay down the foundation
for implementing them in future iterations.

## Impacted components

Compute, control plane, CI, observability (some Grafana dashboards may require changes).

## Prior art

One of the very close examples is how Helm tracks [releases history](https://helm.sh/docs/helm/helm_history/).

In the code:

- [Release](https://github.com/helm/helm/blob/2b30cf4b61d587d3f7594102bb202b787b9918db/pkg/release/release.go#L20-L43)
- [Release info](https://github.com/helm/helm/blob/2b30cf4b61d587d3f7594102bb202b787b9918db/pkg/release/info.go#L24-L40)
- [Release status](https://github.com/helm/helm/blob/2b30cf4b61d587d3f7594102bb202b787b9918db/pkg/release/status.go#L18-L42)

TL;DR it has several important attributes:

- Revision -- unique release ID/primary key. It is not the same as the application version,
  because the same version can be deployed several times, e.g., after a newer version rollback.
- App version -- version of the application chart/code.
- Config -- set of overrides to the default config of the application.
- Status -- current status of the release in the history.
- Timestamps -- tracks when a release was created and deployed.

## Proposed implementation

### Separate release branch

We will use a separate release branch, `release-compute`, to have a clean history for releases and commits.
In order to avoid confusion with storage releases, we will use a different prefix for compute [git release
tags](https://github.com/neondatabase/neon/releases) -- `release-compute-XXXX`. We will use the same tag for
Docker images as well. The `neondatabase/compute-node-v16:release-compute-XXXX` looks longer and a bit redundant,
but it's better to have image and git tags in sync.

Currently, control plane relies on the numeric compute and storage release versions to decide on compute->storage
compatibility. Once we implement this proposal, we should drop this code as release numbers will be completely
independent. The only constraint we want is that it must monotonically increase within the same release branch.

### Compute config/settings manifest

We will create a new sub-directory `compute` and file `compute/manifest.yaml` with a structure:

```yaml
pg_settings:
  # Common settings for primaries and secondaries of all versions.
  common:
    wal_log_hints: "off"
    max_wal_size: "1024"

  per_version:
    14:
      # Common settings for both replica and primary of version PG 14
      common:
        shared_preload_libraries: "neon,pg_stat_statements,extension_x"
    15:
      common:
        shared_preload_libraries: "neon,pg_stat_statements,extension_x"
      # Settings that should be applied only to
      replica:
        # Available only starting Postgres 15th
        recovery_prefetch: "off"
    # ...
    17:
      common:
        # For example, if third-party `extension_x` is not yet available for PG 17
        shared_preload_libraries: "neon,pg_stat_statements"
      replica:
        recovery_prefetch: "off"
```

**N.B.** Setting value should be a string with `on|off` for booleans and a number (as a string)
without units for all numeric settings. That's how the control plane currently operates.

The priority of settings will be (a higher number is a higher priority):

1. Any static and hard-coded settings in the control plane
2. `pg_settings->common`
3. Per-version `common`
4. Per-version `replica`
5. Any per-user/project/endpoint overrides in the control plane
6. Any dynamic setting calculated based on the compute size

**N.B.** For simplicity, we do not do any custom logic for `shared_preload_libraries`, so it's completely
overridden if specified on some level. Make sure that you include all necessary extensions in it when you
do any overrides.

**N.B.** There is a tricky question about what to do with custom compute image pinning we sometimes
do for particular projects and customers. That's usually some ad-hoc work and images are based on
the latest compute image, so it's relatively safe to assume that we could use settings from the latest compute
release. If for some reason that's not true, and further overrides are needed, it's also possible to do
on the project level together with pinning the image, so it's on-call/engineer/support responsibility to
ensure that compute starts with the specified custom image. The only real risk is that compute image will get
stale and settings from new releases will drift away, so eventually it will get something incompatible,
but i) this is some operational issue, as we do not want stale images anyway, and ii) base settings
receive something really new so rarely that the chance of this happening is very low. If we want to solve it completely,
then together with pinning the image we could also pin the matching release revision in the control plane.

The compute team will own the content of `compute/manifest.yaml`.

### Control plane: releases table

In order to store information about releases, the control plane will use a table `compute_releases` with the following
schema:

```sql
CREATE TABLE compute_releases (
  -- Unique release ID
  -- N.B. Revision won't by synchronized across all regions, because all control planes are technically independent
  -- services. We have the same situation with Helm releases as well because they could be deployed and rolled back
  -- independently in different clusters.
  revision BIGSERIAL PRIMARY KEY,
  -- Numeric version of the compute image, e.g. 9057
  version BIGINT NOT NULL,
  -- Compute image tag, e.g. `release-9057`
  tag TEXT NOT NULL,
  -- Current release status. Currently, it will be a simple enum
  -- * `deployed` -- release is deployed and used for new compute starts.
  --                 Exactly one release can have this status at a time.
  -- * `superseded` -- release has been replaced by a newer one.
  -- But we can always extend it in the future when we need more statuses
  -- for more complex deployment strategies.
  status TEXT NOT NULL,
  -- Any additional metadata for compute in the corresponding release
  manifest JSONB NOT NULL,
  -- Timestamp when release record was created in the control plane database
  created_at TIMESTAMP NOT NULL DEFAULT now(),
  -- Timestamp when release deployment was finished
  deployed_at TIMESTAMP
);
```

We keep track of the old releases not only for the sake of audit, but also because we usually have ~30% of
old computes started using the image from one of the previous releases. Yet, when users want to reconfigure
them without restarting, the control plane needs to know what settings are applicable to them, so we also need
information about the previous releases that are readily available. There could be some other auxiliary info
needed as well: supported extensions, compute flags, etc.

**N.B.** Here, we can end up in an ambiguous situation when the same compute image is deployed twice, e.g.,
it was deployed once, then rolled back, and then deployed again, potentially with a different manifest. Yet,
we could've started some computes with the first deployment and some with the second. Thus, when we need to
look up the manifest for the compute by its image tag, we will see two records in the table with the same tag,
but different revision numbers. We can assume that this could happen only in case of rollbacks, so we
can just take the latest revision for the given tag.

### Control plane: management API

The control plane will implement new API methods to manage releases:

1. `POST /management/api/v2/compute_releases` to create a new release. With payload

   ```json
    {
      "version": 9057,
      "tag": "release-9057",
      "manifest": {}
    }
   ```

   and response

   ```json
    {
      "revision": 53,
      "version": 9057,
      "tag": "release-9057",
      "status": "deployed",
      "manifest": {},
      "created_at": "2024-08-15T15:52:01.0000Z",
      "deployed_at": "2024-08-15T15:52:01.0000Z",
    }
   ```

   Here, we can actually mix-in custom (remote) extensions metadata into the `manifest`, so that the control plane
   will get information about all available extensions not bundled into compute image. The corresponding
   workflow in `neondatabase/build-custom-extensions` should produce it as an artifact and make
   it accessible to the workflow in the `neondatabase/infra`. See the complete release flow below. Doing that,
   we put a constraint that new custom extension requires new compute release, which is good for the safety,
   but is not exactly what we want operational-wise (we want to be able to deploy new extensions without new
   images). Yet, it can be solved incrementally: v0 -- do not do anything with extensions at all;
   v1 -- put them into the same manifest; v2 -- make them separate entities with their own lifecycle.

   **N.B.** This method is intended to be used in CI workflows, and CI/network can be flaky. It's reasonable
   to assume that we could retry the request several times, even though it's already succeeded. Although it's
   not a big deal to create several identical releases one-by-one, it's better to avoid it, so the control plane
   should check if the latest release is identical and just return `304 Not Modified` in this case.

2. `POST /management/api/v2/compute_releases/rollback` to rollback to any previously deployed release. With payload
   including the revision of the release to rollback to:

   ```json
   {
      "revision": 52
   }
   ```

   Rollback marks the current release as `superseded` and creates a new release with all the same data as the
   requested revision, but with a new revision number.

   This rollback API is not strictly needed, as we can just use `infra` repo workflow to deploy any
   available tag. It's still nice to have for on-call and any urgent matters, for example, if we need
   to rollback and GitHub is down. It's much easier to specify only the revision number vs. crafting
   all the necessary data for the new release payload.

### Compute->storage compatibility tests

In order to safely release new compute versions independently from storage, we need to ensure that the currently
deployed storage is compatible with the new compute version. Currently, we maintain backward compatibility
in storage, but newer computes may require a newer storage version.

Remote end-to-end (e2e) tests [already accept](https://github.com/neondatabase/cloud/blob/e3468d433e0d73d02b7d7e738d027f509b522408/.github/workflows/testing.yml#L43-L48)
`storage_image_tag` and `compute_image_tag` as separate inputs. That means that we could reuse e2e tests to ensure
compatibility between storage and compute:

1. Pick the latest storage release tag and use it as `storage_image_tag`.
2. Pick a new compute tag built in the current compute release PR and use it as `compute_image_tag`.
   Here, we should use a temporary ECR image tag, because the final tag will be known only after the release PR is merged.
3. Trigger e2e tests as usual.

### Release flow

```mermaid
  sequenceDiagram

  actor oncall as Compute on-call person
  participant neon as neondatabase/neon

  box private
    participant cloud as neondatabase/cloud
    participant exts as neondatabase/build-custom-extensions
    participant infra as neondatabase/infra
  end

  box cloud
    participant preprod as Pre-prod control plane
    participant prod as Production control plane
    participant k8s as Compute k8s
  end

  oncall ->> neon: Open release PR into release-compute

  activate neon
  neon ->> cloud: CI: trigger e2e compatibility tests
  activate cloud
  cloud -->> neon: CI: e2e tests pass
  deactivate cloud
  neon ->> neon: CI: pass PR checks, get approvals
  deactivate neon

  oncall ->> neon: Merge release PR into release-compute

  activate neon
  neon ->> neon: CI: pass checks, build and push images
  neon ->> exts: CI: trigger extensions build
  activate exts
  exts -->> neon: CI: extensions are ready
  deactivate exts
  neon ->> neon: CI: create release tag
  neon ->> infra: Trigger release workflow using the produced tag
  deactivate neon

  activate infra
  infra ->> infra: CI: pass checks
  infra ->> preprod: Release new compute image to pre-prod automatically <br/> POST /management/api/v2/compute_releases
  activate preprod
  preprod -->> infra: 200 OK
  deactivate preprod

  infra ->> infra: CI: wait for per-region production deploy approvals
  oncall ->> infra: CI: approve deploys region by region
  infra ->> k8s: Prewarm new compute image
  infra ->> prod: POST /management/api/v2/compute_releases
  activate prod
  prod -->> infra: 200 OK
  deactivate prod
  deactivate infra
```

## Further work

As briefly mentioned in other sections, eventually, we would like to use more complex deployment strategies.
For example, we can pass a fraction of the total compute starts that should use the new release. Then we can
mark the release as `partial` or `canary` and monitor its performance. If everything is fine, we can promote it
to `deployed` status. If not, we can roll back to the previous one.

## Alternatives

In theory, we can try using Helm as-is:

1. Write a compute Helm chart. That will actually have only some config map, which the control plane can access and read.
   N.B. We could reuse the control plane chart as well, but then it's not a fully independent release again and even more fuzzy.
2. The control plane will read it and start using the new compute version for new starts.

Drawbacks:

1. Helm releases work best if the workload is controlled by the Helm chart itself. Then you can have different
   deployment strategies like rolling update or canary or blue/green deployments. At Neon, the compute starts are controlled
   by control plane, so it makes it much more tricky.
2. Releases visibility will suffer, i.e. instead of a nice table in the control plane and Admin UI, we would need to use
   `helm` cli and/or K8s UIs like K8sLens.
3. We do not restart all computes shortly after the new version release. This means that for some features and compatibility
   purpose (see above) control plane may need some auxiliary info from the previous releases.