Not a user-facing change, but can break any existing `.neon` directories
created by neon_local, as the name of the database used by the storage
controller changes.
This PR changes all the locations apart from the path of
`control_plane/attachment_service` (waiting for an opportune moment to
do that one, because it's the most conflict-ish wrt ongoing PRs like
#6676 )
## Problem
Attachment service does not do auth based on JWT scopes.
## Summary of changes
Do JWT based permission checking for requests coming into the attachment
service.
Requests into the attachment service must use different tokens based on
the endpoint:
* `/control` and `/debug` require `admin` scope
* `/upcall` requires `generations_api` scope
* `/v1/...` requires `pageserverapi` scope
Requests into the pageserver from the attachment service must use
`pageserverapi` scope.
- Remove the neon.safekeeper_token_env GUC. It was used to set the
name of an environment variable, which was then used in pageserver
and safekeeper connection strings to in place of the
password. Instead, always look up the environment variable called
NEON_AUTH_TOKEN. That's what neon.safekeeper_token_env was always
set to in practice, and I don't see the need for the extra level of
indirection or configurability.
- Instead of substituting $NEON_AUTH_TOKEN in the connection strings,
pass $NEON_AUTH_TOKEN "out-of-band" as the password, when we connect
to the pageserver or safekeepers. That's simpler.
- Also use the password from $NEON_AUTH_TOKEN in compute_ctl, when it
connects to the pageserver to get the "base backup".
The control plane currently only supports EdDSA. We need to either teach
the storage to use EdDSA, or the control plane to use RSA. EdDSA is more
modern, so let's use that.
We could support both, but it would require a little more code and tests,
and we don't really need the flexibility since we control both sides.
This makes it possible to enable authentication only for the mgmt HTTP
API or the compute API. The HTTP API doesn't need to be directly
accessible from compute nodes, and it can be secured through network
policies. This also allows rolling out authentication in a piecemeal
fashion.
Changes are:
* Pageserver: start reading from NEON_AUTH_TOKEN by default.
Warn if ZENITH_AUTH_TOKEN is used instead.
* Compute, Docs: fix the default token name.
* Control plane: change name of the token in configs and start
sequences.
Compatibility:
* Control plane in tests: works, no compatibility expected.
* Control plane for local installations: never officially supported
auth anyways. If someone did enable it, `pageserver.toml` should be updated
with the new `neon.pageserver_connstring` and `neon.safekeeper_token_env`.
* Pageserver is backward compatible: you can run new Pageserver with old
commands and environment configurations, but not vice-versa.
The culprit is the hard-coded `NEON_AUTH_TOKEN`.
* Compute has no code changes. As long as you update its configuration
file with `pageserver_connstring` in sync with the start up scripts,
you are good to go.
* Safekeeper has no code changes and has never used `ZENITH_AUTH_TOKEN` in
the first place.
* Fix https://github.com/neondatabase/neon/issues/1854
* Never log Safekeeper::conninfo in walproposer as it now contains a secret token
* control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver
* Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from
pageserver_connstring because it's embedded in there weirdly.
* Pageserver: load JWT token for Safekeeper from the environment variable.
* Rewrite docs/authentication.md
Current state with authentication.
Page server validates JWT token passed as a password during connection
phase and later when performing an action such as create branch tenant
parameter of an operation is validated to match one submitted in token.
To allow access from console there is dedicated scope: PageServerApi,
this scope allows access to all tenants. See code for access validation in:
PageServerHandler::check_permission.
Because we are in progress of refactoring of communication layer
involving wal proposer protocol, and safekeeper<->pageserver. Safekeeper
now doesn’t check token passed from compute, and uses “hardcoded” token
passed via environment variable to communicate with pageserver.
Compute postgres now takes token from environment variable and passes it
as a password field in pageserver connection. It is not passed through
settings because then user will be able to retrieve it using pg_settings
or SHOW ..
I’ve added basic test in test_auth.py. Probably after we add
authentication to remaining network paths we should enable it by default
and switch all existing tests to use it.