## Problem
The gRPC API does not provide LSN leases.
## Summary of changes
* Add LSN lease support to the gRPC API.
* Use gRPC LSN leases for static computes with `grpc://` connstrings.
* Move `PageserverProtocol` into the `compute_api::spec` module and
reuse it.
## Problem
Location config changes can currently result in changes to the shard
identity. Such changes will cause data corruption, as seen with #12217.
Resolves#12227.
Requires #12377.
## Summary of changes
Assert that the shard identity does not change on location config
updates and on (re)attach.
This is currently asserted with `critical!`, in case it misfires in
production. Later, we should reject such requests with an error and turn
this into a proper assertion.
## Problem
Similarly to #12217, the following endpoints may result in a stripe size
mismatch between the storage controller and Pageserver if an unsharded
tenant has a different stripe size set than the default. This can lead
to data corruption if the tenant is later manually split without
specifying an explicit stripe size, since the storage controller and
Pageserver will apply different defaults. This commonly happens with
tenants that were created before the default stripe size was changed
from 32k to 2k.
* `PUT /v1/tenant/config`
* `PATCH /v1/tenant/config`
These endpoints are no longer in regular production use (they were used
when cplane still managed Pageserver directly), but can still be called
manually or by tests.
## Summary of changes
Retain the current shard parameters when updating the location config in
`PUT | PATCH /v1/tenant/config`.
Also opportunistically derive `Copy` for `ShardParameters`.
Add a new 'pageserver_connection_info' field in the compute spec. It
replaces the old 'pageserver_connstring' field with a more complicated
struct that includes both libpq and grpc URLs, for each shard (or only
one of the the URLs, depending on the configuration). It also includes
a flag suggesting which one to use; compute_ctl now uses it to decide
which protocol to use for the basebackup.
This is compatible with everything that's in production, because the
control plane never used the 'pageserver_connstring' field. That was
added a long time ago with the idea that it would replace the code
that digs the 'neon.pageserver_connstring' GUC from the list of
Postgres settings, but we never got around to do that in the control
plane. Hence, it was only used with neon_local. But the plan now is to
pass the 'pageserver_connection_info' from the control plane, and once
that's fully deployed everywhere, the code to parse
'neon.pageserver_connstring' in compute_ctl can be removed.
The 'grpc' flag on an endpoint in endpoint config is now more of a
suggestion. Compute_ctl gets both URLs, so it can choose to use libpq
or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag
that's part of the connection info though. Postgres however uses grpc
iff the new rust-based communicator is enabled.
TODO/plan for the control plane:
- Start to pass `pageserver_connection_info` in the spec file.
- Also keep the current `neon.pageserver_connstring` setting for now,
for backwards compatibility with old computes
After that, the `pageserver_connection_info.prefer_grpc` flag in the
spec file can be used to control whether compute_ctl uses grpc or
libpq. The actual compute's grpc usage will be controlled by the
`neon.enable_new_communicator` GUC. It can be set separately from
'prefer_grpc'.
Later:
- Once all old computes are gone, remove the code to pass
`neon.pageserver_connstring`
## Problem
The problem has been well described in already-commited PR #11853.
tl;dr: BufferedWriter is sensitive to cancellation, which the previous
approach was not.
The write path was most affected (ingest & compaction), which was mostly
fixed in #11853:
it introduced `PutError` and mapped instances of `PutError` that were
due to cancellation of underlying buffered writer into
`CreateImageLayersError::Cancelled`.
However, there is a long tail of remaining errors that weren't caught by
#11853 that result in `CompactionError::Other`s, which we log with great
noise.
## Solution
The stack trace logging for CompactionError::Other added in #11853
allows us to chop away at that long tail using the following pattern:
- look at the stack trace
- from leaf up, identify the place where we incorrectly map from the
distinguished variant X indicating cancellation to an `anyhow::Error`
- follow that anyhow further up, ensuring it stays the same anyhow all
the way up in the `CompactionError::Other`
- since it stayed one anyhow chain all the way up, root_cause() will
yield us X
- so, in `log_compaction_error`, add an additional `downcast_ref` check
for X
This PR specifically adds checks for
- the flush task cancelling (FlushTaskError, BlobWriterError)
- opening of the layer writer (GateError)
That should cover all the reports in issues
- https://github.com/neondatabase/cloud/issues/29434
- https://github.com/neondatabase/neon/issues/12162
## Refs
- follow-up to #11853
- fixup of / fixes https://github.com/neondatabase/neon/issues/11762
- fixes https://github.com/neondatabase/neon/issues/12162
- refs https://github.com/neondatabase/cloud/issues/29434
## Problem
Some pageservers hit `max_size_entries` limit in staging with only ~25
MiB storage used by basebackup cache. The limit is too strict. It should
be safe to relax it.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Increase the default `max_size_entries` from 1000 to 10000
## Problem
In our infra config, we have to split server_api_key and other fields in
two files: the former one in the sops file, and the latter one in the
normal config. It creates the situation that we might misconfigure some
regions that it only has part of the fields available, causing
storcon/pageserver refuse to start.
## Summary of changes
Allow PostHog config to have part of the fields available. Parse it
later.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Basebackup cache now uses unbounded channel for prepare requests. In
theory it can grow large if the cache is hung and does not process the
requests.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Replace an unbounded channel with a bounded one, the size is
configurable.
- Add `pageserver_basebackup_cache_prepare_queue_size` to observe the
size of the queue.
- Refactor a bit to move all metrics logic to `basebackup_cache.rs`
## Problem
Fix for https://github.com/neondatabase/neon/pull/12324
## Summary of changes
Need `serde(default)` to allow this field not present in the config,
otherwise there will be a config deserialization error.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
It costs $$$ to directly retrieve the feature flags from the pageserver.
Therefore, this patch adds new APIs to retrieve the spec from the
storcon and updates it via pageserver.
* Storcon retrieves the feature flag and send it to the pageservers.
* If the feature flag gets updated outside of the normal refresh loop of
the pageserver, pageserver won't fetch the flags on its own as long as
the last updated time <= refresh_period.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
While working more on TLS to compute, I realised that Console Redirect
-> pg-sni-router -> compute would break if channel binding was set to
prefer. This is because the channel binding data would differ between
Console Redirect -> pg-sni-router vs pg-sni-router -> compute.
I also noticed that I actually disabled channel binding in #12145, since
`connect_raw` would think that the connection didn't support TLS.
## Summary of changes
Make sure we specify the channel binding.
Make sure that `connect_raw` can see if we have TLS support.
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.
## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.
With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.
## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
Part of #11813
## Summary of changes
The current interval is 30s and it costs a lot of $$$. This patch
reduced it to 600s refresh interval (which means that it takes 10min for
feature flags to propagate from UI to the pageserver). In the future we
can let storcon retrieve the feature flags and push it to pageservers.
We can consider creating a new release or we can postpone this to the
week after the next week.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
In #12217, we began passing the stripe size in reattach responses, and
persisting it in the on-disk state. This is necessary to ensure the
storage controller and Pageserver have a consistent view of the intended
stripe size of unsharded tenants, which will be used for splits that do
not specify a stripe size. However, for backwards compatibility, these
stripe sizes were optional.
## Summary of changes
Make the stripe sizes required for reattach responses and on-disk
location configs. These will always be provided by the previous
(current) release.
Enable it across tests and set it as default. Marks the first milestone
of https://github.com/neondatabase/neon/issues/9114. We already enabled
it in all AWS regions and planning to enable it in all Azure regions
next week.
will merge after we roll out in all regions.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
See #11942
Idea:
* if connections are short lived, they can get enqueued and then also
remove themselves later if they never made it to redis. This reduces the
load on the queue.
* short lived connections (<10m, most?) will only issue 1 command, we
remove the delete command and rely on ttl.
* we can enqueue as many commands as we want, as we can always cancel
the enqueue, thanks to the ~~intrusive linked lists~~ `BTreeMap`.
Introduce a separate `postgres_ffi_types` crate which contains a few
types and functions that were used in the API. `postgres_ffi_types` is a
much small crate than `postgres_ffi`, and it doesn't depend on bindgen
or the Postgres C headers.
Move NeonWalRecord and Value types to wal_decoder crate. They are only
used in the pageserver-safekeeper "ingest" API. The rest of the ingest
API types are defined in wal_decoder, so move these there as well.
- Add optional `?mode=fast|immediate` to `/terminate`, `fast` is
default. Immediate avoids waiting 30
seconds before returning from `terminate`.
- Add `TerminateMode` to `ComputeStatus::TerminationPending`
- Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl
stop` for `test_replica_promotes`.
- Change `test_replica_promotes` to check returned LSN
- Annotate `finish_sync_safekeepers` as `noreturn`.
https://github.com/neondatabase/cloud/issues/29807
## Problem
The cache was introduced as a hackathon project and the only supported
limit was the number of entries.
The basebackup entry size may vary. We need to have more control over
disk space usage to ship it to production.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Store the size of entries in the cache and use it to limit
`max_total_size_bytes`
- Add the size of the cache in bytes to metrics.
## Problem
Pageservers now expose a gRPC API on a separate address and port. This
must be registered with the storage controller such that it can be
plumbed through to the compute via cplane.
Touches #11926.
## Summary of changes
This patch registers the gRPC address and port with the storage
controller:
* Add gRPC address to `nodes` database table and `NodePersistence`, with
a Diesel migration.
* Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`,
`NodeDescribeResponse`, and `TenantLocateResponseShard`.
* Add gRPC address flags to `storcon_cli node-register`.
These changes are backwards-compatible, since all structs will ignore
unknown fields during deserialization.
## Problem
We would easily hit this limit for a tenant running for enough long
time.
## Summary of changes
Remove the max key limit for time-travel recovery if the command is
running locally.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Part of #11813
PostHog has two endpoints to retrieve feature flags: the old project ID
one that uses personal API token, and the new one using a special
feature flag secure token that can only retrieve feature flag. The new
API I added in this patch is not documented in the PostHog API doc but
it's used in their Python SDK.
## Summary of changes
Add support for "feature flag secure token API". The API has no way of
providing a project ID so we verify if the retrieved spec is consistent
with the project ID specified by comparing the `team_id` field.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The location config (which includes the stripe size) is stored on
pageserver disk.
For unsharded tenants we [do not include the shard identity in the
serialized
description](ad88ec9257/pageserver/src/tenant/config.rs (L64-L66)).
When the pageserver restarts, it reads that configuration and will use
the stripe size from there
and rely on storcon input from reattach for generation and mode.
The default deserialization is ShardIdentity::unsharded. This has the
new default stripe size of 2048.
Hence, for unsharded tenants we can be running with a stripe size
different from that the one in the
storcon observed state. This is not a problem until we shard split
without specifying a stripe size (i.e. manual splits via the UI or
storcon_cli). When that happens the new shards will use the 2048 stripe
size until storcon realises and switches them back. At that point it's
too late, since we've ingested data with the wrong stripe sizes.
## Summary of changes
Ideally, we would always have the full shard identity on disk. To
achieve this over two releases we do:
1. Always persist the shard identity in the location config on the PS.
2. Storage controller includes the stripe size to use in the re attach
response.
After the first release, we will start persisting correct stripe sizes
for any tenant shard that the storage controller
explicitly sends a location_conf. After the second release, the
re-attach change kicks in and we'll persist the
shard identity for all shards.
## Problem
Base64 0.13 is outdated.
## Summary of changes
Update base64 to 0.22. Affects mostly proxy and proxy libs. Also upgrade
serde_with to remove another dep on base64 0.13 from dep tree.
## Problem
part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
Collect pageserver hostname property so that we can use it in the
PostHog UI. Not sure if this is the best way to do that -- open to
suggestions.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/7546
Add Azure time travel recovery support. The tricky thing is how Azure
handles deletes in its blob version API. For the following sequence:
```
upload file_1 = a
upload file_1 = b
delete file_1
upload file_1 = c
```
The "delete file_1" won't be stored as a version (as AWS did).
Therefore, we can never rollback to a state where file_1 is temporarily
invisible. If we roll back to the time before file_1 gets created for
the first time, it will be removed correctly.
However, this is fine for pageservers, because (1) having extra files in
the tenant storage is usually fine (2) for things like
timelines/X/index_part-Y.json, it will only be deleted once, so it can
always be recovered to a correct state. Therefore, I don't expect any
issues when this functionality is used on pageserver recovery.
TODO: unit tests for time-travel recovery.
## Summary of changes
Add Azure blob storage time-travel recovery support.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
PGLB/Neonkeeper needs to separate the concerns of connecting to compute,
and authenticating to compute.
Additionally, the code within `connect_to_compute` is rather messy,
spending effort on recovering the authentication info after
wake_compute.
## Summary of changes
Split `ConnCfg` into `ConnectInfo` and `AuthInfo`. `wake_compute` only
returns `ConnectInfo` and `AuthInfo` is determined separately from the
`handshake`/`authenticate` process.
Additionally, `ConnectInfo::connect_raw` is in-charge or establishing
the TLS connection, and the `postgres_client::Config::connect_raw` is
configured to use `NoTls` which will force it to skip the TLS
negotiation. This should just work.
## Problem
Removed nodes can re-add themselves on restart if not properly
tombstoned. We need a mechanism (e.g. soft-delete flag) to prevent this,
especially in cases where the node is unreachable.
More details there: #12036
## Summary of changes
- Introduced `NodeLifecycle` enum to represent node lifecycle states.
- Added a string representation of `NodeLifecycle` to the `nodes` table.
- Implemented node removal using a tombstone mechanism.
- Introduced `/debug/v1/tombstone*` handlers to manage the tombstone
state.
neon_local's timeline import subcommand creates timelines manually, but
doesn't create them on the safekeepers. If a test then tries to open an
endpoint to read from the timeline, it will error in the new world with
`--timelines-onto-safekeepers`.
Therefore, if that flag is enabled, create the timelines on the
safekeepers.
Note that this import functionality is different from the fast import
feature (https://github.com/neondatabase/neon/issues/10188, #11801).
Part of #11670
As well as part of #11712
## Problem
We support two ingest protocols on the pageserver: vanilla and
interpreted.
Interpreted has been the only protocol in use for a long time.
## Summary of changes
* Remove the ingest handling of the vanilla protocol
* Remove tenant and pageserver configuration for it
* Update all tests that tweaked the ingest protocol
## Compatibility
Backward compatibility:
* The new pageserver version can read the existing pageserver
configuration and it will ignore the unknown field.
* When the tenant config is read from the storcon db or from the
pageserver disk, the extra field will be ignored.
Forward compatiblity:
* Both the pageserver config and the tenant config map missing fields to
their default value.
I'm not aware of any tenant level override that was made for this knob.
## Problem
Part of https://github.com/neondatabase/neon/issues/11813
In PostHog UI, we need to create the properties before using them as a
filter. We report all variants automatically when we start the
pageserver. In the future, we can report all real tenants instead of
fake tenants (we do that now to save money + we don't need real tenants
in the UI).
## Summary of changes
* Collect `region`, `availability_zone`, `pageserver_id` properties and
use them in the feature evaluation.
* Report 10 fake tenants on each pageserver startup.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Inbetween adding the TLS config for compute-ctl, and adding the TLS
config in controlplane, we switched from using a provision flag to a
bind flag. This happened to work in all of my testing in preview regions
as they have no VM pool, so each bind was also a provision. However, in
staging I found that the TLS config is still only processed during
provision, even though it's only sent on bind.
## Summary of changes
* Add a new feature flag value, `tls_experimental`, which tells
postgres/pgbouncer/local_proxy to use the TLS certificates on bind.
* compute_ctl on provision will be told where the certificates are,
instead of being told on bind.
Url::to_string() adds a trailing slash on the base URL, so when we did
the format!(), we were adding a double forward slash.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We want to repro an OOM situation, but large partial reads are required.
## Summary of Changes
Make the max partial read size configurable for import jobs.