## Problem
Previous attempt https://github.com/neondatabase/neon/pull/10548 caused
some issues in staging and we reverted it. This is a re-attempt to
address https://github.com/neondatabase/neon/issues/11063.
Currently we create image layers at latest record LSN. We would create
"future image layers" (i.e., image layers with LSN larger than disk
consistent LSN) that need special handling at startup. We also waste a
lot of read operations to reconstruct from L0 layers while we could have
compacted all of the L0 layers and operate on a flat level of historic
layers.
## Summary of changes
* Run repartition at L0-L1 boundary.
* Roll out with feature flags.
* Piggyback a change that downgrades "image layer creating below
gc_cutoff" to debug level.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
There is a new API that I plan to use. We generate client from the spec
so it should be in the spec
## Summary of changes
Document the existing API in openAPI format
## Problem
Currently, we collect metrics of what extensions are installed on
computes at start up time. We do not have a mechanism that does this at
runtime.
## Summary of changes
Added a background thread that queries all DBs at regular intervals and
collects a list of installed extensions.
## Problem
Part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
* Support evaluate boolean flags.
* Add docs on how to handle errors.
* Add test cases based on real PostHog config.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We have some gaps in our traces. This indicates missing spans.
## Summary of changes
This PR adds two new spans:
* WAIT_EXECUTOR: time a batched request spends in the batch waiting to
be picked up
* FLUSH_RESPONSE: time a get page request spends flushing the response
to the compute

## Problem
The page API gRPC errors need a few tweaks to integrate better with the
GetPage machinery.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
* Add `GetPageStatus::InternalError` for internal server errors.
* Rename `GetPageStatus::Invalid` to `InvalidRequest` for clarity.
* Rename `status` and `GetPageStatus` to `status_code` and
`GetPageStatusCode`.
* Add an `Into<tonic::Status>` implementation for `ProtocolError`.
Some tests still explicitly specify version 3 of the safekeeper
walproposer protocol. Remove the explicit opt in from the tests as v3 is
the default now since #11518.
We don't touch the places where a test exercises both v2 and v3. Those
we leave for #12021.
Part of https://github.com/neondatabase/neon/issues/10326
## Problem
Test coverage of timeline imports is lacking.
## Summary of changes
This PR adds a chaos import test. It runs an import while injecting
various chaos events
in the environment. All the commits that follow the test fix various
issues that were surfaced by it.
Closes https://github.com/neondatabase/neon/issues/10191
## Problem
The regression test for the extension online_advisor fails on the
staging instance due to a lack of permission to alter the database.
## Summary of changes
A script was added to work around this problem.
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
In order to enable TLS connections between computes and safekeepers, we
need to provide the control plane with a way to configure the various
libpq keyword parameters, sslmode and sslrootcert. neon.safekeepers is a
comma separated list of safekeepers formatted as host:port, so isn't
available for extension in the same way that neon.pageserver_connstring
is. This could be remedied in a future PR.
Part-of: https://github.com/neondatabase/cloud/issues/25823
Link:
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS
Signed-off-by: Tristan Partin <tristan@neon.tech>
Support timeline creations on the storage controller to opt out from
their creation on the safekeepers, introducing the read-only timelines
concept. Read only timelines:
* will never receive WAL of their own, so it's fine to not create them
on the safekeepers
* the property is non-transitive. children of read-only timelines aren't
neccessarily read-only themselves.
This feature can be used for snapshots, to prevent the safekeepers from
being overloaded by empty timelines that won't ever get written to. In
the current world, this is not a problem, because timelines are created
implicitly by the compute connecting to a safekeeper that doesn't have
the timeline yet. In the future however, where the storage controller
creates timelines eagerly, we should watch out for that.
We represent read-only timelines in the storage controller database so
that we ensure that they never touch the safekeepers at all. Especially
we don't want them to cause a mess during the importing process of the
timelines from the cplane to the storcon database.
In a hypothetical future where we have a feature to detach timelines
from safekeepers, we'll either need to find a way to distinguish the
two, or if not, asking safekeepers to list the (empty) timeline prefix
and delete everything from it isn't a big issue either.
This patch will unconditionally hit the new safekeeper timeline creation
path for read-only timelines, without them needing the
`--timelines-onto-safekeepers` flag enabled. This is done because it's
lower risk (no safekeepers or computes involved at all) and gives us
some initial way to verify at least some parts of that code in prod.
https://github.com/neondatabase/cloud/issues/29435https://github.com/neondatabase/neon/issues/11670
Previously we were using project-id/endpoint-id as SYSLOGTAG, which has
a
limit of 32 characters, so the endpoint-id got truncated.
The output is now in RFC5424 format, where the message is json encoded
with additional metadata `endpoint_id` and `project_id`
Also as pgaudit logs multiline messages, we now detect this by parsing
the timestamp in the specific format, and consider non-matching lines to
belong in the previous log message.
Using syslog structured-data would be an alternative, but leaning
towards json
due to being somewhat more generic.
## Problem
part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
* Integrate feature store with tenant structure.
* gc-compaction picks up the current strategy from the feature store.
* We only log them for now for testing purpose. They will not be used
until we have more patches to support different strategies defined in
PostHog.
* We don't support property-based evaulation for now; it will be
implemented later.
* Evaluating result of the feature flag is not cached -- it's not
efficient and cannot be used on hot path right now.
* We don't report the evaluation result back to PostHog right now.
I plan to enable it in staging once we get the patch merged.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We see unexpected basebackup error alerts in the alert channel.
https://github.com/neondatabase/neon/pull/11778 only fixed the alerts
for shutdown errors. However, another path is that tenant shutting down
while waiting LSN -> WaitLsnError::BadState -> QueryError::Reconnect.
Therefore, the reconnect error should also be discarded from the
ok/error counter.
## Summary of changes
Do not increase ok/err counter for reconnect errors.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We didn't test the h3 extension in our test suite.
## Summary of changes
Added tests for h3 and h3-postgis extensions
Includes upgrade test for h3
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
For the gRPC Pageserver API, we should convert the Protobuf types to
stricter, canonical Rust types.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
Adds Rust domain types that mirror the Protobuf types, with conversion
and validation.
## Problem
We need authentication for the gRPC server.
Requires #11972.
Touches #11728.
## Summary of changes
Add two request interceptors that decode the tenant/timeline/shard
metadata and authenticate the JWT token against them.
## Problem
We want to expose the page service over gRPC, for use with the
communicator.
Requires #11995.
Touches #11728.
## Summary of changes
This patch wires up a gRPC server in the Pageserver, using Tonic. It
does not yet implement the actual page service.
* Adds `listen_grpc_addr` and `grpc_auth_type` config options (disabled
by default).
* Enables gRPC by default with `neon_local`.
* Stub implementation of `page_api.PageService`, returning unimplemented
errors.
* gRPC reflection service for use with e.g. `grpcurl`.
Subsequent PRs will implement the actual page service, including
authentication and observability.
Notably, TLS support is not yet implemented. Certificate reloading
requires us to reimplement the entire Tonic gRPC server.
## Problem
For #11992 I realised we need to get the type info before executing the
query. This is important to know how to decode rows with custom types,
eg the following query:
```sql
CREATE TYPE foo AS ENUM ('foo','bar','baz');
SELECT ARRAY['foo'::foo, 'bar'::foo, 'baz'::foo] AS data;
```
Getting that to work was harder that it seems. The original
tokio-postgres setup has a split between `Client` and `Connection`,
where messages are passed between. Because multiple clients were
supported, each client message included a dedicated response channel.
Each request would be terminated by the `ReadyForQuery` message.
The flow I opted to use for parsing types early would not trigger a
`ReadyForQuery`. The flow is as follows:
```
PARSE "" // parse the user provided query
DESCRIBE "" // describe the query, returning param/result type oids
FLUSH // force postgres to flush the responses early
// wait for descriptions
// check if we know the types, if we don't then
// setup the typeinfo query and execute it against each OID:
PARSE typeinfo // prepare our typeinfo query
DESCRIBE typeinfo
FLUSH // force postgres to flush the responses early
// wait for typeinfo statement
// for each OID we don't know:
BIND typeinfo
EXECUTE
FLUSH
// wait for type info, might reveal more OIDs to inspect
// close the typeinfo query, we cache the OID->type map and this is kinder to pgbouncer.
CLOSE typeinfo
// finally once we know all the OIDs:
BIND "" // bind the user provided query - already parsed - to the user provided params
EXECUTE // run the user provided query
SYNC // commit the transaction
```
## Summary of changes
Please review commit by commit. The main challenge was allowing one
query to issue multiple sub-queries. To do this I first made sure that
the client could fully own the connection, which required removing any
shared client state. I then had to replace the way responses are sent to
the client, by using only a single permanent channel. This required some
additional effort to track which query is being processed. Lastly I had
to modify the query/typeinfo functions to not issue `sync` commands, so
it would fit into the desired flow above.
To note: the flow above does force an extra roundtrip into each query. I
don't know yet if this has a measurable latency overhead.
## Problem
- Benchmark periodic pagebench had inconsistent benchmarking results
even when run with the same commit hash.
Hypothesis is this was due to running on dedicated but virtualized EC
instance with varying CPU frequency.
- the dedicated instance type used for the benchmark is quite "old" and
we increasingly get `An error occurred (InsufficientInstanceCapacity)
when calling the StartInstances operation (reached max retries: 2):
Insufficient capacity.`
- periodic pagebench uses a snapshot of pageserver timelines to have the
same layer structure in each run and get consistent performance.
Re-creating the snapshot was a painful manual process (see
https://github.com/neondatabase/cloud/issues/27051 and
https://github.com/neondatabase/cloud/issues/27653)
## Summary of changes
- Run the periodic pagebench on a custom hetzner GitHub runner with
large nvme disk and governor set to defined perf profile
- provide a manual dispatch option for the workflow that allows to
create a new snapshot
- keep the manual dispatch option to specify a commit hash useful for
bi-secting regressions
- always use the newest created snapshot (S3 bucket uses date suffix in
S3 key, example
`s3://neon-github-public-dev/performance/pagebench/shared-snapshots-2025-05-17/`
- `--ignore`
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
in regular benchmarks run for each commit
- improve perf copying snapshot by using `cp` subprocess instead of
traversing tree in python
## Example runs with code in this PR:
- run which creates new snapshot
https://github.com/neondatabase/neon/actions/runs/15083408849/job/42402986376#step:19:55
- run which uses latest snapshot
-
https://github.com/neondatabase/neon/actions/runs/15084907676/job/42406240745#step:11:65
## Problem
We're about to implement a gRPC interface for Pageserver. Let's upgrade
Tonic first, to avoid a more painful migration later. It's currently
only used by storage-broker.
Touches #11728.
## Summary of changes
Upgrade Tonic 0.12.3 → 0.13.1. Also opportunistically upgrade Prost
0.13.3 → 0.13.5. This transitively pulls in Indexmap 2.0.1 → 2.9.0, but
it doesn't appear to be used in any particularly critical code paths.
## Problem
See https://github.com/neondatabase/neon/issues/11997
This guard prevents race condition with pump prefetch state (initiated
by timeout).
Assert checks that prefetching is also done under guard.
But prewarm knows nothing about it.
## Summary of changes
Pump prefetch state only in regular backends.
Prewarming is done by background workers now.
Also it seems to have not sense to pump prefetch state in any other
background workers: parallel executors, vacuum,... because they are
short living and can not leave unconsumed responses in socket.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Detect problems with Postgres optimiser: lack of indexes and statistics
## Summary of changes
https://github.com/knizhnik/online_advisor
Add online_advistor extension to docker image
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
https://github.com/neondatabase/neon/pull/11988 waits only for max
~200ms, so we still see failures, which self-resolve after several
operation retries.
## Summary of changes
Change it to waiting for at least 5 seconds, starting with 2 ms sleep
between iterations and x2 sleep on each next iteration. It could be that
it's not a problem with a slow `rsyslog` start, but a longer wait won't
hurt. If it won't start, we should debug why `inittab` doesn't start it,
or maybe there is another problem.
Add retry loop around waiting for rsyslog start
## Problem
## Summary of changes
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Basebackup cache is on the hot path of compute startup and is generated
on every request (may be slow).
- Issue: https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Add `BasebackupCache` which stores basebackups on local disk.
- Basebackup prepare requests are triggered by
`XLOG_CHECKPOINT_SHUTDOWN` records in the log.
- Limit the size of the cache by number of entries.
- Add `basebackup_cache_enabled` feature flag to TenantConfig.
- Write tests for the cache
## Not implemented yet
- Limit the size of the cache by total size in bytes
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr@neon.tech>
## Problem
For billing, we'd like per-branch consumption metrics.
Requires https://github.com/neondatabase/neon/pull/11984.
Resolves https://github.com/neondatabase/cloud/issues/28155.
## Summary of changes
This patch adds two new consumption metrics:
* `written_size_since_parent`: `written_size - ancestor_lsn`
* `pitr_history_size_since_parent`: `written_size - max(pitr_cutoff,
ancestor_lsn)`
Note that `pitr_history_size_since_parent` will not be emitted until the
PITR cutoff has been computed, and may or may not increase ~immediately
when a user increases their PITR window (depending on how much history
we have available and whether the tenant is restarted/migrated).
## Problem
When testing local proxy the auth-endpoint password shows up in command
line and log
```bash
RUST_LOG=proxy LOGFMT=text cargo run --release --package proxy --bin proxy --features testing -- \
--auth-backend postgres \
--auth-endpoint 'postgresql://postgres:secret_password@127.0.0.1:5432/postgres' \
--tls-cert server.crt \
--tls-key server.key \
--wss 0.0.0.0:4444
```
## Summary of changes
- Allow to set env variable PGPASSWORD
- fall back to use PGPASSWORD env variable when auth-endpoint does not
contain password
- remove auth-endpoint password from logs in `--features testing` mode
Example
```bash
export PGPASSWORD=secret_password
RUST_LOG=proxy LOGFMT=text cargo run --package proxy --bin proxy --features testing -- \
--auth-backend postgres \
--auth-endpoint 'postgresql://postgres@127.0.0.1:5432/postgres' \
--tls-cert server.crt \
--tls-key server.key \
--wss 0.0.0.0:4444
```
## Problem
It is not currently possible to disambiguate a timeline with an
uninitialized PITR cutoff from one that was created within the PITR
window -- both of these have `GcCutoffs::time == Lsn(0)`. For billing
metrics, we need to disambiguate these to avoid accidentally billing the
entire history when a tenant is initially loaded.
Touches https://github.com/neondatabase/cloud/issues/28155.
## Summary of changes
Make `GcCutoffs::time` an `Option<Lsn>`, and only set it to `Some` when
initialized. A `pitr_interval` of 0 will yield `Some(last_record_lsn)`.
This PR takes a conservative approach, and mostly retains the old
behavior of consumers by using `unwrap_or_default()` to yield 0 when
uninitialized, to avoid accidentally introducing bugs -- except in cases
where there is high confidence that the change is beneficial (e.g. for
the `pageserver_pitr_history_size` Prometheus metric and to return early
during GC).
## Problem
Hitting max_client_conn from pgbouncer would lead to invalidation of the
conn info cache.
Customers would hit the limit on wake_compute.
## Summary of changes
`should_retry_wake_compute` detects this specific error from pgbouncer
as non-retriable,
meaning we won't try to wake up the compute again.
## Problem
When using an incorrect endpoint string - `"localhost:4317"`, it's a
runtime error, but it can be a config error
- Closes: https://github.com/neondatabase/neon/issues/11394
## Summary of changes
Add config parse time check via `request::Url::parse` validation.
---------
Co-authored-by: Aleksandr Sarantsev <ephemeralsad@gmail.com>
#11962
Please review each commit separately.
Each commit is rather small in goal. The overall goal of this PR is to
keep the behaviour identical, but shave away small inefficiencies here
and there.
## Problem
See
Discussion:
https://neondb.slack.com/archives/C033RQ5SPDH/p1746645666075799
Issue: https://github.com/neondatabase/cloud/issues/28609
Relation size cache is not correctly updated at PS in case of replicas.
## Summary of changes
1. Have two caches for relation size in timeline:
`rel_size_primary_cache` and `rel_size_replica_cache`.
2. `rel_size_primary_cache` is actually what we have now. The only
difference is that it is not updated in `get_rel_size`, only by WAL
ingestion
3. `rel_size_replica_cache` has limited size (LruCache) and it's key is
`(Lsn,RelTag)` . It is updated in `get_rel_size`. Only strict LSN
matches are accepted as cache hit.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
In the escaping path we were checking that `${tag}$` or `${outer_tag}$`
are present in the string, but that's not enough, as original string
surrounded by `$` can also form a 'tag', like `$x$xx$x$`, which is fine
on it's own, but cannot be used in the string escaped with `$xx$`.
## Summary of changes
Remove `$` from the checks, just check if `{tag}` or `{outer_tag}` are
present. Add more test cases and change the catalog test to stress the
`drop_subscriptions_before_start: true` path as well.
Fixes https://github.com/neondatabase/cloud/issues/29198
## Problem
The gRPC page service API will require decoupling the `PageHandler` from
the libpq protocol implementation. As preparation for this, avoid
passing in the entire server config to `PageHandler`, and instead
explicitly pass in the relevant fields.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
* Change `PageHandler` to take a `GetVectoredConcurrentIo` instead of
the entire config.
* Change `IoConcurrency::spawn_from_conf` to take a
`GetVectoredConcurrentIo`.
## Problem
A binary Protobuf schema descriptor can be used to expose an API
reflection service, which in turn allows convenient usage of e.g.
`grpcurl` against the gRPC server.
Touches #11728.
## Summary of changes
* Generate a binary schema descriptor as
`pageserver_page_api::proto::FILE_DESCRIPTOR_SET`.
* Opportunistically rename the Protobuf package from `page_service` to
`page_api`.
## Problem
For the [communicator
project](https://github.com/neondatabase/company_projects/issues/352),
we want to move to gRPC for the page service protocol.
Touches #11728.
## Summary of changes
This patch adds an experimental gRPC Protobuf schema for the page
service. It is equivalent to the current page service, but with several
improvements, e.g.:
* Connection multiplexing.
* Reduced head-of-line blocking.
* Client-side batching.
* Explicit tenant shard routing.
* GetPage request classification (normal vs. prefetch).
* Explicit rate limiting ("slow down" response status).
The API is exposed as a new `pageserver/page_api` package. This is
separate from the `pageserver_api` package to reduce the dependency
footprint for the communicator. The longer-term plan is to also split
out e.g. the WAL ingestion service to a separate gRPC package, e.g.
`pageserver/wal_api`.
Subsequent PRs will: add Rust domain types for the Protobuf types,
expose a gRPC server, and implement the page service.
Preliminary prototype benchmarks of this gRPC API is within 10% of
baseline libpq performance. We'll do further benchmarking and
optimization as the implementation lands in `main` and is deployed to
staging.
## Problem
Currently the `logger` library throws annoying deprecation warnings:
```python
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
```
## Summary of changes
This small PR resolves the annoying deprecation warnings by migrating to
`.warning` as suggested.
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>