There are a couple of places that call CompactionError::is_cancel but
don't check the Other variant for root cause. But they should, because
some cancellations are observed by code that results in ::Other errors.
I don't think there's a _serious_ case where this causes problems.
The worst case one is the circuit breaker which we do currently trip
on ::Other errors that are due to cancellation. Tripped circuit breaker
on shutting down timelines doesn't really matter practically, but
it's unaesthetic and might cause noise down the line, so, this
PR fixes that at least.
In any way, this PR forces future callers of is_cancel() to explicitly
recognize the suboptimal state of affairs wrt error handling in compaction,
thereby hopefully preventing errors of this kind from creeping in.
(The _right_ solution for the compaction code probably is the approach
I took in #11853: keep using anyhow but have a unified way / pattern
of bubbling up cancellation, so that we don't need to perform the downcast
trickery).
## Problem
We want to repro an OOM situation, but large partial reads are required.
## Summary of Changes
Make the max partial read size configurable for import jobs.
## Problem
We don't currently run tests for PostGIS in our test environment.
## Summary of Changes
- Added PostGIS test support for PostgreSQL v16 and v17
- Configured different PostGIS versions based on PostgreSQL version:
- PostgreSQL v17: PostGIS 3.5.0
- PostgreSQL v14/v15/v16: PostGIS 3.3.3
- Added necessary test scripts and configurations
This ensures our PostgreSQL implementation remains compatible with this
widely-used extension.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
## Problem
Part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
Add a counter on the feature evaluation outcome and we will set up
alerts for too many failed evaluations in the future.
Signed-off-by: Alex Chi Z <chi@neon.tech>
JsonResponse::error() properly logs an error message which can be read
in the compute logs. invalid_status() was not going through that helper
function, thus not logging anything.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
This is already the default in production and in our test suite.
## Summary of changes
Set the default proto to interpreted to reduce friction when spinning up
new regions or cells.
## Problem
Currently, `page_api` domain types validate message invariants both when
converting Protobuf → domain and domain → Protobuf. This is annoying for
clients, because they can't use stream combinators to convert streamed
requests (needed for hot path performance), and also performs the
validation twice in the common case.
Blocks #12099.
## Summary of changes
Only validate the Protobuf → domain type conversion, i.e. on the
receiver side, and make domain → Protobuf infallible. This is where it
matters -- the Protobuf types are less strict than the domain types, and
receivers should expect all sorts of junk from senders (they're not
required to validate anyway, and can just construct an invalid message
manually).
Also adds a missing `impl From<CheckRelExistsRequest> for
proto::CheckRelExistsRequest`.
## Problem
Setting `max_batch_size` to anything higher than
`Timeline::MAX_GET_VECTORED_KEYS` will cause runtime error. We should
rather fail fast at startup if this is the case.
## Summary of changes
* Create `max_get_vectored_keys` as a new configuration (default to 32);
* Validate `max_batch_size` against `max_get_vectored_keys` right at
config parsing and validation.
Closes https://github.com/neondatabase/neon/issues/11994
## Problem
The gRPC `page_api` domain types used smallvecs to avoid heap
allocations in the common case where a single page is requested.
However, this is pointless: the Protobuf types use a normal vec, and
converting a smallvec into a vec always causes a heap allocation anyway.
## Summary of changes
Use a normal `Vec` instead of a `SmallVec` in `page_api` domain types.
## Problem
We print a backtrace in an info level log every 10 seconds while waiting
for the import data to land in the bucket.
## Summary of changes
The backtrace is not useful. Remove it.
## Problem
We should use the full host name for computes, according to
https://github.com/neondatabase/cloud/issues/26005 , but now a truncated
host name is used.
## Summary of changes
The URL for REMOTE_ONNX is rewritten using the FQDN.
## Problem
fix https://github.com/neondatabase/neon/issues/12101; this is a quick
hack and we need better API in the future.
In `get_db_size`, we call `get_reldir_size` for every relation. However,
we do the same deserializing the reldir directory thing for every
relation. This creates huge CPU overhead.
## Summary of changes
Get and deserialize the reldir v1 key once and use it across all
get_rel_size requests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We should expose the page service over gRPC.
Requires #12093.
Touches #11728.
## Summary of changes
This patch adds an initial page service implementation over gRPC. It
ties in with the existing `PageServerHandler` request logic, to avoid
the implementations drifting apart for the core read path.
This is just a bare-bones functional implementation. Several important
aspects have been omitted, and will be addressed in follow-up PRs:
* Limited observability: minimal tracing, no logging, limited metrics
and timing, etc.
* Rate limiting will currently block.
* No performance optimization.
* No cancellation handling.
* No tests.
I've only done rudimentary testing of this, but Pagebench passes at
least.
A smaller version of #12066 that is somewhat easier to review.
Now that I've been using https://crates.io/crates/top-type-sizes I've
found a lot more of the low hanging fruit that can be tweaks to reduce
the memory usage.
Some context for the optimisations:
Rust's stack allocation in futures is quite naive. Stack variables, even
if moved, often still end up taking space in the future. Rearranging the
order in which variables are defined, and properly scoping them can go a
long way.
`async fn` and `async move {}` have a consequence that they always
duplicate the "upvars" (aka captures). All captures are permanently
allocated in the future, even if moved. We can be mindful when writing
futures to only capture as little as possible.
TlsStream is massive. Needs boxing so it doesn't contribute to the above
issue.
## Measurements from `top-type-sizes`:
### Before
```
10328 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}} align=8
6120 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8
```
### After
```
4040 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}}
4704 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8
```
## Problem
We need gRPC support in Pagebench to benchmark the new gRPC Pageserver
implementation.
Touches #11728.
## Summary of changes
Adds a `Client` trait to make the client transport swappable, and a gRPC
client via a `--protocol grpc` parameter. This must also specify the
connstring with the gRPC port:
```
pagebench get-page-latest-lsn --protocol grpc --page-service-connstring grpc://localhost:51051
```
The client is implemented using the raw Tonic-generated gRPC client, to
minimize client overhead.
## Problem
The page service logic asserts that a tracing span is present with
tenant/timeline/shard IDs. An initial gRPC page service implementation
thus requires a tracing span.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
Adds an `ObservabilityLayer` middleware that generates a tracing span
and decorates it with IDs from the gRPC metadata.
This is a minimal implementation to address the tracing span assertion.
It will be extended with additional observability in later PRs.
## Problem
If the wal receiver is cancelled, there's a 50% chance that it will
ingest yet more WAL.
## Summary of Changes
Always check cancellation first.
Precursor to https://github.com/neondatabase/cloud/issues/28333.
We want per-endpoint configuration for rate limits, which will be
distributed via the `GetEndpointAccessControl` API. This lays some of
the ground work.
1. Allow the endpoint rate limiter to accept a custom leaky bucket
config on check.
2. Remove the unused auth rate limiter, as I don't want to think about
how it fits into this.
3. Refactor the caching of `GetEndpointAccessControl`, as it adds
friction for adding new cached data to the API.
That third one was rather large. I couldn't find any way to split it up.
The core idea is that there's now only 2 cache APIs.
`get_endpoint_access_controls` and `get_role_access_controls`.
I'm pretty sure the behaviour is unchanged, except I did a drive by
change to fix#8989 because it felt harmless. The change in question is
that when a password validation fails, we eagerly expire the role cache
if the role was cached for 5 minutes. This is to allow for edge cases
where a user tries to connect with a reset password, but the cache never
expires the entry due to some redis related quirk (lag, or
misconfiguration, or cplane error)
libs/pqproto is designed for safekeeper/pageserver with maximum
throughput.
proxy only needs it for handshakes/authentication where throughput is
not a concern but memory efficiency is. For this reason, we switch to
using read_exact and only allocating as much memory as we need to.
All reads return a `&'a [u8]` instead of a `Bytes` because accidental
sharing of bytes can cause fragmentation. Returning the reference
enforces all callers only hold onto the bytes they absolutely need. For
example, before this change, `pqproto` was allocating 8KiB for the
initial read `BytesMut`, and proxy was holding the `Bytes` in the
`StartupMessageParams` for the entire connection through to passthrough.
## Problem
Release pipeline failing due to some tools cannot be installed.
## Summary of changes
Install with `--locked`.
Signed-off-by: Alex Chi Z <chi@neon.tech>
The expected operating range for the production NVMe drives is
in the range of 50 to 250us.
The bucket boundaries before this PR were not well suited
to reason about the utilization / queuing / latency variability
of those devices.
# Performance
There was some concern about perf impact of having so many buckets,
considering the impl does a linear search on each observe().
I added a benchmark and measured on relevant machines.
In any way, the PR is 40 buckets, so, won't make a meaningful
difference on production machines (im4gn.2xlarge),
going from 30ns -> 35ns.
## Problem
There's a bunch of TODOs in the import code.
## Summary of changes
1. Bound max import byte range to 128MiB. This might still be too high,
given the default job concurrency, but it needs to be balanced with
going back and forth to S3.
2. Prevent unsigned overflow when determining key range splits for
concurrent jobs
3. Use sharded ranges to estimate task size when splitting jobs
4. Bubble up errors that we might hit due to invalid data in the bucket
back to the storage controller.
5. Tweak the import bucket S3 client configuration.
## Problem
I noticed a small percentage of flakes on some import tests.
They were all instances of the storage controller being too eager on the
finalization.
As a refresher: the pageserver notifies the storage controller that it's
done from the import task
and the storage controller has to call back into it in order to finalize
the import. The pageserver
checks that the import task is done before serving that request. Hence,
we can get this race.
In practice, this has no impact since the storage controller will simply
retry.
## Summary of changes
Allow list such cases
This patch is a fixup for
- https://github.com/neondatabase/neon/pull/6788
Background
----------
That PR 6788 added artificial advancement of `disk_consistent_lsn`
and `remote_consistent_lsn` for shards that weren't written to
while other shards _were_ written to.
See the PR description for more context.
At the time of that PR, Pageservers shards were doing WAL filtering.
Nowadays, the WAL filtering happens in Safekeepers.
Shards learn about the WAL gaps via
`InterpretedWalRecords::next_record_lsn`.
The Bug
-------
That artificial advancement code also runs if the flush failed.
So, we advance the disk_consistent_lsn / remote_consistent_lsn,
without having the corresponding L0 to the `index_part.json`.
The frozen layer remains in the layer map until detach,
so we continue to serve data correctly.
We're not advancing flush loop variable `flushed_to_lsn` either,
so, subsequent flush requests will retry the flush and repair the
situation if they succeed.
But if there aren't any successful retries, eventually the tenant
will be detached and when it is attached somewhere else, the
`index_part.json` and therefore layer map...
1. ... does not contain the frozen layer that failed to flush and
2. ... won't re-ingest that WAL either because walreceiver
starts up with the advanced disk_consistent_lsn/remote_consistent_lsn.
The result is that the read path will have a gap in the reconstruct
data for the keys whose modifications were lost, resulting in
a) either walredo failure
b) or an incorrect page@lsn image if walredo doesn't error.
The Fix
-------
The fix is to only do the artificial advancement if `result.is_ok()`.
Misc
----
As an aside, I took some time to re-review the flush loop and its
callers.
I found one more bug related to error handling that I filed here:
- https://github.com/neondatabase/neon/issues/12025
## Problem
## Summary of changes
## Problem
Checking the most recent state of pageservers was insufficient to
evaluate whether another pageserver may read in a particular generation,
since the latest state might mask some earlier AttachedSingle state.
Related: https://github.com/neondatabase/neon/issues/11348
## Summary of changes
- Maintain a history of all attachments
- Write out explicit rules for when a pageserver may read
## Problem
Part of #11813. This pull request adds misc observability improvements
for the functionality.
## Summary of changes
* Info span for the PostHog feature background loop.
* New evaluate feature flag API.
* Put the request error into the error message.
* Log when feature flag gets updated.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
I have a hypothesis that import might be using lower number of jobs than
max for the VM, where the job is running. This change will help finding
this out from logs
## Summary of changes
Added logging of number of jobs, which is passed into both `pg_dump` and
`pg_restore`
Adds two metrics to the storcon that are related to the safekeeper
reconciler:
* `storage_controller_safkeeper_reconciles_queued` to indicate currrent
queue depth
* `storage_controller_safkeeper_reconciles_complete` to indicate the
number of complete reconciles
Both metrics operate on a per-safekeeper basis (as reconcilers run on a
per-safekeeper basis too).
These metrics mirror the `storage_controller_pending_reconciles` and
`storage_controller_reconcile_complete` metrics, although those are not
scoped on a per-pageserver basis but are global for the entire storage
controller.
Part of #11670
## Problem
Disk usage eviction isn't sensitive to layers of imported timelines.
## Summary of changes
Hook importing timelines up into eviction and add a test for it.
I don't think we need any special eviction logic for this. These layers
will all be visible and
their access time will be their creation time. Hence, we'll remove
covered layers first
and get to the imported layers if there's still disk pressure.
## Problem
Importing timelines can't currently be deleted. This is problematic
because:
1. Cplane cannot delete failed imports and we leave the timeline behind.
2. The flow does not support user driven cancellation of the import
## Summary of changes
On the pageserver: I've taken the path of least resistance, extended
`TimelineOrOffloaded`
with a new variant and added handling in the right places. I'm open to
thoughts here,
but I think it turned out better than I was envisioning.
On the storage controller: Again, fairly simple business: when a DELETE
timeline request is
received, we remove the import from the DB and stop any finalization
tasks/futures. In order
to stop finalizations, we track them in-memory. For each finalizing
import, we associate a gate
and a cancellation token.
Note that we delete the entry from the database before cancelling any
finalizations. This is such
that a concurrent request can't progress the import into finalize state
and race with the deletion.
This concern about deleting an import with on-going finalization is
theoretical in the near future.
We are only going to delete importing timelines after the storage
controller reports the failure to
cplane. Alas, the design works for user driven cancellation too.
Closes https://github.com/neondatabase/neon/issues/11897
## Summary
The optimiser background loop could get delayed a lot by waiting for
timeouts trying to talk to offline nodes.
Fixes: #12056
## Solution
- Skip offline nodes in `get_top_tenant_shards`
Link to Devin run:
https://app.devin.ai/sessions/065afd6756734d33bbd4d012428c4b6e
Requested by: John Spray (john@neon.tech)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: John Spray <john@neon.tech>
## Problem
Temporarily reduce the concurrency of gc-compaction to 1 job at a time.
We are going to roll out in the largest AWS region next week. Having one
job running at a time makes it easier to identify what tenant causes
problem if it's not running well and pause gc-compaction for that
specific tenant.
(We can make this configurable via pageserver config in the future!)
## Summary of changes
Reduce `CONCURRENT_GC_COMPACTION_TASKS` from 2 to 1.
Signed-off-by: Alex Chi Z <chi@neon.tech>
The `test_storcon_create_delete_sk_down` test is still flaky. This test
addresses two possible causes for flakiness. both causes are related to
deletion racing with `pull_timeline` which hasn't finished yet.
* the first cause is timeline deletion racing with `pull_timeline`:
* the first deletion attempt doesn't contain the line because the
timeline doesn't exist yet
* the subsequent deletion attempts don't contain it either, only a note
that the timeline is already deleted.
* so this patch adds the note that the timeline is already deleted to
the regex
* the second cause is about tenant deletion racing with `pull_timeline`:
* there were no tenant specific tombstones so if a tenant was deleted,
we only added tombstones for the specific timelines being deleted, not
for the tenant itself.
* This patch changes this, so we now have tenant specific tombstones as
well as timeline specific ones, and creation of a timeline checks both.
* we also don't see any retries of the tenant deletion in the logs. once
it's done it's done. so extend the regex to contain the tenant deletion
message as well.
One could wonder why the regex and why not using the API to check
whether the timeline is just "gone". The issue with the API is that it
doesn't allow one to distinguish between "deleted" and "has never
existed", and latter case might race with `pull_timeline`. I.e. the
second case flakiness helped in the discovery of a real bug (no tenant
tombstones), so the more precise check was helpful.
Before, I could easily reproduce 2-9 occurences of flakiness when
running the test with an additional `range(128)` parameter (i.e. 218
times 4 times). With this patch, I ran it three times, not a single
failure.
Fixes#11838
## Problem
Import planning takes a job size limit as its input. Previously, the job
size came from a pageserver config field. This field may change while
imports are in progress. If this happens, plans will no longer be
identical and the import would fail permanently.
## Summary of Changes
Bake the job size into the import progress reported to the storage
controller. For new imports, use the value from the pagesever config,
and, for existing imports, use the value present in the shard progress.
This value is identical for all shards, but we want it to be versioned
since future versions of the planner might split the jobs up
differently. Hence, it ends up in `ShardImportProgress`.
Closes https://github.com/neondatabase/neon/issues/11983
## Problem
Previous attempt https://github.com/neondatabase/neon/pull/10548 caused
some issues in staging and we reverted it. This is a re-attempt to
address https://github.com/neondatabase/neon/issues/11063.
Currently we create image layers at latest record LSN. We would create
"future image layers" (i.e., image layers with LSN larger than disk
consistent LSN) that need special handling at startup. We also waste a
lot of read operations to reconstruct from L0 layers while we could have
compacted all of the L0 layers and operate on a flat level of historic
layers.
## Summary of changes
* Run repartition at L0-L1 boundary.
* Roll out with feature flags.
* Piggyback a change that downgrades "image layer creating below
gc_cutoff" to debug level.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
There is a new API that I plan to use. We generate client from the spec
so it should be in the spec
## Summary of changes
Document the existing API in openAPI format
## Problem
Currently, we collect metrics of what extensions are installed on
computes at start up time. We do not have a mechanism that does this at
runtime.
## Summary of changes
Added a background thread that queries all DBs at regular intervals and
collects a list of installed extensions.
## Problem
Part of https://github.com/neondatabase/neon/issues/11813
## Summary of changes
* Support evaluate boolean flags.
* Add docs on how to handle errors.
* Add test cases based on real PostHog config.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We have some gaps in our traces. This indicates missing spans.
## Summary of changes
This PR adds two new spans:
* WAIT_EXECUTOR: time a batched request spends in the batch waiting to
be picked up
* FLUSH_RESPONSE: time a get page request spends flushing the response
to the compute

## Problem
The page API gRPC errors need a few tweaks to integrate better with the
GetPage machinery.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
* Add `GetPageStatus::InternalError` for internal server errors.
* Rename `GetPageStatus::Invalid` to `InvalidRequest` for clarity.
* Rename `status` and `GetPageStatus` to `status_code` and
`GetPageStatusCode`.
* Add an `Into<tonic::Status>` implementation for `ProtocolError`.
Some tests still explicitly specify version 3 of the safekeeper
walproposer protocol. Remove the explicit opt in from the tests as v3 is
the default now since #11518.
We don't touch the places where a test exercises both v2 and v3. Those
we leave for #12021.
Part of https://github.com/neondatabase/neon/issues/10326
## Problem
Test coverage of timeline imports is lacking.
## Summary of changes
This PR adds a chaos import test. It runs an import while injecting
various chaos events
in the environment. All the commits that follow the test fix various
issues that were surfaced by it.
Closes https://github.com/neondatabase/neon/issues/10191