Commit Graph

468 Commits

Author SHA1 Message Date
Christian Schwarz
aac1f8efb1 refactor(compaction): eliminate CompactionError::CollectKeyspaceError variant (#12517)
The only differentiated handling of it is for `is_critical`, which in
turn is a `matches!()` on several variants of the `enum
CollectKeyspaceError`
which is the value contained insided
`CompactionError::CollectKeyspaceError`.

This PR introduces a new error for `repartition()`, allowing its
immediate
callers to inspect it like `is_critical` did.

A drive-by fix is more precise classification of WaitLsnError::BadState
when mapping to `tonic::Status`.

refs
- https://databricks.atlassian.net/browse/LKB-182
2025-07-09 08:41:36 +00:00
HaoyuHuang
3dad4698ec PS changes #1 (#12467)
# TLDR
All changes are no-op except 
1. publishing additional metrics. 
2. problem VI

## Problem I

It has come to my attention that the Neon Storage Controller doesn't
correctly update its "observed" state of tenants previously associated
with PSs that has come back up after a local data loss. It would still
think that the old tenants are still attached to page servers and won't
ask more questions. The pageserver has enough information from the
reattach request/response to tell that something is wrong, but it
doesn't do anything about it either. We need to detect this situation in
production while I work on a fix.

(I think there is just some misunderstanding about how Neon manages
their pageserver deployments which got me confused about all the
invariants.)

## Summary of changes I

Added a `pageserver_local_data_loss_suspected` gauge metric that will be
set to 1 if we detect a problematic situation from the reattch response.
The problematic situation is when the PS doesn't have any local tenants
but received a reattach response containing tenants.

We can set up an alert using this metric. The alert should be raised
whenever this metric reports non-zero number.

Also added a HTTP PUT
`http://pageserver/hadron-internal/reset_alert_gauges` API on the
pageserver that can be used to reset the gauge and the alert once we
manually rectify the situation (by restarting the HCC).

## Problem II
Azure upload is 3x slower than AWS. -> 3x slower ingestion. 

The reason for the slower upload is that Azure upload in page server is
much slower => higher flush latency => higher disk consistent LSN =>
higher back pressure.

## Summary of changes II
Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in
parallel.

I set the put_block block size to be 128 MB by default in azure config. 

To minimize neon changes, upload function passes the layer file path to
the azure upload code through the storage metadata. This allows the
azure put block to use FileChunkStreamRead to stream read from one
partition in the file instead of loading all file data in memory and
split it into 8 128 MB chunks.

## How is this tested? II
1. rust test_real_azure tests the put_block change. 
3. I deployed the change in azure dev and saw flush latency reduces from
~30 seconds to 10 seconds.
4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS
runs.

## Problem III
Currently Neon limits the compaction tasks as 3/4 * CPU cores. This
limits the overall compaction throughput and it can easily cause
head-of-the-line blocking problems when a few large tenants are
compacting.

## Summary of changes III
This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD`
(default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also
limits some other tasks `logical_size_calculation` and `layer eviction`
. But compaction should be the most frequent and time-consuming task.

## Summary of changes IV
This PR adds the following PageServer metrics:
1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures
the total amount of bytes evicted. It's more straightforward to see the
bytes directly instead of layers.
2. `pageserver_active_storage_operations_count`: captures the active
storage operation, e.g., flush, L0 compaction, image creation etc. It's
useful to visualize these active operations to get a better idea of what
PageServers are spending cycles on in the background.

## Summary of changes V
When investigating data corruptions, it's useful to search the base
image and all WAL records of a page up to an LSN, i.e., a breakdown of
GetPage@LSN request. This PR implements this functionality with two
tools:

1. Extended `pagectl` with a new command to search the layer files for a
given key up to a given LSN from the `index_part.json` file. The output
can be used to download the files from S3 and then search the file
contents using the second tool.
Example usage:
```
cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8
```
Example output:
```
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b
```

2. Added a unit test to search the layer file contents. It's not
implemented part of `pagectl` because it depends on some test harness
code, which can only be used by unit tests.

Example usage:
```
cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8
```
Example output:
```
# omitted image for brievity
delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA=="
delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA=="
```

## Problem VI
Currently when page service resolves shards from page numbers, it
doesn't fully support the case that the shard could be split in the
middle. This will lead to query failures during the tenant split for
either commit or abort cases (it's mostly for abort).

## Summary of changes VI
This PR adds retry logic in `Cache::get()` to deal with shard resolution
errors more gracefully. Specifically, it'll clear the cache and retry,
instead of failing the query immediately. It also reduces the internal
timeout to make retries faster.

The PR also fixes a very obvious bug in
`TenantManager::resolve_attached_shard` where the code tries to cache
the computed the shard number, but forgot to recompute when the shard
count is different.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-08 19:43:01 +00:00
Christian Schwarz
f9b05a42d7 refactor(compaction): remove CompactionError::AlreadyRunning variant, use ::Other instead (#12512)
The only call stack that can emit the `::AlreadyRunning` variant is
```
-> iteration_inner
	-> iteration
		-> compaction_iteration
			-> compaction_loop
				-> start_background_loops
```

And on that call stack, the only differentiated handling of it is its
invocations of
`log_compaction_error -> CompactionError::is_cancel()`, which returns
`true` for
`::AlreadyRunning`.

I think the condition of `AlreadyRunning` is severe; it really shouldn't
happen.
So, this PR starts treating it as something that is to be logged at
`ERROR` / `WARN`
level, depending on the `degrate_to_warning` argument to
`log_compaction_error`.

refs
- https://databricks.atlassian.net/browse/LKB-182
2025-07-08 16:45:34 +00:00
Christian Schwarz
8a042fb8ed refactor(compaction): eliminate CompactionError::Offload variant, map to ::Other (#12505)
Looks can be deceiving: the match blocks in
`maybe_trip_compaction_breaker`
and at the end of `compact_with_options` seem like differentiated error
handling, but in reality, these branches are unreachable at runtime
because the only source of `CompactionError::Offload` within the
compaction code is at the end of `Tenant::compaction_iteration`.

We can simply map offload cancellation to CompactionError::Cancelled and
all other offload errors to ::Other, since there's no differentiated
handling for them in the compaction code.

Also, the OffloadError::RemoteStorage variant has no differentiated
handling, but was wrapping the remote storage anyhow::Error in a
`anyhow(thiserror(anyhow))` sandwich. This PR removes that variant,
mapping all RemoteStorage errors to `OffloadError::Other`.
Thereby, the sandwich is gone and we will get a proper anyhow backtrace
to the remote storage error location if when we debug-print the
OffloadError (or the CompactionError if we map it to that).

refs
- https://databricks.atlassian.net/browse/LKB-182
- the observation that there's no need for differentiated handling of
CompactionError::Offload was made in
https://databricks.slack.com/archives/C09254R641L/p1751286453930269?thread_ts=1751284317.955159&cid=C09254R641L
2025-07-08 15:03:32 +00:00
Arpad Müller
efd7e52812 Don't error if timeline offload is already in progress (#12428)
Don't print errors like:
```
Compaction failed 1 times, retrying in 2s: Failed to offload timeline: Unexpected offload error: Timeline deletion is already in progress
```

Print it at info log level instead.

https://github.com/neondatabase/cloud/issues/30666
2025-07-02 12:06:55 +00:00
Alex Chi Z.
5ec8881c0b feat(pageserver): resolve feature flag based on remote size (#12400)
## Problem

Part of #11813 

## Summary of changes

* Compute tenant remote size in the housekeeping loop.
* Add a new `TenantFeatureResolver` struct to cache the tenant-specific
properties.
* Evaluate feature flag based on the remote size.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-01 18:11:24 +00:00
Erik Grinaker
a384d7d501 pageserver: assert no changes to shard identity (#12379)
## Problem

Location config changes can currently result in changes to the shard
identity. Such changes will cause data corruption, as seen with #12217.

Resolves #12227.
Requires #12377.

## Summary of changes

Assert that the shard identity does not change on location config
updates and on (re)attach.

This is currently asserted with `critical!`, in case it misfires in
production. Later, we should reject such requests with an error and turn
this into a proper assertion.
2025-06-30 12:36:45 +00:00
Erik Grinaker
1d43f3bee8 pageserver: fix stripe size persistence in legacy HTTP handlers (#12377)
## Problem

Similarly to #12217, the following endpoints may result in a stripe size
mismatch between the storage controller and Pageserver if an unsharded
tenant has a different stripe size set than the default. This can lead
to data corruption if the tenant is later manually split without
specifying an explicit stripe size, since the storage controller and
Pageserver will apply different defaults. This commonly happens with
tenants that were created before the default stripe size was changed
from 32k to 2k.

* `PUT /v1/tenant/config`
* `PATCH /v1/tenant/config`

These endpoints are no longer in regular production use (they were used
when cplane still managed Pageserver directly), but can still be called
manually or by tests.

## Summary of changes

Retain the current shard parameters when updating the location config in
`PUT | PATCH /v1/tenant/config`.

Also opportunistically derive `Copy` for `ShardParameters`.
2025-06-30 09:08:44 +00:00
Alex Chi Z.
6c77638ea1 feat(storcon): retrieve feature flag and pass to pageservers (#12324)
## Problem

part of https://github.com/neondatabase/neon/issues/11813

## Summary of changes

It costs $$$ to directly retrieve the feature flags from the pageserver.
Therefore, this patch adds new APIs to retrieve the spec from the
storcon and updates it via pageserver.

* Storcon retrieves the feature flag and send it to the pageservers.
* If the feature flag gets updated outside of the normal refresh loop of
the pageserver, pageserver won't fetch the flags on its own as long as
the last updated time <= refresh_period.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-06-25 14:58:18 +00:00
Matthias van de Meent
6c6de6382a Use enum-typed PG versions (#12317)
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.

## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.

With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.

## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2025-06-24 17:25:31 +00:00
Arpad Müller
552249607d apply clippy fixes for 1.88.0 beta (#12331)
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.

This is therefore a preparation PR (like similar PRs before it).

There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).

The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.

---------

Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2025-06-24 10:12:42 +00:00
Alex Chi Z.
85164422d0 feat(pageserver): support force overriding feature flags (#12233)
## Problem

Part of #11813 

## Summary of changes

Add a test API to make it easier to manipulate the feature flags within
tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-06-23 17:31:53 +00:00
Vlad Lazar
28e882a80f pageserver: warn on long layer manager locking intervals (#12194)
## Problem

We hold the layer map for too long on occasion.

## Summary of changes

This should help us identify the places where it's happening from.

Related https://github.com/neondatabase/neon/issues/12182
2025-06-11 16:16:30 +00:00
Alex Chi Z.
d8ebd1d771 feat(pageserver): report tenant properties to posthog (#12113)
## Problem

Part of https://github.com/neondatabase/neon/issues/11813

In PostHog UI, we need to create the properties before using them as a
filter. We report all variants automatically when we start the
pageserver. In the future, we can report all real tenants instead of
fake tenants (we do that now to save money + we don't need real tenants
in the UI).

## Summary of changes

* Collect `region`, `availability_zone`, `pageserver_id` properties and
use them in the feature evaluation.
* Report 10 fake tenants on each pageserver startup.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-06-05 07:48:36 +00:00
Alex Chi Z.
af429b4a62 feat(pageserver): observability for feature flags (#12034)
## Problem

Part of #11813. This pull request adds misc observability improvements
for the functionality.

## Summary of changes

* Info span for the PostHog feature background loop.
* New evaluate feature flag API.
* Put the request error into the error message.
* Log when feature flag gets updated.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-05-30 08:02:25 +00:00
Nikita Kalyanov
541fcd8d2f chore: expose new mark_invisible API in openAPI spec for use in cplane (#12032)
## Problem
There is a new API that I plan to use. We generate client from the spec
so it should be in the spec
## Summary of changes
Document the existing API in openAPI format
2025-05-28 03:39:59 +00:00
Vlad Lazar
9657fbc194 pageserver: add and stabilize import chaos test (#11982)
## Problem

Test coverage of timeline imports is lacking.

## Summary of changes

This PR adds a chaos import test. It runs an import while injecting
various chaos events
in the environment. All the commits that follow the test fix various
issues that were surfaced by it.

Closes https://github.com/neondatabase/neon/issues/10191
2025-05-27 09:52:59 +00:00
Arpad Müller
3e86008e66 read-only timelines (#12015)
Support timeline creations on the storage controller to opt out from
their creation on the safekeepers, introducing the read-only timelines
concept. Read only timelines:

* will never receive WAL of their own, so it's fine to not create them
on the safekeepers
* the property is non-transitive. children of read-only timelines aren't
neccessarily read-only themselves.

This feature can be used for snapshots, to prevent the safekeepers from
being overloaded by empty timelines that won't ever get written to. In
the current world, this is not a problem, because timelines are created
implicitly by the compute connecting to a safekeeper that doesn't have
the timeline yet. In the future however, where the storage controller
creates timelines eagerly, we should watch out for that.

We represent read-only timelines in the storage controller database so
that we ensure that they never touch the safekeepers at all. Especially
we don't want them to cause a mess during the importing process of the
timelines from the cplane to the storcon database.

In a hypothetical future where we have a feature to detach timelines
from safekeepers, we'll either need to find a way to distinguish the
two, or if not, asking safekeepers to list the (empty) timeline prefix
and delete everything from it isn't a big issue either.

This patch will unconditionally hit the new safekeeper timeline creation
path for read-only timelines, without them needing the
`--timelines-onto-safekeepers` flag enabled. This is done because it's
lower risk (no safekeepers or computes involved at all) and gives us
some initial way to verify at least some parts of that code in prod.

https://github.com/neondatabase/cloud/issues/29435
https://github.com/neondatabase/neon/issues/11670
2025-05-26 23:23:58 +00:00
Erik Grinaker
95a5f749c8 pageserver: use an Option for GcCutoffs::time (#11984)
## Problem

It is not currently possible to disambiguate a timeline with an
uninitialized PITR cutoff from one that was created within the PITR
window -- both of these have `GcCutoffs::time == Lsn(0)`. For billing
metrics, we need to disambiguate these to avoid accidentally billing the
entire history when a tenant is initially loaded.

Touches https://github.com/neondatabase/cloud/issues/28155.

## Summary of changes

Make `GcCutoffs::time` an `Option<Lsn>`, and only set it to `Some` when
initialized. A `pitr_interval` of 0 will yield `Some(last_record_lsn)`.

This PR takes a conservative approach, and mostly retains the old
behavior of consumers by using `unwrap_or_default()` to yield 0 when
uninitialized, to avoid accidentally introducing bugs -- except in cases
where there is high confidence that the change is beneficial (e.g. for
the `pageserver_pitr_history_size` Prometheus metric and to return early
during GC).
2025-05-21 15:42:11 +00:00
Erik Grinaker
f4150614d0 pageserver: don't pass config to PageHandler (#11973)
## Problem

The gRPC page service API will require decoupling the `PageHandler` from
the libpq protocol implementation. As preparation for this, avoid
passing in the entire server config to `PageHandler`, and instead
explicitly pass in the relevant fields.

Touches https://github.com/neondatabase/neon/issues/11728.

## Summary of changes

* Change `PageHandler` to take a `GetVectoredConcurrentIo` instead of
the entire config.
* Change `IoConcurrency::spawn_from_conf` to take a
`GetVectoredConcurrentIo`.
2025-05-19 15:47:40 +00:00
Vlad Lazar
045ae13e06 pageserver: make imports work with tenant shut downs (#11855)
## Problem

Lifetime of imported timelines (and implicitly the import background
task) has some shortcomings:
1. Timeline activation upon import completion is tricky. Previously, a
timeline that finished importing
after a tenant detach would not get activated and there's concerns about
the safety of activating
concurrently with shut-down.
2. Import jobs can prevent tenant shut down since they hold the tenant
gate

## Summary of Changes

Track the import tasks in memory and abort them explicitly on tenant
shutdown.

Integrate more closely with the storage controller:
1. When an import task has finished all of its jobs, it notifies the
storage controller, but **does not** mark the import as done in the
index_part. When all shards have finished importing, the storage
controller will call the `/activate_post_import` idempotent endpoint for
all of them. The handler, marks the import complete in index part,
resets the tenant if required and checks if the timeline is active yet.
2. Not directly related, but the import job now gets the starting state
from the storage controller instead of the import bucket. This paves the
way for progress checkpointing.

Related: https://github.com/neondatabase/neon/issues/11568
2025-05-13 17:49:49 +00:00
John Spray
0a27973584 pageserver: rename Tenant to TenantShard (#11589)
## Problem

`Tenant` isn't really a whole tenant: it's just one shard of a tenant.

## Summary of changes

- Automated rename of Tenant to TenantShard
- Followup commit to change references in comments
2025-04-17 13:29:16 +00:00
Arpad Müller
88f01c1ca1 Introduce WalIngestError (#11506)
Introduces a `WalIngestError` struct together with a
`WalIngestErrorKind` enum, to be used for walingest related failures and
errors.

* the enum captures backtraces, so we don't regress in comparison to
`anyhow::Error`s (backtraces might be a bit shorter if we use one of the
`anyhow::Error` wrappers)
* it explicitly lists most/all of the potential cases that can occur.

I've originally been inspired to do this in #11496, but it's a
longer-term TODO.
2025-04-11 14:08:46 +00:00
Alex Chi Z.
405a17bf0b fix(pageserver): ensure gc-compaction gets preempted by L0 (#11512)
## Problem

Part of #9114 

## Summary of changes

Gc-compaction flag was not correctly set, causing it not getting
preempted by L0.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-09 20:57:50 +00:00
Conrad Ludgate
72832b3214 chore: fix clippy lints from nightly-2025-03-16 (#11273)
I like to run nightly clippy every so often to make our future rust
upgrades easier. Some notable changes:

* Prefer `next_back()` over `last()`. Generic iterators will implement
`last()` to run forward through the iterator until the end.

* Prefer `io::Error::other()`.

* Use implicit returns

One case where I haven't dealt with the issues is the now
[more-sensitive "large enum variant"
lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not
to take any decisions around it here, and simply marked them as allow
for now.
2025-04-09 15:04:42 +00:00
Dmitrii Kovalkov
e7502a3d63 pageserver: return 412 PreconditionFailed in get_timestamp_of_lsn if timestamp is not found (#11491)
## Problem
Now `get_timestamp_of_lsn` returns `404 NotFound` if there is no clog
pages for given LSN, and it's difficult to distinguish from other 404
errors. A separate status code for this error will allow the control
plane to handle this case.
- Closes: https://github.com/neondatabase/neon/issues/11439
- Corresponding PR in control plane:
https://github.com/neondatabase/cloud/pull/27125

## Summary of changes
- Return `412 PreconditionFailed` instead of `404 NotFound` if no
timestamp is fond for given LSN.

I looked briefly through the current error handling code in cloud.git
and the status code change should not affect anything for the existing
code. Change from the corresponding PR also looks fine and should work
with the current PS status code. Additionally, here is OK to merge it
from control plane team:
https://github.com/neondatabase/neon/issues/11439#issuecomment-2789327552

---------

Co-authored-by: John Spray <john@neon.tech>
2025-04-09 13:16:15 +00:00
Erik Grinaker
7679b63a2c pageserver: persist stripe size in tenant manifest for tenant_import (#11181)
## Problem

`tenant_import`, used to import an existing tenant from remote storage
into a storage controller for support and debugging, assumed
`DEFAULT_STRIPE_SIZE` since this can't be recovered from remote storage.
In #11168, we are changing the stripe size, which will break
`tenant_import`.

Resolves #11175.

## Summary of changes

* Add `stripe_size` to the tenant manifest.
* Add `TenantScanRemoteStorageShard::stripe_size` and return from
`tenant_scan_remote` if present.
* Recover the stripe size during`tenant_import`, or fall back to 32768
(the original default stripe size).
* Add tenant manifest compatibility snapshot:
`2025-04-08-pgv17-tenant-manifest-v1.tar.zst`

There are no cross-version concerns here, since unknown fields are
ignored during deserialization where relevant.
2025-04-08 20:43:27 +00:00
Vlad Lazar
9db63fea7a pageserver: optionally export perf traces in OTEL format (#11140)
Based on https://github.com/neondatabase/neon/pull/11139

## Problem

We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.

## Summary of changes

https://github.com/neondatabase/neon/pull/11139 introduces the
infrastructure required to run the otel collector alongside the
pageserver.

### Design

Requirements:
1. We'd like to avoid implementing our own performance tracing stack if
possible and use the `tracing` crate if possible.
2. Ideally, we'd like zero overhead of a sampling rate of zero and be a
be able to change the tracing config for a tenant on the fly.
3. We should leave the current span hierarchy intact. This includes
adding perf traces without modifying existing tracing.

To satisfy (3) (and (2) in part) a separate span hierarchy is used.
`RequestContext` gains an optional `perf_span` member
that's only set when the request was chosen by sampling. All perf span
related methods added to `RequestContext` are no-ops for requests that
are not sampled.

This on its own is not enough for (3), so performance spans use a
separate tracing subscriber. The `tracing` crate doesn't have great
support for this, so there's a fair amount of boilerplate to override
the subscriber at all points of the perf span lifecycle.

### Perf Impact

[Periodic
pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc)
shows no statistically significant regression with a sample ratio of 0.
There's an annotation on the dashboard on 2025-03-06.

### Overview of changes:
1. Clean up the `RequestContext` API a bit. Namely, get rid of the
`RequestContext::extend` API and use the builder instead.
2. Add pageserver level configs for tracing: sampling ratio, otel
endpoint, etc.
3. Introduce some perf span tracking utilities and expose them via
`RequestContext`. We add a `tracing::Span` wrapper to be used for perf
spans and a `tracing::Instrumented` equivalent for it. See doc comments
for reason.
4. Set up OTEL tracing infra according to configuration. A separate
runtime is used for the collector.
5. Add perf traces to the read path.

## Refs

- epic https://github.com/neondatabase/neon/issues/9873

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2025-04-03 17:56:51 +00:00
Alex Chi Z.
131b32ef48 fix(pageserver): clean up aux files before detaching (#11299)
## Problem

Related to https://github.com/neondatabase/cloud/issues/26091 and
https://github.com/neondatabase/cloud/issues/25840

Close https://github.com/neondatabase/neon/issues/11297

Discussion on Slack:
https://neondb.slack.com/archives/C033RQ5SPDH/p1742320666313969

## Summary of changes

* When detaching, scan all aux files within
`sparse_non_inherited_keyspace` in the ancestor timeline and create an
image layer exactly at the ancestor LSN. All scanned keys will map to an
empty value, which is a delete tombstone.
- Note that end_lsn for rewritten delta layers = ancestor_lsn + 1, so
the image layer will have image_end_lsn=end_lsn. With the current
`select_layer` logic, the read path will always first read the image
layer.
* Add a test case.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2025-04-03 15:55:22 +00:00
Alex Chi Z.
dd1299f337 feat(storcon): passthrough mark invisible and add tests (#11401)
## Problem

close https://github.com/neondatabase/neon/issues/11279

## Summary of changes

* Allow passthrough of other methods in tenant timeline shard0
passthrough of storcon.
* Passthrough mark invisible API in storcon.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-02 17:11:49 +00:00
John Spray
cb19e4e05d pageserver: remove legacy TimelineInfo::latest_gc_cutoff field (2/2) (#11136)
## Problem

This field was retained for backward compat only in #10707.

Once https://github.com/neondatabase/cloud/pull/25233 is released,
nothing will be reading this field.

Related: https://github.com/neondatabase/cloud/issues/24250

## Summary of changes

- Remove TimelineInfo::latest_gc_cutoff_lsn
2025-04-02 15:21:58 +00:00
Alex Chi Z.
c4fc602115 feat(pageserver): support synthetic size calculation for invisible branches (#11335)
## Problem

ref https://github.com/neondatabase/neon/issues/11279


Imagine we have a branch with 3 snapshots A, B, and C:
```
base---+---+---+---main
        \-A \-B \-C
base=100G, base-A=1G, A-B=1G, B-C=1G, C-main=1G
```
at this point, the synthetic size should be 100+1+1+1+1=104G.

after the deletion, the structure looks like:
```
base---+---+---+
       \-A \-B \-C
```
If we simply assume main never exists, the size will be calculated as
size(A) + size(B) + size(C)=300GB, which obviously is not what the user
would expect.

The correct way to do this is to assume part of main still exists, that
is to say, set C-main=1G:
```
base---+---+---+main
       \-A \-B \-C
```
And we will get the correct synthetic size of 100G+1+1+1=103G.


## Summary of changes

* Do not generate gc cutoff point for invisible branches.
* Use the same LSN as the last branchpoint for branch end.
* Remove test_api_handler for mark_invisible.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-01 18:50:58 +00:00
Arpad Müller
016068b966 API spec: add safekeepers field returned by storcon (#11385)
Add an optional `safekeepers` field to `TimelineInfo` which is returned
by the storcon upon timeline creation if the
`--timelines-onto-safekeepers` flag is enabled. It contains the list of
safekeepers chosen.

Other contexts where we return `TimelineInfo` do not contain the
`safekeepers` field, sadly I couldn't make this more type safe like done
in Rust via `TimelineCreateResponseStorcon`, as there is no way of
flattening or inheritance (and I don't that duplicating the entire type
for some minor type safety improvements is worth it).

The storcon side has been done in #11058.

Part of https://github.com/neondatabase/cloud/issues/16176
cc https://github.com/neondatabase/cloud/issues/16796
2025-04-01 12:39:10 +00:00
Erik Grinaker
80596feeaa pageserver: invert CompactFlags::NoYield as YieldForL0 (#11382)
## Problem

`CompactFlags::NoYield` was a bit inconvenient, since every caller
except for the background compaction loop should generally set it (e.g.
HTTP API calls, tests, etc). It was also inconsistent with
`CompactionOutcome::YieldForL0`.

## Summary of changes

Invert `CompactFlags::NoYield` as `CompactFlags::YieldForL0`. There
should be no behavioral changes.
2025-04-01 11:43:58 +00:00
Alex Chi Z.
bae9b9acdc feat(pageserver): persist timeline invisible flag (#11331)
## Problem

part of https://github.com/neondatabase/neon/issues/11279

## Summary of changes

The invisible flag is used to exclude a timeline from synthetic size
calculation. For the first step, let's persist this flag. Most of the
code are following the `is_archived` modification flow.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-20 18:39:08 +00:00
Nikita Kalyanov
53f54ba37a chore: expose detach_v2 (#11325)
we need this exposed in the spec to use it in cplane. extracted from
https://github.com/neondatabase/cloud/pull/26167

## Problem

## Summary of changes
2025-03-20 18:04:17 +00:00
Arpad Müller
5dd60933d3 Mark min_readable_lsn as required in the pageserver API spec (#11324)
Fully applies the changes of
https://github.com/neondatabase/cloud/pull/25233 to neon.git. The field
is always present in the Rust struct definition, so it can be marked as
required.

cc #10707
2025-03-20 15:25:09 +00:00
Christian Schwarz
0f20dae3c3 impr: merge pageserver_api::models::TenantConfig and pageserver::tenant::config::TenantConfOpt (#11298)
The only difference between
- `pageserver_api::models::TenantConfig` and
- `pageserver::tenant::config::TenantConfOpt`

at this point is that `TenantConfOpt` serializes with
`skip_serializing_if = Option::is_none`.
That is an efficiency improvement for all the places that currently
serde `models::TenantConfig` because new serializations will no longer
write `$fieldname: null` for each field that is `None` at runtime.

This should be particularly beneficial for Storcon, which stores
JSON-serialized `models::TenantConfig` in its DB.

# Behavior Changes


This PR changes the serialization behavior: we omit `None` fields
instead of serializing `$fieldname: null`).

So it's a data format change (see section on compatibility below).

And it changes API responses from Storcon and Pageserver.

## API Response Compatibility

Storcon returns the location description.
Afaik it is passed through into
- storcon_cli output
- storcon UI in console admin UI

These outputs will no longer contain `$fieldname: null` values,
which de-bloats the output (good).
But in storcon UI, it also serves as an editor "default", which
will be eliminated after a storcon with this PR is released.


## Data Format Compatibility


Backwards compat: new software reading old serialized data will
deserialize to the same runtime value because all the field types
are exactly the same and `skip_serializing_if` does not affect
deserialization.

Forward compat: old software reading data serialized by new software
will map absence fields in the serialized form to runtime value
`Option::None`. This is serde default behavior, see this playground
to convince yourself:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=f7f4e1a169959a3085b6158c022a05eb

The `serde(with="humantime_serde")` however behaves strangely:
if used on an `Option<Duration>`, it still requires the field to be
present,
unlike the serde default behavior shown in the previous paragraph.
The workaround is to set `serde(default)`.
Previously it was set on each individual field, but, we do have the
container attribute, so, set it there.
This requires deriving a `Default` impl, which, because all fields are
`Option`,
is non-magic.
See my notes here:
https://gist.github.com/problame/eddbc225a5d12617e9f2c6413e0cf799

# Future Work

We should have separate types (& crates) for
- runtime types configuration (e.g. PageServerConf::tenant_config,
AttachedLocationConf)
- `config-v1` file pageserver local disk file format
- `mgmt API`
- `pageserver.toml`

Right now they all use the same, which is convenient but makes it hard
to reason about compatibility breakage.

# Refs

- corresponding docs.neon.build PR
https://github.com/neondatabase/docs/pull/470
2025-03-19 12:47:17 +00:00
Alex Chi Z.
23b713900e feat(storcon): passthrough ancestor detach behavior (#11199)
## Problem

https://github.com/neondatabase/neon/issues/10310
https://github.com/neondatabase/neon/pull/11158

## Summary of changes

We need to passthrough the new detach behavior through the storcon API.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-13 20:21:23 +00:00
Erik Grinaker
c036fec065 pageserver: enable compaction_l0_first by default (#11212)
## Problem

`compaction_l0_first` has already been enabled in production for a
couple of weeks.

## Summary of changes

Enable `compaction_l0_first` by default.

Also set `CompactFlags::NoYield` in `timeline_checkpoint_handler`, to
ensure explicitly requested compaction runs to completion. This endpoint
is mainly used in tests, and caused some flakiness where tests expected
compaction to complete.
2025-03-13 14:28:42 +00:00
Alex Chi Z.
c3b3b507f7 feat(pageserver): support detaching behavior v2 (#11158)
## Problem

close https://github.com/neondatabase/neon/issues/10310

## Summary of changes

This patch adds a new behavior for the detach_ancestor API: detach with
multi-level ancestor and no reparenting. Though we can potentially
support multi-level + do reparenting / single-level + no-reparenting in
the future, as it's not required for the recovery/snapshot epic, I'd
prefer keeping things simple now that we only handle the old one and the
new one instead of supporting the full feature matrix.

I only added a test case of successful detaching instead of testing
failures. I'd like to make this into staging and add more tests in the
future.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-12 22:27:23 +00:00
John Spray
7bf6397334 pageserver: remove legacy TimelineInfo::latest_gc_cutoff field (1/2) (#11149)
## Problem

This field was retained for backward compat only in
https://github.com/neondatabase/neon/pull/10707.

Once https://github.com/neondatabase/cloud/pull/25233 is released,
nothing external will be reading this field.

Internally, this was a mandatory field so storage controller is still
trying to decode it, so we must do this removal in two steps: this PR
makes the field optional, and after one release we can fully remove it.

Related: https://github.com/neondatabase/cloud/issues/24250

## Summary of changes

- Rename field to `_unused`
- Remove field from swagger
- Make field optional
2025-03-12 10:23:41 +00:00
Erik Grinaker
f466c01995 pageserver: add max_logical_size_per_shard for get_top_tenants (#11157)
## Problem

In #11122, we want to split shards once the logical size of the largest
timeline exceeds a split threshold. However, `get_top_tenants` currently
only returns `max_logical_size`, which tracks the max _total_ logical
size of a timeline across all shards.

This is problematic, because the storage controller needs to fetch a
list of N tenants that are eligible for splits, but the API doesn't
currently have a way to express this. For example, with a split
threshold of 1 GB, a tenant with `max_logical_size` of 4 GB is eligible
to split if it has 1 or 2 shards, but not if it already has 4 shards. We
need to express this in per-shard terms, otherwise the `get_top_tenants`
endpoint may end up only returning tenants that can't be split, blocking
splits entirely.

Touches https://github.com/neondatabase/neon/pull/11122.
Touches https://github.com/neondatabase/cloud/issues/22532.

## Summary of changes

Add `TenantShardItem::max_logical_size_per_shard` containing
`max_logical_size / shard_count`, and
`TenantSorting::MaxLogicalSizePerShard` to order and filter by it.
2025-03-11 11:43:55 +00:00
Christian Schwarz
7c462b3417 impr: propagate VirtualFile metrics via RequestContext (#7202)
# Refs

- fixes https://github.com/neondatabase/neon/issues/6107

# Problem

`VirtualFile` currently parses the path it is opened with to identify
the `tenant,shard,timeline` labels to be used for the `STORAGE_IO_SIZE`
metric.

Further, for each read or write call to VirtualFile, it uses
`with_label_values` to retrieve the correct metrics object, which under
the hood is a global hashmap guarded by a parking_lot mutex.

We perform tens of thousands of reads and writes per second on every
pageserver instance; thus, doing the mutex lock + hashmap lookup is
wasteful.

# Changes

Apply the technique we use for all other timeline-scoped metrics to
avoid the repeat `with_label_values`: add it to `TimelineMetrics`.

Wrap `TimelineMetrics` into an `Arc`.

Propagate the `Arc<TimelineMetrics>` down do `VirtualFile`, and use
`Timeline::metrics::storage_io_size`.

To avoid contention on the `Arc<TimelineMetrics>`'s refcount atomics
between different connection handlers for the same timeline, we wrap it
into another Arc.

To avoid frequent allocations, we store that Arc<Arc<TimelineMetrics>>
inside the per-connection timeline cache.

Preliminary refactorings to enable this change:
- https://github.com/neondatabase/neon/pull/11001
- https://github.com/neondatabase/neon/pull/11030


# Performance

I ran the benchmarks in
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
on an `i3en.3xlarge` because that's what we currently run them on.

None of the benchmarks shows a meaningful difference in latency or
throughput or CPU utilization.

I would have expected some improvement in the
many-tenants-one-client-each workload because they all hit that hashmap
constantly, and clone the same `UintCounter` / `Arc` inside of it.

But apparently the overhead is miniscule compared to the remaining work
we do per getpage.

Yet, since the changes are already made, the added complexity is
manageable, and the perf overhead of `with_label_values` demonstrable in
micro-benchmarks, let's have this change anyway.
Also, propagating TimelineMetrics through RequestContext might come in
handy down the line.

The micro-benchmark that demonstrates perf impact of
`with_label_values`, along with other pitfalls and mitigation techniques
around the `metrics`/`prometheus` crate:
- https://github.com/neondatabase/neon/pull/11019

# Alternative Designs

An earlier iteration of this PR stored an `Arc<Arc<Timeline>>` inside
`RequestContext`.
The problem is that this risks reference cycles if the RequestContext
gets stored in an object that is owned directly or indirectly by
`Timeline`.

Ideally, we wouldn't be using this mess of Arc's at all and propagate
Rust references instead.
But tokio requires tasks to be `'static`, and so, we wouldn't be able to
propagate references across task boundaries, which is incompatible with
any sort of fan-out code we already have (e.g. concurrent IO) or future
code (parallel compaction).
So, opt for Arc for now.
2025-03-11 07:23:06 +00:00
Alex Chi Z.
cd438406fb feat(pageserver): add force patch index_part API (#11119)
## Problem

As part of the disaster recovery tool. Partly for
https://github.com/neondatabase/neon/issues/9114.

## Summary of changes

* Add a new pageserver API to force patch the fields in index_part and
modify the timeline internal structures.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-07 17:42:52 +00:00
Alex Chi Z.
6d0976dad5 feat(pageserver): persist reldir v2 migration status (#10980)
## Problem

part of https://github.com/neondatabase/neon/issues/9516

## Summary of changes

Similar to the aux v2 migration, we persist the relv2 migration status
into index_part, so that even the config item is set to false, we will
still read from the v2 storage to avoid loss of data.

Note that only the two variants `None` and
`Some(RelSizeMigration::Migrating)` are used for now. We don't have full
migration implemented so it will never be set to
`RelSizeMigration::Migrated`.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-03 21:05:43 +00:00
Vlad Lazar
5197e43396 pageserver: add recurse flag to layer download spec (#11068)
I missed updating the open api spec in the original PR.
We need this so that the cplane auto-generated client sees the flag.
2025-03-03 19:04:01 +00:00
Vlad Lazar
0d6d58bd3e pageserver: make heatmap layer download API more cplane friendly (#10957)
## Problem

We intend for cplane to use the heatmap layer download API to warm up
timelines after unarchival. It's tricky for them to recurse in the
ancestors,
and the current implementation doesn't work well when unarchiving a
chain
of branches and warming them up.

## Summary of changes

* Add a `recurse` flag to the API. When the flag is set, the operation
recurses into the parent
timeline after the current one is done.
* Be resilient to warming up a chain of unarchived branches. Let's say
we unarchived `B` and `C` from
the `A -> B -> C` branch hierarchy. `B` got unarchived first. We
generated the unarchival heatmaps
and stash them in `A` and `B`. When `C` unarchived, it dropped it's
unarchival heatmap since `A` and `B`
already had one. If `C` needed layers from `A` and `B`, it was out of
luck. Now, when choosing whether
to keep an unarchival heatmap we look at its end LSN. If it's more
inclusive than what we currently have,
keep it.
2025-02-28 10:36:53 +00:00
Christian Schwarz
f09843ef17 refactor(pageserver): propagate RequestContext to layer downloads (#11001)
For some reason the layer download API never fully got
`RequestContext`-infected.

This PR fixes that as a precursor to
- https://github.com/neondatabase/neon/issues/6107
2025-02-27 09:26:25 +00:00
Arpad Müller
920040e402 Update storage components to edition 2024 (#10919)
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.

The PR has two commits:

* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`

I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).

part of https://github.com/neondatabase/neon/issues/10918
2025-02-25 23:51:37 +00:00