A few PS changes (#12540)

# TLDR
All changes are no-op except some metrics. 

## Summary of changes I
### Pageserver
Added a new global counter metric
`pageserver_pagestream_handler_results_total` that categorizes
pagestream request results according to their outcomes:
1. Success
2. Internal errors
3. Other errors

Internal errors include:
1. Page reconstruction error: This probably indicates a pageserver
bug/corruption
2. LSN timeout error: Could indicate overload or bugs with PS's ability
to reach other components
3. Misrouted request error: Indicates bugs in the Storage Controller/HCC

Other errors include transient errors that are expected during normal
operation or errors indicating bugs with other parts of the system
(e.g., malformed requests, errors due to cancelled operations during PS
shutdown, etc.)    


## Summary of changes II
This PR adds a pageserver endpoint and its counterpart in storage
controller to list visible size of all tenant shards. This will be a
prerequisite of the tenant rebalance command.


## Problem III
We need a way to download WAL
segments/layerfiles from S3 and replay WAL records. We cannot access
production S3 from our laptops directly, and we also can't transfer any
user data out of production systems for GDPR compliance, so we need
solutions.

## Summary of changes III

This PR adds a couple of tools to support the debugging
workflow in production:
1. A new `pagectl download-remote-object` command that can be used to
download remote storage objects assuming the correct access is set up.

## Summary of changes IV
This PR adds a command to list all visible delta and image layers from
index_part. This is useful to debug compaction issues as index_part
often contain a lot of covered layers due to PITR.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
This commit is contained in:
HaoyuHuang
2025-07-10 07:39:38 -07:00
committed by GitHub
parent be5bbaecad
commit 2c6b327be6
15 changed files with 404 additions and 25 deletions

View File

@@ -333,6 +333,13 @@ class PageserverHttpClient(requests.Session, MetricsGetter):
res = self.post(f"http://localhost:{self.port}/v1/reload_auth_validation_keys")
self.verbose_error(res)
def list_tenant_visible_size(self) -> dict[TenantShardId, int]:
res = self.get(f"http://localhost:{self.port}/v1/list_tenant_visible_size")
self.verbose_error(res)
res_json = res.json()
assert isinstance(res_json, dict)
return res_json
def tenant_list(self) -> list[dict[Any, Any]]:
res = self.get(f"http://localhost:{self.port}/v1/tenant")
self.verbose_error(res)

View File

@@ -3,6 +3,7 @@ from __future__ import annotations
from typing import TYPE_CHECKING
from fixtures.common_types import Lsn, TenantId, TimelineId
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
DEFAULT_BRANCH_NAME,
NeonEnv,
@@ -164,3 +165,15 @@ def test_pageserver_http_index_part_force_patch(neon_env_builder: NeonEnvBuilder
{"rel_size_migration": "legacy"},
)
assert client.timeline_detail(tenant_id, timeline_id)["rel_size_migration"] == "legacy"
def test_pageserver_get_tenant_visible_size(neon_env_builder: NeonEnvBuilder):
neon_env_builder.num_pageservers = 1
env = neon_env_builder.init_start()
env.create_tenant(shard_count=4)
env.create_tenant(shard_count=2)
json = env.pageserver.http_client().list_tenant_visible_size()
log.info(f"{json}")
# initial tennat + 2 newly created tenants
assert len(json) == 7