mirror of
https://github.com/neondatabase/neon.git
synced 2026-06-03 13:30:38 +00:00
## Problem - When we scheduled locations, we were doing it without any context about other shards in the same tenant - After a shard split, there wasn't an automatic mechanism to migrate the attachments away from the split location - After a shard split and the migration away from the split location, there wasn't an automatic mechanism to pick new secondary locations so that the end state has no concentration of locations on the nodes where the split happened. Partially completes: https://github.com/neondatabase/neon/issues/7139 ## Summary of changes - Scheduler now takes a `ScheduleContext` object that can be populated with information about other shards - During tenant creation and shard split, we incrementally build up the ScheduleContext, updating it for each shard as we proceed. - When scheduling new locations, the ScheduleContext is used to apply a soft anti-affinity to nodes where a tenant already has shards. - The background reconciler task now has an extra phase `optimize_all`, which runs only if the primary `reconcile_all` phase didn't generate any work. The separation is that `reconcile_all` is needed for availability, but optimize_all is purely "nice to have" work to balance work across the nodes better. - optimize_all calls into two new TenantState methods called optimize_attachment and optimize_secondary, which seek out opportunities to improve placment: - optimize_attachment: if the node where we're currently attached has an excess of attached shard locations for this tenant compared with the node where we have a secondary location, then cut over to the secondary location. - optimize_secondary: if the node holding our secondary location has an excessive number of locations for this tenant compared with some other node where we don't currently have a location, then create a new secondary location on that other node. - a new debug API endpoint is provided to run background tasks on-demand. This returns a number of reconciliations in progress, so callers can keep calling until they get a `0` to advance the system to its final state without waiting for many iterations of the background task. Optimization is run at an implicitly low priority by: - Omitting the phase entirely if reconcile_all has work to do - Skipping optimization of any tenant that has reconciles in flight - Limiting the total number of optimizations that will be run from one call to optimize_all to a constant (currently 2). The idea of that low priority execution is to minimize the operational risk that optimization work overloads any part of the system. It happens to also make the system easier to observe and debug, as we avoid running large numbers of concurrent changes. Eventually we may relax these limitations: there is no correctness problem with optimizing lots of tenants concurrently, and optimizing multiple shards in one tenant just requires housekeeping changes to update ShardContext with the result of one optimization before proceeding to the next shard.
289 lines
9.9 KiB
Rust
289 lines
9.9 KiB
Rust
//!
|
|
//! This module provides metric definitions for the storage controller.
|
|
//!
|
|
//! All metrics are grouped in [`StorageControllerMetricGroup`]. [`StorageControllerMetrics`] holds
|
|
//! the mentioned metrics and their encoder. It's globally available via the [`METRICS_REGISTRY`]
|
|
//! constant.
|
|
//!
|
|
//! The rest of the code defines label group types and deals with converting outer types to labels.
|
|
//!
|
|
use bytes::Bytes;
|
|
use measured::{
|
|
label::{LabelValue, StaticLabelSet},
|
|
FixedCardinalityLabel, MetricGroup,
|
|
};
|
|
use once_cell::sync::Lazy;
|
|
use std::sync::Mutex;
|
|
|
|
use crate::persistence::{DatabaseError, DatabaseOperation};
|
|
|
|
pub(crate) static METRICS_REGISTRY: Lazy<StorageControllerMetrics> =
|
|
Lazy::new(StorageControllerMetrics::default);
|
|
|
|
pub fn preinitialize_metrics() {
|
|
Lazy::force(&METRICS_REGISTRY);
|
|
}
|
|
|
|
pub(crate) struct StorageControllerMetrics {
|
|
pub(crate) metrics_group: StorageControllerMetricGroup,
|
|
encoder: Mutex<measured::text::TextEncoder>,
|
|
}
|
|
|
|
#[derive(measured::MetricGroup)]
|
|
pub(crate) struct StorageControllerMetricGroup {
|
|
/// Count of how many times we spawn a reconcile task
|
|
pub(crate) storage_controller_reconcile_spawn: measured::Counter,
|
|
/// Reconciler tasks completed, broken down by success/failure/cancelled
|
|
pub(crate) storage_controller_reconcile_complete:
|
|
measured::CounterVec<ReconcileCompleteLabelGroupSet>,
|
|
|
|
/// Count of how many times we make an optimization change to a tenant's scheduling
|
|
pub(crate) storage_controller_schedule_optimization: measured::Counter,
|
|
|
|
/// HTTP request status counters for handled requests
|
|
pub(crate) storage_controller_http_request_status:
|
|
measured::CounterVec<HttpRequestStatusLabelGroupSet>,
|
|
/// HTTP request handler latency across all status codes
|
|
pub(crate) storage_controller_http_request_latency:
|
|
measured::HistogramVec<HttpRequestLatencyLabelGroupSet, 5>,
|
|
|
|
/// Count of HTTP requests to the pageserver that resulted in an error,
|
|
/// broken down by the pageserver node id, request name and method
|
|
pub(crate) storage_controller_pageserver_request_error:
|
|
measured::CounterVec<PageserverRequestLabelGroupSet>,
|
|
|
|
/// Latency of HTTP requests to the pageserver, broken down by pageserver
|
|
/// node id, request name and method. This include both successful and unsuccessful
|
|
/// requests.
|
|
pub(crate) storage_controller_pageserver_request_latency:
|
|
measured::HistogramVec<PageserverRequestLabelGroupSet, 5>,
|
|
|
|
/// Count of pass-through HTTP requests to the pageserver that resulted in an error,
|
|
/// broken down by the pageserver node id, request name and method
|
|
pub(crate) storage_controller_passthrough_request_error:
|
|
measured::CounterVec<PageserverRequestLabelGroupSet>,
|
|
|
|
/// Latency of pass-through HTTP requests to the pageserver, broken down by pageserver
|
|
/// node id, request name and method. This include both successful and unsuccessful
|
|
/// requests.
|
|
pub(crate) storage_controller_passthrough_request_latency:
|
|
measured::HistogramVec<PageserverRequestLabelGroupSet, 5>,
|
|
|
|
/// Count of errors in database queries, broken down by error type and operation.
|
|
pub(crate) storage_controller_database_query_error:
|
|
measured::CounterVec<DatabaseQueryErrorLabelGroupSet>,
|
|
|
|
/// Latency of database queries, broken down by operation.
|
|
pub(crate) storage_controller_database_query_latency:
|
|
measured::HistogramVec<DatabaseQueryLatencyLabelGroupSet, 5>,
|
|
}
|
|
|
|
impl StorageControllerMetrics {
|
|
pub(crate) fn encode(&self) -> Bytes {
|
|
let mut encoder = self.encoder.lock().unwrap();
|
|
self.metrics_group.collect_into(&mut *encoder);
|
|
encoder.finish()
|
|
}
|
|
}
|
|
|
|
impl Default for StorageControllerMetrics {
|
|
fn default() -> Self {
|
|
Self {
|
|
metrics_group: StorageControllerMetricGroup::new(),
|
|
encoder: Mutex::new(measured::text::TextEncoder::new()),
|
|
}
|
|
}
|
|
}
|
|
|
|
impl StorageControllerMetricGroup {
|
|
pub(crate) fn new() -> Self {
|
|
Self {
|
|
storage_controller_reconcile_spawn: measured::Counter::new(),
|
|
storage_controller_reconcile_complete: measured::CounterVec::new(
|
|
ReconcileCompleteLabelGroupSet {
|
|
status: StaticLabelSet::new(),
|
|
},
|
|
),
|
|
storage_controller_schedule_optimization: measured::Counter::new(),
|
|
storage_controller_http_request_status: measured::CounterVec::new(
|
|
HttpRequestStatusLabelGroupSet {
|
|
path: lasso::ThreadedRodeo::new(),
|
|
method: StaticLabelSet::new(),
|
|
status: StaticLabelSet::new(),
|
|
},
|
|
),
|
|
storage_controller_http_request_latency: measured::HistogramVec::new(
|
|
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
|
|
),
|
|
storage_controller_pageserver_request_error: measured::CounterVec::new(
|
|
PageserverRequestLabelGroupSet {
|
|
pageserver_id: lasso::ThreadedRodeo::new(),
|
|
path: lasso::ThreadedRodeo::new(),
|
|
method: StaticLabelSet::new(),
|
|
},
|
|
),
|
|
storage_controller_pageserver_request_latency: measured::HistogramVec::new(
|
|
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
|
|
),
|
|
storage_controller_passthrough_request_error: measured::CounterVec::new(
|
|
PageserverRequestLabelGroupSet {
|
|
pageserver_id: lasso::ThreadedRodeo::new(),
|
|
path: lasso::ThreadedRodeo::new(),
|
|
method: StaticLabelSet::new(),
|
|
},
|
|
),
|
|
storage_controller_passthrough_request_latency: measured::HistogramVec::new(
|
|
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
|
|
),
|
|
storage_controller_database_query_error: measured::CounterVec::new(
|
|
DatabaseQueryErrorLabelGroupSet {
|
|
operation: StaticLabelSet::new(),
|
|
error_type: StaticLabelSet::new(),
|
|
},
|
|
),
|
|
storage_controller_database_query_latency: measured::HistogramVec::new(
|
|
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
|
|
),
|
|
}
|
|
}
|
|
}
|
|
|
|
#[derive(measured::LabelGroup)]
|
|
#[label(set = ReconcileCompleteLabelGroupSet)]
|
|
pub(crate) struct ReconcileCompleteLabelGroup {
|
|
pub(crate) status: ReconcileOutcome,
|
|
}
|
|
|
|
#[derive(measured::LabelGroup)]
|
|
#[label(set = HttpRequestStatusLabelGroupSet)]
|
|
pub(crate) struct HttpRequestStatusLabelGroup<'a> {
|
|
#[label(dynamic_with = lasso::ThreadedRodeo)]
|
|
pub(crate) path: &'a str,
|
|
pub(crate) method: Method,
|
|
pub(crate) status: StatusCode,
|
|
}
|
|
|
|
#[derive(measured::LabelGroup)]
|
|
#[label(set = HttpRequestLatencyLabelGroupSet)]
|
|
pub(crate) struct HttpRequestLatencyLabelGroup<'a> {
|
|
#[label(dynamic_with = lasso::ThreadedRodeo)]
|
|
pub(crate) path: &'a str,
|
|
pub(crate) method: Method,
|
|
}
|
|
|
|
impl Default for HttpRequestLatencyLabelGroupSet {
|
|
fn default() -> Self {
|
|
Self {
|
|
path: lasso::ThreadedRodeo::new(),
|
|
method: StaticLabelSet::new(),
|
|
}
|
|
}
|
|
}
|
|
|
|
#[derive(measured::LabelGroup, Clone)]
|
|
#[label(set = PageserverRequestLabelGroupSet)]
|
|
pub(crate) struct PageserverRequestLabelGroup<'a> {
|
|
#[label(dynamic_with = lasso::ThreadedRodeo)]
|
|
pub(crate) pageserver_id: &'a str,
|
|
#[label(dynamic_with = lasso::ThreadedRodeo)]
|
|
pub(crate) path: &'a str,
|
|
pub(crate) method: Method,
|
|
}
|
|
|
|
impl Default for PageserverRequestLabelGroupSet {
|
|
fn default() -> Self {
|
|
Self {
|
|
pageserver_id: lasso::ThreadedRodeo::new(),
|
|
path: lasso::ThreadedRodeo::new(),
|
|
method: StaticLabelSet::new(),
|
|
}
|
|
}
|
|
}
|
|
|
|
#[derive(measured::LabelGroup)]
|
|
#[label(set = DatabaseQueryErrorLabelGroupSet)]
|
|
pub(crate) struct DatabaseQueryErrorLabelGroup {
|
|
pub(crate) error_type: DatabaseErrorLabel,
|
|
pub(crate) operation: DatabaseOperation,
|
|
}
|
|
|
|
#[derive(measured::LabelGroup)]
|
|
#[label(set = DatabaseQueryLatencyLabelGroupSet)]
|
|
pub(crate) struct DatabaseQueryLatencyLabelGroup {
|
|
pub(crate) operation: DatabaseOperation,
|
|
}
|
|
|
|
#[derive(FixedCardinalityLabel)]
|
|
pub(crate) enum ReconcileOutcome {
|
|
#[label(rename = "ok")]
|
|
Success,
|
|
Error,
|
|
Cancel,
|
|
}
|
|
|
|
#[derive(FixedCardinalityLabel, Clone)]
|
|
pub(crate) enum Method {
|
|
Get,
|
|
Put,
|
|
Post,
|
|
Delete,
|
|
Other,
|
|
}
|
|
|
|
impl From<hyper::Method> for Method {
|
|
fn from(value: hyper::Method) -> Self {
|
|
if value == hyper::Method::GET {
|
|
Method::Get
|
|
} else if value == hyper::Method::PUT {
|
|
Method::Put
|
|
} else if value == hyper::Method::POST {
|
|
Method::Post
|
|
} else if value == hyper::Method::DELETE {
|
|
Method::Delete
|
|
} else {
|
|
Method::Other
|
|
}
|
|
}
|
|
}
|
|
|
|
pub(crate) struct StatusCode(pub(crate) hyper::http::StatusCode);
|
|
|
|
impl LabelValue for StatusCode {
|
|
fn visit<V: measured::label::LabelVisitor>(&self, v: V) -> V::Output {
|
|
v.write_int(self.0.as_u16() as u64)
|
|
}
|
|
}
|
|
|
|
impl FixedCardinalityLabel for StatusCode {
|
|
fn cardinality() -> usize {
|
|
(100..1000).len()
|
|
}
|
|
|
|
fn encode(&self) -> usize {
|
|
self.0.as_u16() as usize
|
|
}
|
|
|
|
fn decode(value: usize) -> Self {
|
|
Self(hyper::http::StatusCode::from_u16(u16::try_from(value).unwrap()).unwrap())
|
|
}
|
|
}
|
|
|
|
#[derive(FixedCardinalityLabel)]
|
|
pub(crate) enum DatabaseErrorLabel {
|
|
Query,
|
|
Connection,
|
|
ConnectionPool,
|
|
Logical,
|
|
}
|
|
|
|
impl DatabaseError {
|
|
pub(crate) fn error_label(&self) -> DatabaseErrorLabel {
|
|
match self {
|
|
Self::Query(_) => DatabaseErrorLabel::Query,
|
|
Self::Connection(_) => DatabaseErrorLabel::Connection,
|
|
Self::ConnectionPool(_) => DatabaseErrorLabel::ConnectionPool,
|
|
Self::Logical(_) => DatabaseErrorLabel::Logical,
|
|
}
|
|
}
|
|
}
|