mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-07 13:32:57 +00:00
## Problem To test sharding, we need something to control it. We could write python code for doing this from the test runner, but this wouldn't be usable with neon_local run directly, and when we want to write tests with large number of shards/tenants, Rust is a better fit efficiently handling all the required state. This service enables automated tests to easily get a system with sharding/HA without the test itself having to set this all up by hand: existing tests can be run against sharded tenants just by setting a shard count when creating the tenant. ## Summary of changes Attachment service was previously a map of TenantId->TenantState, where the principal state stored for each tenant was the generation and the last attached pageserver. This enabled it to serve the re-attach and validate requests that the pageserver requires. In this PR, the scope of the service is extended substantially to do overall management of tenants in the pageserver, including tenant/timeline creation, live migration, evacuation of offline pageservers etc. This is done using synchronous code to make declarative changes to the tenant's intended state (`TenantState.policy` and `TenantState.intent`), which are then translated into calls into the pageserver by the `Reconciler`. Top level summary of modules within `control_plane/attachment_service/src`: - `tenant_state`: structure that represents one tenant shard. - `service`: implements the main high level such as tenant/timeline creation, marking a node offline, etc. - `scheduler`: for operations that need to pick a pageserver for a tenant, construct a scheduler and call into it. - `compute_hook`: receive notifications when a tenant shard is attached somewhere new. Once we have locations for all the shards in a tenant, emit an update to postgres configuration via the neon_local `LocalEnv`. - `http`: HTTP stubs. These mostly map to methods on `Service`, but are separated for readability and so that it'll be easier to adapt if/when we switch to another RPC layer. - `node`: structure that describes a pageserver node. The most important attribute of a node is its availability: marking a node offline causes tenant shards to reschedule away from it. This PR is a precursor to implementing the full sharding service for prod (#6342). What's the difference between this and a production-ready controller for pageservers? - JSON file persistence to be replaced with a database - Limited observability. - No concurrency limits. Marking a pageserver offline will try and migrate every tenant to a new pageserver concurrently, even if there are thousands. - Very simple scheduler that only knows to pick the pageserver with fewest tenants, and place secondary locations on a different pageserver than attached locations: it does not try to place shards for the same tenant on different pageservers. This matters little in tests, because picking the least-used pageserver usually results in round-robin placement. - Scheduler state is rebuilt exhaustively for each operation that requires a scheduler. - Relies on neon_local mechanisms for updating postgres: in production this would be something that flows through the real control plane. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
78 lines
3.1 KiB
Python
78 lines
3.1 KiB
Python
from contextlib import closing
|
|
|
|
from fixtures.benchmark_fixture import MetricReport
|
|
from fixtures.compare_fixtures import NeonCompare, PgCompare
|
|
from fixtures.pageserver.utils import wait_tenant_status_404
|
|
from fixtures.pg_version import PgVersion
|
|
from fixtures.types import Lsn
|
|
|
|
|
|
#
|
|
# Run bulk INSERT test.
|
|
#
|
|
# Collects metrics:
|
|
#
|
|
# 1. Time to INSERT 5 million rows
|
|
# 2. Disk writes
|
|
# 3. Disk space used
|
|
# 4. Peak memory usage
|
|
#
|
|
def test_bulk_insert(neon_with_baseline: PgCompare):
|
|
env = neon_with_baseline
|
|
|
|
start_lsn = Lsn(env.pg.safe_psql("SELECT pg_current_wal_lsn()")[0][0])
|
|
|
|
with closing(env.pg.connect()) as conn:
|
|
with conn.cursor() as cur:
|
|
cur.execute("create table huge (i int, j int);")
|
|
|
|
# Run INSERT, recording the time and I/O it takes
|
|
with env.record_pageserver_writes("pageserver_writes"):
|
|
with env.record_duration("insert"):
|
|
cur.execute("insert into huge values (generate_series(1, 5000000), 0);")
|
|
env.flush()
|
|
|
|
env.report_peak_memory_use()
|
|
env.report_size()
|
|
|
|
# Report amount of wal written. Useful for comparing vanilla wal format vs
|
|
# neon wal format, measuring neon write amplification, etc.
|
|
end_lsn = Lsn(env.pg.safe_psql("SELECT pg_current_wal_lsn()")[0][0])
|
|
wal_written_bytes = end_lsn - start_lsn
|
|
wal_written_mb = round(wal_written_bytes / (1024 * 1024))
|
|
env.zenbenchmark.record("wal_written", wal_written_mb, "MB", MetricReport.TEST_PARAM)
|
|
|
|
# When testing neon, also check how long it takes the pageserver to reingest the
|
|
# wal from safekeepers. If this number is close to total runtime, then the pageserver
|
|
# is the bottleneck.
|
|
if isinstance(env, NeonCompare):
|
|
measure_recovery_time(env)
|
|
|
|
|
|
def measure_recovery_time(env: NeonCompare):
|
|
client = env.env.pageserver.http_client()
|
|
pg_version = PgVersion(client.timeline_detail(env.tenant, env.timeline)["pg_version"])
|
|
|
|
# Delete the Tenant in the pageserver: this will drop local and remote layers, such that
|
|
# when we "create" the Tenant again, we will replay the WAL from the beginning.
|
|
#
|
|
# This is a "weird" thing to do, and can confuse the attachment service as we're re-using
|
|
# the same tenant ID for a tenant that is logically different from the pageserver's point
|
|
# of view, but the same as far as the safekeeper/WAL is concerned. To work around that,
|
|
# we will explicitly create the tenant in the same generation that it was previously
|
|
# attached in.
|
|
attach_status = env.env.attachment_service.inspect(tenant_shard_id=env.tenant)
|
|
assert attach_status is not None
|
|
(attach_gen, _) = attach_status
|
|
|
|
client.tenant_delete(env.tenant)
|
|
wait_tenant_status_404(client, env.tenant, iterations=60, interval=0.5)
|
|
env.env.pageserver.tenant_create(tenant_id=env.tenant, generation=attach_gen)
|
|
|
|
# Measure recovery time
|
|
with env.record_duration("wal_recovery"):
|
|
client.timeline_create(pg_version, env.tenant, env.timeline)
|
|
|
|
# Flush, which will also wait for lsn to catch up
|
|
env.flush()
|