Files
neon/test_runner/regress/test_physical_replication.py
Christian Schwarz 79401638df remove materialized page cache (#8105)
part of Epic https://github.com/neondatabase/neon/issues/7386

# Motivation

The materialized page cache adds complexity to the code base, which
increases the maintenance burden and risk for subtle and hard to
reproduce bugs such as #8050.

Further, the best hit rate that we currently achieve in production is ca
1% of materialized page cache lookups for
`task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%.

Last, caching page images in Pageserver rewards under-sized caches in
Computes because reading from Pageserver's materialized page cache over
the network is often sufficiently fast (low hundreds of microseconds).
Such Computes should upscale their local caches to fit their working
set, rather than repeatedly requesting the same page from Pageserver.

Some more discussion and context in internal thread
https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459

# Changes

This PR removes the materialized page cache code & metrics.

The infrastructure for different key kinds in `PageCache` is left in
place, even though the "Immutable" key kind is the only remaining one.
This can be further simplified in a future commit.

Some tests started failing because their total runtime was dependent on
high materialized page cache hit rates. This test makes them
fixed-runtime or raises pytest timeouts:
* test_local_file_cache_unlink
* test_physical_replication
* test_pg_regress

# Performance

I focussed on ensuring that this PR will not result in a performance
regression in prod.

* **getpage** requests: our production metrics have shown the
materialized page cache to be irrelevant (low hit rate). Also,
Pageserver is the wrong place to cache page images, it should happen in
compute.
* **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics
show 0 percent hit rate, so, removing will not be a regression.
* **get_lsn_by_timestamp**: important API for branch creation, used by
control pane. The clog pages that this code uses are not
materialize-page-cached because they're not 8k. No risk of introducing a
regression here.

We will watch the various nightly benchmarks closely for more results
before shipping to prod.
2024-06-20 11:56:14 +02:00

41 lines
2.0 KiB
Python

import random
import time
from fixtures.neon_fixtures import NeonEnv
def test_physical_replication(neon_simple_env: NeonEnv):
env = neon_simple_env
with env.endpoints.create_start(
branch_name="main",
endpoint_id="primary",
) as primary:
with primary.connect() as p_con:
with p_con.cursor() as p_cur:
p_cur.execute(
"CREATE TABLE t(pk bigint primary key, payload text default repeat('?',200))"
)
time.sleep(1)
with env.endpoints.new_replica_start(origin=primary, endpoint_id="secondary") as secondary:
with primary.connect() as p_con:
with p_con.cursor() as p_cur:
with secondary.connect() as s_con:
with s_con.cursor() as s_cur:
runtime_secs = 30
started_at = time.time()
pk = 0
while True:
pk += 1
now = time.time()
if now - started_at > runtime_secs:
break
p_cur.execute("insert into t (pk) values (%s)", (pk,))
# an earlier version of this test was based on a fixed number of loop iterations
# and selected for pk=(random.randrange(1, fixed number of loop iterations)).
# => the probability of selection for a value that was never inserted changed from 99.9999% to 0% over the course of the test.
#
# We changed the test to where=(random.randrange(1, 2*pk)), which means the probability is now fixed to 50%.
s_cur.execute(
"select * from t where pk=%s", (random.randrange(1, 2 * pk),)
)