mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-10 23:12:54 +00:00
part of Epic https://github.com/neondatabase/neon/issues/7386 # Motivation The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050. Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for `task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%. Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver. Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459 # Changes This PR removes the materialized page cache code & metrics. The infrastructure for different key kinds in `PageCache` is left in place, even though the "Immutable" key kind is the only remaining one. This can be further simplified in a future commit. Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts: * test_local_file_cache_unlink * test_physical_replication * test_pg_regress # Performance I focussed on ensuring that this PR will not result in a performance regression in prod. * **getpage** requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute. * **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics show 0 percent hit rate, so, removing will not be a regression. * **get_lsn_by_timestamp**: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here. We will watch the various nightly benchmarks closely for more results before shipping to prod.
41 lines
2.0 KiB
Python
41 lines
2.0 KiB
Python
import random
|
|
import time
|
|
|
|
from fixtures.neon_fixtures import NeonEnv
|
|
|
|
|
|
def test_physical_replication(neon_simple_env: NeonEnv):
|
|
env = neon_simple_env
|
|
with env.endpoints.create_start(
|
|
branch_name="main",
|
|
endpoint_id="primary",
|
|
) as primary:
|
|
with primary.connect() as p_con:
|
|
with p_con.cursor() as p_cur:
|
|
p_cur.execute(
|
|
"CREATE TABLE t(pk bigint primary key, payload text default repeat('?',200))"
|
|
)
|
|
time.sleep(1)
|
|
with env.endpoints.new_replica_start(origin=primary, endpoint_id="secondary") as secondary:
|
|
with primary.connect() as p_con:
|
|
with p_con.cursor() as p_cur:
|
|
with secondary.connect() as s_con:
|
|
with s_con.cursor() as s_cur:
|
|
runtime_secs = 30
|
|
started_at = time.time()
|
|
pk = 0
|
|
while True:
|
|
pk += 1
|
|
now = time.time()
|
|
if now - started_at > runtime_secs:
|
|
break
|
|
p_cur.execute("insert into t (pk) values (%s)", (pk,))
|
|
# an earlier version of this test was based on a fixed number of loop iterations
|
|
# and selected for pk=(random.randrange(1, fixed number of loop iterations)).
|
|
# => the probability of selection for a value that was never inserted changed from 99.9999% to 0% over the course of the test.
|
|
#
|
|
# We changed the test to where=(random.randrange(1, 2*pk)), which means the probability is now fixed to 50%.
|
|
s_cur.execute(
|
|
"select * from t where pk=%s", (random.randrange(1, 2 * pk),)
|
|
)
|