rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	dfcbb139fb	the `None` configuration in the benchmark would use the default instead of the serial configuration; fix that	2024-11-29 13:35:24 +01:00
Christian Schwarz	a2a3613185	reintroduce task-based execution	2024-11-28 20:50:06 +01:00
Christian Schwarz	6bd39f95f5	rn benchmark on hetzner runner -------------------------------------------------------------------------------------------------------------------- Benchmark results --------------------------------------------------------------------------------------------------------------------- test_throughput[release-pg16-50-None-30-1-128-not batchable None].tablesize_mib: 50 MiB test_throughput[release-pg16-50-None-30-1-128-not batchable None].pipelining_enabled: 0 test_throughput[release-pg16-50-None-30-1-128-not batchable None].effective_io_concurrency: 1 test_throughput[release-pg16-50-None-30-1-128-not batchable None].readhead_buffer_size: 128 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.time: 0.8905 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.pageserver_cpu_seconds_total: 0.8633 test_throughput[release-pg16-50-None-30-1-128-not batchable None].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].effective_io_concurrency: 1 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].pipelining_config.max_batch_size: 1 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.time: 0.9195 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.pageserver_cpu_seconds_total: 0.8925 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].effective_io_concurrency: 1 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].pipelining_config.max_batch_size: 32 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.time: 0.8724 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.pageserver_cpu_seconds_total: 0.8406 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-None-30-100-128-batchable None].tablesize_mib: 50 MiB test_throughput[release-pg16-50-None-30-100-128-batchable None].pipelining_enabled: 0 test_throughput[release-pg16-50-None-30-100-128-batchable None].effective_io_concurrency: 100 test_throughput[release-pg16-50-None-30-100-128-batchable None].readhead_buffer_size: 128 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.time: 0.2576 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.pageserver_getpage_count: 6,401.5259 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.pageserver_vectored_get_count: 307.8534 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.compute_getpage_count: 6,401.5259 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.pageserver_cpu_seconds_total: 0.3043 test_throughput[release-pg16-50-None-30-100-128-batchable None].perfmetric.batching_factor: 20.7941 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].pipelining_config.max_batch_size: 1 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.time: 0.6187 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.pageserver_cpu_seconds_total: 0.7473 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].pipelining_config.max_batch_size: 2 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.time: 0.4419 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.pageserver_getpage_count: 6,402.6418 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.pageserver_vectored_get_count: 3,207.7015 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.compute_getpage_count: 6,402.6418 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.pageserver_cpu_seconds_total: 0.5391 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].perfmetric.batching_factor: 1.9960 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].pipelining_config.max_batch_size: 4 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.time: 0.3569 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.pageserver_getpage_count: 6,402.1071 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.pageserver_vectored_get_count: 1,660.0952 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.compute_getpage_count: 6,402.1071 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.pageserver_cpu_seconds_total: 0.4244 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].perfmetric.batching_factor: 3.8565 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].pipelining_config.max_batch_size: 8 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.time: 0.2977 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.pageserver_getpage_count: 6,401.7700 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.pageserver_vectored_get_count: 886.6900 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.compute_getpage_count: 6,401.7700 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.pageserver_cpu_seconds_total: 0.3511 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].perfmetric.batching_factor: 7.2199 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].pipelining_config.max_batch_size: 16 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.time: 0.2697 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.pageserver_getpage_count: 6,401.5946 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.pageserver_vectored_get_count: 500.5766 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.compute_getpage_count: 6,401.5946 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.pageserver_cpu_seconds_total: 0.3195 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].perfmetric.batching_factor: 12.7884 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].pipelining_config.max_batch_size: 32 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.time: 0.2548 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.pageserver_getpage_count: 6,401.5128 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.pageserver_vectored_get_count: 307.7692 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.compute_getpage_count: 6,401.5128 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.pageserver_cpu_seconds_total: 0.3015 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].perfmetric.batching_factor: 20.7997 test_latency[release-pg16-None-None].latency_mean: 0.127 ms test_latency[release-pg16-None-None].latency_percentiles.p95: 0.166 ms test_latency[release-pg16-None-None].latency_percentiles.p99: 0.187 ms test_latency[release-pg16-None-None].latency_percentiles.p99.9: 0.292 ms test_latency[release-pg16-None-None].latency_percentiles.p99.99: 0.624 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_mean: 0.139 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p95: 0.175 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p99: 0.200 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p99.9: 0.444 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p99.99: 0.658 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_mean: 0.119 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p95: 0.155 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p99: 0.172 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p99.9: 0.267 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p99.99: 0.587 ms	2024-11-28 20:24:01 +01:00
Christian Schwarz	07358dea89	converge on approach that pushes read Result through pipeline	2024-11-28 20:06:15 +01:00
Christian Schwarz	990e44dda4	longer target runtime	2024-11-22 14:37:01 +01:00
Christian Schwarz	5796f3ba57	fix test	2024-11-22 14:27:59 +01:00
Christian Schwarz	11dc7135b1	rename test file to test_page_service_batching	2024-11-22 13:19:12 +01:00
Christian Schwarz	39e45f9e51	improve tests	2024-11-22 12:27:38 +01:00
Christian Schwarz	c1e8347160	make configurable whether pipelining should use concurrent futures or tasks	2024-11-22 11:27:23 +01:00
Christian Schwarz	c1040bc25d	task-based mode	2024-11-22 09:36:45 +01:00
Christian Schwarz	89d9d16130	cherry-pick from problame/batching-benchmark while it's waiting for merge	2024-11-22 08:17:30 +01:00
Christian Schwarz	09e7485004	Merge branch 'problame/merge-getpage-test' into problame/batching-timer	2024-11-21 11:28:12 +01:00
Christian Schwarz	ff0aa152f1	Merge remote-tracking branch 'origin/main' into problame/batching-benchmark	2024-11-21 11:25:23 +01:00
Christian Schwarz	3375f28990	pytest.approx; https://github.com/neondatabase/neon/pull/9820#discussion_r1850679974	2024-11-21 11:21:50 +01:00
Christian Schwarz	e82deb2ccc	high-resolution CPU usage	2024-11-21 11:16:00 +01:00
Christian Schwarz	c68661dfb3	Revert "undo local modifications to benchmark" This reverts commit `7be13bc5a6`.	2024-11-20 19:53:06 +01:00
Christian Schwarz	21866faa8a	Revert "try async-timer 1.0.0-beta15 (still signal-based timers)" This reverts commit `c73e9e40e9`.	2024-11-20 19:37:51 +01:00
Christian Schwarz	c73e9e40e9	try async-timer 1.0.0-beta15 (still signal-based timers) Results unchanged to 0.7.4 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780e18416cc0faf2aca65	2024-11-20 18:32:53 +01:00
John Spray	5ff2f1ee7d	pageserver: enable compaction to proceed while live-migrating (#5397 ) ## Problem Long ago, in #5299 the tenant states for migration are added, but respected only in a coarse-grained way: when hinted not to do deletions, tenants will just avoid doing all GC or compaction. Skipping compaction is not necessary for AttachedMulti, as we will soon become the primary attached location, and it is not a waste of resources to proceed with compaction. Instead, per the RFC https://github.com/neondatabase/neon/pull/5029/files), deletions should be queued up in this state, and executed later when we switch to AttachedSingle. Avoiding compaction in AttachedMulti can have an operational impact if a tenant is under significant write load, as a long-running migration can result in a large accumulation of delta layers with commensurate impact on read latency. Closes: https://github.com/neondatabase/neon/issues/5396 ## Summary of changes - Add a 'config' part to RemoteTimelineClient so that it can be aware of the mode of the tenant it belongs to, and wire this through for construction + updates - Add a special buffer for delayed deletions, and when in AttachedMulti route deletions here instead of into the main remote client queue. This is drained when transitioning to AttachedSingle. If the tenant is detached or our process dies before then, then these objects are leaked. - As a quality of life improvement, also use the remote timeline client's knowledge of the tenant state to avoid submitting remote consistent LSN updates for validation when in AttachedStale (as we know these will fail) ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-11-20 17:31:55 +00:00
John Spray	67f5f83edc	pageserver: avoid reading SLRU blocks for GC on shards >0 (#9423 ) ## Problem SLRU blocks, which can add up to several gigabytes, are currently ingested by all shards, multiplying their capacity cost by the shard count and slowing down ingest. We do this because all shards need the SLRU pages to do timestamp->LSN lookup for GC. Related: https://github.com/neondatabase/neon/issues/7512 ## Summary of changes - On non-zero shards, learn the GC offset from shard 0's index instead of calculating it. - Add a test `test_sharding_gc` that exercises this - Do GC in test_pg_regress as a general smoke test that GC functions run (e.g. this would fail if we were using SLRUs we didn't have) In this PR we are still ingesting SLRUs everywhere, but not using them any more. Part 2 PR (https://github.com/neondatabase/neon/pull/9786) makes the change to not store them at all. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-11-20 15:56:14 +00:00
Christian Schwarz	7be13bc5a6	undo local modifications to benchmark	2024-11-20 16:00:19 +01:00
John Spray	593e35027a	tests: use fewer pageservers in test_sharding_split_smoke (#9804 ) ## Problem This test uses a gratuitous number of pageservers (16). This works fine when there are plenty of system resources, but causes issues on test runners that have limited resources and run many tests concurrently. Related: https://github.com/neondatabase/neon/issues/9802 ## Summary of changes - Split from 2 shards to 4, instead of 4 to 8 - Don't give every shard a separate pageserver, let two locations share each pageserver. Net result is 4 pageservers instead of 16	2024-11-20 14:57:59 +00:00
Christian Schwarz	689788cbba	async-timer based approach (again, with data) Yep, it's clearly the best one with best batching factor at lowest CPU usage. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780d0a205e081458b46db	2024-11-20 15:36:10 +01:00
Christian Schwarz	f9bf038d2c	Revert "tokio_timerfd::Interval" This reverts commit `12124b28d0`.	2024-11-20 15:25:52 +01:00
Christian Schwarz	12124b28d0	tokio_timerfd::Interval Resolution not high enough to do _any_ batching at 10us or 20us https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e0047800fb74bd8f4ab6cf8e2	2024-11-20 15:25:14 +01:00
Christian Schwarz	1d85bec0ea	Revert "tokio::time::Interval based approach" This reverts commit `81d99704ee`.	2024-11-20 15:13:26 +01:00
Christian Schwarz	81d99704ee	tokio::time::Interval based approach batching at 10us doesn't work well enough, prob the future is ready too soon. batching factor is just 1.5 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120	2024-11-20 15:13:11 +01:00
Christian Schwarz	b299eb19e2	fixup whitespace stuff	2024-11-20 14:23:55 +01:00
Christian Schwarz	88d52b31b7	Merge branch 'problame/batching-benchmark' into problame/merge-getpage-test	2024-11-20 14:23:22 +01:00
Christian Schwarz	b695907752	page_service: add benchmark for batching This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in https://github.com/neondatabase/neon/pull/9321. Refs: - Epic: https://github.com/neondatabase/neon/issues/9376 - Extracted from https://github.com/neondatabase/neon/pull/9792	2024-11-20 14:18:42 +01:00
Christian Schwarz	75041cb61b	bench fixups	2024-11-20 13:59:21 +01:00
Christian Schwarz	e80ce970f7	collect CPU utilization	2024-11-20 13:55:05 +01:00
Christian Schwarz	f2de5b504f	make it a proper benchmark	2024-11-20 13:52:05 +01:00
Alexander Bayandin	899933e159	scan_log_for_errors: check that regex is correct (#9815 ) ## Problem I've noticed that we have 2 flaky tests which failed with error: ``` re.error: missing ), unterminated subpattern at position 21 ``` - `test_timeline_archival_chaos` — has been already fixed - `test_sharded_tad_interleaved_after_partial_success` — I didn't manage to find the incorrect regex [Internal link](https://neonprod.grafana.net/goto/yfmVHV7NR?orgId=1) ## Summary of changes - Wrap `re.match` in `try..except` block and print incorrect regex	2024-11-20 12:48:21 +00:00
Christian Schwarz	5cc0059088	parametrize more test	2024-11-20 12:48:47 +01:00
John Spray	33dce25af8	safekeeper: block deletion on protocol handler shutdown (#9364 ) ## Problem Two recently observed log errors indicate safekeeper tasks for a timeline running after that timeline's deletion has started. - https://github.com/neondatabase/neon/issues/8972 - https://github.com/neondatabase/neon/issues/8974 These code paths do not have a mechanism that coordinates task shutdown with the overall shutdown of the timeline. ## Summary of changes - Add a `Gate` to `Timeline` - Take the gate as part of resident timeline guard: any code that holds a guard over a timeline staying resident should also hold a guard over the timeline's total lifetime. - Take the gate from the wal removal task - Respect Timeline::cancel in WAL send/recv code, so that we do not block shutdown indefinitely. - Add a test that deletes timelines with open pageserver+compute connections, to check these get torn down as expected. There is some risk to introducing gates: if there is code holding a gate which does not properly respect a cancellation token, it can cause shutdown hangs. The risk of this for safekeepers is lower in practice than it is for other services, because in a healthy timeline deletion, the compute is shutdown first, then the timeline is deleted on the pageserver, and finally it is deleted on the safekeepers -- that makes it much less likely that some protocol handler will still be running. Closes: #8972 Closes: #8974	2024-11-20 11:07:45 +00:00
Konstantin Knizhnik	770ac34ae6	Register custom xlog reader callbacks for on-demand WAL download in StartupDecodingContext (#9007 ) ## Problem See https://github.com/neondatabase/neon/issues/8931 On-demand WAL download are not set in all cases where WAL is accessed by logical replication ## Summary of changes Set customer xlog reader handles in StartupDecodingContext Related changes in Postgres modules: https://github.com/neondatabase/postgres/pull/495 https://github.com/neondatabase/postgres/pull/496 https://github.com/neondatabase/postgres/pull/497 https://github.com/neondatabase/postgres/pull/498 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-11-19 22:29:57 +02:00
Alex Chi Z.	b22a84a7bf	feat(pageserver): support key range for manual compaction trigger (#9723 ) part of https://github.com/neondatabase/neon/issues/9114, we want to be able to run partial gc-compaction in tests. In the future, we can also expand this functionality to legacy compaction, so that we can trigger compaction for a specific key range. ## Summary of changes * Support passing compaction key range through pageserver routes. * Refactor input parameters of compact related function to take the new `CompactOptions`. * Add tests for partial compaction. Note that the test may or may not trigger compaction based on GC horizon. We need to improve the test case to ensure things always get below the gc_horizon and the gc-compaction can be triggered. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-19 19:38:41 +00:00
Alex Chi Z.	5e3fbef721	fix(pageserver): queue stopped error should be ignored during create timeline (#9767 ) close https://github.com/neondatabase/neon/issues/9730 The test case tests if anything goes wrong during pageserver restart + during timeline creation not complete. Therefore, queue is stopped error is normal in this case, except that it should be categorized as a shutdown error instead of a real error. ## Summary of changes * More comments for the test case. * Queue stopped error will now be forwarded as CreateTimelineError::ShuttingDown. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-19 14:10:09 -05:00
Christian Schwarz	61ff84a3a2	compiles	2024-11-19 18:40:56 +01:00
Peter Bendel	982cb1c15d	Move logic for ingest benchmark from GitHub workflow into python testcase (#9762 ) ## Problem The first version of the ingest benchmark had some parsing and reporting logic in shell script inside GitHub workflow. it is better to move that logic into a python testcase so that we can also run it locally. ## Summary of changes - Create new python testcase - invoke pgcopydb inside python test case - move the following logic into python testcase - determine backpressure - invoke pgcopydb and report its progress - parse pgcopydb log and extract metrics - insert metrics into perf test database - add additional column to perf test database that can receive endpoint ID used for pgcopydb run to have it available in grafana dashboard when retrieving other metrics for an endpoint ## Example run https://github.com/neondatabase/neon/actions/runs/11860622170/job/33056264386	2024-11-19 09:46:46 +00:00
Christian Schwarz	15e21c714b	got it working and turn it more into a benchmark	2024-11-18 23:57:14 +01:00
Christian Schwarz	0689965282	WIP: page_service: add basic testcase for merging The steps in the test work in neon_local + psql but for some reason they don't work in the test. Asked compute team on Slack for help: https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789	2024-11-18 23:57:14 +01:00
Alex Chi Z.	e5c89f3da3	feat(pageserver): drop disposable keys during gc-compaction (#9765 ) close https://github.com/neondatabase/neon/issues/9552, close https://github.com/neondatabase/neon/issues/8920, part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes * Drop keys not belonging to this shard during gc-compaction to avoid constructing history that might have been truncated during shard compaction. * Run gc-compaction at the end of shard compaction test. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-18 19:27:52 +00:00
Alexander Bayandin	44f33b2bd6	Bump default Postgres version for tests to v17 (#9777 ) ## Problem Tests that are marked with `run_only_on_default_postgres` do not run on debug builds on CI because we run debug builds only for the latest Postgres version (which is 17) ## Summary of changes - Bump `PgVersion.DEFAULT` to `v17` - Skip `test_timeline_archival_chaos` in debug builds	2024-11-18 15:06:24 +00:00
John Spray	3f401a328f	tests: mitigate bug to stabilize test_storage_controller_many_tenants (#9771 ) ## Problem Due to #9471 , the scale test occasionally gets 404s while trying to modify the config of a timeline that belongs to a tenant being migrated. We rarely see this narrow race in the field, but the test is quite good at reproducing it. ## Summary of changes - Ignore 404 errors in this test.	2024-11-18 11:33:27 +00:00
Arpad Müller	7880c246f1	Correct mistakes in offloaded timeline retain_lsn management (#9760 ) PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of `retain_lsn` values. However, there is more places than just tenant activation where one needs to update the `retain_lsn`s. This PR fixes some bugs of the current code that could lead to corruption in the worst case: 1. Deleting of an offloaded timeline would not get its `retain_lsn` purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the tenant is not good either, because we might drop the offloaded timeline in a context where a `offloaded_timelines` lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline. 2. Unoffloading a timeline would not get its `retain_lsn` values populated, leading to it maybe garbage collecting values that its children might need. We now call `initialize_gc_info` on the parent. 3. Offloading of a timeline would not get its `retain_lsn` values registered as offloaded at the parent. So if we drop the `Timeline` object, and its registration is removed, the parent would not have any of the child's `retain_lsn`s around. Also, before, the `Timeline` object would delete anything related to its timeline ID, now it only deletes `retain_lsn`s that have `MaybeOffloaded::No` set. Incorporates Chi's reproducer from #9753. cc https://github.com/neondatabase/cloud/issues/20199 The `test_timeline_retain_lsn` test is extended: 1. it gains a new dimension, duplicating each mode, to either have the "main" branch be the direct parent of the timeline we archive, or the "test_archived_parent" branch intermediary, creating a three timeline structure. This doesn't test anything fixed by this PR in particular, just explores the vast space of possible configurations a little bit more. 2. it gains two new modes, `offload-parent`, which tests the second point, and `offload-no-restart` which tests the third point. It's easy to verify the test actually is "sharp" by removing one of the respective `self.initialize_gc_info()`, `gc_info.insert_child()` or `ancestor_children.push()`. Part of #8088 --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z <chi@neon.tech>	2024-11-15 14:22:29 +01:00
John Spray	04938d9d55	tests: tolerate pageserver 500s in test_timeline_archival_chaos (#9769 ) ## Problem Test exposes cases where pageserver gives 500 responses, causing failures like https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9766/11844529470/index.html#suites/d1acc79950edeb0563fc86236c620898/3546be2ffed99ba6 ## Summary of changes - Tolerate such messages, and link an issue for cleaning up the pageserver not to return such 500s.	2024-11-15 13:22:05 +00:00
John Spray	93939f123f	tests: add test_timeline_archival_chaos (#9609 ) ## Problem - We lack test coverage of cases where multiple timelines fight for updates to the same manifest (https://github.com/neondatabase/neon/pull/9557), and in timeline archival changes while dual-attached (https://github.com/neondatabase/neon/pull/9555) ## Summary of changes - Add a chaos test for timeline creation->archival->offload->deletion	2024-11-14 17:31:35 +00:00
Tristan Partin	49b599c113	Remove the replication slot in test_snap_files at the end of the test Analysis of the LR benchmarking tests indicates that in the duration of test_subscriber_lag, a leftover 'slotter' replication slot can lead to retained WAL growing on the publisher. This replication slot is not used by any subscriber. The only purpose of the slot is to generate snapshot files for the puspose of test_snap_files. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-14 10:59:15 -06:00

1 2 3 4 5 ...

1751 Commits