rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Peter Bendel	af12647b9d	large tenant oltp benchmark: reindex with downtime (remove concurrently) (#11498 ) ## Problem our large oltp benchmark runs very long - we want to remove the duration of the reindex step. we don't run concurrent workload anyhow but added "concurrently" only to have a "prod-like" approach. But if it just doubles the time we report because it requires two instead of one full table scan we can remove it ## Summary of changes remove keyword concurrently from the reindex step	2025-04-09 17:11:00 +00:00
Vlad Lazar	66f80e77ba	tests/performance: reconcile until idle before benchmark (#11435 ) We'd like to run benchmarks starting from a steady state. To this end, do a reconciliation round before proceeding with the benchmark. This is useful for benchmarks that use tenant dir snapshots since a non-standard tenant configuration is used to generate the snapshot. The storage controller is not aware of the non default tenant configuration and will reconcile while the bench is running.	2025-04-09 16:32:19 +00:00
Tristan Partin	e581b670f4	Improve nightly physical replication benchmark (#11389 ) Log the created project and endpoint IDs and improve typing in the source code to improve readability. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-03 23:00:58 +00:00
Vlad Lazar	9db63fea7a	pageserver: optionally export perf traces in OTEL format (#11140 ) Based on https://github.com/neondatabase/neon/pull/11139 ## Problem We want to export performance traces from the pageserver in OTEL format. End goal is to see them in Grafana. ## Summary of changes https://github.com/neondatabase/neon/pull/11139 introduces the infrastructure required to run the otel collector alongside the pageserver. ### Design Requirements: 1. We'd like to avoid implementing our own performance tracing stack if possible and use the `tracing` crate if possible. 2. Ideally, we'd like zero overhead of a sampling rate of zero and be a be able to change the tracing config for a tenant on the fly. 3. We should leave the current span hierarchy intact. This includes adding perf traces without modifying existing tracing. To satisfy (3) (and (2) in part) a separate span hierarchy is used. `RequestContext` gains an optional `perf_span` member that's only set when the request was chosen by sampling. All perf span related methods added to `RequestContext` are no-ops for requests that are not sampled. This on its own is not enough for (3), so performance spans use a separate tracing subscriber. The `tracing` crate doesn't have great support for this, so there's a fair amount of boilerplate to override the subscriber at all points of the perf span lifecycle. ### Perf Impact [Periodic pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc) shows no statistically significant regression with a sample ratio of 0. There's an annotation on the dashboard on 2025-03-06. ### Overview of changes: 1. Clean up the `RequestContext` API a bit. Namely, get rid of the `RequestContext::extend` API and use the builder instead. 2. Add pageserver level configs for tracing: sampling ratio, otel endpoint, etc. 3. Introduce some perf span tracking utilities and expose them via `RequestContext`. We add a `tracing::Span` wrapper to be used for perf spans and a `tracing::Instrumented` equivalent for it. See doc comments for reason. 4. Set up OTEL tracing infra according to configuration. A separate runtime is used for the collector. 5. Add perf traces to the read path. ## Refs - epic https://github.com/neondatabase/neon/issues/9873 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 17:56:51 +00:00
Erik Grinaker	17193d6a33	test_runner: fix pagebench tenant configs (#11420 ) ## Problem Pagebench creates a bunch of tenants by first creating a template tenant and copying its remote storage, then attaching the copies to the Pageserver. These tenants had custom configurations to disable GC and compaction. However, these configs were only picked up by the Pageserver on attach, and not registered with the storage controller. This caused the storage controller to replace the tenant configs with the default tenant config, re-enabling GC and compaction which interferes with benchmark performance. Resolves #11381. ## Summary of changes Register the copied tenants with the storage controller, instead of directly attaching them to the Pageserver.	2025-04-02 20:11:39 +00:00
Peter Bendel	4bc6dbdd5f	use a prod-like shared_buffers size for some perf unit tests (#11373 ) ## Problem In Neon DBaaS we adjust the shared_buffers to the size of the compute, or better described we adjust the max number of connections to the compute size and we adjust the shared_buffers size to the number of max connections according to about the following sizes `2 CU: 225mb; 4 CU: 450mb; 8 CU: 900mb` [see](`877e33b428/goapp/controlplane/internal/pkg/compute/computespec/pg_settings.go (L405)`) ## Summary of changes We should run perf unit tests with settings that is realistic for a paying customer and select 8 CU as the reference for those tests.	2025-04-02 10:43:05 +00:00
Alexander Bayandin	30a7dd630c	ruff: enable TC — flake8-type-checking (#11368 ) ## Problem `TYPE_CHECKING` is used inconsistently across Python tests. ## Summary of changes - Update `ruff`: 0.7.0 -> 0.11.2 - Enable TC (flake8-type-checking): https://docs.astral.sh/ruff/rules/#flake8-type-checking-tc - (auto)fix all new issues	2025-03-30 18:58:33 +00:00
Vlad Lazar	9fc7c22cc9	storcon: add use_local_compute_notifications flag (#11333 ) ## Problem While working on bulk import, I want to use the `control-plane-url` flag for a different request. Currently, the local compute hook is used whenever no control plane is specified in the config. My test requires local compute notifications and a configured `control-plane-url` which isn't supported. ## Summary of changes Add a `use-local-compute-notifications` flag. When this is set, we use the local flow regardless of other config values. It's enabled by default in neon_local and disabled by default in all other envs. I had to turn the flag off in tests that wish to bypass the local flow, but that's expected. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-03-21 15:31:06 +00:00
Erik Grinaker	65d690b21d	storcon: add repeated auto-splits and initial splits (#11122 ) ## Problem Currently, we only split tenants into 8 shards once, at the 64 GB split threshold. For very large tenants, we need to keep splitting to avoid huge shards. And we also want to eagerly split at a lower threshold to improve throughput during initial ingestion. See https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907 for details. Touches https://github.com/neondatabase/cloud/issues/22532. Requires #11157. ## Summary of changes This adds parameters and logic to enable repeated splits when a tenant's largest timeline divided by shard count exceeds `split_threshold`, as well as eager initial splits at a lower threshold to speed up initial ingestion. The default parameters are all set such that they retain the current behavior in production (only split into 8 shards once, at 64 GB). * `split_threshold` now specifies a maximum shard size. When a shard exceeds it, all tenant shards are split by powers of 2 such that all tenant shards fall below `split_threshold`. Disabled by default, like today. * Add `max_split_shards` to specify a max shard count for autosplits. Defaults to 8 to retain current behavior. * Add `initial_split_threshold` and `initial_split_shards` to specify a threshold and target count for eager splits of unsharded tenants. Defaults to 64 GB and 8 shards to retain current production behavior. Because this PR sets `initial_split_threshold` to 64 GB by default, it has the effect of enabling autosplits by default. This was not the case previously, since `split_threshold` defaults to None, but it is already enabled across production and staging. This is temporary until we complete the production rollout. For more details, see code comments. This must wait until #11157 has been deployed to Pageservers. Once this has been deployed to production, we plan to change the parameters to: * `split-threshold`: 256 GB * `initial-split-threshold`: 16 GB * `initial-split-shards`: 4 * `max-split-shards`: 16 The final split points will thus be: * Start: 1 shard * 16 GB: 4 shards * 1 TB: 8 shards * 2 TB: 16 shards We will then change the default settings to be disabled by default. --------- Co-authored-by: John Spray <john@neon.tech>	2025-03-20 15:43:57 +00:00
Alexey Kondratov	518269ea6a	feat(compute): Add perf test for compute startup time breakdown (#11198 ) ## Problem We had a recent Postgres startup latency (`start_postgres_ms`) degradation, but it was only caught with SLO alerts. There was actually an existing test for the same purpose -- `start_postgres_ms`, but it's doing only two starts, so it's a bit noisy. ## Summary of changes Add new compute startup latency test that does 100 iterations and reports p50, p90 and p99 latencies. Part of https://github.com/neondatabase/cloud/issues/24882	2025-03-19 16:11:33 +00:00
Peter Bendel	228bb75354	Extend large tenant OLTP workload ... (#11166 ) ... to better match the workload characteristics of real Neon customers ## Problem We analyzed workloads of large Neon users and want to extend the oltp workload to include characteristics seen in those workloads. ## Summary of changes - for re-use branch delete inserted rows from last run - adjust expected run-time (time-outs) in GitHub workflow - add queries that exposes the prefetch getpages path - add I/U/D transactions for another table (so far the workload was insert/append-only) - add an explicit vacuum analyze step and measure its time - add reindex concurrently step and measure its time (and take care that this step succeeds even if prior reindex runs have failed or were canceled) - create a second connection string for the pooled connection that removes the `-pooler` suffix from the hostname because we want to run long-running statements (database maintenance) and bypass the pooler which doesn't support unlimited statement timeout ## Test run https://github.com/neondatabase/neon/actions/runs/13851772887/job/38760172415	2025-03-16 14:04:48 +00:00
Arpad Müller	b1a1be6a4c	switch pytests and neon_local to control_plane_hooks_api (#11195 ) We want to switch away from and deprecate the `--compute-hook-url` param for the storcon in favour of `--control-plane-url` because it allows us to construct urls with `notify-safekeepers`. This PR switches the pytests and neon_local from a `control_plane_compute_hook_api` to a new param named `control_plane_hooks_api` which is supposed to point to the parent of the `notify-attach` URL. We still support reading the old url from disk to not be too disruptive with existing deployments, but we just ignore it. Also add docs for the `notify-safekeepers` upcall API. Follow-up of #11173 Part of https://github.com/neondatabase/neon/issues/11163	2025-03-13 19:50:52 +00:00
devin-ai-integration[bot]	efb1df4362	fix: Change metric_unit from 'microseconds' to 'μs' in test_compute_ctl_api.py (#11209 ) # Fix metric_unit length in test_compute_ctl_api.py ## Description This PR changes the metric_unit from "microseconds" to "μs" in test_compute_ctl_api.py to fix the issue where perf test results were not being stored in the database due to the string exceeding the 10 character limit of the metric_unit column in the perf_test_results table. ## Problem As reported in Slack, the perf test results were not being uploaded to the database because the "microseconds" string (12 characters) exceeds the 10 character limit of the metric_unit column in the perf_test_results table. ## Solution Replace "microseconds" with "μs" in all metric_unit parameters in the test_compute_ctl_api.py file. ## Testing The changes have been committed and pushed. The PR is ready for review. Link to Devin run: https://app.devin.ai/sessions/e29edd672bd34114b059915820e8a853 Requested by: Peter Bendel Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: peterbendel@neon.tech <peterbendel@neon.tech>	2025-03-13 10:17:01 +00:00
Tristan Partin	5eed0e4b94	Add docs to performance/test_logical_replication.py on how to run the suite (#10175 ) These docs are in tandem with what was recently published on the internal docs site. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-03-12 17:31:09 +00:00
Alex Chi Z.	3451bdd3d2	fix(test): force L0 compaction before gc-compaction (#11143 ) ## Problem Fix test flakyness of `test_gc_feedback` Closes: https://github.com/neondatabase/neon/issues/11153 ## Summary of changes Looking at the log, gc-compaction is interrupted by L0 compaction. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-10 20:03:49 +00:00
John Spray	87e6117dfd	storage controller: API-driven graceful migrations (#10913 ) ## Problem The current migration API does a live migration, but if the destination doesn't already have a secondary, that live migration is unlikely to be able to warm up a tenant properly within its timeout (full warmup of a big tenant can take tens of minutes). Background optimisation code knows how to do this gracefully by creating a secondary first, but we don't currently give a human a way to trigger that. Closes: https://github.com/neondatabase/neon/issues/10540 ## Summary of changes - Add `prefererred_node` parameter to TenantShard, which is respected by optimize_attachment - Modify migration API to have optional prewarm=true mode, in which we set preferred_node and call optimize_attachment, rather than directly modifying intentstate - Require override_scheduler=true flag if migrating somewhere that is a less-than-optimal scheduling location (e.g. wrong AZ) - Add `origin_node_id` to migration API so that callers can ensure they're moving from where they think they're moving from - Add tests for the above The storcon_cli wrapper for this has a 'watch' mode that waits for eventual cutover. This doesn't show the warmth of the secondary evolve because we don't currently have an API for that in the controller, as the passthrough API only targets attached locations, not secondaries. It would be straightforward to add later as a dedicated endpoint for getting secondary status, then extend the storcon_cli to consume that and print a nice progress indicator.	2025-03-07 17:02:38 +00:00
Alexey Kondratov	f5aa8c3eac	feat(compute_ctl): Add a basic HTTP API benchmark (#11123 ) ## Problem We just had a regression reported at https://neondb.slack.com/archives/C08EXUJF554/p1741102467515599, which clearly came with one of the releases. It's not a huge problem yet, but it's annoying that we cannot quickly attribute it to a specific commit. ## Summary of changes Add a very simple `compute_ctl` HTTP API benchmark that does 10k requests to `/status` and `metrics.json` and reports p50 and p99. --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>	2025-03-07 12:35:42 +00:00
Alex Chi Z.	6d0976dad5	feat(pageserver): persist reldir v2 migration status (#10980 ) ## Problem part of https://github.com/neondatabase/neon/issues/9516 ## Summary of changes Similar to the aux v2 migration, we persist the relv2 migration status into index_part, so that even the config item is set to false, we will still read from the v2 storage to avoid loss of data. Note that only the two variants `None` and `Some(RelSizeMigration::Migrating)` are used for now. We don't have full migration implemented so it will never be set to `RelSizeMigration::Migrated`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-03 21:05:43 +00:00
Peter Bendel	a07599949f	First version of a new benchmark to test larger OLTP workload (#11053 ) ## Problem We want to support larger tenants (regarding logical database size, number of transactions per second etc.) and should increase our test coverage of OLTP transactions at larger scale. ## Summary of changes Start a new benchmark that over time will add more OLTP tests at larger scale. This PR covers the first version and will be extended in further PRs. Also fix some infrastructure: - default for new connections and large tenants is to use connection pooler pgbouncer, however our fixture always added `statement_timeout=120` which is not compatible with pooler [see](https://neon.tech/docs/connect/connection-errors#unsupported-startup-parameter) - action to create branch timed out after 10 seconds and 10 retries but for large tenants it can take longer so use increasing back-off for retries ## Test run https://github.com/neondatabase/neon/actions/runs/13593446706	2025-03-03 15:25:48 +00:00
Peter Bendel	3a3d62dc4f	Bodobolero/test cum stats persistence (#10995 ) ## Problem So far cumulative statistics have not been persisted when Neon scales to zero (suspends endpoint). With PR https://github.com/neondatabase/neon/pull/6560 the cumulative statistics should now survive endpoint restarts and correctly trigger the auto- vacuum and auto analyze maintenance So far we did not have a testcase that validates that improvement in our dev cloud environment with a real project. ## Summary of changes Introduce testcase `test_cumulative_statistics_persistence`in the benchmarking workflow running daily to verify: - Verifies that the cumulative statistics are correctly persisted across restarts. - Cumulative statistics are important to persist across restarts because they are used - when auto-vacuum an auto-analyze trigger conditions are met. - The test performs the following steps: - Seed a new project using pgbench - insert tuples that by itself are not enough to trigger auto-vacuum - suspend the endpoint - resume the endpoint - insert additional tuples that by itself are not enough to trigger auto-vacuum but in combination with the previous tuples are - verify that autovacuum is triggered by the combination of tuples inserted before and after endpoint suspension ## Test run https://github.com/neondatabase/neon/actions/runs/13546879714/job/37860609089#step:6:282	2025-02-27 10:45:13 +00:00
Alex Chi Z.	b7fcf2c7a7	test(pageserver): add reldir v2 into tests (#10750 ) ## Problem We have `test_perf_many_relations` but it only runs on remote clusters, and we cannot directly modify tenant config. Therefore, I patched one of the current tests to benchmark relv2 performance. close https://github.com/neondatabase/neon/issues/9986 ## Summary of changes * Add `v1/v2` selector to `test_tx_abort_with_many_relations`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-25 14:50:22 +00:00
Tristan Partin	da9c101939	Implement a second HTTP server within compute_ctl (#10574 ) The compute_ctl HTTP server has the following purposes: - Allow management via the control plane - Provide an endpoint for scaping metrics - Provide APIs for compute internal clients - Neon Postgres extension for installing remote extensions - local_proxy for installing extensions and adding grants The first two purposes require the HTTP server to be available outside the compute. The Neon threat model is a bad actor within our internal network. We need to reduce the surface area of attack. By exposing unnecessary unauthenticated HTTP endpoints to the internal network, we increase the surface area of attack. For endpoints described in the third bullet point, we can just run an extra HTTP server, which is only bound to the loopback interface since all consumers of those endpoints are within the compute.	2025-02-11 18:02:22 +00:00
John Spray	95220ba43e	tests: fix flaky endpoint in test_ingest_logical_message (#10700 ) ## Problem Endpoint kept running while timeline was deleted, causing forbidden warnings on the pageserver when the tenant is not found. ## Summary of changes - Explicitly stop the endpoint before the end of the test, so that it isn't trying to talk to the pageserver in the background while things are torn down	2025-02-07 14:51:36 +00:00
Peter Bendel	839f41f5bb	fix pgcopydb seg fault and -c idle_in_transaction_session_timeout=0 (#10692 ) ## Problem During ingest_benchmark which uses `pgcopydb` ([see](https://github.com/dimitri/pgcopydb))we sometimes had outages. - when PostgreSQL COPY step failed we got a segfault (reported [here](https://github.com/dimitri/pgcopydb/issues/899)) - the root cause was Neon idle_in_transaction_session_timeout is set to 5 minutes which is suboptimal for long-running tasks like project import (reported [here](https://github.com/dimitri/pgcopydb/issues/900)) ## Summary of changes Patch pgcopydb to avoid segfault. override idle_in_transaction_session_timeout and set it to "unlimited"	2025-02-06 14:39:45 +00:00
Arseny Sher	fba22a7123	Record more timings in test_layer_map (#10670 ) ## Problem It it is not very clear how much time take different operations. ## Summary of changes Record more timings. ref https://github.com/neondatabase/neon/issues/10409	2025-02-05 17:00:26 +00:00
Anastasia Lubennikova	b1bc33eb4d	Fix logical_replication_sync test fixture (#10531 ) Fixes flaky test_lr_with_slow_safekeeper test #10242 Fix query to `pg_catalog.pg_stat_subscription` catalog to handle table synchronization and parallel LR correctly.	2025-02-03 12:44:47 +00:00
Erik Grinaker	afbcebe7f7	test_runner: force-compact in `test_sharding_autosplit` (#10605 ) ## Problem This test may not fully detect data corruption during splits, since we don't force-compact the entire keyspace. ## Summary of changes Force-compact all data in `test_sharding_autosplit`.	2025-01-31 11:31:58 +00:00
Alex Chi Z.	5bcefb4ee1	fix(pageserver): compaction perftest wrt upper limit (#10564 ) ## Problem The config is added in https://github.com/neondatabase/neon/pull/10550 causing behavior change for l0 compaction. close https://github.com/neondatabase/neon/issues/10562 ## Summary of changes Fix the test case to consider the effect of upper_limit. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-29 18:43:39 +00:00
Arseny Sher	9ab13d6e2c	Log statements in test_layer_map (#10554 ) ## Problem test_layer_map doesn't log statements and it is not clear how long they take. ## Summary of changes Do log them. ref https://github.com/neondatabase/neon/issues/10409	2025-01-29 09:16:00 +00:00
Erik Grinaker	ddb9ae1214	pageserver: add compaction backpressure for layer flushes (#10405 ) ## Problem There is no direct backpressure for compaction and L0 read amplification. This allows a large buildup of compaction debt and read amplification. Resolves #5415. Requires #10402. ## Summary of changes Delay layer flushes based on the number of level 0 delta layers: * `l0_flush_delay_threshold`: delay flushes such that they take 2x as long (default `2 * compaction_threshold`). * `l0_flush_stall_threshold`: stall flushes until level 0 delta layers drop below threshold (default `4 * compaction_threshold`). If either threshold is reached, ephemeral layer rolls also synchronously wait for layer flushes to propagate this backpressure up into WAL ingestion. This will bound the number of frozen layers to 1 once backpressure kicks in, since all other frozen layers must flush before the rolled layer. ## Analysis This will significantly change the compute backpressure characteristics. Recall the three compute backpressure knobs: * `max_replication_write_lag`: 500 MB (based on Pageserver `last_received_lsn`). * `max_replication_flush_lag`: 10 GB (based on Pageserver `disk_consistent_lsn`). * `max_replication_apply_lag`: disabled (based on Pageserver `remote_consistent_lsn`). Previously, the Pageserver would keep ingesting WAL and build up ephemeral layers and L0 layers until the compute hit `max_replication_flush_lag` at 10 GB and began backpressuring. Now, once we delay/stall WAL ingestion, the compute will begin backpressuring after `max_replication_write_lag`, i.e. 500 MB. This is probably a good thing (we're not building up a ton of compaction debt), but we should consider tuning these settings. `max_replication_flush_lag` probably doesn't serve a purpose anymore, and we should consider removing it. Furthermore, the removal of the upload barrier in #10402 will mean that we no longer backpressure flushes based on S3 uploads, since `max_replication_apply_lag` is disabled. We should consider enabling this as well. ### When and what do we compact? Default compaction settings: * `compaction_threshold`: 10 L0 delta layers. * `compaction_period`: 20 seconds (between each compaction loop check). * `checkpoint_distance`: 256 MB (size of L0 delta layers). * `l0_flush_delay_threshold`: 20 L0 delta layers. * `l0_flush_stall_threshold`: 40 L0 delta layers. Compaction characteristics: * Minimum compaction volume: 10 layers * 256 MB = 2.5 GB. * Additional compaction volume (assuming 128 MB/s WAL): 128 MB/s * 20 seconds = 2.5 GB (10 L0 layers). * Required compaction bandwidth: 5.0 GB / 20 seconds = 256 MB/s. ### When do we hit `max_replication_write_lag`? Depending on how fast compaction and flushes happens, the compute will backpressure somewhere between `l0_flush_delay_threshold` or `l0_flush_stall_threshold` + `max_replication_write_lag`. * Minimum compute backpressure lag: 20 layers * 256 MB + 500 MB = 5.6 GB * Maximum compute backpressure lag: 40 layers * 256 MB + 500 MB = 10.0 GB This seems like a reasonable range to me.	2025-01-24 09:47:28 +00:00
John Spray	fd1368d31e	storcon: rework scheduler optimisation, prioritize AZ (#9916 ) ## Problem We want to do a more robust job of scheduling tenants into their home AZ: https://github.com/neondatabase/neon/issues/8264. Closes: https://github.com/neondatabase/neon/issues/8969 ## Summary of changes ### Scope This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one. ### Scheduling optimisation `TenantShard`'s `optimize_attachment` and `optimize_secondary` methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it: - To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later. - Score types get a `for_optimization` method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we _will_ want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating. - When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well. - A unit test is added for https://github.com/neondatabase/neon/issues/8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.	2025-01-13 19:33:00 +00:00
John Spray	105f66c4ce	tests: move test_parallel_copy into performance tree (#10343 ) ## Problem This test writes ~5GB of data. It is not suitable to run in parallel with all the other small tests in test_runner/regress. via #9537 ## Summary of changes - Move test_parallel_copy into the performance directory, so that it does not run in parallel with other tests	2025-01-10 13:57:26 +00:00
Peter Bendel	43dc03459d	Run pgbench on 10 GB scale factor on database with n relations (e.g. 10k) (#10172 ) ## Problem We want to verify how much / if pgbench throughput and latency on Neon suffers if the database contains many other relations, too. ## Summary of changes Modify the benchmarking.yml pgbench-compare job to - create an addiitional project at scale factor 10 GiB - before running pgbench add n tables (initially 10k) to the database - then compare the pgbench throughput and latency to the existing pgbench-compare at 10 Gib scale factor We use a realistic template for the n relations that is a partitioned table with some realistic data types, indexes and constraints - similar to a table that we use internally. Example run: https://github.com/neondatabase/neon/actions/runs/12377565956/job/34547386959	2024-12-19 10:25:44 +00:00
Alex Chi Z.	6d3e8096fc	refactor(test): tighten up test_gc_feedback (#10126 ) ## Problem In https://github.com/neondatabase/neon/pull/8103 we changed the test case to have more test coverage of gc_compaction. Now that we have `test_gc_compaction_smoke`, we can revert this test case to serve its original purpose and revert the parameter changes. part of https://github.com/neondatabase/neon/issues/9114 ## Summary of changes * Revert pitr_interval from 60s to 10s. * Assert the physical/logical size ratio in the benchmark. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-12-18 18:10:05 +00:00
Erik Grinaker	14c4fae64a	test_runner/performance: add improved bulk insert benchmark (#9812 ) Adds an improved bulk insert benchmark, including S3 uploads. Touches #9789.	2024-12-06 15:17:15 +00:00
John Spray	60c0d19f57	tests: make storcon scale test AZ-aware (#9952 ) ## Problem We have a scale test for the storage controller which also acts as a good stress test for scheduling stability. However, it created nodes with no AZs set. ## Summary of changes - Bump node count to 6 and set AZs on them. This is a precursor to other AZ-related PRs, to make sure any new code that's landed is getting scale tested in an AZ-aware environment.	2024-12-04 15:04:04 +00:00
Peter Bendel	9d75218ba7	fix parsing human time output like "50m37s" (#10001 ) ## Problem In ingest_benchmark.yml workflow we use pgcopydb tool to migrate project. pgcopydb logs human time. Our parsing of the human time doesn't work for times like "50m37s". [Example workflow](https://github.com/neondatabase/neon/actions/runs/12145539948/job/33867418065#step:10:479) contains "57m45s" but we [reported](https://github.com/neondatabase/neon/actions/runs/12145539948/job/33867418065#step:10:500) only the seconds part: 45.000 s ## Summary of changes add a regex pattern for Minute/Second combination	2024-12-04 11:37:24 +00:00
Peter Bendel	1b3558df7a	optimize parms for ingest bench (#9999 ) ## Problem we tried different parallelism settings for ingest bench ## Summary of changes the following settings seem optimal after merging - SK side Wal filtering - batched getpages Settings: - effective_io_concurrency 100 - concurrency limit 200 (different from Prod!) - jobs 4, maintenance workers 7 - 10 GB chunk size	2024-12-04 11:07:22 +00:00
Alexander Bayandin	023821a80c	test_page_service_batching: fix non-numeric metrics (#9998 ) ## Problem ``` 2024-12-03T15:42:46.5978335Z + poetry run python /__w/neon/neon/scripts/ingest_perf_test_result.py --ingest /__w/neon/neon/test_runner/perf-report-local 2024-12-03T15:42:49.5325077Z Traceback (most recent call last): 2024-12-03T15:42:49.5325603Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 165, in <module> 2024-12-03T15:42:49.5326029Z main() 2024-12-03T15:42:49.5326316Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 155, in main 2024-12-03T15:42:49.5326739Z ingested = ingest_perf_test_result(cur, item, recorded_at_timestamp) 2024-12-03T15:42:49.5327488Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-12-03T15:42:49.5327914Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 99, in ingest_perf_test_result 2024-12-03T15:42:49.5328321Z psycopg2.extras.execute_values( 2024-12-03T15:42:49.5328940Z File "/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/psycopg2/extras.py", line 1299, in execute_values 2024-12-03T15:42:49.5335618Z cur.execute(b''.join(parts)) 2024-12-03T15:42:49.5335967Z psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type numeric: "concurrent-futures" 2024-12-03T15:42:49.5336287Z LINE 57: 'concurrent-futures', 2024-12-03T15:42:49.5336462Z ^ ``` ## Summary of changes - `test_page_service_batching`: save non-numeric params as `labels` - Add a runtime check that `metric_value` is NUMERIC	2024-12-03 22:46:18 +00:00
Christian Schwarz	944c1adc4c	tests & benchmarks: unify the way we customize the default tenant config (#9992 ) Before this PR, some override callbacks used `.default()`, others used `.setdefault()`. As of this PR, all callbacks use `.setdefault()` which I think is least prone to failure. Aligning on a single way will set the right example for future tests that need such customization. The `test_pageserver_getpage_throttle.py` technically is a change in behavior: before, it replaced the `tenant_config` field, now it just configures the throttle. This is what I believe is intended anyway.	2024-12-03 22:07:03 +00:00
Erik Grinaker	bbe4dfa991	test_runner: use immediate shutdown in `test_sharded_ingest` (#9984 ) ## Problem `test_sharded_ingest` ingests a lot of data, which can cause shutdown to be slow e.g. due to local "S3 uploads" or compactions. This can cause test flakes during teardown. Resolves #9740. ## Summary of changes Perform an immediate shutdown of the cluster.	2024-12-03 14:33:31 +00:00
Christian Schwarz	cb10be710d	page_service: batching observability & include throttled time in smgr metrics (#9870 ) This PR - fixes smgr metrics https://github.com/neondatabase/neon/issues/9925 - adds an additional startup log line logging the current batching config - adds a histogram of batch sizes global and per-tenant - adds a metric exposing the current batching config The issue described #9925 is that before this PR, request latency was only observed after batching. This means that smgr latency metrics (most importantly getpage latency) don't account for - `wait_lsn` time - time spent waiting for batch to fill up / the executor stage to pick up the batch. The fix is to use a per-request batching timer, like we did before the initial batching PR. We funnel those timers through the entire request lifecycle. I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire. This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the `pgb.flush()` call. I was *unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics. The reason is that we're using a single* counter in `RequestContext` to track micros spent in throttle. But there are N metrics timers in the batch, one per request. As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error `close() called on closed state`. A failed attempt to maintain the current behavior can be found in https://github.com/neondatabase/neon/pull/9951. So, this PR remvoes the deduction behavior from all metrics. I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029 # Refs - fixes https://github.com/neondatabase/neon/issues/9925 - sub-issue https://github.com/neondatabase/neon/issues/9377 - epic: https://github.com/neondatabase/neon/issues/9376	2024-12-03 11:03:23 +00:00
Erik Grinaker	5330122049	test_runner: improve `wait_until` (#9936 ) Improves `wait_until` by: * Use `timeout` instead of `iterations`. This allows changing the timeout/interval parameters independently. * Make `timeout` and `interval` optional (default 20s and 0.5s). Most callers don't care. * Only output status every 1s by default, and add optional `status_interval` parameter. * Remove `show_intermediate_error`, this was always emitted anyway. Most callers have been updated to use the defaults, except where they had good reason otherwise.	2024-12-02 10:26:15 +00:00
Christian Schwarz	aa4ec11af9	page_service: rewrite batching to work without a timeout (#9851 ) # Problem The timeout-based batching adds latency to unbatchable workloads. We can choose a short batching timeout (e.g. 10us) but that requires high-resolution timers, which tokio doesn't have. I thoroughly explored options to use OS timers (see [this](https://github.com/neondatabase/neon/pull/9822) abandoned PR). In short, it's not an attractive option because any timer implementation adds non-trivial overheads. # Solution The insight is that, in the steady state of a batchable workload, the time we spend in `get_vectored` will be hundreds of microseconds anyway. If we prepare the next batch concurrently to `get_vectored`, we will have a sizeable batch ready once `get_vectored` of the current batch is done and do not need an explicit timeout. This can be reasonably described as pipelining of the protocol handler. # Implementation We model the sub-protocol handler for pagestream requests (`handle_pagrequests`) as two futures that form a pipeline: 2. Batching: read requests from the connection and fill the current batch 3. Execution: `take` the current batch, execute it using `get_vectored`, and send the response. The Reading and Batching stage are connected through a new type of channel called `spsc_fold`. See the long comment in the `handle_pagerequests_pipelined` for details. # Changes - Refactor `handle_pagerequests` - separate functions for - reading one protocol message; produces a `BatchedFeMessage` with just one page request in it - batching; tried to merge an incoming `BatchedFeMessage` into an existing `BatchedFeMessage`; returns `None` on success and returns back the incoming message in case merging isn't possible - execution of a batched message - unify the timeline handle acquisition & request span construction; it now happen in the function that reads the protocol message - Implement serial and pipelined model - serial: what we had before any of the batching changes - read one protocol message - execute protocol messages - pipelined: the design described above - optionality for execution of the pipeline: either via concurrent futures vs tokio tasks - Pageserver config - remove batching timeout field - add ability to configure pipelining mode - add ability to limit max batch size for pipelined configurations (required for the rollout, cf https://github.com/neondatabase/cloud/issues/20620 ) - ability to configure execution mode - Tests - remove `batch_timeout` parametrization - rename `test_getpage_merge_smoke` to `test_throughput` - add parametrization to test different max batch sizes and execution moes - rename `test_timer_precision` to `test_latency` - rename the test case file to `test_page_service_batching.py` - better descriptions of what the tests actually do ## On the holding The `TimelineHandle` in the pending batch While batching, we hold the `TimelineHandle` in the pending batch. Therefore, the timeline will not finish shutting down while we're batching. This is not a problem in practice because the concurrently ongoing `get_vectored` call will fail quickly with an error indicating that the timeline is shutting down. This results in the Execution stage returning a `QueryError::Shutdown`, which causes the pipeline / entire page service connection to shut down. This drops all references to the `Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the contained `TimelineHandle`s. - => fixes https://github.com/neondatabase/neon/issues/9850 # Performance Local run of the benchmarks, results in [this empty commit](`1cf5b1463f`) in the PR branch. Key take-aways: * `concurrent-futures` and `tasks` deliver identical `batching_factor` * tail latency impact unknown, cf https://github.com/neondatabase/neon/issues/9837 * `concurrent-futures` has higher throughput than `tasks` in all workloads (=lower `time` metric) * In unbatchable workloads, `concurrent-futures` has 5% higher `CPU-per-throughput` than that of `tasks`, and 15% higher than that of `serial`. * In batchable-32 workload, `concurrent-futures` has 8% lower `CPU-per-throughput` than that of `tasks` (comparison to tput of `serial` is irrelevant) * in unbatchable workloads, mean and tail latencies of `concurrent-futures` is practically identical to `serial`, whereas `tasks` adds 20-30us of overhead Overall, `concurrent-futures` seems like a slightly more attractive choice. # Rollout This change is disabled-by-default. Rollout plan: - https://github.com/neondatabase/cloud/issues/20620 # Refs - epic: https://github.com/neondatabase/neon/issues/9376 - this sub-task: https://github.com/neondatabase/neon/issues/9377 - the abandoned attempt to improve batching timeout resolution: https://github.com/neondatabase/neon/pull/9820 - closes https://github.com/neondatabase/neon/issues/9850 - fixes https://github.com/neondatabase/neon/issues/9835	2024-11-30 00:16:24 +00:00
Erik Grinaker	3ffe6de0b9	test_runner/performance: add logical message ingest benchmark (#9749 ) Adds a benchmark for logical message WAL ingestion throughput end-to-end. Logical messages are essentially noops, and thus ignored by the Pageserver. Example results from my MacBook, with fsync enabled: ``` postgres_ingest: 14.445 s safekeeper_ingest: 29.948 s pageserver_ingest: 30.013 s pageserver_recover_ingest: 8.633 s wal_written: 10,340 MB message_count: 1310720 messages postgres_throughput: 715 MB/s safekeeper_throughput: 345 MB/s pageserver_throughput: 344 MB/s pageserver_recover_throughput: 1197 MB/s ``` See https://github.com/neondatabase/neon/issues/9642#issuecomment-2475995205 for running analysis. Touches #9642.	2024-11-29 09:40:08 +00:00
Vlad Lazar	9e0148de11	safekeeper: use protobuf for sending compressed records to pageserver (#9821 ) ## Problem https://github.com/neondatabase/neon/pull/9746 lifted decoding and interpretation of WAL to the safekeeper. This reduced the ingested amount on the pageservers by around 10x for a tenant with 8 shards, but doubled the ingested amount for single sharded tenants. Also, https://github.com/neondatabase/neon/pull/9746 uses bincode which doesn't support schema evolution. Technically the schema can be evolved, but it's very cumbersome. ## Summary of changes This patch set addresses both problems by adding protobuf support for the interpreted wal records and adding compression support. Compressed protobuf reduced the ingested amount by 100x on the 32 shards `test_sharded_ingest` case (compared to non-interpreted proto). For the 1 shard case the reduction is 5x. Sister change to `rust-postgres` is [here](https://github.com/neondatabase/rust-postgres/pull/33). ## Links Related: https://github.com/neondatabase/neon/issues/9336 Epic: https://github.com/neondatabase/neon/issues/9329	2024-11-27 12:12:21 +00:00
Peter Bendel	277c33ba3f	ingest benchmark: after effective_io_concurrency = 100 we can increase compute side parallelism (#9904 ) ## Problem ingest benchmark tests project migration to Neon involving steps - COPY relation data - create indexes - create constraints Previously we used only 4 copy jobs, 4 create index jobs and 7 maintenance workers. After increasing effective_io_concurrency on compute we see that we can sustain more parallelism in the ingest bench ## Summary of changes Increase copy jobs to 8, create index jobs to 8 and maintenance workers to 16	2024-11-27 10:09:01 +00:00
Vlad Lazar	7a2f0ed8d4	safekeeper: lift decoding and interpretation of WAL to the safekeeper (#9746 ) ## Problem For any given tenant shard, pageservers receive all of the tenant's WAL from the safekeeper. This soft-blocks us from using larger shard counts due to bandwidth concerns and CPU overhead of filtering out the records. ## Summary of changes This PR lifts the decoding and interpretation of WAL from the pageserver into the safekeeper. A customised PG replication protocol is used where instead of sending raw WAL, the safekeeper sends filtered, interpreted records. The receiver drives the protocol selection, so, on the pageserver side, usage of the new protocol is gated by a new pageserver config: `wal_receiver_protocol`. More granularly the changes are: 1. Optionally inject the protocol and shard identity into the arguments used for starting replication 2. On the safekeeper side, implement a new wal sending primitive which decodes and interprets records before sending them over 3. On the pageserver side, implement the ingestion of this new replication message type. It's very similar to what we already have for raw wal (minus decoding and interpreting). ## Notes * This PR currently uses my [branch of rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support) which includes the deserialization logic for the new replication message type. PR for that is open [here](https://github.com/neondatabase/rust-postgres/pull/32). * This PR contains changes for both pageservers and safekeepers. It's safe to merge because the new protocol is disabled by default on the pageserver side. We can gradually start enabling it in subsequent releases. * CI tests are running on https://github.com/neondatabase/neon/pull/9747 ## Links Related: https://github.com/neondatabase/neon/issues/9336 Epic: https://github.com/neondatabase/neon/issues/9329	2024-11-25 17:29:28 +00:00
Christian Schwarz	5c2356988e	page_service: add benchmark for batching (#9820 ) This PR adds two benchmark to demonstrate the effect of server-side getpage request batching added in https://github.com/neondatabase/neon/pull/9321. For the CPU usage, I found the the `prometheus` crate's built-in CPU usage accounts the seconds at integer granularity. That's not enough you reduce the target benchmark runtime for local iteration. So, add a new `libmetrics` metric and report that. The benchmarks are disabled because [on our benchmark nodes, timer resolution isn't high enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449). They work (no statement about quality) on my bare-metal devbox. They will be refined and enabled once we find a fix. Candidates at time of writing are: - https://github.com/neondatabase/neon/pull/9822 - https://github.com/neondatabase/neon/pull/9851 Refs: - Epic: https://github.com/neondatabase/neon/issues/9376 - Extracted from https://github.com/neondatabase/neon/pull/9792	2024-11-25 15:52:39 +00:00
Alex Chi Z.	4630b70962	fix(pageserver): ensure all layers are flushed before measuring RSS (#9861 ) ## Problem close https://github.com/neondatabase/neon/issues/9761 The test assumed that no new L0 layers are flushed throughout the process, which is not true. ## Summary of changes Fix the test case `test_compaction_l0_memory` by flushing in-memory layers before compaction. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-25 14:25:18 +00:00

1 2 3 4 5 ...

255 Commits