rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 22:12:56 +00:00

Author	SHA1	Message	Date
Alexander Bayandin	4778b6a12e	Switch to querying new tests results DB (#5616 ) ## Problem We started to store test results in a new format in https://github.com/neondatabase/neon/pull/4549. This PR switches scripts to query this db. (we can completely remove old DB/ingestions scripts in a couple of weeks after the PR merged) ## Summary of changes - `scripts/benchmark_durations.py` query new database - `scripts/flaky_tests.py` query new database	2023-10-25 14:25:13 +01:00
John Spray	8b8be7bed4	tests: don't fail tests on torn log lines (#5655 ) ## Problem Tests that force-kill and restart a service can generate torn log lines that might match WARN\|ERROR, but not match the allow expression that a test has loaded, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5651/6638398772/index.html#suites/7538959189f4501983ddd9e167836c8b/d272ba8a73e6945c ## Summary of changes Ignore log lines which match a regex for torn log lines on restart: they have two timestamps and the second line is an "INFO version"... message.	2023-10-25 13:29:30 +01:00
Conrad Ludgate	a461c459d8	fix http pool test (#5653 ) ## Problem We defer the returning of connections the the connection pool. It's possible for our test to be faster than the returning of connections - which then gets a differing process ID because it opens a new connection. ## Summary of changes 1. Delay the tests just a little (20ms) to give more chance for connections to return. 2. Correlate connection IDs with the connection logs a bit more	2023-10-25 13:20:45 +01:00
Joonas Koivunen	4ae2d1390d	refactor(remote_timeline_client): Split deletion into unlinking + deletion (#5645 ) Quest: #4745. Prerequisite for #4938. Original https://github.com/neondatabase/neon/pull/4938#issuecomment-1777150665. The new Layer implementation has so far been using `RemoteTimelineClient::schedule_layer_file_deletion` from `Layer::drop` but it was noticed that this could mean that the L0s compaction wanted to remove could linger in the index part for longer time or be left there for longer time. Solution is to split the `RemoteTimelineClient::schedule_layer_file_deletion` into two parts: - unlinking from index_part.json, to be called from end of compaction and gc - scheduling of actual deletions, to be called from `Layer::drop` The added methods are added unused.	2023-10-25 15:01:19 +03:00
Joonas Koivunen	c5949e1fd6	misc smaller improvements (#5527 ) - finally add an `#[instrument]` to Timeline::create_image_layers, making it easier to see that something is happening because we create image layers - format some macro context code - add a warning not to create new validation functions a la parse do not validate Split off from #5198.	2023-10-25 14:59:43 +03:00
John Spray	127837abb0	tests: de-flake test_eviction_across_generations (#5650 ) ## Problem There was an edge case where initial logical size calculation can be downloading a layer that wasn't hit by the test's `SELECT`, and it's on-disk but still marked as remote in the pageserver's internal state, so evicting it fails. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5648/6630099807/index.html#categories/dee044ec96f666edb90a77c01099a941/e38e97a2735ffa8c/ ## Summary of changes Use pageserver API to learn about layers, instead of inspecting local disk, so that we will always agree with the pageserver about which layer are local.	2023-10-25 10:55:45 +01:00
Conrad Ludgate	b2c96047d0	move wake compute after the auth quirks logic (#5642 ) ## Problem https://github.com/neondatabase/neon/issues/5568#issuecomment-1777015606 ## Summary of changes Make the auth_quirks_creds return the authentication information, and push the wake_compute loop to after, inside `auth_quirks`	2023-10-25 08:30:47 +01:00
Em Sharnoff	44202eeb3b	Bump vm-builder v0.18.1 -> v0.18.2 (#5646 ) Only applicable change was neondatabase/autoscaling#571, removing the postgres_exporter flags `--auto-discover-databases` and `--exclude-databases=...`	2023-10-24 16:04:28 -07:00
Arpad Müller	4bef977c56	Use tuples instead of manual comparison chain (#5637 ) Makes code a little bit simpler	2023-10-24 17:16:23 +00:00
John Spray	a0b862a8bd	pageserver: schedule frozen layer uploads inside the layers lock (#5639 ) ## Problem Compaction's source of truth for what layers exist is the LayerManager. `flush_frozen_layer` updates LayerManager before it has scheduled upload of the frozen layer. Compaction can then "see" the new layer, decide to delete it, schedule uploads of replacement layers, all before `flush_frozen_layer` wakes up again and schedules the upload. When the upload is scheduled, the local layer file may be gone, in which case we end up with no such layer in remote storage, but an entry still added to IndexPart pointing to the missing layer. ## Summary of changes Schedule layer uploads inside the `self.layers` lock, so that whenever a frozen layer is present in LayerManager, it is also present in RemoteTimelineClient's metadata. Closes: #5635	2023-10-24 13:57:01 +01:00
Conrad Ludgate	767ef29390	proxy: filter out more quota exceeded errors (#5640 ) ## Problem Looking at logs, I saw more retries being performed for other quota exceeded errors ## Summary of changes Filter out all quota exceeded family of errors	2023-10-24 13:13:23 +01:00
Alexander Bayandin	a8a800af51	Run real Azure tests on CI (#5627 ) ## Problem We do not run real Azure-related tests on CI ## Summary of changes - Set required env variables to run real Azure blob storage tests on CI	2023-10-24 12:12:11 +01:00
Arpad Müller	1e250cd90a	Cleanup in azure_upload_download_works test (#5636 ) The `azure_upload_download_works` test is not cleaning up after itself, leaving behind the files it is uploading. I found these files when looking at the contents of the bucket in #5627. We now clean up the file we uploaded before, like the other tests do it as well. Follow-up of #5546	2023-10-23 19:08:56 +01:00
John Spray	eaaa18f6ed	attachment_service: graceful SIGQUIT (#5626 ) `attachment_service` doesn't explicitly handle signals, which causes a backtrace when `neon_local` kills it with SIGQUIT. Closes: https://github.com/neondatabase/neon/issues/5613	2023-10-23 17:30:25 +01:00
John Spray	188f67e1df	pageserver: forward compat: be tolerant of deletion marker in `timelines/` (#5632 ) ## Problem https://github.com/neondatabase/neon/pull/5580 will move the remote deletion marker into the `timelines/` path. This would cause old pageserver code to fail loading the tenant due to an apparently invalid timeline ID. That would be a problem if we had to roll back after deploying #5580 ## Summary of changes If a `deleted` file is in `timelines/` just ignore it.	2023-10-23 17:51:38 +02:00
John Spray	7e805200bb	pageserver: parallel load of configs (#5607 ) ## Problem When the number of tenants is large, sequentially issuing the open/read calls for their config files is a ~1000ms delay during startup. It's not a lot, but it's simple to fix. ## Summary of changes Put all the config loads into spawn_blocking() tasks and run them in a JoinSet. We can simplify this a bit later when we have full async disk I/O. --------- Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-23 15:32:34 +01:00
Christian Schwarz	c6ca1d76d2	consumption_metrics: fix periodicness behavior & reporting (#5625 ) Before this PR, the ticker was running at default miss behavior `Delay`. For example, here is the startup output with 25k tenants: ``` 2023-10-19T09:57:21.682466Z INFO synthetic_size_worker: starting calculate_synthetic_size_worker 2023-10-19T10:50:44.678202Z WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=3202.995707156s period=10m task=ConsumptionMetricsSyntheticSizeWorker 2023-10-19T10:52:17.408056Z WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=2695.72556035s period=10m task=ConsumptionMetricsSyntheticSizeWorker ``` The first message's `elapsed` value is correct. It matches the delta between the log line timestamps. The second one is logged ca 1.5min after, though, but reports a much larger `elapsed` than 1.5min. This PR fixes the behavior by copying what `eviction_task.rs` does.	2023-10-23 16:31:38 +02:00
Conrad Ludgate	94b4e76e13	proxy: latency connect outcome (#5588 ) ## Problem I recently updated the latency timers to include cache miss and pool miss, as well as connection protocol. By moving the latency timer to start before authentication, we count a lot more failures and it's messed up the latency dashboard. ## Summary of changes Add another label to LatencyTimer metrics for outcome. Explicitly report on success	2023-10-23 15:17:28 +01:00
khanova	b514da90cb	Set up timeout for scram protocol execution (#5551 ) ## Problem Context: https://github.com/neondatabase/neon/issues/5511#issuecomment-1759649679 Some of out scram protocol execution timed out only after 17 minutes. ## Summary of changes Make timeout for scram execution meaningful and configurable.	2023-10-23 15:11:05 +01:00
Conrad Ludgate	7d17f1719f	reduce cancel map contention (#5555 ) ## Problem Every database request locks this cancel map rwlock. At high requests per second this would have high contention ## Summary of changes Switch to dashmap which has a sharded rwlock to reduce contention	2023-10-23 14:12:41 +01:00
John Spray	41ee75bc71	pageserver: do config writes in a spawn_blocking (#5603 ) ## Problem We now persist tenant configuration every time we spawn a tenant. The persist_tenant_config function is doing a series of non-async filesystem I/O, because `crashsafe::` isn't async yet. This isn't a demonstrated problem, but is a source of uncertainty when reasoning about what's happening with our startup times. ## Summary of changes - Wrap `crashsafe_overwrite` in `spawn_blocking`. - Although I think this change makes sense, it does not have a measurable impact on load time when testing with 10k tenants. - This can be reverted when we have full async I/O	2023-10-23 09:19:01 +01:00
Christian Schwarz	11e523f503	walredo: fix EGAGAIN/"os error 11" false page reconstruction failures (#5560 ) Stacked atop https://github.com/neondatabase/neon/pull/5559 Before this PR, there was the following race condition: ``` T1: polls for writeable stdin T1: writes to stdin T1: enters poll for stdout/stderr T2: enters poll for stdin write WALREDO: writes to stderr KERNEL: wakes up T1 and T2 Tx: reads stderr and prints it Ty: reads stderr and gets EAGAIN (valid values for (x, y) are (1, 2) or (2, 1)) ``` The concrete symptom that we observed repeatedly was with PG16, which started logging `registered custom resource manager` to stderr always, during startup, thereby giving us repeated opportunity to hit above race condition. PG14 and PG15 didn't log anything to stderr, hence we could have only hit this race condition if there was an actual error happening. This PR fixes the race by moving the reading of stderr into a tokio task. It exits when the stderr is closed by the child process, which in turn happens when the child exits, either by itself or because we killed it. The downside is that the async scheduling can reorder the log messages, which can be seen in the new `test_stderr`, which runs in a single-threaded runtime. I included the output below. Overall I think we should move the entire walredo to async, as Joonas proposed many months ago. This PR's asyncification is just the first step to resolve these false page reconstruction errors. After this is fixed, we should stop printing that annoying stderr message on walredo startup; it causes noise in the pageserver logs. That work is tracked in #5399 . ``` 2023-10-13T19:05:21.878858Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=753986}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223921878-1132-0.walredo length=1132 2023-10-13T19:05:21.878932Z DEBUG postgres applied 2 WAL records (1062 bytes) in 114666 us to reconstruct page image at LSN 0/0 2023-10-13T19:05:21.878942Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=0: apply_wal_records Caused by: WAL redo process closed its stdout unexpectedly 2023-10-13T19:05:21.879027Z INFO kill_and_wait_impl{pid=753986}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped) 2023-10-13T19:05:21.879079Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started 2023-10-13T19:05:21.879104Z ERROR wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.769 GMT [753986] LOG: registered custom resource manager \"neon\" with ID 134\n" 2023-10-13T19:05:21.879116Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished 2023-10-13T19:05:22.004439Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=754000}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223922004-1132-0.walredo length=1132 2023-10-13T19:05:22.004493Z DEBUG postgres applied 2 WAL records (1062 bytes) in 125344 us to reconstruct page image at LSN 0/0 2023-10-13T19:05:22.004501Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=1: apply_wal_records Caused by: WAL redo process closed its stdout unexpectedly 2023-10-13T19:05:22.004588Z INFO kill_and_wait_impl{pid=754000}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped) 2023-10-13T19:05:22.004624Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started 2023-10-13T19:05:22.004653Z ERROR wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.884 GMT [754000] LOG: registered custom resource manager \"neon\" with ID 134\n" 2023-10-13T19:05:22.004666Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished ```	2023-10-23 09:00:13 +01:00
Konstantin Knizhnik	b1a1126152	Grant replication permission to newly created users (#5615 ) ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-10-20 21:29:17 +03:00
John Spray	a8899e1e0f	pageserver: apply timeout when waiting for tenant loads (#5601 ) ## Problem Loading tenants shouldn't hang. However, if it does, we shouldn't let one hung tenant prevent the entire process from starting background jobs. ## Summary of changes Generalize the timeout mechanism that we already applied to loading initial logical sizes: each phase in startup where we wait for a barrier is subject to a timeout, and startup will proceed if it doesn't complete within timeout. Startup metrics will still reflect the time when a phase actually completed, rather than when we skipped it. The code isn't the most beautiful, but that kind of reflects the awkwardness of await'ing on a future and then stashing it to await again later if we time out. I could imagine making this cleaner in future by waiting on a structure that doesn't self-destruct on wait() the way Barrier does, then make InitializationOrder into a structure that manages the series of waits etc.	2023-10-20 09:15:34 +01:00
Arseny Sher	2fbd5ab075	Add safekeeper test_late_init.	2023-10-20 10:57:59 +03:00
Arseny Sher	702382e99a	Add check that WAL segments are identical after recovery.	2023-10-20 10:57:59 +03:00
Arseny Sher	1b53b3e200	Make test_pageserver_http_get_wal_receiver_success not wait for keepalive.	2023-10-20 10:57:59 +03:00
Arseny Sher	b332268cec	Introduce safekeeper peer recovery. Implements fetching of WAL by safekeeper from another safekeeper by imitating behaviour of last elected leader. This allows to avoid WAL accumulation on compute and facilitates faster compute startup as it doesn't need to download any WAL. Actually removing WAL download in walproposer is a matter of another patch though. There is a per timeline task which always runs, checking regularly if it should start recovery frome someone, meaning there is something to fetch and there is no streaming compute. It then proceeds with fetching, finishing when there is nothing more to receive. Implements https://github.com/neondatabase/neon/pull/4875	2023-10-20 10:57:59 +03:00
Arseny Sher	76c702219c	Don't use AppenRequestHeader.epoch_start_lsn. It is simpler to get it once from ProposerEelected.	2023-10-20 10:57:59 +03:00
Arthur Petukhovsky	ba856140e7	Fix neon_extra_build.yml (#5605 ) Build walproposer-lib in gather-rust-build-stats, fix nproc usage, fix walproposer-lib on macos.	2023-10-19 22:20:39 +01:00
Em Sharnoff	2cf6a47cca	vm-monitor: Deny not fail downscale if no memory stats yet (#5606 ) Fixes an issue we observed on staging that happens when the autoscaler-agent attempts to immediately downscale the VM after binding, which is typical for pooled computes. The issue was occurring because the autoscaler-agent was requesting downscaling before the vm-monitor had gathered sufficient cgroup memory stats to be confident in approving it. When the vm-monitor returned an internal error instead of denying downscaling, the autoscaler-agent retried the connection and immediately hit the same issue (in part because cgroup stats are collected per-connection, rather than globally).	2023-10-19 19:09:37 +01:00
Konstantin Knizhnik	5a8bcdccb0	Fix elog format error in wallog_mapping_file (#5602 ) ## Problem Fix elog format error in wallog_mapping_file ## Summary of changes Use proper case to avoid compilation warning=error in C at MacOS. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-10-19 20:24:35 +03:00
Em Sharnoff	2c8741a5ed	vm-monitor: Log full error on message handling failure (#5604 ) There's currently an issue with the vm-monitor on staging that's not really feasible to debug because the current display impl gives no context to the errors (just says "failed to downscale"). Logging the full error should help. For communications with the autoscaler-agent, it's ok to only provide the outermost cause, because we can cross-reference with the VM logs. At some point in the future, we may want to change that.	2023-10-19 18:10:33 +02:00
Shany Pozin	893b7bac9a	Fix neon_extra_builds.yml : nproc is not supported in mac os (#5598 ) ## Problem nproc is not supported in mac os, use sysctl -n hw.ncpu instead	2023-10-19 15:24:23 +01:00
Arthur Petukhovsky	66f8f5f1c8	Call walproposer from Rust (#5403 ) Create Rust bindings for C functions from walproposer. This allows to write better tests with real walproposer code without spawning multiple processes and starting up the whole environment. `make walproposer-lib` stage was added to build static libraries `libwalproposer.a`, `libpgport.a`, `libpgcommon.a`. These libraries can be statically linked to any executable to call walproposer functions. `libs/walproposer/src/walproposer.rs` contains `test_simple_sync_safekeepers` to test that walproposer can be called from Rust to emulate sync_safekeepers logic. It can also be used as a usage example.	2023-10-19 14:17:15 +01:00
Alexander Bayandin	3a19da1066	build(deps): bump rustix from 0.37.19 to 0.37.25 (#5596 ) ## Problem @dependabot has bumped `rustix` 0.36 version to the latest in https://github.com/neondatabase/neon/pull/5591, but didn't bump 0.37. Also, update all Rust dependencies for `test_runner/pg_clients/rust/tokio-postgres`. Fixes - https://github.com/neondatabase/neon/security/dependabot/39 - https://github.com/neondatabase/neon/security/dependabot/40 ## Summary of changes - `cargo update -p rustix@0.37.19` - Update all dependencies for `test_runner/pg_clients/rust/tokio-postgres`	2023-10-19 13:49:06 +01:00
Conrad Ludgate	572eda44ee	update tokio-postgres (#5597 ) https://github.com/neondatabase/rust-postgres/pull/23	2023-10-19 14:32:19 +02:00
Arpad Müller	b1d6af5ebe	Azure blobs: Simplify error conversion by addition of to_download_error (#5575 ) There is a bunch of duplication and manual Result handling that can be simplified by moving the error conversion into a shared function, using `map_err`, and the question mark operator.	2023-10-19 14:31:09 +02:00
Arpad Müller	f842b22b90	Add endpoint for querying time info for lsn (#5497 ) ## Problem See #5468. ## Summary of changes Add a new `get_timestamp_of_lsn` endpoint, returning the timestamp associated with the given lsn. Fixes #5468. --------- Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-19 04:50:49 +02:00
dependabot[bot]	d444d4dcea	build(deps): bump rustix from 0.36.14 to 0.36.16 (#5591 ) Bumps [rustix](https://github.com/bytecodealliance/rustix) from 0.36.14 to 0.36.16. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-10-19 03:43:49 +01:00
Tristan Partin	c8637f3736	Remove specific file references in NOTICE Seems like a burden to update this file with each major release.	2023-10-18 14:58:48 -05:00
John Spray	ecf759be6d	tests: allow-list S3 500 on DeleteObjects key (#5586 ) ## Problem S3 can give us a 500 whenever it likes: when this happens at request level we eat it in `backoff::retry`, but when it happens for a key inside a DeleteObjects request, we log it at warn level. ## Summary of changes Allow-list this class of log message in all tests.	2023-10-18 15:16:58 +00:00
Arthur Petukhovsky	9a9d9eba42	Add test_idle_reconnections	2023-10-18 17:09:26 +03:00
Arseny Sher	1f4805baf8	Remove remnants of num_computes field. Fixes https://github.com/neondatabase/neon/issues/5581	2023-10-18 17:09:26 +03:00
Konstantin Knizhnik	5c88213eaf	Logical replication (#5271 ) ## Problem See https://github.com/neondatabase/company_projects/issues/111 ## Summary of changes Save logical replication files in WAL at compute and include them in basebackup at pate server. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2023-10-18 16:42:22 +03:00
John Spray	607d19f0e0	pageserver: clean up page service Result handling for shutdown/disconnect (#5504 ) ## Problem - QueryError always logged at error severity, even though disconnections are not true errors. - QueryError type is not expressive enough to distinguish actual errors from shutdowns. - In some functions we're returning Ok(()) on shutdown, in others we're returning an error ## Summary of changes - Add QueryError::Shutdown and use it in places we check for cancellation - Adopt consistent Result behavior: disconnects and shutdowns are always QueryError, not ok - Transform shutdown+disconnect errors to Ok(()) at the very top of the task that runs query handler - Use the postgres protocol error code for "admin shutdown" in responses to clients when we are shutting down. Closes: #5517	2023-10-18 13:28:38 +01:00
dependabot[bot]	1fa0478980	build(deps): bump urllib3 from 1.26.17 to 1.26.18 (#5582 )	2023-10-18 12:21:54 +01:00
Christian Schwarz	9da67c4f19	walredo: make request_redo() an async fn (#5559 ) Stacked atop https://github.com/neondatabase/neon/pull/5557 Prep work for https://github.com/neondatabase/neon/pull/5560 These changes have a 2% impact on `bench_walredo`. That's likely because of the `block_on() in the innermost piece of benchmark-only code. So, it doesn't affect production code. The use of closures in the benchmarking code prevents a straightforward conversion of the whole benchmarking code to async. before: ``` $ cargo bench --features testing --bench bench_walredo Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver) Finished bench [optimized + debuginfo] target(s) in 2m 11s Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70) Gnuplot not found, using plotters backend short/short/1 time: [26.363 µs 27.451 µs 28.573 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild short/short/2 time: [64.340 µs 64.927 µs 65.485 µs] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) low mild short/short/4 time: [101.98 µs 104.06 µs 106.13 µs] short/short/8 time: [151.42 µs 152.74 µs 154.03 µs] short/short/16 time: [296.30 µs 297.53 µs 298.88 µs] Found 14 outliers among 100 measurements (14.00%) 10 (10.00%) high mild 4 (4.00%) high severe medium/medium/1 time: [225.12 µs 225.90 µs 226.66 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild medium/medium/2 time: [490.80 µs 491.64 µs 492.49 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild medium/medium/4 time: [934.47 µs 936.49 µs 938.52 µs] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe medium/medium/8 time: [1.8364 ms 1.8412 ms 1.8463 ms] Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild medium/medium/16 time: [3.6694 ms 3.6896 ms 3.7104 ms] ``` after: ``` $ cargo bench --features testing --bench bench_walredo Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver) Finished bench [optimized + debuginfo] target(s) in 2m 11s Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70) Gnuplot not found, using plotters backend short/short/1 time: [28.345 µs 28.529 µs 28.699 µs] change: [-0.2201% +3.9276% +8.2451%] (p = 0.07 > 0.05) No change in performance detected. Found 17 outliers among 100 measurements (17.00%) 4 (4.00%) low severe 5 (5.00%) high mild 8 (8.00%) high severe short/short/2 time: [66.145 µs 66.719 µs 67.274 µs] change: [+1.5467% +2.7605% +3.9927%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) low mild short/short/4 time: [105.51 µs 107.52 µs 109.49 µs] change: [+0.5023% +3.3196% +6.1986%] (p = 0.02 < 0.05) Change within noise threshold. short/short/8 time: [151.90 µs 153.16 µs 154.41 µs] change: [-1.0001% +0.2779% +1.4221%] (p = 0.65 > 0.05) No change in performance detected. short/short/16 time: [297.38 µs 298.26 µs 299.20 µs] change: [-0.2953% +0.2462% +0.7763%] (p = 0.37 > 0.05) No change in performance detected. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild medium/medium/1 time: [229.76 µs 230.72 µs 231.69 µs] change: [+1.5804% +2.1354% +2.6635%] (p = 0.00 < 0.05) Performance has regressed. medium/medium/2 time: [501.14 µs 502.31 µs 503.64 µs] change: [+1.8730% +2.1709% +2.5199%] (p = 0.00 < 0.05) Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 1 (1.00%) high mild 5 (5.00%) high severe medium/medium/4 time: [954.15 µs 956.74 µs 959.33 µs] change: [+1.7962% +2.1627% +2.4905%] (p = 0.00 < 0.05) Performance has regressed. medium/medium/8 time: [1.8726 ms 1.8785 ms 1.8848 ms] change: [+1.5858% +2.0240% +2.4626%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe medium/medium/16 time: [3.7565 ms 3.7746 ms 3.7934 ms] change: [+1.5503% +2.3044% +3.0818%] (p = 0.00 < 0.05) Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild ```	2023-10-18 11:23:06 +01:00
Em Sharnoff	16c87b5bda	Bump vm-builder v0.17.12 -> v0.18.1 (#5583 ) Only applicable change was neondatabase/autoscaling#566, updating pgbouncer to 1.21.0 and enabling support for prepared statements.	2023-10-18 11:10:01 +02:00
Em Sharnoff	9fe5cc6a82	vm-monitor: Switch from memory.high to polling memory.stat (#5524 ) tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be good enough, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.	2023-10-17 15:30:40 -07:00

1 2 3 4 5 ...

3921 Commits