rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 21:20:37 +00:00

Author	SHA1	Message	Date
Em Sharnoff	e617a3a075	vm-monitor: Improve error display (#10542 ) Logging errors with the debug format specifier causes multi-line errors, which are sometimes a pain to deal with. Instead, we should use anyhow's alternate display format, which shows the same information on a single line. Also adjusted a couple of error messages that were stale. Fixes neondatabase/cloud#14710.	2025-02-03 13:34:11 +00:00
Fedor Dikarev	23ca8b061b	Use actions/checkout for checkout (#10630 ) ## Problem 1. First of all it's more correct 2. Current usage allows ` Time-of-Check-Time-of-Use (TOCTOU) 'Pwn Request' vulnerabilities`. Please check security slack channel or reach me for more details. I will update PR description after merge. ## Summary of changes 1. Use `actions/checkout` with `ref: ${{ github.event.pull_request.head.sha }}` Discovered by and Co-author: @varunsh-coder	2025-02-03 12:55:48 +00:00
Anastasia Lubennikova	b1bc33eb4d	Fix logical_replication_sync test fixture (#10531 ) Fixes flaky test_lr_with_slow_safekeeper test #10242 Fix query to `pg_catalog.pg_stat_subscription` catalog to handle table synchronization and parallel LR correctly.	2025-02-03 12:44:47 +00:00
OBBO67	b1e451091a	pageserver: clean up references to timeline delete marker, uninit marker (#5718 ) (#10627 ) ## Problem Since [#5580](https://github.com/neondatabase/neon/pull/5580) the delete and uninit file markers are no longer needed. ## Summary of changes Remove the remaining code for the delete and uninit markers. Additionally removes the `ends_with_suffix` function as it is no longer required. Closes [#5718](https://github.com/neondatabase/neon/issues/5718).	2025-02-03 11:54:07 +00:00
Arpad Müller	87ad50c925	storcon: use diesel-async again, now with tls support (#10614 ) Successor of #10280 after it was reverted in #10592. Re-introduce the usage of diesel-async again, but now also add TLS support so that we connect to the storcon database using TLS. By default, diesel-async doesn't support TLS, so add some code to make us explicitly request TLS. cc https://github.com/neondatabase/cloud/issues/23583	2025-02-03 11:53:51 +00:00
Alexander Bayandin	89b9f74077	CI(pre-merge-checks): do not run `conclusion` job for PRs (#10619 ) ## Problem While working on https://github.com/neondatabase/neon/pull/10617 I (unintentionally) merged the PR before the main CI pipeline has finished. I suspect this happens because we have received all the required job results from the pre-merge-checks workflow, which runs on PRs that include changes to relevant files. ## Summary of changes - Skip the `conclusion` job in `pre-merge-checks` workflows for PRs	2025-02-03 09:40:12 +00:00
John Spray	f071800979	tests: stabilize shard locations earlier in test_scrubber_tenant_snapshot (#10606 ) ## Problem This test would sometimes emit unexpected logs from the storage controller's requests to do migrations, which overlap with the test's restarts of pageservers, where those migrations are happening some time after a shard split as the controller moves load around. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10602/13067323736/index.html#testresult/f66f1329557a1fc5/retries ## Summary of changes - Do a reconcile_until_idle after shard split, so that the rest of the test doesn't run concurrently with migrations	2025-02-03 09:02:21 +00:00
Peter Bendel	4dfe60e2ad	revert https://github.com/neondatabase/neon/pull/10616 (#10631 ) ## Problem https://github.com/neondatabase/neon/pull/10616 was only intended temparily during the weekend, want to reset to prior state ## Summary of changes revert https://github.com/neondatabase/neon/pull/10616 but keep fixes in https://github.com/neondatabase/neon/pull/10622	2025-02-03 09:00:23 +00:00
Arpad Müller	8ae6f656a6	Don't require partial backup semaphore capacity for deletions (#10628 ) In the safekeeper, we block deletions on the timeline's gate closing, and any `WalResidentTimeline` keeps the gate open (because it owns a gate lock object). Thus, unless the `main_task` function of a partial backup doesn't return, we can't delete the associated timeline. In order to make these tasks exit early, we call the cancellation token of the timeline upon its shutdown. However, the partial backup task wasn't looking for the cancellation while waiting to acquire a partial backup permit. On a staging safekeeper we have been in a situation in the past where the semaphore was already empty for a duration of many hours, rendering all attempted deletions unable to proceed until a restart where the semaphore was reset: https://neondb.slack.com/archives/C03H1K0PGKH/p1738416586442029	2025-02-03 04:11:06 +00:00
Peter Bendel	b9e1a67246	fix generate matrix for olap for saturdays (#10622 ) ## Problem when introducing pg17 for job step `Generate matrix for OLAP benchmarks` I introduced a syntax error that only hits on Saturdays. ## Summary of changes Remove trailing comma ## successful test run https://github.com/neondatabase/neon/actions/runs/13086363907	2025-02-01 11:09:45 +00:00
Folke Behrens	6318828c63	Update rust to 1.84.1 (#10618 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://releases.rs/docs/1.84.1/). Prior update was in https://github.com/neondatabase/neon/pull/10328. Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-01-31 20:52:17 +00:00
Stefan Radig	6dd48ba148	feat(proxy): Implement access control with VPC endpoint checks and block for public internet / VPC (#10143 ) - Wired up filtering on VPC endpoints - Wired up block access from public internet / VPC depending on per project flag - Added cache invalidation for VPC endpoints (partially based on PR from Raphael) - Removed BackendIpAllowlist trait --------- Co-authored-by: Ivan Efremov <ivan@neon.tech>	2025-01-31 20:32:57 +00:00
Conrad Ludgate	ad1a41157a	feat(proxy): optimizing the chances of large write in copy_bidirectional (#10608 ) We forked copy_bidirectional to solve some issues like fast-shutdown (disallowing half-open connections) and to introduce better error tracking (which side of the conn closed down). A change recently made its way upstream offering performance improvements: https://github.com/tokio-rs/tokio/pull/6532. These seem applicable to our fork, thus it makes sense to apply them here as well.	2025-01-31 19:14:27 +00:00
Tristan Partin	fcd195c2b6	Migrate compute_ctl arg parsing to clap derive (#10497 ) The primary benefit is that all the ad hoc get_matches() calls are no longer necessary. Now all it takes to get at the CLI arguments is referencing a struct member. It's also great the we can replace the ad hoc CLI struct we had with this more formal solution. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-01-31 19:04:26 +00:00
Peter Bendel	bc7822d90c	temporarily disable some steps and run more often to expose more pgbench --initialize in benchmarking workflow (#10616 ) ## Problem we want to disable some steps in benchmarking workflow that do not initialize new projects and instead run the test more frequently Test run https://github.com/neondatabase/neon/actions/runs/13077737888	2025-01-31 18:41:17 +00:00
Alexander Bayandin	48c87dc458	CI(pre-merge-checks): fix condition (#10617 ) ## Problem Merge Queue fails if changes include Rust code. ## Summary of changes - Fix condition for `build-build-tools-image` - Add a couple of no-op `false \|\|` to make predicates look symmetric	2025-01-31 18:07:26 +00:00
John Spray	aedeb1c7c2	pageserver: revise logging of cancelled request results (#10604 ) ## Problem When a client dropped before a request completed, and a handler returned an ApiError, we would log that at error severity. That was excessive in the case of a request erroring on a shutdown, and could cause test flakes. example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/13067651123/index.html#suites/ad9c266207b45eafe19909d1020dd987/6021ce86a0d72ae7/ ``` Cancelled request finished with an error: ShuttingDown ``` ## Summary of changes - Log a different info-level on ShuttingDown and ResourceUnavailable API errors from cancelled requests	2025-01-31 17:43:54 +00:00
John Spray	a93e9f22fc	pageserver: remove faulty debug assertion in compaction (#10610 ) ## Problem This assertion is incorrect: it is legal to see another shard's data at this point, after a shard split. Closes: https://github.com/neondatabase/neon/issues/10609 ## Summary of changes - Remove faulty assertion	2025-01-31 17:43:31 +00:00
JC Grünhage	10cf5e7a38	Move cargo-deny into a separate workflow on a schedule (#10289 ) ## Problem There are two (related) problems with the previous handling of `cargo-deny`: - When a new advisory is added to rustsec that affects a dependency, unrelated pull requests will fail. - New advisories rely on pushes or PRs to be surfaced. Problems that already exist on main will only be found if we try to merge new things into main. ## Summary of changes We split out `cargo-deny` into a separate workflow that runs on all PRs that touch `Cargo.lock`, and on a schedule on `main`, `release`, `release-compute` and `release-proxy` to find new advisories.	2025-01-31 13:42:59 +00:00
Arpad Müller	dce617fe07	Update to rebased rust-postgres (#10584 ) Update to a rebased version of our rust-postgres patches, rebased on [this](`98f5a11bc0`) commit this time. With #10280 reapplied, this means that the rust-postgres crates will be deduplicated, as the new crate versions are finally compatible with the requirements of diesel-async. Earlier update: #10561 rust-postgres PR: https://github.com/neondatabase/rust-postgres/pull/39	2025-01-31 12:40:20 +00:00
Alexander Bayandin	503bc72d31	CI: add `diesel print-schema` check (#10527 ) ## Problem We want to check that `diesel print-schema` doesn't generate any changes (`storage_controller/src/schema.rs`) in comparison with the list of migration. ## Summary of changes - Add `diesel_cli` to `build-tools` image - Add `Check diesel schema` step to `build-neon` job, at this stage we have all required binaries, so don't need to compile anything additionally - Check runs only on x86 release builds to be sure we do it at least once per CI run.	2025-01-31 11:48:46 +00:00
Fedor Dikarev	89cff08354	unify pg-build-nonroot-with-cargo base layer and config retries in curl (#10575 ) Ref: https://github.com/neondatabase/cloud/issues/23461 ## Problem Just made changes around and see these 2 base layers could be optimised. and after review comment from @myrrc setting up timeouts and retries in `alpine/curl` image ## Summary of changes	2025-01-31 11:46:33 +00:00
Erik Grinaker	afbcebe7f7	test_runner: force-compact in `test_sharding_autosplit` (#10605 ) ## Problem This test may not fully detect data corruption during splits, since we don't force-compact the entire keyspace. ## Summary of changes Force-compact all data in `test_sharding_autosplit`.	2025-01-31 11:31:58 +00:00
Arpad Müller	7d5c70c717	Update AWS SDK crates (#10588 ) We want to keep the AWS SDK up to date as that way we benefit from new developments and improvements. Prior update was in #10056	2025-01-31 11:23:12 +00:00
John Spray	f09cfd11cb	pageserver: exclude archived timelines from freeze+flush on shutdown (#10594 ) ## Problem If offloading races with normal shutdown, we get a "failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited". This is harmless but points to it being quite strange to try and freeze and flush such a timeline. flushing on shutdown for an archived timeline isn't useful. Related: https://github.com/neondatabase/neon/issues/10389 ## Summary of changes - During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the timeline is archived	2025-01-31 10:54:14 +00:00
Arseny Sher	765ba43438	Allow pageserver unreachable errors in test_scrubber_tenant_snapshot (#10585 ) ## Problem test_scrubber_tenant_snapshot restarts pageservers, but log validation fails tests on any non white listed storcon warnings, making the test flaky. ## Summary of changes Allow warns like 2025-01-29T12:37:42.622179Z WARN reconciler{seq=1 tenant_id=2011077aea9b4e8a60e8e8a19407634c shard_id=0004}: Call to node 2 (localhost:15352) management API failed, will retry (attempt 1): receive body: error sending request for url (http://localhost:15352/v1/tenant/2011077aea9b4e8a60e8e8a19407634c-0004/location_config): client error (Connect) ref https://github.com/neondatabase/neon/issues/10462	2025-01-31 10:33:24 +00:00
Folke Behrens	6041a93591	Update tokio base crates (#10556 ) Update `tokio` base crates and their deps. Pin `tokio` to at least 1.41 which stabilized task ID APIs. To dedup `mio` dep the `notify` crate is updated. It's used in `compute_tools`. `9f81828429/compute_tools/src/pg_helpers.rs (L258-L367)`	2025-01-31 09:54:31 +00:00
Conrad Ludgate	738bf83583	chore: replace dashmap with clashmap (#10582 ) ## Problem Because dashmap 6 switched to hashbrown RawTable API, it required us to use unsafe code in the upgrade: https://github.com/neondatabase/neon/pull/8107 ## Summary of changes Switch to clashmap, a fork maintained by me which removes much of the unsafe and ultimately switches to HashTable instead of RawTable to remove much of the unsafe requirement on us.	2025-01-31 09:53:43 +00:00
Anna Stepanyan	423e239617	[infra/notes] impr: add issue types to issue templates (#10018 ) refs #0000 --------- Co-authored-by: Fedor Dikarev <fedor@neon.tech>	2025-01-31 06:29:06 +00:00
Heikki Linnakangas	df87a55609	tests: Speed up test_pgdata_import_smoke on Postgres v17 (#10567 ) The test runs this query: select count(*), sum(data::bigint)::bigint from t to validate the test results between each part of the test. It performs a simple sequential scan and aggregation, but was taking an order of magnitude longer on v17 than on previous Postgres versions, which sometimes caused the test to time out. There were two reasons for that: 1. On v17, the planner estimates the table to have only only one row. In reality it has 305790 rows, and older versions estimated it at 611580, which is not too bad given that the table has not been analyzed so the planner bases that estimate just on the number of pages and the widths of the datatypes. The new estimate of 1 row is much worse, and it leads the planner to disregard parallel plans, whereas on older versions you got a Parallel Seq Scan. I tracked this down to upstream commit 29cf61ade3, "Consider fillfactor when estimating relation size". With that commit, table_block_relation_estimate_size() function calculates that each page accommodates less than 1 row when the fillfactor is taken into account, which rounds down to 0. In reality, the executor will always place at least one row on a page regardless of fillfactor, but the new estimation formula doesn't take that into account. I reported this to pgsql-hackers (https://www.postgresql.org/message-id/2bf9d973-7789-4937-a7ca-0af9fb49c71e%40iki.fi), we don't need to do anything more about it in neon. It's OK to not use parallel scans here; once issue 2. below is addressed, the queries are fast enough without parallelism.. 2. On v17, prefetching was not happening for the sequential scan. That's because starting with v17, buffers are reserved in the shared buffer cache before prefetching is initiated, and we use a tiny shared_buffers=1MB setting in the tests. The prefetching is effectively disabled with such a small shared_buffers setting, to protect the system from completely starving out of buffers. To address that, simply bump up shared_buffers in the test. This patch addresses the second issue, which is enough to fix the problem.	2025-01-30 22:55:17 +00:00
John Spray	5e0c40709f	storcon: refine chaos selection logic (#10600 ) ## Problem In https://github.com/neondatabase/neon/pull/10438 it was pointed out that it would be good to avoid picking tenants in ID order, and also to avoid situations where we might double-select the same tenant. There was an initial swing at this in https://github.com/neondatabase/neon/pull/10443, where Chi suggested a simpler approach which is done in this PR ## Summary of changes - Split total set of tenants into in and out of home AZ - Consume out of home AZ first, and if necessary shuffle + consume from out of home AZ	2025-01-30 22:45:43 +00:00
John Spray	e1273acdb1	pageserver: handle shutdown cleanly in layer download API (#10598 ) ## Problem This API is used in tests and occasionally for support. It cast all errors to 500. That can cause a failure on the log checks: https://neon-github-public-dev.s3.amazonaws.com/reports/main/13056992876/index.html#suites/ad9c266207b45eafe19909d1020dd987/683a7031d877f3db/ ## Summary of changes - Avoid using generic anyhow::Error for layer downloads - Map shutdown cases to 503 in http route	2025-01-30 22:43:36 +00:00
John Spray	d18f6198e1	storcon: fix AZ-driven tenant selection in chaos (#10443 ) ## Problem In https://github.com/neondatabase/neon/pull/10438 I had got the function for picking tenants backwards, and it was preferring to move things _away_ from their preferred AZ. ## Summary of changes - Fix condition in `is_attached_outside_preferred_az`	2025-01-30 22:17:07 +00:00
John Spray	6da7c556c2	pageserver: fix race cleaning up timeline files when shut down during bootstrap (#10532 ) ## Problem Timeline bootstrap starts a flush loop, but doesn't reliably shut down the timeline (incl. waiting for flush loop to exit) before destroying UninitializedTimeline, and that destructor tries to clean up local storage. If local storage is still being written to, then this is unsound. Currently the symptom is that we see a "Directory not empty" error log, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/main/12966756686/index.html#testresult/5523f7d15f46f7f7/retries ## Summary of changes - Move fallible IO part of bootstrap into a function (notably, this is fallible in the case of the tenant being shut down while creation is happening) - When that function returns an error, call shutdown() on the timeline	2025-01-30 20:33:22 +00:00
a-masterov	bf6d5e93ba	Run tests of the contrib extensions (#10392 ) ## Problem We don't test the extensions, shipped with contrib ## Summary of changes The tests are now running	2025-01-30 19:32:35 +00:00
Arpad Müller	4d2c2e9460	Revert "storcon: switch to diesel-async and tokio-postgres (#10280 )" (#10592 ) There was a regression of #10280, tracked in [#23583](https://github.com/neondatabase/cloud/issues/23583). I have ideas how to fix the issue, but we are too close to the release cutoff, so revert #10280 for now. We can revert the revert later :).	2025-01-30 19:23:25 +00:00
John Spray	bae0de643e	tests: relax constraints on test_timeline_archival_chaos (#10595 ) ## Problem The test asserts that it completes at least 10 full timeline lifecycles, but the noisy CI environment sometimes doesn't meet that goal. Related: https://github.com/neondatabase/neon/issues/10389 ## Summary of changes - Sleep for longer between pageserver restarts, so that the timeline workers have more chance to make progress - Sleep for shorter between retries from timeline worker, so that they have better chance to get in while a pageserver is up between restarts - Relax the success condition to complete at least 5 iterations instead of 10	2025-01-30 19:22:59 +00:00
Cheng Chen	8293b252b2	chore(compute): pg_mooncake v0.1.1 (#10578 ) ## Problem Upgrade pg_mooncake to v0.1.1 ## Summary of changes https://github.com/Mooncake-Labs/pg_mooncake/blob/main/CHANGELOG.md#011-2025-01-29	2025-01-30 18:33:25 +00:00
Peter Bendel	6c8fc909d6	Benchmarking PostgreSQL17: for OLAP need specific connstr secrets (#10587 ) ## Problem for OLAP benchmarks we need specific connstr secrets with different database names for each job step This is a follow-up for https://github.com/neondatabase/neon/pull/10536 In previous PR we used a common GitHub secret for a shared re-use project that has 4 databases: neondb, tpch, clickbench and userexamples. [Failure example](https://neon-github-public-dev.s3.amazonaws.com/reports/main/13044872855/index.html#suites/54d0af6f403f1d8611e8894c2e07d023/fc029330265e9f6e/): ```log # /tmp/neon/pg_install/v17/bin/psql user=neondb_owner dbname=neondb host=ep-broad-brook-w2luwzzv.us-east-2.aws.neon.build sslmode=require options='-cstatement_timeout=0 ' -c -- $ID$ -- TPC-H/TPC-R Pricing Summary Report Query (Q1) -- Functional Query Definition -- Approved February 1998 ... ERROR: relation "lineitem" does not exist ``` ## Summary of changes We need dedicated GitHub secrets and dedicated connection strings for each of the use cases. ## Test run https://github.com/neondatabase/neon/actions/runs/13053968231	2025-01-30 16:41:46 +00:00
Heikki Linnakangas	efe42db264	tests: test_pgdata_import_smoke requires the 'testing' cargo feature (#10569 ) It took me ages to figure out why it was failing on my laptop. What I saw was that when the test makes the 'import_pgdata' in the pageserver, the pageserver actually performs a regular 'bootstrap' timeline creation by running initdb, with no importing. It boiled down to the json request that the test uses: ``` { "new_timeline_id": str(timeline_id), "import_pgdata": { "idempotency_key": str(idempotency), "location": {"LocalFs": {"path": str(importbucket.absolute())}}, }, }, ``` and how serde deserializes into rust structs. The 'LocalFs' enum variant in `models.rs` is gated on the 'testing' cargo feature. On a non-testing build, that got deserialized into the default Bootstrap enum variant, as a valid TimelineCreateRequestModeImportPgdata variant could not be formed. PS. IMHO we should get rid of the testing feature, compile in all the functionality, and have a runtime flag to disable anything dangeorous. With that, you would've gotten a nice "feature only enabled in testing mode" error in this case, or the test would've simply worked. But that's another story.	2025-01-30 16:11:26 +00:00
Alex Chi Z.	cf6dee946e	fix(pageserver): gc-compaction race with read (#10543 ) ## Problem close https://github.com/neondatabase/neon/issues/10482 ## Summary of changes Add an extra lock on the read path to protect against races. The read path has an implication that only certain kind of compactions can be performed. Garbage keys must first have an image layer covering the range, and then being gc-ed -- they cannot be done in one operation. An alternative to fix this is to move the layers read guard to be acquired at the beginning of `get_vectored_reconstruct_data_timeline`, but that was intentionally optimized out and I don't want to regress. The race is not limited to image layers. Gc-compaction will consolidate deltas automatically and produce a flat delta layer (i.e., when we have retain_lsns below the gc-horizon). The same race would also cause behaviors like getting an un-replayable key history as in https://github.com/neondatabase/neon/issues/10049. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-30 15:25:29 +00:00
Alexey Kondratov	be51b10da7	chore(compute): Print some compute_ctl errors in debug mode (#10586 ) ## Problem In some cases, we were returning a very shallow error like `error sending request for url (XXX)`, which made it very hard to figure out the actual error. ## Summary of changes Use `{:?}` in a few places, and remove it from places where we were printing a string anyway.	2025-01-30 14:31:49 +00:00
Arpad Müller	93714c4c7b	secondary downloader: load metadata on loading of timeline (#10539 ) Related to #10308, we might have legitimate changes in file size or generation. Those changes should not cause warn log lines. In order to detect changes of the generation number while the file size stayed the same, load the metadata that we store on disk on loading of the timeline. Still do a comparison with the on-disk layer sizes to find any discrepancies that might occur due to race conditions (new metadata file gets written but layer file has not been updated yet, and PS shuts down). However, as it's possible to hit it in a race conditon, downgrade it to a warning. Also fix a mistake in #10529: we want to compare the old with the new metadata, not the old metadata with itself.	2025-01-30 12:03:36 +00:00
John Spray	ab627ad9fd	storcon_cli: fix spurious error setting preferred AZ (#10568 ) ## Problem The client code for `tenant-set-preferred-az` declared response type `()`, so printed a spurious error on each use: ``` Error: receive body: error decoding response body: invalid type: map, expected unit at line 1 column 0 ``` The requests were successful anyway. ## Summary of changes - Declare the proper return type, so that the command succeeds quietly.	2025-01-30 11:54:02 +00:00
Erik Grinaker	6a2afa0c02	pageserver: add per-timeline read amp histogram (#10566 ) ## Problem We don't have per-timeline observability for read amplification. Touches https://github.com/neondatabase/cloud/issues/23283. ## Summary of changes Add a per-timeline `pageserver_layers_per_read` histogram. NB: per-timeline histograms are expensive, but probably worth it in this case.	2025-01-30 11:24:49 +00:00
Alexander Bayandin	8804d58943	Nightly Benchmarks: use pgbench from artifacts (#10370 ) We don't use statically linked OpenSSL anymore (#10302), it's ok to switch to Neon's pgbench for pgvector benchmarks	2025-01-30 11:18:07 +00:00
Erik Grinaker	d3db96c211	pageserver: add `pageserver_deltas_per_read_global` metric (#10570 ) ## Problem We suspect that Postgres checkpoints will limit the number of page deltas necessary to reconstruct a page, but don't know for certain. Touches https://github.com/neondatabase/cloud/issues/23283. ## Summary of changes Add `pageserver_deltas_per_read_global` metric. This pairs with `pageserver_layers_per_read_global` from #10573.	2025-01-30 10:55:07 +00:00
Erik Grinaker	b24727134c	pageserver: improve read amp metric (#10573 ) ## Problem The current global `pageserver_layers_visited_per_vectored_read_global` metric does not appear to accurately measure read amplification. It divides the layer count by the number of reads in a batch, but this means that e.g. 10 reads with 100 L0 layers will only measure a read amp of 10 per read, while the actual read amp was 100. While the cost of layer visits are amortized across the batch, and some layers may not intersect with a given key, each visited layer contributes directly to the observed latency for every read in the batch, which is what we care about. Touches https://github.com/neondatabase/cloud/issues/23283. Extracted from #10566. ## Summary of changes * Count the number of layers visited towards each read in the batch, instead of the average across the batch. * Rename `pageserver_layers_visited_per_vectored_read_global` to `pageserver_layers_per_read_global`. * Reduce the read amp log warning threshold down from 512 to 100.	2025-01-30 09:27:40 +00:00
Alexander Lakhin	a7a706cff7	Fix submodule reference after #10473 (#10577 )	2025-01-30 09:09:43 +00:00
Alex Chi Z.	77ea9b16fe	fix(pageserver): use the larger one of upper limit and threshold (#10571 ) ## Problem Follow up of https://github.com/neondatabase/neon/pull/10550 in case the upper limit is set larger than threshold. It does not make sense for someone to enforce the behavior like "if there are >= 50 L0s, only compact 10 of them". ## Summary of changes Use the maximum of compaction threshold and upper limit when selecting L0 files to compact. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-30 00:05:40 +00:00

1 2 3 4 5 ...

7088 Commits