rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-14 17:02:56 +00:00

Author	SHA1	Message	Date
Christian Schwarz	09e7485004	Merge branch 'problame/merge-getpage-test' into problame/batching-timer	2024-11-21 11:28:12 +01:00
Christian Schwarz	058b35f884	Merge branch 'problame/batching-benchmark' into problame/merge-getpage-test	2024-11-21 11:27:16 +01:00
Christian Schwarz	ff0aa152f1	Merge remote-tracking branch 'origin/main' into problame/batching-benchmark	2024-11-21 11:25:23 +01:00
Christian Schwarz	3375f28990	pytest.approx; https://github.com/neondatabase/neon/pull/9820#discussion_r1850679974	2024-11-21 11:21:50 +01:00
Christian Schwarz	e82deb2ccc	high-resolution CPU usage	2024-11-21 11:16:00 +01:00
Christian Schwarz	fa7ce2ca07	the final choice: async-timer 1.0beta15 with features=["tokio1"]	2024-11-21 11:15:02 +01:00
John Spray	42bda5d632	pageserver: revise metrics lifetime for SecondaryTenant (#9818 ) ## Problem We saw a scale test failure when one shard went secondary->attached->secondary in a short period of time -- the metrics for the shard failed a validation assertion that is meant to ensure the size metric matches the sum of layer sizes in the SecondaryDetail struct. This appears to be due to two SecondaryTenants being alive at the same time -- the first one was shut down but still had its contributions to the metrics. Closes: https://github.com/neondatabase/neon/issues/9628 ## Summary of changes - Refactor code for validating metrics and call it in shutdown as well as during downloads - Move code for dropping per-tenant secondary metrics from drop() into shutdown(), so that once shutdown() completes it is definitely safe to instantiate another SecondaryTenant for the same tenant.	2024-11-21 08:31:24 +00:00
Arpad Müller	59c2c3f8ad	compute_ctl: print OpenTelemetry errors via tracing, not stdout (#9830 ) Before, `OpenTelemetry` errors were printed to stdout/stderr directly, causing one of the few log lines without a timestamp, like: ``` OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/v1/traces) ``` Now, we print: ``` 2024-11-21T02:24:20.511160Z INFO OpenTelemetry error: error sending request for url (http://localhost:4318/v1/traces) ``` I found this while investigating #9731.	2024-11-21 04:46:01 +00:00
Ivan Efremov	2d6bf176a0	proxy: Refactor http conn pool (#9785 ) - Use the same ConnPoolEntry for http connection pool. - Rename EndpointConnPool to the HttpConnPool. - Narrow clone bound for client Fixes #9284	2024-11-20 19:36:29 +00:00
Vadim Kharitonov	313ebfdb88	[proxy] chore: allow bypassing empty `params` to `/sql` endpoint (#9827 ) ## Problem ``` curl -H "Neon-Connection-String: postgresql://neondb_owner:PASSWORD@ep-autumn-rain-a58lubg0.us-east-2.aws.neon.tech/neondb?sslmode=require" https://ep-autumn-rain-a58lubg0.us-east-2.aws.neon.tech/sql -d '{"query":"SELECT 1","params":[]}' ``` For such a query, I also need to send `params`. Do I really need it? ## Summary of changes I've marked `params` as optional	2024-11-20 19:36:23 +00:00
Arpad Müller	811fab136f	scrubber: allow restricting find_garbage to a partial tenant id prefix (#9814 ) Adds support to the `find_garbage` command to restrict itself to a partial tenant ID prefix, say `a`, and then it only traverses tenants with IDs starting with `a`. One can now pass the `--tenant-id-prefix` parameter. That way, one can shard the `find_garbage` command and make it run in parallel. The PR also does a change of how `remote_storage` first removes trailing `/`s, only to then add them in the listing function. It turns out that this isn't neccessary and it prevents the prefix functionality from working. S3 doesn't do this either.	2024-11-20 19:31:02 +00:00
Christian Schwarz	89b6cb8eba	Revert "vanilla tokio based timer impl based on tokio::time::Sleep" This reverts commit `517dda849f`.	2024-11-20 20:17:49 +01:00
Christian Schwarz	c68661dfb3	Revert "undo local modifications to benchmark" This reverts commit `7be13bc5a6`.	2024-11-20 19:53:06 +01:00
Christian Schwarz	517dda849f	vanilla tokio based timer impl based on tokio::time::Sleep	2024-11-20 19:52:47 +01:00
Christian Schwarz	f22ad868cf	Revert "tokio_timerfd::Delay based impl" This reverts commit `fcda7a72c6`.	2024-11-20 19:45:37 +01:00
Christian Schwarz	fcda7a72c6	tokio_timerfd::Delay based impl Performs identically great to the async-timer::Timer features=tokio1 impl Makes sense because it's the same thing that's happening under the hood. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780ea9decc82281f6b8d1	2024-11-20 19:42:00 +01:00
Christian Schwarz	469ce810fc	Revert "async-timer based approach (again, with data)" This reverts commit `689788cbba`.	2024-11-20 19:40:24 +01:00
Christian Schwarz	21866faa8a	Revert "try async-timer 1.0.0-beta15 (still signal-based timers)" This reverts commit `c73e9e40e9`.	2024-11-20 19:37:51 +01:00
Christian Schwarz	cbb5817997	Revert "async-timer 1.0.0-beta15 with features=tokio1" This reverts commit `68550f0f50`.	2024-11-20 19:37:44 +01:00
Vlad Lazar	ee26f09e45	pageserver: remove shard split hard link assertion (#9829 ) ## Problem We were hitting this assertion in debug mode tests sometimes. This case was being hit when the parent shard has no resident layers. For instance, this is the case on split retry where the previous attempt shut-down the parent and deleted local state for it. If the logical size calculation does not download some layers before we get to the hardlinking, then the assertion is hit. ## Summary of Changes Remove the assertion. It's fine for the ancestor to not have any resident layers at the time of the split. Closes https://github.com/neondatabase/neon/issues/9412	2024-11-20 18:33:05 +00:00
Christian Schwarz	5f3e6f398c	Revert "try interval-based impl to cross-chec" This reverts commit `721643beed`.	2024-11-20 18:52:55 +01:00
Christian Schwarz	721643beed	try interval-based impl to cross-chec => zero batching https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478065a9b3e51726082885	2024-11-20 18:50:48 +01:00
Conrad Ludgate	f36f0068b8	chore(proxy): demote more logs during successful connection attempts (#9828 ) Follow up to #9803 See https://github.com/neondatabase/cloud/issues/14378 In collaboration with @cloneable and @awarus, we sifted through logs and simply demoted some logs to debug. This is not at all finished and there are more logs to review, but we ran out of time in the session we organised. In any slightly more nuanced cases, we didn't touch the log, instead leaving a TODO comment. I've also slightly refactored the sql-over-http body read/length reject code. I can split that into a separate PR. It just felt natural after I switched to `read_body_with_limit` as we discussed during the meet.	2024-11-20 17:50:39 +00:00
Christian Schwarz	68550f0f50	async-timer 1.0.0-beta15 with features=tokio1 Best batching factor so far with no worse degradation of un-batchable workloads than the other candidates. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780c0921fe99e1da0e8c9	2024-11-20 18:41:31 +01:00
Christian Schwarz	c73e9e40e9	try async-timer 1.0.0-beta15 (still signal-based timers) Results unchanged to 0.7.4 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780e18416cc0faf2aca65	2024-11-20 18:32:53 +01:00
John Spray	5ff2f1ee7d	pageserver: enable compaction to proceed while live-migrating (#5397 ) ## Problem Long ago, in #5299 the tenant states for migration are added, but respected only in a coarse-grained way: when hinted not to do deletions, tenants will just avoid doing all GC or compaction. Skipping compaction is not necessary for AttachedMulti, as we will soon become the primary attached location, and it is not a waste of resources to proceed with compaction. Instead, per the RFC https://github.com/neondatabase/neon/pull/5029/files), deletions should be queued up in this state, and executed later when we switch to AttachedSingle. Avoiding compaction in AttachedMulti can have an operational impact if a tenant is under significant write load, as a long-running migration can result in a large accumulation of delta layers with commensurate impact on read latency. Closes: https://github.com/neondatabase/neon/issues/5396 ## Summary of changes - Add a 'config' part to RemoteTimelineClient so that it can be aware of the mode of the tenant it belongs to, and wire this through for construction + updates - Add a special buffer for delayed deletions, and when in AttachedMulti route deletions here instead of into the main remote client queue. This is drained when transitioning to AttachedSingle. If the tenant is detached or our process dies before then, then these objects are leaked. - As a quality of life improvement, also use the remote timeline client's knowledge of the tenant state to avoid submitting remote consistent LSN updates for validation when in AttachedStale (as we know these will fail) ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-11-20 17:31:55 +00:00
John Spray	67f5f83edc	pageserver: avoid reading SLRU blocks for GC on shards >0 (#9423 ) ## Problem SLRU blocks, which can add up to several gigabytes, are currently ingested by all shards, multiplying their capacity cost by the shard count and slowing down ingest. We do this because all shards need the SLRU pages to do timestamp->LSN lookup for GC. Related: https://github.com/neondatabase/neon/issues/7512 ## Summary of changes - On non-zero shards, learn the GC offset from shard 0's index instead of calculating it. - Add a test `test_sharding_gc` that exercises this - Do GC in test_pg_regress as a general smoke test that GC functions run (e.g. this would fail if we were using SLRUs we didn't have) In this PR we are still ingesting SLRUs everywhere, but not using them any more. Part 2 PR (https://github.com/neondatabase/neon/pull/9786) makes the change to not store them at all. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-11-20 15:56:14 +00:00
Christian Schwarz	7be13bc5a6	undo local modifications to benchmark	2024-11-20 16:00:19 +01:00
John Spray	593e35027a	tests: use fewer pageservers in test_sharding_split_smoke (#9804 ) ## Problem This test uses a gratuitous number of pageservers (16). This works fine when there are plenty of system resources, but causes issues on test runners that have limited resources and run many tests concurrently. Related: https://github.com/neondatabase/neon/issues/9802 ## Summary of changes - Split from 2 shards to 4, instead of 4 to 8 - Don't give every shard a separate pageserver, let two locations share each pageserver. Net result is 4 pageservers instead of 16	2024-11-20 14:57:59 +00:00
Christian Schwarz	689788cbba	async-timer based approach (again, with data) Yep, it's clearly the best one with best batching factor at lowest CPU usage. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780d0a205e081458b46db	2024-11-20 15:36:10 +01:00
Christian Schwarz	f9bf038d2c	Revert "tokio_timerfd::Interval" This reverts commit `12124b28d0`.	2024-11-20 15:25:52 +01:00
Christian Schwarz	12124b28d0	tokio_timerfd::Interval Resolution not high enough to do _any_ batching at 10us or 20us https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e0047800fb74bd8f4ab6cf8e2	2024-11-20 15:25:14 +01:00
Christian Schwarz	1d85bec0ea	Revert "tokio::time::Interval based approach" This reverts commit `81d99704ee`.	2024-11-20 15:13:26 +01:00
Christian Schwarz	81d99704ee	tokio::time::Interval based approach batching at 10us doesn't work well enough, prob the future is ready too soon. batching factor is just 1.5 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120	2024-11-20 15:13:11 +01:00
Christian Schwarz	f3ed5692ea	Revert "async-timer based approach" This reverts commit `1639b26002`.	2024-11-20 14:49:09 +01:00
Christian Schwarz	1639b26002	async-timer based approach With this, 10us batching timeout works, but it has some other wrinkles: - it uses the signal-based timer APIs instead of going through epoll (=> timerfd) = it needs to make a syscall for each batch, which costs around 1-2us, so, probably significant CPU time wasted on this.	2024-11-20 14:49:01 +01:00
Christian Schwarz	af95320a8c	Revert "Revert "switch back to tokio::time::sleep, to get the numbers"" This reverts commit `aa695b2ad7`.	2024-11-20 14:25:05 +01:00
Christian Schwarz	b299eb19e2	fixup whitespace stuff	2024-11-20 14:23:55 +01:00
Christian Schwarz	88d52b31b7	Merge branch 'problame/batching-benchmark' into problame/merge-getpage-test	2024-11-20 14:23:22 +01:00
Christian Schwarz	aa695b2ad7	Revert "switch back to tokio::time::sleep, to get the numbers" This reverts commit `b9746168ff`.	2024-11-20 14:22:31 +01:00
Christian Schwarz	b695907752	page_service: add benchmark for batching This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in https://github.com/neondatabase/neon/pull/9321. Refs: - Epic: https://github.com/neondatabase/neon/issues/9376 - Extracted from https://github.com/neondatabase/neon/pull/9792	2024-11-20 14:18:42 +01:00
Christian Schwarz	75041cb61b	bench fixups	2024-11-20 13:59:21 +01:00
Christian Schwarz	e80ce970f7	collect CPU utilization	2024-11-20 13:55:05 +01:00
Christian Schwarz	f2de5b504f	make it a proper benchmark	2024-11-20 13:52:05 +01:00
Folke Behrens	bf7d859a8b	proxy: Rename RequestMonitoring to RequestContext (#9805 ) ## Problem It is called context/ctx everywhere and the Monitoring suffix needlessly confuses with proper monitoring code. ## Summary of changes * Rename RequestMonitoring to RequestContext * Rename RequestMonitoringInner to RequestContextInner	2024-11-20 12:50:36 +00:00
Alexander Bayandin	899933e159	scan_log_for_errors: check that regex is correct (#9815 ) ## Problem I've noticed that we have 2 flaky tests which failed with error: ``` re.error: missing ), unterminated subpattern at position 21 ``` - `test_timeline_archival_chaos` — has been already fixed - `test_sharded_tad_interleaved_after_partial_success` — I didn't manage to find the incorrect regex [Internal link](https://neonprod.grafana.net/goto/yfmVHV7NR?orgId=1) ## Summary of changes - Wrap `re.match` in `try..except` block and print incorrect regex	2024-11-20 12:48:21 +00:00
Alexander Bayandin	46beecacce	CI(benchmarking): route test failures to on-call-qa-staging-stream (#9813 ) ## Problem We want to keep `#on-call-staging-stream` channel close to the prod one and redirect notifications from failing benchmarks to another channel for investigation. ## Summary of changes - Send notifications regarding failures in `benchmarking` job to `#on-call-staging-stream` - Send notifications regarding failures in `periodic_pagebench` job to `#on-call-staging-stream`	2024-11-20 12:23:41 +00:00
Fedor Dikarev	94e4a0e2a0	update macos version for runner (#9817 ) Closes: https://github.com/neondatabase/neon/issues/9816 Run MacOs builds on `macos-15`. As `pkg-config` is bundled in runner image, don't install it with `brew`	2024-11-20 13:04:14 +01:00
Christian Schwarz	b9746168ff	switch back to tokio::time::sleep, to get the numbers => https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478054b8a3e325735ffa19 => unacceptable	2024-11-20 12:50:29 +01:00
Christian Schwarz	5cc0059088	parametrize more test	2024-11-20 12:48:47 +01:00

1 2 3 4 5 ...

6633 Commits