Optimizing top K using Adrien Grand's ideas

https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html
Fix clippy warnings: deprecated gen_range, manual div_ceil, legacy import (#2860 )
2026-03-26 23:20:42 +00:00 · 2026-03-26 15:23:40 -04:00 · 2026-03-26 07:37:26 -04:00 · 2026-03-24 08:02:12 +01:00 · 2026-03-24 02:02:30 +01:00
11 changed files with 765 additions and 34 deletions
--- a/.claude/skills/update-changelog/SKILL.md
+++ b/.claude/skills/update-changelog/SKILL.md
@@ -0,0 +1,87 @@
+---
+name: update-changelog
+description: Update CHANGELOG.md with merged PRs since the last changelog update, categorized by type
+---
+
+# Update Changelog
+
+This skill updates CHANGELOG.md with merged PRs that aren't already listed.
+
+## Step 1: Determine the changelog scope
+
+Read `CHANGELOG.md` to identify the current unreleased version section at the top (e.g., `Tantivy 0.26 (Unreleased)`).
+
+Collect all PR numbers already mentioned in the unreleased section by extracting `#NNNN` references.
+
+## Step 2: Find merged PRs not yet in the changelog
+
+Use `gh` to list recently merged PRs from the upstream repo:
+
+```bash
+gh pr list --repo quickwit-oss/tantivy --state merged --limit 100 --json number,title,author,labels,mergedAt
+```
+
+Filter out any PRs whose number already appears in the unreleased section of the changelog.
+
+## Step 3: Consolidate related PRs
+
+Before categorizing, group PRs that belong to the same logical change. This is critical for producing a clean changelog. Use PR descriptions, titles, cross-references, and the files touched to identify relationships.
+
+**Merge follow-up PRs into the original:**
+- If a PR is a bugfix, refinement, or follow-up to another PR in the same unreleased cycle, combine them into a single changelog entry with multiple `[#N](url)` links.
+- Also consolidate PRs that touch the same feature area even if not explicitly linked — e.g., a PR fixing an edge case in a new API should be folded into the entry for the PR that introduced that API.
+
+**Filter out bugfixes on unreleased features:**
+- If a bugfix PR fixes something introduced by another PR in the **same unreleased version**, it must NOT appear as a separate Bugfixes entry. Instead, silently fold it into the original feature/improvement entry. The changelog should describe the final shipped state, not the development history.
+- To detect this: check if the bugfix PR references or reverts changes from another PR in the same release cycle, or if it touches code that was newly added (not present in the previous release).
+
+## Step 4: Review the actual code diff
+
+**Do not rely on PR titles or descriptions alone.** For every candidate PR, run `gh pr diff <number> --repo quickwit-oss/tantivy` and read the actual changes. PR titles are often misleading — the diff is the source of truth.
+
+**What to look for in the diff:**
+- Does it change observable behavior, public API surface, or performance characteristics?
+- Is the change something a user of the library would notice or need to know about?
+- Could the change break existing code (API changes, removed features)?
+
+**Skip PRs where the diff reveals the change is not meaningful enough for the changelog** — e.g., cosmetic renames, trivial visibility tweaks, test-only changes, etc.
+
+## Step 5: Categorize each PR group
+
+For each PR (or consolidated group) that survived the diff review, determine its category:
+
+- **Bugfixes** — fixes to behavior that existed in the **previous release**. NOT fixes to features introduced in this release cycle.
+- **Features/Improvements** — new features, API additions, new options, improvements that change user-facing behavior or add new capabilities.
+- **Performance** — optimizations, speed improvements, memory reductions. **If a PR adds new API whose primary purpose is enabling a performance optimization, categorize it as Performance, not Features.** The deciding question is: does a user benefit from this because of new functionality, or because things got faster/leaner? For example, a new trait method that exists solely to enable cheaper intersection ordering is Performance, not a Feature.
+
+If a PR doesn't clearly fit any category (e.g., CI-only changes, internal refactors with no user-facing impact, dependency bumps with no behavior change), skip it — not everything belongs in the changelog.
+
+When unclear, use your best judgment or ask the user.
+
+## Step 6: Format entries
+
+Each entry must follow this exact format:
+
+```
+- Description [#NUMBER](https://github.com/quickwit-oss/tantivy/pull/NUMBER)(@author)
+```
+
+Rules:
+- The description should be concise and describe the user-facing change (not the implementation). Describe the final shipped state, not the incremental development steps.
+- Use sub-categories with bold headers when multiple entries relate to the same area (e.g., `- **Aggregation**` with indented entries beneath). Follow the existing grouping style in the changelog.
+- Author is the GitHub username from the PR, prefixed with `@`. For consolidated entries, include all contributing authors.
+- For consolidated PRs, list all PR links in a single entry: `[#100](url) [#110](url)` (see existing entries for examples).
+
+## Step 7: Present changes to the user
+
+Show the user the proposed changelog entries grouped by category **before** editing the file. Ask for confirmation or adjustments.
+
+## Step 8: Update CHANGELOG.md
+
+Insert the new entries into the appropriate sections of the unreleased version block. If a section doesn't exist yet, create it following the order: Bugfixes, Features/Improvements, Performance.
+
+Append new entries at the end of each section (before the next section header or version header).
+
+## Step 9: Verify
+
+Read back the updated unreleased section and display it to the user for final review.
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,51 @@
+Tantivy 0.26 (Unreleased)
+================================
+
+## Bugfixes
+- Align float query coercion during search with the columnar coercion rules [#2692](https://github.com/quickwit-oss/tantivy/pull/2692)(@fulmicoton)
+- Fix lenient elastic range queries with trailing closing parentheses [#2816](https://github.com/quickwit-oss/tantivy/pull/2816)(@evance-br)
+- Fix intersection `seek()` advancing below current doc id [#2812](https://github.com/quickwit-oss/tantivy/pull/2812)(@fulmicoton)
+- Fix phrase query prefixed with `*` [#2751](https://github.com/quickwit-oss/tantivy/pull/2751)(@Darkheir)
+- Fix `vint` buffer overflow during index creation [#2778](https://github.com/quickwit-oss/tantivy/pull/2778)(@rebasedming)
+- Fix integer overflow in `ExpUnrolledLinkedList` for large datasets [#2735](https://github.com/quickwit-oss/tantivy/pull/2735)(@mdashti)
+- Fix integer overflow in segment sorting and merge policy truncation [#2846](https://github.com/quickwit-oss/tantivy/pull/2846)(@anaslimem)
+- Fix merging of intermediate aggregation results [#2719](https://github.com/quickwit-oss/tantivy/pull/2719)(@PSeitz)
+- Fix deduplicate doc counts in term aggregation for multi-valued fields [#2854](https://github.com/quickwit-oss/tantivy/pull/2854)(@nuri-yoo)
+
+## Features/Improvements
+- **Aggregation**
+    - Add filter aggregation [#2711](https://github.com/quickwit-oss/tantivy/pull/2711)(@mdashti)
+    - Add include/exclude filtering for term aggregations [#2717](https://github.com/quickwit-oss/tantivy/pull/2717)(@PSeitz)
+    - Add public accessors for intermediate aggregation results [#2829](https://github.com/quickwit-oss/tantivy/pull/2829)(@congx4)
+    - Replace HyperLogLog++ with Apache DataSketches HLL for cardinality aggregation [#2837](https://github.com/quickwit-oss/tantivy/pull/2837) [#2842](https://github.com/quickwit-oss/tantivy/pull/2842)(@congx4)
+    - Add composite aggregation [#2856](https://github.com/quickwit-oss/tantivy/pull/2856)(@fulmicoton)
+- **Fast Fields**
+    - Add fast field fallback for `TermQuery` when the field is not indexed [#2693](https://github.com/quickwit-oss/tantivy/pull/2693)(@PSeitz-dd)
+    - Add fast field support for `Bytes` values [#2830](https://github.com/quickwit-oss/tantivy/pull/2830)(@mdashti)
+- **Query Parser**
+    - Add support for regexes in the query grammar [#2677](https://github.com/quickwit-oss/tantivy/pull/2677) [#2818](https://github.com/quickwit-oss/tantivy/pull/2818)(@Darkheir)
+    - Deduplicate queries in query parser [#2698](https://github.com/quickwit-oss/tantivy/pull/2698)(@PSeitz-dd)
+- Add erased `SortKeyComputer` for sorting on column types unknown until runtime [#2770](https://github.com/quickwit-oss/tantivy/pull/2770) [#2790](https://github.com/quickwit-oss/tantivy/pull/2790)(@stuhood @PSeitz)
+- Add natural-order-with-none-highest support in `TopDocs::order_by` [#2780](https://github.com/quickwit-oss/tantivy/pull/2780)(@stuhood)
+- Move stemming behing `stemmer` feature flag [#2791](https://github.com/quickwit-oss/tantivy/pull/2791)(@fulmicoton)
+- Make `DeleteMeta`, `AddOperation`, `advance_deletes`, `with_max_doc`, `serializer` module, and `delete_queue` public [#2762](https://github.com/quickwit-oss/tantivy/pull/2762) [#2765](https://github.com/quickwit-oss/tantivy/pull/2765) [#2766](https://github.com/quickwit-oss/tantivy/pull/2766) [#2835](https://github.com/quickwit-oss/tantivy/pull/2835)(@philippemnoel @PSeitz)
+- Make `Language` hashable [#2763](https://github.com/quickwit-oss/tantivy/pull/2763)(@philippemnoel)
+- Improve `space_usage` reporting for JSON fields and columnar data [#2761](https://github.com/quickwit-oss/tantivy/pull/2761)(@PSeitz-dd)
+- Split `Term` into `Term` and `IndexingTerm` [#2744](https://github.com/quickwit-oss/tantivy/pull/2744) [#2750](https://github.com/quickwit-oss/tantivy/pull/2750)(@PSeitz-dd @PSeitz)
+
+## Performance
+- **Aggregation**
+    - Large speed up and memory reduction for nested high cardinality aggregations by using one collector per request instead of one per bucket, and adding `PagedTermMap` for faster medium cardinality term aggregations [#2715](https://github.com/quickwit-oss/tantivy/pull/2715) [#2759](https://github.com/quickwit-oss/tantivy/pull/2759)(@PSeitz @PSeitz-dd)
+    - Optimize low-cardinality term aggregations by using a `Vec` instead of a `HashMap` [#2740](https://github.com/quickwit-oss/tantivy/pull/2740)(@fulmicoton-dd)
+- Optimize `ExistsQuery` for a high number of dynamic columns [#2694](https://github.com/quickwit-oss/tantivy/pull/2694)(@PSeitz-dd)
+- Add lazy scorers to stop score evaluation early when a doc won't reach the top-K threshold [#2726](https://github.com/quickwit-oss/tantivy/pull/2726) [#2777](https://github.com/quickwit-oss/tantivy/pull/2777)(@fulmicoton @stuhood)
+- Add `DocSet::cost()` and use it to order scorers in intersections [#2707](https://github.com/quickwit-oss/tantivy/pull/2707)(@PSeitz)
+- Add `collect_block` support for collector wrappers [#2727](https://github.com/quickwit-oss/tantivy/pull/2727)(@stuhood)
+- Optimize saturated posting lists by replacing them with `AllScorer` in boolean queries [#2745](https://github.com/quickwit-oss/tantivy/pull/2745) [#2760](https://github.com/quickwit-oss/tantivy/pull/2760) [#2774](https://github.com/quickwit-oss/tantivy/pull/2774)(@fulmicoton @mdashti @trinity-1686a)
+- Add `seek_danger` on `DocSet` for more efficient intersections [#2538](https://github.com/quickwit-oss/tantivy/pull/2538) [#2810](https://github.com/quickwit-oss/tantivy/pull/2810)(@PSeitz @stuhood @fulmicoton)
+- Skip column traversal in `RangeDocSet` when query range does not overlap with column bounds [#2783](https://github.com/quickwit-oss/tantivy/pull/2783)(@ChangRui-Ryan)
+- Speed up exclude queries by supporting multiple excluded `DocSet`s without intermediate union [#2825](https://github.com/quickwit-oss/tantivy/pull/2825)(@PSeitz)
+
 Tantivy 0.25
 ================================

--- a/Cargo.toml
+++ b/Cargo.toml
@@ -27,7 +27,7 @@ regex = { version = "1.5.5", default-features = false, features = [
 aho-corasick = "1.0"
 tantivy-fst = "0.5"
 memmap2 = { version = "0.9.0", optional = true }
-lz4_flex = { version = "0.12", default-features = false, optional = true }
+lz4_flex = { version = "0.13", default-features = false, optional = true }
 zstd = { version = "0.13", optional = true, default-features = false }
 tempfile = { version = "3.12.0", optional = true }
 log = "0.4.16"
@@ -64,7 +64,7 @@ query-grammar = { version = "0.25.0", path = "./query-grammar", package = "tanti
 tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
 common = { version = "0.10", path = "./common/", package = "tantivy-common" }
 tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
-sketches-ddsketch = { git = "https://github.com/quickwit-oss/rust-sketches-ddsketch.git", rev = "555caf1", features = ["use_serde"] }
+sketches-ddsketch = { version = "0.4", features = ["use_serde"] }
 datasketches = "0.2.0"
 futures-util = { version = "0.3.28", optional = true }
 futures-channel = { version = "0.3.28", optional = true }
--- a/benches/str_search_and_get.rs
+++ b/benches/str_search_and_get.rs
@@ -45,7 +45,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
        match distribution {
            "dense_random" => {
                for _doc_id in 0..num_docs {
-                    let suffix = rng.gen_range(0u64..1000u64);
+                    let suffix = rng.random_range(0u64..1000u64);
                    let str_val = format!("str_{:03}", suffix);

                    writer
@@ -71,7 +71,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
            }
            "sparse_random" => {
                for _doc_id in 0..num_docs {
-                    let suffix = rng.gen_range(0u64..1000000u64);
+                    let suffix = rng.random_range(0u64..1000000u64);
                    let str_val = format!("str_{:07}", suffix);

                    writer
--- a/columnar/src/block_accessor.rs
+++ b/columnar/src/block_accessor.rs
@@ -58,6 +58,78 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
        }
    }

+    /// Like `fetch_block_with_missing`, but deduplicates (doc_id, value) pairs
+    /// so that each unique value per document is returned only once.
+    ///
+    /// This is necessary for correct document counting in aggregations,
+    /// where multi-valued fields can produce duplicate entries that inflate counts.
+    #[inline]
+    pub fn fetch_block_with_missing_unique_per_doc(
+        &mut self,
+        docs: &[u32],
+        accessor: &Column<T>,
+        missing: Option<T>,
+    ) where
+        T: Ord,
+    {
+        self.fetch_block_with_missing(docs, accessor, missing);
+        if accessor.index.get_cardinality().is_multivalue() {
+            self.dedup_docid_val_pairs();
+        }
+    }
+
+    /// Removes duplicate (doc_id, value) pairs from the caches.
+    ///
+    /// After `fetch_block`, entries are sorted by doc_id, but values within
+    /// the same doc may not be sorted (e.g. `(0,1), (0,2), (0,1)`).
+    /// We group consecutive entries by doc_id, sort values within each group
+    /// if it has more than 2 elements, then deduplicate adjacent pairs.
+    ///
+    /// Skips entirely if no doc_id appears more than once in the block.
+    fn dedup_docid_val_pairs(&mut self)
+    where T: Ord {
+        if self.docid_cache.len() <= 1 {
+            return;
+        }
+
+        // Quick check: if no consecutive doc_ids are equal, no dedup needed.
+        let has_multivalue = self.docid_cache.windows(2).any(|w| w[0] == w[1]);
+        if !has_multivalue {
+            return;
+        }
+
+        // Sort values within each doc_id group so duplicates become adjacent.
+        let mut start = 0;
+        while start < self.docid_cache.len() {
+            let doc = self.docid_cache[start];
+            let mut end = start + 1;
+            while end < self.docid_cache.len() && self.docid_cache[end] == doc {
+                end += 1;
+            }
+            if end - start > 2 {
+                self.val_cache[start..end].sort();
+            }
+            start = end;
+        }
+
+        // Now duplicates are adjacent — deduplicate in place.
+        let mut write = 0;
+        for read in 1..self.docid_cache.len() {
+            if self.docid_cache[read] != self.docid_cache[write]
+                || self.val_cache[read] != self.val_cache[write]
+            {
+                write += 1;
+                if write != read {
+                    self.docid_cache[write] = self.docid_cache[read];
+                    self.val_cache[write] = self.val_cache[read];
+                }
+            }
+        }
+        let new_len = write + 1;
+        self.docid_cache.truncate(new_len);
+        self.val_cache.truncate(new_len);
+    }
+
    #[inline]
    pub fn iter_vals(&self) -> impl Iterator<Item = T> + '_ {
        self.val_cache.iter().cloned()
@@ -163,4 +235,56 @@ mod tests {

        assert_eq!(missing_docs, vec![1, 2, 3, 4, 5]);
    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_consecutive() {
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 2, 3];
+        accessor.val_cache = vec![10, 10, 10, 10];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 2, 3]);
+        assert_eq!(accessor.val_cache, vec![10, 10, 10]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_non_consecutive() {
+        // (0,1), (0,2), (0,1) — duplicate value not adjacent
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 0];
+        accessor.val_cache = vec![1, 2, 1];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 0]);
+        assert_eq!(accessor.val_cache, vec![1, 2]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_multi_doc() {
+        // doc 0: values [3, 1, 3], doc 1: values [5, 5]
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 0, 1, 1];
+        accessor.val_cache = vec![3, 1, 3, 5, 5];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
+        assert_eq!(accessor.val_cache, vec![1, 3, 5]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_no_duplicates() {
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 1];
+        accessor.val_cache = vec![1, 2, 3];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
+        assert_eq!(accessor.val_cache, vec![1, 2, 3]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_single_element() {
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0];
+        accessor.val_cache = vec![1];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0]);
+        assert_eq!(accessor.val_cache, vec![1]);
+    }
 }
--- a/src/aggregation/bucket/composite/calendar_interval.rs
+++ b/src/aggregation/bucket/composite/calendar_interval.rs
@@ -54,8 +54,6 @@ fn month_bucket_using_time_crate(timestamp_ns: i64) -> Result<i64, time::Error>

 #[cfg(test)]
 mod tests {
-    use std::i64;
-
    use time::format_description::well_known::Iso8601;
    use time::UtcDateTime;

--- a/src/aggregation/bucket/composite/mod.rs
+++ b/src/aggregation/bucket/composite/mod.rs
@@ -533,7 +533,7 @@ mod tests {
        let expected_buckets_vec = expected_buckets.as_array().unwrap();

        for page_size in 1..=expected_buckets_vec.len() {
-            let page_count = (expected_buckets_vec.len() + page_size - 1) / page_size;
+            let page_count = expected_buckets_vec.len().div_ceil(page_size);
            let mut after_key = None;
            for page_idx in 0..page_count {
                let mut agg_req_json = json!({
@@ -565,7 +565,7 @@ mod tests {
                        "expected after_key on all but last page"
                    );
                    after_key = Some(res["my_composite"]["after_key"].clone());
-                } else if let Some(_) = res["my_composite"].get("after_key") {
+                } else if res["my_composite"].get("after_key").is_some() {
                    // currently we sometime have an after_key on the last page,
                    // check that the next "page" is empty
                    let agg_req_json = json!({
--- a/src/aggregation/bucket/term_agg.rs
+++ b/src/aggregation/bucket/term_agg.rs
@@ -807,11 +807,13 @@ impl<TermMap: TermAggregationMap, C: SubAggCache> SegmentAggregationCollector

        let req_data = &mut self.terms_req_data;

-        agg_data.column_block_accessor.fetch_block_with_missing(
-            docs,
-            &req_data.accessor,
-            req_data.missing_value_for_accessor,
-        );
+        agg_data
+            .column_block_accessor
+            .fetch_block_with_missing_unique_per_doc(
+                docs,
+                &req_data.accessor,
+                req_data.missing_value_for_accessor,
+            );

        if let Some(sub_agg) = &mut self.sub_agg {
            let term_buckets = &mut self.parent_buckets[parent_bucket_id as usize];
@@ -2347,7 +2349,7 @@ mod tests {

        // text field
        assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hello Hello");
-        assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 5);
+        assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
        assert_eq!(res["my_texts"]["buckets"][1]["key"], "Empty");
        assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
        assert_eq!(
@@ -2356,7 +2358,7 @@ mod tests {
        );
        // text field with number as missing fallback
        assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
-        assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 5);
+        assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 4);
        assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
        assert_eq!(res["my_texts2"]["buckets"][1]["doc_count"], 2);
        assert_eq!(
@@ -2370,7 +2372,7 @@ mod tests {
        assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
        assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 4);
        assert_eq!(res["my_ids"]["buckets"][1]["key"], 1.0);
-        assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 3);
+        assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 2);
        assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);

        Ok(())
--- a/src/query/boolean_query/block_wand_intersection.rs
+++ b/src/query/boolean_query/block_wand_intersection.rs
@@ -0,0 +1,418 @@
+use crate::query::term_query::TermScorer;
+use crate::query::Scorer;
+use crate::{DocId, DocSet, Score, TERMINATED};
+
+/// Block-max pruning for top-K over intersection of term scorers.
+///
+/// Uses the least-frequent term as "leader" to define 128-doc processing windows.
+/// For each window, the sum of block_max_scores is compared to the current threshold;
+/// if the block can't beat it, the entire block is skipped.
+///
+/// Within non-skipped blocks, individual documents are pruned by checking whether
+/// leader_score + sum(secondary block_max_scores) can exceed the threshold before
+/// performing the expensive intersection membership check (seeking into secondary scorers).
+///
+/// # Preconditions
+/// - `scorers` has at least 2 elements
+/// - All scorers read frequencies (`FreqReadingOption::ReadFreq`)
+pub fn block_wand_intersection(
+    mut scorers: Vec<TermScorer>,
+    mut threshold: Score,
+    callback: &mut dyn FnMut(DocId, Score) -> Score,
+) {
+    assert!(scorers.len() >= 2);
+
+    // Sort by cost (ascending). scorers[0] becomes the "leader" (rarest term).
+    scorers.sort_by_key(TermScorer::size_hint);
+
+    let (leader, secondaries) = scorers.split_first_mut().unwrap();
+
+    // Precompute global max scores for early termination checks.
+    let secondaries_global_max_sum: Score = secondaries.iter().map(|s| s.max_score()).sum();
+
+    // Early exit: no document can possibly beat the threshold.
+    if leader.max_score() + secondaries_global_max_sum <= threshold {
+        return;
+    }
+
+    let mut doc = leader.doc();
+    if doc == TERMINATED {
+        return;
+    }
+
+    loop {
+        // --- Phase 1: Block-level pruning ---
+        //
+        // Position all skip readers on the block containing `doc`.
+        // seek_block is cheap: it only advances the skip reader, no block decompression.
+        leader.seek_block(doc);
+        let leader_block_max: Score = leader.block_max_score();
+
+        // Compute the window end as the minimum last_doc_in_block across all scorers.
+        // This ensures the block_max values are valid for all docs in [doc, window_end].
+        // Different scorers have independently aligned blocks, so we must use the
+        // smallest window where all block_max values hold.
+        let mut window_end: DocId = leader.last_doc_in_block();
+
+        let mut secondary_block_max_sum: Score = 0.0;
+        for secondary in secondaries.iter_mut() {
+            secondary.seek_block(doc);
+            secondary_block_max_sum += secondary.block_max_score();
+            window_end = window_end.min(secondary.last_doc_in_block());
+        }
+
+        if leader_block_max + secondary_block_max_sum <= threshold {
+            // The entire window cannot beat the threshold. Skip past it.
+            if window_end == TERMINATED {
+                return;
+            }
+            doc = window_end + 1;
+            continue;
+        }
+
+        // --- Phase 2: Doc-level processing within the window ---
+        //
+        // Load the leader's block and iterate through its documents up to window_end.
+        doc = leader.seek(doc);
+        if doc == TERMINATED {
+            return;
+        }
+
+        'next_doc: while doc <= window_end {
+            let leader_score: Score = leader.score();
+
+            // Doc-level pruning: can leader_score + best possible secondary contribution
+            // beat the threshold?
+            if leader_score + secondary_block_max_sum <= threshold {
+                doc = leader.advance();
+                if doc == TERMINATED {
+                    return;
+                }
+                continue;
+            }
+
+            // Check intersection membership in secondaries.
+            let mut total_score: Score = leader_score;
+            for secondary in secondaries.iter_mut() {
+                // seek() requires target >= self.doc(). If the secondary is already
+                // past `doc` from a previous seek, this doc is not in the intersection.
+                let secondary_doc = secondary.doc();
+                let seek_result = if secondary_doc <= doc {
+                    secondary.seek(doc)
+                } else {
+                    secondary_doc
+                };
+                if seek_result != doc {
+                    doc = leader.advance();
+                    if doc == TERMINATED {
+                        return;
+                    }
+                    continue 'next_doc;
+                }
+                total_score += secondary.score();
+            }
+
+            // All secondaries matched.
+            if total_score > threshold {
+                threshold = callback(doc, total_score);
+
+                // Re-check global early termination after threshold update.
+                if leader.max_score() + secondaries_global_max_sum <= threshold {
+                    return;
+                }
+            }
+
+            doc = leader.advance();
+            if doc == TERMINATED {
+                return;
+            }
+        }
+        // `doc` is now past window_end but not TERMINATED.
+        // Loop back to Phase 1 with this new doc.
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::cmp::Ordering;
+    use std::collections::BinaryHeap;
+
+    use proptest::prelude::*;
+
+    use crate::query::term_query::TermScorer;
+    use crate::query::{Bm25Weight, Scorer};
+    use crate::{DocId, DocSet, Score, TERMINATED};
+
+    struct Float(Score);
+
+    impl Eq for Float {}
+
+    impl PartialEq for Float {
+        fn eq(&self, other: &Self) -> bool {
+            self.cmp(other) == Ordering::Equal
+        }
+    }
+
+    impl PartialOrd for Float {
+        fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+            Some(self.cmp(other))
+        }
+    }
+
+    impl Ord for Float {
+        fn cmp(&self, other: &Self) -> Ordering {
+            other.0.partial_cmp(&self.0).unwrap_or(Ordering::Equal)
+        }
+    }
+
+    fn nearly_equals(left: Score, right: Score) -> bool {
+        (left - right).abs() < 0.0001 * (left + right).abs()
+    }
+
+    /// Run block_wand_intersection and collect (doc, score) pairs above threshold.
+    fn compute_checkpoints_block_wand_intersection(
+        term_scorers: Vec<TermScorer>,
+        top_k: usize,
+    ) -> Vec<(DocId, Score)> {
+        let mut heap: BinaryHeap<Float> = BinaryHeap::with_capacity(top_k);
+        let mut checkpoints: Vec<(DocId, Score)> = Vec::new();
+        let mut limit: Score = 0.0;
+
+        let callback = &mut |doc, score| {
+            heap.push(Float(score));
+            if heap.len() > top_k {
+                heap.pop().unwrap();
+            }
+            if heap.len() == top_k {
+                limit = heap.peek().unwrap().0;
+            }
+            if !nearly_equals(score, limit) {
+                checkpoints.push((doc, score));
+            }
+            limit
+        };
+
+        super::block_wand_intersection(term_scorers, Score::MIN, callback);
+        checkpoints
+    }
+
+    /// Naive baseline: intersect by iterating all docs.
+    fn compute_checkpoints_naive_intersection(
+        mut term_scorers: Vec<TermScorer>,
+        top_k: usize,
+    ) -> Vec<(DocId, Score)> {
+        let mut heap: BinaryHeap<Float> = BinaryHeap::with_capacity(top_k);
+        let mut checkpoints: Vec<(DocId, Score)> = Vec::new();
+        let mut limit = Score::MIN;
+
+        // Sort by cost to use the cheapest as driver.
+        term_scorers.sort_by_key(|s| s.cost());
+
+        let (leader, secondaries) = term_scorers.split_first_mut().unwrap();
+
+        let mut doc = leader.doc();
+        while doc != TERMINATED {
+            let mut all_match = true;
+            for secondary in secondaries.iter_mut() {
+                let secondary_doc = secondary.doc();
+                let seek_result = if secondary_doc <= doc {
+                    secondary.seek(doc)
+                } else {
+                    secondary_doc
+                };
+                if seek_result != doc {
+                    all_match = false;
+                    break;
+                }
+            }
+
+            if all_match {
+                let score: Score =
+                    leader.score() + secondaries.iter_mut().map(|s| s.score()).sum::<Score>();
+
+                if score > limit {
+                    heap.push(Float(score));
+                    if heap.len() > top_k {
+                        heap.pop().unwrap();
+                    }
+                    if heap.len() == top_k {
+                        limit = heap.peek().unwrap().0;
+                    }
+                    if !nearly_equals(score, limit) {
+                        checkpoints.push((doc, score));
+                    }
+                }
+            }
+            doc = leader.advance();
+        }
+        checkpoints
+    }
+
+    const MAX_TERM_FREQ: u32 = 100u32;
+
+    fn posting_list(max_doc: u32) -> BoxedStrategy<Vec<(DocId, u32)>> {
+        (1..max_doc + 1)
+            .prop_flat_map(move |doc_freq| {
+                (
+                    proptest::bits::bitset::sampled(doc_freq as usize, 0..max_doc as usize),
+                    proptest::collection::vec(1u32..MAX_TERM_FREQ, doc_freq as usize),
+                )
+            })
+            .prop_map(|(docset, term_freqs)| {
+                docset
+                    .iter()
+                    .map(|doc| doc as u32)
+                    .zip(term_freqs.iter().cloned())
+                    .collect::<Vec<_>>()
+            })
+            .boxed()
+    }
+
+    #[expect(clippy::type_complexity)]
+    fn gen_term_scorers(num_scorers: usize) -> BoxedStrategy<(Vec<Vec<(DocId, u32)>>, Vec<u32>)> {
+        (1u32..100u32)
+            .prop_flat_map(move |max_doc: u32| {
+                (
+                    proptest::collection::vec(posting_list(max_doc), num_scorers),
+                    proptest::collection::vec(2u32..10u32 * MAX_TERM_FREQ, max_doc as usize),
+                )
+            })
+            .boxed()
+    }
+
+    fn test_block_wand_intersection_aux(posting_lists: &[Vec<(DocId, u32)>], fieldnorms: &[u32]) {
+        // Repeat docs 64 times to create multi-block scenarios, matching block_wand.rs test
+        // strategy.
+        const REPEAT: usize = 64;
+        let fieldnorms_expanded: Vec<u32> = fieldnorms
+            .iter()
+            .cloned()
+            .flat_map(|fieldnorm| std::iter::repeat_n(fieldnorm, REPEAT))
+            .collect();
+
+        let postings_lists_expanded: Vec<Vec<(DocId, u32)>> = posting_lists
+            .iter()
+            .map(|posting_list| {
+                posting_list
+                    .iter()
+                    .cloned()
+                    .flat_map(|(doc, term_freq)| {
+                        (0_u32..REPEAT as u32).map(move |offset| {
+                            (
+                                doc * (REPEAT as u32) + offset,
+                                if offset == 0 { term_freq } else { 1 },
+                            )
+                        })
+                    })
+                    .collect::<Vec<(DocId, u32)>>()
+            })
+            .collect();
+
+        let total_fieldnorms: u64 = fieldnorms_expanded
+            .iter()
+            .cloned()
+            .map(|fieldnorm| fieldnorm as u64)
+            .sum();
+        let average_fieldnorm = (total_fieldnorms as Score) / (fieldnorms_expanded.len() as Score);
+        let max_doc = fieldnorms_expanded.len();
+
+        let make_scorers = || -> Vec<TermScorer> {
+            postings_lists_expanded
+                .iter()
+                .map(|postings| {
+                    let bm25_weight = Bm25Weight::for_one_term(
+                        postings.len() as u64,
+                        max_doc as u64,
+                        average_fieldnorm,
+                    );
+                    TermScorer::create_for_test(postings, &fieldnorms_expanded[..], bm25_weight)
+                })
+                .collect()
+        };
+
+        for top_k in 1..4 {
+            let checkpoints_optimized =
+                compute_checkpoints_block_wand_intersection(make_scorers(), top_k);
+            let checkpoints_naive = compute_checkpoints_naive_intersection(make_scorers(), top_k);
+            assert_eq!(
+                checkpoints_optimized.len(),
+                checkpoints_naive.len(),
+                "Mismatch in checkpoint count for top_k={top_k}"
+            );
+            for (&(left_doc, left_score), &(right_doc, right_score)) in
+                checkpoints_optimized.iter().zip(checkpoints_naive.iter())
+            {
+                assert_eq!(left_doc, right_doc);
+                assert!(
+                    nearly_equals(left_score, right_score),
+                    "Score mismatch for doc {left_doc}: {left_score} vs {right_score}"
+                );
+            }
+        }
+    }
+
+    proptest! {
+        #![proptest_config(ProptestConfig::with_cases(500))]
+        #[test]
+        fn test_block_wand_intersection_two_scorers(
+            (posting_lists, fieldnorms) in gen_term_scorers(2)
+        ) {
+            test_block_wand_intersection_aux(&posting_lists[..], &fieldnorms[..]);
+        }
+    }
+
+    proptest! {
+        #![proptest_config(ProptestConfig::with_cases(500))]
+        #[test]
+        fn test_block_wand_intersection_three_scorers(
+            (posting_lists, fieldnorms) in gen_term_scorers(3)
+        ) {
+            test_block_wand_intersection_aux(&posting_lists[..], &fieldnorms[..]);
+        }
+    }
+
+    #[test]
+    fn test_block_wand_intersection_disjoint() {
+        // Two posting lists with no overlap — intersection is empty.
+        let fieldnorms: Vec<u32> = vec![10; 200];
+        let average_fieldnorm = 10.0;
+        let postings_a: Vec<(DocId, u32)> = (0..100).map(|d| (d, 1)).collect();
+        let postings_b: Vec<(DocId, u32)> = (100..200).map(|d| (d, 1)).collect();
+
+        let scorer_a = TermScorer::create_for_test(
+            &postings_a,
+            &fieldnorms,
+            Bm25Weight::for_one_term(100, 200, average_fieldnorm),
+        );
+        let scorer_b = TermScorer::create_for_test(
+            &postings_b,
+            &fieldnorms,
+            Bm25Weight::for_one_term(100, 200, average_fieldnorm),
+        );
+
+        let checkpoints = compute_checkpoints_block_wand_intersection(vec![scorer_a, scorer_b], 10);
+        assert!(checkpoints.is_empty());
+    }
+
+    #[test]
+    fn test_block_wand_intersection_all_overlap() {
+        // Two posting lists with full overlap.
+        let fieldnorms: Vec<u32> = vec![10; 50];
+        let average_fieldnorm = 10.0;
+        let postings: Vec<(DocId, u32)> = (0..50).map(|d| (d, 3)).collect();
+
+        let make_scorer = || {
+            TermScorer::create_for_test(
+                &postings,
+                &fieldnorms,
+                Bm25Weight::for_one_term(50, 50, average_fieldnorm),
+            )
+        };
+
+        let checkpoints_opt =
+            compute_checkpoints_block_wand_intersection(vec![make_scorer(), make_scorer()], 5);
+        let checkpoints_naive =
+            compute_checkpoints_naive_intersection(vec![make_scorer(), make_scorer()], 5);
+        assert_eq!(checkpoints_opt.len(), checkpoints_naive.len());
+    }
+}
--- a/src/query/boolean_query/boolean_weight.rs
+++ b/src/query/boolean_query/boolean_weight.rs
@@ -16,6 +16,7 @@ use crate::{DocId, Score};

 enum SpecializedScorer {
    TermUnion(Vec<TermScorer>),
+    TermIntersection(Vec<TermScorer>),
    Other(Box<dyn Scorer>),
 }

@@ -93,6 +94,13 @@ fn into_box_scorer<TScoreCombiner: ScoreCombiner>(
                BufferedUnionScorer::build(term_scorers, score_combiner_fn, num_docs);
            Box::new(union_scorer)
        }
+        SpecializedScorer::TermIntersection(term_scorers) => {
+            let boxed_scorers: Vec<Box<dyn Scorer>> = term_scorers
+                .into_iter()
+                .map(|s| Box::new(s) as Box<dyn Scorer>)
+                .collect();
+            intersect_scorers(boxed_scorers, num_docs)
+        }
        SpecializedScorer::Other(scorer) => scorer,
    }
 }
@@ -297,14 +305,43 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
                // Result depends entirely on MUST + any removed AllScorers.
                let combined_all_scorer_count = must_special_scorer_counts.num_all_scorers
                    + should_special_scorer_counts.num_all_scorers;
-                let boxed_scorer: Box<dyn Scorer> = effective_must_scorer(
-                    must_scorers,
-                    combined_all_scorer_count,
-                    reader.max_doc(),
-                    num_docs,
-                )
-                .unwrap_or_else(|| Box::new(EmptyScorer));
-                SpecializedScorer::Other(boxed_scorer)
+
+                // Try to detect a pure TermScorer intersection for block-max optimization.
+                // Preconditions: no removed AllScorers, at least 2 scorers, all TermScorer
+                // with frequency reading enabled.
+                if combined_all_scorer_count == 0
+                    && must_scorers.len() >= 2
+                    && must_scorers.iter().all(|s| s.is::<TermScorer>())
+                {
+                    let term_scorers: Vec<TermScorer> = must_scorers
+                        .into_iter()
+                        .map(|s| *(s.downcast::<TermScorer>().map_err(|_| ()).unwrap()))
+                        .collect();
+                    if term_scorers
+                        .iter()
+                        .all(|s| s.freq_reading_option() == FreqReadingOption::ReadFreq)
+                    {
+                        SpecializedScorer::TermIntersection(term_scorers)
+                    } else {
+                        let must_scorers: Vec<Box<dyn Scorer>> = term_scorers
+                            .into_iter()
+                            .map(|s| Box::new(s) as Box<dyn Scorer>)
+                            .collect();
+                        let boxed_scorer: Box<dyn Scorer> =
+                            effective_must_scorer(must_scorers, 0, reader.max_doc(), num_docs)
+                                .unwrap_or_else(|| Box::new(EmptyScorer));
+                        SpecializedScorer::Other(boxed_scorer)
+                    }
+                } else {
+                    let boxed_scorer: Box<dyn Scorer> = effective_must_scorer(
+                        must_scorers,
+                        combined_all_scorer_count,
+                        reader.max_doc(),
+                        num_docs,
+                    )
+                    .unwrap_or_else(|| Box::new(EmptyScorer));
+                    SpecializedScorer::Other(boxed_scorer)
+                }
            }
            (ShouldScorersCombinationMethod::Optional(should_scorer), must_scorers) => {
                // Optional SHOULD: contributes to scoring but not required for matching.
@@ -463,15 +500,21 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
        callback: &mut dyn FnMut(DocId, Score),
    ) -> crate::Result<()> {
        let scorer = self.complex_scorer(reader, 1.0, &self.score_combiner_fn)?;
+        let num_docs = reader.num_docs();
        match scorer {
            SpecializedScorer::TermUnion(term_scorers) => {
-                let mut union_scorer = BufferedUnionScorer::build(
-                    term_scorers,
-                    &self.score_combiner_fn,
-                    reader.num_docs(),
-                );
+                let mut union_scorer =
+                    BufferedUnionScorer::build(term_scorers, &self.score_combiner_fn, num_docs);
                for_each_scorer(&mut union_scorer, callback);
            }
+            SpecializedScorer::TermIntersection(term_scorers) => {
+                let mut intersection = into_box_scorer(
+                    SpecializedScorer::TermIntersection(term_scorers),
+                    &self.score_combiner_fn,
+                    num_docs,
+                );
+                for_each_scorer(intersection.as_mut(), callback);
+            }
            SpecializedScorer::Other(mut scorer) => {
                for_each_scorer(scorer.as_mut(), callback);
            }
@@ -485,17 +528,23 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
        callback: &mut dyn FnMut(&[DocId]),
    ) -> crate::Result<()> {
        let scorer = self.complex_scorer(reader, 1.0, || DoNothingCombiner)?;
+        let num_docs = reader.num_docs();
        let mut buffer = [0u32; COLLECT_BLOCK_BUFFER_LEN];

        match scorer {
            SpecializedScorer::TermUnion(term_scorers) => {
-                let mut union_scorer = BufferedUnionScorer::build(
-                    term_scorers,
-                    &self.score_combiner_fn,
-                    reader.num_docs(),
-                );
+                let mut union_scorer =
+                    BufferedUnionScorer::build(term_scorers, &self.score_combiner_fn, num_docs);
                for_each_docset_buffered(&mut union_scorer, &mut buffer, callback);
            }
+            SpecializedScorer::TermIntersection(term_scorers) => {
+                let mut intersection = into_box_scorer(
+                    SpecializedScorer::TermIntersection(term_scorers),
+                    DoNothingCombiner::default,
+                    num_docs,
+                );
+                for_each_docset_buffered(intersection.as_mut(), &mut buffer, callback);
+            }
            SpecializedScorer::Other(mut scorer) => {
                for_each_docset_buffered(scorer.as_mut(), &mut buffer, callback);
            }
@@ -524,6 +573,9 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
            SpecializedScorer::TermUnion(term_scorers) => {
                super::block_wand(term_scorers, threshold, callback);
            }
+            SpecializedScorer::TermIntersection(term_scorers) => {
+                super::block_wand_intersection(term_scorers, threshold, callback);
+            }
            SpecializedScorer::Other(mut scorer) => {
                for_each_pruning_scorer(scorer.as_mut(), threshold, callback);
            }
--- a/src/query/boolean_query/mod.rs
+++ b/src/query/boolean_query/mod.rs
@@ -1,8 +1,10 @@
 mod block_wand;
+mod block_wand_intersection;
 mod boolean_query;
 mod boolean_weight;

 pub(crate) use self::block_wand::{block_wand, block_wand_single_scorer};
+pub(crate) use self::block_wand_intersection::block_wand_intersection;
 pub use self::boolean_query::BooleanQuery;
 pub use self::boolean_weight::BooleanWeight;
Author	SHA1	Message	Date
Paul Masurel	4bdbc013ba	Optimizing top K using Adrien Grand's ideas https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html	2026-03-26 15:23:40 -04:00
Charlie Tonneslan	a9535156b1	Fix clippy warnings: deprecated gen_range, manual div_ceil, legacy import (#2860 ) - Replace deprecated rand::Rng::gen_range with random_range in benchmarks - Use usize::div_ceil instead of manual (len + size - 1) / size - Remove unused legacy std::i64 import - Replace 'if let Some(_)' with '.is_some()'	2026-03-26 07:37:26 -04:00
PSeitz	993ef97814	update CHANGELOG for tantivy 0.26 release (#2857 ) * update CHANGELOG for tantivy 0.26 release * add CHANGELOG skill Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com> * update CHANGELOG, add CHANGELOG skill Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com> * use sketches from crates.io * update lz4_flex * update CHANGELOG.md --------- Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>	2026-03-24 08:02:12 +01:00
nuri	3859cc8699	fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854 ) * fix: deduplicate doc counts in term aggregation for multi-valued fields Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count. Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor that deduplicates (doc_id, value) pairs, and use it in term aggregation. Fixes #2721 * refactor: only deduplicate for multivalue cardinality Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue(). * fix: handle non-consecutive duplicate values in dedup Sort values within each doc_id group before deduplicating, so that non-adjacent duplicates are correctly handled. Add unit tests for dedup_docid_val_pairs: consecutive duplicates, non-consecutive duplicates, multi-doc groups, no duplicates, and single element. * perf: skip dedup when block has no multivalue entries Add early return when no consecutive doc_ids are equal, avoiding unnecessary sort and dedup passes. Remove the 2-element swap optimization as it is not needed by the dedup algorithm. --------- Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-03-24 02:02:30 +01:00