update CHANGELOG for tantivy 0.26 release (#2857 )

* update CHANGELOG for tantivy 0.26 release * add CHANGELOG skill Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com> * update CHANGELOG, add CHANGELOG skill Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com> * use sketches from crates.io * update lz4_flex * update CHANGELOG.md --------- Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>
fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854 )
2026-03-25 06:30:44 +00:00 · 2026-03-24 08:02:12 +01:00 · 2026-03-24 02:02:30 +01:00 · 2026-03-18 17:28:59 +01:00 · 2026-03-16 10:43:39 +01:00
18 changed files with 562 additions and 290 deletions
--- a/.claude/skills/update-changelog/SKILL.md
+++ b/.claude/skills/update-changelog/SKILL.md
@@ -0,0 +1,87 @@
+---
+name: update-changelog
+description: Update CHANGELOG.md with merged PRs since the last changelog update, categorized by type
+---
+
+# Update Changelog
+
+This skill updates CHANGELOG.md with merged PRs that aren't already listed.
+
+## Step 1: Determine the changelog scope
+
+Read `CHANGELOG.md` to identify the current unreleased version section at the top (e.g., `Tantivy 0.26 (Unreleased)`).
+
+Collect all PR numbers already mentioned in the unreleased section by extracting `#NNNN` references.
+
+## Step 2: Find merged PRs not yet in the changelog
+
+Use `gh` to list recently merged PRs from the upstream repo:
+
+```bash
+gh pr list --repo quickwit-oss/tantivy --state merged --limit 100 --json number,title,author,labels,mergedAt
+```
+
+Filter out any PRs whose number already appears in the unreleased section of the changelog.
+
+## Step 3: Consolidate related PRs
+
+Before categorizing, group PRs that belong to the same logical change. This is critical for producing a clean changelog. Use PR descriptions, titles, cross-references, and the files touched to identify relationships.
+
+**Merge follow-up PRs into the original:**
+- If a PR is a bugfix, refinement, or follow-up to another PR in the same unreleased cycle, combine them into a single changelog entry with multiple `[#N](url)` links.
+- Also consolidate PRs that touch the same feature area even if not explicitly linked — e.g., a PR fixing an edge case in a new API should be folded into the entry for the PR that introduced that API.
+
+**Filter out bugfixes on unreleased features:**
+- If a bugfix PR fixes something introduced by another PR in the **same unreleased version**, it must NOT appear as a separate Bugfixes entry. Instead, silently fold it into the original feature/improvement entry. The changelog should describe the final shipped state, not the development history.
+- To detect this: check if the bugfix PR references or reverts changes from another PR in the same release cycle, or if it touches code that was newly added (not present in the previous release).
+
+## Step 4: Review the actual code diff
+
+**Do not rely on PR titles or descriptions alone.** For every candidate PR, run `gh pr diff <number> --repo quickwit-oss/tantivy` and read the actual changes. PR titles are often misleading — the diff is the source of truth.
+
+**What to look for in the diff:**
+- Does it change observable behavior, public API surface, or performance characteristics?
+- Is the change something a user of the library would notice or need to know about?
+- Could the change break existing code (API changes, removed features)?
+
+**Skip PRs where the diff reveals the change is not meaningful enough for the changelog** — e.g., cosmetic renames, trivial visibility tweaks, test-only changes, etc.
+
+## Step 5: Categorize each PR group
+
+For each PR (or consolidated group) that survived the diff review, determine its category:
+
+- **Bugfixes** — fixes to behavior that existed in the **previous release**. NOT fixes to features introduced in this release cycle.
+- **Features/Improvements** — new features, API additions, new options, improvements that change user-facing behavior or add new capabilities.
+- **Performance** — optimizations, speed improvements, memory reductions. **If a PR adds new API whose primary purpose is enabling a performance optimization, categorize it as Performance, not Features.** The deciding question is: does a user benefit from this because of new functionality, or because things got faster/leaner? For example, a new trait method that exists solely to enable cheaper intersection ordering is Performance, not a Feature.
+
+If a PR doesn't clearly fit any category (e.g., CI-only changes, internal refactors with no user-facing impact, dependency bumps with no behavior change), skip it — not everything belongs in the changelog.
+
+When unclear, use your best judgment or ask the user.
+
+## Step 6: Format entries
+
+Each entry must follow this exact format:
+
+```
+- Description [#NUMBER](https://github.com/quickwit-oss/tantivy/pull/NUMBER)(@author)
+```
+
+Rules:
+- The description should be concise and describe the user-facing change (not the implementation). Describe the final shipped state, not the incremental development steps.
+- Use sub-categories with bold headers when multiple entries relate to the same area (e.g., `- **Aggregation**` with indented entries beneath). Follow the existing grouping style in the changelog.
+- Author is the GitHub username from the PR, prefixed with `@`. For consolidated entries, include all contributing authors.
+- For consolidated PRs, list all PR links in a single entry: `[#100](url) [#110](url)` (see existing entries for examples).
+
+## Step 7: Present changes to the user
+
+Show the user the proposed changelog entries grouped by category **before** editing the file. Ask for confirmation or adjustments.
+
+## Step 8: Update CHANGELOG.md
+
+Insert the new entries into the appropriate sections of the unreleased version block. If a section doesn't exist yet, create it following the order: Bugfixes, Features/Improvements, Performance.
+
+Append new entries at the end of each section (before the next section header or version header).
+
+## Step 9: Verify
+
+Read back the updated unreleased section and display it to the user for final review.
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,51 @@
+Tantivy 0.26 (Unreleased)
+================================
+
+## Bugfixes
+- Align float query coercion during search with the columnar coercion rules [#2692](https://github.com/quickwit-oss/tantivy/pull/2692)(@fulmicoton)
+- Fix lenient elastic range queries with trailing closing parentheses [#2816](https://github.com/quickwit-oss/tantivy/pull/2816)(@evance-br)
+- Fix intersection `seek()` advancing below current doc id [#2812](https://github.com/quickwit-oss/tantivy/pull/2812)(@fulmicoton)
+- Fix phrase query prefixed with `*` [#2751](https://github.com/quickwit-oss/tantivy/pull/2751)(@Darkheir)
+- Fix `vint` buffer overflow during index creation [#2778](https://github.com/quickwit-oss/tantivy/pull/2778)(@rebasedming)
+- Fix integer overflow in `ExpUnrolledLinkedList` for large datasets [#2735](https://github.com/quickwit-oss/tantivy/pull/2735)(@mdashti)
+- Fix integer overflow in segment sorting and merge policy truncation [#2846](https://github.com/quickwit-oss/tantivy/pull/2846)(@anaslimem)
+- Fix merging of intermediate aggregation results [#2719](https://github.com/quickwit-oss/tantivy/pull/2719)(@PSeitz)
+- Fix deduplicate doc counts in term aggregation for multi-valued fields [#2854](https://github.com/quickwit-oss/tantivy/pull/2854)(@nuri-yoo)
+
+## Features/Improvements
+- **Aggregation**
+    - Add filter aggregation [#2711](https://github.com/quickwit-oss/tantivy/pull/2711)(@mdashti)
+    - Add include/exclude filtering for term aggregations [#2717](https://github.com/quickwit-oss/tantivy/pull/2717)(@PSeitz)
+    - Add public accessors for intermediate aggregation results [#2829](https://github.com/quickwit-oss/tantivy/pull/2829)(@congx4)
+    - Replace HyperLogLog++ with Apache DataSketches HLL for cardinality aggregation [#2837](https://github.com/quickwit-oss/tantivy/pull/2837) [#2842](https://github.com/quickwit-oss/tantivy/pull/2842)(@congx4)
+    - Add composite aggregation [#2856](https://github.com/quickwit-oss/tantivy/pull/2856)(@fulmicoton)
+- **Fast Fields**
+    - Add fast field fallback for `TermQuery` when the field is not indexed [#2693](https://github.com/quickwit-oss/tantivy/pull/2693)(@PSeitz-dd)
+    - Add fast field support for `Bytes` values [#2830](https://github.com/quickwit-oss/tantivy/pull/2830)(@mdashti)
+- **Query Parser**
+    - Add support for regexes in the query grammar [#2677](https://github.com/quickwit-oss/tantivy/pull/2677) [#2818](https://github.com/quickwit-oss/tantivy/pull/2818)(@Darkheir)
+    - Deduplicate queries in query parser [#2698](https://github.com/quickwit-oss/tantivy/pull/2698)(@PSeitz-dd)
+- Add erased `SortKeyComputer` for sorting on column types unknown until runtime [#2770](https://github.com/quickwit-oss/tantivy/pull/2770) [#2790](https://github.com/quickwit-oss/tantivy/pull/2790)(@stuhood @PSeitz)
+- Add natural-order-with-none-highest support in `TopDocs::order_by` [#2780](https://github.com/quickwit-oss/tantivy/pull/2780)(@stuhood)
+- Move stemming behing `stemmer` feature flag [#2791](https://github.com/quickwit-oss/tantivy/pull/2791)(@fulmicoton)
+- Make `DeleteMeta`, `AddOperation`, `advance_deletes`, `with_max_doc`, `serializer` module, and `delete_queue` public [#2762](https://github.com/quickwit-oss/tantivy/pull/2762) [#2765](https://github.com/quickwit-oss/tantivy/pull/2765) [#2766](https://github.com/quickwit-oss/tantivy/pull/2766) [#2835](https://github.com/quickwit-oss/tantivy/pull/2835)(@philippemnoel @PSeitz)
+- Make `Language` hashable [#2763](https://github.com/quickwit-oss/tantivy/pull/2763)(@philippemnoel)
+- Improve `space_usage` reporting for JSON fields and columnar data [#2761](https://github.com/quickwit-oss/tantivy/pull/2761)(@PSeitz-dd)
+- Split `Term` into `Term` and `IndexingTerm` [#2744](https://github.com/quickwit-oss/tantivy/pull/2744) [#2750](https://github.com/quickwit-oss/tantivy/pull/2750)(@PSeitz-dd @PSeitz)
+
+## Performance
+- **Aggregation**
+    - Large speed up and memory reduction for nested high cardinality aggregations by using one collector per request instead of one per bucket, and adding `PagedTermMap` for faster medium cardinality term aggregations [#2715](https://github.com/quickwit-oss/tantivy/pull/2715) [#2759](https://github.com/quickwit-oss/tantivy/pull/2759)(@PSeitz @PSeitz-dd)
+    - Optimize low-cardinality term aggregations by using a `Vec` instead of a `HashMap` [#2740](https://github.com/quickwit-oss/tantivy/pull/2740)(@fulmicoton-dd)
+- Optimize `ExistsQuery` for a high number of dynamic columns [#2694](https://github.com/quickwit-oss/tantivy/pull/2694)(@PSeitz-dd)
+- Add lazy scorers to stop score evaluation early when a doc won't reach the top-K threshold [#2726](https://github.com/quickwit-oss/tantivy/pull/2726) [#2777](https://github.com/quickwit-oss/tantivy/pull/2777)(@fulmicoton @stuhood)
+- Add `DocSet::cost()` and use it to order scorers in intersections [#2707](https://github.com/quickwit-oss/tantivy/pull/2707)(@PSeitz)
+- Add `collect_block` support for collector wrappers [#2727](https://github.com/quickwit-oss/tantivy/pull/2727)(@stuhood)
+- Optimize saturated posting lists by replacing them with `AllScorer` in boolean queries [#2745](https://github.com/quickwit-oss/tantivy/pull/2745) [#2760](https://github.com/quickwit-oss/tantivy/pull/2760) [#2774](https://github.com/quickwit-oss/tantivy/pull/2774)(@fulmicoton @mdashti @trinity-1686a)
+- Add `seek_danger` on `DocSet` for more efficient intersections [#2538](https://github.com/quickwit-oss/tantivy/pull/2538) [#2810](https://github.com/quickwit-oss/tantivy/pull/2810)(@PSeitz @stuhood @fulmicoton)
+- Skip column traversal in `RangeDocSet` when query range does not overlap with column bounds [#2783](https://github.com/quickwit-oss/tantivy/pull/2783)(@ChangRui-Ryan)
+- Speed up exclude queries by supporting multiple excluded `DocSet`s without intermediate union [#2825](https://github.com/quickwit-oss/tantivy/pull/2825)(@PSeitz)
+
 Tantivy 0.25
 ================================

--- a/Cargo.toml
+++ b/Cargo.toml
@@ -11,7 +11,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
 readme = "README.md"
 keywords = ["search", "information", "retrieval"]
 edition = "2021"
-rust-version = "1.85"
+rust-version = "1.86"
 exclude = ["benches/*.json", "benches/*.txt"]

 [dependencies]
@@ -27,7 +27,7 @@ regex = { version = "1.5.5", default-features = false, features = [
 aho-corasick = "1.0"
 tantivy-fst = "0.5"
 memmap2 = { version = "0.9.0", optional = true }
-lz4_flex = { version = "0.12", default-features = false, optional = true }
+lz4_flex = { version = "0.13", default-features = false, optional = true }
 zstd = { version = "0.13", optional = true, default-features = false }
 tempfile = { version = "3.12.0", optional = true }
 log = "0.4.16"
@@ -64,7 +64,7 @@ query-grammar = { version = "0.25.0", path = "./query-grammar", package = "tanti
 tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
 common = { version = "0.10", path = "./common/", package = "tantivy-common" }
 tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
-sketches-ddsketch = { git = "https://github.com/quickwit-oss/rust-sketches-ddsketch.git", rev = "555caf1", features = ["use_serde"] }
+sketches-ddsketch = { version = "0.4", features = ["use_serde"] }
 datasketches = "0.2.0"
 futures-util = { version = "0.3.28", optional = true }
 futures-channel = { version = "0.3.28", optional = true }
--- a/columnar/src/block_accessor.rs
+++ b/columnar/src/block_accessor.rs
@@ -58,6 +58,78 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
        }
    }

+    /// Like `fetch_block_with_missing`, but deduplicates (doc_id, value) pairs
+    /// so that each unique value per document is returned only once.
+    ///
+    /// This is necessary for correct document counting in aggregations,
+    /// where multi-valued fields can produce duplicate entries that inflate counts.
+    #[inline]
+    pub fn fetch_block_with_missing_unique_per_doc(
+        &mut self,
+        docs: &[u32],
+        accessor: &Column<T>,
+        missing: Option<T>,
+    ) where
+        T: Ord,
+    {
+        self.fetch_block_with_missing(docs, accessor, missing);
+        if accessor.index.get_cardinality().is_multivalue() {
+            self.dedup_docid_val_pairs();
+        }
+    }
+
+    /// Removes duplicate (doc_id, value) pairs from the caches.
+    ///
+    /// After `fetch_block`, entries are sorted by doc_id, but values within
+    /// the same doc may not be sorted (e.g. `(0,1), (0,2), (0,1)`).
+    /// We group consecutive entries by doc_id, sort values within each group
+    /// if it has more than 2 elements, then deduplicate adjacent pairs.
+    ///
+    /// Skips entirely if no doc_id appears more than once in the block.
+    fn dedup_docid_val_pairs(&mut self)
+    where T: Ord {
+        if self.docid_cache.len() <= 1 {
+            return;
+        }
+
+        // Quick check: if no consecutive doc_ids are equal, no dedup needed.
+        let has_multivalue = self.docid_cache.windows(2).any(|w| w[0] == w[1]);
+        if !has_multivalue {
+            return;
+        }
+
+        // Sort values within each doc_id group so duplicates become adjacent.
+        let mut start = 0;
+        while start < self.docid_cache.len() {
+            let doc = self.docid_cache[start];
+            let mut end = start + 1;
+            while end < self.docid_cache.len() && self.docid_cache[end] == doc {
+                end += 1;
+            }
+            if end - start > 2 {
+                self.val_cache[start..end].sort();
+            }
+            start = end;
+        }
+
+        // Now duplicates are adjacent — deduplicate in place.
+        let mut write = 0;
+        for read in 1..self.docid_cache.len() {
+            if self.docid_cache[read] != self.docid_cache[write]
+                || self.val_cache[read] != self.val_cache[write]
+            {
+                write += 1;
+                if write != read {
+                    self.docid_cache[write] = self.docid_cache[read];
+                    self.val_cache[write] = self.val_cache[read];
+                }
+            }
+        }
+        let new_len = write + 1;
+        self.docid_cache.truncate(new_len);
+        self.val_cache.truncate(new_len);
+    }
+
    #[inline]
    pub fn iter_vals(&self) -> impl Iterator<Item = T> + '_ {
        self.val_cache.iter().cloned()
@@ -163,4 +235,56 @@ mod tests {

        assert_eq!(missing_docs, vec![1, 2, 3, 4, 5]);
    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_consecutive() {
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 2, 3];
+        accessor.val_cache = vec![10, 10, 10, 10];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 2, 3]);
+        assert_eq!(accessor.val_cache, vec![10, 10, 10]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_non_consecutive() {
+        // (0,1), (0,2), (0,1) — duplicate value not adjacent
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 0];
+        accessor.val_cache = vec![1, 2, 1];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 0]);
+        assert_eq!(accessor.val_cache, vec![1, 2]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_multi_doc() {
+        // doc 0: values [3, 1, 3], doc 1: values [5, 5]
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 0, 1, 1];
+        accessor.val_cache = vec![3, 1, 3, 5, 5];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
+        assert_eq!(accessor.val_cache, vec![1, 3, 5]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_no_duplicates() {
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0, 0, 1];
+        accessor.val_cache = vec![1, 2, 3];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
+        assert_eq!(accessor.val_cache, vec![1, 2, 3]);
+    }
+
+    #[test]
+    fn test_dedup_docid_val_pairs_single_element() {
+        let mut accessor = ColumnBlockAccessor::<u64>::default();
+        accessor.docid_cache = vec![0];
+        accessor.val_cache = vec![1];
+        accessor.dedup_docid_val_pairs();
+        assert_eq!(accessor.docid_cache, vec![0]);
+        assert_eq!(accessor.val_cache, vec![1]);
+    }
 }
--- a/columnar/src/column_values/mod.rs
+++ b/columnar/src/column_values/mod.rs
@@ -31,7 +31,7 @@ pub use u64_based::{
    serialize_and_load_u64_based_column_values, serialize_u64_based_column_values,
 };
 pub use u128_based::{
-    CompactSpaceU64Accessor, open_u128_as_compact_u64, open_u128_mapped,
+    CompactHit, CompactSpaceU64Accessor, open_u128_as_compact_u64, open_u128_mapped,
    serialize_column_values_u128,
 };
 pub use vec_column::VecColumn;
--- a/columnar/src/column_values/u128_based/compact_space/mod.rs
+++ b/columnar/src/column_values/u128_based/compact_space/mod.rs
@@ -292,6 +292,19 @@ impl BinarySerializable for IPCodecParams {
    }
 }

+/// Represents the result of looking up a u128 value in the compact space.
+///
+/// If a value is outside the compact space, the next compact value is returned.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum CompactHit {
+    /// The value exists in the compact space
+    Exact(u32),
+    /// The value does not exist in the compact space, but the next higher value does
+    Next(u32),
+    /// The value is greater than the maximum compact value
+    AfterLast,
+}
+
 /// Exposes the compact space compressed values as u64.
 ///
 /// This allows faster access to the values, as u64 is faster to work with than u128.
@@ -309,6 +322,11 @@ impl CompactSpaceU64Accessor {
    pub fn compact_to_u128(&self, compact: u32) -> u128 {
        self.0.compact_to_u128(compact)
    }
+
+    /// Finds the next compact space value for a given u128 value.
+    pub fn u128_to_next_compact(&self, value: u128) -> CompactHit {
+        self.0.u128_to_next_compact(value)
+    }
 }

 impl ColumnValues<u64> for CompactSpaceU64Accessor {
@@ -441,6 +459,21 @@ impl CompactSpaceDecompressor {
        self.params.compact_space.u128_to_compact(value)
    }

+    /// Finds the next compact space value for a given u128 value.
+    pub fn u128_to_next_compact(&self, value: u128) -> CompactHit {
+        match self.u128_to_compact(value) {
+            Ok(compact) => CompactHit::Exact(compact),
+            Err(pos) => {
+                if pos >= self.params.compact_space.ranges_mapping.len() {
+                    CompactHit::AfterLast
+                } else {
+                    let next_range = &self.params.compact_space.ranges_mapping[pos];
+                    CompactHit::Next(next_range.compact_start)
+                }
+            }
+        }
+    }
+
    fn compact_to_u128(&self, compact: u32) -> u128 {
        self.params.compact_space.compact_to_u128(compact)
    }
@@ -823,6 +856,41 @@ mod tests {
        let _data = test_aux_vals(vals);
    }

+    #[test]
+    fn test_u128_to_next_compact() {
+        let vals = &[100u128, 200u128, 1_000_000_000u128, 1_000_000_100u128];
+        let mut data = test_aux_vals(vals);
+
+        let _header = U128Header::deserialize(&mut data);
+        let decomp = CompactSpaceDecompressor::open(data).unwrap();
+
+        // Test value that's already in a range
+        let compact_100 = decomp.u128_to_compact(100).unwrap();
+        assert_eq!(
+            decomp.u128_to_next_compact(100),
+            CompactHit::Exact(compact_100)
+        );
+
+        // Test value between two ranges
+        let compact_million = decomp.u128_to_compact(1_000_000_000).unwrap();
+        assert_eq!(
+            decomp.u128_to_next_compact(250),
+            CompactHit::Next(compact_million)
+        );
+
+        // Test value before the first range
+        assert_eq!(
+            decomp.u128_to_next_compact(50),
+            CompactHit::Next(compact_100)
+        );
+
+        // Test value after the last range
+        assert_eq!(
+            decomp.u128_to_next_compact(10_000_000_000),
+            CompactHit::AfterLast
+        );
+    }
+
    use proptest::prelude::*;

    fn num_strategy() -> impl Strategy<Value = u128> {
--- a/columnar/src/column_values/u128_based/mod.rs
+++ b/columnar/src/column_values/u128_based/mod.rs
@@ -7,7 +7,7 @@ mod compact_space;

 use common::{BinarySerializable, OwnedBytes, VInt};
 pub use compact_space::{
-    CompactSpaceCompressor, CompactSpaceDecompressor, CompactSpaceU64Accessor,
+    CompactHit, CompactSpaceCompressor, CompactSpaceDecompressor, CompactSpaceU64Accessor,
 };

 use crate::column_values::monotonic_map_column;
--- a/src/aggregation/bucket/composite/accessors.rs
+++ b/src/aggregation/bucket/composite/accessors.rs
@@ -1,6 +1,6 @@
 use std::net::Ipv6Addr;

-use columnar::column_values::CompactSpaceU64Accessor;
+use columnar::column_values::{CompactHit, CompactSpaceU64Accessor};
 use columnar::{Column, ColumnType, MonotonicallyMappableToU64, StrColumn, TermOrdHit};

 use crate::aggregation::accessor_helpers::get_numeric_or_date_column_types;
@@ -150,7 +150,7 @@ impl CompositeSourceAccessors {
                {
                    match source_after_key_opt {
                        Some(after_key) => PrecomputedAfterKey::precompute(
-                            &first_col,
+                            first_col,
                            after_key,
                            &source.field,
                            source.missing_order,
@@ -342,7 +342,7 @@ impl PrecomputedDateInterval {
                    .to_string(),
            )),
            (Some(fixed_interval), None) => {
-                let fixed_interval_ms = parse_into_milliseconds(&fixed_interval)?;
+                let fixed_interval_ms = parse_into_milliseconds(fixed_interval)?;
                Ok(PrecomputedDateInterval::FixedNanoseconds(
                    fixed_interval_ms * 1_000_000,
                ))
@@ -370,6 +370,16 @@ pub enum PrecomputedAfterKey {
    AfterLast,
 }

+impl From<CompactHit> for PrecomputedAfterKey {
+    fn from(hit: CompactHit) -> Self {
+        match hit {
+            CompactHit::Exact(ord) => PrecomputedAfterKey::Exact(ord as u64),
+            CompactHit::Next(ord) => PrecomputedAfterKey::Next(ord as u64),
+            CompactHit::AfterLast => PrecomputedAfterKey::AfterLast,
+        }
+    }
+}
+
 impl From<TermOrdHit> for PrecomputedAfterKey {
    fn from(hit: TermOrdHit) -> Self {
        match hit {
@@ -418,58 +428,18 @@ impl PrecomputedAfterKey {
    }

    fn precompute_ip_addr(column: &Column<u64>, key: &Ipv6Addr) -> crate::Result<Self> {
-        // For IP addresses we need to find the compact space value.
-        // We try to convert via the column's min/max range scan.
-        // Since CompactSpaceU64Accessor::u128_to_compact is not public,
-        // we search linearly for the exact u64 value by scanning column values.
+        let compact_space_accessor = column
+            .values
+            .clone()
+            .downcast_arc::<CompactSpaceU64Accessor>()
+            .map_err(|_| {
+                TantivyError::AggregationError(crate::aggregation::AggregationError::InternalError(
+                    "type mismatch: could not downcast to CompactSpaceU64Accessor".to_string(),
+                ))
+            })?;
        let ip_u128 = key.to_bits();
-
-        // Scan for matching value - IP columns are typically small
-        let num_vals = column.values.num_vals();
-        let mut found_exact = false;
-        let mut exact_compact = 0u64;
-        let mut best_next: Option<u64> = None;
-
-        for doc_id in 0..num_vals {
-            let val = column.values.get_val(doc_id);
-            // We need the CompactSpaceU64Accessor to convert compact to u128
-            let compact_accessor = column
-                .values
-                .clone()
-                .downcast_arc::<CompactSpaceU64Accessor>()
-                .map_err(|_| {
-                    TantivyError::AggregationError(
-                        crate::aggregation::AggregationError::InternalError(
-                            "type mismatch: could not downcast to CompactSpaceU64Accessor"
-                                .to_string(),
-                        ),
-                    )
-                })?;
-            let val_u128 = compact_accessor.compact_to_u128(val as u32);
-            if val_u128 == ip_u128 {
-                found_exact = true;
-                exact_compact = val;
-                break;
-            } else if val_u128 > ip_u128 {
-                match best_next {
-                    None => best_next = Some(val),
-                    Some(current_best) => {
-                        let current_u128 = compact_accessor.compact_to_u128(current_best as u32);
-                        if val_u128 < current_u128 {
-                            best_next = Some(val);
-                        }
-                    }
-                }
-            }
-        }
-
-        if found_exact {
-            Ok(PrecomputedAfterKey::Exact(exact_compact))
-        } else if let Some(next) = best_next {
-            Ok(PrecomputedAfterKey::Next(next))
-        } else {
-            Ok(PrecomputedAfterKey::AfterLast)
-        }
+        let ip_next_compact = compact_space_accessor.u128_to_next_compact(ip_u128);
+        Ok(ip_next_compact.into())
    }

    fn precompute_term_ord(
--- a/src/aggregation/bucket/composite/calendar_interval.rs
+++ b/src/aggregation/bucket/composite/calendar_interval.rs
@@ -8,9 +8,8 @@ const NS_IN_DAY: i64 = Nanosecond::per_t::<i128>(Day) as i64;
 pub(super) fn try_year_bucket(timestamp_ns: i64) -> crate::Result<i64> {
    year_bucket_using_time_crate(timestamp_ns).map_err(|e| {
        crate::TantivyError::InvalidArgument(format!(
-            "Failed to compute year bucket for timestamp {}: {}",
-            timestamp_ns,
-            e.to_string()
+            "Failed to compute year bucket for timestamp {}: {e}",
+            timestamp_ns
        ))
    })
 }
@@ -20,9 +19,8 @@ pub(super) fn try_year_bucket(timestamp_ns: i64) -> crate::Result<i64> {
 pub(super) fn try_month_bucket(timestamp_ns: i64) -> crate::Result<i64> {
    month_bucket_using_time_crate(timestamp_ns).map_err(|e| {
        crate::TantivyError::InvalidArgument(format!(
-            "Failed to compute month bucket for timestamp {}: {}",
-            timestamp_ns,
-            e.to_string()
+            "Failed to compute month bucket for timestamp {}: {e}",
+            timestamp_ns
        ))
    })
 }
--- a/src/aggregation/bucket/composite/collector.rs
+++ b/src/aggregation/bucket/composite/collector.rs
@@ -137,7 +137,7 @@ impl SegmentAggregationCollector for SegmentCompositeCollector {
            .name
            .clone();

-        let buckets = self.into_intermediate_bucket_result(agg_data, parent_bucket_id)?;
+        let buckets = self.add_intermediate_bucket_result(agg_data, parent_bucket_id)?;
        results.push(
            name,
            IntermediateAggregationResult::Bucket(IntermediateBucketResult::Composite { buckets }),
@@ -156,17 +156,15 @@ impl SegmentAggregationCollector for SegmentCompositeCollector {
        let composite_agg_data = agg_data.take_composite_req_data(self.accessor_idx);

        for doc in docs {
-            let mut sub_level_values = SmallVec::new();
-            recursive_key_visitor(
-                *doc,
-                &composite_agg_data,
-                0,
-                &mut sub_level_values,
-                &mut self.parent_buckets[parent_bucket_id as usize],
-                true,
-                &mut self.sub_agg,
-                &mut self.bucket_id_provider,
-            )?;
+            let mut visitor = CompositeKeyVisitor {
+                doc_id: *doc,
+                composite_agg_data: &composite_agg_data,
+                buckets: &mut self.parent_buckets[parent_bucket_id as usize],
+                sub_agg: &mut self.sub_agg,
+                bucket_id_provider: &mut self.bucket_id_provider,
+                sub_level_values: SmallVec::new(),
+            };
+            visitor.visit(0, true)?;
        }
        agg_data.put_back_composite_req_data(self.accessor_idx, composite_agg_data);

@@ -238,7 +236,7 @@ impl SegmentCompositeCollector {
    }

    #[inline]
-    fn into_intermediate_bucket_result(
+    fn add_intermediate_bucket_result(
        &mut self,
        agg_data: &AggregationsSegmentCtx,
        parent_bucket_id: BucketId,
@@ -305,6 +303,13 @@ fn validate_req(req_data: &mut AggregationsSegmentCtx, accessor_idx: usize) -> c
            "composite aggregation 'size' must be > 0".to_string(),
        ));
    }
+
+    if composite_data.composite_accessors.len() > MAX_DYN_ARRAY_SIZE {
+        return Err(TantivyError::InvalidArgument(format!(
+            "composite aggregation source supports maximum {MAX_DYN_ARRAY_SIZE} sources",
+        )));
+    }
+
    let column_types_for_sources = composite_data.composite_accessors.iter().map(|item| {
        item.accessors
            .iter()
@@ -313,11 +318,6 @@ fn validate_req(req_data: &mut AggregationsSegmentCtx, accessor_idx: usize) -> c
    });

    for column_types in column_types_for_sources {
-        if column_types.len() > MAX_DYN_ARRAY_SIZE {
-            return Err(TantivyError::InvalidArgument(format!(
-                "composite aggregation source supports maximum {MAX_DYN_ARRAY_SIZE} sources",
-            )));
-        }
        if column_types.contains(&ColumnType::Bytes) {
            return Err(TantivyError::InvalidArgument(
                "composite aggregation does not support 'bytes' field type".to_string(),
@@ -480,195 +480,173 @@ fn resolve_term(
    Ok(key)
 }

-/// Depth-first walk of the accessors to build the composite key combinations
-/// and update the buckets.
-fn recursive_key_visitor(
+/// Browse through the cardinal product obtained by the different values of the doc composite key
+/// sources.
+///
+/// For each of those tuple-key, that are after the limit key, we call collect_bucket_with_limit.
+struct CompositeKeyVisitor<'a> {
    doc_id: crate::DocId,
-    composite_agg_data: &CompositeAggReqData,
-    source_idx_for_recursion: usize,
-    sub_level_values: &mut SmallVec<[InternalValueRepr; MAX_DYN_ARRAY_SIZE]>,
-    buckets: &mut DynArrayHeapMap<InternalValueRepr, CompositeBucketCollector>,
-    // whether we need to consider the after_key in the following levels
-    is_on_after_key: bool,
-    sub_agg: &mut Option<CachedSubAggs<HighCardSubAggCache>>,
-    bucket_id_provider: &mut BucketIdProvider,
-) -> crate::Result<()> {
-    if source_idx_for_recursion == composite_agg_data.req.sources.len() {
-        if !is_on_after_key {
-            collect_bucket_with_limit(
-                doc_id,
-                composite_agg_data.req.size as usize,
-                buckets,
-                sub_level_values,
-                sub_agg,
-                bucket_id_provider,
-            );
-        }
-        return Ok(());
-    }
+    composite_agg_data: &'a CompositeAggReqData,
+    buckets: &'a mut DynArrayHeapMap<InternalValueRepr, CompositeBucketCollector>,
+    sub_agg: &'a mut Option<CachedSubAggs<HighCardSubAggCache>>,
+    bucket_id_provider: &'a mut BucketIdProvider,
+    sub_level_values: SmallVec<[InternalValueRepr; MAX_DYN_ARRAY_SIZE]>,
+}

-    let current_level_accessors = &composite_agg_data.composite_accessors[source_idx_for_recursion];
-    let current_level_source = &composite_agg_data.req.sources[source_idx_for_recursion];
-    let mut missing = true;
-    for (accessor_idx, accessor) in current_level_accessors.accessors.iter().enumerate() {
-        let values = accessor.column.values_for_doc(doc_id);
-        for value in values {
-            missing = false;
-            match current_level_source {
-                CompositeAggregationSource::Terms(_) => {
-                    let preceeds_after_key_type =
-                        accessor_idx < current_level_accessors.after_key_accessor_idx;
-                    if is_on_after_key && preceeds_after_key_type {
-                        break;
-                    }
-                    let matches_after_key_type =
-                        accessor_idx == current_level_accessors.after_key_accessor_idx;
-
-                    if matches_after_key_type && is_on_after_key {
-                        let should_skip = match current_level_source.order() {
-                            Order::Asc => current_level_accessors.after_key.gt(value),
-                            Order::Desc => current_level_accessors.after_key.lt(value),
-                        };
-                        if should_skip {
-                            continue;
-                        }
-                    }
-                    sub_level_values.push(InternalValueRepr::new_term(
-                        value,
-                        accessor_idx as u8,
-                        current_level_source.order(),
-                    ));
-                    let still_on_after_key =
-                        matches_after_key_type && current_level_accessors.after_key.equals(value);
-                    recursive_key_visitor(
-                        doc_id,
-                        composite_agg_data,
-                        source_idx_for_recursion + 1,
-                        sub_level_values,
-                        buckets,
-                        is_on_after_key && still_on_after_key,
-                        sub_agg,
-                        bucket_id_provider,
-                    )?;
-                    sub_level_values.pop();
-                }
-                CompositeAggregationSource::Histogram(source) => {
-                    let float_value = match accessor.column_type {
-                        ColumnType::U64 => value as f64,
-                        ColumnType::I64 => i64::from_u64(value) as f64,
-                        ColumnType::DateTime => i64::from_u64(value) as f64 / 1_000_000.,
-                        ColumnType::F64 => f64::from_u64(value),
-                        _ => {
-                            panic!(
-                                "unexpected type {:?}. This should not happen",
-                                accessor.column_type
-                            )
-                        }
-                    };
-                    let bucket_index = (float_value / source.interval).floor() as i64;
-                    let bucket_value = i64::to_u64(bucket_index);
-                    if is_on_after_key {
-                        let should_skip = match current_level_source.order() {
-                            Order::Asc => current_level_accessors.after_key.gt(bucket_value),
-                            Order::Desc => current_level_accessors.after_key.lt(bucket_value),
-                        };
-                        if should_skip {
-                            continue;
-                        }
-                    }
-                    sub_level_values.push(InternalValueRepr::new_histogram(
-                        bucket_value,
-                        current_level_source.order(),
-                    ));
-                    let still_on_after_key = current_level_accessors.after_key.equals(bucket_value);
-                    recursive_key_visitor(
-                        doc_id,
-                        composite_agg_data,
-                        source_idx_for_recursion + 1,
-                        sub_level_values,
-                        buckets,
-                        is_on_after_key && still_on_after_key,
-                        sub_agg,
-                        bucket_id_provider,
-                    )?;
-                    sub_level_values.pop();
-                }
-                CompositeAggregationSource::DateHistogram(_) => {
-                    let value_ns = match accessor.column_type {
-                        ColumnType::DateTime => i64::from_u64(value),
-                        _ => {
-                            panic!(
-                                "unexpected type {:?}. This should not happen",
-                                accessor.column_type
-                            )
-                        }
-                    };
-                    let bucket_index = match accessor.date_histogram_interval {
-                        PrecomputedDateInterval::FixedNanoseconds(fixed_interval_ns) => {
-                            (value_ns / fixed_interval_ns) * fixed_interval_ns
-                        }
-                        PrecomputedDateInterval::Calendar(CalendarInterval::Year) => {
-                            calendar_interval::try_year_bucket(value_ns)?
-                        }
-                        PrecomputedDateInterval::Calendar(CalendarInterval::Month) => {
-                            calendar_interval::try_month_bucket(value_ns)?
-                        }
-                        PrecomputedDateInterval::Calendar(CalendarInterval::Week) => {
-                            calendar_interval::week_bucket(value_ns)
-                        }
-                        PrecomputedDateInterval::NotApplicable => {
-                            panic!("interval not precomputed for date histogram source")
-                        }
-                    };
-                    let bucket_value = i64::to_u64(bucket_index);
-                    if is_on_after_key {
-                        let should_skip = match current_level_source.order() {
-                            Order::Asc => current_level_accessors.after_key.gt(bucket_value),
-                            Order::Desc => current_level_accessors.after_key.lt(bucket_value),
-                        };
-                        if should_skip {
-                            continue;
-                        }
-                    }
-                    sub_level_values.push(InternalValueRepr::new_histogram(
-                        bucket_value,
-                        current_level_source.order(),
-                    ));
-                    let still_on_after_key = current_level_accessors.after_key.equals(bucket_value);
-                    recursive_key_visitor(
-                        doc_id,
-                        composite_agg_data,
-                        source_idx_for_recursion + 1,
-                        sub_level_values,
-                        buckets,
-                        is_on_after_key && still_on_after_key,
-                        sub_agg,
-                        bucket_id_provider,
-                    )?;
-                    sub_level_values.pop();
-                }
-            };
-        }
-    }
-    if missing && current_level_source.missing_bucket() {
-        if is_on_after_key && current_level_accessors.skip_missing {
+impl CompositeKeyVisitor<'_> {
+    /// Depth-first walk of the accessors to build the composite key combinations
+    /// and update the buckets.
+    ///
+    /// `source_idx` is the current source index in the recursion.
+    /// `is_on_after_key` tracks whether we still need to consider the after_key
+    /// for pruning at this level and below.
+    fn visit(&mut self, source_idx: usize, is_on_after_key: bool) -> crate::Result<()> {
+        if source_idx == self.composite_agg_data.req.sources.len() {
+            if !is_on_after_key {
+                collect_bucket_with_limit(
+                    self.doc_id,
+                    self.composite_agg_data.req.size as usize,
+                    self.buckets,
+                    &self.sub_level_values,
+                    self.sub_agg,
+                    self.bucket_id_provider,
+                );
+            }
            return Ok(());
        }
-        sub_level_values.push(InternalValueRepr::new_missing(
-            current_level_source.order(),
-            current_level_source.missing_order(),
-        ));
-        recursive_key_visitor(
-            doc_id,
-            composite_agg_data,
-            source_idx_for_recursion + 1,
-            sub_level_values,
-            buckets,
-            is_on_after_key && current_level_accessors.is_after_key_explicit_missing,
-            sub_agg,
-            bucket_id_provider,
-        )?;
-        sub_level_values.pop();
+
+        let current_level_accessors = &self.composite_agg_data.composite_accessors[source_idx];
+        let current_level_source = &self.composite_agg_data.req.sources[source_idx];
+        let mut missing = true;
+        for (accessor_idx, accessor) in current_level_accessors.accessors.iter().enumerate() {
+            let values = accessor.column.values_for_doc(self.doc_id);
+            for value in values {
+                missing = false;
+                match current_level_source {
+                    CompositeAggregationSource::Terms(_) => {
+                        let preceeds_after_key_type =
+                            accessor_idx < current_level_accessors.after_key_accessor_idx;
+                        if is_on_after_key && preceeds_after_key_type {
+                            break;
+                        }
+                        let matches_after_key_type =
+                            accessor_idx == current_level_accessors.after_key_accessor_idx;
+
+                        if matches_after_key_type && is_on_after_key {
+                            let should_skip = match current_level_source.order() {
+                                Order::Asc => current_level_accessors.after_key.gt(value),
+                                Order::Desc => current_level_accessors.after_key.lt(value),
+                            };
+                            if should_skip {
+                                continue;
+                            }
+                        }
+                        self.sub_level_values.push(InternalValueRepr::new_term(
+                            value,
+                            accessor_idx as u8,
+                            current_level_source.order(),
+                        ));
+                        let still_on_after_key = matches_after_key_type
+                            && current_level_accessors.after_key.equals(value);
+                        self.visit(source_idx + 1, is_on_after_key && still_on_after_key)?;
+                        self.sub_level_values.pop();
+                    }
+                    CompositeAggregationSource::Histogram(source) => {
+                        let float_value = match accessor.column_type {
+                            ColumnType::U64 => value as f64,
+                            ColumnType::I64 => i64::from_u64(value) as f64,
+                            ColumnType::DateTime => i64::from_u64(value) as f64 / 1_000_000.,
+                            ColumnType::F64 => f64::from_u64(value),
+                            _ => {
+                                panic!(
+                                    "unexpected type {:?}. This should not happen",
+                                    accessor.column_type
+                                )
+                            }
+                        };
+                        let bucket_index = (float_value / source.interval).floor() as i64;
+                        let bucket_value = i64::to_u64(bucket_index);
+                        if is_on_after_key {
+                            let should_skip = match current_level_source.order() {
+                                Order::Asc => current_level_accessors.after_key.gt(bucket_value),
+                                Order::Desc => current_level_accessors.after_key.lt(bucket_value),
+                            };
+                            if should_skip {
+                                continue;
+                            }
+                        }
+                        self.sub_level_values.push(InternalValueRepr::new_histogram(
+                            bucket_value,
+                            current_level_source.order(),
+                        ));
+                        let still_on_after_key =
+                            current_level_accessors.after_key.equals(bucket_value);
+                        self.visit(source_idx + 1, is_on_after_key && still_on_after_key)?;
+                        self.sub_level_values.pop();
+                    }
+                    CompositeAggregationSource::DateHistogram(_) => {
+                        let value_ns = match accessor.column_type {
+                            ColumnType::DateTime => i64::from_u64(value),
+                            _ => {
+                                panic!(
+                                    "unexpected type {:?}. This should not happen",
+                                    accessor.column_type
+                                )
+                            }
+                        };
+                        let bucket_index = match accessor.date_histogram_interval {
+                            PrecomputedDateInterval::FixedNanoseconds(fixed_interval_ns) => {
+                                (value_ns / fixed_interval_ns) * fixed_interval_ns
+                            }
+                            PrecomputedDateInterval::Calendar(CalendarInterval::Year) => {
+                                calendar_interval::try_year_bucket(value_ns)?
+                            }
+                            PrecomputedDateInterval::Calendar(CalendarInterval::Month) => {
+                                calendar_interval::try_month_bucket(value_ns)?
+                            }
+                            PrecomputedDateInterval::Calendar(CalendarInterval::Week) => {
+                                calendar_interval::week_bucket(value_ns)
+                            }
+                            PrecomputedDateInterval::NotApplicable => {
+                                panic!("interval not precomputed for date histogram source")
+                            }
+                        };
+                        let bucket_value = i64::to_u64(bucket_index);
+                        if is_on_after_key {
+                            let should_skip = match current_level_source.order() {
+                                Order::Asc => current_level_accessors.after_key.gt(bucket_value),
+                                Order::Desc => current_level_accessors.after_key.lt(bucket_value),
+                            };
+                            if should_skip {
+                                continue;
+                            }
+                        }
+                        self.sub_level_values.push(InternalValueRepr::new_histogram(
+                            bucket_value,
+                            current_level_source.order(),
+                        ));
+                        let still_on_after_key =
+                            current_level_accessors.after_key.equals(bucket_value);
+                        self.visit(source_idx + 1, is_on_after_key && still_on_after_key)?;
+                        self.sub_level_values.pop();
+                    }
+                };
+            }
+        }
+        if missing && current_level_source.missing_bucket() {
+            if is_on_after_key && current_level_accessors.skip_missing {
+                return Ok(());
+            }
+            self.sub_level_values.push(InternalValueRepr::new_missing(
+                current_level_source.order(),
+                current_level_source.missing_order(),
+            ));
+            self.visit(
+                source_idx + 1,
+                is_on_after_key && current_level_accessors.is_after_key_explicit_missing,
+            )?;
+            self.sub_level_values.pop();
+        }
+        Ok(())
    }
-    Ok(())
 }
--- a/src/aggregation/bucket/composite/map.rs
+++ b/src/aggregation/bucket/composite/map.rs
@@ -323,7 +323,7 @@ mod tests {
        let mut iter = map.into_iter();
        let (k, v) = iter.next().unwrap();
        assert_eq!(k.as_slice(), &key1);
-        assert_eq!(v, "c");
+        assert_eq!(v, "a");
        assert_eq!(iter.next(), None);
    }
 }
--- a/src/aggregation/bucket/composite/mod.rs
+++ b/src/aggregation/bucket/composite/mod.rs
@@ -820,7 +820,7 @@ mod tests {
                {"key": {"myterm": "apple"}, "doc_count": 1}
            ])
        );
-        assert!(res["my_composite"].get("after_key").is_none());
+        assert!(res["fruity_aggreg"].get("after_key").is_none());

        Ok(())
    }
--- a/src/aggregation/bucket/composite/numeric_types.rs
+++ b/src/aggregation/bucket/composite/numeric_types.rs
@@ -1,4 +1,4 @@
-/// This modules helps comparing numerical values of different types (i64, u64
+/// This module helps comparing numerical values of different types (i64, u64
 /// and f64).
 pub(super) mod num_cmp {
    use std::cmp::Ordering;
@@ -93,7 +93,7 @@ pub(super) mod num_cmp {
    }
 }

-/// This modules helps projecting numerical values to other numerical types.
+/// This module helps projecting numerical values to other numerical types.
 /// When the target value space cannot exactly represent the source value, the
 /// next representable value is returned (or AfterLast if the source value is
 /// larger than the largest representable value).
@@ -138,9 +138,9 @@ pub(super) mod num_proj {

    pub fn f64_to_i64(value: f64) -> ProjectedNumber<i64> {
        if value < (i64::MIN as f64) {
-            return ProjectedNumber::Next(i64::MIN);
+            ProjectedNumber::Next(i64::MIN)
        } else if value >= (i64::MAX as f64) {
-            return ProjectedNumber::AfterLast;
+            ProjectedNumber::AfterLast
        } else if value.fract() == 0.0 {
            ProjectedNumber::Exact(value as i64)
        } else if value > 0.0 {
--- a/src/aggregation/bucket/term_agg.rs
+++ b/src/aggregation/bucket/term_agg.rs
@@ -807,11 +807,13 @@ impl<TermMap: TermAggregationMap, C: SubAggCache> SegmentAggregationCollector

        let req_data = &mut self.terms_req_data;

-        agg_data.column_block_accessor.fetch_block_with_missing(
-            docs,
-            &req_data.accessor,
-            req_data.missing_value_for_accessor,
-        );
+        agg_data
+            .column_block_accessor
+            .fetch_block_with_missing_unique_per_doc(
+                docs,
+                &req_data.accessor,
+                req_data.missing_value_for_accessor,
+            );

        if let Some(sub_agg) = &mut self.sub_agg {
            let term_buckets = &mut self.parent_buckets[parent_bucket_id as usize];
@@ -2347,7 +2349,7 @@ mod tests {

        // text field
        assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hello Hello");
-        assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 5);
+        assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
        assert_eq!(res["my_texts"]["buckets"][1]["key"], "Empty");
        assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
        assert_eq!(
@@ -2356,7 +2358,7 @@ mod tests {
        );
        // text field with number as missing fallback
        assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
-        assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 5);
+        assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 4);
        assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
        assert_eq!(res["my_texts2"]["buckets"][1]["doc_count"], 2);
        assert_eq!(
@@ -2370,7 +2372,7 @@ mod tests {
        assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
        assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 4);
        assert_eq!(res["my_ids"]["buckets"][1]["key"], 1.0);
-        assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 3);
+        assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 2);
        assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);

        Ok(())
--- a/src/directory/composite_file.rs
+++ b/src/directory/composite_file.rs
@@ -167,6 +167,7 @@ impl CompositeFile {
            .map(|byte_range| self.data.slice(byte_range.clone()))
    }

+    /// Returns the space usage per field in this composite file.
    pub fn space_usage(&self, schema: &Schema) -> PerFieldSpaceUsage {
        let mut fields = Vec::new();
        for (&field_addr, byte_range) in &self.offsets_index {
--- a/src/indexer/segment_updater.rs
+++ b/src/indexer/segment_updater.rs
@@ -649,9 +649,6 @@ impl SegmentUpdater {
                                    merge_operation.segment_ids(),
                                    advance_deletes_err
                                );
-                                assert!(!cfg!(test), "Merge failed.");
-
-                                // ... cancel merge
                                // `merge_operations` are tracked. As it is dropped, the
                                // the segment_ids will be available again for merge.
                                return Err(advance_deletes_err);
@@ -719,7 +716,7 @@ mod tests {
        // Regression test: -(max_doc as i32) overflows for max_doc >= 2^31.
        // Using std::cmp::Reverse avoids this.
        let inventory = SegmentMetaInventory::default();
-        let mut metas = vec![
+        let mut metas = [
            inventory.new_segment_meta(SegmentId::generate_random(), 100),
            inventory.new_segment_meta(SegmentId::generate_random(), (1u32 << 31) - 1),
            inventory.new_segment_meta(SegmentId::generate_random(), 50_000),
--- a/src/termdict/fst_termdict/term_info_store.rs
+++ b/src/termdict/fst_termdict/term_info_store.rs
@@ -48,8 +48,7 @@ impl BinarySerializable for TermInfoBlockMeta {
 }

 impl FixedSize for TermInfoBlockMeta {
-    const SIZE_IN_BYTES: usize =
-        u64::SIZE_IN_BYTES + TermInfo::SIZE_IN_BYTES + 3 * u8::SIZE_IN_BYTES;
+    const SIZE_IN_BYTES: usize = u64::SIZE_IN_BYTES + TermInfo::SIZE_IN_BYTES + 3;
 }

 impl TermInfoBlockMeta {
--- a/sstable/src/sstable_index_v3.rs
+++ b/sstable/src/sstable_index_v3.rs
@@ -553,7 +553,7 @@ impl FixedSize for BlockAddrBlockMetadata {
    const SIZE_IN_BYTES: usize = u64::SIZE_IN_BYTES
        + BlockStartAddr::SIZE_IN_BYTES
        + 2 * u32::SIZE_IN_BYTES
-        + 2 * u8::SIZE_IN_BYTES
+        + 2
        + u16::SIZE_IN_BYTES;
 }
Author	SHA1	Message	Date
PSeitz	993ef97814	update CHANGELOG for tantivy 0.26 release (#2857 ) * update CHANGELOG for tantivy 0.26 release * add CHANGELOG skill Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com> * update CHANGELOG, add CHANGELOG skill Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com> * use sketches from crates.io * update lz4_flex * update CHANGELOG.md --------- Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>	2026-03-24 08:02:12 +01:00
nuri	3859cc8699	fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854 ) * fix: deduplicate doc counts in term aggregation for multi-valued fields Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count. Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor that deduplicates (doc_id, value) pairs, and use it in term aggregation. Fixes #2721 * refactor: only deduplicate for multivalue cardinality Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue(). * fix: handle non-consecutive duplicate values in dedup Sort values within each doc_id group before deduplicating, so that non-adjacent duplicates are correctly handled. Add unit tests for dedup_docid_val_pairs: consecutive duplicates, non-consecutive duplicates, multi-doc groups, no duplicates, and single element. * perf: skip dedup when block has no multivalue entries Add early return when no consecutive doc_ids are equal, avoiding unnecessary sort and dedup passes. Remove the 2-element swap optimization as it is not needed by the dedup algorithm. --------- Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-03-24 02:02:30 +01:00
Paul Masurel	545169c0d8	Composite agg merge (#2856 ) Add composite aggregation Co-authored-by: Remi Dettai <remi.dettai@sekoia.io> Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-18 17:28:59 +01:00
Paul Masurel	68a9066d13	Fix format (#2852 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-16 10:43:39 +01:00