mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-05-24 04:00:40 +00:00
* fix: deduplicate doc counts in term aggregation for multi-valued fields Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count. Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor that deduplicates (doc_id, value) pairs, and use it in term aggregation. Fixes #2721 * refactor: only deduplicate for multivalue cardinality Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue(). * fix: handle non-consecutive duplicate values in dedup Sort values within each doc_id group before deduplicating, so that non-adjacent duplicates are correctly handled. Add unit tests for dedup_docid_val_pairs: consecutive duplicates, non-consecutive duplicates, multi-doc groups, no duplicates, and single element. * perf: skip dedup when block has no multivalue entries Add early return when no consecutive doc_ids are equal, avoiding unnecessary sort and dedup passes. Remove the 2-element swap optimization as it is not needed by the dedup algorithm. --------- Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>