Files
tantivy/columnar/src
nuri 3859cc8699 fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854)
* fix: deduplicate doc counts in term aggregation for multi-valued fields

Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.

Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.

Fixes #2721

* refactor: only deduplicate for multivalue cardinality

Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().

* fix: handle non-consecutive duplicate values in dedup

Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.

Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.

* perf: skip dedup when block has no multivalue entries

Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.

---------

Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
2026-03-24 02:02:30 +01:00
..
2025-09-19 18:04:25 +02:00
2026-03-18 17:28:59 +01:00
2025-09-19 18:04:25 +02:00
2025-04-18 04:56:31 +02:00
2024-11-29 16:08:21 +08:00
2026-03-18 17:28:59 +01:00
2024-10-10 09:55:37 +08:00
2023-01-07 17:37:00 +09:00