tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-24 04:00:40 +00:00

Files

nuri 3859cc8699 fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854 )

* fix: deduplicate doc counts in term aggregation for multi-valued fields

Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.

Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.

Fixes #2721

* refactor: only deduplicate for multivalue cardinality

Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().

* fix: handle non-consecutive duplicate values in dedup

Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.

Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.

* perf: skip dedup when block has no multivalue entries

Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.

---------

Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>

2026-03-24 02:02:30 +01:00

column

fix Column.first method parameter type (#2792 )

2026-01-05 10:03:01 +01:00

column_index

clippy (#2700 )

2025-09-19 18:04:25 +02:00

column_values

Composite agg merge (#2856 )

2026-03-18 17:28:59 +01:00

columnar

clippy (#2700 )