The early return for `scorers.len() == 1` in `scorer_union` short-circuits a single TermScorer into `SpecializedScorer::Other`, bypassing the `TermUnion` path that enables block-max WAND (BMW) in `for_each_pruning`.
This was originally addressed in PR #2898 (backed out), which added a special case in `BooleanWeight::for_each_pruning`. PR #2912 (merged as d27ca164a) added a single-scorer fast path inside `block_wand` itself, but did not remove this early return — so a single SHOULD TermScorer still never reaches the BMW path.
Removing the early return lets a single TermScorer with freq reading flow through to `SpecializedScorer::TermUnion`, where `block_wand` → `block_wand_single_scorer` handles it efficiently.
* Optimizing top K using Adrien Grand's ideas
https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html
* Suffix-sum pruning for multi-term intersection candidates
After scoring each secondary in Phase 2, check whether remaining
secondaries' block_max scores can still beat the threshold. Skip
to the next candidate early if impossible, avoiding expensive seeks
into later secondaries.
Improves three-term intersection by ~8% on the balanced benchmark
while keeping two-term performance neutral.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Claude CR comment
* Removed 16 term scorer limit.
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a document has the exact registered facet path (not a child),
compute_collapse_mapping_one maps it to a sentinel (u64::MAX, 0).
Without filtering, harvest() passes u64::MAX to ord_to_term which
resolves to the last dictionary entry, producing a spurious facet
from an unrelated branch.
Skip entries where facet_ord == u64::MAX in harvest().
Closes#2494
Signed-off-by: majiayu000 <1835304752@qq.com>
## Bug Overview
Under certain conditions, a `terms` aggregation request can cause a
bounds-check panic. Those conditions are:
- The queried field must be a text field
- There must be a segment where the number of distinct terms in it's
dictionary for the queried field is divisible by 64 (i.e.e where
`count(term_dict.keys) % 64 == 0`)
- That same segment must contain at least one document that does not
contain this field.
- The request contain a `missing` key that is a string.
- The request must contain an `include` or `exclude` filter.
For example:
```json
{
"my_bool": {
"terms": {
"field": "title",
"include": "foo",
"missing": "__NULL__",
}
}
}
```
Check out the added tests in `src/aggregation/bucket/term_agg.rs` to see
this in action
## How the bug happens
### Preparation
While preparing the aggregation nodes:
1) When we've provided a `missing` key, we derive a missing sentinel.
For string keys this column's max value (which for string keys is always
the number of terms in this segment) + 1.
2) for string columns only, we optionally prep an "allowed" `BitSet` for
allowed term ids. (`build_allowed_term_ids_for_str` in
`src/aggregation/agg_data.rs`)
- If no `include` or `exclude` filter is provided, this just returns
`None`, causing this check to be skipped down the line
- Otherwise the bitset is initialized to be able to hold the exact
number of terms in the segments term dictionary, and the bits are set to
signify which terms are to be included in the results.
### Collection
If we have an "allowed" `BitSet`, filter documents against that. For
each document, we check if the `BitSet` contains the documents term id.
For documents without the field, this is the missing sentinel we derived
earlier, minus 1 (to account for zero-based indexing): `(num_terms + 1)
- 1`.However, the `BitSet`s size is only `num_terms`. Normally, this
slips by without a problem, but if `num_terms % 64 == 0`, this will
cause a panic.
### Why `BitSet` panics
`BitSet` is represented under the hood by a boxed slice of `u64`s. When
you go to check a bit using `BitSet::contains`, it must determine which
of those `u64`s the bit is in, and then the position within that `u64`
of the bit.
In cases where the number of terms is not divisible by 64, the `BitSet`
must waste some bits. When we then look up the missing sentinel's bit,
it happens to be one of those wasted bits, for which `BitSet` is happy
to return the value of. For example, if the number of terms was 63:
```rust
let bitset_init_size = 63; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term id [0, 62]
let missing_sentinel = 63; // num_terms + 1 - 1;
let byte_pos = missing_sentinel / 64; // 0 - within the valid slice
let bit_pos = missing_sentinel % 64; // 63 - hits the 1 wasted bit
```
But if the number of terms is indeed divisible by 64, then the `BitSet`
is perfectly aligned to the byte boundary:
```rust
let bitset_init_size = 64; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term ids [0, 63]
let missing_sentinel = 64; // num_terms + 1 - 1,
let byte_pos = missing_sentinel / 64; // 1 - idx 1 >= slice length 1
let bit_pos = missing_sentinel % 64; // 0
```
We try to access a byte outside of the bounds of the boxed slice,
causing a panic from the bounds check to failing.
## Fixing it
The fix is simple. If we need to account for the missing sentinel,
initialize the `BitSet` with capacity for one more bit.
## Tests
- Added a bunch of unit tests that hit these conditions. I ensured they
failed without the fix, and that they now pass.
- All unit tests pass with the fix in place
## Other
- The investigation that led to finding this bug began with
https://github.com/paradedb/paradedb/issues/4746.
Same change as 26a589e for SegmentCompositeCollector: get_memory_consumption
summed across all parent_buckets on every block, scaling with outer bucket
cardinality. Pass parent_bucket_id and index the single bucket.
Previously, coupons were computed via murmurhash32 and fed as raw u32
to the HLL sketch, bypassing the sketch's internal hashing and producing
invalid (slot, value) pairs. Switch to Coupon::from_hash from the
datasketches crate which correctly derives coupons, and drop the
now-unused murmurhash32 dependency.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a string cardinality aggregation is nested it end up being applied to different buckets.
Dictionary encoding relies on a different dictionaries for each segment.
As a result, during segment collection, we only collect term ordinals in a HashSet, and decode them in the
term dictionary at the end of collection.
Before this PR, this decoding phase was done once for each bucket, causing the same work to be done over and over. This PR introduce a coupon cache. The HLL sketch relies on a hash of the string values.
We populate the cache before bucket collection, and get our values from it.
This PR also rename "caching" "buffering" in aggregation (it was never caching), and does several cleanups.
These items need to be accessible from the tantivy-datafusion crate:
- BucketEntries::iter() for iterating aggregation bucket results
- PercentileValuesVecEntry.key/.value for reading percentile results
- TopNComputer.threshold for Block-WAND score pruning in the inverted index provider
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Use BinaryHeap for score-based top-K collection
* Use peek_mut and add proptest for TopNHeap
---------
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
* fix: deduplicate doc counts in term aggregation for multi-valued fields
Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.
Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.
Fixes#2721
* refactor: only deduplicate for multivalue cardinality
Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().
* fix: handle non-consecutive duplicate values in dedup
Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.
Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.
* perf: skip dedup when block has no multivalue entries
Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.
---------
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
Keep use_serde on sketches-ddsketch so DDSketch derives
Serialize/Deserialize, removing the need for custom impls
on PercentilesCollector.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the derived Serialize/Deserialize on PercentilesCollector with
custom impls that use DDSketch's Java-compatible binary encoding
(encode_to_java_bytes / decode_from_java_bytes). This removes the need
for the use_serde feature on sketches-ddsketch entirely.
Also restore original float test values and use assert_nearly_equals!
for all float comparisons in percentile tests, since DDSketch quantile
estimates can have minor precision differences across platforms.
Co-authored-by: Cursor <cursoragent@cursor.com>
Address review feedback: replace assert_eq! with assert_nearly_equals!
for float values that go through JSON serialization roundtrips, which
can introduce minor precision differences.
Co-authored-by: Cursor <cursoragent@cursor.com>
Fork sketches-ddsketch as a workspace member to add native Java binary
serialization (to_java_bytes/from_java_bytes) for DDSketch. This enables
pomsky to return raw DDSketch bytes that event-query can deserialize via
DDSketchWithExactSummaryStatistics.decode().
Key changes:
- Vendor sketches-ddsketch crate with encoding.rs implementing VarEncoding,
flag bytes, and INDEX_DELTAS_AND_COUNTS store format
- Align Config::key() to floor-based indexing matching Java's LogarithmicMapping
- Add PercentilesCollector::to_sketch_bytes() for pomsky integration
- Cross-language golden byte tests verified byte-identical with Java output
Co-authored-by: Cursor <cursoragent@cursor.com>
Switch tantivy's cardinality aggregation from the hyperloglogplus crate
(HyperLogLog++ with p=16) to the official Apache DataSketches HLL
implementation (datasketches crate v0.2.0 with lg_k=11, Hll4).
This enables returning raw HLL sketch bytes from pomsky to Datadog's
event query, where they can be properly deserialized and merged using
the same DataSketches library (Java). The previous implementation
required pomsky to fabricate fake HLL sketches from scalar cardinality
estimates, which produced incorrect results when merged.
Changes:
- Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0
- CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch
- Custom Serde impl using HllSketch binary format (cross-shard compat)
- New to_sketch_bytes() for external consumers (pomsky)
- Salt preserved via (salt, value) tuple hashing for column type disambiguation
- Removed BuildSaltedHasher struct
- Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)
Otherwise, there is no way to access these fields when not using the
json serialized form of the aggregation results.
This simple data struct is part of the public api,
so its fields should be accessible as well.
## What
Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields.
## Why
`BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering.
## How
1. **Enable range queries for Bytes** - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes`
2. **Add BytesColumn handling in scorer** - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic)
3. **Add SortByBytes** - New sort key computer for TopN queries on bytes columns
## Tests
- `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges
- `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions
* faster exclude queries
Faster exclude queries with multiple terms.
Changes `Exclude` to be able to exclude multiple DocSets, instead of
putting the docsets into a union.
Use `seek_danger` in `Exclude`.
closes#2822
* replace unwrap with match
The intersection algorithm made it possible for .seek(..) with values
lower than the current doc id, breaking the DocSet contract.
The fix removes the optimization that caused left.seek(..) to be replaced
by a simpler left.advance(..).
Simply doing so lead to a performance regression.
I therefore integrated that idea within SegmentPostings.seek.
We now attempt to check the next doc systematically on seek,
PROVIDED the block is already loaded.
Closes#2811
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>