The early return for `scorers.len() == 1` in `scorer_union` short-circuits a single TermScorer into `SpecializedScorer::Other`, bypassing the `TermUnion` path that enables block-max WAND (BMW) in `for_each_pruning`.
This was originally addressed in PR #2898 (backed out), which added a special case in `BooleanWeight::for_each_pruning`. PR #2912 (merged as d27ca164a) added a single-scorer fast path inside `block_wand` itself, but did not remove this early return — so a single SHOULD TermScorer still never reaches the BMW path.
Removing the early return lets a single TermScorer with freq reading flow through to `SpecializedScorer::TermUnion`, where `block_wand` → `block_wand_single_scorer` handles it efficiently.
* Optimizing top K using Adrien Grand's ideas
https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html
* Suffix-sum pruning for multi-term intersection candidates
After scoring each secondary in Phase 2, check whether remaining
secondaries' block_max scores can still beat the threshold. Skip
to the next candidate early if impossible, avoiding expensive seeks
into later secondaries.
Improves three-term intersection by ~8% on the balanced benchmark
while keeping two-term performance neutral.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Claude CR comment
* Removed 16 term scorer limit.
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a document has the exact registered facet path (not a child),
compute_collapse_mapping_one maps it to a sentinel (u64::MAX, 0).
Without filtering, harvest() passes u64::MAX to ord_to_term which
resolves to the last dictionary entry, producing a spurious facet
from an unrelated branch.
Skip entries where facet_ord == u64::MAX in harvest().
Closes#2494
Signed-off-by: majiayu000 <1835304752@qq.com>
## Bug Overview
Under certain conditions, a `terms` aggregation request can cause a
bounds-check panic. Those conditions are:
- The queried field must be a text field
- There must be a segment where the number of distinct terms in it's
dictionary for the queried field is divisible by 64 (i.e.e where
`count(term_dict.keys) % 64 == 0`)
- That same segment must contain at least one document that does not
contain this field.
- The request contain a `missing` key that is a string.
- The request must contain an `include` or `exclude` filter.
For example:
```json
{
"my_bool": {
"terms": {
"field": "title",
"include": "foo",
"missing": "__NULL__",
}
}
}
```
Check out the added tests in `src/aggregation/bucket/term_agg.rs` to see
this in action
## How the bug happens
### Preparation
While preparing the aggregation nodes:
1) When we've provided a `missing` key, we derive a missing sentinel.
For string keys this column's max value (which for string keys is always
the number of terms in this segment) + 1.
2) for string columns only, we optionally prep an "allowed" `BitSet` for
allowed term ids. (`build_allowed_term_ids_for_str` in
`src/aggregation/agg_data.rs`)
- If no `include` or `exclude` filter is provided, this just returns
`None`, causing this check to be skipped down the line
- Otherwise the bitset is initialized to be able to hold the exact
number of terms in the segments term dictionary, and the bits are set to
signify which terms are to be included in the results.
### Collection
If we have an "allowed" `BitSet`, filter documents against that. For
each document, we check if the `BitSet` contains the documents term id.
For documents without the field, this is the missing sentinel we derived
earlier, minus 1 (to account for zero-based indexing): `(num_terms + 1)
- 1`.However, the `BitSet`s size is only `num_terms`. Normally, this
slips by without a problem, but if `num_terms % 64 == 0`, this will
cause a panic.
### Why `BitSet` panics
`BitSet` is represented under the hood by a boxed slice of `u64`s. When
you go to check a bit using `BitSet::contains`, it must determine which
of those `u64`s the bit is in, and then the position within that `u64`
of the bit.
In cases where the number of terms is not divisible by 64, the `BitSet`
must waste some bits. When we then look up the missing sentinel's bit,
it happens to be one of those wasted bits, for which `BitSet` is happy
to return the value of. For example, if the number of terms was 63:
```rust
let bitset_init_size = 63; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term id [0, 62]
let missing_sentinel = 63; // num_terms + 1 - 1;
let byte_pos = missing_sentinel / 64; // 0 - within the valid slice
let bit_pos = missing_sentinel % 64; // 63 - hits the 1 wasted bit
```
But if the number of terms is indeed divisible by 64, then the `BitSet`
is perfectly aligned to the byte boundary:
```rust
let bitset_init_size = 64; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term ids [0, 63]
let missing_sentinel = 64; // num_terms + 1 - 1,
let byte_pos = missing_sentinel / 64; // 1 - idx 1 >= slice length 1
let bit_pos = missing_sentinel % 64; // 0
```
We try to access a byte outside of the bounds of the boxed slice,
causing a panic from the bounds check to failing.
## Fixing it
The fix is simple. If we need to account for the missing sentinel,
initialize the `BitSet` with capacity for one more bit.
## Tests
- Added a bunch of unit tests that hit these conditions. I ensured they
failed without the fix, and that they now pass.
- All unit tests pass with the fix in place
## Other
- The investigation that led to finding this bug began with
https://github.com/paradedb/paradedb/issues/4746.
* Fix O(2^n) query parser regression for deeply-nested queries
The top-level `ast()` parser used `alt((boolean_expr, single_leaf))` at
every group level. When the group contained a single leaf with no
trailing operand, `boolean_expr` would parse `occur_leaf` (recursing
into the inner group), fail at `multispace1`, backtrack, and then
`single_leaf` would re-parse `occur_leaf` from scratch. Every nesting
level doubled the work, giving O(2^n) time for queries like
`(((((title:test)))))`.
Parse `occur_leaf` once and peek ahead for a trailing operand instead
of backtracking. This keeps parsing O(n) and also avoids the duplicate
parse for simple single-leaf queries.
Fixes#2498.
Measured on the issue reproducer (release build):
depth before after
20 0.87 s <1 us
25 28.23 s <1 us
60 (years) ~5 us
Non-pathological queries are unaffected or slightly faster:
query before after
hello 650 ns 308 ns
a AND b AND c 1380 ns 1364 ns
title:rust AND (...) 3426 ns 3460 ns
All 53 existing grammar tests and 56 query_parser tests pass. Adds a
regression test at depth 60 that would not complete under the old
parser.
* Add ignored benchmark for nested query parsing at depth 20/21
Matches the depths from issue #2498 which reported 0.87 s / 1.72 s
under the regression. With the fix these parse in single-digit
microseconds. Runs via:
cargo test -p tantivy-query-grammar --release bench_deeply_nested \
-- --ignored --nocapture
* Propagate Err::Failure and Err::Incomplete from operand parser
`alt((boolean_expr, single_leaf))` only retried on `Err::Error` and
propagated `Err::Failure` and `Err::Incomplete`. The replacement was
catching all three with `Err(_)`, which would silently fall back to
a single leaf if any cut point were ever added to `operand_leaf` or
its descendants. Match specifically on `Err::Error` to preserve the
original `alt` semantics.
* Replace inline bench with binggan bench in benches/
Move the nested-query benchmark out of the query-grammar test module
and into a proper binggan benchmark at benches/query_parser_nested.rs,
registered as a harnessless bench in Cargo.toml. Keeps the correctness
regression test (depth 60) in place.
Run with: cargo bench --bench query_parser_nested
* Fix rustfmt import ordering in query_parser_nested bench
Runs weekly security analysis and uploads SARIF results to GitHub code
scanning. Third-party actions are pinned by commit SHA. Adds the Scorecard
badge to the README.
Based on quickwit-oss/quickwit#5969.
Same change as 26a589e for SegmentCompositeCollector: get_memory_consumption
summed across all parent_buckets on every block, scaling with outer bucket
cardinality. Pass parent_bucket_id and index the single bucket.
The previous commit pinned actions to commit SHAs but used stale
version tags (v4.2.2, v2.7.5, old nextest/cargo-llvm-cov refs).
Update to the actual current HEAD of each pinned tag:
actions/checkout v4.2.2 → v4.3.1 (34e114876b0b...)
Swatinem/rust-cache v2.7.5 → v2.9.1 (c19371144df3...)
taiki-e/install-action nextest (56cc9adf3a3e...)
taiki-e/install-action cargo-llvm-cov (e4b3a0453201...)
actions-rs/toolchain, actions-rs/clippy-check, and
codecov/codecov-action SHAs were already correct.
https://claude.ai/code/session_01VD7Bo8upj3cQwWDf9ni2Ln
Previously, coupons were computed via murmurhash32 and fed as raw u32
to the HLL sketch, bypassing the sketch's internal hashing and producing
invalid (slot, value) pairs. Switch to Coupon::from_hash from the
datasketches crate which correctly derives coupons, and drop the
now-unused murmurhash32 dependency.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a string cardinality aggregation is nested it end up being applied to different buckets.
Dictionary encoding relies on a different dictionaries for each segment.
As a result, during segment collection, we only collect term ordinals in a HashSet, and decode them in the
term dictionary at the end of collection.
Before this PR, this decoding phase was done once for each bucket, causing the same work to be done over and over. This PR introduce a coupon cache. The HLL sketch relies on a hash of the string values.
We populate the cache before bucket collection, and get our values from it.
This PR also rename "caching" "buffering" in aggregation (it was never caching), and does several cleanups.
These items need to be accessible from the tantivy-datafusion crate:
- BucketEntries::iter() for iterating aggregation bucket results
- PercentileValuesVecEntry.key/.value for reading percentile results
- TopNComputer.threshold for Block-WAND score pruning in the inverted index provider
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Use BinaryHeap for score-based top-K collection
* Use peek_mut and add proptest for TopNHeap
---------
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
* fix: deduplicate doc counts in term aggregation for multi-valued fields
Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.
Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.
Fixes#2721
* refactor: only deduplicate for multivalue cardinality
Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().
* fix: handle non-consecutive duplicate values in dedup
Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.
Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.
* perf: skip dedup when block has no multivalue entries
Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.
---------
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>