After scoring each secondary in Phase 2, check whether remaining
secondaries' block_max scores can still beat the threshold. Skip
to the next candidate early if impossible, avoiding expensive seeks
into later secondaries.
Improves three-term intersection by ~8% on the balanced benchmark
while keeping two-term performance neutral.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## What
Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields.
## Why
`BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering.
## How
1. **Enable range queries for Bytes** - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes`
2. **Add BytesColumn handling in scorer** - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic)
3. **Add SortByBytes** - New sort key computer for TopN queries on bytes columns
## Tests
- `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges
- `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions
* faster exclude queries
Faster exclude queries with multiple terms.
Changes `Exclude` to be able to exclude multiple DocSets, instead of
putting the docsets into a union.
Use `seek_danger` in `Exclude`.
closes#2822
* replace unwrap with match
The intersection algorithm made it possible for .seek(..) with values
lower than the current doc id, breaking the DocSet contract.
The fix removes the optimization that caused left.seek(..) to be replaced
by a simpler left.advance(..).
Simply doing so lead to a performance regression.
I therefore integrated that idea within SegmentPostings.seek.
We now attempt to check the next doc systematically on seek,
PROVIDED the block is already loaded.
Closes#2811
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
A bug was added with the `seek_into_the_danger_zone()` optimization
(Spotted and fixed by Stu)
The contract says seek_into_the_danger_zone returns true if do is part of the docset.
The blanket implementation goes like this.
```
let current_doc = self.doc();
if current_doc < target {
self.seek(target);
}
self.doc() == target
```
So it will return true if target is TERMINATED, where really TERMINATED does not belong to the docset.
The fix tries to clarify the contracts and fixes the intersection algorithm.
We observe a small but all over the board improvement in intersection performance.
---------
Co-authored-by: Stu Hood <stuhood@gmail.com>
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* seek_exact + cost based intersection
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.
`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.
Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.
Refactor range query benchmark.
Closes#2531
*Future Work*
Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add API contract verfication
* impl seek_exact on union
* rename seek_exact
* add mixed AND OR test, fix buffered_union
* Add a proptest of BooleanQuery. (#2690)
* fix build
* Increase the document count.
* fix merge conflict
* fix debug assert
* Fix compilation errors after rebase
- Remove duplicate proptest_boolean_query module
- Remove duplicate cost() method implementations
- Fix TopDocs API usage (add .order_by_score())
- Remove duplicate imports
- Remove unused variable assignments
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Stu Hood <stuhood@gmail.com>
* Fixed the range issue.
* Fixed the second all scorer issue
* Improved docs + tests
* Improved code.
* Fixed lint issues.
* Improved tests + logic based on PR comments.
* Fixed lint issues.
* Increase the document count.
* Improved the prop-tests
* Expand the index size, and remove unused parameter.
---------
Co-authored-by: Stu Hood <stuhood@gmail.com>
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.
- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.
This PR breaks public API, but fixing code is straightforward.
* Bumping tantivy version
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* Optimization when posting list are saturated.
If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.
In turn, in a boolean query, we can dismiss all scorers and
empty scorers, to accelerate the request.
* Added range query optimization
* CR comment
* CR comments
* CR comment
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
This introduce an optimization of top level term aggregation on field with a low cardinality.
We then use a Vec as the underlying map.
In addition, we buffer subaggregations.
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
* query: add DocSet cost hint and use it for intersection ordering
- Add DocSet::cost()
- Use cost() instead of size_hint() to order scorers in intersect_scorers
This isolates cost-related changes without the new seek APIs from
PR #2538
* add comments
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.
That way a document with the float 1.0 will be searchable when the user
searches for 1.
Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
* add RegexPhraseQuery
RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"
Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.
This PR comes with some new datastructures:
SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)
LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.
BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.
Renamed Union to BufferedUnionScorer
Added proptests to test different union types.
* cleanup
* use Box instead of Vec
* use RefCell instead of term_freq(&mut)
* remove wildcard mode
* move RefCell to outer
* clippy
(a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d)
(a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d)
This directly affects how queries are executed
remove unused SumWithCoordsCombiner
the number of fields is unused and private
* support ff range queries on json fields
* fix term date truncation
* use inverted index range query for phrase prefix queries
* rename to InvertedIndexRangeQuery
* fix column filter, add mixed column test