Convert SegmentReader, InvertedIndexReader and postinglists to traits.
Add special functions to pushdown certain performance methods to keep
them strictly typed.
We rely on a ObjectSafeCodec contraption to avoid the proliferation of generics.
That object's point is to make sure we can build TermScorer with a concrete
codec specific type before reboxing it. (same thing for PhraseScorer).
fix performance regression: fix incorrect scorer cast for buffered union
bock wand
Otherwise, there is no way to access these fields when not using the
json serialized form of the aggregation results.
This simple data struct is part of the public api,
so its fields should be accessible as well.
## What
Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields.
## Why
`BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering.
## How
1. **Enable range queries for Bytes** - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes`
2. **Add BytesColumn handling in scorer** - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic)
3. **Add SortByBytes** - New sort key computer for TopN queries on bytes columns
## Tests
- `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges
- `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions
* faster exclude queries
Faster exclude queries with multiple terms.
Changes `Exclude` to be able to exclude multiple DocSets, instead of
putting the docsets into a union.
Use `seek_danger` in `Exclude`.
closes#2822
* replace unwrap with match
The intersection algorithm made it possible for .seek(..) with values
lower than the current doc id, breaking the DocSet contract.
The fix removes the optimization that caused left.seek(..) to be replaced
by a simpler left.advance(..).
Simply doing so lead to a performance regression.
I therefore integrated that idea within SegmentPostings.seek.
We now attempt to check the next doc systematically on seek,
PROVIDED the block is already loaded.
Closes#2811
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
A bug was added with the `seek_into_the_danger_zone()` optimization
(Spotted and fixed by Stu)
The contract says seek_into_the_danger_zone returns true if do is part of the docset.
The blanket implementation goes like this.
```
let current_doc = self.doc();
if current_doc < target {
self.seek(target);
}
self.doc() == target
```
So it will return true if target is TERMINATED, where really TERMINATED does not belong to the docset.
The fix tries to clarify the contracts and fixes the intersection algorithm.
We observe a small but all over the board improvement in intersection performance.
---------
Co-authored-by: Stu Hood <stuhood@gmail.com>
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
Removes the Write generics argument in PostingsSerializer.
This removes useless generic.
Prepares the path for codecs.
Removes one useless CountingWrite layer.
etc.
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* improve bench
* add more tests for new collection type
* one collector per agg request instead per bucket
In this refactoring a collector knows in which bucket of the parent
their data is in. This allows to convert the previous approach of one
collector per bucket to one collector per request.
low card bucket optimization
* reduce dynamic dispatch, faster term agg
* use radix map, fix prepare_max_bucket
use paged term map in term agg
use special no sub agg term map impl
* specialize columntype in stats
* remove stacktrace bloat, use &mut helper
increase cache to 2048
* cleanup
remove clone
move data in term req, single doc opt for stats
* add comment
* share column block accessor
* simplify fetch block in column_block_accessor
* split subaggcache into two trait impls
* move partitions to heap
* fix name, add comment
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* Remove PartialOrd bound on compared values.
* Fix declared `SortKey` type of `impl<..> SortKeyComputer for (HeadSortKeyComputer, TailSortKeyComputer)`
* Add a SortByOwnedValue implementation to provide a type-erased column.
* Add support for comparing mismatched `OwnedValue` types.
* Support JSON columns.
* Refer to https://github.com/quickwit-oss/tantivy/issues/2776
* Rename to `SortByErasedType`.
* Comment on transitivity.
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Fix clippy warnings in new code.
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* seek_exact + cost based intersection
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.
`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.
Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.
Refactor range query benchmark.
Closes#2531
*Future Work*
Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add API contract verfication
* impl seek_exact on union
* rename seek_exact
* add mixed AND OR test, fix buffered_union
* Add a proptest of BooleanQuery. (#2690)
* fix build
* Increase the document count.
* fix merge conflict
* fix debug assert
* Fix compilation errors after rebase
- Remove duplicate proptest_boolean_query module
- Remove duplicate cost() method implementations
- Fix TopDocs API usage (add .order_by_score())
- Remove duplicate imports
- Remove unused variable assignments
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Stu Hood <stuhood@gmail.com>
* Remove `(Partial)Ord` from `ComparableDoc`, and unify comparison between `TopNComputer` and `Comparator`.
* Doc cleanups.
* Require Ord for `ComparableDoc`.
* Semantics are actually _ascending_ DocId order.
* Adjust docs again for ascending DocId order.
* minor change
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>