This PR modifies internal API signatures and implementation details so that `FileSlice`s are passed down into the innards of (at least) the `BlockwiseLinearCodec`. This allows tantivy to defer dereferencing large slices of bytes when reading numeric fast fields, and instead dereference only the slice of bytes it needs for any given compressed Block.
The motivation here is for external `Directory` implementations where it's not exactly efficient to dereference large slices of bytes.
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.
- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.
This PR breaks public API, but fixing code is straightforward.
* Bumping tantivy version
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* Optimization when posting list are saturated.
If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.
In turn, in a boolean query, we can dismiss all scorers and
empty scorers, to accelerate the request.
* Added range query optimization
* CR comment
* CR comments
* CR comment
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
This introduce an optimization of top level term aggregation on field with a low cardinality.
We then use a Vec as the underlying map.
In addition, we buffer subaggregations.
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
* query: add DocSet cost hint and use it for intersection ordering
- Add DocSet::cost()
- Use cost() instead of size_hint() to order scorers in intersect_scorers
This isolates cost-related changes without the new seek APIs from
PR #2538
* add comments
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.
That way a document with the float 1.0 will be searchable when the user
searches for 1.
Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
* add RegexPhraseQuery
RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"
Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.
This PR comes with some new datastructures:
SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)
LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.
BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.
Renamed Union to BufferedUnionScorer
Added proptests to test different union types.
* cleanup
* use Box instead of Vec
* use RefCell instead of term_freq(&mut)
* remove wildcard mode
* move RefCell to outer
* clippy
(a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d)
(a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d)
This directly affects how queries are executed
remove unused SumWithCoordsCombiner
the number of fields is unused and private
* support ff range queries on json fields
* fix term date truncation
* use inverted index range query for phrase prefix queries
* rename to InvertedIndexRangeQuery
* fix column filter, add mixed column test
* add support for str fast field range query
Add support for range queries on fast fields, by converting term bounds to
term ordinals bounds.
closes https://github.com/quickwit-oss/tantivy/issues/2023
* extend tests, rename
* update comment
* update comment
As preparation of #2023 and #1709
* Use Term to pass parameters
* merge u64 and ip fast field range query
Side note: I did not rename range_query_u64_fastfield, because then git can't track the changes.
* feat(query): Make `BooleanQuery` supports `minimum_number_should_match`. see issue #2398
In this commit, a novel scorer named DisjunctionScorer is introduced, which performs the union of inverted chains with the minimal required elements. BTW, it's implemented via a min-heap. Necessary modifications on `BooleanQuery` and `BooleanWeight` are performed as well.
* fixup! fix test
* fixup!: refactor code.
1. More meaningful names.
2. Add Cache for `Disjunction`'s scorers, and fix bug.
3. Optimize `BooleanWeight::complex_scorer`
Thanks
Paul Masurel <paul@quickwit.io>
* squash!: come up with better variable naming.
* squash!: fix naming issues.
* squash!: fix typo.
* squash!: Remove CombinationMethod::FullIntersection
* fix ReferenceValue API flaw
Remove `Facet` and `TokenizedString` values from the `ReferenceValue` API,
as this requires the trait value to have them stored somewhere.
Since `TokenizedString` is quite niche, I just copy it into a Box,
instead of designing a reference API around it.
* fix comment link
* reduce number of allocations
Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.
- Make Explanation optional in BM25
- Avoid allocations when using Explanation
* use Cow