* seek_exact + cost based intersection
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.
`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.
Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.
Refactor range query benchmark.
Closes#2531
*Future Work*
Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add API contract verfication
* impl seek_exact on union
* rename seek_exact
* add mixed AND OR test, fix buffered_union
* Add a proptest of BooleanQuery. (#2690)
* fix build
* Increase the document count.
* fix merge conflict
* fix debug assert
* Fix compilation errors after rebase
- Remove duplicate proptest_boolean_query module
- Remove duplicate cost() method implementations
- Fix TopDocs API usage (add .order_by_score())
- Remove duplicate imports
- Remove unused variable assignments
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Stu Hood <stuhood@gmail.com>
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.
- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.
This PR breaks public API, but fixing code is straightforward.
* Bumping tantivy version
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* Optimization when posting list are saturated.
If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.
In turn, in a boolean query, we can dismiss all scorers and
empty scorers, to accelerate the request.
* Added range query optimization
* CR comment
* CR comments
* CR comment
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
This introduce an optimization of top level term aggregation on field with a low cardinality.
We then use a Vec as the underlying map.
In addition, we buffer subaggregations.
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
* reduce number of allocations
Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.
- Make Explanation optional in BM25
- Avoid allocations when using Explanation
* use Cow
* add support for delta-1 encoding posting list
* encode term frequency minus one
* don't emit tf for json integer terms
* make skipreader not pub(crate) mutable
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.
Change memory variable naming from arena to budget.
closes#2156
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
Prev: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })', src/main.rs:46:14
Now: Automatic downgrade to next available level
* Use a more general but still object-safe signature for Query::query_terms.
* Further constraint the generalized Query::query_terms signature to allow extracting references to terms.
* Removes all usage of block_on, and use a oneshot channel instead.
Calling `block_on` panics in certain context.
For instance, it panics when it is called in a the context of another
call to block.
Using it in tantivy is unnecessary. We replace it by a thin wrapper
around a oneshot channel that supports both async/sync.
* Removing needless uses of async in the API.
Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
* Added sstable and enabling it by default, and parallel boolean query.
* Added async API for FileSlice.
* Added async get_doc
* Reduce blocksize to 32_000
* Added debug logs
Quickwit specific feature a hidden behind the quickwit feature flag.
* Term are now typed.
This change is backward compatible:
While the Term has a byte representation that is modified, a Term itself
is a transient object that is not serialized as is in the index.
Its .field() and .value_bytes() on the other hand are unchanged.
This change offers better Debug information for terms.
While not necessary it also will help in the support for JSON types.
* Renamed Hierarchical Facet -> Facet
* Add a NORMED options on field
Make fieldnorm indexation optional:
* for all types except text => added a NORMED options
* for text field
** if STRING, field has not fieldnorm retained
** if TEXT, field has fieldnorm computed
* Finalize making fieldnorm optional for all field types.
- Using Option for fieldnorm readers.
* Added optimisation using block wand for single TermScorer.
A proptest was also added.
* Fix block wand algorithm by taking the last doc id of scores until the pivot scorer (included).
* In block wand, when block max score is lower than the threshold, advance the scorer with best score.
* Fix wrong condition in block_wand_single_scorer and add debug_assert to have an equality check on doc to break the loop.