* Remove `(Partial)Ord` from `ComparableDoc`, and unify comparison between `TopNComputer` and `Comparator`.
* Doc cleanups.
* Require Ord for `ComparableDoc`.
* Semantics are actually _ascending_ DocId order.
* Adjust docs again for ascending DocId order.
* minor change
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* Fixed the range issue.
* Fixed the second all scorer issue
* Improved docs + tests
* Improved code.
* Fixed lint issues.
* Improved tests + logic based on PR comments.
* Fixed lint issues.
* Increase the document count.
* Improved the prop-tests
* Expand the index size, and remove unused parameter.
---------
Co-authored-by: Stu Hood <stuhood@gmail.com>
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.
- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.
This PR breaks public API, but fixing code is straightforward.
* Bumping tantivy version
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* Optimization when posting list are saturated.
If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.
In turn, in a boolean query, we can dismiss all scorers and
empty scorers, to accelerate the request.
* Added range query optimization
* CR comment
* CR comments
* CR comment
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
This introduce an optimization of top level term aggregation on field with a low cardinality.
We then use a Vec as the underlying map.
In addition, we buffer subaggregations.
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Initial impl
* Added `Filter` impl in `build_single_agg_segment_collector_with_reader` + Added tests
* Added `Filter(FilterBucketResult)` + Made tests work.
* Fixed type issues.
* Fixed a test.
* 8a7a73a: Pass `segment_reader`
* Added more tests.
* Improved parsing + tests
* refactoring
* Added more tests.
* refactoring: moved parsing code under QueryParser
* Use Tantivy syntax instead of ES
* Added a sanity check test.
* Simplified impl + tests
* Added back tests in a more maintable way
* nitz.
* nitz
* implemented very simple fast-path
* improved a comment
* implemented fast field support
* Used `BoundsRange`
* Improved fast field impl + tests
* Simplified execution.
* Fixed exports + nitz
* Improved the tests to check to the expected result.
* Improved test by checking the whole result JSON
* Removed brittle perf checks.
* Added efficiency verification tests.
* Added one more efficiency check test.
* Improved the efficiency tests.
* Removed unnecessary parsing code + added direct Query obj
* Fixed tests.
* Improved tests
* Fixed code structure
* Fixed lint issues
* nitz.
* nitz
* nitz.
* nitz.
* nitz.
* Added an example
* Fixed PR comments.
* Applied PR comments + nitz
* nitz.
* Improved the code.
* Fixed a perf issue.
* Added batch processing.
* Made the example more interesting
* Fixed bucket count
* Renamed Direct to CustomQuery
* Fixed lint issues.
* No need for scorer to be an `Option`
* nitz
* Used BitSet
* Added an optimization for AllQuery
* Fixed merge issues.
* Fixed lint issues.
* Added benchmark for FILTER
* Removed the Option wrapper.
* nitz.
* Applied PR comments.
* Fixed the AllQuery optimization
* Applied PR comments.
* feat: used `erased_serde` to allow filter query to be serialized
* further improved a comment
* Added back tests.
* removed an unused method
* removed an unused method
* Added documentation
* nitz.
* Added query builder.
* Fixed a comment.
* Applied PR comments.
* Fixed doctest issues.
* Added ser/de
* Removed bench in test
* Fixed a lint issue.
Previously the merging relied on the order of the results, which is invalid since https://github.com/quickwit-oss/tantivy/pull/2035.
This bug is only hit in specific scenarios, when the aggregation collectors are built in a different order on different segments.
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
* add nested histogram-termagg benchmark
* Replace AggregationsWithAccessor with AggData
With AggregationsWithAccessor pre-computation and caching was done on the collector level.
If you have 10000 sub collectors (e.g. a term aggregation with sub aggregations) this is very inefficient.
`AggData` instead moves the data from the collector to a node which reflects the cardinality of the request tree instead of the cardinality of the segment collector.
It also moves the global struct shared with all aggregations in to aggregation specific structs. So each aggregation has its own space to store cached data and aggregation specific information.
This also breaks up the dependency to the elastic search aggregation structure somewhat.
Due to lifetime issues, we move the agg request specific object out of `AggData` during the collection and move it back at the end (for now). That's some unnecessary work, which costs CPU.
This allows better caching and will also pave the way for another potential optimization, by separating the collector and its storage. Currently we allocate a new collector for each sub aggregation bucket (for nested aggregations), but ideally we would have just one collector instance.
* renames
* move request data to agg request files
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
* query: add DocSet cost hint and use it for intersection ordering
- Add DocSet::cost()
- Use cost() instead of size_hint() to order scorers in intersect_scorers
This isolates cost-related changes without the new seek APIs from
PR #2538
* add comments
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.
That way a document with the float 1.0 will be searchable when the user
searches for 1.
Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.