* add RegexPhraseQuery
RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"
Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.
This PR comes with some new datastructures:
SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)
LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.
BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.
Renamed Union to BufferedUnionScorer
Added proptests to test different union types.
* cleanup
* use Box instead of Vec
* use RefCell instead of term_freq(&mut)
* remove wildcard mode
* move RefCell to outer
* clippy
* store DateTime as nanoseconds in doc store
The doc store DateTime was truncated to microseconds previously. This
removes this truncation, while still keeping backwards compatibility.
This is done by adding the trait `ConfigurableBinarySerializable`, which
works like `BinarySerializable`, but with a config that allows de/serialize
as different date time precision currently.
bump version format to 7.
add compat test to check the date time truncation.
* remove configurable binary serialize, add enum for doc store version
* test doc store version ord
This is required by the `fs4` dependency. There are other
things that need something later than 1.66.
Both quickwit and the Python binding already require something
newer.
* change AggregationLimits behavior
This fixes an issue encountered with the current behaviour of
AggregationLimits.
Previously we had AggregationLimits and RessourceLimitGuard, which both
track the memory, but only RessourceLimitGuard released memory when
dropped, while AggregationLimits did not.
This PR changes AggregationLimits to be a guard itself and removes the
RessourceLimitGuard.
* rename AggregationLimits to AggregationLimitsGuard
* Fix: Improve collapse_overlapped_ranges function
- Refactor into separate sort_and_deduplicate_ranges and merge_overlapping_ranges functions
- Enhance sorting to consider both start and end of ranges
- Optimize merging logic to handle adjacent ranges
- Add comprehensive examples in function documentation
- Ensure proper handling of duplicate and unsorted input ranges
- Improve overall efficiency and readability of range collapsing algorithm
* move debug_assert
---------
Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
(a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d)
(a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d)
This directly affects how queries are executed
remove unused SumWithCoordsCombiner
the number of fields is unused and private
* support ff range queries on json fields
* fix term date truncation
* use inverted index range query for phrase prefix queries
* rename to InvertedIndexRangeQuery
* fix column filter, add mixed column test
* add Key::I64 and Key::U64 variants in aggregation
Currently all `Key` numerical values are returned as f64. This causes problems in some
cases with the precision and the way f64 is serialized.
This PR adds `Key::I64` and `Key::U64` variants and uses them in the term
aggregation.
* add clarification comment
avoid single segments lists without deletes as merge candidates, as they will be moved
to a merge operation and filtered for merging in the next
consider_merge_options call. In rare cases this may end up in a endless
merge loop where only single segments where nothing is to be done are
merged.
* add support for str fast field range query
Add support for range queries on fast fields, by converting term bounds to
term ordinals bounds.
closes https://github.com/quickwit-oss/tantivy/issues/2023
* extend tests, rename
* update comment
* update comment
As preparation of #2023 and #1709
* Use Term to pass parameters
* merge u64 and ip fast field range query
Side note: I did not rename range_query_u64_fastfield, because then git can't track the changes.
The previous way to address the problem was to replace \u{0000}
with 0 in different places.
This logic had several flaws:
Done on the serializer side (like it was for the columnar), there was
a collision problem.
If a document in the segment contained a json field with a \0 and
antoher doc contained the same json field but `0` then we were sending
the same field path twice to the serializer.
Another option would have been to normalizes all values on the writer
side.
This PR simplifies the logic and simply ignore json path containing a
\0, both in the columnar and the inverted index.
Closes#2442
* feat(query): Make `BooleanQuery` supports `minimum_number_should_match`. see issue #2398
In this commit, a novel scorer named DisjunctionScorer is introduced, which performs the union of inverted chains with the minimal required elements. BTW, it's implemented via a min-heap. Necessary modifications on `BooleanQuery` and `BooleanWeight` are performed as well.
* fixup! fix test
* fixup!: refactor code.
1. More meaningful names.
2. Add Cache for `Disjunction`'s scorers, and fix bug.
3. Optimize `BooleanWeight::complex_scorer`
Thanks
Paul Masurel <paul@quickwit.io>
* squash!: come up with better variable naming.
* squash!: fix naming issues.
* squash!: fix typo.
* squash!: Remove CombinationMethod::FullIntersection
* WiP: cardinality aggregation
* Collect unique entries first, then insert into HyperLogLog
* Handle `missing`
* Hybrid approach
* Review changes
- insert `missing` value at most once
- `term_id` -> `term_ord`
- iterate directly over entries without collecting first
* Use salted hasher to include column type
* fix: formatting
* More review fixes
* Add cardinality to test_aggregation_flushing
* Formatting