* tokenizer-api: reduce Tokenizer overhead
Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
* simplify api
* move lowercase and ascii folding buffer to global
* empty Token text as default
* Expose phrase-prefix queries via the built-in query parser
This proposes the less-than-imaginative syntax `field:"phrase ter"*` to
perform a phrase prefix query against `field` using `phrase` and `ter` as the
terms. The aim of this is to make this type of query more discoverable and
simplify manual testing.
I did consider exposing the `max_expansions` parameter similar to how slop is
handled, but I think that this is rather something that should be configured via
the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as
choosing it requires rather intimiate knowledge of the backing index.
* Prevent construction of zero or one term phrase-prefix queries via the query parser.
* Add example using phrase-prefix search via surface API to improve feature discoverability.
This fixes an issue when min_doc==0 loads terms from the dictionary from
one segment and merges the same term with a subaggregation from another
segment.
Previously the empty structure was not correctly initialized to contain
the subaggregation so the merge was incorrect.
* enable tokenizer on json fields
enable tokenizer on json fields for type text
* Avoid making the tokenizer within the TextAnalyzer pub(crate)
* Moving BoxableTokenizer to tantivy.
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* switch to ms in histogram for date type
switch to ms in histogram, by adding a normalization step that converts
to nanoseconds precision when creating the collector.
closes#2028
related to #2026
* add missing unit long variants
* use single thread to avoid handling test case
* fix docs
* revert CI
* cleanup
* improve docs
* Update src/aggregation/bucket/histogram/histogram.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Change in the query grammar.
Quotation mark can now be used for phrase queries.
The delimiter is part of the `UserInputLeaf`.
That information is meant to be used in Quickwit to solve #3364.
This PR also adds support for quotation marks escaping in phrase
queries.
* Apply suggestions from code review
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
* clear memory consumption in AggregationLimits
clear memory consumption in AggregationLimits at the end of segment collection
* switch to ResourceLimitGuard
* unduplicate code
* merge methods
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* allow slop in both directions
allow slop in both directions
so "big wolf"~3 can also match "wolf big"
This also fixes#1934, when the docsets were reordered by size and didn't
match the terms.
* remove count
* add test for repeating tokens, unduplicate tests