With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.
Change memory variable naming from arena to budget.
closes#2156
* Improve aggregation error message
Improve aggregation error message by wrapping the deserialization with a
custom struct. This deserialization variant is slower, since we need to
keep the deserialized data around twice with this approach.
For now the valid variants list is manually updated. This could be
replaced with a proc macro.
closes#2143
* Simpler implementation
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* move query parser to nom
* add suupport for term grouping
* initial work on infallible parser
* fmt
* add tests and fix minor parsing bugs
* address review comments
* add support for lenient queries in tantivy
* make lenient parser report errors
* allow mixing occur and bool in query
* tokenizer-api: reduce Tokenizer overhead
Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
* simplify api
* move lowercase and ascii folding buffer to global
* empty Token text as default
* Expose phrase-prefix queries via the built-in query parser
This proposes the less-than-imaginative syntax `field:"phrase ter"*` to
perform a phrase prefix query against `field` using `phrase` and `ter` as the
terms. The aim of this is to make this type of query more discoverable and
simplify manual testing.
I did consider exposing the `max_expansions` parameter similar to how slop is
handled, but I think that this is rather something that should be configured via
the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as
choosing it requires rather intimiate knowledge of the backing index.
* Prevent construction of zero or one term phrase-prefix queries via the query parser.
* Add example using phrase-prefix search via surface API to improve feature discoverability.
* Change in the query grammar.
Quotation mark can now be used for phrase queries.
The delimiter is part of the `UserInputLeaf`.
That information is meant to be used in Quickwit to solve #3364.
This PR also adds support for quotation marks escaping in phrase
queries.
* Apply suggestions from code review
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
* allow slop in both directions
allow slop in both directions
so "big wolf"~3 can also match "wolf big"
This also fixes#1934, when the docsets were reordered by size and didn't
match the terms.
* remove count
* add test for repeating tokens, unduplicate tests
* Better mixed types support in aggs and fix serialization issue
- Improve support for mixed types in JSON field aggregations (pick the right field, #1913)
- Resolve the issue with JSON serialization for numeric keys (fixes#1967)
- Add JSON round-trip test for term buckets
- Remove `u64_lenient`, as this is a footgun without the type
- move aggregation benchmarks
* remove shadowing
Prev: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })', src/main.rs:46:14
Now: Automatic downgrade to next available level