* Fix O(2^n) query parser regression for deeply-nested queries
The top-level `ast()` parser used `alt((boolean_expr, single_leaf))` at
every group level. When the group contained a single leaf with no
trailing operand, `boolean_expr` would parse `occur_leaf` (recursing
into the inner group), fail at `multispace1`, backtrack, and then
`single_leaf` would re-parse `occur_leaf` from scratch. Every nesting
level doubled the work, giving O(2^n) time for queries like
`(((((title:test)))))`.
Parse `occur_leaf` once and peek ahead for a trailing operand instead
of backtracking. This keeps parsing O(n) and also avoids the duplicate
parse for simple single-leaf queries.
Fixes#2498.
Measured on the issue reproducer (release build):
depth before after
20 0.87 s <1 us
25 28.23 s <1 us
60 (years) ~5 us
Non-pathological queries are unaffected or slightly faster:
query before after
hello 650 ns 308 ns
a AND b AND c 1380 ns 1364 ns
title:rust AND (...) 3426 ns 3460 ns
All 53 existing grammar tests and 56 query_parser tests pass. Adds a
regression test at depth 60 that would not complete under the old
parser.
* Add ignored benchmark for nested query parsing at depth 20/21
Matches the depths from issue #2498 which reported 0.87 s / 1.72 s
under the regression. With the fix these parse in single-digit
microseconds. Runs via:
cargo test -p tantivy-query-grammar --release bench_deeply_nested \
-- --ignored --nocapture
* Propagate Err::Failure and Err::Incomplete from operand parser
`alt((boolean_expr, single_leaf))` only retried on `Err::Error` and
propagated `Err::Failure` and `Err::Incomplete`. The replacement was
catching all three with `Err(_)`, which would silently fall back to
a single leaf if any cut point were ever added to `operand_leaf` or
its descendants. Match specifically on `Err::Error` to preserve the
original `alt` semantics.
* Replace inline bench with binggan bench in benches/
Move the nested-query benchmark out of the query-grammar test module
and into a proper binggan benchmark at benches/query_parser_nested.rs,
registered as a harnessless bench in Cargo.toml. Keeps the correctness
regression test (depth 60) in place.
Run with: cargo bench --bench query_parser_nested
* Fix rustfmt import ordering in query_parser_nested bench
* move query parser to nom
* add suupport for term grouping
* initial work on infallible parser
* fmt
* add tests and fix minor parsing bugs
* address review comments
* add support for lenient queries in tantivy
* make lenient parser report errors
* allow mixing occur and bool in query
* Expose phrase-prefix queries via the built-in query parser
This proposes the less-than-imaginative syntax `field:"phrase ter"*` to
perform a phrase prefix query against `field` using `phrase` and `ter` as the
terms. The aim of this is to make this type of query more discoverable and
simplify manual testing.
I did consider exposing the `max_expansions` parameter similar to how slop is
handled, but I think that this is rather something that should be configured via
the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as
choosing it requires rather intimiate knowledge of the backing index.
* Prevent construction of zero or one term phrase-prefix queries via the query parser.
* Add example using phrase-prefix search via surface API to improve feature discoverability.
* Change in the query grammar.
Quotation mark can now be used for phrase queries.
The delimiter is part of the `UserInputLeaf`.
That information is meant to be used in Quickwit to solve #3364.
This PR also adds support for quotation marks escaping in phrase
queries.
* Apply suggestions from code review
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
Quickwit's still heavily relies on generating field names
containing a '.' for nested object, yet allows for
user defined field names to contain a dot.
In order to reuse tantivy query parser, we will end up
using quickwit field names directly into tantivy.
Only '.' will be escaped.
This PR makes minor changes in how tantivy query parser parses
a field name and resolves it to a field.
Some of the new edge case behavior is hacky.
Closes#1355
For date values `chrono` has been replaced with `time`
- The `time` crate is re-exported as `tantivy::time` instead of `tantivy::chrono`.
- The type alias `tantivy::DateTime` has been removed.
- `Value::Date` wraps `time::PrimitiveDateTime` without time zone information.
- Internally date/time values are stored as seconds since UNIX epoch in UTC.
- Converting a `time::OffsetDateTime` to `Value::Date` implicitly converts the value into UTC.
If this is not desired do the time zone conversion yourself and use `time::PrimitiveDateTime`
directly instead.
Closes#1304
* Handle field names with any characters with a known set of special characters and an escape one
* Update field name validation rule to check only if it has at least one character and does not start with `-`
Closes#1087.