* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.
- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.
This PR breaks public API, but fixing code is straightforward.
* Bumping tantivy version
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* Initial impl
* Added `Filter` impl in `build_single_agg_segment_collector_with_reader` + Added tests
* Added `Filter(FilterBucketResult)` + Made tests work.
* Fixed type issues.
* Fixed a test.
* 8a7a73a: Pass `segment_reader`
* Added more tests.
* Improved parsing + tests
* refactoring
* Added more tests.
* refactoring: moved parsing code under QueryParser
* Use Tantivy syntax instead of ES
* Added a sanity check test.
* Simplified impl + tests
* Added back tests in a more maintable way
* nitz.
* nitz
* implemented very simple fast-path
* improved a comment
* implemented fast field support
* Used `BoundsRange`
* Improved fast field impl + tests
* Simplified execution.
* Fixed exports + nitz
* Improved the tests to check to the expected result.
* Improved test by checking the whole result JSON
* Removed brittle perf checks.
* Added efficiency verification tests.
* Added one more efficiency check test.
* Improved the efficiency tests.
* Removed unnecessary parsing code + added direct Query obj
* Fixed tests.
* Improved tests
* Fixed code structure
* Fixed lint issues
* nitz.
* nitz
* nitz.
* nitz.
* nitz.
* Added an example
* Fixed PR comments.
* Applied PR comments + nitz
* nitz.
* Improved the code.
* Fixed a perf issue.
* Added batch processing.
* Made the example more interesting
* Fixed bucket count
* Renamed Direct to CustomQuery
* Fixed lint issues.
* No need for scorer to be an `Option`
* nitz
* Used BitSet
* Added an optimization for AllQuery
* Fixed merge issues.
* Fixed lint issues.
* Added benchmark for FILTER
* Removed the Option wrapper.
* nitz.
* Applied PR comments.
* Fixed the AllQuery optimization
* Applied PR comments.
* feat: used `erased_serde` to allow filter query to be serialized
* further improved a comment
* Added back tests.
* removed an unused method
* removed an unused method
* Added documentation
* nitz.
* Added query builder.
* Fixed a comment.
* Applied PR comments.
* Fixed doctest issues.
* Added ser/de
* Removed bench in test
* Fixed a lint issue.
As preparation of #2023 and #1709
* Use Term to pass parameters
* merge u64 and ip fast field range query
Side note: I did not rename range_query_u64_fastfield, because then git can't track the changes.
* compact doc
* add any value type
* pass references when building CompactDoc
* remove OwnedValue from API
* clippy
* clippy
* fail on large documents
* fmt
* cleanup
* cleanup
* implement Value for different types
fix serde_json date Value implementation
* fmt
* cleanup
* fmt
* cleanup
* store positions instead of pos+len
* remove nodes array
* remove mediumvec
* cleanup
* infallible serialize into vec
* remove positions indirection
* remove 24MB limitation in document
use u32 for Addr
Remove the 3 byte addressing limitation and use VInt instead
* cleanup
* extend test
* cleanup, add comments
* rename, remove pub
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* tokenizer-api: reduce Tokenizer overhead
Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
* simplify api
* move lowercase and ascii folding buffer to global
* empty Token text as default
* Expose phrase-prefix queries via the built-in query parser
This proposes the less-than-imaginative syntax `field:"phrase ter"*` to
perform a phrase prefix query against `field` using `phrase` and `ter` as the
terms. The aim of this is to make this type of query more discoverable and
simplify manual testing.
I did consider exposing the `max_expansions` parameter similar to how slop is
handled, but I think that this is rather something that should be configured via
the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as
choosing it requires rather intimiate knowledge of the backing index.
* Prevent construction of zero or one term phrase-prefix queries via the query parser.
* Add example using phrase-prefix search via surface API to improve feature discoverability.
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
* tokenizer option on text fastfield
allow to set tokenizer option on text fastfield (fixes#1901)
handle PreTokenized strings in fast field
* change visibility
* remove custom de/serialization
* add memory limit for aggregations
introduce AggregationLimits to set memory consumption limit and bucket limits
memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request.
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add ByteCount with human readable format
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add aggregation support for date type
fixes#1332
* serialize key_as_string as rfc3339 in date histogram
* update docs
* enable date for range aggregation