* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.
- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.
This PR breaks public API, but fixing code is straightforward.
* Bumping tantivy version
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.
That way a document with the float 1.0 will be searchable when the user
searches for 1.
Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
* Added per-field size details.
This also does a bunch of refactoring.
merging field metadata does not silently asserts that arguments should be sorted.
merging does not set `stored`.
We do not rely on a hashmap to group fields, but instead rely on the fact that
the term dictionary is sorted.
The inverted level method that exposes field metadata is not exposed
as public anymore.
* CR comment
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* store DateTime as nanoseconds in doc store
The doc store DateTime was truncated to microseconds previously. This
removes this truncation, while still keeping backwards compatibility.
This is done by adding the trait `ConfigurableBinarySerializable`, which
works like `BinarySerializable`, but with a config that allows de/serialize
as different date time precision currently.
bump version format to 7.
add compat test to check the date time truncation.
* remove configurable binary serialize, add enum for doc store version
* test doc store version ord
* first version of extended stats along with its tests
* using IntermediateExtendStats instead of IntermediateStats with all tests passing
* Created struct for request and response
* first test with extended_stats
* kahan summation and tests with approximate equality
* version ready for merge
* removed approx dependency
* refactor for using ExtendedStats only when needed
* interim version
* refined version with code formatted
* refactored a struct
* cosmetic refactor
* fix after merge
* fix format
* added extended_stat bench
* merge and new benchmark for extended stats
* split stat segment collectors
* wrapped intermediate extended stat with a box to limit memory usage
* Revert "wrapped intermediate extended stat with a box to limit memory usage"
This reverts commit 5b4aa9f393.
* some code reformat, commented kahan summation
* refactor after review
* refactor after code review
* fix after incorrectly restoring kahan summation
* modifications for code review + bug fix in merge_fruit
* refactor assert_nearly_equals macro
* update after code review
---------
Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>
* compact doc
* add any value type
* pass references when building CompactDoc
* remove OwnedValue from API
* clippy
* clippy
* fail on large documents
* fmt
* cleanup
* cleanup
* implement Value for different types
fix serde_json date Value implementation
* fmt
* cleanup
* fmt
* cleanup
* store positions instead of pos+len
* remove nodes array
* remove mediumvec
* cleanup
* infallible serialize into vec
* remove positions indirection
* remove 24MB limitation in document
use u32 for Addr
Remove the 3 byte addressing limitation and use VInt instead
* cleanup
* extend test
* cleanup, add comments
* rename, remove pub
* add method to fetch block of first vals in columnar
add method to fetch block of first vals in columnar (this is way faster
than single calls for full columns)
add benchmark
fix import warnings
```
test bench_get_block_first_on_full_column ... bench: 56 ns/iter (+/- 26)
test bench_get_block_first_on_full_column_single_calls ... bench: 311 ns/iter (+/- 6)
test bench_get_block_first_on_multi_column ... bench: 378 ns/iter (+/- 15)
test bench_get_block_first_on_multi_column_single_calls ... bench: 546 ns/iter (+/- 13)
test bench_get_block_first_on_optional_column ... bench: 291 ns/iter (+/- 6)
test bench_get_block_first_on_optional_column_single_calls ... bench: 362 ns/iter (+/- 8)
```
* use remainder
* feat(aggregators/metric): Implement a top_hits aggregator
* fix: Expose get_fields
* fix: Serializer for top_hits request
Also removes extraneous the extraneous third-party
serialization helper.
* chore: Avert panick on parsing invalid top_hits query
* refactor: Allow multiple field names from aggregations
* perf: Replace binary heap with TopNComputer
* fix: Avoid comparator inversion by ComparableDoc
* fix: Rank missing field values lower than present values
* refactor: Make KeyOrder a struct
* feat: Rough attempt at docvalue_fields
* feat: Complete stab at docvalue_fields
- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn
* test(unit): Add tests for top_hits aggregator
* fix: docfield_value field globbing
* test(unit): Include dynamic fields
* fix: Value -> OwnedValue
* fix: Use OwnedValue's native Null variant
* chore: Improve readability of test asserts
* chore: Remove DocAddress from top_hits result
* docs: Update aggregator doc
* revert: accidental doc test
* chore: enable time macros only for tests
* chore: Apply suggestions from review
* chore: Apply suggestions from review
* fix: Retrieve all values for fields
* test(unit): Update for multi-value retrieval
* chore: Assert term existence
* feat: Include all columns for a column name
Since a (name, type) constitutes a unique column.
* fix: Resolve json fields
Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.
* chore: Address review on mutability
* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal
* chore: Revert erroneous grammar change
* add fields_metadata to SegmentReader, add columnar docs
* use schema to resolve field, add test
* normalize paths
* merge for FieldsMetadata, add fields_metadata on Index
* Update src/core/segment_reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* merge code paths
* add Hash
* move function oustide
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* Include only built-in compression algorithms as enum variants
This enables compile-time errors when a compression algorithm is requested which
is not actually enabled for the current Cargo project. The cost is that indexes
using other compression algorithms cannot even be loaded (even though they
are not fully accessible in any case).
As a drive-by, this also fixes `--no-default-features` on `cfg(unix)`.
* Provide more instructive error messages for unsupported, but not unknown compression variants.
* Do some Clippy- and Cargo-related boy-scouting.
* Add BytesFilterCollector to support filtering based on a bytes fast field
This is basically a copy of the existing FilterCollector but modified and
specialised to work on a bytes fast field.
* Changed semantics of filter collectors to consider multi-valued fields
The new fast field code, based on columnar, had a larger minimum memory
footprint, causing the first docuemnt to trigger a flush of the asegment
in this unit test.
This PR prevents the allocation of a large capacity for the different hashmap tables
using in the columnar writer.
Closes#1859
Introduce MakeZero trait, remove make_zero from FastValue
Merge two multivalue fastfield implementations into one
prepare range query on fastfield for different types