* add support for delta-1 encoding posting list
* encode term frequency minus one
* don't emit tf for json integer terms
* make skipreader not pub(crate) mutable
* run coverage only after merge
coverage is a quite slow step in CI. It can be run only after merging
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* remove Document: DocumentDeserialize dependency
The dependency requires users to implement an API they may not use.
* remove unnecessary Document bounds
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* replace BinaryHeap for TopN
replace BinaryHeap for TopN with variant that selects the median with QuickSort,
which runs in O(n) time.
add merge_fruits fast path
* call truncate unconditionally, extend test
* remove special early exit
* add TODO, fmt
* truncate top n instead median, return vec
* simplify code
* Fix DateHistogram bucket gap
Fixes a computation issue of the number of buckets needed in the
DateHistogram.
This is due to a missing normalization from request values (ms) to fast field
values (ns), when converting an intermediate result to the final result.
This results in a wrong computation by a factor 1_000_000.
The Histogram normalizes values to nanoseconds, to make the user input like
extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible.
This normalization happens only for date type fields, as other field types don't have precision settings.
The normalization does not happen due a missing `column_type`, which is not
correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation.
Another related issue is an empty aggregation, which will not have
`column_type` set, will not convert the result to human readable format.
This PR fixes the issue by:
- Limit the allowed field types of DateHistogram to DateType
- Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`.
- Fix the merge logic
Add a flag to to normalization only once. This is not an issue
currently, but it could become easily one.
closes https://github.com/quickwit-oss/quickwit/issues/3837
* use older nightly for time crate (breaks build)
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.
Change memory variable naming from arena to budget.
closes#2156
* add missing aggregation part 2
Add missing support for:
- Mixed types columns
- Key of type string on numerical fields
The special aggregation is slower than the integrated one in TermsAggregation and therefore not
chosen by default, although it can cover all use cases.
* simplify, add num_docs to empty