* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* tokenizer-api: reduce Tokenizer overhead
Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
* simplify api
* move lowercase and ascii folding buffer to global
* empty Token text as default
* enable tokenizer on json fields
enable tokenizer on json fields for type text
* Avoid making the tokenizer within the TextAnalyzer pub(crate)
* Moving BoxableTokenizer to tantivy.
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* compress sstable with zstd
* add some details to sstable readme
* compress only block which benefit from it
* multiple changes to sstable
make compression optional
use OwnedBytes instead of impl Read in sstable, required for next point
use zstd bulk api, which is much faster on small records
* cleanup and use bulk api for compression
* use dedicated byte for compression
* switch block len and compression flag
* change default zstd level in sstable
* tokenizer option on text fastfield
allow to set tokenizer option on text fastfield (fixes#1901)
handle PreTokenized strings in fast field
* change visibility
* remove custom de/serialization
* document a new sstable format
* add support for changing target block size
* use new format for sstable index
* handle sstable version errror
* use very small blocks for proptests
* add a footer structure
* add memory limit for aggregations
introduce AggregationLimits to set memory consumption limit and bucket limits
memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request.
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add ByteCount with human readable format
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* handle missing column for aggs
add empty column fallback for missing column in aggs.
Fix sort for term agg on sub-agg with missing value (null is smallest)
* add error when field is not fast
The new fast field code, based on columnar, had a larger minimum memory
footprint, causing the first docuemnt to trigger a flush of the asegment
in this unit test.
This PR prevents the allocation of a large capacity for the different hashmap tables
using in the columnar writer.
Closes#1859
* Improvement on the scalar / random bitpacker code.
Added proptesting
Added simple benchmark
Added assert and comments on the very non trivial hidden contract
Remove the need for an extra padding.
The last point introduces a small performance regression (~10%).
* Fixing unit tests
Introduce MakeZero trait, remove make_zero from FastValue
Merge two multivalue fastfield implementations into one
prepare range query on fastfield for different types
* prepare fastfield format for null index
* add format version for fastfield
* Update fastfield_codecs/src/compact_space/mod.rs
* switch to variable size footer
* serialize delta of end
* add aggregation support for date type
fixes#1332
* serialize key_as_string as rfc3339 in date histogram
* update docs
* enable date for range aggregation
* Rename BlockwiseLinear to BlockwiseLinearLegacy
Reimplements the blockwise multilinear codec using integer arithmetics.
Added comments
* add estimate for blockwise
* Added one unit test
* use int based for linear interpol
* fix merge conflicts
* reuse code
* cargo fmt
* fix clippy
* fix test
* fix off by one
fix off by one to accurately interpolate autoincrement fields
* extend test, fix estimate
* remove legacy codec
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>