* Drop additional Arc-layer as the automaton itself is now cheap-to-clone.
* Drop state ID type parameter as it is not exposed by the library any more.
* add term hashmap benchmark
* refactor arena hashmap
add inlines
remove occupied array and use table_entry.is_empty instead (saves 4 bytes per entry)
reduce saturation threshold from 1/3 to 1/2 to reduce memory
use u32 for UnorderedId (we have the 4billion limit anyways on the Columnar stuff)
fix naming LinearProbing
remove byteorder dependency
memory consumption went down from 2Gb to 1.8GB on indexing wikipedia dataset in tantivy
* Update stacker/src/arena_hashmap.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* compress sstable with zstd
* add some details to sstable readme
* compress only block which benefit from it
* multiple changes to sstable
make compression optional
use OwnedBytes instead of impl Read in sstable, required for next point
use zstd bulk api, which is much faster on small records
* cleanup and use bulk api for compression
* use dedicated byte for compression
* switch block len and compression flag
* change default zstd level in sstable
* re-export a few sstable functions on dicitonary
* Update documentation
Co-authored-by: François Massot <francois.massot@gmail.com>
---------
Co-authored-by: François Massot <francois.massot@gmail.com>
* Update benchmarks section in READEME.md to link to the bench repo
* Apply suggestions from code review
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Added proptest on columnar merge with a shuffle
Made column serialization more explicit.
Bugfix when a bytes column is missing, and with a shuffle.
Improved the cardinality detection logic / column detection.
* Code review
* CR comments
* Following CR
* tokenizer option on text fastfield
allow to set tokenizer option on text fastfield (fixes#1901)
handle PreTokenized strings in fast field
* change visibility
* remove custom de/serialization
* Better mixed types support in aggs and fix serialization issue
- Improve support for mixed types in JSON field aggregations (pick the right field, #1913)
- Resolve the issue with JSON serialization for numeric keys (fixes#1967)
- Add JSON round-trip test for term buckets
- Remove `u64_lenient`, as this is a footgun without the type
- move aggregation benchmarks
* remove shadowing
* Faster range queries
This PR does several changes
- ip compact space now uses u32
- the bitunpacker now gets a get_batch function
- we push down range filtering, removing GCD / shift in the bitpacking
codec.
- we rely on AVX2 routine to do the filtering.
* Apply suggestions from code review
* Apply suggestions from code review
* CR comments
* document a new sstable format
* add support for changing target block size
* use new format for sstable index
* handle sstable version errror
* use very small blocks for proptests
* add a footer structure
* add memory limit for aggregations
introduce AggregationLimits to set memory consumption limit and bucket limits
memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request.
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add ByteCount with human readable format
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* handle missing column for aggs
add empty column fallback for missing column in aggs.
Fix sort for term agg on sub-agg with missing value (null is smallest)
* add error when field is not fast