Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
* clear memory consumption in AggregationLimits
clear memory consumption in AggregationLimits at the end of segment collection
* switch to ResourceLimitGuard
* unduplicate code
* merge methods
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* allow slop in both directions
allow slop in both directions
so "big wolf"~3 can also match "wolf big"
This also fixes#1934, when the docsets were reordered by size and didn't
match the terms.
* remove count
* add test for repeating tokens, unduplicate tests
* chore!:drop JSON support on intermediate agg result
add support for other formats by removing skip_serialize and untagged
JSON support is broken anyway due it's lack on f64::INF etc. handling
* Update src/aggregation/intermediate_agg_result.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* move from impl
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
- improve performance of vint
vint serialization shows up in performance profiles during indexing.
It would also make sense to limit the value space to u29 and operate on 4 bytes only.
- remove unused code
- add missing inlines
- fix regex test
* Drop additional Arc-layer as the automaton itself is now cheap-to-clone.
* Drop state ID type parameter as it is not exposed by the library any more.
* add term hashmap benchmark
* refactor arena hashmap
add inlines
remove occupied array and use table_entry.is_empty instead (saves 4 bytes per entry)
reduce saturation threshold from 1/3 to 1/2 to reduce memory
use u32 for UnorderedId (we have the 4billion limit anyways on the Columnar stuff)
fix naming LinearProbing
remove byteorder dependency
memory consumption went down from 2Gb to 1.8GB on indexing wikipedia dataset in tantivy
* Update stacker/src/arena_hashmap.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* compress sstable with zstd
* add some details to sstable readme
* compress only block which benefit from it
* multiple changes to sstable
make compression optional
use OwnedBytes instead of impl Read in sstable, required for next point
use zstd bulk api, which is much faster on small records
* cleanup and use bulk api for compression
* use dedicated byte for compression
* switch block len and compression flag
* change default zstd level in sstable
* re-export a few sstable functions on dicitonary
* Update documentation
Co-authored-by: François Massot <francois.massot@gmail.com>
---------
Co-authored-by: François Massot <francois.massot@gmail.com>
* Update benchmarks section in READEME.md to link to the bench repo
* Apply suggestions from code review
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Added proptest on columnar merge with a shuffle
Made column serialization more explicit.
Bugfix when a bytes column is missing, and with a shuffle.
Improved the cardinality detection logic / column detection.
* Code review
* CR comments
* Following CR
* tokenizer option on text fastfield
allow to set tokenizer option on text fastfield (fixes#1901)
handle PreTokenized strings in fast field
* change visibility
* remove custom de/serialization
* Better mixed types support in aggs and fix serialization issue
- Improve support for mixed types in JSON field aggregations (pick the right field, #1913)
- Resolve the issue with JSON serialization for numeric keys (fixes#1967)
- Add JSON round-trip test for term buckets
- Remove `u64_lenient`, as this is a footgun without the type
- move aggregation benchmarks
* remove shadowing
* Faster range queries
This PR does several changes
- ip compact space now uses u32
- the bitunpacker now gets a get_batch function
- we push down range filtering, removing GCD / shift in the bitpacking
codec.
- we rely on AVX2 routine to do the filtering.
* Apply suggestions from code review
* Apply suggestions from code review
* CR comments