Truncate keys to u16::MAX, instead e.g. storing 0 bytes for keys with length u16::MAX + 1
The term hashmap has a hidden API contract to only accept terms with lenght up u16::MAX.
* Do some Clippy- and Cargo-related boy-scouting.
* Add BytesFilterCollector to support filtering based on a bytes fast field
This is basically a copy of the existing FilterCollector but modified and
specialised to work on a bytes fast field.
* Changed semantics of filter collectors to consider multi-valued fields
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
- improve performance of vint
vint serialization shows up in performance profiles during indexing.
It would also make sense to limit the value space to u29 and operate on 4 bytes only.
- remove unused code
- add missing inlines
- fix regex test
* add term hashmap benchmark
* refactor arena hashmap
add inlines
remove occupied array and use table_entry.is_empty instead (saves 4 bytes per entry)
reduce saturation threshold from 1/3 to 1/2 to reduce memory
use u32 for UnorderedId (we have the 4billion limit anyways on the Columnar stuff)
fix naming LinearProbing
remove byteorder dependency
memory consumption went down from 2Gb to 1.8GB on indexing wikipedia dataset in tantivy
* Update stacker/src/arena_hashmap.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
The new fast field code, based on columnar, had a larger minimum memory
footprint, causing the first docuemnt to trigger a flush of the asegment
in this unit test.
This PR prevents the allocation of a large capacity for the different hashmap tables
using in the columnar writer.
Closes#1859