If the field names are statically known, `Cow::Borrowed(&'static str)` can
handle them without allocations. The general case is still handled by
`Cow::Owned(String)`.
Truncate keys to u16::MAX, instead e.g. storing 0 bytes for keys with length u16::MAX + 1
The term hashmap has a hidden API contract to only accept terms with lenght up u16::MAX.
* Fix bug that can cause `get_docids_for_value_range` to panic.
When `selected_docid_range.end == num_rows`, we would get a panic
as we try to access a non-existing blockmeta.
This PR accepts calls to rank with any value.
For any value above num_rows we simply return non_null_rows.
Fixes#2293
* add tests, merge variables
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
Root cause was the positions buffer had residue positions from the
previous term, when the terms were alternating between having and not
having positions in JSON (terms have positions, but not numerics).
Fixes#2283
* read path for new fst based index
* implement BlockAddrStoreWriter
* extract slop/derivation computation
* use better linear approximator and allow negative correction to approximator
* document format and reorder some fields
* optimize single block sstable size
* plug backward compat
* add fields_metadata to SegmentReader, add columnar docs
* use schema to resolve field, add test
* normalize paths
* merge for FieldsMetadata, add fields_metadata on Index
* Update src/core/segment_reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* merge code paths
* add Hash
* move function oustide
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* reduce number of allocations
Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.
- Make Explanation optional in BM25
- Avoid allocations when using Explanation
* use Cow
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.
The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.
It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.
Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)
The merger would then proceed and attempt to decode 1 position when
there are in fact none.
This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.
Closes#2251
* docid deltas while indexing
storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.
HDFS Indexing 1.1GB Total memory consumption:
Before: 760 MB
Now: 590 MB
* use scan for delta decoding
* add support for delta-1 encoding posting list
* encode term frequency minus one
* don't emit tf for json integer terms
* make skipreader not pub(crate) mutable
* run coverage only after merge
coverage is a quite slow step in CI. It can be run only after merging
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* remove Document: DocumentDeserialize dependency
The dependency requires users to implement an API they may not use.
* remove unnecessary Document bounds