* Fix trait bound of StoreReader::iter
Similar to `StoreReader::get`, `StoreReader::iter` should only require
`DocumentDeserialize` and not `Document`.
* Mark the iterator returned by SegmentReader::doc_ids_alive as Send so it can be used in impls of Stream/AsyncIterator.
* add method to fetch block of first vals in columnar
add method to fetch block of first vals in columnar (this is way faster
than single calls for full columns)
add benchmark
fix import warnings
```
test bench_get_block_first_on_full_column ... bench: 56 ns/iter (+/- 26)
test bench_get_block_first_on_full_column_single_calls ... bench: 311 ns/iter (+/- 6)
test bench_get_block_first_on_multi_column ... bench: 378 ns/iter (+/- 15)
test bench_get_block_first_on_multi_column_single_calls ... bench: 546 ns/iter (+/- 13)
test bench_get_block_first_on_optional_column ... bench: 291 ns/iter (+/- 6)
test bench_get_block_first_on_optional_column_single_calls ... bench: 362 ns/iter (+/- 8)
```
* use remainder
* handle ip adresses in term aggregation
Stores IpAdresses during the segment term aggregation via u64 representation
and convert to u128(IpV6Adress) via downcast when converting to intermediate results.
Enable Downcasting on `ColumnValues`
Expose u64 variant for u128 encoded data via `open_u64_lenient` method.
Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming
from downcast trait.
* rename method
Spotted in `range_date_histogram` query in quickwit benchmark:
5% of time copying docs around, which is not needed in the full index case
remove Column to ColumnIndex deref
* Fix serde for TopNComputer
The top hits aggregation changed the TopNComputer to be serializable,
but capacity needs to be carried over, as it contains logic which is
checked against when pushing elements (capacity == 0 is not allowed).
* use serde from deser
* remove pub, clippy
* feat(aggregators/metric): Implement a top_hits aggregator
* fix: Expose get_fields
* fix: Serializer for top_hits request
Also removes extraneous the extraneous third-party
serialization helper.
* chore: Avert panick on parsing invalid top_hits query
* refactor: Allow multiple field names from aggregations
* perf: Replace binary heap with TopNComputer
* fix: Avoid comparator inversion by ComparableDoc
* fix: Rank missing field values lower than present values
* refactor: Make KeyOrder a struct
* feat: Rough attempt at docvalue_fields
* feat: Complete stab at docvalue_fields
- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn
* test(unit): Add tests for top_hits aggregator
* fix: docfield_value field globbing
* test(unit): Include dynamic fields
* fix: Value -> OwnedValue
* fix: Use OwnedValue's native Null variant
* chore: Improve readability of test asserts
* chore: Remove DocAddress from top_hits result
* docs: Update aggregator doc
* revert: accidental doc test
* chore: enable time macros only for tests
* chore: Apply suggestions from review
* chore: Apply suggestions from review
* fix: Retrieve all values for fields
* test(unit): Update for multi-value retrieval
* chore: Assert term existence
* feat: Include all columns for a column name
Since a (name, type) constitutes a unique column.
* fix: Resolve json fields
Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.
* chore: Address review on mutability
* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal
* chore: Revert erroneous grammar change
Truncate keys to u16::MAX, instead e.g. storing 0 bytes for keys with length u16::MAX + 1
The term hashmap has a hidden API contract to only accept terms with lenght up u16::MAX.
* Fix bug that can cause `get_docids_for_value_range` to panic.
When `selected_docid_range.end == num_rows`, we would get a panic
as we try to access a non-existing blockmeta.
This PR accepts calls to rank with any value.
For any value above num_rows we simply return non_null_rows.
Fixes#2293
* add tests, merge variables
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
Root cause was the positions buffer had residue positions from the
previous term, when the terms were alternating between having and not
having positions in JSON (terms have positions, but not numerics).
Fixes#2283
* read path for new fst based index
* implement BlockAddrStoreWriter
* extract slop/derivation computation
* use better linear approximator and allow negative correction to approximator
* document format and reorder some fields
* optimize single block sstable size
* plug backward compat
* add fields_metadata to SegmentReader, add columnar docs
* use schema to resolve field, add test
* normalize paths
* merge for FieldsMetadata, add fields_metadata on Index
* Update src/core/segment_reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* merge code paths
* add Hash
* move function oustide
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>