For secondary indexes, it is often necessary to read all documents, compute
some function on them and associated the result with a document ID.
Currently, this requires something like
```rust
let reader = segment.get_store_reader(1)?;
for doc_id in segment.doc_ids_alive() {
let doc = reader.get(doc_id)?;
// Use doc and doc_id here ...
}
```
which can be simplified to
```rust
let reader = segment.get_store_reader(1)?;
for res in reader.enumerate() {
let (doc_id, doc) = res?;
// Use doc and doc_id here ...
}
```
using the method proposed here.
(I added a new method instead of modifying `StoreReader::iter` to make the
change backwards compatible, i.e. possible to include in a point release.)
* compact doc
* add any value type
* pass references when building CompactDoc
* remove OwnedValue from API
* clippy
* clippy
* fail on large documents
* fmt
* cleanup
* cleanup
* implement Value for different types
fix serde_json date Value implementation
* fmt
* cleanup
* fmt
* cleanup
* store positions instead of pos+len
* remove nodes array
* remove mediumvec
* cleanup
* infallible serialize into vec
* remove positions indirection
* remove 24MB limitation in document
use u32 for Addr
Remove the 3 byte addressing limitation and use VInt instead
* cleanup
* extend test
* cleanup, add comments
* rename, remove pub
PR https://github.com/quickwit-oss/quickwit/pull/4962 fixes an issue
where the AggregationLimits are not passed correctly. Since the
AggregationLimits are shared properly we run into contention issues.
This PR includes some straightforward improvement to reduce contention,
by only calling if the memory changed and avoiding the second read.
We probably need some sharding with multiple counters or local caching before updating the
global after some threshold.
Achieved by moving the boxes out of the temporary reference wrappers which are
cloneable themselves, i.e. if required the caller can clone them already or
consume them to reuse existing allocations.
* fix ReferenceValue API flaw
Remove `Facet` and `TokenizedString` values from the `ReferenceValue` API,
as this requires the trait value to have them stored somewhere.
Since `TokenizedString` is quite niche, I just copy it into a Box,
instead of designing a reference API around it.
* fix comment link
This changes three things:
- Reuse positions_per_path hashmap instead of allocating one per
indexed JSON value
- Try to cast u64 values to i64 to streamline with search behaviour
- Allow top level json values to be of any type, instead of limiting it
to JSON objects. Remove special JSON object handling method.
TODO: We probably should also try to check f64 to i64 and u64 when
indexing, as values may get converted to f64 by the JSON parser
* Fix trait bound of StoreReader::iter
Similar to `StoreReader::get`, `StoreReader::iter` should only require
`DocumentDeserialize` and not `Document`.
* Mark the iterator returned by SegmentReader::doc_ids_alive as Send so it can be used in impls of Stream/AsyncIterator.
* add method to fetch block of first vals in columnar
add method to fetch block of first vals in columnar (this is way faster
than single calls for full columns)
add benchmark
fix import warnings
```
test bench_get_block_first_on_full_column ... bench: 56 ns/iter (+/- 26)
test bench_get_block_first_on_full_column_single_calls ... bench: 311 ns/iter (+/- 6)
test bench_get_block_first_on_multi_column ... bench: 378 ns/iter (+/- 15)
test bench_get_block_first_on_multi_column_single_calls ... bench: 546 ns/iter (+/- 13)
test bench_get_block_first_on_optional_column ... bench: 291 ns/iter (+/- 6)
test bench_get_block_first_on_optional_column_single_calls ... bench: 362 ns/iter (+/- 8)
```
* use remainder
* handle ip adresses in term aggregation
Stores IpAdresses during the segment term aggregation via u64 representation
and convert to u128(IpV6Adress) via downcast when converting to intermediate results.
Enable Downcasting on `ColumnValues`
Expose u64 variant for u128 encoded data via `open_u64_lenient` method.
Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming
from downcast trait.
* rename method
Spotted in `range_date_histogram` query in quickwit benchmark:
5% of time copying docs around, which is not needed in the full index case
remove Column to ColumnIndex deref
* Fix serde for TopNComputer
The top hits aggregation changed the TopNComputer to be serializable,
but capacity needs to be carried over, as it contains logic which is
checked against when pushing elements (capacity == 0 is not allowed).
* use serde from deser
* remove pub, clippy
* feat(aggregators/metric): Implement a top_hits aggregator
* fix: Expose get_fields
* fix: Serializer for top_hits request
Also removes extraneous the extraneous third-party
serialization helper.
* chore: Avert panick on parsing invalid top_hits query
* refactor: Allow multiple field names from aggregations
* perf: Replace binary heap with TopNComputer
* fix: Avoid comparator inversion by ComparableDoc
* fix: Rank missing field values lower than present values
* refactor: Make KeyOrder a struct
* feat: Rough attempt at docvalue_fields
* feat: Complete stab at docvalue_fields
- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn
* test(unit): Add tests for top_hits aggregator
* fix: docfield_value field globbing
* test(unit): Include dynamic fields
* fix: Value -> OwnedValue
* fix: Use OwnedValue's native Null variant
* chore: Improve readability of test asserts
* chore: Remove DocAddress from top_hits result
* docs: Update aggregator doc
* revert: accidental doc test
* chore: enable time macros only for tests
* chore: Apply suggestions from review
* chore: Apply suggestions from review
* fix: Retrieve all values for fields
* test(unit): Update for multi-value retrieval
* chore: Assert term existence
* feat: Include all columns for a column name
Since a (name, type) constitutes a unique column.
* fix: Resolve json fields
Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.
* chore: Address review on mutability
* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal
* chore: Revert erroneous grammar change
Root cause was the positions buffer had residue positions from the
previous term, when the terms were alternating between having and not
having positions in JSON (terms have positions, but not numerics).
Fixes#2283
* read path for new fst based index
* implement BlockAddrStoreWriter
* extract slop/derivation computation
* use better linear approximator and allow negative correction to approximator
* document format and reorder some fields
* optimize single block sstable size
* plug backward compat
* add fields_metadata to SegmentReader, add columnar docs
* use schema to resolve field, add test
* normalize paths
* merge for FieldsMetadata, add fields_metadata on Index
* Update src/core/segment_reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* merge code paths
* add Hash
* move function oustide
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>