precompute stats for merge multivalue index + disable Line encoding for
multivalue index. That combination allows to skip the first estimation
pass. This gives up to 2x on merge performance on multivalue indices.
This change may decrease compression as Line is
very good compressible for documents, which have a fixed amount of
values in each doc. The line codec should be replaced.
```
merge_multi_and_multi Avg: 22.7880ms (-47.15%) Median: 22.5469ms (-47.38%) [22.3691ms .. 25.8392ms]
merge_dense_and_dense Avg: 14.4398ms (+2.18%) Median: 14.2465ms (+0.74%) [14.1620ms .. 16.1270ms]
merge_sparse_and_sparse Avg: 10.6559ms (+1.10%) Median: 10.6318ms (+0.91%) [10.5527ms .. 11.2848ms]
merge_sparse_and_dense Avg: 12.4886ms (+1.52%) Median: 12.4044ms (+0.84%) [12.3261ms .. 13.9439ms]
merge_multi_and_dense Avg: 25.6686ms (-45.56%) Median: 25.4851ms (-45.84%) [25.1618ms .. 27.6226ms]
merge_multi_and_sparse Avg: 24.3278ms (-47.00%) Median: 24.1917ms (-47.34%) [23.7159ms .. 27.0513ms]
```
* first version of extended stats along with its tests
* using IntermediateExtendStats instead of IntermediateStats with all tests passing
* Created struct for request and response
* first test with extended_stats
* kahan summation and tests with approximate equality
* version ready for merge
* removed approx dependency
* refactor for using ExtendedStats only when needed
* interim version
* refined version with code formatted
* refactored a struct
* cosmetic refactor
* fix after merge
* fix format
* added extended_stat bench
* merge and new benchmark for extended stats
* split stat segment collectors
* wrapped intermediate extended stat with a box to limit memory usage
* Revert "wrapped intermediate extended stat with a box to limit memory usage"
This reverts commit 5b4aa9f393.
* some code reformat, commented kahan summation
* refactor after review
* refactor after code review
* fix after incorrectly restoring kahan summation
* modifications for code review + bug fix in merge_fruit
* refactor assert_nearly_equals macro
* update after code review
---------
Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>
* compact doc
* add any value type
* pass references when building CompactDoc
* remove OwnedValue from API
* clippy
* clippy
* fail on large documents
* fmt
* cleanup
* cleanup
* implement Value for different types
fix serde_json date Value implementation
* fmt
* cleanup
* fmt
* cleanup
* store positions instead of pos+len
* remove nodes array
* remove mediumvec
* cleanup
* infallible serialize into vec
* remove positions indirection
* remove 24MB limitation in document
use u32 for Addr
Remove the 3 byte addressing limitation and use VInt instead
* cleanup
* extend test
* cleanup, add comments
* rename, remove pub
PR https://github.com/quickwit-oss/quickwit/pull/4962 fixes an issue
where the AggregationLimits are not passed correctly. Since the
AggregationLimits are shared properly we run into contention issues.
This PR includes some straightforward improvement to reduce contention,
by only calling if the memory changed and avoiding the second read.
We probably need some sharding with multiple counters or local caching before updating the
global after some threshold.
Achieved by moving the boxes out of the temporary reference wrappers which are
cloneable themselves, i.e. if required the caller can clone them already or
consume them to reuse existing allocations.
* fix ReferenceValue API flaw
Remove `Facet` and `TokenizedString` values from the `ReferenceValue` API,
as this requires the trait value to have them stored somewhere.
Since `TokenizedString` is quite niche, I just copy it into a Box,
instead of designing a reference API around it.
* fix comment link
This changes three things:
- Reuse positions_per_path hashmap instead of allocating one per
indexed JSON value
- Try to cast u64 values to i64 to streamline with search behaviour
- Allow top level json values to be of any type, instead of limiting it
to JSON objects. Remove special JSON object handling method.
TODO: We probably should also try to check f64 to i64 and u64 when
indexing, as values may get converted to f64 by the JSON parser
* Fix trait bound of StoreReader::iter
Similar to `StoreReader::get`, `StoreReader::iter` should only require
`DocumentDeserialize` and not `Document`.
* Mark the iterator returned by SegmentReader::doc_ids_alive as Send so it can be used in impls of Stream/AsyncIterator.