* add method to fetch block of first vals in columnar
add method to fetch block of first vals in columnar (this is way faster
than single calls for full columns)
add benchmark
fix import warnings
```
test bench_get_block_first_on_full_column ... bench: 56 ns/iter (+/- 26)
test bench_get_block_first_on_full_column_single_calls ... bench: 311 ns/iter (+/- 6)
test bench_get_block_first_on_multi_column ... bench: 378 ns/iter (+/- 15)
test bench_get_block_first_on_multi_column_single_calls ... bench: 546 ns/iter (+/- 13)
test bench_get_block_first_on_optional_column ... bench: 291 ns/iter (+/- 6)
test bench_get_block_first_on_optional_column_single_calls ... bench: 362 ns/iter (+/- 8)
```
* use remainder
* handle ip adresses in term aggregation
Stores IpAdresses during the segment term aggregation via u64 representation
and convert to u128(IpV6Adress) via downcast when converting to intermediate results.
Enable Downcasting on `ColumnValues`
Expose u64 variant for u128 encoded data via `open_u64_lenient` method.
Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming
from downcast trait.
* rename method
Spotted in `range_date_histogram` query in quickwit benchmark:
5% of time copying docs around, which is not needed in the full index case
remove Column to ColumnIndex deref
* Fix serde for TopNComputer
The top hits aggregation changed the TopNComputer to be serializable,
but capacity needs to be carried over, as it contains logic which is
checked against when pushing elements (capacity == 0 is not allowed).
* use serde from deser
* remove pub, clippy
* feat(aggregators/metric): Implement a top_hits aggregator
* fix: Expose get_fields
* fix: Serializer for top_hits request
Also removes extraneous the extraneous third-party
serialization helper.
* chore: Avert panick on parsing invalid top_hits query
* refactor: Allow multiple field names from aggregations
* perf: Replace binary heap with TopNComputer
* fix: Avoid comparator inversion by ComparableDoc
* fix: Rank missing field values lower than present values
* refactor: Make KeyOrder a struct
* feat: Rough attempt at docvalue_fields
* feat: Complete stab at docvalue_fields
- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn
* test(unit): Add tests for top_hits aggregator
* fix: docfield_value field globbing
* test(unit): Include dynamic fields
* fix: Value -> OwnedValue
* fix: Use OwnedValue's native Null variant
* chore: Improve readability of test asserts
* chore: Remove DocAddress from top_hits result
* docs: Update aggregator doc
* revert: accidental doc test
* chore: enable time macros only for tests
* chore: Apply suggestions from review
* chore: Apply suggestions from review
* fix: Retrieve all values for fields
* test(unit): Update for multi-value retrieval
* chore: Assert term existence
* feat: Include all columns for a column name
Since a (name, type) constitutes a unique column.
* fix: Resolve json fields
Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.
* chore: Address review on mutability
* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal
* chore: Revert erroneous grammar change
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* Fix DateHistogram bucket gap
Fixes a computation issue of the number of buckets needed in the
DateHistogram.
This is due to a missing normalization from request values (ms) to fast field
values (ns), when converting an intermediate result to the final result.
This results in a wrong computation by a factor 1_000_000.
The Histogram normalizes values to nanoseconds, to make the user input like
extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible.
This normalization happens only for date type fields, as other field types don't have precision settings.
The normalization does not happen due a missing `column_type`, which is not
correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation.
Another related issue is an empty aggregation, which will not have
`column_type` set, will not convert the result to human readable format.
This PR fixes the issue by:
- Limit the allowed field types of DateHistogram to DateType
- Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`.
- Fix the merge logic
Add a flag to to normalization only once. This is not an issue
currently, but it could become easily one.
closes https://github.com/quickwit-oss/quickwit/issues/3837
* use older nightly for time crate (breaks build)
* add missing aggregation part 2
Add missing support for:
- Mixed types columns
- Key of type string on numerical fields
The special aggregation is slower than the integrated one in TermsAggregation and therefore not
chosen by default, although it can cover all use cases.
* simplify, add num_docs to empty
* Improve aggregation error message
Improve aggregation error message by wrapping the deserialization with a
custom struct. This deserialization variant is slower, since we need to
keep the deserialized data around twice with this approach.
For now the valid variants list is manually updated. This could be
replaced with a proc macro.
closes#2143
* Simpler implementation
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* alternative mixed field aggregation collection
instead of having multiple accessor in one AggregationWithAccessor split it into
multiple independent AggregationWithAccessor
* Update src/aggregation/agg_req_with_accessor.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
This fixes an issue when min_doc==0 loads terms from the dictionary from
one segment and merges the same term with a subaggregation from another
segment.
Previously the empty structure was not correctly initialized to contain
the subaggregation so the merge was incorrect.
* switch to ms in histogram for date type
switch to ms in histogram, by adding a normalization step that converts
to nanoseconds precision when creating the collector.
closes#2028
related to #2026
* add missing unit long variants
* use single thread to avoid handling test case
* fix docs
* revert CI
* cleanup
* improve docs
* Update src/aggregation/bucket/histogram/histogram.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
* clear memory consumption in AggregationLimits
clear memory consumption in AggregationLimits at the end of segment collection
* switch to ResourceLimitGuard
* unduplicate code
* merge methods
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* chore!:drop JSON support on intermediate agg result
add support for other formats by removing skip_serialize and untagged
JSON support is broken anyway due it's lack on f64::INF etc. handling
* Update src/aggregation/intermediate_agg_result.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* move from impl
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>