* Initial impl
* Added `Filter` impl in `build_single_agg_segment_collector_with_reader` + Added tests
* Added `Filter(FilterBucketResult)` + Made tests work.
* Fixed type issues.
* Fixed a test.
* 8a7a73a: Pass `segment_reader`
* Added more tests.
* Improved parsing + tests
* refactoring
* Added more tests.
* refactoring: moved parsing code under QueryParser
* Use Tantivy syntax instead of ES
* Added a sanity check test.
* Simplified impl + tests
* Added back tests in a more maintable way
* nitz.
* nitz
* implemented very simple fast-path
* improved a comment
* implemented fast field support
* Used `BoundsRange`
* Improved fast field impl + tests
* Simplified execution.
* Fixed exports + nitz
* Improved the tests to check to the expected result.
* Improved test by checking the whole result JSON
* Removed brittle perf checks.
* Added efficiency verification tests.
* Added one more efficiency check test.
* Improved the efficiency tests.
* Removed unnecessary parsing code + added direct Query obj
* Fixed tests.
* Improved tests
* Fixed code structure
* Fixed lint issues
* nitz.
* nitz
* nitz.
* nitz.
* nitz.
* Added an example
* Fixed PR comments.
* Applied PR comments + nitz
* nitz.
* Improved the code.
* Fixed a perf issue.
* Added batch processing.
* Made the example more interesting
* Fixed bucket count
* Renamed Direct to CustomQuery
* Fixed lint issues.
* No need for scorer to be an `Option`
* nitz
* Used BitSet
* Added an optimization for AllQuery
* Fixed merge issues.
* Fixed lint issues.
* Added benchmark for FILTER
* Removed the Option wrapper.
* nitz.
* Applied PR comments.
* Fixed the AllQuery optimization
* Applied PR comments.
* feat: used `erased_serde` to allow filter query to be serialized
* further improved a comment
* Added back tests.
* removed an unused method
* removed an unused method
* Added documentation
* nitz.
* Added query builder.
* Fixed a comment.
* Applied PR comments.
* Fixed doctest issues.
* Added ser/de
* Removed bench in test
* Fixed a lint issue.
* add nested histogram-termagg benchmark
* Replace AggregationsWithAccessor with AggData
With AggregationsWithAccessor pre-computation and caching was done on the collector level.
If you have 10000 sub collectors (e.g. a term aggregation with sub aggregations) this is very inefficient.
`AggData` instead moves the data from the collector to a node which reflects the cardinality of the request tree instead of the cardinality of the segment collector.
It also moves the global struct shared with all aggregations in to aggregation specific structs. So each aggregation has its own space to store cached data and aggregation specific information.
This also breaks up the dependency to the elastic search aggregation structure somewhat.
Due to lifetime issues, we move the agg request specific object out of `AggData` during the collection and move it back at the end (for now). That's some unnecessary work, which costs CPU.
This allows better caching and will also pave the way for another potential optimization, by separating the collector and its storage. Currently we allocate a new collector for each sub aggregation bucket (for nested aggregations), but ideally we would have just one collector instance.
* renames
* move request data to agg request files
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
* WiP: cardinality aggregation
* Collect unique entries first, then insert into HyperLogLog
* Handle `missing`
* Hybrid approach
* Review changes
- insert `missing` value at most once
- `term_id` -> `term_ord`
- iterate directly over entries without collecting first
* Use salted hasher to include column type
* fix: formatting
* More review fixes
* Add cardinality to test_aggregation_flushing
* Formatting
* first version of extended stats along with its tests
* using IntermediateExtendStats instead of IntermediateStats with all tests passing
* Created struct for request and response
* first test with extended_stats
* kahan summation and tests with approximate equality
* version ready for merge
* removed approx dependency
* refactor for using ExtendedStats only when needed
* interim version
* refined version with code formatted
* refactored a struct
* cosmetic refactor
* fix after merge
* fix format
* added extended_stat bench
* merge and new benchmark for extended stats
* split stat segment collectors
* wrapped intermediate extended stat with a box to limit memory usage
* Revert "wrapped intermediate extended stat with a box to limit memory usage"
This reverts commit 5b4aa9f393.
* some code reformat, commented kahan summation
* refactor after review
* refactor after code review
* fix after incorrectly restoring kahan summation
* modifications for code review + bug fix in merge_fruit
* refactor assert_nearly_equals macro
* update after code review
---------
Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>
* feat(aggregators/metric): Implement a top_hits aggregator
* fix: Expose get_fields
* fix: Serializer for top_hits request
Also removes extraneous the extraneous third-party
serialization helper.
* chore: Avert panick on parsing invalid top_hits query
* refactor: Allow multiple field names from aggregations
* perf: Replace binary heap with TopNComputer
* fix: Avoid comparator inversion by ComparableDoc
* fix: Rank missing field values lower than present values
* refactor: Make KeyOrder a struct
* feat: Rough attempt at docvalue_fields
* feat: Complete stab at docvalue_fields
- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn
* test(unit): Add tests for top_hits aggregator
* fix: docfield_value field globbing
* test(unit): Include dynamic fields
* fix: Value -> OwnedValue
* fix: Use OwnedValue's native Null variant
* chore: Improve readability of test asserts
* chore: Remove DocAddress from top_hits result
* docs: Update aggregator doc
* revert: accidental doc test
* chore: enable time macros only for tests
* chore: Apply suggestions from review
* chore: Apply suggestions from review
* fix: Retrieve all values for fields
* test(unit): Update for multi-value retrieval
* chore: Assert term existence
* feat: Include all columns for a column name
Since a (name, type) constitutes a unique column.
* fix: Resolve json fields
Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.
* chore: Address review on mutability
* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal
* chore: Revert erroneous grammar change
* Improve aggregation error message
Improve aggregation error message by wrapping the deserialization with a
custom struct. This deserialization variant is slower, since we need to
keep the deserialized data around twice with this approach.
For now the valid variants list is manually updated. This could be
replaced with a proc macro.
closes#2143
* Simpler implementation
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* switch to ms in histogram for date type
switch to ms in histogram, by adding a normalization step that converts
to nanoseconds precision when creating the collector.
closes#2028
related to #2026
* add missing unit long variants
* use single thread to avoid handling test case
* fix docs
* revert CI
* cleanup
* improve docs
* Update src/aggregation/bucket/histogram/histogram.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* improve validation in aggregation, extend invalid field test
improve validation in aggregation
extend invalid field test
Fixes#1291
* collect fast field names on request structure
* fix visibility of AggregationSegmentCollector