* fix: deduplicate doc counts in term aggregation for multi-valued fields
Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.
Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.
Fixes#2721
* refactor: only deduplicate for multivalue cardinality
Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().
* fix: handle non-consecutive duplicate values in dedup
Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.
Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.
* perf: skip dedup when block has no multivalue entries
Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.
---------
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
Keep use_serde on sketches-ddsketch so DDSketch derives
Serialize/Deserialize, removing the need for custom impls
on PercentilesCollector.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the derived Serialize/Deserialize on PercentilesCollector with
custom impls that use DDSketch's Java-compatible binary encoding
(encode_to_java_bytes / decode_from_java_bytes). This removes the need
for the use_serde feature on sketches-ddsketch entirely.
Also restore original float test values and use assert_nearly_equals!
for all float comparisons in percentile tests, since DDSketch quantile
estimates can have minor precision differences across platforms.
Co-authored-by: Cursor <cursoragent@cursor.com>
Address review feedback: replace assert_eq! with assert_nearly_equals!
for float values that go through JSON serialization roundtrips, which
can introduce minor precision differences.
Co-authored-by: Cursor <cursoragent@cursor.com>
Fork sketches-ddsketch as a workspace member to add native Java binary
serialization (to_java_bytes/from_java_bytes) for DDSketch. This enables
pomsky to return raw DDSketch bytes that event-query can deserialize via
DDSketchWithExactSummaryStatistics.decode().
Key changes:
- Vendor sketches-ddsketch crate with encoding.rs implementing VarEncoding,
flag bytes, and INDEX_DELTAS_AND_COUNTS store format
- Align Config::key() to floor-based indexing matching Java's LogarithmicMapping
- Add PercentilesCollector::to_sketch_bytes() for pomsky integration
- Cross-language golden byte tests verified byte-identical with Java output
Co-authored-by: Cursor <cursoragent@cursor.com>
Switch tantivy's cardinality aggregation from the hyperloglogplus crate
(HyperLogLog++ with p=16) to the official Apache DataSketches HLL
implementation (datasketches crate v0.2.0 with lg_k=11, Hll4).
This enables returning raw HLL sketch bytes from pomsky to Datadog's
event query, where they can be properly deserialized and merged using
the same DataSketches library (Java). The previous implementation
required pomsky to fabricate fake HLL sketches from scalar cardinality
estimates, which produced incorrect results when merged.
Changes:
- Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0
- CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch
- Custom Serde impl using HllSketch binary format (cross-shard compat)
- New to_sketch_bytes() for external consumers (pomsky)
- Salt preserved via (salt, value) tuple hashing for column type disambiguation
- Removed BuildSaltedHasher struct
- Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)
Otherwise, there is no way to access these fields when not using the
json serialized form of the aggregation results.
This simple data struct is part of the public api,
so its fields should be accessible as well.
* improve bench
* add more tests for new collection type
* one collector per agg request instead per bucket
In this refactoring a collector knows in which bucket of the parent
their data is in. This allows to convert the previous approach of one
collector per bucket to one collector per request.
low card bucket optimization
* reduce dynamic dispatch, faster term agg
* use radix map, fix prepare_max_bucket
use paged term map in term agg
use special no sub agg term map impl
* specialize columntype in stats
* remove stacktrace bloat, use &mut helper
increase cache to 2048
* cleanup
remove clone
move data in term req, single doc opt for stats
* add comment
* share column block accessor
* simplify fetch block in column_block_accessor
* split subaggcache into two trait impls
* move partitions to heap
* fix name, add comment
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.
- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.
This PR breaks public API, but fixing code is straightforward.
* Bumping tantivy version
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* Optimization when posting list are saturated.
If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.
In turn, in a boolean query, we can dismiss all scorers and
empty scorers, to accelerate the request.
* Added range query optimization
* CR comment
* CR comments
* CR comment
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
This introduce an optimization of top level term aggregation on field with a low cardinality.
We then use a Vec as the underlying map.
In addition, we buffer subaggregations.
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Initial impl
* Added `Filter` impl in `build_single_agg_segment_collector_with_reader` + Added tests
* Added `Filter(FilterBucketResult)` + Made tests work.
* Fixed type issues.
* Fixed a test.
* 8a7a73a: Pass `segment_reader`
* Added more tests.
* Improved parsing + tests
* refactoring
* Added more tests.
* refactoring: moved parsing code under QueryParser
* Use Tantivy syntax instead of ES
* Added a sanity check test.
* Simplified impl + tests
* Added back tests in a more maintable way
* nitz.
* nitz
* implemented very simple fast-path
* improved a comment
* implemented fast field support
* Used `BoundsRange`
* Improved fast field impl + tests
* Simplified execution.
* Fixed exports + nitz
* Improved the tests to check to the expected result.
* Improved test by checking the whole result JSON
* Removed brittle perf checks.
* Added efficiency verification tests.
* Added one more efficiency check test.
* Improved the efficiency tests.
* Removed unnecessary parsing code + added direct Query obj
* Fixed tests.
* Improved tests
* Fixed code structure
* Fixed lint issues
* nitz.
* nitz
* nitz.
* nitz.
* nitz.
* Added an example
* Fixed PR comments.
* Applied PR comments + nitz
* nitz.
* Improved the code.
* Fixed a perf issue.
* Added batch processing.
* Made the example more interesting
* Fixed bucket count
* Renamed Direct to CustomQuery
* Fixed lint issues.
* No need for scorer to be an `Option`
* nitz
* Used BitSet
* Added an optimization for AllQuery
* Fixed merge issues.
* Fixed lint issues.
* Added benchmark for FILTER
* Removed the Option wrapper.
* nitz.
* Applied PR comments.
* Fixed the AllQuery optimization
* Applied PR comments.
* feat: used `erased_serde` to allow filter query to be serialized
* further improved a comment
* Added back tests.
* removed an unused method
* removed an unused method
* Added documentation
* nitz.
* Added query builder.
* Fixed a comment.
* Applied PR comments.
* Fixed doctest issues.
* Added ser/de
* Removed bench in test
* Fixed a lint issue.
Previously the merging relied on the order of the results, which is invalid since https://github.com/quickwit-oss/tantivy/pull/2035.
This bug is only hit in specific scenarios, when the aggregation collectors are built in a different order on different segments.
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
* add nested histogram-termagg benchmark
* Replace AggregationsWithAccessor with AggData
With AggregationsWithAccessor pre-computation and caching was done on the collector level.
If you have 10000 sub collectors (e.g. a term aggregation with sub aggregations) this is very inefficient.
`AggData` instead moves the data from the collector to a node which reflects the cardinality of the request tree instead of the cardinality of the segment collector.
It also moves the global struct shared with all aggregations in to aggregation specific structs. So each aggregation has its own space to store cached data and aggregation specific information.
This also breaks up the dependency to the elastic search aggregation structure somewhat.
Due to lifetime issues, we move the agg request specific object out of `AggData` during the collection and move it back at the end (for now). That's some unnecessary work, which costs CPU.
This allows better caching and will also pave the way for another potential optimization, by separating the collector and its storage. Currently we allocate a new collector for each sub aggregation bucket (for nested aggregations), but ideally we would have just one collector instance.
* renames
* move request data to agg request files
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
* change AggregationLimits behavior
This fixes an issue encountered with the current behaviour of
AggregationLimits.
Previously we had AggregationLimits and RessourceLimitGuard, which both
track the memory, but only RessourceLimitGuard released memory when
dropped, while AggregationLimits did not.
This PR changes AggregationLimits to be a guard itself and removes the
RessourceLimitGuard.
* rename AggregationLimits to AggregationLimitsGuard
* add Key::I64 and Key::U64 variants in aggregation
Currently all `Key` numerical values are returned as f64. This causes problems in some
cases with the precision and the way f64 is serialized.
This PR adds `Key::I64` and `Key::U64` variants and uses them in the term
aggregation.
* add clarification comment
* WiP: cardinality aggregation
* Collect unique entries first, then insert into HyperLogLog
* Handle `missing`
* Hybrid approach
* Review changes
- insert `missing` value at most once
- `term_id` -> `term_ord`
- iterate directly over entries without collecting first
* Use salted hasher to include column type
* fix: formatting
* More review fixes
* Add cardinality to test_aggregation_flushing
* Formatting
* first version of extended stats along with its tests
* using IntermediateExtendStats instead of IntermediateStats with all tests passing
* Created struct for request and response
* first test with extended_stats
* kahan summation and tests with approximate equality
* version ready for merge
* removed approx dependency
* refactor for using ExtendedStats only when needed
* interim version
* refined version with code formatted
* refactored a struct
* cosmetic refactor
* fix after merge
* fix format
* added extended_stat bench
* merge and new benchmark for extended stats
* split stat segment collectors
* wrapped intermediate extended stat with a box to limit memory usage
* Revert "wrapped intermediate extended stat with a box to limit memory usage"
This reverts commit 5b4aa9f393.
* some code reformat, commented kahan summation
* refactor after review
* refactor after code review
* fix after incorrectly restoring kahan summation
* modifications for code review + bug fix in merge_fruit
* refactor assert_nearly_equals macro
* update after code review
---------
Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>
PR https://github.com/quickwit-oss/quickwit/pull/4962 fixes an issue
where the AggregationLimits are not passed correctly. Since the
AggregationLimits are shared properly we run into contention issues.
This PR includes some straightforward improvement to reduce contention,
by only calling if the memory changed and avoiding the second read.
We probably need some sharding with multiple counters or local caching before updating the
global after some threshold.