Keep use_serde on sketches-ddsketch so DDSketch derives
Serialize/Deserialize, removing the need for custom impls
on PercentilesCollector.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the derived Serialize/Deserialize on PercentilesCollector with
custom impls that use DDSketch's Java-compatible binary encoding
(encode_to_java_bytes / decode_from_java_bytes). This removes the need
for the use_serde feature on sketches-ddsketch entirely.
Also restore original float test values and use assert_nearly_equals!
for all float comparisons in percentile tests, since DDSketch quantile
estimates can have minor precision differences across platforms.
Co-authored-by: Cursor <cursoragent@cursor.com>
Address review feedback: replace assert_eq! with assert_nearly_equals!
for float values that go through JSON serialization roundtrips, which
can introduce minor precision differences.
Co-authored-by: Cursor <cursoragent@cursor.com>
Move the vendored sketches-ddsketch crate (with Java-compatible binary
encoding) to its own repo at quickwit-oss/rust-sketches-ddsketch and
reference it via git+rev in Cargo.toml.
Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace approximate PI/E constants with non-famous value in test
- Fix reversed empty range (2048..0) → (0..2048).rev() in store test
Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace bare constants with FlagType and BinEncodingMode enums
- Use const fn for flag byte construction instead of raw bit ops
- Replace if-else chain with nested match in decode_from_java_bytes
- Use split_first() in read_byte for idiomatic slice consumption
- Use split_at in read_f64_le to avoid TryInto on edition 2018
- Use u64::from(next) instead of `next as u64` casts
- Extract assert_golden, assert_quantiles_match, bytes_to_hex helpers
to reduce duplication across golden byte tests
- Fix edition-2018 assert! format string compatibility
- Clean up is_valid_flag_byte with let-else and match
Co-authored-by: Cursor <cursoragent@cursor.com>
- manual_range_contains: use !(0.0..=1.0).contains(&q)
- identity_op: simplify (0 << 2) | FLAG_TYPE to just FLAG_TYPE
- manual_clamp: use .clamp(0, 8) instead of .max(0).min(8)
- manual_repeat_n: use repeat_n() instead of repeat().take()
- cast_abs_to_unsigned: use .unsigned_abs() instead of .abs() as usize
Co-authored-by: Cursor <cursoragent@cursor.com>
Reference the exact Java source files in DataDog/sketches-java for
Config::new(), Config::key(), Config::value(), Config::from_gamma(),
and Store::add_count() so readers can verify the alignment.
Co-authored-by: Cursor <cursoragent@cursor.com>
Fork sketches-ddsketch as a workspace member to add native Java binary
serialization (to_java_bytes/from_java_bytes) for DDSketch. This enables
pomsky to return raw DDSketch bytes that event-query can deserialize via
DDSketchWithExactSummaryStatistics.decode().
Key changes:
- Vendor sketches-ddsketch crate with encoding.rs implementing VarEncoding,
flag bytes, and INDEX_DELTAS_AND_COUNTS store format
- Align Config::key() to floor-based indexing matching Java's LogarithmicMapping
- Add PercentilesCollector::to_sketch_bytes() for pomsky integration
- Cross-language golden byte tests verified byte-identical with Java output
Co-authored-by: Cursor <cursoragent@cursor.com>
Switch tantivy's cardinality aggregation from the hyperloglogplus crate
(HyperLogLog++ with p=16) to the official Apache DataSketches HLL
implementation (datasketches crate v0.2.0 with lg_k=11, Hll4).
This enables returning raw HLL sketch bytes from pomsky to Datadog's
event query, where they can be properly deserialized and merged using
the same DataSketches library (Java). The previous implementation
required pomsky to fabricate fake HLL sketches from scalar cardinality
estimates, which produced incorrect results when merged.
Changes:
- Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0
- CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch
- Custom Serde impl using HllSketch binary format (cross-shard compat)
- New to_sketch_bytes() for external consumers (pomsky)
- Salt preserved via (salt, value) tuple hashing for column type disambiguation
- Removed BuildSaltedHasher struct
- Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)
Otherwise, there is no way to access these fields when not using the
json serialized form of the aggregation results.
This simple data struct is part of the public api,
so its fields should be accessible as well.
## What
Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields.
## Why
`BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering.
## How
1. **Enable range queries for Bytes** - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes`
2. **Add BytesColumn handling in scorer** - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic)
3. **Add SortByBytes** - New sort key computer for TopN queries on bytes columns
## Tests
- `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges
- `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions
* faster exclude queries
Faster exclude queries with multiple terms.
Changes `Exclude` to be able to exclude multiple DocSets, instead of
putting the docsets into a union.
Use `seek_danger` in `Exclude`.
closes#2822
* replace unwrap with match
The intersection algorithm made it possible for .seek(..) with values
lower than the current doc id, breaking the DocSet contract.
The fix removes the optimization that caused left.seek(..) to be replaced
by a simpler left.advance(..).
Simply doing so lead to a performance regression.
I therefore integrated that idea within SegmentPostings.seek.
We now attempt to check the next doc systematically on seek,
PROVIDED the block is already loaded.
Closes#2811
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
A bug was added with the `seek_into_the_danger_zone()` optimization
(Spotted and fixed by Stu)
The contract says seek_into_the_danger_zone returns true if do is part of the docset.
The blanket implementation goes like this.
```
let current_doc = self.doc();
if current_doc < target {
self.seek(target);
}
self.doc() == target
```
So it will return true if target is TERMINATED, where really TERMINATED does not belong to the docset.
The fix tries to clarify the contracts and fixes the intersection algorithm.
We observe a small but all over the board improvement in intersection performance.
---------
Co-authored-by: Stu Hood <stuhood@gmail.com>
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
Removes the Write generics argument in PostingsSerializer.
This removes useless generic.
Prepares the path for codecs.
Removes one useless CountingWrite layer.
etc.
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
* improve bench
* add more tests for new collection type
* one collector per agg request instead per bucket
In this refactoring a collector knows in which bucket of the parent
their data is in. This allows to convert the previous approach of one
collector per bucket to one collector per request.
low card bucket optimization
* reduce dynamic dispatch, faster term agg
* use radix map, fix prepare_max_bucket
use paged term map in term agg
use special no sub agg term map impl
* specialize columntype in stats
* remove stacktrace bloat, use &mut helper
increase cache to 2048
* cleanup
remove clone
move data in term req, single doc opt for stats
* add comment
* share column block accessor
* simplify fetch block in column_block_accessor
* split subaggcache into two trait impls
* move partitions to heap
* fix name, add comment
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>