tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-07-04 08:10:42 +00:00

Author	SHA1	Message	Date
Pascal Seitz	1e859fd78d	fix term aggregation u32::MAX overflow issue	2026-06-18 17:07:43 +08:00
Pascal Seitz	c096b2ad89	aggregation/terms: charge fused term_counts to the memory limit term_counts (one u32/term) was allocated but not charged to AggregationLimitsGuard, so a memory limit could be exceeded silently. Charge it, skip allocating it when unbounded, and add a regression test.	2026-06-16 21:23:23 +08:00
Pascal Seitz	ac7a3d347c	add comment, hoist variables	2026-06-16 21:23:23 +08:00
Pascal Seitz	03520a0719	add top level comment	2026-06-16 21:23:23 +08:00
Pascal Seitz	86a4c47bed	merge loops, histo with bounds may benefit from single vec opt	2026-06-16 21:23:23 +08:00
Pascal Seitz	3ca510dff0	aggregation/terms: tidy fused term×histogram grid construction Rename the value threaded through build_segment_term_collector and maybe_build_collector from max_term_id to col_max_val/max_column_val — it is the column's max value, only later reused as the max term id. Make the grid-size arithmetic overflow-/zero-safe (saturating_add, checked_div).	2026-06-16 21:23:23 +08:00
Pascal Seitz	3cb400c300	clarify counts/term_counts field docs Spell out that `counts` is the flattened per-term × time-bucket grid (each term's own contiguous slice) and that `term_counts` is only needed when the per-term total can't be derived from that grid (i.e. with hard bounds).	2026-06-16 21:23:23 +08:00
Pascal Seitz	ef13489d63	skip hard_bounds that can't exclude any value When a histogram's hard_bounds are wider than the column's value range, the per-doc `bounds.contains` check can never fail. Collapse such bounds to the unbounded sentinel in `normalize_histogram_req`, so both the general histogram hot loop and the fused term×histogram path skip the check — the latter then derives per-term counts from the grid (the ~17% win) instead of falling back to per-doc counting just because `bounds != [MIN, MAX]`. Only the collect-time filter is affected: empty-bucket emission reads `req.hard_bounds` directly, and hard_bounds only ever clips that range, so a wider-than-data bound leaves results unchanged. Covered by new tests on the general and fused paths, including mid-interval (bucket-splitting) bounds. Also tighten the fused-path u32-overflow guard to bound on `num_vals()` (the per-value increment count) rather than `num_docs()`, and document why the fused collector's hot-loop fields are hoisted into locals (re-reading them from memory each iteration measured ~15% slower).	2026-06-16 21:23:23 +08:00
Pascal Seitz	9f7aea4765	derive term counts	2026-06-16 21:23:23 +08:00
Pascal Seitz	2c8536ab11	add specialized TermHistogram	2026-06-16 21:23:23 +08:00
Pascal Seitz	05f4c02ac5	add dense histogram, optional sub-buckets	2026-06-16 21:23:23 +08:00
Pascal Seitz	d137779219	add no sub-gg fastpath	2026-06-16 21:23:23 +08:00
Pascal Seitz	8f9846ac80	use get_range when possible	2026-06-16 21:23:23 +08:00
Mohammad Dashti	799f7b4646	Built SUM final result in each branch directly. Keeps the empty-bucket coercion visible at the boundary instead of a shared binding, following the reviewer's suggested shape.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	fc88d80726	docs: drop downstream-specific name from none_if_no_match doc The flag's purpose is described well enough by "SQL-style consumers"; no need to call out a specific downstream.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	6a684e7c38	feat: opt-in none_if_no_match flag on SumAggregation for SQL-style null Switch the default serialized output of `sum` on empty / all-missing buckets back to `"value": 0` to match Elasticsearch, and gate the SQL-style `"value": null` behavior behind a new `none_if_no_match: Option<bool>` flag on `SumAggregation`. `IntermediateSum::finalize` still returns `Option<f64>` internally so the Rust API stays parallel to min/max/avg, but the ES-vs-SQL choice is made at the boundary in `IntermediateMetricResult::into_final_metric_result`: `None` is coerced to `Some(0.0)` unless `none_if_no_match` is set on the aggregation request. Adds `AggregationVariants::as_sum()` accessor for that boundary check and two end-to-end tests covering both the default ES behavior and the opt-in null behavior on an empty index.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	94fe52cc67	docs: clarify SUM finalize returning None diverges from Elasticsearch Surface the trade-off in the doc comment so future reviewers see why this differs from ES (which returns "value": 0 for sum over empty/all-missing buckets) and what consumers (ParadeDB SQL NULL) the None variant is meant to serve.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	2ff39f6f7f	fix: return None from SUM when no values were collected IntermediateSum::finalize() returned Some(0.0) even when count==0 (all documents had missing/NULL values). This differs from MIN, MAX, and AVG which all return None for count==0. The 0.0 came from IntermediateStats' default sum initialization. Consumers (like ParadeDB) that map None to SQL NULL were incorrectly getting 0 for SUM on all-NULL groups. Fixes paradedb/paradedb#4621	2026-06-16 03:10:30 +08:00
Pascal Seitz	b19f0ddc77	fix clippy	2026-06-09 23:14:12 +08:00
Pascal Seitz	b4acfcf881	cleanup AggregationsSegmentCtx The metric/cardinality/histogram _mut getters had no callers needing mutation; their two uses already pass the resulting reference as &T. simplify req_data ownership: clone into collectors, Rc only for filter BitSet Replace Vec<Option<Box<T>>> + take/put-back round-trip with Vec<T> + direct clone into collector. Collectors now own their per-segment request data outright, removing the borrow-checker dance that the take/put-back pattern existed to satisfy. The structural clones are cheap (Column<u64> is Arc-internal) except for the filter aggregation, whose DocumentQueryEvaluator carries a precomputed per-segment BitSet sized by max_doc. Wrap that in Rc<DocumentQueryEvaluator> so FilterAggReqData::clone() bumps a refcount instead of duplicating the BitSet. Move SegmentFilterCollector's matching_docs_buffer out of FilterAggReqData so its pre-allocated capacity is preserved per collector instead of being lost on every clone.	2026-06-09 23:14:12 +08:00
Paul Masurel	46b3fb9ed3	Relying on upstream version of datasketch and stop using HLL 4. (#2936 ) We were relying on a fork for: a bugfix in LIST serialization a better API exposing a new Coupon type, required for caching coupons. We also stop using HLL8 in hope to fix https://datadoghq.atlassian.net/browse/CLOUDPREM-625 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-05-19 13:29:35 +02:00
Mohammad Dashti	d99a5d4e91	Rename validate_aggregation_fields to validate_aggregation_fields_exist Applies @PSeitz's review suggestion to make the function name more descriptive of what it checks. Also adds a doc note clarifying why validation is opt-in rather than enforced by default.	2026-05-16 15:45:20 +08:00
Mohammad Dashti	2de6f075ce	Fixed the example	2026-05-16 15:45:20 +08:00
Mohammad Dashti	18080067c7	Applied PR comment: I would move it outside of the aggregation. You can fetch the fields from the aggregation request and do a validation in a helper function	2026-05-16 15:45:20 +08:00
Mohammad Dashti	95db7d2e5c	Revert "Revert all impl." This reverts commit d5e0991549a05bf80f19f853f7689ad69f96e7e5.	2026-05-16 15:45:20 +08:00
Mohammad Dashti	fc017c4c74	Applied PR comments.	2026-05-16 15:45:20 +08:00
Mohammad Dashti	141c91d028	Added a flag: strict_validation	2026-05-16 15:45:20 +08:00
Mohammad Dashti	36a83e7c1a	Fixed agg validation	2026-05-16 15:45:20 +08:00
Pascal Seitz	edfb02b47e	switch to enum, fix mixed types for cardinality agg	2026-05-05 16:39:51 +08:00
Pascal Seitz	d0fad88bac	use bitsets for card agg	2026-05-05 16:39:51 +08:00
Pascal Seitz	d47abdf104	early cut off for order by sub agg in term agg	2026-04-28 16:59:59 +02:00
trinity-1686a	ca139d8eb1	Merge pull request #2910 from quickwit-oss/abdul.andha/composite-agg-after Composite aggregations: send after key on last page	2026-04-27 23:38:52 +02:00
Abdul Andha	ac508108aa	address pr comment	2026-04-27 12:39:38 -04:00
RJ Barman	73ad18fa1e	fix: Add space for missing sentinel in allowed bitset when a missing key is provided (#119 ) (#2907 ) ## Bug Overview Under certain conditions, a `terms` aggregation request can cause a bounds-check panic. Those conditions are: - The queried field must be a text field - There must be a segment where the number of distinct terms in it's dictionary for the queried field is divisible by 64 (i.e.e where `count(term_dict.keys) % 64 == 0`) - That same segment must contain at least one document that does not contain this field. - The request contain a `missing` key that is a string. - The request must contain an `include` or `exclude` filter. For example: ```json { "my_bool": { "terms": { "field": "title", "include": "foo", "missing": "__NULL__", } } } ``` Check out the added tests in `src/aggregation/bucket/term_agg.rs` to see this in action ## How the bug happens ### Preparation While preparing the aggregation nodes: 1) When we've provided a `missing` key, we derive a missing sentinel. For string keys this column's max value (which for string keys is always the number of terms in this segment) + 1. 2) for string columns only, we optionally prep an "allowed" `BitSet` for allowed term ids. (`build_allowed_term_ids_for_str` in `src/aggregation/agg_data.rs`) - If no `include` or `exclude` filter is provided, this just returns `None`, causing this check to be skipped down the line - Otherwise the bitset is initialized to be able to hold the exact number of terms in the segments term dictionary, and the bits are set to signify which terms are to be included in the results. ### Collection If we have an "allowed" `BitSet`, filter documents against that. For each document, we check if the `BitSet` contains the documents term id. For documents without the field, this is the missing sentinel we derived earlier, minus 1 (to account for zero-based indexing): `(num_terms + 1) - 1`.However, the `BitSet`s size is only `num_terms`. Normally, this slips by without a problem, but if `num_terms % 64 == 0`, this will cause a panic. ### Why `BitSet` panics `BitSet` is represented under the hood by a boxed slice of `u64`s. When you go to check a bit using `BitSet::contains`, it must determine which of those `u64`s the bit is in, and then the position within that `u64` of the bit. In cases where the number of terms is not divisible by 64, the `BitSet` must waste some bits. When we then look up the missing sentinel's bit, it happens to be one of those wasted bits, for which `BitSet` is happy to return the value of. For example, if the number of terms was 63: ```rust let bitset_init_size = 63; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term id [0, 62] let missing_sentinel = 63; // num_terms + 1 - 1; let byte_pos = missing_sentinel / 64; // 0 - within the valid slice let bit_pos = missing_sentinel % 64; // 63 - hits the 1 wasted bit ``` But if the number of terms is indeed divisible by 64, then the `BitSet` is perfectly aligned to the byte boundary: ```rust let bitset_init_size = 64; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term ids [0, 63] let missing_sentinel = 64; // num_terms + 1 - 1, let byte_pos = missing_sentinel / 64; // 1 - idx 1 >= slice length 1 let bit_pos = missing_sentinel % 64; // 0 ``` We try to access a byte outside of the bounds of the boxed slice, causing a panic from the bounds check to failing. ## Fixing it The fix is simple. If we need to account for the missing sentinel, initialize the `BitSet` with capacity for one more bit. ## Tests - Added a bunch of unit tests that hit these conditions. I ensured they failed without the fix, and that they now pass. - All unit tests pass with the fix in place ## Other - The investigation that led to finding this bug began with https://github.com/paradedb/paradedb/issues/4746.	2026-04-25 14:11:47 +02:00
Abdul Andha	4fbae92187	send after key on last page	2026-04-24 15:33:26 -04:00
Pascal Seitz	2e16243f9a	fix memory consumption for histogram	2026-04-21 13:58:39 +02:00
Pascal Seitz	73c711ec74	perf(agg): only measure active parent bucket in composite collect Same change as `26a589e` for SegmentCompositeCollector: get_memory_consumption summed across all parent_buckets on every block, scaling with outer bucket cardinality. Pass parent_bucket_id and index the single bucket.	2026-04-21 07:26:58 +02:00
Pascal Seitz	cb037c8079	add inline	2026-04-21 07:26:58 +02:00
Pascal Seitz	ed3453606b	agg fix: compute memory consumption only for current bucket	2026-04-21 07:26:58 +02:00
Paul Masurel	58aa4b7074	Fix cardinality aggregation using invalid coupons (#2893 ) Previously, coupons were computed via murmurhash32 and fed as raw u32 to the HLL sketch, bypassing the sketch's internal hashing and producing invalid (slot, value) pairs. Switch to Coupon::from_hash from the datasketches crate which correctly derives coupons, and drop the now-unused murmurhash32 dependency. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 19:14:30 +02:00
Paul Masurel	04beab3b29	Performance improvement for nested cardinality aggregation When a string cardinality aggregation is nested it end up being applied to different buckets. Dictionary encoding relies on a different dictionaries for each segment. As a result, during segment collection, we only collect term ordinals in a HashSet, and decode them in the term dictionary at the end of collection. Before this PR, this decoding phase was done once for each bucket, causing the same work to be done over and over. This PR introduce a coupon cache. The HLL sketch relies on a hash of the string values. We populate the cache before bucket collection, and get our values from it. This PR also rename "caching" "buffering" in aggregation (it was never caching), and does several cleanups.	2026-04-10 14:51:00 +02:00
alexanderbianchi	3cd9011f87	Make BucketEntries::iter, PercentileValuesVecEntry fields, and TopNComputer::threshold public (#2890 ) These items need to be accessible from the tantivy-datafusion crate: - BucketEntries::iter() for iterating aggregation bucket results - PercentileValuesVecEntry.key/.value for reading percentile results - TopNComputer.threshold for Block-WAND score pruning in the inverted index provider Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Paul Masurel <paul@quickwit.io>	2026-04-09 13:32:31 +02:00
Charlie Tonneslan	a9535156b1	Fix clippy warnings: deprecated gen_range, manual div_ceil, legacy import (#2860 ) - Replace deprecated rand::Rng::gen_range with random_range in benchmarks - Use usize::div_ceil instead of manual (len + size - 1) / size - Remove unused legacy std::i64 import - Replace 'if let Some(_)' with '.is_some()'	2026-03-26 07:37:26 -04:00
nuri	3859cc8699	fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854 ) * fix: deduplicate doc counts in term aggregation for multi-valued fields Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count. Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor that deduplicates (doc_id, value) pairs, and use it in term aggregation. Fixes #2721 * refactor: only deduplicate for multivalue cardinality Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue(). * fix: handle non-consecutive duplicate values in dedup Sort values within each doc_id group before deduplicating, so that non-adjacent duplicates are correctly handled. Add unit tests for dedup_docid_val_pairs: consecutive duplicates, non-consecutive duplicates, multi-doc groups, no duplicates, and single element. * perf: skip dedup when block has no multivalue entries Add early return when no consecutive doc_ids are equal, avoiding unnecessary sort and dedup passes. Remove the 2-element swap optimization as it is not needed by the dedup algorithm. --------- Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-03-24 02:02:30 +01:00
Paul Masurel	545169c0d8	Composite agg merge (#2856 ) Add composite aggregation Co-authored-by: Remi Dettai <remi.dettai@sekoia.io> Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-18 17:28:59 +01:00
cong.xie	18fedd9384	Fix nightly fmt: merge crate imports in percentiles tests Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 14:18:54 -05:00
cong.xie	2098fca47f	Restore use_serde feature and simplify PercentilesCollector Keep use_serde on sketches-ddsketch so DDSketch derives Serialize/Deserialize, removing the need for custom impls on PercentilesCollector. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 14:13:17 -05:00
cong.xie	1251b40c93	Drop use_serde feature; use Java binary encoding for PercentilesCollector Replace the derived Serialize/Deserialize on PercentilesCollector with custom impls that use DDSketch's Java-compatible binary encoding (encode_to_java_bytes / decode_from_java_bytes). This removes the need for the use_serde feature on sketches-ddsketch entirely. Also restore original float test values and use assert_nearly_equals! for all float comparisons in percentile tests, since DDSketch quantile estimates can have minor precision differences across platforms. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 13:32:28 -05:00
cong.xie	09a49b872c	Use assert_nearly_equals! for float comparisons in percentile test Address review feedback: replace assert_eq! with assert_nearly_equals! for float values that go through JSON serialization roundtrips, which can introduce minor precision differences. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 13:21:48 -05:00
cong.xie	cf760fd5b6	fix: remove internal reference from code comment Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 12:59:25 -05:00

1 2 3 4 5 ...

285 Commits