tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-14 23:30:41 +00:00

Author	SHA1	Message	Date
James Sewell	4480cf0a98	Enable BMW for single-scorer boolean queries by removing early return in `scorer_union` (#2915 ) The early return for `scorers.len() == 1` in `scorer_union` short-circuits a single TermScorer into `SpecializedScorer::Other`, bypassing the `TermUnion` path that enables block-max WAND (BMW) in `for_each_pruning`. This was originally addressed in PR #2898 (backed out), which added a special case in `BooleanWeight::for_each_pruning`. PR #2912 (merged as `d27ca164a`) added a single-scorer fast path inside `block_wand` itself, but did not remove this early return — so a single SHOULD TermScorer still never reaches the BMW path. Removing the early return lets a single TermScorer with freq reading flow through to `SpecializedScorer::TermUnion`, where `block_wand` → `block_wand_single_scorer` handles it efficiently.	2026-04-28 14:49:53 -07:00
Pascal Seitz	d47abdf104	early cut off for order by sub agg in term agg	2026-04-28 16:59:59 +02:00
trinity-1686a	ca139d8eb1	Merge pull request #2910 from quickwit-oss/abdul.andha/composite-agg-after Composite aggregations: send after key on last page	2026-04-27 23:38:52 +02:00
Abdul Andha	ac508108aa	address pr comment	2026-04-27 12:39:38 -04:00
Paul Masurel	63da5a21b2	Optimizing top K using Adrien Grand's ideas (#2865 ) * Optimizing top K using Adrien Grand's ideas https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html * Suffix-sum pruning for multi-term intersection candidates After scoring each secondary in Phase 2, check whether remaining secondaries' block_max scores can still beat the threshold. Skip to the next candidate early if impossible, avoiding expensive seeks into later secondaries. Improves three-term intersection by ~8% on the balanced benchmark while keeping two-term performance neutral. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Claude CR comment * Removed 16 term scorer limit. --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-26 12:14:40 +02:00
lif	54cd5bba98	fix: skip sentinel facet ords in harvest to prevent wrong root (#2867 ) When a document has the exact registered facet path (not a child), compute_collapse_mapping_one maps it to a sentinel (u64::MAX, 0). Without filtering, harvest() passes u64::MAX to ord_to_term which resolves to the last dictionary entry, producing a spurious facet from an unrelated branch. Skip entries where facet_ord == u64::MAX in harvest(). Closes #2494 Signed-off-by: majiayu000 <1835304752@qq.com>	2026-04-25 22:23:30 +02:00
Paul Masurel	d27ca164a9	block_wand: use single-scorer path when there is only one scorer	2026-04-25 16:35:00 +02:00
James Sewell	322286ee16	Tighen Block-Max in single-scorer (#2897 ) In the Block-Max WAND single-scorer, it uses block_max_score() < threshold, whereas the multi-term one uses block_max_score_upperbound <= threshold. As both of these are guarded later on with if score > threshold we can use the more efficent form in single-scorer. Single-scorer block skip (<, should be <=): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L231 Multi-scorer block skip (already <=): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L179 Single-scorer per-doc guard (>): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L246 Multi-scorer per-doc guard (>): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L206 This will improve performance when there are many identical scores.	2026-04-25 14:13:07 +02:00
RJ Barman	73ad18fa1e	fix: Add space for missing sentinel in allowed bitset when a missing key is provided (#119 ) (#2907 ) ## Bug Overview Under certain conditions, a `terms` aggregation request can cause a bounds-check panic. Those conditions are: - The queried field must be a text field - There must be a segment where the number of distinct terms in it's dictionary for the queried field is divisible by 64 (i.e.e where `count(term_dict.keys) % 64 == 0`) - That same segment must contain at least one document that does not contain this field. - The request contain a `missing` key that is a string. - The request must contain an `include` or `exclude` filter. For example: ```json { "my_bool": { "terms": { "field": "title", "include": "foo", "missing": "__NULL__", } } } ``` Check out the added tests in `src/aggregation/bucket/term_agg.rs` to see this in action ## How the bug happens ### Preparation While preparing the aggregation nodes: 1) When we've provided a `missing` key, we derive a missing sentinel. For string keys this column's max value (which for string keys is always the number of terms in this segment) + 1. 2) for string columns only, we optionally prep an "allowed" `BitSet` for allowed term ids. (`build_allowed_term_ids_for_str` in `src/aggregation/agg_data.rs`) - If no `include` or `exclude` filter is provided, this just returns `None`, causing this check to be skipped down the line - Otherwise the bitset is initialized to be able to hold the exact number of terms in the segments term dictionary, and the bits are set to signify which terms are to be included in the results. ### Collection If we have an "allowed" `BitSet`, filter documents against that. For each document, we check if the `BitSet` contains the documents term id. For documents without the field, this is the missing sentinel we derived earlier, minus 1 (to account for zero-based indexing): `(num_terms + 1) - 1`.However, the `BitSet`s size is only `num_terms`. Normally, this slips by without a problem, but if `num_terms % 64 == 0`, this will cause a panic. ### Why `BitSet` panics `BitSet` is represented under the hood by a boxed slice of `u64`s. When you go to check a bit using `BitSet::contains`, it must determine which of those `u64`s the bit is in, and then the position within that `u64` of the bit. In cases where the number of terms is not divisible by 64, the `BitSet` must waste some bits. When we then look up the missing sentinel's bit, it happens to be one of those wasted bits, for which `BitSet` is happy to return the value of. For example, if the number of terms was 63: ```rust let bitset_init_size = 63; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term id [0, 62] let missing_sentinel = 63; // num_terms + 1 - 1; let byte_pos = missing_sentinel / 64; // 0 - within the valid slice let bit_pos = missing_sentinel % 64; // 63 - hits the 1 wasted bit ``` But if the number of terms is indeed divisible by 64, then the `BitSet` is perfectly aligned to the byte boundary: ```rust let bitset_init_size = 64; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term ids [0, 63] let missing_sentinel = 64; // num_terms + 1 - 1, let byte_pos = missing_sentinel / 64; // 1 - idx 1 >= slice length 1 let bit_pos = missing_sentinel % 64; // 0 ``` We try to access a byte outside of the bounds of the boxed slice, causing a panic from the bounds check to failing. ## Fixing it The fix is simple. If we need to account for the missing sentinel, initialize the `BitSet` with capacity for one more bit. ## Tests - Added a bunch of unit tests that hit these conditions. I ensured they failed without the fix, and that they now pass. - All unit tests pass with the fix in place ## Other - The investigation that led to finding this bug began with https://github.com/paradedb/paradedb/issues/4746.	2026-04-25 14:11:47 +02:00
Abdul Andha	4fbae92187	send after key on last page	2026-04-24 15:33:26 -04:00
Pascal Seitz	2e16243f9a	fix memory consumption for histogram	2026-04-21 13:58:39 +02:00
Pascal Seitz	73c711ec74	perf(agg): only measure active parent bucket in composite collect Same change as `26a589e` for SegmentCompositeCollector: get_memory_consumption summed across all parent_buckets on every block, scaling with outer bucket cardinality. Pass parent_bucket_id and index the single bucket.	2026-04-21 07:26:58 +02:00
Pascal Seitz	cb037c8079	add inline	2026-04-21 07:26:58 +02:00
Pascal Seitz	ed3453606b	agg fix: compute memory consumption only for current bucket	2026-04-21 07:26:58 +02:00
Paul Masurel	58aa4b7074	Fix cardinality aggregation using invalid coupons (#2893 ) Previously, coupons were computed via murmurhash32 and fed as raw u32 to the HLL sketch, bypassing the sketch's internal hashing and producing invalid (slot, value) pairs. Switch to Coupon::from_hash from the datasketches crate which correctly derives coupons, and drop the now-unused murmurhash32 dependency. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 19:14:30 +02:00
Paul Masurel	04beab3b29	Performance improvement for nested cardinality aggregation When a string cardinality aggregation is nested it end up being applied to different buckets. Dictionary encoding relies on a different dictionaries for each segment. As a result, during segment collection, we only collect term ordinals in a HashSet, and decode them in the term dictionary at the end of collection. Before this PR, this decoding phase was done once for each bucket, causing the same work to be done over and over. This PR introduce a coupon cache. The HLL sketch relies on a hash of the string values. We populate the cache before bucket collection, and get our values from it. This PR also rename "caching" "buffering" in aggregation (it was never caching), and does several cleanups.	2026-04-10 14:51:00 +02:00
alexanderbianchi	3cd9011f87	Make BucketEntries::iter, PercentileValuesVecEntry fields, and TopNComputer::threshold public (#2890 ) These items need to be accessible from the tantivy-datafusion crate: - BucketEntries::iter() for iterating aggregation bucket results - PercentileValuesVecEntry.key/.value for reading percentile results - TopNComputer.threshold for Block-WAND score pruning in the inverted index provider Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Paul Masurel <paul@quickwit.io>	2026-04-09 13:32:31 +02:00
Paul Masurel	d2c1b8bc2c	Optimized intersection count using a bitset when the first leg is dense	2026-04-06 12:01:52 -04:00
nuri	a65107135a	Use BinaryHeap for score-based top-K collection (#2881 ) * Use BinaryHeap for score-based top-K collection * Use peek_mut and add proptest for TopNHeap --------- Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-04-04 19:49:05 +02:00
PSeitz	129c40f8ec	Improve Union Performance for non-score unions (#2863 ) * enhance and_or_queries bench * optimize unions for count/non-score, bitset fix for ARM Benchmarks run on M4 Max ``` single_field_only_union_5%_OR_1% count Avg: 0.1100ms (-17.46%) Median: 0.1079ms (-14.08%) [0.1045ms .. 0.1410ms] Output: 54_110 top10_inv_idx Avg: 0.1663ms (+0.79%) Median: 0.1660ms (+0.75%) [0.1634ms .. 0.1702ms] Output: 10 count+top10 Avg: 0.2639ms (-1.24%) Median: 0.2634ms (-0.31%) [0.2512ms .. 0.2813ms] Output: 54_110 top10_by_ff Avg: 0.2875ms (-8.67%) Median: 0.2852ms (-8.80%) [0.2737ms .. 0.3083ms] Output: 10 top10_by_2ff Avg: 0.3137ms (-5.79%) Median: 0.3128ms (-0.35%) [0.3044ms .. 0.3313ms] Output: 10 single_field_only_union_5%_OR_1%_OR_15% count Avg: 0.4122ms (-33.05%) Median: 0.4140ms (-32.20%) [0.3940ms .. 0.4341ms] Output: 181_663 top10_inv_idx Avg: 0.3999ms (+2.39%) Median: 0.3987ms (+2.02%) [0.3939ms .. 0.4160ms] Output: 10 count+top10 Avg: 0.8520ms (-8.63%) Median: 0.8516ms (-8.65%) [0.8413ms .. 0.8676ms] Output: 181_663 top10_by_ff Avg: 0.9694ms (-13.06%) Median: 0.9645ms (-13.77%) [0.9403ms .. 1.0122ms] Output: 10 top10_by_2ff Avg: 0.9880ms (-13.01%) Median: 0.9838ms (-13.59%) [0.9781ms .. 1.0306ms] Output: 10 single_field_only_union_5%_OR_30% count Avg: 0.7364ms (-33.11%) Median: 0.7347ms (-33.19%) [0.7233ms .. 0.7547ms] Output: 303_337 top10_inv_idx Avg: 0.8932ms (-0.89%) Median: 0.8919ms (-0.75%) [0.8861ms .. 0.9249ms] Output: 10 count+top10 Avg: 1.3611ms (-9.23%) Median: 1.3598ms (-9.39%) [1.3426ms .. 1.3891ms] Output: 303_337 top10_by_ff Avg: 1.6575ms (-18.64%) Median: 1.6224ms (-20.81%) [1.6051ms .. 1.7560ms] Output: 10 top10_by_2ff Avg: 1.6800ms (-16.24%) Median: 1.6769ms (-15.72%) [1.6661ms .. 1.7229ms] Output: 10 single_field_only_union_30%_OR_0.01% count Avg: 0.6471ms (-33.73%) Median: 0.6464ms (-33.46%) [0.6375ms .. 0.6604ms] Output: 270_268 top10_inv_idx Avg: 0.0338ms (-0.27%) Median: 0.0338ms (+0.11%) [0.0331ms .. 0.0351ms] Output: 10 count+top10 Avg: 1.2209ms (-9.27%) Median: 1.2207ms (-9.25%) [1.2158ms .. 1.2351ms] Output: 270_268 top10_by_ff Avg: 1.4808ms (-17.20%) Median: 1.4690ms (-17.91%) [1.4384ms .. 1.5553ms] Output: 10 top10_by_2ff Avg: 1.5011ms (-14.30%) Median: 1.4992ms (-13.88%) [1.4891ms .. 1.5320ms] Output: 10 multi_field_only_union_5%_OR_1% count Avg: 0.1196ms (-17.67%) Median: 0.1166ms (-14.83%) [0.1123ms .. 0.1462ms] Output: 60_183 top10_inv_idx Avg: 0.2356ms (-0.21%) Median: 0.2355ms (+0.23%) [0.2330ms .. 0.2406ms] Output: 10 count+top10 Avg: 0.2985ms (-5.06%) Median: 0.2957ms (-5.79%) [0.2875ms .. 0.3186ms] Output: 60_183 top10_by_ff Avg: 0.3102ms (-9.44%) Median: 0.3031ms (-11.09%) [0.2994ms .. 0.3324ms] Output: 10 top10_by_2ff Avg: 0.3435ms (-0.91%) Median: 0.3447ms (-0.62%) [0.3342ms .. 0.3530ms] Output: 10 multi_field_only_union_5%_OR_1%_OR_15% count Avg: 0.4465ms (-35.41%) Median: 0.4456ms (-36.25%) [0.4250ms .. 0.4936ms] Output: 201_114 top10_inv_idx Avg: 1.1542ms (+2.38%) Median: 1.1560ms (+2.96%) [1.1193ms .. 1.1912ms] Output: 10 count+top10 Avg: 0.9334ms (-8.89%) Median: 0.9330ms (-8.95%) [0.9191ms .. 0.9542ms] Output: 201_114 top10_by_ff Avg: 1.0590ms (-14.10%) Median: 1.0424ms (-15.08%) [1.0304ms .. 1.1174ms] Output: 10 top10_by_2ff Avg: 1.0779ms (-17.06%) Median: 1.0754ms (-17.40%) [1.0650ms .. 1.1155ms] Output: 10 multi_field_only_union_5%_OR_30% count Avg: 0.8137ms (-33.48%) Median: 0.7976ms (-34.84%) [0.7734ms .. 1.0855ms] Output: 335_682 top10_inv_idx Avg: 1.5108ms (+0.36%) Median: 1.4943ms (-0.72%) [1.4805ms .. 1.5865ms] Output: 10 count+top10 Avg: 1.4985ms (-9.75%) Median: 1.4936ms (-9.63%) [1.4784ms .. 1.5472ms] Output: 335_682 top10_by_ff Avg: 1.8531ms (-15.70%) Median: 1.8583ms (-16.30%) [1.7467ms .. 2.2297ms] Output: 10 top10_by_2ff Avg: 1.8735ms (-16.67%) Median: 1.8421ms (-18.05%) [1.8146ms .. 2.3650ms] Output: 10 multi_field_only_union_30%_OR_0.01% count Avg: 0.7020ms (-34.40%) Median: 0.7004ms (-34.05%) [0.6943ms .. 0.7156ms] Output: 300_315 top10_inv_idx Avg: 0.1445ms (-1.57%) Median: 0.1442ms (-1.35%) [0.1426ms .. 0.1478ms] Output: 10 count+top10 Avg: 1.3309ms (-9.84%) Median: 1.3284ms (-9.71%) [1.3234ms .. 1.3549ms] Output: 300_315 top10_by_ff Avg: 1.6152ms (-17.39%) Median: 1.6037ms (-18.72%) [1.5778ms .. 1.7227ms] Output: 10 top10_by_2ff Avg: 1.6479ms (-17.10%) Median: 1.6444ms (-15.46%) [1.6307ms .. 1.6901ms] Output: 10 ``` * add comment * fix comment * remove inline(never), bounds check	2026-03-27 08:00:26 +01:00
Charlie Tonneslan	a9535156b1	Fix clippy warnings: deprecated gen_range, manual div_ceil, legacy import (#2860 ) - Replace deprecated rand::Rng::gen_range with random_range in benchmarks - Use usize::div_ceil instead of manual (len + size - 1) / size - Remove unused legacy std::i64 import - Replace 'if let Some(_)' with '.is_some()'	2026-03-26 07:37:26 -04:00
nuri	3859cc8699	fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854 ) * fix: deduplicate doc counts in term aggregation for multi-valued fields Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count. Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor that deduplicates (doc_id, value) pairs, and use it in term aggregation. Fixes #2721 * refactor: only deduplicate for multivalue cardinality Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue(). * fix: handle non-consecutive duplicate values in dedup Sort values within each doc_id group before deduplicating, so that non-adjacent duplicates are correctly handled. Add unit tests for dedup_docid_val_pairs: consecutive duplicates, non-consecutive duplicates, multi-doc groups, no duplicates, and single element. * perf: skip dedup when block has no multivalue entries Add early return when no consecutive doc_ids are equal, avoiding unnecessary sort and dedup passes. Remove the 2-element swap optimization as it is not needed by the dedup algorithm. --------- Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-03-24 02:02:30 +01:00
Paul Masurel	545169c0d8	Composite agg merge (#2856 ) Add composite aggregation Co-authored-by: Remi Dettai <remi.dettai@sekoia.io> Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-18 17:28:59 +01:00
Paul Masurel	68a9066d13	Fix format (#2852 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-16 10:43:39 +01:00
Paul Masurel	d02559a4d1	Update time deps to defensively address a vulnerability. (#2850 ) Closes #2849 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-12 16:47:11 +01:00
Anas Limem	1922abaf33	Fixed integer overflow in segment sorting and merge policy truncation (#2846 )	2026-03-12 16:44:38 +01:00
trinity-1686a	d0c5ffb0aa	Merge pull request #2842 from quickwit-oss/congxie/replaceHll Use sketches-ddsketch fork with Java-compatible binary encoding	2026-02-20 16:56:56 +01:00
cong.xie	18fedd9384	Fix nightly fmt: merge crate imports in percentiles tests Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 14:18:54 -05:00
cong.xie	2098fca47f	Restore use_serde feature and simplify PercentilesCollector Keep use_serde on sketches-ddsketch so DDSketch derives Serialize/Deserialize, removing the need for custom impls on PercentilesCollector. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 14:13:17 -05:00
cong.xie	1251b40c93	Drop use_serde feature; use Java binary encoding for PercentilesCollector Replace the derived Serialize/Deserialize on PercentilesCollector with custom impls that use DDSketch's Java-compatible binary encoding (encode_to_java_bytes / decode_from_java_bytes). This removes the need for the use_serde feature on sketches-ddsketch entirely. Also restore original float test values and use assert_nearly_equals! for all float comparisons in percentile tests, since DDSketch quantile estimates can have minor precision differences across platforms. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 13:32:28 -05:00
cong.xie	09a49b872c	Use assert_nearly_equals! for float comparisons in percentile test Address review feedback: replace assert_eq! with assert_nearly_equals! for float values that go through JSON serialization roundtrips, which can introduce minor precision differences. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 13:21:48 -05:00
cong.xie	cf760fd5b6	fix: remove internal reference from code comment Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 12:59:25 -05:00
cong.xie	68626bf3a1	Vendor sketches-ddsketch with Java-compatible binary encoding Fork sketches-ddsketch as a workspace member to add native Java binary serialization (to_java_bytes/from_java_bytes) for DDSketch. This enables pomsky to return raw DDSketch bytes that event-query can deserialize via DDSketchWithExactSummaryStatistics.decode(). Key changes: - Vendor sketches-ddsketch crate with encoding.rs implementing VarEncoding, flag bytes, and INDEX_DELTAS_AND_COUNTS store format - Align Config::key() to floor-based indexing matching Java's LogarithmicMapping - Add PercentilesCollector::to_sketch_bytes() for pomsky integration - Cross-language golden byte tests verified byte-identical with Java output Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 11:36:21 -05:00
Adrien Guillo	51f340f83d	Merge pull request #2837 from quickwit-oss/congxie/replaceHll Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11)	2026-02-12 17:19:40 -05:00
cong.xie	7eca33143e	Remove Datadog-specific references from comments This is an open-source repo — replace references to Datadog's event query with generic cross-language compatibility descriptions.	2026-02-12 11:44:42 -05:00
cong.xie	698f073f88	fix fmt	2026-02-11 15:52:39 -05:00
cong.xie	cdd24b7ee5	Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11) Switch tantivy's cardinality aggregation from the hyperloglogplus crate (HyperLogLog++ with p=16) to the official Apache DataSketches HLL implementation (datasketches crate v0.2.0 with lg_k=11, Hll4). This enables returning raw HLL sketch bytes from pomsky to Datadog's event query, where they can be properly deserialized and merged using the same DataSketches library (Java). The previous implementation required pomsky to fabricate fake HLL sketches from scalar cardinality estimates, which produced incorrect results when merged. Changes: - Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0 - CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch - Custom Serde impl using HllSketch binary format (cross-shard compat) - New to_sketch_bytes() for external consumers (pomsky) - Salt preserved via (salt, value) tuple hashing for column type disambiguation - Removed BuildSaltedHasher struct - Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)	2026-02-11 08:49:46 -05:00
PSeitz	57fe659fff	make serializer pub (#2835 ) some changes on the posting list serializer to make it usable in other contexts. Improve errors Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>	2026-02-11 14:37:42 +01:00
trinity-1686a	5562ce6037	Merge pull request #2818 from Darkheir/fix/query_grammar_regex_between_parentheses	2026-02-11 11:39:58 +01:00
Metin Dumandag	09b6ececa7	Export fields of the PercentileValuesVecEntry (#2833 ) Otherwise, there is no way to access these fields when not using the json serialized form of the aggregation results. This simple data struct is part of the public api, so its fields should be accessible as well.	2026-02-11 11:31:07 +01:00
Moe	8018016e46	feat: add fast field support for Bytes type (#100 ) (#2830 ) ## What Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields. ## Why `BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering. ## How 1. Enable range queries for Bytes - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes` 2. Add BytesColumn handling in scorer - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic) 3. Add SortByBytes - New sort key computer for TopN queries on bytes columns ## Tests - `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges - `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions	2026-02-11 11:26:18 +01:00
cong.xie	bb141abe22	feat(aggregation): add keys() accessor to IntermediateAggregationResults	2026-02-09 15:38:35 -05:00
cong.xie	f1c29ba972	resolve conflcit	2026-02-06 14:23:11 -05:00
cong.xie	ae0554a6a5	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 11:12:20 -05:00
cong.xie	0d7abe5d23	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), get_mut(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 10:28:59 -05:00
PSeitz	98ebbf922d	faster exclude queries (#2825 ) * faster exclude queries Faster exclude queries with multiple terms. Changes `Exclude` to be able to exclude multiple DocSets, instead of putting the docsets into a union. Use `seek_danger` in `Exclude`. closes #2822 * replace unwrap with match	2026-01-30 17:06:41 +01:00
Paul Masurel	4a89e74597	Fix rfc3339 typos and add Claude Code skills (#2823 ) Closes #2817	2026-01-30 12:00:28 +01:00
Darkheir	a55e4069e4	feat(query-grammar): Apply PR review suggestions Signed-off-by: Darkheir <raphael.cohen@sekoia.io>	2026-01-28 14:13:55 +01:00
Paul Masurel	0ae94baef5	Remove temp file (#2815 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:22:11 +01:00
Paul Masurel	3f448ecf79	Bugfix on intersection. (#2812 ) The intersection algorithm made it possible for .seek(..) with values lower than the current doc id, breaking the DocSet contract. The fix removes the optimization that caused left.seek(..) to be replaced by a simpler left.advance(..). Simply doing so lead to a performance regression. I therefore integrated that idea within SegmentPostings.seek. We now attempt to check the next doc systematically on seek, PROVIDED the block is already loaded. Closes #2811 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:21:09 +01:00

1 2 3 4 5 ...

2582 Commits