Commit Graph

3604 Commits

Author SHA1 Message Date
trinity.pointard
73547027e7 Merge branch 'main' into trinity.pointard/multi-terms 2026-07-03 16:29:39 +00:00
Pascal Seitz
6b8bd7b884 reorder if block 2026-07-03 17:44:44 +02:00
Pascal Seitz
9db05b660e fix overflow issue 2026-07-03 17:44:44 +02:00
Pascal Seitz
057e9d6618 make BucketId optional in aggregations
If term aggregations don't have sub-aggregations, we don't need to carry
BucketId(u32).

```
full
terms_7                                               Memory: 46.5 KB             Avg: 2.4024ms (+1.96%)      Median: 2.4024ms (+1.96%)      [2.4024ms .. 2.4024ms]
terms_all_unique                                      Memory: 11.5 MB (-9.93%)    Avg: 4.7910ms (-22.62%)     Median: 4.7910ms (-22.62%)     [4.7910ms .. 4.7910ms]
terms_all_unique_order_by_key                         Memory: 11.5 MB (-9.94%)    Avg: 4.8056ms (-21.31%)     Median: 4.8056ms (-21.31%)     [4.8056ms .. 4.8056ms]
terms_150_000                                         Memory: 2.7 MB (-9.90%)     Avg: 5.7312ms (-5.96%)      Median: 5.7312ms (-5.96%)      [5.7312ms .. 5.7312ms]
terms_many_top_1000                                   Memory: 5.3 MB              Avg: 8.5912ms (-3.34%)      Median: 8.5912ms (-3.34%)      [8.5912ms .. 8.5912ms]
terms_many_order_by_term                              Memory: 2.7 MB (-9.95%)     Avg: 4.5581ms (-10.79%)     Median: 4.5581ms (-10.79%)     [4.5581ms .. 4.5581ms]
terms_many_with_top_hits                              Memory: 48.9 MB             Avg: 102.3672ms (-2.30%)    Median: 102.3672ms (-2.30%)    [102.3672ms .. 102.3672ms]
terms_all_unique_with_avg_sub_agg                     Memory: 54.7 MB             Avg: 18.4644ms (-0.11%)     Median: 18.4644ms (-0.11%)     [18.4644ms .. 18.4644ms]
terms_many_with_avg_sub_agg                           Memory: 13.5 MB             Avg: 16.7345ms (-4.09%)     Median: 16.7345ms (-4.09%)     [16.7345ms .. 16.7345ms]
terms_status_with_avg_sub_agg                         Memory: 92.1 KB             Avg: 5.1590ms (+0.33%)      Median: 5.1590ms (+0.33%)      [5.1590ms .. 5.1590ms]
terms_status_with_terms_zipf_1000_sub_agg             Memory: 213.2 KB            Avg: 3.9562ms (+1.49%)      Median: 3.9562ms (+1.49%)      [3.9562ms .. 3.9562ms]
terms_zipf_1000_with_terms_status_sub_agg             Memory: 724.4 KB            Avg: 11.7206ms (-4.68%)     Median: 11.7206ms (-4.68%)     [11.7206ms .. 11.7206ms]
terms_status_with_histogram                           Memory: 139.0 KB            Avg: 2.3985ms (-1.07%)      Median: 2.3985ms (-1.07%)      [2.3985ms .. 2.3985ms]
terms_status_with_date_histogram                      Memory: 145.9 KB            Avg: 2.2963ms (-1.86%)      Median: 2.2963ms (-1.86%)      [2.2963ms .. 2.2963ms]
terms_status_with_date_histogram_hard_bounds          Memory: 145.4 KB            Avg: 2.4312ms (-1.37%)      Median: 2.4312ms (-1.37%)      [2.4312ms .. 2.4312ms]
terms_status_with_date_histogram_and_sibling_terms    Memory: 143.3 KB            Avg: 3.8401ms (+0.44%)      Median: 3.8401ms (+0.44%)      [3.8401ms .. 3.8401ms]
terms_zipf_1000                                       Memory: 75.8 KB             Avg: 2.1611ms (-2.70%)      Median: 2.1611ms (-2.70%)      [2.1611ms .. 2.1611ms]
terms_zipf_1000_with_histogram                        Memory: 1.2 MB              Avg: 20.1369ms (-1.16%)     Median: 20.1369ms (-1.16%)     [20.1369ms .. 20.1369ms]
terms_zipf_1000_with_avg_sub_agg                      Memory: 472.7 KB            Avg: 8.5851ms (-4.93%)      Median: 8.5851ms (-4.93%)      [8.5851ms .. 8.5851ms]
terms_many_json_mixed_type_with_avg_sub_agg           Memory: 17.9 MB             Avg: 26.3988ms (-6.61%)     Median: 26.3988ms (-6.61%)     [26.3988ms .. 26.3988ms]
```
2026-07-03 17:44:44 +02:00
trinity.pointard
911fe1f883 more codex review 2026-07-03 15:39:32 +00:00
trinity.pointard
be7f09a0b2 fix handling of bytes for cardinality agg 2026-07-03 14:56:33 +00:00
trinity.pointard
ba1a979824 codex review 2026-07-03 13:33:23 +00:00
Pascal Seitz
95172089e7 cleanup benchmark 2026-07-03 13:42:33 +02:00
Pascal Seitz
f313f22df7 Speed up range-query intersections via seek_danger on RangeDocSet (up to ~50x faster)
A regular seek on RangeDocSet is costly: on a miss it fetches blocks and
scans the column forward to materialize the next matching doc. As a
non-leading docset in an intersection that work is wasted — the driver only
asks "does this candidate match?". seek_danger answers that with a cheap
point lookup via Column::values_for_doc, returning a lower bound on a miss
and leaving forward progress to the caller.

Forward seek_danger through ConstScorer.

Benchmarks (bool_queries_with_range, _all_results / DocSetCollector):

```
dense and 0.1% a
a_AND_num_rand:[0_TO_9]_all_results                                 Avg: 0.0827ms (-4.60%)     Median: 0.0825ms (-4.82%)     [0.0809ms .. 0.0891ms]    Output: 43
a_AND_num_asc:[0_TO_9]_all_results                                  Avg: 0.1937ms (-3.70%)     Median: 0.1930ms (-3.59%)     [0.1806ms .. 0.2044ms]    Output: 100
a_AND_num_rand_fast:[0_TO_9]_all_results                            Avg: 0.0367ms (-92.67%)    Median: 0.0365ms (-92.65%)    [0.0340ms .. 0.0398ms]    Output: 43
a_AND_num_asc_fast:[0_TO_9]_all_results                             Avg: 0.1052ms (-98.05%)    Median: 0.1050ms (-97.98%)    [0.1009ms .. 0.1117ms]    Output: 100
num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results        Avg: 2.7147ms (-51.42%)    Median: 2.7075ms (-49.58%)    [2.6806ms .. 2.7799ms]    Output: 968
dense and 1% a
a_AND_num_rand:[0_TO_9]_all_results                                 Avg: 0.4373ms (-9.71%)     Median: 0.4357ms (-10.12%)    [0.4117ms .. 0.4711ms]    Output: 463
a_AND_num_asc:[0_TO_9]_all_results                                  Avg: 0.2342ms (-2.50%)     Median: 0.2338ms (-2.56%)     [0.2247ms .. 0.2452ms]    Output: 1_054
a_AND_num_rand_fast:[0_TO_9]_all_results                            Avg: 0.3956ms (-82.86%)    Median: 0.3943ms (-82.90%)    [0.3815ms .. 0.4119ms]    Output: 463
a_AND_num_asc_fast:[0_TO_9]_all_results                             Avg: 0.4896ms (-91.16%)    Median: 0.4862ms (-90.81%)    [0.4797ms .. 0.5084ms]    Output: 1_054
num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results        Avg: 2.7108ms (-50.81%)    Median: 2.6925ms (-49.51%)    [2.6688ms .. 2.7868ms]    Output: 968
dense and 10% a
a_AND_num_rand:[0_TO_9]_all_results                                 Avg: 0.9869ms (-3.71%)     Median: 0.9833ms (-3.83%)     [0.9518ms .. 1.1218ms]    Output: 4_914
a_AND_num_asc:[0_TO_9]_all_results                                  Avg: 0.6352ms (-3.74%)     Median: 0.6363ms (-3.32%)     [0.6158ms .. 0.6488ms]    Output: 10_152
a_AND_num_rand_fast:[0_TO_9]_all_results                            Avg: 3.1264ms (+0.39%)     Median: 3.1466ms (+1.34%)     [3.0261ms .. 3.2051ms]    Output: 4_914
a_AND_num_asc_fast:[0_TO_9]_all_results                             Avg: 4.1547ms (-31.12%)    Median: 4.0933ms (-28.55%)    [3.7648ms .. 4.7600ms]    Output: 10_152
num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results        Avg: 2.6973ms (-52.30%)    Median: 2.6901ms (-49.86%)    [2.6689ms .. 2.7677ms]    Output: 968
```

Gains are largest when the range query is the non-leading docset of a low-cardinality intersection.
2026-07-03 13:42:33 +02:00
trinity.pointard
d7e9b31977 rustfmt 2026-07-03 08:35:06 +00:00
trinity.pointard
7d4f8454be Merge branch 'main' into trinity.pointard/multi-terms 2026-07-03 08:35:05 +00:00
trinity.pointard
091f6fadf1 fix bool serialization 2026-07-03 08:15:27 +00:00
trinity.pointard
a81109226d first pass of optimising multi-terms agg 2026-07-02 21:55:57 +00:00
Pascal Seitz
82bee54a00 block_search: drop unsafe indexing, remove K=64
For K∈{2,4,8,16,32} LLVM proves the index bounds and elides the checks,
so get_unchecked buys nothing on the production K=8 path. K=64 was the
only value that defeated bounds-check elision (one check in the tail
scan) and it was instantiated in tests only — drop it.

block_search: cite the k-ary search paper

Document that kary_search is the 'k-ary search on a sorted array' variant
from Schlegel, Gemulla & Lehner (DaMoN 2009), specialized to a lower-bound.
2026-07-02 18:27:40 +02:00
Pascal Seitz
6892995d02 10% faster intersections: Use k-ary in block search 2026-07-02 18:27:40 +02:00
trinity-1686a
9704e6c0e3 Merge pull request #2983 from quickwit-oss/trinity.pointard/quickselect-term-agg
use select-nth instead of full sort in segment level agg top-k selection
2026-07-02 15:38:14 +02:00
trinity.pointard
715590b357 rename local var 2026-07-02 12:00:00 +00:00
trinity.pointard
d496e402ca rustfmt 2026-07-02 07:12:04 +00:00
trinity.pointard
348ca1e309 don't count matching doc twice 2026-06-30 16:09:11 +00:00
trinity.pointard
5e4fe3520c better handle sorted buckets 2026-06-30 14:56:24 +00:00
trinity.pointard
e7ad06f42f allow extracting values from IntermediateMultiTermsBucketResult 2026-06-29 17:42:15 +00:00
pascal
a9733ba8c2 Keep buffered union refill out of line
BufferedUnionScorer is the hot path for full union traversal, including (TopDocs, Count) where Count forces all matches to be visited. After the block-wand intersection changes, LLVM started inlining the refill helper into the advance path, which regressed TOP_100_COUNT union queries even though the union algorithm did not change.

Force the refill helper out of line so the advance loop stays small and stable while pruning collectors continue to use Block-WAND.

Benchmark on search-benchmark-game TOP_100_COUNT union query set (301 queries, sum of per-query medians):
- tantivy 0.26: 0.853646s
- main before: 0.918605s
- this change: 0.841659s
2026-06-29 19:33:50 +02:00
pascal
874d54a63a Remove union wrapping for single-terms
search-benchmark-game shows TOP_100_COUNT regression on queries tagged intersection_union.

The regression came from allowing single-term boolean unions to become TermUnion for Block-WAND. https://github.com/quickwit-oss/tantivy/pull/2915
When such a scorer is used as the optional side of RequiredOptionalScorer, boxing converted the lone term into BufferedUnionScorer.

Keep the TermUnion representation available for pruning collection, but unwrap one-term unions when boxing or doing non-pruning iteration.
2026-06-29 19:33:50 +02:00
trinity.pointard
74a510cb56 try to use select-nth instead of full sort in segment level agg top-k selection 2026-06-29 09:13:21 +00:00
trinity.pointard
d7f69903fa first swab at multi-terms impl 2026-06-26 21:48:18 +00:00
trinity-1686a
02e34508e2 Merge pull request #2971 from quickwit-oss/trinity.pointard/fix-slop-overflow
fix overflow on large jumps in linear sequence
2026-06-23 10:18:29 +02:00
trinity-1686a
4031d97bac fix overflow on large jumps in linear sequence
new limit prevent an overflow in eval which caused the residual to be 64b when a slop of zero would give a smaller one
2026-06-23 00:13:27 +02:00
Ming
384f31d350 feat: Restore index sorting (#2959)
We ([ParadeDB](https://github.com/paradedb/paradedb)) have restored and been using the removed [index sorting](https://github.com/quickwit-oss/tantivy/issues/2352) feature in our Tantivy fork.

Our use case is sorting the index by Postgres' internal `ctid` identifier. Results returned from Tantivy must be checked against Postgres' visibility map, and checking them in ctid order is much more cache friendly, resulting in up to 80% speedups for certain queries.

This PR is split into 5 commits, corresponding to the index sorting reversal plus bug fixes we uncovered during our usage of index sorting.

| Commit | Maps to | What it does |
|---|---|---|
| `2aea0ad9f` | foundation ([#104](https://github.com/paradedb/tantivy/pull/104)) | Restore `SegmentComponent::TempStore` (revert of upstream #2815). Subsumes fork PR [#104](https://github.com/paradedb/tantivy/pull/104)'s CI fix. |
| `9205bcb0c` | [#92](https://github.com/paradedb/tantivy/pull/92) | Restore sort-by-field (single-segment + merge paths). |
| `39c790f0f` | [#101](https://github.com/paradedb/tantivy/pull/101) | Enable `sort_by` for `Str`/`Bytes` fast fields. |
| `9c4341a87` | [#105](https://github.com/paradedb/tantivy/pull/105) | Native typed numeric sort-key comparison (precision/NULL fix). |
| `2d9ba2418` | [#106](https://github.com/paradedb/tantivy/pull/106) | Preserve NULL ordering in numeric segment merges. |

We have discussed with the Tantivy maintainers and they indicated they would be open to this PR. Another motivation for landing this PR is we are planning on contributing a significant refactor that makes Tantivy's segment components extensible, and landing that without index sorting leads to too many conflicts.
2026-06-22 11:22:25 -07:00
Pascal Seitz
1e859fd78d fix term aggregation u32::MAX overflow issue 2026-06-18 17:07:43 +08:00
Pascal Seitz
f451fa938f explain why naive scorer must accumulate scores in WAND order 2026-06-17 18:58:58 +08:00
Pascal Seitz
2a82dd6f64 fix flaky test 2026-06-17 18:58:58 +08:00
Pascal Seitz
c096b2ad89 aggregation/terms: charge fused term_counts to the memory limit
term_counts (one u32/term) was allocated but not charged to
AggregationLimitsGuard, so a memory limit could be exceeded silently.
Charge it, skip allocating it when unbounded, and add a regression test.
2026-06-16 21:23:23 +08:00
Pascal Seitz
ac7a3d347c add comment, hoist variables 2026-06-16 21:23:23 +08:00
Pascal Seitz
03520a0719 add top level comment 2026-06-16 21:23:23 +08:00
Pascal Seitz
86a4c47bed merge loops, histo with bounds may benefit from single vec opt 2026-06-16 21:23:23 +08:00
Pascal Seitz
fb23e8908f add histogram with bounds 2026-06-16 21:23:23 +08:00
Pascal Seitz
3ca510dff0 aggregation/terms: tidy fused term×histogram grid construction
Rename the value threaded through build_segment_term_collector and
maybe_build_collector from max_term_id to col_max_val/max_column_val — it
is the column's max value, only later reused as the max term id. Make the
grid-size arithmetic overflow-/zero-safe (saturating_add, checked_div).
2026-06-16 21:23:23 +08:00
Pascal Seitz
3cb400c300 clarify counts/term_counts field docs
Spell out that `counts` is the flattened per-term × time-bucket grid (each
term's own contiguous slice) and that `term_counts` is only needed when the
per-term total can't be derived from that grid (i.e. with hard bounds).
2026-06-16 21:23:23 +08:00
Pascal Seitz
ef13489d63 skip hard_bounds that can't exclude any value
When a histogram's hard_bounds are wider than the column's value range, the
per-doc `bounds.contains` check can never fail. Collapse such bounds to the
unbounded sentinel in `normalize_histogram_req`, so both the general histogram
hot loop and the fused term×histogram path skip the check — the latter then
derives per-term counts from the grid (the ~17% win) instead of falling back to
per-doc counting just because `bounds != [MIN, MAX]`.

Only the collect-time filter is affected: empty-bucket emission reads
`req.hard_bounds` directly, and hard_bounds only ever clips that range, so a
wider-than-data bound leaves results unchanged. Covered by new tests on the
general and fused paths, including mid-interval (bucket-splitting) bounds.

Also tighten the fused-path u32-overflow guard to bound on `num_vals()` (the
per-value increment count) rather than `num_docs()`, and document why the fused
collector's hot-loop fields are hoisted into locals (re-reading them from memory
each iteration measured ~15% slower).
2026-06-16 21:23:23 +08:00
Pascal Seitz
9f7aea4765 derive term counts 2026-06-16 21:23:23 +08:00
Pascal Seitz
2c8536ab11 add specialized TermHistogram 2026-06-16 21:23:23 +08:00
Pascal Seitz
05f4c02ac5 add dense histogram, optional sub-buckets 2026-06-16 21:23:23 +08:00
Pascal Seitz
d137779219 add no sub-gg fastpath 2026-06-16 21:23:23 +08:00
Pascal Seitz
8f9846ac80 use get_range when possible 2026-06-16 21:23:23 +08:00
Pascal Seitz
52e24a9757 add status -> date histogram bench 2026-06-16 21:23:23 +08:00
trinity-1686a
00714326af Merge pull request #2960 from Darkheir/fix/query_grammar_boost_and_escape
fix(query-grammar): Fix issues on boosted and regex queries
2026-06-16 12:03:23 +02:00
Mohammad Dashti
799f7b4646 Built SUM final result in each branch directly.
Keeps the empty-bucket coercion visible at the boundary instead of a
shared binding, following the reviewer's suggested shape.
2026-06-16 03:10:30 +08:00
Mohammad Dashti
fc88d80726 docs: drop downstream-specific name from none_if_no_match doc
The flag's purpose is described well enough by "SQL-style consumers";
no need to call out a specific downstream.
2026-06-16 03:10:30 +08:00
Mohammad Dashti
6a684e7c38 feat: opt-in none_if_no_match flag on SumAggregation for SQL-style null
Switch the default serialized output of `sum` on empty / all-missing
buckets back to `"value": 0` to match Elasticsearch, and gate the
SQL-style `"value": null` behavior behind a new
`none_if_no_match: Option<bool>` flag on `SumAggregation`.

`IntermediateSum::finalize` still returns `Option<f64>` internally so
the Rust API stays parallel to min/max/avg, but the ES-vs-SQL choice is
made at the boundary in `IntermediateMetricResult::into_final_metric_result`:
`None` is coerced to `Some(0.0)` unless `none_if_no_match` is set on the
aggregation request.

Adds `AggregationVariants::as_sum()` accessor for that boundary check
and two end-to-end tests covering both the default ES behavior and the
opt-in null behavior on an empty index.
2026-06-16 03:10:30 +08:00
Mohammad Dashti
94fe52cc67 docs: clarify SUM finalize returning None diverges from Elasticsearch
Surface the trade-off in the doc comment so future reviewers see why
this differs from ES (which returns "value": 0 for sum over
empty/all-missing buckets) and what consumers (ParadeDB SQL NULL) the
None variant is meant to serve.
2026-06-16 03:10:30 +08:00