tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-07-04 00:00:46 +00:00

Author	SHA1	Message	Date
trinity.pointard	73547027e7	Merge branch 'main' into trinity.pointard/multi-terms	2026-07-03 16:29:39 +00:00
Pascal Seitz	6b8bd7b884	reorder if block	2026-07-03 17:44:44 +02:00
Pascal Seitz	9db05b660e	fix overflow issue	2026-07-03 17:44:44 +02:00
Pascal Seitz	057e9d6618	make BucketId optional in aggregations If term aggregations don't have sub-aggregations, we don't need to carry BucketId(u32). ``` full terms_7 Memory: 46.5 KB Avg: 2.4024ms (+1.96%) Median: 2.4024ms (+1.96%) [2.4024ms .. 2.4024ms] terms_all_unique Memory: 11.5 MB (-9.93%) Avg: 4.7910ms (-22.62%) Median: 4.7910ms (-22.62%) [4.7910ms .. 4.7910ms] terms_all_unique_order_by_key Memory: 11.5 MB (-9.94%) Avg: 4.8056ms (-21.31%) Median: 4.8056ms (-21.31%) [4.8056ms .. 4.8056ms] terms_150_000 Memory: 2.7 MB (-9.90%) Avg: 5.7312ms (-5.96%) Median: 5.7312ms (-5.96%) [5.7312ms .. 5.7312ms] terms_many_top_1000 Memory: 5.3 MB Avg: 8.5912ms (-3.34%) Median: 8.5912ms (-3.34%) [8.5912ms .. 8.5912ms] terms_many_order_by_term Memory: 2.7 MB (-9.95%) Avg: 4.5581ms (-10.79%) Median: 4.5581ms (-10.79%) [4.5581ms .. 4.5581ms] terms_many_with_top_hits Memory: 48.9 MB Avg: 102.3672ms (-2.30%) Median: 102.3672ms (-2.30%) [102.3672ms .. 102.3672ms] terms_all_unique_with_avg_sub_agg Memory: 54.7 MB Avg: 18.4644ms (-0.11%) Median: 18.4644ms (-0.11%) [18.4644ms .. 18.4644ms] terms_many_with_avg_sub_agg Memory: 13.5 MB Avg: 16.7345ms (-4.09%) Median: 16.7345ms (-4.09%) [16.7345ms .. 16.7345ms] terms_status_with_avg_sub_agg Memory: 92.1 KB Avg: 5.1590ms (+0.33%) Median: 5.1590ms (+0.33%) [5.1590ms .. 5.1590ms] terms_status_with_terms_zipf_1000_sub_agg Memory: 213.2 KB Avg: 3.9562ms (+1.49%) Median: 3.9562ms (+1.49%) [3.9562ms .. 3.9562ms] terms_zipf_1000_with_terms_status_sub_agg Memory: 724.4 KB Avg: 11.7206ms (-4.68%) Median: 11.7206ms (-4.68%) [11.7206ms .. 11.7206ms] terms_status_with_histogram Memory: 139.0 KB Avg: 2.3985ms (-1.07%) Median: 2.3985ms (-1.07%) [2.3985ms .. 2.3985ms] terms_status_with_date_histogram Memory: 145.9 KB Avg: 2.2963ms (-1.86%) Median: 2.2963ms (-1.86%) [2.2963ms .. 2.2963ms] terms_status_with_date_histogram_hard_bounds Memory: 145.4 KB Avg: 2.4312ms (-1.37%) Median: 2.4312ms (-1.37%) [2.4312ms .. 2.4312ms] terms_status_with_date_histogram_and_sibling_terms Memory: 143.3 KB Avg: 3.8401ms (+0.44%) Median: 3.8401ms (+0.44%) [3.8401ms .. 3.8401ms] terms_zipf_1000 Memory: 75.8 KB Avg: 2.1611ms (-2.70%) Median: 2.1611ms (-2.70%) [2.1611ms .. 2.1611ms] terms_zipf_1000_with_histogram Memory: 1.2 MB Avg: 20.1369ms (-1.16%) Median: 20.1369ms (-1.16%) [20.1369ms .. 20.1369ms] terms_zipf_1000_with_avg_sub_agg Memory: 472.7 KB Avg: 8.5851ms (-4.93%) Median: 8.5851ms (-4.93%) [8.5851ms .. 8.5851ms] terms_many_json_mixed_type_with_avg_sub_agg Memory: 17.9 MB Avg: 26.3988ms (-6.61%) Median: 26.3988ms (-6.61%) [26.3988ms .. 26.3988ms] ```	2026-07-03 17:44:44 +02:00
trinity.pointard	911fe1f883	more codex review	2026-07-03 15:39:32 +00:00
trinity.pointard	be7f09a0b2	fix handling of bytes for cardinality agg	2026-07-03 14:56:33 +00:00
trinity.pointard	ba1a979824	codex review	2026-07-03 13:33:23 +00:00
Pascal Seitz	95172089e7	cleanup benchmark	2026-07-03 13:42:33 +02:00
Pascal Seitz	f313f22df7	Speed up range-query intersections via seek_danger on RangeDocSet (up to ~50x faster) A regular seek on RangeDocSet is costly: on a miss it fetches blocks and scans the column forward to materialize the next matching doc. As a non-leading docset in an intersection that work is wasted — the driver only asks "does this candidate match?". seek_danger answers that with a cheap point lookup via Column::values_for_doc, returning a lower bound on a miss and leaving forward progress to the caller. Forward seek_danger through ConstScorer. Benchmarks (bool_queries_with_range, _all_results / DocSetCollector): ``` dense and 0.1% a a_AND_num_rand:[0_TO_9]_all_results Avg: 0.0827ms (-4.60%) Median: 0.0825ms (-4.82%) [0.0809ms .. 0.0891ms] Output: 43 a_AND_num_asc:[0_TO_9]_all_results Avg: 0.1937ms (-3.70%) Median: 0.1930ms (-3.59%) [0.1806ms .. 0.2044ms] Output: 100 a_AND_num_rand_fast:[0_TO_9]_all_results Avg: 0.0367ms (-92.67%) Median: 0.0365ms (-92.65%) [0.0340ms .. 0.0398ms] Output: 43 a_AND_num_asc_fast:[0_TO_9]_all_results Avg: 0.1052ms (-98.05%) Median: 0.1050ms (-97.98%) [0.1009ms .. 0.1117ms] Output: 100 num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results Avg: 2.7147ms (-51.42%) Median: 2.7075ms (-49.58%) [2.6806ms .. 2.7799ms] Output: 968 dense and 1% a a_AND_num_rand:[0_TO_9]_all_results Avg: 0.4373ms (-9.71%) Median: 0.4357ms (-10.12%) [0.4117ms .. 0.4711ms] Output: 463 a_AND_num_asc:[0_TO_9]_all_results Avg: 0.2342ms (-2.50%) Median: 0.2338ms (-2.56%) [0.2247ms .. 0.2452ms] Output: 1_054 a_AND_num_rand_fast:[0_TO_9]_all_results Avg: 0.3956ms (-82.86%) Median: 0.3943ms (-82.90%) [0.3815ms .. 0.4119ms] Output: 463 a_AND_num_asc_fast:[0_TO_9]_all_results Avg: 0.4896ms (-91.16%) Median: 0.4862ms (-90.81%) [0.4797ms .. 0.5084ms] Output: 1_054 num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results Avg: 2.7108ms (-50.81%) Median: 2.6925ms (-49.51%) [2.6688ms .. 2.7868ms] Output: 968 dense and 10% a a_AND_num_rand:[0_TO_9]_all_results Avg: 0.9869ms (-3.71%) Median: 0.9833ms (-3.83%) [0.9518ms .. 1.1218ms] Output: 4_914 a_AND_num_asc:[0_TO_9]_all_results Avg: 0.6352ms (-3.74%) Median: 0.6363ms (-3.32%) [0.6158ms .. 0.6488ms] Output: 10_152 a_AND_num_rand_fast:[0_TO_9]_all_results Avg: 3.1264ms (+0.39%) Median: 3.1466ms (+1.34%) [3.0261ms .. 3.2051ms] Output: 4_914 a_AND_num_asc_fast:[0_TO_9]_all_results Avg: 4.1547ms (-31.12%) Median: 4.0933ms (-28.55%) [3.7648ms .. 4.7600ms] Output: 10_152 num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results Avg: 2.6973ms (-52.30%) Median: 2.6901ms (-49.86%) [2.6689ms .. 2.7677ms] Output: 968 ``` Gains are largest when the range query is the non-leading docset of a low-cardinality intersection.	2026-07-03 13:42:33 +02:00
trinity.pointard	d7e9b31977	rustfmt	2026-07-03 08:35:06 +00:00
trinity.pointard	7d4f8454be	Merge branch 'main' into trinity.pointard/multi-terms	2026-07-03 08:35:05 +00:00
trinity.pointard	091f6fadf1	fix bool serialization	2026-07-03 08:15:27 +00:00
trinity.pointard	a81109226d	first pass of optimising multi-terms agg	2026-07-02 21:55:57 +00:00
Pascal Seitz	82bee54a00	block_search: drop unsafe indexing, remove K=64 For K∈{2,4,8,16,32} LLVM proves the index bounds and elides the checks, so get_unchecked buys nothing on the production K=8 path. K=64 was the only value that defeated bounds-check elision (one check in the tail scan) and it was instantiated in tests only — drop it. block_search: cite the k-ary search paper Document that kary_search is the 'k-ary search on a sorted array' variant from Schlegel, Gemulla & Lehner (DaMoN 2009), specialized to a lower-bound.	2026-07-02 18:27:40 +02:00
Pascal Seitz	6892995d02	10% faster intersections: Use k-ary in block search	2026-07-02 18:27:40 +02:00
trinity-1686a	9704e6c0e3	Merge pull request #2983 from quickwit-oss/trinity.pointard/quickselect-term-agg use select-nth instead of full sort in segment level agg top-k selection	2026-07-02 15:38:14 +02:00
trinity.pointard	715590b357	rename local var	2026-07-02 12:00:00 +00:00
trinity.pointard	d496e402ca	rustfmt	2026-07-02 07:12:04 +00:00
trinity.pointard	348ca1e309	don't count matching doc twice	2026-06-30 16:09:11 +00:00
trinity.pointard	5e4fe3520c	better handle sorted buckets	2026-06-30 14:56:24 +00:00
trinity.pointard	e7ad06f42f	allow extracting values from IntermediateMultiTermsBucketResult	2026-06-29 17:42:15 +00:00
pascal	a9733ba8c2	Keep buffered union refill out of line BufferedUnionScorer is the hot path for full union traversal, including (TopDocs, Count) where Count forces all matches to be visited. After the block-wand intersection changes, LLVM started inlining the refill helper into the advance path, which regressed TOP_100_COUNT union queries even though the union algorithm did not change. Force the refill helper out of line so the advance loop stays small and stable while pruning collectors continue to use Block-WAND. Benchmark on search-benchmark-game TOP_100_COUNT union query set (301 queries, sum of per-query medians): - tantivy 0.26: 0.853646s - main before: 0.918605s - this change: 0.841659s	2026-06-29 19:33:50 +02:00
pascal	874d54a63a	Remove union wrapping for single-terms search-benchmark-game shows TOP_100_COUNT regression on queries tagged intersection_union. The regression came from allowing single-term boolean unions to become TermUnion for Block-WAND. https://github.com/quickwit-oss/tantivy/pull/2915 When such a scorer is used as the optional side of RequiredOptionalScorer, boxing converted the lone term into BufferedUnionScorer. Keep the TermUnion representation available for pruning collection, but unwrap one-term unions when boxing or doing non-pruning iteration.	2026-06-29 19:33:50 +02:00
trinity.pointard	74a510cb56	try to use select-nth instead of full sort in segment level agg top-k selection	2026-06-29 09:13:21 +00:00
trinity.pointard	d7f69903fa	first swab at multi-terms impl	2026-06-26 21:48:18 +00:00
trinity-1686a	02e34508e2	Merge pull request #2971 from quickwit-oss/trinity.pointard/fix-slop-overflow fix overflow on large jumps in linear sequence	2026-06-23 10:18:29 +02:00
trinity-1686a	4031d97bac	fix overflow on large jumps in linear sequence new limit prevent an overflow in eval which caused the residual to be 64b when a slop of zero would give a smaller one	2026-06-23 00:13:27 +02:00
Ming	384f31d350	feat: Restore index sorting (#2959 ) We ([ParadeDB](https://github.com/paradedb/paradedb)) have restored and been using the removed [index sorting](https://github.com/quickwit-oss/tantivy/issues/2352) feature in our Tantivy fork. Our use case is sorting the index by Postgres' internal `ctid` identifier. Results returned from Tantivy must be checked against Postgres' visibility map, and checking them in ctid order is much more cache friendly, resulting in up to 80% speedups for certain queries. This PR is split into 5 commits, corresponding to the index sorting reversal plus bug fixes we uncovered during our usage of index sorting. \| Commit \| Maps to \| What it does \| \|---\|---\|---\| \| `2aea0ad9f` \| foundation ([#104](https://github.com/paradedb/tantivy/pull/104)) \| Restore `SegmentComponent::TempStore` (revert of upstream #2815). Subsumes fork PR [#104](https://github.com/paradedb/tantivy/pull/104)'s CI fix. \| \| `9205bcb0c` \| [#92](https://github.com/paradedb/tantivy/pull/92) \| Restore sort-by-field (single-segment + merge paths). \| \| `39c790f0f` \| [#101](https://github.com/paradedb/tantivy/pull/101) \| Enable `sort_by` for `Str`/`Bytes` fast fields. \| \| `9c4341a87` \| [#105](https://github.com/paradedb/tantivy/pull/105) \| Native typed numeric sort-key comparison (precision/NULL fix). \| \| `2d9ba2418` \| [#106](https://github.com/paradedb/tantivy/pull/106) \| Preserve NULL ordering in numeric segment merges. \| We have discussed with the Tantivy maintainers and they indicated they would be open to this PR. Another motivation for landing this PR is we are planning on contributing a significant refactor that makes Tantivy's segment components extensible, and landing that without index sorting leads to too many conflicts.	2026-06-22 11:22:25 -07:00
Pascal Seitz	1e859fd78d	fix term aggregation u32::MAX overflow issue	2026-06-18 17:07:43 +08:00
Pascal Seitz	f451fa938f	explain why naive scorer must accumulate scores in WAND order	2026-06-17 18:58:58 +08:00
Pascal Seitz	2a82dd6f64	fix flaky test	2026-06-17 18:58:58 +08:00
Pascal Seitz	c096b2ad89	aggregation/terms: charge fused term_counts to the memory limit term_counts (one u32/term) was allocated but not charged to AggregationLimitsGuard, so a memory limit could be exceeded silently. Charge it, skip allocating it when unbounded, and add a regression test.	2026-06-16 21:23:23 +08:00
Pascal Seitz	ac7a3d347c	add comment, hoist variables	2026-06-16 21:23:23 +08:00
Pascal Seitz	03520a0719	add top level comment	2026-06-16 21:23:23 +08:00
Pascal Seitz	86a4c47bed	merge loops, histo with bounds may benefit from single vec opt	2026-06-16 21:23:23 +08:00
Pascal Seitz	fb23e8908f	add histogram with bounds	2026-06-16 21:23:23 +08:00
Pascal Seitz	3ca510dff0	aggregation/terms: tidy fused term×histogram grid construction Rename the value threaded through build_segment_term_collector and maybe_build_collector from max_term_id to col_max_val/max_column_val — it is the column's max value, only later reused as the max term id. Make the grid-size arithmetic overflow-/zero-safe (saturating_add, checked_div).	2026-06-16 21:23:23 +08:00
Pascal Seitz	3cb400c300	clarify counts/term_counts field docs Spell out that `counts` is the flattened per-term × time-bucket grid (each term's own contiguous slice) and that `term_counts` is only needed when the per-term total can't be derived from that grid (i.e. with hard bounds).	2026-06-16 21:23:23 +08:00
Pascal Seitz	ef13489d63	skip hard_bounds that can't exclude any value When a histogram's hard_bounds are wider than the column's value range, the per-doc `bounds.contains` check can never fail. Collapse such bounds to the unbounded sentinel in `normalize_histogram_req`, so both the general histogram hot loop and the fused term×histogram path skip the check — the latter then derives per-term counts from the grid (the ~17% win) instead of falling back to per-doc counting just because `bounds != [MIN, MAX]`. Only the collect-time filter is affected: empty-bucket emission reads `req.hard_bounds` directly, and hard_bounds only ever clips that range, so a wider-than-data bound leaves results unchanged. Covered by new tests on the general and fused paths, including mid-interval (bucket-splitting) bounds. Also tighten the fused-path u32-overflow guard to bound on `num_vals()` (the per-value increment count) rather than `num_docs()`, and document why the fused collector's hot-loop fields are hoisted into locals (re-reading them from memory each iteration measured ~15% slower).	2026-06-16 21:23:23 +08:00
Pascal Seitz	9f7aea4765	derive term counts	2026-06-16 21:23:23 +08:00
Pascal Seitz	2c8536ab11	add specialized TermHistogram	2026-06-16 21:23:23 +08:00
Pascal Seitz	05f4c02ac5	add dense histogram, optional sub-buckets	2026-06-16 21:23:23 +08:00
Pascal Seitz	d137779219	add no sub-gg fastpath	2026-06-16 21:23:23 +08:00
Pascal Seitz	8f9846ac80	use get_range when possible	2026-06-16 21:23:23 +08:00
Pascal Seitz	52e24a9757	add status -> date histogram bench	2026-06-16 21:23:23 +08:00
trinity-1686a	00714326af	Merge pull request #2960 from Darkheir/fix/query_grammar_boost_and_escape fix(query-grammar): Fix issues on boosted and regex queries	2026-06-16 12:03:23 +02:00
Mohammad Dashti	799f7b4646	Built SUM final result in each branch directly. Keeps the empty-bucket coercion visible at the boundary instead of a shared binding, following the reviewer's suggested shape.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	fc88d80726	docs: drop downstream-specific name from none_if_no_match doc The flag's purpose is described well enough by "SQL-style consumers"; no need to call out a specific downstream.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	6a684e7c38	feat: opt-in none_if_no_match flag on SumAggregation for SQL-style null Switch the default serialized output of `sum` on empty / all-missing buckets back to `"value": 0` to match Elasticsearch, and gate the SQL-style `"value": null` behavior behind a new `none_if_no_match: Option<bool>` flag on `SumAggregation`. `IntermediateSum::finalize` still returns `Option<f64>` internally so the Rust API stays parallel to min/max/avg, but the ES-vs-SQL choice is made at the boundary in `IntermediateMetricResult::into_final_metric_result`: `None` is coerced to `Some(0.0)` unless `none_if_no_match` is set on the aggregation request. Adds `AggregationVariants::as_sum()` accessor for that boundary check and two end-to-end tests covering both the default ES behavior and the opt-in null behavior on an empty index.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	94fe52cc67	docs: clarify SUM finalize returning None diverges from Elasticsearch Surface the trade-off in the doc comment so future reviewers see why this differs from ES (which returns "value": 0 for sum over empty/all-missing buckets) and what consumers (ParadeDB SQL NULL) the None variant is meant to serve.	2026-06-16 03:10:30 +08:00

1 2 3 4 5 ...

3604 Commits