tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-07-03 15:50:44 +00:00

Author	SHA1	Message	Date
Pascal Seitz	42d721214b	clippy	2026-07-02 15:25:33 +02:00
pascal	a9733ba8c2	Keep buffered union refill out of line BufferedUnionScorer is the hot path for full union traversal, including (TopDocs, Count) where Count forces all matches to be visited. After the block-wand intersection changes, LLVM started inlining the refill helper into the advance path, which regressed TOP_100_COUNT union queries even though the union algorithm did not change. Force the refill helper out of line so the advance loop stays small and stable while pruning collectors continue to use Block-WAND. Benchmark on search-benchmark-game TOP_100_COUNT union query set (301 queries, sum of per-query medians): - tantivy 0.26: 0.853646s - main before: 0.918605s - this change: 0.841659s	2026-06-29 19:33:50 +02:00
pascal	874d54a63a	Remove union wrapping for single-terms search-benchmark-game shows TOP_100_COUNT regression on queries tagged intersection_union. The regression came from allowing single-term boolean unions to become TermUnion for Block-WAND. https://github.com/quickwit-oss/tantivy/pull/2915 When such a scorer is used as the optional side of RequiredOptionalScorer, boxing converted the lone term into BufferedUnionScorer. Keep the TermUnion representation available for pruning collection, but unwrap one-term unions when boxing or doing non-pruning iteration.	2026-06-29 19:33:50 +02:00
trinity-1686a	02e34508e2	Merge pull request #2971 from quickwit-oss/trinity.pointard/fix-slop-overflow fix overflow on large jumps in linear sequence	2026-06-23 10:18:29 +02:00
trinity-1686a	4031d97bac	fix overflow on large jumps in linear sequence new limit prevent an overflow in eval which caused the residual to be 64b when a slop of zero would give a smaller one	2026-06-23 00:13:27 +02:00
Ming	384f31d350	feat: Restore index sorting (#2959 ) We ([ParadeDB](https://github.com/paradedb/paradedb)) have restored and been using the removed [index sorting](https://github.com/quickwit-oss/tantivy/issues/2352) feature in our Tantivy fork. Our use case is sorting the index by Postgres' internal `ctid` identifier. Results returned from Tantivy must be checked against Postgres' visibility map, and checking them in ctid order is much more cache friendly, resulting in up to 80% speedups for certain queries. This PR is split into 5 commits, corresponding to the index sorting reversal plus bug fixes we uncovered during our usage of index sorting. \| Commit \| Maps to \| What it does \| \|---\|---\|---\| \| `2aea0ad9f` \| foundation ([#104](https://github.com/paradedb/tantivy/pull/104)) \| Restore `SegmentComponent::TempStore` (revert of upstream #2815). Subsumes fork PR [#104](https://github.com/paradedb/tantivy/pull/104)'s CI fix. \| \| `9205bcb0c` \| [#92](https://github.com/paradedb/tantivy/pull/92) \| Restore sort-by-field (single-segment + merge paths). \| \| `39c790f0f` \| [#101](https://github.com/paradedb/tantivy/pull/101) \| Enable `sort_by` for `Str`/`Bytes` fast fields. \| \| `9c4341a87` \| [#105](https://github.com/paradedb/tantivy/pull/105) \| Native typed numeric sort-key comparison (precision/NULL fix). \| \| `2d9ba2418` \| [#106](https://github.com/paradedb/tantivy/pull/106) \| Preserve NULL ordering in numeric segment merges. \| We have discussed with the Tantivy maintainers and they indicated they would be open to this PR. Another motivation for landing this PR is we are planning on contributing a significant refactor that makes Tantivy's segment components extensible, and landing that without index sorting leads to too many conflicts.	2026-06-22 11:22:25 -07:00
Pascal Seitz	1e859fd78d	fix term aggregation u32::MAX overflow issue	2026-06-18 17:07:43 +08:00
Pascal Seitz	f451fa938f	explain why naive scorer must accumulate scores in WAND order	2026-06-17 18:58:58 +08:00
Pascal Seitz	2a82dd6f64	fix flaky test	2026-06-17 18:58:58 +08:00
Pascal Seitz	c096b2ad89	aggregation/terms: charge fused term_counts to the memory limit term_counts (one u32/term) was allocated but not charged to AggregationLimitsGuard, so a memory limit could be exceeded silently. Charge it, skip allocating it when unbounded, and add a regression test.	2026-06-16 21:23:23 +08:00
Pascal Seitz	ac7a3d347c	add comment, hoist variables	2026-06-16 21:23:23 +08:00
Pascal Seitz	03520a0719	add top level comment	2026-06-16 21:23:23 +08:00
Pascal Seitz	86a4c47bed	merge loops, histo with bounds may benefit from single vec opt	2026-06-16 21:23:23 +08:00
Pascal Seitz	fb23e8908f	add histogram with bounds	2026-06-16 21:23:23 +08:00
Pascal Seitz	3ca510dff0	aggregation/terms: tidy fused term×histogram grid construction Rename the value threaded through build_segment_term_collector and maybe_build_collector from max_term_id to col_max_val/max_column_val — it is the column's max value, only later reused as the max term id. Make the grid-size arithmetic overflow-/zero-safe (saturating_add, checked_div).	2026-06-16 21:23:23 +08:00
Pascal Seitz	3cb400c300	clarify counts/term_counts field docs Spell out that `counts` is the flattened per-term × time-bucket grid (each term's own contiguous slice) and that `term_counts` is only needed when the per-term total can't be derived from that grid (i.e. with hard bounds).	2026-06-16 21:23:23 +08:00
Pascal Seitz	ef13489d63	skip hard_bounds that can't exclude any value When a histogram's hard_bounds are wider than the column's value range, the per-doc `bounds.contains` check can never fail. Collapse such bounds to the unbounded sentinel in `normalize_histogram_req`, so both the general histogram hot loop and the fused term×histogram path skip the check — the latter then derives per-term counts from the grid (the ~17% win) instead of falling back to per-doc counting just because `bounds != [MIN, MAX]`. Only the collect-time filter is affected: empty-bucket emission reads `req.hard_bounds` directly, and hard_bounds only ever clips that range, so a wider-than-data bound leaves results unchanged. Covered by new tests on the general and fused paths, including mid-interval (bucket-splitting) bounds. Also tighten the fused-path u32-overflow guard to bound on `num_vals()` (the per-value increment count) rather than `num_docs()`, and document why the fused collector's hot-loop fields are hoisted into locals (re-reading them from memory each iteration measured ~15% slower).	2026-06-16 21:23:23 +08:00
Pascal Seitz	9f7aea4765	derive term counts	2026-06-16 21:23:23 +08:00
Pascal Seitz	2c8536ab11	add specialized TermHistogram	2026-06-16 21:23:23 +08:00
Pascal Seitz	05f4c02ac5	add dense histogram, optional sub-buckets	2026-06-16 21:23:23 +08:00
Pascal Seitz	d137779219	add no sub-gg fastpath	2026-06-16 21:23:23 +08:00
Pascal Seitz	8f9846ac80	use get_range when possible	2026-06-16 21:23:23 +08:00
Pascal Seitz	52e24a9757	add status -> date histogram bench	2026-06-16 21:23:23 +08:00
trinity-1686a	00714326af	Merge pull request #2960 from Darkheir/fix/query_grammar_boost_and_escape fix(query-grammar): Fix issues on boosted and regex queries	2026-06-16 12:03:23 +02:00
Mohammad Dashti	799f7b4646	Built SUM final result in each branch directly. Keeps the empty-bucket coercion visible at the boundary instead of a shared binding, following the reviewer's suggested shape.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	fc88d80726	docs: drop downstream-specific name from none_if_no_match doc The flag's purpose is described well enough by "SQL-style consumers"; no need to call out a specific downstream.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	6a684e7c38	feat: opt-in none_if_no_match flag on SumAggregation for SQL-style null Switch the default serialized output of `sum` on empty / all-missing buckets back to `"value": 0` to match Elasticsearch, and gate the SQL-style `"value": null` behavior behind a new `none_if_no_match: Option<bool>` flag on `SumAggregation`. `IntermediateSum::finalize` still returns `Option<f64>` internally so the Rust API stays parallel to min/max/avg, but the ES-vs-SQL choice is made at the boundary in `IntermediateMetricResult::into_final_metric_result`: `None` is coerced to `Some(0.0)` unless `none_if_no_match` is set on the aggregation request. Adds `AggregationVariants::as_sum()` accessor for that boundary check and two end-to-end tests covering both the default ES behavior and the opt-in null behavior on an empty index.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	94fe52cc67	docs: clarify SUM finalize returning None diverges from Elasticsearch Surface the trade-off in the doc comment so future reviewers see why this differs from ES (which returns "value": 0 for sum over empty/all-missing buckets) and what consumers (ParadeDB SQL NULL) the None variant is meant to serve.	2026-06-16 03:10:30 +08:00
Mohammad Dashti	2ff39f6f7f	fix: return None from SUM when no values were collected IntermediateSum::finalize() returned Some(0.0) even when count==0 (all documents had missing/NULL values). This differs from MIN, MAX, and AVG which all return None for count==0. The 0.0 came from IntermediateStats' default sum initialization. Consumers (like ParadeDB) that map None to SQL NULL were incorrectly getting 0 for SUM on all-NULL groups. Fixes paradedb/paradedb#4621	2026-06-16 03:10:30 +08:00
Windforce17	1d06328cb3	Add BlockSegmentPostings::rank() for skip-list-based positional counting Add a public rank(target) method on BlockSegmentPostings that returns the number of docs with a doc id strictly smaller than target. It jumps to the candidate block through the skip list and decodes a single block, so the cost is O(skip-list entries) + one block decode rather than O(doc_freq). This is a useful primitive for range counting over a posting list (e.g. number of matches in a [lo, hi) doc-id window) without iterating every matched doc. To support it, expose SkipReader::remaining_docs() (pub(crate)). Like seek(), rank() advances the cursor forward only and must be called with non-decreasing, valid (<= TERMINATED) targets. Adds a unit test covering multi-block lists and the below-first / above-last / empty edge cases.	2026-06-15 18:56:49 +08:00
Darkheir	7fd1dbe9f5	fix(query-grammar): Fix issues on boosted and regex queries Signed-off-by: Darkheir <raphael.cohen@sekoia.io>	2026-06-15 10:50:07 +02:00
Pascal Seitz	b19f0ddc77	fix clippy	2026-06-09 23:14:12 +08:00
Pascal Seitz	b4acfcf881	cleanup AggregationsSegmentCtx The metric/cardinality/histogram _mut getters had no callers needing mutation; their two uses already pass the resulting reference as &T. simplify req_data ownership: clone into collectors, Rc only for filter BitSet Replace Vec<Option<Box<T>>> + take/put-back round-trip with Vec<T> + direct clone into collector. Collectors now own their per-segment request data outright, removing the borrow-checker dance that the take/put-back pattern existed to satisfy. The structural clones are cheap (Column<u64> is Arc-internal) except for the filter aggregation, whose DocumentQueryEvaluator carries a precomputed per-segment BitSet sized by max_doc. Wrap that in Rc<DocumentQueryEvaluator> so FilterAggReqData::clone() bumps a refcount instead of duplicating the BitSet. Move SegmentFilterCollector's matching_docs_buffer out of FilterAggReqData so its pre-allocated capacity is preserved per collector instead of being lost on every clone.	2026-06-09 23:14:12 +08:00
dependabot[bot]	3a8240b123	Bump codecov/codecov-action from 6.0.0 to 7.0.0 Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 6.0.0 to 7.0.0. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](`57e3a136b7...fb8b3582c8`) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-version: 7.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-06-09 14:48:17 +08:00
dependabot[bot]	fd9713e1ca	Bump actions/checkout from 6.0.2 to 6.0.3 (#2949 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.2 to 6.0.3. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](`de0fac2e45...df4cb1c069`) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 6.0.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-08 10:55:54 +02:00
dependabot[bot]	96f3784f79	Bump github/codeql-action from 4.35.2 to 4.36.1 (#2948 ) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.35.2 to 4.36.1. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](`95e58e9a2c...87557b9c84`) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.36.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-08 10:49:04 +02:00
dependabot[bot]	87a6679a79	Bump actions/upload-artifact from 7.0.0 to 7.0.1 (#2917 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 7.0.0 to 7.0.1. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](`bbbca2ddaa...043fb46d1a`) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 7.0.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-08 10:48:48 +02:00
dependabot[bot]	864a6aa72c	Update murmurhash32 requirement from 0.3 to 0.4 (#2894 ) Updates the requirements on [murmurhash32](https://github.com/quickwit-inc/murmurhash32) to permit the latest version. - [Commits](https://github.com/quickwit-inc/murmurhash32/commits) --- updated-dependencies: - dependency-name: murmurhash32 dependency-version: 0.4.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-08 10:48:32 +02:00
Paul Masurel	abcf6754a2	CR comments from https://github.com/quickwit-oss/tantivy/pull/2940 (#2952 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-06-08 10:47:58 +02:00
Kanishk Sachan	70a8e56ee5	test(postings): add unit tests for TermFrequencyRecorder Closes #2285 The TermFrequencyRecorder was completely untested. Add five focused tests: - term_frequency_recorder_has_term_freq: verifies the recorder correctly advertises term-frequency support via has_term_freq() - term_frequency_recorder_zero_docs: term_doc_freq() returns Some(0) before any documents are recorded - term_frequency_recorder_term_doc_freq_single_doc: one document with two occurrences yields term_doc_freq() == Some(1) - term_frequency_recorder_term_doc_freq_multiple_docs: three documents with varying term frequencies yield term_doc_freq() == Some(3), confirming the count tracks documents, not occurrences - term_frequency_recorder_single_occurrence_per_doc: each of three documents has exactly one occurrence - term_frequency_recorder_high_frequency_doc: a single document with 1000 occurrences still yields term_doc_freq() == Some(1)	2026-06-06 14:44:51 +08:00
Paul Masurel	62705526e8	Add sve + neon filter vec implementation as spotted by Adam (#2940 ) * Add filter_vec benchmarks (dense, sparse, full coverage) Uses get_ids_for_value_range to exercise both the bitpacking decode and the filter_vec SIMD path together under realistic cache conditions. * Add NEON and SVE implementations for filter_vec Adds aarch64-specific SIMD paths (NEON always available on aarch64; SVE gated on nightly + non-Apple target) with routing logic in mod.rs that selects the best available instruction set at runtime. * Using asm! to workaround the lack of stabilized SVE intrinsics * showing instruction set * improved proptesting * removing build.rs --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-06-04 17:51:26 +02:00
Paul Masurel	a27c64998f	Cargo clippy fix (#2943 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-06-01 14:39:44 +02:00
Paul Masurel	46b3fb9ed3	Relying on upstream version of datasketch and stop using HLL 4. (#2936 ) We were relying on a fork for: a bugfix in LIST serialization a better API exposing a new Coupon type, required for caching coupons. We also stop using HLL8 in hope to fix https://datadoghq.atlassian.net/browse/CLOUDPREM-625 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-05-19 13:29:35 +02:00
trinity-1686a	fbe620b9b4	Merge pull request #2933 from quickwit-oss/1686a/sstable-opt optimise sstable index access pattern	2026-05-19 11:43:17 +02:00
trinity-1686a	95d8a3989a	cr	2026-05-19 11:38:48 +02:00
trinity-1686a	ea61a68db4	skip sstable index binary search when ordinal is in same block	2026-05-16 11:35:38 +02:00
trinity-1686a	c367df37c1	refactor sstable index	2026-05-16 11:30:02 +02:00
Mohammad Dashti	d99a5d4e91	Rename validate_aggregation_fields to validate_aggregation_fields_exist Applies @PSeitz's review suggestion to make the function name more descriptive of what it checks. Also adds a doc note clarifying why validation is opt-in rather than enforced by default.	2026-05-16 15:45:20 +08:00
Mohammad Dashti	2de6f075ce	Fixed the example	2026-05-16 15:45:20 +08:00
Mohammad Dashti	18080067c7	Applied PR comment: I would move it outside of the aggregation. You can fetch the fields from the aggregation request and do a validation in a helper function	2026-05-16 15:45:20 +08:00

1 2 3 4 5 ...

3582 Commits