BufferedUnionScorer is the hot path for full union traversal, including (TopDocs, Count) where Count forces all matches to be visited. After the block-wand intersection changes, LLVM started inlining the refill helper into the advance path, which regressed TOP_100_COUNT union queries even though the union algorithm did not change.
Force the refill helper out of line so the advance loop stays small and stable while pruning collectors continue to use Block-WAND.
Benchmark on search-benchmark-game TOP_100_COUNT union query set (301 queries, sum of per-query medians):
- tantivy 0.26: 0.853646s
- main before: 0.918605s
- this change: 0.841659s
search-benchmark-game shows TOP_100_COUNT regression on queries tagged intersection_union.
The regression came from allowing single-term boolean unions to become TermUnion for Block-WAND. https://github.com/quickwit-oss/tantivy/pull/2915
When such a scorer is used as the optional side of RequiredOptionalScorer, boxing converted the lone term into BufferedUnionScorer.
Keep the TermUnion representation available for pruning collection, but unwrap one-term unions when boxing or doing non-pruning iteration.
We ([ParadeDB](https://github.com/paradedb/paradedb)) have restored and been using the removed [index sorting](https://github.com/quickwit-oss/tantivy/issues/2352) feature in our Tantivy fork.
Our use case is sorting the index by Postgres' internal `ctid` identifier. Results returned from Tantivy must be checked against Postgres' visibility map, and checking them in ctid order is much more cache friendly, resulting in up to 80% speedups for certain queries.
This PR is split into 5 commits, corresponding to the index sorting reversal plus bug fixes we uncovered during our usage of index sorting.
| Commit | Maps to | What it does |
|---|---|---|
| `2aea0ad9f` | foundation ([#104](https://github.com/paradedb/tantivy/pull/104)) | Restore `SegmentComponent::TempStore` (revert of upstream #2815). Subsumes fork PR [#104](https://github.com/paradedb/tantivy/pull/104)'s CI fix. |
| `9205bcb0c` | [#92](https://github.com/paradedb/tantivy/pull/92) | Restore sort-by-field (single-segment + merge paths). |
| `39c790f0f` | [#101](https://github.com/paradedb/tantivy/pull/101) | Enable `sort_by` for `Str`/`Bytes` fast fields. |
| `9c4341a87` | [#105](https://github.com/paradedb/tantivy/pull/105) | Native typed numeric sort-key comparison (precision/NULL fix). |
| `2d9ba2418` | [#106](https://github.com/paradedb/tantivy/pull/106) | Preserve NULL ordering in numeric segment merges. |
We have discussed with the Tantivy maintainers and they indicated they would be open to this PR. Another motivation for landing this PR is we are planning on contributing a significant refactor that makes Tantivy's segment components extensible, and landing that without index sorting leads to too many conflicts.
term_counts (one u32/term) was allocated but not charged to
AggregationLimitsGuard, so a memory limit could be exceeded silently.
Charge it, skip allocating it when unbounded, and add a regression test.
Rename the value threaded through build_segment_term_collector and
maybe_build_collector from max_term_id to col_max_val/max_column_val — it
is the column's max value, only later reused as the max term id. Make the
grid-size arithmetic overflow-/zero-safe (saturating_add, checked_div).
Spell out that `counts` is the flattened per-term × time-bucket grid (each
term's own contiguous slice) and that `term_counts` is only needed when the
per-term total can't be derived from that grid (i.e. with hard bounds).
When a histogram's hard_bounds are wider than the column's value range, the
per-doc `bounds.contains` check can never fail. Collapse such bounds to the
unbounded sentinel in `normalize_histogram_req`, so both the general histogram
hot loop and the fused term×histogram path skip the check — the latter then
derives per-term counts from the grid (the ~17% win) instead of falling back to
per-doc counting just because `bounds != [MIN, MAX]`.
Only the collect-time filter is affected: empty-bucket emission reads
`req.hard_bounds` directly, and hard_bounds only ever clips that range, so a
wider-than-data bound leaves results unchanged. Covered by new tests on the
general and fused paths, including mid-interval (bucket-splitting) bounds.
Also tighten the fused-path u32-overflow guard to bound on `num_vals()` (the
per-value increment count) rather than `num_docs()`, and document why the fused
collector's hot-loop fields are hoisted into locals (re-reading them from memory
each iteration measured ~15% slower).
Switch the default serialized output of `sum` on empty / all-missing
buckets back to `"value": 0` to match Elasticsearch, and gate the
SQL-style `"value": null` behavior behind a new
`none_if_no_match: Option<bool>` flag on `SumAggregation`.
`IntermediateSum::finalize` still returns `Option<f64>` internally so
the Rust API stays parallel to min/max/avg, but the ES-vs-SQL choice is
made at the boundary in `IntermediateMetricResult::into_final_metric_result`:
`None` is coerced to `Some(0.0)` unless `none_if_no_match` is set on the
aggregation request.
Adds `AggregationVariants::as_sum()` accessor for that boundary check
and two end-to-end tests covering both the default ES behavior and the
opt-in null behavior on an empty index.
Surface the trade-off in the doc comment so future reviewers see why
this differs from ES (which returns "value": 0 for sum over
empty/all-missing buckets) and what consumers (ParadeDB SQL NULL) the
None variant is meant to serve.
IntermediateSum::finalize() returned Some(0.0) even when count==0
(all documents had missing/NULL values). This differs from MIN, MAX,
and AVG which all return None for count==0.
The 0.0 came from IntermediateStats' default sum initialization.
Consumers (like ParadeDB) that map None to SQL NULL were incorrectly
getting 0 for SUM on all-NULL groups.
Fixesparadedb/paradedb#4621
Add a public rank(target) method on BlockSegmentPostings that returns the
number of docs with a doc id strictly smaller than target. It jumps to the
candidate block through the skip list and decodes a single block, so the cost
is O(skip-list entries) + one block decode rather than O(doc_freq).
This is a useful primitive for range counting over a posting list (e.g. number
of matches in a [lo, hi) doc-id window) without iterating every matched doc.
To support it, expose SkipReader::remaining_docs() (pub(crate)). Like seek(),
rank() advances the cursor forward only and must be called with non-decreasing,
valid (<= TERMINATED) targets. Adds a unit test covering multi-block lists and
the below-first / above-last / empty edge cases.
The metric/cardinality/histogram _mut getters had no callers needing
mutation; their two uses already pass the resulting reference as &T.
simplify req_data ownership: clone into collectors, Rc only for filter BitSet
Replace Vec<Option<Box<T>>> + take/put-back round-trip with Vec<T> +
direct clone into collector. Collectors now own their per-segment
request data outright, removing the borrow-checker dance that the
take/put-back pattern existed to satisfy.
The structural clones are cheap (Column<u64> is Arc-internal) except
for the filter aggregation, whose DocumentQueryEvaluator carries a
precomputed per-segment BitSet sized by max_doc. Wrap that in
Rc<DocumentQueryEvaluator> so FilterAggReqData::clone() bumps a
refcount instead of duplicating the BitSet. Move SegmentFilterCollector's
matching_docs_buffer out of FilterAggReqData so its pre-allocated
capacity is preserved per collector instead of being lost on every clone.
Closes#2285
The TermFrequencyRecorder was completely untested. Add five focused tests:
- term_frequency_recorder_has_term_freq: verifies the recorder
correctly advertises term-frequency support via has_term_freq()
- term_frequency_recorder_zero_docs: term_doc_freq() returns Some(0)
before any documents are recorded
- term_frequency_recorder_term_doc_freq_single_doc: one document with
two occurrences yields term_doc_freq() == Some(1)
- term_frequency_recorder_term_doc_freq_multiple_docs: three documents
with varying term frequencies yield term_doc_freq() == Some(3),
confirming the count tracks documents, not occurrences
- term_frequency_recorder_single_occurrence_per_doc: each of three
documents has exactly one occurrence
- term_frequency_recorder_high_frequency_doc: a single document with
1000 occurrences still yields term_doc_freq() == Some(1)
* Add filter_vec benchmarks (dense, sparse, full coverage)
Uses get_ids_for_value_range to exercise both the bitpacking decode and
the filter_vec SIMD path together under realistic cache conditions.
* Add NEON and SVE implementations for filter_vec
Adds aarch64-specific SIMD paths (NEON always available on aarch64;
SVE gated on nightly + non-Apple target) with routing logic in mod.rs
that selects the best available instruction set at runtime.
* Using asm! to workaround the lack of stabilized SVE intrinsics
* showing instruction set
* improved proptesting
* removing build.rs
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
Applies @PSeitz's review suggestion to make the function name more
descriptive of what it checks. Also adds a doc note clarifying why
validation is opt-in rather than enforced by default.