Commit Graph

3527 Commits

Author SHA1 Message Date
PSeitz
042490479a Update CHANGELOG for Tantivy 0.26 release 2026-05-11 13:42:26 +02:00
Pascal Seitz
edfb02b47e switch to enum, fix mixed types for cardinality agg 2026-05-05 16:39:51 +08:00
Pascal Seitz
d0fad88bac use bitsets for card agg 2026-05-05 16:39:51 +08:00
Pascal Seitz
351280c0b4 add card bench for high card 2026-05-05 16:39:51 +08:00
James Sewell
4480cf0a98 Enable BMW for single-scorer boolean queries by removing early return in scorer_union (#2915)
The early return for `scorers.len() == 1` in `scorer_union` short-circuits a single TermScorer into `SpecializedScorer::Other`, bypassing the `TermUnion` path that enables block-max WAND (BMW) in `for_each_pruning`.

This was originally addressed in PR #2898 (backed out), which added a special case in `BooleanWeight::for_each_pruning`. PR #2912 (merged as d27ca164a) added a single-scorer fast path inside `block_wand` itself, but did not remove this early return — so a single SHOULD TermScorer still never reaches the BMW path.

Removing the early return lets a single TermScorer with freq reading flow through to `SpecializedScorer::TermUnion`, where `block_wand` → `block_wand_single_scorer` handles it efficiently.
2026-04-28 14:49:53 -07:00
Pascal Seitz
d47abdf104 early cut off for order by sub agg in term agg 2026-04-28 16:59:59 +02:00
Pascal Seitz
c11952eb7c add order by agg benchmark 2026-04-28 16:59:59 +02:00
trinity-1686a
09667ee9c8 Merge pull request #2909 from osyniakov/claude/add-ossf-scorecard-1z6Vn
Add OpenSSF Scorecard workflow
2026-04-28 11:57:36 +02:00
trinity-1686a
333ccf5300 Merge pull request #2896 from osyniakov/claude/fix-issues-5945-5937-eQm1Q
ci: pin GitHub Actions to full commit SHAs and restrict token permissions
2026-04-28 11:57:18 +02:00
Oleksii Syniakov
60a39a4689 Merge branch 'main' into claude/fix-issues-5945-5937-eQm1Q 2026-04-28 10:28:23 +02:00
Oleksii Syniakov
f8f3e4277f remove not neeeded permissions for the public repo 2026-04-28 10:09:30 +02:00
Oleksii Syniakov
ff1433713a bump upload-sarif -> 4.35.2
Co-authored-by: trinity-1686a <trinity.pointard@gmail.com>
2026-04-28 10:07:45 +02:00
trinity-1686a
ca139d8eb1 Merge pull request #2910 from quickwit-oss/abdul.andha/composite-agg-after
Composite aggregations: send after key on last page
2026-04-27 23:38:52 +02:00
Abdul Andha
ac508108aa address pr comment 2026-04-27 12:39:38 -04:00
Paul Masurel
63da5a21b2 Optimizing top K using Adrien Grand's ideas (#2865)
* Optimizing top K using Adrien Grand's ideas

https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html

* Suffix-sum pruning for multi-term intersection candidates

After scoring each secondary in Phase 2, check whether remaining
secondaries' block_max scores can still beat the threshold. Skip
to the next candidate early if impossible, avoiding expensive seeks
into later secondaries.

Improves three-term intersection by ~8% on the balanced benchmark
while keeping two-term performance neutral.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Claude CR comment

* Removed 16 term scorer limit.

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-26 12:14:40 +02:00
lif
54cd5bba98 fix: skip sentinel facet ords in harvest to prevent wrong root (#2867)
When a document has the exact registered facet path (not a child),
compute_collapse_mapping_one maps it to a sentinel (u64::MAX, 0).
Without filtering, harvest() passes u64::MAX to ord_to_term which
resolves to the last dictionary entry, producing a spurious facet
from an unrelated branch.

Skip entries where facet_ord == u64::MAX in harvest().

Closes #2494

Signed-off-by: majiayu000 <1835304752@qq.com>
2026-04-25 22:23:30 +02:00
Paul Masurel
d27ca164a9 block_wand: use single-scorer path when there is only one scorer 2026-04-25 16:35:00 +02:00
dependabot[bot]
2f5a48e8b1 Update criterion requirement from 0.5 to 0.8 (#2873)
Updates the requirements on [criterion](https://github.com/criterion-rs/criterion.rs) to permit the latest version.
- [Release notes](https://github.com/criterion-rs/criterion.rs/releases)
- [Changelog](https://github.com/criterion-rs/criterion.rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/criterion-rs/criterion.rs/compare/0.5.0...criterion-v0.8.2)

---
updated-dependencies:
- dependency-name: criterion
  dependency-version: 0.8.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-25 14:15:53 +02:00
dependabot[bot]
ae0ab907fe Bump actions/checkout from 4 to 6 (#2875)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-25 14:15:27 +02:00
dependabot[bot]
7d62e084e7 Bump codecov/codecov-action from 3 to 6 (#2876)
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 3 to 6.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](https://github.com/codecov/codecov-action/compare/v3...v6)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-25 14:14:54 +02:00
James Sewell
322286ee16 Tighen Block-Max in single-scorer (#2897)
In the Block-Max WAND single-scorer, it uses block_max_score() < threshold,
whereas the multi-term one uses  block_max_score_upperbound <= threshold.

As both of these are guarded later on with if score > threshold we can
use the more efficent form in single-scorer.

Single-scorer block skip (<, should be <=): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L231
Multi-scorer block skip (already <=): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L179
Single-scorer per-doc guard (>): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L246
Multi-scorer per-doc guard (>): https://github.com/quickwit-oss/tantivy/blob/main/src/query/boolean_query/block_wand.rs#L206

This will improve performance when there are many identical scores.
2026-04-25 14:13:07 +02:00
RJ Barman
73ad18fa1e fix: Add space for missing sentinel in allowed bitset when a missing key is provided (#119) (#2907)
## Bug Overview
Under certain conditions, a `terms` aggregation request can cause a
bounds-check panic. Those conditions are:
- The queried field must be a text field
- There must be a segment where the number of distinct terms in it's
dictionary for the queried field is divisible by 64 (i.e.e where
`count(term_dict.keys) % 64 == 0`)
- That same segment must contain at least one document that does not
contain this field.
- The request contain a `missing` key that is a string.
- The request must contain an `include` or `exclude` filter.
For example:
```json
{
    "my_bool": {
        "terms": {
            "field": "title",
            "include": "foo",
            "missing": "__NULL__",
        }
    }
}
```
Check out the added tests in `src/aggregation/bucket/term_agg.rs` to see
this in action

## How the bug happens
### Preparation
While preparing the aggregation nodes:
1) When we've provided a `missing` key, we derive a missing sentinel.
For string keys this column's max value (which for string keys is always
the number of terms in this segment) + 1.
2) for string columns only, we optionally prep an "allowed" `BitSet` for
allowed term ids. (`build_allowed_term_ids_for_str` in
`src/aggregation/agg_data.rs`)
- If no `include` or `exclude` filter is provided, this just returns
`None`, causing this check to be skipped down the line
- Otherwise the bitset is initialized to be able to hold the exact
number of terms in the segments term dictionary, and the bits are set to
signify which terms are to be included in the results.

### Collection
If we have an "allowed" `BitSet`, filter documents against that. For
each document, we check if the `BitSet` contains the documents term id.
For documents without the field, this is the missing sentinel we derived
earlier, minus 1 (to account for zero-based indexing): `(num_terms + 1)
- 1`.However, the `BitSet`s size is only `num_terms`. Normally, this
slips by without a problem, but if `num_terms % 64 == 0`, this will
cause a panic.

### Why `BitSet` panics
`BitSet` is represented under the hood by a boxed slice of `u64`s. When
you go to check a bit using `BitSet::contains`, it must determine which
of those `u64`s the bit is in, and then the position within that `u64`
of the bit.

In cases where the number of terms is not divisible by 64, the `BitSet`
must waste some bits. When we then look up the missing sentinel's bit,
it happens to be one of those wasted bits, for which `BitSet` is happy
to return the value of. For example, if the number of terms was 63:
```rust
let bitset_init_size = 63; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term id [0, 62]
let missing_sentinel = 63; // num_terms + 1 - 1;
let byte_pos = missing_sentinel / 64; // 0 - within the valid slice
let bit_pos = missing_sentinel % 64; // 63 - hits the 1 wasted bit
```

But if the number of terms is indeed divisible by 64, then the `BitSet`
is perfectly aligned to the byte boundary:
```rust
let bitset_init_size = 64; // so BitSet's boxed slice has a length of 1, capable of holding 64 bits, term ids [0, 63]
let missing_sentinel = 64; // num_terms + 1 - 1, 
let byte_pos = missing_sentinel / 64; // 1 - idx 1 >= slice length 1
let bit_pos = missing_sentinel % 64; // 0 
```
We try to access a byte outside of the bounds of the boxed slice,
causing a panic from the bounds check to failing.

## Fixing it
The fix is simple. If we need to account for the missing sentinel,
initialize the `BitSet` with capacity for one more bit.

## Tests
- Added a bunch of unit tests that hit these conditions. I ensured they
failed without the fix, and that they now pass.
- All unit tests pass with the fix in place

## Other
- The investigation that led to finding this bug began with
https://github.com/paradedb/paradedb/issues/4746.
2026-04-25 14:11:47 +02:00
Abdul Andha
4fbae92187 send after key on last page 2026-04-24 15:33:26 -04:00
Cameron
89f0cef807 Fix O(2^n) query parser regression for deeply-nested queries (#2905)
* Fix O(2^n) query parser regression for deeply-nested queries

The top-level `ast()` parser used `alt((boolean_expr, single_leaf))` at
every group level. When the group contained a single leaf with no
trailing operand, `boolean_expr` would parse `occur_leaf` (recursing
into the inner group), fail at `multispace1`, backtrack, and then
`single_leaf` would re-parse `occur_leaf` from scratch. Every nesting
level doubled the work, giving O(2^n) time for queries like
`(((((title:test)))))`.

Parse `occur_leaf` once and peek ahead for a trailing operand instead
of backtracking. This keeps parsing O(n) and also avoids the duplicate
parse for simple single-leaf queries.

Fixes #2498.

Measured on the issue reproducer (release build):

    depth   before     after
       20   0.87 s   <1 us
       25  28.23 s   <1 us
       60  (years)   ~5 us

Non-pathological queries are unaffected or slightly faster:

    query                     before     after
    hello                     650 ns     308 ns
    a AND b AND c            1380 ns    1364 ns
    title:rust AND (...)     3426 ns    3460 ns

All 53 existing grammar tests and 56 query_parser tests pass. Adds a
regression test at depth 60 that would not complete under the old
parser.

* Add ignored benchmark for nested query parsing at depth 20/21

Matches the depths from issue #2498 which reported 0.87 s / 1.72 s
under the regression. With the fix these parse in single-digit
microseconds. Runs via:

  cargo test -p tantivy-query-grammar --release bench_deeply_nested \
      -- --ignored --nocapture

* Propagate Err::Failure and Err::Incomplete from operand parser

`alt((boolean_expr, single_leaf))` only retried on `Err::Error` and
propagated `Err::Failure` and `Err::Incomplete`. The replacement was
catching all three with `Err(_)`, which would silently fall back to
a single leaf if any cut point were ever added to `operand_leaf` or
its descendants. Match specifically on `Err::Error` to preserve the
original `alt` semantics.

* Replace inline bench with binggan bench in benches/

Move the nested-query benchmark out of the query-grammar test module
and into a proper binggan benchmark at benches/query_parser_nested.rs,
registered as a harnessless bench in Cargo.toml. Keeps the correctness
regression test (depth 60) in place.

Run with: cargo bench --bench query_parser_nested

* Fix rustfmt import ordering in query_parser_nested bench
2026-04-24 03:54:00 -04:00
Claude
a5d297c75f Add OpenSSF Scorecard workflow
Runs weekly security analysis and uploads SARIF results to GitHub code
scanning. Third-party actions are pinned by commit SHA. Adds the Scorecard
badge to the README.

Based on quickwit-oss/quickwit#5969.
2026-04-24 06:56:58 +00:00
Pascal Seitz
2e16243f9a fix memory consumption for histogram 2026-04-21 13:58:39 +02:00
Pascal Seitz
e015abab8e docs: add 0.26.1 changelog entry for aggregation perf fix 2026-04-21 11:12:37 +02:00
Pascal Seitz
73c711ec74 perf(agg): only measure active parent bucket in composite collect
Same change as 26a589e for SegmentCompositeCollector: get_memory_consumption
summed across all parent_buckets on every block, scaling with outer bucket
cardinality. Pass parent_bucket_id and index the single bucket.
2026-04-21 07:26:58 +02:00
Pascal Seitz
cb037c8079 add inline 2026-04-21 07:26:58 +02:00
Pascal Seitz
ed3453606b agg fix: compute memory consumption only for current bucket 2026-04-21 07:26:58 +02:00
Pascal Seitz
e9641f99c5 add nested term benchmark 2026-04-21 07:26:58 +02:00
Paul Masurel
13d74c3c20 Update binggan requirement from 0.16.0 to 0.16.1 (#2899) 2026-04-20 11:59:47 +02:00
Claude
3a6a3de8d7 ci: update pinned Action SHAs to current latest versions
The previous commit pinned actions to commit SHAs but used stale
version tags (v4.2.2, v2.7.5, old nextest/cargo-llvm-cov refs).
Update to the actual current HEAD of each pinned tag:

  actions/checkout        v4.2.2 → v4.3.1  (34e114876b0b...)
  Swatinem/rust-cache     v2.7.5 → v2.9.1  (c19371144df3...)
  taiki-e/install-action  nextest           (56cc9adf3a3e...)
  taiki-e/install-action  cargo-llvm-cov    (e4b3a0453201...)

actions-rs/toolchain, actions-rs/clippy-check, and
codecov/codecov-action SHAs were already correct.

https://claude.ai/code/session_01VD7Bo8upj3cQwWDf9ni2Ln
2026-04-16 06:49:47 +00:00
Claude
af3c6c0070 ci: pin GitHub Actions to full commit SHAs and restrict token permissions
Fixes two supply chain / token security issues:

- Pin all third-party Actions to immutable full commit SHAs instead of
  mutable version tags (addresses unpinned-dependencies risk, analogous
  to quickwit-oss/quickwit#5937):
    actions/checkout v4.2.2
    actions-rs/toolchain v1.0.7
    Swatinem/rust-cache v2.7.5
    taiki-e/install-action nextest / cargo-llvm-cov
    actions-rs/clippy-check v1.0.7
    codecov/codecov-action v3.1.6

- Add explicit least-privilege `permissions` blocks at workflow and job
  level (addresses excessive GITHUB_TOKEN permissions, analogous to
  quickwit-oss/quickwit#5945):
    default: contents: read
    check job: also grants checks: write (required by clippy-check)

https://claude.ai/code/session_01VD7Bo8upj3cQwWDf9ni2Ln
2026-04-15 20:55:43 +00:00
dependabot[bot]
058afff8b7 Update binggan requirement from 0.15.3 to 0.16.0
Updates the requirements on [binggan](https://github.com/pseitz/binggan) to permit the latest version.
- [Changelog](https://github.com/PSeitz/binggan/blob/main/CHANGELOG.md)
- [Commits](https://github.com/pseitz/binggan/commits)

---
updated-dependencies:
- dependency-name: binggan
  dependency-version: 0.16.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-15 08:58:03 +02:00
Paul Masurel
58aa4b7074 Fix cardinality aggregation using invalid coupons (#2893)
Previously, coupons were computed via murmurhash32 and fed as raw u32
to the HLL sketch, bypassing the sketch's internal hashing and producing
invalid (slot, value) pairs. Switch to Coupon::from_hash from the
datasketches crate which correctly derives coupons, and drop the
now-unused murmurhash32 dependency.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 19:14:30 +02:00
Paul Masurel
04beab3b29 Performance improvement for nested cardinality aggregation
When a string cardinality aggregation is nested it end up being applied to different buckets.
Dictionary encoding relies on a different dictionaries for each segment.

As a result, during segment collection, we only collect term ordinals in a HashSet, and decode them in the
term dictionary at the end of collection.

Before this PR, this decoding phase was done once for each bucket, causing the same work to be done over and over. This PR introduce a coupon cache. The HLL sketch relies on a hash of the string values.

We populate the cache before bucket collection, and get our values from it.

This PR also rename "caching" "buffering" in aggregation (it was never caching), and does several cleanups.
2026-04-10 14:51:00 +02:00
alexanderbianchi
3cd9011f87 Make BucketEntries::iter, PercentileValuesVecEntry fields, and TopNComputer::threshold public (#2890)
These items need to be accessible from the tantivy-datafusion crate:
- BucketEntries::iter() for iterating aggregation bucket results
- PercentileValuesVecEntry.key/.value for reading percentile results
- TopNComputer.threshold for Block-WAND score pruning in the inverted index provider

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
2026-04-09 13:32:31 +02:00
Paul Masurel
d2c1b8bc2c Optimized intersection count using a bitset when the first leg is dense 2026-04-06 12:01:52 -04:00
nuri
a65107135a Use BinaryHeap for score-based top-K collection (#2881)
* Use BinaryHeap for score-based top-K collection

* Use peek_mut and add proptest for TopNHeap

---------

Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
2026-04-04 19:49:05 +02:00
Pascal Seitz
5c344db1bf chore: Release 2026-03-31 17:15:34 +08:00
Pascal Seitz
dc0f31554d unbump for release and update Changelog.md 2026-03-31 17:15:34 +08:00
trinity-1686a
a28ce3ee54 Merge pull request #2869 from quickwit-oss/trinity.pointard/maint
add dependabot cooldown
2026-03-31 09:52:22 +02:00
dependabot[bot]
3abc137bfe Update binggan requirement from 0.14.2 to 0.15.3 (#2870)
Updates the requirements on [binggan](https://github.com/pseitz/binggan) to permit the latest version.
- [Commits](https://github.com/pseitz/binggan/commits)

---
updated-dependencies:
- dependency-name: binggan
  dependency-version: 0.15.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 07:59:02 +08:00
trinity Pointard
cf9800f981 add dependabot cooldown 2026-03-30 11:36:04 +02:00
PSeitz
129c40f8ec Improve Union Performance for non-score unions (#2863)
* enhance and_or_queries bench

* optimize unions for count/non-score, bitset fix for ARM

Benchmarks run on M4 Max
```
single_field_only_union_5%_OR_1%
count                Avg: 0.1100ms (-17.46%)    Median: 0.1079ms (-14.08%)    [0.1045ms .. 0.1410ms]    Output: 54_110
top10_inv_idx        Avg: 0.1663ms (+0.79%)     Median: 0.1660ms (+0.75%)     [0.1634ms .. 0.1702ms]    Output: 10
count+top10          Avg: 0.2639ms (-1.24%)     Median: 0.2634ms (-0.31%)     [0.2512ms .. 0.2813ms]    Output: 54_110
top10_by_ff          Avg: 0.2875ms (-8.67%)     Median: 0.2852ms (-8.80%)     [0.2737ms .. 0.3083ms]    Output: 10
top10_by_2ff         Avg: 0.3137ms (-5.79%)     Median: 0.3128ms (-0.35%)     [0.3044ms .. 0.3313ms]    Output: 10
single_field_only_union_5%_OR_1%_OR_15%
count                Avg: 0.4122ms (-33.05%)    Median: 0.4140ms (-32.20%)    [0.3940ms .. 0.4341ms]    Output: 181_663
top10_inv_idx        Avg: 0.3999ms (+2.39%)     Median: 0.3987ms (+2.02%)     [0.3939ms .. 0.4160ms]    Output: 10
count+top10          Avg: 0.8520ms (-8.63%)     Median: 0.8516ms (-8.65%)     [0.8413ms .. 0.8676ms]    Output: 181_663
top10_by_ff          Avg: 0.9694ms (-13.06%)    Median: 0.9645ms (-13.77%)    [0.9403ms .. 1.0122ms]    Output: 10
top10_by_2ff         Avg: 0.9880ms (-13.01%)    Median: 0.9838ms (-13.59%)    [0.9781ms .. 1.0306ms]    Output: 10
single_field_only_union_5%_OR_30%
count                Avg: 0.7364ms (-33.11%)    Median: 0.7347ms (-33.19%)    [0.7233ms .. 0.7547ms]    Output: 303_337
top10_inv_idx        Avg: 0.8932ms (-0.89%)     Median: 0.8919ms (-0.75%)     [0.8861ms .. 0.9249ms]    Output: 10
count+top10          Avg: 1.3611ms (-9.23%)     Median: 1.3598ms (-9.39%)     [1.3426ms .. 1.3891ms]    Output: 303_337
top10_by_ff          Avg: 1.6575ms (-18.64%)    Median: 1.6224ms (-20.81%)    [1.6051ms .. 1.7560ms]    Output: 10
top10_by_2ff         Avg: 1.6800ms (-16.24%)    Median: 1.6769ms (-15.72%)    [1.6661ms .. 1.7229ms]    Output: 10
single_field_only_union_30%_OR_0.01%
count                Avg: 0.6471ms (-33.73%)    Median: 0.6464ms (-33.46%)    [0.6375ms .. 0.6604ms]    Output: 270_268
top10_inv_idx        Avg: 0.0338ms (-0.27%)     Median: 0.0338ms (+0.11%)     [0.0331ms .. 0.0351ms]    Output: 10
count+top10          Avg: 1.2209ms (-9.27%)     Median: 1.2207ms (-9.25%)     [1.2158ms .. 1.2351ms]    Output: 270_268
top10_by_ff          Avg: 1.4808ms (-17.20%)    Median: 1.4690ms (-17.91%)    [1.4384ms .. 1.5553ms]    Output: 10
top10_by_2ff         Avg: 1.5011ms (-14.30%)    Median: 1.4992ms (-13.88%)    [1.4891ms .. 1.5320ms]    Output: 10
multi_field_only_union_5%_OR_1%
count                Avg: 0.1196ms (-17.67%)    Median: 0.1166ms (-14.83%)    [0.1123ms .. 0.1462ms]    Output: 60_183
top10_inv_idx        Avg: 0.2356ms (-0.21%)     Median: 0.2355ms (+0.23%)     [0.2330ms .. 0.2406ms]    Output: 10
count+top10          Avg: 0.2985ms (-5.06%)     Median: 0.2957ms (-5.79%)     [0.2875ms .. 0.3186ms]    Output: 60_183
top10_by_ff          Avg: 0.3102ms (-9.44%)     Median: 0.3031ms (-11.09%)    [0.2994ms .. 0.3324ms]    Output: 10
top10_by_2ff         Avg: 0.3435ms (-0.91%)     Median: 0.3447ms (-0.62%)     [0.3342ms .. 0.3530ms]    Output: 10
multi_field_only_union_5%_OR_1%_OR_15%
count                Avg: 0.4465ms (-35.41%)    Median: 0.4456ms (-36.25%)    [0.4250ms .. 0.4936ms]    Output: 201_114
top10_inv_idx        Avg: 1.1542ms (+2.38%)     Median: 1.1560ms (+2.96%)     [1.1193ms .. 1.1912ms]    Output: 10
count+top10          Avg: 0.9334ms (-8.89%)     Median: 0.9330ms (-8.95%)     [0.9191ms .. 0.9542ms]    Output: 201_114
top10_by_ff          Avg: 1.0590ms (-14.10%)    Median: 1.0424ms (-15.08%)    [1.0304ms .. 1.1174ms]    Output: 10
top10_by_2ff         Avg: 1.0779ms (-17.06%)    Median: 1.0754ms (-17.40%)    [1.0650ms .. 1.1155ms]    Output: 10
multi_field_only_union_5%_OR_30%
count                Avg: 0.8137ms (-33.48%)    Median: 0.7976ms (-34.84%)    [0.7734ms .. 1.0855ms]    Output: 335_682
top10_inv_idx        Avg: 1.5108ms (+0.36%)     Median: 1.4943ms (-0.72%)     [1.4805ms .. 1.5865ms]    Output: 10
count+top10          Avg: 1.4985ms (-9.75%)     Median: 1.4936ms (-9.63%)     [1.4784ms .. 1.5472ms]    Output: 335_682
top10_by_ff          Avg: 1.8531ms (-15.70%)    Median: 1.8583ms (-16.30%)    [1.7467ms .. 2.2297ms]    Output: 10
top10_by_2ff         Avg: 1.8735ms (-16.67%)    Median: 1.8421ms (-18.05%)    [1.8146ms .. 2.3650ms]    Output: 10
multi_field_only_union_30%_OR_0.01%
count                Avg: 0.7020ms (-34.40%)    Median: 0.7004ms (-34.05%)    [0.6943ms .. 0.7156ms]    Output: 300_315
top10_inv_idx        Avg: 0.1445ms (-1.57%)     Median: 0.1442ms (-1.35%)     [0.1426ms .. 0.1478ms]    Output: 10
count+top10          Avg: 1.3309ms (-9.84%)     Median: 1.3284ms (-9.71%)     [1.3234ms .. 1.3549ms]    Output: 300_315
top10_by_ff          Avg: 1.6152ms (-17.39%)    Median: 1.6037ms (-18.72%)    [1.5778ms .. 1.7227ms]    Output: 10
top10_by_2ff         Avg: 1.6479ms (-17.10%)    Median: 1.6444ms (-15.46%)    [1.6307ms .. 1.6901ms]    Output: 10
```

* add comment

* fix comment

* remove inline(never), bounds check
2026-03-27 08:00:26 +01:00
Charlie Tonneslan
a9535156b1 Fix clippy warnings: deprecated gen_range, manual div_ceil, legacy import (#2860)
- Replace deprecated rand::Rng::gen_range with random_range in benchmarks
- Use usize::div_ceil instead of manual (len + size - 1) / size
- Remove unused legacy std::i64 import
- Replace 'if let Some(_)' with '.is_some()'
2026-03-26 07:37:26 -04:00
PSeitz
993ef97814 update CHANGELOG for tantivy 0.26 release (#2857)
* update CHANGELOG for tantivy 0.26 release

* add CHANGELOG skill

Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>

* update CHANGELOG, add CHANGELOG skill

Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>

* use sketches from crates.io

* update lz4_flex

* update CHANGELOG.md

---------

Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>
2026-03-24 08:02:12 +01:00
nuri
3859cc8699 fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854)
* fix: deduplicate doc counts in term aggregation for multi-valued fields

Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.

Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.

Fixes #2721

* refactor: only deduplicate for multivalue cardinality

Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().

* fix: handle non-consecutive duplicate values in dedup

Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.

Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.

* perf: skip dedup when block has no multivalue entries

Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.

---------

Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
2026-03-24 02:02:30 +01:00
Paul Masurel
545169c0d8 Composite agg merge (#2856)
Add composite aggregation

Co-authored-by: Remi Dettai <remi.dettai@sekoia.io>
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-03-18 17:28:59 +01:00