mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-03-31 09:30:45 +00:00
Compare commits
8 Commits
optim
...
dependabot
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
076fdf7407 | ||
|
|
a28ce3ee54 | ||
|
|
3abc137bfe | ||
|
|
cf9800f981 | ||
|
|
129c40f8ec | ||
|
|
a9535156b1 | ||
|
|
993ef97814 | ||
|
|
3859cc8699 |
87
.claude/skills/update-changelog/SKILL.md
Normal file
87
.claude/skills/update-changelog/SKILL.md
Normal file
@@ -0,0 +1,87 @@
|
||||
---
|
||||
name: update-changelog
|
||||
description: Update CHANGELOG.md with merged PRs since the last changelog update, categorized by type
|
||||
---
|
||||
|
||||
# Update Changelog
|
||||
|
||||
This skill updates CHANGELOG.md with merged PRs that aren't already listed.
|
||||
|
||||
## Step 1: Determine the changelog scope
|
||||
|
||||
Read `CHANGELOG.md` to identify the current unreleased version section at the top (e.g., `Tantivy 0.26 (Unreleased)`).
|
||||
|
||||
Collect all PR numbers already mentioned in the unreleased section by extracting `#NNNN` references.
|
||||
|
||||
## Step 2: Find merged PRs not yet in the changelog
|
||||
|
||||
Use `gh` to list recently merged PRs from the upstream repo:
|
||||
|
||||
```bash
|
||||
gh pr list --repo quickwit-oss/tantivy --state merged --limit 100 --json number,title,author,labels,mergedAt
|
||||
```
|
||||
|
||||
Filter out any PRs whose number already appears in the unreleased section of the changelog.
|
||||
|
||||
## Step 3: Consolidate related PRs
|
||||
|
||||
Before categorizing, group PRs that belong to the same logical change. This is critical for producing a clean changelog. Use PR descriptions, titles, cross-references, and the files touched to identify relationships.
|
||||
|
||||
**Merge follow-up PRs into the original:**
|
||||
- If a PR is a bugfix, refinement, or follow-up to another PR in the same unreleased cycle, combine them into a single changelog entry with multiple `[#N](url)` links.
|
||||
- Also consolidate PRs that touch the same feature area even if not explicitly linked — e.g., a PR fixing an edge case in a new API should be folded into the entry for the PR that introduced that API.
|
||||
|
||||
**Filter out bugfixes on unreleased features:**
|
||||
- If a bugfix PR fixes something introduced by another PR in the **same unreleased version**, it must NOT appear as a separate Bugfixes entry. Instead, silently fold it into the original feature/improvement entry. The changelog should describe the final shipped state, not the development history.
|
||||
- To detect this: check if the bugfix PR references or reverts changes from another PR in the same release cycle, or if it touches code that was newly added (not present in the previous release).
|
||||
|
||||
## Step 4: Review the actual code diff
|
||||
|
||||
**Do not rely on PR titles or descriptions alone.** For every candidate PR, run `gh pr diff <number> --repo quickwit-oss/tantivy` and read the actual changes. PR titles are often misleading — the diff is the source of truth.
|
||||
|
||||
**What to look for in the diff:**
|
||||
- Does it change observable behavior, public API surface, or performance characteristics?
|
||||
- Is the change something a user of the library would notice or need to know about?
|
||||
- Could the change break existing code (API changes, removed features)?
|
||||
|
||||
**Skip PRs where the diff reveals the change is not meaningful enough for the changelog** — e.g., cosmetic renames, trivial visibility tweaks, test-only changes, etc.
|
||||
|
||||
## Step 5: Categorize each PR group
|
||||
|
||||
For each PR (or consolidated group) that survived the diff review, determine its category:
|
||||
|
||||
- **Bugfixes** — fixes to behavior that existed in the **previous release**. NOT fixes to features introduced in this release cycle.
|
||||
- **Features/Improvements** — new features, API additions, new options, improvements that change user-facing behavior or add new capabilities.
|
||||
- **Performance** — optimizations, speed improvements, memory reductions. **If a PR adds new API whose primary purpose is enabling a performance optimization, categorize it as Performance, not Features.** The deciding question is: does a user benefit from this because of new functionality, or because things got faster/leaner? For example, a new trait method that exists solely to enable cheaper intersection ordering is Performance, not a Feature.
|
||||
|
||||
If a PR doesn't clearly fit any category (e.g., CI-only changes, internal refactors with no user-facing impact, dependency bumps with no behavior change), skip it — not everything belongs in the changelog.
|
||||
|
||||
When unclear, use your best judgment or ask the user.
|
||||
|
||||
## Step 6: Format entries
|
||||
|
||||
Each entry must follow this exact format:
|
||||
|
||||
```
|
||||
- Description [#NUMBER](https://github.com/quickwit-oss/tantivy/pull/NUMBER)(@author)
|
||||
```
|
||||
|
||||
Rules:
|
||||
- The description should be concise and describe the user-facing change (not the implementation). Describe the final shipped state, not the incremental development steps.
|
||||
- Use sub-categories with bold headers when multiple entries relate to the same area (e.g., `- **Aggregation**` with indented entries beneath). Follow the existing grouping style in the changelog.
|
||||
- Author is the GitHub username from the PR, prefixed with `@`. For consolidated entries, include all contributing authors.
|
||||
- For consolidated PRs, list all PR links in a single entry: `[#100](url) [#110](url)` (see existing entries for examples).
|
||||
|
||||
## Step 7: Present changes to the user
|
||||
|
||||
Show the user the proposed changelog entries grouped by category **before** editing the file. Ask for confirmation or adjustments.
|
||||
|
||||
## Step 8: Update CHANGELOG.md
|
||||
|
||||
Insert the new entries into the appropriate sections of the unreleased version block. If a section doesn't exist yet, create it following the order: Bugfixes, Features/Improvements, Performance.
|
||||
|
||||
Append new entries at the end of each section (before the next section header or version header).
|
||||
|
||||
## Step 9: Verify
|
||||
|
||||
Read back the updated unreleased section and display it to the user for final review.
|
||||
4
.github/dependabot.yml
vendored
4
.github/dependabot.yml
vendored
@@ -6,6 +6,8 @@ updates:
|
||||
interval: daily
|
||||
time: "20:00"
|
||||
open-pull-requests-limit: 10
|
||||
cooldown:
|
||||
default-days: 2
|
||||
|
||||
- package-ecosystem: "github-actions"
|
||||
directory: "/"
|
||||
@@ -13,3 +15,5 @@ updates:
|
||||
interval: daily
|
||||
time: "20:00"
|
||||
open-pull-requests-limit: 10
|
||||
cooldown:
|
||||
default-days: 2
|
||||
|
||||
2
.github/workflows/coverage.yml
vendored
2
.github/workflows/coverage.yml
vendored
@@ -21,7 +21,7 @@ jobs:
|
||||
- name: Generate code coverage
|
||||
run: cargo +nightly-2025-12-01 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
|
||||
- name: Upload coverage to Codecov
|
||||
uses: codecov/codecov-action@v3
|
||||
uses: codecov/codecov-action@v6
|
||||
continue-on-error: true
|
||||
with:
|
||||
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
|
||||
|
||||
48
CHANGELOG.md
48
CHANGELOG.md
@@ -1,3 +1,51 @@
|
||||
Tantivy 0.26 (Unreleased)
|
||||
================================
|
||||
|
||||
## Bugfixes
|
||||
- Align float query coercion during search with the columnar coercion rules [#2692](https://github.com/quickwit-oss/tantivy/pull/2692)(@fulmicoton)
|
||||
- Fix lenient elastic range queries with trailing closing parentheses [#2816](https://github.com/quickwit-oss/tantivy/pull/2816)(@evance-br)
|
||||
- Fix intersection `seek()` advancing below current doc id [#2812](https://github.com/quickwit-oss/tantivy/pull/2812)(@fulmicoton)
|
||||
- Fix phrase query prefixed with `*` [#2751](https://github.com/quickwit-oss/tantivy/pull/2751)(@Darkheir)
|
||||
- Fix `vint` buffer overflow during index creation [#2778](https://github.com/quickwit-oss/tantivy/pull/2778)(@rebasedming)
|
||||
- Fix integer overflow in `ExpUnrolledLinkedList` for large datasets [#2735](https://github.com/quickwit-oss/tantivy/pull/2735)(@mdashti)
|
||||
- Fix integer overflow in segment sorting and merge policy truncation [#2846](https://github.com/quickwit-oss/tantivy/pull/2846)(@anaslimem)
|
||||
- Fix merging of intermediate aggregation results [#2719](https://github.com/quickwit-oss/tantivy/pull/2719)(@PSeitz)
|
||||
- Fix deduplicate doc counts in term aggregation for multi-valued fields [#2854](https://github.com/quickwit-oss/tantivy/pull/2854)(@nuri-yoo)
|
||||
|
||||
## Features/Improvements
|
||||
- **Aggregation**
|
||||
- Add filter aggregation [#2711](https://github.com/quickwit-oss/tantivy/pull/2711)(@mdashti)
|
||||
- Add include/exclude filtering for term aggregations [#2717](https://github.com/quickwit-oss/tantivy/pull/2717)(@PSeitz)
|
||||
- Add public accessors for intermediate aggregation results [#2829](https://github.com/quickwit-oss/tantivy/pull/2829)(@congx4)
|
||||
- Replace HyperLogLog++ with Apache DataSketches HLL for cardinality aggregation [#2837](https://github.com/quickwit-oss/tantivy/pull/2837) [#2842](https://github.com/quickwit-oss/tantivy/pull/2842)(@congx4)
|
||||
- Add composite aggregation [#2856](https://github.com/quickwit-oss/tantivy/pull/2856)(@fulmicoton)
|
||||
- **Fast Fields**
|
||||
- Add fast field fallback for `TermQuery` when the field is not indexed [#2693](https://github.com/quickwit-oss/tantivy/pull/2693)(@PSeitz-dd)
|
||||
- Add fast field support for `Bytes` values [#2830](https://github.com/quickwit-oss/tantivy/pull/2830)(@mdashti)
|
||||
- **Query Parser**
|
||||
- Add support for regexes in the query grammar [#2677](https://github.com/quickwit-oss/tantivy/pull/2677) [#2818](https://github.com/quickwit-oss/tantivy/pull/2818)(@Darkheir)
|
||||
- Deduplicate queries in query parser [#2698](https://github.com/quickwit-oss/tantivy/pull/2698)(@PSeitz-dd)
|
||||
- Add erased `SortKeyComputer` for sorting on column types unknown until runtime [#2770](https://github.com/quickwit-oss/tantivy/pull/2770) [#2790](https://github.com/quickwit-oss/tantivy/pull/2790)(@stuhood @PSeitz)
|
||||
- Add natural-order-with-none-highest support in `TopDocs::order_by` [#2780](https://github.com/quickwit-oss/tantivy/pull/2780)(@stuhood)
|
||||
- Move stemming behing `stemmer` feature flag [#2791](https://github.com/quickwit-oss/tantivy/pull/2791)(@fulmicoton)
|
||||
- Make `DeleteMeta`, `AddOperation`, `advance_deletes`, `with_max_doc`, `serializer` module, and `delete_queue` public [#2762](https://github.com/quickwit-oss/tantivy/pull/2762) [#2765](https://github.com/quickwit-oss/tantivy/pull/2765) [#2766](https://github.com/quickwit-oss/tantivy/pull/2766) [#2835](https://github.com/quickwit-oss/tantivy/pull/2835)(@philippemnoel @PSeitz)
|
||||
- Make `Language` hashable [#2763](https://github.com/quickwit-oss/tantivy/pull/2763)(@philippemnoel)
|
||||
- Improve `space_usage` reporting for JSON fields and columnar data [#2761](https://github.com/quickwit-oss/tantivy/pull/2761)(@PSeitz-dd)
|
||||
- Split `Term` into `Term` and `IndexingTerm` [#2744](https://github.com/quickwit-oss/tantivy/pull/2744) [#2750](https://github.com/quickwit-oss/tantivy/pull/2750)(@PSeitz-dd @PSeitz)
|
||||
|
||||
## Performance
|
||||
- **Aggregation**
|
||||
- Large speed up and memory reduction for nested high cardinality aggregations by using one collector per request instead of one per bucket, and adding `PagedTermMap` for faster medium cardinality term aggregations [#2715](https://github.com/quickwit-oss/tantivy/pull/2715) [#2759](https://github.com/quickwit-oss/tantivy/pull/2759)(@PSeitz @PSeitz-dd)
|
||||
- Optimize low-cardinality term aggregations by using a `Vec` instead of a `HashMap` [#2740](https://github.com/quickwit-oss/tantivy/pull/2740)(@fulmicoton-dd)
|
||||
- Optimize `ExistsQuery` for a high number of dynamic columns [#2694](https://github.com/quickwit-oss/tantivy/pull/2694)(@PSeitz-dd)
|
||||
- Add lazy scorers to stop score evaluation early when a doc won't reach the top-K threshold [#2726](https://github.com/quickwit-oss/tantivy/pull/2726) [#2777](https://github.com/quickwit-oss/tantivy/pull/2777)(@fulmicoton @stuhood)
|
||||
- Add `DocSet::cost()` and use it to order scorers in intersections [#2707](https://github.com/quickwit-oss/tantivy/pull/2707)(@PSeitz)
|
||||
- Add `collect_block` support for collector wrappers [#2727](https://github.com/quickwit-oss/tantivy/pull/2727)(@stuhood)
|
||||
- Optimize saturated posting lists by replacing them with `AllScorer` in boolean queries [#2745](https://github.com/quickwit-oss/tantivy/pull/2745) [#2760](https://github.com/quickwit-oss/tantivy/pull/2760) [#2774](https://github.com/quickwit-oss/tantivy/pull/2774)(@fulmicoton @mdashti @trinity-1686a)
|
||||
- Add `seek_danger` on `DocSet` for more efficient intersections [#2538](https://github.com/quickwit-oss/tantivy/pull/2538) [#2810](https://github.com/quickwit-oss/tantivy/pull/2810)(@PSeitz @stuhood @fulmicoton)
|
||||
- Skip column traversal in `RangeDocSet` when query range does not overlap with column bounds [#2783](https://github.com/quickwit-oss/tantivy/pull/2783)(@ChangRui-Ryan)
|
||||
- Speed up exclude queries by supporting multiple excluded `DocSet`s without intermediate union [#2825](https://github.com/quickwit-oss/tantivy/pull/2825)(@PSeitz)
|
||||
|
||||
Tantivy 0.25
|
||||
================================
|
||||
|
||||
|
||||
@@ -27,7 +27,7 @@ regex = { version = "1.5.5", default-features = false, features = [
|
||||
aho-corasick = "1.0"
|
||||
tantivy-fst = "0.5"
|
||||
memmap2 = { version = "0.9.0", optional = true }
|
||||
lz4_flex = { version = "0.12", default-features = false, optional = true }
|
||||
lz4_flex = { version = "0.13", default-features = false, optional = true }
|
||||
zstd = { version = "0.13", optional = true, default-features = false }
|
||||
tempfile = { version = "3.12.0", optional = true }
|
||||
log = "0.4.16"
|
||||
@@ -64,7 +64,7 @@ query-grammar = { version = "0.25.0", path = "./query-grammar", package = "tanti
|
||||
tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
|
||||
common = { version = "0.10", path = "./common/", package = "tantivy-common" }
|
||||
tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
|
||||
sketches-ddsketch = { git = "https://github.com/quickwit-oss/rust-sketches-ddsketch.git", rev = "555caf1", features = ["use_serde"] }
|
||||
sketches-ddsketch = { version = "0.4", features = ["use_serde"] }
|
||||
datasketches = "0.2.0"
|
||||
futures-util = { version = "0.3.28", optional = true }
|
||||
futures-channel = { version = "0.3.28", optional = true }
|
||||
@@ -75,7 +75,7 @@ typetag = "0.2.21"
|
||||
winapi = "0.3.9"
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.14.2"
|
||||
binggan = "0.15.3"
|
||||
rand = "0.9"
|
||||
maplit = "1.0.2"
|
||||
matches = "0.1.9"
|
||||
|
||||
@@ -22,7 +22,7 @@ use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::sort_key::SortByStaticFastValue;
|
||||
use tantivy::collector::{Collector, Count, TopDocs};
|
||||
use tantivy::query::{Query, QueryParser};
|
||||
use tantivy::query::QueryParser;
|
||||
use tantivy::schema::{Schema, FAST, TEXT};
|
||||
use tantivy::{doc, Index, Order, ReloadPolicy, Searcher};
|
||||
|
||||
@@ -38,7 +38,7 @@ struct BenchIndex {
|
||||
/// return two BenchIndex views:
|
||||
/// - single_field: QueryParser defaults to only "body"
|
||||
/// - multi_field: QueryParser defaults to ["title", "body"]
|
||||
fn build_shared_indices(num_docs: usize, p_a: f32, p_b: f32, p_c: f32) -> (BenchIndex, BenchIndex) {
|
||||
fn build_index(num_docs: usize, terms: &[(&str, f32)]) -> (BenchIndex, BenchIndex) {
|
||||
// Unified schema (two text fields)
|
||||
let mut schema_builder = Schema::builder();
|
||||
let f_title = schema_builder.add_text_field("title", TEXT);
|
||||
@@ -55,32 +55,17 @@ fn build_shared_indices(num_docs: usize, p_a: f32, p_b: f32, p_c: f32) -> (Bench
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 500_000_000).unwrap();
|
||||
for _ in 0..num_docs {
|
||||
let has_a = rng.random_bool(p_a as f64);
|
||||
let has_b = rng.random_bool(p_b as f64);
|
||||
let has_c = rng.random_bool(p_c as f64);
|
||||
let score = rng.random_range(0u64..100u64);
|
||||
let score2 = rng.random_range(0u64..100_000u64);
|
||||
let mut title_tokens: Vec<&str> = Vec::new();
|
||||
let mut body_tokens: Vec<&str> = Vec::new();
|
||||
if has_a {
|
||||
if rng.random_bool(0.1) {
|
||||
title_tokens.push("a");
|
||||
} else {
|
||||
body_tokens.push("a");
|
||||
}
|
||||
}
|
||||
if has_b {
|
||||
if rng.random_bool(0.1) {
|
||||
title_tokens.push("b");
|
||||
} else {
|
||||
body_tokens.push("b");
|
||||
}
|
||||
}
|
||||
if has_c {
|
||||
if rng.random_bool(0.1) {
|
||||
title_tokens.push("c");
|
||||
} else {
|
||||
body_tokens.push("c");
|
||||
for &(tok, prob) in terms {
|
||||
if rng.random_bool(prob as f64) {
|
||||
if rng.random_bool(0.1) {
|
||||
title_tokens.push(tok);
|
||||
} else {
|
||||
body_tokens.push(tok);
|
||||
}
|
||||
}
|
||||
}
|
||||
if title_tokens.is_empty() && body_tokens.is_empty() {
|
||||
@@ -110,59 +95,97 @@ fn build_shared_indices(num_docs: usize, p_a: f32, p_b: f32, p_c: f32) -> (Bench
|
||||
let qp_single = QueryParser::for_index(&index, vec![f_body]);
|
||||
let qp_multi = QueryParser::for_index(&index, vec![f_title, f_body]);
|
||||
|
||||
let single_view = BenchIndex {
|
||||
let only_title = BenchIndex {
|
||||
index: index.clone(),
|
||||
searcher: searcher.clone(),
|
||||
query_parser: qp_single,
|
||||
};
|
||||
let multi_view = BenchIndex {
|
||||
let title_and_body = BenchIndex {
|
||||
index,
|
||||
searcher,
|
||||
query_parser: qp_multi,
|
||||
};
|
||||
(single_view, multi_view)
|
||||
(only_title, title_and_body)
|
||||
}
|
||||
|
||||
fn format_pct(p: f32) -> String {
|
||||
let pct = (p as f64) * 100.0;
|
||||
let rounded = (pct * 1_000_000.0).round() / 1_000_000.0;
|
||||
if rounded.fract() <= 0.001 {
|
||||
format!("{}%", rounded as u64)
|
||||
} else {
|
||||
format!("{}%", rounded)
|
||||
}
|
||||
}
|
||||
|
||||
fn query_label(query_str: &str, term_pcts: &[(&str, String)]) -> String {
|
||||
let mut label = query_str.to_string();
|
||||
for (term, pct) in term_pcts {
|
||||
label = label.replace(term, pct);
|
||||
}
|
||||
label.replace(' ', "_")
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// Prepare corpora with varying selectivity. Build one index per corpus
|
||||
// and derive two views (single-field vs multi-field) from it.
|
||||
let scenarios = vec![
|
||||
// terms with varying selectivity, ordered from rarest to most common.
|
||||
// With 1M docs, we expect:
|
||||
// a: 0.01% (100), b: 1% (10k), c: 5% (50k), d: 15% (150k), e: 30% (300k)
|
||||
let num_docs = 1_000_000;
|
||||
let terms: &[(&str, f32)] = &[
|
||||
("a", 0.0001),
|
||||
("b", 0.01),
|
||||
("c", 0.05),
|
||||
("d", 0.15),
|
||||
("e", 0.30),
|
||||
];
|
||||
|
||||
let queries: &[(&str, &[&str])] = &[
|
||||
(
|
||||
"N=1M, p(a)=5%, p(b)=1%, p(c)=15%".to_string(),
|
||||
1_000_000,
|
||||
0.05,
|
||||
0.01,
|
||||
0.15,
|
||||
"only_union",
|
||||
&["c OR b", "c OR b OR d", "c OR e", "e OR a"] as &[&str],
|
||||
),
|
||||
(
|
||||
"N=1M, p(a)=1%, p(b)=1%, p(c)=15%".to_string(),
|
||||
1_000_000,
|
||||
0.01,
|
||||
0.01,
|
||||
0.15,
|
||||
"only_intersection",
|
||||
&["+c +b", "+c +b +d", "+c +e", "+e +a"] as &[&str],
|
||||
),
|
||||
(
|
||||
"union_intersection",
|
||||
&["+c +(b OR d)", "+e +(c OR a)", "+(c OR b) +(d OR e)"] as &[&str],
|
||||
),
|
||||
];
|
||||
|
||||
let queries = &["a", "+a +b", "+a +b +c", "a OR b", "a OR b OR c"];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for (label, n, pa, pb, pc) in scenarios {
|
||||
let (single_view, multi_view) = build_shared_indices(n, pa, pb, pc);
|
||||
let (only_title, title_and_body) = build_index(num_docs, terms);
|
||||
let term_pcts: Vec<(&str, String)> = terms
|
||||
.iter()
|
||||
.map(|&(term, p)| (term, format_pct(p)))
|
||||
.collect();
|
||||
|
||||
for (view_name, bench_index) in [("single_field", single_view), ("multi_field", multi_view)]
|
||||
{
|
||||
// Single-field group: default field is body only
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!("{} — {}", view_name, label));
|
||||
for query_str in queries {
|
||||
for (view_name, bench_index) in [
|
||||
("single_field", only_title),
|
||||
("multi_field", title_and_body),
|
||||
] {
|
||||
for (category_name, category_queries) in queries {
|
||||
for query_str in *category_queries {
|
||||
let mut group = runner.new_group();
|
||||
let query_label = query_label(query_str, &term_pcts);
|
||||
group.set_name(format!("{}_{}_{}", view_name, category_name, query_label));
|
||||
add_bench_task(&mut group, &bench_index, query_str, Count, "count");
|
||||
add_bench_task(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
query_str,
|
||||
TopDocs::with_limit(10).order_by_score(),
|
||||
"top10",
|
||||
"top10_inv_idx",
|
||||
);
|
||||
add_bench_task(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
query_str,
|
||||
(Count, TopDocs::with_limit(10).order_by_score()),
|
||||
"count+top10",
|
||||
);
|
||||
|
||||
add_bench_task(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
@@ -180,39 +203,47 @@ fn main() {
|
||||
)),
|
||||
"top10_by_2ff",
|
||||
);
|
||||
|
||||
group.run();
|
||||
}
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
trait FruitCount {
|
||||
fn count(&self) -> usize;
|
||||
}
|
||||
|
||||
impl FruitCount for usize {
|
||||
fn count(&self) -> usize {
|
||||
*self
|
||||
}
|
||||
}
|
||||
|
||||
impl<T> FruitCount for Vec<T> {
|
||||
fn count(&self) -> usize {
|
||||
self.len()
|
||||
}
|
||||
}
|
||||
|
||||
impl<A: FruitCount, B> FruitCount for (A, B) {
|
||||
fn count(&self) -> usize {
|
||||
self.0.count()
|
||||
}
|
||||
}
|
||||
|
||||
fn add_bench_task<C: Collector + 'static>(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query_str: &str,
|
||||
collector: C,
|
||||
collector_name: &str,
|
||||
) {
|
||||
let task_name = format!("{}_{}", query_str.replace(" ", "_"), collector_name);
|
||||
) where
|
||||
C::Fruit: FruitCount,
|
||||
{
|
||||
let query = bench_index.query_parser.parse_query(query_str).unwrap();
|
||||
let search_task = SearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
collector,
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
struct SearchTask<C: Collector> {
|
||||
searcher: Searcher,
|
||||
collector: C,
|
||||
query: Box<dyn Query>,
|
||||
}
|
||||
|
||||
impl<C: Collector> SearchTask<C> {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
self.searcher.search(&self.query, &self.collector).unwrap();
|
||||
1
|
||||
}
|
||||
let searcher = bench_index.searcher.clone();
|
||||
bench_group.register(collector_name.to_string(), move |_| {
|
||||
black_box(searcher.search(&query, &collector).unwrap().count())
|
||||
});
|
||||
}
|
||||
|
||||
@@ -45,7 +45,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
match distribution {
|
||||
"dense_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.gen_range(0u64..1000u64);
|
||||
let suffix = rng.random_range(0u64..1000u64);
|
||||
let str_val = format!("str_{:03}", suffix);
|
||||
|
||||
writer
|
||||
@@ -71,7 +71,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
}
|
||||
"sparse_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.gen_range(0u64..1000000u64);
|
||||
let suffix = rng.random_range(0u64..1000000u64);
|
||||
let str_val = format!("str_{:07}", suffix);
|
||||
|
||||
writer
|
||||
|
||||
@@ -23,7 +23,7 @@ downcast-rs = "2.0.1"
|
||||
proptest = "1"
|
||||
more-asserts = "0.3.1"
|
||||
rand = "0.9"
|
||||
binggan = "0.14.0"
|
||||
binggan = "0.15.3"
|
||||
|
||||
[[bench]]
|
||||
name = "bench_merge"
|
||||
|
||||
@@ -58,6 +58,78 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
|
||||
}
|
||||
}
|
||||
|
||||
/// Like `fetch_block_with_missing`, but deduplicates (doc_id, value) pairs
|
||||
/// so that each unique value per document is returned only once.
|
||||
///
|
||||
/// This is necessary for correct document counting in aggregations,
|
||||
/// where multi-valued fields can produce duplicate entries that inflate counts.
|
||||
#[inline]
|
||||
pub fn fetch_block_with_missing_unique_per_doc(
|
||||
&mut self,
|
||||
docs: &[u32],
|
||||
accessor: &Column<T>,
|
||||
missing: Option<T>,
|
||||
) where
|
||||
T: Ord,
|
||||
{
|
||||
self.fetch_block_with_missing(docs, accessor, missing);
|
||||
if accessor.index.get_cardinality().is_multivalue() {
|
||||
self.dedup_docid_val_pairs();
|
||||
}
|
||||
}
|
||||
|
||||
/// Removes duplicate (doc_id, value) pairs from the caches.
|
||||
///
|
||||
/// After `fetch_block`, entries are sorted by doc_id, but values within
|
||||
/// the same doc may not be sorted (e.g. `(0,1), (0,2), (0,1)`).
|
||||
/// We group consecutive entries by doc_id, sort values within each group
|
||||
/// if it has more than 2 elements, then deduplicate adjacent pairs.
|
||||
///
|
||||
/// Skips entirely if no doc_id appears more than once in the block.
|
||||
fn dedup_docid_val_pairs(&mut self)
|
||||
where T: Ord {
|
||||
if self.docid_cache.len() <= 1 {
|
||||
return;
|
||||
}
|
||||
|
||||
// Quick check: if no consecutive doc_ids are equal, no dedup needed.
|
||||
let has_multivalue = self.docid_cache.windows(2).any(|w| w[0] == w[1]);
|
||||
if !has_multivalue {
|
||||
return;
|
||||
}
|
||||
|
||||
// Sort values within each doc_id group so duplicates become adjacent.
|
||||
let mut start = 0;
|
||||
while start < self.docid_cache.len() {
|
||||
let doc = self.docid_cache[start];
|
||||
let mut end = start + 1;
|
||||
while end < self.docid_cache.len() && self.docid_cache[end] == doc {
|
||||
end += 1;
|
||||
}
|
||||
if end - start > 2 {
|
||||
self.val_cache[start..end].sort();
|
||||
}
|
||||
start = end;
|
||||
}
|
||||
|
||||
// Now duplicates are adjacent — deduplicate in place.
|
||||
let mut write = 0;
|
||||
for read in 1..self.docid_cache.len() {
|
||||
if self.docid_cache[read] != self.docid_cache[write]
|
||||
|| self.val_cache[read] != self.val_cache[write]
|
||||
{
|
||||
write += 1;
|
||||
if write != read {
|
||||
self.docid_cache[write] = self.docid_cache[read];
|
||||
self.val_cache[write] = self.val_cache[read];
|
||||
}
|
||||
}
|
||||
}
|
||||
let new_len = write + 1;
|
||||
self.docid_cache.truncate(new_len);
|
||||
self.val_cache.truncate(new_len);
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn iter_vals(&self) -> impl Iterator<Item = T> + '_ {
|
||||
self.val_cache.iter().cloned()
|
||||
@@ -163,4 +235,56 @@ mod tests {
|
||||
|
||||
assert_eq!(missing_docs, vec![1, 2, 3, 4, 5]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_consecutive() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 2, 3];
|
||||
accessor.val_cache = vec![10, 10, 10, 10];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 2, 3]);
|
||||
assert_eq!(accessor.val_cache, vec![10, 10, 10]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_non_consecutive() {
|
||||
// (0,1), (0,2), (0,1) — duplicate value not adjacent
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 0];
|
||||
accessor.val_cache = vec![1, 2, 1];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 2]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_multi_doc() {
|
||||
// doc 0: values [3, 1, 3], doc 1: values [5, 5]
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 0, 1, 1];
|
||||
accessor.val_cache = vec![3, 1, 3, 5, 5];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 3, 5]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_no_duplicates() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 1];
|
||||
accessor.val_cache = vec![1, 2, 3];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 2, 3]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_single_element() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0];
|
||||
accessor.val_cache = vec![1];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0]);
|
||||
assert_eq!(accessor.val_cache, vec![1]);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -19,6 +19,6 @@ time = { version = "0.3.47", features = ["serde-well-known"] }
|
||||
serde = { version = "1.0.136", features = ["derive"] }
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.14.0"
|
||||
binggan = "0.15.3"
|
||||
proptest = "1.0.0"
|
||||
rand = "0.9"
|
||||
|
||||
@@ -153,7 +153,22 @@ impl TinySet {
|
||||
None
|
||||
} else {
|
||||
let lowest = self.0.trailing_zeros();
|
||||
self.0 ^= TinySet::singleton(lowest).0;
|
||||
// Kernighan's trick: `n &= n - 1` clears the lowest set bit
|
||||
// without depending on `lowest`. This lets the CPU execute
|
||||
// `trailing_zeros` and the bit-clear in parallel instead of
|
||||
// serializing them.
|
||||
//
|
||||
// The previous form `self.0 ^= 1 << lowest` needs the result of
|
||||
// `trailing_zeros` before it can shift, creating a dependency chain:
|
||||
// ARM64: rbit → clz → lsl → eor
|
||||
// x86: tzcnt → btc
|
||||
//
|
||||
// With Kernighan's trick the clear path is independent of the count:
|
||||
// ARM64: sub → and (trailing_zeros runs in parallel)
|
||||
// x86: blsr (tzcnt runs in parallel)
|
||||
//
|
||||
// https://godbolt.org/z/fnfrP1T5f
|
||||
self.0 &= self.0 - 1;
|
||||
Some(lowest)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -54,8 +54,6 @@ fn month_bucket_using_time_crate(timestamp_ns: i64) -> Result<i64, time::Error>
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use std::i64;
|
||||
|
||||
use time::format_description::well_known::Iso8601;
|
||||
use time::UtcDateTime;
|
||||
|
||||
|
||||
@@ -533,7 +533,7 @@ mod tests {
|
||||
let expected_buckets_vec = expected_buckets.as_array().unwrap();
|
||||
|
||||
for page_size in 1..=expected_buckets_vec.len() {
|
||||
let page_count = (expected_buckets_vec.len() + page_size - 1) / page_size;
|
||||
let page_count = expected_buckets_vec.len().div_ceil(page_size);
|
||||
let mut after_key = None;
|
||||
for page_idx in 0..page_count {
|
||||
let mut agg_req_json = json!({
|
||||
@@ -565,7 +565,7 @@ mod tests {
|
||||
"expected after_key on all but last page"
|
||||
);
|
||||
after_key = Some(res["my_composite"]["after_key"].clone());
|
||||
} else if let Some(_) = res["my_composite"].get("after_key") {
|
||||
} else if res["my_composite"].get("after_key").is_some() {
|
||||
// currently we sometime have an after_key on the last page,
|
||||
// check that the next "page" is empty
|
||||
let agg_req_json = json!({
|
||||
|
||||
@@ -807,11 +807,13 @@ impl<TermMap: TermAggregationMap, C: SubAggCache> SegmentAggregationCollector
|
||||
|
||||
let req_data = &mut self.terms_req_data;
|
||||
|
||||
agg_data.column_block_accessor.fetch_block_with_missing(
|
||||
docs,
|
||||
&req_data.accessor,
|
||||
req_data.missing_value_for_accessor,
|
||||
);
|
||||
agg_data
|
||||
.column_block_accessor
|
||||
.fetch_block_with_missing_unique_per_doc(
|
||||
docs,
|
||||
&req_data.accessor,
|
||||
req_data.missing_value_for_accessor,
|
||||
);
|
||||
|
||||
if let Some(sub_agg) = &mut self.sub_agg {
|
||||
let term_buckets = &mut self.parent_buckets[parent_bucket_id as usize];
|
||||
@@ -2347,7 +2349,7 @@ mod tests {
|
||||
|
||||
// text field
|
||||
assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hello Hello");
|
||||
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 5);
|
||||
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_texts"]["buckets"][1]["key"], "Empty");
|
||||
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
|
||||
assert_eq!(
|
||||
@@ -2356,7 +2358,7 @@ mod tests {
|
||||
);
|
||||
// text field with number as missing fallback
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 5);
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
|
||||
assert_eq!(res["my_texts2"]["buckets"][1]["doc_count"], 2);
|
||||
assert_eq!(
|
||||
@@ -2370,7 +2372,7 @@ mod tests {
|
||||
assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
|
||||
assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_ids"]["buckets"][1]["key"], 1.0);
|
||||
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 3);
|
||||
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 2);
|
||||
assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);
|
||||
|
||||
Ok(())
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use common::TinySet;
|
||||
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
|
||||
use crate::query::score_combiner::{DoNothingCombiner, ScoreCombiner};
|
||||
use crate::query::size_hint::estimate_union;
|
||||
use crate::query::Scorer;
|
||||
@@ -172,6 +172,46 @@ where
|
||||
self.doc
|
||||
}
|
||||
|
||||
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
|
||||
if self.doc == TERMINATED {
|
||||
return 0;
|
||||
}
|
||||
// The current doc (self.doc) has already been popped from the bitsets,
|
||||
// so the loop below won't yield it. Emit it here first.
|
||||
buffer[0] = self.doc;
|
||||
let mut count = 1;
|
||||
|
||||
loop {
|
||||
// Drain docs directly from the pre-computed bitsets.
|
||||
while self.bucket_idx < HORIZON_NUM_TINYBITSETS {
|
||||
// Move bitset to a local variable to avoid read/store on self.bitsets while
|
||||
// iterating through the bits.
|
||||
let mut tinyset: TinySet = self.bitsets[self.bucket_idx];
|
||||
|
||||
while let Some(val) = tinyset.pop_lowest() {
|
||||
let delta = val + (self.bucket_idx as u32) * 64;
|
||||
self.doc = self.window_start_doc + delta;
|
||||
|
||||
if count >= COLLECT_BLOCK_BUFFER_LEN {
|
||||
// Buffer full; put remaining bits back.
|
||||
self.bitsets[self.bucket_idx] = tinyset;
|
||||
return COLLECT_BLOCK_BUFFER_LEN;
|
||||
}
|
||||
buffer[count] = self.doc;
|
||||
count += 1;
|
||||
}
|
||||
self.bitsets[self.bucket_idx] = TinySet::empty();
|
||||
self.bucket_idx += 1;
|
||||
}
|
||||
|
||||
// Current window exhausted, refill.
|
||||
if !self.refill() {
|
||||
self.doc = TERMINATED;
|
||||
return count;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
if self.doc >= target {
|
||||
return self.doc;
|
||||
|
||||
@@ -27,7 +27,7 @@ rand = "0.9"
|
||||
zipf = "7.0.0"
|
||||
rustc-hash = "2.1.0"
|
||||
proptest = "1.2.0"
|
||||
binggan = { version = "0.14.0" }
|
||||
binggan = { version = "0.15.3" }
|
||||
rand_distr = "0.5"
|
||||
|
||||
[features]
|
||||
|
||||
Reference in New Issue
Block a user