seek_exact + cost based intersection (#2538)

* seek_exact + cost based intersection

Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.

`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.

Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.

Refactor range query benchmark.

Closes #2531

*Future Work*

Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add API contract verfication

* impl seek_exact on union

* rename seek_exact

* add mixed AND OR test, fix buffered_union

* Add a proptest of BooleanQuery. (#2690)

* fix build

* Increase the document count.

* fix merge conflict

* fix debug assert

* Fix compilation errors after rebase

- Remove duplicate proptest_boolean_query module
- Remove duplicate cost() method implementations
- Fix TopDocs API usage (add .order_by_score())
- Remove duplicate imports
- Remove unused variable assignments

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Stu Hood <stuhood@gmail.com>
This commit is contained in:
PSeitz
2025-12-30 21:43:25 +08:00
committed by GitHub
parent e0b62e00ac
commit 923f0508f2
19 changed files with 581 additions and 487 deletions

View File

@@ -483,7 +483,7 @@ mod tests {
let checkpoints_for_each_pruning =
compute_checkpoints_for_each_pruning(term_scorers.clone(), top_k);
let checkpoints_manual =
compute_checkpoints_manual(term_scorers.clone(), top_k, 100_000);
compute_checkpoints_manual(term_scorers.clone(), top_k, max_doc as u32);
assert_eq!(checkpoints_for_each_pruning.len(), checkpoints_manual.len());
for (&(left_doc, left_score), &(right_doc, right_score)) in checkpoints_for_each_pruning
.iter()

View File

@@ -817,7 +817,6 @@ mod proptest_boolean_query {
fn proptest_boolean_query() {
// In the presence of optimizations around buffering, it can take large numbers of
// documents to uncover some issues.
let num_docs = 10000;
let num_fields = 8;
let num_docs = 1 << num_fields;
let (index, fields, range_field) =