Commit Graph

161 Commits

Author SHA1 Message Date
Moe
eda9aa437f fix: boolean query incorrectly dropping documents when AllScorer is present (#84)
Co-authored-by: Stu Hood <stuhood@gmail.com>
2025-12-10 10:17:28 -08:00
Eric Ridge
526afc6111 chore: internal API visibility adjustments (#53) 2025-12-10 10:17:27 -08:00
Eric Ridge
bb4c4b8522 perf: push FileSlices down through most of fast fields (#19)
This PR modifies internal API signatures and implementation details so that `FileSlice`s are passed down into the innards of (at least) the `BlockwiseLinearCodec`.  This allows tantivy to defer dereferencing large slices of bytes when reading numeric fast fields, and instead dereference only the slice of bytes it needs for any given compressed Block.

The motivation here is for external `Directory` implementations where it's not exactly efficient to dereference large slices of bytes.
2025-12-10 10:17:25 -08:00
Paul Masurel
63c66005db Lazy scorers (#2726)
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.

- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.

This PR breaks public API, but fixing code is straightforward.

* Bumping tantivy version

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-01 15:38:57 +01:00
Paul Masurel
f88b7200b2 Optimization when posting list are saturated. (#2745)
* Optimization when posting list are saturated.

If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.

In turn, in a boolean query, we can dismiss  all scorers and
empty scorers, to accelerate the request.

* Added range query optimization

* CR comment

* CR comments

* CR comment

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-11-26 15:50:57 +01:00
PSeitz
d410a3b0c0 Add Filtering for Term Aggregations (#2717)
* Add Filtering for Term Aggregations

Closes #2702

* add AggregationsSegmentCtx memory consumption

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-15 17:39:53 +02:00
PSeitz
33835b6a01 Add DocSet::cost() (#2707)
* query: add DocSet cost hint and use it for intersection ordering

- Add DocSet::cost()
- Use cost() instead of size_hint() to order scorers in intersect_scorers

This isolates cost-related changes without the new seek APIs from
PR #2538

* add comments

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-13 16:25:49 +02:00
PSeitz
270ca5123c refactor postings (#2709)
rename shallow_seek to seek_block
remove full_block from public postings API

This is as preparation to optionally handle Bitsets in the postings
2025-10-08 16:55:25 +02:00
PSeitz-dd
2340dca628 fix compiler warnings (#2699)
* fix compiler warnings

* fix import
2025-09-19 15:55:04 +02:00
PSeitz
945af922d1 clippy (#2661)
* clippy

* use readable version

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-02 11:25:03 +02:00
PSeitz-dd
295d07e55c fix union performance regression (#2663)
closes https://github.com/quickwit-oss/tantivy/issues/2656
2025-07-01 20:32:25 +02:00
PSeitz
21d057059e clippy (#2527)
* clippy

* clippy

* clippy

* clippy

* convert allow to expect and remove unused

* cargo fmt

* cleanup

* export sample

* clippy
2024-10-22 09:26:54 +08:00
PSeitz
aebae9965d add RegexPhraseQuery (#2516)
* add RegexPhraseQuery

RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"

Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.

This PR comes with some new datastructures:

SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)

LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.

BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.

Renamed Union to BufferedUnionScorer
Added proptests to test different union types.

* cleanup

* use Box instead of Vec

* use RefCell instead of term_freq(&mut)

* remove wildcard mode

* move RefCell to outer

* clippy
2024-10-21 18:29:17 +08:00
trinity-1686a
85395d942a fix clippy lints from 1.80-1.81 (#2488)
* fix some clippy lints

* fix clippy::doc_lazy_continuation

* fix some lints for 1.82
2024-09-05 14:33:05 +02:00
PSeitz
27be6aed91 lift clauses in LogicalAst (#2449)
(a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d)
(a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d)

This directly affects how queries are executed

remove unused SumWithCoordsCombiner
the number of fields is unused and private
2024-08-14 19:21:26 +02:00
PSeitz
56d79cb203 fix cardinality aggregation performance (#2446)
* fix cardinality aggregation performance

fix cardinality performance by fetching multiple terms at once. This
avoids decompressing the same block and keeps the buffer state between
terms.

add cardinality aggregation benchmark

bump rust version to 1.66

Performance comparison to before (AllQuery)
```
full
cardinality_agg                   Memory: 3.5 MB (-0.00%)    Avg: 21.2256ms (-97.78%)    Median: 21.0042ms (-97.82%)    [20.4717ms .. 23.6206ms]
terms_few_with_cardinality_agg    Memory: 10.6 MB            Avg: 81.9293ms (-97.37%)    Median: 81.5526ms (-97.38%)    [79.7564ms .. 88.0374ms]
dense
cardinality_agg                   Memory: 3.6 MB (-0.00%)    Avg: 25.9372ms (-97.24%)    Median: 25.7744ms (-97.25%)    [24.7241ms .. 27.8793ms]
terms_few_with_cardinality_agg    Memory: 10.6 MB            Avg: 93.9897ms (-96.91%)    Median: 92.7821ms (-96.94%)    [90.3312ms .. 117.4076ms]
sparse
cardinality_agg                   Memory: 895.4 KB (-0.00%)    Avg: 22.5113ms (-95.01%)    Median: 22.5629ms (-94.99%)    [22.1628ms .. 22.9436ms]
terms_few_with_cardinality_agg    Memory: 680.2 KB             Avg: 26.4250ms (-94.85%)    Median: 26.4135ms (-94.86%)    [26.3210ms .. 26.6774ms]
```

* clippy

* assert for sorted ordinals
2024-07-02 15:29:00 +08:00
落叶乌龟
f9ae295507 feat(query): Make BooleanQuery supports minimum_number_should_match (#2405)
* feat(query): Make `BooleanQuery` supports `minimum_number_should_match`. see issue #2398

In this commit, a novel scorer named DisjunctionScorer is introduced, which performs the union of inverted chains with the minimal required elements. BTW, it's implemented via a min-heap. Necessary modifications on `BooleanQuery` and `BooleanWeight` are performed as well.

* fixup! fix test

* fixup!: refactor code.

1. More meaningful names.
2. Add Cache for `Disjunction`'s scorers, and fix bug.
3. Optimize `BooleanWeight::complex_scorer`

Thanks
 Paul Masurel <paul@quickwit.io>

* squash!: come up with better variable naming.

* squash!: fix naming issues.

* squash!: fix typo.

* squash!: Remove CombinationMethod::FullIntersection
2024-07-01 15:39:41 +08:00
PSeitz
67ebba3c3c expose collect_block buffer size (#2326)
* expose buffer of collect_block

* flip shard_size segment_size
2024-03-15 08:02:08 +01:00
PSeitz
48630ceec9 move into new index module (#2259)
move core modules to index module
2024-01-31 10:30:04 +01:00
Harrison Burt
1c7c6fd591 POC: Tantivy documents as a trait (#2071)
* fix windows build (#1)

* Fix windows build

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Fix generic bugs

* Reformat code

* Add generic to index writer which I forgot about

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Rebase main and fix conflicts

* Reformat code

* Merge upstream

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add tokenizer improvements from previous commits

* Add tokenizer improvements from previous commits

* Reformat

* Fix unit tests

* Fix unit tests

* Use enum in changes

* Stage changes

* Add new deserializer logic

* Add serializer integration

* Add document deserializer

* Implement new (de)serialization api for existing types

* Fix bugs and type errors

* Add helper implementations

* Fix errors

* Reformat code

* Add unit tests and some code organisation for serialization

* Add unit tests to deserializer

* Add some small docs

* Add support for deserializing serde values

* Reformat

* Fix typo

* Fix typo

* Change repr of facet

* Remove unused trait methods

* Add child value type

* Resolve comments

* Fix build

* Fix more build errors

* Fix more build errors

* Fix the tests I missed

* Fix examples

* fix numerical order, serialize PreTok Str

* fix coverage

* rename Document to TantivyDocument, rename DocumentAccess to Document

add Binary prefix to binary de/serialization

* fix coverage

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2023-10-02 10:01:16 +02:00
PSeitz
2d7390341c increase min memory to 15MB for indexing (#2176)
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
2023-09-13 07:38:34 +02:00
Adam Reichold
22c35b1e00 Fix explanation of boost queries seeking beyond query result. (#2142)
* Make current nightly Clippy happy.

* Fix explanation of boost queries seeking beyond query result.
2023-08-14 11:59:11 +09:00
RT_Enzyme
ff3d3313c4 fix BooleanQuery document (#1999)
* fix BooleanQuery document

* Update src/query/boolean_query/boolean_query.rs

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-20 11:37:20 +02:00
Paul Masurel
4b01cc4c49 Made BooleanWeight and BoostWeight public (#1991) 2023-04-12 10:26:30 +09:00
PSeitz
6a7a1106d6 work in batches of docs (#1937)
* work in batches of docs

* add fill_buffer test
2023-03-21 06:57:44 +01:00
Alex Cole
f2f38c43ce Make BM25 scoring more flexible (#1855)
* Introduce Bm25StatisticsProvider to inject statistics

* fix formatting I accidentally changed
2023-02-16 19:14:12 +09:00
Shikhar Bhushan
2650111b76 EnableScoring::Disabled - optional Searcher (#1780) 2023-01-12 09:26:50 -05:00
Adam Reichold
2080c370c2 Enable usage of FuzzyTermQuery for specific fields via QueryParser (#1750)
* Make nightly Clippy mostly happy.

* Document how to produce TermSetQuery queries using QueryParser.

* Enable construction of queries using FuzzyTermQuery via the QueryParser

* Use FxHashMap instead of HashMap in the QueryParser as these hash tables are not exposed to DoS attacks.

* Use a struct instead of a tuple to improve readability.
2023-01-04 18:11:27 +09:00
Paul Masurel
3edf0a2724 Using the manual reload policy in IndexWriter. (#1667) 2022-11-09 11:20:41 +01:00
PSeitz
2af6b01c17 Update src/query/boolean_query/boolean_weight.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-11-01 16:13:00 +08:00
Pascal Seitz
dfab201191 for_each_docset to iterate without score 2022-10-26 17:25:05 +08:00
Pascal Seitz
af839753e0 No score calls if score is not requested 2022-10-26 12:18:35 +08:00
Pascal Seitz
6800fdec9d add indexing for ip field
Closes #1595
2022-10-18 10:07:48 +08:00
Bruce Mitchener
44e03791f9 Fix warnings when doc'ing private items. (#1579)
This also fixes a couple of typos, but plenty remain!
2022-10-03 14:24:00 +09:00
Bruce Mitchener
a24ae8d924 clippy: Fix needless-borrow warnings. (#1581)
These show on nightly clippy.
2022-10-03 14:15:09 +09:00
Bruce Mitchener
cf02e32578 Improvements to doc linking, grammar, etc. 2022-09-19 18:10:22 +07:00
Adam Reichold
71ab482720 RFC: Use a more general but still object-safe signature for Query::query_terms. (#1468)
* Use a more general but still object-safe signature for Query::query_terms.

* Further constraint the generalized Query::query_terms signature to allow extracting references to terms.
2022-08-24 06:34:07 +09:00
PSeitz
8edcd6f958 Merge pull request #1428 from izihawa/feature/dismax
[feat] Implement `DisjunctionMaxQuery` and refactor `ScoreCombiner`
2022-08-22 06:15:30 -07:00
Kian-Meng Ang
014b1adc3e cargo +nightly fmt 2022-08-17 22:33:44 +08:00
Kian-Meng Ang
84295d5b35 cargo fmt 2022-08-15 21:07:01 +08:00
Kian-Meng Ang
625bcb4877 Fix typos and markdowns
Found via these commands:

    codespell -L crate,ser,panting,beauti,hart,ue,atleast,childs,ond,pris,hel,mot
    markdownlint *.md doc/src/*.md --disable MD013 MD025 MD033 MD001 MD024 MD036 MD041 MD003
2022-08-13 18:25:47 +08:00
Pasha Podolsky
09aae134e6 [feat] Implement DisjunctionMaxQuery and refactor ScoreCombiner 2022-07-28 20:47:20 +03:00
Ryan Russell
b33b4c0092 Fix various occurrence var names and references (#1385)
Thank you Ryan!

Signed-off-by: Ryan Russell <git@ryanrussell.org>
2022-06-07 11:08:19 +09:00
PSeitz
4b62f7907d Merge pull request #1297 from PSeitz/fix_clippy
fix clippy issues
2022-03-02 10:11:56 +01:00
Antoine G
e37775fe21 iff->if or if and only if (#1298)
* has_xxx is_xxx -> if, these function usualy define equivalence
xxx returns bool -> specify equivalence when appropriate

* fix doc
2022-03-02 11:00:00 +09:00
Pascal Seitz
091b668624 fix clippy issues 2022-03-01 08:58:51 +01:00
Paul Masurel
d7b46d2137 Added JSON Type (#1270)
- Removed useless copy when ingesting JSON.
- Bugfix in phrase query with a missing field norms.
- Disabled range query on default fields

Closes #1251
2022-02-24 16:25:22 +09:00
Paul Masurel
2069e3e52b Fixing clippy comments 2022-02-01 10:24:05 +09:00
Paul Masurel
eca6628b3c Minor refactoring (#1266) 2022-01-28 15:55:55 +09:00
Paul Masurel
7234bef0eb Issue/1198 (#1201)
* Unit test reproducing #1198
* Fixing unit test to handle the error from add_document.
* Bump project version
2021-11-11 16:42:19 +09:00