tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-22 19:20:43 +00:00

Author	SHA1	Message	Date
Paul Masurel	63da5a21b2	Optimizing top K using Adrien Grand's ideas (#2865 ) * Optimizing top K using Adrien Grand's ideas https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html * Suffix-sum pruning for multi-term intersection candidates After scoring each secondary in Phase 2, check whether remaining secondaries' block_max scores can still beat the threshold. Skip to the next candidate early if impossible, avoiding expensive seeks into later secondaries. Improves three-term intersection by ~8% on the balanced benchmark while keeping two-term performance neutral. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Claude CR comment * Removed 16 term scorer limit. --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-26 12:14:40 +02:00
Paul Masurel	b11605f045	Addressing clippy comments (#2789 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-12-31 18:02:00 +01:00
PSeitz	923f0508f2	seek_exact + cost based intersection (#2538 ) * seek_exact + cost based intersection Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection. Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist. `cost` allows to address the different DocSet types and their cost model and is used to determine the DocSet that drives the intersection. E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit. They both have a higher cost than their size_hint would suggest. Improves `size_hint` estimation for intersection and union, by having a estimation based on random distribution with a co-location factor. Refactor range query benchmark. Closes #2531 Future Work Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries) Evaluate replacing `seek` with `seek_exact` to reduce code complexity * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add API contract verfication * impl seek_exact on union * rename seek_exact * add mixed AND OR test, fix buffered_union * Add a proptest of BooleanQuery. (#2690) * fix build * Increase the document count. * fix merge conflict * fix debug assert * Fix compilation errors after rebase - Remove duplicate proptest_boolean_query module - Remove duplicate cost() method implementations - Fix TopDocs API usage (add .order_by_score()) - Remove duplicate imports - Remove unused variable assignments --------- Co-authored-by: Paul Masurel <paul@quickwit.io> Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com> Co-authored-by: Stu Hood <stuhood@gmail.com>	2025-12-30 14:43:25 +01:00
Moe	e3c9be1f92	fix: boolean query incorrectly dropping documents when AllScorer is present (#2760 ) * Fixed the range issue. * Fixed the second all scorer issue * Improved docs + tests * Improved code. * Fixed lint issues. * Improved tests + logic based on PR comments. * Fixed lint issues. * Increase the document count. * Improved the prop-tests * Expand the index size, and remove unused parameter. --------- Co-authored-by: Stu Hood <stuhood@gmail.com>	2025-12-16 22:52:02 +01:00
trinity-1686a	d0e1600135	fix bug with minimum_should_match and AllScorer (#2774 )	2025-12-14 10:10:45 +01:00
Paul Masurel	63c66005db	Lazy scorers (#2726 ) * Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features. - Allow lazy evaluation of score. As soon as we identified that a doc won't reach the topK threshold, we can stop the evaluation. - Allow for a different segment level score, segment level score and their conversion. This PR breaks public API, but fixing code is straightforward. * Bumping tantivy version --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-12-01 15:38:57 +01:00
Paul Masurel	f88b7200b2	Optimization when posting list are saturated. (#2745 ) * Optimization when posting list are saturated. If a posting list doc freq is the segment reader's max_doc, and if scoring does not matter, we can replace it by a AllScorer. In turn, in a boolean query, we can dismiss all scorers and empty scorers, to accelerate the request. * Added range query optimization * CR comment * CR comments * CR comment --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-11-26 15:50:57 +01:00
PSeitz	27be6aed91	lift clauses in LogicalAst (#2449 ) (a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d) (a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d) This directly affects how queries are executed remove unused SumWithCoordsCombiner the number of fields is unused and private	2024-08-14 19:21:26 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	2d7390341c	increase min memory to 15MB for indexing (#2176 ) With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156	2023-09-13 07:38:34 +02:00
Paul Masurel	4b01cc4c49	Made BooleanWeight and BoostWeight public (#1991 )	2023-04-12 10:26:30 +09:00
Alex Cole	f2f38c43ce	Make BM25 scoring more flexible (#1855 ) * Introduce Bm25StatisticsProvider to inject statistics * fix formatting I accidentally changed	2023-02-16 19:14:12 +09:00
Shikhar Bhushan	2650111b76	EnableScoring::Disabled - optional Searcher (#1780 )	2023-01-12 09:26:50 -05:00
Paul Masurel	3edf0a2724	Using the manual reload policy in IndexWriter. (#1667 )	2022-11-09 11:20:41 +01:00
Pasha Podolsky	09aae134e6	[feat] Implement `DisjunctionMaxQuery` and refactor `ScoreCombiner`	2022-07-28 20:47:20 +03:00
Paul Masurel	eca6628b3c	Minor refactoring (#1266 )	2022-01-28 15:55:55 +09:00
Paul Masurel	7234bef0eb	Issue/1198 (#1201 ) * Unit test reproducing #1198 * Fixing unit test to handle the error from add_document. * Bump project version	2021-11-11 16:42:19 +09:00
François Massot	0462754673	Optimize block wand for one and several TermScorer. (#1190 ) * Added optimisation using block wand for single TermScorer. A proptest was also added. * Fix block wand algorithm by taking the last doc id of scores until the pivot scorer (included). * In block wand, when block max score is lower than the threshold, advance the scorer with best score. * Fix wrong condition in block_wand_single_scorer and add debug_assert to have an equality check on doc to break the loop.	2021-11-01 09:18:05 +09:00
sigaloid	096ce7488e	Resolve some clippys, format (#1144 ) * cargo +nightly clippy --fix -Z unstable-options	2021-08-26 08:46:00 +09:00
Stéphane Campinas	a0ec6e1e9d	Expand the DocAddress struct with named fields	2021-03-28 19:00:23 +02:00
Paul Masurel	36a0520a48	Added failing proptest and fixed it.	2020-11-05 15:40:00 +09:00
Paul Masurel	730ccefffb	Fixes a bug in TermQuery::explain. Closes #915	2020-10-28 22:29:15 +09:00
Adrien Guillo	7f373f232a	Add helper methods for BooleanQuery	2020-10-28 17:35:34 +09:00
Paul Masurel	439d6956a9	Returning Result in some of the API (#880 ) * Returning Result in some of the API * Introducing `.writer_for_test(..)`	2020-09-07 15:52:34 +09:00
Paul Masurel	2481c87be8	Block wand (#856 )	2020-08-19 22:36:36 +09:00
Paul Masurel	6db8bb49d6	Assert nearly equals macro (#853 ) * Assert nearly equals macro * Renamed specialized_scorer in TermScorer	2020-07-17 16:40:41 +09:00
Paul Masurel	f71b04acb0	Bugfix. (#849 ) go_to_first_doc was typically calling seek with a target smaller than doc. Since SegmentPostings typically do a linear search on the full block, regardless of the current position, it could have our segment postings go backward.	2020-07-16 10:57:51 +09:00
Ype Kingma	7d773abc92	Boolean query: do not combine excluded scores. (#840 ) * Do nothing when combining score values of excluded scores. * Add test case for two excluded. * Test score for two excluded terms. * Use TopDocs in test_boolean_query_two_excluded	2020-06-08 20:01:19 +09:00
Paul Masurel	e25284bafe	Major change in the DocSet/Scorer API (#824 ) - Change in the DocSet and Scorer API. (@fulmicoton). A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet. `.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId. As a result, iterating through DocSet now looks as follows ```rust let mut doc = docset.doc(); while doc != TERMINATED { // ... doc = docset.advance(); } ``` The change made it possible to greatly simplify a lot of the docset's code. - Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)	2020-05-16 16:33:36 +09:00
Paul Masurel	d04368b1d4	Closes #788 . OR not working when using conjunction by default. (#802 )	2020-03-31 21:13:50 +09:00
Paul Masurel	7d6cfa58e1	[WIP] Alternative take on boosted queries (#772 ) * Alternative take on boosted queries * Fixing unit test * Added boosting to the query grammar. * Made BoostQuery public. * Added support for boosting field in QueryParser Closes #547	2020-02-19 11:04:38 +09:00
Paul Masurel	5c6580eb15	fmt (#661 )	2019-10-04 12:10:01 +09:00
Paul Masurel	c3231ca252	Added phrase query tests (#601 )	2019-07-22 13:43:00 +09:00
Paul Masurel	462774b15c	Tiqb feature/2018 (#583 ) * rust 2018 * Added CHANGELOG comment	2019-07-01 10:01:46 +09:00
Paul Masurel	4822940b19	Issue/36 (#559 ) * Added explanation * Explain * Splitting weight and idf * Added comments Closes #36	2019-06-06 10:03:54 +09:00
Paul Masurel	663dd89c05	Feature/reader (#517 ) Adding IndexReader to the API. Making it possible to watch for changes. * Closes #500	2019-03-20 08:39:22 +09:00
Paul Masurel	50ed6fb534	Code cleanup Fixed compilation without the mmap directory	2019-02-05 12:39:30 +01:00
Paul Masurel	4c93b096eb	Rustfmt	2019-01-29 11:45:30 +01:00
Paul Masurel	6a547b0b5f	Issue/483 (#484 ) * Downcast_ref * fixing unit test	2019-01-28 11:43:42 +01:00
Paul Masurel	63b593bd0a	Lower RAM usage in tests.	2019-01-24 09:10:38 +09:00
Paul Masurel	a6e767c877	Cargo fmt	2018-11-30 22:52:45 +09:00
Paul Masurel	07d87e154b	Collector refactoring and multithreaded search (#437 ) * Split Collector into an overall Collector and a per-segment SegmentCollector. Precursor to cross-segment parallelism, and as a side benefit cleans up any per-segment fields from being Option<T> to just T. * Attempt to add MultiCollector back * working. Chained collector is broken though * Fix chained collector * Fix test * Make Weight Send+Sync for parallelization purposes * Expose parameters of RangeQuery for external usage * Removed &mut self * fixing tests * Restored TestCollectors * blop * multicollector working * chained collector working * test broken * fixing unit test * blop * blop * Blop * simplifying APi * blop * better syntax * Simplifying top_collector * refactoring * blop * Sync with master * Added multithread search * Collector refactoring * Schema::builder * CR and rustdoc * CR comments * blop * Added an executor * Sorted the segment readers in the searcher * Update searcher.rs * Fixed unit testst * changed the place where we have the sort-segment-by-count heuristic * using crossbeam::channel * inlining * Comments about panics propagating * Added unit test for executor panicking * Readded default * Removed Default impl * Added unit test for executor	2018-11-30 22:46:59 +09:00
Paul Masurel	0ba1cf93f7	Remove Searcher dereference (#419 )	2018-09-14 09:54:26 +09:00
Paul Masurel	78673172d0	Cargo fmt	2018-04-21 20:05:36 +09:00
Paul Masurel	e44782bf14	No more	2018-04-12 13:01:11 +09:00
Paul Masurel	1d9566e73c	Making mmap a feature	2018-03-31 13:23:43 +09:00
Paul Masurel	ffa03bad71	TermScorer does not handle deletes	2018-03-27 17:35:20 +09:00
Paul Masurel	9712a75399	Added unit test for intersection score	2018-03-25 12:58:24 +09:00
Paul Masurel	b0e5e1f61d	Back merged master	2018-03-19 12:19:08 +09:00
Paul Masurel	a3b44773bb	Bugfix and rustfmt	2018-03-10 12:21:50 +09:00

1 2

78 Commits