tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-03 15:52:55 +00:00

Author	SHA1	Message	Date
PSeitz	735c588f4f	fix union performance regression (#2790 ) * add inlines * fix union performance regression Remove unwrap from hotpath generates better assembly. closes #2788	2026-01-02 12:06:51 +01:00
PSeitz	923f0508f2	seek_exact + cost based intersection (#2538 ) * seek_exact + cost based intersection Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection. Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist. `cost` allows to address the different DocSet types and their cost model and is used to determine the DocSet that drives the intersection. E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit. They both have a higher cost than their size_hint would suggest. Improves `size_hint` estimation for intersection and union, by having a estimation based on random distribution with a co-location factor. Refactor range query benchmark. Closes #2531 Future Work Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries) Evaluate replacing `seek` with `seek_exact` to reduce code complexity * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add API contract verfication * impl seek_exact on union * rename seek_exact * add mixed AND OR test, fix buffered_union * Add a proptest of BooleanQuery. (#2690) * fix build * Increase the document count. * fix merge conflict * fix debug assert * Fix compilation errors after rebase - Remove duplicate proptest_boolean_query module - Remove duplicate cost() method implementations - Fix TopDocs API usage (add .order_by_score()) - Remove duplicate imports - Remove unused variable assignments --------- Co-authored-by: Paul Masurel <paul@quickwit.io> Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com> Co-authored-by: Stu Hood <stuhood@gmail.com>	2025-12-30 14:43:25 +01:00
Paul Masurel	63c66005db	Lazy scorers (#2726 ) * Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features. - Allow lazy evaluation of score. As soon as we identified that a doc won't reach the topK threshold, we can stop the evaluation. - Allow for a different segment level score, segment level score and their conversion. This PR breaks public API, but fixing code is straightforward. * Bumping tantivy version --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-12-01 15:38:57 +01:00
Paul Masurel	25d44fcec8	Revert "remove unused columnar api (#2742 )" (#2748 ) * Revert "remove unused columnar api (#2742)" This reverts commit `8725594d47`. * Clippy comment + removing fill_vals --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-11-26 17:44:02 +01:00
Paul Masurel	f88b7200b2	Optimization when posting list are saturated. (#2745 ) * Optimization when posting list are saturated. If a posting list doc freq is the segment reader's max_doc, and if scoring does not matter, we can replace it by a AllScorer. In turn, in a boolean query, we can dismiss all scorers and empty scorers, to accelerate the request. * Added range query optimization * CR comment * CR comments * CR comment --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-11-26 15:50:57 +01:00
Paul Masurel	c363bbd23d	Optimize term aggregation with low cardinality + some refactoring (#2740 ) This introduce an optimization of top level term aggregation on field with a low cardinality. We then use a Vec as the underlying map. In addition, we buffer subaggregations. --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com> Co-authored-by: Paul Masurel <paul@quickwit.io>	2025-11-21 14:46:29 +01:00
PSeitz	270ca5123c	refactor postings (#2709 ) rename shallow_seek to seek_block remove full_block from public postings API This is as preparation to optionally handle Bitsets in the postings	2025-10-08 16:55:25 +02:00
PSeitz-dd	7963b0b4aa	Add fast field fallback for term query if not indexed (#2693 ) * Add fast field fallback for term query if not indexed * only fallback without scores	2025-09-12 14:58:21 +02:00
PSeitz	945af922d1	clippy (#2661 ) * clippy * use readable version --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-07-02 11:25:03 +02:00
PSeitz	eea70030bf	cleanup top level exports (#2382 ) remove some top level exports	2024-05-07 09:59:41 +02:00
PSeitz	74940e9345	clippy (#2349 ) * fix clippy * fix clippy * fix duplicate imports	2024-04-09 07:54:44 +02:00
PSeitz	67ebba3c3c	expose collect_block buffer size (#2326 ) * expose buffer of collect_block * flip shard_size segment_size	2024-03-15 08:02:08 +01:00
PSeitz	48630ceec9	move into new index module (#2259 ) move core modules to index module	2024-01-31 10:30:04 +01:00
PSeitz	0aae31d7d7	reduce number of allocations (#2257 ) * reduce number of allocations Explanation makes up around 50% of all allocations (numbers not perf). It's created during serialization but not called. - Make Explanation optional in BM25 - Avoid allocations when using Explanation * use Cow	2023-11-16 13:47:36 +01:00
Paul Masurel	7bc5bf78e2	Fixing functional tests. (#2239 )	2023-11-05 18:18:39 +09:00
trinity-1686a	0d4589219b	encode some part of posting list as -1 instead of direct values (#2185 ) * add support for delta-1 encoding posting list * encode term frequency minus one * don't emit tf for json integer terms * make skipreader not pub(crate) mutable	2023-10-20 16:58:26 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	2d7390341c	increase min memory to 15MB for indexing (#2176 ) With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156	2023-09-13 07:38:34 +02:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
PSeitz	74f9eafefc	refactor Term (#2006 ) * refactor Term add ValueBytes for serialized term values add missing debug for ip skip unnecessary json path validation remove code duplication add DATE_TIME_PRECISION_INDEXED constant add missing Term clarification remove weird value_bytes_mut() API * fix naming	2023-04-20 15:31:43 +02:00
PSeitz	6a7a1106d6	work in batches of docs (#1937 ) * work in batches of docs * add fill_buffer test	2023-03-21 06:57:44 +01:00
Alex Cole	f2f38c43ce	Make BM25 scoring more flexible (#1855 ) * Introduce Bm25StatisticsProvider to inject statistics * fix formatting I accidentally changed	2023-02-16 19:14:12 +09:00
PSeitz	36c6138e7f	fix: auto downgrade index record option, instead of vint error (#1857 ) Prev: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })', src/main.rs:46:14 Now: Automatic downgrade to next available level	2023-02-10 13:45:23 +01:00
Paul Masurel	bd5eea9852	Integrated columnar work.	2023-02-09 13:14:31 +01:00
Shikhar Bhushan	2650111b76	EnableScoring::Disabled - optional Searcher (#1780 )	2023-01-12 09:26:50 -05:00
Paul Masurel	3edf0a2724	Using the manual reload policy in IndexWriter. (#1667 )	2022-11-09 11:20:41 +01:00
Pascal Seitz	dfab201191	for_each_docset to iterate without score	2022-10-26 17:25:05 +08:00
Pascal Seitz	af839753e0	No score calls if score is not requested	2022-10-26 12:18:35 +08:00
Pascal Seitz	6800fdec9d	add indexing for ip field Closes #1595	2022-10-18 10:07:48 +08:00
Adam Reichold	71ab482720	RFC: Use a more general but still object-safe signature for Query::query_terms. (#1468 ) * Use a more general but still object-safe signature for Query::query_terms. * Further constraint the generalized Query::query_terms signature to allow extracting references to terms.	2022-08-24 06:34:07 +09:00
PSeitz	23fe73a6c0	remove searcher pool and make Searcher cloneable (#1411 ) * remove searcher pool and make Searcher cloneable closes #1410 * use SearcherInner in InnerIndexReader	2022-07-12 18:07:48 +09:00
Paul Masurel	46d5de920d	Removes all usage of block_on, and use a oneshot channel instead. (#1315 ) * Removes all usage of block_on, and use a oneshot channel instead. Calling `block_on` panics in certain context. For instance, it panics when it is called in a the context of another call to block. Using it in tantivy is unnecessary. We replace it by a thin wrapper around a oneshot channel that supports both async/sync. * Removing needless uses of async in the API. Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>	2022-03-18 16:54:58 +09:00
Paul Masurel	2ead010c83	Tantivy quickwit (#1293 ) * Added sstable and enabling it by default, and parallel boolean query. * Added async API for FileSlice. * Added async get_doc * Reduce blocksize to 32_000 * Added debug logs Quickwit specific feature a hidden behind the quickwit feature flag.	2022-02-25 17:32:49 +09:00
Paul Masurel	d7b46d2137	Added JSON Type (#1270 ) - Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251	2022-02-24 16:25:22 +09:00
Paul Masurel	2069e3e52b	Fixing clippy comments	2022-02-01 10:24:05 +09:00
Paul Masurel	eca6628b3c	Minor refactoring (#1266 )	2022-01-28 15:55:55 +09:00
Paul Masurel	732f6847c0	Field type with codes (#1255 ) * Term are now typed. This change is backward compatible: While the Term has a byte representation that is modified, a Term itself is a transient object that is not serialized as is in the index. Its .field() and .value_bytes() on the other hand are unchanged. This change offers better Debug information for terms. While not necessary it also will help in the support for JSON types. * Renamed Hierarchical Facet -> Facet	2022-01-07 20:49:00 +09:00
Paul Masurel	c81b3030fa	Issue/922b (#1233 ) * Add a NORMED options on field Make fieldnorm indexation optional: * for all types except text => added a NORMED options * for text field if STRING, field has not fieldnorm retained if TEXT, field has fieldnorm computed * Finalize making fieldnorm optional for all field types. - Using Option for fieldnorm readers.	2021-12-10 21:12:29 +09:00
Paul Masurel	7234bef0eb	Issue/1198 (#1201 ) * Unit test reproducing #1198 * Fixing unit test to handle the error from add_document. * Bump project version	2021-11-11 16:42:19 +09:00
François Massot	0462754673	Optimize block wand for one and several TermScorer. (#1190 ) * Added optimisation using block wand for single TermScorer. A proptest was also added. * Fix block wand algorithm by taking the last doc id of scores until the pivot scorer (included). * In block wand, when block max score is lower than the threshold, advance the scorer with best score. * Fix wrong condition in block_wand_single_scorer and add debug_assert to have an equality check on doc to break the loop.	2021-11-01 09:18:05 +09:00
Pascal Seitz	d7a6a409a1	renames	2021-09-23 20:33:11 +08:00
Pascal Seitz	a1ac63ee1c	fix clippy	2021-07-01 18:06:03 +02:00
Pascal Seitz	1e4df54ab3	fix clippy	2021-07-01 17:41:53 +02:00
Paul Masurel	6e4b61154f	Issue/1070 (#1071 ) Add a boolean flag in the Query::query_terms informing on whether position information is required. Closes #1070	2021-06-03 22:33:20 +09:00
Paul Masurel	39dd8cfe24	Cargo clippy. Acronym should not be full uppercase apparently.	2021-04-26 11:49:18 +09:00
Stéphane Campinas	a0ec6e1e9d	Expand the DocAddress struct with named fields	2021-03-28 19:00:23 +02:00
Paul Masurel	9c3cabce40	Updated version of the rand crate.	2021-01-06 18:09:00 +09:00
Paul Masurel	80a99539ce	Several TermDict operation now returns an io::Result	2020-12-03 13:13:11 +09:00
Paul Masurel	e9aa27dace	Avoid computing the BM25 weight if scoring is disabled	2020-11-25 14:35:49 +09:00
Paul Masurel	d23aee76c9	Avoid loading fieldnorms when not necessary	2020-11-09 15:50:16 +09:00

1 2 3

134 Commits