tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-08 01:52:54 +00:00

Author	SHA1	Message	Date
PSeitz	21d057059e	clippy (#2527 ) * clippy * clippy * clippy * clippy * convert allow to expect and remove unused * cargo fmt * cleanup * export sample * clippy	2024-10-22 09:26:54 +08:00
PSeitz	dca508b4ca	remove read_postings_no_deletes (#2526 ) closes #2525	2024-10-22 09:52:43 +09:00
PSeitz	aebae9965d	add RegexPhraseQuery (#2516 ) * add RegexPhraseQuery RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf" Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early. This PR comes with some new datastructures: SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general) LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption. BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one. Renamed Union to BufferedUnionScorer Added proptests to test different union types. * cleanup * use Box instead of Vec * use RefCell instead of term_freq(&mut) * remove wildcard mode * move RefCell to outer * clippy	2024-10-21 18:29:17 +08:00
Bruce Mitchener	c17e513377	Reduce typo count. (#2510 )	2024-10-10 09:55:37 +08:00
trinity-1686a	85395d942a	fix clippy lints from 1.80-1.81 (#2488 ) * fix some clippy lints * fix clippy::doc_lazy_continuation * fix some lints for 1.82	2024-09-05 14:33:05 +02:00
PSeitz	a206c3ccd3	add compat tests (#2485 )	2024-09-04 18:26:57 +08:00
Chaya	dc5d31c116	grammar and misspellings (#2483 ) * grammar * grammar * misspelling	2024-09-04 12:45:31 +08:00
PSeitz	c71ec8086d	add FastFieldRangeQuery, rename (#2477 ) * add FastFieldRangeQuery, rename * remove Query impl	2024-08-19 09:02:00 +02:00
PSeitz	27be6aed91	lift clauses in LogicalAst (#2449 ) (a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d) (a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d) This directly affects how queries are executed remove unused SumWithCoordsCombiner the number of fields is unused and private	2024-08-14 19:21:26 +02:00
PSeitz	3d1c4b313a	support ff range queries on json fields (#2456 ) * support ff range queries on json fields * fix term date truncation * use inverted index range query for phrase prefix queries * rename to InvertedIndexRangeQuery * fix column filter, add mixed column test	2024-08-02 00:06:50 +08:00
PSeitz	d8843c608c	make FastFieldRangeWeight::new pub (#2460 )	2024-07-29 10:39:27 +08:00
PSeitz	7ebcc15b17	add support for str fast field range query (#2453 ) * add support for str fast field range query Add support for range queries on fast fields, by converting term bounds to term ordinals bounds. closes https://github.com/quickwit-oss/tantivy/issues/2023 * extend tests, rename * update comment * update comment	2024-07-17 09:31:42 +08:00
PSeitz	1b4076691f	refactor fast field query (#2452 ) As preparation of #2023 and #1709 * Use Term to pass parameters * merge u64 and ip fast field range query Side note: I did not rename range_query_u64_fastfield, because then git can't track the changes.	2024-07-15 18:08:05 +08:00
PSeitz	56d79cb203	fix cardinality aggregation performance (#2446 ) * fix cardinality aggregation performance fix cardinality performance by fetching multiple terms at once. This avoids decompressing the same block and keeps the buffer state between terms. add cardinality aggregation benchmark bump rust version to 1.66 Performance comparison to before (AllQuery) ``` full cardinality_agg Memory: 3.5 MB (-0.00%) Avg: 21.2256ms (-97.78%) Median: 21.0042ms (-97.82%) [20.4717ms .. 23.6206ms] terms_few_with_cardinality_agg Memory: 10.6 MB Avg: 81.9293ms (-97.37%) Median: 81.5526ms (-97.38%) [79.7564ms .. 88.0374ms] dense cardinality_agg Memory: 3.6 MB (-0.00%) Avg: 25.9372ms (-97.24%) Median: 25.7744ms (-97.25%) [24.7241ms .. 27.8793ms] terms_few_with_cardinality_agg Memory: 10.6 MB Avg: 93.9897ms (-96.91%) Median: 92.7821ms (-96.94%) [90.3312ms .. 117.4076ms] sparse cardinality_agg Memory: 895.4 KB (-0.00%) Avg: 22.5113ms (-95.01%) Median: 22.5629ms (-94.99%) [22.1628ms .. 22.9436ms] terms_few_with_cardinality_agg Memory: 680.2 KB Avg: 26.4250ms (-94.85%) Median: 26.4135ms (-94.86%) [26.3210ms .. 26.6774ms] ``` * clippy * assert for sorted ordinals	2024-07-02 15:29:00 +08:00
落叶乌龟	f9ae295507	feat(query): Make `BooleanQuery` supports `minimum_number_should_match` (#2405 ) * feat(query): Make `BooleanQuery` supports `minimum_number_should_match`. see issue #2398 In this commit, a novel scorer named DisjunctionScorer is introduced, which performs the union of inverted chains with the minimal required elements. BTW, it's implemented via a min-heap. Necessary modifications on `BooleanQuery` and `BooleanWeight` are performed as well. * fixup! fix test * fixup!: refactor code. 1. More meaningful names. 2. Add Cache for `Disjunction`'s scorers, and fix bug. 3. Optimize `BooleanWeight::complex_scorer` Thanks Paul Masurel <paul@quickwit.io> * squash!: come up with better variable naming. * squash!: fix naming issues. * squash!: fix typo. * squash!: Remove CombinationMethod::FullIntersection	2024-07-01 15:39:41 +08:00
Paul Masurel	e453848134	Recycling buffer in PrefixPhraseScorer (#2443 )	2024-06-24 17:11:53 +09:00
PSeitz	c3b92a5412	fix compiler warning, cleanup (#2393 ) fix compiler warning for missing feature flag remove unused variables cleanup unused methods	2024-06-11 16:03:50 +08:00
Hamir Mahal	0c634adbe1	style: simplify strings with string interpolation (#2412 ) * style: simplify strings with string interpolation * fix: formatting	2024-05-27 09:16:47 +02:00
PSeitz	71f3b4e4e3	fix ReferenceValue API flaw (#2372 ) * fix ReferenceValue API flaw Remove `Facet` and `TokenizedString` values from the `ReferenceValue` API, as this requires the trait value to have them stored somewhere. Since `TokenizedString` is quite niche, I just copy it into a Box, instead of designing a reference API around it. * fix comment link	2024-05-09 06:14:42 +02:00
PSeitz	eea70030bf	cleanup top level exports (#2382 ) remove some top level exports	2024-05-07 09:59:41 +02:00
trinity-1686a	6a66a71cbb	modify fastfield range query heuristic (#2375 )	2024-04-25 10:06:11 +02:00
PSeitz	047da20b5b	add json path constructor to term (#2367 )	2024-04-22 12:23:35 +02:00
PSeitz	0e9fced336	remove JsonTermWriter (#2238 ) * remove JsonTermWriter remove JsonTermWriter remove path truncation logic, add assertion * fix json_path_writer add sep logic	2024-04-18 16:28:05 +02:00
PSeitz	74940e9345	clippy (#2349 ) * fix clippy * fix clippy * fix duplicate imports	2024-04-09 07:54:44 +02:00
PSeitz	67ebba3c3c	expose collect_block buffer size (#2326 ) * expose buffer of collect_block * flip shard_size segment_size	2024-03-15 08:02:08 +01:00
PSeitz	48630ceec9	move into new index module (#2259 ) move core modules to index module	2024-01-31 10:30:04 +01:00
Adam Reichold	53f2fe1fbe	Forward regex parser errors to enable understandin their reason. (#2288 )	2023-12-22 11:01:10 +01:00
PSeitz	0aae31d7d7	reduce number of allocations (#2257 ) * reduce number of allocations Explanation makes up around 50% of all allocations (numbers not perf). It's created during serialization but not called. - Make Explanation optional in BM25 - Avoid allocations when using Explanation * use Cow	2023-11-16 13:47:36 +01:00
Paul Masurel	7bc5bf78e2	Fixing functional tests. (#2239 )	2023-11-05 18:18:39 +09:00
PSeitz	83af14caa4	Fix range query (#2226 ) Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225	2023-10-25 09:17:31 +02:00
trinity-1686a	0d4589219b	encode some part of posting list as -1 instead of direct values (#2185 ) * add support for delta-1 encoding posting list * encode term frequency minus one * don't emit tf for json integer terms * make skipreader not pub(crate) mutable	2023-10-20 16:58:26 +02:00
PSeitz	03a1f40767	rename DocValue to Value (#2197 ) rename DocValue to Value to avoid confusion with lucene DocValues rename Value to OwnedValue	2023-10-02 17:03:00 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	832f1633de	handle exclusive out of bounds ranges on fastfield range queries (#2174 ) closes https://github.com/quickwit-oss/quickwit/issues/3790	2023-09-26 08:00:40 +02:00
trinity-1686a	0241a05b90	add support for exists query syntax in query parser (#2170 ) * add support for exists query syntax in query parser * rustfmt * make Exists require a field	2023-09-19 11:10:39 +02:00
PSeitz	2d7390341c	increase min memory to 15MB for indexing (#2176 ) With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156	2023-09-13 07:38:34 +02:00
Ping Xia	e4e416ac42	extend FuzzyTermQuery to support json field (#2173 ) * extend fuzzy search for json field * comments * comments * fmt fix * comments	2023-09-11 05:59:40 +02:00
Igor Motov	19325132b7	Fast-field based implementation of ExistsQuery (#2160 ) Adds an implementation of ExistsQuery that takes advantage of fast fields. Fixes #2159	2023-09-07 11:51:49 +09:00
PSeitz	c4e2708901	fix clippy, fmt (#2162 )	2023-08-30 08:04:26 +02:00
PSeitz	48d4847b38	Improve aggregation error message (#2150 ) * Improve aggregation error message Improve aggregation error message by wrapping the deserialization with a custom struct. This deserialization variant is slower, since we need to keep the deserialized data around twice with this approach. For now the valid variants list is manually updated. This could be replaced with a proc macro. closes #2143 * Simpler implementation --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-08-23 20:52:15 +02:00
PSeitz	480763db0d	track memory arena memory usage (#2148 )	2023-08-16 18:19:42 +02:00
Caleb Hattingh	47b315ff18	doc: escape the backslash (#2144 )	2023-08-14 19:10:07 +02:00
Adam Reichold	22c35b1e00	Fix explanation of boost queries seeking beyond query result. (#2142 ) * Make current nightly Clippy happy. * Fix explanation of boost queries seeking beyond query result.	2023-08-14 11:59:11 +09:00
trinity-1686a	b92082b748	implement lenient parser (#2129 ) * move query parser to nom * add suupport for term grouping * initial work on infallible parser * fmt * add tests and fix minor parsing bugs * address review comments * add support for lenient queries in tantivy * make lenient parser report errors * allow mixing occur and bool in query	2023-08-08 15:41:29 +02:00
Adam Reichold	42acd334f4	Fixes the new deny-by-default incorrect_partial_ord_impl_on_ord_type Clippy lint (#2131 )	2023-07-21 11:36:17 +09:00
Adam Reichold	5fafe4b1ab	Add missing query_terms impl for TermSetQuery. (#2120 )	2023-07-13 14:54:29 +02:00
François Massot	0a23201338	Fix stackoverflow and add docs.	2023-07-03 22:05:11 +09:00
Paul Masurel	ad4c940fa3	proof of concept for dynamic tokenizer.	2023-07-03 22:05:10 +09:00
Paul Masurel	910b0b0c61	Cargo fmt	2023-07-03 22:03:31 +09:00
PSeitz	3fef052bf1	fix flaky test (#2107 ) closes #2099	2023-06-29 14:30:56 +08:00

1 2 3 4 5 ...

563 Commits