tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2025-12-27 04:29:58 +00:00

Author	SHA1	Message	Date
Pascal Seitz	4a9262cd2c	accept * as field name	2024-12-06 09:42:18 +01:00
PSeitz	876a579e5d	queryparser: add field respecification test (#2550 )	2024-12-02 14:17:12 +01:00
PSeitz	4c52499622	clippy (#2549 )	2024-11-29 16:08:21 +08:00
PSeitz	52d4e81e70	update CHANGELOG (#2546 )	2024-11-27 20:49:35 +08:00
dependabot[bot]	c71ea7b2ef	Update thiserror requirement from 1.0.30 to 2.0.1 (#2542 ) Updates the requirements on [thiserror](https://github.com/dtolnay/thiserror) to permit the latest version. - [Release notes](https://github.com/dtolnay/thiserror/releases) - [Commits](https://github.com/dtolnay/thiserror/compare/1.0.30...2.0.1) --- updated-dependencies: - dependency-name: thiserror dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-11-09 08:08:34 +08:00
Paul Masurel	c35a782747	Updating rustc-hash and clippy fixes (#2532 ) * Updating rustc-hash and clippy fixes * fix terms_aggregation_min_doc_count_special_case --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-11-01 13:46:26 +08:00
dependabot[bot]	c66af2c0a9	Update binggan requirement from 0.12.0 to 0.14.0 (#2530 ) * Update binggan requirement from 0.12.0 to 0.14.0 --- updated-dependencies: - dependency-name: binggan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * fix build --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-10-24 09:41:35 +08:00
Joan Antoni RE	f9ac055847	Fix some links in architecture docs (#2528 )	2024-10-23 21:06:54 +09:00
PSeitz	21d057059e	clippy (#2527 ) * clippy * clippy * clippy * clippy * convert allow to expect and remove unused * cargo fmt * cleanup * export sample * clippy	2024-10-22 09:26:54 +08:00
PSeitz	dca508b4ca	remove read_postings_no_deletes (#2526 ) closes #2525	2024-10-22 09:52:43 +09:00
PSeitz	aebae9965d	add RegexPhraseQuery (#2516 ) * add RegexPhraseQuery RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf" Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early. This PR comes with some new datastructures: SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general) LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption. BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one. Renamed Union to BufferedUnionScorer Added proptests to test different union types. * cleanup * use Box instead of Vec * use RefCell instead of term_freq(&mut) * remove wildcard mode * move RefCell to outer * clippy	2024-10-21 18:29:17 +08:00
Marvin	e7e3e3f44c	make casing in docs more consistent (#2524 ) * make casing in docs more consistent * more * lowercase tantivy	2024-10-21 17:59:41 +09:00
PSeitz	2f2db16ec1	store DateTime as nanoseconds in doc store (#2486 ) * store DateTime as nanoseconds in doc store The doc store DateTime was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. This is done by adding the trait `ConfigurableBinarySerializable`, which works like `BinarySerializable`, but with a config that allows de/serialize as different date time precision currently. bump version format to 7. add compat test to check the date time truncation. * remove configurable binary serialize, add enum for doc store version * test doc store version ord	2024-10-18 10:50:20 +08:00
Paul Masurel	d152e29687	Fixed citation (#2523 )	2024-10-17 10:19:50 +09:00
Paul Masurel	285bcc25c9	Added citation.cff (#2522 )	2024-10-17 09:43:35 +09:00
PSeitz	7b65ad922d	use binggan for stacker bench (#2492 ) * use binggan for stacker bench ``` alice (num terms: 174693) hashmap Memory: 1.3 MB Avg: 367.19 MiB/s (-1.34%) Median: 368.10 MiB/s (-1.34%) [378.75 MiB/s .. 352.81 MiB/s] hasmap with postings Memory: 2.4 MB Avg: 237.29 MiB/s (-2.19%) Median: 240.22 MiB/s (-1.61%) [248.26 MiB/s .. 210.66 MiB/s] fxhashmap ref postings Memory: 2.9 MB Avg: 171.94 MiB/s (-3.22%) Median: 174.13 MiB/s (-2.69%) [185.94 MiB/s .. 152.43 MiB/s] fxhasmap owned postings Memory: 3.5 MB Avg: 96.993 MiB/s (-4.20%) Median: 97.410 MiB/s (-4.48%) [102.78 MiB/s .. 82.745 MiB/s] numbers unique 100k hashmap Memory: 5.2 MB Avg: 334.17 MiB/s (-3.06%) Median: 352.61 MiB/s (+0.77%) [362.60 MiB/s .. 213.03 MiB/s] hasmap with postings Memory: 6.3 MB Avg: 316.96 MiB/s (-0.02%) Median: 325.16 MiB/s (-0.04%) [338.36 MiB/s .. 218.60 MiB/s] zipfs numbers 100k hashmap Memory: 1.3 MB Avg: 1.2342 GiB/s (+2.87%) Median: 1.2677 GiB/s (+4.66%) [1.3130 GiB/s .. 915.93 MiB/s] hasmap with postings Memory: 2.4 MB Avg: 485.16 MiB/s (+2.68%) Median: 494.70 MiB/s (+4.42%) [505.31 MiB/s .. 413.14 MiB/s] numbers unique 1mio hashmap Memory: 35.7 MB Avg: 169.68 MiB/s (-1.08%) Median: 166.80 MiB/s (-3.87%) [201.33 MiB/s .. 154.26 MiB/s] hasmap with postings Memory: 39.8 MB Avg: 149.49 MiB/s (-3.07%) Median: 150.85 MiB/s (-1.45%) [160.76 MiB/s .. 130.94 MiB/s] zipfs numbers 1mio hashmap Memory: 1.3 MB Avg: 1.2185 GiB/s (-2.33%) Median: 1.2291 GiB/s (-2.33%) [1.2905 GiB/s .. 1.0742 GiB/s] hasmap with postings Memory: 5.5 MB Avg: 358.43 MiB/s (-11.63%) Median: 356.95 MiB/s (-12.85%) [444.94 MiB/s .. 302.46 MiB/s] numbers unique 2mio hashmap Memory: 70.3 MB Avg: 163.65 MiB/s (+8.37%) Median: 162.83 MiB/s (+8.80%) [190.20 MiB/s .. 144.70 MiB/s] hasmap with postings Memory: 78.6 MB Avg: 148.00 MiB/s (+7.75%) Median: 151.53 MiB/s (+9.11%) [166.92 MiB/s .. 120.09 MiB/s] zipfs numbers 2mio hashmap Memory: 1.3 MB Avg: 1.2535 GiB/s (+2.59%) Median: 1.2654 GiB/s (+0.36%) [1.2938 GiB/s .. 1.0592 GiB/s] hasmap with postings Memory: 9.7 MB Avg: 377.96 MiB/s (-4.94%) Median: 381.82 MiB/s (-3.67%) [426.14 MiB/s .. 335.66 MiB/s] numbers unique 5mio hashmap Memory: 277.9 MB Avg: 121.30 MiB/s (+2.00%) Median: 121.99 MiB/s (+2.99%) [132.51 MiB/s .. 110.32 MiB/s] hasmap with postings Memory: 295.7 MB Avg: 114.23 MiB/s (+2.13%) Median: 115.26 MiB/s (+2.94%) [124.08 MiB/s .. 103.38 MiB/s] zipfs numbers 5mio hashmap Memory: 1.3 MB Avg: 1.2326 GiB/s (+0.63%) Median: 1.2400 GiB/s (+0.71%) [1.2755 GiB/s .. 1.0923 GiB/s] hasmap with postings Memory: 25.4 MB Avg: 360.49 MiB/s (+1.07%) Median: 363.44 MiB/s (+1.27%) [404.88 MiB/s .. 300.38 MiB/s] ``` * rename bench * update binggan * rename to HASHMAP_CAPACITY	2024-10-16 11:41:33 +08:00
dependabot[bot]	99be20cedd	Update binggan requirement from 0.10.0 to 0.12.0 (#2519 ) * Update binggan requirement from 0.10.0 to 0.12.0 --- updated-dependencies: - dependency-name: binggan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * fix build --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-10-16 11:36:04 +08:00
Bruce Mitchener	5f026901b8	Update MSRV to 1.75 (#2515 ) This is required by the `fs4` dependency. There are other things that need something later than 1.66. Both quickwit and the Python binding already require something newer.	2024-10-16 10:32:16 +08:00
baishen	6dfa2df06f	fix OwnedBytes debug panic (#2512 )	2024-10-16 10:31:40 +08:00
Bruce Mitchener	c17e513377	Reduce typo count. (#2510 )	2024-10-10 09:55:37 +08:00
PSeitz	2f5a269c70	update packages (#2500 ) fixes some warnings	2024-09-25 17:46:18 +08:00
PSeitz	50532260e3	update changelog (#2496 )	2024-09-25 10:28:53 +08:00
Tri	8bd6eb06e6	feat: make SegmentMeta.with_max_doc public (#2499 ) * chore: add container * feat: make max doc editable externally * chore: expose another method * chore: remove comments * remove unused devcontainer * chore: manually match nightly format * chore: change weird formating * revert format change * fix: format with nightly	2024-09-23 12:39:36 +08:00
PSeitz	55b0b52457	Fix AggregationLimits (#2495 ) * change AggregationLimits behavior This fixes an issue encountered with the current behaviour of AggregationLimits. Previously we had AggregationLimits and RessourceLimitGuard, which both track the memory, but only RessourceLimitGuard released memory when dropped, while AggregationLimits did not. This PR changes AggregationLimits to be a guard itself and removes the RessourceLimitGuard. * rename AggregationLimits to AggregationLimitsGuard	2024-09-17 14:25:47 +08:00
dependabot[bot]	56fc56c5b9	Update binggan requirement from 0.8.0 to 0.10.0 (#2493 ) * Update binggan requirement from 0.8.0 to 0.10.0 --- updated-dependencies: - dependency-name: binggan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * update PR --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-09-10 14:26:06 +08:00
trinity-1686a	85395d942a	fix clippy lints from 1.80-1.81 (#2488 ) * fix some clippy lints * fix clippy::doc_lazy_continuation * fix some lints for 1.82	2024-09-05 14:33:05 +02:00
PSeitz	a206c3ccd3	add compat tests (#2485 )	2024-09-04 18:26:57 +08:00
Chaya	dc5d31c116	grammar and misspellings (#2483 ) * grammar * grammar * misspelling	2024-09-04 12:45:31 +08:00
gezihuzi	95a4ddea3e	Fix: Improve collapse_overlapped_ranges function (#2474 ) * Fix: Improve collapse_overlapped_ranges function - Refactor into separate sort_and_deduplicate_ranges and merge_overlapping_ranges functions - Enhance sorting to consider both start and end of ranges - Optimize merging logic to handle adjacent ranges - Add comprehensive examples in function documentation - Ensure proper handling of duplicate and unsorted input ranges - Improve overall efficiency and readability of range collapsing algorithm * move debug_assert --------- Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>	2024-09-04 12:39:13 +08:00
trinity-1686a	ab5125d3dc	remove unused trait bounds and outdated doc comment (#2478 )	2024-09-03 16:31:51 +02:00
trinity-1686a	9f81d59ecd	make find_field_with_default return json fields without path (#2476 ) * make find_field_with_default return json fields without path * add tests for find_field_with_default	2024-08-19 15:25:29 +02:00
PSeitz	c71ec8086d	add FastFieldRangeQuery, rename (#2477 ) * add FastFieldRangeQuery, rename * remove Query impl	2024-08-19 09:02:00 +02:00
PSeitz	27be6aed91	lift clauses in LogicalAst (#2449 ) (a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d) (a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d) This directly affects how queries are executed remove unused SumWithCoordsCombiner the number of fields is unused and private	2024-08-14 19:21:26 +02:00
PSeitz	3d1c4b313a	support ff range queries on json fields (#2456 ) * support ff range queries on json fields * fix term date truncation * use inverted index range query for phrase prefix queries * rename to InvertedIndexRangeQuery * fix column filter, add mixed column test	2024-08-02 00:06:50 +08:00
PSeitz	0d4e319965	add Key::I64 and Key::U64 variants in aggregation (#2468 ) * add Key::I64 and Key::U64 variants in aggregation Currently all `Key` numerical values are returned as f64. This causes problems in some cases with the precision and the way f64 is serialized. This PR adds `Key::I64` and `Key::U64` variants and uses them in the term aggregation. * add clarification comment	2024-07-31 20:29:32 +08:00
PSeitz	75dc3eb298	extend custom order deserialization (#2451 ) allow arrays improve validation closes https://github.com/quickwit-oss/tantivy/issues/2435	2024-07-30 18:36:08 +08:00
PSeitz	3f6d225086	fix potential endless loop in merge (#2457 ) avoid single segments lists without deletes as merge candidates, as they will be moved to a merge operation and filtered for merging in the next consider_merge_options call. In rare cases this may end up in a endless merge loop where only single segments where nothing is to be done are merged.	2024-07-30 16:37:20 +08:00
PSeitz	d8843c608c	make FastFieldRangeWeight::new pub (#2460 )	2024-07-29 10:39:27 +08:00
PSeitz	7ebcc15b17	add support for str fast field range query (#2453 ) * add support for str fast field range query Add support for range queries on fast fields, by converting term bounds to term ordinals bounds. closes https://github.com/quickwit-oss/tantivy/issues/2023 * extend tests, rename * update comment * update comment	2024-07-17 09:31:42 +08:00
PSeitz	1b4076691f	refactor fast field query (#2452 ) As preparation of #2023 and #1709 * Use Term to pass parameters * merge u64 and ip fast field range query Side note: I did not rename range_query_u64_fastfield, because then git can't track the changes.	2024-07-15 18:08:05 +08:00
Robert Caulk	eab660873a	doc: fix typo in readme (#2450 )	2024-07-09 15:12:22 +08:00
PSeitz	232f37126e	fix coverage (#2448 )	2024-07-05 12:04:18 +08:00
PSeitz	13e9885dfd	faster term aggregation fetch terms (#2447 ) big impact for term aggregations with large `size` parameter (e.g. 1000) add top 1000 term agg bench full terms_few Memory: 27.3 KB (+79.09%) Avg: 3.8058ms (+2.40%) Median: 3.7192ms (+3.47%) [3.6224ms .. 4.3721ms] terms_many Memory: 6.9 MB Avg: 12.6102ms (-4.70%) Median: 12.1389ms (-6.58%) [10.2847ms .. 15.4857ms] terms_many_top_1000 Memory: 6.9 MB Avg: 15.8216ms (-83.19%) Median: 15.4899ms (-83.46%) [13.4250ms .. 20.6897ms] terms_many_order_by_term Memory: 6.9 MB Avg: 14.7820ms (-3.95%) Median: 14.2236ms (-4.28%) [12.6669ms .. 21.0968ms] terms_many_with_top_hits Memory: 58.2 MB Avg: 551.6218ms (+7.18%) Median: 549.8826ms (+11.01%) [496.7371ms .. 592.1299ms] terms_many_with_avg_sub_agg Memory: 27.8 MB Avg: 197.7029ms (+2.66%) Median: 190.1564ms (+0.64%) [167.9226ms .. 245.6651ms] terms_many_json_mixed_type_with_avg_sub_agg Memory: 42.0 MB (+0.00%) Avg: 242.0121ms (+0.92%) Median: 237.7084ms (-2.85%) [201.9959ms .. 302.2136ms] terms_few_with_cardinality_agg Memory: 10.6 MB Avg: 122.6036ms (+1.21%) Median: 119.0033ms (+2.60%) [109.2859ms .. 161.5858ms] range_agg_with_term_agg_few Memory: 45.4 KB (+39.75%) Avg: 24.5454ms (+2.14%) Median: 24.2861ms (+2.44%) [23.5109ms .. 27.8406ms] range_agg_with_term_agg_many Memory: 6.9 MB Avg: 56.8049ms (+3.01%) Median: 50.9706ms (+1.52%) [41.4517ms .. 90.3934ms] dense terms_few Memory: 28.8 KB (+81.74%) Avg: 8.9092ms (-2.24%) Median: 8.7143ms (-1.31%) [8.6148ms .. 10.3868ms] terms_many Memory: 6.9 MB (-0.00%) Avg: 17.9604ms (-10.18%) Median: 17.1552ms (-11.93%) [14.8979ms .. 26.2779ms] terms_many_top_1000 Memory: 6.9 MB Avg: 21.4963ms (-78.90%) Median: 21.2924ms (-78.98%) [18.2033ms .. 28.0087ms] terms_many_order_by_term Memory: 6.9 MB Avg: 20.4167ms (-9.13%) Median: 19.5596ms (-11.37%) [17.5153ms .. 29.5987ms] terms_many_with_top_hits Memory: 58.2 MB Avg: 518.4474ms (-6.41%) Median: 514.9180ms (-9.44%) [471.5550ms .. 579.0220ms] terms_many_with_avg_sub_agg Memory: 27.8 MB Avg: 263.6702ms (-2.78%) Median: 260.8775ms (-2.55%) [239.5754ms .. 304.6669ms] terms_many_json_mixed_type_with_avg_sub_agg Memory: 42.0 MB Avg: 299.9791ms (-2.01%) Median: 302.2180ms (-3.08%) [239.2080ms .. 346.3649ms] terms_few_with_cardinality_agg Memory: 10.6 MB Avg: 136.3303ms (-3.12%) Median: 132.3831ms (-2.88%) [123.7564ms .. 164.7914ms] range_agg_with_term_agg_few Memory: 47.1 KB (+37.81%) Avg: 35.4538ms (+0.66%) Median: 34.8754ms (-0.56%) [34.2287ms .. 40.0884ms] range_agg_with_term_agg_many Memory: 6.9 MB Avg: 72.2269ms (-4.38%) Median: 66.1174ms (-4.98%) [55.5125ms .. 124.1622ms] sparse terms_few Memory: 27.3 KB (+69.68%) Avg: 19.6053ms (-1.15%) Median: 19.4543ms (-0.38%) [19.3056ms .. 24.0547ms] terms_many Memory: 1.8 MB Avg: 21.2886ms (-6.28%) Median: 21.1287ms (-6.65%) [20.6640ms .. 24.6144ms] terms_many_top_1000 Memory: 2.6 MB Avg: 23.4869ms (-85.53%) Median: 23.3393ms (-85.61%) [22.7789ms .. 25.0896ms] terms_many_order_by_term Memory: 1.8 MB Avg: 21.7437ms (-7.78%) Median: 21.6272ms (-7.66%) [21.0409ms .. 23.6517ms] terms_many_with_top_hits Memory: 13.1 MB Avg: 43.7926ms (-2.76%) Median: 44.3602ms (+0.01%) [37.8039ms .. 51.0451ms] terms_many_with_avg_sub_agg Memory: 7.5 MB Avg: 34.6307ms (+3.72%) Median: 33.4522ms (+1.16%) [32.4418ms .. 41.4196ms] terms_many_json_mixed_type_with_avg_sub_agg Memory: 7.4 MB Avg: 46.4318ms (+1.16%) Median: 46.4050ms (+2.03%) [44.5986ms .. 48.5142ms] terms_few_with_cardinality_agg Memory: 680.0 KB (-0.04%) Avg: 35.4410ms (+2.05%) Median: 35.1384ms (+1.19%) [34.4402ms .. 39.1082ms] range_agg_with_term_agg_few Memory: 45.7 KB (+39.44%) Avg: 22.7760ms (+0.44%) Median: 22.5152ms (-0.35%) [22.3078ms .. 26.1567ms] range_agg_with_term_agg_many Memory: 1.8 MB Avg: 25.7696ms (-4.45%) Median: 25.4009ms (-5.61%) [24.7874ms .. 29.6434ms] multivalue terms_few Memory: 244.4 KB Avg: 15.1253ms (-2.85%) Median: 15.0988ms (-0.54%) [14.8790ms .. 15.8193ms] terms_many Memory: 6.9 MB (-0.00%) Avg: 26.3019ms (-6.24%) Median: 26.3662ms (-4.94%) [21.3553ms .. 31.0564ms] terms_many_top_1000 Memory: 6.9 MB Avg: 29.5212ms (-72.90%) Median: 29.4257ms (-72.84%) [24.2645ms .. 35.1607ms] terms_many_order_by_term Memory: 6.9 MB Avg: 28.6076ms (-4.93%) Median: 28.1059ms (-6.64%) [24.0845ms .. 34.1493ms] terms_many_with_top_hits Memory: 58.3 MB Avg: 570.1548ms (+1.52%) Median: 572.7759ms (+0.53%) [525.9567ms .. 617.0862ms] terms_many_with_avg_sub_agg Memory: 27.8 MB Avg: 305.5207ms (+0.24%) Median: 296.0101ms (-0.22%) [277.8579ms .. 373.5914ms] terms_many_json_mixed_type_with_avg_sub_agg Memory: 42.0 MB (-0.00%) Avg: 324.7342ms (-2.51%) Median: 319.0025ms (-2.58%) [298.7122ms .. 368.6144ms] terms_few_with_cardinality_agg Memory: 10.8 MB Avg: 151.6126ms (-2.54%) Median: 149.0616ms (-0.32%) [136.5592ms .. 181.8942ms] range_agg_with_term_agg_few Memory: 248.2 KB Avg: 49.5225ms (+3.11%) Median: 48.3994ms (+3.18%) [46.4134ms .. 60.5989ms] range_agg_with_term_agg_many Memory: 6.9 MB Avg: 85.9824ms (-3.66%) Median: 78.4266ms (-3.85%) [64.1231ms .. 128.5279ms]	2024-07-03 12:42:59 +08:00
PSeitz	56d79cb203	fix cardinality aggregation performance (#2446 ) * fix cardinality aggregation performance fix cardinality performance by fetching multiple terms at once. This avoids decompressing the same block and keeps the buffer state between terms. add cardinality aggregation benchmark bump rust version to 1.66 Performance comparison to before (AllQuery) ``` full cardinality_agg Memory: 3.5 MB (-0.00%) Avg: 21.2256ms (-97.78%) Median: 21.0042ms (-97.82%) [20.4717ms .. 23.6206ms] terms_few_with_cardinality_agg Memory: 10.6 MB Avg: 81.9293ms (-97.37%) Median: 81.5526ms (-97.38%) [79.7564ms .. 88.0374ms] dense cardinality_agg Memory: 3.6 MB (-0.00%) Avg: 25.9372ms (-97.24%) Median: 25.7744ms (-97.25%) [24.7241ms .. 27.8793ms] terms_few_with_cardinality_agg Memory: 10.6 MB Avg: 93.9897ms (-96.91%) Median: 92.7821ms (-96.94%) [90.3312ms .. 117.4076ms] sparse cardinality_agg Memory: 895.4 KB (-0.00%) Avg: 22.5113ms (-95.01%) Median: 22.5629ms (-94.99%) [22.1628ms .. 22.9436ms] terms_few_with_cardinality_agg Memory: 680.2 KB Avg: 26.4250ms (-94.85%) Median: 26.4135ms (-94.86%) [26.3210ms .. 26.6774ms] ``` * clippy * assert for sorted ordinals	2024-07-02 15:29:00 +08:00
Paul Masurel	0f4c2e27cf	Fixes bug that causes out-of-order sstable key. (#2445 ) The previous way to address the problem was to replace \u{0000} with 0 in different places. This logic had several flaws: Done on the serializer side (like it was for the columnar), there was a collision problem. If a document in the segment contained a json field with a \0 and antoher doc contained the same json field but `0` then we were sending the same field path twice to the serializer. Another option would have been to normalizes all values on the writer side. This PR simplifies the logic and simply ignore json path containing a \0, both in the columnar and the inverted index. Closes #2442	2024-07-01 15:40:07 +08:00
落叶乌龟	f9ae295507	feat(query): Make `BooleanQuery` supports `minimum_number_should_match` (#2405 ) * feat(query): Make `BooleanQuery` supports `minimum_number_should_match`. see issue #2398 In this commit, a novel scorer named DisjunctionScorer is introduced, which performs the union of inverted chains with the minimal required elements. BTW, it's implemented via a min-heap. Necessary modifications on `BooleanQuery` and `BooleanWeight` are performed as well. * fixup! fix test * fixup!: refactor code. 1. More meaningful names. 2. Add Cache for `Disjunction`'s scorers, and fix bug. 3. Optimize `BooleanWeight::complex_scorer` Thanks Paul Masurel <paul@quickwit.io> * squash!: come up with better variable naming. * squash!: fix naming issues. * squash!: fix typo. * squash!: Remove CombinationMethod::FullIntersection	2024-07-01 15:39:41 +08:00
Raphael Coeffic	d9db5302d9	feat: cardinality aggregation (#2337 ) * WiP: cardinality aggregation * Collect unique entries first, then insert into HyperLogLog * Handle `missing` * Hybrid approach * Review changes - insert `missing` value at most once - `term_id` -> `term_ord` - iterate directly over entries without collecting first * Use salted hasher to include column type * fix: formatting * More review fixes * Add cardinality to test_aggregation_flushing * Formatting	2024-07-01 07:49:42 +08:00
Paul Masurel	e453848134	Recycling buffer in PrefixPhraseScorer (#2443 )	2024-06-24 17:11:53 +09:00
PSeitz	59084143ef	use optional index in multivalued index (#2439 ) * use optional index in multivalued index For mostly empty multivalued indices there was a large overhead during creation when iterating all docids. This is alleviated by placing an optional index in the multivalued index to mark documents that have values. There's some performance overhead when accessing values in a multivalued index. The accessing cost is now optional index + multivalue index. The sparse codec performs relatively bad with the binary_search when accessing data. This is reflected in the benchmarks below. This changes the format of columnar to v2, but code is added to handle the v1 formats. ``` Running benches/bench_access.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_access-ea323c028db88db4) multi sparse 1/13 access_values_for_doc Avg: 42.8946ms (+241.80%) Median: 42.8869ms (+244.10%) [42.7484ms .. 43.1074ms] access_first_vals Avg: 42.8022ms (+421.93%) Median: 42.7553ms (+439.84%) [42.6794ms .. 43.7404ms] multi 2x access_values_for_doc Avg: 31.1244ms (+24.17%) Median: 30.8339ms (+23.46%) [30.7192ms .. 33.6059ms] access_first_vals Avg: 24.3070ms (+70.92%) Median: 24.0966ms (+70.18%) [23.9328ms .. 26.4851ms] sparse 1/13 access_values_for_doc Avg: 42.2490ms (+0.61%) Median: 42.2346ms (+2.28%) [41.8988ms .. 43.7821ms] access_first_vals Avg: 43.6272ms (+0.23%) Median: 43.6197ms (+1.78%) [43.4920ms .. 43.9009ms] dense 1/12 access_values_for_doc Avg: 8.6184ms (+23.18%) Median: 8.6126ms (+23.78%) [8.5843ms .. 8.7527ms] access_first_vals Avg: 6.8112ms (+4.47%) Median: 6.8002ms (+4.55%) [6.7887ms .. 6.8991ms] full access_values_for_doc Avg: 9.4073ms (-5.09%) Median: 9.4023ms (-2.23%) [9.3694ms .. 9.4568ms] access_first_vals Avg: 4.9531ms (+6.24%) Median: 4.9502ms (+7.85%) [4.9423ms .. 4.9718ms] ``` ``` Running benches/bench_merge.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_merge-475697dfceb3639f) merge_multi 2x_and_multi 2x Avg: 20.2280ms (+34.33%) Median: 20.1829ms (+35.33%) [19.9933ms .. 20.8806ms] merge_multi sparse 1/13_and_multi sparse 1/13 Avg: 0.8961ms (-78.04%) Median: 0.8943ms (-77.61%) [0.8899ms .. 0.9272ms] merge_dense 1/12_and_dense 1/12 Avg: 0.6619ms (-1.26%) Median: 0.6616ms (+2.20%) [0.6473ms .. 0.6837ms] merge_sparse 1/13_and_sparse 1/13 Avg: 0.5508ms (-0.85%) Median: 0.5508ms (+2.80%) [0.5420ms .. 0.5634ms] merge_sparse 1/13_and_dense 1/12 Avg: 0.6046ms (-4.64%) Median: 0.6038ms (+2.80%) [0.5939ms .. 0.6296ms] merge_multi sparse 1/13_and_dense 1/12 Avg: 0.9111ms (-83.48%) Median: 0.9063ms (-83.50%) [0.9047ms .. 0.9663ms] merge_multi sparse 1/13_and_sparse 1/13 Avg: 0.8451ms (-89.49%) Median: 0.8428ms (-89.43%) [0.8411ms .. 0.8563ms] merge_multi 2x_and_dense 1/12 Avg: 10.6624ms (-4.82%) Median: 10.6568ms (-4.49%) [10.5738ms .. 10.8353ms] merge_multi 2x_and_sparse 1/13 Avg: 10.6336ms (-22.95%) Median: 10.5925ms (-22.33%) [10.5149ms .. 11.5657ms] ``` * Update columnar/src/columnar/format_version.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * Update columnar/src/column_index/mod.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2024-06-19 14:54:12 +08:00
PSeitz	511b027350	update columnar bench (#2438 ) * update columnar bench * fix compile	2024-06-14 10:42:35 +08:00

1 2 3 4 5 ...

3255 Commits