tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-10 11:02:55 +00:00

Author	SHA1	Message	Date
PSeitz	fdecb79273	tokenizer-api: reduce Tokenizer overhead (#2062 ) * tokenizer-api: reduce Tokenizer overhead Previously a new `Token` for each text encountered was created, which contains `String::with_capacity(200)` In the new API the token_stream gets mutable access to the tokenizer, this allows state to be shared (in this PR Token is shared). Ideally the allocation for the BoxTokenStream would also be removed, but this may require some lifetime tricks. * simplify api * move lowercase and ascii folding buffer to global * empty Token text as default	2023-06-08 18:37:58 +08:00
PSeitz	27f202083c	Improve Termmap Indexing Performance +~30% (#2058 ) * update benchmark * Improve Termmap Indexing Performance +~30% This contains many small changes to improve Termmap performance. Most notably: * Specialized byte compare and equality versions, instead of glibc calls. * ExpUnrolledLinkedList to not contain inline items. Allow compare hash only via a feature flag compare_hash_only: 64bits should be enough with a good hash function to compare strings by their hashes instead of comparing the strings. Disabled by default CreateHashMap/alice/174693 time: [642.23 µs 643.80 µs 645.24 µs] thrpt: [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s] change: time: [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05) thrpt: [+14.088% +15.344% +16.862%] Performance has improved. CreateHashMap/alice_expull/174693 time: [877.03 µs 880.44 µs 884.67 µs] thrpt: [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s] change: time: [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05) thrpt: [+35.301% +35.637% +35.981%] Performance has improved. CreateHashMap/numbers_zipf/8000000 time: [9.1198 ms 9.1573 ms 9.1961 ms] thrpt: [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s] change: time: [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05) thrpt: [+52.403% +53.440% +54.390%] Performance has improved. * clippy * add bench for ids * inline(always) to inline whole block with bounds checks * cleanup	2023-06-08 11:13:52 +02:00
PSeitz	ccb09aaa83	allow histogram bounds to be passed as Rfc3339 (#2076 )	2023-06-08 09:07:08 +02:00
Valerii	4b7c485a08	feat: add stop words for Hungarian language (#2069 )	2023-06-02 07:26:03 +02:00
Adam Reichold	b325d569ad	Expose phrase-prefix queries via the built-in query parser (#2044 ) * Expose phrase-prefix queries via the built-in query parser This proposes the less-than-imaginative syntax `field:"phrase ter"` to perform a phrase prefix query against `field` using `phrase` and `ter` as the terms. The aim of this is to make this type of query more discoverable and simplify manual testing. I did consider exposing the `max_expansions` parameter similar to how slop is handled, but I think that this is rather something that should be configured via the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as choosing it requires rather intimiate knowledge of the backing index. Prevent construction of zero or one term phrase-prefix queries via the query parser. * Add example using phrase-prefix search via surface API to improve feature discoverability.	2023-06-01 13:03:16 +02:00
Paul Masurel	7ee78bda52	Readding s in datetime precision variant names (#2065 ) There is no clear win and it change some serialization in quickwit.	2023-06-01 06:39:46 +02:00
PSeitz	3af456972e	Fix min doc_count empty merge bug (#2057 ) This fixes an issue when min_doc==0 loads terms from the dictionary from one segment and merges the same term with a subaggregation from another segment. Previously the empty structure was not correctly initialized to contain the subaggregation so the merge was incorrect.	2023-05-29 14:20:50 +08:00
PSeitz	e56addc63e	enable tokenizer on json fields (#2053 ) * enable tokenizer on json fields enable tokenizer on json fields for type text * Avoid making the tokenizer within the TextAnalyzer pub(crate) * Moving BoxableTokenizer to tantivy. --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-24 10:47:39 +02:00
Adrien Guillo	a789ad9aee	Rename `DatePrecision` to `DateTimePrecision` (#2051 )	2023-05-23 17:09:11 +02:00
Sergei Lavrentev	8cf26da4b2	Add possibility to set up highlighten prefix and postfix for snippet (#1422 ) * add possibility to change highlight prefix and postfix * add comment to Snippet::new * add test for highlighten elements * add default highlight prefix and postfix constants * fix spelling * fix tests * fix spelling * do fixes after code review * reduce test_snippet_generator_custom_highlighted_elements code * fix fmt * change names to more convenient --------- Co-authored-by: Sergei Lavrentev <23312691+lavrxxx@users.noreply.github.com>	2023-05-23 15:09:24 +02:00
trinity-1686a	a3f001360f	add support for warming up range of terms (#2042 ) * add support for warming up range of terms * simplify handling of limit	2023-05-22 14:29:35 +02:00
trinity-1686a	6564e0c467	fix phrase prefix query (#2043 ) * fix phrase prefix query it would fail spectacularly when no doc in the segment would match the phrase part of the query * clippy	2023-05-22 12:36:20 +02:00
Paul Masurel	d7e97331e5	Minor refactoring find field (#2055 ) * Minor refactoring Moving find_field_with_default to Schema. * Clippy comments	2023-05-22 15:00:48 +09:00
Paul Masurel	4417be165d	Minor refactoring (#2054 ) Moving find_field_with_default to Schema.	2023-05-22 14:56:38 +09:00
PSeitz	6239697a02	switch to ms in histogram for date type (#2045 ) * switch to ms in histogram for date type switch to ms in histogram, by adding a normalization step that converts to nanoseconds precision when creating the collector. closes #2028 related to #2026 * add missing unit long variants * use single thread to avoid handling test case * fix docs * revert CI * cleanup * improve docs * Update src/aggregation/bucket/histogram/histogram.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-19 08:15:44 +02:00
Paul Masurel	62709b8094	Change in the query grammar. (#2050 ) * Change in the query grammar. Quotation mark can now be used for phrase queries. The delimiter is part of the `UserInputLeaf`. That information is meant to be used in Quickwit to solve #3364. This PR also adds support for quotation marks escaping in phrase queries. * Apply suggestions from code review	2023-05-19 12:07:10 +09:00
PSeitz	04562c0318	add fastfield tokenizer to IndexBuilder (#2046 )	2023-05-18 04:33:42 +02:00
PSeitz	2dfe37940d	handle multiple types in term aggregation (#2041 )	2023-05-15 11:57:38 +02:00
Adam Reichold	fedd9559e7	Expose create a query from a user input AST. (#2039 )	2023-05-11 21:53:18 +09:00
Paul Masurel	fe3ecf9567	Added support for madvise (#2036 ) Added support for madvise	2023-05-11 05:39:17 +02:00
PSeitz	ba3a885a3b	handle multiple agg results (#2035 ) handle multiple intermediate aggregation results with the same name.	2023-05-10 15:00:38 +02:00
PSeitz	d1988be8e9	fix and extend benchmark (#2030 ) * add benchmark, add missing inlines * fix stacker bench * add wiki benchmark * move line split out of bench	2023-05-10 13:01:56 +02:00
PSeitz	0eafbaab8e	fix slop (#2031 ) Fix slop by carrying slop so far for multiterms. Define slop contract in the API	2023-05-10 11:45:14 +02:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
PSeitz	4ee1b5cda0	add seperate tokenizer manager for fast fields (#2019 ) * add seperate tokenizer manager for fast fields * rename	2023-05-08 11:22:31 +02:00
PSeitz	45ff0e3c5c	clear memory consumption in AggregationLimits (#2022 ) * clear memory consumption in AggregationLimits clear memory consumption in AggregationLimits at the end of segment collection * switch to ResourceLimitGuard * unduplicate code * merge methods * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-08 10:15:09 +02:00
PSeitz	4c58b0086d	allow slop in both directions (#2020 ) * allow slop in both directions allow slop in both directions so "big wolf"~3 can also match "wolf big" This also fixes #1934, when the docsets were reordered by size and didn't match the terms. * remove count * add test for repeating tokens, unduplicate tests	2023-05-07 12:05:21 +09:00
François Massot	992f755298	Fix clippy.	2023-05-05 10:51:29 +02:00
François Massot	c8df843f96	Fix date histogram bounds and field name.	2023-05-05 00:52:55 +02:00
Paul Masurel	f28ddb711e	Exposing u64-based FastFieldRangeWeight (#2024 )	2023-05-03 18:32:00 +09:00
tottoto	73452284ae	Remove unused crates from dependencies (#2018 ) * Remove unused crates from dependencies * Revert rand to columnar * Revert criterion to stacker	2023-05-02 12:34:20 +02:00
PSeitz	ba309e18a1	switch to nanosecond precision (#2016 )	2023-05-01 03:32:20 +02:00
PSeitz	cbf2bdc75b	change bucket count type (#2013 ) * change bucket count type closes #2012 * Update src/aggregation/agg_limits.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * Update src/directory/managed_directory.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * fix test --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-27 15:47:31 +08:00
PSeitz	1f06997d04	fix single collector special case (#2014 )	2023-04-27 09:30:19 +02:00
PSeitz	c599bf3b6c	chore!:drop JSON support on intermediate agg result (#1992 ) * chore!:drop JSON support on intermediate agg result add support for other formats by removing skip_serialize and untagged JSON support is broken anyway due it's lack on f64::INF etc. handling * Update src/aggregation/intermediate_agg_result.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * move from impl --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-26 13:05:16 +02:00
PSeitz	80df1d9835	Handle error for exists on MMapDirectory (#1988 ) `exists` will return false in case of other io errors, like permission denied	2023-04-25 09:20:33 +02:00
PSeitz	2e369db936	switch to Aggregation without serde_untagged (#2003 ) * refactor result handling * remove Internal stuff * merge different accessors * switch to Aggregation without serde_untagged * fix doctests	2023-04-25 08:54:51 +02:00
PSeitz	7b31100208	refactor vint (#2010 ) - improve performance of vint vint serialization shows up in performance profiles during indexing. It would also make sense to limit the value space to u29 and operate on 4 bytes only. - remove unused code - add missing inlines - fix regex test	2023-04-25 08:49:36 +02:00
trinity-1686a	9c93bfeb51	optimise warmup code path (#2007 ) * optimise warmup code path * better function naming	2023-04-21 11:23:09 +02:00
PSeitz	74f9eafefc	refactor Term (#2006 ) * refactor Term add ValueBytes for serialized term values add missing debug for ip skip unnecessary json path validation remove code duplication add DATE_TIME_PRECISION_INDEXED constant add missing Term clarification remove weird value_bytes_mut() API * fix naming	2023-04-20 15:31:43 +02:00
RT_Enzyme	ff3d3313c4	fix BooleanQuery document (#1999 ) * fix BooleanQuery document * Update src/query/boolean_query/boolean_query.rs --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-20 11:37:20 +02:00
Paul Masurel	fbda511a1a	Making more things public for quickwit. (#2005 )	2023-04-20 11:37:45 +09:00
Adam Reichold	c1defdda05	Bump aho-corasick dependency to version 1.0 and adjust to API changes (#2002 ) * Drop additional Arc-layer as the automaton itself is now cheap-to-clone. * Drop state ID type parameter as it is not exposed by the library any more.	2023-04-18 07:34:30 +02:00
PSeitz	e522163a1c	use json in agg tests (#1998 ) * switch to JSON in tests, add flat aggregation types * use method * clippy * remove commented file	2023-04-17 14:08:48 +02:00
PSeitz	e83abbfe4a	perf: faster term hash map (#1940 ) * add term hashmap benchmark * refactor arena hashmap add inlines remove occupied array and use table_entry.is_empty instead (saves 4 bytes per entry) reduce saturation threshold from 1/3 to 1/2 to reduce memory use u32 for UnorderedId (we have the 4billion limit anyways on the Columnar stuff) fix naming LinearProbing remove byteorder dependency memory consumption went down from 2Gb to 1.8GB on indexing wikipedia dataset in tantivy * Update stacker/src/arena_hashmap.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-17 09:07:33 +02:00
trinity-1686a	780e26331d	sstable compression (#1946 ) * compress sstable with zstd * add some details to sstable readme * compress only block which benefit from it * multiple changes to sstable make compression optional use OwnedBytes instead of impl Read in sstable, required for next point use zstd bulk api, which is much faster on small records * cleanup and use bulk api for compression * use dedicated byte for compression * switch block len and compression flag * change default zstd level in sstable	2023-04-14 16:25:50 +02:00
trinity-1686a	0286ecea09	re-export a few sstable functions on dicitonary (#1996 ) * re-export a few sstable functions on dicitonary * Update documentation Co-authored-by: François Massot <francois.massot@gmail.com> --------- Co-authored-by: François Massot <francois.massot@gmail.com>	2023-04-14 11:13:48 +02:00
François Massot	36138c493b	Merge pull request #1994 from quickwit-oss/fmassot/expose-simple-token-stream Expose `SimpleTokenStream` to use it in quickwit for the multilanguage tokenizer	2023-04-13 18:55:02 +02:00
François Massot	64bce340b2	Expose to use it in quickwit.	2023-04-13 18:28:53 +02:00
trinity-1686a	205e8a0a92	encode dictionary type in fst footer (#1968 ) * encode additional footer for dictionary kind in fst	2023-04-12 09:43:01 +02:00

1 2 3 4 5 ...

2261 Commits