tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-25 04:30:40 +00:00

Author	SHA1	Message	Date
PSeitz	1e7cd48cfa	remove allocations in split compound words (#2080 ) * remove allocations in split compound words * clear reused data	2023-07-13 09:43:02 +09:00
François Massot	b91d3f6be4	Clean comment on 'TextAnalyzerBuilder::filter_dynamic' method.	2023-07-03 18:45:59 +02:00
François Massot	a8e76513bb	Remove useless clone.	2023-07-03 22:05:11 +09:00
François Massot	0a23201338	Fix stackoverflow and add docs.	2023-07-03 22:05:11 +09:00
François Massot	81330aaf89	WIP	2023-07-03 22:05:10 +09:00
Paul Masurel	98a3b01992	Removing the BoxedTokenizer	2023-07-03 22:05:10 +09:00
Paul Masurel	d341520938	Dynamic follow up	2023-07-03 22:05:10 +09:00
François Massot	5c9af73e41	Followup fulmicoton poc.	2023-07-03 22:05:10 +09:00
Paul Masurel	ad4c940fa3	proof of concept for dynamic tokenizer.	2023-07-03 22:05:10 +09:00
François Massot	d73706dede	Ngram tokenizer now returns an error with invalid arguments.	2023-06-25 20:13:24 +02:00
Adam Reichold	4aa131c3db	Make TextAnalyzerBuilder publically accessible (#2097 ) This way, client code can name the type to e.g. store it inside structs without resorting to generics and it means that its documentation is part of the crate documentation generated by `cargo doc`.	2023-06-22 11:24:21 +09:00
PSeitz	fdecb79273	tokenizer-api: reduce Tokenizer overhead (#2062 ) * tokenizer-api: reduce Tokenizer overhead Previously a new `Token` for each text encountered was created, which contains `String::with_capacity(200)` In the new API the token_stream gets mutable access to the tokenizer, this allows state to be shared (in this PR Token is shared). Ideally the allocation for the BoxTokenStream would also be removed, but this may require some lifetime tricks. * simplify api * move lowercase and ascii folding buffer to global * empty Token text as default	2023-06-08 18:37:58 +08:00
Valerii	4b7c485a08	feat: add stop words for Hungarian language (#2069 )	2023-06-02 07:26:03 +02:00
PSeitz	e56addc63e	enable tokenizer on json fields (#2053 ) * enable tokenizer on json fields enable tokenizer on json fields for type text * Avoid making the tokenizer within the TextAnalyzer pub(crate) * Moving BoxableTokenizer to tantivy. --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-24 10:47:39 +02:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
PSeitz	7b31100208	refactor vint (#2010 ) - improve performance of vint vint serialization shows up in performance profiles during indexing. It would also make sense to limit the value space to u29 and operate on 4 bytes only. - remove unused code - add missing inlines - fix regex test	2023-04-25 08:49:36 +02:00
Paul Masurel	fbda511a1a	Making more things public for quickwit. (#2005 )	2023-04-20 11:37:45 +09:00
Adam Reichold	c1defdda05	Bump aho-corasick dependency to version 1.0 and adjust to API changes (#2002 ) * Drop additional Arc-layer as the automaton itself is now cheap-to-clone. * Drop state ID type parameter as it is not exposed by the library any more.	2023-04-18 07:34:30 +02:00
François Massot	64bce340b2	Expose to use it in quickwit.	2023-04-13 18:28:53 +02:00
trinity-1686a	064518156f	refactor tokenization pipeline to use GATs (#1924 ) * refactor tokenization pipeline to use GATs * fix doctests * fix clippy lints * remove commented code	2023-03-09 09:39:37 +01:00
Paul Masurel	8ea97e7d6b	Minor refactoring preparing for getting columnar integrated in quickwit. (#1911 )	2023-02-27 14:23:30 +09:00
Paul Masurel	097fd6138d	Fix clippy comments (#1872 )	2023-02-14 23:12:45 +09:00
trinity-1686a	3ac973bea4	fix invalid endianness in documentation (#1845 ) * fix doc about term endianness * rustfmt	2023-02-09 15:36:38 +01:00
Paul Masurel	bd5eea9852	Integrated columnar work.	2023-02-09 13:14:31 +01:00
Paul Masurel	7a8fce0ae7	Minor mini fixes	2023-01-10 14:15:30 +09:00
Michael Kleen	196e42f33e	Add regex tokenizer (#1759 ) This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. Co-authored-by: Michael Kleen <mkleen@gmailw.com>	2023-01-10 13:38:37 +09:00
PSeitz	514d23a20c	move tokenizer API to seperate crate (#1767 ) closes #1766 Finding tantivy tokenizers is a frustrating experience currently, since they need be updated for each tantivy version. That's unnecessary since the API is rather stable anyway.	2023-01-09 06:37:38 +01:00
Adam Reichold	2080c370c2	Enable usage of FuzzyTermQuery for specific fields via QueryParser (#1750 ) * Make nightly Clippy mostly happy. * Document how to produce TermSetQuery queries using QueryParser. * Enable construction of queries using FuzzyTermQuery via the QueryParser * Use FxHashMap instead of HashMap in the QueryParser as these hash tables are not exposed to DoS attacks. * Use a struct instead of a tuple to improve readability.	2023-01-04 18:11:27 +09:00
Adam Reichold	ca6231170e	Make the built-in stop word lists selectable via the Language enum already used by the Stemmer filter. (#1671 )	2022-11-15 17:40:25 +09:00
Adam Reichold	a4b759d2fe	Include stop word lists from Lucene and the Snowball project (#1666 )	2022-11-09 16:57:35 +09:00
Adam Reichold	c32ab66bbd	Small improvements to StopWorldFilter (#1657 ) * Do not copy the whole set of stop words for each stream * Make construction of StopWordFilter more flexible.	2022-11-01 16:47:34 +09:00
PSeitz	4e46f4f8c4	Merge pull request #1649 from adamreichold/split-compound-words RFC: Add dictionary-based SplitCompoundWords token filter.	2022-10-27 17:12:48 +08:00
PSeitz	6647362464	Merge pull request #1648 from adamreichold/stemmer-todo-alloc Avoid unconditional allocation in StemmerTokenStream.	2022-10-27 16:50:41 +08:00
Adam Reichold	cd952429d2	Add dictionary-based SplitCompoundWords token filter.	2022-10-27 08:30:33 +02:00
Adam Reichold	bbb058d976	Replace FNV by rustc-hash Both construction have similar goals but rustc-hash ist better suited for contemporary CPU as it works one word at a time instead of byte per byte.	2022-10-27 00:35:09 +02:00
Adam Reichold	5f7d027a52	Avoid unconditional allocation in StemmerTokenStream. This fixes the TODO in two ways: If the stemmer already yields an owned string, it is used directly as the new text of the token. Otherwise, a temporary buffer is used to copy the stemmed text (just as before) and then swapping it into the token to reuse its existing buffer.	2022-10-26 18:11:15 +02:00
Bruce Mitchener	44e03791f9	Fix warnings when doc'ing private items. (#1579 ) This also fixes a couple of typos, but plenty remain!	2022-10-03 14:24:00 +09:00
Bruce Mitchener	cf02e32578	Improvements to doc linking, grammar, etc.	2022-09-19 18:10:22 +07:00
Bruce Mitchener	6a88ac3fe3	Documentation improvements. Fix some linking, some grammar, some typos, etc.	2022-09-18 18:05:37 +07:00
Kanji Yomoda	af84e74284	Replace deprecated std package's constants on floats and integers (#1420 )	2022-07-22 08:05:08 +09:00
Antoine G	11e4225f23	doc fix (#1391 ) Documentation fix.	2022-06-21 15:53:33 +09:00
PSeitz	7f45a6ac96	allow setting tokenizer manager on index (#1362 ) handle json in tokenizer_for_field	2022-05-09 18:15:45 +09:00
Paul Masurel	46d5de920d	Removes all usage of block_on, and use a oneshot channel instead. (#1315 ) * Removes all usage of block_on, and use a oneshot channel instead. Calling `block_on` panics in certain context. For instance, it panics when it is called in a the context of another call to block. Using it in tantivy is unnecessary. We replace it by a thin wrapper around a oneshot channel that supports both async/sync. * Removing needless uses of async in the API. Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>	2022-03-18 16:54:58 +09:00
Paul Masurel	958b2bee08	Clippy comments (#1316 )	2022-03-17 18:57:55 +09:00
Paul Masurel	848b795b9f	Apply suggestions from code review	2022-03-01 18:37:51 +09:00
Pascal Seitz	091b668624	fix clippy issues	2022-03-01 08:58:51 +01:00
Paul Masurel	d7b46d2137	Added JSON Type (#1270 ) - Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251	2022-02-24 16:25:22 +09:00
Paul Masurel	4dc80cfa25	Removes TokenStream chain. (#1283 ) This change is mostly motivated by the introduction of json object. We need to be able to inject a position object to make the position shift.	2022-02-21 09:51:27 +09:00
Paul Masurel	2069e3e52b	Fixing clippy comments	2022-02-01 10:24:05 +09:00
Paul Masurel	eca6628b3c	Minor refactoring (#1266 )	2022-01-28 15:55:55 +09:00

1 2 3

121 Commits