tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-06 17:22:54 +00:00

Author	SHA1	Message	Date
François Massot	b91d3f6be4	Clean comment on 'TextAnalyzerBuilder::filter_dynamic' method.	2023-07-03 18:45:59 +02:00
François Massot	a8e76513bb	Remove useless clone.	2023-07-03 22:05:11 +09:00
François Massot	0a23201338	Fix stackoverflow and add docs.	2023-07-03 22:05:11 +09:00
François Massot	81330aaf89	WIP	2023-07-03 22:05:10 +09:00
Paul Masurel	98a3b01992	Removing the BoxedTokenizer	2023-07-03 22:05:10 +09:00
Paul Masurel	d341520938	Dynamic follow up	2023-07-03 22:05:10 +09:00
François Massot	5c9af73e41	Followup fulmicoton poc.	2023-07-03 22:05:10 +09:00
Paul Masurel	ad4c940fa3	proof of concept for dynamic tokenizer.	2023-07-03 22:05:10 +09:00
PSeitz	fdecb79273	tokenizer-api: reduce Tokenizer overhead (#2062 ) * tokenizer-api: reduce Tokenizer overhead Previously a new `Token` for each text encountered was created, which contains `String::with_capacity(200)` In the new API the token_stream gets mutable access to the tokenizer, this allows state to be shared (in this PR Token is shared). Ideally the allocation for the BoxTokenStream would also be removed, but this may require some lifetime tricks. * simplify api * move lowercase and ascii folding buffer to global * empty Token text as default	2023-06-08 18:37:58 +08:00
PSeitz	e56addc63e	enable tokenizer on json fields (#2053 ) * enable tokenizer on json fields enable tokenizer on json fields for type text * Avoid making the tokenizer within the TextAnalyzer pub(crate) * Moving BoxableTokenizer to tantivy. --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-24 10:47:39 +02:00
trinity-1686a	064518156f	refactor tokenization pipeline to use GATs (#1924 ) * refactor tokenization pipeline to use GATs * fix doctests * fix clippy lints * remove commented code	2023-03-09 09:39:37 +01:00
PSeitz	514d23a20c	move tokenizer API to seperate crate (#1767 ) closes #1766 Finding tantivy tokenizers is a frustrating experience currently, since they need be updated for each tantivy version. That's unnecessary since the API is rather stable anyway.	2023-01-09 06:37:38 +01:00
Bruce Mitchener	cf02e32578	Improvements to doc linking, grammar, etc.	2022-09-19 18:10:22 +07:00
Kanji Yomoda	af84e74284	Replace deprecated std package's constants on floats and integers (#1420 )	2022-07-22 08:05:08 +09:00
Antoine G	11e4225f23	doc fix (#1391 ) Documentation fix.	2022-06-21 15:53:33 +09:00
Paul Masurel	d7b46d2137	Added JSON Type (#1270 ) - Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251	2022-02-24 16:25:22 +09:00
Paul Masurel	4dc80cfa25	Removes TokenStream chain. (#1283 ) This change is mostly motivated by the introduction of json object. We need to be able to inject a position object to make the position shift.	2022-02-21 09:51:27 +09:00
Paul Masurel	2069e3e52b	Fixing clippy comments	2022-02-01 10:24:05 +09:00
Paul Masurel	eca6628b3c	Minor refactoring (#1266 )	2022-01-28 15:55:55 +09:00
Paul Masurel	486b8fa9c5	Removing serde-derive dependency (#786 )	2020-03-06 23:33:58 +09:00
Paul Masurel	811fd0cb9e	Dynamic analyzer (#755 ) * Removed generics in tokenizers * lowercaser * Added TokenizerExt * Introducing BoxedTokenizer * Introducing BoxXXXXX helper struct * Closes #762. * Introducing a TextAnalyzer	2020-01-29 18:23:37 +09:00
Paul Masurel	ef3eddf3da	clippy first stab (#711 )	2019-11-22 13:09:35 +09:00
kkoziara	0519056bd8	Added handling of pre-tokenized text fields (#642 ). (#669 ) * Added handling of pre-tokenized text fields (#642). * * Updated changelog and examples concerning #642. * Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream. * * Removed tokenized flag from TextOptions and code reliance on the flag. * Changed naming to use word "pre-tokenized" instead of "tokenized". * Updated example code. * Fixed comments. * Minor code refactoring. Test improvements.	2019-11-07 10:10:56 +09:00
Paul Masurel	039c0a0863	Introducing a wrapper struct instead of Boxed<BoxableTokenizer> (#631 ) Closes #629	2019-08-15 16:37:04 +09:00
Paul Masurel	462774b15c	Tiqb feature/2018 (#583 ) * rust 2018 * Added CHANGELOG comment	2019-07-01 10:01:46 +09:00
Paul Masurel	0b0bf59a32	Allow stemmers in languages other than English (#478 ) Allow users to create stemmers for languages other than English. Add a default stemmer for English. Closes #478	2019-01-23 22:21:00 +09:00
Paul Masurel	37e4280c0a	Cargo Format (#420 )	2018-09-15 07:44:22 +09:00
Vignesh Sarma K	09e00f1d42	add position_length to Token (#337 ) * add position_length to Token refer #291 * Add term offset to `PhraseQuery` ref #291 * Add new constructor for `PhraseQuery` that allows custom offset * fix the method name as per pr comment * Closes #291 Added unit test. Using offsets from the analyzer in QueryParser.	2018-08-13 10:14:50 +09:00
Paul Masurel	b59132966f	Better heap (#311 ) * Changed the heap to a paged memory arena. * Trying to simplify the indexing term hashmap * Exploding datastruct * Removed some complexity in bitpacker	2018-06-04 09:39:18 +09:00
Paul Masurel	9a0b7f9855	Rustfmt	2018-05-07 19:50:35 -07:00
Paul Masurel	99c0b84036	Integrating #274 , #280 , #289 into master (#290 ) * Integrating bugfixes into master Closes #274 Closes #280 Closes #289 * Next version will be 0.6	2018-05-06 09:48:25 -07:00
Dru Sellers	ca74c14647	Simple Implementation of NGram Tokenizer (#278 ) * Simple Implementation of NGram Tokenizer It does not yet support edges It could probably be better in many "rusty" ways But the test is passing, so I'll call this a good stopping point for the day. * Remove Ngram from manager. Too many variations * Basic configuration model Should the extensive tests exist here? * Add Sample to provide an End to End testing * Basic Edgegram support * cleanup * code feedback * More code review feedback processed	2018-05-06 09:47:49 -07:00
Paul Masurel	78673172d0	Cargo fmt	2018-04-21 20:05:36 +09:00
Paul Masurel	e44782bf14	No more	2018-04-12 13:01:11 +09:00
Paul Masurel	df53dc4ceb	Format	2018-02-03 00:21:05 +09:00
Paul Masurel	49519c3f61	added comments	2018-01-04 12:53:20 +09:00
Paul Masurel	cb11b92505	Added comments	2018-01-04 12:27:14 +09:00
Paul Masurel	db7d784573	Issue 227 Faster merge when there are no deletes	2017-12-21 22:04:05 +09:00
Paul Masurel	1e55189db1	NOBUG rustfmt	2017-12-14 19:30:31 +09:00
Paul Masurel	8b1b389a76	NOBUG Clippy	2017-12-14 19:25:12 +09:00
Paul Masurel	8023445b63	docs	2017-11-26 11:52:03 +09:00
Paul Masurel	05ce093f97	doc	2017-11-26 11:43:11 +09:00
Paul Masurel	6937e23a56	fixing doctest	2017-11-26 11:06:34 +09:00
Paul Masurel	974c321153	cargo fmt	2017-11-26 11:02:02 +09:00
Paul Masurel	f30ec9b36b	Merge branch 'master' of github.com:tantivy-search/tantivy Conflicts: src/analyzer/mod.rs src/schema/index_record_option.rs src/tokenizer/lower_caser.rs src/tokenizer/tokenizer.rs	2017-11-26 10:54:05 +09:00
Paul Masurel	acd7c1ea2d	Added comments	2017-11-26 10:44:49 +09:00
Paul Masurel	ac4d433fad	Renamed analyzer to tokenizer	2017-11-24 16:50:32 +09:00

47 Commits