tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-13 20:42:55 +00:00

Author	SHA1	Message	Date
Adam Reichold	ca6231170e	Make the built-in stop word lists selectable via the Language enum already used by the Stemmer filter. (#1671 )	2022-11-15 17:40:25 +09:00
Adam Reichold	a4b759d2fe	Include stop word lists from Lucene and the Snowball project (#1666 )	2022-11-09 16:57:35 +09:00
Adam Reichold	c32ab66bbd	Small improvements to StopWorldFilter (#1657 ) * Do not copy the whole set of stop words for each stream * Make construction of StopWordFilter more flexible.	2022-11-01 16:47:34 +09:00
PSeitz	4e46f4f8c4	Merge pull request #1649 from adamreichold/split-compound-words RFC: Add dictionary-based SplitCompoundWords token filter.	2022-10-27 17:12:48 +08:00
PSeitz	6647362464	Merge pull request #1648 from adamreichold/stemmer-todo-alloc Avoid unconditional allocation in StemmerTokenStream.	2022-10-27 16:50:41 +08:00
Adam Reichold	cd952429d2	Add dictionary-based SplitCompoundWords token filter.	2022-10-27 08:30:33 +02:00
Adam Reichold	bbb058d976	Replace FNV by rustc-hash Both construction have similar goals but rustc-hash ist better suited for contemporary CPU as it works one word at a time instead of byte per byte.	2022-10-27 00:35:09 +02:00
Adam Reichold	5f7d027a52	Avoid unconditional allocation in StemmerTokenStream. This fixes the TODO in two ways: If the stemmer already yields an owned string, it is used directly as the new text of the token. Otherwise, a temporary buffer is used to copy the stemmed text (just as before) and then swapping it into the token to reuse its existing buffer.	2022-10-26 18:11:15 +02:00
Bruce Mitchener	44e03791f9	Fix warnings when doc'ing private items. (#1579 ) This also fixes a couple of typos, but plenty remain!	2022-10-03 14:24:00 +09:00
Bruce Mitchener	cf02e32578	Improvements to doc linking, grammar, etc.	2022-09-19 18:10:22 +07:00
Bruce Mitchener	6a88ac3fe3	Documentation improvements. Fix some linking, some grammar, some typos, etc.	2022-09-18 18:05:37 +07:00
Kanji Yomoda	af84e74284	Replace deprecated std package's constants on floats and integers (#1420 )	2022-07-22 08:05:08 +09:00
Antoine G	11e4225f23	doc fix (#1391 ) Documentation fix.	2022-06-21 15:53:33 +09:00
PSeitz	7f45a6ac96	allow setting tokenizer manager on index (#1362 ) handle json in tokenizer_for_field	2022-05-09 18:15:45 +09:00
Paul Masurel	46d5de920d	Removes all usage of block_on, and use a oneshot channel instead. (#1315 ) * Removes all usage of block_on, and use a oneshot channel instead. Calling `block_on` panics in certain context. For instance, it panics when it is called in a the context of another call to block. Using it in tantivy is unnecessary. We replace it by a thin wrapper around a oneshot channel that supports both async/sync. * Removing needless uses of async in the API. Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>	2022-03-18 16:54:58 +09:00
Paul Masurel	958b2bee08	Clippy comments (#1316 )	2022-03-17 18:57:55 +09:00
Paul Masurel	848b795b9f	Apply suggestions from code review	2022-03-01 18:37:51 +09:00
Pascal Seitz	091b668624	fix clippy issues	2022-03-01 08:58:51 +01:00
Paul Masurel	d7b46d2137	Added JSON Type (#1270 ) - Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251	2022-02-24 16:25:22 +09:00
Paul Masurel	4dc80cfa25	Removes TokenStream chain. (#1283 ) This change is mostly motivated by the introduction of json object. We need to be able to inject a position object to make the position shift.	2022-02-21 09:51:27 +09:00
Paul Masurel	2069e3e52b	Fixing clippy comments	2022-02-01 10:24:05 +09:00
Paul Masurel	eca6628b3c	Minor refactoring (#1266 )	2022-01-28 15:55:55 +09:00
Paul Masurel	732f6847c0	Field type with codes (#1255 ) * Term are now typed. This change is backward compatible: While the Term has a byte representation that is modified, a Term itself is a transient object that is not serialized as is in the index. Its .field() and .value_bytes() on the other hand are unchanged. This change offers better Debug information for terms. While not necessary it also will help in the support for JSON types. * Renamed Hierarchical Facet -> Facet	2022-01-07 20:49:00 +09:00
Paul Masurel	3ea6800ac5	Pleasing clippy (#1253 )	2022-01-06 16:41:24 +09:00
Tomoko Uchida	74e36c7e97	Add unit tests for tokenizers and filters (#1156 ) * add unit test for SimpleTokenizer * add unit tests for tokenizers and filters.	2021-09-27 10:22:01 +09:00
Tomoko Uchida	dd81e38e53	Add WhitespaceTokenizer (#1147 ) * Add WhitespaceTokenizer.	2021-08-29 18:20:49 +09:00
Pascal Seitz	9b3e508753	fix clippy	2021-07-01 18:06:09 +02:00
Pascal Seitz	1e4df54ab3	fix clippy	2021-07-01 17:41:53 +02:00
Paul Masurel	486b8fa9c5	Removing serde-derive dependency (#786 )	2020-03-06 23:33:58 +09:00
Paul Masurel	811fd0cb9e	Dynamic analyzer (#755 ) * Removed generics in tokenizers * lowercaser * Added TokenizerExt * Introducing BoxedTokenizer * Introducing BoxXXXXX helper struct * Closes #762. * Introducing a TextAnalyzer	2020-01-29 18:23:37 +09:00
Christian Hunstad	02af28b3b7	add norwegian stemmer (#717 )	2019-11-27 21:08:59 +09:00
Paul Masurel	ef3eddf3da	clippy first stab (#711 )	2019-11-22 13:09:35 +09:00
kkoziara	0519056bd8	Added handling of pre-tokenized text fields (#642 ). (#669 ) * Added handling of pre-tokenized text fields (#642). * * Updated changelog and examples concerning #642. * Added tokenized_text method to Value implementation. * Implemented From<TokenizedString> for TokenizedStream. * * Removed tokenized flag from TextOptions and code reliance on the flag. * Changed naming to use word "pre-tokenized" instead of "tokenized". * Updated example code. * Fixed comments. * Minor code refactoring. Test improvements.	2019-11-07 10:10:56 +09:00
Paul Masurel	5c6580eb15	fmt (#661 )	2019-10-04 12:10:01 +09:00
Joshua Dutton	9f74786db2	Update import statements in examples, doctests (#633 ) Update import statements to edition 2018, including removing `extern crate` and `#[macro_use]`. Alphabetize the statements.	2019-08-19 07:26:35 +09:00
Paul Masurel	039c0a0863	Introducing a wrapper struct instead of Boxed<BoxableTokenizer> (#631 ) Closes #629	2019-08-15 16:37:04 +09:00
Paul Masurel	498057c5b7	Refactor deletes (#597 ) * Refactor deletes * Removing generation from SegmentUpdater. These have been obsolete for a long time * Number literal clippy * Removed clippy useless allow statement	2019-07-17 13:06:44 +09:00
Paul Masurel	462774b15c	Tiqb feature/2018 (#583 ) * rust 2018 * Added CHANGELOG comment	2019-07-01 10:01:46 +09:00
Paul Masurel	66b4615e4e	Issue/542 (#543 ) * Closes 542. Fast fields are all loaded when the segment reader is created.	2019-05-05 13:52:43 +09:00
Paul Masurel	dac50c6aeb	Dds merged (#539 ) * add ascii folding support * Minor change and added Changelog. * add additional tests * Add tests for ascii folding (#533) * first tests for ascii folding * use a `RawTokenizer` for tokens using punctuation * add test for all (?) folding, inspired by Lucene * Simplification of the unit test code	2019-04-26 10:25:08 +09:00
Paul Masurel	96a4f503ec	Closes #526 (#535 )	2019-04-24 20:59:48 +09:00
Panagiotis Ktistakis	2cd31bcda2	Fix non english stemmers (#521 )	2019-03-27 08:54:16 +09:00
Panagiotis Ktistakis	76609deadf	Add Greek stemmer (#486 )	2019-02-01 06:30:49 +01:00
Paul Masurel	bf94fd77db	Issue/471 (#481 ) * Closes 471 Removing writing_segments in the segment manager as it is now useless. Removing the target merged segment id as it is useless as well. * RAII for tracking which segment is in merge. Closes #471 * fmt * Using Inventory::default().	2019-01-27 12:18:59 +09:00
Paul Masurel	1fd46c1e9b	Clippy	2019-01-28 03:46:23 +01:00
Paul Masurel	63b593bd0a	Lower RAM usage in tests.	2019-01-24 09:10:38 +09:00
Paul Masurel	0b0bf59a32	Allow stemmers in languages other than English (#478 ) Allow users to create stemmers for languages other than English. Add a default stemmer for English. Closes #478	2019-01-23 22:21:00 +09:00
Paul Masurel	a3042e956b	Facet remove unsafe (#454 ) * Removing some unsafe * Removing some unsafe (2)	2018-12-17 09:31:09 +09:00
Paul Masurel	a6e767c877	Cargo fmt	2018-11-30 22:52:45 +09:00
Paul Masurel	07d87e154b	Collector refactoring and multithreaded search (#437 ) * Split Collector into an overall Collector and a per-segment SegmentCollector. Precursor to cross-segment parallelism, and as a side benefit cleans up any per-segment fields from being Option<T> to just T. * Attempt to add MultiCollector back * working. Chained collector is broken though * Fix chained collector * Fix test * Make Weight Send+Sync for parallelization purposes * Expose parameters of RangeQuery for external usage * Removed &mut self * fixing tests * Restored TestCollectors * blop * multicollector working * chained collector working * test broken * fixing unit test * blop * blop * Blop * simplifying APi * blop * better syntax * Simplifying top_collector * refactoring * blop * Sync with master * Added multithread search * Collector refactoring * Schema::builder * CR and rustdoc * CR comments * blop * Added an executor * Sorted the segment readers in the searcher * Update searcher.rs * Fixed unit testst * changed the place where we have the sort-segment-by-count heuristic * using crossbeam::channel * inlining * Comments about panics propagating * Added unit test for executor panicking * Readded default * Removed Default impl * Added unit test for executor	2018-11-30 22:46:59 +09:00

1 2

93 Commits