PSeitz
fdecb79273
tokenizer-api: reduce Tokenizer overhead ( #2062 )
...
* tokenizer-api: reduce Tokenizer overhead
Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
* simplify api
* move lowercase and ascii folding buffer to global
* empty Token text as default
2023-06-08 18:37:58 +08:00
trinity-1686a
064518156f
refactor tokenization pipeline to use GATs ( #1924 )
...
* refactor tokenization pipeline to use GATs
* fix doctests
* fix clippy lints
* remove commented code
2023-03-09 09:39:37 +01:00
Antoine G
11e4225f23
doc fix ( #1391 )
...
Documentation fix.
2022-06-21 15:53:33 +09:00
PSeitz
7f45a6ac96
allow setting tokenizer manager on index ( #1362 )
...
handle json in tokenizer_for_field
2022-05-09 18:15:45 +09:00
Paul Masurel
eca6628b3c
Minor refactoring ( #1266 )
2022-01-28 15:55:55 +09:00
Tomoko Uchida
dd81e38e53
Add WhitespaceTokenizer ( #1147 )
...
* Add WhitespaceTokenizer.
2021-08-29 18:20:49 +09:00
Paul Masurel
811fd0cb9e
Dynamic analyzer ( #755 )
...
* Removed generics in tokenizers
* lowercaser
* Added TokenizerExt
* Introducing BoxedTokenizer
* Introducing BoxXXXXX helper struct
* Closes #762 .
* Introducing a TextAnalyzer
2020-01-29 18:23:37 +09:00
Paul Masurel
039c0a0863
Introducing a wrapper struct instead of Boxed<BoxableTokenizer> ( #631 )
...
Closes #629
2019-08-15 16:37:04 +09:00
Paul Masurel
462774b15c
Tiqb feature/2018 ( #583 )
...
* rust 2018
* Added CHANGELOG comment
2019-07-01 10:01:46 +09:00
Paul Masurel
66b4615e4e
Issue/542 ( #543 )
...
* Closes 542.
Fast fields are all loaded when the segment reader is created.
2019-05-05 13:52:43 +09:00
Paul Masurel
63b593bd0a
Lower RAM usage in tests.
2019-01-24 09:10:38 +09:00
Paul Masurel
0b0bf59a32
Allow stemmers in languages other than English ( #478 )
...
Allow users to create stemmers for languages other than English. Add a
default stemmer for English.
Closes #478
2019-01-23 22:21:00 +09:00
Paul Masurel
dd37e109f2
Merge branch 'issue/368b'
2018-09-11 20:16:14 +09:00
Paul Masurel
63868733a3
Added SnippetGenerator
2018-09-11 09:45:27 +09:00
Paul Masurel
7e5f697d00
Closes #387
2018-09-09 16:23:56 +09:00
Paul Masurel
ede97eded6
Removed use
2018-08-28 09:54:04 +09:00
Dru Sellers
af593b1116
Add default EN stopwords to the default analyzer ( #381 )
...
* Add a default list of en stopwords
* Add the default en stopword filter to the standard tokenizers
* code review feedback
2018-08-22 10:49:39 +09:00
Paul Masurel
78673172d0
Cargo fmt
2018-04-21 20:05:36 +09:00
Paul Masurel
49519c3f61
added comments
2018-01-04 12:53:20 +09:00
Paul Masurel
1e55189db1
NOBUG rustfmt
2017-12-14 19:30:31 +09:00
Paul Masurel
f24e5f405e
NOBUG intellij misc lint
2017-12-14 18:23:35 +09:00
Paul Masurel
05ce093f97
doc
2017-11-26 11:43:11 +09:00
Paul Masurel
974c321153
cargo fmt
2017-11-26 11:02:02 +09:00
Paul Masurel
aaeeda2bc5
Editing rustdoc
2017-11-25 13:23:32 +09:00
Paul Masurel
ac4d433fad
Renamed analyzer to tokenizer
2017-11-24 16:50:32 +09:00