Commit Graph

15 Commits

Author SHA1 Message Date
PSeitz
fdecb79273 tokenizer-api: reduce Tokenizer overhead (#2062)
* tokenizer-api: reduce Tokenizer overhead

Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.

* simplify api

* move lowercase and ascii folding buffer to global

* empty Token text as default
2023-06-08 18:37:58 +08:00
trinity-1686a
064518156f refactor tokenization pipeline to use GATs (#1924)
* refactor tokenization pipeline to use GATs

* fix doctests

* fix clippy lints

* remove commented code
2023-03-09 09:39:37 +01:00
Paul Masurel
eca6628b3c Minor refactoring (#1266) 2022-01-28 15:55:55 +09:00
Tomoko Uchida
74e36c7e97 Add unit tests for tokenizers and filters (#1156)
* add unit test for SimpleTokenizer
* add unit tests for tokenizers and filters.
2021-09-27 10:22:01 +09:00
Paul Masurel
811fd0cb9e Dynamic analyzer (#755)
* Removed generics in tokenizers

* lowercaser

* Added TokenizerExt

* Introducing BoxedTokenizer

* Introducing BoxXXXXX helper struct

* Closes #762.

* Introducing a TextAnalyzer
2020-01-29 18:23:37 +09:00
Paul Masurel
ef3eddf3da clippy first stab (#711) 2019-11-22 13:09:35 +09:00
Joshua Dutton
9f74786db2 Update import statements in examples, doctests (#633)
Update import statements to edition 2018, including removing
`extern crate` and  `#[macro_use]`. Alphabetize the statements.
2019-08-19 07:26:35 +09:00
Paul Masurel
dac50c6aeb Dds merged (#539)
* add ascii folding support

* Minor change and added Changelog.

* add additional tests

* Add tests for ascii folding (#533)

* first tests for ascii folding

* use a `RawTokenizer` for tokens using punctuation

* add test for all (?) folding, inspired by Lucene

* Simplification of the unit test code
2019-04-26 10:25:08 +09:00
Dru Sellers
82d87416c2 Implement StopWords Filter (#292)
* Implement StopWords Filter

- added example doctest for alphanum_only.rs so that I could
drive my own test of the stopword filter

* Style Cop

* Switch HashSet Hasher to FNV for speed

* Update Change Log

* fix missed location renaming
2018-05-09 18:40:41 -07:00
Paul Masurel
cb11b92505 Added comments 2018-01-04 12:27:14 +09:00
Paul Masurel
1e55189db1 NOBUG rustfmt 2017-12-14 19:30:31 +09:00
Paul Masurel
f24e5f405e NOBUG intellij misc lint 2017-12-14 18:23:35 +09:00
Paul Masurel
974c321153 cargo fmt 2017-11-26 11:02:02 +09:00
Paul Masurel
acd7c1ea2d Added comments 2017-11-26 10:44:49 +09:00
Paul Masurel
ac4d433fad Renamed analyzer to tokenizer 2017-11-24 16:50:32 +09:00