Commit Graph

68 Commits

Author SHA1 Message Date
Tomoko Uchida
dd81e38e53 Add WhitespaceTokenizer (#1147)
* Add WhitespaceTokenizer.
2021-08-29 18:20:49 +09:00
Pascal Seitz
9b3e508753 fix clippy 2021-07-01 18:06:09 +02:00
Pascal Seitz
1e4df54ab3 fix clippy 2021-07-01 17:41:53 +02:00
Paul Masurel
486b8fa9c5 Removing serde-derive dependency (#786) 2020-03-06 23:33:58 +09:00
Paul Masurel
811fd0cb9e Dynamic analyzer (#755)
* Removed generics in tokenizers

* lowercaser

* Added TokenizerExt

* Introducing BoxedTokenizer

* Introducing BoxXXXXX helper struct

* Closes #762.

* Introducing a TextAnalyzer
2020-01-29 18:23:37 +09:00
Christian Hunstad
02af28b3b7 add norwegian stemmer (#717) 2019-11-27 21:08:59 +09:00
Paul Masurel
ef3eddf3da clippy first stab (#711) 2019-11-22 13:09:35 +09:00
kkoziara
0519056bd8 Added handling of pre-tokenized text fields (#642). (#669)
* Added handling of pre-tokenized text fields (#642).

* * Updated changelog and examples concerning #642.
* Added tokenized_text method to Value implementation.
* Implemented From<TokenizedString> for TokenizedStream.

* * Removed tokenized flag from TextOptions and code reliance on the flag.
* Changed naming to use word "pre-tokenized" instead of "tokenized".
* Updated example code.
* Fixed comments.

* Minor code refactoring. Test improvements.
2019-11-07 10:10:56 +09:00
Paul Masurel
5c6580eb15 fmt (#661) 2019-10-04 12:10:01 +09:00
Joshua Dutton
9f74786db2 Update import statements in examples, doctests (#633)
Update import statements to edition 2018, including removing
`extern crate` and  `#[macro_use]`. Alphabetize the statements.
2019-08-19 07:26:35 +09:00
Paul Masurel
039c0a0863 Introducing a wrapper struct instead of Boxed<BoxableTokenizer> (#631)
Closes #629
2019-08-15 16:37:04 +09:00
Paul Masurel
498057c5b7 Refactor deletes (#597)
* Refactor deletes

* Removing generation from SegmentUpdater. These have been obsolete for a long time

* Number literal clippy

* Removed clippy useless allow statement
2019-07-17 13:06:44 +09:00
Paul Masurel
462774b15c Tiqb feature/2018 (#583)
* rust 2018

* Added CHANGELOG comment
2019-07-01 10:01:46 +09:00
Paul Masurel
66b4615e4e Issue/542 (#543)
* Closes 542.

Fast fields are all loaded when the segment reader is created.
2019-05-05 13:52:43 +09:00
Paul Masurel
dac50c6aeb Dds merged (#539)
* add ascii folding support

* Minor change and added Changelog.

* add additional tests

* Add tests for ascii folding (#533)

* first tests for ascii folding

* use a `RawTokenizer` for tokens using punctuation

* add test for all (?) folding, inspired by Lucene

* Simplification of the unit test code
2019-04-26 10:25:08 +09:00
Paul Masurel
96a4f503ec Closes #526 (#535) 2019-04-24 20:59:48 +09:00
Panagiotis Ktistakis
2cd31bcda2 Fix non english stemmers (#521) 2019-03-27 08:54:16 +09:00
Panagiotis Ktistakis
76609deadf Add Greek stemmer (#486) 2019-02-01 06:30:49 +01:00
Paul Masurel
bf94fd77db Issue/471 (#481)
* Closes 471

Removing writing_segments in the segment manager as it is now useless.
Removing the target merged segment id as it is useless as well.

* RAII for tracking which segment is in merge.

Closes #471

* fmt

* Using Inventory::default().
2019-01-27 12:18:59 +09:00
Paul Masurel
1fd46c1e9b Clippy 2019-01-28 03:46:23 +01:00
Paul Masurel
63b593bd0a Lower RAM usage in tests. 2019-01-24 09:10:38 +09:00
Paul Masurel
0b0bf59a32 Allow stemmers in languages other than English (#478)
Allow users to create stemmers for languages other than English. Add a
default stemmer for English.

Closes #478
2019-01-23 22:21:00 +09:00
Paul Masurel
a3042e956b Facet remove unsafe (#454)
* Removing some unsafe

* Removing some unsafe (2)
2018-12-17 09:31:09 +09:00
Paul Masurel
a6e767c877 Cargo fmt 2018-11-30 22:52:45 +09:00
Paul Masurel
07d87e154b Collector refactoring and multithreaded search (#437)
* Split Collector into an overall Collector and a per-segment SegmentCollector. Precursor to cross-segment parallelism, and as a side benefit cleans up any per-segment fields from being Option<T> to just T.

* Attempt to add MultiCollector back

* working. Chained collector is broken though

* Fix chained collector

* Fix test

* Make Weight Send+Sync for parallelization purposes

* Expose parameters of RangeQuery for external usage

* Removed &mut self

* fixing tests

* Restored TestCollectors

* blop

* multicollector working

* chained collector working

* test broken

* fixing unit test

* blop

* blop

* Blop

* simplifying APi

* blop

* better syntax

* Simplifying top_collector

* refactoring

* blop

* Sync with master

* Added multithread search

* Collector refactoring

* Schema::builder

* CR and rustdoc

* CR comments

* blop

* Added an executor

* Sorted the segment readers in the searcher

* Update searcher.rs

* Fixed unit testst

* changed the place where we have the sort-segment-by-count heuristic

* using crossbeam::channel

* inlining

* Comments about panics propagating

* Added unit test for executor panicking

* Readded default

* Removed Default impl

* Added unit test for executor
2018-11-30 22:46:59 +09:00
Dru Sellers
e75bb1d6a1 Fix NGram processing of non-ascii characters (#430)
* A working version

* optimize the ngram parsing

* Decoding codepoint only once.

* Closes #429

* using leading_zeros to make code less cryptic

* lookup in a table
2018-10-31 08:35:27 +09:00
Paul Masurel
10f6c07c53 Clippy (#422)
* Cargo Format
* Clippy
2018-09-15 20:20:22 +09:00
Paul Masurel
37e4280c0a Cargo Format (#420) 2018-09-15 07:44:22 +09:00
Paul Masurel
dd37e109f2 Merge branch 'issue/368b' 2018-09-11 20:16:14 +09:00
Paul Masurel
63868733a3 Added SnippetGenerator 2018-09-11 09:45:27 +09:00
Paul Masurel
7e5f697d00 Closes #387 2018-09-09 16:23:56 +09:00
Vignesh Sarma K
9ccba9f864 Merge branch 'master' into issue/368 2018-09-07 20:27:38 +05:30
Paul Masurel
c64972e039 Apply unicode lowercasing. (#408)
Checks if the str is ASCII, and uses a fast track if it is the case.
If not, the std's definition of a lowercase character.

Closes #406
2018-09-05 09:43:56 +09:00
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
835cdc2fe8 Initial version of snippet
refer #368
2018-08-28 20:41:41 +05:30
Paul Masurel
ede97eded6 Removed use 2018-08-28 09:54:04 +09:00
Dru Sellers
af593b1116 Add default EN stopwords to the default analyzer (#381)
* Add a default list of en stopwords

* Add the default en stopword filter to the standard tokenizers

* code review feedback
2018-08-22 10:49:39 +09:00
Vignesh Sarma K
09e00f1d42 add position_length to Token (#337)
* add position_length to Token

refer #291

* Add term offset to `PhraseQuery`

ref #291

* Add new constructor for `PhraseQuery` that allows custom offset

* fix the method name as per pr comment

* Closes #291

Added unit test.
Using offsets from the analyzer in QueryParser.
2018-08-13 10:14:50 +09:00
Paul Masurel
811ddf2226 Closes #364 (#365)
* Closes #364

* Trying to raise the recursion limit

* Better unit test and bug fix on token offsets
2018-08-08 11:15:20 +09:00
Paul Masurel
b59132966f Better heap (#311)
* Changed the heap to a paged memory arena.
* Trying to simplify the indexing term hashmap
* Exploding datastruct
* Removed some complexity in bitpacker
2018-06-04 09:39:18 +09:00
Paul Masurel
bc69dab822 cargo fmt 2018-05-18 10:08:05 +09:00
Dru Sellers
82d87416c2 Implement StopWords Filter (#292)
* Implement StopWords Filter

- added example doctest for alphanum_only.rs so that I could
drive my own test of the stopword filter

* Style Cop

* Switch HashSet Hasher to FNV for speed

* Update Change Log

* fix missed location renaming
2018-05-09 18:40:41 -07:00
Paul Masurel
24050d0eb5 Remove some unsafe stuff, justified some of it. 2018-05-07 23:57:53 -07:00
Paul Masurel
9a0b7f9855 Rustfmt 2018-05-07 19:50:35 -07:00
Paul Masurel
99c0b84036 Integrating #274, #280, #289 into master (#290)
* Integrating bugfixes into master

Closes #274
Closes #280
Closes #289

* Next version will be 0.6
2018-05-06 09:48:25 -07:00
Dru Sellers
ca74c14647 Simple Implementation of NGram Tokenizer (#278)
* Simple Implementation of NGram Tokenizer

It does not yet support edges
It could probably be better in many "rusty" ways
But the test is passing, so I'll call this a good stopping point for
the day.

* Remove Ngram from manager. Too many variations

* Basic configuration model

Should the extensive tests exist here?

* Add Sample to provide an End to End testing

* Basic Edgegram support

* cleanup

* code feedback

* More code review feedback processed
2018-05-06 09:47:49 -07:00
Paul Masurel
78673172d0 Cargo fmt 2018-04-21 20:05:36 +09:00
Paul Masurel
e44782bf14 No more 2018-04-12 13:01:11 +09:00
Paul Masurel
0cf274135b Clippy 2018-03-10 13:07:18 +09:00
Paul Masurel
a7ffc0e610 Rustfmt 2018-02-12 10:31:29 +09:00
Paul Masurel
df53dc4ceb Format 2018-02-03 00:21:05 +09:00