Commit Graph

106 Commits

Author SHA1 Message Date
PSeitz
7b31100208 refactor vint (#2010)
- improve performance of vint
vint serialization shows up in performance profiles during indexing.
It would also make sense to limit the value space to u29 and operate on 4 bytes only.
- remove unused code
- add missing inlines
- fix regex test
2023-04-25 08:49:36 +02:00
Paul Masurel
fbda511a1a Making more things public for quickwit. (#2005) 2023-04-20 11:37:45 +09:00
Adam Reichold
c1defdda05 Bump aho-corasick dependency to version 1.0 and adjust to API changes (#2002)
* Drop additional Arc-layer as the automaton itself is now cheap-to-clone.
* Drop state ID type parameter as it is not exposed by the library any more.
2023-04-18 07:34:30 +02:00
François Massot
64bce340b2 Expose to use it in quickwit. 2023-04-13 18:28:53 +02:00
trinity-1686a
064518156f refactor tokenization pipeline to use GATs (#1924)
* refactor tokenization pipeline to use GATs

* fix doctests

* fix clippy lints

* remove commented code
2023-03-09 09:39:37 +01:00
Paul Masurel
8ea97e7d6b Minor refactoring preparing for getting columnar integrated in quickwit. (#1911) 2023-02-27 14:23:30 +09:00
Paul Masurel
097fd6138d Fix clippy comments (#1872) 2023-02-14 23:12:45 +09:00
trinity-1686a
3ac973bea4 fix invalid endianness in documentation (#1845)
* fix doc about term endianness

* rustfmt
2023-02-09 15:36:38 +01:00
Paul Masurel
bd5eea9852 Integrated columnar work. 2023-02-09 13:14:31 +01:00
Paul Masurel
7a8fce0ae7 Minor mini fixes 2023-01-10 14:15:30 +09:00
Michael Kleen
196e42f33e Add regex tokenizer (#1759)
This adds a regex tokenizer which tokenizes the text by using a
regex pattern to split.

Co-authored-by: Michael Kleen <mkleen@gmailw.com>
2023-01-10 13:38:37 +09:00
PSeitz
514d23a20c move tokenizer API to seperate crate (#1767)
closes #1766

Finding tantivy tokenizers is a frustrating experience currently, since
they need be updated for each tantivy version. That's unnecessary since
the API is rather stable anyway.
2023-01-09 06:37:38 +01:00
Adam Reichold
2080c370c2 Enable usage of FuzzyTermQuery for specific fields via QueryParser (#1750)
* Make nightly Clippy mostly happy.

* Document how to produce TermSetQuery queries using QueryParser.

* Enable construction of queries using FuzzyTermQuery via the QueryParser

* Use FxHashMap instead of HashMap in the QueryParser as these hash tables are not exposed to DoS attacks.

* Use a struct instead of a tuple to improve readability.
2023-01-04 18:11:27 +09:00
Adam Reichold
ca6231170e Make the built-in stop word lists selectable via the Language enum already used by the Stemmer filter. (#1671) 2022-11-15 17:40:25 +09:00
Adam Reichold
a4b759d2fe Include stop word lists from Lucene and the Snowball project (#1666) 2022-11-09 16:57:35 +09:00
Adam Reichold
c32ab66bbd Small improvements to StopWorldFilter (#1657)
* Do not copy the whole set of stop words for each stream

* Make construction of StopWordFilter more flexible.
2022-11-01 16:47:34 +09:00
PSeitz
4e46f4f8c4 Merge pull request #1649 from adamreichold/split-compound-words
RFC: Add dictionary-based SplitCompoundWords token filter.
2022-10-27 17:12:48 +08:00
PSeitz
6647362464 Merge pull request #1648 from adamreichold/stemmer-todo-alloc
Avoid unconditional allocation in StemmerTokenStream.
2022-10-27 16:50:41 +08:00
Adam Reichold
cd952429d2 Add dictionary-based SplitCompoundWords token filter. 2022-10-27 08:30:33 +02:00
Adam Reichold
bbb058d976 Replace FNV by rustc-hash
Both construction have similar goals but rustc-hash ist better suited for
contemporary CPU as it works one word at a time instead of byte per byte.
2022-10-27 00:35:09 +02:00
Adam Reichold
5f7d027a52 Avoid unconditional allocation in StemmerTokenStream.
This fixes the TODO in two ways: If the stemmer already yields an owned string,
it is used directly as the new text of the token. Otherwise, a temporary buffer
is used to copy the stemmed text (just as before) and then swapping it into the
token to reuse its existing buffer.
2022-10-26 18:11:15 +02:00
Bruce Mitchener
44e03791f9 Fix warnings when doc'ing private items. (#1579)
This also fixes a couple of typos, but plenty remain!
2022-10-03 14:24:00 +09:00
Bruce Mitchener
cf02e32578 Improvements to doc linking, grammar, etc. 2022-09-19 18:10:22 +07:00
Bruce Mitchener
6a88ac3fe3 Documentation improvements.
Fix some linking, some grammar, some typos, etc.
2022-09-18 18:05:37 +07:00
Kanji Yomoda
af84e74284 Replace deprecated std package's constants on floats and integers (#1420) 2022-07-22 08:05:08 +09:00
Antoine G
11e4225f23 doc fix (#1391)
Documentation fix.
2022-06-21 15:53:33 +09:00
PSeitz
7f45a6ac96 allow setting tokenizer manager on index (#1362)
handle json in tokenizer_for_field
2022-05-09 18:15:45 +09:00
Paul Masurel
46d5de920d Removes all usage of block_on, and use a oneshot channel instead. (#1315)
* Removes all usage of block_on, and use a oneshot channel instead.

Calling `block_on` panics in certain context.
For instance, it panics when it is called in a the context of another
call to block.

Using it in tantivy is unnecessary. We replace it by a thin wrapper
around a oneshot channel that supports both async/sync.

* Removing needless uses of async in the API.

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2022-03-18 16:54:58 +09:00
Paul Masurel
958b2bee08 Clippy comments (#1316) 2022-03-17 18:57:55 +09:00
Paul Masurel
848b795b9f Apply suggestions from code review 2022-03-01 18:37:51 +09:00
Pascal Seitz
091b668624 fix clippy issues 2022-03-01 08:58:51 +01:00
Paul Masurel
d7b46d2137 Added JSON Type (#1270)
- Removed useless copy when ingesting JSON.
- Bugfix in phrase query with a missing field norms.
- Disabled range query on default fields

Closes #1251
2022-02-24 16:25:22 +09:00
Paul Masurel
4dc80cfa25 Removes TokenStream chain. (#1283)
This change is mostly motivated by the introduction of json object.

We need to be able to inject a position object to make the position
shift.
2022-02-21 09:51:27 +09:00
Paul Masurel
2069e3e52b Fixing clippy comments 2022-02-01 10:24:05 +09:00
Paul Masurel
eca6628b3c Minor refactoring (#1266) 2022-01-28 15:55:55 +09:00
Paul Masurel
732f6847c0 Field type with codes (#1255)
* Term are now typed.

This change is backward compatible:
While the Term has a byte representation that is modified, a Term itself
is a transient object that is not serialized as is in the index.

Its .field() and .value_bytes() on the other hand are unchanged.
This change offers better Debug information for terms.

While not necessary it also will help in the support for JSON types.

* Renamed Hierarchical Facet -> Facet
2022-01-07 20:49:00 +09:00
Paul Masurel
3ea6800ac5 Pleasing clippy (#1253) 2022-01-06 16:41:24 +09:00
Tomoko Uchida
74e36c7e97 Add unit tests for tokenizers and filters (#1156)
* add unit test for SimpleTokenizer
* add unit tests for tokenizers and filters.
2021-09-27 10:22:01 +09:00
Tomoko Uchida
dd81e38e53 Add WhitespaceTokenizer (#1147)
* Add WhitespaceTokenizer.
2021-08-29 18:20:49 +09:00
Pascal Seitz
9b3e508753 fix clippy 2021-07-01 18:06:09 +02:00
Pascal Seitz
1e4df54ab3 fix clippy 2021-07-01 17:41:53 +02:00
Paul Masurel
486b8fa9c5 Removing serde-derive dependency (#786) 2020-03-06 23:33:58 +09:00
Paul Masurel
811fd0cb9e Dynamic analyzer (#755)
* Removed generics in tokenizers

* lowercaser

* Added TokenizerExt

* Introducing BoxedTokenizer

* Introducing BoxXXXXX helper struct

* Closes #762.

* Introducing a TextAnalyzer
2020-01-29 18:23:37 +09:00
Christian Hunstad
02af28b3b7 add norwegian stemmer (#717) 2019-11-27 21:08:59 +09:00
Paul Masurel
ef3eddf3da clippy first stab (#711) 2019-11-22 13:09:35 +09:00
kkoziara
0519056bd8 Added handling of pre-tokenized text fields (#642). (#669)
* Added handling of pre-tokenized text fields (#642).

* * Updated changelog and examples concerning #642.
* Added tokenized_text method to Value implementation.
* Implemented From<TokenizedString> for TokenizedStream.

* * Removed tokenized flag from TextOptions and code reliance on the flag.
* Changed naming to use word "pre-tokenized" instead of "tokenized".
* Updated example code.
* Fixed comments.

* Minor code refactoring. Test improvements.
2019-11-07 10:10:56 +09:00
Paul Masurel
5c6580eb15 fmt (#661) 2019-10-04 12:10:01 +09:00
Joshua Dutton
9f74786db2 Update import statements in examples, doctests (#633)
Update import statements to edition 2018, including removing
`extern crate` and  `#[macro_use]`. Alphabetize the statements.
2019-08-19 07:26:35 +09:00
Paul Masurel
039c0a0863 Introducing a wrapper struct instead of Boxed<BoxableTokenizer> (#631)
Closes #629
2019-08-15 16:37:04 +09:00
Paul Masurel
498057c5b7 Refactor deletes (#597)
* Refactor deletes

* Removing generation from SegmentUpdater. These have been obsolete for a long time

* Number literal clippy

* Removed clippy useless allow statement
2019-07-17 13:06:44 +09:00