Compare commits

..

26 Commits

Author SHA1 Message Date
Paul Masurel
784717749f Removing unused imports. 2021-02-05 23:04:17 +09:00
Paul Masurel
945bcc5bd3 Bump tantivy-grammar version 2021-02-05 22:58:21 +09:00
Paul Masurel
51aa9c319e Bumped version to 0.14 2021-02-05 22:55:26 +09:00
Paul Masurel
74d8d2946b Merge pull request #980 from lengyijun/patch-8
Update segment_postings.rs
2021-02-05 22:52:29 +09:00
lyj
0a160cc16e Update segment_postings.rs 2021-02-05 21:32:25 +08:00
Paul Masurel
f099f97daa Merge pull request #979 from slckl/main
FacetCounts are now pub use in tantivy::collector (Closes #978)
2021-02-05 17:05:20 +09:00
alif
769e9ba14d added simple docs to FacetCounts now-public API 2021-02-05 09:18:20 +02:00
alif
a482c0e966 pub use FacetCounts in tantivy::collector module 2021-02-05 09:00:48 +02:00
Paul Masurel
86d92a72e7 Renaming MultiValueIntFastField* to MultiValuedIntFastField* 2021-01-21 22:47:00 +09:00
Paul Masurel
ef618a5999 Made fast field reader clonable. 2021-01-21 22:15:24 +09:00
Paul Masurel
94d3d7a89a Rename FastFieldReaders::load_all 2021-01-21 18:38:48 +09:00
Paul Masurel
aa9e79f957 Clippy warnings. 2021-01-21 18:23:20 +09:00
Paul Masurel
84a2f534db Merge pull request #976 from tantivy-search/issue/fastfield_no_load
Fast field are not loaded on the opening of a segment.
2021-01-21 18:14:55 +09:00
Paul Masurel
1b4be24dca Fast field are not loaded on the opening of a segment.
They are instead loaded lazily when they are request.
2021-01-21 18:13:08 +09:00
Paul Masurel
824ccc37ae Merge pull request #975 from jamescorbett/patch-1
Change from serde::export to std::marker
2021-01-12 10:04:23 +09:00
Paul Masurel
5231651020 Closes #974 2021-01-12 10:03:37 +09:00
James Corbett
fa2c6f80c7 Change from serde::export to std::marker
For some reason under a docker build I get a build error under docker only saying that `serde::export` is private. This fixes it for me.

```
error[E0603]: module `export` is private
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.13.2/src/collector/top_collector.rs:5:12
    |
5   | use serde::export::PhantomData;
    |            ^^^^^^ private module
    |
note: the module `export` is defined here
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/serde-1.0.119/src/lib.rs:275:5
    |
275 | use self::__private as export;
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^
```
2021-01-12 00:25:54 +00:00
Paul Masurel
43c7b3bfec Bugfix in the RAMDirectory.
There was a state where the meta.json was empty.
2021-01-11 14:11:42 +09:00
Paul Masurel
b17a10546a Minor change in unit test. 2021-01-11 11:33:59 +09:00
Paul Masurel
bf6e6e8a7c Merge pull request #972 from tantivy-search/issue/969
Issue/969
2021-01-07 22:49:31 +09:00
Paul Masurel
203b0256a3 Minor renaming 2021-01-07 22:47:57 +09:00
Paul Masurel
caf2a38b7e Closes #969.
The segment stacking optimization is not updating "first_doc_in_block".
2021-01-07 22:43:56 +09:00
Paul Masurel
96f24b078e Added failing unit test. 2021-01-07 22:43:28 +09:00
Paul Masurel
332b50a4eb Merge pull request #970 from tantivy-search/functional-test-store
Added a functional long running test to test store merging.
2021-01-07 14:27:08 +09:00
Paul Masurel
8ca0954b3b Added a functional long running test to test store merging. 2021-01-07 14:07:15 +09:00
Paul Masurel
36343e2de8 Merge pull request #968 from tantivy-search/add-bench-analyzer
added a simple bench for the default analyzer
2021-01-06 21:33:39 +09:00
56 changed files with 1426 additions and 1160 deletions

View File

@@ -1,22 +1,23 @@
Tantivy 0.14.0
=========================
- Remove dependency to atomicwrites #833 .Implemented by @pmasurel upon suggestion and research from @asafigan).
- Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).
- Migrated tantivy error from the now deprecated `failure` crate to `thiserror` #760. (@hirevo)
- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match.
- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match.
- Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a `FileSlice` that can be reduced and eventually read into an `OwnedBytes` object. Long and blocking io operation are still required by they do not span over the entire file.
- Added support for Brotli compression in the DocStore. (@ppodolsky)
- Added helper for building intersections and unions in BooleanQuery (@guilload)
- Bugfix in `Query::explain`
- Removed dependency on `notify` #924. Replaced with `FileWatcher` struct that polls meta file every 500ms in background thread. (@halvorboe @guilload)
- Added `FilterCollector`, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev)
- Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@pmasurel)
- Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)
- `FilterCollector` now supports all Fast Field value types (@barrotsteindev)
- FastField are not all loaded when opening the segment reader. (@fulmicoton)
This version breaks compatibility and requires users to reindex everything.
Tantivy 0.13.2
===================
Bugfix. Acquiring a facet reader on a segment that does not contain any
Bugfix. Acquiring a facet reader on a segment that does not contain any
doc with this facet returns `None`. (#896)
Tantivy 0.13.1
@@ -27,7 +28,7 @@ Updated misc dependency versions.
Tantivy 0.13.0
======================
Tantivy 0.13 introduce a change in the index format that will require
you to reindex your index (BlockWAND information are added in the skiplist).
you to reindex your index (BlockWAND information are added in the skiplist).
The index size increase is minor as this information is only added for
full blocks.
If you have a massive index for which reindexing is not an option, please contact me
@@ -36,7 +37,7 @@ so that we can discuss possible solutions.
- Bugfix in `FuzzyTermQuery` not matching terms by prefix when it should (@Peachball)
- Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.
- `MMapDirectory::open` does not return a `Result` anymore.
- Change in the DocSet and Scorer API. (@fulmicoton).
- Change in the DocSet and Scorer API. (@fulmicoton).
A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet.
`.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId.
As a result, iterating through DocSet now looks as follows
@@ -50,7 +51,7 @@ while doc != TERMINATED {
The change made it possible to greatly simplify a lot of the docset's code.
- Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)
- Added an offset option to the Top(.*)Collectors. (@robyoung)
- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks
- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks
to the PISA team for answering all my questions!)
Tantivy 0.12.0
@@ -58,14 +59,14 @@ Tantivy 0.12.0
- Removing static dispatch in tokenizers for simplicity. (#762)
- Added backward iteration for `TermDictionary` stream. (@halvorboe)
- Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)
- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)
- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)
- Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)
- Added support for field boosting. (#547, @fulmicoton)
## How to update?
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
minor changes. Check https://github.com/tantivy-search/tantivy/blob/master/examples/custom_tokenizer.rs
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
minor changes. Check https://github.com/tantivy-search/tantivy/blob/main/examples/custom_tokenizer.rs
to check for some code sample.
Tantivy 0.11.3
@@ -101,7 +102,7 @@ Tantivy 0.11.0
## How to update?
- The index format is changed. You are required to reindex your data to use tantivy 0.11.
- The index format is changed. You are required to reindex your data to use tantivy 0.11.
- `Box<dyn BoxableTokenizer>` has been replaced by a `BoxedTokenizer` struct.
- Regex are now compiled when the `RegexQuery` instance is built. As a result, it can now return
an error and handling the `Result` is required.
@@ -125,26 +126,26 @@ Tantivy 0.10.0
*Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.*
- Added an API to easily tweak or entirely replace the
default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@pmasurel)
- Added an API to easily tweak or entirely replace the
default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@fulmicoton)
- Added an ASCII folding filter (@drusellers)
- Bugfix in `query.count` in presence of deletes (@pmasurel)
- Added `.explain(...)` in `Query` and `Weight` to (@pmasurel)
- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik).
- Bugfix in `query.count` in presence of deletes (@fulmicoton)
- Added `.explain(...)` in `Query` and `Weight` to (@fulmicoton)
- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik).
All segments are simply removed.
Minor
---------
- Switched to Rust 2018 (@uvd)
- Small simplification of the code.
- Small simplification of the code.
Calling .freq() or .doc() when .advance() has never been called
on segment postings should panic from now on.
- Tokens exceeding `u16::max_value() - 4` chars are discarded silently instead of panicking.
- Fast fields are now preloaded when the `SegmentReader` is created.
- `IndexMeta` is now public. (@hntd187)
- `IndexWriter` `add_document`, `delete_term`. `IndexWriter` is `Sync`, making it possible to use it with a `
Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
only require a read lock. (@pmasurel)
Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
only require a read lock. (@fulmicoton)
- Introducing `Opstamp` as an expressive type alias for `u64`. (@petr-tik)
- Stamper now relies on `AtomicU64` on all platforms (@petr-tik)
- Bugfix - Files get deleted slightly earlier
@@ -158,7 +159,7 @@ Your program should be usable as is.
Fast fields used to be accessed directly from the `SegmentReader`.
The API changed, you are now required to acquire your fast field reader via the
`segment_reader.fast_fields()`, and use one of the typed method:
`segment_reader.fast_fields()`, and use one of the typed method:
- `.u64()`, `.i64()` if your field is single-valued ;
- `.u64s()`, `.i64s()` if your field is multi-valued ;
- `.bytes()` if your field is bytes fast field.
@@ -167,16 +168,16 @@ The API changed, you are now required to acquire your fast field reader via the
Tantivy 0.9.0
=====================
*0.9.0 index format is not compatible with the
*0.9.0 index format is not compatible with the
previous index format.*
- MAJOR BUGFIX :
- MAJOR BUGFIX :
Some `Mmap` objects were being leaked, and would never get released. (@fulmicoton)
- Removed most unsafe (@fulmicoton)
- Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)
- Stemming in other language possible (@pentlander)
- Segments with no docs are deleted earlier (@barrotsteindev)
- Added grouped add and delete operations.
They are guaranteed to happen together (i.e. they cannot be split by a commit).
- Added grouped add and delete operations.
They are guaranteed to happen together (i.e. they cannot be split by a commit).
In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)
- Removed `INT_STORED` and `INT_INDEXED`. It is now possible to use `STORED` and `INDEXED`
for int fields. (@fulmicoton)
@@ -190,26 +191,26 @@ tantivy 0.9 brought some API breaking change.
To update from tantivy 0.8, you will need to go through the following steps.
- `schema::INT_INDEXED` and `schema::INT_STORED` should be replaced by `schema::INDEXED` and `schema::INT_STORED`.
- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called
`IndexReader` for this.
- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called
`IndexReader` for this.
```rust
// create the reader. You typically need to create 1 reader for the entire
// lifetime of you program.
let reader = index.reader()?;
// Acquire a searcher (previously `index.searcher()`) is now written:
let searcher = reader.searcher();
// With the default setting of the reader, you are not required to
// With the default setting of the reader, you are not required to
// call `index.load_searchers()` anymore.
//
// The IndexReader will pick up that change automatically, regardless
// of whether the update was done in a different process or not.
// If this behavior is not wanted, you can create your reader with
// If this behavior is not wanted, you can create your reader with
// the `ReloadPolicy::Manual`, and manually decide when to reload the index
// by calling `reader.reload()?`.
```
@@ -224,7 +225,7 @@ Tantivy 0.8.1
=====================
Hotfix of #476.
Merge was reflecting deletes before commit was passed.
Merge was reflecting deletes before commit was passed.
Thanks @barrotsteindev for reporting the bug.
@@ -232,7 +233,7 @@ Tantivy 0.8.0
=====================
*No change in the index format*
- API Breaking change in the collector API. (@jwolfe, @fulmicoton)
- Multithreaded search (@jwolfe, @fulmicoton)
- Multithreaded search (@jwolfe, @fulmicoton)
Tantivy 0.7.1
@@ -260,7 +261,7 @@ Tantivy 0.6.1
- Exclusive `field:{startExcl to endExcl}`
- Mixed `field:[startIncl to endExcl}` and vice versa
- Unbounded `field:[start to *]`, `field:[* to end]`
Tantivy 0.6
==========================
@@ -268,10 +269,10 @@ Tantivy 0.6
Special thanks to @drusellers and @jason-wolfe for their contributions
to this release!
- Removed C code. Tantivy is now pure Rust. (@pmasurel)
- BM25 (@pmasurel)
- Approximate field norms encoded over 1 byte. (@pmasurel)
- Compiles on stable rust (@pmasurel)
- Removed C code. Tantivy is now pure Rust. (@fulmicoton)
- BM25 (@fulmicoton)
- Approximate field norms encoded over 1 byte. (@fulmicoton)
- Compiles on stable rust (@fulmicoton)
- Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270)
- Completely uncompressed
- Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
@@ -279,7 +280,7 @@ to this release!
- Add Stopword Filter support (@drusellers)
- Add a FuzzyTermQuery (@drusellers)
- Add a RegexQuery (@drusellers)
- Various performance improvements (@pmasurel)_
- Various performance improvements (@fulmicoton)_
Tantivy 0.5.2

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy"
version = "0.14.0-dev"
version = "0.14.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
@@ -33,7 +33,7 @@ levenshtein_automata = "0.2"
uuid = { version = "0.8", features = ["v4", "serde"] }
crossbeam = "0.8"
futures = {version = "0.3", features=["thread-pool"] }
tantivy-query-grammar = { version="0.14.0-dev", path="./query-grammar" }
tantivy-query-grammar = { version="0.14.0", path="./query-grammar" }
stable_deref_trait = "1"
rust-stemmers = "1"
downcast-rs = "1"

View File

@@ -1,9 +1,9 @@
[![Build Status](https://travis-ci.org/tantivy-search/tantivy.svg?branch=master)](https://travis-ci.org/tantivy-search/tantivy)
[![codecov](https://codecov.io/gh/tantivy-search/tantivy/branch/master/graph/badge.svg)](https://codecov.io/gh/tantivy-search/tantivy)
[![Build Status](https://travis-ci.org/tantivy-search/tantivy.svg?branch=main)](https://travis-ci.org/tantivy-search/tantivy)
[![codecov](https://codecov.io/gh/tantivy-search/tantivy/branch/main/graph/badge.svg)](https://codecov.io/gh/tantivy-search/tantivy)
[![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/master?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/master)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/main?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/main)
[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)

View File

@@ -10,7 +10,7 @@ pub fn criterion_benchmark(c: &mut Criterion) {
b.iter(|| {
let mut word_count = 0;
let mut token_stream = tokenizer.token_stream(ALICE_TXT);
for token in token_stream {
while token_stream.advance() {
word_count += 1;
}
assert_eq!(word_count, 30_731);

View File

@@ -14,7 +14,7 @@ use tantivy::fastfield::FastFieldReader;
use tantivy::query::QueryParser;
use tantivy::schema::Field;
use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
use tantivy::{doc, Index, Score, SegmentReader, TantivyError};
use tantivy::{doc, Index, Score, SegmentReader};
#[derive(Default)]
struct Stats {
@@ -72,16 +72,7 @@ impl Collector for StatsCollector {
_segment_local_id: u32,
segment_reader: &SegmentReader,
) -> tantivy::Result<StatsSegmentCollector> {
let fast_field_reader = segment_reader
.fast_fields()
.u64(self.field)
.ok_or_else(|| {
let field_name = segment_reader.schema().get_field_name(self.field);
TantivyError::SchemaError(format!(
"Field {:?} is not a u64 fast field.",
field_name
))
})?;
let fast_field_reader = segment_reader.fast_fields().u64(self.field)?;
Ok(StatsSegmentCollector {
fast_field_reader,
stats: Stats::default(),

View File

@@ -17,7 +17,12 @@ use tantivy::{doc, Index, ReloadPolicy};
use tempfile::TempDir;
fn pre_tokenize_text(text: &str) -> Vec<Token> {
SimpleTokenizer.token_stream(text).collect()
let mut token_stream = SimpleTokenizer.token_stream(text);
let mut tokens = vec![];
while token_stream.advance() {
tokens.push(token_stream.token().clone());
}
tokens
}
fn main() -> tantivy::Result<()> {

View File

@@ -51,7 +51,7 @@ fn main() -> tantivy::Result<()> {
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let mut snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
for (score, doc_address) in top_docs {
let doc = searcher.doc(doc_address)?;

View File

@@ -50,13 +50,12 @@ fn main() -> tantivy::Result<()> {
// This tokenizer lowers all of the text (to help with stop word matching)
// then removes all instances of `the` and `and` from the corpus
let tokenizer = analyzer_builder(SimpleTokenizer)
.filter(LowerCaser::new())
let tokenizer = TextAnalyzer::from(SimpleTokenizer)
.filter(LowerCaser)
.filter(StopWordFilter::remove(vec![
"the".to_string(),
"and".to_string(),
]))
.build();
]));
index.tokenizers().register("stoppy", tokenizer);

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-query-grammar"
version = "0.14.0-dev"
version = "0.14.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]

View File

@@ -398,6 +398,8 @@ impl<'a> Iterator for FacetChildIterator<'a> {
}
impl FacetCounts {
/// Returns an iterator over all of the facet count pairs inside this result.
/// See the documentation for `FacetCollector` for a usage example.
pub fn get<T>(&self, facet_from: T) -> FacetChildIterator<'_>
where
Facet: From<T>,
@@ -417,6 +419,8 @@ impl FacetCounts {
FacetChildIterator { underlying }
}
/// Returns a vector of top `k` facets with their counts, sorted highest-to-lowest by counts.
/// See the documentation for `FacetCollector` for a usage example.
pub fn top_k<T>(&self, facet: T, k: usize) -> Vec<(&Facet, u64)>
where
Facet: From<T>,

View File

@@ -124,13 +124,7 @@ where
let fast_field_reader = segment_reader
.fast_fields()
.typed_fast_field_reader(self.field)
.ok_or_else(|| {
TantivyError::SchemaError(format!(
"{:?} is not declared as a fast field in the schema.",
self.field
))
})?;
.typed_fast_field_reader(self.field)?;
let segment_collector = self
.collector

View File

@@ -109,6 +109,7 @@ pub use self::tweak_score_top_collector::{ScoreSegmentTweaker, ScoreTweaker};
mod facet_collector;
pub use self::facet_collector::FacetCollector;
pub use self::facet_collector::FacetCounts;
use crate::query::Weight;
mod docset_collector;

View File

@@ -240,12 +240,7 @@ impl Collector for BytesFastFieldTestCollector {
_segment_local_id: u32,
segment_reader: &SegmentReader,
) -> crate::Result<BytesFastFieldSegmentCollector> {
let reader = segment_reader
.fast_fields()
.bytes(self.field)
.ok_or_else(|| {
crate::TantivyError::InvalidArgument("Field is not a bytes fast field.".to_string())
})?;
let reader = segment_reader.fast_fields().bytes(self.field)?;
Ok(BytesFastFieldSegmentCollector {
vals: Vec::new(),
reader,

View File

@@ -2,9 +2,9 @@ use crate::DocAddress;
use crate::DocId;
use crate::SegmentLocalId;
use crate::SegmentReader;
use serde::export::PhantomData;
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use std::marker::PhantomData;
/// Contains a feature (field, score, etc.) of a document along with the document address.
///

View File

@@ -146,15 +146,14 @@ impl CustomScorer<u64> for ScorerByField {
type Child = ScorerByFastFieldReader;
fn segment_scorer(&self, segment_reader: &SegmentReader) -> crate::Result<Self::Child> {
let ff_reader = segment_reader
// We interpret this field as u64, regardless of its type, that way,
// we avoid needless conversion. Regardless of the fast field type, the
// mapping is monotonic, so it is sufficient to compute our top-K docs.
//
// The conversion will then happen only on the top-K docs.
let ff_reader: FastFieldReader<u64> = segment_reader
.fast_fields()
.u64_lenient(self.field)
.ok_or_else(|| {
crate::TantivyError::SchemaError(format!(
"Field requested ({:?}) is not a fast field.",
self.field
))
})?;
.typed_fast_field_reader(self.field)?;
Ok(ScorerByFastFieldReader { ff_reader })
}
}
@@ -232,7 +231,7 @@ impl TopDocs {
/// # let title = schema_builder.add_text_field("title", TEXT);
/// # let rating = schema_builder.add_u64_field("rating", FAST);
/// # let schema = schema_builder.build();
/// #
/// #
/// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// # index_writer.add_document(doc!(title => "The Name of the Wind", rating => 92u64));
@@ -262,7 +261,7 @@ impl TopDocs {
/// let top_books_by_rating = TopDocs
/// ::with_limit(10)
/// .order_by_u64_field(rating_field);
///
///
/// // ... and here are our documents. Note this is a simple vec.
/// // The `u64` in the pair is the value of our fast field for
/// // each documents.
@@ -272,13 +271,13 @@ impl TopDocs {
/// // query.
/// let resulting_docs: Vec<(u64, DocAddress)> =
/// searcher.search(query, &top_books_by_rating)?;
///
///
/// Ok(resulting_docs)
/// }
/// ```
///
/// # See also
///
///
/// To confortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to
/// [.order_by_fast_field(...)](#method.order_by_fast_field) method.
pub fn order_by_u64_field(
@@ -290,7 +289,7 @@ impl TopDocs {
/// Set top-K to rank documents by a given fast field.
///
/// If the field is not a fast field, or its field type does not match the generic type, this method does not panic,
/// If the field is not a fast field, or its field type does not match the generic type, this method does not panic,
/// but an explicit error will be returned at the moment of collection.
///
/// Note that this method is a generic. The requested fast field type will be often
@@ -314,7 +313,7 @@ impl TopDocs {
/// # let title = schema_builder.add_text_field("company", TEXT);
/// # let rating = schema_builder.add_i64_field("revenue", FAST);
/// # let schema = schema_builder.build();
/// #
/// #
/// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// # index_writer.add_document(doc!(title => "MadCow Inc.", rating => 92_000_000i64));
@@ -343,7 +342,7 @@ impl TopDocs {
/// let top_company_by_revenue = TopDocs
/// ::with_limit(2)
/// .order_by_fast_field(revenue_field);
///
///
/// // ... and here are our documents. Note this is a simple vec.
/// // The `i64` in the pair is the value of our fast field for
/// // each documents.
@@ -353,7 +352,7 @@ impl TopDocs {
/// // query.
/// let resulting_docs: Vec<(i64, DocAddress)> =
/// searcher.search(query, &top_company_by_revenue)?;
///
///
/// Ok(resulting_docs)
/// }
/// ```
@@ -392,7 +391,7 @@ impl TopDocs {
///
/// In the following example will will tweak our ranking a bit by
/// boosting popular products a notch.
///
///
/// In more serious application, this tweaking could involved running a
/// learning-to-rank model over various features
///
@@ -523,7 +522,7 @@ impl TopDocs {
/// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// # let product_name = index.schema().get_field("product_name").unwrap();
/// #
/// #
/// let popularity: Field = index.schema().get_field("popularity").unwrap();
/// let boosted: Field = index.schema().get_field("boosted").unwrap();
/// # index_writer.add_document(doc!(boosted=>1u64, product_name => "The Diary of Muadib", popularity => 1u64));
@@ -557,7 +556,7 @@ impl TopDocs {
/// segment_reader.fast_fields().u64(popularity).unwrap();
/// let boosted_reader =
/// segment_reader.fast_fields().u64(boosted).unwrap();
///
///
/// // We can now define our actual scoring function
/// move |doc: DocId| {
/// let popularity: u64 = popularity_reader.get(doc);
@@ -994,9 +993,7 @@ mod tests {
let segment = searcher.segment_reader(0);
let top_collector = TopDocs::with_limit(4).order_by_u64_field(size);
let err = top_collector.for_segment(0, segment).err().unwrap();
assert!(
matches!(err, crate::TantivyError::SchemaError(msg) if msg == "Field requested (Field(0)) is not a fast field.")
);
assert!(matches!(err, crate::TantivyError::SchemaError(_)));
Ok(())
}

View File

@@ -20,7 +20,7 @@ use crate::reader::IndexReaderBuilder;
use crate::schema::Field;
use crate::schema::FieldType;
use crate::schema::Schema;
use crate::tokenizer::{TextAnalyzerT, TokenizerManager};
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::IndexWriter;
use std::collections::HashSet;
use std::fmt;
@@ -35,12 +35,21 @@ fn load_metas(
inventory: &SegmentMetaInventory,
) -> crate::Result<IndexMeta> {
let meta_data = directory.atomic_read(&META_FILEPATH)?;
let meta_string = String::from_utf8_lossy(&meta_data);
let meta_string = String::from_utf8(meta_data).map_err(|_utf8_err| {
error!("Meta data is not valid utf8.");
DataCorruption::new(
META_FILEPATH.to_path_buf(),
"Meta file does not contain valid utf8 file.".to_string(),
)
})?;
IndexMeta::deserialize(&meta_string, &inventory)
.map_err(|e| {
DataCorruption::new(
META_FILEPATH.to_path_buf(),
format!("Meta file cannot be deserialized. {:?}.", e),
format!(
"Meta file cannot be deserialized. {:?}. Content: {:?}",
e, meta_string
),
)
})
.map_err(From::from)
@@ -119,12 +128,13 @@ impl Index {
return Index::create(dir, schema);
}
let index = Index::open(dir)?;
if index.schema() != schema {
return Err(TantivyError::SchemaError(
if index.schema() == schema {
Ok(index)
} else {
Err(TantivyError::SchemaError(
"An index exists but the schema does not match.".to_string(),
));
))
}
Ok(index)
}
/// Creates a new index in a temp directory.
@@ -180,11 +190,11 @@ impl Index {
}
/// Helper to access the tokenizer associated to a specific field.
pub fn tokenizer_for_field(&self, field: Field) -> crate::Result<Box<dyn TextAnalyzerT>> {
pub fn tokenizer_for_field(&self, field: Field) -> crate::Result<TextAnalyzer> {
let field_entry = self.schema.get_field_entry(field);
let field_type = field_entry.field_type();
let tokenizer_manager: &TokenizerManager = self.tokenizers();
let tokenizer_name_opt: Option<Box<dyn TextAnalyzerT>> = match field_type {
let tokenizer_name_opt: Option<TextAnalyzer> = match field_type {
FieldType::Str(text_options) => text_options
.get_indexing_options()
.map(|text_indexing_options| text_indexing_options.tokenizer().to_string())

View File

@@ -114,12 +114,7 @@ impl SegmentReader {
field_entry.name()
)));
}
let term_ords_reader = self.fast_fields().u64s(field).ok_or_else(|| {
DataCorruption::comment_only(format!(
"Cannot find data for hierarchical facet {:?}",
field_entry.name()
))
})?;
let term_ords_reader = self.fast_fields().u64s(field)?;
let termdict = self
.termdict_composite
.open_read(field)
@@ -183,8 +178,10 @@ impl SegmentReader {
let fast_fields_data = segment.open_read(SegmentComponent::FASTFIELDS)?;
let fast_fields_composite = CompositeFile::open(&fast_fields_data)?;
let fast_field_readers =
Arc::new(FastFieldReaders::load_all(&schema, &fast_fields_composite)?);
let fast_field_readers = Arc::new(FastFieldReaders::new(
schema.clone(),
fast_fields_composite,
)?);
let fieldnorm_data = segment.open_read(SegmentComponent::FIELDNORMS)?;
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;

View File

@@ -226,13 +226,9 @@ impl Directory for RAMDirectory {
)));
let path_buf = PathBuf::from(path);
// Reserve the path to prevent calls to .write() to succeed.
self.fs.write().unwrap().write(path_buf.clone(), &[]);
self.fs.write().unwrap().write(path_buf, data);
let mut vec_writer = VecWriter::new(path_buf, self.clone());
vec_writer.write_all(data)?;
vec_writer.flush()?;
if path == Path::new(&*META_FILEPATH) {
if path == *META_FILEPATH {
let _ = self.fs.write().unwrap().watch_router.broadcast();
}
Ok(())

View File

@@ -1,4 +1,4 @@
use super::MultiValueIntFastFieldReader;
use super::MultiValuedFastFieldReader;
use crate::error::DataCorruption;
use crate::schema::Facet;
use crate::termdict::TermDictionary;
@@ -20,7 +20,7 @@ use std::str;
/// list of facets. This ordinal is segment local and
/// only makes sense for a given segment.
pub struct FacetReader {
term_ords: MultiValueIntFastFieldReader<u64>,
term_ords: MultiValuedFastFieldReader<u64>,
term_dict: TermDictionary,
buffer: Vec<u8>,
}
@@ -29,12 +29,12 @@ impl FacetReader {
/// Creates a new `FacetReader`.
///
/// A facet reader just wraps :
/// - a `MultiValueIntFastFieldReader` that makes it possible to
/// - a `MultiValuedFastFieldReader` that makes it possible to
/// access the list of facet ords for a given document.
/// - a `TermDictionary` that helps associating a facet to
/// an ordinal and vice versa.
pub fn new(
term_ords: MultiValueIntFastFieldReader<u64>,
term_ords: MultiValuedFastFieldReader<u64>,
term_dict: TermDictionary,
) -> FacetReader {
FacetReader {

View File

@@ -28,7 +28,7 @@ pub use self::delete::write_delete_bitset;
pub use self::delete::DeleteBitSet;
pub use self::error::{FastFieldNotAvailableError, Result};
pub use self::facet_reader::FacetReader;
pub use self::multivalued::{MultiValueIntFastFieldReader, MultiValueIntFastFieldWriter};
pub use self::multivalued::{MultiValuedFastFieldReader, MultiValuedFastFieldWriter};
pub use self::reader::FastFieldReader;
pub use self::readers::FastFieldReaders;
pub use self::serializer::FastFieldSerializer;

View File

@@ -1,8 +1,8 @@
mod reader;
mod writer;
pub use self::reader::MultiValueIntFastFieldReader;
pub use self::writer::MultiValueIntFastFieldWriter;
pub use self::reader::MultiValuedFastFieldReader;
pub use self::writer::MultiValuedFastFieldWriter;
#[cfg(test)]
mod tests {

View File

@@ -10,29 +10,22 @@ use crate::DocId;
/// The `idx_reader` associated, for each document, the index of its first value.
///
#[derive(Clone)]
pub struct MultiValueIntFastFieldReader<Item: FastValue> {
pub struct MultiValuedFastFieldReader<Item: FastValue> {
idx_reader: FastFieldReader<u64>,
vals_reader: FastFieldReader<Item>,
}
impl<Item: FastValue> MultiValueIntFastFieldReader<Item> {
impl<Item: FastValue> MultiValuedFastFieldReader<Item> {
pub(crate) fn open(
idx_reader: FastFieldReader<u64>,
vals_reader: FastFieldReader<Item>,
) -> MultiValueIntFastFieldReader<Item> {
MultiValueIntFastFieldReader {
) -> MultiValuedFastFieldReader<Item> {
MultiValuedFastFieldReader {
idx_reader,
vals_reader,
}
}
pub(crate) fn into_u64s_reader(self) -> MultiValueIntFastFieldReader<u64> {
MultiValueIntFastFieldReader {
idx_reader: self.idx_reader,
vals_reader: self.vals_reader.into_u64_reader(),
}
}
/// Returns `(start, stop)`, such that the values associated
/// to the given document are `start..stop`.
fn range(&self, doc: DocId) -> (u64, u64) {

View File

@@ -18,7 +18,7 @@ use std::io;
/// in your schema
/// - add your document simply by calling `.add_document(...)`.
///
/// The `MultiValueIntFastFieldWriter` can be acquired from the
/// The `MultiValuedFastFieldWriter` can be acquired from the
/// fastfield writer, by calling [`.get_multivalue_writer(...)`](./struct.FastFieldsWriter.html#method.get_multivalue_writer).
///
/// Once acquired, writing is done by calling calls to
@@ -29,17 +29,17 @@ use std::io;
/// This makes it possible to push unordered term ids,
/// during indexing and remap them to their respective
/// term ids when the segment is getting serialized.
pub struct MultiValueIntFastFieldWriter {
pub struct MultiValuedFastFieldWriter {
field: Field,
vals: Vec<UnorderedTermId>,
doc_index: Vec<u64>,
is_facet: bool,
}
impl MultiValueIntFastFieldWriter {
impl MultiValuedFastFieldWriter {
/// Creates a new `IntFastFieldWriter`
pub(crate) fn new(field: Field, is_facet: bool) -> Self {
MultiValueIntFastFieldWriter {
MultiValuedFastFieldWriter {
field,
vals: Vec::new(),
doc_index: Vec::new(),
@@ -47,7 +47,7 @@ impl MultiValueIntFastFieldWriter {
}
}
/// Access the field associated to the `MultiValueIntFastFieldWriter`
/// Access the field associated to the `MultiValuedFastFieldWriter`
pub fn field(&self) -> Field {
self.field
}

View File

@@ -42,24 +42,6 @@ impl<Item: FastValue> FastFieldReader<Item> {
})
}
pub(crate) fn into_u64_reader(self) -> FastFieldReader<u64> {
FastFieldReader {
bit_unpacker: self.bit_unpacker,
min_value_u64: self.min_value_u64,
max_value_u64: self.max_value_u64,
_phantom: PhantomData,
}
}
pub(crate) fn cast<TFastValue: FastValue>(self) -> FastFieldReader<TFastValue> {
FastFieldReader {
bit_unpacker: self.bit_unpacker,
min_value_u64: self.min_value_u64,
max_value_u64: self.max_value_u64,
_phantom: PhantomData,
}
}
/// Return the value associated to the given document.
///
/// This accessor should return as fast as possible.

View File

@@ -1,28 +1,22 @@
use crate::common::CompositeFile;
use crate::fastfield::MultiValueIntFastFieldReader;
use crate::directory::FileSlice;
use crate::fastfield::MultiValuedFastFieldReader;
use crate::fastfield::{BytesFastFieldReader, FastValue};
use crate::fastfield::{FastFieldNotAvailableError, FastFieldReader};
use crate::schema::{Cardinality, Field, FieldType, Schema};
use crate::space_usage::PerFieldSpaceUsage;
use std::collections::HashMap;
use crate::TantivyError;
/// Provides access to all of the FastFieldReader.
///
/// Internally, `FastFieldReaders` have preloaded fast field readers,
/// and just wraps several `HashMap`.
#[derive(Clone)]
pub struct FastFieldReaders {
fast_field_i64: HashMap<Field, FastFieldReader<i64>>,
fast_field_u64: HashMap<Field, FastFieldReader<u64>>,
fast_field_f64: HashMap<Field, FastFieldReader<f64>>,
fast_field_date: HashMap<Field, FastFieldReader<crate::DateTime>>,
fast_field_i64s: HashMap<Field, MultiValueIntFastFieldReader<i64>>,
fast_field_u64s: HashMap<Field, MultiValueIntFastFieldReader<u64>>,
fast_field_f64s: HashMap<Field, MultiValueIntFastFieldReader<f64>>,
fast_field_dates: HashMap<Field, MultiValueIntFastFieldReader<crate::DateTime>>,
fast_bytes: HashMap<Field, BytesFastFieldReader>,
schema: Schema,
fast_fields_composite: CompositeFile,
}
#[derive(Eq, PartialEq, Debug)]
enum FastType {
I64,
U64,
@@ -50,236 +44,167 @@ fn type_and_cardinality(field_type: &FieldType) -> Option<(FastType, Cardinality
}
impl FastFieldReaders {
pub(crate) fn load_all(
schema: &Schema,
fast_fields_composite: &CompositeFile,
pub(crate) fn new(
schema: Schema,
fast_fields_composite: CompositeFile,
) -> crate::Result<FastFieldReaders> {
let mut fast_field_readers = FastFieldReaders {
fast_field_i64: Default::default(),
fast_field_u64: Default::default(),
fast_field_f64: Default::default(),
fast_field_date: Default::default(),
fast_field_i64s: Default::default(),
fast_field_u64s: Default::default(),
fast_field_f64s: Default::default(),
fast_field_dates: Default::default(),
fast_bytes: Default::default(),
fast_fields_composite: fast_fields_composite.clone(),
};
for (field, field_entry) in schema.fields() {
let field_type = field_entry.field_type();
if let FieldType::Bytes(bytes_option) = field_type {
if !bytes_option.is_fast() {
continue;
}
let fast_field_idx_file = fast_fields_composite
.open_read_with_idx(field, 0)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?;
let idx_reader = FastFieldReader::open(fast_field_idx_file)?;
let data = fast_fields_composite
.open_read_with_idx(field, 1)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?;
let bytes_fast_field_reader = BytesFastFieldReader::open(idx_reader, data)?;
fast_field_readers
.fast_bytes
.insert(field, bytes_fast_field_reader);
} else if let Some((fast_type, cardinality)) = type_and_cardinality(field_type) {
match cardinality {
Cardinality::SingleValue => {
if let Some(fast_field_data) = fast_fields_composite.open_read(field) {
match fast_type {
FastType::U64 => {
let fast_field_reader = FastFieldReader::open(fast_field_data)?;
fast_field_readers
.fast_field_u64
.insert(field, fast_field_reader);
}
FastType::I64 => {
let fast_field_reader =
FastFieldReader::open(fast_field_data.clone())?;
fast_field_readers
.fast_field_i64
.insert(field, fast_field_reader);
}
FastType::F64 => {
let fast_field_reader =
FastFieldReader::open(fast_field_data.clone())?;
fast_field_readers
.fast_field_f64
.insert(field, fast_field_reader);
}
FastType::Date => {
let fast_field_reader =
FastFieldReader::open(fast_field_data.clone())?;
fast_field_readers
.fast_field_date
.insert(field, fast_field_reader);
}
}
} else {
return Err(From::from(FastFieldNotAvailableError::new(field_entry)));
}
}
Cardinality::MultiValues => {
let idx_opt = fast_fields_composite.open_read_with_idx(field, 0);
let data_opt = fast_fields_composite.open_read_with_idx(field, 1);
if let (Some(fast_field_idx), Some(fast_field_data)) = (idx_opt, data_opt) {
let idx_reader = FastFieldReader::open(fast_field_idx)?;
match fast_type {
FastType::I64 => {
let vals_reader = FastFieldReader::open(fast_field_data)?;
let multivalued_int_fast_field =
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
fast_field_readers
.fast_field_i64s
.insert(field, multivalued_int_fast_field);
}
FastType::U64 => {
let vals_reader = FastFieldReader::open(fast_field_data)?;
let multivalued_int_fast_field =
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
fast_field_readers
.fast_field_u64s
.insert(field, multivalued_int_fast_field);
}
FastType::F64 => {
let vals_reader = FastFieldReader::open(fast_field_data)?;
let multivalued_int_fast_field =
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
fast_field_readers
.fast_field_f64s
.insert(field, multivalued_int_fast_field);
}
FastType::Date => {
let vals_reader = FastFieldReader::open(fast_field_data)?;
let multivalued_int_fast_field =
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
fast_field_readers
.fast_field_dates
.insert(field, multivalued_int_fast_field);
}
}
} else {
return Err(From::from(FastFieldNotAvailableError::new(field_entry)));
}
}
}
}
}
Ok(fast_field_readers)
Ok(FastFieldReaders {
fast_fields_composite,
schema,
})
}
pub(crate) fn space_usage(&self) -> PerFieldSpaceUsage {
self.fast_fields_composite.space_usage()
}
/// Returns the `u64` fast field reader reader associated to `field`.
///
/// If `field` is not a u64 fast field, this method returns `None`.
pub fn u64(&self, field: Field) -> Option<FastFieldReader<u64>> {
self.fast_field_u64.get(&field).cloned()
fn fast_field_data(&self, field: Field, idx: usize) -> crate::Result<FileSlice> {
self.fast_fields_composite
.open_read_with_idx(field, idx)
.ok_or_else(|| {
let field_name = self.schema.get_field_entry(field).name();
TantivyError::SchemaError(format!("Field({}) data was not found", field_name))
})
}
/// If the field is a u64-fast field return the associated reader.
/// If the field is a i64-fast field, return the associated u64 reader. Values are
/// mapped from i64 to u64 using a (well the, it is unique) monotonic mapping. ///
///
/// This method is useful when merging segment reader.
pub(crate) fn u64_lenient(&self, field: Field) -> Option<FastFieldReader<u64>> {
if let Some(u64_ff_reader) = self.u64(field) {
return Some(u64_ff_reader);
fn check_type(
&self,
field: Field,
expected_fast_type: FastType,
expected_cardinality: Cardinality,
) -> crate::Result<()> {
let field_entry = self.schema.get_field_entry(field);
let (fast_type, cardinality) =
type_and_cardinality(field_entry.field_type()).ok_or_else(|| {
crate::TantivyError::SchemaError(format!(
"Field {:?} is not a fast field.",
field_entry.name()
))
})?;
if fast_type != expected_fast_type {
return Err(crate::TantivyError::SchemaError(format!(
"Field {:?} is of type {:?}, expected {:?}.",
field_entry.name(),
fast_type,
expected_fast_type
)));
}
if let Some(i64_ff_reader) = self.i64(field) {
return Some(i64_ff_reader.into_u64_reader());
if cardinality != expected_cardinality {
return Err(crate::TantivyError::SchemaError(format!(
"Field {:?} is of cardinality {:?}, expected {:?}.",
field_entry.name(),
cardinality,
expected_cardinality
)));
}
if let Some(f64_ff_reader) = self.f64(field) {
return Some(f64_ff_reader.into_u64_reader());
}
if let Some(date_ff_reader) = self.date(field) {
return Some(date_ff_reader.into_u64_reader());
}
None
Ok(())
}
pub(crate) fn typed_fast_field_reader<TFastValue: FastValue>(
&self,
field: Field,
) -> Option<FastFieldReader<TFastValue>> {
self.u64_lenient(field)
.map(|fast_field_reader| fast_field_reader.cast())
) -> crate::Result<FastFieldReader<TFastValue>> {
let fast_field_slice = self.fast_field_data(field, 0)?;
FastFieldReader::open(fast_field_slice)
}
pub(crate) fn typed_fast_field_multi_reader<TFastValue: FastValue>(
&self,
field: Field,
) -> crate::Result<MultiValuedFastFieldReader<TFastValue>> {
let fast_field_slice_idx = self.fast_field_data(field, 0)?;
let fast_field_slice_vals = self.fast_field_data(field, 1)?;
let idx_reader = FastFieldReader::open(fast_field_slice_idx)?;
let vals_reader: FastFieldReader<TFastValue> =
FastFieldReader::open(fast_field_slice_vals)?;
Ok(MultiValuedFastFieldReader::open(idx_reader, vals_reader))
}
/// Returns the `u64` fast field reader reader associated to `field`.
///
/// If `field` is not a u64 fast field, this method returns `None`.
pub fn u64(&self, field: Field) -> crate::Result<FastFieldReader<u64>> {
self.check_type(field, FastType::U64, Cardinality::SingleValue)?;
self.typed_fast_field_reader(field)
}
/// Returns the `i64` fast field reader reader associated to `field`.
///
/// If `field` is not a i64 fast field, this method returns `None`.
pub fn i64(&self, field: Field) -> Option<FastFieldReader<i64>> {
self.fast_field_i64.get(&field).cloned()
pub fn i64(&self, field: Field) -> crate::Result<FastFieldReader<i64>> {
self.check_type(field, FastType::I64, Cardinality::SingleValue)?;
self.typed_fast_field_reader(field)
}
/// Returns the `i64` fast field reader reader associated to `field`.
///
/// If `field` is not a i64 fast field, this method returns `None`.
pub fn date(&self, field: Field) -> Option<FastFieldReader<crate::DateTime>> {
self.fast_field_date.get(&field).cloned()
pub fn date(&self, field: Field) -> crate::Result<FastFieldReader<crate::DateTime>> {
self.check_type(field, FastType::Date, Cardinality::SingleValue)?;
self.typed_fast_field_reader(field)
}
/// Returns the `f64` fast field reader reader associated to `field`.
///
/// If `field` is not a f64 fast field, this method returns `None`.
pub fn f64(&self, field: Field) -> Option<FastFieldReader<f64>> {
self.fast_field_f64.get(&field).cloned()
pub fn f64(&self, field: Field) -> crate::Result<FastFieldReader<f64>> {
self.check_type(field, FastType::F64, Cardinality::SingleValue)?;
self.typed_fast_field_reader(field)
}
/// Returns a `u64s` multi-valued fast field reader reader associated to `field`.
///
/// If `field` is not a u64 multi-valued fast field, this method returns `None`.
pub fn u64s(&self, field: Field) -> Option<MultiValueIntFastFieldReader<u64>> {
self.fast_field_u64s.get(&field).cloned()
}
/// If the field is a u64s-fast field return the associated reader.
/// If the field is a i64s-fast field, return the associated u64s reader. Values are
/// mapped from i64 to u64 using a (well the, it is unique) monotonic mapping.
///
/// This method is useful when merging segment reader.
pub(crate) fn u64s_lenient(&self, field: Field) -> Option<MultiValueIntFastFieldReader<u64>> {
if let Some(u64s_ff_reader) = self.u64s(field) {
return Some(u64s_ff_reader);
}
if let Some(i64s_ff_reader) = self.i64s(field) {
return Some(i64s_ff_reader.into_u64s_reader());
}
if let Some(f64s_ff_reader) = self.f64s(field) {
return Some(f64s_ff_reader.into_u64s_reader());
}
None
pub fn u64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<u64>> {
self.check_type(field, FastType::U64, Cardinality::MultiValues)?;
self.typed_fast_field_multi_reader(field)
}
/// Returns a `i64s` multi-valued fast field reader reader associated to `field`.
///
/// If `field` is not a i64 multi-valued fast field, this method returns `None`.
pub fn i64s(&self, field: Field) -> Option<MultiValueIntFastFieldReader<i64>> {
self.fast_field_i64s.get(&field).cloned()
pub fn i64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<i64>> {
self.check_type(field, FastType::I64, Cardinality::MultiValues)?;
self.typed_fast_field_multi_reader(field)
}
/// Returns a `f64s` multi-valued fast field reader reader associated to `field`.
///
/// If `field` is not a f64 multi-valued fast field, this method returns `None`.
pub fn f64s(&self, field: Field) -> Option<MultiValueIntFastFieldReader<f64>> {
self.fast_field_f64s.get(&field).cloned()
pub fn f64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<f64>> {
self.check_type(field, FastType::F64, Cardinality::MultiValues)?;
self.typed_fast_field_multi_reader(field)
}
/// Returns a `crate::DateTime` multi-valued fast field reader reader associated to `field`.
///
/// If `field` is not a `crate::DateTime` multi-valued fast field, this method returns `None`.
pub fn dates(&self, field: Field) -> Option<MultiValueIntFastFieldReader<crate::DateTime>> {
self.fast_field_dates.get(&field).cloned()
pub fn dates(
&self,
field: Field,
) -> crate::Result<MultiValuedFastFieldReader<crate::DateTime>> {
self.check_type(field, FastType::Date, Cardinality::MultiValues)?;
self.typed_fast_field_multi_reader(field)
}
/// Returns the `bytes` fast field reader associated to `field`.
///
/// If `field` is not a bytes fast field, returns `None`.
pub fn bytes(&self, field: Field) -> Option<BytesFastFieldReader> {
self.fast_bytes.get(&field).cloned()
pub fn bytes(&self, field: Field) -> crate::Result<BytesFastFieldReader> {
let field_entry = self.schema.get_field_entry(field);
if let FieldType::Bytes(bytes_option) = field_entry.field_type() {
if !bytes_option.is_fast() {
return Err(crate::TantivyError::SchemaError(format!(
"Field {:?} is not a fast field.",
field_entry.name()
)));
}
let fast_field_idx_file = self.fast_field_data(field, 0)?;
let idx_reader = FastFieldReader::open(fast_field_idx_file)?;
let data = self.fast_field_data(field, 1)?;
BytesFastFieldReader::open(idx_reader, data)
} else {
Err(FastFieldNotAvailableError::new(field_entry).into())
}
}
}

View File

@@ -1,4 +1,4 @@
use super::multivalued::MultiValueIntFastFieldWriter;
use super::multivalued::MultiValuedFastFieldWriter;
use crate::common;
use crate::common::BinarySerializable;
use crate::common::VInt;
@@ -13,7 +13,7 @@ use std::io;
/// The fastfieldswriter regroup all of the fast field writers.
pub struct FastFieldsWriter {
single_value_writers: Vec<IntFastFieldWriter>,
multi_values_writers: Vec<MultiValueIntFastFieldWriter>,
multi_values_writers: Vec<MultiValuedFastFieldWriter>,
bytes_value_writers: Vec<BytesFastFieldWriter>,
}
@@ -46,14 +46,14 @@ impl FastFieldsWriter {
single_value_writers.push(fast_field_writer);
}
Some(Cardinality::MultiValues) => {
let fast_field_writer = MultiValueIntFastFieldWriter::new(field, false);
let fast_field_writer = MultiValuedFastFieldWriter::new(field, false);
multi_values_writers.push(fast_field_writer);
}
None => {}
}
}
FieldType::HierarchicalFacet => {
let fast_field_writer = MultiValueIntFastFieldWriter::new(field, true);
let fast_field_writer = MultiValuedFastFieldWriter::new(field, true);
multi_values_writers.push(fast_field_writer);
}
FieldType::Bytes(bytes_option) => {
@@ -87,7 +87,7 @@ impl FastFieldsWriter {
pub fn get_multivalue_writer(
&mut self,
field: Field,
) -> Option<&mut MultiValueIntFastFieldWriter> {
) -> Option<&mut MultiValuedFastFieldWriter> {
// TODO optimize
self.multi_values_writers
.iter_mut()

View File

@@ -1,31 +1,76 @@
use rand::thread_rng;
use std::collections::HashSet;
use crate::schema::*;
use crate::Index;
use crate::Searcher;
use crate::{doc, schema::*};
use rand::thread_rng;
use rand::Rng;
use std::collections::HashSet;
fn check_index_content(searcher: &Searcher, vals: &HashSet<u64>) {
fn check_index_content(searcher: &Searcher, vals: &[u64]) -> crate::Result<()> {
assert!(searcher.segment_readers().len() < 20);
assert_eq!(searcher.num_docs() as usize, vals.len());
for segment_reader in searcher.segment_readers() {
let store_reader = segment_reader.get_store_reader()?;
for doc_id in 0..segment_reader.max_doc() {
let _doc = store_reader.get(doc_id)?;
}
}
Ok(())
}
#[test]
#[ignore]
fn test_indexing() {
fn test_functional_store() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let id_field = schema_builder.add_u64_field("id", INDEXED | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let reader = index.reader()?;
let mut rng = thread_rng();
let mut index_writer = index.writer_with_num_threads(3, 12_000_000)?;
let mut doc_set: Vec<u64> = Vec::new();
let mut doc_id = 0u64;
for iteration in 0..500 {
dbg!(iteration);
let num_docs: usize = rng.gen_range(0..4);
if doc_set.len() >= 1 {
let doc_to_remove_id = rng.gen_range(0..doc_set.len());
let removed_doc_id = doc_set.swap_remove(doc_to_remove_id);
index_writer.delete_term(Term::from_field_u64(id_field, removed_doc_id));
}
for _ in 0..num_docs {
doc_set.push(doc_id);
index_writer.add_document(doc!(id_field=>doc_id));
doc_id += 1;
}
index_writer.commit()?;
reader.reload()?;
let searcher = reader.searcher();
check_index_content(&searcher, &doc_set)?;
}
Ok(())
}
#[test]
#[ignore]
fn test_functional_indexing() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let id_field = schema_builder.add_u64_field("id", INDEXED);
let multiples_field = schema_builder.add_u64_field("multiples", INDEXED);
let schema = schema_builder.build();
let index = Index::create_from_tempdir(schema).unwrap();
let reader = index.reader().unwrap();
let index = Index::create_from_tempdir(schema)?;
let reader = index.reader()?;
let mut rng = thread_rng();
let mut index_writer = index.writer_with_num_threads(3, 120_000_000).unwrap();
let mut index_writer = index.writer_with_num_threads(3, 120_000_000)?;
let mut committed_docs: HashSet<u64> = HashSet::new();
let mut uncommitted_docs: HashSet<u64> = HashSet::new();
@@ -33,13 +78,16 @@ fn test_indexing() {
for _ in 0..200 {
let random_val = rng.gen_range(0..20);
if random_val == 0 {
index_writer.commit().expect("Commit failed");
index_writer.commit()?;
committed_docs.extend(&uncommitted_docs);
uncommitted_docs.clear();
reader.reload().unwrap();
reader.reload()?;
let searcher = reader.searcher();
// check that everything is correct.
check_index_content(&searcher, &committed_docs);
check_index_content(
&searcher,
&committed_docs.iter().cloned().collect::<Vec<u64>>(),
)?;
} else {
if committed_docs.remove(&random_val) || uncommitted_docs.remove(&random_val) {
let doc_id_term = Term::from_field_u64(id_field, random_val);
@@ -55,4 +103,5 @@ fn test_indexing() {
}
}
}
Ok(())
}

View File

@@ -7,7 +7,7 @@ use crate::fastfield::BytesFastFieldReader;
use crate::fastfield::DeleteBitSet;
use crate::fastfield::FastFieldReader;
use crate::fastfield::FastFieldSerializer;
use crate::fastfield::MultiValueIntFastFieldReader;
use crate::fastfield::MultiValuedFastFieldReader;
use crate::fieldnorm::FieldNormsSerializer;
use crate::fieldnorm::FieldNormsWriter;
use crate::fieldnorm::{FieldNormReader, FieldNormReaders};
@@ -246,7 +246,7 @@ impl IndexMerger {
for reader in &self.readers {
let u64_reader: FastFieldReader<u64> = reader
.fast_fields()
.u64_lenient(field)
.typed_fast_field_reader(field)
.expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen.");
if let Some((seg_min_val, seg_max_val)) =
compute_min_max_val(&u64_reader, reader.max_doc(), reader.delete_bitset())
@@ -290,7 +290,7 @@ impl IndexMerger {
fast_field_serializer: &mut FastFieldSerializer,
) -> crate::Result<()> {
let mut total_num_vals = 0u64;
let mut u64s_readers: Vec<MultiValueIntFastFieldReader<u64>> = Vec::new();
let mut u64s_readers: Vec<MultiValuedFastFieldReader<u64>> = Vec::new();
// In the first pass, we compute the total number of vals.
//
@@ -298,9 +298,8 @@ impl IndexMerger {
// what should be the bit length use for bitpacking.
for reader in &self.readers {
let u64s_reader = reader.fast_fields()
.u64s_lenient(field)
.typed_fast_field_multi_reader(field)
.expect("Failed to find index for multivalued field. This is a bug in tantivy, please report.");
if let Some(delete_bitset) = reader.delete_bitset() {
for doc in 0u32..reader.max_doc() {
if delete_bitset.is_alive(doc) {
@@ -353,7 +352,7 @@ impl IndexMerger {
for (segment_ord, segment_reader) in self.readers.iter().enumerate() {
let term_ordinal_mapping: &[TermOrdinal] =
term_ordinal_mappings.get_segment(segment_ord);
let ff_reader: MultiValueIntFastFieldReader<u64> = segment_reader
let ff_reader: MultiValuedFastFieldReader<u64> = segment_reader
.fast_fields()
.u64s(field)
.expect("Could not find multivalued u64 fast value reader.");
@@ -397,8 +396,10 @@ impl IndexMerger {
// We go through a complete first pass to compute the minimum and the
// maximum value and initialize our Serializer.
for reader in &self.readers {
let ff_reader: MultiValueIntFastFieldReader<u64> =
reader.fast_fields().u64s_lenient(field).expect(
let ff_reader: MultiValuedFastFieldReader<u64> = reader
.fast_fields()
.typed_fast_field_multi_reader(field)
.expect(
"Failed to find multivalued fast field reader. This is a bug in \
tantivy. Please report.",
);
@@ -445,11 +446,7 @@ impl IndexMerger {
let mut bytes_readers: Vec<BytesFastFieldReader> = Vec::new();
for reader in &self.readers {
let bytes_reader = reader.fast_fields().bytes(field).ok_or_else(|| {
crate::TantivyError::InvalidArgument(
"Bytes fast field {:?} not found in segment.".to_string(),
)
})?;
let bytes_reader = reader.fast_fields().bytes(field)?;
if let Some(delete_bitset) = reader.delete_bitset() {
for doc in 0u32..reader.max_doc() {
if delete_bitset.is_alive(doc) {

View File

@@ -10,9 +10,10 @@ use crate::schema::FieldType;
use crate::schema::Schema;
use crate::schema::Term;
use crate::schema::Value;
use crate::tokenizer::PreTokenizedStream;
use crate::tokenizer::{DynTokenStreamChain, Tokenizer};
use crate::tokenizer::{FacetTokenizer, TextAnalyzerT, Token};
use crate::schema::{Field, FieldEntry};
use crate::tokenizer::{BoxTokenStream, PreTokenizedStream};
use crate::tokenizer::{FacetTokenizer, TextAnalyzer};
use crate::tokenizer::{TokenStreamChain, Tokenizer};
use crate::Opstamp;
use crate::{DocId, SegmentComponent};
@@ -22,7 +23,7 @@ use crate::{DocId, SegmentComponent};
fn initial_table_size(per_thread_memory_budget: usize) -> crate::Result<usize> {
let table_memory_upper_bound = per_thread_memory_budget / 3;
if let Some(limit) = (10..)
.take_while(|&num_bits| compute_table_size(num_bits) < table_memory_upper_bound)
.take_while(|num_bits: &usize| compute_table_size(*num_bits) < table_memory_upper_bound)
.last()
{
Ok(limit.min(19)) // we cap it at 2^19 = 512K.
@@ -44,8 +45,7 @@ pub struct SegmentWriter {
fast_field_writers: FastFieldsWriter,
fieldnorms_writer: FieldNormsWriter,
doc_opstamps: Vec<Opstamp>,
// TODO: change type
tokenizers: Vec<Option<Box<dyn TextAnalyzerT>>>,
tokenizers: Vec<Option<TextAnalyzer>>,
term_buffer: Term,
}
@@ -70,17 +70,17 @@ impl SegmentWriter {
let multifield_postings = MultiFieldPostingsWriter::new(schema, table_num_bits);
let tokenizers = schema
.fields()
.map(|(_, field_entry)| match field_entry.field_type() {
FieldType::Str(text_options) => {
text_options
.map(
|(_, field_entry): (Field, &FieldEntry)| match field_entry.field_type() {
FieldType::Str(ref text_options) => text_options
.get_indexing_options()
.and_then(|text_index_option| {
let tokenizer_name = &text_index_option.tokenizer();
tokenizer_manager.get(tokenizer_name)
})
}
_ => None,
})
}),
_ => None,
},
)
.collect();
Ok(SegmentWriter {
max_doc: 0,
@@ -141,13 +141,13 @@ impl SegmentWriter {
}
let (term_buffer, multifield_postings) =
(&mut self.term_buffer, &mut self.multifield_postings);
match field_entry.field_type() {
match *field_entry.field_type() {
FieldType::HierarchicalFacet => {
term_buffer.set_field(field);
let facets =
field_values
.iter()
.flat_map(|field_value| match field_value.value() {
.flat_map(|field_value| match *field_value.value() {
Value::Facet(ref facet) => Some(facet.encoded_str()),
_ => {
panic!("Expected hierarchical facet");
@@ -157,13 +157,12 @@ impl SegmentWriter {
let mut unordered_term_id_opt = None;
FacetTokenizer
.token_stream(facet_str)
.map(|token| {
.process(&mut |token| {
term_buffer.set_text(&token.text);
let unordered_term_id =
multifield_postings.subscribe(doc_id, &term_buffer);
unordered_term_id_opt = Some(unordered_term_id);
})
.count();
});
if let Some(unordered_term_id) = unordered_term_id_opt {
self.fast_field_writers
.get_multivalue_writer(field)
@@ -173,38 +172,37 @@ impl SegmentWriter {
}
}
FieldType::Str(_) => {
let mut streams_with_offsets = vec![];
let mut token_streams: Vec<BoxTokenStream> = vec![];
let mut offsets = vec![];
let mut total_offset = 0;
for field_value in field_values {
match field_value.value() {
Value::PreTokStr(tok_str) => {
streams_with_offsets.push((
Box::new(PreTokenizedStream::from(tok_str.clone()))
as Box<dyn Iterator<Item = Token>>,
total_offset,
));
offsets.push(total_offset);
if let Some(last_token) = tok_str.tokens.last() {
total_offset += last_token.offset_to;
}
token_streams
.push(PreTokenizedStream::from(tok_str.clone()).into());
}
Value::Str(text) => {
Value::Str(ref text) => {
if let Some(ref mut tokenizer) =
self.tokenizers[field.field_id() as usize]
{
streams_with_offsets
.push((tokenizer.token_stream(text), total_offset));
offsets.push(total_offset);
total_offset += text.len();
token_streams.push(tokenizer.token_stream(text));
}
}
_ => (),
}
}
let num_tokens = if streams_with_offsets.is_empty() {
let num_tokens = if token_streams.is_empty() {
0
} else {
let mut token_stream = DynTokenStreamChain::from_vec(streams_with_offsets);
let mut token_stream = TokenStreamChain::new(offsets, token_streams);
multifield_postings.index_text(
doc_id,
field,
@@ -215,62 +213,71 @@ impl SegmentWriter {
self.fieldnorms_writer.record(doc_id, field, num_tokens);
}
FieldType::U64(int_option) if int_option.is_indexed() => {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let u64_val = field_value
.value()
.u64_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_u64(u64_val);
multifield_postings.subscribe(doc_id, &term_buffer);
FieldType::U64(ref int_option) => {
if int_option.is_indexed() {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let u64_val = field_value
.value()
.u64_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_u64(u64_val);
multifield_postings.subscribe(doc_id, &term_buffer);
}
}
}
FieldType::Date(int_option) if int_option.is_indexed() => {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let date_val = field_value
.value()
.date_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_i64(date_val.timestamp());
multifield_postings.subscribe(doc_id, &term_buffer);
FieldType::Date(ref int_option) => {
if int_option.is_indexed() {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let date_val = field_value
.value()
.date_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_i64(date_val.timestamp());
multifield_postings.subscribe(doc_id, &term_buffer);
}
}
}
FieldType::I64(int_option) if int_option.is_indexed() => {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let i64_val = field_value
.value()
.i64_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_i64(i64_val);
multifield_postings.subscribe(doc_id, &term_buffer);
FieldType::I64(ref int_option) => {
if int_option.is_indexed() {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let i64_val = field_value
.value()
.i64_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_i64(i64_val);
multifield_postings.subscribe(doc_id, &term_buffer);
}
}
}
FieldType::F64(int_option) if int_option.is_indexed() => {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let f64_val = field_value
.value()
.f64_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_f64(f64_val);
multifield_postings.subscribe(doc_id, &term_buffer);
FieldType::F64(ref int_option) => {
if int_option.is_indexed() {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let f64_val = field_value
.value()
.f64_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_f64(f64_val);
multifield_postings.subscribe(doc_id, &term_buffer);
}
}
}
FieldType::Bytes(option) if option.is_indexed() => {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let bytes = field_value
.value()
.bytes_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_bytes(bytes);
self.multifield_postings.subscribe(doc_id, &term_buffer);
FieldType::Bytes(ref option) => {
if option.is_indexed() {
for field_value in field_values {
term_buffer.set_field(field_value.field());
let bytes = field_value
.value()
.bytes_value()
.ok_or_else(make_schema_error)?;
term_buffer.set_bytes(bytes);
self.multifield_postings.subscribe(doc_id, &term_buffer);
}
}
}
_ => {}
}
}
doc.filter_fields(|field| schema.get_field_entry(field).is_stored());

View File

@@ -96,7 +96,7 @@
//! A good place for you to get started is to check out
//! the example code (
//! [literate programming](https://tantivy-search.github.io/examples/basic_search.html) /
//! [source code](https://github.com/tantivy-search/tantivy/blob/master/examples/basic_search.rs))
//! [source code](https://github.com/tantivy-search/tantivy/blob/main/examples/basic_search.rs))
#[cfg_attr(test, macro_use)]
extern crate serde_json;
@@ -866,39 +866,39 @@ mod tests {
let searcher = reader.searcher();
let segment_reader: &SegmentReader = searcher.segment_reader(0);
{
let fast_field_reader_opt = segment_reader.fast_fields().u64(text_field);
assert!(fast_field_reader_opt.is_none());
let fast_field_reader_res = segment_reader.fast_fields().u64(text_field);
assert!(fast_field_reader_res.is_err());
}
{
let fast_field_reader_opt = segment_reader.fast_fields().u64(stored_int_field);
assert!(fast_field_reader_opt.is_none());
assert!(fast_field_reader_opt.is_err());
}
{
let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_signed);
assert!(fast_field_reader_opt.is_none());
assert!(fast_field_reader_opt.is_err());
}
{
let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_float);
assert!(fast_field_reader_opt.is_none());
assert!(fast_field_reader_opt.is_err());
}
{
let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_unsigned);
assert!(fast_field_reader_opt.is_some());
assert!(fast_field_reader_opt.is_ok());
let fast_field_reader = fast_field_reader_opt.unwrap();
assert_eq!(fast_field_reader.get(0), 4u64)
}
{
let fast_field_reader_opt = segment_reader.fast_fields().i64(fast_field_signed);
assert!(fast_field_reader_opt.is_some());
let fast_field_reader = fast_field_reader_opt.unwrap();
let fast_field_reader_res = segment_reader.fast_fields().i64(fast_field_signed);
assert!(fast_field_reader_res.is_ok());
let fast_field_reader = fast_field_reader_res.unwrap();
assert_eq!(fast_field_reader.get(0), 4i64)
}
{
let fast_field_reader_opt = segment_reader.fast_fields().f64(fast_field_float);
assert!(fast_field_reader_opt.is_some());
let fast_field_reader = fast_field_reader_opt.unwrap();
let fast_field_reader_res = segment_reader.fast_fields().f64(fast_field_float);
assert!(fast_field_reader_res.is_ok());
let fast_field_reader = fast_field_reader_res.unwrap();
assert_eq!(fast_field_reader.get(0), 4f64)
}
Ok(())

View File

@@ -109,9 +109,9 @@ impl BlockSearcher {
/// The results should be equivalent to
/// ```compile_fail
/// block[..]
/// .iter()
/// .take_while(|&&val| val < target)
/// .count()
// .iter()
// .take_while(|&&val| val < target)
// .count()
/// ```
///
/// The `start` argument is just used to hint that the response is

View File

@@ -9,6 +9,7 @@ use crate::postings::{FieldSerializer, InvertedIndexSerializer};
use crate::schema::IndexRecordOption;
use crate::schema::{Field, FieldEntry, FieldType, Schema, Term};
use crate::termdict::TermOrdinal;
use crate::tokenizer::TokenStream;
use crate::tokenizer::{Token, MAX_TOKEN_LEN};
use crate::DocId;
use fnv::FnvHashMap;
@@ -99,10 +100,12 @@ impl MultiFieldPostingsWriter {
&mut self,
doc: DocId,
field: Field,
token_stream: &mut dyn Iterator<Item = Token>,
token_stream: &mut dyn TokenStream,
term_buffer: &mut Term,
) -> u32 {
self.per_field_postings_writers[field.field_id() as usize].index_text(
let postings_writer =
self.per_field_postings_writers[field.field_id() as usize].deref_mut();
postings_writer.index_text(
&mut self.term_index,
doc,
field,
@@ -214,7 +217,7 @@ pub trait PostingsWriter {
term_index: &mut TermHashMap,
doc_id: DocId,
field: Field,
token_stream: &mut dyn Iterator<Item = Token>,
token_stream: &mut dyn TokenStream,
heap: &mut MemoryArena,
term_buffer: &mut Term,
) -> u32 {
@@ -239,7 +242,7 @@ pub trait PostingsWriter {
);
}
};
token_stream.map(|tok| sink(&tok)).count() as u32
token_stream.process(&mut sink)
}
fn total_num_tokens(&self) -> u64;

View File

@@ -1,14 +1,11 @@
use crate::common::HasLen;
use crate::directory::FileSlice;
use crate::docset::DocSet;
use crate::fastfield::DeleteBitSet;
use crate::positions::PositionReader;
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
use crate::postings::serializer::PostingsSerializer;
use crate::postings::BlockSearcher;
use crate::postings::BlockSegmentPostings;
use crate::postings::Postings;
use crate::schema::IndexRecordOption;
use crate::{DocId, TERMINATED};
/// `SegmentPostings` represents the inverted list or postings associated to
@@ -68,7 +65,11 @@ impl SegmentPostings {
/// It serializes the doc ids using tantivy's codec
/// and returns a `SegmentPostings` object that embeds a
/// buffer with the serialized data.
#[cfg(test)]
pub fn create_from_docs(docs: &[u32]) -> SegmentPostings {
use crate::directory::FileSlice;
use crate::postings::serializer::PostingsSerializer;
use crate::schema::IndexRecordOption;
let mut buffer = Vec::new();
{
let mut postings_serializer =
@@ -97,6 +98,9 @@ impl SegmentPostings {
doc_and_tfs: &[(u32, u32)],
fieldnorms: Option<&[u32]>,
) -> SegmentPostings {
use crate::directory::FileSlice;
use crate::postings::serializer::PostingsSerializer;
use crate::schema::IndexRecordOption;
use crate::fieldnorm::FieldNormReader;
use crate::Score;
let mut buffer: Vec<u8> = Vec::new();

View File

@@ -289,7 +289,7 @@ impl QueryParser {
let field_name = field_entry.name().to_string();
return Err(QueryParserError::FieldNotIndexed(field_name));
}
match field_type {
match *field_type {
FieldType::I64(_) => {
let val: i64 = i64::from_str(phrase)?;
let term = Term::from_field_i64(field, val);
@@ -312,7 +312,7 @@ impl QueryParser {
let term = Term::from_field_u64(field, val);
Ok(vec![(0, term)])
}
FieldType::Str(str_options) => {
FieldType::Str(ref str_options) => {
if let Some(option) = str_options.get_indexing_options() {
let tokenizer =
self.tokenizer_manager
@@ -323,14 +323,15 @@ impl QueryParser {
option.tokenizer().to_string(),
)
})?;
let token_stream = tokenizer.token_stream(phrase);
let terms: Vec<_> = token_stream
.map(|token| {
let term = Term::from_field_text(field, &token.text);
(token.position, term)
})
.collect();
if terms.len() <= 1 {
let mut terms: Vec<(usize, Term)> = Vec::new();
let mut token_stream = tokenizer.token_stream(phrase);
token_stream.process(&mut |token| {
let term = Term::from_field_text(field, &token.text);
terms.push((token.position, term));
});
if terms.is_empty() {
Ok(vec![])
} else if terms.len() == 1 {
Ok(terms)
} else {
let field_entry = self.schema.get_field_entry(field);
@@ -413,7 +414,7 @@ impl QueryParser {
&self,
given_field: &Option<String>,
) -> Result<Cow<'_, [Field]>, QueryParserError> {
match given_field {
match *given_field {
None => {
if self.default_fields.is_empty() {
Err(QueryParserError::NoDefaultFieldDeclared)
@@ -421,7 +422,7 @@ impl QueryParser {
Ok(Cow::from(&self.default_fields[..]))
}
}
Some(field) => Ok(Cow::from(vec![self.resolve_field_name(&*field)?])),
Some(ref field) => Ok(Cow::from(vec![self.resolve_field_name(&*field)?])),
}
}
@@ -573,12 +574,15 @@ fn convert_to_query(logical_ast: LogicalAST) -> Box<dyn Query> {
#[cfg(test)]
mod test {
use super::super::logical_ast::*;
use super::*;
use super::QueryParser;
use super::QueryParserError;
use crate::query::Query;
use crate::schema::Field;
use crate::schema::{IndexRecordOption, TextFieldIndexing, TextOptions};
use crate::schema::{Schema, Term, INDEXED, STORED, STRING, TEXT};
use crate::tokenizer::{analyzer_builder, LowerCaser, SimpleTokenizer, StopWordFilter};
use crate::tokenizer::{
LowerCaser, SimpleTokenizer, StopWordFilter, TextAnalyzer, TokenizerManager,
};
use crate::Index;
use matches::assert_matches;
@@ -616,10 +620,9 @@ mod test {
let tokenizer_manager = TokenizerManager::default();
tokenizer_manager.register(
"en_with_stop_words",
analyzer_builder(SimpleTokenizer)
.filter(LowerCaser::new())
.filter(StopWordFilter::remove(vec!["the".to_string()]))
.build(),
TextAnalyzer::from(SimpleTokenizer)
.filter(LowerCaser)
.filter(StopWordFilter::remove(vec!["the".to_string()])),
);
QueryParser::new(schema, default_fields, tokenizer_manager)
}

View File

@@ -1,7 +1,7 @@
use crate::query::Query;
use crate::schema::Field;
use crate::schema::Value;
use crate::tokenizer::{TextAnalyzerT, Token};
use crate::tokenizer::{TextAnalyzer, Token};
use crate::Searcher;
use crate::{Document, Score};
use htmlescape::encode_minimal;
@@ -139,9 +139,9 @@ impl Snippet {
///
/// Fragments must be valid in the sense that `&text[fragment.start..fragment.stop]`\
/// has to be a valid string.
fn search_fragments(
tokenizer: &dyn TextAnalyzerT,
text: &str,
fn search_fragments<'a>(
tokenizer: &TextAnalyzer,
text: &'a str,
terms: &BTreeMap<String, Score>,
max_num_chars: usize,
) -> Vec<FragmentCandidate> {
@@ -155,7 +155,7 @@ fn search_fragments(
};
fragment = FragmentCandidate::new(next.offset_from);
}
fragment.try_add_token(&next, &terms);
fragment.try_add_token(next, &terms);
}
if fragment.score > 0.0 {
fragments.push(fragment)
@@ -249,7 +249,7 @@ fn select_best_fragment_combination(fragments: &[FragmentCandidate], text: &str)
/// ```
pub struct SnippetGenerator {
terms_text: BTreeMap<String, Score>,
tokenizer: Box<dyn TextAnalyzerT>,
tokenizer: TextAnalyzer,
field: Field,
max_num_chars: usize,
}
@@ -297,37 +297,33 @@ impl SnippetGenerator {
///
/// This method extract the text associated to the `SnippetGenerator`'s field
/// and computes a snippet.
pub fn snippet_from_doc(&mut self, doc: &Document) -> Snippet {
pub fn snippet_from_doc(&self, doc: &Document) -> Snippet {
let text: String = doc
.get_all(self.field)
.flat_map(Value::text)
.collect::<Vec<&str>>()
.join(" ");
self.snippet(text.as_ref())
self.snippet(&text)
}
/// Generates a snippet for the given text.
pub fn snippet(&mut self, text: &str) -> Snippet {
let fragment_candidates = search_fragments(
&mut *self.tokenizer,
text,
&self.terms_text,
self.max_num_chars,
);
select_best_fragment_combination(&fragment_candidates[..], text)
pub fn snippet(&self, text: &str) -> Snippet {
let fragment_candidates =
search_fragments(&self.tokenizer, &text, &self.terms_text, self.max_num_chars);
select_best_fragment_combination(&fragment_candidates[..], &text)
}
}
#[cfg(test)]
mod tests {
use super::*;
use super::{search_fragments, select_best_fragment_combination};
use crate::query::QueryParser;
use crate::schema::{IndexRecordOption, Schema, TextFieldIndexing, TextOptions, TEXT};
use crate::tokenizer::SimpleTokenizer;
use crate::tokenizer::TextAnalyzer;
use crate::Index;
use crate::SnippetGenerator;
use maplit::btreemap;
use std::collections::BTreeMap;
use std::iter::Iterator;
const TEST_TEXT: &'static str = r#"Rust is a systems programming language sponsored by
@@ -350,13 +346,7 @@ Survey in 2016, 2017, and 2018."#;
String::from("rust") => 1.0,
String::from("language") => 0.9
};
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
TEST_TEXT,
&terms,
100,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), TEST_TEXT, &terms, 100);
assert_eq!(fragments.len(), 7);
{
let first = &fragments[0];
@@ -383,12 +373,7 @@ Survey in 2016, 2017, and 2018."#;
String::from("rust") =>1.0,
String::from("language") => 0.9
};
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
TEST_TEXT,
&terms,
20,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), TEST_TEXT, &terms, 20);
{
let first = &fragments[0];
assert_eq!(first.score, 1.0);
@@ -402,12 +387,7 @@ Survey in 2016, 2017, and 2018."#;
String::from("rust") =>0.9,
String::from("language") => 1.0
};
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
TEST_TEXT,
&terms,
20,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), TEST_TEXT, &terms, 20);
//assert_eq!(fragments.len(), 7);
{
let first = &fragments[0];
@@ -426,12 +406,7 @@ Survey in 2016, 2017, and 2018."#;
let mut terms = BTreeMap::new();
terms.insert(String::from("c"), 1.0);
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
&text,
&terms,
3,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), &text, &terms, 3);
assert_eq!(fragments.len(), 1);
{
@@ -453,12 +428,7 @@ Survey in 2016, 2017, and 2018."#;
let mut terms = BTreeMap::new();
terms.insert(String::from("f"), 1.0);
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
&text,
&terms,
3,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), &text, &terms, 3);
assert_eq!(fragments.len(), 2);
{
@@ -481,12 +451,7 @@ Survey in 2016, 2017, and 2018."#;
terms.insert(String::from("f"), 1.0);
terms.insert(String::from("a"), 0.9);
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
&text,
&terms,
7,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), &text, &terms, 7);
assert_eq!(fragments.len(), 2);
{
@@ -508,12 +473,7 @@ Survey in 2016, 2017, and 2018."#;
let mut terms = BTreeMap::new();
terms.insert(String::from("z"), 1.0);
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
&text,
&terms,
3,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), &text, &terms, 3);
assert_eq!(fragments.len(), 0);
@@ -527,12 +487,7 @@ Survey in 2016, 2017, and 2018."#;
let text = "a b c d";
let terms = BTreeMap::new();
let fragments = search_fragments(
&Into::<TextAnalyzer<_>>::into(SimpleTokenizer),
&text,
&terms,
3,
);
let fragments = search_fragments(&From::from(SimpleTokenizer), &text, &terms, 3);
assert_eq!(fragments.len(), 0);
let snippet = select_best_fragment_combination(&fragments[..], &text);
@@ -617,12 +572,12 @@ Survey in 2016, 2017, and 2018."#;
let mut snippet_generator =
SnippetGenerator::create(&searcher, &*query, text_field).unwrap();
{
let snippet = snippet_generator.snippet(TEST_TEXT.into());
let snippet = snippet_generator.snippet(TEST_TEXT);
assert_eq!(snippet.to_html(), "imperative-procedural paradigms. <b>Rust</b> is syntactically similar to C++[according to whom?],\nbut its <b>designers</b> intend it to provide better memory safety");
}
{
snippet_generator.set_max_num_chars(90);
let snippet = snippet_generator.snippet(TEST_TEXT.into());
let snippet = snippet_generator.snippet(TEST_TEXT);
assert_eq!(snippet.to_html(), "<b>Rust</b> is syntactically similar to C++[according to whom?],\nbut its <b>designers</b> intend it to");
}
}

View File

@@ -43,6 +43,9 @@ impl CheckpointBlock {
/// Adding another checkpoint in the block.
pub fn push(&mut self, checkpoint: Checkpoint) {
if let Some(prev_checkpoint) = self.checkpoints.last() {
assert!(checkpoint.follows(prev_checkpoint));
}
self.checkpoints.push(checkpoint);
}

View File

@@ -26,6 +26,12 @@ pub struct Checkpoint {
pub end_offset: u64,
}
impl Checkpoint {
pub(crate) fn follows(&self, other: &Checkpoint) -> bool {
(self.start_doc == other.end_doc) && (self.start_offset == other.end_offset)
}
}
impl fmt::Debug for Checkpoint {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(
@@ -39,13 +45,16 @@ impl fmt::Debug for Checkpoint {
#[cfg(test)]
mod tests {
use std::io;
use std::{io, iter};
use futures::executor::block_on;
use proptest::strategy::{BoxedStrategy, Strategy};
use crate::directory::OwnedBytes;
use crate::indexer::NoMergePolicy;
use crate::schema::{SchemaBuilder, STORED, STRING};
use crate::store::index::Checkpoint;
use crate::DocId;
use crate::{DocAddress, DocId, Index, Term};
use super::{SkipIndex, SkipIndexBuilder};
@@ -54,7 +63,7 @@ mod tests {
let mut output: Vec<u8> = Vec::new();
let skip_index_builder: SkipIndexBuilder = SkipIndexBuilder::new();
skip_index_builder.write(&mut output)?;
let skip_index: SkipIndex = SkipIndex::from(OwnedBytes::new(output));
let skip_index: SkipIndex = SkipIndex::open(OwnedBytes::new(output));
let mut skip_cursor = skip_index.checkpoints();
assert!(skip_cursor.next().is_none());
Ok(())
@@ -72,7 +81,7 @@ mod tests {
};
skip_index_builder.insert(checkpoint);
skip_index_builder.write(&mut output)?;
let skip_index: SkipIndex = SkipIndex::from(OwnedBytes::new(output));
let skip_index: SkipIndex = SkipIndex::open(OwnedBytes::new(output));
let mut skip_cursor = skip_index.checkpoints();
assert_eq!(skip_cursor.next(), Some(checkpoint));
assert_eq!(skip_cursor.next(), None);
@@ -86,7 +95,7 @@ mod tests {
Checkpoint {
start_doc: 0,
end_doc: 3,
start_offset: 4,
start_offset: 0,
end_offset: 9,
},
Checkpoint {
@@ -121,7 +130,7 @@ mod tests {
}
skip_index_builder.write(&mut output)?;
let skip_index: SkipIndex = SkipIndex::from(OwnedBytes::new(output));
let skip_index: SkipIndex = SkipIndex::open(OwnedBytes::new(output));
assert_eq!(
&skip_index.checkpoints().collect::<Vec<_>>()[..],
&checkpoints[..]
@@ -133,6 +142,40 @@ mod tests {
(doc as u64) * (doc as u64)
}
#[test]
fn test_merge_store_with_stacking_reproducing_issue969() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::default();
let text = schema_builder.add_text_field("text", STORED | STRING);
let body = schema_builder.add_text_field("body", STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
let long_text: String = iter::repeat("abcdefghijklmnopqrstuvwxyz")
.take(1_000)
.collect();
for _ in 0..20 {
index_writer.add_document(doc!(body=>long_text.clone()));
}
index_writer.commit()?;
index_writer.add_document(doc!(text=>"testb"));
for _ in 0..10 {
index_writer.add_document(doc!(text=>"testd", body=>long_text.clone()));
}
index_writer.commit()?;
index_writer.delete_term(Term::from_field_text(text, "testb"));
index_writer.commit()?;
let segment_ids = index.searchable_segment_ids()?;
block_on(index_writer.merge(&segment_ids))?;
let reader = index.reader()?;
let searcher = reader.searcher();
assert_eq!(searcher.num_docs(), 30);
for i in 0..searcher.num_docs() as u32 {
let _doc = searcher.doc(DocAddress(0u32, i))?;
}
Ok(())
}
#[test]
fn test_skip_index_long() -> io::Result<()> {
let mut output: Vec<u8> = Vec::new();
@@ -150,26 +193,28 @@ mod tests {
}
skip_index_builder.write(&mut output)?;
assert_eq!(output.len(), 4035);
let resulting_checkpoints: Vec<Checkpoint> = SkipIndex::from(OwnedBytes::new(output))
let resulting_checkpoints: Vec<Checkpoint> = SkipIndex::open(OwnedBytes::new(output))
.checkpoints()
.collect();
assert_eq!(&resulting_checkpoints, &checkpoints);
Ok(())
}
fn integrate_delta(mut vals: Vec<u64>) -> Vec<u64> {
fn integrate_delta(vals: Vec<u64>) -> Vec<u64> {
let mut output = Vec::with_capacity(vals.len() + 1);
output.push(0u64);
let mut prev = 0u64;
for val in vals.iter_mut() {
let new_val = *val + prev;
for val in vals {
let new_val = val + prev;
prev = new_val;
*val = new_val;
output.push(new_val);
}
vals
output
}
// Generates a sequence of n valid checkpoints, with n < max_len.
fn monotonic_checkpoints(max_len: usize) -> BoxedStrategy<Vec<Checkpoint>> {
(1..max_len)
(0..max_len)
.prop_flat_map(move |len: usize| {
(
proptest::collection::vec(1u64..20u64, len as usize).prop_map(integrate_delta),
@@ -221,7 +266,7 @@ mod tests {
}
let mut buffer = Vec::new();
skip_index_builder.write(&mut buffer).unwrap();
let skip_index = SkipIndex::from(OwnedBytes::new(buffer));
let skip_index = SkipIndex::open(OwnedBytes::new(buffer));
let iter_checkpoints: Vec<Checkpoint> = skip_index.checkpoints().collect();
assert_eq!(&checkpoints[..], &iter_checkpoints[..]);
test_skip_index_aux(skip_index, &checkpoints[..]);

View File

@@ -59,6 +59,24 @@ pub struct SkipIndex {
}
impl SkipIndex {
pub fn open(mut data: OwnedBytes) -> SkipIndex {
let offsets: Vec<u64> = Vec::<VInt>::deserialize(&mut data)
.unwrap()
.into_iter()
.map(|el| el.0)
.collect();
let mut start_offset = 0;
let mut layers = Vec::new();
for end_offset in offsets {
let layer = Layer {
data: data.slice(start_offset as usize, end_offset as usize),
};
layers.push(layer);
start_offset = end_offset;
}
SkipIndex { layers }
}
pub(crate) fn checkpoints(&self) -> impl Iterator<Item = Checkpoint> + '_ {
self.layers
.last()
@@ -90,22 +108,3 @@ impl SkipIndex {
Some(cur_checkpoint)
}
}
impl From<OwnedBytes> for SkipIndex {
fn from(mut data: OwnedBytes) -> SkipIndex {
let offsets: Vec<u64> = Vec::<VInt>::deserialize(&mut data)
.unwrap()
.into_iter()
.map(|el| el.0)
.collect();
let mut start_offset = 0;
let mut layers = Vec::new();
for end_offset in offsets {
layers.push(Layer {
data: data.slice(start_offset as usize, end_offset as usize),
});
start_offset = end_offset;
}
SkipIndex { layers }
}
}

View File

@@ -28,18 +28,20 @@ impl LayerBuilder {
///
/// If the block was empty to begin with, simply return None.
fn flush_block(&mut self) -> Option<Checkpoint> {
self.block.doc_interval().map(|(start_doc, end_doc)| {
if let Some((start_doc, end_doc)) = self.block.doc_interval() {
let start_offset = self.buffer.len() as u64;
self.block.serialize(&mut self.buffer);
let end_offset = self.buffer.len() as u64;
self.block.clear();
Checkpoint {
Some(Checkpoint {
start_doc,
end_doc,
start_offset,
end_offset,
}
})
})
} else {
None
}
}
fn push(&mut self, checkpoint: Checkpoint) {
@@ -48,7 +50,7 @@ impl LayerBuilder {
fn insert(&mut self, checkpoint: Checkpoint) -> Option<Checkpoint> {
self.push(checkpoint);
let emit_skip_info = (self.block.len() % CHECKPOINT_PERIOD) == 0;
let emit_skip_info = self.block.len() >= CHECKPOINT_PERIOD;
if emit_skip_info {
self.flush_block()
} else {

View File

@@ -35,7 +35,7 @@ impl StoreReader {
let (data_file, offset_index_file) = split_file(store_file)?;
let index_data = offset_index_file.read_bytes()?;
let space_usage = StoreSpaceUsage::new(data_file.len(), offset_index_file.len());
let skip_index = SkipIndex::from(index_data);
let skip_index = SkipIndex::open(index_data);
Ok(StoreReader {
data: data_file,
cache: Arc::new(Mutex::new(LruCache::new(LRU_CACHE_CAPACITY))),

View File

@@ -72,6 +72,7 @@ impl StoreWriter {
if !self.current_block.is_empty() {
self.write_and_compress_block()?;
}
assert_eq!(self.first_doc_in_block, self.doc);
let doc_shift = self.doc;
let start_shift = self.writer.written_bytes() as u64;
@@ -86,12 +87,17 @@ impl StoreWriter {
checkpoint.end_doc += doc_shift;
checkpoint.start_offset += start_shift;
checkpoint.end_offset += start_shift;
self.offset_index_writer.insert(checkpoint);
self.doc = checkpoint.end_doc;
self.register_checkpoint(checkpoint);
}
Ok(())
}
fn register_checkpoint(&mut self, checkpoint: Checkpoint) {
self.offset_index_writer.insert(checkpoint);
self.first_doc_in_block = checkpoint.end_doc;
self.doc = checkpoint.end_doc;
}
fn write_and_compress_block(&mut self) -> io::Result<()> {
assert!(self.doc > 0);
self.intermediary_buffer.clear();
@@ -100,14 +106,13 @@ impl StoreWriter {
self.writer.write_all(&self.intermediary_buffer)?;
let end_offset = self.writer.written_bytes();
let end_doc = self.doc;
self.offset_index_writer.insert(Checkpoint {
self.register_checkpoint(Checkpoint {
start_doc: self.first_doc_in_block,
end_doc,
start_offset,
end_offset,
});
self.current_block.clear();
self.first_doc_in_block = self.doc;
Ok(())
}

View File

@@ -2,16 +2,16 @@
//! ```rust
//! use tantivy::tokenizer::*;
//!
//! let tokenizer = analyzer_builder(RawTokenizer)
//! .filter(AlphaNumOnlyFilter).build();
//! let tokenizer = TextAnalyzer::from(RawTokenizer)
//! .filter(AlphaNumOnlyFilter);
//!
//! let mut stream = tokenizer.token_stream("hello there");
//! // is none because the raw filter emits one token that
//! // contains a space
//! assert!(stream.next().is_none());
//!
//! let tokenizer = analyzer_builder(SimpleTokenizer)
//! .filter(AlphaNumOnlyFilter).build();
//! let tokenizer = TextAnalyzer::from(SimpleTokenizer)
//! .filter(AlphaNumOnlyFilter);
//!
//! let mut stream = tokenizer.token_stream("hello there 💣");
//! assert!(stream.next().is_some());
@@ -19,18 +19,45 @@
//! // the "emoji" is dropped because its not an alphanum
//! assert!(stream.next().is_none());
//! ```
use super::{Token, TokenFilter};
use super::{BoxTokenStream, Token, TokenFilter, TokenStream};
/// `TokenFilter` that removes all tokens that contain non
/// ascii alphanumeric characters.
#[derive(Clone, Debug, Default)]
#[derive(Clone)]
pub struct AlphaNumOnlyFilter;
impl TokenFilter for AlphaNumOnlyFilter {
fn transform(&mut self, token: Token) -> Option<Token> {
if token.text.chars().all(|c| c.is_ascii_alphanumeric()) {
return Some(token);
}
None
pub struct AlphaNumOnlyFilterStream<'a> {
tail: BoxTokenStream<'a>,
}
impl<'a> AlphaNumOnlyFilterStream<'a> {
fn predicate(&self, token: &Token) -> bool {
token.text.chars().all(|c| c.is_ascii_alphanumeric())
}
}
impl TokenFilter for AlphaNumOnlyFilter {
fn transform<'a>(&self, token_stream: BoxTokenStream<'a>) -> BoxTokenStream<'a> {
BoxTokenStream::from(AlphaNumOnlyFilterStream { tail: token_stream })
}
}
impl<'a> TokenStream for AlphaNumOnlyFilterStream<'a> {
fn advance(&mut self) -> bool {
while self.tail.advance() {
if self.predicate(self.tail.token()) {
return true;
}
}
false
}
fn token(&self) -> &Token {
self.tail.token()
}
fn token_mut(&mut self) -> &mut Token {
self.tail.token_mut()
}
}

View File

@@ -1,31 +1,45 @@
use super::{Token, TokenFilter};
use super::{BoxTokenStream, Token, TokenFilter, TokenStream};
use std::mem;
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
#[derive(Clone, Debug, Default)]
pub struct AsciiFolding {
buffer: String,
}
#[derive(Clone)]
pub struct AsciiFoldingFilter;
impl AsciiFolding {
/// Construct a new `AsciiFolding` filter.
pub fn new() -> Self {
Self {
impl TokenFilter for AsciiFoldingFilter {
fn transform<'a>(&self, token_stream: BoxTokenStream<'a>) -> BoxTokenStream<'a> {
From::from(AsciiFoldingFilterTokenStream {
tail: token_stream,
buffer: String::with_capacity(100),
}
})
}
}
impl TokenFilter for AsciiFolding {
fn transform(&mut self, mut token: Token) -> Option<Token> {
if !token.text.is_ascii() {
// ignore its already ascii
to_ascii(&token.text, &mut self.buffer);
mem::swap(&mut token.text, &mut self.buffer);
pub struct AsciiFoldingFilterTokenStream<'a> {
buffer: String,
tail: BoxTokenStream<'a>,
}
impl<'a> TokenStream for AsciiFoldingFilterTokenStream<'a> {
fn advance(&mut self) -> bool {
if !self.tail.advance() {
return false;
}
Some(token)
if !self.token_mut().text.is_ascii() {
// ignore its already ascii
to_ascii(&mut self.tail.token_mut().text, &mut self.buffer);
mem::swap(&mut self.tail.token_mut().text, &mut self.buffer);
}
true
}
fn token(&self) -> &Token {
self.tail.token()
}
fn token_mut(&mut self) -> &mut Token {
self.tail.token_mut()
}
}
@@ -1512,7 +1526,7 @@ fn fold_non_ascii_char(c: char) -> Option<&'static str> {
}
// https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java#L187
fn to_ascii(text: &String, output: &mut String) {
fn to_ascii(text: &mut String, output: &mut String) {
output.clear();
for c in text.chars() {
@@ -1526,8 +1540,11 @@ fn to_ascii(text: &String, output: &mut String) {
#[cfg(test)]
mod tests {
use super::super::*;
use super::*;
use super::to_ascii;
use crate::tokenizer::AsciiFoldingFilter;
use crate::tokenizer::RawTokenizer;
use crate::tokenizer::SimpleTokenizer;
use crate::tokenizer::TextAnalyzer;
use std::iter;
#[test]
@@ -1543,22 +1560,22 @@ mod tests {
}
fn folding_helper(text: &str) -> Vec<String> {
let tokens = analyzer_builder(SimpleTokenizer)
.filter(AsciiFolding::new())
.build()
let mut tokens = Vec::new();
TextAnalyzer::from(SimpleTokenizer)
.filter(AsciiFoldingFilter)
.token_stream(text)
.map(|token| token.text.clone())
.collect();
.process(&mut |token| {
tokens.push(token.text.clone());
});
tokens
}
fn folding_using_raw_tokenizer_helper(text: &str) -> String {
let mut token_stream = analyzer_builder(RawTokenizer)
.filter(AsciiFolding::new())
.build()
let mut token_stream = TextAnalyzer::from(RawTokenizer)
.filter(AsciiFoldingFilter)
.token_stream(text);
let Token { text, .. } = token_stream.next().unwrap();
text
token_stream.advance();
token_stream.token().text.clone()
}
#[test]
@@ -1609,9 +1626,9 @@ mod tests {
#[test]
fn test_to_ascii() {
let input = "Rámon".to_string();
let mut input = "Rámon".to_string();
let mut buffer = String::new();
to_ascii(&input, &mut buffer);
to_ascii(&mut input, &mut buffer);
assert_eq!("Ramon", buffer);
}

View File

@@ -1,4 +1,4 @@
use super::{Token, Tokenizer};
use super::{BoxTokenStream, Token, TokenStream, Tokenizer};
use crate::schema::FACET_SEP_BYTE;
/// The `FacetTokenizer` process a `Facet` binary representation
@@ -9,63 +9,72 @@ use crate::schema::FACET_SEP_BYTE;
/// - `/america/north_america/canada`
/// - `/america/north_america`
/// - `/america`
#[derive(Clone, Debug, Default)]
#[derive(Clone)]
pub struct FacetTokenizer;
#[derive(Clone, Debug)]
#[derive(Debug)]
enum State {
RootFacetNotEmitted,
UpToPosition(usize), //< we already emitted facet prefix up to &text[..cursor]
Terminated,
}
#[derive(Clone, Debug)]
pub struct FacetTokenStream {
text: String,
pub struct FacetTokenStream<'a> {
text: &'a str,
state: State,
token: Token,
}
impl Tokenizer for FacetTokenizer {
type Iter = FacetTokenStream;
fn token_stream(&self, text: &str) -> Self::Iter {
fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
FacetTokenStream {
text: text.to_string(),
text,
state: State::RootFacetNotEmitted, //< pos is the first char that has not been processed yet.
token: Token::default(),
}
.into()
}
}
impl Iterator for FacetTokenStream {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
self.state = match self.state {
impl<'a> TokenStream for FacetTokenStream<'a> {
fn advance(&mut self) -> bool {
match self.state {
State::RootFacetNotEmitted => {
if self.text.is_empty() {
self.state = if self.text.is_empty() {
State::Terminated
} else {
State::UpToPosition(0)
}
};
true
}
State::UpToPosition(cursor) => {
if let Some(next_sep_pos) = self.text.as_bytes()[cursor + 1..]
let bytes: &[u8] = self.text.as_bytes();
if let Some(next_sep_pos) = bytes[cursor + 1..]
.iter()
.position(|&b| b == FACET_SEP_BYTE)
.cloned()
.position(|b| b == FACET_SEP_BYTE)
.map(|pos| cursor + 1 + pos)
{
let facet_part = &self.text[cursor..next_sep_pos];
self.token.text.push_str(facet_part);
State::UpToPosition(next_sep_pos)
self.state = State::UpToPosition(next_sep_pos);
} else {
let facet_part = &self.text[cursor..];
self.token.text.push_str(facet_part);
State::Terminated
self.state = State::Terminated;
}
true
}
State::Terminated => return None,
};
Some(self.token.clone())
State::Terminated => false,
}
}
fn token(&self) -> &Token {
&self.token
}
fn token_mut(&mut self) -> &mut Token {
&mut self.token
}
}
@@ -74,19 +83,21 @@ mod tests {
use super::FacetTokenizer;
use crate::schema::Facet;
use crate::tokenizer::Tokenizer;
use crate::tokenizer::{Token, Tokenizer};
#[test]
fn test_facet_tokenizer() {
let facet = Facet::from_path(vec!["top", "a", "b"]);
let tokens: Vec<_> = FacetTokenizer
.token_stream(facet.encoded_str())
.map(|token| {
Facet::from_encoded(token.text.as_bytes().to_owned())
.unwrap()
.to_string()
})
.collect();
let mut tokens = vec![];
{
let mut add_token = |token: &Token| {
let facet = Facet::from_encoded(token.text.as_bytes().to_owned()).unwrap();
tokens.push(format!("{}", facet));
};
FacetTokenizer
.token_stream(facet.encoded_str())
.process(&mut add_token);
}
assert_eq!(tokens.len(), 4);
assert_eq!(tokens[0], "/");
assert_eq!(tokens[1], "/top");
@@ -97,14 +108,16 @@ mod tests {
#[test]
fn test_facet_tokenizer_root_facets() {
let facet = Facet::root();
let tokens: Vec<_> = FacetTokenizer
.token_stream(facet.encoded_str())
.map(|token| {
Facet::from_encoded(token.text.as_bytes().to_owned())
.unwrap()
.to_string()
})
.collect();
let mut tokens = vec![];
{
let mut add_token = |token: &Token| {
let facet = Facet::from_encoded(token.text.as_bytes().to_owned()).unwrap(); // ok test
tokens.push(format!("{}", facet));
};
FacetTokenizer
.token_stream(facet.encoded_str()) // ok test
.process(&mut add_token);
}
assert_eq!(tokens.len(), 1);
assert_eq!(tokens[0], "/");
}

View File

@@ -1,36 +1,27 @@
use super::{Token, TokenFilter};
use super::{Token, TokenFilter, TokenStream};
use crate::tokenizer::BoxTokenStream;
use std::mem;
impl TokenFilter for LowerCaser {
fn transform(&mut self, mut token: Token) -> Option<Token> {
if token.text.is_ascii() {
// fast track for ascii.
token.text.make_ascii_lowercase();
} else {
to_lowercase_unicode(&token.text, &mut self.buffer);
mem::swap(&mut token.text, &mut self.buffer);
}
Some(token)
fn transform<'a>(&self, token_stream: BoxTokenStream<'a>) -> BoxTokenStream<'a> {
BoxTokenStream::from(LowerCaserTokenStream {
tail: token_stream,
buffer: String::with_capacity(100),
})
}
}
/// Token filter that lowercase terms.
#[derive(Clone, Debug, Default)]
pub struct LowerCaser {
buffer: String,
}
#[derive(Clone)]
pub struct LowerCaser;
impl LowerCaser {
/// Initialize the `LowerCaser`
pub fn new() -> Self {
LowerCaser {
buffer: String::with_capacity(100),
}
}
pub struct LowerCaserTokenStream<'a> {
buffer: String,
tail: BoxTokenStream<'a>,
}
// writes a lowercased version of text into output.
fn to_lowercase_unicode(text: &String, output: &mut String) {
fn to_lowercase_unicode(text: &mut String, output: &mut String) {
output.clear();
for c in text.chars() {
// Contrary to the std, we do not take care of sigma special case.
@@ -39,31 +30,57 @@ fn to_lowercase_unicode(text: &String, output: &mut String) {
}
}
impl<'a> TokenStream for LowerCaserTokenStream<'a> {
fn advance(&mut self) -> bool {
if !self.tail.advance() {
return false;
}
if self.token_mut().text.is_ascii() {
// fast track for ascii.
self.token_mut().text.make_ascii_lowercase();
} else {
to_lowercase_unicode(&mut self.tail.token_mut().text, &mut self.buffer);
mem::swap(&mut self.tail.token_mut().text, &mut self.buffer);
}
true
}
fn token(&self) -> &Token {
self.tail.token()
}
fn token_mut(&mut self) -> &mut Token {
self.tail.token_mut()
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::tokenizer::{analyzer_builder, LowerCaser, SimpleTokenizer, TextAnalyzerT};
use crate::tokenizer::{LowerCaser, SimpleTokenizer, TextAnalyzer};
#[test]
fn test_to_lower_case() {
assert_eq!(lowercase_helper("Русский текст"), vec!["русский", "текст"]);
assert_eq!(
lowercase_helper("Русский текст"),
vec!["русский".to_string(), "текст".to_string()]
);
}
fn lowercase_helper(text: &str) -> Vec<String> {
analyzer_builder(SimpleTokenizer)
.filter(LowerCaser::new())
.build()
.token_stream(text)
.map(|token| {
let Token { text, .. } = token;
text
})
.collect()
let mut tokens = vec![];
let mut token_stream = TextAnalyzer::from(SimpleTokenizer)
.filter(LowerCaser)
.token_stream(text);
while token_stream.advance() {
let token_text = token_stream.token().text.clone();
tokens.push(token_text);
}
tokens
}
#[test]
fn test_lowercaser() {
assert_eq!(lowercase_helper("Tree"), vec!["tree"]);
assert_eq!(lowercase_helper("Русский"), vec!["русский"]);
assert_eq!(lowercase_helper("Tree"), vec!["tree".to_string()]);
assert_eq!(lowercase_helper("Русский"), vec!["русский".to_string()]);
}
}

View File

@@ -64,10 +64,10 @@
//! ```rust
//! use tantivy::tokenizer::*;
//!
//! let en_stem = analyzer_builder(SimpleTokenizer)
//! let en_stem = TextAnalyzer::from(SimpleTokenizer)
//! .filter(RemoveLongFilter::limit(40))
//! .filter(LowerCaser::new())
//! .filter(Stemmer::new(Language::English)).build();
//! .filter(LowerCaser)
//! .filter(Stemmer::new(Language::English));
//! ```
//!
//! Once your tokenizer is defined, you need to
@@ -109,9 +109,9 @@
//! let index = Index::create_in_ram(schema);
//!
//! // We need to register our tokenizer :
//! let custom_en_tokenizer = analyzer_builder(SimpleTokenizer)
//! let custom_en_tokenizer = TextAnalyzer::from(SimpleTokenizer)
//! .filter(RemoveLongFilter::limit(40))
//! .filter(LowerCaser::new()).build();
//! .filter(LowerCaser);
//! index
//! .tokenizers()
//! .register("custom_en", custom_en_tokenizer);
@@ -133,7 +133,7 @@ mod tokenizer;
mod tokenizer_manager;
pub use self::alphanum_only::AlphaNumOnlyFilter;
pub use self::ascii_folding_filter::AsciiFolding;
pub use self::ascii_folding_filter::AsciiFoldingFilter;
pub use self::facet_tokenizer::FacetTokenizer;
pub use self::lower_caser::LowerCaser;
pub use self::ngram_tokenizer::NgramTokenizer;
@@ -142,11 +142,11 @@ pub use self::remove_long::RemoveLongFilter;
pub use self::simple_tokenizer::SimpleTokenizer;
pub use self::stemmer::{Language, Stemmer};
pub use self::stop_word_filter::StopWordFilter;
pub(crate) use self::token_stream_chain::{DynTokenStreamChain, TokenStreamChain};
pub(crate) use self::token_stream_chain::TokenStreamChain;
pub use self::tokenized_string::{PreTokenizedStream, PreTokenizedString};
pub use self::tokenizer::{
analyzer_builder, Identity, TextAnalyzer, TextAnalyzerT, Token, TokenFilter, Tokenizer,
BoxTokenFilter, BoxTokenStream, TextAnalyzer, Token, TokenFilter, TokenStream, Tokenizer,
};
pub use self::tokenizer_manager::TokenizerManager;
@@ -160,7 +160,10 @@ pub const MAX_TOKEN_LEN: usize = u16::max_value() as usize - 4;
#[cfg(test)]
pub mod tests {
use super::*;
use super::{
Language, LowerCaser, RemoveLongFilter, SimpleTokenizer, Stemmer, Token, TokenizerManager,
};
use crate::tokenizer::TextAnalyzer;
/// This is a function that can be used in tests and doc tests
/// to assert a token's correctness.
@@ -187,9 +190,15 @@ pub mod tests {
fn test_raw_tokenizer() {
let tokenizer_manager = TokenizerManager::default();
let en_tokenizer = tokenizer_manager.get("raw").unwrap();
let tokens: Vec<Token> = en_tokenizer
.token_stream("Hello, happy tax payer!")
.collect();
let mut tokens: Vec<Token> = vec![];
{
let mut add_token = |token: &Token| {
tokens.push(token.clone());
};
en_tokenizer
.token_stream("Hello, happy tax payer!")
.process(&mut add_token);
}
assert_eq!(tokens.len(), 1);
assert_token(&tokens[0], 0, "Hello, happy tax payer!", 0, 23);
}
@@ -199,9 +208,15 @@ pub mod tests {
let tokenizer_manager = TokenizerManager::default();
assert!(tokenizer_manager.get("en_doesnotexist").is_none());
let en_tokenizer = tokenizer_manager.get("en_stem").unwrap();
let tokens: Vec<Token> = en_tokenizer
.token_stream("Hello, happy tax payer!")
.collect();
let mut tokens: Vec<Token> = vec![];
{
let mut add_token = |token: &Token| {
tokens.push(token.clone());
};
en_tokenizer
.token_stream("Hello, happy tax payer!")
.process(&mut add_token);
}
assert_eq!(tokens.len(), 4);
assert_token(&tokens[0], 0, "hello", 0, 5);
@@ -215,16 +230,21 @@ pub mod tests {
let tokenizer_manager = TokenizerManager::default();
tokenizer_manager.register(
"el_stem",
analyzer_builder(SimpleTokenizer)
TextAnalyzer::from(SimpleTokenizer)
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser::new())
.filter(Stemmer::new(Language::Greek))
.build(),
.filter(LowerCaser)
.filter(Stemmer::new(Language::Greek)),
);
let en_tokenizer = tokenizer_manager.get("el_stem").unwrap();
let tokens: Vec<Token> = en_tokenizer
.token_stream("Καλημέρα, χαρούμενε φορολογούμενε!")
.collect();
let mut tokens: Vec<Token> = vec![];
{
let mut add_token = |token: &Token| {
tokens.push(token.clone());
};
en_tokenizer
.token_stream("Καλημέρα, χαρούμενε φορολογούμενε!")
.process(&mut add_token);
}
assert_eq!(tokens.len(), 3);
assert_token(&tokens[0], 0, "καλημερ", 0, 16);
@@ -236,9 +256,25 @@ pub mod tests {
fn test_tokenizer_empty() {
let tokenizer_manager = TokenizerManager::default();
let en_tokenizer = tokenizer_manager.get("en_stem").unwrap();
let tokens: Vec<Token> = en_tokenizer.token_stream(" ").collect();
assert!(tokens.is_empty());
let tokens: Vec<Token> = en_tokenizer.token_stream(" ").collect();
assert!(tokens.is_empty());
{
let mut tokens: Vec<Token> = vec![];
{
let mut add_token = |token: &Token| {
tokens.push(token.clone());
};
en_tokenizer.token_stream(" ").process(&mut add_token);
}
assert!(tokens.is_empty());
}
{
let mut tokens: Vec<Token> = vec![];
{
let mut add_token = |token: &Token| {
tokens.push(token.clone());
};
en_tokenizer.token_stream(" ").process(&mut add_token);
}
assert!(tokens.is_empty());
}
}
}

View File

@@ -1,4 +1,5 @@
use super::{Token, Tokenizer};
use super::{Token, TokenStream, Tokenizer};
use crate::tokenizer::BoxTokenStream;
/// Tokenize the text by splitting words into n-grams of the given size(s)
///
@@ -78,7 +79,7 @@ use super::{Token, Tokenizer};
/// }
/// assert!(stream.next().is_none());
/// ```
#[derive(Clone, Debug, Default)]
#[derive(Clone)]
pub struct NgramTokenizer {
/// min size of the n-gram
min_gram: usize,
@@ -118,48 +119,54 @@ impl NgramTokenizer {
}
/// TokenStream associate to the `NgramTokenizer`
pub struct NgramTokenStream {
pub struct NgramTokenStream<'a> {
/// parameters
ngram_charidx_iterator: StutteringIterator<CodepointFrontiers>,
ngram_charidx_iterator: StutteringIterator<CodepointFrontiers<'a>>,
/// true if the NgramTokenStream is in prefix mode.
prefix_only: bool,
/// input
text: String,
text: &'a str,
/// output
token: Token,
}
impl Tokenizer for NgramTokenizer {
type Iter = NgramTokenStream;
fn token_stream(&self, text: &str) -> Self::Iter {
NgramTokenStream {
fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
From::from(NgramTokenStream {
ngram_charidx_iterator: StutteringIterator::new(
CodepointFrontiers::for_str(text),
self.min_gram,
self.max_gram,
),
prefix_only: self.prefix_only,
text: text.to_string(),
text,
token: Token::default(),
}
})
}
}
impl Iterator for NgramTokenStream {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
impl<'a> TokenStream for NgramTokenStream<'a> {
fn advance(&mut self) -> bool {
if let Some((offset_from, offset_to)) = self.ngram_charidx_iterator.next() {
if self.prefix_only && offset_from > 0 {
return None;
return false;
}
self.token.position = 0;
self.token.offset_from = offset_from;
self.token.offset_to = offset_to;
self.token.text.clear();
self.token.text.push_str(&self.text[offset_from..offset_to]);
return Some(self.token.clone());
};
None
true
} else {
false
}
}
fn token(&self) -> &Token {
&self.token
}
fn token_mut(&mut self) -> &mut Token {
&mut self.token
}
}
@@ -250,21 +257,21 @@ where
/// or a codepoint ends.
///
/// By convention, we emit [0] for the empty string.
struct CodepointFrontiers {
s: String,
struct CodepointFrontiers<'a> {
s: &'a str,
next_el: Option<usize>,
}
impl CodepointFrontiers {
fn for_str(s: &str) -> Self {
impl<'a> CodepointFrontiers<'a> {
fn for_str(s: &'a str) -> Self {
CodepointFrontiers {
s: s.to_string(),
s,
next_el: Some(0),
}
}
}
impl<'a> Iterator for CodepointFrontiers {
impl<'a> Iterator for CodepointFrontiers<'a> {
type Item = usize;
fn next(&mut self) -> Option<usize> {
@@ -273,7 +280,7 @@ impl<'a> Iterator for CodepointFrontiers {
self.next_el = None;
} else {
let first_codepoint_width = utf8_codepoint_width(self.s.as_bytes()[0]);
self.s = (&self.s[first_codepoint_width..]).to_string();
self.s = &self.s[first_codepoint_width..];
self.next_el = Some(offset + first_codepoint_width);
}
offset
@@ -294,8 +301,20 @@ fn utf8_codepoint_width(b: u8) -> usize {
#[cfg(test)]
mod tests {
use super::*;
use super::utf8_codepoint_width;
use super::CodepointFrontiers;
use super::NgramTokenizer;
use super::StutteringIterator;
use crate::tokenizer::tests::assert_token;
use crate::tokenizer::tokenizer::Tokenizer;
use crate::tokenizer::{BoxTokenStream, Token};
fn test_helper(mut tokenizer: BoxTokenStream) -> Vec<Token> {
let mut tokens: Vec<Token> = vec![];
tokenizer.process(&mut |token: &Token| tokens.push(token.clone()));
tokens
}
#[test]
fn test_utf8_codepoint_width() {
@@ -332,9 +351,7 @@ mod tests {
#[test]
fn test_ngram_tokenizer_1_2_false() {
let tokens: Vec<_> = NgramTokenizer::all_ngrams(1, 2)
.token_stream("hello")
.collect();
let tokens = test_helper(NgramTokenizer::all_ngrams(1, 2).token_stream("hello"));
assert_eq!(tokens.len(), 9);
assert_token(&tokens[0], 0, "h", 0, 1);
assert_token(&tokens[1], 0, "he", 0, 2);
@@ -349,9 +366,7 @@ mod tests {
#[test]
fn test_ngram_tokenizer_min_max_equal() {
let tokens: Vec<_> = NgramTokenizer::all_ngrams(3, 3)
.token_stream("hello")
.collect();
let tokens = test_helper(NgramTokenizer::all_ngrams(3, 3).token_stream("hello"));
assert_eq!(tokens.len(), 3);
assert_token(&tokens[0], 0, "hel", 0, 3);
assert_token(&tokens[1], 0, "ell", 1, 4);
@@ -360,9 +375,7 @@ mod tests {
#[test]
fn test_ngram_tokenizer_2_5_prefix() {
let tokens: Vec<_> = NgramTokenizer::prefix_only(2, 5)
.token_stream("frankenstein")
.collect();
let tokens = test_helper(NgramTokenizer::prefix_only(2, 5).token_stream("frankenstein"));
assert_eq!(tokens.len(), 4);
assert_token(&tokens[0], 0, "fr", 0, 2);
assert_token(&tokens[1], 0, "fra", 0, 3);
@@ -372,9 +385,7 @@ mod tests {
#[test]
fn test_ngram_non_ascii_1_2() {
let tokens: Vec<_> = NgramTokenizer::all_ngrams(1, 2)
.token_stream("hεllo")
.collect();
let tokens = test_helper(NgramTokenizer::all_ngrams(1, 2).token_stream("hεllo"));
assert_eq!(tokens.len(), 9);
assert_token(&tokens[0], 0, "h", 0, 1);
assert_token(&tokens[1], 0, "", 0, 3);
@@ -389,9 +400,7 @@ mod tests {
#[test]
fn test_ngram_non_ascii_2_5_prefix() {
let tokens: Vec<_> = NgramTokenizer::prefix_only(2, 5)
.token_stream("hεllo")
.collect();
let tokens = test_helper(NgramTokenizer::prefix_only(2, 5).token_stream("hεllo"));
assert_eq!(tokens.len(), 4);
assert_token(&tokens[0], 0, "", 0, 3);
assert_token(&tokens[1], 0, "hεl", 0, 4);
@@ -401,16 +410,16 @@ mod tests {
#[test]
fn test_ngram_empty() {
let tokens: Vec<_> = NgramTokenizer::all_ngrams(1, 5).token_stream("").collect();
let tokens = test_helper(NgramTokenizer::all_ngrams(1, 5).token_stream(""));
assert!(tokens.is_empty());
let tokens: Vec<_> = NgramTokenizer::all_ngrams(2, 5).token_stream("").collect();
let tokens = test_helper(NgramTokenizer::all_ngrams(2, 5).token_stream(""));
assert!(tokens.is_empty());
}
#[test]
#[should_panic(expected = "min_gram must be greater than 0")]
fn test_ngram_min_max_interval_empty() {
NgramTokenizer::all_ngrams(0, 2).token_stream("hellossss");
test_helper(NgramTokenizer::all_ngrams(0, 2).token_stream("hellossss"));
}
#[test]

View File

@@ -1,17 +1,17 @@
use super::{Token, Tokenizer};
use super::{Token, TokenStream, Tokenizer};
use crate::tokenizer::BoxTokenStream;
/// For each value of the field, emit a single unprocessed token.
#[derive(Clone, Debug, Default)]
#[derive(Clone)]
pub struct RawTokenizer;
#[derive(Clone, Debug)]
pub struct RawTokenStream {
token: Option<Token>,
token: Token,
has_token: bool,
}
impl Tokenizer for RawTokenizer {
type Iter = RawTokenStream;
fn token_stream(&self, text: &str) -> Self::Iter {
fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
let token = Token {
offset_from: 0,
offset_to: text.len(),
@@ -19,13 +19,26 @@ impl Tokenizer for RawTokenizer {
text: text.to_string(),
position_length: 1,
};
RawTokenStream { token: Some(token) }
RawTokenStream {
token,
has_token: true,
}
.into()
}
}
impl Iterator for RawTokenStream {
type Item = Token;
fn next(&mut self) -> Option<Token> {
self.token.take()
impl TokenStream for RawTokenStream {
fn advance(&mut self) -> bool {
let result = self.has_token;
self.has_token = false;
result
}
fn token(&self) -> &Token {
&self.token
}
fn token_mut(&mut self) -> &mut Token {
&mut self.token
}
}

View File

@@ -2,8 +2,8 @@
//! ```rust
//! use tantivy::tokenizer::*;
//!
//! let tokenizer = analyzer_builder(SimpleTokenizer)
//! .filter(RemoveLongFilter::limit(5)).build();
//! let tokenizer = TextAnalyzer::from(SimpleTokenizer)
//! .filter(RemoveLongFilter::limit(5));
//!
//! let mut stream = tokenizer.token_stream("toolong nice");
//! // because `toolong` is more than 5 characters, it is filtered
@@ -12,30 +12,61 @@
//! assert!(stream.next().is_none());
//! ```
//!
use super::{Token, TokenFilter};
use super::{Token, TokenFilter, TokenStream};
use crate::tokenizer::BoxTokenStream;
/// `RemoveLongFilter` removes tokens that are longer
/// than a given number of bytes (in UTF-8 representation).
///
/// It is especially useful when indexing unconstrained content.
/// e.g. Mail containing base-64 encoded pictures etc.
#[derive(Clone, Debug)]
#[derive(Clone)]
pub struct RemoveLongFilter {
limit: usize,
length_limit: usize,
}
impl RemoveLongFilter {
/// Creates a `RemoveLongFilter` given a limit in bytes of the UTF-8 representation.
pub fn limit(limit: usize) -> RemoveLongFilter {
RemoveLongFilter { limit }
pub fn limit(length_limit: usize) -> RemoveLongFilter {
RemoveLongFilter { length_limit }
}
}
impl<'a> RemoveLongFilterStream<'a> {
fn predicate(&self, token: &Token) -> bool {
token.text.len() < self.token_length_limit
}
}
impl TokenFilter for RemoveLongFilter {
fn transform(&mut self, token: Token) -> Option<Token> {
if token.text.len() >= self.limit {
return None;
}
Some(token)
fn transform<'a>(&self, token_stream: BoxTokenStream<'a>) -> BoxTokenStream<'a> {
BoxTokenStream::from(RemoveLongFilterStream {
token_length_limit: self.length_limit,
tail: token_stream,
})
}
}
pub struct RemoveLongFilterStream<'a> {
token_length_limit: usize,
tail: BoxTokenStream<'a>,
}
impl<'a> TokenStream for RemoveLongFilterStream<'a> {
fn advance(&mut self) -> bool {
while self.tail.advance() {
if self.predicate(self.tail.token()) {
return true;
}
}
false
}
fn token(&self) -> &Token {
self.tail.token()
}
fn token_mut(&mut self) -> &mut Token {
self.tail.token_mut()
}
}

View File

@@ -1,74 +1,59 @@
use super::{Token, Tokenizer};
use super::BoxTokenStream;
use super::{Token, TokenStream, Tokenizer};
use std::str::CharIndices;
/// Tokenize the text by splitting on whitespaces and punctuation.
#[derive(Clone, Debug)]
#[derive(Clone)]
pub struct SimpleTokenizer;
pub struct SimpleTokenStream<'a> {
text: &'a str,
chars: CharIndices<'a>,
token: Token,
}
impl Tokenizer for SimpleTokenizer {
type Iter = SimpleTokenizerStream;
fn token_stream(&self, text: &str) -> Self::Iter {
let vec: Vec<_> = text.char_indices().collect();
SimpleTokenizerStream {
text: text.to_string(),
chars: vec.into_iter(),
position: usize::max_value(),
}
fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
BoxTokenStream::from(SimpleTokenStream {
text,
chars: text.char_indices(),
token: Token::default(),
})
}
}
#[derive(Clone, Debug)]
pub struct SimpleTokenizerStream {
text: String,
chars: std::vec::IntoIter<(usize, char)>,
position: usize,
}
impl SimpleTokenizerStream {
impl<'a> SimpleTokenStream<'a> {
// search for the end of the current token.
fn search_token_end(&mut self) -> usize {
(&mut self.chars)
.filter(|&(_, c)| !c.is_alphanumeric())
.filter(|&(_, ref c)| !c.is_alphanumeric())
.map(|(offset, _)| offset)
.next()
.unwrap_or_else(|| self.text.len())
}
}
impl Iterator for SimpleTokenizerStream {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
self.position = self.position.wrapping_add(1);
impl<'a> TokenStream for SimpleTokenStream<'a> {
fn advance(&mut self) -> bool {
self.token.text.clear();
self.token.position = self.token.position.wrapping_add(1);
while let Some((offset_from, c)) = self.chars.next() {
if c.is_alphanumeric() {
let offset_to = self.search_token_end();
let token = Token {
text: self.text[offset_from..offset_to].into(),
offset_from,
offset_to,
position: self.position,
..Default::default()
};
return Some(token);
self.token.offset_from = offset_from;
self.token.offset_to = offset_to;
self.token.text.push_str(&self.text[offset_from..offset_to]);
return true;
}
}
None
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_empty() {
let mut empty = SimpleTokenizer.token_stream("");
assert_eq!(empty.next(), None);
}
#[test]
fn simple_tokenizer() {
let mut simple = SimpleTokenizer.token_stream("tokenizer hello world");
assert_eq!(simple.next().unwrap().text, "tokenizer");
assert_eq!(simple.next().unwrap().text, "hello");
assert_eq!(simple.next().unwrap().text, "world");
false
}
fn token(&self) -> &Token {
&self.token
}
fn token_mut(&mut self) -> &mut Token {
&mut self.token
}
}

View File

@@ -1,6 +1,5 @@
use std::sync::Arc;
use super::{Token, TokenFilter};
use super::{Token, TokenFilter, TokenStream};
use crate::tokenizer::BoxTokenStream;
use rust_stemmers::{self, Algorithm};
use serde::{Deserialize, Serialize};
@@ -59,15 +58,14 @@ impl Language {
/// Tokens are expected to be lowercased beforehand.
#[derive(Clone)]
pub struct Stemmer {
stemmer: Arc<rust_stemmers::Stemmer>,
stemmer_algorithm: Algorithm,
}
impl Stemmer {
/// Creates a new Stemmer `TokenFilter` for a given language algorithm.
pub fn new(language: Language) -> Stemmer {
let stemmer = rust_stemmers::Stemmer::create(language.algorithm());
Stemmer {
stemmer: Arc::new(stemmer),
stemmer_algorithm: language.algorithm(),
}
}
}
@@ -80,12 +78,37 @@ impl Default for Stemmer {
}
impl TokenFilter for Stemmer {
fn transform(&mut self, mut token: Token) -> Option<Token> {
// TODO remove allocation
let stemmed_str: String = self.stemmer.stem(&token.text).into_owned();
// TODO remove clear
token.text.clear();
token.text.push_str(&stemmed_str);
Some(token)
fn transform<'a>(&self, token_stream: BoxTokenStream<'a>) -> BoxTokenStream<'a> {
let inner_stemmer = rust_stemmers::Stemmer::create(self.stemmer_algorithm);
BoxTokenStream::from(StemmerTokenStream {
tail: token_stream,
stemmer: inner_stemmer,
})
}
}
pub struct StemmerTokenStream<'a> {
tail: BoxTokenStream<'a>,
stemmer: rust_stemmers::Stemmer,
}
impl<'a> TokenStream for StemmerTokenStream<'a> {
fn advance(&mut self) -> bool {
if !self.tail.advance() {
return false;
}
// TODO remove allocation
let stemmed_str: String = self.stemmer.stem(&self.token().text).into_owned();
self.token_mut().text.clear();
self.token_mut().text.push_str(&stemmed_str);
true
}
fn token(&self) -> &Token {
self.tail.token()
}
fn token_mut(&mut self) -> &mut Token {
self.tail.token_mut()
}
}

View File

@@ -2,15 +2,16 @@
//! ```rust
//! use tantivy::tokenizer::*;
//!
//! let tokenizer = analyzer_builder(SimpleTokenizer)
//! .filter(StopWordFilter::remove(vec!["the".to_string(), "is".to_string()])).build();
//! let tokenizer = TextAnalyzer::from(SimpleTokenizer)
//! .filter(StopWordFilter::remove(vec!["the".to_string(), "is".to_string()]));
//!
//! let mut stream = tokenizer.token_stream("the fox is crafty");
//! assert_eq!(stream.next().unwrap().text, "fox");
//! assert_eq!(stream.next().unwrap().text, "crafty");
//! assert!(stream.next().is_none());
//! ```
use super::{Token, TokenFilter};
use super::{Token, TokenFilter, TokenStream};
use crate::tokenizer::BoxTokenStream;
use fnv::FnvHasher;
use std::collections::HashSet;
use std::hash::BuildHasherDefault;
@@ -48,12 +49,42 @@ impl StopWordFilter {
}
}
pub struct StopWordFilterStream<'a> {
words: StopWordHashSet,
tail: BoxTokenStream<'a>,
}
impl TokenFilter for StopWordFilter {
fn transform(&mut self, token: Token) -> Option<Token> {
if self.words.contains(&token.text) {
return None;
fn transform<'a>(&self, token_stream: BoxTokenStream<'a>) -> BoxTokenStream<'a> {
BoxTokenStream::from(StopWordFilterStream {
words: self.words.clone(),
tail: token_stream,
})
}
}
impl<'a> StopWordFilterStream<'a> {
fn predicate(&self, token: &Token) -> bool {
!self.words.contains(&token.text)
}
}
impl<'a> TokenStream for StopWordFilterStream<'a> {
fn advance(&mut self) -> bool {
while self.tail.advance() {
if self.predicate(self.tail.token()) {
return true;
}
}
Some(token)
false
}
fn token(&self) -> &Token {
self.tail.token()
}
fn token_mut(&mut self) -> &mut Token {
self.tail.token_mut()
}
}

View File

@@ -1,121 +1,95 @@
use crate::tokenizer::Token;
use crate::tokenizer::{BoxTokenStream, Token, TokenStream};
use std::ops::DerefMut;
const POSITION_GAP: usize = 2;
pub(crate) struct TokenStreamChain<Inner, Outer> {
streams_with_offsets: Outer,
current: Option<(Inner, usize)>,
position: usize,
pub(crate) struct TokenStreamChain<'a> {
offsets: Vec<usize>,
token_streams: Vec<BoxTokenStream<'a>>,
position_shift: usize,
stream_idx: usize,
token: Token,
}
impl<'a, Inner, Outer> TokenStreamChain<Inner, Outer>
where
Inner: Iterator<Item = Token>,
Outer: Iterator<Item = (Inner, usize)>,
{
pub fn new(mut streams_with_offsets: Outer) -> TokenStreamChain<Inner, Outer> {
let current = streams_with_offsets.next();
impl<'a> TokenStreamChain<'a> {
pub fn new(
offsets: Vec<usize>,
token_streams: Vec<BoxTokenStream<'a>>,
) -> TokenStreamChain<'a> {
TokenStreamChain {
streams_with_offsets: streams_with_offsets,
current,
position: usize::max_value(),
offsets,
stream_idx: 0,
token_streams,
position_shift: 0,
token: Token::default(),
}
}
}
impl<'a, Inner, Outer> Iterator for TokenStreamChain<Inner, Outer>
where
Inner: Iterator<Item = Token>,
Outer: Iterator<Item = (Inner, usize)>,
{
type Item = Token;
fn next(&mut self) -> Option<Token> {
while let Some((ref mut token_stream, offset_offset)) = self.current {
if let Some(mut token) = token_stream.next() {
token.offset_from += offset_offset;
token.offset_to += offset_offset;
token.position += self.position_shift;
self.position = token.position;
return Some(token);
impl<'a> TokenStream for TokenStreamChain<'a> {
fn advance(&mut self) -> bool {
while self.stream_idx < self.token_streams.len() {
let token_stream = self.token_streams[self.stream_idx].deref_mut();
if token_stream.advance() {
let token = token_stream.token();
let offset_offset = self.offsets[self.stream_idx];
self.token.offset_from = token.offset_from + offset_offset;
self.token.offset_to = token.offset_to + offset_offset;
self.token.position = token.position + self.position_shift;
self.token.text.clear();
self.token.text.push_str(token.text.as_str());
return true;
} else {
self.stream_idx += 1;
self.position_shift = self.token.position.wrapping_add(POSITION_GAP);
}
self.position_shift = self.position.wrapping_add(POSITION_GAP);
self.current = self.streams_with_offsets.next();
}
None
false
}
}
impl DynTokenStreamChain {
pub fn from_vec(
streams_with_offsets: Vec<(Box<dyn Iterator<Item = Token>>, usize)>,
) -> impl Iterator<Item = Token> {
DynTokenStreamChain {
streams_with_offsets,
idx: 0,
position: usize::max_value(),
position_shift: 0,
}
fn token(&self) -> &Token {
assert!(
self.stream_idx <= self.token_streams.len(),
"You called .token(), after the end of the token stream has been reached"
);
&self.token
}
}
pub(crate) struct DynTokenStreamChain {
streams_with_offsets: Vec<(Box<dyn Iterator<Item = Token>>, usize)>,
idx: usize,
position: usize,
position_shift: usize,
}
impl Iterator for DynTokenStreamChain {
type Item = Token;
fn next(&mut self) -> Option<Token> {
while let Some((token_stream, offset_offset)) = self.streams_with_offsets.get_mut(self.idx)
{
if let Some(mut token) = token_stream.next() {
token.offset_from += *offset_offset;
token.offset_to += *offset_offset;
token.position += self.position_shift;
self.position = token.position;
return Some(token);
}
self.idx += 1;
self.position_shift = self.position.wrapping_add(POSITION_GAP);
}
None
fn token_mut(&mut self) -> &mut Token {
assert!(
self.stream_idx <= self.token_streams.len(),
"You called .token(), after the end of the token stream has been reached"
);
&mut self.token
}
}
#[cfg(test)]
mod tests {
use super::super::tokenizer::Tokenizer;
use super::super::SimpleTokenizer;
use super::*;
use super::super::{SimpleTokenizer, TokenStream, Tokenizer};
use super::TokenStreamChain;
use super::POSITION_GAP;
#[test]
fn test_chain_first_emits_no_tokens() {
let token_streams = vec![
(SimpleTokenizer.token_stream(""), 0),
(SimpleTokenizer.token_stream("hello world"), 0),
SimpleTokenizer.token_stream(""),
SimpleTokenizer.token_stream("hello world"),
];
let mut token_chain = TokenStreamChain::new(token_streams.into_iter());
let token = token_chain.next();
let mut token_chain = TokenStreamChain::new(vec![0, 0], token_streams);
let expect = Token {
offset_from: 0,
offset_to: 5,
position: POSITION_GAP - 1,
text: "hello".into(),
..Token::default()
};
assert_eq!(token.unwrap(), expect);
assert!(token_chain.advance());
assert_eq!(token_chain.token().text, "hello");
assert_eq!(token_chain.token().offset_from, 0);
assert_eq!(token_chain.token().offset_to, 5);
assert_eq!(token_chain.token().position, POSITION_GAP - 1);
let token = token_chain.next().unwrap();
assert_eq!(token.text, "world");
assert_eq!(token.offset_from, 6);
assert_eq!(token.offset_to, 11);
assert_eq!(token.position, POSITION_GAP);
assert!(token_chain.advance());
assert_eq!(token_chain.token().text, "world");
assert_eq!(token_chain.token().offset_from, 6);
assert_eq!(token_chain.token().offset_to, 11);
assert_eq!(token_chain.token().position, POSITION_GAP);
assert!(token_chain.next().is_none());
assert!(!token_chain.advance());
}
}

View File

@@ -1,4 +1,4 @@
use crate::tokenizer::{Token, TokenStreamChain};
use crate::tokenizer::{BoxTokenStream, Token, TokenStream, TokenStreamChain};
use serde::{Deserialize, Serialize};
use std::cmp::Ordering;
@@ -26,14 +26,14 @@ impl PartialOrd for PreTokenizedString {
/// TokenStream implementation which wraps PreTokenizedString
pub struct PreTokenizedStream {
tokenized_string: PreTokenizedString,
current_token: usize,
current_token: i64,
}
impl From<PreTokenizedString> for PreTokenizedStream {
fn from(s: PreTokenizedString) -> PreTokenizedStream {
PreTokenizedStream {
tokenized_string: s,
current_token: 0,
current_token: -1,
}
}
}
@@ -41,28 +41,49 @@ impl From<PreTokenizedString> for PreTokenizedStream {
impl PreTokenizedStream {
/// Creates a TokenStream from PreTokenizedString array
pub fn chain_tokenized_strings<'a>(
tok_strings: &'a [&PreTokenizedString],
) -> impl Iterator<Item = Token> + 'a {
let streams_with_offsets = tok_strings.iter().scan(0, |total_offset, tok_string| {
let next = Some((
PreTokenizedStream::from((*tok_string).to_owned()),
*total_offset,
));
if let Some(last_token) = tok_string.tokens.last() {
*total_offset += last_token.offset_to;
tok_strings: &'a [&'a PreTokenizedString],
) -> BoxTokenStream {
if tok_strings.len() == 1 {
PreTokenizedStream::from((*tok_strings[0]).clone()).into()
} else {
let mut offsets = vec![];
let mut total_offset = 0;
for &tok_string in tok_strings {
offsets.push(total_offset);
if let Some(last_token) = tok_string.tokens.last() {
total_offset += last_token.offset_to;
}
}
next
});
TokenStreamChain::new(streams_with_offsets)
// TODO remove the string cloning.
let token_streams: Vec<BoxTokenStream<'static>> = tok_strings
.iter()
.map(|&tok_string| PreTokenizedStream::from((*tok_string).clone()).into())
.collect();
TokenStreamChain::new(offsets, token_streams).into()
}
}
}
impl Iterator for PreTokenizedStream {
type Item = Token;
fn next(&mut self) -> Option<Token> {
let token = self.tokenized_string.tokens.get(self.current_token)?;
impl TokenStream for PreTokenizedStream {
fn advance(&mut self) -> bool {
self.current_token += 1;
Some(token.clone())
self.current_token < self.tokenized_string.tokens.len() as i64
}
fn token(&self) -> &Token {
assert!(
self.current_token >= 0,
"TokenStream not initialized. You should call advance() at least once."
);
&self.tokenized_string.tokens[self.current_token as usize]
}
fn token_mut(&mut self) -> &mut Token {
assert!(
self.current_token >= 0,
"TokenStream not initialized. You should call advance() at least once."
);
&mut self.tokenized_string.tokens[self.current_token as usize]
}
}
@@ -98,9 +119,10 @@ mod tests {
let mut token_stream = PreTokenizedStream::from(tok_text.clone());
for expected_token in tok_text.tokens {
assert_eq!(token_stream.next().unwrap(), expected_token);
assert!(token_stream.advance());
assert_eq!(token_stream.token(), &expected_token);
}
assert!(token_stream.next().is_none());
assert!(!token_stream.advance());
}
#[test]
@@ -161,8 +183,9 @@ mod tests {
];
for expected_token in expected_tokens {
assert_eq!(token_stream.next().unwrap(), expected_token);
assert!(token_stream.advance());
assert_eq!(token_stream.token(), &expected_token);
}
assert!(token_stream.next().is_none());
assert!(!token_stream.advance());
}
}

View File

@@ -2,23 +2,8 @@ use crate::tokenizer::TokenStreamChain;
use serde::{Deserialize, Serialize};
/// The tokenizer module contains all of the tools used to process
/// text in `tantivy`.
pub trait TextAnalyzerClone {
fn box_clone(&self) -> Box<dyn TextAnalyzerT>;
}
/// 'Top-level' trait hiding concrete types, below which static dispatch occurs.
pub trait TextAnalyzerT: 'static + Send + Sync + TextAnalyzerClone {
/// 'Top-level' dynamic dispatch function hiding concrete types of the staticly
/// dispatched `token_stream` from the `Tokenizer` trait.
fn token_stream(&self, text: &str) -> Box<dyn Iterator<Item = Token>>;
}
impl Clone for Box<dyn TextAnalyzerT> {
fn clone(&self) -> Self {
(**self).box_clone()
}
}
use std::borrow::{Borrow, BorrowMut};
use std::ops::{Deref, DerefMut};
/// Token
#[derive(Debug, Clone, Serialize, Deserialize, Eq, PartialEq)]
@@ -50,116 +35,35 @@ impl Default for Token {
}
}
/// Trait for the pluggable components of `Tokenizer`s.
pub trait TokenFilter: 'static + Send + Sync + Clone {
/// Take a `Token` and transform it or return `None` if it's to be removed
/// from the output stream.
fn transform(&mut self, token: Token) -> Option<Token>;
}
/// `Tokenizer` are in charge of splitting text into a stream of token
/// before indexing.
///
/// See the [module documentation](./index.html) for more detail.
pub trait Tokenizer: 'static + Send + Sync + Clone {
/// An iteratable type is returned.
type Iter: Iterator<Item = Token>;
/// Creates a token stream for a given `str`.
fn token_stream(&self, text: &str) -> Self::Iter;
/// Tokenize an array`&str`
///
/// The resulting `Token` stream is equivalent to what would be obtained if the &str were
/// one concatenated `&str`, with an artificial position gap of `2` between the different fields
/// to prevent accidental `PhraseQuery` to match accross two terms.
fn token_stream_texts<'a>(&'a self, texts: &'a [&str]) -> Box<dyn Iterator<Item = Token> + 'a> {
let streams_with_offsets = texts.iter().scan(0, move |total_offset, &text| {
let temp = *total_offset;
*total_offset += text.len();
Some((self.token_stream(text), temp))
});
Box::new(TokenStreamChain::new(streams_with_offsets))
}
}
/// `TextAnalyzer` wraps the tokenization of an input text and its modification by any filters applied onto it.
/// `TextAnalyzer` tokenizes an input text into tokens and modifies the resulting `TokenStream`.
///
/// It simply wraps a `Tokenizer` and a list of `TokenFilter` that are applied sequentially.
#[derive(Clone, Debug, Default)]
pub struct TextAnalyzer<T>(T);
pub struct TextAnalyzer {
tokenizer: Box<dyn Tokenizer>,
token_filters: Vec<BoxTokenFilter>,
}
impl<T: Tokenizer> From<T> for TextAnalyzer<T> {
fn from(src: T) -> TextAnalyzer<T> {
TextAnalyzer(src)
impl<T: Tokenizer> From<T> for TextAnalyzer {
fn from(tokenizer: T) -> Self {
TextAnalyzer::new(tokenizer, Vec::new())
}
}
impl<T: Tokenizer> TextAnalyzerClone for TextAnalyzer<T> {
fn box_clone(&self) -> Box<dyn TextAnalyzerT> {
Box::new(TextAnalyzer(self.0.clone()))
}
}
impl<T: Tokenizer> TextAnalyzerT for TextAnalyzer<T> {
fn token_stream(&self, text: &str) -> Box<dyn Iterator<Item = Token>> {
Box::new(self.0.token_stream(text))
}
}
/// Identity `TokenFilter`
#[derive(Clone, Debug, Default)]
pub struct Identity;
impl TokenFilter for Identity {
fn transform(&mut self, token: Token) -> Option<Token> {
Some(token)
}
}
/// `Filter` is a wrapper around a `Token` stream and a `TokenFilter` which modifies it.
#[derive(Clone, Default, Debug)]
pub struct Filter<I, F> {
iter: I,
f: F,
}
impl<I, F> Iterator for Filter<I, F>
where
I: Iterator<Item = Token>,
F: TokenFilter,
{
type Item = Token;
fn next(&mut self) -> Option<Token> {
while let Some(token) = self.iter.next() {
if let Some(tok) = self.f.transform(token) {
return Some(tok);
}
impl TextAnalyzer {
/// Creates a new `TextAnalyzer` given a tokenizer and a vector of `BoxTokenFilter`.
///
/// When creating a `TextAnalyzer` from a `Tokenizer` alone, prefer using
/// `TextAnalyzer::from(tokenizer)`.
pub fn new<T: Tokenizer>(tokenizer: T, token_filters: Vec<BoxTokenFilter>) -> TextAnalyzer {
TextAnalyzer {
tokenizer: Box::new(tokenizer),
token_filters,
}
None
}
}
#[derive(Clone, Debug, Default)]
pub struct AnalyzerBuilder<T, F> {
tokenizer: T,
f: F,
}
/// Construct an `AnalyzerBuilder` on which to apply `TokenFilter`.
pub fn analyzer_builder<T: Tokenizer>(tokenizer: T) -> AnalyzerBuilder<T, Identity> {
AnalyzerBuilder {
tokenizer,
f: Identity,
}
}
impl<T, F> AnalyzerBuilder<T, F>
where
T: Tokenizer,
F: TokenFilter,
{
/// Appends a token filter to the current tokenizer.
///
/// The method consumes the current `Token` and returns a
/// The method consumes the current `TokenStream` and returns a
/// new one.
///
/// # Example
@@ -167,35 +71,248 @@ where
/// ```rust
/// use tantivy::tokenizer::*;
///
/// let en_stem = analyzer_builder(SimpleTokenizer)
/// let en_stem = TextAnalyzer::from(SimpleTokenizer)
/// .filter(RemoveLongFilter::limit(40))
/// .filter(LowerCaser::new())
/// .filter(Stemmer::default()).build();
/// .filter(LowerCaser)
/// .filter(Stemmer::default());
/// ```
///
pub fn filter<G: TokenFilter>(self, f: G) -> AnalyzerBuilder<AnalyzerBuilder<T, F>, G> {
AnalyzerBuilder { tokenizer: self, f }
pub fn filter<F: Into<BoxTokenFilter>>(mut self, token_filter: F) -> Self {
self.token_filters.push(token_filter.into());
self
}
/// Finalize the build process.
pub fn build(self) -> TextAnalyzer<AnalyzerBuilder<T, F>> {
TextAnalyzer(self)
/// Tokenize an array`&str`
///
/// The resulting `BoxTokenStream` is equivalent to what would be obtained if the &str were
/// one concatenated `&str`, with an artificial position gap of `2` between the different fields
/// to prevent accidental `PhraseQuery` to match accross two terms.
pub fn token_stream_texts<'a>(&self, texts: &'a [&'a str]) -> BoxTokenStream<'a> {
assert!(!texts.is_empty());
if texts.len() == 1 {
self.token_stream(texts[0])
} else {
let mut offsets = vec![];
let mut total_offset = 0;
for &text in texts {
offsets.push(total_offset);
total_offset += text.len();
}
let token_streams: Vec<BoxTokenStream<'a>> = texts
.iter()
.cloned()
.map(|text| self.token_stream(text))
.collect();
From::from(TokenStreamChain::new(offsets, token_streams))
}
}
/// Creates a token stream for a given `str`.
pub fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
let mut token_stream = self.tokenizer.token_stream(text);
for token_filter in &self.token_filters {
token_stream = token_filter.transform(token_stream);
}
token_stream
}
}
impl<T: Tokenizer, F: TokenFilter> Tokenizer for AnalyzerBuilder<T, F> {
type Iter = Filter<T::Iter, F>;
fn token_stream(&self, text: &str) -> Self::Iter {
Filter {
iter: self.tokenizer.token_stream(text),
f: self.f.clone(),
impl Clone for TextAnalyzer {
fn clone(&self) -> Self {
TextAnalyzer {
tokenizer: self.tokenizer.box_clone(),
token_filters: self
.token_filters
.iter()
.map(|token_filter| token_filter.box_clone())
.collect(),
}
}
}
/// `Tokenizer` are in charge of splitting text into a stream of token
/// before indexing.
///
/// See the [module documentation](./index.html) for more detail.
///
/// # Warning
///
/// This API may change to use associated types.
pub trait Tokenizer: 'static + Send + Sync + TokenizerClone {
/// Creates a token stream for a given `str`.
fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a>;
}
pub trait TokenizerClone {
fn box_clone(&self) -> Box<dyn Tokenizer>;
}
impl<T: Tokenizer + Clone> TokenizerClone for T {
fn box_clone(&self) -> Box<dyn Tokenizer> {
Box::new(self.clone())
}
}
impl<'a> TokenStream for Box<dyn TokenStream + 'a> {
fn advance(&mut self) -> bool {
let token_stream: &mut dyn TokenStream = self.borrow_mut();
token_stream.advance()
}
fn token<'b>(&'b self) -> &'b Token {
let token_stream: &'b (dyn TokenStream + 'a) = self.borrow();
token_stream.token()
}
fn token_mut<'b>(&'b mut self) -> &'b mut Token {
let token_stream: &'b mut (dyn TokenStream + 'a) = self.borrow_mut();
token_stream.token_mut()
}
}
/// Simple wrapper of `Box<dyn TokenStream + 'a>`.
///
/// See `TokenStream` for more information.
pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>);
impl<'a, T> From<T> for BoxTokenStream<'a>
where
T: TokenStream + 'a,
{
fn from(token_stream: T) -> BoxTokenStream<'a> {
BoxTokenStream(Box::new(token_stream))
}
}
impl<'a> Deref for BoxTokenStream<'a> {
type Target = dyn TokenStream + 'a;
fn deref(&self) -> &Self::Target {
&*self.0
}
}
impl<'a> DerefMut for BoxTokenStream<'a> {
fn deref_mut(&mut self) -> &mut Self::Target {
&mut *self.0
}
}
/// Simple wrapper of `Box<dyn TokenFilter + 'a>`.
///
/// See `TokenStream` for more information.
pub struct BoxTokenFilter(Box<dyn TokenFilter>);
impl Deref for BoxTokenFilter {
type Target = dyn TokenFilter;
fn deref(&self) -> &dyn TokenFilter {
&*self.0
}
}
impl<T: TokenFilter> From<T> for BoxTokenFilter {
fn from(tokenizer: T) -> BoxTokenFilter {
BoxTokenFilter(Box::new(tokenizer))
}
}
/// `TokenStream` is the result of the tokenization.
///
/// It consists consumable stream of `Token`s.
///
/// # Example
///
/// ```
/// use tantivy::tokenizer::*;
///
/// let tokenizer = TextAnalyzer::from(SimpleTokenizer)
/// .filter(RemoveLongFilter::limit(40))
/// .filter(LowerCaser);
/// let mut token_stream = tokenizer.token_stream("Hello, happy tax payer");
/// {
/// let token = token_stream.next().unwrap();
/// assert_eq!(&token.text, "hello");
/// assert_eq!(token.offset_from, 0);
/// assert_eq!(token.offset_to, 5);
/// assert_eq!(token.position, 0);
/// }
/// {
/// let token = token_stream.next().unwrap();
/// assert_eq!(&token.text, "happy");
/// assert_eq!(token.offset_from, 7);
/// assert_eq!(token.offset_to, 12);
/// assert_eq!(token.position, 1);
/// }
/// ```
///
pub trait TokenStream {
/// Advance to the next token
///
/// Returns false if there are no other tokens.
fn advance(&mut self) -> bool;
/// Returns a reference to the current token.
fn token(&self) -> &Token;
/// Returns a mutable reference to the current token.
fn token_mut(&mut self) -> &mut Token;
/// Helper to iterate over tokens. It
/// simply combines a call to `.advance()`
/// and `.token()`.
///
/// ```
/// use tantivy::tokenizer::*;
///
/// let tokenizer = TextAnalyzer::from(SimpleTokenizer)
/// .filter(RemoveLongFilter::limit(40))
/// .filter(LowerCaser);
/// let mut token_stream = tokenizer.token_stream("Hello, happy tax payer");
/// while let Some(token) = token_stream.next() {
/// println!("Token {:?}", token.text);
/// }
/// ```
fn next(&mut self) -> Option<&Token> {
if self.advance() {
Some(self.token())
} else {
None
}
}
/// Helper function to consume the entire `TokenStream`
/// and push the tokens to a sink function.
///
/// Remove this.
fn process(&mut self, sink: &mut dyn FnMut(&Token)) -> u32 {
let mut num_tokens_pushed = 0u32;
while self.advance() {
sink(self.token());
num_tokens_pushed += 1u32;
}
num_tokens_pushed
}
}
pub trait TokenFilterClone {
fn box_clone(&self) -> BoxTokenFilter;
}
/// Trait for the pluggable components of `Tokenizer`s.
pub trait TokenFilter: 'static + Send + Sync + TokenFilterClone {
/// Wraps a token stream and returns the modified one.
fn transform<'a>(&self, token_stream: BoxTokenStream<'a>) -> BoxTokenStream<'a>;
}
impl<T: TokenFilter + Clone> TokenFilterClone for T {
fn box_clone(&self) -> BoxTokenFilter {
BoxTokenFilter::from(self.clone())
}
}
#[cfg(test)]
mod test {
use super::*;
use crate::tokenizer::SimpleTokenizer;
use super::Token;
#[test]
fn clone() {
@@ -213,15 +330,4 @@ mod test {
assert_eq!(t1.offset_to, t2.offset_to);
assert_eq!(t1.text, t2.text);
}
#[test]
fn text_analyzer() {
let mut stream = SimpleTokenizer.token_stream("tokenizer hello world");
dbg!(stream.next());
dbg!(stream.next());
dbg!(stream.next());
dbg!(stream.next());
dbg!(stream.next());
dbg!(stream.next());
}
}

View File

@@ -1,5 +1,5 @@
use crate::tokenizer::stemmer::Language;
use crate::tokenizer::tokenizer::{analyzer_builder, TextAnalyzer, TextAnalyzerT, Tokenizer};
use crate::tokenizer::tokenizer::TextAnalyzer;
use crate::tokenizer::LowerCaser;
use crate::tokenizer::RawTokenizer;
use crate::tokenizer::RemoveLongFilter;
@@ -22,23 +22,24 @@ use std::sync::{Arc, RwLock};
/// search engine.
#[derive(Clone)]
pub struct TokenizerManager {
tokenizers: Arc<RwLock<HashMap<String, Box<dyn TextAnalyzerT>>>>,
tokenizers: Arc<RwLock<HashMap<String, TextAnalyzer>>>,
}
impl TokenizerManager {
/// Registers a new tokenizer associated with a given name.
pub fn register<U: Tokenizer, T>(&self, tokenizer_name: &str, tokenizer: T)
pub fn register<T>(&self, tokenizer_name: &str, tokenizer: T)
where
T: Into<TextAnalyzer<U>>,
TextAnalyzer: From<T>,
{
let boxed_tokenizer: TextAnalyzer = TextAnalyzer::from(tokenizer);
self.tokenizers
.write()
.expect("Acquiring the lock should never fail")
.insert(tokenizer_name.to_string(), Box::new(tokenizer.into()));
.insert(tokenizer_name.to_string(), boxed_tokenizer);
}
/// Accessing a tokenizer given its name.
pub fn get(&self, tokenizer_name: &str) -> Option<Box<dyn TextAnalyzerT>> {
pub fn get(&self, tokenizer_name: &str) -> Option<TextAnalyzer> {
self.tokenizers
.read()
.expect("Acquiring the lock should never fail")
@@ -53,25 +54,23 @@ impl Default for TokenizerManager {
/// - simple
/// - en_stem
/// - ja
fn default() -> Self {
fn default() -> TokenizerManager {
let manager = TokenizerManager {
tokenizers: Arc::new(RwLock::new(HashMap::new())),
};
manager.register("raw", RawTokenizer);
manager.register(
"default",
analyzer_builder(SimpleTokenizer)
TextAnalyzer::from(SimpleTokenizer)
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser::new())
.build(),
.filter(LowerCaser),
);
manager.register(
"en_stem",
analyzer_builder(SimpleTokenizer)
TextAnalyzer::from(SimpleTokenizer)
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser::new())
.filter(Stemmer::new(Language::English))
.build(),
.filter(LowerCaser)
.filter(Stemmer::new(Language::English)),
);
manager
}