Fixing bench compilation

fmt
2026-05-05 10:50:39 +00:00 · 2019-10-04 16:36:17 +09:00 · 2019-10-02 09:50:20 +09:00
39 changed files with 180 additions and 1026 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,11 +9,7 @@ Tantivy 0.11.0
 - API change around `Box<BoxableTokenizer>`. See detail in #629
 - Avoid rebuilding Regex automaton whenever a regex query is reused. #639 (@brainlock)
 - Add footer with some metadata to index files. #605 (@fdb-hiroshima)
- TopDocs collector: ensure stable sorting on equal score. #671 (@brainlock)
- Added handling of pre-tokenized text fields (#642), which will enable users to
-  load tokens created outside tantivy. See usage in examples/pre_tokenized_text. (@kkoziara)
- Fix crash when committing multiple times with deleted documents. #681 (@brainlock)
-
+ 
 ## How to update?

 - `Box<dyn BoxableTokenizer>` has been replaced by a `BoxedTokenizer` struct.
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -13,7 +13,7 @@ keywords = ["search", "information", "retrieval"]
 edition = "2018"

 [dependencies]
-base64 = "0.11.0"
+base64 = "0.10.0"
 byteorder = "1.0"
 crc32fast = "1.2.0"
 once_cell = "1.0"
@@ -34,7 +34,7 @@ itertools = "0.8"
 levenshtein_automata = {version="0.1", features=["fst_automaton"]}
 notify = {version="4", optional=true}
 bit-set = "0.5"
-uuid = { version = "0.8", features = ["v4", "serde"] }
+uuid = { version = "0.7.2", features = ["v4", "serde"] }
 crossbeam = "0.7"
 futures = "0.1"
 futures-cpupool = "0.1"
--- a/README.md
+++ b/README.md
@@ -21,9 +21,9 @@
 [![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)


-**Tantivy** is a **full text search engine library** written in Rust.
+**Tantivy** is a **full text search engine library** written in rust.

-It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
+It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) and [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
 an off-the-shelf search engine server, but rather a crate that can be used
 to build such a search engine.

@@ -31,7 +31,7 @@ Tantivy is, in fact, strongly inspired by Lucene's design.

 # Benchmark

-Tantivy is typically faster than Lucene, but the results depend on 
+Tantivy is typically faster than Lucene, but the results will depend on 
 the nature of the queries in your workload.

 The following [benchmark](https://tantivy-search.github.io/bench/) break downs 
@@ -40,19 +40,19 @@ performance for different type of queries / collection.
 # Features

 - Full-text search
- Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)) and [Japanese](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter))
+- Configurable tokenizer. (stemming available for 17 latin languages. Third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)) and [Japanese](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)
 - Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
 - Tiny startup time (<10ms), perfect for command line tools
- BM25 scoring (the same as Lucene)
- Natural query language (e.g. `(michael AND jackson) OR "king of pop"`)
- Phrase queries search (e.g. `"michael jackson"`)
+- BM25 scoring (the same as lucene)
+- Natural query language `(michael AND jackson) OR "king of pop"`
+- Phrase queries search (`"michael jackson"`)
 - Incremental indexing
 - Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
 - Mmap directory
- SIMD integer compression when the platform/CPU includes the SSE2 instruction set
- Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
+- SIMD integer compression when the platform/CPU includes the SSE2 instruction set.
+- Single valued and multivalued u64, i64 and f64 fast fields (equivalent of doc values in Lucene)
 - `&[u8]` fast fields
- Text, i64, u64, f64, dates, and hierarchical facet fields
+- Text, i64, u64, f64, dates and hierarchical facet fields
 - LZ4 compressed document store
 - Range queries
 - Faceted search
@@ -61,42 +61,43 @@ performance for different type of queries / collection.

 # Non-features

- Distributed search is out of the scope of Tantivy. That being said, Tantivy is a
+- Distributed search is out of the scope of tantivy. That being said, tantivy is meant as a
 library upon which one could build a distributed search. Serializable/mergeable collector state for instance, 
-are within the scope of Tantivy.
+are within the scope of tantivy.

 # Supported OS and compiler

-Tantivy works on stable Rust (>= 1.27) and supports Linux, MacOS, and Windows.
+Tantivy works on stable rust (>= 1.27) and supports Linux, MacOS and Windows.

 # Getting started

- [Tantivy's simple search example](https://tantivy-search.github.io/examples/basic_search.html)
- [tantivy-cli and its tutorial](https://github.com/tantivy-search/tantivy-cli) - `tantivy-cli` is an actual command line interface that makes it easy for you to create a search engine,
-index documents, and search via the CLI or a small server with a REST API.
-It walks you through getting a wikipedia search engine up and running in a few minutes.
- [Reference doc for the last released version](https://docs.rs/tantivy/)
+- [tantivy's simple search example](https://tantivy-search.github.io/examples/basic_search.html)
+- [tantivy-cli and its tutorial](https://github.com/tantivy-search/tantivy-cli).
+`tantivy-cli` is an actual command line interface that makes it easy for you to create a search engine,
+index documents and search via the CLI or a small server with a REST API.
+It will walk you through getting a wikipedia search engine up and running in a few minutes.
+- [reference doc for the last released version](https://docs.rs/tantivy/)

 # How can I support this project?

 There are many ways to support this project. 

- Use Tantivy and tell us about your experience on [Gitter](https://gitter.im/tantivy-search/tantivy) or by email (paul.masurel@gmail.com)
+- Use tantivy and tell us about your experience on [gitter](https://gitter.im/tantivy-search/tantivy) or by email (paul.masurel@gmail.com)
 - Report bugs
 - Write a blog post
 - Help with documentation by asking questions or submitting PRs
- Contribute code (you can join [our Gitter](https://gitter.im/tantivy-search/tantivy))
- Talk about Tantivy around you
+- Contribute code (you can join [our gitter](https://gitter.im/tantivy-search/tantivy) )
+- Talk about tantivy around you
 - Drop a word on on [![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/fulmicoton) or even [![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)

 # Contributing code

-We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
+We use the GitHub Pull Request workflow - reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.

 ## Clone and build locally

-Tantivy compiles on stable Rust but requires `Rust >= 1.27`.
-To check out and run tests, you can simply run:
+Tantivy compiles on stable rust but requires `Rust >= 1.27`.
+To check out and run tests, you can simply run :

 ```bash
    git clone https://github.com/tantivy-search/tantivy.git
@@ -107,7 +108,7 @@ To check out and run tests, you can simply run:
 ## Run tests

 Some tests will not run with just `cargo test` because of `fail-rs`.
-To run the tests exhaustively, run `./run-tests.sh`.
+To run the tests exhaustively, run `./run-tests.sh`

 ## Debug

@@ -115,13 +116,13 @@ You might find it useful to step through the programme with a debugger.

 ### A failing test

-Make sure you haven't run `cargo clean` after the most recent `cargo test` or `cargo build` to guarantee that the `target/` directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under `rust-gdb`:
+Make sure you haven't run `cargo clean` after the most recent `cargo test` or `cargo build` to guarantee that `target/` dir exists. Use this bash script to find the most name of the most recent debug build of tantivy and run it under rust-gdb.

 ```bash
 find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY
 ```

-Now that you are in `rust-gdb`, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to `cargo test` like this:
+Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source-code and run the debug executable with flags that you normally pass to `cargo test` to like this

 ```bash
 $gdb run --test-threads 1 --test $NAME_OF_TEST
@@ -129,7 +130,7 @@ $gdb run --test-threads 1 --test $NAME_OF_TEST

 ### An example

-By default, `rustc` compiles everything in the `examples/` directory in debug mode. This makes it easy for you to make examples to reproduce bugs:
+By default, rustc compiles everything in the `examples/` dir in debug mode. This makes it easy for you to make examples to reproduce bugs.

 ```bash
 rust-gdb target/debug/examples/$EXAMPLE_NAME
--- a/examples/pre_tokenized_text.rs
+++ b/examples/pre_tokenized_text.rs
@@ -1,140 +0,0 @@
-// # Pre-tokenized text example
-//
-// This example shows how to use pre-tokenized text. Sometimes yout might
-// want to index and search through text which is already split into
-// tokens by some external tool.
-//
-// In this example we will:
-// - use tantivy tokenizer to create tokens and load them directly into tantivy,
-// - import tokenized text straight from json,
-// - perform a search on documents with pre-tokenized text
-
-use tantivy::tokenizer::{PreTokenizedString, SimpleTokenizer, Token, TokenStream, Tokenizer};
-
-use tantivy::collector::{Count, TopDocs};
-use tantivy::query::TermQuery;
-use tantivy::schema::*;
-use tantivy::{doc, Index, ReloadPolicy};
-use tempfile::TempDir;
-
-fn pre_tokenize_text(text: &str) -> Vec<Token> {
-    let mut token_stream = SimpleTokenizer.token_stream(text);
-    let mut tokens = vec![];
-    while token_stream.advance() {
-        tokens.push(token_stream.token().clone());
-    }
-    tokens
-}
-
-fn main() -> tantivy::Result<()> {
-    let index_path = TempDir::new()?;
-
-    let mut schema_builder = Schema::builder();
-
-    schema_builder.add_text_field("title", TEXT | STORED);
-    schema_builder.add_text_field("body", TEXT);
-
-    let schema = schema_builder.build();
-
-    let index = Index::create_in_dir(&index_path, schema.clone())?;
-
-    let mut index_writer = index.writer(50_000_000)?;
-
-    // We can create a document manually, by setting the fields
-    // one by one in a Document object.
-    let title = schema.get_field("title").unwrap();
-    let body = schema.get_field("body").unwrap();
-
-    let title_text = "The Old Man and the Sea";
-    let body_text = "He was an old man who fished alone in a skiff in the Gulf Stream";
-
-    // Content of our first document
-    // We create `PreTokenizedString` which contains original text and vector of tokens
-    let title_tok = PreTokenizedString {
-        text: String::from(title_text),
-        tokens: pre_tokenize_text(title_text),
-    };
-
-    println!(
-        "Original text: \"{}\" and tokens: {:?}",
-        title_tok.text, title_tok.tokens
-    );
-
-    let body_tok = PreTokenizedString {
-        text: String::from(body_text),
-        tokens: pre_tokenize_text(body_text),
-    };
-
-    // Now lets create a document and add our `PreTokenizedString` using
-    // `add_pre_tokenized_text` method of `Document`
-    let mut old_man_doc = Document::default();
-    old_man_doc.add_pre_tokenized_text(title, &title_tok);
-    old_man_doc.add_pre_tokenized_text(body, &body_tok);
-
-    // ... now let's just add it to the IndexWriter
-    index_writer.add_document(old_man_doc);
-
-    // Pretokenized text can also be fed as JSON
-    let short_man_json = r#"{
-        "title":[{
-            "text":"The Old Man",
-            "tokens":[
-                {"offset_from":0,"offset_to":3,"position":0,"text":"The","position_length":1},
-                {"offset_from":4,"offset_to":7,"position":1,"text":"Old","position_length":1},
-                {"offset_from":8,"offset_to":11,"position":2,"text":"Man","position_length":1}
-            ]
-        }]
-    }"#;
-
-    let short_man_doc = schema.parse_document(&short_man_json)?;
-
-    index_writer.add_document(short_man_doc);
-
-    // Let's commit changes
-    index_writer.commit()?;
-
-    // ... and now is the time to query our index
-
-    let reader = index
-        .reader_builder()
-        .reload_policy(ReloadPolicy::OnCommit)
-        .try_into()?;
-
-    let searcher = reader.searcher();
-
-    // We want to get documents with token "Man", we will use TermQuery to do it
-    // Using PreTokenizedString means the tokens are stored as is avoiding stemming
-    // and lowercasing, which preserves full words in their original form
-    let query = TermQuery::new(
-        Term::from_field_text(title, "Man"),
-        IndexRecordOption::Basic,
-    );
-
-    let (top_docs, count) = searcher
-        .search(&query, &(TopDocs::with_limit(2), Count))
-        .unwrap();
-
-    assert_eq!(count, 2);
-
-    for (_score, doc_address) in top_docs {
-        let retrieved_doc = searcher.doc(doc_address)?;
-        println!("Document: {}", schema.to_json(&retrieved_doc));
-    }
-
-    // In contrary to the previous query, when we search for the "man" term we
-    // should get no results, as it's not one of the indexed tokens. SimpleTokenizer
-    // only splits text on whitespace / punctuation.
-
-    let query = TermQuery::new(
-        Term::from_field_text(title, "man"),
-        IndexRecordOption::Basic,
-    );
-
-    let (_top_docs, count) = searcher
-        .search(&query, &(TopDocs::with_limit(2), Count))
-        .unwrap();
-
-    assert_eq!(count, 0);
-
-    Ok(())
-}
--- a/query-grammar/src/occur.rs
+++ b/query-grammar/src/occur.rs
@@ -2,7 +2,7 @@ use std::fmt;
 use std::fmt::Write;

 /// Defines whether a term in a query must be present,
-/// should be present or must be not present.
+/// should be present or must not be present.
 #[derive(Debug, Clone, Hash, Copy, Eq, PartialEq)]
 pub enum Occur {
    /// For a given document to be considered for scoring,
--- a/src/collector/facet_collector.rs
+++ b/src/collector/facet_collector.rs
@@ -515,7 +515,7 @@ mod tests {
    #[should_panic(expected = "Tried to add a facet which is a descendant of \
                               an already added facet.")]
    fn test_misused_facet_collector() {
-        let mut facet_collector = FacetCollector::for_field(Field::from_field_id(0));
+        let mut facet_collector = FacetCollector::for_field(Field(0));
        facet_collector.add_facet(Facet::from("/country"));
        facet_collector.add_facet(Facet::from("/country/europe"));
    }
@@ -546,7 +546,7 @@ mod tests {

    #[test]
    fn test_non_used_facet_collector() {
-        let mut facet_collector = FacetCollector::for_field(Field::from_field_id(0));
+        let mut facet_collector = FacetCollector::for_field(Field(0));
        facet_collector.add_facet(Facet::from("/country"));
        facet_collector.add_facet(Facet::from("/countryeurope"));
    }
--- a/src/collector/top_collector.rs
+++ b/src/collector/top_collector.rs
@@ -12,9 +12,6 @@ use std::collections::BinaryHeap;
 /// It has a custom implementation of `PartialOrd` that reverses the order. This is because the
 /// default Rust heap is a max heap, whereas a min heap is needed.
 ///
-/// Additionally, it guarantees stable sorting: in case of a tie on the feature, the document
-/// address is used.
-///
 /// WARNING: equality is not what you would expect here.
 /// Two elements are equal if their feature is equal, and regardless of whether `doc`
 /// is equal. This should be perfectly fine for this usage, but let's make sure this
@@ -24,37 +21,29 @@ struct ComparableDoc<T, D> {
    doc: D,
 }

-impl<T: PartialOrd, D: PartialOrd> PartialOrd for ComparableDoc<T, D> {
+impl<T: PartialOrd, D> PartialOrd for ComparableDoc<T, D> {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
 }

-impl<T: PartialOrd, D: PartialOrd> Ord for ComparableDoc<T, D> {
+impl<T: PartialOrd, D> Ord for ComparableDoc<T, D> {
    #[inline]
    fn cmp(&self, other: &Self) -> Ordering {
-        // Reversed to make BinaryHeap work as a min-heap
-        let by_feature = other
+        other
            .feature
            .partial_cmp(&self.feature)
-            .unwrap_or(Ordering::Equal);
-
-        let lazy_by_doc_address = || self.doc.partial_cmp(&other.doc).unwrap_or(Ordering::Equal);
-
-        // In case of a tie on the feature, we sort by ascending
-        // `DocAddress` in order to ensure a stable sorting of the
-        // documents.
-        by_feature.then_with(lazy_by_doc_address)
+            .unwrap_or_else(|| Ordering::Equal)
    }
 }

-impl<T: PartialOrd, D: PartialOrd> PartialEq for ComparableDoc<T, D> {
+impl<T: PartialOrd, D> PartialEq for ComparableDoc<T, D> {
    fn eq(&self, other: &Self) -> bool {
        self.cmp(other) == Ordering::Equal
    }
 }

-impl<T: PartialOrd, D: PartialOrd> Eq for ComparableDoc<T, D> {}
+impl<T: PartialOrd, D> Eq for ComparableDoc<T, D> {}

 pub(crate) struct TopCollector<T> {
    limit: usize,
@@ -225,94 +214,4 @@ mod tests {
            ]
        );
    }
-
-    #[test]
-    fn test_top_segment_collector_stable_ordering_for_equal_feature() {
-        // given that the documents are collected in ascending doc id order,
-        // when harvesting we have to guarantee stable sorting in case of a tie
-        // on the score
-        let doc_ids_collection = [4, 5, 6];
-        let score = 3.14;
-
-        let mut top_collector_limit_2 = TopSegmentCollector::new(0, 2);
-        for id in &doc_ids_collection {
-            top_collector_limit_2.collect(*id, score);
-        }
-
-        let mut top_collector_limit_3 = TopSegmentCollector::new(0, 3);
-        for id in &doc_ids_collection {
-            top_collector_limit_3.collect(*id, score);
-        }
-
-        assert_eq!(
-            top_collector_limit_2.harvest(),
-            top_collector_limit_3.harvest()[..2].to_vec(),
-        );
-    }
-}
-
-#[cfg(all(test, feature = "unstable"))]
-mod bench {
-    use super::TopSegmentCollector;
-    use test::Bencher;
-
-    #[bench]
-    fn bench_top_segment_collector_collect_not_at_capacity(b: &mut Bencher) {
-        let mut top_collector = TopSegmentCollector::new(0, 400);
-
-        b.iter(|| {
-            for i in 0..100 {
-                top_collector.collect(i, 0.8);
-            }
-        });
-    }
-
-    #[bench]
-    fn bench_top_segment_collector_collect_at_capacity(b: &mut Bencher) {
-        let mut top_collector = TopSegmentCollector::new(0, 100);
-
-        for i in 0..100 {
-            top_collector.collect(i, 0.8);
-        }
-
-        b.iter(|| {
-            for i in 0..100 {
-                top_collector.collect(i, 0.8);
-            }
-        });
-    }
-
-    #[bench]
-    fn bench_top_segment_collector_collect_and_harvest_many_ties(b: &mut Bencher) {
-        b.iter(|| {
-            let mut top_collector = TopSegmentCollector::new(0, 100);
-
-            for i in 0..100 {
-                top_collector.collect(i, 0.8);
-            }
-
-            // it would be nice to be able to do the setup N times but still
-            // measure only harvest(). We can't since harvest() consumes
-            // the top_collector.
-            top_collector.harvest()
-        });
-    }
-
-    #[bench]
-    fn bench_top_segment_collector_collect_and_harvest_no_tie(b: &mut Bencher) {
-        b.iter(|| {
-            let mut top_collector = TopSegmentCollector::new(0, 100);
-            let mut score = 1.0;
-
-            for i in 0..100 {
-                score += 1.0;
-                top_collector.collect(i, score);
-            }
-
-            // it would be nice to be able to do the setup N times but still
-            // measure only harvest(). We can't since harvest() consumes
-            // the top_collector.
-            top_collector.harvest()
-        });
-    }
 }
--- a/src/collector/top_score_collector.rs
+++ b/src/collector/top_score_collector.rs
@@ -15,16 +15,13 @@ use crate::SegmentLocalId;
 use crate::SegmentReader;
 use std::fmt;

-/// The `TopDocs` collector keeps track of the top `K` documents
+/// The Top Score Collector keeps track of the K documents
 /// sorted by their score.
 ///
 /// The implementation is based on a `BinaryHeap`.
 /// The theorical complexity for collecting the top `K` out of `n` documents
 /// is `O(n log K)`.
 ///
-/// This collector guarantees a stable sorting in case of a tie on the
-/// document score. As such, it is suitable to implement pagination.
-///
 /// ```rust
 /// use tantivy::collector::TopDocs;
 /// use tantivy::query::QueryParser;
@@ -431,13 +428,12 @@ impl SegmentCollector for TopScoreSegmentCollector {
 mod tests {
    use super::TopDocs;
    use crate::collector::Collector;
-    use crate::query::{AllQuery, Query, QueryParser};
+    use crate::query::{Query, QueryParser};
    use crate::schema::{Field, Schema, FAST, STORED, TEXT};
    use crate::DocAddress;
    use crate::Index;
    use crate::IndexWriter;
    use crate::Score;
-    use itertools::Itertools;

    fn make_index() -> Index {
        let mut schema_builder = Schema::builder();
@@ -498,29 +494,6 @@ mod tests {
        );
    }

-    #[test]
-    fn test_top_collector_stable_sorting() {
-        let index = make_index();
-
-        // using AllQuery to get a constant score
-        let searcher = index.reader().unwrap().searcher();
-
-        let page_1 = searcher.search(&AllQuery, &TopDocs::with_limit(2)).unwrap();
-
-        let page_2 = searcher.search(&AllQuery, &TopDocs::with_limit(3)).unwrap();
-
-        // precondition for the test to be meaningful: we did get documents
-        // with the same score
-        assert!(page_1.iter().map(|result| result.0).all_equal());
-        assert!(page_2.iter().map(|result| result.0).all_equal());
-
-        // sanity check since we're relying on make_index()
-        assert_eq!(page_1.len(), 2);
-        assert_eq!(page_2.len(), 3);
-
-        assert_eq!(page_1, &page_2[..page_1.len()]);
-    }
-
    #[test]
    #[should_panic]
    fn test_top_0() {
@@ -578,7 +551,7 @@ mod tests {
            ));
        });
        let searcher = index.reader().unwrap().searcher();
-        let top_collector = TopDocs::with_limit(4).order_by_u64_field(Field::from_field_id(2));
+        let top_collector = TopDocs::with_limit(4).order_by_u64_field(Field(2));
        let segment_reader = searcher.segment_reader(0u32);
        top_collector
            .for_segment(0, segment_reader)
--- a/src/common/composite_file.rs
+++ b/src/common/composite_file.rs
@@ -199,13 +199,13 @@ mod test {
            let w = directory.open_write(path).unwrap();
            let mut composite_write = CompositeWrite::wrap(w);
            {
-                let mut write_0 = composite_write.for_field(Field::from_field_id(0u32));
+                let mut write_0 = composite_write.for_field(Field(0u32));
                VInt(32431123u64).serialize(&mut write_0).unwrap();
                write_0.flush().unwrap();
            }

            {
-                let mut write_4 = composite_write.for_field(Field::from_field_id(4u32));
+                let mut write_4 = composite_write.for_field(Field(4u32));
                VInt(2).serialize(&mut write_4).unwrap();
                write_4.flush().unwrap();
            }
@@ -215,18 +215,14 @@ mod test {
            let r = directory.open_read(path).unwrap();
            let composite_file = CompositeFile::open(&r).unwrap();
            {
-                let file0 = composite_file
-                    .open_read(Field::from_field_id(0u32))
-                    .unwrap();
+                let file0 = composite_file.open_read(Field(0u32)).unwrap();
                let mut file0_buf = file0.as_slice();
                let payload_0 = VInt::deserialize(&mut file0_buf).unwrap().0;
                assert_eq!(file0_buf.len(), 0);
                assert_eq!(payload_0, 32431123u64);
            }
            {
-                let file4 = composite_file
-                    .open_read(Field::from_field_id(4u32))
-                    .unwrap();
+                let file4 = composite_file.open_read(Field(4u32)).unwrap();
                let mut file4_buf = file4.as_slice();
                let payload_4 = VInt::deserialize(&mut file4_buf).unwrap().0;
                assert_eq!(file4_buf.len(), 0);
--- a/src/core/index_meta.rs
+++ b/src/core/index_meta.rs
@@ -150,21 +150,6 @@ impl SegmentMeta {
        self.num_deleted_docs() > 0
    }

-    /// Updates the max_doc value from the `SegmentMeta`.
-    ///
-    /// This method is only used when updating `max_doc` from 0
-    /// as we finalize a fresh new segment.
-    pub(crate) fn with_max_doc(self, max_doc: u32) -> SegmentMeta {
-        assert_eq!(self.tracked.max_doc, 0);
-        assert!(self.tracked.deletes.is_none());
-        let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
-            segment_id: inner_meta.segment_id,
-            max_doc,
-            deletes: None,
-        });
-        SegmentMeta { tracked }
-    }
-
    #[doc(hidden)]
    pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: Opstamp) -> SegmentMeta {
        let delete_meta = DeleteMeta {
--- a/src/core/segment.rs
+++ b/src/core/segment.rs
@@ -50,17 +50,6 @@ impl Segment {
        &self.meta
    }

-    /// Updates the max_doc value from the `SegmentMeta`.
-    ///
-    /// This method is only used when updating `max_doc` from 0
-    /// as we finalize a fresh new segment.
-    pub(crate) fn with_max_doc(self, max_doc: u32) -> Segment {
-        Segment {
-            index: self.index,
-            meta: self.meta.with_max_doc(max_doc),
-        }
-    }
-
    #[doc(hidden)]
    pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: Opstamp) -> Segment {
        Segment {
--- a/src/core/segment_id.rs
+++ b/src/core/segment_id.rs
@@ -76,7 +76,7 @@ impl SegmentId {
 }

 /// Error type used when parsing a `SegmentId` from a string fails.
-pub struct SegmentIdParseError(uuid::Error);
+pub struct SegmentIdParseError(uuid::parser::ParseError);

 impl Error for SegmentIdParseError {}

--- a/src/directory/managed_directory.rs
+++ b/src/directory/managed_directory.rs
@@ -327,7 +327,8 @@ mod tests_mmap_specific {
                .unwrap();
            assert!(managed_directory.exists(test_path1));
            assert!(managed_directory.exists(test_path2));
-            let living_files: HashSet<PathBuf> = [test_path1.to_owned()].iter().cloned().collect();
+            let living_files: HashSet<PathBuf> =
+                [test_path1.to_owned()].into_iter().cloned().collect();
            managed_directory.garbage_collect(|| living_files);
            assert!(managed_directory.exists(test_path1));
            assert!(!managed_directory.exists(test_path2));
--- a/src/fastfield/delete.rs
+++ b/src/fastfield/delete.rs
@@ -10,14 +10,11 @@ use std::io::Write;
 /// Write a delete `BitSet`
 ///
 /// where `delete_bitset` is the set of deleted `DocId`.
-pub fn write_delete_bitset(
-    delete_bitset: &BitSet,
-    max_doc: u32,
-    writer: &mut WritePtr,
-) -> io::Result<()> {
+pub fn write_delete_bitset(delete_bitset: &BitSet, writer: &mut WritePtr) -> io::Result<()> {
+    let max_doc = delete_bitset.capacity();
    let mut byte = 0u8;
    let mut shift = 0u8;
-    for doc in 0..(max_doc as usize) {
+    for doc in 0..max_doc {
        if delete_bitset.contains(doc) {
            byte |= 1 << shift;
        }
@@ -89,17 +86,18 @@ mod tests {
    use bit_set::BitSet;
    use std::path::PathBuf;

-    fn test_delete_bitset_helper(bitset: &BitSet, max_doc: u32) {
+    fn test_delete_bitset_helper(bitset: &BitSet) {
        let test_path = PathBuf::from("test");
        let mut directory = RAMDirectory::create();
        {
            let mut writer = directory.open_write(&*test_path).unwrap();
-            write_delete_bitset(bitset, max_doc, &mut writer).unwrap();
+            write_delete_bitset(bitset, &mut writer).unwrap();
        }
        {
            let source = directory.open_read(&test_path).unwrap();
            let delete_bitset = DeleteBitSet::open(source);
-            for doc in 0..max_doc as usize {
+            let n = bitset.capacity();
+            for doc in 0..n {
                assert_eq!(bitset.contains(doc), delete_bitset.is_deleted(doc as DocId));
            }
            assert_eq!(delete_bitset.len(), bitset.len());
@@ -112,7 +110,7 @@ mod tests {
            let mut bitset = BitSet::with_capacity(10);
            bitset.insert(1);
            bitset.insert(9);
-            test_delete_bitset_helper(&bitset, 10);
+            test_delete_bitset_helper(&bitset);
        }
        {
            let mut bitset = BitSet::with_capacity(8);
@@ -121,7 +119,7 @@ mod tests {
            bitset.insert(3);
            bitset.insert(5);
            bitset.insert(7);
-            test_delete_bitset_helper(&bitset, 8);
+            test_delete_bitset_helper(&bitset);
        }
    }
 }
--- a/src/fastfield/readers.rs
+++ b/src/fastfield/readers.rs
@@ -59,7 +59,8 @@ impl FastFieldReaders {
            fast_bytes: Default::default(),
            fast_fields_composite: fast_fields_composite.clone(),
        };
-        for (field, field_entry) in schema.fields() {
+        for (field_id, field_entry) in schema.fields().iter().enumerate() {
+            let field = Field(field_id as u32);
            let field_type = field_entry.field_type();
            if field_type == &FieldType::Bytes {
                let idx_reader = fast_fields_composite
--- a/src/fastfield/writer.rs
+++ b/src/fastfield/writer.rs
@@ -24,7 +24,8 @@ impl FastFieldsWriter {
        let mut multi_values_writers = Vec::new();
        let mut bytes_value_writers = Vec::new();

-        for (field, field_entry) in schema.fields() {
+        for (field_id, field_entry) in schema.fields().iter().enumerate() {
+            let field = Field(field_id as u32);
            let default_value = match *field_entry.field_type() {
                FieldType::I64(_) => common::i64_to_u64(0i64),
                FieldType::F64(_) => common::f64_to_u64(0.0f64),
--- a/src/fieldnorm/writer.rs
+++ b/src/fieldnorm/writer.rs
@@ -22,14 +22,11 @@ impl FieldNormsWriter {
    pub(crate) fn fields_with_fieldnorm(schema: &Schema) -> Vec<Field> {
        schema
            .fields()
-            .filter_map(|(field, field_entry)| {
-                if field_entry.is_indexed() {
-                    Some(field)
-                } else {
-                    None
-                }
-            })
-            .collect::<Vec<_>>()
+            .iter()
+            .enumerate()
+            .filter(|&(_, field_entry)| field_entry.is_indexed())
+            .map(|(field, _)| Field(field as u32))
+            .collect::<Vec<Field>>()
    }

    /// Initialize with state for tracking the field norm fields
@@ -38,7 +35,7 @@ impl FieldNormsWriter {
        let fields = FieldNormsWriter::fields_with_fieldnorm(schema);
        let max_field = fields
            .iter()
-            .map(Field::field_id)
+            .map(|field| field.0)
            .max()
            .map(|max_field_id| max_field_id as usize + 1)
            .unwrap_or(0);
@@ -53,8 +50,8 @@ impl FieldNormsWriter {
    ///
    /// Will extend with 0-bytes for documents that have not been seen.
    pub fn fill_up_to_max_doc(&mut self, max_doc: DocId) {
-        for field in self.fields.iter() {
-            self.fieldnorms_buffer[field.field_id() as usize].resize(max_doc as usize, 0u8);
+        for &field in self.fields.iter() {
+            self.fieldnorms_buffer[field.0 as usize].resize(max_doc as usize, 0u8);
        }
    }

@@ -67,7 +64,7 @@ impl FieldNormsWriter {
    /// * field     - the field being set
    /// * fieldnorm - the number of terms present in document `doc` in field `field`
    pub fn record(&mut self, doc: DocId, field: Field, fieldnorm: u32) {
-        let fieldnorm_buffer: &mut Vec<u8> = &mut self.fieldnorms_buffer[field.field_id() as usize];
+        let fieldnorm_buffer: &mut Vec<u8> = &mut self.fieldnorms_buffer[field.0 as usize];
        assert!(
            fieldnorm_buffer.len() <= doc as usize,
            "Cannot register a given fieldnorm twice"
@@ -80,7 +77,7 @@ impl FieldNormsWriter {
    /// Serialize the seen fieldnorm values to the serializer for all fields.
    pub fn serialize(&self, fieldnorms_serializer: &mut FieldNormsSerializer) -> io::Result<()> {
        for &field in self.fields.iter() {
-            let fieldnorm_values: &[u8] = &self.fieldnorms_buffer[field.field_id() as usize][..];
+            let fieldnorm_values: &[u8] = &self.fieldnorms_buffer[field.0 as usize][..];
            fieldnorms_serializer.serialize_field(field, fieldnorm_values)?;
        }
        Ok(())
--- a/src/indexer/delete_queue.rs
+++ b/src/indexer/delete_queue.rs
@@ -258,7 +258,7 @@ mod tests {
        let delete_queue = DeleteQueue::new();

        let make_op = |i: usize| {
-            let field = Field::from_field_id(1u32);
+            let field = Field(1u32);
            DeleteOperation {
                opstamp: i as u64,
                term: Term::from_field_u64(field, i as u64),
--- a/src/indexer/index_writer.rs
+++ b/src/indexer/index_writer.rs
@@ -148,6 +148,7 @@ pub(crate) fn advance_deletes(
        };

        let delete_cursor = segment_entry.delete_cursor();
+
        compute_deleted_bitset(
            &mut delete_bitset,
            &segment_reader,
@@ -167,7 +168,7 @@ pub(crate) fn advance_deletes(
        if num_deleted_docs > 0 {
            segment = segment.with_delete_meta(num_deleted_docs as u32, target_opstamp);
            let mut delete_file = segment.open_write(SegmentComponent::DELETE)?;
-            write_delete_bitset(&delete_bitset, max_doc, &mut delete_file)?;
+            write_delete_bitset(&delete_bitset, &mut delete_file)?;
            delete_file.terminate()?;
        }
    }
@@ -177,13 +178,13 @@ pub(crate) fn advance_deletes(

 fn index_documents(
    memory_budget: usize,
-    segment: Segment,
+    segment: &Segment,
    grouped_document_iterator: &mut dyn Iterator<Item = OperationGroup>,
    segment_updater: &mut SegmentUpdater,
    mut delete_cursor: DeleteCursor,
 ) -> Result<bool> {
    let schema = segment.schema();
-
+    let segment_id = segment.id();
    let mut segment_writer = SegmentWriter::for_segment(memory_budget, segment.clone(), &schema)?;
    for document_group in grouped_document_iterator {
        for doc in document_group {
@@ -203,30 +204,21 @@ fn index_documents(
        return Ok(false);
    }

-    let max_doc = segment_writer.max_doc();
+    let num_docs = segment_writer.max_doc();

    // this is ensured by the call to peek before starting
    // the worker thread.
-    assert!(max_doc > 0);
+    assert!(num_docs > 0);

    let doc_opstamps: Vec<Opstamp> = segment_writer.finalize()?;
-
-    let segment_with_max_doc = segment.with_max_doc(max_doc);
+    let segment_meta = segment.index().new_segment_meta(segment_id, num_docs);

    let last_docstamp: Opstamp = *(doc_opstamps.last().unwrap());

-    let delete_bitset_opt = apply_deletes(
-        &segment_with_max_doc,
-        &mut delete_cursor,
-        &doc_opstamps,
-        last_docstamp,
-    )?;
+    let delete_bitset_opt =
+        apply_deletes(&segment, &mut delete_cursor, &doc_opstamps, last_docstamp)?;

-    let segment_entry = SegmentEntry::new(
-        segment_with_max_doc.meta().clone(),
-        delete_cursor,
-        delete_bitset_opt,
-    );
+    let segment_entry = SegmentEntry::new(segment_meta, delete_cursor, delete_bitset_opt);
    Ok(segment_updater.add_segment(segment_entry))
 }

@@ -243,9 +235,7 @@ fn apply_deletes(
    }
    let segment_reader = SegmentReader::open(segment)?;
    let doc_to_opstamps = DocToOpstampMapping::from(doc_opstamps);
-
-    let max_doc = segment.meta().max_doc();
-    let mut deleted_bitset = BitSet::with_capacity(max_doc as usize);
+    let mut deleted_bitset = BitSet::with_capacity(segment_reader.max_doc() as usize);
    let may_have_deletes = compute_deleted_bitset(
        &mut deleted_bitset,
        &segment_reader,
@@ -417,7 +407,7 @@ impl IndexWriter {
                    let segment = index.new_segment();
                    index_documents(
                        mem_budget,
-                        segment,
+                        &segment,
                        &mut document_iterator,
                        &mut segment_updater,
                        delete_cursor.clone(),
--- a/src/indexer/merger.rs
+++ b/src/indexer/merger.rs
@@ -190,7 +190,8 @@ impl IndexMerger {
        fast_field_serializer: &mut FastFieldSerializer,
        mut term_ord_mappings: HashMap<Field, TermOrdinalMapping>,
    ) -> Result<()> {
-        for (field, field_entry) in self.schema.fields() {
+        for (field_id, field_entry) in self.schema.fields().iter().enumerate() {
+            let field = Field(field_id as u32);
            let field_type = field_entry.field_type();
            match *field_type {
                FieldType::HierarchicalFacet => {
@@ -648,12 +649,15 @@ impl IndexMerger {
        serializer: &mut InvertedIndexSerializer,
    ) -> Result<HashMap<Field, TermOrdinalMapping>> {
        let mut term_ordinal_mappings = HashMap::new();
-        for (field, field_entry) in self.schema.fields() {
+        for (field_ord, field_entry) in self.schema.fields().iter().enumerate() {
            if field_entry.is_indexed() {
-                if let Some(term_ordinal_mapping) =
-                    self.write_postings_for_field(field, field_entry.field_type(), serializer)?
-                {
-                    term_ordinal_mappings.insert(field, term_ordinal_mapping);
+                let indexed_field = Field(field_ord as u32);
+                if let Some(term_ordinal_mapping) = self.write_postings_for_field(
+                    indexed_field,
+                    field_entry.field_type(),
+                    serializer,
+                )? {
+                    term_ordinal_mappings.insert(indexed_field, term_ordinal_mapping);
                }
            }
        }
--- a/src/indexer/mod.rs
+++ b/src/indexer/mod.rs
@@ -28,25 +28,3 @@ pub use self::segment_writer::SegmentWriter;

 /// Alias for the default merge policy, which is the `LogMergePolicy`.
 pub type DefaultMergePolicy = LogMergePolicy;
-
-#[cfg(test)]
-mod tests {
-    use crate::schema::{self, Schema};
-    use crate::{Index, Term};
-    #[test]
-    fn test_advance_delete_bug() {
-        let mut schema_builder = Schema::builder();
-        let text_field = schema_builder.add_text_field("text", schema::TEXT);
-        let index = Index::create_from_tempdir(schema_builder.build()).unwrap();
-        let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
-        // there must be one deleted document in the segment
-        index_writer.add_document(doc!(text_field=>"b"));
-        index_writer.delete_term(Term::from_field_text(text_field, "b"));
-        // we need enough data to trigger the bug (at least 32 documents)
-        for _ in 0..32 {
-            index_writer.add_document(doc!(text_field=>"c"));
-        }
-        index_writer.commit().unwrap();
-        index_writer.commit().unwrap();
-    }
-}
--- a/src/indexer/segment_writer.rs
+++ b/src/indexer/segment_writer.rs
@@ -6,15 +6,14 @@ use crate::fieldnorm::FieldNormsWriter;
 use crate::indexer::segment_serializer::SegmentSerializer;
 use crate::postings::compute_table_size;
 use crate::postings::MultiFieldPostingsWriter;
+use crate::schema::FieldEntry;
 use crate::schema::FieldType;
 use crate::schema::Schema;
 use crate::schema::Term;
 use crate::schema::Value;
-use crate::schema::{Field, FieldEntry};
 use crate::tokenizer::BoxedTokenizer;
 use crate::tokenizer::FacetTokenizer;
-use crate::tokenizer::PreTokenizedStream;
-use crate::tokenizer::{TokenStream, TokenStreamChain, Tokenizer};
+use crate::tokenizer::{TokenStream, Tokenizer};
 use crate::DocId;
 use crate::Opstamp;
 use crate::Result;
@@ -71,10 +70,12 @@ impl SegmentWriter {
        let table_num_bits = initial_table_size(memory_budget)?;
        let segment_serializer = SegmentSerializer::for_segment(&mut segment)?;
        let multifield_postings = MultiFieldPostingsWriter::new(schema, table_num_bits);
-        let tokenizers = schema
-            .fields()
-            .map(
-                |(_, field_entry): (Field, &FieldEntry)| match field_entry.field_type() {
+        let tokenizers =
+            schema
+                .fields()
+                .iter()
+                .map(FieldEntry::field_type)
+                .map(|field_type| match *field_type {
                    FieldType::Str(ref text_options) => text_options
                        .get_indexing_options()
                        .and_then(|text_index_option| {
@@ -82,9 +83,8 @@ impl SegmentWriter {
                            segment.index().tokenizers().get(tokenizer_name)
                        }),
                    _ => None,
-                },
-            )
-            .collect();
+                })
+                .collect();
        Ok(SegmentWriter {
            max_doc: 0,
            multifield_postings,
@@ -159,43 +159,26 @@ impl SegmentWriter {
                    }
                }
                FieldType::Str(_) => {
-                    let mut token_streams: Vec<Box<dyn TokenStream>> = vec![];
-                    let mut offsets = vec![];
-                    let mut total_offset = 0;
-
-                    for field_value in field_values {
-                        match field_value.value() {
-                            Value::PreTokStr(tok_str) => {
-                                offsets.push(total_offset);
-                                if let Some(last_token) = tok_str.tokens.last() {
-                                    total_offset += last_token.offset_to;
-                                }
-                                token_streams
-                                    .push(Box::new(PreTokenizedStream::from(tok_str.clone())));
-                            }
-                            Value::Str(ref text) => {
-                                if let Some(ref mut tokenizer) =
-                                    self.tokenizers[field.field_id() as usize]
-                                {
-                                    offsets.push(total_offset);
-                                    total_offset += text.len();
-
-                                    token_streams.push(tokenizer.token_stream(text));
-                                }
-                            }
-                            _ => (),
+                    let num_tokens = if let Some(ref mut tokenizer) =
+                        self.tokenizers[field.0 as usize]
+                    {
+                        let texts: Vec<&str> = field_values
+                            .iter()
+                            .flat_map(|field_value| match *field_value.value() {
+                                Value::Str(ref text) => Some(text.as_str()),
+                                _ => None,
+                            })
+                            .collect();
+                        if texts.is_empty() {
+                            0
+                        } else {
+                            let mut token_stream = tokenizer.token_stream_texts(&texts[..]);
+                            self.multifield_postings
+                                .index_text(doc_id, field, &mut token_stream)
                        }
-                    }
-
-                    let num_tokens = if token_streams.is_empty() {
-                        0
                    } else {
-                        let mut token_stream: Box<dyn TokenStream> =
-                            Box::new(TokenStreamChain::new(offsets, token_streams));
-                        self.multifield_postings
-                            .index_text(doc_id, field, &mut token_stream)
+                        0
                    };
-
                    self.fieldnorms_writer.record(doc_id, field, num_tokens);
                }
                FieldType::U64(ref int_option) => {
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -212,13 +212,15 @@ pub type Score = f32;
 pub type SegmentLocalId = u32;

 impl DocAddress {
-    /// Return the segment ordinal id that identifies the segment
-    /// hosting the document in the `Searcher` it is called from.
+    /// Return the segment ordinal.
+    /// The segment ordinal is an id identifying the segment
+    /// hosting the document. It is only meaningful, in the context
+    /// of a searcher.
    pub fn segment_ord(self) -> SegmentLocalId {
        self.0
    }

-    /// Return the segment-local `DocId`
+    /// Return the segment local `DocId`
    pub fn doc(self) -> DocId {
        self.1
    }
@@ -227,11 +229,11 @@ impl DocAddress {
 /// `DocAddress` contains all the necessary information
 /// to identify a document given a `Searcher` object.
 ///
-/// It consists of an id identifying its segment, and
-/// a segment-local `DocId`.
+/// It consists in an id identifying its segment, and
+/// its segment-local `DocId`.
 ///
 /// The id used for the segment is actually an ordinal
-/// in the list of `Segment`s held by a `Searcher`.
+/// in the list of segment hold by a `Searcher`.
 #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
 pub struct DocAddress(pub SegmentLocalId, pub DocId);

--- a/src/postings/mod.rs
+++ b/src/postings/mod.rs
@@ -356,9 +356,9 @@ pub mod tests {

    #[test]
    fn test_skip_next() {
-        let term_0 = Term::from_field_u64(Field::from_field_id(0), 0);
-        let term_1 = Term::from_field_u64(Field::from_field_id(0), 1);
-        let term_2 = Term::from_field_u64(Field::from_field_id(0), 2);
+        let term_0 = Term::from_field_u64(Field(0), 0);
+        let term_1 = Term::from_field_u64(Field(0), 1);
+        let term_2 = Term::from_field_u64(Field(0), 2);

        let num_docs = 300u32;

@@ -511,19 +511,19 @@ pub mod tests {
    }

    pub static TERM_A: Lazy<Term> = Lazy::new(|| {
-        let field = Field::from_field_id(0);
+        let field = Field(0);
        Term::from_field_text(field, "a")
    });
    pub static TERM_B: Lazy<Term> = Lazy::new(|| {
-        let field = Field::from_field_id(0);
+        let field = Field(0);
        Term::from_field_text(field, "b")
    });
    pub static TERM_C: Lazy<Term> = Lazy::new(|| {
-        let field = Field::from_field_id(0);
+        let field = Field(0);
        Term::from_field_text(field, "c")
    });
    pub static TERM_D: Lazy<Term> = Lazy::new(|| {
-        let field = Field::from_field_id(0);
+        let field = Field(0);
        Term::from_field_text(field, "d")
    });

--- a/src/postings/postings_writer.rs
+++ b/src/postings/postings_writer.rs
@@ -61,12 +61,12 @@ fn make_field_partition(
        .iter()
        .map(|(key, _, _)| Term::wrap(key).field())
        .enumerate();
-    let mut prev_field_opt = None;
+    let mut prev_field = Field(u32::max_value());
    let mut fields = vec![];
    let mut offsets = vec![];
    for (offset, field) in term_offsets_it {
-        if Some(field) != prev_field_opt {
-            prev_field_opt = Some(field);
+        if field != prev_field {
+            prev_field = field;
            fields.push(field);
            offsets.push(offset);
        }
@@ -86,7 +86,8 @@ impl MultiFieldPostingsWriter {
        let term_index = TermHashMap::new(table_bits);
        let per_field_postings_writers: Vec<_> = schema
            .fields()
-            .map(|(_, field_entry)| posting_from_field_entry(field_entry))
+            .iter()
+            .map(|field_entry| posting_from_field_entry(field_entry))
            .collect();
        MultiFieldPostingsWriter {
            heap: MemoryArena::new(),
@@ -106,8 +107,7 @@ impl MultiFieldPostingsWriter {
        field: Field,
        token_stream: &mut dyn TokenStream,
    ) -> u32 {
-        let postings_writer =
-            self.per_field_postings_writers[field.field_id() as usize].deref_mut();
+        let postings_writer = self.per_field_postings_writers[field.0 as usize].deref_mut();
        postings_writer.index_text(
            &mut self.term_index,
            doc,
@@ -118,8 +118,7 @@ impl MultiFieldPostingsWriter {
    }

    pub fn subscribe(&mut self, doc: DocId, term: &Term) -> UnorderedTermId {
-        let postings_writer =
-            self.per_field_postings_writers[term.field().field_id() as usize].deref_mut();
+        let postings_writer = self.per_field_postings_writers[term.field().0 as usize].deref_mut();
        postings_writer.subscribe(&mut self.term_index, doc, 0u32, term, &mut self.heap)
    }

@@ -161,7 +160,7 @@ impl MultiFieldPostingsWriter {
                FieldType::Bytes => {}
            }

-            let postings_writer = &self.per_field_postings_writers[field.field_id() as usize];
+            let postings_writer = &self.per_field_postings_writers[field.0 as usize];
            let mut field_serializer =
                serializer.new_field(field, postings_writer.total_num_tokens())?;
            postings_writer.serialize(
--- a/src/query/boolean_query/boolean_query.rs
+++ b/src/query/boolean_query/boolean_query.rs
@@ -9,8 +9,7 @@ use crate::Result;
 use crate::Searcher;
 use std::collections::BTreeSet;

-/// The boolean query returns a set of documents
-/// that matches the Boolean combination of constituent subqueries.
+/// The boolean query combines a set of queries
 ///
 /// The documents matched by the boolean query are
 /// those which
@@ -20,113 +19,6 @@ use std::collections::BTreeSet;
 /// `MustNot` occurence.
 /// * match at least one of the subqueries that is not
 /// a `MustNot` occurence.
-///
-///
-/// You can combine other query types and their `Occur`ances into one `BooleanQuery`
-///
-/// ```rust
-///use tantivy::collector::Count;
-///use tantivy::doc;
-///use tantivy::query::{BooleanQuery, Occur, PhraseQuery, Query, TermQuery};
-///use tantivy::schema::{IndexRecordOption, Schema, TEXT};
-///use tantivy::Term;
-///use tantivy::{Index, Result};
-///
-///fn main() -> Result<()> {
-///    let mut schema_builder = Schema::builder();
-///    let title = schema_builder.add_text_field("title", TEXT);
-///    let body = schema_builder.add_text_field("body", TEXT);
-///    let schema = schema_builder.build();
-///    let index = Index::create_in_ram(schema);
-///    {
-///        let mut index_writer = index.writer(3_000_000)?;
-///        index_writer.add_document(doc!(
-///            title => "The Name of the Wind",
-///        ));
-///        index_writer.add_document(doc!(
-///            title => "The Diary of Muadib",
-///        ));
-///        index_writer.add_document(doc!(
-///            title => "A Dairy Cow",
-///            body => "hidden",
-///        ));
-///        index_writer.add_document(doc!(
-///            title => "A Dairy Cow",
-///            body => "found",
-///        ));
-///        index_writer.add_document(doc!(
-///            title => "The Diary of a Young Girl",
-///        ));
-///        index_writer.commit().unwrap();
-///    }
-///
-///    let reader = index.reader()?;
-///    let searcher = reader.searcher();
-///
-///    // Make TermQuery's for "girl" and "diary" in the title
-///    let girl_term_query: Box<dyn Query> = Box::new(TermQuery::new(
-///        Term::from_field_text(title, "girl"),
-///        IndexRecordOption::Basic,
-///    ));
-///    let diary_term_query: Box<dyn Query> = Box::new(TermQuery::new(
-///        Term::from_field_text(title, "diary"),
-///        IndexRecordOption::Basic,
-///    ));
-///    // A TermQuery with "found" in the body
-///    let body_term_query: Box<dyn Query> = Box::new(TermQuery::new(
-///        Term::from_field_text(body, "found"),
-///        IndexRecordOption::Basic,
-///    ));
-///    // TermQuery "diary" must and "girl" must not be present
-///    let queries_with_occurs1 = vec![
-///        (Occur::Must, diary_term_query.box_clone()),
-///        (Occur::MustNot, girl_term_query),
-///    ];
-///    // Make a BooleanQuery equivalent to
-///    // title:+diary title:-girl
-///    let diary_must_and_girl_mustnot = BooleanQuery::from(queries_with_occurs1);
-///    let count1 = searcher.search(&diary_must_and_girl_mustnot, &Count)?;
-///    assert_eq!(count1, 1);
-///
-///    // TermQuery for "cow" in the title
-///    let cow_term_query: Box<dyn Query> = Box::new(TermQuery::new(
-///        Term::from_field_text(title, "cow"),
-///        IndexRecordOption::Basic,
-///    ));
-///    // "title:diary OR title:cow"
-///    let title_diary_or_cow = BooleanQuery::from(vec![
-///        (Occur::Should, diary_term_query.box_clone()),
-///        (Occur::Should, cow_term_query),
-///    ]);
-///    let count2 = searcher.search(&title_diary_or_cow, &Count)?;
-///    assert_eq!(count2, 4);
-///
-///    // Make a `PhraseQuery` from a vector of `Term`s
-///    let phrase_query: Box<dyn Query> = Box::new(PhraseQuery::new(vec![
-///        Term::from_field_text(title, "dairy"),
-///        Term::from_field_text(title, "cow"),
-///    ]));
-///    // You can combine subqueries of different types into 1 BooleanQuery:
-///    // `TermQuery` and `PhraseQuery`
-///    // "title:diary OR "dairy cow"
-///    let term_of_phrase_query = BooleanQuery::from(vec![
-///        (Occur::Should, diary_term_query.box_clone()),
-///        (Occur::Should, phrase_query.box_clone()),
-///    ]);
-///    let count3 = searcher.search(&term_of_phrase_query, &Count)?;
-///    assert_eq!(count3, 4);
-///
-///    // You can nest one BooleanQuery inside another
-///    // body:found AND ("title:diary OR "dairy cow")
-///    let nested_query = BooleanQuery::from(vec![
-///        (Occur::Must, body_term_query),
-///        (Occur::Must, Box::new(term_of_phrase_query))
-///    ]);
-///    let count4 = searcher.search(&nested_query, &Count)?;
-///    assert_eq!(count4, 1);
-///    Ok(())
-///}
-/// ```
 #[derive(Debug)]
 pub struct BooleanQuery {
    subqueries: Vec<(Occur, Box<dyn Query>)>,
--- a/src/query/phrase_query/phrase_query.rs
+++ b/src/query/phrase_query/phrase_query.rs
@@ -40,7 +40,7 @@ impl PhraseQuery {
        PhraseQuery::new_with_offset(terms_with_offset)
    }

-    /// Creates a new `PhraseQuery` given a list of terms and their offsets.
+    /// Creates a new `PhraseQuery` given a list of terms and there offsets.
    ///
    /// Can be used to provide custom offset for each term.
    pub fn new_with_offset(mut terms: Vec<(usize, Term)>) -> PhraseQuery {
@@ -73,7 +73,7 @@ impl PhraseQuery {
            .collect::<Vec<Term>>()
    }

-    /// Returns the `PhraseWeight` for the given phrase query given a specific `searcher`.
+    /// Returns the `PhraseWeight` for the given phrase query given a specific `searcher`.  
    ///
    /// This function is the same as `.weight(...)` except it returns
    /// a specialized type `PhraseWeight` instead of a Boxed trait.
--- a/src/query/query_parser/query_parser.rs
+++ b/src/query/query_parser/query_parser.rs
@@ -674,19 +674,13 @@ mod test {

        test_parse_query_to_logical_ast_helper(
            "signed:-2324",
-            &format!(
-                "{:?}",
-                Term::from_field_i64(Field::from_field_id(2u32), -2324)
-            ),
+            &format!("{:?}", Term::from_field_i64(Field(2u32), -2324)),
            false,
        );

        test_parse_query_to_logical_ast_helper(
            "float:2.5",
-            &format!(
-                "{:?}",
-                Term::from_field_f64(Field::from_field_id(10u32), 2.5)
-            ),
+            &format!("{:?}", Term::from_field_f64(Field(10u32), 2.5)),
            false,
        );
    }
--- a/src/query/term_query/mod.rs
+++ b/src/query/term_query/mod.rs
@@ -118,7 +118,7 @@ mod tests {
    #[test]
    fn test_term_query_debug() {
        let term_query = TermQuery::new(
-            Term::from_field_text(Field::from_field_id(1), "hello"),
+            Term::from_field_text(Field(1), "hello"),
            IndexRecordOption::WithFreqs,
        );
        assert_eq!(
--- a/src/schema/document.rs
+++ b/src/schema/document.rs
@@ -1,7 +1,6 @@
 use super::*;
 use crate::common::BinarySerializable;
 use crate::common::VInt;
-use crate::tokenizer::PreTokenizedString;
 use crate::DateTime;
 use itertools::Itertools;
 use std::io::{self, Read, Write};
@@ -30,8 +29,8 @@ impl From<Vec<FieldValue>> for Document {
 impl PartialEq for Document {
    fn eq(&self, other: &Document) -> bool {
        // super slow, but only here for tests
-        let mut self_field_values: Vec<&_> = self.field_values.iter().collect();
-        let mut other_field_values: Vec<&_> = other.field_values.iter().collect();
+        let mut self_field_values = self.field_values.clone();
+        let mut other_field_values = other.field_values.clone();
        self_field_values.sort();
        other_field_values.sort();
        self_field_values.eq(&other_field_values)
@@ -79,16 +78,6 @@ impl Document {
        self.add(FieldValue::new(field, value));
    }

-    /// Add a pre-tokenized text field.
-    pub fn add_pre_tokenized_text(
-        &mut self,
-        field: Field,
-        pre_tokenized_text: &PreTokenizedString,
-    ) {
-        let value = Value::PreTokStr(pre_tokenized_text.clone());
-        self.add(FieldValue::new(field, value));
-    }
-
    /// Add a u64 field
    pub fn add_u64(&mut self, field: Field, value: u64) {
        self.add(FieldValue::new(field, Value::U64(value)));
--- a/src/schema/field.rs
+++ b/src/schema/field.rs
@@ -3,23 +3,14 @@ use std::io;
 use std::io::Read;
 use std::io::Write;

-/// `Field` is represented by an unsigned 32-bit integer type
-/// The schema holds the mapping between field names and `Field` objects.
+/// `Field` is actually a `u8` identifying a `Field`
+/// The schema is in charge of holding mapping between field names
+/// to `Field` objects.
+///
+/// Because the field id is a `u8`, tantivy can only have at most `255` fields.
+/// Value 255 is reserved.
 #[derive(Copy, Clone, Debug, PartialEq, PartialOrd, Eq, Ord, Hash, Serialize, Deserialize)]
-pub struct Field(u32);
-
-impl Field {
-    /// Create a new field object for the given FieldId.
-    pub fn from_field_id(field_id: u32) -> Field {
-        Field(field_id)
-    }
-
-    /// Returns a u32 identifying uniquely a field within a schema.
-    #[allow(clippy::trivially_copy_pass_by_ref)]
-    pub fn field_id(&self) -> u32 {
-        self.0
-    }
-}
+pub struct Field(pub u32);

 impl BinarySerializable for Field {
    fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
--- a/src/schema/field_type.rs
+++ b/src/schema/field_type.rs
@@ -1,11 +1,11 @@
 use base64::decode;

+use crate::schema::{IntOptions, TextOptions};
+
 use crate::schema::Facet;
 use crate::schema::IndexRecordOption;
 use crate::schema::TextFieldIndexing;
 use crate::schema::Value;
-use crate::schema::{IntOptions, TextOptions};
-use crate::tokenizer::PreTokenizedString;
 use serde_json::Value as JsonValue;

 /// Possible error that may occur while parsing a field value
@@ -169,28 +169,6 @@ impl FieldType {
                    Err(ValueParsingError::TypeError(msg))
                }
            },
-            JsonValue::Object(_) => match *self {
-                FieldType::Str(_) => {
-                    if let Ok(tok_str_val) =
-                        serde_json::from_value::<PreTokenizedString>(json.clone())
-                    {
-                        Ok(Value::PreTokStr(tok_str_val))
-                    } else {
-                        let msg = format!(
-                            "Json value {:?} cannot be translated to PreTokenizedString.",
-                            json
-                        );
-                        Err(ValueParsingError::TypeError(msg))
-                    }
-                }
-                _ => {
-                    let msg = format!(
-                        "Json value not supported error {:?}. Expected {:?}",
-                        json, self
-                    );
-                    Err(ValueParsingError::TypeError(msg))
-                }
-            },
            _ => {
                let msg = format!(
                    "Json value not supported error {:?}. Expected {:?}",
@@ -206,9 +184,7 @@ impl FieldType {
 mod tests {
    use super::FieldType;
    use crate::schema::field_type::ValueParsingError;
-    use crate::schema::TextOptions;
    use crate::schema::Value;
-    use crate::tokenizer::{PreTokenizedString, Token};

    #[test]
    fn test_bytes_value_from_json() {
@@ -229,71 +205,4 @@ mod tests {
            _ => panic!("Expected parse failure for invalid base64"),
        }
    }
-
-    #[test]
-    fn test_pre_tok_str_value_from_json() {
-        let pre_tokenized_string_json = r#"{
-  "text": "The Old Man",
-  "tokens": [
-    {
-      "offset_from": 0,
-      "offset_to": 3,
-      "position": 0,
-      "text": "The",
-      "position_length": 1
-    },
-    {
-      "offset_from": 4,
-      "offset_to": 7,
-      "position": 1,
-      "text": "Old",
-      "position_length": 1
-    },
-    {
-      "offset_from": 8,
-      "offset_to": 11,
-      "position": 2,
-      "text": "Man",
-      "position_length": 1
-    }
-  ]
-}"#;
-
-        let expected_value = Value::PreTokStr(PreTokenizedString {
-            text: String::from("The Old Man"),
-            tokens: vec![
-                Token {
-                    offset_from: 0,
-                    offset_to: 3,
-                    position: 0,
-                    text: String::from("The"),
-                    position_length: 1,
-                },
-                Token {
-                    offset_from: 4,
-                    offset_to: 7,
-                    position: 1,
-                    text: String::from("Old"),
-                    position_length: 1,
-                },
-                Token {
-                    offset_from: 8,
-                    offset_to: 11,
-                    position: 2,
-                    text: String::from("Man"),
-                    position_length: 1,
-                },
-            ],
-        });
-
-        let deserialized_value = FieldType::Str(TextOptions::default())
-            .value_from_json(&serde_json::from_str(pre_tokenized_string_json).unwrap())
-            .unwrap();
-
-        assert_eq!(deserialized_value, expected_value);
-
-        let serialized_value_json = serde_json::to_string_pretty(&expected_value).unwrap();
-
-        assert_eq!(serialized_value_json, pre_tokenized_string_json);
-    }
 }
--- a/src/schema/schema.rs
+++ b/src/schema/schema.rs
@@ -167,7 +167,7 @@ impl SchemaBuilder {

    /// Adds a field entry to the schema in build.
    fn add_field(&mut self, field_entry: FieldEntry) -> Field {
-        let field = Field::from_field_id(self.fields.len() as u32);
+        let field = Field(self.fields.len() as u32);
        let field_name = field_entry.name().to_string();
        self.fields.push(field_entry);
        self.fields_map.insert(field_name, field);
@@ -223,7 +223,7 @@ pub struct Schema(Arc<InnerSchema>);
 impl Schema {
    /// Return the `FieldEntry` associated to a `Field`.
    pub fn get_field_entry(&self, field: Field) -> &FieldEntry {
-        &self.0.fields[field.field_id() as usize]
+        &self.0.fields[field.0 as usize]
    }

    /// Return the field name for a given `Field`.
@@ -232,12 +232,8 @@ impl Schema {
    }

    /// Return the list of all the `Field`s.
-    pub fn fields(&self) -> impl Iterator<Item = (Field, &FieldEntry)> {
-        self.0
-            .fields
-            .iter()
-            .enumerate()
-            .map(|(field_id, field_entry)| (Field::from_field_id(field_id as u32), field_entry))
+    pub fn fields(&self) -> &[FieldEntry] {
+        &self.0.fields
    }

    /// Creates a new builder.
@@ -489,32 +485,13 @@ mod tests {

        let schema: Schema = serde_json::from_str(expected).unwrap();

-        let mut fields = schema.fields();
-        {
-            let (field, field_entry) = fields.next().unwrap();
-            assert_eq!("title", field_entry.name());
-            assert_eq!(0, field.field_id());
-        }
-        {
-            let (field, field_entry) = fields.next().unwrap();
-            assert_eq!("author", field_entry.name());
-            assert_eq!(1, field.field_id());
-        }
-        {
-            let (field, field_entry) = fields.next().unwrap();
-            assert_eq!("count", field_entry.name());
-            assert_eq!(2, field.field_id());
-        }
-        {
-            let (field, field_entry) = fields.next().unwrap();
-            assert_eq!("popularity", field_entry.name());
-            assert_eq!(3, field.field_id());
-        }
-        {
-            let (field, field_entry) = fields.next().unwrap();
-            assert_eq!("score", field_entry.name());
-            assert_eq!(4, field.field_id());
-        }
+        let mut fields = schema.fields().iter();
+
+        assert_eq!("title", fields.next().unwrap().name());
+        assert_eq!("author", fields.next().unwrap().name());
+        assert_eq!("count", fields.next().unwrap().name());
+        assert_eq!("popularity", fields.next().unwrap().name());
+        assert_eq!("score", fields.next().unwrap().name());
        assert!(fields.next().is_none());
    }

--- a/src/schema/term.rs
+++ b/src/schema/term.rs
@@ -105,7 +105,7 @@ impl Term {
        if self.0.len() < 4 {
            self.0.resize(4, 0u8);
        }
-        BigEndian::write_u32(&mut self.0[0..4], field.field_id());
+        BigEndian::write_u32(&mut self.0[0..4], field.0);
    }

    /// Sets a u64 value in the term.
@@ -157,7 +157,7 @@ where

    /// Returns the field.
    pub fn field(&self) -> Field {
-        Field::from_field_id(BigEndian::read_u32(&self.0.as_ref()[..4]))
+        Field(BigEndian::read_u32(&self.0.as_ref()[..4]))
    }

    /// Returns the `u64` value stored in a term.
@@ -227,7 +227,7 @@ impl fmt::Debug for Term {
        write!(
            f,
            "Term(field={},bytes={:?})",
-            self.field().field_id(),
+            self.field().0,
            self.value_bytes()
        )
    }
--- a/src/schema/value.rs
+++ b/src/schema/value.rs
@@ -1,5 +1,4 @@
 use crate::schema::Facet;
-use crate::tokenizer::PreTokenizedString;
 use crate::DateTime;
 use serde::de::Visitor;
 use serde::{Deserialize, Deserializer, Serialize, Serializer};
@@ -11,8 +10,6 @@ use std::{cmp::Ordering, fmt};
 pub enum Value {
    /// The str type is used for any text information.
    Str(String),
-    /// Pre-tokenized str type,
-    PreTokStr(PreTokenizedString),
    /// Unsigned 64-bits Integer `u64`
    U64(u64),
    /// Signed 64-bits Integer `i64`
@@ -32,7 +29,6 @@ impl Ord for Value {
    fn cmp(&self, other: &Self) -> Ordering {
        match (self, other) {
            (Value::Str(l), Value::Str(r)) => l.cmp(r),
-            (Value::PreTokStr(l), Value::PreTokStr(r)) => l.cmp(r),
            (Value::U64(l), Value::U64(r)) => l.cmp(r),
            (Value::I64(l), Value::I64(r)) => l.cmp(r),
            (Value::Date(l), Value::Date(r)) => l.cmp(r),
@@ -48,8 +44,6 @@ impl Ord for Value {
            }
            (Value::Str(_), _) => Ordering::Less,
            (_, Value::Str(_)) => Ordering::Greater,
-            (Value::PreTokStr(_), _) => Ordering::Less,
-            (_, Value::PreTokStr(_)) => Ordering::Greater,
            (Value::U64(_), _) => Ordering::Less,
            (_, Value::U64(_)) => Ordering::Greater,
            (Value::I64(_), _) => Ordering::Less,
@@ -71,7 +65,6 @@ impl Serialize for Value {
    {
        match *self {
            Value::Str(ref v) => serializer.serialize_str(v),
-            Value::PreTokStr(ref v) => v.serialize(serializer),
            Value::U64(u) => serializer.serialize_u64(u),
            Value::I64(u) => serializer.serialize_i64(u),
            Value::F64(u) => serializer.serialize_f64(u),
@@ -131,15 +124,6 @@ impl Value {
        }
    }

-    /// Returns the tokenized text, provided the value is of the `PreTokStr` type.
-    /// (Returns None if the value is not of the `PreTokStr` type).
-    pub fn tokenized_text(&self) -> Option<&PreTokenizedString> {
-        match *self {
-            Value::PreTokStr(ref tok_text) => Some(tok_text),
-            _ => None,
-        }
-    }
-
    /// Returns the u64-value, provided the value is of the `U64` type.
    ///
    /// # Panics
@@ -237,7 +221,6 @@ mod binary_serialize {
    use super::Value;
    use crate::common::{f64_to_u64, u64_to_f64, BinarySerializable};
    use crate::schema::Facet;
-    use crate::tokenizer::PreTokenizedString;
    use chrono::{TimeZone, Utc};
    use std::io::{self, Read, Write};

@@ -248,11 +231,6 @@ mod binary_serialize {
    const BYTES_CODE: u8 = 4;
    const DATE_CODE: u8 = 5;
    const F64_CODE: u8 = 6;
-    const EXT_CODE: u8 = 7;
-
-    // extended types
-
-    const TOK_STR_CODE: u8 = 0;

    impl BinarySerializable for Value {
        fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
@@ -261,18 +239,6 @@ mod binary_serialize {
                    TEXT_CODE.serialize(writer)?;
                    text.serialize(writer)
                }
-                Value::PreTokStr(ref tok_str) => {
-                    EXT_CODE.serialize(writer)?;
-                    TOK_STR_CODE.serialize(writer)?;
-                    if let Ok(text) = serde_json::to_string(tok_str) {
-                        text.serialize(writer)
-                    } else {
-                        Err(io::Error::new(
-                            io::ErrorKind::Other,
-                            "Failed to dump Value::PreTokStr(_) to json.",
-                        ))
-                    }
-                }
                Value::U64(ref val) => {
                    U64_CODE.serialize(writer)?;
                    val.serialize(writer)
@@ -324,30 +290,6 @@ mod binary_serialize {
                }
                HIERARCHICAL_FACET_CODE => Ok(Value::Facet(Facet::deserialize(reader)?)),
                BYTES_CODE => Ok(Value::Bytes(Vec::<u8>::deserialize(reader)?)),
-                EXT_CODE => {
-                    let ext_type_code = u8::deserialize(reader)?;
-                    match ext_type_code {
-                        TOK_STR_CODE => {
-                            let str_val = String::deserialize(reader)?;
-                            if let Ok(value) = serde_json::from_str::<PreTokenizedString>(&str_val)
-                            {
-                                Ok(Value::PreTokStr(value))
-                            } else {
-                                Err(io::Error::new(
-                                    io::ErrorKind::Other,
-                                    "Failed to parse string data as Value::PreTokStr(_).",
-                                ))
-                            }
-                        }
-                        _ => Err(io::Error::new(
-                            io::ErrorKind::InvalidData,
-                            format!(
-                                "No extened field type is associated with code {:?}",
-                                ext_type_code
-                            ),
-                        )),
-                    }
-                }
                _ => Err(io::Error::new(
                    io::ErrorKind::InvalidData,
                    format!("No field type is associated with code {:?}", type_code),
--- a/src/tokenizer/mod.rs
+++ b/src/tokenizer/mod.rs
@@ -136,7 +136,6 @@ mod simple_tokenizer;
 mod stemmer;
 mod stop_word_filter;
 mod token_stream_chain;
-mod tokenized_string;
 mod tokenizer;
 mod tokenizer_manager;

@@ -153,9 +152,7 @@ pub use self::stop_word_filter::StopWordFilter;
 pub(crate) use self::token_stream_chain::TokenStreamChain;
 pub use self::tokenizer::BoxedTokenizer;

-pub use self::tokenized_string::{PreTokenizedStream, PreTokenizedString};
 pub use self::tokenizer::{Token, TokenFilter, TokenStream, Tokenizer};
-
 pub use self::tokenizer_manager::TokenizerManager;

 /// Maximum authorized len (in bytes) for a token.
--- a/src/tokenizer/tokenized_string.rs
+++ b/src/tokenizer/tokenized_string.rs
@@ -1,191 +0,0 @@
-use crate::tokenizer::{Token, TokenStream, TokenStreamChain};
-use std::cmp::Ordering;
-
-/// Struct representing pre-tokenized text
-#[derive(Debug, Clone, Serialize, Deserialize, Eq, PartialEq)]
-pub struct PreTokenizedString {
-    /// Original text
-    pub text: String,
-    /// Tokens derived from the text
-    pub tokens: Vec<Token>,
-}
-
-impl Ord for PreTokenizedString {
-    fn cmp(&self, other: &Self) -> Ordering {
-        self.text.cmp(&other.text)
-    }
-}
-
-impl PartialOrd for PreTokenizedString {
-    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
-        Some(self.cmp(other))
-    }
-}
-
-/// TokenStream implementation which wraps PreTokenizedString
-pub struct PreTokenizedStream {
-    tokenized_string: PreTokenizedString,
-    current_token: i64,
-}
-
-impl From<PreTokenizedString> for PreTokenizedStream {
-    fn from(s: PreTokenizedString) -> PreTokenizedStream {
-        PreTokenizedStream {
-            tokenized_string: s,
-            current_token: -1,
-        }
-    }
-}
-
-impl PreTokenizedStream {
-    /// Creates a TokenStream from PreTokenizedString array
-    pub fn chain_tokenized_strings<'a>(
-        tok_strings: &'a [&'a PreTokenizedString],
-    ) -> Box<dyn TokenStream + 'a> {
-        if tok_strings.len() == 1 {
-            return Box::new(PreTokenizedStream::from((*tok_strings[0]).clone()));
-        }
-        let mut offsets = vec![];
-        let mut total_offset = 0;
-        for &tok_string in tok_strings {
-            offsets.push(total_offset);
-            if let Some(last_token) = tok_string.tokens.last() {
-                total_offset += last_token.offset_to;
-            }
-        }
-        let token_streams: Vec<_> = tok_strings
-            .iter()
-            .map(|tok_string| PreTokenizedStream::from((*tok_string).clone()))
-            .collect();
-        Box::new(TokenStreamChain::new(offsets, token_streams))
-    }
-}
-
-impl TokenStream for PreTokenizedStream {
-    fn advance(&mut self) -> bool {
-        if self.current_token >= self.tokenized_string.tokens.len() as i64 - 1 {
-            // This was our last token.
-            return false;
-        }
-        self.current_token += 1;
-        true
-    }
-
-    fn token(&self) -> &Token {
-        assert!(
-            self.current_token >= 0,
-            "TokenStream not initialized. You should call advance() at least once."
-        );
-        &self.tokenized_string.tokens[self.current_token as usize]
-    }
-
-    fn token_mut(&mut self) -> &mut Token {
-        assert!(
-            self.current_token >= 0,
-            "TokenStream not initialized. You should call advance() at least once."
-        );
-        &mut self.tokenized_string.tokens[self.current_token as usize]
-    }
-}
-
-#[cfg(test)]
-mod tests {
-
-    use super::*;
-
-    use crate::tokenizer::Token;
-
-    #[test]
-    fn test_tokenized_stream() {
-        let tok_text = PreTokenizedString {
-            text: String::from("A a"),
-            tokens: vec![
-                Token {
-                    offset_from: 0,
-                    offset_to: 1,
-                    position: 0,
-                    text: String::from("A"),
-                    position_length: 1,
-                },
-                Token {
-                    offset_from: 2,
-                    offset_to: 3,
-                    position: 1,
-                    text: String::from("a"),
-                    position_length: 1,
-                },
-            ],
-        };
-
-        let mut tok_stream = PreTokenizedStream::from(tok_text.clone());
-
-        let mut i = 0;
-        while tok_stream.advance() {
-            assert!(*tok_stream.token() == tok_text.tokens[i]);
-            i += 1;
-        }
-    }
-
-    #[test]
-    fn test_chain_tokenized_strings() {
-        let tok_text = PreTokenizedString {
-            text: String::from("A a"),
-            tokens: vec![
-                Token {
-                    offset_from: 0,
-                    offset_to: 1,
-                    position: 0,
-                    text: String::from("A"),
-                    position_length: 1,
-                },
-                Token {
-                    offset_from: 2,
-                    offset_to: 3,
-                    position: 1,
-                    text: String::from("a"),
-                    position_length: 1,
-                },
-            ],
-        };
-
-        let chain_parts = vec![&tok_text, &tok_text];
-
-        let mut token_stream = PreTokenizedStream::chain_tokenized_strings(&chain_parts[..]);
-
-        let expected_tokens = vec![
-            Token {
-                offset_from: 0,
-                offset_to: 1,
-                position: 0,
-                text: String::from("A"),
-                position_length: 1,
-            },
-            Token {
-                offset_from: 2,
-                offset_to: 3,
-                position: 1,
-                text: String::from("a"),
-                position_length: 1,
-            },
-            Token {
-                offset_from: 3,
-                offset_to: 4,
-                position: 3,
-                text: String::from("A"),
-                position_length: 1,
-            },
-            Token {
-                offset_from: 5,
-                offset_to: 6,
-                position: 4,
-                text: String::from("a"),
-                position_length: 1,
-            },
-        ];
-        for expected_token in expected_tokens {
-            assert!(token_stream.advance());
-            assert_eq!(token_stream.token(), &expected_token);
-        }
-        assert!(!token_stream.advance());
-    }
-}
--- a/src/tokenizer/tokenizer.rs
+++ b/src/tokenizer/tokenizer.rs
@@ -4,7 +4,7 @@ use crate::tokenizer::TokenStreamChain;
 use std::borrow::{Borrow, BorrowMut};

 /// Token
-#[derive(Debug, Clone, Serialize, Deserialize, Eq, PartialEq)]
+#[derive(Debug, Clone)]
 pub struct Token {
    /// Offset (byte index) of the first character of the token.
    /// Offsets shall not be modified by token filters.
--- a/tests/failpoints/mod.rs
+++ b/tests/failpoints/mod.rs
@@ -1,4 +1,5 @@
 use fail;
+use std::io::Write;
 use std::path::Path;
 use tantivy::directory::{Directory, ManagedDirectory, RAMDirectory, TerminatingWrite};
 use tantivy::doc;
Author	SHA1	Message	Date
Paul Masurel	9fd23f3abf	Fixing bench compilation	2019-10-04 16:36:17 +09:00
Paul Masurel	c030990d00	fmt	2019-10-02 09:50:20 +09:00