Bounced version and edited changelog

fix #1151 (#1152 )
* fix #1151 Fixes a off by one error in the stats for the index fast field in the multi value fast field. When retrieving the data range for a docid, `get(doc)..get(docid+1)` is requested. On creation the num_vals statistic was set to doc instead of docid + 1. In the multivaluelinearinterpol fast field the last value was therefore not serialized (and would return 0 instead in most cases). So the last document get(lastdoc)..get(lastdoc + 1) would return the invalid range `value..0`. This PR adds a proptest to cover this scenario. A combination of a large number values, since multilinear interpolation is only active for more than 5_000 values, and a merge is required.
2025-12-30 05:52:54 +00:00 · 2021-09-10 23:05:09 +09:00 · 2021-09-10 23:00:37 +09:00 · 2021-09-03 22:00:43 +09:00 · 2021-09-03 10:37:16 +09:00 · 2021-08-29 18:20:49 +09:00
84 changed files with 1118 additions and 720 deletions
--- a/.github/workflows/coverage.yml
+++ b/.github/workflows/coverage.yml
@@ -1,27 +1,25 @@
-name:                           coverage
+name: Coverage

 on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
+
 jobs:
-  test:
-    name:                       coverage
-    runs-on:                    ubuntu-latest
-    container:
-      image:                    xd009642/tarpaulin:develop-nightly
-      options:                  --security-opt seccomp=unconfined
+  coverage:
+    runs-on: ubuntu-latest
    steps:
-      - name:                   Checkout repository
-        uses:                   actions/checkout@v2
-
-      - name:                   Generate code coverage
-        run: |
-          cargo +nightly tarpaulin --verbose --all-features --workspace --timeout 120 --out Xml
-
-      - name:                   Upload to codecov.io
-        uses:                   codecov/codecov-action@v1
+      - uses: actions/checkout@v2
+      - name: Install Rust
+        run: rustup toolchain install nightly --component llvm-tools-preview
+      - name: Install cargo-llvm-cov
+        run: curl -LsSf https://github.com/taiki-e/cargo-llvm-cov/releases/latest/download/cargo-llvm-cov-x86_64-unknown-linux-gnu.tar.gz | tar xzf - -C ~/.cargo/bin
+      - name: Generate code coverage
+        run: cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info
+      - name: Upload coverage to Codecov
+        uses: codecov/codecov-action@v1
        with:
-          # token:                ${{secrets.CODECOV_TOKEN}} # not required for public repos
-          fail_ci_if_error:     true
+          token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
+          files: lcov.info
+          fail_ci_if_error: true
--- a/.github/workflows/long_running.yml
+++ b/.github/workflows/long_running.yml
@@ -0,0 +1,24 @@
+name: Rust
+
+on:
+  push:
+    branches: [ main ]
+
+env:
+  CARGO_TERM_COLOR: always
+  NUM_FUNCTIONAL_TEST_ITERATIONS: 20000
+
+jobs:
+  functional_test_unsorted:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Run indexing_unsorted
+      run: cargo test indexing_unsorted -- --ignored
+  functional_test_sorted:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Run indexing_sorted
+      run: cargo test indexing_sorted -- --ignored
+
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -10,7 +10,7 @@ env:
  CARGO_TERM_COLOR: always

 jobs:
-  build:
+  test:

    runs-on: ubuntu-latest

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,12 @@
+Tantivy 0.16.1
+========================
+- Major Bugfix on multivalued fastfield.  #1151
+
+Tantivy 0.16.0
+=========================
+- Bugfix in the filesum check. (@evanxg852000) #1127
+- Bugfix in positions when the index is sorted by a field. (@appaquet) #1125
+
 Tantivy 0.15.3
 =========================
 - Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tantivy"
-version = "0.16.0-dev"
+version = "0.16.1"
 authors = ["Paul Masurel <paul.masurel@gmail.com>"]
 license = "MIT"
 categories = ["database-implementations", "data-structures"]
@@ -19,7 +19,7 @@ crc32fast = "1.2.1"
 once_cell = "1.7.2"
 regex ={ version = "1.5.4", default-features = false, features = ["std"] }
 tantivy-fst = "0.3"
-memmap = {version = "0.7", optional=true}
+memmap2 = {version = "0.3", optional=true}
 lz4_flex = { version = "0.8.0", default-features = false, features = ["checked-decode"], optional = true }
 brotli = { version = "3.3", optional = true }
 snap = { version = "1.0.5", optional = true }
@@ -31,11 +31,11 @@ num_cpus = "1.13"
 fs2={ version = "0.4.3", optional = true }
 levenshtein_automata = "0.2"
 uuid = { version = "0.8.2", features = ["v4", "serde"] }
-crossbeam = "0.8"
+crossbeam = "0.8.1"
 futures = { version = "0.3.15", features = ["thread-pool"] }
 tantivy-query-grammar = { version="0.15.0", path="./query-grammar" }
 tantivy-bitpacker = { version="0.1", path="./bitpacker" }
-common = { version="0.1", path="./common" }
+common = { version = "0.1", path = "./common/", package = "tantivy-common" }
 fastfield_codecs = { version="0.1", path="./fastfield_codecs", default-features = false }
 ownedbytes = { version="0.1", path="./ownedbytes" }
 stable_deref_trait = "1.2"
@@ -64,7 +64,9 @@ rand = "0.8.3"
 maplit = "1.0.2"
 matches = "0.1.8"
 proptest = "1.0"
-criterion = "0.3.4"
+criterion = "0.3.5"
+test-env-log = "0.2.7"
+env_logger = "0.9.0"

 [dev-dependencies.fail]
 version = "0.4"
@@ -81,7 +83,7 @@ overflow-checks = true

 [features]
 default = ["mmap", "lz4-compression" ]
-mmap = ["fs2", "tempfile", "memmap"]
+mmap = ["fs2", "tempfile", "memmap2"]

 brotli-compression = ["brotli"]
 lz4-compression = ["lz4_flex"]
--- a/README.md
+++ b/README.md
@@ -1,9 +1,9 @@

-[![Build Status](https://travis-ci.org/tantivy-search/tantivy.svg?branch=main)](https://travis-ci.org/tantivy-search/tantivy)
+[![Docs](https://docs.rs/tantivy/badge.svg)](https://docs.rs/crate/tantivy/)
+[![Build Status](https://github.com/tantivy-search/tantivy/actions/workflows/test.yml/badge.svg)](https://github.com/tantivy-search/tantivy/actions/workflows/test.yml)
 [![codecov](https://codecov.io/gh/tantivy-search/tantivy/branch/main/graph/badge.svg)](https://codecov.io/gh/tantivy-search/tantivy)
 [![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/main?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/main)
 [![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)

 ![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
--- a/bitpacker/Cargo.toml
+++ b/bitpacker/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tantivy-bitpacker"
-version = "0.1.0"
+version = "0.1.1"
 edition = "2018"
 authors = ["Paul Masurel <paul.masurel@gmail.com>"]
 license = "MIT"
--- a/bitpacker/src/lib.rs
+++ b/bitpacker/src/lib.rs
@@ -50,3 +50,32 @@ where
    }
    None
 }
+
+#[test]
+fn test_compute_num_bits() {
+    assert_eq!(compute_num_bits(1), 1u8);
+    assert_eq!(compute_num_bits(0), 0u8);
+    assert_eq!(compute_num_bits(2), 2u8);
+    assert_eq!(compute_num_bits(3), 2u8);
+    assert_eq!(compute_num_bits(4), 3u8);
+    assert_eq!(compute_num_bits(255), 8u8);
+    assert_eq!(compute_num_bits(256), 9u8);
+    assert_eq!(compute_num_bits(5_000_000_000), 33u8);
+}
+
+#[test]
+fn test_minmax_empty() {
+    let vals: Vec<u32> = vec![];
+    assert_eq!(minmax(vals.into_iter()), None);
+}
+
+#[test]
+fn test_minmax_one() {
+    assert_eq!(minmax(vec![1].into_iter()), Some((1, 1)));
+}
+
+#[test]
+fn test_minmax_two() {
+    assert_eq!(minmax(vec![1, 2].into_iter()), Some((1, 2)));
+    assert_eq!(minmax(vec![2, 1].into_iter()), Some((1, 2)));
+}
--- a/common/Cargo.toml
+++ b/common/Cargo.toml
@@ -1,5 +1,5 @@
 [package]
-name = "common"
+name = "tantivy-common"
 version = "0.1.0"
 authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
 license = "MIT"
@@ -10,3 +10,7 @@ description = "common traits and utility functions used by multiple tantivy subc

 [dependencies]
 byteorder = "1.4.3"
+
+[dev-dependencies]
+proptest = "1.0.0"
+rand = "0.8.4"
--- a/common/src/bitset.rs
+++ b/common/src/bitset.rs
@@ -2,7 +2,7 @@ use std::fmt;
 use std::u64;

 #[derive(Clone, Copy, Eq, PartialEq)]
-pub(crate) struct TinySet(u64);
+pub struct TinySet(u64);

 impl fmt::Debug for TinySet {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
@@ -178,7 +178,7 @@ impl BitSet {
    ///
    /// Reminder: the tiny set with the bucket `bucket`, represents the
    /// elements from `bucket * 64` to `(bucket+1) * 64`.
-    pub(crate) fn first_non_empty_bucket(&self, bucket: u32) -> Option<u32> {
+    pub fn first_non_empty_bucket(&self, bucket: u32) -> Option<u32> {
        self.tinysets[bucket as usize..]
            .iter()
            .cloned()
@@ -193,7 +193,7 @@ impl BitSet {
    /// Returns the tiny bitset representing the
    /// the set restricted to the number range from
    /// `bucket * 64` to `(bucket + 1) * 64`.
-    pub(crate) fn tinyset(&self, bucket: u32) -> TinySet {
+    pub fn tinyset(&self, bucket: u32) -> TinySet {
        self.tinysets[bucket as usize]
    }
 }
@@ -203,11 +203,9 @@ mod tests {

    use super::BitSet;
    use super::TinySet;
-    use crate::docset::{DocSet, TERMINATED};
-    use crate::query::BitSetDocSet;
-    use crate::tests;
-    use crate::tests::generate_nonunique_unsorted;
-    use std::collections::BTreeSet;
+    use rand::distributions::Bernoulli;
+    use rand::rngs::StdRng;
+    use rand::{Rng, SeedableRng};
    use std::collections::HashSet;

    #[test]
@@ -263,29 +261,6 @@ mod tests {
        test_against_hashset(&[62u32, 63u32], 64);
    }

-    #[test]
-    fn test_bitset_large() {
-        let arr = generate_nonunique_unsorted(100_000, 5_000);
-        let mut btreeset: BTreeSet<u32> = BTreeSet::new();
-        let mut bitset = BitSet::with_max_value(100_000);
-        for el in arr {
-            btreeset.insert(el);
-            bitset.insert(el);
-        }
-        for i in 0..100_000 {
-            assert_eq!(btreeset.contains(&i), bitset.contains(i));
-        }
-        assert_eq!(btreeset.len(), bitset.len());
-        let mut bitset_docset = BitSetDocSet::from(bitset);
-        let mut remaining = true;
-        for el in btreeset.into_iter() {
-            assert!(remaining);
-            assert_eq!(bitset_docset.doc(), el);
-            remaining = bitset_docset.advance() != TERMINATED;
-        }
-        assert!(!remaining);
-    }
-
    #[test]
    fn test_bitset_num_buckets() {
        use super::num_buckets;
@@ -340,10 +315,23 @@ mod tests {
        assert_eq!(bitset.len(), 3);
    }

+    pub fn sample_with_seed(n: u32, ratio: f64, seed_val: u8) -> Vec<u32> {
+        StdRng::from_seed([seed_val; 32])
+            .sample_iter(&Bernoulli::new(ratio).unwrap())
+            .take(n as usize)
+            .enumerate()
+            .filter_map(|(val, keep)| if keep { Some(val as u32) } else { None })
+            .collect()
+    }
+
+    pub fn sample(n: u32, ratio: f64) -> Vec<u32> {
+        sample_with_seed(n, ratio, 4)
+    }
+
    #[test]
    fn test_bitset_clear() {
        let mut bitset = BitSet::with_max_value(1_000);
-        let els = tests::sample(1_000, 0.01f64);
+        let els = sample(1_000, 0.01f64);
        for &el in &els {
            bitset.insert(el);
        }
--- a/common/src/lib.rs
+++ b/common/src/lib.rs
@@ -1,9 +1,167 @@
+use std::ops::Deref;
+
 pub use byteorder::LittleEndian as Endianness;

+mod bitset;
 mod serialize;
 mod vint;
 mod writer;

+pub use bitset::*;
 pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize};
 pub use vint::{read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint, VInt};
 pub use writer::{AntiCallToken, CountingWriter, TerminatingWrite};
+
+/// Has length trait
+pub trait HasLen {
+    /// Return length
+    fn len(&self) -> usize;
+
+    /// Returns true iff empty.
+    fn is_empty(&self) -> bool {
+        self.len() == 0
+    }
+}
+
+impl<T: Deref<Target = [u8]>> HasLen for T {
+    fn len(&self) -> usize {
+        self.deref().len()
+    }
+}
+
+const HIGHEST_BIT: u64 = 1 << 63;
+
+/// Maps a `i64` to `u64`
+///
+/// For simplicity, tantivy internally handles `i64` as `u64`.
+/// The mapping is defined by this function.
+///
+/// Maps `i64` to `u64` so that
+/// `-2^63 .. 2^63-1` is mapped
+///     to
+/// `0 .. 2^64-1`
+/// in that order.
+///
+/// This is more suited than simply casting (`val as u64`)
+/// because of bitpacking.
+///
+/// Imagine a list of `i64` ranging from -10 to 10.
+/// When casting negative values, the negative values are projected
+/// to values over 2^63, and all values end up requiring 64 bits.
+///
+/// # See also
+/// The [reverse mapping is `u64_to_i64`](./fn.u64_to_i64.html).
+#[inline]
+pub fn i64_to_u64(val: i64) -> u64 {
+    (val as u64) ^ HIGHEST_BIT
+}
+
+/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
+#[inline]
+pub fn u64_to_i64(val: u64) -> i64 {
+    (val ^ HIGHEST_BIT) as i64
+}
+
+/// Maps a `f64` to `u64`
+///
+/// For simplicity, tantivy internally handles `f64` as `u64`.
+/// The mapping is defined by this function.
+///
+/// Maps `f64` to `u64` in a monotonic manner, so that bytes lexical order is preserved.
+///
+/// This is more suited than simply casting (`val as u64`)
+/// which would truncate the result
+///
+/// # Reference
+///
+/// Daniel Lemire's [blog post](https://lemire.me/blog/2020/12/14/converting-floating-point-numbers-to-integers-while-preserving-order/)
+/// explains the mapping in a clear manner.
+///
+/// # See also
+/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
+#[inline]
+pub fn f64_to_u64(val: f64) -> u64 {
+    let bits = val.to_bits();
+    if val.is_sign_positive() {
+        bits ^ HIGHEST_BIT
+    } else {
+        !bits
+    }
+}
+
+/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
+#[inline]
+pub fn u64_to_f64(val: u64) -> f64 {
+    f64::from_bits(if val & HIGHEST_BIT != 0 {
+        val ^ HIGHEST_BIT
+    } else {
+        !val
+    })
+}
+
+#[cfg(test)]
+pub mod test {
+
+    use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
+    use super::{BinarySerializable, FixedSize};
+    use proptest::prelude::*;
+    use std::f64;
+
+    fn test_i64_converter_helper(val: i64) {
+        assert_eq!(u64_to_i64(i64_to_u64(val)), val);
+    }
+
+    fn test_f64_converter_helper(val: f64) {
+        assert_eq!(u64_to_f64(f64_to_u64(val)), val);
+    }
+
+    pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
+        let mut buffer = Vec::new();
+        O::default().serialize(&mut buffer).unwrap();
+        assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
+    }
+
+    proptest! {
+        #[test]
+        fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
+            let left_u64 = f64_to_u64(left);
+            let right_u64 = f64_to_u64(right);
+            assert_eq!(left_u64 < right_u64,  left < right);
+        }
+    }
+
+    #[test]
+    fn test_i64_converter() {
+        assert_eq!(i64_to_u64(i64::min_value()), u64::min_value());
+        assert_eq!(i64_to_u64(i64::max_value()), u64::max_value());
+        test_i64_converter_helper(0i64);
+        test_i64_converter_helper(i64::min_value());
+        test_i64_converter_helper(i64::max_value());
+        for i in -1000i64..1000i64 {
+            test_i64_converter_helper(i);
+        }
+    }
+
+    #[test]
+    fn test_f64_converter() {
+        test_f64_converter_helper(f64::INFINITY);
+        test_f64_converter_helper(f64::NEG_INFINITY);
+        test_f64_converter_helper(0.0);
+        test_f64_converter_helper(-0.0);
+        test_f64_converter_helper(1.0);
+        test_f64_converter_helper(-1.0);
+    }
+
+    #[test]
+    fn test_f64_order() {
+        assert!(!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY))
+            .contains(&f64_to_u64(f64::NAN))); //nan is not a number
+        assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); //same exponent, different mantissa
+        assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); //same mantissa, different exponent
+        assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); //different exponent and mantissa
+        assert!(f64_to_u64(1.0) > f64_to_u64(-1.0)); // pos > neg
+        assert!(f64_to_u64(-1.5) < f64_to_u64(-1.0));
+        assert!(f64_to_u64(-2.0) < f64_to_u64(1.0));
+        assert!(f64_to_u64(-2.0) < f64_to_u64(-1.5));
+    }
+}
--- a/doc/src/SUMMARY.md
+++ b/doc/src/SUMMARY.md
@@ -7,6 +7,7 @@
 - [Segments](./basis.md)
 - [Defining your schema](./schema.md)
 - [Facetting](./facetting.md)
+- [Index Sorting](./index_sorting.md)
 - [Innerworkings](./innerworkings.md)
  - [Inverted index](./inverted_index.md)
 - [Best practise](./inverted_index.md)
--- a/doc/src/index_sorting.md
+++ b/doc/src/index_sorting.md
@@ -0,0 +1,61 @@
+
+- [Index Sorting](#index-sorting)
+    + [Why Sorting](#why-sorting)
+        * [Compression](#compression)
+        * [Top-N Optimization](#top-n-optimization)
+        * [Pruning](#pruning)
+        * [Other](#other)
+    + [Usage](#usage)
+
+# Index Sorting
+
+Tantivy allows you to sort the index according to a property.
+
+## Why Sorting
+
+Presorting an index has several advantages:
+
+###### Compression
+
+When data is sorted it is easier to compress the data. E.g. the numbers sequence [5, 2, 3, 1, 4] would be sorted to [1, 2, 3, 4, 5]. 
+If we apply delta encoding this list would be unsorted [5, -3, 1, -2, 3] vs. [1, 1, 1, 1, 1].
+Compression ratio is mainly affected on the fast field of the sorted property, every thing else is likely unaffected. 
+###### Top-N Optimization
+
+When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents. 
+E.g. if the data is sorted by timestamp and want the top n newest docs containing a term, we can simply leveraging the order of the docids.
+
+Note: Tantivy 0.16 does not do this optimization yet.
+
+###### Pruning
+
+Let's say we want all documents and want to apply the filter `>= 2010-08-11`. When the data is sorted, we could make a lookup in the fast field to find the docid range and use this as the filter.
+
+Note: Tantivy 0.16 does not do this optimization yet.
+
+###### Other?
+
+In principle there are many algorithms possible that exploit the monotonically increasing nature. (aggregations maybe?)
+
+## Usage
+The index sorting can be configured setting [`sort_by_field`](https://github.com/tantivy-search/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of tantvy 0.16 only fast fields are allowed to be used.
+
+```
+let settings = IndexSettings {
+    sort_by_field: Some(IndexSortByField {
+        field: "intval".to_string(),
+        order: Order::Desc,
+    }),
+    ..Default::default()
+};
+let mut index_builder = Index::builder().schema(schema);
+index_builder = index_builder.settings(settings);
+let index = index_builder.create_in_ram().unwrap();
+```
+
+## Implementation details
+
+Sorting an index is applied in the serialization step. In general there are two serialization steps: [Finishing a single segment](https://github.com/tantivy-search/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/segment_writer.rs#L338) and [merging multiple segments](https://github.com/tantivy-search/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/merger.rs#L1073).
+
+In both cases we generate a docid mapping reflecting the sort. This mapping is used when serializing the different components (doc store, fastfields, posting list, normfield, facets).
+
--- a/examples/custom_collector.rs
+++ b/examples/custom_collector.rs
@@ -86,12 +86,10 @@ impl Collector for StatsCollector {

    fn merge_fruits(&self, segment_stats: Vec<Option<Stats>>) -> tantivy::Result<Option<Stats>> {
        let mut stats = Stats::default();
-        for segment_stats_opt in segment_stats {
-            if let Some(segment_stats) = segment_stats_opt {
-                stats.count += segment_stats.count;
-                stats.sum += segment_stats.sum;
-                stats.squared_sum += segment_stats.squared_sum;
-            }
+        for segment_stats in segment_stats.into_iter().flatten() {
+            stats.count += segment_stats.count;
+            stats.sum += segment_stats.sum;
+            stats.squared_sum += segment_stats.squared_sum;
        }
        Ok(stats.non_zero_count())
    }
--- a/fastfield_codecs/Cargo.toml
+++ b/fastfield_codecs/Cargo.toml
@@ -9,8 +9,8 @@ description = "Fast field codecs used by tantivy"
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

 [dependencies]
-common = { path = "../common/" }
-tantivy-bitpacker = { path = "../bitpacker/" }
+common = { version = "0.1", path = "../common/", package = "tantivy-common" }
+tantivy-bitpacker = { version="0.1.1", path = "../bitpacker/" }
 prettytable-rs = {version="0.8.0", optional= true}
 rand = {version="0.8.3", optional= true}

--- a/fastfield_codecs/src/multilinearinterpol.rs
+++ b/fastfield_codecs/src/multilinearinterpol.rs
@@ -1,3 +1,17 @@
+/*!
+
+MultiLinearInterpol compressor uses linear interpolation to guess a values and stores the offset, but in blocks of 512.
+
+With a CHUNK_SIZE of 512 and 29 byte metadata per block, we get a overhead for metadata of 232 / 512 = 0,45 bits per element.
+The additional space required per element in a block is the the maximum deviation of the linear interpolation estimation function.
+
+E.g. if the maximum deviation of an element is 12, all elements cost 4bits.
+
+Size per block:
+Num Elements * Maximum Deviation from Interpolation + 29 Byte Metadata
+
+*/
+
 use crate::FastFieldCodecReader;
 use crate::FastFieldCodecSerializer;
 use crate::FastFieldDataAccess;
--- a/ownedbytes/Cargo.toml
+++ b/ownedbytes/Cargo.toml
@@ -4,6 +4,7 @@ name = "ownedbytes"
 version = "0.1.0"
 edition = "2018"
 description = "Expose data as static slice"
+license = "MIT"
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

 [dependencies]
--- a/src/common/mod.rs
+++ b/src/common/mod.rs
@@ -1,203 +0,0 @@
-mod bitset;
-mod composite_file;
-
-pub use self::bitset::BitSet;
-pub(crate) use self::bitset::TinySet;
-pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
-pub use byteorder::LittleEndian as Endianness;
-pub use common::CountingWriter;
-pub use common::{
-    read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint, VInt,
-};
-pub use common::{BinarySerializable, DeserializeFrom, FixedSize};
-
-/// Segment's max doc must be `< MAX_DOC_LIMIT`.
-///
-/// We do not allow segments with more than
-pub const MAX_DOC_LIMIT: u32 = 1 << 31;
-
-/// Has length trait
-pub trait HasLen {
-    /// Return length
-    fn len(&self) -> usize;
-
-    /// Returns true iff empty.
-    fn is_empty(&self) -> bool {
-        self.len() == 0
-    }
-}
-
-const HIGHEST_BIT: u64 = 1 << 63;
-
-/// Maps a `i64` to `u64`
-///
-/// For simplicity, tantivy internally handles `i64` as `u64`.
-/// The mapping is defined by this function.
-///
-/// Maps `i64` to `u64` so that
-/// `-2^63 .. 2^63-1` is mapped
-///     to
-/// `0 .. 2^64-1`
-/// in that order.
-///
-/// This is more suited than simply casting (`val as u64`)
-/// because of bitpacking.
-///
-/// Imagine a list of `i64` ranging from -10 to 10.
-/// When casting negative values, the negative values are projected
-/// to values over 2^63, and all values end up requiring 64 bits.
-///
-/// # See also
-/// The [reverse mapping is `u64_to_i64`](./fn.u64_to_i64.html).
-#[inline]
-pub fn i64_to_u64(val: i64) -> u64 {
-    (val as u64) ^ HIGHEST_BIT
-}
-
-/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
-#[inline]
-pub fn u64_to_i64(val: u64) -> i64 {
-    (val ^ HIGHEST_BIT) as i64
-}
-
-/// Maps a `f64` to `u64`
-///
-/// For simplicity, tantivy internally handles `f64` as `u64`.
-/// The mapping is defined by this function.
-///
-/// Maps `f64` to `u64` in a monotonic manner, so that bytes lexical order is preserved.
-///
-/// This is more suited than simply casting (`val as u64`)
-/// which would truncate the result
-///
-/// # Reference
-///
-/// Daniel Lemire's [blog post](https://lemire.me/blog/2020/12/14/converting-floating-point-numbers-to-integers-while-preserving-order/)
-/// explains the mapping in a clear manner.
-///
-/// # See also
-/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
-#[inline]
-pub fn f64_to_u64(val: f64) -> u64 {
-    let bits = val.to_bits();
-    if val.is_sign_positive() {
-        bits ^ HIGHEST_BIT
-    } else {
-        !bits
-    }
-}
-
-/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
-#[inline]
-pub fn u64_to_f64(val: u64) -> f64 {
-    f64::from_bits(if val & HIGHEST_BIT != 0 {
-        val ^ HIGHEST_BIT
-    } else {
-        !val
-    })
-}
-
-#[cfg(test)]
-pub(crate) mod test {
-
-    use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
-    use common::{BinarySerializable, FixedSize};
-    use proptest::prelude::*;
-    use std::f64;
-    use tantivy_bitpacker::compute_num_bits;
-    pub use tantivy_bitpacker::minmax;
-
-    fn test_i64_converter_helper(val: i64) {
-        assert_eq!(u64_to_i64(i64_to_u64(val)), val);
-    }
-
-    fn test_f64_converter_helper(val: f64) {
-        assert_eq!(u64_to_f64(f64_to_u64(val)), val);
-    }
-
-    pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
-        let mut buffer = Vec::new();
-        O::default().serialize(&mut buffer).unwrap();
-        assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
-    }
-
-    proptest! {
-        #[test]
-        fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
-            let left_u64 = f64_to_u64(left);
-            let right_u64 = f64_to_u64(right);
-            assert_eq!(left_u64 < right_u64,  left < right);
-        }
-    }
-
-    #[test]
-    fn test_i64_converter() {
-        assert_eq!(i64_to_u64(i64::min_value()), u64::min_value());
-        assert_eq!(i64_to_u64(i64::max_value()), u64::max_value());
-        test_i64_converter_helper(0i64);
-        test_i64_converter_helper(i64::min_value());
-        test_i64_converter_helper(i64::max_value());
-        for i in -1000i64..1000i64 {
-            test_i64_converter_helper(i);
-        }
-    }
-
-    #[test]
-    fn test_f64_converter() {
-        test_f64_converter_helper(f64::INFINITY);
-        test_f64_converter_helper(f64::NEG_INFINITY);
-        test_f64_converter_helper(0.0);
-        test_f64_converter_helper(-0.0);
-        test_f64_converter_helper(1.0);
-        test_f64_converter_helper(-1.0);
-    }
-
-    #[test]
-    fn test_f64_order() {
-        assert!(!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY))
-            .contains(&f64_to_u64(f64::NAN))); //nan is not a number
-        assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); //same exponent, different mantissa
-        assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); //same mantissa, different exponent
-        assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); //different exponent and mantissa
-        assert!(f64_to_u64(1.0) > f64_to_u64(-1.0)); // pos > neg
-        assert!(f64_to_u64(-1.5) < f64_to_u64(-1.0));
-        assert!(f64_to_u64(-2.0) < f64_to_u64(1.0));
-        assert!(f64_to_u64(-2.0) < f64_to_u64(-1.5));
-    }
-
-    #[test]
-    fn test_compute_num_bits() {
-        assert_eq!(compute_num_bits(1), 1u8);
-        assert_eq!(compute_num_bits(0), 0u8);
-        assert_eq!(compute_num_bits(2), 2u8);
-        assert_eq!(compute_num_bits(3), 2u8);
-        assert_eq!(compute_num_bits(4), 3u8);
-        assert_eq!(compute_num_bits(255), 8u8);
-        assert_eq!(compute_num_bits(256), 9u8);
-        assert_eq!(compute_num_bits(5_000_000_000), 33u8);
-    }
-
-    #[test]
-    fn test_max_doc() {
-        // this is the first time I write a unit test for a constant.
-        assert!(((super::MAX_DOC_LIMIT - 1) as i32) >= 0);
-        assert!((super::MAX_DOC_LIMIT as i32) < 0);
-    }
-
-    #[test]
-    fn test_minmax_empty() {
-        let vals: Vec<u32> = vec![];
-        assert_eq!(minmax(vals.into_iter()), None);
-    }
-
-    #[test]
-    fn test_minmax_one() {
-        assert_eq!(minmax(vec![1].into_iter()), Some((1, 1)));
-    }
-
-    #[test]
-    fn test_minmax_two() {
-        assert_eq!(minmax(vec![1, 2].into_iter()), Some((1, 2)));
-        assert_eq!(minmax(vec![2, 1].into_iter()), Some((1, 2)));
-    }
-}
--- a/src/core/index.rs
+++ b/src/core/index.rs
@@ -120,7 +120,7 @@ impl IndexBuilder {
    /// Creates a new index in a given filepath.
    /// The index will use the `MMapDirectory`.
    ///
-    /// If a previous index was in this directory, then its meta file will be destroyed.
+    /// If a previous index was in this directory, it returns an `IndexAlreadyExists` error.
    #[cfg(feature = "mmap")]
    pub fn create_in_dir<P: AsRef<Path>>(self, directory_path: P) -> crate::Result<Index> {
        let mmap_directory = MmapDirectory::open(directory_path)?;
@@ -229,7 +229,8 @@ impl Index {
    /// Creates a new index using the `RamDirectory`.
    ///
    /// The index will be allocated in anonymous memory.
-    /// This should only be used for unit tests.
+    /// This is useful for indexing small set of documents
+    /// for instances like unit test or temporary in memory index.
    pub fn create_in_ram(schema: Schema) -> Index {
        IndexBuilder::new().schema(schema).create_in_ram().unwrap()
    }
@@ -237,7 +238,7 @@ impl Index {
    /// Creates a new index in a given filepath.
    /// The index will use the `MMapDirectory`.
    ///
-    /// If a previous index was in this directory, then its meta file will be destroyed.
+    /// If a previous index was in this directory, then it returns  an `IndexAlreadyExists` error.
    #[cfg(feature = "mmap")]
    pub fn create_in_dir<P: AsRef<Path>>(
        directory_path: P,
@@ -523,7 +524,22 @@ impl Index {

    /// Returns the set of corrupted files
    pub fn validate_checksum(&self) -> crate::Result<HashSet<PathBuf>> {
-        self.directory.list_damaged().map_err(Into::into)
+        let managed_files = self.directory.list_managed_files();
+        let active_segments_files: HashSet<PathBuf> = self
+            .searchable_segment_metas()?
+            .iter()
+            .flat_map(|segment_meta| segment_meta.list_files())
+            .collect();
+        let active_existing_files: HashSet<&PathBuf> =
+            active_segments_files.intersection(&managed_files).collect();
+
+        let mut damaged_files = HashSet::new();
+        for path in active_existing_files {
+            if !self.directory.validate_checksum(path)? {
+                damaged_files.insert((*path).clone());
+            }
+        }
+        Ok(damaged_files)
    }
 }

--- a/src/core/index_meta.rs
+++ b/src/core/index_meta.rs
@@ -101,6 +101,7 @@ impl SegmentMeta {

    /// Returns the list of files that
    /// are required for the segment meta.
+    /// Note: Some of the returned files may not exist depending on the state of the segment.
    ///
    /// This is useful as the way tantivy removes files
    /// is by removing all files that have been created by tantivy
--- a/src/core/inverted_index_reader.rs
+++ b/src/core/inverted_index_reader.rs
@@ -1,6 +1,5 @@
 use std::io;

-use crate::common::BinarySerializable;
 use crate::directory::FileSlice;
 use crate::positions::PositionReader;
 use crate::postings::TermInfo;
@@ -8,6 +7,7 @@ use crate::postings::{BlockSegmentPostings, SegmentPostings};
 use crate::schema::IndexRecordOption;
 use crate::schema::Term;
 use crate::termdict::TermDictionary;
+use common::BinarySerializable;

 /// The inverted index reader is in charge of accessing
 /// the inverted index associated to a specific field.
--- a/src/core/segment_reader.rs
+++ b/src/core/segment_reader.rs
@@ -2,7 +2,9 @@ use crate::core::InvertedIndexReader;
 use crate::core::Segment;
 use crate::core::SegmentComponent;
 use crate::core::SegmentId;
+use crate::directory::CompositeFile;
 use crate::directory::FileSlice;
+use crate::error::DataCorruption;
 use crate::fastfield::DeleteBitSet;
 use crate::fastfield::FacetReader;
 use crate::fastfield::FastFieldReaders;
@@ -14,7 +16,6 @@ use crate::space_usage::SegmentSpaceUsage;
 use crate::store::StoreReader;
 use crate::termdict::TermDictionary;
 use crate::DocId;
-use crate::{common::CompositeFile, error::DataCorruption};
 use fail::fail_point;
 use std::fmt;
 use std::sync::Arc;
--- a/src/directory/composite_file.rs
+++ b/src/directory/composite_file.rs
@@ -1,18 +1,17 @@
-use crate::common::BinarySerializable;
-use crate::common::CountingWriter;
-use crate::common::VInt;
 use crate::directory::FileSlice;
 use crate::directory::{TerminatingWrite, WritePtr};
 use crate::schema::Field;
 use crate::space_usage::FieldUsage;
 use crate::space_usage::PerFieldSpaceUsage;
+use common::BinarySerializable;
+use common::CountingWriter;
+use common::HasLen;
+use common::VInt;
 use std::collections::HashMap;
 use std::io::{self, Read, Write};
 use std::iter::ExactSizeIterator;
 use std::ops::Range;

-use super::HasLen;
-
 #[derive(Eq, PartialEq, Hash, Copy, Ord, PartialOrd, Clone, Debug)]
 pub struct FileAddr {
    field: Field,
@@ -188,10 +187,10 @@ impl CompositeFile {
 mod test {

    use super::{CompositeFile, CompositeWrite};
-    use crate::common::BinarySerializable;
-    use crate::common::VInt;
    use crate::directory::{Directory, RamDirectory};
    use crate::schema::Field;
+    use common::BinarySerializable;
+    use common::VInt;
    use std::io::Write;
    use std::path::Path;

--- a/src/directory/file_slice.rs
+++ b/src/directory/file_slice.rs
@@ -1,7 +1,7 @@
 use stable_deref_trait::StableDeref;

-use crate::common::HasLen;
 use crate::directory::OwnedBytes;
+use common::HasLen;
 use std::fmt;
 use std::ops::Range;
 use std::sync::{Arc, Weak};
@@ -32,12 +32,6 @@ impl FileHandle for &'static [u8] {
    }
 }

-impl<T: Deref<Target = [u8]>> HasLen for T {
-    fn len(&self) -> usize {
-        self.deref().len()
-    }
-}
-
 impl<B> From<B> for FileSlice
 where
    B: StableDeref + Deref<Target = [u8]> + 'static + Send + Sync,
@@ -178,7 +172,7 @@ impl HasLen for FileSlice {
 #[cfg(test)]
 mod tests {
    use super::{FileHandle, FileSlice};
-    use crate::common::HasLen;
+    use common::HasLen;
    use std::io;

    #[test]
--- a/src/directory/footer.rs
+++ b/src/directory/footer.rs
@@ -1,10 +1,10 @@
 use crate::directory::error::Incompatibility;
 use crate::directory::FileSlice;
 use crate::{
-    common::{BinarySerializable, CountingWriter, DeserializeFrom, FixedSize, HasLen},
    directory::{AntiCallToken, TerminatingWrite},
    Version, INDEX_FORMAT_VERSION,
 };
+use common::{BinarySerializable, CountingWriter, DeserializeFrom, FixedSize, HasLen};
 use crc32fast::Hasher;
 use serde::{Deserialize, Serialize};
 use std::io;
@@ -156,10 +156,8 @@ mod tests {

    use crate::directory::footer::Footer;
    use crate::directory::OwnedBytes;
-    use crate::{
-        common::BinarySerializable,
-        directory::{footer::FOOTER_MAGIC_NUMBER, FileSlice},
-    };
+    use crate::directory::{footer::FOOTER_MAGIC_NUMBER, FileSlice};
+    use common::BinarySerializable;
    use std::io;

    #[test]
--- a/src/directory/managed_directory.rs
+++ b/src/directory/managed_directory.rs
@@ -1,4 +1,4 @@
-use crate::core::{MANAGED_FILEPATH, META_FILEPATH};
+use crate::core::MANAGED_FILEPATH;
 use crate::directory::error::{DeleteError, LockError, OpenReadError, OpenWriteError};
 use crate::directory::footer::{Footer, FooterProxy};
 use crate::directory::GarbageCollectionResult;
@@ -248,24 +248,15 @@ impl ManagedDirectory {
        Ok(footer.crc() == crc)
    }

-    /// List files for which checksum does not match content
-    pub fn list_damaged(&self) -> result::Result<HashSet<PathBuf>, OpenReadError> {
-        let mut managed_paths = self
+    /// List all managed files
+    pub fn list_managed_files(&self) -> HashSet<PathBuf> {
+        let managed_paths = self
            .meta_informations
            .read()
            .expect("Managed directory rlock poisoned in list damaged.")
            .managed_paths
            .clone();
-
-        managed_paths.remove(*META_FILEPATH);
-
-        let mut damaged_files = HashSet::new();
-        for path in managed_paths {
-            if !self.validate_checksum(&path)? {
-                damaged_files.insert(path);
-            }
-        }
-        Ok(damaged_files)
+        managed_paths
    }
 }

@@ -336,7 +327,6 @@ mod tests_mmap_specific {

    use crate::directory::{Directory, ManagedDirectory, MmapDirectory, TerminatingWrite};
    use std::collections::HashSet;
-    use std::fs::OpenOptions;
    use std::io::Write;
    use std::path::{Path, PathBuf};
    use tempfile::TempDir;
@@ -405,39 +395,4 @@ mod tests_mmap_specific {
        }
        assert!(!managed_directory.exists(test_path1).unwrap());
    }
-
-    #[test]
-    fn test_checksum() -> crate::Result<()> {
-        let test_path1: &'static Path = Path::new("some_path_for_test");
-        let test_path2: &'static Path = Path::new("other_test_path");
-
-        let tempdir = TempDir::new().unwrap();
-        let tempdir_path = PathBuf::from(tempdir.path());
-
-        let mmap_directory = MmapDirectory::open(&tempdir_path)?;
-        let managed_directory = ManagedDirectory::wrap(mmap_directory)?;
-        let mut write = managed_directory.open_write(test_path1)?;
-        write.write_all(&[0u8, 1u8])?;
-        write.terminate()?;
-
-        let mut write = managed_directory.open_write(test_path2)?;
-        write.write_all(&[3u8, 4u8, 5u8])?;
-        write.terminate()?;
-
-        let read_file = managed_directory.open_read(test_path2)?.read_bytes()?;
-        assert_eq!(read_file.as_slice(), &[3u8, 4u8, 5u8]);
-        assert!(managed_directory.list_damaged().unwrap().is_empty());
-
-        let mut corrupted_path = tempdir_path;
-        corrupted_path.push(test_path2);
-        let mut file = OpenOptions::new().write(true).open(&corrupted_path)?;
-        file.write_all(&[255u8])?;
-        file.flush()?;
-        drop(file);
-
-        let damaged = managed_directory.list_damaged()?;
-        assert_eq!(damaged.len(), 1);
-        assert!(damaged.contains(test_path2));
-        Ok(())
-    }
 }
--- a/src/directory/mmap_directory.rs
+++ b/src/directory/mmap_directory.rs
@@ -11,7 +11,7 @@ use crate::directory::{AntiCallToken, FileHandle, OwnedBytes};
 use crate::directory::{ArcBytes, WeakArcBytes};
 use crate::directory::{TerminatingWrite, WritePtr};
 use fs2::FileExt;
-use memmap::Mmap;
+use memmap2::Mmap;
 use serde::{Deserialize, Serialize};
 use stable_deref_trait::StableDeref;
 use std::convert::From;
@@ -53,7 +53,7 @@ fn open_mmap(full_path: &Path) -> result::Result<Option<Mmap>, OpenReadError> {
        return Ok(None);
    }
    unsafe {
-        memmap::Mmap::map(&file)
+        memmap2::Mmap::map(&file)
            .map(Some)
            .map_err(|io_err| OpenReadError::wrap_io_error(io_err, full_path.to_path_buf()))
    }
@@ -485,13 +485,14 @@ mod tests {
    // The following tests are specific to the MmapDirectory

    use super::*;
+    use crate::indexer::LogMergePolicy;
    use crate::Index;
    use crate::ReloadPolicy;
-    use crate::{common::HasLen, indexer::LogMergePolicy};
    use crate::{
        schema::{Schema, SchemaBuilder, TEXT},
        IndexSettings,
    };
+    use common::HasLen;

    #[test]
    fn test_open_non_existent_path() {
--- a/src/directory/mod.rs
+++ b/src/directory/mod.rs
@@ -20,6 +20,9 @@ mod watch_event_router;
 /// Errors specific to the directory module.
 pub mod error;

+mod composite_file;
+
+pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
 pub use self::directory::DirectoryLock;
 pub use self::directory::{Directory, DirectoryClone};
 pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK};
--- a/src/directory/ram_directory.rs
+++ b/src/directory/ram_directory.rs
@@ -1,9 +1,10 @@
+use crate::core::META_FILEPATH;
 use crate::directory::error::{DeleteError, OpenReadError, OpenWriteError};
 use crate::directory::AntiCallToken;
 use crate::directory::WatchCallbackList;
 use crate::directory::{Directory, FileSlice, WatchCallback, WatchHandle};
 use crate::directory::{TerminatingWrite, WritePtr};
-use crate::{common::HasLen, core::META_FILEPATH};
+use common::HasLen;
 use fail::fail_point;
 use std::collections::HashMap;
 use std::fmt;
--- a/src/fastfield/delete.rs
+++ b/src/fastfield/delete.rs
@@ -1,9 +1,10 @@
-use crate::common::{BitSet, HasLen};
 use crate::directory::FileSlice;
 use crate::directory::OwnedBytes;
 use crate::directory::WritePtr;
 use crate::space_usage::ByteCount;
 use crate::DocId;
+use common::BitSet;
+use common::HasLen;
 use std::io;
 use std::io::Write;

@@ -110,7 +111,7 @@ impl HasLen for DeleteBitSet {
 #[cfg(test)]
 mod tests {
    use super::DeleteBitSet;
-    use crate::common::HasLen;
+    use common::HasLen;

    #[test]
    fn test_delete_bitset_empty() {
--- a/src/fastfield/mod.rs
+++ b/src/fastfield/mod.rs
@@ -40,11 +40,11 @@ pub use self::writer::{FastFieldsWriter, IntFastFieldWriter};
 use crate::schema::Cardinality;
 use crate::schema::FieldType;
 use crate::schema::Value;
+use crate::DocId;
 use crate::{
    chrono::{NaiveDateTime, Utc},
    schema::Type,
 };
-use crate::{common, DocId};

 mod bytes;
 mod delete;
@@ -213,8 +213,7 @@ fn value_to_u64(value: &Value) -> u64 {
 mod tests {

    use super::*;
-    use crate::common::CompositeFile;
-    use crate::common::HasLen;
+    use crate::directory::CompositeFile;
    use crate::directory::{Directory, RamDirectory, WritePtr};
    use crate::merge_policy::NoMergePolicy;
    use crate::schema::Field;
@@ -222,6 +221,7 @@ mod tests {
    use crate::schema::FAST;
    use crate::schema::{Document, IntOptions};
    use crate::{Index, SegmentId, SegmentReader};
+    use common::HasLen;
    use once_cell::sync::Lazy;
    use rand::prelude::SliceRandom;
    use rand::rngs::StdRng;
@@ -588,7 +588,7 @@ mod bench {
    use super::tests::FIELD;
    use super::tests::{generate_permutation, SCHEMA};
    use super::*;
-    use crate::common::CompositeFile;
+    use crate::directory::CompositeFile;
    use crate::directory::{Directory, RamDirectory, WritePtr};
    use crate::fastfield::FastFieldReader;
    use std::collections::HashMap;
--- a/src/fastfield/multivalued/mod.rs
+++ b/src/fastfield/multivalued/mod.rs
@@ -8,14 +8,22 @@ pub use self::writer::MultiValuedFastFieldWriter;
 mod tests {

    use crate::collector::TopDocs;
+    use crate::indexer::NoMergePolicy;
    use crate::query::QueryParser;
    use crate::schema::Cardinality;
    use crate::schema::Facet;
    use crate::schema::IntOptions;
    use crate::schema::Schema;
    use crate::schema::INDEXED;
+    use crate::Document;
    use crate::Index;
+    use crate::Term;
    use chrono::Duration;
+    use futures::executor::block_on;
+    use proptest::prop_oneof;
+    use proptest::proptest;
+    use proptest::strategy::Strategy;
+    use test_env_log::test;

    #[test]
    fn test_multivalued_u64() {
@@ -225,6 +233,111 @@ mod tests {
        multi_value_reader.get_vals(3, &mut vals);
        assert_eq!(&vals, &[-5i64, -20i64, 1i64]);
    }
+
+    fn test_multivalued_no_panic(ops: &[IndexingOp]) {
+        let mut schema_builder = Schema::builder();
+        let field = schema_builder.add_u64_field(
+            "multifield",
+            IntOptions::default()
+                .set_fast(Cardinality::MultiValues)
+                .set_indexed(),
+        );
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut index_writer = index.writer_for_tests().unwrap();
+        index_writer.set_merge_policy(Box::new(NoMergePolicy));
+
+        for &op in ops {
+            match op {
+                IndexingOp::AddDoc { id } => {
+                    match id % 3 {
+                        0 => {
+                            index_writer.add_document(doc!());
+                        }
+                        1 => {
+                            let mut doc = Document::new();
+                            for _ in 0..5001 {
+                                doc.add_u64(field, id as u64);
+                            }
+                            index_writer.add_document(doc);
+                        }
+                        _ => {
+                            let mut doc = Document::new();
+                            doc.add_u64(field, id as u64);
+                            index_writer.add_document(doc);
+                        }
+                    };
+                }
+                IndexingOp::DeleteDoc { id } => {
+                    index_writer.delete_term(Term::from_field_u64(field, id as u64));
+                }
+                IndexingOp::Commit => {
+                    index_writer.commit().unwrap();
+                }
+                IndexingOp::Merge => {
+                    let segment_ids = index
+                        .searchable_segment_ids()
+                        .expect("Searchable segments failed.");
+                    if segment_ids.len() >= 2 {
+                        block_on(index_writer.merge(&segment_ids)).unwrap();
+                        assert!(index_writer.segment_updater().wait_merging_thread().is_ok());
+                    }
+                }
+            }
+        }
+
+        assert!(index_writer.commit().is_ok());
+
+        // Merging the segments
+        {
+            let segment_ids = index
+                .searchable_segment_ids()
+                .expect("Searchable segments failed.");
+            if !segment_ids.is_empty() {
+                block_on(index_writer.merge(&segment_ids)).unwrap();
+                assert!(index_writer.wait_merging_threads().is_ok());
+            }
+        }
+    }
+
+    #[derive(Debug, Clone, Copy)]
+    enum IndexingOp {
+        AddDoc { id: u32 },
+        DeleteDoc { id: u32 },
+        Commit,
+        Merge,
+    }
+
+    fn operation_strategy() -> impl Strategy<Value = IndexingOp> {
+        prop_oneof![
+            (0u32..10u32).prop_map(|id| IndexingOp::DeleteDoc { id }),
+            (0u32..10u32).prop_map(|id| IndexingOp::AddDoc { id }),
+            (0u32..2u32).prop_map(|_| IndexingOp::Commit),
+            (0u32..1u32).prop_map(|_| IndexingOp::Merge),
+        ]
+    }
+
+    proptest! {
+        #[test]
+        fn test_multivalued_proptest(ops in proptest::collection::vec(operation_strategy(), 1..10)) {
+            test_multivalued_no_panic(&ops[..]);
+        }
+    }
+
+    #[test]
+    fn test_multivalued_proptest_off_by_one_bug_1151() {
+        use IndexingOp::*;
+        let ops = [
+            AddDoc { id: 3 },
+            AddDoc { id: 1 },
+            AddDoc { id: 3 },
+            Commit,
+            Merge,
+        ];
+
+        test_multivalued_no_panic(&ops[..]);
+    }
+
    #[test]
    #[ignore]
    fn test_many_facets() {
--- a/src/fastfield/reader.rs
+++ b/src/fastfield/reader.rs
@@ -1,6 +1,5 @@
 use super::FastValue;
-use crate::common::BinarySerializable;
-use crate::common::CompositeFile;
+use crate::directory::CompositeFile;
 use crate::directory::FileSlice;
 use crate::directory::OwnedBytes;
 use crate::directory::{Directory, RamDirectory, WritePtr};
@@ -8,6 +7,7 @@ use crate::fastfield::{CompositeFastFieldSerializer, FastFieldsWriter};
 use crate::schema::Schema;
 use crate::schema::FAST;
 use crate::DocId;
+use common::BinarySerializable;
 use fastfield_codecs::bitpacked::BitpackedFastFieldReader as BitpackedReader;
 use fastfield_codecs::bitpacked::BitpackedFastFieldSerializer;
 use fastfield_codecs::linearinterpol::LinearInterpolFastFieldReader;
--- a/src/fastfield/readers.rs
+++ b/src/fastfield/readers.rs
@@ -1,4 +1,4 @@
-use crate::common::CompositeFile;
+use crate::directory::CompositeFile;
 use crate::directory::FileSlice;
 use crate::fastfield::MultiValuedFastFieldReader;
 use crate::fastfield::{BitpackedFastFieldReader, FastFieldNotAvailableError};
--- a/src/fastfield/serializer/mod.rs
+++ b/src/fastfield/serializer/mod.rs
@@ -1,8 +1,8 @@
-use crate::common::BinarySerializable;
-use crate::common::CompositeWrite;
-use crate::common::CountingWriter;
+use crate::directory::CompositeWrite;
 use crate::directory::WritePtr;
 use crate::schema::Field;
+use common::BinarySerializable;
+use common::CountingWriter;
 pub use fastfield_codecs::bitpacked::BitpackedFastFieldSerializer;
 pub use fastfield_codecs::bitpacked::BitpackedFastFieldSerializerLegacy;
 use fastfield_codecs::linearinterpol::LinearInterpolFastFieldSerializer;
@@ -105,9 +105,7 @@ impl CompositeFastFieldSerializer {
            &fastfield_accessor,
            &mut estimations,
        );
-        if let Some(broken_estimation) = estimations
-            .iter()
-            .find(|estimation| estimation.0 == f32::NAN)
+        if let Some(broken_estimation) = estimations.iter().find(|estimation| estimation.0.is_nan())
        {
            warn!(
                "broken estimation for fast field codec {}",
--- a/src/fastfield/writer.rs
+++ b/src/fastfield/writer.rs
@@ -1,12 +1,12 @@
 use super::multivalued::MultiValuedFastFieldWriter;
 use super::serializer::FastFieldStats;
 use super::FastFieldDataAccess;
-use crate::common;
 use crate::fastfield::{BytesFastFieldWriter, CompositeFastFieldSerializer};
 use crate::indexer::doc_id_mapping::DocIdMapping;
 use crate::postings::UnorderedTermId;
 use crate::schema::{Cardinality, Document, Field, FieldEntry, FieldType, Schema};
 use crate::termdict::TermOrdinal;
+use common;
 use fnv::FnvHashMap;
 use std::collections::HashMap;
 use std::io;
--- a/src/fieldnorm/reader.rs
+++ b/src/fieldnorm/reader.rs
@@ -1,5 +1,5 @@
 use super::{fieldnorm_to_id, id_to_fieldnorm};
-use crate::common::CompositeFile;
+use crate::directory::CompositeFile;
 use crate::directory::FileSlice;
 use crate::directory::OwnedBytes;
 use crate::schema::Field;
--- a/src/fieldnorm/serializer.rs
+++ b/src/fieldnorm/serializer.rs
@@ -1,4 +1,4 @@
-use crate::common::CompositeWrite;
+use crate::directory::CompositeWrite;
 use crate::directory::WritePtr;
 use crate::schema::Field;
 use std::io;
--- a/src/functional_test.rs
+++ b/src/functional_test.rs
@@ -1,4 +1,8 @@
+use crate::schema;
 use crate::Index;
+use crate::IndexSettings;
+use crate::IndexSortByField;
+use crate::Order;
 use crate::Searcher;
 use crate::{doc, schema::*};
 use rand::thread_rng;
@@ -35,7 +39,7 @@ fn test_functional_store() -> crate::Result<()> {
    let mut doc_set: Vec<u64> = Vec::new();

    let mut doc_id = 0u64;
-    for iteration in 0..500 {
+    for iteration in 0..get_num_iterations() {
        dbg!(iteration);
        let num_docs: usize = rng.gen_range(0..4);
        if !doc_set.is_empty() {
@@ -56,16 +60,37 @@ fn test_functional_store() -> crate::Result<()> {
    Ok(())
 }

+fn get_num_iterations() -> usize {
+    std::env::var("NUM_FUNCTIONAL_TEST_ITERATIONS")
+        .map(|str| str.parse().unwrap())
+        .unwrap_or(2000)
+}
 #[test]
 #[ignore]
-fn test_functional_indexing() -> crate::Result<()> {
+fn test_functional_indexing_sorted() -> crate::Result<()> {
    let mut schema_builder = Schema::builder();

-    let id_field = schema_builder.add_u64_field("id", INDEXED);
+    let id_field = schema_builder.add_u64_field("id", INDEXED | FAST);
    let multiples_field = schema_builder.add_u64_field("multiples", INDEXED);
+    let text_field_options = TextOptions::default()
+        .set_indexing_options(
+            TextFieldIndexing::default()
+                .set_index_option(schema::IndexRecordOption::WithFreqsAndPositions),
+        )
+        .set_stored();
+    let text_field = schema_builder.add_text_field("text_field", text_field_options);
    let schema = schema_builder.build();

-    let index = Index::create_from_tempdir(schema)?;
+    let mut index_builder = Index::builder().schema(schema);
+    index_builder = index_builder.settings(IndexSettings {
+        sort_by_field: Some(IndexSortByField {
+            field: "id".to_string(),
+            order: Order::Desc,
+        }),
+        ..Default::default()
+    });
+    let index = index_builder.create_from_tempdir().unwrap();
+
    let reader = index.reader()?;

    let mut rng = thread_rng();
@@ -75,7 +100,7 @@ fn test_functional_indexing() -> crate::Result<()> {
    let mut committed_docs: HashSet<u64> = HashSet::new();
    let mut uncommitted_docs: HashSet<u64> = HashSet::new();

-    for _ in 0..200 {
+    for _ in 0..get_num_iterations() {
        let random_val = rng.gen_range(0..20);
        if random_val == 0 {
            index_writer.commit()?;
@@ -98,6 +123,84 @@ fn test_functional_indexing() -> crate::Result<()> {
            for i in 1u64..10u64 {
                doc.add_u64(multiples_field, random_val * i);
            }
+            doc.add_text(text_field, get_text());
+            index_writer.add_document(doc);
+        }
+    }
+    Ok(())
+}
+
+const LOREM: &str = "Doc Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed \
+             do eiusmod tempor incididunt ut labore et dolore magna aliqua. \
+             Ut enim ad minim veniam, quis nostrud exercitation ullamco \
+             laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure \
+             dolor in reprehenderit in voluptate velit esse cillum dolore eu \
+             fugiat nulla pariatur. Excepteur sint occaecat cupidatat non \
+             proident, sunt in culpa qui officia deserunt mollit anim id est \
+             laborum.";
+fn get_text() -> String {
+    use rand::seq::SliceRandom;
+    let mut rng = thread_rng();
+    let tokens: Vec<_> = LOREM.split(' ').collect();
+    let random_val = rng.gen_range(0..20);
+
+    (0..random_val)
+        .map(|_| tokens.choose(&mut rng).unwrap())
+        .cloned()
+        .collect::<Vec<_>>()
+        .join(" ")
+}
+
+#[test]
+#[ignore]
+fn test_functional_indexing_unsorted() -> crate::Result<()> {
+    let mut schema_builder = Schema::builder();
+
+    let id_field = schema_builder.add_u64_field("id", INDEXED);
+    let multiples_field = schema_builder.add_u64_field("multiples", INDEXED);
+    let text_field_options = TextOptions::default()
+        .set_indexing_options(
+            TextFieldIndexing::default()
+                .set_index_option(schema::IndexRecordOption::WithFreqsAndPositions),
+        )
+        .set_stored();
+    let text_field = schema_builder.add_text_field("text_field", text_field_options);
+    let schema = schema_builder.build();
+
+    let index = Index::create_from_tempdir(schema)?;
+    let reader = index.reader()?;
+
+    let mut rng = thread_rng();
+
+    let mut index_writer = index.writer_with_num_threads(3, 120_000_000)?;
+
+    let mut committed_docs: HashSet<u64> = HashSet::new();
+    let mut uncommitted_docs: HashSet<u64> = HashSet::new();
+
+    for _ in 0..get_num_iterations() {
+        let random_val = rng.gen_range(0..20);
+        if random_val == 0 {
+            index_writer.commit()?;
+            committed_docs.extend(&uncommitted_docs);
+            uncommitted_docs.clear();
+            reader.reload()?;
+            let searcher = reader.searcher();
+            // check that everything is correct.
+            check_index_content(
+                &searcher,
+                &committed_docs.iter().cloned().collect::<Vec<u64>>(),
+            )?;
+        } else if committed_docs.remove(&random_val) || uncommitted_docs.remove(&random_val) {
+            let doc_id_term = Term::from_field_u64(id_field, random_val);
+            index_writer.delete_term(doc_id_term);
+        } else {
+            uncommitted_docs.insert(random_val);
+            let mut doc = Document::new();
+            doc.add_u64(id_field, random_val);
+            for i in 1u64..10u64 {
+                doc.add_u64(multiples_field, random_val * i);
+            }
+            doc.add_text(text_field, get_text());
            index_writer.add_document(doc);
        }
    }
--- a/src/indexer/index_writer.rs
+++ b/src/indexer/index_writer.rs
@@ -1,7 +1,6 @@
 use super::operation::{AddOperation, UserOperation};
 use super::segment_updater::SegmentUpdater;
 use super::PreparedCommit;
-use crate::common::BitSet;
 use crate::core::Index;
 use crate::core::Segment;
 use crate::core::SegmentComponent;
@@ -24,6 +23,7 @@ use crate::schema::Document;
 use crate::schema::IndexRecordOption;
 use crate::schema::Term;
 use crate::Opstamp;
+use common::BitSet;
 use crossbeam::channel;
 use futures::executor::block_on;
 use futures::future::Future;
@@ -1361,6 +1361,7 @@ mod tests {
        AddDoc { id: u64 },
        DeleteDoc { id: u64 },
        Commit,
+        Merge,
    }

    fn operation_strategy() -> impl Strategy<Value = IndexingOp> {
@@ -1368,6 +1369,7 @@ mod tests {
            (0u64..10u64).prop_map(|id| IndexingOp::DeleteDoc { id }),
            (0u64..10u64).prop_map(|id| IndexingOp::AddDoc { id }),
            (0u64..2u64).prop_map(|_| IndexingOp::Commit),
+            (0u64..1u64).prop_map(|_| IndexingOp::Merge),
        ]
    }

@@ -1393,7 +1395,7 @@ mod tests {
    fn test_operation_strategy(
        ops: &[IndexingOp],
        sort_index: bool,
-        force_merge: bool,
+        force_end_merge: bool,
    ) -> crate::Result<()> {
        let mut schema_builder = schema::Schema::builder();
        let id_field = schema_builder.add_u64_field("id", FAST | INDEXED | STORED);
@@ -1435,6 +1437,8 @@ mod tests {
            .settings(settings)
            .create_in_ram()?;
        let mut index_writer = index.writer_for_tests()?;
+        index_writer.set_merge_policy(Box::new(NoMergePolicy));
+
        for &op in ops {
            match op {
                IndexingOp::AddDoc { id } => {
@@ -1448,12 +1452,21 @@ mod tests {
                IndexingOp::Commit => {
                    index_writer.commit()?;
                }
+                IndexingOp::Merge => {
+                    let segment_ids = index
+                        .searchable_segment_ids()
+                        .expect("Searchable segments failed.");
+                    if segment_ids.len() >= 2 {
+                        block_on(index_writer.merge(&segment_ids)).unwrap();
+                        assert!(index_writer.segment_updater().wait_merging_thread().is_ok());
+                    }
+                }
            }
        }
        index_writer.commit()?;

        let searcher = index.reader()?.searcher();
-        if force_merge {
+        if force_end_merge {
            index_writer.wait_merging_threads()?;
            let mut index_writer = index.writer_for_tests()?;
            let segment_ids = index
--- a/src/indexer/merger.rs
+++ b/src/indexer/merger.rs
@@ -5,6 +5,7 @@ use crate::fastfield::DynamicFastFieldReader;
 use crate::fastfield::FastFieldDataAccess;
 use crate::fastfield::FastFieldReader;
 use crate::fastfield::FastFieldStats;
+use crate::fastfield::MultiValueLength;
 use crate::fastfield::MultiValuedFastFieldReader;
 use crate::fieldnorm::FieldNormsSerializer;
 use crate::fieldnorm::FieldNormsWriter;
@@ -19,9 +20,8 @@ use crate::schema::{Field, Schema};
 use crate::store::StoreWriter;
 use crate::termdict::TermMerger;
 use crate::termdict::TermOrdinal;
+use crate::IndexSettings;
 use crate::IndexSortByField;
-use crate::{common::HasLen, fastfield::MultiValueLength};
-use crate::{common::MAX_DOC_LIMIT, IndexSettings};
 use crate::{core::Segment, indexer::doc_id_mapping::expect_field_id_for_sort_field};
 use crate::{core::SegmentReader, Order};
 use crate::{
@@ -29,6 +29,7 @@ use crate::{
    SegmentOrdinal,
 };
 use crate::{DocId, InvertedIndexReader, SegmentComponent};
+use common::HasLen;
 use itertools::Itertools;
 use measure_time::debug_time;
 use std::cmp;
@@ -36,6 +37,11 @@ use std::collections::HashMap;
 use std::sync::Arc;
 use tantivy_bitpacker::minmax;

+/// Segment's max doc must be `< MAX_DOC_LIMIT`.
+///
+/// We do not allow segments with more than
+pub const MAX_DOC_LIMIT: u32 = 1 << 31;
+
 fn compute_total_num_tokens(readers: &[SegmentReader], field: Field) -> crate::Result<u64> {
    let mut total_tokens = 0u64;
    let mut count: [usize; 256] = [0; 256];
@@ -144,7 +150,7 @@ impl TermOrdinalMapping {
            .iter()
            .flat_map(|term_ordinals| term_ordinals.iter().cloned().max())
            .max()
-            .unwrap_or_else(TermOrdinal::default)
+            .unwrap_or_default()
    }
 }

@@ -495,10 +501,10 @@ impl IndexMerger {
        //
        // This is required by the bitpacker, as it needs to know
        // what should be the bit length use for bitpacking.
-        let mut idx_num_vals = 0;
+        let mut num_docs = 0;
        for (reader, u64s_reader) in reader_and_field_accessors.iter() {
            if let Some(delete_bitset) = reader.delete_bitset() {
-                idx_num_vals += reader.max_doc() as u64 - delete_bitset.len() as u64;
+                num_docs += reader.max_doc() as u64 - delete_bitset.len() as u64;
                for doc in 0u32..reader.max_doc() {
                    if delete_bitset.is_alive(doc) {
                        let num_vals = u64s_reader.get_len(doc) as u64;
@@ -506,14 +512,15 @@ impl IndexMerger {
                    }
                }
            } else {
-                idx_num_vals += reader.max_doc() as u64;
+                num_docs += reader.max_doc() as u64;
                total_num_vals += u64s_reader.get_total_len();
            }
        }

        let stats = FastFieldStats {
            max_value: total_num_vals,
-            num_vals: idx_num_vals,
+            // The fastfield offset index contains (num_docs + 1) values.
+            num_vals: num_docs + 1,
            min_value: 0,
        };
        // We can now create our `idx` serializer, and in a second pass,
@@ -958,12 +965,13 @@ impl IndexMerger {
            }
            if !doc_id_mapping.is_trivial() {
                doc_id_and_positions.sort_unstable_by_key(|&(doc_id, _, _)| doc_id);
+
                for (doc_id, term_freq, positions) in &doc_id_and_positions {
-                    field_serializer.write_doc(*doc_id, *term_freq, positions);
+                    let delta_positions = delta_computer.compute_delta(positions);
+                    field_serializer.write_doc(*doc_id, *term_freq, delta_positions);
                }
                doc_id_and_positions.clear();
            }
-
            // closing the term.
            field_serializer.close_term()?;
        }
@@ -2074,4 +2082,11 @@ mod tests {

        Ok(())
    }
+
+    #[test]
+    fn test_max_doc() {
+        // this is the first time I write a unit test for a constant.
+        assert!(((super::MAX_DOC_LIMIT - 1) as i32) >= 0);
+        assert!((super::MAX_DOC_LIMIT as i32) < 0);
+    }
 }
--- a/src/indexer/merger_sorted_index_test.rs
+++ b/src/indexer/merger_sorted_index_test.rs
@@ -1,6 +1,7 @@
 #[cfg(test)]
 mod tests {
-    use crate::fastfield::FastFieldReader;
+    use crate::fastfield::{DeleteBitSet, FastFieldReader};
+    use crate::schema::IndexRecordOption;
    use crate::{
        collector::TopDocs,
        schema::{Cardinality, TextFieldIndexing},
@@ -16,7 +17,7 @@ mod tests {
        schema::{self, BytesOptions},
        DocAddress,
    };
-    use crate::{IndexSettings, Term};
+    use crate::{DocSet, IndexSettings, Postings, Term};
    use futures::executor::block_on;

    fn create_test_index_posting_list_issue(index_settings: Option<IndexSettings>) -> Index {
@@ -104,9 +105,11 @@ mod tests {
            index_writer.add_document(
                doc!(int_field=>3_u64, multi_numbers => 3_u64, multi_numbers => 4_u64, bytes_field => vec![1, 2, 3], text_field => "some text", facet_field=> Facet::from("/book/crime")),
            );
-            index_writer.add_document(doc!(int_field=>1_u64, text_field=> "deleteme"));
            index_writer.add_document(
-                doc!(int_field=>2_u64, multi_numbers => 2_u64, multi_numbers => 3_u64),
+                doc!(int_field=>1_u64, text_field=> "deleteme",  text_field => "ok text more text"),
+            );
+            index_writer.add_document(
+                doc!(int_field=>2_u64, multi_numbers => 2_u64, multi_numbers => 3_u64, text_field => "ok text more text"),
            );

            assert!(index_writer.commit().is_ok());
@@ -118,7 +121,7 @@ mod tests {
            } else {
                1
            };
-            index_writer.add_document(doc!(int_field=>in_val, text_field=> "deleteme", facet_field=> Facet::from("/book/crime")));
+            index_writer.add_document(doc!(int_field=>in_val, text_field=> "deleteme" , text_field => "ok text more text", facet_field=> Facet::from("/book/crime")));
            assert!(index_writer.commit().is_ok());
            // segment 3 - range 5-1000, with force_disjunct_segment_sort_values 50-1000
            let int_vals = if force_disjunct_segment_sort_values {
@@ -243,6 +246,36 @@ mod tests {
            assert_eq!(do_search("biggest"), vec![0]);
        }

+        // postings file
+        {
+            let my_text_field = index.schema().get_field("text_field").unwrap();
+            let term_a = Term::from_field_text(my_text_field, "text");
+            let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
+            let mut postings = inverted_index
+                .read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
+                .unwrap()
+                .unwrap();
+
+            assert_eq!(postings.doc_freq(), 2);
+            let fallback_bitset = DeleteBitSet::for_test(&[0], 100);
+            assert_eq!(
+                postings.doc_freq_given_deletes(
+                    segment_reader.delete_bitset().unwrap_or(&fallback_bitset)
+                ),
+                2
+            );
+
+            assert_eq!(postings.term_freq(), 1);
+            let mut output = vec![];
+            postings.positions(&mut output);
+            assert_eq!(output, vec![1]);
+            postings.advance();
+
+            assert_eq!(postings.term_freq(), 2);
+            postings.positions(&mut output);
+            assert_eq!(output, vec![1, 3]);
+        }
+
        // access doc store
        {
            let blubber_pos = if force_disjunct_segment_sort_values {
@@ -260,6 +293,69 @@ mod tests {
        }
    }

+    #[test]
+    fn test_merge_unsorted_index() {
+        let index = create_test_index(
+            Some(IndexSettings {
+                ..Default::default()
+            }),
+            false,
+        );
+
+        let reader = index.reader().unwrap();
+        let searcher = reader.searcher();
+        assert_eq!(searcher.segment_readers().len(), 1);
+        let segment_reader = searcher.segment_readers().last().unwrap();
+
+        let searcher = index.reader().unwrap().searcher();
+        {
+            let my_text_field = index.schema().get_field("text_field").unwrap();
+
+            let do_search = |term: &str| {
+                let query = QueryParser::for_index(&index, vec![my_text_field])
+                    .parse_query(term)
+                    .unwrap();
+                let top_docs: Vec<(f32, DocAddress)> =
+                    searcher.search(&query, &TopDocs::with_limit(3)).unwrap();
+
+                top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>()
+            };
+
+            assert_eq!(do_search("some"), vec![1]);
+            assert_eq!(do_search("blubber"), vec![3]);
+            assert_eq!(do_search("biggest"), vec![4]);
+        }
+
+        // postings file
+        {
+            let my_text_field = index.schema().get_field("text_field").unwrap();
+            let term_a = Term::from_field_text(my_text_field, "text");
+            let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
+            let mut postings = inverted_index
+                .read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
+                .unwrap()
+                .unwrap();
+            assert_eq!(postings.doc_freq(), 2);
+            let fallback_bitset = DeleteBitSet::for_test(&[0], 100);
+            assert_eq!(
+                postings.doc_freq_given_deletes(
+                    segment_reader.delete_bitset().unwrap_or(&fallback_bitset)
+                ),
+                2
+            );
+
+            assert_eq!(postings.term_freq(), 1);
+            let mut output = vec![];
+            postings.positions(&mut output);
+            assert_eq!(output, vec![1]);
+            postings.advance();
+
+            assert_eq!(postings.term_freq(), 2);
+            postings.positions(&mut output);
+            assert_eq!(output, vec![1, 3]);
+        }
+    }
+
    #[test]
    fn test_merge_sorted_index_asc() {
        let index = create_test_index(
@@ -314,7 +410,7 @@ mod tests {
            let my_text_field = index.schema().get_field("text_field").unwrap();
            let fieldnorm_reader = segment_reader.get_fieldnorms_reader(my_text_field).unwrap();
            assert_eq!(fieldnorm_reader.fieldnorm(0), 0);
-            assert_eq!(fieldnorm_reader.fieldnorm(1), 0);
+            assert_eq!(fieldnorm_reader.fieldnorm(1), 4);
            assert_eq!(fieldnorm_reader.fieldnorm(2), 2); // some text
            assert_eq!(fieldnorm_reader.fieldnorm(3), 1);
            assert_eq!(fieldnorm_reader.fieldnorm(5), 3); // the biggest num
@@ -339,6 +435,34 @@ mod tests {
            assert_eq!(do_search("biggest"), vec![5]);
        }

+        // postings file
+        {
+            let my_text_field = index.schema().get_field("text_field").unwrap();
+            let term_a = Term::from_field_text(my_text_field, "text");
+            let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
+            let mut postings = inverted_index
+                .read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
+                .unwrap()
+                .unwrap();
+
+            assert_eq!(postings.doc_freq(), 2);
+            let fallback_bitset = DeleteBitSet::for_test(&[0], 100);
+            assert_eq!(
+                postings.doc_freq_given_deletes(
+                    segment_reader.delete_bitset().unwrap_or(&fallback_bitset)
+                ),
+                2
+            );
+
+            let mut output = vec![];
+            postings.positions(&mut output);
+            assert_eq!(output, vec![1, 3]);
+            postings.advance();
+
+            postings.positions(&mut output);
+            assert_eq!(output, vec![1]);
+        }
+
        // access doc store
        {
            let doc = searcher.doc(DocAddress::new(0, 0)).unwrap();
--- a/src/indexer/segment_entry.rs
+++ b/src/indexer/segment_entry.rs
@@ -1,7 +1,7 @@
-use crate::common::BitSet;
 use crate::core::SegmentId;
 use crate::core::SegmentMeta;
 use crate::indexer::delete_queue::DeleteCursor;
+use common::BitSet;
 use std::fmt;

 /// A segment entry describes the state of
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -135,7 +135,6 @@ pub type Result<T> = std::result::Result<T, TantivyError>;
 /// Tantivy DateTime
 pub type DateTime = chrono::DateTime<chrono::Utc>;

-mod common;
 mod core;
 mod indexer;

@@ -163,8 +162,6 @@ pub use self::snippet::{Snippet, SnippetGenerator};

 mod docset;
 pub use self::docset::{DocSet, TERMINATED};
-pub use crate::common::HasLen;
-pub use crate::common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
 pub use crate::core::{Executor, SegmentComponent};
 pub use crate::core::{
    Index, IndexBuilder, IndexMeta, IndexSettings, IndexSortByField, Order, Searcher, Segment,
@@ -178,6 +175,8 @@ pub use crate::indexer::IndexWriter;
 pub use crate::postings::Postings;
 pub use crate::reader::LeasedItem;
 pub use crate::schema::{Document, Term};
+pub use common::HasLen;
+pub use common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
 use std::fmt;

 use once_cell::sync::Lazy;
@@ -293,7 +292,7 @@ pub struct DocAddress {
 }

 #[cfg(test)]
-mod tests {
+pub mod tests {
    use crate::collector::tests::TEST_COLLECTOR_WITH_SCORE;
    use crate::core::SegmentReader;
    use crate::docset::{DocSet, TERMINATED};
@@ -304,11 +303,18 @@ mod tests {
    use crate::Index;
    use crate::Postings;
    use crate::ReloadPolicy;
+    use common::{BinarySerializable, FixedSize};
    use rand::distributions::Bernoulli;
    use rand::distributions::Uniform;
    use rand::rngs::StdRng;
    use rand::{Rng, SeedableRng};

+    pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
+        let mut buffer = Vec::new();
+        O::default().serialize(&mut buffer).unwrap();
+        assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
+    }
+
    /// Checks if left and right are close one to each other.
    /// Panics if the two values are more than 0.5% apart.
    #[macro_export]
@@ -993,8 +999,24 @@ mod tests {
    #[test]
    fn test_validate_checksum() -> crate::Result<()> {
        let index_path = tempfile::tempdir().expect("dir");
-        let schema = Schema::builder().build();
+        let mut builder = Schema::builder();
+        let body = builder.add_text_field("body", TEXT | STORED);
+        let schema = builder.build();
        let index = Index::create_in_dir(&index_path, schema)?;
+        let mut writer = index.writer(50_000_000)?;
+        for _ in 0..5000 {
+            writer.add_document(doc!(body => "foo"));
+            writer.add_document(doc!(body => "boo"));
+        }
+        writer.commit()?;
+        assert!(index.validate_checksum()?.is_empty());
+
+        // delete few docs
+        writer.delete_term(Term::from_field_text(body, "foo"));
+        writer.commit()?;
+        let segment_ids = index.searchable_segment_ids()?;
+        let _ = futures::executor::block_on(writer.merge(&segment_ids));
+
        assert!(index.validate_checksum()?.is_empty());
        Ok(())
    }
--- a/src/positions/reader.rs
+++ b/src/positions/reader.rs
@@ -1,9 +1,9 @@
 use std::io;

-use crate::common::{BinarySerializable, VInt};
 use crate::directory::OwnedBytes;
 use crate::positions::COMPRESSION_BLOCK_SIZE;
 use crate::postings::compression::{BlockDecoder, VIntDecoder};
+use common::{BinarySerializable, VInt};

 /// When accessing the position of a term, we get a positions_idx from the `Terminfo`.
 /// This means we need to skip to the `nth` positions efficiently.
--- a/src/positions/serializer.rs
+++ b/src/positions/serializer.rs
@@ -1,7 +1,7 @@
-use crate::common::{BinarySerializable, CountingWriter, VInt};
 use crate::positions::COMPRESSION_BLOCK_SIZE;
 use crate::postings::compression::BlockEncoder;
 use crate::postings::compression::VIntEncoder;
+use common::{BinarySerializable, CountingWriter, VInt};
 use std::io::{self, Write};

 /// The PositionSerializer is in charge of serializing all of the positions
--- a/src/postings/block_search.rs
+++ b/src/postings/block_search.rs
@@ -1,241 +1,109 @@
-use std::ops::Range;
+use crate::postings::compression::COMPRESSION_BLOCK_SIZE;

-use crate::postings::compression::AlignedBuffer;
+unsafe fn binary_search_step(ptr: *const u32, target: u32, half_size: isize) -> *const u32 {
+    let mid = ptr.offset(half_size);
+    if *mid < target {
+        mid.offset(1)
+    } else {
+        ptr
+    }
+}

-/// This modules define the logic used to search for a doc in a given
-/// block. (at most 128 docs)
+/// Search the first index containing an element greater or equal to
+/// the target.
 ///
-/// Searching within a block is a hotspot when running intersection.
-/// so it was worth defining it in its own module.
-
-#[cfg(target_arch = "x86_64")]
-mod sse2 {
-    use crate::postings::compression::{AlignedBuffer, COMPRESSION_BLOCK_SIZE};
-    use std::arch::x86_64::__m128i as DataType;
-    use std::arch::x86_64::_mm_add_epi32 as op_add;
-    use std::arch::x86_64::_mm_cmplt_epi32 as op_lt;
-    use std::arch::x86_64::_mm_load_si128 as op_load; // requires 128-bits alignment
-    use std::arch::x86_64::_mm_set1_epi32 as set1;
-    use std::arch::x86_64::_mm_setzero_si128 as set0;
-    use std::arch::x86_64::_mm_sub_epi32 as op_sub;
-    use std::arch::x86_64::{_mm_cvtsi128_si32, _mm_shuffle_epi32};
-
-    const MASK1: i32 = 78;
-    const MASK2: i32 = 177;
-
-    /// Performs an exhaustive linear search over the
-    ///
-    /// There is no early exit here. We simply count the
-    /// number of elements that are `< target`.
-    pub(crate) fn linear_search_sse2_128(arr: &AlignedBuffer, target: u32) -> usize {
-        unsafe {
-            let ptr = arr as *const AlignedBuffer as *const DataType;
-            let vkey = set1(target as i32);
-            let mut cnt = set0();
-            // We work over 4 `__m128i` at a time.
-            // A single `__m128i` actual contains 4 `u32`.
-            for i in 0..(COMPRESSION_BLOCK_SIZE as isize) / (4 * 4) {
-                let cmp1 = op_lt(op_load(ptr.offset(i * 4)), vkey);
-                let cmp2 = op_lt(op_load(ptr.offset(i * 4 + 1)), vkey);
-                let cmp3 = op_lt(op_load(ptr.offset(i * 4 + 2)), vkey);
-                let cmp4 = op_lt(op_load(ptr.offset(i * 4 + 3)), vkey);
-                let sum = op_add(op_add(cmp1, cmp2), op_add(cmp3, cmp4));
-                cnt = op_sub(cnt, sum);
-            }
-            cnt = op_add(cnt, _mm_shuffle_epi32(cnt, MASK1));
-            cnt = op_add(cnt, _mm_shuffle_epi32(cnt, MASK2));
-            _mm_cvtsi128_si32(cnt) as usize
-        }
-    }
-
-    #[cfg(test)]
-    mod test {
-        use super::linear_search_sse2_128;
-        use crate::postings::compression::{AlignedBuffer, COMPRESSION_BLOCK_SIZE};
-
-        #[test]
-        fn test_linear_search_sse2_128_u32() {
-            let mut block = [0u32; COMPRESSION_BLOCK_SIZE];
-            for el in 0u32..128u32 {
-                block[el as usize] = (el * 2 + 1) << 18;
-            }
-            let target = block[64] + 1;
-            assert_eq!(linear_search_sse2_128(&AlignedBuffer(block), target), 65);
-        }
-    }
-}
-
-/// This `linear search` browser exhaustively through the array.
-/// but the early exit is very difficult to predict.
+/// The results should be equivalent to
+/// ```compile_fail
+/// block[..]
+//       .iter()
+//       .take_while(|&&val| val < target)
+//       .count()
+/// ```
 ///
-/// Coupled with `exponential search` this function is likely
-/// to be called with the same `len`
-fn linear_search(arr: &[u32], target: u32) -> usize {
-    arr.iter().map(|&el| if el < target { 1 } else { 0 }).sum()
-}
-
-fn exponential_search(arr: &[u32], target: u32) -> Range<usize> {
-    let end = arr.len();
-    let mut begin = 0;
-    for &pivot in &[1, 3, 7, 15, 31, 63] {
-        if pivot >= end {
-            break;
-        }
-        if arr[pivot] > target {
-            return begin..pivot;
-        }
-        begin = pivot;
-    }
-    begin..end
-}
-
-#[inline(never)]
-fn galloping(block_docs: &[u32], target: u32) -> usize {
-    let range = exponential_search(block_docs, target);
-    range.start + linear_search(&block_docs[range], target)
-}
-
-/// Tantivy may rely on SIMD instructions to search for a specific document within
-/// a given block.
-#[derive(Clone, Copy, PartialEq)]
-pub enum BlockSearcher {
-    #[cfg(target_arch = "x86_64")]
-    Sse2,
-    Scalar,
-}
-
-impl BlockSearcher {
-    /// Search the first index containing an element greater or equal to
-    /// the target.
-    ///
-    /// The results should be equivalent to
-    /// ```compile_fail
-    /// block[..]
-    //       .iter()
-    //       .take_while(|&&val| val < target)
-    //       .count()
-    /// ```
-    ///
-    /// The `start` argument is just used to hint that the response is
-    /// greater than beyond `start`. The implementation may or may not use
-    /// it for optimization.
-    ///
-    /// # Assumption
-    ///
-    /// The array len is > start.
-    /// The block is sorted
-    /// The target is assumed greater or equal to the `arr[start]`.
-    /// The target is assumed smaller or equal to the last element of the block.
-    ///
-    /// Currently the scalar implementation starts by an exponential search, and
-    /// then operates a linear search in the result subarray.
-    ///
-    /// If SSE2 instructions are available in the `(platform, running CPU)`,
-    /// then we use a different implementation that does an exhaustive linear search over
-    /// the block regardless of whether the block is full or not.
-    ///
-    /// Indeed, if the block is not full, the remaining items are TERMINATED.
-    /// It is surprisingly faster, most likely because of the lack of branch misprediction.
-    pub(crate) fn search_in_block(self, block_docs: &AlignedBuffer, target: u32) -> usize {
-        #[cfg(target_arch = "x86_64")]
-        {
-            if self == BlockSearcher::Sse2 {
-                return sse2::linear_search_sse2_128(block_docs, target);
-            }
-        }
-        galloping(&block_docs.0[..], target)
-    }
-}
-
-impl Default for BlockSearcher {
-    fn default() -> BlockSearcher {
-        #[cfg(target_arch = "x86_64")]
-        {
-            if is_x86_feature_detected!("sse2") {
-                return BlockSearcher::Sse2;
-            }
-        }
-        BlockSearcher::Scalar
+/// the `start` argument is just used to hint that the response is
+/// greater than beyond `start`. the implementation may or may not use
+/// it for optimization.
+///
+/// # Assumption
+///
+/// - The block is sorted. Some elements may appear several times. This is the case at the
+/// end of the last block for instance.
+/// - The target is assumed smaller or equal to the last element of the block.
+pub fn branchless_binary_search(arr: &[u32; COMPRESSION_BLOCK_SIZE], target: u32) -> usize {
+    let start_ptr: *const u32 = &arr[0] as *const u32;
+    unsafe {
+        let mut ptr = start_ptr;
+        ptr = binary_search_step(ptr, target, 63);
+        ptr = binary_search_step(ptr, target, 31);
+        ptr = binary_search_step(ptr, target, 15);
+        ptr = binary_search_step(ptr, target, 7);
+        ptr = binary_search_step(ptr, target, 3);
+        ptr = binary_search_step(ptr, target, 1);
+        let extra = if *ptr < target { 1 } else { 0 };
+        (ptr.offset_from(start_ptr) as usize) + extra
    }
 }

 #[cfg(test)]
 mod tests {
-    use super::exponential_search;
-    use super::linear_search;
-    use super::BlockSearcher;
+    use super::branchless_binary_search;
    use crate::docset::TERMINATED;
-    use crate::postings::compression::{AlignedBuffer, COMPRESSION_BLOCK_SIZE};
-
-    #[test]
-    fn test_linear_search() {
-        let len: usize = 50;
-        let arr: Vec<u32> = (0..len).map(|el| 1u32 + (el as u32) * 2).collect();
-        for target in 1..*arr.last().unwrap() {
-            let res = linear_search(&arr[..], target);
-            if res > 0 {
-                assert!(arr[res - 1] < target);
-            }
-            if res < len {
-                assert!(arr[res] >= target);
-            }
-        }
-    }
-
-    #[test]
-    fn test_exponentiel_search() {
-        assert_eq!(exponential_search(&[1, 2], 0), 0..1);
-        assert_eq!(exponential_search(&[1, 2], 1), 0..1);
-        assert_eq!(
-            exponential_search(&[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 7),
-            3..7
-        );
-    }
-
-    fn util_test_search_in_block(block_searcher: BlockSearcher, block: &[u32], target: u32) {
-        let cursor = search_in_block_trivial_but_slow(block, target);
-        assert!(block.len() < COMPRESSION_BLOCK_SIZE);
-        let mut output_buffer = [TERMINATED; COMPRESSION_BLOCK_SIZE];
-        output_buffer[..block.len()].copy_from_slice(block);
-        assert_eq!(
-            block_searcher.search_in_block(&AlignedBuffer(output_buffer), target),
-            cursor
-        );
-    }
-
-    fn util_test_search_in_block_all(block_searcher: BlockSearcher, block: &[u32]) {
-        use std::collections::HashSet;
-        let mut targets = HashSet::new();
-        for (i, val) in block.iter().cloned().enumerate() {
-            if i > 0 {
-                targets.insert(val - 1);
-            }
-            targets.insert(val);
-        }
-        for target in targets {
-            util_test_search_in_block(block_searcher, block, target);
-        }
-    }
+    use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
+    use proptest::prelude::*;
+    use std::collections::HashSet;

    fn search_in_block_trivial_but_slow(block: &[u32], target: u32) -> usize {
        block.iter().take_while(|&&val| val < target).count()
    }

-    fn test_search_in_block_util(block_searcher: BlockSearcher) {
-        for len in 1u32..128u32 {
-            let v: Vec<u32> = (0..len).map(|i| i * 2).collect();
-            util_test_search_in_block_all(block_searcher, &v[..]);
+    fn util_test_search_in_block(block: &[u32], target: u32) {
+        let cursor = search_in_block_trivial_but_slow(block, target);
+        assert!(cursor < COMPRESSION_BLOCK_SIZE);
+        assert!(block[cursor] >= target);
+        if cursor > 0 {
+            assert!(block[cursor - 1] < target);
+        }
+        assert_eq!(block.len(), COMPRESSION_BLOCK_SIZE);
+        let mut output_buffer = [TERMINATED; COMPRESSION_BLOCK_SIZE];
+        output_buffer[..block.len()].copy_from_slice(block);
+        assert_eq!(branchless_binary_search(&output_buffer, target), cursor);
+    }
+
+    fn util_test_search_in_block_all(block: &[u32]) {
+        let mut targets = HashSet::new();
+        targets.insert(0);
+        for &val in block {
+            if val > 0 {
+                targets.insert(val - 1);
+            }
+            targets.insert(val);
+        }
+        for target in targets {
+            util_test_search_in_block(block, target);
        }
    }

    #[test]
-    fn test_search_in_block_scalar() {
-        test_search_in_block_util(BlockSearcher::Scalar);
+    fn test_search_in_branchless_binary_search() {
+        let v: Vec<u32> = (0..COMPRESSION_BLOCK_SIZE).map(|i| i as u32 * 2).collect();
+        util_test_search_in_block_all(&v[..]);
    }

-    #[cfg(target_arch = "x86_64")]
-    #[test]
-    fn test_search_in_block_sse2() {
-        test_search_in_block_util(BlockSearcher::Sse2);
+    fn monotonous_block() -> impl Strategy<Value = Vec<u32>> {
+        prop::collection::vec(0u32..5u32, COMPRESSION_BLOCK_SIZE).prop_map(|mut deltas| {
+            let mut el = 0;
+            for i in 0..COMPRESSION_BLOCK_SIZE {
+                el += deltas[i];
+                deltas[i] = el;
+            }
+            deltas
+        })
+    }
+
+    proptest! {
+        #[test]
+        fn test_proptest_branchless_binary_search(block in monotonous_block()) {
+            util_test_search_in_block_all(&block[..]);
+        }
    }
 }
--- a/src/postings/block_segment_postings.rs
+++ b/src/postings/block_segment_postings.rs
@@ -1,16 +1,14 @@
 use std::io;

-use crate::common::{BinarySerializable, VInt};
 use crate::directory::FileSlice;
 use crate::directory::OwnedBytes;
 use crate::fieldnorm::FieldNormReader;
-use crate::postings::compression::{
-    AlignedBuffer, BlockDecoder, VIntDecoder, COMPRESSION_BLOCK_SIZE,
-};
+use crate::postings::compression::{BlockDecoder, VIntDecoder, COMPRESSION_BLOCK_SIZE};
 use crate::postings::{BlockInfo, FreqReadingOption, SkipReader};
 use crate::query::Bm25Weight;
 use crate::schema::IndexRecordOption;
 use crate::{DocId, Score, TERMINATED};
+use common::{BinarySerializable, VInt};

 fn max_score<I: Iterator<Item = Score>>(mut it: I) -> Option<Score> {
    it.next().map(|first| it.fold(first, Score::max))
@@ -209,9 +207,9 @@ impl BlockSegmentPostings {
    ///
    /// This method is useful to run SSE2 linear search.
    #[inline]
-    pub(crate) fn docs_aligned(&self) -> &AlignedBuffer {
+    pub(crate) fn full_block(&self) -> &[DocId; COMPRESSION_BLOCK_SIZE] {
        debug_assert!(self.block_is_loaded());
-        self.doc_decoder.output_aligned()
+        self.doc_decoder.full_output()
    }

    /// Return the document at index `idx` of the block.
@@ -349,7 +347,6 @@ impl BlockSegmentPostings {
 #[cfg(test)]
 mod tests {
    use super::BlockSegmentPostings;
-    use crate::common::HasLen;
    use crate::core::Index;
    use crate::docset::{DocSet, TERMINATED};
    use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
@@ -360,6 +357,7 @@ mod tests {
    use crate::schema::Term;
    use crate::schema::INDEXED;
    use crate::DocId;
+    use common::HasLen;

    #[test]
    fn test_empty_segment_postings() {
--- a/src/postings/compression/mod.rs
+++ b/src/postings/compression/mod.rs
@@ -1,5 +1,5 @@
-use crate::common::FixedSize;
 use bitpacking::{BitPacker, BitPacker4x};
+use common::FixedSize;

 pub const COMPRESSION_BLOCK_SIZE: usize = BitPacker4x::BLOCK_LEN;
 const COMPRESSED_BLOCK_MAX_SIZE: usize = COMPRESSION_BLOCK_SIZE * u32::SIZE_IN_BYTES;
@@ -49,16 +49,10 @@ impl BlockEncoder {
    }
 }

-/// We ensure that the OutputBuffer is align on 128 bits
-/// in order to run SSE2 linear search on it.
-#[repr(align(128))]
-#[derive(Clone)]
-pub(crate) struct AlignedBuffer(pub [u32; COMPRESSION_BLOCK_SIZE]);
-
 #[derive(Clone)]
 pub struct BlockDecoder {
    bitpacker: BitPacker4x,
-    output: AlignedBuffer,
+    output: [u32; COMPRESSION_BLOCK_SIZE],
    pub output_len: usize,
 }

@@ -72,7 +66,7 @@ impl BlockDecoder {
    pub fn with_val(val: u32) -> BlockDecoder {
        BlockDecoder {
            bitpacker: BitPacker4x::new(),
-            output: AlignedBuffer([val; COMPRESSION_BLOCK_SIZE]),
+            output: [val; COMPRESSION_BLOCK_SIZE],
            output_len: 0,
        }
    }
@@ -85,28 +79,28 @@ impl BlockDecoder {
    ) -> usize {
        self.output_len = COMPRESSION_BLOCK_SIZE;
        self.bitpacker
-            .decompress_sorted(offset, compressed_data, &mut self.output.0, num_bits)
+            .decompress_sorted(offset, compressed_data, &mut self.output, num_bits)
    }

    pub fn uncompress_block_unsorted(&mut self, compressed_data: &[u8], num_bits: u8) -> usize {
        self.output_len = COMPRESSION_BLOCK_SIZE;
        self.bitpacker
-            .decompress(compressed_data, &mut self.output.0, num_bits)
+            .decompress(compressed_data, &mut self.output, num_bits)
    }

    #[inline]
    pub fn output_array(&self) -> &[u32] {
-        &self.output.0[..self.output_len]
+        &self.output[..self.output_len]
    }

    #[inline]
-    pub(crate) fn output_aligned(&self) -> &AlignedBuffer {
+    pub(crate) fn full_output(&self) -> &[u32; COMPRESSION_BLOCK_SIZE] {
        &self.output
    }

    #[inline]
    pub fn output(&self, idx: usize) -> u32 {
-        self.output.0[idx]
+        self.output[idx]
    }
 }

@@ -190,8 +184,8 @@ impl VIntDecoder for BlockDecoder {
        padding: u32,
    ) -> usize {
        self.output_len = num_els;
-        self.output.0.iter_mut().for_each(|el| *el = padding);
-        vint::uncompress_sorted(compressed_data, &mut self.output.0[..num_els], offset)
+        self.output.iter_mut().for_each(|el| *el = padding);
+        vint::uncompress_sorted(compressed_data, &mut self.output[..num_els], offset)
    }

    fn uncompress_vint_unsorted(
@@ -201,12 +195,12 @@ impl VIntDecoder for BlockDecoder {
        padding: u32,
    ) -> usize {
        self.output_len = num_els;
-        self.output.0.iter_mut().for_each(|el| *el = padding);
-        vint::uncompress_unsorted(compressed_data, &mut self.output.0[..num_els])
+        self.output.iter_mut().for_each(|el| *el = padding);
+        vint::uncompress_unsorted(compressed_data, &mut self.output[..num_els])
    }

    fn uncompress_vint_unsorted_until_end(&mut self, compressed_data: &[u8]) {
-        let num_els = vint::uncompress_unsorted_until_end(compressed_data, &mut self.output.0);
+        let num_els = vint::uncompress_unsorted_until_end(compressed_data, &mut self.output);
        self.output_len = num_els;
    }
 }
--- a/src/postings/mod.rs
+++ b/src/postings/mod.rs
@@ -3,6 +3,9 @@ Postings module (also called inverted index)
 */

 mod block_search;
+
+pub(crate) use self::block_search::branchless_binary_search;
+
 mod block_segment_postings;
 pub(crate) mod compression;
 mod postings;
@@ -14,7 +17,6 @@ mod skip;
 mod stacker;
 mod term_info;

-pub(crate) use self::block_search::BlockSearcher;
 pub use self::block_segment_postings::BlockSegmentPostings;
 pub use self::postings::Postings;
 pub(crate) use self::postings_writer::MultiFieldPostingsWriter;
--- a/src/postings/postings.rs
+++ b/src/postings/postings.rs
@@ -11,7 +11,7 @@ use crate::docset::DocSet;
 /// but other implementations mocking `SegmentPostings` exist,
 /// for merging segments or for testing.
 pub trait Postings: DocSet + 'static {
-    /// Returns the term frequency
+    /// The number of times the term appears in the document.
    fn term_freq(&self) -> u32;

    /// Returns the positions offseted with a given value.
--- a/src/postings/recorder.rs
+++ b/src/postings/recorder.rs
@@ -1,10 +1,8 @@
 use super::stacker::{ExpUnrolledLinkedList, MemoryArena};
+use crate::indexer::doc_id_mapping::DocIdMapping;
 use crate::postings::FieldSerializer;
 use crate::DocId;
-use crate::{
-    common::{read_u32_vint, write_u32_vint},
-    indexer::doc_id_mapping::DocIdMapping,
-};
+use common::{read_u32_vint, write_u32_vint};

 const POSITION_END: u32 = 0;

--- a/src/postings/segment_postings.rs
+++ b/src/postings/segment_postings.rs
@@ -1,12 +1,12 @@
-use crate::common::HasLen;
 use crate::docset::DocSet;
 use crate::fastfield::DeleteBitSet;
 use crate::positions::PositionReader;
+use crate::postings::branchless_binary_search;
 use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
-use crate::postings::BlockSearcher;
 use crate::postings::BlockSegmentPostings;
 use crate::postings::Postings;
 use crate::{DocId, TERMINATED};
+use common::HasLen;

 /// `SegmentPostings` represents the inverted list or postings associated to
 /// a term in a `Segment`.
@@ -18,7 +18,6 @@ pub struct SegmentPostings {
    pub(crate) block_cursor: BlockSegmentPostings,
    cur: usize,
    position_reader: Option<PositionReader>,
-    block_searcher: BlockSearcher,
 }

 impl SegmentPostings {
@@ -28,7 +27,6 @@ impl SegmentPostings {
            block_cursor: BlockSegmentPostings::empty(),
            cur: 0,
            position_reader: None,
-            block_searcher: BlockSearcher::default(),
        }
    }

@@ -154,7 +152,6 @@ impl SegmentPostings {
            block_cursor: segment_block_postings,
            cur: 0, // cursor within the block
            position_reader,
-            block_searcher: BlockSearcher::default(),
        }
    }
 }
@@ -183,8 +180,8 @@ impl DocSet for SegmentPostings {
        self.block_cursor.seek(target);

        // At this point we are on the block, that might contain our document.
-        let output = self.block_cursor.docs_aligned();
-        self.cur = self.block_searcher.search_in_block(output, target);
+        let output = self.block_cursor.full_block();
+        self.cur = branchless_binary_search(output, target);

        // The last block is not full and padded with the value TERMINATED,
        // so that we are guaranteed to have at least doc in the block (a real one or the padding)
@@ -197,7 +194,7 @@ impl DocSet for SegmentPostings {
        // with the value `TERMINATED`.
        //
        // After the search, the cursor should point to the first value of TERMINATED.
-        let doc = output.0[self.cur];
+        let doc = output[self.cur];
        debug_assert!(doc >= target);
        debug_assert_eq!(doc, self.doc());
        doc
@@ -268,7 +265,7 @@ impl Postings for SegmentPostings {
 mod tests {

    use super::SegmentPostings;
-    use crate::common::HasLen;
+    use common::HasLen;

    use crate::docset::{DocSet, TERMINATED};
    use crate::fastfield::DeleteBitSet;
--- a/src/postings/serializer.rs
+++ b/src/postings/serializer.rs
@@ -1,7 +1,6 @@
 use super::TermInfo;
-use crate::common::{BinarySerializable, VInt};
-use crate::common::{CompositeWrite, CountingWriter};
 use crate::core::Segment;
+use crate::directory::CompositeWrite;
 use crate::directory::WritePtr;
 use crate::fieldnorm::FieldNormReader;
 use crate::positions::PositionSerializer;
@@ -12,6 +11,8 @@ use crate::schema::{Field, FieldEntry, FieldType};
 use crate::schema::{IndexRecordOption, Schema};
 use crate::termdict::{TermDictionaryBuilder, TermOrdinal};
 use crate::{DocId, Score};
+use common::CountingWriter;
+use common::{BinarySerializable, VInt};
 use std::cmp::Ordering;
 use std::io::{self, Write};

@@ -442,10 +443,8 @@ impl<W: Write> PostingsSerializer<W> {
            let skip_data = self.skip_write.data();
            VInt(skip_data.len() as u64).serialize(&mut self.output_write)?;
            self.output_write.write_all(skip_data)?;
-            self.output_write.write_all(&self.postings_write[..])?;
-        } else {
-            self.output_write.write_all(&self.postings_write[..])?;
        }
+        self.output_write.write_all(&self.postings_write[..])?;
        self.skip_write.clear();
        self.postings_write.clear();
        self.bm25_weight = None;
--- a/src/postings/term_info.rs
+++ b/src/postings/term_info.rs
@@ -1,4 +1,4 @@
-use crate::common::{BinarySerializable, FixedSize};
+use common::{BinarySerializable, FixedSize};
 use std::io;
 use std::iter::ExactSizeIterator;
 use std::ops::Range;
@@ -67,7 +67,7 @@ impl BinarySerializable for TermInfo {
 mod tests {

    use super::TermInfo;
-    use crate::common::test::fixed_size_test;
+    use crate::tests::fixed_size_test;

    // TODO add serialize/deserialize test for terminfo

--- a/src/query/automaton_weight.rs
+++ b/src/query/automaton_weight.rs
@@ -1,4 +1,3 @@
-use crate::common::BitSet;
 use crate::core::SegmentReader;
 use crate::query::ConstScorer;
 use crate::query::{BitSetDocSet, Explanation};
@@ -7,6 +6,7 @@ use crate::schema::{Field, IndexRecordOption};
 use crate::termdict::{TermDictionary, TermStreamer};
 use crate::TantivyError;
 use crate::{DocId, Score};
+use common::BitSet;
 use std::io;
 use std::sync::Arc;
 use tantivy_fst::Automaton;
@@ -121,10 +121,7 @@ mod tests {
        }

        fn is_match(&self, state: &Self::State) -> bool {
-            match *state {
-                State::AfterA => true,
-                _ => false,
-            }
+            matches!(*state, State::AfterA)
        }

        fn accept(&self, state: &Self::State, byte: u8) -> Self::State {
--- a/src/query/bitset/mod.rs
+++ b/src/query/bitset/mod.rs
@@ -1,6 +1,6 @@
-use crate::common::{BitSet, TinySet};
 use crate::docset::{DocSet, TERMINATED};
 use crate::DocId;
+use common::{BitSet, TinySet};

 /// A `BitSetDocSet` makes it possible to iterate through a bitset as if it was a `DocSet`.
 ///
@@ -96,10 +96,13 @@ impl DocSet for BitSetDocSet {

 #[cfg(test)]
 mod tests {
+    use std::collections::BTreeSet;
+
    use super::BitSetDocSet;
-    use crate::common::BitSet;
    use crate::docset::{DocSet, TERMINATED};
+    use crate::tests::generate_nonunique_unsorted;
    use crate::DocId;
+    use common::BitSet;

    fn create_docbitset(docs: &[DocId], max_doc: DocId) -> BitSetDocSet {
        let mut docset = BitSet::with_max_value(max_doc);
@@ -109,6 +112,29 @@ mod tests {
        BitSetDocSet::from(docset)
    }

+    #[test]
+    fn test_bitset_large() {
+        let arr = generate_nonunique_unsorted(100_000, 5_000);
+        let mut btreeset: BTreeSet<u32> = BTreeSet::new();
+        let mut bitset = BitSet::with_max_value(100_000);
+        for el in arr {
+            btreeset.insert(el);
+            bitset.insert(el);
+        }
+        for i in 0..100_000 {
+            assert_eq!(btreeset.contains(&i), bitset.contains(i));
+        }
+        assert_eq!(btreeset.len(), bitset.len());
+        let mut bitset_docset = BitSetDocSet::from(bitset);
+        let mut remaining = true;
+        for el in btreeset.into_iter() {
+            assert!(remaining);
+            assert_eq!(bitset_docset.doc(), el);
+            remaining = bitset_docset.advance() != TERMINATED;
+        }
+        assert!(!remaining);
+    }
+
    #[test]
    fn test_empty() {
        let bitset = BitSet::with_max_value(1000);
--- a/src/query/boolean_query/mod.rs
+++ b/src/query/boolean_query/mod.rs
@@ -310,7 +310,7 @@ mod tests {
        ));
        let query = BooleanQuery::from(vec![(Occur::Should, term_a), (Occur::Should, term_b)]);
        let explanation = query.explain(&searcher, DocAddress::new(0, 0u32))?;
-        assert_nearly_equals!(explanation.value(), 0.6931472);
+        assert_nearly_equals!(explanation.value(), std::f32::consts::LN_2);
        Ok(())
    }
 }
--- a/src/query/range_query.rs
+++ b/src/query/range_query.rs
@@ -1,4 +1,3 @@
-use crate::common::BitSet;
 use crate::core::Searcher;
 use crate::core::SegmentReader;
 use crate::error::TantivyError;
@@ -10,6 +9,7 @@ use crate::schema::Type;
 use crate::schema::{Field, IndexRecordOption, Term};
 use crate::termdict::{TermDictionary, TermStreamer};
 use crate::{DocId, Score};
+use common::BitSet;
 use std::io;
 use std::ops::{Bound, Range};

--- a/src/query/regex_query.rs
+++ b/src/query/regex_query.rs
@@ -10,6 +10,9 @@ use tantivy_fst::Regex;
 /// containing a specific term that matches
 /// a regex pattern.
 ///
+/// Wildcard queries (e.g. ho*se) can be achieved
+/// by converting them to their regex counterparts.
+///
 /// ```rust
 /// use tantivy::collector::Count;
 /// use tantivy::query::RegexQuery;
--- a/src/query/union.rs
+++ b/src/query/union.rs
@@ -1,9 +1,9 @@
-use crate::common::TinySet;
 use crate::docset::{DocSet, TERMINATED};
 use crate::query::score_combiner::{DoNothingCombiner, ScoreCombiner};
 use crate::query::Scorer;
 use crate::DocId;
 use crate::Score;
+use common::TinySet;

 const HORIZON_NUM_TINYBITSETS: usize = 64;
 const HORIZON: u32 = 64u32 * HORIZON_NUM_TINYBITSETS as u32;
--- a/src/query/vec_docset.rs
+++ b/src/query/vec_docset.rs
@@ -1,8 +1,8 @@
 #![allow(dead_code)]

-use crate::common::HasLen;
 use crate::docset::{DocSet, TERMINATED};
 use crate::DocId;
+use common::HasLen;

 /// Simulate a `Postings` objects from a `VecPostings`.
 /// `VecPostings` only exist for testing purposes.
--- a/src/schema/document.rs
+++ b/src/schema/document.rs
@@ -1,8 +1,8 @@
 use super::*;
-use crate::common::BinarySerializable;
-use crate::common::VInt;
 use crate::tokenizer::PreTokenizedString;
 use crate::DateTime;
+use common::BinarySerializable;
+use common::VInt;
 use std::io::{self, Read, Write};
 use std::mem;

--- a/src/schema/facet.rs
+++ b/src/schema/facet.rs
@@ -1,4 +1,4 @@
-use crate::common::BinarySerializable;
+use common::BinarySerializable;
 use once_cell::sync::Lazy;
 use regex::Regex;
 use serde::{Deserialize, Deserializer, Serialize, Serializer};
--- a/src/schema/field.rs
+++ b/src/schema/field.rs
@@ -1,4 +1,4 @@
-use crate::common::BinarySerializable;
+use common::BinarySerializable;
 use std::io;
 use std::io::Read;
 use std::io::Write;
--- a/src/schema/field_entry.rs
+++ b/src/schema/field_entry.rs
@@ -109,17 +109,11 @@ impl FieldEntry {
        &self.field_type
    }

-    /// Returns true iff the field is indexed
+    /// Returns true iff the field is indexed.
+    ///
+    /// An indexed field is searchable.
    pub fn is_indexed(&self) -> bool {
-        match self.field_type {
-            FieldType::Str(ref options) => options.get_indexing_options().is_some(),
-            FieldType::U64(ref options)
-            | FieldType::I64(ref options)
-            | FieldType::F64(ref options)
-            | FieldType::Date(ref options) => options.is_indexed(),
-            FieldType::HierarchicalFacet(ref options) => options.is_indexed(),
-            FieldType::Bytes(ref options) => options.is_indexed(),
-        }
+        self.field_type.is_indexed()
    }

    /// Returns true iff the field is a int (signed or unsigned) fast field
--- a/src/schema/field_value.rs
+++ b/src/schema/field_value.rs
@@ -1,6 +1,6 @@
-use crate::common::BinarySerializable;
 use crate::schema::Field;
 use crate::schema::Value;
+use common::BinarySerializable;
 use std::io::{self, Read, Write};

 /// `FieldValue` holds together a `Field` and its `Value`.
--- a/src/schema/flags.rs
+++ b/src/schema/flags.rs
@@ -20,7 +20,7 @@ pub const STORED: SchemaFlagList<StoredFlag, ()> = SchemaFlagList {

 #[derive(Clone)]
 pub struct IndexedFlag;
-/// Flag to mark the field as indexed.
+/// Flag to mark the field as indexed. An indexed field is searchable.
 ///
 /// The `INDEXED` flag can only be used when building `IntOptions` (`u64`, `i64` and `f64` fields)
 /// Of course, text fields can also be indexed... But this is expressed by using either the
--- a/src/schema/int_options.rs
+++ b/src/schema/int_options.rs
@@ -29,7 +29,7 @@ impl IntOptions {
        self.stored
    }

-    /// Returns true iff the value is indexed.
+    /// Returns true iff the value is indexed and therefore searchable.
    pub fn is_indexed(&self) -> bool {
        self.indexed
    }
@@ -52,6 +52,8 @@ impl IntOptions {
    ///
    /// Setting an integer as indexed will generate
    /// a posting list for each value taken by the integer.
+    ///
+    /// This is required for the field to be searchable.
    pub fn set_indexed(mut self) -> IntOptions {
        self.indexed = true;
        self
--- a/src/schema/mod.rs
+++ b/src/schema/mod.rs
@@ -157,7 +157,7 @@ pub use self::int_options::IntOptions;
 /// A field name can be any character, must have at least one character
 /// and must not start with a `-`.
 pub fn is_valid_field_name(field_name: &str) -> bool {
-    field_name.len() > 0 && !field_name.starts_with('-')
+    !field_name.is_empty() && !field_name.starts_with('-')
 }

 #[cfg(test)]
--- a/src/schema/term.rs
+++ b/src/schema/term.rs
@@ -1,9 +1,9 @@
 use std::fmt;

 use super::Field;
-use crate::common;
 use crate::schema::Facet;
 use crate::DateTime;
+use common;
 use std::str;

 /// Size (in bytes) of the buffer of a int field.
--- a/src/schema/text_options.rs
+++ b/src/schema/text_options.rs
@@ -94,7 +94,7 @@ impl TextFieldIndexing {
    }
 }

-/// The field will be untokenized and indexed
+/// The field will be untokenized and indexed.
 pub const STRING: TextOptions = TextOptions {
    indexing: Some(TextFieldIndexing {
        tokenizer: Cow::Borrowed("raw"),
@@ -103,7 +103,7 @@ pub const STRING: TextOptions = TextOptions {
    stored: false,
 };

-/// The field will be tokenized and indexed
+/// The field will be tokenized and indexed.
 pub const TEXT: TextOptions = TextOptions {
    indexing: Some(TextFieldIndexing {
        tokenizer: Cow::Borrowed("default"),
--- a/src/schema/value.rs
+++ b/src/schema/value.rs
@@ -276,10 +276,10 @@ impl From<PreTokenizedString> for Value {

 mod binary_serialize {
    use super::Value;
-    use crate::common::{f64_to_u64, u64_to_f64, BinarySerializable};
    use crate::schema::Facet;
    use crate::tokenizer::PreTokenizedString;
    use chrono::{TimeZone, Utc};
+    use common::{f64_to_u64, u64_to_f64, BinarySerializable};
    use std::io::{self, Read, Write};

    const TEXT_CODE: u8 = 0;
--- a/src/store/footer.rs
+++ b/src/store/footer.rs
@@ -1,8 +1,5 @@
-use crate::{
-    common::{BinarySerializable, FixedSize, HasLen},
-    directory::FileSlice,
-    store::Compressor,
-};
+use crate::{directory::FileSlice, store::Compressor};
+use common::{BinarySerializable, FixedSize, HasLen};
 use std::io;

 #[derive(Debug, Clone, PartialEq)]
--- a/src/store/index/block.rs
+++ b/src/store/index/block.rs
@@ -1,6 +1,6 @@
-use crate::common::VInt;
 use crate::store::index::{Checkpoint, CHECKPOINT_PERIOD};
 use crate::DocId;
+use common::VInt;
 use std::io;
 use std::ops::Range;

--- a/src/store/index/skip_index.rs
+++ b/src/store/index/skip_index.rs
@@ -1,8 +1,8 @@
-use crate::common::{BinarySerializable, VInt};
 use crate::directory::OwnedBytes;
 use crate::store::index::block::CheckpointBlock;
 use crate::store::index::Checkpoint;
 use crate::DocId;
+use common::{BinarySerializable, VInt};

 pub struct LayerCursor<'a> {
    remaining: &'a [u8],
--- a/src/store/index/skip_index_builder.rs
+++ b/src/store/index/skip_index_builder.rs
@@ -1,6 +1,6 @@
-use crate::common::{BinarySerializable, VInt};
 use crate::store::index::block::CheckpointBlock;
 use crate::store::index::{Checkpoint, CHECKPOINT_PERIOD};
+use common::{BinarySerializable, VInt};
 use std::io;
 use std::io::Write;

--- a/src/store/reader.rs
+++ b/src/store/reader.rs
@@ -5,11 +5,8 @@ use crate::schema::Document;
 use crate::space_usage::StoreSpaceUsage;
 use crate::store::index::Checkpoint;
 use crate::DocId;
-use crate::{
-    common::{BinarySerializable, HasLen, VInt},
-    error::DataCorruption,
-    fastfield::DeleteBitSet,
-};
+use crate::{error::DataCorruption, fastfield::DeleteBitSet};
+use common::{BinarySerializable, HasLen, VInt};
 use lru::LruCache;
 use std::io;
 use std::sync::atomic::{AtomicUsize, Ordering};
--- a/src/store/writer.rs
+++ b/src/store/writer.rs
@@ -1,13 +1,13 @@
 use super::index::SkipIndexBuilder;
 use super::StoreReader;
 use super::{compressors::Compressor, footer::DocStoreFooter};
-use crate::common::CountingWriter;
-use crate::common::{BinarySerializable, VInt};
 use crate::directory::TerminatingWrite;
 use crate::directory::WritePtr;
 use crate::schema::Document;
 use crate::store::index::Checkpoint;
 use crate::DocId;
+use common::CountingWriter;
+use common::{BinarySerializable, VInt};
 use std::io::{self, Write};

 const BLOCK_SIZE: usize = 16_384;
--- a/src/termdict/fst_termdict/term_info_store.rs
+++ b/src/termdict/fst_termdict/term_info_store.rs
@@ -1,8 +1,8 @@
-use crate::common::{BinarySerializable, FixedSize};
 use crate::directory::{FileSlice, OwnedBytes};
 use crate::postings::TermInfo;
 use crate::termdict::TermOrdinal;
 use byteorder::{ByteOrder, LittleEndian};
+use common::{BinarySerializable, FixedSize};
 use std::cmp;
 use std::io::{self, Read, Write};
 use tantivy_bitpacker::compute_num_bits;
@@ -290,16 +290,16 @@ mod tests {
    use super::extract_bits;
    use super::TermInfoBlockMeta;
    use super::{TermInfoStore, TermInfoStoreWriter};
-    use crate::common;
-    use crate::common::BinarySerializable;
    use crate::directory::FileSlice;
    use crate::postings::TermInfo;
+    use common;
+    use common::BinarySerializable;
    use tantivy_bitpacker::compute_num_bits;
    use tantivy_bitpacker::BitPacker;

    #[test]
    fn test_term_info_block() {
-        common::test::fixed_size_test::<TermInfoBlockMeta>();
+        crate::tests::fixed_size_test::<TermInfoBlockMeta>();
    }

    #[test]
--- a/src/termdict/fst_termdict/termdict.rs
+++ b/src/termdict/fst_termdict/termdict.rs
@@ -1,10 +1,10 @@
 use super::term_info_store::{TermInfoStore, TermInfoStoreWriter};
 use super::{TermStreamer, TermStreamerBuilder};
-use crate::common::{BinarySerializable, CountingWriter};
 use crate::directory::{FileSlice, OwnedBytes};
 use crate::error::DataCorruption;
 use crate::postings::TermInfo;
 use crate::termdict::TermOrdinal;
+use common::{BinarySerializable, CountingWriter};
 use once_cell::sync::Lazy;
 use std::io::{self, Write};
 use tantivy_fst::raw::Fst;
--- a/src/tokenizer/mod.rs
+++ b/src/tokenizer/mod.rs
@@ -131,6 +131,7 @@ mod token_stream_chain;
 mod tokenized_string;
 mod tokenizer;
 mod tokenizer_manager;
+mod whitespace_tokenizer;

 pub use self::alphanum_only::AlphaNumOnlyFilter;
 pub use self::ascii_folding_filter::AsciiFoldingFilter;
@@ -143,6 +144,7 @@ pub use self::simple_tokenizer::SimpleTokenizer;
 pub use self::stemmer::{Language, Stemmer};
 pub use self::stop_word_filter::StopWordFilter;
 pub(crate) use self::token_stream_chain::TokenStreamChain;
+pub use self::whitespace_tokenizer::WhitespaceTokenizer;

 pub use self::tokenized_string::{PreTokenizedStream, PreTokenizedString};
 pub use self::tokenizer::{
@@ -277,4 +279,25 @@ pub mod tests {
            assert!(tokens.is_empty());
        }
    }
+
+    #[test]
+    fn test_whitespace_tokenizer() {
+        let tokenizer_manager = TokenizerManager::default();
+        let ws_tokenizer = tokenizer_manager.get("whitespace").unwrap();
+        let mut tokens: Vec<Token> = vec![];
+        {
+            let mut add_token = |token: &Token| {
+                tokens.push(token.clone());
+            };
+            ws_tokenizer
+                .token_stream("Hello, happy tax payer!")
+                .process(&mut add_token);
+        }
+
+        assert_eq!(tokens.len(), 4);
+        assert_token(&tokens[0], 0, "Hello,", 0, 6);
+        assert_token(&tokens[1], 1, "happy", 7, 12);
+        assert_token(&tokens[2], 2, "tax", 13, 16);
+        assert_token(&tokens[3], 3, "payer!", 17, 23);
+    }
 }
--- a/src/tokenizer/tokenizer_manager.rs
+++ b/src/tokenizer/tokenizer_manager.rs
@@ -5,6 +5,7 @@ use crate::tokenizer::RawTokenizer;
 use crate::tokenizer::RemoveLongFilter;
 use crate::tokenizer::SimpleTokenizer;
 use crate::tokenizer::Stemmer;
+use crate::tokenizer::WhitespaceTokenizer;
 use std::collections::HashMap;
 use std::sync::{Arc, RwLock};

@@ -72,6 +73,7 @@ impl Default for TokenizerManager {
                .filter(LowerCaser)
                .filter(Stemmer::new(Language::English)),
        );
+        manager.register("whitespace", WhitespaceTokenizer);
        manager
    }
 }
--- a/src/tokenizer/whitespace_tokenizer.rs
+++ b/src/tokenizer/whitespace_tokenizer.rs
@@ -0,0 +1,59 @@
+use super::BoxTokenStream;
+use super::{Token, TokenStream, Tokenizer};
+use std::str::CharIndices;
+
+/// Tokenize the text by splitting on whitespaces.
+#[derive(Clone)]
+pub struct WhitespaceTokenizer;
+
+pub struct WhitespaceTokenStream<'a> {
+    text: &'a str,
+    chars: CharIndices<'a>,
+    token: Token,
+}
+
+impl Tokenizer for WhitespaceTokenizer {
+    fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
+        BoxTokenStream::from(WhitespaceTokenStream {
+            text,
+            chars: text.char_indices(),
+            token: Token::default(),
+        })
+    }
+}
+
+impl<'a> WhitespaceTokenStream<'a> {
+    // search for the end of the current token.
+    fn search_token_end(&mut self) -> usize {
+        (&mut self.chars)
+            .filter(|&(_, ref c)| c.is_ascii_whitespace())
+            .map(|(offset, _)| offset)
+            .next()
+            .unwrap_or_else(|| self.text.len())
+    }
+}
+
+impl<'a> TokenStream for WhitespaceTokenStream<'a> {
+    fn advance(&mut self) -> bool {
+        self.token.text.clear();
+        self.token.position = self.token.position.wrapping_add(1);
+        while let Some((offset_from, c)) = self.chars.next() {
+            if !c.is_ascii_whitespace() {
+                let offset_to = self.search_token_end();
+                self.token.offset_from = offset_from;
+                self.token.offset_to = offset_to;
+                self.token.text.push_str(&self.text[offset_from..offset_to]);
+                return true;
+            }
+        }
+        false
+    }
+
+    fn token(&self) -> &Token {
+        &self.token
+    }
+
+    fn token_mut(&mut self) -> &mut Token {
+        &mut self.token
+    }
+}
Author	SHA1	Message	Date
Paul Masurel	46b86a7976	Bounced version and edited changelog	2021-09-10 23:05:09 +09:00
PSeitz	3bc177e69d	fix #1151 (#1152 ) * fix #1151 Fixes a off by one error in the stats for the index fast field in the multi value fast field. When retrieving the data range for a docid, `get(doc)..get(docid+1)` is requested. On creation the num_vals statistic was set to doc instead of docid + 1. In the multivaluelinearinterpol fast field the last value was therefore not serialized (and would return 0 instead in most cases). So the last document get(lastdoc)..get(lastdoc + 1) would return the invalid range `value..0`. This PR adds a proptest to cover this scenario. A combination of a large number values, since multilinear interpolation is only active for more than 5_000 values, and a merge is required.	2021-09-10 23:00:37 +09:00
PSeitz	319609e9c1	test cargo-llvm-cov (#1149 )	2021-09-03 22:00:43 +09:00
Kanji Yomoda	9d87b89718	Fix incorrect comment for Index::create_in_dir (#1148 ) * Fix incorrect comment for Index::create_in_dir	2021-09-03 10:37:16 +09:00
Tomoko Uchida	dd81e38e53	Add WhitespaceTokenizer (#1147 ) * Add WhitespaceTokenizer.	2021-08-29 18:20:49 +09:00
Paul Masurel	9f32b22602	Preparing for release.	2021-08-26 09:07:08 +09:00
sigaloid	096ce7488e	Resolve some clippys, format (#1144 ) * cargo +nightly clippy --fix -Z unstable-options	2021-08-26 08:46:00 +09:00
PSeitz	a1782dd172	Update index_sorting.md	2021-08-25 07:55:50 +01:00
PSeitz	000d76b11a	Update index_sorting.md	2021-08-24 19:28:06 +01:00
PSeitz	abd29f6646	Update index_sorting.md	2021-08-24 19:26:19 +01:00
PSeitz	b4ecf0ab2f	Merge pull request #1146 from tantivy-search/sorting_doc add sorting to book	2021-08-23 17:37:54 +01:00
Pascal Seitz	798f7dbf67	add sorting to book	2021-08-23 17:36:41 +01:00
PSeitz	06a2e47c8d	Merge pull request #1145 from tantivy-search/blub2 cargo fmt	2021-08-21 18:52:50 +01:00
Pascal Seitz	e0b83eb291	cargo fmt	2021-08-21 18:52:10 +01:00
PSeitz	13401f46ea	add wildcard mention	2021-08-21 18:10:33 +01:00
PSeitz	1a45b030dc	Merge pull request #1141 from tantivy-search/tantivy_common dissolve common module	2021-08-20 08:03:37 +01:00
Pascal Seitz	62052bcc2d	add missing test function closes #1139	2021-08-20 07:26:22 +01:00
Pascal Seitz	3265f7bec3	dissolve common module	2021-08-19 23:26:34 +01:00
Pascal Seitz	ee0881712a	move bitset to common crate, move composite file to directory	2021-08-19 17:45:09 +01:00
PSeitz	483e0336b6	Merge pull request #1140 from tantivy-search/tantivy_common rename common to tantivy-common	2021-08-19 13:02:54 +01:00
Pascal Seitz	3e8f267e33	rename common to tantivy-common	2021-08-19 10:27:20 +01:00
Paul Masurel	3b247fd968	Version bump	2021-08-19 10:12:30 +09:00
Paul Masurel	750f6e6479	Removed obsolete unit test (#1138 )	2021-08-19 10:07:49 +09:00
Evance Soumaoro	5b475e6603	Checksum validation using active files (#1130 ) * now validate checksum uses segment files not managed files	2021-08-19 10:03:20 +09:00
PSeitz	0ca7f73dc5	add docs badge, fix build badge	2021-08-13 19:40:33 +01:00
PSeitz	47ed18845e	Merge pull request #1136 from tantivy-search/minor_fixes more docs detail	2021-08-13 18:11:47 +01:00
Pascal Seitz	dc141cdb29	more docs detail remove code duplicate	2021-08-13 17:40:13 +01:00
PSeitz	f6cf6e889b	Merge pull request #1133 from tantivy-search/merge_overflow test doc_freq and term_freq in sorted index	2021-08-05 07:53:46 +01:00
Pascal Seitz	f379a80233	test doc_freq and term_freq in sorted index	2021-08-03 11:38:05 +01:00
PSeitz	4a320fd1ff	fix delta position in merge and index sorting (#1132 ) fixes #1125	2021-08-03 18:06:36 +09:00
PSeitz	85d23e8e3b	Merge pull request #1129 from tantivy-search/merge_overflow add long running test in ci	2021-08-02 15:54:31 +01:00
Pascal Seitz	022ab9d298	don't run as pr	2021-08-02 15:44:00 +01:00
Pascal Seitz	605e8603dc	add positions to long running test	2021-08-02 15:29:49 +01:00
Pascal Seitz	70f160b329	add long running test in ci	2021-08-02 11:35:39 +01:00
PSeitz	6d265e6bed	fix gh action name	2021-08-02 10:38:01 +01:00
PSeitz	fdc512391b	Merge pull request #1128 from tantivy-search/merge_overflow add sort to functional test, add env for iterations	2021-08-02 10:29:16 +01:00
Pascal Seitz	108714c934	add sort to functional test, add env for iterations	2021-08-02 10:11:17 +01:00
Paul Masurel	44e8cf98a5	Cargo fmt	2021-07-30 15:30:01 +09:00
Paul Masurel	f0ee69d9e9	Remove the complicated block search logic for a simpler branchless (#1124 ) binary search The code is simpler and faster. Before test postings::bench::bench_segment_intersection ... bench: 2,093,697 ns/iter (+/- 115,509) test postings::bench::bench_skip_next_p01 ... bench: 58,585 ns/iter (+/- 796) test postings::bench::bench_skip_next_p1 ... bench: 160,872 ns/iter (+/- 5,164) test postings::bench::bench_skip_next_p10 ... bench: 615,229 ns/iter (+/- 25,108) test postings::bench::bench_skip_next_p90 ... bench: 1,120,509 ns/iter (+/- 22,271) After test postings::bench::bench_segment_intersection ... bench: 1,747,726 ns/iter (+/- 52,867) test postings::bench::bench_skip_next_p01 ... bench: 55,205 ns/iter (+/- 714) test postings::bench::bench_skip_next_p1 ... bench: 131,433 ns/iter (+/- 2,814) test postings::bench::bench_skip_next_p10 ... bench: 478,830 ns/iter (+/- 12,794) test postings::bench::bench_skip_next_p90 ... bench: 931,082 ns/iter (+/- 31,468)	2021-07-30 14:38:42 +09:00
Evance Soumaoro	b8a10c8406	switched to memmap2-rs (#1120 )	2021-07-27 18:40:41 +09:00
PSeitz	ff4813529e	add comments on compression (#1119 )	2021-07-26 22:54:22 +09:00
PSeitz	470bc18e9b	Merge pull request #1118 from tantivy-search/remove_rand move rand to optional dependencies	2021-07-21 18:01:22 +01:00