chore: Release

Merge pull request #2618 from quickwit-oss/release_tantivy
2026-01-04 16:22:55 +00:00 · 2025-04-09 16:58:45 +08:00 · 2025-04-09 16:57:04 +08:00 · 2025-04-09 09:54:09 +02:00 · 2025-04-09 14:35:23 +08:00 · 2025-04-09 08:09:41 +02:00
179 changed files with 4130 additions and 1172 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -46,7 +46,7 @@ The file of a segment has the format

 ```segment-id . ext```

-The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
+The extension signals which data structure (or [`SegmentComponent`](src/index/segment_component.rs)) is stored in the file.

 A small `meta.json` file is in charge of keeping track of the list of segments, as well as the schema.

@@ -102,7 +102,7 @@ but users can extend tantivy with their own implementation.

 Tantivy's document follows a very strict schema, decided before building any index.

-The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
+The schema defines all of the fields that the indexes [`Document`](src/schema/document/mod.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.

 Depending on the type of the field, you can decide to

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,11 +1,14 @@
 Tantivy 0.23 - Unreleased
 ================================
-Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0.21.
+Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0.21. The new minimum rust version will be 1.75.

 #### Bugfixes
 - fix potential endless loop in merge [#2457](https://github.com/quickwit-oss/tantivy/pull/2457)(@PSeitz)
 - fix bug that causes out-of-order sstable key. [#2445](https://github.com/quickwit-oss/tantivy/pull/2445)(@fulmicoton)
 - fix ReferenceValue API flaw [#2372](https://github.com/quickwit-oss/tantivy/pull/2372)(@PSeitz)
+- fix `OwnedBytes` debug panic [#2512](https://github.com/quickwit-oss/tantivy/pull/2512)(@b41sh)
+- catch panics during merges [#2582](https://github.com/quickwit-oss/tantivy/pull/2582)(@rdettai)
+- switch from u32 to usize in bitpacker. This enables multivalued columns larger than 4GB, which crashed during merge before. [#2581](https://github.com/quickwit-oss/tantivy/pull/2581) [#2586](https://github.com/quickwit-oss/tantivy/pull/2586)(@fulmicoton-dd @PSeitz)

 #### Breaking API Changes
 - remove index sorting [#2434](https://github.com/quickwit-oss/tantivy/pull/2434)(@PSeitz)
@@ -23,6 +26,7 @@ Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0
    - reduce top hits memory consumption [#2426](https://github.com/quickwit-oss/tantivy/pull/2426)(@PSeitz)
    - check unsupported parameters top_hits [#2351](https://github.com/quickwit-oss/tantivy/pull/2351)(@PSeitz)
    - Change AggregationLimits to AggregationLimitsGuard [#2495](https://github.com/quickwit-oss/tantivy/pull/2495)(@PSeitz)
+    - add support for counting non integer in aggregation [#2547](https://github.com/quickwit-oss/tantivy/pull/2547)(@trinity-1686a)
 - **Range Queries**
    - Support fast field range queries on json fields [#2456](https://github.com/quickwit-oss/tantivy/pull/2456)(@PSeitz)
    - Add support for str fast field range query [#2460](https://github.com/quickwit-oss/tantivy/pull/2460) [#2452](https://github.com/quickwit-oss/tantivy/pull/2452) [#2453](https://github.com/quickwit-oss/tantivy/pull/2453)(@PSeitz)
@@ -33,9 +37,18 @@ Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0
 - add columnar format compatibility tests [#2433](https://github.com/quickwit-oss/tantivy/pull/2433)(@PSeitz)
 - Improved snippet ranges algorithm [#2474](https://github.com/quickwit-oss/tantivy/pull/2474)(@gezihuzi)
 - make find_field_with_default return json fields without path [#2476](https://github.com/quickwit-oss/tantivy/pull/2476)(@trinity-1686a)
- feat(query): Make `BooleanQuery` support `minimum_number_should_match` [#2405](https://github.com/quickwit-oss/tantivy/pull/2405)(@LebranceBW)
+- Make `BooleanQuery` support `minimum_number_should_match` [#2405](https://github.com/quickwit-oss/tantivy/pull/2405)(@LebranceBW)
+- Make `NUM_MERGE_THREADS` configurable [#2535](https://github.com/quickwit-oss/tantivy/pull/2535)(@Barre)

- **Optional Index in Multivalue Columnar Index** For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case). This is alleviated by placing an optional index in the multivalued index to mark documents that have values. This will slightly increase space and access time. [#2439](https://github.com/quickwit-oss/tantivy/pull/2439)(@PSeitz)
+- **RegexPhraseQuery** 
+`RegexPhraseQuery` supports phrase queries with regex. E.g. query "b.* b.* wolf" matches "big bad wolf". Slop is supported as well: "b.* wolf"~2 matches "big bad wolf" [#2516](https://github.com/quickwit-oss/tantivy/pull/2516)(@PSeitz)
+
+- **Optional Index in Multivalue Columnar Index** 
+For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case). 
+This is alleviated by placing an optional index in the multivalued index to mark documents that have values. 
+This will slightly increase space and access time. [#2439](https://github.com/quickwit-oss/tantivy/pull/2439)(@PSeitz)
+
+- **Store DateTime as nanoseconds in doc store** DateTime in the doc store was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. [#2486](https://github.com/quickwit-oss/tantivy/pull/2486)(@PSeitz)

 - **Performace/Memory**
    - lift clauses in LogicalAst for optimized ast during execution [#2449](https://github.com/quickwit-oss/tantivy/pull/2449)(@PSeitz)
@@ -51,18 +64,21 @@ Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0
    - fix de-escaping too much in query parser [#2427](https://github.com/quickwit-oss/tantivy/pull/2427)(@trinity-1686a)
    - improve query parser [#2416](https://github.com/quickwit-oss/tantivy/pull/2416)(@trinity-1686a)
    - Support field grouping `title:(return AND "pink panther")` [#2333](https://github.com/quickwit-oss/tantivy/pull/2333)(@trinity-1686a)
+    - allow term starting with wildcard [#2568](https://github.com/quickwit-oss/tantivy/pull/2568)(@trinity-1686a)

+- Exist queries match subpath fields [#2558](https://github.com/quickwit-oss/tantivy/pull/2558)(@rdettai)
 - add access benchmark for columnar [#2432](https://github.com/quickwit-oss/tantivy/pull/2432)(@PSeitz)
 - extend indexwriter proptests [#2342](https://github.com/quickwit-oss/tantivy/pull/2342)(@PSeitz)
 - add bench & test for columnar merging [#2428](https://github.com/quickwit-oss/tantivy/pull/2428)(@PSeitz)
 - Change in Executor API [#2391](https://github.com/quickwit-oss/tantivy/pull/2391)(@fulmicoton)
 - Removed usage of num_cpus [#2387](https://github.com/quickwit-oss/tantivy/pull/2387)(@fulmicoton)
- use bingang for agg benchmark [#2378](https://github.com/quickwit-oss/tantivy/pull/2378)(@PSeitz)
+- use bingang for agg and stacker benchmark [#2378](https://github.com/quickwit-oss/tantivy/pull/2378)[#2492](https://github.com/quickwit-oss/tantivy/pull/2492)(@PSeitz) 
 - cleanup top level exports [#2382](https://github.com/quickwit-oss/tantivy/pull/2382)(@PSeitz)
 - make convert_to_fast_value_and_append_to_json_term pub [#2370](https://github.com/quickwit-oss/tantivy/pull/2370)(@PSeitz)
 - remove JsonTermWriter [#2238](https://github.com/quickwit-oss/tantivy/pull/2238)(@PSeitz)
 - validate sort by field type [#2336](https://github.com/quickwit-oss/tantivy/pull/2336)(@PSeitz)
 - Fix trait bound of StoreReader::iter [#2360](https://github.com/quickwit-oss/tantivy/pull/2360)(@adamreichold)
+- remove read_postings_no_deletes [#2526](https://github.com/quickwit-oss/tantivy/pull/2526)(@PSeitz)

 Tantivy 0.22
 ================================
@@ -717,7 +733,7 @@ Tantivy 0.4.0
 - Raise the limit of number of fields (previously 256 fields) (@fulmicoton)
 - Removed u32 fields. They are replaced by u64 and i64 fields (#65) (@fulmicoton)
 - Optimized skip in SegmentPostings (#130) (@lnicola)
- Replacing rustc_serialize by serde. Kudos to @KodrAus and @lnicola
+- Replacing rustc_serialize by serde. Kudos to  benchmark@KodrAus and @lnicola
 - Using error-chain (@KodrAus)
 - QueryParser: (@fulmicoton)
  - Explicit error returned when searched for a term that is not indexed
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -0,0 +1,10 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+authors:
+  - alias: Quickwit Inc.
+    website: "https://quickwit.io"
+title: "tantivy"
+version: 0.22.0
+doi: 10.5281/zenodo.13942948
+date-released: 2024-10-17
+url: "https://github.com/quickwit-oss/tantivy"
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tantivy"
-version = "0.23.0"
+version = "0.24.0"
 authors = ["Paul Masurel <paul.masurel@gmail.com>"]
 license = "MIT"
 categories = ["database-implementations", "data-structures"]
@@ -11,7 +11,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
 readme = "README.md"
 keywords = ["search", "information", "retrieval"]
 edition = "2021"
-rust-version = "1.66"
+rust-version = "1.75"
 exclude = ["benches/*.json", "benches/*.txt"]

 [dependencies]
@@ -31,20 +31,20 @@ lz4_flex = { version = "0.11", default-features = false, optional = true }
 zstd = { version = "0.13", optional = true, default-features = false }
 tempfile = { version = "3.12.0", optional = true }
 log = "0.4.16"
-serde = { version = "1.0.136", features = ["derive"] }
-serde_json = "1.0.79"
+serde = { version = "1.0.219", features = ["derive"] }
+serde_json = "1.0.140"
 fs4 = { version = "0.8.0", optional = true }
 levenshtein_automata = "0.2.1"
 uuid = { version = "1.0.0", features = ["v4", "serde"] }
 crossbeam-channel = "0.5.4"
 rust-stemmers = "1.2.0"
-downcast-rs = "1.2.1"
+downcast-rs = "2.0.1"
 bitpacking = { version = "0.9.2", default-features = false, features = [
    "bitpacker4x",
 ] }
 census = "0.4.2"
-rustc-hash = "1.1.0"
-thiserror = "1.0.30"
+rustc-hash = "2.0.0"
+thiserror = "2.0.1"
 htmlescape = "0.3.1"
 fail = { version = "0.5.0", optional = true }
 time = { version = "0.3.35", features = ["serde-well-known"] }
@@ -52,27 +52,29 @@ smallvec = "1.8.0"
 rayon = "1.5.2"
 lru = "0.12.0"
 fastdivide = "0.4.0"
-itertools = "0.13.0"
-measure_time = "0.8.2"
+itertools = "0.14.0"
+measure_time = "0.9.0"
 arc-swap = "1.5.0"
+bon = "3.3.1"

-columnar = { version = "0.3", path = "./columnar", package = "tantivy-columnar" }
-sstable = { version = "0.3", path = "./sstable", package = "tantivy-sstable", optional = true }
-stacker = { version = "0.3", path = "./stacker", package = "tantivy-stacker" }
-query-grammar = { version = "0.22.0", path = "./query-grammar", package = "tantivy-query-grammar" }
-tantivy-bitpacker = { version = "0.6", path = "./bitpacker" }
-common = { version = "0.7", path = "./common/", package = "tantivy-common" }
-tokenizer-api = { version = "0.3", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
+columnar = { version = "0.5", path = "./columnar", package = "tantivy-columnar" }
+sstable = { version = "0.5", path = "./sstable", package = "tantivy-sstable", optional = true }
+stacker = { version = "0.5", path = "./stacker", package = "tantivy-stacker" }
+query-grammar = { version = "0.24.0", path = "./query-grammar", package = "tantivy-query-grammar" }
+tantivy-bitpacker = { version = "0.8", path = "./bitpacker" }
+common = { version = "0.9", path = "./common/", package = "tantivy-common" }
+tokenizer-api = { version = "0.5", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
 sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] }
 hyperloglogplus = { version = "0.4.1", features = ["const-loop"] }
 futures-util = { version = "0.3.28", optional = true }
+futures-channel = { version = "0.3.28", optional = true }
 fnv = "1.0.7"

 [target.'cfg(windows)'.dependencies]
 winapi = "0.3.9"

 [dev-dependencies]
-binggan = "0.12.0"
+binggan = "0.14.0"
 rand = "0.8.5"
 maplit = "1.0.2"
 matches = "0.1.9"
@@ -120,7 +122,7 @@ zstd-compression = ["zstd"]
 failpoints = ["fail", "fail/failpoints"]
 unstable = []                            # useful for benches.

-quickwit = ["sstable", "futures-util"]
+quickwit = ["sstable", "futures-util", "futures-channel"]

 # Compares only the hash of a string when indexing data.
 # Increases indexing speed, but may lead to extremely rare missing terms, when there's a hash collision.
--- a/benches/agg_bench.rs
+++ b/benches/agg_bench.rs
@@ -20,7 +20,6 @@ macro_rules! register {
    ($runner:expr, $func:ident) => {
        $runner.register(stringify!($func), move |index| {
            $func(index);
-            None
        })
    };
 }
--- a/bitpacker/Cargo.toml
+++ b/bitpacker/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tantivy-bitpacker"
-version = "0.6.0"
+version = "0.8.0"
 edition = "2021"
 authors = ["Paul Masurel <paul.masurel@gmail.com>"]
 license = "MIT"
--- a/bitpacker/src/bitpacker.rs
+++ b/bitpacker/src/bitpacker.rs
@@ -65,7 +65,7 @@ impl BitPacker {

 #[derive(Clone, Debug, Default, Copy)]
 pub struct BitUnpacker {
-    num_bits: u32,
+    num_bits: usize,
    mask: u64,
 }

@@ -83,7 +83,7 @@ impl BitUnpacker {
            (1u64 << num_bits) - 1u64
        };
        BitUnpacker {
-            num_bits: u32::from(num_bits),
+            num_bits: usize::from(num_bits),
            mask,
        }
    }
@@ -94,14 +94,14 @@ impl BitUnpacker {

    #[inline]
    pub fn get(&self, idx: u32, data: &[u8]) -> u64 {
-        let addr_in_bits = idx * self.num_bits;
-        let addr = (addr_in_bits >> 3) as usize;
+        let addr_in_bits = idx as usize * self.num_bits;
+        let addr = addr_in_bits >> 3;
        if addr + 8 > data.len() {
            if self.num_bits == 0 {
                return 0;
            }
            let bit_shift = addr_in_bits & 7;
-            return self.get_slow_path(addr, bit_shift, data);
+            return self.get_slow_path(addr, bit_shift as u32, data);
        }
        let bit_shift = addr_in_bits & 7;
        let bytes: [u8; 8] = (&data[addr..addr + 8]).try_into().unwrap();
@@ -134,12 +134,13 @@ impl BitUnpacker {
            "Bitwidth must be <= 32 to use this method."
        );

-        let end_idx = start_idx + output.len() as u32;
+        let end_idx: u32 = start_idx + output.len() as u32;

-        let end_bit_read = end_idx * self.num_bits;
+        // We use `usize` here to avoid overflow issues.
+        let end_bit_read = (end_idx as usize) * self.num_bits;
        let end_byte_read = (end_bit_read + 7) / 8;
        assert!(
-            end_byte_read as usize <= data.len(),
+            end_byte_read <= data.len(),
            "Requested index is out of bounds."
        );

@@ -159,24 +160,24 @@ impl BitUnpacker {
        // We want the start of the fast track to start align with bytes.
        // A sufficient condition is to start with an idx that is a multiple of 8,
        // so highway start is the closest multiple of 8 that is >= start_idx.
-        let entrance_ramp_len = 8 - (start_idx % 8) % 8;
+        let entrance_ramp_len: u32 = 8 - (start_idx % 8) % 8;

        let highway_start: u32 = start_idx + entrance_ramp_len;

-        if highway_start + BitPacker1x::BLOCK_LEN as u32 > end_idx {
+        if highway_start + (BitPacker1x::BLOCK_LEN as u32) > end_idx {
            // We don't have enough values to have even a single block of highway.
            // Let's just supply the values the simple way.
            get_batch_ramp(start_idx, output);
            return;
        }

-        let num_blocks: u32 = (end_idx - highway_start) / BitPacker1x::BLOCK_LEN as u32;
+        let num_blocks: usize = (end_idx - highway_start) as usize / BitPacker1x::BLOCK_LEN;

        // Entrance ramp
        get_batch_ramp(start_idx, &mut output[..entrance_ramp_len as usize]);

        // Highway
-        let mut offset = (highway_start * self.num_bits) as usize / 8;
+        let mut offset = (highway_start as usize * self.num_bits) / 8;
        let mut output_cursor = (highway_start - start_idx) as usize;
        for _ in 0..num_blocks {
            offset += BitPacker1x.decompress(
@@ -188,7 +189,7 @@ impl BitUnpacker {
        }

        // Exit ramp
-        let highway_end = highway_start + num_blocks * BitPacker1x::BLOCK_LEN as u32;
+        let highway_end: u32 = highway_start + (num_blocks * BitPacker1x::BLOCK_LEN) as u32;
        get_batch_ramp(highway_end, &mut output[output_cursor..]);
    }

--- a/bitpacker/src/blocked_bitpacker.rs
+++ b/bitpacker/src/blocked_bitpacker.rs
@@ -34,7 +34,7 @@ struct BlockedBitpackerEntryMetaData {

 impl BlockedBitpackerEntryMetaData {
    fn new(offset: u64, num_bits: u8, base_value: u64) -> Self {
-        let encoded = offset | (num_bits as u64) << (64 - 8);
+        let encoded = offset | (u64::from(num_bits) << (64 - 8));
        Self {
            encoded,
            base_value,
--- a/bitpacker/src/filter_vec/mod.rs
+++ b/bitpacker/src/filter_vec/mod.rs
@@ -35,8 +35,8 @@ const IMPLS: [FilterImplPerInstructionSet; 2] = [
 const IMPLS: [FilterImplPerInstructionSet; 1] = [FilterImplPerInstructionSet::Scalar];

 impl FilterImplPerInstructionSet {
-    #[allow(unused_variables)]
    #[inline]
+    #[allow(unused_variables)] // on non-x86_64, code is unused.
    fn from(code: u8) -> FilterImplPerInstructionSet {
        #[cfg(target_arch = "x86_64")]
        if code == FilterImplPerInstructionSet::AVX2 as u8 {
--- a/cliff.toml
+++ b/cliff.toml
@@ -16,14 +16,14 @@ body = """

 {%- if version %} in {{ version }}{%- endif -%}
 {% for commit in commits %}
-  {% if commit.github.pr_title -%}
-    {%- set commit_message = commit.github.pr_title -%}
+  {% if commit.remote.pr_title -%}
+    {%- set commit_message = commit.remote.pr_title -%}
  {%- else -%}
    {%- set commit_message = commit.message -%}
  {%- endif -%}
  - {{ commit_message | split(pat="\n") | first | trim }}\
-    {% if commit.github.pr_number %} \
-      [#{{ commit.github.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.github.pr_number }}){% if commit.github.username %}(@{{ commit.github.username }}){%- endif -%} \
+    {% if commit.remote.pr_number %} \
+      [#{{ commit.remote.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.remote.pr_number }}){% if commit.remote.username %}(@{{ commit.remote.username }}){%- endif -%} \
    {%- endif %}
 {%- endfor -%}

--- a/columnar/Cargo.toml
+++ b/columnar/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tantivy-columnar"
-version = "0.3.0"
+version = "0.5.0"
 edition = "2021"
 license = "MIT"
 homepage = "https://github.com/quickwit-oss/tantivy"
@@ -9,21 +9,21 @@ description = "column oriented storage for tantivy"
 categories = ["database-implementations", "data-structures", "compression"]

 [dependencies]
-itertools = "0.13.0"
+itertools = "0.14.0"
 fastdivide = "0.4.0"

-stacker = { version= "0.3", path = "../stacker", package="tantivy-stacker"}
-sstable = { version= "0.3", path = "../sstable", package = "tantivy-sstable" }
-common = { version= "0.7", path = "../common", package = "tantivy-common" }
-tantivy-bitpacker = { version= "0.6", path = "../bitpacker/" }
+stacker = { version= "0.5", path = "../stacker", package="tantivy-stacker"}
+sstable = { version= "0.5", path = "../sstable", package = "tantivy-sstable" }
+common = { version= "0.9", path = "../common", package = "tantivy-common" }
+tantivy-bitpacker = { version= "0.8", path = "../bitpacker/" }
 serde = "1.0.152"
-downcast-rs = "1.2.0"
+downcast-rs = "2.0.1"

 [dev-dependencies]
 proptest = "1"
 more-asserts = "0.3.1"
 rand = "0.8"
-binggan = "0.12.0"
+binggan = "0.14.0"

 [[bench]]
 name = "bench_merge"
--- a/columnar/benches/bench_access.rs
+++ b/columnar/benches/bench_access.rs
@@ -42,7 +42,6 @@ fn bench_group(mut runner: InputGroup<Column>) {
            }
        }
        black_box(sum);
-        None
    });
    runner.register("access_first_vals", |column| {
        let mut sum = 0;
@@ -63,7 +62,6 @@ fn bench_group(mut runner: InputGroup<Column>) {
        }

        black_box(sum);
-        None
    });
    runner.run();
 }
--- a/columnar/benches/bench_merge.rs
+++ b/columnar/benches/bench_merge.rs
@@ -1,6 +1,6 @@
 pub mod common;

-use binggan::{black_box, BenchRunner};
+use binggan::BenchRunner;
 use common::{generate_columnar_with_name, Card};
 use tantivy_columnar::*;

@@ -29,7 +29,7 @@ fn main() {
    add_combo(Card::Multi, Card::Dense);
    add_combo(Card::Multi, Card::Sparse);

-    let runner: BenchRunner = BenchRunner::new();
+    let mut runner: BenchRunner = BenchRunner::new();
    let mut group = runner.new_group();
    for (input_name, columnar_readers) in inputs.iter() {
        group.register_with_input(
--- a/columnar/columnar-cli-inspect/Cargo.toml
+++ b/columnar/columnar-cli-inspect/Cargo.toml
@@ -0,0 +1,18 @@
+[package]
+name = "tantivy-columnar-inspect"
+version = "0.1.0"
+edition = "2021"
+license = "MIT"
+
+[dependencies]
+tantivy = {path="../..", package="tantivy"}
+columnar = {path="../", package="tantivy-columnar"}
+common = {path="../../common", package="tantivy-common"}
+
+[workspace]
+members = []
+
+[profile.release]
+debug = true
+#debug-assertions = true
+#overflow-checks = true
--- a/columnar/columnar-cli-inspect/src/main.rs
+++ b/columnar/columnar-cli-inspect/src/main.rs
@@ -0,0 +1,54 @@
+use columnar::ColumnarReader;
+use common::file_slice::{FileSlice, WrapFile};
+use std::io;
+use std::path::Path;
+use tantivy::directory::footer::Footer;
+
+fn main() -> io::Result<()> {
+    println!("Opens a columnar file written by tantivy and validates it.");
+    let path = std::env::args().nth(1).unwrap();
+
+    let path = Path::new(&path);
+    println!("Reading {:?}", path);
+    let _reader = open_and_validate_columnar(path.to_str().unwrap())?;
+
+    Ok(())
+}
+
+pub fn validate_columnar_reader(reader: &ColumnarReader) {
+    let num_rows = reader.num_rows();
+    println!("num_rows: {}", num_rows);
+    let columns = reader.list_columns().unwrap();
+    println!("num columns: {:?}", columns.len());
+    for (col_name, dynamic_column_handle) in columns {
+        let col = dynamic_column_handle.open().unwrap();
+        match col {
+            columnar::DynamicColumn::Bool(_)
+            | columnar::DynamicColumn::I64(_)
+            | columnar::DynamicColumn::U64(_)
+            | columnar::DynamicColumn::F64(_)
+            | columnar::DynamicColumn::IpAddr(_)
+            | columnar::DynamicColumn::DateTime(_)
+            | columnar::DynamicColumn::Bytes(_) => {}
+            columnar::DynamicColumn::Str(str_column) => {
+                let num_vals = str_column.ords().values.num_vals();
+                let num_terms_dict = str_column.num_terms() as u64;
+                let max_ord = str_column.ords().values.iter().max().unwrap_or_default();
+                println!("{col_name:35}  num_vals {num_vals:10} \t num_terms_dict {num_terms_dict:8} max_ord: {max_ord:8}",);
+                for ord in str_column.ords().values.iter() {
+                    assert!(ord < num_terms_dict);
+                }
+            }
+        }
+    }
+}
+
+/// Opens a columnar file that was written by tantivy and validates it.
+pub fn open_and_validate_columnar(path: &str) -> io::Result<ColumnarReader> {
+    let wrap_file = WrapFile::new(std::fs::File::open(path)?)?;
+    let slice = FileSlice::new(std::sync::Arc::new(wrap_file));
+    let (_footer, slice) = Footer::extract_footer(slice.clone()).unwrap();
+    let reader = ColumnarReader::open(slice).unwrap();
+    validate_columnar_reader(&reader);
+    Ok(reader)
+}
--- a/columnar/src/block_accessor.rs
+++ b/columnar/src/block_accessor.rs
@@ -66,7 +66,7 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
        &'a self,
        docs: &'a [u32],
        accessor: &Column<T>,
-    ) -> impl Iterator<Item = (DocId, T)> + '_ {
+    ) -> impl Iterator<Item = (DocId, T)> + 'a {
        if accessor.index.get_cardinality().is_full() {
            docs.iter().cloned().zip(self.val_cache.iter().cloned())
        } else {
@@ -139,7 +139,7 @@ mod tests {
            missing_docs.push(missing_doc);
        });

-        assert_eq!(missing_docs, vec![]);
+        assert_eq!(missing_docs, Vec::<u32>::new());
    }

    #[test]
--- a/columnar/src/column_index/merge/shuffled.rs
+++ b/columnar/src/column_index/merge/shuffled.rs
@@ -58,7 +58,7 @@ struct ShuffledIndex<'a> {
    merge_order: &'a ShuffleMergeOrder,
 }

-impl<'a> Iterable<u32> for ShuffledIndex<'a> {
+impl Iterable<u32> for ShuffledIndex<'_> {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
        Box::new(
            self.merge_order
@@ -127,7 +127,7 @@ fn integrate_num_vals(num_vals: impl Iterator<Item = u32>) -> impl Iterator<Item
    )
 }

-impl<'a> Iterable<u32> for ShuffledMultivaluedIndex<'a> {
+impl Iterable<u32> for ShuffledMultivaluedIndex<'_> {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
        let num_vals_per_row = iter_num_values(self.column_indexes, self.merge_order);
        Box::new(integrate_num_vals(num_vals_per_row))
--- a/columnar/src/column_index/merge/stacked.rs
+++ b/columnar/src/column_index/merge/stacked.rs
@@ -56,7 +56,7 @@ fn get_doc_ids_with_values<'a>(
        ColumnIndex::Full => Box::new(doc_range),
        ColumnIndex::Optional(optional_index) => Box::new(
            optional_index
-                .iter_rows()
+                .iter_docs()
                .map(move |row| row + doc_range.start),
        ),
        ColumnIndex::Multivalued(multivalued_index) => match multivalued_index {
@@ -73,7 +73,7 @@ fn get_doc_ids_with_values<'a>(
            MultiValueIndex::MultiValueIndexV2(multivalued_index) => Box::new(
                multivalued_index
                    .optional_index
-                    .iter_rows()
+                    .iter_docs()
                    .map(move |row| row + doc_range.start),
            ),
        },
@@ -123,7 +123,7 @@ fn get_num_values_iterator<'a>(
    }
 }

-impl<'a> Iterable<u32> for StackedStartOffsets<'a> {
+impl Iterable<u32> for StackedStartOffsets<'_> {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
        let num_values_it = (0..self.column_indexes.len()).flat_map(|columnar_id| {
            let num_docs = self.stack_merge_order.columnar_range(columnar_id).len() as u32;
@@ -177,7 +177,7 @@ impl<'a> Iterable<RowId> for StackedOptionalIndex<'a> {
                        ColumnIndex::Full => Box::new(columnar_row_range),
                        ColumnIndex::Optional(optional_index) => Box::new(
                            optional_index
-                                .iter_rows()
+                                .iter_docs()
                                .map(move |row_id: RowId| columnar_row_range.start + row_id),
                        ),
                        ColumnIndex::Multivalued(_) => {
--- a/columnar/src/column_index/optional_index/mod.rs
+++ b/columnar/src/column_index/optional_index/mod.rs
@@ -80,23 +80,23 @@ impl BlockVariant {
 /// index is the block index. For each block `byte_start` and `offset` is computed.
 #[derive(Clone)]
 pub struct OptionalIndex {
-    num_rows: RowId,
-    num_non_null_rows: RowId,
+    num_docs: RowId,
+    num_non_null_docs: RowId,
    block_data: OwnedBytes,
    block_metas: Arc<[BlockMeta]>,
 }

-impl<'a> Iterable<u32> for &'a OptionalIndex {
+impl Iterable<u32> for &OptionalIndex {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
-        Box::new(self.iter_rows())
+        Box::new(self.iter_docs())
    }
 }

 impl std::fmt::Debug for OptionalIndex {
    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
        f.debug_struct("OptionalIndex")
-            .field("num_rows", &self.num_rows)
-            .field("num_non_null_rows", &self.num_non_null_rows)
+            .field("num_docs", &self.num_docs)
+            .field("num_non_null_docs", &self.num_non_null_docs)
            .finish_non_exhaustive()
    }
 }
@@ -123,7 +123,7 @@ enum BlockSelectCursor<'a> {
    Sparse(<SparseBlock<'a> as Set<u16>>::SelectCursor<'a>),
 }

-impl<'a> BlockSelectCursor<'a> {
+impl BlockSelectCursor<'_> {
    fn select(&mut self, rank: u16) -> u16 {
        match self {
            BlockSelectCursor::Dense(dense_select_cursor) => dense_select_cursor.select(rank),
@@ -141,7 +141,7 @@ pub struct OptionalIndexSelectCursor<'a> {
    num_null_rows_before_block: RowId,
 }

-impl<'a> OptionalIndexSelectCursor<'a> {
+impl OptionalIndexSelectCursor<'_> {
    fn search_and_load_block(&mut self, rank: RowId) {
        if rank < self.current_block_end_rank {
            // we are already in the right block
@@ -165,7 +165,7 @@ impl<'a> OptionalIndexSelectCursor<'a> {
    }
 }

-impl<'a> SelectCursor<RowId> for OptionalIndexSelectCursor<'a> {
+impl SelectCursor<RowId> for OptionalIndexSelectCursor<'_> {
    fn select(&mut self, rank: RowId) -> RowId {
        self.search_and_load_block(rank);
        let index_in_block = (rank - self.num_null_rows_before_block) as u16;
@@ -271,17 +271,17 @@ impl OptionalIndex {
    }

    pub fn num_docs(&self) -> RowId {
-        self.num_rows
+        self.num_docs
    }

    pub fn num_non_nulls(&self) -> RowId {
-        self.num_non_null_rows
+        self.num_non_null_docs
    }

-    pub fn iter_rows(&self) -> impl Iterator<Item = RowId> + '_ {
+    pub fn iter_docs(&self) -> impl Iterator<Item = RowId> + '_ {
        // TODO optimize
        let mut select_batch = self.select_cursor();
-        (0..self.num_non_null_rows).map(move |rank| select_batch.select(rank))
+        (0..self.num_non_null_docs).map(move |rank| select_batch.select(rank))
    }
    pub fn select_batch(&self, ranks: &mut [RowId]) {
        let mut select_cursor = self.select_cursor();
@@ -505,7 +505,7 @@ fn deserialize_optional_index_block_metadatas(
        non_null_rows_before_block += num_non_null_rows;
    }
    block_metas.resize(
-        ((num_rows + ELEMENTS_PER_BLOCK - 1) / ELEMENTS_PER_BLOCK) as usize,
+        num_rows.div_ceil(ELEMENTS_PER_BLOCK) as usize,
        BlockMeta {
            non_null_rows_before_block,
            start_byte_offset,
@@ -519,15 +519,15 @@ pub fn open_optional_index(bytes: OwnedBytes) -> io::Result<OptionalIndex> {
    let (mut bytes, num_non_empty_blocks_bytes) = bytes.rsplit(2);
    let num_non_empty_block_bytes =
        u16::from_le_bytes(num_non_empty_blocks_bytes.as_slice().try_into().unwrap());
-    let num_rows = VInt::deserialize_u64(&mut bytes)? as u32;
+    let num_docs = VInt::deserialize_u64(&mut bytes)? as u32;
    let block_metas_num_bytes =
        num_non_empty_block_bytes as usize * SERIALIZED_BLOCK_META_NUM_BYTES;
    let (block_data, block_metas) = bytes.rsplit(block_metas_num_bytes);
-    let (block_metas, num_non_null_rows) =
-        deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_rows);
+    let (block_metas, num_non_null_docs) =
+        deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_docs);
    let optional_index = OptionalIndex {
-        num_rows,
-        num_non_null_rows,
+        num_docs,
+        num_non_null_docs,
        block_data,
        block_metas: block_metas.into(),
    };
--- a/columnar/src/column_index/optional_index/set_block/dense.rs
+++ b/columnar/src/column_index/optional_index/set_block/dense.rs
@@ -23,7 +23,6 @@ fn set_bit_at(input: &mut u64, n: u16) {
 ///
 /// When translating a dense index to the original index, we can use the offset to find the correct
 /// block. Direct computation is not possible, but we can employ a linear or binary search.
-
 const ELEMENTS_PER_MINI_BLOCK: u16 = 64;
 const MINI_BLOCK_BITVEC_NUM_BYTES: usize = 8;
 const MINI_BLOCK_OFFSET_NUM_BYTES: usize = 2;
@@ -109,7 +108,7 @@ pub struct DenseBlockSelectCursor<'a> {
    dense_block: DenseBlock<'a>,
 }

-impl<'a> SelectCursor<u16> for DenseBlockSelectCursor<'a> {
+impl SelectCursor<u16> for DenseBlockSelectCursor<'_> {
    #[inline]
    fn select(&mut self, rank: u16) -> u16 {
        self.block_id = self
@@ -175,7 +174,7 @@ impl<'a> Set<u16> for DenseBlock<'a> {
    }
 }

-impl<'a> DenseBlock<'a> {
+impl DenseBlock<'_> {
    #[inline]
    fn mini_block(&self, mini_block_id: u16) -> DenseMiniBlock {
        let data_start_pos = mini_block_id as usize * MINI_BLOCK_NUM_BYTES;
--- a/columnar/src/column_index/optional_index/set_block/sparse.rs
+++ b/columnar/src/column_index/optional_index/set_block/sparse.rs
@@ -31,7 +31,7 @@ impl<'a> SelectCursor<u16> for SparseBlock<'a> {
    }
 }

-impl<'a> Set<u16> for SparseBlock<'a> {
+impl Set<u16> for SparseBlock<'_> {
    type SelectCursor<'b>
        = Self
    where Self: 'b;
@@ -69,7 +69,7 @@ fn get_u16(data: &[u8], byte_position: usize) -> u16 {
    u16::from_le_bytes(bytes)
 }

-impl<'a> SparseBlock<'a> {
+impl SparseBlock<'_> {
    #[inline(always)]
    fn value_at_idx(&self, data: &[u8], idx: u16) -> u16 {
        let start_offset: usize = idx as usize * 2;
@@ -82,7 +82,7 @@ impl<'a> SparseBlock<'a> {
    }

    #[inline]
-    #[allow(clippy::comparison_chain)]
+    #[expect(clippy::comparison_chain)]
    // Looks for the element in the block. Returns the positions if found.
    fn binary_search(&self, target: u16) -> Result<u16, u16> {
        let data = &self.0;
--- a/columnar/src/column_index/optional_index/tests.rs
+++ b/columnar/src/column_index/optional_index/tests.rs
@@ -164,7 +164,7 @@ fn test_optional_index_large() {
 fn test_optional_index_iter_aux(row_ids: &[RowId], num_rows: RowId) {
    let optional_index = OptionalIndex::for_test(num_rows, row_ids);
    assert_eq!(optional_index.num_docs(), num_rows);
-    assert!(optional_index.iter_rows().eq(row_ids.iter().copied()));
+    assert!(optional_index.iter_docs().eq(row_ids.iter().copied()));
 }

 #[test]
--- a/columnar/src/column_index/serialize.rs
+++ b/columnar/src/column_index/serialize.rs
@@ -31,7 +31,7 @@ pub enum SerializableColumnIndex<'a> {
    Multivalued(SerializableMultivalueIndex<'a>),
 }

-impl<'a> SerializableColumnIndex<'a> {
+impl SerializableColumnIndex<'_> {
    pub fn get_cardinality(&self) -> Cardinality {
        match self {
            SerializableColumnIndex::Full => Cardinality::Full,
--- a/columnar/src/column_values/merge.rs
+++ b/columnar/src/column_values/merge.rs
@@ -10,7 +10,7 @@ pub(crate) struct MergedColumnValues<'a, T> {
    pub(crate) merge_row_order: &'a MergeRowOrder,
 }

-impl<'a, T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'a, T> {
+impl<T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'_, T> {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
        match self.merge_row_order {
            MergeRowOrder::Stack(_) => Box::new(
--- a/columnar/src/column_values/u128_based/mod.rs
+++ b/columnar/src/column_values/u128_based/mod.rs
@@ -128,7 +128,7 @@ pub fn open_u128_as_compact_u64(mut bytes: OwnedBytes) -> io::Result<Arc<dyn Col
 }

 #[cfg(test)]
-pub mod tests {
+pub(crate) mod tests {
    use super::*;
    use crate::column_values::u64_based::{
        serialize_and_load_u64_based_column_values, serialize_u64_based_column_values,
--- a/columnar/src/column_values/u64_based/blockwise_linear.rs
+++ b/columnar/src/column_values/u64_based/blockwise_linear.rs
@@ -39,7 +39,7 @@ impl BinarySerializable for Block {
 }

 fn compute_num_blocks(num_vals: u32) -> u32 {
-    (num_vals + BLOCK_SIZE - 1) / BLOCK_SIZE
+    num_vals.div_ceil(BLOCK_SIZE)
 }

 pub struct BlockwiseLinearEstimator {
--- a/columnar/src/columnar/merge/merge_dict_column.rs
+++ b/columnar/src/columnar/merge/merge_dict_column.rs
@@ -3,7 +3,7 @@ use std::io::{self, Write};
 use common::{BitSet, CountingWriter, ReadOnlyBitSet};
 use sstable::{SSTable, Streamer, TermOrdinal, VoidSSTable};

-use super::term_merger::TermMerger;
+use super::term_merger::{TermMerger, TermsWithSegmentOrd};
 use crate::column::serialize_column_mappable_to_u64;
 use crate::column_index::SerializableColumnIndex;
 use crate::iterable::Iterable;
@@ -39,7 +39,7 @@ struct RemappedTermOrdinalsValues<'a> {
    merge_row_order: &'a MergeRowOrder,
 }

-impl<'a> Iterable for RemappedTermOrdinalsValues<'a> {
+impl Iterable for RemappedTermOrdinalsValues<'_> {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
        match self.merge_row_order {
            MergeRowOrder::Stack(_) => self.boxed_iter_stacked(),
@@ -50,7 +50,7 @@ impl<'a> Iterable for RemappedTermOrdinalsValues<'a> {
    }
 }

-impl<'a> RemappedTermOrdinalsValues<'a> {
+impl RemappedTermOrdinalsValues<'_> {
    fn boxed_iter_stacked(&self) -> Box<dyn Iterator<Item = u64> + '_> {
        let iter = self
            .bytes_columns
@@ -126,14 +126,17 @@ fn serialize_merged_dict(
    let mut term_ord_mapping = TermOrdinalMapping::default();

    let mut field_term_streams = Vec::new();
-    for column_opt in bytes_columns.iter() {
+    for (segment_ord, column_opt) in bytes_columns.iter().enumerate() {
        if let Some(column) = column_opt {
            term_ord_mapping.add_segment(column.dictionary.num_terms());
            let terms: Streamer<VoidSSTable> = column.dictionary.stream()?;
-            field_term_streams.push(terms);
+            field_term_streams.push(TermsWithSegmentOrd { terms, segment_ord });
        } else {
            term_ord_mapping.add_segment(0);
-            field_term_streams.push(Streamer::empty());
+            field_term_streams.push(TermsWithSegmentOrd {
+                terms: Streamer::empty(),
+                segment_ord,
+            });
        }
    }

@@ -191,6 +194,7 @@ fn serialize_merged_dict(

 #[derive(Default, Debug)]
 struct TermOrdinalMapping {
+    /// Contains the new term ordinals for each segment.
    per_segment_new_term_ordinals: Vec<Vec<TermOrdinal>>,
 }

@@ -205,6 +209,6 @@ impl TermOrdinalMapping {
    }

    fn get_segment(&self, segment_ord: u32) -> &[TermOrdinal] {
-        &(self.per_segment_new_term_ordinals[segment_ord as usize])[..]
+        &self.per_segment_new_term_ordinals[segment_ord as usize]
    }
 }
--- a/columnar/src/columnar/merge/merge_mapping.rs
+++ b/columnar/src/columnar/merge/merge_mapping.rs
@@ -26,7 +26,7 @@ impl StackMergeOrder {
        let mut cumulated_row_ids: Vec<RowId> = Vec::with_capacity(columnars.len());
        let mut cumulated_row_id = 0;
        for columnar in columnars {
-            cumulated_row_id += columnar.num_rows();
+            cumulated_row_id += columnar.num_docs();
            cumulated_row_ids.push(cumulated_row_id);
        }
        StackMergeOrder { cumulated_row_ids }
--- a/columnar/src/columnar/merge/mod.rs
+++ b/columnar/src/columnar/merge/mod.rs
@@ -80,13 +80,12 @@ pub fn merge_columnar(
    output: &mut impl io::Write,
 ) -> io::Result<()> {
    let mut serializer = ColumnarSerializer::new(output);
-    let num_rows_per_columnar = columnar_readers
+    let num_docs_per_columnar = columnar_readers
        .iter()
-        .map(|reader| reader.num_rows())
+        .map(|reader| reader.num_docs())
        .collect::<Vec<u32>>();

-    let columns_to_merge =
-        group_columns_for_merge(columnar_readers, required_columns, &merge_row_order)?;
+    let columns_to_merge = group_columns_for_merge(columnar_readers, required_columns)?;
    for res in columns_to_merge {
        let ((column_name, _column_type_category), grouped_columns) = res;
        let grouped_columns = grouped_columns.open(&merge_row_order)?;
@@ -94,15 +93,18 @@ pub fn merge_columnar(
            continue;
        }

-        let column_type = grouped_columns.column_type_after_merge();
+        let column_type_after_merge = grouped_columns.column_type_after_merge();
        let mut columns = grouped_columns.columns;
-        coerce_columns(column_type, &mut columns)?;
+        // Make sure the number of columns is the same as the number of columnar readers.
+        // Or num_docs_per_columnar would be incorrect.
+        assert_eq!(columns.len(), columnar_readers.len());
+        coerce_columns(column_type_after_merge, &mut columns)?;

        let mut column_serializer =
-            serializer.start_serialize_column(column_name.as_bytes(), column_type);
+            serializer.start_serialize_column(column_name.as_bytes(), column_type_after_merge);
        merge_column(
-            column_type,
-            &num_rows_per_columnar,
+            column_type_after_merge,
+            &num_docs_per_columnar,
            columns,
            &merge_row_order,
            &mut column_serializer,
@@ -128,7 +130,7 @@ fn dynamic_column_to_u64_monotonic(dynamic_column: DynamicColumn) -> Option<Colu
 fn merge_column(
    column_type: ColumnType,
    num_docs_per_column: &[u32],
-    columns: Vec<Option<DynamicColumn>>,
+    columns_to_merge: Vec<Option<DynamicColumn>>,
    merge_row_order: &MergeRowOrder,
    wrt: &mut impl io::Write,
 ) -> io::Result<()> {
@@ -138,10 +140,10 @@ fn merge_column(
        | ColumnType::F64
        | ColumnType::DateTime
        | ColumnType::Bool => {
-            let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
+            let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
            let mut column_values: Vec<Option<Arc<dyn ColumnValues>>> =
-                Vec::with_capacity(columns.len());
-            for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
+                Vec::with_capacity(columns_to_merge.len());
+            for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
                if let Some(Column { index: idx, values }) =
                    dynamic_column_opt.and_then(dynamic_column_to_u64_monotonic)
                {
@@ -164,10 +166,10 @@ fn merge_column(
            serialize_column_mappable_to_u64(merged_column_index, &merge_column_values, wrt)?;
        }
        ColumnType::IpAddr => {
-            let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
+            let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
            let mut column_values: Vec<Option<Arc<dyn ColumnValues<Ipv6Addr>>>> =
-                Vec::with_capacity(columns.len());
-            for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
+                Vec::with_capacity(columns_to_merge.len());
+            for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
                if let Some(DynamicColumn::IpAddr(Column { index: idx, values })) =
                    dynamic_column_opt
                {
@@ -192,9 +194,10 @@ fn merge_column(
            serialize_column_mappable_to_u128(merged_column_index, &merge_column_values, wrt)?;
        }
        ColumnType::Bytes | ColumnType::Str => {
-            let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
-            let mut bytes_columns: Vec<Option<BytesColumn>> = Vec::with_capacity(columns.len());
-            for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
+            let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
+            let mut bytes_columns: Vec<Option<BytesColumn>> =
+                Vec::with_capacity(columns_to_merge.len());
+            for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
                match dynamic_column_opt {
                    Some(DynamicColumn::Str(str_column)) => {
                        column_indexes.push(str_column.term_ord_column.index.clone());
@@ -248,7 +251,7 @@ impl GroupedColumns {
        if column_type.len() == 1 {
            return column_type.into_iter().next().unwrap();
        }
-        // At the moment, only the numerical categorical column type has more than one possible
+        // At the moment, only the numerical column type category has more than one possible
        // column type.
        assert!(self
            .columns
@@ -361,7 +364,7 @@ fn is_empty_after_merge(
                    ColumnIndex::Empty { .. } => true,
                    ColumnIndex::Full => alive_bitset.len() == 0,
                    ColumnIndex::Optional(optional_index) => {
-                        for doc in optional_index.iter_rows() {
+                        for doc in optional_index.iter_docs() {
                            if alive_bitset.contains(doc) {
                                return false;
                            }
@@ -391,7 +394,6 @@ fn is_empty_after_merge(
 fn group_columns_for_merge<'a>(
    columnar_readers: &'a [&'a ColumnarReader],
    required_columns: &'a [(String, ColumnType)],
-    _merge_row_order: &'a MergeRowOrder,
 ) -> io::Result<BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle>> {
    let mut columns: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = BTreeMap::new();

--- a/columnar/src/columnar/merge/term_merger.rs
+++ b/columnar/src/columnar/merge/term_merger.rs
@@ -5,28 +5,29 @@ use sstable::TermOrdinal;

 use crate::Streamer;

-pub struct HeapItem<'a> {
-    pub streamer: Streamer<'a>,
+/// The terms of a column with the ordinal of the segment.
+pub struct TermsWithSegmentOrd<'a> {
+    pub terms: Streamer<'a>,
    pub segment_ord: usize,
 }

-impl<'a> PartialEq for HeapItem<'a> {
+impl PartialEq for TermsWithSegmentOrd<'_> {
    fn eq(&self, other: &Self) -> bool {
        self.segment_ord == other.segment_ord
    }
 }

-impl<'a> Eq for HeapItem<'a> {}
+impl Eq for TermsWithSegmentOrd<'_> {}

-impl<'a> PartialOrd for HeapItem<'a> {
-    fn partial_cmp(&self, other: &HeapItem<'a>) -> Option<Ordering> {
+impl<'a> PartialOrd for TermsWithSegmentOrd<'a> {
+    fn partial_cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Option<Ordering> {
        Some(self.cmp(other))
    }
 }

-impl<'a> Ord for HeapItem<'a> {
-    fn cmp(&self, other: &HeapItem<'a>) -> Ordering {
-        (&other.streamer.key(), &other.segment_ord).cmp(&(&self.streamer.key(), &self.segment_ord))
+impl<'a> Ord for TermsWithSegmentOrd<'a> {
+    fn cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Ordering {
+        (&other.terms.key(), &other.segment_ord).cmp(&(&self.terms.key(), &self.segment_ord))
    }
 }

@@ -37,39 +38,32 @@ impl<'a> Ord for HeapItem<'a> {
 /// - the term
 /// - a slice with the ordinal of the segments containing the terms.
 pub struct TermMerger<'a> {
-    heap: BinaryHeap<HeapItem<'a>>,
-    current_streamers: Vec<HeapItem<'a>>,
+    heap: BinaryHeap<TermsWithSegmentOrd<'a>>,
+    term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>,
 }

 impl<'a> TermMerger<'a> {
    /// Stream of merged term dictionary
-    pub fn new(streams: Vec<Streamer<'a>>) -> TermMerger<'a> {
+    pub fn new(term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>) -> TermMerger<'a> {
        TermMerger {
            heap: BinaryHeap::new(),
-            current_streamers: streams
-                .into_iter()
-                .enumerate()
-                .map(|(ord, streamer)| HeapItem {
-                    streamer,
-                    segment_ord: ord,
-                })
-                .collect(),
+            term_streams_with_segment,
        }
    }

    pub(crate) fn matching_segments<'b: 'a>(
        &'b self,
    ) -> impl 'b + Iterator<Item = (usize, TermOrdinal)> {
-        self.current_streamers
+        self.term_streams_with_segment
            .iter()
-            .map(|heap_item| (heap_item.segment_ord, heap_item.streamer.term_ord()))
+            .map(|heap_item| (heap_item.segment_ord, heap_item.terms.term_ord()))
    }

    fn advance_segments(&mut self) {
-        let streamers = &mut self.current_streamers;
+        let streamers = &mut self.term_streams_with_segment;
        let heap = &mut self.heap;
        for mut heap_item in streamers.drain(..) {
-            if heap_item.streamer.advance() {
+            if heap_item.terms.advance() {
                heap.push(heap_item);
            }
        }
@@ -81,13 +75,13 @@ impl<'a> TermMerger<'a> {
    pub fn advance(&mut self) -> bool {
        self.advance_segments();
        if let Some(head) = self.heap.pop() {
-            self.current_streamers.push(head);
+            self.term_streams_with_segment.push(head);
            while let Some(next_streamer) = self.heap.peek() {
-                if self.current_streamers[0].streamer.key() != next_streamer.streamer.key() {
+                if self.term_streams_with_segment[0].terms.key() != next_streamer.terms.key() {
                    break;
                }
                let next_heap_it = self.heap.pop().unwrap(); // safe : we peeked beforehand
-                self.current_streamers.push(next_heap_it);
+                self.term_streams_with_segment.push(next_heap_it);
            }
            true
        } else {
@@ -101,6 +95,6 @@ impl<'a> TermMerger<'a> {
    /// if and only if advance() has been called before
    /// and "true" was returned.
    pub fn key(&self) -> &[u8] {
-        self.current_streamers[0].streamer.key()
+        self.term_streams_with_segment[0].terms.key()
    }
 }
--- a/columnar/src/columnar/merge/tests.rs
+++ b/columnar/src/columnar/merge/tests.rs
@@ -1,7 +1,10 @@
 use itertools::Itertools;
+use proptest::collection::vec;
+use proptest::prelude::*;

 use super::*;
-use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId};
+use crate::columnar::{merge_columnar, ColumnarReader, MergeRowOrder, StackMergeOrder};
+use crate::{Cardinality, ColumnarWriter, DynamicColumn, HasAssociatedColumnType, RowId};

 fn make_columnar<T: Into<NumericalValue> + HasAssociatedColumnType + Copy>(
    column_name: &str,
@@ -26,9 +29,8 @@ fn test_column_coercion_to_u64() {
    // u64 type
    let columnar2 = make_columnar("numbers", &[u64::MAX]);
    let columnars = &[&columnar1, &columnar2];
-    let merge_order = StackMergeOrder::stack(columnars).into();
    let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
-        group_columns_for_merge(columnars, &[], &merge_order).unwrap();
+        group_columns_for_merge(columnars, &[]).unwrap();
    assert_eq!(column_map.len(), 1);
    assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
 }
@@ -38,9 +40,8 @@ fn test_column_coercion_to_i64() {
    let columnar1 = make_columnar("numbers", &[-1i64]);
    let columnar2 = make_columnar("numbers", &[2u64]);
    let columnars = &[&columnar1, &columnar2];
-    let merge_order = StackMergeOrder::stack(columnars).into();
    let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
-        group_columns_for_merge(columnars, &[], &merge_order).unwrap();
+        group_columns_for_merge(columnars, &[]).unwrap();
    assert_eq!(column_map.len(), 1);
    assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
 }
@@ -63,14 +64,8 @@ fn test_group_columns_with_required_column() {
    let columnar1 = make_columnar("numbers", &[1i64]);
    let columnar2 = make_columnar("numbers", &[2u64]);
    let columnars = &[&columnar1, &columnar2];
-    let merge_order = StackMergeOrder::stack(columnars).into();
    let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
-        group_columns_for_merge(
-            &[&columnar1, &columnar2],
-            &[("numbers".to_string(), ColumnType::U64)],
-            &merge_order,
-        )
-        .unwrap();
+        group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap();
    assert_eq!(column_map.len(), 1);
    assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
 }
@@ -80,13 +75,9 @@ fn test_group_columns_required_column_with_no_existing_columns() {
    let columnar1 = make_columnar("numbers", &[2u64]);
    let columnar2 = make_columnar("numbers", &[2u64]);
    let columnars = &[&columnar1, &columnar2];
-    let merge_order = StackMergeOrder::stack(columnars).into();
-    let column_map: BTreeMap<_, _> = group_columns_for_merge(
-        columnars,
-        &[("required_col".to_string(), ColumnType::Str)],
-        &merge_order,
-    )
-    .unwrap();
+    let column_map: BTreeMap<_, _> =
+        group_columns_for_merge(columnars, &[("required_col".to_string(), ColumnType::Str)])
+            .unwrap();
    assert_eq!(column_map.len(), 2);
    let columns = &column_map
        .get(&("required_col".to_string(), ColumnTypeCategory::Str))
@@ -102,14 +93,8 @@ fn test_group_columns_required_column_is_above_all_columns_have_the_same_type_ru
    let columnar1 = make_columnar("numbers", &[2i64]);
    let columnar2 = make_columnar("numbers", &[2i64]);
    let columnars = &[&columnar1, &columnar2];
-    let merge_order = StackMergeOrder::stack(columnars).into();
    let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
-        group_columns_for_merge(
-            columnars,
-            &[("numbers".to_string(), ColumnType::U64)],
-            &merge_order,
-        )
-        .unwrap();
+        group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap();
    assert_eq!(column_map.len(), 1);
    assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
 }
@@ -119,9 +104,8 @@ fn test_missing_column() {
    let columnar1 = make_columnar("numbers", &[-1i64]);
    let columnar2 = make_columnar("numbers2", &[2u64]);
    let columnars = &[&columnar1, &columnar2];
-    let merge_order = StackMergeOrder::stack(columnars).into();
    let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
-        group_columns_for_merge(columnars, &[], &merge_order).unwrap();
+        group_columns_for_merge(columnars, &[]).unwrap();
    assert_eq!(column_map.len(), 2);
    assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
    {
@@ -224,7 +208,7 @@ fn test_merge_columnar_numbers() {
    )
    .unwrap();
    let columnar_reader = ColumnarReader::open(buffer).unwrap();
-    assert_eq!(columnar_reader.num_rows(), 3);
+    assert_eq!(columnar_reader.num_docs(), 3);
    assert_eq!(columnar_reader.num_columns(), 1);
    let cols = columnar_reader.read_columns("numbers").unwrap();
    let dynamic_column = cols[0].open().unwrap();
@@ -252,7 +236,7 @@ fn test_merge_columnar_texts() {
    )
    .unwrap();
    let columnar_reader = ColumnarReader::open(buffer).unwrap();
-    assert_eq!(columnar_reader.num_rows(), 3);
+    assert_eq!(columnar_reader.num_docs(), 3);
    assert_eq!(columnar_reader.num_columns(), 1);
    let cols = columnar_reader.read_columns("texts").unwrap();
    let dynamic_column = cols[0].open().unwrap();
@@ -301,7 +285,7 @@ fn test_merge_columnar_byte() {
    )
    .unwrap();
    let columnar_reader = ColumnarReader::open(buffer).unwrap();
-    assert_eq!(columnar_reader.num_rows(), 4);
+    assert_eq!(columnar_reader.num_docs(), 4);
    assert_eq!(columnar_reader.num_columns(), 1);
    let cols = columnar_reader.read_columns("bytes").unwrap();
    let dynamic_column = cols[0].open().unwrap();
@@ -357,7 +341,7 @@ fn test_merge_columnar_byte_with_missing() {
    )
    .unwrap();
    let columnar_reader = ColumnarReader::open(buffer).unwrap();
-    assert_eq!(columnar_reader.num_rows(), 3 + 2 + 3);
+    assert_eq!(columnar_reader.num_docs(), 3 + 2 + 3);
    assert_eq!(columnar_reader.num_columns(), 2);
    let cols = columnar_reader.read_columns("col").unwrap();
    let dynamic_column = cols[0].open().unwrap();
@@ -409,7 +393,7 @@ fn test_merge_columnar_different_types() {
    )
    .unwrap();
    let columnar_reader = ColumnarReader::open(buffer).unwrap();
-    assert_eq!(columnar_reader.num_rows(), 4);
+    assert_eq!(columnar_reader.num_docs(), 4);
    assert_eq!(columnar_reader.num_columns(), 2);
    let cols = columnar_reader.read_columns("mixed").unwrap();

@@ -419,11 +403,11 @@ fn test_merge_columnar_different_types() {
        panic!()
    };
    assert_eq!(vals.get_cardinality(), Cardinality::Optional);
-    assert_eq!(vals.values_for_doc(0).collect_vec(), vec![]);
-    assert_eq!(vals.values_for_doc(1).collect_vec(), vec![]);
-    assert_eq!(vals.values_for_doc(2).collect_vec(), vec![]);
+    assert_eq!(vals.values_for_doc(0).collect_vec(), Vec::<i64>::new());
+    assert_eq!(vals.values_for_doc(1).collect_vec(), Vec::<i64>::new());
+    assert_eq!(vals.values_for_doc(2).collect_vec(), Vec::<i64>::new());
    assert_eq!(vals.values_for_doc(3).collect_vec(), vec![1]);
-    assert_eq!(vals.values_for_doc(4).collect_vec(), vec![]);
+    assert_eq!(vals.values_for_doc(4).collect_vec(), Vec::<i64>::new());

    // text column
    let dynamic_column = cols[1].open().unwrap();
@@ -474,7 +458,7 @@ fn test_merge_columnar_different_empty_cardinality() {
    )
    .unwrap();
    let columnar_reader = ColumnarReader::open(buffer).unwrap();
-    assert_eq!(columnar_reader.num_rows(), 2);
+    assert_eq!(columnar_reader.num_docs(), 2);
    assert_eq!(columnar_reader.num_columns(), 2);
    let cols = columnar_reader.read_columns("mixed").unwrap();

@@ -486,3 +470,119 @@ fn test_merge_columnar_different_empty_cardinality() {
    let dynamic_column = cols[1].open().unwrap();
    assert_eq!(dynamic_column.get_cardinality(), Cardinality::Optional);
 }
+
+#[derive(Debug, Clone)]
+struct ColumnSpec {
+    column_name: String,
+    /// (row_id, term)
+    terms: Vec<(RowId, Vec<u8>)>,
+}
+
+#[derive(Clone, Debug)]
+struct ColumnarSpec {
+    columns: Vec<ColumnSpec>,
+}
+
+/// Generate a random (row_id, term) pair:
+///  - row_id in [0..10]
+///  - term is either from POSSIBLE_TERMS or random bytes
+fn rowid_and_term_strategy() -> impl Strategy<Value = (RowId, Vec<u8>)> {
+    const POSSIBLE_TERMS: &[&[u8]] = &[b"a", b"b", b"allo"];
+
+    let term_strat = prop_oneof![
+        // pick from the fixed list
+        (0..POSSIBLE_TERMS.len()).prop_map(|i| POSSIBLE_TERMS[i].to_vec()),
+        // or random bytes (length 0..10)
+        prop::collection::vec(any::<u8>(), 0..10),
+    ];
+
+    (0u32..11, term_strat)
+}
+
+/// Generate one ColumnSpec, with a random name and a random list of (row_id, term).
+/// We sort it by row_id so that data is in ascending order.
+fn column_spec_strategy() -> impl Strategy<Value = ColumnSpec> {
+    let column_name = prop_oneof![
+        Just("col".to_string()),
+        Just("col2".to_string()),
+        "col.*".prop_map(|s| s),
+    ];
+
+    // We'll produce 0..8 (rowid,term) entries for this column
+    let data_strat = vec(rowid_and_term_strategy(), 0..8).prop_map(|mut pairs| {
+        // Sort by row_id
+        pairs.sort_by_key(|(row_id, _)| *row_id);
+        pairs
+    });
+
+    (column_name, data_strat).prop_map(|(name, data)| ColumnSpec {
+        column_name: name,
+        terms: data,
+    })
+}
+
+/// Strategy to generate an ColumnarSpec
+fn columnar_strategy() -> impl Strategy<Value = ColumnarSpec> {
+    vec(column_spec_strategy(), 0..3).prop_map(|columns| ColumnarSpec { columns })
+}
+
+/// Strategy to generate multiple ColumnarSpecs, each of which we will treat
+/// as one "columnar" to be merged together.
+fn columnars_strategy() -> impl Strategy<Value = Vec<ColumnarSpec>> {
+    vec(columnar_strategy(), 1..4)
+}
+
+/// Build a `ColumnarReader` from a `ColumnarSpec`
+fn build_columnar(spec: &ColumnarSpec) -> ColumnarReader {
+    let mut writer = ColumnarWriter::default();
+    let mut max_row_id = 0;
+    for col in &spec.columns {
+        for &(row_id, ref term) in &col.terms {
+            writer.record_bytes(row_id, &col.column_name, term);
+            max_row_id = max_row_id.max(row_id);
+        }
+    }
+
+    let mut buffer = Vec::new();
+    writer.serialize(max_row_id + 1, &mut buffer).unwrap();
+    ColumnarReader::open(buffer).unwrap()
+}
+
+proptest! {
+    // We just test that the merge_columnar function doesn't crash.
+    #![proptest_config(ProptestConfig::with_cases(256))]
+    #[test]
+    fn test_merge_columnar_bytes_no_crash(columnars in columnars_strategy(), second_merge_columnars in columnars_strategy()) {
+        let columnars: Vec<ColumnarReader> = columnars.iter()
+            .map(build_columnar)
+            .collect();
+
+        let mut out = Vec::new();
+        let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
+        let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
+        merge_columnar(
+            &columnar_refs,
+            &[],
+            MergeRowOrder::Stack(stack_merge_order),
+            &mut out,
+        ).unwrap();
+
+        let merged_reader = ColumnarReader::open(out).unwrap();
+
+        // Merge the second set of columnars with the result of the first merge
+        let mut columnars: Vec<ColumnarReader> = second_merge_columnars.iter()
+            .map(build_columnar)
+            .collect();
+        columnars.push(merged_reader);
+        let mut out = Vec::new();
+        let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
+        let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
+        merge_columnar(
+            &columnar_refs,
+            &[],
+            MergeRowOrder::Stack(stack_merge_order),
+            &mut out,
+        ).unwrap();
+
+    }
+}
--- a/columnar/src/columnar/reader/mod.rs
+++ b/columnar/src/columnar/reader/mod.rs
@@ -1,6 +1,7 @@
 use std::{fmt, io, mem};

 use common::file_slice::FileSlice;
+use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
 use common::BinarySerializable;
 use sstable::{Dictionary, RangeSSTable};

@@ -18,13 +19,13 @@ fn io_invalid_data(msg: String) -> io::Error {
 pub struct ColumnarReader {
    column_dictionary: Dictionary<RangeSSTable>,
    column_data: FileSlice,
-    num_rows: RowId,
+    num_docs: RowId,
    format_version: Version,
 }

 impl fmt::Debug for ColumnarReader {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
-        let num_rows = self.num_rows();
+        let num_rows = self.num_docs();
        let columns = self.list_columns().unwrap();
        let num_cols = columns.len();
        let mut debug_struct = f.debug_struct("Columnar");
@@ -76,6 +77,19 @@ fn read_all_columns_in_stream(
    Ok(results)
 }

+fn column_dictionary_prefix_for_column_name(column_name: &str) -> String {
+    // Each column is a associated to a given `column_key`,
+    // that starts by `column_name\0column_header`.
+    //
+    // Listing the columns associated to the given column name is therefore equivalent to
+    // listing `column_key` with the prefix `column_name\0`.
+    format!("{}{}", column_name, '\0')
+}
+
+fn column_dictionary_prefix_for_subpath(root_path: &str) -> String {
+    format!("{}{}", root_path, JSON_PATH_SEGMENT_SEP as char)
+}
+
 impl ColumnarReader {
    /// Opens a new Columnar file.
    pub fn open<F>(file_slice: F) -> io::Result<ColumnarReader>
@@ -98,13 +112,13 @@ impl ColumnarReader {
        Ok(ColumnarReader {
            column_dictionary,
            column_data,
-            num_rows,
+            num_docs: num_rows,
            format_version,
        })
    }

-    pub fn num_rows(&self) -> RowId {
-        self.num_rows
+    pub fn num_docs(&self) -> RowId {
+        self.num_docs
    }
    // Iterate over the columns in a sorted way
    pub fn iter_columns(
@@ -144,32 +158,14 @@ impl ColumnarReader {
        Ok(self.iter_columns()?.collect())
    }

-    fn stream_for_column_range(&self, column_name: &str) -> sstable::StreamerBuilder<RangeSSTable> {
-        // Each column is a associated to a given `column_key`,
-        // that starts by `column_name\0column_header`.
-        //
-        // Listing the columns associated to the given column name is therefore equivalent to
-        // listing `column_key` with the prefix `column_name\0`.
-        //
-        // This is in turn equivalent to searching for the range
-        // `[column_name,\0`..column_name\1)`.
-        // TODO can we get some more generic `prefix(..)` logic in the dictionary.
-        let mut start_key = column_name.to_string();
-        start_key.push('\0');
-        let mut end_key = column_name.to_string();
-        end_key.push(1u8 as char);
-        self.column_dictionary
-            .range()
-            .ge(start_key.as_bytes())
-            .lt(end_key.as_bytes())
-    }
-
    pub async fn read_columns_async(
        &self,
        column_name: &str,
    ) -> io::Result<Vec<DynamicColumnHandle>> {
+        let prefix = column_dictionary_prefix_for_column_name(column_name);
        let stream = self
-            .stream_for_column_range(column_name)
+            .column_dictionary
+            .prefix_range(prefix)
            .into_stream_async()
            .await?;
        read_all_columns_in_stream(stream, &self.column_data, self.format_version)
@@ -180,7 +176,35 @@ impl ColumnarReader {
    /// There can be more than one column associated to a given column name, provided they have
    /// different types.
    pub fn read_columns(&self, column_name: &str) -> io::Result<Vec<DynamicColumnHandle>> {
-        let stream = self.stream_for_column_range(column_name).into_stream()?;
+        let prefix = column_dictionary_prefix_for_column_name(column_name);
+        let stream = self.column_dictionary.prefix_range(prefix).into_stream()?;
+        read_all_columns_in_stream(stream, &self.column_data, self.format_version)
+    }
+
+    pub async fn read_subpath_columns_async(
+        &self,
+        root_path: &str,
+    ) -> io::Result<Vec<DynamicColumnHandle>> {
+        let prefix = column_dictionary_prefix_for_subpath(root_path);
+        let stream = self
+            .column_dictionary
+            .prefix_range(prefix)
+            .into_stream_async()
+            .await?;
+        read_all_columns_in_stream(stream, &self.column_data, self.format_version)
+    }
+
+    /// Get all inner columns for a given JSON prefix, i.e columns for which the name starts
+    /// with the prefix then contain the [`JSON_PATH_SEGMENT_SEP`].
+    ///
+    /// There can be more than one column associated to each path within the JSON structure,
+    /// provided they have different types.
+    pub fn read_subpath_columns(&self, root_path: &str) -> io::Result<Vec<DynamicColumnHandle>> {
+        let prefix = column_dictionary_prefix_for_subpath(root_path);
+        let stream = self
+            .column_dictionary
+            .prefix_range(prefix.as_bytes())
+            .into_stream()?;
        read_all_columns_in_stream(stream, &self.column_data, self.format_version)
    }

@@ -192,6 +216,8 @@ impl ColumnarReader {

 #[cfg(test)]
 mod tests {
+    use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
+
    use crate::{ColumnType, ColumnarReader, ColumnarWriter};

    #[test]
@@ -224,6 +250,64 @@ mod tests {
        assert_eq!(columns[0].1.column_type(), ColumnType::U64);
    }

+    #[test]
+    fn test_read_columns() {
+        let mut columnar_writer = ColumnarWriter::default();
+        columnar_writer.record_column_type("col", ColumnType::U64, false);
+        columnar_writer.record_numerical(1, "col", 1u64);
+        let mut buffer = Vec::new();
+        columnar_writer.serialize(2, &mut buffer).unwrap();
+        let columnar = ColumnarReader::open(buffer).unwrap();
+        {
+            let columns = columnar.read_columns("col").unwrap();
+            assert_eq!(columns.len(), 1);
+            assert_eq!(columns[0].column_type(), ColumnType::U64);
+        }
+        {
+            let columns = columnar.read_columns("other").unwrap();
+            assert_eq!(columns.len(), 0);
+        }
+    }
+
+    #[test]
+    fn test_read_subpath_columns() {
+        let mut columnar_writer = ColumnarWriter::default();
+        columnar_writer.record_str(
+            0,
+            &format!("col1{}subcol1", JSON_PATH_SEGMENT_SEP as char),
+            "hello",
+        );
+        columnar_writer.record_numerical(
+            0,
+            &format!("col1{}subcol2", JSON_PATH_SEGMENT_SEP as char),
+            1i64,
+        );
+        columnar_writer.record_str(1, "col1", "hello");
+        columnar_writer.record_str(0, "col2", "hello");
+        let mut buffer = Vec::new();
+        columnar_writer.serialize(2, &mut buffer).unwrap();
+
+        let columnar = ColumnarReader::open(buffer).unwrap();
+        {
+            let columns = columnar.read_subpath_columns("col1").unwrap();
+            assert_eq!(columns.len(), 2);
+            assert_eq!(columns[0].column_type(), ColumnType::Str);
+            assert_eq!(columns[1].column_type(), ColumnType::I64);
+        }
+        {
+            let columns = columnar.read_subpath_columns("col1.subcol1").unwrap();
+            assert_eq!(columns.len(), 0);
+        }
+        {
+            let columns = columnar.read_subpath_columns("col2").unwrap();
+            assert_eq!(columns.len(), 0);
+        }
+        {
+            let columns = columnar.read_subpath_columns("other").unwrap();
+            assert_eq!(columns.len(), 0);
+        }
+    }
+
    #[test]
    #[should_panic(expected = "Input type forbidden")]
    fn test_list_columns_strict_typing_panics_on_wrong_types() {
--- a/columnar/src/columnar/writer/column_operation.rs
+++ b/columnar/src/columnar/writer/column_operation.rs
@@ -122,7 +122,6 @@ impl<T> From<T> for ColumnOperation<T> {
 // In order to limit memory usage, and in order
 // to benefit from the stacker, we do this by serialization our data
 // as "Symbols".
-#[allow(clippy::from_over_into)]
 pub(super) trait SymbolValue: Clone + Copy {
    // Serializes the symbol into the given buffer.
    // Returns the number of bytes written into the buffer.
--- a/columnar/src/columnar/writer/mod.rs
+++ b/columnar/src/columnar/writer/mod.rs
@@ -285,7 +285,6 @@ impl ColumnarWriter {
                .map(|(column_name, addr)| (column_name, ColumnType::DateTime, addr)),
        );
        columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type));
-
        let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries);
        let mut symbol_byte_buffer: Vec<u8> = Vec::new();
        for (column_name, column_type, addr) in columns {
@@ -392,7 +391,7 @@ impl ColumnarWriter {

 // Serialize [Dictionary, Column, dictionary num bytes U32::LE]
 // Column: [Column Index, Column Values, column index num bytes U32::LE]
-#[allow(clippy::too_many_arguments)]
+#[expect(clippy::too_many_arguments)]
 fn serialize_bytes_or_str_column(
    cardinality: Cardinality,
    num_docs: RowId,
--- a/columnar/src/columnar/writer/serializer.rs
+++ b/columnar/src/columnar/writer/serializer.rs
@@ -67,7 +67,7 @@ pub struct ColumnSerializer<'a, W: io::Write> {
    start_offset: u64,
 }

-impl<'a, W: io::Write> ColumnSerializer<'a, W> {
+impl<W: io::Write> ColumnSerializer<'_, W> {
    pub fn finalize(self) -> io::Result<()> {
        let end_offset: u64 = self.columnar_serializer.wrt.written_bytes();
        let byte_range = self.start_offset..end_offset;
@@ -80,7 +80,7 @@ impl<'a, W: io::Write> ColumnSerializer<'a, W> {
    }
 }

-impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
+impl<W: io::Write> io::Write for ColumnSerializer<'_, W> {
    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
        self.columnar_serializer.wrt.write(buf)
    }
--- a/columnar/src/iterable.rs
+++ b/columnar/src/iterable.rs
@@ -7,7 +7,7 @@ pub trait Iterable<T = u64> {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_>;
 }

-impl<'a, T: Copy> Iterable<T> for &'a [T] {
+impl<T: Copy> Iterable<T> for &[T] {
    fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
        Box::new(self.iter().copied())
    }
--- a/columnar/src/tests.rs
+++ b/columnar/src/tests.rs
@@ -380,7 +380,7 @@ fn assert_columnar_eq(
    right: &ColumnarReader,
    lenient_on_numerical_value: bool,
 ) {
-    assert_eq!(left.num_rows(), right.num_rows());
+    assert_eq!(left.num_docs(), right.num_docs());
    let left_columns = left.list_columns().unwrap();
    let right_columns = right.list_columns().unwrap();
    assert_eq!(left_columns.len(), right_columns.len());
@@ -588,7 +588,7 @@ proptest! {
    #[test]
    fn test_single_columnar_builder_proptest(docs in columnar_docs_strategy()) {
        let columnar = build_columnar(&docs[..]);
-        assert_eq!(columnar.num_rows() as usize, docs.len());
+        assert_eq!(columnar.num_docs() as usize, docs.len());
        let mut expected_columns: HashMap<(&str, ColumnTypeCategory), HashMap<u32, Vec<&ColumnValue>> > = Default::default();
        for (doc_id, doc_vals) in docs.iter().enumerate() {
            for (col_name, col_val) in doc_vals {
@@ -715,6 +715,7 @@ fn test_columnar_merging_number_columns() {
 // TODO test required_columns
 // TODO document edge case: required_columns incompatible with values.

+#[allow(clippy::type_complexity)]
 fn columnar_docs_and_remap(
 ) -> impl Strategy<Value = (Vec<Vec<Vec<(&'static str, ColumnValue)>>>, Vec<RowAddr>)> {
    proptest::collection::vec(columnar_docs_strategy(), 2..=3).prop_flat_map(
@@ -819,7 +820,7 @@ fn test_columnar_merge_empty() {
    )
    .unwrap();
    let merged_columnar = ColumnarReader::open(output).unwrap();
-    assert_eq!(merged_columnar.num_rows(), 0);
+    assert_eq!(merged_columnar.num_docs(), 0);
    assert_eq!(merged_columnar.num_columns(), 0);
 }

@@ -845,7 +846,7 @@ fn test_columnar_merge_single_str_column() {
    )
    .unwrap();
    let merged_columnar = ColumnarReader::open(output).unwrap();
-    assert_eq!(merged_columnar.num_rows(), 1);
+    assert_eq!(merged_columnar.num_docs(), 1);
    assert_eq!(merged_columnar.num_columns(), 1);
 }

@@ -877,7 +878,7 @@ fn test_delete_decrease_cardinality() {
    )
    .unwrap();
    let merged_columnar = ColumnarReader::open(output).unwrap();
-    assert_eq!(merged_columnar.num_rows(), 1);
+    assert_eq!(merged_columnar.num_docs(), 1);
    assert_eq!(merged_columnar.num_columns(), 1);
    let cols = merged_columnar.read_columns("c").unwrap();
    assert_eq!(cols.len(), 1);
--- a/common/Cargo.toml
+++ b/common/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tantivy-common"
-version = "0.7.0"
+version = "0.9.0"
 authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
 license = "MIT"
 edition = "2021"
@@ -13,13 +13,13 @@ repository = "https://github.com/quickwit-oss/tantivy"

 [dependencies]
 byteorder = "1.4.3"
-ownedbytes = { version= "0.7", path="../ownedbytes" }
+ownedbytes = { version= "0.9", path="../ownedbytes" }
 async-trait = "0.1"
 time = { version = "0.3.10", features = ["serde-well-known"] }
 serde = { version = "1.0.136", features = ["derive"] }

 [dev-dependencies]
-binggan = "0.12.0"
+binggan = "0.14.0"
 proptest = "1.0.0"
 rand = "0.8.4"

--- a/common/benches/bench.rs
+++ b/common/benches/bench.rs
@@ -15,7 +15,6 @@ fn bench_vint() {
            out += u64::from(buf[0]);
        }
        black_box(out);
-        None
    });

    let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
@@ -27,7 +26,6 @@ fn bench_vint() {
            out += u64::from(buf[0]);
        }
        black_box(out);
-        None
    });
 }

@@ -43,24 +41,20 @@ fn bench_bitset() {
        tinyset.pop_lowest();
        tinyset.pop_lowest();
        black_box(tinyset);
-        None
    });

    let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
    runner.bench_function("bench_tinyset_sum", move |_| {
        assert_eq!(black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
-        None
    });

    let v = [10u32, 14u32, 21u32];
    runner.bench_function("bench_tinyarr_sum", move |_| {
        black_box(v.iter().cloned().sum::<u32>());
-        None
    });

    runner.bench_function("bench_bitset_initialize", move |_| {
        black_box(BitSet::with_max_value(1_000_000));
-        None
    });
 }

--- a/common/src/file_slice.rs
+++ b/common/src/file_slice.rs
@@ -1,5 +1,6 @@
 use std::fs::File;
 use std::ops::{Deref, Range, RangeBounds};
+use std::path::Path;
 use std::sync::Arc;
 use std::{fmt, io};

@@ -177,6 +178,12 @@ fn combine_ranges<R: RangeBounds<usize>>(orig_range: Range<usize>, rel_range: R)
 }

 impl FileSlice {
+    /// Creates a FileSlice from a path.
+    pub fn open(path: &Path) -> io::Result<FileSlice> {
+        let wrap_file = WrapFile::new(File::open(path)?)?;
+        Ok(FileSlice::new(Arc::new(wrap_file)))
+    }
+
    /// Wraps a FileHandle.
    pub fn new(file_handle: Arc<dyn FileHandle>) -> Self {
        let num_bytes = file_handle.len();
--- a/common/src/lib.rs
+++ b/common/src/lib.rs
@@ -130,11 +130,11 @@ pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) {
 }

 #[cfg(test)]
-pub mod test {
+pub(crate) mod test {

    use proptest::prelude::*;

-    use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64, BinarySerializable, FixedSize};
+    use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};

    fn test_i64_converter_helper(val: i64) {
        assert_eq!(u64_to_i64(i64_to_u64(val)), val);
@@ -144,12 +144,6 @@ pub mod test {
        assert_eq!(u64_to_f64(f64_to_u64(val)), val);
    }

-    pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
-        let mut buffer = Vec::new();
-        O::default().serialize(&mut buffer).unwrap();
-        assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
-    }
-
    proptest! {
        #[test]
        fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
--- a/common/src/serialize.rs
+++ b/common/src/serialize.rs
@@ -74,14 +74,14 @@ impl FixedSize for () {

 impl<T: BinarySerializable> BinarySerializable for Vec<T> {
    fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
-        VInt(self.len() as u64).serialize(writer)?;
+        BinarySerializable::serialize(&VInt(self.len() as u64), writer)?;
        for it in self {
            it.serialize(writer)?;
        }
        Ok(())
    }
    fn deserialize<R: Read>(reader: &mut R) -> io::Result<Vec<T>> {
-        let num_items = VInt::deserialize(reader)?.val();
+        let num_items = <VInt as BinarySerializable>::deserialize(reader)?.val();
        let mut items: Vec<T> = Vec::with_capacity(num_items as usize);
        for _ in 0..num_items {
            let item = T::deserialize(reader)?;
@@ -236,12 +236,12 @@ impl FixedSize for bool {
 impl BinarySerializable for String {
    fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
        let data: &[u8] = self.as_bytes();
-        VInt(data.len() as u64).serialize(writer)?;
+        BinarySerializable::serialize(&VInt(data.len() as u64), writer)?;
        writer.write_all(data)
    }

    fn deserialize<R: Read>(reader: &mut R) -> io::Result<String> {
-        let string_length = VInt::deserialize(reader)?.val() as usize;
+        let string_length = <VInt as BinarySerializable>::deserialize(reader)?.val() as usize;
        let mut result = String::with_capacity(string_length);
        reader
            .take(string_length as u64)
@@ -253,12 +253,12 @@ impl BinarySerializable for String {
 impl<'a> BinarySerializable for Cow<'a, str> {
    fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
        let data: &[u8] = self.as_bytes();
-        VInt(data.len() as u64).serialize(writer)?;
+        BinarySerializable::serialize(&VInt(data.len() as u64), writer)?;
        writer.write_all(data)
    }

    fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, str>> {
-        let string_length = VInt::deserialize(reader)?.val() as usize;
+        let string_length = <VInt as BinarySerializable>::deserialize(reader)?.val() as usize;
        let mut result = String::with_capacity(string_length);
        reader
            .take(string_length as u64)
@@ -269,18 +269,18 @@ impl<'a> BinarySerializable for Cow<'a, str> {

 impl<'a> BinarySerializable for Cow<'a, [u8]> {
    fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
-        VInt(self.len() as u64).serialize(writer)?;
+        BinarySerializable::serialize(&VInt(self.len() as u64), writer)?;
        for it in self.iter() {
-            it.serialize(writer)?;
+            BinarySerializable::serialize(it, writer)?;
        }
        Ok(())
    }

    fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, [u8]>> {
-        let num_items = VInt::deserialize(reader)?.val();
+        let num_items = <VInt as BinarySerializable>::deserialize(reader)?.val();
        let mut items: Vec<u8> = Vec::with_capacity(num_items as usize);
        for _ in 0..num_items {
-            let item = u8::deserialize(reader)?;
+            let item = <u8 as BinarySerializable>::deserialize(reader)?;
            items.push(item);
        }
        Ok(Cow::Owned(items))
--- a/common/src/writer.rs
+++ b/common/src/writer.rs
@@ -87,7 +87,7 @@ impl<W: TerminatingWrite> TerminatingWrite for BufWriter<W> {
    }
 }

-impl<'a> TerminatingWrite for &'a mut Vec<u8> {
+impl TerminatingWrite for &mut Vec<u8> {
    fn terminate_ref(&mut self, _a: AntiCallToken) -> io::Result<()> {
        self.flush()
    }
--- a/doc/src/avant-propos.md
+++ b/doc/src/avant-propos.md
@@ -2,7 +2,7 @@

 > Tantivy is a **search** engine **library** for Rust.

-If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
+If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for Rust. Tantivy is heavily inspired by Lucene's design and
 they both have the same scope and targeted use cases.

 If you are not familiar with Lucene, let's break down our little tagline.
@@ -17,7 +17,7 @@ relevancy, collapsing, highlighting, spatial search.
  experience. But keep in mind this is just a toolbox.
  Which bring us to the second keyword...

- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
+- **Library** means that you will have to write code. Tantivy is not an *all-in-one* server solution like Elasticsearch for instance.

  Sometimes a functionality will not be available in tantivy because it is too
  specific to your use case. By design, tantivy should make it possible to extend
@@ -31,4 +31,4 @@ relevancy, collapsing, highlighting, spatial search.
  index from a different format.

  Tantivy exposes a lot of low level API to do all of these things.
-  
+  
--- a/doc/src/basis.md
+++ b/doc/src/basis.md
@@ -11,7 +11,7 @@ directory shipped with tantivy is the `MmapDirectory`.
 While this design has some downsides, this greatly simplifies the source code of
 tantivy. Caching is also entirely delegated to the OS.

-`tantivy` works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
+Tantivy works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.

 This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.

--- a/doc/src/index_sorting.md
+++ b/doc/src/index_sorting.md
@@ -31,13 +31,13 @@ Compression ratio is mainly affected on the fast field of the sorted property, e
 When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents.
 E.g. if the data is sorted by timestamp and want the top n newest docs containing a term, we can simply leveraging the order of the docids.

-Note: Tantivy 0.16 does not do this optimization yet.
+Note: tantivy 0.16 does not do this optimization yet.

 ### Pruning

 Let's say we want all documents and want to apply the filter `>= 2010-08-11`. When the data is sorted, we could make a lookup in the fast field to find the docid range and use this as the filter.

-Note: Tantivy 0.16 does not do this optimization yet.
+Note: tantivy 0.16 does not do this optimization yet.

 ### Other?

@@ -45,7 +45,7 @@ In principle there are many algorithms possible that exploit the monotonically i

 ## Usage

-The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of Tantivy 0.16 only fast fields are allowed to be used.
+The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of tantivy 0.16 only fast fields are allowed to be used.

 ```rust
 let settings = IndexSettings {
--- a/doc/src/json.md
+++ b/doc/src/json.md
@@ -39,7 +39,7 @@ Its representation is done by separating segments by a unicode char `\x01`, and
 - `value`: The value representation is just the regular Value representation.

 This representation is designed to align the natural sort of Terms with the lexicographical sort
-of their binary representation (Tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).
+of their binary representation (tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).

 In the example above, the terms will be sorted as

--- a/ownedbytes/Cargo.toml
+++ b/ownedbytes/Cargo.toml
@@ -1,7 +1,7 @@
 [package]
 authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
 name = "ownedbytes"
-version = "0.7.0"
+version = "0.9.0"
 edition = "2021"
 description = "Expose data as static slice"
 license = "MIT"
--- a/ownedbytes/src/lib.rs
+++ b/ownedbytes/src/lib.rs
@@ -151,7 +151,7 @@ impl fmt::Debug for OwnedBytes {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        // We truncate the bytes in order to make sure the debug string
        // is not too long.
-        let bytes_truncated: &[u8] = if self.len() > 8 {
+        let bytes_truncated: &[u8] = if self.len() > 10 {
            &self.as_slice()[..10]
        } else {
            self.as_slice()
@@ -252,6 +252,11 @@ mod tests {
            format!("{short_bytes:?}"),
            "OwnedBytes([97, 98, 99, 100], len=4)"
        );
+        let medium_bytes = OwnedBytes::new(b"abcdefghi".as_ref());
+        assert_eq!(
+            format!("{medium_bytes:?}"),
+            "OwnedBytes([97, 98, 99, 100, 101, 102, 103, 104, 105], len=9)"
+        );
        let long_bytes = OwnedBytes::new(b"abcdefghijklmnopq".as_ref());
        assert_eq!(
            format!("{long_bytes:?}"),
--- a/query-grammar/Cargo.toml
+++ b/query-grammar/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tantivy-query-grammar"
-version = "0.22.0"
+version = "0.24.0"
 authors = ["Paul Masurel <paul.masurel@gmail.com>"]
 license = "MIT"
 categories = ["database-implementations", "data-structures"]
@@ -13,3 +13,5 @@ edition = "2021"

 [dependencies]
 nom = "7"
+serde = { version = "1.0.219", features = ["derive"] }
+serde_json = "1.0.140"
--- a/query-grammar/src/infallible.rs
+++ b/query-grammar/src/infallible.rs
@@ -3,6 +3,7 @@
 use std::convert::Infallible;

 use nom::{AsChar, IResult, InputLength, InputTakeAtPosition};
+use serde::Serialize;

 pub(crate) type ErrorList = Vec<LenientErrorInternal>;
 pub(crate) type JResult<I, O> = IResult<I, (O, ErrorList), Infallible>;
@@ -15,7 +16,8 @@ pub(crate) struct LenientErrorInternal {
 }

 /// A recoverable error and the position it happened at
-#[derive(Debug, PartialEq)]
+#[derive(Debug, PartialEq, Serialize)]
+#[serde(rename_all = "snake_case")]
 pub struct LenientError {
    pub pos: usize,
    pub message: String,
@@ -111,7 +113,6 @@ where F: nom::Parser<I, (O, ErrorList), Infallible> {
        Err(Err::Incomplete(needed)) => Err(Err::Incomplete(needed)),
        // old versions don't understand this is uninhabited and need the empty match to help,
        // newer versions warn because this arm is unreachable (which it is indeed).
-        #[allow(unreachable_patterns)]
        Err(Err::Error(val)) | Err(Err::Failure(val)) => match val {},
    }
 }
@@ -354,3 +355,21 @@ where
 {
    move |i: I| l.choice(i.clone()).unwrap_or_else(|| default.parse(i))
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_lenient_error_serialization() {
+        let error = LenientError {
+            pos: 42,
+            message: "test error message".to_string(),
+        };
+
+        assert_eq!(
+            serde_json::to_string(&error).unwrap(),
+            "{\"pos\":42,\"message\":\"test error message\"}"
+        );
+    }
+}
--- a/query-grammar/src/lib.rs
+++ b/query-grammar/src/lib.rs
@@ -1,5 +1,7 @@
 #![allow(clippy::derive_partial_eq_without_eq)]

+use serde::Serialize;
+
 mod infallible;
 mod occur;
 mod query_grammar;
@@ -12,6 +14,8 @@ pub use crate::user_input_ast::{
    Delimiter, UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral,
 };

+#[derive(Debug, Serialize)]
+#[serde(rename_all = "snake_case")]
 pub struct Error;

 /// Parse a query
@@ -24,3 +28,31 @@ pub fn parse_query(query: &str) -> Result<UserInputAst, Error> {
 pub fn parse_query_lenient(query: &str) -> (UserInputAst, Vec<LenientError>) {
    parse_to_ast_lenient(query)
 }
+
+#[cfg(test)]
+mod tests {
+    use crate::{parse_query, parse_query_lenient};
+
+    #[test]
+    fn test_parse_query_serialization() {
+        let ast = parse_query("title:hello OR title:x").unwrap();
+        let json = serde_json::to_string(&ast).unwrap();
+        assert_eq!(
+            json,
+            r#"{"type":"bool","clauses":[["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}],["should",{"type":"literal","field_name":"title","phrase":"x","delimiter":"none","slop":0,"prefix":false}]]}"#
+        );
+    }
+
+    #[test]
+    fn test_parse_query_wrong_query() {
+        assert!(parse_query("title:").is_err());
+    }
+
+    #[test]
+    fn test_parse_query_lenient_wrong_query() {
+        let (_, errors) = parse_query_lenient("title:");
+        assert!(errors.len() == 1);
+        let json = serde_json::to_string(&errors).unwrap();
+        assert_eq!(json, r#"[{"pos":6,"message":"expected word"}]"#);
+    }
+}
--- a/query-grammar/src/occur.rs
+++ b/query-grammar/src/occur.rs
@@ -1,9 +1,12 @@
 use std::fmt;
 use std::fmt::Write;

+use serde::Serialize;
+
 /// Defines whether a term in a query must be present,
 /// should be present or must not be present.
-#[derive(Debug, Clone, Hash, Copy, Eq, PartialEq)]
+#[derive(Debug, Clone, Hash, Copy, Eq, PartialEq, Serialize)]
+#[serde(rename_all = "snake_case")]
 pub enum Occur {
    /// For a given document to be considered for scoring,
    /// at least one of the queries with the Should or the Must
--- a/query-grammar/src/query_grammar.rs
+++ b/query-grammar/src/query_grammar.rs
@@ -321,7 +321,17 @@ fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
        UserInputLeaf::Exists {
            field: String::new(),
        },
-        tuple((multispace0, char('*'))),
+        tuple((
+            multispace0,
+            char('*'),
+            peek(alt((
+                value(
+                    "",
+                    satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
+                ),
+                eof,
+            ))),
+        )),
    )(inp)
 }

@@ -331,7 +341,14 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
        peek(tuple((
            field_name,
            multispace0,
-            char('*'), // when we are here, we know it can't be anything but a exists
+            char('*'),
+            peek(alt((
+                value(
+                    "",
+                    satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
+                ),
+                eof,
+            ))), // we need to check this isn't a wildcard query
        ))),
    )(inp)
    .map_err(|e| e.map(|_| ()))
@@ -767,7 +784,7 @@ fn occur_leaf(inp: &str) -> IResult<&str, (Option<Occur>, UserInputAst)> {
    tuple((fallible(occur_symbol), boosted_leaf))(inp)
 }

-#[allow(clippy::type_complexity)]
+#[expect(clippy::type_complexity)]
 fn operand_occur_leaf_infallible(
    inp: &str,
 ) -> JResult<&str, (Option<BinaryOperand>, Option<Occur>, Option<UserInputAst>)> {
@@ -1497,6 +1514,11 @@ mod test {
        test_is_parse_err(r#"field:(+a -"b c""#, r#"(+"field":a -"field":"b c")"#);
    }

+    #[test]
+    fn field_re_specification() {
+        test_parse_query_to_ast_helper(r#"field:(abc AND b:cde)"#, r#"(+"field":abc +"b":cde)"#);
+    }
+
    #[test]
    fn test_parse_query_single_term() {
        test_parse_query_to_ast_helper("abc", "abc");
@@ -1619,13 +1641,19 @@ mod test {

    #[test]
    fn test_exist_query() {
-        test_parse_query_to_ast_helper("a:*", "\"a\":*");
-        test_parse_query_to_ast_helper("a: *", "\"a\":*");
-        // an exist followed by default term being b
-        test_is_parse_err("a:*b", "(*\"a\":* *b)");
+        test_parse_query_to_ast_helper("a:*", "$exists(\"a\")");
+        test_parse_query_to_ast_helper("a: *", "$exists(\"a\")");

-        // this is a term query (not a phrase prefix)
+        test_parse_query_to_ast_helper(
+            "(hello AND toto:*) OR happy",
+            "(?(+hello +$exists(\"toto\")) ?happy)",
+        );
+        test_parse_query_to_ast_helper("(a:*)", "$exists(\"a\")");
+
+        // these are term/wildcard query (not a phrase prefix)
        test_parse_query_to_ast_helper("a:b*", "\"a\":b*");
+        test_parse_query_to_ast_helper("a:*b", "\"a\":*b");
+        test_parse_query_to_ast_helper(r#"a:*def*"#, "\"a\":*def*");
    }

    #[test]
--- a/query-grammar/src/user_input_ast.rs
+++ b/query-grammar/src/user_input_ast.rs
@@ -1,9 +1,13 @@
 use std::fmt;
 use std::fmt::{Debug, Formatter};

+use serde::Serialize;
+
 use crate::Occur;

-#[derive(PartialEq, Clone)]
+#[derive(PartialEq, Clone, Serialize)]
+#[serde(tag = "type")]
+#[serde(rename_all = "snake_case")]
 pub enum UserInputLeaf {
    Literal(UserInputLiteral),
    All,
@@ -101,20 +105,22 @@ impl Debug for UserInputLeaf {
            }
            UserInputLeaf::All => write!(formatter, "*"),
            UserInputLeaf::Exists { field } => {
-                write!(formatter, "\"{field}\":*")
+                write!(formatter, "$exists(\"{field}\")")
            }
        }
    }
 }

-#[derive(Copy, Clone, Eq, PartialEq, Debug)]
+#[derive(Copy, Clone, Eq, PartialEq, Debug, Serialize)]
+#[serde(rename_all = "snake_case")]
 pub enum Delimiter {
    SingleQuotes,
    DoubleQuotes,
    None,
 }

-#[derive(PartialEq, Clone)]
+#[derive(PartialEq, Clone, Serialize)]
+#[serde(rename_all = "snake_case")]
 pub struct UserInputLiteral {
    pub field_name: Option<String>,
    pub phrase: String,
@@ -152,7 +158,9 @@ impl fmt::Debug for UserInputLiteral {
    }
 }

-#[derive(PartialEq, Debug, Clone)]
+#[derive(PartialEq, Debug, Clone, Serialize)]
+#[serde(tag = "type", content = "value")]
+#[serde(rename_all = "snake_case")]
 pub enum UserInputBound {
    Inclusive(String),
    Exclusive(String),
@@ -187,11 +195,38 @@ impl UserInputBound {
    }
 }

-#[derive(PartialEq, Clone)]
+#[derive(PartialEq, Clone, Serialize)]
+#[serde(into = "UserInputAstSerde")]
 pub enum UserInputAst {
    Clause(Vec<(Option<Occur>, UserInputAst)>),
-    Leaf(Box<UserInputLeaf>),
    Boost(Box<UserInputAst>, f64),
+    Leaf(Box<UserInputLeaf>),
+}
+
+#[derive(Serialize)]
+#[serde(tag = "type", rename_all = "snake_case")]
+enum UserInputAstSerde {
+    Bool {
+        clauses: Vec<(Option<Occur>, UserInputAst)>,
+    },
+    Boost {
+        underlying: Box<UserInputAst>,
+        boost: f64,
+    },
+    #[serde(untagged)]
+    Leaf(Box<UserInputLeaf>),
+}
+
+impl From<UserInputAst> for UserInputAstSerde {
+    fn from(ast: UserInputAst) -> Self {
+        match ast {
+            UserInputAst::Clause(clause) => UserInputAstSerde::Bool { clauses: clause },
+            UserInputAst::Boost(underlying, boost) => {
+                UserInputAstSerde::Boost { underlying, boost }
+            }
+            UserInputAst::Leaf(leaf) => UserInputAstSerde::Leaf(leaf),
+        }
+    }
 }

 impl UserInputAst {
@@ -285,3 +320,126 @@ impl fmt::Debug for UserInputAst {
        }
    }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_all_leaf_serialization() {
+        let ast = UserInputAst::Leaf(Box::new(UserInputLeaf::All));
+        let json = serde_json::to_string(&ast).unwrap();
+        assert_eq!(json, r#"{"type":"all"}"#);
+    }
+
+    #[test]
+    fn test_literal_leaf_serialization() {
+        let literal = UserInputLiteral {
+            field_name: Some("title".to_string()),
+            phrase: "hello".to_string(),
+            delimiter: Delimiter::None,
+            slop: 0,
+            prefix: false,
+        };
+        let ast = UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(literal)));
+        let json = serde_json::to_string(&ast).unwrap();
+        assert_eq!(
+            json,
+            r#"{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}"#
+        );
+    }
+
+    #[test]
+    fn test_range_leaf_serialization() {
+        let range = UserInputLeaf::Range {
+            field: Some("price".to_string()),
+            lower: UserInputBound::Inclusive("10".to_string()),
+            upper: UserInputBound::Exclusive("100".to_string()),
+        };
+        let ast = UserInputAst::Leaf(Box::new(range));
+        let json = serde_json::to_string(&ast).unwrap();
+        assert_eq!(
+            json,
+            r#"{"type":"range","field":"price","lower":{"type":"inclusive","value":"10"},"upper":{"type":"exclusive","value":"100"}}"#
+        );
+    }
+
+    #[test]
+    fn test_range_leaf_unbounded_serialization() {
+        let range = UserInputLeaf::Range {
+            field: Some("price".to_string()),
+            lower: UserInputBound::Inclusive("10".to_string()),
+            upper: UserInputBound::Unbounded,
+        };
+        let ast = UserInputAst::Leaf(Box::new(range));
+        let json = serde_json::to_string(&ast).unwrap();
+        assert_eq!(
+            json,
+            r#"{"type":"range","field":"price","lower":{"type":"inclusive","value":"10"},"upper":{"type":"unbounded"}}"#
+        );
+    }
+
+    #[test]
+    fn test_boost_serialization() {
+        let inner_ast = UserInputAst::Leaf(Box::new(UserInputLeaf::All));
+        let boost_ast = UserInputAst::Boost(Box::new(inner_ast), 2.5);
+        let json = serde_json::to_string(&boost_ast).unwrap();
+        assert_eq!(
+            json,
+            r#"{"type":"boost","underlying":{"type":"all"},"boost":2.5}"#
+        );
+    }
+
+    #[test]
+    fn test_boost_serialization2() {
+        let boost_ast = UserInputAst::Boost(
+            Box::new(UserInputAst::Clause(vec![
+                (
+                    Some(Occur::Must),
+                    UserInputAst::Leaf(Box::new(UserInputLeaf::All)),
+                ),
+                (
+                    Some(Occur::Should),
+                    UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(UserInputLiteral {
+                        field_name: Some("title".to_string()),
+                        phrase: "hello".to_string(),
+                        delimiter: Delimiter::None,
+                        slop: 0,
+                        prefix: false,
+                    }))),
+                ),
+            ])),
+            2.5,
+        );
+        let json = serde_json::to_string(&boost_ast).unwrap();
+        assert_eq!(
+            json,
+            r#"{"type":"boost","underlying":{"type":"bool","clauses":[["must",{"type":"all"}],["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}]]},"boost":2.5}"#
+        );
+    }
+
+    #[test]
+    fn test_clause_serialization() {
+        let clause = UserInputAst::Clause(vec![
+            (
+                Some(Occur::Must),
+                UserInputAst::Leaf(Box::new(UserInputLeaf::All)),
+            ),
+            (
+                Some(Occur::Should),
+                UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(UserInputLiteral {
+                    field_name: Some("title".to_string()),
+                    phrase: "hello".to_string(),
+                    delimiter: Delimiter::None,
+                    slop: 0,
+                    prefix: false,
+                }))),
+            ),
+        ]);
+        let json = serde_json::to_string(&clause).unwrap();
+        assert_eq!(
+            json,
+            r#"{"type":"bool","clauses":[["must",{"type":"all"}],["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}]]}"#
+        );
+    }
+}
--- a/src/aggregation/agg_req_with_accessor.rs
+++ b/src/aggregation/agg_req_with_accessor.rs
@@ -271,10 +271,6 @@ impl AggregationWithAccessor {
                field: ref field_name,
                ..
            })
-            | Count(CountAggregation {
-                field: ref field_name,
-                ..
-            })
            | Max(MaxAggregation {
                field: ref field_name,
                ..
@@ -299,6 +295,24 @@ impl AggregationWithAccessor {
                    get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
                add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
            }
+            Count(CountAggregation {
+                field: ref field_name,
+                ..
+            }) => {
+                let allowed_column_types = [
+                    ColumnType::I64,
+                    ColumnType::U64,
+                    ColumnType::F64,
+                    ColumnType::Str,
+                    ColumnType::DateTime,
+                    ColumnType::Bool,
+                    ColumnType::IpAddr,
+                    // ColumnType::Bytes Unsupported
+                ];
+                let (accessor, column_type) =
+                    get_ff_reader(reader, field_name, Some(&allowed_column_types))?;
+                add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
+            }
            Percentiles(ref percentiles) => {
                let (accessor, column_type) = get_ff_reader(
                    reader,
--- a/src/aggregation/agg_result.rs
+++ b/src/aggregation/agg_result.rs
@@ -1,4 +1,5 @@
 //! Contains the final aggregation tree.
+//!
 //! This tree can be converted via the `into()` method from `IntermediateAggregationResults`.
 //! This conversion computes the final result. For example: The intermediate result contains
 //! intermediate average results, which is the sum and the number of values. The actual average is
@@ -187,7 +188,7 @@ pub enum BucketEntries<T> {
 }

 impl<T> BucketEntries<T> {
-    fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = &T> + 'a> {
+    fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = &'a T> + 'a> {
        match self {
            BucketEntries::Vec(vec) => Box::new(vec.iter()),
            BucketEntries::HashMap(map) => Box::new(map.values()),
--- a/src/aggregation/bucket/histogram/date_histogram.rs
+++ b/src/aggregation/bucket/histogram/date_histogram.rs
@@ -34,10 +34,10 @@ use crate::aggregation::*;
 pub struct DateHistogramAggregationReq {
    #[doc(hidden)]
    /// Only for validation
-    interval: Option<String>,
+    pub interval: Option<String>,
    #[doc(hidden)]
    /// Only for validation
-    calendar_interval: Option<String>,
+    pub calendar_interval: Option<String>,
    /// The field to aggregate on.
    pub field: String,
    /// The format to format dates. Unsupported currently.
@@ -244,7 +244,7 @@ fn parse_into_milliseconds(input: &str) -> Result<i64, AggregationError> {
 }

 #[cfg(test)]
-pub mod tests {
+pub(crate) mod tests {
    use pretty_assertions::assert_eq;

    use super::*;
--- a/src/aggregation/bucket/range.rs
+++ b/src/aggregation/bucket/range.rs
@@ -16,6 +16,7 @@ use crate::aggregation::*;
 use crate::TantivyError;

 /// Provide user-defined buckets to aggregate on.
+///
 /// Two special buckets will automatically be created to cover the whole range of values.
 /// The provided buckets have to be continuous.
 /// During the aggregation, the values extracted from the fast_field `field` will be checked
--- a/src/aggregation/bucket/term_agg.rs
+++ b/src/aggregation/bucket/term_agg.rs
@@ -1232,8 +1232,8 @@ mod tests {
    #[test]
    fn terms_aggregation_min_doc_count_special_case() -> crate::Result<()> {
        let terms_per_segment = vec![
-            vec!["terma", "terma", "termb", "termb", "termb", "termc"],
-            vec!["terma", "terma", "termb", "termc", "termc"],
+            vec!["terma", "terma", "termb", "termb", "termb"],
+            vec!["terma", "terma", "termb"],
        ];

        let index = get_test_index_from_terms(false, &terms_per_segment)?;
@@ -1255,8 +1255,6 @@ mod tests {
        assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
        assert_eq!(res["my_texts"]["buckets"][1]["key"], "termb");
        assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 0);
-        assert_eq!(res["my_texts"]["buckets"][2]["key"], "termc");
-        assert_eq!(res["my_texts"]["buckets"][2]["doc_count"], 0);
        assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
        assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);

--- a/src/aggregation/metric/stats.rs
+++ b/src/aggregation/metric/stats.rs
@@ -220,9 +220,23 @@ impl SegmentStatsCollector {
                .column_block_accessor
                .fetch_block(docs, &agg_accessor.accessor);
        }
-        for val in agg_accessor.column_block_accessor.iter_vals() {
-            let val1 = f64_from_fastfield_u64(val, &self.field_type);
-            self.stats.collect(val1);
+        if [
+            ColumnType::I64,
+            ColumnType::U64,
+            ColumnType::F64,
+            ColumnType::DateTime,
+        ]
+        .contains(&self.field_type)
+        {
+            for val in agg_accessor.column_block_accessor.iter_vals() {
+                let val1 = f64_from_fastfield_u64(val, &self.field_type);
+                self.stats.collect(val1);
+            }
+        } else {
+            for _val in agg_accessor.column_block_accessor.iter_vals() {
+                // we ignore the value and simply record that we got something
+                self.stats.collect(0.0);
+            }
        }
    }
 }
@@ -435,6 +449,11 @@ mod tests {
                    "field": "score",
                },
            },
+            "count_str": {
+                "value_count": {
+                    "field": "text",
+                },
+            },
            "range": range_agg
        }))
        .unwrap();
@@ -500,6 +519,13 @@ mod tests {
            })
        );

+        assert_eq!(
+            res["count_str"],
+            json!({
+                "value": 7.0,
+            })
+        );
+
        Ok(())
    }

--- a/src/aggregation/mod.rs
+++ b/src/aggregation/mod.rs
@@ -180,7 +180,7 @@ pub(crate) fn deserialize_option_f64<'de, D>(deserializer: D) -> Result<Option<f
 where D: Deserializer<'de> {
    struct StringOrFloatVisitor;

-    impl<'de> Visitor<'de> for StringOrFloatVisitor {
+    impl Visitor<'_> for StringOrFloatVisitor {
        type Value = Option<f64>;

        fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
@@ -226,7 +226,7 @@ pub(crate) fn deserialize_f64<'de, D>(deserializer: D) -> Result<f64, D::Error>
 where D: Deserializer<'de> {
    struct StringOrFloatVisitor;

-    impl<'de> Visitor<'de> for StringOrFloatVisitor {
+    impl Visitor<'_> for StringOrFloatVisitor {
        type Value = f64;

        fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
@@ -366,8 +366,12 @@ impl PartialEq for Key {
    fn eq(&self, other: &Self) -> bool {
        match (self, other) {
            (Self::Str(l), Self::Str(r)) => l == r,
-            (Self::F64(l), Self::F64(r)) => l == r,
-            _ => false,
+            (Self::F64(l), Self::F64(r)) => l.to_bits() == r.to_bits(),
+            (Self::I64(l), Self::I64(r)) => l == r,
+            (Self::U64(l), Self::U64(r)) => l == r,
+            // we list all variant of left operand to make sure this gets updated when we add
+            // variants to the enum
+            (Self::Str(_) | Self::F64(_) | Self::I64(_) | Self::U64(_), _) => false,
        }
    }
 }
@@ -578,7 +582,7 @@ mod tests {
            .set_indexing_options(
                TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
            )
-            .set_fast(None)
+            .set_fast(Some("raw"))
            .set_stored();
        let text_field = schema_builder.add_text_field("text", text_fieldtype);
        let date_field = schema_builder.add_date_field("date", FAST);
--- a/src/collector/facet_collector.rs
+++ b/src/collector/facet_collector.rs
@@ -13,7 +13,7 @@ struct Hit<'a> {
    facet: &'a Facet,
 }

-impl<'a> Eq for Hit<'a> {}
+impl Eq for Hit<'_> {}

 impl<'a> PartialEq<Hit<'a>> for Hit<'a> {
    fn eq(&self, other: &Hit<'_>) -> bool {
@@ -27,7 +27,7 @@ impl<'a> PartialOrd<Hit<'a>> for Hit<'a> {
    }
 }

-impl<'a> Ord for Hit<'a> {
+impl Ord for Hit<'_> {
    fn cmp(&self, other: &Self) -> Ordering {
        other
            .count
--- a/src/collector/filter_collector_wrapper.rs
+++ b/src/collector/filter_collector_wrapper.rs
@@ -182,6 +182,7 @@ where
 }

 /// A variant of the [`FilterCollector`] specialized for bytes fast fields, i.e.
+///
 /// it transparently wraps an inner [`Collector`] but filters documents
 /// based on the result of applying the predicate to the bytes fast field.
 ///
--- a/src/collector/mod.rs
+++ b/src/collector/mod.rs
@@ -495,4 +495,4 @@ where
 impl_downcast!(Fruit);

 #[cfg(test)]
-pub mod tests;
+pub(crate) mod tests;
--- a/src/collector/multi_collector.rs
+++ b/src/collector/multi_collector.rs
@@ -161,7 +161,7 @@ impl<TFruit: Fruit> FruitHandle<TFruit> {
 /// # Ok(())
 /// # }
 /// ```
-#[allow(clippy::type_complexity)]
+#[expect(clippy::type_complexity)]
 #[derive(Default)]
 pub struct MultiCollector<'a> {
    collector_wrappers: Vec<
@@ -190,7 +190,7 @@ impl<'a> MultiCollector<'a> {
    }
 }

-impl<'a> Collector for MultiCollector<'a> {
+impl Collector for MultiCollector<'_> {
    type Fruit = MultiFruit;
    type Child = MultiCollectorChild;

--- a/src/compat_tests.rs
+++ b/src/compat_tests.rs
@@ -44,8 +44,19 @@ fn test_format_6() {
    assert_date_time_precision(&index, DateTimePrecision::Microseconds);
 }

+/// feature flag quickwit uses a different dictionary type
+#[test]
 #[cfg(not(feature = "quickwit"))]
-fn assert_date_time_precision(index: &Index, precision: DateTimePrecision) {
+fn test_format_7() {
+    let path = path_for_version("7");
+
+    let index = Index::open_in_dir(path).expect("Failed to open index");
+    // dates are not truncated in v7 in the docstore
+    assert_date_time_precision(&index, DateTimePrecision::Nanoseconds);
+}
+
+#[cfg(not(feature = "quickwit"))]
+fn assert_date_time_precision(index: &Index, doc_store_precision: DateTimePrecision) {
    use collector::TopDocs;
    let reader = index.reader().expect("Failed to create reader");
    let searcher = reader.searcher();
@@ -75,6 +86,6 @@ fn assert_date_time_precision(index: &Index, precision: DateTimePrecision) {
        .as_datetime()
        .unwrap();

-    let expected = DateTime::from_timestamp_nanos(123456).truncate(precision);
+    let expected = DateTime::from_timestamp_nanos(123456).truncate(doc_store_precision);
    assert_eq!(date_value, expected,);
 }
--- a/src/core/executor.rs
+++ b/src/core/executor.rs
@@ -41,16 +41,12 @@ impl Executor {
    ///
    /// Regardless of the executor (`SingleThread` or `ThreadPool`), panics in the task
    /// will propagate to the caller.
-    pub fn map<
+    pub fn map<A, R, F>(&self, f: F, args: impl Iterator<Item = A>) -> crate::Result<Vec<R>>
+    where
        A: Send,
        R: Send,
-        AIterator: Iterator<Item = A>,
        F: Sized + Sync + Fn(A) -> crate::Result<R>,
-    >(
-        &self,
-        f: F,
-        args: AIterator,
-    ) -> crate::Result<Vec<R>> {
+    {
        match self {
            Executor::SingleThread => args.map(f).collect::<crate::Result<_>>(),
            Executor::ThreadPool(pool) => {
--- a/src/core/json_utils.rs
+++ b/src/core/json_utils.rs
@@ -71,7 +71,7 @@ pub fn json_path_sep_to_dot(path: &mut str) {
    }
 }

-#[allow(clippy::too_many_arguments)]
+#[expect(clippy::too_many_arguments)]
 fn index_json_object<'a, V: Value<'a>>(
    doc: DocId,
    json_visitor: V::ObjectIter,
@@ -101,7 +101,7 @@ fn index_json_object<'a, V: Value<'a>>(
    }
 }

-#[allow(clippy::too_many_arguments)]
+#[expect(clippy::too_many_arguments)]
 pub(crate) fn index_json_value<'a, V: Value<'a>>(
    doc: DocId,
    json_value: V,
--- a/src/directory/directory.rs
+++ b/src/directory/directory.rs
@@ -39,7 +39,7 @@ impl RetryPolicy {
 /// The `DirectoryLock` is an object that represents a file lock.
 ///
 /// It is associated with a lock file, that gets deleted on `Drop.`
-#[allow(dead_code)]
+#[expect(dead_code)]
 pub struct DirectoryLock(Box<dyn Send + Sync + 'static>);

 struct DirectoryLockGuard {
--- a/src/directory/directory_lock.rs
+++ b/src/directory/directory_lock.rs
@@ -48,6 +48,7 @@ pub static INDEX_WRITER_LOCK: Lazy<Lock> = Lazy::new(|| Lock {
 });
 /// The meta lock file is here to protect the segment files being opened by
 /// `IndexReader::reload()` from being garbage collected.
+///
 /// It makes it possible for another process to safely consume
 /// our index in-writing. Ideally, we may have preferred `RWLock` semantics
 /// here, but it is difficult to achieve on Windows.
--- a/src/directory/footer.rs
+++ b/src/directory/footer.rs
@@ -1,3 +1,9 @@
+//! The footer is a small metadata structure that is appended at the end of every file.
+//!
+//! The footer is used to store a checksum of the file content.
+//! The footer also stores the version of the index format.
+//! This version is used to detect incompatibility between the index and the library version.
+
 use std::io;
 use std::io::Write;

@@ -20,20 +26,22 @@ type CrcHashU32 = u32;
 /// A Footer is appended to every file
 #[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
 pub struct Footer {
+    /// The version of the index format
    pub version: Version,
+    /// The crc32 hash of the body
    pub crc: CrcHashU32,
 }

 impl Footer {
-    pub fn new(crc: CrcHashU32) -> Self {
+    pub(crate) fn new(crc: CrcHashU32) -> Self {
        let version = crate::VERSION.clone();
        Footer { version, crc }
    }

-    pub fn crc(&self) -> CrcHashU32 {
+    pub(crate) fn crc(&self) -> CrcHashU32 {
        self.crc
    }
-    pub fn append_footer<W: io::Write>(&self, mut write: &mut W) -> io::Result<()> {
+    pub(crate) fn append_footer<W: io::Write>(&self, mut write: &mut W) -> io::Result<()> {
        let mut counting_write = CountingWriter::wrap(&mut write);
        counting_write.write_all(serde_json::to_string(&self)?.as_ref())?;
        let footer_payload_len = counting_write.written_bytes();
@@ -42,6 +50,7 @@ impl Footer {
        Ok(())
    }

+    /// Extracts the tantivy Footer from the file and returns the footer and the rest of the file
    pub fn extract_footer(file: FileSlice) -> io::Result<(Footer, FileSlice)> {
        if file.len() < 4 {
            return Err(io::Error::new(
--- a/src/directory/mmap_directory.rs
+++ b/src/directory/mmap_directory.rs
@@ -244,7 +244,7 @@ impl MmapDirectory {
                directory_path,
            )));
        }
-        #[allow(clippy::bind_instead_of_map)]
+        #[expect(clippy::bind_instead_of_map)]
        let canonical_path: PathBuf = directory_path.canonicalize().or_else(|io_err| {
            let directory_path = directory_path.to_owned();

--- a/src/directory/mod.rs
+++ b/src/directory/mod.rs
@@ -6,7 +6,7 @@ mod mmap_directory;
 mod directory;
 mod directory_lock;
 mod file_watcher;
-mod footer;
+pub mod footer;
 mod managed_directory;
 mod ram_directory;
 mod watch_event_router;
--- a/src/directory/watch_event_router.rs
+++ b/src/directory/watch_event_router.rs
@@ -32,7 +32,7 @@ pub struct WatchCallbackList {
 /// file change is detected.
 #[must_use = "This `WatchHandle` controls the lifetime of the watch and should therefore be used."]
 #[derive(Clone)]
-#[allow(dead_code)]
+#[expect(dead_code)]
 pub struct WatchHandle(Arc<WatchCallback>);

 impl WatchHandle {
--- a/src/docset.rs
+++ b/src/docset.rs
@@ -117,7 +117,7 @@ pub trait DocSet: Send {
    }
 }

-impl<'a> DocSet for &'a mut dyn DocSet {
+impl DocSet for &mut dyn DocSet {
    fn advance(&mut self) -> u32 {
        (**self).advance()
    }
--- a/src/fastfield/mod.rs
+++ b/src/fastfield/mod.rs
@@ -942,7 +942,7 @@ mod tests {

        let numbers = [100, 200, 300];
        let test_range = |range: RangeInclusive<u64>| {
-            let expected_count = numbers.iter().filter(|num| range.contains(num)).count();
+            let expected_count = numbers.iter().filter(|num| range.contains(*num)).count();
            let mut vec = vec![];
            field.get_row_ids_for_value_range(range, 0..u32::MAX, &mut vec);
            assert_eq!(vec.len(), expected_count);
@@ -1020,7 +1020,7 @@ mod tests {

        let numbers = [1000, 1001, 1003];
        let test_range = |range: RangeInclusive<u64>| {
-            let expected_count = numbers.iter().filter(|num| range.contains(num)).count();
+            let expected_count = numbers.iter().filter(|num| range.contains(*num)).count();
            let mut vec = vec![];
            field.get_row_ids_for_value_range(range, 0..u32::MAX, &mut vec);
            assert_eq!(vec.len(), expected_count);
--- a/src/fastfield/readers.rs
+++ b/src/fastfield/readers.rs
@@ -217,7 +217,7 @@ impl FastFieldReaders {
        Ok(dynamic_column.into())
    }

-    /// Returning a `dynamic_column_handle`.
+    /// Returns a `dynamic_column_handle`.
    pub fn dynamic_column_handle(
        &self,
        field_name: &str,
@@ -234,7 +234,7 @@ impl FastFieldReaders {
        Ok(dynamic_column_handle_opt)
    }

-    /// Returning all `dynamic_column_handle`.
+    /// Returns all `dynamic_column_handle` that match the given field name.
    pub fn dynamic_column_handles(
        &self,
        field_name: &str,
@@ -250,6 +250,22 @@ impl FastFieldReaders {
        Ok(dynamic_column_handles)
    }

+    /// Returns all `dynamic_column_handle` that are inner fields of the provided JSON path.
+    pub fn dynamic_subpath_column_handles(
+        &self,
+        root_path: &str,
+    ) -> crate::Result<Vec<DynamicColumnHandle>> {
+        let Some(resolved_field_name) = self.resolve_field(root_path)? else {
+            return Ok(Vec::new());
+        };
+        let dynamic_column_handles = self
+            .columnar
+            .read_subpath_columns(&resolved_field_name)?
+            .into_iter()
+            .collect();
+        Ok(dynamic_column_handles)
+    }
+
    #[doc(hidden)]
    pub async fn list_dynamic_column_handles(
        &self,
@@ -265,6 +281,21 @@ impl FastFieldReaders {
        Ok(columns)
    }

+    #[doc(hidden)]
+    pub async fn list_subpath_dynamic_column_handles(
+        &self,
+        root_path: &str,
+    ) -> crate::Result<Vec<DynamicColumnHandle>> {
+        let Some(resolved_field_name) = self.resolve_field(root_path)? else {
+            return Ok(Vec::new());
+        };
+        let columns = self
+            .columnar
+            .read_subpath_columns_async(&resolved_field_name)
+            .await?;
+        Ok(columns)
+    }
+
    /// Returns the `u64` column used to represent any `u64`-mapped typed (String/Bytes term ids,
    /// i64, u64, f64, DateTime).
    ///
@@ -476,6 +507,15 @@ mod tests {
            .iter()
            .any(|column| column.column_type() == ColumnType::Str));

-        println!("*** {:?}", fast_fields.columnar().list_columns());
+        let json_columns = fast_fields.dynamic_column_handles("json").unwrap();
+        assert_eq!(json_columns.len(), 0);
+
+        let json_subcolumns = fast_fields.dynamic_subpath_column_handles("json").unwrap();
+        assert_eq!(json_subcolumns.len(), 3);
+
+        let foo_subcolumns = fast_fields
+            .dynamic_subpath_column_handles("json.foo")
+            .unwrap();
+        assert_eq!(foo_subcolumns.len(), 0);
    }
 }
--- a/src/fieldnorm/reader.rs
+++ b/src/fieldnorm/reader.rs
@@ -149,7 +149,7 @@ impl FieldNormReader {
    }

    #[cfg(test)]
-    pub fn for_test(field_norms: &[u32]) -> FieldNormReader {
+    pub(crate) fn for_test(field_norms: &[u32]) -> FieldNormReader {
        let field_norms_id = field_norms
            .iter()
            .cloned()
--- a/src/functional_test.rs
+++ b/src/functional_test.rs
@@ -1,12 +1,9 @@
-#![allow(deprecated)] // Remove with index sorting
-
 use std::collections::HashSet;

 use rand::{thread_rng, Rng};

 use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN;
 use crate::schema::*;
-#[allow(deprecated)]
 use crate::{doc, schema, Index, IndexWriter, Searcher};

 fn check_index_content(searcher: &Searcher, vals: &[u64]) -> crate::Result<()> {
--- a/src/index/index.rs
+++ b/src/index/index.rs
@@ -15,7 +15,9 @@ use crate::directory::MmapDirectory;
 use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK};
 use crate::error::{DataCorruption, TantivyError};
 use crate::index::{IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory};
-use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN};
+use crate::indexer::index_writer::{
+    IndexWriterOptions, MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN,
+};
 use crate::indexer::segment_updater::save_metas;
 use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
 use crate::reader::{IndexReader, IndexReaderBuilder};
@@ -519,6 +521,43 @@ impl Index {
        load_metas(self.directory(), &self.inventory)
    }

+    /// Open a new index writer with the given options. Attempts to acquire a lockfile.
+    ///
+    /// The lockfile should be deleted on drop, but it is possible
+    /// that due to a panic or other error, a stale lockfile will be
+    /// left in the index directory. If you are sure that no other
+    /// `IndexWriter` on the system is accessing the index directory,
+    /// it is safe to manually delete the lockfile.
+    ///
+    /// - `options` defines the writer configuration which includes things like buffer sizes,
+    ///   indexer threads, etc...
+    ///
+    /// # Errors
+    /// If the lockfile already exists, returns `TantivyError::LockFailure`.
+    /// If the memory arena per thread is too small or too big, returns
+    /// `TantivyError::InvalidArgument`
+    pub fn writer_with_options<D: Document>(
+        &self,
+        options: IndexWriterOptions,
+    ) -> crate::Result<IndexWriter<D>> {
+        let directory_lock = self
+            .directory
+            .acquire_lock(&INDEX_WRITER_LOCK)
+            .map_err(|err| {
+                TantivyError::LockFailure(
+                    err,
+                    Some(
+                        "Failed to acquire index lock. If you are using a regular directory, this \
+                         means there is already an `IndexWriter` working on this `Directory`, in \
+                         this process or in a different process."
+                            .to_string(),
+                    ),
+                )
+            })?;
+
+        IndexWriter::new(self, options, directory_lock)
+    }
+
    /// Open a new index writer. Attempts to acquire a lockfile.
    ///
    /// The lockfile should be deleted on drop, but it is possible
@@ -543,27 +582,12 @@ impl Index {
        num_threads: usize,
        overall_memory_budget_in_bytes: usize,
    ) -> crate::Result<IndexWriter<D>> {
-        let directory_lock = self
-            .directory
-            .acquire_lock(&INDEX_WRITER_LOCK)
-            .map_err(|err| {
-                TantivyError::LockFailure(
-                    err,
-                    Some(
-                        "Failed to acquire index lock. If you are using a regular directory, this \
-                         means there is already an `IndexWriter` working on this `Directory`, in \
-                         this process or in a different process."
-                            .to_string(),
-                    ),
-                )
-            })?;
        let memory_arena_in_bytes_per_thread = overall_memory_budget_in_bytes / num_threads;
-        IndexWriter::new(
-            self,
-            num_threads,
-            memory_arena_in_bytes_per_thread,
-            directory_lock,
-        )
+        let options = IndexWriterOptions::builder()
+            .num_worker_threads(num_threads)
+            .memory_budget_per_thread(memory_arena_in_bytes_per_thread)
+            .build();
+        self.writer_with_options(options)
    }

    /// Helper to create an index writer for tests.
--- a/src/index/inverted_index_reader.rs
+++ b/src/index/inverted_index_reader.rs
@@ -3,6 +3,12 @@ use std::io;
 use common::json_path_writer::JSON_END_OF_PATH;
 use common::BinarySerializable;
 use fnv::FnvHashSet;
+#[cfg(feature = "quickwit")]
+use futures_util::{FutureExt, StreamExt, TryStreamExt};
+#[cfg(feature = "quickwit")]
+use itertools::Itertools;
+#[cfg(feature = "quickwit")]
+use tantivy_fst::automaton::{AlwaysMatch, Automaton};

 use crate::directory::FileSlice;
 use crate::positions::PositionReader;
@@ -31,7 +37,6 @@ pub struct InvertedIndexReader {
 }

 impl InvertedIndexReader {
-    #[allow(clippy::needless_pass_by_value)] // for symmetry
    pub(crate) fn new(
        termdict: TermDictionary,
        postings_file_slice: FileSlice,
@@ -205,16 +210,6 @@ impl InvertedIndexReader {
            .transpose()
    }

-    pub(crate) fn read_postings_no_deletes(
-        &self,
-        term: &Term,
-        option: IndexRecordOption,
-    ) -> io::Result<Option<SegmentPostings>> {
-        self.get_term_info(term)?
-            .map(|term_info| self.read_postings_from_terminfo(&term_info, option))
-            .transpose()
-    }
-
    /// Returns the number of documents containing the term.
    pub fn doc_freq(&self, term: &Term) -> io::Result<u32> {
        Ok(self
@@ -230,13 +225,18 @@ impl InvertedIndexReader {
        self.termdict.get_async(term.serialized_value_bytes()).await
    }

-    async fn get_term_range_async(
-        &self,
+    async fn get_term_range_async<'a, A: Automaton + 'a>(
+        &'a self,
        terms: impl std::ops::RangeBounds<Term>,
+        automaton: A,
        limit: Option<u64>,
-    ) -> io::Result<impl Iterator<Item = TermInfo> + '_> {
+        merge_holes_under_bytes: usize,
+    ) -> io::Result<impl Iterator<Item = TermInfo> + 'a>
+    where
+        A::State: Clone,
+    {
        use std::ops::Bound;
-        let range_builder = self.termdict.range();
+        let range_builder = self.termdict.search(automaton);
        let range_builder = match terms.start_bound() {
            Bound::Included(bound) => range_builder.ge(bound.serialized_value_bytes()),
            Bound::Excluded(bound) => range_builder.gt(bound.serialized_value_bytes()),
@@ -253,7 +253,9 @@ impl InvertedIndexReader {
            range_builder
        };

-        let mut stream = range_builder.into_stream_async().await?;
+        let mut stream = range_builder
+            .into_stream_async_merging_holes(merge_holes_under_bytes)
+            .await?;

        let iter = std::iter::from_fn(move || stream.next().map(|(_k, v)| v.clone()));

@@ -299,7 +301,9 @@ impl InvertedIndexReader {
        limit: Option<u64>,
        with_positions: bool,
    ) -> io::Result<bool> {
-        let mut term_info = self.get_term_range_async(terms, limit).await?;
+        let mut term_info = self
+            .get_term_range_async(terms, AlwaysMatch, limit, 0)
+            .await?;

        let Some(first_terminfo) = term_info.next() else {
            // no key matches, nothing more to load
@@ -326,6 +330,84 @@ impl InvertedIndexReader {
        Ok(true)
    }

+    /// Warmup a block postings given a range of `Term`s.
+    /// This method is for an advanced usage only.
+    ///
+    /// returns a boolean, whether a term matching the range was found in the dictionary
+    pub async fn warm_postings_automaton<
+        A: Automaton + Clone + Send + 'static,
+        E: FnOnce(Box<dyn FnOnce() -> io::Result<()> + Send>) -> F,
+        F: std::future::Future<Output = io::Result<()>>,
+    >(
+        &self,
+        automaton: A,
+        // with_positions: bool, at the moment we have no use for it, and supporting it would add
+        // complexity to the coalesce
+        executor: E,
+    ) -> io::Result<bool>
+    where
+        A::State: Clone,
+    {
+        // merge holes under 4MiB, that's how many bytes we can hope to receive during a TTFB from
+        // S3 (~80MiB/s, and 50ms latency)
+        const MERGE_HOLES_UNDER_BYTES: usize = (80 * 1024 * 1024 * 50) / 1000;
+        // we build a first iterator to download everything. Simply calling the function already
+        // download everything we need from the sstable, but doesn't start iterating over it.
+        let _term_info_iter = self
+            .get_term_range_async(.., automaton.clone(), None, MERGE_HOLES_UNDER_BYTES)
+            .await?;
+
+        let (sender, posting_ranges_to_load_stream) = futures_channel::mpsc::unbounded();
+        let termdict = self.termdict.clone();
+        let cpu_bound_task = move || {
+            // then we build a 2nd iterator, this one with no holes, so we don't go through blocks
+            // we can't match.
+            // This makes the assumption there is a caching layer below us, which gives sync read
+            // for free after the initial async access. This might not always be true, but is in
+            // Quickwit.
+            // We build things from this closure otherwise we get into lifetime issues that can only
+            // be solved with self referential strucs. Returning an io::Result from here is a bit
+            // more leaky abstraction-wise, but a lot better than the alternative
+            let mut stream = termdict.search(automaton).into_stream()?;
+
+            // we could do without an iterator, but this allows us access to coalesce which simplify
+            // things
+            let posting_ranges_iter =
+                std::iter::from_fn(move || stream.next().map(|(_k, v)| v.postings_range.clone()));
+
+            let merged_posting_ranges_iter = posting_ranges_iter.coalesce(|range1, range2| {
+                if range1.end + MERGE_HOLES_UNDER_BYTES >= range2.start {
+                    Ok(range1.start..range2.end)
+                } else {
+                    Err((range1, range2))
+                }
+            });
+
+            for posting_range in merged_posting_ranges_iter {
+                if let Err(_) = sender.unbounded_send(posting_range) {
+                    // this should happen only when search is cancelled
+                    return Err(io::Error::other("failed to send posting range back"));
+                }
+            }
+            Ok(())
+        };
+        let task_handle = executor(Box::new(cpu_bound_task));
+
+        let posting_downloader = posting_ranges_to_load_stream
+            .map(|posting_slice| {
+                self.postings_file_slice
+                    .read_bytes_slice_async(posting_slice)
+                    .map(|result| result.map(|_slice| ()))
+            })
+            .buffer_unordered(5)
+            .try_collect::<Vec<()>>();
+
+        let (_, slices_downloaded) =
+            futures_util::future::try_join(task_handle, posting_downloader).await?;
+
+        Ok(!slices_downloaded.is_empty())
+    }
+
    /// Warmup the block postings for all terms.
    /// This method is for an advanced usage only.
    ///
--- a/src/index/segment_component.rs
+++ b/src/index/segment_component.rs
@@ -1,6 +1,7 @@
 use std::slice;

 /// Enum describing each component of a tantivy segment.
+///
 /// Each component is stored in its own file,
 /// using the pattern `segment_uuid`.`component_extension`,
 /// except the delete component that takes an `segment_uuid`.`delete_opstamp`.`component_extension`
--- a/src/index/segment_reader.rs
+++ b/src/index/segment_reader.rs
@@ -478,7 +478,7 @@ pub fn merge_field_meta_data(
        .into_iter()
        .kmerge_by(|left, right| left < right)
        // TODO: Remove allocation
-        .group_by(|el| (el.field_name.to_string(), el.typ))
+        .chunk_by(|el| (el.field_name.to_string(), el.typ))
    {
        let mut merged: FieldMetadata = group.next().unwrap();
        for el in group {
--- a/src/indexer/delete_queue.rs
+++ b/src/indexer/delete_queue.rs
@@ -187,7 +187,6 @@ impl DeleteCursor {
        }
    }

-    #[allow(clippy::wrong_self_convention)]
    fn is_behind_opstamp(&mut self, target_opstamp: Opstamp) -> bool {
        self.get()
            .map(|operation| operation.opstamp < target_opstamp)
--- a/src/indexer/doc_opstamp_mapping.rs
+++ b/src/indexer/doc_opstamp_mapping.rs
@@ -21,7 +21,7 @@ pub enum DocToOpstampMapping<'a> {
    None,
 }

-impl<'a> DocToOpstampMapping<'a> {
+impl DocToOpstampMapping<'_> {
    /// Assess whether a document should be considered deleted given that it contains
    /// a deleted term that was deleted at the opstamp: `delete_opstamp`.
    ///
--- a/src/indexer/index_writer.rs
+++ b/src/indexer/index_writer.rs
@@ -45,6 +45,23 @@ fn error_in_index_worker_thread(context: &str) -> TantivyError {
    ))
 }

+#[derive(Clone, bon::Builder)]
+/// A builder for creating a new [IndexWriter] for an index.
+pub struct IndexWriterOptions {
+    #[builder(default = MEMORY_BUDGET_NUM_BYTES_MIN)]
+    /// The memory budget per indexer thread.
+    ///
+    /// When an indexer thread has buffered this much data in memory
+    /// it will flush the segment to disk (although this is not searchable until commit is called.)
+    memory_budget_per_thread: usize,
+    #[builder(default = 1)]
+    /// The number of indexer worker threads to use.
+    num_worker_threads: usize,
+    #[builder(default = 4)]
+    /// Defines the number of merger threads to use.
+    num_merge_threads: usize,
+}
+
 /// `IndexWriter` is the user entry-point to add document to an index.
 ///
 /// It manages a small number of indexing thread, as well as a shared
@@ -58,8 +75,7 @@ pub struct IndexWriter<D: Document = TantivyDocument> {

    index: Index,

-    // The memory budget per thread, after which a commit is triggered.
-    memory_budget_in_bytes_per_thread: usize,
+    options: IndexWriterOptions,

    workers_join_handle: Vec<JoinHandle<crate::Result<()>>>,

@@ -70,8 +86,6 @@ pub struct IndexWriter<D: Document = TantivyDocument> {

    worker_id: usize,

-    num_threads: usize,
-
    delete_queue: DeleteQueue,

    stamper: Stamper,
@@ -265,23 +279,27 @@ impl<D: Document> IndexWriter<D> {
    /// `TantivyError::InvalidArgument`
    pub(crate) fn new(
        index: &Index,
-        num_threads: usize,
-        memory_budget_in_bytes_per_thread: usize,
+        options: IndexWriterOptions,
        directory_lock: DirectoryLock,
    ) -> crate::Result<Self> {
-        if memory_budget_in_bytes_per_thread < MEMORY_BUDGET_NUM_BYTES_MIN {
+        if options.memory_budget_per_thread < MEMORY_BUDGET_NUM_BYTES_MIN {
            let err_msg = format!(
                "The memory arena in bytes per thread needs to be at least \
                 {MEMORY_BUDGET_NUM_BYTES_MIN}."
            );
            return Err(TantivyError::InvalidArgument(err_msg));
        }
-        if memory_budget_in_bytes_per_thread >= MEMORY_BUDGET_NUM_BYTES_MAX {
+        if options.memory_budget_per_thread >= MEMORY_BUDGET_NUM_BYTES_MAX {
            let err_msg = format!(
                "The memory arena in bytes per thread cannot exceed {MEMORY_BUDGET_NUM_BYTES_MAX}"
            );
            return Err(TantivyError::InvalidArgument(err_msg));
        }
+        if options.num_worker_threads == 0 {
+            let err_msg = "At least one worker thread is required, got 0".to_string();
+            return Err(TantivyError::InvalidArgument(err_msg));
+        }
+
        let (document_sender, document_receiver) =
            crossbeam_channel::bounded(PIPELINE_MAX_SIZE_IN_DOCS);

@@ -291,13 +309,17 @@ impl<D: Document> IndexWriter<D> {

        let stamper = Stamper::new(current_opstamp);

-        let segment_updater =
-            SegmentUpdater::create(index.clone(), stamper.clone(), &delete_queue.cursor())?;
+        let segment_updater = SegmentUpdater::create(
+            index.clone(),
+            stamper.clone(),
+            &delete_queue.cursor(),
+            options.num_merge_threads,
+        )?;

        let mut index_writer = Self {
            _directory_lock: Some(directory_lock),

-            memory_budget_in_bytes_per_thread,
+            options: options.clone(),
            index: index.clone(),
            index_writer_status: IndexWriterStatus::from(document_receiver),
            operation_sender: document_sender,
@@ -305,7 +327,6 @@ impl<D: Document> IndexWriter<D> {
            segment_updater,

            workers_join_handle: vec![],
-            num_threads,

            delete_queue,

@@ -398,7 +419,7 @@ impl<D: Document> IndexWriter<D> {

        let mut delete_cursor = self.delete_queue.cursor();

-        let mem_budget = self.memory_budget_in_bytes_per_thread;
+        let mem_budget = self.options.memory_budget_per_thread;
        let index = self.index.clone();
        let join_handle: JoinHandle<crate::Result<()>> = thread::Builder::new()
            .name(format!("thrd-tantivy-index{}", self.worker_id))
@@ -451,7 +472,7 @@ impl<D: Document> IndexWriter<D> {
    }

    fn start_workers(&mut self) -> crate::Result<()> {
-        for _ in 0..self.num_threads {
+        for _ in 0..self.options.num_worker_threads {
            self.add_indexing_worker()?;
        }
        Ok(())
@@ -553,12 +574,7 @@ impl<D: Document> IndexWriter<D> {
            .take()
            .expect("The IndexWriter does not have any lock. This is a bug, please report.");

-        let new_index_writer = IndexWriter::new(
-            &self.index,
-            self.num_threads,
-            self.memory_budget_in_bytes_per_thread,
-            directory_lock,
-        )?;
+        let new_index_writer = IndexWriter::new(&self.index, self.options.clone(), directory_lock)?;

        // the current `self` is dropped right away because of this call.
        //
@@ -812,7 +828,7 @@ mod tests {
    use crate::directory::error::LockError;
    use crate::error::*;
    use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN;
-    use crate::indexer::NoMergePolicy;
+    use crate::indexer::{IndexWriterOptions, NoMergePolicy};
    use crate::query::{QueryParser, TermQuery};
    use crate::schema::{
        self, Facet, FacetOptions, IndexRecordOption, IpAddrOptions, JsonObjectOptions,
@@ -2533,4 +2549,36 @@ mod tests {
        index_writer.commit().unwrap();
        Ok(())
    }
+
+    #[test]
+    fn test_writer_options_validation() {
+        let mut schema_builder = Schema::builder();
+        let _field = schema_builder.add_bool_field("example", STORED);
+        let index = Index::create_in_ram(schema_builder.build());
+
+        let opt_wo_threads = IndexWriterOptions::builder().num_worker_threads(0).build();
+        let result = index.writer_with_options::<TantivyDocument>(opt_wo_threads);
+        assert!(result.is_err(), "Writer should reject 0 thread count");
+        assert!(matches!(result, Err(TantivyError::InvalidArgument(_))));
+
+        let opt_with_low_memory = IndexWriterOptions::builder()
+            .memory_budget_per_thread(10 << 10)
+            .build();
+        let result = index.writer_with_options::<TantivyDocument>(opt_with_low_memory);
+        assert!(
+            result.is_err(),
+            "Writer should reject options with too low memory size"
+        );
+        assert!(matches!(result, Err(TantivyError::InvalidArgument(_))));
+
+        let opt_with_low_memory = IndexWriterOptions::builder()
+            .memory_budget_per_thread(5 << 30)
+            .build();
+        let result = index.writer_with_options::<TantivyDocument>(opt_with_low_memory);
+        assert!(
+            result.is_err(),
+            "Writer should reject options with too high memory size"
+        );
+        assert!(matches!(result, Err(TantivyError::InvalidArgument(_))));
+    }
 }
--- a/src/indexer/log_merge_policy.rs
+++ b/src/indexer/log_merge_policy.rs
@@ -104,7 +104,7 @@ impl MergePolicy for LogMergePolicy {

        let mut current_max_log_size = f64::MAX;
        let mut levels = vec![];
-        for (_, merge_group) in &size_sorted_segments.into_iter().group_by(|segment| {
+        for (_, merge_group) in &size_sorted_segments.into_iter().chunk_by(|segment| {
            let segment_log_size = f64::from(self.clip_min_size(segment.num_docs())).log2();
            if segment_log_size < (current_max_log_size - self.level_log_size) {
                // update current_max_log_size to create a new group
--- a/src/indexer/merge_policy.rs
+++ b/src/indexer/merge_policy.rs
@@ -36,7 +36,7 @@ impl MergePolicy for NoMergePolicy {
 }

 #[cfg(test)]
-pub mod tests {
+pub(crate) mod tests {

    use super::*;

--- a/src/indexer/mod.rs
+++ b/src/indexer/mod.rs
@@ -31,7 +31,7 @@ mod stamper;
 use crossbeam_channel as channel;
 use smallvec::SmallVec;

-pub use self::index_writer::IndexWriter;
+pub use self::index_writer::{IndexWriter, IndexWriterOptions};
 pub use self::log_merge_policy::LogMergePolicy;
 pub use self::merge_operation::MergeOperation;
 pub use self::merge_policy::{MergeCandidate, MergePolicy, NoMergePolicy};
--- a/src/indexer/segment_updater.rs
+++ b/src/indexer/segment_updater.rs
@@ -1,3 +1,4 @@
+use std::any::Any;
 use std::borrow::BorrowMut;
 use std::collections::HashSet;
 use std::io::Write;
@@ -23,9 +24,9 @@ use crate::indexer::{
    DefaultMergePolicy, MergeCandidate, MergeOperation, MergePolicy, SegmentEntry,
    SegmentSerializer,
 };
-use crate::{FutureResult, Opstamp};
+use crate::{FutureResult, Opstamp, TantivyError};

-const NUM_MERGE_THREADS: usize = 4;
+const PANIC_CAUGHT: &str = "Panic caught in merge thread";

 /// Save the index meta file.
 /// This operation is atomic:
@@ -273,6 +274,7 @@ impl SegmentUpdater {
        index: Index,
        stamper: Stamper,
        delete_cursor: &DeleteCursor,
+        num_merge_threads: usize,
    ) -> crate::Result<SegmentUpdater> {
        let segments = index.searchable_segment_metas()?;
        let segment_manager = SegmentManager::from_segments(segments, delete_cursor);
@@ -287,7 +289,16 @@ impl SegmentUpdater {
            })?;
        let merge_thread_pool = ThreadPoolBuilder::new()
            .thread_name(|i| format!("merge_thread_{i}"))
-            .num_threads(NUM_MERGE_THREADS)
+            .num_threads(num_merge_threads)
+            .panic_handler(move |panic| {
+                // We don't print the panic content itself,
+                // it is already printed during the unwinding
+                if let Some(message) = panic.downcast_ref::<&str>() {
+                    if *message != PANIC_CAUGHT {
+                        error!("uncaught merge panic")
+                    }
+                }
+            })
            .build()
            .map_err(|_| {
                crate::TantivyError::SystemError(
@@ -507,11 +518,34 @@ impl SegmentUpdater {
            // Its lifetime is used to track how many merging thread are currently running,
            // as well as which segment is currently in merge and therefore should not be
            // candidate for another merge.
-            match merge(
-                &segment_updater.index,
-                segment_entries,
-                merge_operation.target_opstamp(),
-            ) {
+            let merge_panic_res = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
+                merge(
+                    &segment_updater.index,
+                    segment_entries,
+                    merge_operation.target_opstamp(),
+                )
+            }));
+            let merge_res = match merge_panic_res {
+                Ok(merge_res) => merge_res,
+                Err(panic_err) => {
+                    let panic_str = if let Some(msg) = panic_err.downcast_ref::<&str>() {
+                        *msg
+                    } else if let Some(msg) = panic_err.downcast_ref::<String>() {
+                        msg.as_str()
+                    } else {
+                        "UNKNOWN"
+                    };
+                    let _send_result = merging_future_send.send(Err(TantivyError::SystemError(
+                        format!("Merge thread panicked: {panic_str}"),
+                    )));
+                    // Resume unwinding because we forced unwind safety with
+                    // `std::panic::AssertUnwindSafe` Use a specific message so
+                    // the panic_handler can double check that we properly caught the panic.
+                    let boxed_panic_message: Box<dyn Any + Send> = Box::new(PANIC_CAUGHT);
+                    std::panic::resume_unwind(boxed_panic_message);
+                }
+            };
+            match merge_res {
                Ok(after_merge_segment_entry) => {
                    let res = segment_updater.end_merge(merge_operation, after_merge_segment_entry);
                    let _send_result = merging_future_send.send(res);
--- a/src/indexer/segment_writer.rs
+++ b/src/indexer/segment_writer.rs
@@ -150,7 +150,7 @@ impl SegmentWriter {
        let vals_grouped_by_field = doc
            .iter_fields_and_values()
            .sorted_by_key(|(field, _)| *field)
-            .group_by(|(field, _)| *field);
+            .chunk_by(|(field, _)| *field);

        for (field, field_values) in &vals_grouped_by_field {
            let values = field_values.map(|el| el.1);
@@ -422,6 +422,7 @@ mod tests {
    use std::collections::BTreeMap;
    use std::path::{Path, PathBuf};

+    use columnar::ColumnType;
    use tempfile::TempDir;

    use crate::collector::{Count, TopDocs};
@@ -431,15 +432,15 @@ mod tests {
    use crate::query::{PhraseQuery, QueryParser};
    use crate::schema::{
        Document, IndexRecordOption, OwnedValue, Schema, TextFieldIndexing, TextOptions, Value,
-        DATE_TIME_PRECISION_INDEXED, STORED, STRING, TEXT,
+        DATE_TIME_PRECISION_INDEXED, FAST, STORED, STRING, TEXT,
    };
    use crate::store::{Compressor, StoreReader, StoreWriter};
    use crate::time::format_description::well_known::Rfc3339;
    use crate::time::OffsetDateTime;
    use crate::tokenizer::{PreTokenizedString, Token};
    use crate::{
-        DateTime, Directory, DocAddress, DocSet, Index, IndexWriter, TantivyDocument, Term,
-        TERMINATED,
+        DateTime, Directory, DocAddress, DocSet, Index, IndexWriter, SegmentReader,
+        TantivyDocument, Term, TERMINATED,
    };

    #[test]
@@ -841,6 +842,75 @@ mod tests {
        assert_eq!(searcher.search(&phrase_query, &Count).unwrap(), 0);
    }

+    #[test]
+    fn test_json_fast() {
+        let mut schema_builder = Schema::builder();
+        let json_field = schema_builder.add_json_field("json", FAST);
+        let schema = schema_builder.build();
+        let json_val: serde_json::Value = serde_json::from_str(
+            r#"{
+            "toto": "titi",
+            "float": -0.2,
+            "bool": true,
+            "unsigned": 1,
+            "signed": -2,
+            "complexobject": {
+                "field.with.dot": 1
+            },
+            "date": "1985-04-12T23:20:50.52Z",
+            "my_arr": [2, 3, {"my_key": "two tokens"}, 4]
+        }"#,
+        )
+        .unwrap();
+        let doc = doc!(json_field=>json_val.clone());
+        let index = Index::create_in_ram(schema.clone());
+        let mut writer = index.writer_for_tests().unwrap();
+        writer.add_document(doc).unwrap();
+        writer.commit().unwrap();
+        let reader = index.reader().unwrap();
+        let searcher = reader.searcher();
+        let segment_reader = searcher.segment_reader(0u32);
+
+        fn assert_type(reader: &SegmentReader, field: &str, typ: ColumnType) {
+            let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
+            assert_eq!(cols.len(), 1, "{}", field);
+            assert_eq!(cols[0].column_type(), typ, "{}", field);
+        }
+        assert_type(segment_reader, "json.toto", ColumnType::Str);
+        assert_type(segment_reader, "json.float", ColumnType::F64);
+        assert_type(segment_reader, "json.bool", ColumnType::Bool);
+        assert_type(segment_reader, "json.unsigned", ColumnType::I64);
+        assert_type(segment_reader, "json.signed", ColumnType::I64);
+        assert_type(
+            segment_reader,
+            "json.complexobject.field\\.with\\.dot",
+            ColumnType::I64,
+        );
+        assert_type(segment_reader, "json.date", ColumnType::DateTime);
+        assert_type(segment_reader, "json.my_arr", ColumnType::I64);
+        assert_type(segment_reader, "json.my_arr.my_key", ColumnType::Str);
+
+        fn assert_empty(reader: &SegmentReader, field: &str) {
+            let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
+            assert_eq!(cols.len(), 0);
+        }
+        assert_empty(segment_reader, "unknown");
+        assert_empty(segment_reader, "json");
+        assert_empty(segment_reader, "json.toto.titi");
+
+        let sub_columns = segment_reader
+            .fast_fields()
+            .dynamic_subpath_column_handles("json")
+            .unwrap();
+        assert_eq!(sub_columns.len(), 9);
+
+        let subsub_columns = segment_reader
+            .fast_fields()
+            .dynamic_subpath_column_handles("json.complexobject")
+            .unwrap();
+        assert_eq!(subsub_columns.len(), 1);
+    }
+
    #[test]
    fn test_json_term_with_numeric_merge_panic_regression_bug_2283() {
        // https://github.com/quickwit-oss/tantivy/issues/2283
--- a/src/indexer/stamper.rs
+++ b/src/indexer/stamper.rs
@@ -1,79 +1,19 @@
 use std::ops::Range;
-use std::sync::atomic::Ordering;
+use std::sync::atomic::{AtomicU64, Ordering};
 use std::sync::Arc;

 use crate::Opstamp;

-#[cfg(not(target_arch = "arm"))]
-mod atomic_impl {
-
-    use std::sync::atomic::{AtomicU64, Ordering};
-
-    use crate::Opstamp;
-
-    #[derive(Default)]
-    pub struct AtomicU64Wrapper(AtomicU64);
-
-    impl AtomicU64Wrapper {
-        pub fn new(first_opstamp: Opstamp) -> AtomicU64Wrapper {
-            AtomicU64Wrapper(AtomicU64::new(first_opstamp))
-        }
-
-        pub fn fetch_add(&self, val: u64, order: Ordering) -> u64 {
-            self.0.fetch_add(val, order)
-        }
-
-        pub fn revert(&self, val: u64, order: Ordering) -> u64 {
-            self.0.store(val, order);
-            val
-        }
-    }
-}
-
-#[cfg(target_arch = "arm")]
-mod atomic_impl {
-
-    /// Under other architecture, we rely on a mutex.
-    use std::sync::atomic::Ordering;
-    use std::sync::RwLock;
-
-    use crate::Opstamp;
-
-    #[derive(Default)]
-    pub struct AtomicU64Wrapper(RwLock<u64>);
-
-    impl AtomicU64Wrapper {
-        pub fn new(first_opstamp: Opstamp) -> AtomicU64Wrapper {
-            AtomicU64Wrapper(RwLock::new(first_opstamp))
-        }
-
-        pub fn fetch_add(&self, incr: u64, _order: Ordering) -> u64 {
-            let mut lock = self.0.write().unwrap();
-            let previous_val = *lock;
-            *lock = previous_val + incr;
-            previous_val
-        }
-
-        pub fn revert(&self, val: u64, _order: Ordering) -> u64 {
-            let mut lock = self.0.write().unwrap();
-            *lock = val;
-            val
-        }
-    }
-}
-
-use self::atomic_impl::AtomicU64Wrapper;
-
 /// Stamper provides Opstamps, which is just an auto-increment id to label
 /// an operation.
 ///
 /// Cloning does not "fork" the stamp generation. The stamper actually wraps an `Arc`.
 #[derive(Clone, Default)]
-pub struct Stamper(Arc<AtomicU64Wrapper>);
+pub struct Stamper(Arc<AtomicU64>);

 impl Stamper {
    pub fn new(first_opstamp: Opstamp) -> Stamper {
-        Stamper(Arc::new(AtomicU64Wrapper::new(first_opstamp)))
+        Stamper(Arc::new(AtomicU64::new(first_opstamp)))
    }

    pub fn stamp(&self) -> Opstamp {
@@ -92,7 +32,8 @@ impl Stamper {

    /// Reverts the stamper to a given `Opstamp` value and returns it
    pub fn revert(&self, to_opstamp: Opstamp) -> Opstamp {
-        self.0.revert(to_opstamp, Ordering::SeqCst)
+        self.0.store(to_opstamp, Ordering::SeqCst);
+        to_opstamp
    }
 }

@@ -101,7 +42,7 @@ mod test {

    use super::Stamper;

-    #[allow(clippy::redundant_clone)]
+    #[expect(clippy::redundant_clone)]
    #[test]
    fn test_stamper() {
        let stamper = Stamper::new(7u64);
@@ -117,7 +58,7 @@ mod test {
        assert_eq!(stamper.stamp(), 15u64);
    }

-    #[allow(clippy::redundant_clone)]
+    #[expect(clippy::redundant_clone)]
    #[test]
    fn test_stamper_revert() {
        let stamper = Stamper::new(7u64);
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -178,10 +178,8 @@ pub use crate::future_result::FutureResult;
 pub type Result<T> = std::result::Result<T, TantivyError>;

 mod core;
-#[allow(deprecated)] // Remove with index sorting
 pub mod indexer;

-#[allow(unused_doc_comments)]
 pub mod error;
 pub mod tokenizer;

@@ -190,7 +188,6 @@ pub mod collector;
 pub mod directory;
 pub mod fastfield;
 pub mod fieldnorm;
-#[allow(deprecated)] // Remove with index sorting
 pub mod index;
 pub mod positions;
 pub mod postings;
@@ -223,7 +220,6 @@ pub use self::docset::{DocSet, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
 pub use crate::core::json_utils;
 pub use crate::core::{Executor, Searcher, SearcherGeneration};
 pub use crate::directory::Directory;
-#[allow(deprecated)] // Remove with index sorting
 pub use crate::index::{
    Index, IndexBuilder, IndexMeta, IndexSettings, InvertedIndexReader, Order, Segment,
    SegmentMeta, SegmentReader,
@@ -232,7 +228,7 @@ pub use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
 pub use crate::schema::{Document, TantivyDocument, Term};

 /// Index format version.
-pub const INDEX_FORMAT_VERSION: u32 = 6;
+pub const INDEX_FORMAT_VERSION: u32 = 7;
 /// Oldest index format version this tantivy version can read.
 pub const INDEX_FORMAT_OLDEST_SUPPORTED_VERSION: u32 = 4;

@@ -371,6 +367,7 @@ macro_rules! fail_point {
    }};
 }

+/// Common test utilities.
 #[cfg(test)]
 pub mod tests {
    use common::{BinarySerializable, FixedSize};
@@ -389,6 +386,7 @@ pub mod tests {
    use crate::schema::*;
    use crate::{DateTime, DocAddress, Index, IndexWriter, ReloadPolicy};

+    /// Asserts that the serialized value is the value in the trait.
    pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
        let mut buffer = Vec::new();
        O::default().serialize(&mut buffer).unwrap();
@@ -421,6 +419,7 @@ pub mod tests {
        }};
    }

+    /// Generates random numbers
    pub fn generate_nonunique_unsorted(max_value: u32, n_elems: usize) -> Vec<u32> {
        let seed: [u8; 32] = [1; 32];
        StdRng::from_seed(seed)
@@ -429,6 +428,7 @@ pub mod tests {
            .collect::<Vec<u32>>()
    }

+    /// Sample `n` elements with Bernoulli distribution.
    pub fn sample_with_seed(n: u32, ratio: f64, seed_val: u8) -> Vec<u32> {
        StdRng::from_seed([seed_val; 32])
            .sample_iter(&Bernoulli::new(ratio).unwrap())
@@ -438,6 +438,7 @@ pub mod tests {
            .collect()
    }

+    /// Sample `n` elements with Bernoulli distribution.
    pub fn sample(n: u32, ratio: f64) -> Vec<u32> {
        sample_with_seed(n, ratio, 4)
    }
--- a/src/macros.rs
+++ b/src/macros.rs
@@ -41,7 +41,6 @@
 /// );
 /// # }
 /// ```
-
 #[macro_export]
 macro_rules! doc(
    () => {
--- a/src/positions/mod.rs
+++ b/src/positions/mod.rs
@@ -1,4 +1,5 @@
 //! Tantivy can (if instructed to do so in the schema) store the term positions in a given field.
+//!
 //! This position is expressed as token ordinal. For instance,
 //! In "The beauty and the beast", the term "the" appears in position 0 and position 3.
 //! This information is useful to run phrase queries.
@@ -38,7 +39,7 @@ pub use self::serializer::PositionSerializer;
 const COMPRESSION_BLOCK_SIZE: usize = BitPacker4x::BLOCK_LEN;

 #[cfg(test)]
-pub mod tests {
+pub(crate) mod tests {

    use std::iter;

--- a/src/postings/compression/mod.rs
+++ b/src/postings/compression/mod.rs
@@ -264,7 +264,7 @@ impl VIntDecoder for BlockDecoder {
 }

 #[cfg(test)]
-pub mod tests {
+pub(crate) mod tests {

    use super::*;
    use crate::TERMINATED;
--- a/src/postings/loaded_postings.rs
+++ b/src/postings/loaded_postings.rs
@@ -0,0 +1,155 @@
+use crate::docset::{DocSet, TERMINATED};
+use crate::postings::{Postings, SegmentPostings};
+use crate::DocId;
+
+/// `LoadedPostings` is a `DocSet` and `Postings` implementation.
+/// It is used to represent the postings of a term in memory.
+/// It is suitable if there are few documents for a term.
+///
+/// It exists mainly to reduce memory usage.
+/// `SegmentPostings` uses 1840 bytes per instance due to its caches.
+/// If you need to keep many terms around with few docs, it's cheaper to load all the
+/// postings in memory.
+///
+/// This is relevant for `RegexPhraseQuery`, which may have a lot of
+/// terms.
+/// E.g. 100_000 terms would need 184MB due to SegmentPostings.
+pub struct LoadedPostings {
+    doc_ids: Box<[DocId]>,
+    position_offsets: Box<[u32]>,
+    positions: Box<[u32]>,
+    cursor: usize,
+}
+
+impl LoadedPostings {
+    /// Creates a new `LoadedPostings` from a `SegmentPostings`.
+    ///
+    /// It will also preload positions, if positions are available in the SegmentPostings.
+    pub fn load(segment_postings: &mut SegmentPostings) -> LoadedPostings {
+        let num_docs = segment_postings.doc_freq() as usize;
+        let mut doc_ids = Vec::with_capacity(num_docs);
+        let mut positions = Vec::with_capacity(num_docs);
+        let mut position_offsets = Vec::with_capacity(num_docs);
+        while segment_postings.doc() != TERMINATED {
+            position_offsets.push(positions.len() as u32);
+            doc_ids.push(segment_postings.doc());
+            segment_postings.append_positions_with_offset(0, &mut positions);
+            segment_postings.advance();
+        }
+        position_offsets.push(positions.len() as u32);
+        LoadedPostings {
+            doc_ids: doc_ids.into_boxed_slice(),
+            positions: positions.into_boxed_slice(),
+            position_offsets: position_offsets.into_boxed_slice(),
+            cursor: 0,
+        }
+    }
+}
+
+#[cfg(test)]
+impl From<(Vec<DocId>, Vec<Vec<u32>>)> for LoadedPostings {
+    fn from(doc_ids_and_positions: (Vec<DocId>, Vec<Vec<u32>>)) -> LoadedPostings {
+        let mut position_offsets = Vec::new();
+        let mut all_positions = Vec::new();
+        let (doc_ids, docid_positions) = doc_ids_and_positions;
+        for positions in docid_positions {
+            position_offsets.push(all_positions.len() as u32);
+            all_positions.extend_from_slice(&positions);
+        }
+        position_offsets.push(all_positions.len() as u32);
+        LoadedPostings {
+            doc_ids: doc_ids.into_boxed_slice(),
+            positions: all_positions.into_boxed_slice(),
+            position_offsets: position_offsets.into_boxed_slice(),
+            cursor: 0,
+        }
+    }
+}
+
+impl DocSet for LoadedPostings {
+    fn advance(&mut self) -> DocId {
+        self.cursor += 1;
+        if self.cursor >= self.doc_ids.len() {
+            self.cursor = self.doc_ids.len();
+            return TERMINATED;
+        }
+        self.doc()
+    }
+
+    fn doc(&self) -> DocId {
+        if self.cursor >= self.doc_ids.len() {
+            return TERMINATED;
+        }
+        self.doc_ids[self.cursor]
+    }
+
+    fn size_hint(&self) -> u32 {
+        self.doc_ids.len() as u32
+    }
+}
+impl Postings for LoadedPostings {
+    fn term_freq(&self) -> u32 {
+        let start = self.position_offsets[self.cursor] as usize;
+        let end = self.position_offsets[self.cursor + 1] as usize;
+        (end - start) as u32
+    }
+
+    fn append_positions_with_offset(&mut self, offset: u32, output: &mut Vec<u32>) {
+        let start = self.position_offsets[self.cursor] as usize;
+        let end = self.position_offsets[self.cursor + 1] as usize;
+        for pos in &self.positions[start..end] {
+            output.push(*pos + offset);
+        }
+    }
+}
+
+#[cfg(test)]
+pub(crate) mod tests {
+
+    use super::*;
+
+    #[test]
+    pub fn test_vec_postings() {
+        let doc_ids: Vec<DocId> = (0u32..1024u32).map(|e| e * 3).collect();
+        let mut postings = LoadedPostings::from((doc_ids, vec![]));
+        assert_eq!(postings.doc(), 0u32);
+        assert_eq!(postings.advance(), 3u32);
+        assert_eq!(postings.doc(), 3u32);
+        assert_eq!(postings.seek(14u32), 15u32);
+        assert_eq!(postings.doc(), 15u32);
+        assert_eq!(postings.seek(300u32), 300u32);
+        assert_eq!(postings.doc(), 300u32);
+        assert_eq!(postings.seek(6000u32), TERMINATED);
+    }
+
+    #[test]
+    pub fn test_vec_postings2() {
+        let doc_ids: Vec<DocId> = (0u32..1024u32).map(|e| e * 3).collect();
+        let mut positions = Vec::new();
+        positions.resize(1024, Vec::new());
+        positions[0] = vec![1u32, 2u32, 3u32];
+        positions[1] = vec![30u32];
+        positions[2] = vec![10u32];
+        positions[4] = vec![50u32];
+        let mut postings = LoadedPostings::from((doc_ids, positions));
+
+        let load = |postings: &mut LoadedPostings| {
+            let mut loaded_positions = Vec::new();
+            postings.positions(loaded_positions.as_mut());
+            loaded_positions
+        };
+        assert_eq!(postings.doc(), 0u32);
+        assert_eq!(load(&mut postings), vec![1u32, 2u32, 3u32]);
+
+        assert_eq!(postings.advance(), 3u32);
+        assert_eq!(postings.doc(), 3u32);
+
+        assert_eq!(load(&mut postings), vec![30u32]);
+
+        assert_eq!(postings.seek(14u32), 15u32);
+        assert_eq!(postings.doc(), 15u32);
+        assert_eq!(postings.seek(300u32), 300u32);
+        assert_eq!(postings.doc(), 300u32);
+        assert_eq!(postings.seek(6000u32), TERMINATED);
+    }
+}
--- a/src/postings/mod.rs
+++ b/src/postings/mod.rs
@@ -8,6 +8,7 @@ mod block_segment_postings;
 pub(crate) mod compression;
 mod indexing_context;
 mod json_postings_writer;
+mod loaded_postings;
 mod per_field_postings_writer;
 mod postings;
 mod postings_writer;
@@ -17,6 +18,7 @@ mod serializer;
 mod skip;
 mod term_info;

+pub(crate) use loaded_postings::LoadedPostings;
 pub(crate) use stacker::compute_table_memory_size;

 pub use self::block_segment_postings::BlockSegmentPostings;
@@ -29,7 +31,7 @@ pub use self::serializer::{FieldSerializer, InvertedIndexSerializer};
 pub(crate) use self::skip::{BlockInfo, SkipReader};
 pub use self::term_info::TermInfo;

-#[allow(clippy::enum_variant_names)]
+#[expect(clippy::enum_variant_names)]
 #[derive(Debug, PartialEq, Clone, Copy, Eq)]
 pub(crate) enum FreqReadingOption {
    NoFreq,
@@ -38,7 +40,7 @@ pub(crate) enum FreqReadingOption {
 }

 #[cfg(test)]
-pub mod tests {
+pub(crate) mod tests {
    use std::mem;

    use super::{InvertedIndexSerializer, Postings};
--- a/src/postings/postings.rs
+++ b/src/postings/postings.rs
@@ -17,7 +17,14 @@ pub trait Postings: DocSet + 'static {
    /// Returns the positions offsetted with a given value.
    /// It is not necessary to clear the `output` before calling this method.
    /// The output vector will be resized to the `term_freq`.
-    fn positions_with_offset(&mut self, offset: u32, output: &mut Vec<u32>);
+    fn positions_with_offset(&mut self, offset: u32, output: &mut Vec<u32>) {
+        output.clear();
+        self.append_positions_with_offset(offset, output);
+    }
+
+    /// Returns the positions offsetted with a given value.
+    /// Data will be appended to the output.
+    fn append_positions_with_offset(&mut self, offset: u32, output: &mut Vec<u32>);

    /// Returns the positions of the term in the given document.
    /// The output vector will be resized to the `term_freq`.
@@ -25,3 +32,13 @@ pub trait Postings: DocSet + 'static {
        self.positions_with_offset(0u32, output);
    }
 }
+
+impl Postings for Box<dyn Postings> {
+    fn term_freq(&self) -> u32 {
+        (**self).term_freq()
+    }
+
+    fn append_positions_with_offset(&mut self, offset: u32, output: &mut Vec<u32>) {
+        (**self).append_positions_with_offset(offset, output);
+    }
+}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Pascal Seitz	c6e77d27c6	chore: Release	2025-04-09 16:58:45 +08:00
Pascal Seitz	db6587ed9b	chore: Release	2025-04-09 16:57:04 +08:00
Paul Masurel	3fa90e70e2	Merge pull request #2618 from quickwit-oss/release_tantivy fix tantivy-query-grammar version	2025-04-09 09:54:09 +02:00
Pascal Seitz	6ab4102253	fix tantivy-query-grammar version	2025-04-09 14:35:23 +08:00
PSeitz	11c6329ca5	temp unbump version (#2501 ) temp unbump to 0.22 for easier release with `cargo release`	2025-04-09 08:09:41 +02:00
PSeitz	ab8bb93928	update changelog (#2617 )	2025-04-09 03:31:30 +02:00
PSeitz	2b668bd2bf	readability improvement on executor (#2615 )	2025-04-08 18:28:49 +02:00
Paul Masurel	97a7137ef8	Merge pull request #2606 from katlim-br/add_serde_serialize Add serde json serialize to UserInputAst	2025-04-03 15:57:03 +02:00
Kat Lim Ruiz	ffa7cdf397	agreed with Remi, about the final json structure, having "type" tag and using "clauses" is more accurate	2025-04-03 08:35:16 -05:00
Kat Lim Ruiz	caf1275e60	Merge pull request #1 from quickwit-oss/tagged-user-input-ast Tag UserInputAst	2025-04-03 08:30:07 -05:00
Remi Dettai	fb12b7be28	Tag UserInputAst	2025-04-03 10:07:34 +02:00
Kat Lim Ruiz	6f77083493	create more complex unit test	2025-04-02 18:06:20 -05:00
Kat Lim Ruiz	cd7745da7a	set Leaf untagged, leave clause and boost the same (with own property)	2025-04-02 17:52:18 -05:00
Kat Lim Ruiz	eb8304dee9	remove untitled file	2025-04-02 08:47:58 -05:00
Kat Lim Ruiz	e5638112a9	all json should be snake_case	2025-04-02 08:45:33 -05:00
Kat Lim Ruiz	81110152fb	add unit test for unbounded	2025-04-01 18:08:04 -05:00
Kat Lim Ruiz	ae88a7ece5	add tag type and content value to UserInputBound	2025-04-01 18:06:40 -05:00
Kat Lim Ruiz	bdd5f80fd9	add clause unit test	2025-04-01 18:04:19 -05:00
Kat Lim Ruiz	3f62ef22e5	set tag=type only for Leaf	2025-04-01 17:52:36 -05:00
Kat Lim Ruiz	8102e19e48	set Error as serializable because is part of the possible outcomes (however, I think using this empty Error struct is not a good pattern)	2025-04-01 17:43:24 -05:00
Kat Lim Ruiz	175c853ea7	add serialization test for LenientError	2025-04-01 17:38:23 -05:00
Kat Lim Ruiz	c992cf3f37	Revert "set all enum to be snake_case when serializing" This reverts commit `83f6c2f265`.	2025-04-01 17:27:28 -05:00
Kat Lim Ruiz	83f6c2f265	set all enum to be snake_case when serializing	2025-04-01 17:13:04 -05:00
Kat Lim Ruiz	17bf8aa092	Merge branch 'quickwit-oss:main' into add_serde_serialize	2025-04-01 08:32:08 -05:00
trinity-1686a	6fc0e96ff8	Merge pull request #2610 from quickwit-oss/fix-compilation-stability Fix compilation stability	2025-04-01 10:45:58 +02:00
Remi Dettai	06d2dcf469	Further fix type inference tests	2025-04-01 09:52:22 +02:00
Remi Dettai	b681ec9335	Fix compilation stability	2025-04-01 09:33:33 +02:00
Kat Lim Ruiz	da2ff5712a	fix fmt nightly	2025-03-31 08:21:54 -05:00
Kat Lim Ruiz	18da402e27	cargo fmt	2025-03-30 22:10:38 -05:00
Kat Lim Ruiz	18ae3ffe94	uniformize root cargo.toml	2025-03-30 21:55:51 -05:00
Kat Lim Ruiz	0a37b7acaa	update to latest serde and serde_json (and follow the pattern to use patch versions)	2025-03-30 11:35:58 -05:00
Kat Lim Ruiz	1a9fd885dd	allow LenientError to be serializable too	2025-03-30 11:26:20 -05:00
Kat Lim Ruiz	3e660905a7	unit test parse_query_lenient	2025-03-30 11:22:22 -05:00
Kat Lim Ruiz	0c2b984cb4	add tests	2025-03-30 11:12:15 -05:00
Kat Lim Ruiz	a69b1c609c	add error to be debuggable	2025-03-30 11:12:12 -05:00
Kat Lim Ruiz	8d4a6fcaba	deserialize is not needed	2025-03-30 11:11:55 -05:00
Kat Lim Ruiz	feced4762f	update root cargo.toml	2025-03-30 11:01:22 -05:00
Kat Lim Ruiz	0149317c5a	set 0.23	2025-03-30 10:55:48 -05:00
Kat Lim Ruiz	3fcb6f9597	add unit tests	2025-03-30 10:41:43 -05:00
Kat Lim Ruiz	388fcd763b	add serde, and allow UserInputAst to be json serialized/deserialized	2025-03-30 10:36:43 -05:00
trinity-1686a	e488f9e6a2	Merge pull request #2598 from quickwit-oss/1686a/agg-key-eq fix invalid impl of Eq on Key	2025-03-14 15:24:31 +01:00
trinity Pointard	9426d5be7b	fix agg Key PartialEq impl	2025-03-14 14:57:45 +01:00
PSeitz	d5d2d41264	merge column: small refactors (#2579 ) * merge column: small refactors * make ord dependency more explicit * add columnar merge crashtest proptest * fix naming	2025-03-07 18:52:34 +08:00
Paul Masurel	80f5f1ecd4	Merge pull request #2586 from quickwit-oss/issue/2577-get_batch_multiply_overflow follow up on the fix of multiply with overflow	2025-03-05 11:17:12 +01:00
Paul Masurel	519e5d2ed1	clippy warnings	2025-03-05 11:15:06 +01:00
Paul Masurel	df2d52a84e	follow up on the fix of multiply with overflow	2025-03-05 11:15:05 +01:00
Paul Masurel	371dba9414	Merge pull request #2591 from quickwit-oss/cargo-fmt Cargo fmt	2025-03-05 11:08:06 +01:00
Paul Masurel	0afabad494	Cargo fmt	2025-03-05 11:07:46 +01:00
Remi Dettai	89b052cd42	Catch panics during merges (#2582 ) * Adding panic handler for the rayon merge thread pool * Return panic message in error --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-03-05 10:36:48 +01:00
SteveLauC	c48c649436	refactor: use std AtomicU64 and remove wrapper (#2585 )	2025-02-24 03:56:15 +01:00
Paul Masurel	58c0739953	Merge pull request #2581 from quickwit-oss/merge_dict_column_repro use usize in bitpacker	2025-02-21 10:53:07 +09:00
Pascal Seitz	e7daf69de9	use usize in bitpacker use usize in bitpacker to enable larger columns in the columnar store Godbolt comparison with u32 vs u64 for get access: https://godbolt.org/z/cjf7nenYP Add a mini-tool to inspect columnar files created by tantivy. (very basic functionality which can be extended later)	2025-02-20 15:39:10 +01:00
trinity-1686a	f060e86bc6	Merge pull request #2578 from quickwit-oss/1686a/buildable-histo-agg make DateHistogramAggregationReq buildable	2025-02-18 15:30:54 +01:00
trinity Pointard	0368162ef0	make DateHistogramAggregationReq buildable	2025-02-18 11:45:24 +01:00
trinity-1686a	e843c71015	Merge pull request #2568 from quickwit-oss/trinity/wildcard-query-parser allow term starting with wildcard in query parser	2025-02-12 16:47:25 +01:00
trinity Pointard	5cea16ef9f	improve handling of spcial char after exist query	2025-01-22 16:04:31 +01:00
dependabot[bot]	4aa8cd2470	Update downcast-rs requirement from 1.2.1 to 2.0.1 (#2566 ) Updates the requirements on [downcast-rs](https://github.com/marcianx/downcast-rs) to permit the latest version. - [Changelog](https://github.com/marcianx/downcast-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/marcianx/downcast-rs/compare/v1.2.1...v2.0.1) --- updated-dependencies: - dependency-name: downcast-rs dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-01-22 10:32:24 +01:00
trinity Pointard	4d4ee1b0ac	allow term starting with wildcard in query parser	2025-01-15 10:27:48 +01:00
dependabot[bot]	43c89b4360	Update itertools requirement from 0.13.0 to 0.14.0 (#2563 ) Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.14.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-01-08 17:11:46 +01:00
trinity-1686a	d281ca3e65	Merge pull request #2559 from quickwit-oss/trinity/sstable-partial-automaton allow warming partially an sstable for an automaton	2025-01-08 16:35:35 +01:00
trinity Pointard	be17daf658	split iterator	2025-01-08 16:24:34 +01:00
trinity Pointard	6ca84a61fa	make termdict always clone	2025-01-08 16:19:54 +01:00
trinity Pointard	037d12c9c9	fix deadlocking on automaton warmup	2025-01-06 11:58:58 +01:00
Remi Dettai	71cf19870b	Exist queries match subpath fields (#2558 ) * Exist queries match subpath fields * Make subpath check optional * Add async subpath listing	2025-01-06 10:17:39 +01:00
trinity Pointard	175a529c41	use executor for cpu-heavy sstable decompression for automaton	2025-01-03 19:14:07 +01:00
trinity Pointard	fe0c7c5408	change rangebound style	2025-01-02 11:56:05 +01:00
Harrison Burt	148594f0f9	Improve `IndexWriter` customisation via builder (#2562 ) * Improve `IndexWriter` customisation via builder * Remove change noise from PR * Correct documentation * Resolve comments and add test	2025-01-02 09:43:22 +01:00
dependabot[bot]	8edb439440	Update rustc-hash requirement from 1.1.0 to 2.1.0 (#2551 ) --- updated-dependencies: - dependency-name: rustc-hash dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-12-26 10:25:05 +01:00
trinity Pointard	dfff5f3bcb	rename merge_holes_under => merge_holes_under_bytes	2024-12-23 16:17:44 +01:00
trinity-1686a	ebf4d84553	add comment about cpu-intensive operation in async context	2024-12-20 12:23:49 +01:00
trinity-1686a	42efc7f7c8	clippy	2024-12-20 11:00:11 +01:00
trinity-1686a	192395c311	attempt at simplifying can_block_match_automaton	2024-12-20 10:25:38 +01:00
trinity-1686a	a1447cc9c2	remove breaking change in sstable public api	2024-12-19 17:30:05 +01:00
trinity-1686a	c39d91f827	Merge pull request #2547 from quickwit-oss/trinity/count-str add support for counting non integer in aggregation	2024-12-17 15:27:30 +01:00
trinity Pointard	32b6e9711b	add tests	2024-12-13 16:06:24 +01:00
trinity-1686a	24c5dc2398	allow warming up automaton	2024-12-10 13:32:12 +01:00
trinity-1686a	9e2ddec4b3	merge adjacent block when building delta for automaton	2024-12-10 13:32:12 +01:00
trinity-1686a	1f6a8e74bb	support iterating over partially loaded sstable	2024-12-10 13:32:12 +01:00
trinity-1686a	7e901f523b	get iter for blocks of sstable matching automaton	2024-12-10 13:32:12 +01:00
trinity-1686a	3c30a41c14	add helper to figure if block can match automaton	2024-12-10 13:32:12 +01:00
dependabot[bot]	0f99d4f420	Update measure_time requirement from 0.8.2 to 0.9.0 (#2557 ) --- updated-dependencies: - dependency-name: measure_time dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-12-09 21:39:01 +01:00
Pierre Barre	6e02c5cb25	Make `NUM_MERGE_THREADS` configurable (#2535 ) * Make `NUM_MERGE_THREADS` configurable * Remove unused import * Reword comment src/index/index.rs Co-authored-by: PSeitz <PSeitz@users.noreply.github.com> --------- Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>	2024-12-09 16:53:11 +08:00
PSeitz	876a579e5d	queryparser: add field respecification test (#2550 )	2024-12-02 14:17:12 +01:00
PSeitz	4c52499622	clippy (#2549 )	2024-11-29 16:08:21 +08:00
trinity-1686a	0bac391291	add support for counting non integer in aggregation	2024-11-28 19:52:47 +01:00
PSeitz	52d4e81e70	update CHANGELOG (#2546 )	2024-11-27 20:49:35 +08:00
dependabot[bot]	c71ea7b2ef	Update thiserror requirement from 1.0.30 to 2.0.1 (#2542 ) Updates the requirements on [thiserror](https://github.com/dtolnay/thiserror) to permit the latest version. - [Release notes](https://github.com/dtolnay/thiserror/releases) - [Commits](https://github.com/dtolnay/thiserror/compare/1.0.30...2.0.1) --- updated-dependencies: - dependency-name: thiserror dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-11-09 08:08:34 +08:00
Paul Masurel	c35a782747	Updating rustc-hash and clippy fixes (#2532 ) * Updating rustc-hash and clippy fixes * fix terms_aggregation_min_doc_count_special_case --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-11-01 13:46:26 +08:00
dependabot[bot]	c66af2c0a9	Update binggan requirement from 0.12.0 to 0.14.0 (#2530 ) * Update binggan requirement from 0.12.0 to 0.14.0 --- updated-dependencies: - dependency-name: binggan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * fix build --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-10-24 09:41:35 +08:00
Joan Antoni RE	f9ac055847	Fix some links in architecture docs (#2528 )	2024-10-23 21:06:54 +09:00
PSeitz	21d057059e	clippy (#2527 ) * clippy * clippy * clippy * clippy * convert allow to expect and remove unused * cargo fmt * cleanup * export sample * clippy	2024-10-22 09:26:54 +08:00
PSeitz	dca508b4ca	remove read_postings_no_deletes (#2526 ) closes #2525	2024-10-22 09:52:43 +09:00
PSeitz	aebae9965d	add RegexPhraseQuery (#2516 ) * add RegexPhraseQuery RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf" Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early. This PR comes with some new datastructures: SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general) LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption. BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one. Renamed Union to BufferedUnionScorer Added proptests to test different union types. * cleanup * use Box instead of Vec * use RefCell instead of term_freq(&mut) * remove wildcard mode * move RefCell to outer * clippy	2024-10-21 18:29:17 +08:00
Marvin	e7e3e3f44c	make casing in docs more consistent (#2524 ) * make casing in docs more consistent * more * lowercase tantivy	2024-10-21 17:59:41 +09:00
PSeitz	2f2db16ec1	store DateTime as nanoseconds in doc store (#2486 ) * store DateTime as nanoseconds in doc store The doc store DateTime was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. This is done by adding the trait `ConfigurableBinarySerializable`, which works like `BinarySerializable`, but with a config that allows de/serialize as different date time precision currently. bump version format to 7. add compat test to check the date time truncation. * remove configurable binary serialize, add enum for doc store version * test doc store version ord	2024-10-18 10:50:20 +08:00
Paul Masurel	d152e29687	Fixed citation (#2523 )	2024-10-17 10:19:50 +09:00
Paul Masurel	285bcc25c9	Added citation.cff (#2522 )	2024-10-17 09:43:35 +09:00
PSeitz	7b65ad922d	use binggan for stacker bench (#2492 ) * use binggan for stacker bench ``` alice (num terms: 174693) hashmap Memory: 1.3 MB Avg: 367.19 MiB/s (-1.34%) Median: 368.10 MiB/s (-1.34%) [378.75 MiB/s .. 352.81 MiB/s] hasmap with postings Memory: 2.4 MB Avg: 237.29 MiB/s (-2.19%) Median: 240.22 MiB/s (-1.61%) [248.26 MiB/s .. 210.66 MiB/s] fxhashmap ref postings Memory: 2.9 MB Avg: 171.94 MiB/s (-3.22%) Median: 174.13 MiB/s (-2.69%) [185.94 MiB/s .. 152.43 MiB/s] fxhasmap owned postings Memory: 3.5 MB Avg: 96.993 MiB/s (-4.20%) Median: 97.410 MiB/s (-4.48%) [102.78 MiB/s .. 82.745 MiB/s] numbers unique 100k hashmap Memory: 5.2 MB Avg: 334.17 MiB/s (-3.06%) Median: 352.61 MiB/s (+0.77%) [362.60 MiB/s .. 213.03 MiB/s] hasmap with postings Memory: 6.3 MB Avg: 316.96 MiB/s (-0.02%) Median: 325.16 MiB/s (-0.04%) [338.36 MiB/s .. 218.60 MiB/s] zipfs numbers 100k hashmap Memory: 1.3 MB Avg: 1.2342 GiB/s (+2.87%) Median: 1.2677 GiB/s (+4.66%) [1.3130 GiB/s .. 915.93 MiB/s] hasmap with postings Memory: 2.4 MB Avg: 485.16 MiB/s (+2.68%) Median: 494.70 MiB/s (+4.42%) [505.31 MiB/s .. 413.14 MiB/s] numbers unique 1mio hashmap Memory: 35.7 MB Avg: 169.68 MiB/s (-1.08%) Median: 166.80 MiB/s (-3.87%) [201.33 MiB/s .. 154.26 MiB/s] hasmap with postings Memory: 39.8 MB Avg: 149.49 MiB/s (-3.07%) Median: 150.85 MiB/s (-1.45%) [160.76 MiB/s .. 130.94 MiB/s] zipfs numbers 1mio hashmap Memory: 1.3 MB Avg: 1.2185 GiB/s (-2.33%) Median: 1.2291 GiB/s (-2.33%) [1.2905 GiB/s .. 1.0742 GiB/s] hasmap with postings Memory: 5.5 MB Avg: 358.43 MiB/s (-11.63%) Median: 356.95 MiB/s (-12.85%) [444.94 MiB/s .. 302.46 MiB/s] numbers unique 2mio hashmap Memory: 70.3 MB Avg: 163.65 MiB/s (+8.37%) Median: 162.83 MiB/s (+8.80%) [190.20 MiB/s .. 144.70 MiB/s] hasmap with postings Memory: 78.6 MB Avg: 148.00 MiB/s (+7.75%) Median: 151.53 MiB/s (+9.11%) [166.92 MiB/s .. 120.09 MiB/s] zipfs numbers 2mio hashmap Memory: 1.3 MB Avg: 1.2535 GiB/s (+2.59%) Median: 1.2654 GiB/s (+0.36%) [1.2938 GiB/s .. 1.0592 GiB/s] hasmap with postings Memory: 9.7 MB Avg: 377.96 MiB/s (-4.94%) Median: 381.82 MiB/s (-3.67%) [426.14 MiB/s .. 335.66 MiB/s] numbers unique 5mio hashmap Memory: 277.9 MB Avg: 121.30 MiB/s (+2.00%) Median: 121.99 MiB/s (+2.99%) [132.51 MiB/s .. 110.32 MiB/s] hasmap with postings Memory: 295.7 MB Avg: 114.23 MiB/s (+2.13%) Median: 115.26 MiB/s (+2.94%) [124.08 MiB/s .. 103.38 MiB/s] zipfs numbers 5mio hashmap Memory: 1.3 MB Avg: 1.2326 GiB/s (+0.63%) Median: 1.2400 GiB/s (+0.71%) [1.2755 GiB/s .. 1.0923 GiB/s] hasmap with postings Memory: 25.4 MB Avg: 360.49 MiB/s (+1.07%) Median: 363.44 MiB/s (+1.27%) [404.88 MiB/s .. 300.38 MiB/s] ``` * rename bench * update binggan * rename to HASHMAP_CAPACITY	2024-10-16 11:41:33 +08:00
dependabot[bot]	99be20cedd	Update binggan requirement from 0.10.0 to 0.12.0 (#2519 ) * Update binggan requirement from 0.10.0 to 0.12.0 --- updated-dependencies: - dependency-name: binggan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * fix build --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-10-16 11:36:04 +08:00
Bruce Mitchener	5f026901b8	Update MSRV to 1.75 (#2515 ) This is required by the `fs4` dependency. There are other things that need something later than 1.66. Both quickwit and the Python binding already require something newer.	2024-10-16 10:32:16 +08:00
baishen	6dfa2df06f	fix OwnedBytes debug panic (#2512 )	2024-10-16 10:31:40 +08:00