Make allocating field names avoidable for range and exists queries.

If the field names are statically known, `Cow::Borrowed(&'static str)` can handle them without allocations. The general case is still handled by `Cow::Owned(String)`.
Bumped census version
2026-02-23 16:20:36 +00:00 · 2024-01-26 17:31:44 +01:00 · 2024-01-26 19:32:02 +09:00 · 2024-01-23 16:27:34 +01:00 · 2024-01-19 17:46:48 +09:00 · 2024-01-18 05:58:24 +01:00
25 changed files with 314 additions and 336 deletions
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -38,7 +38,7 @@ crossbeam-channel = "0.5.4"
 rust-stemmers = "1.2.0"
 downcast-rs = "1.2.0"
 bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker4x"] }
-census = "0.4.0"
+census = "0.4.2"
 rustc-hash = "1.1.0"
 thiserror = "1.0.30"
 htmlescape = "0.3.1"
--- a/README.md
+++ b/README.md
@@ -5,19 +5,18 @@
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)

-![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
+<img src="https://tantivy-search.github.io/logo/tantivy-logo.png" alt="Tantivy, the fastest full-text search engine library written in Rust" height="250">

-**Tantivy** is a **full-text search engine library** written in Rust.
+## Fast full-text search engine library written in Rust

-It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
-an off-the-shelf search engine server, but rather a crate that can be used
-to build such a search engine.
+**If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our distributed search engine built on top of Tantivy.**
+
+Tantivy is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
+an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

 Tantivy is, in fact, strongly inspired by Lucene's design.

-If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our search engine built on top of Tantivy.
-
-# Benchmark
+## Benchmark

 The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns
 performance for different types of queries/collections.
@@ -28,7 +27,7 @@ Your mileage WILL vary depending on the nature of queries and their load.

 Details about the benchmark can be found at this [repository](https://github.com/quickwit-oss/search-benchmark-game).

-# Features
+## Features

 - Full-text search
 - Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy), [Vaporetto](https://crates.io/crates/vaporetto_tantivy), and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
@@ -54,11 +53,11 @@ Details about the benchmark can be found at this [repository](https://github.com
 - Searcher Warmer API
 - Cheesy logo with a horse

-## Non-features
+### Non-features

 Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out [Quickwit](https://github.com/quickwit-oss/quickwit/).

-# Getting started
+## Getting started

 Tantivy works on stable Rust and supports Linux, macOS, and Windows.

@@ -68,7 +67,7 @@ index documents, and search via the CLI or a small server with a REST API.
 It walks you through getting a Wikipedia search engine up and running in a few minutes.
 - [Reference doc for the last released version](https://docs.rs/tantivy/)

-# How can I support this project?
+## How can I support this project?

 There are many ways to support this project.

@@ -79,16 +78,16 @@ There are many ways to support this project.
 - Contribute code (you can join [our Discord server](https://discord.gg/MT27AG5EVE))
 - Talk about Tantivy around you

-# Contributing code
+## Contributing code

 We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
 Feel free to update CHANGELOG.md with your contribution.

-## Tokenizer
+### Tokenizer

 When implementing a tokenizer for tantivy depend on the `tantivy-tokenizer-api` crate.

-## Clone and build locally
+### Clone and build locally

 Tantivy compiles on stable Rust.
 To check out and run tests, you can simply run:
@@ -99,7 +98,7 @@ cd tantivy
 cargo test
 ```

-# Companies Using Tantivy
+## Companies Using Tantivy

 <p align="left">
 <img align="center" src="doc/assets/images/etsy.png" alt="Etsy" height="25" width="auto" />&nbsp;
@@ -111,7 +110,7 @@ cargo test
 <img align="center" src="doc/assets/images/element-dark-theme.png#gh-dark-mode-only" alt="Element.io" height="25" width="auto" />
 </p>

-# FAQ
+## FAQ

 ### Can I use Tantivy in other languages?

--- a/columnar/src/column_index/mod.rs
+++ b/columnar/src/column_index/mod.rs
@@ -126,18 +126,18 @@ impl ColumnIndex {
        }
    }

-    pub fn docid_range_to_rowids(&self, doc_id: Range<DocId>) -> Range<RowId> {
+    pub fn docid_range_to_rowids(&self, doc_id_range: Range<DocId>) -> Range<RowId> {
        match self {
            ColumnIndex::Empty { .. } => 0..0,
-            ColumnIndex::Full => doc_id,
+            ColumnIndex::Full => doc_id_range,
            ColumnIndex::Optional(optional_index) => {
-                let row_start = optional_index.rank(doc_id.start);
-                let row_end = optional_index.rank(doc_id.end);
+                let row_start = optional_index.rank(doc_id_range.start);
+                let row_end = optional_index.rank(doc_id_range.end);
                row_start..row_end
            }
            ColumnIndex::Multivalued(multivalued_index) => {
-                let end_docid = doc_id.end.min(multivalued_index.num_docs() - 1) + 1;
-                let start_docid = doc_id.start.min(end_docid);
+                let end_docid = doc_id_range.end.min(multivalued_index.num_docs() - 1) + 1;
+                let start_docid = doc_id_range.start.min(end_docid);

                let row_start = multivalued_index.start_index_column.get_val(start_docid);
                let row_end = multivalued_index.start_index_column.get_val(end_docid);
--- a/columnar/src/column_index/optional_index/mod.rs
+++ b/columnar/src/column_index/optional_index/mod.rs
@@ -21,8 +21,6 @@ const DENSE_BLOCK_THRESHOLD: u32 =

 const ELEMENTS_PER_BLOCK: u32 = u16::MAX as u32 + 1;

-const BLOCK_SIZE: RowId = 1 << 16;
-
 #[derive(Copy, Clone, Debug)]
 struct BlockMeta {
    non_null_rows_before_block: u32,
@@ -109,8 +107,8 @@ struct RowAddr {
 #[inline(always)]
 fn row_addr_from_row_id(row_id: RowId) -> RowAddr {
    RowAddr {
-        block_id: (row_id / BLOCK_SIZE) as u16,
-        in_block_row_id: (row_id % BLOCK_SIZE) as u16,
+        block_id: (row_id / ELEMENTS_PER_BLOCK) as u16,
+        in_block_row_id: (row_id % ELEMENTS_PER_BLOCK) as u16,
    }
 }

@@ -185,8 +183,13 @@ impl Set<RowId> for OptionalIndex {
        }
    }

+    /// Any value doc_id is allowed.
+    /// In particular, doc_id = num_rows.
    #[inline]
    fn rank(&self, doc_id: DocId) -> RowId {
+        if doc_id >= self.num_docs() {
+            return self.num_non_nulls();
+        }
        let RowAddr {
            block_id,
            in_block_row_id,
@@ -200,13 +203,15 @@ impl Set<RowId> for OptionalIndex {
        block_meta.non_null_rows_before_block + block_offset_row_id
    }

+    /// Any value doc_id is allowed.
+    /// In particular, doc_id = num_rows.
    #[inline]
    fn rank_if_exists(&self, doc_id: DocId) -> Option<RowId> {
        let RowAddr {
            block_id,
            in_block_row_id,
        } = row_addr_from_row_id(doc_id);
-        let block_meta = self.block_metas[block_id as usize];
+        let block_meta = *self.block_metas.get(block_id as usize)?;
        let block = self.block(block_meta);
        let block_offset_row_id = match block {
            Block::Dense(dense_block) => dense_block.rank_if_exists(in_block_row_id),
@@ -491,7 +496,7 @@ fn deserialize_optional_index_block_metadatas(
        non_null_rows_before_block += num_non_null_rows;
    }
    block_metas.resize(
-        ((num_rows + BLOCK_SIZE - 1) / BLOCK_SIZE) as usize,
+        ((num_rows + ELEMENTS_PER_BLOCK - 1) / ELEMENTS_PER_BLOCK) as usize,
        BlockMeta {
            non_null_rows_before_block,
            start_byte_offset,
--- a/columnar/src/column_index/optional_index/set.rs
+++ b/columnar/src/column_index/optional_index/set.rs
@@ -39,7 +39,8 @@ pub trait Set<T> {
    ///
    /// # Panics
    ///
-    /// May panic if rank is greater than the number of elements in the Set.
+    /// May panic if rank is greater or equal to the number of
+    /// elements in the Set.
    fn select(&self, rank: T) -> T;

    /// Creates a brand new select cursor.
--- a/columnar/src/column_index/optional_index/tests.rs
+++ b/columnar/src/column_index/optional_index/tests.rs
@@ -3,6 +3,30 @@ use proptest::strategy::Strategy;
 use proptest::{prop_oneof, proptest};

 use super::*;
+use crate::{ColumnarReader, ColumnarWriter, DynamicColumnHandle};
+
+#[test]
+fn test_optional_index_bug_2293() {
+    // tests for panic in docid_range_to_rowids for docid == num_docs
+    test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK - 1);
+    test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK);
+    test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK + 1);
+}
+fn test_optional_index_with_num_docs(num_docs: u32) {
+    let mut dataframe_writer = ColumnarWriter::default();
+    dataframe_writer.record_numerical(100, "score", 80i64);
+    let mut buffer: Vec<u8> = Vec::new();
+    dataframe_writer
+        .serialize(num_docs, None, &mut buffer)
+        .unwrap();
+    let columnar = ColumnarReader::open(buffer).unwrap();
+    assert_eq!(columnar.num_columns(), 1);
+    let cols: Vec<DynamicColumnHandle> = columnar.read_columns("score").unwrap();
+    assert_eq!(cols.len(), 1);
+
+    let col = cols[0].open().unwrap();
+    col.column_index().docid_range_to_rowids(0..num_docs);
+}

 #[test]
 fn test_dense_block_threshold() {
@@ -35,7 +59,7 @@ proptest! {

 #[test]
 fn test_with_random_sets_simple() {
-    let vals = 10..BLOCK_SIZE * 2;
+    let vals = 10..ELEMENTS_PER_BLOCK * 2;
    let mut out: Vec<u8> = Vec::new();
    serialize_optional_index(&vals, 100, &mut out).unwrap();
    let null_index = open_optional_index(OwnedBytes::new(out)).unwrap();
@@ -171,7 +195,7 @@ fn test_optional_index_rank() {
    test_optional_index_rank_aux(&[0u32, 1u32]);
    let mut block = Vec::new();
    block.push(3u32);
-    block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1));
+    block.extend((0..ELEMENTS_PER_BLOCK).map(|i| i + ELEMENTS_PER_BLOCK + 1));
    test_optional_index_rank_aux(&block);
 }

@@ -185,8 +209,8 @@ fn test_optional_index_iter_empty_one() {
 fn test_optional_index_iter_dense_block() {
    let mut block = Vec::new();
    block.push(3u32);
-    block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1));
-    test_optional_index_iter_aux(&block, 3 * BLOCK_SIZE);
+    block.extend((0..ELEMENTS_PER_BLOCK).map(|i| i + ELEMENTS_PER_BLOCK + 1));
+    test_optional_index_iter_aux(&block, 3 * ELEMENTS_PER_BLOCK);
 }

 #[test]
--- a/columnar/src/column_values/mod.rs
+++ b/columnar/src/column_values/mod.rs
@@ -101,7 +101,7 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync {
        row_id_hits: &mut Vec<RowId>,
    ) {
        let row_id_range = row_id_range.start..row_id_range.end.min(self.num_vals());
-        for idx in row_id_range.start..row_id_range.end {
+        for idx in row_id_range {
            let val = self.get_val(idx);
            if value_range.contains(&val) {
                row_id_hits.push(idx);
--- a/query-grammar/src/infallible.rs
+++ b/query-grammar/src/infallible.rs
@@ -81,8 +81,8 @@ where
    T: InputTakeAtPosition + Clone,
    <T as InputTakeAtPosition>::Item: AsChar + Clone,
 {
-    opt_i(nom::character::complete::space0)(input)
-        .map(|(left, (spaces, errors))| (left, (spaces.expect("space0 can't fail"), errors)))
+    opt_i(nom::character::complete::multispace0)(input)
+        .map(|(left, (spaces, errors))| (left, (spaces.expect("multispace0 can't fail"), errors)))
 }

 pub(crate) fn space1_infallible<T>(input: T) -> JResult<T, Option<T>>
@@ -90,7 +90,7 @@ where
    T: InputTakeAtPosition + Clone + InputLength,
    <T as InputTakeAtPosition>::Item: AsChar + Clone,
 {
-    opt_i(nom::character::complete::space1)(input).map(|(left, (spaces, mut errors))| {
+    opt_i(nom::character::complete::multispace1)(input).map(|(left, (spaces, mut errors))| {
        if spaces.is_none() {
            errors.push(LenientErrorInternal {
                pos: left.input_len(),
--- a/query-grammar/src/query_grammar.rs
+++ b/query-grammar/src/query_grammar.rs
@@ -3,7 +3,7 @@ use std::iter::once;
 use nom::branch::alt;
 use nom::bytes::complete::tag;
 use nom::character::complete::{
-    anychar, char, digit1, none_of, one_of, satisfy, space0, space1, u32,
+    anychar, char, digit1, multispace0, multispace1, none_of, one_of, satisfy, u32,
 };
 use nom::combinator::{eof, map, map_res, opt, peek, recognize, value, verify};
 use nom::error::{Error, ErrorKind};
@@ -65,7 +65,7 @@ fn word_infallible(delimiter: &str) -> impl Fn(&str) -> JResult<&str, Option<&st
    |inp| {
        opt_i_err(
            preceded(
-                space0,
+                multispace0,
                recognize(many1(satisfy(|c| {
                    !c.is_whitespace() && !delimiter.contains(c)
                }))),
@@ -225,10 +225,10 @@ fn term_group(inp: &str) -> IResult<&str, UserInputAst> {

    map(
        tuple((
-            terminated(field_name, space0),
+            terminated(field_name, multispace0),
            delimited(
-                tuple((char('('), space0)),
-                separated_list0(space1, tuple((opt(occur_symbol), term_or_phrase))),
+                tuple((char('('), multispace0)),
+                separated_list0(multispace1, tuple((opt(occur_symbol), term_or_phrase))),
                char(')'),
            ),
        )),
@@ -250,7 +250,7 @@ fn term_group_precond(inp: &str) -> IResult<&str, (), ()> {
        (),
        peek(tuple((
            field_name,
-            space0,
+            multispace0,
            char('('), // when we are here, we know it can't be anything but a term group
        ))),
    )(inp)
@@ -259,7 +259,7 @@ fn term_group_precond(inp: &str) -> IResult<&str, (), ()> {

 fn term_group_infallible(inp: &str) -> JResult<&str, UserInputAst> {
    let (mut inp, (field_name, _, _, _)) =
-        tuple((field_name, space0, char('('), space0))(inp).expect("precondition failed");
+        tuple((field_name, multispace0, char('('), multispace0))(inp).expect("precondition failed");

    let mut terms = Vec::new();
    let mut errs = Vec::new();
@@ -305,7 +305,7 @@ fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
        UserInputLeaf::Exists {
            field: String::new(),
        },
-        tuple((space0, char('*'))),
+        tuple((multispace0, char('*'))),
    )(inp)
 }

@@ -314,7 +314,7 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
        (),
        peek(tuple((
            field_name,
-            space0,
+            multispace0,
            char('*'), // when we are here, we know it can't be anything but a exists
        ))),
    )(inp)
@@ -323,7 +323,7 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {

 fn exists_infallible(inp: &str) -> JResult<&str, UserInputAst> {
    let (inp, (field_name, _, _)) =
-        tuple((field_name, space0, char('*')))(inp).expect("precondition failed");
+        tuple((field_name, multispace0, char('*')))(inp).expect("precondition failed");

    let exists = UserInputLeaf::Exists { field: field_name }.into();
    Ok((inp, (exists, Vec::new())))
@@ -349,7 +349,7 @@ fn literal_no_group_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>>
            alt_infallible(
                (
                    (
-                        value((), tuple((tag("IN"), space0, char('[')))),
+                        value((), tuple((tag("IN"), multispace0, char('[')))),
                        map(set_infallible, |(set, errs)| (Some(set), errs)),
                    ),
                    (
@@ -430,8 +430,8 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
    // check for unbounded range in the form of <5, <=10, >5, >=5
    let elastic_unbounded_range = map(
        tuple((
-            preceded(space0, alt((tag(">="), tag("<="), tag("<"), tag(">")))),
-            preceded(space0, range_term_val()),
+            preceded(multispace0, alt((tag(">="), tag("<="), tag("<"), tag(">")))),
+            preceded(multispace0, range_term_val()),
        )),
        |(comparison_sign, bound)| match comparison_sign {
            ">=" => (UserInputBound::Inclusive(bound), UserInputBound::Unbounded),
@@ -444,7 +444,7 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
    );

    let lower_bound = map(
-        separated_pair(one_of("{["), space0, range_term_val()),
+        separated_pair(one_of("{["), multispace0, range_term_val()),
        |(boundary_char, lower_bound)| {
            if lower_bound == "*" {
                UserInputBound::Unbounded
@@ -457,7 +457,7 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
    );

    let upper_bound = map(
-        separated_pair(range_term_val(), space0, one_of("}]")),
+        separated_pair(range_term_val(), multispace0, one_of("}]")),
        |(upper_bound, boundary_char)| {
            if upper_bound == "*" {
                UserInputBound::Unbounded
@@ -469,8 +469,11 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
        },
    );

-    let lower_to_upper =
-        separated_pair(lower_bound, tuple((space1, tag("TO"), space1)), upper_bound);
+    let lower_to_upper = separated_pair(
+        lower_bound,
+        tuple((multispace1, tag("TO"), multispace1)),
+        upper_bound,
+    );

    map(
        alt((elastic_unbounded_range, lower_to_upper)),
@@ -490,13 +493,16 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
            word_infallible("]}"),
            space1_infallible,
            opt_i_err(
-                terminated(tag("TO"), alt((value((), space1), value((), eof)))),
+                terminated(tag("TO"), alt((value((), multispace1), value((), eof)))),
                "missing keyword TO",
            ),
            word_infallible("]}"),
            opt_i_err(one_of("]}"), "missing range delimiter"),
        )),
-        |((lower_bound_kind, _space0, lower, _space1, to, upper, upper_bound_kind), errs)| {
+        |(
+            (lower_bound_kind, _multispace0, lower, _multispace1, to, upper, upper_bound_kind),
+            errs,
+        )| {
            let lower_bound = match (lower_bound_kind, lower) {
                (_, Some("*")) => UserInputBound::Unbounded,
                (_, None) => UserInputBound::Unbounded,
@@ -596,10 +602,10 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
 fn set(inp: &str) -> IResult<&str, UserInputLeaf> {
    map(
        preceded(
-            tuple((space0, tag("IN"), space1)),
+            tuple((multispace0, tag("IN"), multispace1)),
            delimited(
-                tuple((char('['), space0)),
-                separated_list0(space1, map(simple_term, |(_, term)| term)),
+                tuple((char('['), multispace0)),
+                separated_list0(multispace1, map(simple_term, |(_, term)| term)),
                char(']'),
            ),
        ),
@@ -667,7 +673,7 @@ fn leaf(inp: &str) -> IResult<&str, UserInputAst> {
    alt((
        delimited(char('('), ast, char(')')),
        map(char('*'), |_| UserInputAst::from(UserInputLeaf::All)),
-        map(preceded(tuple((tag("NOT"), space1)), leaf), negate),
+        map(preceded(tuple((tag("NOT"), multispace1)), leaf), negate),
        literal,
    ))(inp)
 }
@@ -919,17 +925,17 @@ fn aggregate_infallible_expressions(

 fn operand_leaf(inp: &str) -> IResult<&str, (BinaryOperand, UserInputAst)> {
    tuple((
-        terminated(binary_operand, space0),
-        terminated(boosted_leaf, space0),
+        terminated(binary_operand, multispace0),
+        terminated(boosted_leaf, multispace0),
    ))(inp)
 }

 fn ast(inp: &str) -> IResult<&str, UserInputAst> {
    let boolean_expr = map(
-        separated_pair(boosted_leaf, space1, many1(operand_leaf)),
+        separated_pair(boosted_leaf, multispace1, many1(operand_leaf)),
        |(left, right)| aggregate_binary_expressions(left, right),
    );
-    let whitespace_separated_leaves = map(separated_list1(space1, occur_leaf), |subqueries| {
+    let whitespace_separated_leaves = map(separated_list1(multispace1, occur_leaf), |subqueries| {
        if subqueries.len() == 1 {
            let (occur_opt, ast) = subqueries.into_iter().next().unwrap();
            match occur_opt.unwrap_or(Occur::Should) {
@@ -942,9 +948,9 @@ fn ast(inp: &str) -> IResult<&str, UserInputAst> {
    });

    delimited(
-        space0,
+        multispace0,
        alt((boolean_expr, whitespace_separated_leaves)),
-        space0,
+        multispace0,
    )(inp)
 }

@@ -969,7 +975,7 @@ fn ast_infallible(inp: &str) -> JResult<&str, UserInputAst> {
 }

 pub fn parse_to_ast(inp: &str) -> IResult<&str, UserInputAst> {
-    map(delimited(space0, opt(ast), eof), |opt_ast| {
+    map(delimited(multispace0, opt(ast), eof), |opt_ast| {
        rewrite_ast(opt_ast.unwrap_or_else(UserInputAst::empty_query))
    })(inp)
 }
@@ -1145,6 +1151,7 @@ mod test {
    #[test]
    fn test_parse_query_to_ast_binary_op() {
        test_parse_query_to_ast_helper("a AND b", "(+a +b)");
+        test_parse_query_to_ast_helper("a\nAND b", "(+a +b)");
        test_parse_query_to_ast_helper("a OR b", "(?a ?b)");
        test_parse_query_to_ast_helper("a OR b AND c", "(?a ?(+b +c))");
        test_parse_query_to_ast_helper("a AND b         AND c", "(+a +b +c)");
--- a/src/aggregation/bucket/histogram/histogram.rs
+++ b/src/aggregation/bucket/histogram/histogram.rs
@@ -596,10 +596,13 @@ mod tests {

    use super::*;
    use crate::aggregation::agg_req::Aggregations;
+    use crate::aggregation::agg_result::AggregationResults;
    use crate::aggregation::tests::{
        exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit,
        get_test_index_2_segments, get_test_index_from_values, get_test_index_with_num_docs,
    };
+    use crate::aggregation::AggregationCollector;
+    use crate::query::AllQuery;

    #[test]
    fn histogram_test_crooked_values() -> crate::Result<()> {
@@ -1351,6 +1354,35 @@ mod tests {
            })
        );

+        Ok(())
+    }
+    #[test]
+    fn test_aggregation_histogram_empty_index() -> crate::Result<()> {
+        // test index without segments
+        let values = vec![];
+
+        let index = get_test_index_from_values(false, &values)?;
+
+        let agg_req_1: Aggregations = serde_json::from_value(json!({
+            "myhisto": {
+                "histogram": {
+                    "field": "score",
+                    "interval": 10.0
+                },
+            }
+        }))
+        .unwrap();
+
+        let collector = AggregationCollector::from_aggs(agg_req_1, Default::default());
+
+        let reader = index.reader()?;
+        let searcher = reader.searcher();
+        let agg_res: AggregationResults = searcher.search(&AllQuery, &collector).unwrap();
+
+        let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
+        // Make sure the result structure is correct
+        assert_eq!(res["myhisto"]["buckets"].as_array().unwrap().len(), 0);
+
        Ok(())
    }
 }
--- a/src/collector/top_score_collector.rs
+++ b/src/collector/top_score_collector.rs
@@ -309,7 +309,7 @@ impl TopDocs {
    ///
    /// To comfortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to
    /// the [.order_by_fast_field(...)](TopDocs::order_by_fast_field) method.
-    fn order_by_u64_field(
+    pub fn order_by_u64_field(
        self,
        field: impl ToString,
        order: Order,
--- a/src/core/searcher.rs
+++ b/src/core/searcher.rs
@@ -1,6 +1,4 @@
 use std::collections::BTreeMap;
-#[cfg(feature = "quickwit")]
-use std::future::Future;
 use std::sync::Arc;
 use std::{fmt, io};

@@ -114,108 +112,6 @@ impl Searcher {
        store_reader.get_async(doc_address.doc_id).await
    }

-    /// Fetches multiple documents in an asynchronous manner.
-    ///
-    /// This method is more efficient than calling [`doc_async`](Self::doc_async) multiple times,
-    /// as it groups overlapping requests to segments and blocks and avoids concurrent requests
-    /// trashing the caches of each other. However, it does so using intermediate data structures
-    /// and independent block caches so it will be slower if documents from very few blocks are
-    /// fetched which would have fit into the global block cache.
-    ///
-    /// The caller is expected to poll these futures concurrently (e.g. using `FuturesUnordered`)
-    /// or in parallel (e.g. using `JoinSet`) as fits best with the given use case, i.e. whether
-    /// it is predominately I/O-bound or rather CPU-bound.
-    ///
-    /// Note that any blocks brought into any of the per-segment-and-block groups will not be pulled
-    /// into the global block cache and hence not be available for subsequent calls.
-    ///
-    /// Note that there is no synchronous variant of this method as the same degree of efficiency
-    /// can be had by accessing documents in address order.
-    ///
-    /// # Example
-    ///
-    /// ```rust,no_run
-    /// # use futures::executor::block_on;
-    /// # use futures::stream::{FuturesUnordered, StreamExt};
-    /// #
-    /// # use tantivy::schema::Schema;
-    /// # use tantivy::{DocAddress, Index, TantivyDocument, TantivyError};
-    /// #
-    /// # let index = Index::create_in_ram(Schema::builder().build());
-    /// # let searcher = index.reader()?.searcher();
-    /// #
-    /// # let doc_addresses = (0..10).map(|_| DocAddress::new(0, 0));
-    /// #
-    /// let mut groups: FuturesUnordered<_> = searcher
-    ///     .docs_async::<TantivyDocument>(doc_addresses)?
-    ///     .collect();
-    ///
-    /// let mut docs = Vec::new();
-    ///
-    /// block_on(async {
-    ///     while let Some(group) = groups.next().await {
-    ///         docs.extend(group?);
-    ///     }
-    ///
-    ///     Ok::<_, TantivyError>(())
-    /// })?;
-    /// #
-    /// # Ok::<_, TantivyError>(())
-    /// ```
-    #[cfg(feature = "quickwit")]
-    pub fn docs_async<D: DocumentDeserialize>(
-        &self,
-        doc_addresses: impl IntoIterator<Item = DocAddress>,
-    ) -> crate::Result<
-        impl Iterator<Item = impl Future<Output = crate::Result<Vec<(DocAddress, D)>>>> + '_,
-    > {
-        use rustc_hash::FxHashMap;
-
-        use crate::store::CacheKey;
-        use crate::{DocId, SegmentOrdinal};
-
-        let mut groups: FxHashMap<(SegmentOrdinal, CacheKey), Vec<DocId>> = Default::default();
-
-        for doc_address in doc_addresses {
-            let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
-            let cache_key = store_reader.cache_key(doc_address.doc_id)?;
-
-            groups
-                .entry((doc_address.segment_ord, cache_key))
-                .or_default()
-                .push(doc_address.doc_id);
-        }
-
-        let futures = groups
-            .into_iter()
-            .map(|((segment_ord, cache_key), doc_ids)| {
-                // Each group fetches documents from exactly one block and
-                // therefore gets an independent block cache of size one.
-                let store_reader =
-                    self.inner.store_readers[segment_ord as usize].fork_cache(1, &[cache_key]);
-
-                async move {
-                    let mut docs = Vec::new();
-
-                    for doc_id in doc_ids {
-                        let doc = store_reader.get_async(doc_id).await?;
-
-                        docs.push((
-                            DocAddress {
-                                segment_ord,
-                                doc_id,
-                            },
-                            doc,
-                        ));
-                    }
-
-                    Ok(docs)
-                }
-            });
-
-        Ok(futures)
-    }
-
    /// Access the schema associated with the index of this searcher.
    pub fn schema(&self) -> &Schema {
        &self.inner.schema
--- a/src/core/tests.rs
+++ b/src/core/tests.rs
@@ -424,7 +424,7 @@ fn test_non_text_json_term_freq() {
    json_term_writer.set_fast_value(75u64);
    let postings = inv_idx
        .read_postings(
-            json_term_writer.term(),
+            &json_term_writer.term(),
            IndexRecordOption::WithFreqsAndPositions,
        )
        .unwrap()
@@ -462,7 +462,7 @@ fn test_non_text_json_term_freq_bitpacked() {
    json_term_writer.set_fast_value(75u64);
    let mut postings = inv_idx
        .read_postings(
-            json_term_writer.term(),
+            &json_term_writer.term(),
            IndexRecordOption::WithFreqsAndPositions,
        )
        .unwrap()
@@ -474,60 +474,3 @@ fn test_non_text_json_term_freq_bitpacked() {
        assert_eq!(postings.term_freq(), 1u32);
    }
 }
-
-#[cfg(feature = "quickwit")]
-#[test]
-fn test_get_many_docs() -> crate::Result<()> {
-    use futures::executor::block_on;
-    use futures::stream::{FuturesUnordered, StreamExt};
-
-    use crate::schema::{OwnedValue, STORED};
-    use crate::{DocAddress, TantivyError};
-
-    let mut schema_builder = Schema::builder();
-    let num_field = schema_builder.add_u64_field("num", STORED);
-    let schema = schema_builder.build();
-    let index = Index::create_in_ram(schema);
-    let mut index_writer: IndexWriter = index.writer_for_tests()?;
-    index_writer.set_merge_policy(Box::new(NoMergePolicy));
-    for i in 0..10u64 {
-        let doc = doc!(num_field=>i);
-        index_writer.add_document(doc)?;
-    }
-
-    index_writer.commit()?;
-    let segment_ids = index.searchable_segment_ids()?;
-    index_writer.merge(&segment_ids).wait().unwrap();
-
-    let searcher = index.reader()?.searcher();
-    assert_eq!(searcher.num_docs(), 10);
-
-    let doc_addresses = (0..10).map(|i| DocAddress::new(0, i));
-
-    let mut groups: FuturesUnordered<_> = searcher
-        .docs_async::<TantivyDocument>(doc_addresses)?
-        .collect();
-
-    let mut doc_nums = Vec::new();
-
-    block_on(async {
-        while let Some(group) = groups.next().await {
-            for (_doc_address, doc) in group? {
-                let num_value = doc.get_first(num_field).unwrap();
-
-                if let OwnedValue::U64(num) = num_value {
-                    doc_nums.push(*num);
-                } else {
-                    panic!("Expected u64 value");
-                }
-            }
-        }
-
-        Ok::<_, TantivyError>(())
-    })?;
-
-    doc_nums.sort();
-    assert_eq!(doc_nums, (0..10).collect::<Vec<u64>>());
-
-    Ok(())
-}
--- a/src/indexer/index_writer.rs
+++ b/src/indexer/index_writer.rs
@@ -1651,6 +1651,7 @@ mod tests {
        force_end_merge: bool,
    ) -> crate::Result<Index> {
        let mut schema_builder = schema::Schema::builder();
+        let json_field = schema_builder.add_json_field("json", FAST | TEXT | STORED);
        let ip_field = schema_builder.add_ip_addr_field("ip", FAST | INDEXED | STORED);
        let ips_field = schema_builder
            .add_ip_addr_field("ips", IpAddrOptions::default().set_fast().set_indexed());
@@ -1729,7 +1730,9 @@ mod tests {
                            id_field=>id,
                        ))?;
                    } else {
+                        let json = json!({"date1": format!("2022-{id}-01T00:00:01Z"), "date2": format!("{id}-05-01T00:00:01Z"), "id": id, "ip": ip.to_string()});
                        index_writer.add_document(doc!(id_field=>id,
+                                json_field=>json,
                                bytes_field => id.to_le_bytes().as_slice(),
                                id_opt_field => id,
                                ip_field => ip,
--- a/src/indexer/merger.rs
+++ b/src/indexer/merger.rs
@@ -605,6 +605,10 @@ impl IndexMerger {
                            segment_postings.positions(&mut positions_buffer);
                            segment_postings.term_freq()
                        } else {
+                            // The positions_buffer may contain positions from the previous term
+                            // Existence of positions depend on the value type in JSON fields.
+                            // https://github.com/quickwit-oss/tantivy/issues/2283
+                            positions_buffer.clear();
                            0u32
                        };

--- a/src/indexer/segment_writer.rs
+++ b/src/indexer/segment_writer.rs
@@ -879,6 +879,31 @@ mod tests {
        assert_eq!(searcher.search(&phrase_query, &Count).unwrap(), 0);
    }

+    #[test]
+    fn test_json_term_with_numeric_merge_panic_regression_bug_2283() {
+        // https://github.com/quickwit-oss/tantivy/issues/2283
+        let mut schema_builder = Schema::builder();
+        let json = schema_builder.add_json_field("json", TEXT);
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut writer = index.writer_for_tests().unwrap();
+        let doc = json!({"field": "a"});
+        writer.add_document(doc!(json=>doc)).unwrap();
+        writer.commit().unwrap();
+        let doc = json!({"field": "a", "id": 1});
+        writer.add_document(doc!(json=>doc.clone())).unwrap();
+        writer.commit().unwrap();
+
+        // Force Merge
+        writer.wait_merging_threads().unwrap();
+        let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
+        let segment_ids = index
+            .searchable_segment_ids()
+            .expect("Searchable segments failed.");
+        index_writer.merge(&segment_ids).wait().unwrap();
+        assert!(index_writer.wait_merging_threads().is_ok());
+    }
+
    #[test]
    fn test_bug_regression_1629_position_when_array_with_a_field_value_that_does_not_contain_any_token(
    ) {
--- a/src/query/exist_query.rs
+++ b/src/query/exist_query.rs
@@ -1,4 +1,4 @@
-use core::fmt::Debug;
+use std::borrow::Cow;

 use columnar::{ColumnIndex, DynamicColumn};

@@ -14,7 +14,7 @@ use crate::{DocId, Score, TantivyError};
 /// All of the matched documents get the score 1.0.
 #[derive(Clone, Debug)]
 pub struct ExistsQuery {
-    field_name: String,
+    field: Cow<'static, str>,
 }

 impl ExistsQuery {
@@ -23,40 +23,42 @@ impl ExistsQuery {
    /// This query matches all documents with at least one non-null value in the specified field.
    /// This constructor never fails, but executing the search with this query will return an
    /// error if the specified field doesn't exists or is not a fast field.
-    pub fn new_exists_query(field: String) -> ExistsQuery {
-        ExistsQuery { field_name: field }
+    pub fn new_exists_query<F: Into<Cow<'static, str>>>(field: F) -> ExistsQuery {
+        ExistsQuery {
+            field: field.into(),
+        }
    }
 }

 impl Query for ExistsQuery {
    fn weight(&self, enable_scoring: EnableScoring) -> crate::Result<Box<dyn Weight>> {
        let schema = enable_scoring.schema();
-        let Some((field, _path)) = schema.find_field(&self.field_name) else {
-            return Err(TantivyError::FieldNotFound(self.field_name.clone()));
+        let Some((field, _path)) = schema.find_field(&self.field) else {
+            return Err(TantivyError::FieldNotFound(self.field.to_string()));
        };
        let field_type = schema.get_field_entry(field).field_type();
        if !field_type.is_fast() {
            return Err(TantivyError::SchemaError(format!(
                "Field {} is not a fast field.",
-                self.field_name
+                self.field
            )));
        }
        Ok(Box::new(ExistsWeight {
-            field_name: self.field_name.clone(),
+            field: self.field.clone(),
        }))
    }
 }

 /// Weight associated with the `ExistsQuery` query.
 pub struct ExistsWeight {
-    field_name: String,
+    field: Cow<'static, str>,
 }

 impl Weight for ExistsWeight {
    fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
        let fast_field_reader = reader.fast_fields();
        let dynamic_columns: crate::Result<Vec<DynamicColumn>> = fast_field_reader
-            .dynamic_column_handles(&self.field_name)?
+            .dynamic_column_handles(&self.field)?
            .into_iter()
            .map(|handle| handle.open().map_err(|io_error| io_error.into()))
            .collect();
--- a/src/query/range_query/range_query.rs
+++ b/src/query/range_query/range_query.rs
@@ -1,3 +1,4 @@
+use std::borrow::Cow;
 use std::io;
 use std::net::Ipv6Addr;
 use std::ops::{Bound, Range};
@@ -68,7 +69,7 @@ use crate::{DateTime, DocId, Score};
 /// ```
 #[derive(Clone, Debug)]
 pub struct RangeQuery {
-    field: String,
+    field: Cow<'static, str>,
    value_type: Type,
    lower_bound: Bound<Vec<u8>>,
    upper_bound: Bound<Vec<u8>>,
@@ -80,15 +81,15 @@ impl RangeQuery {
    ///
    /// If the value type is not correct, something may go terribly wrong when
    /// the `Weight` object is created.
-    pub fn new_term_bounds(
-        field: String,
+    pub fn new_term_bounds<F: Into<Cow<'static, str>>>(
+        field: F,
        value_type: Type,
        lower_bound: &Bound<Term>,
        upper_bound: &Bound<Term>,
    ) -> RangeQuery {
        let verify_and_unwrap_term = |val: &Term| val.serialized_value_bytes().to_owned();
        RangeQuery {
-            field,
+            field: field.into(),
            value_type,
            lower_bound: map_bound(lower_bound, verify_and_unwrap_term),
            upper_bound: map_bound(upper_bound, verify_and_unwrap_term),
@@ -100,7 +101,7 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `i64`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_i64(field: String, range: Range<i64>) -> RangeQuery {
+    pub fn new_i64<F: Into<Cow<'static, str>>>(field: F, range: Range<i64>) -> RangeQuery {
        RangeQuery::new_i64_bounds(
            field,
            Bound::Included(range.start),
@@ -115,8 +116,8 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `i64`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_i64_bounds(
-        field: String,
+    pub fn new_i64_bounds<F: Into<Cow<'static, str>>>(
+        field: F,
        lower_bound: Bound<i64>,
        upper_bound: Bound<i64>,
    ) -> RangeQuery {
@@ -126,7 +127,7 @@ impl RangeQuery {
                .to_owned()
        };
        RangeQuery {
-            field,
+            field: field.into(),
            value_type: Type::I64,
            lower_bound: map_bound(&lower_bound, make_term_val),
            upper_bound: map_bound(&upper_bound, make_term_val),
@@ -138,7 +139,7 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `f64`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_f64(field: String, range: Range<f64>) -> RangeQuery {
+    pub fn new_f64<F: Into<Cow<'static, str>>>(field: F, range: Range<f64>) -> RangeQuery {
        RangeQuery::new_f64_bounds(
            field,
            Bound::Included(range.start),
@@ -153,8 +154,8 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `f64`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_f64_bounds(
-        field: String,
+    pub fn new_f64_bounds<F: Into<Cow<'static, str>>>(
+        field: F,
        lower_bound: Bound<f64>,
        upper_bound: Bound<f64>,
    ) -> RangeQuery {
@@ -164,7 +165,7 @@ impl RangeQuery {
                .to_owned()
        };
        RangeQuery {
-            field,
+            field: field.into(),
            value_type: Type::F64,
            lower_bound: map_bound(&lower_bound, make_term_val),
            upper_bound: map_bound(&upper_bound, make_term_val),
@@ -179,8 +180,8 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `u64`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_u64_bounds(
-        field: String,
+    pub fn new_u64_bounds<F: Into<Cow<'static, str>>>(
+        field: F,
        lower_bound: Bound<u64>,
        upper_bound: Bound<u64>,
    ) -> RangeQuery {
@@ -190,7 +191,7 @@ impl RangeQuery {
                .to_owned()
        };
        RangeQuery {
-            field,
+            field: field.into(),
            value_type: Type::U64,
            lower_bound: map_bound(&lower_bound, make_term_val),
            upper_bound: map_bound(&upper_bound, make_term_val),
@@ -202,8 +203,8 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `ip`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_ip_bounds(
-        field: String,
+    pub fn new_ip_bounds<F: Into<Cow<'static, str>>>(
+        field: F,
        lower_bound: Bound<Ipv6Addr>,
        upper_bound: Bound<Ipv6Addr>,
    ) -> RangeQuery {
@@ -213,7 +214,7 @@ impl RangeQuery {
                .to_owned()
        };
        RangeQuery {
-            field,
+            field: field.into(),
            value_type: Type::IpAddr,
            lower_bound: map_bound(&lower_bound, make_term_val),
            upper_bound: map_bound(&upper_bound, make_term_val),
@@ -225,7 +226,7 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `u64`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_u64(field: String, range: Range<u64>) -> RangeQuery {
+    pub fn new_u64<F: Into<Cow<'static, str>>>(field: F, range: Range<u64>) -> RangeQuery {
        RangeQuery::new_u64_bounds(
            field,
            Bound::Included(range.start),
@@ -240,8 +241,8 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `date`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_date_bounds(
-        field: String,
+    pub fn new_date_bounds<F: Into<Cow<'static, str>>>(
+        field: F,
        lower_bound: Bound<DateTime>,
        upper_bound: Bound<DateTime>,
    ) -> RangeQuery {
@@ -251,7 +252,7 @@ impl RangeQuery {
                .to_owned()
        };
        RangeQuery {
-            field,
+            field: field.into(),
            value_type: Type::Date,
            lower_bound: map_bound(&lower_bound, make_term_val),
            upper_bound: map_bound(&upper_bound, make_term_val),
@@ -263,7 +264,7 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `date`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_date(field: String, range: Range<DateTime>) -> RangeQuery {
+    pub fn new_date<F: Into<Cow<'static, str>>>(field: F, range: Range<DateTime>) -> RangeQuery {
        RangeQuery::new_date_bounds(
            field,
            Bound::Included(range.start),
@@ -278,14 +279,14 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `Str`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_str_bounds(
-        field: String,
+    pub fn new_str_bounds<F: Into<Cow<'static, str>>>(
+        field: F,
        lower_bound: Bound<&str>,
        upper_bound: Bound<&str>,
    ) -> RangeQuery {
        let make_term_val = |val: &&str| val.as_bytes().to_vec();
        RangeQuery {
-            field,
+            field: field.into(),
            value_type: Type::Str,
            lower_bound: map_bound(&lower_bound, make_term_val),
            upper_bound: map_bound(&upper_bound, make_term_val),
@@ -297,7 +298,7 @@ impl RangeQuery {
    ///
    /// If the field is not of the type `Str`, tantivy
    /// will panic when the `Weight` object is created.
-    pub fn new_str(field: String, range: Range<&str>) -> RangeQuery {
+    pub fn new_str<F: Into<Cow<'static, str>>>(field: F, range: Range<&str>) -> RangeQuery {
        RangeQuery::new_str_bounds(
            field,
            Bound::Included(range.start),
@@ -358,7 +359,7 @@ impl Query for RangeQuery {
                let lower_bound = map_bound_res(&self.lower_bound, parse_ip_from_bytes)?;
                let upper_bound = map_bound_res(&self.upper_bound, parse_ip_from_bytes)?;
                Ok(Box::new(IPFastFieldRangeWeight::new(
-                    self.field.to_string(),
+                    self.field.clone(),
                    lower_bound,
                    upper_bound,
                )))
@@ -373,14 +374,14 @@ impl Query for RangeQuery {
                let lower_bound = map_bound(&self.lower_bound, parse_from_bytes);
                let upper_bound = map_bound(&self.upper_bound, parse_from_bytes);
                Ok(Box::new(FastFieldRangeWeight::new_u64_lenient(
-                    self.field.to_string(),
+                    self.field.clone(),
                    lower_bound,
                    upper_bound,
                )))
            }
        } else {
            Ok(Box::new(RangeWeight {
-                field: self.field.to_string(),
+                field: self.field.clone(),
                lower_bound: self.lower_bound.clone(),
                upper_bound: self.upper_bound.clone(),
                limit: self.limit,
@@ -390,7 +391,7 @@ impl Query for RangeQuery {
 }

 pub struct RangeWeight {
-    field: String,
+    field: Cow<'static, str>,
    lower_bound: Bound<Vec<u8>>,
    upper_bound: Bound<Vec<u8>>,
    limit: Option<u64>,
--- a/src/query/range_query/range_query_ip_fastfield.rs
+++ b/src/query/range_query/range_query_ip_fastfield.rs
@@ -2,6 +2,7 @@
 //! We use this variant only if the fastfield exists, otherwise the default in `range_query` is
 //! used, which uses the term dictionary + postings.

+use std::borrow::Cow;
 use std::net::Ipv6Addr;
 use std::ops::{Bound, RangeInclusive};

@@ -13,14 +14,18 @@ use crate::{DocId, DocSet, Score, SegmentReader, TantivyError};

 /// `IPFastFieldRangeWeight` uses the ip address fast field to execute range queries.
 pub struct IPFastFieldRangeWeight {
-    field: String,
+    field: Cow<'static, str>,
    lower_bound: Bound<Ipv6Addr>,
    upper_bound: Bound<Ipv6Addr>,
 }

 impl IPFastFieldRangeWeight {
    /// Creates a new IPFastFieldRangeWeight.
-    pub fn new(field: String, lower_bound: Bound<Ipv6Addr>, upper_bound: Bound<Ipv6Addr>) -> Self {
+    pub fn new(
+        field: Cow<'static, str>,
+        lower_bound: Bound<Ipv6Addr>,
+        upper_bound: Bound<Ipv6Addr>,
+    ) -> Self {
        Self {
            field,
            lower_bound,
@@ -171,7 +176,7 @@ pub mod tests {
        writer.commit().unwrap();
        let searcher = index.reader().unwrap().searcher();
        let range_weight = IPFastFieldRangeWeight {
-            field: "ips".to_string(),
+            field: Cow::Borrowed("ips"),
            lower_bound: Bound::Included(ip_addrs[1]),
            upper_bound: Bound::Included(ip_addrs[2]),
        };
--- a/src/query/range_query/range_query_u64_fastfield.rs
+++ b/src/query/range_query/range_query_u64_fastfield.rs
@@ -2,6 +2,7 @@
 //! We use this variant only if the fastfield exists, otherwise the default in `range_query` is
 //! used, which uses the term dictionary + postings.

+use std::borrow::Cow;
 use std::ops::{Bound, RangeInclusive};

 use columnar::{ColumnType, HasAssociatedColumnType, MonotonicallyMappableToU64};
@@ -14,7 +15,7 @@ use crate::{DocId, DocSet, Score, SegmentReader, TantivyError};
 /// `FastFieldRangeWeight` uses the fast field to execute range queries.
 #[derive(Clone, Debug)]
 pub struct FastFieldRangeWeight {
-    field: String,
+    field: Cow<'static, str>,
    lower_bound: Bound<u64>,
    upper_bound: Bound<u64>,
    column_type_opt: Option<ColumnType>,
@@ -23,7 +24,7 @@ pub struct FastFieldRangeWeight {
 impl FastFieldRangeWeight {
    /// Create a new FastFieldRangeWeight, using the u64 representation of any fast field.
    pub(crate) fn new_u64_lenient(
-        field: String,
+        field: Cow<'static, str>,
        lower_bound: Bound<u64>,
        upper_bound: Bound<u64>,
    ) -> Self {
@@ -39,7 +40,7 @@ impl FastFieldRangeWeight {

    /// Create a new `FastFieldRangeWeight` for a range of a u64-mappable type .
    pub fn new<T: HasAssociatedColumnType + MonotonicallyMappableToU64>(
-        field: String,
+        field: Cow<'static, str>,
        lower_bound: Bound<T>,
        upper_bound: Bound<T>,
    ) -> Self {
@@ -130,6 +131,7 @@ fn bound_to_value_range<T: MonotonicallyMappableToU64>(

 #[cfg(test)]
 pub mod tests {
+    use std::borrow::Cow;
    use std::ops::{Bound, RangeInclusive};

    use proptest::prelude::*;
@@ -214,7 +216,7 @@ pub mod tests {
        writer.commit().unwrap();
        let searcher = index.reader().unwrap().searcher();
        let range_query = FastFieldRangeWeight::new_u64_lenient(
-            "test_field".to_string(),
+            Cow::Borrowed("test_field"),
            Bound::Included(50_000),
            Bound::Included(50_002),
        );
--- a/src/query/regex_query.rs
+++ b/src/query/regex_query.rs
@@ -63,7 +63,7 @@ impl RegexQuery {
    /// Creates a new RegexQuery from a given pattern
    pub fn from_pattern(regex_pattern: &str, field: Field) -> crate::Result<Self> {
        let regex = Regex::new(regex_pattern)
-            .map_err(|_| TantivyError::InvalidArgument(regex_pattern.to_string()))?;
+            .map_err(|err| TantivyError::InvalidArgument(format!("RegexQueryError: {err}")))?;
        Ok(RegexQuery::from_regex(regex, field))
    }

@@ -176,4 +176,16 @@ mod test {
        verify_regex_query(matching_one, matching_zero, reader);
        Ok(())
    }
+
+    #[test]
+    pub fn test_pattern_error() {
+        let (_reader, field) = build_test_index().unwrap();
+
+        match RegexQuery::from_pattern(r"(foo", field) {
+            Err(crate::TantivyError::InvalidArgument(msg)) => {
+                assert!(msg.contains("error: unclosed group"))
+            }
+            res => panic!("unexpected result: {:?}", res),
+        }
+    }
 }
--- a/src/store/mod.rs
+++ b/src/store/mod.rs
@@ -37,8 +37,6 @@ mod reader;
 mod writer;
 pub use self::compressors::{Compressor, ZstdCompressor};
 pub use self::decompressors::Decompressor;
-#[cfg(feature = "quickwit")]
-pub(crate) use self::reader::CacheKey;
 pub(crate) use self::reader::DOCSTORE_CACHE_CAPACITY;
 pub use self::reader::{CacheStats, StoreReader};
 pub use self::writer::StoreWriter;
--- a/src/store/reader.rs
+++ b/src/store/reader.rs
@@ -40,15 +40,6 @@ struct BlockCache {
 }

 impl BlockCache {
-    fn new(cache_num_blocks: usize) -> Self {
-        Self {
-            cache: NonZeroUsize::new(cache_num_blocks)
-                .map(|cache_num_blocks| Mutex::new(LruCache::new(cache_num_blocks))),
-            cache_hits: Default::default(),
-            cache_misses: Default::default(),
-        }
-    }
-
    fn get_from_cache(&self, pos: usize) -> Option<Block> {
        if let Some(block) = self
            .cache
@@ -90,10 +81,6 @@ impl BlockCache {
    }
 }

-/// Opaque cache key which indicates which documents are cached together.
-#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
-pub(crate) struct CacheKey(usize);
-
 #[derive(Debug, Default)]
 /// CacheStats for the `StoreReader`.
 pub struct CacheStats {
@@ -141,35 +128,17 @@ impl StoreReader {
        Ok(StoreReader {
            decompressor: footer.decompressor,
            data: data_file,
-            cache: BlockCache::new(cache_num_blocks),
+            cache: BlockCache {
+                cache: NonZeroUsize::new(cache_num_blocks)
+                    .map(|cache_num_blocks| Mutex::new(LruCache::new(cache_num_blocks))),
+                cache_hits: Default::default(),
+                cache_misses: Default::default(),
+            },
            skip_index: Arc::new(skip_index),
            space_usage,
        })
    }

-    /// Clones the given store reader with an independent block cache of the given size.
-    ///
-    /// `cache_keys` is used to seed the forked cache from the current cache
-    /// if some blocks are already available.
-    #[cfg(feature = "quickwit")]
-    pub(crate) fn fork_cache(&self, cache_num_blocks: usize, cache_keys: &[CacheKey]) -> Self {
-        let forked = Self {
-            decompressor: self.decompressor,
-            data: self.data.clone(),
-            cache: BlockCache::new(cache_num_blocks),
-            skip_index: Arc::clone(&self.skip_index),
-            space_usage: self.space_usage.clone(),
-        };
-
-        for &CacheKey(pos) in cache_keys {
-            if let Some(block) = self.cache.get_from_cache(pos) {
-                forked.cache.put_into_cache(pos, block);
-            }
-        }
-
-        forked
-    }
-
    pub(crate) fn block_checkpoints(&self) -> impl Iterator<Item = Checkpoint> + '_ {
        self.skip_index.checkpoints()
    }
@@ -183,21 +152,6 @@ impl StoreReader {
        self.cache.stats()
    }

-    /// Returns the cache key for a given document
-    ///
-    /// These keys are opaque and are not used with the public API,
-    /// but having the same cache key means that the documents
-    /// will only require one I/O and decompression operation
-    /// when retrieve from the same store reader consecutively.
-    ///
-    /// Note that looking up the cache key of a document
-    /// will not yet pull anything into the block cache.
-    #[cfg(feature = "quickwit")]
-    pub(crate) fn cache_key(&self, doc_id: DocId) -> crate::Result<CacheKey> {
-        let checkpoint = self.block_checkpoint(doc_id)?;
-        Ok(CacheKey(checkpoint.byte_range.start))
-    }
-
    /// Get checkpoint for `DocId`. The checkpoint can be used to load a block containing the
    /// document.
    ///
--- a/stacker/src/memory_arena.rs
+++ b/stacker/src/memory_arena.rs
@@ -189,6 +189,11 @@ struct Page {

 impl Page {
    fn new(page_id: usize) -> Page {
+        // We use 32-bits addresses.
+        // - 20 bits for the in-page addressing
+        // - 12 bits for the page id.
+        // This limits us to 2^12 - 1=4095 for the page id.
+        assert!(page_id < 4096);
        Page {
            page_id,
            len: 0,
@@ -238,6 +243,7 @@ impl Page {
 mod tests {

    use super::MemoryArena;
+    use crate::memory_arena::PAGE_SIZE;

    #[test]
    fn test_arena_allocate_slice() {
@@ -255,6 +261,31 @@ mod tests {
        assert_eq!(arena.slice(addr_b, b.len()), b);
    }

+    #[test]
+    fn test_arena_allocate_end_of_page() {
+        let mut arena = MemoryArena::default();
+
+        // A big block
+        let len_a = PAGE_SIZE - 2;
+        let addr_a = arena.allocate_space(len_a);
+        *arena.slice_mut(addr_a, len_a).last_mut().unwrap() = 1;
+
+        // Single bytes
+        let addr_b = arena.allocate_space(1);
+        arena.slice_mut(addr_b, 1)[0] = 2;
+
+        let addr_c = arena.allocate_space(1);
+        arena.slice_mut(addr_c, 1)[0] = 3;
+
+        let addr_d = arena.allocate_space(1);
+        arena.slice_mut(addr_d, 1)[0] = 4;
+
+        assert_eq!(arena.slice(addr_a, len_a)[len_a - 1], 1);
+        assert_eq!(arena.slice(addr_b, 1)[0], 2);
+        assert_eq!(arena.slice(addr_c, 1)[0], 3);
+        assert_eq!(arena.slice(addr_d, 1)[0], 4);
+    }
+
    #[derive(Clone, Copy, Debug, Eq, PartialEq)]
    struct MyTest {
        pub a: usize,
--- a/stacker/src/shared_arena_hashmap.rs
+++ b/stacker/src/shared_arena_hashmap.rs
@@ -295,6 +295,8 @@ impl SharedArenaHashMap {
    /// will be in charge of returning a default value.
    /// If the key already as an associated value, then it will be passed
    /// `Some(previous_value)`.
+    ///
+    /// The key will be truncated to u16::MAX bytes.
    #[inline]
    pub fn mutate_or_create<V>(
        &mut self,
@@ -308,6 +310,8 @@ impl SharedArenaHashMap {
        if self.is_saturated() {
            self.resize();
        }
+        // Limit the key size to u16::MAX
+        let key = &key[..std::cmp::min(key.len(), u16::MAX as usize)];
        let hash = self.get_hash(key);
        let mut probe = self.probe(hash);
        let mut bucket = probe.next_probe();
@@ -379,6 +383,36 @@ mod tests {
        }
        assert_eq!(vanilla_hash_map.len(), 2);
    }
+
+    #[test]
+    fn test_long_key_truncation() {
+        // Keys longer than u16::MAX are truncated.
+        let mut memory_arena = MemoryArena::default();
+        let mut hash_map: SharedArenaHashMap = SharedArenaHashMap::default();
+        let key1 = (0..u16::MAX as usize).map(|i| i as u8).collect::<Vec<_>>();
+        hash_map.mutate_or_create(&key1, &mut memory_arena, |opt_val: Option<u32>| {
+            assert_eq!(opt_val, None);
+            4u32
+        });
+        // Due to truncation, this key is the same as key1
+        let key2 = (0..u16::MAX as usize + 1)
+            .map(|i| i as u8)
+            .collect::<Vec<_>>();
+        hash_map.mutate_or_create(&key2, &mut memory_arena, |opt_val: Option<u32>| {
+            assert_eq!(opt_val, Some(4));
+            3u32
+        });
+        let mut vanilla_hash_map = HashMap::new();
+        let iter_values = hash_map.iter(&memory_arena);
+        for (key, addr) in iter_values {
+            let val: u32 = memory_arena.read(addr);
+            vanilla_hash_map.insert(key.to_owned(), val);
+            assert_eq!(key.len(), key1[..].len());
+            assert_eq!(key, &key1[..])
+        }
+        assert_eq!(vanilla_hash_map.len(), 1); // Both map to the same key
+    }
+
    #[test]
    fn test_empty_hashmap() {
        let memory_arena = MemoryArena::default();
Author	SHA1	Message	Date
Adam Reichold	4fd2b22b69	Make allocating field names avoidable for range and exists queries. If the field names are statically known, `Cow::Borrowed(&'static str)` can handle them without allocations. The general case is still handled by `Cow::Owned(String)`.	2024-01-26 17:31:44 +01:00
Paul Masurel	9b7f3a55cf	Bumped census version	2024-01-26 19:32:02 +09:00
PSeitz	1dacdb6c85	add histogram agg test on empty index (#2306 )	2024-01-23 16:27:34 +01:00
François Massot	30483310ca	Minor improvement of README.md (#2305 ) * Update README.md * Remove useless paragraph * Wording.	2024-01-19 17:46:48 +09:00
Tushar	e1d18b5114	chore: Expose TopDocs::order_by_u64_field again (#2282 )	2024-01-18 05:58:24 +01:00
trinity-1686a	108f30ba23	allow newline where we allow space in query parser (#2302 ) fix regression from the new parser	2024-01-17 14:38:35 +01:00
PSeitz	5943ee46bd	Truncate keys to u16::MAX in term hashmap (#2299 ) Truncate keys to u16::MAX, instead e.g. storing 0 bytes for keys with length u16::MAX + 1 The term hashmap has a hidden API contract to only accept terms with lenght up u16::MAX.	2024-01-11 10:19:12 +01:00
PSeitz	f95a76293f	add memory arena test (#2298 ) * add memory arena test * add assert * Update stacker/src/memory_arena.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2024-01-11 07:18:48 +01:00
Paul Masurel	014328e378	Fix bug that can cause `get_docids_for_value_range` to panic. (#2295 ) * Fix bug that can cause `get_docids_for_value_range` to panic. When `selected_docid_range.end == num_rows`, we would get a panic as we try to access a non-existing blockmeta. This PR accepts calls to rank with any value. For any value above num_rows we simply return non_null_rows. Fixes #2293 * add tests, merge variables --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-01-09 14:52:20 +01:00
Adam Reichold	53f2fe1fbe	Forward regex parser errors to enable understandin their reason. (#2288 )	2023-12-22 11:01:10 +01:00
PSeitz	9c75942aaf	fix merge panic for JSON fields (#2284 ) Root cause was the positions buffer had residue positions from the previous term, when the terms were alternating between having and not having positions in JSON (terms have positions, but not numerics). Fixes #2283	2023-12-21 11:05:34 +01:00