add failing test check_num_columnar_fields

remove allocations in split compound words (#2080 )
* remove allocations in split compound words * clear reused data
2026-01-08 01:52:54 +00:00 · 2023-07-13 18:19:02 +09:00 · 2023-07-13 09:43:02 +09:00 · 2023-07-13 09:42:21 +09:00 · 2023-07-11 13:58:49 +08:00 · 2023-07-07 11:14:46 +02:00
6 changed files with 152 additions and 42 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,14 @@

-Tantivy 0.20 [Unreleased]
+Tantivy 0.20.2
+================================
+- Align numerical type priority order on the search side.  [#2088](https://github.com/quickwit-oss/tantivy/issues/2088) (@fmassot)
+- Fix is_child_of function not considering the root facet. [#2086](https://github.com/quickwit-oss/tantivy/issues/2086) (@adamreichhold)
+
+Tantivy 0.20.1
+================================
+- Fix building on windows with mmap [#2070](https://github.com/quickwit-oss/tantivy/issues/2070) (@ChillFish8)
+
+Tantivy 0.20
 ================================
 #### Bugfixes
 - Fix phrase queries with slop (slop supports now transpositions, algorithm that carries slop so far for num terms > 2) [#2031](https://github.com/quickwit-oss/tantivy/issues/2031)[#2020](https://github.com/quickwit-oss/tantivy/issues/2020)(@PSeitz)
@@ -38,12 +47,14 @@ Tantivy 0.20 [Unreleased]
    - Add aggregation support for JSON type [#1888](https://github.com/quickwit-oss/tantivy/issues/1888) (@PSeitz)
    - Mixed types support on JSON fields in aggs [#1971](https://github.com/quickwit-oss/tantivy/issues/1971) (@PSeitz)
  - Perf: Fetch blocks of vals in aggregation for all cardinality [#1950](https://github.com/quickwit-oss/tantivy/issues/1950) (@PSeitz)
+  - Allow histogram bounds to be passed as Rfc3339 [#2076](https://github.com/quickwit-oss/tantivy/issues/2076) (@PSeitz)
 - `Searcher` with disabled scoring via `EnableScoring::Disabled` [#1780](https://github.com/quickwit-oss/tantivy/issues/1780) (@shikhar)
 - Enable tokenizer on json fields [#2053](https://github.com/quickwit-oss/tantivy/issues/2053) (@PSeitz)
 - Enforcing "NOT" and "-" queries consistency in UserInputAst [#1609](https://github.com/quickwit-oss/tantivy/issues/1609) (@bazhenov)
 - Faster indexing
  - Refactor tokenization pipeline to use GATs [#1924](https://github.com/quickwit-oss/tantivy/issues/1924) (@trinity-1686a)
  - Faster term hash map [#2058](https://github.com/quickwit-oss/tantivy/issues/2058)[#1940](https://github.com/quickwit-oss/tantivy/issues/1940) (@PSeitz)
+  - tokenizer-api: reduce Tokenizer allocation overhead [#2062](https://github.com/quickwit-oss/tantivy/issues/2062) (@PSeitz)
  - Refactor vint [#2010](https://github.com/quickwit-oss/tantivy/issues/2010) (@PSeitz)
 - Faster search
  - Work in batches of docs on the SegmentCollector (Only for cases without score for now) [#1937](https://github.com/quickwit-oss/tantivy/issues/1937) (@PSeitz)
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -49,9 +49,9 @@ murmurhash32 = "0.3.0"
 time = { version = "0.3.10", features = ["serde-well-known"] }
 smallvec = "1.8.0"
 rayon = "1.5.2"
-lru = "0.10.0"
+lru = "0.11.0"
 fastdivide = "0.4.0"
-itertools = "0.10.3"
+itertools = "0.11.0"
 measure_time = "0.8.2"
 async-trait = "0.1.53"
 arc-swap = "1.5.0"
--- a/columnar/Cargo.toml
+++ b/columnar/Cargo.toml
@@ -9,7 +9,7 @@ description = "column oriented storage for tantivy"
 categories = ["database-implementations", "data-structures", "compression"]

 [dependencies]
-itertools = "0.10.5"
+itertools = "0.11.0"
 fnv = "1.0.7"
 fastdivide = "0.4.0"

--- a/src/collector/top_score_collector.rs
+++ b/src/collector/top_score_collector.rs
@@ -14,7 +14,7 @@ use crate::collector::{
 };
 use crate::fastfield::{FastFieldNotAvailableError, FastValue};
 use crate::query::Weight;
-use crate::{DocAddress, DocId, Score, SegmentOrdinal, SegmentReader, TantivyError};
+use crate::{DocAddress, DocId, Order, Score, SegmentOrdinal, SegmentReader, TantivyError};

 struct FastFieldConvertCollector<
    TCollector: Collector<Fruit = Vec<(u64, DocAddress)>>,
@@ -23,6 +23,7 @@ struct FastFieldConvertCollector<
    pub collector: TCollector,
    pub field: String,
    pub fast_value: std::marker::PhantomData<TFastValue>,
+    order: Order,
 }

 impl<TCollector, TFastValue> Collector for FastFieldConvertCollector<TCollector, TFastValue>
@@ -70,7 +71,13 @@ where
        let raw_result = self.collector.merge_fruits(segment_fruits)?;
        let transformed_result = raw_result
            .into_iter()
-            .map(|(score, doc_address)| (TFastValue::from_u64(score), doc_address))
+            .map(|(score, doc_address)| {
+                if self.order.is_desc() {
+                    (TFastValue::from_u64(score), doc_address)
+                } else {
+                    (TFastValue::from_u64(u64::MAX - score), doc_address)
+                }
+            })
            .collect::<Vec<_>>();
        Ok(transformed_result)
    }
@@ -131,16 +138,23 @@ impl fmt::Debug for TopDocs {

 struct ScorerByFastFieldReader {
    sort_column: Arc<dyn ColumnValues<u64>>,
+    order: Order,
 }

 impl CustomSegmentScorer<u64> for ScorerByFastFieldReader {
    fn score(&mut self, doc: DocId) -> u64 {
-        self.sort_column.get_val(doc)
+        let value = self.sort_column.get_val(doc);
+        if self.order.is_desc() {
+            value
+        } else {
+            u64::MAX - value
+        }
    }
 }

 struct ScorerByField {
    field: String,
+    order: Order,
 }

 impl CustomScorer<u64> for ScorerByField {
@@ -157,8 +171,13 @@ impl CustomScorer<u64> for ScorerByField {
            sort_column_opt.ok_or_else(|| FastFieldNotAvailableError {
                field_name: self.field.clone(),
            })?;
+        let mut default_value = 0u64;
+        if self.order.is_asc() {
+            default_value = u64::MAX;
+        }
        Ok(ScorerByFastFieldReader {
-            sort_column: sort_column.first_or_default_col(0u64),
+            sort_column: sort_column.first_or_default_col(default_value),
+            order: self.order.clone(),
        })
    }
 }
@@ -230,7 +249,7 @@ impl TopDocs {
    ///
    /// ```rust
    /// # use tantivy::schema::{Schema, FAST, TEXT};
-    /// # use tantivy::{doc, Index, DocAddress};
+    /// # use tantivy::{doc, Index, DocAddress, Order};
    /// # use tantivy::query::{Query, QueryParser};
    /// use tantivy::Searcher;
    /// use tantivy::collector::TopDocs;
@@ -268,7 +287,7 @@ impl TopDocs {
    ///     // Note the `rating_field` needs to be a FAST field here.
    ///     let top_books_by_rating = TopDocs
    ///                 ::with_limit(10)
-    ///                  .order_by_u64_field("rating");
+    ///                  .order_by_fast_field("rating", Order::Desc);
    ///
    ///     // ... and here are our documents. Note this is a simple vec.
    ///     // The `u64` in the pair is the value of our fast field for
@@ -288,13 +307,15 @@ impl TopDocs {
    ///
    /// To comfortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to
    /// the [.order_by_fast_field(...)](TopDocs::order_by_fast_field) method.
-    pub fn order_by_u64_field(
+    fn order_by_u64_field(
        self,
        field: impl ToString,
+        order: Order,
    ) -> impl Collector<Fruit = Vec<(u64, DocAddress)>> {
        CustomScoreTopCollector::new(
            ScorerByField {
                field: field.to_string(),
+                order,
            },
            self.0.into_tscore(),
        )
@@ -316,7 +337,7 @@ impl TopDocs {
    ///
    /// ```rust
    /// # use tantivy::schema::{Schema, FAST, TEXT};
-    /// # use tantivy::{doc, Index, DocAddress};
+    /// # use tantivy::{doc, Index, DocAddress,Order};
    /// # use tantivy::query::{Query, AllQuery};
    /// use tantivy::Searcher;
    /// use tantivy::collector::TopDocs;
@@ -354,7 +375,7 @@ impl TopDocs {
    ///     // type `sort_by_field`. revenue_field here is a FAST i64 field.
    ///     let top_company_by_revenue = TopDocs
    ///                 ::with_limit(2)
-    ///                  .order_by_fast_field("revenue");
+    ///                  .order_by_fast_field("revenue", Order::Desc);
    ///
    ///     // ... and here are our documents. Note this is a simple vec.
    ///     // The `i64` in the pair is the value of our fast field for
@@ -372,15 +393,17 @@ impl TopDocs {
    pub fn order_by_fast_field<TFastValue>(
        self,
        fast_field: impl ToString,
+        order: Order,
    ) -> impl Collector<Fruit = Vec<(TFastValue, DocAddress)>>
    where
        TFastValue: FastValue,
    {
-        let u64_collector = self.order_by_u64_field(fast_field.to_string());
+        let u64_collector = self.order_by_u64_field(fast_field.to_string(), order.clone());
        FastFieldConvertCollector {
            collector: u64_collector,
            field: fast_field.to_string(),
            fast_value: PhantomData,
+            order,
        }
    }

@@ -721,7 +744,7 @@ mod tests {
    use crate::schema::{Field, Schema, FAST, STORED, TEXT};
    use crate::time::format_description::well_known::Rfc3339;
    use crate::time::OffsetDateTime;
-    use crate::{DateTime, DocAddress, DocId, Index, IndexWriter, Score, SegmentReader};
+    use crate::{DateTime, DocAddress, DocId, Index, IndexWriter, Order, Score, SegmentReader};

    fn make_index() -> crate::Result<Index> {
        let mut schema_builder = Schema::builder();
@@ -882,7 +905,7 @@ mod tests {
        });
        let searcher = index.reader()?.searcher();

-        let top_collector = TopDocs::with_limit(4).order_by_u64_field(SIZE);
+        let top_collector = TopDocs::with_limit(4).order_by_u64_field(SIZE, Order::Desc);
        let top_docs: Vec<(u64, DocAddress)> = searcher.search(&query, &top_collector)?;
        assert_eq!(
            &top_docs[..],
@@ -921,7 +944,7 @@ mod tests {
        ))?;
        index_writer.commit()?;
        let searcher = index.reader()?.searcher();
-        let top_collector = TopDocs::with_limit(3).order_by_fast_field("birthday");
+        let top_collector = TopDocs::with_limit(3).order_by_fast_field("birthday", Order::Desc);
        let top_docs: Vec<(DateTime, DocAddress)> = searcher.search(&AllQuery, &top_collector)?;
        assert_eq!(
            &top_docs[..],
@@ -951,7 +974,7 @@ mod tests {
        ))?;
        index_writer.commit()?;
        let searcher = index.reader()?.searcher();
-        let top_collector = TopDocs::with_limit(3).order_by_fast_field("altitude");
+        let top_collector = TopDocs::with_limit(3).order_by_fast_field("altitude", Order::Desc);
        let top_docs: Vec<(i64, DocAddress)> = searcher.search(&AllQuery, &top_collector)?;
        assert_eq!(
            &top_docs[..],
@@ -981,7 +1004,7 @@ mod tests {
        ))?;
        index_writer.commit()?;
        let searcher = index.reader()?.searcher();
-        let top_collector = TopDocs::with_limit(3).order_by_fast_field("altitude");
+        let top_collector = TopDocs::with_limit(3).order_by_fast_field("altitude", Order::Desc);
        let top_docs: Vec<(f64, DocAddress)> = searcher.search(&AllQuery, &top_collector)?;
        assert_eq!(
            &top_docs[..],
@@ -1009,7 +1032,7 @@ mod tests {
                .unwrap();
        });
        let searcher = index.reader().unwrap().searcher();
-        let top_collector = TopDocs::with_limit(4).order_by_u64_field("missing_field");
+        let top_collector = TopDocs::with_limit(4).order_by_u64_field("missing_field", Order::Desc);
        let segment_reader = searcher.segment_reader(0u32);
        top_collector
            .for_segment(0, segment_reader)
@@ -1027,7 +1050,7 @@ mod tests {
        index_writer.commit()?;
        let searcher = index.reader()?.searcher();
        let segment = searcher.segment_reader(0);
-        let top_collector = TopDocs::with_limit(4).order_by_u64_field(SIZE);
+        let top_collector = TopDocs::with_limit(4).order_by_u64_field(SIZE, Order::Desc);
        let err = top_collector.for_segment(0, segment).err().unwrap();
        assert!(matches!(err, crate::TantivyError::InvalidArgument(_)));
        Ok(())
@@ -1044,7 +1067,7 @@ mod tests {
        index_writer.commit()?;
        let searcher = index.reader()?.searcher();
        let segment = searcher.segment_reader(0);
-        let top_collector = TopDocs::with_limit(4).order_by_fast_field::<i64>(SIZE);
+        let top_collector = TopDocs::with_limit(4).order_by_fast_field::<i64>(SIZE, Order::Desc);
        let err = top_collector.for_segment(0, segment).err().unwrap();
        assert!(
            matches!(err, crate::TantivyError::SchemaError(msg) if msg == "Field \"size\" is not a fast field.")
@@ -1106,4 +1129,50 @@ mod tests {
        let query = query_parser.parse_query(query).unwrap();
        (index, query)
    }
+    #[test]
+    fn test_fast_field_ascending_order() -> crate::Result<()> {
+        let mut schema_builder = Schema::builder();
+        let title = schema_builder.add_text_field(TITLE, TEXT);
+        let size = schema_builder.add_u64_field(SIZE, FAST);
+        let schema = schema_builder.build();
+        let (index, query) = index("beer", title, schema, |index_writer| {
+            index_writer
+                .add_document(doc!(
+                    title => "bottle of beer",
+                    size => 12u64,
+                ))
+                .unwrap();
+            index_writer
+                .add_document(doc!(
+                    title => "growler of beer",
+                    size => 64u64,
+                ))
+                .unwrap();
+            index_writer
+                .add_document(doc!(
+                    title => "pint of beer",
+                    size => 16u64,
+                ))
+                .unwrap();
+            index_writer
+                .add_document(doc!(
+                    title => "empty beer",
+                ))
+                .unwrap();
+        });
+        let searcher = index.reader()?.searcher();
+
+        let top_collector = TopDocs::with_limit(4).order_by_fast_field(SIZE, Order::Asc);
+        let top_docs: Vec<(u64, DocAddress)> = searcher.search(&query, &top_collector)?;
+        assert_eq!(
+            &top_docs[..],
+            &[
+                (12, DocAddress::new(0, 0)),
+                (16, DocAddress::new(0, 2)),
+                (64, DocAddress::new(0, 1)),
+                (18446744073709551615, DocAddress::new(0, 3)),
+            ]
+        );
+        Ok(())
+    }
 }
--- a/src/fastfield/mod.rs
+++ b/src/fastfield/mod.rs
@@ -1291,4 +1291,28 @@ mod tests {
        let vals: Vec<i64> = column.values_for_doc(0u32).collect();
        assert_eq!(&vals, &[33]);
    }
+
+    #[test]
+    fn check_num_columnar_fields() -> crate::Result<()> {
+        let mut schema_builder = Schema::builder();
+        let id_field = schema_builder.add_text_field("id", FAST);
+        let index = Index::create_in_ram(schema_builder.build());
+        {
+            let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
+            index_writer.set_merge_policy(Box::new(NoMergePolicy));
+            index_writer.add_document(doc!(
+                id_field => 1u64,
+            ))?;
+            index_writer.commit()?;
+        }
+
+        let reader = index.reader().unwrap();
+
+        let searcher = reader.searcher();
+        let ff_reader = searcher.segment_reader(0).fast_fields();
+        let fields = ff_reader.u64_lenient_for_type_all(None, "id").unwrap();
+        assert_eq!(fields.len(), 1);
+
+        Ok(())
+    }
 }
--- a/src/tokenizer/split_compound_words.rs
+++ b/src/tokenizer/split_compound_words.rs
@@ -86,6 +86,8 @@ impl TokenFilter for SplitCompoundWords {
        SplitCompoundWordsFilter {
            dict: self.dict,
            inner: tokenizer,
+            cuts: Vec::new(),
+            parts: Vec::new(),
        }
    }
 }
@@ -94,29 +96,33 @@ impl TokenFilter for SplitCompoundWords {
 pub struct SplitCompoundWordsFilter<T> {
    dict: AhoCorasick,
    inner: T,
-}
-
-impl<T: Tokenizer> Tokenizer for SplitCompoundWordsFilter<T> {
-    type TokenStream<'a> = SplitCompoundWordsTokenStream<T::TokenStream<'a>>;
-
-    fn token_stream<'a>(&'a mut self, text: &'a str) -> Self::TokenStream<'a> {
-        SplitCompoundWordsTokenStream {
-            dict: self.dict.clone(),
-            tail: self.inner.token_stream(text),
-            cuts: Vec::new(),
-            parts: Vec::new(),
-        }
-    }
-}
-
-pub struct SplitCompoundWordsTokenStream<T> {
-    dict: AhoCorasick,
-    tail: T,
    cuts: Vec<usize>,
    parts: Vec<Token>,
 }

-impl<T: TokenStream> SplitCompoundWordsTokenStream<T> {
+impl<T: Tokenizer> Tokenizer for SplitCompoundWordsFilter<T> {
+    type TokenStream<'a> = SplitCompoundWordsTokenStream<'a, T::TokenStream<'a>>;
+
+    fn token_stream<'a>(&'a mut self, text: &'a str) -> Self::TokenStream<'a> {
+        self.cuts.clear();
+        self.parts.clear();
+        SplitCompoundWordsTokenStream {
+            dict: self.dict.clone(),
+            tail: self.inner.token_stream(text),
+            cuts: &mut self.cuts,
+            parts: &mut self.parts,
+        }
+    }
+}
+
+pub struct SplitCompoundWordsTokenStream<'a, T> {
+    dict: AhoCorasick,
+    tail: T,
+    cuts: &'a mut Vec<usize>,
+    parts: &'a mut Vec<Token>,
+}
+
+impl<'a, T: TokenStream> SplitCompoundWordsTokenStream<'a, T> {
    // Will use `self.cuts` to fill `self.parts` if `self.tail.token()`
    // can fully be split into consecutive matches against `self.dict`.
    fn split(&mut self) {
@@ -152,7 +158,7 @@ impl<T: TokenStream> SplitCompoundWordsTokenStream<T> {
    }
 }

-impl<T: TokenStream> TokenStream for SplitCompoundWordsTokenStream<T> {
+impl<'a, T: TokenStream> TokenStream for SplitCompoundWordsTokenStream<'a, T> {
    fn advance(&mut self) -> bool {
        self.parts.pop();
Author	SHA1	Message	Date
Pascal Seitz	18de6f477b	add failing test check_num_columnar_fields	2023-07-13 18:19:02 +09:00
PSeitz	1e7cd48cfa	remove allocations in split compound words (#2080 ) * remove allocations in split compound words * clear reused data	2023-07-13 09:43:02 +09:00
dependabot[bot]	7f51d85bbd	Update lru requirement from 0.10.0 to 0.11.0 (#2117 ) Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs) to permit the latest version. - [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/jeromefroe/lru-rs/compare/0.10.0...0.11.0) --- updated-dependencies: - dependency-name: lru dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-13 09:42:21 +09:00
PSeitz	ad76e32398	Update CHANGELOG.md (#2091 ) * Update CHANGELOG.md * Update CHANGELOG.md	2023-07-11 13:58:49 +08:00
dependabot[bot]	7575f9bf1c	Update itertools requirement from 0.10.3 to 0.11.0 (#2098 ) Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.10.5...v0.11.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-07 11:14:46 +02:00
Naveen Aiathurai	67bdf3f5f6	fixes order_by_u64_field and order_by_fast_field should allow sorting in ascending order #1676 (#2111 ) * feat: order_by_fast_field allows sorting using parameter order * chore: change the corresponding values to original one * chore: fix formatting issues * fix: first_or_default_col should also sort by order * chore: empty doc to testcase and docstest fixes * chore: fix failure tests * core: add empty document without fastfield * chore: fix fmt * chore: change variable name	2023-07-06 05:10:10 +02:00