diff --git a/CHANGELOG.md b/CHANGELOG.md index 51a425a68..76a837439 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,22 +1,23 @@ Tantivy 0.14.0 ========================= -- Remove dependency to atomicwrites #833 .Implemented by @pmasurel upon suggestion and research from @asafigan). +- Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan). - Migrated tantivy error from the now deprecated `failure` crate to `thiserror` #760. (@hirevo) -- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match. +- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match. - Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a `FileSlice` that can be reduced and eventually read into an `OwnedBytes` object. Long and blocking io operation are still required by they do not span over the entire file. - Added support for Brotli compression in the DocStore. (@ppodolsky) - Added helper for building intersections and unions in BooleanQuery (@guilload) - Bugfix in `Query::explain` - Removed dependency on `notify` #924. Replaced with `FileWatcher` struct that polls meta file every 500ms in background thread. (@halvorboe @guilload) - Added `FilterCollector`, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev) -- Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@pmasurel) +- Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton) - `FilterCollector` now supports all Fast Field value types (@barrotsteindev) +- FastField are not all loaded when opening the segment reader. (@fulmicoton) This version breaks compatibility and requires users to reindex everything. Tantivy 0.13.2 =================== -Bugfix. Acquiring a facet reader on a segment that does not contain any +Bugfix. Acquiring a facet reader on a segment that does not contain any doc with this facet returns `None`. (#896) Tantivy 0.13.1 @@ -27,7 +28,7 @@ Updated misc dependency versions. Tantivy 0.13.0 ====================== Tantivy 0.13 introduce a change in the index format that will require -you to reindex your index (BlockWAND information are added in the skiplist). +you to reindex your index (BlockWAND information are added in the skiplist). The index size increase is minor as this information is only added for full blocks. If you have a massive index for which reindexing is not an option, please contact me @@ -36,7 +37,7 @@ so that we can discuss possible solutions. - Bugfix in `FuzzyTermQuery` not matching terms by prefix when it should (@Peachball) - Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send. - `MMapDirectory::open` does not return a `Result` anymore. -- Change in the DocSet and Scorer API. (@fulmicoton). +- Change in the DocSet and Scorer API. (@fulmicoton). A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet. `.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId. As a result, iterating through DocSet now looks as follows @@ -50,7 +51,7 @@ while doc != TERMINATED { The change made it possible to greatly simplify a lot of the docset's code. - Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton) - Added an offset option to the Top(.*)Collectors. (@robyoung) -- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks +- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks to the PISA team for answering all my questions!) Tantivy 0.12.0 @@ -58,13 +59,13 @@ Tantivy 0.12.0 - Removing static dispatch in tokenizers for simplicity. (#762) - Added backward iteration for `TermDictionary` stream. (@halvorboe) - Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland) -- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713) +- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713) - Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar) - Added support for field boosting. (#547, @fulmicoton) ## How to update? -Crates relying on custom tokenizer, or registering tokenizer in the manager will require some +Crates relying on custom tokenizer, or registering tokenizer in the manager will require some minor changes. Check https://github.com/tantivy-search/tantivy/blob/main/examples/custom_tokenizer.rs to check for some code sample. @@ -101,7 +102,7 @@ Tantivy 0.11.0 ## How to update? -- The index format is changed. You are required to reindex your data to use tantivy 0.11. +- The index format is changed. You are required to reindex your data to use tantivy 0.11. - `Box` has been replaced by a `BoxedTokenizer` struct. - Regex are now compiled when the `RegexQuery` instance is built. As a result, it can now return an error and handling the `Result` is required. @@ -125,26 +126,26 @@ Tantivy 0.10.0 *Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.* -- Added an API to easily tweak or entirely replace the - default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@pmasurel) +- Added an API to easily tweak or entirely replace the + default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@fulmicoton) - Added an ASCII folding filter (@drusellers) -- Bugfix in `query.count` in presence of deletes (@pmasurel) -- Added `.explain(...)` in `Query` and `Weight` to (@pmasurel) -- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik). +- Bugfix in `query.count` in presence of deletes (@fulmicoton) +- Added `.explain(...)` in `Query` and `Weight` to (@fulmicoton) +- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik). All segments are simply removed. Minor --------- - Switched to Rust 2018 (@uvd) -- Small simplification of the code. +- Small simplification of the code. Calling .freq() or .doc() when .advance() has never been called on segment postings should panic from now on. - Tokens exceeding `u16::max_value() - 4` chars are discarded silently instead of panicking. - Fast fields are now preloaded when the `SegmentReader` is created. - `IndexMeta` is now public. (@hntd187) - `IndexWriter` `add_document`, `delete_term`. `IndexWriter` is `Sync`, making it possible to use it with a ` -Arc>`. `add_document` and `delete_term` can -only require a read lock. (@pmasurel) +Arc>`. `add_document` and `delete_term` can +only require a read lock. (@fulmicoton) - Introducing `Opstamp` as an expressive type alias for `u64`. (@petr-tik) - Stamper now relies on `AtomicU64` on all platforms (@petr-tik) - Bugfix - Files get deleted slightly earlier @@ -158,7 +159,7 @@ Your program should be usable as is. Fast fields used to be accessed directly from the `SegmentReader`. The API changed, you are now required to acquire your fast field reader via the -`segment_reader.fast_fields()`, and use one of the typed method: +`segment_reader.fast_fields()`, and use one of the typed method: - `.u64()`, `.i64()` if your field is single-valued ; - `.u64s()`, `.i64s()` if your field is multi-valued ; - `.bytes()` if your field is bytes fast field. @@ -167,16 +168,16 @@ The API changed, you are now required to acquire your fast field reader via the Tantivy 0.9.0 ===================== -*0.9.0 index format is not compatible with the +*0.9.0 index format is not compatible with the previous index format.* -- MAJOR BUGFIX : +- MAJOR BUGFIX : Some `Mmap` objects were being leaked, and would never get released. (@fulmicoton) - Removed most unsafe (@fulmicoton) - Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton) - Stemming in other language possible (@pentlander) - Segments with no docs are deleted earlier (@barrotsteindev) -- Added grouped add and delete operations. - They are guaranteed to happen together (i.e. they cannot be split by a commit). +- Added grouped add and delete operations. + They are guaranteed to happen together (i.e. they cannot be split by a commit). In addition, adds are guaranteed to happen on the same segment. (@elbow-jason) - Removed `INT_STORED` and `INT_INDEXED`. It is now possible to use `STORED` and `INDEXED` for int fields. (@fulmicoton) @@ -190,26 +191,26 @@ tantivy 0.9 brought some API breaking change. To update from tantivy 0.8, you will need to go through the following steps. - `schema::INT_INDEXED` and `schema::INT_STORED` should be replaced by `schema::INDEXED` and `schema::INT_STORED`. -- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called -`IndexReader` for this. - +- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called +`IndexReader` for this. + ```rust // create the reader. You typically need to create 1 reader for the entire // lifetime of you program. let reader = index.reader()?; - + // Acquire a searcher (previously `index.searcher()`) is now written: let searcher = reader.searcher(); - - // With the default setting of the reader, you are not required to + + // With the default setting of the reader, you are not required to // call `index.load_searchers()` anymore. // // The IndexReader will pick up that change automatically, regardless // of whether the update was done in a different process or not. - // If this behavior is not wanted, you can create your reader with + // If this behavior is not wanted, you can create your reader with // the `ReloadPolicy::Manual`, and manually decide when to reload the index // by calling `reader.reload()?`. - + ``` @@ -224,7 +225,7 @@ Tantivy 0.8.1 ===================== Hotfix of #476. -Merge was reflecting deletes before commit was passed. +Merge was reflecting deletes before commit was passed. Thanks @barrotsteindev for reporting the bug. @@ -232,7 +233,7 @@ Tantivy 0.8.0 ===================== *No change in the index format* - API Breaking change in the collector API. (@jwolfe, @fulmicoton) -- Multithreaded search (@jwolfe, @fulmicoton) +- Multithreaded search (@jwolfe, @fulmicoton) Tantivy 0.7.1 @@ -260,7 +261,7 @@ Tantivy 0.6.1 - Exclusive `field:{startExcl to endExcl}` - Mixed `field:[startIncl to endExcl}` and vice versa - Unbounded `field:[start to *]`, `field:[* to end]` - + Tantivy 0.6 ========================== @@ -268,10 +269,10 @@ Tantivy 0.6 Special thanks to @drusellers and @jason-wolfe for their contributions to this release! -- Removed C code. Tantivy is now pure Rust. (@pmasurel) -- BM25 (@pmasurel) -- Approximate field norms encoded over 1 byte. (@pmasurel) -- Compiles on stable rust (@pmasurel) +- Removed C code. Tantivy is now pure Rust. (@fulmicoton) +- BM25 (@fulmicoton) +- Approximate field norms encoded over 1 byte. (@fulmicoton) +- Compiles on stable rust (@fulmicoton) - Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270) - Completely uncompressed - Internally: One u64 fast field for indexes, one fast field for the bytes themselves. @@ -279,7 +280,7 @@ to this release! - Add Stopword Filter support (@drusellers) - Add a FuzzyTermQuery (@drusellers) - Add a RegexQuery (@drusellers) -- Various performance improvements (@pmasurel)_ +- Various performance improvements (@fulmicoton)_ Tantivy 0.5.2 diff --git a/examples/custom_collector.rs b/examples/custom_collector.rs index 4e7148060..0f79b869d 100644 --- a/examples/custom_collector.rs +++ b/examples/custom_collector.rs @@ -14,7 +14,7 @@ use tantivy::fastfield::FastFieldReader; use tantivy::query::QueryParser; use tantivy::schema::Field; use tantivy::schema::{Schema, FAST, INDEXED, TEXT}; -use tantivy::{doc, Index, Score, SegmentReader, TantivyError}; +use tantivy::{doc, Index, Score, SegmentReader}; #[derive(Default)] struct Stats { @@ -72,16 +72,7 @@ impl Collector for StatsCollector { _segment_local_id: u32, segment_reader: &SegmentReader, ) -> tantivy::Result { - let fast_field_reader = segment_reader - .fast_fields() - .u64(self.field) - .ok_or_else(|| { - let field_name = segment_reader.schema().get_field_name(self.field); - TantivyError::SchemaError(format!( - "Field {:?} is not a u64 fast field.", - field_name - )) - })?; + let fast_field_reader = segment_reader.fast_fields().u64(self.field)?; Ok(StatsSegmentCollector { fast_field_reader, stats: Stats::default(), diff --git a/src/collector/filter_collector_wrapper.rs b/src/collector/filter_collector_wrapper.rs index 34a8c6744..907fc6510 100644 --- a/src/collector/filter_collector_wrapper.rs +++ b/src/collector/filter_collector_wrapper.rs @@ -124,13 +124,7 @@ where let fast_field_reader = segment_reader .fast_fields() - .typed_fast_field_reader(self.field) - .ok_or_else(|| { - TantivyError::SchemaError(format!( - "{:?} is not declared as a fast field in the schema.", - self.field - )) - })?; + .typed_fast_field_reader(self.field)?; let segment_collector = self .collector diff --git a/src/collector/tests.rs b/src/collector/tests.rs index 7a7750c9e..ab03a723a 100644 --- a/src/collector/tests.rs +++ b/src/collector/tests.rs @@ -240,12 +240,7 @@ impl Collector for BytesFastFieldTestCollector { _segment_local_id: u32, segment_reader: &SegmentReader, ) -> crate::Result { - let reader = segment_reader - .fast_fields() - .bytes(self.field) - .ok_or_else(|| { - crate::TantivyError::InvalidArgument("Field is not a bytes fast field.".to_string()) - })?; + let reader = segment_reader.fast_fields().bytes(self.field)?; Ok(BytesFastFieldSegmentCollector { vals: Vec::new(), reader, diff --git a/src/collector/top_collector.rs b/src/collector/top_collector.rs index e6af9f017..a32ff5290 100644 --- a/src/collector/top_collector.rs +++ b/src/collector/top_collector.rs @@ -2,9 +2,9 @@ use crate::DocAddress; use crate::DocId; use crate::SegmentLocalId; use crate::SegmentReader; -use std::marker::PhantomData; use std::cmp::Ordering; use std::collections::BinaryHeap; +use std::marker::PhantomData; /// Contains a feature (field, score, etc.) of a document along with the document address. /// diff --git a/src/collector/top_score_collector.rs b/src/collector/top_score_collector.rs index 85f857b68..f58095b21 100644 --- a/src/collector/top_score_collector.rs +++ b/src/collector/top_score_collector.rs @@ -146,15 +146,14 @@ impl CustomScorer for ScorerByField { type Child = ScorerByFastFieldReader; fn segment_scorer(&self, segment_reader: &SegmentReader) -> crate::Result { - let ff_reader = segment_reader + // We interpret this field as u64, regardless of its type, that way, + // we avoid needless conversion. Regardless of the fast field type, the + // mapping is monotonic, so it is sufficient to compute our top-K docs. + // + // The conversion will then happen only on the top-K docs. + let ff_reader: FastFieldReader = segment_reader .fast_fields() - .u64_lenient(self.field) - .ok_or_else(|| { - crate::TantivyError::SchemaError(format!( - "Field requested ({:?}) is not a fast field.", - self.field - )) - })?; + .typed_fast_field_reader(self.field)?; Ok(ScorerByFastFieldReader { ff_reader }) } } @@ -232,7 +231,7 @@ impl TopDocs { /// # let title = schema_builder.add_text_field("title", TEXT); /// # let rating = schema_builder.add_u64_field("rating", FAST); /// # let schema = schema_builder.build(); - /// # + /// # /// # let index = Index::create_in_ram(schema); /// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # index_writer.add_document(doc!(title => "The Name of the Wind", rating => 92u64)); @@ -262,7 +261,7 @@ impl TopDocs { /// let top_books_by_rating = TopDocs /// ::with_limit(10) /// .order_by_u64_field(rating_field); - /// + /// /// // ... and here are our documents. Note this is a simple vec. /// // The `u64` in the pair is the value of our fast field for /// // each documents. @@ -272,13 +271,13 @@ impl TopDocs { /// // query. /// let resulting_docs: Vec<(u64, DocAddress)> = /// searcher.search(query, &top_books_by_rating)?; - /// + /// /// Ok(resulting_docs) /// } /// ``` /// /// # See also - /// + /// /// To confortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to /// [.order_by_fast_field(...)](#method.order_by_fast_field) method. pub fn order_by_u64_field( @@ -290,7 +289,7 @@ impl TopDocs { /// Set top-K to rank documents by a given fast field. /// - /// If the field is not a fast field, or its field type does not match the generic type, this method does not panic, + /// If the field is not a fast field, or its field type does not match the generic type, this method does not panic, /// but an explicit error will be returned at the moment of collection. /// /// Note that this method is a generic. The requested fast field type will be often @@ -314,7 +313,7 @@ impl TopDocs { /// # let title = schema_builder.add_text_field("company", TEXT); /// # let rating = schema_builder.add_i64_field("revenue", FAST); /// # let schema = schema_builder.build(); - /// # + /// # /// # let index = Index::create_in_ram(schema); /// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # index_writer.add_document(doc!(title => "MadCow Inc.", rating => 92_000_000i64)); @@ -343,7 +342,7 @@ impl TopDocs { /// let top_company_by_revenue = TopDocs /// ::with_limit(2) /// .order_by_fast_field(revenue_field); - /// + /// /// // ... and here are our documents. Note this is a simple vec. /// // The `i64` in the pair is the value of our fast field for /// // each documents. @@ -353,7 +352,7 @@ impl TopDocs { /// // query. /// let resulting_docs: Vec<(i64, DocAddress)> = /// searcher.search(query, &top_company_by_revenue)?; - /// + /// /// Ok(resulting_docs) /// } /// ``` @@ -392,7 +391,7 @@ impl TopDocs { /// /// In the following example will will tweak our ranking a bit by /// boosting popular products a notch. - /// + /// /// In more serious application, this tweaking could involved running a /// learning-to-rank model over various features /// @@ -523,7 +522,7 @@ impl TopDocs { /// # let index = Index::create_in_ram(schema); /// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # let product_name = index.schema().get_field("product_name").unwrap(); - /// # + /// # /// let popularity: Field = index.schema().get_field("popularity").unwrap(); /// let boosted: Field = index.schema().get_field("boosted").unwrap(); /// # index_writer.add_document(doc!(boosted=>1u64, product_name => "The Diary of Muadib", popularity => 1u64)); @@ -557,7 +556,7 @@ impl TopDocs { /// segment_reader.fast_fields().u64(popularity).unwrap(); /// let boosted_reader = /// segment_reader.fast_fields().u64(boosted).unwrap(); - /// + /// /// // We can now define our actual scoring function /// move |doc: DocId| { /// let popularity: u64 = popularity_reader.get(doc); @@ -994,9 +993,7 @@ mod tests { let segment = searcher.segment_reader(0); let top_collector = TopDocs::with_limit(4).order_by_u64_field(size); let err = top_collector.for_segment(0, segment).err().unwrap(); - assert!( - matches!(err, crate::TantivyError::SchemaError(msg) if msg == "Field requested (Field(0)) is not a fast field.") - ); + assert!(matches!(err, crate::TantivyError::SchemaError(_))); Ok(()) } diff --git a/src/core/segment_reader.rs b/src/core/segment_reader.rs index e2ce45291..f91b5f5f1 100644 --- a/src/core/segment_reader.rs +++ b/src/core/segment_reader.rs @@ -114,12 +114,7 @@ impl SegmentReader { field_entry.name() ))); } - let term_ords_reader = self.fast_fields().u64s(field).ok_or_else(|| { - DataCorruption::comment_only(format!( - "Cannot find data for hierarchical facet {:?}", - field_entry.name() - )) - })?; + let term_ords_reader = self.fast_fields().u64s(field)?; let termdict = self .termdict_composite .open_read(field) @@ -183,8 +178,10 @@ impl SegmentReader { let fast_fields_data = segment.open_read(SegmentComponent::FASTFIELDS)?; let fast_fields_composite = CompositeFile::open(&fast_fields_data)?; - let fast_field_readers = - Arc::new(FastFieldReaders::load_all(&schema, &fast_fields_composite)?); + let fast_field_readers = Arc::new(FastFieldReaders::load_all( + schema.clone(), + &fast_fields_composite, + )?); let fieldnorm_data = segment.open_read(SegmentComponent::FIELDNORMS)?; let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?; diff --git a/src/fastfield/facet_reader.rs b/src/fastfield/facet_reader.rs index ced6d0654..6f802c153 100644 --- a/src/fastfield/facet_reader.rs +++ b/src/fastfield/facet_reader.rs @@ -1,4 +1,4 @@ -use super::MultiValueIntFastFieldReader; +use super::MultiValuedFastFieldReader; use crate::error::DataCorruption; use crate::schema::Facet; use crate::termdict::TermDictionary; @@ -20,7 +20,7 @@ use std::str; /// list of facets. This ordinal is segment local and /// only makes sense for a given segment. pub struct FacetReader { - term_ords: MultiValueIntFastFieldReader, + term_ords: MultiValuedFastFieldReader, term_dict: TermDictionary, buffer: Vec, } @@ -29,12 +29,12 @@ impl FacetReader { /// Creates a new `FacetReader`. /// /// A facet reader just wraps : - /// - a `MultiValueIntFastFieldReader` that makes it possible to + /// - a `MultiValuedFastFieldReader` that makes it possible to /// access the list of facet ords for a given document. /// - a `TermDictionary` that helps associating a facet to /// an ordinal and vice versa. pub fn new( - term_ords: MultiValueIntFastFieldReader, + term_ords: MultiValuedFastFieldReader, term_dict: TermDictionary, ) -> FacetReader { FacetReader { diff --git a/src/fastfield/mod.rs b/src/fastfield/mod.rs index 80b5bab1f..8f1109863 100644 --- a/src/fastfield/mod.rs +++ b/src/fastfield/mod.rs @@ -28,7 +28,7 @@ pub use self::delete::write_delete_bitset; pub use self::delete::DeleteBitSet; pub use self::error::{FastFieldNotAvailableError, Result}; pub use self::facet_reader::FacetReader; -pub use self::multivalued::{MultiValueIntFastFieldReader, MultiValueIntFastFieldWriter}; +pub use self::multivalued::{MultiValueIntFastFieldWriter, MultiValuedFastFieldReader}; pub use self::reader::FastFieldReader; pub use self::readers::FastFieldReaders; pub use self::serializer::FastFieldSerializer; diff --git a/src/fastfield/multivalued/mod.rs b/src/fastfield/multivalued/mod.rs index 9594f53f3..02ad2495e 100644 --- a/src/fastfield/multivalued/mod.rs +++ b/src/fastfield/multivalued/mod.rs @@ -1,7 +1,7 @@ mod reader; mod writer; -pub use self::reader::MultiValueIntFastFieldReader; +pub use self::reader::MultiValuedFastFieldReader; pub use self::writer::MultiValueIntFastFieldWriter; #[cfg(test)] diff --git a/src/fastfield/multivalued/reader.rs b/src/fastfield/multivalued/reader.rs index efc9ed5c8..ac0d7775d 100644 --- a/src/fastfield/multivalued/reader.rs +++ b/src/fastfield/multivalued/reader.rs @@ -10,29 +10,22 @@ use crate::DocId; /// The `idx_reader` associated, for each document, the index of its first value. /// #[derive(Clone)] -pub struct MultiValueIntFastFieldReader { +pub struct MultiValuedFastFieldReader { idx_reader: FastFieldReader, vals_reader: FastFieldReader, } -impl MultiValueIntFastFieldReader { +impl MultiValuedFastFieldReader { pub(crate) fn open( idx_reader: FastFieldReader, vals_reader: FastFieldReader, - ) -> MultiValueIntFastFieldReader { - MultiValueIntFastFieldReader { + ) -> MultiValuedFastFieldReader { + MultiValuedFastFieldReader { idx_reader, vals_reader, } } - pub(crate) fn into_u64s_reader(self) -> MultiValueIntFastFieldReader { - MultiValueIntFastFieldReader { - idx_reader: self.idx_reader, - vals_reader: self.vals_reader.into_u64_reader(), - } - } - /// Returns `(start, stop)`, such that the values associated /// to the given document are `start..stop`. fn range(&self, doc: DocId) -> (u64, u64) { diff --git a/src/fastfield/reader.rs b/src/fastfield/reader.rs index 34cccc8a1..c5e4d8936 100644 --- a/src/fastfield/reader.rs +++ b/src/fastfield/reader.rs @@ -42,24 +42,6 @@ impl FastFieldReader { }) } - pub(crate) fn into_u64_reader(self) -> FastFieldReader { - FastFieldReader { - bit_unpacker: self.bit_unpacker, - min_value_u64: self.min_value_u64, - max_value_u64: self.max_value_u64, - _phantom: PhantomData, - } - } - - pub(crate) fn cast(self) -> FastFieldReader { - FastFieldReader { - bit_unpacker: self.bit_unpacker, - min_value_u64: self.min_value_u64, - max_value_u64: self.max_value_u64, - _phantom: PhantomData, - } - } - /// Return the value associated to the given document. /// /// This accessor should return as fast as possible. diff --git a/src/fastfield/readers.rs b/src/fastfield/readers.rs index 1e832fdf9..d537dabfb 100644 --- a/src/fastfield/readers.rs +++ b/src/fastfield/readers.rs @@ -1,28 +1,21 @@ use crate::common::CompositeFile; -use crate::fastfield::MultiValueIntFastFieldReader; +use crate::directory::FileSlice; +use crate::fastfield::MultiValuedFastFieldReader; use crate::fastfield::{BytesFastFieldReader, FastValue}; use crate::fastfield::{FastFieldNotAvailableError, FastFieldReader}; use crate::schema::{Cardinality, Field, FieldType, Schema}; use crate::space_usage::PerFieldSpaceUsage; -use std::collections::HashMap; +use crate::TantivyError; /// Provides access to all of the FastFieldReader. /// /// Internally, `FastFieldReaders` have preloaded fast field readers, /// and just wraps several `HashMap`. pub struct FastFieldReaders { - fast_field_i64: HashMap>, - fast_field_u64: HashMap>, - fast_field_f64: HashMap>, - fast_field_date: HashMap>, - fast_field_i64s: HashMap>, - fast_field_u64s: HashMap>, - fast_field_f64s: HashMap>, - fast_field_dates: HashMap>, - fast_bytes: HashMap, + schema: Schema, fast_fields_composite: CompositeFile, } - +#[derive(Eq, PartialEq, Debug)] enum FastType { I64, U64, @@ -51,235 +44,166 @@ fn type_and_cardinality(field_type: &FieldType) -> Option<(FastType, Cardinality impl FastFieldReaders { pub(crate) fn load_all( - schema: &Schema, + schema: Schema, fast_fields_composite: &CompositeFile, ) -> crate::Result { - let mut fast_field_readers = FastFieldReaders { - fast_field_i64: Default::default(), - fast_field_u64: Default::default(), - fast_field_f64: Default::default(), - fast_field_date: Default::default(), - fast_field_i64s: Default::default(), - fast_field_u64s: Default::default(), - fast_field_f64s: Default::default(), - fast_field_dates: Default::default(), - fast_bytes: Default::default(), + Ok(FastFieldReaders { fast_fields_composite: fast_fields_composite.clone(), - }; - for (field, field_entry) in schema.fields() { - let field_type = field_entry.field_type(); - if let FieldType::Bytes(bytes_option) = field_type { - if !bytes_option.is_fast() { - continue; - } - let fast_field_idx_file = fast_fields_composite - .open_read_with_idx(field, 0) - .ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?; - let idx_reader = FastFieldReader::open(fast_field_idx_file)?; - let data = fast_fields_composite - .open_read_with_idx(field, 1) - .ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?; - let bytes_fast_field_reader = BytesFastFieldReader::open(idx_reader, data)?; - fast_field_readers - .fast_bytes - .insert(field, bytes_fast_field_reader); - } else if let Some((fast_type, cardinality)) = type_and_cardinality(field_type) { - match cardinality { - Cardinality::SingleValue => { - if let Some(fast_field_data) = fast_fields_composite.open_read(field) { - match fast_type { - FastType::U64 => { - let fast_field_reader = FastFieldReader::open(fast_field_data)?; - fast_field_readers - .fast_field_u64 - .insert(field, fast_field_reader); - } - FastType::I64 => { - let fast_field_reader = - FastFieldReader::open(fast_field_data.clone())?; - fast_field_readers - .fast_field_i64 - .insert(field, fast_field_reader); - } - FastType::F64 => { - let fast_field_reader = - FastFieldReader::open(fast_field_data.clone())?; - fast_field_readers - .fast_field_f64 - .insert(field, fast_field_reader); - } - FastType::Date => { - let fast_field_reader = - FastFieldReader::open(fast_field_data.clone())?; - fast_field_readers - .fast_field_date - .insert(field, fast_field_reader); - } - } - } else { - return Err(From::from(FastFieldNotAvailableError::new(field_entry))); - } - } - Cardinality::MultiValues => { - let idx_opt = fast_fields_composite.open_read_with_idx(field, 0); - let data_opt = fast_fields_composite.open_read_with_idx(field, 1); - if let (Some(fast_field_idx), Some(fast_field_data)) = (idx_opt, data_opt) { - let idx_reader = FastFieldReader::open(fast_field_idx)?; - match fast_type { - FastType::I64 => { - let vals_reader = FastFieldReader::open(fast_field_data)?; - let multivalued_int_fast_field = - MultiValueIntFastFieldReader::open(idx_reader, vals_reader); - fast_field_readers - .fast_field_i64s - .insert(field, multivalued_int_fast_field); - } - FastType::U64 => { - let vals_reader = FastFieldReader::open(fast_field_data)?; - let multivalued_int_fast_field = - MultiValueIntFastFieldReader::open(idx_reader, vals_reader); - fast_field_readers - .fast_field_u64s - .insert(field, multivalued_int_fast_field); - } - FastType::F64 => { - let vals_reader = FastFieldReader::open(fast_field_data)?; - let multivalued_int_fast_field = - MultiValueIntFastFieldReader::open(idx_reader, vals_reader); - fast_field_readers - .fast_field_f64s - .insert(field, multivalued_int_fast_field); - } - FastType::Date => { - let vals_reader = FastFieldReader::open(fast_field_data)?; - let multivalued_int_fast_field = - MultiValueIntFastFieldReader::open(idx_reader, vals_reader); - fast_field_readers - .fast_field_dates - .insert(field, multivalued_int_fast_field); - } - } - } else { - return Err(From::from(FastFieldNotAvailableError::new(field_entry))); - } - } - } - } - } - Ok(fast_field_readers) + schema: schema.clone(), + }) } pub(crate) fn space_usage(&self) -> PerFieldSpaceUsage { self.fast_fields_composite.space_usage() } - /// Returns the `u64` fast field reader reader associated to `field`. - /// - /// If `field` is not a u64 fast field, this method returns `None`. - pub fn u64(&self, field: Field) -> Option> { - self.fast_field_u64.get(&field).cloned() + fn fast_field_data(&self, field: Field, idx: usize) -> crate::Result { + self.fast_fields_composite + .open_read_with_idx(field, idx) + .ok_or_else(|| { + let field_name = self.schema.get_field_entry(field).name(); + TantivyError::SchemaError(format!("Field({}) data was not found", field_name)) + }) } - /// If the field is a u64-fast field return the associated reader. - /// If the field is a i64-fast field, return the associated u64 reader. Values are - /// mapped from i64 to u64 using a (well the, it is unique) monotonic mapping. /// - /// - /// This method is useful when merging segment reader. - pub(crate) fn u64_lenient(&self, field: Field) -> Option> { - if let Some(u64_ff_reader) = self.u64(field) { - return Some(u64_ff_reader); + fn check_type( + &self, + field: Field, + expected_fast_type: FastType, + expected_cardinality: Cardinality, + ) -> crate::Result<()> { + let field_entry = self.schema.get_field_entry(field); + let (fast_type, cardinality) = + type_and_cardinality(field_entry.field_type()).ok_or_else(|| { + crate::TantivyError::SchemaError(format!( + "Field {:?} is not a fast field.", + field_entry.name() + )) + })?; + if fast_type != expected_fast_type { + return Err(crate::TantivyError::SchemaError(format!( + "Field {:?} is of type {:?}, expected {:?}.", + field_entry.name(), + fast_type, + expected_fast_type + ))); } - if let Some(i64_ff_reader) = self.i64(field) { - return Some(i64_ff_reader.into_u64_reader()); + if cardinality != expected_cardinality { + return Err(crate::TantivyError::SchemaError(format!( + "Field {:?} is of cardinality {:?}, expected {:?}.", + field_entry.name(), + cardinality, + expected_cardinality + ))); } - if let Some(f64_ff_reader) = self.f64(field) { - return Some(f64_ff_reader.into_u64_reader()); - } - if let Some(date_ff_reader) = self.date(field) { - return Some(date_ff_reader.into_u64_reader()); - } - None + Ok(()) } pub(crate) fn typed_fast_field_reader( &self, field: Field, - ) -> Option> { - self.u64_lenient(field) - .map(|fast_field_reader| fast_field_reader.cast()) + ) -> crate::Result> { + let fast_field_slice = self.fast_field_data(field, 0)?; + FastFieldReader::open(fast_field_slice) + } + + pub(crate) fn typed_fast_field_multi_reader( + &self, + field: Field, + ) -> crate::Result> { + let fast_field_slice_idx = self.fast_field_data(field, 0)?; + let fast_field_slice_vals = self.fast_field_data(field, 1)?; + let idx_reader = FastFieldReader::open(fast_field_slice_idx)?; + let vals_reader: FastFieldReader = + FastFieldReader::open(fast_field_slice_vals)?; + Ok(MultiValuedFastFieldReader::open(idx_reader, vals_reader)) + } + + /// Returns the `u64` fast field reader reader associated to `field`. + /// + /// If `field` is not a u64 fast field, this method returns `None`. + pub fn u64(&self, field: Field) -> crate::Result> { + self.check_type(field, FastType::U64, Cardinality::SingleValue)?; + self.typed_fast_field_reader(field) } /// Returns the `i64` fast field reader reader associated to `field`. /// /// If `field` is not a i64 fast field, this method returns `None`. - pub fn i64(&self, field: Field) -> Option> { - self.fast_field_i64.get(&field).cloned() + pub fn i64(&self, field: Field) -> crate::Result> { + self.check_type(field, FastType::I64, Cardinality::SingleValue)?; + self.typed_fast_field_reader(field) } /// Returns the `i64` fast field reader reader associated to `field`. /// /// If `field` is not a i64 fast field, this method returns `None`. - pub fn date(&self, field: Field) -> Option> { - self.fast_field_date.get(&field).cloned() + pub fn date(&self, field: Field) -> crate::Result> { + self.check_type(field, FastType::Date, Cardinality::SingleValue)?; + self.typed_fast_field_reader(field) } /// Returns the `f64` fast field reader reader associated to `field`. /// /// If `field` is not a f64 fast field, this method returns `None`. - pub fn f64(&self, field: Field) -> Option> { - self.fast_field_f64.get(&field).cloned() + pub fn f64(&self, field: Field) -> crate::Result> { + self.check_type(field, FastType::F64, Cardinality::SingleValue)?; + self.typed_fast_field_reader(field) } /// Returns a `u64s` multi-valued fast field reader reader associated to `field`. /// /// If `field` is not a u64 multi-valued fast field, this method returns `None`. - pub fn u64s(&self, field: Field) -> Option> { - self.fast_field_u64s.get(&field).cloned() - } - - /// If the field is a u64s-fast field return the associated reader. - /// If the field is a i64s-fast field, return the associated u64s reader. Values are - /// mapped from i64 to u64 using a (well the, it is unique) monotonic mapping. - /// - /// This method is useful when merging segment reader. - pub(crate) fn u64s_lenient(&self, field: Field) -> Option> { - if let Some(u64s_ff_reader) = self.u64s(field) { - return Some(u64s_ff_reader); - } - if let Some(i64s_ff_reader) = self.i64s(field) { - return Some(i64s_ff_reader.into_u64s_reader()); - } - if let Some(f64s_ff_reader) = self.f64s(field) { - return Some(f64s_ff_reader.into_u64s_reader()); - } - None + pub fn u64s(&self, field: Field) -> crate::Result> { + self.check_type(field, FastType::U64, Cardinality::MultiValues)?; + self.typed_fast_field_multi_reader(field) } /// Returns a `i64s` multi-valued fast field reader reader associated to `field`. /// /// If `field` is not a i64 multi-valued fast field, this method returns `None`. - pub fn i64s(&self, field: Field) -> Option> { - self.fast_field_i64s.get(&field).cloned() + pub fn i64s(&self, field: Field) -> crate::Result> { + self.check_type(field, FastType::I64, Cardinality::MultiValues)?; + self.typed_fast_field_multi_reader(field) } /// Returns a `f64s` multi-valued fast field reader reader associated to `field`. /// /// If `field` is not a f64 multi-valued fast field, this method returns `None`. - pub fn f64s(&self, field: Field) -> Option> { - self.fast_field_f64s.get(&field).cloned() + pub fn f64s(&self, field: Field) -> crate::Result> { + self.check_type(field, FastType::F64, Cardinality::MultiValues)?; + self.typed_fast_field_multi_reader(field) } /// Returns a `crate::DateTime` multi-valued fast field reader reader associated to `field`. /// /// If `field` is not a `crate::DateTime` multi-valued fast field, this method returns `None`. - pub fn dates(&self, field: Field) -> Option> { - self.fast_field_dates.get(&field).cloned() + pub fn dates( + &self, + field: Field, + ) -> crate::Result> { + self.check_type(field, FastType::Date, Cardinality::MultiValues)?; + self.typed_fast_field_multi_reader(field) } /// Returns the `bytes` fast field reader associated to `field`. /// /// If `field` is not a bytes fast field, returns `None`. - pub fn bytes(&self, field: Field) -> Option { - self.fast_bytes.get(&field).cloned() + pub fn bytes(&self, field: Field) -> crate::Result { + let field_entry = self.schema.get_field_entry(field); + if let FieldType::Bytes(bytes_option) = field_entry.field_type() { + if !bytes_option.is_fast() { + return Err(crate::TantivyError::SchemaError(format!( + "Field {:?} is not a fast field.", + field_entry.name() + ))); + } + let fast_field_idx_file = self.fast_field_data(field, 0)?; + let idx_reader = FastFieldReader::open(fast_field_idx_file)?; + let data = self.fast_field_data(field, 1)?; + BytesFastFieldReader::open(idx_reader, data) + } else { + Err(FastFieldNotAvailableError::new(field_entry).into()) + } } } diff --git a/src/indexer/merger.rs b/src/indexer/merger.rs index b263cac09..d91f2f8e3 100644 --- a/src/indexer/merger.rs +++ b/src/indexer/merger.rs @@ -7,7 +7,7 @@ use crate::fastfield::BytesFastFieldReader; use crate::fastfield::DeleteBitSet; use crate::fastfield::FastFieldReader; use crate::fastfield::FastFieldSerializer; -use crate::fastfield::MultiValueIntFastFieldReader; +use crate::fastfield::MultiValuedFastFieldReader; use crate::fieldnorm::FieldNormsSerializer; use crate::fieldnorm::FieldNormsWriter; use crate::fieldnorm::{FieldNormReader, FieldNormReaders}; @@ -246,7 +246,7 @@ impl IndexMerger { for reader in &self.readers { let u64_reader: FastFieldReader = reader .fast_fields() - .u64_lenient(field) + .typed_fast_field_reader(field) .expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen."); if let Some((seg_min_val, seg_max_val)) = compute_min_max_val(&u64_reader, reader.max_doc(), reader.delete_bitset()) @@ -290,7 +290,7 @@ impl IndexMerger { fast_field_serializer: &mut FastFieldSerializer, ) -> crate::Result<()> { let mut total_num_vals = 0u64; - let mut u64s_readers: Vec> = Vec::new(); + let mut u64s_readers: Vec> = Vec::new(); // In the first pass, we compute the total number of vals. // @@ -298,9 +298,8 @@ impl IndexMerger { // what should be the bit length use for bitpacking. for reader in &self.readers { let u64s_reader = reader.fast_fields() - .u64s_lenient(field) + .typed_fast_field_multi_reader(field) .expect("Failed to find index for multivalued field. This is a bug in tantivy, please report."); - if let Some(delete_bitset) = reader.delete_bitset() { for doc in 0u32..reader.max_doc() { if delete_bitset.is_alive(doc) { @@ -353,7 +352,7 @@ impl IndexMerger { for (segment_ord, segment_reader) in self.readers.iter().enumerate() { let term_ordinal_mapping: &[TermOrdinal] = term_ordinal_mappings.get_segment(segment_ord); - let ff_reader: MultiValueIntFastFieldReader = segment_reader + let ff_reader: MultiValuedFastFieldReader = segment_reader .fast_fields() .u64s(field) .expect("Could not find multivalued u64 fast value reader."); @@ -397,8 +396,10 @@ impl IndexMerger { // We go through a complete first pass to compute the minimum and the // maximum value and initialize our Serializer. for reader in &self.readers { - let ff_reader: MultiValueIntFastFieldReader = - reader.fast_fields().u64s_lenient(field).expect( + let ff_reader: MultiValuedFastFieldReader = reader + .fast_fields() + .typed_fast_field_multi_reader(field) + .expect( "Failed to find multivalued fast field reader. This is a bug in \ tantivy. Please report.", ); @@ -445,11 +446,7 @@ impl IndexMerger { let mut bytes_readers: Vec = Vec::new(); for reader in &self.readers { - let bytes_reader = reader.fast_fields().bytes(field).ok_or_else(|| { - crate::TantivyError::InvalidArgument( - "Bytes fast field {:?} not found in segment.".to_string(), - ) - })?; + let bytes_reader = reader.fast_fields().bytes(field)?; if let Some(delete_bitset) = reader.delete_bitset() { for doc in 0u32..reader.max_doc() { if delete_bitset.is_alive(doc) { diff --git a/src/lib.rs b/src/lib.rs index f23be985d..39e21c27f 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -866,39 +866,39 @@ mod tests { let searcher = reader.searcher(); let segment_reader: &SegmentReader = searcher.segment_reader(0); { - let fast_field_reader_opt = segment_reader.fast_fields().u64(text_field); - assert!(fast_field_reader_opt.is_none()); + let fast_field_reader_res = segment_reader.fast_fields().u64(text_field); + assert!(fast_field_reader_res.is_err()); } { let fast_field_reader_opt = segment_reader.fast_fields().u64(stored_int_field); - assert!(fast_field_reader_opt.is_none()); + assert!(fast_field_reader_opt.is_err()); } { let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_signed); - assert!(fast_field_reader_opt.is_none()); + assert!(fast_field_reader_opt.is_err()); } { let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_float); - assert!(fast_field_reader_opt.is_none()); + assert!(fast_field_reader_opt.is_err()); } { let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_unsigned); - assert!(fast_field_reader_opt.is_some()); + assert!(fast_field_reader_opt.is_ok()); let fast_field_reader = fast_field_reader_opt.unwrap(); assert_eq!(fast_field_reader.get(0), 4u64) } { - let fast_field_reader_opt = segment_reader.fast_fields().i64(fast_field_signed); - assert!(fast_field_reader_opt.is_some()); - let fast_field_reader = fast_field_reader_opt.unwrap(); + let fast_field_reader_res = segment_reader.fast_fields().i64(fast_field_signed); + assert!(fast_field_reader_res.is_ok()); + let fast_field_reader = fast_field_reader_res.unwrap(); assert_eq!(fast_field_reader.get(0), 4i64) } { - let fast_field_reader_opt = segment_reader.fast_fields().f64(fast_field_float); - assert!(fast_field_reader_opt.is_some()); - let fast_field_reader = fast_field_reader_opt.unwrap(); + let fast_field_reader_res = segment_reader.fast_fields().f64(fast_field_float); + assert!(fast_field_reader_res.is_ok()); + let fast_field_reader = fast_field_reader_res.unwrap(); assert_eq!(fast_field_reader.get(0), 4f64) } Ok(())