mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-01-08 10:02:55 +00:00
Compare commits
71 Commits
barrotstei
...
0.14
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
784717749f | ||
|
|
945bcc5bd3 | ||
|
|
51aa9c319e | ||
|
|
74d8d2946b | ||
|
|
0a160cc16e | ||
|
|
f099f97daa | ||
|
|
769e9ba14d | ||
|
|
a482c0e966 | ||
|
|
86d92a72e7 | ||
|
|
ef618a5999 | ||
|
|
94d3d7a89a | ||
|
|
aa9e79f957 | ||
|
|
84a2f534db | ||
|
|
1b4be24dca | ||
|
|
824ccc37ae | ||
|
|
5231651020 | ||
|
|
fa2c6f80c7 | ||
|
|
43c7b3bfec | ||
|
|
b17a10546a | ||
|
|
bf6e6e8a7c | ||
|
|
203b0256a3 | ||
|
|
caf2a38b7e | ||
|
|
96f24b078e | ||
|
|
332b50a4eb | ||
|
|
8ca0954b3b | ||
|
|
36343e2de8 | ||
|
|
2f14a892ca | ||
|
|
9c3cabce40 | ||
|
|
f8d71c2b10 | ||
|
|
394dfb24f1 | ||
|
|
b0549a229d | ||
|
|
670b6eaff6 | ||
|
|
a4f33d3823 | ||
|
|
c7841e3da5 | ||
|
|
e7b4a12bba | ||
|
|
0aaa929d6e | ||
|
|
1112797c18 | ||
|
|
920481e1c1 | ||
|
|
55f7b84966 | ||
|
|
09ab4df1fe | ||
|
|
0c2cf81b37 | ||
|
|
d864430bda | ||
|
|
de60540e06 | ||
|
|
c3e311e6b8 | ||
|
|
be626083a0 | ||
|
|
b68fcca1e0 | ||
|
|
af6dfa1856 | ||
|
|
654c400a0b | ||
|
|
80a99539ce | ||
|
|
4b1c770e5e | ||
|
|
3491645e69 | ||
|
|
e72c8287f8 | ||
|
|
b4b3bc7acd | ||
|
|
521c7b271b | ||
|
|
acd888c999 | ||
|
|
3ab1ba0b2f | ||
|
|
b344c0ac05 | ||
|
|
1741619c7f | ||
|
|
067ba3dff0 | ||
|
|
f79250f665 | ||
|
|
5a33b8d533 | ||
|
|
d165655fb1 | ||
|
|
b478ed747a | ||
|
|
e9aa27dace | ||
|
|
c079133f3a | ||
|
|
30c5f7c5f0 | ||
|
|
6f26871c0f | ||
|
|
f93cc5b5e3 | ||
|
|
5a25c8dfd3 | ||
|
|
f5c079159d | ||
|
|
1cfdce3437 |
83
CHANGELOG.md
83
CHANGELOG.md
@@ -1,18 +1,23 @@
|
||||
Tantivy 0.14.0
|
||||
=========================
|
||||
- Remove dependency to atomicwrites #833 .Implemented by @pmasurel upon suggestion and research from @asafigan).
|
||||
- Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).
|
||||
- Migrated tantivy error from the now deprecated `failure` crate to `thiserror` #760. (@hirevo)
|
||||
- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match.
|
||||
- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match.
|
||||
- Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a `FileSlice` that can be reduced and eventually read into an `OwnedBytes` object. Long and blocking io operation are still required by they do not span over the entire file.
|
||||
- Added support for Brotli compression in the DocStore. (@ppodolsky)
|
||||
- Added helper for building intersections and unions in BooleanQuery (@guilload)
|
||||
- Bugfix in `Query::explain`
|
||||
- Removed dependency on `notify` #924. Replaced with `FileWatcher` struct that polls meta file every 500ms in background thread. (@halvorboe @guilload)
|
||||
- Added `FilterCollector`, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev)
|
||||
- Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)
|
||||
- `FilterCollector` now supports all Fast Field value types (@barrotsteindev)
|
||||
- FastField are not all loaded when opening the segment reader. (@fulmicoton)
|
||||
|
||||
This version breaks compatibility and requires users to reindex everything.
|
||||
|
||||
Tantivy 0.13.2
|
||||
===================
|
||||
Bugfix. Acquiring a facet reader on a segment that does not contain any
|
||||
Bugfix. Acquiring a facet reader on a segment that does not contain any
|
||||
doc with this facet returns `None`. (#896)
|
||||
|
||||
Tantivy 0.13.1
|
||||
@@ -23,7 +28,7 @@ Updated misc dependency versions.
|
||||
Tantivy 0.13.0
|
||||
======================
|
||||
Tantivy 0.13 introduce a change in the index format that will require
|
||||
you to reindex your index (BlockWAND information are added in the skiplist).
|
||||
you to reindex your index (BlockWAND information are added in the skiplist).
|
||||
The index size increase is minor as this information is only added for
|
||||
full blocks.
|
||||
If you have a massive index for which reindexing is not an option, please contact me
|
||||
@@ -32,7 +37,7 @@ so that we can discuss possible solutions.
|
||||
- Bugfix in `FuzzyTermQuery` not matching terms by prefix when it should (@Peachball)
|
||||
- Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.
|
||||
- `MMapDirectory::open` does not return a `Result` anymore.
|
||||
- Change in the DocSet and Scorer API. (@fulmicoton).
|
||||
- Change in the DocSet and Scorer API. (@fulmicoton).
|
||||
A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet.
|
||||
`.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId.
|
||||
As a result, iterating through DocSet now looks as follows
|
||||
@@ -46,7 +51,7 @@ while doc != TERMINATED {
|
||||
The change made it possible to greatly simplify a lot of the docset's code.
|
||||
- Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)
|
||||
- Added an offset option to the Top(.*)Collectors. (@robyoung)
|
||||
- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks
|
||||
- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks
|
||||
to the PISA team for answering all my questions!)
|
||||
|
||||
Tantivy 0.12.0
|
||||
@@ -54,14 +59,14 @@ Tantivy 0.12.0
|
||||
- Removing static dispatch in tokenizers for simplicity. (#762)
|
||||
- Added backward iteration for `TermDictionary` stream. (@halvorboe)
|
||||
- Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)
|
||||
- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)
|
||||
- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)
|
||||
- Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)
|
||||
- Added support for field boosting. (#547, @fulmicoton)
|
||||
|
||||
## How to update?
|
||||
|
||||
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
|
||||
minor changes. Check https://github.com/tantivy-search/tantivy/blob/master/examples/custom_tokenizer.rs
|
||||
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
|
||||
minor changes. Check https://github.com/tantivy-search/tantivy/blob/main/examples/custom_tokenizer.rs
|
||||
to check for some code sample.
|
||||
|
||||
Tantivy 0.11.3
|
||||
@@ -97,7 +102,7 @@ Tantivy 0.11.0
|
||||
|
||||
## How to update?
|
||||
|
||||
- The index format is changed. You are required to reindex your data to use tantivy 0.11.
|
||||
- The index format is changed. You are required to reindex your data to use tantivy 0.11.
|
||||
- `Box<dyn BoxableTokenizer>` has been replaced by a `BoxedTokenizer` struct.
|
||||
- Regex are now compiled when the `RegexQuery` instance is built. As a result, it can now return
|
||||
an error and handling the `Result` is required.
|
||||
@@ -121,26 +126,26 @@ Tantivy 0.10.0
|
||||
|
||||
*Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.*
|
||||
|
||||
- Added an API to easily tweak or entirely replace the
|
||||
default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@pmasurel)
|
||||
- Added an API to easily tweak or entirely replace the
|
||||
default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@fulmicoton)
|
||||
- Added an ASCII folding filter (@drusellers)
|
||||
- Bugfix in `query.count` in presence of deletes (@pmasurel)
|
||||
- Added `.explain(...)` in `Query` and `Weight` to (@pmasurel)
|
||||
- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik).
|
||||
- Bugfix in `query.count` in presence of deletes (@fulmicoton)
|
||||
- Added `.explain(...)` in `Query` and `Weight` to (@fulmicoton)
|
||||
- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik).
|
||||
All segments are simply removed.
|
||||
|
||||
Minor
|
||||
---------
|
||||
- Switched to Rust 2018 (@uvd)
|
||||
- Small simplification of the code.
|
||||
- Small simplification of the code.
|
||||
Calling .freq() or .doc() when .advance() has never been called
|
||||
on segment postings should panic from now on.
|
||||
- Tokens exceeding `u16::max_value() - 4` chars are discarded silently instead of panicking.
|
||||
- Fast fields are now preloaded when the `SegmentReader` is created.
|
||||
- `IndexMeta` is now public. (@hntd187)
|
||||
- `IndexWriter` `add_document`, `delete_term`. `IndexWriter` is `Sync`, making it possible to use it with a `
|
||||
Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
|
||||
only require a read lock. (@pmasurel)
|
||||
Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
|
||||
only require a read lock. (@fulmicoton)
|
||||
- Introducing `Opstamp` as an expressive type alias for `u64`. (@petr-tik)
|
||||
- Stamper now relies on `AtomicU64` on all platforms (@petr-tik)
|
||||
- Bugfix - Files get deleted slightly earlier
|
||||
@@ -154,7 +159,7 @@ Your program should be usable as is.
|
||||
|
||||
Fast fields used to be accessed directly from the `SegmentReader`.
|
||||
The API changed, you are now required to acquire your fast field reader via the
|
||||
`segment_reader.fast_fields()`, and use one of the typed method:
|
||||
`segment_reader.fast_fields()`, and use one of the typed method:
|
||||
- `.u64()`, `.i64()` if your field is single-valued ;
|
||||
- `.u64s()`, `.i64s()` if your field is multi-valued ;
|
||||
- `.bytes()` if your field is bytes fast field.
|
||||
@@ -163,16 +168,16 @@ The API changed, you are now required to acquire your fast field reader via the
|
||||
|
||||
Tantivy 0.9.0
|
||||
=====================
|
||||
*0.9.0 index format is not compatible with the
|
||||
*0.9.0 index format is not compatible with the
|
||||
previous index format.*
|
||||
- MAJOR BUGFIX :
|
||||
- MAJOR BUGFIX :
|
||||
Some `Mmap` objects were being leaked, and would never get released. (@fulmicoton)
|
||||
- Removed most unsafe (@fulmicoton)
|
||||
- Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)
|
||||
- Stemming in other language possible (@pentlander)
|
||||
- Segments with no docs are deleted earlier (@barrotsteindev)
|
||||
- Added grouped add and delete operations.
|
||||
They are guaranteed to happen together (i.e. they cannot be split by a commit).
|
||||
- Added grouped add and delete operations.
|
||||
They are guaranteed to happen together (i.e. they cannot be split by a commit).
|
||||
In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)
|
||||
- Removed `INT_STORED` and `INT_INDEXED`. It is now possible to use `STORED` and `INDEXED`
|
||||
for int fields. (@fulmicoton)
|
||||
@@ -186,26 +191,26 @@ tantivy 0.9 brought some API breaking change.
|
||||
To update from tantivy 0.8, you will need to go through the following steps.
|
||||
|
||||
- `schema::INT_INDEXED` and `schema::INT_STORED` should be replaced by `schema::INDEXED` and `schema::INT_STORED`.
|
||||
- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called
|
||||
`IndexReader` for this.
|
||||
|
||||
- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called
|
||||
`IndexReader` for this.
|
||||
|
||||
```rust
|
||||
// create the reader. You typically need to create 1 reader for the entire
|
||||
// lifetime of you program.
|
||||
let reader = index.reader()?;
|
||||
|
||||
|
||||
// Acquire a searcher (previously `index.searcher()`) is now written:
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// With the default setting of the reader, you are not required to
|
||||
|
||||
// With the default setting of the reader, you are not required to
|
||||
// call `index.load_searchers()` anymore.
|
||||
//
|
||||
// The IndexReader will pick up that change automatically, regardless
|
||||
// of whether the update was done in a different process or not.
|
||||
// If this behavior is not wanted, you can create your reader with
|
||||
// If this behavior is not wanted, you can create your reader with
|
||||
// the `ReloadPolicy::Manual`, and manually decide when to reload the index
|
||||
// by calling `reader.reload()?`.
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
@@ -220,7 +225,7 @@ Tantivy 0.8.1
|
||||
=====================
|
||||
Hotfix of #476.
|
||||
|
||||
Merge was reflecting deletes before commit was passed.
|
||||
Merge was reflecting deletes before commit was passed.
|
||||
Thanks @barrotsteindev for reporting the bug.
|
||||
|
||||
|
||||
@@ -228,7 +233,7 @@ Tantivy 0.8.0
|
||||
=====================
|
||||
*No change in the index format*
|
||||
- API Breaking change in the collector API. (@jwolfe, @fulmicoton)
|
||||
- Multithreaded search (@jwolfe, @fulmicoton)
|
||||
- Multithreaded search (@jwolfe, @fulmicoton)
|
||||
|
||||
|
||||
Tantivy 0.7.1
|
||||
@@ -256,7 +261,7 @@ Tantivy 0.6.1
|
||||
- Exclusive `field:{startExcl to endExcl}`
|
||||
- Mixed `field:[startIncl to endExcl}` and vice versa
|
||||
- Unbounded `field:[start to *]`, `field:[* to end]`
|
||||
|
||||
|
||||
|
||||
Tantivy 0.6
|
||||
==========================
|
||||
@@ -264,10 +269,10 @@ Tantivy 0.6
|
||||
Special thanks to @drusellers and @jason-wolfe for their contributions
|
||||
to this release!
|
||||
|
||||
- Removed C code. Tantivy is now pure Rust. (@pmasurel)
|
||||
- BM25 (@pmasurel)
|
||||
- Approximate field norms encoded over 1 byte. (@pmasurel)
|
||||
- Compiles on stable rust (@pmasurel)
|
||||
- Removed C code. Tantivy is now pure Rust. (@fulmicoton)
|
||||
- BM25 (@fulmicoton)
|
||||
- Approximate field norms encoded over 1 byte. (@fulmicoton)
|
||||
- Compiles on stable rust (@fulmicoton)
|
||||
- Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270)
|
||||
- Completely uncompressed
|
||||
- Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
|
||||
@@ -275,7 +280,7 @@ to this release!
|
||||
- Add Stopword Filter support (@drusellers)
|
||||
- Add a FuzzyTermQuery (@drusellers)
|
||||
- Add a RegexQuery (@drusellers)
|
||||
- Various performance improvements (@pmasurel)_
|
||||
- Various performance improvements (@fulmicoton)_
|
||||
|
||||
|
||||
Tantivy 0.5.2
|
||||
|
||||
11
Cargo.toml
11
Cargo.toml
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy"
|
||||
version = "0.14.0-dev"
|
||||
version = "0.14.0"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
categories = ["database-implementations", "data-structures"]
|
||||
@@ -33,7 +33,7 @@ levenshtein_automata = "0.2"
|
||||
uuid = { version = "0.8", features = ["v4", "serde"] }
|
||||
crossbeam = "0.8"
|
||||
futures = {version = "0.3", features=["thread-pool"] }
|
||||
tantivy-query-grammar = { version="0.14.0-dev", path="./query-grammar" }
|
||||
tantivy-query-grammar = { version="0.14.0", path="./query-grammar" }
|
||||
stable_deref_trait = "1"
|
||||
rust-stemmers = "1"
|
||||
downcast-rs = "1"
|
||||
@@ -53,10 +53,11 @@ lru = "0.6"
|
||||
winapi = "0.3"
|
||||
|
||||
[dev-dependencies]
|
||||
rand = "0.7"
|
||||
rand = "0.8"
|
||||
maplit = "1"
|
||||
matches = "0.1.8"
|
||||
proptest = "0.10"
|
||||
criterion = "0.3"
|
||||
|
||||
[dev-dependencies.fail]
|
||||
version = "0.4"
|
||||
@@ -97,3 +98,7 @@ travis-ci = { repository = "tantivy-search/tantivy" }
|
||||
name = "failpoints"
|
||||
path = "tests/failpoints/mod.rs"
|
||||
required-features = ["fail/failpoints"]
|
||||
|
||||
[[bench]]
|
||||
name = "analyzer"
|
||||
harness = false
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
|
||||
[](https://travis-ci.org/tantivy-search/tantivy)
|
||||
[](https://codecov.io/gh/tantivy-search/tantivy)
|
||||
[](https://travis-ci.org/tantivy-search/tantivy)
|
||||
[](https://codecov.io/gh/tantivy-search/tantivy)
|
||||
[](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/master)
|
||||
[](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/main)
|
||||
[](https://crates.io/crates/tantivy)
|
||||
|
||||

|
||||
|
||||
3774
benches/alice.txt
Normal file
3774
benches/alice.txt
Normal file
File diff suppressed because it is too large
Load Diff
22
benches/analyzer.rs
Normal file
22
benches/analyzer.rs
Normal file
@@ -0,0 +1,22 @@
|
||||
use criterion::{criterion_group, criterion_main, Criterion};
|
||||
use tantivy::tokenizer::TokenizerManager;
|
||||
|
||||
const ALICE_TXT: &'static str = include_str!("alice.txt");
|
||||
|
||||
pub fn criterion_benchmark(c: &mut Criterion) {
|
||||
let tokenizer_manager = TokenizerManager::default();
|
||||
let tokenizer = tokenizer_manager.get("default").unwrap();
|
||||
c.bench_function("default-tokenize-alice", |b| {
|
||||
b.iter(|| {
|
||||
let mut word_count = 0;
|
||||
let mut token_stream = tokenizer.token_stream(ALICE_TXT);
|
||||
while token_stream.advance() {
|
||||
word_count += 1;
|
||||
}
|
||||
assert_eq!(word_count, 30_731);
|
||||
})
|
||||
});
|
||||
}
|
||||
|
||||
criterion_group!(benches, criterion_benchmark);
|
||||
criterion_main!(benches);
|
||||
@@ -14,7 +14,7 @@ use tantivy::fastfield::FastFieldReader;
|
||||
use tantivy::query::QueryParser;
|
||||
use tantivy::schema::Field;
|
||||
use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
|
||||
use tantivy::{doc, Index, Score, SegmentReader, TantivyError};
|
||||
use tantivy::{doc, Index, Score, SegmentReader};
|
||||
|
||||
#[derive(Default)]
|
||||
struct Stats {
|
||||
@@ -72,16 +72,7 @@ impl Collector for StatsCollector {
|
||||
_segment_local_id: u32,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> tantivy::Result<StatsSegmentCollector> {
|
||||
let fast_field_reader = segment_reader
|
||||
.fast_fields()
|
||||
.u64(self.field)
|
||||
.ok_or_else(|| {
|
||||
let field_name = segment_reader.schema().get_field_name(self.field);
|
||||
TantivyError::SchemaError(format!(
|
||||
"Field {:?} is not a u64 fast field.",
|
||||
field_name
|
||||
))
|
||||
})?;
|
||||
let fast_field_reader = segment_reader.fast_fields().u64(self.field)?;
|
||||
Ok(StatsSegmentCollector {
|
||||
fast_field_reader,
|
||||
stats: Stats::default(),
|
||||
|
||||
@@ -61,7 +61,7 @@ fn main() -> tantivy::Result<()> {
|
||||
|
||||
let query_ords: HashSet<u64> = facets
|
||||
.iter()
|
||||
.filter_map(|key| facet_dict.term_ord(key.encoded_str()))
|
||||
.filter_map(|key| facet_dict.term_ord(key.encoded_str()).unwrap())
|
||||
.collect();
|
||||
|
||||
let mut facet_ords_buffer: Vec<u64> = Vec::with_capacity(20);
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-query-grammar"
|
||||
version = "0.14.0-dev"
|
||||
version = "0.14.0"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
categories = ["database-implementations", "data-structures"]
|
||||
|
||||
@@ -274,7 +274,7 @@ impl Collector for FacetCollector {
|
||||
let mut collapse_facet_it = self.facets.iter().peekable();
|
||||
collapse_facet_ords.push(0);
|
||||
{
|
||||
let mut facet_streamer = facet_reader.facet_dict().range().into_stream();
|
||||
let mut facet_streamer = facet_reader.facet_dict().range().into_stream()?;
|
||||
if facet_streamer.advance() {
|
||||
'outer: loop {
|
||||
// at the begining of this loop, facet_streamer
|
||||
@@ -368,9 +368,12 @@ impl SegmentCollector for FacetSegmentCollector {
|
||||
}
|
||||
let mut facet = vec![];
|
||||
let facet_ord = self.collapse_facet_ords[collapsed_facet_ord];
|
||||
facet_dict.ord_to_term(facet_ord as u64, &mut facet);
|
||||
// TODO
|
||||
facet_counts.insert(Facet::from_encoded(facet).unwrap(), count);
|
||||
// TODO handle errors.
|
||||
if facet_dict.ord_to_term(facet_ord as u64, &mut facet).is_ok() {
|
||||
if let Ok(facet) = Facet::from_encoded(facet) {
|
||||
facet_counts.insert(facet, count);
|
||||
}
|
||||
}
|
||||
}
|
||||
FacetCounts { facet_counts }
|
||||
}
|
||||
@@ -395,6 +398,8 @@ impl<'a> Iterator for FacetChildIterator<'a> {
|
||||
}
|
||||
|
||||
impl FacetCounts {
|
||||
/// Returns an iterator over all of the facet count pairs inside this result.
|
||||
/// See the documentation for `FacetCollector` for a usage example.
|
||||
pub fn get<T>(&self, facet_from: T) -> FacetChildIterator<'_>
|
||||
where
|
||||
Facet: From<T>,
|
||||
@@ -414,6 +419,8 @@ impl FacetCounts {
|
||||
FacetChildIterator { underlying }
|
||||
}
|
||||
|
||||
/// Returns a vector of top `k` facets with their counts, sorted highest-to-lowest by counts.
|
||||
/// See the documentation for `FacetCollector` for a usage example.
|
||||
pub fn top_k<T>(&self, facet: T, k: usize) -> Vec<(&Facet, u64)>
|
||||
where
|
||||
Facet: From<T>,
|
||||
|
||||
@@ -11,9 +11,9 @@
|
||||
// Importing tantivy...
|
||||
use std::marker::PhantomData;
|
||||
|
||||
use crate::fastfield::{FastValue, FastFieldReader};
|
||||
use crate::schema::Field;
|
||||
use crate::collector::{Collector, SegmentCollector};
|
||||
use crate::fastfield::{FastFieldReader, FastValue};
|
||||
use crate::schema::Field;
|
||||
use crate::{Score, SegmentReader, TantivyError};
|
||||
|
||||
/// The `FilterCollector` collector filters docs using a u64 fast field value and a predicate.
|
||||
@@ -22,23 +22,20 @@ use crate::{Score, SegmentReader, TantivyError};
|
||||
/// ```rust
|
||||
/// use tantivy::collector::{TopDocs, FilterCollector};
|
||||
/// use tantivy::query::QueryParser;
|
||||
/// use tantivy::schema::{Schema, FAST, TEXT};
|
||||
/// use tantivy::DateTime;
|
||||
/// use std::str::FromStr;
|
||||
/// use tantivy::schema::{Schema, TEXT, INDEXED, FAST};
|
||||
/// use tantivy::{doc, DocAddress, Index};
|
||||
///
|
||||
/// let mut schema_builder = Schema::builder();
|
||||
/// let title = schema_builder.add_text_field("title", TEXT);
|
||||
/// let price = schema_builder.add_u64_field("price", FAST);
|
||||
/// let date = schema_builder.add_date_field("date", FAST);
|
||||
/// let price = schema_builder.add_u64_field("price", INDEXED | FAST);
|
||||
/// let schema = schema_builder.build();
|
||||
/// let index = Index::create_in_ram(schema);
|
||||
///
|
||||
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap();
|
||||
/// index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::from_str("1898-04-09T00:00:00+00:00").unwrap()));
|
||||
/// index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::from_str("2020-04-09T00:00:00+00:00").unwrap()));
|
||||
/// index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64, date => DateTime::from_str("2019-04-09T00:00:00+00:00").unwrap()));
|
||||
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64, date => DateTime::from_str("2018-04-09T00:00:00+00:00").unwrap()));
|
||||
/// index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64));
|
||||
/// index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64));
|
||||
/// index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64));
|
||||
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64));
|
||||
/// assert!(index_writer.commit().is_ok());
|
||||
///
|
||||
/// let reader = index.reader().unwrap();
|
||||
@@ -46,8 +43,8 @@ use crate::{Score, SegmentReader, TantivyError};
|
||||
///
|
||||
/// let query_parser = QueryParser::for_index(&index, vec![title]);
|
||||
/// let query = query_parser.parse_query("diary").unwrap();
|
||||
/// let filter_some_collector = FilterCollector::new(price, &|value: u64| value > 20_120u64, TopDocs::with_limit(2));
|
||||
/// let top_docs = searcher.search(&query, &filter_some_collector).unwrap();
|
||||
/// let no_filter_collector = FilterCollector::new(price, &|value: u64| value > 20_120u64, TopDocs::with_limit(2));
|
||||
/// let top_docs = searcher.search(&query, &no_filter_collector).unwrap();
|
||||
///
|
||||
/// assert_eq!(top_docs.len(), 1);
|
||||
/// assert_eq!(top_docs[0].1, DocAddress(0, 1));
|
||||
@@ -56,17 +53,6 @@ use crate::{Score, SegmentReader, TantivyError};
|
||||
/// let filtered_top_docs = searcher.search(&query, &filter_all_collector).unwrap();
|
||||
///
|
||||
/// assert_eq!(filtered_top_docs.len(), 0);
|
||||
///
|
||||
/// fn date_debug(value: DateTime) -> bool {
|
||||
/// println!("date: {:?}", value);
|
||||
/// assert_eq!(value, DateTime::from_str("1000-04-09T00:00:00+00:00").unwrap());
|
||||
/// (value - DateTime::from_str("2019-04-09T00:00:00+00:00").unwrap()).num_weeks() > 0
|
||||
/// }
|
||||
///
|
||||
/// let filter_dates_collector = FilterCollector::new(date, &date_debug, TopDocs::with_limit(2));
|
||||
/// let filtered_date_docs = searcher.search(&query, &filter_all_collector).unwrap();
|
||||
///
|
||||
/// assert_eq!(filtered_date_docs.len(), 5);
|
||||
/// ```
|
||||
pub struct FilterCollector<TCollector, TPredicate, TPredicateValue: FastValue>
|
||||
where
|
||||
@@ -125,42 +111,28 @@ where
|
||||
field_entry.name()
|
||||
)));
|
||||
}
|
||||
let schema_type = TPredicateValue::to_type();
|
||||
let requested_type = field_entry.field_type().value_type();
|
||||
if schema_type != requested_type {
|
||||
let requested_type = TPredicateValue::to_type();
|
||||
let field_schema_type = field_entry.field_type().value_type();
|
||||
if requested_type != field_schema_type {
|
||||
return Err(TantivyError::SchemaError(format!(
|
||||
"Field {:?} is of type {:?}!={:?}",
|
||||
field_entry.name(),
|
||||
schema_type,
|
||||
requested_type
|
||||
requested_type,
|
||||
field_schema_type
|
||||
)));
|
||||
}
|
||||
|
||||
let err_closure = || {
|
||||
let field_name = segment_reader.schema().get_field_name(self.field);
|
||||
TantivyError::SchemaError(format!(
|
||||
"Field {:?} is not a u64 fast field.",
|
||||
field_name
|
||||
))
|
||||
};
|
||||
let fast_fields = segment_reader.fast_fields();
|
||||
let fast_filed_reader: crate::Result<FastFieldReader<TPredicateValue>> = match schema_type {
|
||||
crate::schema::Type::U64 => {fast_fields.u64(self.field).ok_or_else(err_closure)}
|
||||
crate::schema::Type::I64 => {fast_fields.i64(self.field).ok_or_else(err_closure)}
|
||||
crate::schema::Type::F64 => {fast_fields.f64(self.field).ok_or_else(err_closure)}
|
||||
crate::schema::Type::Date => {fast_fields.date(self.field).ok_or_else(err_closure)}
|
||||
crate::schema::Type::Bytes => {fast_fields.bytes(self.field).ok_or_else(err_closure)}
|
||||
crate::schema::Type::Str | crate::schema::Type::HierarchicalFacet => {Err(TantivyError::SchemaError(format!("Field {:?} uses an unsupported type", segment_reader.schema().get_field_name(self.field))))}
|
||||
};
|
||||
let fast_field_reader = segment_reader
|
||||
.fast_fields()
|
||||
.typed_fast_field_reader(self.field)?;
|
||||
|
||||
let segment_collector = self
|
||||
.collector
|
||||
.for_segment(segment_local_id, segment_reader)?;
|
||||
|
||||
let a = fast_filed_reader?;
|
||||
Ok(FilterSegmentCollector {
|
||||
fast_field_reader: a,
|
||||
segment_collector: segment_collector,
|
||||
fast_field_reader,
|
||||
segment_collector,
|
||||
predicate: self.predicate,
|
||||
t_predicate_value: PhantomData,
|
||||
})
|
||||
|
||||
@@ -109,6 +109,7 @@ pub use self::tweak_score_top_collector::{ScoreSegmentTweaker, ScoreTweaker};
|
||||
|
||||
mod facet_collector;
|
||||
pub use self::facet_collector::FacetCollector;
|
||||
pub use self::facet_collector::FacetCounts;
|
||||
use crate::query::Weight;
|
||||
|
||||
mod docset_collector;
|
||||
|
||||
@@ -8,12 +8,12 @@ use crate::DocId;
|
||||
use crate::Score;
|
||||
use crate::SegmentLocalId;
|
||||
|
||||
use crate::collector::{TopDocs, FilterCollector};
|
||||
use crate::collector::{FilterCollector, TopDocs};
|
||||
use crate::query::QueryParser;
|
||||
use crate::schema::{Schema, FAST, TEXT};
|
||||
use crate::DateTime;
|
||||
use std::str::FromStr;
|
||||
use crate::{doc, Index};
|
||||
use std::str::FromStr;
|
||||
|
||||
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
|
||||
compute_score: true,
|
||||
@@ -25,7 +25,6 @@ pub const TEST_COLLECTOR_WITHOUT_SCORE: TestCollector = TestCollector {
|
||||
|
||||
#[test]
|
||||
pub fn test_filter_collector() {
|
||||
|
||||
let mut schema_builder = Schema::builder();
|
||||
let title = schema_builder.add_text_field("title", TEXT);
|
||||
let price = schema_builder.add_u64_field("price", FAST);
|
||||
@@ -36,6 +35,7 @@ pub fn test_filter_collector() {
|
||||
let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap();
|
||||
index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::from_str("1898-04-09T00:00:00+00:00").unwrap()));
|
||||
index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::from_str("2020-04-09T00:00:00+00:00").unwrap()));
|
||||
index_writer.add_document(doc!(title => "The Diary of Anne Frank", price => 18_240u64, date => DateTime::from_str("2019-04-20T00:00:00+00:00").unwrap()));
|
||||
index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64, date => DateTime::from_str("2019-04-09T00:00:00+00:00").unwrap()));
|
||||
index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64, date => DateTime::from_str("2018-04-09T00:00:00+00:00").unwrap()));
|
||||
assert!(index_writer.commit().is_ok());
|
||||
@@ -45,27 +45,30 @@ pub fn test_filter_collector() {
|
||||
|
||||
let query_parser = QueryParser::for_index(&index, vec![title]);
|
||||
let query = query_parser.parse_query("diary").unwrap();
|
||||
let filter_some_collector = FilterCollector::new(price, &|value: u64| value > 20_120u64, TopDocs::with_limit(2));
|
||||
let filter_some_collector = FilterCollector::new(
|
||||
price,
|
||||
&|value: u64| value > 20_120u64,
|
||||
TopDocs::with_limit(2),
|
||||
);
|
||||
let top_docs = searcher.search(&query, &filter_some_collector).unwrap();
|
||||
|
||||
assert_eq!(top_docs.len(), 1);
|
||||
assert_eq!(top_docs[0].1, DocAddress(0, 1));
|
||||
|
||||
let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new(price, &|value| value < 5u64, TopDocs::with_limit(2));
|
||||
let filter_all_collector: FilterCollector<_, _, u64> =
|
||||
FilterCollector::new(price, &|value| value < 5u64, TopDocs::with_limit(2));
|
||||
let filtered_top_docs = searcher.search(&query, &filter_all_collector).unwrap();
|
||||
|
||||
assert_eq!(filtered_top_docs.len(), 0);
|
||||
|
||||
fn date_debug(value: DateTime) -> bool {
|
||||
println!("date: {:?}", value);
|
||||
assert_eq!(value, DateTime::from_str("1000-04-09T00:00:00+00:00").unwrap());
|
||||
(value - DateTime::from_str("2019-04-09T00:00:00+00:00").unwrap()).num_weeks() > 0
|
||||
fn date_filter(value: DateTime) -> bool {
|
||||
(value - DateTime::from_str("2019-04-09T00:00:00+00:00").unwrap()).num_weeks() > 0
|
||||
}
|
||||
|
||||
let filter_dates_collector = FilterCollector::new(date, &date_debug, TopDocs::with_limit(2));
|
||||
let filter_dates_collector = FilterCollector::new(date, &date_filter, TopDocs::with_limit(5));
|
||||
let filtered_date_docs = searcher.search(&query, &filter_dates_collector).unwrap();
|
||||
|
||||
assert_eq!(filtered_date_docs.len(), 5);
|
||||
assert_eq!(filtered_date_docs.len(), 2);
|
||||
}
|
||||
|
||||
/// Stores all of the doc ids.
|
||||
@@ -237,12 +240,7 @@ impl Collector for BytesFastFieldTestCollector {
|
||||
_segment_local_id: u32,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<BytesFastFieldSegmentCollector> {
|
||||
let reader = segment_reader
|
||||
.fast_fields()
|
||||
.bytes(self.field)
|
||||
.ok_or_else(|| {
|
||||
crate::TantivyError::InvalidArgument("Field is not a bytes fast field.".to_string())
|
||||
})?;
|
||||
let reader = segment_reader.fast_fields().bytes(self.field)?;
|
||||
Ok(BytesFastFieldSegmentCollector {
|
||||
vals: Vec::new(),
|
||||
reader,
|
||||
|
||||
@@ -2,9 +2,9 @@ use crate::DocAddress;
|
||||
use crate::DocId;
|
||||
use crate::SegmentLocalId;
|
||||
use crate::SegmentReader;
|
||||
use serde::export::PhantomData;
|
||||
use std::cmp::Ordering;
|
||||
use std::collections::BinaryHeap;
|
||||
use std::marker::PhantomData;
|
||||
|
||||
/// Contains a feature (field, score, etc.) of a document along with the document address.
|
||||
///
|
||||
|
||||
@@ -146,15 +146,14 @@ impl CustomScorer<u64> for ScorerByField {
|
||||
type Child = ScorerByFastFieldReader;
|
||||
|
||||
fn segment_scorer(&self, segment_reader: &SegmentReader) -> crate::Result<Self::Child> {
|
||||
let ff_reader = segment_reader
|
||||
// We interpret this field as u64, regardless of its type, that way,
|
||||
// we avoid needless conversion. Regardless of the fast field type, the
|
||||
// mapping is monotonic, so it is sufficient to compute our top-K docs.
|
||||
//
|
||||
// The conversion will then happen only on the top-K docs.
|
||||
let ff_reader: FastFieldReader<u64> = segment_reader
|
||||
.fast_fields()
|
||||
.u64_lenient(self.field)
|
||||
.ok_or_else(|| {
|
||||
crate::TantivyError::SchemaError(format!(
|
||||
"Field requested ({:?}) is not a fast field.",
|
||||
self.field
|
||||
))
|
||||
})?;
|
||||
.typed_fast_field_reader(self.field)?;
|
||||
Ok(ScorerByFastFieldReader { ff_reader })
|
||||
}
|
||||
}
|
||||
@@ -232,7 +231,7 @@ impl TopDocs {
|
||||
/// # let title = schema_builder.add_text_field("title", TEXT);
|
||||
/// # let rating = schema_builder.add_u64_field("rating", FAST);
|
||||
/// # let schema = schema_builder.build();
|
||||
/// #
|
||||
/// #
|
||||
/// # let index = Index::create_in_ram(schema);
|
||||
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
|
||||
/// # index_writer.add_document(doc!(title => "The Name of the Wind", rating => 92u64));
|
||||
@@ -262,7 +261,7 @@ impl TopDocs {
|
||||
/// let top_books_by_rating = TopDocs
|
||||
/// ::with_limit(10)
|
||||
/// .order_by_u64_field(rating_field);
|
||||
///
|
||||
///
|
||||
/// // ... and here are our documents. Note this is a simple vec.
|
||||
/// // The `u64` in the pair is the value of our fast field for
|
||||
/// // each documents.
|
||||
@@ -272,13 +271,13 @@ impl TopDocs {
|
||||
/// // query.
|
||||
/// let resulting_docs: Vec<(u64, DocAddress)> =
|
||||
/// searcher.search(query, &top_books_by_rating)?;
|
||||
///
|
||||
///
|
||||
/// Ok(resulting_docs)
|
||||
/// }
|
||||
/// ```
|
||||
///
|
||||
/// # See also
|
||||
///
|
||||
///
|
||||
/// To confortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to
|
||||
/// [.order_by_fast_field(...)](#method.order_by_fast_field) method.
|
||||
pub fn order_by_u64_field(
|
||||
@@ -290,7 +289,7 @@ impl TopDocs {
|
||||
|
||||
/// Set top-K to rank documents by a given fast field.
|
||||
///
|
||||
/// If the field is not a fast field, or its field type does not match the generic type, this method does not panic,
|
||||
/// If the field is not a fast field, or its field type does not match the generic type, this method does not panic,
|
||||
/// but an explicit error will be returned at the moment of collection.
|
||||
///
|
||||
/// Note that this method is a generic. The requested fast field type will be often
|
||||
@@ -314,7 +313,7 @@ impl TopDocs {
|
||||
/// # let title = schema_builder.add_text_field("company", TEXT);
|
||||
/// # let rating = schema_builder.add_i64_field("revenue", FAST);
|
||||
/// # let schema = schema_builder.build();
|
||||
/// #
|
||||
/// #
|
||||
/// # let index = Index::create_in_ram(schema);
|
||||
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
|
||||
/// # index_writer.add_document(doc!(title => "MadCow Inc.", rating => 92_000_000i64));
|
||||
@@ -343,7 +342,7 @@ impl TopDocs {
|
||||
/// let top_company_by_revenue = TopDocs
|
||||
/// ::with_limit(2)
|
||||
/// .order_by_fast_field(revenue_field);
|
||||
///
|
||||
///
|
||||
/// // ... and here are our documents. Note this is a simple vec.
|
||||
/// // The `i64` in the pair is the value of our fast field for
|
||||
/// // each documents.
|
||||
@@ -353,7 +352,7 @@ impl TopDocs {
|
||||
/// // query.
|
||||
/// let resulting_docs: Vec<(i64, DocAddress)> =
|
||||
/// searcher.search(query, &top_company_by_revenue)?;
|
||||
///
|
||||
///
|
||||
/// Ok(resulting_docs)
|
||||
/// }
|
||||
/// ```
|
||||
@@ -392,7 +391,7 @@ impl TopDocs {
|
||||
///
|
||||
/// In the following example will will tweak our ranking a bit by
|
||||
/// boosting popular products a notch.
|
||||
///
|
||||
///
|
||||
/// In more serious application, this tweaking could involved running a
|
||||
/// learning-to-rank model over various features
|
||||
///
|
||||
@@ -523,7 +522,7 @@ impl TopDocs {
|
||||
/// # let index = Index::create_in_ram(schema);
|
||||
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
|
||||
/// # let product_name = index.schema().get_field("product_name").unwrap();
|
||||
/// #
|
||||
/// #
|
||||
/// let popularity: Field = index.schema().get_field("popularity").unwrap();
|
||||
/// let boosted: Field = index.schema().get_field("boosted").unwrap();
|
||||
/// # index_writer.add_document(doc!(boosted=>1u64, product_name => "The Diary of Muadib", popularity => 1u64));
|
||||
@@ -557,7 +556,7 @@ impl TopDocs {
|
||||
/// segment_reader.fast_fields().u64(popularity).unwrap();
|
||||
/// let boosted_reader =
|
||||
/// segment_reader.fast_fields().u64(boosted).unwrap();
|
||||
///
|
||||
///
|
||||
/// // We can now define our actual scoring function
|
||||
/// move |doc: DocId| {
|
||||
/// let popularity: u64 = popularity_reader.get(doc);
|
||||
@@ -728,7 +727,7 @@ mod tests {
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_top_collector_not_at_capacity() {
|
||||
fn test_top_collector_not_at_capacity_without_offset() {
|
||||
let index = make_index();
|
||||
let field = index.schema().get_field("text").unwrap();
|
||||
let query_parser = QueryParser::for_index(&index, vec![field]);
|
||||
@@ -994,9 +993,7 @@ mod tests {
|
||||
let segment = searcher.segment_reader(0);
|
||||
let top_collector = TopDocs::with_limit(4).order_by_u64_field(size);
|
||||
let err = top_collector.for_segment(0, segment).err().unwrap();
|
||||
assert!(
|
||||
matches!(err, crate::TantivyError::SchemaError(msg) if msg == "Field requested (Field(0)) is not a fast field.")
|
||||
);
|
||||
assert!(matches!(err, crate::TantivyError::SchemaError(_)));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
|
||||
@@ -20,9 +20,10 @@ impl<W: Write> CountingWriter<W> {
|
||||
self.written_bytes
|
||||
}
|
||||
|
||||
pub fn finish(mut self) -> io::Result<(W, u64)> {
|
||||
self.flush()?;
|
||||
Ok((self.underlying, self.written_bytes))
|
||||
/// Returns the underlying write object.
|
||||
/// Note that this method does not trigger any flushing.
|
||||
pub fn finish(self) -> W {
|
||||
self.underlying
|
||||
}
|
||||
}
|
||||
|
||||
@@ -46,7 +47,6 @@ impl<W: Write> Write for CountingWriter<W> {
|
||||
|
||||
impl<W: TerminatingWrite> TerminatingWrite for CountingWriter<W> {
|
||||
fn terminate_ref(&mut self, token: AntiCallToken) -> io::Result<()> {
|
||||
self.flush()?;
|
||||
self.underlying.terminate_ref(token)
|
||||
}
|
||||
}
|
||||
@@ -63,8 +63,9 @@ mod test {
|
||||
let mut counting_writer = CountingWriter::wrap(buffer);
|
||||
let bytes = (0u8..10u8).collect::<Vec<u8>>();
|
||||
counting_writer.write_all(&bytes).unwrap();
|
||||
let (w, len): (Vec<u8>, u64) = counting_writer.finish().unwrap();
|
||||
let len = counting_writer.written_bytes();
|
||||
let buffer_restituted: Vec<u8> = counting_writer.finish();
|
||||
assert_eq!(len, 10u64);
|
||||
assert_eq!(w.len(), 10);
|
||||
assert_eq!(buffer_restituted.len(), 10);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -115,11 +115,16 @@ pub fn u64_to_i64(val: u64) -> i64 {
|
||||
/// For simplicity, tantivy internally handles `f64` as `u64`.
|
||||
/// The mapping is defined by this function.
|
||||
///
|
||||
/// Maps `f64` to `u64` so that lexical order is preserved.
|
||||
/// Maps `f64` to `u64` in a monotonic manner, so that bytes lexical order is preserved.
|
||||
///
|
||||
/// This is more suited than simply casting (`val as u64`)
|
||||
/// which would truncate the result
|
||||
///
|
||||
/// # Reference
|
||||
///
|
||||
/// Daniel Lemire's [blog post](https://lemire.me/blog/2020/12/14/converting-floating-point-numbers-to-integers-while-preserving-order/)
|
||||
/// explains the mapping in a clear manner.
|
||||
///
|
||||
/// # See also
|
||||
/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
|
||||
#[inline(always)]
|
||||
@@ -148,6 +153,7 @@ pub(crate) mod test {
|
||||
pub use super::minmax;
|
||||
pub use super::serialize::test::fixed_size_test;
|
||||
use super::{compute_num_bits, f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
|
||||
use proptest::prelude::*;
|
||||
use std::f64;
|
||||
|
||||
fn test_i64_converter_helper(val: i64) {
|
||||
@@ -158,6 +164,15 @@ pub(crate) mod test {
|
||||
assert_eq!(u64_to_f64(f64_to_u64(val)), val);
|
||||
}
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
|
||||
let left_u64 = f64_to_u64(left);
|
||||
let right_u64 = f64_to_u64(right);
|
||||
assert_eq!(left_u64 < right_u64, left < right);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_i64_converter() {
|
||||
assert_eq!(i64_to_u64(i64::min_value()), u64::min_value());
|
||||
|
||||
@@ -35,12 +35,21 @@ fn load_metas(
|
||||
inventory: &SegmentMetaInventory,
|
||||
) -> crate::Result<IndexMeta> {
|
||||
let meta_data = directory.atomic_read(&META_FILEPATH)?;
|
||||
let meta_string = String::from_utf8_lossy(&meta_data);
|
||||
let meta_string = String::from_utf8(meta_data).map_err(|_utf8_err| {
|
||||
error!("Meta data is not valid utf8.");
|
||||
DataCorruption::new(
|
||||
META_FILEPATH.to_path_buf(),
|
||||
"Meta file does not contain valid utf8 file.".to_string(),
|
||||
)
|
||||
})?;
|
||||
IndexMeta::deserialize(&meta_string, &inventory)
|
||||
.map_err(|e| {
|
||||
DataCorruption::new(
|
||||
META_FILEPATH.to_path_buf(),
|
||||
format!("Meta file cannot be deserialized. {:?}.", e),
|
||||
format!(
|
||||
"Meta file cannot be deserialized. {:?}. Content: {:?}",
|
||||
e, meta_string
|
||||
),
|
||||
)
|
||||
})
|
||||
.map_err(From::from)
|
||||
@@ -511,28 +520,28 @@ mod tests {
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_index_manual_policy_mmap() {
|
||||
fn test_index_manual_policy_mmap() -> crate::Result<()> {
|
||||
let schema = throw_away_schema();
|
||||
let field = schema.get_field("num_likes").unwrap();
|
||||
let mut index = Index::create_from_tempdir(schema).unwrap();
|
||||
let mut writer = index.writer_for_tests().unwrap();
|
||||
writer.commit().unwrap();
|
||||
let mut index = Index::create_from_tempdir(schema)?;
|
||||
let mut writer = index.writer_for_tests()?;
|
||||
writer.commit()?;
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.unwrap();
|
||||
.try_into()?;
|
||||
assert_eq!(reader.searcher().num_docs(), 0);
|
||||
writer.add_document(doc!(field=>1u64));
|
||||
let (sender, receiver) = crossbeam::channel::unbounded();
|
||||
let _handle = index.directory_mut().watch(WatchCallback::new(move || {
|
||||
let _ = sender.send(());
|
||||
}));
|
||||
writer.commit().unwrap();
|
||||
writer.commit()?;
|
||||
assert!(receiver.recv().is_ok());
|
||||
assert_eq!(reader.searcher().num_docs(), 0);
|
||||
reader.reload().unwrap();
|
||||
reader.reload()?;
|
||||
assert_eq!(reader.searcher().num_docs(), 1);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -66,7 +66,7 @@ impl InvertedIndexReader {
|
||||
}
|
||||
|
||||
/// Returns the term info associated with the term.
|
||||
pub fn get_term_info(&self, term: &Term) -> Option<TermInfo> {
|
||||
pub fn get_term_info(&self, term: &Term) -> io::Result<Option<TermInfo>> {
|
||||
self.termdict.get(term.value_bytes())
|
||||
}
|
||||
|
||||
@@ -106,10 +106,9 @@ impl InvertedIndexReader {
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Option<BlockSegmentPostings>> {
|
||||
Ok(self
|
||||
.get_term_info(term)
|
||||
self.get_term_info(term)?
|
||||
.map(move |term_info| self.read_block_postings_from_terminfo(&term_info, option))
|
||||
.transpose()?)
|
||||
.transpose()
|
||||
}
|
||||
|
||||
/// Returns a block postings given a `term_info`.
|
||||
@@ -181,7 +180,7 @@ impl InvertedIndexReader {
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Option<SegmentPostings>> {
|
||||
self.get_term_info(term)
|
||||
self.get_term_info(term)?
|
||||
.map(move |term_info| self.read_postings_from_terminfo(&term_info, option))
|
||||
.transpose()
|
||||
}
|
||||
@@ -191,7 +190,7 @@ impl InvertedIndexReader {
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Option<SegmentPostings>> {
|
||||
self.get_term_info(term)
|
||||
self.get_term_info(term)?
|
||||
.map(|term_info| self.read_postings_from_terminfo(&term_info, option))
|
||||
.transpose()
|
||||
}
|
||||
@@ -199,7 +198,7 @@ impl InvertedIndexReader {
|
||||
/// Returns the number of documents containing the term.
|
||||
pub fn doc_freq(&self, term: &Term) -> io::Result<u32> {
|
||||
Ok(self
|
||||
.get_term_info(term)
|
||||
.get_term_info(term)?
|
||||
.map(|term_info| term_info.doc_freq)
|
||||
.unwrap_or(0u32))
|
||||
}
|
||||
|
||||
@@ -12,7 +12,7 @@ pub use self::executor::Executor;
|
||||
pub use self::index::Index;
|
||||
pub use self::index_meta::{IndexMeta, SegmentMeta, SegmentMetaInventory};
|
||||
pub use self::inverted_index_reader::InvertedIndexReader;
|
||||
pub use self::searcher::{FieldSearcher, Searcher};
|
||||
pub use self::searcher::Searcher;
|
||||
pub use self::segment::Segment;
|
||||
pub use self::segment::SerializableSegment;
|
||||
pub use self::segment_component::SegmentComponent;
|
||||
|
||||
@@ -1,17 +1,16 @@
|
||||
use crate::collector::Collector;
|
||||
use crate::core::Executor;
|
||||
use crate::core::InvertedIndexReader;
|
||||
|
||||
use crate::core::SegmentReader;
|
||||
use crate::query::Query;
|
||||
use crate::schema::Document;
|
||||
use crate::schema::Schema;
|
||||
use crate::schema::{Field, Term};
|
||||
use crate::schema::Term;
|
||||
use crate::space_usage::SearcherSpaceUsage;
|
||||
use crate::store::StoreReader;
|
||||
use crate::termdict::TermMerger;
|
||||
use crate::DocAddress;
|
||||
use crate::Index;
|
||||
use std::sync::Arc;
|
||||
|
||||
use std::{fmt, io};
|
||||
|
||||
/// Holds a list of `SegmentReader`s ready for search.
|
||||
@@ -148,16 +147,6 @@ impl Searcher {
|
||||
collector.merge_fruits(fruits)
|
||||
}
|
||||
|
||||
/// Return the field searcher associated to a `Field`.
|
||||
pub fn field(&self, field: Field) -> crate::Result<FieldSearcher> {
|
||||
let inv_index_readers: Vec<Arc<InvertedIndexReader>> = self
|
||||
.segment_readers
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.inverted_index(field))
|
||||
.collect::<crate::Result<Vec<_>>>()?;
|
||||
Ok(FieldSearcher::new(inv_index_readers))
|
||||
}
|
||||
|
||||
/// Summarize total space usage of this searcher.
|
||||
pub fn space_usage(&self) -> io::Result<SearcherSpaceUsage> {
|
||||
let mut space_usage = SearcherSpaceUsage::new();
|
||||
@@ -168,32 +157,6 @@ impl Searcher {
|
||||
}
|
||||
}
|
||||
|
||||
/// **Experimental API** `FieldSearcher` only gives access to a stream over the terms of a field.
|
||||
pub struct FieldSearcher {
|
||||
inv_index_readers: Vec<Arc<InvertedIndexReader>>,
|
||||
}
|
||||
|
||||
impl FieldSearcher {
|
||||
fn new(inv_index_readers: Vec<Arc<InvertedIndexReader>>) -> FieldSearcher {
|
||||
FieldSearcher { inv_index_readers }
|
||||
}
|
||||
|
||||
/// Returns a Stream over all of the sorted unique terms of
|
||||
/// for the given field.
|
||||
///
|
||||
/// This method does not take into account which documents are deleted, so
|
||||
/// in presence of deletes some terms may not actually exist in any document
|
||||
/// anymore.
|
||||
pub fn terms(&self) -> TermMerger {
|
||||
let term_streamers: Vec<_> = self
|
||||
.inv_index_readers
|
||||
.iter()
|
||||
.map(|inverted_index| inverted_index.terms().stream())
|
||||
.collect();
|
||||
TermMerger::new(term_streamers)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for Searcher {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
let segment_ids = self
|
||||
|
||||
@@ -114,12 +114,7 @@ impl SegmentReader {
|
||||
field_entry.name()
|
||||
)));
|
||||
}
|
||||
let term_ords_reader = self.fast_fields().u64s(field).ok_or_else(|| {
|
||||
DataCorruption::comment_only(format!(
|
||||
"Cannot find data for hierarchical facet {:?}",
|
||||
field_entry.name()
|
||||
))
|
||||
})?;
|
||||
let term_ords_reader = self.fast_fields().u64s(field)?;
|
||||
let termdict = self
|
||||
.termdict_composite
|
||||
.open_read(field)
|
||||
@@ -183,8 +178,10 @@ impl SegmentReader {
|
||||
|
||||
let fast_fields_data = segment.open_read(SegmentComponent::FASTFIELDS)?;
|
||||
let fast_fields_composite = CompositeFile::open(&fast_fields_data)?;
|
||||
let fast_field_readers =
|
||||
Arc::new(FastFieldReaders::load_all(&schema, &fast_fields_composite)?);
|
||||
let fast_field_readers = Arc::new(FastFieldReaders::new(
|
||||
schema.clone(),
|
||||
fast_fields_composite,
|
||||
)?);
|
||||
|
||||
let fieldnorm_data = segment.open_read(SegmentComponent::FIELDNORMS)?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
@@ -310,7 +307,7 @@ impl SegmentReader {
|
||||
}
|
||||
|
||||
/// Returns an iterator that will iterate over the alive document ids
|
||||
pub fn doc_ids_alive<'a>(&'a self) -> impl Iterator<Item = DocId> + 'a {
|
||||
pub fn doc_ids_alive(&self) -> impl Iterator<Item = DocId> + '_ {
|
||||
(0u32..self.max_doc).filter(move |doc| !self.is_deleted(*doc))
|
||||
}
|
||||
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
use crate::directory::directory_lock::Lock;
|
||||
use crate::directory::error::LockError;
|
||||
use crate::directory::error::{DeleteError, OpenReadError, OpenWriteError};
|
||||
use crate::directory::WatchCallback;
|
||||
use crate::directory::WatchHandle;
|
||||
use crate::directory::{FileHandle, WatchCallback};
|
||||
use crate::directory::{FileSlice, WritePtr};
|
||||
use std::fmt;
|
||||
use std::io;
|
||||
@@ -108,10 +108,13 @@ fn retry_policy(is_blocking: bool) -> RetryPolicy {
|
||||
/// should be your default choice.
|
||||
/// - The [`RAMDirectory`](struct.RAMDirectory.html), which
|
||||
/// should be used mostly for tests.
|
||||
///
|
||||
pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
|
||||
/// Opens a virtual file for read.
|
||||
/// Opens a file and returns a boxed `FileHandle`.
|
||||
///
|
||||
/// Users of `Directory` should typically call `Directory::open_read(...)`,
|
||||
/// while `Directory` implementor should implement `get_file_handle()`.
|
||||
fn get_file_handle(&self, path: &Path) -> Result<Box<dyn FileHandle>, OpenReadError>;
|
||||
|
||||
/// Once a virtual file is open, its data may not
|
||||
/// change.
|
||||
///
|
||||
@@ -119,7 +122,10 @@ pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
|
||||
/// have no effect on the returned `FileSlice` object.
|
||||
///
|
||||
/// You should only use this to read files create with [Directory::open_write].
|
||||
fn open_read(&self, path: &Path) -> Result<FileSlice, OpenReadError>;
|
||||
fn open_read(&self, path: &Path) -> Result<FileSlice, OpenReadError> {
|
||||
let file_handle = self.get_file_handle(path)?;
|
||||
Ok(FileSlice::new(file_handle))
|
||||
}
|
||||
|
||||
/// Removes a file
|
||||
///
|
||||
|
||||
@@ -58,7 +58,8 @@ pub enum OpenWriteError {
|
||||
}
|
||||
|
||||
impl OpenWriteError {
|
||||
pub(crate) fn wrap_io_error(io_error: io::Error, filepath: PathBuf) -> Self {
|
||||
/// Wraps an io error.
|
||||
pub fn wrap_io_error(io_error: io::Error, filepath: PathBuf) -> Self {
|
||||
Self::IOError { io_error, filepath }
|
||||
}
|
||||
}
|
||||
@@ -143,7 +144,8 @@ pub enum OpenReadError {
|
||||
}
|
||||
|
||||
impl OpenReadError {
|
||||
pub(crate) fn wrap_io_error(io_error: io::Error, filepath: PathBuf) -> Self {
|
||||
/// Wraps an io error.
|
||||
pub fn wrap_io_error(io_error: io::Error, filepath: PathBuf) -> Self {
|
||||
Self::IOError { io_error, filepath }
|
||||
}
|
||||
}
|
||||
|
||||
@@ -2,10 +2,11 @@ use stable_deref_trait::StableDeref;
|
||||
|
||||
use crate::common::HasLen;
|
||||
use crate::directory::OwnedBytes;
|
||||
use std::sync::Arc;
|
||||
use std::sync::{Arc, Weak};
|
||||
use std::{io, ops::Deref};
|
||||
|
||||
pub type BoxedData = Box<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
|
||||
pub type ArcBytes = Arc<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
|
||||
pub type WeakArcBytes = Weak<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
|
||||
|
||||
/// Objects that represents files sections in tantivy.
|
||||
///
|
||||
@@ -40,7 +41,7 @@ where
|
||||
B: StableDeref + Deref<Target = [u8]> + 'static + Send + Sync,
|
||||
{
|
||||
fn from(bytes: B) -> FileSlice {
|
||||
FileSlice::new(OwnedBytes::new(bytes))
|
||||
FileSlice::new(Box::new(OwnedBytes::new(bytes)))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -50,22 +51,25 @@ where
|
||||
///
|
||||
#[derive(Clone)]
|
||||
pub struct FileSlice {
|
||||
data: Arc<Box<dyn FileHandle>>,
|
||||
data: Arc<dyn FileHandle>,
|
||||
start: usize,
|
||||
stop: usize,
|
||||
}
|
||||
|
||||
impl FileSlice {
|
||||
/// Wraps a FileHandle.
|
||||
pub fn new<D>(data: D) -> Self
|
||||
where
|
||||
D: FileHandle,
|
||||
{
|
||||
let len = data.len();
|
||||
pub fn new(file_handle: Box<dyn FileHandle>) -> Self {
|
||||
let num_bytes = file_handle.len();
|
||||
FileSlice::new_with_num_bytes(file_handle, num_bytes)
|
||||
}
|
||||
|
||||
/// Wraps a FileHandle.
|
||||
#[doc(hidden)]
|
||||
pub fn new_with_num_bytes(file_handle: Box<dyn FileHandle>, num_bytes: usize) -> Self {
|
||||
FileSlice {
|
||||
data: Arc::new(Box::new(data)),
|
||||
data: Arc::from(file_handle),
|
||||
start: 0,
|
||||
stop: len,
|
||||
stop: num_bytes,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -146,6 +150,12 @@ impl FileSlice {
|
||||
}
|
||||
}
|
||||
|
||||
impl FileHandle for FileSlice {
|
||||
fn read_bytes(&self, from: usize, to: usize) -> io::Result<OwnedBytes> {
|
||||
self.read_bytes_slice(from, to)
|
||||
}
|
||||
}
|
||||
|
||||
impl HasLen for FileSlice {
|
||||
fn len(&self) -> usize {
|
||||
self.stop - self.start
|
||||
@@ -160,7 +170,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_file_slice() -> io::Result<()> {
|
||||
let file_slice = FileSlice::new(b"abcdef".as_ref());
|
||||
let file_slice = FileSlice::new(Box::new(b"abcdef".as_ref()));
|
||||
assert_eq!(file_slice.len(), 6);
|
||||
assert_eq!(file_slice.slice_from(2).read_bytes()?.as_slice(), b"cdef");
|
||||
assert_eq!(file_slice.slice_to(2).read_bytes()?.as_slice(), b"ab");
|
||||
@@ -204,7 +214,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_slice_simple_read() -> io::Result<()> {
|
||||
let slice = FileSlice::new(&b"abcdef"[..]);
|
||||
let slice = FileSlice::new(Box::new(&b"abcdef"[..]));
|
||||
assert_eq!(slice.len(), 6);
|
||||
assert_eq!(slice.read_bytes()?.as_ref(), b"abcdef");
|
||||
assert_eq!(slice.slice(1, 4).read_bytes()?.as_ref(), b"bcd");
|
||||
@@ -213,7 +223,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_slice_read_slice() -> io::Result<()> {
|
||||
let slice_deref = FileSlice::new(&b"abcdef"[..]);
|
||||
let slice_deref = FileSlice::new(Box::new(&b"abcdef"[..]));
|
||||
assert_eq!(slice_deref.read_bytes_slice(1, 4)?.as_ref(), b"bcd");
|
||||
Ok(())
|
||||
}
|
||||
@@ -221,14 +231,14 @@ mod tests {
|
||||
#[test]
|
||||
#[should_panic(expected = "assertion failed: from <= to")]
|
||||
fn test_slice_read_slice_invalid_range() {
|
||||
let slice_deref = FileSlice::new(&b"abcdef"[..]);
|
||||
let slice_deref = FileSlice::new(Box::new(&b"abcdef"[..]));
|
||||
assert_eq!(slice_deref.read_bytes_slice(1, 0).unwrap().as_ref(), b"bcd");
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic(expected = "`to` exceeds the fileslice length")]
|
||||
fn test_slice_read_slice_invalid_range_exceeds() {
|
||||
let slice_deref = FileSlice::new(&b"abcdef"[..]);
|
||||
let slice_deref = FileSlice::new(Box::new(&b"abcdef"[..]));
|
||||
assert_eq!(
|
||||
slice_deref.read_bytes_slice(0, 10).unwrap().as_ref(),
|
||||
b"bcd"
|
||||
|
||||
@@ -3,7 +3,7 @@ use crc32fast::Hasher;
|
||||
use std::fs;
|
||||
use std::io;
|
||||
use std::io::BufRead;
|
||||
use std::path::PathBuf;
|
||||
use std::path::Path;
|
||||
use std::sync::atomic::{AtomicUsize, Ordering};
|
||||
use std::sync::Arc;
|
||||
use std::thread;
|
||||
@@ -13,15 +13,15 @@ pub const POLLING_INTERVAL: Duration = Duration::from_millis(if cfg!(test) { 1 }
|
||||
|
||||
// Watches a file and executes registered callbacks when the file is modified.
|
||||
pub struct FileWatcher {
|
||||
path: Arc<PathBuf>,
|
||||
path: Arc<Path>,
|
||||
callbacks: Arc<WatchCallbackList>,
|
||||
state: Arc<AtomicUsize>, // 0: new, 1: runnable, 2: terminated
|
||||
}
|
||||
|
||||
impl FileWatcher {
|
||||
pub fn new(path: &PathBuf) -> FileWatcher {
|
||||
pub fn new(path: &Path) -> FileWatcher {
|
||||
FileWatcher {
|
||||
path: Arc::new(path.clone()),
|
||||
path: Arc::from(path),
|
||||
callbacks: Default::default(),
|
||||
state: Default::default(),
|
||||
}
|
||||
@@ -63,7 +63,7 @@ impl FileWatcher {
|
||||
handle
|
||||
}
|
||||
|
||||
fn compute_checksum(path: &PathBuf) -> Result<u32, io::Error> {
|
||||
fn compute_checksum(path: &Path) -> Result<u32, io::Error> {
|
||||
let reader = match fs::File::open(path) {
|
||||
Ok(f) => io::BufReader::new(f),
|
||||
Err(e) => {
|
||||
|
||||
@@ -115,6 +115,18 @@ impl Footer {
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
VersionedFooter::V3 {
|
||||
crc32: _crc,
|
||||
store_compression,
|
||||
} => {
|
||||
if &library_version.store_compression != store_compression {
|
||||
return Err(Incompatibility::CompressionMismatch {
|
||||
library_compression_format: library_version.store_compression.to_string(),
|
||||
index_compression_format: store_compression.to_string(),
|
||||
});
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
VersionedFooter::UnknownVersion => Err(Incompatibility::IndexMismatch {
|
||||
library_version: library_version.clone(),
|
||||
index_version: self.version.clone(),
|
||||
@@ -136,24 +148,31 @@ pub enum VersionedFooter {
|
||||
crc32: CrcHashU32,
|
||||
store_compression: String,
|
||||
},
|
||||
// Block wand max termfred on 1 byte
|
||||
V3 {
|
||||
crc32: CrcHashU32,
|
||||
store_compression: String,
|
||||
},
|
||||
}
|
||||
|
||||
impl BinarySerializable for VersionedFooter {
|
||||
fn serialize<W: io::Write>(&self, writer: &mut W) -> io::Result<()> {
|
||||
let mut buf = Vec::new();
|
||||
match self {
|
||||
VersionedFooter::V2 {
|
||||
VersionedFooter::V3 {
|
||||
crc32,
|
||||
store_compression: compression,
|
||||
} => {
|
||||
// Serializes a valid `VersionedFooter` or panics if the version is unknown
|
||||
// [ version | crc_hash | compression_mode ]
|
||||
// [ 0..4 | 4..8 | variable ]
|
||||
BinarySerializable::serialize(&2u32, &mut buf)?;
|
||||
BinarySerializable::serialize(&3u32, &mut buf)?;
|
||||
BinarySerializable::serialize(crc32, &mut buf)?;
|
||||
BinarySerializable::serialize(compression, &mut buf)?;
|
||||
}
|
||||
VersionedFooter::V1 { .. } | VersionedFooter::UnknownVersion => {
|
||||
VersionedFooter::V2 { .. }
|
||||
| VersionedFooter::V1 { .. }
|
||||
| VersionedFooter::UnknownVersion => {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
"Cannot serialize an unknown versioned footer ",
|
||||
@@ -182,7 +201,7 @@ impl BinarySerializable for VersionedFooter {
|
||||
reader.read_exact(&mut buf[..])?;
|
||||
let mut cursor = &buf[..];
|
||||
let version = u32::deserialize(&mut cursor)?;
|
||||
if version != 1 && version != 2 {
|
||||
if version > 3 {
|
||||
return Ok(VersionedFooter::UnknownVersion);
|
||||
}
|
||||
let crc32 = u32::deserialize(&mut cursor)?;
|
||||
@@ -192,12 +211,17 @@ impl BinarySerializable for VersionedFooter {
|
||||
crc32,
|
||||
store_compression,
|
||||
}
|
||||
} else {
|
||||
assert_eq!(version, 2);
|
||||
} else if version == 2 {
|
||||
VersionedFooter::V2 {
|
||||
crc32,
|
||||
store_compression,
|
||||
}
|
||||
} else {
|
||||
assert_eq!(version, 3);
|
||||
VersionedFooter::V3 {
|
||||
crc32,
|
||||
store_compression,
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
@@ -205,6 +229,7 @@ impl BinarySerializable for VersionedFooter {
|
||||
impl VersionedFooter {
|
||||
pub fn crc(&self) -> Option<CrcHashU32> {
|
||||
match self {
|
||||
VersionedFooter::V3 { crc32, .. } => Some(*crc32),
|
||||
VersionedFooter::V2 { crc32, .. } => Some(*crc32),
|
||||
VersionedFooter::V1 { crc32, .. } => Some(*crc32),
|
||||
VersionedFooter::UnknownVersion { .. } => None,
|
||||
@@ -243,7 +268,7 @@ impl<W: TerminatingWrite> Write for FooterProxy<W> {
|
||||
impl<W: TerminatingWrite> TerminatingWrite for FooterProxy<W> {
|
||||
fn terminate_ref(&mut self, _: AntiCallToken) -> io::Result<()> {
|
||||
let crc32 = self.hasher.take().unwrap().finalize();
|
||||
let footer = Footer::new(VersionedFooter::V2 {
|
||||
let footer = Footer::new(VersionedFooter::V3 {
|
||||
crc32,
|
||||
store_compression: crate::store::COMPRESSION.to_string(),
|
||||
});
|
||||
@@ -278,7 +303,7 @@ mod tests {
|
||||
let footer = Footer::deserialize(&mut &vec[..]).unwrap();
|
||||
assert!(matches!(
|
||||
footer.versioned_footer,
|
||||
VersionedFooter::V2 { store_compression, .. }
|
||||
VersionedFooter::V3 { store_compression, .. }
|
||||
if store_compression == crate::store::COMPRESSION
|
||||
));
|
||||
assert_eq!(&footer.version, crate::version());
|
||||
@@ -288,7 +313,7 @@ mod tests {
|
||||
fn test_serialize_deserialize_footer() {
|
||||
let mut buffer = Vec::new();
|
||||
let crc32 = 123456u32;
|
||||
let footer: Footer = Footer::new(VersionedFooter::V2 {
|
||||
let footer: Footer = Footer::new(VersionedFooter::V3 {
|
||||
crc32,
|
||||
store_compression: "lz4".to_string(),
|
||||
});
|
||||
@@ -300,7 +325,7 @@ mod tests {
|
||||
#[test]
|
||||
fn footer_length() {
|
||||
let crc32 = 1111111u32;
|
||||
let versioned_footer = VersionedFooter::V2 {
|
||||
let versioned_footer = VersionedFooter::V3 {
|
||||
crc32,
|
||||
store_compression: "lz4".to_string(),
|
||||
};
|
||||
@@ -321,7 +346,7 @@ mod tests {
|
||||
// versionned footer length
|
||||
12 | 128,
|
||||
// index format version
|
||||
2,
|
||||
3,
|
||||
0,
|
||||
0,
|
||||
0,
|
||||
@@ -340,7 +365,7 @@ mod tests {
|
||||
let versioned_footer = VersionedFooter::deserialize(&mut cursor).unwrap();
|
||||
assert!(cursor.is_empty());
|
||||
let expected_crc: u32 = LittleEndian::read_u32(&v_footer_bytes[5..9]) as CrcHashU32;
|
||||
let expected_versioned_footer: VersionedFooter = VersionedFooter::V2 {
|
||||
let expected_versioned_footer: VersionedFooter = VersionedFooter::V3 {
|
||||
crc32: expected_crc,
|
||||
store_compression: "lz4".to_string(),
|
||||
};
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
use crate::core::{MANAGED_FILEPATH, META_FILEPATH};
|
||||
use crate::directory::error::{DeleteError, LockError, OpenReadError, OpenWriteError};
|
||||
use crate::directory::footer::{Footer, FooterProxy};
|
||||
use crate::directory::DirectoryLock;
|
||||
use crate::directory::GarbageCollectionResult;
|
||||
use crate::directory::Lock;
|
||||
use crate::directory::META_LOCK;
|
||||
use crate::directory::{DirectoryLock, FileHandle};
|
||||
use crate::directory::{FileSlice, WritePtr};
|
||||
use crate::directory::{WatchCallback, WatchHandle};
|
||||
use crate::error::DataCorruption;
|
||||
@@ -274,6 +274,11 @@ impl ManagedDirectory {
|
||||
}
|
||||
|
||||
impl Directory for ManagedDirectory {
|
||||
fn get_file_handle(&self, path: &Path) -> Result<Box<dyn FileHandle>, OpenReadError> {
|
||||
let file_slice = self.open_read(path)?;
|
||||
Ok(Box::new(file_slice))
|
||||
}
|
||||
|
||||
fn open_read(&self, path: &Path) -> result::Result<FileSlice, OpenReadError> {
|
||||
let file_slice = self.directory.open_read(path)?;
|
||||
let (footer, reader) = Footer::extract_footer(file_slice)
|
||||
|
||||
@@ -2,14 +2,13 @@ use crate::core::META_FILEPATH;
|
||||
use crate::directory::error::LockError;
|
||||
use crate::directory::error::{DeleteError, OpenDirectoryError, OpenReadError, OpenWriteError};
|
||||
use crate::directory::file_watcher::FileWatcher;
|
||||
use crate::directory::AntiCallToken;
|
||||
use crate::directory::BoxedData;
|
||||
use crate::directory::Directory;
|
||||
use crate::directory::DirectoryLock;
|
||||
use crate::directory::FileSlice;
|
||||
use crate::directory::Lock;
|
||||
use crate::directory::WatchCallback;
|
||||
use crate::directory::WatchHandle;
|
||||
use crate::directory::{AntiCallToken, FileHandle, OwnedBytes};
|
||||
use crate::directory::{ArcBytes, WeakArcBytes};
|
||||
use crate::directory::{TerminatingWrite, WritePtr};
|
||||
use fs2::FileExt;
|
||||
use memmap::Mmap;
|
||||
@@ -25,7 +24,6 @@ use std::path::{Path, PathBuf};
|
||||
use std::result;
|
||||
use std::sync::Arc;
|
||||
use std::sync::RwLock;
|
||||
use std::sync::Weak;
|
||||
use std::{collections::HashMap, ops::Deref};
|
||||
use tempfile::TempDir;
|
||||
|
||||
@@ -78,7 +76,7 @@ pub struct CacheInfo {
|
||||
|
||||
struct MmapCache {
|
||||
counters: CacheCounters,
|
||||
cache: HashMap<PathBuf, Weak<BoxedData>>,
|
||||
cache: HashMap<PathBuf, WeakArcBytes>,
|
||||
}
|
||||
|
||||
impl Default for MmapCache {
|
||||
@@ -112,7 +110,7 @@ impl MmapCache {
|
||||
}
|
||||
|
||||
// Returns None if the file exists but as a len of 0 (and hence is not mmappable).
|
||||
fn get_mmap(&mut self, full_path: &Path) -> Result<Option<Arc<BoxedData>>, OpenReadError> {
|
||||
fn get_mmap(&mut self, full_path: &Path) -> Result<Option<ArcBytes>, OpenReadError> {
|
||||
if let Some(mmap_weak) = self.cache.get(full_path) {
|
||||
if let Some(mmap_arc) = mmap_weak.upgrade() {
|
||||
self.counters.hit += 1;
|
||||
@@ -123,7 +121,7 @@ impl MmapCache {
|
||||
self.counters.miss += 1;
|
||||
let mmap_opt = open_mmap(full_path)?;
|
||||
Ok(mmap_opt.map(|mmap| {
|
||||
let mmap_arc: Arc<BoxedData> = Arc::new(Box::new(mmap));
|
||||
let mmap_arc: ArcBytes = Arc::new(mmap);
|
||||
let mmap_weak = Arc::downgrade(&mmap_arc);
|
||||
self.cache.insert(full_path.to_owned(), mmap_weak);
|
||||
mmap_arc
|
||||
@@ -161,7 +159,7 @@ impl MmapDirectoryInner {
|
||||
mmap_cache: Default::default(),
|
||||
_temp_directory: temp_directory,
|
||||
watcher: FileWatcher::new(&root_path.join(*META_FILEPATH)),
|
||||
root_path: root_path,
|
||||
root_path,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -316,7 +314,7 @@ impl TerminatingWrite for SafeFileWriter {
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
struct MmapArc(Arc<Box<dyn Deref<Target = [u8]> + Send + Sync>>);
|
||||
struct MmapArc(Arc<dyn Deref<Target = [u8]> + Send + Sync>);
|
||||
|
||||
impl Deref for MmapArc {
|
||||
type Target = [u8];
|
||||
@@ -346,7 +344,7 @@ pub(crate) fn atomic_write(path: &Path, content: &[u8]) -> io::Result<()> {
|
||||
}
|
||||
|
||||
impl Directory for MmapDirectory {
|
||||
fn open_read(&self, path: &Path) -> result::Result<FileSlice, OpenReadError> {
|
||||
fn get_file_handle(&self, path: &Path) -> result::Result<Box<dyn FileHandle>, OpenReadError> {
|
||||
debug!("Open Read {:?}", path);
|
||||
let full_path = self.resolve_path(path);
|
||||
|
||||
@@ -359,11 +357,16 @@ impl Directory for MmapDirectory {
|
||||
let io_err = make_io_err(msg);
|
||||
OpenReadError::wrap_io_error(io_err, path.to_path_buf())
|
||||
})?;
|
||||
if let Some(mmap_arc) = mmap_cache.get_mmap(&full_path)? {
|
||||
Ok(FileSlice::from(MmapArc(mmap_arc)))
|
||||
} else {
|
||||
Ok(FileSlice::empty())
|
||||
}
|
||||
|
||||
let owned_bytes = mmap_cache
|
||||
.get_mmap(&full_path)?
|
||||
.map(|mmap_arc| {
|
||||
let mmap_arc_obj = MmapArc(mmap_arc);
|
||||
OwnedBytes::new(mmap_arc_obj)
|
||||
})
|
||||
.unwrap_or_else(OwnedBytes::empty);
|
||||
|
||||
Ok(Box::new(owned_bytes))
|
||||
}
|
||||
|
||||
/// Any entry associated to the path in the mmap will be
|
||||
@@ -446,7 +449,8 @@ impl Directory for MmapDirectory {
|
||||
fn atomic_write(&self, path: &Path, content: &[u8]) -> io::Result<()> {
|
||||
debug!("Atomic Write {:?}", path);
|
||||
let full_path = self.resolve_path(path);
|
||||
atomic_write(&full_path, content)
|
||||
atomic_write(&full_path, content)?;
|
||||
self.sync_directory()
|
||||
}
|
||||
|
||||
fn acquire_lock(&self, lock: &Lock) -> Result<DirectoryLock, LockError> {
|
||||
|
||||
@@ -23,7 +23,7 @@ pub mod error;
|
||||
pub use self::directory::DirectoryLock;
|
||||
pub use self::directory::{Directory, DirectoryClone};
|
||||
pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK};
|
||||
pub(crate) use self::file_slice::BoxedData;
|
||||
pub(crate) use self::file_slice::{ArcBytes, WeakArcBytes};
|
||||
pub use self::file_slice::{FileHandle, FileSlice};
|
||||
pub use self::owned_bytes::OwnedBytes;
|
||||
pub use self::ram_directory::RAMDirectory;
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
use crate::directory::FileHandle;
|
||||
use stable_deref_trait::StableDeref;
|
||||
use std::convert::TryInto;
|
||||
use std::mem;
|
||||
use std::ops::Deref;
|
||||
use std::sync::Arc;
|
||||
@@ -95,6 +96,24 @@ impl OwnedBytes {
|
||||
pub fn advance(&mut self, advance_len: usize) {
|
||||
self.data = &self.data[advance_len..]
|
||||
}
|
||||
|
||||
/// Reads an `u8` from the `OwnedBytes` and advance by one byte.
|
||||
pub fn read_u8(&mut self) -> u8 {
|
||||
assert!(!self.is_empty());
|
||||
|
||||
let byte = self.as_slice()[0];
|
||||
self.advance(1);
|
||||
byte
|
||||
}
|
||||
|
||||
/// Reads an `u64` encoded as little-endian from the `OwnedBytes` and advance by 8 bytes.
|
||||
pub fn read_u64(&mut self) -> u64 {
|
||||
assert!(self.len() > 7);
|
||||
|
||||
let octlet: [u8; 8] = self.as_slice()[..8].try_into().unwrap();
|
||||
self.advance(8);
|
||||
u64::from_le_bytes(octlet)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for OwnedBytes {
|
||||
@@ -230,6 +249,22 @@ mod tests {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_owned_bytes_read_u8() -> io::Result<()> {
|
||||
let mut bytes = OwnedBytes::new(b"\xFF".as_ref());
|
||||
assert_eq!(bytes.read_u8(), 255);
|
||||
assert_eq!(bytes.len(), 0);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_owned_bytes_read_u64() -> io::Result<()> {
|
||||
let mut bytes = OwnedBytes::new(b"\0\xFF\xFF\xFF\xFF\xFF\xFF\xFF".as_ref());
|
||||
assert_eq!(bytes.read_u64(), u64::MAX - 255);
|
||||
assert_eq!(bytes.len(), 0);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_owned_bytes_split() {
|
||||
let bytes = OwnedBytes::new(b"abcdefghi".as_ref());
|
||||
|
||||
@@ -12,6 +12,8 @@ use std::path::{Path, PathBuf};
|
||||
use std::result;
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use super::FileHandle;
|
||||
|
||||
/// Writer associated with the `RAMDirectory`
|
||||
///
|
||||
/// The Writer just writes a buffer.
|
||||
@@ -163,6 +165,11 @@ impl RAMDirectory {
|
||||
}
|
||||
|
||||
impl Directory for RAMDirectory {
|
||||
fn get_file_handle(&self, path: &Path) -> Result<Box<dyn FileHandle>, OpenReadError> {
|
||||
let file_slice = self.open_read(path)?;
|
||||
Ok(Box::new(file_slice))
|
||||
}
|
||||
|
||||
fn open_read(&self, path: &Path) -> result::Result<FileSlice, OpenReadError> {
|
||||
self.fs.read().unwrap().open_read(path)
|
||||
}
|
||||
@@ -219,13 +226,9 @@ impl Directory for RAMDirectory {
|
||||
)));
|
||||
let path_buf = PathBuf::from(path);
|
||||
|
||||
// Reserve the path to prevent calls to .write() to succeed.
|
||||
self.fs.write().unwrap().write(path_buf.clone(), &[]);
|
||||
self.fs.write().unwrap().write(path_buf, data);
|
||||
|
||||
let mut vec_writer = VecWriter::new(path_buf, self.clone());
|
||||
vec_writer.write_all(data)?;
|
||||
vec_writer.flush()?;
|
||||
if path == Path::new(&*META_FILEPATH) {
|
||||
if path == *META_FILEPATH {
|
||||
let _ = self.fs.write().unwrap().watch_router.broadcast();
|
||||
}
|
||||
Ok(())
|
||||
|
||||
@@ -6,12 +6,12 @@ use std::sync::Weak;
|
||||
|
||||
/// Cloneable wrapper for callbacks registered when watching files of a `Directory`.
|
||||
#[derive(Clone)]
|
||||
pub struct WatchCallback(Arc<Box<dyn Fn() + Sync + Send>>);
|
||||
pub struct WatchCallback(Arc<dyn Fn() + Sync + Send>);
|
||||
|
||||
impl WatchCallback {
|
||||
/// Wraps a `Fn()` to create a WatchCallback.
|
||||
pub fn new<F: Fn() + Sync + Send + 'static>(op: F) -> Self {
|
||||
WatchCallback(Arc::new(Box::new(op)))
|
||||
WatchCallback(Arc::new(op))
|
||||
}
|
||||
|
||||
fn call(&self) {
|
||||
|
||||
@@ -10,7 +10,7 @@ use std::borrow::BorrowMut;
|
||||
pub const TERMINATED: DocId = std::i32::MAX as u32;
|
||||
|
||||
/// Represents an iterable set of sorted doc ids.
|
||||
pub trait DocSet {
|
||||
pub trait DocSet: Send {
|
||||
/// Goes to the next element.
|
||||
///
|
||||
/// The DocId of the next element is returned.
|
||||
@@ -129,6 +129,14 @@ impl<'a> DocSet for &'a mut dyn DocSet {
|
||||
fn size_hint(&self) -> u32 {
|
||||
(**self).size_hint()
|
||||
}
|
||||
|
||||
fn count(&mut self, delete_bitset: &DeleteBitSet) -> u32 {
|
||||
(**self).count(delete_bitset)
|
||||
}
|
||||
|
||||
fn count_including_deleted(&mut self) -> u32 {
|
||||
(**self).count_including_deleted()
|
||||
}
|
||||
}
|
||||
|
||||
impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
use super::MultiValueIntFastFieldReader;
|
||||
use super::MultiValuedFastFieldReader;
|
||||
use crate::error::DataCorruption;
|
||||
use crate::schema::Facet;
|
||||
use crate::termdict::TermDictionary;
|
||||
use crate::termdict::TermOrdinal;
|
||||
@@ -19,7 +20,7 @@ use std::str;
|
||||
/// list of facets. This ordinal is segment local and
|
||||
/// only makes sense for a given segment.
|
||||
pub struct FacetReader {
|
||||
term_ords: MultiValueIntFastFieldReader<u64>,
|
||||
term_ords: MultiValuedFastFieldReader<u64>,
|
||||
term_dict: TermDictionary,
|
||||
buffer: Vec<u8>,
|
||||
}
|
||||
@@ -28,12 +29,12 @@ impl FacetReader {
|
||||
/// Creates a new `FacetReader`.
|
||||
///
|
||||
/// A facet reader just wraps :
|
||||
/// - a `MultiValueIntFastFieldReader` that makes it possible to
|
||||
/// - a `MultiValuedFastFieldReader` that makes it possible to
|
||||
/// access the list of facet ords for a given document.
|
||||
/// - a `TermDictionary` that helps associating a facet to
|
||||
/// an ordinal and vice versa.
|
||||
pub fn new(
|
||||
term_ords: MultiValueIntFastFieldReader<u64>,
|
||||
term_ords: MultiValuedFastFieldReader<u64>,
|
||||
term_dict: TermDictionary,
|
||||
) -> FacetReader {
|
||||
FacetReader {
|
||||
@@ -62,12 +63,13 @@ impl FacetReader {
|
||||
&mut self,
|
||||
facet_ord: TermOrdinal,
|
||||
output: &mut Facet,
|
||||
) -> Result<(), str::Utf8Error> {
|
||||
) -> crate::Result<()> {
|
||||
let found_term = self
|
||||
.term_dict
|
||||
.ord_to_term(facet_ord as u64, &mut self.buffer);
|
||||
.ord_to_term(facet_ord as u64, &mut self.buffer)?;
|
||||
assert!(found_term, "Term ordinal {} no found.", facet_ord);
|
||||
let facet_str = str::from_utf8(&self.buffer[..])?;
|
||||
let facet_str = str::from_utf8(&self.buffer[..])
|
||||
.map_err(|utf8_err| DataCorruption::comment_only(utf8_err.to_string()))?;
|
||||
output.set_facet_str(facet_str);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -28,7 +28,7 @@ pub use self::delete::write_delete_bitset;
|
||||
pub use self::delete::DeleteBitSet;
|
||||
pub use self::error::{FastFieldNotAvailableError, Result};
|
||||
pub use self::facet_reader::FacetReader;
|
||||
pub use self::multivalued::{MultiValueIntFastFieldReader, MultiValueIntFastFieldWriter};
|
||||
pub use self::multivalued::{MultiValuedFastFieldReader, MultiValuedFastFieldWriter};
|
||||
pub use self::reader::FastFieldReader;
|
||||
pub use self::readers::FastFieldReaders;
|
||||
pub use self::serializer::FastFieldSerializer;
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
mod reader;
|
||||
mod writer;
|
||||
|
||||
pub use self::reader::MultiValueIntFastFieldReader;
|
||||
pub use self::writer::MultiValueIntFastFieldWriter;
|
||||
pub use self::reader::MultiValuedFastFieldReader;
|
||||
pub use self::writer::MultiValuedFastFieldWriter;
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
@@ -10,29 +10,22 @@ use crate::DocId;
|
||||
/// The `idx_reader` associated, for each document, the index of its first value.
|
||||
///
|
||||
#[derive(Clone)]
|
||||
pub struct MultiValueIntFastFieldReader<Item: FastValue> {
|
||||
pub struct MultiValuedFastFieldReader<Item: FastValue> {
|
||||
idx_reader: FastFieldReader<u64>,
|
||||
vals_reader: FastFieldReader<Item>,
|
||||
}
|
||||
|
||||
impl<Item: FastValue> MultiValueIntFastFieldReader<Item> {
|
||||
impl<Item: FastValue> MultiValuedFastFieldReader<Item> {
|
||||
pub(crate) fn open(
|
||||
idx_reader: FastFieldReader<u64>,
|
||||
vals_reader: FastFieldReader<Item>,
|
||||
) -> MultiValueIntFastFieldReader<Item> {
|
||||
MultiValueIntFastFieldReader {
|
||||
) -> MultiValuedFastFieldReader<Item> {
|
||||
MultiValuedFastFieldReader {
|
||||
idx_reader,
|
||||
vals_reader,
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn into_u64s_reader(self) -> MultiValueIntFastFieldReader<u64> {
|
||||
MultiValueIntFastFieldReader {
|
||||
idx_reader: self.idx_reader,
|
||||
vals_reader: self.vals_reader.into_u64_reader(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns `(start, stop)`, such that the values associated
|
||||
/// to the given document are `start..stop`.
|
||||
fn range(&self, doc: DocId) -> (u64, u64) {
|
||||
|
||||
@@ -18,7 +18,7 @@ use std::io;
|
||||
/// in your schema
|
||||
/// - add your document simply by calling `.add_document(...)`.
|
||||
///
|
||||
/// The `MultiValueIntFastFieldWriter` can be acquired from the
|
||||
/// The `MultiValuedFastFieldWriter` can be acquired from the
|
||||
/// fastfield writer, by calling [`.get_multivalue_writer(...)`](./struct.FastFieldsWriter.html#method.get_multivalue_writer).
|
||||
///
|
||||
/// Once acquired, writing is done by calling calls to
|
||||
@@ -29,17 +29,17 @@ use std::io;
|
||||
/// This makes it possible to push unordered term ids,
|
||||
/// during indexing and remap them to their respective
|
||||
/// term ids when the segment is getting serialized.
|
||||
pub struct MultiValueIntFastFieldWriter {
|
||||
pub struct MultiValuedFastFieldWriter {
|
||||
field: Field,
|
||||
vals: Vec<UnorderedTermId>,
|
||||
doc_index: Vec<u64>,
|
||||
is_facet: bool,
|
||||
}
|
||||
|
||||
impl MultiValueIntFastFieldWriter {
|
||||
impl MultiValuedFastFieldWriter {
|
||||
/// Creates a new `IntFastFieldWriter`
|
||||
pub(crate) fn new(field: Field, is_facet: bool) -> Self {
|
||||
MultiValueIntFastFieldWriter {
|
||||
MultiValuedFastFieldWriter {
|
||||
field,
|
||||
vals: Vec::new(),
|
||||
doc_index: Vec::new(),
|
||||
@@ -47,7 +47,7 @@ impl MultiValueIntFastFieldWriter {
|
||||
}
|
||||
}
|
||||
|
||||
/// Access the field associated to the `MultiValueIntFastFieldWriter`
|
||||
/// Access the field associated to the `MultiValuedFastFieldWriter`
|
||||
pub fn field(&self) -> Field {
|
||||
self.field
|
||||
}
|
||||
|
||||
@@ -42,15 +42,6 @@ impl<Item: FastValue> FastFieldReader<Item> {
|
||||
})
|
||||
}
|
||||
|
||||
pub(crate) fn into_u64_reader(self) -> FastFieldReader<u64> {
|
||||
FastFieldReader {
|
||||
bit_unpacker: self.bit_unpacker,
|
||||
min_value_u64: self.min_value_u64,
|
||||
max_value_u64: self.max_value_u64,
|
||||
_phantom: PhantomData,
|
||||
}
|
||||
}
|
||||
|
||||
/// Return the value associated to the given document.
|
||||
///
|
||||
/// This accessor should return as fast as possible.
|
||||
|
||||
@@ -1,28 +1,22 @@
|
||||
use crate::common::CompositeFile;
|
||||
use crate::fastfield::BytesFastFieldReader;
|
||||
use crate::fastfield::MultiValueIntFastFieldReader;
|
||||
use crate::directory::FileSlice;
|
||||
use crate::fastfield::MultiValuedFastFieldReader;
|
||||
use crate::fastfield::{BytesFastFieldReader, FastValue};
|
||||
use crate::fastfield::{FastFieldNotAvailableError, FastFieldReader};
|
||||
use crate::schema::{Cardinality, Field, FieldType, Schema};
|
||||
use crate::space_usage::PerFieldSpaceUsage;
|
||||
use std::collections::HashMap;
|
||||
use crate::TantivyError;
|
||||
|
||||
/// Provides access to all of the FastFieldReader.
|
||||
///
|
||||
/// Internally, `FastFieldReaders` have preloaded fast field readers,
|
||||
/// and just wraps several `HashMap`.
|
||||
#[derive(Clone)]
|
||||
pub struct FastFieldReaders {
|
||||
fast_field_i64: HashMap<Field, FastFieldReader<i64>>,
|
||||
fast_field_u64: HashMap<Field, FastFieldReader<u64>>,
|
||||
fast_field_f64: HashMap<Field, FastFieldReader<f64>>,
|
||||
fast_field_date: HashMap<Field, FastFieldReader<crate::DateTime>>,
|
||||
fast_field_i64s: HashMap<Field, MultiValueIntFastFieldReader<i64>>,
|
||||
fast_field_u64s: HashMap<Field, MultiValueIntFastFieldReader<u64>>,
|
||||
fast_field_f64s: HashMap<Field, MultiValueIntFastFieldReader<f64>>,
|
||||
fast_field_dates: HashMap<Field, MultiValueIntFastFieldReader<crate::DateTime>>,
|
||||
fast_bytes: HashMap<Field, BytesFastFieldReader>,
|
||||
schema: Schema,
|
||||
fast_fields_composite: CompositeFile,
|
||||
}
|
||||
|
||||
#[derive(Eq, PartialEq, Debug)]
|
||||
enum FastType {
|
||||
I64,
|
||||
U64,
|
||||
@@ -50,228 +44,167 @@ fn type_and_cardinality(field_type: &FieldType) -> Option<(FastType, Cardinality
|
||||
}
|
||||
|
||||
impl FastFieldReaders {
|
||||
pub(crate) fn load_all(
|
||||
schema: &Schema,
|
||||
fast_fields_composite: &CompositeFile,
|
||||
pub(crate) fn new(
|
||||
schema: Schema,
|
||||
fast_fields_composite: CompositeFile,
|
||||
) -> crate::Result<FastFieldReaders> {
|
||||
let mut fast_field_readers = FastFieldReaders {
|
||||
fast_field_i64: Default::default(),
|
||||
fast_field_u64: Default::default(),
|
||||
fast_field_f64: Default::default(),
|
||||
fast_field_date: Default::default(),
|
||||
fast_field_i64s: Default::default(),
|
||||
fast_field_u64s: Default::default(),
|
||||
fast_field_f64s: Default::default(),
|
||||
fast_field_dates: Default::default(),
|
||||
fast_bytes: Default::default(),
|
||||
fast_fields_composite: fast_fields_composite.clone(),
|
||||
};
|
||||
for (field, field_entry) in schema.fields() {
|
||||
let field_type = field_entry.field_type();
|
||||
if let FieldType::Bytes(bytes_option) = field_type {
|
||||
if !bytes_option.is_fast() {
|
||||
continue;
|
||||
}
|
||||
let fast_field_idx_file = fast_fields_composite
|
||||
.open_read_with_idx(field, 0)
|
||||
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?;
|
||||
let idx_reader = FastFieldReader::open(fast_field_idx_file)?;
|
||||
let data = fast_fields_composite
|
||||
.open_read_with_idx(field, 1)
|
||||
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?;
|
||||
let bytes_fast_field_reader = BytesFastFieldReader::open(idx_reader, data)?;
|
||||
fast_field_readers
|
||||
.fast_bytes
|
||||
.insert(field, bytes_fast_field_reader);
|
||||
} else if let Some((fast_type, cardinality)) = type_and_cardinality(field_type) {
|
||||
match cardinality {
|
||||
Cardinality::SingleValue => {
|
||||
if let Some(fast_field_data) = fast_fields_composite.open_read(field) {
|
||||
match fast_type {
|
||||
FastType::U64 => {
|
||||
let fast_field_reader = FastFieldReader::open(fast_field_data)?;
|
||||
fast_field_readers
|
||||
.fast_field_u64
|
||||
.insert(field, fast_field_reader);
|
||||
}
|
||||
FastType::I64 => {
|
||||
let fast_field_reader =
|
||||
FastFieldReader::open(fast_field_data.clone())?;
|
||||
fast_field_readers
|
||||
.fast_field_i64
|
||||
.insert(field, fast_field_reader);
|
||||
}
|
||||
FastType::F64 => {
|
||||
let fast_field_reader =
|
||||
FastFieldReader::open(fast_field_data.clone())?;
|
||||
fast_field_readers
|
||||
.fast_field_f64
|
||||
.insert(field, fast_field_reader);
|
||||
}
|
||||
FastType::Date => {
|
||||
let fast_field_reader =
|
||||
FastFieldReader::open(fast_field_data.clone())?;
|
||||
fast_field_readers
|
||||
.fast_field_date
|
||||
.insert(field, fast_field_reader);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
return Err(From::from(FastFieldNotAvailableError::new(field_entry)));
|
||||
}
|
||||
}
|
||||
Cardinality::MultiValues => {
|
||||
let idx_opt = fast_fields_composite.open_read_with_idx(field, 0);
|
||||
let data_opt = fast_fields_composite.open_read_with_idx(field, 1);
|
||||
if let (Some(fast_field_idx), Some(fast_field_data)) = (idx_opt, data_opt) {
|
||||
let idx_reader = FastFieldReader::open(fast_field_idx)?;
|
||||
match fast_type {
|
||||
FastType::I64 => {
|
||||
let vals_reader = FastFieldReader::open(fast_field_data)?;
|
||||
let multivalued_int_fast_field =
|
||||
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
|
||||
fast_field_readers
|
||||
.fast_field_i64s
|
||||
.insert(field, multivalued_int_fast_field);
|
||||
}
|
||||
FastType::U64 => {
|
||||
let vals_reader = FastFieldReader::open(fast_field_data)?;
|
||||
let multivalued_int_fast_field =
|
||||
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
|
||||
fast_field_readers
|
||||
.fast_field_u64s
|
||||
.insert(field, multivalued_int_fast_field);
|
||||
}
|
||||
FastType::F64 => {
|
||||
let vals_reader = FastFieldReader::open(fast_field_data)?;
|
||||
let multivalued_int_fast_field =
|
||||
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
|
||||
fast_field_readers
|
||||
.fast_field_f64s
|
||||
.insert(field, multivalued_int_fast_field);
|
||||
}
|
||||
FastType::Date => {
|
||||
let vals_reader = FastFieldReader::open(fast_field_data)?;
|
||||
let multivalued_int_fast_field =
|
||||
MultiValueIntFastFieldReader::open(idx_reader, vals_reader);
|
||||
fast_field_readers
|
||||
.fast_field_dates
|
||||
.insert(field, multivalued_int_fast_field);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
return Err(From::from(FastFieldNotAvailableError::new(field_entry)));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(fast_field_readers)
|
||||
Ok(FastFieldReaders {
|
||||
fast_fields_composite,
|
||||
schema,
|
||||
})
|
||||
}
|
||||
|
||||
pub(crate) fn space_usage(&self) -> PerFieldSpaceUsage {
|
||||
self.fast_fields_composite.space_usage()
|
||||
}
|
||||
|
||||
fn fast_field_data(&self, field: Field, idx: usize) -> crate::Result<FileSlice> {
|
||||
self.fast_fields_composite
|
||||
.open_read_with_idx(field, idx)
|
||||
.ok_or_else(|| {
|
||||
let field_name = self.schema.get_field_entry(field).name();
|
||||
TantivyError::SchemaError(format!("Field({}) data was not found", field_name))
|
||||
})
|
||||
}
|
||||
|
||||
fn check_type(
|
||||
&self,
|
||||
field: Field,
|
||||
expected_fast_type: FastType,
|
||||
expected_cardinality: Cardinality,
|
||||
) -> crate::Result<()> {
|
||||
let field_entry = self.schema.get_field_entry(field);
|
||||
let (fast_type, cardinality) =
|
||||
type_and_cardinality(field_entry.field_type()).ok_or_else(|| {
|
||||
crate::TantivyError::SchemaError(format!(
|
||||
"Field {:?} is not a fast field.",
|
||||
field_entry.name()
|
||||
))
|
||||
})?;
|
||||
if fast_type != expected_fast_type {
|
||||
return Err(crate::TantivyError::SchemaError(format!(
|
||||
"Field {:?} is of type {:?}, expected {:?}.",
|
||||
field_entry.name(),
|
||||
fast_type,
|
||||
expected_fast_type
|
||||
)));
|
||||
}
|
||||
if cardinality != expected_cardinality {
|
||||
return Err(crate::TantivyError::SchemaError(format!(
|
||||
"Field {:?} is of cardinality {:?}, expected {:?}.",
|
||||
field_entry.name(),
|
||||
cardinality,
|
||||
expected_cardinality
|
||||
)));
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub(crate) fn typed_fast_field_reader<TFastValue: FastValue>(
|
||||
&self,
|
||||
field: Field,
|
||||
) -> crate::Result<FastFieldReader<TFastValue>> {
|
||||
let fast_field_slice = self.fast_field_data(field, 0)?;
|
||||
FastFieldReader::open(fast_field_slice)
|
||||
}
|
||||
|
||||
pub(crate) fn typed_fast_field_multi_reader<TFastValue: FastValue>(
|
||||
&self,
|
||||
field: Field,
|
||||
) -> crate::Result<MultiValuedFastFieldReader<TFastValue>> {
|
||||
let fast_field_slice_idx = self.fast_field_data(field, 0)?;
|
||||
let fast_field_slice_vals = self.fast_field_data(field, 1)?;
|
||||
let idx_reader = FastFieldReader::open(fast_field_slice_idx)?;
|
||||
let vals_reader: FastFieldReader<TFastValue> =
|
||||
FastFieldReader::open(fast_field_slice_vals)?;
|
||||
Ok(MultiValuedFastFieldReader::open(idx_reader, vals_reader))
|
||||
}
|
||||
|
||||
/// Returns the `u64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a u64 fast field, this method returns `None`.
|
||||
pub fn u64(&self, field: Field) -> Option<FastFieldReader<u64>> {
|
||||
self.fast_field_u64.get(&field).cloned()
|
||||
}
|
||||
|
||||
/// If the field is a u64-fast field return the associated reader.
|
||||
/// If the field is a i64-fast field, return the associated u64 reader. Values are
|
||||
/// mapped from i64 to u64 using a (well the, it is unique) monotonic mapping. ///
|
||||
///
|
||||
/// This method is useful when merging segment reader.
|
||||
pub(crate) fn u64_lenient(&self, field: Field) -> Option<FastFieldReader<u64>> {
|
||||
if let Some(u64_ff_reader) = self.u64(field) {
|
||||
return Some(u64_ff_reader);
|
||||
}
|
||||
if let Some(i64_ff_reader) = self.i64(field) {
|
||||
return Some(i64_ff_reader.into_u64_reader());
|
||||
}
|
||||
if let Some(f64_ff_reader) = self.f64(field) {
|
||||
return Some(f64_ff_reader.into_u64_reader());
|
||||
}
|
||||
if let Some(date_ff_reader) = self.date(field) {
|
||||
return Some(date_ff_reader.into_u64_reader());
|
||||
}
|
||||
None
|
||||
pub fn u64(&self, field: Field) -> crate::Result<FastFieldReader<u64>> {
|
||||
self.check_type(field, FastType::U64, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
}
|
||||
|
||||
/// Returns the `i64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a i64 fast field, this method returns `None`.
|
||||
pub fn i64(&self, field: Field) -> Option<FastFieldReader<i64>> {
|
||||
self.fast_field_i64.get(&field).cloned()
|
||||
pub fn i64(&self, field: Field) -> crate::Result<FastFieldReader<i64>> {
|
||||
self.check_type(field, FastType::I64, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
}
|
||||
|
||||
/// Returns the `i64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a i64 fast field, this method returns `None`.
|
||||
pub fn date(&self, field: Field) -> Option<FastFieldReader<crate::DateTime>> {
|
||||
self.fast_field_date.get(&field).cloned()
|
||||
pub fn date(&self, field: Field) -> crate::Result<FastFieldReader<crate::DateTime>> {
|
||||
self.check_type(field, FastType::Date, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
}
|
||||
|
||||
/// Returns the `f64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a f64 fast field, this method returns `None`.
|
||||
pub fn f64(&self, field: Field) -> Option<FastFieldReader<f64>> {
|
||||
self.fast_field_f64.get(&field).cloned()
|
||||
pub fn f64(&self, field: Field) -> crate::Result<FastFieldReader<f64>> {
|
||||
self.check_type(field, FastType::F64, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
}
|
||||
|
||||
/// Returns a `u64s` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a u64 multi-valued fast field, this method returns `None`.
|
||||
pub fn u64s(&self, field: Field) -> Option<MultiValueIntFastFieldReader<u64>> {
|
||||
self.fast_field_u64s.get(&field).cloned()
|
||||
}
|
||||
|
||||
/// If the field is a u64s-fast field return the associated reader.
|
||||
/// If the field is a i64s-fast field, return the associated u64s reader. Values are
|
||||
/// mapped from i64 to u64 using a (well the, it is unique) monotonic mapping.
|
||||
///
|
||||
/// This method is useful when merging segment reader.
|
||||
pub(crate) fn u64s_lenient(&self, field: Field) -> Option<MultiValueIntFastFieldReader<u64>> {
|
||||
if let Some(u64s_ff_reader) = self.u64s(field) {
|
||||
return Some(u64s_ff_reader);
|
||||
}
|
||||
if let Some(i64s_ff_reader) = self.i64s(field) {
|
||||
return Some(i64s_ff_reader.into_u64s_reader());
|
||||
}
|
||||
if let Some(f64s_ff_reader) = self.f64s(field) {
|
||||
return Some(f64s_ff_reader.into_u64s_reader());
|
||||
}
|
||||
None
|
||||
pub fn u64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<u64>> {
|
||||
self.check_type(field, FastType::U64, Cardinality::MultiValues)?;
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
}
|
||||
|
||||
/// Returns a `i64s` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a i64 multi-valued fast field, this method returns `None`.
|
||||
pub fn i64s(&self, field: Field) -> Option<MultiValueIntFastFieldReader<i64>> {
|
||||
self.fast_field_i64s.get(&field).cloned()
|
||||
pub fn i64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<i64>> {
|
||||
self.check_type(field, FastType::I64, Cardinality::MultiValues)?;
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
}
|
||||
|
||||
/// Returns a `f64s` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a f64 multi-valued fast field, this method returns `None`.
|
||||
pub fn f64s(&self, field: Field) -> Option<MultiValueIntFastFieldReader<f64>> {
|
||||
self.fast_field_f64s.get(&field).cloned()
|
||||
pub fn f64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<f64>> {
|
||||
self.check_type(field, FastType::F64, Cardinality::MultiValues)?;
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
}
|
||||
|
||||
/// Returns a `crate::DateTime` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a `crate::DateTime` multi-valued fast field, this method returns `None`.
|
||||
pub fn dates(&self, field: Field) -> Option<MultiValueIntFastFieldReader<crate::DateTime>> {
|
||||
self.fast_field_dates.get(&field).cloned()
|
||||
pub fn dates(
|
||||
&self,
|
||||
field: Field,
|
||||
) -> crate::Result<MultiValuedFastFieldReader<crate::DateTime>> {
|
||||
self.check_type(field, FastType::Date, Cardinality::MultiValues)?;
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
}
|
||||
|
||||
/// Returns the `bytes` fast field reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a bytes fast field, returns `None`.
|
||||
pub fn bytes(&self, field: Field) -> Option<BytesFastFieldReader> {
|
||||
self.fast_bytes.get(&field).cloned()
|
||||
pub fn bytes(&self, field: Field) -> crate::Result<BytesFastFieldReader> {
|
||||
let field_entry = self.schema.get_field_entry(field);
|
||||
if let FieldType::Bytes(bytes_option) = field_entry.field_type() {
|
||||
if !bytes_option.is_fast() {
|
||||
return Err(crate::TantivyError::SchemaError(format!(
|
||||
"Field {:?} is not a fast field.",
|
||||
field_entry.name()
|
||||
)));
|
||||
}
|
||||
let fast_field_idx_file = self.fast_field_data(field, 0)?;
|
||||
let idx_reader = FastFieldReader::open(fast_field_idx_file)?;
|
||||
let data = self.fast_field_data(field, 1)?;
|
||||
BytesFastFieldReader::open(idx_reader, data)
|
||||
} else {
|
||||
Err(FastFieldNotAvailableError::new(field_entry).into())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
use super::multivalued::MultiValueIntFastFieldWriter;
|
||||
use super::multivalued::MultiValuedFastFieldWriter;
|
||||
use crate::common;
|
||||
use crate::common::BinarySerializable;
|
||||
use crate::common::VInt;
|
||||
@@ -13,7 +13,7 @@ use std::io;
|
||||
/// The fastfieldswriter regroup all of the fast field writers.
|
||||
pub struct FastFieldsWriter {
|
||||
single_value_writers: Vec<IntFastFieldWriter>,
|
||||
multi_values_writers: Vec<MultiValueIntFastFieldWriter>,
|
||||
multi_values_writers: Vec<MultiValuedFastFieldWriter>,
|
||||
bytes_value_writers: Vec<BytesFastFieldWriter>,
|
||||
}
|
||||
|
||||
@@ -46,14 +46,14 @@ impl FastFieldsWriter {
|
||||
single_value_writers.push(fast_field_writer);
|
||||
}
|
||||
Some(Cardinality::MultiValues) => {
|
||||
let fast_field_writer = MultiValueIntFastFieldWriter::new(field, false);
|
||||
let fast_field_writer = MultiValuedFastFieldWriter::new(field, false);
|
||||
multi_values_writers.push(fast_field_writer);
|
||||
}
|
||||
None => {}
|
||||
}
|
||||
}
|
||||
FieldType::HierarchicalFacet => {
|
||||
let fast_field_writer = MultiValueIntFastFieldWriter::new(field, true);
|
||||
let fast_field_writer = MultiValuedFastFieldWriter::new(field, true);
|
||||
multi_values_writers.push(fast_field_writer);
|
||||
}
|
||||
FieldType::Bytes(bytes_option) => {
|
||||
@@ -87,7 +87,7 @@ impl FastFieldsWriter {
|
||||
pub fn get_multivalue_writer(
|
||||
&mut self,
|
||||
field: Field,
|
||||
) -> Option<&mut MultiValueIntFastFieldWriter> {
|
||||
) -> Option<&mut MultiValuedFastFieldWriter> {
|
||||
// TODO optimize
|
||||
self.multi_values_writers
|
||||
.iter_mut()
|
||||
|
||||
@@ -61,16 +61,38 @@ impl FieldNormReaders {
|
||||
/// precompute computationally expensive functions of the fieldnorm
|
||||
/// in a very short array.
|
||||
#[derive(Clone)]
|
||||
pub struct FieldNormReader {
|
||||
data: OwnedBytes,
|
||||
pub struct FieldNormReader(ReaderImplEnum);
|
||||
|
||||
impl From<ReaderImplEnum> for FieldNormReader {
|
||||
fn from(reader_enum: ReaderImplEnum) -> FieldNormReader {
|
||||
FieldNormReader(reader_enum)
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
enum ReaderImplEnum {
|
||||
FromData(OwnedBytes),
|
||||
Const {
|
||||
num_docs: u32,
|
||||
fieldnorm_id: u8,
|
||||
fieldnorm: u32,
|
||||
},
|
||||
}
|
||||
|
||||
impl FieldNormReader {
|
||||
/// Creates a `FieldNormReader` with a constant fieldnorm.
|
||||
///
|
||||
/// The fieldnorm will be subjected to compression as if it was coming
|
||||
/// from an array-backed fieldnorm reader.
|
||||
pub fn constant(num_docs: u32, fieldnorm: u32) -> FieldNormReader {
|
||||
let fieldnorm_id = fieldnorm_to_id(fieldnorm);
|
||||
let field_norms_data = OwnedBytes::new(vec![fieldnorm_id; num_docs as usize]);
|
||||
FieldNormReader::new(field_norms_data)
|
||||
let fieldnorm = id_to_fieldnorm(fieldnorm_id);
|
||||
ReaderImplEnum::Const {
|
||||
num_docs,
|
||||
fieldnorm_id,
|
||||
fieldnorm,
|
||||
}
|
||||
.into()
|
||||
}
|
||||
|
||||
/// Opens a field norm reader given its file.
|
||||
@@ -80,12 +102,15 @@ impl FieldNormReader {
|
||||
}
|
||||
|
||||
fn new(data: OwnedBytes) -> Self {
|
||||
FieldNormReader { data }
|
||||
ReaderImplEnum::FromData(data).into()
|
||||
}
|
||||
|
||||
/// Returns the number of documents in this segment.
|
||||
pub fn num_docs(&self) -> u32 {
|
||||
self.data.len() as u32
|
||||
match &self.0 {
|
||||
ReaderImplEnum::FromData(data) => data.len() as u32,
|
||||
ReaderImplEnum::Const { num_docs, .. } => *num_docs,
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the `fieldnorm` associated to a doc id.
|
||||
@@ -98,14 +123,25 @@ impl FieldNormReader {
|
||||
/// The fieldnorm is effectively decoded from the
|
||||
/// `fieldnorm_id` by doing a simple table lookup.
|
||||
pub fn fieldnorm(&self, doc_id: DocId) -> u32 {
|
||||
let fieldnorm_id = self.fieldnorm_id(doc_id);
|
||||
id_to_fieldnorm(fieldnorm_id)
|
||||
match &self.0 {
|
||||
ReaderImplEnum::FromData(data) => {
|
||||
let fieldnorm_id = data.as_slice()[doc_id as usize];
|
||||
id_to_fieldnorm(fieldnorm_id)
|
||||
}
|
||||
ReaderImplEnum::Const { fieldnorm, .. } => *fieldnorm,
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the `fieldnorm_id` associated to a document.
|
||||
#[inline(always)]
|
||||
pub fn fieldnorm_id(&self, doc_id: DocId) -> u8 {
|
||||
self.data.as_slice()[doc_id as usize]
|
||||
match &self.0 {
|
||||
ReaderImplEnum::FromData(data) => {
|
||||
let fieldnorm_id = data.as_slice()[doc_id as usize];
|
||||
fieldnorm_id
|
||||
}
|
||||
ReaderImplEnum::Const { fieldnorm_id, .. } => *fieldnorm_id,
|
||||
}
|
||||
}
|
||||
|
||||
/// Converts a `fieldnorm_id` into a fieldnorm.
|
||||
@@ -129,9 +165,7 @@ impl FieldNormReader {
|
||||
.map(FieldNormReader::fieldnorm_to_id)
|
||||
.collect::<Vec<u8>>();
|
||||
let field_norms_data = OwnedBytes::new(field_norms_id);
|
||||
FieldNormReader {
|
||||
data: field_norms_data,
|
||||
}
|
||||
FieldNormReader::new(field_norms_data)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -150,4 +184,20 @@ mod tests {
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(3), 4);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(4), 983_064);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_const_fieldnorm_reader_small_fieldnorm_id() {
|
||||
let fieldnorm_reader = FieldNormReader::constant(1_000_000u32, 10u32);
|
||||
assert_eq!(fieldnorm_reader.num_docs(), 1_000_000u32);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0u32), 10u32);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm_id(0u32), 10u8);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_const_fieldnorm_reader_large_fieldnorm_id() {
|
||||
let fieldnorm_reader = FieldNormReader::constant(1_000_000u32, 300u32);
|
||||
assert_eq!(fieldnorm_reader.num_docs(), 1_000_000u32);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0u32), 280u32);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm_id(0u32), 72u8);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,45 +1,93 @@
|
||||
use rand::thread_rng;
|
||||
use std::collections::HashSet;
|
||||
|
||||
use crate::schema::*;
|
||||
use crate::Index;
|
||||
use crate::Searcher;
|
||||
use crate::{doc, schema::*};
|
||||
use rand::thread_rng;
|
||||
use rand::Rng;
|
||||
use std::collections::HashSet;
|
||||
|
||||
fn check_index_content(searcher: &Searcher, vals: &HashSet<u64>) {
|
||||
fn check_index_content(searcher: &Searcher, vals: &[u64]) -> crate::Result<()> {
|
||||
assert!(searcher.segment_readers().len() < 20);
|
||||
assert_eq!(searcher.num_docs() as usize, vals.len());
|
||||
for segment_reader in searcher.segment_readers() {
|
||||
let store_reader = segment_reader.get_store_reader()?;
|
||||
for doc_id in 0..segment_reader.max_doc() {
|
||||
let _doc = store_reader.get(doc_id)?;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore]
|
||||
fn test_indexing() {
|
||||
fn test_functional_store() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
let id_field = schema_builder.add_u64_field("id", INDEXED | STORED);
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let index = Index::create_in_ram(schema);
|
||||
let reader = index.reader()?;
|
||||
|
||||
let mut rng = thread_rng();
|
||||
|
||||
let mut index_writer = index.writer_with_num_threads(3, 12_000_000)?;
|
||||
|
||||
let mut doc_set: Vec<u64> = Vec::new();
|
||||
|
||||
let mut doc_id = 0u64;
|
||||
for iteration in 0..500 {
|
||||
dbg!(iteration);
|
||||
let num_docs: usize = rng.gen_range(0..4);
|
||||
if doc_set.len() >= 1 {
|
||||
let doc_to_remove_id = rng.gen_range(0..doc_set.len());
|
||||
let removed_doc_id = doc_set.swap_remove(doc_to_remove_id);
|
||||
index_writer.delete_term(Term::from_field_u64(id_field, removed_doc_id));
|
||||
}
|
||||
for _ in 0..num_docs {
|
||||
doc_set.push(doc_id);
|
||||
index_writer.add_document(doc!(id_field=>doc_id));
|
||||
doc_id += 1;
|
||||
}
|
||||
index_writer.commit()?;
|
||||
reader.reload()?;
|
||||
let searcher = reader.searcher();
|
||||
check_index_content(&searcher, &doc_set)?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore]
|
||||
fn test_functional_indexing() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
let id_field = schema_builder.add_u64_field("id", INDEXED);
|
||||
let multiples_field = schema_builder.add_u64_field("multiples", INDEXED);
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let index = Index::create_from_tempdir(schema).unwrap();
|
||||
let reader = index.reader().unwrap();
|
||||
let index = Index::create_from_tempdir(schema)?;
|
||||
let reader = index.reader()?;
|
||||
|
||||
let mut rng = thread_rng();
|
||||
|
||||
let mut index_writer = index.writer_with_num_threads(3, 120_000_000).unwrap();
|
||||
let mut index_writer = index.writer_with_num_threads(3, 120_000_000)?;
|
||||
|
||||
let mut committed_docs: HashSet<u64> = HashSet::new();
|
||||
let mut uncommitted_docs: HashSet<u64> = HashSet::new();
|
||||
|
||||
for _ in 0..200 {
|
||||
let random_val = rng.gen_range(0, 20);
|
||||
let random_val = rng.gen_range(0..20);
|
||||
if random_val == 0 {
|
||||
index_writer.commit().expect("Commit failed");
|
||||
index_writer.commit()?;
|
||||
committed_docs.extend(&uncommitted_docs);
|
||||
uncommitted_docs.clear();
|
||||
reader.reload().unwrap();
|
||||
reader.reload()?;
|
||||
let searcher = reader.searcher();
|
||||
// check that everything is correct.
|
||||
check_index_content(&searcher, &committed_docs);
|
||||
check_index_content(
|
||||
&searcher,
|
||||
&committed_docs.iter().cloned().collect::<Vec<u64>>(),
|
||||
)?;
|
||||
} else {
|
||||
if committed_docs.remove(&random_val) || uncommitted_docs.remove(&random_val) {
|
||||
let doc_id_term = Term::from_field_u64(id_field, random_val);
|
||||
@@ -55,4 +103,5 @@ fn test_indexing() {
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -53,7 +53,7 @@ impl DeleteQueue {
|
||||
return block;
|
||||
}
|
||||
let block = Arc::new(Block {
|
||||
operations: Arc::default(),
|
||||
operations: Arc::new([]),
|
||||
next: NextBlock::from(self.clone()),
|
||||
});
|
||||
wlock.last_block = Arc::downgrade(&block);
|
||||
@@ -108,7 +108,7 @@ impl DeleteQueue {
|
||||
let delete_operations = mem::replace(&mut self_wlock.writer, vec![]);
|
||||
|
||||
let new_block = Arc::new(Block {
|
||||
operations: Arc::new(delete_operations.into_boxed_slice()),
|
||||
operations: Arc::from(delete_operations.into_boxed_slice()),
|
||||
next: NextBlock::from(self.clone()),
|
||||
});
|
||||
|
||||
@@ -167,7 +167,7 @@ impl NextBlock {
|
||||
}
|
||||
|
||||
struct Block {
|
||||
operations: Arc<Box<[DeleteOperation]>>,
|
||||
operations: Arc<[DeleteOperation]>,
|
||||
next: NextBlock,
|
||||
}
|
||||
|
||||
|
||||
@@ -449,7 +449,7 @@ impl IndexWriter {
|
||||
}
|
||||
|
||||
/// Accessor to the merge policy.
|
||||
pub fn get_merge_policy(&self) -> Arc<Box<dyn MergePolicy>> {
|
||||
pub fn get_merge_policy(&self) -> Arc<dyn MergePolicy> {
|
||||
self.segment_updater.get_merge_policy()
|
||||
}
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ const DEFAULT_MIN_LAYER_SIZE: u32 = 10_000;
|
||||
const DEFAULT_MIN_MERGE_SIZE: usize = 8;
|
||||
const DEFAULT_MAX_MERGE_SIZE: usize = 10_000_000;
|
||||
|
||||
/// `LogMergePolicy` tries tries to merge segments that have a similar number of
|
||||
/// `LogMergePolicy` tries to merge segments that have a similar number of
|
||||
/// documents.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct LogMergePolicy {
|
||||
|
||||
@@ -7,7 +7,7 @@ use crate::fastfield::BytesFastFieldReader;
|
||||
use crate::fastfield::DeleteBitSet;
|
||||
use crate::fastfield::FastFieldReader;
|
||||
use crate::fastfield::FastFieldSerializer;
|
||||
use crate::fastfield::MultiValueIntFastFieldReader;
|
||||
use crate::fastfield::MultiValuedFastFieldReader;
|
||||
use crate::fieldnorm::FieldNormsSerializer;
|
||||
use crate::fieldnorm::FieldNormsWriter;
|
||||
use crate::fieldnorm::{FieldNormReader, FieldNormReaders};
|
||||
@@ -246,7 +246,7 @@ impl IndexMerger {
|
||||
for reader in &self.readers {
|
||||
let u64_reader: FastFieldReader<u64> = reader
|
||||
.fast_fields()
|
||||
.u64_lenient(field)
|
||||
.typed_fast_field_reader(field)
|
||||
.expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen.");
|
||||
if let Some((seg_min_val, seg_max_val)) =
|
||||
compute_min_max_val(&u64_reader, reader.max_doc(), reader.delete_bitset())
|
||||
@@ -290,7 +290,7 @@ impl IndexMerger {
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
) -> crate::Result<()> {
|
||||
let mut total_num_vals = 0u64;
|
||||
let mut u64s_readers: Vec<MultiValueIntFastFieldReader<u64>> = Vec::new();
|
||||
let mut u64s_readers: Vec<MultiValuedFastFieldReader<u64>> = Vec::new();
|
||||
|
||||
// In the first pass, we compute the total number of vals.
|
||||
//
|
||||
@@ -298,9 +298,8 @@ impl IndexMerger {
|
||||
// what should be the bit length use for bitpacking.
|
||||
for reader in &self.readers {
|
||||
let u64s_reader = reader.fast_fields()
|
||||
.u64s_lenient(field)
|
||||
.typed_fast_field_multi_reader(field)
|
||||
.expect("Failed to find index for multivalued field. This is a bug in tantivy, please report.");
|
||||
|
||||
if let Some(delete_bitset) = reader.delete_bitset() {
|
||||
for doc in 0u32..reader.max_doc() {
|
||||
if delete_bitset.is_alive(doc) {
|
||||
@@ -353,7 +352,7 @@ impl IndexMerger {
|
||||
for (segment_ord, segment_reader) in self.readers.iter().enumerate() {
|
||||
let term_ordinal_mapping: &[TermOrdinal] =
|
||||
term_ordinal_mappings.get_segment(segment_ord);
|
||||
let ff_reader: MultiValueIntFastFieldReader<u64> = segment_reader
|
||||
let ff_reader: MultiValuedFastFieldReader<u64> = segment_reader
|
||||
.fast_fields()
|
||||
.u64s(field)
|
||||
.expect("Could not find multivalued u64 fast value reader.");
|
||||
@@ -397,8 +396,10 @@ impl IndexMerger {
|
||||
// We go through a complete first pass to compute the minimum and the
|
||||
// maximum value and initialize our Serializer.
|
||||
for reader in &self.readers {
|
||||
let ff_reader: MultiValueIntFastFieldReader<u64> =
|
||||
reader.fast_fields().u64s_lenient(field).expect(
|
||||
let ff_reader: MultiValuedFastFieldReader<u64> = reader
|
||||
.fast_fields()
|
||||
.typed_fast_field_multi_reader(field)
|
||||
.expect(
|
||||
"Failed to find multivalued fast field reader. This is a bug in \
|
||||
tantivy. Please report.",
|
||||
);
|
||||
@@ -445,11 +446,7 @@ impl IndexMerger {
|
||||
let mut bytes_readers: Vec<BytesFastFieldReader> = Vec::new();
|
||||
|
||||
for reader in &self.readers {
|
||||
let bytes_reader = reader.fast_fields().bytes(field).ok_or_else(|| {
|
||||
crate::TantivyError::InvalidArgument(
|
||||
"Bytes fast field {:?} not found in segment.".to_string(),
|
||||
)
|
||||
})?;
|
||||
let bytes_reader = reader.fast_fields().bytes(field)?;
|
||||
if let Some(delete_bitset) = reader.delete_bitset() {
|
||||
for doc in 0u32..reader.max_doc() {
|
||||
if delete_bitset.is_alive(doc) {
|
||||
@@ -503,7 +500,6 @@ impl IndexMerger {
|
||||
let mut positions_buffer: Vec<u32> = Vec::with_capacity(1_000);
|
||||
let mut delta_computer = DeltaComputer::new();
|
||||
|
||||
let mut field_term_streams = Vec::new();
|
||||
let mut max_term_ords: Vec<TermOrdinal> = Vec::new();
|
||||
|
||||
let field_readers: Vec<Arc<InvertedIndexReader>> = self
|
||||
@@ -512,9 +508,10 @@ impl IndexMerger {
|
||||
.map(|reader| reader.inverted_index(indexed_field))
|
||||
.collect::<crate::Result<Vec<_>>>()?;
|
||||
|
||||
let mut field_term_streams = Vec::new();
|
||||
for field_reader in &field_readers {
|
||||
let terms = field_reader.terms();
|
||||
field_term_streams.push(terms.stream());
|
||||
field_term_streams.push(terms.stream()?);
|
||||
max_term_ords.push(terms.num_terms() as u64);
|
||||
}
|
||||
|
||||
|
||||
@@ -9,6 +9,15 @@ pub struct DeleteOperation {
|
||||
pub term: Term,
|
||||
}
|
||||
|
||||
impl Default for DeleteOperation {
|
||||
fn default() -> Self {
|
||||
DeleteOperation {
|
||||
opstamp: 0u64,
|
||||
term: Term::new(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Timestamped Add operation.
|
||||
#[derive(Eq, PartialEq, Debug)]
|
||||
pub struct AddOperation {
|
||||
|
||||
@@ -154,7 +154,7 @@ pub(crate) struct InnerSegmentUpdater {
|
||||
|
||||
index: Index,
|
||||
segment_manager: SegmentManager,
|
||||
merge_policy: RwLock<Arc<Box<dyn MergePolicy>>>,
|
||||
merge_policy: RwLock<Arc<dyn MergePolicy>>,
|
||||
killed: AtomicBool,
|
||||
stamper: Stamper,
|
||||
merge_operations: MergeOperationInventory,
|
||||
@@ -193,19 +193,19 @@ impl SegmentUpdater {
|
||||
merge_thread_pool,
|
||||
index,
|
||||
segment_manager,
|
||||
merge_policy: RwLock::new(Arc::new(Box::new(DefaultMergePolicy::default()))),
|
||||
merge_policy: RwLock::new(Arc::new(DefaultMergePolicy::default())),
|
||||
killed: AtomicBool::new(false),
|
||||
stamper,
|
||||
merge_operations: Default::default(),
|
||||
})))
|
||||
}
|
||||
|
||||
pub fn get_merge_policy(&self) -> Arc<Box<dyn MergePolicy>> {
|
||||
pub fn get_merge_policy(&self) -> Arc<dyn MergePolicy> {
|
||||
self.merge_policy.read().unwrap().clone()
|
||||
}
|
||||
|
||||
pub fn set_merge_policy(&self, merge_policy: Box<dyn MergePolicy>) {
|
||||
let arc_merge_policy = Arc::new(merge_policy);
|
||||
let arc_merge_policy = Arc::from(merge_policy);
|
||||
*self.merge_policy.write().unwrap() = arc_merge_policy;
|
||||
}
|
||||
|
||||
|
||||
30
src/lib.rs
30
src/lib.rs
@@ -96,7 +96,7 @@
|
||||
//! A good place for you to get started is to check out
|
||||
//! the example code (
|
||||
//! [literate programming](https://tantivy-search.github.io/examples/basic_search.html) /
|
||||
//! [source code](https://github.com/tantivy-search/tantivy/blob/master/examples/basic_search.rs))
|
||||
//! [source code](https://github.com/tantivy-search/tantivy/blob/main/examples/basic_search.rs))
|
||||
|
||||
#[cfg_attr(test, macro_use)]
|
||||
extern crate serde_json;
|
||||
@@ -160,7 +160,7 @@ pub use self::docset::{DocSet, TERMINATED};
|
||||
pub use crate::common::HasLen;
|
||||
pub use crate::common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
|
||||
pub use crate::core::{Executor, SegmentComponent};
|
||||
pub use crate::core::{FieldSearcher, Index, IndexMeta, Searcher, Segment, SegmentId, SegmentMeta};
|
||||
pub use crate::core::{Index, IndexMeta, Searcher, Segment, SegmentId, SegmentMeta};
|
||||
pub use crate::core::{InvertedIndexReader, SegmentReader};
|
||||
pub use crate::directory::Directory;
|
||||
pub use crate::indexer::operation::UserOperation;
|
||||
@@ -174,7 +174,7 @@ use once_cell::sync::Lazy;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
/// Index format version.
|
||||
const INDEX_FORMAT_VERSION: u32 = 2;
|
||||
const INDEX_FORMAT_VERSION: u32 = 3;
|
||||
|
||||
/// Structure version for the index.
|
||||
#[derive(Clone, PartialEq, Eq, Serialize, Deserialize)]
|
||||
@@ -866,39 +866,39 @@ mod tests {
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader: &SegmentReader = searcher.segment_reader(0);
|
||||
{
|
||||
let fast_field_reader_opt = segment_reader.fast_fields().u64(text_field);
|
||||
assert!(fast_field_reader_opt.is_none());
|
||||
let fast_field_reader_res = segment_reader.fast_fields().u64(text_field);
|
||||
assert!(fast_field_reader_res.is_err());
|
||||
}
|
||||
{
|
||||
let fast_field_reader_opt = segment_reader.fast_fields().u64(stored_int_field);
|
||||
assert!(fast_field_reader_opt.is_none());
|
||||
assert!(fast_field_reader_opt.is_err());
|
||||
}
|
||||
{
|
||||
let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_signed);
|
||||
assert!(fast_field_reader_opt.is_none());
|
||||
assert!(fast_field_reader_opt.is_err());
|
||||
}
|
||||
{
|
||||
let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_float);
|
||||
assert!(fast_field_reader_opt.is_none());
|
||||
assert!(fast_field_reader_opt.is_err());
|
||||
}
|
||||
{
|
||||
let fast_field_reader_opt = segment_reader.fast_fields().u64(fast_field_unsigned);
|
||||
assert!(fast_field_reader_opt.is_some());
|
||||
assert!(fast_field_reader_opt.is_ok());
|
||||
let fast_field_reader = fast_field_reader_opt.unwrap();
|
||||
assert_eq!(fast_field_reader.get(0), 4u64)
|
||||
}
|
||||
|
||||
{
|
||||
let fast_field_reader_opt = segment_reader.fast_fields().i64(fast_field_signed);
|
||||
assert!(fast_field_reader_opt.is_some());
|
||||
let fast_field_reader = fast_field_reader_opt.unwrap();
|
||||
let fast_field_reader_res = segment_reader.fast_fields().i64(fast_field_signed);
|
||||
assert!(fast_field_reader_res.is_ok());
|
||||
let fast_field_reader = fast_field_reader_res.unwrap();
|
||||
assert_eq!(fast_field_reader.get(0), 4i64)
|
||||
}
|
||||
|
||||
{
|
||||
let fast_field_reader_opt = segment_reader.fast_fields().f64(fast_field_float);
|
||||
assert!(fast_field_reader_opt.is_some());
|
||||
let fast_field_reader = fast_field_reader_opt.unwrap();
|
||||
let fast_field_reader_res = segment_reader.fast_fields().f64(fast_field_float);
|
||||
assert!(fast_field_reader_res.is_ok());
|
||||
let fast_field_reader = fast_field_reader_res.unwrap();
|
||||
assert_eq!(fast_field_reader.get(0), 4f64)
|
||||
}
|
||||
Ok(())
|
||||
|
||||
@@ -132,7 +132,7 @@ impl PositionReader {
|
||||
"offset arguments should be increasing."
|
||||
);
|
||||
let delta_to_block_offset = offset as i64 - self.block_offset as i64;
|
||||
if delta_to_block_offset < 0 || delta_to_block_offset >= 128 {
|
||||
if !(0..128).contains(&delta_to_block_offset) {
|
||||
// The first position is not within the first block.
|
||||
// We need to decompress the first block.
|
||||
let delta_to_anchor_offset = offset - self.anchor_offset;
|
||||
|
||||
@@ -8,7 +8,7 @@ use std::io::{self, Write};
|
||||
pub struct PositionSerializer<W: io::Write> {
|
||||
bit_packer: BitPacker4x,
|
||||
write_stream: CountingWriter<W>,
|
||||
write_skiplist: W,
|
||||
write_skip_index: W,
|
||||
block: Vec<u32>,
|
||||
buffer: Vec<u8>,
|
||||
num_ints: u64,
|
||||
@@ -16,11 +16,11 @@ pub struct PositionSerializer<W: io::Write> {
|
||||
}
|
||||
|
||||
impl<W: io::Write> PositionSerializer<W> {
|
||||
pub fn new(write_stream: W, write_skiplist: W) -> PositionSerializer<W> {
|
||||
pub fn new(write_stream: W, write_skip_index: W) -> PositionSerializer<W> {
|
||||
PositionSerializer {
|
||||
bit_packer: BitPacker4x::new(),
|
||||
write_stream: CountingWriter::wrap(write_stream),
|
||||
write_skiplist,
|
||||
write_skip_index,
|
||||
block: Vec::with_capacity(128),
|
||||
buffer: vec![0u8; 128 * 4],
|
||||
num_ints: 0u64,
|
||||
@@ -52,7 +52,7 @@ impl<W: io::Write> PositionSerializer<W> {
|
||||
|
||||
fn flush_block(&mut self) -> io::Result<()> {
|
||||
let num_bits = self.bit_packer.num_bits(&self.block[..]);
|
||||
self.write_skiplist.write_all(&[num_bits])?;
|
||||
self.write_skip_index.write_all(&[num_bits])?;
|
||||
let written_len = self
|
||||
.bit_packer
|
||||
.compress(&self.block[..], &mut self.buffer, num_bits);
|
||||
@@ -70,10 +70,10 @@ impl<W: io::Write> PositionSerializer<W> {
|
||||
self.flush_block()?;
|
||||
}
|
||||
for &long_skip in &self.long_skips {
|
||||
long_skip.serialize(&mut self.write_skiplist)?;
|
||||
long_skip.serialize(&mut self.write_skip_index)?;
|
||||
}
|
||||
(self.long_skips.len() as u32).serialize(&mut self.write_skiplist)?;
|
||||
self.write_skiplist.flush()?;
|
||||
(self.long_skips.len() as u32).serialize(&mut self.write_skip_index)?;
|
||||
self.write_skip_index.flush()?;
|
||||
self.write_stream.flush()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -469,7 +469,7 @@ mod tests {
|
||||
let segment_reader = searcher.segment_reader(0);
|
||||
let inverted_index = segment_reader.inverted_index(int_field).unwrap();
|
||||
let term = Term::from_field_u64(int_field, 0u64);
|
||||
let term_info = inverted_index.get_term_info(&term).unwrap();
|
||||
let term_info = inverted_index.get_term_info(&term).unwrap().unwrap();
|
||||
inverted_index
|
||||
.read_block_postings_from_terminfo(&term_info, IndexRecordOption::Basic)
|
||||
.unwrap()
|
||||
@@ -513,7 +513,7 @@ mod tests {
|
||||
{
|
||||
let term = Term::from_field_u64(int_field, 0u64);
|
||||
let inverted_index = segment_reader.inverted_index(int_field)?;
|
||||
let term_info = inverted_index.get_term_info(&term).unwrap();
|
||||
let term_info = inverted_index.get_term_info(&term)?.unwrap();
|
||||
block_segments = inverted_index
|
||||
.read_block_postings_from_terminfo(&term_info, IndexRecordOption::Basic)?;
|
||||
}
|
||||
@@ -521,7 +521,7 @@ mod tests {
|
||||
{
|
||||
let term = Term::from_field_u64(int_field, 1u64);
|
||||
let inverted_index = segment_reader.inverted_index(int_field)?;
|
||||
let term_info = inverted_index.get_term_info(&term).unwrap();
|
||||
let term_info = inverted_index.get_term_info(&term)?.unwrap();
|
||||
inverted_index.reset_block_postings_from_terminfo(&term_info, &mut block_segments)?;
|
||||
}
|
||||
assert_eq!(block_segments.docs(), &[1, 3, 5]);
|
||||
|
||||
@@ -54,7 +54,7 @@ pub mod tests {
|
||||
use crate::DocId;
|
||||
use crate::HasLen;
|
||||
use crate::Score;
|
||||
use std::iter;
|
||||
use std::{iter, mem};
|
||||
|
||||
#[test]
|
||||
pub fn test_position_write() -> crate::Result<()> {
|
||||
@@ -71,6 +71,7 @@ pub mod tests {
|
||||
field_serializer.write_doc(doc_id, 4, &delta_positions)?;
|
||||
}
|
||||
field_serializer.close_term()?;
|
||||
mem::drop(field_serializer);
|
||||
posting_serializer.close()?;
|
||||
let read = segment.open_read(SegmentComponent::POSITIONS)?;
|
||||
assert!(read.len() <= 140);
|
||||
@@ -179,7 +180,7 @@ pub mod tests {
|
||||
let inverted_index = segment_reader.inverted_index(text_field)?;
|
||||
assert_eq!(inverted_index.terms().num_terms(), 1);
|
||||
let mut bytes = vec![];
|
||||
assert!(inverted_index.terms().ord_to_term(0, &mut bytes));
|
||||
assert!(inverted_index.terms().ord_to_term(0, &mut bytes)?);
|
||||
assert_eq!(&bytes, b"hello");
|
||||
}
|
||||
{
|
||||
@@ -191,7 +192,7 @@ pub mod tests {
|
||||
let inverted_index = segment_reader.inverted_index(text_field)?;
|
||||
assert_eq!(inverted_index.terms().num_terms(), 1);
|
||||
let mut bytes = vec![];
|
||||
assert!(inverted_index.terms().ord_to_term(0, &mut bytes));
|
||||
assert!(inverted_index.terms().ord_to_term(0, &mut bytes)?);
|
||||
assert_eq!(&bytes[..], ok_token_text.as_bytes());
|
||||
}
|
||||
Ok(())
|
||||
|
||||
@@ -1,14 +1,11 @@
|
||||
use crate::common::HasLen;
|
||||
use crate::directory::FileSlice;
|
||||
use crate::docset::DocSet;
|
||||
use crate::fastfield::DeleteBitSet;
|
||||
use crate::positions::PositionReader;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::postings::BlockSearcher;
|
||||
use crate::postings::BlockSegmentPostings;
|
||||
use crate::postings::Postings;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, TERMINATED};
|
||||
|
||||
/// `SegmentPostings` represents the inverted list or postings associated to
|
||||
@@ -68,7 +65,11 @@ impl SegmentPostings {
|
||||
/// It serializes the doc ids using tantivy's codec
|
||||
/// and returns a `SegmentPostings` object that embeds a
|
||||
/// buffer with the serialized data.
|
||||
#[cfg(test)]
|
||||
pub fn create_from_docs(docs: &[u32]) -> SegmentPostings {
|
||||
use crate::directory::FileSlice;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::schema::IndexRecordOption;
|
||||
let mut buffer = Vec::new();
|
||||
{
|
||||
let mut postings_serializer =
|
||||
@@ -97,6 +98,9 @@ impl SegmentPostings {
|
||||
doc_and_tfs: &[(u32, u32)],
|
||||
fieldnorms: Option<&[u32]>,
|
||||
) -> SegmentPostings {
|
||||
use crate::directory::FileSlice;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::Score;
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
|
||||
@@ -1,32 +1,46 @@
|
||||
use crate::common::{read_u32_vint_no_advance, serialize_vint_u32, BinarySerializable};
|
||||
use std::convert::TryInto;
|
||||
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::postings::compression::{compressed_block_size, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::query::BM25Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, Score, TERMINATED};
|
||||
|
||||
#[inline(always)]
|
||||
fn encode_block_wand_max_tf(max_tf: u32) -> u8 {
|
||||
max_tf.min(u8::MAX as u32) as u8
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
fn decode_block_wand_max_tf(max_tf_code: u8) -> u32 {
|
||||
if max_tf_code == u8::MAX {
|
||||
u32::MAX
|
||||
} else {
|
||||
max_tf_code as u32
|
||||
}
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
fn read_u32(data: &[u8]) -> u32 {
|
||||
u32::from_le_bytes(data[..4].try_into().unwrap())
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
fn write_u32(val: u32, buf: &mut Vec<u8>) {
|
||||
buf.extend_from_slice(&val.to_le_bytes());
|
||||
}
|
||||
|
||||
pub struct SkipSerializer {
|
||||
buffer: Vec<u8>,
|
||||
prev_doc: DocId,
|
||||
}
|
||||
|
||||
impl SkipSerializer {
|
||||
pub fn new() -> SkipSerializer {
|
||||
SkipSerializer {
|
||||
buffer: Vec::new(),
|
||||
prev_doc: 0u32,
|
||||
}
|
||||
SkipSerializer { buffer: Vec::new() }
|
||||
}
|
||||
|
||||
pub fn write_doc(&mut self, last_doc: DocId, doc_num_bits: u8) {
|
||||
assert!(
|
||||
last_doc > self.prev_doc,
|
||||
"write_doc(...) called with non-increasing doc ids. \
|
||||
Did you forget to call clear maybe?"
|
||||
);
|
||||
let delta_doc = last_doc - self.prev_doc;
|
||||
self.prev_doc = last_doc;
|
||||
delta_doc.serialize(&mut self.buffer).unwrap();
|
||||
write_u32(last_doc, &mut self.buffer);
|
||||
self.buffer.push(doc_num_bits);
|
||||
}
|
||||
|
||||
@@ -35,16 +49,13 @@ impl SkipSerializer {
|
||||
}
|
||||
|
||||
pub fn write_total_term_freq(&mut self, tf_sum: u32) {
|
||||
tf_sum
|
||||
.serialize(&mut self.buffer)
|
||||
.expect("Should never fail");
|
||||
write_u32(tf_sum, &mut self.buffer);
|
||||
}
|
||||
|
||||
pub fn write_blockwand_max(&mut self, fieldnorm_id: u8, term_freq: u32) {
|
||||
self.buffer.push(fieldnorm_id);
|
||||
let mut buf = [0u8; 8];
|
||||
let bytes = serialize_vint_u32(term_freq, &mut buf);
|
||||
self.buffer.extend_from_slice(bytes);
|
||||
let block_wand_tf = encode_block_wand_max_tf(term_freq);
|
||||
self.buffer
|
||||
.extend_from_slice(&[fieldnorm_id, block_wand_tf]);
|
||||
}
|
||||
|
||||
pub fn data(&self) -> &[u8] {
|
||||
@@ -52,7 +63,6 @@ impl SkipSerializer {
|
||||
}
|
||||
|
||||
pub fn clear(&mut self) {
|
||||
self.prev_doc = 0u32;
|
||||
self.buffer.clear();
|
||||
}
|
||||
}
|
||||
@@ -159,18 +169,13 @@ impl SkipReader {
|
||||
}
|
||||
|
||||
fn read_block_info(&mut self) {
|
||||
let doc_delta = {
|
||||
let bytes = self.owned_read.as_slice();
|
||||
let mut buf = [0; 4];
|
||||
buf.copy_from_slice(&bytes[..4]);
|
||||
u32::from_le_bytes(buf)
|
||||
};
|
||||
self.last_doc_in_block += doc_delta as DocId;
|
||||
let doc_num_bits = self.owned_read.as_slice()[4];
|
||||
|
||||
let bytes = self.owned_read.as_slice();
|
||||
let advance_len: usize;
|
||||
self.last_doc_in_block = read_u32(bytes);
|
||||
let doc_num_bits = bytes[4];
|
||||
match self.skip_info {
|
||||
IndexRecordOption::Basic => {
|
||||
self.owned_read.advance(5);
|
||||
advance_len = 5;
|
||||
self.block_info = BlockInfo::BitPacked {
|
||||
doc_num_bits,
|
||||
tf_num_bits: 0,
|
||||
@@ -180,11 +185,10 @@ impl SkipReader {
|
||||
};
|
||||
}
|
||||
IndexRecordOption::WithFreqs => {
|
||||
let bytes = self.owned_read.as_slice();
|
||||
let tf_num_bits = bytes[5];
|
||||
let block_wand_fieldnorm_id = bytes[6];
|
||||
let (block_wand_term_freq, num_bytes) = read_u32_vint_no_advance(&bytes[7..]);
|
||||
self.owned_read.advance(7 + num_bytes);
|
||||
let block_wand_term_freq = decode_block_wand_max_tf(bytes[7]);
|
||||
advance_len = 8;
|
||||
self.block_info = BlockInfo::BitPacked {
|
||||
doc_num_bits,
|
||||
tf_num_bits,
|
||||
@@ -194,16 +198,11 @@ impl SkipReader {
|
||||
};
|
||||
}
|
||||
IndexRecordOption::WithFreqsAndPositions => {
|
||||
let bytes = self.owned_read.as_slice();
|
||||
let tf_num_bits = bytes[5];
|
||||
let tf_sum = {
|
||||
let mut buf = [0; 4];
|
||||
buf.copy_from_slice(&bytes[6..10]);
|
||||
u32::from_le_bytes(buf)
|
||||
};
|
||||
let tf_sum = read_u32(&bytes[6..10]);
|
||||
let block_wand_fieldnorm_id = bytes[10];
|
||||
let (block_wand_term_freq, num_bytes) = read_u32_vint_no_advance(&bytes[11..]);
|
||||
self.owned_read.advance(11 + num_bytes);
|
||||
let block_wand_term_freq = decode_block_wand_max_tf(bytes[11]);
|
||||
advance_len = 12;
|
||||
self.block_info = BlockInfo::BitPacked {
|
||||
doc_num_bits,
|
||||
tf_num_bits,
|
||||
@@ -213,6 +212,7 @@ impl SkipReader {
|
||||
};
|
||||
}
|
||||
}
|
||||
self.owned_read.advance(advance_len);
|
||||
}
|
||||
|
||||
pub fn block_info(&self) -> BlockInfo {
|
||||
@@ -274,6 +274,24 @@ mod tests {
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
|
||||
#[test]
|
||||
fn test_encode_block_wand_max_tf() {
|
||||
for tf in 0..255 {
|
||||
assert_eq!(super::encode_block_wand_max_tf(tf), tf as u8);
|
||||
}
|
||||
for &tf in &[255, 256, 1_000_000, u32::MAX] {
|
||||
assert_eq!(super::encode_block_wand_max_tf(tf), 255);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_decode_block_wand_max_tf() {
|
||||
for tf in 0..255 {
|
||||
assert_eq!(super::decode_block_wand_max_tf(tf), tf as u32);
|
||||
}
|
||||
assert_eq!(super::decode_block_wand_max_tf(255), u32::MAX);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_skip_with_freq() {
|
||||
let buf = {
|
||||
|
||||
@@ -7,6 +7,7 @@ use crate::schema::{Field, IndexRecordOption};
|
||||
use crate::termdict::{TermDictionary, TermStreamer};
|
||||
use crate::TantivyError;
|
||||
use crate::{DocId, Score};
|
||||
use std::io;
|
||||
use std::sync::Arc;
|
||||
use tantivy_fst::Automaton;
|
||||
|
||||
@@ -19,6 +20,7 @@ pub struct AutomatonWeight<A> {
|
||||
impl<A> AutomatonWeight<A>
|
||||
where
|
||||
A: Automaton + Send + Sync + 'static,
|
||||
A::State: Clone,
|
||||
{
|
||||
/// Create a new AutomationWeight
|
||||
pub fn new<IntoArcA: Into<Arc<A>>>(field: Field, automaton: IntoArcA) -> AutomatonWeight<A> {
|
||||
@@ -28,7 +30,10 @@ where
|
||||
}
|
||||
}
|
||||
|
||||
fn automaton_stream<'a>(&'a self, term_dict: &'a TermDictionary) -> TermStreamer<'a, &'a A> {
|
||||
fn automaton_stream<'a>(
|
||||
&'a self,
|
||||
term_dict: &'a TermDictionary,
|
||||
) -> io::Result<TermStreamer<'a, &'a A>> {
|
||||
let automaton: &A = &*self.automaton;
|
||||
let term_stream_builder = term_dict.search(automaton);
|
||||
term_stream_builder.into_stream()
|
||||
@@ -38,13 +43,14 @@ where
|
||||
impl<A> Weight for AutomatonWeight<A>
|
||||
where
|
||||
A: Automaton + Send + Sync + 'static,
|
||||
A::State: Clone,
|
||||
{
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
let max_doc = reader.max_doc();
|
||||
let mut doc_bitset = BitSet::with_max_value(max_doc);
|
||||
let inverted_index = reader.inverted_index(self.field)?;
|
||||
let term_dict = inverted_index.terms();
|
||||
let mut term_stream = self.automaton_stream(term_dict);
|
||||
let mut term_stream = self.automaton_stream(term_dict)?;
|
||||
while term_stream.advance() {
|
||||
let term_info = term_stream.value();
|
||||
let mut block_segment_postings = inverted_index
|
||||
@@ -98,6 +104,7 @@ mod tests {
|
||||
index
|
||||
}
|
||||
|
||||
#[derive(Clone, Copy)]
|
||||
enum State {
|
||||
Start,
|
||||
NotMatching,
|
||||
|
||||
@@ -106,7 +106,7 @@ impl BM25Weight {
|
||||
BM25Weight::new(idf_explain, avg_fieldnorm)
|
||||
}
|
||||
|
||||
fn new(idf_explain: Explanation, average_fieldnorm: Score) -> BM25Weight {
|
||||
pub(crate) fn new(idf_explain: Explanation, average_fieldnorm: Score) -> BM25Weight {
|
||||
let weight = idf_explain.value() * (1.0 + K1);
|
||||
BM25Weight {
|
||||
idf_explain,
|
||||
|
||||
@@ -11,6 +11,7 @@ use crate::schema::{Field, IndexRecordOption, Term};
|
||||
use crate::termdict::{TermDictionary, TermStreamer};
|
||||
use crate::{DocId, Score};
|
||||
use std::collections::Bound;
|
||||
use std::io;
|
||||
use std::ops::Range;
|
||||
|
||||
fn map_bound<TFrom, TTo, Transform: Fn(&TFrom) -> TTo>(
|
||||
@@ -274,7 +275,7 @@ pub struct RangeWeight {
|
||||
}
|
||||
|
||||
impl RangeWeight {
|
||||
fn term_range<'a>(&self, term_dict: &'a TermDictionary) -> TermStreamer<'a> {
|
||||
fn term_range<'a>(&self, term_dict: &'a TermDictionary) -> io::Result<TermStreamer<'a>> {
|
||||
use std::collections::Bound::*;
|
||||
let mut term_stream_builder = term_dict.range();
|
||||
term_stream_builder = match self.left_bound {
|
||||
@@ -298,7 +299,7 @@ impl Weight for RangeWeight {
|
||||
|
||||
let inverted_index = reader.inverted_index(self.field)?;
|
||||
let term_dict = inverted_index.terms();
|
||||
let mut term_range = self.term_range(term_dict);
|
||||
let mut term_range = self.term_range(term_dict)?;
|
||||
while term_range.advance() {
|
||||
let term_info = term_range.value();
|
||||
let mut block_segment_postings = inverted_index
|
||||
|
||||
@@ -12,7 +12,7 @@ use std::marker::PhantomData;
|
||||
/// This is useful for queries like `+somethingrequired somethingoptional`.
|
||||
///
|
||||
/// Note that `somethingoptional` has no impact on the `DocSet`.
|
||||
pub struct RequiredOptionalScorer<TReqScorer, TOptScorer, TScoreCombiner> {
|
||||
pub struct RequiredOptionalScorer<TReqScorer, TOptScorer, TScoreCombiner: ScoreCombiner> {
|
||||
req_scorer: TReqScorer,
|
||||
opt_scorer: TOptScorer,
|
||||
score_cache: Option<Score>,
|
||||
@@ -23,6 +23,7 @@ impl<TReqScorer, TOptScorer, TScoreCombiner>
|
||||
RequiredOptionalScorer<TReqScorer, TOptScorer, TScoreCombiner>
|
||||
where
|
||||
TOptScorer: DocSet,
|
||||
TScoreCombiner: ScoreCombiner,
|
||||
{
|
||||
/// Creates a new `RequiredOptionalScorer`.
|
||||
pub fn new(
|
||||
@@ -43,6 +44,7 @@ impl<TReqScorer, TOptScorer, TScoreCombiner> DocSet
|
||||
where
|
||||
TReqScorer: DocSet,
|
||||
TOptScorer: DocSet,
|
||||
TScoreCombiner: ScoreCombiner,
|
||||
{
|
||||
fn advance(&mut self) -> DocId {
|
||||
self.score_cache = None;
|
||||
|
||||
@@ -3,7 +3,7 @@ use crate::Score;
|
||||
|
||||
/// The `ScoreCombiner` trait defines how to compute
|
||||
/// an overall score given a list of scores.
|
||||
pub trait ScoreCombiner: Default + Clone + Copy + 'static {
|
||||
pub trait ScoreCombiner: Default + Clone + Send + Copy + 'static {
|
||||
/// Aggregates the score combiner with the given scorer.
|
||||
///
|
||||
/// The `ScoreCombiner` may decide to call `.scorer.score()`
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
use super::term_weight::TermWeight;
|
||||
use crate::query::bm25::BM25Weight;
|
||||
use crate::query::Query;
|
||||
use crate::query::Weight;
|
||||
use crate::query::{Explanation, Query};
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::Searcher;
|
||||
use crate::Term;
|
||||
@@ -100,7 +100,13 @@ impl TermQuery {
|
||||
field_entry.name()
|
||||
)));
|
||||
}
|
||||
let bm25_weight = BM25Weight::for_terms(searcher, &[term])?;
|
||||
let bm25_weight;
|
||||
if scoring_enabled {
|
||||
bm25_weight = BM25Weight::for_terms(searcher, &[term])?;
|
||||
} else {
|
||||
bm25_weight =
|
||||
BM25Weight::new(Explanation::new("<no score>".to_string(), 1.0f32), 1.0f32);
|
||||
}
|
||||
let index_record_option = if scoring_enabled {
|
||||
self.index_record_option
|
||||
} else {
|
||||
|
||||
@@ -302,7 +302,7 @@ mod tests {
|
||||
let mut rng = rand::thread_rng();
|
||||
writer.set_merge_policy(Box::new(NoMergePolicy));
|
||||
for _ in 0..3_000 {
|
||||
let term_freq = rng.gen_range(1, 10000);
|
||||
let term_freq = rng.gen_range(1..10000);
|
||||
let words: Vec<&str> = std::iter::repeat("bbbb").take(term_freq).collect();
|
||||
let text = words.join(" ");
|
||||
writer.add_document(doc!(text_field=>text));
|
||||
|
||||
@@ -45,7 +45,7 @@ impl Weight for TermWeight {
|
||||
} else {
|
||||
let field = self.term.field();
|
||||
let inv_index = reader.inverted_index(field)?;
|
||||
let term_info = inv_index.get_term_info(&self.term);
|
||||
let term_info = inv_index.get_term_info(&self.term)?;
|
||||
Ok(term_info.map(|term_info| term_info.doc_freq).unwrap_or(0))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -233,6 +233,7 @@ mod tests {
|
||||
assert_eq!(Facet::root(), Facet::from("/"));
|
||||
assert_eq!(format!("{}", Facet::root()), "/");
|
||||
assert!(Facet::root().is_root());
|
||||
assert_eq!(Facet::root().encoded_str(), "");
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
use crate::schema::Value;
|
||||
use serde::Serialize;
|
||||
use serde::{Deserialize, Serialize};
|
||||
use std::collections::BTreeMap;
|
||||
|
||||
/// Internal representation of a document used for JSON
|
||||
@@ -8,5 +8,5 @@ use std::collections::BTreeMap;
|
||||
/// A `NamedFieldDocument` is a simple representation of a document
|
||||
/// as a `BTreeMap<String, Vec<Value>>`.
|
||||
///
|
||||
#[derive(Serialize)]
|
||||
#[derive(Debug, Deserialize, Serialize)]
|
||||
pub struct NamedFieldDocument(pub BTreeMap<String, Vec<Value>>);
|
||||
|
||||
@@ -3,7 +3,7 @@ use std::io::{self, Read, Write};
|
||||
/// Name of the compression scheme used in the doc store.
|
||||
///
|
||||
/// This name is appended to the version string of tantivy.
|
||||
pub const COMPRESSION: &'static str = "lz4";
|
||||
pub const COMPRESSION: &str = "lz4";
|
||||
|
||||
pub fn compress(uncompressed: &[u8], compressed: &mut Vec<u8>) -> io::Result<()> {
|
||||
compressed.clear();
|
||||
|
||||
@@ -43,6 +43,9 @@ impl CheckpointBlock {
|
||||
|
||||
/// Adding another checkpoint in the block.
|
||||
pub fn push(&mut self, checkpoint: Checkpoint) {
|
||||
if let Some(prev_checkpoint) = self.checkpoints.last() {
|
||||
assert!(checkpoint.follows(prev_checkpoint));
|
||||
}
|
||||
self.checkpoints.push(checkpoint);
|
||||
}
|
||||
|
||||
|
||||
@@ -26,6 +26,12 @@ pub struct Checkpoint {
|
||||
pub end_offset: u64,
|
||||
}
|
||||
|
||||
impl Checkpoint {
|
||||
pub(crate) fn follows(&self, other: &Checkpoint) -> bool {
|
||||
(self.start_doc == other.end_doc) && (self.start_offset == other.end_offset)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for Checkpoint {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
write!(
|
||||
@@ -39,13 +45,16 @@ impl fmt::Debug for Checkpoint {
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
use std::io;
|
||||
use std::{io, iter};
|
||||
|
||||
use futures::executor::block_on;
|
||||
use proptest::strategy::{BoxedStrategy, Strategy};
|
||||
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::indexer::NoMergePolicy;
|
||||
use crate::schema::{SchemaBuilder, STORED, STRING};
|
||||
use crate::store::index::Checkpoint;
|
||||
use crate::DocId;
|
||||
use crate::{DocAddress, DocId, Index, Term};
|
||||
|
||||
use super::{SkipIndex, SkipIndexBuilder};
|
||||
|
||||
@@ -54,7 +63,7 @@ mod tests {
|
||||
let mut output: Vec<u8> = Vec::new();
|
||||
let skip_index_builder: SkipIndexBuilder = SkipIndexBuilder::new();
|
||||
skip_index_builder.write(&mut output)?;
|
||||
let skip_index: SkipIndex = SkipIndex::from(OwnedBytes::new(output));
|
||||
let skip_index: SkipIndex = SkipIndex::open(OwnedBytes::new(output));
|
||||
let mut skip_cursor = skip_index.checkpoints();
|
||||
assert!(skip_cursor.next().is_none());
|
||||
Ok(())
|
||||
@@ -72,7 +81,7 @@ mod tests {
|
||||
};
|
||||
skip_index_builder.insert(checkpoint);
|
||||
skip_index_builder.write(&mut output)?;
|
||||
let skip_index: SkipIndex = SkipIndex::from(OwnedBytes::new(output));
|
||||
let skip_index: SkipIndex = SkipIndex::open(OwnedBytes::new(output));
|
||||
let mut skip_cursor = skip_index.checkpoints();
|
||||
assert_eq!(skip_cursor.next(), Some(checkpoint));
|
||||
assert_eq!(skip_cursor.next(), None);
|
||||
@@ -86,7 +95,7 @@ mod tests {
|
||||
Checkpoint {
|
||||
start_doc: 0,
|
||||
end_doc: 3,
|
||||
start_offset: 4,
|
||||
start_offset: 0,
|
||||
end_offset: 9,
|
||||
},
|
||||
Checkpoint {
|
||||
@@ -121,7 +130,7 @@ mod tests {
|
||||
}
|
||||
skip_index_builder.write(&mut output)?;
|
||||
|
||||
let skip_index: SkipIndex = SkipIndex::from(OwnedBytes::new(output));
|
||||
let skip_index: SkipIndex = SkipIndex::open(OwnedBytes::new(output));
|
||||
assert_eq!(
|
||||
&skip_index.checkpoints().collect::<Vec<_>>()[..],
|
||||
&checkpoints[..]
|
||||
@@ -133,6 +142,40 @@ mod tests {
|
||||
(doc as u64) * (doc as u64)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merge_store_with_stacking_reproducing_issue969() -> crate::Result<()> {
|
||||
let mut schema_builder = SchemaBuilder::default();
|
||||
let text = schema_builder.add_text_field("text", STORED | STRING);
|
||||
let body = schema_builder.add_text_field("body", STORED);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut index_writer = index.writer_for_tests()?;
|
||||
index_writer.set_merge_policy(Box::new(NoMergePolicy));
|
||||
let long_text: String = iter::repeat("abcdefghijklmnopqrstuvwxyz")
|
||||
.take(1_000)
|
||||
.collect();
|
||||
for _ in 0..20 {
|
||||
index_writer.add_document(doc!(body=>long_text.clone()));
|
||||
}
|
||||
index_writer.commit()?;
|
||||
index_writer.add_document(doc!(text=>"testb"));
|
||||
for _ in 0..10 {
|
||||
index_writer.add_document(doc!(text=>"testd", body=>long_text.clone()));
|
||||
}
|
||||
index_writer.commit()?;
|
||||
index_writer.delete_term(Term::from_field_text(text, "testb"));
|
||||
index_writer.commit()?;
|
||||
let segment_ids = index.searchable_segment_ids()?;
|
||||
block_on(index_writer.merge(&segment_ids))?;
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
assert_eq!(searcher.num_docs(), 30);
|
||||
for i in 0..searcher.num_docs() as u32 {
|
||||
let _doc = searcher.doc(DocAddress(0u32, i))?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_skip_index_long() -> io::Result<()> {
|
||||
let mut output: Vec<u8> = Vec::new();
|
||||
@@ -150,26 +193,28 @@ mod tests {
|
||||
}
|
||||
skip_index_builder.write(&mut output)?;
|
||||
assert_eq!(output.len(), 4035);
|
||||
let resulting_checkpoints: Vec<Checkpoint> = SkipIndex::from(OwnedBytes::new(output))
|
||||
let resulting_checkpoints: Vec<Checkpoint> = SkipIndex::open(OwnedBytes::new(output))
|
||||
.checkpoints()
|
||||
.collect();
|
||||
assert_eq!(&resulting_checkpoints, &checkpoints);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn integrate_delta(mut vals: Vec<u64>) -> Vec<u64> {
|
||||
fn integrate_delta(vals: Vec<u64>) -> Vec<u64> {
|
||||
let mut output = Vec::with_capacity(vals.len() + 1);
|
||||
output.push(0u64);
|
||||
let mut prev = 0u64;
|
||||
for val in vals.iter_mut() {
|
||||
let new_val = *val + prev;
|
||||
for val in vals {
|
||||
let new_val = val + prev;
|
||||
prev = new_val;
|
||||
*val = new_val;
|
||||
output.push(new_val);
|
||||
}
|
||||
vals
|
||||
output
|
||||
}
|
||||
|
||||
// Generates a sequence of n valid checkpoints, with n < max_len.
|
||||
fn monotonic_checkpoints(max_len: usize) -> BoxedStrategy<Vec<Checkpoint>> {
|
||||
(1..max_len)
|
||||
(0..max_len)
|
||||
.prop_flat_map(move |len: usize| {
|
||||
(
|
||||
proptest::collection::vec(1u64..20u64, len as usize).prop_map(integrate_delta),
|
||||
@@ -221,7 +266,7 @@ mod tests {
|
||||
}
|
||||
let mut buffer = Vec::new();
|
||||
skip_index_builder.write(&mut buffer).unwrap();
|
||||
let skip_index = SkipIndex::from(OwnedBytes::new(buffer));
|
||||
let skip_index = SkipIndex::open(OwnedBytes::new(buffer));
|
||||
let iter_checkpoints: Vec<Checkpoint> = skip_index.checkpoints().collect();
|
||||
assert_eq!(&checkpoints[..], &iter_checkpoints[..]);
|
||||
test_skip_index_aux(skip_index, &checkpoints[..]);
|
||||
|
||||
@@ -19,7 +19,7 @@ impl<'a> Iterator for LayerCursor<'a> {
|
||||
return None;
|
||||
}
|
||||
let (block_mut, remaining_mut) = (&mut self.block, &mut self.remaining);
|
||||
if let Err(_) = block_mut.deserialize(remaining_mut) {
|
||||
if block_mut.deserialize(remaining_mut).is_err() {
|
||||
return None;
|
||||
}
|
||||
self.cursor = 0;
|
||||
@@ -35,11 +35,11 @@ struct Layer {
|
||||
}
|
||||
|
||||
impl Layer {
|
||||
fn cursor<'a>(&'a self) -> impl Iterator<Item = Checkpoint> + 'a {
|
||||
fn cursor(&self) -> impl Iterator<Item = Checkpoint> + '_ {
|
||||
self.cursor_at_offset(0u64)
|
||||
}
|
||||
|
||||
fn cursor_at_offset<'a>(&'a self, start_offset: u64) -> impl Iterator<Item = Checkpoint> + 'a {
|
||||
fn cursor_at_offset(&self, start_offset: u64) -> impl Iterator<Item = Checkpoint> + '_ {
|
||||
let data = &self.data.as_slice();
|
||||
LayerCursor {
|
||||
remaining: &data[start_offset as usize..],
|
||||
@@ -50,8 +50,7 @@ impl Layer {
|
||||
|
||||
fn seek_start_at_offset(&self, target: DocId, offset: u64) -> Option<Checkpoint> {
|
||||
self.cursor_at_offset(offset)
|
||||
.filter(|checkpoint| checkpoint.end_doc > target)
|
||||
.next()
|
||||
.find(|checkpoint| checkpoint.end_doc > target)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -60,7 +59,25 @@ pub struct SkipIndex {
|
||||
}
|
||||
|
||||
impl SkipIndex {
|
||||
pub(crate) fn checkpoints<'a>(&'a self) -> impl Iterator<Item = Checkpoint> + 'a {
|
||||
pub fn open(mut data: OwnedBytes) -> SkipIndex {
|
||||
let offsets: Vec<u64> = Vec::<VInt>::deserialize(&mut data)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|el| el.0)
|
||||
.collect();
|
||||
let mut start_offset = 0;
|
||||
let mut layers = Vec::new();
|
||||
for end_offset in offsets {
|
||||
let layer = Layer {
|
||||
data: data.slice(start_offset as usize, end_offset as usize),
|
||||
};
|
||||
layers.push(layer);
|
||||
start_offset = end_offset;
|
||||
}
|
||||
SkipIndex { layers }
|
||||
}
|
||||
|
||||
pub(crate) fn checkpoints(&self) -> impl Iterator<Item = Checkpoint> + '_ {
|
||||
self.layers
|
||||
.last()
|
||||
.into_iter()
|
||||
@@ -91,22 +108,3 @@ impl SkipIndex {
|
||||
Some(cur_checkpoint)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<OwnedBytes> for SkipIndex {
|
||||
fn from(mut data: OwnedBytes) -> SkipIndex {
|
||||
let offsets: Vec<u64> = Vec::<VInt>::deserialize(&mut data)
|
||||
.unwrap()
|
||||
.into_iter()
|
||||
.map(|el| el.0)
|
||||
.collect();
|
||||
let mut start_offset = 0;
|
||||
let mut layers = Vec::new();
|
||||
for end_offset in offsets {
|
||||
layers.push(Layer {
|
||||
data: data.slice(start_offset as usize, end_offset as usize),
|
||||
});
|
||||
start_offset = end_offset;
|
||||
}
|
||||
SkipIndex { layers }
|
||||
}
|
||||
}
|
||||
|
||||
@@ -28,18 +28,20 @@ impl LayerBuilder {
|
||||
///
|
||||
/// If the block was empty to begin with, simply return None.
|
||||
fn flush_block(&mut self) -> Option<Checkpoint> {
|
||||
self.block.doc_interval().map(|(start_doc, end_doc)| {
|
||||
if let Some((start_doc, end_doc)) = self.block.doc_interval() {
|
||||
let start_offset = self.buffer.len() as u64;
|
||||
self.block.serialize(&mut self.buffer);
|
||||
let end_offset = self.buffer.len() as u64;
|
||||
self.block.clear();
|
||||
Checkpoint {
|
||||
Some(Checkpoint {
|
||||
start_doc,
|
||||
end_doc,
|
||||
start_offset,
|
||||
end_offset,
|
||||
}
|
||||
})
|
||||
})
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
fn push(&mut self, checkpoint: Checkpoint) {
|
||||
@@ -48,7 +50,7 @@ impl LayerBuilder {
|
||||
|
||||
fn insert(&mut self, checkpoint: Checkpoint) -> Option<Checkpoint> {
|
||||
self.push(checkpoint);
|
||||
let emit_skip_info = (self.block.len() % CHECKPOINT_PERIOD) == 0;
|
||||
let emit_skip_info = self.block.len() >= CHECKPOINT_PERIOD;
|
||||
if emit_skip_info {
|
||||
self.flush_block()
|
||||
} else {
|
||||
|
||||
@@ -35,7 +35,7 @@ impl StoreReader {
|
||||
let (data_file, offset_index_file) = split_file(store_file)?;
|
||||
let index_data = offset_index_file.read_bytes()?;
|
||||
let space_usage = StoreSpaceUsage::new(data_file.len(), offset_index_file.len());
|
||||
let skip_index = SkipIndex::from(index_data);
|
||||
let skip_index = SkipIndex::open(index_data);
|
||||
Ok(StoreReader {
|
||||
data: data_file,
|
||||
cache: Arc::new(Mutex::new(LruCache::new(LRU_CACHE_CAPACITY))),
|
||||
@@ -46,7 +46,7 @@ impl StoreReader {
|
||||
})
|
||||
}
|
||||
|
||||
pub(crate) fn block_checkpoints<'a>(&'a self) -> impl Iterator<Item = Checkpoint> + 'a {
|
||||
pub(crate) fn block_checkpoints(&self) -> impl Iterator<Item = Checkpoint> + '_ {
|
||||
self.skip_index.checkpoints()
|
||||
}
|
||||
|
||||
|
||||
@@ -72,6 +72,7 @@ impl StoreWriter {
|
||||
if !self.current_block.is_empty() {
|
||||
self.write_and_compress_block()?;
|
||||
}
|
||||
assert_eq!(self.first_doc_in_block, self.doc);
|
||||
let doc_shift = self.doc;
|
||||
let start_shift = self.writer.written_bytes() as u64;
|
||||
|
||||
@@ -86,12 +87,17 @@ impl StoreWriter {
|
||||
checkpoint.end_doc += doc_shift;
|
||||
checkpoint.start_offset += start_shift;
|
||||
checkpoint.end_offset += start_shift;
|
||||
self.offset_index_writer.insert(checkpoint);
|
||||
self.doc = checkpoint.end_doc;
|
||||
self.register_checkpoint(checkpoint);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn register_checkpoint(&mut self, checkpoint: Checkpoint) {
|
||||
self.offset_index_writer.insert(checkpoint);
|
||||
self.first_doc_in_block = checkpoint.end_doc;
|
||||
self.doc = checkpoint.end_doc;
|
||||
}
|
||||
|
||||
fn write_and_compress_block(&mut self) -> io::Result<()> {
|
||||
assert!(self.doc > 0);
|
||||
self.intermediary_buffer.clear();
|
||||
@@ -100,14 +106,13 @@ impl StoreWriter {
|
||||
self.writer.write_all(&self.intermediary_buffer)?;
|
||||
let end_offset = self.writer.written_bytes();
|
||||
let end_doc = self.doc;
|
||||
self.offset_index_writer.insert(Checkpoint {
|
||||
self.register_checkpoint(Checkpoint {
|
||||
start_doc: self.first_doc_in_block,
|
||||
end_doc,
|
||||
start_offset,
|
||||
end_offset,
|
||||
});
|
||||
self.current_block.clear();
|
||||
self.first_doc_in_block = self.doc;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
|
||||
27
src/termdict/fst_termdict/mod.rs
Normal file
27
src/termdict/fst_termdict/mod.rs
Normal file
@@ -0,0 +1,27 @@
|
||||
/*!
|
||||
The term dictionary main role is to associate the sorted [`Term`s](../struct.Term.html) to
|
||||
a [`TermInfo`](../postings/struct.TermInfo.html) struct that contains some meta-information
|
||||
about the term.
|
||||
|
||||
Internally, the term dictionary relies on the `fst` crate to store
|
||||
a sorted mapping that associate each term to its rank in the lexicographical order.
|
||||
For instance, in a dictionary containing the sorted terms "abba", "bjork", "blur" and "donovan",
|
||||
the `TermOrdinal` are respectively `0`, `1`, `2`, and `3`.
|
||||
|
||||
For `u64`-terms, tantivy explicitely uses a `BigEndian` representation to ensure that the
|
||||
lexicographical order matches the natural order of integers.
|
||||
|
||||
`i64`-terms are transformed to `u64` using a continuous mapping `val ⟶ val - i64::min_value()`
|
||||
and then treated as a `u64`.
|
||||
|
||||
`f64`-terms are transformed to `u64` using a mapping that preserve order, and are then treated
|
||||
as `u64`.
|
||||
|
||||
A second datastructure makes it possible to access a [`TermInfo`](../postings/struct.TermInfo.html).
|
||||
*/
|
||||
mod streamer;
|
||||
mod term_info_store;
|
||||
mod termdict;
|
||||
|
||||
pub use self::streamer::{TermStreamer, TermStreamerBuilder};
|
||||
pub use self::termdict::{TermDictionary, TermDictionaryBuilder};
|
||||
@@ -1,3 +1,5 @@
|
||||
use std::io;
|
||||
|
||||
use super::TermDictionary;
|
||||
use crate::postings::TermInfo;
|
||||
use crate::termdict::TermOrdinal;
|
||||
@@ -59,14 +61,14 @@ where
|
||||
|
||||
/// Creates the stream corresponding to the range
|
||||
/// of terms defined using the `TermStreamerBuilder`.
|
||||
pub fn into_stream(self) -> TermStreamer<'a, A> {
|
||||
TermStreamer {
|
||||
pub fn into_stream(self) -> io::Result<TermStreamer<'a, A>> {
|
||||
Ok(TermStreamer {
|
||||
fst_map: self.fst_map,
|
||||
stream: self.stream_builder.into_stream(),
|
||||
term_ord: 0u64,
|
||||
current_key: Vec::with_capacity(100),
|
||||
current_value: TermInfo::default(),
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
@@ -80,7 +80,6 @@ where
|
||||
.serialize(&mut counting_writer)?;
|
||||
let footer_size = counting_writer.written_bytes();
|
||||
(footer_size as u64).serialize(&mut counting_writer)?;
|
||||
counting_writer.flush()?;
|
||||
}
|
||||
Ok(file)
|
||||
}
|
||||
@@ -139,8 +138,8 @@ impl TermDictionary {
|
||||
}
|
||||
|
||||
/// Returns the ordinal associated to a given term.
|
||||
pub fn term_ord<K: AsRef<[u8]>>(&self, key: K) -> Option<TermOrdinal> {
|
||||
self.fst_index.get(key)
|
||||
pub fn term_ord<K: AsRef<[u8]>>(&self, key: K) -> io::Result<Option<TermOrdinal>> {
|
||||
Ok(self.fst_index.get(key))
|
||||
}
|
||||
|
||||
/// Returns the term associated to a given term ordinal.
|
||||
@@ -152,7 +151,7 @@ impl TermDictionary {
|
||||
///
|
||||
/// Regardless of whether the term is found or not,
|
||||
/// the buffer may be modified.
|
||||
pub fn ord_to_term(&self, mut ord: TermOrdinal, bytes: &mut Vec<u8>) -> bool {
|
||||
pub fn ord_to_term(&self, mut ord: TermOrdinal, bytes: &mut Vec<u8>) -> io::Result<bool> {
|
||||
bytes.clear();
|
||||
let fst = self.fst_index.as_fst();
|
||||
let mut node = fst.root();
|
||||
@@ -167,10 +166,10 @@ impl TermDictionary {
|
||||
let new_node_addr = transition.addr;
|
||||
node = fst.node(new_node_addr);
|
||||
} else {
|
||||
return false;
|
||||
return Ok(false);
|
||||
}
|
||||
}
|
||||
true
|
||||
Ok(true)
|
||||
}
|
||||
|
||||
/// Returns the number of terms in the dictionary.
|
||||
@@ -179,9 +178,10 @@ impl TermDictionary {
|
||||
}
|
||||
|
||||
/// Lookups the value corresponding to the key.
|
||||
pub fn get<K: AsRef<[u8]>>(&self, key: K) -> Option<TermInfo> {
|
||||
self.term_ord(key)
|
||||
.map(|term_ord| self.term_info_from_ord(term_ord))
|
||||
pub fn get<K: AsRef<[u8]>>(&self, key: K) -> io::Result<Option<TermInfo>> {
|
||||
Ok(self
|
||||
.term_ord(key)?
|
||||
.map(|term_ord| self.term_info_from_ord(term_ord)))
|
||||
}
|
||||
|
||||
/// Returns a range builder, to stream all of the terms
|
||||
@@ -191,7 +191,7 @@ impl TermDictionary {
|
||||
}
|
||||
|
||||
/// A stream of all the sorted terms. [See also `.stream_field()`](#method.stream_field)
|
||||
pub fn stream(&self) -> TermStreamer<'_> {
|
||||
pub fn stream(&self) -> io::Result<TermStreamer<'_>> {
|
||||
self.range().into_stream()
|
||||
}
|
||||
|
||||
@@ -20,438 +20,37 @@ as `u64`.
|
||||
A second datastructure makes it possible to access a [`TermInfo`](../postings/struct.TermInfo.html).
|
||||
*/
|
||||
|
||||
use tantivy_fst::automaton::AlwaysMatch;
|
||||
|
||||
mod fst_termdict;
|
||||
use fst_termdict as termdict;
|
||||
|
||||
mod merger;
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests;
|
||||
|
||||
/// Position of the term in the sorted list of terms.
|
||||
pub type TermOrdinal = u64;
|
||||
|
||||
mod merger;
|
||||
mod streamer;
|
||||
mod term_info_store;
|
||||
mod termdict;
|
||||
/// The term dictionary contains all of the terms in
|
||||
/// `tantivy index` in a sorted manner.
|
||||
pub type TermDictionary = self::termdict::TermDictionary;
|
||||
|
||||
pub use self::merger::TermMerger;
|
||||
pub use self::streamer::{TermStreamer, TermStreamerBuilder};
|
||||
pub use self::termdict::{TermDictionary, TermDictionaryBuilder};
|
||||
/// Builder for the new term dictionary.
|
||||
///
|
||||
/// Inserting must be done in the order of the `keys`.
|
||||
pub type TermDictionaryBuilder<W> = self::termdict::TermDictionaryBuilder<W>;
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::{TermDictionary, TermDictionaryBuilder, TermStreamer};
|
||||
use crate::core::Index;
|
||||
use crate::directory::{Directory, FileSlice, RAMDirectory};
|
||||
use crate::postings::TermInfo;
|
||||
use crate::schema::{Schema, TEXT};
|
||||
use std::path::PathBuf;
|
||||
use std::str;
|
||||
/// Given a list of sorted term streams,
|
||||
/// returns an iterator over sorted unique terms.
|
||||
///
|
||||
/// The item yield is actually a pair with
|
||||
/// - the term
|
||||
/// - a slice with the ordinal of the segments containing
|
||||
/// the terms.
|
||||
pub type TermMerger<'a> = self::merger::TermMerger<'a>;
|
||||
|
||||
const BLOCK_SIZE: usize = 1_500;
|
||||
|
||||
fn make_term_info(term_ord: u64) -> TermInfo {
|
||||
let offset = |term_ord: u64| term_ord * 100 + term_ord * term_ord;
|
||||
TermInfo {
|
||||
doc_freq: term_ord as u32,
|
||||
postings_start_offset: offset(term_ord),
|
||||
postings_stop_offset: offset(term_ord + 1),
|
||||
positions_idx: offset(term_ord) * 2u64,
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_term_dictionary() {
|
||||
let empty = TermDictionary::empty();
|
||||
assert!(empty.stream().next().is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_ordinals() -> crate::Result<()> {
|
||||
const COUNTRIES: [&'static str; 7] = [
|
||||
"San Marino",
|
||||
"Serbia",
|
||||
"Slovakia",
|
||||
"Slovenia",
|
||||
"Spain",
|
||||
"Sweden",
|
||||
"Switzerland",
|
||||
];
|
||||
let directory = RAMDirectory::create();
|
||||
let path = PathBuf::from("TermDictionary");
|
||||
{
|
||||
let write = directory.open_write(&path)?;
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(write)?;
|
||||
for term in COUNTRIES.iter() {
|
||||
term_dictionary_builder.insert(term.as_bytes(), &make_term_info(0u64))?;
|
||||
}
|
||||
term_dictionary_builder.finish()?;
|
||||
}
|
||||
let term_file = directory.open_read(&path)?;
|
||||
let term_dict: TermDictionary = TermDictionary::open(term_file)?;
|
||||
for (term_ord, term) in COUNTRIES.iter().enumerate() {
|
||||
assert_eq!(term_dict.term_ord(term).unwrap(), term_ord as u64);
|
||||
let mut bytes = vec![];
|
||||
assert!(term_dict.ord_to_term(term_ord as u64, &mut bytes));
|
||||
assert_eq!(bytes, term.as_bytes());
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_dictionary_simple() -> crate::Result<()> {
|
||||
let directory = RAMDirectory::create();
|
||||
let path = PathBuf::from("TermDictionary");
|
||||
{
|
||||
let write = directory.open_write(&path)?;
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(write)?;
|
||||
term_dictionary_builder.insert("abc".as_bytes(), &make_term_info(34u64))?;
|
||||
term_dictionary_builder.insert("abcd".as_bytes(), &make_term_info(346u64))?;
|
||||
term_dictionary_builder.finish()?;
|
||||
}
|
||||
let file = directory.open_read(&path)?;
|
||||
let term_dict: TermDictionary = TermDictionary::open(file)?;
|
||||
assert_eq!(term_dict.get("abc").unwrap().doc_freq, 34u32);
|
||||
assert_eq!(term_dict.get("abcd").unwrap().doc_freq, 346u32);
|
||||
let mut stream = term_dict.stream();
|
||||
{
|
||||
{
|
||||
let (k, v) = stream.next().unwrap();
|
||||
assert_eq!(k.as_ref(), "abc".as_bytes());
|
||||
assert_eq!(v.doc_freq, 34u32);
|
||||
}
|
||||
assert_eq!(stream.key(), "abc".as_bytes());
|
||||
assert_eq!(stream.value().doc_freq, 34u32);
|
||||
}
|
||||
{
|
||||
{
|
||||
let (k, v) = stream.next().unwrap();
|
||||
assert_eq!(k, "abcd".as_bytes());
|
||||
assert_eq!(v.doc_freq, 346u32);
|
||||
}
|
||||
assert_eq!(stream.key(), "abcd".as_bytes());
|
||||
assert_eq!(stream.value().doc_freq, 346u32);
|
||||
}
|
||||
assert!(!stream.advance());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_iterator() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let index = Index::create_in_ram(schema_builder.build());
|
||||
{
|
||||
let mut index_writer = index.writer_for_tests()?;
|
||||
index_writer.add_document(doc!(text_field=>"a b d f"));
|
||||
index_writer.commit()?;
|
||||
index_writer.add_document(doc!(text_field=>"a b c d f"));
|
||||
index_writer.commit()?;
|
||||
index_writer.add_document(doc!(text_field => "e f"));
|
||||
index_writer.commit()?;
|
||||
}
|
||||
let searcher = index.reader()?.searcher();
|
||||
|
||||
let field_searcher = searcher.field(text_field)?;
|
||||
let mut term_it = field_searcher.terms();
|
||||
let mut term_string = String::new();
|
||||
while term_it.advance() {
|
||||
//let term = Term::from_bytes(term_it.key());
|
||||
term_string.push_str(str::from_utf8(term_it.key()).expect("test"));
|
||||
}
|
||||
assert_eq!(&*term_string, "abcdef");
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_dictionary_stream() -> crate::Result<()> {
|
||||
let ids: Vec<_> = (0u32..10_000u32)
|
||||
.map(|i| (format!("doc{:0>6}", i), i))
|
||||
.collect();
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
for &(ref id, ref i) in &ids {
|
||||
term_dictionary_builder
|
||||
.insert(id.as_bytes(), &make_term_info(*i as u64))
|
||||
.unwrap();
|
||||
}
|
||||
term_dictionary_builder.finish().unwrap()
|
||||
};
|
||||
let term_file = FileSlice::from(buffer);
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(term_file)?;
|
||||
{
|
||||
let mut streamer = term_dictionary.stream();
|
||||
let mut i = 0;
|
||||
while let Some((streamer_k, streamer_v)) = streamer.next() {
|
||||
let &(ref key, ref v) = &ids[i];
|
||||
assert_eq!(streamer_k.as_ref(), key.as_bytes());
|
||||
assert_eq!(streamer_v, &make_term_info(*v as u64));
|
||||
i += 1;
|
||||
}
|
||||
}
|
||||
|
||||
let &(ref key, ref val) = &ids[2047];
|
||||
assert_eq!(
|
||||
term_dictionary.get(key.as_bytes()),
|
||||
Some(make_term_info(*val as u64))
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_high_range_prefix_suffix() -> crate::Result<()> {
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
// term requires more than 16bits
|
||||
term_dictionary_builder.insert("abcdefghijklmnopqrstuvwxy", &make_term_info(1))?;
|
||||
term_dictionary_builder.insert("abcdefghijklmnopqrstuvwxyz", &make_term_info(2))?;
|
||||
term_dictionary_builder.insert("abr", &make_term_info(3))?;
|
||||
term_dictionary_builder.finish()?
|
||||
};
|
||||
let term_dict_file = FileSlice::from(buffer);
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(term_dict_file)?;
|
||||
let mut kv_stream = term_dictionary.stream();
|
||||
assert!(kv_stream.advance());
|
||||
assert_eq!(kv_stream.key(), "abcdefghijklmnopqrstuvwxy".as_bytes());
|
||||
assert_eq!(kv_stream.value(), &make_term_info(1));
|
||||
assert!(kv_stream.advance());
|
||||
assert_eq!(kv_stream.key(), "abcdefghijklmnopqrstuvwxyz".as_bytes());
|
||||
assert_eq!(kv_stream.value(), &make_term_info(2));
|
||||
assert!(kv_stream.advance());
|
||||
assert_eq!(kv_stream.key(), "abr".as_bytes());
|
||||
assert_eq!(kv_stream.value(), &make_term_info(3));
|
||||
assert!(!kv_stream.advance());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_range() -> crate::Result<()> {
|
||||
let ids: Vec<_> = (0u32..10_000u32)
|
||||
.map(|i| (format!("doc{:0>6}", i), i))
|
||||
.collect();
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
for &(ref id, ref i) in &ids {
|
||||
term_dictionary_builder
|
||||
.insert(id.as_bytes(), &make_term_info(*i as u64))
|
||||
.unwrap();
|
||||
}
|
||||
term_dictionary_builder.finish().unwrap()
|
||||
};
|
||||
|
||||
let file = FileSlice::from(buffer);
|
||||
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(file)?;
|
||||
{
|
||||
for i in (0..20).chain(6000..8_000) {
|
||||
let &(ref target_key, _) = &ids[i];
|
||||
let mut streamer = term_dictionary
|
||||
.range()
|
||||
.ge(target_key.as_bytes())
|
||||
.into_stream();
|
||||
for j in 0..3 {
|
||||
let (streamer_k, streamer_v) = streamer.next().unwrap();
|
||||
let &(ref key, ref v) = &ids[i + j];
|
||||
assert_eq!(str::from_utf8(streamer_k.as_ref()).unwrap(), key);
|
||||
assert_eq!(streamer_v.doc_freq, *v);
|
||||
assert_eq!(streamer_v, &make_term_info(*v as u64));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
{
|
||||
for i in (0..20).chain(BLOCK_SIZE - 10..BLOCK_SIZE + 10) {
|
||||
let &(ref target_key, _) = &ids[i];
|
||||
let mut streamer = term_dictionary
|
||||
.range()
|
||||
.gt(target_key.as_bytes())
|
||||
.into_stream();
|
||||
for j in 0..3 {
|
||||
let (streamer_k, streamer_v) = streamer.next().unwrap();
|
||||
let &(ref key, ref v) = &ids[i + j + 1];
|
||||
assert_eq!(streamer_k.as_ref(), key.as_bytes());
|
||||
assert_eq!(streamer_v.doc_freq, *v);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
{
|
||||
for i in (0..20).chain(BLOCK_SIZE - 10..BLOCK_SIZE + 10) {
|
||||
for j in 0..3 {
|
||||
let &(ref fst_key, _) = &ids[i];
|
||||
let &(ref last_key, _) = &ids[i + j];
|
||||
let mut streamer = term_dictionary
|
||||
.range()
|
||||
.ge(fst_key.as_bytes())
|
||||
.lt(last_key.as_bytes())
|
||||
.into_stream();
|
||||
for _ in 0..j {
|
||||
assert!(streamer.next().is_some());
|
||||
}
|
||||
assert!(streamer.next().is_none());
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_string() -> crate::Result<()> {
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
term_dictionary_builder
|
||||
.insert(&[], &make_term_info(1 as u64))
|
||||
.unwrap();
|
||||
term_dictionary_builder
|
||||
.insert(&[1u8], &make_term_info(2 as u64))
|
||||
.unwrap();
|
||||
term_dictionary_builder.finish().unwrap()
|
||||
};
|
||||
let file = FileSlice::from(buffer);
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(file)?;
|
||||
let mut stream = term_dictionary.stream();
|
||||
assert!(stream.advance());
|
||||
assert!(stream.key().is_empty());
|
||||
assert!(stream.advance());
|
||||
assert_eq!(stream.key(), &[1u8]);
|
||||
assert!(!stream.advance());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_range_boundaries() -> crate::Result<()> {
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(Vec::new())?;
|
||||
for i in 0u8..10u8 {
|
||||
let number_arr = [i; 1];
|
||||
term_dictionary_builder.insert(&number_arr, &make_term_info(i as u64))?;
|
||||
}
|
||||
term_dictionary_builder.finish()?
|
||||
};
|
||||
let file = FileSlice::from(buffer);
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(file)?;
|
||||
|
||||
let value_list = |mut streamer: TermStreamer<'_>, backwards: bool| {
|
||||
let mut res: Vec<u32> = vec![];
|
||||
while let Some((_, ref v)) = streamer.next() {
|
||||
res.push(v.doc_freq);
|
||||
}
|
||||
if backwards {
|
||||
res.reverse();
|
||||
}
|
||||
res
|
||||
};
|
||||
{
|
||||
let range = term_dictionary.range().backward().into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, true),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().ge([2u8]).into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, false),
|
||||
vec![2u32, 3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().ge([2u8]).backward().into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, true),
|
||||
vec![2u32, 3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().gt([2u8]).into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, false),
|
||||
vec![3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().gt([2u8]).backward().into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, true),
|
||||
vec![3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().lt([6u8]).into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, false),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().lt([6u8]).backward().into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, true),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().le([6u8]).into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, false),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32, 6u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().le([6u8]).backward().into_stream();
|
||||
assert_eq!(
|
||||
value_list(range, true),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32, 6u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().ge([0u8]).lt([5u8]).into_stream();
|
||||
assert_eq!(value_list(range, false), vec![0u32, 1u32, 2u32, 3u32, 4u32]);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary
|
||||
.range()
|
||||
.ge([0u8])
|
||||
.lt([5u8])
|
||||
.backward()
|
||||
.into_stream();
|
||||
assert_eq!(value_list(range, true), vec![0u32, 1u32, 2u32, 3u32, 4u32]);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_automaton_search() -> crate::Result<()> {
|
||||
use crate::query::DFAWrapper;
|
||||
use levenshtein_automata::LevenshteinAutomatonBuilder;
|
||||
|
||||
const COUNTRIES: [&'static str; 7] = [
|
||||
"San Marino",
|
||||
"Serbia",
|
||||
"Slovakia",
|
||||
"Slovenia",
|
||||
"Spain",
|
||||
"Sweden",
|
||||
"Switzerland",
|
||||
];
|
||||
|
||||
let directory = RAMDirectory::create();
|
||||
let path = PathBuf::from("TermDictionary");
|
||||
{
|
||||
let write = directory.open_write(&path)?;
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(write)?;
|
||||
for term in COUNTRIES.iter() {
|
||||
term_dictionary_builder.insert(term.as_bytes(), &make_term_info(0u64))?;
|
||||
}
|
||||
term_dictionary_builder.finish()?;
|
||||
}
|
||||
let file = directory.open_read(&path)?;
|
||||
let term_dict: TermDictionary = TermDictionary::open(file)?;
|
||||
|
||||
// We can now build an entire dfa.
|
||||
let lev_automaton_builder = LevenshteinAutomatonBuilder::new(2, true);
|
||||
let automaton = DFAWrapper(lev_automaton_builder.build_dfa("Spaen"));
|
||||
|
||||
let mut range = term_dict.search(automaton).into_stream();
|
||||
|
||||
// get the first finding
|
||||
assert!(range.advance());
|
||||
assert_eq!("Spain".as_bytes(), range.key());
|
||||
assert!(!range.advance());
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
/// `TermStreamer` acts as a cursor over a range of terms of a segment.
|
||||
/// Terms are guaranteed to be sorted.
|
||||
pub type TermStreamer<'a, A = AlwaysMatch> = self::termdict::TermStreamer<'a, A>;
|
||||
|
||||
431
src/termdict/tests.rs
Normal file
431
src/termdict/tests.rs
Normal file
@@ -0,0 +1,431 @@
|
||||
use super::{TermDictionary, TermDictionaryBuilder, TermStreamer};
|
||||
|
||||
use crate::directory::{Directory, FileSlice, RAMDirectory, TerminatingWrite};
|
||||
use crate::postings::TermInfo;
|
||||
|
||||
use std::path::PathBuf;
|
||||
use std::str;
|
||||
|
||||
const BLOCK_SIZE: usize = 1_500;
|
||||
|
||||
fn make_term_info(term_ord: u64) -> TermInfo {
|
||||
let offset = |term_ord: u64| term_ord * 100 + term_ord * term_ord;
|
||||
TermInfo {
|
||||
doc_freq: term_ord as u32,
|
||||
postings_start_offset: offset(term_ord),
|
||||
postings_stop_offset: offset(term_ord + 1),
|
||||
positions_idx: offset(term_ord) * 2u64,
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_term_dictionary() {
|
||||
let empty = TermDictionary::empty();
|
||||
assert!(empty.stream().unwrap().next().is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_ordinals() -> crate::Result<()> {
|
||||
const COUNTRIES: [&'static str; 7] = [
|
||||
"San Marino",
|
||||
"Serbia",
|
||||
"Slovakia",
|
||||
"Slovenia",
|
||||
"Spain",
|
||||
"Sweden",
|
||||
"Switzerland",
|
||||
];
|
||||
let directory = RAMDirectory::create();
|
||||
let path = PathBuf::from("TermDictionary");
|
||||
{
|
||||
let write = directory.open_write(&path)?;
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(write)?;
|
||||
for term in COUNTRIES.iter() {
|
||||
term_dictionary_builder.insert(term.as_bytes(), &make_term_info(0u64))?;
|
||||
}
|
||||
term_dictionary_builder.finish()?.terminate()?;
|
||||
}
|
||||
let term_file = directory.open_read(&path)?;
|
||||
let term_dict: TermDictionary = TermDictionary::open(term_file)?;
|
||||
for (term_ord, term) in COUNTRIES.iter().enumerate() {
|
||||
assert_eq!(term_dict.term_ord(term)?, Some(term_ord as u64));
|
||||
let mut bytes = vec![];
|
||||
assert!(term_dict.ord_to_term(term_ord as u64, &mut bytes)?);
|
||||
assert_eq!(bytes, term.as_bytes());
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_dictionary_simple() -> crate::Result<()> {
|
||||
let directory = RAMDirectory::create();
|
||||
let path = PathBuf::from("TermDictionary");
|
||||
{
|
||||
let write = directory.open_write(&path)?;
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(write)?;
|
||||
term_dictionary_builder.insert("abc".as_bytes(), &make_term_info(34u64))?;
|
||||
term_dictionary_builder.insert("abcd".as_bytes(), &make_term_info(346u64))?;
|
||||
term_dictionary_builder.finish()?.terminate()?;
|
||||
}
|
||||
let file = directory.open_read(&path)?;
|
||||
let term_dict: TermDictionary = TermDictionary::open(file)?;
|
||||
assert_eq!(term_dict.get("abc")?.unwrap().doc_freq, 34u32);
|
||||
assert_eq!(term_dict.get("abcd")?.unwrap().doc_freq, 346u32);
|
||||
let mut stream = term_dict.stream()?;
|
||||
{
|
||||
{
|
||||
let (k, v) = stream.next().unwrap();
|
||||
assert_eq!(k.as_ref(), "abc".as_bytes());
|
||||
assert_eq!(v.doc_freq, 34u32);
|
||||
}
|
||||
assert_eq!(stream.key(), "abc".as_bytes());
|
||||
assert_eq!(stream.value().doc_freq, 34u32);
|
||||
}
|
||||
{
|
||||
{
|
||||
let (k, v) = stream.next().unwrap();
|
||||
assert_eq!(k, "abcd".as_bytes());
|
||||
assert_eq!(v.doc_freq, 346u32);
|
||||
}
|
||||
assert_eq!(stream.key(), "abcd".as_bytes());
|
||||
assert_eq!(stream.value().doc_freq, 346u32);
|
||||
}
|
||||
assert!(!stream.advance());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_dictionary_stream() -> crate::Result<()> {
|
||||
let ids: Vec<_> = (0u32..10_000u32)
|
||||
.map(|i| (format!("doc{:0>6}", i), i))
|
||||
.collect();
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
for &(ref id, ref i) in &ids {
|
||||
term_dictionary_builder
|
||||
.insert(id.as_bytes(), &make_term_info(*i as u64))
|
||||
.unwrap();
|
||||
}
|
||||
term_dictionary_builder.finish()?
|
||||
};
|
||||
let term_file = FileSlice::from(buffer);
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(term_file)?;
|
||||
{
|
||||
let mut streamer = term_dictionary.stream()?;
|
||||
let mut i = 0;
|
||||
while let Some((streamer_k, streamer_v)) = streamer.next() {
|
||||
let &(ref key, ref v) = &ids[i];
|
||||
assert_eq!(streamer_k.as_ref(), key.as_bytes());
|
||||
assert_eq!(streamer_v, &make_term_info(*v as u64));
|
||||
i += 1;
|
||||
}
|
||||
}
|
||||
|
||||
let &(ref key, ref val) = &ids[2047];
|
||||
assert_eq!(
|
||||
term_dictionary.get(key.as_bytes())?,
|
||||
Some(make_term_info(*val as u64))
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_high_range_prefix_suffix() -> crate::Result<()> {
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
// term requires more than 16bits
|
||||
term_dictionary_builder.insert("abcdefghijklmnopqrstuvwxy", &make_term_info(1))?;
|
||||
term_dictionary_builder.insert("abcdefghijklmnopqrstuvwxyz", &make_term_info(2))?;
|
||||
term_dictionary_builder.insert("abr", &make_term_info(3))?;
|
||||
term_dictionary_builder.finish()?
|
||||
};
|
||||
let term_dict_file = FileSlice::from(buffer);
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(term_dict_file)?;
|
||||
let mut kv_stream = term_dictionary.stream()?;
|
||||
assert!(kv_stream.advance());
|
||||
assert_eq!(kv_stream.key(), "abcdefghijklmnopqrstuvwxy".as_bytes());
|
||||
assert_eq!(kv_stream.value(), &make_term_info(1));
|
||||
assert!(kv_stream.advance());
|
||||
assert_eq!(kv_stream.key(), "abcdefghijklmnopqrstuvwxyz".as_bytes());
|
||||
assert_eq!(kv_stream.value(), &make_term_info(2));
|
||||
assert!(kv_stream.advance());
|
||||
assert_eq!(kv_stream.key(), "abr".as_bytes());
|
||||
assert_eq!(kv_stream.value(), &make_term_info(3));
|
||||
assert!(!kv_stream.advance());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_range() -> crate::Result<()> {
|
||||
let ids: Vec<_> = (0u32..10_000u32)
|
||||
.map(|i| (format!("doc{:0>6}", i), i))
|
||||
.collect();
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
for &(ref id, ref i) in &ids {
|
||||
term_dictionary_builder
|
||||
.insert(id.as_bytes(), &make_term_info(*i as u64))
|
||||
.unwrap();
|
||||
}
|
||||
term_dictionary_builder.finish()?
|
||||
};
|
||||
|
||||
let file = FileSlice::from(buffer);
|
||||
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(file)?;
|
||||
{
|
||||
for i in (0..20).chain(6000..8_000) {
|
||||
let &(ref target_key, _) = &ids[i];
|
||||
let mut streamer = term_dictionary
|
||||
.range()
|
||||
.ge(target_key.as_bytes())
|
||||
.into_stream()?;
|
||||
for j in 0..3 {
|
||||
let (streamer_k, streamer_v) = streamer.next().unwrap();
|
||||
let &(ref key, ref v) = &ids[i + j];
|
||||
assert_eq!(str::from_utf8(streamer_k.as_ref()).unwrap(), key);
|
||||
assert_eq!(streamer_v.doc_freq, *v);
|
||||
assert_eq!(streamer_v, &make_term_info(*v as u64));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
{
|
||||
for i in (0..20).chain(BLOCK_SIZE - 10..BLOCK_SIZE + 10) {
|
||||
let &(ref target_key, _) = &ids[i];
|
||||
let mut streamer = term_dictionary
|
||||
.range()
|
||||
.gt(target_key.as_bytes())
|
||||
.into_stream()?;
|
||||
for j in 0..3 {
|
||||
let (streamer_k, streamer_v) = streamer.next().unwrap();
|
||||
let &(ref key, ref v) = &ids[i + j + 1];
|
||||
assert_eq!(streamer_k.as_ref(), key.as_bytes());
|
||||
assert_eq!(streamer_v.doc_freq, *v);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
{
|
||||
for i in (0..20).chain(BLOCK_SIZE - 10..BLOCK_SIZE + 10) {
|
||||
for j in 0..3 {
|
||||
let &(ref fst_key, _) = &ids[i];
|
||||
let &(ref last_key, _) = &ids[i + j];
|
||||
let mut streamer = term_dictionary
|
||||
.range()
|
||||
.ge(fst_key.as_bytes())
|
||||
.lt(last_key.as_bytes())
|
||||
.into_stream()?;
|
||||
for _ in 0..j {
|
||||
assert!(streamer.next().is_some());
|
||||
}
|
||||
assert!(streamer.next().is_none());
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_string() -> crate::Result<()> {
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(vec![]).unwrap();
|
||||
term_dictionary_builder
|
||||
.insert(&[], &make_term_info(1 as u64))
|
||||
.unwrap();
|
||||
term_dictionary_builder
|
||||
.insert(&[1u8], &make_term_info(2 as u64))
|
||||
.unwrap();
|
||||
term_dictionary_builder.finish()?
|
||||
};
|
||||
let file = FileSlice::from(buffer);
|
||||
let term_dictionary: TermDictionary = TermDictionary::open(file)?;
|
||||
let mut stream = term_dictionary.stream()?;
|
||||
assert!(stream.advance());
|
||||
assert!(stream.key().is_empty());
|
||||
assert!(stream.advance());
|
||||
assert_eq!(stream.key(), &[1u8]);
|
||||
assert!(!stream.advance());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn stream_range_test_dict() -> crate::Result<TermDictionary> {
|
||||
let buffer: Vec<u8> = {
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(Vec::new())?;
|
||||
for i in 0u8..10u8 {
|
||||
let number_arr = [i; 1];
|
||||
term_dictionary_builder.insert(&number_arr, &make_term_info(i as u64))?;
|
||||
}
|
||||
term_dictionary_builder.finish()?
|
||||
};
|
||||
let file = FileSlice::from(buffer);
|
||||
TermDictionary::open(file)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_range_boundaries_forward() -> crate::Result<()> {
|
||||
let term_dictionary = stream_range_test_dict()?;
|
||||
let value_list = |mut streamer: TermStreamer<'_>| {
|
||||
let mut res: Vec<u32> = vec![];
|
||||
while let Some((_, ref v)) = streamer.next() {
|
||||
res.push(v.doc_freq);
|
||||
}
|
||||
res
|
||||
};
|
||||
{
|
||||
let range = term_dictionary.range().ge([2u8]).into_stream()?;
|
||||
assert_eq!(
|
||||
value_list(range),
|
||||
vec![2u32, 3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().gt([2u8]).into_stream()?;
|
||||
assert_eq!(
|
||||
value_list(range),
|
||||
vec![3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().lt([6u8]).into_stream()?;
|
||||
assert_eq!(value_list(range), vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32]);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().le([6u8]).into_stream()?;
|
||||
assert_eq!(
|
||||
value_list(range),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32, 6u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().ge([0u8]).lt([5u8]).into_stream()?;
|
||||
assert_eq!(value_list(range), vec![0u32, 1u32, 2u32, 3u32, 4u32]);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_range_boundaries_backward() -> crate::Result<()> {
|
||||
let term_dictionary = stream_range_test_dict()?;
|
||||
let value_list_backward = |mut streamer: TermStreamer<'_>| {
|
||||
let mut res: Vec<u32> = vec![];
|
||||
while let Some((_, ref v)) = streamer.next() {
|
||||
res.push(v.doc_freq);
|
||||
}
|
||||
res.reverse();
|
||||
res
|
||||
};
|
||||
{
|
||||
let range = term_dictionary.range().backward().into_stream()?;
|
||||
assert_eq!(
|
||||
value_list_backward(range),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().ge([2u8]).backward().into_stream()?;
|
||||
assert_eq!(
|
||||
value_list_backward(range),
|
||||
vec![2u32, 3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().gt([2u8]).backward().into_stream()?;
|
||||
assert_eq!(
|
||||
value_list_backward(range),
|
||||
vec![3u32, 4u32, 5u32, 6u32, 7u32, 8u32, 9u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().lt([6u8]).backward().into_stream()?;
|
||||
assert_eq!(
|
||||
value_list_backward(range),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary.range().le([6u8]).backward().into_stream()?;
|
||||
assert_eq!(
|
||||
value_list_backward(range),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32, 5u32, 6u32]
|
||||
);
|
||||
}
|
||||
{
|
||||
let range = term_dictionary
|
||||
.range()
|
||||
.ge([0u8])
|
||||
.lt([5u8])
|
||||
.backward()
|
||||
.into_stream()?;
|
||||
assert_eq!(
|
||||
value_list_backward(range),
|
||||
vec![0u32, 1u32, 2u32, 3u32, 4u32]
|
||||
);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ord_to_term() -> crate::Result<()> {
|
||||
let termdict = stream_range_test_dict()?;
|
||||
let mut bytes = vec![];
|
||||
for b in 0u8..10u8 {
|
||||
termdict.ord_to_term(b as u64, &mut bytes)?;
|
||||
assert_eq!(&bytes, &[b]);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_stream_term_ord() -> crate::Result<()> {
|
||||
let termdict = stream_range_test_dict()?;
|
||||
let mut stream = termdict.stream()?;
|
||||
for b in 0u8..10u8 {
|
||||
assert!(stream.advance(), true);
|
||||
assert_eq!(stream.term_ord(), b as u64);
|
||||
assert_eq!(stream.key(), &[b]);
|
||||
}
|
||||
assert!(!stream.advance());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_automaton_search() -> crate::Result<()> {
|
||||
use crate::query::DFAWrapper;
|
||||
use levenshtein_automata::LevenshteinAutomatonBuilder;
|
||||
|
||||
const COUNTRIES: [&'static str; 7] = [
|
||||
"San Marino",
|
||||
"Serbia",
|
||||
"Slovakia",
|
||||
"Slovenia",
|
||||
"Spain",
|
||||
"Sweden",
|
||||
"Switzerland",
|
||||
];
|
||||
|
||||
let directory = RAMDirectory::create();
|
||||
let path = PathBuf::from("TermDictionary");
|
||||
{
|
||||
let write = directory.open_write(&path)?;
|
||||
let mut term_dictionary_builder = TermDictionaryBuilder::create(write)?;
|
||||
for term in COUNTRIES.iter() {
|
||||
term_dictionary_builder.insert(term.as_bytes(), &make_term_info(0u64))?;
|
||||
}
|
||||
term_dictionary_builder.finish()?.terminate()?;
|
||||
}
|
||||
let file = directory.open_read(&path)?;
|
||||
let term_dict: TermDictionary = TermDictionary::open(file)?;
|
||||
|
||||
// We can now build an entire dfa.
|
||||
let lev_automaton_builder = LevenshteinAutomatonBuilder::new(2, true);
|
||||
let automaton = DFAWrapper(lev_automaton_builder.build_dfa("Spaen"));
|
||||
|
||||
let mut range = term_dict.search(automaton).into_stream()?;
|
||||
|
||||
// get the first finding
|
||||
assert!(range.advance());
|
||||
assert_eq!("Spain".as_bytes(), range.key());
|
||||
assert!(!range.advance());
|
||||
Ok(())
|
||||
}
|
||||
Reference in New Issue
Block a user