Compare commits

...

53 Commits

Author SHA1 Message Date
Paul Masurel
507e46f814 Added static directory 2018-10-04 23:28:44 +09:00
Paul Masurel
3d3da2d66f Compiling in WebAssembly 2018-10-04 08:45:04 +09:00
Paul Masurel
e0cdd3114d Fixing README (#427)
Closes #424.
2018-09-17 08:52:29 +09:00
Paul Masurel
f32b4a2ebe Removing release build from ci, disabling lto (#425) 2018-09-17 06:41:40 +09:00
Paul Masurel
6ff60b8ed8 Fixing README (#426) 2018-09-17 06:20:44 +09:00
Paul Masurel
8da28fb6cf Added iml filewq 2018-09-16 13:26:54 +09:00
Paul Masurel
0df2a221da Bump version pre-release 2018-09-16 13:24:14 +09:00
Paul Masurel
5449ec3c11 Snippet term score (#423) 2018-09-16 10:21:02 +09:00
Paul Masurel
10f6c07c53 Clippy (#422)
* Cargo Format
* Clippy
2018-09-15 20:20:22 +09:00
Paul Masurel
06e7bd18e7 Clippy (#421)
* Cargo Format

* Clippy

* bugfix

* still clippy stuff

* clippy step 2
2018-09-15 14:56:14 +09:00
Paul Masurel
37e4280c0a Cargo Format (#420) 2018-09-15 07:44:22 +09:00
Paul Masurel
0ba1cf93f7 Remove Searcher dereference (#419) 2018-09-14 09:54:26 +09:00
Paul Masurel
21a9940726 Update Changelog with #388 (#418) 2018-09-14 09:31:11 +09:00
pentlander
8600b8ea25 Top collector (#413)
* Make TopCollector generic

Make TopCollector take a generic type instead of only being tied to
score. This will allow for sharing code between a TopCollector that
sorts results by Score and a TopCollector that sorts documents by a fast
field. This commit makes no functional changes to TopCollector.

* Add TopFieldCollector and TopScoreCollector

Create two new collectors that use the refactored TopCollector.
TopFieldCollector has the same functionality that TopCollector
originally had. TopFieldCollector allows for sorting results by a given
fast field. Closes tantivy-search/tantivy#388

* Make TopCollector private

Make TopCollector package private and export TopFieldCollector as
TopCollector to maintain backwards compatibility. Mark TopCollector
as deprecated to encourage use of the non-aliased TopFieldCollector.
Remove Collector implementation for TopCollector since it is not longer
used.
2018-09-14 09:22:17 +09:00
Paul Masurel
30f4f85d48 Closes #414. (#417)
Updating documentation for load_searchers.
2018-09-14 09:11:07 +09:00
Paul Masurel
82d25b8397 Fixing snippet example 2018-09-13 12:39:42 +09:00
Paul Masurel
2104c0277c Updating uuid 2018-09-13 09:13:37 +09:00
Paul Masurel
dd37e109f2 Merge branch 'issue/368b' 2018-09-11 20:16:14 +09:00
Paul Masurel
cc23194c58 Editing document 2018-09-11 20:15:38 +09:00
Paul Masurel
63868733a3 Added SnippetGenerator 2018-09-11 09:45:27 +09:00
Paul Masurel
644d8a3a10 Added snippet generator 2018-09-10 16:39:45 +09:00
Paul Masurel
e32dba1a97 Phrase weight 2018-09-10 09:26:33 +09:00
Paul Masurel
a78aa4c259 updating doc 2018-09-09 17:23:30 +09:00
Paul Masurel
7e5f697d00 Closes #387 2018-09-09 16:23:56 +09:00
Paul Masurel
a78f4cca37 Merge branch 'issue/368' into issue/368b 2018-09-09 16:04:20 +09:00
Paul Masurel
2e44f0f099 blop 2018-09-09 14:23:24 +09:00
Vignesh Sarma K
9ccba9f864 Merge branch 'master' into issue/368 2018-09-07 20:27:38 +05:30
Paul Masurel
9101bf5753 Fragments 2018-09-07 09:57:12 +09:00
Paul Masurel
23e97da9f6 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-09-07 08:44:14 +09:00
Paul Masurel
1d439e96f5 Using sort unstable by key. 2018-09-07 08:43:44 +09:00
Paul Masurel
934933582e Closes #402 (#403) 2018-09-06 10:12:26 +09:00
Paul Masurel
98c7fbdc6f Issue/378 (#392)
* Added failing unit test

* Closes #378. Handling queries that end up empty after going through the analyzer.

* Fixed stop word example
2018-09-06 10:11:54 +09:00
Paul Masurel
cec9956a01 Issue/389 (#405)
* Setting up the dependency.

* Completed README
2018-09-06 10:10:40 +09:00
Paul Masurel
c64972e039 Apply unicode lowercasing. (#408)
Checks if the str is ASCII, and uses a fast track if it is the case.
If not, the std's definition of a lowercase character.

Closes #406
2018-09-05 09:43:56 +09:00
Paul Masurel
b3b2421e8a Issue/367 (#404)
* First stab

* Closes #367
2018-09-04 09:17:00 +09:00
Paul Masurel
f570fe37d4 small changes 2018-08-31 09:03:44 +09:00
Paul Masurel
6704ab6987 Added methods to extract the matching terms. First stab 2018-08-30 09:47:19 +09:00
Paul Masurel
a12d211330 Extracting terms matching query in the document 2018-08-30 09:23:34 +09:00
Paul Masurel
ee681a4dd1 Added say thanks badge 2018-08-29 11:06:04 +09:00
petr-tik
d15efd6635 Closes #235 - adds a new error type (#398)
error message suggests possible causes

Addressed code review 1 thread + smaller heap size
2018-08-29 08:26:59 +09:00
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
18814ba0c1 add a test for second fragment having higher score 2018-08-28 22:27:56 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
f247935bb9 Use HighlightSection::new rather than just directly creating the object 2018-08-28 22:16:22 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
6a197e023e ran rustfmt 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
96a313c6dd add more tests 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
fb9b1c1f41 add a test and fix the bug of not calculating first token 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
e1bca6db9d update calculate_score to try_add_token
`try_add_token` will now update the stop_offset as well.
`FragmentCandidate::new` now just takes `start_offset`,
it expects `try_add_token` to be called to add a token.
2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
8438eda01a use while let instead of loop and if.
as per CR comment
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
b373f00840 add htmlescape and update to_html fn to use it.
tests and imports also updated.
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
46decdb0ea compare against accumulator rather than init value 2018-08-28 20:41:41 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
835cdc2fe8 Initial version of snippet
refer #368
2018-08-28 20:41:41 +05:30
Paul Masurel
19756bb7d6 Getting started on #368 2018-08-28 20:41:41 +05:30
CJP10
57e1f8ed28 Missed a closing bracket (#397) 2018-08-28 23:17:59 +09:00
Paul Masurel
2649c8a715 Issue/246 (#393)
* Moving Range and All to Leaves

* Parsing OR/AND

* Simplify user input ast

* AND and OR supported. Returning an error when mixing syntax

Closes #246

* Added support for NOT

* Updated changelog
2018-08-28 11:03:54 +09:00
124 changed files with 3000 additions and 1255 deletions

1
.gitignore vendored
View File

@@ -1,3 +1,4 @@
tantivy.iml
*.swp *.swp
target target
target/debug target/debug

View File

@@ -4,7 +4,9 @@ Tantivy 0.7
- Skip data for doc ids and positions (@fulmicoton), - Skip data for doc ids and positions (@fulmicoton),
greatly improving performance greatly improving performance
- Tantivy error now rely on the failure crate (@drusellers) - Tantivy error now rely on the failure crate (@drusellers)
- Added support for `AND`, `OR`, `NOT` syntax in addition to the `+`,`-` syntax
- Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)
- Added a `TopFieldCollector` (@pentlander)
Tantivy 0.6.1 Tantivy 0.6.1
========================= =========================

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy" name = "tantivy"
version = "0.7.0-dev" version = "0.7.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
@@ -15,10 +15,9 @@ keywords = ["search", "information", "retrieval"]
base64 = "0.9.1" base64 = "0.9.1"
byteorder = "1.0" byteorder = "1.0"
lazy_static = "1" lazy_static = "1"
tinysegmenter = "0.1.0"
regex = "1.0" regex = "1.0"
fst = {version="0.3", default-features=false} fst = {version="0.3", default-features=false}
fst-regex = { version="0.2" } fst-regex = { version="0.2", optional=true}
lz4 = {version="1.20", optional=true} lz4 = {version="1.20", optional=true}
snap = {version="0.2"} snap = {version="0.2"}
atomicwrites = {version="0.2.2", optional=true} atomicwrites = {version="0.2.2", optional=true}
@@ -33,7 +32,7 @@ num_cpus = "1.2"
itertools = "0.7" itertools = "0.7"
levenshtein_automata = {version="0.1", features=["fst_automaton"]} levenshtein_automata = {version="0.1", features=["fst_automaton"]}
bit-set = "0.5" bit-set = "0.5"
uuid = { version = "0.6", features = ["v4", "serde"] } uuid = { version = "0.7", features = ["v4", "serde"] }
crossbeam = "0.4" crossbeam = "0.4"
crossbeam-channel = "0.2" crossbeam-channel = "0.2"
futures = "0.1" futures = "0.1"
@@ -48,24 +47,34 @@ census = "0.1"
fnv = "1.0.6" fnv = "1.0.6"
owned-read = "0.4" owned-read = "0.4"
failure = "0.1" failure = "0.1"
htmlescape = "0.3.1"
fail = "0.2"
[target.'cfg(windows)'.dependencies] [target.'cfg(windows)'.dependencies]
winapi = "0.2" winapi = "0.2"
[dev-dependencies] [dev-dependencies]
rand = "0.5" rand = "0.5"
maplit = "1"
[profile.release] [profile.release]
opt-level = 3 opt-level = 3
debug = false debug = false
lto = true
debug-assertions = false debug-assertions = false
[profile.test]
debug-assertions = true
overflow-checks = true
[features] [features]
default = ["mmap"] # by default no-fail is disabled. We manually enable it when running test.
default = ["mmap", "no_fail", "regex_query"]
mmap = ["fst/mmap", "atomicwrites"] mmap = ["fst/mmap", "atomicwrites"]
regex_query = ["fst-regex"]
lz4-compression = ["lz4"] lz4-compression = ["lz4"]
no_fail = ["fail/no_fail"]
[badges] [badges]
travis-ci = { repository = "tantivy-search/tantivy" } travis-ci = { repository = "tantivy-search/tantivy" }

View File

@@ -4,6 +4,7 @@
[![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/master?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/master) [![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/master?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/master)
[![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/fulmicoton)
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png) ![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
@@ -32,8 +33,8 @@ Tantivy is, in fact, strongly inspired by Lucene's design.
- Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:) - Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
- Tiny startup time (<10ms), perfect for command line tools - Tiny startup time (<10ms), perfect for command line tools
- BM25 scoring (the same as lucene) - BM25 scoring (the same as lucene)
- Basic query language (`+michael +jackson`) - Natural query language `(michael AND jackson) OR "king of pop"`
- Phrase queries search (\"michael jackson\"`) - Phrase queries search (`"michael jackson"`)
- Incremental indexing - Incremental indexing
- Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop) - Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
- Mmap directory - Mmap directory
@@ -43,12 +44,14 @@ Tantivy is, in fact, strongly inspired by Lucene's design.
- LZ4 compressed document store - LZ4 compressed document store
- Range queries - Range queries
- Faceted search - Faceted search
- Configurable indexing (optional term frequency and position indexing - Configurable indexing (optional term frequency and position indexing)
- Cheesy logo with a horse - Cheesy logo with a horse
# Non-features # Non-features
- Distributed search and will not be in the scope of tantivy. - Distributed search is out of the scope of tantivy. That being said, tantivy is meant as a
library upon which one could build a distributed search. Serializable/mergeable collector state for instance,
are within the scope of tantivy.
# Supported OS and compiler # Supported OS and compiler
@@ -77,6 +80,10 @@ To check out and run tests, you can simply run :
cd tantivy cd tantivy
cargo build cargo build
## Running tests
Some tests will not run with just `cargo test` because of `fail-rs`.
To run the tests exhaustively, run `./run-tests.sh`.
# Contribute # Contribute

View File

@@ -18,5 +18,5 @@ install:
build: false build: false
test_script: test_script:
- REM SET RUST_LOG=tantivy,test & cargo test --verbose - REM SET RUST_LOG=tantivy,test & cargo test --verbose --no-default-features --features mmap -- --test-threads 1
- REM SET RUST_BACKTRACE=1 & cargo build --examples - REM SET RUST_BACKTRACE=1 & cargo build --examples

View File

@@ -11,12 +11,11 @@ main() {
else else
echo "Build" echo "Build"
cross build --target $TARGET cross build --target $TARGET
cross build --target $TARGET --release
if [ ! -z $DISABLE_TESTS ]; then if [ ! -z $DISABLE_TESTS ]; then
return return
fi fi
echo "Test" echo "Test"
cross test --target $TARGET cross test --target $TARGET --no-default-features --features mmap -- --test-threads 1
fi fi
for example in $(ls examples/*.rs) for example in $(ls examples/*.rs)
do do

View File

@@ -9,6 +9,7 @@
- [Facetting](./facetting.md) - [Facetting](./facetting.md)
- [Innerworkings](./innerworkings.md) - [Innerworkings](./innerworkings.md)
- [Inverted index](./inverted_index.md) - [Inverted index](./inverted_index.md)
- [Best practise](./inverted_index.md)
[Frequently Asked Questions](./faq.md) [Frequently Asked Questions](./faq.md)
[Examples](./examples.md) [Examples](./examples.md)

View File

@@ -2,8 +2,8 @@
> Tantivy is a **search** engine **library** for Rust. > Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, tantivy is heavily inspired by Lucene's design and If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
they both have the same scope and targetted users. they both have the same scope and targetted use cases.
If you are not familiar with Lucene, let's break down our little tagline. If you are not familiar with Lucene, let's break down our little tagline.
@@ -17,15 +17,18 @@ relevancy, collapsing, highlighting, spatial search.
experience. But keep in mind this is just a toolbox. experience. But keep in mind this is just a toolbox.
Which bring us to the second keyword... Which bring us to the second keyword...
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution. - **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
Sometimes a functionality will not be available in tantivy because it is too specific to your use case. By design, tantivy should make it possible to extend
the available set of features using the existing rock-solid datastructures.
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own Sometimes a functionality will not be available in tantivy because it is too
`Tokenizer/TokenFilter`... But some of your requirement may also be related to specific to your use case. By design, tantivy should make it possible to extend
architecture or operations. For instance, you may want to build a large corpus on Hadoop, the available set of features using the existing rock-solid datastructures.
fine-tune the merge policy to keep your index sharded in a time-wise fashion, or you may want
to convert and existing index from a different format. Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
`TokenFilter`... Some of your requirements may also be related to
Tantivy exposes its API to do all of these things. something closer to architecture or operations. For instance, you may
want to build a large corpus on Hadoop, fine-tune the merge policy to keep your
index sharded in a time-wise fashion, or you may want to convert and existing
index from a different format.
Tantivy exposes a lot of low level API to do all of these things.

View File

@@ -2,47 +2,76 @@
## Straight from disk ## Straight from disk
By default, tantivy accesses its data using its `MMapDirectory`. Tantivy accesses its data using an abstracting trait called `Directory`.
While this design has some downsides, this greatly simplifies the source code of tantivy, In theory, one can come and override the data access logic. In practise, the
and entirely delegates the caching to the OS. trait somewhat assumes that your data can be mapped to memory, and tantivy
seems deeply married to using `mmap` for its io [^1], and the only persisting
directory shipped with tantivy is the `MmapDirectory`.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. While this design has some downsides, this greatly simplifies the source code of
As a result, the act of opening an indexing does not involve loading different datastructures tantivy. Caching is also entirely delegated to the OS.
from the disk into random access memory : starting a process, opening an index, and performing a query
can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine. `tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
Spawning a new process for each new query can be a perfectly sensible solution in some use case.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
In later chapters, we will discuss tantivy's inverted index data layout. In later chapters, we will discuss tantivy's inverted index data layout.
One key take away is that to achieve great performance, search indexes are extremely compact. One key take away is that to achieve great performance, search indexes are extremely compact.
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM. Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
Also, whenever possible the data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
the data from your spinning hard disk, but this is also a great property when working with `SSD` or `RAM`, critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
as it makes our read patterns very predictable for the CPU.
## Segments, and the log method ## Segments, and the log method
That kind compact layout comes at one cost: it prevents our datastructures from being dynamic. That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
In fact, a trait called `Directory` is in charge of abstracting all of tantivy's data access In fact, the `Directory` trait does not even allow you to modify part of a file.
and its API does not even allow editing these file once they are written.
To allow the addition / deletion of documents, and create the illusion that To allow the addition / deletion of documents, and create the illusion that
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*.
referred to as the *log method*.
Let's forget about deletes for a moment. As you add documents, these documents are processed and stored in Let's forget about deletes for a moment.
a dedicated datastructure, in a `RAM` buffer. This datastructure is designed to be dynamic but
cannot be accessed for search. As you add documents, this buffer will reach its capacity and tantivy will As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly.
transparently stop adding document to it and start converting this datastructure to its final
read-only format on disk. Once written, an brand empty buffer is available to resume adding documents. As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
The resulting chunk of index obtained after this serialization is called a `Segment`. The resulting chunk of index obtained after this serialization is called a `Segment`.
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files > A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `<UUID>.*`.
are identified using the naming scheme : `<UUID>.*`.
Which brings us to the nature of a tantivy `Index`.
> A tantivy `Index` is a collection of `Segments`.
Physically, this really just means and index is a bunch of segment files in a given `Directory`,
linked together by a `meta.json` file. This transparency can become extremely handy
to get tantivy to fit your use case:
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
> A tantivy `Index` is a collection of `Segments`.
# Merging
As you index more and more data, your index will accumulate more and more segments.
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit.
That's where merging or compacting comes into place. Tantivy will continuously consider merge
opportunities and start merging segments in the background.
# Indexing throughput, number of indexing threads
[^1]: This may eventually change.
[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not.

View File

View File

@@ -1 +1,3 @@
# Examples # Examples
- [Basic search](/examples/basic_search.html)

View File

@@ -10,7 +10,6 @@
// - search for the best document matchings "sea whale" // - search for the best document matchings "sea whale"
// - retrieve the best document original content. // - retrieve the best document original content.
extern crate tempdir; extern crate tempdir;
// --- // ---
@@ -231,13 +230,11 @@ fn main() -> tantivy::Result<()> {
// a title. // a title.
for doc_address in doc_addresses { for doc_address in doc_addresses {
let retrieved_doc = searcher.doc(&doc_address)?; let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", schema.to_json(&retrieved_doc));
} }
Ok(()) Ok(())
} }
use tempdir::TempDir; use tempdir::TempDir;

View File

@@ -3,7 +3,6 @@
// In this example, we'll see how to define a tokenizer pipeline // In this example, we'll see how to define a tokenizer pipeline
// by aligning a bunch of `TokenFilter`. // by aligning a bunch of `TokenFilter`.
#[macro_use] #[macro_use]
extern crate tantivy; extern crate tantivy;
use tantivy::collector::TopCollector; use tantivy::collector::TopCollector;
@@ -12,7 +11,6 @@ use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer; use tantivy::tokenizer::NgramTokenizer;
use tantivy::Index; use tantivy::Index;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
// //
@@ -111,7 +109,7 @@ fn main() -> tantivy::Result<()> {
let doc_addresses = top_collector.docs(); let doc_addresses = top_collector.docs();
for doc_address in doc_addresses { for doc_address in doc_addresses {
let retrieved_doc = searcher.doc(&doc_address)?; let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", schema.to_json(&retrieved_doc));
} }

View File

@@ -11,10 +11,9 @@
#[macro_use] #[macro_use]
extern crate tantivy; extern crate tantivy;
use tantivy::collector::TopCollector; use tantivy::collector::TopCollector;
use tantivy::query::TermQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::Index; use tantivy::Index;
use tantivy::query::TermQuery;
// A simple helper function to fetch a single document // A simple helper function to fetch a single document
// given its id from our index. // given its id from our index.
@@ -31,8 +30,8 @@ fn extract_doc_given_isbn(index: &Index, isbn_term: &Term) -> tantivy::Result<Op
let mut top_collector = TopCollector::with_limit(1); let mut top_collector = TopCollector::with_limit(1);
searcher.search(&term_query, &mut top_collector)?; searcher.search(&term_query, &mut top_collector)?;
if let Some(doc_address) = top_collector.docs().first() { if let Some(doc_address) = top_collector.docs().first() {
let doc = searcher.doc(doc_address)?; let doc = searcher.doc(*doc_address)?;
Ok(Some(doc)) Ok(Some(doc))
} else { } else {
// no doc matching this ID. // no doc matching this ID.
@@ -41,7 +40,6 @@ fn extract_doc_given_isbn(index: &Index, isbn_term: &Term) -> tantivy::Result<Op
} }
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
// //
// Check out the *basic_search* example if this makes // Check out the *basic_search* example if this makes
@@ -126,7 +124,6 @@ fn main() -> tantivy::Result<()> {
isbn => "978-9176370711", isbn => "978-9176370711",
)); ));
// You are guaranteed that your clients will only observe your index in // You are guaranteed that your clients will only observe your index in
// the state it was in after a commit. // the state it was in after a commit.
// In this example, your search engine will at no point be missing the *Frankenstein* document. // In this example, your search engine will at no point be missing the *Frankenstein* document.
@@ -143,4 +140,4 @@ fn main() -> tantivy::Result<()> {
); );
Ok(()) Ok(())
} }

View File

@@ -22,60 +22,60 @@ use tantivy::schema::*;
use tantivy::Index; use tantivy::Index;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the // Let's create a temporary directory for the
// sake of this example // sake of this example
let index_path = TempDir::new("tantivy_facet_example_dir")?; let index_path = TempDir::new("tantivy_facet_example_dir")?;
let mut schema_builder = SchemaBuilder::default(); let mut schema_builder = SchemaBuilder::default();
schema_builder.add_text_field("name", TEXT | STORED); schema_builder.add_text_field("name", TEXT | STORED);
// this is our faceted field // this is our faceted field
schema_builder.add_facet_field("tags"); schema_builder.add_facet_field("tags");
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_dir(&index_path, schema.clone())?; let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?; let mut index_writer = index.writer(50_000_000)?;
let name = schema.get_field("name").unwrap(); let name = schema.get_field("name").unwrap();
let tags = schema.get_field("tags").unwrap(); let tags = schema.get_field("tags").unwrap();
// For convenience, tantivy also comes with a macro to // For convenience, tantivy also comes with a macro to
// reduce the boilerplate above. // reduce the boilerplate above.
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "the ditch", name => "the ditch",
tags => Facet::from("/pools/north") tags => Facet::from("/pools/north")
)); ));
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "little stacey", name => "little stacey",
tags => Facet::from("/pools/south") tags => Facet::from("/pools/south")
)); ));
index_writer.commit()?; index_writer.commit()?;
index.load_searchers()?; index.load_searchers()?;
let searcher = index.searcher(); let searcher = index.searcher();
let mut facet_collector = FacetCollector::for_field(tags); let mut facet_collector = FacetCollector::for_field(tags);
facet_collector.add_facet("/pools"); facet_collector.add_facet("/pools");
searcher.search(&AllQuery, &mut facet_collector).unwrap(); searcher.search(&AllQuery, &mut facet_collector).unwrap();
let counts = facet_collector.harvest(); let counts = facet_collector.harvest();
// This lists all of the facet counts // This lists all of the facet counts
let facets: Vec<(&Facet, u64)> = counts.get("/pools").collect(); let facets: Vec<(&Facet, u64)> = counts.get("/pools").collect();
assert_eq!( assert_eq!(
facets, facets,
vec![ vec![
(&Facet::from("/pools/north"), 1), (&Facet::from("/pools/north"), 1),
(&Facet::from("/pools/south"), 1) (&Facet::from("/pools/south"), 1),
] ]
); );
Ok(()) Ok(())
} }
use tempdir::TempDir; use tempdir::TempDir;

View File

@@ -7,18 +7,15 @@
// the list of documents containing a term, getting // the list of documents containing a term, getting
// its term frequency, and accessing its positions. // its term frequency, and accessing its positions.
// --- // ---
// Importing tantivy... // Importing tantivy...
#[macro_use] #[macro_use]
extern crate tantivy; extern crate tantivy;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::Index; use tantivy::Index;
use tantivy::{DocSet, DocId, Postings}; use tantivy::{DocId, DocSet, Postings};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// We first create a schema for the sake of the // We first create a schema for the sake of the
// example. Check the `basic_search` example for more information. // example. Check the `basic_search` example for more information.
let mut schema_builder = SchemaBuilder::default(); let mut schema_builder = SchemaBuilder::default();
@@ -47,7 +44,6 @@ fn main() -> tantivy::Result<()> {
// there is actually only one segment here, but let's iterate through the list // there is actually only one segment here, but let's iterate through the list
// anyway) // anyway)
for segment_reader in searcher.segment_readers() { for segment_reader in searcher.segment_readers() {
// A segment contains different data structure. // A segment contains different data structure.
// Inverted index stands for the combination of // Inverted index stands for the combination of
// - the term dictionary // - the term dictionary
@@ -58,19 +54,18 @@ fn main() -> tantivy::Result<()> {
// Let's go through all docs containing the term `title:the` and access their position // Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the"); let term_the = Term::from_field_text(title, "the");
// This segment posting object is like a cursor over the documents matching the term. // This segment posting object is like a cursor over the documents matching the term.
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies // The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies
// and positions. // and positions.
// //
// If you don't need all this information, you may get better performance by decompressing less // If you don't need all this information, you may get better performance by decompressing less
// information. // information.
if let Some(mut segment_postings) = inverted_index.read_postings(&term_the, IndexRecordOption::WithFreqsAndPositions) { if let Some(mut segment_postings) =
inverted_index.read_postings(&term_the, IndexRecordOption::WithFreqsAndPositions)
{
// this buffer will be used to request for positions // this buffer will be used to request for positions
let mut positions: Vec<u32> = Vec::with_capacity(100); let mut positions: Vec<u32> = Vec::with_capacity(100);
while segment_postings.advance() { while segment_postings.advance() {
// the number of time the term appears in the document. // the number of time the term appears in the document.
let doc_id: DocId = segment_postings.doc(); //< do not try to access this before calling advance once. let doc_id: DocId = segment_postings.doc(); //< do not try to access this before calling advance once.
@@ -98,7 +93,6 @@ fn main() -> tantivy::Result<()> {
} }
} }
// A `Term` is a text token associated with a field. // A `Term` is a text token associated with a field.
// Let's go through all docs containing the term `title:the` and access their position // Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the"); let term_the = Term::from_field_text(title, "the");
@@ -111,7 +105,6 @@ fn main() -> tantivy::Result<()> {
// Also, for some VERY specific high performance use case like an OLAP analysis of logs, // Also, for some VERY specific high performance use case like an OLAP analysis of logs,
// you can get better performance by accessing directly the blocks of doc ids. // you can get better performance by accessing directly the blocks of doc ids.
for segment_reader in searcher.segment_readers() { for segment_reader in searcher.segment_readers() {
// A segment contains different data structure. // A segment contains different data structure.
// Inverted index stands for the combination of // Inverted index stands for the combination of
// - the term dictionary // - the term dictionary
@@ -124,7 +117,9 @@ fn main() -> tantivy::Result<()> {
// //
// If you don't need all this information, you may get better performance by decompressing less // If you don't need all this information, you may get better performance by decompressing less
// information. // information.
if let Some(mut block_segment_postings) = inverted_index.read_block_postings(&term_the, IndexRecordOption::Basic) { if let Some(mut block_segment_postings) =
inverted_index.read_block_postings(&term_the, IndexRecordOption::Basic)
{
while block_segment_postings.advance() { while block_segment_postings.advance() {
// Once again these docs MAY contains deleted documents as well. // Once again these docs MAY contains deleted documents as well.
let docs = block_segment_postings.docs(); let docs = block_segment_postings.docs();
@@ -136,4 +131,3 @@ fn main() -> tantivy::Result<()> {
Ok(()) Ok(())
} }

71
examples/snippet.rs Normal file
View File

@@ -0,0 +1,71 @@
// # Snippet example
//
// This example shows how to return a representative snippet of
// your hit result.
// Snippet are an extracted of a target document, and returned in HTML format.
// The keyword searched by the user are highlighted with a `<b>` tag.
extern crate tempdir;
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopCollector;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::SnippetGenerator;
use tempdir::TempDir;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new("tantivy_example_dir")?;
// # Defining the schema
let mut schema_builder = SchemaBuilder::default();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT | STORED);
let schema = schema_builder.build();
// # Indexing documents
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
// we'll only need one doc for this example.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
// ...
index_writer.commit()?;
index.load_searchers()?;
let searcher = index.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("sycamore spring")?;
let mut top_collector = TopCollector::with_limit(10);
searcher.search(&*query, &mut top_collector)?;
let snippet_generator = SnippetGenerator::new(&searcher, &*query, body)?;
let doc_addresses = top_collector.docs();
for doc_address in doc_addresses {
let doc = searcher.doc(doc_address)?;
let snippet = snippet_generator.snippet_from_doc(&doc);
println!("title: {}", doc.get_first(title).unwrap().text().unwrap());
println!("snippet: {}", snippet.to_html());
}
Ok(())
}

View File

@@ -22,72 +22,71 @@ use tantivy::tokenizer::*;
use tantivy::Index; use tantivy::Index;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// this example assumes you understand the content in `basic_search` // this example assumes you understand the content in `basic_search`
let index_path = TempDir::new("tantivy_stopwords_example_dir")?; let mut schema_builder = SchemaBuilder::default();
let mut schema_builder = SchemaBuilder::default();
// This configures your custom options for how tantivy will // This configures your custom options for how tantivy will
// store and process your content in the index; The key // store and process your content in the index; The key
// to note is that we are setting the tokenizer to `stoppy` // to note is that we are setting the tokenizer to `stoppy`
// which will be defined and registered below. // which will be defined and registered below.
let text_field_indexing = TextFieldIndexing::default() let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy") .set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions); .set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default() let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing) .set_indexing_options(text_field_indexing)
.set_stored(); .set_stored();
// Our first field is title. // Our first field is title.
schema_builder.add_text_field("title", text_options); schema_builder.add_text_field("title", text_options);
// Our second field is body. // Our second field is body.
let text_field_indexing = TextFieldIndexing::default() let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy") .set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions); .set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default() let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing) .set_indexing_options(text_field_indexing)
.set_stored(); .set_stored();
schema_builder.add_text_field("body", text_options); schema_builder.add_text_field("body", text_options);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_dir(&index_path, schema.clone())?; let index = Index::create_in_ram(schema.clone());
// This tokenizer lowers all of the text (to help with stop word matching) // This tokenizer lowers all of the text (to help with stop word matching)
// then removes all instances of `the` and `and` from the corpus // then removes all instances of `the` and `and` from the corpus
let tokenizer = SimpleTokenizer let tokenizer = SimpleTokenizer
.filter(LowerCaser) .filter(LowerCaser)
.filter(StopWordFilter::remove(vec![ .filter(StopWordFilter::remove(vec![
"the".to_string(), "the".to_string(),
"and".to_string(), "and".to_string(),
])); ]));
index.tokenizers().register("stoppy", tokenizer); index.tokenizers().register("stoppy", tokenizer);
let mut index_writer = index.writer(50_000_000)?; let mut index_writer = index.writer(50_000_000)?;
let title = schema.get_field("title").unwrap(); let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap(); let body = schema.get_field("body").unwrap();
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "The Old Man and the Sea", title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \ body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish." he had gone eighty-four days now without taking a fish."
)); ));
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Of Mice and Men", title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \ body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \ bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \ over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \ side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \ Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \ fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \ debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool" limbs and branches that arch over the pool"
)); ));
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Frankenstein", title => "Frankenstein",
body => "You will rejoice to hear that no disaster has accompanied the commencement of an \ body => "You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \ enterprise which you have regarded with such evil forebodings. I arrived here \
@@ -95,35 +94,28 @@ fn main() -> tantivy::Result<()> {
increasing confidence in the success of my undertaking." increasing confidence in the success of my undertaking."
)); ));
index_writer.commit()?; index_writer.commit()?;
index.load_searchers()?; index.load_searchers()?;
let searcher = index.searcher(); let searcher = index.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]); let query_parser = QueryParser::for_index(&index, vec![title, body]);
// this will have NO hits because it was filtered out // stop words are applied on the query as well.
// because the query is run through the analyzer you // The following will be equivalent to `title:frankenstein`
// actually will get an error here because the query becomes let query = query_parser.parse_query("title:\"the Frankenstein\"")?;
// empty
assert!(query_parser.parse_query("the").is_err());
// this will have hits let mut top_collector = TopCollector::with_limit(10);
let query = query_parser.parse_query("is")?;
let mut top_collector = TopCollector::with_limit(10); searcher.search(&*query, &mut top_collector)?;
searcher.search(&*query, &mut top_collector)?; let doc_addresses = top_collector.docs();
let doc_addresses = top_collector.docs(); for doc_address in doc_addresses {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
for doc_address in doc_addresses { Ok(())
let retrieved_doc = searcher.doc(&doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
} }
use tempdir::TempDir;

2
run-tests.sh Executable file
View File

@@ -0,0 +1,2 @@
#!/bin/bash
cargo test --no-default-features --features mmap -- --test-threads 1

View File

@@ -342,16 +342,19 @@ impl FacetCollector {
pub fn harvest(mut self) -> FacetCounts { pub fn harvest(mut self) -> FacetCounts {
self.finalize_segment(); self.finalize_segment();
let collapsed_facet_ords: Vec<&[u64]> = self.segment_counters let collapsed_facet_ords: Vec<&[u64]> = self
.segment_counters
.iter() .iter()
.map(|segment_counter| &segment_counter.facet_ords[..]) .map(|segment_counter| &segment_counter.facet_ords[..])
.collect(); .collect();
let collapsed_facet_counts: Vec<&[u64]> = self.segment_counters let collapsed_facet_counts: Vec<&[u64]> = self
.segment_counters
.iter() .iter()
.map(|segment_counter| &segment_counter.facet_counts[..]) .map(|segment_counter| &segment_counter.facet_counts[..])
.collect(); .collect();
let facet_streams = self.segment_counters let facet_streams = self
.segment_counters
.iter() .iter()
.map(|seg_counts| seg_counts.facet_reader.facet_dict().range().into_stream()) .map(|seg_counts| seg_counts.facet_reader.facet_dict().range().into_stream())
.collect::<Vec<_>>(); .collect::<Vec<_>>();
@@ -374,10 +377,8 @@ impl FacetCollector {
} else { } else {
collapsed_facet_counts[seg_ord][collapsed_term_id] collapsed_facet_counts[seg_ord][collapsed_term_id]
} }
}) }).unwrap_or(0)
.unwrap_or(0) }).sum();
})
.sum();
if count > 0u64 { if count > 0u64 {
let bytes: Vec<u8> = facet_merger.key().to_owned(); let bytes: Vec<u8> = facet_merger.key().to_owned();
// may create an corrupted facet if the term dicitonary is corrupted // may create an corrupted facet if the term dicitonary is corrupted
@@ -402,7 +403,8 @@ impl Collector for FacetCollector {
fn collect(&mut self, doc: DocId, _: Score) { fn collect(&mut self, doc: DocId, _: Score) {
let facet_reader: &mut FacetReader = unsafe { let facet_reader: &mut FacetReader = unsafe {
&mut *self.ff_reader &mut *self
.ff_reader
.as_ref() .as_ref()
.expect("collect() was called before set_segment. This should never happen.") .expect("collect() was called before set_segment. This should never happen.")
.get() .get()
@@ -476,9 +478,8 @@ impl FacetCounts {
heap.push(Hit { count, facet }); heap.push(Hit { count, facet });
} }
let mut lowest_count: u64 = heap.peek().map(|hit| hit.count) let mut lowest_count: u64 = heap.peek().map(|hit| hit.count).unwrap_or(u64::MIN); //< the `unwrap_or` case may be triggered but the value
.unwrap_or(u64::MIN); //< the `unwrap_or` case may be triggered but the value // is never used in that case.
// is never used in that case.
for (facet, count) in it { for (facet, count) in it {
if count > lowest_count { if count > lowest_count {
@@ -526,8 +527,7 @@ mod tests {
n /= 4; n /= 4;
let leaf = n % 5; let leaf = n % 5;
Facet::from(&format!("/top{}/mid{}/leaf{}", top, mid, leaf)) Facet::from(&format!("/top{}/mid{}/leaf{}", top, mid, leaf))
}) }).collect();
.collect();
for i in 0..num_facets * 10 { for i in 0..num_facets * 10 {
let mut doc = Document::new(); let mut doc = Document::new();
doc.add_facet(facet_field, facets[i % num_facets].clone()); doc.add_facet(facet_field, facets[i % num_facets].clone());
@@ -554,7 +554,8 @@ mod tests {
("/top1/mid1", 50), ("/top1/mid1", 50),
("/top1/mid2", 50), ("/top1/mid2", 50),
("/top1/mid3", 50), ("/top1/mid3", 50),
].iter() ]
.iter()
.map(|&(facet_str, count)| (String::from(facet_str), count)) .map(|&(facet_str, count)| (String::from(facet_str), count))
.collect::<Vec<_>>() .collect::<Vec<_>>()
); );
@@ -618,9 +619,13 @@ mod tests {
let facet = Facet::from(&format!("/facet/{}", c)); let facet = Facet::from(&format!("/facet/{}", c));
let doc = doc!(facet_field => facet); let doc = doc!(facet_field => facet);
iter::repeat(doc).take(count) iter::repeat(doc).take(count)
}) }).map(|mut doc| {
.map(|mut doc| { doc.add_facet(facet_field, &format!("/facet/{}", thread_rng().sample(&uniform) )); doc}) doc.add_facet(
.collect(); facet_field,
&format!("/facet/{}", thread_rng().sample(&uniform)),
);
doc
}).collect();
thread_rng().shuffle(&mut docs[..]); thread_rng().shuffle(&mut docs[..]);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap(); let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();

View File

@@ -15,7 +15,14 @@ mod multi_collector;
pub use self::multi_collector::MultiCollector; pub use self::multi_collector::MultiCollector;
mod top_collector; mod top_collector;
pub use self::top_collector::TopCollector;
mod top_score_collector;
pub use self::top_score_collector::TopScoreCollector;
#[deprecated]
pub use self::top_score_collector::TopScoreCollector as TopCollector;
mod top_field_collector;
pub use self::top_field_collector::TopFieldCollector;
mod facet_collector; mod facet_collector;
pub use self::facet_collector::FacetCollector; pub use self::facet_collector::FacetCollector;

View File

@@ -100,11 +100,11 @@ impl<'a> Collector for MultiCollector<'a> {
mod tests { mod tests {
use super::*; use super::*;
use collector::{Collector, CountCollector, TopCollector}; use collector::{Collector, CountCollector, TopScoreCollector};
#[test] #[test]
fn test_multi_collector() { fn test_multi_collector() {
let mut top_collector = TopCollector::with_limit(2); let mut top_collector = TopScoreCollector::with_limit(2);
let mut count_collector = CountCollector::default(); let mut count_collector = CountCollector::default();
{ {
let mut collectors = let mut collectors =

View File

@@ -1,115 +1,61 @@
use super::Collector;
use std::cmp::Ordering; use std::cmp::Ordering;
use std::collections::BinaryHeap; use std::collections::BinaryHeap;
use DocAddress; use DocAddress;
use DocId; use DocId;
use Result;
use Score;
use SegmentLocalId; use SegmentLocalId;
use SegmentReader;
// Rust heap is a max-heap and we need a min heap. /// Contains a feature (field, score, etc.) of a document along with the document address.
///
/// It has a custom implementation of `PartialOrd` that reverses the order. This is because the
/// default Rust heap is a max heap, whereas a min heap is needed.
#[derive(Clone, Copy)] #[derive(Clone, Copy)]
struct GlobalScoredDoc { pub struct ComparableDoc<T> {
score: Score, feature: T,
doc_address: DocAddress, doc_address: DocAddress,
} }
impl PartialOrd for GlobalScoredDoc { impl<T: PartialOrd> PartialOrd for ComparableDoc<T> {
fn partial_cmp(&self, other: &GlobalScoredDoc) -> Option<Ordering> { fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other)) Some(self.cmp(other))
} }
} }
impl Ord for GlobalScoredDoc { impl<T: PartialOrd> Ord for ComparableDoc<T> {
#[inline] #[inline]
fn cmp(&self, other: &GlobalScoredDoc) -> Ordering { fn cmp(&self, other: &Self) -> Ordering {
other other
.score .feature
.partial_cmp(&self.score) .partial_cmp(&self.feature)
.unwrap_or_else(|| other.doc_address.cmp(&self.doc_address)) .unwrap_or_else(|| other.doc_address.cmp(&self.doc_address))
} }
} }
impl PartialEq for GlobalScoredDoc { impl<T: PartialOrd> PartialEq for ComparableDoc<T> {
fn eq(&self, other: &GlobalScoredDoc) -> bool { fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal self.cmp(other) == Ordering::Equal
} }
} }
impl Eq for GlobalScoredDoc {} impl<T: PartialOrd> Eq for ComparableDoc<T> {}
/// The Top Collector keeps track of the K documents /// The Top Collector keeps track of the K documents
/// with the best scores. /// sorted by type `T`.
/// ///
/// The implementation is based on a `BinaryHeap`. /// The implementation is based on a `BinaryHeap`.
/// The theorical complexity for collecting the top `K` out of `n` documents /// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`. /// is `O(n log K)`.
/// pub struct TopCollector<T> {
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT};
/// use tantivy::{Index, Result, DocId, Score};
/// use tantivy::collector::TopCollector;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut top_collector = TopCollector::with_limit(2);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut top_collector).unwrap();
///
/// let score_docs: Vec<(Score, DocId)> = top_collector
/// .score_docs()
/// .into_iter()
/// .map(|(score, doc_address)| (score, doc_address.doc()))
/// .collect();
///
/// assert_eq!(score_docs, vec![(0.7261542, 1), (0.6099695, 3)]);
/// }
///
/// Ok(())
/// }
/// ```
pub struct TopCollector {
limit: usize, limit: usize,
heap: BinaryHeap<GlobalScoredDoc>, heap: BinaryHeap<ComparableDoc<T>>,
segment_id: u32, segment_id: u32,
} }
impl TopCollector { impl<T: PartialOrd + Clone> TopCollector<T> {
/// Creates a top collector, with a number of documents equal to "limit". /// Creates a top collector, with a number of documents equal to "limit".
/// ///
/// # Panics /// # Panics
/// The method panics if limit is 0 /// The method panics if limit is 0
pub fn with_limit(limit: usize) -> TopCollector { pub fn with_limit(limit: usize) -> TopCollector<T> {
if limit < 1 { if limit < 1 {
panic!("Limit must be strictly greater than 0."); panic!("Limit must be strictly greater than 0.");
} }
@@ -125,23 +71,27 @@ impl TopCollector {
/// Calling this method triggers the sort. /// Calling this method triggers the sort.
/// The result of the sort is not cached. /// The result of the sort is not cached.
pub fn docs(&self) -> Vec<DocAddress> { pub fn docs(&self) -> Vec<DocAddress> {
self.score_docs() self.top_docs()
.into_iter() .into_iter()
.map(|score_doc| score_doc.1) .map(|(_feature, doc)| doc)
.collect() .collect()
} }
/// Returns K best ScoredDocument sorted in decreasing order. /// Returns K best FeatureDocuments sorted in decreasing order.
/// ///
/// Calling this method triggers the sort. /// Calling this method triggers the sort.
/// The result of the sort is not cached. /// The result of the sort is not cached.
pub fn score_docs(&self) -> Vec<(Score, DocAddress)> { pub fn top_docs(&self) -> Vec<(T, DocAddress)> {
let mut scored_docs: Vec<GlobalScoredDoc> = self.heap.iter().cloned().collect(); let mut feature_docs: Vec<ComparableDoc<T>> = self.heap.iter().cloned().collect();
scored_docs.sort(); feature_docs.sort();
scored_docs feature_docs
.into_iter() .into_iter()
.map(|GlobalScoredDoc { score, doc_address }| (score, doc_address)) .map(
.collect() |ComparableDoc {
feature,
doc_address,
}| (feature, doc_address),
).collect()
} }
/// Return true iff at least K documents have gone through /// Return true iff at least K documents have gone through
@@ -150,46 +100,45 @@ impl TopCollector {
pub fn at_capacity(&self) -> bool { pub fn at_capacity(&self) -> bool {
self.heap.len() >= self.limit self.heap.len() >= self.limit
} }
}
impl Collector for TopCollector { /// Sets the segment local ID for the collector
fn set_segment(&mut self, segment_id: SegmentLocalId, _: &SegmentReader) -> Result<()> { pub fn set_segment_id(&mut self, segment_id: SegmentLocalId) {
self.segment_id = segment_id; self.segment_id = segment_id;
Ok(())
} }
fn collect(&mut self, doc: DocId, score: Score) { /// Collects a document scored by the given feature
///
/// It collects documents until it has reached the max capacity. Once it reaches capacity, it
/// will compare the lowest scoring item with the given one and keep whichever is greater.
pub fn collect(&mut self, doc: DocId, feature: T) {
if self.at_capacity() { if self.at_capacity() {
// It's ok to unwrap as long as a limit of 0 is forbidden. // It's ok to unwrap as long as a limit of 0 is forbidden.
let limit_doc: GlobalScoredDoc = *self.heap let limit_doc: ComparableDoc<T> = self
.heap
.peek() .peek()
.expect("Top collector with size 0 is forbidden"); .expect("Top collector with size 0 is forbidden")
if limit_doc.score < score { .clone();
let mut mut_head = self.heap if limit_doc.feature < feature {
let mut mut_head = self
.heap
.peek_mut() .peek_mut()
.expect("Top collector with size 0 is forbidden"); .expect("Top collector with size 0 is forbidden");
mut_head.score = score; mut_head.feature = feature;
mut_head.doc_address = DocAddress(self.segment_id, doc); mut_head.doc_address = DocAddress(self.segment_id, doc);
} }
} else { } else {
let wrapped_doc = GlobalScoredDoc { let wrapped_doc = ComparableDoc {
score, feature,
doc_address: DocAddress(self.segment_id, doc), doc_address: DocAddress(self.segment_id, doc),
}; };
self.heap.push(wrapped_doc); self.heap.push(wrapped_doc);
} }
} }
fn requires_scoring(&self) -> bool {
true
}
} }
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::*; use super::*;
use collector::Collector;
use DocId; use DocId;
use Score; use Score;
@@ -201,7 +150,7 @@ mod tests {
top_collector.collect(5, 0.3); top_collector.collect(5, 0.3);
assert!(!top_collector.at_capacity()); assert!(!top_collector.at_capacity());
let score_docs: Vec<(Score, DocId)> = top_collector let score_docs: Vec<(Score, DocId)> = top_collector
.score_docs() .top_docs()
.into_iter() .into_iter()
.map(|(score, doc_address)| (score, doc_address.doc())) .map(|(score, doc_address)| (score, doc_address.doc()))
.collect(); .collect();
@@ -219,7 +168,7 @@ mod tests {
assert!(top_collector.at_capacity()); assert!(top_collector.at_capacity());
{ {
let score_docs: Vec<(Score, DocId)> = top_collector let score_docs: Vec<(Score, DocId)> = top_collector
.score_docs() .top_docs()
.into_iter() .into_iter()
.map(|(score, doc_address)| (score, doc_address.doc())) .map(|(score, doc_address)| (score, doc_address.doc()))
.collect(); .collect();
@@ -238,7 +187,7 @@ mod tests {
#[test] #[test]
#[should_panic] #[should_panic]
fn test_top_0() { fn test_top_0() {
TopCollector::with_limit(0); let _collector: TopCollector<Score> = TopCollector::with_limit(0);
} }
} }

View File

@@ -0,0 +1,263 @@
use super::Collector;
use collector::top_collector::TopCollector;
use fastfield::FastFieldReader;
use fastfield::FastValue;
use schema::Field;
use DocAddress;
use DocId;
use Result;
use Score;
use SegmentReader;
/// The Top Field Collector keeps track of the K documents
/// sorted by a fast field in the index
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT, FAST};
/// use tantivy::{Index, Result, DocId};
/// use tantivy::collector::TopFieldCollector;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let rating = schema_builder.add_u64_field("rating", FAST);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// rating => 92u64,
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// rating => 97u64,
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// rating => 63u64,
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// rating => 80u64,
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut top_collector = TopFieldCollector::with_limit(rating, 2);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut top_collector).unwrap();
///
/// let score_docs: Vec<(u64, DocId)> = top_collector
/// .top_docs()
/// .into_iter()
/// .map(|(field, doc_address)| (field, doc_address.doc()))
/// .collect();
///
/// assert_eq!(score_docs, vec![(97u64, 1), (80, 3)]);
/// }
///
/// Ok(())
/// }
/// ```
pub struct TopFieldCollector<T: FastValue> {
field: Field,
collector: TopCollector<T>,
fast_field: Option<FastFieldReader<T>>,
}
impl<T: FastValue + PartialOrd + Clone> TopFieldCollector<T> {
/// Creates a top field collector, with a number of documents equal to "limit".
///
/// The given field name must be a fast field, otherwise the collector have an error while
/// collecting results.
///
/// # Panics
/// The method panics if limit is 0
pub fn with_limit(field: Field, limit: usize) -> Self {
TopFieldCollector {
field,
collector: TopCollector::with_limit(limit),
fast_field: None,
}
}
/// Returns K best documents sorted the given field name in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn docs(&self) -> Vec<DocAddress> {
self.collector.docs()
}
/// Returns K best FieldDocuments sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn top_docs(&self) -> Vec<(T, DocAddress)> {
self.collector.top_docs()
}
/// Return true iff at least K documents have gone through
/// the collector.
#[inline]
pub fn at_capacity(&self) -> bool {
self.collector.at_capacity()
}
}
impl<T: FastValue + PartialOrd + Clone> Collector for TopFieldCollector<T> {
fn set_segment(&mut self, segment_id: u32, segment: &SegmentReader) -> Result<()> {
self.collector.set_segment_id(segment_id);
self.fast_field = Some(segment.fast_field_reader(self.field)?);
Ok(())
}
fn collect(&mut self, doc: DocId, _score: Score) {
let field_value = self
.fast_field
.as_ref()
.expect("collect() was called before set_segment. This should never happen.")
.get(doc);
self.collector.collect(doc, field_value);
}
fn requires_scoring(&self) -> bool {
false
}
}
#[cfg(test)]
mod tests {
use super::*;
use query::Query;
use query::QueryParser;
use schema::Field;
use schema::IntOptions;
use schema::Schema;
use schema::{SchemaBuilder, FAST, TEXT};
use Index;
use IndexWriter;
use TantivyError;
const TITLE: &str = "title";
const SIZE: &str = "size";
#[test]
fn test_top_collector_not_at_capacity() {
let mut schema_builder = SchemaBuilder::new();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build();
let (index, query) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
index_writer.add_document(doc!(
title => "growler of beer",
size => 64u64,
));
index_writer.add_document(doc!(
title => "pint of beer",
size => 16u64,
));
});
let searcher = index.searcher();
let mut top_collector = TopFieldCollector::with_limit(size, 4);
searcher.search(&*query, &mut top_collector).unwrap();
assert!(!top_collector.at_capacity());
let score_docs: Vec<(u64, DocId)> = top_collector
.top_docs()
.into_iter()
.map(|(field, doc_address)| (field, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec![(64, 1), (16, 2), (12, 0)]);
}
#[test]
#[should_panic]
fn test_field_does_not_exist() {
let mut schema_builder = SchemaBuilder::new();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build();
let (index, _) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
});
let searcher = index.searcher();
let segment = searcher.segment_reader(0);
let mut top_collector: TopFieldCollector<u64> = TopFieldCollector::with_limit(Field(2), 4);
let _ = top_collector.set_segment(0, segment);
}
#[test]
fn test_field_not_fast_field() {
let mut schema_builder = SchemaBuilder::new();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, IntOptions::default());
let schema = schema_builder.build();
let (index, _) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
});
let searcher = index.searcher();
let segment = searcher.segment_reader(0);
let mut top_collector: TopFieldCollector<u64> = TopFieldCollector::with_limit(size, 4);
assert_matches!(
top_collector.set_segment(0, segment),
Err(TantivyError::FastFieldError(_))
);
}
#[test]
#[should_panic]
fn test_collect_before_set_segment() {
let mut top_collector: TopFieldCollector<u64> = TopFieldCollector::with_limit(Field(0), 4);
top_collector.collect(0, 0f32);
}
#[test]
#[should_panic]
fn test_top_0() {
let _: TopFieldCollector<u64> = TopFieldCollector::with_limit(Field(0), 0);
}
fn index(
query: &str,
query_field: Field,
schema: Schema,
mut doc_adder: impl FnMut(&mut IndexWriter) -> (),
) -> (Index, Box<Query>) {
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
doc_adder(&mut index_writer);
index_writer.commit().unwrap();
index.load_searchers().unwrap();
let query_parser = QueryParser::for_index(&index, vec![query_field]);
let query = query_parser.parse_query(query).unwrap();
(index, query)
}
}

View File

@@ -0,0 +1,187 @@
use super::Collector;
use collector::top_collector::TopCollector;
use DocAddress;
use DocId;
use Result;
use Score;
use SegmentLocalId;
use SegmentReader;
/// The Top Score Collector keeps track of the K documents
/// sorted by their score.
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT};
/// use tantivy::{Index, Result, DocId, Score};
/// use tantivy::collector::TopScoreCollector;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut top_collector = TopScoreCollector::with_limit(2);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut top_collector).unwrap();
///
/// let score_docs: Vec<(Score, DocId)> = top_collector
/// .top_docs()
/// .into_iter()
/// .map(|(score, doc_address)| (score, doc_address.doc()))
/// .collect();
///
/// assert_eq!(score_docs, vec![(0.7261542, 1), (0.6099695, 3)]);
/// }
///
/// Ok(())
/// }
/// ```
pub struct TopScoreCollector {
collector: TopCollector<Score>,
}
impl TopScoreCollector {
/// Creates a top score collector, with a number of documents equal to "limit".
///
/// # Panics
/// The method panics if limit is 0
pub fn with_limit(limit: usize) -> TopScoreCollector {
TopScoreCollector {
collector: TopCollector::with_limit(limit),
}
}
/// Returns K best scored documents sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn docs(&self) -> Vec<DocAddress> {
self.collector.docs()
}
/// Returns K best ScoredDocuments sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn top_docs(&self) -> Vec<(Score, DocAddress)> {
self.collector.top_docs()
}
/// Returns K best ScoredDocuments sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
#[deprecated]
pub fn score_docs(&self) -> Vec<(Score, DocAddress)> {
self.collector.top_docs()
}
/// Return true iff at least K documents have gone through
/// the collector.
#[inline]
pub fn at_capacity(&self) -> bool {
self.collector.at_capacity()
}
}
impl Collector for TopScoreCollector {
fn set_segment(&mut self, segment_id: SegmentLocalId, _: &SegmentReader) -> Result<()> {
self.collector.set_segment_id(segment_id);
Ok(())
}
fn collect(&mut self, doc: DocId, score: Score) {
self.collector.collect(doc, score);
}
fn requires_scoring(&self) -> bool {
true
}
}
#[cfg(test)]
mod tests {
use super::*;
use collector::Collector;
use DocId;
use Score;
#[test]
fn test_top_collector_not_at_capacity() {
let mut top_collector = TopScoreCollector::with_limit(4);
top_collector.collect(1, 0.8);
top_collector.collect(3, 0.2);
top_collector.collect(5, 0.3);
assert!(!top_collector.at_capacity());
let score_docs: Vec<(Score, DocId)> = top_collector
.top_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec![(0.8, 1), (0.3, 5), (0.2, 3)]);
}
#[test]
fn test_top_collector_at_capacity() {
let mut top_collector = TopScoreCollector::with_limit(4);
top_collector.collect(1, 0.8);
top_collector.collect(3, 0.2);
top_collector.collect(5, 0.3);
top_collector.collect(7, 0.9);
top_collector.collect(9, -0.2);
assert!(top_collector.at_capacity());
{
let score_docs: Vec<(Score, DocId)> = top_collector
.top_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec![(0.9, 7), (0.8, 1), (0.3, 5), (0.2, 3)]);
}
{
let docs: Vec<DocId> = top_collector
.docs()
.into_iter()
.map(|doc_address| doc_address.doc())
.collect();
assert_eq!(docs, vec![7, 1, 5, 3]);
}
}
#[test]
#[should_panic]
fn test_top_0() {
TopScoreCollector::with_limit(0);
}
}

View File

@@ -102,6 +102,7 @@ where
addr + 8 <= data.len(), addr + 8 <= data.len(),
"The fast field field should have been padded with 7 bytes." "The fast field field should have been padded with 7 bytes."
); );
#[cfg_attr(feature = "cargo-clippy", allow(clippy::cast_ptr_alignment))]
let val_unshifted_unmasked: u64 = let val_unshifted_unmasked: u64 =
u64::from_le(unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) }); u64::from_le(unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) });
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64; let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
@@ -125,6 +126,7 @@ where
for output_val in output.iter_mut() { for output_val in output.iter_mut() {
let addr = addr_in_bits >> 3; let addr = addr_in_bits >> 3;
let bit_shift = addr_in_bits & 7; let bit_shift = addr_in_bits & 7;
#[cfg_attr(feature = "cargo-clippy", allow(clippy::cast_ptr_alignment))]
let val_unshifted_unmasked: u64 = let val_unshifted_unmasked: u64 =
unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) }; unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) };
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64; let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;

View File

@@ -34,17 +34,17 @@ impl TinySet {
} }
/// Returns the complement of the set in `[0, 64[`. /// Returns the complement of the set in `[0, 64[`.
fn complement(&self) -> TinySet { fn complement(self) -> TinySet {
TinySet(!self.0) TinySet(!self.0)
} }
/// Returns true iff the `TinySet` contains the element `el`. /// Returns true iff the `TinySet` contains the element `el`.
pub fn contains(&self, el: u32) -> bool { pub fn contains(self, el: u32) -> bool {
!self.intersect(TinySet::singleton(el)).is_empty() !self.intersect(TinySet::singleton(el)).is_empty()
} }
/// Returns the intersection of `self` and `other` /// Returns the intersection of `self` and `other`
pub fn intersect(&self, other: TinySet) -> TinySet { pub fn intersect(self, other: TinySet) -> TinySet {
TinySet(self.0 & other.0) TinySet(self.0 & other.0)
} }
@@ -77,7 +77,7 @@ impl TinySet {
/// Returns true iff the `TinySet` is empty. /// Returns true iff the `TinySet` is empty.
#[inline(always)] #[inline(always)]
pub fn is_empty(&self) -> bool { pub fn is_empty(self) -> bool {
self.0 == 0u64 self.0 == 0u64
} }
@@ -114,7 +114,7 @@ impl TinySet {
self.0 = 0u64; self.0 = 0u64;
} }
pub fn len(&self) -> u32 { pub fn len(self) -> u32 {
self.0.count_ones() self.0.count_ones()
} }
} }
@@ -266,14 +266,14 @@ mod tests {
#[test] #[test]
fn test_bitset_large() { fn test_bitset_large() {
let arr = generate_nonunique_unsorted(1_000_000, 50_000); let arr = generate_nonunique_unsorted(100_000, 5_000);
let mut btreeset: BTreeSet<u32> = BTreeSet::new(); let mut btreeset: BTreeSet<u32> = BTreeSet::new();
let mut bitset = BitSet::with_max_value(1_000_000); let mut bitset = BitSet::with_max_value(100_000);
for el in arr { for el in arr {
btreeset.insert(el); btreeset.insert(el);
bitset.insert(el); bitset.insert(el);
} }
for i in 0..1_000_000 { for i in 0..100_000 {
assert_eq!(btreeset.contains(&i), bitset.contains(i)); assert_eq!(btreeset.contains(&i), bitset.contains(i));
} }
assert_eq!(btreeset.len(), bitset.len()); assert_eq!(btreeset.len(), bitset.len());

View File

@@ -72,7 +72,8 @@ impl<W: Write> CompositeWrite<W> {
let footer_offset = self.write.written_bytes(); let footer_offset = self.write.written_bytes();
VInt(self.offsets.len() as u64).serialize(&mut self.write)?; VInt(self.offsets.len() as u64).serialize(&mut self.write)?;
let mut offset_fields: Vec<_> = self.offsets let mut offset_fields: Vec<_> = self
.offsets
.iter() .iter()
.map(|(file_addr, offset)| (*offset, *file_addr)) .map(|(file_addr, offset)| (*offset, *file_addr))
.collect(); .collect();

View File

@@ -10,8 +10,6 @@ pub struct VInt(pub u64);
const STOP_BIT: u8 = 128; const STOP_BIT: u8 = 128;
impl VInt { impl VInt {
pub fn val(&self) -> u64 { pub fn val(&self) -> u64 {
self.0 self.0
} }
@@ -20,14 +18,13 @@ impl VInt {
VInt::deserialize(reader).map(|vint| vint.0) VInt::deserialize(reader).map(|vint| vint.0)
} }
pub fn serialize_into_vec(&self, output: &mut Vec<u8>){ pub fn serialize_into_vec(&self, output: &mut Vec<u8>) {
let mut buffer = [0u8; 10]; let mut buffer = [0u8; 10];
let num_bytes = self.serialize_into(&mut buffer); let num_bytes = self.serialize_into(&mut buffer);
output.extend(&buffer[0..num_bytes]); output.extend(&buffer[0..num_bytes]);
} }
fn serialize_into(&self, buffer: &mut [u8; 10]) -> usize { fn serialize_into(&self, buffer: &mut [u8; 10]) -> usize {
let mut remaining = self.0; let mut remaining = self.0;
for (i, b) in buffer.iter_mut().enumerate() { for (i, b) in buffer.iter_mut().enumerate() {
let next_byte: u8 = (remaining % 128u64) as u8; let next_byte: u8 = (remaining % 128u64) as u8;
@@ -74,7 +71,6 @@ impl BinarySerializable for VInt {
} }
} }
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
@@ -89,10 +85,10 @@ mod tests {
} }
assert!(num_bytes > 0); assert!(num_bytes > 0);
if num_bytes < 10 { if num_bytes < 10 {
assert!(1u64 << (7*num_bytes) > val); assert!(1u64 << (7 * num_bytes) > val);
} }
if num_bytes > 1 { if num_bytes > 1 {
assert!(1u64 << (7*(num_bytes-1)) <= val); assert!(1u64 << (7 * (num_bytes - 1)) <= val);
} }
let serdeser_val = VInt::deserialize(&mut &v[..]).unwrap(); let serdeser_val = VInt::deserialize(&mut &v[..]).unwrap();
assert_eq!(val, serdeser_val.0); assert_eq!(val, serdeser_val.0);
@@ -105,11 +101,11 @@ mod tests {
aux_test_vint(5); aux_test_vint(5);
aux_test_vint(u64::max_value()); aux_test_vint(u64::max_value());
for i in 1..9 { for i in 1..9 {
let power_of_128 = 1u64 << (7*i); let power_of_128 = 1u64 << (7 * i);
aux_test_vint(power_of_128 - 1u64); aux_test_vint(power_of_128 - 1u64);
aux_test_vint(power_of_128 ); aux_test_vint(power_of_128);
aux_test_vint(power_of_128 + 1u64); aux_test_vint(power_of_128 + 1u64);
} }
aux_test_vint(10); aux_test_vint(10);
} }
} }

View File

@@ -1,34 +1,36 @@
use core::SegmentId;
use error::TantivyError;
use schema::Schema;
use serde_json;
use std::borrow::BorrowMut;
use std::fmt;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use Result;
use super::pool::LeasedItem; use super::pool::LeasedItem;
use super::pool::Pool; use super::pool::Pool;
use super::segment::create_segment; use super::segment::create_segment;
use super::segment::Segment; use super::segment::Segment;
use core::searcher::Searcher; use core::searcher::Searcher;
use core::IndexMeta; use core::IndexMeta;
use core::SegmentId;
use core::SegmentMeta; use core::SegmentMeta;
use core::SegmentReader; use core::SegmentReader;
use core::META_FILEPATH; use core::META_FILEPATH;
use directory::ManagedDirectory;
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
use directory::MmapDirectory; use directory::MmapDirectory;
use directory::{Directory, RAMDirectory}; use directory::{Directory, RAMDirectory};
use directory::{DirectoryClone, ManagedDirectory}; use error::TantivyError;
use indexer::index_writer::open_index_writer; use indexer::index_writer::open_index_writer;
use indexer::index_writer::HEAP_SIZE_MIN; use indexer::index_writer::HEAP_SIZE_MIN;
use indexer::segment_updater::save_new_metas; use indexer::segment_updater::save_new_metas;
use indexer::DirectoryLock; use indexer::LockType;
use num_cpus; use num_cpus;
use schema::Field;
use schema::FieldType;
use schema::Schema;
use serde_json;
use std::borrow::BorrowMut;
use std::fmt;
use std::path::Path; use std::path::Path;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use tokenizer::BoxedTokenizer;
use tokenizer::TokenizerManager; use tokenizer::TokenizerManager;
use IndexWriter; use IndexWriter;
use Result;
fn load_metas(directory: &Directory) -> Result<IndexMeta> { fn load_metas(directory: &Directory) -> Result<IndexMeta> {
let meta_data = directory.atomic_read(&META_FILEPATH)?; let meta_data = directory.atomic_read(&META_FILEPATH)?;
@@ -113,6 +115,27 @@ impl Index {
&self.tokenizers &self.tokenizers
} }
/// Helper to access the tokenizer associated to a specific field.
pub fn tokenizer_for_field(&self, field: Field) -> Result<Box<BoxedTokenizer>> {
let field_entry = self.schema.get_field_entry(field);
let field_type = field_entry.field_type();
let tokenizer_manager: &TokenizerManager = self.tokenizers();
let tokenizer_name_opt: Option<Box<BoxedTokenizer>> = match field_type {
FieldType::Str(text_options) => text_options
.get_indexing_options()
.map(|text_indexing_options| text_indexing_options.tokenizer().to_string())
.and_then(|tokenizer_name| tokenizer_manager.get(&tokenizer_name)),
_ => None,
};
match tokenizer_name_opt {
Some(tokenizer) => Ok(tokenizer),
None => Err(TantivyError::SchemaError(format!(
"{:?} is not a text field.",
field_entry.name()
))),
}
}
/// Opens a new directory from an index path. /// Opens a new directory from an index path.
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
pub fn open_in_dir<P: AsRef<Path>>(directory_path: P) -> Result<Index> { pub fn open_in_dir<P: AsRef<Path>>(directory_path: P) -> Result<Index> {
@@ -156,7 +179,7 @@ impl Index {
num_threads: usize, num_threads: usize,
overall_heap_size_in_bytes: usize, overall_heap_size_in_bytes: usize,
) -> Result<IndexWriter> { ) -> Result<IndexWriter> {
let directory_lock = DirectoryLock::lock(self.directory().box_clone())?; let directory_lock = LockType::IndexWriterLock.acquire_lock(&self.directory)?;
let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads; let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads;
open_index_writer( open_index_writer(
self, self,
@@ -194,7 +217,8 @@ impl Index {
/// Returns the list of segments that are searchable /// Returns the list of segments that are searchable
pub fn searchable_segments(&self) -> Result<Vec<Segment>> { pub fn searchable_segments(&self) -> Result<Vec<Segment>> {
Ok(self.searchable_segment_metas()? Ok(self
.searchable_segment_metas()?
.into_iter() .into_iter()
.map(|segment_meta| self.segment(segment_meta)) .map(|segment_meta| self.segment(segment_meta))
.collect()) .collect())
@@ -229,7 +253,8 @@ impl Index {
/// Returns the list of segment ids that are searchable. /// Returns the list of segment ids that are searchable.
pub fn searchable_segment_ids(&self) -> Result<Vec<SegmentId>> { pub fn searchable_segment_ids(&self) -> Result<Vec<SegmentId>> {
Ok(self.searchable_segment_metas()? Ok(self
.searchable_segment_metas()?
.iter() .iter()
.map(|segment_meta| segment_meta.id()) .map(|segment_meta| segment_meta.id())
.collect()) .collect())
@@ -242,13 +267,18 @@ impl Index {
self.num_searchers.store(num_searchers, Ordering::Release); self.num_searchers.store(num_searchers, Ordering::Release);
} }
/// Creates a new generation of searchers after /// Update searchers so that they reflect the state of the last
/// `.commit()`.
/// a change of the set of searchable indexes.
/// ///
/// This needs to be called when a new segment has been /// If indexing happens in the same process as searching,
/// published or after a merge. /// you most likely want to call `.load_searchers()` right after each
/// successful call to `.commit()`.
///
/// If indexing and searching happen in different processes, the way to
/// get the freshest `index` at all time, is to watch `meta.json` and
/// call `load_searchers` whenever a changes happen.
pub fn load_searchers(&self) -> Result<()> { pub fn load_searchers(&self) -> Result<()> {
let _meta_lock = LockType::MetaLock.acquire_lock(self.directory())?;
let searchable_segments = self.searchable_segments()?; let searchable_segments = self.searchable_segments()?;
let segment_readers: Vec<SegmentReader> = searchable_segments let segment_readers: Vec<SegmentReader> = searchable_segments
.iter() .iter()
@@ -257,7 +287,7 @@ impl Index {
let schema = self.schema(); let schema = self.schema();
let num_searchers: usize = self.num_searchers.load(Ordering::Acquire); let num_searchers: usize = self.num_searchers.load(Ordering::Acquire);
let searchers = (0..num_searchers) let searchers = (0..num_searchers)
.map(|_| Searcher::new(schema.clone(), segment_readers.clone())) .map(|_| Searcher::new(schema.clone(), self.clone(), segment_readers.clone()))
.collect(); .collect();
self.searcher_pool.publish_new_generation(searchers); self.searcher_pool.publish_new_generation(searchers);
Ok(()) Ok(())
@@ -295,3 +325,24 @@ impl Clone for Index {
} }
} }
} }
#[cfg(test)]
mod tests {
use schema::{SchemaBuilder, INT_INDEXED, TEXT};
use Index;
#[test]
fn test_indexer_for_field() {
let mut schema_builder = SchemaBuilder::default();
let num_likes_field = schema_builder.add_u64_field("num_likes", INT_INDEXED);
let body_field = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
assert!(index.tokenizer_for_field(body_field).is_ok());
assert_eq!(
format!("{:?}", index.tokenizer_for_field(num_likes_field).err()),
"Some(SchemaError(\"\\\"num_likes\\\" is not a text field.\"))"
);
}
}

View File

@@ -58,7 +58,7 @@ mod tests {
}; };
let index_metas = IndexMeta { let index_metas = IndexMeta {
segments: Vec::new(), segments: Vec::new(),
schema: schema, schema,
opstamp: 0u64, opstamp: 0u64,
payload: None, payload: None,
}; };

View File

@@ -1,13 +1,13 @@
use common::BinarySerializable; use common::BinarySerializable;
use directory::ReadOnlySource; use directory::ReadOnlySource;
use owned_read::OwnedRead;
use positions::PositionReader;
use postings::TermInfo; use postings::TermInfo;
use postings::{BlockSegmentPostings, SegmentPostings}; use postings::{BlockSegmentPostings, SegmentPostings};
use schema::FieldType; use schema::FieldType;
use schema::IndexRecordOption; use schema::IndexRecordOption;
use schema::Term; use schema::Term;
use termdict::TermDictionary; use termdict::TermDictionary;
use owned_read::OwnedRead;
use positions::PositionReader;
/// The inverted index reader is in charge of accessing /// The inverted index reader is in charge of accessing
/// the inverted index associated to a specific field. /// the inverted index associated to a specific field.
@@ -32,6 +32,10 @@ pub struct InvertedIndexReader {
} }
impl InvertedIndexReader { impl InvertedIndexReader {
#[cfg_attr(
feature = "cargo-clippy",
allow(clippy::needless_pass_by_value)
)] // for symetry
pub(crate) fn new( pub(crate) fn new(
termdict: TermDictionary, termdict: TermDictionary,
postings_source: ReadOnlySource, postings_source: ReadOnlySource,
@@ -54,12 +58,12 @@ impl InvertedIndexReader {
/// Creates an empty `InvertedIndexReader` object, which /// Creates an empty `InvertedIndexReader` object, which
/// contains no terms at all. /// contains no terms at all.
pub fn empty(field_type: FieldType) -> InvertedIndexReader { pub fn empty(field_type: &FieldType) -> InvertedIndexReader {
let record_option = field_type let record_option = field_type
.get_index_record_option() .get_index_record_option()
.unwrap_or(IndexRecordOption::Basic); .unwrap_or(IndexRecordOption::Basic);
InvertedIndexReader { InvertedIndexReader {
termdict: TermDictionary::empty(field_type), termdict: TermDictionary::empty(&field_type),
postings_source: ReadOnlySource::empty(), postings_source: ReadOnlySource::empty(),
positions_source: ReadOnlySource::empty(), positions_source: ReadOnlySource::empty(),
positions_idx_source: ReadOnlySource::empty(), positions_idx_source: ReadOnlySource::empty(),
@@ -100,7 +104,6 @@ impl InvertedIndexReader {
block_postings.reset(term_info.doc_freq, postings_reader); block_postings.reset(term_info.doc_freq, postings_reader);
} }
/// Returns a block postings given a `Term`. /// Returns a block postings given a `Term`.
/// This method is for an advanced usage only. /// This method is for an advanced usage only.
/// ///
@@ -111,7 +114,7 @@ impl InvertedIndexReader {
option: IndexRecordOption, option: IndexRecordOption,
) -> Option<BlockSegmentPostings> { ) -> Option<BlockSegmentPostings> {
self.get_term_info(term) self.get_term_info(term)
.map(move|term_info| self.read_block_postings_from_terminfo(&term_info, option)) .map(move |term_info| self.read_block_postings_from_terminfo(&term_info, option))
} }
/// Returns a block postings given a `term_info`. /// Returns a block postings given a `term_info`.
@@ -147,7 +150,8 @@ impl InvertedIndexReader {
if option.has_positions() { if option.has_positions() {
let position_reader = self.positions_source.clone(); let position_reader = self.positions_source.clone();
let skip_reader = self.positions_idx_source.clone(); let skip_reader = self.positions_idx_source.clone();
let position_reader = PositionReader::new(position_reader, skip_reader, term_info.positions_idx); let position_reader =
PositionReader::new(position_reader, skip_reader, term_info.positions_idx);
Some(position_reader) Some(position_reader)
} else { } else {
None None

View File

@@ -33,10 +33,4 @@ lazy_static! {
/// Removing this file is safe, but will prevent the garbage collection of all of the file that /// Removing this file is safe, but will prevent the garbage collection of all of the file that
/// are currently in the directory /// are currently in the directory
pub static ref MANAGED_FILEPATH: PathBuf = PathBuf::from(".managed.json"); pub static ref MANAGED_FILEPATH: PathBuf = PathBuf::from(".managed.json");
/// Only one process should be able to write tantivy's index at a time.
/// This file, when present, is in charge of preventing other processes to open an IndexWriter.
///
/// If the process is killed and this file remains, it is safe to remove it manually.
pub static ref LOCKFILE_FILEPATH: PathBuf = PathBuf::from(".tantivy-indexer.lock");
} }

View File

@@ -87,7 +87,8 @@ impl<T> Deref for LeasedItem<T> {
type Target = T; type Target = T;
fn deref(&self) -> &T { fn deref(&self) -> &T {
&self.gen_item &self
.gen_item
.as_ref() .as_ref()
.expect("Unwrapping a leased item should never fail") .expect("Unwrapping a leased item should never fail")
.item // unwrap is safe here .item // unwrap is safe here
@@ -96,7 +97,8 @@ impl<T> Deref for LeasedItem<T> {
impl<T> DerefMut for LeasedItem<T> { impl<T> DerefMut for LeasedItem<T> {
fn deref_mut(&mut self) -> &mut T { fn deref_mut(&mut self) -> &mut T {
&mut self.gen_item &mut self
.gen_item
.as_mut() .as_mut()
.expect("Unwrapping a mut leased item should never fail") .expect("Unwrapping a mut leased item should never fail")
.item // unwrap is safe here .item // unwrap is safe here

View File

@@ -9,6 +9,7 @@ use std::fmt;
use std::sync::Arc; use std::sync::Arc;
use termdict::TermMerger; use termdict::TermMerger;
use DocAddress; use DocAddress;
use Index;
use Result; use Result;
/// Holds a list of `SegmentReader`s ready for search. /// Holds a list of `SegmentReader`s ready for search.
@@ -18,23 +19,35 @@ use Result;
/// ///
pub struct Searcher { pub struct Searcher {
schema: Schema, schema: Schema,
index: Index,
segment_readers: Vec<SegmentReader>, segment_readers: Vec<SegmentReader>,
} }
impl Searcher { impl Searcher {
/// Creates a new `Searcher` /// Creates a new `Searcher`
pub(crate) fn new(schema: Schema, segment_readers: Vec<SegmentReader>) -> Searcher { pub(crate) fn new(
schema: Schema,
index: Index,
segment_readers: Vec<SegmentReader>,
) -> Searcher {
Searcher { Searcher {
schema, schema,
index,
segment_readers, segment_readers,
} }
} }
/// Returns the `Index` associated to the `Searcher`
pub fn index(&self) -> &Index {
&self.index
}
/// Fetches a document from tantivy's store given a `DocAddress`. /// Fetches a document from tantivy's store given a `DocAddress`.
/// ///
/// The searcher uses the segment ordinal to route the /// The searcher uses the segment ordinal to route the
/// the request to the right `Segment`. /// the request to the right `Segment`.
pub fn doc(&self, doc_address: &DocAddress) -> Result<Document> { pub fn doc(&self, doc_address: DocAddress) -> Result<Document> {
let DocAddress(segment_local_id, doc_id) = *doc_address; let DocAddress(segment_local_id, doc_id) = doc_address;
let segment_reader = &self.segment_readers[segment_local_id as usize]; let segment_reader = &self.segment_readers[segment_local_id as usize];
segment_reader.doc(doc_id) segment_reader.doc(doc_id)
} }
@@ -48,7 +61,7 @@ impl Searcher {
pub fn num_docs(&self) -> u64 { pub fn num_docs(&self) -> u64 {
self.segment_readers self.segment_readers
.iter() .iter()
.map(|segment_reader| segment_reader.num_docs() as u64) .map(|segment_reader| u64::from(segment_reader.num_docs()))
.sum::<u64>() .sum::<u64>()
} }
@@ -57,8 +70,9 @@ impl Searcher {
pub fn doc_freq(&self, term: &Term) -> u64 { pub fn doc_freq(&self, term: &Term) -> u64 {
self.segment_readers self.segment_readers
.iter() .iter()
.map(|segment_reader| segment_reader.inverted_index(term.field()).doc_freq(term) as u64) .map(|segment_reader| {
.sum::<u64>() u64::from(segment_reader.inverted_index(term.field()).doc_freq(term))
}).sum::<u64>()
} }
/// Return the list of segment readers /// Return the list of segment readers
@@ -78,7 +92,8 @@ impl Searcher {
/// Return the field searcher associated to a `Field`. /// Return the field searcher associated to a `Field`.
pub fn field(&self, field: Field) -> FieldSearcher { pub fn field(&self, field: Field) -> FieldSearcher {
let inv_index_readers = self.segment_readers let inv_index_readers = self
.segment_readers
.iter() .iter()
.map(|segment_reader| segment_reader.inverted_index(field)) .map(|segment_reader| segment_reader.inverted_index(field))
.collect::<Vec<_>>(); .collect::<Vec<_>>();
@@ -98,7 +113,8 @@ impl FieldSearcher {
/// Returns a Stream over all of the sorted unique terms of /// Returns a Stream over all of the sorted unique terms of
/// for the given field. /// for the given field.
pub fn terms(&self) -> TermMerger { pub fn terms(&self) -> TermMerger {
let term_streamers: Vec<_> = self.inv_index_readers let term_streamers: Vec<_> = self
.inv_index_readers
.iter() .iter()
.map(|inverted_index| inverted_index.terms().stream()) .map(|inverted_index| inverted_index.terms().stream())
.collect(); .collect();
@@ -108,7 +124,8 @@ impl FieldSearcher {
impl fmt::Debug for Searcher { impl fmt::Debug for Searcher {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let segment_ids = self.segment_readers let segment_ids = self
.segment_readers
.iter() .iter()
.map(|segment_reader| segment_reader.segment_id()) .map(|segment_reader| segment_reader.segment_id())
.collect::<Vec<_>>(); .collect::<Vec<_>>();

View File

@@ -52,12 +52,12 @@ impl SegmentId {
/// Picking the first 8 chars is ok to identify /// Picking the first 8 chars is ok to identify
/// segments in a display message. /// segments in a display message.
pub fn short_uuid_string(&self) -> String { pub fn short_uuid_string(&self) -> String {
(&self.0.simple().to_string()[..8]).to_string() (&self.0.to_simple_ref().to_string()[..8]).to_string()
} }
/// Returns a segment uuid string. /// Returns a segment uuid string.
pub fn uuid_string(&self) -> String { pub fn uuid_string(&self) -> String {
self.0.simple().to_string() self.0.to_simple_ref().to_string()
} }
} }

View File

@@ -50,7 +50,7 @@ impl<'a> serde::Deserialize<'a> for SegmentMeta {
{ {
let inner = InnerSegmentMeta::deserialize(deserializer)?; let inner = InnerSegmentMeta::deserialize(deserializer)?;
let tracked = INVENTORY.track(inner); let tracked = INVENTORY.track(inner);
Ok(SegmentMeta { tracked: tracked }) Ok(SegmentMeta { tracked })
} }
} }

View File

@@ -4,7 +4,6 @@ use core::InvertedIndexReader;
use core::Segment; use core::Segment;
use core::SegmentComponent; use core::SegmentComponent;
use core::SegmentId; use core::SegmentId;
use core::SegmentMeta;
use error::TantivyError; use error::TantivyError;
use fastfield::DeleteBitSet; use fastfield::DeleteBitSet;
use fastfield::FacetReader; use fastfield::FacetReader;
@@ -44,7 +43,8 @@ pub struct SegmentReader {
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>, inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>,
segment_id: SegmentId, segment_id: SegmentId,
segment_meta: SegmentMeta, max_doc: DocId,
num_docs: DocId,
termdict_composite: CompositeFile, termdict_composite: CompositeFile,
postings_composite: CompositeFile, postings_composite: CompositeFile,
@@ -64,7 +64,7 @@ impl SegmentReader {
/// Today, `tantivy` does not handle deletes, so it happens /// Today, `tantivy` does not handle deletes, so it happens
/// to also be the number of documents in the index. /// to also be the number of documents in the index.
pub fn max_doc(&self) -> DocId { pub fn max_doc(&self) -> DocId {
self.segment_meta.max_doc() self.max_doc
} }
/// Returns the number of documents. /// Returns the number of documents.
@@ -73,7 +73,7 @@ impl SegmentReader {
/// Today, `tantivy` does not handle deletes so max doc and /// Today, `tantivy` does not handle deletes so max doc and
/// num_docs are the same. /// num_docs are the same.
pub fn num_docs(&self) -> DocId { pub fn num_docs(&self) -> DocId {
self.segment_meta.num_docs() self.num_docs
} }
/// Returns the schema of the index this segment belongs to. /// Returns the schema of the index this segment belongs to.
@@ -153,15 +153,17 @@ impl SegmentReader {
/// Accessor to the `BytesFastFieldReader` associated to a given `Field`. /// Accessor to the `BytesFastFieldReader` associated to a given `Field`.
pub fn bytes_fast_field_reader(&self, field: Field) -> fastfield::Result<BytesFastFieldReader> { pub fn bytes_fast_field_reader(&self, field: Field) -> fastfield::Result<BytesFastFieldReader> {
let field_entry = self.schema.get_field_entry(field); let field_entry = self.schema.get_field_entry(field);
match field_entry.field_type() { match *field_entry.field_type() {
&FieldType::Bytes => {} FieldType::Bytes => {}
_ => return Err(FastFieldNotAvailableError::new(field_entry)), _ => return Err(FastFieldNotAvailableError::new(field_entry)),
} }
let idx_reader = self.fast_fields_composite let idx_reader = self
.fast_fields_composite
.open_read_with_idx(field, 0) .open_read_with_idx(field, 0)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry)) .ok_or_else(|| FastFieldNotAvailableError::new(field_entry))
.map(FastFieldReader::open)?; .map(FastFieldReader::open)?;
let values = self.fast_fields_composite let values = self
.fast_fields_composite
.open_read_with_idx(field, 1) .open_read_with_idx(field, 1)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?; .ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?;
Ok(BytesFastFieldReader::open(idx_reader, values)) Ok(BytesFastFieldReader::open(idx_reader, values))
@@ -175,7 +177,7 @@ impl SegmentReader {
"The field {:?} is not a \ "The field {:?} is not a \
hierarchical facet.", hierarchical facet.",
field_entry field_entry
)).into()); )));
} }
let term_ords_reader = self.multi_fast_field_reader(field)?; let term_ords_reader = self.multi_fast_field_reader(field)?;
let termdict_source = self.termdict_composite.open_read(field).ok_or_else(|| { let termdict_source = self.termdict_composite.open_read(field).ok_or_else(|| {
@@ -186,7 +188,7 @@ impl SegmentReader {
field_entry.name() field_entry.name()
)) ))
})?; })?;
let termdict = TermDictionary::from_source(termdict_source); let termdict = TermDictionary::from_source(&termdict_source);
let facet_reader = FacetReader::new(term_ords_reader, termdict); let facet_reader = FacetReader::new(term_ords_reader, termdict);
Ok(facet_reader) Ok(facet_reader)
} }
@@ -225,6 +227,8 @@ impl SegmentReader {
let store_source = segment.open_read(SegmentComponent::STORE)?; let store_source = segment.open_read(SegmentComponent::STORE)?;
let store_reader = StoreReader::from_source(store_source); let store_reader = StoreReader::from_source(store_source);
fail_point!("SegmentReader::open#middle");
let postings_source = segment.open_read(SegmentComponent::POSTINGS)?; let postings_source = segment.open_read(SegmentComponent::POSTINGS)?;
let postings_composite = CompositeFile::open(&postings_source)?; let postings_composite = CompositeFile::open(&postings_source)?;
@@ -260,7 +264,8 @@ impl SegmentReader {
let schema = segment.schema(); let schema = segment.schema();
Ok(SegmentReader { Ok(SegmentReader {
inv_idx_reader_cache: Arc::new(RwLock::new(HashMap::new())), inv_idx_reader_cache: Arc::new(RwLock::new(HashMap::new())),
segment_meta: segment.meta().clone(), max_doc: segment.meta().max_doc(),
num_docs: segment.meta().num_docs(),
termdict_composite, termdict_composite,
postings_composite, postings_composite,
fast_fields_composite, fast_fields_composite,
@@ -282,7 +287,8 @@ impl SegmentReader {
/// term dictionary associated to a specific field, /// term dictionary associated to a specific field,
/// and opening the posting list associated to any term. /// and opening the posting list associated to any term.
pub fn inverted_index(&self, field: Field) -> Arc<InvertedIndexReader> { pub fn inverted_index(&self, field: Field) -> Arc<InvertedIndexReader> {
if let Some(inv_idx_reader) = self.inv_idx_reader_cache if let Some(inv_idx_reader) = self
.inv_idx_reader_cache
.read() .read()
.expect("Lock poisoned. This should never happen") .expect("Lock poisoned. This should never happen")
.get(&field) .get(&field)
@@ -306,25 +312,28 @@ impl SegmentReader {
// As a result, no data is associated to the inverted index. // As a result, no data is associated to the inverted index.
// //
// Returns an empty inverted index. // Returns an empty inverted index.
return Arc::new(InvertedIndexReader::empty(field_type.clone())); return Arc::new(InvertedIndexReader::empty(field_type));
} }
let postings_source = postings_source_opt.unwrap(); let postings_source = postings_source_opt.unwrap();
let termdict_source = self.termdict_composite let termdict_source = self
.termdict_composite
.open_read(field) .open_read(field)
.expect("Failed to open field term dictionary in composite file. Is the field indexed"); .expect("Failed to open field term dictionary in composite file. Is the field indexed");
let positions_source = self.positions_composite let positions_source = self
.positions_composite
.open_read(field) .open_read(field)
.expect("Index corrupted. Failed to open field positions in composite file."); .expect("Index corrupted. Failed to open field positions in composite file.");
let positions_idx_source = self.positions_idx_composite let positions_idx_source = self
.positions_idx_composite
.open_read(field) .open_read(field)
.expect("Index corrupted. Failed to open field positions in composite file."); .expect("Index corrupted. Failed to open field positions in composite file.");
let inv_idx_reader = Arc::new(InvertedIndexReader::new( let inv_idx_reader = Arc::new(InvertedIndexReader::new(
TermDictionary::from_source(termdict_source), TermDictionary::from_source(&termdict_source),
postings_source, postings_source,
positions_source, positions_source,
positions_idx_source, positions_idx_source,
@@ -391,7 +400,7 @@ pub struct SegmentReaderAliveDocsIterator<'a> {
impl<'a> SegmentReaderAliveDocsIterator<'a> { impl<'a> SegmentReaderAliveDocsIterator<'a> {
pub fn new(reader: &'a SegmentReader) -> SegmentReaderAliveDocsIterator<'a> { pub fn new(reader: &'a SegmentReader) -> SegmentReaderAliveDocsIterator<'a> {
SegmentReaderAliveDocsIterator { SegmentReaderAliveDocsIterator {
reader: reader, reader,
max_doc: reader.max_doc(), max_doc: reader.max_doc(),
current: 0, current: 0,
} }

View File

@@ -77,15 +77,15 @@ pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
/// DirectoryClone /// DirectoryClone
pub trait DirectoryClone { pub trait DirectoryClone {
/// Clones the directory and boxes the clone /// Clones the directory and boxes the clone
fn box_clone(&self) -> Box<Directory>; fn box_clone(&self) -> Box<Directory>;
} }
impl<T> DirectoryClone for T impl<T> DirectoryClone for T
where where
T: 'static + Directory + Clone, T: 'static + Directory + Clone,
{ {
fn box_clone(&self) -> Box<Directory> { fn box_clone(&self) -> Box<Directory> {
Box::new(self.clone()) Box::new(self.clone())
} }
} }

View File

@@ -2,6 +2,7 @@ use core::MANAGED_FILEPATH;
use directory::error::{DeleteError, IOError, OpenReadError, OpenWriteError}; use directory::error::{DeleteError, IOError, OpenReadError, OpenWriteError};
use directory::{ReadOnlySource, WritePtr}; use directory::{ReadOnlySource, WritePtr};
use error::TantivyError; use error::TantivyError;
use indexer::LockType;
use serde_json; use serde_json;
use std::collections::HashSet; use std::collections::HashSet;
use std::io; use std::io;
@@ -13,6 +14,17 @@ use std::sync::{Arc, RwLock};
use Directory; use Directory;
use Result; use Result;
/// Returns true iff the file is "managed".
/// Non-managed file are not subject to garbage collection.
///
/// Filenames that starts by a "." -typically locks-
/// are not managed.
fn is_managed(path: &Path) -> bool {
path.to_str()
.map(|p_str| !p_str.starts_with('.'))
.unwrap_or(true)
}
/// Wrapper of directories that keeps track of files created by Tantivy. /// Wrapper of directories that keeps track of files created by Tantivy.
/// ///
/// A managed directory is just a wrapper of a directory /// A managed directory is just a wrapper of a directory
@@ -40,7 +52,7 @@ fn save_managed_paths(
wlock: &RwLockWriteGuard<MetaInformation>, wlock: &RwLockWriteGuard<MetaInformation>,
) -> io::Result<()> { ) -> io::Result<()> {
let mut w = serde_json::to_vec(&wlock.managed_paths)?; let mut w = serde_json::to_vec(&wlock.managed_paths)?;
write!(&mut w, "\n")?; writeln!(&mut w)?;
directory.atomic_write(&MANAGED_FILEPATH, &w[..])?; directory.atomic_write(&MANAGED_FILEPATH, &w[..])?;
Ok(()) Ok(())
} }
@@ -82,25 +94,35 @@ impl ManagedDirectory {
pub fn garbage_collect<L: FnOnce() -> HashSet<PathBuf>>(&mut self, get_living_files: L) { pub fn garbage_collect<L: FnOnce() -> HashSet<PathBuf>>(&mut self, get_living_files: L) {
info!("Garbage collect"); info!("Garbage collect");
let mut files_to_delete = vec![]; let mut files_to_delete = vec![];
// It is crucial to get the living files after acquiring the
// read lock of meta informations. That way, we
// avoid the following scenario.
//
// 1) we get the list of living files.
// 2) someone creates a new file.
// 3) we start garbage collection and remove this file
// even though it is a living file.
//
// releasing the lock as .delete() will use it too.
{ {
// releasing the lock as .delete() will use it too. let meta_informations_rlock = self
let meta_informations_rlock = self.meta_informations .meta_informations
.read() .read()
.expect("Managed directory rlock poisoned in garbage collect."); .expect("Managed directory rlock poisoned in garbage collect.");
// It is crucial to get the living files after acquiring the // The point of this second "file" lock is to enforce the following scenario
// read lock of meta informations. That way, we // 1) process B tries to load a new set of searcher.
// avoid the following scenario. // The list of segments is loaded
// // 2) writer change meta.json (for instance after a merge or a commit)
// 1) we get the list of living files. // 3) gc kicks in.
// 2) someone creates a new file. // 4) gc removes a file that was useful for process B, before process B opened it.
// 3) we start garbage collection and remove this file if let Ok(_meta_lock) = LockType::MetaLock.acquire_lock(self) {
// even though it is a living file. let living_files = get_living_files();
let living_files = get_living_files(); for managed_path in &meta_informations_rlock.managed_paths {
if !living_files.contains(managed_path) {
for managed_path in &meta_informations_rlock.managed_paths { files_to_delete.push(managed_path.clone());
if !living_files.contains(managed_path) { }
files_to_delete.push(managed_path.clone());
} }
} }
} }
@@ -134,7 +156,8 @@ impl ManagedDirectory {
if !deleted_files.is_empty() { if !deleted_files.is_empty() {
// update the list of managed files by removing // update the list of managed files by removing
// the file that were removed. // the file that were removed.
let mut meta_informations_wlock = self.meta_informations let mut meta_informations_wlock = self
.meta_informations
.write() .write()
.expect("Managed directory wlock poisoned (2)."); .expect("Managed directory wlock poisoned (2).");
{ {
@@ -156,8 +179,17 @@ impl ManagedDirectory {
/// registering the filepath and creating the file /// registering the filepath and creating the file
/// will not lead to garbage files that will /// will not lead to garbage files that will
/// never get removed. /// never get removed.
///
/// File starting by "." are reserved to locks.
/// They are not managed and cannot be subjected
/// to garbage collection.
fn register_file_as_managed(&mut self, filepath: &Path) -> io::Result<()> { fn register_file_as_managed(&mut self, filepath: &Path) -> io::Result<()> {
let mut meta_wlock = self.meta_informations // Files starting by "." (e.g. lock files) are not managed.
if !is_managed(filepath) {
return Ok(());
}
let mut meta_wlock = self
.meta_informations
.write() .write()
.expect("Managed file lock poisoned"); .expect("Managed file lock poisoned");
let has_changed = meta_wlock.managed_paths.insert(filepath.to_owned()); let has_changed = meta_wlock.managed_paths.insert(filepath.to_owned());

View File

@@ -32,7 +32,8 @@ fn open_mmap(full_path: &Path) -> result::Result<Option<MmapReadOnly>, OpenReadE
} }
})?; })?;
let meta_data = file.metadata() let meta_data = file
.metadata()
.map_err(|e| IOError::with_path(full_path.to_owned(), e))?; .map_err(|e| IOError::with_path(full_path.to_owned(), e))?;
if meta_data.len() == 0 { if meta_data.len() == 0 {
// if the file size is 0, it will not be possible // if the file size is 0, it will not be possible
@@ -309,7 +310,8 @@ impl Directory for MmapDirectory {
// when the last reference is gone. // when the last reference is gone.
mmap_cache.cache.remove(&full_path); mmap_cache.cache.remove(&full_path);
match fs::remove_file(&full_path) { match fs::remove_file(&full_path) {
Ok(_) => self.sync_directory() Ok(_) => self
.sync_directory()
.map_err(|e| IOError::with_path(path.to_owned(), e).into()), .map_err(|e| IOError::with_path(path.to_owned(), e).into()),
Err(e) => { Err(e) => {
if e.kind() == io::ErrorKind::NotFound { if e.kind() == io::ErrorKind::NotFound {

View File

@@ -12,6 +12,7 @@ mod managed_directory;
mod ram_directory; mod ram_directory;
mod read_only_source; mod read_only_source;
mod shared_vec_slice; mod shared_vec_slice;
mod static_dictionnary;
/// Errors specific to the directory module. /// Errors specific to the directory module.
pub mod error; pub mod error;
@@ -21,6 +22,7 @@ use std::io::{BufWriter, Seek, Write};
pub use self::directory::{Directory, DirectoryClone}; pub use self::directory::{Directory, DirectoryClone};
pub use self::ram_directory::RAMDirectory; pub use self::ram_directory::RAMDirectory;
pub use self::read_only_source::ReadOnlySource; pub use self::read_only_source::ReadOnlySource;
pub use self::static_dictionnary::StaticDirectory;
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
pub use self::mmap_directory::MmapDirectory; pub use self::mmap_directory::MmapDirectory;

View File

@@ -100,8 +100,7 @@ impl InnerDirectory {
); );
let io_err = make_io_err(msg); let io_err = make_io_err(msg);
OpenReadError::IOError(IOError::with_path(path.to_owned(), io_err)) OpenReadError::IOError(IOError::with_path(path.to_owned(), io_err))
}) }).and_then(|readable_map| {
.and_then(|readable_map| {
readable_map readable_map
.get(path) .get(path)
.ok_or_else(|| OpenReadError::FileDoesNotExist(PathBuf::from(path))) .ok_or_else(|| OpenReadError::FileDoesNotExist(PathBuf::from(path)))
@@ -121,8 +120,7 @@ impl InnerDirectory {
); );
let io_err = make_io_err(msg); let io_err = make_io_err(msg);
DeleteError::IOError(IOError::with_path(path.to_owned(), io_err)) DeleteError::IOError(IOError::with_path(path.to_owned(), io_err))
}) }).and_then(|mut writable_map| match writable_map.remove(path) {
.and_then(|mut writable_map| match writable_map.remove(path) {
Some(_) => Ok(()), Some(_) => Ok(()),
None => Err(DeleteError::FileDoesNotExist(PathBuf::from(path))), None => Err(DeleteError::FileDoesNotExist(PathBuf::from(path))),
}) })
@@ -170,10 +168,10 @@ impl Directory for RAMDirectory {
let path_buf = PathBuf::from(path); let path_buf = PathBuf::from(path);
let vec_writer = VecWriter::new(path_buf.clone(), self.fs.clone()); let vec_writer = VecWriter::new(path_buf.clone(), self.fs.clone());
let exists = self.fs let exists = self
.fs
.write(path_buf.clone(), &Vec::new()) .write(path_buf.clone(), &Vec::new())
.map_err(|err| IOError::with_path(path.to_owned(), err))?; .map_err(|err| IOError::with_path(path.to_owned(), err))?;
// force the creation of the file to mimic the MMap directory. // force the creation of the file to mimic the MMap directory.
if exists { if exists {
Err(OpenWriteError::FileAlreadyExists(path_buf)) Err(OpenWriteError::FileAlreadyExists(path_buf))
@@ -196,6 +194,10 @@ impl Directory for RAMDirectory {
} }
fn atomic_write(&mut self, path: &Path, data: &[u8]) -> io::Result<()> { fn atomic_write(&mut self, path: &Path, data: &[u8]) -> io::Result<()> {
fail_point!("RAMDirectory::atomic_write", |msg| Err(io::Error::new(
io::ErrorKind::Other,
msg.unwrap_or("Undefined".to_string())
)));
let path_buf = PathBuf::from(path); let path_buf = PathBuf::from(path);
let mut vec_writer = VecWriter::new(path_buf.clone(), self.fs.clone()); let mut vec_writer = VecWriter::new(path_buf.clone(), self.fs.clone());
self.fs.write(path_buf, &Vec::new())?; self.fs.write(path_buf, &Vec::new())?;

View File

@@ -6,6 +6,8 @@ use stable_deref_trait::{CloneStableDeref, StableDeref};
use std::ops::Deref; use std::ops::Deref;
const EMPTY_SLICE: [u8; 0] = [];
/// Read object that represents files in tantivy. /// Read object that represents files in tantivy.
/// ///
/// These read objects are only in charge to deliver /// These read objects are only in charge to deliver
@@ -18,6 +20,8 @@ pub enum ReadOnlySource {
Mmap(MmapReadOnly), Mmap(MmapReadOnly),
/// Wrapping a `Vec<u8>` /// Wrapping a `Vec<u8>`
Anonymous(SharedVecSlice), Anonymous(SharedVecSlice),
/// Wrapping a static slice
Static(&'static [u8])
} }
unsafe impl StableDeref for ReadOnlySource {} unsafe impl StableDeref for ReadOnlySource {}
@@ -34,7 +38,7 @@ impl Deref for ReadOnlySource {
impl ReadOnlySource { impl ReadOnlySource {
/// Creates an empty ReadOnlySource /// Creates an empty ReadOnlySource
pub fn empty() -> ReadOnlySource { pub fn empty() -> ReadOnlySource {
ReadOnlySource::Anonymous(SharedVecSlice::empty()) ReadOnlySource::Static(&EMPTY_SLICE)
} }
/// Returns the data underlying the ReadOnlySource object. /// Returns the data underlying the ReadOnlySource object.
@@ -43,6 +47,7 @@ impl ReadOnlySource {
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
ReadOnlySource::Mmap(ref mmap_read_only) => mmap_read_only.as_slice(), ReadOnlySource::Mmap(ref mmap_read_only) => mmap_read_only.as_slice(),
ReadOnlySource::Anonymous(ref shared_vec) => shared_vec.as_slice(), ReadOnlySource::Anonymous(ref shared_vec) => shared_vec.as_slice(),
ReadOnlySource::Static(data) => data,
} }
} }
@@ -80,6 +85,9 @@ impl ReadOnlySource {
ReadOnlySource::Anonymous(ref shared_vec) => { ReadOnlySource::Anonymous(ref shared_vec) => {
ReadOnlySource::Anonymous(shared_vec.slice(from_offset, to_offset)) ReadOnlySource::Anonymous(shared_vec.slice(from_offset, to_offset))
} }
ReadOnlySource::Static(data) => {
ReadOnlySource::Static(&data[from_offset..to_offset])
}
} }
} }
@@ -119,3 +127,9 @@ impl From<Vec<u8>> for ReadOnlySource {
ReadOnlySource::Anonymous(shared_data) ReadOnlySource::Anonymous(shared_data)
} }
} }
impl From<&'static [u8]> for ReadOnlySource {
fn from(data: &'static [u8]) -> ReadOnlySource {
ReadOnlySource::Static(data)
}
}

View File

@@ -4,6 +4,7 @@ use std::io;
use directory::error::{IOError, OpenDirectoryError, OpenReadError, OpenWriteError}; use directory::error::{IOError, OpenDirectoryError, OpenReadError, OpenWriteError};
use fastfield::FastFieldNotAvailableError; use fastfield::FastFieldNotAvailableError;
use indexer::LockType;
use query; use query;
use schema; use schema;
use serde_json; use serde_json;
@@ -19,6 +20,12 @@ pub enum TantivyError {
/// File already exists, this is a problem when we try to write into a new file. /// File already exists, this is a problem when we try to write into a new file.
#[fail(display = "file already exists: '{:?}'", _0)] #[fail(display = "file already exists: '{:?}'", _0)]
FileAlreadyExists(PathBuf), FileAlreadyExists(PathBuf),
/// Failed to acquire file lock
#[fail(
display = "Failed to acquire Lockfile: {:?}. Possible causes: another IndexWriter instance or panic during previous lock drop.",
_0
)]
LockFailure(LockType),
/// IO Error. /// IO Error.
#[fail(display = "an IO error occurred: '{}'", _0)] #[fail(display = "an IO error occurred: '{}'", _0)]
IOError(#[cause] IOError), IOError(#[cause] IOError),
@@ -46,48 +53,46 @@ pub enum TantivyError {
impl From<FastFieldNotAvailableError> for TantivyError { impl From<FastFieldNotAvailableError> for TantivyError {
fn from(fastfield_error: FastFieldNotAvailableError) -> TantivyError { fn from(fastfield_error: FastFieldNotAvailableError) -> TantivyError {
TantivyError::FastFieldError(fastfield_error).into() TantivyError::FastFieldError(fastfield_error)
} }
} }
impl From<IOError> for TantivyError { impl From<IOError> for TantivyError {
fn from(io_error: IOError) -> TantivyError { fn from(io_error: IOError) -> TantivyError {
TantivyError::IOError(io_error).into() TantivyError::IOError(io_error)
} }
} }
impl From<io::Error> for TantivyError { impl From<io::Error> for TantivyError {
fn from(io_error: io::Error) -> TantivyError { fn from(io_error: io::Error) -> TantivyError {
TantivyError::IOError(io_error.into()).into() TantivyError::IOError(io_error.into())
} }
} }
impl From<query::QueryParserError> for TantivyError { impl From<query::QueryParserError> for TantivyError {
fn from(parsing_error: query::QueryParserError) -> TantivyError { fn from(parsing_error: query::QueryParserError) -> TantivyError {
TantivyError::InvalidArgument(format!("Query is invalid. {:?}", parsing_error)).into() TantivyError::InvalidArgument(format!("Query is invalid. {:?}", parsing_error))
} }
} }
impl<Guard> From<PoisonError<Guard>> for TantivyError { impl<Guard> From<PoisonError<Guard>> for TantivyError {
fn from(_: PoisonError<Guard>) -> TantivyError { fn from(_: PoisonError<Guard>) -> TantivyError {
TantivyError::Poisoned.into() TantivyError::Poisoned
} }
} }
impl From<OpenReadError> for TantivyError { impl From<OpenReadError> for TantivyError {
fn from(error: OpenReadError) -> TantivyError { fn from(error: OpenReadError) -> TantivyError {
match error { match error {
OpenReadError::FileDoesNotExist(filepath) => { OpenReadError::FileDoesNotExist(filepath) => TantivyError::PathDoesNotExist(filepath),
TantivyError::PathDoesNotExist(filepath).into() OpenReadError::IOError(io_error) => TantivyError::IOError(io_error),
}
OpenReadError::IOError(io_error) => TantivyError::IOError(io_error).into(),
} }
} }
} }
impl From<schema::DocParsingError> for TantivyError { impl From<schema::DocParsingError> for TantivyError {
fn from(error: schema::DocParsingError) -> TantivyError { fn from(error: schema::DocParsingError) -> TantivyError {
TantivyError::InvalidArgument(format!("Failed to parse document {:?}", error)).into() TantivyError::InvalidArgument(format!("Failed to parse document {:?}", error))
} }
} }
@@ -98,7 +103,7 @@ impl From<OpenWriteError> for TantivyError {
TantivyError::FileAlreadyExists(filepath) TantivyError::FileAlreadyExists(filepath)
} }
OpenWriteError::IOError(io_error) => TantivyError::IOError(io_error), OpenWriteError::IOError(io_error) => TantivyError::IOError(io_error),
}.into() }
} }
} }
@@ -106,11 +111,11 @@ impl From<OpenDirectoryError> for TantivyError {
fn from(error: OpenDirectoryError) -> TantivyError { fn from(error: OpenDirectoryError) -> TantivyError {
match error { match error {
OpenDirectoryError::DoesNotExist(directory_path) => { OpenDirectoryError::DoesNotExist(directory_path) => {
TantivyError::PathDoesNotExist(directory_path).into() TantivyError::PathDoesNotExist(directory_path)
}
OpenDirectoryError::NotADirectory(directory_path) => {
TantivyError::InvalidArgument(format!("{:?} is not a directory", directory_path))
} }
OpenDirectoryError::NotADirectory(directory_path) => TantivyError::InvalidArgument(
format!("{:?} is not a directory", directory_path),
).into(),
} }
} }
} }
@@ -118,6 +123,6 @@ impl From<OpenDirectoryError> for TantivyError {
impl From<serde_json::Error> for TantivyError { impl From<serde_json::Error> for TantivyError {
fn from(error: serde_json::Error) -> TantivyError { fn from(error: serde_json::Error) -> TantivyError {
let io_err = io::Error::from(error); let io_err = io::Error::from(error);
TantivyError::IOError(io_err.into()).into() TantivyError::IOError(io_err.into())
} }
} }

View File

@@ -51,7 +51,7 @@ impl BytesFastFieldWriter {
self.next_doc(); self.next_doc();
for field_value in doc.field_values() { for field_value in doc.field_values() {
if field_value.field() == self.field { if field_value.field() == self.field {
if let &Value::Bytes(ref bytes) = field_value.value() { if let Value::Bytes(ref bytes) = *field_value.value() {
self.vals.extend_from_slice(bytes); self.vals.extend_from_slice(bytes);
} else { } else {
panic!( panic!(

View File

@@ -41,7 +41,8 @@ pub struct DeleteBitSet {
impl DeleteBitSet { impl DeleteBitSet {
/// Opens a delete bitset given its data source. /// Opens a delete bitset given its data source.
pub fn open(data: ReadOnlySource) -> DeleteBitSet { pub fn open(data: ReadOnlySource) -> DeleteBitSet {
let num_deleted: usize = data.as_slice() let num_deleted: usize = data
.as_slice()
.iter() .iter()
.map(|b| b.count_ones() as usize) .map(|b| b.count_ones() as usize)
.sum(); .sum();

View File

@@ -56,7 +56,8 @@ impl FacetReader {
/// Given a term ordinal returns the term associated to it. /// Given a term ordinal returns the term associated to it.
pub fn facet_from_ord(&self, facet_ord: TermOrdinal, output: &mut Facet) { pub fn facet_from_ord(&self, facet_ord: TermOrdinal, output: &mut Facet) {
let found_term = self.term_dict let found_term = self
.term_dict
.ord_to_term(facet_ord as u64, output.inner_buffer_mut()); .ord_to_term(facet_ord as u64, output.inner_buffer_mut());
assert!(found_term, "Term ordinal {} no found.", facet_ord); assert!(found_term, "Term ordinal {} no found.", facet_ord);
} }

View File

@@ -370,7 +370,7 @@ mod tests {
pub fn generate_permutation() -> Vec<u64> { pub fn generate_permutation() -> Vec<u64> {
let seed: [u8; 16] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]; let seed: [u8; 16] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16];
let mut rng = XorShiftRng::from_seed(seed); let mut rng = XorShiftRng::from_seed(seed);
let mut permutation: Vec<u64> = (0u64..1_000_000u64).collect(); let mut permutation: Vec<u64> = (0u64..100_000u64).collect();
rng.shuffle(&mut permutation); rng.shuffle(&mut permutation);
permutation permutation
} }

View File

@@ -132,7 +132,8 @@ impl MultiValueIntFastFieldWriter {
); );
let mut doc_vals: Vec<u64> = Vec::with_capacity(100); let mut doc_vals: Vec<u64> = Vec::with_capacity(100);
for (start, stop) in self.doc_index for (start, stop) in self
.doc_index
.windows(2) .windows(2)
.map(|interval| (interval[0], interval[1])) .map(|interval| (interval[0], interval[1]))
.chain(Some(last_interval).into_iter()) .chain(Some(last_interval).into_iter())
@@ -148,7 +149,6 @@ impl MultiValueIntFastFieldWriter {
value_serializer.add_val(val)?; value_serializer.add_val(val)?;
} }
} }
} }
None => { None => {
let val_min_max = self.vals.iter().cloned().minmax(); let val_min_max = self.vals.iter().cloned().minmax();

View File

@@ -11,7 +11,6 @@ use schema::SchemaBuilder;
use schema::FAST; use schema::FAST;
use std::collections::HashMap; use std::collections::HashMap;
use std::marker::PhantomData; use std::marker::PhantomData;
use std::mem;
use std::path::Path; use std::path::Path;
use DocId; use DocId;
@@ -80,7 +79,8 @@ impl<Item: FastValue> FastFieldReader<Item> {
// TODO change start to `u64`. // TODO change start to `u64`.
// For multifastfield, start is an index in a second fastfield, not a `DocId` // For multifastfield, start is an index in a second fastfield, not a `DocId`
pub fn get_range(&self, start: u32, output: &mut [Item]) { pub fn get_range(&self, start: u32, output: &mut [Item]) {
let output_u64: &mut [u64] = unsafe { mem::transmute(output) }; // ok: Item is either `u64` or `i64` // ok: Item is either `u64` or `i64`
let output_u64: &mut [u64] = unsafe { &mut *(output as *mut [Item] as *mut [u64]) };
self.bit_unpacker.get_range(start, output_u64); self.bit_unpacker.get_range(start, output_u64);
for out in output_u64.iter_mut() { for out in output_u64.iter_mut() {
*out = Item::from_u64(*out + self.min_value_u64).as_u64(); *out = Item::from_u64(*out + self.min_value_u64).as_u64();

View File

@@ -10,27 +10,28 @@ pub fn fieldnorm_to_id(fieldnorm: u32) -> u8 {
.unwrap_or_else(|idx| idx - 1) as u8 .unwrap_or_else(|idx| idx - 1) as u8
} }
#[cfg_attr(feature = "cargo-clippy", allow(clippy::unreadable_literal))]
pub const FIELD_NORMS_TABLE: [u32; 256] = [ pub const FIELD_NORMS_TABLE: [u32; 256] = [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 44, 46, 48, 50, 52, 54, 56, 60, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 44, 46, 48, 50, 52, 54, 56, 60,
64, 68, 72, 76, 80, 84, 88, 96, 104, 112, 120, 128, 136, 144, 152, 168, 184, 200, 216, 232, 64, 68, 72, 76, 80, 84, 88, 96, 104, 112, 120, 128, 136, 144, 152, 168, 184, 200, 216, 232,
248, 264, 280, 312, 344, 376, 408, 440, 472, 504, 536, 600, 664, 728, 792, 856, 920, 984, 1048, 248, 264, 280, 312, 344, 376, 408, 440, 472, 504, 536, 600, 664, 728, 792, 856, 920, 984,
1176, 1304, 1432, 1560, 1688, 1816, 1944, 2072, 2328, 2584, 2840, 3096, 3352, 3608, 3864, 4120, 1_048, 1176, 1304, 1432, 1560, 1688, 1816, 1944, 2072, 2328, 2584, 2840, 3096, 3352, 3608,
4632, 5144, 5656, 6168, 6680, 7192, 7704, 8216, 9240, 10264, 11288, 12312, 13336, 14360, 15384, 3864, 4120, 4632, 5144, 5656, 6168, 6680, 7192, 7704, 8216, 9240, 10264, 11288, 12312, 13336,
16408, 18456, 20504, 22552, 24600, 26648, 28696, 30744, 32792, 36888, 40984, 45080, 49176, 14360, 15384, 16408, 18456, 20504, 22552, 24600, 26648, 28696, 30744, 32792, 36888, 40984,
53272, 57368, 61464, 65560, 73752, 81944, 90136, 98328, 106520, 114712, 122904, 131096, 147480, 45080, 49176, 53272, 57368, 61464, 65560, 73752, 81944, 90136, 98328, 106520, 114712, 122904,
163864, 180248, 196632, 213016, 229400, 245784, 262168, 294936, 327704, 360472, 393240, 426008, 131096, 147480, 163864, 180248, 196632, 213016, 229400, 245784, 262168, 294936, 327704, 360472,
458776, 491544, 524312, 589848, 655384, 720920, 786456, 851992, 917528, 983064, 1048600, 393240, 426008, 458776, 491544, 524312, 589848, 655384, 720920, 786456, 851992, 917528, 983064,
1179672, 1310744, 1441816, 1572888, 1703960, 1835032, 1966104, 2097176, 2359320, 2621464, 1048600, 1179672, 1310744, 1441816, 1572888, 1703960, 1835032, 1966104, 2097176, 2359320,
2883608, 3145752, 3407896, 3670040, 3932184, 4194328, 4718616, 5242904, 5767192, 6291480, 2621464, 2883608, 3145752, 3407896, 3670040, 3932184, 4194328, 4718616, 5242904, 5767192,
6815768, 7340056, 7864344, 8388632, 9437208, 10485784, 11534360, 12582936, 13631512, 14680088, 6291480, 6815768, 7340056, 7864344, 8388632, 9437208, 10485784, 11534360, 12582936, 13631512,
15728664, 16777240, 18874392, 20971544, 23068696, 25165848, 27263000, 29360152, 31457304, 14680088, 15728664, 16777240, 18874392, 20971544, 23068696, 25165848, 27263000, 29360152,
33554456, 37748760, 41943064, 46137368, 50331672, 54525976, 58720280, 62914584, 67108888, 31457304, 33554456, 37748760, 41943064, 46137368, 50331672, 54525976, 58720280, 62914584,
75497496, 83886104, 92274712, 100663320, 109051928, 117440536, 125829144, 134217752, 150994968, 67108888, 75497496, 83886104, 92274712, 100663320, 109051928, 117440536, 125829144, 134217752,
167772184, 184549400, 201326616, 218103832, 234881048, 251658264, 268435480, 301989912, 150994968, 167772184, 184549400, 201326616, 218103832, 234881048, 251658264, 268435480,
335544344, 369098776, 402653208, 436207640, 469762072, 503316504, 536870936, 603979800, 301989912, 335544344, 369098776, 402653208, 436207640, 469762072, 503316504, 536870936,
671088664, 738197528, 805306392, 872415256, 939524120, 1006632984, 1073741848, 1207959576, 603979800, 671088664, 738197528, 805306392, 872415256, 939524120, 1006632984, 1073741848,
1342177304, 1476395032, 1610612760, 1744830488, 1879048216, 2013265944, 1207959576, 1342177304, 1476395032, 1610612760, 1744830488, 1879048216, 2013265944,
]; ];
#[cfg(test)] #[cfg(test)]

View File

@@ -1,8 +1,8 @@
use rand::thread_rng; use rand::thread_rng;
use std::collections::HashSet; use std::collections::HashSet;
use rand::Rng;
use rand::distributions::Range; use rand::distributions::Range;
use rand::Rng;
use schema::*; use schema::*;
use Index; use Index;
use Searcher; use Searcher;

View File

@@ -52,7 +52,8 @@ impl DeleteQueue {
// //
// Past delete operations are not accessible. // Past delete operations are not accessible.
pub fn cursor(&self) -> DeleteCursor { pub fn cursor(&self) -> DeleteCursor {
let last_block = self.inner let last_block = self
.inner
.read() .read()
.expect("Read lock poisoned when opening delete queue cursor") .expect("Read lock poisoned when opening delete queue cursor")
.last_block .last_block
@@ -92,7 +93,8 @@ impl DeleteQueue {
// be some unflushed operations. // be some unflushed operations.
// //
fn flush(&self) -> Option<Arc<Block>> { fn flush(&self) -> Option<Arc<Block>> {
let mut self_wlock = self.inner let mut self_wlock = self
.inner
.write() .write()
.expect("Failed to acquire write lock on delete queue writer"); .expect("Failed to acquire write lock on delete queue writer");
@@ -132,7 +134,8 @@ impl From<DeleteQueue> for NextBlock {
impl NextBlock { impl NextBlock {
fn next_block(&self) -> Option<Arc<Block>> { fn next_block(&self) -> Option<Arc<Block>> {
{ {
let next_read_lock = self.0 let next_read_lock = self
.0
.read() .read()
.expect("Failed to acquire write lock in delete queue"); .expect("Failed to acquire write lock in delete queue");
if let InnerNextBlock::Closed(ref block) = *next_read_lock { if let InnerNextBlock::Closed(ref block) = *next_read_lock {
@@ -141,7 +144,8 @@ impl NextBlock {
} }
let next_block; let next_block;
{ {
let mut next_write_lock = self.0 let mut next_write_lock = self
.0
.write() .write()
.expect("Failed to acquire write lock in delete queue"); .expect("Failed to acquire write lock in delete queue");
match *next_write_lock { match *next_write_lock {
@@ -182,19 +186,21 @@ impl DeleteCursor {
/// `opstamp >= target_opstamp`. /// `opstamp >= target_opstamp`.
pub fn skip_to(&mut self, target_opstamp: u64) { pub fn skip_to(&mut self, target_opstamp: u64) {
// TODO Can be optimize as we work with block. // TODO Can be optimize as we work with block.
#[cfg_attr(feature = "cargo-clippy", allow(while_let_loop))] while self.is_behind_opstamp(target_opstamp) {
loop {
if let Some(operation) = self.get() {
if operation.opstamp >= target_opstamp {
break;
}
} else {
break;
}
self.advance(); self.advance();
} }
} }
#[cfg_attr(
feature = "cargo-clippy",
allow(clippy::wrong_self_convention)
)]
fn is_behind_opstamp(&mut self, target_opstamp: u64) -> bool {
self.get()
.map(|operation| operation.opstamp < target_opstamp)
.unwrap_or(false)
}
/// If the current block has been entirely /// If the current block has been entirely
/// consumed, try to load the next one. /// consumed, try to load the next one.
/// ///

View File

@@ -1,26 +1,130 @@
use core::LOCKFILE_FILEPATH;
use directory::error::OpenWriteError; use directory::error::OpenWriteError;
use std::io::Write;
use std::path::{Path, PathBuf};
use std::thread;
use std::time::Duration;
use Directory; use Directory;
use TantivyError;
/// The directory lock is a mechanism used to #[derive(Debug, Clone, Copy)]
/// prevent the creation of two [`IndexWriter`](struct.IndexWriter.html) pub enum LockType {
/// /// Only one process should be able to write tantivy's index at a time.
/// Only one lock can exist at a time for a given directory. /// This lock file, when present, is in charge of preventing other processes to open an IndexWriter.
/// The lock is release automatically on `Drop`. ///
pub struct DirectoryLock { /// If the process is killed and this file remains, it is safe to remove it manually.
directory: Box<Directory>, ///
/// Failing to acquire this lock usually means a misuse of tantivy's API,
/// (creating more than one instance of the `IndexWriter`), are a spurious
/// lock file remaining after a crash. In the latter case, removing the file after
/// checking no process running tantivy is running is safe.
IndexWriterLock,
/// The meta lock file is here to protect the segment files being opened by
/// `.load_searchers()` from being garbage collected.
/// It makes it possible for another process to safely consume
/// our index in-writing. Ideally, we may have prefered `RWLock` semantics
/// here, but it is difficult to achieve on Windows.
///
/// Opening segment readers is a very fast process.
/// Right now if the lock cannot be acquire on the first attempt, the logic
/// is very simplistic. We retry after `100ms` until we effectively
/// acquire the lock.
/// This lock should not have much contention in normal usage.
MetaLock,
} }
impl DirectoryLock { /// Retry the logic of acquiring locks is pretty simple.
pub fn lock(mut directory: Box<Directory>) -> Result<DirectoryLock, OpenWriteError> { /// We just retry `n` times after a given `duratio`, both
directory.open_write(&*LOCKFILE_FILEPATH)?; /// depending on the type of lock.
Ok(DirectoryLock { directory }) struct RetryPolicy {
num_retries: usize,
wait_in_ms: u64,
}
impl RetryPolicy {
fn no_retry() -> RetryPolicy {
RetryPolicy {
num_retries: 0,
wait_in_ms: 0,
}
} }
fn wait_and_retry(&mut self) -> bool {
if self.num_retries == 0 {
false
} else {
self.num_retries -= 1;
let wait_duration = Duration::from_millis(self.wait_in_ms);
thread::sleep(wait_duration);
true
}
}
}
impl LockType {
fn retry_policy(self) -> RetryPolicy {
match self {
LockType::IndexWriterLock => RetryPolicy::no_retry(),
LockType::MetaLock => RetryPolicy {
num_retries: 100,
wait_in_ms: 100,
},
}
}
fn try_acquire_lock(self, directory: &mut Directory) -> Result<DirectoryLock, TantivyError> {
let path = self.filename();
let mut write = directory.open_write(path).map_err(|e| match e {
OpenWriteError::FileAlreadyExists(_) => TantivyError::LockFailure(self),
OpenWriteError::IOError(io_error) => TantivyError::IOError(io_error),
})?;
write.flush()?;
Ok(DirectoryLock {
directory: directory.box_clone(),
path: path.to_owned(),
})
}
/// Acquire a lock in the given directory.
pub fn acquire_lock(self, directory: &Directory) -> Result<DirectoryLock, TantivyError> {
let mut box_directory = directory.box_clone();
let mut retry_policy = self.retry_policy();
loop {
let lock_result = self.try_acquire_lock(&mut *box_directory);
match lock_result {
Ok(result) => {
return Ok(result);
}
Err(TantivyError::LockFailure(ref filepath)) => {
if !retry_policy.wait_and_retry() {
return Err(TantivyError::LockFailure(filepath.to_owned()));
}
}
Err(_) => {}
}
}
}
fn filename(&self) -> &Path {
match *self {
LockType::MetaLock => Path::new(".tantivy-meta.lock"),
LockType::IndexWriterLock => Path::new(".tantivy-indexer.lock"),
}
}
}
/// The `DirectoryLock` is an object that represents a file lock.
/// See [`LockType`](struct.LockType.html)
///
/// It is transparently associated to a lock file, that gets deleted
/// on `Drop.` The lock is release automatically on `Drop`.
pub struct DirectoryLock {
directory: Box<Directory>,
path: PathBuf,
} }
impl Drop for DirectoryLock { impl Drop for DirectoryLock {
fn drop(&mut self) { fn drop(&mut self) {
if let Err(e) = self.directory.delete(&*LOCKFILE_FILEPATH) { if let Err(e) = self.directory.delete(&*self.path) {
error!("Failed to remove the lock file. {:?}", e); error!("Failed to remove the lock file. {:?}", e);
} }
} }

View File

@@ -54,14 +54,14 @@ type DocumentReceiver = channel::Receiver<AddOperation>;
fn initial_table_size(per_thread_memory_budget: usize) -> usize { fn initial_table_size(per_thread_memory_budget: usize) -> usize {
let table_size_limit: usize = per_thread_memory_budget / 3; let table_size_limit: usize = per_thread_memory_budget / 3;
(1..) (1..)
.into_iter()
.take_while(|num_bits: &usize| compute_table_size(*num_bits) < table_size_limit) .take_while(|num_bits: &usize| compute_table_size(*num_bits) < table_size_limit)
.last() .last()
.expect(&format!( .unwrap_or_else(|| {
"Per thread memory is too small: {}", panic!(
per_thread_memory_budget "Per thread memory is too small: {}",
)) per_thread_memory_budget
.min(19) // we cap it at 512K )
}).min(19) // we cap it at 512K
} }
/// `IndexWriter` is the user entry-point to add document to an index. /// `IndexWriter` is the user entry-point to add document to an index.
@@ -177,7 +177,7 @@ pub fn compute_deleted_bitset(
) -> Result<bool> { ) -> Result<bool> {
let mut might_have_changed = false; let mut might_have_changed = false;
#[cfg_attr(feature = "cargo-clippy", allow(while_let_loop))] #[cfg_attr(feature = "cargo-clippy", allow(clippy::while_let_loop))]
loop { loop {
if let Some(delete_op) = delete_cursor.get() { if let Some(delete_op) = delete_cursor.get() {
if delete_op.opstamp > target_opstamp { if delete_op.opstamp > target_opstamp {
@@ -301,25 +301,29 @@ fn index_documents(
let last_docstamp: u64 = *(doc_opstamps.last().unwrap()); let last_docstamp: u64 = *(doc_opstamps.last().unwrap());
let doc_to_opstamps = DocToOpstampMapping::from(doc_opstamps); let segment_entry: SegmentEntry = if delete_cursor.get().is_some() {
let segment_reader = SegmentReader::open(segment)?; let doc_to_opstamps = DocToOpstampMapping::from(doc_opstamps);
let mut deleted_bitset = BitSet::with_capacity(num_docs as usize); let segment_reader = SegmentReader::open(segment)?;
let may_have_deletes = compute_deleted_bitset( let mut deleted_bitset = BitSet::with_capacity(num_docs as usize);
&mut deleted_bitset, let may_have_deletes = compute_deleted_bitset(
&segment_reader, &mut deleted_bitset,
&mut delete_cursor, &segment_reader,
&doc_to_opstamps, &mut delete_cursor,
last_docstamp, &doc_to_opstamps,
)?; last_docstamp,
)?;
let segment_entry = SegmentEntry::new(segment_meta, delete_cursor, { SegmentEntry::new(segment_meta, delete_cursor, {
if may_have_deletes { if may_have_deletes {
Some(deleted_bitset) Some(deleted_bitset)
} else { } else {
None None
} }
}); })
} else {
// if there are no delete operation in the queue, no need
// to even open the segment.
SegmentEntry::new(segment_meta, delete_cursor, None)
};
Ok(segment_updater.add_segment(generation, segment_entry)) Ok(segment_updater.add_segment(generation, segment_entry))
} }
@@ -341,7 +345,8 @@ impl IndexWriter {
} }
drop(self.workers_join_handle); drop(self.workers_join_handle);
let result = self.segment_updater let result = self
.segment_updater
.wait_merging_thread() .wait_merging_thread()
.map_err(|_| TantivyError::ErrorInThread("Failed to join merging thread.".into())); .map_err(|_| TantivyError::ErrorInThread("Failed to join merging thread.".into()));
@@ -385,11 +390,9 @@ impl IndexWriter {
.name(format!( .name(format!(
"indexing thread {} for gen {}", "indexing thread {} for gen {}",
self.worker_id, generation self.worker_id, generation
)) )).spawn(move || {
.spawn(move || {
loop { loop {
let mut document_iterator = let mut document_iterator = document_receiver_clone.clone().peekable();
document_receiver_clone.clone().into_iter().peekable();
// the peeking here is to avoid // the peeking here is to avoid
// creating a new segment's files // creating a new segment's files
@@ -488,7 +491,8 @@ impl IndexWriter {
let document_receiver = self.document_receiver.clone(); let document_receiver = self.document_receiver.clone();
// take the directory lock to create a new index_writer. // take the directory lock to create a new index_writer.
let directory_lock = self._directory_lock let directory_lock = self
._directory_lock
.take() .take()
.expect("The IndexWriter does not have any lock. This is a bug, please report."); .expect("The IndexWriter does not have any lock. This is a bug, please report.");
@@ -657,11 +661,26 @@ mod tests {
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let _index_writer = index.writer(40_000_000).unwrap(); let _index_writer = index.writer(40_000_000).unwrap();
match index.writer(40_000_000) { match index.writer(40_000_000) {
Err(TantivyError::FileAlreadyExists(_)) => {} Err(TantivyError::LockFailure(_)) => {}
_ => panic!("Expected FileAlreadyExists error"), _ => panic!("Expected FileAlreadyExists error"),
} }
} }
#[test]
fn test_lockfile_already_exists_error_msg() {
let schema_builder = schema::SchemaBuilder::default();
let index = Index::create_in_ram(schema_builder.build());
let _index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
match index.writer_with_num_threads(1, 3_000_000) {
Err(err) => {
let err_msg = err.to_string();
assert!(err_msg.contains("Lockfile"));
assert!(err_msg.contains("Possible causes:"))
}
_ => panic!("Expected LockfileAlreadyExists error"),
}
}
#[test] #[test]
fn test_set_merge_policy() { fn test_set_merge_policy() {
let schema_builder = schema::SchemaBuilder::default(); let schema_builder = schema::SchemaBuilder::default();
@@ -843,4 +862,32 @@ mod tests {
assert_eq!(initial_table_size(1_000_000_000), 19); assert_eq!(initial_table_size(1_000_000_000), 19);
} }
#[cfg(not(feature = "no_fail"))]
#[test]
fn test_write_commit_fails() {
use fail;
let mut schema_builder = schema::SchemaBuilder::default();
let text_field = schema_builder.add_text_field("text", schema::TEXT);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
for _ in 0..100 {
index_writer.add_document(doc!(text_field => "a"));
}
index_writer.commit().unwrap();
fail::cfg("RAMDirectory::atomic_write", "return(error_write_failed)").unwrap();
for _ in 0..100 {
index_writer.add_document(doc!(text_field => "b"));
}
assert!(index_writer.commit().is_err());
index.load_searchers().unwrap();
let num_docs_containing = |s: &str| {
let searcher = index.searcher();
let term_a = Term::from_field_text(text_field, s);
searcher.doc_freq(&term_a)
};
assert_eq!(num_docs_containing("a"), 100);
assert_eq!(num_docs_containing("b"), 0);
fail::cfg("RAMDirectory::atomic_write", "off").unwrap();
}
} }

View File

@@ -21,17 +21,17 @@ pub trait MergePolicy: MergePolicyClone + marker::Send + marker::Sync + Debug {
/// MergePolicyClone /// MergePolicyClone
pub trait MergePolicyClone { pub trait MergePolicyClone {
/// Returns a boxed clone of the MergePolicy. /// Returns a boxed clone of the MergePolicy.
fn box_clone(&self) -> Box<MergePolicy>; fn box_clone(&self) -> Box<MergePolicy>;
} }
impl<T> MergePolicyClone for T impl<T> MergePolicyClone for T
where where
T: 'static + MergePolicy + Clone, T: 'static + MergePolicy + Clone,
{ {
fn box_clone(&self) -> Box<MergePolicy> { fn box_clone(&self) -> Box<MergePolicy> {
Box::new(self.clone()) Box::new(self.clone())
} }
} }
/// Never merge segments. /// Never merge segments.

View File

@@ -40,15 +40,13 @@ fn compute_total_num_tokens(readers: &[SegmentReader], field: Field) -> u64 {
total_tokens += reader.inverted_index(field).total_num_tokens(); total_tokens += reader.inverted_index(field).total_num_tokens();
} }
} }
total_tokens total_tokens + count
+ count .iter()
.iter() .cloned()
.cloned() .enumerate()
.enumerate() .map(|(fieldnorm_ord, count)| {
.map(|(fieldnorm_ord, count)| { count as u64 * u64::from(FieldNormReader::id_to_fieldnorm(fieldnorm_ord as u8))
count as u64 * FieldNormReader::id_to_fieldnorm(fieldnorm_ord as u8) as u64 }).sum::<u64>()
})
.sum::<u64>()
} }
pub struct IndexMerger { pub struct IndexMerger {
@@ -111,7 +109,7 @@ impl TermOrdinalMapping {
.iter() .iter()
.flat_map(|term_ordinals| term_ordinals.iter().cloned().max()) .flat_map(|term_ordinals| term_ordinals.iter().cloned().max())
.max() .max()
.unwrap_or(TermOrdinal::default()) .unwrap_or_else(TermOrdinal::default)
} }
} }
@@ -190,7 +188,7 @@ impl IndexMerger {
`term_ordinal_mapping`."); `term_ordinal_mapping`.");
self.write_hierarchical_facet_field( self.write_hierarchical_facet_field(
field, field,
term_ordinal_mapping, &term_ordinal_mapping,
fast_field_serializer, fast_field_serializer,
)?; )?;
} }
@@ -314,7 +312,7 @@ impl IndexMerger {
fn write_hierarchical_facet_field( fn write_hierarchical_facet_field(
&self, &self,
field: Field, field: Field,
term_ordinal_mappings: TermOrdinalMapping, term_ordinal_mappings: &TermOrdinalMapping,
fast_field_serializer: &mut FastFieldSerializer, fast_field_serializer: &mut FastFieldSerializer,
) -> Result<()> { ) -> Result<()> {
// Multifastfield consists in 2 fastfields. // Multifastfield consists in 2 fastfields.
@@ -393,8 +391,8 @@ impl IndexMerger {
// We can now initialize our serializer, and push it the different values // We can now initialize our serializer, and push it the different values
{ {
let mut serialize_vals = let mut serialize_vals = fast_field_serializer
fast_field_serializer.new_u64_fast_field_with_idx(field, min_value, max_value, 1)?; .new_u64_fast_field_with_idx(field, min_value, max_value, 1)?;
for reader in &self.readers { for reader in &self.readers {
let ff_reader: MultiValueIntFastFieldReader<u64> = let ff_reader: MultiValueIntFastFieldReader<u64> =
reader.multi_fast_field_reader(field)?; reader.multi_fast_field_reader(field)?;
@@ -440,7 +438,8 @@ impl IndexMerger {
) -> Result<Option<TermOrdinalMapping>> { ) -> Result<Option<TermOrdinalMapping>> {
let mut positions_buffer: Vec<u32> = Vec::with_capacity(1_000); let mut positions_buffer: Vec<u32> = Vec::with_capacity(1_000);
let mut delta_computer = DeltaComputer::new(); let mut delta_computer = DeltaComputer::new();
let field_readers = self.readers let field_readers = self
.readers
.iter() .iter()
.map(|reader| reader.inverted_index(indexed_field)) .map(|reader| reader.inverted_index(indexed_field))
.collect::<Vec<_>>(); .collect::<Vec<_>>();
@@ -524,8 +523,7 @@ impl IndexMerger {
} }
} }
None None
}) }).collect();
.collect();
// At this point, `segment_postings` contains the posting list // At this point, `segment_postings` contains the posting list
// of all of the segments containing the given term. // of all of the segments containing the given term.
@@ -666,8 +664,7 @@ mod tests {
TextFieldIndexing::default() TextFieldIndexing::default()
.set_tokenizer("default") .set_tokenizer("default")
.set_index_option(IndexRecordOption::WithFreqs), .set_index_option(IndexRecordOption::WithFreqs),
) ).set_stored();
.set_stored();
let text_field = schema_builder.add_text_field("text", text_fieldtype); let text_field = schema_builder.add_text_field("text", text_fieldtype);
let score_fieldtype = schema::IntOptions::default().set_fast(Cardinality::SingleValue); let score_fieldtype = schema::IntOptions::default().set_fast(Cardinality::SingleValue);
let score_field = schema_builder.add_u64_field("score", score_fieldtype); let score_field = schema_builder.add_u64_field("score", score_fieldtype);
@@ -769,24 +766,24 @@ mod tests {
); );
} }
{ {
let doc = searcher.doc(&DocAddress(0, 0)).unwrap(); let doc = searcher.doc(DocAddress(0, 0)).unwrap();
assert_eq!(doc.get_first(text_field).unwrap().text(), "af b"); assert_eq!(doc.get_first(text_field).unwrap().text(), Some("af b"));
} }
{ {
let doc = searcher.doc(&DocAddress(0, 1)).unwrap(); let doc = searcher.doc(DocAddress(0, 1)).unwrap();
assert_eq!(doc.get_first(text_field).unwrap().text(), "a b c"); assert_eq!(doc.get_first(text_field).unwrap().text(), Some("a b c"));
} }
{ {
let doc = searcher.doc(&DocAddress(0, 2)).unwrap(); let doc = searcher.doc(DocAddress(0, 2)).unwrap();
assert_eq!(doc.get_first(text_field).unwrap().text(), "a b c d"); assert_eq!(doc.get_first(text_field).unwrap().text(), Some("a b c d"));
} }
{ {
let doc = searcher.doc(&DocAddress(0, 3)).unwrap(); let doc = searcher.doc(DocAddress(0, 3)).unwrap();
assert_eq!(doc.get_first(text_field).unwrap().text(), "af b"); assert_eq!(doc.get_first(text_field).unwrap().text(), Some("af b"));
} }
{ {
let doc = searcher.doc(&DocAddress(0, 4)).unwrap(); let doc = searcher.doc(DocAddress(0, 4)).unwrap();
assert_eq!(doc.get_first(text_field).unwrap().text(), "a b c g"); assert_eq!(doc.get_first(text_field).unwrap().text(), Some("a b c g"));
} }
{ {
let get_fast_vals = |terms: Vec<Term>| { let get_fast_vals = |terms: Vec<Term>| {
@@ -821,8 +818,7 @@ mod tests {
let text_fieldtype = schema::TextOptions::default() let text_fieldtype = schema::TextOptions::default()
.set_indexing_options( .set_indexing_options(
TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs), TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
) ).set_stored();
.set_stored();
let text_field = schema_builder.add_text_field("text", text_fieldtype); let text_field = schema_builder.add_text_field("text", text_fieldtype);
let score_fieldtype = schema::IntOptions::default().set_fast(Cardinality::SingleValue); let score_fieldtype = schema::IntOptions::default().set_fast(Cardinality::SingleValue);
let score_field = schema_builder.add_u64_field("score", score_fieldtype); let score_field = schema_builder.add_u64_field("score", score_fieldtype);

View File

@@ -16,6 +16,8 @@ mod segment_writer;
mod stamper; mod stamper;
pub(crate) use self::directory_lock::DirectoryLock; pub(crate) use self::directory_lock::DirectoryLock;
pub use self::directory_lock::LockType;
pub use self::index_writer::IndexWriter; pub use self::index_writer::IndexWriter;
pub use self::log_merge_policy::LogMergePolicy; pub use self::log_merge_policy::LogMergePolicy;
pub use self::merge_policy::{MergeCandidate, MergePolicy, NoMergePolicy}; pub use self::merge_policy::{MergeCandidate, MergePolicy, NoMergePolicy};

View File

@@ -11,8 +11,8 @@ pub enum SegmentState {
} }
impl SegmentState { impl SegmentState {
pub fn letter_code(&self) -> char { pub fn letter_code(self) -> char {
match *self { match self {
SegmentState::InMerge => 'M', SegmentState::InMerge => 'M',
SegmentState::Ready => 'R', SegmentState::Ready => 'R',
} }

View File

@@ -1,7 +1,7 @@
use super::segment_register::SegmentRegister; use super::segment_register::SegmentRegister;
use core::SegmentId; use core::SegmentId;
use core::SegmentMeta; use core::SegmentMeta;
use core::{LOCKFILE_FILEPATH, META_FILEPATH}; use core::META_FILEPATH;
use error::TantivyError; use error::TantivyError;
use indexer::delete_queue::DeleteCursor; use indexer::delete_queue::DeleteCursor;
use indexer::SegmentEntry; use indexer::SegmentEntry;
@@ -78,10 +78,13 @@ impl SegmentManager {
registers_lock.committed.len() + registers_lock.uncommitted.len() registers_lock.committed.len() + registers_lock.uncommitted.len()
} }
/// List the files that are useful to the index.
///
/// This does not include lock files, or files that are obsolete
/// but have not yet been deleted by the garbage collector.
pub fn list_files(&self) -> HashSet<PathBuf> { pub fn list_files(&self) -> HashSet<PathBuf> {
let mut files = HashSet::new(); let mut files = HashSet::new();
files.insert(META_FILEPATH.clone()); files.insert(META_FILEPATH.clone());
files.insert(LOCKFILE_FILEPATH.clone());
for segment_meta in SegmentMeta::all() { for segment_meta in SegmentMeta::all() {
files.extend(segment_meta.list_files()); files.extend(segment_meta.list_files());
} }

View File

@@ -51,7 +51,8 @@ impl SegmentRegister {
} }
pub fn segment_metas(&self) -> Vec<SegmentMeta> { pub fn segment_metas(&self) -> Vec<SegmentMeta> {
let mut segment_ids: Vec<SegmentMeta> = self.segment_states let mut segment_ids: Vec<SegmentMeta> = self
.segment_states
.values() .values()
.map(|segment_entry| segment_entry.meta().clone()) .map(|segment_entry| segment_entry.meta().clone())
.collect(); .collect();

View File

@@ -72,7 +72,7 @@ pub fn save_metas(
payload, payload,
}; };
let mut buffer = serde_json::to_vec_pretty(&metas)?; let mut buffer = serde_json::to_vec_pretty(&metas)?;
write!(&mut buffer, "\n")?; writeln!(&mut buffer)?;
directory.atomic_write(&META_FILEPATH, &buffer[..])?; directory.atomic_write(&META_FILEPATH, &buffer[..])?;
debug!("Saved metas {:?}", serde_json::to_string_pretty(&metas)); debug!("Saved metas {:?}", serde_json::to_string_pretty(&metas));
Ok(()) Ok(())
@@ -336,8 +336,7 @@ impl SegmentUpdater {
.unwrap() .unwrap()
.remove(&merging_thread_id); .remove(&merging_thread_id);
Ok(()) Ok(())
}) }).expect("Failed to spawn a thread.");
.expect("Failed to spawn a thread.");
self.0 self.0
.merging_threads .merging_threads
.write() .write()

View File

@@ -49,20 +49,20 @@ impl SegmentWriter {
) -> Result<SegmentWriter> { ) -> Result<SegmentWriter> {
let segment_serializer = SegmentSerializer::for_segment(&mut segment)?; let segment_serializer = SegmentSerializer::for_segment(&mut segment)?;
let multifield_postings = MultiFieldPostingsWriter::new(schema, table_bits); let multifield_postings = MultiFieldPostingsWriter::new(schema, table_bits);
let tokenizers = schema let tokenizers =
.fields() schema
.iter() .fields()
.map(|field_entry| field_entry.field_type()) .iter()
.map(|field_type| match *field_type { .map(|field_entry| field_entry.field_type())
FieldType::Str(ref text_options) => text_options.get_indexing_options().and_then( .map(|field_type| match *field_type {
|text_index_option| { FieldType::Str(ref text_options) => text_options
let tokenizer_name = &text_index_option.tokenizer(); .get_indexing_options()
segment.index().tokenizers().get(tokenizer_name) .and_then(|text_index_option| {
}, let tokenizer_name = &text_index_option.tokenizer();
), segment.index().tokenizers().get(tokenizer_name)
_ => None, }),
}) _ => None,
.collect(); }).collect();
Ok(SegmentWriter { Ok(SegmentWriter {
max_doc: 0, max_doc: 0,
multifield_postings, multifield_postings,
@@ -117,8 +117,7 @@ impl SegmentWriter {
_ => { _ => {
panic!("Expected hierarchical facet"); panic!("Expected hierarchical facet");
} }
}) }).collect();
.collect();
let mut term = Term::for_field(field); // we set the Term let mut term = Term::for_field(field); // we set the Term
for facet_bytes in facets { for facet_bytes in facets {
let mut unordered_term_id_opt = None; let mut unordered_term_id_opt = None;
@@ -146,8 +145,7 @@ impl SegmentWriter {
.flat_map(|field_value| match *field_value.value() { .flat_map(|field_value| match *field_value.value() {
Value::Str(ref text) => Some(text.as_str()), Value::Str(ref text) => Some(text.as_str()),
_ => None, _ => None,
}) }).collect();
.collect();
if texts.is_empty() { if texts.is_empty() {
0 0
} else { } else {

42
src/lib.rs Normal file → Executable file
View File

@@ -1,11 +1,8 @@
#![doc(html_logo_url = "http://fulmicoton.com/tantivy-logo/tantivy-logo.png")] #![doc(html_logo_url = "http://fulmicoton.com/tantivy-logo/tantivy-logo.png")]
#![cfg_attr(feature = "cargo-clippy", allow(module_inception))]
#![cfg_attr(feature = "cargo-clippy", allow(inline_always))]
#![cfg_attr(all(feature = "unstable", test), feature(test))] #![cfg_attr(all(feature = "unstable", test), feature(test))]
#![cfg_attr(feature = "cargo-clippy", feature(tool_lints))]
#![cfg_attr(feature = "cargo-clippy", allow(clippy::module_inception))]
#![doc(test(attr(allow(unused_variables), deny(warnings))))] #![doc(test(attr(allow(unused_variables), deny(warnings))))]
#![allow(unknown_lints)]
#![allow(new_without_default)]
#![allow(decimal_literal_representation)]
#![warn(missing_docs)] #![warn(missing_docs)]
#![recursion_limit = "80"] #![recursion_limit = "80"]
@@ -96,7 +93,7 @@
//! // most relevant doc ids... //! // most relevant doc ids...
//! let doc_addresses = top_collector.docs(); //! let doc_addresses = top_collector.docs();
//! for doc_address in doc_addresses { //! for doc_address in doc_addresses {
//! let retrieved_doc = searcher.doc(&doc_address)?; //! let retrieved_doc = searcher.doc(doc_address)?;
//! println!("{}", schema.to_json(&retrieved_doc)); //! println!("{}", schema.to_json(&retrieved_doc));
//! } //! }
//! //!
@@ -133,16 +130,16 @@ extern crate bit_set;
extern crate bitpacking; extern crate bitpacking;
extern crate byteorder; extern crate byteorder;
#[macro_use]
extern crate combine; extern crate combine;
extern crate crossbeam; extern crate crossbeam;
extern crate crossbeam_channel; extern crate crossbeam_channel;
extern crate fnv; extern crate fnv;
extern crate fst; extern crate fst;
extern crate fst_regex;
extern crate futures; extern crate futures;
extern crate futures_cpupool; extern crate futures_cpupool;
extern crate htmlescape;
extern crate itertools; extern crate itertools;
extern crate levenshtein_automata; extern crate levenshtein_automata;
extern crate num_cpus; extern crate num_cpus;
@@ -155,6 +152,8 @@ extern crate tempdir;
extern crate tempfile; extern crate tempfile;
extern crate uuid; extern crate uuid;
#[cfg(test)] #[cfg(test)]
#[macro_use] #[macro_use]
extern crate matches; extern crate matches;
@@ -165,14 +164,19 @@ extern crate winapi;
#[cfg(test)] #[cfg(test)]
extern crate rand; extern crate rand;
#[cfg(test)]
#[macro_use]
extern crate maplit;
#[cfg(all(test, feature = "unstable"))] #[cfg(all(test, feature = "unstable"))]
extern crate test; extern crate test;
extern crate tinysegmenter;
#[macro_use] #[macro_use]
extern crate downcast; extern crate downcast;
#[macro_use]
extern crate fail;
#[cfg(test)] #[cfg(test)]
mod functional_test; mod functional_test;
@@ -181,7 +185,10 @@ mod macros;
pub use error::TantivyError; pub use error::TantivyError;
#[deprecated(since="0.7.0", note="please use `tantivy::TantivyError` instead")] #[deprecated(
since = "0.7.0",
note = "please use `tantivy::TantivyError` instead"
)]
pub use error::TantivyError as Error; pub use error::TantivyError as Error;
extern crate census; extern crate census;
@@ -209,6 +216,9 @@ pub mod schema;
pub mod store; pub mod store;
pub mod termdict; pub mod termdict;
mod snippet;
pub use self::snippet::SnippetGenerator;
mod docset; mod docset;
pub use self::docset::{DocSet, SkipResult}; pub use self::docset::{DocSet, SkipResult};
@@ -261,12 +271,12 @@ impl DocAddress {
/// The segment ordinal is an id identifying the segment /// The segment ordinal is an id identifying the segment
/// hosting the document. It is only meaningful, in the context /// hosting the document. It is only meaningful, in the context
/// of a searcher. /// of a searcher.
pub fn segment_ord(&self) -> SegmentLocalId { pub fn segment_ord(self) -> SegmentLocalId {
self.0 self.0
} }
/// Return the segment local `DocId` /// Return the segment local `DocId`
pub fn doc(&self) -> DocId { pub fn doc(self) -> DocId {
self.1 self.1
} }
} }
@@ -892,11 +902,11 @@ mod tests {
assert_eq!(document.len(), 3); assert_eq!(document.len(), 3);
let values = document.get_all(text_field); let values = document.get_all(text_field);
assert_eq!(values.len(), 2); assert_eq!(values.len(), 2);
assert_eq!(values[0].text(), "tantivy"); assert_eq!(values[0].text(), Some("tantivy"));
assert_eq!(values[1].text(), "some other value"); assert_eq!(values[1].text(), Some("some other value"));
let values = document.get_all(other_text_field); let values = document.get_all(other_text_field);
assert_eq!(values.len(), 1); assert_eq!(values.len(), 1);
assert_eq!(values[0].text(), "short"); assert_eq!(values[0].text(), Some("short"));
} }
#[test] #[test]

View File

@@ -1,4 +1,3 @@
/// Positions are stored in three parts and over two files. /// Positions are stored in three parts and over two files.
// //
/// The `SegmentComponent::POSITIONS` file contains all of the bitpacked positions delta, /// The `SegmentComponent::POSITIONS` file contains all of the bitpacked positions delta,
@@ -24,13 +23,12 @@
/// The long skip structure makes it possible to skip rapidly to the a checkpoint close to this /// The long skip structure makes it possible to skip rapidly to the a checkpoint close to this
/// value, and then skip normally. /// value, and then skip normally.
/// ///
mod reader; mod reader;
mod serializer; mod serializer;
pub use self::reader::PositionReader; pub use self::reader::PositionReader;
pub use self::serializer::PositionSerializer; pub use self::serializer::PositionSerializer;
use bitpacking::{BitPacker4x, BitPacker}; use bitpacking::{BitPacker, BitPacker4x};
const COMPRESSION_BLOCK_SIZE: usize = BitPacker4x::BLOCK_LEN; const COMPRESSION_BLOCK_SIZE: usize = BitPacker4x::BLOCK_LEN;
const LONG_SKIP_IN_BLOCKS: usize = 1_024; const LONG_SKIP_IN_BLOCKS: usize = 1_024;
@@ -43,10 +41,10 @@ lazy_static! {
#[cfg(test)] #[cfg(test)]
pub mod tests { pub mod tests {
use std::iter; use super::{PositionReader, PositionSerializer};
use super::{PositionSerializer, PositionReader};
use directory::ReadOnlySource; use directory::ReadOnlySource;
use positions::COMPRESSION_BLOCK_SIZE; use positions::COMPRESSION_BLOCK_SIZE;
use std::iter;
fn create_stream_buffer(vals: &[u32]) -> (ReadOnlySource, ReadOnlySource) { fn create_stream_buffer(vals: &[u32]) -> (ReadOnlySource, ReadOnlySource) {
let mut skip_buffer = vec![]; let mut skip_buffer = vec![];
@@ -59,7 +57,10 @@ pub mod tests {
} }
serializer.close().unwrap(); serializer.close().unwrap();
} }
(ReadOnlySource::from(stream_buffer), ReadOnlySource::from(skip_buffer)) (
ReadOnlySource::from(stream_buffer),
ReadOnlySource::from(skip_buffer),
)
} }
#[test] #[test]
@@ -103,7 +104,7 @@ pub mod tests {
assert_eq!(skip.len(), 12); assert_eq!(skip.len(), 12);
assert_eq!(stream.len(), 1168); assert_eq!(stream.len(), 1168);
let mut position_reader = PositionReader::new(stream,skip, 0u64); let mut position_reader = PositionReader::new(stream, skip, 0u64);
let mut buf = [0u32; 7]; let mut buf = [0u32; 7];
let mut c = 0; let mut c = 0;
for _ in 0..100 { for _ in 0..100 {
@@ -125,7 +126,7 @@ pub mod tests {
let (stream, skip) = create_stream_buffer(&v[..]); let (stream, skip) = create_stream_buffer(&v[..]);
assert_eq!(skip.len(), 15_749); assert_eq!(skip.len(), 15_749);
assert_eq!(stream.len(), 1_000_000); assert_eq!(stream.len(), 1_000_000);
let mut position_reader = PositionReader::new(stream,skip, 128 * 1024); let mut position_reader = PositionReader::new(stream, skip, 128 * 1024);
let mut buf = [0u32; 1]; let mut buf = [0u32; 1];
position_reader.read(&mut buf); position_reader.read(&mut buf);
assert_eq!(buf[0], CONST_VAL); assert_eq!(buf[0], CONST_VAL);
@@ -137,12 +138,17 @@ pub mod tests {
let (stream, skip) = create_stream_buffer(&v[..]); let (stream, skip) = create_stream_buffer(&v[..]);
assert_eq!(skip.len(), 15_749); assert_eq!(skip.len(), 15_749);
assert_eq!(stream.len(), 4_987_872); assert_eq!(stream.len(), 4_987_872);
for &offset in &[10, 128 * 1024, 128 * 1024 - 1, 128 * 1024 + 7, 128 * 10 * 1024 + 10] { for &offset in &[
let mut position_reader = PositionReader::new(stream.clone(),skip.clone(), offset); 10,
128 * 1024,
128 * 1024 - 1,
128 * 1024 + 7,
128 * 10 * 1024 + 10,
] {
let mut position_reader = PositionReader::new(stream.clone(), skip.clone(), offset);
let mut buf = [0u32; 1]; let mut buf = [0u32; 1];
position_reader.read(&mut buf); position_reader.read(&mut buf);
assert_eq!(buf[0], offset as u32); assert_eq!(buf[0], offset as u32);
} }
} }
} }

View File

@@ -1,12 +1,12 @@
use bitpacking::{BitPacker4x, BitPacker};
use owned_read::OwnedRead;
use common::{BinarySerializable, FixedSize};
use postings::compression::compressed_block_size;
use directory::ReadOnlySource;
use positions::COMPRESSION_BLOCK_SIZE;
use positions::LONG_SKIP_IN_BLOCKS;
use positions::LONG_SKIP_INTERVAL;
use super::BIT_PACKER; use super::BIT_PACKER;
use bitpacking::{BitPacker, BitPacker4x};
use common::{BinarySerializable, FixedSize};
use directory::ReadOnlySource;
use owned_read::OwnedRead;
use positions::COMPRESSION_BLOCK_SIZE;
use positions::LONG_SKIP_INTERVAL;
use positions::LONG_SKIP_IN_BLOCKS;
use postings::compression::compressed_block_size;
pub struct PositionReader { pub struct PositionReader {
skip_read: OwnedRead, skip_read: OwnedRead,
@@ -18,7 +18,6 @@ pub struct PositionReader {
// of the block of the next int to read. // of the block of the next int to read.
} }
// `ahead` represents the offset of the block currently loaded // `ahead` represents the offset of the block currently loaded
// compared to the cursor of the actual stream. // compared to the cursor of the actual stream.
// //
@@ -32,7 +31,8 @@ fn read_impl(
buffer: &mut [u32; 128], buffer: &mut [u32; 128],
mut inner_offset: usize, mut inner_offset: usize,
num_bits: &[u8], num_bits: &[u8],
output: &mut [u32]) -> usize { output: &mut [u32],
) -> usize {
let mut output_start = 0; let mut output_start = 0;
let mut output_len = output.len(); let mut output_len = output.len();
let mut ahead = 0; let mut ahead = 0;
@@ -47,8 +47,7 @@ fn read_impl(
output_start += available_len; output_start += available_len;
inner_offset = 0; inner_offset = 0;
let num_bits = num_bits[ahead]; let num_bits = num_bits[ahead];
BitPacker4x::new() BitPacker4x::new().decompress(position, &mut buffer[..], num_bits);
.decompress(position, &mut buffer[..], num_bits);
let block_len = compressed_block_size(num_bits); let block_len = compressed_block_size(num_bits);
position = &position[block_len..]; position = &position[block_len..];
ahead += 1; ahead += 1;
@@ -56,11 +55,12 @@ fn read_impl(
} }
} }
impl PositionReader { impl PositionReader {
pub fn new(position_source: ReadOnlySource, pub fn new(
skip_source: ReadOnlySource, position_source: ReadOnlySource,
offset: u64) -> PositionReader { skip_source: ReadOnlySource,
offset: u64,
) -> PositionReader {
let skip_len = skip_source.len(); let skip_len = skip_source.len();
let (body, footer) = skip_source.split(skip_len - u32::SIZE_IN_BYTES); let (body, footer) = skip_source.split(skip_len - u32::SIZE_IN_BYTES);
let num_long_skips = u32::deserialize(&mut footer.as_slice()).expect("Index corrupted"); let num_long_skips = u32::deserialize(&mut footer.as_slice()).expect("Index corrupted");
@@ -70,7 +70,8 @@ impl PositionReader {
let small_skip = (offset - (long_skip_id as u64) * (LONG_SKIP_INTERVAL as u64)) as usize; let small_skip = (offset - (long_skip_id as u64) * (LONG_SKIP_INTERVAL as u64)) as usize;
let offset_num_bytes: u64 = { let offset_num_bytes: u64 = {
if long_skip_id > 0 { if long_skip_id > 0 {
let mut long_skip_blocks: &[u8] = &long_skips.as_slice()[(long_skip_id - 1) * 8..][..8]; let mut long_skip_blocks: &[u8] =
&long_skips.as_slice()[(long_skip_id - 1) * 8..][..8];
u64::deserialize(&mut long_skip_blocks).expect("Index corrupted") * 16 u64::deserialize(&mut long_skip_blocks).expect("Index corrupted") * 16
} else { } else {
0 0
@@ -79,13 +80,13 @@ impl PositionReader {
let mut position_read = OwnedRead::new(position_source); let mut position_read = OwnedRead::new(position_source);
position_read.advance(offset_num_bytes as usize); position_read.advance(offset_num_bytes as usize);
let mut skip_read = OwnedRead::new(skip_body); let mut skip_read = OwnedRead::new(skip_body);
skip_read.advance(long_skip_id * LONG_SKIP_IN_BLOCKS); skip_read.advance(long_skip_id * LONG_SKIP_IN_BLOCKS);
let mut position_reader = PositionReader { let mut position_reader = PositionReader {
skip_read, skip_read,
position_read, position_read,
inner_offset: 0, inner_offset: 0,
buffer: Box::new([0u32; 128]), buffer: Box::new([0u32; 128]),
ahead: None ahead: None,
}; };
position_reader.skip(small_skip); position_reader.skip(small_skip);
position_reader position_reader
@@ -108,7 +109,8 @@ impl PositionReader {
self.buffer.as_mut(), self.buffer.as_mut(),
self.inner_offset, self.inner_offset,
&skip_data[1..], &skip_data[1..],
output)); output,
));
} }
/// Skip the next `skip_len` integer. /// Skip the next `skip_len` integer.
@@ -118,27 +120,25 @@ impl PositionReader {
/// ///
/// May panic if the end of the stream is reached. /// May panic if the end of the stream is reached.
pub fn skip(&mut self, skip_len: usize) { pub fn skip(&mut self, skip_len: usize) {
let skip_len_plus_inner_offset = skip_len + self.inner_offset; let skip_len_plus_inner_offset = skip_len + self.inner_offset;
let num_blocks_to_advance = skip_len_plus_inner_offset / COMPRESSION_BLOCK_SIZE; let num_blocks_to_advance = skip_len_plus_inner_offset / COMPRESSION_BLOCK_SIZE;
self.inner_offset = skip_len_plus_inner_offset % COMPRESSION_BLOCK_SIZE; self.inner_offset = skip_len_plus_inner_offset % COMPRESSION_BLOCK_SIZE;
self.ahead = self.ahead self.ahead = self.ahead.and_then(|num_blocks| {
.and_then(|num_blocks| { if num_blocks >= num_blocks_to_advance {
if num_blocks >= num_blocks_to_advance { Some(num_blocks - num_blocks_to_advance)
Some(num_blocks_to_advance - num_blocks_to_advance) } else {
} else { None
None }
} });
});
let skip_len = self.skip_read let skip_len = self.skip_read.as_ref()[..num_blocks_to_advance]
.as_ref()[..num_blocks_to_advance]
.iter() .iter()
.cloned() .cloned()
.map(|num_bit| num_bit as usize) .map(|num_bit| num_bit as usize)
.sum::<usize>() * (COMPRESSION_BLOCK_SIZE / 8); .sum::<usize>()
* (COMPRESSION_BLOCK_SIZE / 8);
self.skip_read.advance(num_blocks_to_advance); self.skip_read.advance(num_blocks_to_advance);
self.position_read.advance(skip_len); self.position_read.advance(skip_len);

View File

@@ -1,8 +1,8 @@
use std::io;
use bitpacking::BitPacker;
use positions::{COMPRESSION_BLOCK_SIZE, LONG_SKIP_INTERVAL};
use common::BinarySerializable;
use super::BIT_PACKER; use super::BIT_PACKER;
use bitpacking::BitPacker;
use common::BinarySerializable;
use positions::{COMPRESSION_BLOCK_SIZE, LONG_SKIP_INTERVAL};
use std::io;
pub struct PositionSerializer<W: io::Write> { pub struct PositionSerializer<W: io::Write> {
write_stream: W, write_stream: W,
@@ -23,7 +23,7 @@ impl<W: io::Write> PositionSerializer<W> {
buffer: vec![0u8; 128 * 4], buffer: vec![0u8; 128 * 4],
num_ints: 0u64, num_ints: 0u64,
long_skips: Vec::new(), long_skips: Vec::new(),
cumulated_num_bits: 0u64 cumulated_num_bits: 0u64,
} }
} }
@@ -31,7 +31,6 @@ impl<W: io::Write> PositionSerializer<W> {
self.num_ints self.num_ints
} }
fn remaining_block_len(&self) -> usize { fn remaining_block_len(&self) -> usize {
COMPRESSION_BLOCK_SIZE - self.block.len() COMPRESSION_BLOCK_SIZE - self.block.len()
} }
@@ -52,8 +51,8 @@ impl<W: io::Write> PositionSerializer<W> {
fn flush_block(&mut self) -> io::Result<()> { fn flush_block(&mut self) -> io::Result<()> {
let num_bits = BIT_PACKER.num_bits(&self.block[..]); let num_bits = BIT_PACKER.num_bits(&self.block[..]);
self.cumulated_num_bits += num_bits as u64; self.cumulated_num_bits += u64::from(num_bits);
self.write_skiplist.write(&[num_bits])?; self.write_skiplist.write_all(&[num_bits])?;
let written_len = BIT_PACKER.compress(&self.block[..], &mut self.buffer, num_bits); let written_len = BIT_PACKER.compress(&self.block[..], &mut self.buffer, num_bits);
self.write_stream.write_all(&self.buffer[..written_len])?; self.write_stream.write_all(&self.buffer[..written_len])?;
self.block.clear(); self.block.clear();

View File

@@ -28,14 +28,16 @@ impl BlockEncoder {
pub fn compress_block_sorted(&mut self, block: &[u32], offset: u32) -> (u8, &[u8]) { pub fn compress_block_sorted(&mut self, block: &[u32], offset: u32) -> (u8, &[u8]) {
let num_bits = self.bitpacker.num_bits_sorted(offset, block); let num_bits = self.bitpacker.num_bits_sorted(offset, block);
let written_size = self.bitpacker let written_size =
.compress_sorted(offset, block, &mut self.output[..], num_bits); self.bitpacker
.compress_sorted(offset, block, &mut self.output[..], num_bits);
(num_bits, &self.output[..written_size]) (num_bits, &self.output[..written_size])
} }
pub fn compress_block_unsorted(&mut self, block: &[u32]) -> (u8, &[u8]) { pub fn compress_block_unsorted(&mut self, block: &[u32]) -> (u8, &[u8]) {
let num_bits = self.bitpacker.num_bits(block); let num_bits = self.bitpacker.num_bits(block);
let written_size = self.bitpacker let written_size = self
.bitpacker
.compress(block, &mut self.output[..], num_bits); .compress(block, &mut self.output[..], num_bits);
(num_bits, &self.output[..written_size]) (num_bits, &self.output[..written_size])
} }
@@ -62,19 +64,21 @@ impl BlockDecoder {
} }
} }
pub fn uncompress_block_sorted(&mut self, compressed_data: &[u8], offset: u32, num_bits: u8) -> usize { pub fn uncompress_block_sorted(
&mut self,
compressed_data: &[u8],
offset: u32,
num_bits: u8,
) -> usize {
self.output_len = COMPRESSION_BLOCK_SIZE; self.output_len = COMPRESSION_BLOCK_SIZE;
self.bitpacker.decompress_sorted( self.bitpacker
offset, .decompress_sorted(offset, &compressed_data, &mut self.output, num_bits)
&compressed_data,
&mut self.output,
num_bits,
)
} }
pub fn uncompress_block_unsorted(&mut self, compressed_data: &[u8], num_bits: u8) -> usize { pub fn uncompress_block_unsorted(&mut self, compressed_data: &[u8], num_bits: u8) -> usize {
self.output_len = COMPRESSION_BLOCK_SIZE; self.output_len = COMPRESSION_BLOCK_SIZE;
self.bitpacker.decompress(&compressed_data, &mut self.output, num_bits) self.bitpacker
.decompress(&compressed_data, &mut self.output, num_bits)
} }
#[inline] #[inline]
@@ -88,7 +92,6 @@ impl BlockDecoder {
} }
} }
pub trait VIntEncoder { pub trait VIntEncoder {
/// Compresses an array of `u32` integers, /// Compresses an array of `u32` integers,
/// using [delta-encoding](https://en.wikipedia.org/wiki/Delta_ encoding) /// using [delta-encoding](https://en.wikipedia.org/wiki/Delta_ encoding)

View File

@@ -1,9 +1,5 @@
#[inline(always)] #[inline(always)]
pub fn compress_sorted<'a>( pub fn compress_sorted<'a>(input: &[u32], output: &'a mut [u8], mut offset: u32) -> &'a [u8] {
input: &[u32],
output: &'a mut [u8],
mut offset: u32,
) -> &'a [u8] {
let mut byte_written = 0; let mut byte_written = 0;
for &v in input { for &v in input {
let mut to_encode: u32 = v - offset; let mut to_encode: u32 = v - offset;
@@ -46,47 +42,41 @@ pub(crate) fn compress_unsorted<'a>(input: &[u32], output: &'a mut [u8]) -> &'a
} }
#[inline(always)] #[inline(always)]
pub fn uncompress_sorted<'a>( pub fn uncompress_sorted<'a>(compressed_data: &'a [u8], output: &mut [u32], offset: u32) -> usize {
compressed_data: &'a [u8],
output: &mut [u32],
offset: u32,
) -> usize {
let mut read_byte = 0; let mut read_byte = 0;
let mut result = offset; let mut result = offset;
let num_els = output.len(); for output_mut in output.iter_mut() {
for i in 0..num_els {
let mut shift = 0u32; let mut shift = 0u32;
loop { loop {
let cur_byte = compressed_data[read_byte]; let cur_byte = compressed_data[read_byte];
read_byte += 1; read_byte += 1;
result += ((cur_byte % 128u8) as u32) << shift; result += u32::from(cur_byte % 128u8) << shift;
if cur_byte & 128u8 != 0u8 { if cur_byte & 128u8 != 0u8 {
break; break;
} }
shift += 7; shift += 7;
} }
output[i] = result; *output_mut = result;
} }
read_byte read_byte
} }
#[inline(always)] #[inline(always)]
pub(crate) fn uncompress_unsorted<'a>(compressed_data: &'a [u8], output: &mut [u32]) -> usize { pub(crate) fn uncompress_unsorted(compressed_data: &[u8], output_arr: &mut [u32]) -> usize {
let mut read_byte = 0; let mut read_byte = 0;
let num_els = output.len(); for output_mut in output_arr.iter_mut() {
for i in 0..num_els {
let mut result = 0u32; let mut result = 0u32;
let mut shift = 0u32; let mut shift = 0u32;
loop { loop {
let cur_byte = compressed_data[read_byte]; let cur_byte = compressed_data[read_byte];
read_byte += 1; read_byte += 1;
result += ((cur_byte % 128u8) as u32) << shift; result += u32::from(cur_byte % 128u8) << shift;
if cur_byte & 128u8 != 0u8 { if cur_byte & 128u8 != 0u8 {
break; break;
} }
shift += 7; shift += 7;
} }
output[i] = result; *output_mut = result;
} }
read_byte read_byte
} }

View File

@@ -2,6 +2,7 @@
Postings module (also called inverted index) Postings module (also called inverted index)
*/ */
pub(crate) mod compression;
/// Postings module /// Postings module
/// ///
/// Postings, also called inverted lists, is the key datastructure /// Postings, also called inverted lists, is the key datastructure
@@ -11,18 +12,17 @@ mod postings_writer;
mod recorder; mod recorder;
mod segment_postings; mod segment_postings;
mod serializer; mod serializer;
pub(crate) mod compression; mod skip;
mod stacker; mod stacker;
mod term_info; mod term_info;
mod skip;
pub(crate) use self::postings_writer::MultiFieldPostingsWriter; pub(crate) use self::postings_writer::MultiFieldPostingsWriter;
pub use self::serializer::{FieldSerializer, InvertedIndexSerializer}; pub use self::serializer::{FieldSerializer, InvertedIndexSerializer};
use self::compression::COMPRESSION_BLOCK_SIZE;
pub use self::postings::Postings; pub use self::postings::Postings;
pub use self::term_info::TermInfo;
pub(crate) use self::skip::SkipReader; pub(crate) use self::skip::SkipReader;
use self::compression::{COMPRESSION_BLOCK_SIZE}; pub use self::term_info::TermInfo;
pub use self::segment_postings::{BlockSegmentPostings, SegmentPostings}; pub use self::segment_postings::{BlockSegmentPostings, SegmentPostings};
@@ -34,7 +34,7 @@ pub(crate) const USE_SKIP_INFO_LIMIT: u32 = COMPRESSION_BLOCK_SIZE as u32;
pub(crate) type UnorderedTermId = u64; pub(crate) type UnorderedTermId = u64;
#[allow(enum_variant_names)] #[cfg_attr(feature = "cargo-clippy", allow(clippy::enum_variant_names))]
#[derive(Debug, PartialEq, Clone, Copy, Eq)] #[derive(Debug, PartialEq, Clone, Copy, Eq)]
pub(crate) enum FreqReadingOption { pub(crate) enum FreqReadingOption {
NoFreq, NoFreq,
@@ -71,8 +71,7 @@ pub mod tests {
let mut segment = index.new_segment(); let mut segment = index.new_segment();
let mut posting_serializer = InvertedIndexSerializer::open(&mut segment).unwrap(); let mut posting_serializer = InvertedIndexSerializer::open(&mut segment).unwrap();
{ {
let mut field_serializer = posting_serializer let mut field_serializer = posting_serializer.new_field(text_field, 120 * 4).unwrap();
.new_field(text_field, 120 * 4).unwrap();
field_serializer.new_term("abc".as_bytes()).unwrap(); field_serializer.new_term("abc".as_bytes()).unwrap();
for doc_id in 0u32..120u32 { for doc_id in 0u32..120u32 {
let delta_positions = vec![1, 2, 3, 2]; let delta_positions = vec![1, 2, 3, 2];
@@ -512,13 +511,13 @@ pub mod tests {
let mut index_writer = index.writer_with_num_threads(1, 40_000_000).unwrap(); let mut index_writer = index.writer_with_num_threads(1, 40_000_000).unwrap();
for _ in 0..posting_list_size { for _ in 0..posting_list_size {
let mut doc = Document::default(); let mut doc = Document::default();
if rng.gen_bool(1f64/ 15f64) { if rng.gen_bool(1f64 / 15f64) {
doc.add_text(text_field, "a"); doc.add_text(text_field, "a");
} }
if rng.gen_bool(1f64/ 10f64) { if rng.gen_bool(1f64 / 10f64) {
doc.add_text(text_field, "b"); doc.add_text(text_field, "b");
} }
if rng.gen_bool(1f64/ 5f64) { if rng.gen_bool(1f64 / 5f64) {
doc.add_text(text_field, "c"); doc.add_text(text_field, "c");
} }
doc.add_text(text_field, "d"); doc.add_text(text_field, "d");

View File

@@ -15,7 +15,7 @@ use tokenizer::TokenStream;
use DocId; use DocId;
use Result; use Result;
fn posting_from_field_entry<'a>(field_entry: &FieldEntry) -> Box<PostingsWriter> { fn posting_from_field_entry(field_entry: &FieldEntry) -> Box<PostingsWriter> {
match *field_entry.field_type() { match *field_entry.field_type() {
FieldType::Str(ref text_options) => text_options FieldType::Str(ref text_options) => text_options
.get_indexing_options() .get_indexing_options()
@@ -29,8 +29,7 @@ fn posting_from_field_entry<'a>(field_entry: &FieldEntry) -> Box<PostingsWriter>
IndexRecordOption::WithFreqsAndPositions => { IndexRecordOption::WithFreqsAndPositions => {
SpecializedPostingsWriter::<TFAndPositionRecorder>::new_boxed() SpecializedPostingsWriter::<TFAndPositionRecorder>::new_boxed()
} }
}) }).unwrap_or_else(|| SpecializedPostingsWriter::<NothingRecorder>::new_boxed()),
.unwrap_or_else(|| SpecializedPostingsWriter::<NothingRecorder>::new_boxed()),
FieldType::U64(_) | FieldType::I64(_) | FieldType::HierarchicalFacet => { FieldType::U64(_) | FieldType::I64(_) | FieldType::HierarchicalFacet => {
SpecializedPostingsWriter::<NothingRecorder>::new_boxed() SpecializedPostingsWriter::<NothingRecorder>::new_boxed()
} }
@@ -94,11 +93,12 @@ impl MultiFieldPostingsWriter {
&self, &self,
serializer: &mut InvertedIndexSerializer, serializer: &mut InvertedIndexSerializer,
) -> Result<HashMap<Field, HashMap<UnorderedTermId, TermOrdinal>>> { ) -> Result<HashMap<Field, HashMap<UnorderedTermId, TermOrdinal>>> {
let mut term_offsets: Vec<(&[u8], Addr, UnorderedTermId)> = self.term_index let mut term_offsets: Vec<(&[u8], Addr, UnorderedTermId)> = self
.term_index
.iter() .iter()
.map(|(term_bytes, addr, bucket_id)| (term_bytes, addr, bucket_id as UnorderedTermId)) .map(|(term_bytes, addr, bucket_id)| (term_bytes, addr, bucket_id as UnorderedTermId))
.collect(); .collect();
term_offsets.sort_by_key(|&(k, _, _)| k); term_offsets.sort_unstable_by_key(|&(k, _, _)| k);
let mut offsets: Vec<(Field, usize)> = vec![]; let mut offsets: Vec<(Field, usize)> = vec![];
let term_offsets_it = term_offsets let term_offsets_it = term_offsets
@@ -127,8 +127,8 @@ impl MultiFieldPostingsWriter {
let field_entry = self.schema.get_field_entry(field); let field_entry = self.schema.get_field_entry(field);
match field_entry.field_type() { match *field_entry.field_type() {
&FieldType::Str(_) | &FieldType::HierarchicalFacet => { FieldType::Str(_) | FieldType::HierarchicalFacet => {
// populating the (unordered term ord) -> (ordered term ord) mapping // populating the (unordered term ord) -> (ordered term ord) mapping
// for the field. // for the field.
let mut unordered_term_ids = term_offsets[start..stop] let mut unordered_term_ids = term_offsets[start..stop]
@@ -138,12 +138,11 @@ impl MultiFieldPostingsWriter {
.enumerate() .enumerate()
.map(|(term_ord, unord_term_id)| { .map(|(term_ord, unord_term_id)| {
(unord_term_id as UnorderedTermId, term_ord as TermOrdinal) (unord_term_id as UnorderedTermId, term_ord as TermOrdinal)
}) }).collect();
.collect();
unordered_term_mappings.insert(field, mapping); unordered_term_mappings.insert(field, mapping);
} }
&FieldType::U64(_) | &FieldType::I64(_) => {} FieldType::U64(_) | FieldType::I64(_) => {}
&FieldType::Bytes => {} FieldType::Bytes => {}
} }
let postings_writer = &self.per_field_postings_writers[field.0 as usize]; let postings_writer = &self.per_field_postings_writers[field.0 as usize];
@@ -202,14 +201,11 @@ pub trait PostingsWriter {
heap: &mut MemoryArena, heap: &mut MemoryArena,
) -> u32 { ) -> u32 {
let mut term = Term::for_field(field); let mut term = Term::for_field(field);
let num_tokens = { let mut sink = |token: &Token| {
let mut sink = |token: &Token| { term.set_text(token.text.as_str());
term.set_text(token.text.as_str()); self.subscribe(term_index, doc_id, token.position as u32, &term, heap);
self.subscribe(term_index, doc_id, token.position as u32, &term, heap);
};
token_stream.process(&mut sink)
}; };
num_tokens token_stream.process(&mut sink)
} }
fn total_num_tokens(&self) -> u64; fn total_num_tokens(&self) -> u64;

View File

@@ -107,7 +107,8 @@ impl Recorder for TermFrequencyRecorder {
fn serialize(&self, serializer: &mut FieldSerializer, heap: &MemoryArena) -> io::Result<()> { fn serialize(&self, serializer: &mut FieldSerializer, heap: &MemoryArena) -> io::Result<()> {
// the last document has not been closed... // the last document has not been closed...
// its term freq is self.current_tf. // its term freq is self.current_tf.
let mut doc_iter = self.stack let mut doc_iter = self
.stack
.iter(heap) .iter(heap)
.chain(Some(self.current_tf).into_iter()); .chain(Some(self.current_tf).into_iter());

View File

@@ -1,20 +1,20 @@
use postings::compression::{BlockDecoder, VIntDecoder, COMPRESSION_BLOCK_SIZE};
use DocId;
use common::BitSet; use common::BitSet;
use common::HasLen; use common::HasLen;
use postings::compression::compressed_block_size; use common::{BinarySerializable, VInt};
use docset::{DocSet, SkipResult}; use docset::{DocSet, SkipResult};
use fst::Streamer; use fst::Streamer;
use owned_read::OwnedRead;
use positions::PositionReader;
use postings::compression::compressed_block_size;
use postings::compression::{BlockDecoder, VIntDecoder, COMPRESSION_BLOCK_SIZE};
use postings::serializer::PostingsSerializer; use postings::serializer::PostingsSerializer;
use postings::FreqReadingOption; use postings::FreqReadingOption;
use postings::Postings; use postings::Postings;
use owned_read::OwnedRead;
use common::{VInt, BinarySerializable};
use postings::USE_SKIP_INFO_LIMIT;
use postings::SkipReader; use postings::SkipReader;
use postings::USE_SKIP_INFO_LIMIT;
use schema::IndexRecordOption; use schema::IndexRecordOption;
use positions::PositionReader;
use std::cmp::Ordering; use std::cmp::Ordering;
use DocId;
const EMPTY_ARR: [u8; 0] = []; const EMPTY_ARR: [u8; 0] = [];
@@ -98,7 +98,7 @@ impl SegmentPostings {
docs.len() as u32, docs.len() as u32,
OwnedRead::new(buffer), OwnedRead::new(buffer),
IndexRecordOption::Basic, IndexRecordOption::Basic,
IndexRecordOption::Basic IndexRecordOption::Basic,
); );
SegmentPostings::from_block_postings(block_segment_postings, None) SegmentPostings::from_block_postings(block_segment_postings, None)
} }
@@ -151,7 +151,11 @@ fn exponential_search(target: u32, arr: &[u32]) -> (usize, usize) {
/// The target is assumed smaller or equal to the last element. /// The target is assumed smaller or equal to the last element.
fn search_within_block(block_docs: &[u32], target: u32) -> usize { fn search_within_block(block_docs: &[u32], target: u32) -> usize {
let (start, end) = exponential_search(target, block_docs); let (start, end) = exponential_search(target, block_docs);
start.wrapping_add(block_docs[start..end].binary_search(&target).unwrap_or_else(|e| e)) start.wrapping_add(
block_docs[start..end]
.binary_search(&target)
.unwrap_or_else(|e| e),
)
} }
impl DocSet for SegmentPostings { impl DocSet for SegmentPostings {
@@ -179,21 +183,20 @@ impl DocSet for SegmentPostings {
// check if we need to go to the next block // check if we need to go to the next block
let need_positions = self.position_computer.is_some(); let need_positions = self.position_computer.is_some();
let mut sum_freqs_skipped: u32 = 0; let mut sum_freqs_skipped: u32 = 0;
if !self.block_cursor if !self
.docs() .block_cursor
.last() .docs()
.map(|doc| *doc >= target) .last()
.unwrap_or(false) // there should always be at least a document in the block .map(|doc| *doc >= target)
// since advance returned. .unwrap_or(false)
// there should always be at least a document in the block
// since advance returned.
{ {
// we are not in the right block. // we are not in the right block.
// //
// First compute all of the freqs skipped from the current block. // First compute all of the freqs skipped from the current block.
if need_positions { if need_positions {
sum_freqs_skipped = self.block_cursor sum_freqs_skipped = self.block_cursor.freqs()[self.cur..].iter().sum();
.freqs()[self.cur..]
.iter()
.sum();
match self.block_cursor.skip_to(target) { match self.block_cursor.skip_to(target) {
BlockSegmentPostingsSkipResult::Success(block_skip_freqs) => { BlockSegmentPostingsSkipResult::Success(block_skip_freqs) => {
sum_freqs_skipped += block_skip_freqs; sum_freqs_skipped += block_skip_freqs;
@@ -202,11 +205,11 @@ impl DocSet for SegmentPostings {
return SkipResult::End; return SkipResult::End;
} }
} }
} else { } else if self.block_cursor.skip_to(target)
== BlockSegmentPostingsSkipResult::Terminated
{
// no positions needed. no need to sum freqs. // no positions needed. no need to sum freqs.
if self.block_cursor.skip_to(target) == BlockSegmentPostingsSkipResult::Terminated { return SkipResult::End;
return SkipResult::End;
}
} }
self.cur = 0; self.cur = 0;
} }
@@ -215,9 +218,13 @@ impl DocSet for SegmentPostings {
let block_docs = self.block_cursor.docs(); let block_docs = self.block_cursor.docs();
debug_assert!(target >= self.doc()); debug_assert!(target >= self.doc());
let new_cur = self.cur.wrapping_add(search_within_block(&block_docs[self.cur..], target)); let new_cur = self
.cur
.wrapping_add(search_within_block(&block_docs[self.cur..], target));
if need_positions { if need_positions {
sum_freqs_skipped += self.block_cursor.freqs()[self.cur..new_cur].iter().sum::<u32>(); sum_freqs_skipped += self.block_cursor.freqs()[self.cur..new_cur]
.iter()
.sum::<u32>();
self.position_computer self.position_computer
.as_mut() .as_mut()
.unwrap() .unwrap()
@@ -229,9 +236,9 @@ impl DocSet for SegmentPostings {
let doc = block_docs[new_cur]; let doc = block_docs[new_cur];
debug_assert!(doc >= target); debug_assert!(doc >= target);
if doc == target { if doc == target {
return SkipResult::Reached; SkipResult::Reached
} else { } else {
return SkipResult::OverStep; SkipResult::OverStep
} }
} }
@@ -330,7 +337,10 @@ pub struct BlockSegmentPostings {
skip_reader: SkipReader, skip_reader: SkipReader,
} }
fn split_into_skips_and_postings(doc_freq: u32, mut data: OwnedRead) -> (Option<OwnedRead>, OwnedRead) { fn split_into_skips_and_postings(
doc_freq: u32,
mut data: OwnedRead,
) -> (Option<OwnedRead>, OwnedRead) {
if doc_freq >= USE_SKIP_INFO_LIMIT { if doc_freq >= USE_SKIP_INFO_LIMIT {
let skip_len = VInt::deserialize(&mut data).expect("Data corrupted").0 as usize; let skip_len = VInt::deserialize(&mut data).expect("Data corrupted").0 as usize;
let mut postings_data = data.clone(); let mut postings_data = data.clone();
@@ -345,7 +355,7 @@ fn split_into_skips_and_postings(doc_freq: u32, mut data: OwnedRead) -> (Option<
#[derive(Debug, Eq, PartialEq)] #[derive(Debug, Eq, PartialEq)]
pub enum BlockSegmentPostingsSkipResult { pub enum BlockSegmentPostingsSkipResult {
Terminated, Terminated,
Success(u32) //< number of term freqs to skip Success(u32), //< number of term freqs to skip
} }
impl BlockSegmentPostings { impl BlockSegmentPostings {
@@ -353,7 +363,7 @@ impl BlockSegmentPostings {
doc_freq: u32, doc_freq: u32,
data: OwnedRead, data: OwnedRead,
record_option: IndexRecordOption, record_option: IndexRecordOption,
requested_option: IndexRecordOption requested_option: IndexRecordOption,
) -> BlockSegmentPostings { ) -> BlockSegmentPostings {
let freq_reading_option = match (record_option, requested_option) { let freq_reading_option = match (record_option, requested_option) {
(IndexRecordOption::Basic, _) => FreqReadingOption::NoFreq, (IndexRecordOption::Basic, _) => FreqReadingOption::NoFreq,
@@ -362,11 +372,10 @@ impl BlockSegmentPostings {
}; };
let (skip_data_opt, postings_data) = split_into_skips_and_postings(doc_freq, data); let (skip_data_opt, postings_data) = split_into_skips_and_postings(doc_freq, data);
let skip_reader = let skip_reader = match skip_data_opt {
match skip_data_opt { Some(skip_data) => SkipReader::new(skip_data, record_option),
Some(skip_data) => SkipReader::new(skip_data, record_option), None => SkipReader::new(OwnedRead::new(&EMPTY_ARR[..]), record_option),
None => SkipReader::new(OwnedRead::new(&EMPTY_ARR[..]), record_option) };
};
let doc_freq = doc_freq as usize; let doc_freq = doc_freq as usize;
let num_vint_docs = doc_freq % COMPRESSION_BLOCK_SIZE; let num_vint_docs = doc_freq % COMPRESSION_BLOCK_SIZE;
BlockSegmentPostings { BlockSegmentPostings {
@@ -450,7 +459,6 @@ impl BlockSegmentPostings {
self.doc_decoder.output_len self.doc_decoder.output_len
} }
/// position on a block that may contains `doc_id`. /// position on a block that may contains `doc_id`.
/// Always advance the current block. /// Always advance the current block.
/// ///
@@ -461,9 +469,7 @@ impl BlockSegmentPostings {
/// Returns false iff all of the document remaining are smaller than /// Returns false iff all of the document remaining are smaller than
/// `doc_id`. In that case, all of these document are consumed. /// `doc_id`. In that case, all of these document are consumed.
/// ///
pub fn skip_to(&mut self, pub fn skip_to(&mut self, target_doc: DocId) -> BlockSegmentPostingsSkipResult {
target_doc: DocId) -> BlockSegmentPostingsSkipResult {
let mut skip_freqs = 0u32; let mut skip_freqs = 0u32;
while self.skip_reader.advance() { while self.skip_reader.advance() {
if self.skip_reader.doc() >= target_doc { if self.skip_reader.doc() >= target_doc {
@@ -472,11 +478,11 @@ impl BlockSegmentPostings {
// //
// We found our block! // We found our block!
let num_bits = self.skip_reader.doc_num_bits(); let num_bits = self.skip_reader.doc_num_bits();
let num_consumed_bytes = self.doc_decoder let num_consumed_bytes = self.doc_decoder.uncompress_block_sorted(
.uncompress_block_sorted( self.remaining_data.as_ref(),
self.remaining_data.as_ref(), self.doc_offset,
self.doc_offset, num_bits,
num_bits); );
self.remaining_data.advance(num_consumed_bytes); self.remaining_data.advance(num_consumed_bytes);
let tf_num_bits = self.skip_reader.tf_num_bits(); let tf_num_bits = self.skip_reader.tf_num_bits();
match self.freq_reading_option { match self.freq_reading_option {
@@ -486,9 +492,9 @@ impl BlockSegmentPostings {
self.remaining_data.advance(num_bytes_to_skip); self.remaining_data.advance(num_bytes_to_skip);
} }
FreqReadingOption::ReadFreq => { FreqReadingOption::ReadFreq => {
let num_consumed_bytes = self.freq_decoder let num_consumed_bytes = self
.uncompress_block_unsorted(self.remaining_data.as_ref(), .freq_decoder
tf_num_bits); .uncompress_block_unsorted(self.remaining_data.as_ref(), tf_num_bits);
self.remaining_data.advance(num_consumed_bytes); self.remaining_data.advance(num_consumed_bytes);
} }
} }
@@ -518,7 +524,8 @@ impl BlockSegmentPostings {
} }
} }
self.num_vint_docs = 0; self.num_vint_docs = 0;
return self.docs() return self
.docs()
.last() .last()
.map(|last_doc| { .map(|last_doc| {
if *last_doc >= target_doc { if *last_doc >= target_doc {
@@ -526,8 +533,7 @@ impl BlockSegmentPostings {
} else { } else {
BlockSegmentPostingsSkipResult::Terminated BlockSegmentPostingsSkipResult::Terminated
} }
}) }).unwrap_or(BlockSegmentPostingsSkipResult::Terminated);
.unwrap_or(BlockSegmentPostingsSkipResult::Terminated);
} }
BlockSegmentPostingsSkipResult::Terminated BlockSegmentPostingsSkipResult::Terminated
} }
@@ -538,11 +544,11 @@ impl BlockSegmentPostings {
pub fn advance(&mut self) -> bool { pub fn advance(&mut self) -> bool {
if self.skip_reader.advance() { if self.skip_reader.advance() {
let num_bits = self.skip_reader.doc_num_bits(); let num_bits = self.skip_reader.doc_num_bits();
let num_consumed_bytes = self.doc_decoder let num_consumed_bytes = self.doc_decoder.uncompress_block_sorted(
.uncompress_block_sorted( self.remaining_data.as_ref(),
self.remaining_data.as_ref(), self.doc_offset,
self.doc_offset, num_bits,
num_bits); );
self.remaining_data.advance(num_consumed_bytes); self.remaining_data.advance(num_consumed_bytes);
let tf_num_bits = self.skip_reader.tf_num_bits(); let tf_num_bits = self.skip_reader.tf_num_bits();
match self.freq_reading_option { match self.freq_reading_option {
@@ -552,9 +558,9 @@ impl BlockSegmentPostings {
self.remaining_data.advance(num_bytes_to_skip); self.remaining_data.advance(num_bytes_to_skip);
} }
FreqReadingOption::ReadFreq => { FreqReadingOption::ReadFreq => {
let num_consumed_bytes = self.freq_decoder let num_consumed_bytes = self
.uncompress_block_unsorted(self.remaining_data.as_ref(), .freq_decoder
tf_num_bits); .uncompress_block_unsorted(self.remaining_data.as_ref(), tf_num_bits);
self.remaining_data.advance(num_consumed_bytes); self.remaining_data.advance(num_consumed_bytes);
} }
} }
@@ -594,7 +600,6 @@ impl BlockSegmentPostings {
doc_offset: 0, doc_offset: 0,
doc_freq: 0, doc_freq: 0,
remaining_data: OwnedRead::new(vec![]), remaining_data: OwnedRead::new(vec![]),
skip_reader: SkipReader::new(OwnedRead::new(vec![]), IndexRecordOption::Basic), skip_reader: SkipReader::new(OwnedRead::new(vec![]), IndexRecordOption::Basic),
} }
@@ -616,7 +621,9 @@ impl<'b> Streamer<'b> for BlockSegmentPostings {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::search_within_block;
use super::BlockSegmentPostings; use super::BlockSegmentPostings;
use super::BlockSegmentPostingsSkipResult;
use super::SegmentPostings; use super::SegmentPostings;
use common::HasLen; use common::HasLen;
use core::Index; use core::Index;
@@ -626,9 +633,7 @@ mod tests {
use schema::SchemaBuilder; use schema::SchemaBuilder;
use schema::Term; use schema::Term;
use schema::INT_INDEXED; use schema::INT_INDEXED;
use super::BlockSegmentPostingsSkipResult;
use DocId; use DocId;
use super::search_within_block;
#[test] #[test]
fn test_empty_segment_postings() { fn test_empty_segment_postings() {
@@ -645,7 +650,6 @@ mod tests {
assert_eq!(postings.doc_freq(), 0); assert_eq!(postings.doc_freq(), 0);
} }
fn search_within_block_trivial_but_slow(block: &[u32], target: u32) -> usize { fn search_within_block_trivial_but_slow(block: &[u32], target: u32) -> usize {
block block
.iter() .iter()
@@ -653,11 +657,15 @@ mod tests {
.enumerate() .enumerate()
.filter(|&(_, ref val)| *val >= target) .filter(|&(_, ref val)| *val >= target)
.next() .next()
.unwrap().0 .unwrap()
.0
} }
fn util_test_search_within_block(block: &[u32], target: u32) { fn util_test_search_within_block(block: &[u32], target: u32) {
assert_eq!(search_within_block(block, target), search_within_block_trivial_but_slow(block, target)); assert_eq!(
search_within_block(block, target),
search_within_block_trivial_but_slow(block, target)
);
} }
fn util_test_search_within_block_all(block: &[u32]) { fn util_test_search_within_block_all(block: &[u32]) {
@@ -677,7 +685,7 @@ mod tests {
#[test] #[test]
fn test_search_within_block() { fn test_search_within_block() {
for len in 1u32..128u32 { for len in 1u32..128u32 {
let v: Vec<u32> = (0..len).map(|i| i*2).collect(); let v: Vec<u32> = (0..len).map(|i| i * 2).collect();
util_test_search_within_block_all(&v[..]); util_test_search_within_block_all(&v[..]);
} }
} }
@@ -726,14 +734,22 @@ mod tests {
fn test_block_segment_postings_skip() { fn test_block_segment_postings_skip() {
for i in 0..4 { for i in 0..4 {
let mut block_postings = build_block_postings(vec![3]); let mut block_postings = build_block_postings(vec![3]);
assert_eq!(block_postings.skip_to(i), BlockSegmentPostingsSkipResult::Success(0u32)); assert_eq!(
assert_eq!(block_postings.skip_to(i), BlockSegmentPostingsSkipResult::Terminated); block_postings.skip_to(i),
BlockSegmentPostingsSkipResult::Success(0u32)
);
assert_eq!(
block_postings.skip_to(i),
BlockSegmentPostingsSkipResult::Terminated
);
} }
let mut block_postings = build_block_postings(vec![3]); let mut block_postings = build_block_postings(vec![3]);
assert_eq!(block_postings.skip_to(4u32), BlockSegmentPostingsSkipResult::Terminated); assert_eq!(
block_postings.skip_to(4u32),
BlockSegmentPostingsSkipResult::Terminated
);
} }
#[test] #[test]
fn test_block_segment_postings_skip2() { fn test_block_segment_postings_skip2() {
let mut docs = vec![0]; let mut docs = vec![0];
@@ -741,14 +757,23 @@ mod tests {
docs.push((i * i / 100) + i); docs.push((i * i / 100) + i);
} }
let mut block_postings = build_block_postings(docs.clone()); let mut block_postings = build_block_postings(docs.clone());
for i in vec![0, 424, 10000] { for i in vec![0, 424, 10000] {
assert_eq!(block_postings.skip_to(i), BlockSegmentPostingsSkipResult::Success(0u32)); assert_eq!(
block_postings.skip_to(i),
BlockSegmentPostingsSkipResult::Success(0u32)
);
let docs = block_postings.docs(); let docs = block_postings.docs();
assert!(docs[0] <= i); assert!(docs[0] <= i);
assert!(docs.last().cloned().unwrap_or(0u32) >= i); assert!(docs.last().cloned().unwrap_or(0u32) >= i);
} }
assert_eq!(block_postings.skip_to(100_000), BlockSegmentPostingsSkipResult::Terminated); assert_eq!(
assert_eq!(block_postings.skip_to(101_000), BlockSegmentPostingsSkipResult::Terminated); block_postings.skip_to(100_000),
BlockSegmentPostingsSkipResult::Terminated
);
assert_eq!(
block_postings.skip_to(101_000),
BlockSegmentPostingsSkipResult::Terminated
);
} }
#[test] #[test]

View File

@@ -1,18 +1,18 @@
use super::TermInfo; use super::TermInfo;
use common::{VInt, BinarySerializable}; use common::{BinarySerializable, VInt};
use common::{CompositeWrite, CountingWriter}; use common::{CompositeWrite, CountingWriter};
use postings::compression::{VIntEncoder, BlockEncoder, COMPRESSION_BLOCK_SIZE};
use core::Segment; use core::Segment;
use directory::WritePtr; use directory::WritePtr;
use positions::PositionSerializer;
use postings::compression::{BlockEncoder, VIntEncoder, COMPRESSION_BLOCK_SIZE};
use postings::skip::SkipSerializer;
use postings::USE_SKIP_INFO_LIMIT;
use schema::Schema; use schema::Schema;
use schema::{Field, FieldEntry, FieldType}; use schema::{Field, FieldEntry, FieldType};
use std::io::{self, Write}; use std::io::{self, Write};
use termdict::{TermDictionaryBuilder, TermOrdinal}; use termdict::{TermDictionaryBuilder, TermOrdinal};
use DocId; use DocId;
use Result; use Result;
use postings::USE_SKIP_INFO_LIMIT;
use postings::skip::SkipSerializer;
use positions::PositionSerializer;
/// `PostingsSerializer` is in charge of serializing /// `PostingsSerializer` is in charge of serializing
/// postings on disk, in the /// postings on disk, in the
@@ -100,11 +100,11 @@ impl InvertedIndexSerializer {
let positionsidx_write = self.positionsidx_write.for_field(field); let positionsidx_write = self.positionsidx_write.for_field(field);
let field_type: FieldType = (*field_entry.field_type()).clone(); let field_type: FieldType = (*field_entry.field_type()).clone();
FieldSerializer::new( FieldSerializer::new(
field_type, &field_type,
term_dictionary_write, term_dictionary_write,
postings_write, postings_write,
positions_write, positions_write,
positionsidx_write positionsidx_write,
) )
} }
@@ -131,11 +131,11 @@ pub struct FieldSerializer<'a> {
impl<'a> FieldSerializer<'a> { impl<'a> FieldSerializer<'a> {
fn new( fn new(
field_type: FieldType, field_type: &FieldType,
term_dictionary_write: &'a mut CountingWriter<WritePtr>, term_dictionary_write: &'a mut CountingWriter<WritePtr>,
postings_write: &'a mut CountingWriter<WritePtr>, postings_write: &'a mut CountingWriter<WritePtr>,
positions_write: &'a mut CountingWriter<WritePtr>, positions_write: &'a mut CountingWriter<WritePtr>,
positionsidx_write: &'a mut CountingWriter<WritePtr> positionsidx_write: &'a mut CountingWriter<WritePtr>,
) -> io::Result<FieldSerializer<'a>> { ) -> io::Result<FieldSerializer<'a>> {
let (term_freq_enabled, position_enabled): (bool, bool) = match field_type { let (term_freq_enabled, position_enabled): (bool, bool) = match field_type {
FieldType::Str(ref text_options) => { FieldType::Str(ref text_options) => {
@@ -152,8 +152,9 @@ impl<'a> FieldSerializer<'a> {
_ => (false, false), _ => (false, false),
}; };
let term_dictionary_builder = let term_dictionary_builder =
TermDictionaryBuilder::new(term_dictionary_write, field_type)?; TermDictionaryBuilder::new(term_dictionary_write, &field_type)?;
let postings_serializer = PostingsSerializer::new(postings_write, term_freq_enabled, position_enabled); let postings_serializer =
PostingsSerializer::new(postings_write, term_freq_enabled, position_enabled);
let positions_serializer_opt = if position_enabled { let positions_serializer_opt = if position_enabled {
Some(PositionSerializer::new(positions_write, positionsidx_write)) Some(PositionSerializer::new(positions_write, positionsidx_write))
} else { } else {
@@ -171,14 +172,15 @@ impl<'a> FieldSerializer<'a> {
} }
fn current_term_info(&self) -> TermInfo { fn current_term_info(&self) -> TermInfo {
let positions_idx = self.positions_serializer_opt let positions_idx = self
.positions_serializer_opt
.as_ref() .as_ref()
.map(|positions_serializer| positions_serializer.positions_idx()) .map(|positions_serializer| positions_serializer.positions_idx())
.unwrap_or(0u64); .unwrap_or(0u64);
TermInfo { TermInfo {
doc_freq: 0, doc_freq: 0,
postings_offset: self.postings_serializer.addr(), postings_offset: self.postings_serializer.addr(),
positions_idx positions_idx,
} }
} }
@@ -253,7 +255,7 @@ impl<'a> FieldSerializer<'a> {
struct Block { struct Block {
doc_ids: [DocId; COMPRESSION_BLOCK_SIZE], doc_ids: [DocId; COMPRESSION_BLOCK_SIZE],
term_freqs: [u32; COMPRESSION_BLOCK_SIZE], term_freqs: [u32; COMPRESSION_BLOCK_SIZE],
len: usize len: usize,
} }
impl Block { impl Block {
@@ -261,7 +263,7 @@ impl Block {
Block { Block {
doc_ids: [0u32; COMPRESSION_BLOCK_SIZE], doc_ids: [0u32; COMPRESSION_BLOCK_SIZE],
term_freqs: [0u32; COMPRESSION_BLOCK_SIZE], term_freqs: [0u32; COMPRESSION_BLOCK_SIZE],
len: 0 len: 0,
} }
} }
@@ -312,9 +314,12 @@ pub struct PostingsSerializer<W: Write> {
termfreq_sum_enabled: bool, termfreq_sum_enabled: bool,
} }
impl<W: Write> PostingsSerializer<W> { impl<W: Write> PostingsSerializer<W> {
pub fn new(write: W, termfreq_enabled: bool, termfreq_sum_enabled: bool) -> PostingsSerializer<W> { pub fn new(
write: W,
termfreq_enabled: bool,
termfreq_sum_enabled: bool,
) -> PostingsSerializer<W> {
PostingsSerializer { PostingsSerializer {
output_write: CountingWriter::wrap(write), output_write: CountingWriter::wrap(write),
@@ -337,14 +342,16 @@ impl<W: Write> PostingsSerializer<W> {
.block_encoder .block_encoder
.compress_block_sorted(&self.block.doc_ids(), self.last_doc_id_encoded); .compress_block_sorted(&self.block.doc_ids(), self.last_doc_id_encoded);
self.last_doc_id_encoded = self.block.last_doc(); self.last_doc_id_encoded = self.block.last_doc();
self.skip_write.write_doc(self.last_doc_id_encoded, num_bits); self.skip_write
.write_doc(self.last_doc_id_encoded, num_bits);
// last el block 0, offset block 1, // last el block 0, offset block 1,
self.postings_write.extend(block_encoded); self.postings_write.extend(block_encoded);
} }
if self.termfreq_enabled { if self.termfreq_enabled {
// encode the term_freqs // encode the term_freqs
let (num_bits, block_encoded): (u8, &[u8]) = let (num_bits, block_encoded): (u8, &[u8]) = self
self.block_encoder.compress_block_unsorted(&self.block.term_freqs()); .block_encoder
.compress_block_unsorted(&self.block.term_freqs());
self.postings_write.extend(block_encoded); self.postings_write.extend(block_encoded);
self.skip_write.write_term_freq(num_bits); self.skip_write.write_term_freq(num_bits);
if self.termfreq_sum_enabled { if self.termfreq_sum_enabled {
@@ -375,13 +382,15 @@ impl<W: Write> PostingsSerializer<W> {
// In that case, the remaining part is encoded // In that case, the remaining part is encoded
// using variable int encoding. // using variable int encoding.
{ {
let block_encoded = self.block_encoder let block_encoded = self
.block_encoder
.compress_vint_sorted(&self.block.doc_ids(), self.last_doc_id_encoded); .compress_vint_sorted(&self.block.doc_ids(), self.last_doc_id_encoded);
self.postings_write.write_all(block_encoded)?; self.postings_write.write_all(block_encoded)?;
} }
// ... Idem for term frequencies // ... Idem for term frequencies
if self.termfreq_enabled { if self.termfreq_enabled {
let block_encoded = self.block_encoder let block_encoded = self
.block_encoder
.compress_vint_unsorted(self.block.term_freqs()); .compress_vint_unsorted(self.block.term_freqs());
self.postings_write.write_all(block_encoded)?; self.postings_write.write_all(block_encoded)?;
} }
@@ -392,7 +401,6 @@ impl<W: Write> PostingsSerializer<W> {
VInt(skip_data.len() as u64).serialize(&mut self.output_write)?; VInt(skip_data.len() as u64).serialize(&mut self.output_write)?;
self.output_write.write_all(skip_data)?; self.output_write.write_all(skip_data)?;
self.output_write.write_all(&self.postings_write[..])?; self.output_write.write_all(&self.postings_write[..])?;
} else { } else {
self.output_write.write_all(&self.postings_write[..])?; self.output_write.write_all(&self.postings_write[..])?;
} }

View File

@@ -1,8 +1,8 @@
use DocId;
use common::BinarySerializable; use common::BinarySerializable;
use owned_read::OwnedRead; use owned_read::OwnedRead;
use postings::compression::COMPRESSION_BLOCK_SIZE; use postings::compression::COMPRESSION_BLOCK_SIZE;
use schema::IndexRecordOption; use schema::IndexRecordOption;
use DocId;
pub struct SkipSerializer { pub struct SkipSerializer {
buffer: Vec<u8>, buffer: Vec<u8>,
@@ -18,8 +18,11 @@ impl SkipSerializer {
} }
pub fn write_doc(&mut self, last_doc: DocId, doc_num_bits: u8) { pub fn write_doc(&mut self, last_doc: DocId, doc_num_bits: u8) {
assert!(last_doc > self.prev_doc, "write_doc(...) called with non-increasing doc ids. \ assert!(
Did you forget to call clear maybe?"); last_doc > self.prev_doc,
"write_doc(...) called with non-increasing doc ids. \
Did you forget to call clear maybe?"
);
let delta_doc = last_doc - self.prev_doc; let delta_doc = last_doc - self.prev_doc;
self.prev_doc = last_doc; self.prev_doc = last_doc;
delta_doc.serialize(&mut self.buffer).unwrap(); delta_doc.serialize(&mut self.buffer).unwrap();
@@ -30,9 +33,10 @@ impl SkipSerializer {
self.buffer.push(tf_num_bits); self.buffer.push(tf_num_bits);
} }
pub fn write_total_term_freq(&mut self, tf_sum: u32) { pub fn write_total_term_freq(&mut self, tf_sum: u32) {
tf_sum.serialize(&mut self.buffer).expect("Should never fail"); tf_sum
.serialize(&mut self.buffer)
.expect("Should never fail");
} }
pub fn data(&self) -> &[u8] { pub fn data(&self) -> &[u8] {
@@ -103,33 +107,32 @@ impl SkipReader {
} else { } else {
let doc_delta = u32::deserialize(&mut self.owned_read).expect("Skip data corrupted"); let doc_delta = u32::deserialize(&mut self.owned_read).expect("Skip data corrupted");
self.doc += doc_delta as DocId; self.doc += doc_delta as DocId;
self.doc_num_bits = self.owned_read.get(0); self.doc_num_bits = self.owned_read.get(0);
match self.skip_info { match self.skip_info {
IndexRecordOption::Basic => { IndexRecordOption::Basic => {
self.owned_read.advance(1); self.owned_read.advance(1);
} }
IndexRecordOption::WithFreqs=> { IndexRecordOption::WithFreqs => {
self.tf_num_bits = self.owned_read.get(1); self.tf_num_bits = self.owned_read.get(1);
self.owned_read.advance(2); self.owned_read.advance(2);
} }
IndexRecordOption::WithFreqsAndPositions => { IndexRecordOption::WithFreqsAndPositions => {
self.tf_num_bits = self.owned_read.get(1); self.tf_num_bits = self.owned_read.get(1);
self.owned_read.advance(2); self.owned_read.advance(2);
self.tf_sum = u32::deserialize(&mut self.owned_read) self.tf_sum =
.expect("Failed reading tf_sum"); u32::deserialize(&mut self.owned_read).expect("Failed reading tf_sum");
} }
} }
true true
} }
} }
} }
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::{SkipReader, SkipSerializer};
use super::IndexRecordOption; use super::IndexRecordOption;
use super::{SkipReader, SkipSerializer};
use owned_read::OwnedRead; use owned_read::OwnedRead;
#[test] #[test]
@@ -171,4 +174,4 @@ mod tests {
assert_eq!(skip_reader.doc_num_bits(), 5u8); assert_eq!(skip_reader.doc_num_bits(), 5u8);
assert!(!skip_reader.advance()); assert!(!skip_reader.advance());
} }
} }

View File

@@ -47,7 +47,7 @@ impl Addr {
} }
/// Returns the `Addr` object for `addr + offset` /// Returns the `Addr` object for `addr + offset`
pub fn offset(&self, offset: u32) -> Addr { pub fn offset(self, offset: u32) -> Addr {
Addr(self.0.wrapping_add(offset)) Addr(self.0.wrapping_add(offset))
} }
@@ -55,16 +55,16 @@ impl Addr {
Addr((page_id << NUM_BITS_PAGE_ADDR | local_addr) as u32) Addr((page_id << NUM_BITS_PAGE_ADDR | local_addr) as u32)
} }
fn page_id(&self) -> usize { fn page_id(self) -> usize {
(self.0 as usize) >> NUM_BITS_PAGE_ADDR (self.0 as usize) >> NUM_BITS_PAGE_ADDR
} }
fn page_local_addr(&self) -> usize { fn page_local_addr(self) -> usize {
(self.0 as usize) & (PAGE_SIZE - 1) (self.0 as usize) & (PAGE_SIZE - 1)
} }
/// Returns true if and only if the `Addr` is null. /// Returns true if and only if the `Addr` is null.
pub fn is_null(&self) -> bool { pub fn is_null(self) -> bool {
self.0 == u32::max_value() self.0 == u32::max_value()
} }
} }
@@ -233,12 +233,12 @@ impl Page {
#[inline(always)] #[inline(always)]
pub(crate) unsafe fn get_ptr(&self, addr: usize) -> *const u8 { pub(crate) unsafe fn get_ptr(&self, addr: usize) -> *const u8 {
self.data.as_ptr().offset(addr as isize) self.data.as_ptr().add(addr)
} }
#[inline(always)] #[inline(always)]
pub(crate) unsafe fn get_mut_ptr(&mut self, addr: usize) -> *mut u8 { pub(crate) unsafe fn get_mut_ptr(&mut self, addr: usize) -> *mut u8 {
self.data.as_mut_ptr().offset(addr as isize) self.data.as_mut_ptr().add(addr)
} }
} }

View File

@@ -4,6 +4,7 @@ const M: u32 = 0x5bd1_e995;
#[inline(always)] #[inline(always)]
pub fn murmurhash2(key: &[u8]) -> u32 { pub fn murmurhash2(key: &[u8]) -> u32 {
#[cfg_attr(feature = "cargo-clippy", allow(clippy::cast_ptr_alignment))]
let mut key_ptr: *const u32 = key.as_ptr() as *const u32; let mut key_ptr: *const u32 = key.as_ptr() as *const u32;
let len = key.len() as u32; let len = key.len() as u32;
let mut h: u32 = SEED ^ len; let mut h: u32 = SEED ^ len;

View File

@@ -61,7 +61,7 @@ impl Default for KeyValue {
} }
impl KeyValue { impl KeyValue {
fn is_empty(&self) -> bool { fn is_empty(self) -> bool {
self.key_value_addr.is_null() self.key_value_addr.is_null()
} }
} }

View File

@@ -59,10 +59,10 @@ impl DocSet for AllScorer {
} }
} }
if self.doc < self.max_doc { if self.doc < self.max_doc {
return true; true
} else { } else {
self.state = State::Finished; self.state = State::Finished;
return false; false
} }
} }

View File

@@ -17,9 +17,9 @@ fn cached_tf_component(fieldnorm: u32, average_fieldnorm: f32) -> f32 {
fn compute_tf_cache(average_fieldnorm: f32) -> [f32; 256] { fn compute_tf_cache(average_fieldnorm: f32) -> [f32; 256] {
let mut cache = [0f32; 256]; let mut cache = [0f32; 256];
for fieldnorm_id in 0..256 { for (fieldnorm_id, cache_mut) in cache.iter_mut().enumerate() {
let fieldnorm = FieldNormReader::id_to_fieldnorm(fieldnorm_id as u8); let fieldnorm = FieldNormReader::id_to_fieldnorm(fieldnorm_id as u8);
cache[fieldnorm_id] = cached_tf_component(fieldnorm, average_fieldnorm); *cache_mut = cached_tf_component(fieldnorm, average_fieldnorm);
} }
cache cache
} }
@@ -54,7 +54,7 @@ impl BM25Weight {
for segment_reader in searcher.segment_readers() { for segment_reader in searcher.segment_readers() {
let inverted_index = segment_reader.inverted_index(field); let inverted_index = segment_reader.inverted_index(field);
total_num_tokens += inverted_index.total_num_tokens(); total_num_tokens += inverted_index.total_num_tokens();
total_num_docs += segment_reader.max_doc() as u64; total_num_docs += u64::from(segment_reader.max_doc());
} }
let average_fieldnorm = total_num_tokens as f32 / total_num_docs as f32; let average_fieldnorm = total_num_tokens as f32 / total_num_docs as f32;
@@ -63,8 +63,7 @@ impl BM25Weight {
.map(|term| { .map(|term| {
let term_doc_freq = searcher.doc_freq(term); let term_doc_freq = searcher.doc_freq(term);
idf(term_doc_freq, total_num_docs) idf(term_doc_freq, total_num_docs)
}) }).sum::<f32>();
.sum::<f32>();
BM25Weight::new(idf, average_fieldnorm) BM25Weight::new(idf, average_fieldnorm)
} }

View File

@@ -5,6 +5,7 @@ use query::TermQuery;
use query::Weight; use query::Weight;
use schema::IndexRecordOption; use schema::IndexRecordOption;
use schema::Term; use schema::Term;
use std::collections::BTreeSet;
use Result; use Result;
use Searcher; use Searcher;
@@ -27,7 +28,7 @@ impl Clone for BooleanQuery {
fn clone(&self) -> Self { fn clone(&self) -> Self {
self.subqueries self.subqueries
.iter() .iter()
.map(|(x, y)| (x.clone(), y.box_clone())) .map(|(occur, subquery)| (*occur, subquery.box_clone()))
.collect::<Vec<_>>() .collect::<Vec<_>>()
.into() .into()
} }
@@ -41,14 +42,20 @@ impl From<Vec<(Occur, Box<Query>)>> for BooleanQuery {
impl Query for BooleanQuery { impl Query for BooleanQuery {
fn weight(&self, searcher: &Searcher, scoring_enabled: bool) -> Result<Box<Weight>> { fn weight(&self, searcher: &Searcher, scoring_enabled: bool) -> Result<Box<Weight>> {
let sub_weights = self.subqueries let sub_weights = self
.subqueries
.iter() .iter()
.map(|&(ref occur, ref subquery)| { .map(|&(ref occur, ref subquery)| {
Ok((*occur, subquery.weight(searcher, scoring_enabled)?)) Ok((*occur, subquery.weight(searcher, scoring_enabled)?))
}) }).collect::<Result<_>>()?;
.collect::<Result<_>>()?;
Ok(Box::new(BooleanWeight::new(sub_weights, scoring_enabled))) Ok(Box::new(BooleanWeight::new(sub_weights, scoring_enabled)))
} }
fn query_terms(&self, term_set: &mut BTreeSet<Term>) {
for (_occur, subquery) in &self.subqueries {
subquery.query_terms(term_set);
}
}
} }
impl BooleanQuery { impl BooleanQuery {
@@ -61,8 +68,7 @@ impl BooleanQuery {
let term_query: Box<Query> = let term_query: Box<Query> =
Box::new(TermQuery::new(term, IndexRecordOption::WithFreqs)); Box::new(TermQuery::new(term, IndexRecordOption::WithFreqs));
(Occur::Should, term_query) (Occur::Should, term_query)
}) }).collect();
.collect();
BooleanQuery::from(occur_term_queries) BooleanQuery::from(occur_term_queries)
} }

View File

@@ -39,7 +39,7 @@ where
} }
let scorer: Box<Scorer> = Box::new(Union::<_, TScoreCombiner>::from(scorers)); let scorer: Box<Scorer> = Box::new(Union::<_, TScoreCombiner>::from(scorers));
return scorer; scorer
} }
pub struct BooleanWeight { pub struct BooleanWeight {

View File

@@ -69,7 +69,7 @@ mod tests {
let query_parser = QueryParser::for_index(&index, vec![text_field]); let query_parser = QueryParser::for_index(&index, vec![text_field]);
let query = query_parser.parse_query("+a").unwrap(); let query = query_parser.parse_query("+a").unwrap();
let searcher = index.searcher(); let searcher = index.searcher();
let weight = query.weight(&*searcher, true).unwrap(); let weight = query.weight(&searcher, true).unwrap();
let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap(); let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap();
assert!(Downcast::<TermScorer>::is_type(&*scorer)); assert!(Downcast::<TermScorer>::is_type(&*scorer));
} }
@@ -81,13 +81,13 @@ mod tests {
let searcher = index.searcher(); let searcher = index.searcher();
{ {
let query = query_parser.parse_query("+a +b +c").unwrap(); let query = query_parser.parse_query("+a +b +c").unwrap();
let weight = query.weight(&*searcher, true).unwrap(); let weight = query.weight(&searcher, true).unwrap();
let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap(); let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap();
assert!(Downcast::<Intersection<TermScorer>>::is_type(&*scorer)); assert!(Downcast::<Intersection<TermScorer>>::is_type(&*scorer));
} }
{ {
let query = query_parser.parse_query("+a +(b c)").unwrap(); let query = query_parser.parse_query("+a +(b c)").unwrap();
let weight = query.weight(&*searcher, true).unwrap(); let weight = query.weight(&searcher, true).unwrap();
let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap(); let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap();
assert!(Downcast::<Intersection<Box<Scorer>>>::is_type(&*scorer)); assert!(Downcast::<Intersection<Box<Scorer>>>::is_type(&*scorer));
} }
@@ -100,7 +100,7 @@ mod tests {
let searcher = index.searcher(); let searcher = index.searcher();
{ {
let query = query_parser.parse_query("+a b").unwrap(); let query = query_parser.parse_query("+a b").unwrap();
let weight = query.weight(&*searcher, true).unwrap(); let weight = query.weight(&searcher, true).unwrap();
let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap(); let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap();
assert!(Downcast::< assert!(Downcast::<
RequiredOptionalScorer<Box<Scorer>, Box<Scorer>, SumWithCoordsCombiner>, RequiredOptionalScorer<Box<Scorer>, Box<Scorer>, SumWithCoordsCombiner>,
@@ -108,7 +108,7 @@ mod tests {
} }
{ {
let query = query_parser.parse_query("+a b").unwrap(); let query = query_parser.parse_query("+a b").unwrap();
let weight = query.weight(&*searcher, false).unwrap(); let weight = query.weight(&searcher, false).unwrap();
let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap(); let scorer = weight.scorer(searcher.segment_reader(0u32)).unwrap();
println!("{:?}", scorer.type_name()); println!("{:?}", scorer.type_name());
assert!(Downcast::<TermScorer>::is_type(&*scorer)); assert!(Downcast::<TermScorer>::is_type(&*scorer));

View File

@@ -1,11 +1,11 @@
use super::Scorer; use super::Scorer;
use DocSet;
use Score;
use DocId;
use query::Query; use query::Query;
use Result;
use Searcher;
use query::Weight; use query::Weight;
use DocId;
use DocSet;
use Result;
use Score;
use Searcher;
use SegmentReader; use SegmentReader;
/// `EmptyQuery` is a dummy `Query` in which no document matches. /// `EmptyQuery` is a dummy `Query` in which no document matches.

View File

@@ -10,7 +10,7 @@ lazy_static! {
let mut lev_builder_cache = HashMap::new(); let mut lev_builder_cache = HashMap::new();
// TODO make population lazy on a `(distance, val)` basis // TODO make population lazy on a `(distance, val)` basis
for distance in 0..3 { for distance in 0..3 {
for &transposition in [false, true].iter() { for &transposition in &[false, true] {
let lev_automaton_builder = LevenshteinAutomatonBuilder::new(distance, transposition); let lev_automaton_builder = LevenshteinAutomatonBuilder::new(distance, transposition);
lev_builder_cache.insert((distance, transposition), lev_automaton_builder); lev_builder_cache.insert((distance, transposition), lev_automaton_builder);
} }
@@ -153,7 +153,7 @@ mod test {
let fuzzy_query = FuzzyTermQuery::new(term, 1, true); let fuzzy_query = FuzzyTermQuery::new(term, 1, true);
searcher.search(&fuzzy_query, &mut collector).unwrap(); searcher.search(&fuzzy_query, &mut collector).unwrap();
let scored_docs = collector.score_docs(); let scored_docs = collector.top_docs();
assert_eq!(scored_docs.len(), 1, "Expected only 1 document"); assert_eq!(scored_docs.len(), 1, "Expected only 1 document");
let (score, _) = scored_docs[0]; let (score, _) = scored_docs[0];
assert_nearly_equals(1f32, score); assert_nearly_equals(1f32, score);

View File

@@ -26,10 +26,11 @@ pub fn intersect_scorers(mut scorers: Vec<Box<Scorer>>) -> Box<Scorer> {
(Some(single_docset), None) => single_docset, (Some(single_docset), None) => single_docset,
(Some(left), Some(right)) => { (Some(left), Some(right)) => {
{ {
if [&left, &right].into_iter().all(|scorer| { let all_term_scorers = [&left, &right].into_iter().all(|scorer| {
let scorer_ref: &Scorer = (*scorer).borrow(); let scorer_ref: &Scorer = (*scorer).borrow();
Downcast::<TermScorer>::is_type(scorer_ref) Downcast::<TermScorer>::is_type(scorer_ref)
}) { });
if all_term_scorers {
let left = *Downcast::<TermScorer>::downcast(left).unwrap(); let left = *Downcast::<TermScorer>::downcast(left).unwrap();
let right = *Downcast::<TermScorer>::downcast(right).unwrap(); let right = *Downcast::<TermScorer>::downcast(right).unwrap();
return Box::new(Intersection { return Box::new(Intersection {
@@ -40,12 +41,12 @@ pub fn intersect_scorers(mut scorers: Vec<Box<Scorer>>) -> Box<Scorer> {
}); });
} }
} }
return Box::new(Intersection { Box::new(Intersection {
left, left,
right, right,
others: scorers, others: scorers,
num_docsets, num_docsets,
}); })
} }
_ => { _ => {
unreachable!(); unreachable!();
@@ -99,7 +100,7 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> Intersection<TDocSet, TOtherDocSet>
} }
impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOtherDocSet> { impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOtherDocSet> {
#[allow(never_loop)] #[cfg_attr(feature = "cargo-clippy", allow(clippy::never_loop))]
fn advance(&mut self) -> bool { fn advance(&mut self) -> bool {
let (left, right) = (&mut self.left, &mut self.right); let (left, right) = (&mut self.left, &mut self.right);

View File

@@ -3,11 +3,11 @@ Query
*/ */
mod all_query; mod all_query;
mod empty_query;
mod automaton_weight; mod automaton_weight;
mod bitset; mod bitset;
mod bm25; mod bm25;
mod boolean_query; mod boolean_query;
mod empty_query;
mod exclude; mod exclude;
mod fuzzy_query; mod fuzzy_query;
mod intersection; mod intersection;
@@ -16,7 +16,10 @@ mod phrase_query;
mod query; mod query;
mod query_parser; mod query_parser;
mod range_query; mod range_query;
#[cfg(feature="regex_query")]
mod regex_query; mod regex_query;
mod reqopt_scorer; mod reqopt_scorer;
mod scorer; mod scorer;
mod term_query; mod term_query;
@@ -27,7 +30,6 @@ mod weight;
mod vec_docset; mod vec_docset;
pub(crate) mod score_combiner; pub(crate) mod score_combiner;
pub use self::intersection::Intersection; pub use self::intersection::Intersection;
pub use self::union::Union; pub use self::union::Union;
@@ -35,10 +37,10 @@ pub use self::union::Union;
pub use self::vec_docset::VecDocSet; pub use self::vec_docset::VecDocSet;
pub use self::all_query::{AllQuery, AllScorer, AllWeight}; pub use self::all_query::{AllQuery, AllScorer, AllWeight};
pub use self::empty_query::{EmptyQuery, EmptyWeight, EmptyScorer};
pub use self::automaton_weight::AutomatonWeight; pub use self::automaton_weight::AutomatonWeight;
pub use self::bitset::BitSetDocSet; pub use self::bitset::BitSetDocSet;
pub use self::boolean_query::BooleanQuery; pub use self::boolean_query::BooleanQuery;
pub use self::empty_query::{EmptyQuery, EmptyScorer, EmptyWeight};
pub use self::exclude::Exclude; pub use self::exclude::Exclude;
pub use self::fuzzy_query::FuzzyTermQuery; pub use self::fuzzy_query::FuzzyTermQuery;
pub use self::intersection::intersect_scorers; pub use self::intersection::intersect_scorers;
@@ -48,9 +50,62 @@ pub use self::query::Query;
pub use self::query_parser::QueryParser; pub use self::query_parser::QueryParser;
pub use self::query_parser::QueryParserError; pub use self::query_parser::QueryParserError;
pub use self::range_query::RangeQuery; pub use self::range_query::RangeQuery;
#[cfg(feature="regex_query")]
pub use self::regex_query::RegexQuery; pub use self::regex_query::RegexQuery;
pub use self::reqopt_scorer::RequiredOptionalScorer; pub use self::reqopt_scorer::RequiredOptionalScorer;
pub use self::scorer::ConstScorer; pub use self::scorer::ConstScorer;
pub use self::scorer::Scorer; pub use self::scorer::Scorer;
pub use self::term_query::TermQuery; pub use self::term_query::TermQuery;
pub use self::weight::Weight; pub use self::weight::Weight;
#[cfg(test)]
mod tests {
use Index;
use schema::{SchemaBuilder, TEXT};
use query::QueryParser;
use Term;
use std::collections::BTreeSet;
#[test]
fn test_query_terms() {
let mut schema_builder = SchemaBuilder::default();
let text_field = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let query_parser = QueryParser::for_index(&index, vec![text_field]);
let term_a = Term::from_field_text(text_field, "a");
let term_b = Term::from_field_text(text_field, "b");
{
let mut terms_set: BTreeSet<Term> = BTreeSet::new();
query_parser.parse_query("a").unwrap().query_terms(&mut terms_set);
let terms: Vec<&Term> = terms_set.iter().collect();
assert_eq!(vec![&term_a], terms);
}
{
let mut terms_set: BTreeSet<Term> = BTreeSet::new();
query_parser.parse_query("a b").unwrap().query_terms(&mut terms_set);
let terms: Vec<&Term> = terms_set.iter().collect();
assert_eq!(vec![&term_a, &term_b], terms);
}
{
let mut terms_set: BTreeSet<Term> = BTreeSet::new();
query_parser.parse_query("\"a b\"").unwrap().query_terms(&mut terms_set);
let terms: Vec<&Term> = terms_set.iter().collect();
assert_eq!(vec![&term_a, &term_b], terms);
}
{
let mut terms_set: BTreeSet<Term> = BTreeSet::new();
query_parser.parse_query("a a a a a").unwrap().query_terms(&mut terms_set);
let terms: Vec<&Term> = terms_set.iter().collect();
assert_eq!(vec![&term_a], terms);
}
{
let mut terms_set: BTreeSet<Term> = BTreeSet::new();
query_parser.parse_query("a -b").unwrap().query_terms(&mut terms_set);
let terms: Vec<&Term> = terms_set.iter().collect();
assert_eq!(vec![&term_a, &term_b], terms);
}
}
}

View File

@@ -12,3 +12,38 @@ pub enum Occur {
/// search. /// search.
MustNot, MustNot,
} }
impl Occur {
/// Returns the one-char prefix symbol for this `Occur`.
/// - `Should` => '?',
/// - `Must` => '+'
/// - `Not` => '-'
pub fn to_char(self) -> char {
match self {
Occur::Should => '?',
Occur::Must => '+',
Occur::MustNot => '-',
}
}
}
/// Compose two occur values.
pub fn compose_occur(left: Occur, right: Occur) -> Occur {
match left {
Occur::Should => right,
Occur::Must => {
if right == Occur::MustNot {
Occur::MustNot
} else {
Occur::Must
}
}
Occur::MustNot => {
if right == Occur::MustNot {
Occur::Must
} else {
Occur::MustNot
}
}
}
}

View File

@@ -5,6 +5,7 @@ use query::bm25::BM25Weight;
use query::Query; use query::Query;
use query::Weight; use query::Weight;
use schema::{Field, Term}; use schema::{Field, Term};
use std::collections::BTreeSet;
use Result; use Result;
/// `PhraseQuery` matches a specific sequence of words. /// `PhraseQuery` matches a specific sequence of words.
@@ -107,4 +108,10 @@ impl Query for PhraseQuery {
))) )))
} }
} }
fn query_terms(&self, term_set: &mut BTreeSet<Term>) {
for (_, query_term) in &self.phrase_terms {
term_set.insert(query_term.clone());
}
}
} }

View File

@@ -124,7 +124,8 @@ impl<TPostings: Postings> PhraseScorer<TPostings> {
fieldnorm_reader: FieldNormReader, fieldnorm_reader: FieldNormReader,
score_needed: bool, score_needed: bool,
) -> PhraseScorer<TPostings> { ) -> PhraseScorer<TPostings> {
let max_offset = term_postings.iter() let max_offset = term_postings
.iter()
.map(|&(offset, _)| offset) .map(|&(offset, _)| offset)
.max() .max()
.unwrap_or(0); .unwrap_or(0);
@@ -133,8 +134,7 @@ impl<TPostings: Postings> PhraseScorer<TPostings> {
.into_iter() .into_iter()
.map(|(offset, postings)| { .map(|(offset, postings)| {
PostingsWithOffset::new(postings, (max_offset - offset) as u32) PostingsWithOffset::new(postings, (max_offset - offset) as u32)
}) }).collect::<Vec<_>>();
.collect::<Vec<_>>();
PhraseScorer { PhraseScorer {
intersection_docset: Intersection::new(postings_with_offsets), intersection_docset: Intersection::new(postings_with_offsets),
num_docsets, num_docsets,

View File

@@ -2,9 +2,11 @@ use super::Weight;
use collector::Collector; use collector::Collector;
use core::searcher::Searcher; use core::searcher::Searcher;
use downcast; use downcast;
use std::collections::BTreeSet;
use std::fmt; use std::fmt;
use Result; use Result;
use SegmentLocalId; use SegmentLocalId;
use Term;
/// The `Query` trait defines a set of documents and a scoring method /// The `Query` trait defines a set of documents and a scoring method
/// for those documents. /// for those documents.
@@ -58,6 +60,10 @@ pub trait Query: QueryClone + downcast::Any + fmt::Debug {
Ok(result) Ok(result)
} }
/// Extract all of the terms associated to the query and insert them in the
/// term set given in arguments.
fn query_terms(&self, _term_set: &mut BTreeSet<Term>) {}
/// Search works as follows : /// Search works as follows :
/// ///
/// First the weight object associated to the query is created. /// First the weight object associated to the query is created.

View File

@@ -1,6 +1,12 @@
#![cfg_attr(feature = "cargo-clippy", allow(clippy::unneeded_field_pattern))]
#![cfg_attr(feature = "cargo-clippy", allow(clippy::toplevel_ref_arg))]
use super::user_input_ast::*; use super::user_input_ast::*;
use combine::char::*; use combine::char::*;
use combine::error::StreamError;
use combine::stream::StreamErrorFor;
use combine::*; use combine::*;
use query::occur::Occur;
use query::query_parser::user_input_ast::UserInputBound; use query::query_parser::user_input_ast::UserInputBound;
parser! { parser! {
@@ -17,18 +23,25 @@ parser! {
fn word[I]()(I) -> String fn word[I]()(I) -> String
where [I: Stream<Item = char>] { where [I: Stream<Item = char>] {
many1(satisfy(|c: char| c.is_alphanumeric())) many1(satisfy(|c: char| c.is_alphanumeric()))
.and_then(|s: String| {
match s.as_str() {
"OR" => Err(StreamErrorFor::<I>::unexpected_static_message("OR")),
"AND" => Err(StreamErrorFor::<I>::unexpected_static_message("AND")),
"NOT" => Err(StreamErrorFor::<I>::unexpected_static_message("NOT")),
_ => Ok(s)
}
})
} }
} }
parser! { parser! {
fn literal[I]()(I) -> UserInputAST fn literal[I]()(I) -> UserInputLeaf
where [I: Stream<Item = char>] where [I: Stream<Item = char>]
{ {
let term_val = || { let term_val = || {
let phrase = (char('"'), many1(satisfy(|c| c != '"')), char('"')).map(|(_, s, _)| s); let phrase = (char('"'), many1(satisfy(|c| c != '"')), char('"')).map(|(_, s, _)| s);
phrase.or(word()) phrase.or(word())
}; };
let term_val_with_field = negative_number().or(term_val()); let term_val_with_field = negative_number().or(term_val());
let term_query = let term_query =
(field(), char(':'), term_val_with_field).map(|(field_name, _, phrase)| UserInputLiteral { (field(), char(':'), term_val_with_field).map(|(field_name, _, phrase)| UserInputLiteral {
@@ -41,7 +54,7 @@ parser! {
}); });
try(term_query) try(term_query)
.or(term_default_field) .or(term_default_field)
.map(UserInputAST::from) .map(UserInputLeaf::from)
} }
} }
@@ -55,7 +68,14 @@ parser! {
} }
parser! { parser! {
fn range[I]()(I) -> UserInputAST fn spaces1[I]()(I) -> ()
where [I: Stream<Item = char>] {
skip_many1(space())
}
}
parser! {
fn range[I]()(I) -> UserInputLeaf
where [I: Stream<Item = char>] { where [I: Stream<Item = char>] {
let term_val = || { let term_val = || {
word().or(negative_number()).or(char('*').map(|_| "*".to_string())) word().or(negative_number()).or(char('*').map(|_| "*".to_string()))
@@ -77,7 +97,7 @@ parser! {
string("TO"), string("TO"),
spaces(), spaces(),
upper_bound, upper_bound,
).map(|(field, lower, _, _, _, upper)| UserInputAST::Range { ).map(|(field, lower, _, _, _, upper)| UserInputLeaf::Range {
field, field,
lower, lower,
upper upper
@@ -88,13 +108,50 @@ parser! {
parser! { parser! {
fn leaf[I]()(I) -> UserInputAST fn leaf[I]()(I) -> UserInputAST
where [I: Stream<Item = char>] { where [I: Stream<Item = char>] {
(char('-'), leaf()) (char('-'), leaf()).map(|(_, expr)| expr.unary(Occur::MustNot) )
.map(|(_, expr)| UserInputAST::Not(Box::new(expr))) .or((char('+'), leaf()).map(|(_, expr)| expr.unary(Occur::Must) ))
.or((char('+'), leaf()).map(|(_, expr)| UserInputAST::Must(Box::new(expr))))
.or((char('('), parse_to_ast(), char(')')).map(|(_, expr, _)| expr)) .or((char('('), parse_to_ast(), char(')')).map(|(_, expr, _)| expr))
.or(char('*').map(|_| UserInputAST::All)) .or(char('*').map(|_| UserInputAST::from(UserInputLeaf::All) ))
.or(try(range())) .or(try(
.or(literal()) (string("NOT"), spaces1(), leaf()).map(|(_, _, expr)| expr.unary(Occur::MustNot))
)
)
.or(try(
range().map(UserInputAST::from)
)
)
.or(literal().map(|leaf| UserInputAST::Leaf(Box::new(leaf))))
}
}
enum BinaryOperand {
Or,
And,
}
parser! {
fn binary_operand[I]()(I) -> BinaryOperand
where [I: Stream<Item = char>] {
(spaces1(),
(
string("AND").map(|_| BinaryOperand::And)
.or(string("OR").map(|_| BinaryOperand::Or))
),
spaces1()).map(|(_, op,_)| op)
}
}
enum Element {
SingleEl(UserInputAST),
NormalDisjunctive(Vec<Vec<UserInputAST>>),
}
impl Element {
pub fn into_dnf(self) -> Vec<Vec<UserInputAST>> {
match self {
Element::NormalDisjunctive(conjunctions) => conjunctions,
Element::SingleEl(el) => vec![vec![el]],
}
} }
} }
@@ -102,14 +159,56 @@ parser! {
pub fn parse_to_ast[I]()(I) -> UserInputAST pub fn parse_to_ast[I]()(I) -> UserInputAST
where [I: Stream<Item = char>] where [I: Stream<Item = char>]
{ {
sep_by(leaf(), spaces()) (
.map(|subqueries: Vec<UserInputAST>| { try(
if subqueries.len() == 1 { chainl1(
subqueries.into_iter().next().unwrap() leaf().map(Element::SingleEl),
} else { binary_operand().map(|op: BinaryOperand|
UserInputAST::Clause(subqueries.into_iter().map(Box::new).collect()) move |left: Element, right: Element| {
} let mut dnf = left.into_dnf();
}) if let Element::SingleEl(el) = right {
match op {
BinaryOperand::And => {
if let Some(last) = dnf.last_mut() {
last.push(el);
}
}
BinaryOperand::Or => {
dnf.push(vec!(el));
}
}
} else {
unreachable!("Please report.")
}
Element::NormalDisjunctive(dnf)
}
)
)
.map(|el| el.into_dnf())
.map(|fnd| {
if fnd.len() == 1 {
UserInputAST::and(fnd.into_iter().next().unwrap()) //< safe
} else {
let conjunctions = fnd
.into_iter()
.map(UserInputAST::and)
.collect();
UserInputAST::or(conjunctions)
}
})
)
.or(
sep_by(leaf(), spaces())
.map(|subqueries: Vec<UserInputAST>| {
if subqueries.len() == 1 {
subqueries.into_iter().next().unwrap()
} else {
UserInputAST::Clause(subqueries.into_iter().collect())
}
})
)
)
} }
} }
@@ -128,6 +227,40 @@ mod test {
assert!(parse_to_ast().parse(query).is_err()); assert!(parse_to_ast().parse(query).is_err());
} }
#[test]
fn test_parse_query_to_ast_not_op() {
assert_eq!(
format!("{:?}", parse_to_ast().parse("NOT")),
"Err(UnexpectedParse)"
);
test_parse_query_to_ast_helper("NOTa", "\"NOTa\"");
test_parse_query_to_ast_helper("NOT a", "-(\"a\")");
}
#[test]
fn test_parse_query_to_ast_binary_op() {
test_parse_query_to_ast_helper("a AND b", "(+(\"a\") +(\"b\"))");
test_parse_query_to_ast_helper("a OR b", "(?(\"a\") ?(\"b\"))");
test_parse_query_to_ast_helper("a OR b AND c", "(?(\"a\") ?((+(\"b\") +(\"c\"))))");
test_parse_query_to_ast_helper("a AND b AND c", "(+(\"a\") +(\"b\") +(\"c\"))");
assert_eq!(
format!("{:?}", parse_to_ast().parse("a OR b aaa")),
"Err(UnexpectedParse)"
);
assert_eq!(
format!("{:?}", parse_to_ast().parse("a AND b aaa")),
"Err(UnexpectedParse)"
);
assert_eq!(
format!("{:?}", parse_to_ast().parse("aaa a OR b ")),
"Err(UnexpectedParse)"
);
assert_eq!(
format!("{:?}", parse_to_ast().parse("aaa ccc a OR b ")),
"Err(UnexpectedParse)"
);
}
#[test] #[test]
fn test_parse_query_to_ast() { fn test_parse_query_to_ast() {
test_parse_query_to_ast_helper("+(a b) +d", "(+((\"a\" \"b\")) +(\"d\"))"); test_parse_query_to_ast_helper("+(a b) +d", "(+((\"a\" \"b\")) +(\"d\"))");

View File

@@ -1,9 +1,13 @@
use super::logical_ast::*; use super::logical_ast::*;
use super::query_grammar::parse_to_ast; use super::query_grammar::parse_to_ast;
use super::user_input_ast::*; use super::user_input_ast::*;
use combine::Parser;
use core::Index; use core::Index;
use query::occur::compose_occur;
use query::query_parser::logical_ast::LogicalAST;
use query::AllQuery; use query::AllQuery;
use query::BooleanQuery; use query::BooleanQuery;
use query::EmptyQuery;
use query::Occur; use query::Occur;
use query::PhraseQuery; use query::PhraseQuery;
use query::Query; use query::Query;
@@ -17,9 +21,6 @@ use std::num::ParseIntError;
use std::ops::Bound; use std::ops::Bound;
use std::str::FromStr; use std::str::FromStr;
use tokenizer::TokenizerManager; use tokenizer::TokenizerManager;
use combine::Parser;
use query::EmptyQuery;
/// Possible error that may happen when parsing a query. /// Possible error that may happen when parsing a query.
#[derive(Debug, PartialEq, Eq)] #[derive(Debug, PartialEq, Eq)]
@@ -57,6 +58,27 @@ impl From<ParseIntError> for QueryParserError {
} }
} }
/// Recursively remove empty clause from the AST
///
/// Returns `None` iff the `logical_ast` ended up being empty.
fn trim_ast(logical_ast: LogicalAST) -> Option<LogicalAST> {
match logical_ast {
LogicalAST::Clause(children) => {
let trimmed_children = children
.into_iter()
.flat_map(|(occur, child)| {
trim_ast(child).map(|trimmed_child| (occur, trimmed_child))
}).collect::<Vec<_>>();
if trimmed_children.is_empty() {
None
} else {
Some(LogicalAST::Clause(trimmed_children))
}
}
_ => Some(logical_ast),
}
}
/// Tantivy's Query parser /// Tantivy's Query parser
/// ///
/// The language covered by the current parser is extremely simple. /// The language covered by the current parser is extremely simple.
@@ -79,12 +101,22 @@ impl From<ParseIntError> for QueryParserError {
/// ///
/// Switching to a default of `AND` can be done by calling `.set_conjunction_by_default()`. /// Switching to a default of `AND` can be done by calling `.set_conjunction_by_default()`.
/// ///
///
/// * boolean operators `AND`, `OR`. `AND` takes precedence over `OR`, so that `a AND b OR c` is interpreted
/// as `(a AND b) OR c`.
///
/// * In addition to the boolean operators, the `-`, `+` can help define. These operators
/// are sufficient to axpress all queries using boolean operators. For instance `x AND y OR z` can
/// be written (`(+x +y) z`). In addition, these operators can help define "required optional"
/// queries. `(+x y)` matches the same document set as simply `x`, but `y` will help refining the score.
///
/// * negative terms: By prepending a term by a `-`, a term can be excluded /// * negative terms: By prepending a term by a `-`, a term can be excluded
/// from the search. This is useful for disambiguating a query. /// from the search. This is useful for disambiguating a query.
/// e.g. `apple -fruit` /// e.g. `apple -fruit`
/// ///
/// * must terms: By prepending a term by a `+`, a term can be made required for the search. /// * must terms: By prepending a term by a `+`, a term can be made required for the search.
/// ///
///
/// * phrase terms: Quoted terms become phrase searches on fields that have positions indexed. /// * phrase terms: Quoted terms become phrase searches on fields that have positions indexed.
/// e.g., `title:"Barack Obama"` will only find documents that have "barack" immediately followed /// e.g., `title:"Barack Obama"` will only find documents that have "barack" immediately followed
/// by "obama". /// by "obama".
@@ -155,8 +187,9 @@ impl QueryParser {
/// Parse the user query into an AST. /// Parse the user query into an AST.
fn parse_query_to_logical_ast(&self, query: &str) -> Result<LogicalAST, QueryParserError> { fn parse_query_to_logical_ast(&self, query: &str) -> Result<LogicalAST, QueryParserError> {
let (user_input_ast, _remaining) = let (user_input_ast, _remaining) = parse_to_ast()
parse_to_ast().parse(query).map_err(|_| QueryParserError::SyntaxError)?; .parse(query)
.map_err(|_| QueryParserError::SyntaxError)?;
self.compute_logical_ast(user_input_ast) self.compute_logical_ast(user_input_ast)
} }
@@ -201,14 +234,15 @@ impl QueryParser {
} }
FieldType::Str(ref str_options) => { FieldType::Str(ref str_options) => {
if let Some(option) = str_options.get_indexing_options() { if let Some(option) = str_options.get_indexing_options() {
let mut tokenizer = self.tokenizer_manager.get(option.tokenizer()).ok_or_else( let mut tokenizer =
|| { self.tokenizer_manager
QueryParserError::UnknownTokenizer( .get(option.tokenizer())
field_entry.name().to_string(), .ok_or_else(|| {
option.tokenizer().to_string(), QueryParserError::UnknownTokenizer(
) field_entry.name().to_string(),
}, option.tokenizer().to_string(),
)?; )
})?;
let mut terms: Vec<(usize, Term)> = Vec::new(); let mut terms: Vec<(usize, Term)> = Vec::new();
let mut token_stream = tokenizer.token_stream(phrase); let mut token_stream = tokenizer.token_stream(phrase);
token_stream.process(&mut |token| { token_stream.process(&mut |token| {
@@ -258,12 +292,9 @@ impl QueryParser {
) -> Result<Option<LogicalLiteral>, QueryParserError> { ) -> Result<Option<LogicalLiteral>, QueryParserError> {
let terms = self.compute_terms_for_string(field, phrase)?; let terms = self.compute_terms_for_string(field, phrase)?;
match &terms[..] { match &terms[..] {
[] => [] => Ok(None),
Ok(None), [(_, term)] => Ok(Some(LogicalLiteral::Term(term.clone()))),
[(_, term)] => _ => Ok(Some(LogicalLiteral::Phrase(terms.clone()))),
Ok(Some(LogicalLiteral::Term(term.clone()))),
_ =>
Ok(Some(LogicalLiteral::Phrase(terms.clone()))),
} }
} }
@@ -275,7 +306,11 @@ impl QueryParser {
} }
} }
fn resolve_bound(&self, field: Field, bound: &UserInputBound) -> Result<Bound<Term>, QueryParserError> { fn resolve_bound(
&self,
field: Field,
bound: &UserInputBound,
) -> Result<Bound<Term>, QueryParserError> {
if bound.term_str() == "*" { if bound.term_str() == "*" {
return Ok(Bound::Unbounded); return Ok(Bound::Unbounded);
} }
@@ -315,56 +350,30 @@ impl QueryParser {
let default_occur = self.default_occur(); let default_occur = self.default_occur();
let mut logical_sub_queries: Vec<(Occur, LogicalAST)> = Vec::new(); let mut logical_sub_queries: Vec<(Occur, LogicalAST)> = Vec::new();
for sub_query in sub_queries { for sub_query in sub_queries {
let (occur, sub_ast) = self.compute_logical_ast_with_occur(*sub_query)?; let (occur, sub_ast) = self.compute_logical_ast_with_occur(sub_query)?;
let new_occur = compose_occur(default_occur, occur); let new_occur = compose_occur(default_occur, occur);
logical_sub_queries.push((new_occur, sub_ast)); logical_sub_queries.push((new_occur, sub_ast));
} }
Ok((Occur::Should, LogicalAST::Clause(logical_sub_queries))) Ok((Occur::Should, LogicalAST::Clause(logical_sub_queries)))
} }
UserInputAST::Not(subquery) => { UserInputAST::Unary(left_occur, subquery) => {
let (occur, logical_sub_queries) = self.compute_logical_ast_with_occur(*subquery)?; let (right_occur, logical_sub_queries) =
Ok((compose_occur(Occur::MustNot, occur), logical_sub_queries)) self.compute_logical_ast_with_occur(*subquery)?;
Ok((compose_occur(left_occur, right_occur), logical_sub_queries))
} }
UserInputAST::Must(subquery) => { UserInputAST::Leaf(leaf) => {
let (occur, logical_sub_queries) = self.compute_logical_ast_with_occur(*subquery)?; let result_ast = self.compute_logical_ast_from_leaf(*leaf)?;
Ok((compose_occur(Occur::Must, occur), logical_sub_queries))
}
UserInputAST::Range {
field,
lower,
upper,
} => {
let fields = self.resolved_fields(&field)?;
let mut clauses = fields
.iter()
.map(|&field| {
let field_entry = self.schema.get_field_entry(field);
let value_type = field_entry.field_type().value_type();
Ok(LogicalAST::Leaf(Box::new(LogicalLiteral::Range {
field,
value_type,
lower: self.resolve_bound(field, &lower)?,
upper: self.resolve_bound(field, &upper)?,
})))
})
.collect::<Result<Vec<_>, QueryParserError>>()?;
let result_ast = if clauses.len() == 1 {
clauses.pop().unwrap()
} else {
LogicalAST::Clause(
clauses
.into_iter()
.map(|clause| (Occur::Should, clause))
.collect(),
)
};
Ok((Occur::Should, result_ast)) Ok((Occur::Should, result_ast))
} }
UserInputAST::All => Ok(( }
Occur::Should, }
LogicalAST::Leaf(Box::new(LogicalLiteral::All)),
)), fn compute_logical_ast_from_leaf(
UserInputAST::Leaf(literal) => { &self,
leaf: UserInputLeaf,
) -> Result<LogicalAST, QueryParserError> {
match leaf {
UserInputLeaf::Literal(literal) => {
let term_phrases: Vec<(Field, String)> = match literal.field_name { let term_phrases: Vec<(Field, String)> = match literal.field_name {
Some(ref field_name) => { Some(ref field_name) => {
let field = self.resolve_field_name(field_name)?; let field = self.resolve_field_name(field_name)?;
@@ -387,36 +396,43 @@ impl QueryParser {
asts.push(LogicalAST::Leaf(Box::new(ast))); asts.push(LogicalAST::Leaf(Box::new(ast)));
} }
} }
let result_ast = if asts.is_empty() { let result_ast: LogicalAST = if asts.len() == 1 {
// this should never happen asts.into_iter().next().unwrap()
return Err(QueryParserError::SyntaxError);
} else if asts.len() == 1 {
asts[0].clone()
} else { } else {
LogicalAST::Clause(asts.into_iter().map(|ast| (Occur::Should, ast)).collect()) LogicalAST::Clause(asts.into_iter().map(|ast| (Occur::Should, ast)).collect())
}; };
Ok((Occur::Should, result_ast)) Ok(result_ast)
} }
} UserInputLeaf::All => Ok(LogicalAST::Leaf(Box::new(LogicalLiteral::All))),
} UserInputLeaf::Range {
} field,
lower,
/// Compose two occur values. upper,
fn compose_occur(left: Occur, right: Occur) -> Occur { } => {
match left { let fields = self.resolved_fields(&field)?;
Occur::Should => right, let mut clauses = fields
Occur::Must => { .iter()
if right == Occur::MustNot { .map(|&field| {
Occur::MustNot let field_entry = self.schema.get_field_entry(field);
} else { let value_type = field_entry.field_type().value_type();
Occur::Must Ok(LogicalAST::Leaf(Box::new(LogicalLiteral::Range {
} field,
} value_type,
Occur::MustNot => { lower: self.resolve_bound(field, &lower)?,
if right == Occur::MustNot { upper: self.resolve_bound(field, &upper)?,
Occur::Must })))
} else { }).collect::<Result<Vec<_>, QueryParserError>>()?;
Occur::MustNot let result_ast = if clauses.len() == 1 {
clauses.pop().unwrap()
} else {
LogicalAST::Clause(
clauses
.into_iter()
.map(|clause| (Occur::Should, clause))
.collect(),
)
};
Ok(result_ast)
} }
} }
} }
@@ -425,31 +441,38 @@ fn compose_occur(left: Occur, right: Occur) -> Occur {
fn convert_literal_to_query(logical_literal: LogicalLiteral) -> Box<Query> { fn convert_literal_to_query(logical_literal: LogicalLiteral) -> Box<Query> {
match logical_literal { match logical_literal {
LogicalLiteral::Term(term) => Box::new(TermQuery::new(term, IndexRecordOption::WithFreqs)), LogicalLiteral::Term(term) => Box::new(TermQuery::new(term, IndexRecordOption::WithFreqs)),
LogicalLiteral::Phrase(term_with_offsets) => Box::new(PhraseQuery::new_with_offset(term_with_offsets)), LogicalLiteral::Phrase(term_with_offsets) => {
Box::new(PhraseQuery::new_with_offset(term_with_offsets))
}
LogicalLiteral::Range { LogicalLiteral::Range {
field, field,
value_type, value_type,
lower, lower,
upper, upper,
} => Box::new(RangeQuery::new_term_bounds(field, value_type, lower, upper)), } => Box::new(RangeQuery::new_term_bounds(
field, value_type, &lower, &upper,
)),
LogicalLiteral::All => Box::new(AllQuery), LogicalLiteral::All => Box::new(AllQuery),
} }
} }
fn convert_to_query(logical_ast: LogicalAST) -> Box<Query> { fn convert_to_query(logical_ast: LogicalAST) -> Box<Query> {
match logical_ast { match trim_ast(logical_ast) {
LogicalAST::Clause(clause) => { Some(LogicalAST::Clause(trimmed_clause)) => {
if clause.is_empty() { let occur_subqueries = trimmed_clause
Box::new(EmptyQuery) .into_iter()
} else { .map(|(occur, subquery)| (occur, convert_to_query(subquery)))
let occur_subqueries = clause .collect::<Vec<_>>();
.into_iter() assert!(
.map(|(occur, subquery)| (occur, convert_to_query(subquery))) !occur_subqueries.is_empty(),
.collect::<Vec<_>>(); "Should not be empty after trimming"
Box::new(BooleanQuery::from(occur_subqueries)) );
} Box::new(BooleanQuery::from(occur_subqueries))
} }
LogicalAST::Leaf(logical_literal) => convert_literal_to_query(*logical_literal), Some(LogicalAST::Leaf(trimmed_logical_literal)) => {
convert_literal_to_query(*trimmed_logical_literal)
}
None => Box::new(EmptyQuery),
} }
} }
@@ -462,12 +485,17 @@ mod test {
use schema::Field; use schema::Field;
use schema::{IndexRecordOption, TextFieldIndexing, TextOptions}; use schema::{IndexRecordOption, TextFieldIndexing, TextOptions};
use schema::{SchemaBuilder, Term, INT_INDEXED, STORED, STRING, TEXT}; use schema::{SchemaBuilder, Term, INT_INDEXED, STORED, STRING, TEXT};
use tokenizer::SimpleTokenizer; use tokenizer::{LowerCaser, SimpleTokenizer, StopWordFilter, Tokenizer, TokenizerManager};
use tokenizer::TokenizerManager;
use Index; use Index;
fn make_query_parser() -> QueryParser { fn make_query_parser() -> QueryParser {
let mut schema_builder = SchemaBuilder::default(); let mut schema_builder = SchemaBuilder::default();
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("en_with_stop_words")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
let title = schema_builder.add_text_field("title", TEXT); let title = schema_builder.add_text_field("title", TEXT);
let text = schema_builder.add_text_field("text", TEXT); let text = schema_builder.add_text_field("text", TEXT);
schema_builder.add_i64_field("signed", INT_INDEXED); schema_builder.add_i64_field("signed", INT_INDEXED);
@@ -476,9 +504,16 @@ mod test {
schema_builder.add_text_field("notindexed_u64", STORED); schema_builder.add_text_field("notindexed_u64", STORED);
schema_builder.add_text_field("notindexed_i64", STORED); schema_builder.add_text_field("notindexed_i64", STORED);
schema_builder.add_text_field("nottokenized", STRING); schema_builder.add_text_field("nottokenized", STRING);
schema_builder.add_text_field("with_stop_words", text_options);
let schema = schema_builder.build(); let schema = schema_builder.build();
let default_fields = vec![title, text]; let default_fields = vec![title, text];
let tokenizer_manager = TokenizerManager::default(); let tokenizer_manager = TokenizerManager::default();
tokenizer_manager.register(
"en_with_stop_words",
SimpleTokenizer
.filter(LowerCaser)
.filter(StopWordFilter::remove(vec!["the".to_string()])),
);
QueryParser::new(schema, default_fields, tokenizer_manager) QueryParser::new(schema, default_fields, tokenizer_manager)
} }
@@ -548,16 +583,8 @@ mod test {
#[test] #[test]
pub fn test_parse_query_empty() { pub fn test_parse_query_empty() {
test_parse_query_to_logical_ast_helper( test_parse_query_to_logical_ast_helper("", "<emptyclause>", false);
"", test_parse_query_to_logical_ast_helper(" ", "<emptyclause>", false);
"<emptyclause>",
false,
);
test_parse_query_to_logical_ast_helper(
" ",
"<emptyclause>",
false,
);
let query_parser = make_query_parser(); let query_parser = make_query_parser();
let query_result = query_parser.parse_query(""); let query_result = query_parser.parse_query("");
let query = query_result.unwrap(); let query = query_result.unwrap();
@@ -670,11 +697,7 @@ mod test {
"(Excluded(Term([0, 0, 0, 0, 116, 105, 116, 105])) TO Unbounded)", "(Excluded(Term([0, 0, 0, 0, 116, 105, 116, 105])) TO Unbounded)",
false, false,
); );
test_parse_query_to_logical_ast_helper( test_parse_query_to_logical_ast_helper("*", "*", false);
"*",
"*",
false,
);
} }
#[test] #[test]
@@ -747,6 +770,13 @@ mod test {
); );
} }
#[test]
pub fn test_query_parser_not_empty_but_no_tokens() {
let query_parser = make_query_parser();
assert!(query_parser.parse_query(" !, ").is_ok());
assert!(query_parser.parse_query("with_stop_words:the").is_ok());
}
#[test] #[test]
pub fn test_parse_query_to_ast_conjunction() { pub fn test_parse_query_to_ast_conjunction() {
test_parse_query_to_logical_ast_helper( test_parse_query_to_logical_ast_helper(

View File

@@ -1,4 +1,39 @@
use std::fmt; use std::fmt;
use std::fmt::{Debug, Formatter};
use query::Occur;
pub enum UserInputLeaf {
Literal(UserInputLiteral),
All,
Range {
field: Option<String>,
lower: UserInputBound,
upper: UserInputBound,
},
}
impl Debug for UserInputLeaf {
fn fmt(&self, formatter: &mut Formatter) -> Result<(), fmt::Error> {
match self {
UserInputLeaf::Literal(literal) => literal.fmt(formatter),
UserInputLeaf::Range {
ref field,
ref lower,
ref upper,
} => {
if let Some(ref field) = field {
write!(formatter, "{}:", field)?;
}
lower.display_lower(formatter)?;
write!(formatter, " TO ")?;
upper.display_upper(formatter)?;
Ok(())
}
UserInputLeaf::All => write!(formatter, "*"),
}
}
}
pub struct UserInputLiteral { pub struct UserInputLiteral {
pub field_name: Option<String>, pub field_name: Option<String>,
@@ -43,28 +78,93 @@ impl UserInputBound {
} }
pub enum UserInputAST { pub enum UserInputAST {
Clause(Vec<Box<UserInputAST>>), Clause(Vec<UserInputAST>),
Not(Box<UserInputAST>), Unary(Occur, Box<UserInputAST>),
Must(Box<UserInputAST>), // Not(Box<UserInputAST>),
Range { // Should(Box<UserInputAST>),
field: Option<String>, // Must(Box<UserInputAST>),
lower: UserInputBound, Leaf(Box<UserInputLeaf>),
upper: UserInputBound,
},
All,
Leaf(Box<UserInputLiteral>),
} }
impl From<UserInputLiteral> for UserInputAST { impl UserInputAST {
fn from(literal: UserInputLiteral) -> UserInputAST { pub fn unary(self, occur: Occur) -> UserInputAST {
UserInputAST::Leaf(Box::new(literal)) UserInputAST::Unary(occur, Box::new(self))
}
fn compose(occur: Occur, asts: Vec<UserInputAST>) -> UserInputAST {
assert!(occur != Occur::MustNot);
assert!(!asts.is_empty());
if asts.len() == 1 {
asts.into_iter().next().unwrap() //< safe
} else {
UserInputAST::Clause(
asts.into_iter()
.map(|ast: UserInputAST| ast.unary(occur))
.collect::<Vec<_>>(),
)
}
}
pub fn and(asts: Vec<UserInputAST>) -> UserInputAST {
UserInputAST::compose(Occur::Must, asts)
}
pub fn or(asts: Vec<UserInputAST>) -> UserInputAST {
UserInputAST::compose(Occur::Should, asts)
}
}
/*
impl UserInputAST {
fn compose_occur(self, occur: Occur) -> UserInputAST {
match self {
UserInputAST::Not(other) => {
let new_occur = compose_occur(Occur::MustNot, occur);
other.simplify()
}
_ => {
self
}
}
}
pub fn simplify(self) -> UserInputAST {
match self {
UserInputAST::Clause(els) => {
if els.len() == 1 {
return els.into_iter().next().unwrap();
} else {
return self;
}
}
UserInputAST::Not(els) => {
if els.len() == 1 {
return els.into_iter().next().unwrap();
} else {
return self;
}
}
}
}
}
*/
impl From<UserInputLiteral> for UserInputLeaf {
fn from(literal: UserInputLiteral) -> UserInputLeaf {
UserInputLeaf::Literal(literal)
}
}
impl From<UserInputLeaf> for UserInputAST {
fn from(leaf: UserInputLeaf) -> UserInputAST {
UserInputAST::Leaf(Box::new(leaf))
} }
} }
impl fmt::Debug for UserInputAST { impl fmt::Debug for UserInputAST {
fn fmt(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> { fn fmt(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> {
match *self { match *self {
UserInputAST::Must(ref subquery) => write!(formatter, "+({:?})", subquery),
UserInputAST::Clause(ref subqueries) => { UserInputAST::Clause(ref subqueries) => {
if subqueries.is_empty() { if subqueries.is_empty() {
write!(formatter, "<emptyclause>")?; write!(formatter, "<emptyclause>")?;
@@ -78,21 +178,9 @@ impl fmt::Debug for UserInputAST {
} }
Ok(()) Ok(())
} }
UserInputAST::Not(ref subquery) => write!(formatter, "-({:?})", subquery), UserInputAST::Unary(ref occur, ref subquery) => {
UserInputAST::Range { write!(formatter, "{}({:?})", occur.to_char(), subquery)
ref field,
ref lower,
ref upper,
} => {
if let &Some(ref field) = field {
write!(formatter, "{}:", field)?;
}
lower.display_lower(formatter)?;
write!(formatter, " TO ")?;
upper.display_upper(formatter)?;
Ok(())
} }
UserInputAST::All => write!(formatter, "*"),
UserInputAST::Leaf(ref subquery) => write!(formatter, "{:?}", subquery), UserInputAST::Leaf(ref subquery) => write!(formatter, "{:?}", subquery),
} }
} }

View File

@@ -68,7 +68,7 @@ fn map_bound<TFrom, TTo, Transform: Fn(&TFrom) -> TTo>(
/// let docs_in_the_sixties = RangeQuery::new_u64(year_field, 1960..1970); /// let docs_in_the_sixties = RangeQuery::new_u64(year_field, 1960..1970);
/// ///
/// let mut count_collector = CountCollector::default(); /// let mut count_collector = CountCollector::default();
/// docs_in_the_sixties.search(&*searcher, &mut count_collector)?; /// docs_in_the_sixties.search(&searcher, &mut count_collector)?;
/// ///
/// let num_60s_books = count_collector.count(); /// let num_60s_books = count_collector.count();
/// ///
@@ -96,8 +96,8 @@ impl RangeQuery {
pub fn new_term_bounds( pub fn new_term_bounds(
field: Field, field: Field,
value_type: Type, value_type: Type,
left_bound: Bound<Term>, left_bound: &Bound<Term>,
right_bound: Bound<Term>, right_bound: &Bound<Term>,
) -> RangeQuery { ) -> RangeQuery {
let verify_and_unwrap_term = |val: &Term| { let verify_and_unwrap_term = |val: &Term| {
assert_eq!(field, val.field()); assert_eq!(field, val.field());
@@ -184,11 +184,7 @@ impl RangeQuery {
/// ///
/// If the field is not of the type `Str`, tantivy /// If the field is not of the type `Str`, tantivy
/// will panic when the `Weight` object is created. /// will panic when the `Weight` object is created.
pub fn new_str_bounds<'b>( pub fn new_str_bounds(field: Field, left: Bound<&str>, right: Bound<&str>) -> RangeQuery {
field: Field,
left: Bound<&'b str>,
right: Bound<&'b str>,
) -> RangeQuery {
let make_term_val = |val: &&str| val.as_bytes().to_vec(); let make_term_val = |val: &&str| val.as_bytes().to_vec();
RangeQuery { RangeQuery {
field, field,
@@ -202,7 +198,7 @@ impl RangeQuery {
/// ///
/// If the field is not of the type `Str`, tantivy /// If the field is not of the type `Str`, tantivy
/// will panic when the `Weight` object is created. /// will panic when the `Weight` object is created.
pub fn new_str<'b>(field: Field, range: Range<&'b str>) -> RangeQuery { pub fn new_str(field: Field, range: Range<&str>) -> RangeQuery {
RangeQuery::new_str_bounds( RangeQuery::new_str_bounds(
field, field,
Bound::Included(range.start), Bound::Included(range.start),
@@ -332,7 +328,7 @@ mod tests {
// ... or `1960..=1969` if inclusive range is enabled. // ... or `1960..=1969` if inclusive range is enabled.
let mut count_collector = CountCollector::default(); let mut count_collector = CountCollector::default();
docs_in_the_sixties.search(&*searcher, &mut count_collector)?; docs_in_the_sixties.search(&searcher, &mut count_collector)?;
assert_eq!(count_collector.count(), 2285); assert_eq!(count_collector.count(), 2285);
Ok(()) Ok(())
} }
@@ -369,9 +365,7 @@ mod tests {
let searcher = index.searcher(); let searcher = index.searcher();
let count_multiples = |range_query: RangeQuery| { let count_multiples = |range_query: RangeQuery| {
let mut count_collector = CountCollector::default(); let mut count_collector = CountCollector::default();
range_query range_query.search(&searcher, &mut count_collector).unwrap();
.search(&*searcher, &mut count_collector)
.unwrap();
count_collector.count() count_collector.count()
}; };

View File

@@ -1,5 +1,7 @@
extern crate fst_regex;
use error::TantivyError; use error::TantivyError;
use fst_regex::Regex; use self::fst_regex::Regex;
use query::{AutomatonWeight, Query, Weight}; use query::{AutomatonWeight, Query, Weight};
use schema::Field; use schema::Field;
use std::clone::Clone; use std::clone::Clone;
@@ -82,7 +84,7 @@ impl RegexQuery {
let automaton = Regex::new(&self.regex_pattern) let automaton = Regex::new(&self.regex_pattern)
.map_err(|_| TantivyError::InvalidArgument(self.regex_pattern.clone()))?; .map_err(|_| TantivyError::InvalidArgument(self.regex_pattern.clone()))?;
Ok(AutomatonWeight::new(self.field.clone(), automaton)) Ok(AutomatonWeight::new(self.field, automaton))
} }
} }
@@ -123,7 +125,7 @@ mod test {
let mut collector = TopCollector::with_limit(2); let mut collector = TopCollector::with_limit(2);
let regex_query = RegexQuery::new("jap[ao]n".to_string(), country_field); let regex_query = RegexQuery::new("jap[ao]n".to_string(), country_field);
searcher.search(&regex_query, &mut collector).unwrap(); searcher.search(&regex_query, &mut collector).unwrap();
let scored_docs = collector.score_docs(); let scored_docs = collector.top_docs();
assert_eq!(scored_docs.len(), 1, "Expected only 1 document"); assert_eq!(scored_docs.len(), 1, "Expected only 1 document");
let (score, _) = scored_docs[0]; let (score, _) = scored_docs[0];
assert_nearly_equals(1f32, score); assert_nearly_equals(1f32, score);
@@ -132,7 +134,7 @@ mod test {
let mut collector = TopCollector::with_limit(2); let mut collector = TopCollector::with_limit(2);
let regex_query = RegexQuery::new("jap[A-Z]n".to_string(), country_field); let regex_query = RegexQuery::new("jap[A-Z]n".to_string(), country_field);
searcher.search(&regex_query, &mut collector).unwrap(); searcher.search(&regex_query, &mut collector).unwrap();
let scored_docs = collector.score_docs(); let scored_docs = collector.top_docs();
assert_eq!(scored_docs.len(), 0, "Expected ZERO document"); assert_eq!(scored_docs.len(), 0, "Expected ZERO document");
} }
} }

View File

@@ -50,7 +50,6 @@ impl Scorer for Box<Scorer> {
} }
} }
/// Wraps a `DocSet` and simply returns a constant `Scorer`. /// Wraps a `DocSet` and simply returns a constant `Scorer`.
/// The `ConstScorer` is useful if you have a `DocSet` where /// The `ConstScorer` is useful if you have a `DocSet` where
/// you needed a scorer. /// you needed a scorer.

View File

@@ -72,7 +72,7 @@ mod tests {
let term = Term::from_field_text(left_field, "left2"); let term = Term::from_field_text(left_field, "left2");
let term_query = TermQuery::new(term, IndexRecordOption::WithFreqs); let term_query = TermQuery::new(term, IndexRecordOption::WithFreqs);
searcher.search(&term_query, &mut collector).unwrap(); searcher.search(&term_query, &mut collector).unwrap();
let scored_docs = collector.score_docs(); let scored_docs = collector.top_docs();
assert_eq!(scored_docs.len(), 1); assert_eq!(scored_docs.len(), 1);
let (score, _) = scored_docs[0]; let (score, _) = scored_docs[0];
assert_nearly_equals(0.77802235, score); assert_nearly_equals(0.77802235, score);
@@ -82,7 +82,7 @@ mod tests {
let term = Term::from_field_text(left_field, "left1"); let term = Term::from_field_text(left_field, "left1");
let term_query = TermQuery::new(term, IndexRecordOption::WithFreqs); let term_query = TermQuery::new(term, IndexRecordOption::WithFreqs);
searcher.search(&term_query, &mut collector).unwrap(); searcher.search(&term_query, &mut collector).unwrap();
let scored_docs = collector.score_docs(); let scored_docs = collector.top_docs();
assert_eq!(scored_docs.len(), 2); assert_eq!(scored_docs.len(), 2);
let (score1, _) = scored_docs[0]; let (score1, _) = scored_docs[0];
assert_nearly_equals(0.27101856, score1); assert_nearly_equals(0.27101856, score1);
@@ -94,7 +94,7 @@ mod tests {
let query = query_parser.parse_query("left:left2 left:left1").unwrap(); let query = query_parser.parse_query("left:left2 left:left1").unwrap();
let mut collector = TopCollector::with_limit(2); let mut collector = TopCollector::with_limit(2);
searcher.search(&*query, &mut collector).unwrap(); searcher.search(&*query, &mut collector).unwrap();
let scored_docs = collector.score_docs(); let scored_docs = collector.top_docs();
assert_eq!(scored_docs.len(), 2); assert_eq!(scored_docs.len(), 2);
let (score1, _) = scored_docs[0]; let (score1, _) = scored_docs[0];
assert_nearly_equals(0.9153879, score1); assert_nearly_equals(0.9153879, score1);

View File

@@ -3,6 +3,7 @@ use query::bm25::BM25Weight;
use query::Query; use query::Query;
use query::Weight; use query::Weight;
use schema::IndexRecordOption; use schema::IndexRecordOption;
use std::collections::BTreeSet;
use Result; use Result;
use Searcher; use Searcher;
use Term; use Term;
@@ -110,4 +111,7 @@ impl Query for TermQuery {
fn weight(&self, searcher: &Searcher, scoring_enabled: bool) -> Result<Box<Weight>> { fn weight(&self, searcher: &Searcher, scoring_enabled: bool) -> Result<Box<Weight>> {
Ok(Box::new(self.specialized_weight(searcher, scoring_enabled))) Ok(Box::new(self.specialized_weight(searcher, scoring_enabled)))
} }
fn query_terms(&self, term_set: &mut BTreeSet<Term>) {
term_set.insert(self.term.clone());
}
} }

Some files were not shown because too many files have changed in this diff Show More