Compare commits

...

182 Commits
wasm ... doc

Author SHA1 Message Date
Paul Masurel
3d06639531 doc 2018-11-13 05:27:28 +09:00
Paul Masurel
edcafb69bb Fixed benches 2018-11-10 17:04:29 -08:00
Paul Masurel
14908479d5 Release 0.7.1 2018-11-02 17:56:25 +09:00
Dru Sellers
ab4593eeb7 Adds open_or_create method (#428)
* Change the semantic of Index::create_in_dir.

It should return an error if the directory already contains an Index.

* Index::open_or_create is working

* additional test

* Checking that schema matches on open_or_create.

Simplifying unit tests.

* simplifying Eq
2018-10-31 08:36:39 +09:00
Dru Sellers
e75bb1d6a1 Fix NGram processing of non-ascii characters (#430)
* A working version

* optimize the ngram parsing

* Decoding codepoint only once.

* Closes #429

* using leading_zeros to make code less cryptic

* lookup in a table
2018-10-31 08:35:27 +09:00
dependabot[bot]
63b9d62237 Update base64 requirement from 0.9.1 to 0.10.0 (#433)
Updates the requirements on [base64](https://github.com/alicemaz/rust-base64) to permit the latest version.
- [Release notes](https://github.com/alicemaz/rust-base64/releases)
- [Changelog](https://github.com/alicemaz/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/alicemaz/rust-base64/commits/v0.10.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-10-31 08:34:44 +09:00
Jason Wolfe
0098e3d428 Compute space usage of a Searcher / SegmentReader / CompositeFile (#282)
* Compute space usage of a Searcher / SegmentReader / CompositeFile

* Fix typo

* Add serde Serialize/Deserialize for all the SpaceUsage structs

* Fix indexing

* Public methods for consuming space usage information

* #281: Add a space usage method that takes a SegmentComponent to support code that is unaware of particular segment components, and to make it more likely to update methods when a new component type is added.

* Add support for space usage computation of positions skip index file (#281)

* Add some tests for space usage computation (#281)
2018-10-15 09:04:36 +09:00
Konstantin Gribov
69d5e4b9b1 Added proper references for Apache Lucene & Solr (#432)
Also, added links to websites for Lucene, Solr & ElasticSearch
2018-10-12 08:46:07 +09:00
Paul Masurel
e0cdd3114d Fixing README (#427)
Closes #424.
2018-09-17 08:52:29 +09:00
Paul Masurel
f32b4a2ebe Removing release build from ci, disabling lto (#425) 2018-09-17 06:41:40 +09:00
Paul Masurel
6ff60b8ed8 Fixing README (#426) 2018-09-17 06:20:44 +09:00
Paul Masurel
8da28fb6cf Added iml filewq 2018-09-16 13:26:54 +09:00
Paul Masurel
0df2a221da Bump version pre-release 2018-09-16 13:24:14 +09:00
Paul Masurel
5449ec3c11 Snippet term score (#423) 2018-09-16 10:21:02 +09:00
Paul Masurel
10f6c07c53 Clippy (#422)
* Cargo Format
* Clippy
2018-09-15 20:20:22 +09:00
Paul Masurel
06e7bd18e7 Clippy (#421)
* Cargo Format

* Clippy

* bugfix

* still clippy stuff

* clippy step 2
2018-09-15 14:56:14 +09:00
Paul Masurel
37e4280c0a Cargo Format (#420) 2018-09-15 07:44:22 +09:00
Paul Masurel
0ba1cf93f7 Remove Searcher dereference (#419) 2018-09-14 09:54:26 +09:00
Paul Masurel
21a9940726 Update Changelog with #388 (#418) 2018-09-14 09:31:11 +09:00
pentlander
8600b8ea25 Top collector (#413)
* Make TopCollector generic

Make TopCollector take a generic type instead of only being tied to
score. This will allow for sharing code between a TopCollector that
sorts results by Score and a TopCollector that sorts documents by a fast
field. This commit makes no functional changes to TopCollector.

* Add TopFieldCollector and TopScoreCollector

Create two new collectors that use the refactored TopCollector.
TopFieldCollector has the same functionality that TopCollector
originally had. TopFieldCollector allows for sorting results by a given
fast field. Closes tantivy-search/tantivy#388

* Make TopCollector private

Make TopCollector package private and export TopFieldCollector as
TopCollector to maintain backwards compatibility. Mark TopCollector
as deprecated to encourage use of the non-aliased TopFieldCollector.
Remove Collector implementation for TopCollector since it is not longer
used.
2018-09-14 09:22:17 +09:00
Paul Masurel
30f4f85d48 Closes #414. (#417)
Updating documentation for load_searchers.
2018-09-14 09:11:07 +09:00
Paul Masurel
82d25b8397 Fixing snippet example 2018-09-13 12:39:42 +09:00
Paul Masurel
2104c0277c Updating uuid 2018-09-13 09:13:37 +09:00
Paul Masurel
dd37e109f2 Merge branch 'issue/368b' 2018-09-11 20:16:14 +09:00
Paul Masurel
cc23194c58 Editing document 2018-09-11 20:15:38 +09:00
Paul Masurel
63868733a3 Added SnippetGenerator 2018-09-11 09:45:27 +09:00
Paul Masurel
644d8a3a10 Added snippet generator 2018-09-10 16:39:45 +09:00
Paul Masurel
e32dba1a97 Phrase weight 2018-09-10 09:26:33 +09:00
Paul Masurel
a78aa4c259 updating doc 2018-09-09 17:23:30 +09:00
Paul Masurel
7e5f697d00 Closes #387 2018-09-09 16:23:56 +09:00
Paul Masurel
a78f4cca37 Merge branch 'issue/368' into issue/368b 2018-09-09 16:04:20 +09:00
Paul Masurel
2e44f0f099 blop 2018-09-09 14:23:24 +09:00
Vignesh Sarma K
9ccba9f864 Merge branch 'master' into issue/368 2018-09-07 20:27:38 +05:30
Paul Masurel
9101bf5753 Fragments 2018-09-07 09:57:12 +09:00
Paul Masurel
23e97da9f6 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-09-07 08:44:14 +09:00
Paul Masurel
1d439e96f5 Using sort unstable by key. 2018-09-07 08:43:44 +09:00
Paul Masurel
934933582e Closes #402 (#403) 2018-09-06 10:12:26 +09:00
Paul Masurel
98c7fbdc6f Issue/378 (#392)
* Added failing unit test

* Closes #378. Handling queries that end up empty after going through the analyzer.

* Fixed stop word example
2018-09-06 10:11:54 +09:00
Paul Masurel
cec9956a01 Issue/389 (#405)
* Setting up the dependency.

* Completed README
2018-09-06 10:10:40 +09:00
Paul Masurel
c64972e039 Apply unicode lowercasing. (#408)
Checks if the str is ASCII, and uses a fast track if it is the case.
If not, the std's definition of a lowercase character.

Closes #406
2018-09-05 09:43:56 +09:00
Paul Masurel
b3b2421e8a Issue/367 (#404)
* First stab

* Closes #367
2018-09-04 09:17:00 +09:00
Paul Masurel
f570fe37d4 small changes 2018-08-31 09:03:44 +09:00
Paul Masurel
6704ab6987 Added methods to extract the matching terms. First stab 2018-08-30 09:47:19 +09:00
Paul Masurel
a12d211330 Extracting terms matching query in the document 2018-08-30 09:23:34 +09:00
Paul Masurel
ee681a4dd1 Added say thanks badge 2018-08-29 11:06:04 +09:00
petr-tik
d15efd6635 Closes #235 - adds a new error type (#398)
error message suggests possible causes

Addressed code review 1 thread + smaller heap size
2018-08-29 08:26:59 +09:00
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
18814ba0c1 add a test for second fragment having higher score 2018-08-28 22:27:56 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
f247935bb9 Use HighlightSection::new rather than just directly creating the object 2018-08-28 22:16:22 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
6a197e023e ran rustfmt 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
96a313c6dd add more tests 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
fb9b1c1f41 add a test and fix the bug of not calculating first token 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
e1bca6db9d update calculate_score to try_add_token
`try_add_token` will now update the stop_offset as well.
`FragmentCandidate::new` now just takes `start_offset`,
it expects `try_add_token` to be called to add a token.
2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
8438eda01a use while let instead of loop and if.
as per CR comment
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
b373f00840 add htmlescape and update to_html fn to use it.
tests and imports also updated.
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
46decdb0ea compare against accumulator rather than init value 2018-08-28 20:41:41 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
835cdc2fe8 Initial version of snippet
refer #368
2018-08-28 20:41:41 +05:30
Paul Masurel
19756bb7d6 Getting started on #368 2018-08-28 20:41:41 +05:30
CJP10
57e1f8ed28 Missed a closing bracket (#397) 2018-08-28 23:17:59 +09:00
Paul Masurel
2649c8a715 Issue/246 (#393)
* Moving Range and All to Leaves

* Parsing OR/AND

* Simplify user input ast

* AND and OR supported. Returning an error when mixing syntax

Closes #246

* Added support for NOT

* Updated changelog
2018-08-28 11:03:54 +09:00
Paul Masurel
ede97eded6 Removed use 2018-08-28 09:54:04 +09:00
Paul Masurel
4b7ff78c5a Added fundamentalss 2018-08-28 08:09:27 +09:00
Paul Masurel
948758ad78 First commit for the documentation 2018-08-27 09:49:49 +09:00
Paul Masurel
d71fa43ca3 Moving emoticon on the right side of the parenthesis 2018-08-23 08:59:11 +09:00
Paul Masurel
1e5266d4c9 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-08-23 08:55:30 +09:00
Paul Masurel
537fc27231 Added bench line in features 2018-08-23 08:55:13 +09:00
Dru Sellers
af593b1116 Add default EN stopwords to the default analyzer (#381)
* Add a default list of en stopwords

* Add the default en stopword filter to the standard tokenizers

* code review feedback
2018-08-22 10:49:39 +09:00
Paul Masurel
3d73c0c240 Update issue templates 2018-08-21 10:59:08 +09:00
Paul Masurel
3a8e524f77 Added example to show how to access the inverted list directly 2018-08-21 09:36:13 +09:00
Paul Masurel
c0641c2b47 Remove generate html script. It moved to tantivy-search.github.io 2018-08-21 08:26:46 +09:00
Dru Sellers
ef3a16a129 Switch from error-chain to failure crate (#376)
* Switch from error-chain to failure crate

* Added deprecated alias for

* Started editing the changeld
2018-08-20 09:40:45 +09:00
Paul Masurel
a0a284fe91 Added a full fledge empty query and relyign on it in QueryParser, instead of using an empty clause. 2018-08-20 09:21:32 +09:00
dependabot[bot]
0feeef2684 Update owning_ref requirement from 0.3 to 0.4 (#379)
Updates the requirements on [owning_ref](https://github.com/Kimundi/owning-ref-rs) to permit the latest version.
- [Release notes](https://github.com/Kimundi/owning-ref-rs/releases)
- [Commits](https://github.com/Kimundi/owning-ref-rs/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-08-20 09:08:11 +09:00
Dru Sellers
cc50bdb06a Add a basic faceted search example (#383)
* Add a basic faceted search example

* quieting the compiler
2018-08-19 08:07:54 +09:00
Paul Masurel
23c2c3ae7c Building all examples on appveyor + running them on travis 2018-08-17 13:24:37 +09:00
Dru Sellers
674524ba91 Add an example of using the stopwords filter (#377) 2018-08-17 12:52:21 +09:00
Paul Masurel
60a9a7f837 Added example showing how to delete/update documents 2018-08-17 09:43:55 +09:00
Paul Masurel
5b5c706581 Simplified examples 2018-08-16 22:38:39 +09:00
Paul Masurel
3e14a76623 Update regex_query.rs 2018-08-15 16:38:32 +09:00
Paul Masurel
8cde1c81e5 Update README.md 2018-08-13 18:03:30 +09:00
Paul Masurel
8d0a29b137 Added sourcerer wall of fame 2018-08-13 18:02:49 +09:00
Paul Masurel
cbfb2fe19d Avoid building twice when doing code coverage 2018-08-13 10:38:01 +09:00
Vignesh Sarma K
09e00f1d42 add position_length to Token (#337)
* add position_length to Token

refer #291

* Add term offset to `PhraseQuery`

ref #291

* Add new constructor for `PhraseQuery` that allows custom offset

* fix the method name as per pr comment

* Closes #291

Added unit test.
Using offsets from the analyzer in QueryParser.
2018-08-13 10:14:50 +09:00
Paul Masurel
290620fdee Added slashes 2018-08-13 09:13:01 +09:00
petr-tik
f0d1b85bd8 N370 pr fix num searchers (#371)
* Change ordering to Acquire

* set_num_searchers now uses AtomicUsize.store
2018-08-13 08:56:30 +09:00
petr-tik
aaef546f91 Moved NUM_SEARCHERS into a local variable (#369)
* Moved NUM_SEARCHERS into a local variable

dynamically determined as the number of available cpus.

var name in lowercase (not a constant anymore).

updated it in docstring

* lowercased the varnames

* User can set number of logical cores in create_from_metas

* cargo fmt

* Num_searchers as Arc<AtomicUsize>

Retrieving the value with Relaxed ordering

Reverted create_from_metas signature. However, it calls num_cpus and
sets the Arc val
2018-08-12 20:08:14 +09:00
Paul Masurel
811ddf2226 Closes #364 (#365)
* Closes #364

* Trying to raise the recursion limit

* Better unit test and bug fix on token offsets
2018-08-08 11:15:20 +09:00
Paul Masurel
79a339d353 Removing env_logger dependency 2018-08-02 19:29:09 +09:00
Paul Masurel
e45e4c79d9 update crossbeam 2018-08-02 19:24:08 +09:00
Paul Masurel
848bf41bc9 Updating rand to 0.5 (#363) 2018-08-02 19:19:04 +09:00
Paul Masurel
d11cb087a7 Updated to combine-0.3 (#362) 2018-08-02 18:29:58 +09:00
Jacob Brown
2dd7422f42 replace chan with crossbeam-channel (#361)
* replace chan with crossbeam-channel

* Update Cargo.toml
2018-08-02 12:47:22 +09:00
Paul Masurel
e8707c02c0 Issue/333 (#335)
* Add skip information for posting list (skip to doc ids) 
* Separate num bits from data for positions (skip n positions)
* Address in the position using a n-position offset
* Added a long skip structure to allow efficient opening of the position for a given term.
2018-07-31 10:51:53 +09:00
dependabot[bot]
55928d756a Update rust-stemmers requirement to 1.0.2 (#350)
* Update rust-stemmers requirement to 1.0.2

Updates the requirements on [rust-stemmers](https://github.com/CurrySoftware/rust-stemmers) to permit the latest version.
- [Release notes](https://github.com/CurrySoftware/rust-stemmers/releases)
- [Commits](https://github.com/CurrySoftware/rust-stemmers/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-31 09:32:57 +09:00
dependabot[bot]
a4370bca64 Update owned-read requirement to 0.4 (#352)
Updates the requirements on [owned-read](https://github.com/tantivy-search/owned-read) to permit the latest version.
- [Release notes](https://github.com/tantivy-search/owned-read/releases)
- [Commits](https://github.com/tantivy-search/owned-read/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-31 09:32:01 +09:00
dependabot[bot]
5a5c5a8ca5 Update bit-set requirement to 0.5.0 (#351)
* Update bit-set requirement to 0.5.0

Updates the requirements on [bit-set](https://github.com/contain-rs/bit-set) to permit the latest version.
- [Release notes](https://github.com/contain-rs/bit-set/releases)
- [Commits](https://github.com/contain-rs/bit-set/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml

* Update Cargo.toml
2018-07-31 09:31:41 +09:00
dependabot[bot]
1b470dd474 Update log requirement to 0.4.3 (#353)
* Update log requirement to 0.4.3

Updates the requirements on [log](https://github.com/rust-lang/log) to permit the latest version.
- [Release notes](https://github.com/rust-lang/log/releases)
- [Changelog](https://github.com/rust-lang-nursery/log/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/log/commits/env_logger-0.4.3)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-31 09:31:19 +09:00
Paul Masurel
52b4575245 Issue/355 (#358)
* issue with top_k sorting (#356)

* Closes #355
2018-07-31 08:24:55 +09:00
dependabot[bot]
ddd2d5b04c Update lazy_static requirement to 1.0.2 (#349)
* Update lazy_static requirement to 1.0.2

Updates the requirements on [lazy_static](https://github.com/rust-lang-nursery/lazy-static.rs) to permit the latest version.
- [Release notes](https://github.com/rust-lang-nursery/lazy-static.rs/releases)
- [Commits](https://github.com/rust-lang-nursery/lazy-static.rs/commits/v1.0.2)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-30 12:34:06 +09:00
dependabot[bot]
fa22b4041a Update itertools requirement to 0.7.8 (#346)
* Update itertools requirement to 0.7.8

Updates the requirements on [itertools](https://github.com/bluss/rust-itertools) to permit the latest version.
- [Release notes](https://github.com/bluss/rust-itertools/releases)
- [Commits](https://github.com/bluss/rust-itertools/commits/0.7.8)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-30 11:32:12 +09:00
dependabot[bot]
8faee143fa Update regex requirement to 1.0 (#347)
Updates the requirements on [regex](https://github.com/rust-lang/regex) to permit the latest version.
- [Release notes](https://github.com/rust-lang/regex/releases)
- [Changelog](https://github.com/rust-lang/regex/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/regex/commits/1.0.2)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-30 09:59:19 +09:00
dependabot[bot]
366ce98f08 Update tempfile requirement to 3.0 (#348)
Updates the requirements on [tempfile](https://github.com/Stebalien/tempfile) to permit the latest version.
- [Release notes](https://github.com/Stebalien/tempfile/releases)
- [Changelog](https://github.com/Stebalien/tempfile/blob/master/NEWS)
- [Commits](https://github.com/Stebalien/tempfile/commits/v3.0.3)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-30 09:58:56 +09:00
Paul Masurel
190e60a41c Closes #339. (#340)
As required per the FacetCollector,
facet values needs to be sorted before being encoded in the
multivalued field.
2018-07-25 18:21:48 +09:00
Vignesh Sarma K
b9558801a1 Declare and implement separate Clone Traits (#336)
For traits, `Directory` and `MergePolicy`.

refer #306
2018-07-18 12:36:43 +09:00
Paul Masurel
36728215ac Using the codecov badge 2018-07-10 21:19:59 +09:00
Paul Masurel
39551a0418 fix travis 2018-07-10 13:08:22 +09:00
Paul Masurel
39b98b2e76 fix travis 2018-07-10 13:07:15 +09:00
Paul Masurel
616162400d Add missing space 2018-07-10 12:49:32 +09:00
Paul Masurel
694d164db6 fix travis.yml 2018-07-10 09:39:39 +09:00
Paul Masurel
ef442cefb1 codecov 2018-07-10 09:38:59 +09:00
Paul Masurel
14da241f35 Readed cov 2018-07-10 09:25:24 +09:00
Paul Masurel
346a9e4287 Set dev version 2018-07-10 09:20:21 +09:00
Paul Masurel
31655e92d7 Preparing release 0.6.1 2018-07-10 09:12:26 +09:00
Paul Masurel
6b8d76685a Tiny refactoring 2018-07-05 09:11:55 +09:00
Paul Masurel
ce5683fc6a Removed useless counting_writer 2018-07-04 16:13:19 +09:00
Paul Masurel
5205579db6 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-07-04 16:09:59 +09:00
Paul Masurel
d056ae60dc Removed SourceRead. Relying on the new owned-read crate instead (#332) 2018-07-04 16:08:52 +09:00
Paul Masurel
af9280c95f Removed SourceRead. Relying on the new owned-read crate instead 2018-07-04 12:47:25 +09:00
David Hewson
2e538ce6e6 remove extra space in name (#331)
the extra space that appeared breaks using the package
2018-07-02 05:32:19 +09:00
Jason Wolfe
00466d2b08 #328: Support parsing unbounded range queries (#329)
* #328: Support parsing unbounded range queries. Update CHANGELOG.md for query parser changes.

* Set version to 0.7-dev
2018-06-30 13:24:02 +09:00
Paul Masurel
8ebbf6b336 Issue/325 (#330)
* Introducing a SegmentMea inventory.
* Depending on census=0.1
* Cargo fmt
2018-06-30 13:11:41 +09:00
Paul Masurel
1ce36bb211 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-06-27 16:58:47 +09:00
Jason Wolfe
2ac43bf21b Support parsing RangeQuery and AllQuery in Queryparser (#323)
* (#321) Add support for range query parsing to grammar / parser. Still needs to be wired through the rest of the way.

* (321) Finish wiring RangeQuery parsing through

* (#321) Add logical AST query parser tests for RangeQuery

* (#321) Support parsing AllQuery

* (#321) Update documentation of QueryParser

* (#321) Support negative numbers in range query parsing
2018-06-25 08:29:47 +09:00
Paul Masurel
3fd8c2aa5a Removed one keywoard 2018-06-22 14:47:21 +09:00
Paul Masurel
c1022e23d2 Switching to stable rust in AppVeyor. 2018-06-22 14:33:42 +09:00
Paul Masurel
8ccbfdea5d Preparing for release 2018-06-22 14:27:46 +09:00
Paul Masurel
badfce3a23 Preparing for release. 2018-06-22 14:09:14 +09:00
Dru Sellers
e301e0bc87 Add some simple doc tests (#320)
* Add TopCollector doc test

* Add CountCollector Doc Test

* Add Doc Test for MultiCollector

* Add ChainedCollector Doc Test

* Expose Fuzzy Query where it should be

* Add FuzzyTermQuery Doc Test

* Expose RegexQuery

* Regex Query Doc Test

* Add TermQuery Doc Test

* Add doc comments

* fix test 🤦

* Added explanation about the complexity variables

* Fixing unit tests

* Single threads if you check docids
2018-06-19 10:45:20 +09:00
Dru Sellers
317baf4e75 Add in simple regex query support (#319)
* Add fst_regex crate in

* Reduce API surface area

This doesn't need to be public

* better test name

* Pull Automaton weight out so it can be shared

* Implement Regex Query
2018-06-16 14:08:30 +09:00
Paul Masurel
24398d94e4 Exposing the 2018-06-15 21:40:57 +09:00
Dru Sellers
360f4132eb Standardizes the Index::open_* APIs (#318)
* Relocate `from_directory` closer to its usage

* Specific methods come before the generic method

* Rename open methods to follow the lead of the create methods
2018-06-15 12:16:41 +09:00
Dru Sellers
2b8f02764b Standardizes the Index::create_* APIs (#317)
* Pull all creation methods next to each other

The goal here is to make it clear which methods are performing the
same function, and to assist with standardizing the API calls.

* Make `from_directory` private

This seems to be an internal function, so lets make it internal.

* Rename `create` to `create_in_dir`

This lets the name match the `create_in_ram` pattern and opens up
`create` for the generic implementation.

* Implement the generic create function

All of the create methods now delegate to the common create function
and future `create_in_*` functions now have a clear pattern
to follow as well
2018-06-14 11:08:42 +09:00
Paul Masurel
0465876854 Issue/257 (#310)
* Replaced lz4 by a pure rust implementation of snappy.

Closes #257

* snappy is the default compression. One can use lz4 by enabling the lz4 feature flag.

* Removed Compression trait
2018-06-12 19:02:57 +09:00
Dru Sellers
6f7b099370 Add AutomatonWeight to a fuzzy_search module and FuzzyQuery (#300)
* Add AutomatonWeight to a fuzzy_search module

* Hacking around ownership issues

* Working through lifetime issues

* Working through tests

* fix test by lower casing the words (reducing distance)

* code review changes

* Suggestion on how to solve the borrow problem

* clean up
2018-06-11 22:23:03 +09:00
Paul Masurel
84f5cc4388 Added an AUTHORS file. Closes #315 (#316) 2018-06-11 22:21:58 +09:00
Paul Masurel
75aae0d2c2 Update README 2018-06-08 13:05:57 +09:00
Paul Masurel
009a3559be atomicwrites 2.2.0 for ARM compilation 2018-06-06 07:13:09 +09:00
Paul Masurel
7a31669e9d Disabling ARM targets 2018-06-05 12:22:00 +09:00
Paul Masurel
5185eb790b Reduced heap usage in unit test 2018-06-05 10:02:10 +09:00
Paul Masurel
a3dffbf1c6 Added more ARM target. 2018-06-05 09:06:33 +09:00
Paul Masurel
857a5794d8 Updated nix version 2018-06-05 09:02:40 +09:00
Paul Masurel
b0a6fc1448 Reduce RAM usage 2018-06-04 11:20:24 +09:00
Paul Masurel
989d52bea4 Updated atomicwrites version. 2018-06-04 10:00:21 +09:00
Paul Masurel
09661ea7ec Added cross testing on different platforms 2018-06-04 09:47:53 +09:00
Paul Masurel
b59132966f Better heap (#311)
* Changed the heap to a paged memory arena.
* Trying to simplify the indexing term hashmap
* Exploding datastruct
* Removed some complexity in bitpacker
2018-06-04 09:39:18 +09:00
Paul Masurel
863d3411bc Update Cargo.toml 2018-05-31 15:54:34 +09:00
Paul Masurel
8a55d133ab Showing Appveyor CI badge for the master branch
.. before the last build was shown.
2018-05-28 13:44:53 +09:00
Jason Wolfe
432d49d814 Expose parameters of RangeQuery for external usage (#309) 2018-05-19 14:29:25 +09:00
Jason Wolfe
0cea706f10 Add docs to new Query methods (#307) 2018-05-18 13:53:29 +09:00
Paul Masurel
71d41ca209 Added Google to the license 2018-05-18 10:13:23 +09:00
Paul Masurel
bc69dab822 cargo fmt 2018-05-18 10:08:05 +09:00
Jason Wolfe
72acad0921 Add box_clone() and downcast::Any to Query (#303) 2018-05-18 09:53:11 +09:00
Paul Masurel
c9459f74e8 Update docs about TermDict. 2018-05-18 09:20:39 +09:00
Dru Sellers
08d2cc6c7b Make it possible to stream the terms matching an Automaton (#297)
* rustfmt and some English grammar

* sort cargo.toml crates

* WIP: something to show

* Remove example for now

* Implement desired method

* Resolving Generic Type Arguments

* Resolve Generic Types

* Banging around on the tests

* DANGER! Change unsafe usage based on compiler warnings

* Unscrew up my rebase

* Clean Up Type Spam

Default Types FTW

* typo

* better variable names

* Remove Duplicate Levenshtein crate
2018-05-11 12:41:14 -07:00
Dru Sellers
82d87416c2 Implement StopWords Filter (#292)
* Implement StopWords Filter

- added example doctest for alphanum_only.rs so that I could
drive my own test of the stopword filter

* Style Cop

* Switch HashSet Hasher to FNV for speed

* Update Change Log

* fix missed location renaming
2018-05-09 18:40:41 -07:00
Paul Masurel
96b2c2971e Testing actual doc ids in unit test 2018-05-09 09:14:22 -07:00
Dru Sellers
162afd73f6 Alive docs iterator (#293)
* Add non-deleted DocId iterator to SegmentReader

Closes #287

* Add Todo

* Add Unit Test

* Improving test based on feedback

- found bug and fixed it. :)

* Reestablish changes post rebase for clean merge
2018-05-09 09:03:27 -07:00
Paul Masurel
ddfd87fa59 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-05-08 00:08:17 -07:00
Paul Masurel
24050d0eb5 Remove some unsafe stuff, justified some of it. 2018-05-07 23:57:53 -07:00
Jason Wolfe
89eb209ece #294: Make fieldnorm module public, add documentation (#295) 2018-05-07 20:20:38 -07:00
Paul Masurel
9a0b7f9855 Rustfmt 2018-05-07 19:50:35 -07:00
Jason Wolfe
8e343b1ca3 Add fast field for associating arbitrary bytes to a document (#275)
* Add fast field for associating arbitrary bytes to a document

* Fix unused macro_use warning

* Improvements from code review

* Make BytesFastFieldWriter public

* Fix json parsing validation failure

* Add bytes fast field to CHANGELOG.md

* Fix compile errors from merge

* Support merging

* Address misc code review comments

* Fix comments from CR
2018-05-07 19:30:31 -07:00
Paul Masurel
99c0b84036 Integrating #274, #280, #289 into master (#290)
* Integrating bugfixes into master

Closes #274
Closes #280
Closes #289

* Next version will be 0.6
2018-05-06 09:48:25 -07:00
Dru Sellers
ca74c14647 Simple Implementation of NGram Tokenizer (#278)
* Simple Implementation of NGram Tokenizer

It does not yet support edges
It could probably be better in many "rusty" ways
But the test is passing, so I'll call this a good stopping point for
the day.

* Remove Ngram from manager. Too many variations

* Basic configuration model

Should the extensive tests exist here?

* Add Sample to provide an End to End testing

* Basic Edgegram support

* cleanup

* code feedback

* More code review feedback processed
2018-05-06 09:47:49 -07:00
Dru Sellers
68ee18e4e8 Add Index::open_directory function (#285)
* Add Index::open_directory function

* dry
2018-05-03 00:07:46 -07:00
Paul Masurel
5637657c2f Removed ptr dereference for explicit ptr::read_unaligned 2018-04-25 19:15:32 +09:00
Paul Masurel
2e3c9a8878 Bugfix in murmurhash. 2018-04-25 19:06:31 +09:00
Paul Masurel
78673172d0 Cargo fmt 2018-04-21 20:05:36 +09:00
Paul Masurel
175b76f119 Removed streamdict
Closes #271
2018-04-21 19:55:41 +09:00
Paul Masurel
9b79e21bd7 Returning error when schema is not valid for a given query. 2018-04-19 13:02:30 +09:00
Paul Masurel
5e38ae336f Bump tantivy version and readded win deps 2018-04-17 18:27:57 +09:00
Paul Masurel
8604351f59 Hide some of the API
Added some doc.
2018-04-17 13:31:22 +09:00
Paul Masurel
6a48953d8a Closes #266 (#268)
PhraseQuery panics with a nice error message when the underlying field does not have any positions.
The `QueryParser` fails as well with a dedicated error.
2018-04-17 10:03:15 +09:00
pmasurel
0804b42afa Checking the type of range queries 2018-04-16 14:01:10 +09:00
Paul Masurel
8083bc6eef bench working 2018-04-15 12:25:38 +09:00
Paul Masurel
0156f88265 Compiles in stable rust 2018-04-15 11:03:44 +09:00
Paul Masurel
a1c07bf457 Added iterator for facet collector 2018-04-14 20:22:02 +09:00
Paul Masurel
9de74b68d1 Remove range argument 2018-04-13 18:34:23 +09:00
Paul Masurel
57c7073867 Removed 2018-04-13 09:43:36 +09:00
Paul Masurel
121374b89b Removed the need for AtomicU64 2018-04-12 22:08:15 +09:00
Paul Masurel
e44782bf14 No more 2018-04-12 13:01:11 +09:00
Paul Masurel
dfafb24fa6 Bumped bitpacker's version 2018-04-10 21:21:47 +09:00
jason-wolfe
4c6f9541e9 #263: Make MultiValueIntFastFieldWriter public, expose via FastFieldsWriter (#264) 2018-04-10 12:27:34 +09:00
231 changed files with 12579 additions and 7143 deletions

19
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@@ -0,0 +1,19 @@
---
name: Bug report
about: Create a report to help us improve
---
**Describe the bug**
- What did you do?
- What happened?
- What was expected?
**Which version of tantivy are you using?**
If "master", ideally give the specific sha1 revision.
**To Reproduce**
If your bug is deterministic, can you give a minimal reproducing code?
Some bugs are not deterministic. Can you describe with precision in which context it happened?
If this is possible, can you share your code?

View File

@@ -0,0 +1,14 @@
---
name: Feature request
about: Suggest an idea for this project
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**[Optional] describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

7
.github/ISSUE_TEMPLATE/question.md vendored Normal file
View File

@@ -0,0 +1,7 @@
---
name: Question
about: Ask any question about tantivy's usage...
---
Try to be specific about your use case...

1
.gitignore vendored
View File

@@ -1,3 +1,4 @@
tantivy.iml
*.swp
target
target/debug

View File

@@ -1,14 +1,17 @@
# Based on the "trust" template v0.1.2
# https://github.com/japaric/trust/tree/v0.1.2
dist: trusty
language: rust
services: docker
sudo: required
cache: cargo
rust:
- nightly
env:
global:
- CC=gcc-4.8
- CXX=g++-4.8
- CRATE_NAME=tantivy
- TRAVIS_CARGO_NIGHTLY_FEATURE=""
- secure: eC8HjTi1wgRVCsMAeXEXt8Ckr0YBSGOEnQkkW4/Nde/OZ9jJjz2nmP1ELQlDE7+czHub2QvYtDMG0parcHZDx/Kus0yvyn08y3g2rhGIiE7y8OCvQm1Mybu2D/p7enm6shXquQ6Z5KRfRq+18mHy80wy9ABMA/ukEZdvnfQ76/Een8/Lb0eHaDoXDXn3PqLVtByvSfQQ7OhS60dEScu8PWZ6/l1057P5NpdWbMExBE7Ro4zYXNhkJeGZx0nP/Bd4Jjdt1XfPzMEybV6NZ5xsTILUBFTmOOt603IsqKGov089NExqxYu5bD3K+S4MzF1Nd6VhomNPJqLDCfhlymJCUj5n5Ku4yidlhQbM4Ej9nGrBalJnhcjBjPua5tmMF2WCxP9muKn/2tIOu1/+wc0vMf9Yd3wKIkf5+FtUxCgs2O+NslWvmOMAMI/yD25m7hb4t1IwE/4Bk+GVcWJRWXbo0/m6ZUHzRzdjUY2a1qvw7C9udzdhg7gcnXwsKrSWi2NjMiIVw86l+Zim0nLpKIN41sxZHLaFRG63Ki8zQ/481LGn32awJ6i3sizKS0WD+N1DfR2qYMrwYHaMN0uR0OFXYTJkFvTFttAeUY3EKmRKAuMhmO2YRdSr4/j/G5E9HMc1gSGJj6PxgpQU7EpvxRsmoVAEJr0mszmOj9icGHep/FM=
addons:
apt:
sources:
@@ -22,16 +25,56 @@ addons:
- libdw-dev
- binutils-dev
- cmake
matrix:
include:
# Android
- env: TARGET=aarch64-linux-android DISABLE_TESTS=1
#- env: TARGET=arm-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=armv7-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=i686-linux-android DISABLE_TESTS=1
#- env: TARGET=x86_64-linux-android DISABLE_TESTS=1
# Linux
#- env: TARGET=aarch64-unknown-linux-gnu
#- env: TARGET=i686-unknown-linux-gnu
- env: TARGET=x86_64-unknown-linux-gnu CODECOV=1
# - env: TARGET=x86_64-unknown-linux-musl CODECOV=1
# OSX
- env: TARGET=x86_64-apple-darwin
os: osx
before_install:
- set -e
- rustup self update
install:
- sh ci/install.sh
- source ~/.cargo/env || true
before_script:
- export PATH=$HOME/.cargo/bin:$PATH
- cargo install cargo-update || echo "cargo-update already installed"
- cargo install cargo-travis || echo "cargo-travis already installed"
script:
- cargo build
- cargo test
- cargo test -- --ignored
- cargo run --example simple_search
- cargo doc
after_success:
- cargo coveralls --exclude-pattern src/functional_test.rs
- cargo doc-upload
- bash ci/script.sh
before_deploy:
- sh ci/before_deploy.sh
cache: cargo
before_cache:
# Travis can't cache files that are not readable by "others"
- chmod -R a+r $HOME/.cargo
#branches:
# only:
# # release tags
# - /^v\d+\.\d+\.\d+.*$/
# - master
notifications:
email:
on_success: never

11
AUTHORS Normal file
View File

@@ -0,0 +1,11 @@
# This is the list of authors of tantivy for copyright purposes.
Paul Masurel
Laurentiu Nicola
Dru Sellers
Ashley Mannix
Michael J. Curry
Jason Wolfe
# As an employee of Google I am required to add Google LLC
# in the list of authors, but this project is not affiliated to Google
# in any other way.
Google LLC

View File

@@ -1,9 +1,55 @@
Tantivy 0.5.2
Tantivy 0.7.1
=====================
- Bugfix: NGramTokenizer panics on non ascii chars
- Added a space usage API
Tantivy 0.7
=====================
- Skip data for doc ids and positions (@fulmicoton),
greatly improving performance
- Tantivy error now rely on the failure crate (@drusellers)
- Added support for `AND`, `OR`, `NOT` syntax in addition to the `+`,`-` syntax
- Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)
- Added a `TopFieldCollector` (@pentlander)
Tantivy 0.6.1
=========================
- Bugfix #324. GC removing was removing file that were still in useful
- Added support for parsing AllQuery and RangeQuery via QueryParser
- AllQuery: `*`
- RangeQuery:
- Inclusive `field:[startIncl to endIncl]`
- Exclusive `field:{startExcl to endExcl}`
- Mixed `field:[startIncl to endExcl}` and vice versa
- Unbounded `field:[start to *]`, `field:[* to end]`
Tantivy 0.6
==========================
- Removed C code. Tantivy is now pure Rust.
- BM25
- Approximate field norms encoded over 1 byte.
Special thanks to @drusellers and @jason-wolfe for their contributions
to this release!
- Removed C code. Tantivy is now pure Rust. (@pmasurel)
- BM25 (@pmasurel)
- Approximate field norms encoded over 1 byte. (@pmasurel)
- Compiles on stable rust (@pmasurel)
- Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270)
- Completely uncompressed
- Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
- Add NGram token support (@drusellers)
- Add Stopword Filter support (@drusellers)
- Add a FuzzyTermQuery (@drusellers)
- Add a RegexQuery (@drusellers)
- Various performance improvements (@pmasurel)_
Tantivy 0.5.2
===========================
- bugfix #274
- bugfix #280
- bugfix #289
Tantivy 0.5.1
==========================
@@ -81,7 +127,7 @@ Tantivy 0.3
Special thanks to @Kodraus @lnicola @Ameobea @manuel-woelker @celaus
for their contribution to this release.
Thanks also to everyone in tantivy gitter chat
Thanks also to everyone in tantivy gitter chat
for their advise and company :)
https://gitter.im/tantivy-search/tantivy
@@ -89,9 +135,9 @@ https://gitter.im/tantivy-search/tantivy
Warning:
Tantivy 0.3 is NOT backward compatible with tantivy 0.2
Tantivy 0.3 is NOT backward compatible with tantivy 0.2
code and index format.
You should not expect backward compatibility before
You should not expect backward compatibility before
tantivy 1.0.
@@ -117,7 +163,7 @@ Thanks to @KodrAus ! (#108)
the natural ordering.
- Building binary targets for tantivy-cli (Thanks to @KodrAus)
- Misc invisible bug fixes, and code cleanup.
- Use
- Use

View File

@@ -1,10 +1,10 @@
[package]
name = "tantivy"
version = "0.5.1"
version = "0.7.1"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
description = """Tantivy is a search engine library."""
description = """Search engine library"""
documentation = "https://tantivy-search.github.io/tantivy/tantivy/index.html"
homepage = "https://github.com/tantivy-search/tantivy"
repository = "https://github.com/tantivy-search/tantivy"
@@ -12,59 +12,69 @@ readme = "README.md"
keywords = ["search", "information", "retrieval"]
[dependencies]
base64 = "0.10.0"
byteorder = "1.0"
lazy_static = "0.2.1"
tinysegmenter = "0.1.0"
regex = "0.2"
fst = {version="0.2", default-features=false}
atomicwrites = {version="0.1", optional=true}
tempfile = "2.1"
log = "0.3.6"
combine = "2.2"
lazy_static = "1"
regex = "1.0"
fst = {version="0.3", default-features=false}
fst-regex = { version="0.2" }
lz4 = {version="1.20", optional=true}
snap = {version="0.2"}
atomicwrites = {version="0.2.2", optional=true}
tempfile = "3.0"
log = "0.4"
combine = "3"
tempdir = "0.3"
serde = "1.0"
serde_derive = "1.0"
serde_json = "1.0"
num_cpus = "1.2"
itertools = "0.5.9"
lz4 = "1.20"
bit-set = "0.4.0"
uuid = { version = "0.6", features = ["v4", "serde"] }
chan = "0.1"
crossbeam = "0.3"
itertools = "0.7"
levenshtein_automata = {version="0.1", features=["fst_automaton"]}
bit-set = "0.5"
uuid = { version = "0.7", features = ["v4", "serde"] }
crossbeam = "0.4"
crossbeam-channel = "0.2"
futures = "0.1"
futures-cpupool = "0.1"
error-chain = "0.8"
owning_ref = "0.3"
owning_ref = "0.4"
stable_deref_trait = "1.0.0"
rust-stemmers = "0.1.0"
downcast = { version="0.9", features = ["nightly"]}
rust-stemmers = "1"
downcast = { version="0.9" }
matches = "0.1"
bitpacking = "0.3"
bitpacking = "0.5"
census = "0.1"
fnv = "1.0.6"
owned-read = "0.4"
failure = "0.1"
htmlescape = "0.3.1"
fail = "0.2"
[target.'cfg(windows)'.dependencies]
winapi = "0.2"
[dev-dependencies]
rand = "0.3"
env_logger = "0.4"
rand = "0.5"
maplit = "1"
[profile.release]
opt-level = 3
debug = false
lto = true
debug-assertions = false
[profile.test]
debug-assertions = true
overflow-checks = true
[features]
default = ["mmap"]
streamdict = []
# by default no-fail is disabled. We manually enable it when running test.
default = ["mmap", "no_fail"]
mmap = ["fst/mmap", "atomicwrites"]
lz4-compression = ["lz4"]
no_fail = ["fail/no_fail"]
unstable = [] # useful for benches.
[badges]
travis-ci = { repository = "tantivy-search/tantivy" }
[[example]]
name = "simple_search"
required-features = ["mmap"]

View File

@@ -1,4 +1,4 @@
Copyright (c) 2016 Paul Masurel
Copyright (c) 2018 by the project authors, as listed in the AUTHORS file.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

View File

@@ -1,39 +1,69 @@
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
[![Build Status](https://travis-ci.org/tantivy-search/tantivy.svg?branch=master)](https://travis-ci.org/tantivy-search/tantivy)
[![Coverage Status](https://coveralls.io/repos/github/tantivy-search/tantivy/badge.svg?branch=master&refresh1)](https://coveralls.io/github/tantivy-search/tantivy?branch=master)
[![codecov](https://codecov.io/gh/tantivy-search/tantivy/branch/master/graph/badge.svg)](https://codecov.io/gh/tantivy-search/tantivy)
[![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/master?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/master)
[![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/fulmicoton)
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/0)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/0)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/1)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/1)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/2)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/2)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/3)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/3)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/4)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/4)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/5)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/5)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/6)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/6)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/7)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/7)
**Tantivy** is a **full text search engine library** written in rust.
It is strongly inspired by Lucene's design.
It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elastic Search](https://www.elastic.co/products/elasticsearch) and [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
an off-the-shelf search engine server, but rather a crate that can be used
to build such a search engine.
Tantivy is, in fact, strongly inspired by Lucene's design.
# Features
- Full-text search
- Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
- Tiny startup time (<10ms), perfect for command line tools
- tf-idf scoring
- Basic query language
- Phrase queries
- BM25 scoring (the same as lucene)
- Natural query language `(michael AND jackson) OR "king of pop"`
- Phrase queries search (`"michael jackson"`)
- Incremental indexing
- Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
- Mmap directory
- optional SIMD integer compression
- SIMD integer compression when the platform/CPU includes the SSE2 instruction set.
- Single valued and multivalued u64 and i64 fast fields (equivalent of doc values in Lucene)
- `&[u8]` fast fields
- LZ4 compressed document store
- Range queries
- Faceting
- configurable indexing (optional term frequency and position indexing
- Faceted search
- Configurable indexing (optional term frequency and position indexing)
- Cheesy logo with a horse
Tantivy supports Linux, MacOS and Windows.
# Non-features
- Distributed search is out of the scope of tantivy. That being said, tantivy is meant as a
library upon which one could build a distributed search. Serializable/mergeable collector state for instance,
are within the scope of tantivy.
# Supported OS and compiler
Tantivy works on stable rust (>= 1.27) and supports Linux, MacOS and Windows.
# Getting started
- [tantivy's usage example](http://fulmicoton.com/tantivy-examples/simple_search.html)
- [tantivy's simple search example](http://fulmicoton.com/tantivy-examples/simple_search.html)
- [tantivy-cli and its tutorial](https://github.com/tantivy-search/tantivy-cli).
`tantivy-cli` is an actual command line interface that makes it easy for you to create a search engine,
index documents and search via the CLI or a small server with a REST API.
It will walk you through getting a wikipedia search engine up and running in a few minutes.
- [reference doc]
- [For the last released version](https://docs.rs/tantivy/)
@@ -43,43 +73,18 @@ It will walk you through getting a wikipedia search engine up and running in a f
## Development
Tantivy requires Rust Nightly because it uses requires the features [`box_syntax`](https://doc.rust-lang.org/stable/unstable-book/language-features/box-syntax.html), [`optin_builtin_traits`](https://github.com/rust-lang/rfcs/blob/master/text/0019-opt-in-builtin-traits.md), [`conservative_impl_trait`](https://github.com/rust-lang/rfcs/blob/master/text/1522-conservative-impl-trait.md),
and [simd](https://github.com/rust-lang/rust/issues/27731).
To check out and run test, you can simply run :
Tantivy compiles on stable rust but requires `Rust >= 1.27`.
To check out and run tests, you can simply run :
git clone git@github.com:tantivy-search/tantivy.git
cd tantivy
cargo +nightly build
cargo build
## Running tests
## Note on release build and performance
If your project depends on `tantivy`, for better performance, make sure to enable
`sse3` instructions using a RUSTFLAGS. (This instruction set is likely to
be available on most `x86_64` CPUs you will encounter).
For instance,
RUSTFLAGS='-C target-feature=+sse3'
Or, if you are targetting a specific cpu
RUSTFLAGS='-C target-cpu=native' build --release
Regardless of the flags you pass, by default `tantivy` will contain `SSE3` instructions.
If you want to disable those, you can run the following command :
cargo build --no-default-features
Alternatively, if you are trying to compile `tantivy` without simd compression,
you can disable this functionality. In this case, this submodule is not required
and you can compile tantivy by using the `--no-default-features` flag.
cargo build --no-default-features
Some tests will not run with just `cargo test` because of `fail-rs`.
To run the tests exhaustively, run `./run-tests.sh`.
# Contribute
Send me an email (paul.masurel at gmail.com) if you want to contribute to tantivy.
Send me an email (paul.masurel at gmail.com) if you want to contribute to tantivy.

View File

@@ -4,11 +4,8 @@
os: Visual Studio 2015
environment:
matrix:
- channel: nightly
- channel: stable
target: x86_64-pc-windows-msvc
- channel: nightly
target: x86_64-pc-windows-gnu
msys_bits: 64
install:
- appveyor DownloadFile https://win.rustup.rs/ -FileName rustup-init.exe
@@ -21,5 +18,5 @@ install:
build: false
test_script:
- REM SET RUST_LOG=tantivy,test & cargo test --verbose
- REM SET RUST_BACKTRACE=1 & cargo run --example simple_search
- REM SET RUST_LOG=tantivy,test & cargo test --verbose --no-default-features --features mmap -- --test-threads 1
- REM SET RUST_BACKTRACE=1 & cargo build --examples

23
ci/before_deploy.ps1 Normal file
View File

@@ -0,0 +1,23 @@
# This script takes care of packaging the build artifacts that will go in the
# release zipfile
$SRC_DIR = $PWD.Path
$STAGE = [System.Guid]::NewGuid().ToString()
Set-Location $ENV:Temp
New-Item -Type Directory -Name $STAGE
Set-Location $STAGE
$ZIP = "$SRC_DIR\$($Env:CRATE_NAME)-$($Env:APPVEYOR_REPO_TAG_NAME)-$($Env:TARGET).zip"
# TODO Update this to package the right artifacts
Copy-Item "$SRC_DIR\target\$($Env:TARGET)\release\hello.exe" '.\'
7z a "$ZIP" *
Push-AppveyorArtifact "$ZIP"
Remove-Item *.* -Force
Set-Location ..
Remove-Item $STAGE
Set-Location $SRC_DIR

33
ci/before_deploy.sh Normal file
View File

@@ -0,0 +1,33 @@
# This script takes care of building your crate and packaging it for release
set -ex
main() {
local src=$(pwd) \
stage=
case $TRAVIS_OS_NAME in
linux)
stage=$(mktemp -d)
;;
osx)
stage=$(mktemp -d -t tmp)
;;
esac
test -f Cargo.lock || cargo generate-lockfile
# TODO Update this to build the artifacts that matter to you
cross rustc --bin hello --target $TARGET --release -- -C lto
# TODO Update this to package the right artifacts
cp target/$TARGET/release/hello $stage/
cd $stage
tar czf $src/$CRATE_NAME-$TRAVIS_TAG-$TARGET.tar.gz *
cd $src
rm -rf $stage
}
main

47
ci/install.sh Normal file
View File

@@ -0,0 +1,47 @@
set -ex
main() {
local target=
if [ $TRAVIS_OS_NAME = linux ]; then
target=x86_64-unknown-linux-musl
sort=sort
else
target=x86_64-apple-darwin
sort=gsort # for `sort --sort-version`, from brew's coreutils.
fi
# Builds for iOS are done on OSX, but require the specific target to be
# installed.
case $TARGET in
aarch64-apple-ios)
rustup target install aarch64-apple-ios
;;
armv7-apple-ios)
rustup target install armv7-apple-ios
;;
armv7s-apple-ios)
rustup target install armv7s-apple-ios
;;
i386-apple-ios)
rustup target install i386-apple-ios
;;
x86_64-apple-ios)
rustup target install x86_64-apple-ios
;;
esac
# This fetches latest stable release
local tag=$(git ls-remote --tags --refs --exit-code https://github.com/japaric/cross \
| cut -d/ -f3 \
| grep -E '^v[0.1.0-9.]+$' \
| $sort --version-sort \
| tail -n1)
curl -LSfs https://japaric.github.io/trust/install.sh | \
sh -s -- \
--force \
--git japaric/cross \
--tag $tag \
--target $target
}
main

29
ci/script.sh Normal file
View File

@@ -0,0 +1,29 @@
#!/usr/bin/env bash
# This script takes care of testing your crate
set -ex
main() {
if [ ! -z $CODECOV ]; then
echo "Codecov"
cargo build --verbose && cargo coverage --verbose && bash <(curl -s https://codecov.io/bash) -s target/kcov
else
echo "Build"
cross build --target $TARGET
if [ ! -z $DISABLE_TESTS ]; then
return
fi
echo "Test"
cross test --target $TARGET --no-default-features --features mmap -- --test-threads 1
fi
for example in $(ls examples/*.rs)
do
cargo run --example $(basename $example .rs)
done
}
# we don't run the "test phase" when doing deploys
if [ -z $TRAVIS_TAG ]; then
main
fi

1
doc/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
book

5
doc/book.toml Normal file
View File

@@ -0,0 +1,5 @@
[book]
authors = ["Paul Masurel"]
multilingual = false
src = "src"
title = "Tantivy, the user guide"

16
doc/src/SUMMARY.md Normal file
View File

@@ -0,0 +1,16 @@
# Summary
[Avant Propos](./avant-propos.md)
- [Schema](./schema.md)
- [Indexing](./indexing.md)
- [Segments](./basis.md)
- [Facetting](./facetting.md)
- [Innerworkings](./innerworkings.md)
- [Inverted index](./inverted_index.md)
- [Best practise](./inverted_index.md)
[Frequently Asked Questions](./faq.md)
[Examples](./examples.md)

33
doc/src/avant-propos.md Normal file
View File

@@ -0,0 +1,33 @@
# Foreword, what is the scope of tantivy?
> Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
they both have the same scope and targetted use cases.
If you are not familiar with Lucene, let's break down our little tagline.
- **Search** here means full-text search : fundamentally, tantivy is here to help you
identify efficiently what are the documents matching a given query in your corpus.
But modern search UI are so much more : text processing, facetting, autocomplete, fuzzy search, good
relevancy, collapsing, highlighting, spatial search.
While some of these features are not available in tantivy yet, all of these are relevant
feature requests. Tantivy's objective is to offer a solid toolbox to create the best search
experience. But keep in mind this is just a toolbox.
Which bring us to the second keyword...
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
Sometimes a functionality will not be available in tantivy because it is too
specific to your use case. By design, tantivy should make it possible to extend
the available set of features using the existing rock-solid datastructures.
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
`TokenFilter`... Some of your requirements may also be related to
something closer to architecture or operations. For instance, you may
want to build a large corpus on Hadoop, fine-tune the merge policy to keep your
index sharded in a time-wise fashion, or you may want to convert and existing
index from a different format.
Tantivy exposes a lot of low level API to do all of these things.

77
doc/src/basis.md Normal file
View File

@@ -0,0 +1,77 @@
# Anatomy of an index
## Straight from disk
Tantivy accesses its data using an abstracting trait called `Directory`.
In theory, one can come and override the data access logic. In practise, the
trait somewhat assumes that your data can be mapped to memory, and tantivy
seems deeply married to using `mmap` for its io [^1], and the only persisting
directory shipped with tantivy is the `MmapDirectory`.
While this design has some downsides, this greatly simplifies the source code of
tantivy. Caching is also entirely delegated to the OS.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
In later chapters, we will discuss tantivy's inverted index data layout.
One key take away is that to achieve great performance, search indexes are extremely compact.
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
## Segments, and the log method
That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
In fact, the `Directory` trait does not even allow you to modify part of a file.
To allow the addition / deletion of documents, and create the illusion that
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*.
Let's forget about deletes for a moment.
As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly.
As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
The resulting chunk of index obtained after this serialization is called a `Segment`.
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `<UUID>.*`.
Which brings us to the nature of a tantivy `Index`.
> A tantivy `Index` is a collection of `Segments`.
Physically, this really just means and index is a bunch of segment files in a given `Directory`,
linked together by a `meta.json` file. This transparency can become extremely handy
to get tantivy to fit your use case:
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
# Merging
As you index more and more data, your index will accumulate more and more segments.
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit.
That's where merging or compacting comes into place. Tantivy will continuously consider merge
opportunities and start merging segments in the background.
# Indexing throughput, number of indexing threads
[^1]: This may eventually change.
[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not.

View File

3
doc/src/examples.md Normal file
View File

@@ -0,0 +1,3 @@
# Examples
- [Basic search](/examples/basic_search.html)

5
doc/src/facetting.md Normal file
View File

@@ -0,0 +1,5 @@
# Facetting
wewew
## weeewe

0
doc/src/faq.md Normal file
View File

0
doc/src/indexing.md Normal file
View File

1
doc/src/innerworkings.md Normal file
View File

@@ -0,0 +1 @@
# Innerworkings

View File

@@ -0,0 +1 @@
# Inverted index

50
doc/src/schema.md Normal file
View File

@@ -0,0 +1,50 @@
# Schema
When starting a new project using tantivy, your first step will be to your schema. Be aware that changing it will probably require you to reindex all of your data.
It is strongly recommended you keep the means to iterate through your original data when this happens.
If not specified otherwise, tantivy does not keep a raw version of your data,
so the good practise is to rely on a distinct storage to store your
raw documents.
The schema defines both the type of the fields you are indexing, but also the type of indexing you want to apply to them. The set of search operations that you will be able to perform depends on the way you set up your schema.
Here is what defining your schema could look like.
```Rust
use tantivy::schema::{Schema, TEXT, STORED, INT_INDEXED};
let mut schema_builder = SchemaBuilder::default();
let text_field = schema_builder.add_text_field("name", TEXT | STORED);
let tag_field = schema_builder.add_facet_field("tags");
let timestamp_field = schema_buider.add_u64_field("timestamp", INT_INDEXED)
let schema = schema_builder.build();
```
Notice how adding a new field to your schema builder
follows the following pattern :
```verbatim
schema_builder.add_<fieldtype>_field("<fieldname>", <field_configuration>);
```
This method returns a `Field` handle that will be used for all kind of
# Field types
Tantivy currently supports only 4 types.
- `text` (understand `&str`)
- `u64` and `i64`
- `HierarchicalFacet`
Let's go into their specificities.
# Text
Full-text search is the bread and butter of search engine.
The key idea is fairly simple. Your text is broken apart into tokens (that's
what we call tokenization). Tantivy then keeps track of the list of the documents containing each token.
In order to increase recall you might want to normalize tokens. For instance,
you most likely want to lowercase your tokens so that documents match the query `cat` regardless of whether your they contain the token `cat` or `Cat`.

View File

@@ -1,26 +1,31 @@
extern crate tantivy;
// # Basic Example
//
// This example covers the basic functionalities of
// tantivy.
//
// We will :
// - define our schema
// = create an index in a directory
// - index few documents in our index
// - search for the best document matchings "sea whale"
// - retrieve the best document original content.
extern crate tempdir;
// ---
// Importing tantivy...
#[macro_use]
extern crate serde_json;
use std::path::Path;
use tempdir::TempDir;
use tantivy::Index;
use tantivy::schema::*;
extern crate tantivy;
use tantivy::collector::TopCollector;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::Index;
fn main() {
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
if let Ok(dir) = TempDir::new("tantivy_example_dir") {
run_example(dir.path()).unwrap();
dir.close().unwrap();
}
}
let index_path = TempDir::new("tantivy_example_dir")?;
fn run_example(index_path: &Path) -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
@@ -35,7 +40,7 @@ fn run_example(index_path: &Path) -> tantivy::Result<()> {
// We want full-text search for it, and we also want
// to be able to retrieve the document after the search.
//
// TEXT | STORED is some syntactic sugar to describe
// `TEXT | STORED` is some syntactic sugar to describe
// that.
//
// `TEXT` means the field should be tokenized and indexed,
@@ -64,21 +69,22 @@ fn run_example(index_path: &Path) -> tantivy::Result<()> {
//
// This will actually just save a meta.json
// with our schema in the directory.
let index = Index::create(index_path, schema.clone())?;
let index = Index::create_in_dir(&index_path, schema.clone())?;
// To insert document we need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we use a buffer of 50MB per thread. Using a bigger
// heap for the indexer can increase its throughput.
// Here we give tantivy a budget of `50MB`.
// Using a bigger heap for the indexer may increase
// throughput, but 50 MB is already plenty.
let mut index_writer = index.writer(50_000_000)?;
// Let's index our documents!
// We first need a handle on the title and the body field.
// ### Create a document "manually".
// ### Adding documents
//
// We can create a document manually, by setting the fields
// one by one in a Document object.
@@ -96,15 +102,11 @@ fn run_example(index_path: &Path) -> tantivy::Result<()> {
// ... and add it to the `IndexWriter`.
index_writer.add_document(old_man_doc);
// ### Create a document directly from json.
//
// Alternatively, we can use our schema to parse a
// document object directly from json.
// The document is a string, but we use the `json` macro
// from `serde_json` for the convenience of multi-line support.
let json = json!({
"title": "Of Mice and Men",
"body": "A few miles south of Soledad, the Salinas River drops in close to the hillside \
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
@@ -112,30 +114,35 @@ fn run_example(index_path: &Path) -> tantivy::Result<()> {
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
});
let mice_and_men_doc = schema.parse_document(&json.to_string())?;
));
index_writer.add_document(mice_and_men_doc);
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with treeswillows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
// Multi-valued field are allowed, they are
// expressed in JSON by an array.
// The following document has two titles.
let json = json!({
"title": ["Frankenstein", "The Modern Prometheus"],
"body": "You will rejoice to hear that no disaster has accompanied the commencement of an \
// Multivalued field just need to be repeated.
index_writer.add_document(doc!(
title => "Frankenstein",
title => "The Modern Prometheus",
body => "You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking."
});
let frankenstein_doc = schema.parse_document(&json.to_string())?;
index_writer.add_document(frankenstein_doc);
));
// This is an example, so we will only index 3 documents
// here. You can check out tantivy's tutorial to index
// the English wikipedia. Tantivy's indexing is rather fast.
// Indexing 5 million articles of the English wikipedia takes
// around 4 minutes on my computer!
// around 3 minutes on my computer!
// ### Committing
//
@@ -160,17 +167,29 @@ fn run_example(index_path: &Path) -> tantivy::Result<()> {
// # Searching
//
// ### Searcher
//
// Let's search our index. Start by reloading
// searchers in the index. This should be done
// after every commit().
// after every `commit()`.
index.load_searchers()?;
// Afterwards create one (or more) searchers.
// We now need to acquire a searcher.
// Some search experience might require more than
// one query.
//
// You should create a searcher
// every time you start a "search query".
// The searcher ensure that we get to work
// with a consistent version of the index.
//
// Acquiring a `searcher` is very cheap.
//
// You should acquire a searcher every time you
// start processing a request and
// and release it right after your query is finished.
let searcher = index.searcher();
// ### Query
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
@@ -211,15 +230,11 @@ fn run_example(index_path: &Path) -> tantivy::Result<()> {
// a title.
for doc_address in doc_addresses {
let retrieved_doc = searcher.doc(&doc_address)?;
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
// Wait for indexing and merging threads to shut down.
// Usually this isn't needed, but in `main` we try to
// delete the temporary directory and that fails on
// Windows if the files are still open.
index_writer.wait_merging_threads()?;
Ok(())
}
use tempdir::TempDir;

View File

@@ -0,0 +1,117 @@
// # Defining a tokenizer pipeline
//
// In this example, we'll see how to define a tokenizer pipeline
// by aligning a bunch of `TokenFilter`.
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopCollector;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer;
use tantivy::Index;
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = SchemaBuilder::default();
// Our first field is title.
// In this example we want to use NGram searching
// we will set that to 3 characters, so any three
// char in the title should be findable.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("ngram3")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
let title = schema_builder.add_text_field("title", text_options);
// Our second field is body.
// We want full-text search for it, but we do not
// need to be able to be able to retrieve it
// for our application.
//
// We can make our index lighter and
// by omitting `STORED` flag.
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
// To simplify we will work entirely in RAM.
// This is not what you want in reality, but it is very useful
// for your unit tests... Or this example.
let index = Index::create_in_ram(schema.clone());
// here we are registering our custome tokenizer
// this will store tokens of 3 characters each
index
.tokenizers()
.register("ngram3", NgramTokenizer::new(3, 3, false));
// To insert document we need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we use a buffer of 50MB per thread. Using a bigger
// heap for the indexer can increase its throughput.
let mut index_writer = index.writer(50_000_000)?;
index_writer.add_document(doc!(
title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => r#"A few miles south of Soledad, the Salinas River drops in close to the hillside
bank and runs deep and green. The water is warm too, for it has slipped twinkling
over the yellow sands in the sunlight before reaching the narrow pool. On one
side of the river the golden foothill slopes curve up to the strong and rocky
Gabilan Mountains, but on the valley side the water is lined with trees—willows
fresh and green with every spring, carrying in their lower leaf junctures the
debris of the winters flooding; and sycamores with mottled, white, recumbent
limbs and branches that arch over the pool"#
));
index_writer.add_document(doc!(
title => "Frankenstein",
body => r#"You will rejoice to hear that no disaster has accompanied the commencement of an
enterprise which you have regarded with such evil forebodings. I arrived here
yesterday, and my first task is to assure my dear sister of my welfare and
increasing confidence in the success of my undertaking."#
));
index_writer.commit()?;
index.load_searchers()?;
let searcher = index.searcher();
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// here we want to get a hit on the 'ken' in Frankenstein
let query = query_parser.parse_query("ken")?;
let mut top_collector = TopCollector::with_limit(10);
searcher.search(&*query, &mut top_collector)?;
let doc_addresses = top_collector.docs();
for doc_address in doc_addresses {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,143 @@
// # Deleting and Updating (?) documents
//
// This example explains how to delete and update documents.
// In fact there is actually no such thing as an update in tantivy.
//
// To update a document, you need to delete a document and then reinsert
// its new version.
//
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopCollector;
use tantivy::query::TermQuery;
use tantivy::schema::*;
use tantivy::Index;
// A simple helper function to fetch a single document
// given its id from our index.
// It will be helpful to check our work.
fn extract_doc_given_isbn(index: &Index, isbn_term: &Term) -> tantivy::Result<Option<Document>> {
let searcher = index.searcher();
// This is the simplest query you can think of.
// It matches all of the documents containing a specific term.
//
// The second argument is here to tell we don't care about decoding positions,
// or term frequencies.
let term_query = TermQuery::new(isbn_term.clone(), IndexRecordOption::Basic);
let mut top_collector = TopCollector::with_limit(1);
searcher.search(&term_query, &mut top_collector)?;
if let Some(doc_address) = top_collector.docs().first() {
let doc = searcher.doc(*doc_address)?;
Ok(Some(doc))
} else {
// no doc matching this ID.
Ok(None)
}
}
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// Check out the *basic_search* example if this makes
// small sense to you.
let mut schema_builder = SchemaBuilder::default();
// Tantivy does not really have a notion of primary id.
// This may change in the future.
//
// Still, we can create a `isbn` field and use it as an id. This
// field can be `u64` or a `text`, depending on your use case.
// It just needs to be indexed.
//
// If it is `text`, let's make sure to keep it `raw` and let's avoid
// running any text processing on it.
// This is done by associating this field to the tokenizer named `raw`.
// Rather than building our [`TextOptions`](//docs.rs/tantivy/~0/tantivy/schema/struct.TextOptions.html) manually,
// We use the `STRING` shortcut. `STRING` stands for indexed (without term frequency or positions)
// and untokenized.
//
// Because we also want to be able to see this `id` in our returned documents,
// we also mark the field as stored.
let isbn = schema_builder.add_text_field("isbn", STRING | STORED);
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?;
// Let's add a couple of documents, for the sake of the example.
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
index_writer.add_document(doc!(
isbn => "978-0099908401",
title => "The old Man and the see"
));
index_writer.add_document(doc!(
isbn => "978-0140177398",
title => "Of Mice and Men",
));
index_writer.add_document(doc!(
title => "Frankentein", //< Oops there is a typo here.
isbn => "978-9176370711",
));
index_writer.commit()?;
index.load_searchers()?;
let frankenstein_isbn = Term::from_field_text(isbn, "978-9176370711");
// Oops our frankenstein doc seems mispelled
let frankenstein_doc_misspelled = extract_doc_given_isbn(&index, &frankenstein_isbn)?.unwrap();
assert_eq!(
schema.to_json(&frankenstein_doc_misspelled),
r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#,
);
// # Update = Delete + Insert
//
// Here we will want to update the typo in the `Frankenstein` book.
//
// Tantivy does not handle updates directly, we need to delete
// and reinsert the document.
//
// This can be complicated as it means you need to have access
// to the entire document. It is good practise to integrate tantivy
// with a key value store for this reason.
//
// To remove one of the document, we just call `delete_term`
// on its id.
//
// Note that `tantivy` does nothing to enforce the idea that
// there is only one document associated to this id.
//
// Also you might have noticed that we apply the delete before
// having committed. This does not matter really...
index_writer.delete_term(frankenstein_isbn.clone());
// We now need to reinsert our document without the typo.
index_writer.add_document(doc!(
title => "Frankenstein",
isbn => "978-9176370711",
));
// You are guaranteed that your clients will only observe your index in
// the state it was in after a commit.
// In this example, your search engine will at no point be missing the *Frankenstein* document.
// Everything happened as if the document was updated.
index_writer.commit()?;
// We reload our searcher to make our change available to clients.
index.load_searchers()?;
// No more typo!
let frankenstein_new_doc = extract_doc_given_isbn(&index, &frankenstein_isbn)?.unwrap();
assert_eq!(
schema.to_json(&frankenstein_new_doc),
r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#,
);
Ok(())
}

View File

@@ -0,0 +1,81 @@
// # Basic Example
//
// This example covers the basic functionalities of
// tantivy.
//
// We will :
// - define our schema
// = create an index in a directory
// - index few documents in our index
// - search for the best document matchings "sea whale"
// - retrieve the best document original content.
extern crate tempdir;
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::FacetCollector;
use tantivy::query::AllQuery;
use tantivy::schema::*;
use tantivy::Index;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new("tantivy_facet_example_dir")?;
let mut schema_builder = SchemaBuilder::default();
schema_builder.add_text_field("name", TEXT | STORED);
// this is our faceted field
schema_builder.add_facet_field("tags");
let schema = schema_builder.build();
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
let name = schema.get_field("name").unwrap();
let tags = schema.get_field("tags").unwrap();
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
index_writer.add_document(doc!(
name => "the ditch",
tags => Facet::from("/pools/north")
));
index_writer.add_document(doc!(
name => "little stacey",
tags => Facet::from("/pools/south")
));
index_writer.commit()?;
index.load_searchers()?;
let searcher = index.searcher();
let mut facet_collector = FacetCollector::for_field(tags);
facet_collector.add_facet("/pools");
searcher.search(&AllQuery, &mut facet_collector).unwrap();
let counts = facet_collector.harvest();
// This lists all of the facet counts
let facets: Vec<(&Facet, u64)> = counts.get("/pools").collect();
assert_eq!(
facets,
vec![
(&Facet::from("/pools/north"), 1),
(&Facet::from("/pools/south"), 1),
]
);
Ok(())
}
use tempdir::TempDir;

View File

@@ -1,2 +0,0 @@
#!/bin/bash
docco simple_search.rs -o html

View File

@@ -1,518 +0,0 @@
/*--------------------- Typography ----------------------------*/
@font-face {
font-family: 'aller-light';
src: url('public/fonts/aller-light.eot');
src: url('public/fonts/aller-light.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-light.woff') format('woff'),
url('public/fonts/aller-light.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'aller-bold';
src: url('public/fonts/aller-bold.eot');
src: url('public/fonts/aller-bold.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-bold.woff') format('woff'),
url('public/fonts/aller-bold.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'roboto-black';
src: url('public/fonts/roboto-black.eot');
src: url('public/fonts/roboto-black.eot?#iefix') format('embedded-opentype'),
url('public/fonts/roboto-black.woff') format('woff'),
url('public/fonts/roboto-black.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
/*--------------------- Layout ----------------------------*/
html { height: 100%; }
body {
font-family: "aller-light";
font-size: 14px;
line-height: 18px;
color: #30404f;
margin: 0; padding: 0;
height:100%;
}
#container { min-height: 100%; }
a {
color: #000;
}
b, strong {
font-weight: normal;
font-family: "aller-bold";
}
p {
margin: 15px 0 0px;
}
.annotation ul, .annotation ol {
margin: 25px 0;
}
.annotation ul li, .annotation ol li {
font-size: 14px;
line-height: 18px;
margin: 10px 0;
}
h1, h2, h3, h4, h5, h6 {
color: #112233;
line-height: 1em;
font-weight: normal;
font-family: "roboto-black";
text-transform: uppercase;
margin: 30px 0 15px 0;
}
h1 {
margin-top: 40px;
}
h2 {
font-size: 1.26em;
}
hr {
border: 0;
background: 1px #ddd;
height: 1px;
margin: 20px 0;
}
pre, tt, code {
font-size: 12px; line-height: 16px;
font-family: Menlo, Monaco, Consolas, "Lucida Console", monospace;
margin: 0; padding: 0;
}
.annotation pre {
display: block;
margin: 0;
padding: 7px 10px;
background: #fcfcfc;
-moz-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
-webkit-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
overflow-x: auto;
}
.annotation pre code {
border: 0;
padding: 0;
background: transparent;
}
blockquote {
border-left: 5px solid #ccc;
margin: 0;
padding: 1px 0 1px 1em;
}
.sections blockquote p {
font-family: Menlo, Consolas, Monaco, monospace;
font-size: 12px; line-height: 16px;
color: #999;
margin: 10px 0 0;
white-space: pre-wrap;
}
ul.sections {
list-style: none;
padding:0 0 5px 0;;
margin:0;
}
/*
Force border-box so that % widths fit the parent
container without overlap because of margin/padding.
More Info : http://www.quirksmode.org/css/box.html
*/
ul.sections > li > div {
-moz-box-sizing: border-box; /* firefox */
-ms-box-sizing: border-box; /* ie */
-webkit-box-sizing: border-box; /* webkit */
-khtml-box-sizing: border-box; /* konqueror */
box-sizing: border-box; /* css3 */
}
/*---------------------- Jump Page -----------------------------*/
#jump_to, #jump_page {
margin: 0;
background: white;
-webkit-box-shadow: 0 0 25px #777; -moz-box-shadow: 0 0 25px #777;
-webkit-border-bottom-left-radius: 5px; -moz-border-radius-bottomleft: 5px;
font: 16px Arial;
cursor: pointer;
text-align: right;
list-style: none;
}
#jump_to a {
text-decoration: none;
}
#jump_to a.large {
display: none;
}
#jump_to a.small {
font-size: 22px;
font-weight: bold;
color: #676767;
}
#jump_to, #jump_wrapper {
position: fixed;
right: 0; top: 0;
padding: 10px 15px;
margin:0;
}
#jump_wrapper {
display: none;
padding:0;
}
#jump_to:hover #jump_wrapper {
display: block;
}
#jump_page_wrapper{
position: fixed;
right: 0;
top: 0;
bottom: 0;
}
#jump_page {
padding: 5px 0 3px;
margin: 0 0 25px 25px;
max-height: 100%;
overflow: auto;
}
#jump_page .source {
display: block;
padding: 15px;
text-decoration: none;
border-top: 1px solid #eee;
}
#jump_page .source:hover {
background: #f5f5ff;
}
#jump_page .source:first-child {
}
/*---------------------- Low resolutions (> 320px) ---------------------*/
@media only screen and (min-width: 320px) {
.pilwrap { display: none; }
ul.sections > li > div {
display: block;
padding:5px 10px 0 10px;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 30px;
}
ul.sections > li > div.content {
overflow-x:auto;
-webkit-box-shadow: inset 0 0 5px #e5e5ee;
box-shadow: inset 0 0 5px #e5e5ee;
border: 1px solid #dedede;
margin:5px 10px 5px 10px;
padding-bottom: 5px;
}
ul.sections > li > div.annotation pre {
margin: 7px 0 7px;
padding-left: 15px;
}
ul.sections > li > div.annotation p tt, .annotation code {
background: #f8f8ff;
border: 1px solid #dedede;
font-size: 12px;
padding: 0 0.2em;
}
}
/*---------------------- (> 481px) ---------------------*/
@media only screen and (min-width: 481px) {
#container {
position: relative;
}
body {
background-color: #F5F5FF;
font-size: 15px;
line-height: 21px;
}
pre, tt, code {
line-height: 18px;
}
p, ul, ol {
margin: 0 0 15px;
}
#jump_to {
padding: 5px 10px;
}
#jump_wrapper {
padding: 0;
}
#jump_to, #jump_page {
font: 10px Arial;
text-transform: uppercase;
}
#jump_page .source {
padding: 5px 10px;
}
#jump_to a.large {
display: inline-block;
}
#jump_to a.small {
display: none;
}
#background {
position: absolute;
top: 0; bottom: 0;
width: 350px;
background: #fff;
border-right: 1px solid #e5e5ee;
z-index: -1;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 40px;
}
ul.sections > li {
white-space: nowrap;
}
ul.sections > li > div {
display: inline-block;
}
ul.sections > li > div.annotation {
max-width: 350px;
min-width: 350px;
min-height: 5px;
padding: 13px;
overflow-x: hidden;
white-space: normal;
vertical-align: top;
text-align: left;
}
ul.sections > li > div.annotation pre {
margin: 15px 0 15px;
padding-left: 15px;
}
ul.sections > li > div.content {
padding: 13px;
vertical-align: top;
border: none;
-webkit-box-shadow: none;
box-shadow: none;
}
.pilwrap {
position: relative;
display: inline;
}
.pilcrow {
font: 12px Arial;
text-decoration: none;
color: #454545;
position: absolute;
top: 3px; left: -20px;
padding: 1px 2px;
opacity: 0;
-webkit-transition: opacity 0.2s linear;
}
.for-h1 .pilcrow {
top: 47px;
}
.for-h2 .pilcrow, .for-h3 .pilcrow, .for-h4 .pilcrow {
top: 35px;
}
ul.sections > li > div.annotation:hover .pilcrow {
opacity: 1;
}
}
/*---------------------- (> 1025px) ---------------------*/
@media only screen and (min-width: 1025px) {
body {
font-size: 16px;
line-height: 24px;
}
#background {
width: 525px;
}
ul.sections > li > div.annotation {
max-width: 525px;
min-width: 525px;
padding: 10px 25px 1px 50px;
}
ul.sections > li > div.content {
padding: 9px 15px 16px 25px;
}
}
/*---------------------- Syntax Highlighting -----------------------------*/
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
/*
github.com style (c) Vasily Polovnyov <vast@whiteants.net>
*/
pre code {
display: block; padding: 0.5em;
color: #000;
background: #f8f8ff
}
pre .hljs-comment,
pre .hljs-template_comment,
pre .hljs-diff .hljs-header,
pre .hljs-javadoc {
color: #408080;
font-style: italic
}
pre .hljs-keyword,
pre .hljs-assignment,
pre .hljs-literal,
pre .hljs-css .hljs-rule .hljs-keyword,
pre .hljs-winutils,
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
color: #954121;
/*font-weight: bold*/
}
pre .hljs-number,
pre .hljs-hexcolor {
color: #40a070
}
pre .hljs-string,
pre .hljs-tag .hljs-value,
pre .hljs-phpdoc,
pre .hljs-tex .hljs-formula {
color: #219161;
}
pre .hljs-title,
pre .hljs-id {
color: #19469D;
}
pre .hljs-params {
color: #00F;
}
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
font-weight: normal
}
pre .hljs-class .hljs-title,
pre .hljs-haskell .hljs-label,
pre .hljs-tex .hljs-command {
color: #458;
font-weight: bold
}
pre .hljs-tag,
pre .hljs-tag .hljs-title,
pre .hljs-rules .hljs-property,
pre .hljs-django .hljs-tag .hljs-keyword {
color: #000080;
font-weight: normal
}
pre .hljs-attribute,
pre .hljs-variable,
pre .hljs-instancevar,
pre .hljs-lisp .hljs-body {
color: #008080
}
pre .hljs-regexp {
color: #B68
}
pre .hljs-class {
color: #458;
font-weight: bold
}
pre .hljs-symbol,
pre .hljs-ruby .hljs-symbol .hljs-string,
pre .hljs-ruby .hljs-symbol .hljs-keyword,
pre .hljs-ruby .hljs-symbol .hljs-keymethods,
pre .hljs-lisp .hljs-keyword,
pre .hljs-tex .hljs-special,
pre .hljs-input_number {
color: #990073
}
pre .hljs-builtin,
pre .hljs-constructor,
pre .hljs-built_in,
pre .hljs-lisp .hljs-title {
color: #0086b3
}
pre .hljs-preprocessor,
pre .hljs-pi,
pre .hljs-doctype,
pre .hljs-shebang,
pre .hljs-cdata {
color: #999;
font-weight: bold
}
pre .hljs-deletion {
background: #fdd
}
pre .hljs-addition {
background: #dfd
}
pre .hljs-diff .hljs-change {
background: #0086b3
}
pre .hljs-chunk {
color: #aaa
}
pre .hljs-tex .hljs-formula {
opacity: 0.5;
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

View File

@@ -1,375 +0,0 @@
/*! normalize.css v2.0.1 | MIT License | git.io/normalize */
/* ==========================================================================
HTML5 display definitions
========================================================================== */
/*
* Corrects `block` display not defined in IE 8/9.
*/
article,
aside,
details,
figcaption,
figure,
footer,
header,
hgroup,
nav,
section,
summary {
display: block;
}
/*
* Corrects `inline-block` display not defined in IE 8/9.
*/
audio,
canvas,
video {
display: inline-block;
}
/*
* Prevents modern browsers from displaying `audio` without controls.
* Remove excess height in iOS 5 devices.
*/
audio:not([controls]) {
display: none;
height: 0;
}
/*
* Addresses styling for `hidden` attribute not present in IE 8/9.
*/
[hidden] {
display: none;
}
/* ==========================================================================
Base
========================================================================== */
/*
* 1. Sets default font family to sans-serif.
* 2. Prevents iOS text size adjust after orientation change, without disabling
* user zoom.
*/
html {
font-family: sans-serif; /* 1 */
-webkit-text-size-adjust: 100%; /* 2 */
-ms-text-size-adjust: 100%; /* 2 */
}
/*
* Removes default margin.
*/
body {
margin: 0;
}
/* ==========================================================================
Links
========================================================================== */
/*
* Addresses `outline` inconsistency between Chrome and other browsers.
*/
a:focus {
outline: thin dotted;
}
/*
* Improves readability when focused and also mouse hovered in all browsers.
*/
a:active,
a:hover {
outline: 0;
}
/* ==========================================================================
Typography
========================================================================== */
/*
* Addresses `h1` font sizes within `section` and `article` in Firefox 4+,
* Safari 5, and Chrome.
*/
h1 {
font-size: 2em;
}
/*
* Addresses styling not present in IE 8/9, Safari 5, and Chrome.
*/
abbr[title] {
border-bottom: 1px dotted;
}
/*
* Addresses style set to `bolder` in Firefox 4+, Safari 5, and Chrome.
*/
b,
strong {
font-weight: bold;
}
/*
* Addresses styling not present in Safari 5 and Chrome.
*/
dfn {
font-style: italic;
}
/*
* Addresses styling not present in IE 8/9.
*/
mark {
background: #ff0;
color: #000;
}
/*
* Corrects font family set oddly in Safari 5 and Chrome.
*/
code,
kbd,
pre,
samp {
font-family: monospace, serif;
font-size: 1em;
}
/*
* Improves readability of pre-formatted text in all browsers.
*/
pre {
white-space: pre;
white-space: pre-wrap;
word-wrap: break-word;
}
/*
* Sets consistent quote types.
*/
q {
quotes: "\201C" "\201D" "\2018" "\2019";
}
/*
* Addresses inconsistent and variable font size in all browsers.
*/
small {
font-size: 80%;
}
/*
* Prevents `sub` and `sup` affecting `line-height` in all browsers.
*/
sub,
sup {
font-size: 75%;
line-height: 0;
position: relative;
vertical-align: baseline;
}
sup {
top: -0.5em;
}
sub {
bottom: -0.25em;
}
/* ==========================================================================
Embedded content
========================================================================== */
/*
* Removes border when inside `a` element in IE 8/9.
*/
img {
border: 0;
}
/*
* Corrects overflow displayed oddly in IE 9.
*/
svg:not(:root) {
overflow: hidden;
}
/* ==========================================================================
Figures
========================================================================== */
/*
* Addresses margin not present in IE 8/9 and Safari 5.
*/
figure {
margin: 0;
}
/* ==========================================================================
Forms
========================================================================== */
/*
* Define consistent border, margin, and padding.
*/
fieldset {
border: 1px solid #c0c0c0;
margin: 0 2px;
padding: 0.35em 0.625em 0.75em;
}
/*
* 1. Corrects color not being inherited in IE 8/9.
* 2. Remove padding so people aren't caught out if they zero out fieldsets.
*/
legend {
border: 0; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Corrects font family not being inherited in all browsers.
* 2. Corrects font size not being inherited in all browsers.
* 3. Addresses margins set differently in Firefox 4+, Safari 5, and Chrome
*/
button,
input,
select,
textarea {
font-family: inherit; /* 1 */
font-size: 100%; /* 2 */
margin: 0; /* 3 */
}
/*
* Addresses Firefox 4+ setting `line-height` on `input` using `!important` in
* the UA stylesheet.
*/
button,
input {
line-height: normal;
}
/*
* 1. Avoid the WebKit bug in Android 4.0.* where (2) destroys native `audio`
* and `video` controls.
* 2. Corrects inability to style clickable `input` types in iOS.
* 3. Improves usability and consistency of cursor style between image-type
* `input` and others.
*/
button,
html input[type="button"], /* 1 */
input[type="reset"],
input[type="submit"] {
-webkit-appearance: button; /* 2 */
cursor: pointer; /* 3 */
}
/*
* Re-set default cursor for disabled elements.
*/
button[disabled],
input[disabled] {
cursor: default;
}
/*
* 1. Addresses box sizing set to `content-box` in IE 8/9.
* 2. Removes excess padding in IE 8/9.
*/
input[type="checkbox"],
input[type="radio"] {
box-sizing: border-box; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Addresses `appearance` set to `searchfield` in Safari 5 and Chrome.
* 2. Addresses `box-sizing` set to `border-box` in Safari 5 and Chrome
* (include `-moz` to future-proof).
*/
input[type="search"] {
-webkit-appearance: textfield; /* 1 */
-moz-box-sizing: content-box;
-webkit-box-sizing: content-box; /* 2 */
box-sizing: content-box;
}
/*
* Removes inner padding and search cancel button in Safari 5 and Chrome
* on OS X.
*/
input[type="search"]::-webkit-search-cancel-button,
input[type="search"]::-webkit-search-decoration {
-webkit-appearance: none;
}
/*
* Removes inner padding and border in Firefox 4+.
*/
button::-moz-focus-inner,
input::-moz-focus-inner {
border: 0;
padding: 0;
}
/*
* 1. Removes default vertical scrollbar in IE 8/9.
* 2. Improves readability and alignment in all browsers.
*/
textarea {
overflow: auto; /* 1 */
vertical-align: top; /* 2 */
}
/* ==========================================================================
Tables
========================================================================== */
/*
* Remove most spacing between table cells.
*/
table {
border-collapse: collapse;
border-spacing: 0;
}

View File

@@ -1,542 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<title>simple_search.rs</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, target-densitydpi=160dpi, initial-scale=1.0; maximum-scale=1.0; user-scalable=0;">
<link rel="stylesheet" media="all" href="docco.css" />
</head>
<body>
<div id="container">
<div id="background"></div>
<ul class="sections">
<li id="title">
<div class="annotation">
<h1>simple_search.rs</h1>
</div>
</li>
<li id="section-1">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-1">&#182;</a>
</div>
</div>
<div class="content"><div class='highlight'><pre><span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tantivy;
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tempdir;
<span class="hljs-meta">#[macro_use]</span>
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> serde_json;
<span class="hljs-keyword">use</span> std::path::Path;
<span class="hljs-keyword">use</span> tempdir::TempDir;
<span class="hljs-keyword">use</span> tantivy::Index;
<span class="hljs-keyword">use</span> tantivy::schema::*;
<span class="hljs-keyword">use</span> tantivy::collector::TopCollector;
<span class="hljs-keyword">use</span> tantivy::query::QueryParser;
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() {</pre></div></div>
</li>
<li id="section-2">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-2">&#182;</a>
</div>
<p>Lets create a temporary directory for the
sake of this example</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Ok</span>(dir) = TempDir::new(<span class="hljs-string">"tantivy_example_dir"</span>) {
run_example(dir.path()).unwrap();
dir.close().unwrap();
}
}
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">run_example</span></span>(index_path: &amp;Path) -&gt; tantivy::<span class="hljs-built_in">Result</span>&lt;()&gt; {</pre></div></div>
</li>
<li id="section-3">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-3">&#182;</a>
</div>
<h1 id="defining-the-schema">Defining the schema</h1>
<p>The Tantivy index requires a very strict schema.
The schema declares which fields are in the index,
and for each field, its type and “the way it should
be indexed”.</p>
</div>
</li>
<li id="section-4">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-4">&#182;</a>
</div>
<p>first we need to define a schema …</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> schema_builder = SchemaBuilder::<span class="hljs-keyword">default</span>();</pre></div></div>
</li>
<li id="section-5">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-5">&#182;</a>
</div>
<p>Our first field is title.
We want full-text search for it, and we also want
to be able to retrieve the document after the search.</p>
<p>TEXT | STORED is some syntactic sugar to describe
that.</p>
<p><code>TEXT</code> means the field should be tokenized and indexed,
along with its term frequency and term positions.</p>
<p><code>STORED</code> means that the field will also be saved
in a compressed, row-oriented key-value store.
This store is useful to reconstruct the
documents that were selected during the search phase.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"title"</span>, TEXT | STORED);</pre></div></div>
</li>
<li id="section-6">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-6">&#182;</a>
</div>
<p>Our second field is body.
We want full-text search for it, but we do not
need to be able to be able to retrieve it
for our application. </p>
<p>We can make our index lighter and
by omitting <code>STORED</code> flag.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"body"</span>, TEXT);
<span class="hljs-keyword">let</span> schema = schema_builder.build();</pre></div></div>
</li>
<li id="section-7">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-7">&#182;</a>
</div>
<h1 id="indexing-documents">Indexing documents</h1>
<p>Lets create a brand new index.</p>
<p>This will actually just save a meta.json
with our schema in the directory.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> index = Index::create(index_path, schema.clone())?;</pre></div></div>
</li>
<li id="section-8">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-8">&#182;</a>
</div>
<p>To insert document we need an index writer.
There must be only one writer at a time.
This single <code>IndexWriter</code> is already
multithreaded.</p>
<p>Here we use a buffer of 50MB per thread. Using a bigger
heap for the indexer can increase its throughput.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> index_writer = index.writer(<span class="hljs-number">50_000_000</span>)?;</pre></div></div>
</li>
<li id="section-9">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-9">&#182;</a>
</div>
<p>Lets index our documents!
We first need a handle on the title and the body field.</p>
</div>
</li>
<li id="section-10">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-10">&#182;</a>
</div>
<h3 id="create-a-document-manually-">Create a document “manually”.</h3>
<p>We can create a document manually, by setting the fields
one by one in a Document object.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> title = schema.get_field(<span class="hljs-string">"title"</span>).unwrap();
<span class="hljs-keyword">let</span> body = schema.get_field(<span class="hljs-string">"body"</span>).unwrap();
<span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> old_man_doc = Document::<span class="hljs-keyword">default</span>();
old_man_doc.add_text(title, <span class="hljs-string">"The Old Man and the Sea"</span>);
old_man_doc.add_text(
body,
<span class="hljs-string">"He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."</span>,
);</pre></div></div>
</li>
<li id="section-11">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-11">&#182;</a>
</div>
<p>… and add it to the <code>IndexWriter</code>.</p>
</div>
<div class="content"><div class='highlight'><pre> index_writer.add_document(old_man_doc);</pre></div></div>
</li>
<li id="section-12">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-12">&#182;</a>
</div>
<h3 id="create-a-document-directly-from-json-">Create a document directly from json.</h3>
<p>Alternatively, we can use our schema to parse a
document object directly from json.
The document is a string, but we use the <code>json</code> macro
from <code>serde_json</code> for the convenience of multi-line support.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> json = json!({
<span class="hljs-string">"title"</span>: <span class="hljs-string">"Of Mice and Men"</span>,
<span class="hljs-string">"body"</span>: <span class="hljs-string">"A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"</span>
});
<span class="hljs-keyword">let</span> mice_and_men_doc = schema.parse_document(&amp;json.to_string())?;
index_writer.add_document(mice_and_men_doc);</pre></div></div>
</li>
<li id="section-13">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-13">&#182;</a>
</div>
<p>Multi-valued field are allowed, they are
expressed in JSON by an array.
The following document has two titles.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> json = json!({
<span class="hljs-string">"title"</span>: [<span class="hljs-string">"Frankenstein"</span>, <span class="hljs-string">"The Modern Prometheus"</span>],
<span class="hljs-string">"body"</span>: <span class="hljs-string">"You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking."</span>
});
<span class="hljs-keyword">let</span> frankenstein_doc = schema.parse_document(&amp;json.to_string())?;
index_writer.add_document(frankenstein_doc);</pre></div></div>
</li>
<li id="section-14">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-14">&#182;</a>
</div>
<p>This is an example, so we will only index 3 documents
here. You can check out tantivys tutorial to index
the English wikipedia. Tantivys indexing is rather fast.
Indexing 5 million articles of the English wikipedia takes
around 4 minutes on my computer!</p>
</div>
</li>
<li id="section-15">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-15">&#182;</a>
</div>
<h3 id="committing">Committing</h3>
<p>At this point our documents are not searchable.</p>
<p>We need to call .commit() explicitly to force the
index_writer to finish processing the documents in the queue,
flush the current index to the disk, and advertise
the existence of new documents.</p>
<p>This call is blocking.</p>
</div>
<div class="content"><div class='highlight'><pre> index_writer.commit()?;</pre></div></div>
</li>
<li id="section-16">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-16">&#182;</a>
</div>
<p>If <code>.commit()</code> returns correctly, then all of the
documents that have been added are guaranteed to be
persistently indexed.</p>
<p>In the scenario of a crash or a power failure,
tantivy behaves as if has rolled back to its last
commit.</p>
</div>
</li>
<li id="section-17">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-17">&#182;</a>
</div>
<h1 id="searching">Searching</h1>
<p>Lets search our index. Start by reloading
searchers in the index. This should be done
after every commit().</p>
</div>
<div class="content"><div class='highlight'><pre> index.load_searchers()?;</pre></div></div>
</li>
<li id="section-18">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-18">&#182;</a>
</div>
<p>Afterwards create one (or more) searchers.</p>
<p>You should create a searcher
every time you start a “search query”.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> searcher = index.searcher();</pre></div></div>
</li>
<li id="section-19">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-19">&#182;</a>
</div>
<p>The query parser can interpret human queries.
Here, if the user does not specify which
field they want to search, tantivy will search
in both title and body.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> query_parser = QueryParser::for_index(index, <span class="hljs-built_in">vec!</span>[title, body]);</pre></div></div>
</li>
<li id="section-20">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-20">&#182;</a>
</div>
<p>QueryParser may fail if the query is not in the right
format. For user facing applications, this can be a problem.
A ticket has been opened regarding this problem.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> query = query_parser.parse_query(<span class="hljs-string">"sea whale"</span>)?;</pre></div></div>
</li>
<li id="section-21">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-21">&#182;</a>
</div>
<p>A query defines a set of documents, as
well as the way they should be scored.</p>
<p>A query created by the query parser is scored according
to a metric called Tf-Idf, and will consider
any document matching at least one of our terms.</p>
</div>
</li>
<li id="section-22">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-22">&#182;</a>
</div>
<h3 id="collectors">Collectors</h3>
<p>We are not interested in all of the documents but
only in the top 10. Keeping track of our top 10 best documents
is the role of the TopCollector.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> top_collector = TopCollector::with_limit(<span class="hljs-number">10</span>);</pre></div></div>
</li>
<li id="section-23">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-23">&#182;</a>
</div>
<p>We can now perform our query.</p>
</div>
<div class="content"><div class='highlight'><pre> searcher.search(&amp;*query, &amp;<span class="hljs-keyword">mut</span> top_collector)?;</pre></div></div>
</li>
<li id="section-24">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-24">&#182;</a>
</div>
<p>Our top collector now contains the 10
most relevant doc ids…</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> doc_addresses = top_collector.docs();</pre></div></div>
</li>
<li id="section-25">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-25">&#182;</a>
</div>
<p>The actual documents still need to be
retrieved from Tantivys store.</p>
<p>Since the body field was not configured as stored,
the document returned will only contain
a title.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">for</span> doc_address <span class="hljs-keyword">in</span> doc_addresses {
<span class="hljs-keyword">let</span> retrieved_doc = searcher.doc(&amp;doc_address)?;
<span class="hljs-built_in">println!</span>(<span class="hljs-string">"{}"</span>, schema.to_json(&amp;retrieved_doc));
}</pre></div></div>
</li>
<li id="section-26">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-26">&#182;</a>
</div>
<p>Wait for indexing and merging threads to shut down.
Usually this isnt needed, but in <code>main</code> we try to
delete the temporary directory and that fails on
Windows if the files are still open.</p>
</div>
<div class="content"><div class='highlight'><pre> index_writer.wait_merging_threads()?;
<span class="hljs-literal">Ok</span>(())
}</pre></div></div>
</li>
</ul>
</div>
</body>
</html>

View File

@@ -0,0 +1,133 @@
// # Iterating docs and positioms.
//
// At its core of tantivy, relies on a data structure
// called an inverted index.
//
// This example shows how to manually iterate through
// the list of documents containing a term, getting
// its term frequency, and accessing its positions.
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::{DocId, DocSet, Postings};
fn main() -> tantivy::Result<()> {
// We first create a schema for the sake of the
// example. Check the `basic_search` example for more information.
let mut schema_builder = SchemaBuilder::default();
// For this example, we need to make sure to index positions for our title
// field. `TEXT` precisely does this.
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 50_000_000)?;
index_writer.add_document(doc!(title => "The Old Man and the Sea"));
index_writer.add_document(doc!(title => "Of Mice and Men"));
index_writer.add_document(doc!(title => "The modern Promotheus"));
index_writer.commit()?;
index.load_searchers()?;
let searcher = index.searcher();
// A tantivy index is actually a collection of segments.
// Similarly, a searcher just wraps a list `segment_reader`.
//
// (Because we indexed a very small number of documents over one thread
// there is actually only one segment here, but let's iterate through the list
// anyway)
for segment_reader in searcher.segment_readers() {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
let inverted_index = segment_reader.inverted_index(title);
// A `Term` is a text token associated with a field.
// Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the");
// This segment posting object is like a cursor over the documents matching the term.
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies
// and positions.
//
// If you don't need all this information, you may get better performance by decompressing less
// information.
if let Some(mut segment_postings) =
inverted_index.read_postings(&term_the, IndexRecordOption::WithFreqsAndPositions)
{
// this buffer will be used to request for positions
let mut positions: Vec<u32> = Vec::with_capacity(100);
while segment_postings.advance() {
// the number of time the term appears in the document.
let doc_id: DocId = segment_postings.doc(); //< do not try to access this before calling advance once.
// This MAY contains deleted documents as well.
if segment_reader.is_deleted(doc_id) {
continue;
}
// the number of time the term appears in the document.
let term_freq: u32 = segment_postings.term_freq();
// accessing positions is slightly expensive and lazy, do not request
// for them if you don't need them for some documents.
segment_postings.positions(&mut positions);
// By definition we should have `term_freq` positions.
assert_eq!(positions.len(), term_freq as usize);
// This prints:
// ```
// Doc 0: TermFreq 2: [0, 4]
// Doc 2: TermFreq 1: [0]
// ```
println!("Doc {}: TermFreq {}: {:?}", doc_id, term_freq, positions);
}
}
}
// A `Term` is a text token associated with a field.
// Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the");
// Some other powerful operations (especially `.skip_to`) may be useful to consume these
// posting lists rapidly.
// You can check for them in the [`DocSet`](https://docs.rs/tantivy/~0/tantivy/trait.DocSet.html) trait
// and the [`Postings`](https://docs.rs/tantivy/~0/tantivy/trait.Postings.html) trait
// Also, for some VERY specific high performance use case like an OLAP analysis of logs,
// you can get better performance by accessing directly the blocks of doc ids.
for segment_reader in searcher.segment_readers() {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
let inverted_index = segment_reader.inverted_index(title);
// This segment posting object is like a cursor over the documents matching the term.
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies
// and positions.
//
// If you don't need all this information, you may get better performance by decompressing less
// information.
if let Some(mut block_segment_postings) =
inverted_index.read_block_postings(&term_the, IndexRecordOption::Basic)
{
while block_segment_postings.advance() {
// Once again these docs MAY contains deleted documents as well.
let docs = block_segment_postings.docs();
// Prints `Docs [0, 2].`
println!("Docs {:?}", docs);
}
}
}
Ok(())
}

71
examples/snippet.rs Normal file
View File

@@ -0,0 +1,71 @@
// # Snippet example
//
// This example shows how to return a representative snippet of
// your hit result.
// Snippet are an extracted of a target document, and returned in HTML format.
// The keyword searched by the user are highlighted with a `<b>` tag.
extern crate tempdir;
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopCollector;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::SnippetGenerator;
use tempdir::TempDir;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new("tantivy_example_dir")?;
// # Defining the schema
let mut schema_builder = SchemaBuilder::default();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT | STORED);
let schema = schema_builder.build();
// # Indexing documents
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
// we'll only need one doc for this example.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
// ...
index_writer.commit()?;
index.load_searchers()?;
let searcher = index.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("sycamore spring")?;
let mut top_collector = TopCollector::with_limit(10);
searcher.search(&*query, &mut top_collector)?;
let snippet_generator = SnippetGenerator::new(&searcher, &*query, body)?;
let doc_addresses = top_collector.docs();
for doc_address in doc_addresses {
let doc = searcher.doc(doc_address)?;
let snippet = snippet_generator.snippet_from_doc(&doc);
println!("title: {}", doc.get_first(title).unwrap().text().unwrap());
println!("snippet: {}", snippet.to_html());
}
Ok(())
}

121
examples/stop_words.rs Normal file
View File

@@ -0,0 +1,121 @@
// # Stop Words Example
//
// This example covers the basic usage of stop words
// with tantivy
//
// We will :
// - define our schema
// - create an index in a directory
// - add a few stop words
// - index few documents in our index
extern crate tempdir;
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopCollector;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::*;
use tantivy::Index;
fn main() -> tantivy::Result<()> {
// this example assumes you understand the content in `basic_search`
let mut schema_builder = SchemaBuilder::default();
// This configures your custom options for how tantivy will
// store and process your content in the index; The key
// to note is that we are setting the tokenizer to `stoppy`
// which will be defined and registered below.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
// Our first field is title.
schema_builder.add_text_field("title", text_options);
// Our second field is body.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
schema_builder.add_text_field("body", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
// This tokenizer lowers all of the text (to help with stop word matching)
// then removes all instances of `the` and `and` from the corpus
let tokenizer = SimpleTokenizer
.filter(LowerCaser)
.filter(StopWordFilter::remove(vec![
"the".to_string(),
"and".to_string(),
]));
index.tokenizers().register("stoppy", tokenizer);
let mut index_writer = index.writer(50_000_000)?;
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
index_writer.add_document(doc!(
title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
index_writer.add_document(doc!(
title => "Frankenstein",
body => "You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking."
));
index_writer.commit()?;
index.load_searchers()?;
let searcher = index.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// stop words are applied on the query as well.
// The following will be equivalent to `title:frankenstein`
let query = query_parser.parse_query("title:\"the Frankenstein\"")?;
let mut top_collector = TopCollector::with_limit(10);
searcher.search(&*query, &mut top_collector)?;
let doc_addresses = top_collector.docs();
for doc_address in doc_addresses {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,41 @@
extern crate tantivy;
use tantivy::schema::*;
// # Document from json
//
// For convenience, `Document` can be parsed directly from json.
fn main() -> tantivy::Result<()> {
// Let's first define a schema and an index.
// Check out the basic example if this is confusing to you.
//
// first we need to define a schema ...
let mut schema_builder = SchemaBuilder::default();
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
schema_builder.add_u64_field("year", INT_INDEXED);
let schema = schema_builder.build();
// Let's assume we have a json-serialized document.
let mice_and_men_doc_json = r#"{
"title": "Of Mice and Men",
"year": 1937
}"#;
// We can parse our document
let _mice_and_men_doc = schema.parse_document(&mice_and_men_doc_json)?;
// Multi-valued field are allowed, they are
// expressed in JSON by an array.
// The following document has two titles.
let frankenstein_json = r#"{
"title": ["Frankenstein", "The Modern Prometheus"],
"year": 1818
}"#;
let _frankenstein_doc = schema.parse_document(&frankenstein_json)?;
// Note that the schema is saved in your index directory.
//
// As a result, Indexes are aware of their schema, and you can use this feature
// just by opening an existing `Index`, and calling `index.schema()..parse_document(json)`.
Ok(())
}

2
run-tests.sh Executable file
View File

@@ -0,0 +1,2 @@
#!/bin/bash
cargo test --no-default-features --features mmap -- --test-threads 1

View File

@@ -1,9 +1,9 @@
use Result;
use collector::Collector;
use DocId;
use Result;
use Score;
use SegmentLocalId;
use SegmentReader;
use DocId;
use Score;
/// Collector that does nothing.
/// This is used in the chain Collector and will hopefully
@@ -25,6 +25,57 @@ impl Collector for DoNothingCollector {
/// Zero-cost abstraction used to collect on multiple collectors.
/// This contraption is only usable if the type of your collectors
/// are known at compile time.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT};
/// use tantivy::{Index, Result};
/// use tantivy::collector::{CountCollector, TopCollector, chain};
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer(3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut top_collector = TopCollector::with_limit(2);
/// let mut count_collector = CountCollector::default();
/// {
/// let mut collectors = chain().push(&mut top_collector).push(&mut count_collector);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut collectors).unwrap();
/// }
/// assert_eq!(count_collector.count(), 2);
/// assert!(top_collector.at_capacity());
/// }
///
/// Ok(())
/// }
/// ```
pub struct ChainedCollector<Left: Collector, Right: Collector> {
left: Left,
right: Right,

View File

@@ -1,12 +1,59 @@
use super::Collector;
use DocId;
use Score;
use Result;
use SegmentReader;
use Score;
use SegmentLocalId;
use SegmentReader;
/// `CountCollector` collector only counts how many
/// documents match the query.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT};
/// use tantivy::{Index, Result};
/// use tantivy::collector::CountCollector;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer(3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut count_collector = CountCollector::default();
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut count_collector).unwrap();
///
/// assert_eq!(count_collector.count(), 2);
/// }
///
/// Ok(())
/// }
/// ```
#[derive(Default)]
pub struct CountCollector {
count: usize,

View File

@@ -1,27 +1,25 @@
use std::mem;
use collector::Collector;
use docset::SkipResult;
use fastfield::FacetReader;
use schema::Facet;
use schema::Field;
use std::cell::UnsafeCell;
use schema::Facet;
use std::collections::btree_map;
use std::collections::BTreeMap;
use std::collections::BTreeSet;
use std::collections::BinaryHeap;
use std::collections::Bound;
use termdict::TermDictionary;
use termdict::TermStreamer;
use termdict::TermStreamerBuilder;
use std::collections::BTreeSet;
use termdict::TermMerger;
use docset::SkipResult;
use std::{usize, u64};
use std::iter::Peekable;
use std::mem;
use std::{u64, usize};
use termdict::TermMerger;
use std::cmp::Ordering;
use DocId;
use Result;
use Score;
use SegmentReader;
use SegmentLocalId;
use std::cmp::Ordering;
use SegmentReader;
struct Hit<'a> {
count: u64,
@@ -344,16 +342,19 @@ impl FacetCollector {
pub fn harvest(mut self) -> FacetCounts {
self.finalize_segment();
let collapsed_facet_ords: Vec<&[u64]> = self.segment_counters
let collapsed_facet_ords: Vec<&[u64]> = self
.segment_counters
.iter()
.map(|segment_counter| &segment_counter.facet_ords[..])
.collect();
let collapsed_facet_counts: Vec<&[u64]> = self.segment_counters
let collapsed_facet_counts: Vec<&[u64]> = self
.segment_counters
.iter()
.map(|segment_counter| &segment_counter.facet_counts[..])
.collect();
let facet_streams = self.segment_counters
let facet_streams = self
.segment_counters
.iter()
.map(|seg_counts| seg_counts.facet_reader.facet_dict().range().into_stream())
.collect::<Vec<_>>();
@@ -376,13 +377,13 @@ impl FacetCollector {
} else {
collapsed_facet_counts[seg_ord][collapsed_term_id]
}
})
.unwrap_or(0)
})
.sum();
}).unwrap_or(0)
}).sum();
if count > 0u64 {
let bytes = facet_merger.key().to_owned();
facet_counts.insert(Facet::from_encoded(bytes), count);
let bytes: Vec<u8> = facet_merger.key().to_owned();
// may create an corrupted facet if the term dicitonary is corrupted
let facet = unsafe { Facet::from_encoded(bytes) };
facet_counts.insert(facet, count);
}
}
FacetCounts { facet_counts }
@@ -402,7 +403,8 @@ impl Collector for FacetCollector {
fn collect(&mut self, doc: DocId, _: Score) {
let facet_reader: &mut FacetReader = unsafe {
&mut *self.ff_reader
&mut *self
.ff_reader
.as_ref()
.expect("collect() was called before set_segment. This should never happen.")
.get()
@@ -432,9 +434,20 @@ pub struct FacetCounts {
facet_counts: BTreeMap<Facet, u64>,
}
pub struct FacetChildIterator<'a> {
underlying: btree_map::Range<'a, Facet, u64>,
}
impl<'a> Iterator for FacetChildIterator<'a> {
type Item = (&'a Facet, u64);
fn next(&mut self) -> Option<Self::Item> {
self.underlying.next().map(|(facet, count)| (facet, *count))
}
}
impl FacetCounts {
#[allow(needless_lifetimes)] //< compiler fails if we remove the lifetime
pub fn get<'a, T>(&'a self, facet_from: T) -> impl Iterator<Item = (&'a Facet, u64)>
pub fn get<T>(&self, facet_from: T) -> FacetChildIterator
where
Facet: From<T>,
{
@@ -443,15 +456,13 @@ impl FacetCounts {
let right_bound = if facet.is_root() {
Bound::Unbounded
} else {
let mut facet_after_bytes = facet.encoded_bytes().to_owned();
let mut facet_after_bytes: Vec<u8> = facet.encoded_bytes().to_owned();
facet_after_bytes.push(1u8);
let facet_after = Facet::from_encoded(facet_after_bytes);
let facet_after = unsafe { Facet::from_encoded(facet_after_bytes) }; // ok logic
Bound::Excluded(facet_after)
};
self.facet_counts
.range((left_bound, right_bound))
.map(|(facet, count)| (facet, *count))
let underlying: btree_map::Range<_, _> = self.facet_counts.range((left_bound, right_bound));
FacetChildIterator { underlying }
}
pub fn top_k<T>(&self, facet: T, k: usize) -> Vec<(&Facet, u64)>
@@ -461,17 +472,24 @@ impl FacetCounts {
let mut heap = BinaryHeap::with_capacity(k);
let mut it = self.get(facet);
// push the first k elements to first bring the heap
// to capacity
for (facet, count) in (&mut it).take(k) {
heap.push(Hit { count, facet });
}
let mut lowest_count: u64 = heap.peek().map(|hit| hit.count).unwrap_or(u64::MIN);
let mut lowest_count: u64 = heap.peek().map(|hit| hit.count).unwrap_or(u64::MIN); //< the `unwrap_or` case may be triggered but the value
// is never used in that case.
for (facet, count) in it {
if count > lowest_count {
lowest_count = count;
if let Some(mut head) = heap.peek_mut() {
*head = Hit { count, facet };
}
// the heap gets reconstructed at this point
if let Some(head) = heap.peek() {
lowest_count = head.count;
}
}
}
heap.into_sorted_vec()
@@ -483,14 +501,14 @@ impl FacetCounts {
#[cfg(test)]
mod tests {
use test::Bencher;
use core::Index;
use schema::{Document, Facet, SchemaBuilder};
use query::AllQuery;
use super::{FacetCollector, FacetCounts};
use std::iter;
use schema::Field;
use core::Index;
use query::AllQuery;
use rand::distributions::Uniform;
use rand::{thread_rng, Rng};
use schema::Field;
use schema::{Document, Facet, SchemaBuilder};
use std::iter;
#[test]
fn test_facet_collector_drilldown() {
@@ -499,7 +517,7 @@ mod tests {
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(3_000_000).unwrap();
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
let num_facets: usize = 3 * 4 * 5;
let facets: Vec<Facet> = (0..num_facets)
.map(|mut n| {
@@ -509,8 +527,7 @@ mod tests {
n /= 4;
let leaf = n % 5;
Facet::from(&format!("/top{}/mid{}/leaf{}", top, mid, leaf))
})
.collect();
}).collect();
for i in 0..num_facets * 10 {
let mut doc = Document::new();
doc.add_facet(facet_field, facets[i % num_facets].clone());
@@ -537,7 +554,8 @@ mod tests {
("/top1/mid1", 50),
("/top1/mid2", 50),
("/top1/mid3", 50),
].iter()
]
.iter()
.map(|&(facet_str, count)| (String::from(facet_str), count))
.collect::<Vec<_>>()
);
@@ -545,14 +563,41 @@ mod tests {
}
#[test]
#[should_panic(expected = "Tried to add a facet which is a descendant of \
an already added facet.")]
#[should_panic(
expected = "Tried to add a facet which is a descendant of \
an already added facet."
)]
fn test_misused_facet_collector() {
let mut facet_collector = FacetCollector::for_field(Field(0));
facet_collector.add_facet(Facet::from("/country"));
facet_collector.add_facet(Facet::from("/country/europe"));
}
#[test]
fn test_doc_unsorted_multifacet() {
let mut schema_builder = SchemaBuilder::new();
let facet_field = schema_builder.add_facet_field("facets");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
index_writer.add_document(doc!(
facet_field => Facet::from_text(&"/subjects/A/a"),
facet_field => Facet::from_text(&"/subjects/B/a"),
facet_field => Facet::from_text(&"/subjects/A/b"),
facet_field => Facet::from_text(&"/subjects/B/b"),
));
index_writer.commit().unwrap();
index.load_searchers().unwrap();
let searcher = index.searcher();
assert_eq!(searcher.num_docs(), 1);
let mut facet_collector = FacetCollector::for_field(facet_field);
facet_collector.add_facet("/subjects");
searcher.search(&AllQuery, &mut facet_collector).unwrap();
let counts = facet_collector.harvest();
let facets: Vec<(&Facet, u64)> = counts.get("/subjects").collect();
assert_eq!(facets[0].1, 1);
}
#[test]
fn test_non_used_facet_collector() {
let mut facet_collector = FacetCollector::for_field(Field(0));
@@ -567,17 +612,23 @@ mod tests {
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let uniform = Uniform::new_inclusive(1, 100_000);
let mut docs: Vec<Document> = vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)]
.into_iter()
.flat_map(|(c, count)| {
let facet = Facet::from(&format!("/facet_{}", c));
let facet = Facet::from(&format!("/facet/{}", c));
let doc = doc!(facet_field => facet);
iter::repeat(doc).take(count)
})
.collect();
}).map(|mut doc| {
doc.add_facet(
facet_field,
&format!("/facet/{}", thread_rng().sample(&uniform)),
);
doc
}).collect();
thread_rng().shuffle(&mut docs[..]);
let mut index_writer = index.writer(3_000_000).unwrap();
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
for doc in docs {
index_writer.add_document(doc);
}
@@ -587,23 +638,36 @@ mod tests {
let searcher = index.searcher();
let mut facet_collector = FacetCollector::for_field(facet_field);
facet_collector.add_facet("/");
facet_collector.add_facet("/facet");
searcher.search(&AllQuery, &mut facet_collector).unwrap();
let counts: FacetCounts = facet_collector.harvest();
{
let facets: Vec<(&Facet, u64)> = counts.top_k("/", 3);
let facets: Vec<(&Facet, u64)> = counts.top_k("/facet", 3);
assert_eq!(
facets,
vec![
(&Facet::from("/facet_b"), 100),
(&Facet::from("/facet_e"), 21),
(&Facet::from("/facet_d"), 12),
(&Facet::from("/facet/b"), 100),
(&Facet::from("/facet/e"), 21),
(&Facet::from("/facet/d"), 12),
]
);
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use collector::FacetCollector;
use query::AllQuery;
use rand::{thread_rng, Rng};
use schema::Facet;
use schema::SchemaBuilder;
use test::Bencher;
use Index;
#[bench]
fn bench_facet_collector(b: &mut Bencher) {
let mut schema_builder = SchemaBuilder::new();
@@ -621,7 +685,7 @@ mod tests {
// 40425 docs
thread_rng().shuffle(&mut docs[..]);
let mut index_writer = index.writer(3_000_000).unwrap();
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
for doc in docs {
index_writer.add_document(doc);
}

View File

@@ -2,11 +2,11 @@
Defines how the documents matching a search query should be processed.
*/
use SegmentReader;
use SegmentLocalId;
use DocId;
use Score;
use Result;
use Score;
use SegmentLocalId;
use SegmentReader;
mod count_collector;
pub use self::count_collector::CountCollector;
@@ -15,13 +15,20 @@ mod multi_collector;
pub use self::multi_collector::MultiCollector;
mod top_collector;
pub use self::top_collector::TopCollector;
mod top_score_collector;
pub use self::top_score_collector::TopScoreCollector;
#[deprecated]
pub use self::top_score_collector::TopScoreCollector as TopCollector;
mod top_field_collector;
pub use self::top_field_collector::TopFieldCollector;
mod facet_collector;
pub use self::facet_collector::FacetCollector;
mod chained_collector;
pub use self::chained_collector::chain;
pub use self::chained_collector::{chain, ChainedCollector};
/// Collectors are in charge of collecting and retaining relevant
/// information from the document found and scored by the query.
@@ -89,13 +96,13 @@ impl<'a, C: Collector> Collector for &'a mut C {
pub mod tests {
use super::*;
use test::Bencher;
use DocId;
use Score;
use core::SegmentReader;
use SegmentLocalId;
use fastfield::BytesFastFieldReader;
use fastfield::FastFieldReader;
use schema::Field;
use DocId;
use Score;
use SegmentLocalId;
/// Stores all of the doc ids.
/// This collector is only used for tests.
@@ -186,6 +193,52 @@ pub mod tests {
}
}
/// Collects in order all of the fast field bytes for all of the
/// docs in the `DocSet`
///
/// This collector is mainly useful for tests.
pub struct BytesFastFieldTestCollector {
vals: Vec<u8>,
field: Field,
ff_reader: Option<BytesFastFieldReader>,
}
impl BytesFastFieldTestCollector {
pub fn for_field(field: Field) -> BytesFastFieldTestCollector {
BytesFastFieldTestCollector {
vals: Vec::new(),
field,
ff_reader: None,
}
}
pub fn vals(self) -> Vec<u8> {
self.vals
}
}
impl Collector for BytesFastFieldTestCollector {
fn set_segment(&mut self, _segment_local_id: u32, segment: &SegmentReader) -> Result<()> {
self.ff_reader = Some(segment.bytes_fast_field_reader(self.field)?);
Ok(())
}
fn collect(&mut self, doc: u32, _score: f32) {
let val = self.ff_reader.as_ref().unwrap().get_val(doc);
self.vals.extend(val);
}
fn requires_scoring(&self) -> bool {
false
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use collector::{Collector, CountCollector};
use test::Bencher;
#[bench]
fn build_collector(b: &mut Bencher) {
b.iter(|| {

View File

@@ -1,14 +1,66 @@
use super::Collector;
use DocId;
use Score;
use Result;
use SegmentReader;
use Score;
use SegmentLocalId;
use SegmentReader;
/// Multicollector makes it possible to collect on more than one collector.
/// It should only be used for use cases where the Collector types is unknown
/// at compile time.
/// If the type of the collectors is known, you should prefer to use `ChainedCollector`.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT};
/// use tantivy::{Index, Result};
/// use tantivy::collector::{CountCollector, TopCollector, MultiCollector};
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer(3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut top_collector = TopCollector::with_limit(2);
/// let mut count_collector = CountCollector::default();
/// {
/// let mut collectors =
/// MultiCollector::from(vec![&mut top_collector, &mut count_collector]);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut collectors).unwrap();
/// }
/// assert_eq!(count_collector.count(), 2);
/// assert!(top_collector.at_capacity());
/// }
///
/// Ok(())
/// }
/// ```
pub struct MultiCollector<'a> {
collectors: Vec<&'a mut Collector>,
}
@@ -48,11 +100,11 @@ impl<'a> Collector for MultiCollector<'a> {
mod tests {
use super::*;
use collector::{Collector, CountCollector, TopCollector};
use collector::{Collector, CountCollector, TopScoreCollector};
#[test]
fn test_multi_collector() {
let mut top_collector = TopCollector::with_limit(2);
let mut top_collector = TopScoreCollector::with_limit(2);
let mut count_collector = CountCollector::default();
{
let mut collectors =

View File

@@ -1,61 +1,61 @@
use super::Collector;
use SegmentReader;
use SegmentLocalId;
use DocAddress;
use Result;
use std::collections::BinaryHeap;
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use DocAddress;
use DocId;
use Score;
use SegmentLocalId;
// Rust heap is a max-heap and we need a min heap.
/// Contains a feature (field, score, etc.) of a document along with the document address.
///
/// It has a custom implementation of `PartialOrd` that reverses the order. This is because the
/// default Rust heap is a max heap, whereas a min heap is needed.
#[derive(Clone, Copy)]
struct GlobalScoredDoc {
score: Score,
pub struct ComparableDoc<T> {
feature: T,
doc_address: DocAddress,
}
impl PartialOrd for GlobalScoredDoc {
fn partial_cmp(&self, other: &GlobalScoredDoc) -> Option<Ordering> {
impl<T: PartialOrd> PartialOrd for ComparableDoc<T> {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for GlobalScoredDoc {
impl<T: PartialOrd> Ord for ComparableDoc<T> {
#[inline]
fn cmp(&self, other: &GlobalScoredDoc) -> Ordering {
fn cmp(&self, other: &Self) -> Ordering {
other
.score
.partial_cmp(&self.score)
.feature
.partial_cmp(&self.feature)
.unwrap_or_else(|| other.doc_address.cmp(&self.doc_address))
}
}
impl PartialEq for GlobalScoredDoc {
fn eq(&self, other: &GlobalScoredDoc) -> bool {
impl<T: PartialOrd> PartialEq for ComparableDoc<T> {
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal
}
}
impl Eq for GlobalScoredDoc {}
impl<T: PartialOrd> Eq for ComparableDoc<T> {}
/// The Top Collector keeps track of the K documents
/// with the best scores.
/// sorted by type `T`.
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity is `O(n log K)`.
pub struct TopCollector {
/// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`.
pub struct TopCollector<T> {
limit: usize,
heap: BinaryHeap<GlobalScoredDoc>,
heap: BinaryHeap<ComparableDoc<T>>,
segment_id: u32,
}
impl TopCollector {
impl<T: PartialOrd + Clone> TopCollector<T> {
/// Creates a top collector, with a number of documents equal to "limit".
///
/// # Panics
/// The method panics if limit is 0
pub fn with_limit(limit: usize) -> TopCollector {
pub fn with_limit(limit: usize) -> TopCollector<T> {
if limit < 1 {
panic!("Limit must be strictly greater than 0.");
}
@@ -71,23 +71,27 @@ impl TopCollector {
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn docs(&self) -> Vec<DocAddress> {
self.score_docs()
self.top_docs()
.into_iter()
.map(|score_doc| score_doc.1)
.map(|(_feature, doc)| doc)
.collect()
}
/// Returns K best ScoredDocument sorted in decreasing order.
/// Returns K best FeatureDocuments sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn score_docs(&self) -> Vec<(Score, DocAddress)> {
let mut scored_docs: Vec<GlobalScoredDoc> = self.heap.iter().cloned().collect();
scored_docs.sort();
scored_docs
pub fn top_docs(&self) -> Vec<(T, DocAddress)> {
let mut feature_docs: Vec<ComparableDoc<T>> = self.heap.iter().cloned().collect();
feature_docs.sort();
feature_docs
.into_iter()
.map(|GlobalScoredDoc { score, doc_address }| (score, doc_address))
.collect()
.map(
|ComparableDoc {
feature,
doc_address,
}| (feature, doc_address),
).collect()
}
/// Return true iff at least K documents have gone through
@@ -96,48 +100,47 @@ impl TopCollector {
pub fn at_capacity(&self) -> bool {
self.heap.len() >= self.limit
}
}
impl Collector for TopCollector {
fn set_segment(&mut self, segment_id: SegmentLocalId, _: &SegmentReader) -> Result<()> {
/// Sets the segment local ID for the collector
pub fn set_segment_id(&mut self, segment_id: SegmentLocalId) {
self.segment_id = segment_id;
Ok(())
}
fn collect(&mut self, doc: DocId, score: Score) {
/// Collects a document scored by the given feature
///
/// It collects documents until it has reached the max capacity. Once it reaches capacity, it
/// will compare the lowest scoring item with the given one and keep whichever is greater.
pub fn collect(&mut self, doc: DocId, feature: T) {
if self.at_capacity() {
// It's ok to unwrap as long as a limit of 0 is forbidden.
let limit_doc: GlobalScoredDoc = *self.heap
let limit_doc: ComparableDoc<T> = self
.heap
.peek()
.expect("Top collector with size 0 is forbidden");
if limit_doc.score < score {
let mut mut_head = self.heap
.expect("Top collector with size 0 is forbidden")
.clone();
if limit_doc.feature < feature {
let mut mut_head = self
.heap
.peek_mut()
.expect("Top collector with size 0 is forbidden");
mut_head.score = score;
mut_head.feature = feature;
mut_head.doc_address = DocAddress(self.segment_id, doc);
}
} else {
let wrapped_doc = GlobalScoredDoc {
score,
let wrapped_doc = ComparableDoc {
feature,
doc_address: DocAddress(self.segment_id, doc),
};
self.heap.push(wrapped_doc);
}
}
fn requires_scoring(&self) -> bool {
true
}
}
#[cfg(test)]
mod tests {
use super::*;
use DocId;
use Score;
use collector::Collector;
#[test]
fn test_top_collector_not_at_capacity() {
@@ -147,7 +150,7 @@ mod tests {
top_collector.collect(5, 0.3);
assert!(!top_collector.at_capacity());
let score_docs: Vec<(Score, DocId)> = top_collector
.score_docs()
.top_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
@@ -165,7 +168,7 @@ mod tests {
assert!(top_collector.at_capacity());
{
let score_docs: Vec<(Score, DocId)> = top_collector
.score_docs()
.top_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
@@ -184,6 +187,7 @@ mod tests {
#[test]
#[should_panic]
fn test_top_0() {
TopCollector::with_limit(0);
let _collector: TopCollector<Score> = TopCollector::with_limit(0);
}
}

View File

@@ -0,0 +1,263 @@
use super::Collector;
use collector::top_collector::TopCollector;
use fastfield::FastFieldReader;
use fastfield::FastValue;
use schema::Field;
use DocAddress;
use DocId;
use Result;
use Score;
use SegmentReader;
/// The Top Field Collector keeps track of the K documents
/// sorted by a fast field in the index
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT, FAST};
/// use tantivy::{Index, Result, DocId};
/// use tantivy::collector::TopFieldCollector;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let rating = schema_builder.add_u64_field("rating", FAST);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// rating => 92u64,
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// rating => 97u64,
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// rating => 63u64,
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// rating => 80u64,
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut top_collector = TopFieldCollector::with_limit(rating, 2);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut top_collector).unwrap();
///
/// let score_docs: Vec<(u64, DocId)> = top_collector
/// .top_docs()
/// .into_iter()
/// .map(|(field, doc_address)| (field, doc_address.doc()))
/// .collect();
///
/// assert_eq!(score_docs, vec![(97u64, 1), (80, 3)]);
/// }
///
/// Ok(())
/// }
/// ```
pub struct TopFieldCollector<T: FastValue> {
field: Field,
collector: TopCollector<T>,
fast_field: Option<FastFieldReader<T>>,
}
impl<T: FastValue + PartialOrd + Clone> TopFieldCollector<T> {
/// Creates a top field collector, with a number of documents equal to "limit".
///
/// The given field name must be a fast field, otherwise the collector have an error while
/// collecting results.
///
/// # Panics
/// The method panics if limit is 0
pub fn with_limit(field: Field, limit: usize) -> Self {
TopFieldCollector {
field,
collector: TopCollector::with_limit(limit),
fast_field: None,
}
}
/// Returns K best documents sorted the given field name in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn docs(&self) -> Vec<DocAddress> {
self.collector.docs()
}
/// Returns K best FieldDocuments sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn top_docs(&self) -> Vec<(T, DocAddress)> {
self.collector.top_docs()
}
/// Return true iff at least K documents have gone through
/// the collector.
#[inline]
pub fn at_capacity(&self) -> bool {
self.collector.at_capacity()
}
}
impl<T: FastValue + PartialOrd + Clone> Collector for TopFieldCollector<T> {
fn set_segment(&mut self, segment_id: u32, segment: &SegmentReader) -> Result<()> {
self.collector.set_segment_id(segment_id);
self.fast_field = Some(segment.fast_field_reader(self.field)?);
Ok(())
}
fn collect(&mut self, doc: DocId, _score: Score) {
let field_value = self
.fast_field
.as_ref()
.expect("collect() was called before set_segment. This should never happen.")
.get(doc);
self.collector.collect(doc, field_value);
}
fn requires_scoring(&self) -> bool {
false
}
}
#[cfg(test)]
mod tests {
use super::*;
use query::Query;
use query::QueryParser;
use schema::Field;
use schema::IntOptions;
use schema::Schema;
use schema::{SchemaBuilder, FAST, TEXT};
use Index;
use IndexWriter;
use TantivyError;
const TITLE: &str = "title";
const SIZE: &str = "size";
#[test]
fn test_top_collector_not_at_capacity() {
let mut schema_builder = SchemaBuilder::new();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build();
let (index, query) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
index_writer.add_document(doc!(
title => "growler of beer",
size => 64u64,
));
index_writer.add_document(doc!(
title => "pint of beer",
size => 16u64,
));
});
let searcher = index.searcher();
let mut top_collector = TopFieldCollector::with_limit(size, 4);
searcher.search(&*query, &mut top_collector).unwrap();
assert!(!top_collector.at_capacity());
let score_docs: Vec<(u64, DocId)> = top_collector
.top_docs()
.into_iter()
.map(|(field, doc_address)| (field, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec![(64, 1), (16, 2), (12, 0)]);
}
#[test]
#[should_panic]
fn test_field_does_not_exist() {
let mut schema_builder = SchemaBuilder::new();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build();
let (index, _) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
});
let searcher = index.searcher();
let segment = searcher.segment_reader(0);
let mut top_collector: TopFieldCollector<u64> = TopFieldCollector::with_limit(Field(2), 4);
let _ = top_collector.set_segment(0, segment);
}
#[test]
fn test_field_not_fast_field() {
let mut schema_builder = SchemaBuilder::new();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, IntOptions::default());
let schema = schema_builder.build();
let (index, _) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
});
let searcher = index.searcher();
let segment = searcher.segment_reader(0);
let mut top_collector: TopFieldCollector<u64> = TopFieldCollector::with_limit(size, 4);
assert_matches!(
top_collector.set_segment(0, segment),
Err(TantivyError::FastFieldError(_))
);
}
#[test]
#[should_panic]
fn test_collect_before_set_segment() {
let mut top_collector: TopFieldCollector<u64> = TopFieldCollector::with_limit(Field(0), 4);
top_collector.collect(0, 0f32);
}
#[test]
#[should_panic]
fn test_top_0() {
let _: TopFieldCollector<u64> = TopFieldCollector::with_limit(Field(0), 0);
}
fn index(
query: &str,
query_field: Field,
schema: Schema,
mut doc_adder: impl FnMut(&mut IndexWriter) -> (),
) -> (Index, Box<Query>) {
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
doc_adder(&mut index_writer);
index_writer.commit().unwrap();
index.load_searchers().unwrap();
let query_parser = QueryParser::for_index(&index, vec![query_field]);
let query = query_parser.parse_query(query).unwrap();
(index, query)
}
}

View File

@@ -0,0 +1,187 @@
use super::Collector;
use collector::top_collector::TopCollector;
use DocAddress;
use DocId;
use Result;
use Score;
use SegmentLocalId;
use SegmentReader;
/// The Top Score Collector keeps track of the K documents
/// sorted by their score.
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{SchemaBuilder, TEXT};
/// use tantivy::{Index, Result, DocId, Score};
/// use tantivy::collector::TopScoreCollector;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = SchemaBuilder::new();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// index.load_searchers()?;
/// let searcher = index.searcher();
///
/// {
/// let mut top_collector = TopScoreCollector::with_limit(2);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// searcher.search(&*query, &mut top_collector).unwrap();
///
/// let score_docs: Vec<(Score, DocId)> = top_collector
/// .top_docs()
/// .into_iter()
/// .map(|(score, doc_address)| (score, doc_address.doc()))
/// .collect();
///
/// assert_eq!(score_docs, vec![(0.7261542, 1), (0.6099695, 3)]);
/// }
///
/// Ok(())
/// }
/// ```
pub struct TopScoreCollector {
collector: TopCollector<Score>,
}
impl TopScoreCollector {
/// Creates a top score collector, with a number of documents equal to "limit".
///
/// # Panics
/// The method panics if limit is 0
pub fn with_limit(limit: usize) -> TopScoreCollector {
TopScoreCollector {
collector: TopCollector::with_limit(limit),
}
}
/// Returns K best scored documents sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn docs(&self) -> Vec<DocAddress> {
self.collector.docs()
}
/// Returns K best ScoredDocuments sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn top_docs(&self) -> Vec<(Score, DocAddress)> {
self.collector.top_docs()
}
/// Returns K best ScoredDocuments sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
#[deprecated]
pub fn score_docs(&self) -> Vec<(Score, DocAddress)> {
self.collector.top_docs()
}
/// Return true iff at least K documents have gone through
/// the collector.
#[inline]
pub fn at_capacity(&self) -> bool {
self.collector.at_capacity()
}
}
impl Collector for TopScoreCollector {
fn set_segment(&mut self, segment_id: SegmentLocalId, _: &SegmentReader) -> Result<()> {
self.collector.set_segment_id(segment_id);
Ok(())
}
fn collect(&mut self, doc: DocId, score: Score) {
self.collector.collect(doc, score);
}
fn requires_scoring(&self) -> bool {
true
}
}
#[cfg(test)]
mod tests {
use super::*;
use collector::Collector;
use DocId;
use Score;
#[test]
fn test_top_collector_not_at_capacity() {
let mut top_collector = TopScoreCollector::with_limit(4);
top_collector.collect(1, 0.8);
top_collector.collect(3, 0.2);
top_collector.collect(5, 0.3);
assert!(!top_collector.at_capacity());
let score_docs: Vec<(Score, DocId)> = top_collector
.top_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec![(0.8, 1), (0.3, 5), (0.2, 3)]);
}
#[test]
fn test_top_collector_at_capacity() {
let mut top_collector = TopScoreCollector::with_limit(4);
top_collector.collect(1, 0.8);
top_collector.collect(3, 0.2);
top_collector.collect(5, 0.3);
top_collector.collect(7, 0.9);
top_collector.collect(9, -0.2);
assert!(top_collector.at_capacity());
{
let score_docs: Vec<(Score, DocId)> = top_collector
.top_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec![(0.9, 7), (0.8, 1), (0.3, 5), (0.2, 3)]);
}
{
let docs: Vec<DocId> = top_collector
.docs()
.into_iter()
.map(|doc_address| doc_address.doc())
.collect();
assert_eq!(docs, vec![7, 1, 5, 3]);
}
}
#[test]
#[should_panic]
fn test_top_0() {
TopScoreCollector::with_limit(0);
}
}

View File

@@ -1,6 +1,6 @@
use std::io::Write;
use std::io;
use common::serialize::BinarySerializable;
use std::io;
use std::io::Write;
use std::mem;
use std::ops::Deref;
use std::ptr;
@@ -46,7 +46,7 @@ impl BitPacker {
pub fn flush<TWrite: Write>(&mut self, output: &mut TWrite) -> io::Result<()> {
if self.mini_buffer_written > 0 {
let num_bytes = (self.mini_buffer_written + 7) / 8;
let arr: [u8; 8] = unsafe { mem::transmute::<u64, [u8; 8]>(self.mini_buffer) };
let arr: [u8; 8] = unsafe { mem::transmute::<u64, [u8; 8]>(self.mini_buffer.to_le()) };
output.write_all(&arr[..num_bytes])?;
self.mini_buffer_written = 0;
}
@@ -98,30 +98,15 @@ where
let addr_in_bits = idx * num_bits;
let addr = addr_in_bits >> 3;
let bit_shift = addr_in_bits & 7;
if cfg!(feature = "simdcompression") {
// for simdcompression,
// the bitpacker is only used for fastfields,
// and we expect them to be always padded.
debug_assert!(
addr + 8 <= data.len(),
"The fast field field should have been padded with 7 bytes."
);
let val_unshifted_unmasked: u64 = unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) };
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
val_shifted & mask
} else {
let val_unshifted_unmasked: u64 = if addr + 8 <= data.len() {
unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) }
} else {
let mut buffer = [0u8; 8];
for i in addr..data.len() {
buffer[i - addr] += data[i];
}
unsafe { ptr::read_unaligned(buffer[..].as_ptr() as *const u64) }
};
let val_shifted = val_unshifted_unmasked >> (bit_shift as u64);
val_shifted & mask
}
debug_assert!(
addr + 8 <= data.len(),
"The fast field field should have been padded with 7 bytes."
);
#[cfg_attr(feature = "cargo-clippy", allow(clippy::cast_ptr_alignment))]
let val_unshifted_unmasked: u64 =
u64::from_le(unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) });
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
val_shifted & mask
}
/// Reads a range of values from the fast field.
@@ -141,7 +126,9 @@ where
for output_val in output.iter_mut() {
let addr = addr_in_bits >> 3;
let bit_shift = addr_in_bits & 7;
let val_unshifted_unmasked: u64 = unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) };
#[cfg_attr(feature = "cargo-clippy", allow(clippy::cast_ptr_alignment))]
let val_unshifted_unmasked: u64 =
unsafe { ptr::read_unaligned(data[addr..].as_ptr() as *const u64) };
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
*output_val = val_shifted & mask;
addr_in_bits += num_bits;

View File

@@ -34,17 +34,17 @@ impl TinySet {
}
/// Returns the complement of the set in `[0, 64[`.
fn complement(&self) -> TinySet {
fn complement(self) -> TinySet {
TinySet(!self.0)
}
/// Returns true iff the `TinySet` contains the element `el`.
pub fn contains(&self, el: u32) -> bool {
pub fn contains(self, el: u32) -> bool {
!self.intersect(TinySet::singleton(el)).is_empty()
}
/// Returns the intersection of `self` and `other`
pub fn intersect(&self, other: TinySet) -> TinySet {
pub fn intersect(self, other: TinySet) -> TinySet {
TinySet(self.0 & other.0)
}
@@ -77,7 +77,7 @@ impl TinySet {
/// Returns true iff the `TinySet` is empty.
#[inline(always)]
pub fn is_empty(&self) -> bool {
pub fn is_empty(self) -> bool {
self.0 == 0u64
}
@@ -114,7 +114,7 @@ impl TinySet {
self.0 = 0u64;
}
pub fn len(&self) -> u32 {
pub fn len(self) -> u32 {
self.0.count_ones()
}
}
@@ -202,15 +202,14 @@ impl BitSet {
#[cfg(test)]
mod tests {
extern crate test;
use tests;
use std::collections::HashSet;
use super::BitSet;
use super::TinySet;
use tests::generate_nonunique_unsorted;
use std::collections::BTreeSet;
use query::BitSetDocSet;
use docset::DocSet;
use query::BitSetDocSet;
use std::collections::BTreeSet;
use std::collections::HashSet;
use tests;
use tests::generate_nonunique_unsorted;
#[test]
fn test_tiny_set() {
@@ -267,14 +266,14 @@ mod tests {
#[test]
fn test_bitset_large() {
let arr = generate_nonunique_unsorted(1_000_000, 50_000);
let arr = generate_nonunique_unsorted(100_000, 5_000);
let mut btreeset: BTreeSet<u32> = BTreeSet::new();
let mut bitset = BitSet::with_max_value(1_000_000);
let mut bitset = BitSet::with_max_value(100_000);
for el in arr {
btreeset.insert(el);
bitset.insert(el);
}
for i in 0..1_000_000 {
for i in 0..100_000 {
assert_eq!(btreeset.contains(&i), bitset.contains(i));
}
assert_eq!(btreeset.len(), bitset.len());
@@ -343,7 +342,7 @@ mod tests {
#[test]
fn test_bitset_clear() {
let mut bitset = BitSet::with_max_value(1_000);
let els = tests::sample(1_000, 0.01f32);
let els = tests::sample(1_000, 0.01f64);
for &el in &els {
bitset.insert(el);
}
@@ -353,6 +352,14 @@ mod tests {
assert!(!bitset.contains(el));
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use super::BitSet;
use super::TinySet;
use test;
#[bench]
fn bench_tinyset_pop(b: &mut test::Bencher) {
@@ -385,5 +392,4 @@ mod tests {
fn bench_bitset_initialize(b: &mut test::Bencher) {
b.iter(|| BitSet::with_max_value(1_000_000));
}
}

View File

@@ -1,12 +1,14 @@
use std::io::Write;
use common::CountingWriter;
use std::collections::HashMap;
use schema::Field;
use common::VInt;
use directory::WritePtr;
use std::io::{self, Read};
use directory::ReadOnlySource;
use common::BinarySerializable;
use common::CountingWriter;
use common::VInt;
use directory::ReadOnlySource;
use directory::WritePtr;
use schema::Field;
use space_usage::PerFieldSpaceUsage;
use space_usage::FieldUsage;
use std::collections::HashMap;
use std::io::Write;
use std::io::{self, Read};
#[derive(Eq, PartialEq, Hash, Copy, Ord, PartialOrd, Clone, Debug)]
pub struct FileAddr {
@@ -30,10 +32,7 @@ impl BinarySerializable for FileAddr {
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let field = Field::deserialize(reader)?;
let idx = VInt::deserialize(reader)?.0 as usize;
Ok(FileAddr {
field,
idx,
})
Ok(FileAddr { field, idx })
}
}
@@ -67,7 +66,7 @@ impl<W: Write> CompositeWrite<W> {
&mut self.write
}
/// Close the composite file.
/// Close the composite file
///
/// An index of the different field offsets
/// will be written as a footer.
@@ -75,7 +74,8 @@ impl<W: Write> CompositeWrite<W> {
let footer_offset = self.write.written_bytes();
VInt(self.offsets.len() as u64).serialize(&mut self.write)?;
let mut offset_fields: Vec<_> = self.offsets
let mut offset_fields: Vec<_> = self
.offsets
.iter()
.map(|(file_addr, offset)| (*offset, *file_addr))
.collect();
@@ -115,7 +115,6 @@ impl CompositeFile {
let end = data.len();
let footer_len_data = data.slice_from(end - 4);
let footer_len = u32::deserialize(&mut footer_len_data.as_slice())? as usize;
let footer_start = end - 4 - footer_len;
let footer_data = data.slice(footer_start, footer_start + footer_len);
let mut footer_buffer = footer_data.as_slice();
@@ -166,20 +165,30 @@ impl CompositeFile {
/// to a given `Field` and stored in a `CompositeFile`.
pub fn open_read_with_idx(&self, field: Field, idx: usize) -> Option<ReadOnlySource> {
self.offsets_index
.get(&FileAddr { field, idx, })
.get(&FileAddr { field, idx })
.map(|&(from, to)| self.data.slice(from, to))
}
pub fn space_usage(&self) -> PerFieldSpaceUsage {
let mut fields = HashMap::new();
for (&field_addr, &(start, end)) in self.offsets_index.iter() {
fields.entry(field_addr.field)
.or_insert_with(|| FieldUsage::empty(field_addr.field))
.add_field_idx(field_addr.idx, end - start);
}
PerFieldSpaceUsage::new(fields)
}
}
#[cfg(test)]
mod test {
use std::io::Write;
use super::{CompositeFile, CompositeWrite};
use common::BinarySerializable;
use common::VInt;
use directory::{Directory, RAMDirectory};
use schema::Field;
use common::VInt;
use common::BinarySerializable;
use std::io::Write;
use std::path::Path;
#[test]

View File

@@ -1,5 +1,5 @@
use std::io::Write;
use std::io;
use std::io::Write;
pub struct CountingWriter<W> {
underlying: W,

View File

@@ -1,16 +1,16 @@
mod serialize;
mod vint;
mod counting_writer;
mod composite_file;
pub mod bitpacker;
mod bitset;
mod composite_file;
mod counting_writer;
mod serialize;
mod vint;
pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
pub use self::serialize::{BinarySerializable, FixedSize};
pub use self::vint::VInt;
pub use self::counting_writer::CountingWriter;
pub use self::bitset::BitSet;
pub(crate) use self::bitset::TinySet;
pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
pub use self::counting_writer::CountingWriter;
pub use self::serialize::{BinarySerializable, FixedSize};
pub use self::vint::VInt;
pub use byteorder::LittleEndian as Endianness;
use std::io;
@@ -104,8 +104,8 @@ pub fn u64_to_i64(val: u64) -> i64 {
#[cfg(test)]
pub(crate) mod test {
use super::{compute_num_bits, i64_to_u64, u64_to_i64};
pub use super::serialize::test::fixed_size_test;
use super::{compute_num_bits, i64_to_u64, u64_to_i64};
fn test_i64_converter_helper(val: i64) {
assert_eq!(u64_to_i64(i64_to_u64(val)), val);

View File

@@ -1,10 +1,10 @@
use byteorder::{ReadBytesExt, WriteBytesExt};
use common::Endianness;
use std::fmt;
use std::io::Write;
use std::io::Read;
use std::io;
use common::VInt;
use std::fmt;
use std::io;
use std::io::Read;
use std::io::Write;
/// Trait for a simple binary serialization.
pub trait BinarySerializable: fmt::Debug + Sized {
@@ -135,8 +135,8 @@ impl BinarySerializable for String {
#[cfg(test)]
pub mod test {
use common::VInt;
use super::*;
use common::VInt;
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();

View File

@@ -1,12 +1,14 @@
use super::BinarySerializable;
use std::io;
use std::io::Write;
use std::io::Read;
use std::io::Write;
/// Wrapper over a `u64` that serializes as a variable int.
#[derive(Debug, Eq, PartialEq)]
pub struct VInt(pub u64);
const STOP_BIT: u8 = 128;
impl VInt {
pub fn val(&self) -> u64 {
self.0
@@ -15,24 +17,34 @@ impl VInt {
pub fn deserialize_u64<R: Read>(reader: &mut R) -> io::Result<u64> {
VInt::deserialize(reader).map(|vint| vint.0)
}
pub fn serialize_into_vec(&self, output: &mut Vec<u8>) {
let mut buffer = [0u8; 10];
let num_bytes = self.serialize_into(&mut buffer);
output.extend(&buffer[0..num_bytes]);
}
fn serialize_into(&self, buffer: &mut [u8; 10]) -> usize {
let mut remaining = self.0;
for (i, b) in buffer.iter_mut().enumerate() {
let next_byte: u8 = (remaining % 128u64) as u8;
remaining /= 128u64;
if remaining == 0u64 {
*b = next_byte | STOP_BIT;
return i + 1;
} else {
*b = next_byte;
}
}
unreachable!();
}
}
impl BinarySerializable for VInt {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
let mut remaining = self.0;
let mut buffer = [0u8; 10];
let mut i = 0;
loop {
let next_byte: u8 = (remaining % 128u64) as u8;
remaining /= 128u64;
if remaining == 0u64 {
buffer[i] = next_byte | 128u8;
return writer.write_all(&buffer[0..i + 1]);
} else {
buffer[i] = next_byte;
}
i += 1;
}
let num_bytes = self.serialize_into(&mut buffer);
writer.write_all(&buffer[0..num_bytes])
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
@@ -42,20 +54,58 @@ impl BinarySerializable for VInt {
loop {
match bytes.next() {
Some(Ok(b)) => {
result += u64::from(b % 128u8) << shift;
if b & 128u8 != 0u8 {
break;
result |= u64::from(b % 128u8) << shift;
if b >= STOP_BIT {
return Ok(VInt(result));
}
shift += 7;
}
_ => {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Reach end of buffer",
"Reach end of buffer while reading VInt",
))
}
}
}
Ok(VInt(result))
}
}
#[cfg(test)]
mod tests {
use super::VInt;
use common::BinarySerializable;
fn aux_test_vint(val: u64) {
let mut v = [14u8; 10];
let num_bytes = VInt(val).serialize_into(&mut v);
for i in num_bytes..10 {
assert_eq!(v[i], 14u8);
}
assert!(num_bytes > 0);
if num_bytes < 10 {
assert!(1u64 << (7 * num_bytes) > val);
}
if num_bytes > 1 {
assert!(1u64 << (7 * (num_bytes - 1)) <= val);
}
let serdeser_val = VInt::deserialize(&mut &v[..]).unwrap();
assert_eq!(val, serdeser_val.0);
}
#[test]
fn test_vint() {
aux_test_vint(0);
aux_test_vint(1);
aux_test_vint(5);
aux_test_vint(u64::max_value());
for i in 1..9 {
let power_of_128 = 1u64 << (7 * i);
aux_test_vint(power_of_128 - 1u64);
aux_test_vint(power_of_128);
aux_test_vint(power_of_128 + 1u64);
}
aux_test_vint(10);
}
}

View File

@@ -1,158 +0,0 @@
use compression::BlockDecoder;
use compression::COMPRESSION_BLOCK_SIZE;
use compression::compressed_block_size;
use directory::{ReadOnlySource, SourceRead};
/// Reads a stream of compressed ints.
///
/// Tantivy uses `CompressedIntStream` to read
/// the position file.
/// The `.skip(...)` makes it possible to avoid
/// decompressing blocks that are not required.
pub struct CompressedIntStream {
buffer: SourceRead,
block_decoder: BlockDecoder,
cached_addr: usize, // address of the currently decoded block
cached_next_addr: usize, // address following the currently decoded block
addr: usize, // address of the block associated to the current position
inner_offset: usize,
}
impl CompressedIntStream {
/// Opens a compressed int stream.
pub(crate) fn wrap(source: ReadOnlySource) -> CompressedIntStream {
CompressedIntStream {
buffer: SourceRead::from(source),
block_decoder: BlockDecoder::new(),
cached_addr: usize::max_value(),
cached_next_addr: usize::max_value(),
addr: 0,
inner_offset: 0,
}
}
/// Loads the block at the given address and return the address of the
/// following block
pub fn read_block(&mut self, addr: usize) -> usize {
if self.cached_addr == addr {
// we are already on this block.
// no need to read.
self.cached_next_addr
} else {
let next_addr = addr + self.block_decoder.uncompress_block_unsorted(self.buffer.slice_from(addr));
self.cached_addr = addr;
self.cached_next_addr = next_addr;
next_addr
}
}
/// Fills a buffer with the next `output.len()` integers.
/// This does not consume / advance the stream.
pub fn read(&mut self, output: &mut [u32]) {
let mut cursor = self.addr;
let mut inner_offset = self.inner_offset;
let mut num_els: usize = output.len();
let mut start = 0;
loop {
cursor = self.read_block(cursor);
let block = &self.block_decoder.output_array()[inner_offset..];
let block_len = block.len();
if num_els >= block_len {
output[start..start + block_len].clone_from_slice(&block);
start += block_len;
num_els -= block_len;
inner_offset = 0;
} else {
output[start..].clone_from_slice(&block[..num_els]);
break;
}
}
}
/// Skip the next `skip_len` integer.
///
/// If a full block is skipped, calling
/// `.skip(...)` will avoid decompressing it.
///
/// May panic if the end of the stream is reached.
pub fn skip(&mut self, mut skip_len: usize) {
loop {
let available = COMPRESSION_BLOCK_SIZE - self.inner_offset;
if available >= skip_len {
self.inner_offset += skip_len;
break;
} else {
skip_len -= available;
// entirely skip decompressing some blocks.
let num_bits: u8 = self.buffer.get(self.addr);
let block_len = compressed_block_size(num_bits);
self.addr += block_len;
self.inner_offset = 0;
}
}
}
}
#[cfg(test)]
pub mod tests {
use super::CompressedIntStream;
use compression::compressed_block_size;
use compression::COMPRESSION_BLOCK_SIZE;
use compression::BlockEncoder;
use directory::ReadOnlySource;
fn create_stream_buffer() -> ReadOnlySource {
let mut buffer: Vec<u8> = vec![];
let mut encoder = BlockEncoder::new();
let vals: Vec<u32> = (0u32..1152u32).collect();
for chunk in vals.chunks(COMPRESSION_BLOCK_SIZE) {
let compressed_block = encoder.compress_block_unsorted(chunk);
let num_bits = compressed_block[0];
assert_eq!(compressed_block_size(num_bits), compressed_block.len());
buffer.extend_from_slice(compressed_block);
}
if cfg!(simd) {
buffer.extend_from_slice(&[0u8; 7]);
}
ReadOnlySource::from(buffer)
}
#[test]
fn test_compressed_int_stream() {
let buffer = create_stream_buffer();
let mut stream = CompressedIntStream::wrap(buffer);
let mut block: [u32; COMPRESSION_BLOCK_SIZE] = [0u32; COMPRESSION_BLOCK_SIZE];
stream.read(&mut block[0..2]);
assert_eq!(block[0], 0);
assert_eq!(block[1], 1);
// reading does not consume the stream
stream.read(&mut block[0..2]);
assert_eq!(block[0], 0);
assert_eq!(block[1], 1);
stream.skip(2);
stream.skip(5);
stream.read(&mut block[0..3]);
stream.skip(3);
assert_eq!(block[0], 7);
assert_eq!(block[1], 8);
assert_eq!(block[2], 9);
stream.skip(500);
stream.read(&mut block[0..3]);
stream.skip(3);
assert_eq!(block[0], 510);
assert_eq!(block[1], 511);
assert_eq!(block[2], 512);
stream.skip(511);
stream.read(&mut block[..1]);
assert_eq!(block[0], 1024);
}
}

View File

@@ -1,80 +1,95 @@
use Result;
use error::{ErrorKind, ResultExt};
use serde_json;
use schema::Schema;
use std::sync::Arc;
use std::borrow::BorrowMut;
use std::fmt;
use super::pool::LeasedItem;
use super::pool::Pool;
use super::segment::create_segment;
use super::segment::Segment;
use core::searcher::Searcher;
use core::IndexMeta;
use core::SegmentId;
#[cfg(feature="mmap")]
use core::SegmentMeta;
use core::SegmentReader;
use core::META_FILEPATH;
use directory::ManagedDirectory;
#[cfg(feature = "mmap")]
use directory::MmapDirectory;
use directory::{Directory, RAMDirectory};
use error::TantivyError;
use indexer::index_writer::open_index_writer;
use core::searcher::Searcher;
use std::convert::From;
use num_cpus;
use super::segment::Segment;
use core::SegmentReader;
use super::pool::Pool;
use core::SegmentMeta;
use super::pool::LeasedItem;
use std::path::Path;
use core::IndexMeta;
use indexer::DirectoryLock;
use IndexWriter;
use directory::ManagedDirectory;
use core::META_FILEPATH;
use super::segment::create_segment;
use indexer::index_writer::HEAP_SIZE_MIN;
use indexer::segment_updater::save_new_metas;
use indexer::LockType;
use num_cpus;
use schema::Field;
use schema::FieldType;
use schema::Schema;
use serde_json;
use std::borrow::BorrowMut;
use std::fmt;
use std::path::Path;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use tokenizer::BoxedTokenizer;
use tokenizer::TokenizerManager;
const NUM_SEARCHERS: usize = 12;
use IndexWriter;
use Result;
fn load_metas(directory: &Directory) -> Result<IndexMeta> {
let meta_data = directory.atomic_read(&META_FILEPATH)?;
let meta_string = String::from_utf8_lossy(&meta_data);
serde_json::from_str(&meta_string).chain_err(|| ErrorKind::CorruptedFile(META_FILEPATH.clone()))
serde_json::from_str(&meta_string)
.map_err(|_| TantivyError::CorruptedFile(META_FILEPATH.clone()))
}
/// Search Index
pub struct Index {
directory: ManagedDirectory,
schema: Schema,
num_searchers: Arc<AtomicUsize>,
searcher_pool: Arc<Pool<Searcher>>,
tokenizers: TokenizerManager,
}
impl Index {
/// Examines the director to see if it contains an index
pub fn exists<Dir: Directory>(dir: &Dir) -> bool {
dir.exists(&META_FILEPATH)
}
/// Creates a new index using the `RAMDirectory`.
///
/// The index will be allocated in anonymous memory.
/// This should only be used for unit tests.
pub fn create_in_ram(schema: Schema) -> Index {
let ram_directory = RAMDirectory::create();
// unwrap is ok here
let directory = ManagedDirectory::new(ram_directory).expect(
"Creating a managed directory from a brand new RAM directory \
should never fail.",
);
Index::from_directory(directory, schema).expect("Creating a RAMDirectory should never fail")
Index::create(ram_directory, schema).expect("Creating a RAMDirectory should never fail")
}
/// Creates a new index in a given filepath.
/// The index will use the `MMapDirectory`.
///
/// If a previous index was in this directory, then its meta file will be destroyed.
#[cfg(feature="mmap")]
pub fn create<P: AsRef<Path>>(directory_path: P, schema: Schema) -> Result<Index> {
#[cfg(feature = "mmap")]
pub fn create_in_dir<P: AsRef<Path>>(directory_path: P, schema: Schema) -> Result<Index> {
let mmap_directory = MmapDirectory::open(directory_path)?;
let directory = ManagedDirectory::new(mmap_directory)?;
Index::from_directory(directory, schema)
if Index::exists(&mmap_directory) {
return Err(TantivyError::IndexAlreadyExists);
}
Index::create(mmap_directory, schema)
}
/// Accessor for the tokenizer manager.
pub fn tokenizers(&self) -> &TokenizerManager {
&self.tokenizers
/// Opens or creates a new index in the provided directory
#[cfg(feature = "mmap")]
pub fn open_or_create<Dir: Directory>(dir: Dir, schema: Schema) -> Result<Index> {
if Index::exists(&dir) {
let index = Index::open(dir)?;
if index.schema() == schema {
Ok(index)
} else {
Err(TantivyError::SchemaError("An index exists but the schema does not match.".to_string()))
}
} else {
Index::create(dir, schema)
}
}
/// Creates a new index in a temp directory.
@@ -85,19 +100,35 @@ impl Index {
///
/// The temp directory is only used for testing the `MmapDirectory`.
/// For other unit tests, prefer the `RAMDirectory`, see: `create_in_ram`.
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
pub fn create_from_tempdir(schema: Schema) -> Result<Index> {
let mmap_directory = MmapDirectory::create_from_tempdir()?;
let directory = ManagedDirectory::new(mmap_directory)?;
Index::create(mmap_directory, schema)
}
/// Creates a new index given an implementation of the trait `Directory`
pub fn create<Dir: Directory>(dir: Dir, schema: Schema) -> Result<Index> {
let directory = ManagedDirectory::new(dir)?;
Index::from_directory(directory, schema)
}
/// Create a new index from a directory.
///
/// This will overwrite existing meta.json
fn from_directory(mut directory: ManagedDirectory, schema: Schema) -> Result<Index> {
save_new_metas(schema.clone(), 0, directory.borrow_mut())?;
let metas = IndexMeta::with_schema(schema);
Index::create_from_metas(directory, &metas)
}
/// Creates a new index given a directory and an `IndexMeta`.
fn create_from_metas(directory: ManagedDirectory, metas: &IndexMeta) -> Result<Index> {
let schema = metas.schema.clone();
let n_cpus = num_cpus::get();
let index = Index {
directory,
schema,
num_searchers: Arc::new(AtomicUsize::new(n_cpus)),
searcher_pool: Arc::new(Pool::new()),
tokenizers: TokenizerManager::default(),
};
@@ -105,18 +136,42 @@ impl Index {
Ok(index)
}
/// Create a new index from a directory.
pub fn from_directory(mut directory: ManagedDirectory, schema: Schema) -> Result<Index> {
save_new_metas(schema.clone(), 0, directory.borrow_mut())?;
let metas = IndexMeta::with_schema(schema);
Index::create_from_metas(directory, &metas)
/// Accessor for the tokenizer manager.
pub fn tokenizers(&self) -> &TokenizerManager {
&self.tokenizers
}
/// Helper to access the tokenizer associated to a specific field.
pub fn tokenizer_for_field(&self, field: Field) -> Result<Box<BoxedTokenizer>> {
let field_entry = self.schema.get_field_entry(field);
let field_type = field_entry.field_type();
let tokenizer_manager: &TokenizerManager = self.tokenizers();
let tokenizer_name_opt: Option<Box<BoxedTokenizer>> = match field_type {
FieldType::Str(text_options) => text_options
.get_indexing_options()
.map(|text_indexing_options| text_indexing_options.tokenizer().to_string())
.and_then(|tokenizer_name| tokenizer_manager.get(&tokenizer_name)),
_ => None,
};
match tokenizer_name_opt {
Some(tokenizer) => Ok(tokenizer),
None => Err(TantivyError::SchemaError(format!(
"{:?} is not a text field.",
field_entry.name()
))),
}
}
/// Opens a new directory from an index path.
#[cfg(feature="mmap")]
pub fn open<P: AsRef<Path>>(directory_path: P) -> Result<Index> {
#[cfg(feature = "mmap")]
pub fn open_in_dir<P: AsRef<Path>>(directory_path: P) -> Result<Index> {
let mmap_directory = MmapDirectory::open(directory_path)?;
let directory = ManagedDirectory::new(mmap_directory)?;
Index::open(mmap_directory)
}
/// Open the index using the provided directory
pub fn open<D: Directory>(directory: D) -> Result<Index> {
let directory = ManagedDirectory::new(directory)?;
let metas = load_metas(&directory)?;
Index::create_from_metas(directory, &metas)
}
@@ -134,9 +189,13 @@ impl Index {
/// `IndexWriter` on the system is accessing the index directory,
/// it is safe to manually delete the lockfile.
///
/// num_threads specifies the number of indexing workers that
/// - `num_threads` defines the number of indexing workers that
/// should work at the same time.
///
/// - `overall_heap_size_in_bytes` sets the amount of memory
/// allocated for all indexing thread.
/// Each thread will receive a budget of `overall_heap_size_in_bytes / num_threads`.
///
/// # Errors
/// If the lockfile already exists, returns `Error::FileAlreadyExists`.
/// # Panics
@@ -144,21 +203,35 @@ impl Index {
pub fn writer_with_num_threads(
&self,
num_threads: usize,
heap_size_in_bytes: usize,
overall_heap_size_in_bytes: usize,
) -> Result<IndexWriter> {
let directory_lock = DirectoryLock::lock(self.directory().box_clone())?;
open_index_writer(self, num_threads, heap_size_in_bytes, directory_lock)
let directory_lock = LockType::IndexWriterLock.acquire_lock(&self.directory)?;
let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads;
open_index_writer(
self,
num_threads,
heap_size_in_bytes_per_thread,
directory_lock,
)
}
/// Creates a multithreaded writer
/// It just calls `writer_with_num_threads` with the number of cores as `num_threads`
///
/// Tantivy will automatically define the number of threads to use.
/// `overall_heap_size_in_bytes` is the total target memory usage that will be split
/// between a given number of threads.
///
/// # Errors
/// If the lockfile already exists, returns `Error::FileAlreadyExists`.
/// # Panics
/// If the heap size per thread is too small, panics.
pub fn writer(&self, heap_size_in_bytes: usize) -> Result<IndexWriter> {
self.writer_with_num_threads(num_cpus::get(), heap_size_in_bytes)
pub fn writer(&self, overall_heap_size_in_bytes: usize) -> Result<IndexWriter> {
let mut num_threads = num_cpus::get();
let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads;
if heap_size_in_bytes_per_thread < HEAP_SIZE_MIN {
num_threads = (overall_heap_size_in_bytes / HEAP_SIZE_MIN).max(1);
}
self.writer_with_num_threads(num_threads, overall_heap_size_in_bytes)
}
/// Accessor to the index schema
@@ -170,7 +243,8 @@ impl Index {
/// Returns the list of segments that are searchable
pub fn searchable_segments(&self) -> Result<Vec<Segment>> {
Ok(self.searchable_segment_metas()?
Ok(self
.searchable_segment_metas()?
.into_iter()
.map(|segment_meta| self.segment(segment_meta))
.collect())
@@ -183,8 +257,8 @@ impl Index {
/// Creates a new segment.
pub fn new_segment(&self) -> Segment {
let segment_meta = SegmentMeta::new(SegmentId::generate_random());
create_segment(self.clone(), segment_meta)
let segment_meta = SegmentMeta::new(SegmentId::generate_random(), 0);
self.segment(segment_meta)
}
/// Return a reference to the index directory.
@@ -205,26 +279,41 @@ impl Index {
/// Returns the list of segment ids that are searchable.
pub fn searchable_segment_ids(&self) -> Result<Vec<SegmentId>> {
Ok(self.searchable_segment_metas()?
Ok(self
.searchable_segment_metas()?
.iter()
.map(|segment_meta| segment_meta.id())
.collect())
}
/// Creates a new generation of searchers after
/// a change of the set of searchable indexes.
/// Sets the number of searchers to use
///
/// This needs to be called when a new segment has been
/// published or after a merge.
/// Only works after the next call to `load_searchers`
pub fn set_num_searchers(&mut self, num_searchers: usize) {
self.num_searchers.store(num_searchers, Ordering::Release);
}
/// Update searchers so that they reflect the state of the last
/// `.commit()`.
///
/// If indexing happens in the same process as searching,
/// you most likely want to call `.load_searchers()` right after each
/// successful call to `.commit()`.
///
/// If indexing and searching happen in different processes, the way to
/// get the freshest `index` at all time, is to watch `meta.json` and
/// call `load_searchers` whenever a changes happen.
pub fn load_searchers(&self) -> Result<()> {
let _meta_lock = LockType::MetaLock.acquire_lock(self.directory())?;
let searchable_segments = self.searchable_segments()?;
let segment_readers: Vec<SegmentReader> = searchable_segments
.iter()
.map(SegmentReader::open)
.collect::<Result<_>>()?;
let searchers = (0..NUM_SEARCHERS)
.map(|_| Searcher::from(segment_readers.clone()))
let schema = self.schema();
let num_searchers: usize = self.num_searchers.load(Ordering::Acquire);
let searchers = (0..num_searchers)
.map(|_| Searcher::new(schema.clone(), self.clone(), segment_readers.clone()))
.collect();
self.searcher_pool.publish_new_generation(searchers);
Ok(())
@@ -234,7 +323,7 @@ impl Index {
///
/// This method should be called every single time a search
/// query is performed.
/// The searchers are taken from a pool of `NUM_SEARCHERS` searchers.
/// The searchers are taken from a pool of `num_searchers` searchers.
/// If no searcher is available
/// this may block.
///
@@ -256,8 +345,79 @@ impl Clone for Index {
Index {
directory: self.directory.clone(),
schema: self.schema.clone(),
num_searchers: Arc::clone(&self.num_searchers),
searcher_pool: Arc::clone(&self.searcher_pool),
tokenizers: self.tokenizers.clone(),
}
}
}
#[cfg(test)]
mod tests {
use schema::{Schema, SchemaBuilder, INT_INDEXED, TEXT};
use Index;
use directory::RAMDirectory;
#[test]
fn test_indexer_for_field() {
let mut schema_builder = SchemaBuilder::default();
let num_likes_field = schema_builder.add_u64_field("num_likes", INT_INDEXED);
let body_field = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
assert!(index.tokenizer_for_field(body_field).is_ok());
assert_eq!(
format!("{:?}", index.tokenizer_for_field(num_likes_field).err()),
"Some(SchemaError(\"\\\"num_likes\\\" is not a text field.\"))"
);
}
#[test]
fn test_index_exists() {
let directory = RAMDirectory::create();
assert!(!Index::exists(&directory));
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
}
#[test]
fn open_or_create_should_create() {
let directory = RAMDirectory::create();
assert!(!Index::exists(&directory));
assert!(Index::open_or_create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
}
#[test]
fn open_or_create_should_open() {
let directory = RAMDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
assert!(Index::open_or_create(directory, throw_away_schema()).is_ok());
}
#[test]
fn create_should_wipeoff_existing() {
let directory = RAMDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
assert!(Index::create(directory.clone(), SchemaBuilder::default().build()).is_ok());
}
#[test]
fn open_or_create_exists_but_schema_does_not_match() {
let directory = RAMDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
assert!(Index::open_or_create(directory.clone(), throw_away_schema()).is_ok());
let err = Index::open_or_create(directory, SchemaBuilder::default().build());
assert_eq!(format!("{:?}", err.unwrap_err()), "SchemaError(\"An index exists but the schema does not match.\")");
}
fn throw_away_schema() -> Schema {
let mut schema_builder = SchemaBuilder::default();
let _ = schema_builder.add_u64_field("num_likes", INT_INDEXED);
schema_builder.build()
}
}

View File

@@ -1,7 +1,7 @@
use schema::Schema;
use core::SegmentMeta;
use std::fmt;
use schema::Schema;
use serde_json;
use std::fmt;
/// Meta information about the `Index`.
///
@@ -45,9 +45,9 @@ impl fmt::Debug for IndexMeta {
#[cfg(test)]
mod tests {
use serde_json;
use super::IndexMeta;
use schema::{SchemaBuilder, TEXT};
use serde_json;
#[test]
fn test_serialize_metas() {
@@ -58,7 +58,7 @@ mod tests {
};
let index_metas = IndexMeta {
segments: Vec::new(),
schema: schema,
schema,
opstamp: 0u64,
payload: None,
};

View File

@@ -1,13 +1,13 @@
use directory::{ReadOnlySource, SourceRead};
use termdict::{TermDictionary, TermDictionaryImpl};
use postings::{BlockSegmentPostings, SegmentPostings};
use common::BinarySerializable;
use directory::ReadOnlySource;
use owned_read::OwnedRead;
use positions::PositionReader;
use postings::TermInfo;
use postings::{BlockSegmentPostings, SegmentPostings};
use schema::FieldType;
use schema::IndexRecordOption;
use schema::Term;
use compression::CompressedIntStream;
use postings::FreqReadingOption;
use common::BinarySerializable;
use schema::FieldType;
use termdict::TermDictionary;
/// The inverted index reader is in charge of accessing
/// the inverted index associated to a specific field.
@@ -23,18 +23,24 @@ use schema::FieldType;
/// `InvertedIndexReader` are created by calling
/// the `SegmentReader`'s [`.inverted_index(...)`] method
pub struct InvertedIndexReader {
termdict: TermDictionaryImpl,
termdict: TermDictionary,
postings_source: ReadOnlySource,
positions_source: ReadOnlySource,
positions_idx_source: ReadOnlySource,
record_option: IndexRecordOption,
total_num_tokens: u64
total_num_tokens: u64,
}
impl InvertedIndexReader {
#[cfg_attr(
feature = "cargo-clippy",
allow(clippy::needless_pass_by_value)
)] // for symetry
pub(crate) fn new(
termdict: TermDictionaryImpl,
termdict: TermDictionary,
postings_source: ReadOnlySource,
positions_source: ReadOnlySource,
positions_idx_source: ReadOnlySource,
record_option: IndexRecordOption,
) -> InvertedIndexReader {
let total_num_tokens_data = postings_source.slice(0, 8);
@@ -44,23 +50,25 @@ impl InvertedIndexReader {
termdict,
postings_source: postings_source.slice_from(8),
positions_source,
positions_idx_source,
record_option,
total_num_tokens
total_num_tokens,
}
}
/// Creates an empty `InvertedIndexReader` object, which
/// contains no terms at all.
pub fn empty(field_type: FieldType) -> InvertedIndexReader {
pub fn empty(field_type: &FieldType) -> InvertedIndexReader {
let record_option = field_type
.get_index_record_option()
.unwrap_or(IndexRecordOption::Basic);
InvertedIndexReader {
termdict: TermDictionaryImpl::empty(field_type),
termdict: TermDictionary::empty(&field_type),
postings_source: ReadOnlySource::empty(),
positions_source: ReadOnlySource::empty(),
positions_idx_source: ReadOnlySource::empty(),
record_option,
total_num_tokens: 0u64
total_num_tokens: 0u64,
}
}
@@ -70,7 +78,7 @@ impl InvertedIndexReader {
}
/// Return the term dictionary datastructure.
pub fn terms(&self) -> &TermDictionaryImpl {
pub fn terms(&self) -> &TermDictionary {
&self.termdict
}
@@ -92,8 +100,21 @@ impl InvertedIndexReader {
let offset = term_info.postings_offset as usize;
let end_source = self.postings_source.len();
let postings_slice = self.postings_source.slice(offset, end_source);
let postings_reader = SourceRead::from(postings_slice);
block_postings.reset(term_info.doc_freq as usize, postings_reader);
let postings_reader = OwnedRead::new(postings_slice);
block_postings.reset(term_info.doc_freq, postings_reader);
}
/// Returns a block postings given a `Term`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
pub fn read_block_postings(
&self,
term: &Term,
option: IndexRecordOption,
) -> Option<BlockSegmentPostings> {
self.get_term_info(term)
.map(move |term_info| self.read_block_postings_from_terminfo(&term_info, option))
}
/// Returns a block postings given a `term_info`.
@@ -107,15 +128,11 @@ impl InvertedIndexReader {
) -> BlockSegmentPostings {
let offset = term_info.postings_offset as usize;
let postings_data = self.postings_source.slice_from(offset);
let freq_reading_option = match (self.record_option, requested_option) {
(IndexRecordOption::Basic, _) => FreqReadingOption::NoFreq,
(_, IndexRecordOption::Basic) => FreqReadingOption::SkipFreq,
(_, _) => FreqReadingOption::ReadFreq,
};
BlockSegmentPostings::from_data(
term_info.doc_freq as usize,
SourceRead::from(postings_data),
freq_reading_option,
term_info.doc_freq,
OwnedRead::new(postings_data),
self.record_option,
requested_option,
)
}
@@ -131,11 +148,11 @@ impl InvertedIndexReader {
let block_postings = self.read_block_postings_from_terminfo(term_info, option);
let position_stream = {
if option.has_positions() {
let position_offset = term_info.positions_offset;
let positions_source = self.positions_source.slice_from(position_offset as usize);
let mut stream = CompressedIntStream::wrap(positions_source);
stream.skip(term_info.positions_inner_offset as usize);
Some(stream)
let position_reader = self.positions_source.clone();
let skip_reader = self.positions_idx_source.clone();
let position_reader =
PositionReader::new(position_reader, skip_reader, term_info.positions_idx);
Some(position_reader)
} else {
None
}
@@ -149,8 +166,6 @@ impl InvertedIndexReader {
self.total_num_tokens
}
/// Returns the segment postings associated with the term, and with the given option,
/// or `None` if the term has never been encountered and indexed.
///
@@ -162,16 +177,19 @@ impl InvertedIndexReader {
/// `TextIndexingOptions` that does not index position will return a `SegmentPostings`
/// with `DocId`s and frequencies.
pub fn read_postings(&self, term: &Term, option: IndexRecordOption) -> Option<SegmentPostings> {
let term_info = get!(self.get_term_info(term));
Some(self.read_postings_from_terminfo(&term_info, option))
self.get_term_info(term)
.map(move |term_info| self.read_postings_from_terminfo(&term_info, option))
}
pub(crate) fn read_postings_no_deletes(&self, term: &Term, option: IndexRecordOption) -> Option<SegmentPostings> {
let term_info = get!(self.get_term_info(term));
Some(self.read_postings_from_terminfo(&term_info, option))
pub(crate) fn read_postings_no_deletes(
&self,
term: &Term,
option: IndexRecordOption,
) -> Option<SegmentPostings> {
self.get_term_info(term)
.map(|term_info| self.read_postings_from_terminfo(&term_info, option))
}
/// Returns the number of documents containing the term.
pub fn doc_freq(&self, term: &Term) -> u32 {
self.get_term_info(term)
@@ -179,6 +197,3 @@ impl InvertedIndexReader {
.unwrap_or(0u32)
}
}

View File

@@ -1,24 +1,24 @@
pub mod searcher;
pub mod index;
mod segment_reader;
mod segment_id;
mod segment_component;
mod segment;
mod index_meta;
mod pool;
mod segment_meta;
mod inverted_index_reader;
mod pool;
pub mod searcher;
mod segment;
mod segment_component;
mod segment_id;
mod segment_meta;
mod segment_reader;
pub use self::index::Index;
pub use self::index_meta::IndexMeta;
pub use self::inverted_index_reader::InvertedIndexReader;
pub use self::searcher::Searcher;
pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId;
pub use self::segment_reader::SegmentReader;
pub use self::segment::Segment;
pub use self::segment::SerializableSegment;
pub use self::index::Index;
pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId;
pub use self::segment_meta::SegmentMeta;
pub use self::index_meta::IndexMeta;
pub use self::segment_reader::SegmentReader;
use std::path::PathBuf;
@@ -33,10 +33,4 @@ lazy_static! {
/// Removing this file is safe, but will prevent the garbage collection of all of the file that
/// are currently in the directory
pub static ref MANAGED_FILEPATH: PathBuf = PathBuf::from(".managed.json");
/// Only one process should be able to write tantivy's index at a time.
/// This file, when present, is in charge of preventing other processes to open an IndexWriter.
///
/// If the process is killed and this file remains, it is safe to remove it manually.
pub static ref LOCKFILE_FILEPATH: PathBuf = PathBuf::from(".tantivy-indexer.lock");
}

View File

@@ -1,8 +1,8 @@
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering;
use crossbeam::queue::MsQueue;
use std::mem;
use std::ops::{Deref, DerefMut};
use crossbeam::sync::MsQueue;
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering;
use std::sync::Arc;
pub struct GenerationItem<T> {
@@ -87,7 +87,8 @@ impl<T> Deref for LeasedItem<T> {
type Target = T;
fn deref(&self) -> &T {
&self.gen_item
&self
.gen_item
.as_ref()
.expect("Unwrapping a leased item should never fail")
.item // unwrap is safe here
@@ -96,7 +97,8 @@ impl<T> Deref for LeasedItem<T> {
impl<T> DerefMut for LeasedItem<T> {
fn deref_mut(&mut self) -> &mut T {
&mut self.gen_item
&mut self
.gen_item
.as_mut()
.expect("Unwrapping a mut leased item should never fail")
.item // unwrap is safe here
@@ -114,8 +116,8 @@ impl<T> Drop for LeasedItem<T> {
#[cfg(test)]
mod tests {
use std::iter;
use super::Pool;
use std::iter;
#[test]
fn test_pool() {

View File

@@ -1,14 +1,17 @@
use Result;
use core::SegmentReader;
use schema::Document;
use collector::Collector;
use query::Query;
use DocAddress;
use schema::{Field, Term};
use termdict::{TermDictionary, TermMerger};
use std::sync::Arc;
use std::fmt;
use core::InvertedIndexReader;
use core::SegmentReader;
use query::Query;
use schema::Document;
use schema::Schema;
use schema::{Field, Term};
use space_usage::SearcherSpaceUsage;
use std::fmt;
use std::sync::Arc;
use termdict::TermMerger;
use DocAddress;
use Index;
use Result;
/// Holds a list of `SegmentReader`s ready for search.
///
@@ -16,25 +19,50 @@ use core::InvertedIndexReader;
/// the destruction of the `Searcher`.
///
pub struct Searcher {
schema: Schema,
index: Index,
segment_readers: Vec<SegmentReader>,
}
impl Searcher {
/// Creates a new `Searcher`
pub(crate) fn new(
schema: Schema,
index: Index,
segment_readers: Vec<SegmentReader>,
) -> Searcher {
Searcher {
schema,
index,
segment_readers,
}
}
/// Returns the `Index` associated to the `Searcher`
pub fn index(&self) -> &Index {
&self.index
}
/// Fetches a document from tantivy's store given a `DocAddress`.
///
/// The searcher uses the segment ordinal to route the
/// the request to the right `Segment`.
pub fn doc(&self, doc_address: &DocAddress) -> Result<Document> {
let DocAddress(segment_local_id, doc_id) = *doc_address;
pub fn doc(&self, doc_address: DocAddress) -> Result<Document> {
let DocAddress(segment_local_id, doc_id) = doc_address;
let segment_reader = &self.segment_readers[segment_local_id as usize];
segment_reader.doc(doc_id)
}
/// Access the schema associated to the index of this searcher.
pub fn schema(&self) -> &Schema {
&self.schema
}
/// Returns the overall number of documents in the index.
pub fn num_docs(&self) -> u64 {
self.segment_readers
.iter()
.map(|segment_reader| segment_reader.num_docs() as u64)
.map(|segment_reader| u64::from(segment_reader.num_docs()))
.sum::<u64>()
}
@@ -43,8 +71,9 @@ impl Searcher {
pub fn doc_freq(&self, term: &Term) -> u64 {
self.segment_readers
.iter()
.map(|segment_reader| segment_reader.inverted_index(term.field()).doc_freq(term) as u64)
.sum::<u64>()
.map(|segment_reader| {
u64::from(segment_reader.inverted_index(term.field()).doc_freq(term))
}).sum::<u64>()
}
/// Return the list of segment readers
@@ -64,12 +93,22 @@ impl Searcher {
/// Return the field searcher associated to a `Field`.
pub fn field(&self, field: Field) -> FieldSearcher {
let inv_index_readers = self.segment_readers
let inv_index_readers = self
.segment_readers
.iter()
.map(|segment_reader| segment_reader.inverted_index(field))
.collect::<Vec<_>>();
FieldSearcher::new(inv_index_readers)
}
/// Summarize total space usage of this searcher.
pub fn space_usage(&self) -> SearcherSpaceUsage {
let mut space_usage = SearcherSpaceUsage::new();
for segment_reader in self.segment_readers.iter() {
space_usage.add_segment(segment_reader.space_usage());
}
space_usage
}
}
pub struct FieldSearcher {
@@ -84,7 +123,8 @@ impl FieldSearcher {
/// Returns a Stream over all of the sorted unique terms of
/// for the given field.
pub fn terms(&self) -> TermMerger {
let term_streamers: Vec<_> = self.inv_index_readers
let term_streamers: Vec<_> = self
.inv_index_readers
.iter()
.map(|inverted_index| inverted_index.terms().stream())
.collect();
@@ -92,15 +132,10 @@ impl FieldSearcher {
}
}
impl From<Vec<SegmentReader>> for Searcher {
fn from(segment_readers: Vec<SegmentReader>) -> Searcher {
Searcher { segment_readers }
}
}
impl fmt::Debug for Searcher {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let segment_ids = self.segment_readers
let segment_ids = self
.segment_readers
.iter()
.map(|segment_reader| segment_reader.segment_id())
.collect::<Vec<_>>();

View File

@@ -1,16 +1,16 @@
use Result;
use std::path::PathBuf;
use schema::Schema;
use std::fmt;
use core::SegmentId;
use directory::{FileProtection, ReadOnlySource, WritePtr};
use indexer::segment_serializer::SegmentSerializer;
use super::SegmentComponent;
use core::Index;
use std::result;
use directory::Directory;
use core::SegmentId;
use core::SegmentMeta;
use directory::error::{OpenReadError, OpenWriteError};
use directory::Directory;
use directory::{ReadOnlySource, WritePtr};
use indexer::segment_serializer::SegmentSerializer;
use schema::Schema;
use std::fmt;
use std::path::PathBuf;
use std::result;
use Result;
/// A segment is a piece of the index.
#[derive(Clone)]
@@ -28,6 +28,7 @@ impl fmt::Debug for Segment {
/// Creates a new segment given an `Index` and a `SegmentId`
///
/// The function is here to make it private outside `tantivy`.
/// #[doc(hidden)]
pub fn create_segment(index: Index, meta: SegmentMeta) -> Segment {
Segment { index, meta }
}
@@ -49,8 +50,11 @@ impl Segment {
}
#[doc(hidden)]
pub fn set_delete_meta(&mut self, num_deleted_docs: u32, opstamp: u64) {
self.meta.set_delete_meta(num_deleted_docs, opstamp);
pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: u64) -> Segment {
Segment {
index: self.index,
meta: self.meta.with_delete_meta(num_deleted_docs, opstamp),
}
}
/// Returns the segment's id.
@@ -66,16 +70,6 @@ impl Segment {
self.meta.relative_path(component)
}
/// Protects a specific component file from being deleted.
///
/// Returns a FileProtection object. The file is guaranteed
/// to not be garbage collected as long as this `FileProtection` object
/// lives.
pub fn protect_from_delete(&self, component: SegmentComponent) -> FileProtection {
let path = self.relative_path(component);
self.index.directory().protect_file_from_delete(&path)
}
/// Open one of the component file for a *regular* read.
pub fn open_read(
&self,
@@ -105,35 +99,3 @@ pub trait SerializableSegment {
/// The number of documents in the segment.
fn write(&self, serializer: SegmentSerializer) -> Result<u32>;
}
#[cfg(test)]
mod tests {
use core::SegmentComponent;
use directory::Directory;
use std::collections::HashSet;
use schema::SchemaBuilder;
use Index;
#[test]
fn test_segment_protect_component() {
let mut index = Index::create_in_ram(SchemaBuilder::new().build());
let segment = index.new_segment();
let path = segment.relative_path(SegmentComponent::POSTINGS);
let directory = index.directory_mut();
directory.atomic_write(&*path, &vec![0u8]).unwrap();
let living_files = HashSet::new();
{
let _file_protection = segment.protect_from_delete(SegmentComponent::POSTINGS);
assert!(directory.exists(&*path));
directory.garbage_collect(|| living_files.clone());
assert!(directory.exists(&*path));
}
directory.garbage_collect(|| living_files);
assert!(!directory.exists(&*path));
}
}

View File

@@ -1,3 +1,5 @@
use std::slice;
/// Enum describing each component of a tantivy segment.
/// Each component is stored in its own file,
/// using the pattern `segment_uuid`.`component_extension`,
@@ -8,6 +10,8 @@ pub enum SegmentComponent {
POSTINGS,
/// Positions of terms in each document.
POSITIONS,
/// Index to seek within the position file
POSITIONSSKIP,
/// Column-oriented random-access storage of fields.
FASTFIELDS,
/// Stores the sum of the length (in terms) of each field for each document.
@@ -26,10 +30,11 @@ pub enum SegmentComponent {
impl SegmentComponent {
/// Iterates through the components.
pub fn iterator() -> impl Iterator<Item = &'static SegmentComponent> {
static SEGMENT_COMPONENTS: [SegmentComponent; 7] = [
pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
SegmentComponent::POSTINGS,
SegmentComponent::POSITIONS,
SegmentComponent::POSITIONSSKIP,
SegmentComponent::FASTFIELDS,
SegmentComponent::FIELDNORMS,
SegmentComponent::TERMS,

View File

@@ -1,6 +1,6 @@
use uuid::Uuid;
use std::fmt;
use std::cmp::{Ord, Ordering};
use std::fmt;
use uuid::Uuid;
#[cfg(test)]
use std::sync::atomic;
@@ -52,12 +52,12 @@ impl SegmentId {
/// Picking the first 8 chars is ok to identify
/// segments in a display message.
pub fn short_uuid_string(&self) -> String {
(&self.0.simple().to_string()[..8]).to_string()
(&self.0.to_simple_ref().to_string()[..8]).to_string()
}
/// Returns a segment uuid string.
pub fn uuid_string(&self) -> String {
self.0.simple().to_string()
self.0.to_simple_ref().to_string()
}
}

View File

@@ -1,7 +1,14 @@
use core::SegmentId;
use super::SegmentComponent;
use std::path::PathBuf;
use census::{Inventory, TrackedObject};
use core::SegmentId;
use serde;
use std::collections::HashSet;
use std::fmt;
use std::path::PathBuf;
lazy_static! {
static ref INVENTORY: Inventory<InnerSegmentMeta> = { Inventory::new() };
}
#[derive(Clone, Debug, Serialize, Deserialize)]
struct DeleteMeta {
@@ -13,32 +20,72 @@ struct DeleteMeta {
///
/// For instance the number of docs it contains,
/// how many are deleted, etc.
#[derive(Clone, Debug, Serialize, Deserialize)]
#[derive(Clone)]
pub struct SegmentMeta {
segment_id: SegmentId,
max_doc: u32,
deletes: Option<DeleteMeta>,
tracked: TrackedObject<InnerSegmentMeta>,
}
impl fmt::Debug for SegmentMeta {
fn fmt(&self, f: &mut fmt::Formatter) -> Result<(), fmt::Error> {
self.tracked.fmt(f)
}
}
impl serde::Serialize for SegmentMeta {
fn serialize<S>(
&self,
serializer: S,
) -> Result<<S as serde::Serializer>::Ok, <S as serde::Serializer>::Error>
where
S: serde::Serializer,
{
self.tracked.serialize(serializer)
}
}
impl<'a> serde::Deserialize<'a> for SegmentMeta {
fn deserialize<D>(deserializer: D) -> Result<Self, <D as serde::Deserializer<'a>>::Error>
where
D: serde::Deserializer<'a>,
{
let inner = InnerSegmentMeta::deserialize(deserializer)?;
let tracked = INVENTORY.track(inner);
Ok(SegmentMeta { tracked })
}
}
impl SegmentMeta {
/// Creates a new segment meta for
/// a segment with no deletes and no documents.
pub fn new(segment_id: SegmentId) -> SegmentMeta {
SegmentMeta {
/// Lists all living `SegmentMeta` object at the time of the call.
pub fn all() -> Vec<SegmentMeta> {
INVENTORY
.list()
.into_iter()
.map(|inner| SegmentMeta { tracked: inner })
.collect::<Vec<_>>()
}
/// Creates a new `SegmentMeta` object.
#[doc(hidden)]
pub fn new(segment_id: SegmentId, max_doc: u32) -> SegmentMeta {
let inner = InnerSegmentMeta {
segment_id,
max_doc: 0,
max_doc,
deletes: None,
};
SegmentMeta {
tracked: INVENTORY.track(inner),
}
}
/// Returns the segment id.
pub fn id(&self) -> SegmentId {
self.segment_id
self.tracked.segment_id
}
/// Returns the number of deleted documents.
pub fn num_deleted_docs(&self) -> u32 {
self.deletes
self.tracked
.deletes
.as_ref()
.map(|delete_meta| delete_meta.num_deleted_docs)
.unwrap_or(0u32)
@@ -63,8 +110,9 @@ impl SegmentMeta {
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
let mut path = self.id().uuid_string();
path.push_str(&*match component {
SegmentComponent::POSITIONS => ".pos".to_string(),
SegmentComponent::POSTINGS => ".idx".to_string(),
SegmentComponent::POSITIONS => ".pos".to_string(),
SegmentComponent::POSITIONSSKIP => ".posidx".to_string(),
SegmentComponent::TERMS => ".term".to_string(),
SegmentComponent::STORE => ".store".to_string(),
SegmentComponent::FASTFIELDS => ".fast".to_string(),
@@ -80,7 +128,7 @@ impl SegmentMeta {
/// and all the doc ids contains in this segment
/// are exactly (0..max_doc).
pub fn max_doc(&self) -> u32 {
self.max_doc
self.tracked.max_doc
}
/// Return the number of documents in the segment.
@@ -91,25 +139,36 @@ impl SegmentMeta {
/// Returns the opstamp of the last delete operation
/// taken in account in this segment.
pub fn delete_opstamp(&self) -> Option<u64> {
self.deletes.as_ref().map(|delete_meta| delete_meta.opstamp)
self.tracked
.deletes
.as_ref()
.map(|delete_meta| delete_meta.opstamp)
}
/// Returns true iff the segment meta contains
/// delete information.
pub fn has_deletes(&self) -> bool {
self.deletes.is_some()
self.num_deleted_docs() > 0
}
#[doc(hidden)]
pub fn set_max_doc(&mut self, max_doc: u32) {
self.max_doc = max_doc;
}
#[doc(hidden)]
pub fn set_delete_meta(&mut self, num_deleted_docs: u32, opstamp: u64) {
self.deletes = Some(DeleteMeta {
pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: u64) -> SegmentMeta {
let delete_meta = DeleteMeta {
num_deleted_docs,
opstamp,
};
let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
segment_id: inner_meta.segment_id,
max_doc: inner_meta.max_doc,
deletes: Some(delete_meta),
});
SegmentMeta { tracked }
}
}
#[derive(Clone, Debug, Serialize, Deserialize)]
struct InnerSegmentMeta {
segment_id: SegmentId,
max_doc: u32,
deletes: Option<DeleteMeta>,
}

View File

@@ -1,31 +1,30 @@
use Result;
use core::Segment;
use core::SegmentId;
use core::SegmentComponent;
use std::sync::RwLock;
use common::HasLen;
use core::SegmentMeta;
use fastfield::{self, FastFieldNotAvailableError};
use fastfield::DeleteBitSet;
use store::StoreReader;
use schema::Document;
use DocId;
use std::sync::Arc;
use std::collections::HashMap;
use common::CompositeFile;
use std::fmt;
use common::HasLen;
use core::InvertedIndexReader;
use schema::Field;
use schema::FieldType;
use error::ErrorKind;
use termdict::TermDictionaryImpl;
use core::Segment;
use core::SegmentComponent;
use core::SegmentId;
use error::TantivyError;
use fastfield::DeleteBitSet;
use fastfield::FacetReader;
use fastfield::FastFieldReader;
use schema::Schema;
use termdict::TermDictionary;
use fastfield::{FastValue, MultiValueIntFastFieldReader};
use schema::Cardinality;
use fastfield::{self, FastFieldNotAvailableError};
use fastfield::{BytesFastFieldReader, FastValue, MultiValueIntFastFieldReader};
use fieldnorm::FieldNormReader;
use schema::Cardinality;
use schema::Document;
use schema::Field;
use schema::FieldType;
use schema::Schema;
use space_usage::SegmentSpaceUsage;
use std::collections::HashMap;
use std::fmt;
use std::sync::Arc;
use std::sync::RwLock;
use store::StoreReader;
use termdict::TermDictionary;
use DocId;
use Result;
/// Entry point to access all of the datastructures of the `Segment`
///
@@ -45,11 +44,13 @@ pub struct SegmentReader {
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>,
segment_id: SegmentId,
segment_meta: SegmentMeta,
max_doc: DocId,
num_docs: DocId,
termdict_composite: CompositeFile,
postings_composite: CompositeFile,
positions_composite: CompositeFile,
positions_idx_composite: CompositeFile,
fast_fields_composite: CompositeFile,
fieldnorms_composite: CompositeFile,
@@ -64,7 +65,7 @@ impl SegmentReader {
/// Today, `tantivy` does not handle deletes, so it happens
/// to also be the number of documents in the index.
pub fn max_doc(&self) -> DocId {
self.segment_meta.max_doc()
self.max_doc
}
/// Returns the number of documents.
@@ -73,7 +74,12 @@ impl SegmentReader {
/// Today, `tantivy` does not handle deletes so max doc and
/// num_docs are the same.
pub fn num_docs(&self) -> DocId {
self.segment_meta.num_docs()
self.num_docs
}
/// Returns the schema of the index this segment belongs to.
pub fn schema(&self) -> &Schema {
&self.schema
}
/// Return the number of documents that have been
@@ -105,12 +111,25 @@ impl SegmentReader {
) -> fastfield::Result<FastFieldReader<Item>> {
let field_entry = self.schema.get_field_entry(field);
if Item::fast_field_cardinality(field_entry.field_type()) == Some(Cardinality::SingleValue)
{
self.fast_fields_composite
.open_read(field)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))
.map(FastFieldReader::open)
} else {
{
self.fast_fields_composite
.open_read(field)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))
.map(FastFieldReader::open)
} else {
Err(FastFieldNotAvailableError::new(field_entry))
}
}
pub(crate) fn fast_field_reader_with_idx<Item: FastValue>(
&self,
field: Field,
idx: usize,
) -> fastfield::Result<FastFieldReader<Item>> {
if let Some(ff_source) = self.fast_fields_composite.open_read_with_idx(field, idx) {
Ok(FastFieldReader::open(ff_source))
} else {
let field_entry = self.schema.get_field_entry(field);
Err(FastFieldNotAvailableError::new(field_entry))
}
}
@@ -123,41 +142,54 @@ impl SegmentReader {
) -> fastfield::Result<MultiValueIntFastFieldReader<Item>> {
let field_entry = self.schema.get_field_entry(field);
if Item::fast_field_cardinality(field_entry.field_type()) == Some(Cardinality::MultiValues)
{
let idx_reader = self.fast_fields_composite
.open_read_with_idx(field, 0)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))
.map(FastFieldReader::open)?;
let vals_reader = self.fast_fields_composite
.open_read_with_idx(field, 1)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))
.map(FastFieldReader::open)?;
Ok(MultiValueIntFastFieldReader::open(idx_reader, vals_reader))
} else {
{
let idx_reader = self.fast_field_reader_with_idx(field, 0)?;
let vals_reader = self.fast_field_reader_with_idx(field, 1)?;
Ok(MultiValueIntFastFieldReader::open(idx_reader, vals_reader))
} else {
Err(FastFieldNotAvailableError::new(field_entry))
}
}
/// Accessor to the `BytesFastFieldReader` associated to a given `Field`.
pub fn bytes_fast_field_reader(&self, field: Field) -> fastfield::Result<BytesFastFieldReader> {
let field_entry = self.schema.get_field_entry(field);
match *field_entry.field_type() {
FieldType::Bytes => {}
_ => return Err(FastFieldNotAvailableError::new(field_entry)),
}
let idx_reader = self
.fast_fields_composite
.open_read_with_idx(field, 0)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))
.map(FastFieldReader::open)?;
let values = self
.fast_fields_composite
.open_read_with_idx(field, 1)
.ok_or_else(|| FastFieldNotAvailableError::new(field_entry))?;
Ok(BytesFastFieldReader::open(idx_reader, values))
}
/// Accessor to the `FacetReader` associated to a given `Field`.
pub fn facet_reader(&self, field: Field) -> Result<FacetReader> {
let field_entry = self.schema.get_field_entry(field);
if field_entry.field_type() != &FieldType::HierarchicalFacet {
return Err(ErrorKind::InvalidArgument(format!(
return Err(TantivyError::InvalidArgument(format!(
"The field {:?} is not a \
hierarchical facet.",
field_entry
)).into());
)));
}
let term_ords_reader = self.multi_fast_field_reader(field)?;
let termdict_source = self.termdict_composite.open_read(field).ok_or_else(|| {
ErrorKind::InvalidArgument(format!(
TantivyError::InvalidArgument(format!(
"The field \"{}\" is a hierarchical \
but this segment does not seem to have the field term \
dictionary.",
field_entry.name()
))
})?;
let termdict = TermDictionaryImpl::from_source(termdict_source);
let termdict = TermDictionary::from_source(&termdict_source);
let facet_reader = FacetReader::new(term_ords_reader, termdict);
Ok(facet_reader)
}
@@ -171,12 +203,14 @@ impl SegmentReader {
/// They are simply stored as a fast field, serialized in
/// the `.fieldnorm` file of the segment.
pub fn get_fieldnorms_reader(&self, field: Field) -> FieldNormReader {
if let Some(fieldnorm_source) = self.fieldnorms_composite
.open_read(field) {
if let Some(fieldnorm_source) = self.fieldnorms_composite.open_read(field) {
FieldNormReader::open(fieldnorm_source)
} else {
let field_name = self.schema.get_field_name(field);
let err_msg= format!("Field norm not found for field {:?}. Was it market as indexed during indexing.", field_name);
let err_msg = format!(
"Field norm not found for field {:?}. Was it market as indexed during indexing.",
field_name
);
panic!(err_msg);
}
}
@@ -194,6 +228,8 @@ impl SegmentReader {
let store_source = segment.open_read(SegmentComponent::STORE)?;
let store_reader = StoreReader::from_source(store_source);
fail_point!("SegmentReader::open#middle");
let postings_source = segment.open_read(SegmentComponent::POSTINGS)?;
let postings_composite = CompositeFile::open(&postings_source)?;
@@ -205,24 +241,32 @@ impl SegmentReader {
}
};
let positions_idx_composite = {
if let Ok(source) = segment.open_read(SegmentComponent::POSITIONSSKIP) {
CompositeFile::open(&source)?
} else {
CompositeFile::empty()
}
};
let fast_fields_data = segment.open_read(SegmentComponent::FASTFIELDS)?;
let fast_fields_composite = CompositeFile::open(&fast_fields_data)?;
let fieldnorms_data = segment.open_read(SegmentComponent::FIELDNORMS)?;
let fieldnorms_composite = CompositeFile::open(&fieldnorms_data)?;
let delete_bitset_opt =
if segment.meta().has_deletes() {
let delete_data = segment.open_read(SegmentComponent::DELETE)?;
Some(DeleteBitSet::open(delete_data))
} else {
None
};
let delete_bitset_opt = if segment.meta().has_deletes() {
let delete_data = segment.open_read(SegmentComponent::DELETE)?;
Some(DeleteBitSet::open(delete_data))
} else {
None
};
let schema = segment.schema();
Ok(SegmentReader {
inv_idx_reader_cache: Arc::new(RwLock::new(HashMap::new())),
segment_meta: segment.meta().clone(),
max_doc: segment.meta().max_doc(),
num_docs: segment.meta().num_docs(),
termdict_composite,
postings_composite,
fast_fields_composite,
@@ -231,6 +275,7 @@ impl SegmentReader {
store_reader,
delete_bitset_opt,
positions_composite,
positions_idx_composite,
schema,
})
}
@@ -243,7 +288,8 @@ impl SegmentReader {
/// term dictionary associated to a specific field,
/// and opening the posting list associated to any term.
pub fn inverted_index(&self, field: Field) -> Arc<InvertedIndexReader> {
if let Some(inv_idx_reader) = self.inv_idx_reader_cache
if let Some(inv_idx_reader) = self
.inv_idx_reader_cache
.read()
.expect("Lock poisoned. This should never happen")
.get(&field)
@@ -267,23 +313,31 @@ impl SegmentReader {
// As a result, no data is associated to the inverted index.
//
// Returns an empty inverted index.
return Arc::new(InvertedIndexReader::empty(field_type.clone()));
return Arc::new(InvertedIndexReader::empty(field_type));
}
let postings_source = postings_source_opt.unwrap();
let termdict_source = self.termdict_composite
let termdict_source = self
.termdict_composite
.open_read(field)
.expect("Failed to open field term dictionary in composite file. Is the field indexed");
let positions_source = self.positions_composite
let positions_source = self
.positions_composite
.open_read(field)
.expect("Index corrupted. Failed to open field positions in composite file.");
let positions_idx_source = self
.positions_idx_composite
.open_read(field)
.expect("Index corrupted. Failed to open field positions in composite file.");
let inv_idx_reader = Arc::new(InvertedIndexReader::new(
TermDictionaryImpl::from_source(termdict_source),
TermDictionary::from_source(&termdict_source),
postings_source,
positions_source,
positions_idx_source,
record_option,
));
@@ -323,6 +377,26 @@ impl SegmentReader {
.map(|delete_set| delete_set.is_deleted(doc))
.unwrap_or(false)
}
/// Returns an iterator that will iterate over the alive document ids
pub fn doc_ids_alive(&self) -> SegmentReaderAliveDocsIterator {
SegmentReaderAliveDocsIterator::new(&self)
}
/// Summarize total space usage of this segment.
pub fn space_usage(&self) -> SegmentSpaceUsage {
SegmentSpaceUsage::new(
self.num_docs(),
self.termdict_composite.space_usage(),
self.postings_composite.space_usage(),
self.positions_composite.space_usage(),
self.positions_idx_composite.space_usage(),
self.fast_fields_composite.space_usage(),
self.fieldnorms_composite.space_usage(),
self.store_reader.space_usage(),
self.delete_bitset_opt.as_ref().map(|x| x.space_usage()).unwrap_or(0),
)
}
}
impl fmt::Debug for SegmentReader {
@@ -330,3 +404,90 @@ impl fmt::Debug for SegmentReader {
write!(f, "SegmentReader({:?})", self.segment_id)
}
}
/// Implements the iterator trait to allow easy iteration
/// over non-deleted ("alive") DocIds in a SegmentReader
pub struct SegmentReaderAliveDocsIterator<'a> {
reader: &'a SegmentReader,
max_doc: DocId,
current: DocId,
}
impl<'a> SegmentReaderAliveDocsIterator<'a> {
pub fn new(reader: &'a SegmentReader) -> SegmentReaderAliveDocsIterator<'a> {
SegmentReaderAliveDocsIterator {
reader,
max_doc: reader.max_doc(),
current: 0,
}
}
}
impl<'a> Iterator for SegmentReaderAliveDocsIterator<'a> {
type Item = DocId;
fn next(&mut self) -> Option<Self::Item> {
// TODO: Use TinySet (like in BitSetDocSet) to speed this process up
if self.current >= self.max_doc {
return None;
}
// find the next alive doc id
while self.reader.is_deleted(self.current) {
self.current += 1;
if self.current >= self.max_doc {
return None;
}
}
// capture the current alive DocId
let result = Some(self.current);
// move down the chain
self.current += 1;
result
}
}
#[cfg(test)]
mod test {
use core::Index;
use schema::{SchemaBuilder, Term, STORED, TEXT};
use DocId;
#[test]
fn test_alive_docs_iterator() {
let mut schema_builder = SchemaBuilder::new();
schema_builder.add_text_field("name", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let name = schema.get_field("name").unwrap();
{
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
index_writer.add_document(doc!(name => "tantivy"));
index_writer.add_document(doc!(name => "horse"));
index_writer.add_document(doc!(name => "jockey"));
index_writer.add_document(doc!(name => "cap"));
// we should now have one segment with two docs
index_writer.commit().unwrap();
}
{
let mut index_writer2 = index.writer(50_000_000).unwrap();
index_writer2.delete_term(Term::from_field_text(name, "horse"));
index_writer2.delete_term(Term::from_field_text(name, "cap"));
// ok, now we should have a deleted doc
index_writer2.commit().unwrap();
}
index.load_searchers().unwrap();
let searcher = index.searcher();
let docs: Vec<DocId> = searcher.segment_reader(0).doc_ids_alive().collect();
assert_eq!(vec![0u32, 2u32], docs);
}
}

View File

@@ -1,4 +0,0 @@
mod skip;
pub mod stacker;
pub use self::skip::{SkipList, SkipListBuilder};

View File

@@ -1,161 +0,0 @@
use std::mem;
use super::heap::{Heap, HeapAllocable};
#[inline]
pub fn is_power_of_2(val: u32) -> bool {
val & (val - 1) == 0
}
#[inline]
pub fn jump_needed(val: u32) -> bool {
val > 3 && is_power_of_2(val)
}
#[derive(Debug, Clone)]
pub struct ExpUnrolledLinkedList {
len: u32,
end: u32,
val0: u32,
val1: u32,
val2: u32,
next: u32, // inline of the first block
}
impl ExpUnrolledLinkedList {
pub fn iter<'a>(&self, addr: u32, heap: &'a Heap) -> ExpUnrolledLinkedListIterator<'a> {
ExpUnrolledLinkedListIterator {
heap,
addr: addr + 2u32 * (mem::size_of::<u32>() as u32),
len: self.len,
consumed: 0,
}
}
pub fn push(&mut self, val: u32, heap: &Heap) {
self.len += 1;
if jump_needed(self.len) {
// we need to allocate another block.
// ... As we want to grow block exponentially
// the next block as a size of (length so far),
// and we need to add 1u32 to store the pointer
// to the next element.
let new_block_size: usize = (self.len as usize + 1) * mem::size_of::<u32>();
let new_block_addr: u32 = heap.allocate_space(new_block_size);
heap.set(self.end, &new_block_addr);
self.end = new_block_addr;
}
heap.set(self.end, &val);
self.end += mem::size_of::<u32>() as u32;
}
}
impl HeapAllocable for u32 {
fn with_addr(_addr: u32) -> u32 {
0u32
}
}
impl HeapAllocable for ExpUnrolledLinkedList {
fn with_addr(addr: u32) -> ExpUnrolledLinkedList {
let last_addr = addr + mem::size_of::<u32>() as u32 * 2u32;
ExpUnrolledLinkedList {
len: 0u32,
end: last_addr,
val0: 0u32,
val1: 0u32,
val2: 0u32,
next: 0u32,
}
}
}
pub struct ExpUnrolledLinkedListIterator<'a> {
heap: &'a Heap,
addr: u32,
len: u32,
consumed: u32,
}
impl<'a> Iterator for ExpUnrolledLinkedListIterator<'a> {
type Item = u32;
fn next(&mut self) -> Option<u32> {
if self.consumed == self.len {
None
} else {
let addr: u32;
self.consumed += 1;
if jump_needed(self.consumed) {
addr = *self.heap.get_mut_ref(self.addr);
} else {
addr = self.addr;
}
self.addr = addr + mem::size_of::<u32>() as u32;
Some(*self.heap.get_mut_ref(addr))
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use super::super::heap::Heap;
use test::Bencher;
const NUM_STACK: usize = 10_000;
const STACK_SIZE: u32 = 1000;
#[test]
fn test_stack() {
let heap = Heap::with_capacity(1_000_000);
let (addr, stack) = heap.allocate_object::<ExpUnrolledLinkedList>();
stack.push(1u32, &heap);
stack.push(2u32, &heap);
stack.push(4u32, &heap);
stack.push(8u32, &heap);
{
let mut it = stack.iter(addr, &heap);
assert_eq!(it.next().unwrap(), 1u32);
assert_eq!(it.next().unwrap(), 2u32);
assert_eq!(it.next().unwrap(), 4u32);
assert_eq!(it.next().unwrap(), 8u32);
assert!(it.next().is_none());
}
}
#[bench]
fn bench_push_vec(bench: &mut Bencher) {
bench.iter(|| {
let mut vecs = Vec::with_capacity(100);
for _ in 0..NUM_STACK {
vecs.push(Vec::new());
}
for s in 0..NUM_STACK {
for i in 0u32..STACK_SIZE {
let t = s * 392017 % NUM_STACK;
vecs[t].push(i);
}
}
});
}
#[bench]
fn bench_push_stack(bench: &mut Bencher) {
let heap = Heap::with_capacity(64_000_000);
bench.iter(|| {
let mut stacks = Vec::with_capacity(100);
for _ in 0..NUM_STACK {
let (_, stack) = heap.allocate_object::<ExpUnrolledLinkedList>();
stacks.push(stack);
}
for s in 0..NUM_STACK {
for i in 0u32..STACK_SIZE {
let t = s * 392017 % NUM_STACK;
stacks[t].push(i, &heap);
}
}
heap.clear();
});
}
}

View File

@@ -1,309 +0,0 @@
use std::iter;
use std::mem;
use postings::UnorderedTermId;
use super::heap::{BytesRef, Heap, HeapAllocable};
mod murmurhash2 {
const SEED: u32 = 3_242_157_231u32;
#[inline(always)]
pub fn murmurhash2(key: &[u8]) -> u32 {
let mut key_ptr: *const u32 = key.as_ptr() as *const u32;
let m: u32 = 0x5bd1_e995;
let r = 24;
let len = key.len() as u32;
let mut h: u32 = SEED ^ len;
let num_blocks = len >> 2;
for _ in 0..num_blocks {
let mut k: u32 = unsafe { *key_ptr };
k = k.wrapping_mul(m);
k ^= k >> r;
k = k.wrapping_mul(m);
k = k.wrapping_mul(m);
h ^= k;
key_ptr = key_ptr.wrapping_offset(1);
}
// Handle the last few bytes of the input array
let remaining = len & 3;
let key_ptr_u8: *const u8 = key_ptr as *const u8;
match remaining {
3 => {
h ^= unsafe { u32::from(*key_ptr_u8.wrapping_offset(2)) } << 16;
h ^= unsafe { u32::from(*key_ptr_u8.wrapping_offset(1)) } << 8;
h ^= unsafe { u32::from(*key_ptr_u8) };
h = h.wrapping_mul(m);
}
2 => {
h ^= unsafe { u32::from(*key_ptr_u8.wrapping_offset(1)) } << 8;
h ^= unsafe { u32::from(*key_ptr_u8) };
h = h.wrapping_mul(m);
}
1 => {
h ^= unsafe { u32::from(*key_ptr_u8) };
h = h.wrapping_mul(m);
}
_ => {}
}
h ^= h >> 13;
h = h.wrapping_mul(m);
h ^ (h >> 15)
}
}
/// Split the thread memory budget into
/// - the heap size
/// - the hash table "table" itself.
///
/// Returns (the heap size in bytes, the hash table size in number of bits)
pub(crate) fn split_memory(per_thread_memory_budget: usize) -> (usize, usize) {
let table_size_limit: usize = per_thread_memory_budget / 3;
let compute_table_size = |num_bits: usize| (1 << num_bits) * mem::size_of::<KeyValue>();
let table_num_bits: usize = (1..)
.into_iter()
.take_while(|num_bits: &usize| compute_table_size(*num_bits) < table_size_limit)
.last()
.expect(&format!(
"Per thread memory is too small: {}",
per_thread_memory_budget
));
let table_size = compute_table_size(table_num_bits);
let heap_size = per_thread_memory_budget - table_size;
(heap_size, table_num_bits)
}
/// `KeyValue` is the item stored in the hash table.
/// The key is actually a `BytesRef` object stored in an external heap.
/// The `value_addr` also points to an address in the heap.
///
/// The key and the value are actually stored contiguously.
/// For this reason, the (start, stop) information is actually redundant
/// and can be simplified in the future
#[derive(Copy, Clone, Default)]
struct KeyValue {
key_value_addr: BytesRef,
hash: u32,
}
impl KeyValue {
fn is_empty(&self) -> bool {
self.key_value_addr.is_null()
}
}
/// Customized `HashMap` with string keys
///
/// This `HashMap` takes String as keys. Keys are
/// stored in a user defined heap.
///
/// The quirky API has the benefit of avoiding
/// the computation of the hash of the key twice,
/// or copying the key as long as there is no insert.
///
pub struct TermHashMap<'a> {
table: Box<[KeyValue]>,
heap: &'a Heap,
mask: usize,
occupied: Vec<usize>,
}
struct QuadraticProbing {
hash: usize,
i: usize,
mask: usize,
}
impl QuadraticProbing {
fn compute(hash: usize, mask: usize) -> QuadraticProbing {
QuadraticProbing {
hash,
i: 0,
mask,
}
}
#[inline]
fn next_probe(&mut self) -> usize {
self.i += 1;
(self.hash + self.i * self.i) & self.mask
}
}
impl<'a> TermHashMap<'a> {
pub fn new(num_bucket_power_of_2: usize, heap: &'a Heap) -> TermHashMap<'a> {
let table_size = 1 << num_bucket_power_of_2;
let table: Vec<KeyValue> = iter::repeat(KeyValue::default()).take(table_size).collect();
TermHashMap {
table: table.into_boxed_slice(),
heap,
mask: table_size - 1,
occupied: Vec::with_capacity(table_size / 2),
}
}
fn probe(&self, hash: u32) -> QuadraticProbing {
QuadraticProbing::compute(hash as usize, self.mask)
}
pub fn is_saturated(&self) -> bool {
self.table.len() < self.occupied.len() * 3
}
#[inline(never)]
fn get_key_value(&self, bytes_ref: BytesRef) -> (&[u8], u32) {
let key_bytes: &[u8] = self.heap.get_slice(bytes_ref);
let expull_addr: u32 = bytes_ref.addr() + 2 + key_bytes.len() as u32;
(key_bytes, expull_addr)
}
pub fn set_bucket(&mut self, hash: u32, key_value_addr: BytesRef, bucket: usize) {
self.occupied.push(bucket);
self.table[bucket] = KeyValue {
key_value_addr, hash
};
}
pub fn iter<'b: 'a>(&'b self) -> impl Iterator<Item = (&'a [u8], u32, UnorderedTermId)> + 'b {
self.occupied.iter().cloned().map(move |bucket: usize| {
let kv = self.table[bucket];
let (key, offset) = self.get_key_value(kv.key_value_addr);
(key, offset, bucket as UnorderedTermId)
})
}
pub fn get_or_create<S: AsRef<[u8]>, V: HeapAllocable>(
&mut self,
key: S,
) -> (UnorderedTermId, &mut V) {
let key_bytes: &[u8] = key.as_ref();
let hash = murmurhash2::murmurhash2(key.as_ref());
let mut probe = self.probe(hash);
loop {
let bucket = probe.next_probe();
let kv: KeyValue = self.table[bucket];
if kv.is_empty() {
let key_bytes_ref = self.heap.allocate_and_set(key_bytes);
let (addr, val): (u32, &mut V) = self.heap.allocate_object();
assert_eq!(addr, key_bytes_ref.addr() + 2 + key_bytes.len() as u32);
self.set_bucket(hash, key_bytes_ref, bucket);
return (bucket as UnorderedTermId, val);
} else if kv.hash == hash {
let (stored_key, expull_addr): (&[u8], u32) = self.get_key_value(kv.key_value_addr);
if stored_key == key_bytes {
return (
bucket as UnorderedTermId,
self.heap.get_mut_ref(expull_addr),
);
}
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use super::super::heap::{Heap, HeapAllocable};
use super::murmurhash2::murmurhash2;
use test::Bencher;
use std::collections::HashSet;
use super::split_memory;
struct TestValue {
val: u32,
_addr: u32,
}
impl HeapAllocable for TestValue {
fn with_addr(addr: u32) -> TestValue {
TestValue {
val: 0u32,
_addr: addr,
}
}
}
#[test]
fn test_hashmap_size() {
assert_eq!(split_memory(100_000), (67232, 12));
assert_eq!(split_memory(1_000_000), (737856, 15));
assert_eq!(split_memory(10_000_000), (7902848, 18));
}
#[test]
fn test_hash_map() {
let heap = Heap::with_capacity(2_000_000);
let mut hash_map: TermHashMap = TermHashMap::new(18, &heap);
{
let v: &mut TestValue = hash_map.get_or_create("abc").1;
assert_eq!(v.val, 0u32);
v.val = 3u32;
}
{
let v: &mut TestValue = hash_map.get_or_create("abcd").1;
assert_eq!(v.val, 0u32);
v.val = 4u32;
}
{
let v: &mut TestValue = hash_map.get_or_create("abc").1;
assert_eq!(v.val, 3u32);
}
{
let v: &mut TestValue = hash_map.get_or_create("abcd").1;
assert_eq!(v.val, 4u32);
}
let mut iter_values = hash_map.iter();
{
let (_, addr, _) = iter_values.next().unwrap();
let val: &TestValue = heap.get_ref(addr);
assert_eq!(val.val, 3u32);
}
{
let (_, addr, _) = iter_values.next().unwrap();
let val: &TestValue = heap.get_ref(addr);
assert_eq!(val.val, 4u32);
}
assert!(iter_values.next().is_none());
}
#[test]
fn test_murmur() {
let s1 = "abcdef";
let s2 = "abcdeg";
for i in 0..5 {
assert_eq!(
murmurhash2(&s1[i..5].as_bytes()),
murmurhash2(&s2[i..5].as_bytes())
);
}
}
#[test]
fn test_murmur_collisions() {
let mut set: HashSet<u32> = HashSet::default();
for i in 0..10_000 {
let s = format!("hash{}", i);
let hash = murmurhash2(s.as_bytes());
set.insert(hash);
}
assert_eq!(set.len(), 10_000);
}
#[bench]
fn bench_murmurhash_2(b: &mut Bencher) {
let keys: Vec<&'static str> =
vec!["wer qwe qwe qwe ", "werbq weqweqwe2 ", "weraq weqweqwe3 "];
b.iter(|| {
keys.iter()
.map(|&s| s.as_bytes())
.map(murmurhash2::murmurhash2)
.map(|h| h as u64)
.last()
.unwrap()
});
}
}

View File

@@ -1,233 +0,0 @@
use std::cell::UnsafeCell;
use std::mem;
use std::ptr;
use byteorder::{ByteOrder, NativeEndian};
/// `BytesRef` refers to a slice in tantivy's custom `Heap`.
///
/// The slice will encode the length of the `&[u8]` slice
/// on 16-bits, and then the data is encoded.
#[derive(Copy, Clone)]
pub struct BytesRef(u32);
impl BytesRef {
pub fn is_null(&self) -> bool {
self.0 == u32::max_value()
}
pub fn addr(&self) -> u32 {
self.0
}
}
impl Default for BytesRef {
fn default() -> BytesRef {
BytesRef(u32::max_value())
}
}
/// Object that can be allocated in tantivy's custom `Heap`.
pub trait HeapAllocable {
fn with_addr(addr: u32) -> Self;
}
/// Tantivy's custom `Heap`.
pub struct Heap {
inner: UnsafeCell<InnerHeap>,
}
#[cfg_attr(feature = "cargo-clippy", allow(mut_from_ref))]
impl Heap {
/// Creates a new heap with a given capacity
pub fn with_capacity(num_bytes: usize) -> Heap {
Heap {
inner: UnsafeCell::new(InnerHeap::with_capacity(num_bytes)),
}
}
fn inner(&self) -> &mut InnerHeap {
unsafe { &mut *self.inner.get() }
}
/// Clears the heap. All the underlying data is lost.
///
/// This heap does not support deallocation.
/// This method is the only way to free memory.
pub fn clear(&self) {
self.inner().clear();
}
/// Return amount of free space, in bytes.
pub fn num_free_bytes(&self) -> u32 {
self.inner().num_free_bytes()
}
/// Allocate a given amount of space and returns an address
/// in the Heap.
pub fn allocate_space(&self, num_bytes: usize) -> u32 {
self.inner().allocate_space(num_bytes)
}
/// Allocate an object in the heap
pub fn allocate_object<V: HeapAllocable>(&self) -> (u32, &mut V) {
let addr = self.inner().allocate_space(mem::size_of::<V>());
let v: V = V::with_addr(addr);
self.inner().set(addr, &v);
(addr, self.inner().get_mut_ref(addr))
}
/// Stores a `&[u8]` in the heap and returns the destination BytesRef.
pub fn allocate_and_set(&self, data: &[u8]) -> BytesRef {
self.inner().allocate_and_set(data)
}
/// Fetches the `&[u8]` stored on the slice defined by the `BytesRef`
/// given as argumetn
pub fn get_slice(&self, bytes_ref: BytesRef) -> &[u8] {
self.inner().get_slice(bytes_ref)
}
/// Stores an item's data in the heap, at the given `address`.
pub fn set<Item>(&self, addr: u32, val: &Item) {
self.inner().set(addr, val);
}
/// Returns a mutable reference for an object at a given Item.
pub fn get_mut_ref<Item>(&self, addr: u32) -> &mut Item {
self.inner().get_mut_ref(addr)
}
/// Returns a mutable reference to an `Item` at a given `addr`.
#[cfg(test)]
pub fn get_ref<Item>(&self, addr: u32) -> &mut Item {
self.get_mut_ref(addr)
}
}
struct InnerHeap {
buffer: Vec<u8>,
buffer_len: u32,
used: u32,
next_heap: Option<Box<InnerHeap>>,
}
impl InnerHeap {
pub fn with_capacity(num_bytes: usize) -> InnerHeap {
let buffer: Vec<u8> = vec![0u8; num_bytes];
InnerHeap {
buffer,
buffer_len: num_bytes as u32,
next_heap: None,
used: 0u32,
}
}
pub fn clear(&mut self) {
self.used = 0u32;
self.next_heap = None;
}
// Returns the number of free bytes. If the buffer
// has reached it's capacity and overflowed to another buffer, return 0.
pub fn num_free_bytes(&self) -> u32 {
if self.next_heap.is_some() {
0u32
} else {
self.buffer_len - self.used
}
}
pub fn allocate_space(&mut self, num_bytes: usize) -> u32 {
let addr = self.used;
self.used += num_bytes as u32;
if self.used <= self.buffer_len {
addr
} else {
if self.next_heap.is_none() {
info!(
r#"Exceeded heap size. The segment will be committed right
after indexing this document."#,
);
self.next_heap = Some(Box::new(InnerHeap::with_capacity(self.buffer_len as usize)));
}
self.next_heap.as_mut().unwrap().allocate_space(num_bytes) + self.buffer_len
}
}
fn get_slice(&self, bytes_ref: BytesRef) -> &[u8] {
let start = bytes_ref.0;
if start >= self.buffer_len {
self.next_heap
.as_ref()
.unwrap()
.get_slice(BytesRef(start - self.buffer_len))
} else {
let start = start as usize;
let len = NativeEndian::read_u16(&self.buffer[start..start + 2]) as usize;
&self.buffer[start + 2..start + 2 + len]
}
}
fn get_mut_slice(&mut self, start: u32, stop: u32) -> &mut [u8] {
if start >= self.buffer_len {
self.next_heap
.as_mut()
.unwrap()
.get_mut_slice(start - self.buffer_len, stop - self.buffer_len)
} else {
&mut self.buffer[start as usize..stop as usize]
}
}
fn allocate_and_set(&mut self, data: &[u8]) -> BytesRef {
assert!(data.len() < u16::max_value() as usize);
let total_len = 2 + data.len();
let start = self.allocate_space(total_len);
let total_buff = self.get_mut_slice(start, start + total_len as u32);
NativeEndian::write_u16(&mut total_buff[0..2], data.len() as u16);
total_buff[2..].clone_from_slice(data);
BytesRef(start)
}
fn get_mut(&mut self, addr: u32) -> *mut u8 {
if addr >= self.buffer_len {
self.next_heap
.as_mut()
.unwrap()
.get_mut(addr - self.buffer_len)
} else {
let addr_isize = addr as isize;
unsafe { self.buffer.as_mut_ptr().offset(addr_isize) }
}
}
fn get_mut_ref<Item>(&mut self, addr: u32) -> &mut Item {
if addr >= self.buffer_len {
self.next_heap
.as_mut()
.unwrap()
.get_mut_ref(addr - self.buffer_len)
} else {
let v_ptr_u8 = self.get_mut(addr) as *mut u8;
let v_ptr = v_ptr_u8 as *mut Item;
unsafe { &mut *v_ptr }
}
}
pub fn set<Item>(&mut self, addr: u32, val: &Item) {
if addr >= self.buffer_len {
self.next_heap
.as_mut()
.unwrap()
.set(addr - self.buffer_len, val);
} else {
let v_ptr: *const Item = val as *const Item;
let v_ptr_u8: *const u8 = v_ptr as *const u8;
debug_assert!(addr + mem::size_of::<Item>() as u32 <= self.used);
unsafe {
let dest_ptr: *mut u8 = self.get_mut(addr);
ptr::copy(v_ptr_u8, dest_ptr, mem::size_of::<Item>());
}
}
}
}

View File

@@ -1,43 +0,0 @@
pub(crate) mod hashmap;
mod heap;
mod expull;
pub use self::heap::{Heap, HeapAllocable};
pub use self::expull::ExpUnrolledLinkedList;
pub use self::hashmap::TermHashMap;
#[test]
fn test_unrolled_linked_list() {
use std::collections;
let heap = Heap::with_capacity(30_000_000);
{
heap.clear();
let mut ks: Vec<usize> = (1..5).map(|k| k * 100).collect();
ks.push(2);
ks.push(3);
for k in (1..5).map(|k| k * 100) {
let mut hashmap: TermHashMap = TermHashMap::new(10, &heap);
for j in 0..k {
for i in 0..500 {
let v: &mut ExpUnrolledLinkedList = hashmap.get_or_create(i.to_string()).1;
v.push(i * j, &heap);
}
}
let mut map_addr: collections::HashMap<Vec<u8>, u32> = collections::HashMap::new();
for (key, addr, _) in hashmap.iter() {
map_addr.insert(Vec::from(key), addr);
}
for i in 0..500 {
let key: String = i.to_string();
let addr: u32 = *map_addr.get(key.as_bytes()).unwrap();
let exp_pull: &ExpUnrolledLinkedList = heap.get_ref(addr);
let mut it = exp_pull.iter(addr, &heap);
for j in 0..k {
assert_eq!(it.next().unwrap(), i * j);
}
assert!(!it.next().is_some());
}
}
}
}

View File

@@ -1,11 +1,11 @@
use std::marker::Send;
use std::fmt;
use std::path::Path;
use directory::error::{DeleteError, OpenReadError, OpenWriteError};
use directory::{ReadOnlySource, WritePtr};
use std::result;
use std::fmt;
use std::io;
use std::marker::Send;
use std::marker::Sync;
use std::path::Path;
use std::result;
/// Write-once read many (WORM) abstraction for where
/// tantivy's data should be stored.
@@ -17,7 +17,7 @@ use std::marker::Sync;
/// - The [`RAMDirectory`](struct.RAMDirectory.html), which
/// should be used mostly for tests.
///
pub trait Directory: fmt::Debug + Send + Sync + 'static {
pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
/// Opens a virtual file for read.
///
/// Once a virtual file is open, its data may not
@@ -73,7 +73,19 @@ pub trait Directory: fmt::Debug + Send + Sync + 'static {
///
/// The file may or may not previously exist.
fn atomic_write(&mut self, path: &Path, data: &[u8]) -> io::Result<()>;
}
/// DirectoryClone
pub trait DirectoryClone {
/// Clones the directory and boxes the clone
fn box_clone(&self) -> Box<Directory>;
}
impl<T> DirectoryClone for T
where
T: 'static + Directory + Clone,
{
fn box_clone(&self) -> Box<Directory> {
Box::new(self.clone())
}
}

View File

@@ -1,7 +1,7 @@
use std::error::Error as StdError;
use std::path::PathBuf;
use std::io;
use std::fmt;
use std::io;
use std::path::PathBuf;
/// General IO error with an optional path to the offending file.
#[derive(Debug)]
@@ -173,9 +173,6 @@ pub enum DeleteError {
/// Any kind of IO error that happens when
/// interacting with the underlying IO device.
IOError(IOError),
/// The file may not be deleted because it is
/// protected.
FileProtected(PathBuf),
}
impl From<IOError> for DeleteError {
@@ -190,9 +187,6 @@ impl fmt::Display for DeleteError {
DeleteError::FileDoesNotExist(ref path) => {
write!(f, "the file '{:?}' does not exist", path)
}
DeleteError::FileProtected(ref path) => {
write!(f, "the file '{:?}' is protected and can't be deleted", path)
}
DeleteError::IOError(ref err) => {
write!(f, "an io error occurred while deleting a file: '{}'", err)
}
@@ -207,7 +201,7 @@ impl StdError for DeleteError {
fn cause(&self) -> Option<&StdError> {
match *self {
DeleteError::FileDoesNotExist(_) | DeleteError::FileProtected(_) => None,
DeleteError::FileDoesNotExist(_) => None,
DeleteError::IOError(ref err) => Some(err),
}
}

View File

@@ -1,18 +1,29 @@
use std::path::{Path, PathBuf};
use serde_json;
use core::MANAGED_FILEPATH;
use directory::error::{DeleteError, IOError, OpenReadError, OpenWriteError};
use directory::{ReadOnlySource, WritePtr};
use std::result;
use std::io;
use Directory;
use std::sync::{Arc, RwLock};
use error::TantivyError;
use indexer::LockType;
use serde_json;
use std::collections::HashSet;
use std::sync::RwLockWriteGuard;
use std::io;
use std::io::Write;
use core::MANAGED_FILEPATH;
use std::collections::HashMap;
use std::fmt;
use error::{ErrorKind, Result, ResultExt};
use std::path::{Path, PathBuf};
use std::result;
use std::sync::RwLockWriteGuard;
use std::sync::{Arc, RwLock};
use Directory;
use Result;
/// Returns true iff the file is "managed".
/// Non-managed file are not subject to garbage collection.
///
/// Filenames that starts by a "." -typically locks-
/// are not managed.
fn is_managed(path: &Path) -> bool {
path.to_str()
.map(|p_str| !p_str.starts_with('.'))
.unwrap_or(true)
}
/// Wrapper of directories that keeps track of files created by Tantivy.
///
@@ -32,37 +43,6 @@ pub struct ManagedDirectory {
#[derive(Debug, Default)]
struct MetaInformation {
managed_paths: HashSet<PathBuf>,
protected_files: HashMap<PathBuf, usize>,
}
/// A `FileProtection` prevents the garbage collection of a file.
///
/// See `ManagedDirectory.protect_file_from_delete`.
pub struct FileProtection {
directory: ManagedDirectory,
path: PathBuf,
}
fn unprotect_file_from_delete(directory: &ManagedDirectory, path: &Path) {
let mut meta_informations_wlock = directory
.meta_informations
.write()
.expect("Managed file lock poisoned");
if let Some(counter_ref_mut) = meta_informations_wlock.protected_files.get_mut(path) {
(*counter_ref_mut) -= 1;
}
}
impl fmt::Debug for FileProtection {
fn fmt(&self, formatter: &mut fmt::Formatter) -> result::Result<(), fmt::Error> {
write!(formatter, "FileProtectionFor({:?})", self.path)
}
}
impl Drop for FileProtection {
fn drop(&mut self) {
unprotect_file_from_delete(&self.directory, &*self.path);
}
}
/// Saves the file containing the list of existing files
@@ -72,7 +52,7 @@ fn save_managed_paths(
wlock: &RwLockWriteGuard<MetaInformation>,
) -> io::Result<()> {
let mut w = serde_json::to_vec(&wlock.managed_paths)?;
write!(&mut w, "\n")?;
writeln!(&mut w)?;
directory.atomic_write(&MANAGED_FILEPATH, &w[..])?;
Ok(())
}
@@ -84,17 +64,16 @@ impl ManagedDirectory {
Ok(data) => {
let managed_files_json = String::from_utf8_lossy(&data);
let managed_files: HashSet<PathBuf> = serde_json::from_str(&managed_files_json)
.chain_err(|| ErrorKind::CorruptedFile(MANAGED_FILEPATH.clone()))?;
.map_err(|_| TantivyError::CorruptedFile(MANAGED_FILEPATH.clone()))?;
Ok(ManagedDirectory {
directory: box directory,
directory: Box::new(directory),
meta_informations: Arc::new(RwLock::new(MetaInformation {
managed_paths: managed_files,
protected_files: HashMap::default(),
})),
})
}
Err(OpenReadError::FileDoesNotExist(_)) => Ok(ManagedDirectory {
directory: box directory,
directory: Box::new(directory),
meta_informations: Arc::default(),
}),
Err(OpenReadError::IOError(e)) => Err(From::from(e)),
@@ -115,25 +94,35 @@ impl ManagedDirectory {
pub fn garbage_collect<L: FnOnce() -> HashSet<PathBuf>>(&mut self, get_living_files: L) {
info!("Garbage collect");
let mut files_to_delete = vec![];
// It is crucial to get the living files after acquiring the
// read lock of meta informations. That way, we
// avoid the following scenario.
//
// 1) we get the list of living files.
// 2) someone creates a new file.
// 3) we start garbage collection and remove this file
// even though it is a living file.
//
// releasing the lock as .delete() will use it too.
{
// releasing the lock as .delete() will use it too.
let meta_informations_rlock = self.meta_informations
let meta_informations_rlock = self
.meta_informations
.read()
.expect("Managed directory rlock poisoned in garbage collect.");
// It is crucial to get the living files after acquiring the
// read lock of meta informations. That way, we
// avoid the following scenario.
//
// 1) we get the list of living files.
// 2) someone creates a new file.
// 3) we start garbage collection and remove this file
// even though it is a living file.
let living_files = get_living_files();
for managed_path in &meta_informations_rlock.managed_paths {
if !living_files.contains(managed_path) {
files_to_delete.push(managed_path.clone());
// The point of this second "file" lock is to enforce the following scenario
// 1) process B tries to load a new set of searcher.
// The list of segments is loaded
// 2) writer change meta.json (for instance after a merge or a commit)
// 3) gc kicks in.
// 4) gc removes a file that was useful for process B, before process B opened it.
if let Ok(_meta_lock) = LockType::MetaLock.acquire_lock(self) {
let living_files = get_living_files();
for managed_path in &meta_informations_rlock.managed_paths {
if !living_files.contains(managed_path) {
files_to_delete.push(managed_path.clone());
}
}
}
}
@@ -158,9 +147,6 @@ impl ManagedDirectory {
error!("Failed to delete {:?}", file_to_delete);
}
}
DeleteError::FileProtected(_) => {
// this is expected.
}
}
}
}
@@ -170,7 +156,8 @@ impl ManagedDirectory {
if !deleted_files.is_empty() {
// update the list of managed files by removing
// the file that were removed.
let mut meta_informations_wlock = self.meta_informations
let mut meta_informations_wlock = self
.meta_informations
.write()
.expect("Managed directory wlock poisoned (2).");
{
@@ -185,28 +172,6 @@ impl ManagedDirectory {
}
}
/// Protects a file from being garbage collected.
///
/// The method returns a `FileProtection` object.
/// The file will not be garbage collected as long as the
/// `FileProtection` object is kept alive.
pub fn protect_file_from_delete(&self, path: &Path) -> FileProtection {
let pathbuf = path.to_owned();
{
let mut meta_informations_wlock = self.meta_informations
.write()
.expect("Managed file lock poisoned on protect");
*meta_informations_wlock
.protected_files
.entry(pathbuf.clone())
.or_insert(0) += 1;
}
FileProtection {
directory: self.clone(),
path: pathbuf.clone(),
}
}
/// Registers a file as managed
///
/// This method must be called before the file is
@@ -214,8 +179,17 @@ impl ManagedDirectory {
/// registering the filepath and creating the file
/// will not lead to garbage files that will
/// never get removed.
///
/// File starting by "." are reserved to locks.
/// They are not managed and cannot be subjected
/// to garbage collection.
fn register_file_as_managed(&mut self, filepath: &Path) -> io::Result<()> {
let mut meta_wlock = self.meta_informations
// Files starting by "." (e.g. lock files) are not managed.
if !is_managed(filepath) {
return Ok(());
}
let mut meta_wlock = self
.meta_informations
.write()
.expect("Managed file lock poisoned");
let has_changed = meta_wlock.managed_paths.insert(filepath.to_owned());
@@ -247,26 +221,12 @@ impl Directory for ManagedDirectory {
}
fn delete(&self, path: &Path) -> result::Result<(), DeleteError> {
{
let metas_rlock = self.meta_informations
.read()
.expect("poisoned lock in managed directory meta");
if let Some(counter) = metas_rlock.protected_files.get(path) {
if *counter > 0 {
return Err(DeleteError::FileProtected(path.to_owned()));
}
}
}
self.directory.delete(path)
}
fn exists(&self, path: &Path) -> bool {
self.directory.exists(path)
}
fn box_clone(&self) -> Box<Directory> {
box self.clone()
}
}
impl Clone for ManagedDirectory {
@@ -282,10 +242,10 @@ impl Clone for ManagedDirectory {
mod tests {
use super::*;
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
use directory::MmapDirectory;
use std::path::Path;
use std::io::Write;
use std::path::Path;
use tempdir::TempDir;
lazy_static! {
@@ -294,7 +254,7 @@ mod tests {
}
#[test]
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
fn test_managed_directory() {
let tempdir = TempDir::new("index").unwrap();
let tempdir_path = PathBuf::from(tempdir.path());
@@ -343,7 +303,7 @@ mod tests {
}
#[test]
#[cfg(feature="mmap ")]
#[cfg(feature = "mmap ")]
fn test_managed_directory_gc_while_mmapped() {
let tempdir = TempDir::new("index").unwrap();
let tempdir_path = PathBuf::from(tempdir.path());
@@ -372,28 +332,4 @@ mod tests {
}
}
#[test]
#[cfg(feature="mmap")]
fn test_managed_directory_protect() {
let tempdir = TempDir::new("index").unwrap();
let tempdir_path = PathBuf::from(tempdir.path());
let living_files = HashSet::new();
let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap();
let mut managed_directory = ManagedDirectory::new(mmap_directory).unwrap();
managed_directory
.atomic_write(*TEST_PATH1, &vec![0u8, 1u8])
.unwrap();
assert!(managed_directory.exists(*TEST_PATH1));
{
let _file_protection = managed_directory.protect_file_from_delete(*TEST_PATH1);
managed_directory.garbage_collect(|| living_files.clone());
assert!(managed_directory.exists(*TEST_PATH1));
}
managed_directory.garbage_collect(|| living_files.clone());
assert!(!managed_directory.exists(*TEST_PATH1));
}
}

View File

@@ -1,17 +1,17 @@
use atomicwrites;
use common::make_io_err;
use directory::Directory;
use directory::error::{DeleteError, IOError, OpenDirectoryError, OpenReadError, OpenWriteError};
use directory::ReadOnlySource;
use directory::shared_vec_slice::SharedVecSlice;
use directory::Directory;
use directory::ReadOnlySource;
use directory::WritePtr;
use fst::raw::MmapReadOnly;
use std::collections::hash_map::Entry as HashMapEntry;
use std::collections::HashMap;
use std::convert::From;
use std::fmt;
use std::fs::{self, File};
use std::fs::OpenOptions;
use std::fs::{self, File};
use std::io::{self, Seek, SeekFrom};
use std::io::{BufWriter, Read, Write};
use std::path::{Path, PathBuf};
@@ -32,7 +32,8 @@ fn open_mmap(full_path: &Path) -> result::Result<Option<MmapReadOnly>, OpenReadE
}
})?;
let meta_data = file.metadata()
let meta_data = file
.metadata()
.map_err(|e| IOError::with_path(full_path.to_owned(), e))?;
if meta_data.len() == 0 {
// if the file size is 0, it will not be possible
@@ -40,9 +41,11 @@ fn open_mmap(full_path: &Path) -> result::Result<Option<MmapReadOnly>, OpenReadE
// instead.
return Ok(None);
}
MmapReadOnly::open(&file)
.map(Some)
.map_err(|e| From::from(IOError::with_path(full_path.to_owned(), e)))
unsafe {
MmapReadOnly::open(&file)
.map(Some)
.map_err(|e| From::from(IOError::with_path(full_path.to_owned(), e)))
}
}
#[derive(Default, Clone, Debug, Serialize, Deserialize)]
@@ -307,7 +310,8 @@ impl Directory for MmapDirectory {
// when the last reference is gone.
mmap_cache.cache.remove(&full_path);
match fs::remove_file(&full_path) {
Ok(_) => self.sync_directory()
Ok(_) => self
.sync_directory()
.map_err(|e| IOError::with_path(path.to_owned(), e).into()),
Err(e) => {
if e.kind() == io::ErrorKind::NotFound {
@@ -350,10 +354,6 @@ impl Directory for MmapDirectory {
meta_file.write(|f| f.write_all(data))?;
Ok(())
}
fn box_clone(&self) -> Box<Directory> {
Box::new(self.clone())
}
}
#[cfg(test)]
@@ -364,6 +364,11 @@ mod tests {
use super::*;
#[test]
fn test_open_non_existant_path() {
assert!(MmapDirectory::open(PathBuf::from("./nowhere")).is_err());
}
#[test]
fn test_open_empty() {
// empty file is actually an edge case because those

View File

@@ -4,29 +4,28 @@ WORM directory abstraction.
*/
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
mod mmap_directory;
mod ram_directory;
mod directory;
mod managed_directory;
mod ram_directory;
mod read_only_source;
mod shared_vec_slice;
mod managed_directory;
/// Errors specific to the directory module.
pub mod error;
use std::io::{BufWriter, Seek, Write};
pub use self::read_only_source::ReadOnlySource;
pub use self::directory::Directory;
pub use self::directory::{Directory, DirectoryClone};
pub use self::ram_directory::RAMDirectory;
pub use self::read_only_source::ReadOnlySource;
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
pub use self::mmap_directory::MmapDirectory;
pub(crate) use self::read_only_source::SourceRead;
pub(crate) use self::managed_directory::{FileProtection, ManagedDirectory};
pub(crate) use self::managed_directory::ManagedDirectory;
/// Synonym of Seek + Write
pub trait SeekableWrite: Seek + Write {}
@@ -42,8 +41,8 @@ pub type WritePtr = BufWriter<Box<SeekableWrite>>;
mod tests {
use super::*;
use std::path::Path;
use std::io::{Seek, SeekFrom, Write};
use std::path::Path;
lazy_static! {
static ref TEST_PATH: &'static Path = Path::new("some_path_for_test");
@@ -56,7 +55,7 @@ mod tests {
}
#[test]
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
fn test_mmap_directory() {
let mut mmap_directory = MmapDirectory::create_from_tempdir().unwrap();
test_directory(&mut mmap_directory);

View File

@@ -1,14 +1,14 @@
use super::shared_vec_slice::SharedVecSlice;
use common::make_io_err;
use directory::error::{DeleteError, IOError, OpenReadError, OpenWriteError};
use directory::WritePtr;
use directory::{Directory, ReadOnlySource};
use std::collections::HashMap;
use std::fmt;
use std::io::{self, BufWriter, Cursor, Seek, SeekFrom, Write};
use std::path::{Path, PathBuf};
use std::result;
use std::sync::{Arc, RwLock};
use common::make_io_err;
use directory::{Directory, ReadOnlySource};
use directory::error::{DeleteError, IOError, OpenReadError, OpenWriteError};
use directory::WritePtr;
use super::shared_vec_slice::SharedVecSlice;
/// Writer associated with the `RAMDirectory`
///
@@ -100,8 +100,7 @@ impl InnerDirectory {
);
let io_err = make_io_err(msg);
OpenReadError::IOError(IOError::with_path(path.to_owned(), io_err))
})
.and_then(|readable_map| {
}).and_then(|readable_map| {
readable_map
.get(path)
.ok_or_else(|| OpenReadError::FileDoesNotExist(PathBuf::from(path)))
@@ -121,8 +120,7 @@ impl InnerDirectory {
);
let io_err = make_io_err(msg);
DeleteError::IOError(IOError::with_path(path.to_owned(), io_err))
})
.and_then(|mut writable_map| match writable_map.remove(path) {
}).and_then(|mut writable_map| match writable_map.remove(path) {
Some(_) => Ok(()),
None => Err(DeleteError::FileDoesNotExist(PathBuf::from(path))),
})
@@ -170,10 +168,10 @@ impl Directory for RAMDirectory {
let path_buf = PathBuf::from(path);
let vec_writer = VecWriter::new(path_buf.clone(), self.fs.clone());
let exists = self.fs
let exists = self
.fs
.write(path_buf.clone(), &Vec::new())
.map_err(|err| IOError::with_path(path.to_owned(), err))?;
// force the creation of the file to mimic the MMap directory.
if exists {
Err(OpenWriteError::FileAlreadyExists(path_buf))
@@ -196,6 +194,10 @@ impl Directory for RAMDirectory {
}
fn atomic_write(&mut self, path: &Path, data: &[u8]) -> io::Result<()> {
fail_point!("RAMDirectory::atomic_write", |msg| Err(io::Error::new(
io::ErrorKind::Other,
msg.unwrap_or("Undefined".to_string())
)));
let path_buf = PathBuf::from(path);
let mut vec_writer = VecWriter::new(path_buf.clone(), self.fs.clone());
self.fs.write(path_buf, &Vec::new())?;
@@ -203,8 +205,4 @@ impl Directory for RAMDirectory {
vec_writer.flush()?;
Ok(())
}
fn box_clone(&self) -> Box<Directory> {
Box::new(self.clone())
}
}

View File

@@ -1,11 +1,9 @@
#[cfg(feature="mmap")]
use fst::raw::MmapReadOnly;
use std::ops::Deref;
use super::shared_vec_slice::SharedVecSlice;
use common::HasLen;
use std::slice;
use std::io::{self, Read};
#[cfg(feature = "mmap")]
use fst::raw::MmapReadOnly;
use stable_deref_trait::{CloneStableDeref, StableDeref};
use std::ops::Deref;
/// Read object that represents files in tantivy.
///
@@ -15,7 +13,7 @@ use stable_deref_trait::{CloneStableDeref, StableDeref};
/// hold by this object should never be altered or destroyed.
pub enum ReadOnlySource {
/// Mmap source of data
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
Mmap(MmapReadOnly),
/// Wrapping a `Vec<u8>`
Anonymous(SharedVecSlice),
@@ -41,8 +39,8 @@ impl ReadOnlySource {
/// Returns the data underlying the ReadOnlySource object.
pub fn as_slice(&self) -> &[u8] {
match *self {
#[cfg(feature="mmap")]
ReadOnlySource::Mmap(ref mmap_read_only) => unsafe { mmap_read_only.as_slice() },
#[cfg(feature = "mmap")]
ReadOnlySource::Mmap(ref mmap_read_only) => mmap_read_only.as_slice(),
ReadOnlySource::Anonymous(ref shared_vec) => shared_vec.as_slice(),
}
}
@@ -66,9 +64,14 @@ impl ReadOnlySource {
/// 1KB slice is remaining, the whole `500MBs`
/// are retained in memory.
pub fn slice(&self, from_offset: usize, to_offset: usize) -> ReadOnlySource {
assert!(from_offset <= to_offset, "Requested negative slice [{}..{}]", from_offset, to_offset);
assert!(
from_offset <= to_offset,
"Requested negative slice [{}..{}]",
from_offset,
to_offset
);
match *self {
#[cfg(feature="mmap")]
#[cfg(feature = "mmap")]
ReadOnlySource::Mmap(ref mmap_read_only) => {
let sliced_mmap = mmap_read_only.range(from_offset, to_offset - from_offset);
ReadOnlySource::Mmap(sliced_mmap)
@@ -115,51 +118,3 @@ impl From<Vec<u8>> for ReadOnlySource {
ReadOnlySource::Anonymous(shared_data)
}
}
/// Acts as a owning cursor over the data backed up by a `ReadOnlySource`
pub(crate) struct SourceRead {
_data_owner: ReadOnlySource,
cursor: &'static [u8],
}
impl SourceRead {
// Advance the cursor by a given number of bytes.
pub fn advance(&mut self, len: usize) {
self.cursor = &self.cursor[len..];
}
pub fn slice_from(&self, start: usize) -> &[u8] {
&self.cursor[start..]
}
pub fn get(&self, idx: usize) -> u8 {
self.cursor[idx]
}
}
impl AsRef<[u8]> for SourceRead {
fn as_ref(&self) -> &[u8] {
self.cursor
}
}
impl From<ReadOnlySource> for SourceRead {
// Creates a new `SourceRead` from a given `ReadOnlySource`
fn from(source: ReadOnlySource) -> SourceRead {
let len = source.len();
let slice_ptr = source.as_slice().as_ptr();
let static_slice = unsafe { slice::from_raw_parts(slice_ptr, len) };
SourceRead {
_data_owner: source,
cursor: static_slice,
}
}
}
impl Read for SourceRead {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
self.cursor.read(buf)
}
}

View File

@@ -1,8 +1,8 @@
use DocId;
use common::BitSet;
use std::borrow::Borrow;
use std::borrow::BorrowMut;
use std::cmp::Ordering;
use common::BitSet;
use DocId;
/// Expresses the outcome of a call to `DocSet`'s `.skip_next(...)`.
#[derive(PartialEq, Eq, Debug)]

View File

@@ -2,137 +2,130 @@
use std::io;
use std::path::PathBuf;
use std::sync::PoisonError;
use directory::error::{IOError, OpenDirectoryError, OpenReadError, OpenWriteError};
use fastfield::FastFieldNotAvailableError;
use indexer::LockType;
use query;
use schema;
use fastfield::FastFieldNotAvailableError;
use serde_json;
use std::path::PathBuf;
use std::sync::PoisonError;
error_chain!(
errors {
/// Path does not exist.
PathDoesNotExist(buf: PathBuf) {
description("path does not exist")
display("path does not exist: '{:?}'", buf)
}
/// File already exists, this is a problem when we try to write into a new file.
FileAlreadyExists(buf: PathBuf) {
description("file already exists")
display("file already exists: '{:?}'", buf)
}
/// IO Error.
IOError(err: IOError) {
description("an IO error occurred")
display("an IO error occurred: '{}'", err)
}
/// The data within is corrupted.
///
/// For instance, it contains invalid JSON.
CorruptedFile(buf: PathBuf) {
description("file contains corrupted data")
display("file contains corrupted data: '{:?}'", buf)
}
/// A thread holding the locked panicked and poisoned the lock.
Poisoned {
description("a thread holding the locked panicked and poisoned the lock")
}
/// Invalid argument was passed by the user.
InvalidArgument(arg: String) {
description("an invalid argument was passed")
display("an invalid argument was passed: '{}'", arg)
}
/// An Error happened in one of the thread.
ErrorInThread(err: String) {
description("an error occurred in a thread")
display("an error occurred in a thread: '{}'", err)
}
/// An Error appeared related to the lack of a field.
SchemaError(field: String) {
description("a schema field is missing")
display("a schema field is missing: '{}'", field)
}
/// Tried to access a fastfield reader for a field not configured accordingly.
FastFieldError(err: FastFieldNotAvailableError) {
description("fast field not available")
display("fast field not available: '{:?}'", err)
}
}
);
/// The library's failure based error enum
#[derive(Debug, Fail)]
pub enum TantivyError {
/// Path does not exist.
#[fail(display = "path does not exist: '{:?}'", _0)]
PathDoesNotExist(PathBuf),
/// File already exists, this is a problem when we try to write into a new file.
#[fail(display = "file already exists: '{:?}'", _0)]
FileAlreadyExists(PathBuf),
/// Index already exists in this directory
#[fail(display = "index already exists")]
IndexAlreadyExists,
/// Failed to acquire file lock
#[fail(
display = "Failed to acquire Lockfile: {:?}. Possible causes: another IndexWriter instance or panic during previous lock drop.",
_0
)]
LockFailure(LockType),
/// IO Error.
#[fail(display = "an IO error occurred: '{}'", _0)]
IOError(#[cause] IOError),
/// The data within is corrupted.
///
/// For instance, it contains invalid JSON.
#[fail(display = "file contains corrupted data: '{:?}'", _0)]
CorruptedFile(PathBuf),
/// A thread holding the locked panicked and poisoned the lock.
#[fail(display = "a thread holding the locked panicked and poisoned the lock")]
Poisoned,
/// Invalid argument was passed by the user.
#[fail(display = "an invalid argument was passed: '{}'", _0)]
InvalidArgument(String),
/// An Error happened in one of the thread.
#[fail(display = "an error occurred in a thread: '{}'", _0)]
ErrorInThread(String),
/// An Error appeared related to the schema.
#[fail(display = "Schema error: '{}'", _0)]
SchemaError(String),
/// Tried to access a fastfield reader for a field not configured accordingly.
#[fail(display = "fast field not available: '{:?}'", _0)]
FastFieldError(#[cause] FastFieldNotAvailableError),
}
impl From<FastFieldNotAvailableError> for Error {
fn from(fastfield_error: FastFieldNotAvailableError) -> Error {
ErrorKind::FastFieldError(fastfield_error).into()
impl From<FastFieldNotAvailableError> for TantivyError {
fn from(fastfield_error: FastFieldNotAvailableError) -> TantivyError {
TantivyError::FastFieldError(fastfield_error)
}
}
impl From<IOError> for Error {
fn from(io_error: IOError) -> Error {
ErrorKind::IOError(io_error).into()
impl From<IOError> for TantivyError {
fn from(io_error: IOError) -> TantivyError {
TantivyError::IOError(io_error)
}
}
impl From<io::Error> for Error {
fn from(io_error: io::Error) -> Error {
ErrorKind::IOError(io_error.into()).into()
impl From<io::Error> for TantivyError {
fn from(io_error: io::Error) -> TantivyError {
TantivyError::IOError(io_error.into())
}
}
impl From<query::QueryParserError> for Error {
fn from(parsing_error: query::QueryParserError) -> Error {
ErrorKind::InvalidArgument(format!("Query is invalid. {:?}", parsing_error)).into()
impl From<query::QueryParserError> for TantivyError {
fn from(parsing_error: query::QueryParserError) -> TantivyError {
TantivyError::InvalidArgument(format!("Query is invalid. {:?}", parsing_error))
}
}
impl<Guard> From<PoisonError<Guard>> for Error {
fn from(_: PoisonError<Guard>) -> Error {
ErrorKind::Poisoned.into()
impl<Guard> From<PoisonError<Guard>> for TantivyError {
fn from(_: PoisonError<Guard>) -> TantivyError {
TantivyError::Poisoned
}
}
impl From<OpenReadError> for Error {
fn from(error: OpenReadError) -> Error {
impl From<OpenReadError> for TantivyError {
fn from(error: OpenReadError) -> TantivyError {
match error {
OpenReadError::FileDoesNotExist(filepath) => {
ErrorKind::PathDoesNotExist(filepath).into()
OpenReadError::FileDoesNotExist(filepath) => TantivyError::PathDoesNotExist(filepath),
OpenReadError::IOError(io_error) => TantivyError::IOError(io_error),
}
}
}
impl From<schema::DocParsingError> for TantivyError {
fn from(error: schema::DocParsingError) -> TantivyError {
TantivyError::InvalidArgument(format!("Failed to parse document {:?}", error))
}
}
impl From<OpenWriteError> for TantivyError {
fn from(error: OpenWriteError) -> TantivyError {
match error {
OpenWriteError::FileAlreadyExists(filepath) => {
TantivyError::FileAlreadyExists(filepath)
}
OpenReadError::IOError(io_error) => ErrorKind::IOError(io_error).into(),
OpenWriteError::IOError(io_error) => TantivyError::IOError(io_error),
}
}
}
impl From<schema::DocParsingError> for Error {
fn from(error: schema::DocParsingError) -> Error {
ErrorKind::InvalidArgument(format!("Failed to parse document {:?}", error)).into()
}
}
impl From<OpenWriteError> for Error {
fn from(error: OpenWriteError) -> Error {
match error {
OpenWriteError::FileAlreadyExists(filepath) => ErrorKind::FileAlreadyExists(filepath),
OpenWriteError::IOError(io_error) => ErrorKind::IOError(io_error),
}.into()
}
}
impl From<OpenDirectoryError> for Error {
fn from(error: OpenDirectoryError) -> Error {
impl From<OpenDirectoryError> for TantivyError {
fn from(error: OpenDirectoryError) -> TantivyError {
match error {
OpenDirectoryError::DoesNotExist(directory_path) => {
ErrorKind::PathDoesNotExist(directory_path).into()
TantivyError::PathDoesNotExist(directory_path)
}
OpenDirectoryError::NotADirectory(directory_path) => {
TantivyError::InvalidArgument(format!("{:?} is not a directory", directory_path))
}
OpenDirectoryError::NotADirectory(directory_path) => ErrorKind::InvalidArgument(
format!("{:?} is not a directory", directory_path),
).into(),
}
}
}
impl From<serde_json::Error> for Error {
fn from(error: serde_json::Error) -> Error {
impl From<serde_json::Error> for TantivyError {
fn from(error: serde_json::Error) -> TantivyError {
let io_err = io::Error::from(error);
ErrorKind::IOError(io_err.into()).into()
TantivyError::IOError(io_err.into())
}
}

View File

@@ -0,0 +1,38 @@
mod reader;
mod writer;
pub use self::reader::BytesFastFieldReader;
pub use self::writer::BytesFastFieldWriter;
#[cfg(test)]
mod tests {
use schema::SchemaBuilder;
use Index;
#[test]
fn test_bytes() {
let mut schema_builder = SchemaBuilder::default();
let field = schema_builder.add_bytes_field("bytesfield");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
index_writer.add_document(doc!(field=>vec![0u8, 1, 2, 3]));
index_writer.add_document(doc!(field=>vec![]));
index_writer.add_document(doc!(field=>vec![255u8]));
index_writer.add_document(doc!(field=>vec![1u8, 3, 5, 7, 9]));
index_writer.add_document(doc!(field=>vec![0u8; 1000]));
assert!(index_writer.commit().is_ok());
index.load_searchers().unwrap();
let searcher = index.searcher();
let reader = searcher.segment_reader(0);
let bytes_reader = reader.bytes_fast_field_reader(field).unwrap();
assert_eq!(bytes_reader.get_val(0), &[0u8, 1, 2, 3]);
assert!(bytes_reader.get_val(1).is_empty());
assert_eq!(bytes_reader.get_val(2), &[255u8]);
assert_eq!(bytes_reader.get_val(3), &[1u8, 3, 5, 7, 9]);
let long = vec![0u8; 1000];
assert_eq!(bytes_reader.get_val(4), long.as_slice());
}
}

View File

@@ -0,0 +1,37 @@
use owning_ref::OwningRef;
use directory::ReadOnlySource;
use fastfield::FastFieldReader;
use DocId;
/// Reader for byte array fast fields
///
/// The reader is implemented as a `u64` fast field and a separate collection of bytes.
///
/// The `vals_reader` will access the concatenated list of all values for all documents.
///
/// The `idx_reader` associates, for each document, the index of its first value.
///
/// Reading the value for a document is done by reading the start index for it,
/// and the start index for the next document, and keeping the bytes in between.
pub struct BytesFastFieldReader {
idx_reader: FastFieldReader<u64>,
values: OwningRef<ReadOnlySource, [u8]>,
}
impl BytesFastFieldReader {
pub(crate) fn open(
idx_reader: FastFieldReader<u64>,
values_source: ReadOnlySource,
) -> BytesFastFieldReader {
let values = OwningRef::new(values_source).map(|source| &source[..]);
BytesFastFieldReader { idx_reader, values }
}
/// Returns the bytes associated to the given `doc`
pub fn get_val(&self, doc: DocId) -> &[u8] {
let start = self.idx_reader.get(doc) as usize;
let stop = self.idx_reader.get(doc + 1) as usize;
&self.values[start..stop]
}
}

View File

@@ -0,0 +1,96 @@
use std::io;
use fastfield::serializer::FastFieldSerializer;
use schema::{Document, Field, Value};
use DocId;
/// Writer for byte array (as in, any number of bytes per document) fast fields
///
/// This `BytesFastFieldWriter` is only useful for advanced user.
/// The normal way to get your associated bytes in your index
/// is to
/// - declare your field with fast set to `Cardinality::SingleValue`
/// in your schema
/// - add your document simply by calling `.add_document(...)` with associating bytes to the field.
///
/// The `BytesFastFieldWriter` can be acquired from the
/// fast field writer by calling
/// [`.get_bytes_writer(...)`](./struct.FastFieldsWriter.html#method.get_bytes_writer).
///
/// Once acquired, writing is done by calling `.add_document_val(&[u8])`
/// once per document, even if there are no bytes associated to it.
pub struct BytesFastFieldWriter {
field: Field,
vals: Vec<u8>,
doc_index: Vec<u64>,
}
impl BytesFastFieldWriter {
/// Creates a new `BytesFastFieldWriter`
pub fn new(field: Field) -> Self {
BytesFastFieldWriter {
field,
vals: Vec::new(),
doc_index: Vec::new(),
}
}
/// Access the field associated to the `BytesFastFieldWriter`
pub fn field(&self) -> Field {
self.field
}
/// Finalize the current document.
pub(crate) fn next_doc(&mut self) {
self.doc_index.push(self.vals.len() as u64);
}
/// Shift to the next document and add all of the
/// matching field values present in the document.
pub fn add_document(&mut self, doc: &Document) {
self.next_doc();
for field_value in doc.field_values() {
if field_value.field() == self.field {
if let Value::Bytes(ref bytes) = *field_value.value() {
self.vals.extend_from_slice(bytes);
} else {
panic!(
"Bytes field contained non-Bytes Value!. Field {:?} = {:?}",
self.field, field_value
);
}
}
}
}
/// Register the bytes associated to a document.
///
/// The method returns the `DocId` of the document that was
/// just written.
pub fn add_document_val(&mut self, val: &[u8]) -> DocId {
let doc = self.doc_index.len() as DocId;
self.next_doc();
self.vals.extend_from_slice(val);
doc
}
/// Serializes the fast field values by pushing them to the `FastFieldSerializer`.
pub fn serialize(&self, serializer: &mut FastFieldSerializer) -> io::Result<()> {
{
// writing the offset index
let mut doc_index_serializer =
serializer.new_u64_fast_field_with_idx(self.field, 0, self.vals.len() as u64, 0)?;
for &offset in &self.doc_index {
doc_index_serializer.add_val(offset)?;
}
doc_index_serializer.add_val(self.vals.len() as u64)?;
doc_index_serializer.close_field()?;
}
{
// writing the values themselves
let mut value_serializer = serializer.new_bytes_fast_field_with_idx(self.field, 1)?;
value_serializer.write_all(&self.vals)?;
}
Ok(())
}
}

View File

@@ -1,10 +1,11 @@
use bit_set::BitSet;
use directory::WritePtr;
use std::io::Write;
use std::io;
use directory::ReadOnlySource;
use DocId;
use common::HasLen;
use directory::ReadOnlySource;
use directory::WritePtr;
use space_usage::ByteCount;
use std::io;
use std::io::Write;
use DocId;
/// Write a delete `BitSet`
///
@@ -41,7 +42,8 @@ pub struct DeleteBitSet {
impl DeleteBitSet {
/// Opens a delete bitset given its data source.
pub fn open(data: ReadOnlySource) -> DeleteBitSet {
let num_deleted: usize = data.as_slice()
let num_deleted: usize = data
.as_slice()
.iter()
.map(|b| b.count_ones() as usize)
.sum();
@@ -63,9 +65,12 @@ impl DeleteBitSet {
}
}
/// Summarize total space usage of this bitset.
pub fn space_usage(&self) -> ByteCount {
self.data.len()
}
}
impl HasLen for DeleteBitSet {
fn len(&self) -> usize {
self.len
@@ -74,10 +79,10 @@ impl HasLen for DeleteBitSet {
#[cfg(test)]
mod tests {
use std::path::PathBuf;
use super::*;
use bit_set::BitSet;
use directory::*;
use super::*;
use std::path::PathBuf;
fn test_delete_bitset_helper(bitset: &BitSet) {
let test_path = PathBuf::from("test");

View File

@@ -1,10 +1,11 @@
use std::result;
use schema::FieldEntry;
use std::result;
/// `FastFieldNotAvailableError` is returned when the
/// user requested for a fast field reader, and the field was not
/// defined in the schema as a fast field.
#[derive(Debug)]
#[derive(Debug, Fail)]
#[fail(display = "field not available: '{:?}'", field_name)]
pub struct FastFieldNotAvailableError {
field_name: String,
}

Some files were not shown because too many files have changed in this diff Show More