Compare commits

...

904 Commits

Author SHA1 Message Date
Paul Masurel
a7c579f5c9 Added method to convert named doc to doc 2019-08-06 08:00:32 +09:00
Paul Masurel
f2e546bdff Changes required for python binding 2019-08-01 17:23:49 +09:00
Paul Masurel
efd1af1325 Closes #544. (#607)
Prepare for release 0.10.1
2019-07-30 13:38:06 +09:00
fdb-hiroshima
c91eb7fba7 add to_path for Facet (#604)
fix #580
2019-07-27 17:58:43 +09:00
fdb-hiroshima
6eb4e08636 add support for float (#603)
* add basic support for float

as for i64, they are mapped to u64 for indexing
query parser don't work yet

* Update value.rs

* implement support for float in query parser

* Update README.md
2019-07-27 17:57:33 +09:00
Paul Masurel
c3231ca252 Added phrase query tests (#601) 2019-07-22 13:43:00 +09:00
Paul Masurel
7211df6719 Failrs (#600)
* Single thread tests

* Isolating fail tests into a different binary
2019-07-22 13:17:21 +09:00
Paul Masurel
f27ce6412c Made the SegmentMeta inventory out of static. 2019-07-21 10:38:00 +09:00
Paul Masurel
8197a9921f Small code cleaning 2019-07-20 07:10:12 +09:00
Paul Masurel
b0e23b5715 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-07-18 10:16:49 +09:00
Paul Masurel
0167151f5b Disabling generating docs 2019-07-18 10:16:29 +09:00
Paul Masurel
0668949390 Disabling generating docs 2019-07-18 09:36:57 +09:00
Paul Masurel
94d0e52786 Using instead of u64. 2019-07-17 22:02:47 +09:00
Paul Masurel
818a0abbee Small refactoring 2019-07-17 21:55:59 +09:00
Luca Bruno
4e6dcf3cbe cargo: update to fail 0.3 (#593)
* cargo: update to fail 0.3

* tantivy: align failpoints feature naming

This aligns feature naming to use `failpoints` everywhere, like the
underlying library.
2019-07-17 18:51:38 +09:00
Paul Masurel
af7ea1422a using smallvec for operation batches (#599) 2019-07-17 13:20:02 +09:00
Paul Masurel
498057c5b7 Refactor deletes (#597)
* Refactor deletes

* Removing generation from SegmentUpdater. These have been obsolete for a long time

* Number literal clippy

* Removed clippy useless allow statement
2019-07-17 13:06:44 +09:00
Paul Masurel
5095e6b010 Introduce a small refactoring of the sgment writer. (#596) 2019-07-17 08:32:29 +09:00
Paul Masurel
1aebc87ee3 disabling caching (#595) 2019-07-16 19:05:22 +09:00
Paul Masurel
9fb5058b29 Fixed links (#592)
Closes #591
2019-07-15 19:35:44 +09:00
Paul Masurel
158e0a28ba Removed ilnk to master reference doc 2019-07-15 15:18:53 +09:00
Paul Masurel
3576a006f7 Updated example link 2019-07-15 15:17:53 +09:00
Paul Masurel
80c25ae9f3 Release 0.10 2019-07-11 19:10:12 +09:00
Paul Masurel
4867be3d3b Kompass master (#590)
* Use once_cell in place of lazy_static

* Minor changes
2019-07-10 19:24:54 +09:00
Paul Masurel
697c7e721d Only compile bitpacker4x (#589) 2019-07-10 18:53:46 +09:00
Paul Masurel
3e368d92cb Issue/479 (#578)
* Sort by field relying on tweaked score
* Sort by u64/i64 get independent methods.
2019-07-07 17:12:31 +09:00
Paul Masurel
0bc2c64a53 2018 (#585)
* removing macro import for fail-rs

* Downcast-rs

* matches
2019-07-07 17:09:04 +09:00
Paul Masurel
35236c8634 Seek not required in Directory's write anymore (#584) 2019-07-03 10:12:33 +09:00
Paul Masurel
462774b15c Tiqb feature/2018 (#583)
* rust 2018

* Added CHANGELOG comment
2019-07-01 10:01:46 +09:00
Paul Masurel
185a5b8d31 updating rand (#582) 2019-06-29 13:11:42 +09:00
petr-tik
73d7791479 Add instructions for contributors (#574) 2019-06-27 09:59:07 +09:00
Kirill Zaborsky
f52b1e68d1 Fix typo (#573) 2019-06-27 09:57:37 +09:00
Paul Masurel
3e0907fe05 Fixed CHANGELOG and disable one test on windows (#577) 2019-06-27 09:48:53 +09:00
dependabot-preview[bot]
ab4a8916d3 Update bitpacking requirement from 0.6 to 0.7 (#575)
Updates the requirements on bitpacking to permit the latest version.

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
2019-06-27 09:39:26 +09:00
Antoine Catton
bcd7386fc5 Add crates.io shield to the README (#572) 2019-06-18 11:19:06 +09:00
Paul Masurel
c23a7c992b Closes #552 (#570)
The different handles to `SegmentMeta` are closed before calling gc on
end_merge.
2019-06-16 14:12:13 +09:00
Paul Masurel
2a88094ec4 Disabling travis on OSX (#571) 2019-06-16 14:12:01 +09:00
Paul Masurel
ca3cfddab4 adding cond (#568) 2019-06-16 11:59:26 +09:00
Paul Masurel
7bd9f9773b trying to fix doc upload (#567) 2019-06-16 11:22:51 +09:00
Paul Masurel
e2da92fcb5 Petr tik n510 clear index (#566)
* Enables clearing the index

Closes #510

* Adds an examples to clear and rebuild index

* Addressing code review

Moved the example from examples/ to docstring above `clear`

* Corrected minor typos and missed/duplicate words

* Added stamper.revert method to be used for rollback

Added type alias for Opstamp

Moved to AtomicU64 on stable rust (since 1.34)

* Change the method name and doc-string

* Remove rollback from delete_all_documents

test_add_then_delete_all_documents fails with --test-threads 2

* Passes all the tests with any number of test-threads

(ran locally 5 times)

* Addressed code review

Deleted comments with debug info
changed ReloadPolicy to Manual

* Removing useless garbage_collect call and updated CHANGELOG
2019-06-12 09:40:03 +09:00
petr-tik
876e1451c4 Resume uploading docs to gh-pages (#565)
* Fixes #546

Generate docs and upload them. Need GH_TOKEN env var to be set in travis settings

* Investigate what TRAVIS* env vars are set
2019-06-12 09:30:09 +09:00
dependabot-preview[bot]
a37d2f9777 Update winapi requirement from 0.2 to 0.3 (#537)
* Update winapi requirement from 0.2 to 0.3

Updates the requirements on [winapi](https://github.com/retep998/winapi-rs) to permit the latest version.
- [Release notes](https://github.com/retep998/winapi-rs/releases)
- [Commits](https://github.com/retep998/winapi-rs/commits/0.3.7)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Fixing upgrading winapi (hopefully).
2019-06-06 10:23:13 +09:00
Paul Masurel
4822940b19 Issue/36 (#559)
* Added explanation

* Explain

* Splitting weight and idf

* Added comments

Closes #36
2019-06-06 10:03:54 +09:00
Paul Masurel
d590f4c6b0 Comments for IndexMeta (#560) 2019-06-06 09:24:31 +09:00
Paul Masurel
edfa619519 Update .travis.yml 2019-05-29 16:45:56 +09:00
Paul Masurel
96f194635f Trying to address #546 2019-05-29 09:17:41 +09:00
Paul Masurel
444662485f Remove mut in add_document and delete_term. Made stamper ordering rel… (#551)
* Remove mut in add_document and delete_term. Made stamper ordering relaxed.

* Made batch operations &mut self -> &self

* Added example
2019-05-28 10:26:00 +09:00
Stephen Carman
943c25d0f8 Make IndexMeta public (#553) 2019-05-28 09:27:49 +09:00
Paul Masurel
5c0b2a4579 Merge branch 'stamper_refactor' 2019-05-08 10:02:02 +09:00
Paul Masurel
9870a9258d Removed the mutex implementation of AtomicU64.
Fixed comment
2019-05-08 09:59:28 +09:00
Paul Masurel
7102b363f5 Fix build 2019-05-05 14:19:54 +09:00
Paul Masurel
66b4615e4e Issue/542 (#543)
* Closes 542.

Fast fields are all loaded when the segment reader is created.
2019-05-05 13:52:43 +09:00
petr-tik
da46913839 Merge branch 'master' into stamper_refactor 2019-04-30 22:28:48 +01:00
Paul Masurel
3df037961f Added more info to fast fields. 2019-04-30 13:14:01 +09:00
petr-tik
8ffae47854 Addressed code review
moved Opstamp to top-level namespace, added a docstring

Corrected minor typos/whitespace
2019-04-29 21:23:28 +01:00
petr-tik
1a90a1f3b0 Merge branch 'master' of github.com:tantivy-search/tantivy into stamper_refactor 2019-04-26 08:47:12 +01:00
Paul Masurel
dac50c6aeb Dds merged (#539)
* add ascii folding support

* Minor change and added Changelog.

* add additional tests

* Add tests for ascii folding (#533)

* first tests for ascii folding

* use a `RawTokenizer` for tokens using punctuation

* add test for all (?) folding, inspired by Lucene

* Simplification of the unit test code
2019-04-26 10:25:08 +09:00
Paul Masurel
31b22c5acc Added logging when token is dropped. (#538) 2019-04-26 09:23:28 +09:00
petr-tik
8e50921363 Tidied up the Stamper module and upgraded to a 1.34 dependency
Added stamper.revert method to be used for rollback - rolling back to a previous
commit in case of deleting all documents or rolling operations back should reset
the stamper as well

Added type alias for Opstamp - helps code readibility instead of seeing u64
returned by functions.

Moved to AtomicU64 on stable rust (since 1.34) - where possible use standard
library interfaces.
2019-04-24 20:46:28 +01:00
Paul Masurel
96a4f503ec Closes #526 (#535) 2019-04-24 20:59:48 +09:00
Paul Masurel
9df288b0c9 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-04-24 12:31:47 +09:00
Paul Masurel
b7c2d0de97 Clippy2 (#534)
* Clippy comments

Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.

This commit passes the OutputBuffer object (which enforces proper
    alignment) instead of `&[u32]`.

* Clippy. Block alignment

* Code simplification

* Added comment. Code simplification

* Removed the extraneous freq block len hack.
2019-04-24 12:31:32 +09:00
Paul Masurel
62445e0ec8 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-04-23 09:55:55 +09:00
Paul Masurel
a228825462 Clippy comments (#532)
Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.

This commit passes the OutputBuffer object (which enforces proper
    alignment) instead of `&[u32]`.
2019-04-23 09:54:02 +09:00
Paul Masurel
d3eabd14bc Clippy comments
Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.

This commit passes the OutputBuffer object (which enforces proper
    alignment) instead of `&[u32]`.
2019-04-22 11:16:21 +09:00
petr-tik
c967031d21 Delete files from target/ dir to avoid caching them on CI (#531)
* Delete files from target/ dir to avoid caching them on CI

idea from here https://github.com/rust-lang/cargo/issues/5885#issuecomment-432723546

* Delete examples
2019-04-21 08:02:27 +09:00
Paul Masurel
d823163d52 Closes #527. (#529)
Fixing the bug that affects the result of `query.count()` in presence of
deletes.
2019-04-19 09:19:50 +09:00
Paul Masurel
c4f59f202d Bumped combine version 2019-04-11 08:33:56 +09:00
Paul Masurel
acd29b535d Fix comment 2019-04-02 10:05:14 +09:00
Panagiotis Ktistakis
2cd31bcda2 Fix non english stemmers (#521) 2019-03-27 08:54:16 +09:00
Paul Masurel
99870de55c 0.10.0-dev 2019-03-25 08:58:26 +09:00
Paul Masurel
cad2d91845 Disabled tests for android 2019-03-24 22:58:46 +09:00
Paul Masurel
79f3cd6cf4 Added instructions to update 2019-03-24 09:10:31 +09:00
Paul Masurel
e3abb4481b broken link 2019-03-22 09:58:28 +09:00
Paul Masurel
bfa61d2f2f Added patreon button 2019-03-22 09:51:00 +09:00
Paul Masurel
6c0e621fdb Added bench info in README 2019-03-21 09:35:04 +09:00
Paul Masurel
a8cc5208f1 Linear simd (#519)
* linear simd search within block
2019-03-20 22:10:05 +09:00
Paul Masurel
83eb0d0cb7 Disabling tests on Android 2019-03-20 10:24:17 +09:00
Paul Masurel
ee6e273365 cleanup for nodefaultfeatures 2019-03-20 10:04:42 +09:00
Paul Masurel
6ea34b3d53 Fix version 2019-03-20 09:39:24 +09:00
Paul Masurel
22cf1004bd Reenabled test on android 2019-03-20 08:54:52 +09:00
Paul Masurel
5768d93171 Rename try to attempt as try is becoming a keyword in rust 2019-03-20 08:54:19 +09:00
Paul Masurel
663dd89c05 Feature/reader (#517)
Adding IndexReader to the API. Making it possible to watch for changes.

* Closes #500
2019-03-20 08:39:22 +09:00
barrotsteindev
a934577168 WIP: date field (#487)
* initial version, still a work in progress

* remove redudant or

* add chrono::DateTime and index i64

* add more tests

* fix tests

* pass DateTime by ptr

* remove println!

* document query_parser rfc 3339 date support

* added some more docs about implementation to schema.rs

* enforce DateTime is UTC, and re-export chrono

* added DateField to changelog

* fixed conflict

* use INDEXED instead of INT_INDEXED for date fields
2019-03-15 22:10:37 +09:00
Paul Masurel
94f1885334 Issue/513 (#514)
* Closes #513

* Clean up and doc

* Updated changelog
2019-03-07 09:39:30 +09:00
Jonathan Fok kan
2ccfdb97b5 WIP: compiling to wasm (#512)
* First work to enable compile to wasm

* Added back fst-regex/mmap to mmap feature

* Removed fst-regex. Forced uuid version 0.7.2.
2019-03-06 10:40:54 +09:00
Paul Masurel
e67883138d Cargo fmt 2019-03-06 10:31:00 +09:00
Paul Masurel
f5c65f1f60 Added comment on the constructor fo TopDocSByField 2019-03-06 10:30:37 +09:00
Mauri de Souza Nunes
ec73a9a284 Remove note about panicking in get_field docs (#503)
Since get_field rely on calling get on the underlying InnerSchema HashMap
it shouldn't fail if the field was not found, it simply returns None.
2019-02-28 09:23:00 +09:00
Thomas Schaller
a814a31f1e Remove semicolon from doc! expansion (#509) 2019-02-28 09:20:43 +09:00
Paul Masurel
9acadb3756 Code cleaning 2019-02-26 10:50:36 +09:00
Paul Masurel
774fcecf23 cargo fmt 2019-02-26 10:44:59 +09:00
Paul Masurel
27c9fa6028 Jannickj prove bug with facets (#508)
* prove bug with facets

* Closing #505

Introduce a term id in the TermHashMap
2019-02-25 22:33:17 +09:00
Paul Masurel
fdefea9e26 Removed path reference to tantivy-fst 2019-02-23 10:42:44 +09:00
Paul Masurel
b422f9c389 Partially addresses #500 (#502)
Using `tantivy_fst`. Storing `Weak<Mmap>` in the Mmap cache.
2019-02-23 10:33:59 +09:00
petr-tik
9451fd5b09 MsQueue to channel (#495)
* Format

Made the docstring consistent
remove empty line

* Move matches to dev deps

* Replace MsQueue with an unbounded crossbeam-channel

Questions:
queue.push ignores Result return

How to test pop() calls, if they block

* Format

Made the docstring consistent
remove empty line

* Unwrap the Result of queue.pop

* Addressed Paul's review

wrap the Result-returning send call with expect()

implemented the test not to fail after popping from empty queue

removed references to the Michael-Scott Queue

formatted
2019-02-23 09:06:50 +09:00
Jason Goldberger
788b3803d9 updated changelog (#501)
* updated changelog

* Update CHANGELOG.md

* Update CHANGELOG.md
2019-02-19 00:25:18 +09:00
Paul Masurel
5b11228083 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-02-15 08:30:55 +09:00
Paul Masurel
515adff644 Merge branch 'hotfix/0.8.2' 2019-02-15 08:30:27 +09:00
Paul Masurel
e70a45426a 0.8.2 release
Backporting a fix for non x86_64 platforms
2019-02-14 09:16:27 +09:00
Jason Goldberger
e14701e9cd Add grouped operations (#493)
* [WIP] added UserOperation enum, added IndexWriter.run, and added MultiStamp

* removed MultiStamp in favor of std::ops::Range

* changed IndexWriter::run to return u64, Stamper::stamps to return a Range, added tests, and added docs

* changed delete_cursor skipping to use first operation's opstamp vice last. change index_writer test to use 1 thread

* added test for order batch of operations

* added a test comment
2019-02-14 08:56:01 +09:00
Paul Masurel
45e62d4329 Code simplification and adding comments 2019-02-06 10:05:15 +09:00
petr-tik
76d2b4dab6 Add integer range search example (#490)
Copied and simplified the example in the range_query mod
2019-02-05 23:34:06 +01:00
Paul Masurel
04e9606638 simplification of positions 2019-02-05 15:36:13 +01:00
Paul Masurel
a5c57ebbd9 Positions simplification 2019-02-05 14:50:51 +01:00
Paul Masurel
96eaa5bc63 Positions 2019-02-05 14:50:16 +01:00
Paul Masurel
f1d30ab196 fastfield reader fix 2019-02-05 14:10:16 +01:00
Paul Masurel
4507df9255 Closes #461 (#489)
Multivalued fast field uses `u64` indexes.
2019-02-04 13:24:00 +01:00
Paul Masurel
e8625548b7 Closes #461 (#488)
Multivalued fast field uses `u64` indexes.
2019-02-04 13:20:20 +01:00
Paul Masurel
50ed6fb534 Code cleanup
Fixed compilation without the mmap directory
2019-02-05 12:39:30 +01:00
Panagiotis Ktistakis
76609deadf Add Greek stemmer (#486) 2019-02-01 06:30:49 +01:00
Paul Masurel
749e62c40b renamed 2019-01-30 16:29:17 +01:00
Paul Masurel
259ce567d1 Using linear search 2019-01-29 15:59:24 +01:00
Paul Masurel
4c93b096eb Rustfmt 2019-01-29 11:45:30 +01:00
Paul Masurel
6a547b0b5f Issue/483 (#484)
* Downcast_ref

* fixing unit test
2019-01-28 11:43:42 +01:00
Paul Masurel
e99d1a2355 Better exponential search 2019-01-29 11:29:17 +01:00
Paul Masurel
c7bddc5fe3 Inlined exponential search 2019-01-28 17:28:07 +01:00
Paul Masurel
7b97dde335 Clippy + cargo fmt 2019-01-28 12:37:55 +01:00
Paul Masurel
644b4bd0a1 Issue/468b (#482)
* Moving lock to directory/

* added fs2

* doc

* Using fs2 for locking

* Added unit test

* Fixed error message related unit test

* Fixing location of import
2019-01-27 12:32:21 +01:00
Paul Masurel
bf94fd77db Issue/471 (#481)
* Closes 471

Removing writing_segments in the segment manager as it is now useless.
Removing the target merged segment id as it is useless as well.

* RAII for tracking which segment is in merge.

Closes #471

* fmt

* Using Inventory::default().
2019-01-27 12:18:59 +09:00
Paul Masurel
097eaf4aa6 impl Future as a result of merges 2019-01-28 03:56:43 +01:00
Paul Masurel
1fd46c1e9b Clippy 2019-01-28 03:46:23 +01:00
Paul Masurel
2fb219d017 Changelog 2019-01-24 09:12:07 +09:00
Paul Masurel
63b593bd0a Lower RAM usage in tests. 2019-01-24 09:10:38 +09:00
Paul Masurel
286bb75a0c Updated changelog 2019-01-24 09:03:58 +09:00
barrotsteindev
222b7f2580 Tantivy-288 (#472)
* add unit test

* improved test

* added SegmentManager#remove_empty_segments

* update old tests for new behaviour

* cleaner filter for empty segments

* PR adjustments

* rename x in closures

* simplify assert_eq!(vec.len(), 0)

* wait_merging_threads

* acquire searchers

* add comments to test

* rebased on latest master

* harden test

* fix merger#test_merge_multivalued_int_fields_all_deleted test
2019-01-24 08:58:56 +09:00
pentlander
5292e78860 Allow stemmers in languages other than English (#473)
Allow users to create stemmers for languages other than English. Add a
default stemmer for English.
2019-01-23 22:24:32 +09:00
Paul Masurel
c0cc6aac83 Updated changelog 2019-01-23 22:22:34 +09:00
Paul Masurel
0b0bf59a32 Allow stemmers in languages other than English (#478)
Allow users to create stemmers for languages other than English. Add a
default stemmer for English.

Closes #478
2019-01-23 22:21:00 +09:00
Paul Masurel
74f70a5c2c 32bits platforms 2019-01-23 13:21:31 +09:00
Paul Masurel
1acfb2ebb5 cargo fmt 2019-01-23 10:21:39 +09:00
Paul Masurel
4dfd091e67 Bumped version to 0.8.2-dev 2019-01-23 10:20:59 +09:00
Paul Masurel
8eba4ab807 Merge branch 'hotfix-476' 2019-01-23 10:20:33 +09:00
Paul Masurel
5e8e03882b Merge branch 'bug/476' 2019-01-23 10:18:27 +09:00
Paul Masurel
7df3260a15 Version bump 2019-01-23 10:13:18 +09:00
Paul Masurel
176f67a266 Refactoring 2019-01-23 10:06:40 +09:00
Paul Masurel
19babff849 Closes #476 2019-01-23 10:06:39 +09:00
Paul Masurel
bf2576adf9 Added a broken unit test 2019-01-23 10:04:27 +09:00
Paul Masurel
0e8fcd5727 Plastic surgery 2019-01-19 23:13:27 +09:00
Paul Masurel
f745c83bb7 Closes 466. Removing mentions of the chain collector. (#467) 2019-01-16 10:28:19 +09:00
Paul Masurel
ffb16d9103 More efficient indexing (#463)
* Using unrolled u32 VInt and caching Vec s

* cargo fmt

* Exposing a io::Write in the Expull thing

* expull as a writer. clippy + format

* inline the first block

* simplified -if let Some-

* vint reader iterator

* blop
2019-01-13 14:51:18 +09:00
Paul Masurel
98ca703daa More efficient indexing (#462)
* Using unrolled u32 VInt and caching Vec s

* cargo fmt

* Exposing a io::Write in the Expull thing

* expull as a writer. clippy + format

* inline the first block

* simplified -if let Some-

* vint reader iterator
2019-01-13 14:41:56 +09:00
Paul Masurel
b9d25cda5d Using LittleEndian explicitely 2019-01-08 12:41:58 +09:00
Paul Masurel
beb4289ec2 Less unsafe 2019-01-08 00:48:14 +09:00
Andrew Banchich
bdd72e4683 Update README.md (#459)
Fix Elasticsearch spelling
2018-12-27 07:26:49 +09:00
Paul Masurel
45c3cd19be Fixing README: git clone https... 2018-12-26 21:13:33 +09:00
Paul Masurel
b8241c5603 0.8.0 2018-12-26 10:18:34 +09:00
Paul Masurel
a4745151c0 Version to 0.8 2018-12-26 10:11:06 +09:00
Paul Masurel
e2ce326a8c Merge branch 'issue/457' 2018-12-18 10:35:01 +09:00
Paul Masurel
bb21d12a70 Bumping version 2018-12-18 10:14:12 +09:00
Paul Masurel
4565aba62a Added unit test for exponential search 2018-12-18 09:24:31 +09:00
Paul Masurel
545a7ec8dd Closes #457 2018-12-18 09:18:46 +09:00
Paul Masurel
e68775d71c Format and update murmurhash32 version 2018-12-17 19:12:38 +09:00
Paul Masurel
dcc92d287e Facet remove unsafe (#456)
* Removing some unsafe

* Removing some unsafe (2)

* Remove murmurhash
2018-12-17 19:08:48 +09:00
Paul Masurel
b48f81c051 Removing unsafe from bitpacking code (#455) 2018-12-17 19:06:37 +09:00
Paul Masurel
a3042e956b Facet remove unsafe (#454)
* Removing some unsafe

* Removing some unsafe (2)
2018-12-17 09:31:09 +09:00
dependabot[bot]
1fa10f0a0b Update itertools requirement from 0.7 to 0.8 (#453)
Updates the requirements on [itertools](https://github.com/bluss/rust-itertools) to permit the latest version.
- [Release notes](https://github.com/bluss/rust-itertools/releases)
- [Commits](https://github.com/bluss/rust-itertools/commits/0.8.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-12-17 09:28:36 +09:00
Paul Masurel
279a9eb5e3 Closes #449 (#450)
Clippy working on stable.
Clippy warnings addressed
2018-12-10 12:20:59 +09:00
fdb-hiroshima
21a24672d8 Add accessors for Snippet and HighlightSection (#448)
* Add accessors for Snippet and HighlightSection

And add an example of custom highlighter

* Remove inline(always) and unnecessary empty lines
2018-12-02 18:00:16 +09:00
dependabot[bot]
a3f1fbaae6 Update scoped-pool requirement from 0.1 to 1.0 (#447)
Updates the requirements on [scoped-pool](https://github.com/reem/rust-scoped-pool) to permit the latest version.
- [Release notes](https://github.com/reem/rust-scoped-pool/releases)
- [Commits](https://github.com/reem/rust-scoped-pool/commits/1.0.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-12-01 13:54:59 +09:00
Paul Masurel
a6e767c877 Cargo fmt 2018-11-30 22:52:45 +09:00
Paul Masurel
6af0488dbe Executor made sorted 2018-11-30 22:52:26 +09:00
Paul Masurel
07d87e154b Collector refactoring and multithreaded search (#437)
* Split Collector into an overall Collector and a per-segment SegmentCollector. Precursor to cross-segment parallelism, and as a side benefit cleans up any per-segment fields from being Option<T> to just T.

* Attempt to add MultiCollector back

* working. Chained collector is broken though

* Fix chained collector

* Fix test

* Make Weight Send+Sync for parallelization purposes

* Expose parameters of RangeQuery for external usage

* Removed &mut self

* fixing tests

* Restored TestCollectors

* blop

* multicollector working

* chained collector working

* test broken

* fixing unit test

* blop

* blop

* Blop

* simplifying APi

* blop

* better syntax

* Simplifying top_collector

* refactoring

* blop

* Sync with master

* Added multithread search

* Collector refactoring

* Schema::builder

* CR and rustdoc

* CR comments

* blop

* Added an executor

* Sorted the segment readers in the searcher

* Update searcher.rs

* Fixed unit testst

* changed the place where we have the sort-segment-by-count heuristic

* using crossbeam::channel

* inlining

* Comments about panics propagating

* Added unit test for executor panicking

* Readded default

* Removed Default impl

* Added unit test for executor
2018-11-30 22:46:59 +09:00
Paul Masurel
8b0b0133dd Importing crossbeam_channel from crossbeam reexport. 2018-11-19 09:19:28 +09:00
dependabot[bot]
7b9752f897 Update crossbeam-channel requirement from 0.2 to 0.3 (#436)
* Update crossbeam-channel requirement from 0.2 to 0.3

Updates the requirements on [crossbeam-channel](https://github.com/crossbeam-rs/crossbeam-channel) to permit the latest version.
- [Release notes](https://github.com/crossbeam-rs/crossbeam-channel/releases)
- [Changelog](https://github.com/crossbeam-rs/crossbeam-channel/blob/master/CHANGELOG.md)
- [Commits](https://github.com/crossbeam-rs/crossbeam-channel/commits/v0.3.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* fixing build
2018-11-16 14:26:59 +09:00
dependabot[bot]
c92f41aea8 Update rand requirement from 0.5 to 0.6 (#440)
* Update rand requirement from 0.5 to 0.6

Updates the requirements on [rand](https://github.com/rust-random/rand) to permit the latest version.
- [Release notes](https://github.com/rust-random/rand/releases)
- [Changelog](https://github.com/rust-random/rand/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-random/rand/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Updating rand.
2018-11-16 12:38:01 +09:00
Do Duy
dea16f1d9d Derive Clone for QueryParser (#442) 2018-11-15 18:45:40 +09:00
dependabot[bot]
236cfbec08 Update crossbeam requirement from 0.4 to 0.5 (#438)
Updates the requirements on [crossbeam](https://github.com/crossbeam-rs/crossbeam) to permit the latest version.
- [Release notes](https://github.com/crossbeam-rs/crossbeam/releases)
- [Changelog](https://github.com/crossbeam-rs/crossbeam/blob/master/CHANGELOG.md)
- [Commits](https://github.com/crossbeam-rs/crossbeam/commits/crossbeam-0.5.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-11-15 06:16:22 +09:00
Paul Masurel
edcafb69bb Fixed benches 2018-11-10 17:04:29 -08:00
Paul Masurel
14908479d5 Release 0.7.1 2018-11-02 17:56:25 +09:00
Dru Sellers
ab4593eeb7 Adds open_or_create method (#428)
* Change the semantic of Index::create_in_dir.

It should return an error if the directory already contains an Index.

* Index::open_or_create is working

* additional test

* Checking that schema matches on open_or_create.

Simplifying unit tests.

* simplifying Eq
2018-10-31 08:36:39 +09:00
Dru Sellers
e75bb1d6a1 Fix NGram processing of non-ascii characters (#430)
* A working version

* optimize the ngram parsing

* Decoding codepoint only once.

* Closes #429

* using leading_zeros to make code less cryptic

* lookup in a table
2018-10-31 08:35:27 +09:00
dependabot[bot]
63b9d62237 Update base64 requirement from 0.9.1 to 0.10.0 (#433)
Updates the requirements on [base64](https://github.com/alicemaz/rust-base64) to permit the latest version.
- [Release notes](https://github.com/alicemaz/rust-base64/releases)
- [Changelog](https://github.com/alicemaz/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/alicemaz/rust-base64/commits/v0.10.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-10-31 08:34:44 +09:00
Jason Wolfe
0098e3d428 Compute space usage of a Searcher / SegmentReader / CompositeFile (#282)
* Compute space usage of a Searcher / SegmentReader / CompositeFile

* Fix typo

* Add serde Serialize/Deserialize for all the SpaceUsage structs

* Fix indexing

* Public methods for consuming space usage information

* #281: Add a space usage method that takes a SegmentComponent to support code that is unaware of particular segment components, and to make it more likely to update methods when a new component type is added.

* Add support for space usage computation of positions skip index file (#281)

* Add some tests for space usage computation (#281)
2018-10-15 09:04:36 +09:00
Konstantin Gribov
69d5e4b9b1 Added proper references for Apache Lucene & Solr (#432)
Also, added links to websites for Lucene, Solr & ElasticSearch
2018-10-12 08:46:07 +09:00
Paul Masurel
e0cdd3114d Fixing README (#427)
Closes #424.
2018-09-17 08:52:29 +09:00
Paul Masurel
f32b4a2ebe Removing release build from ci, disabling lto (#425) 2018-09-17 06:41:40 +09:00
Paul Masurel
6ff60b8ed8 Fixing README (#426) 2018-09-17 06:20:44 +09:00
Paul Masurel
8da28fb6cf Added iml filewq 2018-09-16 13:26:54 +09:00
Paul Masurel
0df2a221da Bump version pre-release 2018-09-16 13:24:14 +09:00
Paul Masurel
5449ec3c11 Snippet term score (#423) 2018-09-16 10:21:02 +09:00
Paul Masurel
10f6c07c53 Clippy (#422)
* Cargo Format
* Clippy
2018-09-15 20:20:22 +09:00
Paul Masurel
06e7bd18e7 Clippy (#421)
* Cargo Format

* Clippy

* bugfix

* still clippy stuff

* clippy step 2
2018-09-15 14:56:14 +09:00
Paul Masurel
37e4280c0a Cargo Format (#420) 2018-09-15 07:44:22 +09:00
Paul Masurel
0ba1cf93f7 Remove Searcher dereference (#419) 2018-09-14 09:54:26 +09:00
Paul Masurel
21a9940726 Update Changelog with #388 (#418) 2018-09-14 09:31:11 +09:00
pentlander
8600b8ea25 Top collector (#413)
* Make TopCollector generic

Make TopCollector take a generic type instead of only being tied to
score. This will allow for sharing code between a TopCollector that
sorts results by Score and a TopCollector that sorts documents by a fast
field. This commit makes no functional changes to TopCollector.

* Add TopFieldCollector and TopScoreCollector

Create two new collectors that use the refactored TopCollector.
TopFieldCollector has the same functionality that TopCollector
originally had. TopFieldCollector allows for sorting results by a given
fast field. Closes tantivy-search/tantivy#388

* Make TopCollector private

Make TopCollector package private and export TopFieldCollector as
TopCollector to maintain backwards compatibility. Mark TopCollector
as deprecated to encourage use of the non-aliased TopFieldCollector.
Remove Collector implementation for TopCollector since it is not longer
used.
2018-09-14 09:22:17 +09:00
Paul Masurel
30f4f85d48 Closes #414. (#417)
Updating documentation for load_searchers.
2018-09-14 09:11:07 +09:00
Paul Masurel
82d25b8397 Fixing snippet example 2018-09-13 12:39:42 +09:00
Paul Masurel
2104c0277c Updating uuid 2018-09-13 09:13:37 +09:00
Paul Masurel
dd37e109f2 Merge branch 'issue/368b' 2018-09-11 20:16:14 +09:00
Paul Masurel
cc23194c58 Editing document 2018-09-11 20:15:38 +09:00
Paul Masurel
63868733a3 Added SnippetGenerator 2018-09-11 09:45:27 +09:00
Paul Masurel
644d8a3a10 Added snippet generator 2018-09-10 16:39:45 +09:00
Paul Masurel
e32dba1a97 Phrase weight 2018-09-10 09:26:33 +09:00
Paul Masurel
a78aa4c259 updating doc 2018-09-09 17:23:30 +09:00
Paul Masurel
7e5f697d00 Closes #387 2018-09-09 16:23:56 +09:00
Paul Masurel
a78f4cca37 Merge branch 'issue/368' into issue/368b 2018-09-09 16:04:20 +09:00
Paul Masurel
2e44f0f099 blop 2018-09-09 14:23:24 +09:00
Vignesh Sarma K
9ccba9f864 Merge branch 'master' into issue/368 2018-09-07 20:27:38 +05:30
Paul Masurel
9101bf5753 Fragments 2018-09-07 09:57:12 +09:00
Paul Masurel
23e97da9f6 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-09-07 08:44:14 +09:00
Paul Masurel
1d439e96f5 Using sort unstable by key. 2018-09-07 08:43:44 +09:00
Paul Masurel
934933582e Closes #402 (#403) 2018-09-06 10:12:26 +09:00
Paul Masurel
98c7fbdc6f Issue/378 (#392)
* Added failing unit test

* Closes #378. Handling queries that end up empty after going through the analyzer.

* Fixed stop word example
2018-09-06 10:11:54 +09:00
Paul Masurel
cec9956a01 Issue/389 (#405)
* Setting up the dependency.

* Completed README
2018-09-06 10:10:40 +09:00
Paul Masurel
c64972e039 Apply unicode lowercasing. (#408)
Checks if the str is ASCII, and uses a fast track if it is the case.
If not, the std's definition of a lowercase character.

Closes #406
2018-09-05 09:43:56 +09:00
Paul Masurel
b3b2421e8a Issue/367 (#404)
* First stab

* Closes #367
2018-09-04 09:17:00 +09:00
Paul Masurel
f570fe37d4 small changes 2018-08-31 09:03:44 +09:00
Paul Masurel
6704ab6987 Added methods to extract the matching terms. First stab 2018-08-30 09:47:19 +09:00
Paul Masurel
a12d211330 Extracting terms matching query in the document 2018-08-30 09:23:34 +09:00
Paul Masurel
ee681a4dd1 Added say thanks badge 2018-08-29 11:06:04 +09:00
petr-tik
d15efd6635 Closes #235 - adds a new error type (#398)
error message suggests possible causes

Addressed code review 1 thread + smaller heap size
2018-08-29 08:26:59 +09:00
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
18814ba0c1 add a test for second fragment having higher score 2018-08-28 22:27:56 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
f247935bb9 Use HighlightSection::new rather than just directly creating the object 2018-08-28 22:16:22 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
6a197e023e ran rustfmt 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
96a313c6dd add more tests 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
fb9b1c1f41 add a test and fix the bug of not calculating first token 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
e1bca6db9d update calculate_score to try_add_token
`try_add_token` will now update the stop_offset as well.
`FragmentCandidate::new` now just takes `start_offset`,
it expects `try_add_token` to be called to add a token.
2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
8438eda01a use while let instead of loop and if.
as per CR comment
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
b373f00840 add htmlescape and update to_html fn to use it.
tests and imports also updated.
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
46decdb0ea compare against accumulator rather than init value 2018-08-28 20:41:41 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
835cdc2fe8 Initial version of snippet
refer #368
2018-08-28 20:41:41 +05:30
Paul Masurel
19756bb7d6 Getting started on #368 2018-08-28 20:41:41 +05:30
CJP10
57e1f8ed28 Missed a closing bracket (#397) 2018-08-28 23:17:59 +09:00
Paul Masurel
2649c8a715 Issue/246 (#393)
* Moving Range and All to Leaves

* Parsing OR/AND

* Simplify user input ast

* AND and OR supported. Returning an error when mixing syntax

Closes #246

* Added support for NOT

* Updated changelog
2018-08-28 11:03:54 +09:00
Paul Masurel
ede97eded6 Removed use 2018-08-28 09:54:04 +09:00
Paul Masurel
4b7ff78c5a Added fundamentalss 2018-08-28 08:09:27 +09:00
Paul Masurel
948758ad78 First commit for the documentation 2018-08-27 09:49:49 +09:00
Paul Masurel
d71fa43ca3 Moving emoticon on the right side of the parenthesis 2018-08-23 08:59:11 +09:00
Paul Masurel
1e5266d4c9 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-08-23 08:55:30 +09:00
Paul Masurel
537fc27231 Added bench line in features 2018-08-23 08:55:13 +09:00
Dru Sellers
af593b1116 Add default EN stopwords to the default analyzer (#381)
* Add a default list of en stopwords

* Add the default en stopword filter to the standard tokenizers

* code review feedback
2018-08-22 10:49:39 +09:00
Paul Masurel
3d73c0c240 Update issue templates 2018-08-21 10:59:08 +09:00
Paul Masurel
3a8e524f77 Added example to show how to access the inverted list directly 2018-08-21 09:36:13 +09:00
Paul Masurel
c0641c2b47 Remove generate html script. It moved to tantivy-search.github.io 2018-08-21 08:26:46 +09:00
Dru Sellers
ef3a16a129 Switch from error-chain to failure crate (#376)
* Switch from error-chain to failure crate

* Added deprecated alias for

* Started editing the changeld
2018-08-20 09:40:45 +09:00
Paul Masurel
a0a284fe91 Added a full fledge empty query and relyign on it in QueryParser, instead of using an empty clause. 2018-08-20 09:21:32 +09:00
dependabot[bot]
0feeef2684 Update owning_ref requirement from 0.3 to 0.4 (#379)
Updates the requirements on [owning_ref](https://github.com/Kimundi/owning-ref-rs) to permit the latest version.
- [Release notes](https://github.com/Kimundi/owning-ref-rs/releases)
- [Commits](https://github.com/Kimundi/owning-ref-rs/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-08-20 09:08:11 +09:00
Dru Sellers
cc50bdb06a Add a basic faceted search example (#383)
* Add a basic faceted search example

* quieting the compiler
2018-08-19 08:07:54 +09:00
Paul Masurel
23c2c3ae7c Building all examples on appveyor + running them on travis 2018-08-17 13:24:37 +09:00
Dru Sellers
674524ba91 Add an example of using the stopwords filter (#377) 2018-08-17 12:52:21 +09:00
Paul Masurel
60a9a7f837 Added example showing how to delete/update documents 2018-08-17 09:43:55 +09:00
Paul Masurel
5b5c706581 Simplified examples 2018-08-16 22:38:39 +09:00
Paul Masurel
3e14a76623 Update regex_query.rs 2018-08-15 16:38:32 +09:00
Paul Masurel
8cde1c81e5 Update README.md 2018-08-13 18:03:30 +09:00
Paul Masurel
8d0a29b137 Added sourcerer wall of fame 2018-08-13 18:02:49 +09:00
Paul Masurel
cbfb2fe19d Avoid building twice when doing code coverage 2018-08-13 10:38:01 +09:00
Vignesh Sarma K
09e00f1d42 add position_length to Token (#337)
* add position_length to Token

refer #291

* Add term offset to `PhraseQuery`

ref #291

* Add new constructor for `PhraseQuery` that allows custom offset

* fix the method name as per pr comment

* Closes #291

Added unit test.
Using offsets from the analyzer in QueryParser.
2018-08-13 10:14:50 +09:00
Paul Masurel
290620fdee Added slashes 2018-08-13 09:13:01 +09:00
petr-tik
f0d1b85bd8 N370 pr fix num searchers (#371)
* Change ordering to Acquire

* set_num_searchers now uses AtomicUsize.store
2018-08-13 08:56:30 +09:00
petr-tik
aaef546f91 Moved NUM_SEARCHERS into a local variable (#369)
* Moved NUM_SEARCHERS into a local variable

dynamically determined as the number of available cpus.

var name in lowercase (not a constant anymore).

updated it in docstring

* lowercased the varnames

* User can set number of logical cores in create_from_metas

* cargo fmt

* Num_searchers as Arc<AtomicUsize>

Retrieving the value with Relaxed ordering

Reverted create_from_metas signature. However, it calls num_cpus and
sets the Arc val
2018-08-12 20:08:14 +09:00
Paul Masurel
811ddf2226 Closes #364 (#365)
* Closes #364

* Trying to raise the recursion limit

* Better unit test and bug fix on token offsets
2018-08-08 11:15:20 +09:00
Paul Masurel
79a339d353 Removing env_logger dependency 2018-08-02 19:29:09 +09:00
Paul Masurel
e45e4c79d9 update crossbeam 2018-08-02 19:24:08 +09:00
Paul Masurel
848bf41bc9 Updating rand to 0.5 (#363) 2018-08-02 19:19:04 +09:00
Paul Masurel
d11cb087a7 Updated to combine-0.3 (#362) 2018-08-02 18:29:58 +09:00
Jacob Brown
2dd7422f42 replace chan with crossbeam-channel (#361)
* replace chan with crossbeam-channel

* Update Cargo.toml
2018-08-02 12:47:22 +09:00
Paul Masurel
e8707c02c0 Issue/333 (#335)
* Add skip information for posting list (skip to doc ids) 
* Separate num bits from data for positions (skip n positions)
* Address in the position using a n-position offset
* Added a long skip structure to allow efficient opening of the position for a given term.
2018-07-31 10:51:53 +09:00
dependabot[bot]
55928d756a Update rust-stemmers requirement to 1.0.2 (#350)
* Update rust-stemmers requirement to 1.0.2

Updates the requirements on [rust-stemmers](https://github.com/CurrySoftware/rust-stemmers) to permit the latest version.
- [Release notes](https://github.com/CurrySoftware/rust-stemmers/releases)
- [Commits](https://github.com/CurrySoftware/rust-stemmers/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-31 09:32:57 +09:00
dependabot[bot]
a4370bca64 Update owned-read requirement to 0.4 (#352)
Updates the requirements on [owned-read](https://github.com/tantivy-search/owned-read) to permit the latest version.
- [Release notes](https://github.com/tantivy-search/owned-read/releases)
- [Commits](https://github.com/tantivy-search/owned-read/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-31 09:32:01 +09:00
dependabot[bot]
5a5c5a8ca5 Update bit-set requirement to 0.5.0 (#351)
* Update bit-set requirement to 0.5.0

Updates the requirements on [bit-set](https://github.com/contain-rs/bit-set) to permit the latest version.
- [Release notes](https://github.com/contain-rs/bit-set/releases)
- [Commits](https://github.com/contain-rs/bit-set/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml

* Update Cargo.toml
2018-07-31 09:31:41 +09:00
dependabot[bot]
1b470dd474 Update log requirement to 0.4.3 (#353)
* Update log requirement to 0.4.3

Updates the requirements on [log](https://github.com/rust-lang/log) to permit the latest version.
- [Release notes](https://github.com/rust-lang/log/releases)
- [Changelog](https://github.com/rust-lang-nursery/log/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/log/commits/env_logger-0.4.3)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-31 09:31:19 +09:00
Paul Masurel
52b4575245 Issue/355 (#358)
* issue with top_k sorting (#356)

* Closes #355
2018-07-31 08:24:55 +09:00
dependabot[bot]
ddd2d5b04c Update lazy_static requirement to 1.0.2 (#349)
* Update lazy_static requirement to 1.0.2

Updates the requirements on [lazy_static](https://github.com/rust-lang-nursery/lazy-static.rs) to permit the latest version.
- [Release notes](https://github.com/rust-lang-nursery/lazy-static.rs/releases)
- [Commits](https://github.com/rust-lang-nursery/lazy-static.rs/commits/v1.0.2)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-30 12:34:06 +09:00
dependabot[bot]
fa22b4041a Update itertools requirement to 0.7.8 (#346)
* Update itertools requirement to 0.7.8

Updates the requirements on [itertools](https://github.com/bluss/rust-itertools) to permit the latest version.
- [Release notes](https://github.com/bluss/rust-itertools/releases)
- [Commits](https://github.com/bluss/rust-itertools/commits/0.7.8)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-30 11:32:12 +09:00
dependabot[bot]
8faee143fa Update regex requirement to 1.0 (#347)
Updates the requirements on [regex](https://github.com/rust-lang/regex) to permit the latest version.
- [Release notes](https://github.com/rust-lang/regex/releases)
- [Changelog](https://github.com/rust-lang/regex/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/regex/commits/1.0.2)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-30 09:59:19 +09:00
dependabot[bot]
366ce98f08 Update tempfile requirement to 3.0 (#348)
Updates the requirements on [tempfile](https://github.com/Stebalien/tempfile) to permit the latest version.
- [Release notes](https://github.com/Stebalien/tempfile/releases)
- [Changelog](https://github.com/Stebalien/tempfile/blob/master/NEWS)
- [Commits](https://github.com/Stebalien/tempfile/commits/v3.0.3)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-30 09:58:56 +09:00
Paul Masurel
190e60a41c Closes #339. (#340)
As required per the FacetCollector,
facet values needs to be sorted before being encoded in the
multivalued field.
2018-07-25 18:21:48 +09:00
Vignesh Sarma K
b9558801a1 Declare and implement separate Clone Traits (#336)
For traits, `Directory` and `MergePolicy`.

refer #306
2018-07-18 12:36:43 +09:00
Paul Masurel
36728215ac Using the codecov badge 2018-07-10 21:19:59 +09:00
Paul Masurel
39551a0418 fix travis 2018-07-10 13:08:22 +09:00
Paul Masurel
39b98b2e76 fix travis 2018-07-10 13:07:15 +09:00
Paul Masurel
616162400d Add missing space 2018-07-10 12:49:32 +09:00
Paul Masurel
694d164db6 fix travis.yml 2018-07-10 09:39:39 +09:00
Paul Masurel
ef442cefb1 codecov 2018-07-10 09:38:59 +09:00
Paul Masurel
14da241f35 Readed cov 2018-07-10 09:25:24 +09:00
Paul Masurel
346a9e4287 Set dev version 2018-07-10 09:20:21 +09:00
Paul Masurel
31655e92d7 Preparing release 0.6.1 2018-07-10 09:12:26 +09:00
Paul Masurel
6b8d76685a Tiny refactoring 2018-07-05 09:11:55 +09:00
Paul Masurel
ce5683fc6a Removed useless counting_writer 2018-07-04 16:13:19 +09:00
Paul Masurel
5205579db6 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-07-04 16:09:59 +09:00
Paul Masurel
d056ae60dc Removed SourceRead. Relying on the new owned-read crate instead (#332) 2018-07-04 16:08:52 +09:00
Paul Masurel
af9280c95f Removed SourceRead. Relying on the new owned-read crate instead 2018-07-04 12:47:25 +09:00
David Hewson
2e538ce6e6 remove extra space in name (#331)
the extra space that appeared breaks using the package
2018-07-02 05:32:19 +09:00
Jason Wolfe
00466d2b08 #328: Support parsing unbounded range queries (#329)
* #328: Support parsing unbounded range queries. Update CHANGELOG.md for query parser changes.

* Set version to 0.7-dev
2018-06-30 13:24:02 +09:00
Paul Masurel
8ebbf6b336 Issue/325 (#330)
* Introducing a SegmentMea inventory.
* Depending on census=0.1
* Cargo fmt
2018-06-30 13:11:41 +09:00
Paul Masurel
1ce36bb211 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-06-27 16:58:47 +09:00
Jason Wolfe
2ac43bf21b Support parsing RangeQuery and AllQuery in Queryparser (#323)
* (#321) Add support for range query parsing to grammar / parser. Still needs to be wired through the rest of the way.

* (321) Finish wiring RangeQuery parsing through

* (#321) Add logical AST query parser tests for RangeQuery

* (#321) Support parsing AllQuery

* (#321) Update documentation of QueryParser

* (#321) Support negative numbers in range query parsing
2018-06-25 08:29:47 +09:00
Paul Masurel
3fd8c2aa5a Removed one keywoard 2018-06-22 14:47:21 +09:00
Paul Masurel
c1022e23d2 Switching to stable rust in AppVeyor. 2018-06-22 14:33:42 +09:00
Paul Masurel
8ccbfdea5d Preparing for release 2018-06-22 14:27:46 +09:00
Paul Masurel
badfce3a23 Preparing for release. 2018-06-22 14:09:14 +09:00
Dru Sellers
e301e0bc87 Add some simple doc tests (#320)
* Add TopCollector doc test

* Add CountCollector Doc Test

* Add Doc Test for MultiCollector

* Add ChainedCollector Doc Test

* Expose Fuzzy Query where it should be

* Add FuzzyTermQuery Doc Test

* Expose RegexQuery

* Regex Query Doc Test

* Add TermQuery Doc Test

* Add doc comments

* fix test 🤦

* Added explanation about the complexity variables

* Fixing unit tests

* Single threads if you check docids
2018-06-19 10:45:20 +09:00
Dru Sellers
317baf4e75 Add in simple regex query support (#319)
* Add fst_regex crate in

* Reduce API surface area

This doesn't need to be public

* better test name

* Pull Automaton weight out so it can be shared

* Implement Regex Query
2018-06-16 14:08:30 +09:00
Paul Masurel
24398d94e4 Exposing the 2018-06-15 21:40:57 +09:00
Dru Sellers
360f4132eb Standardizes the Index::open_* APIs (#318)
* Relocate `from_directory` closer to its usage

* Specific methods come before the generic method

* Rename open methods to follow the lead of the create methods
2018-06-15 12:16:41 +09:00
Dru Sellers
2b8f02764b Standardizes the Index::create_* APIs (#317)
* Pull all creation methods next to each other

The goal here is to make it clear which methods are performing the
same function, and to assist with standardizing the API calls.

* Make `from_directory` private

This seems to be an internal function, so lets make it internal.

* Rename `create` to `create_in_dir`

This lets the name match the `create_in_ram` pattern and opens up
`create` for the generic implementation.

* Implement the generic create function

All of the create methods now delegate to the common create function
and future `create_in_*` functions now have a clear pattern
to follow as well
2018-06-14 11:08:42 +09:00
Paul Masurel
0465876854 Issue/257 (#310)
* Replaced lz4 by a pure rust implementation of snappy.

Closes #257

* snappy is the default compression. One can use lz4 by enabling the lz4 feature flag.

* Removed Compression trait
2018-06-12 19:02:57 +09:00
Dru Sellers
6f7b099370 Add AutomatonWeight to a fuzzy_search module and FuzzyQuery (#300)
* Add AutomatonWeight to a fuzzy_search module

* Hacking around ownership issues

* Working through lifetime issues

* Working through tests

* fix test by lower casing the words (reducing distance)

* code review changes

* Suggestion on how to solve the borrow problem

* clean up
2018-06-11 22:23:03 +09:00
Paul Masurel
84f5cc4388 Added an AUTHORS file. Closes #315 (#316) 2018-06-11 22:21:58 +09:00
Paul Masurel
75aae0d2c2 Update README 2018-06-08 13:05:57 +09:00
Paul Masurel
009a3559be atomicwrites 2.2.0 for ARM compilation 2018-06-06 07:13:09 +09:00
Paul Masurel
7a31669e9d Disabling ARM targets 2018-06-05 12:22:00 +09:00
Paul Masurel
5185eb790b Reduced heap usage in unit test 2018-06-05 10:02:10 +09:00
Paul Masurel
a3dffbf1c6 Added more ARM target. 2018-06-05 09:06:33 +09:00
Paul Masurel
857a5794d8 Updated nix version 2018-06-05 09:02:40 +09:00
Paul Masurel
b0a6fc1448 Reduce RAM usage 2018-06-04 11:20:24 +09:00
Paul Masurel
989d52bea4 Updated atomicwrites version. 2018-06-04 10:00:21 +09:00
Paul Masurel
09661ea7ec Added cross testing on different platforms 2018-06-04 09:47:53 +09:00
Paul Masurel
b59132966f Better heap (#311)
* Changed the heap to a paged memory arena.
* Trying to simplify the indexing term hashmap
* Exploding datastruct
* Removed some complexity in bitpacker
2018-06-04 09:39:18 +09:00
Paul Masurel
863d3411bc Update Cargo.toml 2018-05-31 15:54:34 +09:00
Paul Masurel
8a55d133ab Showing Appveyor CI badge for the master branch
.. before the last build was shown.
2018-05-28 13:44:53 +09:00
Jason Wolfe
432d49d814 Expose parameters of RangeQuery for external usage (#309) 2018-05-19 14:29:25 +09:00
Jason Wolfe
0cea706f10 Add docs to new Query methods (#307) 2018-05-18 13:53:29 +09:00
Paul Masurel
71d41ca209 Added Google to the license 2018-05-18 10:13:23 +09:00
Paul Masurel
bc69dab822 cargo fmt 2018-05-18 10:08:05 +09:00
Jason Wolfe
72acad0921 Add box_clone() and downcast::Any to Query (#303) 2018-05-18 09:53:11 +09:00
Paul Masurel
c9459f74e8 Update docs about TermDict. 2018-05-18 09:20:39 +09:00
Dru Sellers
08d2cc6c7b Make it possible to stream the terms matching an Automaton (#297)
* rustfmt and some English grammar

* sort cargo.toml crates

* WIP: something to show

* Remove example for now

* Implement desired method

* Resolving Generic Type Arguments

* Resolve Generic Types

* Banging around on the tests

* DANGER! Change unsafe usage based on compiler warnings

* Unscrew up my rebase

* Clean Up Type Spam

Default Types FTW

* typo

* better variable names

* Remove Duplicate Levenshtein crate
2018-05-11 12:41:14 -07:00
Dru Sellers
82d87416c2 Implement StopWords Filter (#292)
* Implement StopWords Filter

- added example doctest for alphanum_only.rs so that I could
drive my own test of the stopword filter

* Style Cop

* Switch HashSet Hasher to FNV for speed

* Update Change Log

* fix missed location renaming
2018-05-09 18:40:41 -07:00
Paul Masurel
96b2c2971e Testing actual doc ids in unit test 2018-05-09 09:14:22 -07:00
Dru Sellers
162afd73f6 Alive docs iterator (#293)
* Add non-deleted DocId iterator to SegmentReader

Closes #287

* Add Todo

* Add Unit Test

* Improving test based on feedback

- found bug and fixed it. :)

* Reestablish changes post rebase for clean merge
2018-05-09 09:03:27 -07:00
Paul Masurel
ddfd87fa59 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-05-08 00:08:17 -07:00
Paul Masurel
24050d0eb5 Remove some unsafe stuff, justified some of it. 2018-05-07 23:57:53 -07:00
Jason Wolfe
89eb209ece #294: Make fieldnorm module public, add documentation (#295) 2018-05-07 20:20:38 -07:00
Paul Masurel
9a0b7f9855 Rustfmt 2018-05-07 19:50:35 -07:00
Jason Wolfe
8e343b1ca3 Add fast field for associating arbitrary bytes to a document (#275)
* Add fast field for associating arbitrary bytes to a document

* Fix unused macro_use warning

* Improvements from code review

* Make BytesFastFieldWriter public

* Fix json parsing validation failure

* Add bytes fast field to CHANGELOG.md

* Fix compile errors from merge

* Support merging

* Address misc code review comments

* Fix comments from CR
2018-05-07 19:30:31 -07:00
Paul Masurel
99c0b84036 Integrating #274, #280, #289 into master (#290)
* Integrating bugfixes into master

Closes #274
Closes #280
Closes #289

* Next version will be 0.6
2018-05-06 09:48:25 -07:00
Dru Sellers
ca74c14647 Simple Implementation of NGram Tokenizer (#278)
* Simple Implementation of NGram Tokenizer

It does not yet support edges
It could probably be better in many "rusty" ways
But the test is passing, so I'll call this a good stopping point for
the day.

* Remove Ngram from manager. Too many variations

* Basic configuration model

Should the extensive tests exist here?

* Add Sample to provide an End to End testing

* Basic Edgegram support

* cleanup

* code feedback

* More code review feedback processed
2018-05-06 09:47:49 -07:00
Dru Sellers
68ee18e4e8 Add Index::open_directory function (#285)
* Add Index::open_directory function

* dry
2018-05-03 00:07:46 -07:00
Paul Masurel
5637657c2f Removed ptr dereference for explicit ptr::read_unaligned 2018-04-25 19:15:32 +09:00
Paul Masurel
2e3c9a8878 Bugfix in murmurhash. 2018-04-25 19:06:31 +09:00
Paul Masurel
78673172d0 Cargo fmt 2018-04-21 20:05:36 +09:00
Paul Masurel
175b76f119 Removed streamdict
Closes #271
2018-04-21 19:55:41 +09:00
Paul Masurel
9b79e21bd7 Returning error when schema is not valid for a given query. 2018-04-19 13:02:30 +09:00
Paul Masurel
5e38ae336f Bump tantivy version and readded win deps 2018-04-17 18:27:57 +09:00
Paul Masurel
8604351f59 Hide some of the API
Added some doc.
2018-04-17 13:31:22 +09:00
Paul Masurel
6a48953d8a Closes #266 (#268)
PhraseQuery panics with a nice error message when the underlying field does not have any positions.
The `QueryParser` fails as well with a dedicated error.
2018-04-17 10:03:15 +09:00
pmasurel
0804b42afa Checking the type of range queries 2018-04-16 14:01:10 +09:00
Paul Masurel
8083bc6eef bench working 2018-04-15 12:25:38 +09:00
Paul Masurel
0156f88265 Compiles in stable rust 2018-04-15 11:03:44 +09:00
Paul Masurel
a1c07bf457 Added iterator for facet collector 2018-04-14 20:22:02 +09:00
Paul Masurel
9de74b68d1 Remove range argument 2018-04-13 18:34:23 +09:00
Paul Masurel
57c7073867 Removed 2018-04-13 09:43:36 +09:00
Paul Masurel
121374b89b Removed the need for AtomicU64 2018-04-12 22:08:15 +09:00
Paul Masurel
e44782bf14 No more 2018-04-12 13:01:11 +09:00
Paul Masurel
dfafb24fa6 Bumped bitpacker's version 2018-04-10 21:21:47 +09:00
jason-wolfe
4c6f9541e9 #263: Make MultiValueIntFastFieldWriter public, expose via FastFieldsWriter (#264) 2018-04-10 12:27:34 +09:00
Paul Masurel
743ae102f1 Using bitpacker@3 2018-04-10 10:05:42 +09:00
Paul Masurel
0107fe886b Removed timer 2018-03-31 15:40:16 +09:00
Paul Masurel
1d9566e73c Making mmap a feature 2018-03-31 13:23:43 +09:00
Paul Masurel
8006f1df11 Added comments 2018-03-28 08:28:49 +09:00
Paul Masurel
ffa03bad71 TermScorer does not handle deletes 2018-03-27 17:35:20 +09:00
Paul Masurel
98cf4ba63a Small refactor of postings's skip method 2018-03-27 16:14:28 +09:00
Paul Masurel
4d65771e04 field norm reader is not an option anymore. 2018-03-26 13:25:29 +09:00
Paul Masurel
9712a75399 Added unit test for intersection score 2018-03-25 12:58:24 +09:00
Paul Masurel
3ae03b91ae PhraseScorer's score aligned with that of Lucene.) 2018-03-25 12:44:16 +09:00
Paul Masurel
238b02ce7d Bugfixed 2018-03-23 18:50:57 +09:00
Paul Masurel
3091459777 Fixed main bug. Unit test still not passing because of altered scoring 2018-03-23 13:52:10 +09:00
Paul Masurel
b7f8884246 Closes #245 = BM25. (#260)
* Closes #245 = BM25.

Scores are the same as Lucene.

* Fixing travis conf
2018-03-22 15:06:56 +09:00
Paul Masurel
e22f767fda Backmerge 2018-03-21 21:18:46 +09:00
Paul Masurel
3ecfc36e53 Total field norm fixed. 2018-03-21 20:43:02 +09:00
Paul Masurel
1c9450174e Fieldnorm reader working except merge 2018-03-21 17:36:16 +09:00
Paul Masurel
cde4c391cd Added fieldnorm module 2018-03-21 15:41:46 +09:00
Paul Masurel
6d47634616 Added unit tests 2018-03-20 12:11:28 +09:00
Paul Masurel
39b182c24b Simplified phrase queries. Reading several time is ok. 2018-03-20 11:47:48 +09:00
Paul Masurel
baaae3f4ec Making it possible to read positions twice 2018-03-20 11:36:22 +09:00
Paul Masurel
63064601a7 Readded test for reading positions twice 2018-03-20 10:04:36 +09:00
Paul Masurel
07a8023a3a Added 2018-03-19 14:36:43 +09:00
Paul Masurel
59639cd311 In sync with master. Fixed merging 2018-03-19 12:58:42 +09:00
Paul Masurel
b0e5e1f61d Back merged master 2018-03-19 12:19:08 +09:00
Paul Masurel
234a902470 Removed cc from Cargo.toml 2018-03-19 12:09:25 +09:00
Paul Masurel
75d130f1ce Edited CHANGELOG 2018-03-19 12:01:48 +09:00
Paul Masurel
410187dd24 Removed .vimrc 2018-03-19 11:54:10 +09:00
Paul Masurel
88303d4833 Removed script directory 2018-03-19 11:53:15 +09:00
Paul Masurel
a26b0ff4a2 Removed exclude cpp from travis configuration 2018-03-19 11:51:41 +09:00
Paul Masurel
d4ed86f13a Issue/255 (#256)
* Remove cpp compression.

* Pointing to publish bitpacking

* Edited README
2018-03-19 11:48:40 +09:00
Paul Masurel
fc8902353c fieldnrom encoding. test broken 2018-03-10 18:35:16 +09:00
Paul Masurel
a2ee988304 Small change in pop_lowest. 2018-03-10 15:32:30 +09:00
Paul Masurel
97b7984200 Updated CHANGELOG 2018-03-10 14:08:11 +09:00
Paul Masurel
8683718159 Version bump 2018-03-10 14:01:30 +09:00
Paul Masurel
0cf274135b Clippy 2018-03-10 13:07:18 +09:00
Paul Masurel
a3b44773bb Bugfix and rustfmt 2018-03-10 12:21:50 +09:00
Paul Masurel
ec7c582109 NOBUG no-simd compression fix 2018-03-09 14:19:58 +09:00
Ewan Higgs
ee7ab72fb1 Support trailing commas using ',+ ,' trick from Blandy 2017. (#250) 2018-02-27 10:33:39 +09:00
Paul Masurel
2c20759829 removed unsafecell for position computer 2018-02-24 12:07:55 +09:00
Paul Masurel
23387b0ed0 Positions writes to an external Vec 2018-02-24 11:14:45 +09:00
Dylan DPC
e82859f2e6 Update Cargo.toml (#249) 2018-02-24 09:17:33 +09:00
Paul Masurel
be830b03c5 Bugfix in intersection.advance and impl skip_next 2018-02-23 11:55:23 +09:00
Paul Masurel
1b94a3e382 Phrase query optimisation 2018-02-23 00:00:22 +09:00
Paul Masurel
c3fbc4c8fa Simplified a notch TinySet::pop_lowest() 2018-02-22 10:43:06 +09:00
Paul Masurel
4ee2db25a0 Generic on Postings rather than deletes in TermScorer 2018-02-22 08:26:45 +09:00
Paul Masurel
e423784fd0 Added specialized SegmentPostings when there are no DeleteSet 2018-02-21 23:49:20 +09:00
Paul Masurel
fdb9c3c516 Tantivy version 0.5.0 2018-02-21 11:38:26 +09:00
Paul Masurel
6fb114224a Added unit test 2018-02-21 00:13:04 +09:00
Paul Masurel
2c3e33895a Added unit tests 2018-02-21 00:03:41 +09:00
Paul Masurel
d512b53688 Added handling of parenthesis in query parser 2018-02-20 23:18:02 +09:00
Paul Masurel
c8afd2b55d Added unit tests 2018-02-20 17:05:33 +09:00
Paul Masurel
3fd6d7125b Added unit test 2018-02-20 13:12:05 +09:00
Paul Masurel
de6a3987a9 Ignoring functional test 2018-02-20 12:58:06 +09:00
Paul Masurel
3dedc465fa Merge branch 'feature/multivalued-i64-u64' 2018-02-20 12:54:18 +09:00
Paul Masurel
f16cc6367e Refactoring of fastfields 2018-02-20 12:52:30 +09:00
Paul Masurel
4026fc5fb1 Removed redundant compressed_block_size function 2018-02-20 08:28:28 +09:00
Paul Masurel
43742a93ef Multivalue u64 field / i64 field. 2018-02-20 00:16:20 +09:00
Paul Masurel
2a843d86cb Code cleaning 2018-02-19 21:51:39 +09:00
Paul Masurel
9a706c296a Larger union horizon 2018-02-19 21:50:33 +09:00
Paul Masurel
5ff8123b7a Code cleaning 2018-02-19 15:41:19 +09:00
Paul Masurel
6061158506 Added long running test to travis conf 2018-02-19 13:23:04 +09:00
Paul Masurel
4e8b0e89d9 Added unit test 2018-02-19 13:19:18 +09:00
Paul Masurel
0540ebb49e Cargo clippy 2018-02-19 12:36:24 +09:00
Paul Masurel
ef94582203 Rustfmt 2018-02-19 12:12:10 +09:00
Paul Masurel
2f242d5f52 Moving docset around 2018-02-19 12:07:05 +09:00
Paul Masurel
da3d372e6e Faster union counts 2018-02-19 10:17:16 +09:00
Paul Masurel
42fd3fe5c7 Bugfix on TermWeight::count() 2018-02-18 10:59:18 +09:00
Paul Masurel
5dae6e6bbc Downcast TermScorer for intersection when all legs are TermScorers 2018-02-18 10:28:43 +09:00
Paul Masurel
e608e0a1df Removed half baked usage of Any 2018-02-18 10:01:14 +09:00
Paul Masurel
6c8c90d348 Removed lifetime from scorer 2018-02-18 09:12:40 +09:00
Paul Masurel
eb50e92ec4 Removed specialized postings on SegmentPostings 2018-02-18 00:09:15 +09:00
Paul Masurel
20bede9462 Bugfix when requesting no termfreq. 2018-02-17 22:41:12 +09:00
Paul Masurel
4640ab4e65 Merge branch 'master' into issue/query-perf 2018-02-17 17:31:51 +09:00
Paul Masurel
cd51ed0f9f Added comments 2018-02-17 16:59:28 +09:00
Paul Masurel
6676fe5717 Added a count method 2018-02-17 15:02:51 +09:00
Paul Masurel
292bb17346 Disable scoring
- Disabling scoring is an argument of the `.weight()` method
- Collectors declare whether they need scoring
2018-02-17 12:43:16 +09:00
Paul Masurel
0300e7272b Scoring for union. 2018-02-17 11:56:21 +09:00
Paul Masurel
8760899fa2 Stupid implementaiton of Box<Scorer>::collect 2018-02-16 19:30:50 +09:00
Paul Masurel
c89d570a79 rustfmt 2018-02-16 17:50:05 +09:00
Paul Masurel
1da06d867b Using the same logic when score is enabled. 2018-02-16 17:36:33 +09:00
Paul Masurel
76e8db6ed3 blop 2018-02-16 14:57:08 +09:00
Paul Masurel
31e5580bfa Renaming intersection / exclude 2018-02-16 11:55:56 +09:00
Paul Masurel
930d3db2f7 Integrated reqopt_scorer 2018-02-16 11:43:27 +09:00
Paul Masurel
1593e1dc6f Added reqopt 2018-02-16 11:22:39 +09:00
Paul Masurel
e0189fc9e6 Added exclude query 2018-02-14 18:06:51 +09:00
Paul Masurel
ffdb4ef0a7 Added unit test 2018-02-14 11:58:40 +09:00
Paul Masurel
58845344c2 Unit test + bugfix in union 2018-02-13 14:54:20 +09:00
Paul Masurel
548ec9ecca Added ok unit test 2018-02-12 17:48:41 +09:00
Paul Masurel
86b700fa93 Updated travis.yml 2018-02-12 12:13:36 +09:00
Paul Masurel
e95c49e749 Added unit test to show bug in intersection 2018-02-12 12:06:19 +09:00
Paul Masurel
f3033a8469 Added sudo required to travis conf because of https://github.com/travis-ci/travis-ci/issues/9061 2018-02-12 11:19:12 +09:00
Paul Masurel
c4125bda59 Backmerging master 2018-02-12 11:08:57 +09:00
Paul Masurel
a7ffc0e610 Rustfmt 2018-02-12 10:31:29 +09:00
Paul Masurel
9370427ae2 Terminfo blocks (#244)
* Using u64 key in the store
* Using Option<> for the next element, as opposed to u64
* Code simplification.
* Added TermInfoStoreWriter.
* Added a TermInfoStore
* Added FixedSized for BinarySerialized.
2018-02-12 10:24:58 +09:00
Paul Masurel
1fc7afa90a Issue/range query (#242)
BitSet and RangeQuery
2018-02-05 09:33:25 +09:00
Paul Masurel
6a104e4f69 Cargo fmt 2018-02-03 11:59:34 +09:00
Paul Masurel
920f086e1d Clippy 2018-02-03 11:46:01 +09:00
Paul Masurel
13aaca7e11 Merge branch 'master' into merge-facets 2018-02-03 11:13:02 +09:00
Paul Masurel
df53dc4ceb Format 2018-02-03 00:21:05 +09:00
Paul Masurel
dd028841e8 Added documentation / test and change the contract of .add_facet() 2018-02-03 00:17:51 +09:00
Paul Masurel
eb84b8a60d bugfix 2018-02-02 18:52:07 +09:00
Paul Masurel
c05f46ad0e skip for intersection 2018-02-02 17:22:58 +09:00
Paul Masurel
435ff9d524 Make constructor of RangeQuery public 2018-02-02 16:50:22 +09:00
Paul Masurel
fdd5dd8496 Merge branch 'master' into issue/query-perf 2018-02-02 16:39:28 +09:00
Paul Masurel
fb5476d5de Query optimization: phrase query + union 2018-02-02 16:39:17 +09:00
Paul Masurel
dd8332c327 Added disabling scoring 2018-02-02 12:11:56 +09:00
Paul Masurel
63d201150b issue/range-query Added range query 2018-02-02 00:41:12 +09:00
Paul Masurel
b78efdc59f NOBUG Use the skipping logic of segment postings in 2018-02-01 18:36:55 +09:00
Paul Masurel
5cb08f7996 Method to create bitset from DocSet directly. 2018-02-01 18:25:43 +09:00
Paul Masurel
1947a19700 Added bitse 2018-01-31 23:56:54 +09:00
Paul Masurel
271b019420 added cargo doc 2018-01-30 15:18:19 +09:00
Paul Masurel
340693184f Added comment 2018-01-30 15:15:55 +09:00
Paul Masurel
97782a9511 updated travis-cargo 2018-01-30 13:18:51 +09:00
Paul Masurel
930010aa88 Unit test passing 2018-01-28 00:03:51 +09:00
Paul Masurel
7f5b07d4e7 Fixing unit tests 2018-01-25 14:55:29 +09:00
Paul Masurel
3edb3dce6a Test not passing 2018-01-25 12:46:32 +09:00
Paul Masurel
1edaf7a312 Closes #236. Removes dependency to version. 2018-01-20 12:12:43 +09:00
Paul Masurel
137906ff29 Fixing PhraseQuery, broken due to the reordering of the intersection clauses.
Closes #234
2018-01-12 21:01:28 +09:00
Paul Masurel
143a143cde issue/232 added unit test. (#233) 2018-01-11 23:37:45 +09:00
Paul Masurel
4f5ce12a77 NOBUG removed cpp from patterns 2018-01-05 12:09:42 +09:00
Paul Masurel
813efa4ab3 NOBUG coveralls 2018-01-05 11:03:27 +09:00
Paul Masurel
c3b6c1dc0b NOBUG coveralls 2018-01-05 00:31:57 +09:00
Paul Masurel
6f5e0ef6f4 NOBUG Simplify travis 2018-01-04 20:51:00 +09:00
Paul Masurel
7224f58895 Merge branch 'issue/218'
Conflicts:
	src/directory/mmap_directory.rs
	src/lib.rs
2018-01-04 18:47:10 +09:00
Paul Masurel
49519c3f61 added comments 2018-01-04 12:53:20 +09:00
Paul Masurel
cb11b92505 Added comments 2018-01-04 12:27:14 +09:00
Paul Masurel
7b2dcfbd91 Merge branch 'issue/227' 2018-01-04 12:12:00 +09:00
Paul Masurel
d2e30e6681 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-01-04 12:09:44 +09:00
Paul Masurel
ef109927b3 rustfmt 2018-01-04 12:08:34 +09:00
Paul Masurel
44e5c4dfd3 Added alphanum only token filter 2017-12-31 13:43:10 +09:00
Paul Masurel
6f223253ea Made load_metas public 2017-12-31 08:57:19 +09:00
Paul Masurel
f7b0392bd5 issue/230 Add an optional commit message. (#231)
Closes #230
2017-12-27 12:27:02 +09:00
Paul Masurel
442bc9a1b8 Fixes the computation of the memory size of a hashtable with a key of n bits. (#229)
Closes #228
2017-12-25 13:04:10 +09:00
Paul Masurel
db7d784573 Issue 227 Faster merge when there are no deletes 2017-12-21 22:04:05 +09:00
Paul Masurel
79132e803a NOBUG Switched to 64 bits addr 2017-12-21 11:06:46 +09:00
Paul Masurel
9e132b7dde NOBUG QueryParser does not need to be mut. Code cleanup 2017-12-16 15:43:35 +09:00
Paul Masurel
1e55189db1 NOBUG rustfmt 2017-12-14 19:30:31 +09:00
Paul Masurel
8b1b389a76 NOBUG Clippy 2017-12-14 19:25:12 +09:00
Paul Masurel
46f3ec87a5 Removed packed memory layout. 2017-12-14 18:37:04 +09:00
Paul Masurel
f24e5f405e NOBUG intellij misc lint 2017-12-14 18:23:35 +09:00
Paul Masurel
2589be3984 BUGFIX Serialization of schema got broken after serde's update 2017-12-14 17:37:20 +09:00
Paul Masurel
a02a9294e4 removed doc in travis 2017-11-27 13:53:58 +09:00
Paul Masurel
8023445b63 docs 2017-11-26 11:52:03 +09:00
Paul Masurel
05ce093f97 doc 2017-11-26 11:43:11 +09:00
Paul Masurel
6937e23a56 fixing doctest 2017-11-26 11:06:34 +09:00
Paul Masurel
974c321153 cargo fmt 2017-11-26 11:02:02 +09:00
Paul Masurel
f30ec9b36b Merge branch 'master' of github.com:tantivy-search/tantivy
Conflicts:
	src/analyzer/mod.rs
	src/schema/index_record_option.rs
	src/tokenizer/lower_caser.rs
	src/tokenizer/tokenizer.rs
2017-11-26 10:54:05 +09:00
Paul Masurel
acd7c1ea2d Added comments 2017-11-26 10:44:49 +09:00
Paul Masurel
aaeeda2bc5 Editing rustdoc 2017-11-25 13:23:32 +09:00
Paul Masurel
ac4d433fad Renamed analyzer to tokenizer 2017-11-24 16:50:32 +09:00
Paul Masurel
a298c084e6 Analyzer's Analyzer::token_stream does not need to me &mut self 2017-11-22 20:37:34 +09:00
Paul Masurel
185a72b341 Closes #224. Fixes documentation about STORED in the example. (#225) 2017-11-16 08:22:54 +09:00
Paul Masurel
bb41ae76f9 Closes #224. Fixes documentation about STORED in the example. 2017-11-16 08:16:17 +09:00
Paul Masurel
74d32e522a Stopped using mmap in tantivy. Caching MmapReadOnly.
Closes #218
2017-10-08 17:07:19 +09:00
Jain Jacob
927dd1ee6f Updates crate gcc to cc v1 (#217)
* Bump cc to v1

* Changes gcc::Config to cc::Build. Resolves #216
2017-10-06 16:18:44 +09:00
Paul Masurel
2c9302290f #191 Analyzer 2017-09-20 22:56:55 +09:00
Paul Masurel
426cc436da Test passing 2017-09-10 17:48:41 +09:00
Paul Masurel
68d42c9cf2 Added raw tokenizer, using the right analyzer in query parser. 2017-09-10 16:58:50 +09:00
Paul Masurel
ca49d6130f Test not passing 2017-09-09 17:32:47 +09:00
Paul Masurel
3588ca0561 Integrated with the merge branch 2017-09-09 15:27:19 +09:00
Paul Masurel
7c6cdcd876 Merge branch 'master' of github.com:tantivy-search/tantivy 2017-09-02 16:03:06 +09:00
Paul Masurel
71366b9a56 issue/197 Remove logic that prevents leak from crossbeam MsQueue. (#212)
Closes #197
2017-09-02 15:55:23 +09:00
Paul Masurel
a3247ebcfb issue/197 Remove logic that prevents leak from csossbeam MsQueue. 2017-09-02 15:53:07 +09:00
Paul Masurel
3ec13a8719 Readded fix for non-simd 2017-08-28 23:18:56 +09:00
Paul Masurel
f8593c76d5 Merge branch 'imhotep-new-codec'
Conflicts:
	src/common/bitpacker.rs
	src/compression/pack/compression_pack_nosimd.rs
	src/indexer/log_merge_policy.rs
2017-08-28 19:30:01 +09:00
Paul Masurel
f8710bd4b0 Format 2017-08-28 18:22:41 +09:00
Paul Masurel
8d05b8f7b2 Added comments. Renamed field reader 2017-08-28 17:00:12 +09:00
Paul Masurel
fc25516b7a Added unit test. 2017-08-28 11:15:37 +09:00
Paul Masurel
5b1e71947f Stream working, all test passing 2017-08-27 20:20:38 +09:00
Paul Masurel
69351fb4a5 Toward a new codec 2017-08-27 18:44:37 +09:00
Paul Masurel
3d0082d020 Delta encoded. Range and get are broken 2017-08-26 19:59:51 +09:00
Paul Masurel
8e450c770a Better error handling. Some doc. 2017-08-26 18:40:30 +09:00
Paul Masurel
a757902aed Merge branch 'feature/streamdict-simd' into imhotep 2017-08-22 18:58:57 +09:00
Paul Masurel
b3a8074826 removed println 2017-08-22 18:58:17 +09:00
Paul Masurel
4289625348 Merged with the new codec branch 2017-08-22 18:26:09 +09:00
Paul Masurel
850f10c1fe Exposing Field 2017-08-22 18:21:35 +09:00
raphael claude
d7f9bfdfc5 fix segments sorting in log_merge_policy (#211)
bug: segments were sorted on their indices (first field in the tuples)
fix: sort on the segments size
2017-08-20 08:59:54 +09:00
Paul Masurel
d0d5db4515 Streamdict using SIMD instruction. 2017-08-19 12:03:04 +09:00
Paul Masurel
303fc7e820 Better unit test for termdict. Checking the TermInfo 2017-08-17 12:08:39 +09:00
Paul Masurel
744edb2c5c NOBUG Avoid serializing position offset when useless. Test passing 2017-08-16 14:06:00 +09:00
Paul Masurel
2d70efb7b0 Removed trait boundary on termdict 2017-08-15 14:43:05 +09:00
Paul Masurel
eb5b2ffdcc Cleanups 2017-08-15 13:57:22 +09:00
Paul Masurel
38513014d5 Reenable unit test.
Consuming CompositeWrite on Close.
2017-08-14 23:35:09 +09:00
Paul Masurel
9cb7a0f6e6 Unit tests passing 2017-08-13 19:38:25 +09:00
Paul Masurel
8d466b8a76 half way through removing FastFieldsReader 2017-08-13 18:39:45 +09:00
Paul Masurel
413d0e1719 NOBUG test passing 2017-08-13 17:57:11 +09:00
Paul Masurel
0eb3c872fd Using composite file for all of the inverted index component 2017-08-12 19:34:23 +09:00
Paul Masurel
f9203228be Using composite file in fast field. 2017-08-12 18:45:59 +09:00
Paul Masurel
8f377b92d0 introducing a field serializer 2017-08-11 18:11:32 +09:00
Paul Masurel
1e89f86267 blop 2017-08-08 13:55:09 +09:00
Paul Masurel
d1f61a50c1 issue/207 Lazily decompressing positions. 2017-08-06 20:29:21 +09:00
Dru Sellers
2bb85ed575 Minor Doc Changes (#206)
* Various small documentation tweaks

* walking through the docs

* Update lib.rs

* Update lib.rs

* Update mod.rs
2017-08-06 09:22:03 +09:00
Paul Masurel
236fa74767 Positions almost working. 2017-08-05 23:17:35 +09:00
Paul Masurel
63b35dd87b removing freq handler. 2017-08-05 18:09:19 +09:00
Paul Masurel
efb910f4e8 Added CompressedIntStream 2017-08-05 16:44:01 +09:00
Paul Masurel
aff7e64d4e test 2017-08-04 22:07:14 +09:00
Paul Masurel
92a3f3981f issue/204 trying to fix nosimd branch. test not passing 2017-08-04 21:19:18 +09:00
king6cong
447a9361d8 Remove submodule information in README as subtree is now used 2017-08-03 13:52:16 +09:00
Paul Masurel
5f59139484 NOBUG simplified code. 2017-08-02 20:49:47 +09:00
Paul Masurel
27c373d26d NOBUG Updated changelog and bumped version 2017-07-24 18:52:45 +09:00
Paul Masurel
80ae136646 issue/198 Getting living_file after getting the list of managed files. 2017-07-24 18:46:41 +09:00
Paul Masurel
52b1398702 NOBUG version 0.4.0 -> 0.4.1 2017-07-19 19:07:54 +09:00
Paul Masurel
7b9cd09a6e Closes #199. Unindexed fields are indexed as untokenized 2017-07-19 18:41:22 +09:00
Paul Masurel
4c423ad2ca Merge branch 'master' of github.com:tantivy-search/tantivy 2017-07-19 17:01:32 +09:00
Paul Masurel
9f542d5252 NOBUG Fix spelling of "encountered". (as reported by @dazzag24) 2017-07-19 16:59:50 +09:00
Paul Masurel
77d8e81ae4 issue/17 Slightly more explicit error message 2017-07-19 11:08:42 +09:00
Paul Masurel
76e07b9705 NOBUG Small fixes. 2017-07-14 18:09:54 +09:00
Paul Masurel
ea4e9fdaf1 NOBUG updated README 2017-07-14 14:09:13 +09:00
Paul Masurel
e418bee693 NOBUG Garbage collection after end merge. 2017-07-14 12:09:47 +09:00
Paul Masurel
af4f1a86bc Merge remote-tracking branch 'origin/exp/hash_intable' 2017-07-13 20:50:54 +09:00
Paul Masurel
753b639454 NOBUG splitting the per-thread memory between the table and the heap 2017-07-13 17:11:39 +09:00
Paul Masurel
5907a47547 NOBUG Added whitespaces. 2017-07-13 15:14:12 +09:00
Paul Masurel
586a6e62a2 NOBUG Added Changelog for 4.0 2017-07-13 15:06:09 +09:00
Paul Masurel
fdae0eff5a NOBUG Remove range step_by 2017-07-13 14:05:33 +09:00
Paul Masurel
6eea407f20 Removing usage of step_by 2017-06-23 17:46:39 +09:00
Paul Masurel
1ba51d4dc4 NOBUG removed using range.step_by 2017-06-22 22:10:53 +09:00
Paul Masurel
6e742d5145 NOBUG removing batch add docs 2017-06-22 11:35:22 +09:00
Paul Masurel
1843259e91 NOBUG Simplified addr definitions 2017-06-22 11:27:32 +09:00
Paul Masurel
4ebacb7297 BytesRef is now wrapping an addr 2017-06-21 22:32:05 +09:00
Paul Masurel
fb75e60c6e issue/136 Added hashmaps. 2017-06-21 15:47:55 +09:00
Paul Masurel
04b15c6c11 Merge branch 'master' into exp/hash_intable
Conflicts:
	src/datastruct/stacker/hashmap.rs
2017-06-21 11:40:49 +09:00
Paul Masurel
b05b5f5487 issue/191 Added an analyzer manager. 2017-06-20 10:02:26 +09:00
Paul Masurel
4fe96483bc fill_buffer 2017-06-14 23:32:58 +09:00
Paul Masurel
09e27740e2 Added fill_buffer in DocSet 2017-06-14 18:28:30 +09:00
Paul Masurel
e51feea574 Removed cargo fmt from travis. 2017-06-14 13:45:11 +09:00
Paul Masurel
93e7f28cc0 Added unit test 2017-06-14 10:46:06 +09:00
Paul Masurel
8875b9794a Added API to get range from fastfield 2017-06-13 23:16:50 +09:00
Paul Masurel
f26874557e Remove the concept of pipeline. Made a BoableAnalyzer 2017-06-10 20:06:00 +09:00
Paul Masurel
a7d10b65ae Added support for Japanese. 2017-06-09 22:25:03 +09:00
Paul Masurel
e120e3b7aa issue/191 Added proper analyzer 2017-06-07 23:21:36 +09:00
Paul Masurel
90fcfb3f43 issue/188 Using murmurhash 2017-06-07 09:30:34 +09:00
Paul Masurel
e547e8abad Closes #184
Resizing the `Vec` was a bad idea, as for some stacker operation,
we may have a living reference to an object in the current heap.
2017-06-06 23:16:28 +09:00
Paul Masurel
5aa4565424 Tiny cleaning 2017-06-05 23:40:08 +09:00
Paul Masurel
3637620187 Merge branch 'master' of github.com:tantivy-search/tantivy 2017-06-02 21:03:37 +09:00
Laurentiu Nicola
a94679d74d Use four terms in the intersection bench 2017-05-31 08:31:33 +09:00
Laurentiu Nicola
a35a8638cc Comment nit 2017-05-31 08:31:33 +09:00
Paul Masurel
97a051996f issue 171. Hopefully bugfix? 2017-05-31 08:31:33 +09:00
Laurentiu Nicola
69525cb3c7 Add extra intersection test 2017-05-31 08:31:33 +09:00
Laurentiu Nicola
63867a7150 Fix document generation for posting benchmarks 2017-05-31 08:31:33 +09:00
Paul Masurel
19c073385a Better intersection and added size_hint 2017-05-31 08:31:33 +09:00
Paul Masurel
0521844e56 Format, small changes in VInt 2017-05-31 08:31:20 +09:00
Paul Masurel
8d4778f94d issue/181 BinarySerializable does not return the len + Generics over Read+Write 2017-05-31 08:31:20 +09:00
Paul Masurel
1d5464351d generic read 2017-05-31 08:31:20 +09:00
Paul Masurel
522ebdc674 made ResultExt public 2017-05-31 08:31:20 +09:00
Paul Masurel
4a805733db another hash 2017-05-30 15:36:48 +09:00
Paul Masurel
568d149db8 Merge branch 'master' into exp/hash_intable 2017-05-30 08:27:33 +09:00
Paul Masurel
4cfc9806c0 made ResultExt public 2017-05-30 08:22:17 +09:00
Paul Masurel
37042e3ccb Send and Sync impl now useless 2017-05-29 18:53:49 +09:00
Paul Masurel
b316cd337a Optimization in bitpacker 2017-05-29 18:53:49 +09:00
Paul Masurel
c04991e5ad Removed pointer in fastfield 2017-05-29 18:53:49 +09:00
Paul Masurel
c59b712eeb Added hash info in the table 2017-05-29 18:47:20 +09:00
Ashley Mannix
da61baed3b run fmt 2017-05-29 18:29:39 +09:00
Ashley Mannix
b6140d2962 drop some patch bounds 2017-05-29 18:29:39 +09:00
Ashley Mannix
6a9a71bb1b re-export ErrorKind 2017-05-29 18:29:39 +09:00
Ashley Mannix
e8fc4c77e2 fix delete error msg 2017-05-29 18:29:39 +09:00
Ashley Mannix
80837601ea remove error::* imports 2017-05-29 18:29:39 +09:00
Ashley Mannix
2b2703cf51 run cargo fmt 2017-05-29 18:29:39 +09:00
Ashley Mannix
d79018a7f8 fix build warnings 2017-05-29 18:29:39 +09:00
Ashley Mannix
d8a7c428f7 impl std error for directory errors 2017-05-29 18:29:39 +09:00
Ashley Mannix
45595234cc fix error match 2017-05-29 18:29:39 +09:00
Ashley Mannix
1bcebdd29e initial error-chain 2017-05-29 18:29:39 +09:00
Paul Masurel
ed0333a404 Optimized streamer 2017-05-28 19:58:28 +09:00
Paul Masurel
ac0b1a21eb Term as a wrapper
Small changes

Plastic
2017-05-25 23:49:54 +09:00
Paul Masurel
6bbc789d84 Fmt fix 2017-05-25 23:49:54 +09:00
Paul Masurel
87152daef3 issue/174 Added doc, and made field private 2017-05-25 23:49:54 +09:00
Paul Masurel
e0fce4782a Added documentation 2017-05-25 23:49:54 +09:00
Paul Masurel
a633c2a49a Avoid exposing common. Exposes u64 to i64 conversion instead. 2017-05-25 23:49:54 +09:00
Paul Masurel
51623d593e Avoid exposign schema from segment_reader 2017-05-25 23:49:54 +09:00
Paul Masurel
29bf740ddf Exposing the remaining API 2017-05-25 23:49:54 +09:00
Paul Masurel
511bd25a31 trailing whitespace 2017-05-25 18:17:37 +09:00
Paul Masurel
66e14ac1b1 clippy 2017-05-25 18:17:37 +09:00
Paul Masurel
09e94072ba Cargo fmt 2017-05-25 18:17:37 +09:00
Paul Masurel
6c68136d31 Reorganized code 2017-05-25 18:17:37 +09:00
Paul Masurel
aaf1b2c6b6 Reorganized code and added documentation. 2017-05-25 18:17:37 +09:00
Paul Masurel
8a6af2aefa Added unit test and bugfix 2017-05-25 18:17:37 +09:00
Paul Masurel
7a6e62976b Added stream dictionary code, merge unit test 2017-05-25 18:17:37 +09:00
Paul Masurel
2712930bd6 Added the feature 2017-05-25 18:17:37 +09:00
Paul Masurel
cb05f8c098 Prevent execution of the code in the macro doc 2017-05-22 10:55:45 +09:00
Paul Masurel
c0c9d04ca9 Added extra doc 2017-05-22 10:55:45 +09:00
Paul Masurel
7ea5e740e0 Using the $crate thing to make the macro usable in and outside tantivy 2017-05-22 10:55:45 +09:00
Paul Masurel
2afa6c372a issue/168 Make doc! macro usable outside tantivy 2017-05-22 10:55:45 +09:00
Paul Masurel
c7db8866b5 Merge branch 'facets' 2017-05-21 22:57:01 +09:00
Paul Masurel
02d992324a simplified facets. 2017-05-21 22:56:43 +09:00
Paul Masurel
4ab511ffc6 Merging 2017-05-21 22:15:02 +09:00
Paul Masurel
f318172ea4 Merge branch 'issue/162' 2017-05-21 20:04:03 +09:00
Paul Masurel
581449a824 issue/162 Docs and unit tests 2017-05-21 18:58:04 +09:00
Maciej Dziardziel
272589a381 faceting for fast numerical fields 2017-05-21 12:04:29 +03:00
Laurentiu Nicola
73d54c6379 Inline block_len 2017-05-21 10:44:49 +03:00
Paul Masurel
3e4606de5d Simplifying, and reordering the members 2017-05-21 16:31:52 +09:00
Laurentiu Nicola
020779f61b Make things faster 2017-05-20 20:56:37 +03:00
Laurentiu Nicola
835936585f Don't search whole blocks, but only the remaining part 2017-05-20 18:45:41 +03:00
Paul Masurel
bdd05e97d1 Added bench for segment postings 2017-05-20 23:38:53 +09:00
Paul Masurel
2be5f08cd6 issue/162 Added block iteration API 2017-05-20 11:46:40 +09:00
Paul Masurel
3f49d65a87 issue/162 Create block postings 2017-05-20 00:46:23 +09:00
Paul Masurel
f9baf4bcc8 Merge branch 'issue/155'
Conflicts:
	src/indexer/merger.rs
	src/indexer/segment_writer.rs
2017-05-19 20:14:36 +09:00
Paul Masurel
7ee93fbed5 Cleaning 2017-05-19 20:08:04 +09:00
Paul Masurel
57a5547ae8 Comments and cleaning up API 2017-05-19 11:20:27 +09:00
Paul Masurel
c57ab6a335 Renamed fstmap to termdict 2017-05-19 09:26:18 +09:00
Paul Masurel
02bfa9be52 Moving to termdict 2017-05-19 08:43:52 +09:00
Paul Masurel
b3f62b8acc Better API 2017-05-18 23:35:39 +09:00
Paul Masurel
2a08c247af Clippy 2017-05-18 23:20:41 +09:00
Paul Masurel
d2926b6ee0 Format 2017-05-18 23:09:20 +09:00
Paul Masurel
0272167c2e Code cleaning 2017-05-18 23:06:02 +09:00
Laurentiu Nicola
a9cf0bde16 Format code 2017-05-18 22:07:49 +09:00
Laurentiu Nicola
5a457df45d VInt encode values in IntFastFieldWriter
Closes #131
2017-05-18 22:07:49 +09:00
Paul Masurel
ca76fd5ba0 Uncommenting unit test 2017-05-18 20:41:56 +09:00
Paul Masurel
e79a316e41 Issue 155 - Trying to avoid term lookup when merging terms
+ Adds a proper Streamer interface
2017-05-18 20:12:00 +09:00
Paul Masurel
733f54d80e Making clippy happy. 2017-05-17 19:07:39 +09:00
Paul Masurel
7b2b181652 Merge branch 'master' into issue/136
Conflicts:
	src/datastruct/stacker/hashmap.rs
	src/datastruct/stacker/heap.rs
	src/datastruct/stacker/mod.rs
	src/indexer/index_writer.rs
	src/indexer/merger.rs
	src/indexer/segment_updater.rs
	src/indexer/segment_writer.rs
	src/postings/postings_writer.rs
	src/postings/recorder.rs
	src/schema/term.rs
2017-05-17 18:40:09 +09:00
Laurentiu Nicola
b3f39f2343 Remove unneeded suppressions, make clippy lints explicit 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
a13122d392 use explicit drop instead of suppression 2017-05-17 15:50:07 +09:00
Paul Masurel
113917c521 Making clippy happy.
+ Simplifying bitpacking by adding a 7 byte padding.
+ Bugfix in a unit test.
2017-05-17 15:50:07 +09:00
Laurentiu Nicola
1352b95b07 clippy: fix never_loop warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
c0538dbe9a clippy: fix mut_from_ref warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
0d5ea98132 clippy: fix inline_always warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
0404df3fd5 Fix typo in docstring 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
a67caee141 clippy: fix len_zero warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
f5fb29422a clippy: fix while_let_loop warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
4e48bbf0ea clippy: fix needless_lifetimes warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
6fea510869 clippy: fix redundant_closure warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
39958ec476 clippy: fix single_match warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
36f51e289e clippy: fix match_same_arms warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
5c83153035 clippy: fix or_fun_call warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
8e407bb314 clippy: fix needless_borrow warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
103ba6ba35 clippy: fix match_ref_pats warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
3965b26cd2 clippy: fix useless_let_if_seq warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
1cd0b378fb clippy: fix map_clone warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
92f383fa51 clippy: fix let_unit_value warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
6ae34d2a77 clippy: fix toplevel_ref_arg warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
1af1f7e0d1 clippy: fix if_let_redundant_pattern_matching warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
feec2e2620 clippy: fix needless_bool warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
3e2ad7542d clippy: fix needless_return warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
ac02c76b1e clippy: fix doc_markdown warnings 2017-05-17 15:50:07 +09:00
Paul Masurel
e5c7c0b8b9 Update CHANGELOG.md 2017-05-16 21:13:33 +09:00
Laurentiu Nicola
49dbe4722f Add a test for SegmentPostings::skip_len 2017-05-16 21:12:43 +09:00
Laurentiu Nicola
f64ff77424 Use an exponential search 2017-05-16 21:12:43 +09:00
Laurentiu Nicola
2bf93e9e51 Avoid rebuilding simdcomp when running tests 2017-05-16 08:37:43 +09:00
Laurentiu Nicola
3dde748b25 Make rustfmt happy 2017-05-16 00:49:05 +03:00
Laurentiu Nicola
1dabe26395 Add comment about block_len 2017-05-15 21:26:28 +03:00
Laurentiu Nicola
5590537739 Disable early exit 2017-05-15 21:18:06 +03:00
Laurentiu Nicola
ccf0f9cb2f Merge branch 'master' of github.com:tantivy-search/tantivy into issue/130 2017-05-15 18:54:16 +03:00
Laurentiu Nicola
e21913ecdc Use binary search for SegmentPostings::skip_next 2017-05-15 18:33:43 +03:00
Laurentiu Nicola
2cc826adc7 Add a bench for SegmentPostings::SkipNext 2017-05-15 18:33:43 +03:00
Laurentiu Nicola
4d90d8fc1d Move the random sampling helpers to the tests module 2017-05-15 18:33:43 +03:00
Paul Masurel
0606a8ae73 Bugfix in travis yml 2017-05-16 00:22:11 +09:00
Paul Masurel
03564214e7 Added check for rustfmt in travis 2017-05-15 22:46:43 +09:00
Paul Masurel
4c8f9742f8 format 2017-05-15 22:30:18 +09:00
Paul Masurel
a23b7a1815 Test the size of complete 0..128 block 2017-05-15 19:09:52 +09:00
Paul Masurel
6f89a86b14 Added simple search in travis CI 2017-05-15 12:10:23 +09:00
Laurentiu Nicola
b2beac1203 Check the result of wait_merging_threads 2017-05-15 08:00:25 +09:00
Paul Masurel
8cd5a2d81d Fixed logging deleted files twice 2017-05-15 00:25:49 +09:00
Paul Masurel
b26c22ada0 Merge branch 'issue/148' 2017-05-15 00:02:51 +09:00
Laurentiu Nicola
8a35259300 Avoid clone() call 2017-05-14 23:28:17 +09:00
Paul Masurel
db56167a5d Display backtrace 2017-05-14 23:28:17 +09:00
Paul Masurel
ab66ffed4e Closes #147 2017-05-14 23:28:17 +09:00
Laurentiu Nicola
e04f2f0b08 issue/148 Wait for the index writer threads to shut down in simple_search 2017-05-14 16:35:24 +03:00
Paul Masurel
7a5df33c85 issue/148 Wrapping MsQueue to drop all of its concent on Drop 2017-05-14 16:25:33 +03:00
Laurentiu Nicola
ee0873dd07 Avoid clone() call 2017-05-13 16:11:58 +03:00
Paul Masurel
695c8828b8 Display backtrace 2017-05-13 18:51:38 +09:00
Paul Masurel
4ff7dc7a4f Closes #147 2017-05-13 18:46:50 +09:00
Paul Masurel
69832bfd03 NOBUG Disabling running examples in CI as it is not working. 2017-05-12 14:35:50 +09:00
Paul Masurel
ecbdd70c37 Removed the clunky linked list logic of the heap. 2017-05-12 14:01:52 +09:00
Paul Masurel
fb1b2be782 issue/136 Fix following CR 2017-05-12 13:51:09 +09:00
Paul Masurel
9cd7458978 NOBUG Hiding methods making it possible to build a incorrect Term. 2017-05-11 21:12:59 +09:00
Paul Masurel
4c4c28e2c4 Fix broke compile 2017-05-11 20:57:32 +09:00
Paul Masurel
9f9e588905 Merge branch 'master' into issue/136
Conflicts:
	src/postings/postings_writer.rs
2017-05-11 20:50:24 +09:00
Paul Masurel
6fd17e0ead Code cleaning 2017-05-11 20:47:30 +09:00
Paul Masurel
65dc5b0d83 Closes #145 2017-05-11 19:48:06 +09:00
Paul Masurel
15d15c01f8 Runing examples in CI
Closes #143
2017-05-11 19:43:36 +09:00
Paul Masurel
106832a66a Make Term::with_capacity crate-public 2017-05-11 19:37:15 +09:00
Paul Masurel
477b9136b9 FIXED inconsistent Term's field serialization.
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
2017-05-11 19:37:15 +09:00
Paul Masurel
7852d097b8 CHANGELOG 0.3.1 did not included the fix of the Field(u32) 2017-05-11 09:48:37 +09:00
Ashley Mannix
0bd56241bb pretty print meta.json 2017-05-10 20:13:53 +09:00
Paul Masurel
54ab897755 Added comment 2017-05-10 19:30:24 +09:00
Paul Masurel
1369d2d144 Quadratic probing. 2017-05-10 10:38:47 +09:00
Paul Masurel
d3f829dc8a Bugfix 2017-05-10 00:29:37 +09:00
Paul Masurel
e82ccf9627 Merge branch 'master' into issue/indexing-refactoring 2017-05-09 16:43:33 +09:00
Paul Masurel
d3d29f7f54 NOBUG Updated CHANGELOG with the serde change for 0.4.0 2017-05-09 16:42:25 +09:00
Paul Masurel
3566717979 Merge pull request #134 from tantivy-search/chore/serde-rebase
Replace rustc_serialize with serde (updated)
2017-05-09 16:38:42 +09:00
Paul Masurel
90bc3e3773 Added limitation on term dictionary saturation 2017-05-09 14:10:33 +09:00
Paul Masurel
ffb62b6835 working 2017-05-09 10:17:05 +09:00
Ashley Mannix
4f9ce91d6a update underflow test 2017-05-08 14:40:58 +10:00
Laurentiu Nicola
3c3a2fbfe8 Remove old serialization code 2017-05-08 07:36:15 +03:00
Laurentiu Nicola
0508571d1a Use the proper error type on u64 overflow 2017-05-08 07:35:33 +03:00
Laurentiu Nicola
7b733dd34f Fix i64 overflow check and merge NotJSON with NotJSONObject 2017-05-08 07:09:54 +03:00
Ashley Mannix
2c798e3147 Replace rustc_serialize with serde 2017-05-07 20:21:22 +03:00
Paul Masurel
2c13f210bc Bugfix on merging i64 fast fields 2017-05-07 15:57:29 +09:00
Paul Masurel
0dad02791c issues/65 Added comments
Closes #65
Closes #132
2017-05-06 23:09:45 +09:00
Paul Masurel
2947364ae1 issues/65 Phrase query for untokenized fields are not tokenized. 2017-05-06 22:14:26 +09:00
Paul Masurel
05111599b3 Removed several TODOs 2017-05-05 16:08:09 +08:00
Paul Masurel
83263eabbb issues/65 Updated changelog added some doc. 2017-05-04 17:13:14 +08:00
Paul Masurel
5cb5c9a8f2 issues/65 Added i64 fast fields 2017-05-04 16:46:14 +08:00
Paul Masurel
9ab92b7739 i64 fast field working 2017-05-04 16:46:14 +08:00
Paul Masurel
962bddfbbf Merge with panicks. 2017-05-04 16:46:14 +08:00
Paul Masurel
26cfe2909f FastField with different types 2017-05-04 16:46:13 +08:00
Paul Masurel
afdfb1a69b Compiling... fastfield not implemented yet 2017-05-04 16:46:13 +08:00
Paul Masurel
b26ad1d57a Added int options 2017-05-04 16:46:13 +08:00
Paul Masurel
1dbd54edbb Renamed u64options 2017-05-04 16:46:13 +08:00
Paul Masurel
deb04eb090 issue/65 Switching to u64. 2017-05-04 16:46:13 +08:00
Paul Masurel
bed34bf502 Merge branch 'issues/122' 2017-04-23 16:14:40 +08:00
Paul Masurel
80f1e26c3b Tantivy 0.3.1 2017-04-23 15:52:07 +08:00
Paul Masurel
3e68b61d8f issue/122 Adds a garbage collect method 2017-04-23 15:51:06 +08:00
Paul Masurel
95bfb71901 NOBUG Remove 256 num fields limit 2017-04-19 22:37:34 +09:00
Paul Masurel
74e10843a7 issue/120 Disabled SIMD vbyte compression for msvc 2017-04-17 22:36:32 +09:00
Paul Masurel
1b922e6d23 issue 120. Using streamvbyte codec for the vbyte part of the encoding 2017-04-16 18:49:53 +09:00
Paul Masurel
a7c6c31538 Merge commit '9d071c8d4610aa61f4b1f7dd489210415a05cfc0' as 'cpp/streamvbyte' 2017-04-16 15:22:43 +09:00
Paul Masurel
9d071c8d46 Squashed 'cpp/streamvbyte/' content from commit f38aa6b
git-subtree-dir: cpp/streamvbyte
git-subtree-split: f38aa6b6ec4c5cee9d72c94ef305e6a79a108252
2017-04-16 15:22:43 +09:00
Paul Masurel
04074f7bcb Merge pull request #119 from tantivy-search/issue/118
Using u32 for field ids
2017-04-15 13:11:22 +09:00
Paul Masurel
8a28d1643d Using u32 for field ids 2017-04-15 13:04:33 +09:00
Paul Masurel
44c684af5c NOBUG Fixes winapi version 2017-04-08 19:01:31 +09:00
Paul Masurel
60279a03b6 RELEASE Tantivy 0.3. See Changelog 2017-04-08 18:53:40 +09:00
Paul Masurel
dc43135fe0 NOBUG Remove .info 2017-04-08 18:49:37 +09:00
Paul Masurel
ce022e5f06 issue/54 Clone segment reader rather than reload.
Closes #54.
2017-04-08 17:52:33 +09:00
Paul Masurel
0be977d9eb Merge pull request #114 from tantivy-search/issue/96
Closes Issue/96
2017-04-08 17:49:48 +09:00
Paul Masurel
a4ba20eea3 issue/96 code clean up, adding comments.wq 2017-04-08 17:30:25 +09:00
Paul Masurel
4bef6c99ee issue/96 Cleaning up some lock management 2017-04-05 10:12:39 +09:00
Paul Masurel
a84871468b issue/96 Rename FileError -> OpenReadError 2017-04-05 10:01:49 +09:00
Paul Masurel
e0a39fb273 issue/96 Added unit test, documentation and various tiny improvements. 2017-04-04 22:43:35 +09:00
Paul Masurel
35203378ef Considering merge options after calling end_merge 2017-04-03 17:26:21 +09:00
Paul Masurel
b5bf9bb13c issue/96 Looping over wait_merging_thread. 2017-04-03 08:39:18 +09:00
Paul Masurel
ea3349644c issue/96 Fixed unit test condition to something reasonable 2017-04-02 21:58:38 +09:00
Paul Masurel
d4f2e475ff issue/96 removed faulty assert 2017-04-02 19:21:20 +09:00
Paul Masurel
17631ed866 issue/96 Added functionality to protect files from deletion
Hopefully fixed the race condition happening when merging files.
2017-04-02 18:48:20 +09:00
Paul Masurel
9eb2d3e8c5 issue/96 avoid removing the bitset from segment_entry. 2017-04-02 16:26:28 +09:00
Paul Masurel
afd08a7bbc issue/96 Changed datastruct for the delete queue. 2017-04-01 21:01:10 +09:00
Paul Masurel
4fc7bc5f09 Added helper to create Vec with a given sizewq 2017-03-31 18:54:23 +09:00
Paul Masurel
602b9d235f Merge pull request #113 from kaedroho/patch-1
Mark "cpp" folder as linguist-vendored in .gitattributes
2017-03-31 09:05:57 +09:00
Karl Hobley
b22c6b86c7 Mark "cpp" folder as linguist-vendored in .gitattributes
This repo is currently being detected as a C project because of some vendored libraries in the "cpp" folder.

According to https://github.com/github/linguist#using-gitattributes you can use ``.gitattributes`` tell GitHub to not count this folder when detecting the language.
2017-03-30 13:43:03 +01:00
Paul Masurel
f0dc0de4b7 Added helper to create Vec with a given size 2017-03-29 11:26:24 +09:00
Paul Masurel
456dd3a60d issue/96 merge 2017-03-28 16:49:48 +09:00
Paul Masurel
d768a10077 master merged in feature branch 2017-03-27 09:27:23 +09:00
Paul Masurel
ddb2b8d807 test passing.
SegmentWriter create SegmentEntry which contain a delete_bitset
2017-03-26 18:32:53 +09:00
Paul Masurel
45806951b1 added quotation mark 2017-03-25 22:48:07 +09:00
Paul Masurel
84a060552d issue/109 trying to get proper logging in appveyor 2017-03-25 22:34:40 +09:00
Paul Masurel
68a956c6e7 issue/109 Showing debug! if test fails 2017-03-25 21:54:17 +09:00
Paul Masurel
f50f557cfc issue/109 Remove futures from most of segment_updater API. 2017-03-25 19:36:03 +09:00
Paul Masurel
daa19b770a (hopefully) bugfix race condition on wait merging threadwq. 2017-03-24 18:20:58 +09:00
Paul Masurel
e75402be80 Merge pull request #108 from KodrAus/ci/appveyor
Add appveyor config
2017-03-24 15:49:50 +09:00
Ashley Mannix
51cab39186 drop to vs2015 image 2017-03-24 16:37:30 +10:00
Ashley Mannix
c8e12b6847 try set mingw path 2017-03-24 16:22:32 +10:00
Ashley Mannix
b44a9cb89d add appveyor config 2017-03-24 16:11:51 +10:00
Paul Masurel
e650fab927 Merge pull request #106 from tantivy-search/wip/delay-test-deletes
Fix delete tests on Windows
2017-03-22 09:26:36 +09:00
Paul Masurel
b12a97abe4 Add unit test for when deleting fails
Test that when delete fails, we still keep
the file as managed.

Remove the error log for windows, as failing
to delete is expected.
2017-03-22 08:57:09 +09:00
Laurentiu Nicola
2b5a4bbde2 Don't delete twice on not(windows) 2017-03-21 07:48:58 +02:00
Laurentiu Nicola
2d169c4454 Delay deleting the files in the test suite to make it work on Windows 2017-03-21 07:37:28 +02:00
Paul Masurel
66d6e4e246 Merge pull request #103 from tantivy-search/lnicola-fix-sync-directory
Make directory syncing work on Windows (resubmit)
2017-03-21 10:55:03 +09:00
Paul Masurel
a061ba091d Merge pull request #105 from tantivy-search/wip/simdcomp-build
Avoid using make for building simdcomp
2017-03-21 10:00:49 +09:00
Laurentiu Nicola
92ce9b906b Avoid using make for building simdcomp 2017-03-21 00:25:04 +02:00
Laurentiu Nicola
1e0ac31e11 Clarify comment and use qualified import for the flag 2017-03-20 23:12:48 +02:00
Paul Masurel
ebcea0128c Getting the FLAG from the winapi module. 2017-03-19 11:09:15 +09:00
Paul Masurel
30075176cb blop 2017-03-19 10:52:54 +09:00
Laurentiu Nicola
7c114b602d Make directory syncing work on Windows 2017-03-19 02:17:13 +02:00
Paul Masurel
50659147d1 NOBUG updated simple_search.html 2017-03-14 12:04:21 +09:00
Paul Masurel
da10fe3b4d Various fixes. 2017-03-13 22:01:55 +09:00
Paul Masurel
4db56c6bd8 Merge pull request #101 from tantivy-search/issue/99
Improvements to simple_search.rs: fixes #100 and improves #99
2017-03-13 13:26:39 +09:00
Claus Matzinger
292dd6dcb6 fixup 2017-03-13 00:24:54 -04:00
Claus Matzinger
37e71f7c63 fixes #100 and improves #99 2017-03-12 22:59:38 -04:00
Paul Masurel
5932278e00 test passing 2017-03-13 10:00:19 +09:00
Paul Masurel
202dda98ba baby step 3 2017-03-12 19:00:57 +09:00
Paul Masurel
7c971b5d3b baby step 2 2017-03-11 16:14:20 +09:00
Paul Masurel
77c61ddab2 Baby step1 2017-03-11 14:20:46 +09:00
Paul Masurel
b7f026bab9 Merger returns a SegmentMeta 2017-03-10 09:05:51 +09:00
Paul Masurel
cc2f78184b Added unit test for #96 2017-03-10 09:05:51 +09:00
Paul Masurel
673423f762 Merge pull request #98 from KodrAus/feat/no-cpp
Convert simd wrapper to C
2017-03-09 13:11:08 +09:00
Paul Masurel
7532c4a440 Removed double ; 2017-03-09 10:57:30 +09:00
Ashley Mannix
324b56a60c fix warnings 2017-03-09 06:54:48 +10:00
Paul Masurel
ac3890f93c NOBUG Marked the functional test as ignore 2017-03-08 19:08:29 +09:00
Ashley Mannix
69b3de43f6 convert simd wrapper to c 2017-03-08 14:02:48 +10:00
Paul Masurel
3d1196d53e NOBUG added doc link. 2017-03-07 10:14:00 +09:00
Paul Masurel
a397537ed8 NOBUG added rustdoc 2017-03-07 10:10:43 +09:00
Paul Masurel
ebca904767 NOBUG added rustdoc 2017-03-07 09:58:51 +09:00
Paul Masurel
3a472914ce Fix .write -> .write_all 2017-03-06 16:28:30 +09:00
Paul Masurel
c59507444f issue/77 ManagedDirectory working
Closes #77
2017-03-06 12:18:36 +09:00
Paul Masurel
4b7afa2ae7 issue/77 Added managed directory 2017-03-03 22:41:37 +09:00
Paul Masurel
590a8582c9 The reference doc should not point to the schema page. 2017-02-28 21:17:19 +09:00
Paul Masurel
ab3440f925 NOBUG Bypass github cache for coveralls badge 2017-02-27 12:39:59 +09:00
Paul Masurel
ec5fb2eaa9 NOBUG cleanup 2017-02-27 09:52:28 +09:00
Paul Masurel
15b60d72cc NOBUG add_document does not return result 2017-02-27 09:36:41 +09:00
Paul Masurel
7a07144c68 Bugfix related with deletes, rollback and the index opstamp. 2017-02-27 01:42:25 +09:00
Paul Masurel
8bcfdb8e80 NOBUG misc ... 2017-02-26 21:35:18 +09:00
Paul Masurel
a7f10f055d Nobug hidding doc, filling doc 2017-02-26 00:11:32 +09:00
Paul Masurel
597dac9cb6 NOBUG Adding doc. 2017-02-25 23:39:02 +09:00
Paul Masurel
6a002bcc76 NOBUGwq 2017-02-25 21:20:55 +09:00
Paul Masurel
3a86fc00a2 Closes #64 - Improve Index creationg API / documentation 2017-02-25 20:40:39 +09:00
Paul Masurel
ca1617d3cd Fixes #91 2017-02-25 20:32:26 +09:00
Paul Masurel
e4a102d859 Merge branch 'issue/43'
Conflicts:
	src/directory/mmap_directory.rs
2017-02-25 19:36:21 +09:00
Paul Masurel
1d9924ee90 Closes #43. 2017-02-25 19:32:36 +09:00
Paul Masurel
f326a2dafe TODO hunt 2017-02-25 15:28:56 +09:00
Paul Masurel
78228ece73 Closes #92. ByteOrder of u32 terms. 2017-02-24 23:41:46 +09:00
Paul Masurel
503d0295cb issue/43 TODO hunt 2017-02-23 09:54:54 +09:00
Paul Masurel
eb39db44fc issue/43 Avoid keeping segments with 0 documents. 2017-02-23 09:20:30 +09:00
Paul Masurel
7f78d1f4ca Fixes #82 Renamed and commented the function to create Term from &[u8] 2017-02-23 08:33:59 +09:00
Paul Masurel
df9090cb0b NOBUG TODO hunt, and cleanups 2017-02-22 22:18:33 +09:00
Paul Masurel
4a8eb3cb05 issue/43 Added unit test for deletes including merging. 2017-02-22 21:38:37 +09:00
Paul Masurel
a74b41d7ed NOBUG run benchmark over exactly 100 K elements 2017-02-21 11:43:55 +09:00
Paul Masurel
06017bd422 NOBUG made the cleanup limit adaptive in MmapCache 2017-02-21 00:37:45 +09:00
Paul Masurel
17beaab8bf Merge branch 'issue/72' 2017-02-21 00:25:24 +09:00
Paul Masurel
062e38a2ab Fixes #72 - Cache directory uses weak ref. Introduced CacheInfo object. 2017-02-21 00:24:33 +09:00
Paul Masurel
8c2b20c496 NOBUG Trying to fix coverall conf. 2017-02-20 17:47:16 +09:00
Paul Masurel
c677eb9f13 issue/43 Removed notify 2017-02-19 22:41:45 +09:00
Paul Masurel
0f332d1fd3 issue/43 Removed doc freq from recorders. 2017-02-19 22:39:31 +09:00
Paul Masurel
1b45539f32 issue/43 Added support for delete in merged index 2017-02-19 22:39:31 +09:00
Paul Masurel
7315000fd4 issue/43 Merging ok for postings / fastfields. 2017-02-19 22:39:31 +09:00
Paul Masurel
e3d2fca844 issue/43 Isolated segment_entry / doc_opstamp_mapping 2017-02-19 22:39:31 +09:00
Paul Masurel
1c03d98a11 issue/43 added delete_queue right in the segment updater 2017-02-19 22:39:31 +09:00
Paul Masurel
8b68f22be1 issue/43 made the delete queue shareable 2017-02-19 22:39:31 +09:00
Paul Masurel
d007cf3435 issue/43 simplification. removed the notion of delete cursor. 2017-02-19 22:39:04 +09:00
Paul Masurel
72afbb28c7 issue/43 test passing 2017-02-19 22:39:04 +09:00
Paul Masurel
2fc3a505bc issue/43 refactoring segment meta 2017-02-19 22:39:04 +09:00
Paul Masurel
e337c35721 issue/43 SegmentMeta refactoring 2017-02-19 22:39:04 +09:00
Paul Masurel
0c318339b0 issue/43 Path logic in segment. 2017-02-19 22:39:04 +09:00
Paul Masurel
64fee11bc0 issue/43 Clean up 2017-02-19 22:39:04 +09:00
Paul Masurel
e12fc4bb09 issue/43 deletes
merge not working
only updating uncommitted
2017-02-19 22:39:04 +09:00
Paul Masurel
0820992141 issue/43 docstamp -> opstamp 2017-02-19 22:38:39 +09:00
Paul Masurel
09782858da issue/43 Segment have a commit opstamp 2017-02-19 22:38:39 +09:00
Paul Masurel
ca977fb17b issue/43 Refactoring of SegmentUpdater 2017-02-19 22:38:39 +09:00
Paul Masurel
e8ecb68f00 issue/43 switching for futures 2017-02-19 22:38:39 +09:00
Paul Masurel
0ec492dcf2 issue/43 refactoring in order to remove the segment updater non sense for simpler futures 2017-02-19 22:38:39 +09:00
Paul Masurel
20eb586660 issue/43 Rename SegmentUpdater 2017-02-19 22:38:39 +09:00
Paul Masurel
6530d43d6a issue/43 Small fixes. 2017-02-19 22:38:39 +09:00
Paul Masurel
926e71a573 issue/43 unit test running. segment updater uses futures. 2017-02-19 22:38:38 +09:00
Paul Masurel
bacaabf857 issue/43 fixed on unit test. need big refactoring of segment updater 2017-02-19 22:38:38 +09:00
Paul Masurel
d6e7157173 issue/43 Test broken... moved segment manager to the segment updater / segment writer 2017-02-19 22:38:15 +09:00
Paul Masurel
093dcbd253 issue/43 Isolated SegmentMeta 2017-02-19 22:38:15 +09:00
Paul Masurel
fba44b78b6 issue/43 Added delete doc file 2017-02-19 22:38:15 +09:00
Paul Masurel
01cf303dec issue/43 segment writer 2017-02-19 22:38:14 +09:00
Paul Masurel
d5c161e196 issue/43 Computing deleted doc bitset 2017-02-19 22:38:14 +09:00
Paul Masurel
183d5221b5 issue/43 DeleteQueue. 2017-02-19 22:38:14 +09:00
Paul Masurel
5a06f45403 issue/43 small progress 2017-02-19 22:36:57 +09:00
Paul Masurel
395cbf3913 issue/43 Change the delete queue datastruct for something cleaner/functional 2017-02-19 22:36:57 +09:00
Paul Masurel
fe2ddb8844 issue43 Added DeleteQueue. 2017-02-19 22:36:57 +09:00
Paul Masurel
3129701e92 issue/71 Added list of supported OSes 2017-02-19 14:14:15 +09:00
Paul Masurel
56ba698def Merge pull request #76 from Ameobea/master
Updated dependency versions and implementations
2017-02-17 18:20:44 +09:00
Casey Primozic
e0ba699c16 Updated dependency versions and implementations
- Updated `byteorder` error usage (now returns straight `Error`s)
 - Updated `Uuid` implementation (`to_simple_string` now `.simple().to_string()`)
2017-02-17 01:26:13 -06:00
Paul Masurel
b6423f9a76 Merge pull request #73 from manuel-woelker/pr-subtree
Use git subtree mechanism for simdcomp to simplify build (cf. #24)
2017-01-27 14:52:36 +09:00
Manuel Woelker
a667394a49 update README and build after simdcomp subtree refactor 2017-01-26 21:14:05 +01:00
Manuel Woelker
9f02b090dd Merge commit 'f07ccd6e4fbc5bbfeb94d40e0f14bc527a7d5439' as 'cpp/simdcomp' 2017-01-26 20:28:23 +01:00
Manuel Woelker
f07ccd6e4f Squashed 'cpp/simdcomp/' content from commit 0dca286
git-subtree-dir: cpp/simdcomp
git-subtree-split: 0dca28668f1fb6d343dc3c62fa7750a00f1d7201
2017-01-26 20:28:23 +01:00
Manuel Woelker
f19f8757de remove git submodule to replace via git subtree 2017-01-26 20:17:44 +01:00
Paul Masurel
f729edb529 NOBUG added badges / categories for crates.io 2017-01-21 09:35:44 +09:00
Paul Masurel
73ef201c44 Merge branch 'master' of github.com:tantivy-search/tantivy 2017-01-11 21:09:05 +09:00
Paul Masurel
3b69e790e9 NOBUG expose a version public api. Handy to check if the compilation was made with simd or not. 2017-01-11 21:06:41 +09:00
Paul Masurel
1b0b3051c2 NOBUG Pinned some version, removed import warning. 2017-01-09 15:30:50 +09:00
Paul Masurel
43c1da1a92 Merge branch 'issue/67' 2016-12-20 16:52:33 +01:00
Paul Masurel
e1cb5e299d NOBUG split field_type into 2 2016-12-20 16:51:34 +01:00
Paul Masurel
14ebed392b Merge pull request #68 from tantivy-search/issue/67
Issue/67
2016-12-20 11:27:05 +01:00
Paul Masurel
d3d34be167 issue/67 Added a advance interface to the term iterator 2016-12-20 11:25:52 +01:00
Paul Masurel
98cdc83428 Issue #67 Removing afterwards. 2016-12-18 11:57:28 +01:00
Paul Masurel
4d7d201f21 Issue #67 - Removed segment ord array from term iteration.
This was probably an early optimization.
2016-12-17 09:44:51 +01:00
Paul Masurel
ca5f3e1d46 issue/67 First stab. Iterator working. 2016-12-17 00:58:12 +01:00
Paul Masurel
1559733b03 Merge pull request #63 from vandenoever/readme
fix for build instructions
2016-12-12 10:17:27 +09:00
Paul Masurel
44b5f1868c Merge branch 'master' into readme 2016-12-12 10:17:19 +09:00
Paul Masurel
4cedfd903d NOBUG Added ga beacon to README 2016-12-12 10:07:30 +09:00
Paul Masurel
c0049e8487 NOBUG fixed doc urls. 2016-12-11 21:43:14 +09:00
Paul Masurel
e88adbff5c Bumped tantivy's version in Cargo.toml 2016-12-11 17:51:45 +09:00
Jos van den Oever
e497e04f70 fix for build instructions
And clarification that nighty is required.
2016-12-10 18:08:15 +01:00
275 changed files with 40044 additions and 11187 deletions

1
.gitattributes vendored Normal file
View File

@@ -0,0 +1 @@
cpp/* linguist-vendored

19
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@@ -0,0 +1,19 @@
---
name: Bug report
about: Create a report to help us improve
---
**Describe the bug**
- What did you do?
- What happened?
- What was expected?
**Which version of tantivy are you using?**
If "master", ideally give the specific sha1 revision.
**To Reproduce**
If your bug is deterministic, can you give a minimal reproducing code?
Some bugs are not deterministic. Can you describe with precision in which context it happened?
If this is possible, can you share your code?

View File

@@ -0,0 +1,14 @@
---
name: Feature request
about: Suggest an idea for this project
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**[Optional] describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

7
.github/ISSUE_TEMPLATE/question.md vendored Normal file
View File

@@ -0,0 +1,7 @@
---
name: Question
about: Ask any question about tantivy's usage...
---
Try to be specific about your use case...

6
.gitignore vendored
View File

@@ -1,3 +1,5 @@
tantivy.iml
*.swp
target
target/debug
.vscode
@@ -5,3 +7,7 @@ target/release
Cargo.lock
benchmark
.DS_Store
cpp/simdcomp/bitpackingbenchmark
*.bk
.idea
trace.dat

3
.gitmodules vendored
View File

@@ -1,3 +0,0 @@
[submodule "cpp/simdcomp"]
path = cpp/simdcomp
url = git@github.com:lemire/simdcomp.git

View File

@@ -1,21 +1,22 @@
# Based on the "trust" template v0.1.2
# https://github.com/japaric/trust/tree/v0.1.2
dist: trusty
language: rust
rust:
- nightly
git:
submodules: false
before_install:
- sed -i 's/git@github.com:/https:\/\/github.com\//' .gitmodules
- git submodule update --init --recursive
services: docker
sudo: required
env:
global:
- CC=gcc-4.8
- CXX=g++-4.8
- CRATE_NAME=tantivy
- TRAVIS_CARGO_NIGHTLY_FEATURE=""
- secure: eC8HjTi1wgRVCsMAeXEXt8Ckr0YBSGOEnQkkW4/Nde/OZ9jJjz2nmP1ELQlDE7+czHub2QvYtDMG0parcHZDx/Kus0yvyn08y3g2rhGIiE7y8OCvQm1Mybu2D/p7enm6shXquQ6Z5KRfRq+18mHy80wy9ABMA/ukEZdvnfQ76/Een8/Lb0eHaDoXDXn3PqLVtByvSfQQ7OhS60dEScu8PWZ6/l1057P5NpdWbMExBE7Ro4zYXNhkJeGZx0nP/Bd4Jjdt1XfPzMEybV6NZ5xsTILUBFTmOOt603IsqKGov089NExqxYu5bD3K+S4MzF1Nd6VhomNPJqLDCfhlymJCUj5n5Ku4yidlhQbM4Ej9nGrBalJnhcjBjPua5tmMF2WCxP9muKn/2tIOu1/+wc0vMf9Yd3wKIkf5+FtUxCgs2O+NslWvmOMAMI/yD25m7hb4t1IwE/4Bk+GVcWJRWXbo0/m6ZUHzRzdjUY2a1qvw7C9udzdhg7gcnXwsKrSWi2NjMiIVw86l+Zim0nLpKIN41sxZHLaFRG63Ki8zQ/481LGn32awJ6i3sizKS0WD+N1DfR2qYMrwYHaMN0uR0OFXYTJkFvTFttAeUY3EKmRKAuMhmO2YRdSr4/j/G5E9HMc1gSGJj6PxgpQU7EpvxRsmoVAEJr0mszmOj9icGHep/FM=
# - secure: eC8HjTi1wgRVCsMAeXEXt8Ckr0YBSGOEnQkkW4/Nde/OZ9jJjz2nmP1ELQlDE7+czHub2QvYtDMG0parcHZDx/Kus0yvyn08y3g2rhGIiE7y8OCvQm1Mybu2D/p7enm6shXquQ6Z5KRfRq+18mHy80wy9ABMA/ukEZdvnfQ76/Een8/Lb0eHaDoXDXn3PqLVtByvSfQQ7OhS60dEScu8PWZ6/l1057P5NpdWbMExBE7Ro4zYXNhkJeGZx0nP/Bd4Jjdt1XfPzMEybV6NZ5xsTILUBFTmOOt603IsqKGov089NExqxYu5bD3K+S4MzF1Nd6VhomNPJqLDCfhlymJCUj5n5Ku4yidlhQbM4Ej9nGrBalJnhcjBjPua5tmMF2WCxP9muKn/2tIOu1/+wc0vMf9Yd3wKIkf5+FtUxCgs2O+NslWvmOMAMI/yD25m7hb4t1IwE/4Bk+GVcWJRWXbo0/m6ZUHzRzdjUY2a1qvw7C9udzdhg7gcnXwsKrSWi2NjMiIVw86l+Zim0nLpKIN41sxZHLaFRG63Ki8zQ/481LGn32awJ6i3sizKS0WD+N1DfR2qYMrwYHaMN0uR0OFXYTJkFvTFttAeUY3EKmRKAuMhmO2YRdSr4/j/G5E9HMc1gSGJj6PxgpQU7EpvxRsmoVAEJr0mszmOj9icGHep/FM=
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- kalakris-cmake
packages:
- gcc-4.8
- g++-4.8
@@ -23,18 +24,67 @@ addons:
- libelf-dev
- libdw-dev
- binutils-dev
- cmake
matrix:
include:
# Android
- env: TARGET=aarch64-linux-android DISABLE_TESTS=1
#- env: TARGET=arm-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=armv7-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=i686-linux-android DISABLE_TESTS=1
#- env: TARGET=x86_64-linux-android DISABLE_TESTS=1
# Linux
#- env: TARGET=aarch64-unknown-linux-gnu
#- env: TARGET=i686-unknown-linux-gnu
- env: TARGET=x86_64-unknown-linux-gnu CODECOV=1 #UPLOAD_DOCS=1
# - env: TARGET=x86_64-unknown-linux-musl CODECOV=1
# OSX
#- env: TARGET=x86_64-apple-darwin
# os: osx
before_install:
- set -e
- rustup self update
install:
- sh ci/install.sh
- source ~/.cargo/env || true
- env | grep "TRAVIS"
before_script:
- |
pip install 'travis-cargo<0.2' --user &&
export PATH=$HOME/.local/bin:$PATH
- export PATH=$HOME/.cargo/bin:$PATH
- cargo install cargo-update || echo "cargo-update already installed"
- cargo install cargo-travis || echo "cargo-travis already installed"
script:
- |
travis-cargo build &&
travis-cargo test &&
travis-cargo bench &&
travis-cargo doc
- bash ci/script.sh
before_deploy:
- sh ci/before_deploy.sh
after_success:
- bash ./script/build-doc.sh
- travis-cargo doc-upload
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then travis-cargo coveralls --no-sudo --verify; fi
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ./kcov/build/src/kcov --verify --coveralls-id=$TRAVIS_JOB_ID --exclude-pattern=/.cargo target/kcov target/debug/tantivy-*; fi
# Needs GH_TOKEN env var to be set in travis settings
- if [[ -v GH_TOKEN ]]; then echo "GH TOKEN IS SET"; else echo "GH TOKEN NOT SET"; fi
- if [[ -v UPLOAD_DOCS ]]; then cargo doc; cargo doc-upload; else echo "doc upload disabled."; fi
#cache: cargo
#before_cache:
# # Travis can't cache files that are not readable by "others"
# - chmod -R a+r $HOME/.cargo
# - find ./target/debug -type f -maxdepth 1 -delete
# - rm -f ./target/.rustc_info.json
# - rm -fr ./target/debug/{deps,.fingerprint}/tantivy*
# - rm -r target/debug/examples/
# - ls -1 examples/ | sed -e 's/\.rs$//' | xargs -I "{}" find target/* -name "*{}*" -type f -delete
#branches:
# only:
# # release tags
# - /^v\d+\.\d+\.\d+.*$/
# - master
notifications:
email:
on_success: never

11
AUTHORS Normal file
View File

@@ -0,0 +1,11 @@
# This is the list of authors of tantivy for copyright purposes.
Paul Masurel
Laurentiu Nicola
Dru Sellers
Ashley Mannix
Michael J. Curry
Jason Wolfe
# As an employee of Google I am required to add Google LLC
# in the list of authors, but this project is not affiliated to Google
# in any other way.
Google LLC

299
CHANGELOG.md Normal file
View File

@@ -0,0 +1,299 @@
Tantivy 0.11.0
=====================
- Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)
Tantivy 0.10.1
=====================
- Closes #544. A few users experienced problems with the directory watching system.
Avoid watching the mmap directory until someone effectively creates a reader that uses
this functionality.
Tantivy 0.10.0
=====================
*Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.*
- Added an API to easily tweak or entirely replace the
default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@pmasurel)
- Added an ASCII folding filter (@drusellers)
- Bugfix in `query.count` in presence of deletes (@pmasurel)
- Added `.explain(...)` in `Query` and `Weight` to (@pmasurel)
- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik).
All segments are simply removed.
Minor
---------
- Switched to Rust 2018 (@uvd)
- Small simplification of the code.
Calling .freq() or .doc() when .advance() has never been called
on segment postings should panic from now on.
- Tokens exceeding `u16::max_value() - 4` chars are discarded silently instead of panicking.
- Fast fields are now preloaded when the `SegmentReader` is created.
- `IndexMeta` is now public. (@hntd187)
- `IndexWriter` `add_document`, `delete_term`. `IndexWriter` is `Sync`, making it possible to use it with a `
Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
only require a read lock. (@pmasurel)
- Introducing `Opstamp` as an expressive type alias for `u64`. (@petr-tik)
- Stamper now relies on `AtomicU64` on all platforms (@petr-tik)
- Bugfix - Files get deleted slightly earlier
- Compilation resources improved (@fdb-hiroshima)
## How to update?
Your program should be usable as is.
### Fast fields
Fast fields used to be accessed directly from the `SegmentReader`.
The API changed, you are now required to acquire your fast field reader via the
`segment_reader.fast_fields()`, and use one of the typed method:
- `.u64()`, `.i64()` if your field is single-valued ;
- `.u64s()`, `.i64s()` if your field is multi-valued ;
- `.bytes()` if your field is bytes fast field.
Tantivy 0.9.0
=====================
*0.9.0 index format is not compatible with the
previous index format.*
- MAJOR BUGFIX :
Some `Mmap` objects were being leaked, and would never get released. (@fulmicoton)
- Removed most unsafe (@fulmicoton)
- Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)
- Stemming in other language possible (@pentlander)
- Segments with no docs are deleted earlier (@barrotsteindev)
- Added grouped add and delete operations.
They are guaranteed to happen together (i.e. they cannot be split by a commit).
In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)
- Removed `INT_STORED` and `INT_INDEXED`. It is now possible to use `STORED` and `INDEXED`
for int fields. (@fulmicoton)
- Added DateTime field (@barrotsteindev)
- Added IndexReader. By default, index is reloaded automatically upon new commits (@fulmicoton)
- SIMD linear search within blocks (@fulmicoton)
## How to update ?
tantivy 0.9 brought some API breaking change.
To update from tantivy 0.8, you will need to go through the following steps.
- `schema::INT_INDEXED` and `schema::INT_STORED` should be replaced by `schema::INDEXED` and `schema::INT_STORED`.
- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called
`IndexReader` for this.
```rust
// create the reader. You typically need to create 1 reader for the entire
// lifetime of you program.
let reader = index.reader()?;
// Acquire a searcher (previously `index.searcher()`) is now written:
let searcher = reader.searcher();
// With the default setting of the reader, you are not required to
// call `index.load_searchers()` anymore.
//
// The IndexReader will pick up that change automatically, regardless
// of whether the update was done in a different process or not.
// If this behavior is not wanted, you can create your reader with
// the `ReloadPolicy::Manual`, and manually decide when to reload the index
// by calling `reader.reload()?`.
```
Tantivy 0.8.2
=====================
Fixing build for x86_64 platforms. (#496)
No need to update from 0.8.1 if tantivy
is building on your platform.
Tantivy 0.8.1
=====================
Hotfix of #476.
Merge was reflecting deletes before commit was passed.
Thanks @barrotsteindev for reporting the bug.
Tantivy 0.8.0
=====================
*No change in the index format*
- API Breaking change in the collector API. (@jwolfe, @fulmicoton)
- Multithreaded search (@jwolfe, @fulmicoton)
Tantivy 0.7.1
=====================
*No change in the index format*
- Bugfix: NGramTokenizer panics on non ascii chars
- Added a space usage API
Tantivy 0.7
=====================
- Skip data for doc ids and positions (@fulmicoton),
greatly improving performance
- Tantivy error now rely on the failure crate (@drusellers)
- Added support for `AND`, `OR`, `NOT` syntax in addition to the `+`,`-` syntax
- Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)
- Added a `TopFieldCollector` (@pentlander)
Tantivy 0.6.1
=========================
- Bugfix #324. GC removing was removing file that were still in useful
- Added support for parsing AllQuery and RangeQuery via QueryParser
- AllQuery: `*`
- RangeQuery:
- Inclusive `field:[startIncl to endIncl]`
- Exclusive `field:{startExcl to endExcl}`
- Mixed `field:[startIncl to endExcl}` and vice versa
- Unbounded `field:[start to *]`, `field:[* to end]`
Tantivy 0.6
==========================
Special thanks to @drusellers and @jason-wolfe for their contributions
to this release!
- Removed C code. Tantivy is now pure Rust. (@pmasurel)
- BM25 (@pmasurel)
- Approximate field norms encoded over 1 byte. (@pmasurel)
- Compiles on stable rust (@pmasurel)
- Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270)
- Completely uncompressed
- Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
- Add NGram token support (@drusellers)
- Add Stopword Filter support (@drusellers)
- Add a FuzzyTermQuery (@drusellers)
- Add a RegexQuery (@drusellers)
- Various performance improvements (@pmasurel)_
Tantivy 0.5.2
===========================
- bugfix #274
- bugfix #280
- bugfix #289
Tantivy 0.5.1
==========================
- bugfix #254 : tantivy failed if no documents in a segment contained a specific field.
Tantivy 0.5
==========================
- Faceting
- RangeQuery
- Configurable tokenization pipeline
- Bugfix in PhraseQuery
- Various query optimisation
- Allowing very large indexes
- 64 bits file address
- Smarter encoding of the `TermInfo` objects
Tantivy 0.4.3
==========================
- Bugfix race condition when deleting files. (#198)
Tantivy 0.4.2
==========================
- Prevent usage of AVX2 instructions (#201)
Tantivy 0.4.1
==========================
- Bugfix for non-indexed fields. (#199)
Tantivy 0.4.0
==========================
- Raise the limit of number of fields (previously 256 fields) (@fulmicoton)
- Removed u32 fields. They are replaced by u64 and i64 fields (#65) (@fulmicoton)
- Optimized skip in SegmentPostings (#130) (@lnicola)
- Replacing rustc_serialize by serde. Kudos to @KodrAus and @lnicola
- Using error-chain (@KodrAus)
- QueryParser: (@fulmicoton)
- Explicit error returned when searched for a term that is not indexed
- Searching for a int term via the query parser was broken `(age:1)`
- Searching for a non-indexed field returns an explicit Error
- Phrase query for non-tokenized field are not tokenized by the query parser.
- Faster/Better indexing (@fulmicoton)
- using murmurhash2
- faster merging
- more memory efficient fast field writer (@lnicola )
- better handling of collisions
- lesser memory usage
- Added API, most notably to iterate over ranges of terms (@fulmicoton)
- Bugfix that was preventing to unmap segment files, on index drop (@fulmicoton)
- Made the doc! macro public (@fulmicoton)
- Added an alternative implementation of the streaming dictionary (@fulmicoton)
Tantivy 0.3.1
==========================
- Expose a method to trigger files garbage collection
Tantivy 0.3
==========================
Special thanks to @Kodraus @lnicola @Ameobea @manuel-woelker @celaus
for their contribution to this release.
Thanks also to everyone in tantivy gitter chat
for their advise and company :)
https://gitter.im/tantivy-search/tantivy
Warning:
Tantivy 0.3 is NOT backward compatible with tantivy 0.2
code and index format.
You should not expect backward compatibility before
tantivy 1.0.
New Features
------------
- Delete. You can now delete documents from an index.
- Support for windows (Thanks to @lnicola)
Various Bugfixes & small improvements
----------------------------------------
- Added CI for Windows (https://ci.appveyor.com/project/fulmicoton/tantivy)
Thanks to @KodrAus ! (#108)
- Various dependy version update (Thanks to @Ameobea) #76
- Fixed several race conditions in `Index.wait_merge_threads`
- Fixed #72. Mmap were never released.
- Fixed #80. Fast field used to take an amplitude of 32 bits after a merge. (Ouch!)
- Fixed #92. u32 are now encoded using big endian in the fst
in order to make there enumeration consistent with
the natural ordering.
- Building binary targets for tantivy-cli (Thanks to @KodrAus)
- Misc invisible bug fixes, and code cleanup.
- Use

View File

@@ -1,54 +1,101 @@
[package]
name = "tantivy"
version = "0.1.1"
version = "0.10.1"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
build = "build.rs"
license = "MIT"
description = """Tantivy is a search engine library."""
documentation = "http://fulmicoton.com/tantivy/tantivy/index.html"
categories = ["database-implementations", "data-structures"]
description = """Search engine library"""
documentation = "https://tantivy-search.github.io/tantivy/tantivy/index.html"
homepage = "https://github.com/tantivy-search/tantivy"
repository = "https://github.com/tantivy-search/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2018"
[dependencies]
byteorder = "0.4"
memmap = "0.2"
lazy_static = "0.1"
regex = "0.1"
fst = "0.1"
atomicwrites = "0.0.14"
tempfile = "2.0"
rustc-serialize = "0.3"
log = "0.3"
combine = "2.0.*"
base64 = "0.10.0"
byteorder = "1.0"
once_cell = "0.2"
regex = "1.0"
tantivy-fst = "0.1"
memmap = {version = "0.7", optional=true}
lz4 = {version="1.20", optional=true}
snap = {version="0.2"}
atomicwrites = {version="0.2.2", optional=true}
tempfile = "3.0"
log = "0.4"
combine = ">=3.6.0,<4.0.0"
tempdir = "0.3"
bincode = "0.4"
libc = {version = "0.2.6", optional=true}
num_cpus = "0.2"
itertools = "0.4"
lz4 = "1.13"
time = "0.1"
uuid = "0.1"
chan = "0.1"
crossbeam = "0.2"
serde = "1.0"
serde_derive = "1.0"
serde_json = "1.0"
num_cpus = "1.2"
fs2={version="0.4", optional=true}
itertools = "0.8"
levenshtein_automata = {version="0.1", features=["fst_automaton"]}
notify = {version="4", optional=true}
bit-set = "0.5"
uuid = { version = "0.7.2", features = ["v4", "serde"] }
crossbeam = "0.5"
futures = "0.1"
futures-cpupool = "0.1"
owning_ref = "0.4"
stable_deref_trait = "1.0.0"
rust-stemmers = "1.1"
downcast-rs = { version="1.0" }
bitpacking = {version="0.8", default-features = false, features=["bitpacker4x"]}
census = "0.2"
fnv = "1.0.6"
owned-read = "0.4"
failure = "0.1"
htmlescape = "0.3.1"
fail = "0.3"
scoped-pool = "1.0"
murmurhash32 = "0.2"
chrono = "0.4"
smallvec = "0.6"
[target.'cfg(windows)'.dependencies]
winapi = "0.3"
[dev-dependencies]
rand = "0.3"
[build-dependencies]
gcc = {version = "0.3", optional=true}
rand = "0.7"
maplit = "1"
matches = "0.1.8"
time = "0.1.42"
[profile.release]
opt-level = 3
debug = false
lto = true
debug-assertions = false
[profile.test]
debug-assertions = true
overflow-checks = true
[features]
default = ["simdcompression"]
simdcompression = ["libc", "gcc"]
default = ["mmap"]
mmap = ["atomicwrites", "fs2", "memmap", "notify"]
lz4-compression = ["lz4"]
failpoints = ["fail/failpoints"]
unstable = [] # useful for benches.
wasm-bindgen = ["uuid/wasm-bindgen"]
[badges]
travis-ci = { repository = "tantivy-search/tantivy" }
[dev-dependencies.fail]
features = ["failpoints"]
# Following the "fail" crate best practises, we isolate
# tests that define specific behavior in fail check points
# in a different binary.
#
# We do that because, fail rely on a global definition of
# failpoints behavior and hence, it is incompatible with
# multithreading.
[[test]]
name = "failpoints"
path = "tests/failpoints/mod.rs"
required-features = ["fail/failpoints"]

View File

@@ -1,4 +1,4 @@
Copyright (c) 2016 Paul Masurel
Copyright (c) 2018 by the project authors, as listed in the AUTHORS file.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

137
README.md
View File

@@ -1,55 +1,138 @@
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
[![Build Status](https://travis-ci.org/tantivy-search/tantivy.svg?branch=master)](https://travis-ci.org/tantivy-search/tantivy)
[![Coverage Status](https://coveralls.io/repos/github/tantivy-search/tantivy/badge.svg?branch=master)](https://coveralls.io/github/tantivy-search/tantivy?branch=master)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![codecov](https://codecov.io/gh/tantivy-search/tantivy/branch/master/graph/badge.svg)](https://codecov.io/gh/tantivy-search/tantivy)
[![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/master?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/master)
[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)
[![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/fulmicoton)
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/0)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/0)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/1)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/1)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/2)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/2)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/3)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/3)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/4)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/4)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/5)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/5)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/6)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/6)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/7)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/7)
[![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)
**Tantivy** is a **full text search engine library** written in rust.
It is strongly inspired by Lucene's design.
It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) and [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
an off-the-shelf search engine server, but rather a crate that can be used
to build such a search engine.
Tantivy is, in fact, strongly inspired by Lucene's design.
# Benchmark
Tantivy is typically faster than Lucene, but the results will depend on
the nature of the queries in your workload.
The following [benchmark](https://tantivy-search.github.io/bench/) break downs
performance for different type of queries / collection.
# Features
- configurable indexing (optional term frequency and position indexing)
- tf-idf scoring
- Basic query language
- Full-text search
- Configurable tokenizer. (stemming available for 17 latin languages. Third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)) and [Japanese](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)
- Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
- Tiny startup time (<10ms), perfect for command line tools
- BM25 scoring (the same as lucene)
- Natural query language `(michael AND jackson) OR "king of pop"`
- Phrase queries search (`"michael jackson"`)
- Incremental indexing
- Multithreaded indexing (indexing English Wikipedia takes 4 minutes on my desktop)
- mmap based
- SIMD integer compression
- u32 fast fields (equivalent of doc values in Lucene)
- Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
- Mmap directory
- SIMD integer compression when the platform/CPU includes the SSE2 instruction set.
- Single valued and multivalued u64, i64 and f64 fast fields (equivalent of doc values in Lucene)
- `&[u8]` fast fields
- Text, i64, u64, f64, dates and hierarchical facet fields
- LZ4 compressed document store
- Range queries
- Faceted search
- Configurable indexing (optional term frequency and position indexing)
- Cheesy logo with a horse
# Non-features
- Distributed search is out of the scope of tantivy. That being said, tantivy is meant as a
library upon which one could build a distributed search. Serializable/mergeable collector state for instance,
are within the scope of tantivy.
# Supported OS and compiler
Tantivy works on stable rust (>= 1.27) and supports Linux, MacOS and Windows.
# Getting started
- [tantivy's usage example](http://fulmicoton.com/tantivy-examples/simple_search.html)
- [tantivy-cli and its tutorial](https://github.com/fulmicoton/tantivy-cli).
- [tantivy's simple search example](https://tantivy-search.github.io/examples/basic_search.html)
- [tantivy-cli and its tutorial](https://github.com/tantivy-search/tantivy-cli).
`tantivy-cli` is an actual command line interface that makes it easy for you to create a search engine,
index documents and search via the CLI or a small server with a REST API.
It will walk you through getting a wikipedia search engine up and running in a few minutes.
- [reference doc](http://fulmicoton.com/tantivy/tantivy/index.html).
- [reference doc for the last released version](https://docs.rs/tantivy/)
# How can I support this project?
# Compiling
There are many ways to support this project.
By default, `tantivy` uses a git submodule called `simdcomp`.
After cloning the repository, you will need to initialize and update
the submodules. The project can then be built using `cargo`.
- Use tantivy and tell us about your experience on [gitter](https://gitter.im/tantivy-search/tantivy) or by email (paul.masurel@gmail.com)
- Report bugs
- Write a blog post
- Help with documentation by asking questions or submitting PRs
- Contribute code (you can join [our gitter](https://gitter.im/tantivy-search/tantivy) )
- Talk about tantivy around you
- Drop a word on on [![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/fulmicoton) or even [![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)
git clone git@github.com:fulmicoton/tantivy.git
git submodule init
git submodule update
# Contributing code
We use the GitHub Pull Request workflow - reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
## Clone and build locally
Tantivy compiles on stable rust but requires `Rust >= 1.27`.
To check out and run tests, you can simply run :
```bash
git clone https://github.com/tantivy-search/tantivy.git
cd tantivy
cargo build
```
## Run tests
Alternatively, if you are trying to compile `tantivy` without simd compression,
you can disable this functionality. In this case, this submodule is not required
and you can compile tantivy by using the `--no-default-features` flag.
Some tests will not run with just `cargo test` because of `fail-rs`.
To run the tests exhaustively, run `./run-tests.sh`
cargo build --no-default-features
## Debug
You might find it useful to step through the programme with a debugger.
# Contribute
### A failing test
Send me an email (paul.masurel at gmail.com) if you want to contribute to tantivy.
Make sure you haven't run `cargo clean` after the most recent `cargo test` or `cargo build` to guarantee that `target/` dir exists. Use this bash script to find the most name of the most recent debug build of tantivy and run it under rust-gdb.
```bash
find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY
```
Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source-code and run the debug executable with flags that you normally pass to `cargo test` to like this
```bash
$gdb run --test-threads 1 --test $NAME_OF_TEST
```
### An example
By default, rustc compiles everything in the `examples/` dir in debug mode. This makes it easy for you to make examples to reproduce bugs.
```bash
rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run
```

22
appveyor.yml Normal file
View File

@@ -0,0 +1,22 @@
# Appveyor configuration template for Rust using rustup for Rust installation
# https://github.com/starkat99/appveyor-rust
os: Visual Studio 2015
environment:
matrix:
- channel: stable
target: x86_64-pc-windows-msvc
install:
- appveyor DownloadFile https://win.rustup.rs/ -FileName rustup-init.exe
- rustup-init -yv --default-toolchain %channel% --default-host %target%
- set PATH=%PATH%;%USERPROFILE%\.cargo\bin
- if defined msys_bits set PATH=%PATH%;C:\msys64\mingw%msys_bits%\bin
- rustc -vV
- cargo -vV
build: false
test_script:
- REM SET RUST_LOG=tantivy,test & cargo test --verbose --no-default-features --features mmap
- REM SET RUST_BACKTRACE=1 & cargo build --examples

View File

@@ -1,40 +0,0 @@
#[cfg(feature= "simdcompression")]
mod build {
extern crate gcc;
use std::process::Command;
pub fn build() {
Command::new("make")
.current_dir("cpp/simdcomp")
.output()
.unwrap_or_else(|e| { panic!("Failed to make simdcomp: {}", e) });
gcc::Config::new()
.cpp(true)
.flag("-std=c++11")
.flag("-O3")
.flag("-mssse3")
.include("./cpp/simdcomp/include")
.object("cpp/simdcomp/avxbitpacking.o")
.object("cpp/simdcomp/simdintegratedbitpacking.o")
.object("cpp/simdcomp/simdbitpacking.o")
.object("cpp/simdcomp/simdpackedsearch.o")
.object("cpp/simdcomp/simdcomputil.o")
.object("cpp/simdcomp/simdpackedselect.o")
.object("cpp/simdcomp/simdfor.o")
.file("cpp/simdcomp_wrapper.cpp")
.compile("libsimdcomp.a");
println!("cargo:rustc-flags=-l dylib=stdc++");
}
}
#[cfg(not(feature= "simdcompression"))]
mod build {
pub fn build() {
}
}
fn main() {
build::build();
}

23
ci/before_deploy.ps1 Normal file
View File

@@ -0,0 +1,23 @@
# This script takes care of packaging the build artifacts that will go in the
# release zipfile
$SRC_DIR = $PWD.Path
$STAGE = [System.Guid]::NewGuid().ToString()
Set-Location $ENV:Temp
New-Item -Type Directory -Name $STAGE
Set-Location $STAGE
$ZIP = "$SRC_DIR\$($Env:CRATE_NAME)-$($Env:APPVEYOR_REPO_TAG_NAME)-$($Env:TARGET).zip"
# TODO Update this to package the right artifacts
Copy-Item "$SRC_DIR\target\$($Env:TARGET)\release\hello.exe" '.\'
7z a "$ZIP" *
Push-AppveyorArtifact "$ZIP"
Remove-Item *.* -Force
Set-Location ..
Remove-Item $STAGE
Set-Location $SRC_DIR

33
ci/before_deploy.sh Normal file
View File

@@ -0,0 +1,33 @@
# This script takes care of building your crate and packaging it for release
set -ex
main() {
local src=$(pwd) \
stage=
case $TRAVIS_OS_NAME in
linux)
stage=$(mktemp -d)
;;
osx)
stage=$(mktemp -d -t tmp)
;;
esac
test -f Cargo.lock || cargo generate-lockfile
# TODO Update this to build the artifacts that matter to you
cross rustc --bin hello --target $TARGET --release -- -C lto
# TODO Update this to package the right artifacts
cp target/$TARGET/release/hello $stage/
cd $stage
tar czf $src/$CRATE_NAME-$TRAVIS_TAG-$TARGET.tar.gz *
cd $src
rm -rf $stage
}
main

47
ci/install.sh Normal file
View File

@@ -0,0 +1,47 @@
set -ex
main() {
local target=
if [ $TRAVIS_OS_NAME = linux ]; then
target=x86_64-unknown-linux-musl
sort=sort
else
target=x86_64-apple-darwin
sort=gsort # for `sort --sort-version`, from brew's coreutils.
fi
# Builds for iOS are done on OSX, but require the specific target to be
# installed.
case $TARGET in
aarch64-apple-ios)
rustup target install aarch64-apple-ios
;;
armv7-apple-ios)
rustup target install armv7-apple-ios
;;
armv7s-apple-ios)
rustup target install armv7s-apple-ios
;;
i386-apple-ios)
rustup target install i386-apple-ios
;;
x86_64-apple-ios)
rustup target install x86_64-apple-ios
;;
esac
# This fetches latest stable release
local tag=$(git ls-remote --tags --refs --exit-code https://github.com/japaric/cross \
| cut -d/ -f3 \
| grep -E '^v[0.1.0-9.]+$' \
| $sort --version-sort \
| tail -n1)
curl -LSfs https://japaric.github.io/trust/install.sh | \
sh -s -- \
--force \
--git japaric/cross \
--tag $tag \
--target $target
}
main

29
ci/script.sh Normal file
View File

@@ -0,0 +1,29 @@
#!/usr/bin/env bash
# This script takes care of testing your crate
set -ex
main() {
if [ ! -z $CODECOV ]; then
echo "Codecov"
cargo build --verbose && cargo coverage --verbose && bash <(curl -s https://codecov.io/bash) -s target/kcov
else
echo "Build"
cross build --target $TARGET
if [ ! -z $DISABLE_TESTS ]; then
return
fi
echo "Test"
cross test --target $TARGET --no-default-features --features mmap -- --test-threads 1
fi
for example in $(ls examples/*.rs)
do
cargo run --example $(basename $example .rs)
done
}
# we don't run the "test phase" when doing deploys
if [ -z $TRAVIS_TAG ]; then
main
fi

1
cpp/simdcomp vendored

Submodule cpp/simdcomp deleted from 0dca28668f

View File

@@ -1,48 +0,0 @@
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include "simdcomp.h"
#include "simdcomputil.h"
extern "C" {
// assumes datain has a size of 128 uint32
// and that buffer is large enough to host the data.
size_t compress_sorted_cpp(
const uint32_t* datain,
uint8_t* output,
const uint32_t offset) {
const uint32_t b = simdmaxbitsd1(offset, datain);
*output++ = b;
simdpackwithoutmaskd1(offset, datain, (__m128i *) output, b);
return 1 + b * sizeof(__m128i);;
}
// assumes datain has a size of 128 uint32
// and that buffer is large enough to host the data.
size_t uncompress_sorted_cpp(
const uint8_t* compressed_data,
uint32_t* output,
uint32_t offset) {
const uint32_t b = *compressed_data++;
simdunpackd1(offset, (__m128i *)compressed_data, output, b);
return 1 + b * sizeof(__m128i);
}
size_t compress_unsorted_cpp(
const uint32_t* datain,
uint8_t* output) {
const uint32_t b = maxbits(datain);
*output++ = b;
simdpackwithoutmask(datain, (__m128i *) output, b);
return 1 + b * sizeof(__m128i);;
}
size_t uncompress_unsorted_cpp(
const uint8_t* compressed_data,
uint32_t* output) {
const uint32_t b = *compressed_data++;
simdunpack((__m128i *)compressed_data, output, b);
return 1 + b * sizeof(__m128i);
}
}

1
doc/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
book

5
doc/book.toml Normal file
View File

@@ -0,0 +1,5 @@
[book]
authors = ["Paul Masurel"]
multilingual = false
src = "src"
title = "Tantivy, the user guide"

15
doc/src/SUMMARY.md Normal file
View File

@@ -0,0 +1,15 @@
# Summary
[Avant Propos](./avant-propos.md)
- [Segments](./basis.md)
- [Defining your schema](./schema.md)
- [Facetting](./facetting.md)
- [Innerworkings](./innerworkings.md)
- [Inverted index](./inverted_index.md)
- [Best practise](./inverted_index.md)
[Frequently Asked Questions](./faq.md)
[Examples](./examples.md)

34
doc/src/avant-propos.md Normal file
View File

@@ -0,0 +1,34 @@
# Foreword, what is the scope of tantivy?
> Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
they both have the same scope and targetted use cases.
If you are not familiar with Lucene, let's break down our little tagline.
- **Search** here means full-text search : fundamentally, tantivy is here to help you
identify efficiently what are the documents matching a given query in your corpus.
But modern search UI are so much more : text processing, facetting, autocomplete, fuzzy search, good
relevancy, collapsing, highlighting, spatial search.
While some of these features are not available in tantivy yet, all of these are relevant
feature requests. Tantivy's objective is to offer a solid toolbox to create the best search
experience. But keep in mind this is just a toolbox.
Which bring us to the second keyword...
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
Sometimes a functionality will not be available in tantivy because it is too
specific to your use case. By design, tantivy should make it possible to extend
the available set of features using the existing rock-solid datastructures.
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
`TokenFilter`... Some of your requirements may also be related to
something closer to architecture or operations. For instance, you may
want to build a large corpus on Hadoop, fine-tune the merge policy to keep your
index sharded in a time-wise fashion, or you may want to convert and existing
index from a different format.
Tantivy exposes a lot of low level API to do all of these things.

77
doc/src/basis.md Normal file
View File

@@ -0,0 +1,77 @@
# Anatomy of an index
## Straight from disk
Tantivy accesses its data using an abstracting trait called `Directory`.
In theory, one can come and override the data access logic. In practise, the
trait somewhat assumes that your data can be mapped to memory, and tantivy
seems deeply married to using `mmap` for its io [^1], and the only persisting
directory shipped with tantivy is the `MmapDirectory`.
While this design has some downsides, this greatly simplifies the source code of
tantivy. Caching is also entirely delegated to the OS.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
In later chapters, we will discuss tantivy's inverted index data layout.
One key take away is that to achieve great performance, search indexes are extremely compact.
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
## Segments, and the log method
That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
In fact, the `Directory` trait does not even allow you to modify part of a file.
To allow the addition / deletion of documents, and create the illusion that
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*.
Let's forget about deletes for a moment.
As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly.
As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
The resulting chunk of index obtained after this serialization is called a `Segment`.
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `<UUID>.*`.
Which brings us to the nature of a tantivy `Index`.
> A tantivy `Index` is a collection of `Segments`.
Physically, this really just means and index is a bunch of segment files in a given `Directory`,
linked together by a `meta.json` file. This transparency can become extremely handy
to get tantivy to fit your use case:
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
# Merging
As you index more and more data, your index will accumulate more and more segments.
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit.
That's where merging or compacting comes into place. Tantivy will continuously consider merge
opportunities and start merging segments in the background.
# Indexing throughput, number of indexing threads
[^1]: This may eventually change.
[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not.

View File

3
doc/src/examples.md Normal file
View File

@@ -0,0 +1,3 @@
# Examples
- [Basic search](/examples/basic_search.html)

5
doc/src/facetting.md Normal file
View File

@@ -0,0 +1,5 @@
# Facetting
wewew
## weeewe

0
doc/src/faq.md Normal file
View File

1
doc/src/innerworkings.md Normal file
View File

@@ -0,0 +1 @@
# Innerworkings

View File

@@ -0,0 +1 @@
# Inverted index

1
doc/src/schema.md Normal file
View File

@@ -0,0 +1 @@
# Defining your schema

241
examples/basic_search.rs Normal file
View File

@@ -0,0 +1,241 @@
// # Basic Example
//
// This example covers the basic functionalities of
// tantivy.
//
// We will :
// - define our schema
// = create an index in a directory
// - index few documents in our index
// - search for the best document matchings "sea whale"
// - retrieve the best document original content.
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::ReloadPolicy;
use tempdir::TempDir;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new("tantivy_example_dir")?;
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = Schema::builder();
// Our first field is title.
// We want full-text search for it, and we also want
// to be able to retrieve the document after the search.
//
// `TEXT | STORED` is some syntactic sugar to describe
// that.
//
// `TEXT` means the field should be tokenized and indexed,
// along with its term frequency and term positions.
//
// `STORED` means that the field will also be saved
// in a compressed, row-oriented key-value store.
// This store is useful to reconstruct the
// documents that were selected during the search phase.
schema_builder.add_text_field("title", TEXT | STORED);
// Our second field is body.
// We want full-text search for it, but we do not
// need to be able to be able to retrieve it
// for our application.
//
// We can make our index lighter and
// by omitting `STORED` flag.
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
//
// This will actually just save a meta.json
// with our schema in the directory.
let index = Index::create_in_dir(&index_path, schema.clone())?;
// To insert document we need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we give tantivy a budget of `50MB`.
// Using a bigger heap for the indexer may increase
// throughput, but 50 MB is already plenty.
let mut index_writer = index.writer(50_000_000)?;
// Let's index our documents!
// We first need a handle on the title and the body field.
// ### Adding documents
//
// We can create a document manually, by setting the fields
// one by one in a Document object.
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
old_man_doc.add_text(
body,
"He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish.",
);
// ... and add it to the `IndexWriter`.
index_writer.add_document(old_man_doc);
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
// Multivalued field just need to be repeated.
index_writer.add_document(doc!(
title => "Frankenstein",
title => "The Modern Prometheus",
body => "You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking."
));
// This is an example, so we will only index 3 documents
// here. You can check out tantivy's tutorial to index
// the English wikipedia. Tantivy's indexing is rather fast.
// Indexing 5 million articles of the English wikipedia takes
// around 3 minutes on my computer!
// ### Committing
//
// At this point our documents are not searchable.
//
//
// We need to call .commit() explicitly to force the
// index_writer to finish processing the documents in the queue,
// flush the current index to the disk, and advertise
// the existence of new documents.
//
// This call is blocking.
index_writer.commit()?;
// If `.commit()` returns correctly, then all of the
// documents that have been added are guaranteed to be
// persistently indexed.
//
// In the scenario of a crash or a power failure,
// tantivy behaves as if has rolled back to its last
// commit.
// # Searching
//
// ### Searcher
//
// A reader is required to get search the index.
// It acts as a `Searcher` pool that reloads itself,
// depending on a `ReloadPolicy`.
//
// For a search server you will typically create one reader for the entire lifetime of your
// program, and acquire a new searcher for every single request.
//
// In the code below, we rely on the 'ON_COMMIT' policy: the reader
// will reload the index automatically after each commit.
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
// We now need to acquire a searcher.
//
// A searcher points to snapshotted, immutable version of the index.
//
// Some search experience might require more than
// one query. Using the same searcher ensures that all of these queries will run on the
// same version of the index.
//
// Acquiring a `searcher` is very cheap.
//
// You should acquire a searcher every time you start processing a request and
// and release it right after your query is finished.
let searcher = reader.searcher();
// ### Query
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// QueryParser may fail if the query is not in the right
// format. For user facing applications, this can be a problem.
// A ticket has been opened regarding this problem.
let query = query_parser.parse_query("sea whale")?;
// A query defines a set of documents, as
// well as the way they should be scored.
//
// A query created by the query parser is scored according
// to a metric called Tf-Idf, and will consider
// any document matching at least one of our terms.
// ### Collectors
//
// We are not interested in all of the documents but
// only in the top 10. Keeping track of our top 10 best documents
// is the role of the TopDocs.
// We can now perform our query.
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
// The actual documents still need to be
// retrieved from Tantivy's store.
//
// Since the body field was not configured as stored,
// the document returned will only contain
// a title.
for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,194 @@
// # Custom collector example
//
// This example shows how you can implement your own
// collector. As an example, we will compute a collector
// that computes the standard deviation of a given fast field.
//
// Of course, you can have a look at the tantivy's built-in collectors
// such as the `CountCollector` for more examples.
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::{Collector, SegmentCollector};
use tantivy::fastfield::FastFieldReader;
use tantivy::query::QueryParser;
use tantivy::schema::Field;
use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
use tantivy::SegmentReader;
use tantivy::{Index, TantivyError};
#[derive(Default)]
struct Stats {
count: usize,
sum: f64,
squared_sum: f64,
}
impl Stats {
pub fn count(&self) -> usize {
self.count
}
pub fn mean(&self) -> f64 {
self.sum / (self.count as f64)
}
fn square_mean(&self) -> f64 {
self.squared_sum / (self.count as f64)
}
pub fn standard_deviation(&self) -> f64 {
let mean = self.mean();
(self.square_mean() - mean * mean).sqrt()
}
fn non_zero_count(self) -> Option<Stats> {
if self.count == 0 {
None
} else {
Some(self)
}
}
}
struct StatsCollector {
field: Field,
}
impl StatsCollector {
fn with_field(field: Field) -> StatsCollector {
StatsCollector { field }
}
}
impl Collector for StatsCollector {
// That's the type of our result.
// Our standard deviation will be a float.
type Fruit = Option<Stats>;
type Child = StatsSegmentCollector;
fn for_segment(
&self,
_segment_local_id: u32,
segment_reader: &SegmentReader,
) -> tantivy::Result<StatsSegmentCollector> {
let fast_field_reader = segment_reader
.fast_fields()
.u64(self.field)
.ok_or_else(|| {
let field_name = segment_reader.schema().get_field_name(self.field);
TantivyError::SchemaError(format!(
"Field {:?} is not a u64 fast field.",
field_name
))
})?;
Ok(StatsSegmentCollector {
fast_field_reader,
stats: Stats::default(),
})
}
fn requires_scoring(&self) -> bool {
// this collector does not care about score.
false
}
fn merge_fruits(&self, segment_stats: Vec<Option<Stats>>) -> tantivy::Result<Option<Stats>> {
let mut stats = Stats::default();
for segment_stats_opt in segment_stats {
if let Some(segment_stats) = segment_stats_opt {
stats.count += segment_stats.count;
stats.sum += segment_stats.sum;
stats.squared_sum += segment_stats.squared_sum;
}
}
Ok(stats.non_zero_count())
}
}
struct StatsSegmentCollector {
fast_field_reader: FastFieldReader<u64>,
stats: Stats,
}
impl SegmentCollector for StatsSegmentCollector {
type Fruit = Option<Stats>;
fn collect(&mut self, doc: u32, _score: f32) {
let value = self.fast_field_reader.get(doc) as f64;
self.stats.count += 1;
self.stats.sum += value;
self.stats.squared_sum += value * value;
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
self.stats.non_zero_count()
}
}
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = Schema::builder();
// We'll assume a fictional index containing
// products, and with a name, a description, and a price.
let product_name = schema_builder.add_text_field("name", TEXT);
let product_description = schema_builder.add_text_field("description", TEXT);
let price = schema_builder.add_u64_field("price", INDEXED | FAST);
let schema = schema_builder.build();
// # Indexing documents
//
// Lets index a bunch of fake documents for the sake of
// this example.
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?;
index_writer.add_document(doc!(
product_name => "Super Broom 2000",
product_description => "While it is ok for short distance travel, this broom \
was designed quiditch. It will up your game.",
price => 30_200u64
));
index_writer.add_document(doc!(
product_name => "Turbulobroom",
product_description => "You might have heard of this broom before : it is the sponsor of the Wales team.\
You'll enjoy its sharp turns, and rapid acceleration",
price => 29_240u64
));
index_writer.add_document(doc!(
product_name => "Broomio",
product_description => "Great value for the price. This broom is a market favorite",
price => 21_240u64
));
index_writer.add_document(doc!(
product_name => "Whack a Mole",
product_description => "Prime quality bat.",
price => 5_200u64
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![product_name, product_description]);
// here we want to get a hit on the 'ken' in Frankenstein
let query = query_parser.parse_query("broom")?;
if let Some(stats) = searcher.search(&query, &StatsCollector::with_field(price))? {
println!("count: {}", stats.count());
println!("mean: {}", stats.mean());
println!("standard deviation: {}", stats.standard_deviation());
}
Ok(())
}

View File

@@ -0,0 +1,115 @@
// # Defining a tokenizer pipeline
//
// In this example, we'll see how to define a tokenizer pipeline
// by aligning a bunch of `TokenFilter`.
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer;
use tantivy::Index;
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = Schema::builder();
// Our first field is title.
// In this example we want to use NGram searching
// we will set that to 3 characters, so any three
// char in the title should be findable.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("ngram3")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
let title = schema_builder.add_text_field("title", text_options);
// Our second field is body.
// We want full-text search for it, but we do not
// need to be able to be able to retrieve it
// for our application.
//
// We can make our index lighter and
// by omitting `STORED` flag.
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
// To simplify we will work entirely in RAM.
// This is not what you want in reality, but it is very useful
// for your unit tests... Or this example.
let index = Index::create_in_ram(schema.clone());
// here we are registering our custome tokenizer
// this will store tokens of 3 characters each
index
.tokenizers()
.register("ngram3", NgramTokenizer::new(3, 3, false));
// To insert document we need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we use a buffer of 50MB per thread. Using a bigger
// heap for the indexer can increase its throughput.
let mut index_writer = index.writer(50_000_000)?;
index_writer.add_document(doc!(
title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => r#"A few miles south of Soledad, the Salinas River drops in close to the hillside
bank and runs deep and green. The water is warm too, for it has slipped twinkling
over the yellow sands in the sunlight before reaching the narrow pool. On one
side of the river the golden foothill slopes curve up to the strong and rocky
Gabilan Mountains, but on the valley side the water is lined with trees—willows
fresh and green with every spring, carrying in their lower leaf junctures the
debris of the winters flooding; and sycamores with mottled, white, recumbent
limbs and branches that arch over the pool"#
));
index_writer.add_document(doc!(
title => "Frankenstein",
body => r#"You will rejoice to hear that no disaster has accompanied the commencement of an
enterprise which you have regarded with such evil forebodings. I arrived here
yesterday, and my first task is to assure my dear sister of my welfare and
increasing confidence in the success of my undertaking."#
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// here we want to get a hit on the 'ken' in Frankenstein
let query = query_parser.parse_query("ken")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,146 @@
// # Deleting and Updating (?) documents
//
// This example explains how to delete and update documents.
// In fact there is actually no such thing as an update in tantivy.
//
// To update a document, you need to delete a document and then reinsert
// its new version.
//
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopDocs;
use tantivy::query::TermQuery;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::IndexReader;
// A simple helper function to fetch a single document
// given its id from our index.
// It will be helpful to check our work.
fn extract_doc_given_isbn(
reader: &IndexReader,
isbn_term: &Term,
) -> tantivy::Result<Option<Document>> {
let searcher = reader.searcher();
// This is the simplest query you can think of.
// It matches all of the documents containing a specific term.
//
// The second argument is here to tell we don't care about decoding positions,
// or term frequencies.
let term_query = TermQuery::new(isbn_term.clone(), IndexRecordOption::Basic);
let top_docs = searcher.search(&term_query, &TopDocs::with_limit(1))?;
if let Some((_score, doc_address)) = top_docs.first() {
let doc = searcher.doc(*doc_address)?;
Ok(Some(doc))
} else {
// no doc matching this ID.
Ok(None)
}
}
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// Check out the *basic_search* example if this makes
// small sense to you.
let mut schema_builder = Schema::builder();
// Tantivy does not really have a notion of primary id.
// This may change in the future.
//
// Still, we can create a `isbn` field and use it as an id. This
// field can be `u64` or a `text`, depending on your use case.
// It just needs to be indexed.
//
// If it is `text`, let's make sure to keep it `raw` and let's avoid
// running any text processing on it.
// This is done by associating this field to the tokenizer named `raw`.
// Rather than building our [`TextOptions`](//docs.rs/tantivy/~0/tantivy/schema/struct.TextOptions.html) manually,
// We use the `STRING` shortcut. `STRING` stands for indexed (without term frequency or positions)
// and untokenized.
//
// Because we also want to be able to see this `id` in our returned documents,
// we also mark the field as stored.
let isbn = schema_builder.add_text_field("isbn", STRING | STORED);
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?;
// Let's add a couple of documents, for the sake of the example.
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
index_writer.add_document(doc!(
isbn => "978-0099908401",
title => "The old Man and the see"
));
index_writer.add_document(doc!(
isbn => "978-0140177398",
title => "Of Mice and Men",
));
index_writer.add_document(doc!(
title => "Frankentein", //< Oops there is a typo here.
isbn => "978-9176370711",
));
index_writer.commit()?;
let reader = index.reader()?;
let frankenstein_isbn = Term::from_field_text(isbn, "978-9176370711");
// Oops our frankenstein doc seems mispelled
let frankenstein_doc_misspelled = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!(
schema.to_json(&frankenstein_doc_misspelled),
r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#,
);
// # Update = Delete + Insert
//
// Here we will want to update the typo in the `Frankenstein` book.
//
// Tantivy does not handle updates directly, we need to delete
// and reinsert the document.
//
// This can be complicated as it means you need to have access
// to the entire document. It is good practise to integrate tantivy
// with a key value store for this reason.
//
// To remove one of the document, we just call `delete_term`
// on its id.
//
// Note that `tantivy` does nothing to enforce the idea that
// there is only one document associated to this id.
//
// Also you might have noticed that we apply the delete before
// having committed. This does not matter really...
index_writer.delete_term(frankenstein_isbn.clone());
// We now need to reinsert our document without the typo.
index_writer.add_document(doc!(
title => "Frankenstein",
isbn => "978-9176370711",
));
// You are guaranteed that your clients will only observe your index in
// the state it was in after a commit.
// In this example, your search engine will at no point be missing the *Frankenstein* document.
// Everything happened as if the document was updated.
index_writer.commit()?;
// We reload our searcher to make our change available to clients.
reader.reload()?;
// No more typo!
let frankenstein_new_doc = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!(
schema.to_json(&frankenstein_new_doc),
r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#,
);
Ok(())
}

View File

@@ -0,0 +1,78 @@
// # Basic Example
//
// This example covers the basic functionalities of
// tantivy.
//
// We will :
// - define our schema
// = create an index in a directory
// - index few documents in our index
// - search for the best document matchings "sea whale"
// - retrieve the best document original content.
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::FacetCollector;
use tantivy::query::AllQuery;
use tantivy::schema::*;
use tantivy::Index;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new("tantivy_facet_example_dir")?;
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("name", TEXT | STORED);
// this is our faceted field
schema_builder.add_facet_field("tags");
let schema = schema_builder.build();
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
let name = schema.get_field("name").unwrap();
let tags = schema.get_field("tags").unwrap();
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
index_writer.add_document(doc!(
name => "the ditch",
tags => Facet::from("/pools/north")
));
index_writer.add_document(doc!(
name => "little stacey",
tags => Facet::from("/pools/south")
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let mut facet_collector = FacetCollector::for_field(tags);
facet_collector.add_facet("/pools");
let facet_counts = searcher.search(&AllQuery, &facet_collector).unwrap();
// This lists all of the facet counts
let facets: Vec<(&Facet, u64)> = facet_counts.get("/pools").collect();
assert_eq!(
facets,
vec![
(&Facet::from("/pools/north"), 1),
(&Facet::from("/pools/south"), 1),
]
);
Ok(())
}
use tempdir::TempDir;

View File

@@ -1,2 +0,0 @@
#!/bin/bash
docco simple_search.rs -o html

View File

@@ -1,518 +0,0 @@
/*--------------------- Typography ----------------------------*/
@font-face {
font-family: 'aller-light';
src: url('public/fonts/aller-light.eot');
src: url('public/fonts/aller-light.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-light.woff') format('woff'),
url('public/fonts/aller-light.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'aller-bold';
src: url('public/fonts/aller-bold.eot');
src: url('public/fonts/aller-bold.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-bold.woff') format('woff'),
url('public/fonts/aller-bold.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'roboto-black';
src: url('public/fonts/roboto-black.eot');
src: url('public/fonts/roboto-black.eot?#iefix') format('embedded-opentype'),
url('public/fonts/roboto-black.woff') format('woff'),
url('public/fonts/roboto-black.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
/*--------------------- Layout ----------------------------*/
html { height: 100%; }
body {
font-family: "aller-light";
font-size: 14px;
line-height: 18px;
color: #30404f;
margin: 0; padding: 0;
height:100%;
}
#container { min-height: 100%; }
a {
color: #000;
}
b, strong {
font-weight: normal;
font-family: "aller-bold";
}
p {
margin: 15px 0 0px;
}
.annotation ul, .annotation ol {
margin: 25px 0;
}
.annotation ul li, .annotation ol li {
font-size: 14px;
line-height: 18px;
margin: 10px 0;
}
h1, h2, h3, h4, h5, h6 {
color: #112233;
line-height: 1em;
font-weight: normal;
font-family: "roboto-black";
text-transform: uppercase;
margin: 30px 0 15px 0;
}
h1 {
margin-top: 40px;
}
h2 {
font-size: 1.26em;
}
hr {
border: 0;
background: 1px #ddd;
height: 1px;
margin: 20px 0;
}
pre, tt, code {
font-size: 12px; line-height: 16px;
font-family: Menlo, Monaco, Consolas, "Lucida Console", monospace;
margin: 0; padding: 0;
}
.annotation pre {
display: block;
margin: 0;
padding: 7px 10px;
background: #fcfcfc;
-moz-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
-webkit-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
overflow-x: auto;
}
.annotation pre code {
border: 0;
padding: 0;
background: transparent;
}
blockquote {
border-left: 5px solid #ccc;
margin: 0;
padding: 1px 0 1px 1em;
}
.sections blockquote p {
font-family: Menlo, Consolas, Monaco, monospace;
font-size: 12px; line-height: 16px;
color: #999;
margin: 10px 0 0;
white-space: pre-wrap;
}
ul.sections {
list-style: none;
padding:0 0 5px 0;;
margin:0;
}
/*
Force border-box so that % widths fit the parent
container without overlap because of margin/padding.
More Info : http://www.quirksmode.org/css/box.html
*/
ul.sections > li > div {
-moz-box-sizing: border-box; /* firefox */
-ms-box-sizing: border-box; /* ie */
-webkit-box-sizing: border-box; /* webkit */
-khtml-box-sizing: border-box; /* konqueror */
box-sizing: border-box; /* css3 */
}
/*---------------------- Jump Page -----------------------------*/
#jump_to, #jump_page {
margin: 0;
background: white;
-webkit-box-shadow: 0 0 25px #777; -moz-box-shadow: 0 0 25px #777;
-webkit-border-bottom-left-radius: 5px; -moz-border-radius-bottomleft: 5px;
font: 16px Arial;
cursor: pointer;
text-align: right;
list-style: none;
}
#jump_to a {
text-decoration: none;
}
#jump_to a.large {
display: none;
}
#jump_to a.small {
font-size: 22px;
font-weight: bold;
color: #676767;
}
#jump_to, #jump_wrapper {
position: fixed;
right: 0; top: 0;
padding: 10px 15px;
margin:0;
}
#jump_wrapper {
display: none;
padding:0;
}
#jump_to:hover #jump_wrapper {
display: block;
}
#jump_page_wrapper{
position: fixed;
right: 0;
top: 0;
bottom: 0;
}
#jump_page {
padding: 5px 0 3px;
margin: 0 0 25px 25px;
max-height: 100%;
overflow: auto;
}
#jump_page .source {
display: block;
padding: 15px;
text-decoration: none;
border-top: 1px solid #eee;
}
#jump_page .source:hover {
background: #f5f5ff;
}
#jump_page .source:first-child {
}
/*---------------------- Low resolutions (> 320px) ---------------------*/
@media only screen and (min-width: 320px) {
.pilwrap { display: none; }
ul.sections > li > div {
display: block;
padding:5px 10px 0 10px;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 30px;
}
ul.sections > li > div.content {
overflow-x:auto;
-webkit-box-shadow: inset 0 0 5px #e5e5ee;
box-shadow: inset 0 0 5px #e5e5ee;
border: 1px solid #dedede;
margin:5px 10px 5px 10px;
padding-bottom: 5px;
}
ul.sections > li > div.annotation pre {
margin: 7px 0 7px;
padding-left: 15px;
}
ul.sections > li > div.annotation p tt, .annotation code {
background: #f8f8ff;
border: 1px solid #dedede;
font-size: 12px;
padding: 0 0.2em;
}
}
/*---------------------- (> 481px) ---------------------*/
@media only screen and (min-width: 481px) {
#container {
position: relative;
}
body {
background-color: #F5F5FF;
font-size: 15px;
line-height: 21px;
}
pre, tt, code {
line-height: 18px;
}
p, ul, ol {
margin: 0 0 15px;
}
#jump_to {
padding: 5px 10px;
}
#jump_wrapper {
padding: 0;
}
#jump_to, #jump_page {
font: 10px Arial;
text-transform: uppercase;
}
#jump_page .source {
padding: 5px 10px;
}
#jump_to a.large {
display: inline-block;
}
#jump_to a.small {
display: none;
}
#background {
position: absolute;
top: 0; bottom: 0;
width: 350px;
background: #fff;
border-right: 1px solid #e5e5ee;
z-index: -1;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 40px;
}
ul.sections > li {
white-space: nowrap;
}
ul.sections > li > div {
display: inline-block;
}
ul.sections > li > div.annotation {
max-width: 350px;
min-width: 350px;
min-height: 5px;
padding: 13px;
overflow-x: hidden;
white-space: normal;
vertical-align: top;
text-align: left;
}
ul.sections > li > div.annotation pre {
margin: 15px 0 15px;
padding-left: 15px;
}
ul.sections > li > div.content {
padding: 13px;
vertical-align: top;
border: none;
-webkit-box-shadow: none;
box-shadow: none;
}
.pilwrap {
position: relative;
display: inline;
}
.pilcrow {
font: 12px Arial;
text-decoration: none;
color: #454545;
position: absolute;
top: 3px; left: -20px;
padding: 1px 2px;
opacity: 0;
-webkit-transition: opacity 0.2s linear;
}
.for-h1 .pilcrow {
top: 47px;
}
.for-h2 .pilcrow, .for-h3 .pilcrow, .for-h4 .pilcrow {
top: 35px;
}
ul.sections > li > div.annotation:hover .pilcrow {
opacity: 1;
}
}
/*---------------------- (> 1025px) ---------------------*/
@media only screen and (min-width: 1025px) {
body {
font-size: 16px;
line-height: 24px;
}
#background {
width: 525px;
}
ul.sections > li > div.annotation {
max-width: 525px;
min-width: 525px;
padding: 10px 25px 1px 50px;
}
ul.sections > li > div.content {
padding: 9px 15px 16px 25px;
}
}
/*---------------------- Syntax Highlighting -----------------------------*/
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
/*
github.com style (c) Vasily Polovnyov <vast@whiteants.net>
*/
pre code {
display: block; padding: 0.5em;
color: #000;
background: #f8f8ff
}
pre .hljs-comment,
pre .hljs-template_comment,
pre .hljs-diff .hljs-header,
pre .hljs-javadoc {
color: #408080;
font-style: italic
}
pre .hljs-keyword,
pre .hljs-assignment,
pre .hljs-literal,
pre .hljs-css .hljs-rule .hljs-keyword,
pre .hljs-winutils,
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
color: #954121;
/*font-weight: bold*/
}
pre .hljs-number,
pre .hljs-hexcolor {
color: #40a070
}
pre .hljs-string,
pre .hljs-tag .hljs-value,
pre .hljs-phpdoc,
pre .hljs-tex .hljs-formula {
color: #219161;
}
pre .hljs-title,
pre .hljs-id {
color: #19469D;
}
pre .hljs-params {
color: #00F;
}
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
font-weight: normal
}
pre .hljs-class .hljs-title,
pre .hljs-haskell .hljs-label,
pre .hljs-tex .hljs-command {
color: #458;
font-weight: bold
}
pre .hljs-tag,
pre .hljs-tag .hljs-title,
pre .hljs-rules .hljs-property,
pre .hljs-django .hljs-tag .hljs-keyword {
color: #000080;
font-weight: normal
}
pre .hljs-attribute,
pre .hljs-variable,
pre .hljs-instancevar,
pre .hljs-lisp .hljs-body {
color: #008080
}
pre .hljs-regexp {
color: #B68
}
pre .hljs-class {
color: #458;
font-weight: bold
}
pre .hljs-symbol,
pre .hljs-ruby .hljs-symbol .hljs-string,
pre .hljs-ruby .hljs-symbol .hljs-keyword,
pre .hljs-ruby .hljs-symbol .hljs-keymethods,
pre .hljs-lisp .hljs-keyword,
pre .hljs-tex .hljs-special,
pre .hljs-input_number {
color: #990073
}
pre .hljs-builtin,
pre .hljs-constructor,
pre .hljs-built_in,
pre .hljs-lisp .hljs-title {
color: #0086b3
}
pre .hljs-preprocessor,
pre .hljs-pi,
pre .hljs-doctype,
pre .hljs-shebang,
pre .hljs-cdata {
color: #999;
font-weight: bold
}
pre .hljs-deletion {
background: #fdd
}
pre .hljs-addition {
background: #dfd
}
pre .hljs-diff .hljs-change {
background: #0086b3
}
pre .hljs-chunk {
color: #aaa
}
pre .hljs-tex .hljs-formula {
opacity: 0.5;
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

View File

@@ -1,375 +0,0 @@
/*! normalize.css v2.0.1 | MIT License | git.io/normalize */
/* ==========================================================================
HTML5 display definitions
========================================================================== */
/*
* Corrects `block` display not defined in IE 8/9.
*/
article,
aside,
details,
figcaption,
figure,
footer,
header,
hgroup,
nav,
section,
summary {
display: block;
}
/*
* Corrects `inline-block` display not defined in IE 8/9.
*/
audio,
canvas,
video {
display: inline-block;
}
/*
* Prevents modern browsers from displaying `audio` without controls.
* Remove excess height in iOS 5 devices.
*/
audio:not([controls]) {
display: none;
height: 0;
}
/*
* Addresses styling for `hidden` attribute not present in IE 8/9.
*/
[hidden] {
display: none;
}
/* ==========================================================================
Base
========================================================================== */
/*
* 1. Sets default font family to sans-serif.
* 2. Prevents iOS text size adjust after orientation change, without disabling
* user zoom.
*/
html {
font-family: sans-serif; /* 1 */
-webkit-text-size-adjust: 100%; /* 2 */
-ms-text-size-adjust: 100%; /* 2 */
}
/*
* Removes default margin.
*/
body {
margin: 0;
}
/* ==========================================================================
Links
========================================================================== */
/*
* Addresses `outline` inconsistency between Chrome and other browsers.
*/
a:focus {
outline: thin dotted;
}
/*
* Improves readability when focused and also mouse hovered in all browsers.
*/
a:active,
a:hover {
outline: 0;
}
/* ==========================================================================
Typography
========================================================================== */
/*
* Addresses `h1` font sizes within `section` and `article` in Firefox 4+,
* Safari 5, and Chrome.
*/
h1 {
font-size: 2em;
}
/*
* Addresses styling not present in IE 8/9, Safari 5, and Chrome.
*/
abbr[title] {
border-bottom: 1px dotted;
}
/*
* Addresses style set to `bolder` in Firefox 4+, Safari 5, and Chrome.
*/
b,
strong {
font-weight: bold;
}
/*
* Addresses styling not present in Safari 5 and Chrome.
*/
dfn {
font-style: italic;
}
/*
* Addresses styling not present in IE 8/9.
*/
mark {
background: #ff0;
color: #000;
}
/*
* Corrects font family set oddly in Safari 5 and Chrome.
*/
code,
kbd,
pre,
samp {
font-family: monospace, serif;
font-size: 1em;
}
/*
* Improves readability of pre-formatted text in all browsers.
*/
pre {
white-space: pre;
white-space: pre-wrap;
word-wrap: break-word;
}
/*
* Sets consistent quote types.
*/
q {
quotes: "\201C" "\201D" "\2018" "\2019";
}
/*
* Addresses inconsistent and variable font size in all browsers.
*/
small {
font-size: 80%;
}
/*
* Prevents `sub` and `sup` affecting `line-height` in all browsers.
*/
sub,
sup {
font-size: 75%;
line-height: 0;
position: relative;
vertical-align: baseline;
}
sup {
top: -0.5em;
}
sub {
bottom: -0.25em;
}
/* ==========================================================================
Embedded content
========================================================================== */
/*
* Removes border when inside `a` element in IE 8/9.
*/
img {
border: 0;
}
/*
* Corrects overflow displayed oddly in IE 9.
*/
svg:not(:root) {
overflow: hidden;
}
/* ==========================================================================
Figures
========================================================================== */
/*
* Addresses margin not present in IE 8/9 and Safari 5.
*/
figure {
margin: 0;
}
/* ==========================================================================
Forms
========================================================================== */
/*
* Define consistent border, margin, and padding.
*/
fieldset {
border: 1px solid #c0c0c0;
margin: 0 2px;
padding: 0.35em 0.625em 0.75em;
}
/*
* 1. Corrects color not being inherited in IE 8/9.
* 2. Remove padding so people aren't caught out if they zero out fieldsets.
*/
legend {
border: 0; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Corrects font family not being inherited in all browsers.
* 2. Corrects font size not being inherited in all browsers.
* 3. Addresses margins set differently in Firefox 4+, Safari 5, and Chrome
*/
button,
input,
select,
textarea {
font-family: inherit; /* 1 */
font-size: 100%; /* 2 */
margin: 0; /* 3 */
}
/*
* Addresses Firefox 4+ setting `line-height` on `input` using `!important` in
* the UA stylesheet.
*/
button,
input {
line-height: normal;
}
/*
* 1. Avoid the WebKit bug in Android 4.0.* where (2) destroys native `audio`
* and `video` controls.
* 2. Corrects inability to style clickable `input` types in iOS.
* 3. Improves usability and consistency of cursor style between image-type
* `input` and others.
*/
button,
html input[type="button"], /* 1 */
input[type="reset"],
input[type="submit"] {
-webkit-appearance: button; /* 2 */
cursor: pointer; /* 3 */
}
/*
* Re-set default cursor for disabled elements.
*/
button[disabled],
input[disabled] {
cursor: default;
}
/*
* 1. Addresses box sizing set to `content-box` in IE 8/9.
* 2. Removes excess padding in IE 8/9.
*/
input[type="checkbox"],
input[type="radio"] {
box-sizing: border-box; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Addresses `appearance` set to `searchfield` in Safari 5 and Chrome.
* 2. Addresses `box-sizing` set to `border-box` in Safari 5 and Chrome
* (include `-moz` to future-proof).
*/
input[type="search"] {
-webkit-appearance: textfield; /* 1 */
-moz-box-sizing: content-box;
-webkit-box-sizing: content-box; /* 2 */
box-sizing: content-box;
}
/*
* Removes inner padding and search cancel button in Safari 5 and Chrome
* on OS X.
*/
input[type="search"]::-webkit-search-cancel-button,
input[type="search"]::-webkit-search-decoration {
-webkit-appearance: none;
}
/*
* Removes inner padding and border in Firefox 4+.
*/
button::-moz-focus-inner,
input::-moz-focus-inner {
border: 0;
padding: 0;
}
/*
* 1. Removes default vertical scrollbar in IE 8/9.
* 2. Improves readability and alignment in all browsers.
*/
textarea {
overflow: auto; /* 1 */
vertical-align: top; /* 2 */
}
/* ==========================================================================
Tables
========================================================================== */
/*
* Remove most spacing between table cells.
*/
table {
border-collapse: collapse;
border-spacing: 0;
}

View File

@@ -1,489 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<title>simple_search.rs</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, target-densitydpi=160dpi, initial-scale=1.0; maximum-scale=1.0; user-scalable=0;">
<link rel="stylesheet" media="all" href="docco.css" />
</head>
<body>
<div id="container">
<div id="background"></div>
<ul class="sections">
<li id="title">
<div class="annotation">
<h1>simple_search.rs</h1>
</div>
</li>
<li id="section-1">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-1">&#182;</a>
</div>
</div>
<div class="content"><div class='highlight'><pre><span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> rustc_serialize;
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tantivy;
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tempdir;
<span class="hljs-keyword">use</span> std::path::Path;
<span class="hljs-keyword">use</span> tempdir::TempDir;
<span class="hljs-keyword">use</span> tantivy::Index;
<span class="hljs-keyword">use</span> tantivy::schema::*;
<span class="hljs-keyword">use</span> tantivy::collector::TopCollector;
<span class="hljs-keyword">use</span> tantivy::query::QueryParser;
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() {</pre></div></div>
</li>
<li id="section-2">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-2">&#182;</a>
</div>
<p>Lets create a temporary directory for the
sake of this example</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Ok</span>(dir) = TempDir::new(<span class="hljs-string">"tantivy_example_dir"</span>) {
run_example(dir.path()).unwrap();
dir.close().unwrap();
}
}
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">run_example</span></span>(index_path: &amp;Path) -&gt; tantivy::<span class="hljs-built_in">Result</span>&lt;()&gt; {</pre></div></div>
</li>
<li id="section-3">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-3">&#182;</a>
</div>
<h1 id="defining-the-schema">Defining the schema</h1>
<p>The Tantivy index requires a very strict schema.
The schema declares which fields are in the index,
and for each field, its type and “the way it should
be indexed”.</p>
</div>
</li>
<li id="section-4">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-4">&#182;</a>
</div>
<p>first we need to define a schema …</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> schema_builder = SchemaBuilder::<span class="hljs-keyword">default</span>();</pre></div></div>
</li>
<li id="section-5">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-5">&#182;</a>
</div>
<p>Our first field is title.
We want full-text search for it, and we want to be able
to retrieve the document after the search.</p>
<p>TEXT | STORED is some syntactic sugar to describe
that. </p>
<p><code>TEXT</code> means the field should be tokenized and indexed,
along with its term frequency and term positions.</p>
<p><code>STORED</code> means that the field will also be saved
in a compressed, row-oriented key-value store.
This store is useful to reconstruct the
documents that were selected during the search phase.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"title"</span>, TEXT | STORED);</pre></div></div>
</li>
<li id="section-6">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-6">&#182;</a>
</div>
<p>Our first field is body.
We want full-text search for it, and we want to be able
to retrieve the body after the search.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"body"</span>, TEXT);
<span class="hljs-keyword">let</span> schema = schema_builder.build();</pre></div></div>
</li>
<li id="section-7">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-7">&#182;</a>
</div>
<h1 id="indexing-documents">Indexing documents</h1>
<p>Lets create a brand new index.</p>
<p>This will actually just save a meta.json
with our schema in the directory.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> index = <span class="hljs-built_in">try!</span>(Index::create(index_path, schema.clone()));</pre></div></div>
</li>
<li id="section-8">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-8">&#182;</a>
</div>
<p>To insert document we need an index writer.
There must be only one writer at a time.
This single <code>IndexWriter</code> is already
multithreaded.</p>
<p>Here we use a buffer of 1 GB. Using a bigger
heap for the indexer can increase its throughput.
This buffer will be split between the indexing
threads.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> index_writer = <span class="hljs-built_in">try!</span>(index.writer(<span class="hljs-number">1_000_000_000</span>));</pre></div></div>
</li>
<li id="section-9">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-9">&#182;</a>
</div>
<p>Lets index our documents!
We first need a handle on the title and the body field.</p>
</div>
</li>
<li id="section-10">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-10">&#182;</a>
</div>
<h3 id="create-a-document-manually-">Create a document “manually”.</h3>
<p>We can create a document manually, by setting the fields
one by one in a Document object.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> title = schema.get_field(<span class="hljs-string">"title"</span>).unwrap();
<span class="hljs-keyword">let</span> body = schema.get_field(<span class="hljs-string">"body"</span>).unwrap();
<span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> old_man_doc = Document::<span class="hljs-keyword">default</span>();
old_man_doc.add_text(title, <span class="hljs-string">"The Old Man and the Sea"</span>);
old_man_doc.add_text(body, <span class="hljs-string">"He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish."</span>);</pre></div></div>
</li>
<li id="section-11">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-11">&#182;</a>
</div>
<p>… and add it to the <code>IndexWriter</code>.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(index_writer.add_document(old_man_doc));</pre></div></div>
</li>
<li id="section-12">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-12">&#182;</a>
</div>
<h3 id="create-a-document-directly-from-json-">Create a document directly from json.</h3>
<p>Alternatively, we can use our schema to parse
a document object directly from json.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">let</span> mice_and_men_doc = <span class="hljs-built_in">try!</span>(schema.parse_document(r#<span class="hljs-string">"{
"</span>title<span class="hljs-string">": "</span>Of Mice and Men<span class="hljs-string">",
"</span>body<span class="hljs-string">": "</span>few miles south of Soledad, the Salinas River drops <span class="hljs-keyword">in</span> close to the hillside bank and runs deep and green. The water is warm too, <span class="hljs-keyword">for</span> it has slipped twinkling over the yellow sands <span class="hljs-keyword">in</span> the sunlight before reaching the narrow pool. On one side of the river the golden foothill slopes curve up to the strong and rocky Gabilan Mountains, but on the valley side the water is lined with trees—willows fresh and green with every spring, carrying <span class="hljs-keyword">in</span> their lower leaf junctures the debris of the winters flooding; and sycamores with mottled, white,recumbent limbs and branches that arch over the pool<span class="hljs-string">"
}"</span>#));
<span class="hljs-built_in">try!</span>(index_writer.add_document(mice_and_men_doc));</pre></div></div>
</li>
<li id="section-13">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-13">&#182;</a>
</div>
<p>Multi-valued field are allowed, they are
expressed in JSON by an array.
The following document has two titles.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> frankenstein_doc = <span class="hljs-built_in">try!</span>(schema.parse_document(r#<span class="hljs-string">"{
"</span>title<span class="hljs-string">": ["</span>Frankenstein<span class="hljs-string">", "</span>The Modern Promotheus<span class="hljs-string">"],
"</span>body<span class="hljs-string">": "</span>You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence <span class="hljs-keyword">in</span> the success of my undertaking.<span class="hljs-string">"
}"</span>#));
<span class="hljs-built_in">try!</span>(index_writer.add_document(frankenstein_doc));</pre></div></div>
</li>
<li id="section-14">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-14">&#182;</a>
</div>
<p>This is an example, so we will only index 3 documents
here. You can check out tantivys tutorial to index
the English wikipedia. Tantivys indexing is rather fast.
Indexing 5 million articles of the English wikipedia takes
around 4 minutes on my computer!</p>
</div>
</li>
<li id="section-15">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-15">&#182;</a>
</div>
<h3 id="committing">Committing</h3>
<p>At this point our documents are not searchable.</p>
<p>We need to call .commit() explicitly to force the
index_writer to finish processing the documents in the queue,
flush the current index to the disk, and advertise
the existence of new documents.</p>
<p>This call is blocking.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(index_writer.commit());</pre></div></div>
</li>
<li id="section-16">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-16">&#182;</a>
</div>
<p>If <code>.commit()</code> returns correctly, then all of the
documents that have been added are guaranteed to be
persistently indexed.</p>
<p>In the scenario of a crash or a power failure,
tantivy behaves as if has rolled back to its last
commit.</p>
</div>
</li>
<li id="section-17">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-17">&#182;</a>
</div>
<h1 id="searching">Searching</h1>
<p>Lets search our index. We start
by creating a searcher. There can be more
than one searcher at a time.</p>
<p>You should create a searcher
every time you start a “search query”.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> searcher = index.searcher();</pre></div></div>
</li>
<li id="section-18">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-18">&#182;</a>
</div>
<p>The query parser can interpret human queries.
Here, if the user does not specify which
field they want to search, tantivy will search
in both title and body.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> query_parser = QueryParser::new(index.schema(), <span class="hljs-built_in">vec!</span>(title, body));</pre></div></div>
</li>
<li id="section-19">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-19">&#182;</a>
</div>
<p>QueryParser may fail if the query is not in the right
format. For user facing applications, this can be a problem.
A ticket has been opened regarding this problem.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> query = <span class="hljs-built_in">try!</span>(query_parser.parse_query(<span class="hljs-string">"sea whale"</span>));</pre></div></div>
</li>
<li id="section-20">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-20">&#182;</a>
</div>
<p>A query defines a set of documents, as
well as the way they should be scored.</p>
<p>A query created by the query parser is scored according
to a metric called Tf-Idf, and will consider
any document matching at least one of our terms.</p>
</div>
</li>
<li id="section-21">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-21">&#182;</a>
</div>
<h3 id="collectors">Collectors</h3>
<p>We are not interested in all of the documents but
only in the top 10. Keeping track of our top 10 best documents
is the role of the TopCollector.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> top_collector = TopCollector::with_limit(<span class="hljs-number">10</span>);</pre></div></div>
</li>
<li id="section-22">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-22">&#182;</a>
</div>
<p>We can now perform our query.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(searcher.search(&amp;query, &amp;<span class="hljs-keyword">mut</span> top_collector)));</pre></div></div>
</li>
<li id="section-23">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-23">&#182;</a>
</div>
<p>Our top collector now contains the 10
most relevant doc ids…</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> doc_addresses = top_collector.docs();</pre></div></div>
</li>
<li id="section-24">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-24">&#182;</a>
</div>
<p>The actual documents still need to be
retrieved from Tantivys store.</p>
<p>Since the body field was not configured as stored,
the document returned will only contain
a title.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">for</span> doc_address <span class="hljs-keyword">in</span> doc_addresses {
<span class="hljs-keyword">let</span> retrieved_doc = <span class="hljs-built_in">try!</span>(searcher.doc(&amp;doc_address));
<span class="hljs-built_in">println!</span>(<span class="hljs-string">"{}"</span>, schema.to_json(&amp;retrieved_doc));
}
<span class="hljs-literal">Ok</span>(())
}</pre></div></div>
</li>
</ul>
</div>
</body>
</html>

View File

@@ -0,0 +1,43 @@
// # Searching a range on an indexed int field.
//
// Below is an example of creating an indexed integer field in your schema
// You can use RangeQuery to get a Count of all occurrences in a given range.
#[macro_use]
extern crate tantivy;
use tantivy::collector::Count;
use tantivy::query::RangeQuery;
use tantivy::schema::{Schema, INDEXED};
use tantivy::Index;
use tantivy::Result;
fn run() -> Result<()> {
// For the sake of simplicity, this schema will only have 1 field
let mut schema_builder = Schema::builder();
// `INDEXED` is a short-hand to indicate that our field should be "searchable".
let year_field = schema_builder.add_u64_field("year", INDEXED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let reader = index.reader()?;
{
let mut index_writer = index.writer_with_num_threads(1, 6_000_000)?;
for year in 1950u64..2019u64 {
index_writer.add_document(doc!(year_field => year));
}
index_writer.commit()?;
// The index will be a range of years
}
reader.reload()?;
let searcher = reader.searcher();
// The end is excluded i.e. here we are searching up to 1969
let docs_in_the_sixties = RangeQuery::new_u64(year_field, 1960..1970);
// Uses a Count collector to sum the total number of docs in the range
let num_60s_books = searcher.search(&docs_in_the_sixties, &Count)?;
assert_eq!(num_60s_books, 10);
Ok(())
}
fn main() {
run().unwrap()
}

View File

@@ -0,0 +1,133 @@
// # Iterating docs and positioms.
//
// At its core of tantivy, relies on a data structure
// called an inverted index.
//
// This example shows how to manually iterate through
// the list of documents containing a term, getting
// its term frequency, and accessing its positions.
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::{DocId, DocSet, Postings};
fn main() -> tantivy::Result<()> {
// We first create a schema for the sake of the
// example. Check the `basic_search` example for more information.
let mut schema_builder = Schema::builder();
// For this example, we need to make sure to index positions for our title
// field. `TEXT` precisely does this.
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 50_000_000)?;
index_writer.add_document(doc!(title => "The Old Man and the Sea"));
index_writer.add_document(doc!(title => "Of Mice and Men"));
index_writer.add_document(doc!(title => "The modern Promotheus"));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
// A tantivy index is actually a collection of segments.
// Similarly, a searcher just wraps a list `segment_reader`.
//
// (Because we indexed a very small number of documents over one thread
// there is actually only one segment here, but let's iterate through the list
// anyway)
for segment_reader in searcher.segment_readers() {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
let inverted_index = segment_reader.inverted_index(title);
// A `Term` is a text token associated with a field.
// Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the");
// This segment posting object is like a cursor over the documents matching the term.
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies
// and positions.
//
// If you don't need all this information, you may get better performance by decompressing less
// information.
if let Some(mut segment_postings) =
inverted_index.read_postings(&term_the, IndexRecordOption::WithFreqsAndPositions)
{
// this buffer will be used to request for positions
let mut positions: Vec<u32> = Vec::with_capacity(100);
while segment_postings.advance() {
// the number of time the term appears in the document.
let doc_id: DocId = segment_postings.doc(); //< do not try to access this before calling advance once.
// This MAY contains deleted documents as well.
if segment_reader.is_deleted(doc_id) {
continue;
}
// the number of time the term appears in the document.
let term_freq: u32 = segment_postings.term_freq();
// accessing positions is slightly expensive and lazy, do not request
// for them if you don't need them for some documents.
segment_postings.positions(&mut positions);
// By definition we should have `term_freq` positions.
assert_eq!(positions.len(), term_freq as usize);
// This prints:
// ```
// Doc 0: TermFreq 2: [0, 4]
// Doc 2: TermFreq 1: [0]
// ```
println!("Doc {}: TermFreq {}: {:?}", doc_id, term_freq, positions);
}
}
}
// A `Term` is a text token associated with a field.
// Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the");
// Some other powerful operations (especially `.skip_to`) may be useful to consume these
// posting lists rapidly.
// You can check for them in the [`DocSet`](https://docs.rs/tantivy/~0/tantivy/trait.DocSet.html) trait
// and the [`Postings`](https://docs.rs/tantivy/~0/tantivy/trait.Postings.html) trait
// Also, for some VERY specific high performance use case like an OLAP analysis of logs,
// you can get better performance by accessing directly the blocks of doc ids.
for segment_reader in searcher.segment_readers() {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
let inverted_index = segment_reader.inverted_index(title);
// This segment posting object is like a cursor over the documents matching the term.
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies
// and positions.
//
// If you don't need all this information, you may get better performance by decompressing less
// information.
if let Some(mut block_segment_postings) =
inverted_index.read_block_postings(&term_the, IndexRecordOption::Basic)
{
while block_segment_postings.advance() {
// Once again these docs MAY contains deleted documents as well.
let docs = block_segment_postings.docs();
// Prints `Docs [0, 2].`
println!("Docs {:?}", docs);
}
}
}
Ok(())
}

View File

@@ -0,0 +1,105 @@
// # Indexing from different threads.
//
// It is fairly common to have to index from different threads.
// Tantivy forbids to create more than one `IndexWriter` at a time.
//
// This `IndexWriter` itself has its own multithreaded layer, so managing your own
// indexing threads will not help. However, it can still be useful for some applications.
//
// For instance, if preparing documents to send to tantivy before indexing is the bottleneck of
// your application, it is reasonable to have multiple threads.
//
// Another very common reason to want to index from multiple threads, is implementing a webserver
// with CRUD capabilities. The server framework will most likely handle request from
// different threads.
//
// The recommended way to address both of these use case is to wrap your `IndexWriter` into a
// `Arc<RwLock<IndexWriter>>`.
//
// While this is counterintuitive, adding and deleting documents do not require mutability
// over the `IndexWriter`, so several threads will be able to do this operation concurrently.
//
// The example below does not represent an actual real-life use case (who would spawn thread to
// index a single document?), but aims at demonstrating the mechanism that makes indexing
// from several threads possible.
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use std::sync::{Arc, RwLock};
use std::thread;
use std::time::Duration;
use tantivy::schema::{Schema, STORED, TEXT};
use tantivy::Opstamp;
use tantivy::{Index, IndexWriter};
fn main() -> tantivy::Result<()> {
// # Defining the schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let index_writer: Arc<RwLock<IndexWriter>> = Arc::new(RwLock::new(index.writer(50_000_000)?));
// # First indexing thread.
let index_writer_clone_1 = index_writer.clone();
thread::spawn(move || {
// we index 100 times the document... for the sake of the example.
for i in 0..100 {
let opstamp = {
// A read lock is sufficient here.
let index_writer_rlock = index_writer_clone_1.read().unwrap();
index_writer_rlock.add_document(
doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
))
};
println!("add doc {} from thread 1 - opstamp {}", i, opstamp);
thread::sleep(Duration::from_millis(20));
}
});
// # Second indexing thread.
let index_writer_clone_2 = index_writer.clone();
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
thread::spawn(move || {
// we index 100 times the document... for the sake of the example.
for i in 0..100 {
// A read lock is sufficient here.
let opstamp = {
let index_writer_rlock = index_writer_clone_2.read().unwrap();
index_writer_rlock.add_document(doc!(
title => "Manufacturing consent",
body => "Some great book description..."
))
};
println!("add doc {} from thread 2 - opstamp {}", i, opstamp);
thread::sleep(Duration::from_millis(10));
}
});
// # In the main thread, we commit 10 times, once every 500ms.
for _ in 0..10 {
let opstamp: Opstamp = {
// Committing or rollbacking on the other hand requires write lock. This will block other threads.
let mut index_writer_wlock = index_writer.write().unwrap();
index_writer_wlock.commit().unwrap()
};
println!("committed with opstamp {}", opstamp);
thread::sleep(Duration::from_millis(500));
}
Ok(())
}

View File

@@ -1,207 +0,0 @@
extern crate rustc_serialize;
extern crate tantivy;
extern crate tempdir;
use std::path::Path;
use tempdir::TempDir;
use tantivy::Index;
use tantivy::schema::*;
use tantivy::collector::TopCollector;
use tantivy::query::QueryParser;
fn main() {
// Let's create a temporary directory for the
// sake of this example
if let Ok(dir) = TempDir::new("tantivy_example_dir") {
run_example(dir.path()).unwrap();
dir.close().unwrap();
}
}
fn run_example(index_path: &Path) -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = SchemaBuilder::default();
// Our first field is title.
// We want full-text search for it, and we want to be able
// to retrieve the document after the search.
//
// TEXT | STORED is some syntactic sugar to describe
// that.
//
// `TEXT` means the field should be tokenized and indexed,
// along with its term frequency and term positions.
//
// `STORED` means that the field will also be saved
// in a compressed, row-oriented key-value store.
// This store is useful to reconstruct the
// documents that were selected during the search phase.
schema_builder.add_text_field("title", TEXT | STORED);
// Our first field is body.
// We want full-text search for it, and we want to be able
// to retrieve the body after the search.
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
//
// This will actually just save a meta.json
// with our schema in the directory.
let index = try!(Index::create(index_path, schema.clone()));
// To insert document we need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we use a buffer of 1 GB. Using a bigger
// heap for the indexer can increase its throughput.
// This buffer will be split between the indexing
// threads.
let mut index_writer = try!(index.writer(1_000_000_000));
// Let's index our documents!
// We first need a handle on the title and the body field.
// ### Create a document "manually".
//
// We can create a document manually, by setting the fields
// one by one in a Document object.
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
old_man_doc.add_text(body, "He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish.");
// ... and add it to the `IndexWriter`.
try!(index_writer.add_document(old_man_doc));
// ### Create a document directly from json.
//
// Alternatively, we can use our schema to parse
// a document object directly from json.
let mice_and_men_doc = try!(schema.parse_document(r#"{
"title": "Of Mice and Men",
"body": "few miles south of Soledad, the Salinas River drops in close to the hillside bank and runs deep and green. The water is warm too, for it has slipped twinkling over the yellow sands in the sunlight before reaching the narrow pool. On one side of the river the golden foothill slopes curve up to the strong and rocky Gabilan Mountains, but on the valley side the water is lined with trees—willows fresh and green with every spring, carrying in their lower leaf junctures the debris of the winters flooding; and sycamores with mottled, white,recumbent limbs and branches that arch over the pool"
}"#));
try!(index_writer.add_document(mice_and_men_doc));
// Multi-valued field are allowed, they are
// expressed in JSON by an array.
// The following document has two titles.
let frankenstein_doc = try!(schema.parse_document(r#"{
"title": ["Frankenstein", "The Modern Promotheus"],
"body": "You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking."
}"#));
try!(index_writer.add_document(frankenstein_doc));
// This is an example, so we will only index 3 documents
// here. You can check out tantivy's tutorial to index
// the English wikipedia. Tantivy's indexing is rather fast.
// Indexing 5 million articles of the English wikipedia takes
// around 4 minutes on my computer!
// ### Committing
//
// At this point our documents are not searchable.
//
//
// We need to call .commit() explicitly to force the
// index_writer to finish processing the documents in the queue,
// flush the current index to the disk, and advertise
// the existence of new documents.
//
// This call is blocking.
try!(index_writer.commit());
// If `.commit()` returns correctly, then all of the
// documents that have been added are guaranteed to be
// persistently indexed.
//
// In the scenario of a crash or a power failure,
// tantivy behaves as if has rolled back to its last
// commit.
// # Searching
//
// Let's search our index. We start
// by creating a searcher. There can be more
// than one searcher at a time.
//
// You should create a searcher
// every time you start a "search query".
let searcher = index.searcher();
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::new(index.schema(), vec!(title, body));
// QueryParser may fail if the query is not in the right
// format. For user facing applications, this can be a problem.
// A ticket has been opened regarding this problem.
let query = try!(query_parser.parse_query("sea whale"));
// A query defines a set of documents, as
// well as the way they should be scored.
//
// A query created by the query parser is scored according
// to a metric called Tf-Idf, and will consider
// any document matching at least one of our terms.
// ### Collectors
//
// We are not interested in all of the documents but
// only in the top 10. Keeping track of our top 10 best documents
// is the role of the TopCollector.
let mut top_collector = TopCollector::with_limit(10);
// We can now perform our query.
try!(searcher.search(&*query, &mut top_collector));
// Our top collector now contains the 10
// most relevant doc ids...
let doc_addresses = top_collector.docs();
// The actual documents still need to be
// retrieved from Tantivy's store.
//
// Since the body field was not configured as stored,
// the document returned will only contain
// a title.
for doc_address in doc_addresses {
let retrieved_doc = try!(searcher.doc(&doc_address));
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

85
examples/snippet.rs Normal file
View File

@@ -0,0 +1,85 @@
// # Snippet example
//
// This example shows how to return a representative snippet of
// your hit result.
// Snippet are an extracted of a target document, and returned in HTML format.
// The keyword searched by the user are highlighted with a `<b>` tag.
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::Index;
use tantivy::{Snippet, SnippetGenerator};
use tempdir::TempDir;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new("tantivy_example_dir")?;
// # Defining the schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT | STORED);
let schema = schema_builder.build();
// # Indexing documents
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
// we'll only need one doc for this example.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
// ...
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("sycamore spring")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
for (score, doc_address) in top_docs {
let doc = searcher.doc(doc_address)?;
let snippet = snippet_generator.snippet_from_doc(&doc);
println!("Document score {}:", score);
println!("title: {}", doc.get_first(title).unwrap().text().unwrap());
println!("snippet: {}", snippet.to_html());
println!("custom highlighting: {}", highlight(snippet));
}
Ok(())
}
fn highlight(snippet: Snippet) -> String {
let mut result = String::new();
let mut start_from = 0;
for (start, end) in snippet.highlighted().iter().map(|h| h.bounds()) {
result.push_str(&snippet.fragments()[start_from..start]);
result.push_str(" --> ");
result.push_str(&snippet.fragments()[start..end]);
result.push_str(" <-- ");
start_from = end;
}
result.push_str(&snippet.fragments()[start_from..]);
result
}

115
examples/stop_words.rs Normal file
View File

@@ -0,0 +1,115 @@
// # Stop Words Example
//
// This example covers the basic usage of stop words
// with tantivy
//
// We will :
// - define our schema
// - create an index in a directory
// - add a few stop words
// - index few documents in our index
// ---
// Importing tantivy...
#[macro_use]
extern crate tantivy;
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::*;
use tantivy::Index;
fn main() -> tantivy::Result<()> {
// this example assumes you understand the content in `basic_search`
let mut schema_builder = Schema::builder();
// This configures your custom options for how tantivy will
// store and process your content in the index; The key
// to note is that we are setting the tokenizer to `stoppy`
// which will be defined and registered below.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
// Our first field is title.
schema_builder.add_text_field("title", text_options);
// Our second field is body.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
schema_builder.add_text_field("body", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
// This tokenizer lowers all of the text (to help with stop word matching)
// then removes all instances of `the` and `and` from the corpus
let tokenizer = SimpleTokenizer
.filter(LowerCaser)
.filter(StopWordFilter::remove(vec![
"the".to_string(),
"and".to_string(),
]));
index.tokenizers().register("stoppy", tokenizer);
let mut index_writer = index.writer(50_000_000)?;
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
index_writer.add_document(doc!(
title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
index_writer.add_document(doc!(
title => "Frankenstein",
body => "You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking."
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// stop words are applied on the query as well.
// The following will be equivalent to `title:frankenstein`
let query = query_parser.parse_query("title:\"the Frankenstein\"")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("\n==\nDocument score {}:", score);
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,41 @@
use tantivy;
use tantivy::schema::*;
// # Document from json
//
// For convenience, `Document` can be parsed directly from json.
fn main() -> tantivy::Result<()> {
// Let's first define a schema and an index.
// Check out the basic example if this is confusing to you.
//
// first we need to define a schema ...
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
schema_builder.add_u64_field("year", INDEXED);
let schema = schema_builder.build();
// Let's assume we have a json-serialized document.
let mice_and_men_doc_json = r#"{
"title": "Of Mice and Men",
"year": 1937
}"#;
// We can parse our document
let _mice_and_men_doc = schema.parse_document(&mice_and_men_doc_json)?;
// Multi-valued field are allowed, they are
// expressed in JSON by an array.
// The following document has two titles.
let frankenstein_json = r#"{
"title": ["Frankenstein", "The Modern Prometheus"],
"year": 1818
}"#;
let _frankenstein_doc = schema.parse_document(&frankenstein_json)?;
// Note that the schema is saved in your index directory.
//
// As a result, Indexes are aware of their schema, and you can use this feature
// just by opening an existing `Index`, and calling `index.schema()..parse_document(json)`.
Ok(())
}

2
run-tests.sh Executable file
View File

@@ -0,0 +1,2 @@
#!/bin/bash
cargo test

1
rustfmt.toml Normal file
View File

@@ -0,0 +1 @@
use_try_shorthand = true

View File

@@ -1,10 +0,0 @@
#!/bin/bash
DEST=target/doc/tantivy/docs/
mkdir -p $DEST
for f in $(ls docs/*.md)
do
rustdoc $f -o $DEST --markdown-css ../../rustdoc.css --markdown-css style.css
done
cp docs/*.css $DEST

View File

@@ -1,5 +0,0 @@
#/bin/bash
valgrind --tool=cachegrind target/release/tantivy-bench -i /data/wiki-index -q ./queries.txt -n 3
valgrind --tool=callgrind target/release/tantivy-bench -i /data/wiki-index -q ./queries.txt -n 3

View File

@@ -1,86 +0,0 @@
extern crate regex;
use std::str::Chars;
use std::ascii::AsciiExt;
pub struct TokenIter<'a> {
chars: Chars<'a>,
term_buffer: String,
}
fn append_char_lowercase(c: char, term_buffer: &mut String) {
term_buffer.push(c.to_ascii_lowercase());
}
pub trait StreamingIterator<'a, T> {
fn next(&'a mut self) -> Option<T>;
}
impl<'a, 'b> TokenIter<'b> {
fn consume_token(&'a mut self) -> Option<&'a str> {
for c in &mut self.chars {
if c.is_alphanumeric() {
append_char_lowercase(c, &mut self.term_buffer);
}
else {
break;
}
}
Some(&self.term_buffer)
}
}
impl<'a, 'b> StreamingIterator<'a, &'a str> for TokenIter<'b> {
#[inline]
fn next(&'a mut self,) -> Option<&'a str> {
self.term_buffer.clear();
// skipping non-letter characters.
loop {
match self.chars.next() {
Some(c) => {
if c.is_alphanumeric() {
append_char_lowercase(c, &mut self.term_buffer);
return self.consume_token();
}
}
None => { return None; }
}
}
}
}
pub struct SimpleTokenizer;
impl SimpleTokenizer {
pub fn tokenize<'a>(&self, text: &'a str) -> TokenIter<'a> {
TokenIter {
term_buffer: String::new(),
chars: text.chars(),
}
}
}
#[test]
fn test_tokenizer() {
let simple_tokenizer = SimpleTokenizer;
let mut term_reader = simple_tokenizer.tokenize("hello, happy tax payer!");
assert_eq!(term_reader.next().unwrap(), "hello");
assert_eq!(term_reader.next().unwrap(), "happy");
assert_eq!(term_reader.next().unwrap(), "tax");
assert_eq!(term_reader.next().unwrap(), "payer");
assert_eq!(term_reader.next(), None);
}
#[test]
fn test_tokenizer_empty() {
let simple_tokenizer = SimpleTokenizer;
let mut term_reader = simple_tokenizer.tokenize("");
assert_eq!(term_reader.next(), None);
}

View File

@@ -1,83 +0,0 @@
use collector::Collector;
use SegmentLocalId;
use SegmentReader;
use std::io;
use DocId;
use Score;
/// Collector that does nothing.
/// This is used in the chain Collector and will hopefully
/// be optimized away by the compiler.
pub struct DoNothingCollector;
impl Collector for DoNothingCollector {
#[inline]
fn set_segment(&mut self, _: SegmentLocalId, _: &SegmentReader) -> io::Result<()> {
Ok(())
}
#[inline]
fn collect(&mut self, _doc: DocId, _score: Score) {}
}
/// Zero-cost abstraction used to collect on multiple collectors.
/// This contraption is only usable if the type of your collectors
/// are known at compile time.
pub struct ChainedCollector<Left: Collector, Right: Collector> {
left: Left,
right: Right
}
impl<Left: Collector, Right: Collector> ChainedCollector<Left, Right> {
/// Adds a collector
pub fn push<C: Collector>(self, new_collector: &mut C) -> ChainedCollector<Self, &mut C> {
ChainedCollector {
left: self,
right: new_collector,
}
}
}
impl<Left: Collector, Right: Collector> Collector for ChainedCollector<Left, Right> {
fn set_segment(&mut self, segment_local_id: SegmentLocalId, segment: &SegmentReader) -> io::Result<()> {
try!(self.left.set_segment(segment_local_id, segment));
try!(self.right.set_segment(segment_local_id, segment));
Ok(())
}
fn collect(&mut self, doc: DocId, score: Score) {
self.left.collect(doc, score);
self.right.collect(doc, score);
}
}
/// Creates a `ChainedCollector`
pub fn chain() -> ChainedCollector<DoNothingCollector, DoNothingCollector> {
ChainedCollector {
left: DoNothingCollector,
right: DoNothingCollector,
}
}
#[cfg(test)]
mod tests {
use super::*;
use collector::{Collector, CountCollector, TopCollector};
#[test]
fn test_chained_collector() {
let mut top_collector = TopCollector::with_limit(2);
let mut count_collector = CountCollector::default();
{
let mut collectors = chain()
.push(&mut top_collector)
.push(&mut count_collector);
collectors.collect(1, 0.2);
collectors.collect(2, 0.1);
collectors.collect(3, 0.5);
}
assert_eq!(count_collector.count(), 3);
assert!(top_collector.at_capacity());
}
}

View File

@@ -1,57 +1,129 @@
use std::io;
use super::Collector;
use DocId;
use Score;
use SegmentReader;
use SegmentLocalId;
use crate::collector::SegmentCollector;
use crate::DocId;
use crate::Result;
use crate::Score;
use crate::SegmentLocalId;
use crate::SegmentReader;
/// `CountCollector` collector only counts how many
/// documents match the query.
pub struct CountCollector {
/// documents match the query.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{Schema, TEXT};
/// use tantivy::{Index, Result};
/// use tantivy::collector::Count;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = Schema::builder();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer(3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// let reader = index.reader()?;
/// let searcher = reader.searcher();
///
/// {
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// let count = searcher.search(&query, &Count).unwrap();
///
/// assert_eq!(count, 2);
/// }
///
/// Ok(())
/// }
/// ```
pub struct Count;
impl Collector for Count {
type Fruit = usize;
type Child = SegmentCountCollector;
fn for_segment(&self, _: SegmentLocalId, _: &SegmentReader) -> Result<SegmentCountCollector> {
Ok(SegmentCountCollector::default())
}
fn requires_scoring(&self) -> bool {
false
}
fn merge_fruits(&self, segment_counts: Vec<usize>) -> Result<usize> {
Ok(segment_counts.into_iter().sum())
}
}
#[derive(Default)]
pub struct SegmentCountCollector {
count: usize,
}
impl CountCollector {
/// Returns the count of documents that were
/// collected.
pub fn count(&self,) -> usize {
self.count
}
}
impl Default for CountCollector {
fn default() -> CountCollector {
CountCollector {count: 0,
}
}
}
impl Collector for CountCollector {
fn set_segment(&mut self, _: SegmentLocalId, _: &SegmentReader) -> io::Result<()> {
Ok(())
}
impl SegmentCollector for SegmentCountCollector {
type Fruit = usize;
fn collect(&mut self, _: DocId, _: Score) {
self.count += 1;
}
fn harvest(self) -> usize {
self.count
}
}
#[cfg(test)]
mod tests {
use super::{Count, SegmentCountCollector};
use crate::collector::Collector;
use crate::collector::SegmentCollector;
use super::*;
use test::Bencher;
use collector::Collector;
#[bench]
fn build_collector(b: &mut Bencher) {
b.iter(|| {
let mut count_collector = CountCollector::default();
for doc in 0..1_000_000 {
count_collector.collect(doc, 1f32);
}
count_collector.count()
});
#[test]
fn test_count_collect_does_not_requires_scoring() {
assert!(!Count.requires_scoring());
}
#[test]
fn test_segment_count_collector() {
{
let count_collector = SegmentCountCollector::default();
assert_eq!(count_collector.harvest(), 0);
}
{
let mut count_collector = SegmentCountCollector::default();
count_collector.collect(0u32, 1f32);
assert_eq!(count_collector.harvest(), 1);
}
{
let mut count_collector = SegmentCountCollector::default();
count_collector.collect(0u32, 1f32);
assert_eq!(count_collector.harvest(), 1);
}
{
let mut count_collector = SegmentCountCollector::default();
count_collector.collect(0u32, 1f32);
count_collector.collect(1u32, 1f32);
assert_eq!(count_collector.harvest(), 2);
}
}
}

View File

@@ -0,0 +1,126 @@
use crate::collector::top_collector::{TopCollector, TopSegmentCollector};
use crate::collector::{Collector, SegmentCollector};
use crate::Result;
use crate::{DocAddress, DocId, Score, SegmentReader};
pub(crate) struct CustomScoreTopCollector<TCustomScorer, TScore = Score> {
custom_scorer: TCustomScorer,
collector: TopCollector<TScore>,
}
impl<TCustomScorer, TScore> CustomScoreTopCollector<TCustomScorer, TScore>
where
TScore: Clone + PartialOrd,
{
pub fn new(
custom_scorer: TCustomScorer,
limit: usize,
) -> CustomScoreTopCollector<TCustomScorer, TScore> {
CustomScoreTopCollector {
custom_scorer,
collector: TopCollector::with_limit(limit),
}
}
}
/// A custom segment scorer makes it possible to define any kind of score
/// for a given document belonging to a specific segment.
///
/// It is the segment local version of the [`CustomScorer`](./trait.CustomScorer.html).
pub trait CustomSegmentScorer<TScore>: 'static {
/// Computes the score of a specific `doc`.
fn score(&self, doc: DocId) -> TScore;
}
/// `CustomScorer` makes it possible to define any kind of score.
///
/// The `CustomerScorer` itself does not make much of the computation itself.
/// Instead, it helps constructing `Self::Child` instances that will compute
/// the score at a segment scale.
pub trait CustomScorer<TScore>: Sync {
/// Type of the associated [`CustomSegmentScorer`](./trait.CustomSegmentScorer.html).
type Child: CustomSegmentScorer<TScore>;
/// Builds a child scorer for a specific segment. The child scorer is associated to
/// a specific segment.
fn segment_scorer(&self, segment_reader: &SegmentReader) -> Result<Self::Child>;
}
impl<TCustomScorer, TScore> Collector for CustomScoreTopCollector<TCustomScorer, TScore>
where
TCustomScorer: CustomScorer<TScore>,
TScore: 'static + PartialOrd + Clone + Send + Sync,
{
type Fruit = Vec<(TScore, DocAddress)>;
type Child = CustomScoreTopSegmentCollector<TCustomScorer::Child, TScore>;
fn for_segment(
&self,
segment_local_id: u32,
segment_reader: &SegmentReader,
) -> Result<Self::Child> {
let segment_scorer = self.custom_scorer.segment_scorer(segment_reader)?;
let segment_collector = self
.collector
.for_segment(segment_local_id, segment_reader)?;
Ok(CustomScoreTopSegmentCollector {
segment_collector,
segment_scorer,
})
}
fn requires_scoring(&self) -> bool {
false
}
fn merge_fruits(&self, segment_fruits: Vec<Self::Fruit>) -> Result<Self::Fruit> {
self.collector.merge_fruits(segment_fruits)
}
}
pub struct CustomScoreTopSegmentCollector<T, TScore>
where
TScore: 'static + PartialOrd + Clone + Send + Sync + Sized,
T: CustomSegmentScorer<TScore>,
{
segment_collector: TopSegmentCollector<TScore>,
segment_scorer: T,
}
impl<T, TScore> SegmentCollector for CustomScoreTopSegmentCollector<T, TScore>
where
TScore: 'static + PartialOrd + Clone + Send + Sync,
T: 'static + CustomSegmentScorer<TScore>,
{
type Fruit = Vec<(TScore, DocAddress)>;
fn collect(&mut self, doc: DocId, _score: Score) {
let score = self.segment_scorer.score(doc);
self.segment_collector.collect(doc, score);
}
fn harvest(self) -> Vec<(TScore, DocAddress)> {
self.segment_collector.harvest()
}
}
impl<F, TScore, T> CustomScorer<TScore> for F
where
F: 'static + Send + Sync + Fn(&SegmentReader) -> T,
T: CustomSegmentScorer<TScore>,
{
type Child = T;
fn segment_scorer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
Ok((self)(segment_reader))
}
}
impl<F, TScore> CustomSegmentScorer<TScore> for F
where
F: 'static + Sync + Send + Fn(DocId) -> TScore,
{
fn score(&self, doc: DocId) -> TScore {
(self)(doc)
}
}

View File

@@ -0,0 +1,647 @@
use crate::collector::Collector;
use crate::collector::SegmentCollector;
use crate::docset::SkipResult;
use crate::fastfield::FacetReader;
use crate::schema::Facet;
use crate::schema::Field;
use crate::DocId;
use crate::Result;
use crate::Score;
use crate::SegmentLocalId;
use crate::SegmentReader;
use crate::TantivyError;
use std::cmp::Ordering;
use std::collections::btree_map;
use std::collections::BTreeMap;
use std::collections::BTreeSet;
use std::collections::BinaryHeap;
use std::collections::Bound;
use std::iter::Peekable;
use std::{u64, usize};
struct Hit<'a> {
count: u64,
facet: &'a Facet,
}
impl<'a> Eq for Hit<'a> {}
impl<'a> PartialEq<Hit<'a>> for Hit<'a> {
fn eq(&self, other: &Hit<'_>) -> bool {
self.count == other.count
}
}
impl<'a> PartialOrd<Hit<'a>> for Hit<'a> {
fn partial_cmp(&self, other: &Hit<'_>) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl<'a> Ord for Hit<'a> {
fn cmp(&self, other: &Self) -> Ordering {
other.count.cmp(&self.count)
}
}
fn facet_depth(facet_bytes: &[u8]) -> usize {
if facet_bytes.is_empty() {
0
} else {
facet_bytes.iter().cloned().filter(|b| *b == 0u8).count() + 1
}
}
/// Collector for faceting
///
/// The collector collects all facets. You need to configure it
/// beforehand with the facet you want to extract.
///
/// This is done by calling `.add_facet(...)` with the root of the
/// facet you want to extract as argument.
///
/// Facet counts will only be computed for the facet that are direct children
/// of such a root facet.
///
/// For instance, if your index represents books, your hierarchy of facets
/// may contain `category`, `language`.
///
/// The category facet may include `subcategories`. For instance, a book
/// could belong to `/category/fiction/fantasy`.
///
/// If you request the facet counts for `/category`, the result will be
/// the breakdown of counts for the direct children of `/category`
/// (e.g. `/category/fiction`, `/category/biography`, `/category/personal_development`).
///
/// Once collection is finished, you can harvest its results in the form
/// of a `FacetCounts` object, and extract your face t counts from it.
///
/// This implementation assumes you are working with a number of facets that
/// is much hundreds of time lower than your number of documents.
///
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{Facet, Schema, TEXT};
/// use tantivy::{Index, Result};
/// use tantivy::collector::FacetCollector;
/// use tantivy::query::AllQuery;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = Schema::builder();
///
/// // Facet have their own specific type.
/// // It is not a bad practise to put all of your
/// // facet information in the same field.
/// let facet = schema_builder.add_facet_field("facet");
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer(3_000_000)?;
/// // a document can be associated to any number of facets
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// facet => Facet::from("/lang/en"),
/// facet => Facet::from("/category/fiction/fantasy")
/// ));
/// index_writer.add_document(doc!(
/// title => "Dune",
/// facet => Facet::from("/lang/en"),
/// facet => Facet::from("/category/fiction/sci-fi")
/// ));
/// index_writer.add_document(doc!(
/// title => "La Vénus d'Ille",
/// facet => Facet::from("/lang/fr"),
/// facet => Facet::from("/category/fiction/fantasy"),
/// facet => Facet::from("/category/fiction/horror")
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// facet => Facet::from("/lang/en"),
/// facet => Facet::from("/category/biography")
/// ));
/// index_writer.commit()?;
/// }
/// let reader = index.reader()?;
/// let searcher = reader.searcher();
///
/// {
/// let mut facet_collector = FacetCollector::for_field(facet);
/// facet_collector.add_facet("/lang");
/// facet_collector.add_facet("/category");
/// let facet_counts = searcher.search(&AllQuery, &facet_collector)?;
///
/// // This lists all of the facet counts
/// let facets: Vec<(&Facet, u64)> = facet_counts
/// .get("/category")
/// .collect();
/// assert_eq!(facets, vec![
/// (&Facet::from("/category/biography"), 1),
/// (&Facet::from("/category/fiction"), 3)
/// ]);
/// }
///
/// {
/// let mut facet_collector = FacetCollector::for_field(facet);
/// facet_collector.add_facet("/category/fiction");
/// let facet_counts = searcher.search(&AllQuery, &facet_collector)?;
///
/// // This lists all of the facet counts
/// let facets: Vec<(&Facet, u64)> = facet_counts
/// .get("/category/fiction")
/// .collect();
/// assert_eq!(facets, vec![
/// (&Facet::from("/category/fiction/fantasy"), 2),
/// (&Facet::from("/category/fiction/horror"), 1),
/// (&Facet::from("/category/fiction/sci-fi"), 1)
/// ]);
/// }
///
/// {
/// let mut facet_collector = FacetCollector::for_field(facet);
/// facet_collector.add_facet("/category/fiction");
/// let facet_counts = searcher.search(&AllQuery, &facet_collector)?;
///
/// // This lists all of the facet counts
/// let facets: Vec<(&Facet, u64)> = facet_counts.top_k("/category/fiction", 1);
/// assert_eq!(facets, vec![
/// (&Facet::from("/category/fiction/fantasy"), 2)
/// ]);
/// }
///
/// Ok(())
/// }
/// ```
pub struct FacetCollector {
field: Field,
facets: BTreeSet<Facet>,
}
pub struct FacetSegmentCollector {
reader: FacetReader,
facet_ords_buf: Vec<u64>,
// facet_ord -> collapse facet_id
collapse_mapping: Vec<usize>,
// collapse facet_id -> count
counts: Vec<u64>,
// collapse facet_id -> facet_ord
collapse_facet_ords: Vec<u64>,
}
fn skip<'a, I: Iterator<Item = &'a Facet>>(
target: &[u8],
collapse_it: &mut Peekable<I>,
) -> SkipResult {
loop {
match collapse_it.peek() {
Some(facet_bytes) => match facet_bytes.encoded_str().as_bytes().cmp(target) {
Ordering::Less => {}
Ordering::Greater => {
return SkipResult::OverStep;
}
Ordering::Equal => {
return SkipResult::Reached;
}
},
None => {
return SkipResult::End;
}
}
collapse_it.next();
}
}
impl FacetCollector {
/// Create a facet collector to collect the facets
/// from a specific facet `Field`.
///
/// This function does not check whether the field
/// is of the proper type.
pub fn for_field(field: Field) -> FacetCollector {
FacetCollector {
field,
facets: BTreeSet::default(),
}
}
/// Adds a facet that we want to record counts
///
/// Adding facet `Facet::from("/country")` for instance,
/// will record the counts of all of the direct children of the facet country
/// (e.g. `/country/FR`, `/country/UK`).
///
/// Adding two facets within which one is the prefix of the other is forbidden.
/// If you need the correct number of unique documents for two such facets,
/// just add them in separate `FacetCollector`.
pub fn add_facet<T>(&mut self, facet_from: T)
where
Facet: From<T>,
{
let facet = Facet::from(facet_from);
for old_facet in &self.facets {
assert!(
!old_facet.is_prefix_of(&facet),
"Tried to add a facet which is a descendant of an already added facet."
);
assert!(
!facet.is_prefix_of(old_facet),
"Tried to add a facet which is an ancestor of an already added facet."
);
}
self.facets.insert(facet);
}
}
impl Collector for FacetCollector {
type Fruit = FacetCounts;
type Child = FacetSegmentCollector;
fn for_segment(
&self,
_: SegmentLocalId,
reader: &SegmentReader,
) -> Result<FacetSegmentCollector> {
let field_name = reader.schema().get_field_name(self.field);
let facet_reader = reader.facet_reader(self.field).ok_or_else(|| {
TantivyError::SchemaError(format!("Field {:?} is not a facet field.", field_name))
})?;
let mut collapse_mapping = Vec::new();
let mut counts = Vec::new();
let mut collapse_facet_ords = Vec::new();
let mut collapse_facet_it = self.facets.iter().peekable();
collapse_facet_ords.push(0);
{
let mut facet_streamer = facet_reader.facet_dict().range().into_stream();
if facet_streamer.advance() {
'outer: loop {
// at the begining of this loop, facet_streamer
// is positionned on a term that has not been processed yet.
let skip_result = skip(facet_streamer.key(), &mut collapse_facet_it);
match skip_result {
SkipResult::Reached => {
// we reach a facet we decided to collapse.
let collapse_depth = facet_depth(facet_streamer.key());
let mut collapsed_id = 0;
collapse_mapping.push(0);
while facet_streamer.advance() {
let depth = facet_depth(facet_streamer.key());
if depth <= collapse_depth {
continue 'outer;
}
if depth == collapse_depth + 1 {
collapsed_id = collapse_facet_ords.len();
collapse_facet_ords.push(facet_streamer.term_ord());
collapse_mapping.push(collapsed_id);
} else {
collapse_mapping.push(collapsed_id);
}
}
break;
}
SkipResult::End | SkipResult::OverStep => {
collapse_mapping.push(0);
if !facet_streamer.advance() {
break;
}
}
}
}
}
}
counts.resize(collapse_facet_ords.len(), 0);
Ok(FacetSegmentCollector {
reader: facet_reader,
facet_ords_buf: Vec::with_capacity(255),
collapse_mapping,
counts,
collapse_facet_ords,
})
}
fn requires_scoring(&self) -> bool {
false
}
fn merge_fruits(&self, segments_facet_counts: Vec<FacetCounts>) -> Result<FacetCounts> {
let mut facet_counts: BTreeMap<Facet, u64> = BTreeMap::new();
for segment_facet_counts in segments_facet_counts {
for (facet, count) in segment_facet_counts.facet_counts {
*(facet_counts.entry(facet).or_insert(0)) += count;
}
}
Ok(FacetCounts { facet_counts })
}
}
impl SegmentCollector for FacetSegmentCollector {
type Fruit = FacetCounts;
fn collect(&mut self, doc: DocId, _: Score) {
self.reader.facet_ords(doc, &mut self.facet_ords_buf);
let mut previous_collapsed_ord: usize = usize::MAX;
for &facet_ord in &self.facet_ords_buf {
let collapsed_ord = self.collapse_mapping[facet_ord as usize];
self.counts[collapsed_ord] += if collapsed_ord == previous_collapsed_ord {
0
} else {
1
};
previous_collapsed_ord = collapsed_ord;
}
}
/// Returns the results of the collection.
///
/// This method does not just return the counters,
/// it also translates the facet ordinals of the last segment.
fn harvest(self) -> FacetCounts {
let mut facet_counts = BTreeMap::new();
let facet_dict = self.reader.facet_dict();
for (collapsed_facet_ord, count) in self.counts.iter().cloned().enumerate() {
if count == 0 {
continue;
}
let mut facet = vec![];
let facet_ord = self.collapse_facet_ords[collapsed_facet_ord];
facet_dict.ord_to_term(facet_ord as u64, &mut facet);
// TODO
facet_counts.insert(Facet::from_encoded(facet).unwrap(), count);
}
FacetCounts { facet_counts }
}
}
/// Intermediary result of the `FacetCollector` that stores
/// the facet counts for all the segments.
pub struct FacetCounts {
facet_counts: BTreeMap<Facet, u64>,
}
pub struct FacetChildIterator<'a> {
underlying: btree_map::Range<'a, Facet, u64>,
}
impl<'a> Iterator for FacetChildIterator<'a> {
type Item = (&'a Facet, u64);
fn next(&mut self) -> Option<Self::Item> {
self.underlying.next().map(|(facet, count)| (facet, *count))
}
}
impl FacetCounts {
pub fn get<T>(&self, facet_from: T) -> FacetChildIterator<'_>
where
Facet: From<T>,
{
let facet = Facet::from(facet_from);
let left_bound = Bound::Excluded(facet.clone());
let right_bound = if facet.is_root() {
Bound::Unbounded
} else {
let mut facet_after_bytes: String = facet.encoded_str().to_owned();
facet_after_bytes.push('\u{1}');
let facet_after = Facet::from_encoded_string(facet_after_bytes);
Bound::Excluded(facet_after)
};
let underlying: btree_map::Range<'_, _, _> =
self.facet_counts.range((left_bound, right_bound));
FacetChildIterator { underlying }
}
pub fn top_k<T>(&self, facet: T, k: usize) -> Vec<(&Facet, u64)>
where
Facet: From<T>,
{
let mut heap = BinaryHeap::with_capacity(k);
let mut it = self.get(facet);
// push the first k elements to first bring the heap
// to capacity
for (facet, count) in (&mut it).take(k) {
heap.push(Hit { count, facet });
}
let mut lowest_count: u64 = heap.peek().map(|hit| hit.count).unwrap_or(u64::MIN); //< the `unwrap_or` case may be triggered but the value
// is never used in that case.
for (facet, count) in it {
if count > lowest_count {
if let Some(mut head) = heap.peek_mut() {
*head = Hit { count, facet };
}
// the heap gets reconstructed at this point
if let Some(head) = heap.peek() {
lowest_count = head.count;
}
}
}
heap.into_sorted_vec()
.into_iter()
.map(|hit| (hit.facet, hit.count))
.collect::<Vec<_>>()
}
}
#[cfg(test)]
mod tests {
use super::{FacetCollector, FacetCounts};
use crate::core::Index;
use crate::query::AllQuery;
use crate::schema::{Document, Facet, Field, Schema};
use rand::distributions::Uniform;
use rand::prelude::SliceRandom;
use rand::{thread_rng, Rng};
use std::iter;
#[test]
fn test_facet_collector_drilldown() {
let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facet");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
let num_facets: usize = 3 * 4 * 5;
let facets: Vec<Facet> = (0..num_facets)
.map(|mut n| {
let top = n % 3;
n /= 3;
let mid = n % 4;
n /= 4;
let leaf = n % 5;
Facet::from(&format!("/top{}/mid{}/leaf{}", top, mid, leaf))
})
.collect();
for i in 0..num_facets * 10 {
let mut doc = Document::new();
doc.add_facet(facet_field, facets[i % num_facets].clone());
index_writer.add_document(doc);
}
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let mut facet_collector = FacetCollector::for_field(facet_field);
facet_collector.add_facet(Facet::from("/top1"));
let counts = searcher.search(&AllQuery, &facet_collector).unwrap();
{
let facets: Vec<(String, u64)> = counts
.get("/top1")
.map(|(facet, count)| (facet.to_string(), count))
.collect();
assert_eq!(
facets,
[
("/top1/mid0", 50),
("/top1/mid1", 50),
("/top1/mid2", 50),
("/top1/mid3", 50),
]
.iter()
.map(|&(facet_str, count)| (String::from(facet_str), count))
.collect::<Vec<_>>()
);
}
}
#[test]
#[should_panic(expected = "Tried to add a facet which is a descendant of \
an already added facet.")]
fn test_misused_facet_collector() {
let mut facet_collector = FacetCollector::for_field(Field(0));
facet_collector.add_facet(Facet::from("/country"));
facet_collector.add_facet(Facet::from("/country/europe"));
}
#[test]
fn test_doc_unsorted_multifacet() {
let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facets");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
index_writer.add_document(doc!(
facet_field => Facet::from_text(&"/subjects/A/a"),
facet_field => Facet::from_text(&"/subjects/B/a"),
facet_field => Facet::from_text(&"/subjects/A/b"),
facet_field => Facet::from_text(&"/subjects/B/b"),
));
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
assert_eq!(searcher.num_docs(), 1);
let mut facet_collector = FacetCollector::for_field(facet_field);
facet_collector.add_facet("/subjects");
let counts = searcher.search(&AllQuery, &facet_collector).unwrap();
let facets: Vec<(&Facet, u64)> = counts.get("/subjects").collect();
assert_eq!(facets[0].1, 1);
}
#[test]
fn test_non_used_facet_collector() {
let mut facet_collector = FacetCollector::for_field(Field(0));
facet_collector.add_facet(Facet::from("/country"));
facet_collector.add_facet(Facet::from("/countryeurope"));
}
#[test]
fn test_facet_collector_topk() {
let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facet");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let uniform = Uniform::new_inclusive(1, 100_000);
let mut docs: Vec<Document> = vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)]
.into_iter()
.flat_map(|(c, count)| {
let facet = Facet::from(&format!("/facet/{}", c));
let doc = doc!(facet_field => facet);
iter::repeat(doc).take(count)
})
.map(|mut doc| {
doc.add_facet(
facet_field,
&format!("/facet/{}", thread_rng().sample(&uniform)),
);
doc
})
.collect();
docs[..].shuffle(&mut thread_rng());
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
for doc in docs {
index_writer.add_document(doc);
}
index_writer.commit().unwrap();
let searcher = index.reader().unwrap().searcher();
let mut facet_collector = FacetCollector::for_field(facet_field);
facet_collector.add_facet("/facet");
let counts: FacetCounts = searcher.search(&AllQuery, &facet_collector).unwrap();
{
let facets: Vec<(&Facet, u64)> = counts.top_k("/facet", 3);
assert_eq!(
facets,
vec![
(&Facet::from("/facet/b"), 100),
(&Facet::from("/facet/e"), 21),
(&Facet::from("/facet/d"), 12),
]
);
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use collector::FacetCollector;
use query::AllQuery;
use rand::{thread_rng, Rng};
use schema::Facet;
use schema::Schema;
use test::Bencher;
use Index;
#[bench]
fn bench_facet_collector(b: &mut Bencher) {
let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facet");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut docs = vec![];
for val in 0..50 {
let facet = Facet::from(&format!("/facet_{}", val));
for _ in 0..val * val {
docs.push(doc!(facet_field=>facet.clone()));
}
}
// 40425 docs
thread_rng().shuffle(&mut docs[..]);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
for doc in docs {
index_writer.add_document(doc);
}
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
b.iter(|| {
let searcher = index.searcher();
let facet_collector = FacetCollector::for_field(facet_field);
searcher.search(&AllQuery, &facet_collector).unwrap();
});
}
}

View File

@@ -0,0 +1,127 @@
use std::cmp::Eq;
use std::collections::HashMap;
use std::hash::Hash;
use collector::Collector;
use fastfield::FastFieldReader;
use schema::Field;
use DocId;
use Result;
use Score;
use SegmentReader;
use SegmentLocalId;
/// Facet collector for i64/u64 fast field
pub struct IntFacetCollector<T>
where
T: FastFieldReader,
T::ValueType: Eq + Hash,
{
counters: HashMap<T::ValueType, u64>,
field: Field,
ff_reader: Option<T>,
}
impl<T> IntFacetCollector<T>
where
T: FastFieldReader,
T::ValueType: Eq + Hash,
{
/// Creates a new facet collector for aggregating a given field.
pub fn new(field: Field) -> IntFacetCollector<T> {
IntFacetCollector {
counters: HashMap::new(),
field: field,
ff_reader: None,
}
}
}
impl<T> Collector for IntFacetCollector<T>
where
T: FastFieldReader,
T::ValueType: Eq + Hash,
{
fn set_segment(&mut self, _: SegmentLocalId, reader: &SegmentReader) -> Result<()> {
self.ff_reader = Some(reader.get_fast_field_reader(self.field)?);
Ok(())
}
fn collect(&mut self, doc: DocId, _: Score) {
let val = self.ff_reader
.as_ref()
.expect(
"collect() was called before set_segment. \
This should never happen.",
)
.get(doc);
*(self.counters.entry(val).or_insert(0)) += 1;
}
}
#[cfg(test)]
mod tests {
use collector::{chain, IntFacetCollector};
use query::QueryParser;
use fastfield::{I64FastFieldReader, U64FastFieldReader};
use schema::{self, FAST, STRING};
use Index;
#[test]
// create 10 documents, set num field value to 0 or 1 for even/odd ones
// make sure we have facet counters correctly filled
fn test_facet_collector_results() {
let mut schema_builder = schema::Schema::builder();
let num_field_i64 = schema_builder.add_i64_field("num_i64", FAST);
let num_field_u64 = schema_builder.add_u64_field("num_u64", FAST);
let num_field_f64 = schema_builder.add_f64_field("num_f64", FAST);
let text_field = schema_builder.add_text_field("text", STRING);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
{
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
{
for i in 0u64..10u64 {
index_writer.add_document(doc!(
num_field_i64 => ((i as i64) % 3i64) as i64,
num_field_u64 => (i % 2u64) as u64,
num_field_f64 => (i % 4u64) as f64,
text_field => "text"
));
}
}
assert_eq!(index_writer.commit().unwrap(), 10u64);
}
let searcher = index.reader().searcher();
let mut ffvf_i64: IntFacetCollector<I64FastFieldReader> = IntFacetCollector::new(num_field_i64);
let mut ffvf_u64: IntFacetCollector<U64FastFieldReader> = IntFacetCollector::new(num_field_u64);
let mut ffvf_f64: IntFacetCollector<F64FastFieldReader> = IntFacetCollector::new(num_field_f64);
{
// perform the query
let mut facet_collectors = chain().push(&mut ffvf_i64).push(&mut ffvf_u64).push(&mut ffvf_f64);
let mut query_parser = QueryParser::for_index(index, vec![text_field]);
let query = query_parser.parse_query("text:text").unwrap();
query.search(&searcher, &mut facet_collectors).unwrap();
}
assert_eq!(ffvf_u64.counters[&0], 5);
assert_eq!(ffvf_u64.counters[&1], 5);
assert_eq!(ffvf_i64.counters[&0], 4);
assert_eq!(ffvf_i64.counters[&1], 3);
assert_eq!(ffvf_f64.counters[&0.0], 3);
assert_eq!(ffvf_f64.counters[&2.0], 2);
}
}

View File

@@ -1,172 +1,367 @@
use SegmentReader;
use SegmentLocalId;
use DocId;
use Score;
use std::io;
/*!
# Collectors
Collectors define the information you want to extract from the documents matching the queries.
In tantivy jargon, we call this information your search "fruit".
Your fruit could for instance be :
- [the count of matching documents](./struct.Count.html)
- [the top 10 documents, by relevancy or by a fast field](./struct.TopDocs.html)
- [facet counts](./struct.FacetCollector.html)
At one point in your code, you will trigger the actual search operation by calling
[the `search(...)` method of your `Searcher` object](../struct.Searcher.html#method.search).
This call will look like this.
```verbatim
let fruit = searcher.search(&query, &collector)?;
```
Here the type of fruit is actually determined as an associated type of the collector (`Collector::Fruit`).
# Combining several collectors
A rich search experience often requires to run several collectors on your search query.
For instance,
- selecting the top-K products matching your query
- counting the matching documents
- computing several facets
- computing statistics about the matching product prices
A simple and efficient way to do that is to pass your collectors as one tuple.
The resulting `Fruit` will then be a typed tuple with each collector's original fruits
in their respective position.
```rust
# extern crate tantivy;
# use tantivy::schema::*;
# use tantivy::*;
# use tantivy::query::*;
use tantivy::collector::{Count, TopDocs};
#
# fn main() -> tantivy::Result<()> {
# let mut schema_builder = Schema::builder();
# let title = schema_builder.add_text_field("title", TEXT);
# let schema = schema_builder.build();
# let index = Index::create_in_ram(schema);
# let mut index_writer = index.writer(3_000_000)?;
# index_writer.add_document(doc!(
# title => "The Name of the Wind",
# ));
# index_writer.add_document(doc!(
# title => "The Diary of Muadib",
# ));
# index_writer.commit()?;
# let reader = index.reader()?;
# let searcher = reader.searcher();
# let query_parser = QueryParser::for_index(&index, vec![title]);
# let query = query_parser.parse_query("diary")?;
let (doc_count, top_docs): (usize, Vec<(Score, DocAddress)>) =
searcher.search(&query, &(Count, TopDocs::with_limit(2)))?;
# Ok(())
# }
```
The `Collector` trait is implemented for up to 4 collectors.
If you have more than 4 collectors, you can either group them into
tuples of tuples `(a,(b,(c,d)))`, or rely on [`MultiCollector`](./struct.MultiCollector.html).
# Combining several collectors dynamically
Combining collectors into a tuple is a zero-cost abstraction: everything
happens as if you had manually implemented a single collector
combining all of our features.
Unfortunately it requires you to know at compile time your collector types.
If on the other hand, the collectors depend on some query parameter,
you can rely on `MultiCollector`'s.
# Implementing your own collectors.
See the `custom_collector` example.
*/
use crate::DocId;
use crate::Result;
use crate::Score;
use crate::SegmentLocalId;
use crate::SegmentReader;
use downcast_rs::impl_downcast;
mod count_collector;
pub use self::count_collector::CountCollector;
pub use self::count_collector::Count;
mod multi_collector;
pub use self::multi_collector::MultiCollector;
mod top_collector;
pub use self::top_collector::TopCollector;
mod chained_collector;
pub use self::chained_collector::chain;
mod top_score_collector;
pub use self::top_score_collector::TopDocs;
/// Collectors are in charge of collecting and retaining relevant
mod custom_score_top_collector;
pub use self::custom_score_top_collector::{CustomScorer, CustomSegmentScorer};
mod tweak_score_top_collector;
pub use self::tweak_score_top_collector::{ScoreSegmentTweaker, ScoreTweaker};
mod facet_collector;
pub use self::facet_collector::FacetCollector;
/// `Fruit` is the type for the result of our collection.
/// e.g. `usize` for the `Count` collector.
pub trait Fruit: Send + downcast_rs::Downcast {}
impl<T> Fruit for T where T: Send + downcast_rs::Downcast {}
/// Collectors are in charge of collecting and retaining relevant
/// information from the document found and scored by the query.
///
///
/// For instance,
/// For instance,
///
/// - keeping track of the top 10 best documents
/// - computing a breakdown over a fast field
/// - computing the number of documents matching the query
///
/// Queries are in charge of pushing the `DocSet` to the collector.
/// Our search index is in fact a collection of segments, so
/// a `Collector` trait is actually more of a factory to instance
/// `SegmentCollector`s for each segments.
///
/// As they work on multiple segments, they first inform
/// the collector of a change in a segment and then
/// call the `collect` method to push the document to the collector.
///
/// Temporally, our collector will receive calls
/// - `.set_segment(0, segment_reader_0)`
/// - `.collect(doc0_of_segment_0)`
/// - `.collect(...)`
/// - `.collect(last_doc_of_segment_0)`
/// - `.set_segment(1, segment_reader_1)`
/// - `.collect(doc0_of_segment_1)`
/// - `.collect(...)`
/// - `.collect(last_doc_of_segment_1)`
/// - `...`
/// - `.collect(last_doc_of_last_segment)`
/// The collection logic itself is in the `SegmentCollector`.
///
/// Segments are not guaranteed to be visited in any specific order.
pub trait Collector {
/// `set_segment` is called before beginning to enumerate
pub trait Collector: Sync {
/// `Fruit` is the type for the result of our collection.
/// e.g. `usize` for the `Count` collector.
type Fruit: Fruit;
/// Type of the `SegmentCollector` associated to this collector.
type Child: SegmentCollector<Fruit = Self::Fruit>;
/// `set_segment` is called before beginning to enumerate
/// on this segment.
fn set_segment(&mut self, segment_local_id: SegmentLocalId, segment: &SegmentReader) -> io::Result<()>;
fn for_segment(
&self,
segment_local_id: SegmentLocalId,
segment: &SegmentReader,
) -> Result<Self::Child>;
/// Returns true iff the collector requires to compute scores for documents.
fn requires_scoring(&self) -> bool;
/// Combines the fruit associated to the collection of each segments
/// into one fruit.
fn merge_fruits(&self, segment_fruits: Vec<Self::Fruit>) -> Result<Self::Fruit>;
}
/// The `SegmentCollector` is the trait in charge of defining the
/// collect operation at the scale of the segment.
///
/// `.collect(doc, score)` will be called for every documents
/// matching the query.
pub trait SegmentCollector: 'static {
/// `Fruit` is the type for the result of our collection.
/// e.g. `usize` for the `Count` collector.
type Fruit: Fruit;
/// The query pushes the scored document to the collector via this method.
fn collect(&mut self, doc: DocId, score: Score);
/// Extract the fruit of the collection from the `SegmentCollector`.
fn harvest(self) -> Self::Fruit;
}
// -----------------------------------------------
// Tuple implementations.
impl<'a, C: Collector> Collector for &'a mut C {
fn set_segment(&mut self, segment_local_id: SegmentLocalId, segment: &SegmentReader) -> io::Result<()> {
(*self).set_segment(segment_local_id, segment)
impl<Left, Right> Collector for (Left, Right)
where
Left: Collector,
Right: Collector,
{
type Fruit = (Left::Fruit, Right::Fruit);
type Child = (Left::Child, Right::Child);
fn for_segment(&self, segment_local_id: u32, segment: &SegmentReader) -> Result<Self::Child> {
let left = self.0.for_segment(segment_local_id, segment)?;
let right = self.1.for_segment(segment_local_id, segment)?;
Ok((left, right))
}
/// The query pushes the scored document to the collector via this method.
fn requires_scoring(&self) -> bool {
self.0.requires_scoring() || self.1.requires_scoring()
}
fn merge_fruits(
&self,
children: Vec<(Left::Fruit, Right::Fruit)>,
) -> Result<(Left::Fruit, Right::Fruit)> {
let mut left_fruits = vec![];
let mut right_fruits = vec![];
for (left_fruit, right_fruit) in children {
left_fruits.push(left_fruit);
right_fruits.push(right_fruit);
}
Ok((
self.0.merge_fruits(left_fruits)?,
self.1.merge_fruits(right_fruits)?,
))
}
}
impl<Left, Right> SegmentCollector for (Left, Right)
where
Left: SegmentCollector,
Right: SegmentCollector,
{
type Fruit = (Left::Fruit, Right::Fruit);
fn collect(&mut self, doc: DocId, score: Score) {
(*self).collect(doc, score);
self.0.collect(doc, score);
self.1.collect(doc, score);
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
(self.0.harvest(), self.1.harvest())
}
}
// 3-Tuple
impl<One, Two, Three> Collector for (One, Two, Three)
where
One: Collector,
Two: Collector,
Three: Collector,
{
type Fruit = (One::Fruit, Two::Fruit, Three::Fruit);
type Child = (One::Child, Two::Child, Three::Child);
fn for_segment(&self, segment_local_id: u32, segment: &SegmentReader) -> Result<Self::Child> {
let one = self.0.for_segment(segment_local_id, segment)?;
let two = self.1.for_segment(segment_local_id, segment)?;
let three = self.2.for_segment(segment_local_id, segment)?;
Ok((one, two, three))
}
fn requires_scoring(&self) -> bool {
self.0.requires_scoring() || self.1.requires_scoring() || self.2.requires_scoring()
}
fn merge_fruits(&self, children: Vec<Self::Fruit>) -> Result<Self::Fruit> {
let mut one_fruits = vec![];
let mut two_fruits = vec![];
let mut three_fruits = vec![];
for (one_fruit, two_fruit, three_fruit) in children {
one_fruits.push(one_fruit);
two_fruits.push(two_fruit);
three_fruits.push(three_fruit);
}
Ok((
self.0.merge_fruits(one_fruits)?,
self.1.merge_fruits(two_fruits)?,
self.2.merge_fruits(three_fruits)?,
))
}
}
impl<One, Two, Three> SegmentCollector for (One, Two, Three)
where
One: SegmentCollector,
Two: SegmentCollector,
Three: SegmentCollector,
{
type Fruit = (One::Fruit, Two::Fruit, Three::Fruit);
fn collect(&mut self, doc: DocId, score: Score) {
self.0.collect(doc, score);
self.1.collect(doc, score);
self.2.collect(doc, score);
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
(self.0.harvest(), self.1.harvest(), self.2.harvest())
}
}
// 4-Tuple
impl<One, Two, Three, Four> Collector for (One, Two, Three, Four)
where
One: Collector,
Two: Collector,
Three: Collector,
Four: Collector,
{
type Fruit = (One::Fruit, Two::Fruit, Three::Fruit, Four::Fruit);
type Child = (One::Child, Two::Child, Three::Child, Four::Child);
fn for_segment(&self, segment_local_id: u32, segment: &SegmentReader) -> Result<Self::Child> {
let one = self.0.for_segment(segment_local_id, segment)?;
let two = self.1.for_segment(segment_local_id, segment)?;
let three = self.2.for_segment(segment_local_id, segment)?;
let four = self.3.for_segment(segment_local_id, segment)?;
Ok((one, two, three, four))
}
fn requires_scoring(&self) -> bool {
self.0.requires_scoring()
|| self.1.requires_scoring()
|| self.2.requires_scoring()
|| self.3.requires_scoring()
}
fn merge_fruits(&self, children: Vec<Self::Fruit>) -> Result<Self::Fruit> {
let mut one_fruits = vec![];
let mut two_fruits = vec![];
let mut three_fruits = vec![];
let mut four_fruits = vec![];
for (one_fruit, two_fruit, three_fruit, four_fruit) in children {
one_fruits.push(one_fruit);
two_fruits.push(two_fruit);
three_fruits.push(three_fruit);
four_fruits.push(four_fruit);
}
Ok((
self.0.merge_fruits(one_fruits)?,
self.1.merge_fruits(two_fruits)?,
self.2.merge_fruits(three_fruits)?,
self.3.merge_fruits(four_fruits)?,
))
}
}
impl<One, Two, Three, Four> SegmentCollector for (One, Two, Three, Four)
where
One: SegmentCollector,
Two: SegmentCollector,
Three: SegmentCollector,
Four: SegmentCollector,
{
type Fruit = (One::Fruit, Two::Fruit, Three::Fruit, Four::Fruit);
fn collect(&mut self, doc: DocId, score: Score) {
self.0.collect(doc, score);
self.1.collect(doc, score);
self.2.collect(doc, score);
self.3.collect(doc, score);
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
(
self.0.harvest(),
self.1.harvest(),
self.2.harvest(),
self.3.harvest(),
)
}
}
impl_downcast!(Fruit);
#[cfg(test)]
pub mod tests {
use super::*;
use test::Bencher;
use DocId;
use Score;
use core::SegmentReader;
use std::io;
use SegmentLocalId;
use fastfield::U32FastFieldReader;
use schema::Field;
/// Stores all of the doc ids.
/// This collector is only used for tests.
/// It is unusable in practise, as it does not store
/// the segment ordinals
pub struct TestCollector {
offset: DocId,
segment_max_doc: DocId,
docs: Vec<DocId>,
}
impl TestCollector {
/// Return the exhalist of documents.
pub fn docs(self,) -> Vec<DocId> {
self.docs
}
}
impl Default for TestCollector {
fn default() -> TestCollector {
TestCollector {
docs: Vec::new(),
offset: 0,
segment_max_doc: 0,
}
}
}
impl Collector for TestCollector {
fn set_segment(&mut self, _: SegmentLocalId, reader: &SegmentReader) -> io::Result<()> {
self.offset += self.segment_max_doc;
self.segment_max_doc = reader.max_doc();
Ok(())
}
fn collect(&mut self, doc: DocId, _score: Score) {
self.docs.push(doc + self.offset);
}
}
/// Collects in order all of the fast fields for all of the
/// doc in the `DocSet`
///
/// This collector is mainly useful for tests.
pub struct FastFieldTestCollector {
vals: Vec<u32>,
field: Field,
ff_reader: Option<U32FastFieldReader>,
}
impl FastFieldTestCollector {
pub fn for_field(field: Field) -> FastFieldTestCollector {
FastFieldTestCollector {
vals: Vec::new(),
field: field,
ff_reader: None,
}
}
pub fn vals(&self,) -> &Vec<u32> {
&self.vals
}
}
impl Collector for FastFieldTestCollector {
fn set_segment(&mut self, _: SegmentLocalId, reader: &SegmentReader) -> io::Result<()> {
self.ff_reader = Some(try!(reader.get_fast_field_reader(self.field)));
Ok(())
}
fn collect(&mut self, doc: DocId, _score: Score) {
let val = self.ff_reader.as_ref().unwrap().get(doc);
self.vals.push(val);
}
}
#[bench]
fn build_collector(b: &mut Bencher) {
b.iter(|| {
let mut count_collector = CountCollector::default();
let docs: Vec<u32> = (0..1_000_000).collect();
for doc in docs {
count_collector.collect(doc, 1f32);
}
count_collector.count()
});
}
}
pub mod tests;

View File

@@ -1,63 +1,298 @@
use std::io;
use super::Collector;
use DocId;
use Score;
use SegmentReader;
use SegmentLocalId;
use super::SegmentCollector;
use crate::collector::Fruit;
use crate::DocId;
use crate::Result;
use crate::Score;
use crate::SegmentLocalId;
use crate::SegmentReader;
use crate::TantivyError;
use std::marker::PhantomData;
use std::ops::Deref;
pub struct MultiFruit {
sub_fruits: Vec<Option<Box<dyn Fruit>>>,
}
pub struct CollectorWrapper<TCollector: Collector>(TCollector);
impl<TCollector: Collector> Collector for CollectorWrapper<TCollector> {
type Fruit = Box<dyn Fruit>;
type Child = Box<dyn BoxableSegmentCollector>;
fn for_segment(
&self,
segment_local_id: u32,
reader: &SegmentReader,
) -> Result<Box<dyn BoxableSegmentCollector>> {
let child = self.0.for_segment(segment_local_id, reader)?;
Ok(Box::new(SegmentCollectorWrapper(child)))
}
fn requires_scoring(&self) -> bool {
self.0.requires_scoring()
}
fn merge_fruits(&self, children: Vec<<Self as Collector>::Fruit>) -> Result<Box<dyn Fruit>> {
let typed_fruit: Vec<TCollector::Fruit> = children
.into_iter()
.map(|untyped_fruit| {
untyped_fruit
.downcast::<TCollector::Fruit>()
.map(|boxed_but_typed| *boxed_but_typed)
.map_err(|_| {
TantivyError::InvalidArgument("Failed to cast child fruit.".to_string())
})
})
.collect::<Result<_>>()?;
let merged_fruit = self.0.merge_fruits(typed_fruit)?;
Ok(Box::new(merged_fruit))
}
}
impl SegmentCollector for Box<dyn BoxableSegmentCollector> {
type Fruit = Box<dyn Fruit>;
fn collect(&mut self, doc: u32, score: f32) {
self.as_mut().collect(doc, score);
}
fn harvest(self) -> Box<dyn Fruit> {
BoxableSegmentCollector::harvest_from_box(self)
}
}
pub trait BoxableSegmentCollector {
fn collect(&mut self, doc: u32, score: f32);
fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit>;
}
pub struct SegmentCollectorWrapper<TSegmentCollector: SegmentCollector>(TSegmentCollector);
impl<TSegmentCollector: SegmentCollector> BoxableSegmentCollector
for SegmentCollectorWrapper<TSegmentCollector>
{
fn collect(&mut self, doc: u32, score: f32) {
self.0.collect(doc, score);
}
fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit> {
Box::new(self.0.harvest())
}
}
pub struct FruitHandle<TFruit: Fruit> {
pos: usize,
_phantom: PhantomData<TFruit>,
}
impl<TFruit: Fruit> FruitHandle<TFruit> {
pub fn extract(self, fruits: &mut MultiFruit) -> TFruit {
let boxed_fruit = fruits.sub_fruits[self.pos].take().expect("");
*boxed_fruit
.downcast::<TFruit>()
.map_err(|_| ())
.expect("Failed to downcast collector fruit.")
}
}
/// Multicollector makes it possible to collect on more than one collector.
/// It should only be used for use cases where the Collector types is unknown
/// It should only be used for use cases where the Collector types is unknown
/// at compile time.
/// If the type of the collectors is known, you should prefer to use `ChainedCollector`.
///
/// If the type of the collectors is known, you can just group yours collectors
/// in a tuple. See the
/// [Combining several collectors section of the collector documentation](./index.html#combining-several-collectors).
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::schema::{Schema, TEXT};
/// use tantivy::{Index, Result};
/// use tantivy::collector::{Count, TopDocs, MultiCollector};
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = Schema::builder();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer(3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// let reader = index.reader()?;
/// let searcher = reader.searcher();
///
/// let mut collectors = MultiCollector::new();
/// let top_docs_handle = collectors.add_collector(TopDocs::with_limit(2));
/// let count_handle = collectors.add_collector(Count);
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// let mut multi_fruit = searcher.search(&query, &collectors)?;
///
/// let count = count_handle.extract(&mut multi_fruit);
/// let top_docs = top_docs_handle.extract(&mut multi_fruit);
///
/// # assert_eq!(count, 2);
/// # assert_eq!(top_docs.len(), 2);
///
/// Ok(())
/// }
/// ```
#[allow(clippy::type_complexity)]
#[derive(Default)]
pub struct MultiCollector<'a> {
collectors: Vec<&'a mut Collector>,
collector_wrappers: Vec<
Box<dyn Collector<Child = Box<dyn BoxableSegmentCollector>, Fruit = Box<dyn Fruit>> + 'a>,
>,
}
impl<'a> MultiCollector<'a> {
/// Constructor
pub fn from(collectors: Vec<&'a mut Collector>) -> MultiCollector {
MultiCollector {
collectors: collectors,
/// Create a new `MultiCollector`
pub fn new() -> Self {
Default::default()
}
/// Add a new collector to our `MultiCollector`.
pub fn add_collector<'b: 'a, TCollector: Collector + 'b>(
&mut self,
collector: TCollector,
) -> FruitHandle<TCollector::Fruit> {
let pos = self.collector_wrappers.len();
self.collector_wrappers
.push(Box::new(CollectorWrapper(collector)));
FruitHandle {
pos,
_phantom: PhantomData,
}
}
}
impl<'a> Collector for MultiCollector<'a> {
fn set_segment(&mut self, segment_local_id: SegmentLocalId, segment: &SegmentReader) -> io::Result<()> {
for collector in &mut self.collectors {
try!(collector.set_segment(segment_local_id, segment));
}
Ok(())
type Fruit = MultiFruit;
type Child = MultiCollectorChild;
fn for_segment(
&self,
segment_local_id: SegmentLocalId,
segment: &SegmentReader,
) -> Result<MultiCollectorChild> {
let children = self
.collector_wrappers
.iter()
.map(|collector_wrapper| collector_wrapper.for_segment(segment_local_id, segment))
.collect::<Result<Vec<_>>>()?;
Ok(MultiCollectorChild { children })
}
fn collect(&mut self, doc: DocId, score: Score) {
for collector in &mut self.collectors {
collector.collect(doc, score);
fn requires_scoring(&self) -> bool {
self.collector_wrappers
.iter()
.map(Deref::deref)
.any(Collector::requires_scoring)
}
fn merge_fruits(&self, segments_multifruits: Vec<MultiFruit>) -> Result<MultiFruit> {
let mut segment_fruits_list: Vec<Vec<Box<dyn Fruit>>> = (0..self.collector_wrappers.len())
.map(|_| Vec::with_capacity(segments_multifruits.len()))
.collect::<Vec<_>>();
for segment_multifruit in segments_multifruits {
for (idx, segment_fruit_opt) in segment_multifruit.sub_fruits.into_iter().enumerate() {
if let Some(segment_fruit) = segment_fruit_opt {
segment_fruits_list[idx].push(segment_fruit);
}
}
}
let sub_fruits = self
.collector_wrappers
.iter()
.zip(segment_fruits_list)
.map(|(child_collector, segment_fruits)| {
Ok(Some(child_collector.merge_fruits(segment_fruits)?))
})
.collect::<Result<_>>()?;
Ok(MultiFruit { sub_fruits })
}
}
pub struct MultiCollectorChild {
children: Vec<Box<dyn BoxableSegmentCollector>>,
}
impl SegmentCollector for MultiCollectorChild {
type Fruit = MultiFruit;
fn collect(&mut self, doc: DocId, score: Score) {
for child in &mut self.children {
child.collect(doc, score);
}
}
fn harvest(self) -> MultiFruit {
MultiFruit {
sub_fruits: self
.children
.into_iter()
.map(|child| Some(child.harvest()))
.collect(),
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use collector::{Collector, CountCollector, TopCollector};
use crate::collector::{Count, TopDocs};
use crate::query::TermQuery;
use crate::schema::IndexRecordOption;
use crate::schema::{Schema, TEXT};
use crate::Index;
use crate::Term;
#[test]
fn test_multi_collector() {
let mut top_collector = TopCollector::with_limit(2);
let mut count_collector = CountCollector::default();
let mut schema_builder = Schema::builder();
let text = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut collectors = MultiCollector::from(vec!(&mut top_collector, &mut count_collector));
collectors.collect(1, 0.2);
collectors.collect(2, 0.1);
collectors.collect(3, 0.5);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
index_writer.add_document(doc!(text=>"abc"));
index_writer.add_document(doc!(text=>"abc abc abc"));
index_writer.add_document(doc!(text=>"abc abc"));
index_writer.commit().unwrap();
index_writer.add_document(doc!(text=>""));
index_writer.add_document(doc!(text=>"abc abc abc abc"));
index_writer.add_document(doc!(text=>"abc"));
index_writer.commit().unwrap();
}
assert_eq!(count_collector.count(), 3);
assert!(top_collector.at_capacity());
let searcher = index.reader().unwrap().searcher();
let term = Term::from_field_text(text, "abc");
let query = TermQuery::new(term, IndexRecordOption::Basic);
let mut collectors = MultiCollector::new();
let topdocs_handler = collectors.add_collector(TopDocs::with_limit(2));
let count_handler = collectors.add_collector(Count);
let mut multifruits = searcher.search(&query, &mut collectors).unwrap();
assert_eq!(count_handler.extract(&mut multifruits), 5);
assert_eq!(topdocs_handler.extract(&mut multifruits).len(), 2);
}
}

217
src/collector/tests.rs Normal file
View File

@@ -0,0 +1,217 @@
use super::*;
use crate::core::SegmentReader;
use crate::fastfield::BytesFastFieldReader;
use crate::fastfield::FastFieldReader;
use crate::schema::Field;
use crate::DocAddress;
use crate::DocId;
use crate::Score;
use crate::SegmentLocalId;
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
compute_score: true,
};
pub const TEST_COLLECTOR_WITHOUT_SCORE: TestCollector = TestCollector {
compute_score: true,
};
/// Stores all of the doc ids.
/// This collector is only used for tests.
/// It is unusable in pr
///
/// actise, as it does not store
/// the segment ordinals
pub struct TestCollector {
pub compute_score: bool,
}
pub struct TestSegmentCollector {
segment_id: SegmentLocalId,
fruit: TestFruit,
}
#[derive(Default)]
pub struct TestFruit {
docs: Vec<DocAddress>,
scores: Vec<Score>,
}
impl TestFruit {
/// Return the list of matching documents exhaustively.
pub fn docs(&self) -> &[DocAddress] {
&self.docs[..]
}
pub fn scores(&self) -> &[Score] {
&self.scores[..]
}
}
impl Collector for TestCollector {
type Fruit = TestFruit;
type Child = TestSegmentCollector;
fn for_segment(
&self,
segment_id: SegmentLocalId,
_reader: &SegmentReader,
) -> Result<TestSegmentCollector> {
Ok(TestSegmentCollector {
segment_id,
fruit: TestFruit::default(),
})
}
fn requires_scoring(&self) -> bool {
self.compute_score
}
fn merge_fruits(&self, mut children: Vec<TestFruit>) -> Result<TestFruit> {
children.sort_by_key(|fruit| {
if fruit.docs().is_empty() {
0
} else {
fruit.docs()[0].segment_ord()
}
});
let mut docs = vec![];
let mut scores = vec![];
for child in children {
docs.extend(child.docs());
scores.extend(child.scores);
}
Ok(TestFruit { docs, scores })
}
}
impl SegmentCollector for TestSegmentCollector {
type Fruit = TestFruit;
fn collect(&mut self, doc: DocId, score: Score) {
self.fruit.docs.push(DocAddress(self.segment_id, doc));
self.fruit.scores.push(score);
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
self.fruit
}
}
/// Collects in order all of the fast fields for all of the
/// doc in the `DocSet`
///
/// This collector is mainly useful for tests.
pub struct FastFieldTestCollector {
field: Field,
}
pub struct FastFieldSegmentCollector {
vals: Vec<u64>,
reader: FastFieldReader<u64>,
}
impl FastFieldTestCollector {
pub fn for_field(field: Field) -> FastFieldTestCollector {
FastFieldTestCollector { field }
}
}
impl Collector for FastFieldTestCollector {
type Fruit = Vec<u64>;
type Child = FastFieldSegmentCollector;
fn for_segment(
&self,
_: SegmentLocalId,
segment_reader: &SegmentReader,
) -> Result<FastFieldSegmentCollector> {
let reader = segment_reader
.fast_fields()
.u64(self.field)
.expect("Requested field is not a fast field.");
Ok(FastFieldSegmentCollector {
vals: Vec::new(),
reader,
})
}
fn requires_scoring(&self) -> bool {
false
}
fn merge_fruits(&self, children: Vec<Vec<u64>>) -> Result<Vec<u64>> {
Ok(children.into_iter().flat_map(|v| v.into_iter()).collect())
}
}
impl SegmentCollector for FastFieldSegmentCollector {
type Fruit = Vec<u64>;
fn collect(&mut self, doc: DocId, _score: Score) {
let val = self.reader.get(doc);
self.vals.push(val);
}
fn harvest(self) -> Vec<u64> {
self.vals
}
}
/// Collects in order all of the fast field bytes for all of the
/// docs in the `DocSet`
///
/// This collector is mainly useful for tests.
pub struct BytesFastFieldTestCollector {
field: Field,
}
pub struct BytesFastFieldSegmentCollector {
vals: Vec<u8>,
reader: BytesFastFieldReader,
}
impl BytesFastFieldTestCollector {
pub fn for_field(field: Field) -> BytesFastFieldTestCollector {
BytesFastFieldTestCollector { field }
}
}
impl Collector for BytesFastFieldTestCollector {
type Fruit = Vec<u8>;
type Child = BytesFastFieldSegmentCollector;
fn for_segment(
&self,
_segment_local_id: u32,
segment_reader: &SegmentReader,
) -> Result<BytesFastFieldSegmentCollector> {
Ok(BytesFastFieldSegmentCollector {
vals: Vec::new(),
reader: segment_reader
.fast_fields()
.bytes(self.field)
.expect("Field is not a bytes fast field."),
})
}
fn requires_scoring(&self) -> bool {
false
}
fn merge_fruits(&self, children: Vec<Vec<u8>>) -> Result<Vec<u8>> {
Ok(children.into_iter().flat_map(|c| c.into_iter()).collect())
}
}
impl SegmentCollector for BytesFastFieldSegmentCollector {
type Fruit = Vec<u8>;
fn collect(&mut self, doc: u32, _score: f32) {
let data = self.reader.get_bytes(doc);
self.vals.extend(data);
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
self.vals
}
}

View File

@@ -1,195 +1,217 @@
use std::io;
use super::Collector;
use SegmentReader;
use SegmentLocalId;
use DocAddress;
use std::collections::BinaryHeap;
use crate::DocAddress;
use crate::DocId;
use crate::Result;
use crate::SegmentLocalId;
use crate::SegmentReader;
use serde::export::PhantomData;
use std::cmp::Ordering;
use DocId;
use Score;
use std::collections::BinaryHeap;
// Rust heap is a max-heap and we need a min heap.
#[derive(Clone, Copy)]
struct GlobalScoredDoc {
score: Score,
doc_address: DocAddress
/// Contains a feature (field, score, etc.) of a document along with the document address.
///
/// It has a custom implementation of `PartialOrd` that reverses the order. This is because the
/// default Rust heap is a max heap, whereas a min heap is needed.
///
/// WARNING: equality is not what you would expect here.
/// Two elements are equal if their feature is equal, and regardless of whether `doc`
/// is equal. This should be perfectly fine for this usage, but let's make sure this
/// struct is never public.
struct ComparableDoc<T, D> {
feature: T,
doc: D,
}
impl PartialOrd for GlobalScoredDoc {
fn partial_cmp(&self, other: &GlobalScoredDoc) -> Option<Ordering> {
impl<T: PartialOrd, D> PartialOrd for ComparableDoc<T, D> {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for GlobalScoredDoc {
impl<T: PartialOrd, D> Ord for ComparableDoc<T, D> {
#[inline]
fn cmp(&self, other: &GlobalScoredDoc) -> Ordering {
other.score.partial_cmp(&self.score)
.unwrap_or(
other.doc_address.cmp(&self.doc_address)
)
fn cmp(&self, other: &Self) -> Ordering {
other
.feature
.partial_cmp(&self.feature)
.unwrap_or_else(|| Ordering::Equal)
}
}
impl PartialEq for GlobalScoredDoc {
fn eq(&self, other: &GlobalScoredDoc) -> bool {
impl<T: PartialOrd, D> PartialEq for ComparableDoc<T, D> {
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal
}
}
impl Eq for GlobalScoredDoc {}
impl<T: PartialOrd, D> Eq for ComparableDoc<T, D> {}
/// The Top Collector keeps track of the K documents
/// with the best scores.
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity is `O(n log K)`.
pub struct TopCollector {
pub(crate) struct TopCollector<T> {
limit: usize,
heap: BinaryHeap<GlobalScoredDoc>,
segment_id: u32,
_marker: PhantomData<T>,
}
impl TopCollector {
impl<T> TopCollector<T>
where
T: PartialOrd + Clone,
{
/// Creates a top collector, with a number of documents equal to "limit".
///
/// # Panics
/// The method panics if limit is 0
pub fn with_limit(limit: usize) -> TopCollector {
pub fn with_limit(limit: usize) -> TopCollector<T> {
if limit < 1 {
panic!("Limit must be strictly greater than 0.");
}
TopCollector {
limit: limit,
heap: BinaryHeap::with_capacity(limit),
segment_id: 0,
limit,
_marker: PhantomData,
}
}
/// Returns K best documents sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn docs(&self) -> Vec<DocAddress> {
self.score_docs()
.into_iter()
.map(|score_doc| score_doc.1)
.collect()
pub fn limit(&self) -> usize {
self.limit
}
/// Returns K best ScoredDocument sorted in decreasing order.
///
/// Calling this method triggers the sort.
/// The result of the sort is not cached.
pub fn score_docs(&self) -> Vec<(Score, DocAddress)> {
let mut scored_docs: Vec<GlobalScoredDoc> = self.heap
.iter()
.cloned()
.collect();
scored_docs.sort();
scored_docs.into_iter()
.map(|GlobalScoredDoc {score, doc_address}| (score, doc_address))
pub fn merge_fruits(
&self,
children: Vec<Vec<(T, DocAddress)>>,
) -> Result<Vec<(T, DocAddress)>> {
if self.limit == 0 {
return Ok(Vec::new());
}
let mut top_collector = BinaryHeap::new();
for child_fruit in children {
for (feature, doc) in child_fruit {
if top_collector.len() < self.limit {
top_collector.push(ComparableDoc { feature, doc });
} else if let Some(mut head) = top_collector.peek_mut() {
if head.feature < feature {
*head = ComparableDoc { feature, doc };
}
}
}
}
Ok(top_collector
.into_sorted_vec()
.into_iter()
.map(|cdoc| (cdoc.feature, cdoc.doc))
.collect())
}
pub(crate) fn for_segment<F: PartialOrd>(
&self,
segment_id: SegmentLocalId,
_: &SegmentReader,
) -> Result<TopSegmentCollector<F>> {
Ok(TopSegmentCollector::new(segment_id, self.limit))
}
}
/// The Top Collector keeps track of the K documents
/// sorted by type `T`.
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`.
pub(crate) struct TopSegmentCollector<T> {
limit: usize,
heap: BinaryHeap<ComparableDoc<T, DocId>>,
segment_id: u32,
}
impl<T: PartialOrd> TopSegmentCollector<T> {
fn new(segment_id: SegmentLocalId, limit: usize) -> TopSegmentCollector<T> {
TopSegmentCollector {
limit,
heap: BinaryHeap::with_capacity(limit),
segment_id,
}
}
}
impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
pub fn harvest(self) -> Vec<(T, DocAddress)> {
let segment_id = self.segment_id;
self.heap
.into_sorted_vec()
.into_iter()
.map(|comparable_doc| {
(
comparable_doc.feature,
DocAddress(segment_id, comparable_doc.doc),
)
})
.collect()
}
/// Return true iff at least K documents have gone through
/// the collector.
#[inline]
pub fn at_capacity(&self, ) -> bool {
#[inline(always)]
pub(crate) fn at_capacity(&self) -> bool {
self.heap.len() >= self.limit
}
}
impl Collector for TopCollector {
fn set_segment(&mut self, segment_id: SegmentLocalId, _: &SegmentReader) -> io::Result<()> {
self.segment_id = segment_id;
Ok(())
}
fn collect(&mut self, doc: DocId, score: Score) {
/// Collects a document scored by the given feature
///
/// It collects documents until it has reached the max capacity. Once it reaches capacity, it
/// will compare the lowest scoring item with the given one and keep whichever is greater.
#[inline(always)]
pub fn collect(&mut self, doc: DocId, feature: T) {
if self.at_capacity() {
// It's ok to unwrap as long as a limit of 0 is forbidden.
let limit_doc: GlobalScoredDoc = *self.heap.peek().expect("Top collector with size 0 is forbidden");
if limit_doc.score < score {
let mut mut_head = self.heap.peek_mut().expect("Top collector with size 0 is forbidden");
mut_head.score = score;
mut_head.doc_address = DocAddress(self.segment_id, doc);
if let Some(limit_feature) = self.heap.peek().map(|head| head.feature.clone()) {
if limit_feature < feature {
if let Some(mut head) = self.heap.peek_mut() {
head.feature = feature;
head.doc = doc;
}
}
}
} else {
// we have not reached capacity yet, so we can just push the
// element.
self.heap.push(ComparableDoc { feature, doc });
}
else {
let wrapped_doc = GlobalScoredDoc {
score: score,
doc_address: DocAddress(self.segment_id, doc)
};
self.heap.push(wrapped_doc);
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use DocId;
use Score;
use collector::Collector;
use super::TopSegmentCollector;
use crate::DocAddress;
#[test]
fn test_top_collector_not_at_capacity() {
let mut top_collector = TopCollector::with_limit(4);
let mut top_collector = TopSegmentCollector::new(0, 4);
top_collector.collect(1, 0.8);
top_collector.collect(3, 0.2);
top_collector.collect(5, 0.3);
assert!(!top_collector.at_capacity());
let score_docs: Vec<(Score, DocId)> = top_collector.score_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec!(
(0.8, 1), (0.3, 5), (0.2, 3),
));
assert_eq!(
top_collector.harvest(),
vec![
(0.8, DocAddress(0, 1)),
(0.3, DocAddress(0, 5)),
(0.2, DocAddress(0, 3))
]
);
}
#[test]
fn test_top_collector_at_capacity() {
let mut top_collector = TopCollector::with_limit(4);
let mut top_collector = TopSegmentCollector::new(0, 4);
top_collector.collect(1, 0.8);
top_collector.collect(3, 0.2);
top_collector.collect(5, 0.3);
top_collector.collect(7, 0.9);
top_collector.collect(9, -0.2);
assert!(top_collector.at_capacity());
{
let score_docs: Vec<(Score, DocId)> = top_collector
.score_docs()
.into_iter()
.map(|(score, doc_address)| (score, doc_address.doc()))
.collect();
assert_eq!(score_docs, vec!(
(0.9, 7), (0.8, 1), (0.3, 5), (0.2, 3)
));
}
{
let docs: Vec<DocId> = top_collector
.docs()
.into_iter()
.map(|doc_address| doc_address.doc())
.collect();
assert_eq!(docs, vec!(7, 1, 5, 3));
}
}
#[test]
#[should_panic]
fn test_top_0() {
TopCollector::with_limit(0);
assert_eq!(
top_collector.harvest(),
vec![
(0.9, DocAddress(0, 7)),
(0.8, DocAddress(0, 1)),
(0.3, DocAddress(0, 5)),
(0.2, DocAddress(0, 3))
]
);
}
}

View File

@@ -0,0 +1,605 @@
use super::Collector;
use crate::collector::custom_score_top_collector::CustomScoreTopCollector;
use crate::collector::top_collector::TopCollector;
use crate::collector::top_collector::TopSegmentCollector;
use crate::collector::tweak_score_top_collector::TweakedScoreTopCollector;
use crate::collector::{
CustomScorer, CustomSegmentScorer, ScoreSegmentTweaker, ScoreTweaker, SegmentCollector,
};
use crate::schema::Field;
use crate::DocAddress;
use crate::DocId;
use crate::Result;
use crate::Score;
use crate::SegmentLocalId;
use crate::SegmentReader;
use std::fmt;
/// The Top Score Collector keeps track of the K documents
/// sorted by their score.
///
/// The implementation is based on a `BinaryHeap`.
/// The theorical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// use tantivy::DocAddress;
/// use tantivy::schema::{Schema, TEXT};
/// use tantivy::{Index, Result};
/// use tantivy::collector::TopDocs;
/// use tantivy::query::QueryParser;
///
/// # fn main() { example().unwrap(); }
/// fn example() -> Result<()> {
/// let mut schema_builder = Schema::builder();
/// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of Muadib",
/// ));
/// index_writer.add_document(doc!(
/// title => "A Dairy Cow",
/// ));
/// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl",
/// ));
/// index_writer.commit().unwrap();
/// }
///
/// let reader = index.reader()?;
/// let searcher = reader.searcher();
///
/// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?;
/// let top_docs = searcher.search(&query, &TopDocs::with_limit(2))?;
///
/// assert_eq!(&top_docs[0], &(0.7261542, DocAddress(0, 1)));
/// assert_eq!(&top_docs[1], &(0.6099695, DocAddress(0, 3)));
///
/// Ok(())
/// }
/// ```
pub struct TopDocs(TopCollector<Score>);
impl fmt::Debug for TopDocs {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "TopDocs({})", self.0.limit())
}
}
impl TopDocs {
/// Creates a top score collector, with a number of documents equal to "limit".
///
/// # Panics
/// The method panics if limit is 0
pub fn with_limit(limit: usize) -> TopDocs {
TopDocs(TopCollector::with_limit(limit))
}
/// Set top-K to rank documents by a given fast field.
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// # use tantivy::schema::{Schema, FAST, TEXT};
/// # use tantivy::{Index, Result, DocAddress};
/// # use tantivy::query::{Query, QueryParser};
/// use tantivy::Searcher;
/// use tantivy::collector::TopDocs;
/// use tantivy::schema::Field;
///
/// # fn main() -> tantivy::Result<()> {
/// # let mut schema_builder = Schema::builder();
/// # let title = schema_builder.add_text_field("title", TEXT);
/// # let rating = schema_builder.add_u64_field("rating", FAST);
/// # let schema = schema_builder.build();
/// #
/// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// # index_writer.add_document(doc!(
/// # title => "The Name of the Wind",
/// # rating => 92u64,
/// # ));
/// # index_writer.add_document(doc!(title => "The Diary of Muadib", rating => 97u64));
/// # index_writer.add_document(doc!(title => "A Dairy Cow", rating => 63u64));
/// # index_writer.add_document(doc!(title => "The Diary of a Young Girl", rating => 80u64));
/// # index_writer.commit()?;
/// # let reader = index.reader()?;
/// # let query = QueryParser::for_index(&index, vec![title]).parse_query("diary")?;
/// # let top_docs = docs_sorted_by_rating(&reader.searcher(), &query, rating)?;
/// # assert_eq!(top_docs,
/// # vec![(97u64, DocAddress(0u32, 1)),
/// # (80u64, DocAddress(0u32, 3))]);
/// # Ok(())
/// # }
///
///
/// /// Searches the document matching the given query, and
/// /// collects the top 10 documents, order by the u64-`field`
/// /// given in argument.
/// ///
/// /// `field` is required to be a FAST field.
/// fn docs_sorted_by_rating(searcher: &Searcher,
/// query: &Query,
/// sort_by_field: Field)
/// -> Result<Vec<(u64, DocAddress)>> {
///
/// // This is where we build our topdocs collector
/// //
/// // Note the generics parameter that needs to match the
/// // type `sort_by_field`.
/// let top_docs_by_rating = TopDocs
/// ::with_limit(10)
/// .order_by_u64_field(sort_by_field);
///
/// // ... and here are our documents. Note this is a simple vec.
/// // The `u64` in the pair is the value of our fast field for
/// // each documents.
/// //
/// // The vec is sorted decreasingly by `sort_by_field`, and has a
/// // length of 10, or less if not enough documents matched the
/// // query.
/// let resulting_docs: Vec<(u64, DocAddress)> =
/// searcher.search(query, &top_docs_by_rating)?;
///
/// Ok(resulting_docs)
/// }
/// ```
///
/// # Panics
///
/// May panic if the field requested is not a fast field.
///
pub fn order_by_u64_field(
self,
field: Field,
) -> impl Collector<Fruit = Vec<(u64, DocAddress)>> {
self.custom_score(move |segment_reader: &SegmentReader| {
let ff_reader = segment_reader
.fast_fields()
.u64(field)
.expect("Field requested is not a i64/u64 fast field.");
//TODO error message missmatch actual behavior for i64
move |doc: DocId| ff_reader.get(doc)
})
}
/// Ranks the documents using a custom score.
///
/// This method offers a convenient way to tweak or replace
/// the documents score. As suggested by the prototype you can
/// manually define your own [`ScoreTweaker`](./trait.ScoreTweaker.html)
/// and pass it as an argument, but there is a much simpler way to
/// tweak your score: you can use a closure as in the following
/// example.
///
/// # Example
///
/// Typically, you will want to rely on one or more fast fields,
/// to alter the original relevance `Score`.
///
/// For instance, in the following, we assume that we are implementing
/// an e-commerce website that has a fast field called `popularity`
/// that rates whether a product is typically often bought by users.
///
/// In the following example will will tweak our ranking a bit by
/// boosting popular products a notch.
///
/// In more serious application, this tweaking could involved running a
/// learning-to-rank model over various features
///
/// ```rust
/// #[macro_use]
/// extern crate tantivy;
/// # use tantivy::schema::{Schema, FAST, TEXT};
/// # use tantivy::{Index, DocAddress, DocId, Score};
/// # use tantivy::query::QueryParser;
/// use tantivy::SegmentReader;
/// use tantivy::collector::TopDocs;
/// use tantivy::schema::Field;
///
/// # fn create_schema() -> Schema {
/// # let mut schema_builder = Schema::builder();
/// # schema_builder.add_text_field("product_name", TEXT);
/// # schema_builder.add_u64_field("popularity", FAST);
/// # schema_builder.build()
/// # }
/// #
/// # fn main() -> tantivy::Result<()> {
/// # let schema = create_schema();
/// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// # let product_name = index.schema().get_field("product_name").unwrap();
/// #
/// let popularity: Field = index.schema().get_field("popularity").unwrap();
/// # index_writer.add_document(doc!(product_name => "The Diary of Muadib", popularity => 1u64));
/// # index_writer.add_document(doc!(product_name => "A Dairy Cow", popularity => 10u64));
/// # index_writer.add_document(doc!(product_name => "The Diary of a Young Girl", popularity => 15u64));
/// # index_writer.commit()?;
/// // ...
/// # let user_query = "diary";
/// # let query = QueryParser::for_index(&index, vec![product_name]).parse_query(user_query)?;
///
/// // This is where we build our collector with our custom score.
/// let top_docs_by_custom_score = TopDocs
/// ::with_limit(10)
/// .tweak_score(move |segment_reader: &SegmentReader| {
/// // The argument is a function that returns our scoring
/// // function.
/// //
/// // The point of this "mother" function is to gather all
/// // of the segment level information we need for scoring.
/// // Typically, fast_fields.
/// //
/// // In our case, we will get a reader for the popularity
/// // fast field.
/// let popularity_reader =
/// segment_reader.fast_fields().u64(popularity).unwrap();
///
/// // We can now define our actual scoring function
/// move |doc: DocId, original_score: Score| {
/// let popularity: u64 = popularity_reader.get(doc);
/// // Well.. For the sake of the example we use a simple logarithm
/// // function.
/// let popularity_boost_score = ((2u64 + popularity) as f32).log2();
/// popularity_boost_score * original_score
/// }
/// });
/// # let reader = index.reader()?;
/// # let searcher = reader.searcher();
/// // ... and here are our documents. Note this is a simple vec.
/// // The `Score` in the pair is our tweaked score.
/// let resulting_docs: Vec<(Score, DocAddress)> =
/// searcher.search(&*query, &top_docs_by_custom_score)?;
///
/// # Ok(())
/// # }
/// ```
///
/// # See also
/// [custom_score(...)](#method.custom_score).
pub fn tweak_score<TScore, TScoreSegmentTweaker, TScoreTweaker>(
self,
score_tweaker: TScoreTweaker,
) -> impl Collector<Fruit = Vec<(TScore, DocAddress)>>
where
TScore: 'static + Send + Sync + Clone + PartialOrd,
TScoreSegmentTweaker: ScoreSegmentTweaker<TScore> + 'static,
TScoreTweaker: ScoreTweaker<TScore, Child = TScoreSegmentTweaker>,
{
TweakedScoreTopCollector::new(score_tweaker, self.0.limit())
}
/// Ranks the documents using a custom score.
///
/// This method offers a convenient way to use a different score.
///
/// As suggested by the prototype you can manually define your
/// own [`CustomScorer`](./trait.CustomScorer.html)
/// and pass it as an argument, but there is a much simpler way to
/// tweak your score: you can use a closure as in the following
/// example.
///
/// # Limitation
///
/// This method only makes it possible to compute the score from a given
/// `DocId`, fastfield values for the doc and any information you could
/// have precomputed beforehands. It does not make it possible for instance
/// to compute something like TfIdf as it does not have access to the list of query
/// terms present in the document, nor the term frequencies for the different terms.
///
/// It can be used if your search engine relies on a learning-to-rank model for instance,
/// which does not rely on the term frequencies or positions as features.
///
/// # Example
///
/// ```rust
/// # #[macro_use]
/// # extern crate tantivy;
/// # use tantivy::schema::{Schema, FAST, TEXT};
/// # use tantivy::{Index, DocAddress, DocId};
/// # use tantivy::query::QueryParser;
/// use tantivy::SegmentReader;
/// use tantivy::collector::TopDocs;
/// use tantivy::schema::Field;
///
/// # fn create_schema() -> Schema {
/// # let mut schema_builder = Schema::builder();
/// # schema_builder.add_text_field("product_name", TEXT);
/// # schema_builder.add_u64_field("popularity", FAST);
/// # schema_builder.add_u64_field("boosted", FAST);
/// # schema_builder.build()
/// # }
/// #
/// # fn main() -> tantivy::Result<()> {
/// # let schema = create_schema();
/// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 3_000_000)?;
/// # let product_name = index.schema().get_field("product_name").unwrap();
/// #
/// let popularity: Field = index.schema().get_field("popularity").unwrap();
/// let boosted: Field = index.schema().get_field("boosted").unwrap();
/// # index_writer.add_document(doc!(boosted=>1u64, product_name => "The Diary of Muadib", popularity => 1u64));
/// # index_writer.add_document(doc!(boosted=>0u64, product_name => "A Dairy Cow", popularity => 10u64));
/// # index_writer.add_document(doc!(boosted=>0u64, product_name => "The Diary of a Young Girl", popularity => 15u64));
/// # index_writer.commit()?;
/// // ...
/// # let user_query = "diary";
/// # let query = QueryParser::for_index(&index, vec![product_name]).parse_query(user_query)?;
///
/// // This is where we build our collector with our custom score.
/// let top_docs_by_custom_score = TopDocs
/// ::with_limit(10)
/// .custom_score(move |segment_reader: &SegmentReader| {
/// // The argument is a function that returns our scoring
/// // function.
/// //
/// // The point of this "mother" function is to gather all
/// // of the segment level information we need for scoring.
/// // Typically, fast_fields.
/// //
/// // In our case, we will get a reader for the popularity
/// // fast field and a boosted field.
/// //
/// // We want to get boosted items score, and when we get
/// // a tie, return the item with the highest popularity.
/// //
/// // Note that this is implemented by using a `(u64, u64)`
/// // as a score.
/// let popularity_reader =
/// segment_reader.fast_fields().u64(popularity).unwrap();
/// let boosted_reader =
/// segment_reader.fast_fields().u64(boosted).unwrap();
///
/// // We can now define our actual scoring function
/// move |doc: DocId| {
/// let popularity: u64 = popularity_reader.get(doc);
/// let boosted: u64 = boosted_reader.get(doc);
/// // Score do not have to be `f64` in tantivy.
/// // Here we return a couple to get lexicographical order
/// // for free.
/// (boosted, popularity)
/// }
/// });
/// # let reader = index.reader()?;
/// # let searcher = reader.searcher();
/// // ... and here are our documents. Note this is a simple vec.
/// // The `Score` in the pair is our tweaked score.
/// let resulting_docs: Vec<((u64, u64), DocAddress)> =
/// searcher.search(&*query, &top_docs_by_custom_score)?;
///
/// # Ok(())
/// # }
/// ```
///
/// # See also
/// [tweak_score(...)](#method.tweak_score).
pub fn custom_score<TScore, TCustomSegmentScorer, TCustomScorer>(
self,
custom_score: TCustomScorer,
) -> impl Collector<Fruit = Vec<(TScore, DocAddress)>>
where
TScore: 'static + Send + Sync + Clone + PartialOrd,
TCustomSegmentScorer: CustomSegmentScorer<TScore> + 'static,
TCustomScorer: CustomScorer<TScore, Child = TCustomSegmentScorer>,
{
CustomScoreTopCollector::new(custom_score, self.0.limit())
}
}
impl Collector for TopDocs {
type Fruit = Vec<(Score, DocAddress)>;
type Child = TopScoreSegmentCollector;
fn for_segment(
&self,
segment_local_id: SegmentLocalId,
reader: &SegmentReader,
) -> Result<Self::Child> {
let collector = self.0.for_segment(segment_local_id, reader)?;
Ok(TopScoreSegmentCollector(collector))
}
fn requires_scoring(&self) -> bool {
true
}
fn merge_fruits(&self, child_fruits: Vec<Vec<(Score, DocAddress)>>) -> Result<Self::Fruit> {
self.0.merge_fruits(child_fruits)
}
}
/// Segment Collector associated to `TopDocs`.
pub struct TopScoreSegmentCollector(TopSegmentCollector<Score>);
impl SegmentCollector for TopScoreSegmentCollector {
type Fruit = Vec<(Score, DocAddress)>;
fn collect(&mut self, doc: DocId, score: Score) {
self.0.collect(doc, score)
}
fn harvest(self) -> Vec<(Score, DocAddress)> {
self.0.harvest()
}
}
#[cfg(test)]
mod tests {
use super::TopDocs;
use crate::collector::Collector;
use crate::query::{Query, QueryParser};
use crate::schema::{Field, Schema, FAST, STORED, TEXT};
use crate::DocAddress;
use crate::Index;
use crate::IndexWriter;
use crate::Score;
fn make_index() -> Index {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
// writing the segment
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
index_writer.add_document(doc!(text_field=>"Hello happy tax payer."));
index_writer.add_document(doc!(text_field=>"Droopy says hello happy tax payer"));
index_writer.add_document(doc!(text_field=>"I like Droopy"));
assert!(index_writer.commit().is_ok());
}
index
}
#[test]
fn test_top_collector_not_at_capacity() {
let index = make_index();
let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap();
let score_docs: Vec<(Score, DocAddress)> = index
.reader()
.unwrap()
.searcher()
.search(&text_query, &TopDocs::with_limit(4))
.unwrap();
assert_eq!(
score_docs,
vec![
(0.81221175, DocAddress(0u32, 1)),
(0.5376842, DocAddress(0u32, 2)),
(0.48527452, DocAddress(0, 0))
]
);
}
#[test]
fn test_top_collector_at_capacity() {
let index = make_index();
let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap();
let score_docs: Vec<(Score, DocAddress)> = index
.reader()
.unwrap()
.searcher()
.search(&text_query, &TopDocs::with_limit(2))
.unwrap();
assert_eq!(
score_docs,
vec![
(0.81221175, DocAddress(0u32, 1)),
(0.5376842, DocAddress(0u32, 2)),
]
);
}
#[test]
#[should_panic]
fn test_top_0() {
TopDocs::with_limit(0);
}
const TITLE: &str = "title";
const SIZE: &str = "size";
#[test]
fn test_top_field_collector_not_at_capacity() {
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build();
let (index, query) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
index_writer.add_document(doc!(
title => "growler of beer",
size => 64u64,
));
index_writer.add_document(doc!(
title => "pint of beer",
size => 16u64,
));
});
let searcher = index.reader().unwrap().searcher();
let top_collector = TopDocs::with_limit(4).order_by_u64_field(size);
let top_docs: Vec<(u64, DocAddress)> = searcher.search(&query, &top_collector).unwrap();
assert_eq!(
top_docs,
vec![
(64, DocAddress(0, 1)),
(16, DocAddress(0, 2)),
(12, DocAddress(0, 0))
]
);
}
#[test]
#[should_panic]
fn test_field_does_not_exist() {
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build();
let (index, _) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
});
let searcher = index.reader().unwrap().searcher();
let top_collector = TopDocs::with_limit(4).order_by_u64_field(Field(2));
let segment_reader = searcher.segment_reader(0u32);
top_collector
.for_segment(0, segment_reader)
.expect("should panic");
}
#[test]
#[should_panic(expected = "Field requested is not a i64/u64 fast field")]
fn test_field_not_fast_field() {
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, STORED);
let schema = schema_builder.build();
let (index, _) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!(
title => "bottle of beer",
size => 12u64,
));
});
let searcher = index.reader().unwrap().searcher();
let segment = searcher.segment_reader(0);
let top_collector = TopDocs::with_limit(4).order_by_u64_field(size);
assert!(top_collector.for_segment(0, segment).is_ok());
}
fn index(
query: &str,
query_field: Field,
schema: Schema,
mut doc_adder: impl FnMut(&mut IndexWriter) -> (),
) -> (Index, Box<Query>) {
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
doc_adder(&mut index_writer);
index_writer.commit().unwrap();
let query_parser = QueryParser::for_index(&index, vec![query_field]);
let query = query_parser.parse_query(query).unwrap();
(index, query)
}
}

View File

@@ -0,0 +1,129 @@
use crate::collector::top_collector::{TopCollector, TopSegmentCollector};
use crate::collector::{Collector, SegmentCollector};
use crate::DocAddress;
use crate::{DocId, Result, Score, SegmentReader};
pub(crate) struct TweakedScoreTopCollector<TScoreTweaker, TScore = Score> {
score_tweaker: TScoreTweaker,
collector: TopCollector<TScore>,
}
impl<TScoreTweaker, TScore> TweakedScoreTopCollector<TScoreTweaker, TScore>
where
TScore: Clone + PartialOrd,
{
pub fn new(
score_tweaker: TScoreTweaker,
limit: usize,
) -> TweakedScoreTopCollector<TScoreTweaker, TScore> {
TweakedScoreTopCollector {
score_tweaker,
collector: TopCollector::with_limit(limit),
}
}
}
/// A `ScoreSegmentTweaker` makes it possible to modify the default score
/// for a given document belonging to a specific segment.
///
/// It is the segment local version of the [`ScoreTweaker`](./trait.ScoreTweaker.html).
pub trait ScoreSegmentTweaker<TScore>: 'static {
/// Tweak the given `score` for the document `doc`.
fn score(&self, doc: DocId, score: Score) -> TScore;
}
/// `ScoreTweaker` makes it possible to tweak the score
/// emitted by the scorer into another one.
///
/// The `ScoreTweaker` itself does not make much of the computation itself.
/// Instead, it helps constructing `Self::Child` instances that will compute
/// the score at a segment scale.
pub trait ScoreTweaker<TScore>: Sync {
/// Type of the associated [`ScoreSegmentTweaker`](./trait.ScoreSegmentTweaker.html).
type Child: ScoreSegmentTweaker<TScore>;
/// Builds a child tweaker for a specific segment. The child scorer is associated to
/// a specific segment.
fn segment_tweaker(&self, segment_reader: &SegmentReader) -> Result<Self::Child>;
}
impl<TScoreTweaker, TScore> Collector for TweakedScoreTopCollector<TScoreTweaker, TScore>
where
TScoreTweaker: ScoreTweaker<TScore>,
TScore: 'static + PartialOrd + Clone + Send + Sync,
{
type Fruit = Vec<(TScore, DocAddress)>;
type Child = TopTweakedScoreSegmentCollector<TScoreTweaker::Child, TScore>;
fn for_segment(
&self,
segment_local_id: u32,
segment_reader: &SegmentReader,
) -> Result<Self::Child> {
let segment_scorer = self.score_tweaker.segment_tweaker(segment_reader)?;
let segment_collector = self
.collector
.for_segment(segment_local_id, segment_reader)?;
Ok(TopTweakedScoreSegmentCollector {
segment_collector,
segment_scorer,
})
}
fn requires_scoring(&self) -> bool {
true
}
fn merge_fruits(&self, segment_fruits: Vec<Self::Fruit>) -> Result<Self::Fruit> {
self.collector.merge_fruits(segment_fruits)
}
}
pub struct TopTweakedScoreSegmentCollector<TSegmentScoreTweaker, TScore>
where
TScore: 'static + PartialOrd + Clone + Send + Sync + Sized,
TSegmentScoreTweaker: ScoreSegmentTweaker<TScore>,
{
segment_collector: TopSegmentCollector<TScore>,
segment_scorer: TSegmentScoreTweaker,
}
impl<TSegmentScoreTweaker, TScore> SegmentCollector
for TopTweakedScoreSegmentCollector<TSegmentScoreTweaker, TScore>
where
TScore: 'static + PartialOrd + Clone + Send + Sync,
TSegmentScoreTweaker: 'static + ScoreSegmentTweaker<TScore>,
{
type Fruit = Vec<(TScore, DocAddress)>;
fn collect(&mut self, doc: DocId, score: Score) {
let score = self.segment_scorer.score(doc, score);
self.segment_collector.collect(doc, score);
}
fn harvest(self) -> Vec<(TScore, DocAddress)> {
self.segment_collector.harvest()
}
}
impl<F, TScore, TSegmentScoreTweaker> ScoreTweaker<TScore> for F
where
F: 'static + Send + Sync + Fn(&SegmentReader) -> TSegmentScoreTweaker,
TSegmentScoreTweaker: ScoreSegmentTweaker<TScore>,
{
type Child = TSegmentScoreTweaker;
fn segment_tweaker(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
Ok((self)(segment_reader))
}
}
impl<F, TScore> ScoreSegmentTweaker<TScore> for F
where
F: 'static + Sync + Send + Fn(DocId, Score) -> TScore,
{
fn score(&self, doc: DocId, score: Score) -> TScore {
(self)(doc, score)
}
}

View File

@@ -1,154 +1,138 @@
use std::io::Write;
use byteorder::{ByteOrder, LittleEndian, WriteBytesExt};
use std::io;
use common::serialize::BinarySerializable;
use std::mem;
use std::ops::Deref;
pub fn compute_num_bits(amplitude: u32) -> u8 {
(32u32 - amplitude.leading_zeros()) as u8
}
pub struct BitPacker {
pub(crate) struct BitPacker {
mini_buffer: u64,
mini_buffer_written: usize,
num_bits: usize,
written_size: usize,
}
impl BitPacker {
pub fn new(num_bits: usize) -> BitPacker {
impl BitPacker {
pub fn new() -> BitPacker {
BitPacker {
mini_buffer: 0u64,
mini_buffer_written: 0,
num_bits: num_bits,
written_size: 0,
}
}
pub fn write<TWrite: Write>(&mut self, val: u32, output: &mut TWrite) -> io::Result<()> {
pub fn write<TWrite: io::Write>(
&mut self,
val: u64,
num_bits: u8,
output: &mut TWrite,
) -> io::Result<()> {
let val_u64 = val as u64;
if self.mini_buffer_written + self.num_bits > 64 {
let num_bits = num_bits as usize;
if self.mini_buffer_written + num_bits > 64 {
self.mini_buffer |= val_u64.wrapping_shl(self.mini_buffer_written as u32);
self.written_size += self.mini_buffer.serialize(output)?;
output.write_u64::<LittleEndian>(self.mini_buffer)?;
self.mini_buffer = val_u64.wrapping_shr((64 - self.mini_buffer_written) as u32);
self.mini_buffer_written = self.mini_buffer_written + (self.num_bits as usize) - 64;
}
else {
self.mini_buffer_written = self.mini_buffer_written + num_bits - 64;
} else {
self.mini_buffer |= val_u64 << self.mini_buffer_written;
self.mini_buffer_written += self.num_bits;
self.mini_buffer_written += num_bits;
if self.mini_buffer_written == 64 {
self.written_size += self.mini_buffer.serialize(output)?;
output.write_u64::<LittleEndian>(self.mini_buffer)?;
self.mini_buffer_written = 0;
self.mini_buffer = 0u64;
}
}
}
Ok(())
}
fn flush<TWrite: Write>(&mut self, output: &mut TWrite) -> io::Result<()>{
pub fn flush<TWrite: io::Write>(&mut self, output: &mut TWrite) -> io::Result<()> {
if self.mini_buffer_written > 0 {
let num_bytes = (self.mini_buffer_written + 7) / 8;
let arr: [u8; 8] = unsafe { mem::transmute::<u64, [u8; 8]>(self.mini_buffer) };
let mut arr: [u8; 8] = [0u8; 8];
LittleEndian::write_u64(&mut arr, self.mini_buffer);
output.write_all(&arr[..num_bytes])?;
self.written_size += num_bytes;
self.mini_buffer_written = 0;
}
Ok(())
}
pub fn close<TWrite: Write>(&mut self, output: &mut TWrite) -> io::Result<usize> {
pub fn close<TWrite: io::Write>(&mut self, output: &mut TWrite) -> io::Result<()> {
self.flush(output)?;
Ok(self.written_size)
// Padding the write file to simplify reads.
output.write_all(&[0u8; 7])?;
Ok(())
}
}
pub struct BitUnpacker {
num_bits: usize,
mask: u32,
data_ptr: *const u8,
data_len: usize,
#[derive(Clone)]
pub struct BitUnpacker<Data>
where
Data: Deref<Target = [u8]>,
{
num_bits: u64,
mask: u64,
data: Data,
}
impl BitUnpacker {
pub fn new(data: &[u8], num_bits: usize) -> BitUnpacker {
impl<Data> BitUnpacker<Data>
where
Data: Deref<Target = [u8]>,
{
pub fn new(data: Data, num_bits: u8) -> BitUnpacker<Data> {
let mask: u64 = if num_bits == 64 {
!0u64
} else {
(1u64 << num_bits) - 1u64
};
BitUnpacker {
num_bits: num_bits,
mask: (1u32 << num_bits) - 1u32,
data_ptr: data.as_ptr(),
data_len: data.len()
num_bits: u64::from(num_bits),
mask,
data,
}
}
pub fn get(&self, idx: usize) -> u32 {
pub fn get(&self, idx: u64) -> u64 {
if self.num_bits == 0 {
return 0;
return 0u64;
}
let addr = (idx * self.num_bits) / 8;
let bit_shift = idx * self.num_bits - addr * 8;
let val_unshifted_unmasked: u64;
if addr + 8 <= self.data_len {
val_unshifted_unmasked = unsafe { * (self.data_ptr.offset(addr as isize) as *const u64) };
}
else {
let mut arr = [0u8; 8];
if addr < self.data_len {
for i in 0..self.data_len - addr {
arr[i] = unsafe { *self.data_ptr.offset( (addr + i) as isize) };
}
}
val_unshifted_unmasked = unsafe { mem::transmute::<[u8; 8], u64>(arr) };
}
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u32;
(val_shifted & self.mask)
let data: &[u8] = &*self.data;
let num_bits = self.num_bits;
let mask = self.mask;
let addr_in_bits = idx * num_bits;
let addr = addr_in_bits >> 3;
let bit_shift = addr_in_bits & 7;
debug_assert!(
addr + 8 <= data.len() as u64,
"The fast field field should have been padded with 7 bytes."
);
let val_unshifted_unmasked: u64 = LittleEndian::read_u64(&data[(addr as usize)..]);
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
val_shifted & mask
}
}
#[cfg(test)]
mod test {
use super::{BitPacker, BitUnpacker, compute_num_bits};
#[test]
fn test_compute_num_bits() {
assert_eq!(compute_num_bits(1), 1u8);
assert_eq!(compute_num_bits(0), 0u8);
assert_eq!(compute_num_bits(2), 2u8);
assert_eq!(compute_num_bits(3), 2u8);
assert_eq!(compute_num_bits(4), 3u8);
assert_eq!(compute_num_bits(255), 8u8);
assert_eq!(compute_num_bits(256), 9u8);
}
fn test_bitpacker_util(len: usize, num_bits: usize) {
use super::{BitPacker, BitUnpacker};
fn create_fastfield_bitpacker(len: usize, num_bits: u8) -> (BitUnpacker<Vec<u8>>, Vec<u64>) {
let mut data = Vec::new();
let mut bitpacker = BitPacker::new(num_bits);
let max_val: u32 = (1 << num_bits) - 1;
let vals: Vec<u32> = (0u32..len as u32).map(|i| {
if max_val == 0 {
0
}
else {
i % max_val
}
}).collect();
let mut bitpacker = BitPacker::new();
let max_val: u64 = (1u64 << num_bits as u64) - 1u64;
let vals: Vec<u64> = (0u64..len as u64)
.map(|i| if max_val == 0 { 0 } else { i % max_val })
.collect();
for &val in &vals {
bitpacker.write(val, &mut data).unwrap();
bitpacker.write(val, num_bits, &mut data).unwrap();
}
let num_bytes = bitpacker.close(&mut data).unwrap();
assert_eq!(num_bytes, (num_bits * len + 7) / 8);
assert_eq!(data.len(), num_bytes);
let bitunpacker = BitUnpacker::new(&data, num_bits);
bitpacker.close(&mut data).unwrap();
assert_eq!(data.len(), ((num_bits as usize) * len + 7) / 8 + 7);
let bitunpacker = BitUnpacker::new(data, num_bits);
(bitunpacker, vals)
}
fn test_bitpacker_util(len: usize, num_bits: u8) {
let (bitunpacker, vals) = create_fastfield_bitpacker(len, num_bits);
for (i, val) in vals.iter().enumerate() {
assert_eq!(bitunpacker.get(i), *val);
assert_eq!(bitunpacker.get(i as u64), *val);
}
}
#[test]
fn test_bitpacker() {
test_bitpacker_util(10, 3);
@@ -157,4 +141,4 @@ mod test {
test_bitpacker_util(6, 14);
test_bitpacker_util(1000, 14);
}
}
}

395
src/common/bitset.rs Normal file
View File

@@ -0,0 +1,395 @@
use std::fmt;
use std::u64;
#[derive(Clone, Copy, Eq, PartialEq)]
pub(crate) struct TinySet(u64);
impl fmt::Debug for TinySet {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
self.into_iter().collect::<Vec<u32>>().fmt(f)
}
}
pub struct TinySetIterator(TinySet);
impl Iterator for TinySetIterator {
type Item = u32;
fn next(&mut self) -> Option<Self::Item> {
self.0.pop_lowest()
}
}
impl IntoIterator for TinySet {
type Item = u32;
type IntoIter = TinySetIterator;
fn into_iter(self) -> Self::IntoIter {
TinySetIterator(self)
}
}
impl TinySet {
/// Returns an empty `TinySet`.
pub fn empty() -> TinySet {
TinySet(0u64)
}
/// Returns the complement of the set in `[0, 64[`.
fn complement(self) -> TinySet {
TinySet(!self.0)
}
/// Returns true iff the `TinySet` contains the element `el`.
pub fn contains(self, el: u32) -> bool {
!self.intersect(TinySet::singleton(el)).is_empty()
}
/// Returns the intersection of `self` and `other`
pub fn intersect(self, other: TinySet) -> TinySet {
TinySet(self.0 & other.0)
}
/// Creates a new `TinySet` containing only one element
/// within `[0; 64[`
#[inline(always)]
pub fn singleton(el: u32) -> TinySet {
TinySet(1u64 << u64::from(el))
}
/// Insert a new element within [0..64[
#[inline(always)]
pub fn insert(self, el: u32) -> TinySet {
self.union(TinySet::singleton(el))
}
/// Insert a new element within [0..64[
#[inline(always)]
pub fn insert_mut(&mut self, el: u32) -> bool {
let old = *self;
*self = old.insert(el);
old != *self
}
/// Returns the union of two tinysets
#[inline(always)]
pub fn union(self, other: TinySet) -> TinySet {
TinySet(self.0 | other.0)
}
/// Returns true iff the `TinySet` is empty.
#[inline(always)]
pub fn is_empty(self) -> bool {
self.0 == 0u64
}
/// Returns the lowest element in the `TinySet`
/// and removes it.
#[inline(always)]
pub fn pop_lowest(&mut self) -> Option<u32> {
if self.is_empty() {
None
} else {
let lowest = self.0.trailing_zeros() as u32;
self.0 ^= TinySet::singleton(lowest).0;
Some(lowest)
}
}
/// Returns a `TinySet` than contains all values up
/// to limit excluded.
///
/// The limit is assumed to be strictly lower than 64.
pub fn range_lower(upper_bound: u32) -> TinySet {
TinySet((1u64 << u64::from(upper_bound % 64u32)) - 1u64)
}
/// Returns a `TinySet` that contains all values greater
/// or equal to the given limit, included. (and up to 63)
///
/// The limit is assumed to be strictly lower than 64.
pub fn range_greater_or_equal(from_included: u32) -> TinySet {
TinySet::range_lower(from_included).complement()
}
pub fn clear(&mut self) {
self.0 = 0u64;
}
pub fn len(self) -> u32 {
self.0.count_ones()
}
}
#[derive(Clone)]
pub struct BitSet {
tinysets: Box<[TinySet]>,
len: usize, //< Technically it should be u32, but we
// count multiple inserts.
// `usize` guards us from overflow.
max_value: u32,
}
fn num_buckets(max_val: u32) -> u32 {
(max_val + 63u32) / 64u32
}
impl BitSet {
/// Create a new `BitSet` that may contain elements
/// within `[0, max_val[`.
pub fn with_max_value(max_value: u32) -> BitSet {
let num_buckets = num_buckets(max_value);
let tinybisets = vec![TinySet::empty(); num_buckets as usize].into_boxed_slice();
BitSet {
tinysets: tinybisets,
len: 0,
max_value,
}
}
/// Removes all elements from the `BitSet`.
pub fn clear(&mut self) {
for tinyset in self.tinysets.iter_mut() {
*tinyset = TinySet::empty();
}
}
/// Returns the number of elements in the `BitSet`.
pub fn len(&self) -> usize {
self.len
}
/// Inserts an element in the `BitSet`
pub fn insert(&mut self, el: u32) {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len += if self.tinysets[higher as usize].insert_mut(lower) {
1
} else {
0
};
}
/// Returns true iff the elements is in the `BitSet`.
pub fn contains(&self, el: u32) -> bool {
self.tinyset(el / 64u32).contains(el % 64)
}
/// Returns the first non-empty `TinySet` associated to a bucket lower
/// or greater than bucket.
///
/// Reminder: the tiny set with the bucket `bucket`, represents the
/// elements from `bucket * 64` to `(bucket+1) * 64`.
pub(crate) fn first_non_empty_bucket(&self, bucket: u32) -> Option<u32> {
self.tinysets[bucket as usize..]
.iter()
.cloned()
.position(|tinyset| !tinyset.is_empty())
.map(|delta_bucket| bucket + delta_bucket as u32)
}
pub fn max_value(&self) -> u32 {
self.max_value
}
/// Returns the tiny bitset representing the
/// the set restricted to the number range from
/// `bucket * 64` to `(bucket + 1) * 64`.
pub(crate) fn tinyset(&self, bucket: u32) -> TinySet {
self.tinysets[bucket as usize]
}
}
#[cfg(test)]
mod tests {
use super::BitSet;
use super::TinySet;
use crate::docset::DocSet;
use crate::query::BitSetDocSet;
use crate::tests;
use crate::tests::generate_nonunique_unsorted;
use std::collections::BTreeSet;
use std::collections::HashSet;
#[test]
fn test_tiny_set() {
assert!(TinySet::empty().is_empty());
{
let mut u = TinySet::empty().insert(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none())
}
{
let mut u = TinySet::empty().insert(1u32).insert(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none())
}
{
let mut u = TinySet::empty().insert(2u32);
assert_eq!(u.pop_lowest(), Some(2u32));
u.insert_mut(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none());
}
{
let mut u = TinySet::empty().insert(63u32);
assert_eq!(u.pop_lowest(), Some(63u32));
assert!(u.pop_lowest().is_none());
}
}
#[test]
fn test_bitset() {
let test_against_hashset = |els: &[u32], max_value: u32| {
let mut hashset: HashSet<u32> = HashSet::new();
let mut bitset = BitSet::with_max_value(max_value);
for &el in els {
assert!(el < max_value);
hashset.insert(el);
bitset.insert(el);
}
for el in 0..max_value {
assert_eq!(hashset.contains(&el), bitset.contains(el));
}
assert_eq!(bitset.max_value(), max_value);
};
test_against_hashset(&[], 0);
test_against_hashset(&[], 1);
test_against_hashset(&[0u32], 1);
test_against_hashset(&[0u32], 100);
test_against_hashset(&[1u32, 2u32], 4);
test_against_hashset(&[99u32], 100);
test_against_hashset(&[63u32], 64);
test_against_hashset(&[62u32, 63u32], 64);
}
#[test]
fn test_bitset_large() {
let arr = generate_nonunique_unsorted(100_000, 5_000);
let mut btreeset: BTreeSet<u32> = BTreeSet::new();
let mut bitset = BitSet::with_max_value(100_000);
for el in arr {
btreeset.insert(el);
bitset.insert(el);
}
for i in 0..100_000 {
assert_eq!(btreeset.contains(&i), bitset.contains(i));
}
assert_eq!(btreeset.len(), bitset.len());
let mut bitset_docset = BitSetDocSet::from(bitset);
for el in btreeset.into_iter() {
bitset_docset.advance();
assert_eq!(bitset_docset.doc(), el);
}
assert!(!bitset_docset.advance());
}
#[test]
fn test_bitset_num_buckets() {
use super::num_buckets;
assert_eq!(num_buckets(0u32), 0);
assert_eq!(num_buckets(1u32), 1);
assert_eq!(num_buckets(64u32), 1);
assert_eq!(num_buckets(65u32), 2);
assert_eq!(num_buckets(128u32), 2);
assert_eq!(num_buckets(129u32), 3);
}
#[test]
fn test_tinyset_range() {
assert_eq!(
TinySet::range_lower(3).into_iter().collect::<Vec<u32>>(),
[0, 1, 2]
);
assert!(TinySet::range_lower(0).is_empty());
assert_eq!(
TinySet::range_lower(63).into_iter().collect::<Vec<u32>>(),
(0u32..63u32).collect::<Vec<_>>()
);
assert_eq!(
TinySet::range_lower(1).into_iter().collect::<Vec<u32>>(),
[0]
);
assert_eq!(
TinySet::range_lower(2).into_iter().collect::<Vec<u32>>(),
[0, 1]
);
assert_eq!(
TinySet::range_greater_or_equal(3)
.into_iter()
.collect::<Vec<u32>>(),
(3u32..64u32).collect::<Vec<_>>()
);
}
#[test]
fn test_bitset_len() {
let mut bitset = BitSet::with_max_value(1_000);
assert_eq!(bitset.len(), 0);
bitset.insert(3u32);
assert_eq!(bitset.len(), 1);
bitset.insert(103u32);
assert_eq!(bitset.len(), 2);
bitset.insert(3u32);
assert_eq!(bitset.len(), 2);
bitset.insert(103u32);
assert_eq!(bitset.len(), 2);
bitset.insert(104u32);
assert_eq!(bitset.len(), 3);
}
#[test]
fn test_bitset_clear() {
let mut bitset = BitSet::with_max_value(1_000);
let els = tests::sample(1_000, 0.01f64);
for &el in &els {
bitset.insert(el);
}
assert!(els.iter().all(|el| bitset.contains(*el)));
bitset.clear();
for el in 0u32..1000u32 {
assert!(!bitset.contains(el));
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use super::BitSet;
use super::TinySet;
use test;
#[bench]
fn bench_tinyset_pop(b: &mut test::Bencher) {
b.iter(|| {
let mut tinyset = TinySet::singleton(test::black_box(31u32));
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
});
}
#[bench]
fn bench_tinyset_sum(b: &mut test::Bencher) {
let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
b.iter(|| {
assert_eq!(test::black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
});
}
#[bench]
fn bench_tinyarr_sum(b: &mut test::Bencher) {
let v = [10u32, 14u32, 21u32];
b.iter(|| test::black_box(v).iter().cloned().sum::<u32>());
}
#[bench]
fn bench_bitset_initialize(b: &mut test::Bencher) {
b.iter(|| BitSet::with_max_value(1_000_000));
}
}

View File

@@ -0,0 +1,235 @@
use crate::common::BinarySerializable;
use crate::common::CountingWriter;
use crate::common::VInt;
use crate::directory::ReadOnlySource;
use crate::directory::WritePtr;
use crate::schema::Field;
use crate::space_usage::FieldUsage;
use crate::space_usage::PerFieldSpaceUsage;
use std::collections::HashMap;
use std::io::Write;
use std::io::{self, Read};
#[derive(Eq, PartialEq, Hash, Copy, Ord, PartialOrd, Clone, Debug)]
pub struct FileAddr {
field: Field,
idx: usize,
}
impl FileAddr {
fn new(field: Field, idx: usize) -> FileAddr {
FileAddr { field, idx }
}
}
impl BinarySerializable for FileAddr {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
self.field.serialize(writer)?;
VInt(self.idx as u64).serialize(writer)?;
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let field = Field::deserialize(reader)?;
let idx = VInt::deserialize(reader)?.0 as usize;
Ok(FileAddr { field, idx })
}
}
/// A `CompositeWrite` is used to write a `CompositeFile`.
pub struct CompositeWrite<W = WritePtr> {
write: CountingWriter<W>,
offsets: HashMap<FileAddr, u64>,
}
impl<W: Write> CompositeWrite<W> {
/// Crate a new API writer that writes a composite file
/// in a given write.
pub fn wrap(w: W) -> CompositeWrite<W> {
CompositeWrite {
write: CountingWriter::wrap(w),
offsets: HashMap::new(),
}
}
/// Start writing a new field.
pub fn for_field(&mut self, field: Field) -> &mut CountingWriter<W> {
self.for_field_with_idx(field, 0)
}
/// Start writing a new field.
pub fn for_field_with_idx(&mut self, field: Field, idx: usize) -> &mut CountingWriter<W> {
let offset = self.write.written_bytes();
let file_addr = FileAddr::new(field, idx);
assert!(!self.offsets.contains_key(&file_addr));
self.offsets.insert(file_addr, offset);
&mut self.write
}
/// Close the composite file
///
/// An index of the different field offsets
/// will be written as a footer.
pub fn close(mut self) -> io::Result<()> {
let footer_offset = self.write.written_bytes();
VInt(self.offsets.len() as u64).serialize(&mut self.write)?;
let mut offset_fields: Vec<_> = self
.offsets
.iter()
.map(|(file_addr, offset)| (*offset, *file_addr))
.collect();
offset_fields.sort();
let mut prev_offset = 0;
for (offset, file_addr) in offset_fields {
VInt((offset - prev_offset) as u64).serialize(&mut self.write)?;
file_addr.serialize(&mut self.write)?;
prev_offset = offset;
}
let footer_len = (self.write.written_bytes() - footer_offset) as u32;
footer_len.serialize(&mut self.write)?;
self.write.flush()?;
Ok(())
}
}
/// A composite file is an abstraction to store a
/// file partitioned by field.
///
/// The file needs to be written field by field.
/// A footer describes the start and stop offsets
/// for each field.
#[derive(Clone)]
pub struct CompositeFile {
data: ReadOnlySource,
offsets_index: HashMap<FileAddr, (usize, usize)>,
}
impl CompositeFile {
/// Opens a composite file stored in a given
/// `ReadOnlySource`.
pub fn open(data: &ReadOnlySource) -> io::Result<CompositeFile> {
let end = data.len();
let footer_len_data = data.slice_from(end - 4);
let footer_len = u32::deserialize(&mut footer_len_data.as_slice())? as usize;
let footer_start = end - 4 - footer_len;
let footer_data = data.slice(footer_start, footer_start + footer_len);
let mut footer_buffer = footer_data.as_slice();
let num_fields = VInt::deserialize(&mut footer_buffer)?.0 as usize;
let mut file_addrs = vec![];
let mut offsets = vec![];
let mut field_index = HashMap::new();
let mut offset = 0;
for _ in 0..num_fields {
offset += VInt::deserialize(&mut footer_buffer)?.0 as usize;
let file_addr = FileAddr::deserialize(&mut footer_buffer)?;
offsets.push(offset);
file_addrs.push(file_addr);
}
offsets.push(footer_start);
for i in 0..num_fields {
let file_addr = file_addrs[i];
let start_offset = offsets[i];
let end_offset = offsets[i + 1];
field_index.insert(file_addr, (start_offset, end_offset));
}
Ok(CompositeFile {
data: data.slice_to(footer_start),
offsets_index: field_index,
})
}
/// Returns a composite file that stores
/// no fields.
pub fn empty() -> CompositeFile {
CompositeFile {
offsets_index: HashMap::new(),
data: ReadOnlySource::empty(),
}
}
/// Returns the `ReadOnlySource` associated
/// to a given `Field` and stored in a `CompositeFile`.
pub fn open_read(&self, field: Field) -> Option<ReadOnlySource> {
self.open_read_with_idx(field, 0)
}
/// Returns the `ReadOnlySource` associated
/// to a given `Field` and stored in a `CompositeFile`.
pub fn open_read_with_idx(&self, field: Field, idx: usize) -> Option<ReadOnlySource> {
self.offsets_index
.get(&FileAddr { field, idx })
.map(|&(from, to)| self.data.slice(from, to))
}
pub fn space_usage(&self) -> PerFieldSpaceUsage {
let mut fields = HashMap::new();
for (&field_addr, &(start, end)) in self.offsets_index.iter() {
fields
.entry(field_addr.field)
.or_insert_with(|| FieldUsage::empty(field_addr.field))
.add_field_idx(field_addr.idx, end - start);
}
PerFieldSpaceUsage::new(fields)
}
}
#[cfg(test)]
mod test {
use super::{CompositeFile, CompositeWrite};
use crate::common::BinarySerializable;
use crate::common::VInt;
use crate::directory::{Directory, RAMDirectory};
use crate::schema::Field;
use std::io::Write;
use std::path::Path;
#[test]
fn test_composite_file() {
let path = Path::new("test_path");
let mut directory = RAMDirectory::create();
{
let w = directory.open_write(path).unwrap();
let mut composite_write = CompositeWrite::wrap(w);
{
let mut write_0 = composite_write.for_field(Field(0u32));
VInt(32431123u64).serialize(&mut write_0).unwrap();
write_0.flush().unwrap();
}
{
let mut write_4 = composite_write.for_field(Field(4u32));
VInt(2).serialize(&mut write_4).unwrap();
write_4.flush().unwrap();
}
composite_write.close().unwrap();
}
{
let r = directory.open_read(path).unwrap();
let composite_file = CompositeFile::open(&r).unwrap();
{
let file0 = composite_file.open_read(Field(0u32)).unwrap();
let mut file0_buf = file0.as_slice();
let payload_0 = VInt::deserialize(&mut file0_buf).unwrap().0;
assert_eq!(file0_buf.len(), 0);
assert_eq!(payload_0, 32431123u64);
}
{
let file4 = composite_file.open_read(Field(4u32)).unwrap();
let mut file4_buf = file4.as_slice();
let payload_4 = VInt::deserialize(&mut file4_buf).unwrap().0;
assert_eq!(file4_buf.len(), 0);
assert_eq!(payload_4, 2u64);
}
}
}
}

View File

@@ -0,0 +1,61 @@
use std::io;
use std::io::Write;
pub struct CountingWriter<W> {
underlying: W,
written_bytes: u64,
}
impl<W: Write> CountingWriter<W> {
pub fn wrap(underlying: W) -> CountingWriter<W> {
CountingWriter {
underlying,
written_bytes: 0,
}
}
pub fn written_bytes(&self) -> u64 {
self.written_bytes
}
pub fn finish(mut self) -> io::Result<(W, u64)> {
self.flush()?;
Ok((self.underlying, self.written_bytes))
}
}
impl<W: Write> Write for CountingWriter<W> {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
let written_size = self.underlying.write(buf)?;
self.written_bytes += written_size as u64;
Ok(written_size)
}
fn write_all(&mut self, buf: &[u8]) -> io::Result<()> {
self.underlying.write_all(buf)?;
self.written_bytes += buf.len() as u64;
Ok(())
}
fn flush(&mut self) -> io::Result<()> {
self.underlying.flush()
}
}
#[cfg(test)]
mod test {
use super::CountingWriter;
use std::io::Write;
#[test]
fn test_counting_writer() {
let buffer: Vec<u8> = vec![];
let mut counting_writer = CountingWriter::wrap(buffer);
let bytes = (0u8..10u8).collect::<Vec<u8>>();
counting_writer.write_all(&bytes).unwrap();
let (w, len): (Vec<u8>, u64) = counting_writer.finish().unwrap();
assert_eq!(len, 10u64);
assert_eq!(w.len(), 10);
}
}

View File

@@ -1,32 +1,203 @@
mod serialize;
mod timer;
mod vint;
pub mod bitpacker;
mod bitset;
mod composite_file;
mod counting_writer;
mod serialize;
mod vint;
pub use self::bitset::BitSet;
pub(crate) use self::bitset::TinySet;
pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
pub use self::counting_writer::CountingWriter;
pub use self::serialize::{BinarySerializable, FixedSize};
pub use self::vint::{read_u32_vint, serialize_vint_u32, write_u32_vint, VInt};
pub use byteorder::LittleEndian as Endianness;
pub use self::serialize::BinarySerializable;
pub use self::timer::Timing;
pub use self::timer::TimerTree;
pub use self::timer::OpenTimer;
pub use self::vint::VInt;
/// Segment's max doc must be `< MAX_DOC_LIMIT`.
///
/// We do not allow segments with more than
pub const MAX_DOC_LIMIT: u32 = 1 << 31;
use std::io;
pub fn make_io_err(msg: String) -> io::Error {
io::Error::new(io::ErrorKind::Other, msg)
/// Computes the number of bits that will be used for bitpacking.
///
/// In general the target is the minimum number of bits
/// required to express the amplitude given in argument.
///
/// e.g. If the amplitude is 10, we can store all ints on simply 4bits.
///
/// The logic is slightly more convoluted here as for optimization
/// reasons, we want to ensure that a value spawns over at most 8 bytes
/// of aligns bytes.
///
/// Spanning over 9 bytes is possible for instance, if we do
/// bitpacking with an amplitude of 63 bits.
/// In this case, the second int will start on bit
/// 63 (which belongs to byte 7) and ends at byte 15;
/// Hence 9 bytes (from byte 7 to byte 15 included).
///
/// To avoid this, we force the number of bits to 64bits
/// when the result is greater than `64-8 = 56 bits`.
///
/// Note that this only affects rare use cases spawning over
/// a very large range of values. Even in this case, it results
/// in an extra cost of at most 12% compared to the optimal
/// number of bits.
pub(crate) fn compute_num_bits(n: u64) -> u8 {
let amplitude = (64u32 - n.leading_zeros()) as u8;
if amplitude <= 64 - 8 {
amplitude
} else {
64
}
}
pub(crate) fn is_power_of_2(n: usize) -> bool {
(n > 0) && (n & (n - 1) == 0)
}
/// Has length trait
pub trait HasLen {
/// Return length
fn len(&self,) -> usize;
fn len(&self) -> usize;
/// Returns true iff empty.
fn is_empty(&self,) -> bool {
fn is_empty(&self) -> bool {
self.len() == 0
}
}
const HIGHEST_BIT: u64 = 1 << 63;
/// Maps a `i64` to `u64`
///
/// For simplicity, tantivy internally handles `i64` as `u64`.
/// The mapping is defined by this function.
///
/// Maps `i64` to `u64` so that
/// `-2^63 .. 2^63-1` is mapped
/// to
/// `0 .. 2^64-1`
/// in that order.
///
/// This is more suited than simply casting (`val as u64`)
/// because of bitpacking.
///
/// Imagine a list of `i64` ranging from -10 to 10.
/// When casting negative values, the negative values are projected
/// to values over 2^63, and all values end up requiring 64 bits.
///
/// # See also
/// The [reverse mapping is `u64_to_i64`](./fn.u64_to_i64.html).
#[inline(always)]
pub fn i64_to_u64(val: i64) -> u64 {
(val as u64) ^ HIGHEST_BIT
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
#[inline(always)]
pub fn u64_to_i64(val: u64) -> i64 {
(val ^ HIGHEST_BIT) as i64
}
/// Maps a `f64` to `u64`
///
/// For simplicity, tantivy internally handles `f64` as `u64`.
/// The mapping is defined by this function.
///
/// Maps `f64` to `u64` so that lexical order is preserved.
///
/// This is more suited than simply casting (`val as u64`)
/// which would truncate the result
///
/// # See also
/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
#[inline(always)]
pub fn f64_to_u64(val: f64) -> u64 {
let bits = val.to_bits();
if val.is_sign_positive() {
bits ^ HIGHEST_BIT
} else {
!bits
}
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
#[inline(always)]
pub fn u64_to_f64(val: u64) -> f64 {
f64::from_bits(
if val & HIGHEST_BIT != 0 {
val ^ HIGHEST_BIT
} else {
!val
}
)
}
#[cfg(test)]
pub(crate) mod test {
pub use super::serialize::test::fixed_size_test;
use super::{compute_num_bits, i64_to_u64, u64_to_i64, f64_to_u64, u64_to_f64};
use std::f64;
fn test_i64_converter_helper(val: i64) {
assert_eq!(u64_to_i64(i64_to_u64(val)), val);
}
fn test_f64_converter_helper(val: f64) {
assert_eq!(u64_to_f64(f64_to_u64(val)), val);
}
#[test]
fn test_i64_converter() {
assert_eq!(i64_to_u64(i64::min_value()), u64::min_value());
assert_eq!(i64_to_u64(i64::max_value()), u64::max_value());
test_i64_converter_helper(0i64);
test_i64_converter_helper(i64::min_value());
test_i64_converter_helper(i64::max_value());
for i in -1000i64..1000i64 {
test_i64_converter_helper(i);
}
}
#[test]
fn test_f64_converter() {
test_f64_converter_helper(f64::INFINITY);
test_f64_converter_helper(f64::NEG_INFINITY);
test_f64_converter_helper(0.0);
test_f64_converter_helper(-0.0);
test_f64_converter_helper(1.0);
test_f64_converter_helper(-1.0);
}
#[test]
fn test_f64_order() {
assert!(!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY)).contains(&f64_to_u64(f64::NAN))); //nan is not a number
assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); //same exponent, different mantissa
assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); //same mantissa, different exponent
assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); //different exponent and mantissa
assert!(f64_to_u64(1.0) > f64_to_u64(-1.0)); // pos > neg
assert!(f64_to_u64(-1.5) < f64_to_u64(-1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(-1.5));
}
#[test]
fn test_compute_num_bits() {
assert_eq!(compute_num_bits(1), 1u8);
assert_eq!(compute_num_bits(0), 0u8);
assert_eq!(compute_num_bits(2), 2u8);
assert_eq!(compute_num_bits(3), 2u8);
assert_eq!(compute_num_bits(4), 3u8);
assert_eq!(compute_num_bits(255), 8u8);
assert_eq!(compute_num_bits(256), 9u8);
assert_eq!(compute_num_bits(5_000_000_000), 33u8);
}
#[test]
fn test_max_doc() {
// this is the first time I write a unit test for a constant.
assert!(((super::MAX_DOC_LIMIT - 1) as i32) >= 0);
assert!((super::MAX_DOC_LIMIT as i32) < 0);
}
}

View File

@@ -1,180 +1,228 @@
use crate::common::Endianness;
use crate::common::VInt;
use byteorder::{ReadBytesExt, WriteBytesExt};
use byteorder::LittleEndian as Endianness;
use std::fmt;
use std::io::Write;
use std::io::Read;
use std::io;
use common::VInt;
use byteorder;
use std::io::Read;
use std::io::Write;
pub trait BinarySerializable : fmt::Debug + Sized {
fn serialize(&self, writer: &mut Write) -> io::Result<usize>;
fn deserialize(reader: &mut Read) -> io::Result<Self>;
/// Trait for a simple binary serialization.
pub trait BinarySerializable: fmt::Debug + Sized {
/// Serialize
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()>;
/// Deserialize
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self>;
}
fn convert_byte_order_error(byteorder_error: byteorder::Error) -> io::Error {
match byteorder_error {
byteorder::Error::UnexpectedEOF => io::Error::new(io::ErrorKind::InvalidData, "Reached EOF unexpectedly"),
byteorder::Error::Io(e) => e,
}
/// `FixedSize` marks a `BinarySerializable` as
/// always serializing to the same size.
pub trait FixedSize: BinarySerializable {
const SIZE_IN_BYTES: usize;
}
impl BinarySerializable for () {
fn serialize(&self, _: &mut Write) -> io::Result<usize> {
Ok(0)
fn serialize<W: Write>(&self, _: &mut W) -> io::Result<()> {
Ok(())
}
fn deserialize(_: &mut Read) -> io::Result<Self> {
fn deserialize<R: Read>(_: &mut R) -> io::Result<Self> {
Ok(())
}
}
impl FixedSize for () {
const SIZE_IN_BYTES: usize = 0;
}
impl<T: BinarySerializable> BinarySerializable for Vec<T> {
fn serialize(&self, writer: &mut Write) -> io::Result<usize> {
let mut total_size = try!(VInt(self.len() as u64).serialize(writer));
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.len() as u64).serialize(writer)?;
for it in self {
total_size += try!(it.serialize(writer));
it.serialize(writer)?;
}
Ok(total_size)
Ok(())
}
fn deserialize(reader: &mut Read) -> io::Result<Vec<T>> {
let num_items = try!(VInt::deserialize(reader)).val();
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Vec<T>> {
let num_items = VInt::deserialize(reader)?.val();
let mut items: Vec<T> = Vec::with_capacity(num_items as usize);
for _ in 0..num_items {
let item = try!(T::deserialize(reader));
let item = T::deserialize(reader)?;
items.push(item);
}
Ok(items)
}
}
impl<Left: BinarySerializable, Right: BinarySerializable> BinarySerializable for (Left, Right) {
fn serialize(&self, write: &mut Write) -> io::Result<usize> {
Ok(try!(self.0.serialize(write)) + try!(self.1.serialize(write)))
fn serialize<W: Write>(&self, write: &mut W) -> io::Result<()> {
self.0.serialize(write)?;
self.1.serialize(write)
}
fn deserialize(reader: &mut Read) -> io::Result<Self> {
Ok( (try!(Left::deserialize(reader)), try!(Right::deserialize(reader))) )
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
Ok((Left::deserialize(reader)?, Right::deserialize(reader)?))
}
}
impl BinarySerializable for u32 {
fn serialize(&self, writer: &mut Write) -> io::Result<usize> {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u32::<Endianness>(*self)
.map(|_| 4)
.map_err(convert_byte_order_error)
}
fn deserialize(reader: &mut Read) -> io::Result<u32> {
fn deserialize<R: Read>(reader: &mut R) -> io::Result<u32> {
reader.read_u32::<Endianness>()
.map_err(convert_byte_order_error)
}
}
impl FixedSize for u32 {
const SIZE_IN_BYTES: usize = 4;
}
impl BinarySerializable for u64 {
fn serialize(&self, writer: &mut Write) -> io::Result<usize> {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u64::<Endianness>(*self)
.map(|_| 8)
.map_err(convert_byte_order_error)
}
fn deserialize(reader: &mut Read) -> io::Result<u64> {
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
reader.read_u64::<Endianness>()
.map_err(convert_byte_order_error)
}
}
impl FixedSize for u64 {
const SIZE_IN_BYTES: usize = 8;
}
impl BinarySerializable for i64 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_i64::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
reader.read_i64::<Endianness>()
}
}
impl FixedSize for i64 {
const SIZE_IN_BYTES: usize = 8;
}
impl BinarySerializable for f64 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_f64::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
reader.read_f64::<Endianness>()
}
}
impl FixedSize for f64 {
const SIZE_IN_BYTES: usize = 8;
}
impl BinarySerializable for u8 {
fn serialize(&self, writer: &mut Write) -> io::Result<usize> {
// TODO error
try!(writer.write_u8(*self).map_err(convert_byte_order_error));
Ok(1)
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u8(*self)
}
fn deserialize(reader: &mut Read) -> io::Result<u8> {
fn deserialize<R: Read>(reader: &mut R) -> io::Result<u8> {
reader.read_u8()
.map_err(convert_byte_order_error)
}
}
impl FixedSize for u8 {
const SIZE_IN_BYTES: usize = 1;
}
impl BinarySerializable for String {
fn serialize(&self, writer: &mut Write) -> io::Result<usize> {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
let data: &[u8] = self.as_bytes();
let mut size = try!(VInt(data.len() as u64).serialize(writer));
size += data.len();
try!(writer.write_all(data));
Ok(size)
VInt(data.len() as u64).serialize(writer)?;
writer.write_all(data)
}
fn deserialize(reader: &mut Read) -> io::Result<String> {
let string_length = try!(VInt::deserialize(reader)).val() as usize;
fn deserialize<R: Read>(reader: &mut R) -> io::Result<String> {
let string_length = VInt::deserialize(reader)?.val() as usize;
let mut result = String::with_capacity(string_length);
try!(reader.take(string_length as u64).read_to_string(&mut result));
reader
.take(string_length as u64)
.read_to_string(&mut result)?;
Ok(result)
}
}
#[cfg(test)]
mod test {
pub mod test {
use common::VInt;
use super::*;
use crate::common::VInt;
fn serialize_test<T: BinarySerializable + Eq>(v: T, num_bytes: usize) {
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();
O::default().serialize(&mut buffer).unwrap();
assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
}
fn serialize_test<T: BinarySerializable + Eq>(v: T) -> usize {
let mut buffer: Vec<u8> = Vec::new();
if num_bytes != 0 {
assert_eq!(v.serialize(&mut buffer).unwrap(), num_bytes);
assert_eq!(buffer.len(), num_bytes);
}
else {
v.serialize(&mut buffer).unwrap();
}
v.serialize(&mut buffer).unwrap();
let num_bytes = buffer.len();
let mut cursor = &buffer[..];
let deser = T::deserialize(&mut cursor).unwrap();
assert_eq!(deser, v);
num_bytes
}
#[test]
fn test_serialize_u8() {
serialize_test(3u8, 1);
serialize_test(5u8, 1);
fixed_size_test::<u8>();
}
#[test]
fn test_serialize_u32() {
serialize_test(3u32, 4);
serialize_test(5u32, 4);
serialize_test(u32::max_value(), 4);
fixed_size_test::<u32>();
assert_eq!(4, serialize_test(3u32));
assert_eq!(4, serialize_test(5u32));
assert_eq!(4, serialize_test(u32::max_value()));
}
#[test]
fn test_serialize_i64() {
fixed_size_test::<i64>();
}
#[test]
fn test_serialize_f64() {
fixed_size_test::<f64>();
}
#[test]
fn test_serialize_u64() {
fixed_size_test::<u64>();
}
#[test]
fn test_serialize_string() {
serialize_test(String::from(""), 1);
serialize_test(String::from("ぽよぽよ"), 1 + 3*4);
serialize_test(String::from("富士さん見える。"), 1 + 3*8);
assert_eq!(serialize_test(String::from("")), 1);
assert_eq!(serialize_test(String::from("ぽよぽよ")), 1 + 3 * 4);
assert_eq!(
serialize_test(String::from("富士さん見える。")),
1 + 3 * 8
);
}
#[test]
fn test_serialize_vec() {
let v: Vec<u8> = Vec::new();
serialize_test(v, 1);
serialize_test(vec!(1u32, 3u32), 1 + 4*2);
assert_eq!(serialize_test(Vec::<u8>::new()), 1);
assert_eq!(serialize_test(vec![1u32, 3u32]), 1 + 4 * 2);
}
#[test]
fn test_serialize_vint() {
for i in 0..10_000 {
serialize_test(VInt(i as u64), 0);
serialize_test(VInt(i as u64));
}
serialize_test(VInt(7u64), 1);
serialize_test(VInt(127u64), 1);
serialize_test(VInt(128u64), 2);
serialize_test(VInt(129u64), 2);
serialize_test(VInt(1234u64), 2);
serialize_test(VInt(16_383), 2);
serialize_test(VInt(16_384), 3);
serialize_test(VInt(u64::max_value()), 10);
assert_eq!(serialize_test(VInt(7u64)), 1);
assert_eq!(serialize_test(VInt(127u64)), 1);
assert_eq!(serialize_test(VInt(128u64)), 2);
assert_eq!(serialize_test(VInt(129u64)), 2);
assert_eq!(serialize_test(VInt(1234u64)), 2);
assert_eq!(serialize_test(VInt(16_383u64)), 2);
assert_eq!(serialize_test(VInt(16_384u64)), 3);
assert_eq!(serialize_test(VInt(u64::max_value())), 10);
}
}

View File

@@ -1,98 +0,0 @@
use time::PreciseTime;
pub struct OpenTimer<'a> {
name: &'static str,
timer_tree: &'a mut TimerTree,
start: PreciseTime,
depth: u32,
}
impl<'a> OpenTimer<'a> {
/// Starts timing a new named subtask
///
/// The timer is stopped automatically
/// when the `OpenTimer` is dropped.
pub fn open(&mut self, name: &'static str) -> OpenTimer {
OpenTimer {
name: name,
timer_tree: self.timer_tree,
start: PreciseTime::now(),
depth: self.depth + 1,
}
}
}
impl<'a> Drop for OpenTimer<'a> {
fn drop(&mut self,) {
self.timer_tree.timings.push(Timing {
name: self.name,
duration: self.start.to(PreciseTime::now()).num_microseconds().unwrap(),
depth: self.depth,
});
}
}
/// Timing recording
#[derive(Debug, RustcEncodable)]
pub struct Timing {
name: &'static str,
duration: i64,
depth: u32,
}
/// Timer tree
#[derive(Debug, RustcEncodable)]
pub struct TimerTree {
timings: Vec<Timing>,
}
impl TimerTree {
/// Returns the total time elapsed in microseconds
pub fn total_time(&self,) -> i64 {
self.timings.last().unwrap().duration
}
/// Open a new named subtask
pub fn open(&mut self, name: &'static str) -> OpenTimer {
OpenTimer {
name: name,
timer_tree: self,
start: PreciseTime::now(),
depth: 0,
}
}
}
impl Default for TimerTree {
fn default() -> TimerTree {
TimerTree {
timings: Vec::new(),
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_timer() {
let mut timer_tree = TimerTree::default();
{
let mut a = timer_tree.open("a");
{
let mut ab = a.open("b");
{
let _abc = ab.open("c");
}
{
let _abd = ab.open("d");
}
}
}
assert_eq!(timer_tree.timings.len(), 4);
}
}

View File

@@ -1,61 +1,232 @@
use super::BinarySerializable;
use byteorder::{ByteOrder, LittleEndian};
use std::io;
use std::io::Write;
use std::io::Read;
use std::io::Write;
/// Wrapper over a `u64` that serializes as a variable int.
/// Wrapper over a `u64` that serializes as a variable int.
#[derive(Debug, Eq, PartialEq)]
pub struct VInt(pub u64);
const STOP_BIT: u8 = 128;
pub fn serialize_vint_u32(val: u32) -> (u64, usize) {
const START_2: u64 = 1 << 7;
const START_3: u64 = 1 << 14;
const START_4: u64 = 1 << 21;
const START_5: u64 = 1 << 28;
const STOP_1: u64 = START_2 - 1;
const STOP_2: u64 = START_3 - 1;
const STOP_3: u64 = START_4 - 1;
const STOP_4: u64 = START_5 - 1;
const MASK_1: u64 = 127;
const MASK_2: u64 = MASK_1 << 7;
const MASK_3: u64 = MASK_2 << 7;
const MASK_4: u64 = MASK_3 << 7;
const MASK_5: u64 = MASK_4 << 7;
let val = u64::from(val);
const STOP_BIT: u64 = 128u64;
match val {
0..=STOP_1 => (val | STOP_BIT, 1),
START_2..=STOP_2 => (
(val & MASK_1) | ((val & MASK_2) << 1) | (STOP_BIT << (8)),
2,
),
START_3..=STOP_3 => (
(val & MASK_1) | ((val & MASK_2) << 1) | ((val & MASK_3) << 2) | (STOP_BIT << (8 * 2)),
3,
),
START_4..=STOP_4 => (
(val & MASK_1)
| ((val & MASK_2) << 1)
| ((val & MASK_3) << 2)
| ((val & MASK_4) << 3)
| (STOP_BIT << (8 * 3)),
4,
),
_ => (
(val & MASK_1)
| ((val & MASK_2) << 1)
| ((val & MASK_3) << 2)
| ((val & MASK_4) << 3)
| ((val & MASK_5) << 4)
| (STOP_BIT << (8 * 4)),
5,
),
}
}
/// Returns the number of bytes covered by a
/// serialized vint `u32`.
///
/// Expects a buffer data that starts
/// by the serialized `vint`, scans at most 5 bytes ahead until
/// it finds the vint final byte.
///
/// # May Panic
/// If the payload does not start by a valid `vint`
fn vint_len(data: &[u8]) -> usize {
for (i, &val) in data.iter().enumerate().take(5) {
if val >= STOP_BIT {
return i + 1;
}
}
panic!("Corrupted data. Invalid VInt 32");
}
/// Reads a vint `u32` from a buffer, and
/// consumes its payload data.
///
/// # Panics
///
/// If the buffer does not start by a valid
/// vint payload
pub fn read_u32_vint(data: &mut &[u8]) -> u32 {
let vlen = vint_len(*data);
let mut result = 0u32;
let mut shift = 0u64;
for &b in &data[..vlen] {
result |= u32::from(b & 127u8) << shift;
shift += 7;
}
*data = &data[vlen..];
result
}
/// Write a `u32` as a vint payload.
pub fn write_u32_vint<W: io::Write>(val: u32, writer: &mut W) -> io::Result<()> {
let (val, num_bytes) = serialize_vint_u32(val);
let mut buffer = [0u8; 8];
LittleEndian::write_u64(&mut buffer, val);
writer.write_all(&buffer[..num_bytes])
}
impl VInt {
pub fn val(&self,) -> u64 {
pub fn val(&self) -> u64 {
self.0
}
pub fn deserialize_u64<R: Read>(reader: &mut R) -> io::Result<u64> {
VInt::deserialize(reader).map(|vint| vint.0)
}
pub fn serialize_into_vec(&self, output: &mut Vec<u8>) {
let mut buffer = [0u8; 10];
let num_bytes = self.serialize_into(&mut buffer);
output.extend(&buffer[0..num_bytes]);
}
pub fn serialize_into(&self, buffer: &mut [u8; 10]) -> usize {
let mut remaining = self.0;
for (i, b) in buffer.iter_mut().enumerate() {
let next_byte: u8 = (remaining % 128u64) as u8;
remaining /= 128u64;
if remaining == 0u64 {
*b = next_byte | STOP_BIT;
return i + 1;
} else {
*b = next_byte;
}
}
unreachable!();
}
}
impl BinarySerializable for VInt {
fn serialize(&self, writer: &mut Write) -> io::Result<usize> {
let mut remaining = self.0;
let mut written: usize = 0;
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
let mut buffer = [0u8; 10];
loop {
let next_byte: u8 = (remaining % 128u64) as u8;
remaining /= 128u64;
if remaining == 0u64 {
buffer[written] = next_byte | 128u8;
written += 1;
break;
}
else {
buffer[written] = next_byte;
written += 1;
}
}
try!(writer.write_all(&buffer[0..written]));
Ok(written)
let num_bytes = self.serialize_into(&mut buffer);
writer.write_all(&buffer[0..num_bytes])
}
fn deserialize(reader: &mut Read) -> io::Result<Self> {
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let mut bytes = reader.bytes();
let mut result = 0u64;
let mut shift = 0u64;
loop {
match bytes.next() {
Some(Ok(b)) => {
result += ((b % 128u8) as u64) << shift;
if b & 128u8 != 0u8 {
break;
result |= u64::from(b % 128u8) << shift;
if b >= STOP_BIT {
return Ok(VInt(result));
}
shift += 7;
}
_ => {
return Err(io::Error::new(io::ErrorKind::InvalidData, "Reach end of buffer"))
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Reach end of buffer while reading VInt",
));
}
}
}
Ok(VInt(result))
}
}
#[cfg(test)]
mod tests {
use super::serialize_vint_u32;
use super::VInt;
use crate::common::BinarySerializable;
use byteorder::{ByteOrder, LittleEndian};
fn aux_test_vint(val: u64) {
let mut v = [14u8; 10];
let num_bytes = VInt(val).serialize_into(&mut v);
for i in num_bytes..10 {
assert_eq!(v[i], 14u8);
}
assert!(num_bytes > 0);
if num_bytes < 10 {
assert!(1u64 << (7 * num_bytes) > val);
}
if num_bytes > 1 {
assert!(1u64 << (7 * (num_bytes - 1)) <= val);
}
let serdeser_val = VInt::deserialize(&mut &v[..]).unwrap();
assert_eq!(val, serdeser_val.0);
}
#[test]
fn test_vint() {
aux_test_vint(0);
aux_test_vint(1);
aux_test_vint(5);
aux_test_vint(u64::max_value());
for i in 1..9 {
let power_of_128 = 1u64 << (7 * i);
aux_test_vint(power_of_128 - 1u64);
aux_test_vint(power_of_128);
aux_test_vint(power_of_128 + 1u64);
}
aux_test_vint(10);
}
fn aux_test_serialize_vint_u32(val: u32) {
let mut buffer = [0u8; 10];
let mut buffer2 = [0u8; 10];
let len_vint = VInt(val as u64).serialize_into(&mut buffer);
let (vint, len) = serialize_vint_u32(val);
assert_eq!(len, len_vint, "len wrong for val {}", val);
LittleEndian::write_u64(&mut buffer2, vint);
assert_eq!(&buffer[..len], &buffer2[..len], "array wrong for {}", val);
}
#[test]
fn test_vint_u32() {
aux_test_serialize_vint_u32(0);
aux_test_serialize_vint_u32(1);
aux_test_serialize_vint_u32(5);
for i in 1..3 {
let power_of_128 = 1u32 << (7 * i);
aux_test_serialize_vint_u32(power_of_128 - 1u32);
aux_test_serialize_vint_u32(power_of_128);
aux_test_serialize_vint_u32(power_of_128 + 1u32);
}
aux_test_serialize_vint_u32(u32::max_value());
}
}

View File

@@ -1,159 +0,0 @@
use super::{BlockEncoder, BlockDecoder};
use super::NUM_DOCS_PER_BLOCK;
use compression::{VIntEncoder, VIntDecoder};
pub struct CompositeEncoder {
block_encoder: BlockEncoder,
output: Vec<u8>,
}
impl CompositeEncoder {
pub fn new() -> CompositeEncoder {
CompositeEncoder {
block_encoder: BlockEncoder::new(),
output: Vec::with_capacity(500_000),
}
}
pub fn compress_sorted(&mut self, vals: &[u32]) -> &[u8] {
self.output.clear();
let num_blocks = vals.len() / NUM_DOCS_PER_BLOCK;
let mut offset = 0u32;
for i in 0..num_blocks {
let vals_slice = &vals[i * NUM_DOCS_PER_BLOCK .. (i + 1) * NUM_DOCS_PER_BLOCK];
let block_compressed = self.block_encoder.compress_block_sorted(vals_slice, offset);
offset = vals_slice[NUM_DOCS_PER_BLOCK - 1];
self.output.extend_from_slice(block_compressed);
}
let vint_compressed = self.block_encoder.compress_vint_sorted(&vals[num_blocks * NUM_DOCS_PER_BLOCK..], offset);
self.output.extend_from_slice(vint_compressed);
&self.output
}
pub fn compress_unsorted(&mut self, vals: &[u32]) -> &[u8] {
self.output.clear();
let num_blocks = vals.len() / NUM_DOCS_PER_BLOCK;
for i in 0..num_blocks {
let vals_slice = &vals[i * NUM_DOCS_PER_BLOCK .. (i + 1) * NUM_DOCS_PER_BLOCK];
let block_compressed = self.block_encoder.compress_block_unsorted(vals_slice);
self.output.extend_from_slice(block_compressed);
}
let vint_compressed = self.block_encoder.compress_vint_unsorted(&vals[num_blocks * NUM_DOCS_PER_BLOCK..]);
self.output.extend_from_slice(vint_compressed);
&self.output
}
}
pub struct CompositeDecoder {
block_decoder: BlockDecoder,
vals: Vec<u32>,
}
impl CompositeDecoder {
pub fn new() -> CompositeDecoder {
CompositeDecoder {
block_decoder: BlockDecoder::new(),
vals: Vec::with_capacity(500_000),
}
}
pub fn uncompress_sorted(&mut self, mut compressed_data: &[u8], uncompressed_len: usize) -> &[u32] {
if uncompressed_len > self.vals.capacity() {
let extra_capacity = uncompressed_len - self.vals.capacity();
self.vals.reserve(extra_capacity);
}
let mut offset = 0u32;
self.vals.clear();
let num_blocks = uncompressed_len / NUM_DOCS_PER_BLOCK;
for _ in 0..num_blocks {
compressed_data = self.block_decoder.uncompress_block_sorted(compressed_data, offset);
offset = self.block_decoder.output(NUM_DOCS_PER_BLOCK - 1);
self.vals.extend_from_slice(self.block_decoder.output_array());
}
self.block_decoder.uncompress_vint_sorted(compressed_data, offset, uncompressed_len % NUM_DOCS_PER_BLOCK);
self.vals.extend_from_slice(self.block_decoder.output_array());
&self.vals
}
pub fn uncompress_unsorted(&mut self, mut compressed_data: &[u8], uncompressed_len: usize) -> &[u32] {
self.vals.clear();
let num_blocks = uncompressed_len / NUM_DOCS_PER_BLOCK;
for _ in 0..num_blocks {
compressed_data = self.block_decoder.uncompress_block_unsorted(compressed_data);
self.vals.extend_from_slice(self.block_decoder.output_array());
}
self.block_decoder.uncompress_vint_unsorted(compressed_data, uncompressed_len % NUM_DOCS_PER_BLOCK);
self.vals.extend_from_slice(self.block_decoder.output_array());
&self.vals
}
}
impl Into<Vec<u32>> for CompositeDecoder {
fn into(self) -> Vec<u32> {
self.vals
}
}
#[cfg(test)]
pub mod tests {
use test::Bencher;
use super::*;
use compression::tests::generate_array;
#[test]
fn test_composite_unsorted() {
let data = generate_array(10_000, 0.1);
let mut encoder = CompositeEncoder::new();
let compressed = encoder.compress_unsorted(&data);
assert_eq!(compressed.len(), 19_790);
let mut decoder = CompositeDecoder::new();
let result = decoder.uncompress_unsorted(&compressed, data.len());
for i in 0..data.len() {
assert_eq!(data[i], result[i]);
}
}
#[test]
fn test_composite_sorted() {
let data = generate_array(10_000, 0.1);
let mut encoder = CompositeEncoder::new();
let compressed = encoder.compress_sorted(&data);
assert_eq!(compressed.len(), 7_822);
let mut decoder = CompositeDecoder::new();
let result = decoder.uncompress_sorted(&compressed, data.len());
for i in 0..data.len() {
assert_eq!(data[i], result[i]);
}
}
const BENCH_NUM_INTS: usize = 99_968;
#[bench]
fn bench_compress(b: &mut Bencher) {
let mut encoder = CompositeEncoder::new();
let data = generate_array(BENCH_NUM_INTS, 0.1);
b.iter(|| {
encoder.compress_sorted(&data);
});
}
#[bench]
fn bench_uncompress(b: &mut Bencher) {
let mut encoder = CompositeEncoder::new();
let data = generate_array(BENCH_NUM_INTS, 0.1);
let compressed = encoder.compress_sorted(&data);
let mut decoder = CompositeDecoder::new();
b.iter(|| {
decoder.uncompress_sorted(compressed, BENCH_NUM_INTS);
});
}
}

View File

@@ -1,127 +0,0 @@
use common::bitpacker::compute_num_bits;
use common::bitpacker::{BitPacker, BitUnpacker};
use std::cmp;
use std::io::Write;
use super::NUM_DOCS_PER_BLOCK;
const COMPRESSED_BLOCK_MAX_SIZE: usize = NUM_DOCS_PER_BLOCK * 4 + 1;
pub fn compress_sorted(vals: &mut [u32], mut output: &mut [u8], offset: u32) -> usize {
let mut max_delta = 0;
{
let mut local_offset = offset;
for i in 0..NUM_DOCS_PER_BLOCK {
let val = vals[i];
let delta = val - local_offset;
max_delta = cmp::max(max_delta, delta);
vals[i] = delta;
local_offset = val;
}
}
let num_bits = compute_num_bits(max_delta);
output.write_all(&[num_bits]).unwrap();
let mut bit_packer = BitPacker::new(num_bits as usize);
for val in vals {
bit_packer.write(*val, &mut output).unwrap();
}
1 + bit_packer.close(&mut output).expect("packing in memory should never fail")
}
pub struct BlockEncoder {
pub output: [u8; COMPRESSED_BLOCK_MAX_SIZE],
pub output_len: usize,
input_buffer: [u32; NUM_DOCS_PER_BLOCK],
}
impl BlockEncoder {
pub fn new() -> BlockEncoder {
BlockEncoder {
output: [0u8; COMPRESSED_BLOCK_MAX_SIZE],
output_len: 0,
input_buffer: [0u32; NUM_DOCS_PER_BLOCK],
}
}
pub fn compress_block_sorted(&mut self, vals: &[u32], offset: u32) -> &[u8] {
self.input_buffer.clone_from_slice(vals);
let compressed_size = compress_sorted(&mut self.input_buffer, &mut self.output, offset);
&self.output[..compressed_size]
}
pub fn compress_block_unsorted(&mut self, vals: &[u32]) -> &[u8] {
let compressed_size: usize = {
let mut output: &mut [u8] = &mut self.output;
let max = vals.iter().cloned().max().expect("compress unsorted called with an empty array");
let num_bits = compute_num_bits(max);
output.write_all(&[num_bits]).unwrap();
let mut bit_packer = BitPacker::new(num_bits as usize);
for val in vals {
bit_packer.write(*val, &mut output).unwrap();
}
1 + bit_packer.close(&mut output).expect("packing in memory should never fail")
};
&self.output[..compressed_size]
}
}
pub struct BlockDecoder {
pub output: [u32; COMPRESSED_BLOCK_MAX_SIZE],
pub output_len: usize,
}
impl BlockDecoder {
pub fn new() -> BlockDecoder {
BlockDecoder::with_val(0u32)
}
pub fn with_val(val: u32) -> BlockDecoder {
BlockDecoder {
output: [val; COMPRESSED_BLOCK_MAX_SIZE],
output_len: 0,
}
}
pub fn uncompress_block_sorted<'a>(&mut self, compressed_data: &'a [u8], mut offset: u32) -> &'a[u8] {
let consumed_size = {
let num_bits = compressed_data[0];
let bit_unpacker = BitUnpacker::new(&compressed_data[1..], num_bits as usize);
for i in 0..NUM_DOCS_PER_BLOCK {
let delta = bit_unpacker.get(i);
let val = offset + delta;
self.output[i] = val;
offset = val;
}
1 + (num_bits as usize * NUM_DOCS_PER_BLOCK + 7) / 8
};
self.output_len = NUM_DOCS_PER_BLOCK;
&compressed_data[consumed_size..]
}
pub fn uncompress_block_unsorted<'a>(&mut self, compressed_data: &'a [u8]) -> &'a[u8] {
let num_bits = compressed_data[0];
let bit_unpacker = BitUnpacker::new(&compressed_data[1..], num_bits as usize);
for i in 0..NUM_DOCS_PER_BLOCK {
self.output[i] = bit_unpacker.get(i);
}
let consumed_size = 1 + (num_bits as usize * NUM_DOCS_PER_BLOCK + 7) / 8;
self.output_len = NUM_DOCS_PER_BLOCK;
&compressed_data[consumed_size..]
}
#[inline]
pub fn output_array(&self,) -> &[u32] {
&self.output[..self.output_len]
}
#[inline]
pub fn output(&self, idx: usize) -> u32 {
self.output[idx]
}
}

View File

@@ -1,112 +0,0 @@
use super::NUM_DOCS_PER_BLOCK;
use libc::size_t;
const COMPRESSED_BLOCK_MAX_SIZE: usize = NUM_DOCS_PER_BLOCK * 4 + 1;
extern {
fn compress_sorted_cpp(
data: *const u32,
output: *mut u8,
offset: u32) -> size_t;
fn uncompress_sorted_cpp(
compressed_data: *const u8,
output: *mut u32,
offset: u32) -> size_t;
fn compress_unsorted_cpp(
data: *const u32,
output: *mut u8) -> size_t;
fn uncompress_unsorted_cpp(
compressed_data: *const u8,
output: *mut u32) -> size_t;
}
fn compress_sorted(vals: &[u32], output: &mut [u8], offset: u32) -> usize {
unsafe { compress_sorted_cpp(vals.as_ptr(), output.as_mut_ptr(), offset) }
}
fn uncompress_sorted(compressed_data: &[u8], output: &mut [u32], offset: u32) -> usize {
unsafe { uncompress_sorted_cpp(compressed_data.as_ptr(), output.as_mut_ptr(), offset) }
}
fn compress_unsorted(vals: &[u32], output: &mut [u8]) -> usize {
unsafe { compress_unsorted_cpp(vals.as_ptr(), output.as_mut_ptr()) }
}
fn uncompress_unsorted(compressed_data: &[u8], output: &mut [u32]) -> usize {
unsafe { uncompress_unsorted_cpp(compressed_data.as_ptr(), output.as_mut_ptr()) }
}
pub struct BlockEncoder {
pub output: [u8; COMPRESSED_BLOCK_MAX_SIZE],
pub output_len: usize,
}
impl BlockEncoder {
pub fn new() -> BlockEncoder {
BlockEncoder {
output: [0u8; COMPRESSED_BLOCK_MAX_SIZE],
output_len: 0,
}
}
pub fn compress_block_sorted(&mut self, vals: &[u32], offset: u32) -> &[u8] {
let compressed_size = compress_sorted(vals, &mut self.output, offset);
&self.output[..compressed_size]
}
pub fn compress_block_unsorted(&mut self, vals: &[u32]) -> &[u8] {
let compressed_size = compress_unsorted(vals, &mut self.output);
&self.output[..compressed_size]
}
}
pub struct BlockDecoder {
pub output: [u32; COMPRESSED_BLOCK_MAX_SIZE],
pub output_len: usize,
}
impl BlockDecoder {
pub fn new() -> BlockDecoder {
BlockDecoder::with_val(0u32)
}
pub fn with_val(val: u32) -> BlockDecoder {
BlockDecoder {
output: [val; COMPRESSED_BLOCK_MAX_SIZE],
output_len: 0,
}
}
pub fn uncompress_block_sorted<'a>(&mut self, compressed_data: &'a [u8], offset: u32) -> &'a[u8] {
let consumed_size = uncompress_sorted(compressed_data, &mut self.output, offset);
self.output_len = NUM_DOCS_PER_BLOCK;
&compressed_data[consumed_size..]
}
pub fn uncompress_block_unsorted<'a>(&mut self, compressed_data: &'a [u8]) -> &'a[u8] {
let consumed_size = uncompress_unsorted(compressed_data, &mut self.output);
self.output_len = NUM_DOCS_PER_BLOCK;
&compressed_data[consumed_size..]
}
#[inline]
pub fn output_array(&self,) -> &[u32] {
&self.output[..self.output_len]
}
#[inline]
pub fn output(&self, idx: usize) -> u32 {
self.output[idx]
}
}

View File

@@ -1,275 +0,0 @@
#![allow(dead_code)]
mod composite;
pub use self::composite::{CompositeEncoder, CompositeDecoder};
#[cfg(feature="simdcompression")]
mod compression_simd;
#[cfg(feature="simdcompression")]
pub use self::compression_simd::{BlockEncoder, BlockDecoder};
#[cfg(not(feature="simdcompression"))]
mod compression_nosimd;
#[cfg(not(feature="simdcompression"))]
pub use self::compression_nosimd::{BlockEncoder, BlockDecoder};
pub trait VIntEncoder {
fn compress_vint_sorted(&mut self, input: &[u32], offset: u32) -> &[u8];
fn compress_vint_unsorted(&mut self, input: &[u32]) -> &[u8];
}
pub trait VIntDecoder {
fn uncompress_vint_sorted<'a>(&mut self, compressed_data: &'a [u8], offset: u32, num_els: usize) -> &'a [u8];
fn uncompress_vint_unsorted<'a>(&mut self, compressed_data: &'a [u8], num_els: usize) -> &'a [u8];
}
impl VIntEncoder for BlockEncoder{
fn compress_vint_sorted(&mut self, input: &[u32], mut offset: u32) -> &[u8] {
let mut byte_written = 0;
for &v in input {
let mut to_encode: u32 = v - offset;
offset = v;
loop {
let next_byte: u8 = (to_encode % 128u32) as u8;
to_encode /= 128u32;
if to_encode == 0u32 {
self.output[byte_written] = next_byte | 128u8;
byte_written += 1;
break;
}
else {
self.output[byte_written] = next_byte;
byte_written += 1;
}
}
}
&self.output[..byte_written]
}
fn compress_vint_unsorted(&mut self, input: &[u32]) -> &[u8] {
let mut byte_written = 0;
for &v in input {
let mut to_encode: u32 = v;
loop {
let next_byte: u8 = (to_encode % 128u32) as u8;
to_encode /= 128u32;
if to_encode == 0u32 {
self.output[byte_written] = next_byte | 128u8;
byte_written += 1;
break;
}
else {
self.output[byte_written] = next_byte;
byte_written += 1;
}
}
}
&self.output[..byte_written]
}
}
impl VIntDecoder for BlockDecoder {
fn uncompress_vint_sorted<'a>(
&mut self,
compressed_data: &'a [u8],
offset: u32,
num_els: usize) -> &'a [u8] {
let mut read_byte = 0;
let mut result = offset;
for i in 0..num_els {
let mut shift = 0u32;
loop {
let cur_byte = compressed_data[read_byte];
read_byte += 1;
result += ((cur_byte % 128u8) as u32) << shift;
if cur_byte & 128u8 != 0u8 {
break;
}
shift += 7;
}
self.output[i] = result;
}
self.output_len = num_els;
&compressed_data[read_byte..]
}
fn uncompress_vint_unsorted<'a>(
&mut self,
compressed_data: &'a [u8],
num_els: usize) -> &'a [u8] {
let mut read_byte = 0;
for i in 0..num_els {
let mut result = 0u32;
let mut shift = 0u32;
loop {
let cur_byte = compressed_data[read_byte];
read_byte += 1;
result += ((cur_byte % 128u8) as u32) << shift;
if cur_byte & 128u8 != 0u8 {
break;
}
shift += 7;
}
self.output[i] = result;
}
self.output_len = num_els;
&compressed_data[read_byte..]
}
}
pub const NUM_DOCS_PER_BLOCK: usize = 128; //< should be a power of 2 to let the compiler optimize.
#[cfg(test)]
pub mod tests {
use rand::Rng;
use rand::SeedableRng;
use rand::XorShiftRng;
use super::*;
use test::Bencher;
fn generate_array_with_seed(n: usize, ratio: f32, seed_val: u32) -> Vec<u32> {
let seed: &[u32; 4] = &[1, 2, 3, seed_val];
let mut rng: XorShiftRng = XorShiftRng::from_seed(*seed);
(0..u32::max_value())
.filter(|_| rng.next_f32()< ratio)
.take(n)
.collect()
}
pub fn generate_array(n: usize, ratio: f32) -> Vec<u32> {
generate_array_with_seed(n, ratio, 4)
}
#[test]
fn test_encode_sorted_block() {
let vals: Vec<u32> = (0u32..128u32).map(|i| i*7).collect();
let mut encoder = BlockEncoder::new();
let compressed_data = encoder.compress_block_sorted(&vals, 0);
let mut decoder = BlockDecoder::new();
{
let remaining_data = decoder.uncompress_block_sorted(compressed_data, 0);
assert_eq!(remaining_data.len(), 0);
}
for i in 0..128 {
assert_eq!(vals[i], decoder.output(i));
}
}
#[test]
fn test_encode_sorted_block_with_offset() {
let vals: Vec<u32> = (0u32..128u32).map(|i| 11 + i*7).collect();
let mut encoder = BlockEncoder::new();
let compressed_data = encoder.compress_block_sorted(&vals, 10);
let mut decoder = BlockDecoder::new();
{
let remaining_data = decoder.uncompress_block_sorted(compressed_data, 10);
assert_eq!(remaining_data.len(), 0);
}
for i in 0..128 {
assert_eq!(vals[i], decoder.output(i));
}
}
#[test]
fn test_encode_sorted_block_with_junk() {
let mut compressed: Vec<u8> = Vec::new();
let n = 128;
let vals: Vec<u32> = (0..n).map(|i| 11u32 + (i as u32)*7u32).collect();
let mut encoder = BlockEncoder::new();
let compressed_data = encoder.compress_block_sorted(&vals, 10);
compressed.extend_from_slice(compressed_data);
compressed.push(173u8);
let mut decoder = BlockDecoder::new();
{
let remaining_data = decoder.uncompress_block_sorted(&compressed, 10);
assert_eq!(remaining_data.len(), 1);
assert_eq!(remaining_data[0], 173u8);
}
for i in 0..n {
assert_eq!(vals[i], decoder.output(i));
}
}
#[test]
fn test_encode_unsorted_block_with_junk() {
let mut compressed: Vec<u8> = Vec::new();
let n = 128;
let vals: Vec<u32> = (0..n).map(|i| 11u32 + (i as u32)*7u32 % 12).collect();
let mut encoder = BlockEncoder::new();
let compressed_data = encoder.compress_block_unsorted(&vals);
compressed.extend_from_slice(compressed_data);
compressed.push(173u8);
let mut decoder = BlockDecoder::new();
{
let remaining_data = decoder.uncompress_block_unsorted(&compressed);
assert_eq!(remaining_data.len(), 1);
assert_eq!(remaining_data[0], 173u8);
}
for i in 0..n {
assert_eq!(vals[i], decoder.output(i));
}
}
#[test]
fn test_encode_vint() {
{
let expected_length = 123;
let mut encoder = BlockEncoder::new();
let input: Vec<u32> = (0u32..123u32)
.map(|i| 4 + i * 7 / 2)
.into_iter()
.collect();
for offset in &[0u32, 1u32, 2u32] {
let encoded_data = encoder.compress_vint_sorted(&input, *offset);
assert_eq!(encoded_data.len(), expected_length);
let mut decoder = BlockDecoder::new();
let remaining_data = decoder.uncompress_vint_sorted(&encoded_data, *offset, input.len());
assert_eq!(0, remaining_data.len());
assert_eq!(input, decoder.output_array());
}
}
{
let mut encoder = BlockEncoder::new();
let input = vec!(3u32, 17u32, 187u32);
let encoded_data = encoder.compress_vint_sorted(&input, 0);
assert_eq!(encoded_data.len(), 4);
assert_eq!(encoded_data[0], 3u8 + 128u8);
assert_eq!(encoded_data[1], (17u8 - 3u8) + 128u8);
assert_eq!(encoded_data[2], (187u8 - 17u8 - 128u8));
assert_eq!(encoded_data[3], (1u8 + 128u8));
}
}
#[bench]
fn bench_compress(b: &mut Bencher) {
let mut encoder = BlockEncoder::new();
let data = generate_array(NUM_DOCS_PER_BLOCK, 0.1);
b.iter(|| {
encoder.compress_block_sorted(&data, 0u32);
});
}
#[bench]
fn bench_uncompress(b: &mut Bencher) {
let mut encoder = BlockEncoder::new();
let data = generate_array(NUM_DOCS_PER_BLOCK, 0.1);
let compressed = encoder.compress_block_sorted(&data, 0u32);
let mut decoder = BlockDecoder::new();
b.iter(|| {
decoder.uncompress_block_sorted(compressed, 0u32);
});
}
}

136
src/core/executor.rs Normal file
View File

@@ -0,0 +1,136 @@
use crate::Result;
use crossbeam::channel;
use scoped_pool::{Pool, ThreadConfig};
/// Search executor whether search request are single thread or multithread.
///
/// We don't expose Rayon thread pool directly here for several reasons.
///
/// First dependency hell. It is not a good idea to expose the
/// API of a dependency, knowing it might conflict with a different version
/// used by the client. Second, we may stop using rayon in the future.
pub enum Executor {
SingleThread,
ThreadPool(Pool),
}
impl Executor {
/// Creates an Executor that performs all task in the caller thread.
pub fn single_thread() -> Executor {
Executor::SingleThread
}
// Creates an Executor that dispatches the tasks in a thread pool.
pub fn multi_thread(num_threads: usize, prefix: &'static str) -> Executor {
let thread_config = ThreadConfig::new().prefix(prefix);
let pool = Pool::with_thread_config(num_threads, thread_config);
Executor::ThreadPool(pool)
}
// Perform a map in the thread pool.
//
// Regardless of the executor (`SingleThread` or `ThreadPool`), panics in the task
// will propagate to the caller.
pub fn map<
A: Send,
R: Send,
AIterator: Iterator<Item = A>,
F: Sized + Sync + Fn(A) -> Result<R>,
>(
&self,
f: F,
args: AIterator,
) -> Result<Vec<R>> {
match self {
Executor::SingleThread => args.map(f).collect::<Result<_>>(),
Executor::ThreadPool(pool) => {
let args_with_indices: Vec<(usize, A)> = args.enumerate().collect();
let num_fruits = args_with_indices.len();
let fruit_receiver = {
let (fruit_sender, fruit_receiver) = channel::unbounded();
pool.scoped(|scope| {
for arg_with_idx in args_with_indices {
scope.execute(|| {
let (idx, arg) = arg_with_idx;
let fruit = f(arg);
if let Err(err) = fruit_sender.send((idx, fruit)) {
error!("Failed to send search task. It probably means all search threads have panicked. {:?}", err);
}
});
}
});
fruit_receiver
// This ends the scope of fruit_sender.
// This is important as it makes it possible for the fruit_receiver iteration to
// terminate.
};
// This is lame, but safe.
let mut results_with_position = Vec::with_capacity(num_fruits);
for (pos, fruit_res) in fruit_receiver {
let fruit = fruit_res?;
results_with_position.push((pos, fruit));
}
results_with_position.sort_by_key(|(pos, _)| *pos);
assert_eq!(results_with_position.len(), num_fruits);
Ok(results_with_position
.into_iter()
.map(|(_, fruit)| fruit)
.collect::<Vec<_>>())
}
}
}
}
#[cfg(test)]
mod tests {
use super::Executor;
#[test]
#[should_panic(expected = "panic should propagate")]
fn test_panic_propagates_single_thread() {
let _result: Vec<usize> = Executor::single_thread()
.map(
|_| {
panic!("panic should propagate");
},
vec![0].into_iter(),
)
.unwrap();
}
#[test]
#[should_panic] //< unfortunately the panic message is not propagated
fn test_panic_propagates_multi_thread() {
let _result: Vec<usize> = Executor::multi_thread(1, "search-test")
.map(
|_| {
panic!("panic should propagate");
},
vec![0].into_iter(),
)
.unwrap();
}
#[test]
fn test_map_singlethread() {
let result: Vec<usize> = Executor::single_thread()
.map(|i| Ok(i * 2), 0..1_000)
.unwrap();
assert_eq!(result.len(), 1_000);
for i in 0..1_000 {
assert_eq!(result[i], i * 2);
}
}
#[test]
fn test_map_multithread() {
let result: Vec<usize> = Executor::multi_thread(3, "search-test")
.map(|i| Ok(i * 2), 0..10)
.unwrap();
assert_eq!(result.len(), 10);
for i in 0..10 {
assert_eq!(result[i], i * 2);
}
}
}

View File

@@ -1,71 +1,125 @@
use Result;
use Error;
use schema::Schema;
use std::sync::Arc;
use std::fmt;
use rustc_serialize::json;
use core::SegmentId;
use directory::{Directory, MmapDirectory, RAMDirectory};
use indexer::IndexWriter;
use core::searcher::Searcher;
use std::convert::From;
use num_cpus;
use super::segment::Segment;
use core::SegmentReader;
use super::pool::Pool;
use super::pool::LeasedItem;
use std::path::Path;
use indexer::SegmentManager;
use core::IndexMeta;
use core::META_FILEPATH;
use super::segment::create_segment;
use indexer::segment_updater::save_new_metas;
use super::segment::Segment;
use crate::core::Executor;
use crate::core::IndexMeta;
use crate::core::SegmentId;
use crate::core::SegmentMeta;
use crate::core::SegmentMetaInventory;
use crate::core::META_FILEPATH;
use crate::directory::ManagedDirectory;
#[cfg(feature = "mmap")]
use crate::directory::MmapDirectory;
use crate::directory::INDEX_WRITER_LOCK;
use crate::directory::{Directory, RAMDirectory};
use crate::error::DataCorruption;
use crate::error::TantivyError;
use crate::indexer::index_writer::HEAP_SIZE_MIN;
use crate::indexer::segment_updater::save_new_metas;
use crate::reader::IndexReader;
use crate::reader::IndexReaderBuilder;
use crate::schema::Field;
use crate::schema::FieldType;
use crate::schema::Schema;
use crate::tokenizer::BoxedTokenizer;
use crate::tokenizer::TokenizerManager;
use crate::IndexWriter;
use crate::Result;
use num_cpus;
use std::borrow::BorrowMut;
use std::fmt;
#[cfg(feature = "mmap")]
use std::path::Path;
use std::sync::Arc;
const NUM_SEARCHERS: usize = 12;
/// Accessor to the index segment manager
///
/// This method is not part of tantivy's public API
pub fn get_segment_manager(index: &Index) -> Arc<SegmentManager> {
index.segment_manager.clone()
fn load_metas(directory: &dyn Directory, inventory: &SegmentMetaInventory) -> Result<IndexMeta> {
let meta_data = directory.atomic_read(&META_FILEPATH)?;
let meta_string = String::from_utf8_lossy(&meta_data);
IndexMeta::deserialize(&meta_string, &inventory)
.map_err(|e| {
DataCorruption::new(
META_FILEPATH.to_path_buf(),
format!("Meta file cannot be deserialized. {:?}.", e),
)
})
.map_err(From::from)
}
fn load_metas(directory: &Directory) -> Result<IndexMeta> {
let meta_file = try!(directory.open_read(&META_FILEPATH));
let meta_content = String::from_utf8_lossy(meta_file.as_slice());
json::decode(&meta_content)
.map_err(|e| Error::CorruptedFile(META_FILEPATH.clone(), Box::new(e)))
}
/// Tantivy's Search Index
/// Search Index
#[derive(Clone)]
pub struct Index {
segment_manager: Arc<SegmentManager>,
directory: Box<Directory>,
directory: ManagedDirectory,
schema: Schema,
searcher_pool: Arc<Pool<Searcher>>,
docstamp: u64,
executor: Arc<Executor>,
tokenizers: TokenizerManager,
inventory: SegmentMetaInventory,
}
impl Index {
/// Examines the director to see if it contains an index
pub fn exists<Dir: Directory>(dir: &Dir) -> bool {
dir.exists(&META_FILEPATH)
}
/// Accessor to the search executor.
///
/// This pool is used by default when calling `searcher.search(...)`
/// to perform search on the individual segments.
///
/// By default the executor is single thread, and simply runs in the calling thread.
pub fn search_executor(&self) -> &Executor {
self.executor.as_ref()
}
/// Replace the default single thread search executor pool
/// by a thread pool with a given number of threads.
pub fn set_multithread_executor(&mut self, num_threads: usize) {
self.executor = Arc::new(Executor::multi_thread(num_threads, "thrd-tantivy-search-"));
}
/// Replace the default single thread search executor pool
/// by a thread pool with a given number of threads.
pub fn set_default_multithread_executor(&mut self) {
let default_num_threads = num_cpus::get();
self.set_multithread_executor(default_num_threads);
}
/// Creates a new index using the `RAMDirectory`.
///
/// The index will be allocated in anonymous memory.
/// This should only be used for unit tests.
pub fn create_in_ram(schema: Schema) -> Index {
let directory = Box::new(RAMDirectory::create());
Index::from_directory(directory, schema).expect("Creating a RAMDirectory should never fail") // unwrap is ok here
let ram_directory = RAMDirectory::create();
Index::create(ram_directory, schema).expect("Creating a RAMDirectory should never fail")
}
/// Creates a new index in a given filepath.
/// The index will use the `MMapDirectory`.
///
/// If a previous index was in this directory, then its meta file will be destroyed.
pub fn create(directory_path: &Path, schema: Schema) -> Result<Index> {
let mut directory = MmapDirectory::open(directory_path)?;
save_new_metas(schema.clone(), 0, &mut directory)?;
Index::from_directory(box directory, schema)
#[cfg(feature = "mmap")]
pub fn create_in_dir<P: AsRef<Path>>(directory_path: P, schema: Schema) -> Result<Index> {
let mmap_directory = MmapDirectory::open(directory_path)?;
if Index::exists(&mmap_directory) {
return Err(TantivyError::IndexAlreadyExists);
}
Index::create(mmap_directory, schema)
}
/// Opens or creates a new index in the provided directory
pub fn open_or_create<Dir: Directory>(dir: Dir, schema: Schema) -> Result<Index> {
if Index::exists(&dir) {
let index = Index::open(dir)?;
if index.schema() == schema {
Ok(index)
} else {
Err(TantivyError::SchemaError(
"An index exists but the schema does not match.".to_string(),
))
}
} else {
Index::create(dir, schema)
}
}
/// Creates a new index in a temp directory.
@@ -76,71 +130,175 @@ impl Index {
///
/// The temp directory is only used for testing the `MmapDirectory`.
/// For other unit tests, prefer the `RAMDirectory`, see: `create_in_ram`.
#[cfg(feature = "mmap")]
pub fn create_from_tempdir(schema: Schema) -> Result<Index> {
let directory = Box::new(try!(MmapDirectory::create_from_tempdir()));
let mmap_directory = MmapDirectory::create_from_tempdir()?;
Index::create(mmap_directory, schema)
}
/// Creates a new index given an implementation of the trait `Directory`
pub fn create<Dir: Directory>(dir: Dir, schema: Schema) -> Result<Index> {
let directory = ManagedDirectory::wrap(dir)?;
Index::from_directory(directory, schema)
}
/// Creates a new index given a directory and an `IndexMeta`.
fn create_from_metas(directory: Box<Directory>, metas: IndexMeta) -> Result<Index> {
let schema = metas.schema.clone();
let docstamp = metas.docstamp;
let committed_segments = metas.committed_segments;
// TODO log somethings is uncommitted is not empty.
let index = Index {
segment_manager: Arc::new(SegmentManager::from_segments(committed_segments)),
directory: directory,
schema: schema,
searcher_pool: Arc::new(Pool::new()),
docstamp: docstamp,
};
try!(index.load_searchers());
Ok(index)
/// Create a new index from a directory.
///
/// This will overwrite existing meta.json
fn from_directory(mut directory: ManagedDirectory, schema: Schema) -> Result<Index> {
save_new_metas(schema.clone(), directory.borrow_mut())?;
let metas = IndexMeta::with_schema(schema);
Index::create_from_metas(directory, &metas, SegmentMetaInventory::default())
}
/// Opens a new directory from a directory.
pub fn from_directory(directory: Box<Directory>, schema: Schema) -> Result<Index> {
Index::create_from_metas(directory, IndexMeta::with_schema(schema))
/// Creates a new index given a directory and an `IndexMeta`.
fn create_from_metas(
directory: ManagedDirectory,
metas: &IndexMeta,
inventory: SegmentMetaInventory,
) -> Result<Index> {
let schema = metas.schema.clone();
Ok(Index {
directory,
schema,
tokenizers: TokenizerManager::default(),
executor: Arc::new(Executor::single_thread()),
inventory,
})
}
/// Accessor for the tokenizer manager.
pub fn tokenizers(&self) -> &TokenizerManager {
&self.tokenizers
}
/// Helper to access the tokenizer associated to a specific field.
pub fn tokenizer_for_field(&self, field: Field) -> Result<Box<dyn BoxedTokenizer>> {
let field_entry = self.schema.get_field_entry(field);
let field_type = field_entry.field_type();
let tokenizer_manager: &TokenizerManager = self.tokenizers();
let tokenizer_name_opt: Option<Box<dyn BoxedTokenizer>> = match field_type {
FieldType::Str(text_options) => text_options
.get_indexing_options()
.map(|text_indexing_options| text_indexing_options.tokenizer().to_string())
.and_then(|tokenizer_name| tokenizer_manager.get(&tokenizer_name)),
_ => None,
};
match tokenizer_name_opt {
Some(tokenizer) => Ok(tokenizer),
None => Err(TantivyError::SchemaError(format!(
"{:?} is not a text field.",
field_entry.name()
))),
}
}
/// Create a default `IndexReader` for the given index.
///
/// See [`Index.reader_builder()`](#method.reader_builder).
pub fn reader(&self) -> Result<IndexReader> {
self.reader_builder().try_into()
}
/// Create a `IndexReader` for the given index.
///
/// Most project should create at most one reader for a given index.
/// This method is typically called only once per `Index` instance,
/// over the lifetime of most problem.
pub fn reader_builder(&self) -> IndexReaderBuilder {
IndexReaderBuilder::new(self.clone())
}
/// Opens a new directory from an index path.
pub fn open(directory_path: &Path) -> Result<Index> {
let directory = try!(MmapDirectory::open(directory_path));
let metas = try!(load_metas(&directory)); //< TODO does the directory already exists?
Index::create_from_metas(directory.box_clone(), metas)
#[cfg(feature = "mmap")]
pub fn open_in_dir<P: AsRef<Path>>(directory_path: P) -> Result<Index> {
let mmap_directory = MmapDirectory::open(directory_path)?;
Index::open(mmap_directory)
}
/// Returns the index docstamp.
pub(crate) fn inventory(&self) -> &SegmentMetaInventory {
&self.inventory
}
/// Open the index using the provided directory
pub fn open<D: Directory>(directory: D) -> Result<Index> {
let directory = ManagedDirectory::wrap(directory)?;
let inventory = SegmentMetaInventory::default();
let metas = load_metas(&directory, &inventory)?;
Index::create_from_metas(directory, &metas, inventory)
}
/// Reads the index meta file from the directory.
pub fn load_metas(&self) -> Result<IndexMeta> {
load_metas(self.directory(), &self.inventory)
}
/// Open a new index writer. Attempts to acquire a lockfile.
///
/// The docstamp is the number of documents that have been added
/// from the beginning of time, and until the moment of the last commit.
pub fn docstamp(&self) -> u64 {
self.docstamp
}
/// Creates a multithreaded writer.
/// Each writer produces an independent segment.
/// The lockfile should be deleted on drop, but it is possible
/// that due to a panic or other error, a stale lockfile will be
/// left in the index directory. If you are sure that no other
/// `IndexWriter` on the system is accessing the index directory,
/// it is safe to manually delete the lockfile.
///
/// - `num_threads` defines the number of indexing workers that
/// should work at the same time.
///
/// - `overall_heap_size_in_bytes` sets the amount of memory
/// allocated for all indexing thread.
/// Each thread will receive a budget of `overall_heap_size_in_bytes / num_threads`.
///
/// # Errors
/// If the lockfile already exists, returns `Error::FileAlreadyExists`.
/// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IOError`.
///
/// # Panics
/// If the heap size per thread is too small, panics.
pub fn writer_with_num_threads(&self,
num_threads: usize,
heap_size_in_bytes: usize)
-> Result<IndexWriter> {
IndexWriter::open(self, num_threads, heap_size_in_bytes)
pub fn writer_with_num_threads(
&self,
num_threads: usize,
overall_heap_size_in_bytes: usize,
) -> Result<IndexWriter> {
let directory_lock = self
.directory
.acquire_lock(&INDEX_WRITER_LOCK)
.map_err(|err| {
TantivyError::LockFailure(
err,
Some(
"Failed to acquire index lock. If you are using\
a regular directory, this means there is already an \
`IndexWriter` working on this `Directory`, in this process \
or in a different process."
.to_string(),
),
)
})?;
let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads;
IndexWriter::new(
self,
num_threads,
heap_size_in_bytes_per_thread,
directory_lock,
)
}
/// Creates a multithreaded writer
/// It just calls `writer_with_num_threads` with the number of cores as `num_threads`
///
/// Tantivy will automatically define the number of threads to use.
/// `overall_heap_size_in_bytes` is the total target memory usage that will be split
/// between a given number of threads.
///
/// # Errors
/// If the lockfile already exists, returns `Error::FileAlreadyExists`.
/// # Panics
/// If the heap size per thread is too small, panics.
pub fn writer(&self, heap_size_in_bytes: usize) -> Result<IndexWriter> {
self.writer_with_num_threads(num_cpus::get(), heap_size_in_bytes)
pub fn writer(&self, overall_heap_size_in_bytes: usize) -> Result<IndexWriter> {
let mut num_threads = num_cpus::get();
let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads;
if heap_size_in_bytes_per_thread < HEAP_SIZE_MIN {
num_threads = (overall_heap_size_in_bytes / HEAP_SIZE_MIN).max(1);
}
self.writer_with_num_threads(num_threads, overall_heap_size_in_bytes)
}
/// Accessor to the index schema
@@ -151,100 +309,277 @@ impl Index {
}
/// Returns the list of segments that are searchable
pub fn searchable_segments(&self) -> Vec<Segment> {
self.searchable_segment_ids()
pub fn searchable_segments(&self) -> Result<Vec<Segment>> {
Ok(self
.searchable_segment_metas()?
.into_iter()
.map(|segment_id| self.segment(segment_id))
.collect()
.map(|segment_meta| self.segment(segment_meta))
.collect())
}
/// Remove all of the file associated with the segment.
///
/// This method cannot fail. If a problem occurs,
/// some files may end up never being removed.
/// The error will only be logged.
pub fn delete_segment(&self, segment_id: SegmentId) {
self.segment(segment_id).delete();
}
/// Return a segment object given a `segment_id`
///
/// The segment may or may not exist.
pub fn segment(&self, segment_id: SegmentId) -> Segment {
create_segment(self.clone(), segment_id)
}
/// Return a reference to the index directory.
pub fn directory(&self) -> &Directory {
&*self.directory
}
/// Return a mutable reference to the index directory.
pub fn directory_mut(&mut self) -> &mut Directory {
&mut *self.directory
}
/// Returns the list of segment ids that are searchable.
fn searchable_segment_ids(&self) -> Vec<SegmentId> {
self.segment_manager.committed_segments()
#[doc(hidden)]
pub fn segment(&self, segment_meta: SegmentMeta) -> Segment {
create_segment(self.clone(), segment_meta)
}
/// Creates a new segment.
pub fn new_segment(&self) -> Segment {
self.segment(SegmentId::generate_random())
let segment_meta = self
.inventory
.new_segment_meta(SegmentId::generate_random(), 0);
self.segment(segment_meta)
}
/// Creates a new generation of searchers after
/// a change of the set of searchable indexes.
///
/// This needs to be called when a new segment has been
/// published or after a merge.
pub fn load_searchers(&self) -> Result<()> {
let searchable_segments = self.searchable_segments();
let mut searchers = Vec::new();
for _ in 0..NUM_SEARCHERS {
let searchable_segments_clone = searchable_segments.clone();
let segment_readers: Vec<SegmentReader> = try!(searchable_segments_clone.into_iter()
.map(SegmentReader::open)
.collect());
let searcher = Searcher::from(segment_readers);
searchers.push(searcher);
}
self.searcher_pool.publish_new_generation(searchers);
Ok(())
/// Return a reference to the index directory.
pub fn directory(&self) -> &ManagedDirectory {
&self.directory
}
/// Returns a searcher
///
/// This method should be called every single time a search
/// query is performed.
/// The searchers are taken from a pool of `NUM_SEARCHERS` searchers.
/// If no searcher is available
/// this may block.
///
/// The same searcher must be used for a given query, as it ensures
/// the use of a consistent segment set.
pub fn searcher(&self) -> LeasedItem<Searcher> {
self.searcher_pool.acquire()
/// Return a mutable reference to the index directory.
pub fn directory_mut(&mut self) -> &mut ManagedDirectory {
&mut self.directory
}
/// Reads the meta.json and returns the list of
/// `SegmentMeta` from the last commit.
pub fn searchable_segment_metas(&self) -> Result<Vec<SegmentMeta>> {
Ok(self.load_metas()?.segments)
}
/// Returns the list of segment ids that are searchable.
pub fn searchable_segment_ids(&self) -> Result<Vec<SegmentId>> {
Ok(self
.searchable_segment_metas()?
.iter()
.map(SegmentMeta::id)
.collect())
}
}
impl fmt::Debug for Index {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "Index({:?})", self.directory)
}
}
impl Clone for Index {
fn clone(&self) -> Index {
Index {
segment_manager: self.segment_manager.clone(),
#[cfg(test)]
mod tests {
use crate::directory::RAMDirectory;
use crate::schema::Field;
use crate::schema::{Schema, INDEXED, TEXT};
use crate::Index;
use crate::IndexReader;
use crate::IndexWriter;
use crate::ReloadPolicy;
use std::thread;
use std::time::Duration;
directory: self.directory.box_clone(),
schema: self.schema.clone(),
searcher_pool: self.searcher_pool.clone(),
docstamp: self.docstamp,
#[test]
fn test_indexer_for_field() {
let mut schema_builder = Schema::builder();
let num_likes_field = schema_builder.add_u64_field("num_likes", INDEXED);
let body_field = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
assert!(index.tokenizer_for_field(body_field).is_ok());
assert_eq!(
format!("{:?}", index.tokenizer_for_field(num_likes_field).err()),
"Some(SchemaError(\"\\\"num_likes\\\" is not a text field.\"))"
);
}
#[test]
fn test_index_exists() {
let directory = RAMDirectory::create();
assert!(!Index::exists(&directory));
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
}
#[test]
fn open_or_create_should_create() {
let directory = RAMDirectory::create();
assert!(!Index::exists(&directory));
assert!(Index::open_or_create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
}
#[test]
fn open_or_create_should_open() {
let directory = RAMDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
assert!(Index::open_or_create(directory, throw_away_schema()).is_ok());
}
#[test]
fn create_should_wipeoff_existing() {
let directory = RAMDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
assert!(Index::create(directory.clone(), Schema::builder().build()).is_ok());
}
#[test]
fn open_or_create_exists_but_schema_does_not_match() {
let directory = RAMDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
assert!(Index::exists(&directory));
assert!(Index::open_or_create(directory.clone(), throw_away_schema()).is_ok());
let err = Index::open_or_create(directory, Schema::builder().build());
assert_eq!(
format!("{:?}", err.unwrap_err()),
"SchemaError(\"An index exists but the schema does not match.\")"
);
}
fn throw_away_schema() -> Schema {
let mut schema_builder = Schema::builder();
let _ = schema_builder.add_u64_field("num_likes", INDEXED);
schema_builder.build()
}
#[test]
fn test_index_on_commit_reload_policy() {
let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap();
let index = Index::create_in_ram(schema);
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()
.unwrap();
assert_eq!(reader.searcher().num_docs(), 0);
let mut writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
test_index_on_commit_reload_policy_aux(field, &mut writer, &reader);
}
#[cfg(feature = "mmap")]
mod mmap_specific {
use super::*;
use std::path::PathBuf;
use tempdir::TempDir;
#[test]
fn test_index_on_commit_reload_policy_mmap() {
let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap();
let tempdir = TempDir::new("index").unwrap();
let tempdir_path = PathBuf::from(tempdir.path());
let index = Index::create_in_dir(&tempdir_path, schema).unwrap();
let mut writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
writer.commit().unwrap();
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()
.unwrap();
assert_eq!(reader.searcher().num_docs(), 0);
test_index_on_commit_reload_policy_aux(field, &mut writer, &reader);
}
#[test]
fn test_index_manual_policy_mmap() {
let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap();
let index = Index::create_from_tempdir(schema).unwrap();
let mut writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
writer.commit().unwrap();
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.try_into()
.unwrap();
assert_eq!(reader.searcher().num_docs(), 0);
writer.add_document(doc!(field=>1u64));
writer.commit().unwrap();
thread::sleep(Duration::from_millis(500));
assert_eq!(reader.searcher().num_docs(), 0);
reader.reload().unwrap();
assert_eq!(reader.searcher().num_docs(), 1);
}
#[test]
fn test_index_on_commit_reload_policy_different_directories() {
let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap();
let tempdir = TempDir::new("index").unwrap();
let tempdir_path = PathBuf::from(tempdir.path());
let write_index = Index::create_in_dir(&tempdir_path, schema).unwrap();
let read_index = Index::open_in_dir(&tempdir_path).unwrap();
let reader = read_index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()
.unwrap();
assert_eq!(reader.searcher().num_docs(), 0);
let mut writer = write_index.writer_with_num_threads(1, 3_000_000).unwrap();
test_index_on_commit_reload_policy_aux(field, &mut writer, &reader);
}
}
fn test_index_on_commit_reload_policy_aux(
field: Field,
writer: &mut IndexWriter,
reader: &IndexReader,
) {
assert_eq!(reader.searcher().num_docs(), 0);
writer.add_document(doc!(field=>1u64));
writer.commit().unwrap();
let mut count = 0;
for _ in 0..100 {
count = reader.searcher().num_docs();
if count > 0 {
break;
}
thread::sleep(Duration::from_millis(100));
}
assert_eq!(count, 1);
writer.add_document(doc!(field=>2u64));
writer.commit().unwrap();
let mut count = 0;
for _ in 0..10 {
count = reader.searcher().num_docs();
if count > 1 {
break;
}
thread::sleep(Duration::from_millis(100));
}
assert_eq!(count, 2);
}
// This test will not pass on windows, because windows
// prevent deleting files that are MMapped.
#[cfg(not(target_os = "windows"))]
#[test]
fn garbage_collect_works_as_intended() {
let directory = RAMDirectory::create();
let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap();
let index = Index::create(directory.clone(), schema).unwrap();
let mut writer = index.writer_with_num_threads(8, 24_000_000).unwrap();
for i in 0u64..8_000u64 {
writer.add_document(doc!(field => i));
}
writer.commit().unwrap();
let mem_right_after_commit = directory.total_mem_usage();
thread::sleep(Duration::from_millis(1_000));
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.try_into()
.unwrap();
assert_eq!(reader.searcher().num_docs(), 8_000);
writer.wait_merging_threads().unwrap();
let mem_right_after_merge_finished = directory.total_mem_usage();
reader.reload().unwrap();
let searcher = reader.searcher();
assert_eq!(searcher.num_docs(), 8_000);
assert!(mem_right_after_merge_finished < mem_right_after_commit);
}
}

View File

@@ -1,47 +1,291 @@
use super::SegmentComponent;
use crate::core::SegmentId;
use crate::schema::Schema;
use crate::Opstamp;
use census::{Inventory, TrackedObject};
use serde;
use serde_json;
use std::collections::HashSet;
use std::fmt;
use std::path::PathBuf;
use schema::Schema;
use core::SegmentId;
#[derive(Clone, Debug, Serialize, Deserialize)]
struct DeleteMeta {
num_deleted_docs: u32,
opstamp: Opstamp,
}
#[derive(Clone, Default)]
pub struct SegmentMetaInventory {
inventory: Inventory<InnerSegmentMeta>,
}
impl SegmentMetaInventory {
/// Lists all living `SegmentMeta` object at the time of the call.
pub fn all(&self) -> Vec<SegmentMeta> {
self.inventory
.list()
.into_iter()
.map(SegmentMeta::from)
.collect::<Vec<_>>()
}
#[doc(hidden)]
pub fn new_segment_meta(&self, segment_id: SegmentId, max_doc: u32) -> SegmentMeta {
let inner = InnerSegmentMeta {
segment_id,
max_doc,
deletes: None,
};
SegmentMeta::from(self.inventory.track(inner))
}
}
/// `SegmentMeta` contains simple meta information about a segment.
///
/// For instance the number of docs it contains,
/// how many are deleted, etc.
#[derive(Clone)]
pub struct SegmentMeta {
tracked: TrackedObject<InnerSegmentMeta>,
}
impl fmt::Debug for SegmentMeta {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> {
self.tracked.fmt(f)
}
}
impl serde::Serialize for SegmentMeta {
fn serialize<S>(
&self,
serializer: S,
) -> Result<<S as serde::Serializer>::Ok, <S as serde::Serializer>::Error>
where
S: serde::Serializer,
{
self.tracked.serialize(serializer)
}
}
impl From<TrackedObject<InnerSegmentMeta>> for SegmentMeta {
fn from(tracked: TrackedObject<InnerSegmentMeta>) -> SegmentMeta {
SegmentMeta { tracked }
}
}
impl SegmentMeta {
// Creates a new `SegmentMeta` object.
/// Returns the segment id.
pub fn id(&self) -> SegmentId {
self.tracked.segment_id
}
/// Returns the number of deleted documents.
pub fn num_deleted_docs(&self) -> u32 {
self.tracked
.deletes
.as_ref()
.map(|delete_meta| delete_meta.num_deleted_docs)
.unwrap_or(0u32)
}
/// Returns the list of files that
/// are required for the segment meta.
///
/// This is useful as the way tantivy removes files
/// is by removing all files that have been created by tantivy
/// and are not used by any segment anymore.
pub fn list_files(&self) -> HashSet<PathBuf> {
SegmentComponent::iterator()
.map(|component| self.relative_path(*component))
.collect::<HashSet<PathBuf>>()
}
/// Returns the relative path of a component of our segment.
///
/// It just joins the segment id with the extension
/// associated to a segment component.
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
let mut path = self.id().uuid_string();
path.push_str(&*match component {
SegmentComponent::POSTINGS => ".idx".to_string(),
SegmentComponent::POSITIONS => ".pos".to_string(),
SegmentComponent::POSITIONSSKIP => ".posidx".to_string(),
SegmentComponent::TERMS => ".term".to_string(),
SegmentComponent::STORE => ".store".to_string(),
SegmentComponent::FASTFIELDS => ".fast".to_string(),
SegmentComponent::FIELDNORMS => ".fieldnorm".to_string(),
SegmentComponent::DELETE => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
});
PathBuf::from(path)
}
/// Return the highest doc id + 1
///
/// If there are no deletes, then num_docs = max_docs
/// and all the doc ids contains in this segment
/// are exactly (0..max_doc).
pub fn max_doc(&self) -> u32 {
self.tracked.max_doc
}
/// Return the number of documents in the segment.
pub fn num_docs(&self) -> u32 {
self.max_doc() - self.num_deleted_docs()
}
/// Returns the `Opstamp` of the last delete operation
/// taken in account in this segment.
pub fn delete_opstamp(&self) -> Option<Opstamp> {
self.tracked
.deletes
.as_ref()
.map(|delete_meta| delete_meta.opstamp)
}
/// Returns true iff the segment meta contains
/// delete information.
pub fn has_deletes(&self) -> bool {
self.num_deleted_docs() > 0
}
#[doc(hidden)]
pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: Opstamp) -> SegmentMeta {
let delete_meta = DeleteMeta {
num_deleted_docs,
opstamp,
};
let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
segment_id: inner_meta.segment_id,
max_doc: inner_meta.max_doc,
deletes: Some(delete_meta),
});
SegmentMeta { tracked }
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
struct InnerSegmentMeta {
segment_id: SegmentId,
max_doc: u32,
deletes: Option<DeleteMeta>,
}
impl InnerSegmentMeta {
pub fn track(self, inventory: &SegmentMetaInventory) -> SegmentMeta {
SegmentMeta {
tracked: inventory.inventory.track(self),
}
}
}
/// Meta information about the `Index`.
///
///
/// This object is serialized on disk in the `meta.json` file.
/// It keeps information about
/// It keeps information about
/// * the searchable segments,
/// * the index docstamp
/// * the index `docstamp`
/// * the schema
///
#[derive(Clone,Debug,RustcDecodable,RustcEncodable)]
#[derive(Clone, Serialize)]
pub struct IndexMeta {
pub committed_segments: Vec<SegmentMeta>,
pub uncommitted_segments: Vec<SegmentMeta>,
/// List of `SegmentMeta` informations associated to each finalized segment of the index.
pub segments: Vec<SegmentMeta>,
/// Index `Schema`
pub schema: Schema,
pub docstamp: u64,
/// Opstamp associated to the last `commit` operation.
pub opstamp: Opstamp,
#[serde(skip_serializing_if = "Option::is_none")]
/// Payload associated to the last commit.
///
/// Upon commit, clients can optionally add a small `Striing` payload to their commit
/// to help identify this commit.
/// This payload is entirely unused by tantivy.
pub payload: Option<String>,
}
#[derive(Deserialize)]
struct UntrackedIndexMeta {
pub segments: Vec<InnerSegmentMeta>,
pub schema: Schema,
pub opstamp: Opstamp,
#[serde(skip_serializing_if = "Option::is_none")]
pub payload: Option<String>,
}
impl UntrackedIndexMeta {
pub fn track(self, inventory: &SegmentMetaInventory) -> IndexMeta {
IndexMeta {
segments: self
.segments
.into_iter()
.map(|inner_seg_meta| inner_seg_meta.track(inventory))
.collect::<Vec<SegmentMeta>>(),
schema: self.schema,
opstamp: self.opstamp,
payload: self.payload,
}
}
}
impl IndexMeta {
/// Create an `IndexMeta` object representing a brand new `Index`
/// with the given index.
///
/// This new index does not contains any segments.
/// Opstamp will the value `0u64`.
pub fn with_schema(schema: Schema) -> IndexMeta {
IndexMeta {
committed_segments: Vec::new(),
uncommitted_segments: Vec::new(),
schema: schema,
docstamp: 0u64,
segments: vec![],
schema,
opstamp: 0u64,
payload: None,
}
}
pub(crate) fn deserialize(
meta_json: &str,
inventory: &SegmentMetaInventory,
) -> serde_json::Result<IndexMeta> {
let untracked_meta_json: UntrackedIndexMeta = serde_json::from_str(meta_json)?;
Ok(untracked_meta_json.track(inventory))
}
}
#[derive(Clone, Debug, RustcDecodable,RustcEncodable)]
pub struct SegmentMeta {
pub segment_id: SegmentId,
pub num_docs: u32,
impl fmt::Debug for IndexMeta {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"{}",
serde_json::ser::to_string(self)
.expect("JSON serialization for IndexMeta should never fail.")
)
}
}
#[cfg(test)]
impl SegmentMeta {
pub fn new(segment_id: SegmentId, num_docs: u32) -> SegmentMeta {
SegmentMeta {
segment_id: segment_id,
num_docs: num_docs,
}
mod tests {
use super::IndexMeta;
use crate::schema::{Schema, TEXT};
use serde_json;
#[test]
fn test_serialize_metas() {
let schema = {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("text", TEXT);
schema_builder.build()
};
let index_metas = IndexMeta {
segments: Vec::new(),
schema,
opstamp: 0u64,
payload: None,
};
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
assert_eq!(json, r#"{"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","tokenizer":"default"},"stored":false}}],"opstamp":0}"#);
}
}
}

View File

@@ -0,0 +1,196 @@
use crate::common::BinarySerializable;
use crate::directory::ReadOnlySource;
use crate::positions::PositionReader;
use crate::postings::TermInfo;
use crate::postings::{BlockSegmentPostings, SegmentPostings};
use crate::schema::FieldType;
use crate::schema::IndexRecordOption;
use crate::schema::Term;
use crate::termdict::TermDictionary;
use owned_read::OwnedRead;
/// The inverted index reader is in charge of accessing
/// the inverted index associated to a specific field.
///
/// # Note
///
/// It is safe to delete the segment associated to
/// an `InvertedIndexReader`. As long as it is open,
/// the `ReadOnlySource` it is relying on should
/// stay available.
///
///
/// `InvertedIndexReader` are created by calling
/// the `SegmentReader`'s [`.inverted_index(...)`] method
pub struct InvertedIndexReader {
termdict: TermDictionary,
postings_source: ReadOnlySource,
positions_source: ReadOnlySource,
positions_idx_source: ReadOnlySource,
record_option: IndexRecordOption,
total_num_tokens: u64,
}
impl InvertedIndexReader {
#[cfg_attr(feature = "cargo-clippy", allow(clippy::needless_pass_by_value))] // for symmetry
pub(crate) fn new(
termdict: TermDictionary,
postings_source: ReadOnlySource,
positions_source: ReadOnlySource,
positions_idx_source: ReadOnlySource,
record_option: IndexRecordOption,
) -> InvertedIndexReader {
let total_num_tokens_data = postings_source.slice(0, 8);
let mut total_num_tokens_cursor = total_num_tokens_data.as_slice();
let total_num_tokens = u64::deserialize(&mut total_num_tokens_cursor).unwrap_or(0u64);
InvertedIndexReader {
termdict,
postings_source: postings_source.slice_from(8),
positions_source,
positions_idx_source,
record_option,
total_num_tokens,
}
}
/// Creates an empty `InvertedIndexReader` object, which
/// contains no terms at all.
pub fn empty(field_type: &FieldType) -> InvertedIndexReader {
let record_option = field_type
.get_index_record_option()
.unwrap_or(IndexRecordOption::Basic);
InvertedIndexReader {
termdict: TermDictionary::empty(&field_type),
postings_source: ReadOnlySource::empty(),
positions_source: ReadOnlySource::empty(),
positions_idx_source: ReadOnlySource::empty(),
record_option,
total_num_tokens: 0u64,
}
}
/// Returns the term info associated with the term.
pub fn get_term_info(&self, term: &Term) -> Option<TermInfo> {
self.termdict.get(term.value_bytes())
}
/// Return the term dictionary datastructure.
pub fn terms(&self) -> &TermDictionary {
&self.termdict
}
/// Resets the block segment to another position of the postings
/// file.
///
/// This is useful for enumerating through a list of terms,
/// and consuming the associated posting lists while avoiding
/// reallocating a `BlockSegmentPostings`.
///
/// # Warning
///
/// This does not reset the positions list.
pub fn reset_block_postings_from_terminfo(
&self,
term_info: &TermInfo,
block_postings: &mut BlockSegmentPostings,
) {
let offset = term_info.postings_offset as usize;
let end_source = self.postings_source.len();
let postings_slice = self.postings_source.slice(offset, end_source);
let postings_reader = OwnedRead::new(postings_slice);
block_postings.reset(term_info.doc_freq, postings_reader);
}
/// Returns a block postings given a `Term`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
pub fn read_block_postings(
&self,
term: &Term,
option: IndexRecordOption,
) -> Option<BlockSegmentPostings> {
self.get_term_info(term)
.map(move |term_info| self.read_block_postings_from_terminfo(&term_info, option))
}
/// Returns a block postings given a `term_info`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
pub fn read_block_postings_from_terminfo(
&self,
term_info: &TermInfo,
requested_option: IndexRecordOption,
) -> BlockSegmentPostings {
let offset = term_info.postings_offset as usize;
let postings_data = self.postings_source.slice_from(offset);
BlockSegmentPostings::from_data(
term_info.doc_freq,
OwnedRead::new(postings_data),
self.record_option,
requested_option,
)
}
/// Returns a posting object given a `term_info`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
pub fn read_postings_from_terminfo(
&self,
term_info: &TermInfo,
option: IndexRecordOption,
) -> SegmentPostings {
let block_postings = self.read_block_postings_from_terminfo(term_info, option);
let position_stream = {
if option.has_positions() {
let position_reader = self.positions_source.clone();
let skip_reader = self.positions_idx_source.clone();
let position_reader =
PositionReader::new(position_reader, skip_reader, term_info.positions_idx);
Some(position_reader)
} else {
None
}
};
SegmentPostings::from_block_postings(block_postings, position_stream)
}
/// Returns the total number of tokens recorded for all documents
/// (including deleted documents).
pub fn total_num_tokens(&self) -> u64 {
self.total_num_tokens
}
/// Returns the segment postings associated with the term, and with the given option,
/// or `None` if the term has never been encountered and indexed.
///
/// If the field was not indexed with the indexing options that cover
/// the requested options, the returned `SegmentPostings` the method does not fail
/// and returns a `SegmentPostings` with as much information as possible.
///
/// For instance, requesting `IndexRecordOption::Freq` for a
/// `TextIndexingOptions` that does not index position will return a `SegmentPostings`
/// with `DocId`s and frequencies.
pub fn read_postings(&self, term: &Term, option: IndexRecordOption) -> Option<SegmentPostings> {
self.get_term_info(term)
.map(move |term_info| self.read_postings_from_terminfo(&term_info, option))
}
pub(crate) fn read_postings_no_deletes(
&self,
term: &Term,
option: IndexRecordOption,
) -> Option<SegmentPostings> {
self.get_term_info(term)
.map(|term_info| self.read_postings_from_terminfo(&term_info, option))
}
/// Returns the number of documents containing the term.
pub fn doc_freq(&self, term: &Term) -> u32 {
self.get_term_info(term)
.map(|term_info| term_info.doc_freq)
.unwrap_or(0u32)
}
}

View File

@@ -1,26 +1,34 @@
pub mod searcher;
mod executor;
pub mod index;
mod segment_reader;
mod segment_id;
mod segment_component;
mod segment;
mod index_meta;
mod pool;
use std::path::PathBuf;
mod inverted_index_reader;
pub mod searcher;
mod segment;
mod segment_component;
mod segment_id;
mod segment_reader;
pub use self::executor::Executor;
pub use self::index::Index;
pub use self::index_meta::{IndexMeta, SegmentMeta, SegmentMetaInventory};
pub use self::inverted_index_reader::InvertedIndexReader;
pub use self::searcher::Searcher;
pub use self::segment::Segment;
pub use self::segment::SerializableSegment;
pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId;
pub use self::segment_reader::SegmentReader;
pub use self::segment::Segment;
pub use self::segment::SegmentInfo;
pub use self::segment::SerializableSegment;
pub use self::index::Index;
pub use self::index_meta::{IndexMeta, SegmentMeta};
use once_cell::sync::Lazy;
use std::path::Path;
lazy_static! {
pub static ref META_FILEPATH: PathBuf = PathBuf::from("meta.json");
}
/// The meta file contains all the information about the list of segments and the schema
/// of the index.
pub static META_FILEPATH: Lazy<&'static Path> = Lazy::new(|| Path::new("meta.json"));
/// The managed file contains a list of files that were created by the tantivy
/// and will therefore be garbage collected when they are deemed useless by tantivy.
///
/// Removing this file is safe, but will prevent the garbage collection of all of the file that
/// are currently in the directory
pub static MANAGED_FILEPATH: Lazy<&'static Path> = Lazy::new(|| Path::new(".managed.json"));

View File

@@ -1,130 +0,0 @@
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering;
use std::mem;
use std::ops::{Deref, DerefMut};
use crossbeam::sync::MsQueue;
use std::sync::Arc;
pub struct GenerationItem<T> {
generation: usize,
item: T,
}
pub struct Pool<T> {
queue: Arc<MsQueue<GenerationItem<T>>>,
freshest_generation: AtomicUsize,
next_generation: AtomicUsize,
}
impl<T> Pool<T> {
pub fn new() -> Pool<T> {
Pool {
queue: Arc::new(MsQueue::new()),
freshest_generation: AtomicUsize::default(),
next_generation: AtomicUsize::default(),
}
}
pub fn publish_new_generation(&self, items: Vec<T>) {
let next_generation = self.next_generation.fetch_add(1, Ordering::SeqCst) + 1;
for item in items {
let gen_item = GenerationItem {
item: item,
generation: next_generation,
};
self.queue.push(gen_item);
}
self.advertise_generation(next_generation);
}
/// At the exit of this method,
/// - freshest_generation has a value greater or equal than generation
/// - freshest_generation has a value that has been advertised
/// - freshest_generation has
fn advertise_generation(&self, generation: usize) {
// not optimal at all but the easiest to read proof.
loop {
let former_generation = self.freshest_generation.load(Ordering::Acquire);
if former_generation >= generation {
break;
}
self.freshest_generation.compare_and_swap(former_generation, generation, Ordering::SeqCst);
}
}
fn generation(&self,) -> usize {
self.freshest_generation.load(Ordering::Acquire)
}
pub fn acquire(&self,) -> LeasedItem<T> {
let generation = self.generation();
loop {
let gen_item = self.queue.pop();
if gen_item.generation >= generation {
return LeasedItem {
gen_item: Some(gen_item),
recycle_queue: self.queue.clone(),
}
}
else {
// this searcher is obsolete,
// removing it from the pool.
}
}
}
}
pub struct LeasedItem<T> {
gen_item: Option<GenerationItem<T>>,
recycle_queue: Arc<MsQueue<GenerationItem<T>>>,
}
impl<T> Deref for LeasedItem<T> {
type Target = T;
fn deref(&self) -> &T {
&self.gen_item.as_ref().expect("Unwrapping a leased item should never fail").item // unwrap is safe here
}
}
impl<T> DerefMut for LeasedItem<T> {
fn deref_mut(&mut self) -> &mut T {
&mut self.gen_item.as_mut().expect("Unwrapping a mut leased item should never fail").item // unwrap is safe here
}
}
impl<T> Drop for LeasedItem<T> {
fn drop(&mut self) {
let gen_item: GenerationItem<T> = mem::replace(&mut self.gen_item, None).expect("Unwrapping a leased item should never fail");
self.recycle_queue.push(gen_item);
}
}
#[cfg(test)]
mod tests {
use std::iter;
use super::Pool;
#[test]
fn test_pool() {
let items10: Vec<usize> = iter::repeat(10).take(10).collect();
let pool = Pool::new();
pool.publish_new_generation(items10);
for _ in 0..20 {
assert_eq!(*pool.acquire(), 10);
}
let items11: Vec<usize> = iter::repeat(11).take(10).collect();
pool.publish_new_generation(items11);
for _ in 0..20 {
assert_eq!(*pool.acquire(), 11);
}
}
}

View File

@@ -1,73 +1,225 @@
use Result;
use core::SegmentReader;
use schema::Document;
use collector::Collector;
use common::TimerTree;
use query::Query;
use DocId;
use DocAddress;
use schema::Term;
use crate::collector::Collector;
use crate::collector::SegmentCollector;
use crate::core::Executor;
use crate::core::InvertedIndexReader;
use crate::core::SegmentReader;
use crate::query::Query;
use crate::query::Scorer;
use crate::query::Weight;
use crate::schema::Document;
use crate::schema::Schema;
use crate::schema::{Field, Term};
use crate::space_usage::SearcherSpaceUsage;
use crate::store::StoreReader;
use crate::termdict::TermMerger;
use crate::DocAddress;
use crate::Index;
use crate::Result;
use std::fmt;
use std::sync::Arc;
fn collect_segment<C: Collector>(
collector: &C,
weight: &dyn Weight,
segment_ord: u32,
segment_reader: &SegmentReader,
) -> Result<C::Fruit> {
let mut scorer = weight.scorer(segment_reader)?;
let mut segment_collector = collector.for_segment(segment_ord as u32, segment_reader)?;
if let Some(delete_bitset) = segment_reader.delete_bitset() {
scorer.for_each(&mut |doc, score| {
if delete_bitset.is_alive(doc) {
segment_collector.collect(doc, score);
}
});
} else {
scorer.for_each(&mut |doc, score| segment_collector.collect(doc, score));
}
Ok(segment_collector.harvest())
}
/// Holds a list of `SegmentReader`s ready for search.
///
/// It guarantees that the `Segment` will not be removed before
/// It guarantees that the `Segment` will not be removed before
/// the destruction of the `Searcher`.
///
#[derive(Debug)]
///
pub struct Searcher {
schema: Schema,
index: Index,
segment_readers: Vec<SegmentReader>,
store_readers: Vec<StoreReader>,
}
impl Searcher {
/// Creates a new `Searcher`
pub(crate) fn new(
schema: Schema,
index: Index,
segment_readers: Vec<SegmentReader>,
) -> Searcher {
let store_readers = segment_readers
.iter()
.map(SegmentReader::get_store_reader)
.collect();
Searcher {
schema,
index,
segment_readers,
store_readers,
}
}
/// Returns the `Index` associated to the `Searcher`
pub fn index(&self) -> &Index {
&self.index
}
/// Fetches a document from tantivy's store given a `DocAddress`.
///
/// The searcher uses the segment ordinal to route the
/// the request to the right `Segment`.
pub fn doc(&self, doc_address: &DocAddress) -> Result<Document> {
let DocAddress(segment_local_id, doc_id) = *doc_address;
let segment_reader = &self.segment_readers[segment_local_id as usize];
segment_reader.doc(doc_id)
/// the request to the right `Segment`.
pub fn doc(&self, doc_address: DocAddress) -> Result<Document> {
let DocAddress(segment_local_id, doc_id) = doc_address;
let store_reader = &self.store_readers[segment_local_id as usize];
store_reader.get(doc_id)
}
/// Access the schema associated to the index of this searcher.
pub fn schema(&self) -> &Schema {
&self.schema
}
/// Returns the overall number of documents in the index.
pub fn num_docs(&self,) -> DocId {
pub fn num_docs(&self) -> u64 {
self.segment_readers
.iter()
.map(|segment_reader| segment_reader.num_docs())
.fold(0u32, |acc, val| acc + val)
.map(|segment_reader| u64::from(segment_reader.num_docs()))
.sum::<u64>()
}
/// Return the overall number of documents containing
/// the given term.
pub fn doc_freq(&self, term: &Term) -> u32 {
/// the given term.
pub fn doc_freq(&self, term: &Term) -> u64 {
self.segment_readers
.iter()
.map(|segment_reader| segment_reader.doc_freq(term))
.fold(0u32, |acc, val| acc + val)
.map(|segment_reader| {
u64::from(segment_reader.inverted_index(term.field()).doc_freq(term))
})
.sum::<u64>()
}
/// Return the list of segment readers
pub fn segment_readers(&self,) -> &Vec<SegmentReader> {
pub fn segment_readers(&self) -> &[SegmentReader] {
&self.segment_readers
}
/// Returns the segment_reader associated with the given segment_ordinal
pub fn segment_reader(&self, segment_ord: usize) -> &SegmentReader {
&self.segment_readers[segment_ord]
pub fn segment_reader(&self, segment_ord: u32) -> &SegmentReader {
&self.segment_readers[segment_ord as usize]
}
/// Runs a query on the segment readers wrapped by the searcher
pub fn search<C: Collector>(&self, query: &Query, collector: &mut C) -> Result<TimerTree> {
query.search(self, collector)
/// Runs a query on the segment readers wrapped by the searcher.
///
/// Search works as follows :
///
/// First the weight object associated to the query is created.
///
/// Then, the query loops over the segments and for each segment :
/// - setup the collector and informs it that the segment being processed has changed.
/// - creates a SegmentCollector for collecting documents associated to the segment
/// - creates a `Scorer` object associated for this segment
/// - iterate through the matched documents and push them to the segment collector.
///
/// Finally, the Collector merges each of the child collectors into itself for result usability
/// by the caller.
pub fn search<C: Collector>(&self, query: &dyn Query, collector: &C) -> Result<C::Fruit> {
let executor = self.index.search_executor();
self.search_with_executor(query, collector, executor)
}
/// Same as [`search(...)`](#method.search) but multithreaded.
///
/// The current implementation is rather naive :
/// multithreading is by splitting search into as many task
/// as there are segments.
///
/// It is powerless at making search faster if your index consists in
/// one large segment.
///
/// Also, keep in my multithreading a single query on several
/// threads will not improve your throughput. It can actually
/// hurt it. It will however, decrease the average response time.
pub fn search_with_executor<C: Collector>(
&self,
query: &dyn Query,
collector: &C,
executor: &Executor,
) -> Result<C::Fruit> {
let scoring_enabled = collector.requires_scoring();
let weight = query.weight(self, scoring_enabled)?;
let segment_readers = self.segment_readers();
let fruits = executor.map(
|(segment_ord, segment_reader)| {
collect_segment(
collector,
weight.as_ref(),
segment_ord as u32,
segment_reader,
)
},
segment_readers.iter().enumerate(),
)?;
collector.merge_fruits(fruits)
}
/// Return the field searcher associated to a `Field`.
pub fn field(&self, field: Field) -> FieldSearcher {
let inv_index_readers = self
.segment_readers
.iter()
.map(|segment_reader| segment_reader.inverted_index(field))
.collect::<Vec<_>>();
FieldSearcher::new(inv_index_readers)
}
/// Summarize total space usage of this searcher.
pub fn space_usage(&self) -> SearcherSpaceUsage {
let mut space_usage = SearcherSpaceUsage::new();
for segment_reader in self.segment_readers.iter() {
space_usage.add_segment(segment_reader.space_usage());
}
space_usage
}
}
impl From<Vec<SegmentReader>> for Searcher {
fn from(segment_readers: Vec<SegmentReader>) -> Searcher {
Searcher {
segment_readers: segment_readers,
}
pub struct FieldSearcher {
inv_index_readers: Vec<Arc<InvertedIndexReader>>,
}
impl FieldSearcher {
fn new(inv_index_readers: Vec<Arc<InvertedIndexReader>>) -> FieldSearcher {
FieldSearcher { inv_index_readers }
}
}
/// Returns a Stream over all of the sorted unique terms of
/// for the given field.
pub fn terms(&self) -> TermMerger<'_> {
let term_streamers: Vec<_> = self
.inv_index_readers
.iter()
.map(|inverted_index| inverted_index.terms().stream())
.collect();
TermMerger::new(term_streamers)
}
}
impl fmt::Debug for Searcher {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let segment_ids = self
.segment_readers
.iter()
.map(SegmentReader::segment_id)
.collect::<Vec<_>>();
write!(f, "Searcher({:?})", segment_ids)
}
}

View File

@@ -1,100 +1,93 @@
use Result;
use std::path::PathBuf;
use schema::Schema;
use DocId;
use std::fmt;
use core::SegmentId;
use directory::{ReadOnlySource, WritePtr};
use indexer::segment_serializer::SegmentSerializer;
use super::SegmentComponent;
use core::Index;
use crate::core::Index;
use crate::core::SegmentId;
use crate::core::SegmentMeta;
use crate::directory::error::{OpenReadError, OpenWriteError};
use crate::directory::Directory;
use crate::directory::{ReadOnlySource, WritePtr};
use crate::indexer::segment_serializer::SegmentSerializer;
use crate::schema::Schema;
use crate::Opstamp;
use crate::Result;
use std::fmt;
use std::path::PathBuf;
use std::result;
use directory::error::{FileError, OpenWriteError};
/// A segment is a piece of the index.
#[derive(Clone)]
pub struct Segment {
index: Index,
segment_id: SegmentId,
meta: SegmentMeta,
}
impl fmt::Debug for Segment {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "Segment({:?})", self.segment_id.uuid_string())
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "Segment({:?})", self.id().uuid_string())
}
}
/// Creates a new segment given an `Index` and a `SegmentId`
///
/// The function is here to make it private outside `tantivy`.
pub fn create_segment(index: Index, segment_id: SegmentId) -> Segment {
Segment {
index: index,
segment_id: segment_id,
}
///
/// The function is here to make it private outside `tantivy`.
/// #[doc(hidden)]
pub fn create_segment(index: Index, meta: SegmentMeta) -> Segment {
Segment { index, meta }
}
impl Segment {
/// Returns the index the segment belongs to.
pub fn index(&self) -> &Index {
&self.index
}
/// Returns our index's schema.
pub fn schema(&self,) -> Schema {
pub fn schema(&self) -> Schema {
self.index.schema()
}
/// Returns the segment's id.
pub fn id(&self,) -> SegmentId {
self.segment_id
}
/// Returns the relative path of a component of our segment.
///
/// It just joins the segment id with the extension
/// associated to a segment component.
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
self.segment_id.relative_path(component)
/// Returns the segment meta-information
pub fn meta(&self) -> &SegmentMeta {
&self.meta
}
/// Deletes all of the document of the segment.
/// This is called when there is a merge or a rollback.
///
/// # Disclaimer
/// If deletion of a file fails (e.g. a file
/// was read-only.), the method does not
/// fail and just logs an error
pub fn delete(&self,) {
for component in SegmentComponent::values() {
let rel_path = self.relative_path(component);
if let Err(err) = self.index.directory().delete(&rel_path) {
match err {
FileError::FileDoesNotExist(_) => {
// this is normal behavior.
// the position file for instance may not exists.
}
FileError::IOError(err) => {
error!("Failed to remove {:?} : {:?}", self.segment_id, err);
}
}
}
#[doc(hidden)]
pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: Opstamp) -> Segment {
Segment {
index: self.index,
meta: self.meta.with_delete_meta(num_deleted_docs, opstamp),
}
}
/// Returns the segment's id.
pub fn id(&self) -> SegmentId {
self.meta.id()
}
/// Open one of the component file for read.
pub fn open_read(&self, component: SegmentComponent) -> result::Result<ReadOnlySource, FileError> {
/// Returns the relative path of a component of our segment.
///
/// It just joins the segment id with the extension
/// associated to a segment component.
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
self.meta.relative_path(component)
}
/// Open one of the component file for a *regular* read.
pub fn open_read(
&self,
component: SegmentComponent,
) -> result::Result<ReadOnlySource, OpenReadError> {
let path = self.relative_path(component);
let source = try!(self.index.directory().open_read(&path));
let source = self.index.directory().open_read(&path)?;
Ok(source)
}
/// Open one of the component file for write.
pub fn open_write(&mut self, component: SegmentComponent) -> result::Result<WritePtr, OpenWriteError> {
/// Open one of the component file for *regular* write.
pub fn open_write(
&mut self,
component: SegmentComponent,
) -> result::Result<WritePtr, OpenWriteError> {
let path = self.relative_path(component);
let write = try!(self.index.directory_mut().open_write(&path));
let write = self.index.directory_mut().open_write(&path)?;
Ok(write)
}
}
@@ -107,8 +100,3 @@ pub trait SerializableSegment {
/// The number of documents in the segment.
fn write(&self, serializer: SegmentSerializer) -> Result<u32>;
}
#[derive(Clone,Debug,RustcDecodable,RustcEncodable)]
pub struct SegmentInfo {
pub max_doc: DocId,
}

View File

@@ -1,41 +1,46 @@
use std::vec::IntoIter;
use std::slice;
/// Enum describing each component of a tantivy segment.
/// Each component is stored in its own file,
/// using the pattern `segment_uuid`.`component_extension`,
/// except the delete component that takes an `segment_uuid`.`delete_opstamp`.`component_extension`
#[derive(Copy, Clone)]
pub enum SegmentComponent {
INFO,
/// Postings (or inverted list). Sorted lists of document ids, associated to terms
POSTINGS,
/// Positions of terms in each document.
POSITIONS,
/// Index to seek within the position file
POSITIONSSKIP,
/// Column-oriented random-access storage of fields.
FASTFIELDS,
/// Stores the sum of the length (in terms) of each field for each document.
/// Field norms are stored as a special u64 fast field.
FIELDNORMS,
/// Dictionary associating `Term`s to `TermInfo`s which is
/// simply an address into the `postings` file and the `positions` file.
TERMS,
/// Row-oriented, LZ4-compressed storage of the documents.
/// Accessing a document from the store is relatively slow, as it
/// requires to decompress the entire block it belongs to.
STORE,
/// Bitset describing which document of the segment is deleted.
DELETE,
}
impl SegmentComponent {
pub fn values() -> IntoIter<SegmentComponent> {
vec!(
SegmentComponent::INFO,
/// Iterates through the components.
pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
SegmentComponent::POSTINGS,
SegmentComponent::POSITIONS,
SegmentComponent::POSITIONSSKIP,
SegmentComponent::FASTFIELDS,
SegmentComponent::FIELDNORMS,
SegmentComponent::TERMS,
SegmentComponent::STORE,
).into_iter()
}
pub fn path_suffix(&self)-> &'static str {
match *self {
SegmentComponent::POSITIONS => ".pos",
SegmentComponent::INFO => ".info",
SegmentComponent::POSTINGS => ".idx",
SegmentComponent::TERMS => ".term",
SegmentComponent::STORE => ".store",
SegmentComponent::FASTFIELDS => ".fast",
SegmentComponent::FIELDNORMS => ".fieldnorm",
}
SegmentComponent::DELETE,
];
SEGMENT_COMPONENTS.iter()
}
}

View File

@@ -1,34 +1,38 @@
use uuid::Uuid;
use std::cmp::{Ord, Ordering};
use std::fmt;
use rustc_serialize::{Encoder, Decoder, Encodable, Decodable};
use core::SegmentComponent;
use std::path::PathBuf;
use std::cmp::{Ordering, Ord};
use uuid::Uuid;
#[cfg(test)]
use once_cell::sync::Lazy;
#[cfg(test)]
use std::sync::atomic;
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
/// Uuid identifying a segment.
///
/// Tantivy's segment are identified
/// by a UUID which is used to prefix the filenames
/// of all of the file associated with the segment.
///
/// In unit test, for reproducability, the `SegmentId` are
/// simply generated in an autoincrement fashion.
#[derive(Clone, Copy, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct SegmentId(Uuid);
#[cfg(test)]
static AUTO_INC_COUNTER: Lazy<atomic::AtomicUsize> = Lazy::new(|| atomic::AtomicUsize::default());
#[cfg(test)]
lazy_static! {
static ref AUTO_INC_COUNTER: atomic::AtomicUsize = atomic::AtomicUsize::default();
static ref EMPTY_ARR: [u8; 8] = [0u8; 8];
}
const ZERO_ARRAY: [u8; 8] = [0u8; 8];
// During tests, we generate the segment id in a autoincrement manner
// for consistency of segment id between run.
//
// The order of the test execution is not guaranteed, but the order
// The order of the test execution is not guaranteed, but the order
// of segments within a single test is guaranteed.
#[cfg(test)]
fn create_uuid() -> Uuid {
let new_auto_inc_id = (*AUTO_INC_COUNTER).fetch_add(1, atomic::Ordering::SeqCst);
Uuid::from_fields(new_auto_inc_id as u32, 0, 0, &*EMPTY_ARR)
Uuid::from_fields(new_auto_inc_id as u32, 0, 0, &ZERO_ARRAY).unwrap()
}
#[cfg(not(test))]
@@ -37,43 +41,34 @@ fn create_uuid() -> Uuid {
}
impl SegmentId {
#[doc(hidden)]
pub fn generate_random() -> SegmentId {
SegmentId(create_uuid())
}
pub fn short_uuid_string(&self,) -> String {
(&self.0.to_simple_string()[..8]).to_string()
}
pub fn uuid_string(&self,) -> String {
self.0.to_simple_string()
}
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
let filename = self.uuid_string() + component.path_suffix();
PathBuf::from(filename)
}
}
impl Encodable for SegmentId {
fn encode<S: Encoder>(&self, s: &mut S) -> Result<(), S::Error> {
self.0.encode(s)
/// Returns a shorter identifier of the segment.
///
/// We are using UUID4, so only 6 bits are fixed,
/// and the rest is random.
///
/// Picking the first 8 chars is ok to identify
/// segments in a display message.
pub fn short_uuid_string(&self) -> String {
(&self.0.to_simple_ref().to_string()[..8]).to_string()
}
}
impl Decodable for SegmentId {
fn decode<D: Decoder>(d: &mut D) -> Result<Self, D::Error> {
Uuid::decode(d).map(SegmentId)
/// Returns a segment uuid string.
pub fn uuid_string(&self) -> String {
self.0.to_simple_ref().to_string()
}
}
impl fmt::Debug for SegmentId {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "SegmentId({:?})", self.uuid_string())
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "Seg({:?})", self.short_uuid_string())
}
}
impl PartialOrd for SegmentId {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))

View File

@@ -1,29 +1,27 @@
use Result;
use core::Segment;
use core::SegmentId;
use core::SegmentComponent;
use schema::Term;
use store::StoreReader;
use schema::Document;
use directory::ReadOnlySource;
use DocId;
use std::io;
use std::str;
use postings::TermInfo;
use datastruct::FstMap;
use crate::common::CompositeFile;
use crate::common::HasLen;
use crate::core::InvertedIndexReader;
use crate::core::Segment;
use crate::core::SegmentComponent;
use crate::core::SegmentId;
use crate::directory::ReadOnlySource;
use crate::fastfield::DeleteBitSet;
use crate::fastfield::FacetReader;
use crate::fastfield::FastFieldReaders;
use crate::fieldnorm::FieldNormReader;
use crate::schema::Field;
use crate::schema::FieldType;
use crate::schema::Schema;
use crate::space_usage::SegmentSpaceUsage;
use crate::store::StoreReader;
use crate::termdict::TermDictionary;
use crate::DocId;
use crate::Result;
use fail::fail_point;
use std::collections::HashMap;
use std::fmt;
use rustc_serialize::json;
use core::SegmentInfo;
use schema::Field;
use postings::SegmentPostingsOption;
use postings::SegmentPostings;
use fastfield::{U32FastFieldsReader, U32FastFieldReader};
use schema::Schema;
use schema::FieldType;
use postings::FreqHandler;
use schema::TextIndexingOptions;
use error::Error;
use std::sync::Arc;
use std::sync::RwLock;
/// Entry point to access all of the datastructures of the `Segment`
///
@@ -36,15 +34,25 @@ use error::Error;
/// The segment reader has a very low memory footprint,
/// as close to all of the memory data is mmapped.
///
///
/// TODO fix not decoding docfreq
#[derive(Clone)]
pub struct SegmentReader {
segment_info: SegmentInfo,
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>,
segment_id: SegmentId,
term_infos: FstMap<TermInfo>,
postings_data: ReadOnlySource,
store_reader: StoreReader,
fast_fields_reader: U32FastFieldsReader,
fieldnorms_reader: U32FastFieldsReader,
positions_data: ReadOnlySource,
max_doc: DocId,
num_docs: DocId,
termdict_composite: CompositeFile,
postings_composite: CompositeFile,
positions_composite: CompositeFile,
positions_idx_composite: CompositeFile,
fast_fields_readers: Arc<FastFieldReaders>,
fieldnorms_composite: CompositeFile,
store_source: ReadOnlySource,
delete_bitset_opt: Option<DeleteBitSet>,
schema: Schema,
}
@@ -54,194 +62,350 @@ impl SegmentReader {
/// Today, `tantivy` does not handle deletes, so it happens
/// to also be the number of documents in the index.
pub fn max_doc(&self) -> DocId {
self.segment_info.max_doc
self.max_doc
}
/// Returns the number of documents.
/// Deleted documents are not counted.
///
/// Today, `tantivy` does not handle deletes so max doc and
/// num_docs are the same.
pub fn num_docs(&self) -> DocId {
self.segment_info.max_doc
self.num_docs
}
/// Returns the schema of the index this segment belongs to.
pub fn schema(&self) -> &Schema {
&self.schema
}
/// Return the number of documents that have been
/// deleted in the segment.
pub fn num_deleted_docs(&self) -> DocId {
self.delete_bitset()
.map(|delete_set| delete_set.len() as DocId)
.unwrap_or(0u32)
}
/// Returns true iff some of the documents of the segment have been deleted.
pub fn has_deletes(&self) -> bool {
self.delete_bitset().is_some()
}
/// Accessor to a segment's fast field reader given a field.
pub fn get_fast_field_reader(&self, field: Field) -> io::Result<U32FastFieldReader> {
let field_entry = self.schema.get_field_entry(field);
match *field_entry.field_type() {
FieldType::Str(_) => {
Err(io::Error::new(io::ErrorKind::Other, "fast field are not yet supported for text fields."))
},
FieldType::U32(_) => {
// TODO check that the schema allows that
//Err(io::Error::new(io::ErrorKind::Other, "fast field are not yet supported for text fields."))
self.fast_fields_reader.get_field(field)
},
}
///
/// Returns the u64 fast value reader if the field
/// is a u64 field indexed as "fast".
///
/// Return a FastFieldNotAvailableError if the field is not
/// declared as a fast field in the schema.
///
/// # Panics
/// May panic if the index is corrupted.
pub fn fast_fields(&self) -> &FastFieldReaders {
&self.fast_fields_readers
}
/// Accessor to the `FacetReader` associated to a given `Field`.
pub fn facet_reader(&self, field: Field) -> Option<FacetReader> {
let field_entry = self.schema.get_field_entry(field);
if field_entry.field_type() != &FieldType::HierarchicalFacet {
return None;
}
let term_ords_reader = self.fast_fields().u64s(field)?;
let termdict_source = self.termdict_composite.open_read(field)?;
let termdict = TermDictionary::from_source(&termdict_source);
let facet_reader = FacetReader::new(term_ords_reader, termdict);
Some(facet_reader)
}
/// Accessor to the segment's `Field norms`'s reader.
///
/// Field norms are the length (in tokens) of the fields.
/// It is used in the computation of the [TfIdf](https://fulmicoton.gitbooks.io/tantivy-doc/content/tfidf.html).
///
/// They are simply stored as a fast field, serialized in
/// the `.fieldnorm` file of the segment.
pub fn get_fieldnorms_reader(&self, field: Field) -> io::Result<U32FastFieldReader> {
self.fieldnorms_reader.get_field(field)
}
/// Returns the number of documents containing the term.
pub fn doc_freq(&self, term: &Term) -> u32 {
match self.get_term_info(term) {
Some(term_info) => term_info.doc_freq,
None => 0,
/// They are simply stored as a fast field, serialized in
/// the `.fieldnorm` file of the segment.
pub fn get_fieldnorms_reader(&self, field: Field) -> FieldNormReader {
if let Some(fieldnorm_source) = self.fieldnorms_composite.open_read(field) {
FieldNormReader::open(fieldnorm_source)
} else {
let field_name = self.schema.get_field_name(field);
let err_msg = format!(
"Field norm not found for field {:?}. Was it market as indexed during indexing.",
field_name
);
panic!(err_msg);
}
}
}
/// Accessor to the segment's `StoreReader`.
pub fn get_store_reader(&self) -> &StoreReader {
&self.store_reader
pub fn get_store_reader(&self) -> StoreReader {
StoreReader::from_source(self.store_source.clone())
}
/// Open a new segment for reading.
pub fn open(segment: Segment) -> Result<SegmentReader> {
let segment_info_reader = try!(segment.open_read(SegmentComponent::INFO));
let segment_info_data = try!(
str::from_utf8(&*segment_info_reader)
.map_err(|err| {
let segment_info_filepath = segment.relative_path(SegmentComponent::INFO);
Error::CorruptedFile(segment_info_filepath, Box::new(err))
})
);
let segment_info: SegmentInfo = try!(
json::decode(&segment_info_data)
.map_err(|err| {
let file_path = segment.relative_path(SegmentComponent::INFO);
Error::CorruptedFile(file_path, Box::new(err))
})
);
let source = try!(segment.open_read(SegmentComponent::TERMS));
let term_infos = try!(FstMap::from_source(source));
let store_reader = StoreReader::from(try!(segment.open_read(SegmentComponent::STORE)));
let postings_shared_mmap = try!(segment.open_read(SegmentComponent::POSTINGS));
let fast_field_data = try!(segment.open_read(SegmentComponent::FASTFIELDS));
let fast_fields_reader = try!(U32FastFieldsReader::open(fast_field_data));
let fieldnorms_data = try!(segment.open_read(SegmentComponent::FIELDNORMS));
let fieldnorms_reader = try!(U32FastFieldsReader::open(fieldnorms_data));
let positions_data = segment
.open_read(SegmentComponent::POSITIONS)
.unwrap_or_else(|_| ReadOnlySource::empty());
pub fn open(segment: &Segment) -> Result<SegmentReader> {
let termdict_source = segment.open_read(SegmentComponent::TERMS)?;
let termdict_composite = CompositeFile::open(&termdict_source)?;
let store_source = segment.open_read(SegmentComponent::STORE)?;
fail_point!("SegmentReader::open#middle");
let postings_source = segment.open_read(SegmentComponent::POSTINGS)?;
let postings_composite = CompositeFile::open(&postings_source)?;
let positions_composite = {
if let Ok(source) = segment.open_read(SegmentComponent::POSITIONS) {
CompositeFile::open(&source)?
} else {
CompositeFile::empty()
}
};
let positions_idx_composite = {
if let Ok(source) = segment.open_read(SegmentComponent::POSITIONSSKIP) {
CompositeFile::open(&source)?
} else {
CompositeFile::empty()
}
};
let schema = segment.schema();
let fast_fields_data = segment.open_read(SegmentComponent::FASTFIELDS)?;
let fast_fields_composite = CompositeFile::open(&fast_fields_data)?;
let fast_field_readers =
Arc::new(FastFieldReaders::load_all(&schema, &fast_fields_composite)?);
let fieldnorms_data = segment.open_read(SegmentComponent::FIELDNORMS)?;
let fieldnorms_composite = CompositeFile::open(&fieldnorms_data)?;
let delete_bitset_opt = if segment.meta().has_deletes() {
let delete_data = segment.open_read(SegmentComponent::DELETE)?;
Some(DeleteBitSet::open(delete_data))
} else {
None
};
Ok(SegmentReader {
segment_info: segment_info,
postings_data: postings_shared_mmap,
term_infos: term_infos,
inv_idx_reader_cache: Arc::new(RwLock::new(HashMap::new())),
max_doc: segment.meta().max_doc(),
num_docs: segment.meta().num_docs(),
termdict_composite,
postings_composite,
fast_fields_readers: fast_field_readers,
fieldnorms_composite,
segment_id: segment.id(),
store_reader: store_reader,
fast_fields_reader: fast_fields_reader,
fieldnorms_reader: fieldnorms_reader,
positions_data: positions_data,
schema: schema,
store_source,
delete_bitset_opt,
positions_composite,
positions_idx_composite,
schema,
})
}
/// Return the term dictionary datastructure.
pub fn term_infos(&self) -> &FstMap<TermInfo> {
&self.term_infos
}
/// Returns the document (or to be accurate, its stored field)
/// bearing the given doc id.
/// This method is slow and should seldom be called from
/// within a collector.
pub fn doc(&self, doc_id: DocId) -> Result<Document> {
self.store_reader.get(doc_id)
}
/// Returns the segment postings associated with the term, and with the given option,
/// or `None` if the term has never been encounterred and indexed.
///
/// If the field was not indexed with the indexing options that cover
/// the requested options, the returned `SegmentPostings` the method does not fail
/// and returns a `SegmentPostings` with as much information as possible.
/// Returns a field reader associated to the field given in argument.
/// If the field was not present in the index during indexing time,
/// the InvertedIndexReader is empty.
///
/// For instance, requesting `SegmentPostingsOption::FreqAndPositions` for a `TextIndexingOptions`
/// that does not index position will return a `SegmentPostings` with `DocId`s and frequencies.
pub fn read_postings(&self, term: &Term, option: SegmentPostingsOption) -> Option<SegmentPostings> {
let field = term.field();
/// The field reader is in charge of iterating through the
/// term dictionary associated to a specific field,
/// and opening the posting list associated to any term.
pub fn inverted_index(&self, field: Field) -> Arc<InvertedIndexReader> {
if let Some(inv_idx_reader) = self
.inv_idx_reader_cache
.read()
.expect("Lock poisoned. This should never happen")
.get(&field)
{
return Arc::clone(inv_idx_reader);
}
let field_entry = self.schema.get_field_entry(field);
let term_info = get!(self.get_term_info(&term));
let offset = term_info.postings_offset as usize;
let postings_data = &self.postings_data[offset..];
let freq_handler = match *field_entry.field_type() {
FieldType::Str(ref options) => {
let indexing_options = options.get_indexing_options();
match option {
SegmentPostingsOption::NoFreq => {
FreqHandler::new_without_freq()
}
SegmentPostingsOption::Freq => {
if indexing_options.is_termfreq_enabled() {
FreqHandler::new_with_freq()
}
else {
FreqHandler::new_without_freq()
}
}
SegmentPostingsOption::FreqAndPositions => {
if indexing_options == TextIndexingOptions::TokenizedWithFreqAndPosition {
let offseted_position_data = &self.positions_data[term_info.positions_offset as usize ..];
FreqHandler::new_with_freq_and_position(offseted_position_data)
}
else if indexing_options.is_termfreq_enabled()
{
FreqHandler::new_with_freq()
}
else {
FreqHandler::new_without_freq()
}
}
}
}
_ => {
FreqHandler::new_without_freq()
}
};
Some(SegmentPostings::from_data(term_info.doc_freq, postings_data, freq_handler))
let field_type = field_entry.field_type();
let record_option_opt = field_type.get_index_record_option();
if record_option_opt.is_none() {
panic!("Field {:?} does not seem indexed.", field_entry.name());
}
let record_option = record_option_opt.unwrap();
let postings_source_opt = self.postings_composite.open_read(field);
if postings_source_opt.is_none() {
// no documents in the segment contained this field.
// As a result, no data is associated to the inverted index.
//
// Returns an empty inverted index.
return Arc::new(InvertedIndexReader::empty(field_type));
}
let postings_source = postings_source_opt.unwrap();
let termdict_source = self.termdict_composite.open_read(field).expect(
"Failed to open field term dictionary in composite file. Is the field indexed?",
);
let positions_source = self
.positions_composite
.open_read(field)
.expect("Index corrupted. Failed to open field positions in composite file.");
let positions_idx_source = self
.positions_idx_composite
.open_read(field)
.expect("Index corrupted. Failed to open field positions in composite file.");
let inv_idx_reader = Arc::new(InvertedIndexReader::new(
TermDictionary::from_source(&termdict_source),
postings_source,
positions_source,
positions_idx_source,
record_option,
));
// by releasing the lock in between, we may end up opening the inverting index
// twice, but this is fine.
self.inv_idx_reader_cache
.write()
.expect("Field reader cache lock poisoned. This should never happen.")
.insert(field, Arc::clone(&inv_idx_reader));
inv_idx_reader
}
/// Returns the posting list associated with a term.
pub fn read_postings_all_info(&self, term: &Term) -> Option<SegmentPostings> {
let field_entry = self.schema.get_field_entry(term.field());
let segment_posting_option = match *field_entry.field_type() {
FieldType::Str(ref text_options) => {
match text_options.get_indexing_options() {
TextIndexingOptions::TokenizedWithFreq => SegmentPostingsOption::Freq,
TextIndexingOptions::TokenizedWithFreqAndPosition => SegmentPostingsOption::FreqAndPositions,
_ => SegmentPostingsOption::NoFreq,
}
}
FieldType::U32(_) => SegmentPostingsOption::NoFreq
};
self.read_postings(term, segment_posting_option)
/// Returns the segment id
pub fn segment_id(&self) -> SegmentId {
self.segment_id
}
/// Returns the term info associated with the term.
pub fn get_term_info(&self, term: &Term) -> Option<TermInfo> {
self.term_infos.get(term.as_slice())
/// Returns the bitset representing
/// the documents that have been deleted.
pub fn delete_bitset(&self) -> Option<&DeleteBitSet> {
self.delete_bitset_opt.as_ref()
}
/// Returns true iff the `doc` is marked
/// as deleted.
pub fn is_deleted(&self, doc: DocId) -> bool {
self.delete_bitset()
.map(|delete_set| delete_set.is_deleted(doc))
.unwrap_or(false)
}
/// Returns an iterator that will iterate over the alive document ids
pub fn doc_ids_alive(&self) -> SegmentReaderAliveDocsIterator<'_> {
SegmentReaderAliveDocsIterator::new(&self)
}
/// Summarize total space usage of this segment.
pub fn space_usage(&self) -> SegmentSpaceUsage {
SegmentSpaceUsage::new(
self.num_docs(),
self.termdict_composite.space_usage(),
self.postings_composite.space_usage(),
self.positions_composite.space_usage(),
self.positions_idx_composite.space_usage(),
self.fast_fields_readers.space_usage(),
self.fieldnorms_composite.space_usage(),
self.get_store_reader().space_usage(),
self.delete_bitset_opt
.as_ref()
.map(DeleteBitSet::space_usage)
.unwrap_or(0),
)
}
}
impl fmt::Debug for SegmentReader {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "SegmentReader({:?})", self.segment_id)
}
}
/// Implements the iterator trait to allow easy iteration
/// over non-deleted ("alive") DocIds in a SegmentReader
pub struct SegmentReaderAliveDocsIterator<'a> {
reader: &'a SegmentReader,
max_doc: DocId,
current: DocId,
}
impl<'a> SegmentReaderAliveDocsIterator<'a> {
pub fn new(reader: &'a SegmentReader) -> SegmentReaderAliveDocsIterator<'a> {
SegmentReaderAliveDocsIterator {
reader,
max_doc: reader.max_doc(),
current: 0,
}
}
}
impl<'a> Iterator for SegmentReaderAliveDocsIterator<'a> {
type Item = DocId;
fn next(&mut self) -> Option<Self::Item> {
// TODO: Use TinySet (like in BitSetDocSet) to speed this process up
if self.current >= self.max_doc {
return None;
}
// find the next alive doc id
while self.reader.is_deleted(self.current) {
self.current += 1;
if self.current >= self.max_doc {
return None;
}
}
// capture the current alive DocId
let result = Some(self.current);
// move down the chain
self.current += 1;
result
}
}
#[cfg(test)]
mod test {
use crate::core::Index;
use crate::schema::{Schema, Term, STORED, TEXT};
use crate::DocId;
#[test]
fn test_alive_docs_iterator() {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("name", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let name = schema.get_field("name").unwrap();
{
let mut index_writer = index.writer_with_num_threads(1, 3_000_000).unwrap();
index_writer.add_document(doc!(name => "tantivy"));
index_writer.add_document(doc!(name => "horse"));
index_writer.add_document(doc!(name => "jockey"));
index_writer.add_document(doc!(name => "cap"));
// we should now have one segment with two docs
index_writer.commit().unwrap();
}
{
let mut index_writer2 = index.writer(50_000_000).unwrap();
index_writer2.delete_term(Term::from_field_text(name, "horse"));
index_writer2.delete_term(Term::from_field_text(name, "cap"));
// ok, now we should have a deleted doc
index_writer2.commit().unwrap();
}
let searcher = index.reader().unwrap().searcher();
let docs: Vec<DocId> = searcher.segment_reader(0).doc_ids_alive().collect();
assert_eq!(vec![0u32, 2u32], docs);
}
}

Some files were not shown because too many files have changed in this diff Show More