Compare commits

...

133 Commits

Author SHA1 Message Date
Paul Masurel
d4e2d2e40e Searcher Warming API (#1258)
Adds an API to register Warmers in the IndexReader.


Co-authored-by: shikhar <shikhar@schmizz.net>
2022-01-20 14:32:42 +09:00
Paul Masurel
732f6847c0 Field type with codes (#1255)
* Term are now typed.

This change is backward compatible:
While the Term has a byte representation that is modified, a Term itself
is a transient object that is not serialized as is in the index.

Its .field() and .value_bytes() on the other hand are unchanged.
This change offers better Debug information for terms.

While not necessary it also will help in the support for JSON types.

* Renamed Hierarchical Facet -> Facet
2022-01-07 20:49:00 +09:00
Paul Masurel
1c6d9bdc6a Comparison of Value based on serialization. (#1250) 2022-01-07 20:31:26 +09:00
Paul Masurel
3ea6800ac5 Pleasing clippy (#1253) 2022-01-06 16:41:24 +09:00
Antoine G
395303b644 Collector + directory doc fixes (#1247)
* doc(collector)

* doc(directory)

* doc(misc)

* wording
2022-01-04 09:22:58 +09:00
Daniel Müller
2c200b46cb Use test-log instead of test-env-log (#1248)
The test-env-log crate has been renamed to test-log to better reflect
its intent of not only catering to env_logger specific initialization
but also tracing (and potentially others in the future).
This change updates the crate to use test-log instead of the now
deprecated test-env-log.
2022-01-04 09:20:30 +09:00
Liam Warfield
17e00df112 Change Snippet.fragments -> Snippet.fragment (#1243)
* Change Snippet.fragments -> Snippet.fragment
* Apply suggestions from code review

Co-authored-by: Liam Warfield <lwarfield@arista.com>
2022-01-03 22:23:51 +09:00
Antoine G
3129d86743 doc(termdict) expose structs (#1242)
* doc(termdict) expose structs
also add merger doc + lint
refs #1232
2022-01-03 22:20:31 +09:00
Shikhar Bhushan
e5e252cbc0 LogMergePolicy knob del_docs_percentage_before_merge (#1238)
Add a knob to LogMergePolicy to always merge segments that exceed a threshold of deleted docs

Closes #115
2021-12-20 13:14:56 +09:00
Paul Masurel
b2da82f151 Making MergeCandidate public in order to allow the usage of custom merge (#1237)
policies.

Closes #1235
2021-12-13 09:54:21 +09:00
Paul Masurel
c81b3030fa Issue/922b (#1233)
* Add a NORMED options on field

Make fieldnorm indexation optional:

* for all types except text => added a NORMED options
* for text field
** if STRING, field has not fieldnorm retained
** if TEXT, field has fieldnorm computed

* Finalize making fieldnorm optional for all field types.

- Using Option for fieldnorm readers.
2021-12-10 21:12:29 +09:00
Paul Masurel
9e66c75fc6 Using stable in CI as rustc nightly seems broken 2021-12-10 18:45:23 +09:00
Paul Masurel
ebdbb6bd2e Fixing compilation warnings & clippy comments. 2021-12-10 16:47:59 +09:00
Antoine G
c980b19dd9 canonicalize path when opening MmapDirectory (#1231)
* canonicalize path when opening `MmapDirectory`
fixes #1229
2021-12-09 10:19:52 +09:00
Paul Masurel
098eea843a Reducing the number of call to fsync on the directory. (#1228)
This work by introducing a new API method in the Directory
trait. The user needs to explicitely call this method.
(In particular, once before a commmit)

Closes #1225
2021-12-03 03:10:52 +00:00
Paul Masurel
466dc8233c Cargo fmt 2021-12-02 18:46:28 +09:00
Paul Masurel
03c2f6ece2 We are missing 4 bytes in the LZ4 compression buffer. (#1226)
Closes #831
2021-12-02 16:00:29 +09:00
Paul Masurel
1d4e9a29db Cargo fmt 2021-12-02 15:51:44 +09:00
Paul Masurel
f378d9a57b Pleasing clippy 2021-12-02 14:48:33 +09:00
Paul Masurel
dde49ac8e2 Closes #1195 (#1222)
Removes the indexed option for facets.
Facets are now always indexed.

Closes #1195
2021-12-02 14:37:19 +09:00
Paul Masurel
c3cc93406d Bugfix: adds missing fdatasync on atomic_write.
In addition this PR:
- removes unnecessary flushes and fsyncs on files.
- replace all fsync by fdatasync. The latter triggers
a meta sync if a metadata required to read the file
has changed. It is therefore sufficient for us.

Closes #1224
2021-12-02 13:42:44 +09:00
Kanji Yomoda
bd0f9211da Remove unused sort for segmenta meta list (#1218)
* Remove unused sort for segment meta list
* Fix segment meta order dependent test
2021-12-01 11:18:17 +09:00
PSeitz
c503c6e4fa Switch to non-strict schema (#1216)
Fixes #1211
2021-11-29 10:38:59 +09:00
PSeitz
02174d26af Merge pull request #1209 from quickwit-inc/lz4_flex_version
fix lz4_flex version
2021-11-16 14:12:45 +08:00
PSeitz
cf92be3bd6 fix lz4_flex version 2021-11-16 06:03:04 +00:00
Shikhar Bhushan
72cef12db1 Add none compression (#1208) 2021-11-16 10:50:42 +09:00
Paul Masurel
bbc0a2e233 Fixing the build 2021-11-16 09:37:25 +09:00
François Massot
4fd1a6c84b Merge pull request #1207 from quickwit-inc/fix-chat-links
Remove patron link and changer gitter links to discord links.
2021-11-15 19:23:21 +01:00
François Massot
c83d99c414 Remove patron link and changer gitter links to discord links. 2021-11-15 19:17:35 +01:00
Paul Masurel
eacf510175 Exchange gitter link for discord 2021-11-15 16:44:13 +09:00
Paul Masurel
8802d125f8 Prepare commit is public again (#1202)
- Simplified some of the prepare commit & segment updater code using
async.
- Made PrepareCommit public again.
2021-11-12 23:25:39 +09:00
dependabot[bot]
33301a3eb4 Update fail requirement from 0.4 to 0.5 (#1197)
Updates the requirements on [fail](https://github.com/tikv/fail-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/fail-rs/releases)
- [Changelog](https://github.com/tikv/fail-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/fail-rs/compare/v0.4.0...v0.5.0)

---
updated-dependencies:
- dependency-name: fail
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-12 23:21:16 +09:00
Paul Masurel
7234bef0eb Issue/1198 (#1201)
* Unit test reproducing #1198
* Fixing unit test to handle the error from add_document.
* Bump project version
2021-11-11 16:42:19 +09:00
azerowall
fcff91559b Fix the deserialization error of FieldEntry when the 'options' field appears before the 'type' field (#1199)
Co-authored-by: quel <azerowall>
2021-11-10 18:39:58 +09:00
Paul Masurel
b75d4e59d1 Remove the broken panic on drop unit test. (#1200) 2021-11-10 18:39:37 +09:00
Paul Masurel
c6b5ab1dbe Replacing the panic check in the RAM Directory on lack of flush. 2021-11-09 11:04:31 +09:00
PSeitz
c12e07f0ce Merge pull request #1196 from quickwit-inc/dependabot/cargo/measure_time-0.8.0
Update measure_time requirement from 0.7.0 to 0.8.0
2021-11-05 08:47:51 +08:00
dependabot[bot]
8b877a4c26 Update measure_time requirement from 0.7.0 to 0.8.0
Updates the requirements on [measure_time](https://github.com/PSeitz/rust_measure_time) to permit the latest version.
- [Release notes](https://github.com/PSeitz/rust_measure_time/releases)
- [Commits](https://github.com/PSeitz/rust_measure_time/commits)

---
updated-dependencies:
- dependency-name: measure_time
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-11-04 20:27:16 +00:00
PSeitz
7dc0dc1c9b extend proptests with adding case (#1191)
This extends the proptest to cover a case where up to a 100 documents are added to an index.
2021-11-01 09:27:10 +09:00
François Massot
0462754673 Optimize block wand for one and several TermScorer. (#1190)
* Added optimisation using block wand for single TermScorer.

A proptest was also added.

* Fix block wand algorithm by taking the last doc id of scores until the pivot scorer (included).
* In block wand, when block max score is lower than the threshold, advance the scorer with best score.
* Fix wrong condition in block_wand_single_scorer and add debug_assert to have an equality check on doc to break the loop.
2021-11-01 09:18:05 +09:00
PSeitz
5916ceda73 Merge pull request #1188 from PSeitz/sort_issue
fix incorrect padding in bitset for multiple of 64
2021-10-29 17:06:38 +08:00
Pascal Seitz
70283dc6c8 fix incorrect padding in bitset for multiple of 64 2021-10-29 16:49:22 +08:00
PSeitz
dbaf4f3623 Merge pull request #1187 from PSeitz/sort_issue
check searcher num docs in proptest
2021-10-29 16:19:24 +08:00
Pascal Seitz
4808648322 check searcher num docs in proptest 2021-10-29 14:38:30 +08:00
Paul Masurel
54afb9b34a Made PrepareCommit private 2021-10-29 14:13:14 +09:00
Paul Masurel
d336c8b938 Fixed logo 2021-10-27 08:54:16 +09:00
Paul Masurel
980d1b2796 Removing Patreon link 2021-10-27 08:53:45 +09:00
Dan Cecile
6317982876 Make indexer::prepared_commit public (#1184)
* Make indexer::prepared_commit public

* Add PreparedCommit to lib
2021-10-26 12:21:24 +09:00
PSeitz
e2fbbc08ca Merge pull request #1182 from PSeitz/remove_directory_generic
use Box<dyn Directory> as parameter to open/create an Index
2021-10-25 12:49:55 +08:00
Pascal Seitz
99cd25beae use <T: Into<Box<dyn Directory>>> as parameter to open/create an Index
This is done in order to support Box<dyn Directory> additionally to generic implementations of the trait Directory.
Remove boxing in ManagedDirectory.
2021-10-25 12:34:40 +08:00
Kanji Yomoda
737ecc7015 Fix outdated comment for IndexWriter::new (#1183) 2021-10-25 10:59:18 +09:00
Kanji Yomoda
09668459c8 Update codecov-action to v2 and make it possible to keep it up-to-date with dependabot (#1181)
* Update codecov-action to v2

* Add github-actions to dependabot
2021-10-25 10:58:16 +09:00
Evance Soumaoro
e5fd30f438 Fixed links (#1177) 2021-10-25 10:56:04 +09:00
Tom Parker-Shemilt
c412a46105 Remove travis config (#1180) 2021-10-24 15:40:43 +09:00
PSeitz
3a78402496 update links (#1176) 2021-10-18 20:45:40 +09:00
Paul Masurel
d18ac136c0 Search simplified (#1175) 2021-10-18 12:52:43 +09:00
Paul Masurel
b5b1244857 More functionality in the ownedbytes crate (#1172) 2021-10-07 18:14:49 +09:00
Paul Masurel
27acfa4dea Removing dead file (#1170) 2021-10-07 14:15:21 +09:00
Paul Masurel
02cffa4dea Code simplification. (#1169)
Code simplification and Clippy
2021-10-07 14:11:44 +09:00
Paul Masurel
b52abbc771 Bugfix transposition_cost_one in FuzzyQuery (#1167) 2021-10-07 09:38:39 +09:00
Paul Masurel
894c61867f Fix test compilation (#1168) 2021-10-06 17:50:10 +09:00
PSeitz
352e0cc58d Adde demux operation (#1150)
* add merge for DeleteBitSet, allow custom DeleteBitSet on merge
* forward delete bitsets on merge, add tests
* add demux operation and tests
2021-10-06 16:05:16 +09:00
Paul Masurel
ffe4446d90 Minor lint comments (#1166) 2021-10-06 11:27:48 +09:00
dependabot[bot]
4d05b26e7a Update lru requirement from 0.6.5 to 0.7.0 (#1165)
Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs) to permit the latest version.
- [Release notes](https://github.com/jeromefroe/lru-rs/releases)
- [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jeromefroe/lru-rs/compare/0.6.5...0.7.0)

---
updated-dependencies:
- dependency-name: lru
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-10-06 05:50:24 +09:00
Paul Masurel
0855649986 Leaning more on the alive (vs delete) semantics. (#1164) 2021-10-05 18:53:29 +09:00
PSeitz
d828e58903 Merge pull request #1163 from PSeitz/reduce_mem_usage
reduce mem usage
2021-10-01 08:03:41 +02:00
Pascal Seitz
aa0396fe27 fix variable names 2021-10-01 13:48:51 +08:00
Pascal Seitz
8d8315f8d0 prealloc vec in postinglist 2021-09-29 09:02:38 +08:00
Pascal Seitz
078c0a2e2e reserve vec 2021-09-29 08:45:04 +08:00
Pascal Seitz
f21e8dd875 use only segment ordinal in docidmapping 2021-09-29 08:44:56 +08:00
Tomoko Uchida
74e36c7e97 Add unit tests for tokenizers and filters (#1156)
* add unit test for SimpleTokenizer
* add unit tests for tokenizers and filters.
2021-09-27 10:22:01 +09:00
PSeitz
f27ae04282 fix slope calculation in multilinear interpol (#1161)
add test to check for compression
2021-09-27 10:14:03 +09:00
PSeitz
0ce49c9dd4 use lz4_flex 0.9.0 (#1160) 2021-09-27 10:12:20 +09:00
PSeitz
fe8e58e078 Merge pull request #1154 from PSeitz/delete_bitset
add DeleteBitSet iterator
2021-09-24 09:37:39 +02:00
Pascal Seitz
efc0d8341b fix comment 2021-09-24 15:09:21 +08:00
Pascal Seitz
22bcc83d10 fix padding in initialization 2021-09-24 14:43:04 +08:00
Pascal Seitz
5ee5037934 create and use ReadSerializedBitSet 2021-09-24 12:53:33 +08:00
Pascal Seitz
c217bfed1e cargo fmt 2021-09-23 21:02:19 +08:00
Pascal Seitz
c27ccd3e24 improve naming 2021-09-23 21:02:09 +08:00
Paul Masurel
367f5da782 Fixed comment to the index accessor 2021-09-23 21:53:48 +09:00
Mestery
b256df6599 add index accessor for index writer (#1159)
* add index accessor for index writer

* Update src/indexer/index_writer.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>
2021-09-23 21:49:20 +09:00
Pascal Seitz
d7a6a409a1 renames 2021-09-23 20:33:11 +08:00
Pascal Seitz
a1f5cead96 AliveBitSet instead of DeleteBitSet 2021-09-23 20:03:57 +08:00
dependabot[bot]
37c5fe3c86 Update memmap2 requirement from 0.4 to 0.5 (#1157)
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs) to permit the latest version.
- [Release notes](https://github.com/RazrFalcon/memmap2-rs/releases)
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v0.4.0...v0.5.0)

---
updated-dependencies:
- dependency-name: memmap2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-09-23 20:18:27 +09:00
Pascal Seitz
4583fa270b fixes 2021-09-23 10:39:53 +08:00
Pascal Seitz
beb3a5bd73 fix len 2021-09-18 17:58:15 +08:00
Pascal Seitz
93cbd52bf0 move code to biset, add inline, add benchmark 2021-09-18 17:35:22 +08:00
Pascal Seitz
c22177a005 add iterator 2021-09-17 15:29:27 +08:00
Pascal Seitz
4da71273e1 add de/serialization for bitset
remove len footgun
2021-09-17 10:28:12 +08:00
dependabot[bot]
2c78b31aab Update memmap2 requirement from 0.3 to 0.4 (#1155)
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs) to permit the latest version.
- [Release notes](https://github.com/RazrFalcon/memmap2-rs/releases)
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v.0.3.0...v0.4.0)
2021-09-17 08:52:52 +09:00
Pascal Seitz
4ae1d87632 add DeleteBitSet iterator 2021-09-15 23:10:04 +08:00
Paul Masurel
46b86a7976 Bounced version and edited changelog 2021-09-10 23:05:09 +09:00
PSeitz
3bc177e69d fix #1151 (#1152)
* fix #1151

Fixes a off by one error in the stats for the index fast field in the multi value fast field.
When retrieving the data range for a docid, `get(doc)..get(docid+1)` is requested. On creation
the num_vals statistic was set to doc instead of docid + 1. In the multivaluelinearinterpol fast
field the last value was therefore not serialized (and would return 0 instead in most cases).
So the last document get(lastdoc)..get(lastdoc + 1) would return the invalid range `value..0`.

This PR adds a proptest to cover this scenario. A combination of a large number values, since multilinear
interpolation is only active for more than 5_000 values, and a merge is required.
2021-09-10 23:00:37 +09:00
PSeitz
319609e9c1 test cargo-llvm-cov (#1149) 2021-09-03 22:00:43 +09:00
Kanji Yomoda
9d87b89718 Fix incorrect comment for Index::create_in_dir (#1148)
* Fix incorrect comment for Index::create_in_dir
2021-09-03 10:37:16 +09:00
Tomoko Uchida
dd81e38e53 Add WhitespaceTokenizer (#1147)
* Add WhitespaceTokenizer.
2021-08-29 18:20:49 +09:00
Paul Masurel
9f32b22602 Preparing for release. 2021-08-26 09:07:08 +09:00
sigaloid
096ce7488e Resolve some clippys, format (#1144)
* cargo +nightly clippy --fix -Z unstable-options
2021-08-26 08:46:00 +09:00
PSeitz
a1782dd172 Update index_sorting.md 2021-08-25 07:55:50 +01:00
PSeitz
000d76b11a Update index_sorting.md 2021-08-24 19:28:06 +01:00
PSeitz
abd29f6646 Update index_sorting.md 2021-08-24 19:26:19 +01:00
PSeitz
b4ecf0ab2f Merge pull request #1146 from tantivy-search/sorting_doc
add sorting to book
2021-08-23 17:37:54 +01:00
Pascal Seitz
798f7dbf67 add sorting to book 2021-08-23 17:36:41 +01:00
PSeitz
06a2e47c8d Merge pull request #1145 from tantivy-search/blub2
cargo fmt
2021-08-21 18:52:50 +01:00
Pascal Seitz
e0b83eb291 cargo fmt 2021-08-21 18:52:10 +01:00
PSeitz
13401f46ea add wildcard mention 2021-08-21 18:10:33 +01:00
PSeitz
1a45b030dc Merge pull request #1141 from tantivy-search/tantivy_common
dissolve common module
2021-08-20 08:03:37 +01:00
Pascal Seitz
62052bcc2d add missing test function
closes #1139
2021-08-20 07:26:22 +01:00
Pascal Seitz
3265f7bec3 dissolve common module 2021-08-19 23:26:34 +01:00
Pascal Seitz
ee0881712a move bitset to common crate, move composite file to directory 2021-08-19 17:45:09 +01:00
PSeitz
483e0336b6 Merge pull request #1140 from tantivy-search/tantivy_common
rename common to tantivy-common
2021-08-19 13:02:54 +01:00
Pascal Seitz
3e8f267e33 rename common to tantivy-common 2021-08-19 10:27:20 +01:00
Paul Masurel
3b247fd968 Version bump 2021-08-19 10:12:30 +09:00
Paul Masurel
750f6e6479 Removed obsolete unit test (#1138) 2021-08-19 10:07:49 +09:00
Evance Soumaoro
5b475e6603 Checksum validation using active files (#1130)
* now validate checksum uses segment files not managed files
2021-08-19 10:03:20 +09:00
PSeitz
0ca7f73dc5 add docs badge, fix build badge 2021-08-13 19:40:33 +01:00
PSeitz
47ed18845e Merge pull request #1136 from tantivy-search/minor_fixes
more docs detail
2021-08-13 18:11:47 +01:00
Pascal Seitz
dc141cdb29 more docs detail
remove code duplicate
2021-08-13 17:40:13 +01:00
PSeitz
f6cf6e889b Merge pull request #1133 from tantivy-search/merge_overflow
test doc_freq and term_freq in sorted index
2021-08-05 07:53:46 +01:00
Pascal Seitz
f379a80233 test doc_freq and term_freq in sorted index 2021-08-03 11:38:05 +01:00
PSeitz
4a320fd1ff fix delta position in merge and index sorting (#1132)
fixes #1125
2021-08-03 18:06:36 +09:00
PSeitz
85d23e8e3b Merge pull request #1129 from tantivy-search/merge_overflow
add long running test in ci
2021-08-02 15:54:31 +01:00
Pascal Seitz
022ab9d298 don't run as pr 2021-08-02 15:44:00 +01:00
Pascal Seitz
605e8603dc add positions to long running test 2021-08-02 15:29:49 +01:00
Pascal Seitz
70f160b329 add long running test in ci 2021-08-02 11:35:39 +01:00
PSeitz
6d265e6bed fix gh action name 2021-08-02 10:38:01 +01:00
PSeitz
fdc512391b Merge pull request #1128 from tantivy-search/merge_overflow
add sort to functional test, add env for iterations
2021-08-02 10:29:16 +01:00
Pascal Seitz
108714c934 add sort to functional test, add env for iterations 2021-08-02 10:11:17 +01:00
Paul Masurel
44e8cf98a5 Cargo fmt 2021-07-30 15:30:01 +09:00
Paul Masurel
f0ee69d9e9 Remove the complicated block search logic for a simpler branchless (#1124)
binary search

The code is simpler and faster.

Before
test postings::bench::bench_segment_intersection                                                                         ... bench:   2,093,697 ns/iter (+/- 115,509)
test postings::bench::bench_skip_next_p01                                                                                ... bench:      58,585 ns/iter (+/- 796)
test postings::bench::bench_skip_next_p1                                                                                 ... bench:     160,872 ns/iter (+/- 5,164)
test postings::bench::bench_skip_next_p10                                                                                ... bench:     615,229 ns/iter (+/- 25,108)
test postings::bench::bench_skip_next_p90                                                                                ... bench:   1,120,509 ns/iter (+/- 22,271)

After
test postings::bench::bench_segment_intersection                                                                         ... bench:   1,747,726 ns/iter (+/- 52,867)
test postings::bench::bench_skip_next_p01                                                                                ... bench:      55,205 ns/iter (+/- 714)
test postings::bench::bench_skip_next_p1                                                                                 ... bench:     131,433 ns/iter (+/- 2,814)
test postings::bench::bench_skip_next_p10                                                                                ... bench:     478,830 ns/iter (+/- 12,794)
test postings::bench::bench_skip_next_p90                                                                                ... bench:     931,082 ns/iter (+/- 31,468)
2021-07-30 14:38:42 +09:00
Evance Soumaoro
b8a10c8406 switched to memmap2-rs (#1120) 2021-07-27 18:40:41 +09:00
PSeitz
ff4813529e add comments on compression (#1119) 2021-07-26 22:54:22 +09:00
PSeitz
470bc18e9b Merge pull request #1118 from tantivy-search/remove_rand
move rand to optional dependencies
2021-07-21 18:01:22 +01:00
181 changed files with 7302 additions and 4110 deletions

View File

@@ -6,3 +6,10 @@ updates:
interval: daily interval: daily
time: "20:00" time: "20:00"
open-pull-requests-limit: 10 open-pull-requests-limit: 10
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: daily
time: "20:00"
open-pull-requests-limit: 10

View File

@@ -1,27 +1,25 @@
name: coverage name: Coverage
on: on:
push: push:
branches: [ main ] branches: [ main ]
pull_request: pull_request:
branches: [ main ] branches: [ main ]
jobs: jobs:
test: coverage:
name: coverage runs-on: ubuntu-latest
runs-on: ubuntu-latest
container:
image: xd009642/tarpaulin:develop-nightly
options: --security-opt seccomp=unconfined
steps: steps:
- name: Checkout repository - uses: actions/checkout@v2
uses: actions/checkout@v2 - name: Install Rust
run: rustup toolchain install nightly --component llvm-tools-preview
- name: Generate code coverage - name: Install cargo-llvm-cov
run: | run: curl -LsSf https://github.com/taiki-e/cargo-llvm-cov/releases/latest/download/cargo-llvm-cov-x86_64-unknown-linux-gnu.tar.gz | tar xzf - -C ~/.cargo/bin
cargo +nightly tarpaulin --verbose --all-features --workspace --timeout 120 --out Xml - name: Generate code coverage
run: cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info
- name: Upload to codecov.io - name: Upload coverage to Codecov
uses: codecov/codecov-action@v1 uses: codecov/codecov-action@v2
with: with:
# token: ${{secrets.CODECOV_TOKEN}} # not required for public repos token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
fail_ci_if_error: true files: lcov.info
fail_ci_if_error: true

24
.github/workflows/long_running.yml vendored Normal file
View File

@@ -0,0 +1,24 @@
name: Rust
on:
push:
branches: [ main ]
env:
CARGO_TERM_COLOR: always
NUM_FUNCTIONAL_TEST_ITERATIONS: 20000
jobs:
functional_test_unsorted:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run indexing_unsorted
run: cargo test indexing_unsorted -- --ignored
functional_test_sorted:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run indexing_sorted
run: cargo test indexing_sorted -- --ignored

View File

@@ -10,7 +10,7 @@ env:
CARGO_TERM_COLOR: always CARGO_TERM_COLOR: always
jobs: jobs:
build: test:
runs-on: ubuntu-latest runs-on: ubuntu-latest
@@ -21,10 +21,10 @@ jobs:
- name: Install latest nightly to test also against unstable feature flag - name: Install latest nightly to test also against unstable feature flag
uses: actions-rs/toolchain@v1 uses: actions-rs/toolchain@v1
with: with:
toolchain: nightly toolchain: stable
override: true override: true
components: rustfmt components: rustfmt
- name: Run tests - name: Run tests
run: cargo test --all-features --verbose --workspace run: cargo test --features mmap,brotli-compression,lz4-compression,snappy-compression,failpoints --verbose --workspace
- name: Check Formatting - name: Check Formatting
run: cargo fmt --all -- --check run: cargo fmt --all -- --check

1
.gitignore vendored
View File

@@ -1,4 +1,5 @@
tantivy.iml tantivy.iml
.cargo
proptest-regressions proptest-regressions
*.swp *.swp
target target

View File

@@ -1,92 +0,0 @@
# Based on the "trust" template v0.1.2
# https://github.com/japaric/trust/tree/v0.1.2
dist: trusty
language: rust
services: docker
sudo: required
env:
global:
- CRATE_NAME=tantivy
- TRAVIS_CARGO_NIGHTLY_FEATURE=""
# - secure: eC8HjTi1wgRVCsMAeXEXt8Ckr0YBSGOEnQkkW4/Nde/OZ9jJjz2nmP1ELQlDE7+czHub2QvYtDMG0parcHZDx/Kus0yvyn08y3g2rhGIiE7y8OCvQm1Mybu2D/p7enm6shXquQ6Z5KRfRq+18mHy80wy9ABMA/ukEZdvnfQ76/Een8/Lb0eHaDoXDXn3PqLVtByvSfQQ7OhS60dEScu8PWZ6/l1057P5NpdWbMExBE7Ro4zYXNhkJeGZx0nP/Bd4Jjdt1XfPzMEybV6NZ5xsTILUBFTmOOt603IsqKGov089NExqxYu5bD3K+S4MzF1Nd6VhomNPJqLDCfhlymJCUj5n5Ku4yidlhQbM4Ej9nGrBalJnhcjBjPua5tmMF2WCxP9muKn/2tIOu1/+wc0vMf9Yd3wKIkf5+FtUxCgs2O+NslWvmOMAMI/yD25m7hb4t1IwE/4Bk+GVcWJRWXbo0/m6ZUHzRzdjUY2a1qvw7C9udzdhg7gcnXwsKrSWi2NjMiIVw86l+Zim0nLpKIN41sxZHLaFRG63Ki8zQ/481LGn32awJ6i3sizKS0WD+N1DfR2qYMrwYHaMN0uR0OFXYTJkFvTFttAeUY3EKmRKAuMhmO2YRdSr4/j/G5E9HMc1gSGJj6PxgpQU7EpvxRsmoVAEJr0mszmOj9icGHep/FM=
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- kalakris-cmake
packages:
- gcc-4.8
- g++-4.8
- libcurl4-openssl-dev
- libelf-dev
- libdw-dev
- binutils-dev
- cmake
matrix:
include:
# Android
- env: TARGET=aarch64-linux-android DISABLE_TESTS=1
#- env: TARGET=arm-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=armv7-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=i686-linux-android DISABLE_TESTS=1
#- env: TARGET=x86_64-linux-android DISABLE_TESTS=1
# Linux
#- env: TARGET=aarch64-unknown-linux-gnu
#- env: TARGET=i686-unknown-linux-gnu
- env: TARGET=x86_64-unknown-linux-gnu CODECOV=1 #UPLOAD_DOCS=1
# - env: TARGET=x86_64-unknown-linux-musl CODECOV=1
# OSX
#- env: TARGET=x86_64-apple-darwin
# os: osx
before_install:
- set -e
- rustup self update
- rustup component add rustfmt
install:
- sh ci/install.sh
- source ~/.cargo/env || true
- env | grep "TRAVIS"
before_script:
- export PATH=$HOME/.cargo/bin:$PATH
- cargo install cargo-update || echo "cargo-update already installed"
- cargo install cargo-travis || echo "cargo-travis already installed"
script:
- bash ci/script.sh
- cargo fmt --all -- --check
before_deploy:
- sh ci/before_deploy.sh
after_success:
# Needs GH_TOKEN env var to be set in travis settings
- if [[ -v GH_TOKEN ]]; then echo "GH TOKEN IS SET"; else echo "GH TOKEN NOT SET"; fi
- if [[ -v UPLOAD_DOCS ]]; then cargo doc; cargo doc-upload; else echo "doc upload disabled."; fi
#cache: cargo
#before_cache:
# # Travis can't cache files that are not readable by "others"
# - chmod -R a+r $HOME/.cargo
# - find ./target/debug -type f -maxdepth 1 -delete
# - rm -f ./target/.rustc_info.json
# - rm -fr ./target/debug/{deps,.fingerprint}/tantivy*
# - rm -r target/debug/examples/
# - ls -1 examples/ | sed -e 's/\.rs$//' | xargs -I "{}" find target/* -name "*{}*" -type f -delete
#branches:
# only:
# # release tags
# - /^v\d+\.\d+\.\d+.*$/
# - master
notifications:
email:
on_success: never

View File

@@ -1,3 +1,27 @@
Tantivy 0.17
================================
- LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar) [#115](https://github.com/quickwit-inc/tantivy/issues/115)
- Adds a searcher Warmer API (@shikhar)
- Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211
- Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .
- Bugfix that could in theory impact durability in theory on some filesystems [#1224](https://github.com/quickwit-inc/tantivy/issues/1224)
- Schema now offers not indexing fieldnorms (@lpouget) [#922](https://github.com/quickwit-inc/tantivy/issues/922)
- Reduce the number of fsync calls [#1225](https://github.com/quickwit-inc/tantivy/issues/1225)
Tantivy 0.16.2
================================
- Bugfix in FuzzyTermQuery. (tranposition_cost_one was not doing anything)
Tantivy 0.16.1
========================
- Major Bugfix on multivalued fastfield. #1151
- Demux operation (@PSeitz)
Tantivy 0.16.0
=========================
- Bugfix in the filesum check. (@evanxg852000) #1127
- Bugfix in positions when the index is sorted by a field. (@appaquet) #1125
Tantivy 0.15.3 Tantivy 0.15.3
========================= =========================
- Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101 - Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101
@@ -104,7 +128,7 @@ Tantivy 0.12.0
## How to update? ## How to update?
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
minor changes. Check https://github.com/tantivy-search/tantivy/blob/main/examples/custom_tokenizer.rs minor changes. Check https://github.com/quickwit-inc/tantivy/blob/main/examples/custom_tokenizer.rs
to check for some code sample. to check for some code sample.
Tantivy 0.11.3 Tantivy 0.11.3

View File

@@ -1,13 +1,13 @@
[package] [package]
name = "tantivy" name = "tantivy"
version = "0.16.0-dev" version = "0.17.0-dev"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
description = """Search engine library""" description = """Search engine library"""
documentation = "https://docs.rs/tantivy/" documentation = "https://docs.rs/tantivy/"
homepage = "https://github.com/tantivy-search/tantivy" homepage = "https://github.com/quickwit-inc/tantivy"
repository = "https://github.com/tantivy-search/tantivy" repository = "https://github.com/quickwit-inc/tantivy"
readme = "README.md" readme = "README.md"
keywords = ["search", "information", "retrieval"] keywords = ["search", "information", "retrieval"]
edition = "2018" edition = "2018"
@@ -19,8 +19,8 @@ crc32fast = "1.2.1"
once_cell = "1.7.2" once_cell = "1.7.2"
regex ={ version = "1.5.4", default-features = false, features = ["std"] } regex ={ version = "1.5.4", default-features = false, features = ["std"] }
tantivy-fst = "0.3" tantivy-fst = "0.3"
memmap = {version = "0.7", optional=true} memmap2 = {version = "0.5", optional=true}
lz4_flex = { version = "0.8.0", default-features = false, features = ["checked-decode"], optional = true } lz4_flex = { version = "0.9", default-features = false, features = ["checked-decode"], optional = true }
brotli = { version = "3.3", optional = true } brotli = { version = "3.3", optional = true }
snap = { version = "1.0.5", optional = true } snap = { version = "1.0.5", optional = true }
tempfile = { version = "3.2", optional = true } tempfile = { version = "3.2", optional = true }
@@ -31,13 +31,13 @@ num_cpus = "1.13"
fs2={ version = "0.4.3", optional = true } fs2={ version = "0.4.3", optional = true }
levenshtein_automata = "0.2" levenshtein_automata = "0.2"
uuid = { version = "0.8.2", features = ["v4", "serde"] } uuid = { version = "0.8.2", features = ["v4", "serde"] }
crossbeam = "0.8" crossbeam = "0.8.1"
futures = { version = "0.3.15", features = ["thread-pool"] } futures = { version = "0.3.15", features = ["thread-pool"] }
tantivy-query-grammar = { version="0.15.0", path="./query-grammar" } tantivy-query-grammar = { version="0.15.0", path="./query-grammar" }
tantivy-bitpacker = { version="0.1", path="./bitpacker" } tantivy-bitpacker = { version="0.1", path="./bitpacker" }
common = { version="0.1", path="./common" } common = { version = "0.1", path = "./common/", package = "tantivy-common" }
fastfield_codecs = { version="0.1", path="./fastfield_codecs", default-features = false } fastfield_codecs = { version="0.1", path="./fastfield_codecs", default-features = false }
ownedbytes = { version="0.1", path="./ownedbytes" } ownedbytes = { version="0.2", path="./ownedbytes" }
stable_deref_trait = "1.2" stable_deref_trait = "1.2"
rust-stemmers = "1.2" rust-stemmers = "1.2"
downcast-rs = "1.2" downcast-rs = "1.2"
@@ -46,15 +46,15 @@ census = "0.4"
fnv = "1.0.7" fnv = "1.0.7"
thiserror = "1.0.24" thiserror = "1.0.24"
htmlescape = "0.3.1" htmlescape = "0.3.1"
fail = "0.4" fail = "0.5"
murmurhash32 = "0.2" murmurhash32 = "0.2"
chrono = "0.4.19" chrono = "0.4.19"
smallvec = "1.6.1" smallvec = "1.6.1"
rayon = "1.5" rayon = "1.5"
lru = "0.6.5" lru = "0.7.0"
fastdivide = "0.3" fastdivide = "0.3"
itertools = "0.10.0" itertools = "0.10.0"
measure_time = "0.7.0" measure_time = "0.8.0"
[target.'cfg(windows)'.dependencies] [target.'cfg(windows)'.dependencies]
winapi = "0.3.9" winapi = "0.3.9"
@@ -64,10 +64,12 @@ rand = "0.8.3"
maplit = "1.0.2" maplit = "1.0.2"
matches = "0.1.8" matches = "0.1.8"
proptest = "1.0" proptest = "1.0"
criterion = "0.3.4" criterion = "0.3.5"
test-log = "0.2.8"
env_logger = "0.9.0"
[dev-dependencies.fail] [dev-dependencies.fail]
version = "0.4" version = "0.5"
features = ["failpoints"] features = ["failpoints"]
[profile.release] [profile.release]
@@ -81,7 +83,7 @@ overflow-checks = true
[features] [features]
default = ["mmap", "lz4-compression" ] default = ["mmap", "lz4-compression" ]
mmap = ["fs2", "tempfile", "memmap"] mmap = ["fs2", "tempfile", "memmap2"]
brotli-compression = ["brotli"] brotli-compression = ["brotli"]
lz4-compression = ["lz4_flex"] lz4-compression = ["lz4_flex"]
@@ -89,7 +91,6 @@ snappy-compression = ["snap"]
failpoints = ["fail/failpoints"] failpoints = ["fail/failpoints"]
unstable = [] # useful for benches. unstable = [] # useful for benches.
wasm-bindgen = ["uuid/wasm-bindgen"]
[workspace] [workspace]
members = ["query-grammar", "bitpacker", "common", "fastfield_codecs", "ownedbytes"] members = ["query-grammar", "bitpacker", "common", "fastfield_codecs", "ownedbytes"]

View File

@@ -1,9 +1,9 @@
[![Build Status](https://travis-ci.org/tantivy-search/tantivy.svg?branch=main)](https://travis-ci.org/tantivy-search/tantivy) [![Docs](https://docs.rs/tantivy/badge.svg)](https://docs.rs/crate/tantivy/)
[![codecov](https://codecov.io/gh/tantivy-search/tantivy/branch/main/graph/badge.svg)](https://codecov.io/gh/tantivy-search/tantivy) [![Build Status](https://github.com/quickwit-inc/tantivy/actions/workflows/test.yml/badge.svg)](https://github.com/quickwit-inc/tantivy/actions/workflows/test.yml)
[![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![codecov](https://codecov.io/gh/quickwit-inc/tantivy/branch/main/graph/badge.svg)](https://codecov.io/gh/quickwit-inc/tantivy)
[![Join the chat at https://discord.gg/MT27AG5EVE](https://shields.io/discord/908281611840282624?label=chat%20on%20discord)](https://discord.gg/MT27AG5EVE)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/main?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/main)
[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy) [![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png) ![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
@@ -17,9 +17,6 @@
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/6)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/6) [![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/6)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/6)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/7)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/7) [![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/7)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/7)
[![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)
**Tantivy** is a **full text search engine library** written in Rust. **Tantivy** is a **full text search engine library** written in Rust.
It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
@@ -78,13 +75,12 @@ It walks you through getting a wikipedia search engine up and running in a few m
There are many ways to support this project. There are many ways to support this project.
- Use Tantivy and tell us about your experience on [Gitter](https://gitter.im/tantivy-search/tantivy) or by email (paul.masurel@gmail.com) - Use Tantivy and tell us about your experience on [Discord](https://discord.gg/MT27AG5EVE) or by email (paul.masurel@gmail.com)
- Report bugs - Report bugs
- Write a blog post - Write a blog post
- Help with documentation by asking questions or submitting PRs - Help with documentation by asking questions or submitting PRs
- Contribute code (you can join [our Gitter](https://gitter.im/tantivy-search/tantivy)) - Contribute code (you can join [our Discord server](https://discord.gg/MT27AG5EVE))
- Talk about Tantivy around you - Talk about Tantivy around you
- [![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)
# Contributing code # Contributing code
@@ -96,7 +92,7 @@ Tantivy compiles on stable Rust but requires `Rust >= 1.27`.
To check out and run tests, you can simply run: To check out and run tests, you can simply run:
```bash ```bash
git clone https://github.com/tantivy-search/tantivy.git git clone https://github.com/quickwit-inc/tantivy.git
cd tantivy cd tantivy
cargo build cargo build
``` ```

View File

@@ -1,12 +1,12 @@
[package] [package]
name = "tantivy-bitpacker" name = "tantivy-bitpacker"
version = "0.1.0" version = "0.1.1"
edition = "2018" edition = "2018"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = [] categories = []
description = """Tantivy-sub crate: bitpacking""" description = """Tantivy-sub crate: bitpacking"""
repository = "https://github.com/tantivy-search/tantivy" repository = "https://github.com/quickwit-inc/tantivy"
keywords = [] keywords = []

View File

@@ -50,3 +50,32 @@ where
} }
None None
} }
#[test]
fn test_compute_num_bits() {
assert_eq!(compute_num_bits(1), 1u8);
assert_eq!(compute_num_bits(0), 0u8);
assert_eq!(compute_num_bits(2), 2u8);
assert_eq!(compute_num_bits(3), 2u8);
assert_eq!(compute_num_bits(4), 3u8);
assert_eq!(compute_num_bits(255), 8u8);
assert_eq!(compute_num_bits(256), 9u8);
assert_eq!(compute_num_bits(5_000_000_000), 33u8);
}
#[test]
fn test_minmax_empty() {
let vals: Vec<u32> = vec![];
assert_eq!(minmax(vals.into_iter()), None);
}
#[test]
fn test_minmax_one() {
assert_eq!(minmax(vec![1].into_iter()), Some((1, 1)));
}
#[test]
fn test_minmax_two() {
assert_eq!(minmax(vec![1, 2].into_iter()), Some((1, 2)));
assert_eq!(minmax(vec![2, 1].into_iter()), Some((1, 2)));
}

View File

@@ -1,5 +1,5 @@
[package] [package]
name = "common" name = "tantivy-common"
version = "0.1.0" version = "0.1.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT" license = "MIT"
@@ -10,3 +10,8 @@ description = "common traits and utility functions used by multiple tantivy subc
[dependencies] [dependencies]
byteorder = "1.4.3" byteorder = "1.4.3"
ownedbytes = { version="0.2", path="../ownedbytes" }
[dev-dependencies]
proptest = "1.0.0"
rand = "0.8.4"

748
common/src/bitset.rs Normal file
View File

@@ -0,0 +1,748 @@
use ownedbytes::OwnedBytes;
use std::convert::TryInto;
use std::io::Write;
use std::u64;
use std::{fmt, io};
#[derive(Clone, Copy, Eq, PartialEq)]
pub struct TinySet(u64);
impl fmt::Debug for TinySet {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
self.into_iter().collect::<Vec<u32>>().fmt(f)
}
}
pub struct TinySetIterator(TinySet);
impl Iterator for TinySetIterator {
type Item = u32;
#[inline]
fn next(&mut self) -> Option<Self::Item> {
self.0.pop_lowest()
}
}
impl IntoIterator for TinySet {
type Item = u32;
type IntoIter = TinySetIterator;
fn into_iter(self) -> Self::IntoIter {
TinySetIterator(self)
}
}
impl TinySet {
pub fn serialize<T: Write>(&self, writer: &mut T) -> io::Result<()> {
writer.write_all(self.0.to_le_bytes().as_ref())
}
pub fn into_bytes(self) -> [u8; 8] {
self.0.to_le_bytes()
}
#[inline]
pub fn deserialize(data: [u8; 8]) -> Self {
let val: u64 = u64::from_le_bytes(data);
TinySet(val)
}
/// Returns an empty `TinySet`.
#[inline]
pub fn empty() -> TinySet {
TinySet(0u64)
}
/// Returns a full `TinySet`.
#[inline]
pub fn full() -> TinySet {
TinySet::empty().complement()
}
pub fn clear(&mut self) {
self.0 = 0u64;
}
/// Returns the complement of the set in `[0, 64[`.
///
/// Careful on making this function public, as it will break the padding handling in the last
/// bucket.
#[inline]
fn complement(self) -> TinySet {
TinySet(!self.0)
}
/// Returns true iff the `TinySet` contains the element `el`.
#[inline]
pub fn contains(self, el: u32) -> bool {
!self.intersect(TinySet::singleton(el)).is_empty()
}
/// Returns the number of elements in the TinySet.
#[inline]
pub fn len(self) -> u32 {
self.0.count_ones()
}
/// Returns the intersection of `self` and `other`
#[inline]
#[must_use]
pub fn intersect(self, other: TinySet) -> TinySet {
TinySet(self.0 & other.0)
}
/// Creates a new `TinySet` containing only one element
/// within `[0; 64[`
#[inline]
pub fn singleton(el: u32) -> TinySet {
TinySet(1u64 << u64::from(el))
}
/// Insert a new element within [0..64)
#[inline]
#[must_use]
pub fn insert(self, el: u32) -> TinySet {
self.union(TinySet::singleton(el))
}
/// Removes an element within [0..64)
#[inline]
#[must_use]
pub fn remove(self, el: u32) -> TinySet {
self.intersect(TinySet::singleton(el).complement())
}
/// Insert a new element within [0..64)
///
/// returns true if the set changed
#[inline]
pub fn insert_mut(&mut self, el: u32) -> bool {
let old = *self;
*self = old.insert(el);
old != *self
}
/// Remove a element within [0..64)
///
/// returns true if the set changed
#[inline]
pub fn remove_mut(&mut self, el: u32) -> bool {
let old = *self;
*self = old.remove(el);
old != *self
}
/// Returns the union of two tinysets
#[inline]
#[must_use]
pub fn union(self, other: TinySet) -> TinySet {
TinySet(self.0 | other.0)
}
/// Returns true iff the `TinySet` is empty.
#[inline]
pub fn is_empty(self) -> bool {
self.0 == 0u64
}
/// Returns the lowest element in the `TinySet`
/// and removes it.
#[inline]
pub fn pop_lowest(&mut self) -> Option<u32> {
if self.is_empty() {
None
} else {
let lowest = self.0.trailing_zeros() as u32;
self.0 ^= TinySet::singleton(lowest).0;
Some(lowest)
}
}
/// Returns a `TinySet` than contains all values up
/// to limit excluded.
///
/// The limit is assumed to be strictly lower than 64.
pub fn range_lower(upper_bound: u32) -> TinySet {
TinySet((1u64 << u64::from(upper_bound % 64u32)) - 1u64)
}
/// Returns a `TinySet` that contains all values greater
/// or equal to the given limit, included. (and up to 63)
///
/// The limit is assumed to be strictly lower than 64.
pub fn range_greater_or_equal(from_included: u32) -> TinySet {
TinySet::range_lower(from_included).complement()
}
}
#[derive(Clone)]
pub struct BitSet {
tinysets: Box<[TinySet]>,
len: u64,
max_value: u32,
}
fn num_buckets(max_val: u32) -> u32 {
(max_val + 63u32) / 64u32
}
impl BitSet {
/// serialize a `BitSet`.
///
pub fn serialize<T: Write>(&self, writer: &mut T) -> io::Result<()> {
writer.write_all(self.max_value.to_le_bytes().as_ref())?;
for tinyset in self.tinysets.iter().cloned() {
writer.write_all(&tinyset.into_bytes())?;
}
writer.flush()?;
Ok(())
}
/// Create a new `BitSet` that may contain elements
/// within `[0, max_val)`.
pub fn with_max_value(max_value: u32) -> BitSet {
let num_buckets = num_buckets(max_value);
let tinybitsets = vec![TinySet::empty(); num_buckets as usize].into_boxed_slice();
BitSet {
tinysets: tinybitsets,
len: 0,
max_value,
}
}
/// Create a new `BitSet` that may contain elements. Initially all values will be set.
/// within `[0, max_val)`.
pub fn with_max_value_and_full(max_value: u32) -> BitSet {
let num_buckets = num_buckets(max_value);
let mut tinybitsets = vec![TinySet::full(); num_buckets as usize].into_boxed_slice();
// Fix padding
let lower = max_value % 64u32;
if lower != 0 {
tinybitsets[tinybitsets.len() - 1] = TinySet::range_lower(lower);
}
BitSet {
tinysets: tinybitsets,
len: max_value as u64,
max_value,
}
}
/// Removes all elements from the `BitSet`.
pub fn clear(&mut self) {
for tinyset in self.tinysets.iter_mut() {
*tinyset = TinySet::empty();
}
}
/// Intersect with serialized bitset
pub fn intersect_update(&mut self, other: &ReadOnlyBitSet) {
self.intersect_update_with_iter(other.iter_tinysets());
}
/// Intersect with tinysets
fn intersect_update_with_iter(&mut self, other: impl Iterator<Item = TinySet>) {
self.len = 0;
for (left, right) in self.tinysets.iter_mut().zip(other) {
*left = left.intersect(right);
self.len += left.len() as u64;
}
}
/// Returns the number of elements in the `BitSet`.
#[inline]
pub fn len(&self) -> usize {
self.len as usize
}
/// Inserts an element in the `BitSet`
#[inline]
pub fn insert(&mut self, el: u32) {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len += if self.tinysets[higher as usize].insert_mut(lower) {
1
} else {
0
};
}
/// Inserts an element in the `BitSet`
#[inline]
pub fn remove(&mut self, el: u32) {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len -= if self.tinysets[higher as usize].remove_mut(lower) {
1
} else {
0
};
}
/// Returns true iff the elements is in the `BitSet`.
#[inline]
pub fn contains(&self, el: u32) -> bool {
self.tinyset(el / 64u32).contains(el % 64)
}
/// Returns the first non-empty `TinySet` associated to a bucket lower
/// or greater than bucket.
///
/// Reminder: the tiny set with the bucket `bucket`, represents the
/// elements from `bucket * 64` to `(bucket+1) * 64`.
pub fn first_non_empty_bucket(&self, bucket: u32) -> Option<u32> {
self.tinysets[bucket as usize..]
.iter()
.cloned()
.position(|tinyset| !tinyset.is_empty())
.map(|delta_bucket| bucket + delta_bucket as u32)
}
#[inline]
pub fn max_value(&self) -> u32 {
self.max_value
}
/// Returns the tiny bitset representing the
/// the set restricted to the number range from
/// `bucket * 64` to `(bucket + 1) * 64`.
pub fn tinyset(&self, bucket: u32) -> TinySet {
self.tinysets[bucket as usize]
}
}
/// Serialized BitSet.
#[derive(Clone)]
pub struct ReadOnlyBitSet {
data: OwnedBytes,
max_value: u32,
}
pub fn intersect_bitsets(left: &ReadOnlyBitSet, other: &ReadOnlyBitSet) -> ReadOnlyBitSet {
assert_eq!(left.max_value(), other.max_value());
assert_eq!(left.data.len(), other.data.len());
let union_tinyset_it = left
.iter_tinysets()
.zip(other.iter_tinysets())
.map(|(left_tinyset, right_tinyset)| left_tinyset.intersect(right_tinyset));
let mut output_dataset: Vec<u8> = Vec::with_capacity(left.data.len());
for tinyset in union_tinyset_it {
output_dataset.extend_from_slice(&tinyset.into_bytes());
}
ReadOnlyBitSet {
data: OwnedBytes::new(output_dataset),
max_value: left.max_value(),
}
}
impl ReadOnlyBitSet {
pub fn open(data: OwnedBytes) -> Self {
let (max_value_data, data) = data.split(4);
assert_eq!(data.len() % 8, 0);
let max_value: u32 = u32::from_le_bytes(max_value_data.as_ref().try_into().unwrap());
ReadOnlyBitSet { data, max_value }
}
/// Number of elements in the bitset.
#[inline]
pub fn len(&self) -> usize {
self.iter_tinysets()
.map(|tinyset| tinyset.len() as usize)
.sum()
}
/// Iterate the tinyset on the fly from serialized data.
///
#[inline]
fn iter_tinysets(&self) -> impl Iterator<Item = TinySet> + '_ {
self.data.chunks_exact(8).map(move |chunk| {
let tinyset: TinySet = TinySet::deserialize(chunk.try_into().unwrap());
tinyset
})
}
/// Iterate over the positions of the elements.
///
#[inline]
pub fn iter(&self) -> impl Iterator<Item = u32> + '_ {
self.iter_tinysets()
.enumerate()
.flat_map(move |(chunk_num, tinyset)| {
let chunk_base_val = chunk_num as u32 * 64;
tinyset
.into_iter()
.map(move |val| val + chunk_base_val)
.take_while(move |doc| *doc < self.max_value)
})
}
/// Returns true iff the elements is in the `BitSet`.
#[inline]
pub fn contains(&self, el: u32) -> bool {
let byte_offset = el / 8u32;
let b: u8 = self.data[byte_offset as usize];
let shift = (el % 8) as u8;
b & (1u8 << shift) != 0
}
/// Maximum value the bitset may contain.
/// (Note this is not the maximum value contained in the set.)
///
/// A bitset has an intrinsic capacity.
/// It only stores elements within [0..max_value).
#[inline]
pub fn max_value(&self) -> u32 {
self.max_value
}
/// Number of bytes used in the bitset representation.
pub fn num_bytes(&self) -> usize {
self.data.len()
}
}
impl<'a> From<&'a BitSet> for ReadOnlyBitSet {
fn from(bitset: &'a BitSet) -> ReadOnlyBitSet {
let mut buffer = Vec::with_capacity(bitset.tinysets.len() * 8 + 4);
bitset
.serialize(&mut buffer)
.expect("serializing into a buffer should never fail");
ReadOnlyBitSet::open(OwnedBytes::new(buffer))
}
}
#[cfg(test)]
mod tests {
use super::BitSet;
use super::ReadOnlyBitSet;
use super::TinySet;
use ownedbytes::OwnedBytes;
use rand::distributions::Bernoulli;
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use std::collections::HashSet;
#[test]
fn test_read_serialized_bitset_full_multi() {
for i in 0..1000 {
let bitset = BitSet::with_max_value_and_full(i);
let mut out = vec![];
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len() as usize, i as usize);
}
}
#[test]
fn test_read_serialized_bitset_full_block() {
let bitset = BitSet::with_max_value_and_full(64);
let mut out = vec![];
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len() as usize, 64 as usize);
}
#[test]
fn test_read_serialized_bitset_full() {
let mut bitset = BitSet::with_max_value_and_full(5);
bitset.remove(3);
let mut out = vec![];
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len(), 4);
}
#[test]
fn test_bitset_intersect() {
let bitset_serialized = {
let mut bitset = BitSet::with_max_value_and_full(5);
bitset.remove(1);
bitset.remove(3);
let mut out = vec![];
bitset.serialize(&mut out).unwrap();
ReadOnlyBitSet::open(OwnedBytes::new(out))
};
let mut bitset = BitSet::with_max_value_and_full(5);
bitset.remove(1);
bitset.intersect_update(&bitset_serialized);
assert!(bitset.contains(0));
assert!(!bitset.contains(1));
assert!(bitset.contains(2));
assert!(!bitset.contains(3));
assert!(bitset.contains(4));
bitset.intersect_update_with_iter(vec![TinySet::singleton(0)].into_iter());
assert!(bitset.contains(0));
assert!(!bitset.contains(1));
assert!(!bitset.contains(2));
assert!(!bitset.contains(3));
assert!(!bitset.contains(4));
assert_eq!(bitset.len(), 1);
bitset.intersect_update_with_iter(vec![TinySet::singleton(1)].into_iter());
assert!(!bitset.contains(0));
assert!(!bitset.contains(1));
assert!(!bitset.contains(2));
assert!(!bitset.contains(3));
assert!(!bitset.contains(4));
assert_eq!(bitset.len(), 0);
}
#[test]
fn test_read_serialized_bitset_empty() {
let mut bitset = BitSet::with_max_value(5);
bitset.insert(3);
let mut out = vec![];
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len(), 1);
{
let bitset = BitSet::with_max_value(5);
let mut out = vec![];
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len(), 0);
}
}
#[test]
fn test_tiny_set_remove() {
{
let mut u = TinySet::empty().insert(63u32).insert(5).remove(63u32);
assert_eq!(u.pop_lowest(), Some(5u32));
assert!(u.pop_lowest().is_none());
}
{
let mut u = TinySet::empty()
.insert(63u32)
.insert(1)
.insert(5)
.remove(63u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert_eq!(u.pop_lowest(), Some(5u32));
assert!(u.pop_lowest().is_none());
}
{
let mut u = TinySet::empty().insert(1).remove(63u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none());
}
{
let mut u = TinySet::empty().insert(1).remove(1u32);
assert!(u.pop_lowest().is_none());
}
}
#[test]
fn test_tiny_set() {
assert!(TinySet::empty().is_empty());
{
let mut u = TinySet::empty().insert(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none())
}
{
let mut u = TinySet::empty().insert(1u32).insert(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none())
}
{
let mut u = TinySet::empty().insert(2u32);
assert_eq!(u.pop_lowest(), Some(2u32));
u.insert_mut(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none());
}
{
let mut u = TinySet::empty().insert(63u32);
assert_eq!(u.pop_lowest(), Some(63u32));
assert!(u.pop_lowest().is_none());
}
{
let mut u = TinySet::empty().insert(63u32).insert(5);
assert_eq!(u.pop_lowest(), Some(5u32));
assert_eq!(u.pop_lowest(), Some(63u32));
assert!(u.pop_lowest().is_none());
}
{
let original = TinySet::empty().insert(63u32).insert(5);
let after_serialize_deserialize = TinySet::deserialize(original.into_bytes());
assert_eq!(original, after_serialize_deserialize);
}
}
#[test]
fn test_bitset() {
let test_against_hashset = |els: &[u32], max_value: u32| {
let mut hashset: HashSet<u32> = HashSet::new();
let mut bitset = BitSet::with_max_value(max_value);
for &el in els {
assert!(el < max_value);
hashset.insert(el);
bitset.insert(el);
}
for el in 0..max_value {
assert_eq!(hashset.contains(&el), bitset.contains(el));
}
assert_eq!(bitset.max_value(), max_value);
// test deser
let mut data = vec![];
bitset.serialize(&mut data).unwrap();
let ro_bitset = ReadOnlyBitSet::open(OwnedBytes::new(data));
for el in 0..max_value {
assert_eq!(hashset.contains(&el), ro_bitset.contains(el));
}
assert_eq!(ro_bitset.max_value(), max_value);
assert_eq!(ro_bitset.len(), els.len());
};
test_against_hashset(&[], 0);
test_against_hashset(&[], 1);
test_against_hashset(&[0u32], 1);
test_against_hashset(&[0u32], 100);
test_against_hashset(&[1u32, 2u32], 4);
test_against_hashset(&[99u32], 100);
test_against_hashset(&[63u32], 64);
test_against_hashset(&[62u32, 63u32], 64);
}
#[test]
fn test_bitset_num_buckets() {
use super::num_buckets;
assert_eq!(num_buckets(0u32), 0);
assert_eq!(num_buckets(1u32), 1);
assert_eq!(num_buckets(64u32), 1);
assert_eq!(num_buckets(65u32), 2);
assert_eq!(num_buckets(128u32), 2);
assert_eq!(num_buckets(129u32), 3);
}
#[test]
fn test_tinyset_range() {
assert_eq!(
TinySet::range_lower(3).into_iter().collect::<Vec<u32>>(),
[0, 1, 2]
);
assert!(TinySet::range_lower(0).is_empty());
assert_eq!(
TinySet::range_lower(63).into_iter().collect::<Vec<u32>>(),
(0u32..63u32).collect::<Vec<_>>()
);
assert_eq!(
TinySet::range_lower(1).into_iter().collect::<Vec<u32>>(),
[0]
);
assert_eq!(
TinySet::range_lower(2).into_iter().collect::<Vec<u32>>(),
[0, 1]
);
assert_eq!(
TinySet::range_greater_or_equal(3)
.into_iter()
.collect::<Vec<u32>>(),
(3u32..64u32).collect::<Vec<_>>()
);
}
#[test]
fn test_bitset_len() {
let mut bitset = BitSet::with_max_value(1_000);
assert_eq!(bitset.len(), 0);
bitset.insert(3u32);
assert_eq!(bitset.len(), 1);
bitset.insert(103u32);
assert_eq!(bitset.len(), 2);
bitset.insert(3u32);
assert_eq!(bitset.len(), 2);
bitset.insert(103u32);
assert_eq!(bitset.len(), 2);
bitset.insert(104u32);
assert_eq!(bitset.len(), 3);
bitset.remove(105u32);
assert_eq!(bitset.len(), 3);
bitset.remove(104u32);
assert_eq!(bitset.len(), 2);
bitset.remove(3u32);
assert_eq!(bitset.len(), 1);
bitset.remove(103u32);
assert_eq!(bitset.len(), 0);
}
pub fn sample_with_seed(n: u32, ratio: f64, seed_val: u8) -> Vec<u32> {
StdRng::from_seed([seed_val; 32])
.sample_iter(&Bernoulli::new(ratio).unwrap())
.take(n as usize)
.enumerate()
.filter_map(|(val, keep)| if keep { Some(val as u32) } else { None })
.collect()
}
pub fn sample(n: u32, ratio: f64) -> Vec<u32> {
sample_with_seed(n, ratio, 4)
}
#[test]
fn test_bitset_clear() {
let mut bitset = BitSet::with_max_value(1_000);
let els = sample(1_000, 0.01f64);
for &el in &els {
bitset.insert(el);
}
assert!(els.iter().all(|el| bitset.contains(*el)));
bitset.clear();
for el in 0u32..1000u32 {
assert!(!bitset.contains(el));
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use super::BitSet;
use super::TinySet;
use test;
#[bench]
fn bench_tinyset_pop(b: &mut test::Bencher) {
b.iter(|| {
let mut tinyset = TinySet::singleton(test::black_box(31u32));
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
});
}
#[bench]
fn bench_tinyset_sum(b: &mut test::Bencher) {
let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
b.iter(|| {
assert_eq!(test::black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
});
}
#[bench]
fn bench_tinyarr_sum(b: &mut test::Bencher) {
let v = [10u32, 14u32, 21u32];
b.iter(|| test::black_box(v).iter().cloned().sum::<u32>());
}
#[bench]
fn bench_bitset_initialize(b: &mut test::Bencher) {
b.iter(|| BitSet::with_max_value(1_000_000));
}
}

View File

@@ -1,9 +1,169 @@
#![allow(clippy::len_without_is_empty)]
use std::ops::Deref;
pub use byteorder::LittleEndian as Endianness; pub use byteorder::LittleEndian as Endianness;
mod bitset;
mod serialize; mod serialize;
mod vint; mod vint;
mod writer; mod writer;
pub use bitset::*;
pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize}; pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize};
pub use vint::{read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint, VInt}; pub use vint::{read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint, VInt};
pub use writer::{AntiCallToken, CountingWriter, TerminatingWrite}; pub use writer::{AntiCallToken, CountingWriter, TerminatingWrite};
/// Has length trait
pub trait HasLen {
/// Return length
fn len(&self) -> usize;
/// Returns true iff empty.
fn is_empty(&self) -> bool {
self.len() == 0
}
}
impl<T: Deref<Target = [u8]>> HasLen for T {
fn len(&self) -> usize {
self.deref().len()
}
}
const HIGHEST_BIT: u64 = 1 << 63;
/// Maps a `i64` to `u64`
///
/// For simplicity, tantivy internally handles `i64` as `u64`.
/// The mapping is defined by this function.
///
/// Maps `i64` to `u64` so that
/// `-2^63 .. 2^63-1` is mapped
/// to
/// `0 .. 2^64-1`
/// in that order.
///
/// This is more suited than simply casting (`val as u64`)
/// because of bitpacking.
///
/// Imagine a list of `i64` ranging from -10 to 10.
/// When casting negative values, the negative values are projected
/// to values over 2^63, and all values end up requiring 64 bits.
///
/// # See also
/// The [reverse mapping is `u64_to_i64`](./fn.u64_to_i64.html).
#[inline]
pub fn i64_to_u64(val: i64) -> u64 {
(val as u64) ^ HIGHEST_BIT
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
#[inline]
pub fn u64_to_i64(val: u64) -> i64 {
(val ^ HIGHEST_BIT) as i64
}
/// Maps a `f64` to `u64`
///
/// For simplicity, tantivy internally handles `f64` as `u64`.
/// The mapping is defined by this function.
///
/// Maps `f64` to `u64` in a monotonic manner, so that bytes lexical order is preserved.
///
/// This is more suited than simply casting (`val as u64`)
/// which would truncate the result
///
/// # Reference
///
/// Daniel Lemire's [blog post](https://lemire.me/blog/2020/12/14/converting-floating-point-numbers-to-integers-while-preserving-order/)
/// explains the mapping in a clear manner.
///
/// # See also
/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
#[inline]
pub fn f64_to_u64(val: f64) -> u64 {
let bits = val.to_bits();
if val.is_sign_positive() {
bits ^ HIGHEST_BIT
} else {
!bits
}
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
#[inline]
pub fn u64_to_f64(val: u64) -> f64 {
f64::from_bits(if val & HIGHEST_BIT != 0 {
val ^ HIGHEST_BIT
} else {
!val
})
}
#[cfg(test)]
pub mod test {
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
use super::{BinarySerializable, FixedSize};
use proptest::prelude::*;
use std::f64;
fn test_i64_converter_helper(val: i64) {
assert_eq!(u64_to_i64(i64_to_u64(val)), val);
}
fn test_f64_converter_helper(val: f64) {
assert_eq!(u64_to_f64(f64_to_u64(val)), val);
}
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();
O::default().serialize(&mut buffer).unwrap();
assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
}
proptest! {
#[test]
fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
let left_u64 = f64_to_u64(left);
let right_u64 = f64_to_u64(right);
assert_eq!(left_u64 < right_u64, left < right);
}
}
#[test]
fn test_i64_converter() {
assert_eq!(i64_to_u64(i64::min_value()), u64::min_value());
assert_eq!(i64_to_u64(i64::max_value()), u64::max_value());
test_i64_converter_helper(0i64);
test_i64_converter_helper(i64::min_value());
test_i64_converter_helper(i64::max_value());
for i in -1000i64..1000i64 {
test_i64_converter_helper(i);
}
}
#[test]
fn test_f64_converter() {
test_f64_converter_helper(f64::INFINITY);
test_f64_converter_helper(f64::NEG_INFINITY);
test_f64_converter_helper(0.0);
test_f64_converter_helper(-0.0);
test_f64_converter_helper(1.0);
test_f64_converter_helper(-1.0);
}
#[test]
fn test_f64_order() {
assert!(!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY))
.contains(&f64_to_u64(f64::NAN))); //nan is not a number
assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); //same exponent, different mantissa
assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); //same mantissa, different exponent
assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); //different exponent and mantissa
assert!(f64_to_u64(1.0) > f64_to_u64(-1.0)); // pos > neg
assert!(f64_to_u64(-1.5) < f64_to_u64(-1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(-1.5));
}
}

View File

@@ -54,7 +54,7 @@ impl<W: TerminatingWrite> TerminatingWrite for CountingWriter<W> {
} }
} }
/// Struct used to prevent from calling [`terminate_ref`](trait.TerminatingWrite#method.terminate_ref) directly /// Struct used to prevent from calling [`terminate_ref`](trait.TerminatingWrite.html#tymethod.terminate_ref) directly
/// ///
/// The point is that while the type is public, it cannot be built by anyone /// The point is that while the type is public, it cannot be built by anyone
/// outside of this module. /// outside of this module.

View File

@@ -7,6 +7,7 @@
- [Segments](./basis.md) - [Segments](./basis.md)
- [Defining your schema](./schema.md) - [Defining your schema](./schema.md)
- [Facetting](./facetting.md) - [Facetting](./facetting.md)
- [Index Sorting](./index_sorting.md)
- [Innerworkings](./innerworkings.md) - [Innerworkings](./innerworkings.md)
- [Inverted index](./inverted_index.md) - [Inverted index](./inverted_index.md)
- [Best practise](./inverted_index.md) - [Best practise](./inverted_index.md)

61
doc/src/index_sorting.md Normal file
View File

@@ -0,0 +1,61 @@
- [Index Sorting](#index-sorting)
+ [Why Sorting](#why-sorting)
* [Compression](#compression)
* [Top-N Optimization](#top-n-optimization)
* [Pruning](#pruning)
* [Other](#other)
+ [Usage](#usage)
# Index Sorting
Tantivy allows you to sort the index according to a property.
## Why Sorting
Presorting an index has several advantages:
###### Compression
When data is sorted it is easier to compress the data. E.g. the numbers sequence [5, 2, 3, 1, 4] would be sorted to [1, 2, 3, 4, 5].
If we apply delta encoding this list would be unsorted [5, -3, 1, -2, 3] vs. [1, 1, 1, 1, 1].
Compression ratio is mainly affected on the fast field of the sorted property, every thing else is likely unaffected.
###### Top-N Optimization
When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents.
E.g. if the data is sorted by timestamp and want the top n newest docs containing a term, we can simply leveraging the order of the docids.
Note: Tantivy 0.16 does not do this optimization yet.
###### Pruning
Let's say we want all documents and want to apply the filter `>= 2010-08-11`. When the data is sorted, we could make a lookup in the fast field to find the docid range and use this as the filter.
Note: Tantivy 0.16 does not do this optimization yet.
###### Other?
In principle there are many algorithms possible that exploit the monotonically increasing nature. (aggregations maybe?)
## Usage
The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-inc/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of tantvy 0.16 only fast fields are allowed to be used.
```
let settings = IndexSettings {
sort_by_field: Some(IndexSortByField {
field: "intval".to_string(),
order: Order::Desc,
}),
..Default::default()
};
let mut index_builder = Index::builder().schema(schema);
index_builder = index_builder.settings(settings);
let index = index_builder.create_in_ram().unwrap();
```
## Implementation details
Sorting an index is applied in the serialization step. In general there are two serialization steps: [Finishing a single segment](https://github.com/quickwit-inc/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/segment_writer.rs#L338) and [merging multiple segments](https://github.com/quickwit-inc/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/merger.rs#L1073).
In both cases we generate a docid mapping reflecting the sort. This mapping is used when serializing the different components (doc store, fastfields, posting list, normfield, facets).

View File

@@ -96,7 +96,7 @@ fn main() -> tantivy::Result<()> {
); );
// ... and add it to the `IndexWriter`. // ... and add it to the `IndexWriter`.
index_writer.add_document(old_man_doc); index_writer.add_document(old_man_doc)?;
// For convenience, tantivy also comes with a macro to // For convenience, tantivy also comes with a macro to
// reduce the boilerplate above. // reduce the boilerplate above.
@@ -110,7 +110,7 @@ fn main() -> tantivy::Result<()> {
fresh and green with every spring, carrying in their lower leaf junctures the \ fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \ debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool" limbs and branches that arch over the pool"
)); ))?;
// Multivalued field just need to be repeated. // Multivalued field just need to be repeated.
index_writer.add_document(doc!( index_writer.add_document(doc!(
@@ -120,7 +120,7 @@ fn main() -> tantivy::Result<()> {
enterprise which you have regarded with such evil forebodings. I arrived here \ enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \ yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking." increasing confidence in the success of my undertaking."
)); ))?;
// This is an example, so we will only index 3 documents // This is an example, so we will only index 3 documents
// here. You can check out tantivy's tutorial to index // here. You can check out tantivy's tutorial to index

View File

@@ -86,12 +86,10 @@ impl Collector for StatsCollector {
fn merge_fruits(&self, segment_stats: Vec<Option<Stats>>) -> tantivy::Result<Option<Stats>> { fn merge_fruits(&self, segment_stats: Vec<Option<Stats>>) -> tantivy::Result<Option<Stats>> {
let mut stats = Stats::default(); let mut stats = Stats::default();
for segment_stats_opt in segment_stats { for segment_stats in segment_stats.into_iter().flatten() {
if let Some(segment_stats) = segment_stats_opt { stats.count += segment_stats.count;
stats.count += segment_stats.count; stats.sum += segment_stats.sum;
stats.sum += segment_stats.sum; stats.squared_sum += segment_stats.squared_sum;
stats.squared_sum += segment_stats.squared_sum;
}
} }
Ok(stats.non_zero_count()) Ok(stats.non_zero_count())
} }
@@ -147,23 +145,23 @@ fn main() -> tantivy::Result<()> {
product_description => "While it is ok for short distance travel, this broom \ product_description => "While it is ok for short distance travel, this broom \
was designed quiditch. It will up your game.", was designed quiditch. It will up your game.",
price => 30_200u64 price => 30_200u64
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
product_name => "Turbulobroom", product_name => "Turbulobroom",
product_description => "You might have heard of this broom before : it is the sponsor of the Wales team.\ product_description => "You might have heard of this broom before : it is the sponsor of the Wales team.\
You'll enjoy its sharp turns, and rapid acceleration", You'll enjoy its sharp turns, and rapid acceleration",
price => 29_240u64 price => 29_240u64
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
product_name => "Broomio", product_name => "Broomio",
product_description => "Great value for the price. This broom is a market favorite", product_description => "Great value for the price. This broom is a market favorite",
price => 21_240u64 price => 21_240u64
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
product_name => "Whack a Mole", product_name => "Whack a Mole",
product_description => "Prime quality bat.", product_description => "Prime quality bat.",
price => 5_200u64 price => 5_200u64
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;

View File

@@ -68,7 +68,7 @@ fn main() -> tantivy::Result<()> {
title => "The Old Man and the Sea", title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \ body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish." he had gone eighty-four days now without taking a fish."
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Of Mice and Men", title => "Of Mice and Men",
body => r#"A few miles south of Soledad, the Salinas River drops in close to the hillside body => r#"A few miles south of Soledad, the Salinas River drops in close to the hillside
@@ -79,14 +79,14 @@ fn main() -> tantivy::Result<()> {
fresh and green with every spring, carrying in their lower leaf junctures the fresh and green with every spring, carrying in their lower leaf junctures the
debris of the winters flooding; and sycamores with mottled, white, recumbent debris of the winters flooding; and sycamores with mottled, white, recumbent
limbs and branches that arch over the pool"# limbs and branches that arch over the pool"#
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Frankenstein", title => "Frankenstein",
body => r#"You will rejoice to hear that no disaster has accompanied the commencement of an body => r#"You will rejoice to hear that no disaster has accompanied the commencement of an
enterprise which you have regarded with such evil forebodings. I arrived here enterprise which you have regarded with such evil forebodings. I arrived here
yesterday, and my first task is to assure my dear sister of my welfare and yesterday, and my first task is to assure my dear sister of my welfare and
increasing confidence in the success of my undertaking."# increasing confidence in the success of my undertaking."#
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;

View File

@@ -76,15 +76,15 @@ fn main() -> tantivy::Result<()> {
index_writer.add_document(doc!( index_writer.add_document(doc!(
isbn => "978-0099908401", isbn => "978-0099908401",
title => "The old Man and the see" title => "The old Man and the see"
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
isbn => "978-0140177398", isbn => "978-0140177398",
title => "Of Mice and Men", title => "Of Mice and Men",
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Frankentein", //< Oops there is a typo here. title => "Frankentein", //< Oops there is a typo here.
isbn => "978-9176370711", isbn => "978-9176370711",
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
@@ -122,7 +122,7 @@ fn main() -> tantivy::Result<()> {
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Frankenstein", title => "Frankenstein",
isbn => "978-9176370711", isbn => "978-9176370711",
)); ))?;
// You are guaranteed that your clients will only observe your index in // You are guaranteed that your clients will only observe your index in
// the state it was in after a commit. // the state it was in after a commit.

View File

@@ -23,7 +23,7 @@ fn main() -> tantivy::Result<()> {
let name = schema_builder.add_text_field("felin_name", TEXT | STORED); let name = schema_builder.add_text_field("felin_name", TEXT | STORED);
// this is our faceted field: its scientific classification // this is our faceted field: its scientific classification
let classification = schema_builder.add_facet_field("classification", INDEXED); let classification = schema_builder.add_facet_field("classification", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
@@ -35,35 +35,35 @@ fn main() -> tantivy::Result<()> {
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Cat", name => "Cat",
classification => Facet::from("/Felidae/Felinae/Felis") classification => Facet::from("/Felidae/Felinae/Felis")
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Canada lynx", name => "Canada lynx",
classification => Facet::from("/Felidae/Felinae/Lynx") classification => Facet::from("/Felidae/Felinae/Lynx")
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Cheetah", name => "Cheetah",
classification => Facet::from("/Felidae/Felinae/Acinonyx") classification => Facet::from("/Felidae/Felinae/Acinonyx")
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Tiger", name => "Tiger",
classification => Facet::from("/Felidae/Pantherinae/Panthera") classification => Facet::from("/Felidae/Pantherinae/Panthera")
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Lion", name => "Lion",
classification => Facet::from("/Felidae/Pantherinae/Panthera") classification => Facet::from("/Felidae/Pantherinae/Panthera")
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Jaguar", name => "Jaguar",
classification => Facet::from("/Felidae/Pantherinae/Panthera") classification => Facet::from("/Felidae/Pantherinae/Panthera")
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Sunda clouded leopard", name => "Sunda clouded leopard",
classification => Facet::from("/Felidae/Pantherinae/Neofelis") classification => Facet::from("/Felidae/Pantherinae/Neofelis")
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Fossa", name => "Fossa",
classification => Facet::from("/Eupleridae/Cryptoprocta") classification => Facet::from("/Eupleridae/Cryptoprocta")
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;

View File

@@ -9,7 +9,7 @@ fn main() -> tantivy::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", STORED); let title = schema_builder.add_text_field("title", STORED);
let ingredient = schema_builder.add_facet_field("ingredient", INDEXED); let ingredient = schema_builder.add_facet_field("ingredient", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
@@ -20,14 +20,14 @@ fn main() -> tantivy::Result<()> {
title => "Fried egg", title => "Fried egg",
ingredient => Facet::from("/ingredient/egg"), ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/oil"), ingredient => Facet::from("/ingredient/oil"),
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Scrambled egg", title => "Scrambled egg",
ingredient => Facet::from("/ingredient/egg"), ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/butter"), ingredient => Facet::from("/ingredient/butter"),
ingredient => Facet::from("/ingredient/milk"), ingredient => Facet::from("/ingredient/milk"),
ingredient => Facet::from("/ingredient/salt"), ingredient => Facet::from("/ingredient/salt"),
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Egg rolls", title => "Egg rolls",
ingredient => Facet::from("/ingredient/egg"), ingredient => Facet::from("/ingredient/egg"),
@@ -36,7 +36,7 @@ fn main() -> tantivy::Result<()> {
ingredient => Facet::from("/ingredient/oil"), ingredient => Facet::from("/ingredient/oil"),
ingredient => Facet::from("/ingredient/tortilla-wrap"), ingredient => Facet::from("/ingredient/tortilla-wrap"),
ingredient => Facet::from("/ingredient/mushroom"), ingredient => Facet::from("/ingredient/mushroom"),
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;

View File

@@ -7,7 +7,7 @@ use tantivy::query::RangeQuery;
use tantivy::schema::{Schema, INDEXED}; use tantivy::schema::{Schema, INDEXED};
use tantivy::{doc, Index, Result}; use tantivy::{doc, Index, Result};
fn run() -> Result<()> { fn main() -> Result<()> {
// For the sake of simplicity, this schema will only have 1 field // For the sake of simplicity, this schema will only have 1 field
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
@@ -19,7 +19,7 @@ fn run() -> Result<()> {
{ {
let mut index_writer = index.writer_with_num_threads(1, 6_000_000)?; let mut index_writer = index.writer_with_num_threads(1, 6_000_000)?;
for year in 1950u64..2019u64 { for year in 1950u64..2019u64 {
index_writer.add_document(doc!(year_field => year)); index_writer.add_document(doc!(year_field => year))?;
} }
index_writer.commit()?; index_writer.commit()?;
// The index will be a range of years // The index will be a range of years
@@ -33,7 +33,3 @@ fn run() -> Result<()> {
assert_eq!(num_60s_books, 10); assert_eq!(num_60s_books, 10);
Ok(()) Ok(())
} }
fn main() {
run().unwrap()
}

View File

@@ -25,9 +25,9 @@ fn main() -> tantivy::Result<()> {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 50_000_000)?; let mut index_writer = index.writer_with_num_threads(1, 50_000_000)?;
index_writer.add_document(doc!(title => "The Old Man and the Sea")); index_writer.add_document(doc!(title => "The Old Man and the Sea"))?;
index_writer.add_document(doc!(title => "Of Mice and Men")); index_writer.add_document(doc!(title => "Of Mice and Men"))?;
index_writer.add_document(doc!(title => "The modern Promotheus")); index_writer.add_document(doc!(title => "The modern Promotheus"))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;

View File

@@ -29,7 +29,7 @@ use std::sync::{Arc, RwLock};
use std::thread; use std::thread;
use std::time::Duration; use std::time::Duration;
use tantivy::schema::{Schema, STORED, TEXT}; use tantivy::schema::{Schema, STORED, TEXT};
use tantivy::{doc, Index, IndexWriter, Opstamp}; use tantivy::{doc, Index, IndexWriter, Opstamp, TantivyError};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -59,10 +59,11 @@ fn main() -> tantivy::Result<()> {
fresh and green with every spring, carrying in their lower leaf junctures the \ fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \ debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool" limbs and branches that arch over the pool"
)); ))?;
println!("add doc {} from thread 1 - opstamp {}", i, opstamp); println!("add doc {} from thread 1 - opstamp {}", i, opstamp);
thread::sleep(Duration::from_millis(20)); thread::sleep(Duration::from_millis(20));
} }
Result::<(), TantivyError>::Ok(())
}); });
// # Second indexing thread. // # Second indexing thread.
@@ -78,11 +79,12 @@ fn main() -> tantivy::Result<()> {
index_writer_rlock.add_document(doc!( index_writer_rlock.add_document(doc!(
title => "Manufacturing consent", title => "Manufacturing consent",
body => "Some great book description..." body => "Some great book description..."
)) ))?
}; };
println!("add doc {} from thread 2 - opstamp {}", i, opstamp); println!("add doc {} from thread 2 - opstamp {}", i, opstamp);
thread::sleep(Duration::from_millis(10)); thread::sleep(Duration::from_millis(10));
} }
Result::<(), TantivyError>::Ok(())
}); });
// # In the main thread, we commit 10 times, once every 500ms. // # In the main thread, we commit 10 times, once every 500ms.
@@ -90,7 +92,7 @@ fn main() -> tantivy::Result<()> {
let opstamp: Opstamp = { let opstamp: Opstamp = {
// Committing or rollbacking on the other hand requires write lock. This will block other threads. // Committing or rollbacking on the other hand requires write lock. This will block other threads.
let mut index_writer_wlock = index_writer.write().unwrap(); let mut index_writer_wlock = index_writer.write().unwrap();
index_writer_wlock.commit().unwrap() index_writer_wlock.commit()?
}; };
println!("committed with opstamp {}", opstamp); println!("committed with opstamp {}", opstamp);
thread::sleep(Duration::from_millis(500)); thread::sleep(Duration::from_millis(500));

View File

@@ -68,7 +68,7 @@ fn main() -> tantivy::Result<()> {
let old_man_doc = doc!(title => title_tok, body => body_tok); let old_man_doc = doc!(title => title_tok, body => body_tok);
// ... now let's just add it to the IndexWriter // ... now let's just add it to the IndexWriter
index_writer.add_document(old_man_doc); index_writer.add_document(old_man_doc)?;
// Pretokenized text can also be fed as JSON // Pretokenized text can also be fed as JSON
let short_man_json = r#"{ let short_man_json = r#"{
@@ -84,7 +84,7 @@ fn main() -> tantivy::Result<()> {
let short_man_doc = schema.parse_document(short_man_json)?; let short_man_doc = schema.parse_document(short_man_json)?;
index_writer.add_document(short_man_doc); index_writer.add_document(short_man_doc)?;
// Let's commit changes // Let's commit changes
index_writer.commit()?; index_writer.commit()?;
@@ -106,9 +106,7 @@ fn main() -> tantivy::Result<()> {
IndexRecordOption::Basic, IndexRecordOption::Basic,
); );
let (top_docs, count) = searcher let (top_docs, count) = searcher.search(&query, &(TopDocs::with_limit(2), Count))?;
.search(&query, &(TopDocs::with_limit(2), Count))
.unwrap();
assert_eq!(count, 2); assert_eq!(count, 2);
@@ -129,9 +127,7 @@ fn main() -> tantivy::Result<()> {
IndexRecordOption::Basic, IndexRecordOption::Basic,
); );
let (_top_docs, count) = searcher let (_top_docs, count) = searcher.search(&query, &(TopDocs::with_limit(2), Count))?;
.search(&query, &(TopDocs::with_limit(2), Count))
.unwrap();
assert_eq!(count, 0); assert_eq!(count, 0);

View File

@@ -40,7 +40,7 @@ fn main() -> tantivy::Result<()> {
fresh and green with every spring, carrying in their lower leaf junctures the \ fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \ debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool" limbs and branches that arch over the pool"
)); ))?;
// ... // ...
index_writer.commit()?; index_writer.commit()?;
@@ -70,13 +70,13 @@ fn highlight(snippet: Snippet) -> String {
let mut start_from = 0; let mut start_from = 0;
for fragment_range in snippet.highlighted() { for fragment_range in snippet.highlighted() {
result.push_str(&snippet.fragments()[start_from..fragment_range.start]); result.push_str(&snippet.fragment()[start_from..fragment_range.start]);
result.push_str(" --> "); result.push_str(" --> ");
result.push_str(&snippet.fragments()[fragment_range.clone()]); result.push_str(&snippet.fragment()[fragment_range.clone()]);
result.push_str(" <-- "); result.push_str(" <-- ");
start_from = fragment_range.end; start_from = fragment_range.end;
} }
result.push_str(&snippet.fragments()[start_from..]); result.push_str(&snippet.fragment()[start_from..]);
result result
} }

View File

@@ -68,7 +68,7 @@ fn main() -> tantivy::Result<()> {
title => "The Old Man and the Sea", title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \ body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish." he had gone eighty-four days now without taking a fish."
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Of Mice and Men", title => "Of Mice and Men",
@@ -80,7 +80,7 @@ fn main() -> tantivy::Result<()> {
fresh and green with every spring, carrying in their lower leaf junctures the \ fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \ debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool" limbs and branches that arch over the pool"
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Frankenstein", title => "Frankenstein",
@@ -88,7 +88,7 @@ fn main() -> tantivy::Result<()> {
enterprise which you have regarded with such evil forebodings. I arrived here \ enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \ yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking." increasing confidence in the success of my undertaking."
)); ))?;
index_writer.commit()?; index_writer.commit()?;

223
examples/warmer.rs Normal file
View File

@@ -0,0 +1,223 @@
use std::cmp::Reverse;
use std::collections::{HashMap, HashSet};
use std::sync::{Arc, RwLock, Weak};
use tantivy::collector::TopDocs;
use tantivy::fastfield::FastFieldReader;
use tantivy::query::QueryParser;
use tantivy::schema::{Field, Schema, FAST, TEXT};
use tantivy::{doc, DocAddress, DocId, Index, IndexReader, SegmentReader, TrackedObject};
use tantivy::{Opstamp, Searcher, SearcherGeneration, SegmentId, Warmer};
// This example shows how warmers can be used to
// load a values from an external sources using the Warmer API.
//
// In this example, we assume an e-commerce search engine.
type ProductId = u64;
/// Price
type Price = u32;
pub trait PriceFetcher: Send + Sync + 'static {
fn fetch_prices(&self, product_ids: &[ProductId]) -> Vec<Price>;
}
struct DynamicPriceColumn {
field: Field,
price_cache: RwLock<HashMap<(SegmentId, Option<Opstamp>), Arc<Vec<Price>>>>,
price_fetcher: Box<dyn PriceFetcher>,
}
impl DynamicPriceColumn {
pub fn with_product_id_field<T: PriceFetcher>(field: Field, price_fetcher: T) -> Self {
DynamicPriceColumn {
field,
price_cache: Default::default(),
price_fetcher: Box::new(price_fetcher),
}
}
pub fn price_for_segment(&self, segment_reader: &SegmentReader) -> Option<Arc<Vec<Price>>> {
let segment_key = (segment_reader.segment_id(), segment_reader.delete_opstamp());
self.price_cache.read().unwrap().get(&segment_key).cloned()
}
}
impl Warmer for DynamicPriceColumn {
fn warm(&self, searcher: &Searcher) -> tantivy::Result<()> {
for segment in searcher.segment_readers() {
let key = (segment.segment_id(), segment.delete_opstamp());
let product_id_reader = segment.fast_fields().u64(self.field)?;
let product_ids: Vec<ProductId> = segment
.doc_ids_alive()
.map(|doc| product_id_reader.get(doc))
.collect();
let mut prices_it = self.price_fetcher.fetch_prices(&product_ids).into_iter();
let mut price_vals: Vec<Price> = Vec::new();
for doc in 0..segment.max_doc() {
if segment.is_deleted(doc) {
price_vals.push(0);
} else {
price_vals.push(prices_it.next().unwrap())
}
}
self.price_cache
.write()
.unwrap()
.insert(key, Arc::new(price_vals));
}
Ok(())
}
fn garbage_collect(&self, live_generations: &[TrackedObject<SearcherGeneration>]) {
let live_segment_id_and_delete_ops: HashSet<(SegmentId, Option<Opstamp>)> =
live_generations
.iter()
.flat_map(|gen| gen.segments())
.map(|(&segment_id, &opstamp)| (segment_id, opstamp))
.collect();
let mut price_cache_wrt = self.price_cache.write().unwrap();
// let price_cache = std::mem::take(&mut *price_cache_wrt);
// Drain would be nicer here.
*price_cache_wrt = std::mem::take(&mut *price_cache_wrt)
.into_iter()
.filter(|(seg_id_and_op, _)| !live_segment_id_and_delete_ops.contains(seg_id_and_op))
.collect();
}
}
/// For the sake of this example, the table is just an editable HashMap behind a RwLock.
/// This map represents a map (ProductId -> Price)
///
/// In practise, it could be fetching things from an external service, like a SQL table.
///
#[derive(Default, Clone)]
pub struct ExternalPriceTable {
prices: Arc<RwLock<HashMap<ProductId, Price>>>,
}
impl ExternalPriceTable {
pub fn update_price(&self, product_id: ProductId, price: Price) {
let mut prices_wrt = self.prices.write().unwrap();
prices_wrt.insert(product_id, price);
}
}
impl PriceFetcher for ExternalPriceTable {
fn fetch_prices(&self, product_ids: &[ProductId]) -> Vec<Price> {
let prices_read = self.prices.read().unwrap();
product_ids
.iter()
.map(|product_id| prices_read.get(product_id).cloned().unwrap_or(0))
.collect()
}
}
fn main() -> tantivy::Result<()> {
// Declaring our schema.
let mut schema_builder = Schema::builder();
// The product id is assumed to be a primary id for our external price source.
let product_id = schema_builder.add_u64_field("product_id", FAST);
let text = schema_builder.add_text_field("text", TEXT);
let schema: Schema = schema_builder.build();
let price_table = ExternalPriceTable::default();
let price_dynamic_column = Arc::new(DynamicPriceColumn::with_product_id_field(
product_id,
price_table.clone(),
));
price_table.update_price(OLIVE_OIL, 12);
price_table.update_price(GLOVES, 13);
price_table.update_price(SNEAKERS, 80);
const OLIVE_OIL: ProductId = 323423;
const GLOVES: ProductId = 3966623;
const SNEAKERS: ProductId = 23222;
let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 10_000_000)?;
writer.add_document(doc!(product_id=>OLIVE_OIL, text=>"cooking olive oil from greece"))?;
writer.add_document(doc!(product_id=>GLOVES, text=>"kitchen gloves, perfect for cooking"))?;
writer.add_document(doc!(product_id=>SNEAKERS, text=>"uber sweet sneakers"))?;
writer.commit()?;
let warmers: Vec<Weak<dyn Warmer>> = vec![Arc::downgrade(
&(price_dynamic_column.clone() as Arc<dyn Warmer>),
)];
let reader: IndexReader = index
.reader_builder()
.warmers(warmers)
.num_searchers(1)
.try_into()?;
reader.reload()?;
let query_parser = QueryParser::for_index(&index, vec![text]);
let query = query_parser.parse_query("cooking")?;
let searcher = reader.searcher();
let score_by_price = move |segment_reader: &SegmentReader| {
let price = price_dynamic_column
.price_for_segment(segment_reader)
.unwrap();
move |doc_id: DocId| Reverse(price[doc_id as usize])
};
let most_expensive_first = TopDocs::with_limit(10).custom_score(score_by_price);
let hits = searcher.search(&query, &most_expensive_first)?;
assert_eq!(
&hits,
&[
(
Reverse(12u32),
DocAddress {
segment_ord: 0,
doc_id: 0u32
}
),
(
Reverse(13u32),
DocAddress {
segment_ord: 0,
doc_id: 1u32
}
),
]
);
// Olive oil just got more expensive!
price_table.update_price(OLIVE_OIL, 15);
// The price update are directly reflected on `reload`.
//
// Be careful here though!...
// You may have spotted that we are still using the same `Searcher`.
//
// It is up to the `Warmer` implementer to decide how
// to control this behavior.
reader.reload()?;
let hits_with_new_prices = searcher.search(&query, &most_expensive_first)?;
assert_eq!(
&hits_with_new_prices,
&[
(
Reverse(13u32),
DocAddress {
segment_ord: 0,
doc_id: 1u32
}
),
(
Reverse(15u32),
DocAddress {
segment_ord: 0,
doc_id: 0u32
}
),
]
);
Ok(())
}

View File

@@ -9,8 +9,8 @@ description = "Fast field codecs used by tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies] [dependencies]
common = { path = "../common/" } common = { version = "0.1", path = "../common/", package = "tantivy-common" }
tantivy-bitpacker = { path = "../bitpacker/" } tantivy-bitpacker = { version="0.1.1", path = "../bitpacker/" }
prettytable-rs = {version="0.8.0", optional= true} prettytable-rs = {version="0.8.0", optional= true}
rand = {version="0.8.3", optional= true} rand = {version="0.8.3", optional= true}

View File

@@ -118,7 +118,7 @@ mod tests {
); );
} }
} }
let actual_compression = data.len() as f32 / out.len() as f32; let actual_compression = out.len() as f32 / (data.len() as f32 * 8.0);
(estimation, actual_compression) (estimation, actual_compression)
} }
pub fn get_codec_test_data_sets() -> Vec<(Vec<u64>, &'static str)> { pub fn get_codec_test_data_sets() -> Vec<(Vec<u64>, &'static str)> {

View File

@@ -239,11 +239,21 @@ mod tests {
use super::*; use super::*;
use crate::tests::get_codec_test_data_sets; use crate::tests::get_codec_test_data_sets;
fn create_and_validate(data: &[u64], name: &str) { fn create_and_validate(data: &[u64], name: &str) -> (f32, f32) {
crate::tests::create_and_validate::< crate::tests::create_and_validate::<
LinearInterpolFastFieldSerializer, LinearInterpolFastFieldSerializer,
LinearInterpolFastFieldReader, LinearInterpolFastFieldReader,
>(data, name); >(data, name)
}
#[test]
fn test_compression() {
let data = (10..=6_000_u64).collect::<Vec<_>>();
let (estimate, actual_compression) =
create_and_validate(&data, "simple monotonically large");
assert!(actual_compression < 0.01);
assert!(estimate < 0.01);
} }
#[test] #[test]

View File

@@ -1,3 +1,17 @@
/*!
MultiLinearInterpol compressor uses linear interpolation to guess a values and stores the offset, but in blocks of 512.
With a CHUNK_SIZE of 512 and 29 byte metadata per block, we get a overhead for metadata of 232 / 512 = 0,45 bits per element.
The additional space required per element in a block is the the maximum deviation of the linear interpolation estimation function.
E.g. if the maximum deviation of an element is 12, all elements cost 4bits.
Size per block:
Num Elements * Maximum Deviation from Interpolation + 29 Byte Metadata
*/
use crate::FastFieldCodecReader; use crate::FastFieldCodecReader;
use crate::FastFieldCodecSerializer; use crate::FastFieldCodecSerializer;
use crate::FastFieldDataAccess; use crate::FastFieldDataAccess;
@@ -43,7 +57,7 @@ struct Function {
impl Function { impl Function {
fn calc_slope(&mut self) { fn calc_slope(&mut self) {
let num_vals = self.end_pos - self.start_pos; let num_vals = self.end_pos - self.start_pos;
get_slope(self.value_start_pos, self.value_end_pos, num_vals); self.slope = get_slope(self.value_start_pos, self.value_end_pos, num_vals);
} }
// split the interpolation into two function, change self and return the second split // split the interpolation into two function, change self and return the second split
fn split(&mut self, split_pos: u64, split_pos_value: u64) -> Function { fn split(&mut self, split_pos: u64, split_pos_value: u64) -> Function {
@@ -364,11 +378,22 @@ mod tests {
use super::*; use super::*;
use crate::tests::get_codec_test_data_sets; use crate::tests::get_codec_test_data_sets;
fn create_and_validate(data: &[u64], name: &str) { fn create_and_validate(data: &[u64], name: &str) -> (f32, f32) {
crate::tests::create_and_validate::< crate::tests::create_and_validate::<
MultiLinearInterpolFastFieldSerializer, MultiLinearInterpolFastFieldSerializer,
MultiLinearInterpolFastFieldReader, MultiLinearInterpolFastFieldReader,
>(data, name); >(data, name)
}
#[test]
fn test_compression() {
let data = (10..=6_000_u64).collect::<Vec<_>>();
let (estimate, actual_compression) =
create_and_validate(&data, "simple monotonically large");
assert!(actual_compression < 0.2);
assert!(estimate < 0.20);
assert!(estimate > 0.15);
assert!(actual_compression > 0.01);
} }
#[test] #[test]
@@ -400,9 +425,11 @@ mod tests {
fn rand() { fn rand() {
for _ in 0..10 { for _ in 0..10 {
let mut data = (5_000..20_000) let mut data = (5_000..20_000)
.map(|_| rand::random::<u64>() as u64) .map(|_| rand::random::<u32>() as u64)
.collect::<Vec<_>>(); .collect::<Vec<_>>();
create_and_validate(&data, "random"); let (estimate, actual_compression) = create_and_validate(&data, "random");
dbg!(estimate);
dbg!(actual_compression);
data.reverse(); data.reverse();
create_and_validate(&data, "random"); create_and_validate(&data, "random");

View File

@@ -1,9 +1,10 @@
[package] [package]
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
name = "ownedbytes" name = "ownedbytes"
version = "0.1.0" version = "0.2.0"
edition = "2018" edition = "2018"
description = "Expose data as static slice" description = "Expose data as static slice"
license = "MIT"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies] [dependencies]

View File

@@ -1,3 +1,5 @@
#![allow(clippy::return_self_not_must_use)]
use stable_deref_trait::StableDeref; use stable_deref_trait::StableDeref;
use std::convert::TryInto; use std::convert::TryInto;
use std::mem; use std::mem;
@@ -35,6 +37,8 @@ impl OwnedBytes {
} }
/// creates a fileslice that is just a view over a slice of the data. /// creates a fileslice that is just a view over a slice of the data.
#[must_use]
#[inline]
pub fn slice(&self, range: Range<usize>) -> Self { pub fn slice(&self, range: Range<usize>) -> Self {
OwnedBytes { OwnedBytes {
data: &self.data[range], data: &self.data[range],
@@ -63,6 +67,8 @@ impl OwnedBytes {
/// On the other hand, both `left` and `right` retain a handle over /// On the other hand, both `left` and `right` retain a handle over
/// the entire slice of memory. In other words, the memory will only /// the entire slice of memory. In other words, the memory will only
/// be released when both left and right are dropped. /// be released when both left and right are dropped.
#[inline]
#[must_use]
pub fn split(self, split_len: usize) -> (OwnedBytes, OwnedBytes) { pub fn split(self, split_len: usize) -> (OwnedBytes, OwnedBytes) {
let right_box_stable_deref = self.box_stable_deref.clone(); let right_box_stable_deref = self.box_stable_deref.clone();
let left = OwnedBytes { let left = OwnedBytes {
@@ -76,6 +82,19 @@ impl OwnedBytes {
(left, right) (left, right)
} }
/// Splits the right part of the `OwnedBytes` at the given offset.
///
/// `self` is truncated to `split_len`, left with the remaining bytes.
pub fn split_off(&mut self, split_len: usize) -> OwnedBytes {
let right_box_stable_deref = self.box_stable_deref.clone();
let right_piece = OwnedBytes {
data: &self.data[split_len..],
box_stable_deref: right_box_stable_deref,
};
self.data = &self.data[..split_len];
right_piece
}
/// Returns true iff this `OwnedBytes` is empty. /// Returns true iff this `OwnedBytes` is empty.
#[inline] #[inline]
pub fn is_empty(&self) -> bool { pub fn is_empty(&self) -> bool {
@@ -84,7 +103,6 @@ impl OwnedBytes {
/// Drops the left most `advance_len` bytes. /// Drops the left most `advance_len` bytes.
/// ///
/// See also [.clip(clip_len: usize))](#method.clip).
#[inline] #[inline]
pub fn advance(&mut self, advance_len: usize) { pub fn advance(&mut self, advance_len: usize) {
self.data = &self.data[advance_len..] self.data = &self.data[advance_len..]
@@ -124,6 +142,35 @@ impl fmt::Debug for OwnedBytes {
} }
} }
impl PartialEq for OwnedBytes {
fn eq(&self, other: &OwnedBytes) -> bool {
self.as_slice() == other.as_slice()
}
}
impl Eq for OwnedBytes {}
impl PartialEq<[u8]> for OwnedBytes {
fn eq(&self, other: &[u8]) -> bool {
self.as_slice() == other
}
}
impl PartialEq<str> for OwnedBytes {
fn eq(&self, other: &str) -> bool {
self.as_slice() == other.as_bytes()
}
}
impl<'a, T: ?Sized> PartialEq<&'a T> for OwnedBytes
where
OwnedBytes: PartialEq<T>,
{
fn eq(&self, other: &&'a T) -> bool {
*self == **other
}
}
impl Deref for OwnedBytes { impl Deref for OwnedBytes {
type Target = [u8]; type Target = [u8];
@@ -287,4 +334,14 @@ mod tests {
assert_eq!(right.as_slice(), b""); assert_eq!(right.as_slice(), b"");
} }
} }
#[test]
fn test_split_off() {
let mut data = OwnedBytes::new(b"abcdef".as_ref());
assert_eq!(data, "abcdef");
assert_eq!(data.split_off(2), "cdef");
assert_eq!(data, "ab");
assert_eq!(data.split_off(1), "b");
assert_eq!(data, "a");
}
} }

View File

@@ -5,9 +5,9 @@ authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
description = """Search engine library""" description = """Search engine library"""
documentation = "https://tantivy-search.github.io/tantivy/tantivy/index.html" documentation = "https://quickwit-inc.github.io/tantivy/tantivy/index.html"
homepage = "https://github.com/tantivy-search/tantivy" homepage = "https://github.com/quickwit-inc/tantivy"
repository = "https://github.com/tantivy-search/tantivy" repository = "https://github.com/quickwit-inc/tantivy"
readme = "README.md" readme = "README.md"
keywords = ["search", "information", "retrieval"] keywords = ["search", "information", "retrieval"]
edition = "2018" edition = "2018"

View File

@@ -91,6 +91,7 @@ pub enum UserInputAst {
} }
impl UserInputAst { impl UserInputAst {
#[must_use]
pub fn unary(self, occur: Occur) -> UserInputAst { pub fn unary(self, occur: Occur) -> UserInputAst {
UserInputAst::Clause(vec![(Some(occur), self)]) UserInputAst::Clause(vec![(Some(occur), self)])
} }

View File

@@ -20,10 +20,10 @@ use crate::SegmentReader;
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer(3_000_000).unwrap(); /// let mut index_writer = index.writer(3_000_000).unwrap();
/// index_writer.add_document(doc!(title => "The Name of the Wind")); /// index_writer.add_document(doc!(title => "The Name of the Wind")).unwrap();
/// index_writer.add_document(doc!(title => "The Diary of Muadib")); /// index_writer.add_document(doc!(title => "The Diary of Muadib")).unwrap();
/// index_writer.add_document(doc!(title => "A Dairy Cow")); /// index_writer.add_document(doc!(title => "A Dairy Cow")).unwrap();
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl")); /// index_writer.add_document(doc!(title => "The Diary of a Young Girl")).unwrap();
/// assert!(index_writer.commit().is_ok()); /// assert!(index_writer.commit().is_ok());
/// ///
/// let reader = index.reader().unwrap(); /// let reader = index.reader().unwrap();

View File

@@ -83,7 +83,7 @@ fn facet_depth(facet_bytes: &[u8]) -> usize {
/// ```rust /// ```rust
/// use tantivy::collector::FacetCollector; /// use tantivy::collector::FacetCollector;
/// use tantivy::query::AllQuery; /// use tantivy::query::AllQuery;
/// use tantivy::schema::{Facet, Schema, INDEXED, TEXT}; /// use tantivy::schema::{Facet, Schema, FacetOptions, TEXT};
/// use tantivy::{doc, Index}; /// use tantivy::{doc, Index};
/// ///
/// fn example() -> tantivy::Result<()> { /// fn example() -> tantivy::Result<()> {
@@ -92,7 +92,7 @@ fn facet_depth(facet_bytes: &[u8]) -> usize {
/// // Facet have their own specific type. /// // Facet have their own specific type.
/// // It is not a bad practise to put all of your /// // It is not a bad practise to put all of your
/// // facet information in the same field. /// // facet information in the same field.
/// let facet = schema_builder.add_facet_field("facet", INDEXED); /// let facet = schema_builder.add_facet_field("facet", FacetOptions::default());
/// let title = schema_builder.add_text_field("title", TEXT); /// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
@@ -103,23 +103,23 @@ fn facet_depth(facet_bytes: &[u8]) -> usize {
/// title => "The Name of the Wind", /// title => "The Name of the Wind",
/// facet => Facet::from("/lang/en"), /// facet => Facet::from("/lang/en"),
/// facet => Facet::from("/category/fiction/fantasy") /// facet => Facet::from("/category/fiction/fantasy")
/// )); /// ))?;
/// index_writer.add_document(doc!( /// index_writer.add_document(doc!(
/// title => "Dune", /// title => "Dune",
/// facet => Facet::from("/lang/en"), /// facet => Facet::from("/lang/en"),
/// facet => Facet::from("/category/fiction/sci-fi") /// facet => Facet::from("/category/fiction/sci-fi")
/// )); /// ))?;
/// index_writer.add_document(doc!( /// index_writer.add_document(doc!(
/// title => "La Vénus d'Ille", /// title => "La Vénus d'Ille",
/// facet => Facet::from("/lang/fr"), /// facet => Facet::from("/lang/fr"),
/// facet => Facet::from("/category/fiction/fantasy"), /// facet => Facet::from("/category/fiction/fantasy"),
/// facet => Facet::from("/category/fiction/horror") /// facet => Facet::from("/category/fiction/horror")
/// )); /// ))?;
/// index_writer.add_document(doc!( /// index_writer.add_document(doc!(
/// title => "The Diary of a Young Girl", /// title => "The Diary of a Young Girl",
/// facet => Facet::from("/lang/en"), /// facet => Facet::from("/lang/en"),
/// facet => Facet::from("/category/biography") /// facet => Facet::from("/category/biography")
/// )); /// ))?;
/// index_writer.commit()?; /// index_writer.commit()?;
/// } /// }
/// let reader = index.reader()?; /// let reader = index.reader()?;
@@ -400,7 +400,7 @@ impl<'a> Iterator for FacetChildIterator<'a> {
impl FacetCounts { impl FacetCounts {
/// Returns an iterator over all of the facet count pairs inside this result. /// Returns an iterator over all of the facet count pairs inside this result.
/// See the documentation for `FacetCollector` for a usage example. /// See the documentation for [FacetCollector] for a usage example.
pub fn get<T>(&self, facet_from: T) -> FacetChildIterator<'_> pub fn get<T>(&self, facet_from: T) -> FacetChildIterator<'_>
where where
Facet: From<T>, Facet: From<T>,
@@ -421,7 +421,7 @@ impl FacetCounts {
} }
/// Returns a vector of top `k` facets with their counts, sorted highest-to-lowest by counts. /// Returns a vector of top `k` facets with their counts, sorted highest-to-lowest by counts.
/// See the documentation for `FacetCollector` for a usage example. /// See the documentation for [FacetCollector] for a usage example.
pub fn top_k<T>(&self, facet: T, k: usize) -> Vec<(&Facet, u64)> pub fn top_k<T>(&self, facet: T, k: usize) -> Vec<(&Facet, u64)>
where where
Facet: From<T>, Facet: From<T>,
@@ -462,7 +462,7 @@ mod tests {
use crate::collector::Count; use crate::collector::Count;
use crate::core::Index; use crate::core::Index;
use crate::query::{AllQuery, QueryParser, TermQuery}; use crate::query::{AllQuery, QueryParser, TermQuery};
use crate::schema::{Document, Facet, Field, IndexRecordOption, Schema, INDEXED}; use crate::schema::{Document, Facet, FacetOptions, Field, IndexRecordOption, Schema};
use crate::Term; use crate::Term;
use rand::distributions::Uniform; use rand::distributions::Uniform;
use rand::prelude::SliceRandom; use rand::prelude::SliceRandom;
@@ -470,13 +470,13 @@ mod tests {
use std::iter; use std::iter;
#[test] #[test]
fn test_facet_collector_drilldown() { fn test_facet_collector_drilldown() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
let num_facets: usize = 3 * 4 * 5; let num_facets: usize = 3 * 4 * 5;
let facets: Vec<Facet> = (0..num_facets) let facets: Vec<Facet> = (0..num_facets)
.map(|mut n| { .map(|mut n| {
@@ -491,14 +491,14 @@ mod tests {
for i in 0..num_facets * 10 { for i in 0..num_facets * 10 {
let mut doc = Document::new(); let mut doc = Document::new();
doc.add_facet(facet_field, facets[i % num_facets].clone()); doc.add_facet(facet_field, facets[i % num_facets].clone());
index_writer.add_document(doc); index_writer.add_document(doc)?;
} }
index_writer.commit().unwrap(); index_writer.commit()?;
let reader = index.reader().unwrap(); let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
let mut facet_collector = FacetCollector::for_field(facet_field); let mut facet_collector = FacetCollector::for_field(facet_field);
facet_collector.add_facet(Facet::from("/top1")); facet_collector.add_facet(Facet::from("/top1"));
let counts = searcher.search(&AllQuery, &facet_collector).unwrap(); let counts = searcher.search(&AllQuery, &facet_collector)?;
{ {
let facets: Vec<(String, u64)> = counts let facets: Vec<(String, u64)> = counts
@@ -518,6 +518,7 @@ mod tests {
.collect::<Vec<_>>() .collect::<Vec<_>>()
); );
} }
Ok(())
} }
#[test] #[test]
@@ -530,48 +531,49 @@ mod tests {
} }
#[test] #[test]
fn test_doc_unsorted_multifacet() { fn test_doc_unsorted_multifacet() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facets", INDEXED); let facet_field = schema_builder.add_facet_field("facets", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
facet_field => Facet::from_text(&"/subjects/A/a").unwrap(), facet_field => Facet::from_text(&"/subjects/A/a").unwrap(),
facet_field => Facet::from_text(&"/subjects/B/a").unwrap(), facet_field => Facet::from_text(&"/subjects/B/a").unwrap(),
facet_field => Facet::from_text(&"/subjects/A/b").unwrap(), facet_field => Facet::from_text(&"/subjects/A/b").unwrap(),
facet_field => Facet::from_text(&"/subjects/B/b").unwrap(), facet_field => Facet::from_text(&"/subjects/B/b").unwrap(),
)); ))?;
index_writer.commit().unwrap(); index_writer.commit()?;
let reader = index.reader().unwrap(); let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
assert_eq!(searcher.num_docs(), 1); assert_eq!(searcher.num_docs(), 1);
let mut facet_collector = FacetCollector::for_field(facet_field); let mut facet_collector = FacetCollector::for_field(facet_field);
facet_collector.add_facet("/subjects"); facet_collector.add_facet("/subjects");
let counts = searcher.search(&AllQuery, &facet_collector).unwrap(); let counts = searcher.search(&AllQuery, &facet_collector)?;
let facets: Vec<(&Facet, u64)> = counts.get("/subjects").collect(); let facets: Vec<(&Facet, u64)> = counts.get("/subjects").collect();
assert_eq!(facets[0].1, 1); assert_eq!(facets[0].1, 1);
Ok(())
} }
#[test] #[test]
fn test_doc_search_by_facet() -> crate::Result<()> { fn test_doc_search_by_facet() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
facet_field => Facet::from_text(&"/A/A").unwrap(), facet_field => Facet::from_text(&"/A/A").unwrap(),
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
facet_field => Facet::from_text(&"/A/B").unwrap(), facet_field => Facet::from_text(&"/A/B").unwrap(),
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
facet_field => Facet::from_text(&"/A/C/A").unwrap(), facet_field => Facet::from_text(&"/A/C/A").unwrap(),
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
facet_field => Facet::from_text(&"/D/C/A").unwrap(), facet_field => Facet::from_text(&"/D/C/A").unwrap(),
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -613,7 +615,7 @@ mod tests {
#[test] #[test]
fn test_facet_collector_topk() { fn test_facet_collector_topk() {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
@@ -637,7 +639,7 @@ mod tests {
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests().unwrap();
for doc in docs { for doc in docs {
index_writer.add_document(doc); index_writer.add_document(doc).unwrap();
} }
index_writer.commit().unwrap(); index_writer.commit().unwrap();
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader().unwrap().searcher();
@@ -662,7 +664,7 @@ mod tests {
#[test] #[test]
fn test_facet_collector_topk_tie_break() -> crate::Result<()> { fn test_facet_collector_topk_tie_break() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
@@ -677,7 +679,7 @@ mod tests {
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
for doc in docs { for doc in docs {
index_writer.add_document(doc); index_writer.add_document(doc)?;
} }
index_writer.commit()?; index_writer.commit()?;
@@ -725,7 +727,7 @@ mod bench {
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests().unwrap();
for doc in docs { for doc in docs {
index_writer.add_document(doc); index_writer.add_document(doc).unwrap();
} }
index_writer.commit().unwrap(); index_writer.commit().unwrap();
let reader = index.reader().unwrap(); let reader = index.reader().unwrap();

View File

@@ -16,7 +16,7 @@ use crate::fastfield::{DynamicFastFieldReader, FastFieldReader, FastValue};
use crate::schema::Field; use crate::schema::Field;
use crate::{Score, SegmentReader, TantivyError}; use crate::{Score, SegmentReader, TantivyError};
/// The `FilterCollector` collector filters docs using a fast field value and a predicate. /// The `FilterCollector` filters docs using a fast field value and a predicate.
/// Only the documents for which the predicate returned "true" will be passed on to the next collector. /// Only the documents for which the predicate returned "true" will be passed on to the next collector.
/// ///
/// ```rust /// ```rust
@@ -25,34 +25,37 @@ use crate::{Score, SegmentReader, TantivyError};
/// use tantivy::schema::{Schema, TEXT, INDEXED, FAST}; /// use tantivy::schema::{Schema, TEXT, INDEXED, FAST};
/// use tantivy::{doc, DocAddress, Index}; /// use tantivy::{doc, DocAddress, Index};
/// ///
/// # fn main() -> tantivy::Result<()> {
/// let mut schema_builder = Schema::builder(); /// let mut schema_builder = Schema::builder();
/// let title = schema_builder.add_text_field("title", TEXT); /// let title = schema_builder.add_text_field("title", TEXT);
/// let price = schema_builder.add_u64_field("price", INDEXED | FAST); /// let price = schema_builder.add_u64_field("price", INDEXED | FAST);
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap(); /// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64)); /// index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64)); /// index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64)); /// index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64))?;
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64)); /// index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64))?;
/// assert!(index_writer.commit().is_ok()); /// index_writer.commit()?;
/// ///
/// let reader = index.reader().unwrap(); /// let reader = index.reader()?;
/// let searcher = reader.searcher(); /// let searcher = reader.searcher();
/// ///
/// let query_parser = QueryParser::for_index(&index, vec![title]); /// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary").unwrap(); /// let query = query_parser.parse_query("diary")?;
/// let no_filter_collector = FilterCollector::new(price, &|value: u64| value > 20_120u64, TopDocs::with_limit(2)); /// let no_filter_collector = FilterCollector::new(price, &|value: u64| value > 20_120u64, TopDocs::with_limit(2));
/// let top_docs = searcher.search(&query, &no_filter_collector).unwrap(); /// let top_docs = searcher.search(&query, &no_filter_collector)?;
/// ///
/// assert_eq!(top_docs.len(), 1); /// assert_eq!(top_docs.len(), 1);
/// assert_eq!(top_docs[0].1, DocAddress::new(0, 1)); /// assert_eq!(top_docs[0].1, DocAddress::new(0, 1));
/// ///
/// let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new(price, &|value| value < 5u64, TopDocs::with_limit(2)); /// let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new(price, &|value| value < 5u64, TopDocs::with_limit(2));
/// let filtered_top_docs = searcher.search(&query, &filter_all_collector).unwrap(); /// let filtered_top_docs = searcher.search(&query, &filter_all_collector)?;
/// ///
/// assert_eq!(filtered_top_docs.len(), 0); /// assert_eq!(filtered_top_docs.len(), 0);
/// # Ok(())
/// # }
/// ``` /// ```
pub struct FilterCollector<TCollector, TPredicate, TPredicateValue: FastValue> pub struct FilterCollector<TCollector, TPredicate, TPredicateValue: FastValue>
where where

View File

@@ -226,10 +226,10 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_with_num_threads(1, 4_000_000)?;
writer.add_document(doc!(val_field=>12i64)); writer.add_document(doc!(val_field=>12i64))?;
writer.add_document(doc!(val_field=>-30i64)); writer.add_document(doc!(val_field=>-30i64))?;
writer.add_document(doc!(val_field=>-12i64)); writer.add_document(doc!(val_field=>-12i64))?;
writer.add_document(doc!(val_field=>-10i64)); writer.add_document(doc!(val_field=>-10i64))?;
writer.commit()?; writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -247,13 +247,13 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_with_num_threads(1, 4_000_000)?;
writer.add_document(doc!(val_field=>12i64)); writer.add_document(doc!(val_field=>12i64))?;
writer.commit()?; writer.commit()?;
writer.add_document(doc!(val_field=>-30i64)); writer.add_document(doc!(val_field=>-30i64))?;
writer.commit()?; writer.commit()?;
writer.add_document(doc!(val_field=>-12i64)); writer.add_document(doc!(val_field=>-12i64))?;
writer.commit()?; writer.commit()?;
writer.add_document(doc!(val_field=>-10i64)); writer.add_document(doc!(val_field=>-10i64))?;
writer.commit()?; writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -271,9 +271,9 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_with_num_threads(1, 4_000_000)?;
writer.add_document(doc!(date_field=>Utc.ymd(1982, 9, 17).and_hms(0, 0,0))); writer.add_document(doc!(date_field=>Utc.ymd(1982, 9, 17).and_hms(0, 0,0)))?;
writer.add_document(doc!(date_field=>Utc.ymd(1986, 3, 9).and_hms(0, 0, 0))); writer.add_document(doc!(date_field=>Utc.ymd(1986, 3, 9).and_hms(0, 0, 0)))?;
writer.add_document(doc!(date_field=>Utc.ymd(1983, 9, 27).and_hms(0, 0, 0))); writer.add_document(doc!(date_field=>Utc.ymd(1983, 9, 27).and_hms(0, 0, 0)))?;
writer.commit()?; writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();

View File

@@ -48,10 +48,10 @@ use tantivy::collector::{Count, TopDocs};
# let mut index_writer = index.writer(3_000_000)?; # let mut index_writer = index.writer(3_000_000)?;
# index_writer.add_document(doc!( # index_writer.add_document(doc!(
# title => "The Name of the Wind", # title => "The Name of the Wind",
# )); # ))?;
# index_writer.add_document(doc!( # index_writer.add_document(doc!(
# title => "The Diary of Muadib", # title => "The Diary of Muadib",
# )); # ))?;
# index_writer.commit()?; # index_writer.commit()?;
# let reader = index.reader()?; # let reader = index.reader()?;
# let searcher = reader.searcher(); # let searcher = reader.searcher();
@@ -178,9 +178,9 @@ pub trait Collector: Sync + Send {
) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> { ) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> {
let mut segment_collector = self.for_segment(segment_ord as u32, reader)?; let mut segment_collector = self.for_segment(segment_ord as u32, reader)?;
if let Some(delete_bitset) = reader.delete_bitset() { if let Some(alive_bitset) = reader.alive_bitset() {
weight.for_each(reader, &mut |doc, score| { weight.for_each(reader, &mut |doc, score| {
if delete_bitset.is_alive(doc) { if alive_bitset.is_alive(doc) {
segment_collector.collect(doc, score); segment_collector.collect(doc, score);
} }
})?; })?;

View File

@@ -112,19 +112,19 @@ impl<TFruit: Fruit> FruitHandle<TFruit> {
/// use tantivy::schema::{Schema, TEXT}; /// use tantivy::schema::{Schema, TEXT};
/// use tantivy::{doc, Index}; /// use tantivy::{doc, Index};
/// ///
/// # fn main() -> tantivy::Result<()> {
/// let mut schema_builder = Schema::builder(); /// let mut schema_builder = Schema::builder();
/// let title = schema_builder.add_text_field("title", TEXT); /// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// let mut index_writer = index.writer(3_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind"))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib"))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow"))?;
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl"))?;
/// index_writer.commit()?;
/// ///
/// let mut index_writer = index.writer(3_000_000).unwrap(); /// let reader = index.reader()?;
/// index_writer.add_document(doc!(title => "The Name of the Wind"));
/// index_writer.add_document(doc!(title => "The Diary of Muadib"));
/// index_writer.add_document(doc!(title => "A Dairy Cow"));
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl"));
/// assert!(index_writer.commit().is_ok());
///
/// let reader = index.reader().unwrap();
/// let searcher = reader.searcher(); /// let searcher = reader.searcher();
/// ///
/// let mut collectors = MultiCollector::new(); /// let mut collectors = MultiCollector::new();
@@ -139,6 +139,8 @@ impl<TFruit: Fruit> FruitHandle<TFruit> {
/// ///
/// assert_eq!(count, 2); /// assert_eq!(count, 2);
/// assert_eq!(top_docs.len(), 2); /// assert_eq!(top_docs.len(), 2);
/// # Ok(())
/// # }
/// ``` /// ```
#[allow(clippy::type_complexity)] #[allow(clippy::type_complexity)]
#[derive(Default)] #[derive(Default)]
@@ -252,24 +254,24 @@ mod tests {
use crate::Term; use crate::Term;
#[test] #[test]
fn test_multi_collector() { fn test_multi_collector() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let text = schema_builder.add_text_field("text", TEXT); let text = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
{ {
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text=>"abc")); index_writer.add_document(doc!(text=>"abc"))?;
index_writer.add_document(doc!(text=>"abc abc abc")); index_writer.add_document(doc!(text=>"abc abc abc"))?;
index_writer.add_document(doc!(text=>"abc abc")); index_writer.add_document(doc!(text=>"abc abc"))?;
index_writer.commit().unwrap(); index_writer.commit()?;
index_writer.add_document(doc!(text=>"")); index_writer.add_document(doc!(text=>""))?;
index_writer.add_document(doc!(text=>"abc abc abc abc")); index_writer.add_document(doc!(text=>"abc abc abc abc"))?;
index_writer.add_document(doc!(text=>"abc")); index_writer.add_document(doc!(text=>"abc"))?;
index_writer.commit().unwrap(); index_writer.commit()?;
} }
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader()?.searcher();
let term = Term::from_field_text(text, "abc"); let term = Term::from_field_text(text, "abc");
let query = TermQuery::new(term, IndexRecordOption::Basic); let query = TermQuery::new(term, IndexRecordOption::Basic);
@@ -280,5 +282,6 @@ mod tests {
assert_eq!(count_handler.extract(&mut multifruits), 5); assert_eq!(count_handler.extract(&mut multifruits), 5);
assert_eq!(topdocs_handler.extract(&mut multifruits).len(), 2); assert_eq!(topdocs_handler.extract(&mut multifruits).len(), 2);
Ok(())
} }
} }

View File

@@ -25,7 +25,7 @@ pub const TEST_COLLECTOR_WITHOUT_SCORE: TestCollector = TestCollector {
}; };
#[test] #[test]
pub fn test_filter_collector() { pub fn test_filter_collector() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT); let title = schema_builder.add_text_field("title", TEXT);
let price = schema_builder.add_u64_field("price", FAST); let price = schema_builder.add_u64_field("price", FAST);
@@ -33,25 +33,25 @@ pub fn test_filter_collector() {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap(); let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::from_str("1898-04-09T00:00:00+00:00").unwrap())); index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::from_str("1898-04-09T00:00:00+00:00").unwrap()))?;
index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::from_str("2020-04-09T00:00:00+00:00").unwrap())); index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::from_str("2020-04-09T00:00:00+00:00").unwrap()))?;
index_writer.add_document(doc!(title => "The Diary of Anne Frank", price => 18_240u64, date => DateTime::from_str("2019-04-20T00:00:00+00:00").unwrap())); index_writer.add_document(doc!(title => "The Diary of Anne Frank", price => 18_240u64, date => DateTime::from_str("2019-04-20T00:00:00+00:00").unwrap()))?;
index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64, date => DateTime::from_str("2019-04-09T00:00:00+00:00").unwrap())); index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64, date => DateTime::from_str("2019-04-09T00:00:00+00:00").unwrap()))?;
index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64, date => DateTime::from_str("2018-04-09T00:00:00+00:00").unwrap())); index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64, date => DateTime::from_str("2018-04-09T00:00:00+00:00").unwrap()))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
let reader = index.reader().unwrap(); let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title]); let query_parser = QueryParser::for_index(&index, vec![title]);
let query = query_parser.parse_query("diary").unwrap(); let query = query_parser.parse_query("diary")?;
let filter_some_collector = FilterCollector::new( let filter_some_collector = FilterCollector::new(
price, price,
&|value: u64| value > 20_120u64, &|value: u64| value > 20_120u64,
TopDocs::with_limit(2), TopDocs::with_limit(2),
); );
let top_docs = searcher.search(&query, &filter_some_collector).unwrap(); let top_docs = searcher.search(&query, &filter_some_collector)?;
assert_eq!(top_docs.len(), 1); assert_eq!(top_docs.len(), 1);
assert_eq!(top_docs[0].1, DocAddress::new(0, 1)); assert_eq!(top_docs[0].1, DocAddress::new(0, 1));
@@ -67,9 +67,10 @@ pub fn test_filter_collector() {
} }
let filter_dates_collector = FilterCollector::new(date, &date_filter, TopDocs::with_limit(5)); let filter_dates_collector = FilterCollector::new(date, &date_filter, TopDocs::with_limit(5));
let filtered_date_docs = searcher.search(&query, &filter_dates_collector).unwrap(); let filtered_date_docs = searcher.search(&query, &filter_dates_collector)?;
assert_eq!(filtered_date_docs.len(), 2); assert_eq!(filtered_date_docs.len(), 2);
Ok(())
} }
/// Stores all of the doc ids. /// Stores all of the doc ids.
@@ -274,8 +275,8 @@ fn make_test_searcher() -> crate::Result<crate::LeasedItem<Searcher>> {
let schema = Schema::builder().build(); let schema = Schema::builder().build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(Document::default()); index_writer.add_document(Document::default())?;
index_writer.add_document(Document::default()); index_writer.add_document(Document::default())?;
index_writer.commit()?; index_writer.commit()?;
Ok(index.reader()?.searcher()) Ok(index.reader()?.searcher())
} }

View File

@@ -70,9 +70,7 @@ where
/// # Panics /// # Panics
/// The method panics if limit is 0 /// The method panics if limit is 0
pub fn with_limit(limit: usize) -> TopCollector<T> { pub fn with_limit(limit: usize) -> TopCollector<T> {
if limit < 1 { assert!(limit >= 1, "Limit must be strictly greater than 0.");
panic!("Limit must be strictly greater than 0.");
}
Self { Self {
limit, limit,
offset: 0, offset: 0,

View File

@@ -94,27 +94,30 @@ where
/// use tantivy::schema::{Schema, TEXT}; /// use tantivy::schema::{Schema, TEXT};
/// use tantivy::{doc, DocAddress, Index}; /// use tantivy::{doc, DocAddress, Index};
/// ///
/// # fn main() -> tantivy::Result<()> {
/// let mut schema_builder = Schema::builder(); /// let mut schema_builder = Schema::builder();
/// let title = schema_builder.add_text_field("title", TEXT); /// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap(); /// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind")); /// index_writer.add_document(doc!(title => "The Name of the Wind"))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib")); /// index_writer.add_document(doc!(title => "The Diary of Muadib"))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow")); /// index_writer.add_document(doc!(title => "A Dairy Cow"))?;
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl")); /// index_writer.add_document(doc!(title => "The Diary of a Young Girl"))?;
/// assert!(index_writer.commit().is_ok()); /// index_writer.commit()?;
/// ///
/// let reader = index.reader().unwrap(); /// let reader = index.reader()?;
/// let searcher = reader.searcher(); /// let searcher = reader.searcher();
/// ///
/// let query_parser = QueryParser::for_index(&index, vec![title]); /// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary").unwrap(); /// let query = query_parser.parse_query("diary")?;
/// let top_docs = searcher.search(&query, &TopDocs::with_limit(2)).unwrap(); /// let top_docs = searcher.search(&query, &TopDocs::with_limit(2))?;
/// ///
/// assert_eq!(top_docs[0].1, DocAddress::new(0, 1)); /// assert_eq!(top_docs[0].1, DocAddress::new(0, 1));
/// assert_eq!(top_docs[1].1, DocAddress::new(0, 3)); /// assert_eq!(top_docs[1].1, DocAddress::new(0, 3));
/// # Ok(())
/// # }
/// ``` /// ```
pub struct TopDocs(TopCollector<Score>); pub struct TopDocs(TopCollector<Score>);
@@ -180,30 +183,34 @@ impl TopDocs {
/// use tantivy::schema::{Schema, TEXT}; /// use tantivy::schema::{Schema, TEXT};
/// use tantivy::{doc, DocAddress, Index}; /// use tantivy::{doc, DocAddress, Index};
/// ///
/// # fn main() -> tantivy::Result<()> {
/// let mut schema_builder = Schema::builder(); /// let mut schema_builder = Schema::builder();
/// let title = schema_builder.add_text_field("title", TEXT); /// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap(); /// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind")); /// index_writer.add_document(doc!(title => "The Name of the Wind"))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib")); /// index_writer.add_document(doc!(title => "The Diary of Muadib"))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow")); /// index_writer.add_document(doc!(title => "A Dairy Cow"))?;
/// index_writer.add_document(doc!(title => "The Diary of a Young Girl")); /// index_writer.add_document(doc!(title => "The Diary of a Young Girl"))?;
/// index_writer.add_document(doc!(title => "The Diary of Lena Mukhina")); /// index_writer.add_document(doc!(title => "The Diary of Lena Mukhina"))?;
/// assert!(index_writer.commit().is_ok()); /// index_writer.commit()?;
/// ///
/// let reader = index.reader().unwrap(); /// let reader = index.reader()?;
/// let searcher = reader.searcher(); /// let searcher = reader.searcher();
/// ///
/// let query_parser = QueryParser::for_index(&index, vec![title]); /// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary").unwrap(); /// let query = query_parser.parse_query("diary")?;
/// let top_docs = searcher.search(&query, &TopDocs::with_limit(2).and_offset(1)).unwrap(); /// let top_docs = searcher.search(&query, &TopDocs::with_limit(2).and_offset(1))?;
/// ///
/// assert_eq!(top_docs.len(), 2); /// assert_eq!(top_docs.len(), 2);
/// assert_eq!(top_docs[0].1, DocAddress::new(0, 4)); /// assert_eq!(top_docs[0].1, DocAddress::new(0, 4));
/// assert_eq!(top_docs[1].1, DocAddress::new(0, 3)); /// assert_eq!(top_docs[1].1, DocAddress::new(0, 3));
/// Ok(())
/// # }
/// ``` /// ```
#[must_use]
pub fn and_offset(self, offset: usize) -> TopDocs { pub fn and_offset(self, offset: usize) -> TopDocs {
TopDocs(self.0.and_offset(offset)) TopDocs(self.0.and_offset(offset))
} }
@@ -234,11 +241,11 @@ impl TopDocs {
/// # /// #
/// # let index = Index::create_in_ram(schema); /// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// # index_writer.add_document(doc!(title => "The Name of the Wind", rating => 92u64)); /// # index_writer.add_document(doc!(title => "The Name of the Wind", rating => 92u64))?;
/// # index_writer.add_document(doc!(title => "The Diary of Muadib", rating => 97u64)); /// # index_writer.add_document(doc!(title => "The Diary of Muadib", rating => 97u64))?;
/// # index_writer.add_document(doc!(title => "A Dairy Cow", rating => 63u64)); /// # index_writer.add_document(doc!(title => "A Dairy Cow", rating => 63u64))?;
/// # index_writer.add_document(doc!(title => "The Diary of a Young Girl", rating => 80u64)); /// # index_writer.add_document(doc!(title => "The Diary of a Young Girl", rating => 80u64))?;
/// # assert!(index_writer.commit().is_ok()); /// # index_writer.commit()?;
/// # let reader = index.reader()?; /// # let reader = index.reader()?;
/// # let query = QueryParser::for_index(&index, vec![title]).parse_query("diary")?; /// # let query = QueryParser::for_index(&index, vec![title]).parse_query("diary")?;
/// # let top_docs = docs_sorted_by_rating(&reader.searcher(), &query, rating)?; /// # let top_docs = docs_sorted_by_rating(&reader.searcher(), &query, rating)?;
@@ -316,9 +323,9 @@ impl TopDocs {
/// # /// #
/// # let index = Index::create_in_ram(schema); /// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// # index_writer.add_document(doc!(title => "MadCow Inc.", rating => 92_000_000i64)); /// # index_writer.add_document(doc!(title => "MadCow Inc.", rating => 92_000_000i64))?;
/// # index_writer.add_document(doc!(title => "Zozo Cow KKK", rating => 119_000_000i64)); /// # index_writer.add_document(doc!(title => "Zozo Cow KKK", rating => 119_000_000i64))?;
/// # index_writer.add_document(doc!(title => "Declining Cow", rating => -63_000_000i64)); /// # index_writer.add_document(doc!(title => "Declining Cow", rating => -63_000_000i64))?;
/// # assert!(index_writer.commit().is_ok()); /// # assert!(index_writer.commit().is_ok());
/// # let reader = index.reader()?; /// # let reader = index.reader()?;
/// # let top_docs = docs_sorted_by_revenue(&reader.searcher(), &AllQuery, rating)?; /// # let top_docs = docs_sorted_by_revenue(&reader.searcher(), &AllQuery, rating)?;
@@ -417,9 +424,9 @@ impl TopDocs {
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
/// let product_name = index.schema().get_field("product_name").unwrap(); /// let product_name = index.schema().get_field("product_name").unwrap();
/// let popularity: Field = index.schema().get_field("popularity").unwrap(); /// let popularity: Field = index.schema().get_field("popularity").unwrap();
/// index_writer.add_document(doc!(product_name => "The Diary of Muadib", popularity => 1u64)); /// index_writer.add_document(doc!(product_name => "The Diary of Muadib", popularity => 1u64))?;
/// index_writer.add_document(doc!(product_name => "A Dairy Cow", popularity => 10u64)); /// index_writer.add_document(doc!(product_name => "A Dairy Cow", popularity => 10u64))?;
/// index_writer.add_document(doc!(product_name => "The Diary of a Young Girl", popularity => 15u64)); /// index_writer.add_document(doc!(product_name => "The Diary of a Young Girl", popularity => 15u64))?;
/// index_writer.commit()?; /// index_writer.commit()?;
/// Ok(index) /// Ok(index)
/// } /// }
@@ -527,9 +534,9 @@ impl TopDocs {
/// # /// #
/// let popularity: Field = index.schema().get_field("popularity").unwrap(); /// let popularity: Field = index.schema().get_field("popularity").unwrap();
/// let boosted: Field = index.schema().get_field("boosted").unwrap(); /// let boosted: Field = index.schema().get_field("boosted").unwrap();
/// # index_writer.add_document(doc!(boosted=>1u64, product_name => "The Diary of Muadib", popularity => 1u64)); /// # index_writer.add_document(doc!(boosted=>1u64, product_name => "The Diary of Muadib", popularity => 1u64))?;
/// # index_writer.add_document(doc!(boosted=>0u64, product_name => "A Dairy Cow", popularity => 10u64)); /// # index_writer.add_document(doc!(boosted=>0u64, product_name => "A Dairy Cow", popularity => 10u64))?;
/// # index_writer.add_document(doc!(boosted=>0u64, product_name => "The Diary of a Young Girl", popularity => 15u64)); /// # index_writer.add_document(doc!(boosted=>0u64, product_name => "The Diary of a Young Girl", popularity => 15u64))?;
/// # index_writer.commit()?; /// # index_writer.commit()?;
/// // ... /// // ...
/// # let user_query = "diary"; /// # let user_query = "diary";
@@ -629,10 +636,10 @@ impl Collector for TopDocs {
let heap_len = self.0.limit + self.0.offset; let heap_len = self.0.limit + self.0.offset;
let mut heap: BinaryHeap<ComparableDoc<Score, DocId>> = BinaryHeap::with_capacity(heap_len); let mut heap: BinaryHeap<ComparableDoc<Score, DocId>> = BinaryHeap::with_capacity(heap_len);
if let Some(delete_bitset) = reader.delete_bitset() { if let Some(alive_bitset) = reader.alive_bitset() {
let mut threshold = Score::MIN; let mut threshold = Score::MIN;
weight.for_each_pruning(threshold, reader, &mut |doc, score| { weight.for_each_pruning(threshold, reader, &mut |doc, score| {
if delete_bitset.is_deleted(doc) { if alive_bitset.is_deleted(doc) {
return threshold; return threshold;
} }
let heap_item = ComparableDoc { let heap_item = ComparableDoc {
@@ -713,20 +720,18 @@ mod tests {
use crate::Score; use crate::Score;
use crate::{DocAddress, DocId, SegmentReader}; use crate::{DocAddress, DocId, SegmentReader};
fn make_index() -> Index { fn make_index() -> crate::Result<Index> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT); let text_field = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
{ // writing the segment
// writing the segment let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap(); index_writer.add_document(doc!(text_field=>"Hello happy tax payer."))?;
index_writer.add_document(doc!(text_field=>"Hello happy tax payer.")); index_writer.add_document(doc!(text_field=>"Droopy says hello happy tax payer"))?;
index_writer.add_document(doc!(text_field=>"Droopy says hello happy tax payer")); index_writer.add_document(doc!(text_field=>"I like Droopy"))?;
index_writer.add_document(doc!(text_field=>"I like Droopy")); index_writer.commit()?;
assert!(index_writer.commit().is_ok()); Ok(index)
}
index
} }
fn assert_results_equals(results: &[(Score, DocAddress)], expected: &[(Score, DocAddress)]) { fn assert_results_equals(results: &[(Score, DocAddress)], expected: &[(Score, DocAddress)]) {
@@ -737,17 +742,15 @@ mod tests {
} }
#[test] #[test]
fn test_top_collector_not_at_capacity_without_offset() { fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> {
let index = make_index(); let index = make_index()?;
let field = index.schema().get_field("text").unwrap(); let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]); let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap(); let text_query = query_parser.parse_query("droopy tax")?;
let score_docs: Vec<(Score, DocAddress)> = index let score_docs: Vec<(Score, DocAddress)> = index
.reader() .reader()?
.unwrap()
.searcher() .searcher()
.search(&text_query, &TopDocs::with_limit(4)) .search(&text_query, &TopDocs::with_limit(4))?;
.unwrap();
assert_results_equals( assert_results_equals(
&score_docs, &score_docs,
&[ &[
@@ -756,11 +759,12 @@ mod tests {
(0.48527452, DocAddress::new(0, 0)), (0.48527452, DocAddress::new(0, 0)),
], ],
); );
Ok(())
} }
#[test] #[test]
fn test_top_collector_not_at_capacity_with_offset() { fn test_top_collector_not_at_capacity_with_offset() {
let index = make_index(); let index = make_index().unwrap();
let field = index.schema().get_field("text").unwrap(); let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]); let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap(); let text_query = query_parser.parse_query("droopy tax").unwrap();
@@ -775,7 +779,7 @@ mod tests {
#[test] #[test]
fn test_top_collector_at_capacity() { fn test_top_collector_at_capacity() {
let index = make_index(); let index = make_index().unwrap();
let field = index.schema().get_field("text").unwrap(); let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]); let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap(); let text_query = query_parser.parse_query("droopy tax").unwrap();
@@ -796,7 +800,7 @@ mod tests {
#[test] #[test]
fn test_top_collector_at_capacity_with_offset() { fn test_top_collector_at_capacity_with_offset() {
let index = make_index(); let index = make_index().unwrap();
let field = index.schema().get_field("text").unwrap(); let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]); let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap(); let text_query = query_parser.parse_query("droopy tax").unwrap();
@@ -817,7 +821,7 @@ mod tests {
#[test] #[test]
fn test_top_collector_stable_sorting() { fn test_top_collector_stable_sorting() {
let index = make_index(); let index = make_index().unwrap();
// using AllQuery to get a constant score // using AllQuery to get a constant score
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader().unwrap().searcher();
@@ -848,29 +852,35 @@ mod tests {
const SIZE: &str = "size"; const SIZE: &str = "size";
#[test] #[test]
fn test_top_field_collector_not_at_capacity() { fn test_top_field_collector_not_at_capacity() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field(TITLE, TEXT); let title = schema_builder.add_text_field(TITLE, TEXT);
let size = schema_builder.add_u64_field(SIZE, FAST); let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let (index, query) = index("beer", title, schema, |index_writer| { let (index, query) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!( index_writer
title => "bottle of beer", .add_document(doc!(
size => 12u64, title => "bottle of beer",
)); size => 12u64,
index_writer.add_document(doc!( ))
title => "growler of beer", .unwrap();
size => 64u64, index_writer
)); .add_document(doc!(
index_writer.add_document(doc!( title => "growler of beer",
title => "pint of beer", size => 64u64,
size => 16u64, ))
)); .unwrap();
index_writer
.add_document(doc!(
title => "pint of beer",
size => 16u64,
))
.unwrap();
}); });
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader()?.searcher();
let top_collector = TopDocs::with_limit(4).order_by_u64_field(size); let top_collector = TopDocs::with_limit(4).order_by_u64_field(size);
let top_docs: Vec<(u64, DocAddress)> = searcher.search(&query, &top_collector).unwrap(); let top_docs: Vec<(u64, DocAddress)> = searcher.search(&query, &top_collector)?;
assert_eq!( assert_eq!(
&top_docs[..], &top_docs[..],
&[ &[
@@ -879,6 +889,7 @@ mod tests {
(12, DocAddress::new(0, 0)) (12, DocAddress::new(0, 0))
] ]
); );
Ok(())
} }
#[test] #[test]
@@ -894,12 +905,12 @@ mod tests {
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Paul Robeson", name => "Paul Robeson",
birthday => pr_birthday birthday => pr_birthday
)); ))?;
let mr_birthday = crate::DateTime::from_str("1947-11-08T00:00:00+00:00")?; let mr_birthday = crate::DateTime::from_str("1947-11-08T00:00:00+00:00")?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
name => "Minnie Riperton", name => "Minnie Riperton",
birthday => mr_birthday birthday => mr_birthday
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let top_collector = TopDocs::with_limit(3).order_by_fast_field(birthday); let top_collector = TopDocs::with_limit(3).order_by_fast_field(birthday);
@@ -926,11 +937,11 @@ mod tests {
index_writer.add_document(doc!( index_writer.add_document(doc!(
city => "georgetown", city => "georgetown",
altitude => -1i64, altitude => -1i64,
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
city => "tokyo", city => "tokyo",
altitude => 40i64, altitude => 40i64,
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let top_collector = TopDocs::with_limit(3).order_by_fast_field(altitude); let top_collector = TopDocs::with_limit(3).order_by_fast_field(altitude);
@@ -956,11 +967,11 @@ mod tests {
index_writer.add_document(doc!( index_writer.add_document(doc!(
city => "georgetown", city => "georgetown",
altitude => -1.0f64, altitude => -1.0f64,
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
city => "tokyo", city => "tokyo",
altitude => 40f64, altitude => 40f64,
)); ))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let top_collector = TopDocs::with_limit(3).order_by_fast_field(altitude); let top_collector = TopDocs::with_limit(3).order_by_fast_field(altitude);
@@ -983,10 +994,12 @@ mod tests {
let size = schema_builder.add_u64_field(SIZE, FAST); let size = schema_builder.add_u64_field(SIZE, FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let (index, _) = index("beer", title, schema, |index_writer| { let (index, _) = index("beer", title, schema, |index_writer| {
index_writer.add_document(doc!( index_writer
title => "bottle of beer", .add_document(doc!(
size => 12u64, title => "bottle of beer",
)); size => 12u64,
))
.unwrap();
}); });
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader().unwrap().searcher();
let top_collector = TopDocs::with_limit(4).order_by_u64_field(Field::from_field_id(2)); let top_collector = TopDocs::with_limit(4).order_by_u64_field(Field::from_field_id(2));
@@ -1003,7 +1016,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(size=>1u64)); index_writer.add_document(doc!(size=>1u64))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let segment = searcher.segment_reader(0); let segment = searcher.segment_reader(0);
@@ -1020,7 +1033,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(size=>1u64)); index_writer.add_document(doc!(size=>1u64))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let segment = searcher.segment_reader(0); let segment = searcher.segment_reader(0);
@@ -1033,30 +1046,26 @@ mod tests {
} }
#[test] #[test]
fn test_tweak_score_top_collector_with_offset() { fn test_tweak_score_top_collector_with_offset() -> crate::Result<()> {
let index = make_index(); let index = make_index()?;
let field = index.schema().get_field("text").unwrap(); let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]); let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap(); let text_query = query_parser.parse_query("droopy tax")?;
let collector = TopDocs::with_limit(2).and_offset(1).tweak_score( let collector = TopDocs::with_limit(2).and_offset(1).tweak_score(
move |_segment_reader: &SegmentReader| move |doc: DocId, _original_score: Score| doc, move |_segment_reader: &SegmentReader| move |doc: DocId, _original_score: Score| doc,
); );
let score_docs: Vec<(u32, DocAddress)> = index let score_docs: Vec<(u32, DocAddress)> =
.reader() index.reader()?.searcher().search(&text_query, &collector)?;
.unwrap()
.searcher()
.search(&text_query, &collector)
.unwrap();
assert_eq!( assert_eq!(
score_docs, score_docs,
vec![(1, DocAddress::new(0, 1)), (0, DocAddress::new(0, 0)),] vec![(1, DocAddress::new(0, 1)), (0, DocAddress::new(0, 0)),]
); );
Ok(())
} }
#[test] #[test]
fn test_custom_score_top_collector_with_offset() { fn test_custom_score_top_collector_with_offset() {
let index = make_index(); let index = make_index().unwrap();
let field = index.schema().get_field("text").unwrap(); let field = index.schema().get_field("text").unwrap();
let query_parser = QueryParser::for_index(&index, vec![field]); let query_parser = QueryParser::for_index(&index, vec![field]);
let text_query = query_parser.parse_query("droopy tax").unwrap(); let text_query = query_parser.parse_query("droopy tax").unwrap();

View File

@@ -1,396 +0,0 @@
use std::fmt;
use std::u64;
#[derive(Clone, Copy, Eq, PartialEq)]
pub(crate) struct TinySet(u64);
impl fmt::Debug for TinySet {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
self.into_iter().collect::<Vec<u32>>().fmt(f)
}
}
pub struct TinySetIterator(TinySet);
impl Iterator for TinySetIterator {
type Item = u32;
fn next(&mut self) -> Option<Self::Item> {
self.0.pop_lowest()
}
}
impl IntoIterator for TinySet {
type Item = u32;
type IntoIter = TinySetIterator;
fn into_iter(self) -> Self::IntoIter {
TinySetIterator(self)
}
}
impl TinySet {
/// Returns an empty `TinySet`.
pub fn empty() -> TinySet {
TinySet(0u64)
}
pub fn clear(&mut self) {
self.0 = 0u64;
}
/// Returns the complement of the set in `[0, 64[`.
fn complement(self) -> TinySet {
TinySet(!self.0)
}
/// Returns true iff the `TinySet` contains the element `el`.
pub fn contains(self, el: u32) -> bool {
!self.intersect(TinySet::singleton(el)).is_empty()
}
/// Returns the number of elements in the TinySet.
pub fn len(self) -> u32 {
self.0.count_ones()
}
/// Returns the intersection of `self` and `other`
pub fn intersect(self, other: TinySet) -> TinySet {
TinySet(self.0 & other.0)
}
/// Creates a new `TinySet` containing only one element
/// within `[0; 64[`
#[inline]
pub fn singleton(el: u32) -> TinySet {
TinySet(1u64 << u64::from(el))
}
/// Insert a new element within [0..64[
#[inline]
pub fn insert(self, el: u32) -> TinySet {
self.union(TinySet::singleton(el))
}
/// Insert a new element within [0..64[
#[inline]
pub fn insert_mut(&mut self, el: u32) -> bool {
let old = *self;
*self = old.insert(el);
old != *self
}
/// Returns the union of two tinysets
#[inline]
pub fn union(self, other: TinySet) -> TinySet {
TinySet(self.0 | other.0)
}
/// Returns true iff the `TinySet` is empty.
#[inline]
pub fn is_empty(self) -> bool {
self.0 == 0u64
}
/// Returns the lowest element in the `TinySet`
/// and removes it.
#[inline]
pub fn pop_lowest(&mut self) -> Option<u32> {
if self.is_empty() {
None
} else {
let lowest = self.0.trailing_zeros() as u32;
self.0 ^= TinySet::singleton(lowest).0;
Some(lowest)
}
}
/// Returns a `TinySet` than contains all values up
/// to limit excluded.
///
/// The limit is assumed to be strictly lower than 64.
pub fn range_lower(upper_bound: u32) -> TinySet {
TinySet((1u64 << u64::from(upper_bound % 64u32)) - 1u64)
}
/// Returns a `TinySet` that contains all values greater
/// or equal to the given limit, included. (and up to 63)
///
/// The limit is assumed to be strictly lower than 64.
pub fn range_greater_or_equal(from_included: u32) -> TinySet {
TinySet::range_lower(from_included).complement()
}
}
#[derive(Clone)]
pub struct BitSet {
tinysets: Box<[TinySet]>,
len: usize,
max_value: u32,
}
fn num_buckets(max_val: u32) -> u32 {
(max_val + 63u32) / 64u32
}
impl BitSet {
/// Create a new `BitSet` that may contain elements
/// within `[0, max_val[`.
pub fn with_max_value(max_value: u32) -> BitSet {
let num_buckets = num_buckets(max_value);
let tinybisets = vec![TinySet::empty(); num_buckets as usize].into_boxed_slice();
BitSet {
tinysets: tinybisets,
len: 0,
max_value,
}
}
/// Removes all elements from the `BitSet`.
pub fn clear(&mut self) {
for tinyset in self.tinysets.iter_mut() {
*tinyset = TinySet::empty();
}
}
/// Returns the number of elements in the `BitSet`.
pub fn len(&self) -> usize {
self.len
}
/// Inserts an element in the `BitSet`
pub fn insert(&mut self, el: u32) {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len += if self.tinysets[higher as usize].insert_mut(lower) {
1
} else {
0
};
}
/// Returns true iff the elements is in the `BitSet`.
pub fn contains(&self, el: u32) -> bool {
self.tinyset(el / 64u32).contains(el % 64)
}
/// Returns the first non-empty `TinySet` associated to a bucket lower
/// or greater than bucket.
///
/// Reminder: the tiny set with the bucket `bucket`, represents the
/// elements from `bucket * 64` to `(bucket+1) * 64`.
pub(crate) fn first_non_empty_bucket(&self, bucket: u32) -> Option<u32> {
self.tinysets[bucket as usize..]
.iter()
.cloned()
.position(|tinyset| !tinyset.is_empty())
.map(|delta_bucket| bucket + delta_bucket as u32)
}
pub fn max_value(&self) -> u32 {
self.max_value
}
/// Returns the tiny bitset representing the
/// the set restricted to the number range from
/// `bucket * 64` to `(bucket + 1) * 64`.
pub(crate) fn tinyset(&self, bucket: u32) -> TinySet {
self.tinysets[bucket as usize]
}
}
#[cfg(test)]
mod tests {
use super::BitSet;
use super::TinySet;
use crate::docset::{DocSet, TERMINATED};
use crate::query::BitSetDocSet;
use crate::tests;
use crate::tests::generate_nonunique_unsorted;
use std::collections::BTreeSet;
use std::collections::HashSet;
#[test]
fn test_tiny_set() {
assert!(TinySet::empty().is_empty());
{
let mut u = TinySet::empty().insert(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none())
}
{
let mut u = TinySet::empty().insert(1u32).insert(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none())
}
{
let mut u = TinySet::empty().insert(2u32);
assert_eq!(u.pop_lowest(), Some(2u32));
u.insert_mut(1u32);
assert_eq!(u.pop_lowest(), Some(1u32));
assert!(u.pop_lowest().is_none());
}
{
let mut u = TinySet::empty().insert(63u32);
assert_eq!(u.pop_lowest(), Some(63u32));
assert!(u.pop_lowest().is_none());
}
}
#[test]
fn test_bitset() {
let test_against_hashset = |els: &[u32], max_value: u32| {
let mut hashset: HashSet<u32> = HashSet::new();
let mut bitset = BitSet::with_max_value(max_value);
for &el in els {
assert!(el < max_value);
hashset.insert(el);
bitset.insert(el);
}
for el in 0..max_value {
assert_eq!(hashset.contains(&el), bitset.contains(el));
}
assert_eq!(bitset.max_value(), max_value);
};
test_against_hashset(&[], 0);
test_against_hashset(&[], 1);
test_against_hashset(&[0u32], 1);
test_against_hashset(&[0u32], 100);
test_against_hashset(&[1u32, 2u32], 4);
test_against_hashset(&[99u32], 100);
test_against_hashset(&[63u32], 64);
test_against_hashset(&[62u32, 63u32], 64);
}
#[test]
fn test_bitset_large() {
let arr = generate_nonunique_unsorted(100_000, 5_000);
let mut btreeset: BTreeSet<u32> = BTreeSet::new();
let mut bitset = BitSet::with_max_value(100_000);
for el in arr {
btreeset.insert(el);
bitset.insert(el);
}
for i in 0..100_000 {
assert_eq!(btreeset.contains(&i), bitset.contains(i));
}
assert_eq!(btreeset.len(), bitset.len());
let mut bitset_docset = BitSetDocSet::from(bitset);
let mut remaining = true;
for el in btreeset.into_iter() {
assert!(remaining);
assert_eq!(bitset_docset.doc(), el);
remaining = bitset_docset.advance() != TERMINATED;
}
assert!(!remaining);
}
#[test]
fn test_bitset_num_buckets() {
use super::num_buckets;
assert_eq!(num_buckets(0u32), 0);
assert_eq!(num_buckets(1u32), 1);
assert_eq!(num_buckets(64u32), 1);
assert_eq!(num_buckets(65u32), 2);
assert_eq!(num_buckets(128u32), 2);
assert_eq!(num_buckets(129u32), 3);
}
#[test]
fn test_tinyset_range() {
assert_eq!(
TinySet::range_lower(3).into_iter().collect::<Vec<u32>>(),
[0, 1, 2]
);
assert!(TinySet::range_lower(0).is_empty());
assert_eq!(
TinySet::range_lower(63).into_iter().collect::<Vec<u32>>(),
(0u32..63u32).collect::<Vec<_>>()
);
assert_eq!(
TinySet::range_lower(1).into_iter().collect::<Vec<u32>>(),
[0]
);
assert_eq!(
TinySet::range_lower(2).into_iter().collect::<Vec<u32>>(),
[0, 1]
);
assert_eq!(
TinySet::range_greater_or_equal(3)
.into_iter()
.collect::<Vec<u32>>(),
(3u32..64u32).collect::<Vec<_>>()
);
}
#[test]
fn test_bitset_len() {
let mut bitset = BitSet::with_max_value(1_000);
assert_eq!(bitset.len(), 0);
bitset.insert(3u32);
assert_eq!(bitset.len(), 1);
bitset.insert(103u32);
assert_eq!(bitset.len(), 2);
bitset.insert(3u32);
assert_eq!(bitset.len(), 2);
bitset.insert(103u32);
assert_eq!(bitset.len(), 2);
bitset.insert(104u32);
assert_eq!(bitset.len(), 3);
}
#[test]
fn test_bitset_clear() {
let mut bitset = BitSet::with_max_value(1_000);
let els = tests::sample(1_000, 0.01f64);
for &el in &els {
bitset.insert(el);
}
assert!(els.iter().all(|el| bitset.contains(*el)));
bitset.clear();
for el in 0u32..1000u32 {
assert!(!bitset.contains(el));
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use super::BitSet;
use super::TinySet;
use test;
#[bench]
fn bench_tinyset_pop(b: &mut test::Bencher) {
b.iter(|| {
let mut tinyset = TinySet::singleton(test::black_box(31u32));
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
});
}
#[bench]
fn bench_tinyset_sum(b: &mut test::Bencher) {
let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
b.iter(|| {
assert_eq!(test::black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
});
}
#[bench]
fn bench_tinyarr_sum(b: &mut test::Bencher) {
let v = [10u32, 14u32, 21u32];
b.iter(|| test::black_box(v).iter().cloned().sum::<u32>());
}
#[bench]
fn bench_bitset_initialize(b: &mut test::Bencher) {
b.iter(|| BitSet::with_max_value(1_000_000));
}
}

View File

@@ -1,203 +0,0 @@
mod bitset;
mod composite_file;
pub use self::bitset::BitSet;
pub(crate) use self::bitset::TinySet;
pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
pub use byteorder::LittleEndian as Endianness;
pub use common::CountingWriter;
pub use common::{
read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint, VInt,
};
pub use common::{BinarySerializable, DeserializeFrom, FixedSize};
/// Segment's max doc must be `< MAX_DOC_LIMIT`.
///
/// We do not allow segments with more than
pub const MAX_DOC_LIMIT: u32 = 1 << 31;
/// Has length trait
pub trait HasLen {
/// Return length
fn len(&self) -> usize;
/// Returns true iff empty.
fn is_empty(&self) -> bool {
self.len() == 0
}
}
const HIGHEST_BIT: u64 = 1 << 63;
/// Maps a `i64` to `u64`
///
/// For simplicity, tantivy internally handles `i64` as `u64`.
/// The mapping is defined by this function.
///
/// Maps `i64` to `u64` so that
/// `-2^63 .. 2^63-1` is mapped
/// to
/// `0 .. 2^64-1`
/// in that order.
///
/// This is more suited than simply casting (`val as u64`)
/// because of bitpacking.
///
/// Imagine a list of `i64` ranging from -10 to 10.
/// When casting negative values, the negative values are projected
/// to values over 2^63, and all values end up requiring 64 bits.
///
/// # See also
/// The [reverse mapping is `u64_to_i64`](./fn.u64_to_i64.html).
#[inline]
pub fn i64_to_u64(val: i64) -> u64 {
(val as u64) ^ HIGHEST_BIT
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
#[inline]
pub fn u64_to_i64(val: u64) -> i64 {
(val ^ HIGHEST_BIT) as i64
}
/// Maps a `f64` to `u64`
///
/// For simplicity, tantivy internally handles `f64` as `u64`.
/// The mapping is defined by this function.
///
/// Maps `f64` to `u64` in a monotonic manner, so that bytes lexical order is preserved.
///
/// This is more suited than simply casting (`val as u64`)
/// which would truncate the result
///
/// # Reference
///
/// Daniel Lemire's [blog post](https://lemire.me/blog/2020/12/14/converting-floating-point-numbers-to-integers-while-preserving-order/)
/// explains the mapping in a clear manner.
///
/// # See also
/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
#[inline]
pub fn f64_to_u64(val: f64) -> u64 {
let bits = val.to_bits();
if val.is_sign_positive() {
bits ^ HIGHEST_BIT
} else {
!bits
}
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
#[inline]
pub fn u64_to_f64(val: u64) -> f64 {
f64::from_bits(if val & HIGHEST_BIT != 0 {
val ^ HIGHEST_BIT
} else {
!val
})
}
#[cfg(test)]
pub(crate) mod test {
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
use common::{BinarySerializable, FixedSize};
use proptest::prelude::*;
use std::f64;
use tantivy_bitpacker::compute_num_bits;
pub use tantivy_bitpacker::minmax;
fn test_i64_converter_helper(val: i64) {
assert_eq!(u64_to_i64(i64_to_u64(val)), val);
}
fn test_f64_converter_helper(val: f64) {
assert_eq!(u64_to_f64(f64_to_u64(val)), val);
}
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();
O::default().serialize(&mut buffer).unwrap();
assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
}
proptest! {
#[test]
fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
let left_u64 = f64_to_u64(left);
let right_u64 = f64_to_u64(right);
assert_eq!(left_u64 < right_u64, left < right);
}
}
#[test]
fn test_i64_converter() {
assert_eq!(i64_to_u64(i64::min_value()), u64::min_value());
assert_eq!(i64_to_u64(i64::max_value()), u64::max_value());
test_i64_converter_helper(0i64);
test_i64_converter_helper(i64::min_value());
test_i64_converter_helper(i64::max_value());
for i in -1000i64..1000i64 {
test_i64_converter_helper(i);
}
}
#[test]
fn test_f64_converter() {
test_f64_converter_helper(f64::INFINITY);
test_f64_converter_helper(f64::NEG_INFINITY);
test_f64_converter_helper(0.0);
test_f64_converter_helper(-0.0);
test_f64_converter_helper(1.0);
test_f64_converter_helper(-1.0);
}
#[test]
fn test_f64_order() {
assert!(!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY))
.contains(&f64_to_u64(f64::NAN))); //nan is not a number
assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); //same exponent, different mantissa
assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); //same mantissa, different exponent
assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); //different exponent and mantissa
assert!(f64_to_u64(1.0) > f64_to_u64(-1.0)); // pos > neg
assert!(f64_to_u64(-1.5) < f64_to_u64(-1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(-1.5));
}
#[test]
fn test_compute_num_bits() {
assert_eq!(compute_num_bits(1), 1u8);
assert_eq!(compute_num_bits(0), 0u8);
assert_eq!(compute_num_bits(2), 2u8);
assert_eq!(compute_num_bits(3), 2u8);
assert_eq!(compute_num_bits(4), 3u8);
assert_eq!(compute_num_bits(255), 8u8);
assert_eq!(compute_num_bits(256), 9u8);
assert_eq!(compute_num_bits(5_000_000_000), 33u8);
}
#[test]
fn test_max_doc() {
// this is the first time I write a unit test for a constant.
assert!(((super::MAX_DOC_LIMIT - 1) as i32) >= 0);
assert!((super::MAX_DOC_LIMIT as i32) < 0);
}
#[test]
fn test_minmax_empty() {
let vals: Vec<u32> = vec![];
assert_eq!(minmax(vals.into_iter()), None);
}
#[test]
fn test_minmax_one() {
assert_eq!(minmax(vec![1].into_iter()), Some((1, 1)));
}
#[test]
fn test_minmax_two() {
assert_eq!(minmax(vec![1, 2].into_iter()), Some((1, 2)));
assert_eq!(minmax(vec![2, 1].into_iter()), Some((1, 2)));
}
}

View File

@@ -120,11 +120,11 @@ impl IndexBuilder {
/// Creates a new index in a given filepath. /// Creates a new index in a given filepath.
/// The index will use the `MMapDirectory`. /// The index will use the `MMapDirectory`.
/// ///
/// If a previous index was in this directory, then its meta file will be destroyed. /// If a previous index was in this directory, it returns an `IndexAlreadyExists` error.
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
pub fn create_in_dir<P: AsRef<Path>>(self, directory_path: P) -> crate::Result<Index> { pub fn create_in_dir<P: AsRef<Path>>(self, directory_path: P) -> crate::Result<Index> {
let mmap_directory = MmapDirectory::open(directory_path)?; let mmap_directory: Box<dyn Directory> = Box::new(MmapDirectory::open(directory_path)?);
if Index::exists(&mmap_directory)? { if Index::exists(&*mmap_directory)? {
return Err(TantivyError::IndexAlreadyExists); return Err(TantivyError::IndexAlreadyExists);
} }
self.create(mmap_directory) self.create(mmap_directory)
@@ -139,7 +139,7 @@ impl IndexBuilder {
/// For other unit tests, prefer the `RAMDirectory`, see: `create_in_ram`. /// For other unit tests, prefer the `RAMDirectory`, see: `create_in_ram`.
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
pub fn create_from_tempdir(self) -> crate::Result<Index> { pub fn create_from_tempdir(self) -> crate::Result<Index> {
let mmap_directory = MmapDirectory::create_from_tempdir()?; let mmap_directory: Box<dyn Directory> = Box::new(MmapDirectory::create_from_tempdir()?);
self.create(mmap_directory) self.create(mmap_directory)
} }
fn get_expect_schema(&self) -> crate::Result<Schema> { fn get_expect_schema(&self) -> crate::Result<Schema> {
@@ -149,8 +149,9 @@ impl IndexBuilder {
.ok_or(TantivyError::IndexBuilderMissingArgument("schema")) .ok_or(TantivyError::IndexBuilderMissingArgument("schema"))
} }
/// Opens or creates a new index in the provided directory /// Opens or creates a new index in the provided directory
pub fn open_or_create<Dir: Directory>(self, dir: Dir) -> crate::Result<Index> { pub fn open_or_create<T: Into<Box<dyn Directory>>>(self, dir: T) -> crate::Result<Index> {
if !Index::exists(&dir)? { let dir = dir.into();
if !Index::exists(&*dir)? {
return self.create(dir); return self.create(dir);
} }
let index = Index::open(dir)?; let index = Index::open(dir)?;
@@ -165,7 +166,8 @@ impl IndexBuilder {
/// Creates a new index given an implementation of the trait `Directory`. /// Creates a new index given an implementation of the trait `Directory`.
/// ///
/// If a directory previously existed, it will be erased. /// If a directory previously existed, it will be erased.
fn create<Dir: Directory>(self, dir: Dir) -> crate::Result<Index> { fn create<T: Into<Box<dyn Directory>>>(self, dir: T) -> crate::Result<Index> {
let dir = dir.into();
let directory = ManagedDirectory::wrap(dir)?; let directory = ManagedDirectory::wrap(dir)?;
save_new_metas( save_new_metas(
self.get_expect_schema()?, self.get_expect_schema()?,
@@ -198,7 +200,7 @@ impl Index {
/// Examines the directory to see if it contains an index. /// Examines the directory to see if it contains an index.
/// ///
/// Effectively, it only checks for the presence of the `meta.json` file. /// Effectively, it only checks for the presence of the `meta.json` file.
pub fn exists<Dir: Directory>(dir: &Dir) -> Result<bool, OpenReadError> { pub fn exists(dir: &dyn Directory) -> Result<bool, OpenReadError> {
dir.exists(&META_FILEPATH) dir.exists(&META_FILEPATH)
} }
@@ -215,7 +217,7 @@ impl Index {
/// Replace the default single thread search executor pool /// Replace the default single thread search executor pool
/// by a thread pool with a given number of threads. /// by a thread pool with a given number of threads.
pub fn set_multithread_executor(&mut self, num_threads: usize) -> crate::Result<()> { pub fn set_multithread_executor(&mut self, num_threads: usize) -> crate::Result<()> {
self.executor = Arc::new(Executor::multi_thread(num_threads, "thrd-tantivy-search-")?); self.executor = Arc::new(Executor::multi_thread(num_threads, "tantivy-search-")?);
Ok(()) Ok(())
} }
@@ -229,7 +231,8 @@ impl Index {
/// Creates a new index using the `RamDirectory`. /// Creates a new index using the `RamDirectory`.
/// ///
/// The index will be allocated in anonymous memory. /// The index will be allocated in anonymous memory.
/// This should only be used for unit tests. /// This is useful for indexing small set of documents
/// for instances like unit test or temporary in memory index.
pub fn create_in_ram(schema: Schema) -> Index { pub fn create_in_ram(schema: Schema) -> Index {
IndexBuilder::new().schema(schema).create_in_ram().unwrap() IndexBuilder::new().schema(schema).create_in_ram().unwrap()
} }
@@ -237,7 +240,7 @@ impl Index {
/// Creates a new index in a given filepath. /// Creates a new index in a given filepath.
/// The index will use the `MMapDirectory`. /// The index will use the `MMapDirectory`.
/// ///
/// If a previous index was in this directory, then its meta file will be destroyed. /// If a previous index was in this directory, then it returns an `IndexAlreadyExists` error.
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
pub fn create_in_dir<P: AsRef<Path>>( pub fn create_in_dir<P: AsRef<Path>>(
directory_path: P, directory_path: P,
@@ -249,7 +252,11 @@ impl Index {
} }
/// Opens or creates a new index in the provided directory /// Opens or creates a new index in the provided directory
pub fn open_or_create<Dir: Directory>(dir: Dir, schema: Schema) -> crate::Result<Index> { pub fn open_or_create<T: Into<Box<dyn Directory>>>(
dir: T,
schema: Schema,
) -> crate::Result<Index> {
let dir = dir.into();
IndexBuilder::new().schema(schema).open_or_create(dir) IndexBuilder::new().schema(schema).open_or_create(dir)
} }
@@ -269,11 +276,12 @@ impl Index {
/// Creates a new index given an implementation of the trait `Directory`. /// Creates a new index given an implementation of the trait `Directory`.
/// ///
/// If a directory previously existed, it will be erased. /// If a directory previously existed, it will be erased.
pub fn create<Dir: Directory>( pub fn create<T: Into<Box<dyn Directory>>>(
dir: Dir, dir: T,
schema: Schema, schema: Schema,
settings: IndexSettings, settings: IndexSettings,
) -> crate::Result<Index> { ) -> crate::Result<Index> {
let dir: Box<dyn Directory> = dir.into();
let mut builder = IndexBuilder::new().schema(schema); let mut builder = IndexBuilder::new().schema(schema);
builder = builder.settings(settings); builder = builder.settings(settings);
builder.create(dir) builder.create(dir)
@@ -364,7 +372,8 @@ impl Index {
} }
/// Open the index using the provided directory /// Open the index using the provided directory
pub fn open<D: Directory>(directory: D) -> crate::Result<Index> { pub fn open<T: Into<Box<dyn Directory>>>(directory: T) -> crate::Result<Index> {
let directory = directory.into();
let directory = ManagedDirectory::wrap(directory)?; let directory = ManagedDirectory::wrap(directory)?;
let inventory = SegmentMetaInventory::default(); let inventory = SegmentMetaInventory::default();
let metas = load_metas(&directory, &inventory)?; let metas = load_metas(&directory, &inventory)?;
@@ -394,9 +403,7 @@ impl Index {
/// ///
/// # Errors /// # Errors
/// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IoError`. /// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IoError`.
/// /// If the heap size per thread is too small or too big, returns `TantivyError::InvalidArgument`
/// # Panics
/// If the heap size per thread is too small, panics.
pub fn writer_with_num_threads( pub fn writer_with_num_threads(
&self, &self,
num_threads: usize, num_threads: usize,
@@ -438,14 +445,13 @@ impl Index {
/// Creates a multithreaded writer /// Creates a multithreaded writer
/// ///
/// Tantivy will automatically define the number of threads to use, but /// Tantivy will automatically define the number of threads to use, but
/// no more than [`MAX_NUM_THREAD`] threads. /// no more than 8 threads.
/// `overall_heap_size_in_bytes` is the total target memory usage that will be split /// `overall_heap_size_in_bytes` is the total target memory usage that will be split
/// between a given number of threads. /// between a given number of threads.
/// ///
/// # Errors /// # Errors
/// If the lockfile already exists, returns `Error::FileAlreadyExists`. /// If the lockfile already exists, returns `Error::FileAlreadyExists`.
/// # Panics /// If the heap size per thread is too small or too big, returns `TantivyError::InvalidArgument`
/// If the heap size per thread is too small, panics.
pub fn writer(&self, overall_heap_size_in_bytes: usize) -> crate::Result<IndexWriter> { pub fn writer(&self, overall_heap_size_in_bytes: usize) -> crate::Result<IndexWriter> {
let mut num_threads = std::cmp::min(num_cpus::get(), MAX_NUM_THREAD); let mut num_threads = std::cmp::min(num_cpus::get(), MAX_NUM_THREAD);
let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads; let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads;
@@ -523,7 +529,22 @@ impl Index {
/// Returns the set of corrupted files /// Returns the set of corrupted files
pub fn validate_checksum(&self) -> crate::Result<HashSet<PathBuf>> { pub fn validate_checksum(&self) -> crate::Result<HashSet<PathBuf>> {
self.directory.list_damaged().map_err(Into::into) let managed_files = self.directory.list_managed_files();
let active_segments_files: HashSet<PathBuf> = self
.searchable_segment_metas()?
.iter()
.flat_map(|segment_meta| segment_meta.list_files())
.collect();
let active_existing_files: HashSet<&PathBuf> =
active_segments_files.intersection(&managed_files).collect();
let mut damaged_files = HashSet::new();
for path in active_existing_files {
if !self.directory.validate_checksum(path)? {
damaged_files.insert((*path).clone());
}
}
Ok(damaged_files)
} }
} }
@@ -561,15 +582,15 @@ mod tests {
#[test] #[test]
fn test_index_exists() { fn test_index_exists() {
let directory = RamDirectory::create(); let directory: Box<dyn Directory> = Box::new(RamDirectory::create());
assert!(!Index::exists(&directory).unwrap()); assert!(!Index::exists(directory.as_ref()).unwrap());
assert!(Index::create( assert!(Index::create(
directory.clone(), directory.clone(),
throw_away_schema(), throw_away_schema(),
IndexSettings::default() IndexSettings::default()
) )
.is_ok()); .is_ok());
assert!(Index::exists(&directory).unwrap()); assert!(Index::exists(directory.as_ref()).unwrap());
} }
#[test] #[test]
@@ -582,27 +603,27 @@ mod tests {
#[test] #[test]
fn open_or_create_should_open() { fn open_or_create_should_open() {
let directory = RamDirectory::create(); let directory: Box<dyn Directory> = Box::new(RamDirectory::create());
assert!(Index::create( assert!(Index::create(
directory.clone(), directory.clone(),
throw_away_schema(), throw_away_schema(),
IndexSettings::default() IndexSettings::default()
) )
.is_ok()); .is_ok());
assert!(Index::exists(&directory).unwrap()); assert!(Index::exists(directory.as_ref()).unwrap());
assert!(Index::open_or_create(directory, throw_away_schema()).is_ok()); assert!(Index::open_or_create(directory, throw_away_schema()).is_ok());
} }
#[test] #[test]
fn create_should_wipeoff_existing() { fn create_should_wipeoff_existing() {
let directory = RamDirectory::create(); let directory: Box<dyn Directory> = Box::new(RamDirectory::create());
assert!(Index::create( assert!(Index::create(
directory.clone(), directory.clone(),
throw_away_schema(), throw_away_schema(),
IndexSettings::default() IndexSettings::default()
) )
.is_ok()); .is_ok());
assert!(Index::exists(&directory).unwrap()); assert!(Index::exists(directory.as_ref()).unwrap());
assert!(Index::create( assert!(Index::create(
directory, directory,
Schema::builder().build(), Schema::builder().build(),
@@ -636,7 +657,7 @@ mod tests {
} }
#[test] #[test]
fn test_index_on_commit_reload_policy() { fn test_index_on_commit_reload_policy() -> crate::Result<()> {
let schema = throw_away_schema(); let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap(); let field = schema.get_field("num_likes").unwrap();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
@@ -646,7 +667,7 @@ mod tests {
.try_into() .try_into()
.unwrap(); .unwrap();
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
test_index_on_commit_reload_policy_aux(field, &index, &reader); test_index_on_commit_reload_policy_aux(field, &index, &reader)
} }
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
@@ -658,7 +679,7 @@ mod tests {
use tempfile::TempDir; use tempfile::TempDir;
#[test] #[test]
fn test_index_on_commit_reload_policy_mmap() { fn test_index_on_commit_reload_policy_mmap() -> crate::Result<()> {
let schema = throw_away_schema(); let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap(); let field = schema.get_field("num_likes").unwrap();
let tempdir = TempDir::new().unwrap(); let tempdir = TempDir::new().unwrap();
@@ -670,7 +691,7 @@ mod tests {
.try_into() .try_into()
.unwrap(); .unwrap();
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
test_index_on_commit_reload_policy_aux(field, &index, &reader); test_index_on_commit_reload_policy_aux(field, &index, &reader)
} }
#[test] #[test]
@@ -685,7 +706,7 @@ mod tests {
.reload_policy(ReloadPolicy::Manual) .reload_policy(ReloadPolicy::Manual)
.try_into()?; .try_into()?;
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
writer.add_document(doc!(field=>1u64)); writer.add_document(doc!(field=>1u64))?;
let (sender, receiver) = crossbeam::channel::unbounded(); let (sender, receiver) = crossbeam::channel::unbounded();
let _handle = index.directory_mut().watch(WatchCallback::new(move || { let _handle = index.directory_mut().watch(WatchCallback::new(move || {
let _ = sender.send(()); let _ = sender.send(());
@@ -699,7 +720,7 @@ mod tests {
} }
#[test] #[test]
fn test_index_on_commit_reload_policy_different_directories() { fn test_index_on_commit_reload_policy_different_directories() -> crate::Result<()> {
let schema = throw_away_schema(); let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap(); let field = schema.get_field("num_likes").unwrap();
let tempdir = TempDir::new().unwrap(); let tempdir = TempDir::new().unwrap();
@@ -712,10 +733,14 @@ mod tests {
.try_into() .try_into()
.unwrap(); .unwrap();
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
test_index_on_commit_reload_policy_aux(field, &write_index, &reader); test_index_on_commit_reload_policy_aux(field, &write_index, &reader)
} }
} }
fn test_index_on_commit_reload_policy_aux(field: Field, index: &Index, reader: &IndexReader) { fn test_index_on_commit_reload_policy_aux(
field: Field,
index: &Index,
reader: &IndexReader,
) -> crate::Result<()> {
let mut reader_index = reader.index(); let mut reader_index = reader.index();
let (sender, receiver) = crossbeam::channel::unbounded(); let (sender, receiver) = crossbeam::channel::unbounded();
let _watch_handle = reader_index let _watch_handle = reader_index
@@ -723,9 +748,9 @@ mod tests {
.watch(WatchCallback::new(move || { .watch(WatchCallback::new(move || {
let _ = sender.send(()); let _ = sender.send(());
})); }));
let mut writer = index.writer_for_tests().unwrap(); let mut writer = index.writer_for_tests()?;
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
writer.add_document(doc!(field=>1u64)); writer.add_document(doc!(field=>1u64))?;
writer.commit().unwrap(); writer.commit().unwrap();
// We need a loop here because it is possible for notify to send more than // We need a loop here because it is possible for notify to send more than
// one modify event. It was observed on CI on MacOS. // one modify event. It was observed on CI on MacOS.
@@ -735,7 +760,7 @@ mod tests {
break; break;
} }
} }
writer.add_document(doc!(field=>2u64)); writer.add_document(doc!(field=>2u64))?;
writer.commit().unwrap(); writer.commit().unwrap();
// ... Same as above // ... Same as above
loop { loop {
@@ -744,37 +769,37 @@ mod tests {
break; break;
} }
} }
Ok(())
} }
// This test will not pass on windows, because windows // This test will not pass on windows, because windows
// prevent deleting files that are MMapped. // prevent deleting files that are MMapped.
#[cfg(not(target_os = "windows"))] #[cfg(not(target_os = "windows"))]
#[test] #[test]
fn garbage_collect_works_as_intended() { fn garbage_collect_works_as_intended() -> crate::Result<()> {
let directory = RamDirectory::create(); let directory = RamDirectory::create();
let schema = throw_away_schema(); let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap(); let field = schema.get_field("num_likes").unwrap();
let index = Index::create(directory.clone(), schema, IndexSettings::default()).unwrap(); let index = Index::create(directory.clone(), schema, IndexSettings::default())?;
let mut writer = index.writer_with_num_threads(8, 24_000_000).unwrap(); let mut writer = index.writer_with_num_threads(8, 24_000_000).unwrap();
for i in 0u64..8_000u64 { for i in 0u64..8_000u64 {
writer.add_document(doc!(field => i)); writer.add_document(doc!(field => i))?;
} }
let (sender, receiver) = crossbeam::channel::unbounded(); let (sender, receiver) = crossbeam::channel::unbounded();
let _handle = directory.watch(WatchCallback::new(move || { let _handle = directory.watch(WatchCallback::new(move || {
let _ = sender.send(()); let _ = sender.send(());
})); }));
writer.commit().unwrap(); writer.commit()?;
let mem_right_after_commit = directory.total_mem_usage(); let mem_right_after_commit = directory.total_mem_usage();
assert!(receiver.recv().is_ok()); assert!(receiver.recv().is_ok());
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::Manual) .reload_policy(ReloadPolicy::Manual)
.try_into() .try_into()?;
.unwrap();
assert_eq!(reader.searcher().num_docs(), 8_000); assert_eq!(reader.searcher().num_docs(), 8_000);
writer.wait_merging_threads().unwrap(); writer.wait_merging_threads()?;
let mem_right_after_merge_finished = directory.total_mem_usage(); let mem_right_after_merge_finished = directory.total_mem_usage();
reader.reload().unwrap(); reader.reload().unwrap();
@@ -786,5 +811,6 @@ mod tests {
mem_right_after_merge_finished, mem_right_after_merge_finished,
mem_right_after_commit mem_right_after_commit
); );
Ok(())
} }
} }

View File

@@ -2,7 +2,7 @@ use super::SegmentComponent;
use crate::schema::Schema; use crate::schema::Schema;
use crate::Opstamp; use crate::Opstamp;
use crate::{core::SegmentId, store::Compressor}; use crate::{core::SegmentId, store::Compressor};
use census::{Inventory, TrackedObject}; use crate::{Inventory, TrackedObject};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::path::PathBuf; use std::path::PathBuf;
use std::{collections::HashSet, sync::atomic::AtomicBool}; use std::{collections::HashSet, sync::atomic::AtomicBool};
@@ -101,6 +101,7 @@ impl SegmentMeta {
/// Returns the list of files that /// Returns the list of files that
/// are required for the segment meta. /// are required for the segment meta.
/// Note: Some of the returned files may not exist depending on the state of the segment.
/// ///
/// This is useful as the way tantivy removes files /// This is useful as the way tantivy removes files
/// is by removing all files that have been created by tantivy /// is by removing all files that have been created by tantivy
@@ -188,6 +189,10 @@ impl SegmentMeta {
#[doc(hidden)] #[doc(hidden)]
pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: Opstamp) -> SegmentMeta { pub fn with_delete_meta(self, num_deleted_docs: u32, opstamp: Opstamp) -> SegmentMeta {
assert!(
num_deleted_docs <= self.max_doc(),
"There cannot be more deleted docs than there are docs."
);
let delete_meta = DeleteMeta { let delete_meta = DeleteMeta {
num_deleted_docs, num_deleted_docs,
opstamp, opstamp,
@@ -393,7 +398,7 @@ mod tests {
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed"); let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
assert_eq!( assert_eq!(
json, json,
r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"},"docstore_compression":"lz4"},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","tokenizer":"default"},"stored":false}}],"opstamp":0}"# r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"},"docstore_compression":"lz4"},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false}}],"opstamp":0}"#
); );
} }
} }

View File

@@ -1,6 +1,5 @@
use std::io; use std::io;
use crate::common::BinarySerializable;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::positions::PositionReader; use crate::positions::PositionReader;
use crate::postings::TermInfo; use crate::postings::TermInfo;
@@ -8,6 +7,7 @@ use crate::postings::{BlockSegmentPostings, SegmentPostings};
use crate::schema::IndexRecordOption; use crate::schema::IndexRecordOption;
use crate::schema::Term; use crate::schema::Term;
use crate::termdict::TermDictionary; use crate::termdict::TermDictionary;
use common::BinarySerializable;
/// The inverted index reader is in charge of accessing /// The inverted index reader is in charge of accessing
/// the inverted index associated to a specific field. /// the inverted index associated to a specific field.

View File

@@ -14,7 +14,7 @@ pub use self::index_meta::{
IndexMeta, IndexSettings, IndexSortByField, Order, SegmentMeta, SegmentMetaInventory, IndexMeta, IndexSettings, IndexSortByField, Order, SegmentMeta, SegmentMetaInventory,
}; };
pub use self::inverted_index_reader::InvertedIndexReader; pub use self::inverted_index_reader::InvertedIndexReader;
pub use self::searcher::Searcher; pub use self::searcher::{Searcher, SearcherGeneration};
pub use self::segment::Segment; pub use self::segment::Segment;
pub use self::segment_component::SegmentComponent; pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId; pub use self::segment_id::SegmentId;

View File

@@ -1,6 +1,5 @@
use crate::collector::Collector; use crate::collector::Collector;
use crate::core::Executor; use crate::core::Executor;
use crate::core::SegmentReader; use crate::core::SegmentReader;
use crate::query::Query; use crate::query::Query;
use crate::schema::Document; use crate::schema::Document;
@@ -10,9 +9,62 @@ use crate::space_usage::SearcherSpaceUsage;
use crate::store::StoreReader; use crate::store::StoreReader;
use crate::DocAddress; use crate::DocAddress;
use crate::Index; use crate::Index;
use crate::Opstamp;
use crate::SegmentId;
use crate::TrackedObject;
use std::collections::BTreeMap;
use std::{fmt, io}; use std::{fmt, io};
/// Identifies the searcher generation accessed by a [Searcher].
///
/// While this might seem redundant, a [SearcherGeneration] contains
/// both a `generation_id` AND a list of `(SegmentId, DeleteOpstamp)`.
///
/// This is on purpose. This object is used by the `Warmer` API.
/// Having both information makes it possible to identify which
/// artifact should be refreshed or garbage collected.
///
/// Depending on the use case, `Warmer`'s implementers can decide to
/// produce artifacts per:
/// - `generation_id` (e.g. some searcher level aggregates)
/// - `(segment_id, delete_opstamp)` (e.g. segment level aggregates)
/// - `segment_id` (e.g. for immutable document level information)
/// - `(generation_id, segment_id)` (e.g. for consistent dynamic column)
/// - ...
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct SearcherGeneration {
segments: BTreeMap<SegmentId, Option<Opstamp>>,
generation_id: u64,
}
impl SearcherGeneration {
pub(crate) fn from_segment_readers(
segment_readers: &[SegmentReader],
generation_id: u64,
) -> Self {
let mut segment_id_to_del_opstamp = BTreeMap::new();
for segment_reader in segment_readers {
segment_id_to_del_opstamp
.insert(segment_reader.segment_id(), segment_reader.delete_opstamp());
}
Self {
segments: segment_id_to_del_opstamp,
generation_id,
}
}
/// Returns the searcher generation id.
pub fn generation_id(&self) -> u64 {
self.generation_id
}
/// Return a `(SegmentId -> DeleteOpstamp)` mapping.
pub fn segments(&self) -> &BTreeMap<SegmentId, Option<Opstamp>> {
&self.segments
}
}
/// Holds a list of `SegmentReader`s ready for search. /// Holds a list of `SegmentReader`s ready for search.
/// ///
/// It guarantees that the `Segment` will not be removed before /// It guarantees that the `Segment` will not be removed before
@@ -23,6 +75,7 @@ pub struct Searcher {
index: Index, index: Index,
segment_readers: Vec<SegmentReader>, segment_readers: Vec<SegmentReader>,
store_readers: Vec<StoreReader>, store_readers: Vec<StoreReader>,
generation: TrackedObject<SearcherGeneration>,
} }
impl Searcher { impl Searcher {
@@ -31,6 +84,7 @@ impl Searcher {
schema: Schema, schema: Schema,
index: Index, index: Index,
segment_readers: Vec<SegmentReader>, segment_readers: Vec<SegmentReader>,
generation: TrackedObject<SearcherGeneration>,
) -> io::Result<Searcher> { ) -> io::Result<Searcher> {
let store_readers: Vec<StoreReader> = segment_readers let store_readers: Vec<StoreReader> = segment_readers
.iter() .iter()
@@ -41,6 +95,7 @@ impl Searcher {
index, index,
segment_readers, segment_readers,
store_readers, store_readers,
generation,
}) })
} }
@@ -49,6 +104,11 @@ impl Searcher {
&self.index &self.index
} }
/// [SearcherGeneration] which identifies the version of the snapshot held by this `Searcher`.
pub fn generation(&self) -> &SearcherGeneration {
self.generation.as_ref()
}
/// Fetches a document from tantivy's store given a `DocAddress`. /// Fetches a document from tantivy's store given a `DocAddress`.
/// ///
/// The searcher uses the segment ordinal to route the /// The searcher uses the segment ordinal to route the
@@ -88,7 +148,7 @@ impl Searcher {
&self.segment_readers &self.segment_readers
} }
/// Returns the segment_reader associated with the given segment_ordinal /// Returns the segment_reader associated with the given segment_ord
pub fn segment_reader(&self, segment_ord: u32) -> &SegmentReader { pub fn segment_reader(&self, segment_ord: u32) -> &SegmentReader {
&self.segment_readers[segment_ord as usize] &self.segment_readers[segment_ord as usize]
} }

View File

@@ -2,8 +2,11 @@ use crate::core::InvertedIndexReader;
use crate::core::Segment; use crate::core::Segment;
use crate::core::SegmentComponent; use crate::core::SegmentComponent;
use crate::core::SegmentId; use crate::core::SegmentId;
use crate::directory::CompositeFile;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::fastfield::DeleteBitSet; use crate::error::DataCorruption;
use crate::fastfield::intersect_alive_bitsets;
use crate::fastfield::AliveBitSet;
use crate::fastfield::FacetReader; use crate::fastfield::FacetReader;
use crate::fastfield::FastFieldReaders; use crate::fastfield::FastFieldReaders;
use crate::fieldnorm::{FieldNormReader, FieldNormReaders}; use crate::fieldnorm::{FieldNormReader, FieldNormReaders};
@@ -14,7 +17,7 @@ use crate::space_usage::SegmentSpaceUsage;
use crate::store::StoreReader; use crate::store::StoreReader;
use crate::termdict::TermDictionary; use crate::termdict::TermDictionary;
use crate::DocId; use crate::DocId;
use crate::{common::CompositeFile, error::DataCorruption}; use crate::Opstamp;
use fail::fail_point; use fail::fail_point;
use std::fmt; use std::fmt;
use std::sync::Arc; use std::sync::Arc;
@@ -36,6 +39,8 @@ pub struct SegmentReader {
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>, inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>,
segment_id: SegmentId, segment_id: SegmentId,
delete_opstamp: Option<Opstamp>,
max_doc: DocId, max_doc: DocId,
num_docs: DocId, num_docs: DocId,
@@ -46,7 +51,7 @@ pub struct SegmentReader {
fieldnorm_readers: FieldNormReaders, fieldnorm_readers: FieldNormReaders,
store_file: FileSlice, store_file: FileSlice,
delete_bitset_opt: Option<DeleteBitSet>, alive_bitset_opt: Option<AliveBitSet>,
schema: Schema, schema: Schema,
} }
@@ -71,14 +76,12 @@ impl SegmentReader {
/// Return the number of documents that have been /// Return the number of documents that have been
/// deleted in the segment. /// deleted in the segment.
pub fn num_deleted_docs(&self) -> DocId { pub fn num_deleted_docs(&self) -> DocId {
self.delete_bitset() self.max_doc - self.num_docs
.map(|delete_set| delete_set.num_deleted() as DocId)
.unwrap_or(0u32)
} }
/// Returns true iff some of the documents of the segment have been deleted. /// Returns true iff some of the documents of the segment have been deleted.
pub fn has_deletes(&self) -> bool { pub fn has_deletes(&self) -> bool {
self.delete_bitset().is_some() self.num_deleted_docs() > 0
} }
/// Accessor to a segment's fast field reader given a field. /// Accessor to a segment's fast field reader given a field.
@@ -100,7 +103,7 @@ impl SegmentReader {
let field_entry = self.schema.get_field_entry(field); let field_entry = self.schema.get_field_entry(field);
match field_entry.field_type() { match field_entry.field_type() {
FieldType::HierarchicalFacet(_) => { FieldType::Facet(_) => {
let term_ords_reader = self.fast_fields().u64s(field)?; let term_ords_reader = self.fast_fields().u64s(field)?;
let termdict = self let termdict = self
.termdict_composite .termdict_composite
@@ -127,13 +130,17 @@ impl SegmentReader {
self.fieldnorm_readers.get_field(field)?.ok_or_else(|| { self.fieldnorm_readers.get_field(field)?.ok_or_else(|| {
let field_name = self.schema.get_field_name(field); let field_name = self.schema.get_field_name(field);
let err_msg = format!( let err_msg = format!(
"Field norm not found for field {:?}. Was it marked as indexed during indexing?", "Field norm not found for field {:?}. Was the field set to record norm during indexing?",
field_name field_name
); );
crate::TantivyError::SchemaError(err_msg) crate::TantivyError::SchemaError(err_msg)
}) })
} }
pub(crate) fn fieldnorms_readers(&self) -> &FieldNormReaders {
&self.fieldnorm_readers
}
/// Accessor to the segment's `StoreReader`. /// Accessor to the segment's `StoreReader`.
pub fn get_store_reader(&self) -> io::Result<StoreReader> { pub fn get_store_reader(&self) -> io::Result<StoreReader> {
StoreReader::open(self.store_file.clone()) StoreReader::open(self.store_file.clone())
@@ -141,6 +148,14 @@ impl SegmentReader {
/// Open a new segment for reading. /// Open a new segment for reading.
pub fn open(segment: &Segment) -> crate::Result<SegmentReader> { pub fn open(segment: &Segment) -> crate::Result<SegmentReader> {
Self::open_with_custom_alive_set(segment, None)
}
/// Open a new segment for reading.
pub fn open_with_custom_alive_set(
segment: &Segment,
custom_bitset: Option<AliveBitSet>,
) -> crate::Result<SegmentReader> {
let termdict_file = segment.open_read(SegmentComponent::Terms)?; let termdict_file = segment.open_read(SegmentComponent::Terms)?;
let termdict_composite = CompositeFile::open(&termdict_file)?; let termdict_composite = CompositeFile::open(&termdict_file)?;
@@ -165,29 +180,37 @@ impl SegmentReader {
let fast_fields_composite = CompositeFile::open(&fast_fields_data)?; let fast_fields_composite = CompositeFile::open(&fast_fields_data)?;
let fast_field_readers = let fast_field_readers =
Arc::new(FastFieldReaders::new(schema.clone(), fast_fields_composite)); Arc::new(FastFieldReaders::new(schema.clone(), fast_fields_composite));
let fieldnorm_data = segment.open_read(SegmentComponent::FieldNorms)?; let fieldnorm_data = segment.open_read(SegmentComponent::FieldNorms)?;
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?; let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
let delete_bitset_opt = if segment.meta().has_deletes() { let original_bitset = if segment.meta().has_deletes() {
let delete_data = segment.open_read(SegmentComponent::Delete)?; let delete_file_slice = segment.open_read(SegmentComponent::Delete)?;
let delete_bitset = DeleteBitSet::open(delete_data)?; let delete_data = delete_file_slice.read_bytes()?;
Some(delete_bitset) Some(AliveBitSet::open(delete_data))
} else { } else {
None None
}; };
let alive_bitset_opt = intersect_alive_bitset(original_bitset, custom_bitset);
let max_doc = segment.meta().max_doc();
let num_docs = alive_bitset_opt
.as_ref()
.map(|alive_bitset| alive_bitset.num_alive_docs() as u32)
.unwrap_or(max_doc);
Ok(SegmentReader { Ok(SegmentReader {
inv_idx_reader_cache: Default::default(), inv_idx_reader_cache: Default::default(),
max_doc: segment.meta().max_doc(), num_docs,
num_docs: segment.meta().num_docs(), max_doc,
termdict_composite, termdict_composite,
postings_composite, postings_composite,
fast_fields_readers: fast_field_readers, fast_fields_readers: fast_field_readers,
fieldnorm_readers, fieldnorm_readers,
segment_id: segment.id(), segment_id: segment.id(),
delete_opstamp: segment.meta().delete_opstamp(),
store_file, store_file,
delete_bitset_opt, alive_bitset_opt,
positions_composite, positions_composite,
schema, schema,
}) })
@@ -271,23 +294,32 @@ impl SegmentReader {
self.segment_id self.segment_id
} }
/// Returns the delete opstamp
pub fn delete_opstamp(&self) -> Option<Opstamp> {
self.delete_opstamp
}
/// Returns the bitset representing /// Returns the bitset representing
/// the documents that have been deleted. /// the documents that have been deleted.
pub fn delete_bitset(&self) -> Option<&DeleteBitSet> { pub fn alive_bitset(&self) -> Option<&AliveBitSet> {
self.delete_bitset_opt.as_ref() self.alive_bitset_opt.as_ref()
} }
/// Returns true iff the `doc` is marked /// Returns true iff the `doc` is marked
/// as deleted. /// as deleted.
pub fn is_deleted(&self, doc: DocId) -> bool { pub fn is_deleted(&self, doc: DocId) -> bool {
self.delete_bitset() self.alive_bitset()
.map(|delete_set| delete_set.is_deleted(doc)) .map(|delete_set| delete_set.is_deleted(doc))
.unwrap_or(false) .unwrap_or(false)
} }
/// Returns an iterator that will iterate over the alive document ids /// Returns an iterator that will iterate over the alive document ids
pub fn doc_ids_alive(&self) -> impl Iterator<Item = DocId> + '_ { pub fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + '_> {
(0u32..self.max_doc).filter(move |doc| !self.is_deleted(*doc)) if let Some(alive_bitset) = &self.alive_bitset_opt {
Box::new(alive_bitset.iter_alive())
} else {
Box::new(0u32..self.max_doc)
}
} }
/// Summarize total space usage of this segment. /// Summarize total space usage of this segment.
@@ -300,14 +332,29 @@ impl SegmentReader {
self.fast_fields_readers.space_usage(), self.fast_fields_readers.space_usage(),
self.fieldnorm_readers.space_usage(), self.fieldnorm_readers.space_usage(),
self.get_store_reader()?.space_usage(), self.get_store_reader()?.space_usage(),
self.delete_bitset_opt self.alive_bitset_opt
.as_ref() .as_ref()
.map(DeleteBitSet::space_usage) .map(AliveBitSet::space_usage)
.unwrap_or(0), .unwrap_or(0),
)) ))
} }
} }
fn intersect_alive_bitset(
left_opt: Option<AliveBitSet>,
right_opt: Option<AliveBitSet>,
) -> Option<AliveBitSet> {
match (left_opt, right_opt) {
(Some(left), Some(right)) => {
assert_eq!(left.bitset().max_value(), right.bitset().max_value());
Some(intersect_alive_bitsets(left, right))
}
(Some(left), None) => Some(left),
(None, Some(right)) => Some(right),
(None, None) => None,
}
}
impl fmt::Debug for SegmentReader { impl fmt::Debug for SegmentReader {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "SegmentReader({:?})", self.segment_id) write!(f, "SegmentReader({:?})", self.segment_id)
@@ -330,10 +377,10 @@ mod test {
{ {
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(name => "tantivy")); index_writer.add_document(doc!(name => "tantivy"))?;
index_writer.add_document(doc!(name => "horse")); index_writer.add_document(doc!(name => "horse"))?;
index_writer.add_document(doc!(name => "jockey")); index_writer.add_document(doc!(name => "jockey"))?;
index_writer.add_document(doc!(name => "cap")); index_writer.add_document(doc!(name => "cap"))?;
// we should now have one segment with two docs // we should now have one segment with two docs
index_writer.delete_term(Term::from_field_text(name, "horse")); index_writer.delete_term(Term::from_field_text(name, "horse"));
index_writer.delete_term(Term::from_field_text(name, "cap")); index_writer.delete_term(Term::from_field_text(name, "cap"));
@@ -356,10 +403,10 @@ mod test {
{ {
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(name => "tantivy")); index_writer.add_document(doc!(name => "tantivy"))?;
index_writer.add_document(doc!(name => "horse")); index_writer.add_document(doc!(name => "horse"))?;
index_writer.add_document(doc!(name => "jockey")); index_writer.add_document(doc!(name => "jockey"))?;
index_writer.add_document(doc!(name => "cap")); index_writer.add_document(doc!(name => "cap"))?;
// we should now have one segment with two docs // we should now have one segment with two docs
index_writer.commit()?; index_writer.commit()?;
} }

View File

@@ -1,18 +1,17 @@
use crate::common::BinarySerializable;
use crate::common::CountingWriter;
use crate::common::VInt;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::directory::{TerminatingWrite, WritePtr}; use crate::directory::{TerminatingWrite, WritePtr};
use crate::schema::Field; use crate::schema::Field;
use crate::space_usage::FieldUsage; use crate::space_usage::FieldUsage;
use crate::space_usage::PerFieldSpaceUsage; use crate::space_usage::PerFieldSpaceUsage;
use common::BinarySerializable;
use common::CountingWriter;
use common::HasLen;
use common::VInt;
use std::collections::HashMap; use std::collections::HashMap;
use std::io::{self, Read, Write}; use std::io::{self, Read, Write};
use std::iter::ExactSizeIterator; use std::iter::ExactSizeIterator;
use std::ops::Range; use std::ops::Range;
use super::HasLen;
#[derive(Eq, PartialEq, Hash, Copy, Ord, PartialOrd, Clone, Debug)] #[derive(Eq, PartialEq, Hash, Copy, Ord, PartialOrd, Clone, Debug)]
pub struct FileAddr { pub struct FileAddr {
field: Field, field: Field,
@@ -188,10 +187,10 @@ impl CompositeFile {
mod test { mod test {
use super::{CompositeFile, CompositeWrite}; use super::{CompositeFile, CompositeWrite};
use crate::common::BinarySerializable;
use crate::common::VInt;
use crate::directory::{Directory, RamDirectory}; use crate::directory::{Directory, RamDirectory};
use crate::schema::Field; use crate::schema::Field;
use common::BinarySerializable;
use common::VInt;
use std::io::Write; use std::io::Write;
use std::path::Path; use std::path::Path;

View File

@@ -43,10 +43,8 @@ impl RetryPolicy {
} }
/// The `DirectoryLock` is an object that represents a file lock. /// The `DirectoryLock` is an object that represents a file lock.
/// See [`LockType`](struct.LockType.html)
/// ///
/// It is transparently associated to a lock file, that gets deleted /// It is associated to a lock file, that gets deleted on `Drop.`
/// on `Drop.` The lock is released automatically on `Drop`.
pub struct DirectoryLock(Box<dyn Send + Sync + 'static>); pub struct DirectoryLock(Box<dyn Send + Sync + 'static>);
struct DirectoryLockGuard { struct DirectoryLockGuard {
@@ -142,10 +140,16 @@ pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
/// Opens a writer for the *virtual file* associated with /// Opens a writer for the *virtual file* associated with
/// a Path. /// a Path.
/// ///
/// Right after this call, the file should be created /// Right after this call, for the span of the execution of the program
/// and any subsequent call to `open_read` for the /// the file should be created and any subsequent call to `open_read` for the
/// same path should return a `FileSlice`. /// same path should return a `FileSlice`.
/// ///
/// However, depending on the directory implementation,
/// it might be required to call `sync_directory` to ensure
/// that the file is durably created.
/// (The semantics here are the same when dealing with
/// a posix filesystem.)
///
/// Write operations may be aggressively buffered. /// Write operations may be aggressively buffered.
/// The client of this trait is responsible for calling flush /// The client of this trait is responsible for calling flush
/// to ensure that subsequent `read` operations /// to ensure that subsequent `read` operations
@@ -176,6 +180,12 @@ pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
/// The file may or may not previously exist. /// The file may or may not previously exist.
fn atomic_write(&self, path: &Path, data: &[u8]) -> io::Result<()>; fn atomic_write(&self, path: &Path, data: &[u8]) -> io::Result<()>;
/// Sync the directory.
///
/// This call is required to ensure that newly created files are
/// effectively stored durably.
fn sync_directory(&self) -> io::Result<()>;
/// Acquire a lock in the given directory. /// Acquire a lock in the given directory.
/// ///
/// The method is blocking or not depending on the `Lock` object. /// The method is blocking or not depending on the `Lock` object.
@@ -230,3 +240,15 @@ where
Box::new(self.clone()) Box::new(self.clone())
} }
} }
impl Clone for Box<dyn Directory> {
fn clone(&self) -> Self {
self.box_clone()
}
}
impl<T: Directory + 'static> From<T> for Box<dyn Directory> {
fn from(t: T) -> Self {
Box::new(t)
}
}

View File

@@ -7,8 +7,8 @@ use std::path::PathBuf;
/// [`LockParams`](./enum.LockParams.html). /// [`LockParams`](./enum.LockParams.html).
/// Tantivy itself uses only two locks but client application /// Tantivy itself uses only two locks but client application
/// can use the directory facility to define their own locks. /// can use the directory facility to define their own locks.
/// - [INDEX_WRITER_LOCK](./struct.INDEX_WRITER_LOCK.html) /// - [INDEX_WRITER_LOCK]
/// - [META_LOCK](./struct.META_LOCK.html) /// - [META_LOCK]
/// ///
/// Check out these locks documentation for more information. /// Check out these locks documentation for more information.
/// ///

View File

@@ -39,6 +39,16 @@ pub enum OpenDirectoryError {
}, },
} }
impl OpenDirectoryError {
/// Wraps an io error.
pub fn wrap_io_error(io_error: io::Error, directory_path: PathBuf) -> Self {
Self::IoError {
io_error,
directory_path,
}
}
}
/// Error that may occur when starting to write in a file /// Error that may occur when starting to write in a file
#[derive(Debug, Error)] #[derive(Debug, Error)]
pub enum OpenWriteError { pub enum OpenWriteError {

View File

@@ -1,7 +1,7 @@
use stable_deref_trait::StableDeref; use stable_deref_trait::StableDeref;
use crate::common::HasLen;
use crate::directory::OwnedBytes; use crate::directory::OwnedBytes;
use common::HasLen;
use std::fmt; use std::fmt;
use std::ops::Range; use std::ops::Range;
use std::sync::{Arc, Weak}; use std::sync::{Arc, Weak};
@@ -32,12 +32,6 @@ impl FileHandle for &'static [u8] {
} }
} }
impl<T: Deref<Target = [u8]>> HasLen for T {
fn len(&self) -> usize {
self.deref().len()
}
}
impl<B> From<B> for FileSlice impl<B> From<B> for FileSlice
where where
B: StableDeref + Deref<Target = [u8]> + 'static + Send + Sync, B: StableDeref + Deref<Target = [u8]> + 'static + Send + Sync,
@@ -72,6 +66,7 @@ impl FileSlice {
/// Wraps a FileHandle. /// Wraps a FileHandle.
#[doc(hidden)] #[doc(hidden)]
#[must_use]
pub fn new_with_num_bytes(file_handle: Box<dyn FileHandle>, num_bytes: usize) -> Self { pub fn new_with_num_bytes(file_handle: Box<dyn FileHandle>, num_bytes: usize) -> Self {
FileSlice { FileSlice {
data: Arc::from(file_handle), data: Arc::from(file_handle),
@@ -178,7 +173,7 @@ impl HasLen for FileSlice {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::{FileHandle, FileSlice}; use super::{FileHandle, FileSlice};
use crate::common::HasLen; use common::HasLen;
use std::io; use std::io;
#[test] #[test]

View File

@@ -43,14 +43,16 @@ impl FileWatcher {
thread::Builder::new() thread::Builder::new()
.name("thread-tantivy-meta-file-watcher".to_string()) .name("thread-tantivy-meta-file-watcher".to_string())
.spawn(move || { .spawn(move || {
let mut current_checksum = None; let mut current_checksum_opt = None;
while state.load(Ordering::SeqCst) == 1 { while state.load(Ordering::SeqCst) == 1 {
if let Ok(checksum) = FileWatcher::compute_checksum(&path) { if let Ok(checksum) = FileWatcher::compute_checksum(&path) {
// `None.unwrap_or_else(|| !checksum) != checksum` evaluates to `true` let metafile_has_changed = current_checksum_opt
if current_checksum.unwrap_or_else(|| !checksum) != checksum { .map(|current_checksum| current_checksum != checksum)
.unwrap_or(true);
if metafile_has_changed {
info!("Meta file {:?} was modified", path); info!("Meta file {:?} was modified", path);
current_checksum = Some(checksum); current_checksum_opt = Some(checksum);
futures::executor::block_on(callbacks.broadcast()); futures::executor::block_on(callbacks.broadcast());
} }
} }

View File

@@ -1,10 +1,10 @@
use crate::directory::error::Incompatibility; use crate::directory::error::Incompatibility;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::{ use crate::{
common::{BinarySerializable, CountingWriter, DeserializeFrom, FixedSize, HasLen},
directory::{AntiCallToken, TerminatingWrite}, directory::{AntiCallToken, TerminatingWrite},
Version, INDEX_FORMAT_VERSION, Version, INDEX_FORMAT_VERSION,
}; };
use common::{BinarySerializable, CountingWriter, DeserializeFrom, FixedSize, HasLen};
use crc32fast::Hasher; use crc32fast::Hasher;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::io; use std::io;
@@ -156,10 +156,8 @@ mod tests {
use crate::directory::footer::Footer; use crate::directory::footer::Footer;
use crate::directory::OwnedBytes; use crate::directory::OwnedBytes;
use crate::{ use crate::directory::{footer::FOOTER_MAGIC_NUMBER, FileSlice};
common::BinarySerializable, use common::BinarySerializable;
directory::{footer::FOOTER_MAGIC_NUMBER, FileSlice},
};
use std::io; use std::io;
#[test] #[test]

View File

@@ -1,4 +1,4 @@
use crate::core::{MANAGED_FILEPATH, META_FILEPATH}; use crate::core::MANAGED_FILEPATH;
use crate::directory::error::{DeleteError, LockError, OpenReadError, OpenWriteError}; use crate::directory::error::{DeleteError, LockError, OpenReadError, OpenWriteError};
use crate::directory::footer::{Footer, FooterProxy}; use crate::directory::footer::{Footer, FooterProxy};
use crate::directory::GarbageCollectionResult; use crate::directory::GarbageCollectionResult;
@@ -64,7 +64,7 @@ fn save_managed_paths(
impl ManagedDirectory { impl ManagedDirectory {
/// Wraps a directory as managed directory. /// Wraps a directory as managed directory.
pub fn wrap<Dir: Directory>(directory: Dir) -> crate::Result<ManagedDirectory> { pub fn wrap(directory: Box<dyn Directory>) -> crate::Result<ManagedDirectory> {
match directory.atomic_read(&MANAGED_FILEPATH) { match directory.atomic_read(&MANAGED_FILEPATH) {
Ok(data) => { Ok(data) => {
let managed_files_json = String::from_utf8_lossy(&data); let managed_files_json = String::from_utf8_lossy(&data);
@@ -76,14 +76,14 @@ impl ManagedDirectory {
) )
})?; })?;
Ok(ManagedDirectory { Ok(ManagedDirectory {
directory: Box::new(directory), directory,
meta_informations: Arc::new(RwLock::new(MetaInformation { meta_informations: Arc::new(RwLock::new(MetaInformation {
managed_paths: managed_files, managed_paths: managed_files,
})), })),
}) })
} }
Err(OpenReadError::FileDoesNotExist(_)) => Ok(ManagedDirectory { Err(OpenReadError::FileDoesNotExist(_)) => Ok(ManagedDirectory {
directory: Box::new(directory), directory,
meta_informations: Arc::default(), meta_informations: Arc::default(),
}), }),
io_err @ Err(OpenReadError::IoError { .. }) => Err(io_err.err().unwrap().into()), io_err @ Err(OpenReadError::IoError { .. }) => Err(io_err.err().unwrap().into()),
@@ -192,6 +192,7 @@ impl ManagedDirectory {
for delete_file in &deleted_files { for delete_file in &deleted_files {
managed_paths_write.remove(delete_file); managed_paths_write.remove(delete_file);
} }
self.directory.sync_directory()?;
save_managed_paths(self.directory.as_mut(), &meta_informations_wlock)?; save_managed_paths(self.directory.as_mut(), &meta_informations_wlock)?;
} }
@@ -222,9 +223,22 @@ impl ManagedDirectory {
.write() .write()
.expect("Managed file lock poisoned"); .expect("Managed file lock poisoned");
let has_changed = meta_wlock.managed_paths.insert(filepath.to_owned()); let has_changed = meta_wlock.managed_paths.insert(filepath.to_owned());
if has_changed { if !has_changed {
save_managed_paths(self.directory.as_ref(), &meta_wlock)?; return Ok(());
} }
save_managed_paths(self.directory.as_ref(), &meta_wlock)?;
// This is not the first file we add.
// Therefore, we are sure that `.managed.json` has been already
// properly created and we do not need to sync its parent directory.
//
// (It might seem like a nicer solution to create the managed_json on the
// creation of the ManagedDirectory instance but it would actually
// prevent the use of read-only directories..)
let managed_file_definitely_already_exists = meta_wlock.managed_paths.len() > 1;
if managed_file_definitely_already_exists {
return Ok(());
}
self.directory.sync_directory()?;
Ok(()) Ok(())
} }
@@ -248,24 +262,15 @@ impl ManagedDirectory {
Ok(footer.crc() == crc) Ok(footer.crc() == crc)
} }
/// List files for which checksum does not match content /// List all managed files
pub fn list_damaged(&self) -> result::Result<HashSet<PathBuf>, OpenReadError> { pub fn list_managed_files(&self) -> HashSet<PathBuf> {
let mut managed_paths = self let managed_paths = self
.meta_informations .meta_informations
.read() .read()
.expect("Managed directory rlock poisoned in list damaged.") .expect("Managed directory rlock poisoned in list damaged.")
.managed_paths .managed_paths
.clone(); .clone();
managed_paths
managed_paths.remove(*META_FILEPATH);
let mut damaged_files = HashSet::new();
for path in managed_paths {
if !self.validate_checksum(&path)? {
damaged_files.insert(path);
}
}
Ok(damaged_files)
} }
} }
@@ -319,6 +324,11 @@ impl Directory for ManagedDirectory {
fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle> { fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle> {
self.directory.watch(watch_callback) self.directory.watch(watch_callback)
} }
fn sync_directory(&self) -> io::Result<()> {
self.directory.sync_directory()?;
Ok(())
}
} }
impl Clone for ManagedDirectory { impl Clone for ManagedDirectory {
@@ -336,7 +346,6 @@ mod tests_mmap_specific {
use crate::directory::{Directory, ManagedDirectory, MmapDirectory, TerminatingWrite}; use crate::directory::{Directory, ManagedDirectory, MmapDirectory, TerminatingWrite};
use std::collections::HashSet; use std::collections::HashSet;
use std::fs::OpenOptions;
use std::io::Write; use std::io::Write;
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use tempfile::TempDir; use tempfile::TempDir;
@@ -350,7 +359,7 @@ mod tests_mmap_specific {
let test_path2: &'static Path = Path::new("some_path_for_test_2"); let test_path2: &'static Path = Path::new("some_path_for_test_2");
{ {
let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap(); let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap();
let mut managed_directory = ManagedDirectory::wrap(mmap_directory).unwrap(); let mut managed_directory = ManagedDirectory::wrap(Box::new(mmap_directory)).unwrap();
let write_file = managed_directory.open_write(test_path1).unwrap(); let write_file = managed_directory.open_write(test_path1).unwrap();
write_file.terminate().unwrap(); write_file.terminate().unwrap();
managed_directory managed_directory
@@ -365,7 +374,7 @@ mod tests_mmap_specific {
} }
{ {
let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap(); let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap();
let mut managed_directory = ManagedDirectory::wrap(mmap_directory).unwrap(); let mut managed_directory = ManagedDirectory::wrap(Box::new(mmap_directory)).unwrap();
assert!(managed_directory.exists(test_path1).unwrap()); assert!(managed_directory.exists(test_path1).unwrap());
assert!(!managed_directory.exists(test_path2).unwrap()); assert!(!managed_directory.exists(test_path2).unwrap());
let living_files: HashSet<PathBuf> = HashSet::new(); let living_files: HashSet<PathBuf> = HashSet::new();
@@ -384,7 +393,7 @@ mod tests_mmap_specific {
let living_files = HashSet::new(); let living_files = HashSet::new();
let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap(); let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap();
let mut managed_directory = ManagedDirectory::wrap(mmap_directory).unwrap(); let mut managed_directory = ManagedDirectory::wrap(Box::new(mmap_directory)).unwrap();
let mut write = managed_directory.open_write(test_path1).unwrap(); let mut write = managed_directory.open_write(test_path1).unwrap();
write.write_all(&[0u8, 1u8]).unwrap(); write.write_all(&[0u8, 1u8]).unwrap();
write.terminate().unwrap(); write.terminate().unwrap();
@@ -405,39 +414,4 @@ mod tests_mmap_specific {
} }
assert!(!managed_directory.exists(test_path1).unwrap()); assert!(!managed_directory.exists(test_path1).unwrap());
} }
#[test]
fn test_checksum() -> crate::Result<()> {
let test_path1: &'static Path = Path::new("some_path_for_test");
let test_path2: &'static Path = Path::new("other_test_path");
let tempdir = TempDir::new().unwrap();
let tempdir_path = PathBuf::from(tempdir.path());
let mmap_directory = MmapDirectory::open(&tempdir_path)?;
let managed_directory = ManagedDirectory::wrap(mmap_directory)?;
let mut write = managed_directory.open_write(test_path1)?;
write.write_all(&[0u8, 1u8])?;
write.terminate()?;
let mut write = managed_directory.open_write(test_path2)?;
write.write_all(&[3u8, 4u8, 5u8])?;
write.terminate()?;
let read_file = managed_directory.open_read(test_path2)?.read_bytes()?;
assert_eq!(read_file.as_slice(), &[3u8, 4u8, 5u8]);
assert!(managed_directory.list_damaged().unwrap().is_empty());
let mut corrupted_path = tempdir_path;
corrupted_path.push(test_path2);
let mut file = OpenOptions::new().write(true).open(&corrupted_path)?;
file.write_all(&[255u8])?;
file.flush()?;
drop(file);
let damaged = managed_directory.list_damaged()?;
assert_eq!(damaged.len(), 1);
assert!(damaged.contains(test_path2));
Ok(())
}
} }

View File

@@ -11,7 +11,7 @@ use crate::directory::{AntiCallToken, FileHandle, OwnedBytes};
use crate::directory::{ArcBytes, WeakArcBytes}; use crate::directory::{ArcBytes, WeakArcBytes};
use crate::directory::{TerminatingWrite, WritePtr}; use crate::directory::{TerminatingWrite, WritePtr};
use fs2::FileExt; use fs2::FileExt;
use memmap::Mmap; use memmap2::Mmap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use stable_deref_trait::StableDeref; use stable_deref_trait::StableDeref;
use std::convert::From; use std::convert::From;
@@ -53,7 +53,7 @@ fn open_mmap(full_path: &Path) -> result::Result<Option<Mmap>, OpenReadError> {
return Ok(None); return Ok(None);
} }
unsafe { unsafe {
memmap::Mmap::map(&file) memmap2::Mmap::map(&file)
.map(Some) .map(Some)
.map_err(|io_err| OpenReadError::wrap_io_error(io_err, full_path.to_path_buf())) .map_err(|io_err| OpenReadError::wrap_io_error(io_err, full_path.to_path_buf()))
} }
@@ -74,20 +74,12 @@ pub struct CacheInfo {
pub mmapped: Vec<PathBuf>, pub mmapped: Vec<PathBuf>,
} }
#[derive(Default)]
struct MmapCache { struct MmapCache {
counters: CacheCounters, counters: CacheCounters,
cache: HashMap<PathBuf, WeakArcBytes>, cache: HashMap<PathBuf, WeakArcBytes>,
} }
impl Default for MmapCache {
fn default() -> MmapCache {
MmapCache {
counters: CacheCounters::default(),
cache: HashMap::new(),
}
}
}
impl MmapCache { impl MmapCache {
fn get_info(&self) -> CacheInfo { fn get_info(&self) -> CacheInfo {
let paths: Vec<PathBuf> = self.cache.keys().cloned().collect(); let paths: Vec<PathBuf> = self.cache.keys().cloned().collect();
@@ -201,16 +193,19 @@ impl MmapDirectory {
pub fn open<P: AsRef<Path>>(directory_path: P) -> Result<MmapDirectory, OpenDirectoryError> { pub fn open<P: AsRef<Path>>(directory_path: P) -> Result<MmapDirectory, OpenDirectoryError> {
let directory_path: &Path = directory_path.as_ref(); let directory_path: &Path = directory_path.as_ref();
if !directory_path.exists() { if !directory_path.exists() {
Err(OpenDirectoryError::DoesNotExist(PathBuf::from( return Err(OpenDirectoryError::DoesNotExist(PathBuf::from(
directory_path, directory_path,
))) )));
} else if !directory_path.is_dir() {
Err(OpenDirectoryError::NotADirectory(PathBuf::from(
directory_path,
)))
} else {
Ok(MmapDirectory::new(PathBuf::from(directory_path), None))
} }
let canonical_path: PathBuf = directory_path.canonicalize().map_err(|io_err| {
OpenDirectoryError::wrap_io_error(io_err, PathBuf::from(directory_path))
})?;
if !canonical_path.is_dir() {
return Err(OpenDirectoryError::NotADirectory(PathBuf::from(
directory_path,
)));
}
Ok(MmapDirectory::new(canonical_path, None))
} }
/// Joins a relative_path to the directory `root_path` /// Joins a relative_path to the directory `root_path`
@@ -219,33 +214,6 @@ impl MmapDirectory {
self.inner.root_path.join(relative_path) self.inner.root_path.join(relative_path)
} }
/// Sync the root directory.
/// In certain FS, this is required to persistently create
/// a file.
fn sync_directory(&self) -> Result<(), io::Error> {
let mut open_opts = OpenOptions::new();
// Linux needs read to be set, otherwise returns EINVAL
// write must not be set, or it fails with EISDIR
open_opts.read(true);
// On Windows, opening a directory requires FILE_FLAG_BACKUP_SEMANTICS
// and calling sync_all() only works if write access is requested.
#[cfg(windows)]
{
use std::os::windows::fs::OpenOptionsExt;
use winapi::um::winbase;
open_opts
.write(true)
.custom_flags(winbase::FILE_FLAG_BACKUP_SEMANTICS);
}
let fd = open_opts.open(&self.inner.root_path)?;
fd.sync_all()?;
Ok(())
}
/// Returns some statistical information /// Returns some statistical information
/// about the Mmap cache. /// about the Mmap cache.
/// ///
@@ -296,8 +264,7 @@ impl Write for SafeFileWriter {
} }
fn flush(&mut self) -> io::Result<()> { fn flush(&mut self) -> io::Result<()> {
self.0.flush()?; Ok(())
self.0.sync_all()
} }
} }
@@ -309,7 +276,9 @@ impl Seek for SafeFileWriter {
impl TerminatingWrite for SafeFileWriter { impl TerminatingWrite for SafeFileWriter {
fn terminate_ref(&mut self, _: AntiCallToken) -> io::Result<()> { fn terminate_ref(&mut self, _: AntiCallToken) -> io::Result<()> {
self.flush() self.0.flush()?;
self.0.sync_data()?;
Ok(())
} }
} }
@@ -339,6 +308,7 @@ pub(crate) fn atomic_write(path: &Path, content: &[u8]) -> io::Result<()> {
let mut tempfile = tempfile::Builder::new().tempfile_in(&parent_path)?; let mut tempfile = tempfile::Builder::new().tempfile_in(&parent_path)?;
tempfile.write_all(content)?; tempfile.write_all(content)?;
tempfile.flush()?; tempfile.flush()?;
tempfile.as_file_mut().sync_data()?;
tempfile.into_temp_path().persist(path)?; tempfile.into_temp_path().persist(path)?;
Ok(()) Ok(())
} }
@@ -373,22 +343,17 @@ impl Directory for MmapDirectory {
/// removed before the file is deleted. /// removed before the file is deleted.
fn delete(&self, path: &Path) -> result::Result<(), DeleteError> { fn delete(&self, path: &Path) -> result::Result<(), DeleteError> {
let full_path = self.resolve_path(path); let full_path = self.resolve_path(path);
match fs::remove_file(&full_path) { fs::remove_file(&full_path).map_err(|e| {
Ok(_) => self.sync_directory().map_err(|e| DeleteError::IoError { if e.kind() == io::ErrorKind::NotFound {
io_error: e, DeleteError::FileDoesNotExist(path.to_owned())
filepath: path.to_path_buf(), } else {
}), DeleteError::IoError {
Err(e) => { io_error: e,
if e.kind() == io::ErrorKind::NotFound { filepath: path.to_path_buf(),
Err(DeleteError::FileDoesNotExist(path.to_owned()))
} else {
Err(DeleteError::IoError {
io_error: e,
filepath: path.to_path_buf(),
})
} }
} }
} })?;
Ok(())
} }
fn exists(&self, path: &Path) -> Result<bool, OpenReadError> { fn exists(&self, path: &Path) -> Result<bool, OpenReadError> {
@@ -417,10 +382,13 @@ impl Directory for MmapDirectory {
file.flush() file.flush()
.map_err(|io_error| OpenWriteError::wrap_io_error(io_error, path.to_path_buf()))?; .map_err(|io_error| OpenWriteError::wrap_io_error(io_error, path.to_path_buf()))?;
// Apparetntly, on some filesystem syncing the parent // Note we actually do not sync the parent directory here.
// directory is required. //
self.sync_directory() // A newly created file, may, in some case, be created and even flushed to disk.
.map_err(|io_err| OpenWriteError::wrap_io_error(io_err, path.to_path_buf()))?; // and then lost...
//
// The file will only be durably written after we terminate AND
// sync_directory() is called.
let writer = SafeFileWriter::new(file); let writer = SafeFileWriter::new(file);
Ok(BufWriter::new(Box::new(writer))) Ok(BufWriter::new(Box::new(writer)))
@@ -450,7 +418,7 @@ impl Directory for MmapDirectory {
debug!("Atomic Write {:?}", path); debug!("Atomic Write {:?}", path);
let full_path = self.resolve_path(path); let full_path = self.resolve_path(path);
atomic_write(&full_path, content)?; atomic_write(&full_path, content)?;
self.sync_directory() Ok(())
} }
fn acquire_lock(&self, lock: &Lock) -> Result<DirectoryLock, LockError> { fn acquire_lock(&self, lock: &Lock) -> Result<DirectoryLock, LockError> {
@@ -476,6 +444,30 @@ impl Directory for MmapDirectory {
fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle> { fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle> {
Ok(self.inner.watch(watch_callback)) Ok(self.inner.watch(watch_callback))
} }
fn sync_directory(&self) -> Result<(), io::Error> {
let mut open_opts = OpenOptions::new();
// Linux needs read to be set, otherwise returns EINVAL
// write must not be set, or it fails with EISDIR
open_opts.read(true);
// On Windows, opening a directory requires FILE_FLAG_BACKUP_SEMANTICS
// and calling sync_all() only works if write access is requested.
#[cfg(windows)]
{
use std::os::windows::fs::OpenOptionsExt;
use winapi::um::winbase;
open_opts
.write(true)
.custom_flags(winbase::FILE_FLAG_BACKUP_SEMANTICS);
}
let fd = open_opts.open(&self.inner.root_path)?;
fd.sync_data()?;
Ok(())
}
} }
#[cfg(test)] #[cfg(test)]
@@ -485,13 +477,14 @@ mod tests {
// The following tests are specific to the MmapDirectory // The following tests are specific to the MmapDirectory
use super::*; use super::*;
use crate::indexer::LogMergePolicy;
use crate::Index; use crate::Index;
use crate::ReloadPolicy; use crate::ReloadPolicy;
use crate::{common::HasLen, indexer::LogMergePolicy};
use crate::{ use crate::{
schema::{Schema, SchemaBuilder, TEXT}, schema::{Schema, SchemaBuilder, TEXT},
IndexSettings, IndexSettings,
}; };
use common::HasLen;
#[test] #[test]
fn test_open_non_existent_path() { fn test_open_non_existent_path() {
@@ -581,8 +574,8 @@ mod tests {
} }
#[test] #[test]
fn test_mmap_released() { fn test_mmap_released() -> crate::Result<()> {
let mmap_directory = MmapDirectory::create_from_tempdir().unwrap(); let mmap_directory = MmapDirectory::create_from_tempdir()?;
let mut schema_builder: SchemaBuilder = Schema::builder(); let mut schema_builder: SchemaBuilder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT); let text_field = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build(); let schema = schema_builder.build();
@@ -591,31 +584,30 @@ mod tests {
let index = let index =
Index::create(mmap_directory.clone(), schema, IndexSettings::default()).unwrap(); Index::create(mmap_directory.clone(), schema, IndexSettings::default()).unwrap();
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
let mut log_merge_policy = LogMergePolicy::default(); let mut log_merge_policy = LogMergePolicy::default();
log_merge_policy.set_min_num_segments(3); log_merge_policy.set_min_num_segments(3);
index_writer.set_merge_policy(Box::new(log_merge_policy)); index_writer.set_merge_policy(Box::new(log_merge_policy));
for _num_commits in 0..10 { for _num_commits in 0..10 {
for _ in 0..10 { for _ in 0..10 {
index_writer.add_document(doc!(text_field=>"abc")); index_writer.add_document(doc!(text_field=>"abc"))?;
} }
index_writer.commit().unwrap(); index_writer.commit()?;
} }
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::Manual) .reload_policy(ReloadPolicy::Manual)
.try_into() .try_into()?;
.unwrap();
for _ in 0..4 { for _ in 0..4 {
index_writer.add_document(doc!(text_field=>"abc")); index_writer.add_document(doc!(text_field=>"abc"))?;
index_writer.commit().unwrap(); index_writer.commit()?;
reader.reload().unwrap(); reader.reload()?;
} }
index_writer.wait_merging_threads().unwrap(); index_writer.wait_merging_threads()?;
reader.reload().unwrap(); reader.reload()?;
let num_segments = reader.searcher().segment_readers().len(); let num_segments = reader.searcher().segment_readers().len();
assert!(num_segments <= 4); assert!(num_segments <= 4);
let num_components_except_deletes_and_tempstore = let num_components_except_deletes_and_tempstore =
@@ -626,5 +618,6 @@ mod tests {
); );
} }
assert!(mmap_directory.get_cache_info().mmapped.is_empty()); assert!(mmap_directory.get_cache_info().mmapped.is_empty());
Ok(())
} }
} }

View File

@@ -1,6 +1,6 @@
/*! /*!
WORM directory abstraction. WORM (Write Once Read Many) directory abstraction.
*/ */
@@ -20,6 +20,9 @@ mod watch_event_router;
/// Errors specific to the directory module. /// Errors specific to the directory module.
pub mod error; pub mod error;
mod composite_file;
pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
pub use self::directory::DirectoryLock; pub use self::directory::DirectoryLock;
pub use self::directory::{Directory, DirectoryClone}; pub use self::directory::{Directory, DirectoryClone};
pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK}; pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK};

View File

@@ -1,9 +1,10 @@
use crate::core::META_FILEPATH;
use crate::directory::error::{DeleteError, OpenReadError, OpenWriteError}; use crate::directory::error::{DeleteError, OpenReadError, OpenWriteError};
use crate::directory::AntiCallToken; use crate::directory::AntiCallToken;
use crate::directory::WatchCallbackList; use crate::directory::WatchCallbackList;
use crate::directory::{Directory, FileSlice, WatchCallback, WatchHandle}; use crate::directory::{Directory, FileSlice, WatchCallback, WatchHandle};
use crate::directory::{TerminatingWrite, WritePtr}; use crate::directory::{TerminatingWrite, WritePtr};
use crate::{common::HasLen, core::META_FILEPATH}; use common::HasLen;
use fail::fail_point; use fail::fail_point;
use std::collections::HashMap; use std::collections::HashMap;
use std::fmt; use std::fmt;
@@ -17,13 +18,6 @@ use super::FileHandle;
/// Writer associated with the `RamDirectory` /// Writer associated with the `RamDirectory`
/// ///
/// The Writer just writes a buffer. /// The Writer just writes a buffer.
///
/// # Panics
///
/// On drop, if the writer was left in a *dirty* state.
/// That is, if flush was not called after the last call
/// to write.
///
struct VecWriter { struct VecWriter {
path: PathBuf, path: PathBuf,
shared_directory: RamDirectory, shared_directory: RamDirectory,
@@ -45,7 +39,7 @@ impl VecWriter {
impl Drop for VecWriter { impl Drop for VecWriter {
fn drop(&mut self) { fn drop(&mut self) {
if !self.is_flushed { if !self.is_flushed {
panic!( warn!(
"You forgot to flush {:?} before its writter got Drop. Do not rely on drop. This also occurs when the indexer crashed, so you may want to check the logs for the root cause.", "You forgot to flush {:?} before its writter got Drop. Do not rely on drop. This also occurs when the indexer crashed, so you may want to check the logs for the root cause.",
self.path self.path
) )
@@ -220,14 +214,8 @@ impl Directory for RamDirectory {
} }
fn atomic_write(&self, path: &Path, data: &[u8]) -> io::Result<()> { fn atomic_write(&self, path: &Path, data: &[u8]) -> io::Result<()> {
fail_point!("RamDirectory::atomic_write", |msg| Err(io::Error::new(
io::ErrorKind::Other,
msg.unwrap_or_else(|| "Undefined".to_string())
)));
let path_buf = PathBuf::from(path); let path_buf = PathBuf::from(path);
self.fs.write().unwrap().write(path_buf, data); self.fs.write().unwrap().write(path_buf, data);
if path == *META_FILEPATH { if path == *META_FILEPATH {
let _ = self.fs.write().unwrap().watch_router.broadcast(); let _ = self.fs.write().unwrap().watch_router.broadcast();
} }
@@ -237,6 +225,10 @@ impl Directory for RamDirectory {
fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle> { fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle> {
Ok(self.fs.write().unwrap().watch(watch_callback)) Ok(self.fs.write().unwrap().watch(watch_callback))
} }
fn sync_directory(&self) -> io::Result<()> {
Ok(())
}
} }
#[cfg(test)] #[cfg(test)]

View File

@@ -118,15 +118,6 @@ mod ram_directory_tests {
} }
} }
#[test]
#[should_panic]
fn ram_directory_panics_if_flush_forgotten() {
let test_path: &'static Path = Path::new("some_path_for_test");
let ram_directory = RamDirectory::create();
let mut write_file = ram_directory.open_write(test_path).unwrap();
assert!(write_file.write_all(&[4]).is_ok());
}
fn test_simple(directory: &dyn Directory) -> crate::Result<()> { fn test_simple(directory: &dyn Directory) -> crate::Result<()> {
let test_path: &'static Path = Path::new("some_path_for_test"); let test_path: &'static Path = Path::new("some_path_for_test");
let mut write_file = directory.open_write(test_path)?; let mut write_file = directory.open_write(test_path)?;

View File

@@ -1,4 +1,4 @@
use crate::fastfield::DeleteBitSet; use crate::fastfield::AliveBitSet;
use crate::DocId; use crate::DocId;
use std::borrow::Borrow; use std::borrow::Borrow;
use std::borrow::BorrowMut; use std::borrow::BorrowMut;
@@ -85,11 +85,11 @@ pub trait DocSet: Send {
/// Returns the number documents matching. /// Returns the number documents matching.
/// Calling this method consumes the `DocSet`. /// Calling this method consumes the `DocSet`.
fn count(&mut self, delete_bitset: &DeleteBitSet) -> u32 { fn count(&mut self, alive_bitset: &AliveBitSet) -> u32 {
let mut count = 0u32; let mut count = 0u32;
let mut doc = self.doc(); let mut doc = self.doc();
while doc != TERMINATED { while doc != TERMINATED {
if !delete_bitset.is_deleted(doc) { if alive_bitset.is_alive(doc) {
count += 1u32; count += 1u32;
} }
doc = self.advance(); doc = self.advance();
@@ -130,8 +130,8 @@ impl<'a> DocSet for &'a mut dyn DocSet {
(**self).size_hint() (**self).size_hint()
} }
fn count(&mut self, delete_bitset: &DeleteBitSet) -> u32 { fn count(&mut self, alive_bitset: &AliveBitSet) -> u32 {
(**self).count(delete_bitset) (**self).count(alive_bitset)
} }
fn count_including_deleted(&mut self) -> u32 { fn count_including_deleted(&mut self) -> u32 {
@@ -160,9 +160,9 @@ impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
unboxed.size_hint() unboxed.size_hint()
} }
fn count(&mut self, delete_bitset: &DeleteBitSet) -> u32 { fn count(&mut self, alive_bitset: &AliveBitSet) -> u32 {
let unboxed: &mut TDocSet = self.borrow_mut(); let unboxed: &mut TDocSet = self.borrow_mut();
unboxed.count(delete_bitset) unboxed.count(alive_bitset)
} }
fn count_including_deleted(&mut self) -> u32 { fn count_including_deleted(&mut self) -> u32 {

View File

@@ -0,0 +1,224 @@
use crate::space_usage::ByteCount;
use crate::DocId;
use common::intersect_bitsets;
use common::BitSet;
use common::ReadOnlyBitSet;
use ownedbytes::OwnedBytes;
use std::io;
use std::io::Write;
/// Write a alive `BitSet`
///
/// where `alive_bitset` is the set of alive `DocId`.
/// Warning: this function does not call terminate. The caller is in charge of
/// closing the writer properly.
pub fn write_alive_bitset<T: Write>(alive_bitset: &BitSet, writer: &mut T) -> io::Result<()> {
alive_bitset.serialize(writer)?;
Ok(())
}
/// Set of alive `DocId`s.
#[derive(Clone)]
pub struct AliveBitSet {
num_alive_docs: usize,
bitset: ReadOnlyBitSet,
}
/// Intersects two AliveBitSets in a new one.
/// The two bitsets need to have the same max_value.
pub fn intersect_alive_bitsets(left: AliveBitSet, right: AliveBitSet) -> AliveBitSet {
assert_eq!(left.bitset().max_value(), right.bitset().max_value());
let bitset = intersect_bitsets(left.bitset(), right.bitset());
let num_alive_docs = bitset.len();
AliveBitSet {
num_alive_docs,
bitset,
}
}
impl AliveBitSet {
#[cfg(test)]
pub(crate) fn for_test_from_deleted_docs(deleted_docs: &[DocId], max_doc: u32) -> AliveBitSet {
assert!(deleted_docs.iter().all(|&doc| doc < max_doc));
let mut bitset = BitSet::with_max_value_and_full(max_doc);
for &doc in deleted_docs {
bitset.remove(doc);
}
let mut alive_bitset_buffer = Vec::new();
write_alive_bitset(&bitset, &mut alive_bitset_buffer).unwrap();
let alive_bitset_bytes = OwnedBytes::new(alive_bitset_buffer);
Self::open(alive_bitset_bytes)
}
pub(crate) fn from_bitset(bitset: &BitSet) -> AliveBitSet {
let readonly_bitset = ReadOnlyBitSet::from(bitset);
AliveBitSet::from(readonly_bitset)
}
/// Opens a delete bitset given its file.
pub fn open(bytes: OwnedBytes) -> AliveBitSet {
let bitset = ReadOnlyBitSet::open(bytes);
AliveBitSet::from(bitset)
}
/// Returns true iff the document is still "alive". In other words, if it has not been deleted.
#[inline]
pub fn is_alive(&self, doc: DocId) -> bool {
self.bitset.contains(doc)
}
/// Returns true iff the document has been marked as deleted.
#[inline]
pub fn is_deleted(&self, doc: DocId) -> bool {
!self.is_alive(doc)
}
/// Iterate over the alive doc_ids.
#[inline]
pub fn iter_alive(&self) -> impl Iterator<Item = DocId> + '_ {
self.bitset.iter()
}
/// Get underlying bitset
#[inline]
pub fn bitset(&self) -> &ReadOnlyBitSet {
&self.bitset
}
/// The number of deleted docs
pub fn num_alive_docs(&self) -> usize {
self.num_alive_docs
}
/// Summarize total space usage of this bitset.
pub fn space_usage(&self) -> ByteCount {
self.bitset().num_bytes()
}
}
impl From<ReadOnlyBitSet> for AliveBitSet {
fn from(bitset: ReadOnlyBitSet) -> AliveBitSet {
let num_alive_docs = bitset.len();
AliveBitSet {
num_alive_docs,
bitset,
}
}
}
#[cfg(test)]
mod tests {
use super::AliveBitSet;
#[test]
fn test_alive_bitset_empty() {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[], 10);
for doc in 0..10 {
assert_eq!(alive_bitset.is_deleted(doc), !alive_bitset.is_alive(doc));
assert!(!alive_bitset.is_deleted(doc));
}
assert_eq!(alive_bitset.num_alive_docs(), 10);
}
#[test]
fn test_alive_bitset() {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[1, 9], 10);
assert!(alive_bitset.is_alive(0));
assert!(alive_bitset.is_deleted(1));
assert!(alive_bitset.is_alive(2));
assert!(alive_bitset.is_alive(3));
assert!(alive_bitset.is_alive(4));
assert!(alive_bitset.is_alive(5));
assert!(alive_bitset.is_alive(6));
assert!(alive_bitset.is_alive(6));
assert!(alive_bitset.is_alive(7));
assert!(alive_bitset.is_alive(8));
assert!(alive_bitset.is_deleted(9));
for doc in 0..10 {
assert_eq!(alive_bitset.is_deleted(doc), !alive_bitset.is_alive(doc));
}
assert_eq!(alive_bitset.num_alive_docs(), 8);
}
#[test]
fn test_alive_bitset_iter_minimal() {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[7], 8);
let data: Vec<_> = alive_bitset.iter_alive().collect();
assert_eq!(data, vec![0, 1, 2, 3, 4, 5, 6]);
}
#[test]
fn test_alive_bitset_iter_small() {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[0, 2, 3, 6], 7);
let data: Vec<_> = alive_bitset.iter_alive().collect();
assert_eq!(data, vec![1, 4, 5]);
}
#[test]
fn test_alive_bitset_iter() {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[0, 1, 1000], 1001);
let data: Vec<_> = alive_bitset.iter_alive().collect();
assert_eq!(data, (2..=999).collect::<Vec<_>>());
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use super::AliveBitSet;
use rand::prelude::IteratorRandom;
use rand::thread_rng;
use test::Bencher;
fn get_alive() -> Vec<u32> {
let mut data = (0..1_000_000_u32).collect::<Vec<u32>>();
for _ in 0..(1_000_000) * 1 / 8 {
remove_rand(&mut data);
}
data
}
fn remove_rand(raw: &mut Vec<u32>) {
let i = (0..raw.len()).choose(&mut thread_rng()).unwrap();
raw.remove(i);
}
#[bench]
fn bench_deletebitset_iter_deser_on_fly(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[0, 1, 1000, 10000], 1_000_000);
bench.iter(|| alive_bitset.iter_alive().collect::<Vec<_>>());
}
#[bench]
fn bench_deletebitset_access(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[0, 1, 1000, 10000], 1_000_000);
bench.iter(|| {
(0..1_000_000_u32)
.filter(|doc| alive_bitset.is_alive(*doc))
.collect::<Vec<_>>()
});
}
#[bench]
fn bench_deletebitset_iter_deser_on_fly_1_8_alive(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&get_alive(), 1_000_000);
bench.iter(|| alive_bitset.iter_alive().collect::<Vec<_>>());
}
#[bench]
fn bench_deletebitset_access_1_8_alive(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&get_alive(), 1_000_000);
bench.iter(|| {
(0..1_000_000_u32)
.filter(|doc| alive_bitset.is_alive(*doc))
.collect::<Vec<_>>()
});
}
}

View File

@@ -18,11 +18,11 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(bytes_field=>vec![0u8, 1, 2, 3])); index_writer.add_document(doc!(bytes_field=>vec![0u8, 1, 2, 3]))?;
index_writer.add_document(doc!(bytes_field=>vec![])); index_writer.add_document(doc!(bytes_field=>vec![]))?;
index_writer.add_document(doc!(bytes_field=>vec![255u8])); index_writer.add_document(doc!(bytes_field=>vec![255u8]))?;
index_writer.add_document(doc!(bytes_field=>vec![1u8, 3, 5, 7, 9])); index_writer.add_document(doc!(bytes_field=>vec![1u8, 3, 5, 7, 9]))?;
index_writer.add_document(doc!(bytes_field=>vec![0u8; 1000])); index_writer.add_document(doc!(bytes_field=>vec![0u8; 1000]))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let segment_reader = searcher.segment_reader(0); let segment_reader = searcher.segment_reader(0);
@@ -47,7 +47,7 @@ mod tests {
index_writer.add_document(doc!( index_writer.add_document(doc!(
field => b"tantivy".as_ref(), field => b"tantivy".as_ref(),
field => b"lucene".as_ref() field => b"lucene".as_ref()
)); ))?;
index_writer.commit()?; index_writer.commit()?;
Ok(index.reader()?.searcher()) Ok(index.reader()?.searcher())
} }

View File

@@ -1,143 +0,0 @@
use crate::common::{BitSet, HasLen};
use crate::directory::FileSlice;
use crate::directory::OwnedBytes;
use crate::directory::WritePtr;
use crate::space_usage::ByteCount;
use crate::DocId;
use std::io;
use std::io::Write;
/// Write a delete `BitSet`
///
/// where `delete_bitset` is the set of deleted `DocId`.
/// Warning: this function does not call terminate. The caller is in charge of
/// closing the writer properly.
pub fn write_delete_bitset(
delete_bitset: &BitSet,
max_doc: u32,
writer: &mut WritePtr,
) -> io::Result<()> {
let mut byte = 0u8;
let mut shift = 0u8;
for doc in 0..max_doc {
if delete_bitset.contains(doc) {
byte |= 1 << shift;
}
if shift == 7 {
writer.write_all(&[byte])?;
shift = 0;
byte = 0;
} else {
shift += 1;
}
}
if max_doc % 8 > 0 {
writer.write_all(&[byte])?;
}
Ok(())
}
/// Set of deleted `DocId`s.
#[derive(Clone)]
pub struct DeleteBitSet {
data: OwnedBytes,
num_deleted: usize,
}
impl DeleteBitSet {
#[cfg(test)]
pub(crate) fn for_test(docs: &[DocId], max_doc: u32) -> DeleteBitSet {
use crate::directory::{Directory, RamDirectory, TerminatingWrite};
use std::path::Path;
assert!(docs.iter().all(|&doc| doc < max_doc));
let mut bitset = BitSet::with_max_value(max_doc);
for &doc in docs {
bitset.insert(doc);
}
let directory = RamDirectory::create();
let path = Path::new("dummydeletebitset");
let mut wrt = directory.open_write(path).unwrap();
write_delete_bitset(&bitset, max_doc, &mut wrt).unwrap();
wrt.terminate().unwrap();
let file = directory.open_read(path).unwrap();
Self::open(file).unwrap()
}
/// Opens a delete bitset given its file.
pub fn open(file: FileSlice) -> crate::Result<DeleteBitSet> {
let bytes = file.read_bytes()?;
let num_deleted: usize = bytes
.as_slice()
.iter()
.map(|b| b.count_ones() as usize)
.sum();
Ok(DeleteBitSet {
data: bytes,
num_deleted,
})
}
/// Returns true iff the document is still "alive". In other words, if it has not been deleted.
pub fn is_alive(&self, doc: DocId) -> bool {
!self.is_deleted(doc)
}
/// Returns true iff the document has been marked as deleted.
#[inline]
pub fn is_deleted(&self, doc: DocId) -> bool {
let byte_offset = doc / 8u32;
let b: u8 = self.data.as_slice()[byte_offset as usize];
let shift = (doc & 7u32) as u8;
b & (1u8 << shift) != 0
}
/// The number of deleted docs
pub fn num_deleted(&self) -> usize {
self.num_deleted
}
/// Summarize total space usage of this bitset.
pub fn space_usage(&self) -> ByteCount {
self.data.len()
}
}
impl HasLen for DeleteBitSet {
fn len(&self) -> usize {
self.num_deleted
}
}
#[cfg(test)]
mod tests {
use super::DeleteBitSet;
use crate::common::HasLen;
#[test]
fn test_delete_bitset_empty() {
let delete_bitset = DeleteBitSet::for_test(&[], 10);
for doc in 0..10 {
assert_eq!(delete_bitset.is_deleted(doc), !delete_bitset.is_alive(doc));
}
assert_eq!(delete_bitset.len(), 0);
}
#[test]
fn test_delete_bitset() {
let delete_bitset = DeleteBitSet::for_test(&[1, 9], 10);
assert!(delete_bitset.is_alive(0));
assert!(delete_bitset.is_deleted(1));
assert!(delete_bitset.is_alive(2));
assert!(delete_bitset.is_alive(3));
assert!(delete_bitset.is_alive(4));
assert!(delete_bitset.is_alive(5));
assert!(delete_bitset.is_alive(6));
assert!(delete_bitset.is_alive(6));
assert!(delete_bitset.is_alive(7));
assert!(delete_bitset.is_alive(8));
assert!(delete_bitset.is_deleted(9));
for doc in 0..10 {
assert_eq!(delete_bitset.is_deleted(doc), !delete_bitset.is_alive(doc));
}
assert_eq!(delete_bitset.len(), 2);
}
}

View File

@@ -84,18 +84,18 @@ impl FacetReader {
mod tests { mod tests {
use crate::Index; use crate::Index;
use crate::{ use crate::{
schema::{Facet, FacetOptions, SchemaBuilder, Value, INDEXED, STORED}, schema::{Facet, FacetOptions, SchemaBuilder, Value, STORED},
DocAddress, Document, DocAddress, Document,
}; };
#[test] #[test]
fn test_facet_only_indexed() -> crate::Result<()> { fn test_facet_only_indexed() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::default(); let mut schema_builder = SchemaBuilder::default();
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap())); index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let facet_reader = searcher let facet_reader = searcher
@@ -106,42 +106,19 @@ mod tests {
facet_reader.facet_ords(0u32, &mut facet_ords); facet_reader.facet_ords(0u32, &mut facet_ords);
assert_eq!(&facet_ords, &[2u64]); assert_eq!(&facet_ords, &[2u64]);
let doc = searcher.doc(DocAddress::new(0u32, 0u32))?; let doc = searcher.doc(DocAddress::new(0u32, 0u32))?;
let value = doc.get_first(facet_field).and_then(Value::path); let value = doc.get_first(facet_field).and_then(Value::facet);
assert_eq!(value, None); assert_eq!(value, None);
Ok(()) Ok(())
} }
#[test]
fn test_facet_only_stored() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::default();
let facet_field = schema_builder.add_facet_field("facet", STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()));
index_writer.commit()?;
let searcher = index.reader()?.searcher();
let facet_reader = searcher
.segment_reader(0u32)
.facet_reader(facet_field)
.unwrap();
let mut facet_ords = Vec::new();
facet_reader.facet_ords(0u32, &mut facet_ords);
assert!(facet_ords.is_empty());
let doc = searcher.doc(DocAddress::new(0u32, 0u32))?;
let value = doc.get_first(facet_field).and_then(Value::path);
assert_eq!(value, Some("/a/b".to_string()));
Ok(())
}
#[test] #[test]
fn test_facet_stored_and_indexed() -> crate::Result<()> { fn test_facet_stored_and_indexed() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::default(); let mut schema_builder = SchemaBuilder::default();
let facet_field = schema_builder.add_facet_field("facet", STORED | INDEXED); let facet_field = schema_builder.add_facet_field("facet", STORED);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap())); index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()))?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let facet_reader = searcher let facet_reader = searcher
@@ -152,43 +129,20 @@ mod tests {
facet_reader.facet_ords(0u32, &mut facet_ords); facet_reader.facet_ords(0u32, &mut facet_ords);
assert_eq!(&facet_ords, &[2u64]); assert_eq!(&facet_ords, &[2u64]);
let doc = searcher.doc(DocAddress::new(0u32, 0u32))?; let doc = searcher.doc(DocAddress::new(0u32, 0u32))?;
let value = doc.get_first(facet_field).and_then(Value::path); let value: Option<&Facet> = doc.get_first(facet_field).and_then(Value::facet);
assert_eq!(value, Some("/a/b".to_string())); assert_eq!(value, Facet::from_text("/a/b").ok().as_ref());
Ok(())
}
#[test]
fn test_facet_neither_stored_and_indexed() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::default();
let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()));
index_writer.commit()?;
let searcher = index.reader()?.searcher();
let facet_reader = searcher
.segment_reader(0u32)
.facet_reader(facet_field)
.unwrap();
let mut facet_ords = Vec::new();
facet_reader.facet_ords(0u32, &mut facet_ords);
assert!(facet_ords.is_empty());
let doc = searcher.doc(DocAddress::new(0u32, 0u32))?;
let value = doc.get_first(facet_field).and_then(Value::path);
assert_eq!(value, None);
Ok(()) Ok(())
} }
#[test] #[test]
fn test_facet_not_populated_for_all_docs() -> crate::Result<()> { fn test_facet_not_populated_for_all_docs() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::default(); let mut schema_builder = SchemaBuilder::default();
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap())); index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()))?;
index_writer.add_document(Document::default()); index_writer.add_document(Document::default())?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let facet_reader = searcher let facet_reader = searcher
@@ -206,12 +160,12 @@ mod tests {
#[test] #[test]
fn test_facet_not_populated_for_any_docs() -> crate::Result<()> { fn test_facet_not_populated_for_any_docs() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::default(); let mut schema_builder = SchemaBuilder::default();
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(Document::default()); index_writer.add_document(Document::default())?;
index_writer.add_document(Document::default()); index_writer.add_document(Document::default())?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let facet_reader = searcher let facet_reader = searcher

View File

@@ -23,9 +23,10 @@ values stored.
Read access performance is comparable to that of an array lookup. Read access performance is comparable to that of an array lookup.
*/ */
pub use self::alive_bitset::intersect_alive_bitsets;
pub use self::alive_bitset::write_alive_bitset;
pub use self::alive_bitset::AliveBitSet;
pub use self::bytes::{BytesFastFieldReader, BytesFastFieldWriter}; pub use self::bytes::{BytesFastFieldReader, BytesFastFieldWriter};
pub use self::delete::write_delete_bitset;
pub use self::delete::DeleteBitSet;
pub use self::error::{FastFieldNotAvailableError, Result}; pub use self::error::{FastFieldNotAvailableError, Result};
pub use self::facet_reader::FacetReader; pub use self::facet_reader::FacetReader;
pub use self::multivalued::{MultiValuedFastFieldReader, MultiValuedFastFieldWriter}; pub use self::multivalued::{MultiValuedFastFieldReader, MultiValuedFastFieldWriter};
@@ -40,14 +41,14 @@ pub use self::writer::{FastFieldsWriter, IntFastFieldWriter};
use crate::schema::Cardinality; use crate::schema::Cardinality;
use crate::schema::FieldType; use crate::schema::FieldType;
use crate::schema::Value; use crate::schema::Value;
use crate::DocId;
use crate::{ use crate::{
chrono::{NaiveDateTime, Utc}, chrono::{NaiveDateTime, Utc},
schema::Type, schema::Type,
}; };
use crate::{common, DocId};
mod alive_bitset;
mod bytes; mod bytes;
mod delete;
mod error; mod error;
mod facet_reader; mod facet_reader;
mod multivalued; mod multivalued;
@@ -109,7 +110,7 @@ impl FastValue for u64 {
fn fast_field_cardinality(field_type: &FieldType) -> Option<Cardinality> { fn fast_field_cardinality(field_type: &FieldType) -> Option<Cardinality> {
match *field_type { match *field_type {
FieldType::U64(ref integer_options) => integer_options.get_fastfield_cardinality(), FieldType::U64(ref integer_options) => integer_options.get_fastfield_cardinality(),
FieldType::HierarchicalFacet(_) => Some(Cardinality::MultiValues), FieldType::Facet(_) => Some(Cardinality::MultiValues),
_ => None, _ => None,
} }
} }
@@ -213,8 +214,7 @@ fn value_to_u64(value: &Value) -> u64 {
mod tests { mod tests {
use super::*; use super::*;
use crate::common::CompositeFile; use crate::directory::CompositeFile;
use crate::common::HasLen;
use crate::directory::{Directory, RamDirectory, WritePtr}; use crate::directory::{Directory, RamDirectory, WritePtr};
use crate::merge_policy::NoMergePolicy; use crate::merge_policy::NoMergePolicy;
use crate::schema::Field; use crate::schema::Field;
@@ -222,6 +222,7 @@ mod tests {
use crate::schema::FAST; use crate::schema::FAST;
use crate::schema::{Document, IntOptions}; use crate::schema::{Document, IntOptions};
use crate::{Index, SegmentId, SegmentReader}; use crate::{Index, SegmentId, SegmentReader};
use common::HasLen;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use rand::prelude::SliceRandom; use rand::prelude::SliceRandom;
use rand::rngs::StdRng; use rand::rngs::StdRng;
@@ -496,18 +497,18 @@ mod tests {
} }
#[test] #[test]
fn test_merge_missing_date_fast_field() { fn test_merge_missing_date_fast_field() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let date_field = schema_builder.add_date_field("date", FAST); let date_field = schema_builder.add_date_field("date", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests().unwrap();
index_writer.set_merge_policy(Box::new(NoMergePolicy)); index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(date_field =>crate::chrono::prelude::Utc::now())); index_writer.add_document(doc!(date_field =>crate::chrono::prelude::Utc::now()))?;
index_writer.commit().unwrap(); index_writer.commit()?;
index_writer.add_document(doc!()); index_writer.add_document(doc!())?;
index_writer.commit().unwrap(); index_writer.commit()?;
let reader = index.reader().unwrap(); let reader = index.reader()?;
let segment_ids: Vec<SegmentId> = reader let segment_ids: Vec<SegmentId> = reader
.searcher() .searcher()
.segment_readers() .segment_readers()
@@ -516,10 +517,10 @@ mod tests {
.collect(); .collect();
assert_eq!(segment_ids.len(), 2); assert_eq!(segment_ids.len(), 2);
let merge_future = index_writer.merge(&segment_ids[..]); let merge_future = index_writer.merge(&segment_ids[..]);
let merge_res = futures::executor::block_on(merge_future); futures::executor::block_on(merge_future)?;
assert!(merge_res.is_ok()); reader.reload()?;
assert!(reader.reload().is_ok());
assert_eq!(reader.searcher().segment_readers().len(), 1); assert_eq!(reader.searcher().segment_readers().len(), 1);
Ok(())
} }
#[test] #[test]
@@ -528,7 +529,7 @@ mod tests {
} }
#[test] #[test]
fn test_datefastfield() { fn test_datefastfield() -> crate::Result<()> {
use crate::fastfield::FastValue; use crate::fastfield::FastValue;
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let date_field = schema_builder.add_date_field("date", FAST); let date_field = schema_builder.add_date_field("date", FAST);
@@ -538,22 +539,22 @@ mod tests {
); );
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
index_writer.set_merge_policy(Box::new(NoMergePolicy)); index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!( index_writer.add_document(doc!(
date_field => crate::DateTime::from_u64(1i64.to_u64()), date_field => crate::DateTime::from_u64(1i64.to_u64()),
multi_date_field => crate::DateTime::from_u64(2i64.to_u64()), multi_date_field => crate::DateTime::from_u64(2i64.to_u64()),
multi_date_field => crate::DateTime::from_u64(3i64.to_u64()) multi_date_field => crate::DateTime::from_u64(3i64.to_u64())
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
date_field => crate::DateTime::from_u64(4i64.to_u64()) date_field => crate::DateTime::from_u64(4i64.to_u64())
)); ))?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
multi_date_field => crate::DateTime::from_u64(5i64.to_u64()), multi_date_field => crate::DateTime::from_u64(5i64.to_u64()),
multi_date_field => crate::DateTime::from_u64(6i64.to_u64()) multi_date_field => crate::DateTime::from_u64(6i64.to_u64())
)); ))?;
index_writer.commit().unwrap(); index_writer.commit()?;
let reader = index.reader().unwrap(); let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
assert_eq!(searcher.segment_readers().len(), 1); assert_eq!(searcher.segment_readers().len(), 1);
let segment_reader = searcher.segment_reader(0); let segment_reader = searcher.segment_reader(0);
@@ -580,6 +581,7 @@ mod tests {
assert_eq!(dates[0].timestamp(), 5i64); assert_eq!(dates[0].timestamp(), 5i64);
assert_eq!(dates[1].timestamp(), 6i64); assert_eq!(dates[1].timestamp(), 6i64);
} }
Ok(())
} }
} }
@@ -588,7 +590,7 @@ mod bench {
use super::tests::FIELD; use super::tests::FIELD;
use super::tests::{generate_permutation, SCHEMA}; use super::tests::{generate_permutation, SCHEMA};
use super::*; use super::*;
use crate::common::CompositeFile; use crate::directory::CompositeFile;
use crate::directory::{Directory, RamDirectory, WritePtr}; use crate::directory::{Directory, RamDirectory, WritePtr};
use crate::fastfield::FastFieldReader; use crate::fastfield::FastFieldReader;
use std::collections::HashMap; use std::collections::HashMap;

View File

@@ -8,17 +8,25 @@ pub use self::writer::MultiValuedFastFieldWriter;
mod tests { mod tests {
use crate::collector::TopDocs; use crate::collector::TopDocs;
use crate::indexer::NoMergePolicy;
use crate::query::QueryParser; use crate::query::QueryParser;
use crate::schema::Cardinality; use crate::schema::Cardinality;
use crate::schema::Facet; use crate::schema::Facet;
use crate::schema::FacetOptions;
use crate::schema::IntOptions; use crate::schema::IntOptions;
use crate::schema::Schema; use crate::schema::Schema;
use crate::schema::INDEXED; use crate::Document;
use crate::Index; use crate::Index;
use crate::Term;
use chrono::Duration; use chrono::Duration;
use futures::executor::block_on;
use proptest::prop_oneof;
use proptest::proptest;
use proptest::strategy::Strategy;
use test_log::test;
#[test] #[test]
fn test_multivalued_u64() { fn test_multivalued_u64() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let field = schema_builder.add_u64_field( let field = schema_builder.add_u64_field(
"multifield", "multifield",
@@ -26,17 +34,17 @@ mod tests {
); );
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(field=>1u64, field=>3u64)); index_writer.add_document(doc!(field=>1u64, field=>3u64))?;
index_writer.add_document(doc!()); index_writer.add_document(doc!())?;
index_writer.add_document(doc!(field=>4u64)); index_writer.add_document(doc!(field=>4u64))?;
index_writer.add_document(doc!(field=>5u64, field=>20u64,field=>1u64)); index_writer.add_document(doc!(field=>5u64, field=>20u64,field=>1u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader()?.searcher();
let segment_reader = searcher.segment_reader(0); let segment_reader = searcher.segment_reader(0);
let mut vals = Vec::new(); let mut vals = Vec::new();
let multi_value_reader = segment_reader.fast_fields().u64s(field).unwrap(); let multi_value_reader = segment_reader.fast_fields().u64s(field)?;
{ {
multi_value_reader.get_vals(2, &mut vals); multi_value_reader.get_vals(2, &mut vals);
assert_eq!(&vals, &[4u64]); assert_eq!(&vals, &[4u64]);
@@ -49,56 +57,55 @@ mod tests {
multi_value_reader.get_vals(1, &mut vals); multi_value_reader.get_vals(1, &mut vals);
assert!(vals.is_empty()); assert!(vals.is_empty());
} }
Ok(())
} }
#[test] #[test]
fn test_multivalued_date() { fn test_multivalued_date() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let date_field = schema_builder.add_date_field( let date_field = schema_builder.add_date_field(
"multi_date_field", "multi_date_field",
IntOptions::default() IntOptions::default()
.set_fast(Cardinality::MultiValues) .set_fast(Cardinality::MultiValues)
.set_indexed() .set_indexed()
.set_fieldnorm()
.set_stored(), .set_stored(),
); );
let time_i = let time_i =
schema_builder.add_i64_field("time_stamp_i", IntOptions::default().set_stored()); schema_builder.add_i64_field("time_stamp_i", IntOptions::default().set_stored());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
let first_time_stamp = chrono::Utc::now(); let first_time_stamp = chrono::Utc::now();
index_writer.add_document( index_writer.add_document(
doc!(date_field=>first_time_stamp, date_field=>first_time_stamp, time_i=>1i64), doc!(date_field=>first_time_stamp, date_field=>first_time_stamp, time_i=>1i64),
); )?;
index_writer.add_document(doc!(time_i=>0i64)); index_writer.add_document(doc!(time_i=>0i64))?;
// add one second // add one second
index_writer index_writer.add_document(
.add_document(doc!(date_field=>first_time_stamp + Duration::seconds(1), time_i=>2i64)); doc!(date_field=>first_time_stamp + Duration::seconds(1), time_i=>2i64),
)?;
// add another second // add another second
let two_secs_ahead = first_time_stamp + Duration::seconds(2); let two_secs_ahead = first_time_stamp + Duration::seconds(2);
index_writer.add_document(doc!(date_field=>two_secs_ahead, date_field=>two_secs_ahead,date_field=>two_secs_ahead, time_i=>3i64)); index_writer.add_document(doc!(date_field=>two_secs_ahead, date_field=>two_secs_ahead,date_field=>two_secs_ahead, time_i=>3i64))?;
// add three seconds // add three seconds
index_writer index_writer.add_document(
.add_document(doc!(date_field=>first_time_stamp + Duration::seconds(3), time_i=>4i64)); doc!(date_field=>first_time_stamp + Duration::seconds(3), time_i=>4i64),
assert!(index_writer.commit().is_ok()); )?;
index_writer.commit()?;
let reader = index.reader().unwrap(); let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
let reader = searcher.segment_reader(0); let reader = searcher.segment_reader(0);
assert_eq!(reader.num_docs(), 5); assert_eq!(reader.num_docs(), 5);
{ {
let parser = QueryParser::for_index(&index, vec![date_field]); let parser = QueryParser::for_index(&index, vec![date_field]);
let query = parser let query = parser.parse_query(&format!("\"{}\"", first_time_stamp.to_rfc3339()))?;
.parse_query(&format!("\"{}\"", first_time_stamp.to_rfc3339())) let results = searcher.search(&query, &TopDocs::with_limit(5))?;
.expect("could not parse query");
let results = searcher
.search(&query, &TopDocs::with_limit(5))
.expect("could not query index");
assert_eq!(results.len(), 1); assert_eq!(results.len(), 1);
for (_score, doc_address) in results { for (_score, doc_address) in results {
let retrieved_doc = searcher.doc(doc_address).expect("cannot fetch doc"); let retrieved_doc = searcher.doc(doc_address)?;
assert_eq!( assert_eq!(
retrieved_doc retrieved_doc
.get_first(date_field) .get_first(date_field)
@@ -120,12 +127,8 @@ mod tests {
{ {
let parser = QueryParser::for_index(&index, vec![date_field]); let parser = QueryParser::for_index(&index, vec![date_field]);
let query = parser let query = parser.parse_query(&format!("\"{}\"", two_secs_ahead.to_rfc3339()))?;
.parse_query(&format!("\"{}\"", two_secs_ahead.to_rfc3339())) let results = searcher.search(&query, &TopDocs::with_limit(5))?;
.expect("could not parse query");
let results = searcher
.search(&query, &TopDocs::with_limit(5))
.expect("could not query index");
assert_eq!(results.len(), 1); assert_eq!(results.len(), 1);
@@ -157,10 +160,8 @@ mod tests {
(first_time_stamp + Duration::seconds(1)).to_rfc3339(), (first_time_stamp + Duration::seconds(1)).to_rfc3339(),
(first_time_stamp + Duration::seconds(3)).to_rfc3339() (first_time_stamp + Duration::seconds(3)).to_rfc3339()
); );
let query = parser.parse_query(&range_q).expect("could not parse query"); let query = parser.parse_query(&range_q)?;
let results = searcher let results = searcher.search(&query, &TopDocs::with_limit(5))?;
.search(&query, &TopDocs::with_limit(5))
.expect("could not query index");
assert_eq!(results.len(), 2); assert_eq!(results.len(), 2);
for (i, doc_pair) in results.iter().enumerate() { for (i, doc_pair) in results.iter().enumerate() {
@@ -188,16 +189,16 @@ mod tests {
retrieved_doc retrieved_doc
.get_first(time_i) .get_first(time_i)
.expect("cannot find value") .expect("cannot find value")
.i64_value() .i64_value(),
.expect("value not of i64 type"), Some(time_i_val)
time_i_val
); );
} }
} }
Ok(())
} }
#[test] #[test]
fn test_multivalued_i64() { fn test_multivalued_i64() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let field = schema_builder.add_i64_field( let field = schema_builder.add_i64_field(
"multifield", "multifield",
@@ -205,14 +206,14 @@ mod tests {
); );
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(field=> 1i64, field => 3i64)); index_writer.add_document(doc!(field=> 1i64, field => 3i64))?;
index_writer.add_document(doc!()); index_writer.add_document(doc!())?;
index_writer.add_document(doc!(field=> -4i64)); index_writer.add_document(doc!(field=> -4i64))?;
index_writer.add_document(doc!(field=> -5i64, field => -20i64, field=>1i64)); index_writer.add_document(doc!(field=> -5i64, field => -20i64, field=>1i64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader()?.searcher();
let segment_reader = searcher.segment_reader(0); let segment_reader = searcher.segment_reader(0);
let mut vals = Vec::new(); let mut vals = Vec::new();
let multi_value_reader = segment_reader.fast_fields().i64s(field).unwrap(); let multi_value_reader = segment_reader.fast_fields().i64s(field).unwrap();
@@ -224,18 +225,125 @@ mod tests {
assert!(vals.is_empty()); assert!(vals.is_empty());
multi_value_reader.get_vals(3, &mut vals); multi_value_reader.get_vals(3, &mut vals);
assert_eq!(&vals, &[-5i64, -20i64, 1i64]); assert_eq!(&vals, &[-5i64, -20i64, 1i64]);
Ok(())
} }
#[test]
#[ignore] fn test_multivalued_no_panic(ops: &[IndexingOp]) -> crate::Result<()> {
fn test_many_facets() {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let field = schema_builder.add_facet_field("facetfield", INDEXED); let field = schema_builder.add_u64_field(
"multifield",
IntOptions::default()
.set_fast(Cardinality::MultiValues)
.set_indexed(),
);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
for i in 0..100_000 { index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(field=> Facet::from(format!("/lang/{}", i).as_str())));
for &op in ops {
match op {
IndexingOp::AddDoc { id } => {
match id % 3 {
0 => {
index_writer.add_document(doc!())?;
}
1 => {
let mut doc = Document::new();
for _ in 0..5001 {
doc.add_u64(field, id as u64);
}
index_writer.add_document(doc)?;
}
_ => {
let mut doc = Document::new();
doc.add_u64(field, id as u64);
index_writer.add_document(doc)?;
}
};
}
IndexingOp::DeleteDoc { id } => {
index_writer.delete_term(Term::from_field_u64(field, id as u64));
}
IndexingOp::Commit => {
index_writer.commit().unwrap();
}
IndexingOp::Merge => {
let segment_ids = index.searchable_segment_ids()?;
if segment_ids.len() >= 2 {
block_on(index_writer.merge(&segment_ids))?;
index_writer.segment_updater().wait_merging_thread()?;
}
}
}
} }
assert!(index_writer.commit().is_ok());
index_writer.commit()?;
// Merging the segments
{
let segment_ids = index
.searchable_segment_ids()
.expect("Searchable segments failed.");
if !segment_ids.is_empty() {
block_on(index_writer.merge(&segment_ids)).unwrap();
assert!(index_writer.wait_merging_threads().is_ok());
}
}
Ok(())
}
#[derive(Debug, Clone, Copy)]
enum IndexingOp {
AddDoc { id: u32 },
DeleteDoc { id: u32 },
Commit,
Merge,
}
fn operation_strategy() -> impl Strategy<Value = IndexingOp> {
prop_oneof![
(0u32..10u32).prop_map(|id| IndexingOp::DeleteDoc { id }),
(0u32..10u32).prop_map(|id| IndexingOp::AddDoc { id }),
(0u32..2u32).prop_map(|_| IndexingOp::Commit),
(0u32..1u32).prop_map(|_| IndexingOp::Merge),
]
}
proptest! {
#[test]
fn test_multivalued_proptest(ops in proptest::collection::vec(operation_strategy(), 1..10)) {
assert!(test_multivalued_no_panic(&ops[..]).is_ok());
}
}
#[test]
fn test_multivalued_proptest_off_by_one_bug_1151() {
use IndexingOp::*;
let ops = [
AddDoc { id: 3 },
AddDoc { id: 1 },
AddDoc { id: 3 },
Commit,
Merge,
];
assert!(test_multivalued_no_panic(&ops[..]).is_ok());
}
#[test]
#[ignore]
fn test_many_facets() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_facet_field("facetfield", FacetOptions::default());
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?;
for i in 0..100_000 {
index_writer
.add_document(doc!(field=> Facet::from(format!("/lang/{}", i).as_str())))?;
}
index_writer.commit()?;
Ok(())
} }
} }

View File

@@ -91,27 +91,25 @@ impl<Item: FastValue> MultiValueLength for MultiValuedFastFieldReader<Item> {
mod tests { mod tests {
use crate::core::Index; use crate::core::Index;
use crate::schema::{Cardinality, Facet, IntOptions, Schema, INDEXED}; use crate::schema::{Cardinality, Facet, FacetOptions, IntOptions, Schema};
#[test] #[test]
fn test_multifastfield_reader() { fn test_multifastfield_reader() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let facet_field = schema_builder.add_facet_field("facets", INDEXED); let facet_field = schema_builder.add_facet_field("facets", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index let mut index_writer = index.writer_for_tests()?;
.writer_for_tests()
.expect("Failed to create index writer.");
index_writer.add_document(doc!( index_writer.add_document(doc!(
facet_field => Facet::from("/category/cat2"), facet_field => Facet::from("/category/cat2"),
facet_field => Facet::from("/category/cat1"), facet_field => Facet::from("/category/cat1"),
)); ))?;
index_writer.add_document(doc!(facet_field => Facet::from("/category/cat2"))); index_writer.add_document(doc!(facet_field => Facet::from("/category/cat2")))?;
index_writer.add_document(doc!(facet_field => Facet::from("/category/cat3"))); index_writer.add_document(doc!(facet_field => Facet::from("/category/cat3")))?;
index_writer.commit().expect("Commit failed"); index_writer.commit()?;
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader()?.searcher();
let segment_reader = searcher.segment_reader(0); let segment_reader = searcher.segment_reader(0);
let mut facet_reader = segment_reader.facet_reader(facet_field).unwrap(); let mut facet_reader = segment_reader.facet_reader(facet_field)?;
let mut facet = Facet::root(); let mut facet = Facet::root();
{ {
@@ -145,10 +143,11 @@ mod tests {
facet_reader.facet_ords(2, &mut vals); facet_reader.facet_ords(2, &mut vals);
assert_eq!(&vals[..], &[4]); assert_eq!(&vals[..], &[4]);
} }
Ok(())
} }
#[test] #[test]
fn test_multifastfield_reader_min_max() { fn test_multifastfield_reader_min_max() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let field_options = IntOptions::default() let field_options = IntOptions::default()
.set_indexed() .set_indexed()
@@ -163,15 +162,16 @@ mod tests {
item_field => 2i64, item_field => 2i64,
item_field => 3i64, item_field => 3i64,
item_field => -2i64, item_field => -2i64,
)); ))?;
index_writer.add_document(doc!(item_field => 6i64, item_field => 3i64)); index_writer.add_document(doc!(item_field => 6i64, item_field => 3i64))?;
index_writer.add_document(doc!(item_field => 4i64)); index_writer.add_document(doc!(item_field => 4i64))?;
index_writer.commit().expect("Commit failed"); index_writer.commit()?;
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader()?.searcher();
let segment_reader = searcher.segment_reader(0); let segment_reader = searcher.segment_reader(0);
let field_reader = segment_reader.fast_fields().i64s(item_field).unwrap(); let field_reader = segment_reader.fast_fields().i64s(item_field)?;
assert_eq!(field_reader.min_value(), -2); assert_eq!(field_reader.min_value(), -2);
assert_eq!(field_reader.max_value(), 6); assert_eq!(field_reader.max_value(), 6);
Ok(())
} }
} }

View File

@@ -1,6 +1,5 @@
use super::FastValue; use super::FastValue;
use crate::common::BinarySerializable; use crate::directory::CompositeFile;
use crate::common::CompositeFile;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::directory::OwnedBytes; use crate::directory::OwnedBytes;
use crate::directory::{Directory, RamDirectory, WritePtr}; use crate::directory::{Directory, RamDirectory, WritePtr};
@@ -8,6 +7,7 @@ use crate::fastfield::{CompositeFastFieldSerializer, FastFieldsWriter};
use crate::schema::Schema; use crate::schema::Schema;
use crate::schema::FAST; use crate::schema::FAST;
use crate::DocId; use crate::DocId;
use common::BinarySerializable;
use fastfield_codecs::bitpacked::BitpackedFastFieldReader as BitpackedReader; use fastfield_codecs::bitpacked::BitpackedFastFieldReader as BitpackedReader;
use fastfield_codecs::bitpacked::BitpackedFastFieldSerializer; use fastfield_codecs::bitpacked::BitpackedFastFieldSerializer;
use fastfield_codecs::linearinterpol::LinearInterpolFastFieldReader; use fastfield_codecs::linearinterpol::LinearInterpolFastFieldReader;

View File

@@ -1,4 +1,4 @@
use crate::common::CompositeFile; use crate::directory::CompositeFile;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::fastfield::MultiValuedFastFieldReader; use crate::fastfield::MultiValuedFastFieldReader;
use crate::fastfield::{BitpackedFastFieldReader, FastFieldNotAvailableError}; use crate::fastfield::{BitpackedFastFieldReader, FastFieldNotAvailableError};
@@ -40,7 +40,7 @@ fn type_and_cardinality(field_type: &FieldType) -> Option<(FastType, Cardinality
FieldType::Date(options) => options FieldType::Date(options) => options
.get_fastfield_cardinality() .get_fastfield_cardinality()
.map(|cardinality| (FastType::Date, cardinality)), .map(|cardinality| (FastType::Date, cardinality)),
FieldType::HierarchicalFacet(_) => Some((FastType::U64, Cardinality::MultiValues)), FieldType::Facet(_) => Some((FastType::U64, Cardinality::MultiValues)),
_ => None, _ => None,
} }
} }

View File

@@ -1,8 +1,8 @@
use crate::common::BinarySerializable; use crate::directory::CompositeWrite;
use crate::common::CompositeWrite;
use crate::common::CountingWriter;
use crate::directory::WritePtr; use crate::directory::WritePtr;
use crate::schema::Field; use crate::schema::Field;
use common::BinarySerializable;
use common::CountingWriter;
pub use fastfield_codecs::bitpacked::BitpackedFastFieldSerializer; pub use fastfield_codecs::bitpacked::BitpackedFastFieldSerializer;
pub use fastfield_codecs::bitpacked::BitpackedFastFieldSerializerLegacy; pub use fastfield_codecs::bitpacked::BitpackedFastFieldSerializerLegacy;
use fastfield_codecs::linearinterpol::LinearInterpolFastFieldSerializer; use fastfield_codecs::linearinterpol::LinearInterpolFastFieldSerializer;
@@ -105,9 +105,7 @@ impl CompositeFastFieldSerializer {
&fastfield_accessor, &fastfield_accessor,
&mut estimations, &mut estimations,
); );
if let Some(broken_estimation) = estimations if let Some(broken_estimation) = estimations.iter().find(|estimation| estimation.0.is_nan())
.iter()
.find(|estimation| estimation.0 == f32::NAN)
{ {
warn!( warn!(
"broken estimation for fast field codec {}", "broken estimation for fast field codec {}",

View File

@@ -1,12 +1,12 @@
use super::multivalued::MultiValuedFastFieldWriter; use super::multivalued::MultiValuedFastFieldWriter;
use super::serializer::FastFieldStats; use super::serializer::FastFieldStats;
use super::FastFieldDataAccess; use super::FastFieldDataAccess;
use crate::common;
use crate::fastfield::{BytesFastFieldWriter, CompositeFastFieldSerializer}; use crate::fastfield::{BytesFastFieldWriter, CompositeFastFieldSerializer};
use crate::indexer::doc_id_mapping::DocIdMapping; use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::postings::UnorderedTermId; use crate::postings::UnorderedTermId;
use crate::schema::{Cardinality, Document, Field, FieldEntry, FieldType, Schema}; use crate::schema::{Cardinality, Document, Field, FieldEntry, FieldType, Schema};
use crate::termdict::TermOrdinal; use crate::termdict::TermOrdinal;
use common;
use fnv::FnvHashMap; use fnv::FnvHashMap;
use std::collections::HashMap; use std::collections::HashMap;
use std::io; use std::io;
@@ -54,7 +54,7 @@ impl FastFieldsWriter {
None => {} None => {}
} }
} }
FieldType::HierarchicalFacet(_) => { FieldType::Facet(_) => {
let fast_field_writer = MultiValuedFastFieldWriter::new(field, true); let fast_field_writer = MultiValuedFastFieldWriter::new(field, true);
multi_values_writers.push(fast_field_writer); multi_values_writers.push(fast_field_writer);
} }

View File

@@ -26,3 +26,137 @@ pub use self::serializer::FieldNormsSerializer;
pub use self::writer::FieldNormsWriter; pub use self::writer::FieldNormsWriter;
use self::code::{fieldnorm_to_id, id_to_fieldnorm}; use self::code::{fieldnorm_to_id, id_to_fieldnorm};
#[cfg(test)]
mod tests {
use crate::directory::CompositeFile;
use crate::directory::{Directory, RamDirectory, WritePtr};
use crate::fieldnorm::FieldNormReader;
use crate::fieldnorm::FieldNormsSerializer;
use crate::fieldnorm::FieldNormsWriter;
use crate::query::Query;
use crate::query::TermQuery;
use crate::schema::IndexRecordOption;
use crate::schema::TextFieldIndexing;
use crate::schema::TextOptions;
use crate::schema::TEXT;
use crate::Index;
use crate::Term;
use crate::TERMINATED;
use once_cell::sync::Lazy;
use std::path::Path;
use crate::schema::{Field, Schema, STORED};
pub static SCHEMA: Lazy<Schema> = Lazy::new(|| {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("field", STORED);
schema_builder.add_text_field("txt_field", TEXT);
schema_builder.add_text_field(
"str_field",
TextOptions::default().set_indexing_options(
TextFieldIndexing::default()
.set_index_option(IndexRecordOption::Basic)
.set_fieldnorms(false),
),
);
schema_builder.build()
});
pub static FIELD: Lazy<Field> = Lazy::new(|| SCHEMA.get_field("field").unwrap());
pub static TXT_FIELD: Lazy<Field> = Lazy::new(|| SCHEMA.get_field("txt_field").unwrap());
pub static STR_FIELD: Lazy<Field> = Lazy::new(|| SCHEMA.get_field("str_field").unwrap());
#[test]
#[should_panic(expected = "Cannot register a given fieldnorm twice")]
pub fn test_should_panic_when_recording_fieldnorm_twice_for_same_doc() {
let mut fieldnorm_writers = FieldNormsWriter::for_schema(&SCHEMA);
fieldnorm_writers.record(0u32, *TXT_FIELD, 5);
fieldnorm_writers.record(0u32, *TXT_FIELD, 3);
}
#[test]
pub fn test_fieldnorm() -> crate::Result<()> {
let path = Path::new("test");
let directory: RamDirectory = RamDirectory::create();
{
let write: WritePtr = directory.open_write(Path::new("test"))?;
let serializer = FieldNormsSerializer::from_write(write)?;
let mut fieldnorm_writers = FieldNormsWriter::for_schema(&SCHEMA);
fieldnorm_writers.record(2u32, *TXT_FIELD, 5);
fieldnorm_writers.record(3u32, *TXT_FIELD, 3);
fieldnorm_writers.serialize(serializer, None)?;
}
let file = directory.open_read(&path)?;
{
let fields_composite = CompositeFile::open(&file)?;
assert!(fields_composite.open_read(*FIELD).is_none());
assert!(fields_composite.open_read(*STR_FIELD).is_none());
let data = fields_composite.open_read(*TXT_FIELD).unwrap();
let fieldnorm_reader = FieldNormReader::open(data)?;
assert_eq!(fieldnorm_reader.fieldnorm(0u32), 0u32);
assert_eq!(fieldnorm_reader.fieldnorm(1u32), 0u32);
assert_eq!(fieldnorm_reader.fieldnorm(2u32), 5u32);
assert_eq!(fieldnorm_reader.fieldnorm(3u32), 3u32);
}
Ok(())
}
#[test]
fn test_fieldnorm_disabled() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_options = TextOptions::default()
.set_indexing_options(TextFieldIndexing::default().set_fieldnorms(false));
let text = schema_builder.add_text_field("text", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(text=>"hello"))?;
writer.add_document(doc!(text=>"hello hello hello"))?;
writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query = TermQuery::new(
Term::from_field_text(text, "hello"),
IndexRecordOption::WithFreqs,
);
let weight = query.weight(&*searcher, true)?;
let mut scorer = weight.scorer(searcher.segment_reader(0), 1.0f32)?;
assert_eq!(scorer.doc(), 0);
assert!((scorer.score() - 0.22920431).abs() < 0.001f32);
assert_eq!(scorer.advance(), 1);
assert_eq!(scorer.doc(), 1);
assert!((scorer.score() - 0.22920431).abs() < 0.001f32);
assert_eq!(scorer.advance(), TERMINATED);
Ok(())
}
#[test]
fn test_fieldnorm_enabled() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_options = TextOptions::default()
.set_indexing_options(TextFieldIndexing::default().set_fieldnorms(true));
let text = schema_builder.add_text_field("text", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(text=>"hello"))?;
writer.add_document(doc!(text=>"hello hello hello"))?;
writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query = TermQuery::new(
Term::from_field_text(text, "hello"),
IndexRecordOption::WithFreqs,
);
let weight = query.weight(&*searcher, true)?;
let mut scorer = weight.scorer(searcher.segment_reader(0), 1.0f32)?;
assert_eq!(scorer.doc(), 0);
assert!((scorer.score() - 0.22920431).abs() < 0.001f32);
assert_eq!(scorer.advance(), 1);
assert_eq!(scorer.doc(), 1);
assert!((scorer.score() - 0.15136132).abs() < 0.001f32);
assert_eq!(scorer.advance(), TERMINATED);
Ok(())
}
}

View File

@@ -1,5 +1,5 @@
use super::{fieldnorm_to_id, id_to_fieldnorm}; use super::{fieldnorm_to_id, id_to_fieldnorm};
use crate::common::CompositeFile; use crate::directory::CompositeFile;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::directory::OwnedBytes; use crate::directory::OwnedBytes;
use crate::schema::Field; use crate::schema::Field;

View File

@@ -1,4 +1,4 @@
use crate::common::CompositeWrite; use crate::directory::CompositeWrite;
use crate::directory::WritePtr; use crate::directory::WritePtr;
use crate::schema::Field; use crate::schema::Field;
use std::io; use std::io;

View File

@@ -4,6 +4,7 @@ use super::fieldnorm_to_id;
use super::FieldNormsSerializer; use super::FieldNormsSerializer;
use crate::schema::Field; use crate::schema::Field;
use crate::schema::Schema; use crate::schema::Schema;
use std::cmp::Ordering;
use std::{io, iter}; use std::{io, iter};
/// The `FieldNormsWriter` is in charge of tracking the fieldnorm byte /// The `FieldNormsWriter` is in charge of tracking the fieldnorm byte
@@ -12,8 +13,7 @@ use std::{io, iter};
/// `FieldNormsWriter` stores a Vec<u8> for each tracked field, using a /// `FieldNormsWriter` stores a Vec<u8> for each tracked field, using a
/// byte per document per field. /// byte per document per field.
pub struct FieldNormsWriter { pub struct FieldNormsWriter {
fields: Vec<Field>, fieldnorms_buffers: Vec<Option<Vec<u8>>>,
fieldnorms_buffer: Vec<Vec<u8>>,
} }
impl FieldNormsWriter { impl FieldNormsWriter {
@@ -23,7 +23,7 @@ impl FieldNormsWriter {
schema schema
.fields() .fields()
.filter_map(|(field, field_entry)| { .filter_map(|(field, field_entry)| {
if field_entry.is_indexed() { if field_entry.is_indexed() && field_entry.has_fieldnorms() {
Some(field) Some(field)
} else { } else {
None None
@@ -35,25 +35,20 @@ impl FieldNormsWriter {
/// Initialize with state for tracking the field norm fields /// Initialize with state for tracking the field norm fields
/// specified in the schema. /// specified in the schema.
pub fn for_schema(schema: &Schema) -> FieldNormsWriter { pub fn for_schema(schema: &Schema) -> FieldNormsWriter {
let fields = FieldNormsWriter::fields_with_fieldnorm(schema); let mut fieldnorms_buffers: Vec<Option<Vec<u8>>> = iter::repeat_with(|| None)
let max_field = fields .take(schema.num_fields())
.iter() .collect();
.map(Field::field_id) for field in FieldNormsWriter::fields_with_fieldnorm(schema) {
.max() fieldnorms_buffers[field.field_id() as usize] = Some(Vec::with_capacity(1_000));
.map(|max_field_id| max_field_id as usize + 1)
.unwrap_or(0);
FieldNormsWriter {
fields,
fieldnorms_buffer: iter::repeat_with(Vec::new)
.take(max_field)
.collect::<Vec<_>>(),
} }
FieldNormsWriter { fieldnorms_buffers }
} }
/// The memory used inclusive childs /// The memory used inclusive childs
pub fn mem_usage(&self) -> usize { pub fn mem_usage(&self) -> usize {
self.fieldnorms_buffer self.fieldnorms_buffers
.iter() .iter()
.flatten()
.map(|buf| buf.capacity()) .map(|buf| buf.capacity())
.sum() .sum()
} }
@@ -62,8 +57,10 @@ impl FieldNormsWriter {
/// ///
/// Will extend with 0-bytes for documents that have not been seen. /// Will extend with 0-bytes for documents that have not been seen.
pub fn fill_up_to_max_doc(&mut self, max_doc: DocId) { pub fn fill_up_to_max_doc(&mut self, max_doc: DocId) {
for field in self.fields.iter() { for fieldnorms_buffer_opt in self.fieldnorms_buffers.iter_mut() {
self.fieldnorms_buffer[field.field_id() as usize].resize(max_doc as usize, 0u8); if let Some(fieldnorms_buffer) = fieldnorms_buffer_opt.as_mut() {
fieldnorms_buffer.resize(max_doc as usize, 0u8);
}
} }
} }
@@ -76,14 +73,23 @@ impl FieldNormsWriter {
/// * field - the field being set /// * field - the field being set
/// * fieldnorm - the number of terms present in document `doc` in field `field` /// * fieldnorm - the number of terms present in document `doc` in field `field`
pub fn record(&mut self, doc: DocId, field: Field, fieldnorm: u32) { pub fn record(&mut self, doc: DocId, field: Field, fieldnorm: u32) {
let fieldnorm_buffer: &mut Vec<u8> = &mut self.fieldnorms_buffer[field.field_id() as usize]; if let Some(fieldnorm_buffer) = self
assert!( .fieldnorms_buffers
fieldnorm_buffer.len() <= doc as usize, .get_mut(field.field_id() as usize)
"Cannot register a given fieldnorm twice" .and_then(Option::as_mut)
); {
// we fill intermediary `DocId` as having a fieldnorm of 0. match fieldnorm_buffer.len().cmp(&(doc as usize)) {
fieldnorm_buffer.resize(doc as usize + 1, 0u8); Ordering::Less => {
fieldnorm_buffer[doc as usize] = fieldnorm_to_id(fieldnorm); // we fill intermediary `DocId` as having a fieldnorm of 0.
fieldnorm_buffer.resize(doc as usize, 0u8);
}
Ordering::Equal => {}
Ordering::Greater => {
panic!("Cannot register a given fieldnorm twice")
}
}
fieldnorm_buffer.push(fieldnorm_to_id(fieldnorm));
}
} }
/// Serialize the seen fieldnorm values to the serializer for all fields. /// Serialize the seen fieldnorm values to the serializer for all fields.
@@ -92,17 +98,18 @@ impl FieldNormsWriter {
mut fieldnorms_serializer: FieldNormsSerializer, mut fieldnorms_serializer: FieldNormsSerializer,
doc_id_map: Option<&DocIdMapping>, doc_id_map: Option<&DocIdMapping>,
) -> io::Result<()> { ) -> io::Result<()> {
for &field in self.fields.iter() { for (field, fieldnorms_buffer) in self.fieldnorms_buffers.iter().enumerate().filter_map(
let fieldnorm_values: &[u8] = &self.fieldnorms_buffer[field.field_id() as usize][..]; |(field_id, fieldnorms_buffer_opt)| {
fieldnorms_buffer_opt.as_ref().map(|fieldnorms_buffer| {
(Field::from_field_id(field_id as u32), fieldnorms_buffer)
})
},
) {
if let Some(doc_id_map) = doc_id_map { if let Some(doc_id_map) = doc_id_map {
let mut mapped_fieldnorm_values = vec![]; let remapped_fieldnorm_buffer = doc_id_map.remap(fieldnorms_buffer);
mapped_fieldnorm_values.resize(fieldnorm_values.len(), 0u8); fieldnorms_serializer.serialize_field(field, &remapped_fieldnorm_buffer)?;
for (new_doc_id, old_doc_id) in doc_id_map.iter_old_doc_ids().enumerate() {
mapped_fieldnorm_values[new_doc_id] = fieldnorm_values[old_doc_id as usize];
}
fieldnorms_serializer.serialize_field(field, &mapped_fieldnorm_values)?;
} else { } else {
fieldnorms_serializer.serialize_field(field, fieldnorm_values)?; fieldnorms_serializer.serialize_field(field, fieldnorms_buffer)?;
} }
} }
fieldnorms_serializer.close()?; fieldnorms_serializer.close()?;

View File

@@ -1,4 +1,8 @@
use crate::schema;
use crate::Index; use crate::Index;
use crate::IndexSettings;
use crate::IndexSortByField;
use crate::Order;
use crate::Searcher; use crate::Searcher;
use crate::{doc, schema::*}; use crate::{doc, schema::*};
use rand::thread_rng; use rand::thread_rng;
@@ -35,7 +39,7 @@ fn test_functional_store() -> crate::Result<()> {
let mut doc_set: Vec<u64> = Vec::new(); let mut doc_set: Vec<u64> = Vec::new();
let mut doc_id = 0u64; let mut doc_id = 0u64;
for iteration in 0..500 { for iteration in 0..get_num_iterations() {
dbg!(iteration); dbg!(iteration);
let num_docs: usize = rng.gen_range(0..4); let num_docs: usize = rng.gen_range(0..4);
if !doc_set.is_empty() { if !doc_set.is_empty() {
@@ -45,7 +49,7 @@ fn test_functional_store() -> crate::Result<()> {
} }
for _ in 0..num_docs { for _ in 0..num_docs {
doc_set.push(doc_id); doc_set.push(doc_id);
index_writer.add_document(doc!(id_field=>doc_id)); index_writer.add_document(doc!(id_field=>doc_id))?;
doc_id += 1; doc_id += 1;
} }
index_writer.commit()?; index_writer.commit()?;
@@ -56,16 +60,37 @@ fn test_functional_store() -> crate::Result<()> {
Ok(()) Ok(())
} }
fn get_num_iterations() -> usize {
std::env::var("NUM_FUNCTIONAL_TEST_ITERATIONS")
.map(|str| str.parse().unwrap())
.unwrap_or(2000)
}
#[test] #[test]
#[ignore] #[ignore]
fn test_functional_indexing() -> crate::Result<()> { fn test_functional_indexing_sorted() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let id_field = schema_builder.add_u64_field("id", INDEXED); let id_field = schema_builder.add_u64_field("id", INDEXED | FAST);
let multiples_field = schema_builder.add_u64_field("multiples", INDEXED); let multiples_field = schema_builder.add_u64_field("multiples", INDEXED);
let text_field_options = TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
.set_index_option(schema::IndexRecordOption::WithFreqsAndPositions),
)
.set_stored();
let text_field = schema_builder.add_text_field("text_field", text_field_options);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_from_tempdir(schema)?; let mut index_builder = Index::builder().schema(schema);
index_builder = index_builder.settings(IndexSettings {
sort_by_field: Some(IndexSortByField {
field: "id".to_string(),
order: Order::Desc,
}),
..Default::default()
});
let index = index_builder.create_from_tempdir().unwrap();
let reader = index.reader()?; let reader = index.reader()?;
let mut rng = thread_rng(); let mut rng = thread_rng();
@@ -75,7 +100,7 @@ fn test_functional_indexing() -> crate::Result<()> {
let mut committed_docs: HashSet<u64> = HashSet::new(); let mut committed_docs: HashSet<u64> = HashSet::new();
let mut uncommitted_docs: HashSet<u64> = HashSet::new(); let mut uncommitted_docs: HashSet<u64> = HashSet::new();
for _ in 0..200 { for _ in 0..get_num_iterations() {
let random_val = rng.gen_range(0..20); let random_val = rng.gen_range(0..20);
if random_val == 0 { if random_val == 0 {
index_writer.commit()?; index_writer.commit()?;
@@ -98,7 +123,85 @@ fn test_functional_indexing() -> crate::Result<()> {
for i in 1u64..10u64 { for i in 1u64..10u64 {
doc.add_u64(multiples_field, random_val * i); doc.add_u64(multiples_field, random_val * i);
} }
index_writer.add_document(doc); doc.add_text(text_field, get_text());
index_writer.add_document(doc)?;
}
}
Ok(())
}
const LOREM: &str = "Doc Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed \
do eiusmod tempor incididunt ut labore et dolore magna aliqua. \
Ut enim ad minim veniam, quis nostrud exercitation ullamco \
laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure \
dolor in reprehenderit in voluptate velit esse cillum dolore eu \
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non \
proident, sunt in culpa qui officia deserunt mollit anim id est \
laborum.";
fn get_text() -> String {
use rand::seq::SliceRandom;
let mut rng = thread_rng();
let tokens: Vec<_> = LOREM.split(' ').collect();
let random_val = rng.gen_range(0..20);
(0..random_val)
.map(|_| tokens.choose(&mut rng).unwrap())
.cloned()
.collect::<Vec<_>>()
.join(" ")
}
#[test]
#[ignore]
fn test_functional_indexing_unsorted() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let id_field = schema_builder.add_u64_field("id", INDEXED);
let multiples_field = schema_builder.add_u64_field("multiples", INDEXED);
let text_field_options = TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
.set_index_option(schema::IndexRecordOption::WithFreqsAndPositions),
)
.set_stored();
let text_field = schema_builder.add_text_field("text_field", text_field_options);
let schema = schema_builder.build();
let index = Index::create_from_tempdir(schema)?;
let reader = index.reader()?;
let mut rng = thread_rng();
let mut index_writer = index.writer_with_num_threads(3, 120_000_000)?;
let mut committed_docs: HashSet<u64> = HashSet::new();
let mut uncommitted_docs: HashSet<u64> = HashSet::new();
for _ in 0..get_num_iterations() {
let random_val = rng.gen_range(0..20);
if random_val == 0 {
index_writer.commit()?;
committed_docs.extend(&uncommitted_docs);
uncommitted_docs.clear();
reader.reload()?;
let searcher = reader.searcher();
// check that everything is correct.
check_index_content(
&searcher,
&committed_docs.iter().cloned().collect::<Vec<u64>>(),
)?;
} else if committed_docs.remove(&random_val) || uncommitted_docs.remove(&random_val) {
let doc_id_term = Term::from_field_u64(id_field, random_val);
index_writer.delete_term(doc_id_term);
} else {
uncommitted_docs.insert(random_val);
let mut doc = Document::new();
doc.add_u64(id_field, random_val);
for i in 1u64..10u64 {
doc.add_u64(multiples_field, random_val * i);
}
doc.add_text(text_field, get_text());
index_writer.add_document(doc)?;
} }
} }
Ok(()) Ok(())

324
src/indexer/demuxer.rs Normal file
View File

@@ -0,0 +1,324 @@
use common::BitSet;
use itertools::Itertools;
use crate::fastfield::AliveBitSet;
use crate::{merge_filtered_segments, Directory, Index, IndexSettings, Segment, SegmentOrdinal};
/// DemuxMapping can be used to reorganize data from multiple segments.
///
/// DemuxMapping is useful in a multitenant settings, in which each document might actually belong to a different tenant.
/// It allows to reorganize documents as follows:
///
/// e.g. if you have two tenant ids TENANT_A and TENANT_B and two segments with
/// the documents (simplified)
/// Seg 1 [TENANT_A, TENANT_B]
/// Seg 2 [TENANT_A, TENANT_B]
///
/// You may want to group your documents to
/// Seg 1 [TENANT_A, TENANT_A]
/// Seg 2 [TENANT_B, TENANT_B]
///
/// Demuxing is the tool for that.
/// Semantically you can define a mapping from [old segment ordinal, old doc_id] -> [new segment ordinal].
#[derive(Debug, Default)]
pub struct DemuxMapping {
/// [index old segment ordinal] -> [index doc_id] = new segment ordinal
mapping: Vec<DocIdToSegmentOrdinal>,
}
/// DocIdToSegmentOrdinal maps from doc_id within a segment to the new segment ordinal for demuxing.
///
/// For every source segment there is a `DocIdToSegmentOrdinal` to distribute its doc_ids.
#[derive(Debug, Default)]
pub struct DocIdToSegmentOrdinal {
doc_id_index_to_segment_ord: Vec<SegmentOrdinal>,
}
impl DocIdToSegmentOrdinal {
/// Creates a new DocIdToSegmentOrdinal with size of num_doc_ids.
/// Initially all doc_ids point to segment ordinal 0 and need to be set
/// the via `set` method.
pub fn with_max_doc(max_doc: usize) -> Self {
DocIdToSegmentOrdinal {
doc_id_index_to_segment_ord: vec![0; max_doc],
}
}
/// Returns the number of documents in this mapping.
/// It should be equal to the `max_doc` of the segment it targets.
pub fn max_doc(&self) -> u32 {
self.doc_id_index_to_segment_ord.len() as u32
}
/// Associates a doc_id with an output `SegmentOrdinal`.
pub fn set(&mut self, doc_id: u32, segment_ord: SegmentOrdinal) {
self.doc_id_index_to_segment_ord[doc_id as usize] = segment_ord;
}
/// Iterates over the new SegmentOrdinal in the order of the doc_id.
pub fn iter(&self) -> impl Iterator<Item = SegmentOrdinal> + '_ {
self.doc_id_index_to_segment_ord.iter().cloned()
}
}
impl DemuxMapping {
/// Adds a DocIdToSegmentOrdinal. The order of the pus calls
/// defines the old segment ordinal. e.g. first push = ordinal 0.
pub fn add(&mut self, segment_mapping: DocIdToSegmentOrdinal) {
self.mapping.push(segment_mapping);
}
/// Returns the old number of segments.
pub fn get_old_num_segments(&self) -> usize {
self.mapping.len()
}
}
fn docs_for_segment_ord(
doc_id_to_segment_ord: &DocIdToSegmentOrdinal,
target_segment_ord: SegmentOrdinal,
) -> AliveBitSet {
let mut bitset = BitSet::with_max_value(doc_id_to_segment_ord.max_doc());
for doc_id in doc_id_to_segment_ord
.iter()
.enumerate()
.filter(|(_doc_id, new_segment_ord)| *new_segment_ord == target_segment_ord)
.map(|(doc_id, _)| doc_id)
{
// add document if segment ordinal = target segment ordinal
bitset.insert(doc_id as u32);
}
AliveBitSet::from_bitset(&bitset)
}
fn get_alive_bitsets(
demux_mapping: &DemuxMapping,
target_segment_ord: SegmentOrdinal,
) -> Vec<AliveBitSet> {
demux_mapping
.mapping
.iter()
.map(|doc_id_to_segment_ord| {
docs_for_segment_ord(doc_id_to_segment_ord, target_segment_ord)
})
.collect_vec()
}
/// Demux the segments according to `demux_mapping`. See `DemuxMapping`.
/// The number of output_directories need to match max new segment ordinal from `demux_mapping`.
///
/// The ordinal of `segments` need to match the ordinals provided in `demux_mapping`.
pub fn demux(
segments: &[Segment],
demux_mapping: &DemuxMapping,
target_settings: IndexSettings,
output_directories: Vec<Box<dyn Directory>>,
) -> crate::Result<Vec<Index>> {
let mut indices = vec![];
for (target_segment_ord, output_directory) in output_directories.into_iter().enumerate() {
let delete_bitsets = get_alive_bitsets(demux_mapping, target_segment_ord as u32)
.into_iter()
.map(Some)
.collect_vec();
let index = merge_filtered_segments(
segments,
target_settings.clone(),
delete_bitsets,
output_directory,
)?;
indices.push(index);
}
Ok(indices)
}
#[cfg(test)]
mod tests {
use crate::{
collector::TopDocs,
directory::RamDirectory,
query::QueryParser,
schema::{Schema, TEXT},
DocAddress, Term,
};
use super::*;
#[test]
fn test_demux_map_to_deletebitset() {
let max_value = 2;
let mut demux_mapping = DemuxMapping::default();
//segment ordinal 0 mapping
let mut doc_id_to_segment = DocIdToSegmentOrdinal::with_max_doc(max_value);
doc_id_to_segment.set(0, 1);
doc_id_to_segment.set(1, 0);
demux_mapping.add(doc_id_to_segment);
//segment ordinal 1 mapping
let mut doc_id_to_segment = DocIdToSegmentOrdinal::with_max_doc(max_value);
doc_id_to_segment.set(0, 1);
doc_id_to_segment.set(1, 1);
demux_mapping.add(doc_id_to_segment);
{
let bit_sets_for_demuxing_to_segment_ord_0 = get_alive_bitsets(&demux_mapping, 0);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_0[0].is_deleted(0),
true
);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_0[0].is_deleted(1),
false
);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_0[1].is_deleted(0),
true
);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_0[1].is_deleted(1),
true
);
}
{
let bit_sets_for_demuxing_to_segment_ord_1 = get_alive_bitsets(&demux_mapping, 1);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_1[0].is_deleted(0),
false
);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_1[0].is_deleted(1),
true
);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_1[1].is_deleted(0),
false
);
assert_eq!(
bit_sets_for_demuxing_to_segment_ord_1[1].is_deleted(1),
false
);
}
}
#[test]
fn test_demux_segments() -> crate::Result<()> {
let first_index = {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"texto1"))?;
index_writer.add_document(doc!(text_field=>"texto2"))?;
index_writer.commit()?;
index
};
let second_index = {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"texto3"))?;
index_writer.add_document(doc!(text_field=>"texto4"))?;
index_writer.delete_term(Term::from_field_text(text_field, "4"));
index_writer.commit()?;
index
};
let mut segments: Vec<Segment> = Vec::new();
segments.extend(first_index.searchable_segments()?);
segments.extend(second_index.searchable_segments()?);
let target_settings = first_index.settings().clone();
let mut demux_mapping = DemuxMapping::default();
{
let max_value = 2;
//segment ordinal 0 mapping
let mut doc_id_to_segment = DocIdToSegmentOrdinal::with_max_doc(max_value);
doc_id_to_segment.set(0, 1);
doc_id_to_segment.set(1, 0);
demux_mapping.add(doc_id_to_segment);
//segment ordinal 1 mapping
let mut doc_id_to_segment = DocIdToSegmentOrdinal::with_max_doc(max_value);
doc_id_to_segment.set(0, 1);
doc_id_to_segment.set(1, 1);
demux_mapping.add(doc_id_to_segment);
}
assert_eq!(demux_mapping.get_old_num_segments(), 2);
let demuxed_indices = demux(
&segments,
&demux_mapping,
target_settings,
vec![
Box::new(RamDirectory::default()),
Box::new(RamDirectory::default()),
],
)?;
{
let index = &demuxed_indices[0];
let segments = index.searchable_segments()?;
assert_eq!(segments.len(), 1);
let segment_metas = segments[0].meta();
assert_eq!(segment_metas.num_deleted_docs(), 0);
assert_eq!(segment_metas.num_docs(), 1);
let searcher = index.reader().unwrap().searcher();
{
let text_field = index.schema().get_field("text").unwrap();
let do_search = |term: &str| {
let query = QueryParser::for_index(&index, vec![text_field])
.parse_query(term)
.unwrap();
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(3)).unwrap();
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>()
};
assert_eq!(do_search("texto1"), vec![] as Vec<u32>);
assert_eq!(do_search("texto2"), vec![0]);
}
}
{
let index = &demuxed_indices[1];
let segments = index.searchable_segments()?;
assert_eq!(segments.len(), 1);
let segment_metas = segments[0].meta();
assert_eq!(segment_metas.num_deleted_docs(), 0);
assert_eq!(segment_metas.num_docs(), 3);
let searcher = index.reader().unwrap().searcher();
{
let text_field = index.schema().get_field("text").unwrap();
let do_search = |term: &str| {
let query = QueryParser::for_index(&index, vec![text_field])
.parse_query(term)
.unwrap();
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(3)).unwrap();
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>()
};
assert_eq!(do_search("texto1"), vec![0]);
assert_eq!(do_search("texto2"), vec![] as Vec<u32>);
assert_eq!(do_search("texto3"), vec![1]);
assert_eq!(do_search("texto4"), vec![2]);
}
}
Ok(())
}
}

View File

@@ -2,23 +2,23 @@
//! to get mappings from old doc_id to new doc_id and vice versa, after sorting //! to get mappings from old doc_id to new doc_id and vice versa, after sorting
//! //!
use super::{merger::SegmentReaderWithOrdinal, SegmentWriter}; use super::SegmentWriter;
use crate::{ use crate::{
schema::{Field, Schema}, schema::{Field, Schema},
DocId, IndexSortByField, Order, TantivyError, DocId, IndexSortByField, Order, SegmentOrdinal, TantivyError,
}; };
use std::{cmp::Reverse, ops::Index}; use std::{cmp::Reverse, ops::Index};
/// Struct to provide mapping from new doc_id to old doc_id and segment. /// Struct to provide mapping from new doc_id to old doc_id and segment.
#[derive(Clone)] #[derive(Clone)]
pub(crate) struct SegmentDocidMapping<'a> { pub(crate) struct SegmentDocIdMapping {
new_doc_id_to_old_and_segment: Vec<(DocId, SegmentReaderWithOrdinal<'a>)>, new_doc_id_to_old_and_segment: Vec<(DocId, SegmentOrdinal)>,
is_trivial: bool, is_trivial: bool,
} }
impl<'a> SegmentDocidMapping<'a> { impl SegmentDocIdMapping {
pub(crate) fn new( pub(crate) fn new(
new_doc_id_to_old_and_segment: Vec<(DocId, SegmentReaderWithOrdinal<'a>)>, new_doc_id_to_old_and_segment: Vec<(DocId, SegmentOrdinal)>,
is_trivial: bool, is_trivial: bool,
) -> Self { ) -> Self {
Self { Self {
@@ -26,7 +26,7 @@ impl<'a> SegmentDocidMapping<'a> {
is_trivial, is_trivial,
} }
} }
pub(crate) fn iter(&self) -> impl Iterator<Item = &(DocId, SegmentReaderWithOrdinal)> { pub(crate) fn iter(&self) -> impl Iterator<Item = &(DocId, SegmentOrdinal)> {
self.new_doc_id_to_old_and_segment.iter() self.new_doc_id_to_old_and_segment.iter()
} }
pub(crate) fn len(&self) -> usize { pub(crate) fn len(&self) -> usize {
@@ -40,15 +40,15 @@ impl<'a> SegmentDocidMapping<'a> {
self.is_trivial self.is_trivial
} }
} }
impl<'a> Index<usize> for SegmentDocidMapping<'a> { impl Index<usize> for SegmentDocIdMapping {
type Output = (DocId, SegmentReaderWithOrdinal<'a>); type Output = (DocId, SegmentOrdinal);
fn index(&self, idx: usize) -> &Self::Output { fn index(&self, idx: usize) -> &Self::Output {
&self.new_doc_id_to_old_and_segment[idx] &self.new_doc_id_to_old_and_segment[idx]
} }
} }
impl<'a> IntoIterator for SegmentDocidMapping<'a> { impl IntoIterator for SegmentDocIdMapping {
type Item = (DocId, SegmentReaderWithOrdinal<'a>); type Item = (DocId, SegmentOrdinal);
type IntoIter = std::vec::IntoIter<Self::Item>; type IntoIter = std::vec::IntoIter<Self::Item>;
fn into_iter(self) -> Self::IntoIter { fn into_iter(self) -> Self::IntoIter {
@@ -63,6 +63,24 @@ pub struct DocIdMapping {
} }
impl DocIdMapping { impl DocIdMapping {
pub fn from_new_id_to_old_id(new_doc_id_to_old: Vec<DocId>) -> Self {
let max_doc = new_doc_id_to_old.len();
let old_max_doc = new_doc_id_to_old
.iter()
.cloned()
.max()
.map(|n| n + 1)
.unwrap_or(0);
let mut old_doc_id_to_new = vec![0; old_max_doc as usize];
for i in 0..max_doc {
old_doc_id_to_new[new_doc_id_to_old[i] as usize] = i as DocId;
}
DocIdMapping {
new_doc_id_to_old,
old_doc_id_to_new,
}
}
/// returns the new doc_id for the old doc_id /// returns the new doc_id for the old doc_id
pub fn get_new_doc_id(&self, doc_id: DocId) -> DocId { pub fn get_new_doc_id(&self, doc_id: DocId) -> DocId {
self.old_doc_id_to_new[doc_id as usize] self.old_doc_id_to_new[doc_id as usize]
@@ -75,6 +93,13 @@ impl DocIdMapping {
pub fn iter_old_doc_ids(&self) -> impl Iterator<Item = DocId> + Clone + '_ { pub fn iter_old_doc_ids(&self) -> impl Iterator<Item = DocId> + Clone + '_ {
self.new_doc_id_to_old.iter().cloned() self.new_doc_id_to_old.iter().cloned()
} }
/// Remaps a given array to the new doc ids.
pub fn remap<T: Copy>(&self, els: &[T]) -> Vec<T> {
self.new_doc_id_to_old
.iter()
.map(|old_doc| els[*old_doc as usize])
.collect()
}
} }
pub(crate) fn expect_field_id_for_sort_field( pub(crate) fn expect_field_id_for_sort_field(
@@ -122,23 +147,13 @@ pub(crate) fn get_doc_id_mapping_from_field(
.into_iter() .into_iter()
.map(|el| el.0) .map(|el| el.0)
.collect::<Vec<_>>(); .collect::<Vec<_>>();
Ok(DocIdMapping::from_new_id_to_old_id(new_doc_id_to_old))
// create old doc_id to new doc_id index (used in posting recorder)
let max_doc = new_doc_id_to_old.len();
let mut old_doc_id_to_new = vec![0; max_doc];
for i in 0..max_doc {
old_doc_id_to_new[new_doc_id_to_old[i] as usize] = i as DocId;
}
let doc_id_map = DocIdMapping {
new_doc_id_to_old,
old_doc_id_to_new,
};
Ok(doc_id_map)
} }
#[cfg(test)] #[cfg(test)]
mod tests_indexsorting { mod tests_indexsorting {
use crate::fastfield::FastFieldReader; use crate::fastfield::FastFieldReader;
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::{collector::TopDocs, query::QueryParser, schema::*}; use crate::{collector::TopDocs, query::QueryParser, schema::*};
use crate::{schema::Schema, DocAddress}; use crate::{schema::Schema, DocAddress};
use crate::{Index, IndexSettings, IndexSortByField, Order}; use crate::{Index, IndexSettings, IndexSortByField, Order};
@@ -146,7 +161,7 @@ mod tests_indexsorting {
fn create_test_index( fn create_test_index(
index_settings: Option<IndexSettings>, index_settings: Option<IndexSettings>,
text_field_options: TextOptions, text_field_options: TextOptions,
) -> Index { ) -> crate::Result<Index> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let my_text_field = schema_builder.add_text_field("text_field", text_field_options); let my_text_field = schema_builder.add_text_field("text_field", text_field_options);
@@ -166,19 +181,20 @@ mod tests_indexsorting {
if let Some(settings) = index_settings { if let Some(settings) = index_settings {
index_builder = index_builder.settings(settings); index_builder = index_builder.settings(settings);
} }
let index = index_builder.create_in_ram().unwrap(); let index = index_builder.create_in_ram()?;
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(my_number=>40_u64)); index_writer.add_document(doc!(my_number=>40_u64))?;
index_writer index_writer.add_document(
.add_document(doc!(my_number=>20_u64, multi_numbers => 5_u64, multi_numbers => 6_u64)); doc!(my_number=>20_u64, multi_numbers => 5_u64, multi_numbers => 6_u64),
index_writer.add_document(doc!(my_number=>100_u64)); )?;
index_writer.add_document(doc!(my_number=>100_u64))?;
index_writer.add_document( index_writer.add_document(
doc!(my_number=>10_u64, my_string_field=> "blublub", my_text_field => "some text"), doc!(my_number=>10_u64, my_string_field=> "blublub", my_text_field => "some text"),
); )?;
index_writer.add_document(doc!(my_number=>30_u64, multi_numbers => 3_u64 )); index_writer.add_document(doc!(my_number=>30_u64, multi_numbers => 3_u64 ))?;
index_writer.commit().unwrap(); index_writer.commit()?;
index Ok(index)
} }
fn get_text_options() -> TextOptions { fn get_text_options() -> TextOptions {
TextOptions::default().set_indexing_options( TextOptions::default().set_indexing_options(
@@ -203,7 +219,7 @@ mod tests_indexsorting {
for option in options { for option in options {
//let options = get_text_options(); //let options = get_text_options();
// no index_sort // no index_sort
let index = create_test_index(None, option.clone()); let index = create_test_index(None, option.clone())?;
let my_text_field = index.schema().get_field("text_field").unwrap(); let my_text_field = index.schema().get_field("text_field").unwrap();
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
@@ -225,7 +241,7 @@ mod tests_indexsorting {
..Default::default() ..Default::default()
}), }),
option.clone(), option.clone(),
); )?;
let my_text_field = index.schema().get_field("text_field").unwrap(); let my_text_field = index.schema().get_field("text_field").unwrap();
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -257,7 +273,7 @@ mod tests_indexsorting {
..Default::default() ..Default::default()
}), }),
option.clone(), option.clone(),
); )?;
let my_string_field = index.schema().get_field("text_field").unwrap(); let my_string_field = index.schema().get_field("text_field").unwrap();
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
@@ -287,7 +303,7 @@ mod tests_indexsorting {
#[test] #[test]
fn test_sort_index_get_documents() -> crate::Result<()> { fn test_sort_index_get_documents() -> crate::Result<()> {
// default baseline // default baseline
let index = create_test_index(None, get_text_options()); let index = create_test_index(None, get_text_options())?;
let my_string_field = index.schema().get_field("string_field").unwrap(); let my_string_field = index.schema().get_field("string_field").unwrap();
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
{ {
@@ -316,7 +332,7 @@ mod tests_indexsorting {
..Default::default() ..Default::default()
}), }),
get_text_options(), get_text_options(),
); )?;
let my_string_field = index.schema().get_field("string_field").unwrap(); let my_string_field = index.schema().get_field("string_field").unwrap();
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
{ {
@@ -341,7 +357,7 @@ mod tests_indexsorting {
..Default::default() ..Default::default()
}), }),
get_text_options(), get_text_options(),
); )?;
let my_string_field = index.schema().get_field("string_field").unwrap(); let my_string_field = index.schema().get_field("string_field").unwrap();
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
{ {
@@ -356,7 +372,7 @@ mod tests_indexsorting {
#[test] #[test]
fn test_sort_index_test_string_field() -> crate::Result<()> { fn test_sort_index_test_string_field() -> crate::Result<()> {
let index = create_test_index(None, get_text_options()); let index = create_test_index(None, get_text_options())?;
let my_string_field = index.schema().get_field("string_field").unwrap(); let my_string_field = index.schema().get_field("string_field").unwrap();
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
@@ -376,7 +392,7 @@ mod tests_indexsorting {
..Default::default() ..Default::default()
}), }),
get_text_options(), get_text_options(),
); )?;
let my_string_field = index.schema().get_field("string_field").unwrap(); let my_string_field = index.schema().get_field("string_field").unwrap();
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -407,7 +423,7 @@ mod tests_indexsorting {
..Default::default() ..Default::default()
}), }),
get_text_options(), get_text_options(),
); )?;
let my_string_field = index.schema().get_field("string_field").unwrap(); let my_string_field = index.schema().get_field("string_field").unwrap();
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
@@ -443,7 +459,7 @@ mod tests_indexsorting {
..Default::default() ..Default::default()
}), }),
get_text_options(), get_text_options(),
); )?;
assert_eq!( assert_eq!(
index.settings().sort_by_field.as_ref().unwrap().field, index.settings().sort_by_field.as_ref().unwrap().field,
"my_number".to_string() "my_number".to_string()
@@ -474,4 +490,27 @@ mod tests_indexsorting {
assert_eq!(vals, &[3]); assert_eq!(vals, &[3]);
Ok(()) Ok(())
} }
#[test]
fn test_doc_mapping() {
let doc_mapping = DocIdMapping::from_new_id_to_old_id(vec![3, 2, 5]);
assert_eq!(doc_mapping.get_old_doc_id(0), 3);
assert_eq!(doc_mapping.get_old_doc_id(1), 2);
assert_eq!(doc_mapping.get_old_doc_id(2), 5);
assert_eq!(doc_mapping.get_new_doc_id(0), 0);
assert_eq!(doc_mapping.get_new_doc_id(1), 0);
assert_eq!(doc_mapping.get_new_doc_id(2), 1);
assert_eq!(doc_mapping.get_new_doc_id(3), 0);
assert_eq!(doc_mapping.get_new_doc_id(4), 0);
assert_eq!(doc_mapping.get_new_doc_id(5), 2);
}
#[test]
fn test_doc_mapping_remap() {
let doc_mapping = DocIdMapping::from_new_id_to_old_id(vec![2, 8, 3]);
assert_eq!(
&doc_mapping.remap(&[0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000]),
&[2000, 8000, 3000]
);
}
} }

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,118 @@
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::{Arc, RwLock};
use super::AddBatchReceiver;
#[derive(Clone)]
pub(crate) struct IndexWriterStatus {
inner: Arc<Inner>,
}
impl IndexWriterStatus {
/// Returns true iff the index writer is alive.
pub fn is_alive(&self) -> bool {
self.inner.as_ref().is_alive()
}
/// Returns a copy of the operation receiver.
/// If the index writer was killed, returns None.
pub fn operation_receiver(&self) -> Option<AddBatchReceiver> {
let rlock = self
.inner
.receive_channel
.read()
.expect("This lock should never be poisoned");
rlock.as_ref().cloned()
}
/// Create an index writer bomb.
/// If dropped, the index writer status will be killed.
pub(crate) fn create_bomb(&self) -> IndexWriterBomb {
IndexWriterBomb {
inner: Some(self.inner.clone()),
}
}
}
struct Inner {
is_alive: AtomicBool,
receive_channel: RwLock<Option<AddBatchReceiver>>,
}
impl Inner {
fn is_alive(&self) -> bool {
self.is_alive.load(Ordering::Relaxed)
}
fn kill(&self) {
self.is_alive.store(false, Ordering::Relaxed);
self.receive_channel
.write()
.expect("This lock should never be poisoned")
.take();
}
}
impl From<AddBatchReceiver> for IndexWriterStatus {
fn from(receiver: AddBatchReceiver) -> Self {
IndexWriterStatus {
inner: Arc::new(Inner {
is_alive: AtomicBool::new(true),
receive_channel: RwLock::new(Some(receiver)),
}),
}
}
}
/// If dropped, the index writer will be killed.
/// To prevent this, clients can call `.defuse()`.
pub(crate) struct IndexWriterBomb {
inner: Option<Arc<Inner>>,
}
impl IndexWriterBomb {
/// Defuses the bomb.
///
/// This is the only way to drop the bomb without killing
/// the index writer.
pub fn defuse(mut self) {
self.inner = None;
}
}
impl Drop for IndexWriterBomb {
fn drop(&mut self) {
if let Some(inner) = self.inner.take() {
inner.kill();
}
}
}
#[cfg(test)]
mod tests {
use super::IndexWriterStatus;
use crossbeam::channel;
use std::mem;
#[test]
fn test_bomb_goes_boom() {
let (_tx, rx) = channel::bounded(10);
let index_writer_status: IndexWriterStatus = IndexWriterStatus::from(rx);
assert!(index_writer_status.operation_receiver().is_some());
let bomb = index_writer_status.create_bomb();
assert!(index_writer_status.operation_receiver().is_some());
mem::drop(bomb);
// boom!
assert!(index_writer_status.operation_receiver().is_none());
}
#[test]
fn test_bomb_defused() {
let (_tx, rx) = channel::bounded(10);
let index_writer_status: IndexWriterStatus = IndexWriterStatus::from(rx);
assert!(index_writer_status.operation_receiver().is_some());
let bomb = index_writer_status.create_bomb();
bomb.defuse();
assert!(index_writer_status.operation_receiver().is_some());
}
}

View File

@@ -2,12 +2,15 @@ use super::merge_policy::{MergeCandidate, MergePolicy};
use crate::core::SegmentMeta; use crate::core::SegmentMeta;
use itertools::Itertools; use itertools::Itertools;
use std::cmp; use std::cmp;
use std::f64;
const DEFAULT_LEVEL_LOG_SIZE: f64 = 0.75; const DEFAULT_LEVEL_LOG_SIZE: f64 = 0.75;
const DEFAULT_MIN_LAYER_SIZE: u32 = 10_000; const DEFAULT_MIN_LAYER_SIZE: u32 = 10_000;
const DEFAULT_MIN_NUM_SEGMENTS_IN_MERGE: usize = 8; const DEFAULT_MIN_NUM_SEGMENTS_IN_MERGE: usize = 8;
const DEFAULT_MAX_DOCS_BEFORE_MERGE: usize = 10_000_000; const DEFAULT_MAX_DOCS_BEFORE_MERGE: usize = 10_000_000;
// The default value of 1 means that deletes are not taken in account when
// identifying merge candidates. This is not a very sensible default: it was
// set like that for backward compatibility and might change in the near future.
const DEFAULT_DEL_DOCS_RATIO_BEFORE_MERGE: f32 = 1.0f32;
/// `LogMergePolicy` tries to merge segments that have a similar number of /// `LogMergePolicy` tries to merge segments that have a similar number of
/// documents. /// documents.
@@ -17,6 +20,7 @@ pub struct LogMergePolicy {
max_docs_before_merge: usize, max_docs_before_merge: usize,
min_layer_size: u32, min_layer_size: u32,
level_log_size: f64, level_log_size: f64,
del_docs_ratio_before_merge: f32,
} }
impl LogMergePolicy { impl LogMergePolicy {
@@ -52,19 +56,49 @@ impl LogMergePolicy {
pub fn set_level_log_size(&mut self, level_log_size: f64) { pub fn set_level_log_size(&mut self, level_log_size: f64) {
self.level_log_size = level_log_size; self.level_log_size = level_log_size;
} }
/// Set the ratio of deleted documents in a segment to tolerate.
///
/// If it is exceeded by any segment at a log level, a merge
/// will be triggered for that level.
///
/// If there is a single segment at a level, we effectively end up expunging
/// deleted documents from it.
///
/// # Panics
///
/// Panics if del_docs_ratio_before_merge is not within (0..1].
pub fn set_del_docs_ratio_before_merge(&mut self, del_docs_ratio_before_merge: f32) {
assert!(del_docs_ratio_before_merge <= 1.0f32);
assert!(del_docs_ratio_before_merge > 0f32);
self.del_docs_ratio_before_merge = del_docs_ratio_before_merge;
}
fn has_segment_above_deletes_threshold(&self, level: &[&SegmentMeta]) -> bool {
level
.iter()
.any(|segment| deletes_ratio(segment) > self.del_docs_ratio_before_merge)
}
}
fn deletes_ratio(segment: &SegmentMeta) -> f32 {
if segment.max_doc() == 0 {
return 0f32;
}
segment.num_deleted_docs() as f32 / segment.max_doc() as f32
} }
impl MergePolicy for LogMergePolicy { impl MergePolicy for LogMergePolicy {
fn compute_merge_candidates(&self, segments: &[SegmentMeta]) -> Vec<MergeCandidate> { fn compute_merge_candidates(&self, segments: &[SegmentMeta]) -> Vec<MergeCandidate> {
let mut size_sorted_segments = segments let size_sorted_segments = segments
.iter() .iter()
.filter(|segment_meta| segment_meta.num_docs() <= (self.max_docs_before_merge as u32)) .filter(|seg| seg.num_docs() <= (self.max_docs_before_merge as u32))
.sorted_by_key(|seg| std::cmp::Reverse(seg.max_doc()))
.collect::<Vec<&SegmentMeta>>(); .collect::<Vec<&SegmentMeta>>();
if size_sorted_segments.len() <= 1 { if size_sorted_segments.is_empty() {
return vec![]; return vec![];
} }
size_sorted_segments.sort_by_key(|seg| std::cmp::Reverse(seg.num_docs()));
let mut current_max_log_size = f64::MAX; let mut current_max_log_size = f64::MAX;
let mut levels = vec![]; let mut levels = vec![];
@@ -82,7 +116,10 @@ impl MergePolicy for LogMergePolicy {
levels levels
.iter() .iter()
.filter(|level| level.len() >= self.min_num_segments) .filter(|level| {
level.len() >= self.min_num_segments
|| self.has_segment_above_deletes_threshold(level)
})
.map(|segments| MergeCandidate(segments.iter().map(|&seg| seg.id()).collect())) .map(|segments| MergeCandidate(segments.iter().map(|&seg| seg.id()).collect()))
.collect() .collect()
} }
@@ -95,6 +132,7 @@ impl Default for LogMergePolicy {
max_docs_before_merge: DEFAULT_MAX_DOCS_BEFORE_MERGE, max_docs_before_merge: DEFAULT_MAX_DOCS_BEFORE_MERGE,
min_layer_size: DEFAULT_MIN_LAYER_SIZE, min_layer_size: DEFAULT_MIN_LAYER_SIZE,
level_log_size: DEFAULT_LEVEL_LOG_SIZE, level_log_size: DEFAULT_LEVEL_LOG_SIZE,
del_docs_ratio_before_merge: DEFAULT_DEL_DOCS_RATIO_BEFORE_MERGE,
} }
} }
} }
@@ -114,7 +152,7 @@ mod tests {
use crate::Index; use crate::Index;
#[test] #[test]
fn create_index_test_max_merge_issue_1035() { fn create_index_test_max_merge_issue_1035() -> crate::Result<()> {
let mut schema_builder = schema::Schema::builder(); let mut schema_builder = schema::Schema::builder();
let int_field = schema_builder.add_u64_field("intval", INDEXED); let int_field = schema_builder.add_u64_field("intval", INDEXED);
let schema = schema_builder.build(); let schema = schema_builder.build();
@@ -127,34 +165,34 @@ mod tests {
log_merge_policy.set_max_docs_before_merge(1); log_merge_policy.set_max_docs_before_merge(1);
log_merge_policy.set_min_layer_size(0); log_merge_policy.set_min_layer_size(0);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
index_writer.set_merge_policy(Box::new(log_merge_policy)); index_writer.set_merge_policy(Box::new(log_merge_policy));
// after every commit the merge checker is started, it will merge only segments with 1 // after every commit the merge checker is started, it will merge only segments with 1
// element in it because of the max_merge_size. // element in it because of the max_merge_size.
index_writer.add_document(doc!(int_field=>1_u64)); index_writer.add_document(doc!(int_field=>1_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
index_writer.add_document(doc!(int_field=>2_u64)); index_writer.add_document(doc!(int_field=>2_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
index_writer.add_document(doc!(int_field=>3_u64)); index_writer.add_document(doc!(int_field=>3_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
index_writer.add_document(doc!(int_field=>4_u64)); index_writer.add_document(doc!(int_field=>4_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
index_writer.add_document(doc!(int_field=>5_u64)); index_writer.add_document(doc!(int_field=>5_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
index_writer.add_document(doc!(int_field=>6_u64)); index_writer.add_document(doc!(int_field=>6_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
index_writer.add_document(doc!(int_field=>7_u64)); index_writer.add_document(doc!(int_field=>7_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
index_writer.add_document(doc!(int_field=>8_u64)); index_writer.add_document(doc!(int_field=>8_u64))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
} }
let _segment_ids = index let _segment_ids = index
@@ -169,6 +207,7 @@ mod tests {
panic!("segment can't have more than two segments"); panic!("segment can't have more than two segments");
} // don't know how to wait for the merge, then it could be a simple eq } // don't know how to wait for the merge, then it could be a simple eq
} }
Ok(())
} }
fn test_merge_policy() -> LogMergePolicy { fn test_merge_policy() -> LogMergePolicy {
@@ -287,4 +326,49 @@ mod tests {
assert_eq!(result_list[0].0[1], test_input[4].id()); assert_eq!(result_list[0].0[1], test_input[4].id());
assert_eq!(result_list[0].0[2], test_input[5].id()); assert_eq!(result_list[0].0[2], test_input[5].id());
} }
#[test]
fn test_merge_single_segment_with_deletes_below_threshold() {
let mut test_merge_policy = test_merge_policy();
test_merge_policy.set_del_docs_ratio_before_merge(0.25f32);
let test_input = vec![create_random_segment_meta(40_000).with_delete_meta(10_000, 1)];
let merge_candidates = test_merge_policy.compute_merge_candidates(&test_input);
assert!(merge_candidates.is_empty());
}
#[test]
fn test_merge_single_segment_with_deletes_above_threshold() {
let mut test_merge_policy = test_merge_policy();
test_merge_policy.set_del_docs_ratio_before_merge(0.25f32);
let test_input = vec![create_random_segment_meta(40_000).with_delete_meta(10_001, 1)];
let merge_candidates = test_merge_policy.compute_merge_candidates(&test_input);
assert_eq!(merge_candidates.len(), 1);
}
#[test]
fn test_merge_segments_with_deletes_above_threshold_all_in_level() {
let mut test_merge_policy = test_merge_policy();
test_merge_policy.set_del_docs_ratio_before_merge(0.25f32);
let test_input = vec![
create_random_segment_meta(40_000).with_delete_meta(10_001, 1),
create_random_segment_meta(40_000),
];
let merge_candidates = test_merge_policy.compute_merge_candidates(&test_input);
assert_eq!(merge_candidates.len(), 1);
assert_eq!(merge_candidates[0].0.len(), 2);
}
#[test]
fn test_merge_segments_with_deletes_above_threshold_different_level_not_involved() {
let mut test_merge_policy = test_merge_policy();
test_merge_policy.set_del_docs_ratio_before_merge(0.25f32);
let test_input = vec![
create_random_segment_meta(100),
create_random_segment_meta(40_000).with_delete_meta(10_001, 1),
];
let merge_candidates = test_merge_policy.compute_merge_candidates(&test_input);
assert_eq!(merge_candidates.len(), 1);
assert_eq!(merge_candidates[0].0.len(), 1);
assert_eq!(merge_candidates[0].0[0], test_input[1].id());
}
} }

View File

@@ -1,6 +1,6 @@
use crate::Opstamp; use crate::Opstamp;
use crate::SegmentId; use crate::SegmentId;
use census::{Inventory, TrackedObject}; use crate::{Inventory, TrackedObject};
use std::collections::HashSet; use std::collections::HashSet;
use std::ops::Deref; use std::ops::Deref;

File diff suppressed because it is too large Load Diff

View File

@@ -1,22 +1,18 @@
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use crate::fastfield::FastFieldReader; use crate::collector::TopDocs;
use crate::{ use crate::core::Index;
collector::TopDocs, use crate::fastfield::MultiValuedFastFieldReader;
schema::{Cardinality, TextFieldIndexing}, use crate::fastfield::{AliveBitSet, FastFieldReader};
use crate::query::QueryParser;
use crate::schema::{
self, BytesOptions, Cardinality, Facet, FacetOptions, IndexRecordOption, TextFieldIndexing,
}; };
use crate::{core::Index, fastfield::MultiValuedFastFieldReader}; use crate::schema::{IntOptions, TextOptions};
use crate::{ use crate::DocAddress;
query::QueryParser, use crate::IndexSortByField;
schema::{IntOptions, TextOptions}, use crate::Order;
}; use crate::{DocSet, IndexSettings, Postings, Term};
use crate::{schema::Facet, IndexSortByField};
use crate::{schema::INDEXED, Order};
use crate::{
schema::{self, BytesOptions},
DocAddress,
};
use crate::{IndexSettings, Term};
use futures::executor::block_on; use futures::executor::block_on;
fn create_test_index_posting_list_issue(index_settings: Option<IndexSettings>) -> Index { fn create_test_index_posting_list_issue(index_settings: Option<IndexSettings>) -> Index {
@@ -26,7 +22,7 @@ mod tests {
.set_indexed(); .set_indexed();
let int_field = schema_builder.add_u64_field("intval", int_options); let int_field = schema_builder.add_u64_field("intval", int_options);
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
@@ -38,14 +34,17 @@ mod tests {
{ {
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests().unwrap();
index_writer
index_writer.add_document(doc!(int_field=>3_u64, facet_field=> Facet::from("/crime"))); .add_document(doc!(int_field=>3_u64, facet_field=> Facet::from("/crime")))
index_writer.add_document(doc!(int_field=>6_u64, facet_field=> Facet::from("/crime"))); .unwrap();
index_writer
assert!(index_writer.commit().is_ok()); .add_document(doc!(int_field=>6_u64, facet_field=> Facet::from("/crime")))
index_writer.add_document(doc!(int_field=>5_u64, facet_field=> Facet::from("/fanta"))); .unwrap();
index_writer.commit().unwrap();
assert!(index_writer.commit().is_ok()); index_writer
.add_document(doc!(int_field=>5_u64, facet_field=> Facet::from("/fanta")))
.unwrap();
index_writer.commit().unwrap();
} }
// Merging the segments // Merging the segments
@@ -65,7 +64,7 @@ mod tests {
fn create_test_index( fn create_test_index(
index_settings: Option<IndexSettings>, index_settings: Option<IndexSettings>,
force_disjunct_segment_sort_values: bool, force_disjunct_segment_sort_values: bool,
) -> Index { ) -> crate::Result<Index> {
let mut schema_builder = schema::Schema::builder(); let mut schema_builder = schema::Schema::builder();
let int_options = IntOptions::default() let int_options = IntOptions::default()
.set_fast(Cardinality::SingleValue) .set_fast(Cardinality::SingleValue)
@@ -75,7 +74,7 @@ mod tests {
let bytes_options = BytesOptions::default().set_fast().set_indexed(); let bytes_options = BytesOptions::default().set_fast().set_indexed();
let bytes_field = schema_builder.add_bytes_field("bytes", bytes_options); let bytes_field = schema_builder.add_bytes_field("bytes", bytes_options);
let facet_field = schema_builder.add_facet_field("facet", INDEXED); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let multi_numbers = schema_builder.add_u64_field( let multi_numbers = schema_builder.add_u64_field(
"multi_numbers", "multi_numbers",
@@ -94,32 +93,34 @@ mod tests {
if let Some(settings) = index_settings { if let Some(settings) = index_settings {
index_builder = index_builder.settings(settings); index_builder = index_builder.settings(settings);
} }
let index = index_builder.create_in_ram().unwrap(); let index = index_builder.create_in_ram()?;
{ {
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
// segment 1 - range 1-3 // segment 1 - range 1-3
index_writer.add_document(doc!(int_field=>1_u64)); index_writer.add_document(doc!(int_field=>1_u64))?;
index_writer.add_document( index_writer.add_document(
doc!(int_field=>3_u64, multi_numbers => 3_u64, multi_numbers => 4_u64, bytes_field => vec![1, 2, 3], text_field => "some text", facet_field=> Facet::from("/book/crime")), doc!(int_field=>3_u64, multi_numbers => 3_u64, multi_numbers => 4_u64, bytes_field => vec![1, 2, 3], text_field => "some text", facet_field=> Facet::from("/book/crime")),
); )?;
index_writer.add_document(doc!(int_field=>1_u64, text_field=> "deleteme"));
index_writer.add_document( index_writer.add_document(
doc!(int_field=>2_u64, multi_numbers => 2_u64, multi_numbers => 3_u64), doc!(int_field=>1_u64, text_field=> "deleteme", text_field => "ok text more text"),
); )?;
index_writer.add_document(
doc!(int_field=>2_u64, multi_numbers => 2_u64, multi_numbers => 3_u64, text_field => "ok text more text"),
)?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
// segment 2 - range 1-20 , with force_disjunct_segment_sort_values 10-20 // segment 2 - range 1-20 , with force_disjunct_segment_sort_values 10-20
index_writer.add_document(doc!(int_field=>20_u64, multi_numbers => 20_u64)); index_writer.add_document(doc!(int_field=>20_u64, multi_numbers => 20_u64))?;
let in_val = if force_disjunct_segment_sort_values { let in_val = if force_disjunct_segment_sort_values {
10_u64 10_u64
} else { } else {
1 1
}; };
index_writer.add_document(doc!(int_field=>in_val, text_field=> "deleteme", facet_field=> Facet::from("/book/crime"))); index_writer.add_document(doc!(int_field=>in_val, text_field=> "deleteme" , text_field => "ok text more text", facet_field=> Facet::from("/book/crime")))?;
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
// segment 3 - range 5-1000, with force_disjunct_segment_sort_values 50-1000 // segment 3 - range 5-1000, with force_disjunct_segment_sort_values 50-1000
let int_vals = if force_disjunct_segment_sort_values { let int_vals = if force_disjunct_segment_sort_values {
[100_u64, 50] [100_u64, 50]
@@ -128,26 +129,24 @@ mod tests {
}; };
index_writer.add_document( // position of this doc after delete in desc sorting = [2], in disjunct case [1] index_writer.add_document( // position of this doc after delete in desc sorting = [2], in disjunct case [1]
doc!(int_field=>int_vals[0], multi_numbers => 10_u64, multi_numbers => 11_u64, text_field=> "blubber", facet_field=> Facet::from("/book/fantasy")), doc!(int_field=>int_vals[0], multi_numbers => 10_u64, multi_numbers => 11_u64, text_field=> "blubber", facet_field=> Facet::from("/book/fantasy")),
); )?;
index_writer.add_document(doc!(int_field=>int_vals[1], text_field=> "deleteme")); index_writer.add_document(doc!(int_field=>int_vals[1], text_field=> "deleteme"))?;
index_writer.add_document( index_writer.add_document(
doc!(int_field=>1_000u64, multi_numbers => 1001_u64, multi_numbers => 1002_u64, bytes_field => vec![5, 5],text_field => "the biggest num") doc!(int_field=>1_000u64, multi_numbers => 1001_u64, multi_numbers => 1002_u64, bytes_field => vec![5, 5],text_field => "the biggest num")
); )?;
index_writer.delete_term(Term::from_field_text(text_field, "deleteme")); index_writer.delete_term(Term::from_field_text(text_field, "deleteme"));
assert!(index_writer.commit().is_ok()); index_writer.commit()?;
} }
// Merging the segments // Merging the segments
{ {
let segment_ids = index let segment_ids = index.searchable_segment_ids()?;
.searchable_segment_ids() let mut index_writer = index.writer_for_tests()?;
.expect("Searchable segments failed."); block_on(index_writer.merge(&segment_ids))?;
let mut index_writer = index.writer_for_tests().unwrap(); index_writer.wait_merging_threads()?;
assert!(block_on(index_writer.merge(&segment_ids)).is_ok());
assert!(index_writer.wait_merging_threads().is_ok());
} }
index Ok(index)
} }
#[test] #[test]
@@ -180,7 +179,8 @@ mod tests {
..Default::default() ..Default::default()
}), }),
force_disjunct_segment_sort_values, force_disjunct_segment_sort_values,
); )
.unwrap();
let int_field = index.schema().get_field("intval").unwrap(); let int_field = index.schema().get_field("intval").unwrap();
let reader = index.reader().unwrap(); let reader = index.reader().unwrap();
@@ -243,6 +243,36 @@ mod tests {
assert_eq!(do_search("biggest"), vec![0]); assert_eq!(do_search("biggest"), vec![0]);
} }
// postings file
{
let my_text_field = index.schema().get_field("text_field").unwrap();
let term_a = Term::from_field_text(my_text_field, "text");
let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
let mut postings = inverted_index
.read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
.unwrap()
.unwrap();
assert_eq!(postings.doc_freq(), 2);
let fallback_bitset = AliveBitSet::for_test_from_deleted_docs(&[0], 100);
assert_eq!(
postings.doc_freq_given_deletes(
segment_reader.alive_bitset().unwrap_or(&fallback_bitset)
),
2
);
assert_eq!(postings.term_freq(), 1);
let mut output = vec![];
postings.positions(&mut output);
assert_eq!(output, vec![1]);
postings.advance();
assert_eq!(postings.term_freq(), 2);
postings.positions(&mut output);
assert_eq!(output, vec![1, 3]);
}
// access doc store // access doc store
{ {
let blubber_pos = if force_disjunct_segment_sort_values { let blubber_pos = if force_disjunct_segment_sort_values {
@@ -260,6 +290,70 @@ mod tests {
} }
} }
#[test]
fn test_merge_unsorted_index() {
let index = create_test_index(
Some(IndexSettings {
..Default::default()
}),
false,
)
.unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
assert_eq!(searcher.segment_readers().len(), 1);
let segment_reader = searcher.segment_readers().last().unwrap();
let searcher = index.reader().unwrap().searcher();
{
let my_text_field = index.schema().get_field("text_field").unwrap();
let do_search = |term: &str| {
let query = QueryParser::for_index(&index, vec![my_text_field])
.parse_query(term)
.unwrap();
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(3)).unwrap();
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>()
};
assert_eq!(do_search("some"), vec![1]);
assert_eq!(do_search("blubber"), vec![3]);
assert_eq!(do_search("biggest"), vec![4]);
}
// postings file
{
let my_text_field = index.schema().get_field("text_field").unwrap();
let term_a = Term::from_field_text(my_text_field, "text");
let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
let mut postings = inverted_index
.read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
.unwrap()
.unwrap();
assert_eq!(postings.doc_freq(), 2);
let fallback_bitset = AliveBitSet::for_test_from_deleted_docs(&[0], 100);
assert_eq!(
postings.doc_freq_given_deletes(
segment_reader.alive_bitset().unwrap_or(&fallback_bitset)
),
2
);
assert_eq!(postings.term_freq(), 1);
let mut output = vec![];
postings.positions(&mut output);
assert_eq!(output, vec![1]);
postings.advance();
assert_eq!(postings.term_freq(), 2);
postings.positions(&mut output);
assert_eq!(output, vec![1, 3]);
}
}
#[test] #[test]
fn test_merge_sorted_index_asc() { fn test_merge_sorted_index_asc() {
let index = create_test_index( let index = create_test_index(
@@ -271,7 +365,8 @@ mod tests {
..Default::default() ..Default::default()
}), }),
false, false,
); )
.unwrap();
let int_field = index.schema().get_field("intval").unwrap(); let int_field = index.schema().get_field("intval").unwrap();
let multi_numbers = index.schema().get_field("multi_numbers").unwrap(); let multi_numbers = index.schema().get_field("multi_numbers").unwrap();
@@ -314,7 +409,7 @@ mod tests {
let my_text_field = index.schema().get_field("text_field").unwrap(); let my_text_field = index.schema().get_field("text_field").unwrap();
let fieldnorm_reader = segment_reader.get_fieldnorms_reader(my_text_field).unwrap(); let fieldnorm_reader = segment_reader.get_fieldnorms_reader(my_text_field).unwrap();
assert_eq!(fieldnorm_reader.fieldnorm(0), 0); assert_eq!(fieldnorm_reader.fieldnorm(0), 0);
assert_eq!(fieldnorm_reader.fieldnorm(1), 0); assert_eq!(fieldnorm_reader.fieldnorm(1), 4);
assert_eq!(fieldnorm_reader.fieldnorm(2), 2); // some text assert_eq!(fieldnorm_reader.fieldnorm(2), 2); // some text
assert_eq!(fieldnorm_reader.fieldnorm(3), 1); assert_eq!(fieldnorm_reader.fieldnorm(3), 1);
assert_eq!(fieldnorm_reader.fieldnorm(5), 3); // the biggest num assert_eq!(fieldnorm_reader.fieldnorm(5), 3); // the biggest num
@@ -339,6 +434,34 @@ mod tests {
assert_eq!(do_search("biggest"), vec![5]); assert_eq!(do_search("biggest"), vec![5]);
} }
// postings file
{
let my_text_field = index.schema().get_field("text_field").unwrap();
let term_a = Term::from_field_text(my_text_field, "text");
let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
let mut postings = inverted_index
.read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
.unwrap()
.unwrap();
assert_eq!(postings.doc_freq(), 2);
let fallback_bitset = AliveBitSet::for_test_from_deleted_docs(&[0], 100);
assert_eq!(
postings.doc_freq_given_deletes(
segment_reader.alive_bitset().unwrap_or(&fallback_bitset)
),
2
);
let mut output = vec![];
postings.positions(&mut output);
assert_eq!(output, vec![1, 3]);
postings.advance();
postings.positions(&mut output);
assert_eq!(output, vec![1]);
}
// access doc store // access doc store
{ {
let doc = searcher.doc(DocAddress::new(0, 0)).unwrap(); let doc = searcher.doc(DocAddress::new(0, 0)).unwrap();
@@ -393,7 +516,7 @@ mod bench_sorted_index_merge {
let index_doc = |index_writer: &mut IndexWriter, val: u64| { let index_doc = |index_writer: &mut IndexWriter, val: u64| {
let mut doc = Document::default(); let mut doc = Document::default();
doc.add_u64(int_field, val); doc.add_u64(int_field, val);
index_writer.add_document(doc); index_writer.add_document(doc).unwrap();
}; };
// 3 segments with 10_000 values in the fast fields // 3 segments with 10_000 values in the fast fields
for _ in 0..3 { for _ in 0..3 {
@@ -422,14 +545,15 @@ mod bench_sorted_index_merge {
let doc_id_mapping = merger.generate_doc_id_mapping(&sort_by_field).unwrap(); let doc_id_mapping = merger.generate_doc_id_mapping(&sort_by_field).unwrap();
b.iter(|| { b.iter(|| {
let sorted_doc_ids = doc_id_mapping.iter().map(|(doc_id, reader)|{ let sorted_doc_ids = doc_id_mapping.iter().map(|(doc_id, ordinal)|{
let u64_reader: DynamicFastFieldReader<u64> = reader.reader let reader = &merger.readers[*ordinal as usize];
let u64_reader: DynamicFastFieldReader<u64> = reader
.fast_fields() .fast_fields()
.typed_fast_field_reader(field) .typed_fast_field_reader(field)
.expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen."); .expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen.");
(doc_id, reader, u64_reader) (doc_id, reader, u64_reader)
}); });
// add values in order of the new docids // add values in order of the new doc_ids
let mut val = 0; let mut val = 0;
for (doc_id, _reader, field_reader) in sorted_doc_ids { for (doc_id, _reader, field_reader) in sorted_doc_ids {
val = field_reader.get(*doc_id); val = field_reader.get(*doc_id);
@@ -442,7 +566,7 @@ mod bench_sorted_index_merge {
Ok(()) Ok(())
} }
#[bench] #[bench]
fn create_sorted_index_create_docid_mapping(b: &mut Bencher) -> crate::Result<()> { fn create_sorted_index_create_doc_id_mapping(b: &mut Bencher) -> crate::Result<()> {
let sort_by_field = IndexSortByField { let sort_by_field = IndexSortByField {
field: "intval".to_string(), field: "intval".to_string(),
order: Order::Desc, order: Order::Desc,

View File

@@ -1,15 +1,17 @@
pub mod delete_queue; pub mod delete_queue;
pub mod demuxer;
pub mod doc_id_mapping; pub mod doc_id_mapping;
mod doc_opstamp_mapping; mod doc_opstamp_mapping;
pub mod index_writer; pub mod index_writer;
mod index_writer_status;
mod log_merge_policy; mod log_merge_policy;
mod merge_operation; mod merge_operation;
pub mod merge_policy; pub mod merge_policy;
pub mod merger; pub mod merger;
mod merger_sorted_index_test; mod merger_sorted_index_test;
pub mod operation; pub mod operation;
mod prepared_commit; pub mod prepared_commit;
mod segment_entry; mod segment_entry;
mod segment_manager; mod segment_manager;
mod segment_register; mod segment_register;
@@ -18,6 +20,11 @@ pub mod segment_updater;
mod segment_writer; mod segment_writer;
mod stamper; mod stamper;
use crossbeam::channel;
use smallvec::SmallVec;
use crate::indexer::operation::AddOperation;
pub use self::index_writer::IndexWriter; pub use self::index_writer::IndexWriter;
pub use self::log_merge_policy::LogMergePolicy; pub use self::log_merge_policy::LogMergePolicy;
pub use self::merge_operation::MergeOperation; pub use self::merge_operation::MergeOperation;
@@ -26,12 +33,23 @@ pub use self::prepared_commit::PreparedCommit;
pub use self::segment_entry::SegmentEntry; pub use self::segment_entry::SegmentEntry;
pub use self::segment_manager::SegmentManager; pub use self::segment_manager::SegmentManager;
pub use self::segment_serializer::SegmentSerializer; pub use self::segment_serializer::SegmentSerializer;
pub use self::segment_updater::merge_segments; pub use self::segment_updater::merge_filtered_segments;
pub use self::segment_updater::merge_indices;
pub use self::segment_writer::SegmentWriter; pub use self::segment_writer::SegmentWriter;
/// Alias for the default merge policy, which is the `LogMergePolicy`. /// Alias for the default merge policy, which is the `LogMergePolicy`.
pub type DefaultMergePolicy = LogMergePolicy; pub type DefaultMergePolicy = LogMergePolicy;
// Batch of documents.
// Most of the time, users will send operation one-by-one, but it can be useful to
// send them as a small block to ensure that
// - all docs in the operation will happen on the same segment and continuous doc_ids.
// - all operations in the group are committed at the same time, making the group
// atomic.
type AddBatch = SmallVec<[AddOperation; 4]>;
type AddBatchSender = channel::Sender<AddBatch>;
type AddBatchReceiver = channel::Receiver<AddBatch>;
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
#[cfg(test)] #[cfg(test)]
mod tests_mmap { mod tests_mmap {
@@ -39,19 +57,20 @@ mod tests_mmap {
use crate::{Index, Term}; use crate::{Index, Term};
#[test] #[test]
fn test_advance_delete_bug() { fn test_advance_delete_bug() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", schema::TEXT); let text_field = schema_builder.add_text_field("text", schema::TEXT);
let index = Index::create_from_tempdir(schema_builder.build()).unwrap(); let index = Index::create_from_tempdir(schema_builder.build())?;
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer = index.writer_for_tests()?;
// there must be one deleted document in the segment // there must be one deleted document in the segment
index_writer.add_document(doc!(text_field=>"b")); index_writer.add_document(doc!(text_field=>"b"))?;
index_writer.delete_term(Term::from_field_text(text_field, "b")); index_writer.delete_term(Term::from_field_text(text_field, "b"));
// we need enough data to trigger the bug (at least 32 documents) // we need enough data to trigger the bug (at least 32 documents)
for _ in 0..32 { for _ in 0..32 {
index_writer.add_document(doc!(text_field=>"c")); index_writer.add_document(doc!(text_field=>"c"))?;
} }
index_writer.commit().unwrap(); index_writer.commit()?;
index_writer.commit().unwrap(); index_writer.commit()?;
Ok(())
} }
} }

View File

@@ -18,25 +18,38 @@ impl<'a> PreparedCommit<'a> {
} }
} }
/// Returns the opstamp associated to the prepared commit.
pub fn opstamp(&self) -> Opstamp { pub fn opstamp(&self) -> Opstamp {
self.opstamp self.opstamp
} }
/// Adds an arbitrary payload to the commit.
pub fn set_payload(&mut self, payload: &str) { pub fn set_payload(&mut self, payload: &str) {
self.payload = Some(payload.to_string()) self.payload = Some(payload.to_string())
} }
/// Rollbacks any change.
pub fn abort(self) -> crate::Result<Opstamp> { pub fn abort(self) -> crate::Result<Opstamp> {
self.index_writer.rollback() self.index_writer.rollback()
} }
/// Proceeds to commit.
/// See `.commit_async()`.
pub fn commit(self) -> crate::Result<Opstamp> { pub fn commit(self) -> crate::Result<Opstamp> {
block_on(self.commit_async())
}
/// Proceeds to commit.
///
/// Unfortunately, contrary to what `PrepareCommit` may suggests,
/// this operation is not at all really light.
/// At this point deletes have not been flushed yet.
pub async fn commit_async(self) -> crate::Result<Opstamp> {
info!("committing {}", self.opstamp); info!("committing {}", self.opstamp);
let _ = block_on( self.index_writer
self.index_writer .segment_updater()
.segment_updater() .schedule_commit(self.opstamp, self.payload)
.schedule_commit(self.opstamp, self.payload), .await?;
);
Ok(self.opstamp) Ok(self.opstamp)
} }
} }

View File

@@ -1,7 +1,7 @@
use crate::common::BitSet;
use crate::core::SegmentId; use crate::core::SegmentId;
use crate::core::SegmentMeta; use crate::core::SegmentMeta;
use crate::indexer::delete_queue::DeleteCursor; use crate::indexer::delete_queue::DeleteCursor;
use common::BitSet;
use std::fmt; use std::fmt;
/// A segment entry describes the state of /// A segment entry describes the state of
@@ -9,18 +9,16 @@ use std::fmt;
/// ///
/// In addition to segment `meta`, /// In addition to segment `meta`,
/// it contains a few transient states /// it contains a few transient states
/// - `state` expresses whether the segment is already in the /// - `alive_bitset` is a bitset describing
/// middle of a merge /// documents that were alive during the commit
/// - `delete_bitset` is a bitset describing
/// documents that were deleted during the commit
/// itself. /// itself.
/// - `delete_cursor` is the position in the delete queue. /// - `delete_cursor` is the position in the delete queue.
/// Deletes happening before the cursor are reflected either /// Deletes happening before the cursor are reflected either
/// in the .del file or in the `delete_bitset`. /// in the .del file or in the `alive_bitset`.
#[derive(Clone)] #[derive(Clone)]
pub struct SegmentEntry { pub struct SegmentEntry {
meta: SegmentMeta, meta: SegmentMeta,
delete_bitset: Option<BitSet>, alive_bitset: Option<BitSet>,
delete_cursor: DeleteCursor, delete_cursor: DeleteCursor,
} }
@@ -29,11 +27,11 @@ impl SegmentEntry {
pub fn new( pub fn new(
segment_meta: SegmentMeta, segment_meta: SegmentMeta,
delete_cursor: DeleteCursor, delete_cursor: DeleteCursor,
delete_bitset: Option<BitSet>, alive_bitset: Option<BitSet>,
) -> SegmentEntry { ) -> SegmentEntry {
SegmentEntry { SegmentEntry {
meta: segment_meta, meta: segment_meta,
delete_bitset, alive_bitset,
delete_cursor, delete_cursor,
} }
} }
@@ -41,8 +39,8 @@ impl SegmentEntry {
/// Return a reference to the segment entry deleted bitset. /// Return a reference to the segment entry deleted bitset.
/// ///
/// `DocId` in this bitset are flagged as deleted. /// `DocId` in this bitset are flagged as deleted.
pub fn delete_bitset(&self) -> Option<&BitSet> { pub fn alive_bitset(&self) -> Option<&BitSet> {
self.delete_bitset.as_ref() self.alive_bitset.as_ref()
} }
/// Set the `SegmentMeta` for this segment. /// Set the `SegmentMeta` for this segment.

View File

@@ -66,13 +66,10 @@ impl SegmentRegister {
} }
pub fn segment_metas(&self) -> Vec<SegmentMeta> { pub fn segment_metas(&self) -> Vec<SegmentMeta> {
let mut segment_ids: Vec<SegmentMeta> = self self.segment_states
.segment_states
.values() .values()
.map(|segment_entry| segment_entry.meta().clone()) .map(|segment_entry| segment_entry.meta().clone())
.collect(); .collect()
segment_ids.sort_by_key(SegmentMeta::id);
segment_ids
} }
pub fn contains_all(&self, segment_ids: &[SegmentId]) -> bool { pub fn contains_all(&self, segment_ids: &[SegmentId]) -> bool {

View File

@@ -7,6 +7,7 @@ use crate::core::SegmentId;
use crate::core::SegmentMeta; use crate::core::SegmentMeta;
use crate::core::META_FILEPATH; use crate::core::META_FILEPATH;
use crate::directory::{Directory, DirectoryClone, GarbageCollectionResult}; use crate::directory::{Directory, DirectoryClone, GarbageCollectionResult};
use crate::fastfield::AliveBitSet;
use crate::indexer::delete_queue::DeleteCursor; use crate::indexer::delete_queue::DeleteCursor;
use crate::indexer::index_writer::advance_deletes; use crate::indexer::index_writer::advance_deletes;
use crate::indexer::merge_operation::MergeOperationInventory; use crate::indexer::merge_operation::MergeOperationInventory;
@@ -19,12 +20,15 @@ use crate::indexer::{DefaultMergePolicy, MergePolicy};
use crate::indexer::{MergeCandidate, MergeOperation}; use crate::indexer::{MergeCandidate, MergeOperation};
use crate::schema::Schema; use crate::schema::Schema;
use crate::Opstamp; use crate::Opstamp;
use crate::TantivyError;
use fail::fail_point;
use futures::channel::oneshot; use futures::channel::oneshot;
use futures::executor::{ThreadPool, ThreadPoolBuilder}; use futures::executor::{ThreadPool, ThreadPoolBuilder};
use futures::future::Future; use futures::future::Future;
use futures::future::TryFutureExt; use futures::future::TryFutureExt;
use std::borrow::BorrowMut; use std::borrow::BorrowMut;
use std::collections::HashSet; use std::collections::HashSet;
use std::io;
use std::io::Write; use std::io::Write;
use std::ops::Deref; use std::ops::Deref;
use std::path::PathBuf; use std::path::PathBuf;
@@ -57,7 +61,9 @@ pub fn save_new_metas(
payload: None, payload: None,
}, },
directory, directory,
) )?;
directory.sync_directory()?;
Ok(())
} }
/// Save the index meta file. /// Save the index meta file.
@@ -74,6 +80,11 @@ fn save_metas(metas: &IndexMeta, directory: &dyn Directory) -> crate::Result<()>
let mut buffer = serde_json::to_vec_pretty(metas)?; let mut buffer = serde_json::to_vec_pretty(metas)?;
// Just adding a new line at the end of the buffer. // Just adding a new line at the end of the buffer.
writeln!(&mut buffer)?; writeln!(&mut buffer)?;
fail_point!("save_metas", |msg| Err(TantivyError::from(io::Error::new(
io::ErrorKind::Other,
msg.unwrap_or_else(|| "Undefined".to_string())
))));
directory.sync_directory()?;
directory.atomic_write(&META_FILEPATH, &buffer[..])?; directory.atomic_write(&META_FILEPATH, &buffer[..])?;
debug!("Saved metas {:?}", serde_json::to_string_pretty(&metas)); debug!("Saved metas {:?}", serde_json::to_string_pretty(&metas));
Ok(()) Ok(())
@@ -159,9 +170,9 @@ fn merge(
/// meant to work if you have an IndexWriter running for the origin indices, or /// meant to work if you have an IndexWriter running for the origin indices, or
/// the destination Index. /// the destination Index.
#[doc(hidden)] #[doc(hidden)]
pub fn merge_segments<Dir: Directory>( pub fn merge_indices<T: Into<Box<dyn Directory>>>(
indices: &[Index], indices: &[Index],
output_directory: Dir, output_directory: T,
) -> crate::Result<Index> { ) -> crate::Result<Index> {
if indices.is_empty() { if indices.is_empty() {
// If there are no indices to merge, there is no need to do anything. // If there are no indices to merge, there is no need to do anything.
@@ -170,19 +181,8 @@ pub fn merge_segments<Dir: Directory>(
)); ));
} }
let target_schema = indices[0].schema();
let target_settings = indices[0].settings().clone(); let target_settings = indices[0].settings().clone();
// let's check that all of the indices have the same schema
if indices
.iter()
.skip(1)
.any(|index| index.schema() != target_schema)
{
return Err(crate::TantivyError::InvalidArgument(
"Attempt to merge different schema indices".to_string(),
));
}
// let's check that all of the indices have the same index settings // let's check that all of the indices have the same index settings
if indices if indices
.iter() .iter()
@@ -199,13 +199,61 @@ pub fn merge_segments<Dir: Directory>(
segments.extend(index.searchable_segments()?); segments.extend(index.searchable_segments()?);
} }
let mut merged_index = Index::create(output_directory, target_schema.clone(), target_settings)?; let non_filter = segments.iter().map(|_| None).collect::<Vec<_>>();
merge_filtered_segments(&segments, target_settings, non_filter, output_directory)
}
/// Advanced: Merges a list of segments from different indices in a new index.
/// Additional you can provide a delete bitset for each segment to ignore doc_ids.
///
/// Returns `TantivyError` if the the indices list is empty or their
/// schemas don't match.
///
/// `output_directory`: is assumed to be empty.
///
/// # Warning
/// This function does NOT check or take the `IndexWriter` is running. It is not
/// meant to work if you have an IndexWriter running for the origin indices, or
/// the destination Index.
#[doc(hidden)]
pub fn merge_filtered_segments<T: Into<Box<dyn Directory>>>(
segments: &[Segment],
target_settings: IndexSettings,
filter_doc_ids: Vec<Option<AliveBitSet>>,
output_directory: T,
) -> crate::Result<Index> {
if segments.is_empty() {
// If there are no indices to merge, there is no need to do anything.
return Err(crate::TantivyError::InvalidArgument(
"No segments given to marge".to_string(),
));
}
let target_schema = segments[0].schema();
// let's check that all of the indices have the same schema
if segments
.iter()
.skip(1)
.any(|index| index.schema() != target_schema)
{
return Err(crate::TantivyError::InvalidArgument(
"Attempt to merge different schema indices".to_string(),
));
}
let mut merged_index = Index::create(
output_directory,
target_schema.clone(),
target_settings.clone(),
)?;
let merged_segment = merged_index.new_segment(); let merged_segment = merged_index.new_segment();
let merged_segment_id = merged_segment.id(); let merged_segment_id = merged_segment.id();
let merger: IndexMerger = IndexMerger::open( let merger: IndexMerger = IndexMerger::open_with_custom_alive_set(
merged_index.schema(), merged_index.schema(),
merged_index.settings().clone(), merged_index.settings().clone(),
&segments[..], segments,
filter_doc_ids,
)?; )?;
let segment_serializer = SegmentSerializer::for_segment(merged_segment, true)?; let segment_serializer = SegmentSerializer::for_segment(merged_segment, true)?;
let num_docs = merger.write(segment_serializer)?; let num_docs = merger.write(segment_serializer)?;
@@ -225,7 +273,7 @@ pub fn merge_segments<Dir: Directory>(
); );
let index_meta = IndexMeta { let index_meta = IndexMeta {
index_settings: indices[0].load_metas()?.index_settings, // index_settings of all segments should be the same index_settings: target_settings, // index_settings of all segments should be the same
segments: vec![segment_meta], segments: vec![segment_meta],
schema: target_schema, schema: target_schema,
opstamp: 0u64, opstamp: 0u64,
@@ -306,37 +354,39 @@ impl SegmentUpdater {
*self.merge_policy.write().unwrap() = arc_merge_policy; *self.merge_policy.write().unwrap() = arc_merge_policy;
} }
fn schedule_future<T: 'static + Send, F: Future<Output = crate::Result<T>> + 'static + Send>( async fn schedule_task<
T: 'static + Send,
F: Future<Output = crate::Result<T>> + 'static + Send,
>(
&self, &self,
f: F, task: F,
) -> impl Future<Output = crate::Result<T>> { ) -> crate::Result<T> {
let (sender, receiver) = oneshot::channel(); if !self.is_alive() {
if self.is_alive() { return Err(crate::TantivyError::SystemError(
self.pool.spawn_ok(async move {
let _ = sender.send(f.await);
});
} else {
let _ = sender.send(Err(crate::TantivyError::SystemError(
"Segment updater killed".to_string(), "Segment updater killed".to_string(),
))); ));
} }
receiver.unwrap_or_else(|_| { let (sender, receiver) = oneshot::channel();
self.pool.spawn_ok(async move {
let task_result = task.await;
let _ = sender.send(task_result);
});
let task_result = receiver.await;
task_result.unwrap_or_else(|_| {
let err_msg = let err_msg =
"A segment_updater future did not success. This should never happen.".to_string(); "A segment_updater future did not success. This should never happen.".to_string();
Err(crate::TantivyError::SystemError(err_msg)) Err(crate::TantivyError::SystemError(err_msg))
}) })
} }
pub fn schedule_add_segment( pub async fn schedule_add_segment(&self, segment_entry: SegmentEntry) -> crate::Result<()> {
&self,
segment_entry: SegmentEntry,
) -> impl Future<Output = crate::Result<()>> {
let segment_updater = self.clone(); let segment_updater = self.clone();
self.schedule_future(async move { self.schedule_task(async move {
segment_updater.segment_manager.add_segment(segment_entry); segment_updater.segment_manager.add_segment(segment_entry);
segment_updater.consider_merge_options().await; segment_updater.consider_merge_options().await;
Ok(()) Ok(())
}) })
.await
} }
/// Orders `SegmentManager` to remove all segments /// Orders `SegmentManager` to remove all segments
@@ -403,11 +453,9 @@ impl SegmentUpdater {
Ok(()) Ok(())
} }
pub fn schedule_garbage_collect( pub async fn schedule_garbage_collect(&self) -> crate::Result<GarbageCollectionResult> {
&self,
) -> impl Future<Output = crate::Result<GarbageCollectionResult>> {
let garbage_collect_future = garbage_collect_files(self.clone()); let garbage_collect_future = garbage_collect_files(self.clone());
self.schedule_future(garbage_collect_future) self.schedule_task(garbage_collect_future).await
} }
/// List the files that are useful to the index. /// List the files that are useful to the index.
@@ -425,13 +473,13 @@ impl SegmentUpdater {
files files
} }
pub fn schedule_commit( pub(crate) async fn schedule_commit(
&self, &self,
opstamp: Opstamp, opstamp: Opstamp,
payload: Option<String>, payload: Option<String>,
) -> impl Future<Output = crate::Result<()>> { ) -> crate::Result<()> {
let segment_updater: SegmentUpdater = self.clone(); let segment_updater: SegmentUpdater = self.clone();
self.schedule_future(async move { self.schedule_task(async move {
let segment_entries = segment_updater.purge_deletes(opstamp)?; let segment_entries = segment_updater.purge_deletes(opstamp)?;
segment_updater.segment_manager.commit(segment_entries); segment_updater.segment_manager.commit(segment_entries);
segment_updater.save_metas(opstamp, payload)?; segment_updater.save_metas(opstamp, payload)?;
@@ -439,6 +487,7 @@ impl SegmentUpdater {
segment_updater.consider_merge_options().await; segment_updater.consider_merge_options().await;
Ok(()) Ok(())
}) })
.await
} }
fn store_meta(&self, index_meta: &IndexMeta) { fn store_meta(&self, index_meta: &IndexMeta) {
@@ -513,9 +562,7 @@ impl SegmentUpdater {
e e
); );
// ... cancel merge // ... cancel merge
if cfg!(test) { assert!(!cfg!(test), "Merge failed.");
panic!("Merge failed.");
}
} }
} }
}); });
@@ -568,14 +615,14 @@ impl SegmentUpdater {
} }
} }
fn end_merge( async fn end_merge(
&self, &self,
merge_operation: MergeOperation, merge_operation: MergeOperation,
mut after_merge_segment_entry: SegmentEntry, mut after_merge_segment_entry: SegmentEntry,
) -> impl Future<Output = crate::Result<SegmentMeta>> { ) -> crate::Result<SegmentMeta> {
let segment_updater = self.clone(); let segment_updater = self.clone();
let after_merge_segment_meta = after_merge_segment_entry.meta().clone(); let after_merge_segment_meta = after_merge_segment_entry.meta().clone();
let end_merge_future = self.schedule_future(async move { self.schedule_task(async move {
info!("End merge {:?}", after_merge_segment_entry.meta()); info!("End merge {:?}", after_merge_segment_entry.meta());
{ {
let mut delete_cursor = after_merge_segment_entry.delete_cursor().clone(); let mut delete_cursor = after_merge_segment_entry.delete_cursor().clone();
@@ -594,9 +641,8 @@ impl SegmentUpdater {
merge_operation.segment_ids(), merge_operation.segment_ids(),
advance_deletes_err advance_deletes_err
); );
if cfg!(test) { assert!(!cfg!(test), "Merge failed.");
panic!("Merge failed.");
}
// ... cancel merge // ... cancel merge
// `merge_operations` are tracked. As it is dropped, the // `merge_operations` are tracked. As it is dropped, the
// the segment_ids will be available again for merge. // the segment_ids will be available again for merge.
@@ -619,8 +665,9 @@ impl SegmentUpdater {
let _ = garbage_collect_files(segment_updater).await; let _ = garbage_collect_files(segment_updater).await;
Ok(()) Ok(())
}); })
end_merge_future.map_ok(|_| after_merge_segment_meta) .await?;
Ok(after_merge_segment_meta)
} }
/// Wait for current merging threads. /// Wait for current merging threads.
@@ -646,11 +693,19 @@ impl SegmentUpdater {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::merge_segments; use super::merge_indices;
use crate::collector::TopDocs;
use crate::directory::RamDirectory; use crate::directory::RamDirectory;
use crate::fastfield::AliveBitSet;
use crate::indexer::merge_policy::tests::MergeWheneverPossible; use crate::indexer::merge_policy::tests::MergeWheneverPossible;
use crate::indexer::merger::IndexMerger;
use crate::indexer::segment_updater::merge_filtered_segments;
use crate::query::QueryParser;
use crate::schema::*; use crate::schema::*;
use crate::Directory;
use crate::DocAddress;
use crate::Index; use crate::Index;
use crate::Segment;
#[test] #[test]
fn test_delete_during_merge() -> crate::Result<()> { fn test_delete_during_merge() -> crate::Result<()> {
@@ -663,19 +718,19 @@ mod tests {
index_writer.set_merge_policy(Box::new(MergeWheneverPossible)); index_writer.set_merge_policy(Box::new(MergeWheneverPossible));
for _ in 0..100 { for _ in 0..100 {
index_writer.add_document(doc!(text_field=>"a")); index_writer.add_document(doc!(text_field=>"a"))?;
index_writer.add_document(doc!(text_field=>"b")); index_writer.add_document(doc!(text_field=>"b"))?;
} }
index_writer.commit()?; index_writer.commit()?;
for _ in 0..100 { for _ in 0..100 {
index_writer.add_document(doc!(text_field=>"c")); index_writer.add_document(doc!(text_field=>"c"))?;
index_writer.add_document(doc!(text_field=>"d")); index_writer.add_document(doc!(text_field=>"d"))?;
} }
index_writer.commit()?; index_writer.commit()?;
index_writer.add_document(doc!(text_field=>"e")); index_writer.add_document(doc!(text_field=>"e"))?;
index_writer.add_document(doc!(text_field=>"f")); index_writer.add_document(doc!(text_field=>"f"))?;
index_writer.commit()?; index_writer.commit()?;
let term = Term::from_field_text(text_field, "a"); let term = Term::from_field_text(text_field, "a");
@@ -693,6 +748,50 @@ mod tests {
Ok(()) Ok(())
} }
#[test]
fn delete_all_docs_min() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build());
// writing the segment
let mut index_writer = index.writer_for_tests()?;
for _ in 0..10 {
index_writer.add_document(doc!(text_field=>"a"))?;
index_writer.add_document(doc!(text_field=>"b"))?;
}
index_writer.commit()?;
let seg_ids = index.searchable_segment_ids()?;
// docs exist, should have at least 1 segment
assert!(!seg_ids.is_empty());
let term = Term::from_field_text(text_field, "a");
index_writer.delete_term(term);
index_writer.commit()?;
let term = Term::from_field_text(text_field, "b");
index_writer.delete_term(term);
index_writer.commit()?;
index_writer.wait_merging_threads()?;
let reader = index.reader()?;
assert_eq!(reader.searcher().num_docs(), 0);
let seg_ids = index.searchable_segment_ids()?;
assert!(seg_ids.is_empty());
reader.reload()?;
assert_eq!(reader.searcher().num_docs(), 0);
// empty segments should be erased
assert!(index.searchable_segment_metas()?.is_empty());
assert!(reader.searcher().segment_readers().is_empty());
Ok(())
}
#[test] #[test]
fn delete_all_docs() -> crate::Result<()> { fn delete_all_docs() -> crate::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
@@ -703,19 +802,19 @@ mod tests {
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
for _ in 0..100 { for _ in 0..100 {
index_writer.add_document(doc!(text_field=>"a")); index_writer.add_document(doc!(text_field=>"a"))?;
index_writer.add_document(doc!(text_field=>"b")); index_writer.add_document(doc!(text_field=>"b"))?;
} }
index_writer.commit()?; index_writer.commit()?;
for _ in 0..100 { for _ in 0..100 {
index_writer.add_document(doc!(text_field=>"c")); index_writer.add_document(doc!(text_field=>"c"))?;
index_writer.add_document(doc!(text_field=>"d")); index_writer.add_document(doc!(text_field=>"d"))?;
} }
index_writer.commit()?; index_writer.commit()?;
index_writer.add_document(doc!(text_field=>"e")); index_writer.add_document(doc!(text_field=>"e"))?;
index_writer.add_document(doc!(text_field=>"f")); index_writer.add_document(doc!(text_field=>"f"))?;
index_writer.commit()?; index_writer.commit()?;
let seg_ids = index.searchable_segment_ids()?; let seg_ids = index.searchable_segment_ids()?;
@@ -755,8 +854,8 @@ mod tests {
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
for _ in 0..100 { for _ in 0..100 {
index_writer.add_document(doc!(text_field=>"a")); index_writer.add_document(doc!(text_field=>"a"))?;
index_writer.add_document(doc!(text_field=>"b")); index_writer.add_document(doc!(text_field=>"b"))?;
} }
index_writer.commit()?; index_writer.commit()?;
@@ -782,22 +881,22 @@ mod tests {
// writing two segments // writing two segments
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
for _ in 0..100 { for _ in 0..100 {
index_writer.add_document(doc!(text_field=>"fizz")); index_writer.add_document(doc!(text_field=>"fizz"))?;
index_writer.add_document(doc!(text_field=>"buzz")); index_writer.add_document(doc!(text_field=>"buzz"))?;
} }
index_writer.commit()?; index_writer.commit()?;
for _ in 0..1000 { for _ in 0..1000 {
index_writer.add_document(doc!(text_field=>"foo")); index_writer.add_document(doc!(text_field=>"foo"))?;
index_writer.add_document(doc!(text_field=>"bar")); index_writer.add_document(doc!(text_field=>"bar"))?;
} }
index_writer.commit()?; index_writer.commit()?;
indices.push(index); indices.push(index);
} }
assert_eq!(indices.len(), 3); assert_eq!(indices.len(), 3);
let output_directory = RamDirectory::default(); let output_directory: Box<dyn Directory> = Box::new(RamDirectory::default());
let index = merge_segments(&indices, output_directory)?; let index = merge_indices(&indices, output_directory)?;
assert_eq!(index.schema(), schema); assert_eq!(index.schema(), schema);
let segments = index.searchable_segments()?; let segments = index.searchable_segments()?;
@@ -811,7 +910,7 @@ mod tests {
#[test] #[test]
fn test_merge_empty_indices_array() { fn test_merge_empty_indices_array() {
let merge_result = merge_segments(&[], RamDirectory::default()); let merge_result = merge_indices(&[], RamDirectory::default());
assert!(merge_result.is_err()); assert!(merge_result.is_err());
} }
@@ -822,7 +921,7 @@ mod tests {
let text_field = schema_builder.add_text_field("text", TEXT); let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"some text")); index_writer.add_document(doc!(text_field=>"some text"))?;
index_writer.commit()?; index_writer.commit()?;
index index
}; };
@@ -832,15 +931,197 @@ mod tests {
let body_field = schema_builder.add_text_field("body", TEXT); let body_field = schema_builder.add_text_field("body", TEXT);
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(body_field=>"some body")); index_writer.add_document(doc!(body_field=>"some body"))?;
index_writer.commit()?; index_writer.commit()?;
index index
}; };
// mismatched schema index list // mismatched schema index list
let result = merge_segments(&[first_index, second_index], RamDirectory::default()); let result = merge_indices(&[first_index, second_index], RamDirectory::default());
assert!(result.is_err()); assert!(result.is_err());
Ok(()) Ok(())
} }
#[test]
fn test_merge_filtered_segments() -> crate::Result<()> {
let first_index = {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"some text 1"))?;
index_writer.add_document(doc!(text_field=>"some text 2"))?;
index_writer.commit()?;
index
};
let second_index = {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"some text 3"))?;
index_writer.add_document(doc!(text_field=>"some text 4"))?;
index_writer.delete_term(Term::from_field_text(text_field, "4"));
index_writer.commit()?;
index
};
let mut segments: Vec<Segment> = Vec::new();
segments.extend(first_index.searchable_segments()?);
segments.extend(second_index.searchable_segments()?);
let target_settings = first_index.settings().clone();
let filter_segment_1 = AliveBitSet::for_test_from_deleted_docs(&[1], 2);
let filter_segment_2 = AliveBitSet::for_test_from_deleted_docs(&[0], 2);
let filter_segments = vec![Some(filter_segment_1), Some(filter_segment_2)];
let merged_index = merge_filtered_segments(
&segments,
target_settings,
filter_segments,
RamDirectory::default(),
)?;
let segments = merged_index.searchable_segments()?;
assert_eq!(segments.len(), 1);
let segment_metas = segments[0].meta();
assert_eq!(segment_metas.num_deleted_docs(), 0);
assert_eq!(segment_metas.num_docs(), 1);
Ok(())
}
#[test]
fn test_merge_single_filtered_segments() -> crate::Result<()> {
let first_index = {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"test text"))?;
index_writer.add_document(doc!(text_field=>"some text 2"))?;
index_writer.add_document(doc!(text_field=>"some text 3"))?;
index_writer.add_document(doc!(text_field=>"some text 4"))?;
index_writer.delete_term(Term::from_field_text(text_field, "4"));
index_writer.commit()?;
index
};
let mut segments: Vec<Segment> = Vec::new();
segments.extend(first_index.searchable_segments()?);
let target_settings = first_index.settings().clone();
let filter_segment = AliveBitSet::for_test_from_deleted_docs(&[0], 4);
let filter_segments = vec![Some(filter_segment)];
let index = merge_filtered_segments(
&segments,
target_settings,
filter_segments,
RamDirectory::default(),
)?;
let segments = index.searchable_segments()?;
assert_eq!(segments.len(), 1);
let segment_metas = segments[0].meta();
assert_eq!(segment_metas.num_deleted_docs(), 0);
assert_eq!(segment_metas.num_docs(), 2);
let searcher = index.reader()?.searcher();
{
let text_field = index.schema().get_field("text").unwrap();
let do_search = |term: &str| {
let query = QueryParser::for_index(&index, vec![text_field])
.parse_query(term)
.unwrap();
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(3)).unwrap();
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>()
};
assert_eq!(do_search("test"), vec![] as Vec<u32>);
assert_eq!(do_search("text"), vec![0, 1]);
}
Ok(())
}
#[test]
fn test_apply_doc_id_filter_in_merger() -> crate::Result<()> {
let first_index = {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"some text 1"))?;
index_writer.add_document(doc!(text_field=>"some text 2"))?;
index_writer.add_document(doc!(text_field=>"some text 3"))?;
index_writer.add_document(doc!(text_field=>"some text 4"))?;
index_writer.delete_term(Term::from_field_text(text_field, "4"));
index_writer.commit()?;
index
};
let mut segments: Vec<Segment> = Vec::new();
segments.extend(first_index.searchable_segments()?);
let target_settings = first_index.settings().clone();
{
let filter_segment = AliveBitSet::for_test_from_deleted_docs(&[1], 4);
let filter_segments = vec![Some(filter_segment)];
let target_schema = segments[0].schema();
let merged_index = Index::create(
RamDirectory::default(),
target_schema.clone(),
target_settings.clone(),
)?;
let merger: IndexMerger = IndexMerger::open_with_custom_alive_set(
merged_index.schema(),
merged_index.settings().clone(),
&segments[..],
filter_segments,
)?;
let doc_ids_alive: Vec<_> = merger.readers[0].doc_ids_alive().collect();
assert_eq!(doc_ids_alive, vec![0, 2]);
}
{
let filter_segments = vec![None];
let target_schema = segments[0].schema();
let merged_index = Index::create(
RamDirectory::default(),
target_schema.clone(),
target_settings.clone(),
)?;
let merger: IndexMerger = IndexMerger::open_with_custom_alive_set(
merged_index.schema(),
merged_index.settings().clone(),
&segments[..],
filter_segments,
)?;
let doc_ids_alive: Vec<_> = merger.readers[0].doc_ids_alive().collect();
assert_eq!(doc_ids_alive, vec![0, 1, 2]);
}
Ok(())
}
} }

View File

@@ -2,7 +2,6 @@ use super::{
doc_id_mapping::{get_doc_id_mapping_from_field, DocIdMapping}, doc_id_mapping::{get_doc_id_mapping_from_field, DocIdMapping},
operation::AddOperation, operation::AddOperation,
}; };
use crate::fastfield::FastFieldsWriter;
use crate::fieldnorm::{FieldNormReaders, FieldNormsWriter}; use crate::fieldnorm::{FieldNormReaders, FieldNormsWriter};
use crate::indexer::segment_serializer::SegmentSerializer; use crate::indexer::segment_serializer::SegmentSerializer;
use crate::postings::compute_table_size; use crate::postings::compute_table_size;
@@ -18,6 +17,7 @@ use crate::tokenizer::{FacetTokenizer, TextAnalyzer};
use crate::tokenizer::{TokenStreamChain, Tokenizer}; use crate::tokenizer::{TokenStreamChain, Tokenizer};
use crate::Opstamp; use crate::Opstamp;
use crate::{core::Segment, store::StoreWriter}; use crate::{core::Segment, store::StoreWriter};
use crate::{fastfield::FastFieldsWriter, schema::Type};
use crate::{DocId, SegmentComponent}; use crate::{DocId, SegmentComponent};
/// Computes the initial size of the hash table. /// Computes the initial size of the hash table.
@@ -173,18 +173,11 @@ impl SegmentWriter {
let (term_buffer, multifield_postings) = let (term_buffer, multifield_postings) =
(&mut self.term_buffer, &mut self.multifield_postings); (&mut self.term_buffer, &mut self.multifield_postings);
match *field_entry.field_type() { match *field_entry.field_type() {
FieldType::HierarchicalFacet(_) => { FieldType::Facet(_) => {
term_buffer.set_field(field); term_buffer.set_field(Type::Facet, field);
let facets = for field_value in field_values {
field_values let facet = field_value.value().facet().ok_or_else(make_schema_error)?;
.iter() let facet_str = facet.encoded_str();
.flat_map(|field_value| match *field_value.value() {
Value::Facet(ref facet) => Some(facet.encoded_str()),
_ => {
panic!("Expected hierarchical facet");
}
});
for facet_str in facets {
let mut unordered_term_id_opt = None; let mut unordered_term_id_opt = None;
FacetTokenizer FacetTokenizer
.token_stream(facet_str) .token_stream(facet_str)
@@ -241,12 +234,11 @@ impl SegmentWriter {
term_buffer, term_buffer,
) )
}; };
self.fieldnorms_writer.record(doc_id, field, num_tokens); self.fieldnorms_writer.record(doc_id, field, num_tokens);
} }
FieldType::U64(_) => { FieldType::U64(_) => {
for field_value in field_values { for field_value in field_values {
term_buffer.set_field(field_value.field()); term_buffer.set_field(Type::U64, field_value.field());
let u64_val = field_value let u64_val = field_value
.value() .value()
.u64_value() .u64_value()
@@ -257,7 +249,7 @@ impl SegmentWriter {
} }
FieldType::Date(_) => { FieldType::Date(_) => {
for field_value in field_values { for field_value in field_values {
term_buffer.set_field(field_value.field()); term_buffer.set_field(Type::Date, field_value.field());
let date_val = field_value let date_val = field_value
.value() .value()
.date_value() .date_value()
@@ -268,7 +260,7 @@ impl SegmentWriter {
} }
FieldType::I64(_) => { FieldType::I64(_) => {
for field_value in field_values { for field_value in field_values {
term_buffer.set_field(field_value.field()); term_buffer.set_field(Type::I64, field_value.field());
let i64_val = field_value let i64_val = field_value
.value() .value()
.i64_value() .i64_value()
@@ -279,7 +271,7 @@ impl SegmentWriter {
} }
FieldType::F64(_) => { FieldType::F64(_) => {
for field_value in field_values { for field_value in field_values {
term_buffer.set_field(field_value.field()); term_buffer.set_field(Type::F64, field_value.field());
let f64_val = field_value let f64_val = field_value
.value() .value()
.f64_value() .f64_value()
@@ -290,7 +282,7 @@ impl SegmentWriter {
} }
FieldType::Bytes(_) => { FieldType::Bytes(_) => {
for field_value in field_values { for field_value in field_values {
term_buffer.set_field(field_value.field()); term_buffer.set_field(Type::Bytes, field_value.field());
let bytes = field_value let bytes = field_value
.value() .value()
.bytes_value() .bytes_value()

View File

@@ -10,6 +10,8 @@
)] )]
#![doc(test(attr(allow(unused_variables), deny(warnings))))] #![doc(test(attr(allow(unused_variables), deny(warnings))))]
#![warn(missing_docs)] #![warn(missing_docs)]
#![allow(clippy::len_without_is_empty)]
#![allow(clippy::return_self_not_must_use)]
//! # `tantivy` //! # `tantivy`
//! //!
@@ -62,7 +64,7 @@
//! body => "He was an old man who fished alone in a skiff in \ //! body => "He was an old man who fished alone in a skiff in \
//! the Gulf Stream and he had gone eighty-four days \ //! the Gulf Stream and he had gone eighty-four days \
//! now without taking a fish." //! now without taking a fish."
//! )); //! ))?;
//! //!
//! // We need to call .commit() explicitly to force the //! // We need to call .commit() explicitly to force the
//! // index_writer to finish processing the documents in the queue, //! // index_writer to finish processing the documents in the queue,
@@ -103,7 +105,7 @@
//! A good place for you to get started is to check out //! A good place for you to get started is to check out
//! the example code ( //! the example code (
//! [literate programming](https://tantivy-search.github.io/examples/basic_search.html) / //! [literate programming](https://tantivy-search.github.io/examples/basic_search.html) /
//! [source code](https://github.com/tantivy-search/tantivy/blob/main/examples/basic_search.rs)) //! [source code](https://github.com/quickwit-inc/tantivy/blob/main/examples/basic_search.rs))
#[cfg_attr(test, macro_use)] #[cfg_attr(test, macro_use)]
extern crate serde_json; extern crate serde_json;
@@ -135,7 +137,6 @@ pub type Result<T> = std::result::Result<T, TantivyError>;
/// Tantivy DateTime /// Tantivy DateTime
pub type DateTime = chrono::DateTime<chrono::Utc>; pub type DateTime = chrono::DateTime<chrono::Utc>;
mod common;
mod core; mod core;
mod indexer; mod indexer;
@@ -157,27 +158,30 @@ pub mod termdict;
mod reader; mod reader;
pub use self::reader::{IndexReader, IndexReaderBuilder, ReloadPolicy}; pub use self::reader::{IndexReader, IndexReaderBuilder, ReloadPolicy, Warmer};
mod snippet; mod snippet;
pub use self::snippet::{Snippet, SnippetGenerator}; pub use self::snippet::{Snippet, SnippetGenerator};
mod docset; mod docset;
pub use self::docset::{DocSet, TERMINATED}; pub use self::docset::{DocSet, TERMINATED};
pub use crate::common::HasLen;
pub use crate::common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
pub use crate::core::{Executor, SegmentComponent}; pub use crate::core::{Executor, SegmentComponent};
pub use crate::core::{ pub use crate::core::{
Index, IndexBuilder, IndexMeta, IndexSettings, IndexSortByField, Order, Searcher, Segment, Index, IndexBuilder, IndexMeta, IndexSettings, IndexSortByField, Order, Searcher,
SegmentId, SegmentMeta, SearcherGeneration, Segment, SegmentId, SegmentMeta,
}; };
pub use crate::core::{InvertedIndexReader, SegmentReader}; pub use crate::core::{InvertedIndexReader, SegmentReader};
pub use crate::directory::Directory; pub use crate::directory::Directory;
pub use crate::indexer::merge_segments; pub use crate::indexer::demuxer::*;
pub use crate::indexer::merge_filtered_segments;
pub use crate::indexer::merge_indices;
pub use crate::indexer::operation::UserOperation; pub use crate::indexer::operation::UserOperation;
pub use crate::indexer::IndexWriter; pub use crate::indexer::{IndexWriter, PreparedCommit};
pub use crate::postings::Postings; pub use crate::postings::Postings;
pub use crate::reader::LeasedItem; pub use crate::reader::LeasedItem;
pub use crate::schema::{Document, Term}; pub use crate::schema::{Document, Term};
pub use census::{Inventory, TrackedObject};
pub use common::HasLen;
pub use common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
use std::fmt; use std::fmt;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
@@ -235,6 +239,7 @@ pub fn version_string() -> &'static str {
pub mod merge_policy { pub mod merge_policy {
pub use crate::indexer::DefaultMergePolicy; pub use crate::indexer::DefaultMergePolicy;
pub use crate::indexer::LogMergePolicy; pub use crate::indexer::LogMergePolicy;
pub use crate::indexer::MergeCandidate;
pub use crate::indexer::MergePolicy; pub use crate::indexer::MergePolicy;
pub use crate::indexer::NoMergePolicy; pub use crate::indexer::NoMergePolicy;
} }
@@ -293,7 +298,7 @@ pub struct DocAddress {
} }
#[cfg(test)] #[cfg(test)]
mod tests { pub mod tests {
use crate::collector::tests::TEST_COLLECTOR_WITH_SCORE; use crate::collector::tests::TEST_COLLECTOR_WITH_SCORE;
use crate::core::SegmentReader; use crate::core::SegmentReader;
use crate::docset::{DocSet, TERMINATED}; use crate::docset::{DocSet, TERMINATED};
@@ -304,11 +309,18 @@ mod tests {
use crate::Index; use crate::Index;
use crate::Postings; use crate::Postings;
use crate::ReloadPolicy; use crate::ReloadPolicy;
use common::{BinarySerializable, FixedSize};
use rand::distributions::Bernoulli; use rand::distributions::Bernoulli;
use rand::distributions::Uniform; use rand::distributions::Uniform;
use rand::rngs::StdRng; use rand::rngs::StdRng;
use rand::{Rng, SeedableRng}; use rand::{Rng, SeedableRng};
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();
O::default().serialize(&mut buffer).unwrap();
assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
}
/// Checks if left and right are close one to each other. /// Checks if left and right are close one to each other.
/// Panics if the two values are more than 0.5% apart. /// Panics if the two values are more than 0.5% apart.
#[macro_export] #[macro_export]
@@ -370,24 +382,22 @@ mod tests {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT); let text_field = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_from_tempdir(schema).unwrap(); let index = Index::create_from_tempdir(schema)?;
// writing the segment
let mut index_writer = index.writer_for_tests()?;
{ {
// writing the segment let doc = doc!(text_field=>"af b");
let mut index_writer = index.writer_for_tests()?; index_writer.add_document(doc)?;
{
let doc = doc!(text_field=>"af b");
index_writer.add_document(doc);
}
{
let doc = doc!(text_field=>"a b c");
index_writer.add_document(doc);
}
{
let doc = doc!(text_field=>"a b c d");
index_writer.add_document(doc);
}
assert!(index_writer.commit().is_ok());
} }
{
let doc = doc!(text_field=>"a b c");
index_writer.add_document(doc)?;
}
{
let doc = doc!(text_field=>"a b c d");
index_writer.add_document(doc)?;
}
index_writer.commit()?;
Ok(()) Ok(())
} }
@@ -397,12 +407,12 @@ mod tests {
let text_field = schema_builder.add_text_field("text", TEXT); let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"a b c")); index_writer.add_document(doc!(text_field=>"a b c"))?;
index_writer.commit()?; index_writer.commit()?;
index_writer.add_document(doc!(text_field=>"a")); index_writer.add_document(doc!(text_field=>"a"))?;
index_writer.add_document(doc!(text_field=>"a a")); index_writer.add_document(doc!(text_field=>"a a"))?;
index_writer.commit()?; index_writer.commit()?;
index_writer.add_document(doc!(text_field=>"c")); index_writer.add_document(doc!(text_field=>"c"))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -424,7 +434,7 @@ mod tests {
let text_field = schema_builder.add_text_field("text", TEXT); let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"a b c")); index_writer.add_document(doc!(text_field=>"a b c"))?;
index_writer.commit()?; index_writer.commit()?;
let index_reader = index.reader()?; let index_reader = index.reader()?;
let searcher = index_reader.searcher(); let searcher = index_reader.searcher();
@@ -446,9 +456,9 @@ mod tests {
let text_field = schema_builder.add_text_field("text", TEXT); let text_field = schema_builder.add_text_field("text", TEXT);
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"a b c")); index_writer.add_document(doc!(text_field=>"a b c"))?;
index_writer.add_document(doc!()); index_writer.add_document(doc!())?;
index_writer.add_document(doc!(text_field=>"a b")); index_writer.add_document(doc!(text_field=>"a b"))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -490,20 +500,20 @@ mod tests {
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
// 0 // 0
index_writer.add_document(doc!(text_field=>"a b")); index_writer.add_document(doc!(text_field=>"a b"))?;
// 1 // 1
index_writer.add_document(doc!(text_field=>" a c")); index_writer.add_document(doc!(text_field=>" a c"))?;
// 2 // 2
index_writer.add_document(doc!(text_field=>" b c")); index_writer.add_document(doc!(text_field=>" b c"))?;
// 3 // 3
index_writer.add_document(doc!(text_field=>" b d")); index_writer.add_document(doc!(text_field=>" b d"))?;
index_writer.delete_term(Term::from_field_text(text_field, "c")); index_writer.delete_term(Term::from_field_text(text_field, "c"));
index_writer.delete_term(Term::from_field_text(text_field, "a")); index_writer.delete_term(Term::from_field_text(text_field, "a"));
// 4 // 4
index_writer.add_document(doc!(text_field=>" b c")); index_writer.add_document(doc!(text_field=>" b c"))?;
// 5 // 5
index_writer.add_document(doc!(text_field=>" a")); index_writer.add_document(doc!(text_field=>" a"))?;
index_writer.commit()?; index_writer.commit()?;
} }
{ {
@@ -537,7 +547,7 @@ mod tests {
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
// 0 // 0
index_writer.add_document(doc!(text_field=>"a b")); index_writer.add_document(doc!(text_field=>"a b"))?;
// 1 // 1
index_writer.delete_term(Term::from_field_text(text_field, "c")); index_writer.delete_term(Term::from_field_text(text_field, "c"));
index_writer.rollback()?; index_writer.rollback()?;
@@ -573,7 +583,7 @@ mod tests {
{ {
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"a b")); index_writer.add_document(doc!(text_field=>"a b"))?;
index_writer.delete_term(Term::from_field_text(text_field, "c")); index_writer.delete_term(Term::from_field_text(text_field, "c"));
index_writer.rollback()?; index_writer.rollback()?;
index_writer.delete_term(Term::from_field_text(text_field, "a")); index_writer.delete_term(Term::from_field_text(text_field, "a"));
@@ -623,7 +633,7 @@ mod tests {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(field=>1u64)); index_writer.add_document(doc!(field=>1u64))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -647,7 +657,7 @@ mod tests {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
let negative_val = -1i64; let negative_val = -1i64;
index_writer.add_document(doc!(value_field => negative_val)); index_writer.add_document(doc!(value_field => negative_val))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -671,7 +681,7 @@ mod tests {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
let val = std::f64::consts::PI; let val = std::f64::consts::PI;
index_writer.add_document(doc!(value_field => val)); index_writer.add_document(doc!(value_field => val))?;
index_writer.commit()?; index_writer.commit()?;
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -694,7 +704,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"a")); index_writer.add_document(doc!(text_field=>"a"))?;
assert!(index_writer.commit().is_ok()); assert!(index_writer.commit().is_ok());
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -717,14 +727,14 @@ mod tests {
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"63")); index_writer.add_document(doc!(text_field=>"63"))?;
index_writer.add_document(doc!(text_field=>"70")); index_writer.add_document(doc!(text_field=>"70"))?;
index_writer.add_document(doc!(text_field=>"34")); index_writer.add_document(doc!(text_field=>"34"))?;
index_writer.add_document(doc!(text_field=>"1")); index_writer.add_document(doc!(text_field=>"1"))?;
index_writer.add_document(doc!(text_field=>"38")); index_writer.add_document(doc!(text_field=>"38"))?;
index_writer.add_document(doc!(text_field=>"33")); index_writer.add_document(doc!(text_field=>"33"))?;
index_writer.add_document(doc!(text_field=>"40")); index_writer.add_document(doc!(text_field=>"40"))?;
index_writer.add_document(doc!(text_field=>"17")); index_writer.add_document(doc!(text_field=>"17"))?;
index_writer.delete_term(Term::from_field_text(text_field, "38")); index_writer.delete_term(Term::from_field_text(text_field, "38"));
index_writer.delete_term(Term::from_field_text(text_field, "34")); index_writer.delete_term(Term::from_field_text(text_field, "34"));
index_writer.commit()?; index_writer.commit()?;
@@ -742,7 +752,7 @@ mod tests {
{ {
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"af af af bc bc")); index_writer.add_document(doc!(text_field=>"af af af bc bc"))?;
index_writer.commit()?; index_writer.commit()?;
} }
{ {
@@ -774,9 +784,9 @@ mod tests {
let reader = index.reader()?; let reader = index.reader()?;
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"af af af b")); index_writer.add_document(doc!(text_field=>"af af af b"))?;
index_writer.add_document(doc!(text_field=>"a b c")); index_writer.add_document(doc!(text_field=>"a b c"))?;
index_writer.add_document(doc!(text_field=>"a b c d")); index_writer.add_document(doc!(text_field=>"a b c d"))?;
index_writer.commit()?; index_writer.commit()?;
reader.reload()?; reader.reload()?;
@@ -838,9 +848,9 @@ mod tests {
assert_eq!(reader.searcher().num_docs(), 0u64); assert_eq!(reader.searcher().num_docs(), 0u64);
// writing the segment // writing the segment
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(text_field=>"af b")); index_writer.add_document(doc!(text_field=>"af b"))?;
index_writer.add_document(doc!(text_field=>"a b c")); index_writer.add_document(doc!(text_field=>"a b c"))?;
index_writer.add_document(doc!(text_field=>"a b c d")); index_writer.add_document(doc!(text_field=>"a b c d"))?;
index_writer.commit()?; index_writer.commit()?;
reader.reload()?; reader.reload()?;
assert_eq!(reader.searcher().num_docs(), 3u64); assert_eq!(reader.searcher().num_docs(), 3u64);
@@ -880,7 +890,7 @@ mod tests {
{ {
let document = let document =
doc!(fast_field_unsigned => 4u64, fast_field_signed=>4i64, fast_field_float=>4f64); doc!(fast_field_unsigned => 4u64, fast_field_signed=>4i64, fast_field_float=>4f64);
index_writer.add_document(document); index_writer.add_document(document)?;
index_writer.commit()?; index_writer.commit()?;
} }
let reader = index.reader()?; let reader = index.reader()?;
@@ -947,7 +957,7 @@ mod tests {
index_writer.set_merge_policy(Box::new(NoMergePolicy)); index_writer.set_merge_policy(Box::new(NoMergePolicy));
for doc_id in 0u64..DOC_COUNT { for doc_id in 0u64..DOC_COUNT {
index_writer.add_document(doc!(id => doc_id)); index_writer.add_document(doc!(id => doc_id))?;
} }
index_writer.commit()?; index_writer.commit()?;
@@ -964,7 +974,7 @@ mod tests {
index_writer.delete_term(Term::from_field_u64(id, doc_id)); index_writer.delete_term(Term::from_field_u64(id, doc_id));
index_writer.commit()?; index_writer.commit()?;
index_reader.reload()?; index_reader.reload()?;
index_writer.add_document(doc!(id => doc_id)); index_writer.add_document(doc!(id => doc_id))?;
index_writer.commit()?; index_writer.commit()?;
index_reader.reload()?; index_reader.reload()?;
let searcher = index_reader.searcher(); let searcher = index_reader.searcher();
@@ -993,8 +1003,24 @@ mod tests {
#[test] #[test]
fn test_validate_checksum() -> crate::Result<()> { fn test_validate_checksum() -> crate::Result<()> {
let index_path = tempfile::tempdir().expect("dir"); let index_path = tempfile::tempdir().expect("dir");
let schema = Schema::builder().build(); let mut builder = Schema::builder();
let body = builder.add_text_field("body", TEXT | STORED);
let schema = builder.build();
let index = Index::create_in_dir(&index_path, schema)?; let index = Index::create_in_dir(&index_path, schema)?;
let mut writer = index.writer(50_000_000)?;
for _ in 0..5000 {
writer.add_document(doc!(body => "foo"))?;
writer.add_document(doc!(body => "boo"))?;
}
writer.commit()?;
assert!(index.validate_checksum()?.is_empty());
// delete few docs
writer.delete_term(Term::from_field_text(body, "foo"));
writer.commit()?;
let segment_ids = index.searchable_segment_ids()?;
let _ = futures::executor::block_on(writer.merge(&segment_ids));
assert!(index.validate_checksum()?.is_empty()); assert!(index.validate_checksum()?.is_empty());
Ok(()) Ok(())
} }

View File

@@ -1,9 +1,9 @@
use std::io; use std::io;
use crate::common::{BinarySerializable, VInt};
use crate::directory::OwnedBytes; use crate::directory::OwnedBytes;
use crate::positions::COMPRESSION_BLOCK_SIZE; use crate::positions::COMPRESSION_BLOCK_SIZE;
use crate::postings::compression::{BlockDecoder, VIntDecoder}; use crate::postings::compression::{BlockDecoder, VIntDecoder};
use common::{BinarySerializable, VInt};
/// When accessing the position of a term, we get a positions_idx from the `Terminfo`. /// When accessing the position of a term, we get a positions_idx from the `Terminfo`.
/// This means we need to skip to the `nth` positions efficiently. /// This means we need to skip to the `nth` positions efficiently.

Some files were not shown because too many files have changed in this diff Show More