Compare commits

..

931 Commits

Author SHA1 Message Date
Paul Masurel
7720d21265 Closes #896 - Facet reader related
Bugfix. Acquiring a facet reader on a segment that does not contain any
doc with this facet returns `None`.
2020-10-01 20:25:28 +09:00
Paul Masurel
c339b05789 Bumped version and edited changelog 2020-09-19 21:13:19 +09:00
Paul Masurel
2d3c657f9d Added Send Sync to collectors. 2020-09-19 21:04:44 +09:00
Paul Masurel
07f9b828ae Added Send and Sync to the Query trait. 2020-09-19 21:04:29 +09:00
Paul Masurel
3dd0322f4c Bumped version 2020-08-19 22:41:48 +09:00
Paul Masurel
2481c87be8 Block wand (#856) 2020-08-19 22:36:36 +09:00
Paul Masurel
b6a664b5f8 cargo fmt 2020-08-16 12:40:50 +09:00
lyj
25b666a7c9 Update occur.rs (#862) 2020-08-16 10:49:55 +09:00
Paul Masurel
9b41912e66 Bugfix (#861) 2020-08-12 16:06:24 +09:00
Paul Masurel
8e74bb98b5 Added field norm readers (#854) 2020-07-20 13:05:05 +09:00
Paul Masurel
6db8bb49d6 Assert nearly equals macro (#853)
* Assert nearly equals macro

* Renamed specialized_scorer in TermScorer
2020-07-17 16:40:41 +09:00
lyj
410aed0176 Update segment_updater.rs (#848) 2020-07-16 12:33:11 +09:00
aptend
00a239a712 fix typo in index_meta.rs (#851) 2020-07-16 12:32:45 +09:00
Paul Masurel
68fe406924 Removed asserts (#850) 2020-07-16 12:24:55 +09:00
Paul Masurel
f71b04acb0 Bugfix. (#849)
go_to_first_doc was typically calling seek with a target smaller than
doc.

Since SegmentPostings typically do a linear search on the full block,
regardless of the current position, it could have our segment postings
go backward.
2020-07-16 10:57:51 +09:00
lyj
1ab7f660a4 Update index.rs (#846) 2020-07-02 15:11:38 +09:00
Sean Stangl
0ebbc4cb5a Fix incorrect SimpleTokenizer link in documentation (#844) 2020-07-01 10:26:36 +09:00
lyj
5300cb5da0 Update mod.rs (#845) 2020-07-01 10:25:26 +09:00
Ype Kingma
7d773abc92 Boolean query: do not combine excluded scores. (#840)
* Do nothing when combining score values of excluded scores.

* Add test case for two excluded.

* Test score for two excluded terms.

* Use TopDocs in test_boolean_query_two_excluded
2020-06-08 20:01:19 +09:00
Paul Masurel
c34541ccce Alive doc iterator. (#837) 2020-06-05 19:42:51 +09:00
Paul Masurel
1cc5bd706c Fixes build for no-default-features (#839) 2020-06-05 19:41:55 +09:00
Paul Masurel
4026d183bc Small readability change 2020-06-03 09:04:57 +09:00
Paul Masurel
c0f5645cd9 Move for_each functions from Scorer to Weight. (#836)
* Move for_each functions from Scorer to Weight.

* Specialized foreach / foreach_pruning for union of termscorer.
2020-06-01 11:31:18 +09:00
Paul Masurel
cbff874e43 Change the loading of blocks. 2020-05-27 16:36:50 +09:00
Paul Masurel
baf015fc57 Simplification of the segment postings seek implementation. (#834) 2020-05-27 08:49:47 +09:00
Paul Masurel
7275ebdf3c Skiprefactoring skipabsolute (#831)
Simplification of the way we handle positions.
2020-05-25 09:51:23 +09:00
Paul Masurel
b974e7ce34 Closes #828. (#829)
There was a bug in the LogMergePolicy that was surfacing when there were
segments, but all of the segments were larger than the max limit.

After filtering, the list of segments candidate for merge was 0, and
the code was indexing the first element of an empty Vec.
2020-05-22 16:24:07 +09:00
Paul Masurel
8f8f34499f Updated CHANGELOG with the TopCollector offset information and cargo fmt. 2020-05-20 22:26:54 +09:00
Rob Young
6ea6f4bfcd Add offset to TopDocsCollector (#826)
* Add offset to TopDocsCollector

Add an offset to TopDocsCollector and TopDocs to make it clearer how to
handle pagination.

Closes #822

* Address review comments

- Make Debug formatting of TopDocs clearer.
- Add unit tests for limit and offset on TopCollector.
- Change API for using offset to a fluent interface.
- Add some context to the docstring to clarify what limit and offset are
  equivalent to in other projects.

* Changes required by rebase on e25284

- Pass Collector into TweakedScoreTopCollector and
  CustomScoreTopCollector.
- Add std:: qualifier to f32, i32 etc. Not sure why this was not failing
  already.
- Add unit tests for TopDocs with offset including for tweaked and
  custom score collectors.

In order to convert a TopCollector<Score> to a TopCollector<TScore> I
had to add a `into_tscore` method to `TopCollector`. This is a hack but
I don't know how to avoid it.
2020-05-20 22:25:24 +09:00
Paul Masurel
e25284bafe Major change in the DocSet/Scorer API (#824)
- Change in the DocSet and Scorer API. (@fulmicoton). 
A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet.
`.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId.
As a result, iterating through DocSet now looks as follows
```rust
let mut doc = docset.doc();
while doc != TERMINATED {
   // ...
   doc = docset.advance();
}
```
The change made it possible to greatly simplify a lot of the docset's code.
- Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)
2020-05-16 16:33:36 +09:00
Fisher Darling
8b67877cd5 Made field methods const fns (#823) 2020-05-16 10:59:50 +09:00
Rob Young
9de1360538 Minor doc and test improvements around fuzzy querying (#825) 2020-05-16 10:59:24 +09:00
Paul Masurel
c55db83609 Closes #805 (#820)
Added TryInto implementation for IndexReaderBuilder
2020-04-27 12:01:17 +09:00
Paul Masurel
1e5ebdbf3c Format and remove useless import (#819) 2020-04-27 11:56:49 +09:00
Paul Masurel
9a2090ab21 Create the MMapDirectory does not return a Directory. (#818) 2020-04-27 11:42:20 +09:00
Paul Masurel
e4aaacdb86 Minor change in README.md 2020-04-21 21:30:34 +09:00
Paul Masurel
29acf1104d Update README's claim on performance. 2020-04-21 14:44:26 +09:00
Paul Masurel
3d34fa0b69 Fixed changelog 2020-04-19 15:55:54 +09:00
Rob Young
77f363987a Make TweakScore and CustomScore mutable at the segment level (#807)
* Make TweakScore and CustomScore mutable

Make TweakScore and CustomScore mutable at the segment level.

Addresses issue #806

* Add example to show tweak_score working for facets
2020-04-19 15:54:00 +09:00
Paul Masurel
c0be461191 Removing tantivy-fst conf and removing warning. (#813) 2020-04-18 20:19:23 +09:00
dependabot-preview[bot]
1fb562f44a Update fail requirement from 0.3 to 0.4 (#810)
Updates the requirements on [fail](https://github.com/tikv/fail-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/fail-rs/releases)
- [Changelog](https://github.com/tikv/fail-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/fail-rs/compare/v0.3.0...v0.4.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
2020-04-17 07:14:19 +09:00
Rob Young
c591d0e591 Switch fst dependency to git (#808)
Closes #803

This allows the package to be built without first cloning the
tantivy-search/fst repo into the expected place. This should fix CI.
2020-04-16 23:05:12 +09:00
Paul Masurel
186d7fc20e Fix build 2020-04-01 09:32:45 +09:00
Paul Masurel
cfbdef5186 Using tantivy-fst version 0.3. 2020-03-31 23:24:54 +09:00
Paul Masurel
d04368b1d4 Closes #788. OR not working when using conjunction by default. (#802) 2020-03-31 21:13:50 +09:00
Chen Xu
b167058028 Fix prefix option for FuzzyTermQuery (#797)
* Fix prefix option for FuzzyTermQuery

* Update changelog
2020-03-19 20:19:32 +09:00
Paul Masurel
262957717b unit test fix and use of matches 2020-03-15 00:20:17 +09:00
Paul Masurel
873a808321 Removed itertools (#792) 2020-03-11 18:41:04 +09:00
dependabot-preview[bot]
6fa8f9330e Update base64 requirement from 0.11.0 to 0.12.0 (#791)
Updates the requirements on [base64](https://github.com/marshallpierce/rust-base64) to permit the latest version.
- [Release notes](https://github.com/marshallpierce/rust-base64/releases)
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.11.0...v0.12.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
2020-03-11 17:51:22 +09:00
Paul Masurel
b3f0ef0878 Avoid writing a new delete file if there was no actual deletes. (#787)
When applying the delete operations in the delete queue, it is possible
that there was no new deleted document.

In this case, avoid creating a new delete file, and updating the delete
opstamp.
2020-03-08 13:04:21 +09:00
Paul Masurel
04304262ba cargo fmt 2020-03-08 09:58:42 +09:00
Paul Masurel
920ced364a Added a method to persist the RAMDirectory into a different directory. 2020-03-07 17:00:50 +09:00
Paul Masurel
e0499118e2 Minor refactoring 2020-03-07 15:56:03 +09:00
Paul Masurel
50b5efae46 Added derive feature to serde crate 2020-03-06 23:46:29 +09:00
Paul Masurel
486b8fa9c5 Removing serde-derive dependency (#786) 2020-03-06 23:33:58 +09:00
Minoru Osuka
b2baed9bdd Add Lindera to README.md (#785)
* Add Lindera to README.md

* Put lindera in first place
2020-03-03 20:23:59 +09:00
Paul Masurel
b591542c0b Removing err.description() before deprecation. 2020-03-03 09:58:49 +09:00
Paul Masurel
a83fa00ac4 Faster compilation of query-grammar. (#784) 2020-03-02 22:12:42 +09:00
Paul Masurel
7ff5c7c797 Removing the fst feature in the levenshtein_automata crate. 2020-03-02 21:47:05 +09:00
Paul Masurel
1748602691 ignore -> compile_fail 2020-03-02 09:59:48 +09:00
Paul Masurel
6542dd5337 Removing parenthesis. 2020-03-01 09:41:53 +09:00
Nicholas Connor
c64a44b9e1 Slight re-organization to increase contrast of "Getting Started" (#783) 2020-02-28 08:42:38 +09:00
Paul Masurel
fccc5b3bed Closes #758 2020-02-27 17:58:43 +09:00
Paul Masurel
98b9d5c6c4 Closes #780. Will be fixed on the next published release. 2020-02-21 09:41:52 +09:00
Paul Masurel
afd2c1a8ad Merge branch 'master' of github.com:tantivy-search/tantivy 2020-02-19 22:08:44 +09:00
Paul Masurel
81f35a3ceb Bumped tantivy-grammar version 2020-02-19 22:08:31 +09:00
Paul Masurel
7e2e765f4a Bumped tantivy-grammar version 2020-02-19 22:07:54 +09:00
Paul Masurel
7d6cfa58e1 [WIP] Alternative take on boosted queries (#772)
* Alternative take on boosted queries

* Fixing unit test

* Added boosting to the query grammar.

* Made BoostQuery public.

* Added support for boosting field in QueryParser

Closes #547
2020-02-19 11:04:38 +09:00
Paul Masurel
14735ce3aa Update snap version to 1. (#781) 2020-02-17 10:41:44 +09:00
Paul Masurel
72f7cc1569 Closes #777 (#779) 2020-02-17 09:53:38 +09:00
Paul Masurel
abef5c4e74 Updating combine to version 4 (#775) 2020-02-06 23:02:48 +09:00
Paul Masurel
ae14022bf0 Removed use::Result. (#771) 2020-01-31 18:47:02 +09:00
Alexander
55f5658d40 Make Executor public so Searcher::search_in_executor method now can be used (#769)
* Make Executor public so Searcher::search_in_executor method now can be used

* Fixed cargo fmt
2020-01-31 15:50:26 +09:00
Paul Masurel
3ae6363462 Updated CHANGELOG 2020-01-30 10:16:56 +09:00
Halvor Fladsrud Bø
9e20d7f8a5 Maximum size of segment to be considered for merge (#765)
* Replicated changes from dead PR

* Ran formatter.
2020-01-30 10:14:34 +09:00
Halvor Fladsrud Bø
ab13ffe377 Facet path string (#759)
* Added to_path_string

* Fixed logic. Found strange behavior with string comparisons.

* ran formatter

* Fixed test

* Fixed format

* Fixed comment
2020-01-30 10:11:29 +09:00
Paul Masurel
039138ed50 Added the empty dictionary item in the CHANGELOG 2020-01-30 10:10:34 +09:00
Paul Masurel
6227a0555a Added unit test for empty dictionaries. 2020-01-30 10:08:27 +09:00
Audun Halland
f85d0a522a Optimize TermDictionary::empty by precomputed data source (#767) 2020-01-30 10:04:58 +09:00
Halvor Fladsrud Bø
5795488ba7 Backward iteration for termdict range (#757)
* Added backwards iteration to termdict

* Ran formatter

* Updated fst dependency

* Updated dependency

* Changelog and version

* Fixed version

* Made it part of 12.0
2020-01-30 09:59:21 +09:00
Paul Masurel
c3045dfb5c Remove time dev-deps by relying on chrono::Duration reexport. 2020-01-29 23:25:03 +09:00
Paul Masurel
811fd0cb9e Dynamic analyzer (#755)
* Removed generics in tokenizers

* lowercaser

* Added TokenizerExt

* Introducing BoxedTokenizer

* Introducing BoxXXXXX helper struct

* Closes #762.

* Introducing a TextAnalyzer
2020-01-29 18:23:37 +09:00
dependabot-preview[bot]
f6847c46d7 Update tantivy-fst requirement from 0.1 to 0.2 (#750)
Updates the requirements on [tantivy-fst](https://github.com/tantivy-search/fst) to permit the latest version.
- [Release notes](https://github.com/tantivy-search/fst/releases)
- [Commits](https://github.com/tantivy-search/fst/compare/0.1.1...0.2.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
2020-01-21 07:57:39 +09:00
Paul Masurel
92dac7af5c Return an error instead of panicking when sorting by a non fast field. (#748)
Closes #747
2020-01-08 13:41:02 +09:00
Paul Masurel
801905d77f Davide romanini arm atomic mutex (#746)
* Add atomic mutex implementation for ARM.

* Applied rustfmt.

* rustfmt

Co-authored-by: davide-romanini <davide.romanini@gmail.com>
2019-12-30 23:42:11 +09:00
Paul Horn
8f5ac86f30 Expose UserOperation as a public type. (#744)
In order to make `IndexWriter::run` callable from outside of the create,
the `UserOperation` type needs to be publicly available.
Since the `indexer` module is private, we just export the `UserOperation`
type directly.
2019-12-29 22:37:13 +09:00
Paul Masurel
d12a06b65b Tiny code simplification. 2019-12-26 09:33:17 +09:00
Minoru Osuka
749432f949 Make SchemaBuilder::add_field() public (#742)
* Make add_field() to public

* cargo format
2019-12-25 20:37:34 +09:00
Paul Masurel
c1400f25a7 Handle facet search in the QueryParser. (#741)
Closes #738
2019-12-25 17:43:33 +09:00
Paul Masurel
87120acf7c Bump version 2019-12-20 21:22:43 +09:00
Paul Masurel
401f74f7ae Implement fast field for DateTime. (#736) 2019-12-20 21:20:15 +09:00
Paul Masurel
03d31f6713 Update CHANGELOG 2019-12-19 10:07:43 +09:00
Paul Masurel
a57faf07f6 Added a constructor for WatchHandle (#734)
Closes #731
2019-12-19 10:06:02 +09:00
Paul Masurel
562ea9a839 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-12-19 09:32:50 +09:00
Paul Masurel
cf92cc1ada Closes #732 (#733)
The future returned by `IndexWriter::merge` does not borrow `&mut self`
2019-12-18 23:25:22 +09:00
Paul Masurel
f6000aece7 Closes #732
The future returned by `IndexWriter::merge` does not borrow `&mut self`
2019-12-18 21:48:51 +09:00
Paul Masurel
2b3fe3a2b5 Bumped version for hotfix 2019-12-17 21:10:50 +09:00
Paul Masurel
0fde90faac Closes #729 (#730)
Bug related with merge and deletes...
2019-12-17 21:09:08 +09:00
Paul Masurel
5838644b03 Added README in tantivy-query-grammar 2019-12-16 08:41:21 +09:00
Paul Masurel
c0011edd05 Added version for tantivy-grammar before publish 2019-12-16 08:35:17 +09:00
petr-tik
431c187a60 Make error handling richer in Footer::is_compatible (#724)
* WIP implemented is_compatible

hide Footer::from_bytes from public consumption - only found Footer::extract
used outside the module

Add a new error type for IncompatibleIndex
add a prototypical call to footer.is_compatible() in ManagedDirectory::open_read
to make sure we error before reading it further

* Make error handling more ergonomic

Add an error subtype for OpenReadError and converters to TantivyError

* Remove an unnecessary assert

it's follower by the same check that Errors instead of panicking

* Correct the compatibility check logic

Leave a defensive versioned footer check to make sure we add new logic handling
when we add possible footer versions

Restricted VersionedFooter::from_bytes to be used inside the crate only

remove a half-baked test

* WIP.

* Return an error if index incompatible - closes #662

Enrich the error type with incompatibility

Change return type to Result<bool, TantivyError>, instead of bool

Add an Incompatibility enum that enriches the IncompatibleIndex error variant
with information, which then allows us to generate a developer-friendly hint how
to upgrade library version or switch feature flags for a different compression
algorithm

Updated changelog

Change the signature of is_compatible

Added documentation to the Incompatibility
Added a conditional test on a Footer with lz4 erroring
2019-12-14 09:14:33 +09:00
Caio Romão
392abec420 Make u64_lenient() handle f64 fast fields too (#726)
* Make u64_lenient() handle f64 fast fields too

Without this, we get a panic during merge since the merger will
get a `None` where it expects something.

Prior to this patch, you can reproduce the panic with:

    use tantivy::{
        self,
        schema::{SchemaBuilder, FAST},
        Document, Index, Result,
    };

    #[test]
    fn pass() -> Result<()> {
        let mut builder = SchemaBuilder::new();
        let field = builder.add_f64_field("f64", FAST);
        let index = Index::create_in_ram(builder.build());

        let mut writer = index.writer_with_num_threads(1, 50_000_000)?;

        for i in 0..1000 {
            let mut doc = Document::new();
            doc.add_f64(field, 0.42);
            writer.add_document(doc);

            if i % 5 == 0 {
                writer.commit()?;
            }
        }

        writer.commit()?;

        Ok(())
    }

* Add test to verify that f64 fields are merged

* Ensure multi-valued fast fields can be merged too
2019-12-13 23:41:22 +09:00
Paul Masurel
dfbe337fe2 Optimize deletes (#723)
Closes #710
2019-12-13 09:50:00 +09:00
Paul Masurel
b9896c4962 Cleanup 2019-12-10 23:01:07 +09:00
Paul Masurel
afa5715e56 Added unit test. 2019-12-10 22:49:32 +09:00
Paul Masurel
79474288d0 Some clippy minor fixes (#722) 2019-12-09 13:40:04 +09:00
Paul Masurel
daf64487b4 Fixing JSON se/deserialization of dates. (#721)
Closes #719
2019-12-09 13:31:35 +09:00
Ximo Guanter
00816f5529 Fix outdated reference in documentation (#720) 2019-12-08 18:10:50 +09:00
Paul Masurel
f73787e6e5 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-12-06 10:06:09 +09:00
Paul Masurel
5cffa71467 Using census 0.4 2019-12-06 10:04:01 +09:00
Christian Hunstad
02af28b3b7 add norwegian stemmer (#717) 2019-11-27 21:08:59 +09:00
Paul Masurel
afe0134d0f Kkoziara remove tokens from doc store (#715)
* Prevent tokens from being stored in the document store.

Commit adds prepare_for_store method to Document, which changes all
PreTokenizedString values into String values. The method is called
before adding document to the document store to prevent tokens from
being saved there. Commit also adds small changes to comments in
pre_tokenized_text example.

* Avoid storing the pretokenized text.
2019-11-25 22:39:12 +09:00
Christian Hunstad
db9e81d0f9 Updated rust-stemmers version to 1.2 (#716)
* Updated rust-stemmers version to 1.2

* 1.2.0 -> 1.2
2019-11-25 22:38:48 +09:00
Paul Masurel
3821f57ecc Closes #712 (#714)
Fixing the memory leak in the DeleteQueue.
2019-11-25 15:57:29 +09:00
Paul Masurel
d379f98b22 Waiting for indexing threads when dropping IndexWriter 2019-11-23 15:00:27 +09:00
Paul Masurel
ef3eddf3da clippy first stab (#711) 2019-11-22 13:09:35 +09:00
Paul Masurel
08a2368845 Closes #708 (#709)
Fixes a race condition in the test.
2019-11-21 11:41:59 +09:00
Paul Masurel
1868fc1e2c Text fix 2019-11-20 23:00:39 +09:00
Paul Masurel
451a0252ab thread pool merge (#704) 2019-11-20 21:18:05 +09:00
Paul Masurel
42756c7474 Removing futures-cpupool and upgrading to futures-0.3 2019-11-15 18:35:31 +09:00
Paul Masurel
598b076240 Making some of the IndexWriter's method public. 2019-11-11 12:41:45 +09:00
Paul Masurel
f1f96fc417 Updating some doc. 2019-11-11 10:04:12 +09:00
Paul Masurel
9c941603f5 Petr tik n662 errror incompatible footer version (#696)
* code tidy-up

Replace `20` magic constant with COMMON_FOOTER_SIZE

Add a docstring showing how footer is serialised
Add a test for footer length checking

* Add more tests for VersionedFooter

successful and panicking .to_bytes() calls

* Minor changes in footer.rs
2019-11-10 14:40:06 +09:00
Paul Masurel
fb3d6fa332 Adding Value::From<PretokenizedText> (#697) 2019-11-10 14:39:44 +09:00
Paul Masurel
88fd7f091a SegmentUpdater.add_segment does not need to return true (#693) 2019-11-09 21:18:51 +09:00
Jacob Brown
6e4fdfd4bf replace scoped_pool (#685) 2019-11-07 10:26:08 +09:00
kkoziara
0519056bd8 Added handling of pre-tokenized text fields (#642). (#669)
* Added handling of pre-tokenized text fields (#642).

* * Updated changelog and examples concerning #642.
* Added tokenized_text method to Value implementation.
* Implemented From<TokenizedString> for TokenizedStream.

* * Removed tokenized flag from TextOptions and code reliance on the flag.
* Changed naming to use word "pre-tokenized" instead of "tokenized".
* Updated example code.
* Fixed comments.

* Minor code refactoring. Test improvements.
2019-11-07 10:10:56 +09:00
dependabot-preview[bot]
7305ad575e Update smallvec requirement from 0.6 to 1.0 (#686)
Updates the requirements on [smallvec](https://github.com/servo/rust-smallvec) to permit the latest version.
- [Release notes](https://github.com/servo/rust-smallvec/releases)
- [Commits](https://github.com/servo/rust-smallvec/compare/v0.6.0...v1.0.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
2019-11-07 09:55:33 +09:00
Paul Masurel
79f64ac2f4 Create FUNDING.yml 2019-11-05 16:26:12 +09:00
Paul Masurel
67bce6cbf2 Fixing the construction of the DeleteBitset. (#683)
Closes #681
2019-11-04 15:39:11 +09:00
xiaoniu-578fa6bff964d005
e5316a4388 Reduce unnecessary clone. (#684) 2019-11-04 13:57:59 +09:00
Mathias Svensson
6a8a8557d2 Use slice::iter instead of into_iter to avoid future breakage (#679)
* Use `slice::iter` instead of `into_iter` to avoid future breakage

`an_array.into_iter()` currently just works because of the autoref
feature, which then calls `<[T] as IntoIterator>::into_iter`. But
in the future, arrays will implement `IntoIterator`, too. In order
to avoid problems in the future, the call is replaced by `iter()`
which is shorter and more explicit.

* cargo fmt
2019-10-31 20:59:50 +09:00
Alberto Piai
3a65dc84c8 TopDocs: ensure stable sorting on equal score (#675)
* TopDocs: ensure stable sorting on equal score

When selecting the top K documents by score, we need to ensure stable
sorting. Until now, for documents with the same score, we were relying
on the (arbitrary) order returned by the BinaryHeap used to implement
the collectors.

This patch fixes the problem by explicitly using the doc address when
harvesting the `TopSegmentCollector` and when merging the results in
`TopCollector::merge_fruits()`.

This is important (for example) to implement pagination correctly using
the TopDocs collector. If sorting isn't stable, documents that have the
same score might be ranked in different positions depending on the
specific K that was used, thus appearing in two different pages, or in
none at all.

Fixes gh-671

* TMP: alternative solution (see previous commit)

If we add the constrait that D is also PartialOrd in ComparableDoc<T,
D>, then we can move the comparison by doc address directly in the cmp
implementation of ComparableDoc.

* TMP rebase as first commit: add benchmarks for TopSegmentCollector

* fixup! TMP: alternative solution (see previous commit)

* TMP add changelog entry

* TMP run cargo fmt
2019-10-26 15:27:25 +09:00
dependabot-preview[bot]
ce42bbf5c9 Update base64 requirement from 0.10.0 to 0.11.0 (#676)
Updates the requirements on [base64](https://github.com/marshallpierce/rust-base64) to permit the latest version.
- [Release notes](https://github.com/marshallpierce/rust-base64/releases)
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.10.0...v0.11.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
2019-10-26 15:24:47 +09:00
Paul Masurel
7b21b3f25a Refactoring around Field (#673)
* Refactoring around Field

Removing the contract about the order of the field, and the
field id allocation.

* Update delete_queue.rs

* Update field.rs
2019-10-25 09:06:44 +09:00
Paul Masurel
46caec1040 Updating uuid to 0.8 (#674) 2019-10-25 09:02:00 +09:00
petr-tik
1187a02a3e Fixed #664 (#667)
Removed references to u8 and old documentation
2019-10-22 09:34:10 +09:00
Andrew Banchich
f6c525b19e Fix grammar / punctuation (#668) 2019-10-21 10:50:53 +09:00
petr-tik
4a8f7712f3 Add a doctest to BooleanQuery (#630)
* Add a doctest to BooleanQuery

Closes #446

Mark a function that is only used in tests to be compiled for tests only

Fix doc-comments in a couple of related files

* Minor corrections

remove whitespace, fix typos, add explicit dyn marker

* WIP: BooleanQuery doc test

Trying to nest several BooleanQueries together

* Addressed old review

rust 2018 edition + make function available to everyone

* Box the previous query to resolve the type error

* Rework wording in DocAdress document strings

* Reworded and restructured the docstring
2019-10-07 10:05:12 +09:00
Paul Masurel
2f867aad17 Fix bench (#663)
* fmt

* Fixing bench compilation
2019-10-04 17:07:49 +09:00
Paul Masurel
5c6580eb15 fmt (#661) 2019-10-04 12:10:01 +09:00
Paul Masurel
4c3941750b Waiting potentially longer on watch 2019-10-01 09:50:46 +09:00
Paul Masurel
2ea8e618f2 Merge branch 'hotfix-656' 2019-10-01 09:44:56 +09:00
Paul Masurel
94f27f990b Address #656
Broke the reference loop to make sure that the watch_router can
be dropped, and the thread exits.
2019-10-01 09:34:22 +09:00
Paul Masurel
349e8aa348 Removed enum variants on type alias 2019-09-26 18:43:29 +09:00
Paul Masurel
cde9b78b8d Fixing the issue associated with the Regex performance change 2019-09-18 18:29:27 +09:00
fdb-hiroshima
d8894f0bd2 add checksum check in ManagedDirectory (#605)
* add checksum check in ManagedDirectory

fix #400

* flush after writing checksum

* don't checksum atomic file access and clone managed_paths

* implement a footer storing metadata about a file

this is more of a poc, it require some refactoring into multiple files
`terminate(self)` is implemented, but not used anywhere yet

* address comments and simplify things with new contract

use BitOrder for integer to raw byte conversion
consider atomic write imply atomic read, which might not actually be true
use some indirection to have a boxable terminating writer

* implement TerminatingWrite and make terminate() be called where it should

add dependancy to drop_bomb to help find where terminate() should be called
implement TerminatingWrite for wrapper writers
make tests pass
/!\ some tests seems to pass where they shouldn't

* remove usage of drop_bomb

* fmt

* add test for checksum

* address some review comments

* update changelog

* fmt
2019-09-18 18:26:25 +09:00
fdb-hiroshima
7e08e0047b fix Term documentation (#655)
u64-based fields are actually 4+8=12 bytes long
2019-09-11 18:49:35 +09:00
fdb-hiroshima
1a817f117f fix documentation error (#654)
Union missdocumented as doing an intersection
Union and Intersection can hold more than 2 DocSets
2019-09-11 17:12:08 +09:00
petr-tik
2ec19b21ae Remove unnecessary duplicate methods (#650)
Closes #649

Spotted by @imor
2019-09-09 06:36:04 +09:00
Raminder Singh
141f5a93f7 Using FnvHashMap for mapping UnorderedTermId to TermOrdinal. Fixes #507 (#647)
* Using FnvHashMap for mapping UnorderedTermId to TermOrdinal. Fixes #507

* Fixed cargo fmt errors
2019-09-07 19:40:21 +09:00
Paul Masurel
df47d55cd2 Occur debug interface (#648) 2019-09-07 15:08:45 +09:00
Raminder Singh
5e579fd6b7 Fixed clippy warning: unneeded return statement (#646) 2019-09-07 10:14:37 +09:00
Paul Masurel
4b9c1dce69 Moving queyr grammar to a different crate. (#645) 2019-09-05 09:37:28 +09:00
Paul Masurel
d74f71bbef Lighter regex dependency. (#644)
Detail on https://github.com/rust-lang/regex/pull/613
2019-09-04 13:10:12 +09:00
Paul Masurel
5196ca41d8 Small code clean up 2019-09-03 09:22:32 +09:00
dependabot-preview[bot]
4959e06151 Update once_cell requirement from 0.2 to 1.0 (#643)
Updates the requirements on [once_cell](https://github.com/matklad/once_cell) to permit the latest version.
- [Release notes](https://github.com/matklad/once_cell/releases)
- [Changelog](https://github.com/matklad/once_cell/blob/master/CHANGELOG.md)
- [Commits](https://github.com/matklad/once_cell/compare/v0.2.0...v1.0.2)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
2019-09-03 07:00:45 +09:00
Paul Masurel
c1635c13f6 RegexQuery performance: make it possible to cache Regexes - remastered by fulmicoton (Closes #639) (#641)
* small docs cleanup

* only compile a regex once per RegexQuery

Building a `Regex` is an expensive operation. Users of `RegexQuery`
need to cache and reuse regexes when searching across multiple fields.

This is the first step towards allowing that: we can store the `Regex`
directly in the `RegexQuery`, instead of the string pattern.

* RegexQuery: account for possible failure in the constructor

When building a regex from a str pattern, we have to account for the
possibility that the pattern is invalid. Before the previous commit, the
failure would happen in the `specialized_weight` method. Now that we
store a compiled `Regex` in `RegexQuery`, `specialized_weight` doesn't
fail anymore, and we can fail early while constructing `RegexQuery` if
the pattern is invalid.

This is a breaking change for users of `RegexQuery::new`.

* add RegexQuery::from_regex method

This builds a `RegexQuery` from an already compiled `Regex`. The use of
`Into<Arc<Regex>>` is to allow the caller to either simply pass a
`Regex`, or an `Arc<Regex>`, in case it needs to be cached and shared on
the caller's side.

* Using an Arc in AutomatonWeight

Closes #639
2019-08-22 16:14:01 +09:00
Paul Masurel
135e0ea2e9 Expose new segment meta from Index (#637) 2019-08-19 10:39:15 +09:00
Paul Masurel
f283bfd7ab Added segmentid_from_string (#636) 2019-08-19 10:37:30 +09:00
Joshua Dutton
9f74786db2 Update import statements in examples, doctests (#633)
Update import statements to edition 2018, including removing
`extern crate` and  `#[macro_use]`. Alphabetize the statements.
2019-08-19 07:26:35 +09:00
Joshua Dutton
32e5d7a0c7 Fix trait object in doctest (#635) 2019-08-19 07:25:00 +09:00
Joshua Dutton
84c615cff1 Fixing typos (#634) 2019-08-19 07:24:05 +09:00
Paul Masurel
039c0a0863 Introducing a wrapper struct instead of Boxed<BoxableTokenizer> (#631)
Closes #629
2019-08-15 16:37:04 +09:00
Paul Masurel
b3b0138b82 Change for tantivy-py
Schema.convert_named_doc
Better Debug string for Terms and TermQueries
2019-08-14 17:44:25 +09:00
petr-tik
ea56160cdc Added cargo-fmt to CI runs (#627)
* Added cargo-fmt to CI runs

Closes #625

* Remove fmt from appveyor builds

Windows seems to have issues with install components through rustup.

Formatting should be equally informative regardless of the OS,
so best to keep it in Linux on Travis
2019-08-12 08:25:47 +09:00
petr-tik
028b0a749c Elastic unbounded range query (#624)
* Tidy up

fmt

remove unneccessary -> Result<()> followed by run.unwrap() in a test

* Adding support for elasticsearch-style unbounded queries

Extend the UserInputBound to include Unbounded, so we can reuse formatting and
internal query format

* Still working on elastic-style range queries

Fixes #498

Merge the elastic_range into range

Reformat to make code easier to follow, use optional() macro to return Some

* Fixed bugs

Made the range parser insensitive to whitespace between the ":" and the range.

Removed optional parsing of field.

Added a unit test for the range parser.

Derived PartialEq to compare the results of parsing as structs, instead of
strings. Found a bug with that unit test - "*}" was parsed as an
UserInputBound::Exclusive, instead of UserInputBound::Unbounded. Added an early
detection-and-return for * in the original range parser

* Correct failing test

Assume that we will use "{*" for Unbounded ranges

* Add a note in the changelog

cargo-fmt

* Moved parenthesis to a newline to make nested if-else more visible
2019-08-12 08:24:47 +09:00
Paul Masurel
941f06eb9f Added Schema.from_named_doc 2019-08-11 16:50:32 +09:00
Paul Masurel
04832a86eb WTF is this file doing here (#622) 2019-08-08 21:54:10 +09:00
fdb-hiroshima
beb8e990cd fix parsing neg float in range query (#621)
fix #620
2019-08-08 20:41:04 +09:00
Paul Masurel
001af3876f cargo fmt 2019-08-08 18:07:19 +09:00
Paul Masurel
f428f344da Various bugfix in the query parser (#619) 2019-08-08 17:48:21 +09:00
Paul Masurel
143f78eced Trying to fix #609 (#616) 2019-08-06 20:33:30 +09:00
Kornel
754b55eee5 Bump deps (#613)
* Bump crossbeam

* Warnings--

* Remove outdated tempdir
2019-08-05 22:21:22 +09:00
Paul Masurel
280ea1209c Changes required for python binding (#610) 2019-08-01 17:26:21 +09:00
petr-tik
0154dbe477 Replace unwrap with match and proper Error handling (#606)
* Replace unwrap with match and proper Error handling

* Replaced 'magic' values with a documented variable

Didn't like the unexplained 0..3 range, thought it was best as a variable

Calculating Levenshtein distance is expensive, so best explain why we should
keep it low
2019-07-31 08:16:02 +09:00
Paul Masurel
efd1af1325 Closes #544. (#607)
Prepare for release 0.10.1
2019-07-30 13:38:06 +09:00
fdb-hiroshima
c91eb7fba7 add to_path for Facet (#604)
fix #580
2019-07-27 17:58:43 +09:00
fdb-hiroshima
6eb4e08636 add support for float (#603)
* add basic support for float

as for i64, they are mapped to u64 for indexing
query parser don't work yet

* Update value.rs

* implement support for float in query parser

* Update README.md
2019-07-27 17:57:33 +09:00
Paul Masurel
c3231ca252 Added phrase query tests (#601) 2019-07-22 13:43:00 +09:00
Paul Masurel
7211df6719 Failrs (#600)
* Single thread tests

* Isolating fail tests into a different binary
2019-07-22 13:17:21 +09:00
Paul Masurel
f27ce6412c Made the SegmentMeta inventory out of static. 2019-07-21 10:38:00 +09:00
Paul Masurel
8197a9921f Small code cleaning 2019-07-20 07:10:12 +09:00
Paul Masurel
b0e23b5715 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-07-18 10:16:49 +09:00
Paul Masurel
0167151f5b Disabling generating docs 2019-07-18 10:16:29 +09:00
Paul Masurel
0668949390 Disabling generating docs 2019-07-18 09:36:57 +09:00
Paul Masurel
94d0e52786 Using instead of u64. 2019-07-17 22:02:47 +09:00
Paul Masurel
818a0abbee Small refactoring 2019-07-17 21:55:59 +09:00
Luca Bruno
4e6dcf3cbe cargo: update to fail 0.3 (#593)
* cargo: update to fail 0.3

* tantivy: align failpoints feature naming

This aligns feature naming to use `failpoints` everywhere, like the
underlying library.
2019-07-17 18:51:38 +09:00
Paul Masurel
af7ea1422a using smallvec for operation batches (#599) 2019-07-17 13:20:02 +09:00
Paul Masurel
498057c5b7 Refactor deletes (#597)
* Refactor deletes

* Removing generation from SegmentUpdater. These have been obsolete for a long time

* Number literal clippy

* Removed clippy useless allow statement
2019-07-17 13:06:44 +09:00
Paul Masurel
5095e6b010 Introduce a small refactoring of the sgment writer. (#596) 2019-07-17 08:32:29 +09:00
Paul Masurel
1aebc87ee3 disabling caching (#595) 2019-07-16 19:05:22 +09:00
Paul Masurel
9fb5058b29 Fixed links (#592)
Closes #591
2019-07-15 19:35:44 +09:00
Paul Masurel
158e0a28ba Removed ilnk to master reference doc 2019-07-15 15:18:53 +09:00
Paul Masurel
3576a006f7 Updated example link 2019-07-15 15:17:53 +09:00
Paul Masurel
80c25ae9f3 Release 0.10 2019-07-11 19:10:12 +09:00
Paul Masurel
4867be3d3b Kompass master (#590)
* Use once_cell in place of lazy_static

* Minor changes
2019-07-10 19:24:54 +09:00
Paul Masurel
697c7e721d Only compile bitpacker4x (#589) 2019-07-10 18:53:46 +09:00
Paul Masurel
3e368d92cb Issue/479 (#578)
* Sort by field relying on tweaked score
* Sort by u64/i64 get independent methods.
2019-07-07 17:12:31 +09:00
Paul Masurel
0bc2c64a53 2018 (#585)
* removing macro import for fail-rs

* Downcast-rs

* matches
2019-07-07 17:09:04 +09:00
Paul Masurel
35236c8634 Seek not required in Directory's write anymore (#584) 2019-07-03 10:12:33 +09:00
Paul Masurel
462774b15c Tiqb feature/2018 (#583)
* rust 2018

* Added CHANGELOG comment
2019-07-01 10:01:46 +09:00
Paul Masurel
185a5b8d31 updating rand (#582) 2019-06-29 13:11:42 +09:00
petr-tik
73d7791479 Add instructions for contributors (#574) 2019-06-27 09:59:07 +09:00
Kirill Zaborsky
f52b1e68d1 Fix typo (#573) 2019-06-27 09:57:37 +09:00
Paul Masurel
3e0907fe05 Fixed CHANGELOG and disable one test on windows (#577) 2019-06-27 09:48:53 +09:00
dependabot-preview[bot]
ab4a8916d3 Update bitpacking requirement from 0.6 to 0.7 (#575)
Updates the requirements on bitpacking to permit the latest version.

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
2019-06-27 09:39:26 +09:00
Antoine Catton
bcd7386fc5 Add crates.io shield to the README (#572) 2019-06-18 11:19:06 +09:00
Paul Masurel
c23a7c992b Closes #552 (#570)
The different handles to `SegmentMeta` are closed before calling gc on
end_merge.
2019-06-16 14:12:13 +09:00
Paul Masurel
2a88094ec4 Disabling travis on OSX (#571) 2019-06-16 14:12:01 +09:00
Paul Masurel
ca3cfddab4 adding cond (#568) 2019-06-16 11:59:26 +09:00
Paul Masurel
7bd9f9773b trying to fix doc upload (#567) 2019-06-16 11:22:51 +09:00
Paul Masurel
e2da92fcb5 Petr tik n510 clear index (#566)
* Enables clearing the index

Closes #510

* Adds an examples to clear and rebuild index

* Addressing code review

Moved the example from examples/ to docstring above `clear`

* Corrected minor typos and missed/duplicate words

* Added stamper.revert method to be used for rollback

Added type alias for Opstamp

Moved to AtomicU64 on stable rust (since 1.34)

* Change the method name and doc-string

* Remove rollback from delete_all_documents

test_add_then_delete_all_documents fails with --test-threads 2

* Passes all the tests with any number of test-threads

(ran locally 5 times)

* Addressed code review

Deleted comments with debug info
changed ReloadPolicy to Manual

* Removing useless garbage_collect call and updated CHANGELOG
2019-06-12 09:40:03 +09:00
petr-tik
876e1451c4 Resume uploading docs to gh-pages (#565)
* Fixes #546

Generate docs and upload them. Need GH_TOKEN env var to be set in travis settings

* Investigate what TRAVIS* env vars are set
2019-06-12 09:30:09 +09:00
dependabot-preview[bot]
a37d2f9777 Update winapi requirement from 0.2 to 0.3 (#537)
* Update winapi requirement from 0.2 to 0.3

Updates the requirements on [winapi](https://github.com/retep998/winapi-rs) to permit the latest version.
- [Release notes](https://github.com/retep998/winapi-rs/releases)
- [Commits](https://github.com/retep998/winapi-rs/commits/0.3.7)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Fixing upgrading winapi (hopefully).
2019-06-06 10:23:13 +09:00
Paul Masurel
4822940b19 Issue/36 (#559)
* Added explanation

* Explain

* Splitting weight and idf

* Added comments

Closes #36
2019-06-06 10:03:54 +09:00
Paul Masurel
d590f4c6b0 Comments for IndexMeta (#560) 2019-06-06 09:24:31 +09:00
Paul Masurel
edfa619519 Update .travis.yml 2019-05-29 16:45:56 +09:00
Paul Masurel
96f194635f Trying to address #546 2019-05-29 09:17:41 +09:00
Paul Masurel
444662485f Remove mut in add_document and delete_term. Made stamper ordering rel… (#551)
* Remove mut in add_document and delete_term. Made stamper ordering relaxed.

* Made batch operations &mut self -> &self

* Added example
2019-05-28 10:26:00 +09:00
Stephen Carman
943c25d0f8 Make IndexMeta public (#553) 2019-05-28 09:27:49 +09:00
Paul Masurel
5c0b2a4579 Merge branch 'stamper_refactor' 2019-05-08 10:02:02 +09:00
Paul Masurel
9870a9258d Removed the mutex implementation of AtomicU64.
Fixed comment
2019-05-08 09:59:28 +09:00
Paul Masurel
7102b363f5 Fix build 2019-05-05 14:19:54 +09:00
Paul Masurel
66b4615e4e Issue/542 (#543)
* Closes 542.

Fast fields are all loaded when the segment reader is created.
2019-05-05 13:52:43 +09:00
petr-tik
da46913839 Merge branch 'master' into stamper_refactor 2019-04-30 22:28:48 +01:00
Paul Masurel
3df037961f Added more info to fast fields. 2019-04-30 13:14:01 +09:00
petr-tik
8ffae47854 Addressed code review
moved Opstamp to top-level namespace, added a docstring

Corrected minor typos/whitespace
2019-04-29 21:23:28 +01:00
petr-tik
1a90a1f3b0 Merge branch 'master' of github.com:tantivy-search/tantivy into stamper_refactor 2019-04-26 08:47:12 +01:00
Paul Masurel
dac50c6aeb Dds merged (#539)
* add ascii folding support

* Minor change and added Changelog.

* add additional tests

* Add tests for ascii folding (#533)

* first tests for ascii folding

* use a `RawTokenizer` for tokens using punctuation

* add test for all (?) folding, inspired by Lucene

* Simplification of the unit test code
2019-04-26 10:25:08 +09:00
Paul Masurel
31b22c5acc Added logging when token is dropped. (#538) 2019-04-26 09:23:28 +09:00
petr-tik
8e50921363 Tidied up the Stamper module and upgraded to a 1.34 dependency
Added stamper.revert method to be used for rollback - rolling back to a previous
commit in case of deleting all documents or rolling operations back should reset
the stamper as well

Added type alias for Opstamp - helps code readibility instead of seeing u64
returned by functions.

Moved to AtomicU64 on stable rust (since 1.34) - where possible use standard
library interfaces.
2019-04-24 20:46:28 +01:00
Paul Masurel
96a4f503ec Closes #526 (#535) 2019-04-24 20:59:48 +09:00
Paul Masurel
9df288b0c9 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-04-24 12:31:47 +09:00
Paul Masurel
b7c2d0de97 Clippy2 (#534)
* Clippy comments

Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.

This commit passes the OutputBuffer object (which enforces proper
    alignment) instead of `&[u32]`.

* Clippy. Block alignment

* Code simplification

* Added comment. Code simplification

* Removed the extraneous freq block len hack.
2019-04-24 12:31:32 +09:00
Paul Masurel
62445e0ec8 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-04-23 09:55:55 +09:00
Paul Masurel
a228825462 Clippy comments (#532)
Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.

This commit passes the OutputBuffer object (which enforces proper
    alignment) instead of `&[u32]`.
2019-04-23 09:54:02 +09:00
Paul Masurel
d3eabd14bc Clippy comments
Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.

This commit passes the OutputBuffer object (which enforces proper
    alignment) instead of `&[u32]`.
2019-04-22 11:16:21 +09:00
petr-tik
c967031d21 Delete files from target/ dir to avoid caching them on CI (#531)
* Delete files from target/ dir to avoid caching them on CI

idea from here https://github.com/rust-lang/cargo/issues/5885#issuecomment-432723546

* Delete examples
2019-04-21 08:02:27 +09:00
Paul Masurel
d823163d52 Closes #527. (#529)
Fixing the bug that affects the result of `query.count()` in presence of
deletes.
2019-04-19 09:19:50 +09:00
Paul Masurel
c4f59f202d Bumped combine version 2019-04-11 08:33:56 +09:00
Paul Masurel
acd29b535d Fix comment 2019-04-02 10:05:14 +09:00
Panagiotis Ktistakis
2cd31bcda2 Fix non english stemmers (#521) 2019-03-27 08:54:16 +09:00
Paul Masurel
99870de55c 0.10.0-dev 2019-03-25 08:58:26 +09:00
Paul Masurel
cad2d91845 Disabled tests for android 2019-03-24 22:58:46 +09:00
Paul Masurel
79f3cd6cf4 Added instructions to update 2019-03-24 09:10:31 +09:00
Paul Masurel
e3abb4481b broken link 2019-03-22 09:58:28 +09:00
Paul Masurel
bfa61d2f2f Added patreon button 2019-03-22 09:51:00 +09:00
Paul Masurel
6c0e621fdb Added bench info in README 2019-03-21 09:35:04 +09:00
Paul Masurel
a8cc5208f1 Linear simd (#519)
* linear simd search within block
2019-03-20 22:10:05 +09:00
Paul Masurel
83eb0d0cb7 Disabling tests on Android 2019-03-20 10:24:17 +09:00
Paul Masurel
ee6e273365 cleanup for nodefaultfeatures 2019-03-20 10:04:42 +09:00
Paul Masurel
6ea34b3d53 Fix version 2019-03-20 09:39:24 +09:00
Paul Masurel
22cf1004bd Reenabled test on android 2019-03-20 08:54:52 +09:00
Paul Masurel
5768d93171 Rename try to attempt as try is becoming a keyword in rust 2019-03-20 08:54:19 +09:00
Paul Masurel
663dd89c05 Feature/reader (#517)
Adding IndexReader to the API. Making it possible to watch for changes.

* Closes #500
2019-03-20 08:39:22 +09:00
barrotsteindev
a934577168 WIP: date field (#487)
* initial version, still a work in progress

* remove redudant or

* add chrono::DateTime and index i64

* add more tests

* fix tests

* pass DateTime by ptr

* remove println!

* document query_parser rfc 3339 date support

* added some more docs about implementation to schema.rs

* enforce DateTime is UTC, and re-export chrono

* added DateField to changelog

* fixed conflict

* use INDEXED instead of INT_INDEXED for date fields
2019-03-15 22:10:37 +09:00
Paul Masurel
94f1885334 Issue/513 (#514)
* Closes #513

* Clean up and doc

* Updated changelog
2019-03-07 09:39:30 +09:00
Jonathan Fok kan
2ccfdb97b5 WIP: compiling to wasm (#512)
* First work to enable compile to wasm

* Added back fst-regex/mmap to mmap feature

* Removed fst-regex. Forced uuid version 0.7.2.
2019-03-06 10:40:54 +09:00
Paul Masurel
e67883138d Cargo fmt 2019-03-06 10:31:00 +09:00
Paul Masurel
f5c65f1f60 Added comment on the constructor fo TopDocSByField 2019-03-06 10:30:37 +09:00
Mauri de Souza Nunes
ec73a9a284 Remove note about panicking in get_field docs (#503)
Since get_field rely on calling get on the underlying InnerSchema HashMap
it shouldn't fail if the field was not found, it simply returns None.
2019-02-28 09:23:00 +09:00
Thomas Schaller
a814a31f1e Remove semicolon from doc! expansion (#509) 2019-02-28 09:20:43 +09:00
Paul Masurel
9acadb3756 Code cleaning 2019-02-26 10:50:36 +09:00
Paul Masurel
774fcecf23 cargo fmt 2019-02-26 10:44:59 +09:00
Paul Masurel
27c9fa6028 Jannickj prove bug with facets (#508)
* prove bug with facets

* Closing #505

Introduce a term id in the TermHashMap
2019-02-25 22:33:17 +09:00
Paul Masurel
fdefea9e26 Removed path reference to tantivy-fst 2019-02-23 10:42:44 +09:00
Paul Masurel
b422f9c389 Partially addresses #500 (#502)
Using `tantivy_fst`. Storing `Weak<Mmap>` in the Mmap cache.
2019-02-23 10:33:59 +09:00
petr-tik
9451fd5b09 MsQueue to channel (#495)
* Format

Made the docstring consistent
remove empty line

* Move matches to dev deps

* Replace MsQueue with an unbounded crossbeam-channel

Questions:
queue.push ignores Result return

How to test pop() calls, if they block

* Format

Made the docstring consistent
remove empty line

* Unwrap the Result of queue.pop

* Addressed Paul's review

wrap the Result-returning send call with expect()

implemented the test not to fail after popping from empty queue

removed references to the Michael-Scott Queue

formatted
2019-02-23 09:06:50 +09:00
Jason Goldberger
788b3803d9 updated changelog (#501)
* updated changelog

* Update CHANGELOG.md

* Update CHANGELOG.md
2019-02-19 00:25:18 +09:00
Paul Masurel
5b11228083 Merge branch 'master' of github.com:tantivy-search/tantivy 2019-02-15 08:30:55 +09:00
Paul Masurel
515adff644 Merge branch 'hotfix/0.8.2' 2019-02-15 08:30:27 +09:00
Paul Masurel
e70a45426a 0.8.2 release
Backporting a fix for non x86_64 platforms
2019-02-14 09:16:27 +09:00
Jason Goldberger
e14701e9cd Add grouped operations (#493)
* [WIP] added UserOperation enum, added IndexWriter.run, and added MultiStamp

* removed MultiStamp in favor of std::ops::Range

* changed IndexWriter::run to return u64, Stamper::stamps to return a Range, added tests, and added docs

* changed delete_cursor skipping to use first operation's opstamp vice last. change index_writer test to use 1 thread

* added test for order batch of operations

* added a test comment
2019-02-14 08:56:01 +09:00
Paul Masurel
45e62d4329 Code simplification and adding comments 2019-02-06 10:05:15 +09:00
petr-tik
76d2b4dab6 Add integer range search example (#490)
Copied and simplified the example in the range_query mod
2019-02-05 23:34:06 +01:00
Paul Masurel
04e9606638 simplification of positions 2019-02-05 15:36:13 +01:00
Paul Masurel
a5c57ebbd9 Positions simplification 2019-02-05 14:50:51 +01:00
Paul Masurel
96eaa5bc63 Positions 2019-02-05 14:50:16 +01:00
Paul Masurel
f1d30ab196 fastfield reader fix 2019-02-05 14:10:16 +01:00
Paul Masurel
4507df9255 Closes #461 (#489)
Multivalued fast field uses `u64` indexes.
2019-02-04 13:24:00 +01:00
Paul Masurel
e8625548b7 Closes #461 (#488)
Multivalued fast field uses `u64` indexes.
2019-02-04 13:20:20 +01:00
Paul Masurel
50ed6fb534 Code cleanup
Fixed compilation without the mmap directory
2019-02-05 12:39:30 +01:00
Panagiotis Ktistakis
76609deadf Add Greek stemmer (#486) 2019-02-01 06:30:49 +01:00
Paul Masurel
749e62c40b renamed 2019-01-30 16:29:17 +01:00
Paul Masurel
259ce567d1 Using linear search 2019-01-29 15:59:24 +01:00
Paul Masurel
4c93b096eb Rustfmt 2019-01-29 11:45:30 +01:00
Paul Masurel
6a547b0b5f Issue/483 (#484)
* Downcast_ref

* fixing unit test
2019-01-28 11:43:42 +01:00
Paul Masurel
e99d1a2355 Better exponential search 2019-01-29 11:29:17 +01:00
Paul Masurel
c7bddc5fe3 Inlined exponential search 2019-01-28 17:28:07 +01:00
Paul Masurel
7b97dde335 Clippy + cargo fmt 2019-01-28 12:37:55 +01:00
Paul Masurel
644b4bd0a1 Issue/468b (#482)
* Moving lock to directory/

* added fs2

* doc

* Using fs2 for locking

* Added unit test

* Fixed error message related unit test

* Fixing location of import
2019-01-27 12:32:21 +01:00
Paul Masurel
bf94fd77db Issue/471 (#481)
* Closes 471

Removing writing_segments in the segment manager as it is now useless.
Removing the target merged segment id as it is useless as well.

* RAII for tracking which segment is in merge.

Closes #471

* fmt

* Using Inventory::default().
2019-01-27 12:18:59 +09:00
Paul Masurel
097eaf4aa6 impl Future as a result of merges 2019-01-28 03:56:43 +01:00
Paul Masurel
1fd46c1e9b Clippy 2019-01-28 03:46:23 +01:00
Paul Masurel
2fb219d017 Changelog 2019-01-24 09:12:07 +09:00
Paul Masurel
63b593bd0a Lower RAM usage in tests. 2019-01-24 09:10:38 +09:00
Paul Masurel
286bb75a0c Updated changelog 2019-01-24 09:03:58 +09:00
barrotsteindev
222b7f2580 Tantivy-288 (#472)
* add unit test

* improved test

* added SegmentManager#remove_empty_segments

* update old tests for new behaviour

* cleaner filter for empty segments

* PR adjustments

* rename x in closures

* simplify assert_eq!(vec.len(), 0)

* wait_merging_threads

* acquire searchers

* add comments to test

* rebased on latest master

* harden test

* fix merger#test_merge_multivalued_int_fields_all_deleted test
2019-01-24 08:58:56 +09:00
pentlander
5292e78860 Allow stemmers in languages other than English (#473)
Allow users to create stemmers for languages other than English. Add a
default stemmer for English.
2019-01-23 22:24:32 +09:00
Paul Masurel
c0cc6aac83 Updated changelog 2019-01-23 22:22:34 +09:00
Paul Masurel
0b0bf59a32 Allow stemmers in languages other than English (#478)
Allow users to create stemmers for languages other than English. Add a
default stemmer for English.

Closes #478
2019-01-23 22:21:00 +09:00
Paul Masurel
74f70a5c2c 32bits platforms 2019-01-23 13:21:31 +09:00
Paul Masurel
1acfb2ebb5 cargo fmt 2019-01-23 10:21:39 +09:00
Paul Masurel
4dfd091e67 Bumped version to 0.8.2-dev 2019-01-23 10:20:59 +09:00
Paul Masurel
8eba4ab807 Merge branch 'hotfix-476' 2019-01-23 10:20:33 +09:00
Paul Masurel
5e8e03882b Merge branch 'bug/476' 2019-01-23 10:18:27 +09:00
Paul Masurel
7df3260a15 Version bump 2019-01-23 10:13:18 +09:00
Paul Masurel
176f67a266 Refactoring 2019-01-23 10:06:40 +09:00
Paul Masurel
19babff849 Closes #476 2019-01-23 10:06:39 +09:00
Paul Masurel
bf2576adf9 Added a broken unit test 2019-01-23 10:04:27 +09:00
Paul Masurel
0e8fcd5727 Plastic surgery 2019-01-19 23:13:27 +09:00
Paul Masurel
f745c83bb7 Closes 466. Removing mentions of the chain collector. (#467) 2019-01-16 10:28:19 +09:00
Paul Masurel
ffb16d9103 More efficient indexing (#463)
* Using unrolled u32 VInt and caching Vec s

* cargo fmt

* Exposing a io::Write in the Expull thing

* expull as a writer. clippy + format

* inline the first block

* simplified -if let Some-

* vint reader iterator

* blop
2019-01-13 14:51:18 +09:00
Paul Masurel
98ca703daa More efficient indexing (#462)
* Using unrolled u32 VInt and caching Vec s

* cargo fmt

* Exposing a io::Write in the Expull thing

* expull as a writer. clippy + format

* inline the first block

* simplified -if let Some-

* vint reader iterator
2019-01-13 14:41:56 +09:00
Paul Masurel
b9d25cda5d Using LittleEndian explicitely 2019-01-08 12:41:58 +09:00
Paul Masurel
beb4289ec2 Less unsafe 2019-01-08 00:48:14 +09:00
Andrew Banchich
bdd72e4683 Update README.md (#459)
Fix Elasticsearch spelling
2018-12-27 07:26:49 +09:00
Paul Masurel
45c3cd19be Fixing README: git clone https... 2018-12-26 21:13:33 +09:00
Paul Masurel
b8241c5603 0.8.0 2018-12-26 10:18:34 +09:00
Paul Masurel
a4745151c0 Version to 0.8 2018-12-26 10:11:06 +09:00
Paul Masurel
e2ce326a8c Merge branch 'issue/457' 2018-12-18 10:35:01 +09:00
Paul Masurel
bb21d12a70 Bumping version 2018-12-18 10:14:12 +09:00
Paul Masurel
4565aba62a Added unit test for exponential search 2018-12-18 09:24:31 +09:00
Paul Masurel
545a7ec8dd Closes #457 2018-12-18 09:18:46 +09:00
Paul Masurel
e68775d71c Format and update murmurhash32 version 2018-12-17 19:12:38 +09:00
Paul Masurel
dcc92d287e Facet remove unsafe (#456)
* Removing some unsafe

* Removing some unsafe (2)

* Remove murmurhash
2018-12-17 19:08:48 +09:00
Paul Masurel
b48f81c051 Removing unsafe from bitpacking code (#455) 2018-12-17 19:06:37 +09:00
Paul Masurel
a3042e956b Facet remove unsafe (#454)
* Removing some unsafe

* Removing some unsafe (2)
2018-12-17 09:31:09 +09:00
dependabot[bot]
1fa10f0a0b Update itertools requirement from 0.7 to 0.8 (#453)
Updates the requirements on [itertools](https://github.com/bluss/rust-itertools) to permit the latest version.
- [Release notes](https://github.com/bluss/rust-itertools/releases)
- [Commits](https://github.com/bluss/rust-itertools/commits/0.8.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-12-17 09:28:36 +09:00
Paul Masurel
279a9eb5e3 Closes #449 (#450)
Clippy working on stable.
Clippy warnings addressed
2018-12-10 12:20:59 +09:00
fdb-hiroshima
21a24672d8 Add accessors for Snippet and HighlightSection (#448)
* Add accessors for Snippet and HighlightSection

And add an example of custom highlighter

* Remove inline(always) and unnecessary empty lines
2018-12-02 18:00:16 +09:00
dependabot[bot]
a3f1fbaae6 Update scoped-pool requirement from 0.1 to 1.0 (#447)
Updates the requirements on [scoped-pool](https://github.com/reem/rust-scoped-pool) to permit the latest version.
- [Release notes](https://github.com/reem/rust-scoped-pool/releases)
- [Commits](https://github.com/reem/rust-scoped-pool/commits/1.0.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-12-01 13:54:59 +09:00
Paul Masurel
a6e767c877 Cargo fmt 2018-11-30 22:52:45 +09:00
Paul Masurel
6af0488dbe Executor made sorted 2018-11-30 22:52:26 +09:00
Paul Masurel
07d87e154b Collector refactoring and multithreaded search (#437)
* Split Collector into an overall Collector and a per-segment SegmentCollector. Precursor to cross-segment parallelism, and as a side benefit cleans up any per-segment fields from being Option<T> to just T.

* Attempt to add MultiCollector back

* working. Chained collector is broken though

* Fix chained collector

* Fix test

* Make Weight Send+Sync for parallelization purposes

* Expose parameters of RangeQuery for external usage

* Removed &mut self

* fixing tests

* Restored TestCollectors

* blop

* multicollector working

* chained collector working

* test broken

* fixing unit test

* blop

* blop

* Blop

* simplifying APi

* blop

* better syntax

* Simplifying top_collector

* refactoring

* blop

* Sync with master

* Added multithread search

* Collector refactoring

* Schema::builder

* CR and rustdoc

* CR comments

* blop

* Added an executor

* Sorted the segment readers in the searcher

* Update searcher.rs

* Fixed unit testst

* changed the place where we have the sort-segment-by-count heuristic

* using crossbeam::channel

* inlining

* Comments about panics propagating

* Added unit test for executor panicking

* Readded default

* Removed Default impl

* Added unit test for executor
2018-11-30 22:46:59 +09:00
Paul Masurel
8b0b0133dd Importing crossbeam_channel from crossbeam reexport. 2018-11-19 09:19:28 +09:00
dependabot[bot]
7b9752f897 Update crossbeam-channel requirement from 0.2 to 0.3 (#436)
* Update crossbeam-channel requirement from 0.2 to 0.3

Updates the requirements on [crossbeam-channel](https://github.com/crossbeam-rs/crossbeam-channel) to permit the latest version.
- [Release notes](https://github.com/crossbeam-rs/crossbeam-channel/releases)
- [Changelog](https://github.com/crossbeam-rs/crossbeam-channel/blob/master/CHANGELOG.md)
- [Commits](https://github.com/crossbeam-rs/crossbeam-channel/commits/v0.3.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* fixing build
2018-11-16 14:26:59 +09:00
dependabot[bot]
c92f41aea8 Update rand requirement from 0.5 to 0.6 (#440)
* Update rand requirement from 0.5 to 0.6

Updates the requirements on [rand](https://github.com/rust-random/rand) to permit the latest version.
- [Release notes](https://github.com/rust-random/rand/releases)
- [Changelog](https://github.com/rust-random/rand/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-random/rand/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Updating rand.
2018-11-16 12:38:01 +09:00
Do Duy
dea16f1d9d Derive Clone for QueryParser (#442) 2018-11-15 18:45:40 +09:00
dependabot[bot]
236cfbec08 Update crossbeam requirement from 0.4 to 0.5 (#438)
Updates the requirements on [crossbeam](https://github.com/crossbeam-rs/crossbeam) to permit the latest version.
- [Release notes](https://github.com/crossbeam-rs/crossbeam/releases)
- [Changelog](https://github.com/crossbeam-rs/crossbeam/blob/master/CHANGELOG.md)
- [Commits](https://github.com/crossbeam-rs/crossbeam/commits/crossbeam-0.5.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-11-15 06:16:22 +09:00
Paul Masurel
edcafb69bb Fixed benches 2018-11-10 17:04:29 -08:00
Paul Masurel
14908479d5 Release 0.7.1 2018-11-02 17:56:25 +09:00
Dru Sellers
ab4593eeb7 Adds open_or_create method (#428)
* Change the semantic of Index::create_in_dir.

It should return an error if the directory already contains an Index.

* Index::open_or_create is working

* additional test

* Checking that schema matches on open_or_create.

Simplifying unit tests.

* simplifying Eq
2018-10-31 08:36:39 +09:00
Dru Sellers
e75bb1d6a1 Fix NGram processing of non-ascii characters (#430)
* A working version

* optimize the ngram parsing

* Decoding codepoint only once.

* Closes #429

* using leading_zeros to make code less cryptic

* lookup in a table
2018-10-31 08:35:27 +09:00
dependabot[bot]
63b9d62237 Update base64 requirement from 0.9.1 to 0.10.0 (#433)
Updates the requirements on [base64](https://github.com/alicemaz/rust-base64) to permit the latest version.
- [Release notes](https://github.com/alicemaz/rust-base64/releases)
- [Changelog](https://github.com/alicemaz/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/alicemaz/rust-base64/commits/v0.10.0)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-10-31 08:34:44 +09:00
Jason Wolfe
0098e3d428 Compute space usage of a Searcher / SegmentReader / CompositeFile (#282)
* Compute space usage of a Searcher / SegmentReader / CompositeFile

* Fix typo

* Add serde Serialize/Deserialize for all the SpaceUsage structs

* Fix indexing

* Public methods for consuming space usage information

* #281: Add a space usage method that takes a SegmentComponent to support code that is unaware of particular segment components, and to make it more likely to update methods when a new component type is added.

* Add support for space usage computation of positions skip index file (#281)

* Add some tests for space usage computation (#281)
2018-10-15 09:04:36 +09:00
Konstantin Gribov
69d5e4b9b1 Added proper references for Apache Lucene & Solr (#432)
Also, added links to websites for Lucene, Solr & ElasticSearch
2018-10-12 08:46:07 +09:00
Paul Masurel
e0cdd3114d Fixing README (#427)
Closes #424.
2018-09-17 08:52:29 +09:00
Paul Masurel
f32b4a2ebe Removing release build from ci, disabling lto (#425) 2018-09-17 06:41:40 +09:00
Paul Masurel
6ff60b8ed8 Fixing README (#426) 2018-09-17 06:20:44 +09:00
Paul Masurel
8da28fb6cf Added iml filewq 2018-09-16 13:26:54 +09:00
Paul Masurel
0df2a221da Bump version pre-release 2018-09-16 13:24:14 +09:00
Paul Masurel
5449ec3c11 Snippet term score (#423) 2018-09-16 10:21:02 +09:00
Paul Masurel
10f6c07c53 Clippy (#422)
* Cargo Format
* Clippy
2018-09-15 20:20:22 +09:00
Paul Masurel
06e7bd18e7 Clippy (#421)
* Cargo Format

* Clippy

* bugfix

* still clippy stuff

* clippy step 2
2018-09-15 14:56:14 +09:00
Paul Masurel
37e4280c0a Cargo Format (#420) 2018-09-15 07:44:22 +09:00
Paul Masurel
0ba1cf93f7 Remove Searcher dereference (#419) 2018-09-14 09:54:26 +09:00
Paul Masurel
21a9940726 Update Changelog with #388 (#418) 2018-09-14 09:31:11 +09:00
pentlander
8600b8ea25 Top collector (#413)
* Make TopCollector generic

Make TopCollector take a generic type instead of only being tied to
score. This will allow for sharing code between a TopCollector that
sorts results by Score and a TopCollector that sorts documents by a fast
field. This commit makes no functional changes to TopCollector.

* Add TopFieldCollector and TopScoreCollector

Create two new collectors that use the refactored TopCollector.
TopFieldCollector has the same functionality that TopCollector
originally had. TopFieldCollector allows for sorting results by a given
fast field. Closes tantivy-search/tantivy#388

* Make TopCollector private

Make TopCollector package private and export TopFieldCollector as
TopCollector to maintain backwards compatibility. Mark TopCollector
as deprecated to encourage use of the non-aliased TopFieldCollector.
Remove Collector implementation for TopCollector since it is not longer
used.
2018-09-14 09:22:17 +09:00
Paul Masurel
30f4f85d48 Closes #414. (#417)
Updating documentation for load_searchers.
2018-09-14 09:11:07 +09:00
Paul Masurel
82d25b8397 Fixing snippet example 2018-09-13 12:39:42 +09:00
Paul Masurel
2104c0277c Updating uuid 2018-09-13 09:13:37 +09:00
Paul Masurel
dd37e109f2 Merge branch 'issue/368b' 2018-09-11 20:16:14 +09:00
Paul Masurel
cc23194c58 Editing document 2018-09-11 20:15:38 +09:00
Paul Masurel
63868733a3 Added SnippetGenerator 2018-09-11 09:45:27 +09:00
Paul Masurel
644d8a3a10 Added snippet generator 2018-09-10 16:39:45 +09:00
Paul Masurel
e32dba1a97 Phrase weight 2018-09-10 09:26:33 +09:00
Paul Masurel
a78aa4c259 updating doc 2018-09-09 17:23:30 +09:00
Paul Masurel
7e5f697d00 Closes #387 2018-09-09 16:23:56 +09:00
Paul Masurel
a78f4cca37 Merge branch 'issue/368' into issue/368b 2018-09-09 16:04:20 +09:00
Paul Masurel
2e44f0f099 blop 2018-09-09 14:23:24 +09:00
Vignesh Sarma K
9ccba9f864 Merge branch 'master' into issue/368 2018-09-07 20:27:38 +05:30
Paul Masurel
9101bf5753 Fragments 2018-09-07 09:57:12 +09:00
Paul Masurel
23e97da9f6 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-09-07 08:44:14 +09:00
Paul Masurel
1d439e96f5 Using sort unstable by key. 2018-09-07 08:43:44 +09:00
Paul Masurel
934933582e Closes #402 (#403) 2018-09-06 10:12:26 +09:00
Paul Masurel
98c7fbdc6f Issue/378 (#392)
* Added failing unit test

* Closes #378. Handling queries that end up empty after going through the analyzer.

* Fixed stop word example
2018-09-06 10:11:54 +09:00
Paul Masurel
cec9956a01 Issue/389 (#405)
* Setting up the dependency.

* Completed README
2018-09-06 10:10:40 +09:00
Paul Masurel
c64972e039 Apply unicode lowercasing. (#408)
Checks if the str is ASCII, and uses a fast track if it is the case.
If not, the std's definition of a lowercase character.

Closes #406
2018-09-05 09:43:56 +09:00
Paul Masurel
b3b2421e8a Issue/367 (#404)
* First stab

* Closes #367
2018-09-04 09:17:00 +09:00
Paul Masurel
f570fe37d4 small changes 2018-08-31 09:03:44 +09:00
Paul Masurel
6704ab6987 Added methods to extract the matching terms. First stab 2018-08-30 09:47:19 +09:00
Paul Masurel
a12d211330 Extracting terms matching query in the document 2018-08-30 09:23:34 +09:00
Paul Masurel
ee681a4dd1 Added say thanks badge 2018-08-29 11:06:04 +09:00
petr-tik
d15efd6635 Closes #235 - adds a new error type (#398)
error message suggests possible causes

Addressed code review 1 thread + smaller heap size
2018-08-29 08:26:59 +09:00
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
18814ba0c1 add a test for second fragment having higher score 2018-08-28 22:27:56 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
f247935bb9 Use HighlightSection::new rather than just directly creating the object 2018-08-28 22:16:22 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
6a197e023e ran rustfmt 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
96a313c6dd add more tests 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
fb9b1c1f41 add a test and fix the bug of not calculating first token 2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
e1bca6db9d update calculate_score to try_add_token
`try_add_token` will now update the stop_offset as well.
`FragmentCandidate::new` now just takes `start_offset`,
it expects `try_add_token` to be called to add a token.
2018-08-28 20:41:58 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
8438eda01a use while let instead of loop and if.
as per CR comment
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
b373f00840 add htmlescape and update to_html fn to use it.
tests and imports also updated.
2018-08-28 20:41:57 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
46decdb0ea compare against accumulator rather than init value 2018-08-28 20:41:41 +05:30
Vignesh Sarma K (വിഘ്നേഷ് ശ൪മ കെ)
835cdc2fe8 Initial version of snippet
refer #368
2018-08-28 20:41:41 +05:30
Paul Masurel
19756bb7d6 Getting started on #368 2018-08-28 20:41:41 +05:30
CJP10
57e1f8ed28 Missed a closing bracket (#397) 2018-08-28 23:17:59 +09:00
Paul Masurel
2649c8a715 Issue/246 (#393)
* Moving Range and All to Leaves

* Parsing OR/AND

* Simplify user input ast

* AND and OR supported. Returning an error when mixing syntax

Closes #246

* Added support for NOT

* Updated changelog
2018-08-28 11:03:54 +09:00
Paul Masurel
ede97eded6 Removed use 2018-08-28 09:54:04 +09:00
Paul Masurel
4b7ff78c5a Added fundamentalss 2018-08-28 08:09:27 +09:00
Paul Masurel
948758ad78 First commit for the documentation 2018-08-27 09:49:49 +09:00
Paul Masurel
d71fa43ca3 Moving emoticon on the right side of the parenthesis 2018-08-23 08:59:11 +09:00
Paul Masurel
1e5266d4c9 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-08-23 08:55:30 +09:00
Paul Masurel
537fc27231 Added bench line in features 2018-08-23 08:55:13 +09:00
Dru Sellers
af593b1116 Add default EN stopwords to the default analyzer (#381)
* Add a default list of en stopwords

* Add the default en stopword filter to the standard tokenizers

* code review feedback
2018-08-22 10:49:39 +09:00
Paul Masurel
3d73c0c240 Update issue templates 2018-08-21 10:59:08 +09:00
Paul Masurel
3a8e524f77 Added example to show how to access the inverted list directly 2018-08-21 09:36:13 +09:00
Paul Masurel
c0641c2b47 Remove generate html script. It moved to tantivy-search.github.io 2018-08-21 08:26:46 +09:00
Dru Sellers
ef3a16a129 Switch from error-chain to failure crate (#376)
* Switch from error-chain to failure crate

* Added deprecated alias for

* Started editing the changeld
2018-08-20 09:40:45 +09:00
Paul Masurel
a0a284fe91 Added a full fledge empty query and relyign on it in QueryParser, instead of using an empty clause. 2018-08-20 09:21:32 +09:00
dependabot[bot]
0feeef2684 Update owning_ref requirement from 0.3 to 0.4 (#379)
Updates the requirements on [owning_ref](https://github.com/Kimundi/owning-ref-rs) to permit the latest version.
- [Release notes](https://github.com/Kimundi/owning-ref-rs/releases)
- [Commits](https://github.com/Kimundi/owning-ref-rs/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-08-20 09:08:11 +09:00
Dru Sellers
cc50bdb06a Add a basic faceted search example (#383)
* Add a basic faceted search example

* quieting the compiler
2018-08-19 08:07:54 +09:00
Paul Masurel
23c2c3ae7c Building all examples on appveyor + running them on travis 2018-08-17 13:24:37 +09:00
Dru Sellers
674524ba91 Add an example of using the stopwords filter (#377) 2018-08-17 12:52:21 +09:00
Paul Masurel
60a9a7f837 Added example showing how to delete/update documents 2018-08-17 09:43:55 +09:00
Paul Masurel
5b5c706581 Simplified examples 2018-08-16 22:38:39 +09:00
Paul Masurel
3e14a76623 Update regex_query.rs 2018-08-15 16:38:32 +09:00
Paul Masurel
8cde1c81e5 Update README.md 2018-08-13 18:03:30 +09:00
Paul Masurel
8d0a29b137 Added sourcerer wall of fame 2018-08-13 18:02:49 +09:00
Paul Masurel
cbfb2fe19d Avoid building twice when doing code coverage 2018-08-13 10:38:01 +09:00
Vignesh Sarma K
09e00f1d42 add position_length to Token (#337)
* add position_length to Token

refer #291

* Add term offset to `PhraseQuery`

ref #291

* Add new constructor for `PhraseQuery` that allows custom offset

* fix the method name as per pr comment

* Closes #291

Added unit test.
Using offsets from the analyzer in QueryParser.
2018-08-13 10:14:50 +09:00
Paul Masurel
290620fdee Added slashes 2018-08-13 09:13:01 +09:00
petr-tik
f0d1b85bd8 N370 pr fix num searchers (#371)
* Change ordering to Acquire

* set_num_searchers now uses AtomicUsize.store
2018-08-13 08:56:30 +09:00
petr-tik
aaef546f91 Moved NUM_SEARCHERS into a local variable (#369)
* Moved NUM_SEARCHERS into a local variable

dynamically determined as the number of available cpus.

var name in lowercase (not a constant anymore).

updated it in docstring

* lowercased the varnames

* User can set number of logical cores in create_from_metas

* cargo fmt

* Num_searchers as Arc<AtomicUsize>

Retrieving the value with Relaxed ordering

Reverted create_from_metas signature. However, it calls num_cpus and
sets the Arc val
2018-08-12 20:08:14 +09:00
Paul Masurel
811ddf2226 Closes #364 (#365)
* Closes #364

* Trying to raise the recursion limit

* Better unit test and bug fix on token offsets
2018-08-08 11:15:20 +09:00
Paul Masurel
79a339d353 Removing env_logger dependency 2018-08-02 19:29:09 +09:00
Paul Masurel
e45e4c79d9 update crossbeam 2018-08-02 19:24:08 +09:00
Paul Masurel
848bf41bc9 Updating rand to 0.5 (#363) 2018-08-02 19:19:04 +09:00
Paul Masurel
d11cb087a7 Updated to combine-0.3 (#362) 2018-08-02 18:29:58 +09:00
Jacob Brown
2dd7422f42 replace chan with crossbeam-channel (#361)
* replace chan with crossbeam-channel

* Update Cargo.toml
2018-08-02 12:47:22 +09:00
Paul Masurel
e8707c02c0 Issue/333 (#335)
* Add skip information for posting list (skip to doc ids) 
* Separate num bits from data for positions (skip n positions)
* Address in the position using a n-position offset
* Added a long skip structure to allow efficient opening of the position for a given term.
2018-07-31 10:51:53 +09:00
dependabot[bot]
55928d756a Update rust-stemmers requirement to 1.0.2 (#350)
* Update rust-stemmers requirement to 1.0.2

Updates the requirements on [rust-stemmers](https://github.com/CurrySoftware/rust-stemmers) to permit the latest version.
- [Release notes](https://github.com/CurrySoftware/rust-stemmers/releases)
- [Commits](https://github.com/CurrySoftware/rust-stemmers/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-31 09:32:57 +09:00
dependabot[bot]
a4370bca64 Update owned-read requirement to 0.4 (#352)
Updates the requirements on [owned-read](https://github.com/tantivy-search/owned-read) to permit the latest version.
- [Release notes](https://github.com/tantivy-search/owned-read/releases)
- [Commits](https://github.com/tantivy-search/owned-read/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-31 09:32:01 +09:00
dependabot[bot]
5a5c5a8ca5 Update bit-set requirement to 0.5.0 (#351)
* Update bit-set requirement to 0.5.0

Updates the requirements on [bit-set](https://github.com/contain-rs/bit-set) to permit the latest version.
- [Release notes](https://github.com/contain-rs/bit-set/releases)
- [Commits](https://github.com/contain-rs/bit-set/commits)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml

* Update Cargo.toml
2018-07-31 09:31:41 +09:00
dependabot[bot]
1b470dd474 Update log requirement to 0.4.3 (#353)
* Update log requirement to 0.4.3

Updates the requirements on [log](https://github.com/rust-lang/log) to permit the latest version.
- [Release notes](https://github.com/rust-lang/log/releases)
- [Changelog](https://github.com/rust-lang-nursery/log/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/log/commits/env_logger-0.4.3)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-31 09:31:19 +09:00
Paul Masurel
52b4575245 Issue/355 (#358)
* issue with top_k sorting (#356)

* Closes #355
2018-07-31 08:24:55 +09:00
dependabot[bot]
ddd2d5b04c Update lazy_static requirement to 1.0.2 (#349)
* Update lazy_static requirement to 1.0.2

Updates the requirements on [lazy_static](https://github.com/rust-lang-nursery/lazy-static.rs) to permit the latest version.
- [Release notes](https://github.com/rust-lang-nursery/lazy-static.rs/releases)
- [Commits](https://github.com/rust-lang-nursery/lazy-static.rs/commits/v1.0.2)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-30 12:34:06 +09:00
dependabot[bot]
fa22b4041a Update itertools requirement to 0.7.8 (#346)
* Update itertools requirement to 0.7.8

Updates the requirements on [itertools](https://github.com/bluss/rust-itertools) to permit the latest version.
- [Release notes](https://github.com/bluss/rust-itertools/releases)
- [Commits](https://github.com/bluss/rust-itertools/commits/0.7.8)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Update Cargo.toml
2018-07-30 11:32:12 +09:00
dependabot[bot]
8faee143fa Update regex requirement to 1.0 (#347)
Updates the requirements on [regex](https://github.com/rust-lang/regex) to permit the latest version.
- [Release notes](https://github.com/rust-lang/regex/releases)
- [Changelog](https://github.com/rust-lang/regex/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/regex/commits/1.0.2)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-30 09:59:19 +09:00
dependabot[bot]
366ce98f08 Update tempfile requirement to 3.0 (#348)
Updates the requirements on [tempfile](https://github.com/Stebalien/tempfile) to permit the latest version.
- [Release notes](https://github.com/Stebalien/tempfile/releases)
- [Changelog](https://github.com/Stebalien/tempfile/blob/master/NEWS)
- [Commits](https://github.com/Stebalien/tempfile/commits/v3.0.3)

Signed-off-by: dependabot[bot] <support@dependabot.com>
2018-07-30 09:58:56 +09:00
Paul Masurel
190e60a41c Closes #339. (#340)
As required per the FacetCollector,
facet values needs to be sorted before being encoded in the
multivalued field.
2018-07-25 18:21:48 +09:00
Vignesh Sarma K
b9558801a1 Declare and implement separate Clone Traits (#336)
For traits, `Directory` and `MergePolicy`.

refer #306
2018-07-18 12:36:43 +09:00
Paul Masurel
36728215ac Using the codecov badge 2018-07-10 21:19:59 +09:00
Paul Masurel
39551a0418 fix travis 2018-07-10 13:08:22 +09:00
Paul Masurel
39b98b2e76 fix travis 2018-07-10 13:07:15 +09:00
Paul Masurel
616162400d Add missing space 2018-07-10 12:49:32 +09:00
Paul Masurel
694d164db6 fix travis.yml 2018-07-10 09:39:39 +09:00
Paul Masurel
ef442cefb1 codecov 2018-07-10 09:38:59 +09:00
Paul Masurel
14da241f35 Readed cov 2018-07-10 09:25:24 +09:00
Paul Masurel
346a9e4287 Set dev version 2018-07-10 09:20:21 +09:00
Paul Masurel
31655e92d7 Preparing release 0.6.1 2018-07-10 09:12:26 +09:00
Paul Masurel
6b8d76685a Tiny refactoring 2018-07-05 09:11:55 +09:00
Paul Masurel
ce5683fc6a Removed useless counting_writer 2018-07-04 16:13:19 +09:00
Paul Masurel
5205579db6 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-07-04 16:09:59 +09:00
Paul Masurel
d056ae60dc Removed SourceRead. Relying on the new owned-read crate instead (#332) 2018-07-04 16:08:52 +09:00
Paul Masurel
af9280c95f Removed SourceRead. Relying on the new owned-read crate instead 2018-07-04 12:47:25 +09:00
David Hewson
2e538ce6e6 remove extra space in name (#331)
the extra space that appeared breaks using the package
2018-07-02 05:32:19 +09:00
Jason Wolfe
00466d2b08 #328: Support parsing unbounded range queries (#329)
* #328: Support parsing unbounded range queries. Update CHANGELOG.md for query parser changes.

* Set version to 0.7-dev
2018-06-30 13:24:02 +09:00
Paul Masurel
8ebbf6b336 Issue/325 (#330)
* Introducing a SegmentMea inventory.
* Depending on census=0.1
* Cargo fmt
2018-06-30 13:11:41 +09:00
Paul Masurel
1ce36bb211 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-06-27 16:58:47 +09:00
Jason Wolfe
2ac43bf21b Support parsing RangeQuery and AllQuery in Queryparser (#323)
* (#321) Add support for range query parsing to grammar / parser. Still needs to be wired through the rest of the way.

* (321) Finish wiring RangeQuery parsing through

* (#321) Add logical AST query parser tests for RangeQuery

* (#321) Support parsing AllQuery

* (#321) Update documentation of QueryParser

* (#321) Support negative numbers in range query parsing
2018-06-25 08:29:47 +09:00
Paul Masurel
3fd8c2aa5a Removed one keywoard 2018-06-22 14:47:21 +09:00
Paul Masurel
c1022e23d2 Switching to stable rust in AppVeyor. 2018-06-22 14:33:42 +09:00
Paul Masurel
8ccbfdea5d Preparing for release 2018-06-22 14:27:46 +09:00
Paul Masurel
badfce3a23 Preparing for release. 2018-06-22 14:09:14 +09:00
Dru Sellers
e301e0bc87 Add some simple doc tests (#320)
* Add TopCollector doc test

* Add CountCollector Doc Test

* Add Doc Test for MultiCollector

* Add ChainedCollector Doc Test

* Expose Fuzzy Query where it should be

* Add FuzzyTermQuery Doc Test

* Expose RegexQuery

* Regex Query Doc Test

* Add TermQuery Doc Test

* Add doc comments

* fix test 🤦

* Added explanation about the complexity variables

* Fixing unit tests

* Single threads if you check docids
2018-06-19 10:45:20 +09:00
Dru Sellers
317baf4e75 Add in simple regex query support (#319)
* Add fst_regex crate in

* Reduce API surface area

This doesn't need to be public

* better test name

* Pull Automaton weight out so it can be shared

* Implement Regex Query
2018-06-16 14:08:30 +09:00
Paul Masurel
24398d94e4 Exposing the 2018-06-15 21:40:57 +09:00
Dru Sellers
360f4132eb Standardizes the Index::open_* APIs (#318)
* Relocate `from_directory` closer to its usage

* Specific methods come before the generic method

* Rename open methods to follow the lead of the create methods
2018-06-15 12:16:41 +09:00
Dru Sellers
2b8f02764b Standardizes the Index::create_* APIs (#317)
* Pull all creation methods next to each other

The goal here is to make it clear which methods are performing the
same function, and to assist with standardizing the API calls.

* Make `from_directory` private

This seems to be an internal function, so lets make it internal.

* Rename `create` to `create_in_dir`

This lets the name match the `create_in_ram` pattern and opens up
`create` for the generic implementation.

* Implement the generic create function

All of the create methods now delegate to the common create function
and future `create_in_*` functions now have a clear pattern
to follow as well
2018-06-14 11:08:42 +09:00
Paul Masurel
0465876854 Issue/257 (#310)
* Replaced lz4 by a pure rust implementation of snappy.

Closes #257

* snappy is the default compression. One can use lz4 by enabling the lz4 feature flag.

* Removed Compression trait
2018-06-12 19:02:57 +09:00
Dru Sellers
6f7b099370 Add AutomatonWeight to a fuzzy_search module and FuzzyQuery (#300)
* Add AutomatonWeight to a fuzzy_search module

* Hacking around ownership issues

* Working through lifetime issues

* Working through tests

* fix test by lower casing the words (reducing distance)

* code review changes

* Suggestion on how to solve the borrow problem

* clean up
2018-06-11 22:23:03 +09:00
Paul Masurel
84f5cc4388 Added an AUTHORS file. Closes #315 (#316) 2018-06-11 22:21:58 +09:00
Paul Masurel
75aae0d2c2 Update README 2018-06-08 13:05:57 +09:00
Paul Masurel
009a3559be atomicwrites 2.2.0 for ARM compilation 2018-06-06 07:13:09 +09:00
Paul Masurel
7a31669e9d Disabling ARM targets 2018-06-05 12:22:00 +09:00
Paul Masurel
5185eb790b Reduced heap usage in unit test 2018-06-05 10:02:10 +09:00
Paul Masurel
a3dffbf1c6 Added more ARM target. 2018-06-05 09:06:33 +09:00
Paul Masurel
857a5794d8 Updated nix version 2018-06-05 09:02:40 +09:00
Paul Masurel
b0a6fc1448 Reduce RAM usage 2018-06-04 11:20:24 +09:00
Paul Masurel
989d52bea4 Updated atomicwrites version. 2018-06-04 10:00:21 +09:00
Paul Masurel
09661ea7ec Added cross testing on different platforms 2018-06-04 09:47:53 +09:00
Paul Masurel
b59132966f Better heap (#311)
* Changed the heap to a paged memory arena.
* Trying to simplify the indexing term hashmap
* Exploding datastruct
* Removed some complexity in bitpacker
2018-06-04 09:39:18 +09:00
Paul Masurel
863d3411bc Update Cargo.toml 2018-05-31 15:54:34 +09:00
Paul Masurel
8a55d133ab Showing Appveyor CI badge for the master branch
.. before the last build was shown.
2018-05-28 13:44:53 +09:00
Jason Wolfe
432d49d814 Expose parameters of RangeQuery for external usage (#309) 2018-05-19 14:29:25 +09:00
Jason Wolfe
0cea706f10 Add docs to new Query methods (#307) 2018-05-18 13:53:29 +09:00
Paul Masurel
71d41ca209 Added Google to the license 2018-05-18 10:13:23 +09:00
Paul Masurel
bc69dab822 cargo fmt 2018-05-18 10:08:05 +09:00
Jason Wolfe
72acad0921 Add box_clone() and downcast::Any to Query (#303) 2018-05-18 09:53:11 +09:00
Paul Masurel
c9459f74e8 Update docs about TermDict. 2018-05-18 09:20:39 +09:00
Dru Sellers
08d2cc6c7b Make it possible to stream the terms matching an Automaton (#297)
* rustfmt and some English grammar

* sort cargo.toml crates

* WIP: something to show

* Remove example for now

* Implement desired method

* Resolving Generic Type Arguments

* Resolve Generic Types

* Banging around on the tests

* DANGER! Change unsafe usage based on compiler warnings

* Unscrew up my rebase

* Clean Up Type Spam

Default Types FTW

* typo

* better variable names

* Remove Duplicate Levenshtein crate
2018-05-11 12:41:14 -07:00
Dru Sellers
82d87416c2 Implement StopWords Filter (#292)
* Implement StopWords Filter

- added example doctest for alphanum_only.rs so that I could
drive my own test of the stopword filter

* Style Cop

* Switch HashSet Hasher to FNV for speed

* Update Change Log

* fix missed location renaming
2018-05-09 18:40:41 -07:00
Paul Masurel
96b2c2971e Testing actual doc ids in unit test 2018-05-09 09:14:22 -07:00
Dru Sellers
162afd73f6 Alive docs iterator (#293)
* Add non-deleted DocId iterator to SegmentReader

Closes #287

* Add Todo

* Add Unit Test

* Improving test based on feedback

- found bug and fixed it. :)

* Reestablish changes post rebase for clean merge
2018-05-09 09:03:27 -07:00
Paul Masurel
ddfd87fa59 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-05-08 00:08:17 -07:00
Paul Masurel
24050d0eb5 Remove some unsafe stuff, justified some of it. 2018-05-07 23:57:53 -07:00
Jason Wolfe
89eb209ece #294: Make fieldnorm module public, add documentation (#295) 2018-05-07 20:20:38 -07:00
Paul Masurel
9a0b7f9855 Rustfmt 2018-05-07 19:50:35 -07:00
Jason Wolfe
8e343b1ca3 Add fast field for associating arbitrary bytes to a document (#275)
* Add fast field for associating arbitrary bytes to a document

* Fix unused macro_use warning

* Improvements from code review

* Make BytesFastFieldWriter public

* Fix json parsing validation failure

* Add bytes fast field to CHANGELOG.md

* Fix compile errors from merge

* Support merging

* Address misc code review comments

* Fix comments from CR
2018-05-07 19:30:31 -07:00
Paul Masurel
99c0b84036 Integrating #274, #280, #289 into master (#290)
* Integrating bugfixes into master

Closes #274
Closes #280
Closes #289

* Next version will be 0.6
2018-05-06 09:48:25 -07:00
Dru Sellers
ca74c14647 Simple Implementation of NGram Tokenizer (#278)
* Simple Implementation of NGram Tokenizer

It does not yet support edges
It could probably be better in many "rusty" ways
But the test is passing, so I'll call this a good stopping point for
the day.

* Remove Ngram from manager. Too many variations

* Basic configuration model

Should the extensive tests exist here?

* Add Sample to provide an End to End testing

* Basic Edgegram support

* cleanup

* code feedback

* More code review feedback processed
2018-05-06 09:47:49 -07:00
Dru Sellers
68ee18e4e8 Add Index::open_directory function (#285)
* Add Index::open_directory function

* dry
2018-05-03 00:07:46 -07:00
Paul Masurel
5637657c2f Removed ptr dereference for explicit ptr::read_unaligned 2018-04-25 19:15:32 +09:00
Paul Masurel
2e3c9a8878 Bugfix in murmurhash. 2018-04-25 19:06:31 +09:00
Paul Masurel
78673172d0 Cargo fmt 2018-04-21 20:05:36 +09:00
Paul Masurel
175b76f119 Removed streamdict
Closes #271
2018-04-21 19:55:41 +09:00
Paul Masurel
9b79e21bd7 Returning error when schema is not valid for a given query. 2018-04-19 13:02:30 +09:00
Paul Masurel
5e38ae336f Bump tantivy version and readded win deps 2018-04-17 18:27:57 +09:00
Paul Masurel
8604351f59 Hide some of the API
Added some doc.
2018-04-17 13:31:22 +09:00
Paul Masurel
6a48953d8a Closes #266 (#268)
PhraseQuery panics with a nice error message when the underlying field does not have any positions.
The `QueryParser` fails as well with a dedicated error.
2018-04-17 10:03:15 +09:00
pmasurel
0804b42afa Checking the type of range queries 2018-04-16 14:01:10 +09:00
Paul Masurel
8083bc6eef bench working 2018-04-15 12:25:38 +09:00
Paul Masurel
0156f88265 Compiles in stable rust 2018-04-15 11:03:44 +09:00
Paul Masurel
a1c07bf457 Added iterator for facet collector 2018-04-14 20:22:02 +09:00
Paul Masurel
9de74b68d1 Remove range argument 2018-04-13 18:34:23 +09:00
Paul Masurel
57c7073867 Removed 2018-04-13 09:43:36 +09:00
Paul Masurel
121374b89b Removed the need for AtomicU64 2018-04-12 22:08:15 +09:00
Paul Masurel
e44782bf14 No more 2018-04-12 13:01:11 +09:00
Paul Masurel
dfafb24fa6 Bumped bitpacker's version 2018-04-10 21:21:47 +09:00
jason-wolfe
4c6f9541e9 #263: Make MultiValueIntFastFieldWriter public, expose via FastFieldsWriter (#264) 2018-04-10 12:27:34 +09:00
Paul Masurel
743ae102f1 Using bitpacker@3 2018-04-10 10:05:42 +09:00
Paul Masurel
0107fe886b Removed timer 2018-03-31 15:40:16 +09:00
Paul Masurel
1d9566e73c Making mmap a feature 2018-03-31 13:23:43 +09:00
Paul Masurel
8006f1df11 Added comments 2018-03-28 08:28:49 +09:00
Paul Masurel
ffa03bad71 TermScorer does not handle deletes 2018-03-27 17:35:20 +09:00
Paul Masurel
98cf4ba63a Small refactor of postings's skip method 2018-03-27 16:14:28 +09:00
Paul Masurel
4d65771e04 field norm reader is not an option anymore. 2018-03-26 13:25:29 +09:00
Paul Masurel
9712a75399 Added unit test for intersection score 2018-03-25 12:58:24 +09:00
Paul Masurel
3ae03b91ae PhraseScorer's score aligned with that of Lucene.) 2018-03-25 12:44:16 +09:00
Paul Masurel
238b02ce7d Bugfixed 2018-03-23 18:50:57 +09:00
Paul Masurel
3091459777 Fixed main bug. Unit test still not passing because of altered scoring 2018-03-23 13:52:10 +09:00
Paul Masurel
b7f8884246 Closes #245 = BM25. (#260)
* Closes #245 = BM25.

Scores are the same as Lucene.

* Fixing travis conf
2018-03-22 15:06:56 +09:00
Paul Masurel
e22f767fda Backmerge 2018-03-21 21:18:46 +09:00
Paul Masurel
3ecfc36e53 Total field norm fixed. 2018-03-21 20:43:02 +09:00
Paul Masurel
1c9450174e Fieldnorm reader working except merge 2018-03-21 17:36:16 +09:00
Paul Masurel
cde4c391cd Added fieldnorm module 2018-03-21 15:41:46 +09:00
Paul Masurel
6d47634616 Added unit tests 2018-03-20 12:11:28 +09:00
Paul Masurel
39b182c24b Simplified phrase queries. Reading several time is ok. 2018-03-20 11:47:48 +09:00
Paul Masurel
baaae3f4ec Making it possible to read positions twice 2018-03-20 11:36:22 +09:00
Paul Masurel
63064601a7 Readded test for reading positions twice 2018-03-20 10:04:36 +09:00
Paul Masurel
07a8023a3a Added 2018-03-19 14:36:43 +09:00
Paul Masurel
59639cd311 In sync with master. Fixed merging 2018-03-19 12:58:42 +09:00
Paul Masurel
b0e5e1f61d Back merged master 2018-03-19 12:19:08 +09:00
Paul Masurel
234a902470 Removed cc from Cargo.toml 2018-03-19 12:09:25 +09:00
Paul Masurel
75d130f1ce Edited CHANGELOG 2018-03-19 12:01:48 +09:00
Paul Masurel
410187dd24 Removed .vimrc 2018-03-19 11:54:10 +09:00
Paul Masurel
88303d4833 Removed script directory 2018-03-19 11:53:15 +09:00
Paul Masurel
a26b0ff4a2 Removed exclude cpp from travis configuration 2018-03-19 11:51:41 +09:00
Paul Masurel
d4ed86f13a Issue/255 (#256)
* Remove cpp compression.

* Pointing to publish bitpacking

* Edited README
2018-03-19 11:48:40 +09:00
Paul Masurel
fc8902353c fieldnrom encoding. test broken 2018-03-10 18:35:16 +09:00
Paul Masurel
a2ee988304 Small change in pop_lowest. 2018-03-10 15:32:30 +09:00
Paul Masurel
97b7984200 Updated CHANGELOG 2018-03-10 14:08:11 +09:00
Paul Masurel
8683718159 Version bump 2018-03-10 14:01:30 +09:00
Paul Masurel
0cf274135b Clippy 2018-03-10 13:07:18 +09:00
Paul Masurel
a3b44773bb Bugfix and rustfmt 2018-03-10 12:21:50 +09:00
Paul Masurel
ec7c582109 NOBUG no-simd compression fix 2018-03-09 14:19:58 +09:00
Ewan Higgs
ee7ab72fb1 Support trailing commas using ',+ ,' trick from Blandy 2017. (#250) 2018-02-27 10:33:39 +09:00
Paul Masurel
2c20759829 removed unsafecell for position computer 2018-02-24 12:07:55 +09:00
Paul Masurel
23387b0ed0 Positions writes to an external Vec 2018-02-24 11:14:45 +09:00
Dylan DPC
e82859f2e6 Update Cargo.toml (#249) 2018-02-24 09:17:33 +09:00
Paul Masurel
be830b03c5 Bugfix in intersection.advance and impl skip_next 2018-02-23 11:55:23 +09:00
Paul Masurel
1b94a3e382 Phrase query optimisation 2018-02-23 00:00:22 +09:00
Paul Masurel
c3fbc4c8fa Simplified a notch TinySet::pop_lowest() 2018-02-22 10:43:06 +09:00
Paul Masurel
4ee2db25a0 Generic on Postings rather than deletes in TermScorer 2018-02-22 08:26:45 +09:00
Paul Masurel
e423784fd0 Added specialized SegmentPostings when there are no DeleteSet 2018-02-21 23:49:20 +09:00
Paul Masurel
fdb9c3c516 Tantivy version 0.5.0 2018-02-21 11:38:26 +09:00
Paul Masurel
6fb114224a Added unit test 2018-02-21 00:13:04 +09:00
Paul Masurel
2c3e33895a Added unit tests 2018-02-21 00:03:41 +09:00
Paul Masurel
d512b53688 Added handling of parenthesis in query parser 2018-02-20 23:18:02 +09:00
Paul Masurel
c8afd2b55d Added unit tests 2018-02-20 17:05:33 +09:00
Paul Masurel
3fd6d7125b Added unit test 2018-02-20 13:12:05 +09:00
Paul Masurel
de6a3987a9 Ignoring functional test 2018-02-20 12:58:06 +09:00
Paul Masurel
3dedc465fa Merge branch 'feature/multivalued-i64-u64' 2018-02-20 12:54:18 +09:00
Paul Masurel
f16cc6367e Refactoring of fastfields 2018-02-20 12:52:30 +09:00
Paul Masurel
4026fc5fb1 Removed redundant compressed_block_size function 2018-02-20 08:28:28 +09:00
Paul Masurel
43742a93ef Multivalue u64 field / i64 field. 2018-02-20 00:16:20 +09:00
Paul Masurel
2a843d86cb Code cleaning 2018-02-19 21:51:39 +09:00
Paul Masurel
9a706c296a Larger union horizon 2018-02-19 21:50:33 +09:00
Paul Masurel
5ff8123b7a Code cleaning 2018-02-19 15:41:19 +09:00
Paul Masurel
6061158506 Added long running test to travis conf 2018-02-19 13:23:04 +09:00
Paul Masurel
4e8b0e89d9 Added unit test 2018-02-19 13:19:18 +09:00
Paul Masurel
0540ebb49e Cargo clippy 2018-02-19 12:36:24 +09:00
Paul Masurel
ef94582203 Rustfmt 2018-02-19 12:12:10 +09:00
Paul Masurel
2f242d5f52 Moving docset around 2018-02-19 12:07:05 +09:00
Paul Masurel
da3d372e6e Faster union counts 2018-02-19 10:17:16 +09:00
Paul Masurel
42fd3fe5c7 Bugfix on TermWeight::count() 2018-02-18 10:59:18 +09:00
Paul Masurel
5dae6e6bbc Downcast TermScorer for intersection when all legs are TermScorers 2018-02-18 10:28:43 +09:00
Paul Masurel
e608e0a1df Removed half baked usage of Any 2018-02-18 10:01:14 +09:00
Paul Masurel
6c8c90d348 Removed lifetime from scorer 2018-02-18 09:12:40 +09:00
Paul Masurel
eb50e92ec4 Removed specialized postings on SegmentPostings 2018-02-18 00:09:15 +09:00
Paul Masurel
20bede9462 Bugfix when requesting no termfreq. 2018-02-17 22:41:12 +09:00
Paul Masurel
4640ab4e65 Merge branch 'master' into issue/query-perf 2018-02-17 17:31:51 +09:00
Paul Masurel
cd51ed0f9f Added comments 2018-02-17 16:59:28 +09:00
Paul Masurel
6676fe5717 Added a count method 2018-02-17 15:02:51 +09:00
Paul Masurel
292bb17346 Disable scoring
- Disabling scoring is an argument of the `.weight()` method
- Collectors declare whether they need scoring
2018-02-17 12:43:16 +09:00
Paul Masurel
0300e7272b Scoring for union. 2018-02-17 11:56:21 +09:00
Paul Masurel
8760899fa2 Stupid implementaiton of Box<Scorer>::collect 2018-02-16 19:30:50 +09:00
Paul Masurel
c89d570a79 rustfmt 2018-02-16 17:50:05 +09:00
Paul Masurel
1da06d867b Using the same logic when score is enabled. 2018-02-16 17:36:33 +09:00
Paul Masurel
76e8db6ed3 blop 2018-02-16 14:57:08 +09:00
Paul Masurel
31e5580bfa Renaming intersection / exclude 2018-02-16 11:55:56 +09:00
Paul Masurel
930d3db2f7 Integrated reqopt_scorer 2018-02-16 11:43:27 +09:00
Paul Masurel
1593e1dc6f Added reqopt 2018-02-16 11:22:39 +09:00
Paul Masurel
e0189fc9e6 Added exclude query 2018-02-14 18:06:51 +09:00
Paul Masurel
ffdb4ef0a7 Added unit test 2018-02-14 11:58:40 +09:00
Paul Masurel
58845344c2 Unit test + bugfix in union 2018-02-13 14:54:20 +09:00
Paul Masurel
548ec9ecca Added ok unit test 2018-02-12 17:48:41 +09:00
Paul Masurel
86b700fa93 Updated travis.yml 2018-02-12 12:13:36 +09:00
Paul Masurel
e95c49e749 Added unit test to show bug in intersection 2018-02-12 12:06:19 +09:00
Paul Masurel
f3033a8469 Added sudo required to travis conf because of https://github.com/travis-ci/travis-ci/issues/9061 2018-02-12 11:19:12 +09:00
Paul Masurel
c4125bda59 Backmerging master 2018-02-12 11:08:57 +09:00
Paul Masurel
a7ffc0e610 Rustfmt 2018-02-12 10:31:29 +09:00
Paul Masurel
9370427ae2 Terminfo blocks (#244)
* Using u64 key in the store
* Using Option<> for the next element, as opposed to u64
* Code simplification.
* Added TermInfoStoreWriter.
* Added a TermInfoStore
* Added FixedSized for BinarySerialized.
2018-02-12 10:24:58 +09:00
Paul Masurel
1fc7afa90a Issue/range query (#242)
BitSet and RangeQuery
2018-02-05 09:33:25 +09:00
Paul Masurel
6a104e4f69 Cargo fmt 2018-02-03 11:59:34 +09:00
Paul Masurel
920f086e1d Clippy 2018-02-03 11:46:01 +09:00
Paul Masurel
13aaca7e11 Merge branch 'master' into merge-facets 2018-02-03 11:13:02 +09:00
Paul Masurel
df53dc4ceb Format 2018-02-03 00:21:05 +09:00
Paul Masurel
dd028841e8 Added documentation / test and change the contract of .add_facet() 2018-02-03 00:17:51 +09:00
Paul Masurel
eb84b8a60d bugfix 2018-02-02 18:52:07 +09:00
Paul Masurel
c05f46ad0e skip for intersection 2018-02-02 17:22:58 +09:00
Paul Masurel
435ff9d524 Make constructor of RangeQuery public 2018-02-02 16:50:22 +09:00
Paul Masurel
fdd5dd8496 Merge branch 'master' into issue/query-perf 2018-02-02 16:39:28 +09:00
Paul Masurel
fb5476d5de Query optimization: phrase query + union 2018-02-02 16:39:17 +09:00
Paul Masurel
dd8332c327 Added disabling scoring 2018-02-02 12:11:56 +09:00
Paul Masurel
63d201150b issue/range-query Added range query 2018-02-02 00:41:12 +09:00
Paul Masurel
b78efdc59f NOBUG Use the skipping logic of segment postings in 2018-02-01 18:36:55 +09:00
Paul Masurel
5cb08f7996 Method to create bitset from DocSet directly. 2018-02-01 18:25:43 +09:00
Paul Masurel
1947a19700 Added bitse 2018-01-31 23:56:54 +09:00
Paul Masurel
271b019420 added cargo doc 2018-01-30 15:18:19 +09:00
Paul Masurel
340693184f Added comment 2018-01-30 15:15:55 +09:00
Paul Masurel
97782a9511 updated travis-cargo 2018-01-30 13:18:51 +09:00
Paul Masurel
930010aa88 Unit test passing 2018-01-28 00:03:51 +09:00
Paul Masurel
7f5b07d4e7 Fixing unit tests 2018-01-25 14:55:29 +09:00
Paul Masurel
3edb3dce6a Test not passing 2018-01-25 12:46:32 +09:00
Paul Masurel
1edaf7a312 Closes #236. Removes dependency to version. 2018-01-20 12:12:43 +09:00
Paul Masurel
137906ff29 Fixing PhraseQuery, broken due to the reordering of the intersection clauses.
Closes #234
2018-01-12 21:01:28 +09:00
Paul Masurel
143a143cde issue/232 added unit test. (#233) 2018-01-11 23:37:45 +09:00
Paul Masurel
4f5ce12a77 NOBUG removed cpp from patterns 2018-01-05 12:09:42 +09:00
Paul Masurel
813efa4ab3 NOBUG coveralls 2018-01-05 11:03:27 +09:00
Paul Masurel
c3b6c1dc0b NOBUG coveralls 2018-01-05 00:31:57 +09:00
Paul Masurel
6f5e0ef6f4 NOBUG Simplify travis 2018-01-04 20:51:00 +09:00
Paul Masurel
7224f58895 Merge branch 'issue/218'
Conflicts:
	src/directory/mmap_directory.rs
	src/lib.rs
2018-01-04 18:47:10 +09:00
Paul Masurel
49519c3f61 added comments 2018-01-04 12:53:20 +09:00
Paul Masurel
cb11b92505 Added comments 2018-01-04 12:27:14 +09:00
Paul Masurel
7b2dcfbd91 Merge branch 'issue/227' 2018-01-04 12:12:00 +09:00
Paul Masurel
d2e30e6681 Merge branch 'master' of github.com:tantivy-search/tantivy 2018-01-04 12:09:44 +09:00
Paul Masurel
ef109927b3 rustfmt 2018-01-04 12:08:34 +09:00
Paul Masurel
44e5c4dfd3 Added alphanum only token filter 2017-12-31 13:43:10 +09:00
Paul Masurel
6f223253ea Made load_metas public 2017-12-31 08:57:19 +09:00
Paul Masurel
f7b0392bd5 issue/230 Add an optional commit message. (#231)
Closes #230
2017-12-27 12:27:02 +09:00
Paul Masurel
442bc9a1b8 Fixes the computation of the memory size of a hashtable with a key of n bits. (#229)
Closes #228
2017-12-25 13:04:10 +09:00
Paul Masurel
db7d784573 Issue 227 Faster merge when there are no deletes 2017-12-21 22:04:05 +09:00
Paul Masurel
79132e803a NOBUG Switched to 64 bits addr 2017-12-21 11:06:46 +09:00
Paul Masurel
9e132b7dde NOBUG QueryParser does not need to be mut. Code cleanup 2017-12-16 15:43:35 +09:00
Paul Masurel
1e55189db1 NOBUG rustfmt 2017-12-14 19:30:31 +09:00
Paul Masurel
8b1b389a76 NOBUG Clippy 2017-12-14 19:25:12 +09:00
Paul Masurel
46f3ec87a5 Removed packed memory layout. 2017-12-14 18:37:04 +09:00
Paul Masurel
f24e5f405e NOBUG intellij misc lint 2017-12-14 18:23:35 +09:00
Paul Masurel
2589be3984 BUGFIX Serialization of schema got broken after serde's update 2017-12-14 17:37:20 +09:00
Paul Masurel
a02a9294e4 removed doc in travis 2017-11-27 13:53:58 +09:00
Paul Masurel
8023445b63 docs 2017-11-26 11:52:03 +09:00
Paul Masurel
05ce093f97 doc 2017-11-26 11:43:11 +09:00
Paul Masurel
6937e23a56 fixing doctest 2017-11-26 11:06:34 +09:00
Paul Masurel
974c321153 cargo fmt 2017-11-26 11:02:02 +09:00
Paul Masurel
f30ec9b36b Merge branch 'master' of github.com:tantivy-search/tantivy
Conflicts:
	src/analyzer/mod.rs
	src/schema/index_record_option.rs
	src/tokenizer/lower_caser.rs
	src/tokenizer/tokenizer.rs
2017-11-26 10:54:05 +09:00
Paul Masurel
acd7c1ea2d Added comments 2017-11-26 10:44:49 +09:00
Paul Masurel
aaeeda2bc5 Editing rustdoc 2017-11-25 13:23:32 +09:00
Paul Masurel
ac4d433fad Renamed analyzer to tokenizer 2017-11-24 16:50:32 +09:00
Paul Masurel
a298c084e6 Analyzer's Analyzer::token_stream does not need to me &mut self 2017-11-22 20:37:34 +09:00
Paul Masurel
185a72b341 Closes #224. Fixes documentation about STORED in the example. (#225) 2017-11-16 08:22:54 +09:00
Paul Masurel
bb41ae76f9 Closes #224. Fixes documentation about STORED in the example. 2017-11-16 08:16:17 +09:00
Paul Masurel
74d32e522a Stopped using mmap in tantivy. Caching MmapReadOnly.
Closes #218
2017-10-08 17:07:19 +09:00
Jain Jacob
927dd1ee6f Updates crate gcc to cc v1 (#217)
* Bump cc to v1

* Changes gcc::Config to cc::Build. Resolves #216
2017-10-06 16:18:44 +09:00
Paul Masurel
2c9302290f #191 Analyzer 2017-09-20 22:56:55 +09:00
Paul Masurel
426cc436da Test passing 2017-09-10 17:48:41 +09:00
Paul Masurel
68d42c9cf2 Added raw tokenizer, using the right analyzer in query parser. 2017-09-10 16:58:50 +09:00
Paul Masurel
ca49d6130f Test not passing 2017-09-09 17:32:47 +09:00
Paul Masurel
3588ca0561 Integrated with the merge branch 2017-09-09 15:27:19 +09:00
Paul Masurel
7c6cdcd876 Merge branch 'master' of github.com:tantivy-search/tantivy 2017-09-02 16:03:06 +09:00
Paul Masurel
71366b9a56 issue/197 Remove logic that prevents leak from crossbeam MsQueue. (#212)
Closes #197
2017-09-02 15:55:23 +09:00
Paul Masurel
a3247ebcfb issue/197 Remove logic that prevents leak from csossbeam MsQueue. 2017-09-02 15:53:07 +09:00
Paul Masurel
3ec13a8719 Readded fix for non-simd 2017-08-28 23:18:56 +09:00
Paul Masurel
f8593c76d5 Merge branch 'imhotep-new-codec'
Conflicts:
	src/common/bitpacker.rs
	src/compression/pack/compression_pack_nosimd.rs
	src/indexer/log_merge_policy.rs
2017-08-28 19:30:01 +09:00
Paul Masurel
f8710bd4b0 Format 2017-08-28 18:22:41 +09:00
Paul Masurel
8d05b8f7b2 Added comments. Renamed field reader 2017-08-28 17:00:12 +09:00
Paul Masurel
fc25516b7a Added unit test. 2017-08-28 11:15:37 +09:00
Paul Masurel
5b1e71947f Stream working, all test passing 2017-08-27 20:20:38 +09:00
Paul Masurel
69351fb4a5 Toward a new codec 2017-08-27 18:44:37 +09:00
Paul Masurel
3d0082d020 Delta encoded. Range and get are broken 2017-08-26 19:59:51 +09:00
Paul Masurel
8e450c770a Better error handling. Some doc. 2017-08-26 18:40:30 +09:00
Paul Masurel
a757902aed Merge branch 'feature/streamdict-simd' into imhotep 2017-08-22 18:58:57 +09:00
Paul Masurel
b3a8074826 removed println 2017-08-22 18:58:17 +09:00
Paul Masurel
4289625348 Merged with the new codec branch 2017-08-22 18:26:09 +09:00
Paul Masurel
850f10c1fe Exposing Field 2017-08-22 18:21:35 +09:00
raphael claude
d7f9bfdfc5 fix segments sorting in log_merge_policy (#211)
bug: segments were sorted on their indices (first field in the tuples)
fix: sort on the segments size
2017-08-20 08:59:54 +09:00
Paul Masurel
d0d5db4515 Streamdict using SIMD instruction. 2017-08-19 12:03:04 +09:00
Paul Masurel
303fc7e820 Better unit test for termdict. Checking the TermInfo 2017-08-17 12:08:39 +09:00
Paul Masurel
744edb2c5c NOBUG Avoid serializing position offset when useless. Test passing 2017-08-16 14:06:00 +09:00
Paul Masurel
2d70efb7b0 Removed trait boundary on termdict 2017-08-15 14:43:05 +09:00
Paul Masurel
eb5b2ffdcc Cleanups 2017-08-15 13:57:22 +09:00
Paul Masurel
38513014d5 Reenable unit test.
Consuming CompositeWrite on Close.
2017-08-14 23:35:09 +09:00
Paul Masurel
9cb7a0f6e6 Unit tests passing 2017-08-13 19:38:25 +09:00
Paul Masurel
8d466b8a76 half way through removing FastFieldsReader 2017-08-13 18:39:45 +09:00
Paul Masurel
413d0e1719 NOBUG test passing 2017-08-13 17:57:11 +09:00
Paul Masurel
0eb3c872fd Using composite file for all of the inverted index component 2017-08-12 19:34:23 +09:00
Paul Masurel
f9203228be Using composite file in fast field. 2017-08-12 18:45:59 +09:00
Paul Masurel
8f377b92d0 introducing a field serializer 2017-08-11 18:11:32 +09:00
Paul Masurel
1e89f86267 blop 2017-08-08 13:55:09 +09:00
Paul Masurel
d1f61a50c1 issue/207 Lazily decompressing positions. 2017-08-06 20:29:21 +09:00
Dru Sellers
2bb85ed575 Minor Doc Changes (#206)
* Various small documentation tweaks

* walking through the docs

* Update lib.rs

* Update lib.rs

* Update mod.rs
2017-08-06 09:22:03 +09:00
Paul Masurel
236fa74767 Positions almost working. 2017-08-05 23:17:35 +09:00
Paul Masurel
63b35dd87b removing freq handler. 2017-08-05 18:09:19 +09:00
Paul Masurel
efb910f4e8 Added CompressedIntStream 2017-08-05 16:44:01 +09:00
Paul Masurel
aff7e64d4e test 2017-08-04 22:07:14 +09:00
Paul Masurel
92a3f3981f issue/204 trying to fix nosimd branch. test not passing 2017-08-04 21:19:18 +09:00
king6cong
447a9361d8 Remove submodule information in README as subtree is now used 2017-08-03 13:52:16 +09:00
Paul Masurel
5f59139484 NOBUG simplified code. 2017-08-02 20:49:47 +09:00
Paul Masurel
27c373d26d NOBUG Updated changelog and bumped version 2017-07-24 18:52:45 +09:00
Paul Masurel
80ae136646 issue/198 Getting living_file after getting the list of managed files. 2017-07-24 18:46:41 +09:00
Paul Masurel
52b1398702 NOBUG version 0.4.0 -> 0.4.1 2017-07-19 19:07:54 +09:00
Paul Masurel
7b9cd09a6e Closes #199. Unindexed fields are indexed as untokenized 2017-07-19 18:41:22 +09:00
Paul Masurel
4c423ad2ca Merge branch 'master' of github.com:tantivy-search/tantivy 2017-07-19 17:01:32 +09:00
Paul Masurel
9f542d5252 NOBUG Fix spelling of "encountered". (as reported by @dazzag24) 2017-07-19 16:59:50 +09:00
Paul Masurel
77d8e81ae4 issue/17 Slightly more explicit error message 2017-07-19 11:08:42 +09:00
Paul Masurel
76e07b9705 NOBUG Small fixes. 2017-07-14 18:09:54 +09:00
Paul Masurel
ea4e9fdaf1 NOBUG updated README 2017-07-14 14:09:13 +09:00
Paul Masurel
e418bee693 NOBUG Garbage collection after end merge. 2017-07-14 12:09:47 +09:00
Paul Masurel
af4f1a86bc Merge remote-tracking branch 'origin/exp/hash_intable' 2017-07-13 20:50:54 +09:00
Paul Masurel
753b639454 NOBUG splitting the per-thread memory between the table and the heap 2017-07-13 17:11:39 +09:00
Paul Masurel
5907a47547 NOBUG Added whitespaces. 2017-07-13 15:14:12 +09:00
Paul Masurel
586a6e62a2 NOBUG Added Changelog for 4.0 2017-07-13 15:06:09 +09:00
Paul Masurel
fdae0eff5a NOBUG Remove range step_by 2017-07-13 14:05:33 +09:00
Paul Masurel
6eea407f20 Removing usage of step_by 2017-06-23 17:46:39 +09:00
Paul Masurel
1ba51d4dc4 NOBUG removed using range.step_by 2017-06-22 22:10:53 +09:00
Paul Masurel
6e742d5145 NOBUG removing batch add docs 2017-06-22 11:35:22 +09:00
Paul Masurel
1843259e91 NOBUG Simplified addr definitions 2017-06-22 11:27:32 +09:00
Paul Masurel
4ebacb7297 BytesRef is now wrapping an addr 2017-06-21 22:32:05 +09:00
Paul Masurel
fb75e60c6e issue/136 Added hashmaps. 2017-06-21 15:47:55 +09:00
Paul Masurel
04b15c6c11 Merge branch 'master' into exp/hash_intable
Conflicts:
	src/datastruct/stacker/hashmap.rs
2017-06-21 11:40:49 +09:00
Paul Masurel
b05b5f5487 issue/191 Added an analyzer manager. 2017-06-20 10:02:26 +09:00
Paul Masurel
4fe96483bc fill_buffer 2017-06-14 23:32:58 +09:00
Paul Masurel
09e27740e2 Added fill_buffer in DocSet 2017-06-14 18:28:30 +09:00
Paul Masurel
e51feea574 Removed cargo fmt from travis. 2017-06-14 13:45:11 +09:00
Paul Masurel
93e7f28cc0 Added unit test 2017-06-14 10:46:06 +09:00
Paul Masurel
8875b9794a Added API to get range from fastfield 2017-06-13 23:16:50 +09:00
Paul Masurel
f26874557e Remove the concept of pipeline. Made a BoableAnalyzer 2017-06-10 20:06:00 +09:00
Paul Masurel
a7d10b65ae Added support for Japanese. 2017-06-09 22:25:03 +09:00
Paul Masurel
e120e3b7aa issue/191 Added proper analyzer 2017-06-07 23:21:36 +09:00
Paul Masurel
90fcfb3f43 issue/188 Using murmurhash 2017-06-07 09:30:34 +09:00
Paul Masurel
e547e8abad Closes #184
Resizing the `Vec` was a bad idea, as for some stacker operation,
we may have a living reference to an object in the current heap.
2017-06-06 23:16:28 +09:00
Paul Masurel
5aa4565424 Tiny cleaning 2017-06-05 23:40:08 +09:00
Paul Masurel
3637620187 Merge branch 'master' of github.com:tantivy-search/tantivy 2017-06-02 21:03:37 +09:00
Laurentiu Nicola
a94679d74d Use four terms in the intersection bench 2017-05-31 08:31:33 +09:00
Laurentiu Nicola
a35a8638cc Comment nit 2017-05-31 08:31:33 +09:00
Paul Masurel
97a051996f issue 171. Hopefully bugfix? 2017-05-31 08:31:33 +09:00
Laurentiu Nicola
69525cb3c7 Add extra intersection test 2017-05-31 08:31:33 +09:00
Laurentiu Nicola
63867a7150 Fix document generation for posting benchmarks 2017-05-31 08:31:33 +09:00
Paul Masurel
19c073385a Better intersection and added size_hint 2017-05-31 08:31:33 +09:00
Paul Masurel
0521844e56 Format, small changes in VInt 2017-05-31 08:31:20 +09:00
Paul Masurel
8d4778f94d issue/181 BinarySerializable does not return the len + Generics over Read+Write 2017-05-31 08:31:20 +09:00
Paul Masurel
1d5464351d generic read 2017-05-31 08:31:20 +09:00
Paul Masurel
522ebdc674 made ResultExt public 2017-05-31 08:31:20 +09:00
Paul Masurel
4a805733db another hash 2017-05-30 15:36:48 +09:00
Paul Masurel
568d149db8 Merge branch 'master' into exp/hash_intable 2017-05-30 08:27:33 +09:00
Paul Masurel
4cfc9806c0 made ResultExt public 2017-05-30 08:22:17 +09:00
Paul Masurel
37042e3ccb Send and Sync impl now useless 2017-05-29 18:53:49 +09:00
Paul Masurel
b316cd337a Optimization in bitpacker 2017-05-29 18:53:49 +09:00
Paul Masurel
c04991e5ad Removed pointer in fastfield 2017-05-29 18:53:49 +09:00
Paul Masurel
c59b712eeb Added hash info in the table 2017-05-29 18:47:20 +09:00
Ashley Mannix
da61baed3b run fmt 2017-05-29 18:29:39 +09:00
Ashley Mannix
b6140d2962 drop some patch bounds 2017-05-29 18:29:39 +09:00
Ashley Mannix
6a9a71bb1b re-export ErrorKind 2017-05-29 18:29:39 +09:00
Ashley Mannix
e8fc4c77e2 fix delete error msg 2017-05-29 18:29:39 +09:00
Ashley Mannix
80837601ea remove error::* imports 2017-05-29 18:29:39 +09:00
Ashley Mannix
2b2703cf51 run cargo fmt 2017-05-29 18:29:39 +09:00
Ashley Mannix
d79018a7f8 fix build warnings 2017-05-29 18:29:39 +09:00
Ashley Mannix
d8a7c428f7 impl std error for directory errors 2017-05-29 18:29:39 +09:00
Ashley Mannix
45595234cc fix error match 2017-05-29 18:29:39 +09:00
Ashley Mannix
1bcebdd29e initial error-chain 2017-05-29 18:29:39 +09:00
Paul Masurel
ed0333a404 Optimized streamer 2017-05-28 19:58:28 +09:00
Paul Masurel
ac0b1a21eb Term as a wrapper
Small changes

Plastic
2017-05-25 23:49:54 +09:00
Paul Masurel
6bbc789d84 Fmt fix 2017-05-25 23:49:54 +09:00
Paul Masurel
87152daef3 issue/174 Added doc, and made field private 2017-05-25 23:49:54 +09:00
Paul Masurel
e0fce4782a Added documentation 2017-05-25 23:49:54 +09:00
Paul Masurel
a633c2a49a Avoid exposing common. Exposes u64 to i64 conversion instead. 2017-05-25 23:49:54 +09:00
Paul Masurel
51623d593e Avoid exposign schema from segment_reader 2017-05-25 23:49:54 +09:00
Paul Masurel
29bf740ddf Exposing the remaining API 2017-05-25 23:49:54 +09:00
Paul Masurel
511bd25a31 trailing whitespace 2017-05-25 18:17:37 +09:00
Paul Masurel
66e14ac1b1 clippy 2017-05-25 18:17:37 +09:00
Paul Masurel
09e94072ba Cargo fmt 2017-05-25 18:17:37 +09:00
Paul Masurel
6c68136d31 Reorganized code 2017-05-25 18:17:37 +09:00
Paul Masurel
aaf1b2c6b6 Reorganized code and added documentation. 2017-05-25 18:17:37 +09:00
Paul Masurel
8a6af2aefa Added unit test and bugfix 2017-05-25 18:17:37 +09:00
Paul Masurel
7a6e62976b Added stream dictionary code, merge unit test 2017-05-25 18:17:37 +09:00
Paul Masurel
2712930bd6 Added the feature 2017-05-25 18:17:37 +09:00
Paul Masurel
cb05f8c098 Prevent execution of the code in the macro doc 2017-05-22 10:55:45 +09:00
Paul Masurel
c0c9d04ca9 Added extra doc 2017-05-22 10:55:45 +09:00
Paul Masurel
7ea5e740e0 Using the $crate thing to make the macro usable in and outside tantivy 2017-05-22 10:55:45 +09:00
Paul Masurel
2afa6c372a issue/168 Make doc! macro usable outside tantivy 2017-05-22 10:55:45 +09:00
Paul Masurel
c7db8866b5 Merge branch 'facets' 2017-05-21 22:57:01 +09:00
Paul Masurel
02d992324a simplified facets. 2017-05-21 22:56:43 +09:00
Paul Masurel
4ab511ffc6 Merging 2017-05-21 22:15:02 +09:00
Paul Masurel
f318172ea4 Merge branch 'issue/162' 2017-05-21 20:04:03 +09:00
Paul Masurel
581449a824 issue/162 Docs and unit tests 2017-05-21 18:58:04 +09:00
Maciej Dziardziel
272589a381 faceting for fast numerical fields 2017-05-21 12:04:29 +03:00
Laurentiu Nicola
73d54c6379 Inline block_len 2017-05-21 10:44:49 +03:00
Paul Masurel
3e4606de5d Simplifying, and reordering the members 2017-05-21 16:31:52 +09:00
Laurentiu Nicola
020779f61b Make things faster 2017-05-20 20:56:37 +03:00
Laurentiu Nicola
835936585f Don't search whole blocks, but only the remaining part 2017-05-20 18:45:41 +03:00
Paul Masurel
bdd05e97d1 Added bench for segment postings 2017-05-20 23:38:53 +09:00
Paul Masurel
2be5f08cd6 issue/162 Added block iteration API 2017-05-20 11:46:40 +09:00
Paul Masurel
3f49d65a87 issue/162 Create block postings 2017-05-20 00:46:23 +09:00
Paul Masurel
f9baf4bcc8 Merge branch 'issue/155'
Conflicts:
	src/indexer/merger.rs
	src/indexer/segment_writer.rs
2017-05-19 20:14:36 +09:00
Paul Masurel
7ee93fbed5 Cleaning 2017-05-19 20:08:04 +09:00
Paul Masurel
57a5547ae8 Comments and cleaning up API 2017-05-19 11:20:27 +09:00
Paul Masurel
c57ab6a335 Renamed fstmap to termdict 2017-05-19 09:26:18 +09:00
Paul Masurel
02bfa9be52 Moving to termdict 2017-05-19 08:43:52 +09:00
Paul Masurel
b3f62b8acc Better API 2017-05-18 23:35:39 +09:00
Paul Masurel
2a08c247af Clippy 2017-05-18 23:20:41 +09:00
Paul Masurel
d2926b6ee0 Format 2017-05-18 23:09:20 +09:00
Paul Masurel
0272167c2e Code cleaning 2017-05-18 23:06:02 +09:00
Laurentiu Nicola
a9cf0bde16 Format code 2017-05-18 22:07:49 +09:00
Laurentiu Nicola
5a457df45d VInt encode values in IntFastFieldWriter
Closes #131
2017-05-18 22:07:49 +09:00
Paul Masurel
ca76fd5ba0 Uncommenting unit test 2017-05-18 20:41:56 +09:00
Paul Masurel
e79a316e41 Issue 155 - Trying to avoid term lookup when merging terms
+ Adds a proper Streamer interface
2017-05-18 20:12:00 +09:00
Paul Masurel
733f54d80e Making clippy happy. 2017-05-17 19:07:39 +09:00
Paul Masurel
7b2b181652 Merge branch 'master' into issue/136
Conflicts:
	src/datastruct/stacker/hashmap.rs
	src/datastruct/stacker/heap.rs
	src/datastruct/stacker/mod.rs
	src/indexer/index_writer.rs
	src/indexer/merger.rs
	src/indexer/segment_updater.rs
	src/indexer/segment_writer.rs
	src/postings/postings_writer.rs
	src/postings/recorder.rs
	src/schema/term.rs
2017-05-17 18:40:09 +09:00
Laurentiu Nicola
b3f39f2343 Remove unneeded suppressions, make clippy lints explicit 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
a13122d392 use explicit drop instead of suppression 2017-05-17 15:50:07 +09:00
Paul Masurel
113917c521 Making clippy happy.
+ Simplifying bitpacking by adding a 7 byte padding.
+ Bugfix in a unit test.
2017-05-17 15:50:07 +09:00
Laurentiu Nicola
1352b95b07 clippy: fix never_loop warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
c0538dbe9a clippy: fix mut_from_ref warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
0d5ea98132 clippy: fix inline_always warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
0404df3fd5 Fix typo in docstring 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
a67caee141 clippy: fix len_zero warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
f5fb29422a clippy: fix while_let_loop warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
4e48bbf0ea clippy: fix needless_lifetimes warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
6fea510869 clippy: fix redundant_closure warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
39958ec476 clippy: fix single_match warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
36f51e289e clippy: fix match_same_arms warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
5c83153035 clippy: fix or_fun_call warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
8e407bb314 clippy: fix needless_borrow warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
103ba6ba35 clippy: fix match_ref_pats warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
3965b26cd2 clippy: fix useless_let_if_seq warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
1cd0b378fb clippy: fix map_clone warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
92f383fa51 clippy: fix let_unit_value warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
6ae34d2a77 clippy: fix toplevel_ref_arg warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
1af1f7e0d1 clippy: fix if_let_redundant_pattern_matching warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
feec2e2620 clippy: fix needless_bool warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
3e2ad7542d clippy: fix needless_return warnings 2017-05-17 15:50:07 +09:00
Laurentiu Nicola
ac02c76b1e clippy: fix doc_markdown warnings 2017-05-17 15:50:07 +09:00
Paul Masurel
e5c7c0b8b9 Update CHANGELOG.md 2017-05-16 21:13:33 +09:00
Laurentiu Nicola
49dbe4722f Add a test for SegmentPostings::skip_len 2017-05-16 21:12:43 +09:00
Laurentiu Nicola
f64ff77424 Use an exponential search 2017-05-16 21:12:43 +09:00
Laurentiu Nicola
2bf93e9e51 Avoid rebuilding simdcomp when running tests 2017-05-16 08:37:43 +09:00
Laurentiu Nicola
3dde748b25 Make rustfmt happy 2017-05-16 00:49:05 +03:00
Laurentiu Nicola
1dabe26395 Add comment about block_len 2017-05-15 21:26:28 +03:00
Laurentiu Nicola
5590537739 Disable early exit 2017-05-15 21:18:06 +03:00
Laurentiu Nicola
ccf0f9cb2f Merge branch 'master' of github.com:tantivy-search/tantivy into issue/130 2017-05-15 18:54:16 +03:00
Laurentiu Nicola
e21913ecdc Use binary search for SegmentPostings::skip_next 2017-05-15 18:33:43 +03:00
Laurentiu Nicola
2cc826adc7 Add a bench for SegmentPostings::SkipNext 2017-05-15 18:33:43 +03:00
Laurentiu Nicola
4d90d8fc1d Move the random sampling helpers to the tests module 2017-05-15 18:33:43 +03:00
Paul Masurel
0606a8ae73 Bugfix in travis yml 2017-05-16 00:22:11 +09:00
Paul Masurel
03564214e7 Added check for rustfmt in travis 2017-05-15 22:46:43 +09:00
Paul Masurel
4c8f9742f8 format 2017-05-15 22:30:18 +09:00
Paul Masurel
a23b7a1815 Test the size of complete 0..128 block 2017-05-15 19:09:52 +09:00
Paul Masurel
6f89a86b14 Added simple search in travis CI 2017-05-15 12:10:23 +09:00
Laurentiu Nicola
b2beac1203 Check the result of wait_merging_threads 2017-05-15 08:00:25 +09:00
Paul Masurel
8cd5a2d81d Fixed logging deleted files twice 2017-05-15 00:25:49 +09:00
Paul Masurel
b26c22ada0 Merge branch 'issue/148' 2017-05-15 00:02:51 +09:00
Laurentiu Nicola
8a35259300 Avoid clone() call 2017-05-14 23:28:17 +09:00
Paul Masurel
db56167a5d Display backtrace 2017-05-14 23:28:17 +09:00
Paul Masurel
ab66ffed4e Closes #147 2017-05-14 23:28:17 +09:00
Laurentiu Nicola
e04f2f0b08 issue/148 Wait for the index writer threads to shut down in simple_search 2017-05-14 16:35:24 +03:00
Paul Masurel
7a5df33c85 issue/148 Wrapping MsQueue to drop all of its concent on Drop 2017-05-14 16:25:33 +03:00
Laurentiu Nicola
ee0873dd07 Avoid clone() call 2017-05-13 16:11:58 +03:00
Paul Masurel
695c8828b8 Display backtrace 2017-05-13 18:51:38 +09:00
Paul Masurel
4ff7dc7a4f Closes #147 2017-05-13 18:46:50 +09:00
Paul Masurel
69832bfd03 NOBUG Disabling running examples in CI as it is not working. 2017-05-12 14:35:50 +09:00
Paul Masurel
ecbdd70c37 Removed the clunky linked list logic of the heap. 2017-05-12 14:01:52 +09:00
Paul Masurel
fb1b2be782 issue/136 Fix following CR 2017-05-12 13:51:09 +09:00
Paul Masurel
9cd7458978 NOBUG Hiding methods making it possible to build a incorrect Term. 2017-05-11 21:12:59 +09:00
Paul Masurel
4c4c28e2c4 Fix broke compile 2017-05-11 20:57:32 +09:00
Paul Masurel
9f9e588905 Merge branch 'master' into issue/136
Conflicts:
	src/postings/postings_writer.rs
2017-05-11 20:50:24 +09:00
Paul Masurel
6fd17e0ead Code cleaning 2017-05-11 20:47:30 +09:00
Paul Masurel
65dc5b0d83 Closes #145 2017-05-11 19:48:06 +09:00
Paul Masurel
15d15c01f8 Runing examples in CI
Closes #143
2017-05-11 19:43:36 +09:00
Paul Masurel
106832a66a Make Term::with_capacity crate-public 2017-05-11 19:37:15 +09:00
Paul Masurel
477b9136b9 FIXED inconsistent Term's field serialization.
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
2017-05-11 19:37:15 +09:00
Paul Masurel
7852d097b8 CHANGELOG 0.3.1 did not included the fix of the Field(u32) 2017-05-11 09:48:37 +09:00
Ashley Mannix
0bd56241bb pretty print meta.json 2017-05-10 20:13:53 +09:00
Paul Masurel
54ab897755 Added comment 2017-05-10 19:30:24 +09:00
Paul Masurel
1369d2d144 Quadratic probing. 2017-05-10 10:38:47 +09:00
Paul Masurel
d3f829dc8a Bugfix 2017-05-10 00:29:37 +09:00
Paul Masurel
e82ccf9627 Merge branch 'master' into issue/indexing-refactoring 2017-05-09 16:43:33 +09:00
Paul Masurel
d3d29f7f54 NOBUG Updated CHANGELOG with the serde change for 0.4.0 2017-05-09 16:42:25 +09:00
Paul Masurel
3566717979 Merge pull request #134 from tantivy-search/chore/serde-rebase
Replace rustc_serialize with serde (updated)
2017-05-09 16:38:42 +09:00
Paul Masurel
90bc3e3773 Added limitation on term dictionary saturation 2017-05-09 14:10:33 +09:00
Paul Masurel
ffb62b6835 working 2017-05-09 10:17:05 +09:00
Ashley Mannix
4f9ce91d6a update underflow test 2017-05-08 14:40:58 +10:00
Laurentiu Nicola
3c3a2fbfe8 Remove old serialization code 2017-05-08 07:36:15 +03:00
Laurentiu Nicola
0508571d1a Use the proper error type on u64 overflow 2017-05-08 07:35:33 +03:00
Laurentiu Nicola
7b733dd34f Fix i64 overflow check and merge NotJSON with NotJSONObject 2017-05-08 07:09:54 +03:00
Ashley Mannix
2c798e3147 Replace rustc_serialize with serde 2017-05-07 20:21:22 +03:00
Paul Masurel
2c13f210bc Bugfix on merging i64 fast fields 2017-05-07 15:57:29 +09:00
Paul Masurel
0dad02791c issues/65 Added comments
Closes #65
Closes #132
2017-05-06 23:09:45 +09:00
Paul Masurel
2947364ae1 issues/65 Phrase query for untokenized fields are not tokenized. 2017-05-06 22:14:26 +09:00
Paul Masurel
05111599b3 Removed several TODOs 2017-05-05 16:08:09 +08:00
Paul Masurel
83263eabbb issues/65 Updated changelog added some doc. 2017-05-04 17:13:14 +08:00
Paul Masurel
5cb5c9a8f2 issues/65 Added i64 fast fields 2017-05-04 16:46:14 +08:00
Paul Masurel
9ab92b7739 i64 fast field working 2017-05-04 16:46:14 +08:00
Paul Masurel
962bddfbbf Merge with panicks. 2017-05-04 16:46:14 +08:00
Paul Masurel
26cfe2909f FastField with different types 2017-05-04 16:46:13 +08:00
Paul Masurel
afdfb1a69b Compiling... fastfield not implemented yet 2017-05-04 16:46:13 +08:00
Paul Masurel
b26ad1d57a Added int options 2017-05-04 16:46:13 +08:00
Paul Masurel
1dbd54edbb Renamed u64options 2017-05-04 16:46:13 +08:00
Paul Masurel
deb04eb090 issue/65 Switching to u64. 2017-05-04 16:46:13 +08:00
Paul Masurel
bed34bf502 Merge branch 'issues/122' 2017-04-23 16:14:40 +08:00
Paul Masurel
80f1e26c3b Tantivy 0.3.1 2017-04-23 15:52:07 +08:00
Paul Masurel
3e68b61d8f issue/122 Adds a garbage collect method 2017-04-23 15:51:06 +08:00
Paul Masurel
95bfb71901 NOBUG Remove 256 num fields limit 2017-04-19 22:37:34 +09:00
Paul Masurel
74e10843a7 issue/120 Disabled SIMD vbyte compression for msvc 2017-04-17 22:36:32 +09:00
Paul Masurel
1b922e6d23 issue 120. Using streamvbyte codec for the vbyte part of the encoding 2017-04-16 18:49:53 +09:00
Paul Masurel
a7c6c31538 Merge commit '9d071c8d4610aa61f4b1f7dd489210415a05cfc0' as 'cpp/streamvbyte' 2017-04-16 15:22:43 +09:00
Paul Masurel
9d071c8d46 Squashed 'cpp/streamvbyte/' content from commit f38aa6b
git-subtree-dir: cpp/streamvbyte
git-subtree-split: f38aa6b6ec4c5cee9d72c94ef305e6a79a108252
2017-04-16 15:22:43 +09:00
Paul Masurel
04074f7bcb Merge pull request #119 from tantivy-search/issue/118
Using u32 for field ids
2017-04-15 13:11:22 +09:00
Paul Masurel
8a28d1643d Using u32 for field ids 2017-04-15 13:04:33 +09:00
319 changed files with 43996 additions and 108591 deletions

12
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1,12 @@
# These are supported funding model platforms
github: fulmicoton
patreon: # Replace with a single Patreon username
open_collective: # Replace with a single Open Collective username
ko_fi: # Replace with a single Ko-fi username
tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
liberapay: # Replace with a single Liberapay username
issuehunt: # Replace with a single IssueHunt username
otechie: # Replace with a single Otechie username
custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']

19
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@@ -0,0 +1,19 @@
---
name: Bug report
about: Create a report to help us improve
---
**Describe the bug**
- What did you do?
- What happened?
- What was expected?
**Which version of tantivy are you using?**
If "master", ideally give the specific sha1 revision.
**To Reproduce**
If your bug is deterministic, can you give a minimal reproducing code?
Some bugs are not deterministic. Can you describe with precision in which context it happened?
If this is possible, can you share your code?

View File

@@ -0,0 +1,14 @@
---
name: Feature request
about: Suggest an idea for this project
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**[Optional] describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

7
.github/ISSUE_TEMPLATE/question.md vendored Normal file
View File

@@ -0,0 +1,7 @@
---
name: Question
about: Ask any question about tantivy's usage...
---
Try to be specific about your use case...

8
.gitignore vendored
View File

@@ -1,3 +1,6 @@
tantivy.iml
proptest-regressions
*.swp
target
target/debug
.vscode
@@ -5,4 +8,7 @@ target/release
Cargo.lock
benchmark
.DS_Store
cpp/simdcomp/bitpackingbenchmark
cpp/simdcomp/bitpackingbenchmark
*.bk
.idea
trace.dat

View File

@@ -1,16 +1,22 @@
# Based on the "trust" template v0.1.2
# https://github.com/japaric/trust/tree/v0.1.2
dist: trusty
language: rust
rust:
- nightly
services: docker
sudo: required
env:
global:
- CC=gcc-4.8
- CXX=g++-4.8
- CRATE_NAME=tantivy
- TRAVIS_CARGO_NIGHTLY_FEATURE=""
- secure: eC8HjTi1wgRVCsMAeXEXt8Ckr0YBSGOEnQkkW4/Nde/OZ9jJjz2nmP1ELQlDE7+czHub2QvYtDMG0parcHZDx/Kus0yvyn08y3g2rhGIiE7y8OCvQm1Mybu2D/p7enm6shXquQ6Z5KRfRq+18mHy80wy9ABMA/ukEZdvnfQ76/Een8/Lb0eHaDoXDXn3PqLVtByvSfQQ7OhS60dEScu8PWZ6/l1057P5NpdWbMExBE7Ro4zYXNhkJeGZx0nP/Bd4Jjdt1XfPzMEybV6NZ5xsTILUBFTmOOt603IsqKGov089NExqxYu5bD3K+S4MzF1Nd6VhomNPJqLDCfhlymJCUj5n5Ku4yidlhQbM4Ej9nGrBalJnhcjBjPua5tmMF2WCxP9muKn/2tIOu1/+wc0vMf9Yd3wKIkf5+FtUxCgs2O+NslWvmOMAMI/yD25m7hb4t1IwE/4Bk+GVcWJRWXbo0/m6ZUHzRzdjUY2a1qvw7C9udzdhg7gcnXwsKrSWi2NjMiIVw86l+Zim0nLpKIN41sxZHLaFRG63Ki8zQ/481LGn32awJ6i3sizKS0WD+N1DfR2qYMrwYHaMN0uR0OFXYTJkFvTFttAeUY3EKmRKAuMhmO2YRdSr4/j/G5E9HMc1gSGJj6PxgpQU7EpvxRsmoVAEJr0mszmOj9icGHep/FM=
# - secure: eC8HjTi1wgRVCsMAeXEXt8Ckr0YBSGOEnQkkW4/Nde/OZ9jJjz2nmP1ELQlDE7+czHub2QvYtDMG0parcHZDx/Kus0yvyn08y3g2rhGIiE7y8OCvQm1Mybu2D/p7enm6shXquQ6Z5KRfRq+18mHy80wy9ABMA/ukEZdvnfQ76/Een8/Lb0eHaDoXDXn3PqLVtByvSfQQ7OhS60dEScu8PWZ6/l1057P5NpdWbMExBE7Ro4zYXNhkJeGZx0nP/Bd4Jjdt1XfPzMEybV6NZ5xsTILUBFTmOOt603IsqKGov089NExqxYu5bD3K+S4MzF1Nd6VhomNPJqLDCfhlymJCUj5n5Ku4yidlhQbM4Ej9nGrBalJnhcjBjPua5tmMF2WCxP9muKn/2tIOu1/+wc0vMf9Yd3wKIkf5+FtUxCgs2O+NslWvmOMAMI/yD25m7hb4t1IwE/4Bk+GVcWJRWXbo0/m6ZUHzRzdjUY2a1qvw7C9udzdhg7gcnXwsKrSWi2NjMiIVw86l+Zim0nLpKIN41sxZHLaFRG63Ki8zQ/481LGn32awJ6i3sizKS0WD+N1DfR2qYMrwYHaMN0uR0OFXYTJkFvTFttAeUY3EKmRKAuMhmO2YRdSr4/j/G5E9HMc1gSGJj6PxgpQU7EpvxRsmoVAEJr0mszmOj9icGHep/FM=
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- kalakris-cmake
packages:
- gcc-4.8
- g++-4.8
@@ -18,18 +24,69 @@ addons:
- libelf-dev
- libdw-dev
- binutils-dev
- cmake
matrix:
include:
# Android
- env: TARGET=aarch64-linux-android DISABLE_TESTS=1
#- env: TARGET=arm-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=armv7-linux-androideabi DISABLE_TESTS=1
#- env: TARGET=i686-linux-android DISABLE_TESTS=1
#- env: TARGET=x86_64-linux-android DISABLE_TESTS=1
# Linux
#- env: TARGET=aarch64-unknown-linux-gnu
#- env: TARGET=i686-unknown-linux-gnu
- env: TARGET=x86_64-unknown-linux-gnu CODECOV=1 #UPLOAD_DOCS=1
# - env: TARGET=x86_64-unknown-linux-musl CODECOV=1
# OSX
#- env: TARGET=x86_64-apple-darwin
# os: osx
before_install:
- set -e
- rustup self update
- rustup component add rustfmt
install:
- sh ci/install.sh
- source ~/.cargo/env || true
- env | grep "TRAVIS"
before_script:
- |
pip install 'travis-cargo<0.2' --user &&
export PATH=$HOME/.local/bin:$PATH
- export PATH=$HOME/.cargo/bin:$PATH
- cargo install cargo-update || echo "cargo-update already installed"
- cargo install cargo-travis || echo "cargo-travis already installed"
script:
- |
travis-cargo build &&
travis-cargo test &&
travis-cargo bench &&
travis-cargo doc
- bash ci/script.sh
- cargo fmt --all -- --check
before_deploy:
- sh ci/before_deploy.sh
after_success:
- bash ./script/build-doc.sh
- travis-cargo doc-upload
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then travis-cargo coveralls --no-sudo --verify; fi
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ./kcov/build/src/kcov --verify --coveralls-id=$TRAVIS_JOB_ID --include-path=`pwd`/src --exclude-path=`pwd`/cpp --exclude-pattern=/.cargo target/kcov target/debug/tantivy-*; fi
# Needs GH_TOKEN env var to be set in travis settings
- if [[ -v GH_TOKEN ]]; then echo "GH TOKEN IS SET"; else echo "GH TOKEN NOT SET"; fi
- if [[ -v UPLOAD_DOCS ]]; then cargo doc; cargo doc-upload; else echo "doc upload disabled."; fi
#cache: cargo
#before_cache:
# # Travis can't cache files that are not readable by "others"
# - chmod -R a+r $HOME/.cargo
# - find ./target/debug -type f -maxdepth 1 -delete
# - rm -f ./target/.rustc_info.json
# - rm -fr ./target/debug/{deps,.fingerprint}/tantivy*
# - rm -r target/debug/examples/
# - ls -1 examples/ | sed -e 's/\.rs$//' | xargs -I "{}" find target/* -name "*{}*" -type f -delete
#branches:
# only:
# # release tags
# - /^v\d+\.\d+\.\d+.*$/
# - master
notifications:
email:
on_success: never

11
AUTHORS Normal file
View File

@@ -0,0 +1,11 @@
# This is the list of authors of tantivy for copyright purposes.
Paul Masurel
Laurentiu Nicola
Dru Sellers
Ashley Mannix
Michael J. Curry
Jason Wolfe
# As an employee of Google I am required to add Google LLC
# in the list of authors, but this project is not affiliated to Google
# in any other way.
Google LLC

View File

@@ -1,3 +1,348 @@
Tantivy 0.13.2
===================
Bugfix. Acquiring a facet reader on a segment that does not contain any
doc with this facet returns `None`. (#896)
Tantivy 0.13.1
======================
Made `Query` and `Collector` `Send + Sync`.
Updated misc dependency versions.
Tantivy 0.13.0
======================
Tantivy 0.13 introduce a change in the index format that will require
you to reindex your index (BlockWAND information are added in the skiplist).
The index size increase is minor as this information is only added for
full blocks.
If you have a massive index for which reindexing is not an option, please contact me
so that we can discuss possible solutions.
- Bugfix in `FuzzyTermQuery` not matching terms by prefix when it should (@Peachball)
- Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.
- `MMapDirectory::open` does not return a `Result` anymore.
- Change in the DocSet and Scorer API. (@fulmicoton).
A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet.
`.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId.
As a result, iterating through DocSet now looks as follows
```rust
let mut doc = docset.doc();
while doc != TERMINATED {
// ...
doc = docset.advance();
}
```
The change made it possible to greatly simplify a lot of the docset's code.
- Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)
- Added an offset option to the Top(.*)Collectors. (@robyoung)
- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks
to the PISA team for answering all my questions!)
Tantivy 0.12.0
======================
- Removing static dispatch in tokenizers for simplicity. (#762)
- Added backward iteration for `TermDictionary` stream. (@halvorboe)
- Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)
- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)
- Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)
- Added support for field boosting. (#547, @fulmicoton)
## How to update?
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
minor changes. Check https://github.com/tantivy-search/tantivy/blob/master/examples/custom_tokenizer.rs
to check for some code sample.
Tantivy 0.11.3
=======================
- Fixed DateTime as a fast field (#735)
Tantivy 0.11.2
=======================
- The future returned by `IndexWriter::merge` does not borrow `self` mutably anymore (#732)
- Exposing a constructor for `WatchHandle` (#731)
Tantivy 0.11.1
=====================
- Bug fix #729
Tantivy 0.11.0
=====================
- Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)
- Various bugfixes in the query parser.
- Better handling of hyphens in query parser. (#609)
- Better handling of whitespaces.
- Closes #498 - add support for Elastic-style unbounded range queries for alphanumeric types eg. "title:>hello", "weight:>=70.5", "height:<200" (@petr-tik)
- API change around `Box<BoxableTokenizer>`. See detail in #629
- Avoid rebuilding Regex automaton whenever a regex query is reused. #639 (@brainlock)
- Add footer with some metadata to index files. #605 (@fdb-hiroshima)
- Add a method to check the compatibility of the footer in the index with the running version of tantivy (@petr-tik)
- TopDocs collector: ensure stable sorting on equal score. #671 (@brainlock)
- Added handling of pre-tokenized text fields (#642), which will enable users to
load tokens created outside tantivy. See usage in examples/pre_tokenized_text. (@kkoziara)
- Fix crash when committing multiple times with deleted documents. #681 (@brainlock)
## How to update?
- The index format is changed. You are required to reindex your data to use tantivy 0.11.
- `Box<dyn BoxableTokenizer>` has been replaced by a `BoxedTokenizer` struct.
- Regex are now compiled when the `RegexQuery` instance is built. As a result, it can now return
an error and handling the `Result` is required.
- `tantivy::version()` now returns a `Version` object. This object implements `ToString()`
Tantivy 0.10.2
=====================
- Closes #656. Solving memory leak.
Tantivy 0.10.1
=====================
- Closes #544. A few users experienced problems with the directory watching system.
Avoid watching the mmap directory until someone effectively creates a reader that uses
this functionality.
Tantivy 0.10.0
=====================
*Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.*
- Added an API to easily tweak or entirely replace the
default score. See `TopDocs::tweak_score`and `TopScore::custom_score` (@pmasurel)
- Added an ASCII folding filter (@drusellers)
- Bugfix in `query.count` in presence of deletes (@pmasurel)
- Added `.explain(...)` in `Query` and `Weight` to (@pmasurel)
- Added an efficient way to `delete_all_documents` in `IndexWriter` (@petr-tik).
All segments are simply removed.
Minor
---------
- Switched to Rust 2018 (@uvd)
- Small simplification of the code.
Calling .freq() or .doc() when .advance() has never been called
on segment postings should panic from now on.
- Tokens exceeding `u16::max_value() - 4` chars are discarded silently instead of panicking.
- Fast fields are now preloaded when the `SegmentReader` is created.
- `IndexMeta` is now public. (@hntd187)
- `IndexWriter` `add_document`, `delete_term`. `IndexWriter` is `Sync`, making it possible to use it with a `
Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
only require a read lock. (@pmasurel)
- Introducing `Opstamp` as an expressive type alias for `u64`. (@petr-tik)
- Stamper now relies on `AtomicU64` on all platforms (@petr-tik)
- Bugfix - Files get deleted slightly earlier
- Compilation resources improved (@fdb-hiroshima)
## How to update?
Your program should be usable as is.
### Fast fields
Fast fields used to be accessed directly from the `SegmentReader`.
The API changed, you are now required to acquire your fast field reader via the
`segment_reader.fast_fields()`, and use one of the typed method:
- `.u64()`, `.i64()` if your field is single-valued ;
- `.u64s()`, `.i64s()` if your field is multi-valued ;
- `.bytes()` if your field is bytes fast field.
Tantivy 0.9.0
=====================
*0.9.0 index format is not compatible with the
previous index format.*
- MAJOR BUGFIX :
Some `Mmap` objects were being leaked, and would never get released. (@fulmicoton)
- Removed most unsafe (@fulmicoton)
- Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)
- Stemming in other language possible (@pentlander)
- Segments with no docs are deleted earlier (@barrotsteindev)
- Added grouped add and delete operations.
They are guaranteed to happen together (i.e. they cannot be split by a commit).
In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)
- Removed `INT_STORED` and `INT_INDEXED`. It is now possible to use `STORED` and `INDEXED`
for int fields. (@fulmicoton)
- Added DateTime field (@barrotsteindev)
- Added IndexReader. By default, index is reloaded automatically upon new commits (@fulmicoton)
- SIMD linear search within blocks (@fulmicoton)
## How to update ?
tantivy 0.9 brought some API breaking change.
To update from tantivy 0.8, you will need to go through the following steps.
- `schema::INT_INDEXED` and `schema::INT_STORED` should be replaced by `schema::INDEXED` and `schema::INT_STORED`.
- The index now does not hold the pool of searcher anymore. You are required to create an intermediary object called
`IndexReader` for this.
```rust
// create the reader. You typically need to create 1 reader for the entire
// lifetime of you program.
let reader = index.reader()?;
// Acquire a searcher (previously `index.searcher()`) is now written:
let searcher = reader.searcher();
// With the default setting of the reader, you are not required to
// call `index.load_searchers()` anymore.
//
// The IndexReader will pick up that change automatically, regardless
// of whether the update was done in a different process or not.
// If this behavior is not wanted, you can create your reader with
// the `ReloadPolicy::Manual`, and manually decide when to reload the index
// by calling `reader.reload()?`.
```
Tantivy 0.8.2
=====================
Fixing build for x86_64 platforms. (#496)
No need to update from 0.8.1 if tantivy
is building on your platform.
Tantivy 0.8.1
=====================
Hotfix of #476.
Merge was reflecting deletes before commit was passed.
Thanks @barrotsteindev for reporting the bug.
Tantivy 0.8.0
=====================
*No change in the index format*
- API Breaking change in the collector API. (@jwolfe, @fulmicoton)
- Multithreaded search (@jwolfe, @fulmicoton)
Tantivy 0.7.1
=====================
*No change in the index format*
- Bugfix: NGramTokenizer panics on non ascii chars
- Added a space usage API
Tantivy 0.7
=====================
- Skip data for doc ids and positions (@fulmicoton),
greatly improving performance
- Tantivy error now rely on the failure crate (@drusellers)
- Added support for `AND`, `OR`, `NOT` syntax in addition to the `+`,`-` syntax
- Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)
- Added a `TopFieldCollector` (@pentlander)
Tantivy 0.6.1
=========================
- Bugfix #324. GC removing was removing file that were still in useful
- Added support for parsing AllQuery and RangeQuery via QueryParser
- AllQuery: `*`
- RangeQuery:
- Inclusive `field:[startIncl to endIncl]`
- Exclusive `field:{startExcl to endExcl}`
- Mixed `field:[startIncl to endExcl}` and vice versa
- Unbounded `field:[start to *]`, `field:[* to end]`
Tantivy 0.6
==========================
Special thanks to @drusellers and @jason-wolfe for their contributions
to this release!
- Removed C code. Tantivy is now pure Rust. (@pmasurel)
- BM25 (@pmasurel)
- Approximate field norms encoded over 1 byte. (@pmasurel)
- Compiles on stable rust (@pmasurel)
- Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270)
- Completely uncompressed
- Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
- Add NGram token support (@drusellers)
- Add Stopword Filter support (@drusellers)
- Add a FuzzyTermQuery (@drusellers)
- Add a RegexQuery (@drusellers)
- Various performance improvements (@pmasurel)_
Tantivy 0.5.2
===========================
- bugfix #274
- bugfix #280
- bugfix #289
Tantivy 0.5.1
==========================
- bugfix #254 : tantivy failed if no documents in a segment contained a specific field.
Tantivy 0.5
==========================
- Faceting
- RangeQuery
- Configurable tokenization pipeline
- Bugfix in PhraseQuery
- Various query optimisation
- Allowing very large indexes
- 64 bits file address
- Smarter encoding of the `TermInfo` objects
Tantivy 0.4.3
==========================
- Bugfix race condition when deleting files. (#198)
Tantivy 0.4.2
==========================
- Prevent usage of AVX2 instructions (#201)
Tantivy 0.4.1
==========================
- Bugfix for non-indexed fields. (#199)
Tantivy 0.4.0
==========================
- Raise the limit of number of fields (previously 256 fields) (@fulmicoton)
- Removed u32 fields. They are replaced by u64 and i64 fields (#65) (@fulmicoton)
- Optimized skip in SegmentPostings (#130) (@lnicola)
- Replacing rustc_serialize by serde. Kudos to @KodrAus and @lnicola
- Using error-chain (@KodrAus)
- QueryParser: (@fulmicoton)
- Explicit error returned when searched for a term that is not indexed
- Searching for a int term via the query parser was broken `(age:1)`
- Searching for a non-indexed field returns an explicit Error
- Phrase query for non-tokenized field are not tokenized by the query parser.
- Faster/Better indexing (@fulmicoton)
- using murmurhash2
- faster merging
- more memory efficient fast field writer (@lnicola )
- better handling of collisions
- lesser memory usage
- Added API, most notably to iterate over ranges of terms (@fulmicoton)
- Bugfix that was preventing to unmap segment files, on index drop (@fulmicoton)
- Made the doc! macro public (@fulmicoton)
- Added an alternative implementation of the streaming dictionary (@fulmicoton)
Tantivy 0.3.1
==========================
- Expose a method to trigger files garbage collection
Tantivy 0.3
==========================
@@ -5,7 +350,7 @@ Tantivy 0.3
Special thanks to @Kodraus @lnicola @Ameobea @manuel-woelker @celaus
for their contribution to this release.
Thanks also to everyone in tantivy gitter chat
Thanks also to everyone in tantivy gitter chat
for their advise and company :)
https://gitter.im/tantivy-search/tantivy
@@ -13,12 +358,13 @@ https://gitter.im/tantivy-search/tantivy
Warning:
Tantivy 0.3 is NOT backward compatible with tantivy 0.2
Tantivy 0.3 is NOT backward compatible with tantivy 0.2
code and index format.
You should not expect backward compatibility before
You should not expect backward compatibility before
tantivy 1.0.
New Features
------------
@@ -40,7 +386,7 @@ Thanks to @KodrAus ! (#108)
the natural ordering.
- Building binary targets for tantivy-cli (Thanks to @KodrAus)
- Misc invisible bug fixes, and code cleanup.
- Use
- Use

View File

@@ -1,64 +1,101 @@
[package]
name = "tantivy"
version = "0.3.0"
version = "0.13.2"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
build = "build.rs"
license = "MIT"
categories = ["database-implementations", "data-structures"]
description = """Tantivy is a search engine library."""
documentation = "https://tantivy-search.github.io/tantivy/tantivy/index.html"
description = """Search engine library"""
documentation = "https://docs.rs/tantivy/"
homepage = "https://github.com/tantivy-search/tantivy"
repository = "https://github.com/tantivy-search/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2018"
[dependencies]
byteorder = "1.0"
memmap = "0.4"
lazy_static = "0.2.1"
regex = "0.2"
fst = "0.1.37"
atomicwrites = "0.1.3"
tempfile = "2.1"
rustc-serialize = "0.3"
log = "0.3.6"
combine = "2.2"
tempdir = "0.3"
bincode = "0.5"
libc = {version = "0.2.20", optional=true}
num_cpus = "1.2"
itertools = "0.5.9"
lz4 = "1.20"
bit-set = "0.4.0"
time = "0.1"
uuid = { version = "0.4", features = ["v4", "rustc-serialize"] }
chan = "0.1"
version = "2"
crossbeam = "0.2"
futures = "0.1.9"
futures-cpupool = "0.1.2"
base64 = "0.12"
byteorder = "1"
crc32fast = "1"
once_cell = "1"
regex ={version = "1", default-features = false, features = ["std"]}
tantivy-fst = "0.3"
memmap = {version = "0.7", optional=true}
lz4 = {version="1", optional=true}
snap = "1"
atomicwrites = {version="0.2", optional=true}
tempfile = "3"
log = "0.4"
serde = {version="1", features=["derive"]}
serde_json = "1"
num_cpus = "1"
fs2={version="0.4", optional=true}
levenshtein_automata = "0.2"
notify = {version="4", optional=true}
uuid = { version = "0.8", features = ["v4", "serde"] }
crossbeam = "0.7"
futures = {version = "0.3", features=["thread-pool"] }
owning_ref = "0.4"
stable_deref_trait = "1"
rust-stemmers = "1"
downcast-rs = "1"
tantivy-query-grammar = { version="0.13", path="./query-grammar" }
bitpacking = {version="0.8", default-features = false, features=["bitpacker4x"]}
census = "0.4"
fnv = "1"
owned-read = "0.4"
failure = "0.1"
htmlescape = "0.3"
fail = "0.4"
murmurhash32 = "0.2"
chrono = "0.4"
smallvec = "1"
rayon = "1"
[target.'cfg(windows)'.dependencies]
winapi = "0.2"
winapi = "0.3"
[dev-dependencies]
rand = "0.3"
env_logger = "0.4"
rand = "0.7"
maplit = "1"
matches = "0.1.8"
proptest = "0.10"
[build-dependencies]
gcc = {version = "0.3", optional=true}
[dev-dependencies.fail]
version = "0.4"
features = ["failpoints"]
[profile.release]
opt-level = 3
debug = false
lto = true
debug-assertions = false
[profile.test]
debug-assertions = true
overflow-checks = true
[features]
default = ["simdcompression"]
simdcompression = ["libc", "gcc"]
default = ["mmap"]
mmap = ["atomicwrites", "fs2", "memmap", "notify"]
lz4-compression = ["lz4"]
failpoints = ["fail/failpoints"]
unstable = [] # useful for benches.
wasm-bindgen = ["uuid/wasm-bindgen"]
scoref64 = [] # scores are f64 instead of f32. was introduced to debug blockwand.
[workspace]
members = ["query-grammar"]
[badges]
travis-ci = { repository = "tantivy-search/tantivy" }
# Following the "fail" crate best practises, we isolate
# tests that define specific behavior in fail check points
# in a different binary.
#
# We do that because, fail rely on a global definition of
# failpoints behavior and hence, it is incompatible with
# multithreading.
[[test]]
name = "failpoints"
path = "tests/failpoints/mod.rs"
required-features = ["fail/failpoints"]

View File

@@ -1,4 +1,4 @@
Copyright (c) 2016 Paul Masurel
Copyright (c) 2018 by the project authors, as listed in the AUTHORS file.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

3
Makefile Normal file
View File

@@ -0,0 +1,3 @@
test:
echo "Run test only... No examples."
cargo test --tests --lib

144
README.md
View File

@@ -1,62 +1,140 @@
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
[![Build Status](https://travis-ci.org/tantivy-search/tantivy.svg?branch=master)](https://travis-ci.org/tantivy-search/tantivy)
[![Coverage Status](https://coveralls.io/repos/github/tantivy-search/tantivy/badge.svg?branch=master&refresh1)](https://coveralls.io/github/tantivy-search/tantivy?branch=master)
[![codecov](https://codecov.io/gh/tantivy-search/tantivy/branch/master/graph/badge.svg)](https://codecov.io/gh/tantivy-search/tantivy)
[![Join the chat at https://gitter.im/tantivy-search/tantivy](https://badges.gitter.im/tantivy-search/tantivy.svg)](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy)
![beacon for google analytics](https://ga-beacon.appspot.com/UA-88834340-1/tantivy/README)
[![Build status](https://ci.appveyor.com/api/projects/status/r7nb13kj23u8m9pj/branch/master?svg=true)](https://ci.appveyor.com/project/fulmicoton/tantivy/branch/master)
[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)
[![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/fulmicoton)
**Tantivy** is a **full text search engine library** written in rust.
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
It is strongly inspired by Lucene's design.
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/0)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/0)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/1)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/1)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/2)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/2)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/3)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/3)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/4)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/4)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/5)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/5)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/6)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/6)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/7)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/7)
[![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)
**Tantivy** is a **full text search engine library** written in Rust.
It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
an off-the-shelf search engine server, but rather a crate that can be used
to build such a search engine.
Tantivy is, in fact, strongly inspired by Lucene's design.
# Benchmark
The following [benchmark](https://tantivy-search.github.io/bench/) break downs
performance for different type of queries / collection.
In general, Tantivy tends to be
- slower than Lucene on union with a Top-K due to Block-WAND optimization.
- faster than Lucene on intersection and phrase queries.
Your mileage WILL vary depending on the nature of queries and their load.
# Features
- configurable indexing (optional term frequency and position indexing)
- tf-idf scoring
- Basic query language
- Phrase queries
- Full-text search
- Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy) and [tantivy-tokenizer-tiny-segmente](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
- Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
- Tiny startup time (<10ms), perfect for command line tools
- BM25 scoring (the same as Lucene)
- Natural query language (e.g. `(michael AND jackson) OR "king of pop"`)
- Phrase queries search (e.g. `"michael jackson"`)
- Incremental indexing
- Multithreaded indexing (indexing English Wikipedia takes 4 minutes on my desktop)
- mmap based
- optional SIMD integer compression
- u32 fast fields (equivalent of doc values in Lucene)
- Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
- Mmap directory
- SIMD integer compression when the platform/CPU includes the SSE2 instruction set
- Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
- `&[u8]` fast fields
- Text, i64, u64, f64, dates, and hierarchical facet fields
- LZ4 compressed document store
- Range queries
- Faceted search
- Configurable indexing (optional term frequency and position indexing)
- Cheesy logo with a horse
Tantivy supports Linux, MacOS and Windows.
## Non-features
- Distributed search is out of the scope of Tantivy. That being said, Tantivy is a
library upon which one could build a distributed search. Serializable/mergeable collector state for instance,
are within the scope of Tantivy.
# Getting started
- [tantivy's usage example](http://fulmicoton.com/tantivy-examples/simple_search.html)
- [tantivy-cli and its tutorial](https://github.com/tantivy-search/tantivy-cli).
It will walk you through getting a wikipedia search engine up and running in a few minutes.
- [reference doc]
- [For the last released version](https://docs.rs/tantivy/)
- [For the last master branch](https://tantivy-search.github.io/tantivy/tantivy/index.html)
Tantivy works on stable Rust (>= 1.27) and supports Linux, MacOS, and Windows.
# Compiling
- [Tantivy's simple search example](https://tantivy-search.github.io/examples/basic_search.html)
- [tantivy-cli and its tutorial](https://github.com/tantivy-search/tantivy-cli) - `tantivy-cli` is an actual command line interface that makes it easy for you to create a search engine,
index documents, and search via the CLI or a small server with a REST API.
It walks you through getting a wikipedia search engine up and running in a few minutes.
- [Reference doc for the last released version](https://docs.rs/tantivy/)
Tantivy requires Rust Nightly because it uses requires the features [`box_syntax`](https://doc.rust-lang.org/stable/book/box-syntax-and-patterns.html), [`optin_builtin_traits`](https://github.com/rust-lang/rfcs/blob/master/text/0019-opt-in-builtin-traits.md), and [`conservative_impl_trait`](https://github.com/rust-lang/rfcs/blob/master/text/1522-conservative-impl-trait.md).
By default, `tantivy` uses a git submodule called `simdcomp`.
After cloning the repository, you will need to initialize and update
the submodules. The project can then be built using `cargo`.
# How can I support this project?
git clone git@github.com:tantivy-search/tantivy.git
There are many ways to support this project.
- Use Tantivy and tell us about your experience on [Gitter](https://gitter.im/tantivy-search/tantivy) or by email (paul.masurel@gmail.com)
- Report bugs
- Write a blog post
- Help with documentation by asking questions or submitting PRs
- Contribute code (you can join [our Gitter](https://gitter.im/tantivy-search/tantivy))
- Talk about Tantivy around you
- Drop a word on on [![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/fulmicoton) or even [![Become a patron](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/fulmicoton)
# Contributing code
We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
## Clone and build locally
Tantivy compiles on stable Rust but requires `Rust >= 1.27`.
To check out and run tests, you can simply run:
```bash
git clone https://github.com/tantivy-search/tantivy.git
cd tantivy
cargo build
```
## Run tests
Alternatively, if you are trying to compile `tantivy` without simd compression,
you can disable this functionality. In this case, this submodule is not required
and you can compile tantivy by using the `--no-default-features` flag.
Some tests will not run with just `cargo test` because of `fail-rs`.
To run the tests exhaustively, run `./run-tests.sh`.
cargo build --no-default-features
## Debug
You might find it useful to step through the programme with a debugger.
# Contribute
### A failing test
Send me an email (paul.masurel at gmail.com) if you want to contribute to tantivy.
Make sure you haven't run `cargo clean` after the most recent `cargo test` or `cargo build` to guarantee that the `target/` directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under `rust-gdb`:
```bash
find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY
```
Now that you are in `rust-gdb`, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to `cargo test` like this:
```bash
$gdb run --test-threads 1 --test $NAME_OF_TEST
```
### An example
By default, `rustc` compiles everything in the `examples/` directory in debug mode. This makes it easy for you to make examples to reproduce bugs:
```bash
rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run
```

View File

@@ -4,11 +4,8 @@
os: Visual Studio 2015
environment:
matrix:
- channel: nightly
- channel: stable
target: x86_64-pc-windows-msvc
- channel: nightly
target: x86_64-pc-windows-gnu
msys_bits: 64
install:
- appveyor DownloadFile https://win.rustup.rs/ -FileName rustup-init.exe
@@ -21,4 +18,5 @@ install:
build: false
test_script:
- REM SET RUST_LOG=tantivy,test & cargo test --verbose
- REM SET RUST_LOG=tantivy,test & cargo test --all --verbose --no-default-features --features mmap
- REM SET RUST_BACKTRACE=1 & cargo build --examples

View File

@@ -1,50 +0,0 @@
#[cfg(feature = "simdcompression")]
mod build {
extern crate gcc;
pub fn build() {
let mut config = gcc::Config::new();
config.include("./cpp/simdcomp/include")
.file("cpp/simdcomp/src/avxbitpacking.c")
.file("cpp/simdcomp/src/simdintegratedbitpacking.c")
.file("cpp/simdcomp/src/simdbitpacking.c")
.file("cpp/simdcomp/src/simdpackedsearch.c")
.file("cpp/simdcomp/src/simdcomputil.c")
.file("cpp/simdcomp/src/simdpackedselect.c")
.file("cpp/simdcomp/src/simdfor.c")
.file("cpp/simdcomp_wrapper.c");
if !cfg!(debug_assertions) {
config.opt_level(3);
if cfg!(target_env = "msvc") {
config.define("NDEBUG", None)
.flag("/Gm-")
.flag("/GS-")
.flag("/Gy")
.flag("/Oi")
.flag("/GL");
} else {
config.flag("-msse4.1")
.flag("-march=native");
}
}
config.compile("libsimdcomp.a");
// Workaround for linking static libraries built with /GL
// https://github.com/rust-lang/rust/issues/26003
if !cfg!(debug_assertions) && cfg!(target_env = "msvc") {
println!("cargo:rustc-link-lib=dylib=simdcomp");
}
}
}
#[cfg(not(feature = "simdcompression"))]
mod build {
pub fn build() {}
}
fn main() {
build::build();
}

23
ci/before_deploy.ps1 Normal file
View File

@@ -0,0 +1,23 @@
# This script takes care of packaging the build artifacts that will go in the
# release zipfile
$SRC_DIR = $PWD.Path
$STAGE = [System.Guid]::NewGuid().ToString()
Set-Location $ENV:Temp
New-Item -Type Directory -Name $STAGE
Set-Location $STAGE
$ZIP = "$SRC_DIR\$($Env:CRATE_NAME)-$($Env:APPVEYOR_REPO_TAG_NAME)-$($Env:TARGET).zip"
# TODO Update this to package the right artifacts
Copy-Item "$SRC_DIR\target\$($Env:TARGET)\release\hello.exe" '.\'
7z a "$ZIP" *
Push-AppveyorArtifact "$ZIP"
Remove-Item *.* -Force
Set-Location ..
Remove-Item $STAGE
Set-Location $SRC_DIR

33
ci/before_deploy.sh Normal file
View File

@@ -0,0 +1,33 @@
# This script takes care of building your crate and packaging it for release
set -ex
main() {
local src=$(pwd) \
stage=
case $TRAVIS_OS_NAME in
linux)
stage=$(mktemp -d)
;;
osx)
stage=$(mktemp -d -t tmp)
;;
esac
test -f Cargo.lock || cargo generate-lockfile
# TODO Update this to build the artifacts that matter to you
cross rustc --bin hello --target $TARGET --release -- -C lto
# TODO Update this to package the right artifacts
cp target/$TARGET/release/hello $stage/
cd $stage
tar czf $src/$CRATE_NAME-$TRAVIS_TAG-$TARGET.tar.gz *
cd $src
rm -rf $stage
}
main

47
ci/install.sh Normal file
View File

@@ -0,0 +1,47 @@
set -ex
main() {
local target=
if [ $TRAVIS_OS_NAME = linux ]; then
target=x86_64-unknown-linux-musl
sort=sort
else
target=x86_64-apple-darwin
sort=gsort # for `sort --sort-version`, from brew's coreutils.
fi
# Builds for iOS are done on OSX, but require the specific target to be
# installed.
case $TARGET in
aarch64-apple-ios)
rustup target install aarch64-apple-ios
;;
armv7-apple-ios)
rustup target install armv7-apple-ios
;;
armv7s-apple-ios)
rustup target install armv7s-apple-ios
;;
i386-apple-ios)
rustup target install i386-apple-ios
;;
x86_64-apple-ios)
rustup target install x86_64-apple-ios
;;
esac
# This fetches latest stable release
local tag=$(git ls-remote --tags --refs --exit-code https://github.com/japaric/cross \
| cut -d/ -f3 \
| grep -E '^v[0.1.0-9.]+$' \
| $sort --version-sort \
| tail -n1)
curl -LSfs https://japaric.github.io/trust/install.sh | \
sh -s -- \
--force \
--git japaric/cross \
--tag $tag \
--target $target
}
main

30
ci/script.sh Normal file
View File

@@ -0,0 +1,30 @@
#!/usr/bin/env bash
# This script takes care of testing your crate
set -ex
main() {
if [ ! -z $CODECOV ]; then
echo "Codecov"
cargo build --verbose && cargo coverage --verbose --all && bash <(curl -s https://codecov.io/bash) -s target/kcov
else
echo "Build"
cross build --target $TARGET
if [ ! -z $DISABLE_TESTS ]; then
return
fi
echo "Test"
cross test --target $TARGET --no-default-features --features mmap
cross test --target $TARGET --no-default-features --features mmap query-grammar
fi
for example in $(ls examples/*.rs)
do
cargo run --example $(basename $example .rs)
done
}
# we don't run the "test phase" when doing deploys
if [ -z $TRAVIS_TAG ]; then
main
fi

View File

@@ -1,9 +0,0 @@
Makefile.in
lib*
unit*
*.o
src/*.lo
src/*.o
src/.deps
src/.dirstamp
src/.libs

View File

@@ -1,11 +0,0 @@
language: c
sudo: false
compiler:
- gcc
- clang
branches:
only:
- master
script: make && ./unit

View File

@@ -1,9 +0,0 @@
Upcoming
- added missing include
- improved portability (MSVC)
- implemented C89 compatibility
Version 0.0.3 (19 May 2014)
- improved documentation
Version 0.0.2 (6 February 2014)
- added go demo
Version 0.0.1 (5 February 2014)

View File

@@ -1,27 +0,0 @@
Copyright (c) 2014--, The authors
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
* Neither the name of the {organization} nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -1,137 +0,0 @@
The SIMDComp library
====================
[![Build Status](https://travis-ci.org/lemire/simdcomp.png)](https://travis-ci.org/lemire/simdcomp)
A simple C library for compressing lists of integers using binary packing and SIMD instructions.
The assumption is either that you have a list of 32-bit integers where most of them are small, or a list of 32-bit integers where differences between successive integers are small. No software is able to reliably compress an array of 32-bit random numbers.
This library can decode at least 4 billions of compressed integers per second on most
desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s.
This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.
On a Skylake Intel processor, it can decode integers at a rate 0.3 cycles per integer,
which can easily translate into more than 8 decoded billions integers per second.
Contributors: Daniel Lemire, Nathan Kurz, Christoph Rupp, Anatol Belski, Nick White and others
What is it for?
-------------
This is a low-level library for fast integer compression. By design it does not define a compressed
format. It is up to the (sophisticated) user to create a compressed format.
Requirements
-------------
- Your processor should support SSE4.1 (It is supported by most Intel and AMD processors released since 2008.)
- It is possible to build the core part of the code if your processor support SSE2 (Pentium4 or better)
- C99 compliant compiler (GCC is assumed)
- A Linux-like distribution is assumed by the makefile
For a plain C version that does not use SIMD instructions, see https://github.com/lemire/LittleIntPacker
Usage
-------
Compression works over blocks of 128 integers.
For a complete working example, see example.c (you can build it and
run it with "make example; ./example").
1) Lists of integers in random order.
```C
const uint32_t b = maxbits(datain);// computes bit width
simdpackwithoutmask(datain, buffer, b);//compressed to buffer, compressing 128 32-bit integers down to b*32 bytes
simdunpack(buffer, backbuffer, b);//uncompressed to backbuffer
```
While 128 32-bit integers are read, only b 128-bit words are written. Thus, the compression ratio is 32/b.
2) Sorted lists of integers.
We used differential coding: we store the difference between successive integers. For this purpose, we need an initial value (called offset).
```C
uint32_t offset = 0;
uint32_t b1 = simdmaxbitsd1(offset,datain); // bit width
simdpackwithoutmaskd1(offset, datain, buffer, b1);//compressing 128 32-bit integers down to b1*32 bytes
simdunpackd1(offset, buffer, backbuffer, b1);//uncompressed
```
General example for arrays of arbitrary length:
```C
int compress_decompress_demo() {
size_t k, N = 9999;
__m128i * endofbuf;
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint8_t * buffer;
uint32_t * backbuffer = malloc(N * sizeof(uint32_t));
uint32_t b;
for (k = 0; k < N; ++k){ /* start with k=0, not k=1! */
datain[k] = k;
}
b = maxbits_length(datain, N);
buffer = malloc(simdpack_compressedbytes(N,b)); // allocate just enough memory
endofbuf = simdpack_length(datain, N, (__m128i *)buffer, b);
/* compressed data is stored between buffer and endofbuf using (endofbuf-buffer)*sizeof(__m128i) bytes */
/* would be safe to do : buffer = realloc(buffer,(endofbuf-(__m128i *)buffer)*sizeof(__m128i)); */
simdunpack_length((const __m128i *)buffer, N, backbuffer, b);
for (k = 0; k < N; ++k){
if(datain[k] != backbuffer[k]) {
printf("bug\n");
return -1;
}
}
return 0;
}
```
3) Frame-of-Reference
We also have frame-of-reference (FOR) functions (see simdfor.h header). They work like the bit packing
routines, but do not use differential coding so they allow faster search in some cases, at the expense
of compression.
Setup
---------
make
make test
and if you are daring:
make install
Go
--------
If you are a go user, there is a "go" folder where you will find a simple demo.
Other libraries
----------------
* Fast decoder for VByte-compressed integers https://github.com/lemire/MaskedVByte
* Fast integer compression in C using StreamVByte https://github.com/lemire/streamvbyte
* FastPFOR is a C++ research library well suited to compress unsorted arrays: https://github.com/lemire/FastPFor
* SIMDCompressionAndIntersection is a C++ research library well suited for sorted arrays (differential coding)
and computing intersections: https://github.com/lemire/SIMDCompressionAndIntersection
* TurboPFor is a C library that offers lots of interesting optimizations. Well worth checking! (GPL license) https://github.com/powturbo/TurboPFor
* Oroch is a C++ library that offers a usable API (MIT license) https://github.com/ademakov/Oroch
References
------------
* Daniel Lemire, Leonid Boytsov, Nathan Kurz, SIMD Compression and the Intersection of Sorted Integers, Software Practice & Experience 46 (6) 2016. http://arxiv.org/abs/1401.6399
* Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization, Software Practice & Experience 45 (1), 2015. http://arxiv.org/abs/1209.2137 http://onlinelibrary.wiley.com/doi/10.1002/spe.2203/abstract
* Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015. http://arxiv.org/abs/1503.07387
* Wayne Xin Zhao, Xudong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen, A General SIMD-based Approach to Accelerating Compression Algorithms, ACM Transactions on Information Systems 33 (3), 2015. http://arxiv.org/abs/1502.01916
* T. D. Wu, Bitpacking techniques for indexing genomes: I. Hash tables, Algorithms for Molecular Biology 11 (5), 2016. http://almob.biomedcentral.com/articles/10.1186/s13015-016-0069-5

View File

@@ -1,235 +0,0 @@
/**
* This code is released under a BSD License.
*/
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "simdcomp.h"
#ifdef _MSC_VER
# include <windows.h>
__int64 freq;
typedef __int64 time_snap_t;
static time_snap_t time_snap(void)
{
__int64 now;
QueryPerformanceCounter((LARGE_INTEGER *)&now);
return (__int64)((now*1000000)/freq);
}
# define TIME_SNAP_FMT "%I64d"
#else
# define time_snap clock
# define TIME_SNAP_FMT "%lu"
typedef clock_t time_snap_t;
#endif
void benchmarkSelect() {
uint32_t buffer[128];
uint32_t backbuffer[128];
uint32_t initial = 33;
uint32_t b;
time_snap_t S1, S2, S3;
int i;
printf("benchmarking select \n");
/* this test creates delta encoded buffers with different bits, then
* performs lower bound searches for each key */
for (b = 0; b <= 32; b++) {
uint32_t prev = initial;
uint32_t out[128];
/* initialize the buffer */
for (i = 0; i < 128; i++) {
buffer[i] = ((uint32_t)(1655765 * i )) ;
if(b < 32) buffer[i] %= (1<<b);
}
for (i = 0; i < 128; i++) {
buffer[i] = buffer[i] + prev;
prev = buffer[i];
}
for (i = 1; i < 128; i++) {
if(buffer[i] < buffer[i-1] )
buffer[i] = buffer[i-1];
}
assert(simdmaxbitsd1(initial, buffer)<=b);
for (i = 0; i < 128; i++) {
out[i] = 0; /* memset would do too */
}
/* delta-encode to 'i' bits */
simdpackwithoutmaskd1(initial, buffer, (__m128i *)out, b);
S1 = time_snap();
for (i = 0; i < 128 * 10; i++) {
uint32_t valretrieved = simdselectd1(initial, (__m128i *)out, b, (uint32_t)i % 128);
assert(valretrieved == buffer[i%128]);
}
S2 = time_snap();
for (i = 0; i < 128 * 10; i++) {
simdunpackd1(initial, (__m128i *)out, backbuffer, b);
assert(backbuffer[i % 128] == buffer[i % 128]);
}
S3 = time_snap();
printf("bit width = %d, fast select function time = " TIME_SNAP_FMT ", naive time = " TIME_SNAP_FMT " \n", b, (S2-S1), (S3-S2));
}
}
int uint32_cmp(const void *a, const void *b)
{
const uint32_t *ia = (const uint32_t *)a;
const uint32_t *ib = (const uint32_t *)b;
if(*ia < *ib)
return -1;
else if (*ia > *ib)
return 1;
return 0;
}
/* adapted from wikipedia */
int binary_search(uint32_t * A, uint32_t key, int imin, int imax)
{
int imid;
imax --;
while(imin + 1 < imax) {
imid = imin + ((imax - imin) / 2);
if (A[imid] > key) {
imax = imid;
} else if (A[imid] < key) {
imin = imid;
} else {
return imid;
}
}
return imax;
}
/* adapted from wikipedia */
int lower_bound(uint32_t * A, uint32_t key, int imin, int imax)
{
int imid;
imax --;
while(imin + 1 < imax) {
imid = imin + ((imax - imin) / 2);
if (A[imid] >= key) {
imax = imid;
} else if (A[imid] < key) {
imin = imid;
}
}
if(A[imin] >= key) return imin;
return imax;
}
void benchmarkSearch() {
uint32_t buffer[128];
uint32_t backbuffer[128];
uint32_t out[128];
uint32_t result, initial = 0;
uint32_t b, i;
time_snap_t S1, S2, S3, S4;
printf("benchmarking search \n");
/* this test creates delta encoded buffers with different bits, then
* performs lower bound searches for each key */
for (b = 0; b <= 32; b++) {
uint32_t prev = initial;
/* initialize the buffer */
for (i = 0; i < 128; i++) {
buffer[i] = ((uint32_t)rand()) ;
if(b < 32) buffer[i] %= (1<<b);
}
qsort(buffer,128, sizeof(uint32_t), uint32_cmp);
for (i = 0; i < 128; i++) {
buffer[i] = buffer[i] + prev;
prev = buffer[i];
}
for (i = 1; i < 128; i++) {
if(buffer[i] < buffer[i-1] )
buffer[i] = buffer[i-1];
}
assert(simdmaxbitsd1(initial, buffer)<=b);
for (i = 0; i < 128; i++) {
out[i] = 0; /* memset would do too */
}
/* delta-encode to 'i' bits */
simdpackwithoutmaskd1(initial, buffer, (__m128i *)out, b);
simdunpackd1(initial, (__m128i *)out, backbuffer, b);
for (i = 0; i < 128; i++) {
assert(buffer[i] == backbuffer[i]);
}
S1 = time_snap();
for (i = 0; i < 128 * 10; i++) {
int pos;
uint32_t pseudorandomkey = buffer[i%128];
__m128i vecinitial = _mm_set1_epi32(initial);
pos = simdsearchd1(&vecinitial, (__m128i *)out, b,
pseudorandomkey, &result);
if((result < pseudorandomkey) || (buffer[pos] != result)) {
printf("bug A.\n");
} else if (pos > 0) {
if(buffer[pos-1] >= pseudorandomkey)
printf("bug B.\n");
}
}
S2 = time_snap();
for (i = 0; i < 128 * 10; i++) {
int pos;
uint32_t pseudorandomkey = buffer[i%128];
simdunpackd1(initial, (__m128i *)out, backbuffer, b);
pos = lower_bound(backbuffer, pseudorandomkey, 0, 128);
result = backbuffer[pos];
if((result < pseudorandomkey) || (buffer[pos] != result)) {
printf("bug C.\n");
} else if (pos > 0) {
if(buffer[pos-1] >= pseudorandomkey)
printf("bug D.\n");
}
}
S3 = time_snap();
for (i = 0; i < 128 * 10; i++) {
int pos;
uint32_t pseudorandomkey = buffer[i%128];
pos = simdsearchwithlengthd1(initial, (__m128i *)out, b, 128,
pseudorandomkey, &result);
if((result < pseudorandomkey) || (buffer[pos] != result)) {
printf("bug A.\n");
} else if (pos > 0) {
if(buffer[pos-1] >= pseudorandomkey)
printf("bug B.\n");
}
}
S4 = time_snap();
printf("bit width = %d, fast search function time = " TIME_SNAP_FMT ", naive time = " TIME_SNAP_FMT " , fast with length time = " TIME_SNAP_FMT " \n", b, (S2-S1), (S3-S2), (S4-S3) );
}
}
int main() {
#ifdef _MSC_VER
QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
#endif
benchmarkSearch();
benchmarkSelect();
return 0;
}

View File

@@ -1,205 +0,0 @@
#include <stdio.h>
#include "simdcomp.h"
#define RDTSC_START(cycles) \
do { \
register unsigned cyc_high, cyc_low; \
__asm volatile( \
"cpuid\n\t" \
"rdtsc\n\t" \
"mov %%edx, %0\n\t" \
"mov %%eax, %1\n\t" \
: "=r"(cyc_high), "=r"(cyc_low)::"%rax", "%rbx", "%rcx", "%rdx"); \
(cycles) = ((uint64_t)cyc_high << 32) | cyc_low; \
} while (0)
#define RDTSC_FINAL(cycles) \
do { \
register unsigned cyc_high, cyc_low; \
__asm volatile( \
"rdtscp\n\t" \
"mov %%edx, %0\n\t" \
"mov %%eax, %1\n\t" \
"cpuid\n\t" \
: "=r"(cyc_high), "=r"(cyc_low)::"%rax", "%rbx", "%rcx", "%rdx"); \
(cycles) = ((uint64_t)cyc_high << 32) | cyc_low; \
} while (0)
uint32_t * get_random_array_from_bit_width(uint32_t length, uint32_t bit) {
uint32_t * answer = malloc(sizeof(uint32_t) * length);
uint32_t mask = (uint32_t) ((UINT64_C(1) << bit) - 1);
uint32_t i;
for(i = 0; i < length; ++i) {
answer[i] = rand() & mask;
}
return answer;
}
uint32_t * get_random_array_from_bit_width_d1(uint32_t length, uint32_t bit) {
uint32_t * answer = malloc(sizeof(uint32_t) * length);
uint32_t mask = (uint32_t) ((UINT64_C(1) << bit) - 1);
uint32_t i;
answer[0] = rand() & mask;
for(i = 1; i < length; ++i) {
answer[i] = answer[i-1] + (rand() & mask);
}
return answer;
}
void demo128() {
const uint32_t length = 128;
uint32_t bit;
printf("# --- %s\n", __func__);
printf("# compressing %d integers\n",length);
printf("# format: bit width, pack in cycles per int, unpack in cycles per int\n");
for(bit = 1; bit <= 32; ++bit) {
uint32_t i;
uint32_t * data = get_random_array_from_bit_width(length, bit);
__m128i * buffer = malloc(length * sizeof(uint32_t));
uint32_t * backdata = malloc(length * sizeof(uint32_t));
uint32_t repeat = 500;
uint64_t min_diff;
printf("%d\t",bit);
min_diff = (uint64_t)-1;
for (i = 0; i < repeat; i++) {
uint64_t cycles_start, cycles_final, cycles_diff;
__asm volatile("" ::: /* pretend to clobber */ "memory");
RDTSC_START(cycles_start);
simdpackwithoutmask(data,buffer, bit);
RDTSC_FINAL(cycles_final);
cycles_diff = (cycles_final - cycles_start);
if (cycles_diff < min_diff) min_diff = cycles_diff;
}
printf("%.2f\t",min_diff*1.0/length);
min_diff = (uint64_t)-1;
for (i = 0; i < repeat; i++) {
uint64_t cycles_start, cycles_final, cycles_diff;
__asm volatile("" ::: /* pretend to clobber */ "memory");
RDTSC_START(cycles_start);
simdunpack(buffer, backdata,bit);
RDTSC_FINAL(cycles_final);
cycles_diff = (cycles_final - cycles_start);
if (cycles_diff < min_diff) min_diff = cycles_diff;
}
printf("%.2f\t",min_diff*1.0/length);
free(data);
free(buffer);
free(backdata);
printf("\n");
}
printf("\n\n"); /* two blank lines are required by gnuplot */
}
void demo128_d1() {
const uint32_t length = 128;
uint32_t bit;
printf("# --- %s\n", __func__);
printf("# compressing %d integers\n",length);
printf("# format: bit width, pack in cycles per int, unpack in cycles per int\n");
for(bit = 1; bit <= 32; ++bit) {
uint32_t i;
uint32_t * data = get_random_array_from_bit_width_d1(length, bit);
__m128i * buffer = malloc(length * sizeof(uint32_t));
uint32_t * backdata = malloc(length * sizeof(uint32_t));
uint32_t repeat = 500;
uint64_t min_diff;
printf("%d\t",bit);
min_diff = (uint64_t)-1;
for (i = 0; i < repeat; i++) {
uint64_t cycles_start, cycles_final, cycles_diff;
__asm volatile("" ::: /* pretend to clobber */ "memory");
RDTSC_START(cycles_start);
simdpackwithoutmaskd1(0,data,buffer, bit);
RDTSC_FINAL(cycles_final);
cycles_diff = (cycles_final - cycles_start);
if (cycles_diff < min_diff) min_diff = cycles_diff;
}
printf("%.2f\t",min_diff*1.0/length);
min_diff = (uint64_t)-1;
for (i = 0; i < repeat; i++) {
uint64_t cycles_start, cycles_final, cycles_diff;
__asm volatile("" ::: /* pretend to clobber */ "memory");
RDTSC_START(cycles_start);
simdunpackd1(0,buffer, backdata,bit);
RDTSC_FINAL(cycles_final);
cycles_diff = (cycles_final - cycles_start);
if (cycles_diff < min_diff) min_diff = cycles_diff;
}
printf("%.2f\t",min_diff*1.0/length);
free(data);
free(buffer);
free(backdata);
printf("\n");
}
printf("\n\n"); /* two blank lines are required by gnuplot */
}
#ifdef __AVX2__
void demo256() {
const uint32_t length = 256;
uint32_t bit;
printf("# --- %s\n", __func__);
printf("# compressing %d integers\n",length);
printf("# format: bit width, pack in cycles per int, unpack in cycles per int\n");
for(bit = 1; bit <= 32; ++bit) {
uint32_t i;
uint32_t * data = get_random_array_from_bit_width(length, bit);
__m256i * buffer = malloc(length * sizeof(uint32_t));
uint32_t * backdata = malloc(length * sizeof(uint32_t));
uint32_t repeat = 500;
uint64_t min_diff;
printf("%d\t",bit);
min_diff = (uint64_t)-1;
for (i = 0; i < repeat; i++) {
uint64_t cycles_start, cycles_final, cycles_diff;
__asm volatile("" ::: /* pretend to clobber */ "memory");
RDTSC_START(cycles_start);
avxpackwithoutmask(data,buffer, bit);
RDTSC_FINAL(cycles_final);
cycles_diff = (cycles_final - cycles_start);
if (cycles_diff < min_diff) min_diff = cycles_diff;
}
printf("%.2f\t",min_diff*1.0/length);
min_diff = (uint64_t)-1;
for (i = 0; i < repeat; i++) {
uint64_t cycles_start, cycles_final, cycles_diff;
__asm volatile("" ::: /* pretend to clobber */ "memory");
RDTSC_START(cycles_start);
avxunpack(buffer, backdata,bit);
RDTSC_FINAL(cycles_final);
cycles_diff = (cycles_final - cycles_start);
if (cycles_diff < min_diff) min_diff = cycles_diff;
}
printf("%.2f\t",min_diff*1.0/length);
free(data);
free(buffer);
free(backdata);
printf("\n");
}
printf("\n\n"); /* two blank lines are required by gnuplot */
}
#endif /* avx 2 */
int main() {
demo128();
demo128_d1();
#ifdef __AVX2__
demo256();
#endif
return 0;
}

View File

@@ -1,195 +0,0 @@
/* Type "make example" to build this example program. */
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include "simdcomp.h"
/**
We provide several different code examples.
**/
/* very simple test to illustrate a simple application */
int compress_decompress_demo() {
size_t k, N = 9999;
__m128i * endofbuf;
int howmanybytes;
float compratio;
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint8_t * buffer;
uint32_t * backbuffer = malloc(N * sizeof(uint32_t));
uint32_t b;
printf("== simple test\n");
for (k = 0; k < N; ++k) { /* start with k=0, not k=1! */
datain[k] = k;
}
b = maxbits_length(datain, N);
buffer = malloc(simdpack_compressedbytes(N,b));
endofbuf = simdpack_length(datain, N, (__m128i *)buffer, b);
howmanybytes = (endofbuf-(__m128i *)buffer)*sizeof(__m128i); /* number of compressed bytes */
compratio = N*sizeof(uint32_t) * 1.0 / howmanybytes;
/* endofbuf points to the end of the compressed data */
buffer = realloc(buffer,(endofbuf-(__m128i *)buffer)*sizeof(__m128i)); /* optional but safe. */
printf("Compressed %d integers down to %d bytes (comp. ratio = %f).\n",(int)N,howmanybytes,compratio);
/* in actual applications b must be stored and retrieved: caller is responsible for that. */
simdunpack_length((const __m128i *)buffer, N, backbuffer, b); /* will return a pointer to endofbuf */
for (k = 0; k < N; ++k) {
if(datain[k] != backbuffer[k]) {
printf("bug at %lu \n",(unsigned long)k);
return -1;
}
}
printf("Code works!\n");
free(datain);
free(buffer);
free(backbuffer);
return 0;
}
/* compresses data from datain to buffer, returns how many bytes written
used below in simple_demo */
size_t compress(uint32_t * datain, size_t length, uint8_t * buffer) {
uint32_t offset;
uint8_t * initout;
size_t k;
if(length/SIMDBlockSize*SIMDBlockSize != length) {
printf("Data length should be a multiple of %i \n",SIMDBlockSize);
}
offset = 0;
initout = buffer;
for(k = 0; k < length / SIMDBlockSize; ++k) {
uint32_t b = simdmaxbitsd1(offset,
datain + k * SIMDBlockSize);
*buffer++ = b;
simdpackwithoutmaskd1(offset, datain + k * SIMDBlockSize, (__m128i *) buffer,
b);
offset = datain[k * SIMDBlockSize + SIMDBlockSize - 1];
buffer += b * sizeof(__m128i);
}
return buffer - initout;
}
/* Another illustration ... */
void simple_demo() {
size_t REPEAT = 10, gap;
size_t N = 1000 * SIMDBlockSize;/* SIMDBlockSize is 128 */
uint32_t * datain = malloc(N * sizeof(uint32_t));
size_t compsize;
clock_t start, end;
uint8_t * buffer = malloc(N * sizeof(uint32_t) + N / SIMDBlockSize); /* output buffer */
uint32_t * backbuffer = malloc(SIMDBlockSize * sizeof(uint32_t));
printf("== simple demo\n");
for (gap = 1; gap <= 243; gap *= 3) {
size_t k, repeat;
uint32_t offset = 0;
uint32_t bogus = 0;
double numberofseconds;
printf("\n");
printf(" gap = %lu \n", (unsigned long) gap);
datain[0] = 0;
for (k = 1; k < N; ++k)
datain[k] = datain[k-1] + ( rand() % (gap + 1) );
compsize = compress(datain,N,buffer);
printf("compression ratio = %f \n", (N * sizeof(uint32_t))/ (compsize * 1.0 ));
start = clock();
for(repeat = 0; repeat < REPEAT; ++repeat) {
uint8_t * decbuffer = buffer;
for (k = 0; k * SIMDBlockSize < N; ++k) {
uint8_t b = *decbuffer++;
simdunpackd1(offset, (__m128i *) decbuffer, backbuffer, b);
/* do something here with backbuffer */
bogus += backbuffer[3];
decbuffer += b * sizeof(__m128i);
offset = backbuffer[SIMDBlockSize - 1];
}
}
end = clock();
numberofseconds = (end-start)/(double)CLOCKS_PER_SEC;
printf("decoding speed in million of integers per second %f \n",N*REPEAT/(numberofseconds*1000.0*1000.0));
start = clock();
for(repeat = 0; repeat < REPEAT; ++repeat) {
uint8_t * decbuffer = buffer;
for (k = 0; k * SIMDBlockSize < N; ++k) {
memcpy(backbuffer,decbuffer+k*SIMDBlockSize,SIMDBlockSize*sizeof(uint32_t));
bogus += backbuffer[3] - backbuffer[100];
}
}
end = clock();
numberofseconds = (end-start)/(double)CLOCKS_PER_SEC;
printf("memcpy speed in million of integers per second %f \n",N*REPEAT/(numberofseconds*1000.0*1000.0));
printf("ignore me %i \n",bogus);
printf("All tests are in CPU cache. Avoid out-of-cache decoding in applications.\n");
}
free(buffer);
free(datain);
free(backbuffer);
}
/* Used below in more_sophisticated_demo ... */
size_t varying_bit_width_compress(uint32_t * datain, size_t length, uint8_t * buffer) {
uint8_t * initout;
size_t k;
if(length/SIMDBlockSize*SIMDBlockSize != length) {
printf("Data length should be a multiple of %i \n",SIMDBlockSize);
}
initout = buffer;
for(k = 0; k < length / SIMDBlockSize; ++k) {
uint32_t b = maxbits(datain);
*buffer++ = b;
simdpackwithoutmask(datain, (__m128i *)buffer, b);
datain += SIMDBlockSize;
buffer += b * sizeof(__m128i);
}
return buffer - initout;
}
/* Here we compress the data in blocks of 128 integers with varying bit width */
int varying_bit_width_demo() {
size_t nn = 128 * 2;
uint32_t * datainn = malloc(nn * sizeof(uint32_t));
uint8_t * buffern = malloc(nn * sizeof(uint32_t) + nn / SIMDBlockSize);
uint8_t * initbuffern = buffern;
uint32_t * backbuffern = malloc(nn * sizeof(uint32_t));
size_t k, compsize;
printf("== varying bit-width demo\n");
for(k=0; k<nn; ++k) {
datainn[k] = rand() % (k + 1);
}
compsize = varying_bit_width_compress(datainn,nn,buffern);
printf("encoded size: %u (original size: %u)\n", (unsigned)compsize,
(unsigned)(nn * sizeof(uint32_t)));
for (k = 0; k * SIMDBlockSize < nn; ++k) {
uint32_t b = *buffern;
buffern++;
simdunpack((const __m128i *)buffern, backbuffern + k * SIMDBlockSize, b);
buffern += b * sizeof(__m128i);
}
for (k = 0; k < nn; ++k) {
if(backbuffern[k] != datainn[k]) {
printf("bug\n");
return -1;
}
}
printf("Code works!\n");
free(datainn);
free(initbuffern);
free(backbuffern);
return 0;
}
int main() {
if(compress_decompress_demo() != 0) return -1;
if(varying_bit_width_demo() != 0) return -1;
simple_demo();
return 0;
}

View File

@@ -1,13 +0,0 @@
Simple Go demo
==============
Setup
======
Start by installing the simdcomp library (make && make install).
Then type:
go run test.go

View File

@@ -1,71 +0,0 @@
/////////
// This particular file is in the public domain.
// Author: Daniel Lemire
////////
package main
/*
#cgo LDFLAGS: -lsimdcomp
#include <simdcomp.h>
*/
import "C"
import "fmt"
//////////
// For this demo, we pack and unpack blocks of 128 integers
/////////
func main() {
// I am going to use C types. Alternative might be to use unsafe.Pointer calls, see http://bit.ly/1ndw3W3
// this is our original data
var data [128]C.uint32_t
for i := C.uint32_t(0); i < C.uint32_t(128); i++ {
data[i] = i
}
////////////
// We first pack without differential coding
///////////
// computing how many bits per int. is needed
b := C.maxbits(&data[0])
ratio := 32.0/float64(b)
fmt.Println("Bit width ", b)
fmt.Println(fmt.Sprintf("Compression ratio %f ", ratio))
// we are now going to create a buffer to receive the packed data (each __m128i uses 128 bits)
out := make([] C.__m128i,b)
C.simdpackwithoutmask( &data[0],&out[0],b);
var recovereddata [128]C.uint32_t
C.simdunpack(&out[0],&recovereddata[0],b)
for i := 0; i < 128; i++ {
if data[i] != recovereddata[i] {
fmt.Println("Bug ")
return
}
}
///////////
// Next, we use differential coding
//////////
offset := C.uint32_t(0) // if you pack data from K to K + 128, offset should be the value at K-1. When K = 0, choose a default
b1 := C.simdmaxbitsd1(offset,&data[0])
ratio1 := 32.0/float64(b1)
fmt.Println("Bit width ", b1)
fmt.Println(fmt.Sprintf("Compression ratio %f ", ratio1))
// we are now going to create a buffer to receive the packed data (each __m128i uses 128 bits)
out = make([] C.__m128i,b1)
C.simdpackwithoutmaskd1(offset, &data[0],&out[0],b1);
C.simdunpackd1(offset,&out[0],&recovereddata[0],b1)
for i := 0; i < 128; i++ {
if data[i] != recovereddata[i] {
fmt.Println("Bug ")
return
}
}
fmt.Println("test succesful.")
}

View File

@@ -1,40 +0,0 @@
/**
* This code is released under a BSD License.
*/
#ifndef INCLUDE_AVXBITPACKING_H_
#define INCLUDE_AVXBITPACKING_H_
#ifdef __AVX2__
#include "portability.h"
/* AVX2 is required */
#include <immintrin.h>
/* for memset */
#include <string.h>
#include "simdcomputil.h"
enum{ AVXBlockSize = 256};
/* max integer logarithm over a range of AVXBlockSize integers (256 integer) */
uint32_t avxmaxbits(const uint32_t * begin);
/* reads 256 values from "in", writes "bit" 256-bit vectors to "out" */
void avxpack(const uint32_t * in,__m256i * out, const uint32_t bit);
/* reads 256 values from "in", writes "bit" 256-bit vectors to "out" */
void avxpackwithoutmask(const uint32_t * in,__m256i * out, const uint32_t bit);
/* reads "bit" 256-bit vectors from "in", writes 256 values to "out" */
void avxunpack(const __m256i * in,uint32_t * out, const uint32_t bit);
#endif /* __AVX2__ */
#endif /* INCLUDE_AVXBITPACKING_H_ */

View File

@@ -1,81 +0,0 @@
/**
* This code is released under a BSD License.
*/
#ifndef SIMDBITCOMPAT_H_
#define SIMDBITCOMPAT_H_
#include <iso646.h> /* mostly for Microsoft compilers */
#include <string.h>
#if SIMDCOMP_DEBUG
# define SIMDCOMP_ALWAYS_INLINE inline
# define SIMDCOMP_NEVER_INLINE
# define SIMDCOMP_PURE
#else
# if defined(__GNUC__)
# if __GNUC__ >= 3
# define SIMDCOMP_ALWAYS_INLINE inline __attribute__((always_inline))
# define SIMDCOMP_NEVER_INLINE __attribute__((noinline))
# define SIMDCOMP_PURE __attribute__((pure))
# else
# define SIMDCOMP_ALWAYS_INLINE inline
# define SIMDCOMP_NEVER_INLINE
# define SIMDCOMP_PURE
# endif
# elif defined(_MSC_VER)
# define SIMDCOMP_ALWAYS_INLINE __forceinline
# define SIMDCOMP_NEVER_INLINE
# define SIMDCOMP_PURE
# else
# if __has_attribute(always_inline)
# define SIMDCOMP_ALWAYS_INLINE inline __attribute__((always_inline))
# else
# define SIMDCOMP_ALWAYS_INLINE inline
# endif
# if __has_attribute(noinline)
# define SIMDCOMP_NEVER_INLINE __attribute__((noinline))
# else
# define SIMDCOMP_NEVER_INLINE
# endif
# if __has_attribute(pure)
# define SIMDCOMP_PURE __attribute__((pure))
# else
# define SIMDCOMP_PURE
# endif
# endif
#endif
#if defined(_MSC_VER) && _MSC_VER < 1600
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;
typedef signed char int8_t;
#else
#include <stdint.h> /* part of Visual Studio 2010 and better, others likely anyway */
#endif
#if defined(_MSC_VER)
#define SIMDCOMP_ALIGNED(x) __declspec(align(x))
#else
#if defined(__GNUC__)
#define SIMDCOMP_ALIGNED(x) __attribute__ ((aligned(x)))
#endif
#endif
#if defined(_MSC_VER)
# include <intrin.h>
/* 64-bit needs extending */
# define SIMDCOMP_CTZ(result, mask) do { \
unsigned long index; \
if (!_BitScanForward(&(index), (mask))) { \
(result) = 32U; \
} else { \
(result) = (uint32_t)(index); \
} \
} while (0)
#else
# define SIMDCOMP_CTZ(result, mask) \
result = __builtin_ctz(mask)
#endif
#endif /* SIMDBITCOMPAT_H_ */

View File

@@ -1,72 +0,0 @@
/**
* This code is released under a BSD License.
*/
#ifndef SIMDBITPACKING_H_
#define SIMDBITPACKING_H_
#include "portability.h"
/* SSE2 is required */
#include <emmintrin.h>
/* for memset */
#include <string.h>
#include "simdcomputil.h"
/***
* Please see example.c for various examples on how to make good use
* of these functions.
*/
/* reads 128 values from "in", writes "bit" 128-bit vectors to "out".
* The input values are masked so that only the least significant "bit" bits are used. */
void simdpack(const uint32_t * in,__m128i * out, const uint32_t bit);
/* reads 128 values from "in", writes "bit" 128-bit vectors to "out".
* The input values are assumed to be less than 1<<bit. */
void simdpackwithoutmask(const uint32_t * in,__m128i * out, const uint32_t bit);
/* reads "bit" 128-bit vectors from "in", writes 128 values to "out" */
void simdunpack(const __m128i * in,uint32_t * out, const uint32_t bit);
/* how many compressed bytes are needed to compressed length integers using a bit width of bit with
the simdpackFOR_length function. */
int simdpack_compressedbytes(int length, const uint32_t bit);
/* like simdpack, but supports an undetermined number of inputs.
* This is useful if you need to unpack an array of integers that is not divisible by 128 integers.
* Returns a pointer to the (advanced) compressed array. Compressed data is stored in the memory location between
the provided (out) pointer and the returned pointer. */
__m128i * simdpack_length(const uint32_t * in, size_t length, __m128i * out, const uint32_t bit);
/* like simdunpack, but supports an undetermined number of inputs.
* This is useful if you need to unpack an array of integers that is not divisible by 128 integers.
* Returns a pointer to the (advanced) compressed array. The read compressed data is between the provided
(in) pointer and the returned pointer. */
const __m128i * simdunpack_length(const __m128i * in, size_t length, uint32_t * out, const uint32_t bit);
/* like simdpack, but supports an undetermined small number of inputs. This is useful if you need to pack less
than 128 integers.
* Note that this function is much slower.
* Returns a pointer to the (advanced) compressed array. Compressed data is stored in the memory location
between the provided (out) pointer and the returned pointer. */
__m128i * simdpack_shortlength(const uint32_t * in, int length, __m128i * out, const uint32_t bit);
/* like simdunpack, but supports an undetermined small number of inputs. This is useful if you need to unpack less
than 128 integers.
* Note that this function is much slower.
* Returns a pointer to the (advanced) compressed array. The read compressed data is between the provided (in)
pointer and the returned pointer. */
const __m128i * simdunpack_shortlength(const __m128i * in, int length, uint32_t * out, const uint32_t bit);
/* given a block of 128 packed values, this function sets the value at index "index" to "value" */
void simdfastset(__m128i * in128, uint32_t b, uint32_t value, size_t index);
#endif /* SIMDBITPACKING_H_ */

View File

@@ -1,22 +0,0 @@
/**
* This code is released under a BSD License.
*/
#ifndef SIMDCOMP_H_
#define SIMDCOMP_H_
#ifdef __cplusplus
extern "C" {
#endif
#include "simdbitpacking.h"
#include "simdcomputil.h"
#include "simdfor.h"
#include "simdintegratedbitpacking.h"
#include "avxbitpacking.h"
#ifdef __cplusplus
} // extern "C"
#endif
#endif

View File

@@ -1,54 +0,0 @@
/**
* This code is released under a BSD License.
*/
#ifndef SIMDCOMPUTIL_H_
#define SIMDCOMPUTIL_H_
#include "portability.h"
/* SSE2 is required */
#include <emmintrin.h>
/* returns the integer logarithm of v (bit width) */
uint32_t bits(const uint32_t v);
/* max integer logarithm over a range of SIMDBlockSize integers (128 integer) */
uint32_t maxbits(const uint32_t * begin);
/* same as maxbits, but we specify the number of integers */
uint32_t maxbits_length(const uint32_t * in,uint32_t length);
enum{ SIMDBlockSize = 128};
/* computes (quickly) the minimal value of 128 values */
uint32_t simdmin(const uint32_t * in);
/* computes (quickly) the minimal value of the specified number of values */
uint32_t simdmin_length(const uint32_t * in, uint32_t length);
#ifdef __SSE4_1__
/* computes (quickly) the minimal and maximal value of the specified number of values */
void simdmaxmin_length(const uint32_t * in, uint32_t length, uint32_t * getmin, uint32_t * getmax);
/* computes (quickly) the minimal and maximal value of the 128 values */
void simdmaxmin(const uint32_t * in, uint32_t * getmin, uint32_t * getmax);
#endif
/* like maxbit over 128 integers (SIMDBlockSize) with provided initial value
and using differential coding */
uint32_t simdmaxbitsd1(uint32_t initvalue, const uint32_t * in);
/* like simdmaxbitsd1, but calculates maxbits over |length| integers
with provided initial value. |length| can be any arbitrary value. */
uint32_t simdmaxbitsd1_length(uint32_t initvalue, const uint32_t * in,
uint32_t length);
#endif /* SIMDCOMPUTIL_H_ */

View File

@@ -1,72 +0,0 @@
/**
* This code is released under a BSD License.
*/
#ifndef INCLUDE_SIMDFOR_H_
#define INCLUDE_SIMDFOR_H_
#include "portability.h"
/* SSE2 is required */
#include <emmintrin.h>
#include "simdcomputil.h"
#include "simdbitpacking.h"
#ifdef __cplusplus
extern "C" {
#endif
/* reads 128 values from "in", writes "bit" 128-bit vectors to "out" */
void simdpackFOR(uint32_t initvalue, const uint32_t * in,__m128i * out, const uint32_t bit);
/* reads "bit" 128-bit vectors from "in", writes 128 values to "out" */
void simdunpackFOR(uint32_t initvalue, const __m128i * in,uint32_t * out, const uint32_t bit);
/* how many compressed bytes are needed to compressed length integers using a bit width of bit with
the simdpackFOR_length function. */
int simdpackFOR_compressedbytes(int length, const uint32_t bit);
/* like simdpackFOR, but supports an undetermined number of inputs.
This is useful if you need to pack less than 128 integers. Note that this function is much slower.
Compressed data is stored in the memory location between
the provided (out) pointer and the returned pointer. */
__m128i * simdpackFOR_length(uint32_t initvalue, const uint32_t * in, int length, __m128i * out, const uint32_t bit);
/* like simdunpackFOR, but supports an undetermined number of inputs.
This is useful if you need to unpack less than 128 integers. Note that this function is much slower.
The read compressed data is between the provided
(in) pointer and the returned pointer. */
const __m128i * simdunpackFOR_length(uint32_t initvalue, const __m128i * in, int length, uint32_t * out, const uint32_t bit);
/* returns the value stored at the specified "slot".
* */
uint32_t simdselectFOR(uint32_t initvalue, const __m128i *in, uint32_t bit,
int slot);
/* given a block of 128 packed values, this function sets the value at index "index" to "value" */
void simdfastsetFOR(uint32_t initvalue, __m128i * in, uint32_t bit, uint32_t value, size_t index);
/* searches "bit" 128-bit vectors from "in" (= length<=128 encoded integers) for the first encoded uint32 value
* which is >= |key|, and returns its position. It is assumed that the values
* stored are in sorted order.
* The encoded key is stored in "*presult".
* The first length decoded integers, ignoring others. If no value is larger or equal to the key,
* length is returned. Length should be no larger than 128.
*
* If no value is larger or equal to the key,
* length is returned */
int simdsearchwithlengthFOR(uint32_t initvalue, const __m128i *in, uint32_t bit,
int length, uint32_t key, uint32_t *presult);
#ifdef __cplusplus
} // extern "C"
#endif
#endif /* INCLUDE_SIMDFOR_H_ */

View File

@@ -1,98 +0,0 @@
/**
* This code is released under a BSD License.
*/
#ifndef SIMD_INTEGRATED_BITPACKING_H
#define SIMD_INTEGRATED_BITPACKING_H
#include "portability.h"
/* SSE2 is required */
#include <emmintrin.h>
#include "simdcomputil.h"
#include "simdbitpacking.h"
#ifdef __cplusplus
extern "C" {
#endif
/* reads 128 values from "in", writes "bit" 128-bit vectors to "out"
integer values should be in sorted order (for best results).
The differences are masked so that only the least significant "bit" bits are used. */
void simdpackd1(uint32_t initvalue, const uint32_t * in,__m128i * out, const uint32_t bit);
/* reads 128 values from "in", writes "bit" 128-bit vectors to "out"
integer values should be in sorted order (for best results).
The difference values are assumed to be less than 1<<bit. */
void simdpackwithoutmaskd1(uint32_t initvalue, const uint32_t * in,__m128i * out, const uint32_t bit);
/* reads "bit" 128-bit vectors from "in", writes 128 values to "out" */
void simdunpackd1(uint32_t initvalue, const __m128i * in,uint32_t * out, const uint32_t bit);
/* searches "bit" 128-bit vectors from "in" (= 128 encoded integers) for the first encoded uint32 value
* which is >= |key|, and returns its position. It is assumed that the values
* stored are in sorted order.
* The encoded key is stored in "*presult". If no value is larger or equal to the key,
* 128 is returned. The pointer initOffset is a pointer to the last four value decoded
* (when starting out, this can be a zero vector or initialized with _mm_set1_epi32(init)),
* and the vector gets updated.
**/
int
simdsearchd1(__m128i * initOffset, const __m128i *in, uint32_t bit,
uint32_t key, uint32_t *presult);
/* searches "bit" 128-bit vectors from "in" (= length<=128 encoded integers) for the first encoded uint32 value
* which is >= |key|, and returns its position. It is assumed that the values
* stored are in sorted order.
* The encoded key is stored in "*presult".
* The first length decoded integers, ignoring others. If no value is larger or equal to the key,
* length is returned. Length should be no larger than 128.
*
* If no value is larger or equal to the key,
* length is returned */
int simdsearchwithlengthd1(uint32_t initvalue, const __m128i *in, uint32_t bit,
int length, uint32_t key, uint32_t *presult);
/* returns the value stored at the specified "slot".
* */
uint32_t simdselectd1(uint32_t initvalue, const __m128i *in, uint32_t bit,
int slot);
/* given a block of 128 packed values, this function sets the value at index "index" to "value",
* you must somehow know the previous value.
* Because of differential coding, all following values are incremented by the offset between this new
* value and the old value...
* This functions is useful if you want to modify the last value.
*/
void simdfastsetd1fromprevious( __m128i * in, uint32_t bit, uint32_t previousvalue, uint32_t value, size_t index);
/* given a block of 128 packed values, this function sets the value at index "index" to "value",
* This function computes the previous value if needed.
* Because of differential coding, all following values are incremented by the offset between this new
* value and the old value...
* This functions is useful if you want to modify the last value.
*/
void simdfastsetd1(uint32_t initvalue, __m128i * in, uint32_t bit, uint32_t value, size_t index);
/*Simply scan the data
* The pointer initOffset is a pointer to the last four value decoded
* (when starting out, this can be a zero vector or initialized with _mm_set1_epi32(init);),
* and the vector gets updated.
* */
void
simdscand1(__m128i * initOffset, const __m128i *in, uint32_t bit);
#ifdef __cplusplus
} // extern "C"
#endif
#endif

View File

@@ -1,79 +0,0 @@
# minimalist makefile
.SUFFIXES:
#
.SUFFIXES: .cpp .o .c .h
ifeq ($(DEBUG),1)
CFLAGS = -fPIC -std=c89 -ggdb -msse4.1 -march=native -Wall -Wextra -Wshadow -fsanitize=undefined -fno-omit-frame-pointer -fsanitize=address
else
CFLAGS = -fPIC -std=c89 -O3 -msse4.1 -march=native -Wall -Wextra -Wshadow
endif # debug
LDFLAGS = -shared
LIBNAME=libsimdcomp.so.0.0.3
all: unit unit_chars bitpackingbenchmark $(LIBNAME)
test:
./unit
./unit_chars
install: $(OBJECTS)
cp $(LIBNAME) /usr/local/lib
ln -s /usr/local/lib/$(LIBNAME) /usr/local/lib/libsimdcomp.so
ldconfig
cp $(HEADERS) /usr/local/include
HEADERS=./include/simdbitpacking.h ./include/simdcomputil.h ./include/simdintegratedbitpacking.h ./include/simdcomp.h ./include/simdfor.h ./include/avxbitpacking.h
uninstall:
for h in $(HEADERS) ; do rm /usr/local/$$h; done
rm /usr/local/lib/$(LIBNAME)
rm /usr/local/lib/libsimdcomp.so
ldconfig
OBJECTS= simdbitpacking.o simdintegratedbitpacking.o simdcomputil.o \
simdpackedsearch.o simdpackedselect.o simdfor.o avxbitpacking.o
$(LIBNAME): $(OBJECTS)
$(CC) $(CFLAGS) -o $(LIBNAME) $(OBJECTS) $(LDFLAGS)
avxbitpacking.o: ./src/avxbitpacking.c $(HEADERS)
$(CC) $(CFLAGS) -c ./src/avxbitpacking.c -Iinclude
simdfor.o: ./src/simdfor.c $(HEADERS)
$(CC) $(CFLAGS) -c ./src/simdfor.c -Iinclude
simdcomputil.o: ./src/simdcomputil.c $(HEADERS)
$(CC) $(CFLAGS) -c ./src/simdcomputil.c -Iinclude
simdbitpacking.o: ./src/simdbitpacking.c $(HEADERS)
$(CC) $(CFLAGS) -c ./src/simdbitpacking.c -Iinclude
simdintegratedbitpacking.o: ./src/simdintegratedbitpacking.c $(HEADERS)
$(CC) $(CFLAGS) -c ./src/simdintegratedbitpacking.c -Iinclude
simdpackedsearch.o: ./src/simdpackedsearch.c $(HEADERS)
$(CC) $(CFLAGS) -c ./src/simdpackedsearch.c -Iinclude
simdpackedselect.o: ./src/simdpackedselect.c $(HEADERS)
$(CC) $(CFLAGS) -c ./src/simdpackedselect.c -Iinclude
example: ./example.c $(HEADERS) $(OBJECTS)
$(CC) $(CFLAGS) -o example ./example.c -Iinclude $(OBJECTS)
unit: ./tests/unit.c $(HEADERS) $(OBJECTS)
$(CC) $(CFLAGS) -o unit ./tests/unit.c -Iinclude $(OBJECTS)
bitpackingbenchmark: ./benchmarks/bitpackingbenchmark.c $(HEADERS) $(OBJECTS)
$(CC) $(CFLAGS) -o bitpackingbenchmark ./benchmarks/bitpackingbenchmark.c -Iinclude $(OBJECTS)
benchmark: ./benchmarks/benchmark.c $(HEADERS) $(OBJECTS)
$(CC) $(CFLAGS) -o benchmark ./benchmarks/benchmark.c -Iinclude $(OBJECTS)
dynunit: ./tests/unit.c $(HEADERS) $(LIBNAME)
$(CC) $(CFLAGS) -o dynunit ./tests/unit.c -Iinclude -lsimdcomp
unit_chars: ./tests/unit_chars.c $(HEADERS) $(OBJECTS)
$(CC) $(CFLAGS) -o unit_chars ./tests/unit_chars.c -Iinclude $(OBJECTS)
clean:
rm -f unit *.o $(LIBNAME) example benchmark bitpackingbenchmark dynunit unit_chars

View File

@@ -1,104 +0,0 @@
!IFNDEF MACHINE
!IF "$(PROCESSOR_ARCHITECTURE)"=="AMD64"
MACHINE=x64
!ELSE
MACHINE=x86
!ENDIF
!ENDIF
!IFNDEF DEBUG
DEBUG=no
!ENDIF
!IFNDEF CC
CC=cl.exe
!ENDIF
!IFNDEF AR
AR=lib.exe
!ENDIF
!IFNDEF LINK
LINK=link.exe
!ENDIF
!IFNDEF PGO
PGO=no
!ENDIF
!IFNDEF PGI
PGI=no
!ENDIF
INC = /Iinclude
!IF "$(DEBUG)"=="yes"
CFLAGS = /nologo /MDd /LDd /Od /Zi /D_DEBUG /RTC1 /W3 /GS /Gm
ARFLAGS = /nologo
LDFLAGS = /nologo /debug /nodefaultlib:msvcrt
!ELSE
CFLAGS = /nologo /MD /O2 /Zi /DNDEBUG /W3 /Gm- /GS /Gy /Oi /GL /MP
ARFLAGS = /nologo /LTCG
LDFLAGS = /nologo /LTCG /DYNAMICBASE /incremental:no /debug /opt:ref,icf
!ENDIF
!IF "$(PGI)"=="yes"
LDFLAGS = $(LDFLAGS) /ltcg:pgi
!ENDIF
!IF "$(PGO)"=="yes"
LDFLAGS = $(LDFLAGS) /ltcg:pgo
!ENDIF
LIB_OBJS = simdbitpacking.obj simdintegratedbitpacking.obj simdcomputil.obj \
simdpackedsearch.obj simdpackedselect.obj simdfor.obj
all: lib dll dynunit unit_chars example benchmark
# need some good use case scenario to train the instrumented build
@if "$(PGI)"=="yes" echo Running PGO training
@if "$(PGI)"=="yes" benchmark.exe >nul 2>&1
@if "$(PGI)"=="yes" example.exe >nul 2>&1
$(LIB_OBJS):
$(CC) $(INC) $(CFLAGS) /c src/simdbitpacking.c src/simdintegratedbitpacking.c src/simdcomputil.c \
src/simdpackedsearch.c src/simdpackedselect.c src/simdfor.c
lib: $(LIB_OBJS)
$(AR) $(ARFLAGS) /OUT:simdcomp_a.lib $(LIB_OBJS)
dll: $(LIB_OBJS)
$(LINK) /DLL $(LDFLAGS) /OUT:simdcomp.dll /IMPLIB:simdcomp.lib /DEF:simdcomp.def $(LIB_OBJS)
unit: lib
$(CC) $(INC) $(CFLAGS) /c src/unit.c
$(LINK) $(LDFLAGS) /OUT:unit.exe unit.obj simdcomp_a.lib
dynunit: dll
$(CC) $(INC) $(CFLAGS) /c src/unit.c
$(LINK) $(LDFLAGS) /OUT:unit.exe unit.obj simdcomp.lib
unit_chars: lib
$(CC) $(INC) $(CFLAGS) /c src/unit_chars.c
$(LINK) $(LDFLAGS) /OUT:unit_chars.exe unit_chars.obj simdcomp.lib
example: lib
$(CC) $(INC) $(CFLAGS) /c example.c
$(LINK) $(LDFLAGS) /OUT:example.exe example.obj simdcomp.lib
benchmark: lib
$(CC) $(INC) $(CFLAGS) /c src/benchmark.c
$(LINK) $(LDFLAGS) /OUT:benchmark.exe benchmark.obj simdcomp.lib
clean:
del /Q *.obj
del /Q *.lib
del /Q *.exe
del /Q *.dll
del /Q *.pgc
del /Q *.pgd
del /Q *.pdb

View File

@@ -1,16 +0,0 @@
{
"name": "simdcomp",
"version": "0.0.3",
"repo": "lemire/simdcomp",
"description": "A simple C library for compressing lists of integers",
"license": "BSD-3-Clause",
"src": [
"src/simdbitpacking.c",
"src/simdcomputil.c",
"src/simdintegratedbitpacking.c",
"include/simdbitpacking.h",
"include/simdcomp.h",
"include/simdcomputil.h",
"include/simdintegratedbitpacking.h"
]
}

View File

@@ -1,182 +0,0 @@
#!/usr/bin/env python
import sys
def howmany(bit):
""" how many values are we going to pack? """
return 256
def howmanywords(bit):
return (howmany(bit) * bit + 255)/256
def howmanybytes(bit):
return howmanywords(bit) * 16
print("""
/** code generated by avxpacking.py starts here **/
""")
print("""typedef void (*avxpackblockfnc)(const uint32_t * pin, __m256i * compressed);""")
print("""typedef void (*avxunpackblockfnc)(const __m256i * compressed, uint32_t * pout);""")
def plurial(number):
if(number <> 1):
return "s"
else :
return ""
print("")
print("static void avxpackblock0(const uint32_t * pin, __m256i * compressed) {");
print(" (void)compressed;");
print(" (void) pin; /* we consumed {0} 32-bit integer{1} */ ".format(howmany(0),plurial(howmany(0))));
print("}");
print("")
for bit in range(1,33):
print("")
print("/* we are going to pack {0} {1}-bit values, touching {2} 256-bit words, using {3} bytes */ ".format(howmany(bit),bit,howmanywords(bit),howmanybytes(bit)))
print("static void avxpackblock{0}(const uint32_t * pin, __m256i * compressed) {{".format(bit));
print(" const __m256i * in = (const __m256i *) pin;");
print(" /* we are going to touch {0} 256-bit word{1} */ ".format(howmanywords(bit),plurial(howmanywords(bit))));
if(howmanywords(bit) == 1):
print(" __m256i w0;")
else:
print(" __m256i w0, w1;")
if( (bit & (bit-1)) <> 0) : print(" __m256i tmp; /* used to store inputs at word boundary */")
oldword = 0
for j in range(howmany(bit)/8):
firstword = j * bit / 32
if(firstword > oldword):
print(" _mm256_storeu_si256(compressed + {0}, w{1});".format(oldword,oldword%2))
oldword = firstword
secondword = (j * bit + bit - 1)/32
firstshift = (j*bit) % 32
if( firstword == secondword):
if(firstshift == 0):
print(" w{0} = _mm256_lddqu_si256 (in + {1});".format(firstword%2,j))
else:
print(" w{0} = _mm256_or_si256(w{0},_mm256_slli_epi32(_mm256_lddqu_si256 (in + {1}) , {2}));".format(firstword%2,j,firstshift))
else:
print(" tmp = _mm256_lddqu_si256 (in + {0});".format(j))
print(" w{0} = _mm256_or_si256(w{0},_mm256_slli_epi32(tmp , {2}));".format(firstword%2,j,firstshift))
secondshift = 32-firstshift
print(" w{0} = _mm256_srli_epi32(tmp,{2});".format(secondword%2,j,secondshift))
print(" _mm256_storeu_si256(compressed + {0}, w{1});".format(secondword,secondword%2))
print("}");
print("")
print("")
print("static void avxpackblockmask0(const uint32_t * pin, __m256i * compressed) {");
print(" (void)compressed;");
print(" (void) pin; /* we consumed {0} 32-bit integer{1} */ ".format(howmany(0),plurial(howmany(0))));
print("}");
print("")
for bit in range(1,33):
print("")
print("/* we are going to pack {0} {1}-bit values, touching {2} 256-bit words, using {3} bytes */ ".format(howmany(bit),bit,howmanywords(bit),howmanybytes(bit)))
print("static void avxpackblockmask{0}(const uint32_t * pin, __m256i * compressed) {{".format(bit));
print(" /* we are going to touch {0} 256-bit word{1} */ ".format(howmanywords(bit),plurial(howmanywords(bit))));
if(howmanywords(bit) == 1):
print(" __m256i w0;")
else:
print(" __m256i w0, w1;")
print(" const __m256i * in = (const __m256i *) pin;");
if(bit < 32): print(" const __m256i mask = _mm256_set1_epi32({0});".format((1<<bit)-1));
def maskfnc(x):
if(bit == 32): return x
return " _mm256_and_si256 ( mask, {0}) ".format(x)
if( (bit & (bit-1)) <> 0) : print(" __m256i tmp; /* used to store inputs at word boundary */")
oldword = 0
for j in range(howmany(bit)/8):
firstword = j * bit / 32
if(firstword > oldword):
print(" _mm256_storeu_si256(compressed + {0}, w{1});".format(oldword,oldword%2))
oldword = firstword
secondword = (j * bit + bit - 1)/32
firstshift = (j*bit) % 32
loadstr = maskfnc(" _mm256_lddqu_si256 (in + {0}) ".format(j))
if( firstword == secondword):
if(firstshift == 0):
print(" w{0} = {1};".format(firstword%2,loadstr))
else:
print(" w{0} = _mm256_or_si256(w{0},_mm256_slli_epi32({1} , {2}));".format(firstword%2,loadstr,firstshift))
else:
print(" tmp = {0};".format(loadstr))
print(" w{0} = _mm256_or_si256(w{0},_mm256_slli_epi32(tmp , {2}));".format(firstword%2,j,firstshift))
secondshift = 32-firstshift
print(" w{0} = _mm256_srli_epi32(tmp,{2});".format(secondword%2,j,secondshift))
print(" _mm256_storeu_si256(compressed + {0}, w{1});".format(secondword,secondword%2))
print("}");
print("")
print("static void avxunpackblock0(const __m256i * compressed, uint32_t * pout) {");
print(" (void) compressed;");
print(" memset(pout,0,{0});".format(howmany(0)));
print("}");
print("")
for bit in range(1,33):
print("")
print("/* we packed {0} {1}-bit values, touching {2} 256-bit words, using {3} bytes */ ".format(howmany(bit),bit,howmanywords(bit),howmanybytes(bit)))
print("static void avxunpackblock{0}(const __m256i * compressed, uint32_t * pout) {{".format(bit));
print(" /* we are going to access {0} 256-bit word{1} */ ".format(howmanywords(bit),plurial(howmanywords(bit))));
if(howmanywords(bit) == 1):
print(" __m256i w0;")
else:
print(" __m256i w0, w1;")
print(" __m256i * out = (__m256i *) pout;");
if(bit < 32): print(" const __m256i mask = _mm256_set1_epi32({0});".format((1<<bit)-1));
maskstr = " _mm256_and_si256 ( mask, {0}) "
if (bit == 32) : maskstr = " {0} " # no need
oldword = 0
print(" w0 = _mm256_lddqu_si256 (compressed);")
for j in range(howmany(bit)/8):
firstword = j * bit / 32
secondword = (j * bit + bit - 1)/32
if(secondword > oldword):
print(" w{0} = _mm256_lddqu_si256 (compressed + {1});".format(secondword%2,secondword))
oldword = secondword
firstshift = (j*bit) % 32
firstshiftstr = "_mm256_srli_epi32( w{0} , "+str(firstshift)+") "
if(firstshift == 0):
firstshiftstr =" w{0} " # no need
wfirst = firstshiftstr.format(firstword%2)
if( firstword == secondword):
if(firstshift + bit <> 32):
wfirst = maskstr.format(wfirst)
print(" _mm256_storeu_si256(out + {0}, {1});".format(j,wfirst))
else:
secondshift = (32-firstshift)
wsecond = "_mm256_slli_epi32( w{0} , {1} ) ".format((firstword+1)%2,secondshift)
wfirstorsecond = " _mm256_or_si256 ({0},{1}) ".format(wfirst,wsecond)
wfirstorsecond = maskstr.format(wfirstorsecond)
print(" _mm256_storeu_si256(out + {0},\n {1});".format(j,wfirstorsecond))
print("}");
print("")
print("static avxpackblockfnc avxfuncPackArr[] = {")
for bit in range(0,32):
print("&avxpackblock{0},".format(bit))
print("&avxpackblock32")
print("};")
print("static avxpackblockfnc avxfuncPackMaskArr[] = {")
for bit in range(0,32):
print("&avxpackblockmask{0},".format(bit))
print("&avxpackblockmask32")
print("};")
print("static avxunpackblockfnc avxfuncUnpackArr[] = {")
for bit in range(0,32):
print("&avxunpackblock{0},".format(bit))
print("&avxunpackblock32")
print("};")
print("/** code generated by avxpacking.py ends here **/")

View File

@@ -1,152 +0,0 @@
#!/usr/bin/env python3
from math import ceil
print("""
/**
* Blablabla
*
*/
""");
def mask(bit):
return str((1 << bit) - 1)
for length in [32]:
print("""
static __m128i iunpackFOR0(__m128i initOffset, const __m128i * _in , uint32_t * _out) {
__m128i *out = (__m128i*)(_out);
int i;
(void) _in;
for (i = 0; i < 8; ++i) {
_mm_store_si128(out++, initOffset);
_mm_store_si128(out++, initOffset);
_mm_store_si128(out++, initOffset);
_mm_store_si128(out++, initOffset);
}
return initOffset;
}
""")
print("""
static void ipackFOR0(__m128i initOffset , const uint32_t * _in , __m128i * out ) {
(void) initOffset;
(void) _in;
(void) out;
}
""")
for bit in range(1,33):
offsetVar = " initOffset";
print("""
static void ipackFOR"""+str(bit)+"""(__m128i """+offsetVar+""", const uint32_t * _in, __m128i * out) {
const __m128i *in = (const __m128i*)(_in);
__m128i OutReg;
""");
if (bit != 32):
print(" __m128i CurrIn = _mm_load_si128(in);");
print(" __m128i InReg = _mm_sub_epi32(CurrIn, initOffset);");
else:
print(" __m128i InReg = _mm_load_si128(in);");
print(" (void) initOffset;");
inwordpointer = 0
valuecounter = 0
for k in range(ceil((length * bit) / 32)):
if(valuecounter == length): break
for x in range(inwordpointer,32,bit):
if(x!=0) :
print(" OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, " + str(x) + "));");
else:
print(" OutReg = InReg; ");
if((x+bit>=32) ):
while(inwordpointer<32):
inwordpointer += bit
print(" _mm_store_si128(out, OutReg);");
print("");
if(valuecounter + 1 < length):
print(" ++out;")
inwordpointer -= 32;
if(inwordpointer>0):
print(" OutReg = _mm_srli_epi32(InReg, " + str(bit) + " - " + str(inwordpointer) + ");");
if(valuecounter + 1 < length):
print(" ++in;")
if (bit != 32):
print(" CurrIn = _mm_load_si128(in);");
print(" InReg = _mm_sub_epi32(CurrIn, initOffset);");
else:
print(" InReg = _mm_load_si128(in);");
print("");
valuecounter = valuecounter + 1
if(valuecounter == length): break
assert(valuecounter == length)
print("\n}\n\n""")
for bit in range(1,32):
offsetVar = " initOffset";
print("""\n
static __m128i iunpackFOR"""+str(bit)+"""(__m128i """+offsetVar+""", const __m128i* in, uint32_t * _out) {
""");
print(""" __m128i* out = (__m128i*)(_out);
__m128i InReg = _mm_load_si128(in);
__m128i OutReg;
__m128i tmp;
const __m128i mask = _mm_set1_epi32((1U<<"""+str(bit)+""")-1);
""");
MainText = "";
MainText += "\n";
inwordpointer = 0
valuecounter = 0
for k in range(ceil((length * bit) / 32)):
for x in range(inwordpointer,32,bit):
if(valuecounter == length): break
if (x > 0):
MainText += " tmp = _mm_srli_epi32(InReg," + str(x) +");\n";
else:
MainText += " tmp = InReg;\n";
if(x+bit<32):
MainText += " OutReg = _mm_and_si128(tmp, mask);\n";
else:
MainText += " OutReg = tmp;\n";
if((x+bit>=32) ):
while(inwordpointer<32):
inwordpointer += bit
if(valuecounter + 1 < length):
MainText += " ++in;"
MainText += " InReg = _mm_load_si128(in);\n";
inwordpointer -= 32;
if(inwordpointer>0):
MainText += " OutReg = _mm_or_si128(OutReg, _mm_and_si128(_mm_slli_epi32(InReg, " + str(bit) + "-" + str(inwordpointer) + "), mask));\n\n";
if (bit != 32):
MainText += " OutReg = _mm_add_epi32(OutReg, initOffset);\n";
MainText += " _mm_store_si128(out++, OutReg);\n\n";
MainText += "";
valuecounter = valuecounter + 1
if(valuecounter == length): break
assert(valuecounter == length)
print(MainText)
print(" return initOffset;");
print("\n}\n\n")
print("""
static __m128i iunpackFOR32(__m128i initvalue , const __m128i* in, uint32_t * _out) {
__m128i * mout = (__m128i *)_out;
__m128i invec;
size_t k;
for(k = 0; k < 128/4; ++k) {
invec = _mm_load_si128(in++);
_mm_store_si128(mout++, invec);
}
return invec;
}
""")

View File

@@ -1,40 +0,0 @@
EXPORTS
simdpack
simdpackwithoutmask
simdunpack
bits
maxbits
maxbits_length
simdmin
simdmin_length
simdmaxmin
simdmaxmin_length
simdmaxbitsd1
simdmaxbitsd1_length
simdpackd1
simdpackwithoutmaskd1
simdunpackd1
simdsearchd1
simdsearchwithlengthd1
simdselectd1
simdpackFOR
simdselectFOR
simdsearchwithlengthFOR
simdunpackFOR
simdmin_length
simdmaxmin
simdmaxmin_length
simdpack_length
simdpackFOR_length
simdunpackFOR_length
simdpack_shortlength
simdfastsetFOR
simdfastset
simdfastsetd1
simdunpack_length
simdunpack_shortlength
simdsearchwithlengthFOR
simdscand1
simdfastsetd1fromprevious
simdfastsetd1

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,234 +0,0 @@
/**
* This code is released under a BSD License.
*/
#include "simdcomputil.h"
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif
#include <assert.h>
#define Delta(curr, prev) \
_mm_sub_epi32(curr, \
_mm_or_si128(_mm_slli_si128(curr, 4), _mm_srli_si128(prev, 12)))
/* returns the integer logarithm of v (bit width) */
uint32_t bits(const uint32_t v) {
#ifdef _MSC_VER
unsigned long answer;
if (v == 0) {
return 0;
}
_BitScanReverse(&answer, v);
return answer + 1;
#else
return v == 0 ? 0 : 32 - __builtin_clz(v); /* assume GCC-like compiler if not microsoft */
#endif
}
static uint32_t maxbitas32int(const __m128i accumulator) {
const __m128i _tmp1 = _mm_or_si128(_mm_srli_si128(accumulator, 8), accumulator); /* (A,B,C,D) xor (0,0,A,B) = (A,B,C xor A,D xor B)*/
const __m128i _tmp2 = _mm_or_si128(_mm_srli_si128(_tmp1, 4), _tmp1); /* (A,B,C xor A,D xor B) xor (0,0,0,C xor A)*/
uint32_t ans = _mm_cvtsi128_si32(_tmp2);
return bits(ans);
}
SIMDCOMP_PURE uint32_t maxbits(const uint32_t * begin) {
const __m128i* pin = (const __m128i*)(begin);
__m128i accumulator = _mm_loadu_si128(pin);
uint32_t k = 1;
for(; 4*k < SIMDBlockSize; ++k) {
__m128i newvec = _mm_loadu_si128(pin+k);
accumulator = _mm_or_si128(accumulator,newvec);
}
return maxbitas32int(accumulator);
}
static uint32_t orasint(const __m128i accumulator) {
const __m128i _tmp1 = _mm_or_si128(_mm_srli_si128(accumulator, 8), accumulator); /* (A,B,C,D) xor (0,0,A,B) = (A,B,C xor A,D xor B)*/
const __m128i _tmp2 = _mm_or_si128(_mm_srli_si128(_tmp1, 4), _tmp1); /* (A,B,C xor A,D xor B) xor (0,0,0,C xor A)*/
return _mm_cvtsi128_si32(_tmp2);
}
#ifdef __SSE4_1__
static uint32_t minasint(const __m128i accumulator) {
const __m128i _tmp1 = _mm_min_epu32(_mm_srli_si128(accumulator, 8), accumulator); /* (A,B,C,D) xor (0,0,A,B) = (A,B,C xor A,D xor B)*/
const __m128i _tmp2 = _mm_min_epu32(_mm_srli_si128(_tmp1, 4), _tmp1); /* (A,B,C xor A,D xor B) xor (0,0,0,C xor A)*/
return _mm_cvtsi128_si32(_tmp2);
}
static uint32_t maxasint(const __m128i accumulator) {
const __m128i _tmp1 = _mm_max_epu32(_mm_srli_si128(accumulator, 8), accumulator); /* (A,B,C,D) xor (0,0,A,B) = (A,B,C xor A,D xor B)*/
const __m128i _tmp2 = _mm_max_epu32(_mm_srli_si128(_tmp1, 4), _tmp1); /* (A,B,C xor A,D xor B) xor (0,0,0,C xor A)*/
return _mm_cvtsi128_si32(_tmp2);
}
uint32_t simdmin(const uint32_t * in) {
const __m128i* pin = (const __m128i*)(in);
__m128i accumulator = _mm_loadu_si128(pin);
uint32_t k = 1;
for(; 4*k < SIMDBlockSize; ++k) {
__m128i newvec = _mm_loadu_si128(pin+k);
accumulator = _mm_min_epu32(accumulator,newvec);
}
return minasint(accumulator);
}
void simdmaxmin(const uint32_t * in, uint32_t * getmin, uint32_t * getmax) {
const __m128i* pin = (const __m128i*)(in);
__m128i minaccumulator = _mm_loadu_si128(pin);
__m128i maxaccumulator = minaccumulator;
uint32_t k = 1;
for(; 4*k < SIMDBlockSize; ++k) {
__m128i newvec = _mm_loadu_si128(pin+k);
minaccumulator = _mm_min_epu32(minaccumulator,newvec);
maxaccumulator = _mm_max_epu32(maxaccumulator,newvec);
}
*getmin = minasint(minaccumulator);
*getmax = maxasint(maxaccumulator);
}
uint32_t simdmin_length(const uint32_t * in, uint32_t length) {
uint32_t currentmin = 0xFFFFFFFF;
uint32_t lengthdividedby4 = length / 4;
uint32_t offset = lengthdividedby4 * 4;
uint32_t k;
if (lengthdividedby4 > 0) {
const __m128i* pin = (const __m128i*)(in);
__m128i accumulator = _mm_loadu_si128(pin);
k = 1;
for(; 4*k < lengthdividedby4 * 4; ++k) {
__m128i newvec = _mm_loadu_si128(pin+k);
accumulator = _mm_min_epu32(accumulator,newvec);
}
currentmin = minasint(accumulator);
}
for (k = offset; k < length; ++k)
if (in[k] < currentmin)
currentmin = in[k];
return currentmin;
}
void simdmaxmin_length(const uint32_t * in, uint32_t length, uint32_t * getmin, uint32_t * getmax) {
uint32_t lengthdividedby4 = length / 4;
uint32_t offset = lengthdividedby4 * 4;
uint32_t k;
*getmin = 0xFFFFFFFF;
*getmax = 0;
if (lengthdividedby4 > 0) {
const __m128i* pin = (const __m128i*)(in);
__m128i minaccumulator = _mm_loadu_si128(pin);
__m128i maxaccumulator = minaccumulator;
k = 1;
for(; 4*k < lengthdividedby4 * 4; ++k) {
__m128i newvec = _mm_loadu_si128(pin+k);
minaccumulator = _mm_min_epu32(minaccumulator,newvec);
maxaccumulator = _mm_max_epu32(maxaccumulator,newvec);
}
*getmin = minasint(minaccumulator);
*getmax = maxasint(maxaccumulator);
}
for (k = offset; k < length; ++k) {
if (in[k] < *getmin)
*getmin = in[k];
if (in[k] > *getmax)
*getmax = in[k];
}
}
#endif
SIMDCOMP_PURE uint32_t maxbits_length(const uint32_t * in,uint32_t length) {
uint32_t k;
uint32_t lengthdividedby4 = length / 4;
uint32_t offset = lengthdividedby4 * 4;
uint32_t bigxor = 0;
if(lengthdividedby4 > 0) {
const __m128i* pin = (const __m128i*)(in);
__m128i accumulator = _mm_loadu_si128(pin);
k = 1;
for(; 4*k < 4*lengthdividedby4; ++k) {
__m128i newvec = _mm_loadu_si128(pin+k);
accumulator = _mm_or_si128(accumulator,newvec);
}
bigxor = orasint(accumulator);
}
for(k = offset; k < length; ++k)
bigxor |= in[k];
return bits(bigxor);
}
/* maxbit over 128 integers (SIMDBlockSize) with provided initial value */
uint32_t simdmaxbitsd1(uint32_t initvalue, const uint32_t * in) {
__m128i initoffset = _mm_set1_epi32 (initvalue);
const __m128i* pin = (const __m128i*)(in);
__m128i newvec = _mm_loadu_si128(pin);
__m128i accumulator = Delta(newvec , initoffset);
__m128i oldvec = newvec;
uint32_t k = 1;
for(; 4*k < SIMDBlockSize; ++k) {
newvec = _mm_loadu_si128(pin+k);
accumulator = _mm_or_si128(accumulator,Delta(newvec , oldvec));
oldvec = newvec;
}
initoffset = oldvec;
return maxbitas32int(accumulator);
}
/* maxbit over |length| integers with provided initial value */
uint32_t simdmaxbitsd1_length(uint32_t initvalue, const uint32_t * in,
uint32_t length) {
__m128i newvec;
__m128i oldvec;
__m128i initoffset;
__m128i accumulator;
const __m128i *pin;
uint32_t tmparray[4];
uint32_t k = 1;
uint32_t acc;
assert(length > 0);
pin = (const __m128i *)(in);
initoffset = _mm_set1_epi32(initvalue);
switch (length) {
case 1:
newvec = _mm_set1_epi32(in[0]);
break;
case 2:
newvec = _mm_setr_epi32(in[0], in[1], in[1], in[1]);
break;
case 3:
newvec = _mm_setr_epi32(in[0], in[1], in[2], in[2]);
break;
default:
newvec = _mm_loadu_si128(pin);
break;
}
accumulator = Delta(newvec, initoffset);
oldvec = newvec;
/* process 4 integers and build an accumulator */
while (k * 4 + 4 <= length) {
newvec = _mm_loadu_si128(pin + k);
accumulator = _mm_or_si128(accumulator, Delta(newvec, oldvec));
oldvec = newvec;
k++;
}
/* extract the accumulator as an integer */
_mm_storeu_si128((__m128i *)(tmparray), accumulator);
acc = tmparray[0] | tmparray[1] | tmparray[2] | tmparray[3];
/* now process the remaining integers */
for (k *= 4; k < length; k++)
acc |= in[k] - (k == 0 ? initvalue : in[k - 1]);
/* return the number of bits */
return bits(acc);
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,900 +0,0 @@
/**
* This code is released under a BSD License.
*/
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include "simdcomp.h"
int testshortpack() {
int bit;
size_t i;
size_t length;
__m128i * bb;
srand(0);
printf("testshortpack\n");
for (bit = 0; bit < 32; ++bit) {
const size_t N = 128;
uint32_t * data = malloc(N * sizeof(uint32_t));
uint32_t * backdata = malloc(N * sizeof(uint32_t));
uint32_t * buffer = malloc((2 * N + 1024) * sizeof(uint32_t));
for (i = 0; i < N; ++i) {
data[i] = rand() & ((1 << bit) - 1);
}
for (length = 0; length <= N; ++length) {
for (i = 0; i < N; ++i) {
backdata[i] = 0;
}
bb = simdpack_shortlength(data, length, (__m128i *) buffer,
bit);
if((bb - (__m128i *) buffer) * sizeof(__m128i) != (unsigned) simdpack_compressedbytes(length,bit)) {
printf("bug\n");
return -1;
}
simdunpack_shortlength((__m128i *) buffer, length,
backdata, bit);
for (i = 0; i < length; ++i) {
if (data[i] != backdata[i]) {
printf("bug\n");
return -1;
}
}
}
free(data);
free(backdata);
free(buffer);
}
return 0;
}
int testlongpack() {
int bit;
size_t i;
size_t length;
__m128i * bb;
srand(0);
printf("testlongpack\n");
for (bit = 0; bit < 32; ++bit) {
const size_t N = 2048;
uint32_t * data = malloc(N * sizeof(uint32_t));
uint32_t * backdata = malloc(N * sizeof(uint32_t));
uint32_t * buffer = malloc((2 * N + 1024) * sizeof(uint32_t));
for (i = 0; i < N; ++i) {
data[i] = rand() & ((1 << bit) - 1);
}
for (length = 0; length <= N; ++length) {
for (i = 0; i < N; ++i) {
backdata[i] = 0;
}
bb = simdpack_length(data, length, (__m128i *) buffer,
bit);
if((bb - (__m128i *) buffer) * sizeof(__m128i) != (unsigned) simdpack_compressedbytes(length,bit)) {
printf("bug\n");
return -1;
}
simdunpack_length((__m128i *) buffer, length,
backdata, bit);
for (i = 0; i < length; ++i) {
if (data[i] != backdata[i]) {
printf("bug\n");
return -1;
}
}
}
free(data);
free(backdata);
free(buffer);
}
return 0;
}
int testset() {
int bit;
size_t i;
const size_t N = 128;
uint32_t * data = malloc(N * sizeof(uint32_t));
uint32_t * backdata = malloc(N * sizeof(uint32_t));
uint32_t * buffer = malloc((2 * N + 1024) * sizeof(uint32_t));
srand(0);
for (bit = 0; bit < 32; ++bit) {
printf("simple set %d \n",bit);
for (i = 0; i < N; ++i) {
data[i] = rand() & ((1 << bit) - 1);
}
for (i = 0; i < N; ++i) {
backdata[i] = 0;
}
simdpack(data, (__m128i *) buffer, bit);
simdunpack((__m128i *) buffer, backdata, bit);
for (i = 0; i < N; ++i) {
if (data[i] != backdata[i]) {
printf("bug\n");
return -1;
}
}
for(i = N ; i > 0; i--) {
simdfastset((__m128i *) buffer, bit, data[N - i], i - 1);
}
simdunpack((__m128i *) buffer, backdata, bit);
for (i = 0; i < N; ++i) {
if (data[i] != backdata[N - i - 1]) {
printf("bug\n");
return -1;
}
}
simdpack(data, (__m128i *) buffer, bit);
for(i = 1 ; i <= N; i++) {
simdfastset((__m128i *) buffer, bit, data[i - 1], i - 1);
}
simdunpack((__m128i *) buffer, backdata, bit);
for (i = 0; i < N; ++i) {
if (data[i] != backdata[i]) {
printf("bug\n");
return -1;
}
}
}
free(data);
free(backdata);
free(buffer);
return 0;
}
#ifdef __SSE4_1__
int testsetd1() {
int bit;
size_t i;
uint32_t newvalue;
const size_t N = 128;
uint32_t * data = malloc(N * sizeof(uint32_t));
uint32_t * datazeroes = malloc(N * sizeof(uint32_t));
uint32_t * backdata = malloc(N * sizeof(uint32_t));
uint32_t * buffer = malloc((2 * N + 1024) * sizeof(uint32_t));
srand(0);
for (bit = 0; bit < 32; ++bit) {
printf("simple set d1 %d \n",bit);
data[0] = rand() & ((1 << bit) - 1);
datazeroes[0] = 0;
for (i = 1; i < N; ++i) {
data[i] = data[i - 1] + (rand() & ((1 << bit) - 1));
datazeroes[i] = 0;
}
for (i = 0; i < N; ++i) {
backdata[i] = 0;
}
simdpackd1(0,datazeroes, (__m128i *) buffer, bit);
for(i = 1 ; i <= N; i++) {
simdfastsetd1(0,(__m128i *) buffer, bit, data[i - 1], i - 1);
newvalue = simdselectd1(0, (const __m128i *) buffer, bit,i - 1);
if( newvalue != data[i-1] ) {
printf("bad set-select\n");
return -1;
}
}
simdunpackd1(0,(__m128i *) buffer, backdata, bit);
for (i = 0; i < N; ++i) {
if (data[i] != backdata[i])
return -1;
}
}
free(data);
free(backdata);
free(buffer);
free(datazeroes);
return 0;
}
#endif
int testsetFOR() {
int bit;
size_t i;
uint32_t newvalue;
const size_t N = 128;
uint32_t * data = malloc(N * sizeof(uint32_t));
uint32_t * datazeroes = malloc(N * sizeof(uint32_t));
uint32_t * backdata = malloc(N * sizeof(uint32_t));
uint32_t * buffer = malloc((2 * N + 1024) * sizeof(uint32_t));
srand(0);
for (bit = 0; bit < 32; ++bit) {
printf("simple set FOR %d \n",bit);
for (i = 0; i < N; ++i) {
data[i] = (rand() & ((1 << bit) - 1));
datazeroes[i] = 0;
}
for (i = 0; i < N; ++i) {
backdata[i] = 0;
}
simdpackFOR(0,datazeroes, (__m128i *) buffer, bit);
for(i = 1 ; i <= N; i++) {
simdfastsetFOR(0,(__m128i *) buffer, bit, data[i - 1], i - 1);
newvalue = simdselectFOR(0, (const __m128i *) buffer, bit,i - 1);
if( newvalue != data[i-1] ) {
printf("bad set-select\n");
return -1;
}
}
simdunpackFOR(0,(__m128i *) buffer, backdata, bit);
for (i = 0; i < N; ++i) {
if (data[i] != backdata[i])
return -1;
}
}
free(data);
free(backdata);
free(buffer);
free(datazeroes);
return 0;
}
int testshortFORpack() {
int bit;
size_t i;
__m128i * rb;
size_t length;
uint32_t offset = 7;
srand(0);
for (bit = 0; bit < 32; ++bit) {
const size_t N = 128;
uint32_t * data = malloc(N * sizeof(uint32_t));
uint32_t * backdata = malloc(N * sizeof(uint32_t));
uint32_t * buffer = malloc((2 * N + 1024) * sizeof(uint32_t));
for (i = 0; i < N; ++i) {
data[i] = (rand() & ((1 << bit) - 1)) + offset;
}
for (length = 0; length <= N; ++length) {
for (i = 0; i < N; ++i) {
backdata[i] = 0;
}
rb = simdpackFOR_length(offset,data, length, (__m128i *) buffer,
bit);
if(((rb - (__m128i *) buffer)*sizeof(__m128i)) != (unsigned) simdpackFOR_compressedbytes(length,bit)) {
return -1;
}
simdunpackFOR_length(offset,(__m128i *) buffer, length,
backdata, bit);
for (i = 0; i < length; ++i) {
if (data[i] != backdata[i])
return -1;
}
}
free(data);
free(backdata);
free(buffer);
}
return 0;
}
#ifdef __AVX2__
int testbabyavx() {
int bit;
int trial;
unsigned int i,j;
const size_t N = AVXBlockSize;
srand(0);
printf("testbabyavx\n");
printf("bit = ");
for (bit = 0; bit < 32; ++bit) {
printf(" %d ",bit);
fflush(stdout);
for(trial = 0; trial < 100; ++trial) {
uint32_t * data = malloc(N * sizeof(uint32_t)+ 64 * sizeof(uint32_t));
uint32_t * backdata = malloc(N * sizeof(uint32_t) + 64 * sizeof(uint32_t) );
__m256i * buffer = malloc((2 * N + 1024) * sizeof(uint32_t) + 32);
for (i = 0; i < N; ++i) {
data[i] = rand() & ((uint32_t)(1 << bit) - 1);
}
for (i = 0; i < N; ++i) {
backdata[i] = 0;
}
if(avxmaxbits(data) != maxbits_length(data,N)) {
printf("avxmaxbits is buggy\n");
return -1;
}
avxpackwithoutmask(data, buffer, bit);
avxunpack(buffer, backdata, bit);
for (i = 0; i < AVXBlockSize; ++i) {
if (data[i] != backdata[i]) {
printf("bug\n");
for (j = 0; j < N; ++j) {
if (data[j] != backdata[j]) {
printf("data[%d]=%d v.s. backdata[%d]=%d\n",j,data[j],j,backdata[j]);
} else {
printf("data[%d]=%d\n",j,data[j]);
}
}
return -1;
}
}
free(data);
free(backdata);
free(buffer);
}
}
printf("\n");
return 0;
}
int testavx2() {
int N = 5000 * AVXBlockSize, gap;
__m256i * buffer = malloc(AVXBlockSize * sizeof(uint32_t));
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint32_t * backbuffer = malloc(AVXBlockSize * sizeof(uint32_t));
for (gap = 1; gap <= 387420489; gap *= 3) {
int k;
printf(" gap = %u \n", gap);
for (k = 0; k < N; ++k)
datain[k] = k * gap;
for (k = 0; k * AVXBlockSize < N; ++k) {
/*
First part works for general arrays (sorted or unsorted)
*/
int j;
/* we compute the bit width */
const uint32_t b = avxmaxbits(datain + k * AVXBlockSize);
if(avxmaxbits(datain + k * AVXBlockSize) != maxbits_length(datain + k * AVXBlockSize,AVXBlockSize)) {
printf("avxmaxbits is buggy %d %d \n",
avxmaxbits(datain + k * AVXBlockSize),
maxbits_length(datain + k * AVXBlockSize,AVXBlockSize));
return -1;
}
printf("bit width = %d\n",b);
/* we read 256 integers at "datain + k * AVXBlockSize" and
write b 256-bit vectors at "buffer" */
avxpackwithoutmask(datain + k * AVXBlockSize, buffer, b);
/* we read back b1 128-bit vectors at "buffer" and write 128 integers at backbuffer */
avxunpack(buffer, backbuffer, b);/* uncompressed */
for (j = 0; j < AVXBlockSize; ++j) {
if (backbuffer[j] != datain[k * AVXBlockSize + j]) {
int i;
printf("bug in avxpack\n");
for(i = 0; i < AVXBlockSize; ++i) {
printf("data[%d]=%d got back %d %s\n",i,
datain[k * AVXBlockSize + i],backbuffer[i],
datain[k * AVXBlockSize + i]!=backbuffer[i]?"bug":"");
}
return -2;
}
}
}
}
free(buffer);
free(datain);
free(backbuffer);
printf("Code looks good.\n");
return 0;
}
#endif /* avx2 */
int test() {
int N = 5000 * SIMDBlockSize, gap;
__m128i * buffer = malloc(SIMDBlockSize * sizeof(uint32_t));
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint32_t * backbuffer = malloc(SIMDBlockSize * sizeof(uint32_t));
for (gap = 1; gap <= 387420489; gap *= 3) {
int k;
printf(" gap = %u \n", gap);
for (k = 0; k < N; ++k)
datain[k] = k * gap;
for (k = 0; k * SIMDBlockSize < N; ++k) {
/*
First part works for general arrays (sorted or unsorted)
*/
int j;
/* we compute the bit width */
const uint32_t b = maxbits(datain + k * SIMDBlockSize);
/* we read 128 integers at "datain + k * SIMDBlockSize" and
write b 128-bit vectors at "buffer" */
simdpackwithoutmask(datain + k * SIMDBlockSize, buffer, b);
/* we read back b1 128-bit vectors at "buffer" and write 128 integers at backbuffer */
simdunpack(buffer, backbuffer, b);/* uncompressed */
for (j = 0; j < SIMDBlockSize; ++j) {
if (backbuffer[j] != datain[k * SIMDBlockSize + j]) {
printf("bug in simdpack\n");
return -2;
}
}
{
/*
next part assumes that the data is sorted (uses differential coding)
*/
uint32_t offset = 0;
/* we compute the bit width */
const uint32_t b1 = simdmaxbitsd1(offset,
datain + k * SIMDBlockSize);
/* we read 128 integers at "datain + k * SIMDBlockSize" and
write b1 128-bit vectors at "buffer" */
simdpackwithoutmaskd1(offset, datain + k * SIMDBlockSize, buffer,
b1);
/* we read back b1 128-bit vectors at "buffer" and write 128 integers at backbuffer */
simdunpackd1(offset, buffer, backbuffer, b1);
for (j = 0; j < SIMDBlockSize; ++j) {
if (backbuffer[j] != datain[k * SIMDBlockSize + j]) {
printf("bug in simdpack d1\n");
return -3;
}
}
offset = datain[k * SIMDBlockSize + SIMDBlockSize - 1];
}
}
}
free(buffer);
free(datain);
free(backbuffer);
printf("Code looks good.\n");
return 0;
}
#ifdef __SSE4_1__
int testFOR() {
int N = 5000 * SIMDBlockSize, gap;
__m128i * buffer = malloc(SIMDBlockSize * sizeof(uint32_t));
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint32_t * backbuffer = malloc(SIMDBlockSize * sizeof(uint32_t));
uint32_t tmax, tmin, tb;
for (gap = 1; gap <= 387420489; gap *= 2) {
int k;
printf(" gap = %u \n", gap);
for (k = 0; k < N; ++k)
datain[k] = k * gap;
for (k = 0; k * SIMDBlockSize < N; ++k) {
int j;
simdmaxmin_length(datain + k * SIMDBlockSize,SIMDBlockSize,&tmin,&tmax);
/* we compute the bit width */
tb = bits(tmax - tmin);
/* we read 128 integers at "datain + k * SIMDBlockSize" and
write b 128-bit vectors at "buffer" */
simdpackFOR(tmin,datain + k * SIMDBlockSize, buffer, tb);
for (j = 0; j < SIMDBlockSize; ++j) {
uint32_t selectedvalue = simdselectFOR(tmin,buffer,tb,j);
if (selectedvalue != datain[k * SIMDBlockSize + j]) {
printf("bug in simdselectFOR\n");
return -3;
}
}
/* we read back b1 128-bit vectors at "buffer" and write 128 integers at backbuffer */
simdunpackFOR(tmin,buffer, backbuffer, tb);/* uncompressed */
for (j = 0; j < SIMDBlockSize; ++j) {
if (backbuffer[j] != datain[k * SIMDBlockSize + j]) {
printf("bug in simdpackFOR\n");
return -2;
}
}
}
}
free(buffer);
free(datain);
free(backbuffer);
printf("Code looks good.\n");
return 0;
}
#endif
#define MAX 300
int test_simdmaxbitsd1_length() {
uint32_t result, buffer[MAX + 1];
int i, j;
memset(&buffer[0], 0xff, sizeof(buffer));
/* this test creates buffers of different length; each buffer is
* initialized to result in the following deltas:
* length 1: 2
* length 2: 1 2
* length 3: 1 1 2
* length 4: 1 1 1 2
* length 5: 1 1 1 1 2
* etc. Each sequence's "maxbits" is 2. */
for (i = 0; i < MAX; i++) {
for (j = 0; j < i; j++)
buffer[j] = j + 1;
buffer[i] = i + 2;
result = simdmaxbitsd1_length(0, &buffer[0], i + 1);
if (result != 2) {
printf("simdmaxbitsd1_length: unexpected result %u in loop %d\n",
result, i);
return -1;
}
}
printf("simdmaxbitsd1_length: ok\n");
return 0;
}
int uint32_cmp(const void *a, const void *b)
{
const uint32_t *ia = (const uint32_t *)a;
const uint32_t *ib = (const uint32_t *)b;
if(*ia < *ib)
return -1;
else if (*ia > *ib)
return 1;
return 0;
}
#ifdef __SSE4_1__
int test_simdpackedsearch() {
uint32_t buffer[128];
uint32_t result = 0;
int b, i;
uint32_t init = 0;
__m128i initial = _mm_set1_epi32(init);
/* initialize the buffer */
for (i = 0; i < 128; i++)
buffer[i] = (uint32_t)(i + 1);
/* this test creates delta encoded buffers with different bits, then
* performs lower bound searches for each key */
for (b = 1; b <= 32; b++) {
uint32_t out[128];
/* delta-encode to 'i' bits */
simdpackwithoutmaskd1(init, buffer, (__m128i *)out, b);
initial = _mm_setzero_si128();
printf("simdsearchd1: %d bits\n", b);
/* now perform the searches */
initial = _mm_set1_epi32(init);
assert(simdsearchd1(&initial, (__m128i *)out, b, 0, &result) == 0);
assert(result > 0);
for (i = 1; i <= 128; i++) {
initial = _mm_set1_epi32(init);
assert(simdsearchd1(&initial, (__m128i *)out, b,
(uint32_t)i, &result) == i - 1);
assert(result == (unsigned)i);
}
initial = _mm_set1_epi32(init);
assert(simdsearchd1(&initial, (__m128i *)out, b, 200, &result)
== 128);
assert(result > 200);
}
printf("simdsearchd1: ok\n");
return 0;
}
int test_simdpackedsearchFOR() {
uint32_t buffer[128];
uint32_t result = 0;
int b;
uint32_t i;
uint32_t maxv, tmin, tmax, tb;
uint32_t out[128];
/* this test creates delta encoded buffers with different bits, then
* performs lower bound searches for each key */
for (b = 1; b <= 32; b++) {
/* initialize the buffer */
maxv = (b == 32)
? 0xFFFFFFFF
: ((1U<<b) - 1);
for (i = 0; i < 128; i++)
buffer[i] = maxv * (i + 1) / 128;
simdmaxmin_length(buffer,SIMDBlockSize,&tmin,&tmax);
/* we compute the bit width */
tb = bits(tmax - tmin);
/* delta-encode to 'i' bits */
simdpackFOR(tmin, buffer, (__m128i *)out, tb);
printf("simdsearchd1: %d bits\n", b);
/* now perform the searches */
for (i = 0; i < 128; i++) {
assert(buffer[i] == simdselectFOR(tmin, (__m128i *)out, tb,i));
}
for (i = 0; i < 128; i++) {
int x = simdsearchwithlengthFOR(tmin, (__m128i *)out, tb,
128,buffer[i], &result) ;
assert(simdselectFOR(tmin, (__m128i *)out, tb,x) == buffer[x]);
assert(simdselectFOR(tmin, (__m128i *)out, tb,x) == result);
assert(buffer[x] == result);
assert(result == buffer[i]);
assert(buffer[x] == buffer[i]);
}
}
printf("simdsearchFOR: ok\n");
return 0;
}
int test_simdpackedsearch_advanced() {
uint32_t buffer[128];
uint32_t backbuffer[128];
uint32_t out[128];
uint32_t result = 0;
uint32_t b, i;
uint32_t init = 0;
__m128i initial = _mm_set1_epi32(init);
/* this test creates delta encoded buffers with different bits, then
* performs lower bound searches for each key */
for (b = 0; b <= 32; b++) {
uint32_t prev = init;
/* initialize the buffer */
for (i = 0; i < 128; i++) {
buffer[i] = ((uint32_t)(1431655765 * i + 0xFFFFFFFF)) ;
if(b < 32) buffer[i] %= (1<<b);
}
qsort(buffer,128, sizeof(uint32_t), uint32_cmp);
for (i = 0; i < 128; i++) {
buffer[i] = buffer[i] + prev;
prev = buffer[i];
}
for (i = 1; i < 128; i++) {
if(buffer[i] < buffer[i-1] )
buffer[i] = buffer[i-1];
}
assert(simdmaxbitsd1(init, buffer)<=b);
for (i = 0; i < 128; i++) {
out[i] = 0; /* memset would do too */
}
/* delta-encode to 'i' bits */
simdpackwithoutmaskd1(init, buffer, (__m128i *)out, b);
simdunpackd1(init, (__m128i *)out, backbuffer, b);
for (i = 0; i < 128; i++) {
assert(buffer[i] == backbuffer[i]);
}
printf("advanced simdsearchd1: %d bits\n", b);
for (i = 0; i < 128; i++) {
int pos;
initial = _mm_set1_epi32(init);
pos = simdsearchd1(&initial, (__m128i *)out, b,
buffer[i], &result);
assert(pos == simdsearchwithlengthd1(init, (__m128i *)out, b, 128,
buffer[i], &result));
assert(buffer[pos] == buffer[i]);
if(pos > 0)
assert(buffer[pos - 1] < buffer[i]);
assert(result == buffer[i]);
}
for (i = 0; i < 128; i++) {
int pos;
if(buffer[i] == 0) continue;
initial = _mm_set1_epi32(init);
pos = simdsearchd1(&initial, (__m128i *)out, b,
buffer[i] - 1, &result);
assert(pos == simdsearchwithlengthd1(init, (__m128i *)out, b, 128,
buffer[i] - 1, &result));
assert(buffer[pos] >= buffer[i] - 1);
if(pos > 0)
assert(buffer[pos - 1] < buffer[i] - 1);
assert(result == buffer[pos]);
}
for (i = 0; i < 128; i++) {
int pos;
if (buffer[i] + 1 == 0)
continue;
initial = _mm_set1_epi32(init);
pos = simdsearchd1(&initial, (__m128i *) out, b,
buffer[i] + 1, &result);
assert(pos == simdsearchwithlengthd1(init, (__m128i *)out, b, 128,
buffer[i] + 1, &result));
if(pos == 128) {
assert(buffer[i] == buffer[127]);
} else {
assert(buffer[pos] >= buffer[i] + 1);
if (pos > 0)
assert(buffer[pos - 1] < buffer[i] + 1);
assert(result == buffer[pos]);
}
}
}
printf("advanced simdsearchd1: ok\n");
return 0;
}
int test_simdpackedselect() {
uint32_t buffer[128];
uint32_t initial = 33;
int b, i;
/* initialize the buffer */
for (i = 0; i < 128; i++)
buffer[i] = (uint32_t)(initial + i);
/* this test creates delta encoded buffers with different bits, then
* performs lower bound searches for each key */
for (b = 1; b <= 32; b++) {
uint32_t out[128];
/* delta-encode to 'i' bits */
simdpackwithoutmaskd1(initial, buffer, (__m128i *)out, b);
printf("simdselectd1: %d bits\n", b);
/* now perform the searches */
for (i = 0; i < 128; i++) {
assert(simdselectd1(initial, (__m128i *)out, b, (uint32_t)i)
== initial + i);
}
}
printf("simdselectd1: ok\n");
return 0;
}
int test_simdpackedselect_advanced() {
uint32_t buffer[128];
uint32_t initial = 33;
uint32_t b;
int i;
/* this test creates delta encoded buffers with different bits, then
* performs lower bound searches for each key */
for (b = 0; b <= 32; b++) {
uint32_t prev = initial;
uint32_t out[128];
/* initialize the buffer */
for (i = 0; i < 128; i++) {
buffer[i] = ((uint32_t)(165576 * i)) ;
if(b < 32) buffer[i] %= (1<<b);
}
for (i = 0; i < 128; i++) {
buffer[i] = buffer[i] + prev;
prev = buffer[i];
}
for (i = 1; i < 128; i++) {
if(buffer[i] < buffer[i-1] )
buffer[i] = buffer[i-1];
}
assert(simdmaxbitsd1(initial, buffer)<=b);
for (i = 0; i < 128; i++) {
out[i] = 0; /* memset would do too */
}
/* delta-encode to 'i' bits */
simdpackwithoutmaskd1(initial, buffer, (__m128i *)out, b);
printf("simdselectd1: %d bits\n", b);
/* now perform the searches */
for (i = 0; i < 128; i++) {
uint32_t valretrieved = simdselectd1(initial, (__m128i *)out, b, (uint32_t)i);
assert(valretrieved == buffer[i]);
}
}
printf("advanced simdselectd1: ok\n");
return 0;
}
#endif
int main() {
int r;
r = testsetFOR();
if (r) {
printf("test failure 1\n");
return r;
}
#ifdef __SSE4_1__
r = testsetd1();
if (r) {
printf("test failure 2\n");
return r;
}
#endif
r = testset();
if (r) {
printf("test failure 3\n");
return r;
}
r = testshortFORpack();
if (r) {
printf("test failure 4\n");
return r;
}
r = testshortpack();
if (r) {
printf("test failure 5\n");
return r;
}
r = testlongpack();
if (r) {
printf("test failure 6\n");
return r;
}
#ifdef __SSE4_1__
r = test_simdpackedsearchFOR();
if (r) {
printf("test failure 7\n");
return r;
}
r = testFOR();
if (r) {
printf("test failure 8\n");
return r;
}
#endif
#ifdef __AVX2__
r= testbabyavx();
if (r) {
printf("test failure baby avx\n");
return r;
}
r = testavx2();
if (r) {
printf("test failure 9 avx\n");
return r;
}
#endif
r = test();
if (r) {
printf("test failure 9\n");
return r;
}
r = test_simdmaxbitsd1_length();
if (r) {
printf("test failure 10\n");
return r;
}
#ifdef __SSE4_1__
r = test_simdpackedsearch();
if (r) {
printf("test failure 11\n");
return r;
}
r = test_simdpackedsearch_advanced();
if (r) {
printf("test failure 12\n");
return r;
}
r = test_simdpackedselect();
if (r) {
printf("test failure 13\n");
return r;
}
r = test_simdpackedselect_advanced();
if (r) {
printf("test failure 14\n");
return r;
}
#endif
printf("All tests OK!\n");
return 0;
}

View File

@@ -1,102 +0,0 @@
/**
* This code is released under a BSD License.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "simdcomp.h"
#define get_random_char() (uint8_t)(rand() % 256);
int main() {
int N = 5000 * SIMDBlockSize, gap;
__m128i * buffer = malloc(SIMDBlockSize * sizeof(uint32_t));
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint32_t * backbuffer = malloc(SIMDBlockSize * sizeof(uint32_t));
srand(time(NULL));
for (gap = 1; gap <= 387420489; gap *= 3) {
int k;
printf(" gap = %u \n", gap);
/* simulate some random character string, don't care about endiannes */
for (k = 0; k < N; ++k) {
uint8_t _tmp[4];
_tmp[0] = get_random_char();
_tmp[1] = get_random_char();
_tmp[2] = get_random_char();
_tmp[3] = get_random_char();
memmove(&datain[k], _tmp, 4);
}
for (k = 0; k * SIMDBlockSize < N; ++k) {
/*
First part works for general arrays (sorted or unsorted)
*/
int j;
/* we compute the bit width */
const uint32_t b = maxbits(datain + k * SIMDBlockSize);
/* we read 128 integers at "datain + k * SIMDBlockSize" and
write b 128-bit vectors at "buffer" */
simdpackwithoutmask(datain + k * SIMDBlockSize, buffer, b);
/* we read back b1 128-bit vectors at "buffer" and write 128 integers at backbuffer */
simdunpack(buffer, backbuffer, b);/* uncompressed */
for (j = 0; j < SIMDBlockSize; ++j) {
uint8_t chars_back[4];
uint8_t chars_in[4];
memmove(chars_back, &backbuffer[j], 4);
memmove(chars_in, &datain[k * SIMDBlockSize + j], 4);
if (chars_in[0] != chars_back[0]
|| chars_in[1] != chars_back[1]
|| chars_in[2] != chars_back[2]
|| chars_in[3] != chars_back[3]) {
printf("bug in simdpack\n");
return -2;
}
}
{
/*
next part assumes that the data is sorted (uses differential coding)
*/
uint32_t offset = 0;
/* we compute the bit width */
const uint32_t b1 = simdmaxbitsd1(offset,
datain + k * SIMDBlockSize);
/* we read 128 integers at "datain + k * SIMDBlockSize" and
write b1 128-bit vectors at "buffer" */
simdpackwithoutmaskd1(offset, datain + k * SIMDBlockSize, buffer,
b1);
/* we read back b1 128-bit vectors at "buffer" and write 128 integers at backbuffer */
simdunpackd1(offset, buffer, backbuffer, b1);
for (j = 0; j < SIMDBlockSize; ++j) {
uint8_t chars_back[4];
uint8_t chars_in[4];
memmove(chars_back, &backbuffer[j], 4);
memmove(chars_in, &datain[k * SIMDBlockSize + j], 4);
if (chars_in[0] != chars_back[0]
|| chars_in[1] != chars_back[1]
|| chars_in[2] != chars_back[2]
|| chars_in[3] != chars_back[3]) {
printf("bug in simdpack\n");
return -3;
}
}
offset = datain[k * SIMDBlockSize + SIMDBlockSize - 1];
}
}
}
free(buffer);
free(datain);
free(backbuffer);
printf("Code looks good.\n");
return 0;
}

View File

@@ -1,42 +0,0 @@
#include "simdcomp.h"
#include "simdcomputil.h"
// assumes datain has a size of 128 uint32
// and that buffer is large enough to host the data.
size_t compress_sorted(
const uint32_t* datain,
uint8_t* output,
const uint32_t offset) {
const uint32_t b = simdmaxbitsd1(offset, datain);
*output++ = b;
simdpackwithoutmaskd1(offset, datain, (__m128i *) output, b);
return 1 + b * sizeof(__m128i);
}
// assumes datain has a size of 128 uint32
// and that buffer is large enough to host the data.
size_t uncompress_sorted(
const uint8_t* compressed_data,
uint32_t* output,
uint32_t offset) {
const uint32_t b = *compressed_data++;
simdunpackd1(offset, (__m128i *)compressed_data, output, b);
return 1 + b * sizeof(__m128i);
}
size_t compress_unsorted(
const uint32_t* datain,
uint8_t* output) {
const uint32_t b = maxbits(datain);
*output++ = b;
simdpackwithoutmask(datain, (__m128i *) output, b);
return 1 + b * sizeof(__m128i);
}
size_t uncompress_unsorted(
const uint8_t* compressed_data,
uint32_t* output) {
const uint32_t b = *compressed_data++;
simdunpack((__m128i *)compressed_data, output, b);
return 1 + b * sizeof(__m128i);
}

1
doc/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
book

5
doc/book.toml Normal file
View File

@@ -0,0 +1,5 @@
[book]
authors = ["Paul Masurel"]
multilingual = false
src = "src"
title = "Tantivy, the user guide"

15
doc/src/SUMMARY.md Normal file
View File

@@ -0,0 +1,15 @@
# Summary
[Avant Propos](./avant-propos.md)
- [Segments](./basis.md)
- [Defining your schema](./schema.md)
- [Facetting](./facetting.md)
- [Innerworkings](./innerworkings.md)
- [Inverted index](./inverted_index.md)
- [Best practise](./inverted_index.md)
[Frequently Asked Questions](./faq.md)
[Examples](./examples.md)

34
doc/src/avant-propos.md Normal file
View File

@@ -0,0 +1,34 @@
# Foreword, what is the scope of tantivy?
> Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
they both have the same scope and targetted use cases.
If you are not familiar with Lucene, let's break down our little tagline.
- **Search** here means full-text search : fundamentally, tantivy is here to help you
identify efficiently what are the documents matching a given query in your corpus.
But modern search UI are so much more : text processing, facetting, autocomplete, fuzzy search, good
relevancy, collapsing, highlighting, spatial search.
While some of these features are not available in tantivy yet, all of these are relevant
feature requests. Tantivy's objective is to offer a solid toolbox to create the best search
experience. But keep in mind this is just a toolbox.
Which bring us to the second keyword...
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
Sometimes a functionality will not be available in tantivy because it is too
specific to your use case. By design, tantivy should make it possible to extend
the available set of features using the existing rock-solid datastructures.
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
`TokenFilter`... Some of your requirements may also be related to
something closer to architecture or operations. For instance, you may
want to build a large corpus on Hadoop, fine-tune the merge policy to keep your
index sharded in a time-wise fashion, or you may want to convert and existing
index from a different format.
Tantivy exposes a lot of low level API to do all of these things.

77
doc/src/basis.md Normal file
View File

@@ -0,0 +1,77 @@
# Anatomy of an index
## Straight from disk
Tantivy accesses its data using an abstracting trait called `Directory`.
In theory, one can come and override the data access logic. In practise, the
trait somewhat assumes that your data can be mapped to memory, and tantivy
seems deeply married to using `mmap` for its io [^1], and the only persisting
directory shipped with tantivy is the `MmapDirectory`.
While this design has some downsides, this greatly simplifies the source code of
tantivy. Caching is also entirely delegated to the OS.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
In later chapters, we will discuss tantivy's inverted index data layout.
One key take away is that to achieve great performance, search indexes are extremely compact.
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
## Segments, and the log method
That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
In fact, the `Directory` trait does not even allow you to modify part of a file.
To allow the addition / deletion of documents, and create the illusion that
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*.
Let's forget about deletes for a moment.
As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly.
As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
The resulting chunk of index obtained after this serialization is called a `Segment`.
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `<UUID>.*`.
Which brings us to the nature of a tantivy `Index`.
> A tantivy `Index` is a collection of `Segments`.
Physically, this really just means and index is a bunch of segment files in a given `Directory`,
linked together by a `meta.json` file. This transparency can become extremely handy
to get tantivy to fit your use case:
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
# Merging
As you index more and more data, your index will accumulate more and more segments.
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit.
That's where merging or compacting comes into place. Tantivy will continuously consider merge
opportunities and start merging segments in the background.
# Indexing throughput, number of indexing threads
[^1]: This may eventually change.
[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not.

View File

3
doc/src/examples.md Normal file
View File

@@ -0,0 +1,3 @@
# Examples
- [Basic search](/examples/basic_search.html)

5
doc/src/facetting.md Normal file
View File

@@ -0,0 +1,5 @@
# Facetting
wewew
## weeewe

0
doc/src/faq.md Normal file
View File

1
doc/src/innerworkings.md Normal file
View File

@@ -0,0 +1 @@
# Innerworkings

View File

@@ -0,0 +1 @@
# Inverted index

1
doc/src/schema.md Normal file
View File

@@ -0,0 +1 @@
# Defining your schema

237
examples/basic_search.rs Normal file
View File

@@ -0,0 +1,237 @@
// # Basic Example
//
// This example covers the basic functionalities of
// tantivy.
//
// We will :
// - define our schema
// - create an index in a directory
// - index a few documents into our index
// - search for the best document matching a basic query
// - retrieve the best document's original content.
// ---
// Importing tantivy...
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};
use tempfile::TempDir;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new()?;
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// First we need to define a schema ...
let mut schema_builder = Schema::builder();
// Our first field is title.
// We want full-text search for it, and we also want
// to be able to retrieve the document after the search.
//
// `TEXT | STORED` is some syntactic sugar to describe
// that.
//
// `TEXT` means the field should be tokenized and indexed,
// along with its term frequency and term positions.
//
// `STORED` means that the field will also be saved
// in a compressed, row-oriented key-value store.
// This store is useful for reconstructing the
// documents that were selected during the search phase.
schema_builder.add_text_field("title", TEXT | STORED);
// Our second field is body.
// We want full-text search for it, but we do not
// need to be able to be able to retrieve it
// for our application.
//
// We can make our index lighter by omitting the `STORED` flag.
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
//
// This will actually just save a meta.json
// with our schema in the directory.
let index = Index::create_in_dir(&index_path, schema.clone())?;
// To insert a document we will need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we give tantivy a budget of `50MB`.
// Using a bigger heap for the indexer may increase
// throughput, but 50 MB is already plenty.
let mut index_writer = index.writer(50_000_000)?;
// Let's index our documents!
// We first need a handle on the title and the body field.
// ### Adding documents
//
// We can create a document manually, by setting the fields
// one by one in a Document object.
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
old_man_doc.add_text(
body,
"He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish.",
);
// ... and add it to the `IndexWriter`.
index_writer.add_document(old_man_doc);
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
// Multivalued field just need to be repeated.
index_writer.add_document(doc!(
title => "Frankenstein",
title => "The Modern Prometheus",
body => "You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking."
));
// This is an example, so we will only index 3 documents
// here. You can check out tantivy's tutorial to index
// the English wikipedia. Tantivy's indexing is rather fast.
// Indexing 5 million articles of the English wikipedia takes
// around 3 minutes on my computer!
// ### Committing
//
// At this point our documents are not searchable.
//
//
// We need to call `.commit()` explicitly to force the
// `index_writer` to finish processing the documents in the queue,
// flush the current index to the disk, and advertise
// the existence of new documents.
//
// This call is blocking.
index_writer.commit()?;
// If `.commit()` returns correctly, then all of the
// documents that have been added are guaranteed to be
// persistently indexed.
//
// In the scenario of a crash or a power failure,
// tantivy behaves as if it has rolled back to its last
// commit.
// # Searching
//
// ### Searcher
//
// A reader is required first in order to search an index.
// It acts as a `Searcher` pool that reloads itself,
// depending on a `ReloadPolicy`.
//
// For a search server you will typically create one reader for the entire lifetime of your
// program, and acquire a new searcher for every single request.
//
// In the code below, we rely on the 'ON_COMMIT' policy: the reader
// will reload the index automatically after each commit.
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
// We now need to acquire a searcher.
//
// A searcher points to a snapshotted, immutable version of the index.
//
// Some search experience might require more than
// one query. Using the same searcher ensures that all of these queries will run on the
// same version of the index.
//
// Acquiring a `searcher` is very cheap.
//
// You should acquire a searcher every time you start processing a request and
// and release it right after your query is finished.
let searcher = reader.searcher();
// ### Query
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// `QueryParser` may fail if the query is not in the right
// format. For user facing applications, this can be a problem.
// A ticket has been opened regarding this problem.
let query = query_parser.parse_query("sea whale")?;
// A query defines a set of documents, as
// well as the way they should be scored.
//
// A query created by the query parser is scored according
// to a metric called Tf-Idf, and will consider
// any document matching at least one of our terms.
// ### Collectors
//
// We are not interested in all of the documents but
// only in the top 10. Keeping track of our top 10 best documents
// is the role of the `TopDocs` collector.
// We can now perform our query.
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
// The actual documents still need to be
// retrieved from Tantivy's store.
//
// Since the body field was not configured as stored,
// the document returned will only contain
// a title.
for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,191 @@
// # Custom collector example
//
// This example shows how you can implement your own
// collector. As an example, we will compute a collector
// that computes the standard deviation of a given fast field.
//
// Of course, you can have a look at the tantivy's built-in collectors
// such as the `CountCollector` for more examples.
// ---
// Importing tantivy...
use tantivy::collector::{Collector, SegmentCollector};
use tantivy::fastfield::FastFieldReader;
use tantivy::query::QueryParser;
use tantivy::schema::Field;
use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
use tantivy::{doc, Index, Score, SegmentReader, TantivyError};
#[derive(Default)]
struct Stats {
count: usize,
sum: f64,
squared_sum: f64,
}
impl Stats {
pub fn count(&self) -> usize {
self.count
}
pub fn mean(&self) -> f64 {
self.sum / (self.count as f64)
}
fn square_mean(&self) -> f64 {
self.squared_sum / (self.count as f64)
}
pub fn standard_deviation(&self) -> f64 {
let mean = self.mean();
(self.square_mean() - mean * mean).sqrt()
}
fn non_zero_count(self) -> Option<Stats> {
if self.count == 0 {
None
} else {
Some(self)
}
}
}
struct StatsCollector {
field: Field,
}
impl StatsCollector {
fn with_field(field: Field) -> StatsCollector {
StatsCollector { field }
}
}
impl Collector for StatsCollector {
// That's the type of our result.
// Our standard deviation will be a float.
type Fruit = Option<Stats>;
type Child = StatsSegmentCollector;
fn for_segment(
&self,
_segment_local_id: u32,
segment_reader: &SegmentReader,
) -> tantivy::Result<StatsSegmentCollector> {
let fast_field_reader = segment_reader
.fast_fields()
.u64(self.field)
.ok_or_else(|| {
let field_name = segment_reader.schema().get_field_name(self.field);
TantivyError::SchemaError(format!(
"Field {:?} is not a u64 fast field.",
field_name
))
})?;
Ok(StatsSegmentCollector {
fast_field_reader,
stats: Stats::default(),
})
}
fn requires_scoring(&self) -> bool {
// this collector does not care about score.
false
}
fn merge_fruits(&self, segment_stats: Vec<Option<Stats>>) -> tantivy::Result<Option<Stats>> {
let mut stats = Stats::default();
for segment_stats_opt in segment_stats {
if let Some(segment_stats) = segment_stats_opt {
stats.count += segment_stats.count;
stats.sum += segment_stats.sum;
stats.squared_sum += segment_stats.squared_sum;
}
}
Ok(stats.non_zero_count())
}
}
struct StatsSegmentCollector {
fast_field_reader: FastFieldReader<u64>,
stats: Stats,
}
impl SegmentCollector for StatsSegmentCollector {
type Fruit = Option<Stats>;
fn collect(&mut self, doc: u32, _score: Score) {
let value = self.fast_field_reader.get(doc) as f64;
self.stats.count += 1;
self.stats.sum += value;
self.stats.squared_sum += value * value;
}
fn harvest(self) -> <Self as SegmentCollector>::Fruit {
self.stats.non_zero_count()
}
}
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = Schema::builder();
// We'll assume a fictional index containing
// products, and with a name, a description, and a price.
let product_name = schema_builder.add_text_field("name", TEXT);
let product_description = schema_builder.add_text_field("description", TEXT);
let price = schema_builder.add_u64_field("price", INDEXED | FAST);
let schema = schema_builder.build();
// # Indexing documents
//
// Lets index a bunch of fake documents for the sake of
// this example.
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?;
index_writer.add_document(doc!(
product_name => "Super Broom 2000",
product_description => "While it is ok for short distance travel, this broom \
was designed quiditch. It will up your game.",
price => 30_200u64
));
index_writer.add_document(doc!(
product_name => "Turbulobroom",
product_description => "You might have heard of this broom before : it is the sponsor of the Wales team.\
You'll enjoy its sharp turns, and rapid acceleration",
price => 29_240u64
));
index_writer.add_document(doc!(
product_name => "Broomio",
product_description => "Great value for the price. This broom is a market favorite",
price => 21_240u64
));
index_writer.add_document(doc!(
product_name => "Whack a Mole",
product_description => "Prime quality bat.",
price => 5_200u64
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![product_name, product_description]);
// here we want to get a hit on the 'ken' in Frankenstein
let query = query_parser.parse_query("broom")?;
if let Some(stats) = searcher.search(&query, &StatsCollector::with_field(price))? {
println!("count: {}", stats.count());
println!("mean: {}", stats.mean());
println!("standard deviation: {}", stats.standard_deviation());
}
Ok(())
}

View File

@@ -0,0 +1,112 @@
// # Defining a tokenizer pipeline
//
// In this example, we'll see how to define a tokenizer pipeline
// by aligning a bunch of `TokenFilter`.
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer;
use tantivy::{doc, Index};
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = Schema::builder();
// Our first field is title.
// In this example we want to use NGram searching
// we will set that to 3 characters, so any three
// char in the title should be findable.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("ngram3")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
let title = schema_builder.add_text_field("title", text_options);
// Our second field is body.
// We want full-text search for it, but we do not
// need to be able to be able to retrieve it
// for our application.
//
// We can make our index lighter and
// by omitting `STORED` flag.
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
// To simplify we will work entirely in RAM.
// This is not what you want in reality, but it is very useful
// for your unit tests... Or this example.
let index = Index::create_in_ram(schema.clone());
// here we are registering our custome tokenizer
// this will store tokens of 3 characters each
index
.tokenizers()
.register("ngram3", NgramTokenizer::new(3, 3, false));
// To insert document we need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we use a buffer of 50MB per thread. Using a bigger
// heap for the indexer can increase its throughput.
let mut index_writer = index.writer(50_000_000)?;
index_writer.add_document(doc!(
title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => r#"A few miles south of Soledad, the Salinas River drops in close to the hillside
bank and runs deep and green. The water is warm too, for it has slipped twinkling
over the yellow sands in the sunlight before reaching the narrow pool. On one
side of the river the golden foothill slopes curve up to the strong and rocky
Gabilan Mountains, but on the valley side the water is lined with trees—willows
fresh and green with every spring, carrying in their lower leaf junctures the
debris of the winters flooding; and sycamores with mottled, white, recumbent
limbs and branches that arch over the pool"#
));
index_writer.add_document(doc!(
title => "Frankenstein",
body => r#"You will rejoice to hear that no disaster has accompanied the commencement of an
enterprise which you have regarded with such evil forebodings. I arrived here
yesterday, and my first task is to assure my dear sister of my welfare and
increasing confidence in the success of my undertaking."#
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// here we want to get a hit on the 'ken' in Frankenstein
let query = query_parser.parse_query("ken")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,143 @@
// # Deleting and Updating (?) documents
//
// This example explains how to delete and update documents.
// In fact there is actually no such thing as an update in tantivy.
//
// To update a document, you need to delete a document and then reinsert
// its new version.
//
// ---
// Importing tantivy...
use tantivy::collector::TopDocs;
use tantivy::query::TermQuery;
use tantivy::schema::*;
use tantivy::{doc, Index, IndexReader};
// A simple helper function to fetch a single document
// given its id from our index.
// It will be helpful to check our work.
fn extract_doc_given_isbn(
reader: &IndexReader,
isbn_term: &Term,
) -> tantivy::Result<Option<Document>> {
let searcher = reader.searcher();
// This is the simplest query you can think of.
// It matches all of the documents containing a specific term.
//
// The second argument is here to tell we don't care about decoding positions,
// or term frequencies.
let term_query = TermQuery::new(isbn_term.clone(), IndexRecordOption::Basic);
let top_docs = searcher.search(&term_query, &TopDocs::with_limit(1))?;
if let Some((_score, doc_address)) = top_docs.first() {
let doc = searcher.doc(*doc_address)?;
Ok(Some(doc))
} else {
// no doc matching this ID.
Ok(None)
}
}
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// Check out the *basic_search* example if this makes
// small sense to you.
let mut schema_builder = Schema::builder();
// Tantivy does not really have a notion of primary id.
// This may change in the future.
//
// Still, we can create a `isbn` field and use it as an id. This
// field can be `u64` or a `text`, depending on your use case.
// It just needs to be indexed.
//
// If it is `text`, let's make sure to keep it `raw` and let's avoid
// running any text processing on it.
// This is done by associating this field to the tokenizer named `raw`.
// Rather than building our [`TextOptions`](//docs.rs/tantivy/~0/tantivy/schema/struct.TextOptions.html) manually,
// We use the `STRING` shortcut. `STRING` stands for indexed (without term frequency or positions)
// and untokenized.
//
// Because we also want to be able to see this `id` in our returned documents,
// we also mark the field as stored.
let isbn = schema_builder.add_text_field("isbn", STRING | STORED);
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?;
// Let's add a couple of documents, for the sake of the example.
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
index_writer.add_document(doc!(
isbn => "978-0099908401",
title => "The old Man and the see"
));
index_writer.add_document(doc!(
isbn => "978-0140177398",
title => "Of Mice and Men",
));
index_writer.add_document(doc!(
title => "Frankentein", //< Oops there is a typo here.
isbn => "978-9176370711",
));
index_writer.commit()?;
let reader = index.reader()?;
let frankenstein_isbn = Term::from_field_text(isbn, "978-9176370711");
// Oops our frankenstein doc seems mispelled
let frankenstein_doc_misspelled = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!(
schema.to_json(&frankenstein_doc_misspelled),
r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#,
);
// # Update = Delete + Insert
//
// Here we will want to update the typo in the `Frankenstein` book.
//
// Tantivy does not handle updates directly, we need to delete
// and reinsert the document.
//
// This can be complicated as it means you need to have access
// to the entire document. It is good practise to integrate tantivy
// with a key value store for this reason.
//
// To remove one of the document, we just call `delete_term`
// on its id.
//
// Note that `tantivy` does nothing to enforce the idea that
// there is only one document associated to this id.
//
// Also you might have noticed that we apply the delete before
// having committed. This does not matter really...
index_writer.delete_term(frankenstein_isbn.clone());
// We now need to reinsert our document without the typo.
index_writer.add_document(doc!(
title => "Frankenstein",
isbn => "978-9176370711",
));
// You are guaranteed that your clients will only observe your index in
// the state it was in after a commit.
// In this example, your search engine will at no point be missing the *Frankenstein* document.
// Everything happened as if the document was updated.
index_writer.commit()?;
// We reload our searcher to make our change available to clients.
reader.reload()?;
// No more typo!
let frankenstein_new_doc = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!(
schema.to_json(&frankenstein_new_doc),
r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#,
);
Ok(())
}

112
examples/faceted_search.rs Normal file
View File

@@ -0,0 +1,112 @@
// # Basic Example
//
// This example covers the basic functionalities of
// tantivy.
//
// We will :
// - define our schema
// = create an index in a directory
// - index few documents in our index
// - search for the best document matchings "sea whale"
// - retrieve the best document original content.
// ---
// Importing tantivy...
use tantivy::collector::FacetCollector;
use tantivy::query::{AllQuery, TermQuery};
use tantivy::schema::*;
use tantivy::{doc, Index};
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the sake of this example
let mut schema_builder = Schema::builder();
let name = schema_builder.add_text_field("felin_name", TEXT | STORED);
// this is our faceted field: its scientific classification
let classification = schema_builder.add_facet_field("classification");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(30_000_000)?;
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
index_writer.add_document(doc!(
name => "Cat",
classification => Facet::from("/Felidae/Felinae/Felis")
));
index_writer.add_document(doc!(
name => "Canada lynx",
classification => Facet::from("/Felidae/Felinae/Lynx")
));
index_writer.add_document(doc!(
name => "Cheetah",
classification => Facet::from("/Felidae/Felinae/Acinonyx")
));
index_writer.add_document(doc!(
name => "Tiger",
classification => Facet::from("/Felidae/Pantherinae/Panthera")
));
index_writer.add_document(doc!(
name => "Lion",
classification => Facet::from("/Felidae/Pantherinae/Panthera")
));
index_writer.add_document(doc!(
name => "Jaguar",
classification => Facet::from("/Felidae/Pantherinae/Panthera")
));
index_writer.add_document(doc!(
name => "Sunda clouded leopard",
classification => Facet::from("/Felidae/Pantherinae/Neofelis")
));
index_writer.add_document(doc!(
name => "Fossa",
classification => Facet::from("/Eupleridae/Cryptoprocta")
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
{
let mut facet_collector = FacetCollector::for_field(classification);
facet_collector.add_facet("/Felidae");
let facet_counts = searcher.search(&AllQuery, &facet_collector)?;
// This lists all of the facet counts, right below "/Felidae".
let facets: Vec<(&Facet, u64)> = facet_counts.get("/Felidae").collect();
assert_eq!(
facets,
vec![
(&Facet::from("/Felidae/Felinae"), 3),
(&Facet::from("/Felidae/Pantherinae"), 4),
]
);
}
// Facets are also searchable.
//
// For instance a common UI pattern is to allow the user someone to click on a facet link
// (e.g: `Pantherinae`) to drill down and filter the current result set with this subfacet.
//
// The search would then look as follows.
// Check the reference doc for different ways to create a `Facet` object.
{
let facet = Facet::from_text("/Felidae/Pantherinae");
let facet_term = Term::from_facet(classification, &facet);
let facet_term_query = TermQuery::new(facet_term, IndexRecordOption::Basic);
let mut facet_collector = FacetCollector::for_field(classification);
facet_collector.add_facet("/Felidae/Pantherinae");
let facet_counts = searcher.search(&facet_term_query, &facet_collector)?;
let facets: Vec<(&Facet, u64)> = facet_counts.get("/Felidae/Pantherinae").collect();
assert_eq!(
facets,
vec![
(&Facet::from("/Felidae/Pantherinae/Neofelis"), 1),
(&Facet::from("/Felidae/Pantherinae/Panthera"), 3),
]
);
}
Ok(())
}

View File

@@ -0,0 +1,98 @@
use std::collections::HashSet;
use tantivy::collector::TopDocs;
use tantivy::doc;
use tantivy::query::BooleanQuery;
use tantivy::schema::*;
use tantivy::{DocId, Index, Score, SegmentReader};
fn main() -> tantivy::Result<()> {
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", STORED);
let ingredient = schema_builder.add_facet_field("ingredient");
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(30_000_000)?;
index_writer.add_document(doc!(
title => "Fried egg",
ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/oil"),
));
index_writer.add_document(doc!(
title => "Scrambled egg",
ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/butter"),
ingredient => Facet::from("/ingredient/milk"),
ingredient => Facet::from("/ingredient/salt"),
));
index_writer.add_document(doc!(
title => "Egg rolls",
ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/garlic"),
ingredient => Facet::from("/ingredient/salt"),
ingredient => Facet::from("/ingredient/oil"),
ingredient => Facet::from("/ingredient/tortilla-wrap"),
ingredient => Facet::from("/ingredient/mushroom"),
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
{
let facets = vec![
Facet::from("/ingredient/egg"),
Facet::from("/ingredient/oil"),
Facet::from("/ingredient/garlic"),
Facet::from("/ingredient/mushroom"),
];
let query = BooleanQuery::new_multiterms_query(
facets
.iter()
.map(|key| Term::from_facet(ingredient, &key))
.collect(),
);
let top_docs_by_custom_score =
TopDocs::with_limit(2).tweak_score(move |segment_reader: &SegmentReader| {
let ingredient_reader = segment_reader.facet_reader(ingredient).unwrap();
let facet_dict = ingredient_reader.facet_dict();
let query_ords: HashSet<u64> = facets
.iter()
.filter_map(|key| facet_dict.term_ord(key.encoded_str()))
.collect();
let mut facet_ords_buffer: Vec<u64> = Vec::with_capacity(20);
move |doc: DocId, original_score: Score| {
ingredient_reader.facet_ords(doc, &mut facet_ords_buffer);
let missing_ingredients = facet_ords_buffer
.iter()
.filter(|ord| !query_ords.contains(ord))
.count();
let tweak = 1.0 / 4_f32.powi(missing_ingredients as i32);
original_score * tweak
}
});
let top_docs = searcher.search(&query, &top_docs_by_custom_score)?;
let titles: Vec<String> = top_docs
.iter()
.map(|(_, doc_id)| {
searcher
.doc(*doc_id)
.unwrap()
.get_first(title)
.unwrap()
.text()
.unwrap()
.to_owned()
})
.collect();
assert_eq!(titles, vec!["Fried egg", "Egg rolls"]);
}
Ok(())
}

View File

@@ -1,2 +0,0 @@
#!/bin/bash
docco simple_search.rs -o html

View File

@@ -1,518 +0,0 @@
/*--------------------- Typography ----------------------------*/
@font-face {
font-family: 'aller-light';
src: url('public/fonts/aller-light.eot');
src: url('public/fonts/aller-light.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-light.woff') format('woff'),
url('public/fonts/aller-light.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'aller-bold';
src: url('public/fonts/aller-bold.eot');
src: url('public/fonts/aller-bold.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-bold.woff') format('woff'),
url('public/fonts/aller-bold.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'roboto-black';
src: url('public/fonts/roboto-black.eot');
src: url('public/fonts/roboto-black.eot?#iefix') format('embedded-opentype'),
url('public/fonts/roboto-black.woff') format('woff'),
url('public/fonts/roboto-black.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
/*--------------------- Layout ----------------------------*/
html { height: 100%; }
body {
font-family: "aller-light";
font-size: 14px;
line-height: 18px;
color: #30404f;
margin: 0; padding: 0;
height:100%;
}
#container { min-height: 100%; }
a {
color: #000;
}
b, strong {
font-weight: normal;
font-family: "aller-bold";
}
p {
margin: 15px 0 0px;
}
.annotation ul, .annotation ol {
margin: 25px 0;
}
.annotation ul li, .annotation ol li {
font-size: 14px;
line-height: 18px;
margin: 10px 0;
}
h1, h2, h3, h4, h5, h6 {
color: #112233;
line-height: 1em;
font-weight: normal;
font-family: "roboto-black";
text-transform: uppercase;
margin: 30px 0 15px 0;
}
h1 {
margin-top: 40px;
}
h2 {
font-size: 1.26em;
}
hr {
border: 0;
background: 1px #ddd;
height: 1px;
margin: 20px 0;
}
pre, tt, code {
font-size: 12px; line-height: 16px;
font-family: Menlo, Monaco, Consolas, "Lucida Console", monospace;
margin: 0; padding: 0;
}
.annotation pre {
display: block;
margin: 0;
padding: 7px 10px;
background: #fcfcfc;
-moz-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
-webkit-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
overflow-x: auto;
}
.annotation pre code {
border: 0;
padding: 0;
background: transparent;
}
blockquote {
border-left: 5px solid #ccc;
margin: 0;
padding: 1px 0 1px 1em;
}
.sections blockquote p {
font-family: Menlo, Consolas, Monaco, monospace;
font-size: 12px; line-height: 16px;
color: #999;
margin: 10px 0 0;
white-space: pre-wrap;
}
ul.sections {
list-style: none;
padding:0 0 5px 0;;
margin:0;
}
/*
Force border-box so that % widths fit the parent
container without overlap because of margin/padding.
More Info : http://www.quirksmode.org/css/box.html
*/
ul.sections > li > div {
-moz-box-sizing: border-box; /* firefox */
-ms-box-sizing: border-box; /* ie */
-webkit-box-sizing: border-box; /* webkit */
-khtml-box-sizing: border-box; /* konqueror */
box-sizing: border-box; /* css3 */
}
/*---------------------- Jump Page -----------------------------*/
#jump_to, #jump_page {
margin: 0;
background: white;
-webkit-box-shadow: 0 0 25px #777; -moz-box-shadow: 0 0 25px #777;
-webkit-border-bottom-left-radius: 5px; -moz-border-radius-bottomleft: 5px;
font: 16px Arial;
cursor: pointer;
text-align: right;
list-style: none;
}
#jump_to a {
text-decoration: none;
}
#jump_to a.large {
display: none;
}
#jump_to a.small {
font-size: 22px;
font-weight: bold;
color: #676767;
}
#jump_to, #jump_wrapper {
position: fixed;
right: 0; top: 0;
padding: 10px 15px;
margin:0;
}
#jump_wrapper {
display: none;
padding:0;
}
#jump_to:hover #jump_wrapper {
display: block;
}
#jump_page_wrapper{
position: fixed;
right: 0;
top: 0;
bottom: 0;
}
#jump_page {
padding: 5px 0 3px;
margin: 0 0 25px 25px;
max-height: 100%;
overflow: auto;
}
#jump_page .source {
display: block;
padding: 15px;
text-decoration: none;
border-top: 1px solid #eee;
}
#jump_page .source:hover {
background: #f5f5ff;
}
#jump_page .source:first-child {
}
/*---------------------- Low resolutions (> 320px) ---------------------*/
@media only screen and (min-width: 320px) {
.pilwrap { display: none; }
ul.sections > li > div {
display: block;
padding:5px 10px 0 10px;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 30px;
}
ul.sections > li > div.content {
overflow-x:auto;
-webkit-box-shadow: inset 0 0 5px #e5e5ee;
box-shadow: inset 0 0 5px #e5e5ee;
border: 1px solid #dedede;
margin:5px 10px 5px 10px;
padding-bottom: 5px;
}
ul.sections > li > div.annotation pre {
margin: 7px 0 7px;
padding-left: 15px;
}
ul.sections > li > div.annotation p tt, .annotation code {
background: #f8f8ff;
border: 1px solid #dedede;
font-size: 12px;
padding: 0 0.2em;
}
}
/*---------------------- (> 481px) ---------------------*/
@media only screen and (min-width: 481px) {
#container {
position: relative;
}
body {
background-color: #F5F5FF;
font-size: 15px;
line-height: 21px;
}
pre, tt, code {
line-height: 18px;
}
p, ul, ol {
margin: 0 0 15px;
}
#jump_to {
padding: 5px 10px;
}
#jump_wrapper {
padding: 0;
}
#jump_to, #jump_page {
font: 10px Arial;
text-transform: uppercase;
}
#jump_page .source {
padding: 5px 10px;
}
#jump_to a.large {
display: inline-block;
}
#jump_to a.small {
display: none;
}
#background {
position: absolute;
top: 0; bottom: 0;
width: 350px;
background: #fff;
border-right: 1px solid #e5e5ee;
z-index: -1;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 40px;
}
ul.sections > li {
white-space: nowrap;
}
ul.sections > li > div {
display: inline-block;
}
ul.sections > li > div.annotation {
max-width: 350px;
min-width: 350px;
min-height: 5px;
padding: 13px;
overflow-x: hidden;
white-space: normal;
vertical-align: top;
text-align: left;
}
ul.sections > li > div.annotation pre {
margin: 15px 0 15px;
padding-left: 15px;
}
ul.sections > li > div.content {
padding: 13px;
vertical-align: top;
border: none;
-webkit-box-shadow: none;
box-shadow: none;
}
.pilwrap {
position: relative;
display: inline;
}
.pilcrow {
font: 12px Arial;
text-decoration: none;
color: #454545;
position: absolute;
top: 3px; left: -20px;
padding: 1px 2px;
opacity: 0;
-webkit-transition: opacity 0.2s linear;
}
.for-h1 .pilcrow {
top: 47px;
}
.for-h2 .pilcrow, .for-h3 .pilcrow, .for-h4 .pilcrow {
top: 35px;
}
ul.sections > li > div.annotation:hover .pilcrow {
opacity: 1;
}
}
/*---------------------- (> 1025px) ---------------------*/
@media only screen and (min-width: 1025px) {
body {
font-size: 16px;
line-height: 24px;
}
#background {
width: 525px;
}
ul.sections > li > div.annotation {
max-width: 525px;
min-width: 525px;
padding: 10px 25px 1px 50px;
}
ul.sections > li > div.content {
padding: 9px 15px 16px 25px;
}
}
/*---------------------- Syntax Highlighting -----------------------------*/
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
/*
github.com style (c) Vasily Polovnyov <vast@whiteants.net>
*/
pre code {
display: block; padding: 0.5em;
color: #000;
background: #f8f8ff
}
pre .hljs-comment,
pre .hljs-template_comment,
pre .hljs-diff .hljs-header,
pre .hljs-javadoc {
color: #408080;
font-style: italic
}
pre .hljs-keyword,
pre .hljs-assignment,
pre .hljs-literal,
pre .hljs-css .hljs-rule .hljs-keyword,
pre .hljs-winutils,
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
color: #954121;
/*font-weight: bold*/
}
pre .hljs-number,
pre .hljs-hexcolor {
color: #40a070
}
pre .hljs-string,
pre .hljs-tag .hljs-value,
pre .hljs-phpdoc,
pre .hljs-tex .hljs-formula {
color: #219161;
}
pre .hljs-title,
pre .hljs-id {
color: #19469D;
}
pre .hljs-params {
color: #00F;
}
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
font-weight: normal
}
pre .hljs-class .hljs-title,
pre .hljs-haskell .hljs-label,
pre .hljs-tex .hljs-command {
color: #458;
font-weight: bold
}
pre .hljs-tag,
pre .hljs-tag .hljs-title,
pre .hljs-rules .hljs-property,
pre .hljs-django .hljs-tag .hljs-keyword {
color: #000080;
font-weight: normal
}
pre .hljs-attribute,
pre .hljs-variable,
pre .hljs-instancevar,
pre .hljs-lisp .hljs-body {
color: #008080
}
pre .hljs-regexp {
color: #B68
}
pre .hljs-class {
color: #458;
font-weight: bold
}
pre .hljs-symbol,
pre .hljs-ruby .hljs-symbol .hljs-string,
pre .hljs-ruby .hljs-symbol .hljs-keyword,
pre .hljs-ruby .hljs-symbol .hljs-keymethods,
pre .hljs-lisp .hljs-keyword,
pre .hljs-tex .hljs-special,
pre .hljs-input_number {
color: #990073
}
pre .hljs-builtin,
pre .hljs-constructor,
pre .hljs-built_in,
pre .hljs-lisp .hljs-title {
color: #0086b3
}
pre .hljs-preprocessor,
pre .hljs-pi,
pre .hljs-doctype,
pre .hljs-shebang,
pre .hljs-cdata {
color: #999;
font-weight: bold
}
pre .hljs-deletion {
background: #fdd
}
pre .hljs-addition {
background: #dfd
}
pre .hljs-diff .hljs-change {
background: #0086b3
}
pre .hljs-chunk {
color: #aaa
}
pre .hljs-tex .hljs-formula {
opacity: 0.5;
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

View File

@@ -1,375 +0,0 @@
/*! normalize.css v2.0.1 | MIT License | git.io/normalize */
/* ==========================================================================
HTML5 display definitions
========================================================================== */
/*
* Corrects `block` display not defined in IE 8/9.
*/
article,
aside,
details,
figcaption,
figure,
footer,
header,
hgroup,
nav,
section,
summary {
display: block;
}
/*
* Corrects `inline-block` display not defined in IE 8/9.
*/
audio,
canvas,
video {
display: inline-block;
}
/*
* Prevents modern browsers from displaying `audio` without controls.
* Remove excess height in iOS 5 devices.
*/
audio:not([controls]) {
display: none;
height: 0;
}
/*
* Addresses styling for `hidden` attribute not present in IE 8/9.
*/
[hidden] {
display: none;
}
/* ==========================================================================
Base
========================================================================== */
/*
* 1. Sets default font family to sans-serif.
* 2. Prevents iOS text size adjust after orientation change, without disabling
* user zoom.
*/
html {
font-family: sans-serif; /* 1 */
-webkit-text-size-adjust: 100%; /* 2 */
-ms-text-size-adjust: 100%; /* 2 */
}
/*
* Removes default margin.
*/
body {
margin: 0;
}
/* ==========================================================================
Links
========================================================================== */
/*
* Addresses `outline` inconsistency between Chrome and other browsers.
*/
a:focus {
outline: thin dotted;
}
/*
* Improves readability when focused and also mouse hovered in all browsers.
*/
a:active,
a:hover {
outline: 0;
}
/* ==========================================================================
Typography
========================================================================== */
/*
* Addresses `h1` font sizes within `section` and `article` in Firefox 4+,
* Safari 5, and Chrome.
*/
h1 {
font-size: 2em;
}
/*
* Addresses styling not present in IE 8/9, Safari 5, and Chrome.
*/
abbr[title] {
border-bottom: 1px dotted;
}
/*
* Addresses style set to `bolder` in Firefox 4+, Safari 5, and Chrome.
*/
b,
strong {
font-weight: bold;
}
/*
* Addresses styling not present in Safari 5 and Chrome.
*/
dfn {
font-style: italic;
}
/*
* Addresses styling not present in IE 8/9.
*/
mark {
background: #ff0;
color: #000;
}
/*
* Corrects font family set oddly in Safari 5 and Chrome.
*/
code,
kbd,
pre,
samp {
font-family: monospace, serif;
font-size: 1em;
}
/*
* Improves readability of pre-formatted text in all browsers.
*/
pre {
white-space: pre;
white-space: pre-wrap;
word-wrap: break-word;
}
/*
* Sets consistent quote types.
*/
q {
quotes: "\201C" "\201D" "\2018" "\2019";
}
/*
* Addresses inconsistent and variable font size in all browsers.
*/
small {
font-size: 80%;
}
/*
* Prevents `sub` and `sup` affecting `line-height` in all browsers.
*/
sub,
sup {
font-size: 75%;
line-height: 0;
position: relative;
vertical-align: baseline;
}
sup {
top: -0.5em;
}
sub {
bottom: -0.25em;
}
/* ==========================================================================
Embedded content
========================================================================== */
/*
* Removes border when inside `a` element in IE 8/9.
*/
img {
border: 0;
}
/*
* Corrects overflow displayed oddly in IE 9.
*/
svg:not(:root) {
overflow: hidden;
}
/* ==========================================================================
Figures
========================================================================== */
/*
* Addresses margin not present in IE 8/9 and Safari 5.
*/
figure {
margin: 0;
}
/* ==========================================================================
Forms
========================================================================== */
/*
* Define consistent border, margin, and padding.
*/
fieldset {
border: 1px solid #c0c0c0;
margin: 0 2px;
padding: 0.35em 0.625em 0.75em;
}
/*
* 1. Corrects color not being inherited in IE 8/9.
* 2. Remove padding so people aren't caught out if they zero out fieldsets.
*/
legend {
border: 0; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Corrects font family not being inherited in all browsers.
* 2. Corrects font size not being inherited in all browsers.
* 3. Addresses margins set differently in Firefox 4+, Safari 5, and Chrome
*/
button,
input,
select,
textarea {
font-family: inherit; /* 1 */
font-size: 100%; /* 2 */
margin: 0; /* 3 */
}
/*
* Addresses Firefox 4+ setting `line-height` on `input` using `!important` in
* the UA stylesheet.
*/
button,
input {
line-height: normal;
}
/*
* 1. Avoid the WebKit bug in Android 4.0.* where (2) destroys native `audio`
* and `video` controls.
* 2. Corrects inability to style clickable `input` types in iOS.
* 3. Improves usability and consistency of cursor style between image-type
* `input` and others.
*/
button,
html input[type="button"], /* 1 */
input[type="reset"],
input[type="submit"] {
-webkit-appearance: button; /* 2 */
cursor: pointer; /* 3 */
}
/*
* Re-set default cursor for disabled elements.
*/
button[disabled],
input[disabled] {
cursor: default;
}
/*
* 1. Addresses box sizing set to `content-box` in IE 8/9.
* 2. Removes excess padding in IE 8/9.
*/
input[type="checkbox"],
input[type="radio"] {
box-sizing: border-box; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Addresses `appearance` set to `searchfield` in Safari 5 and Chrome.
* 2. Addresses `box-sizing` set to `border-box` in Safari 5 and Chrome
* (include `-moz` to future-proof).
*/
input[type="search"] {
-webkit-appearance: textfield; /* 1 */
-moz-box-sizing: content-box;
-webkit-box-sizing: content-box; /* 2 */
box-sizing: content-box;
}
/*
* Removes inner padding and search cancel button in Safari 5 and Chrome
* on OS X.
*/
input[type="search"]::-webkit-search-cancel-button,
input[type="search"]::-webkit-search-decoration {
-webkit-appearance: none;
}
/*
* Removes inner padding and border in Firefox 4+.
*/
button::-moz-focus-inner,
input::-moz-focus-inner {
border: 0;
padding: 0;
}
/*
* 1. Removes default vertical scrollbar in IE 8/9.
* 2. Improves readability and alignment in all browsers.
*/
textarea {
overflow: auto; /* 1 */
vertical-align: top; /* 2 */
}
/* ==========================================================================
Tables
========================================================================== */
/*
* Remove most spacing between table cells.
*/
table {
border-collapse: collapse;
border-spacing: 0;
}

View File

@@ -1,503 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<title>simple_search.rs</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, target-densitydpi=160dpi, initial-scale=1.0; maximum-scale=1.0; user-scalable=0;">
<link rel="stylesheet" media="all" href="docco.css" />
</head>
<body>
<div id="container">
<div id="background"></div>
<ul class="sections">
<li id="title">
<div class="annotation">
<h1>simple_search.rs</h1>
</div>
</li>
<li id="section-1">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-1">&#182;</a>
</div>
</div>
<div class="content"><div class='highlight'><pre><span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> rustc_serialize;
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tantivy;
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tempdir;
<span class="hljs-keyword">use</span> std::path::Path;
<span class="hljs-keyword">use</span> tempdir::TempDir;
<span class="hljs-keyword">use</span> tantivy::Index;
<span class="hljs-keyword">use</span> tantivy::schema::*;
<span class="hljs-keyword">use</span> tantivy::collector::TopCollector;
<span class="hljs-keyword">use</span> tantivy::query::QueryParser;
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() {</pre></div></div>
</li>
<li id="section-2">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-2">&#182;</a>
</div>
<p>Lets create a temporary directory for the
sake of this example</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Ok</span>(dir) = TempDir::new(<span class="hljs-string">"tantivy_example_dir"</span>) {
run_example(dir.path()).unwrap();
dir.close().unwrap();
}
}
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">run_example</span></span>(index_path: &amp;Path) -&gt; tantivy::<span class="hljs-built_in">Result</span>&lt;()&gt; {</pre></div></div>
</li>
<li id="section-3">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-3">&#182;</a>
</div>
<h1 id="defining-the-schema">Defining the schema</h1>
<p>The Tantivy index requires a very strict schema.
The schema declares which fields are in the index,
and for each field, its type and “the way it should
be indexed”.</p>
</div>
</li>
<li id="section-4">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-4">&#182;</a>
</div>
<p>first we need to define a schema …</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> schema_builder = SchemaBuilder::<span class="hljs-keyword">default</span>();</pre></div></div>
</li>
<li id="section-5">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-5">&#182;</a>
</div>
<p>Our first field is title.
We want full-text search for it, and we want to be able
to retrieve the document after the search.</p>
<p>TEXT | STORED is some syntactic sugar to describe
that.</p>
<p><code>TEXT</code> means the field should be tokenized and indexed,
along with its term frequency and term positions.</p>
<p><code>STORED</code> means that the field will also be saved
in a compressed, row-oriented key-value store.
This store is useful to reconstruct the
documents that were selected during the search phase.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"title"</span>, TEXT | STORED);</pre></div></div>
</li>
<li id="section-6">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-6">&#182;</a>
</div>
<p>Our first field is body.
We want full-text search for it, and we want to be able
to retrieve the body after the search.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"body"</span>, TEXT);
<span class="hljs-keyword">let</span> schema = schema_builder.build();</pre></div></div>
</li>
<li id="section-7">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-7">&#182;</a>
</div>
<h1 id="indexing-documents">Indexing documents</h1>
<p>Lets create a brand new index.</p>
<p>This will actually just save a meta.json
with our schema in the directory.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> index = <span class="hljs-built_in">try!</span>(Index::create(index_path, schema.clone()));</pre></div></div>
</li>
<li id="section-8">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-8">&#182;</a>
</div>
<p>To insert document we need an index writer.
There must be only one writer at a time.
This single <code>IndexWriter</code> is already
multithreaded.</p>
<p>Here we use a buffer of 50MB per thread. Using a bigger
heap for the indexer can increase its throughput.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> index_writer = <span class="hljs-built_in">try!</span>(index.writer(<span class="hljs-number">50_000_000</span>));</pre></div></div>
</li>
<li id="section-9">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-9">&#182;</a>
</div>
<p>Lets index our documents!
We first need a handle on the title and the body field.</p>
</div>
</li>
<li id="section-10">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-10">&#182;</a>
</div>
<h3 id="create-a-document-manually-">Create a document “manually”.</h3>
<p>We can create a document manually, by setting the fields
one by one in a Document object.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> title = schema.get_field(<span class="hljs-string">"title"</span>).unwrap();
<span class="hljs-keyword">let</span> body = schema.get_field(<span class="hljs-string">"body"</span>).unwrap();
<span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> old_man_doc = Document::<span class="hljs-keyword">default</span>();
old_man_doc.add_text(title, <span class="hljs-string">"The Old Man and the Sea"</span>);
old_man_doc.add_text(body,
<span class="hljs-string">"He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."</span>);</pre></div></div>
</li>
<li id="section-11">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-11">&#182;</a>
</div>
<p>… and add it to the <code>IndexWriter</code>.</p>
</div>
<div class="content"><div class='highlight'><pre> index_writer.add_document(old_man_doc);</pre></div></div>
</li>
<li id="section-12">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-12">&#182;</a>
</div>
<h3 id="create-a-document-directly-from-json-">Create a document directly from json.</h3>
<p>Alternatively, we can use our schema to parse
a document object directly from json.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">let</span> mice_and_men_doc = <span class="hljs-built_in">try!</span>(schema.parse_document(r#<span class="hljs-string">"{
"</span>title<span class="hljs-string">": "</span>Of Mice and Men<span class="hljs-string">",
"</span>body<span class="hljs-string">": "</span>few miles south of Soledad, the Salinas River drops <span class="hljs-keyword">in</span> close to the hillside bank and runs deep and green. The water is warm too, <span class="hljs-keyword">for</span> it has slipped twinkling over the yellow sands <span class="hljs-keyword">in</span> the sunlight before reaching the narrow pool. On one side of the river the golden foothill slopes curve up to the strong and rocky Gabilan Mountains, but on the valley side the water is lined with trees—willows fresh and green with every spring, carrying <span class="hljs-keyword">in</span> their lower leaf junctures the debris of the winters flooding; and sycamores with mottled, white,recumbent limbs and branches that arch over the pool<span class="hljs-string">"
}"</span>#));
index_writer.add_document(mice_and_men_doc);</pre></div></div>
</li>
<li id="section-13">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-13">&#182;</a>
</div>
<p>Multi-valued field are allowed, they are
expressed in JSON by an array.
The following document has two titles.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> frankenstein_doc = <span class="hljs-built_in">try!</span>(schema.parse_document(r#<span class="hljs-string">"{
"</span>title<span class="hljs-string">": ["</span>Frankenstein<span class="hljs-string">", "</span>The Modern Promotheus<span class="hljs-string">"],
"</span>body<span class="hljs-string">": "</span>You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence <span class="hljs-keyword">in</span> the success of my undertaking.<span class="hljs-string">"
}"</span>#));
index_writer.add_document(frankenstein_doc);</pre></div></div>
</li>
<li id="section-14">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-14">&#182;</a>
</div>
<p>This is an example, so we will only index 3 documents
here. You can check out tantivys tutorial to index
the English wikipedia. Tantivys indexing is rather fast.
Indexing 5 million articles of the English wikipedia takes
around 4 minutes on my computer!</p>
</div>
</li>
<li id="section-15">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-15">&#182;</a>
</div>
<h3 id="committing">Committing</h3>
<p>At this point our documents are not searchable.</p>
<p>We need to call .commit() explicitly to force the
index_writer to finish processing the documents in the queue,
flush the current index to the disk, and advertise
the existence of new documents.</p>
<p>This call is blocking.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(index_writer.commit());</pre></div></div>
</li>
<li id="section-16">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-16">&#182;</a>
</div>
<p>If <code>.commit()</code> returns correctly, then all of the
documents that have been added are guaranteed to be
persistently indexed.</p>
<p>In the scenario of a crash or a power failure,
tantivy behaves as if has rolled back to its last
commit.</p>
</div>
</li>
<li id="section-17">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-17">&#182;</a>
</div>
<h1 id="searching">Searching</h1>
<p>Lets search our index. Start by reloading
searchers in the index. This should be done
after every commit().</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(index.load_searchers());</pre></div></div>
</li>
<li id="section-18">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-18">&#182;</a>
</div>
<p>Afterwards create one (or more) searchers.</p>
<p>You should create a searcher
every time you start a “search query”.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> searcher = index.searcher();</pre></div></div>
</li>
<li id="section-19">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-19">&#182;</a>
</div>
<p>The query parser can interpret human queries.
Here, if the user does not specify which
field they want to search, tantivy will search
in both title and body.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> query_parser = QueryParser::new(index.schema(), <span class="hljs-built_in">vec!</span>[title, body]);</pre></div></div>
</li>
<li id="section-20">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-20">&#182;</a>
</div>
<p>QueryParser may fail if the query is not in the right
format. For user facing applications, this can be a problem.
A ticket has been opened regarding this problem.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> query = <span class="hljs-built_in">try!</span>(query_parser.parse_query(<span class="hljs-string">"sea whale"</span>));</pre></div></div>
</li>
<li id="section-21">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-21">&#182;</a>
</div>
<p>A query defines a set of documents, as
well as the way they should be scored.</p>
<p>A query created by the query parser is scored according
to a metric called Tf-Idf, and will consider
any document matching at least one of our terms.</p>
</div>
</li>
<li id="section-22">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-22">&#182;</a>
</div>
<h3 id="collectors">Collectors</h3>
<p>We are not interested in all of the documents but
only in the top 10. Keeping track of our top 10 best documents
is the role of the TopCollector.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> top_collector = TopCollector::with_limit(<span class="hljs-number">10</span>);</pre></div></div>
</li>
<li id="section-23">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-23">&#182;</a>
</div>
<p>We can now perform our query.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(searcher.search(&amp;*query, &amp;<span class="hljs-keyword">mut</span> top_collector));</pre></div></div>
</li>
<li id="section-24">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-24">&#182;</a>
</div>
<p>Our top collector now contains the 10
most relevant doc ids…</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> doc_addresses = top_collector.docs();</pre></div></div>
</li>
<li id="section-25">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-25">&#182;</a>
</div>
<p>The actual documents still need to be
retrieved from Tantivys store.</p>
<p>Since the body field was not configured as stored,
the document returned will only contain
a title.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">for</span> doc_address <span class="hljs-keyword">in</span> doc_addresses {
<span class="hljs-keyword">let</span> retrieved_doc = <span class="hljs-built_in">try!</span>(searcher.doc(&amp;doc_address));
<span class="hljs-built_in">println!</span>(<span class="hljs-string">"{}"</span>, schema.to_json(&amp;retrieved_doc));
}
<span class="hljs-literal">Ok</span>(())
}</pre></div></div>
</li>
</ul>
</div>
</body>
</html>

View File

@@ -0,0 +1,39 @@
// # Searching a range on an indexed int field.
//
// Below is an example of creating an indexed integer field in your schema
// You can use RangeQuery to get a Count of all occurrences in a given range.
use tantivy::collector::Count;
use tantivy::query::RangeQuery;
use tantivy::schema::{Schema, INDEXED};
use tantivy::{doc, Index, Result};
fn run() -> Result<()> {
// For the sake of simplicity, this schema will only have 1 field
let mut schema_builder = Schema::builder();
// `INDEXED` is a short-hand to indicate that our field should be "searchable".
let year_field = schema_builder.add_u64_field("year", INDEXED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let reader = index.reader()?;
{
let mut index_writer = index.writer_with_num_threads(1, 6_000_000)?;
for year in 1950u64..2019u64 {
index_writer.add_document(doc!(year_field => year));
}
index_writer.commit()?;
// The index will be a range of years
}
reader.reload()?;
let searcher = reader.searcher();
// The end is excluded i.e. here we are searching up to 1969
let docs_in_the_sixties = RangeQuery::new_u64(year_field, 1960..1970);
// Uses a Count collector to sum the total number of docs in the range
let num_60s_books = searcher.search(&docs_in_the_sixties, &Count)?;
assert_eq!(num_60s_books, 10);
Ok(())
}
fn main() {
run().unwrap()
}

View File

@@ -0,0 +1,135 @@
// # Iterating docs and positioms.
//
// At its core of tantivy, relies on a data structure
// called an inverted index.
//
// This example shows how to manually iterate through
// the list of documents containing a term, getting
// its term frequency, and accessing its positions.
// ---
// Importing tantivy...
use tantivy::schema::*;
use tantivy::{doc, DocSet, Index, Postings, TERMINATED};
fn main() -> tantivy::Result<()> {
// We first create a schema for the sake of the
// example. Check the `basic_search` example for more information.
let mut schema_builder = Schema::builder();
// For this example, we need to make sure to index positions for our title
// field. `TEXT` precisely does this.
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 50_000_000)?;
index_writer.add_document(doc!(title => "The Old Man and the Sea"));
index_writer.add_document(doc!(title => "Of Mice and Men"));
index_writer.add_document(doc!(title => "The modern Promotheus"));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
// A tantivy index is actually a collection of segments.
// Similarly, a searcher just wraps a list `segment_reader`.
//
// (Because we indexed a very small number of documents over one thread
// there is actually only one segment here, but let's iterate through the list
// anyway)
for segment_reader in searcher.segment_readers() {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
let inverted_index = segment_reader.inverted_index(title);
// A `Term` is a text token associated with a field.
// Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the");
// This segment posting object is like a cursor over the documents matching the term.
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies
// and positions.
//
// If you don't need all this information, you may get better performance by decompressing less
// information.
if let Some(mut segment_postings) =
inverted_index.read_postings(&term_the, IndexRecordOption::WithFreqsAndPositions)
{
// this buffer will be used to request for positions
let mut positions: Vec<u32> = Vec::with_capacity(100);
let mut doc_id = segment_postings.doc();
while doc_id != TERMINATED {
// This MAY contains deleted documents as well.
if segment_reader.is_deleted(doc_id) {
doc_id = segment_postings.advance();
continue;
}
// the number of time the term appears in the document.
let term_freq: u32 = segment_postings.term_freq();
// accessing positions is slightly expensive and lazy, do not request
// for them if you don't need them for some documents.
segment_postings.positions(&mut positions);
// By definition we should have `term_freq` positions.
assert_eq!(positions.len(), term_freq as usize);
// This prints:
// ```
// Doc 0: TermFreq 2: [0, 4]
// Doc 2: TermFreq 1: [0]
// ```
println!("Doc {}: TermFreq {}: {:?}", doc_id, term_freq, positions);
doc_id = segment_postings.advance();
}
}
}
// A `Term` is a text token associated with a field.
// Let's go through all docs containing the term `title:the` and access their position
let term_the = Term::from_field_text(title, "the");
// Some other powerful operations (especially `.skip_to`) may be useful to consume these
// posting lists rapidly.
// You can check for them in the [`DocSet`](https://docs.rs/tantivy/~0/tantivy/trait.DocSet.html) trait
// and the [`Postings`](https://docs.rs/tantivy/~0/tantivy/trait.Postings.html) trait
// Also, for some VERY specific high performance use case like an OLAP analysis of logs,
// you can get better performance by accessing directly the blocks of doc ids.
for segment_reader in searcher.segment_readers() {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
let inverted_index = segment_reader.inverted_index(title);
// This segment posting object is like a cursor over the documents matching the term.
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term frequencies
// and positions.
//
// If you don't need all this information, you may get better performance by decompressing less
// information.
if let Some(mut block_segment_postings) =
inverted_index.read_block_postings(&term_the, IndexRecordOption::Basic)
{
loop {
let docs = block_segment_postings.docs();
if docs.is_empty() {
break;
}
// Once again these docs MAY contains deleted documents as well.
let docs = block_segment_postings.docs();
// Prints `Docs [0, 2].`
println!("Docs {:?}", docs);
block_segment_postings.advance();
}
}
}
Ok(())
}

View File

@@ -0,0 +1,100 @@
// # Indexing from different threads.
//
// It is fairly common to have to index from different threads.
// Tantivy forbids to create more than one `IndexWriter` at a time.
//
// This `IndexWriter` itself has its own multithreaded layer, so managing your own
// indexing threads will not help. However, it can still be useful for some applications.
//
// For instance, if preparing documents to send to tantivy before indexing is the bottleneck of
// your application, it is reasonable to have multiple threads.
//
// Another very common reason to want to index from multiple threads, is implementing a webserver
// with CRUD capabilities. The server framework will most likely handle request from
// different threads.
//
// The recommended way to address both of these use case is to wrap your `IndexWriter` into a
// `Arc<RwLock<IndexWriter>>`.
//
// While this is counterintuitive, adding and deleting documents do not require mutability
// over the `IndexWriter`, so several threads will be able to do this operation concurrently.
//
// The example below does not represent an actual real-life use case (who would spawn thread to
// index a single document?), but aims at demonstrating the mechanism that makes indexing
// from several threads possible.
// ---
// Importing tantivy...
use std::sync::{Arc, RwLock};
use std::thread;
use std::time::Duration;
use tantivy::schema::{Schema, STORED, TEXT};
use tantivy::{doc, Index, IndexWriter, Opstamp};
fn main() -> tantivy::Result<()> {
// # Defining the schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let index_writer: Arc<RwLock<IndexWriter>> = Arc::new(RwLock::new(index.writer(50_000_000)?));
// # First indexing thread.
let index_writer_clone_1 = index_writer.clone();
thread::spawn(move || {
// we index 100 times the document... for the sake of the example.
for i in 0..100 {
let opstamp = index_writer_clone_1
.read().unwrap() //< A read lock is sufficient here.
.add_document(
doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
println!("add doc {} from thread 1 - opstamp {}", i, opstamp);
thread::sleep(Duration::from_millis(20));
}
});
// # Second indexing thread.
let index_writer_clone_2 = index_writer.clone();
// For convenience, tantivy also comes with a macro to
// reduce the boilerplate above.
thread::spawn(move || {
// we index 100 times the document... for the sake of the example.
for i in 0..100 {
// A read lock is sufficient here.
let opstamp = {
let index_writer_rlock = index_writer_clone_2.read().unwrap();
index_writer_rlock.add_document(doc!(
title => "Manufacturing consent",
body => "Some great book description..."
))
};
println!("add doc {} from thread 2 - opstamp {}", i, opstamp);
thread::sleep(Duration::from_millis(10));
}
});
// # In the main thread, we commit 10 times, once every 500ms.
for _ in 0..10 {
let opstamp: Opstamp = {
// Committing or rollbacking on the other hand requires write lock. This will block other threads.
let mut index_writer_wlock = index_writer.write().unwrap();
index_writer_wlock.commit().unwrap()
};
println!("committed with opstamp {}", opstamp);
thread::sleep(Duration::from_millis(500));
}
Ok(())
}

View File

@@ -0,0 +1,139 @@
// # Pre-tokenized text example
//
// This example shows how to use pre-tokenized text. Sometimes yout might
// want to index and search through text which is already split into
// tokens by some external tool.
//
// In this example we will:
// - use tantivy tokenizer to create tokens and load them directly into tantivy,
// - import tokenized text straight from json,
// - perform a search on documents with pre-tokenized text
use tantivy::collector::{Count, TopDocs};
use tantivy::query::TermQuery;
use tantivy::schema::*;
use tantivy::tokenizer::{PreTokenizedString, SimpleTokenizer, Token, Tokenizer};
use tantivy::{doc, Index, ReloadPolicy};
use tempfile::TempDir;
fn pre_tokenize_text(text: &str) -> Vec<Token> {
let mut token_stream = SimpleTokenizer.token_stream(text);
let mut tokens = vec![];
while token_stream.advance() {
tokens.push(token_stream.token().clone());
}
tokens
}
fn main() -> tantivy::Result<()> {
let index_path = TempDir::new()?;
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
// We can create a document manually, by setting the fields
// one by one in a Document object.
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let title_text = "The Old Man and the Sea";
let body_text = "He was an old man who fished alone in a skiff in the Gulf Stream";
// Content of our first document
// We create `PreTokenizedString` which contains original text and vector of tokens
let title_tok = PreTokenizedString {
text: String::from(title_text),
tokens: pre_tokenize_text(title_text),
};
println!(
"Original text: \"{}\" and tokens: {:?}",
title_tok.text, title_tok.tokens
);
let body_tok = PreTokenizedString {
text: String::from(body_text),
tokens: pre_tokenize_text(body_text),
};
// Now lets create a document and add our `PreTokenizedString`
let old_man_doc = doc!(title => title_tok, body => body_tok);
// ... now let's just add it to the IndexWriter
index_writer.add_document(old_man_doc);
// Pretokenized text can also be fed as JSON
let short_man_json = r#"{
"title":[{
"text":"The Old Man",
"tokens":[
{"offset_from":0,"offset_to":3,"position":0,"text":"The","position_length":1},
{"offset_from":4,"offset_to":7,"position":1,"text":"Old","position_length":1},
{"offset_from":8,"offset_to":11,"position":2,"text":"Man","position_length":1}
]
}]
}"#;
let short_man_doc = schema.parse_document(&short_man_json)?;
index_writer.add_document(short_man_doc);
// Let's commit changes
index_writer.commit()?;
// ... and now is the time to query our index
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
let searcher = reader.searcher();
// We want to get documents with token "Man", we will use TermQuery to do it
// Using PreTokenizedString means the tokens are stored as is avoiding stemming
// and lowercasing, which preserves full words in their original form
let query = TermQuery::new(
Term::from_field_text(title, "Man"),
IndexRecordOption::Basic,
);
let (top_docs, count) = searcher
.search(&query, &(TopDocs::with_limit(2), Count))
.unwrap();
assert_eq!(count, 2);
// Now let's print out the results.
// Note that the tokens are not stored along with the original text
// in the document store
for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("Document: {}", schema.to_json(&retrieved_doc));
}
// In contrary to the previous query, when we search for the "man" term we
// should get no results, as it's not one of the indexed tokens. SimpleTokenizer
// only splits text on whitespace / punctuation.
let query = TermQuery::new(
Term::from_field_text(title, "man"),
IndexRecordOption::Basic,
);
let (_top_docs, count) = searcher
.search(&query, &(TopDocs::with_limit(2), Count))
.unwrap();
assert_eq!(count, 0);
Ok(())
}

View File

@@ -1,209 +0,0 @@
extern crate rustc_serialize;
extern crate tantivy;
extern crate tempdir;
use std::path::Path;
use tempdir::TempDir;
use tantivy::Index;
use tantivy::schema::*;
use tantivy::collector::TopCollector;
use tantivy::query::QueryParser;
fn main() {
// Let's create a temporary directory for the
// sake of this example
if let Ok(dir) = TempDir::new("tantivy_example_dir") {
run_example(dir.path()).unwrap();
dir.close().unwrap();
}
}
fn run_example(index_path: &Path) -> tantivy::Result<()> {
// # Defining the schema
//
// The Tantivy index requires a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let mut schema_builder = SchemaBuilder::default();
// Our first field is title.
// We want full-text search for it, and we want to be able
// to retrieve the document after the search.
//
// TEXT | STORED is some syntactic sugar to describe
// that.
//
// `TEXT` means the field should be tokenized and indexed,
// along with its term frequency and term positions.
//
// `STORED` means that the field will also be saved
// in a compressed, row-oriented key-value store.
// This store is useful to reconstruct the
// documents that were selected during the search phase.
schema_builder.add_text_field("title", TEXT | STORED);
// Our first field is body.
// We want full-text search for it, and we want to be able
// to retrieve the body after the search.
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
//
// This will actually just save a meta.json
// with our schema in the directory.
let index = try!(Index::create(index_path, schema.clone()));
// To insert document we need an index writer.
// There must be only one writer at a time.
// This single `IndexWriter` is already
// multithreaded.
//
// Here we use a buffer of 50MB per thread. Using a bigger
// heap for the indexer can increase its throughput.
let mut index_writer = try!(index.writer(50_000_000));
// Let's index our documents!
// We first need a handle on the title and the body field.
// ### Create a document "manually".
//
// We can create a document manually, by setting the fields
// one by one in a Document object.
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
old_man_doc.add_text(body,
"He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish.");
// ... and add it to the `IndexWriter`.
index_writer.add_document(old_man_doc);
// ### Create a document directly from json.
//
// Alternatively, we can use our schema to parse
// a document object directly from json.
let mice_and_men_doc = try!(schema.parse_document(r#"{
"title": "Of Mice and Men",
"body": "few miles south of Soledad, the Salinas River drops in close to the hillside bank and runs deep and green. The water is warm too, for it has slipped twinkling over the yellow sands in the sunlight before reaching the narrow pool. On one side of the river the golden foothill slopes curve up to the strong and rocky Gabilan Mountains, but on the valley side the water is lined with trees—willows fresh and green with every spring, carrying in their lower leaf junctures the debris of the winters flooding; and sycamores with mottled, white,recumbent limbs and branches that arch over the pool"
}"#));
index_writer.add_document(mice_and_men_doc);
// Multi-valued field are allowed, they are
// expressed in JSON by an array.
// The following document has two titles.
let frankenstein_doc = try!(schema.parse_document(r#"{
"title": ["Frankenstein", "The Modern Promotheus"],
"body": "You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking."
}"#));
index_writer.add_document(frankenstein_doc);
// This is an example, so we will only index 3 documents
// here. You can check out tantivy's tutorial to index
// the English wikipedia. Tantivy's indexing is rather fast.
// Indexing 5 million articles of the English wikipedia takes
// around 4 minutes on my computer!
// ### Committing
//
// At this point our documents are not searchable.
//
//
// We need to call .commit() explicitly to force the
// index_writer to finish processing the documents in the queue,
// flush the current index to the disk, and advertise
// the existence of new documents.
//
// This call is blocking.
try!(index_writer.commit());
// If `.commit()` returns correctly, then all of the
// documents that have been added are guaranteed to be
// persistently indexed.
//
// In the scenario of a crash or a power failure,
// tantivy behaves as if has rolled back to its last
// commit.
// # Searching
//
// Let's search our index. Start by reloading
// searchers in the index. This should be done
// after every commit().
try!(index.load_searchers());
// Afterwards create one (or more) searchers.
//
// You should create a searcher
// every time you start a "search query".
let searcher = index.searcher();
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field they want to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::new(index.schema(), vec![title, body]);
// QueryParser may fail if the query is not in the right
// format. For user facing applications, this can be a problem.
// A ticket has been opened regarding this problem.
let query = try!(query_parser.parse_query("sea whale"));
// A query defines a set of documents, as
// well as the way they should be scored.
//
// A query created by the query parser is scored according
// to a metric called Tf-Idf, and will consider
// any document matching at least one of our terms.
// ### Collectors
//
// We are not interested in all of the documents but
// only in the top 10. Keeping track of our top 10 best documents
// is the role of the TopCollector.
let mut top_collector = TopCollector::with_limit(10);
// We can now perform our query.
try!(searcher.search(&*query, &mut top_collector));
// Our top collector now contains the 10
// most relevant doc ids...
let doc_addresses = top_collector.docs();
// The actual documents still need to be
// retrieved from Tantivy's store.
//
// Since the body field was not configured as stored,
// the document returned will only contain
// a title.
for doc_address in doc_addresses {
let retrieved_doc = try!(searcher.doc(&doc_address));
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

82
examples/snippet.rs Normal file
View File

@@ -0,0 +1,82 @@
// # Snippet example
//
// This example shows how to return a representative snippet of
// your hit result.
// Snippet are an extracted of a target document, and returned in HTML format.
// The keyword searched by the user are highlighted with a `<b>` tag.
// ---
// Importing tantivy...
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, Snippet, SnippetGenerator};
use tempfile::TempDir;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new()?;
// # Defining the schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT | STORED);
let schema = schema_builder.build();
// # Indexing documents
let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?;
// we'll only need one doc for this example.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
// ...
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("sycamore spring")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
for (score, doc_address) in top_docs {
let doc = searcher.doc(doc_address)?;
let snippet = snippet_generator.snippet_from_doc(&doc);
println!("Document score {}:", score);
println!("title: {}", doc.get_first(title).unwrap().text().unwrap());
println!("snippet: {}", snippet.to_html());
println!("custom highlighting: {}", highlight(snippet));
}
Ok(())
}
fn highlight(snippet: Snippet) -> String {
let mut result = String::new();
let mut start_from = 0;
for (start, end) in snippet.highlighted().iter().map(|h| h.bounds()) {
result.push_str(&snippet.fragments()[start_from..start]);
result.push_str(" --> ");
result.push_str(&snippet.fragments()[start..end]);
result.push_str(" <-- ");
start_from = end;
}
result.push_str(&snippet.fragments()[start_from..]);
result
}

113
examples/stop_words.rs Normal file
View File

@@ -0,0 +1,113 @@
// # Stop Words Example
//
// This example covers the basic usage of stop words
// with tantivy
//
// We will :
// - define our schema
// - create an index in a directory
// - add a few stop words
// - index few documents in our index
// ---
// Importing tantivy...
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::*;
use tantivy::{doc, Index};
fn main() -> tantivy::Result<()> {
// this example assumes you understand the content in `basic_search`
let mut schema_builder = Schema::builder();
// This configures your custom options for how tantivy will
// store and process your content in the index; The key
// to note is that we are setting the tokenizer to `stoppy`
// which will be defined and registered below.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
// Our first field is title.
schema_builder.add_text_field("title", text_options);
// Our second field is body.
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("stoppy")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
schema_builder.add_text_field("body", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
// This tokenizer lowers all of the text (to help with stop word matching)
// then removes all instances of `the` and `and` from the corpus
let tokenizer = TextAnalyzer::from(SimpleTokenizer)
.filter(LowerCaser)
.filter(StopWordFilter::remove(vec![
"the".to_string(),
"and".to_string(),
]));
index.tokenizers().register("stoppy", tokenizer);
let mut index_writer = index.writer(50_000_000)?;
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
index_writer.add_document(doc!(
title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
he had gone eighty-four days now without taking a fish."
));
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
));
index_writer.add_document(doc!(
title => "Frankenstein",
body => "You will rejoice to hear that no disaster has accompanied the commencement of an \
enterprise which you have regarded with such evil forebodings. I arrived here \
yesterday, and my first task is to assure my dear sister of my welfare and \
increasing confidence in the success of my undertaking."
));
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
// stop words are applied on the query as well.
// The following will be equivalent to `title:frankenstein`
let query = query_parser.parse_query("title:\"the Frankenstein\"")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("\n==\nDocument score {}:", score);
println!("{}", schema.to_json(&retrieved_doc));
}
Ok(())
}

View File

@@ -0,0 +1,41 @@
use tantivy;
use tantivy::schema::*;
// # Document from json
//
// For convenience, `Document` can be parsed directly from json.
fn main() -> tantivy::Result<()> {
// Let's first define a schema and an index.
// Check out the basic example if this is confusing to you.
//
// first we need to define a schema ...
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
schema_builder.add_u64_field("year", INDEXED);
let schema = schema_builder.build();
// Let's assume we have a json-serialized document.
let mice_and_men_doc_json = r#"{
"title": "Of Mice and Men",
"year": 1937
}"#;
// We can parse our document
let _mice_and_men_doc = schema.parse_document(&mice_and_men_doc_json)?;
// Multi-valued field are allowed, they are
// expressed in JSON by an array.
// The following document has two titles.
let frankenstein_json = r#"{
"title": ["Frankenstein", "The Modern Prometheus"],
"year": 1818
}"#;
let _frankenstein_doc = schema.parse_document(&frankenstein_json)?;
// Note that the schema is saved in your index directory.
//
// As a result, Indexes are aware of their schema, and you can use this feature
// just by opening an existing `Index`, and calling `index.schema()..parse_document(json)`.
Ok(())
}

16
query-grammar/Cargo.toml Normal file
View File

@@ -0,0 +1,16 @@
[package]
name = "tantivy-query-grammar"
version = "0.13.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
description = """Search engine library"""
documentation = "https://tantivy-search.github.io/tantivy/tantivy/index.html"
homepage = "https://github.com/tantivy-search/tantivy"
repository = "https://github.com/tantivy-search/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2018"
[dependencies]
combine = {version="4", default-features=false, features=[] }

3
query-grammar/README.md Normal file
View File

@@ -0,0 +1,3 @@
# Tantivy Query Grammar
This crate is used by tantivy to parse queries.

15
query-grammar/src/lib.rs Normal file
View File

@@ -0,0 +1,15 @@
mod occur;
mod query_grammar;
mod user_input_ast;
use combine::parser::Parser;
pub use crate::occur::Occur;
use crate::query_grammar::parse_to_ast;
pub use crate::user_input_ast::{UserInputAST, UserInputBound, UserInputLeaf, UserInputLiteral};
pub struct Error;
pub fn parse_query(query: &str) -> Result<UserInputAST, Error> {
let (user_input_ast, _remaining) = parse_to_ast().parse(query).map_err(|_| Error)?;
Ok(user_input_ast)
}

View File

@@ -0,0 +1,72 @@
use std::fmt;
use std::fmt::Write;
/// Defines whether a term in a query must be present,
/// should be present or must be not present.
#[derive(Debug, Clone, Hash, Copy, Eq, PartialEq)]
pub enum Occur {
/// For a given document to be considered for scoring,
/// at least one of the document with the Should or the Must
/// Occur constraint must be within the document.
Should,
/// Document without the term are excluded from the search.
Must,
/// Document that contain the term are excluded from the
/// search.
MustNot,
}
impl Occur {
/// Returns the one-char prefix symbol for this `Occur`.
/// - `Should` => '?',
/// - `Must` => '+'
/// - `Not` => '-'
fn to_char(self) -> char {
match self {
Occur::Should => '?',
Occur::Must => '+',
Occur::MustNot => '-',
}
}
/// Compose two occur values.
pub fn compose(left: Occur, right: Occur) -> Occur {
match (left, right) {
(Occur::Should, _) => right,
(Occur::Must, Occur::MustNot) => Occur::MustNot,
(Occur::Must, _) => Occur::Must,
(Occur::MustNot, Occur::MustNot) => Occur::Must,
(Occur::MustNot, _) => Occur::MustNot,
}
}
}
impl fmt::Display for Occur {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
f.write_char(self.to_char())
}
}
#[cfg(test)]
mod test {
use crate::Occur;
#[test]
fn test_Occur_compose() {
assert_eq!(Occur::compose(Occur::Should, Occur::Should), Occur::Should);
assert_eq!(Occur::compose(Occur::Should, Occur::Must), Occur::Must);
assert_eq!(
Occur::compose(Occur::Should, Occur::MustNot),
Occur::MustNot
);
assert_eq!(Occur::compose(Occur::Must, Occur::Should), Occur::Must);
assert_eq!(Occur::compose(Occur::Must, Occur::Must), Occur::Must);
assert_eq!(Occur::compose(Occur::Must, Occur::MustNot), Occur::MustNot);
assert_eq!(
Occur::compose(Occur::MustNot, Occur::Should),
Occur::MustNot
);
assert_eq!(Occur::compose(Occur::MustNot, Occur::Must), Occur::MustNot);
assert_eq!(Occur::compose(Occur::MustNot, Occur::MustNot), Occur::Must);
}
}

View File

@@ -0,0 +1,510 @@
use super::user_input_ast::{UserInputAST, UserInputBound, UserInputLeaf, UserInputLiteral};
use crate::Occur;
use combine::error::StringStreamError;
use combine::parser::char::{char, digit, letter, space, spaces, string};
use combine::parser::Parser;
use combine::{
attempt, choice, eof, many, many1, one_of, optional, parser, satisfy, skip_many1, value,
};
fn field<'a>() -> impl Parser<&'a str, Output = String> {
(
letter(),
many(satisfy(|c: char| c.is_alphanumeric() || c == '_')),
)
.skip(char(':'))
.map(|(s1, s2): (char, String)| format!("{}{}", s1, s2))
}
fn word<'a>() -> impl Parser<&'a str, Output = String> {
(
satisfy(|c: char| {
!c.is_whitespace()
&& !['-', '^', '`', ':', '{', '}', '"', '[', ']', '(', ')'].contains(&c)
}),
many(satisfy(|c: char| {
!c.is_whitespace() && ![':', '^', '{', '}', '"', '[', ']', '(', ')'].contains(&c)
})),
)
.map(|(s1, s2): (char, String)| format!("{}{}", s1, s2))
.and_then(|s: String| match s.as_str() {
"OR" | "AND " | "NOT" => Err(StringStreamError::UnexpectedParse),
_ => Ok(s),
})
}
fn term_val<'a>() -> impl Parser<&'a str, Output = String> {
let phrase = char('"').with(many1(satisfy(|c| c != '"'))).skip(char('"'));
phrase.or(word())
}
fn term_query<'a>() -> impl Parser<&'a str, Output = UserInputLiteral> {
let term_val_with_field = negative_number().or(term_val());
(field(), term_val_with_field).map(|(field_name, phrase)| UserInputLiteral {
field_name: Some(field_name),
phrase,
})
}
fn literal<'a>() -> impl Parser<&'a str, Output = UserInputLeaf> {
let term_default_field = term_val().map(|phrase| UserInputLiteral {
field_name: None,
phrase,
});
attempt(term_query())
.or(term_default_field)
.map(UserInputLeaf::from)
}
fn negative_number<'a>() -> impl Parser<&'a str, Output = String> {
(
char('-'),
many1(digit()),
optional((char('.'), many1(digit()))),
)
.map(|(s1, s2, s3): (char, String, Option<(char, String)>)| {
if let Some(('.', s3)) = s3 {
format!("{}{}.{}", s1, s2, s3)
} else {
format!("{}{}", s1, s2)
}
})
}
fn spaces1<'a>() -> impl Parser<&'a str, Output = ()> {
skip_many1(space())
}
/// Function that parses a range out of a Stream
/// Supports ranges like:
/// [5 TO 10], {5 TO 10}, [* TO 10], [10 TO *], {10 TO *], >5, <=10
/// [a TO *], [a TO c], [abc TO bcd}
fn range<'a>() -> impl Parser<&'a str, Output = UserInputLeaf> {
let range_term_val = || {
word()
.or(negative_number())
.or(char('*').with(value("*".to_string())))
};
// check for unbounded range in the form of <5, <=10, >5, >=5
let elastic_unbounded_range = (
choice([
attempt(string(">=")),
attempt(string("<=")),
attempt(string("<")),
attempt(string(">")),
])
.skip(spaces()),
range_term_val(),
)
.map(
|(comparison_sign, bound): (&str, String)| match comparison_sign {
">=" => (UserInputBound::Inclusive(bound), UserInputBound::Unbounded),
"<=" => (UserInputBound::Unbounded, UserInputBound::Inclusive(bound)),
"<" => (UserInputBound::Unbounded, UserInputBound::Exclusive(bound)),
">" => (UserInputBound::Exclusive(bound), UserInputBound::Unbounded),
// default case
_ => (UserInputBound::Unbounded, UserInputBound::Unbounded),
},
);
let lower_bound = (one_of("{[".chars()), range_term_val()).map(
|(boundary_char, lower_bound): (char, String)| {
if lower_bound == "*" {
UserInputBound::Unbounded
} else if boundary_char == '{' {
UserInputBound::Exclusive(lower_bound)
} else {
UserInputBound::Inclusive(lower_bound)
}
},
);
let upper_bound = (range_term_val(), one_of("}]".chars())).map(
|(higher_bound, boundary_char): (String, char)| {
if higher_bound == "*" {
UserInputBound::Unbounded
} else if boundary_char == '}' {
UserInputBound::Exclusive(higher_bound)
} else {
UserInputBound::Inclusive(higher_bound)
}
},
);
// return only lower and upper
let lower_to_upper = (
lower_bound.skip((spaces(), string("TO"), spaces())),
upper_bound,
);
(
optional(field()).skip(spaces()),
// try elastic first, if it matches, the range is unbounded
attempt(elastic_unbounded_range).or(lower_to_upper),
)
.map(|(field, (lower, upper))|
// Construct the leaf from extracted field (optional)
// and bounds
UserInputLeaf::Range {
field,
lower,
upper
})
}
fn negate(expr: UserInputAST) -> UserInputAST {
expr.unary(Occur::MustNot)
}
fn leaf<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
parser(|input| {
char('(')
.with(ast())
.skip(char(')'))
.or(char('*').map(|_| UserInputAST::from(UserInputLeaf::All)))
.or(attempt(
string("NOT").skip(spaces1()).with(leaf()).map(negate),
))
.or(attempt(range().map(UserInputAST::from)))
.or(literal().map(UserInputAST::from))
.parse_stream(input)
.into_result()
})
}
fn occur_symbol<'a>() -> impl Parser<&'a str, Output = Occur> {
char('-')
.map(|_| Occur::MustNot)
.or(char('+').map(|_| Occur::Must))
}
fn occur_leaf<'a>() -> impl Parser<&'a str, Output = (Option<Occur>, UserInputAST)> {
(optional(occur_symbol()), boosted_leaf())
}
fn positive_float_number<'a>() -> impl Parser<&'a str, Output = f64> {
(many1(digit()), optional((char('.'), many1(digit())))).map(
|(int_part, decimal_part_opt): (String, Option<(char, String)>)| {
let mut float_str = int_part;
if let Some((chr, decimal_str)) = decimal_part_opt {
float_str.push(chr);
float_str.push_str(&decimal_str);
}
float_str.parse::<f64>().unwrap()
},
)
}
fn boost<'a>() -> impl Parser<&'a str, Output = f64> {
(char('^'), positive_float_number()).map(|(_, boost)| boost)
}
fn boosted_leaf<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
(leaf(), optional(boost())).map(|(leaf, boost_opt)| match boost_opt {
Some(boost) if (boost - 1.0).abs() > std::f64::EPSILON => {
UserInputAST::Boost(Box::new(leaf), boost)
}
_ => leaf,
})
}
#[derive(Clone, Copy)]
enum BinaryOperand {
Or,
And,
}
fn binary_operand<'a>() -> impl Parser<&'a str, Output = BinaryOperand> {
string("AND")
.with(value(BinaryOperand::And))
.or(string("OR").with(value(BinaryOperand::Or)))
}
fn aggregate_binary_expressions(
left: UserInputAST,
others: Vec<(BinaryOperand, UserInputAST)>,
) -> UserInputAST {
let mut dnf: Vec<Vec<UserInputAST>> = vec![vec![left]];
for (operator, operand_ast) in others {
match operator {
BinaryOperand::And => {
if let Some(last) = dnf.last_mut() {
last.push(operand_ast);
}
}
BinaryOperand::Or => {
dnf.push(vec![operand_ast]);
}
}
}
if dnf.len() == 1 {
UserInputAST::and(dnf.into_iter().next().unwrap()) //< safe
} else {
let conjunctions = dnf.into_iter().map(UserInputAST::and).collect();
UserInputAST::or(conjunctions)
}
}
fn operand_leaf<'a>() -> impl Parser<&'a str, Output = (BinaryOperand, UserInputAST)> {
(
binary_operand().skip(spaces()),
boosted_leaf().skip(spaces()),
)
}
pub fn ast<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
let boolean_expr = (boosted_leaf().skip(spaces()), many1(operand_leaf()))
.map(|(left, right)| aggregate_binary_expressions(left, right));
let whitespace_separated_leaves = many1(occur_leaf().skip(spaces().silent())).map(
|subqueries: Vec<(Option<Occur>, UserInputAST)>| {
if subqueries.len() == 1 {
let (occur_opt, ast) = subqueries.into_iter().next().unwrap();
match occur_opt.unwrap_or(Occur::Should) {
Occur::Must | Occur::Should => ast,
Occur::MustNot => UserInputAST::Clause(vec![(Some(Occur::MustNot), ast)]),
}
} else {
UserInputAST::Clause(subqueries.into_iter().collect())
}
},
);
let expr = attempt(boolean_expr).or(whitespace_separated_leaves);
spaces().with(expr).skip(spaces())
}
pub fn parse_to_ast<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
spaces()
.with(optional(ast()).skip(eof()))
.map(|opt_ast| opt_ast.unwrap_or_else(UserInputAST::empty_query))
}
#[cfg(test)]
mod test {
use super::*;
use combine::parser::Parser;
pub fn nearly_equals(a: f64, b: f64) -> bool {
(a - b).abs() < 0.0005 * (a + b).abs()
}
fn assert_nearly_equals(expected: f64, val: f64) {
assert!(
nearly_equals(val, expected),
"Got {}, expected {}.",
val,
expected
);
}
#[test]
fn test_occur_symbol() {
assert_eq!(super::occur_symbol().parse("-"), Ok((Occur::MustNot, "")));
assert_eq!(super::occur_symbol().parse("+"), Ok((Occur::Must, "")));
}
#[test]
fn test_positive_float_number() {
fn valid_parse(float_str: &str, expected_val: f64, expected_remaining: &str) {
let (val, remaining) = positive_float_number().parse(float_str).unwrap();
assert_eq!(remaining, expected_remaining);
assert_nearly_equals(val, expected_val);
}
fn error_parse(float_str: &str) {
assert!(positive_float_number().parse(float_str).is_err());
}
valid_parse("1.0", 1.0, "");
valid_parse("1", 1.0, "");
valid_parse("0.234234 aaa", 0.234234f64, " aaa");
error_parse(".3332");
error_parse("1.");
error_parse("-1.");
}
fn test_parse_query_to_ast_helper(query: &str, expected: &str) {
let query = parse_to_ast().parse(query).unwrap().0;
let query_str = format!("{:?}", query);
assert_eq!(query_str, expected);
}
fn test_is_parse_err(query: &str) {
assert!(parse_to_ast().parse(query).is_err());
}
#[test]
fn test_parse_empty_to_ast() {
test_parse_query_to_ast_helper("", "<emptyclause>");
}
#[test]
fn test_parse_query_to_ast_hyphen() {
test_parse_query_to_ast_helper("\"www-form-encoded\"", "\"www-form-encoded\"");
test_parse_query_to_ast_helper("www-form-encoded", "\"www-form-encoded\"");
test_parse_query_to_ast_helper("www-form-encoded", "\"www-form-encoded\"");
}
#[test]
fn test_parse_query_to_ast_not_op() {
assert_eq!(
format!("{:?}", parse_to_ast().parse("NOT")),
"Err(UnexpectedParse)"
);
test_parse_query_to_ast_helper("NOTa", "\"NOTa\"");
test_parse_query_to_ast_helper("NOT a", "(-\"a\")");
}
#[test]
fn test_boosting() {
assert!(parse_to_ast().parse("a^2^3").is_err());
assert!(parse_to_ast().parse("a^2^").is_err());
test_parse_query_to_ast_helper("a^3", "(\"a\")^3");
test_parse_query_to_ast_helper("a^3 b^2", "(*(\"a\")^3 *(\"b\")^2)");
test_parse_query_to_ast_helper("a^1", "\"a\"");
}
#[test]
fn test_parse_query_to_ast_binary_op() {
test_parse_query_to_ast_helper("a AND b", "(+\"a\" +\"b\")");
test_parse_query_to_ast_helper("a OR b", "(?\"a\" ?\"b\")");
test_parse_query_to_ast_helper("a OR b AND c", "(?\"a\" ?(+\"b\" +\"c\"))");
test_parse_query_to_ast_helper("a AND b AND c", "(+\"a\" +\"b\" +\"c\")");
assert_eq!(
format!("{:?}", parse_to_ast().parse("a OR b aaa")),
"Err(UnexpectedParse)"
);
assert_eq!(
format!("{:?}", parse_to_ast().parse("a AND b aaa")),
"Err(UnexpectedParse)"
);
assert_eq!(
format!("{:?}", parse_to_ast().parse("aaa a OR b ")),
"Err(UnexpectedParse)"
);
assert_eq!(
format!("{:?}", parse_to_ast().parse("aaa ccc a OR b ")),
"Err(UnexpectedParse)"
);
}
#[test]
fn test_parse_elastic_query_ranges() {
test_parse_query_to_ast_helper("title: >a", "title:{\"a\" TO \"*\"}");
test_parse_query_to_ast_helper("title:>=a", "title:[\"a\" TO \"*\"}");
test_parse_query_to_ast_helper("title: <a", "title:{\"*\" TO \"a\"}");
test_parse_query_to_ast_helper("title:<=a", "title:{\"*\" TO \"a\"]");
test_parse_query_to_ast_helper("title:<=bsd", "title:{\"*\" TO \"bsd\"]");
test_parse_query_to_ast_helper("weight: >70", "weight:{\"70\" TO \"*\"}");
test_parse_query_to_ast_helper("weight:>=70", "weight:[\"70\" TO \"*\"}");
test_parse_query_to_ast_helper("weight: <70", "weight:{\"*\" TO \"70\"}");
test_parse_query_to_ast_helper("weight:<=70", "weight:{\"*\" TO \"70\"]");
test_parse_query_to_ast_helper("weight: >60.7", "weight:{\"60.7\" TO \"*\"}");
test_parse_query_to_ast_helper("weight: <= 70", "weight:{\"*\" TO \"70\"]");
test_parse_query_to_ast_helper("weight: <= 70.5", "weight:{\"*\" TO \"70.5\"]");
}
#[test]
fn test_occur_leaf() {
let ((occur, ast), _) = super::occur_leaf().parse("+abc").unwrap();
assert_eq!(occur, Some(Occur::Must));
assert_eq!(format!("{:?}", ast), "\"abc\"");
}
#[test]
fn test_range_parser() {
// testing the range() parser separately
let res = range().parse("title: <hello").unwrap().0;
let expected = UserInputLeaf::Range {
field: Some("title".to_string()),
lower: UserInputBound::Unbounded,
upper: UserInputBound::Exclusive("hello".to_string()),
};
let res2 = range().parse("title:{* TO hello}").unwrap().0;
assert_eq!(res, expected);
assert_eq!(res2, expected);
let expected_weight = UserInputLeaf::Range {
field: Some("weight".to_string()),
lower: UserInputBound::Inclusive("71.2".to_string()),
upper: UserInputBound::Unbounded,
};
let res3 = range().parse("weight: >=71.2").unwrap().0;
let res4 = range().parse("weight:[71.2 TO *}").unwrap().0;
assert_eq!(res3, expected_weight);
assert_eq!(res4, expected_weight);
}
#[test]
fn test_parse_query_to_triming_spaces() {
test_parse_query_to_ast_helper(" abc", "\"abc\"");
test_parse_query_to_ast_helper("abc ", "\"abc\"");
test_parse_query_to_ast_helper("( a OR abc)", "(?\"a\" ?\"abc\")");
test_parse_query_to_ast_helper("(a OR abc)", "(?\"a\" ?\"abc\")");
test_parse_query_to_ast_helper("(a OR abc)", "(?\"a\" ?\"abc\")");
test_parse_query_to_ast_helper("a OR abc ", "(?\"a\" ?\"abc\")");
test_parse_query_to_ast_helper("(a OR abc )", "(?\"a\" ?\"abc\")");
test_parse_query_to_ast_helper("(a OR abc) ", "(?\"a\" ?\"abc\")");
}
#[test]
fn test_parse_query_single_term() {
test_parse_query_to_ast_helper("abc", "\"abc\"");
}
#[test]
fn test_parse_query_default_clause() {
test_parse_query_to_ast_helper("a b", "(*\"a\" *\"b\")");
}
#[test]
fn test_parse_query_must_default_clause() {
test_parse_query_to_ast_helper("+(a b)", "(*\"a\" *\"b\")");
}
#[test]
fn test_parse_query_must_single_term() {
test_parse_query_to_ast_helper("+d", "\"d\"");
}
#[test]
fn test_single_term_with_field() {
test_parse_query_to_ast_helper("abc:toto", "abc:\"toto\"");
}
#[test]
fn test_single_term_with_float() {
test_parse_query_to_ast_helper("abc:1.1", "abc:\"1.1\"");
}
#[test]
fn test_must_clause() {
test_parse_query_to_ast_helper("(+a +b)", "(+\"a\" +\"b\")");
}
#[test]
fn test_parse_test_query_plus_a_b_plus_d() {
test_parse_query_to_ast_helper("+(a b) +d", "(+(*\"a\" *\"b\") +\"d\")");
}
#[test]
fn test_parse_test_query_other() {
test_parse_query_to_ast_helper("(+a +b) d", "(*(+\"a\" +\"b\") *\"d\")");
test_parse_query_to_ast_helper("+abc:toto", "abc:\"toto\"");
test_parse_query_to_ast_helper("(+abc:toto -titi)", "(+abc:\"toto\" -\"titi\")");
test_parse_query_to_ast_helper("-abc:toto", "(-abc:\"toto\")");
test_parse_query_to_ast_helper("abc:a b", "(*abc:\"a\" *\"b\")");
test_parse_query_to_ast_helper("abc:\"a b\"", "abc:\"a b\"");
test_parse_query_to_ast_helper("foo:[1 TO 5]", "foo:[\"1\" TO \"5\"]");
}
#[test]
fn test_parse_query_with_range() {
test_parse_query_to_ast_helper("[1 TO 5]", "[\"1\" TO \"5\"]");
test_parse_query_to_ast_helper("foo:{a TO z}", "foo:{\"a\" TO \"z\"}");
test_parse_query_to_ast_helper("foo:[1 TO toto}", "foo:[\"1\" TO \"toto\"}");
test_parse_query_to_ast_helper("foo:[* TO toto}", "foo:{\"*\" TO \"toto\"}");
test_parse_query_to_ast_helper("foo:[1 TO *}", "foo:[\"1\" TO \"*\"}");
test_parse_query_to_ast_helper("foo:[1.1 TO *}", "foo:[\"1.1\" TO \"*\"}");
test_is_parse_err("abc + ");
}
}

View File

@@ -0,0 +1,171 @@
use std::fmt;
use std::fmt::{Debug, Formatter};
use crate::Occur;
#[derive(PartialEq)]
pub enum UserInputLeaf {
Literal(UserInputLiteral),
All,
Range {
field: Option<String>,
lower: UserInputBound,
upper: UserInputBound,
},
}
impl Debug for UserInputLeaf {
fn fmt(&self, formatter: &mut Formatter<'_>) -> Result<(), fmt::Error> {
match self {
UserInputLeaf::Literal(literal) => literal.fmt(formatter),
UserInputLeaf::Range {
ref field,
ref lower,
ref upper,
} => {
if let Some(ref field) = field {
write!(formatter, "{}:", field)?;
}
lower.display_lower(formatter)?;
write!(formatter, " TO ")?;
upper.display_upper(formatter)?;
Ok(())
}
UserInputLeaf::All => write!(formatter, "*"),
}
}
}
#[derive(PartialEq)]
pub struct UserInputLiteral {
pub field_name: Option<String>,
pub phrase: String,
}
impl fmt::Debug for UserInputLiteral {
fn fmt(&self, formatter: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> {
match self.field_name {
Some(ref field_name) => write!(formatter, "{}:\"{}\"", field_name, self.phrase),
None => write!(formatter, "\"{}\"", self.phrase),
}
}
}
#[derive(PartialEq)]
pub enum UserInputBound {
Inclusive(String),
Exclusive(String),
Unbounded,
}
impl UserInputBound {
fn display_lower(&self, formatter: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> {
match *self {
UserInputBound::Inclusive(ref word) => write!(formatter, "[\"{}\"", word),
UserInputBound::Exclusive(ref word) => write!(formatter, "{{\"{}\"", word),
UserInputBound::Unbounded => write!(formatter, "{{\"*\""),
}
}
fn display_upper(&self, formatter: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> {
match *self {
UserInputBound::Inclusive(ref word) => write!(formatter, "\"{}\"]", word),
UserInputBound::Exclusive(ref word) => write!(formatter, "\"{}\"}}", word),
UserInputBound::Unbounded => write!(formatter, "\"*\"}}"),
}
}
pub fn term_str(&self) -> &str {
match *self {
UserInputBound::Inclusive(ref contents) => contents,
UserInputBound::Exclusive(ref contents) => contents,
UserInputBound::Unbounded => &"*",
}
}
}
pub enum UserInputAST {
Clause(Vec<(Option<Occur>, UserInputAST)>),
Leaf(Box<UserInputLeaf>),
Boost(Box<UserInputAST>, f64),
}
impl UserInputAST {
pub fn unary(self, occur: Occur) -> UserInputAST {
UserInputAST::Clause(vec![(Some(occur), self)])
}
fn compose(occur: Occur, asts: Vec<UserInputAST>) -> UserInputAST {
assert_ne!(occur, Occur::MustNot);
assert!(!asts.is_empty());
if asts.len() == 1 {
asts.into_iter().next().unwrap() //< safe
} else {
UserInputAST::Clause(
asts.into_iter()
.map(|ast: UserInputAST| (Some(occur), ast))
.collect::<Vec<_>>(),
)
}
}
pub fn empty_query() -> UserInputAST {
UserInputAST::Clause(Vec::default())
}
pub fn and(asts: Vec<UserInputAST>) -> UserInputAST {
UserInputAST::compose(Occur::Must, asts)
}
pub fn or(asts: Vec<UserInputAST>) -> UserInputAST {
UserInputAST::compose(Occur::Should, asts)
}
}
impl From<UserInputLiteral> for UserInputLeaf {
fn from(literal: UserInputLiteral) -> UserInputLeaf {
UserInputLeaf::Literal(literal)
}
}
impl From<UserInputLeaf> for UserInputAST {
fn from(leaf: UserInputLeaf) -> UserInputAST {
UserInputAST::Leaf(Box::new(leaf))
}
}
fn print_occur_ast(
occur_opt: Option<Occur>,
ast: &UserInputAST,
formatter: &mut fmt::Formatter,
) -> fmt::Result {
if let Some(occur) = occur_opt {
write!(formatter, "{}{:?}", occur, ast)?;
} else {
write!(formatter, "*{:?}", ast)?;
}
Ok(())
}
impl fmt::Debug for UserInputAST {
fn fmt(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
match *self {
UserInputAST::Clause(ref subqueries) => {
if subqueries.is_empty() {
write!(formatter, "<emptyclause>")?;
} else {
write!(formatter, "(")?;
print_occur_ast(subqueries[0].0, &subqueries[0].1, formatter)?;
for subquery in &subqueries[1..] {
write!(formatter, " ")?;
print_occur_ast(subquery.0, &subquery.1, formatter)?;
}
write!(formatter, ")")?;
}
Ok(())
}
UserInputAST::Leaf(ref subquery) => write!(formatter, "{:?}", subquery),
UserInputAST::Boost(ref leaf, boost) => write!(formatter, "({:?})^{}", leaf, boost),
}
}
}

Some files were not shown because too many files have changed in this diff Show More