Commit Graph

1918 Commits

Author SHA1 Message Date
PSeitz
5209238c1b use github actions for tests 0.15.1 2021-06-14 12:51:46 +02:00
Paul Masurel
7ef25ec400 Bump to 0.15.1 to publish bugfix 2021-06-14 18:45:38 +09:00
PSeitz
221e7cbb55 Merge pull request #1076 from appaquet/fix/store-reader-iterator
Fix panic in store reader raw document iterator during segment merge
2021-06-14 11:22:58 +02:00
Pascal Seitz
873ac1a3ac cleanup import 2021-06-14 10:31:45 +02:00
Pascal Seitz
ebe55a7ae1 refactor test, fixes #1077
replace test with smaller test in doc_store
2021-06-14 10:10:05 +02:00
Bernard Swart
9f32d40b27 Misspelling of misspelled was fixed (#1078) 2021-06-14 16:29:12 +09:00
Andre-Philippe Paquet
8ae10a930a fix formatting 2021-06-13 17:23:40 -04:00
Andre-Philippe Paquet
473a346814 remove debugging 2021-06-13 16:49:44 -04:00
Andre-Philippe Paquet
3a8a0fe79a add fuzzy merge test 2021-06-13 16:42:24 -04:00
Andre-Philippe Paquet
511dc8f87f fix store reader iterator 2021-06-13 16:00:13 -04:00
Paul Masurel
3901295329 Bumped query-grammar version 2021-06-07 10:00:14 +09:00
Paul Masurel
f5918c6c74 Completed bitpacker README 2021-06-07 09:57:17 +09:00
Paul Masurel
abe6b4baec Bumped tantivy version to 0.15 2021-06-07 09:52:48 +09:00
Paul Masurel
6e4b61154f Issue/1070 (#1071)
Add a boolean flag in the Query::query_terms informing on whether
position information is required.

Closes #1070
2021-06-03 22:33:20 +09:00
PSeitz
2aad0ced77 add inline to bitpacker (#1064) 2021-05-31 23:15:41 +09:00
Stéphane Campinas
41ea14840d add benchmark of term streams merge (#1024)
* add benchmark of term streams merge
* use union based on FST for merging the term dictionaries
* Rename TermMerger benchmark
2021-05-31 23:15:01 +09:00
PSeitz
dff0ffd38a prepare for multiple fastfield codecs (#1063)
* prepare for multiple fastfield codecs

 prepare for multiple fastfield codecs by wrapping the codecs in an enum #1042

* add FastFieldSerializer trait, add DynamicFastFieldSerializer

add FastFieldSerializer trait
add DynamicFastFieldSerializer enum to wrap all implementors of the FastFieldSerializer trait

* add estimation for fastfield bitpacker
2021-05-31 23:14:14 +09:00
PSeitz
8d32c3ba3a Change Footer version handling, Make compression dynamic (#1060)
Change Footer version handling, Make compression dynamic

Change Footer version handling
Simplify version handling by switching to JSON instead of binary serialization.
fixes #1058

Make compression dynamic
Instead of choosing the compression during compile time via a feature flag, you can now have multiple compression algorithms enabled and decide during runtime which one to choose via IndexSettings. Changing the compression algorithm on an index is also supported. The information which algorithm was used in the doc store is stored in the DocStoreFooter. The default is the lz4 block format.
fixes #904

Handle merging of different compressors
Fix feature flag names
Add doc store test for all compressors
2021-05-28 14:57:20 +09:00
Moriyoshi Koizumi
4afba005f9 Provide a means to deal with malformed facet text representation for the query parser (#1056)
* Provide a means to deal with malformed facet text representation for the query parser.
* Specific error enum for the facet parse error.
2021-05-27 12:16:49 +09:00
PSeitz
85fb0cc20a cache field norm reader in merge (#1061) 2021-05-25 21:48:02 +09:00
PSeitz
5ef2d56ec2 Avoid docstore stacking for small segments, fixes #1053 (#1055) 2021-05-24 15:38:49 +09:00
Paul Masurel
fd8e5bdf57 Rename more like this 2021-05-21 16:32:39 +09:00
PSeitz
4f8481a1e4 Detect if segments are stackackable with sorting, fixes #1038 (#1054)
* Detect if segments are stackackable with sorting, fixes #1038

Detect if segments are stackable when their data ranges on the sort property are disjunct.
Presort segments by thei min value on merge, to enable easier stacking.

* move code to function
2021-05-21 15:23:17 +09:00
PSeitz
bcd72e5c14 fix and refactor log merge policy, fixes #1035 (#1043)
* fix and refactor log merge policy, fixes #1035

fixes a bug in log merge policy where an index was wrongly referenced by its index

* cleanup

* fix sort order, improve method names

* use itertools groupby, fix serialization test

* minor improvments

* update names
2021-05-19 10:48:46 +09:00
PSeitz
249bc6cf72 upgrade lz4_flex to 0.8 (#1049)
* upgrade lz4_flex to 0.8

* fix set_len
2021-05-19 10:46:01 +09:00
PSeitz
1c0af5765d fix doc store iter error handling, fixes #1047 (#1051) 2021-05-18 21:43:57 +09:00
Paul Masurel
7ba771ed1b Replaced RawDocument by OwnedBytes (#1046) 2021-05-18 14:33:36 +09:00
PSeitz
a4002622f8 add iterator over documents in docstore (#1044)
* add iterator over documents in docstore

When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.

Merge Time on Sorted Index Before/After:
24s / 19s

Merge Time on Unsorted Index Before/After:
15s / 13,5s

So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.

* Update reader.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>
2021-05-18 10:29:02 +09:00
Kornel
8e21087ad7 Don't use overly-minimal dependencies (#1037) 2021-05-17 22:30:04 +09:00
PSeitz
d523543dc7 Sort Index/Docids By Field (#1026)
* sort index by field

add sort info to IndexSettings
generate docid mapping for sorted field (only fastfield)
remap singlevalue fastfield

* support docid mapping in multivalue fastfield

move docid mapping to serialization step (less intermediate data for mapping)
add support for docid mapping in multivalue fastfield

* handle docid map in bytes fastfield

* forward docid mapping, remap postings

* fix merge conflicts

* move test to index_sorter

* add docid index mapping old->new

add docid mapping for both directions old->new (used in postings) and new->old (used in fast field)
handle mapping in postings recorder
warn instead of info for MAX_TOKEN_LEN

* remap docid in fielnorm

* resort docids in recorder, more extensive tests

* handle index sorting in docstore

handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore.

add docstore sort tests
refactor tests

* refactor

rename docid doc_id
rename docid_map doc_id_map
rename DocidMapping DocIdMapping
fix  typo

* u32 to DocId

* better doc_id_map creation

remove unstable sort

* add non mut method to FastFieldWriters

add _mut prefix to &mut methods

* remove sort_index

* fix clippy issues

* fix SegmentComponent iterator

use std::mem::replace

* fix test

* fmt

* handle indexsettings deserialize

* add reading, writing bytes to doc store

get bytes of document in doc store
add store_bytes method doc writer to accept serialized document
add serialization index settings test

* rename index_sorter to doc_id_mapping

use bufferlender in recorder

* fix compile issue, make sort_by_field optional

* fix test compile

* validate index settings on merge

validate index settings on merge
forward merge info to SegmentSerializer (for TempStore)

* fix doctest

* add itertools, use kmerge

add itertools, use kmerge
push because rustfmt fails

* implement/test merge for fastfield

implement/test merge for fastfield
rename len to num_deleted in DeleteBitSet

* Use precalculated docid mapping in merger

Use precalculated docid mapping in merger for sorted indices instead of on the fly calculation 
Add index creation macro benchmark, but commented out for now, since it is not really usable due to long runtimes, and extreme fluctuations. May be better suited in criterion or an external bench bin

* fix fast field reader docs

fix fast field reader docs, Error instead of None returned
add u64s_lenient to fastreader
add create docid mapping benchmark

* add test for multifast field merge

refactor test 
add test for multifast field merge

* add num_bytes to BytesFastFieldReader

equivalent to num_vals in MultiValuedFastFieldReader

* add MultiValueLength trait

add MultiValueLength trait in order to unify index creation for BytesFastFieldReader and MultiValuedFastFieldReader in merger

* Add ReaderWithOrdinal, fix 

Add ReaderWithOrdinal to associate data to a reader in merger
Fix bytes offset index creation in merger

* add test for merging bytes with sorted docids

* Merge fieldnorm for sorted index

* handle posting list in merge in sorted index

handle posting list in merge in sorted index by using doc id mapping for sorting
reuse SegmentOrdinal type

* handle doc store order in merge in sorted index

* fix typo, cleanup

* make IndexSetting non-optional

* fix type, rename test file

fix type
rename test file
add  type

* remove SegmentReaderWithOrdinal accessors

* cargo fmt

* add index sort & merge test to include deletes

* Fix posting list merge issue

Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids
handle sorting and merging for facets field

* performance: cache field readers, use bytes for doc store merge

* change facet merge test to cover index sorting

* add RawDocument abstraction to access bytes in doc store

* fix deserialization, update changelog

fix deserialization
update changelog
forward error on merge failed

* cache store readers to utilize lru cache (4x performance)

cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)

* add include_temp_doc_store flag in InnerSegmentMeta

unset flag on deserialization and after finalize of a segment
set flag when creating new instances
2021-05-17 22:20:57 +09:00
Abderrahmen Hanafi
6ca27b6dd4 link collector header in introduction section (#1036) 2021-05-17 22:15:48 +09:00
Evance Soumaoro
8d51e9cc91 Capping IndexWriter Num thread (#1033)
* capping num threads of index writter to MAX_NUM_THREAD = 8

* fixed formating

* run ci

* fix bug from max to min
2021-05-06 20:44:39 +09:00
Paul Masurel
2aced2d958 Merge pull request #1028 from tantivy-search/issue-more-like-this-query
Support MoreLikeThisQuery
2021-05-04 22:15:43 +09:00
Paul Masurel
3fcba00a1f Merge pull request #1029 from tantivy-search/dependabot/add-v2-config-file
Upgrade to GitHub-native Dependabot
2021-05-03 21:11:06 +09:00
Evance Souamoro
372d12766a fix cargo fmt 2021-05-03 10:26:56 +00:00
Evance Soumaoro
dfed8896b9 Merge branch 'main' into issue-more-like-this-query 2021-05-03 10:08:38 +00:00
Evance Souamoro
d71aa57077 reusing idf from bm25 module as it was the same logic 2021-05-03 10:05:40 +00:00
Paul Masurel
3e85fe57ac Merge pull request #1031 from PSeitz/bitpack_writer
upate CHANGELOG
2021-05-03 16:29:19 +09:00
Pascal Seitz
537021e12d upate CHANGELOG 2021-05-03 09:09:42 +02:00
Paul Masurel
ec4834cd73 Merge pull request #1030 from PSeitz/bitpack_writer
add BlockedBitpacker
2021-05-03 14:19:17 +09:00
Evance Souamoro
712c01aa93 fixed term sorting & moved it to a better place 2021-05-01 05:40:59 +00:00
Evance Souamoro
cde324d4b4 fixed issues based on comment, still need to check BM25 suggestion 2021-04-30 21:14:19 +00:00
Pascal Seitz
478571ebb4 move minmax to bitpacker
move minmax to bitpacker
use minmax in blocked bitpacker
2021-04-30 17:07:30 +02:00
Pascal Seitz
fde9d27482 refactor 2021-04-30 16:29:02 +02:00
Pascal Seitz
f38daab7f7 add base value to blocked bitpacker 2021-04-30 14:47:58 +02:00
Pascal Seitz
25b9429929 calc mem_usage of more structs
calc mem_usage of more structs in index creation
add some comments
2021-04-30 14:16:39 +02:00
Pascal Seitz
83cf638a2e use 64bit encoded metadata
fix memory_usage calculation
2021-04-30 07:23:44 +02:00
Pascal Seitz
a04e0bdaf1 use flushfree blocked bitpacker (10% slower) 2021-04-29 19:57:17 +02:00
Pascal Seitz
c200d59d1e add blocked bitpacker, add benches 2021-04-29 19:53:54 +02:00
dependabot-preview[bot]
bbeac5888c Upgrade to GitHub-native Dependabot 2021-04-29 15:02:36 +00:00