Commit Graph

128 Commits

Author SHA1 Message Date
Paul Masurel
46d5de920d Removes all usage of block_on, and use a oneshot channel instead. (#1315)
* Removes all usage of block_on, and use a oneshot channel instead.

Calling `block_on` panics in certain context.
For instance, it panics when it is called in a the context of another
call to block.

Using it in tantivy is unnecessary. We replace it by a thin wrapper
around a oneshot channel that supports both async/sync.

* Removing needless uses of async in the API.

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2022-03-18 16:54:58 +09:00
Paul Masurel
2ead010c83 Tantivy quickwit (#1293)
* Added sstable and enabling it by default, and parallel boolean query.
* Added async API for FileSlice.
* Added async get_doc
* Reduce blocksize to 32_000
* Added debug logs

Quickwit specific feature a hidden behind the quickwit feature flag.
2022-02-25 17:32:49 +09:00
Paul Masurel
bdedefe07d Adding an IndexingContext object (#1268) 2022-02-04 15:08:01 +09:00
Paul Masurel
65b365b81c Fixing all-features build. 2022-01-31 14:41:14 +09:00
Paul Masurel
eca6628b3c Minor refactoring (#1266) 2022-01-28 15:55:55 +09:00
Paul Masurel
c81b3030fa Issue/922b (#1233)
* Add a NORMED options on field

Make fieldnorm indexation optional:

* for all types except text => added a NORMED options
* for text field
** if STRING, field has not fieldnorm retained
** if TEXT, field has fieldnorm computed

* Finalize making fieldnorm optional for all field types.

- Using Option for fieldnorm readers.
2021-12-10 21:12:29 +09:00
Paul Masurel
ebdbb6bd2e Fixing compilation warnings & clippy comments. 2021-12-10 16:47:59 +09:00
Paul Masurel
466dc8233c Cargo fmt 2021-12-02 18:46:28 +09:00
Paul Masurel
03c2f6ece2 We are missing 4 bytes in the LZ4 compression buffer. (#1226)
Closes #831
2021-12-02 16:00:29 +09:00
Paul Masurel
f378d9a57b Pleasing clippy 2021-12-02 14:48:33 +09:00
Shikhar Bhushan
72cef12db1 Add none compression (#1208) 2021-11-16 10:50:42 +09:00
Paul Masurel
7234bef0eb Issue/1198 (#1201)
* Unit test reproducing #1198
* Fixing unit test to handle the error from add_document.
* Bump project version
2021-11-11 16:42:19 +09:00
PSeitz
352e0cc58d Adde demux operation (#1150)
* add merge for DeleteBitSet, allow custom DeleteBitSet on merge
* forward delete bitsets on merge, add tests
* add demux operation and tests
2021-10-06 16:05:16 +09:00
Paul Masurel
0855649986 Leaning more on the alive (vs delete) semantics. (#1164) 2021-10-05 18:53:29 +09:00
PSeitz
0ce49c9dd4 use lz4_flex 0.9.0 (#1160) 2021-09-27 10:12:20 +09:00
Pascal Seitz
d7a6a409a1 renames 2021-09-23 20:33:11 +08:00
Pascal Seitz
a1f5cead96 AliveBitSet instead of DeleteBitSet 2021-09-23 20:03:57 +08:00
Pascal Seitz
3265f7bec3 dissolve common module 2021-08-19 23:26:34 +01:00
Pascal Seitz
0062fe705d cargo fmt 2021-07-01 18:17:08 +02:00
Pascal Seitz
e496ae0470 clippy fixes 2021-07-01 17:43:50 +02:00
Pascal Seitz
1e4df54ab3 fix clippy 2021-07-01 17:41:53 +02:00
Pascal Seitz
10f056fbb4 apply clippy fixes 2021-07-01 17:08:44 +02:00
Andre-Philippe Paquet
57ae5b27dc fix store reader iterator, take 2 2021-06-16 07:51:39 -04:00
Pascal Seitz
ebe55a7ae1 refactor test, fixes #1077
replace test with smaller test in doc_store
2021-06-14 10:10:05 +02:00
Andre-Philippe Paquet
473a346814 remove debugging 2021-06-13 16:49:44 -04:00
Andre-Philippe Paquet
511dc8f87f fix store reader iterator 2021-06-13 16:00:13 -04:00
PSeitz
8d32c3ba3a Change Footer version handling, Make compression dynamic (#1060)
Change Footer version handling, Make compression dynamic

Change Footer version handling
Simplify version handling by switching to JSON instead of binary serialization.
fixes #1058

Make compression dynamic
Instead of choosing the compression during compile time via a feature flag, you can now have multiple compression algorithms enabled and decide during runtime which one to choose via IndexSettings. Changing the compression algorithm on an index is also supported. The information which algorithm was used in the doc store is stored in the DocStoreFooter. The default is the lz4 block format.
fixes #904

Handle merging of different compressors
Fix feature flag names
Add doc store test for all compressors
2021-05-28 14:57:20 +09:00
PSeitz
5ef2d56ec2 Avoid docstore stacking for small segments, fixes #1053 (#1055) 2021-05-24 15:38:49 +09:00
PSeitz
249bc6cf72 upgrade lz4_flex to 0.8 (#1049)
* upgrade lz4_flex to 0.8

* fix set_len
2021-05-19 10:46:01 +09:00
PSeitz
1c0af5765d fix doc store iter error handling, fixes #1047 (#1051) 2021-05-18 21:43:57 +09:00
Paul Masurel
7ba771ed1b Replaced RawDocument by OwnedBytes (#1046) 2021-05-18 14:33:36 +09:00
PSeitz
a4002622f8 add iterator over documents in docstore (#1044)
* add iterator over documents in docstore

When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.

Merge Time on Sorted Index Before/After:
24s / 19s

Merge Time on Unsorted Index Before/After:
15s / 13,5s

So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.

* Update reader.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>
2021-05-18 10:29:02 +09:00
PSeitz
d523543dc7 Sort Index/Docids By Field (#1026)
* sort index by field

add sort info to IndexSettings
generate docid mapping for sorted field (only fastfield)
remap singlevalue fastfield

* support docid mapping in multivalue fastfield

move docid mapping to serialization step (less intermediate data for mapping)
add support for docid mapping in multivalue fastfield

* handle docid map in bytes fastfield

* forward docid mapping, remap postings

* fix merge conflicts

* move test to index_sorter

* add docid index mapping old->new

add docid mapping for both directions old->new (used in postings) and new->old (used in fast field)
handle mapping in postings recorder
warn instead of info for MAX_TOKEN_LEN

* remap docid in fielnorm

* resort docids in recorder, more extensive tests

* handle index sorting in docstore

handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore.

add docstore sort tests
refactor tests

* refactor

rename docid doc_id
rename docid_map doc_id_map
rename DocidMapping DocIdMapping
fix  typo

* u32 to DocId

* better doc_id_map creation

remove unstable sort

* add non mut method to FastFieldWriters

add _mut prefix to &mut methods

* remove sort_index

* fix clippy issues

* fix SegmentComponent iterator

use std::mem::replace

* fix test

* fmt

* handle indexsettings deserialize

* add reading, writing bytes to doc store

get bytes of document in doc store
add store_bytes method doc writer to accept serialized document
add serialization index settings test

* rename index_sorter to doc_id_mapping

use bufferlender in recorder

* fix compile issue, make sort_by_field optional

* fix test compile

* validate index settings on merge

validate index settings on merge
forward merge info to SegmentSerializer (for TempStore)

* fix doctest

* add itertools, use kmerge

add itertools, use kmerge
push because rustfmt fails

* implement/test merge for fastfield

implement/test merge for fastfield
rename len to num_deleted in DeleteBitSet

* Use precalculated docid mapping in merger

Use precalculated docid mapping in merger for sorted indices instead of on the fly calculation 
Add index creation macro benchmark, but commented out for now, since it is not really usable due to long runtimes, and extreme fluctuations. May be better suited in criterion or an external bench bin

* fix fast field reader docs

fix fast field reader docs, Error instead of None returned
add u64s_lenient to fastreader
add create docid mapping benchmark

* add test for multifast field merge

refactor test 
add test for multifast field merge

* add num_bytes to BytesFastFieldReader

equivalent to num_vals in MultiValuedFastFieldReader

* add MultiValueLength trait

add MultiValueLength trait in order to unify index creation for BytesFastFieldReader and MultiValuedFastFieldReader in merger

* Add ReaderWithOrdinal, fix 

Add ReaderWithOrdinal to associate data to a reader in merger
Fix bytes offset index creation in merger

* add test for merging bytes with sorted docids

* Merge fieldnorm for sorted index

* handle posting list in merge in sorted index

handle posting list in merge in sorted index by using doc id mapping for sorting
reuse SegmentOrdinal type

* handle doc store order in merge in sorted index

* fix typo, cleanup

* make IndexSetting non-optional

* fix type, rename test file

fix type
rename test file
add  type

* remove SegmentReaderWithOrdinal accessors

* cargo fmt

* add index sort & merge test to include deletes

* Fix posting list merge issue

Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids
handle sorting and merging for facets field

* performance: cache field readers, use bytes for doc store merge

* change facet merge test to cover index sorting

* add RawDocument abstraction to access bytes in doc store

* fix deserialization, update changelog

fix deserialization
update changelog
forward error on merge failed

* cache store readers to utilize lru cache (4x performance)

cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)

* add include_temp_doc_store flag in InnerSegmentMeta

unset flag on deserialization and after finalize of a segment
set flag when creating new instances
2021-05-17 22:20:57 +09:00
Pascal Seitz
25b9429929 calc mem_usage of more structs
calc mem_usage of more structs in index creation
add some comments
2021-04-30 14:16:39 +02:00
Paul Masurel
39dd8cfe24 Cargo clippy. Acronym should not be full uppercase apparently. 2021-04-26 11:49:18 +09:00
Pascal Seitz
c2fdc60569 fix snap version, fix naming 2021-04-19 10:55:44 +02:00
Pascal Seitz
b7159dd48e forward original error 2021-04-16 17:20:25 +02:00
Pascal Seitz
a00049b879 add lz4 block format compressor as default docstore compressor
add lz4 block compressor using lz4_flex, add lz4-block-compression feature flag
add snappy-compression feature flag for snap compressor, make snap crate optional
set lz4-block-compression as default feature flag
2021-04-16 15:24:35 +02:00
Stéphane Campinas
a0ec6e1e9d Expand the DocAddress struct with named fields 2021-03-28 19:00:23 +02:00
Paul Masurel
52b1eb2c37 Clippy fix 2021-03-10 14:35:51 +09:00
Paul Masurel
31137beea6 Replacing (start, end) by Range 2021-03-10 14:06:21 +09:00
Paul Masurel
aa9e79f957 Clippy warnings. 2021-01-21 18:23:20 +09:00
Paul Masurel
b17a10546a Minor change in unit test. 2021-01-11 11:33:59 +09:00
Paul Masurel
203b0256a3 Minor renaming 2021-01-07 22:47:57 +09:00
Paul Masurel
caf2a38b7e Closes #969.
The segment stacking optimization is not updating "first_doc_in_block".
2021-01-07 22:43:56 +09:00
Paul Masurel
96f24b078e Added failing unit test. 2021-01-07 22:43:28 +09:00
Paul Masurel
8ca0954b3b Added a functional long running test to test store merging. 2021-01-07 14:07:15 +09:00
Paul Masurel
c3e311e6b8 Removed 'static in compression_lz4. 2020-12-09 15:30:52 +09:00
Paul Masurel
5a25c8dfd3 No filelen problem. 2020-11-25 11:51:58 +09:00
Paul Masurel
7f0e61b173 Refactoring of the skip index.
The skip index now identifies both the start and the end offset
of blocks. Checkpoints are compressed in blocks, reaching better
compression.
2020-11-17 16:05:11 +09:00