Commit Graph

600 Commits

Author SHA1 Message Date
PSeitz
8d32c3ba3a Change Footer version handling, Make compression dynamic (#1060)
Change Footer version handling, Make compression dynamic

Change Footer version handling
Simplify version handling by switching to JSON instead of binary serialization.
fixes #1058

Make compression dynamic
Instead of choosing the compression during compile time via a feature flag, you can now have multiple compression algorithms enabled and decide during runtime which one to choose via IndexSettings. Changing the compression algorithm on an index is also supported. The information which algorithm was used in the doc store is stored in the DocStoreFooter. The default is the lz4 block format.
fixes #904

Handle merging of different compressors
Fix feature flag names
Add doc store test for all compressors
2021-05-28 14:57:20 +09:00
PSeitz
4f8481a1e4 Detect if segments are stackackable with sorting, fixes #1038 (#1054)
* Detect if segments are stackackable with sorting, fixes #1038

Detect if segments are stackable when their data ranges on the sort property are disjunct.
Presort segments by thei min value on merge, to enable easier stacking.

* move code to function
2021-05-21 15:23:17 +09:00
PSeitz
d523543dc7 Sort Index/Docids By Field (#1026)
* sort index by field

add sort info to IndexSettings
generate docid mapping for sorted field (only fastfield)
remap singlevalue fastfield

* support docid mapping in multivalue fastfield

move docid mapping to serialization step (less intermediate data for mapping)
add support for docid mapping in multivalue fastfield

* handle docid map in bytes fastfield

* forward docid mapping, remap postings

* fix merge conflicts

* move test to index_sorter

* add docid index mapping old->new

add docid mapping for both directions old->new (used in postings) and new->old (used in fast field)
handle mapping in postings recorder
warn instead of info for MAX_TOKEN_LEN

* remap docid in fielnorm

* resort docids in recorder, more extensive tests

* handle index sorting in docstore

handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore.

add docstore sort tests
refactor tests

* refactor

rename docid doc_id
rename docid_map doc_id_map
rename DocidMapping DocIdMapping
fix  typo

* u32 to DocId

* better doc_id_map creation

remove unstable sort

* add non mut method to FastFieldWriters

add _mut prefix to &mut methods

* remove sort_index

* fix clippy issues

* fix SegmentComponent iterator

use std::mem::replace

* fix test

* fmt

* handle indexsettings deserialize

* add reading, writing bytes to doc store

get bytes of document in doc store
add store_bytes method doc writer to accept serialized document
add serialization index settings test

* rename index_sorter to doc_id_mapping

use bufferlender in recorder

* fix compile issue, make sort_by_field optional

* fix test compile

* validate index settings on merge

validate index settings on merge
forward merge info to SegmentSerializer (for TempStore)

* fix doctest

* add itertools, use kmerge

add itertools, use kmerge
push because rustfmt fails

* implement/test merge for fastfield

implement/test merge for fastfield
rename len to num_deleted in DeleteBitSet

* Use precalculated docid mapping in merger

Use precalculated docid mapping in merger for sorted indices instead of on the fly calculation 
Add index creation macro benchmark, but commented out for now, since it is not really usable due to long runtimes, and extreme fluctuations. May be better suited in criterion or an external bench bin

* fix fast field reader docs

fix fast field reader docs, Error instead of None returned
add u64s_lenient to fastreader
add create docid mapping benchmark

* add test for multifast field merge

refactor test 
add test for multifast field merge

* add num_bytes to BytesFastFieldReader

equivalent to num_vals in MultiValuedFastFieldReader

* add MultiValueLength trait

add MultiValueLength trait in order to unify index creation for BytesFastFieldReader and MultiValuedFastFieldReader in merger

* Add ReaderWithOrdinal, fix 

Add ReaderWithOrdinal to associate data to a reader in merger
Fix bytes offset index creation in merger

* add test for merging bytes with sorted docids

* Merge fieldnorm for sorted index

* handle posting list in merge in sorted index

handle posting list in merge in sorted index by using doc id mapping for sorting
reuse SegmentOrdinal type

* handle doc store order in merge in sorted index

* fix typo, cleanup

* make IndexSetting non-optional

* fix type, rename test file

fix type
rename test file
add  type

* remove SegmentReaderWithOrdinal accessors

* cargo fmt

* add index sort & merge test to include deletes

* Fix posting list merge issue

Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids
handle sorting and merging for facets field

* performance: cache field readers, use bytes for doc store merge

* change facet merge test to cover index sorting

* add RawDocument abstraction to access bytes in doc store

* fix deserialization, update changelog

fix deserialization
update changelog
forward error on merge failed

* cache store readers to utilize lru cache (4x performance)

cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)

* add include_temp_doc_store flag in InnerSegmentMeta

unset flag on deserialization and after finalize of a segment
set flag when creating new instances
2021-05-17 22:20:57 +09:00
Evance Soumaoro
8d51e9cc91 Capping IndexWriter Num thread (#1033)
* capping num threads of index writter to MAX_NUM_THREAD = 8

* fixed formating

* run ci

* fix bug from max to min
2021-05-06 20:44:39 +09:00
Pascal Seitz
cbf805c3e6 fix build, skip serialize None 2021-04-26 13:30:34 +02:00
Pascal Seitz
46beb2a989 index_settings should be optional 2021-04-26 11:34:19 +02:00
Pascal Seitz
c01c175744 rename fix 2021-04-26 09:45:12 +02:00
Paul Masurel
eca496ee24 Merge branch 'main' into indexmeta 2021-04-26 14:34:58 +09:00
Paul Masurel
2dc5403e7b Closes #1022 2021-04-26 14:01:14 +09:00
Paul Masurel
aead5d4068 First stab 2021-04-26 12:46:06 +09:00
Paul Masurel
39dd8cfe24 Cargo clippy. Acronym should not be full uppercase apparently. 2021-04-26 11:49:18 +09:00
Pascal Seitz
b9b9e9e518 move Index::create to IndexBuilder 2021-04-23 15:14:15 +02:00
Pascal Seitz
e2c91aff33 add open/create methods to index builder
add indexbuilder error
rename create_from_metas to open_from_metas
remove from_directory
2021-04-23 14:02:21 +02:00
Pascal Seitz
8dc3e7704c add IndexSettings to Index, use Indexbuilder in Index 2021-04-22 21:07:39 +02:00
Pascal Seitz
4243780e0a add Index::builder, add index_settings to IndexMeta 2021-04-21 19:32:19 +02:00
Paul Masurel
be1d9e0db7 Marks list_all_segment_metas() as crate private
Closes #1004
2021-04-07 23:39:28 +09:00
Stéphane Campinas
a0ec6e1e9d Expand the DocAddress struct with named fields 2021-03-28 19:00:23 +02:00
Laurent Pouget
4b34231f28 Make facet indexation and storage optional
Added a FacetOptions for HierarchicalFacet which add indexed and stored flags to it.
Propagate change and update tests accordingly
Added a test to ensure that a not indexed flag was taken care of.
Added on Value implem the `path()` function to return the stored facet.
2021-03-24 14:56:27 +01:00
Paul Masurel
52b1eb2c37 Clippy fix 2021-03-10 14:35:51 +09:00
Paul Masurel
31137beea6 Replacing (start, end) by Range 2021-03-10 14:06:21 +09:00
Paul Masurel
94d3d7a89a Rename FastFieldReaders::load_all 2021-01-21 18:38:48 +09:00
Paul Masurel
aa9e79f957 Clippy warnings. 2021-01-21 18:23:20 +09:00
Paul Masurel
1b4be24dca Fast field are not loaded on the opening of a segment.
They are instead loaded lazily when they are request.
2021-01-21 18:13:08 +09:00
Paul Masurel
43c7b3bfec Bugfix in the RAMDirectory.
There was a state where the meta.json was empty.
2021-01-11 14:11:42 +09:00
Paul Masurel
af6dfa1856 Small refactoring 2020-12-03 14:27:05 +09:00
Paul Masurel
80a99539ce Several TermDict operation now returns an io::Result 2020-12-03 13:13:11 +09:00
Paul Masurel
8d0e049261 Revert "Move SegmentUpdater::list_files() to Index" 2020-11-20 13:53:50 +09:00
Adrien Guillo
267e920a80 Move SegmentUpdater::list_files() to Index
... and make the method public
2020-11-17 17:54:18 -08:00
Paul Masurel
40d41c7dcb Merge pull request #929 from tantivy-search/api-public-term-merger
Make field TermMerger API public
2020-11-12 14:11:53 +09:00
Paul Masurel
eef348004e Closes #930 Minor bug.
Watch callback could be callback if the last watch handle was dropped
shortly before meta.json is called.
2020-11-11 15:51:23 +09:00
Paul Masurel
e784bbc40f Update src/core/searcher.rs
Co-authored-by: Adrien Guillo <adrien.guillo@gmail.com>
2020-11-11 12:37:52 +09:00
Paul Masurel
b8118d439f Make field TermMerger API public 2020-11-11 11:59:09 +09:00
Paul Masurel
41bb2bd58b Merge pull request #926 from tantivy-search/guilload--directory-exists
Modified `Directory::exists` API to return `Result<bool, OpenReadError>`
2020-11-10 17:59:45 +09:00
Adrien Guillo
7fd6054145 Modified Directory::exists API to return Result<bool, OpenReadError> 2020-11-09 18:00:14 -08:00
Paul Masurel
d23aee76c9 Avoid loading fieldnorms when not necessary 2020-11-09 15:50:16 +09:00
Paul Masurel
b5f3dcdc8b TermInfo contain the end_offset of the postings.
We slice the ReadOnlySource tightly.
2020-11-06 15:18:51 +09:00
Paul Masurel
01b4aa9adc Refactoring dir (#905) 2020-10-11 22:22:56 +09:00
Pasha Podolsky
80cbe889ba [tantivy] Add brotli codec for row storage (#885)
* [tantivy] Add brotli codec for row storage

* [tantivy] Fix not actual comments for code

* [CR] Fixes for comment and cursor
2020-10-09 14:51:42 +09:00
Paul Masurel
c23a03ad81 Large API Change in the Directory API. (#901)
Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a `FileSlice` that can be reduced and eventually read into an `OwnedBytes` object. Long and blocking io operation are still required by they do not span over the entire file.
2020-10-08 16:36:51 +09:00
Paul Masurel
ad82b455a3 Minor change 2020-10-01 20:45:07 +09:00
Paul Masurel
848afa43ee Merge branch 'issue/896' into main 2020-10-01 20:43:42 +09:00
Paul Masurel
7720d21265 Closes #896 - Facet reader related
Bugfix. Acquiring a facet reader on a segment that does not contain any
doc with this facet returns `None`.
2020-10-01 20:25:28 +09:00
Paul Masurel
96f946d4c3 Raultang master (#879)
* add support for indexed bytes fast field

* remove backup code file

* refine test cases

* Simplified unit test. Renamed it as it is testing the storable part. Not the indexed part.

* Small refactoring and added unit test. If multivalued we only retain the first FAST value.

Co-authored-by: Raul <raul.tang.lc@gmail.com>
2020-10-01 18:03:18 +09:00
Paul Masurel
838c476733 Hirevo move to thiserror (#889)
* Migrated from `failure` to `thiserror`

* Refactoring

Co-authored-by: Nicolas Polomack <nicolas@polomack.eu>
2020-09-30 16:34:10 +09:00
Paul Masurel
439d6956a9 Returning Result in some of the API (#880)
* Returning Result in some of the API

* Introducing `.writer_for_test(..)`
2020-09-07 15:52:34 +09:00
Paul Masurel
2737822620 Fixing unit tests. (#868)
There was a unit test failing when notify was sending more
than one event on atomicwrites.

It was observed on MacOS CI.
2020-08-27 16:43:39 +09:00
Paul Masurel
2481c87be8 Block wand (#856) 2020-08-19 22:36:36 +09:00
Paul Masurel
8e74bb98b5 Added field norm readers (#854) 2020-07-20 13:05:05 +09:00
aptend
00a239a712 fix typo in index_meta.rs (#851) 2020-07-16 12:32:45 +09:00
lyj
1ab7f660a4 Update index.rs (#846) 2020-07-02 15:11:38 +09:00