* store DateTime as nanoseconds in doc store
The doc store DateTime was truncated to microseconds previously. This
removes this truncation, while still keeping backwards compatibility.
This is done by adding the trait `ConfigurableBinarySerializable`, which
works like `BinarySerializable`, but with a config that allows de/serialize
as different date time precision currently.
bump version format to 7.
add compat test to check the date time truncation.
* remove configurable binary serialize, add enum for doc store version
* test doc store version ord
* compact doc
* add any value type
* pass references when building CompactDoc
* remove OwnedValue from API
* clippy
* clippy
* fail on large documents
* fmt
* cleanup
* cleanup
* implement Value for different types
fix serde_json date Value implementation
* fmt
* cleanup
* fmt
* cleanup
* store positions instead of pos+len
* remove nodes array
* remove mediumvec
* cleanup
* infallible serialize into vec
* remove positions indirection
* remove 24MB limitation in document
use u32 for Addr
Remove the 3 byte addressing limitation and use VInt instead
* cleanup
* extend test
* cleanup, add comments
* rename, remove pub
* Fix trait bound of StoreReader::iter
Similar to `StoreReader::get`, `StoreReader::iter` should only require
`DocumentDeserialize` and not `Document`.
* Mark the iterator returned by SegmentReader::doc_ids_alive as Send so it can be used in impls of Stream/AsyncIterator.
* remove Document: DocumentDeserialize dependency
The dependency requires users to implement an API they may not use.
* remove unnecessary Document bounds
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
* add memory limit for aggregations
introduce AggregationLimits to set memory consumption limit and bucket limits
memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request.
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add ByteCount with human readable format
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
In addition, it isolates the doc compressor logic,
better reports io::Result.
In the case of the same-thread doc compressor,
the blocks are also not copied.
* Added sstable and enabling it by default, and parallel boolean query.
* Added async API for FileSlice.
* Added async get_doc
* Reduce blocksize to 32_000
* Added debug logs
Quickwit specific feature a hidden behind the quickwit feature flag.
Change Footer version handling, Make compression dynamic
Change Footer version handling
Simplify version handling by switching to JSON instead of binary serialization.
fixes#1058
Make compression dynamic
Instead of choosing the compression during compile time via a feature flag, you can now have multiple compression algorithms enabled and decide during runtime which one to choose via IndexSettings. Changing the compression algorithm on an index is also supported. The information which algorithm was used in the doc store is stored in the DocStoreFooter. The default is the lz4 block format.
fixes#904
Handle merging of different compressors
Fix feature flag names
Add doc store test for all compressors
* add iterator over documents in docstore
When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.
Merge Time on Sorted Index Before/After:
24s / 19s
Merge Time on Unsorted Index Before/After:
15s / 13,5s
So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.
* Update reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* sort index by field
add sort info to IndexSettings
generate docid mapping for sorted field (only fastfield)
remap singlevalue fastfield
* support docid mapping in multivalue fastfield
move docid mapping to serialization step (less intermediate data for mapping)
add support for docid mapping in multivalue fastfield
* handle docid map in bytes fastfield
* forward docid mapping, remap postings
* fix merge conflicts
* move test to index_sorter
* add docid index mapping old->new
add docid mapping for both directions old->new (used in postings) and new->old (used in fast field)
handle mapping in postings recorder
warn instead of info for MAX_TOKEN_LEN
* remap docid in fielnorm
* resort docids in recorder, more extensive tests
* handle index sorting in docstore
handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore.
add docstore sort tests
refactor tests
* refactor
rename docid doc_id
rename docid_map doc_id_map
rename DocidMapping DocIdMapping
fix typo
* u32 to DocId
* better doc_id_map creation
remove unstable sort
* add non mut method to FastFieldWriters
add _mut prefix to &mut methods
* remove sort_index
* fix clippy issues
* fix SegmentComponent iterator
use std::mem::replace
* fix test
* fmt
* handle indexsettings deserialize
* add reading, writing bytes to doc store
get bytes of document in doc store
add store_bytes method doc writer to accept serialized document
add serialization index settings test
* rename index_sorter to doc_id_mapping
use bufferlender in recorder
* fix compile issue, make sort_by_field optional
* fix test compile
* validate index settings on merge
validate index settings on merge
forward merge info to SegmentSerializer (for TempStore)
* fix doctest
* add itertools, use kmerge
add itertools, use kmerge
push because rustfmt fails
* implement/test merge for fastfield
implement/test merge for fastfield
rename len to num_deleted in DeleteBitSet
* Use precalculated docid mapping in merger
Use precalculated docid mapping in merger for sorted indices instead of on the fly calculation
Add index creation macro benchmark, but commented out for now, since it is not really usable due to long runtimes, and extreme fluctuations. May be better suited in criterion or an external bench bin
* fix fast field reader docs
fix fast field reader docs, Error instead of None returned
add u64s_lenient to fastreader
add create docid mapping benchmark
* add test for multifast field merge
refactor test
add test for multifast field merge
* add num_bytes to BytesFastFieldReader
equivalent to num_vals in MultiValuedFastFieldReader
* add MultiValueLength trait
add MultiValueLength trait in order to unify index creation for BytesFastFieldReader and MultiValuedFastFieldReader in merger
* Add ReaderWithOrdinal, fix
Add ReaderWithOrdinal to associate data to a reader in merger
Fix bytes offset index creation in merger
* add test for merging bytes with sorted docids
* Merge fieldnorm for sorted index
* handle posting list in merge in sorted index
handle posting list in merge in sorted index by using doc id mapping for sorting
reuse SegmentOrdinal type
* handle doc store order in merge in sorted index
* fix typo, cleanup
* make IndexSetting non-optional
* fix type, rename test file
fix type
rename test file
add type
* remove SegmentReaderWithOrdinal accessors
* cargo fmt
* add index sort & merge test to include deletes
* Fix posting list merge issue
Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids
handle sorting and merging for facets field
* performance: cache field readers, use bytes for doc store merge
* change facet merge test to cover index sorting
* add RawDocument abstraction to access bytes in doc store
* fix deserialization, update changelog
fix deserialization
update changelog
forward error on merge failed
* cache store readers to utilize lru cache (4x performance)
cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)
* add include_temp_doc_store flag in InnerSegmentMeta
unset flag on deserialization and after finalize of a segment
set flag when creating new instances