tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-31 07:30:39 +00:00

Author	SHA1	Message	Date
PSeitz	945af922d1	clippy (#2661 ) * clippy * use readable version --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-07-02 11:25:03 +02:00
PSeitz	2f2db16ec1	store DateTime as nanoseconds in doc store (#2486 ) * store DateTime as nanoseconds in doc store The doc store DateTime was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. This is done by adding the trait `ConfigurableBinarySerializable`, which works like `BinarySerializable`, but with a config that allows de/serialize as different date time precision currently. bump version format to 7. add compat test to check the date time truncation. * remove configurable binary serialize, add enum for doc store version * test doc store version ord	2024-10-18 10:50:20 +08:00
PSeitz	2e3641c2ae	return CompactDocValue instead of trait (#2410 ) The CompactDocValue is easier to handle than the trait in some cases like comparison and conversion	2024-05-27 07:33:50 +02:00
PSeitz	e1679f3fb9	compact doc (#2402 ) * compact doc * add any value type * pass references when building CompactDoc * remove OwnedValue from API * clippy * clippy * fail on large documents * fmt * cleanup * cleanup * implement Value for different types fix serde_json date Value implementation * fmt * cleanup * fmt * cleanup * store positions instead of pos+len * remove nodes array * remove mediumvec * cleanup * infallible serialize into vec * remove positions indirection * remove 24MB limitation in document use u32 for Addr Remove the 3 byte addressing limitation and use VInt instead * cleanup * extend test * cleanup, add comments * rename, remove pub	2024-05-21 10:16:08 +02:00
trinity-1686a	8cd7ddc535	run block decompression from executor (#2386 ) * run block decompression from executor * add a wrapper with is_closed to oneshot channel * add cancelation test to Executor::spawn_blocking	2024-05-08 12:22:44 +02:00
Adam Reichold	b493743f8d	Fix trait bound of StoreReader::iter (#2360 ) * Fix trait bound of StoreReader::iter Similar to `StoreReader::get`, `StoreReader::iter` should only require `DocumentDeserialize` and not `Document`. * Mark the iterator returned by SegmentReader::doc_ids_alive as Send so it can be used in impls of Stream/AsyncIterator.	2024-04-15 15:50:02 +02:00
PSeitz	182f58cea6	remove Document: DocumentDeserialize dependency (#2211 ) * remove Document: DocumentDeserialize dependency The dependency requires users to implement an API they may not use. * remove unnecessary Document bounds	2023-10-13 07:59:54 +02:00
PSeitz	03a1f40767	rename DocValue to Value (#2197 ) rename DocValue to Value to avoid confusion with lucene DocValues rename Value to OwnedValue	2023-10-02 17:03:00 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	040554f2f9	Update to lz4_flex 0.11 (#2106 )	2023-06-29 14:16:00 +08:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
PSeitz	9e2faecf5b	add memory limit for aggregations (#1942 ) * add memory limit for aggregations introduce AggregationLimits to set memory consumption limit and bucket limits memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request. * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add ByteCount with human readable format --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-03-16 06:21:07 +01:00
PSeitz	0f20787917	fix doc store cache docs (#1821 ) * fix doc store cache docs addresses an issue reported in #1820 * rename doc_store_cache_size	2023-01-23 07:06:49 +01:00
Adam Reichold	82a183bc2d	Bump dependency on lru to from version 0.7.5 to version 0.9.0. (#1755 )	2023-01-10 13:35:37 +09:00
Paul Masurel	f39165e1e7	Moving FileSlice to tantivy-common (#1729 )	2022-12-21 16:35:11 +09:00
Paul Masurel	32cb1d22da	Removed AsyncIoResult. (#1728 )	2022-12-21 16:01:17 +09:00
PSeitz	f9171a3981	fix clippy (#1725 ) * fix clippy * fix clippy fastfield codecs * fix clippy bitpacker * fix clippy common * fix clippy stacker * fix clippy sstable * fmt	2022-12-20 07:30:06 +01:00
Bruce Mitchener	b3bf9a5716	Documentation improvements.	2022-10-05 14:18:10 +07:00
trinity-1686a	5945dbf0bd	change format for store to make it faster with small documents (#1569 ) * use new format for docstore blocks * move index to end of block it makes writing the block faster due to one less memcopy	2022-10-04 09:58:55 +02:00
Paul Masurel	817225edfb	Allow for a same-thread doc compressor. (#1510 ) In addition, it isolates the doc compressor logic, better reports io::Result. In the case of the same-thread doc compressor, the blocks are also not copied.	2022-09-13 15:32:48 +09:00
Paul Masurel	8e775b6c3d	Refactoring dyn Column (#1502 )	2022-09-02 17:26:30 +09:00
Pascal Seitz	5750224d4c	set docstore cache size at construction	2022-07-04 14:27:55 +08:00
Pascal Seitz	9db2f0e82b	expose doc store cache size expose lru doc store cache size optimize doc store cache size	2022-07-04 13:54:41 +08:00
Antoine G	11e4225f23	doc fix (#1391 ) Documentation fix.	2022-06-21 15:53:33 +09:00
Pascal Seitz	4d9d2b6db0	split into compressor/decompressor use custom de/serializer for compressor accept parameters like zstd(compression_level=5) as compressor	2022-06-02 23:29:24 +08:00
Pascal Seitz	314ae43a45	fix fmt	2022-06-02 14:54:23 +08:00
Pascal Seitz	9bcd2b8104	fix read_block_async	2022-06-02 13:37:52 +08:00
Pascal Seitz	0c9c257150	move cache handling into single function	2022-06-02 13:25:29 +08:00
Pascal Seitz	1af85a2956	accept usize instead &usize	2022-06-02 11:23:36 +08:00
Pascal Seitz	bc4c3d0c6b	add peek_lru test	2022-06-02 11:13:17 +08:00
Pascal Seitz	6937c75f05	hide advanced doc store api	2022-06-02 11:13:17 +08:00
Pascal Seitz	e54429e827	expose doc store functions expose doc store functions for advanced usage refactor cache expose cache statistics remove unnecessary arc unduplicate code	2022-06-02 11:13:17 +08:00
Kryesh	aaa22ad225	Make block size configurable to allow for better compression ratios on large documents	2022-05-18 11:13:15 +10:00
Paul Masurel	2ead010c83	Tantivy quickwit (#1293 ) * Added sstable and enabling it by default, and parallel boolean query. * Added async API for FileSlice. * Added async get_doc * Reduce blocksize to 32_000 * Added debug logs Quickwit specific feature a hidden behind the quickwit feature flag.	2022-02-25 17:32:49 +09:00
Paul Masurel	eca6628b3c	Minor refactoring (#1266 )	2022-01-28 15:55:55 +09:00
PSeitz	352e0cc58d	Adde demux operation (#1150 ) * add merge for DeleteBitSet, allow custom DeleteBitSet on merge * forward delete bitsets on merge, add tests * add demux operation and tests	2021-10-06 16:05:16 +09:00
Pascal Seitz	d7a6a409a1	renames	2021-09-23 20:33:11 +08:00
Pascal Seitz	a1f5cead96	AliveBitSet instead of DeleteBitSet	2021-09-23 20:03:57 +08:00
Pascal Seitz	3265f7bec3	dissolve common module	2021-08-19 23:26:34 +01:00
Pascal Seitz	10f056fbb4	apply clippy fixes	2021-07-01 17:08:44 +02:00
Andre-Philippe Paquet	57ae5b27dc	fix store reader iterator, take 2	2021-06-16 07:51:39 -04:00
Andre-Philippe Paquet	473a346814	remove debugging	2021-06-13 16:49:44 -04:00
Andre-Philippe Paquet	511dc8f87f	fix store reader iterator	2021-06-13 16:00:13 -04:00
PSeitz	8d32c3ba3a	Change Footer version handling, Make compression dynamic (#1060 ) Change Footer version handling, Make compression dynamic Change Footer version handling Simplify version handling by switching to JSON instead of binary serialization. fixes #1058 Make compression dynamic Instead of choosing the compression during compile time via a feature flag, you can now have multiple compression algorithms enabled and decide during runtime which one to choose via IndexSettings. Changing the compression algorithm on an index is also supported. The information which algorithm was used in the doc store is stored in the DocStoreFooter. The default is the lz4 block format. fixes #904 Handle merging of different compressors Fix feature flag names Add doc store test for all compressors	2021-05-28 14:57:20 +09:00
PSeitz	249bc6cf72	upgrade lz4_flex to 0.8 (#1049 ) * upgrade lz4_flex to 0.8 * fix set_len	2021-05-19 10:46:01 +09:00
PSeitz	1c0af5765d	fix doc store iter error handling, fixes #1047 (#1051 )	2021-05-18 21:43:57 +09:00
Paul Masurel	7ba771ed1b	Replaced RawDocument by OwnedBytes (#1046 )	2021-05-18 14:33:36 +09:00
PSeitz	a4002622f8	add iterator over documents in docstore (#1044 ) * add iterator over documents in docstore When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents. Merge Time on Sorted Index Before/After: 24s / 19s Merge Time on Unsorted Index Before/After: 15s / 13,5s So we can expect 10-20% faster merges. This iterator is also important if we add sorting based on a field in the documents. * Update reader.rs Co-authored-by: Paul Masurel <paul@quickwit.io>	2021-05-18 10:29:02 +09:00
PSeitz	d523543dc7	Sort Index/Docids By Field (#1026 ) * sort index by field add sort info to IndexSettings generate docid mapping for sorted field (only fastfield) remap singlevalue fastfield * support docid mapping in multivalue fastfield move docid mapping to serialization step (less intermediate data for mapping) add support for docid mapping in multivalue fastfield * handle docid map in bytes fastfield * forward docid mapping, remap postings * fix merge conflicts * move test to index_sorter * add docid index mapping old->new add docid mapping for both directions old->new (used in postings) and new->old (used in fast field) handle mapping in postings recorder warn instead of info for MAX_TOKEN_LEN * remap docid in fielnorm * resort docids in recorder, more extensive tests * handle index sorting in docstore handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore. add docstore sort tests refactor tests * refactor rename docid doc_id rename docid_map doc_id_map rename DocidMapping DocIdMapping fix typo * u32 to DocId * better doc_id_map creation remove unstable sort * add non mut method to FastFieldWriters add _mut prefix to &mut methods * remove sort_index * fix clippy issues * fix SegmentComponent iterator use std::mem::replace * fix test * fmt * handle indexsettings deserialize * add reading, writing bytes to doc store get bytes of document in doc store add store_bytes method doc writer to accept serialized document add serialization index settings test * rename index_sorter to doc_id_mapping use bufferlender in recorder * fix compile issue, make sort_by_field optional * fix test compile * validate index settings on merge validate index settings on merge forward merge info to SegmentSerializer (for TempStore) * fix doctest * add itertools, use kmerge add itertools, use kmerge push because rustfmt fails * implement/test merge for fastfield implement/test merge for fastfield rename len to num_deleted in DeleteBitSet * Use precalculated docid mapping in merger Use precalculated docid mapping in merger for sorted indices instead of on the fly calculation Add index creation macro benchmark, but commented out for now, since it is not really usable due to long runtimes, and extreme fluctuations. May be better suited in criterion or an external bench bin * fix fast field reader docs fix fast field reader docs, Error instead of None returned add u64s_lenient to fastreader add create docid mapping benchmark * add test for multifast field merge refactor test add test for multifast field merge * add num_bytes to BytesFastFieldReader equivalent to num_vals in MultiValuedFastFieldReader * add MultiValueLength trait add MultiValueLength trait in order to unify index creation for BytesFastFieldReader and MultiValuedFastFieldReader in merger * Add ReaderWithOrdinal, fix Add ReaderWithOrdinal to associate data to a reader in merger Fix bytes offset index creation in merger * add test for merging bytes with sorted docids * Merge fieldnorm for sorted index * handle posting list in merge in sorted index handle posting list in merge in sorted index by using doc id mapping for sorting reuse SegmentOrdinal type * handle doc store order in merge in sorted index * fix typo, cleanup * make IndexSetting non-optional * fix type, rename test file fix type rename test file add type * remove SegmentReaderWithOrdinal accessors * cargo fmt * add index sort & merge test to include deletes * Fix posting list merge issue Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids handle sorting and merging for facets field * performance: cache field readers, use bytes for doc store merge * change facet merge test to cover index sorting * add RawDocument abstraction to access bytes in doc store * fix deserialization, update changelog fix deserialization update changelog forward error on merge failed * cache store readers to utilize lru cache (4x performance) cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block) * add include_temp_doc_store flag in InnerSegmentMeta unset flag on deserialization and after finalize of a segment set flag when creating new instances	2021-05-17 22:20:57 +09:00
Paul Masurel	39dd8cfe24	Cargo clippy. Acronym should not be full uppercase apparently.	2021-04-26 11:49:18 +09:00

1 2

100 Commits