tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-04 08:12:54 +00:00

Author	SHA1	Message	Date
Pascal Seitz	2ce485b8cc	skip estimate phase for merge multivalue index precompute stats for merge multivalue index + disable Line encoding for multivalue index. That combination allows to skip the first estimation pass. This gives up to 2x on merge performance on multivalue indices. This change may decrease compression as Line is very good compressible for documents, which have a fixed amount of values in each doc. The line codec should be replaced. ``` merge_multi_and_multi Avg: 22.7880ms (-47.15%) Median: 22.5469ms (-47.38%) [22.3691ms .. 25.8392ms] merge_dense_and_dense Avg: 14.4398ms (+2.18%) Median: 14.2465ms (+0.74%) [14.1620ms .. 16.1270ms] merge_sparse_and_sparse Avg: 10.6559ms (+1.10%) Median: 10.6318ms (+0.91%) [10.5527ms .. 11.2848ms] merge_sparse_and_dense Avg: 12.4886ms (+1.52%) Median: 12.4044ms (+0.84%) [12.3261ms .. 13.9439ms] merge_multi_and_dense Avg: 25.6686ms (-45.56%) Median: 25.4851ms (-45.84%) [25.1618ms .. 27.6226ms] merge_multi_and_sparse Avg: 24.3278ms (-47.00%) Median: 24.1917ms (-47.34%) [23.7159ms .. 27.0513ms] ```	2024-06-11 20:22:00 +08:00
PSeitz	714f363d43	add bench & test for columnar merging (#2428 ) * add merge columnar proptest * add columnar merge benchmark	2024-06-10 16:26:16 +08:00
dependabot[bot]	e197b59258	Update itertools requirement from 0.12.0 to 0.13.0 (#2400 ) Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.12.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-05-17 02:53:02 +02:00
PSeitz	99a59ad37e	remove zero byte check (#2379 ) remove zero byte checks in columnar. zero bytes are converted during serialization now. unify code paths extend test for expected column names	2024-04-26 06:03:28 +02:00
PSeitz	17d5869ad6	update CHANGELOG, use github API in cliff (#2354 ) * update CHANGELOG, use github API in cliff * reset version to 0.21.1, before release * chore: Release * remove unreleased from CHANGELOG	2024-04-15 10:07:20 +02:00
PSeitz	74940e9345	clippy (#2349 ) * fix clippy * fix clippy * fix duplicate imports	2024-04-09 07:54:44 +02:00
PSeitz	b644d78a32	fix null byte handling in JSON paths (#2345 ) * fix null byte handling in JSON paths closes https://github.com/quickwit-oss/tantivy/issues/2193 closes https://github.com/quickwit-oss/tantivy/issues/2340 * avoid repeated term truncation * fix test * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add comment --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2024-04-05 09:53:35 +02:00
PSeitz	7ce950f141	add method to fetch block of first vals in columnar (#2330 ) * add method to fetch block of first vals in columnar add method to fetch block of first vals in columnar (this is way faster than single calls for full columns) add benchmark fix import warnings ``` test bench_get_block_first_on_full_column ... bench: 56 ns/iter (+/- 26) test bench_get_block_first_on_full_column_single_calls ... bench: 311 ns/iter (+/- 6) test bench_get_block_first_on_multi_column ... bench: 378 ns/iter (+/- 15) test bench_get_block_first_on_multi_column_single_calls ... bench: 546 ns/iter (+/- 13) test bench_get_block_first_on_optional_column ... bench: 291 ns/iter (+/- 6) test bench_get_block_first_on_optional_column_single_calls ... bench: 362 ns/iter (+/- 8) ``` * use remainder	2024-03-15 08:01:47 +01:00
PSeitz	b0e65560a1	handle ip adresses in term aggregation (#2319 ) * handle ip adresses in term aggregation Stores IpAdresses during the segment term aggregation via u64 representation and convert to u128(IpV6Adress) via downcast when converting to intermediate results. Enable Downcasting on `ColumnValues` Expose u64 variant for u128 encoded data via `open_u64_lenient` method. Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming from downcast trait. * rename method	2024-03-14 09:41:18 +01:00
PSeitz	ec37295b2f	add fast path for full columns in fetch_block (#2328 ) Spotted in `range_date_histogram` query in quickwit benchmark: 5% of time copying docs around, which is not needed in the full index case remove Column to ColumnIndex deref	2024-03-14 04:07:11 +01:00
Adam Reichold	72002e8a89	Make test builds Clippy clean. (#2277 )	2024-01-31 02:47:06 +01:00
Paul Masurel	014328e378	Fix bug that can cause `get_docids_for_value_range` to panic. (#2295 ) * Fix bug that can cause `get_docids_for_value_range` to panic. When `selected_docid_range.end == num_rows`, we would get a panic as we try to access a non-existing blockmeta. This PR accepts calls to rank with any value. For any value above num_rows we simply return non_null_rows. Fixes #2293 * add tests, merge variables --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2024-01-09 14:52:20 +01:00
trinity-1686a	9ebc5ed053	use fst for sstable index (#2268 ) * read path for new fst based index * implement BlockAddrStoreWriter * extract slop/derivation computation * use better linear approximator and allow negative correction to approximator * document format and reorder some fields * optimize single block sstable size * plug backward compat	2023-12-04 15:13:15 +01:00
PSeitz	1a9fc10be9	add fields_metadata to SegmentReader, add columnar docs (#2222 ) * add fields_metadata to SegmentReader, add columnar docs * use schema to resolve field, add test * normalize paths * merge for FieldsMetadata, add fields_metadata on Index * Update src/core/segment_reader.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * merge code paths * add Hash * move function oustide --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-11-22 12:29:53 +01:00
PSeitz	47009ed2d3	remove unused deps (#2264 ) found with cargo machete remove pprof (doesn't work)	2023-11-20 02:59:59 +01:00
dependabot[bot]	7a2c5804b1	Update itertools requirement from 0.11.0 to 0.12.0 (#2255 ) Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.11.0...v0.12.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-15 01:03:08 +01:00
PSeitz	927b4432c9	Perf: use term hashmap in fastfield (#2243 ) * add shared arena hashmap * bench fastfield indexing * use shared arena hashmap in columnar lower minimum resize in hashtable * clippy * add comments	2023-11-09 13:44:02 +01:00
PSeitz	19a859d6fd	term hashmap remove copy in is_empty, unused unordered_id (#2229 )	2023-10-27 05:01:32 +02:00
PSeitz	4feeb2323d	fix clippy (#2223 )	2023-10-24 10:05:22 +02:00
PSeitz	38db53c465	make column_index pub (#2181 )	2023-09-22 08:06:45 +02:00
PSeitz	49448b31c6	chore: Release (#2168 ) * chore: Release * update CHANGELOG	2023-09-01 13:58:58 +02:00
PSeitz	b1d8b072db	add missing aggregation part 2 (#2149 ) * add missing aggregation part 2 Add missing support for: - Mixed types columns - Key of type string on numerical fields The special aggregation is slower than the integrated one in TermsAggregation and therefore not chosen by default, although it can cover all use cases. * simplify, add num_docs to empty	2023-08-31 07:55:33 +02:00
PSeitz	59460c767f	delayed column opening during merge (#2132 ) * lazy columnar merge This is the first part of addressing #3633 Instead of loading all Column into memory for the merge, only the current column_name group is loaded. This can be done since the sstable streams the columns lexicographically. * refactor * add rustdoc * replace iterator with BTreeMap	2023-08-21 08:55:35 +02:00
PSeitz	62ece86f24	track ff dictionary indexing memory consumption (#2147 )	2023-08-16 14:00:08 +02:00
PSeitz	ed1deee902	fix sort index by date (#2124 ) closes #2112	2023-08-14 17:36:52 +02:00
PSeitz	2e109018b7	add missing parameter to term agg (#2103 ) * add missing parameter to term agg * move missing handling to block accessor * add multivalue test, fix multivalue case, add comments * add documentation, deactivate special case * cargo fmt * resolve merge conflict	2023-08-14 14:22:18 +02:00
Adam Reichold	42acd334f4	Fixes the new deny-by-default incorrect_partial_ord_impl_on_ord_type Clippy lint (#2131 )	2023-07-21 11:36:17 +09:00
dependabot[bot]	7575f9bf1c	Update itertools requirement from 0.10.3 to 0.11.0 (#2098 ) Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.10.5...v0.11.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-07 11:14:46 +02:00
Paul Masurel	910b0b0c61	Cargo fmt	2023-07-03 22:03:31 +09:00
PSeitz	17186ca9c9	improve docs (#2105 )	2023-06-27 13:37:14 +08:00
Adam Reichold	ebc78127f3	Add BytesFilterCollector to support filtering based on a bytes fast field (#2075 ) * Do some Clippy- and Cargo-related boy-scouting. * Add BytesFilterCollector to support filtering based on a bytes fast field This is basically a copy of the existing FilterCollector but modified and specialised to work on a bytes fast field. * Changed semantics of filter collectors to consider multi-valued fields	2023-06-13 14:19:58 +09:00
PSeitz	e3eacb4388	release tantivy (#2083 ) * prerelease * chore: Release	2023-06-09 10:47:46 +02:00
PSeitz	6239697a02	switch to ms in histogram for date type (#2045 ) * switch to ms in histogram for date type switch to ms in histogram, by adding a normalization step that converts to nanoseconds precision when creating the collector. closes #2028 related to #2026 * add missing unit long variants * use single thread to avoid handling test case * fix docs * revert CI * cleanup * improve docs * Update src/aggregation/bucket/histogram/histogram.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-19 08:15:44 +02:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
tottoto	73452284ae	Remove unused crates from dependencies (#2018 ) * Remove unused crates from dependencies * Revert rand to columnar * Revert criterion to stacker	2023-05-02 12:34:20 +02:00
PSeitz	ba309e18a1	switch to nanosecond precision (#2016 )	2023-05-01 03:32:20 +02:00
trinity-1686a	780e26331d	sstable compression (#1946 ) * compress sstable with zstd * add some details to sstable readme * compress only block which benefit from it * multiple changes to sstable make compression optional use OwnedBytes instead of impl Read in sstable, required for next point use zstd bulk api, which is much faster on small records * cleanup and use bulk api for compression * use dedicated byte for compression * switch block len and compression flag * change default zstd level in sstable	2023-04-14 16:25:50 +02:00
trinity-1686a	205e8a0a92	encode dictionary type in fst footer (#1968 ) * encode additional footer for dictionary kind in fst	2023-04-12 09:43:01 +02:00
Paul Masurel	5eb12173d6	Proptest merge columnar (#1976 ) * Added proptest on columnar merge with a shuffle Made column serialization more explicit. Bugfix when a bytes column is missing, and with a shuffle. Improved the cardinality detection logic / column detection. * Code review * CR comments * Following CR	2023-04-04 11:28:42 +09:00
PSeitz	571735c5f7	Fix index sort by on optional/multicolumn (#1972 ) Fix index sort by on optional/multicolumn add optional columns to proptest extend proptests for sort add columnar sort tests	2023-03-31 04:24:11 +02:00
Paul Masurel	694a056255	Faster range (#1954 ) * Faster range queries This PR does several changes - ip compact space now uses u32 - the bitunpacker now gets a get_batch function - we push down range filtering, removing GCD / shift in the bitpacking codec. - we rely on AVX2 routine to do the filtering. * Apply suggestions from code review * Apply suggestions from code review * CR comments	2023-03-27 14:56:32 +09:00
Paul Masurel	2955e34452	Added proptests for building/merging columnar. (#1963 )	2023-03-27 14:56:02 +09:00
Paul Masurel	821208480b	Adding Debug/Display impl. Refining the ColumnIndex::get_cardinality	2023-03-26 14:40:37 +09:00
Paul Masurel	a2e3c2ed5b	Renaming Column::idx -> Column::index (#1961 ) There was some variable name ghosting happening.	2023-03-26 13:58:50 +09:00
PSeitz	835f228bfa	fix cardinality when merging empty columns (#1960 ) fixes #1958	2023-03-25 15:58:15 +09:00
Paul Masurel	2b6a4da640	Exposing empty column builder. (#1959 )	2023-03-24 16:34:41 +09:00
PSeitz	da2804644f	fetch blocks of vals in aggregation for all cardinality (#1950 ) * fetch blocks of vals in aggregation for all cardinality * move caching in common accessor	2023-03-23 08:41:11 +01:00
PSeitz	5504cfd012	remove IterColumn (#1955 ) fixes #1658	2023-03-23 06:43:17 +01:00
trinity-1686a	482b4155e8	fix bug with new sstable index format (#1953 )	2023-03-22 10:22:36 +01:00
trinity-1686a	e5e50603a8	new sstable format (#1943 ) * document a new sstable format * add support for changing target block size * use new format for sstable index * handle sstable version errror * use very small blocks for proptests * add a footer structure	2023-03-21 15:03:52 +01:00

1 2

95 Commits