tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-09 10:32:55 +00:00

Author	SHA1	Message	Date
PSeitz	74940e9345	clippy (#2349 ) * fix clippy * fix clippy * fix duplicate imports	2024-04-09 07:54:44 +02:00
PSeitz	92c32979d2	fix postcard compatibility for top_hits, add postcard test (#2346 ) * fix postcard compatibility for top_hits, add postcard test * fix top_hits naming, delay data fetch closes #2347 * fix import	2024-04-09 06:17:25 +02:00
PSeitz	67ebba3c3c	expose collect_block buffer size (#2326 ) * expose buffer of collect_block * flip shard_size segment_size	2024-03-15 08:02:08 +01:00
PSeitz	7ce950f141	add method to fetch block of first vals in columnar (#2330 ) * add method to fetch block of first vals in columnar add method to fetch block of first vals in columnar (this is way faster than single calls for full columns) add benchmark fix import warnings ``` test bench_get_block_first_on_full_column ... bench: 56 ns/iter (+/- 26) test bench_get_block_first_on_full_column_single_calls ... bench: 311 ns/iter (+/- 6) test bench_get_block_first_on_multi_column ... bench: 378 ns/iter (+/- 15) test bench_get_block_first_on_multi_column_single_calls ... bench: 546 ns/iter (+/- 13) test bench_get_block_first_on_optional_column ... bench: 291 ns/iter (+/- 6) test bench_get_block_first_on_optional_column_single_calls ... bench: 362 ns/iter (+/- 8) ``` * use remainder	2024-03-15 08:01:47 +01:00
PSeitz	b0e65560a1	handle ip adresses in term aggregation (#2319 ) * handle ip adresses in term aggregation Stores IpAdresses during the segment term aggregation via u64 representation and convert to u128(IpV6Adress) via downcast when converting to intermediate results. Enable Downcasting on `ColumnValues` Expose u64 variant for u128 encoded data via `open_u64_lenient` method. Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming from downcast trait. * rename method	2024-03-14 09:41:18 +01:00
PSeitz	ec37295b2f	add fast path for full columns in fetch_block (#2328 ) Spotted in `range_date_histogram` query in quickwit benchmark: 5% of time copying docs around, which is not needed in the full index case remove Column to ColumnIndex deref	2024-03-14 04:07:11 +01:00
PSeitz	7e41d31c6e	agg: support to deserialize f64 from string (#2311 ) * agg: support to deserialize f64 from string * remove visit_string * disallow NaN	2024-03-05 05:49:41 +01:00
Paul Masurel	6739357314	Removing split_size and adding split_size and shard_size as segmnet_size (#2320 ) aliases.	2024-02-26 11:35:22 +01:00
PSeitz	d57622d54b	support bool type in term aggregation (#2318 ) * support bool type in term aggregation * add Bool to Intermediate Key	2024-02-20 03:22:22 +01:00
PSeitz	f745dbc054	fix Clone for TopNComputer, add top_hits bench (#2315 ) * fix Clone for TopNComputer, add top_hits bench add top_hits agg bench test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg ... bench: 123,475,175 ns/iter (+/- 30,608,889) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_multi ... bench: 194,170,414 ns/iter (+/- 36,495,516) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_opt ... bench: 179,742,809 ns/iter (+/- 29,976,507) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_sparse ... bench: 27,592,534 ns/iter (+/- 2,672,370) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg ... bench: 552,851,227 ns/iter (+/- 71,975,886) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_multi ... bench: 558,616,384 ns/iter (+/- 100,890,124) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_opt ... bench: 554,031,368 ns/iter (+/- 165,452,650) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_sparse ... bench: 46,435,919 ns/iter (+/- 13,681,935) * add comment	2024-02-20 03:22:00 +01:00
PSeitz	0e16ed9ef7	Fix serde for TopNComputer (#2313 ) * Fix serde for TopNComputer The top hits aggregation changed the TopNComputer to be serializable, but capacity needs to be carried over, as it contains logic which is checked against when pushing elements (capacity == 0 is not allowed). * use serde from deser * remove pub, clippy	2024-02-07 12:52:06 +01:00
Tushar	0e04ec3136	feat(aggregators/metric): Add a top_hits aggregator (#2198 ) * feat(aggregators/metric): Implement a top_hits aggregator * fix: Expose get_fields * fix: Serializer for top_hits request Also removes extraneous the extraneous third-party serialization helper. * chore: Avert panick on parsing invalid top_hits query * refactor: Allow multiple field names from aggregations * perf: Replace binary heap with TopNComputer * fix: Avoid comparator inversion by ComparableDoc * fix: Rank missing field values lower than present values * refactor: Make KeyOrder a struct * feat: Rough attempt at docvalue_fields * feat: Complete stab at docvalue_fields - Rename "SearchResult" => "Retrieval" - Revert Vec => HashMap for aggregation accessors. - Split accessors for core aggregation and field retrieval. - Resolve globbed field names in docvalue_fields retrieval. - Handle strings/bytes and other column types with DynamicColumn * test(unit): Add tests for top_hits aggregator * fix: docfield_value field globbing * test(unit): Include dynamic fields * fix: Value -> OwnedValue * fix: Use OwnedValue's native Null variant * chore: Improve readability of test asserts * chore: Remove DocAddress from top_hits result * docs: Update aggregator doc * revert: accidental doc test * chore: enable time macros only for tests * chore: Apply suggestions from review * chore: Apply suggestions from review * fix: Retrieve all values for fields * test(unit): Update for multi-value retrieval * chore: Assert term existence * feat: Include all columns for a column name Since a (name, type) constitutes a unique column. * fix: Resolve json fields Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps. * chore: Address review on mutability * chore: s/segment_id/segment_ordinal instances of SegmentOrdinal * chore: Revert erroneous grammar change	2024-01-26 16:46:41 +01:00
PSeitz	1dacdb6c85	add histogram agg test on empty index (#2306 )	2024-01-23 16:27:34 +01:00
PSeitz	054f49dc31	support escaped dot, add agg test (#2250 ) add agg test for nested JSON allow escaping of dot	2023-11-20 03:00:57 +01:00
Paul Masurel	7bc5bf78e2	Fixing functional tests. (#2239 )	2023-11-05 18:18:39 +09:00
PSeitz	4feeb2323d	fix clippy (#2223 )	2023-10-24 10:05:22 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	34920d31f5	Fix DateHistogram bucket gap (#2183 ) * Fix DateHistogram bucket gap Fixes a computation issue of the number of buckets needed in the DateHistogram. This is due to a missing normalization from request values (ms) to fast field values (ns), when converting an intermediate result to the final result. This results in a wrong computation by a factor 1_000_000. The Histogram normalizes values to nanoseconds, to make the user input like extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible. This normalization happens only for date type fields, as other field types don't have precision settings. The normalization does not happen due a missing `column_type`, which is not correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation. Another related issue is an empty aggregation, which will not have `column_type` set, will not convert the result to human readable format. This PR fixes the issue by: - Limit the allowed field types of DateHistogram to DateType - Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`. - Fix the merge logic Add a flag to to normalization only once. This is not an issue currently, but it could become easily one. closes https://github.com/quickwit-oss/quickwit/issues/3837 * use older nightly for time crate (breaks build)	2023-09-21 10:41:35 +02:00
PSeitz	e125f3b041	fix test (#2178 )	2023-09-19 08:21:50 +02:00
PSeitz	c520ac46fc	add support for date in term agg (#2172 ) support DateTime in TermsAggregation Format dates with Rfc3339	2023-09-14 09:22:18 +02:00
PSeitz	b1d8b072db	add missing aggregation part 2 (#2149 ) * add missing aggregation part 2 Add missing support for: - Mixed types columns - Key of type string on numerical fields The special aggregation is slower than the integrated one in TermsAggregation and therefore not chosen by default, although it can cover all use cases. * simplify, add num_docs to empty	2023-08-31 07:55:33 +02:00
PSeitz	c4e2708901	fix clippy, fmt (#2162 )	2023-08-30 08:04:26 +02:00
PSeitz	5c8cfa50eb	add missing parameter for percentiles (#2157 )	2023-08-29 13:04:24 +02:00
PSeitz	73cb71762f	add missing parameter for stats,min,max,count,sum,avg (#2151 ) * add missing parameter for stats,min,max,count,sum,avg add missing parameter for stats,min,max,count,sum,avg closes #1913 partially #1789 * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-08-28 08:59:51 +02:00
PSeitz	48d4847b38	Improve aggregation error message (#2150 ) * Improve aggregation error message Improve aggregation error message by wrapping the deserialization with a custom struct. This deserialization variant is slower, since we need to keep the deserialized data around twice with this approach. For now the valid variants list is manually updated. This could be replaced with a proc macro. closes #2143 * Simpler implementation --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-08-23 20:52:15 +02:00
PSeitz	480763db0d	track memory arena memory usage (#2148 )	2023-08-16 18:19:42 +02:00
Caleb Hattingh	52d9e6f298	Fix doc typos in count aggregation metric (#2127 )	2023-08-15 08:50:23 +02:00
PSeitz	2e109018b7	add missing parameter to term agg (#2103 ) * add missing parameter to term agg * move missing handling to block accessor * add multivalue test, fix multivalue case, add comments * add documentation, deactivate special case * cargo fmt * resolve merge conflict	2023-08-14 14:22:18 +02:00
PSeitz	c2be6603a2	alternative mixed field aggregation collection (#2135 ) * alternative mixed field aggregation collection instead of having multiple accessor in one AggregationWithAccessor split it into multiple independent AggregationWithAccessor * Update src/aggregation/agg_req_with_accessor.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-07-27 12:25:31 +02:00
Adam Reichold	c805f08ca7	Fix a few more upcoming Clippy lints (#2133 )	2023-07-24 17:07:57 +09:00
PSeitz	17186ca9c9	improve docs (#2105 )	2023-06-27 13:37:14 +08:00
PSeitz	657f0cd3bd	add missing Bytes validation to term_agg (#2077 ) returns empty for now instead of failing like before	2023-06-12 16:38:07 +08:00
PSeitz	3546e7fc63	small agg limit docs improvement (#2073 ) small docs improvement as follow up on bug https://github.com/quickwit-oss/quickwit/issues/3503	2023-06-12 10:55:24 +09:00
PSeitz	ccb09aaa83	allow histogram bounds to be passed as Rfc3339 (#2076 )	2023-06-08 09:07:08 +02:00
PSeitz	3af456972e	Fix min doc_count empty merge bug (#2057 ) This fixes an issue when min_doc==0 loads terms from the dictionary from one segment and merges the same term with a subaggregation from another segment. Previously the empty structure was not correctly initialized to contain the subaggregation so the merge was incorrect.	2023-05-29 14:20:50 +08:00
PSeitz	6239697a02	switch to ms in histogram for date type (#2045 ) * switch to ms in histogram for date type switch to ms in histogram, by adding a normalization step that converts to nanoseconds precision when creating the collector. closes #2028 related to #2026 * add missing unit long variants * use single thread to avoid handling test case * fix docs * revert CI * cleanup * improve docs * Update src/aggregation/bucket/histogram/histogram.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-19 08:15:44 +02:00
PSeitz	2dfe37940d	handle multiple types in term aggregation (#2041 )	2023-05-15 11:57:38 +02:00
PSeitz	ba3a885a3b	handle multiple agg results (#2035 ) handle multiple intermediate aggregation results with the same name.	2023-05-10 15:00:38 +02:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
PSeitz	45ff0e3c5c	clear memory consumption in AggregationLimits (#2022 ) * clear memory consumption in AggregationLimits clear memory consumption in AggregationLimits at the end of segment collection * switch to ResourceLimitGuard * unduplicate code * merge methods * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-08 10:15:09 +02:00
François Massot	992f755298	Fix clippy.	2023-05-05 10:51:29 +02:00
François Massot	c8df843f96	Fix date histogram bounds and field name.	2023-05-05 00:52:55 +02:00
PSeitz	ba309e18a1	switch to nanosecond precision (#2016 )	2023-05-01 03:32:20 +02:00
PSeitz	cbf2bdc75b	change bucket count type (#2013 ) * change bucket count type closes #2012 * Update src/aggregation/agg_limits.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * Update src/directory/managed_directory.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * fix test --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-27 15:47:31 +08:00
PSeitz	1f06997d04	fix single collector special case (#2014 )	2023-04-27 09:30:19 +02:00
PSeitz	c599bf3b6c	chore!:drop JSON support on intermediate agg result (#1992 ) * chore!:drop JSON support on intermediate agg result add support for other formats by removing skip_serialize and untagged JSON support is broken anyway due it's lack on f64::INF etc. handling * Update src/aggregation/intermediate_agg_result.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * move from impl --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-26 13:05:16 +02:00
PSeitz	2e369db936	switch to Aggregation without serde_untagged (#2003 ) * refactor result handling * remove Internal stuff * merge different accessors * switch to Aggregation without serde_untagged * fix doctests	2023-04-25 08:54:51 +02:00
PSeitz	e522163a1c	use json in agg tests (#1998 ) * switch to JSON in tests, add flat aggregation types * use method * clippy * remove commented file	2023-04-17 14:08:48 +02:00
PSeitz	0ed13eeea8	add sparse to agg benchmark (#1986 ) * add sparse to agg benchmark * Update src/aggregation/agg_bench.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-11 08:13:32 +02:00
PSeitz	41af70799d	add percentiles aggregations (#1984 ) * add percentiles aggregations add percentiles aggregation fix disabled agg benchmark * Update src/aggregation/metric/percentiles.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * fix import * fix import --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-07 07:18:28 +02:00

1 2 3 4

188 Commits