tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-26 05:00:41 +00:00

Author	SHA1	Message	Date
Tushar	0e04ec3136	feat(aggregators/metric): Add a top_hits aggregator (#2198 ) * feat(aggregators/metric): Implement a top_hits aggregator * fix: Expose get_fields * fix: Serializer for top_hits request Also removes extraneous the extraneous third-party serialization helper. * chore: Avert panick on parsing invalid top_hits query * refactor: Allow multiple field names from aggregations * perf: Replace binary heap with TopNComputer * fix: Avoid comparator inversion by ComparableDoc * fix: Rank missing field values lower than present values * refactor: Make KeyOrder a struct * feat: Rough attempt at docvalue_fields * feat: Complete stab at docvalue_fields - Rename "SearchResult" => "Retrieval" - Revert Vec => HashMap for aggregation accessors. - Split accessors for core aggregation and field retrieval. - Resolve globbed field names in docvalue_fields retrieval. - Handle strings/bytes and other column types with DynamicColumn * test(unit): Add tests for top_hits aggregator * fix: docfield_value field globbing * test(unit): Include dynamic fields * fix: Value -> OwnedValue * fix: Use OwnedValue's native Null variant * chore: Improve readability of test asserts * chore: Remove DocAddress from top_hits result * docs: Update aggregator doc * revert: accidental doc test * chore: enable time macros only for tests * chore: Apply suggestions from review * chore: Apply suggestions from review * fix: Retrieve all values for fields * test(unit): Update for multi-value retrieval * chore: Assert term existence * feat: Include all columns for a column name Since a (name, type) constitutes a unique column. * fix: Resolve json fields Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps. * chore: Address review on mutability * chore: s/segment_id/segment_ordinal instances of SegmentOrdinal * chore: Revert erroneous grammar change	2024-01-26 16:46:41 +01:00
PSeitz	1dacdb6c85	add histogram agg test on empty index (#2306 )	2024-01-23 16:27:34 +01:00
PSeitz	054f49dc31	support escaped dot, add agg test (#2250 ) add agg test for nested JSON allow escaping of dot	2023-11-20 03:00:57 +01:00
Paul Masurel	7bc5bf78e2	Fixing functional tests. (#2239 )	2023-11-05 18:18:39 +09:00
PSeitz	4feeb2323d	fix clippy (#2223 )	2023-10-24 10:05:22 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	34920d31f5	Fix DateHistogram bucket gap (#2183 ) * Fix DateHistogram bucket gap Fixes a computation issue of the number of buckets needed in the DateHistogram. This is due to a missing normalization from request values (ms) to fast field values (ns), when converting an intermediate result to the final result. This results in a wrong computation by a factor 1_000_000. The Histogram normalizes values to nanoseconds, to make the user input like extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible. This normalization happens only for date type fields, as other field types don't have precision settings. The normalization does not happen due a missing `column_type`, which is not correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation. Another related issue is an empty aggregation, which will not have `column_type` set, will not convert the result to human readable format. This PR fixes the issue by: - Limit the allowed field types of DateHistogram to DateType - Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`. - Fix the merge logic Add a flag to to normalization only once. This is not an issue currently, but it could become easily one. closes https://github.com/quickwit-oss/quickwit/issues/3837 * use older nightly for time crate (breaks build)	2023-09-21 10:41:35 +02:00
PSeitz	e125f3b041	fix test (#2178 )	2023-09-19 08:21:50 +02:00
PSeitz	c520ac46fc	add support for date in term agg (#2172 ) support DateTime in TermsAggregation Format dates with Rfc3339	2023-09-14 09:22:18 +02:00
PSeitz	b1d8b072db	add missing aggregation part 2 (#2149 ) * add missing aggregation part 2 Add missing support for: - Mixed types columns - Key of type string on numerical fields The special aggregation is slower than the integrated one in TermsAggregation and therefore not chosen by default, although it can cover all use cases. * simplify, add num_docs to empty	2023-08-31 07:55:33 +02:00
PSeitz	c4e2708901	fix clippy, fmt (#2162 )	2023-08-30 08:04:26 +02:00
PSeitz	5c8cfa50eb	add missing parameter for percentiles (#2157 )	2023-08-29 13:04:24 +02:00
PSeitz	73cb71762f	add missing parameter for stats,min,max,count,sum,avg (#2151 ) * add missing parameter for stats,min,max,count,sum,avg add missing parameter for stats,min,max,count,sum,avg closes #1913 partially #1789 * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-08-28 08:59:51 +02:00
PSeitz	48d4847b38	Improve aggregation error message (#2150 ) * Improve aggregation error message Improve aggregation error message by wrapping the deserialization with a custom struct. This deserialization variant is slower, since we need to keep the deserialized data around twice with this approach. For now the valid variants list is manually updated. This could be replaced with a proc macro. closes #2143 * Simpler implementation --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-08-23 20:52:15 +02:00
PSeitz	480763db0d	track memory arena memory usage (#2148 )	2023-08-16 18:19:42 +02:00
Caleb Hattingh	52d9e6f298	Fix doc typos in count aggregation metric (#2127 )	2023-08-15 08:50:23 +02:00
PSeitz	2e109018b7	add missing parameter to term agg (#2103 ) * add missing parameter to term agg * move missing handling to block accessor * add multivalue test, fix multivalue case, add comments * add documentation, deactivate special case * cargo fmt * resolve merge conflict	2023-08-14 14:22:18 +02:00
PSeitz	c2be6603a2	alternative mixed field aggregation collection (#2135 ) * alternative mixed field aggregation collection instead of having multiple accessor in one AggregationWithAccessor split it into multiple independent AggregationWithAccessor * Update src/aggregation/agg_req_with_accessor.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-07-27 12:25:31 +02:00
Adam Reichold	c805f08ca7	Fix a few more upcoming Clippy lints (#2133 )	2023-07-24 17:07:57 +09:00
PSeitz	17186ca9c9	improve docs (#2105 )	2023-06-27 13:37:14 +08:00
PSeitz	657f0cd3bd	add missing Bytes validation to term_agg (#2077 ) returns empty for now instead of failing like before	2023-06-12 16:38:07 +08:00
PSeitz	3546e7fc63	small agg limit docs improvement (#2073 ) small docs improvement as follow up on bug https://github.com/quickwit-oss/quickwit/issues/3503	2023-06-12 10:55:24 +09:00
PSeitz	ccb09aaa83	allow histogram bounds to be passed as Rfc3339 (#2076 )	2023-06-08 09:07:08 +02:00
PSeitz	3af456972e	Fix min doc_count empty merge bug (#2057 ) This fixes an issue when min_doc==0 loads terms from the dictionary from one segment and merges the same term with a subaggregation from another segment. Previously the empty structure was not correctly initialized to contain the subaggregation so the merge was incorrect.	2023-05-29 14:20:50 +08:00
PSeitz	6239697a02	switch to ms in histogram for date type (#2045 ) * switch to ms in histogram for date type switch to ms in histogram, by adding a normalization step that converts to nanoseconds precision when creating the collector. closes #2028 related to #2026 * add missing unit long variants * use single thread to avoid handling test case * fix docs * revert CI * cleanup * improve docs * Update src/aggregation/bucket/histogram/histogram.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-19 08:15:44 +02:00
PSeitz	2dfe37940d	handle multiple types in term aggregation (#2041 )	2023-05-15 11:57:38 +02:00
PSeitz	ba3a885a3b	handle multiple agg results (#2035 ) handle multiple intermediate aggregation results with the same name.	2023-05-10 15:00:38 +02:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
PSeitz	45ff0e3c5c	clear memory consumption in AggregationLimits (#2022 ) * clear memory consumption in AggregationLimits clear memory consumption in AggregationLimits at the end of segment collection * switch to ResourceLimitGuard * unduplicate code * merge methods * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-08 10:15:09 +02:00
François Massot	992f755298	Fix clippy.	2023-05-05 10:51:29 +02:00
François Massot	c8df843f96	Fix date histogram bounds and field name.	2023-05-05 00:52:55 +02:00
PSeitz	ba309e18a1	switch to nanosecond precision (#2016 )	2023-05-01 03:32:20 +02:00
PSeitz	cbf2bdc75b	change bucket count type (#2013 ) * change bucket count type closes #2012 * Update src/aggregation/agg_limits.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * Update src/directory/managed_directory.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * fix test --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-27 15:47:31 +08:00
PSeitz	1f06997d04	fix single collector special case (#2014 )	2023-04-27 09:30:19 +02:00
PSeitz	c599bf3b6c	chore!:drop JSON support on intermediate agg result (#1992 ) * chore!:drop JSON support on intermediate agg result add support for other formats by removing skip_serialize and untagged JSON support is broken anyway due it's lack on f64::INF etc. handling * Update src/aggregation/intermediate_agg_result.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * move from impl --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-26 13:05:16 +02:00
PSeitz	2e369db936	switch to Aggregation without serde_untagged (#2003 ) * refactor result handling * remove Internal stuff * merge different accessors * switch to Aggregation without serde_untagged * fix doctests	2023-04-25 08:54:51 +02:00
PSeitz	e522163a1c	use json in agg tests (#1998 ) * switch to JSON in tests, add flat aggregation types * use method * clippy * remove commented file	2023-04-17 14:08:48 +02:00
PSeitz	0ed13eeea8	add sparse to agg benchmark (#1986 ) * add sparse to agg benchmark * Update src/aggregation/agg_bench.rs Co-authored-by: Paul Masurel <paul@quickwit.io> --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-11 08:13:32 +02:00
PSeitz	41af70799d	add percentiles aggregations (#1984 ) * add percentiles aggregations add percentiles aggregation fix disabled agg benchmark * Update src/aggregation/metric/percentiles.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * fix import * fix import --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-04-07 07:18:28 +02:00
PSeitz	5c4ea6a708	tokenizer option on text fastfield (#1945 ) * tokenizer option on text fastfield allow to set tokenizer option on text fastfield (fixes #1901) handle PreTokenized strings in fast field * change visibility * remove custom de/serialization	2023-03-31 10:03:38 +02:00
PSeitz	5c380b76e7	Better mixed types support in aggs and fix serialization issue (#1971 ) * Better mixed types support in aggs and fix serialization issue - Improve support for mixed types in JSON field aggregations (pick the right field, #1913) - Resolve the issue with JSON serialization for numeric keys (fixes #1967) - Add JSON round-trip test for term buckets - Remove `u64_lenient`, as this is a footgun without the type - move aggregation benchmarks * remove shadowing	2023-03-31 05:52:11 +02:00
Paul Masurel	2b6a4da640	Exposing empty column builder. (#1959 )	2023-03-24 16:34:41 +09:00
PSeitz	d6a95381ee	add memory check for term agg (#1957 )	2023-03-24 06:47:45 +01:00
PSeitz	da2804644f	fetch blocks of vals in aggregation for all cardinality (#1950 ) * fetch blocks of vals in aggregation for all cardinality * move caching in common accessor	2023-03-23 08:41:11 +01:00
PSeitz	8f7f1d6be4	add Display for ByteCount (#1949 ) * add Display for ByteCount * export missing AggregationLimits	2023-03-21 08:02:35 +01:00
PSeitz	6a7a1106d6	work in batches of docs (#1937 ) * work in batches of docs * add fill_buffer test	2023-03-21 06:57:44 +01:00
PSeitz	9e2faecf5b	add memory limit for aggregations (#1942 ) * add memory limit for aggregations introduce AggregationLimits to set memory consumption limit and bucket limits memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request. * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add ByteCount with human readable format --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-03-16 06:21:07 +01:00
PSeitz	b6703f1b3c	fix validation in date histogram (#1936 ) fix validation in date histogram for parameters interval and date_interval	2023-03-15 06:10:43 +01:00
PSeitz	2fb3740cb0	handle missing column for aggs (#1920 ) * handle missing column for aggs add empty column fallback for missing column in aggs. Fix sort for term agg on sub-agg with missing value (null is smallest) * add error when field is not fast	2023-03-15 06:09:59 +01:00
PSeitz	8459efa32c	split term collection count and sub_agg (#1921 ) use unrolled ColumnValues::get_vals	2023-03-13 04:37:41 +01:00

1 2 3 4

177 Commits