tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-06-02 08:30:41 +00:00

Author	SHA1	Message	Date
PSeitz	d57622d54b	support bool type in term aggregation (#2318 ) * support bool type in term aggregation * add Bool to Intermediate Key	2024-02-20 03:22:22 +01:00
PSeitz	f745dbc054	fix Clone for TopNComputer, add top_hits bench (#2315 ) * fix Clone for TopNComputer, add top_hits bench add top_hits agg bench test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg ... bench: 123,475,175 ns/iter (+/- 30,608,889) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_multi ... bench: 194,170,414 ns/iter (+/- 36,495,516) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_opt ... bench: 179,742,809 ns/iter (+/- 29,976,507) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_sparse ... bench: 27,592,534 ns/iter (+/- 2,672,370) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg ... bench: 552,851,227 ns/iter (+/- 71,975,886) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_multi ... bench: 558,616,384 ns/iter (+/- 100,890,124) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_opt ... bench: 554,031,368 ns/iter (+/- 165,452,650) test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_sparse ... bench: 46,435,919 ns/iter (+/- 13,681,935) * add comment	2024-02-20 03:22:00 +01:00
PSeitz	79b041f81f	clippy (#2314 )	2024-02-13 05:56:31 +01:00
PSeitz	0e16ed9ef7	Fix serde for TopNComputer (#2313 ) * Fix serde for TopNComputer The top hits aggregation changed the TopNComputer to be serializable, but capacity needs to be carried over, as it contains logic which is checked against when pushing elements (capacity == 0 is not allowed). * use serde from deser * remove pub, clippy	2024-02-07 12:52:06 +01:00
mochi	88a3275dbb	add shared search executor (#2312 )	2024-02-05 09:33:00 +01:00
PSeitz	48630ceec9	move into new index module (#2259 ) move core modules to index module	2024-01-31 10:30:04 +01:00
Adam Reichold	72002e8a89	Make test builds Clippy clean. (#2277 )	2024-01-31 02:47:06 +01:00
trinity-1686a	3c9297dd64	report if posting list was actually loaded when warming it up (#2309 )	2024-01-29 15:23:16 +01:00
Tushar	0e04ec3136	feat(aggregators/metric): Add a top_hits aggregator (#2198 ) * feat(aggregators/metric): Implement a top_hits aggregator * fix: Expose get_fields * fix: Serializer for top_hits request Also removes extraneous the extraneous third-party serialization helper. * chore: Avert panick on parsing invalid top_hits query * refactor: Allow multiple field names from aggregations * perf: Replace binary heap with TopNComputer * fix: Avoid comparator inversion by ComparableDoc * fix: Rank missing field values lower than present values * refactor: Make KeyOrder a struct * feat: Rough attempt at docvalue_fields * feat: Complete stab at docvalue_fields - Rename "SearchResult" => "Retrieval" - Revert Vec => HashMap for aggregation accessors. - Split accessors for core aggregation and field retrieval. - Resolve globbed field names in docvalue_fields retrieval. - Handle strings/bytes and other column types with DynamicColumn * test(unit): Add tests for top_hits aggregator * fix: docfield_value field globbing * test(unit): Include dynamic fields * fix: Value -> OwnedValue * fix: Use OwnedValue's native Null variant * chore: Improve readability of test asserts * chore: Remove DocAddress from top_hits result * docs: Update aggregator doc * revert: accidental doc test * chore: enable time macros only for tests * chore: Apply suggestions from review * chore: Apply suggestions from review * fix: Retrieve all values for fields * test(unit): Update for multi-value retrieval * chore: Assert term existence * feat: Include all columns for a column name Since a (name, type) constitutes a unique column. * fix: Resolve json fields Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps. * chore: Address review on mutability * chore: s/segment_id/segment_ordinal instances of SegmentOrdinal * chore: Revert erroneous grammar change	2024-01-26 16:46:41 +01:00
PSeitz	1dacdb6c85	add histogram agg test on empty index (#2306 )	2024-01-23 16:27:34 +01:00
Tushar	e1d18b5114	chore: Expose TopDocs::order_by_u64_field again (#2282 )	2024-01-18 05:58:24 +01:00
Adam Reichold	53f2fe1fbe	Forward regex parser errors to enable understandin their reason. (#2288 )	2023-12-22 11:01:10 +01:00
PSeitz	9c75942aaf	fix merge panic for JSON fields (#2284 ) Root cause was the positions buffer had residue positions from the previous term, when the terms were alternating between having and not having positions in JSON (terms have positions, but not numerics). Fixes #2283	2023-12-21 11:05:34 +01:00
trinity-1686a	9ebc5ed053	use fst for sstable index (#2268 ) * read path for new fst based index * implement BlockAddrStoreWriter * extract slop/derivation computation * use better linear approximator and allow negative correction to approximator * document format and reorder some fields * optimize single block sstable size * plug backward compat	2023-12-04 15:13:15 +01:00
PSeitz	1a9fc10be9	add fields_metadata to SegmentReader, add columnar docs (#2222 ) * add fields_metadata to SegmentReader, add columnar docs * use schema to resolve field, add test * normalize paths * merge for FieldsMetadata, add fields_metadata on Index * Update src/core/segment_reader.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * merge code paths * add Hash * move function oustide --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-11-22 12:29:53 +01:00
BlackHoleFox	daad2dc151	Take string references instead of owned values building Facet paths (#2265 )	2023-11-20 09:40:44 +01:00
PSeitz	054f49dc31	support escaped dot, add agg test (#2250 ) add agg test for nested JSON allow escaping of dot	2023-11-20 03:00:57 +01:00
PSeitz	0aae31d7d7	reduce number of allocations (#2257 ) * reduce number of allocations Explanation makes up around 50% of all allocations (numbers not perf). It's created during serialization but not called. - Make Explanation optional in BM25 - Avoid allocations when using Explanation * use Cow	2023-11-16 13:47:36 +01:00
Chris Tam	6d9a7b7eb0	Derive Debug for SchemaBuilder (#2254 )	2023-11-15 01:03:44 +01:00
trinity-1686a	828632e8c4	rustfmt	2023-11-14 15:05:16 +01:00
Paul Masurel	6b59ec6fd5	Fix bug occuring when merging JSON object indexed with positions. In JSON Object field the presence of term frequencies depend on the field. Typically, a string with postiions indexed will have positions while numbers won't. The presence or absence of term freqs for a given term is unfortunately encoded in a very passive way. It is given by the presence of extra information in the skip info, or the lack of term freqs after decoding vint blocks. Before, after writing a segment, we would encode the segment correctly (without any term freq for number in json object field). However during merge, we would get the default term freq=1 value. (this is default in the absence of encoded term freqs) The merger would then proceed and attempt to decode 1 position when there are in fact none. This PR requires to explictly tell the posting serialize whether term frequencies should be serialized for each new term. Closes #2251	2023-11-14 22:41:48 +09:00
PSeitz	b60d862150	docid deltas while indexing (#2249 ) * docid deltas while indexing storing deltas is especially helpful for repetitive data like logs. In those cases, recording a doc on a term costed 4 bytes instead of 1 byte now. HDFS Indexing 1.1GB Total memory consumption: Before: 760 MB Now: 590 MB * use scan for delta decoding	2023-11-13 05:14:27 +01:00
PSeitz	4837c7811a	add missing inlines (#2245 )	2023-11-10 08:00:42 +01:00
trinity-1686a	7a0064db1f	bump index version (#2237 ) * bump index version and add constant for lowest supported version * use range instead of handcoded bounds	2023-11-06 19:02:37 +01:00
PSeitz	2e7327205d	fix coverage run (#2232 ) coverage run uses the compare_hash_only feature which is not compativle with the test_hashmap_size test	2023-11-06 11:18:38 +00:00
Paul Masurel	7bc5bf78e2	Fixing functional tests. (#2239 )	2023-11-05 18:18:39 +09:00
giovannicuccu	ef603c8c7e	rename ReloadPolicy onCommit to onCommitWithDelay (#2235 ) * rename ReloadPolicy onCommit to onCommitWithDelay * fix format issues --------- Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>	2023-11-03 12:22:10 +01:00
PSeitz	28dd6b6546	collect json paths in indexing (#2231 ) * collect json paths in indexing * remove unsafe iter_mut_keys	2023-11-01 11:25:17 +01:00
PSeitz	bf6544cf28	fix mmap::Advice reexport (#2230 )	2023-10-27 14:09:25 +09:00
PSeitz	19a859d6fd	term hashmap remove copy in is_empty, unused unordered_id (#2229 )	2023-10-27 05:01:32 +02:00
PSeitz	83af14caa4	Fix range query (#2226 ) Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225	2023-10-25 09:17:31 +02:00
PSeitz	4feeb2323d	fix clippy (#2223 )	2023-10-24 10:05:22 +02:00
PSeitz	07bf66a197	json path writer (#2224 ) * refactor logic to JsonPathWriter * use in encode_column_name * add inlines * move unsafe block	2023-10-24 09:45:50 +02:00
trinity-1686a	0d4589219b	encode some part of posting list as -1 instead of direct values (#2185 ) * add support for delta-1 encoding posting list * encode term frequency minus one * don't emit tf for json integer terms * make skipreader not pub(crate) mutable	2023-10-20 16:58:26 +02:00
PSeitz	c2b0469180	improve docs, rework exports (#2220 ) * rework exports move snippet and advice make indexer pub, remove indexer reexports * add deprecation warning * add architecture overview	2023-10-18 09:22:24 +02:00
PSeitz	ecb9a89a9f	add compat mode for JSON (#2219 )	2023-10-17 10:00:55 +02:00
PSeitz	5e06e504e6	split into ReferenceValueLeaf (#2217 )	2023-10-16 16:31:30 +02:00
PSeitz	182f58cea6	remove Document: DocumentDeserialize dependency (#2211 ) * remove Document: DocumentDeserialize dependency The dependency requires users to implement an API they may not use. * remove unnecessary Document bounds	2023-10-13 07:59:54 +02:00
PSeitz	493f9b2f2a	Read list of JSON fields encoded in dictionary (#2184 ) * Read list of JSON fields encoded in dictionary add method to get list of fields on InvertedIndexReader * add field type	2023-10-09 12:06:22 +02:00
PSeitz	e246e5765d	replace ReferenceValue with Self in Value (#2210 )	2023-10-06 08:22:15 +02:00
PSeitz	6097235eff	fix numeric order, refactor Document (#2209 ) fix numeric order to prefer i64 rename and move Document stuff	2023-10-05 16:39:56 +02:00
PSeitz	b700c42246	add AsRef, expose object and array iter on Value (#2207 ) add AsRef expose object and array iter add to_json on Document	2023-10-05 03:55:35 +02:00
PSeitz	5b1bf1a993	replace Field with field name (#2196 )	2023-10-04 06:21:40 +02:00
PSeitz	041d4fced7	move to_named_doc to Document trait (#2205 )	2023-10-04 06:03:07 +02:00
PSeitz	514a6e7fef	fix bench compile, fix Document reexport (#2203 )	2023-10-03 17:28:36 +02:00
PSeitz	03a1f40767	rename DocValue to Value (#2197 ) rename DocValue to Value to avoid confusion with lucene DocValues rename Value to OwnedValue	2023-10-02 17:03:00 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	b525f653c0	replace BinaryHeap for TopN (#2186 ) * replace BinaryHeap for TopN replace BinaryHeap for TopN with variant that selects the median with QuickSort, which runs in O(n) time. add merge_fruits fast path * call truncate unconditionally, extend test * remove special early exit * add TODO, fmt * truncate top n instead median, return vec * simplify code	2023-09-27 09:25:30 +02:00
ethever.eth	90586bc1e2	chore: remove unused Seek impl for Writers (#2187 ) (#2189 ) Co-authored-by: famouscat <onismaa@gmail.com>	2023-09-26 17:03:28 +09:00
PSeitz	832f1633de	handle exclusive out of bounds ranges on fastfield range queries (#2174 ) closes https://github.com/quickwit-oss/quickwit/issues/3790	2023-09-26 08:00:40 +02:00

1 2 3 4 5 ...

2368 Commits