tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-07-13 12:40:42 +00:00

Author	SHA1	Message	Date
Adam Reichold	72002e8a89	Make test builds Clippy clean. (#2277 )	2024-01-31 02:47:06 +01:00
trinity-1686a	3c9297dd64	report if posting list was actually loaded when warming it up (#2309 )	2024-01-29 15:23:16 +01:00
PSeitz	1a9fc10be9	add fields_metadata to SegmentReader, add columnar docs (#2222 ) * add fields_metadata to SegmentReader, add columnar docs * use schema to resolve field, add test * normalize paths * merge for FieldsMetadata, add fields_metadata on Index * Update src/core/segment_reader.rs Co-authored-by: Paul Masurel <paul@quickwit.io> * merge code paths * add Hash * move function oustide --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-11-22 12:29:53 +01:00
PSeitz	054f49dc31	support escaped dot, add agg test (#2250 ) add agg test for nested JSON allow escaping of dot	2023-11-20 03:00:57 +01:00
Paul Masurel	6b59ec6fd5	Fix bug occuring when merging JSON object indexed with positions. In JSON Object field the presence of term frequencies depend on the field. Typically, a string with postiions indexed will have positions while numbers won't. The presence or absence of term freqs for a given term is unfortunately encoded in a very passive way. It is given by the presence of extra information in the skip info, or the lack of term freqs after decoding vint blocks. Before, after writing a segment, we would encode the segment correctly (without any term freq for number in json object field). However during merge, we would get the default term freq=1 value. (this is default in the absence of encoded term freqs) The merger would then proceed and attempt to decode 1 position when there are in fact none. This PR requires to explictly tell the posting serialize whether term frequencies should be serialized for each new term. Closes #2251	2023-11-14 22:41:48 +09:00
Paul Masurel	7bc5bf78e2	Fixing functional tests. (#2239 )	2023-11-05 18:18:39 +09:00
giovannicuccu	ef603c8c7e	rename ReloadPolicy onCommit to onCommitWithDelay (#2235 ) * rename ReloadPolicy onCommit to onCommitWithDelay * fix format issues --------- Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>	2023-11-03 12:22:10 +01:00
PSeitz	28dd6b6546	collect json paths in indexing (#2231 ) * collect json paths in indexing * remove unsafe iter_mut_keys	2023-11-01 11:25:17 +01:00
PSeitz	07bf66a197	json path writer (#2224 ) * refactor logic to JsonPathWriter * use in encode_column_name * add inlines * move unsafe block	2023-10-24 09:45:50 +02:00
PSeitz	c2b0469180	improve docs, rework exports (#2220 ) * rework exports move snippet and advice make indexer pub, remove indexer reexports * add deprecation warning * add architecture overview	2023-10-18 09:22:24 +02:00
PSeitz	5e06e504e6	split into ReferenceValueLeaf (#2217 )	2023-10-16 16:31:30 +02:00
PSeitz	182f58cea6	remove Document: DocumentDeserialize dependency (#2211 ) * remove Document: DocumentDeserialize dependency The dependency requires users to implement an API they may not use. * remove unnecessary Document bounds	2023-10-13 07:59:54 +02:00
PSeitz	493f9b2f2a	Read list of JSON fields encoded in dictionary (#2184 ) * Read list of JSON fields encoded in dictionary add method to get list of fields on InvertedIndexReader * add field type	2023-10-09 12:06:22 +02:00
PSeitz	e246e5765d	replace ReferenceValue with Self in Value (#2210 )	2023-10-06 08:22:15 +02:00
PSeitz	03a1f40767	rename DocValue to Value (#2197 ) rename DocValue to Value to avoid confusion with lucene DocValues rename Value to OwnedValue	2023-10-02 17:03:00 +02:00
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	2d7390341c	increase min memory to 15MB for indexing (#2176 ) With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to 12MB. 7MB are for the different fast field collectors types (they could be lazily created). Increase the minimum memory from 3MB to 15MB. Change memory variable naming from arena to budget. closes #2156	2023-09-13 07:38:34 +02:00
PSeitz	c4e2708901	fix clippy, fmt (#2162 )	2023-08-30 08:04:26 +02:00
PSeitz	480763db0d	track memory arena memory usage (#2148 )	2023-08-16 18:19:42 +02:00
Adam Reichold	22c35b1e00	Fix explanation of boost queries seeking beyond query result. (#2142 ) * Make current nightly Clippy happy. * Fix explanation of boost queries seeking beyond query result.	2023-08-14 11:59:11 +09:00
Adam Reichold	820f126075	Remove support for Brotli and Snappy compression (#2123 ) LZ4 provides fast and simple compression whereas Zstd is exceptionally flexible so that the additional support for Brotli and Snappy does not really add any distinct functionality on top of those two algorithms. Removing them reduces our maintenance burden and reduces the number of choices users have to make when setting up their project based on Tantivy.	2023-07-14 16:54:59 +09:00
Adam Reichold	7e6c4a1856	Include only built-in compression algorithms as enum variants (#2121 ) * Include only built-in compression algorithms as enum variants This enables compile-time errors when a compression algorithm is requested which is not actually enabled for the current Cargo project. The cost is that indexes using other compression algorithms cannot even be loaded (even though they are not fully accessible in any case). As a drive-by, this also fixes `--no-default-features` on `cfg(unix)`. * Provide more instructive error messages for unsupported, but not unknown compression variants.	2023-07-14 11:02:49 +09:00
PSeitz	17186ca9c9	improve docs (#2105 )	2023-06-27 13:37:14 +08:00
PSeitz	44850e1036	move fail dep to dev only (#2094 ) wasm compilation fails with dep only	2023-06-22 06:59:11 +02:00
François Massot	17c783b4db	Align numerical type priority order on the search side.	2023-06-11 11:49:27 +02:00
PSeitz	fdecb79273	tokenizer-api: reduce Tokenizer overhead (#2062 ) * tokenizer-api: reduce Tokenizer overhead Previously a new `Token` for each text encountered was created, which contains `String::with_capacity(200)` In the new API the token_stream gets mutable access to the tokenizer, this allows state to be shared (in this PR Token is shared). Ideally the allocation for the BoxTokenStream would also be removed, but this may require some lifetime tricks. * simplify api * move lowercase and ascii folding buffer to global * empty Token text as default	2023-06-08 18:37:58 +08:00
trinity-1686a	a3f001360f	add support for warming up range of terms (#2042 ) * add support for warming up range of terms * simplify handling of limit	2023-05-22 14:29:35 +02:00
PSeitz	04562c0318	add fastfield tokenizer to IndexBuilder (#2046 )	2023-05-18 04:33:42 +02:00
Yuri Astrakhan	74275b76a6	Inline format arguments where makes sense (#2038 ) Applied this command to the code, making it a bit shorter and slightly more readable. ``` cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args cargo +nightly fmt --all ```	2023-05-10 18:03:59 +09:00
PSeitz	4ee1b5cda0	add seperate tokenizer manager for fast fields (#2019 ) * add seperate tokenizer manager for fast fields * rename	2023-05-08 11:22:31 +02:00
tottoto	73452284ae	Remove unused crates from dependencies (#2018 ) * Remove unused crates from dependencies * Revert rand to columnar * Revert criterion to stacker	2023-05-02 12:34:20 +02:00
trinity-1686a	9c93bfeb51	optimise warmup code path (#2007 ) * optimise warmup code path * better function naming	2023-04-21 11:23:09 +02:00
PSeitz	74f9eafefc	refactor Term (#2006 ) * refactor Term add ValueBytes for serialized term values add missing debug for ip skip unnecessary json path validation remove code duplication add DATE_TIME_PRECISION_INDEXED constant add missing Term clarification remove weird value_bytes_mut() API * fix naming	2023-04-20 15:31:43 +02:00
Paul Masurel	fbda511a1a	Making more things public for quickwit. (#2005 )	2023-04-20 11:37:45 +09:00
Paul Masurel	f853bf204b	Align the numerical type priority order with columnar. (#1978 ) Closes #1956	2023-04-07 10:07:54 +09:00
PSeitz	9e2faecf5b	add memory limit for aggregations (#1942 ) * add memory limit for aggregations introduce AggregationLimits to set memory consumption limit and bucket limits memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request. * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add ByteCount with human readable format --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-03-16 06:21:07 +01:00
Paul Masurel	7fae4d98d7	Adapting for quickwit2 (#1912 ) * Adapting tantivy to make it possible to be plugged to quickwit. * Apply suggestions from code review Co-authored-by: PSeitz <PSeitz@users.noreply.github.com> * Added unit test --------- Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>	2023-03-01 16:27:46 +09:00
Paul Masurel	66ff53b0f4	Various minor code cleanup (#1909 )	2023-02-27 13:48:34 +09:00
Paul Masurel	789cc8703e	Adding unit test testing docfreq after merge (#1895 )	2023-02-22 11:05:34 +09:00
Paul Masurel	e5098d9fe8	Moving test around reenabling tests that were disabled. (#1894 )	2023-02-22 10:31:52 +09:00
Alex Cole	f2f38c43ce	Make BM25 scoring more flexible (#1855 ) * Introduce Bm25StatisticsProvider to inject statistics * fix formatting I accidentally changed	2023-02-16 19:14:12 +09:00
PSeitz	36c6138e7f	fix: auto downgrade index record option, instead of vint error (#1857 ) Prev: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })', src/main.rs:46:14 Now: Automatic downgrade to next available level	2023-02-10 13:45:23 +01:00
Paul Masurel	405e2cf4d9	Merge with main	2023-02-09 14:28:57 +01:00
Paul Masurel	bd5eea9852	Integrated columnar work.	2023-02-09 13:14:31 +01:00
PSeitz	0f20787917	fix doc store cache docs (#1821 ) * fix doc store cache docs addresses an issue reported in #1820 * rename doc_store_cache_size	2023-01-23 07:06:49 +01:00
PSeitz	f687b3a5aa	start migrate Field to &str (#1772 ) start migrate Field to &str in preparation of columnar return Result for get_field	2023-01-18 16:12:07 +09:00
Shikhar Bhushan	2650111b76	EnableScoring::Disabled - optional Searcher (#1780 )	2023-01-12 09:26:50 -05:00
Paul Masurel	bb48c3e488	Refactoring to prepare for the addition of dynamic fast field (#1730 ) * Refactoring to prepare for the addition of dynamic fast field - Exposing insert_key / insert_value - Renamed SSTable::{Reader/Writer}-> SSTable::{ValueReader/ValueWriter} - Added a generic Dictionary object in the sstable crate - Removing the TermDictionary wrapper from tantivy, relying directly on an alias of the generic Dictionary object. - dropped the use of byteorder in sstable. - Stopped scanning / reading the entire dictionary when streaming a range. * Added a benchmark for streaming sstable ranges. * CR comments. Rename deserialize_u64 -> deserialize_vint_u64 * Removed needless allocation, split serialize into serialize and clear.	2022-12-22 12:25:46 +09:00
Paul Masurel	32cb1d22da	Removed AsyncIoResult. (#1728 )	2022-12-21 16:01:17 +09:00
PSeitz	f9171a3981	fix clippy (#1725 ) * fix clippy * fix clippy fastfield codecs * fix clippy bitpacker * fix clippy common * fix clippy stacker * fix clippy sstable * fmt	2022-12-20 07:30:06 +01:00

1 2 3 4 5 ...

727 Commits