tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-28 06:00:40 +00:00

Author	SHA1	Message	Date
Harrison Burt	1c7c6fd591	POC: Tantivy documents as a trait (#2071 ) * fix windows build (#1) * Fix windows build * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Fix generic bugs * Reformat code * Add generic to index writer which I forgot about * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add doc traits * Add field value iter * Add value and serialization * Adjust order * Fix bug * Correct type * Rebase main and fix conflicts * Reformat code * Merge upstream * Fix missing generics on single segment writer * Add missing type export * Add default methods for convenience * Cleanup * Fix more-like-this query to use standard types * Update API and fix tests * Add tokenizer improvements from previous commits * Add tokenizer improvements from previous commits * Reformat * Fix unit tests * Fix unit tests * Use enum in changes * Stage changes * Add new deserializer logic * Add serializer integration * Add document deserializer * Implement new (de)serialization api for existing types * Fix bugs and type errors * Add helper implementations * Fix errors * Reformat code * Add unit tests and some code organisation for serialization * Add unit tests to deserializer * Add some small docs * Add support for deserializing serde values * Reformat * Fix typo * Fix typo * Change repr of facet * Remove unused trait methods * Add child value type * Resolve comments * Fix build * Fix more build errors * Fix more build errors * Fix the tests I missed * Fix examples * fix numerical order, serialize PreTok Str * fix coverage * rename Document to TantivyDocument, rename DocumentAccess to Document add Binary prefix to binary de/serialization * fix coverage --------- Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2023-10-02 10:01:16 +02:00
PSeitz	c4e2708901	fix clippy, fmt (#2162 )	2023-08-30 08:04:26 +02:00
PSeitz	fdecb79273	tokenizer-api: reduce Tokenizer overhead (#2062 ) * tokenizer-api: reduce Tokenizer overhead Previously a new `Token` for each text encountered was created, which contains `String::with_capacity(200)` In the new API the token_stream gets mutable access to the tokenizer, this allows state to be shared (in this PR Token is shared). Ideally the allocation for the BoxTokenStream would also be removed, but this may require some lifetime tricks. * simplify api * move lowercase and ascii folding buffer to global * empty Token text as default	2023-06-08 18:37:58 +08:00
Paul Masurel	7ee78bda52	Readding s in datetime precision variant names (#2065 ) There is no clear win and it change some serialization in quickwit.	2023-06-01 06:39:46 +02:00
PSeitz	e56addc63e	enable tokenizer on json fields (#2053 ) * enable tokenizer on json fields enable tokenizer on json fields for type text * Avoid making the tokenizer within the TextAnalyzer pub(crate) * Moving BoxableTokenizer to tantivy. --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-05-24 10:47:39 +02:00
Adrien Guillo	a789ad9aee	Rename `DatePrecision` to `DateTimePrecision` (#2051 )	2023-05-23 17:09:11 +02:00
PSeitz	4ee1b5cda0	add seperate tokenizer manager for fast fields (#2019 ) * add seperate tokenizer manager for fast fields * rename	2023-05-08 11:22:31 +02:00
PSeitz	ba309e18a1	switch to nanosecond precision (#2016 )	2023-05-01 03:32:20 +02:00
PSeitz	74f9eafefc	refactor Term (#2006 ) * refactor Term add ValueBytes for serialized term values add missing debug for ip skip unnecessary json path validation remove code duplication add DATE_TIME_PRECISION_INDEXED constant add missing Term clarification remove weird value_bytes_mut() API * fix naming	2023-04-20 15:31:43 +02:00
trinity-1686a	780e26331d	sstable compression (#1946 ) * compress sstable with zstd * add some details to sstable readme * compress only block which benefit from it * multiple changes to sstable make compression optional use OwnedBytes instead of impl Read in sstable, required for next point use zstd bulk api, which is much faster on small records * cleanup and use bulk api for compression * use dedicated byte for compression * switch block len and compression flag * change default zstd level in sstable	2023-04-14 16:25:50 +02:00
trinity-1686a	205e8a0a92	encode dictionary type in fst footer (#1968 ) * encode additional footer for dictionary kind in fst	2023-04-12 09:43:01 +02:00
PSeitz	5c4ea6a708	tokenizer option on text fastfield (#1945 ) * tokenizer option on text fastfield allow to set tokenizer option on text fastfield (fixes #1901) handle PreTokenized strings in fast field * change visibility * remove custom de/serialization	2023-03-31 10:03:38 +02:00
trinity-1686a	482b4155e8	fix bug with new sstable index format (#1953 )	2023-03-22 10:22:36 +01:00
trinity-1686a	e5e50603a8	new sstable format (#1943 ) * document a new sstable format * add support for changing target block size * use new format for sstable index * handle sstable version errror * use very small blocks for proptests * add a footer structure	2023-03-21 15:03:52 +01:00
PSeitz	9e2faecf5b	add memory limit for aggregations (#1942 ) * add memory limit for aggregations introduce AggregationLimits to set memory consumption limit and bucket limits memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request. * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add ByteCount with human readable format --------- Co-authored-by: Paul Masurel <paul@quickwit.io>	2023-03-16 06:21:07 +01:00
PSeitz	2fb3740cb0	handle missing column for aggs (#1920 ) * handle missing column for aggs add empty column fallback for missing column in aggs. Fix sort for term agg on sub-agg with missing value (null is smallest) * add error when field is not fast	2023-03-15 06:09:59 +01:00
Paul Masurel	364e321415	Clippy fix (#1926 )	2023-03-06 10:37:17 +09:00
Paul Masurel	7fae4d98d7	Adapting for quickwit2 (#1912 ) * Adapting tantivy to make it possible to be plugged to quickwit. * Apply suggestions from code review Co-authored-by: PSeitz <PSeitz@users.noreply.github.com> * Added unit test --------- Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>	2023-03-01 16:27:46 +09:00
Paul Masurel	8ea97e7d6b	Minor refactoring preparing for getting columnar integrated in quickwit. (#1911 )	2023-02-27 14:23:30 +09:00
Paul Masurel	06850719dc	Renaming .values(DocId) to .values_for_doc(DocId) (#1906 )	2023-02-27 12:15:13 +09:00
PSeitz	111f25a8f7	clippy (#1879 ) * fix clippy * fix clippy * fmt	2023-02-17 11:34:21 +01:00
Paul Masurel	7423f99719	Issue/columnar for json (#1876 ) Adding support for JSON fast field.	2023-02-16 20:38:32 +09:00
PSeitz	01e5a22759	switch to new ff api (#1868 )	2023-02-14 15:57:32 +08:00
Paul Masurel	60cc2644d6	Fixing test_fail_on_flush_segment_but_one_worker_remains (#1869 ) The new fast field code, based on columnar, had a larger minimum memory footprint, causing the first docuemnt to trigger a flush of the asegment in this unit test. This PR prevents the allocation of a large capacity for the different hashmap tables using in the columnar writer. Closes #1859	2023-02-14 16:09:42 +09:00
PSeitz	1cfb9ce59a	improve range query performance (#1864 ) fix RowId vs DocId naming fixes #1863	2023-02-14 13:25:39 +09:00
trinity-1686a	539ff08a79	move DateTime to tantivy_common (#1861 ) * move DateTime to tantivy_common * resolve imports of columnar::DateTime as import of common::DateTime	2023-02-11 17:03:06 +01:00
Paul Masurel	bd5eea9852	Integrated columnar work.	2023-02-09 13:14:31 +01:00
Paul Masurel	08919a2900	Improvement on the scalar / random bitpacker code. (#1781 ) * Improvement on the scalar / random bitpacker code. Added proptesting Added simple benchmark Added assert and comments on the very non trivial hidden contract Remove the need for an extra padding. The last point introduces a small performance regression (~10%). * Fixing unit tests	2023-01-19 18:09:13 +09:00
PSeitz	f687b3a5aa	start migrate Field to &str (#1772 ) start migrate Field to &str in preparation of columnar return Result for get_field	2023-01-18 16:12:07 +09:00
PSeitz	1176555eff	handle user input on get_docid_for_value_range (#1760 ) * handle user input on get_docid_for_value_range fixes #1757 * pass range as parameter	2023-01-12 14:20:16 +01:00
PSeitz	07a51eb7c8	refactor multivalue fastfield, refactor range query (#1749 ) Introduce MakeZero trait, remove make_zero from FastValue Merge two multivalue fastfield implementations into one prepare range query on fastfield for different types	2023-01-05 12:09:50 +01:00
PSeitz	f9171a3981	fix clippy (#1725 ) * fix clippy * fix clippy fastfield codecs * fix clippy bitpacker * fix clippy common * fix clippy stacker * fix clippy sstable * fmt	2022-12-20 07:30:06 +01:00
PSeitz	1119e59eae	prepare fastfield format for null index (#1691 ) * prepare fastfield format for null index * add format version for fastfield * Update fastfield_codecs/src/compact_space/mod.rs * switch to variable size footer * serialize delta of end	2022-11-28 17:15:24 +09:00
PSeitz	ee1f2c1f28	add aggregation support for date type (#1693 ) * add aggregation support for date type fixes #1332 * serialize key_as_string as rfc3339 in date histogram * update docs * enable date for range aggregation	2022-11-28 09:12:08 +09:00
Pascal Seitz	83325d8f3f	move multivalue index to own file start_doc parameter in positions to docids	2022-11-01 10:36:13 +08:00
Pascal Seitz	e772d3170d	switch get_val() to u32 Fixes #1638	2022-10-24 19:05:57 +08:00
Pascal Seitz	952b048341	add term aggregation clarification	2022-10-14 16:12:19 +08:00
Pascal Seitz	9cb8cfbea8	return Error instead panic in fastfields fixes #1572	2022-10-11 14:15:22 +08:00
Pascal Seitz	400a20b7af	add ip field add u128 multivalue reader and writer add ip to schema add ip writers, handle merge	2022-10-07 16:25:01 +08:00
Pascal Seitz	d742275048	renames	2022-10-05 19:16:49 +08:00
Pascal Seitz	8b42c4c126	disable linear codec for multivalue value index don't materialize index column on merge use simpler chain() variant	2022-10-05 19:09:17 +08:00
Bruce Mitchener	cb252a42af	docs: "associated to" -> "associated with" (#1557 ) This reads better this way.	2022-09-26 20:23:37 +09:00
Pascal Seitz	f757471077	prepare for ip field	2022-09-26 16:27:35 +08:00
Pascal Seitz	1ff5da5eb4	remove fast_field_cardinality from FastValue unused and at the wrong placed	2022-09-21 09:38:46 +08:00
Paul Masurel	64f08a1a5c	Hiding useless symbols and removing code. (#1522 )	2022-09-16 14:42:27 +09:00
Paul Masurel	c632fc014e	Refactoring fast fields codecs. This removes the GCD part as a codec, and makes it so that fastfield codecs all share the same normalization part (shift + gcd).	2022-09-05 23:07:12 +09:00
Paul Masurel	ea72cf34d6	Int based linear interpol (#1482 ) * Rename BlockwiseLinear to BlockwiseLinearLegacy Reimplements the blockwise multilinear codec using integer arithmetics. Added comments * add estimate for blockwise * Added one unit test * use int based for linear interpol * fix merge conflicts * reuse code * cargo fmt * fix clippy * fix test * fix off by one fix off by one to accurately interpolate autoincrement fields * extend test, fix estimate * remove legacy codec Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>	2022-09-05 15:53:00 +09:00
Paul Masurel	26876d41d7	Moving the serialization logic to the fastfield_codecs crate.	2022-09-03 00:29:52 +09:00
Paul Masurel	8e775b6c3d	Refactoring dyn Column (#1502 )	2022-09-02 17:26:30 +09:00
Paul Masurel	84e0c75598	Bench fixing	2022-09-02 11:15:44 +09:00

1 2 3 4

185 Commits