Commit Graph

539 Commits

Author SHA1 Message Date
PSeitz
67ebba3c3c expose collect_block buffer size (#2326)
* expose buffer of collect_block

* flip shard_size segment_size
2024-03-15 08:02:08 +01:00
PSeitz
48630ceec9 move into new index module (#2259)
move core modules to index module
2024-01-31 10:30:04 +01:00
Adam Reichold
53f2fe1fbe Forward regex parser errors to enable understandin their reason. (#2288) 2023-12-22 11:01:10 +01:00
PSeitz
0aae31d7d7 reduce number of allocations (#2257)
* reduce number of allocations

Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.

- Make Explanation optional in BM25
- Avoid allocations when using Explanation

* use Cow
2023-11-16 13:47:36 +01:00
Paul Masurel
7bc5bf78e2 Fixing functional tests. (#2239) 2023-11-05 18:18:39 +09:00
PSeitz
83af14caa4 Fix range query (#2226)
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
2023-10-25 09:17:31 +02:00
trinity-1686a
0d4589219b encode some part of posting list as -1 instead of direct values (#2185)
* add support for delta-1 encoding posting list

* encode term frequency minus one

* don't emit tf for json integer terms

* make skipreader not pub(crate) mutable
2023-10-20 16:58:26 +02:00
PSeitz
03a1f40767 rename DocValue to Value (#2197)
rename DocValue to Value to avoid confusion with lucene DocValues
rename Value to OwnedValue
2023-10-02 17:03:00 +02:00
Harrison Burt
1c7c6fd591 POC: Tantivy documents as a trait (#2071)
* fix windows build (#1)

* Fix windows build

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Fix generic bugs

* Reformat code

* Add generic to index writer which I forgot about

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Rebase main and fix conflicts

* Reformat code

* Merge upstream

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add tokenizer improvements from previous commits

* Add tokenizer improvements from previous commits

* Reformat

* Fix unit tests

* Fix unit tests

* Use enum in changes

* Stage changes

* Add new deserializer logic

* Add serializer integration

* Add document deserializer

* Implement new (de)serialization api for existing types

* Fix bugs and type errors

* Add helper implementations

* Fix errors

* Reformat code

* Add unit tests and some code organisation for serialization

* Add unit tests to deserializer

* Add some small docs

* Add support for deserializing serde values

* Reformat

* Fix typo

* Fix typo

* Change repr of facet

* Remove unused trait methods

* Add child value type

* Resolve comments

* Fix build

* Fix more build errors

* Fix more build errors

* Fix the tests I missed

* Fix examples

* fix numerical order, serialize PreTok Str

* fix coverage

* rename Document to TantivyDocument, rename DocumentAccess to Document

add Binary prefix to binary de/serialization

* fix coverage

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2023-10-02 10:01:16 +02:00
PSeitz
832f1633de handle exclusive out of bounds ranges on fastfield range queries (#2174)
closes https://github.com/quickwit-oss/quickwit/issues/3790
2023-09-26 08:00:40 +02:00
trinity-1686a
0241a05b90 add support for exists query syntax in query parser (#2170)
* add support for exists query syntax in query parser

* rustfmt

* make Exists require a field
2023-09-19 11:10:39 +02:00
PSeitz
2d7390341c increase min memory to 15MB for indexing (#2176)
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
2023-09-13 07:38:34 +02:00
Ping Xia
e4e416ac42 extend FuzzyTermQuery to support json field (#2173)
* extend fuzzy search for json field

* comments

* comments

* fmt fix

* comments
2023-09-11 05:59:40 +02:00
Igor Motov
19325132b7 Fast-field based implementation of ExistsQuery (#2160)
Adds an implementation of ExistsQuery that takes advantage of fast fields.

Fixes #2159
2023-09-07 11:51:49 +09:00
PSeitz
c4e2708901 fix clippy, fmt (#2162) 2023-08-30 08:04:26 +02:00
PSeitz
48d4847b38 Improve aggregation error message (#2150)
* Improve aggregation error message

Improve aggregation error message by wrapping the deserialization with a
custom struct. This deserialization variant is slower, since we need to
keep the deserialized data around twice with this approach.
For now the valid variants list is manually updated. This could be
replaced with a proc macro.
closes #2143

* Simpler implementation

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-08-23 20:52:15 +02:00
PSeitz
480763db0d track memory arena memory usage (#2148) 2023-08-16 18:19:42 +02:00
Caleb Hattingh
47b315ff18 doc: escape the backslash (#2144) 2023-08-14 19:10:07 +02:00
Adam Reichold
22c35b1e00 Fix explanation of boost queries seeking beyond query result. (#2142)
* Make current nightly Clippy happy.

* Fix explanation of boost queries seeking beyond query result.
2023-08-14 11:59:11 +09:00
trinity-1686a
b92082b748 implement lenient parser (#2129)
* move query parser to nom

* add suupport for term grouping

* initial work on infallible parser

* fmt

* add tests and fix minor parsing bugs

* address review comments

* add support for lenient queries in tantivy

* make lenient parser report errors

* allow mixing occur and bool in query
2023-08-08 15:41:29 +02:00
Adam Reichold
42acd334f4 Fixes the new deny-by-default incorrect_partial_ord_impl_on_ord_type Clippy lint (#2131) 2023-07-21 11:36:17 +09:00
Adam Reichold
5fafe4b1ab Add missing query_terms impl for TermSetQuery. (#2120) 2023-07-13 14:54:29 +02:00
François Massot
0a23201338 Fix stackoverflow and add docs. 2023-07-03 22:05:11 +09:00
Paul Masurel
ad4c940fa3 proof of concept for dynamic tokenizer. 2023-07-03 22:05:10 +09:00
Paul Masurel
910b0b0c61 Cargo fmt 2023-07-03 22:03:31 +09:00
PSeitz
3fef052bf1 fix flaky test (#2107)
closes #2099
2023-06-29 14:30:56 +08:00
François Massot
0cb53207ec Fix tests. 2023-06-11 12:13:35 +02:00
PSeitz
fdecb79273 tokenizer-api: reduce Tokenizer overhead (#2062)
* tokenizer-api: reduce Tokenizer overhead

Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.

* simplify api

* move lowercase and ascii folding buffer to global

* empty Token text as default
2023-06-08 18:37:58 +08:00
Adam Reichold
b325d569ad Expose phrase-prefix queries via the built-in query parser (#2044)
* Expose phrase-prefix queries via the built-in query parser

This proposes the less-than-imaginative syntax `field:"phrase ter"*` to
perform a phrase prefix query against `field` using `phrase` and `ter` as the
terms. The aim of this is to make this type of query more discoverable and
simplify manual testing.

I did consider exposing the `max_expansions` parameter similar to how slop is
handled, but I think that this is rather something that should be configured via
the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as
choosing it requires rather intimiate knowledge of the backing index.

* Prevent construction of zero or one term phrase-prefix queries via the query parser.

* Add example using phrase-prefix search via surface API to improve feature discoverability.
2023-06-01 13:03:16 +02:00
trinity-1686a
6564e0c467 fix phrase prefix query (#2043)
* fix phrase prefix query

it would fail spectacularly when no doc in the segment would match the phrase part of the query

* clippy
2023-05-22 12:36:20 +02:00
Paul Masurel
62709b8094 Change in the query grammar. (#2050)
* Change in the query grammar.

Quotation mark can now be used for phrase queries.
The delimiter is part of the `UserInputLeaf`.
That information is meant to be used in Quickwit to solve #3364.

This PR also adds support for quotation marks escaping in phrase
queries.

* Apply suggestions from code review
2023-05-19 12:07:10 +09:00
Adam Reichold
fedd9559e7 Expose create a query from a user input AST. (#2039) 2023-05-11 21:53:18 +09:00
PSeitz
0eafbaab8e fix slop (#2031)
Fix slop by carrying slop so far for multiterms.
Define slop contract in the API
2023-05-10 11:45:14 +02:00
Yuri Astrakhan
74275b76a6 Inline format arguments where makes sense (#2038)
Applied this command to the code, making it a bit shorter and slightly
more readable.

```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
2023-05-10 18:03:59 +09:00
PSeitz
4c58b0086d allow slop in both directions (#2020)
* allow slop in both directions

allow slop in both directions
so "big wolf"~3 can also match "wolf big"

This also fixes #1934, when the docsets were reordered by size and didn't
match the terms.

* remove count

* add test for repeating tokens, unduplicate tests
2023-05-07 12:05:21 +09:00
Paul Masurel
f28ddb711e Exposing u64-based FastFieldRangeWeight (#2024) 2023-05-03 18:32:00 +09:00
PSeitz
ba309e18a1 switch to nanosecond precision (#2016) 2023-05-01 03:32:20 +02:00
PSeitz
74f9eafefc refactor Term (#2006)
* refactor Term

add ValueBytes for serialized term values
add missing debug for ip
skip unnecessary json path validation
remove code duplication
add DATE_TIME_PRECISION_INDEXED constant
add missing Term clarification
remove weird value_bytes_mut() API

* fix naming
2023-04-20 15:31:43 +02:00
RT_Enzyme
ff3d3313c4 fix BooleanQuery document (#1999)
* fix BooleanQuery document

* Update src/query/boolean_query/boolean_query.rs

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-20 11:37:20 +02:00
Paul Masurel
fbda511a1a Making more things public for quickwit. (#2005) 2023-04-20 11:37:45 +09:00
Paul Masurel
4b01cc4c49 Made BooleanWeight and BoostWeight public (#1991) 2023-04-12 10:26:30 +09:00
PSeitz
5c380b76e7 Better mixed types support in aggs and fix serialization issue (#1971)
* Better mixed types support in aggs and fix serialization issue

- Improve support for mixed types in JSON field aggregations (pick the right field, #1913)
- Resolve the issue with JSON serialization for numeric keys (fixes #1967)
- Add JSON round-trip test for term buckets
- Remove `u64_lenient`, as this is a footgun without the type
- move aggregation benchmarks

* remove shadowing
2023-03-31 05:52:11 +02:00
PSeitz
6a7a1106d6 work in batches of docs (#1937)
* work in batches of docs

* add fill_buffer test
2023-03-21 06:57:44 +01:00
trinity-1686a
064518156f refactor tokenization pipeline to use GATs (#1924)
* refactor tokenization pipeline to use GATs

* fix doctests

* fix clippy lints

* remove commented code
2023-03-09 09:39:37 +01:00
Paul Masurel
7fae4d98d7 Adapting for quickwit2 (#1912)
* Adapting tantivy to make it possible to be plugged to quickwit.

* Apply suggestions from code review

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

* Added unit test

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2023-03-01 16:27:46 +09:00
trinity-1686a
8a71e00da3 allow limiting the number of matched term in range query (#1899) 2023-02-27 10:44:08 +01:00
Paul Masurel
d25fc155b2 Making some of the column/termdict operations async-friendly (#1902) 2023-02-27 15:34:47 +09:00
Paul Masurel
66ff53b0f4 Various minor code cleanup (#1909) 2023-02-27 13:48:34 +09:00
Paul Masurel
d002698008 Re-export of query grammar. (#1908) 2023-02-27 12:26:34 +09:00
trinity-1686a
533ad99cd5 add PhrasePrefixQuery (#1842)
* add PhrasePrefixQuery
2023-02-22 11:18:33 +01:00