Commit Graph

38 Commits

Author SHA1 Message Date
Neil Hansen
dee2dd3f21 Use Levenshtein distance to score documents in fuzzy term queries 2025-12-10 10:17:19 -08:00
Paul Masurel
63c66005db Lazy scorers (#2726)
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.

- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.

This PR breaks public API, but fixing code is straightforward.

* Bumping tantivy version

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-01 15:38:57 +01:00
Hamir Mahal
0c634adbe1 style: simplify strings with string interpolation (#2412)
* style: simplify strings with string interpolation

* fix: formatting
2024-05-27 09:16:47 +02:00
PSeitz
74940e9345 clippy (#2349)
* fix clippy

* fix clippy

* fix duplicate imports
2024-04-09 07:54:44 +02:00
Harrison Burt
1c7c6fd591 POC: Tantivy documents as a trait (#2071)
* fix windows build (#1)

* Fix windows build

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Fix generic bugs

* Reformat code

* Add generic to index writer which I forgot about

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Rebase main and fix conflicts

* Reformat code

* Merge upstream

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add tokenizer improvements from previous commits

* Add tokenizer improvements from previous commits

* Reformat

* Fix unit tests

* Fix unit tests

* Use enum in changes

* Stage changes

* Add new deserializer logic

* Add serializer integration

* Add document deserializer

* Implement new (de)serialization api for existing types

* Fix bugs and type errors

* Add helper implementations

* Fix errors

* Reformat code

* Add unit tests and some code organisation for serialization

* Add unit tests to deserializer

* Add some small docs

* Add support for deserializing serde values

* Reformat

* Fix typo

* Fix typo

* Change repr of facet

* Remove unused trait methods

* Add child value type

* Resolve comments

* Fix build

* Fix more build errors

* Fix more build errors

* Fix the tests I missed

* Fix examples

* fix numerical order, serialize PreTok Str

* fix coverage

* rename Document to TantivyDocument, rename DocumentAccess to Document

add Binary prefix to binary de/serialization

* fix coverage

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2023-10-02 10:01:16 +02:00
PSeitz
2d7390341c increase min memory to 15MB for indexing (#2176)
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
2023-09-13 07:38:34 +02:00
Ping Xia
e4e416ac42 extend FuzzyTermQuery to support json field (#2173)
* extend fuzzy search for json field

* comments

* comments

* fmt fix

* comments
2023-09-11 05:59:40 +02:00
PSeitz
74f9eafefc refactor Term (#2006)
* refactor Term

add ValueBytes for serialized term values
add missing debug for ip
skip unnecessary json path validation
remove code duplication
add DATE_TIME_PRECISION_INDEXED constant
add missing Term clarification
remove weird value_bytes_mut() API

* fix naming
2023-04-20 15:31:43 +02:00
Adam Reichold
1afa5bf3db Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances lazy. (#1756) 2023-01-06 12:44:49 +09:00
Paul Masurel
3edf0a2724 Using the manual reload policy in IndexWriter. (#1667) 2022-11-09 11:20:41 +01:00
Paul Masurel
eca6628b3c Minor refactoring (#1266) 2022-01-28 15:55:55 +09:00
Paul Masurel
732f6847c0 Field type with codes (#1255)
* Term are now typed.

This change is backward compatible:
While the Term has a byte representation that is modified, a Term itself
is a transient object that is not serialized as is in the index.

Its .field() and .value_bytes() on the other hand are unchanged.
This change offers better Debug information for terms.

While not necessary it also will help in the support for JSON types.

* Renamed Hierarchical Facet -> Facet
2022-01-07 20:49:00 +09:00
Paul Masurel
7234bef0eb Issue/1198 (#1201)
* Unit test reproducing #1198
* Fixing unit test to handle the error from add_document.
* Bump project version
2021-11-11 16:42:19 +09:00
Paul Masurel
b52abbc771 Bugfix transposition_cost_one in FuzzyQuery (#1167) 2021-10-07 09:38:39 +09:00
Paul Masurel
39dd8cfe24 Cargo clippy. Acronym should not be full uppercase apparently. 2021-04-26 11:49:18 +09:00
Paul Masurel
439d6956a9 Returning Result in some of the API (#880)
* Returning Result in some of the API

* Introducing `.writer_for_test(..)`
2020-09-07 15:52:34 +09:00
Paul Masurel
2481c87be8 Block wand (#856) 2020-08-19 22:36:36 +09:00
Paul Masurel
6db8bb49d6 Assert nearly equals macro (#853)
* Assert nearly equals macro

* Renamed specialized_scorer in TermScorer
2020-07-17 16:40:41 +09:00
Rob Young
9de1360538 Minor doc and test improvements around fuzzy querying (#825) 2020-05-16 10:59:24 +09:00
Paul Masurel
186d7fc20e Fix build 2020-04-01 09:32:45 +09:00
Paul Masurel
cfbdef5186 Using tantivy-fst version 0.3. 2020-03-31 23:24:54 +09:00
Chen Xu
b167058028 Fix prefix option for FuzzyTermQuery (#797)
* Fix prefix option for FuzzyTermQuery

* Update changelog
2020-03-19 20:19:32 +09:00
Paul Masurel
6542dd5337 Removing parenthesis. 2020-03-01 09:41:53 +09:00
Paul Masurel
ae14022bf0 Removed use::Result. (#771) 2020-01-31 18:47:02 +09:00
Paul Masurel
ef3eddf3da clippy first stab (#711) 2019-11-22 13:09:35 +09:00
Joshua Dutton
9f74786db2 Update import statements in examples, doctests (#633)
Update import statements to edition 2018, including removing
`extern crate` and  `#[macro_use]`. Alphabetize the statements.
2019-08-19 07:26:35 +09:00
Paul Masurel
001af3876f cargo fmt 2019-08-08 18:07:19 +09:00
petr-tik
0154dbe477 Replace unwrap with match and proper Error handling (#606)
* Replace unwrap with match and proper Error handling

* Replaced 'magic' values with a documented variable

Didn't like the unexplained 0..3 range, thought it was best as a variable

Calculating Levenshtein distance is expensive, so best explain why we should
keep it low
2019-07-31 08:16:02 +09:00
Paul Masurel
4867be3d3b Kompass master (#590)
* Use once_cell in place of lazy_static

* Minor changes
2019-07-10 19:24:54 +09:00
Paul Masurel
462774b15c Tiqb feature/2018 (#583)
* rust 2018

* Added CHANGELOG comment
2019-07-01 10:01:46 +09:00
Paul Masurel
663dd89c05 Feature/reader (#517)
Adding IndexReader to the API. Making it possible to watch for changes.

* Closes #500
2019-03-20 08:39:22 +09:00
Paul Masurel
a6e767c877 Cargo fmt 2018-11-30 22:52:45 +09:00
Paul Masurel
07d87e154b Collector refactoring and multithreaded search (#437)
* Split Collector into an overall Collector and a per-segment SegmentCollector. Precursor to cross-segment parallelism, and as a side benefit cleans up any per-segment fields from being Option<T> to just T.

* Attempt to add MultiCollector back

* working. Chained collector is broken though

* Fix chained collector

* Fix test

* Make Weight Send+Sync for parallelization purposes

* Expose parameters of RangeQuery for external usage

* Removed &mut self

* fixing tests

* Restored TestCollectors

* blop

* multicollector working

* chained collector working

* test broken

* fixing unit test

* blop

* blop

* Blop

* simplifying APi

* blop

* better syntax

* Simplifying top_collector

* refactoring

* blop

* Sync with master

* Added multithread search

* Collector refactoring

* Schema::builder

* CR and rustdoc

* CR comments

* blop

* Added an executor

* Sorted the segment readers in the searcher

* Update searcher.rs

* Fixed unit testst

* changed the place where we have the sort-segment-by-count heuristic

* using crossbeam::channel

* inlining

* Comments about panics propagating

* Added unit test for executor panicking

* Readded default

* Removed Default impl

* Added unit test for executor
2018-11-30 22:46:59 +09:00
Paul Masurel
06e7bd18e7 Clippy (#421)
* Cargo Format

* Clippy

* bugfix

* still clippy stuff

* clippy step 2
2018-09-15 14:56:14 +09:00
pentlander
8600b8ea25 Top collector (#413)
* Make TopCollector generic

Make TopCollector take a generic type instead of only being tied to
score. This will allow for sharing code between a TopCollector that
sorts results by Score and a TopCollector that sorts documents by a fast
field. This commit makes no functional changes to TopCollector.

* Add TopFieldCollector and TopScoreCollector

Create two new collectors that use the refactored TopCollector.
TopFieldCollector has the same functionality that TopCollector
originally had. TopFieldCollector allows for sorting results by a given
fast field. Closes tantivy-search/tantivy#388

* Make TopCollector private

Make TopCollector package private and export TopFieldCollector as
TopCollector to maintain backwards compatibility. Mark TopCollector
as deprecated to encourage use of the non-aliased TopFieldCollector.
Remove Collector implementation for TopCollector since it is not longer
used.
2018-09-14 09:22:17 +09:00
Dru Sellers
e301e0bc87 Add some simple doc tests (#320)
* Add TopCollector doc test

* Add CountCollector Doc Test

* Add Doc Test for MultiCollector

* Add ChainedCollector Doc Test

* Expose Fuzzy Query where it should be

* Add FuzzyTermQuery Doc Test

* Expose RegexQuery

* Regex Query Doc Test

* Add TermQuery Doc Test

* Add doc comments

* fix test 🤦

* Added explanation about the complexity variables

* Fixing unit tests

* Single threads if you check docids
2018-06-19 10:45:20 +09:00
Dru Sellers
317baf4e75 Add in simple regex query support (#319)
* Add fst_regex crate in

* Reduce API surface area

This doesn't need to be public

* better test name

* Pull Automaton weight out so it can be shared

* Implement Regex Query
2018-06-16 14:08:30 +09:00
Dru Sellers
6f7b099370 Add AutomatonWeight to a fuzzy_search module and FuzzyQuery (#300)
* Add AutomatonWeight to a fuzzy_search module

* Hacking around ownership issues

* Working through lifetime issues

* Working through tests

* fix test by lower casing the words (reducing distance)

* code review changes

* Suggestion on how to solve the borrow problem

* clean up
2018-06-11 22:23:03 +09:00