Commit Graph

3006 Commits

Author SHA1 Message Date
Pascal Seitz
4f2e810b83 add From impl to BoxTokenStream, Bump tokenizer-api version 2023-06-23 18:54:03 +08:00
PSeitz
8199aa7de7 bump version to 0.20.2 (#2089) 0.20.2 2023-06-12 18:56:54 +08:00
PSeitz
657f0cd3bd add missing Bytes validation to term_agg (#2077)
returns empty for now instead of failing like before
2023-06-12 16:38:07 +08:00
Adam Reichold
3a82ef2560 Fix is_child_of function not considering the root facet. (#2086) 2023-06-12 08:35:18 +02:00
PSeitz
3546e7fc63 small agg limit docs improvement (#2073)
small docs improvement as follow up on bug https://github.com/quickwit-oss/quickwit/issues/3503
2023-06-12 10:55:24 +09:00
PSeitz
862f367f9e release without Alice in Wonderland, bump version to 0.20.1 (#2087)
* Release without Alice in Wonderland

* bump version to 0.20.1
2023-06-12 10:54:03 +09:00
PSeitz
14137d91c4 Update CHANGELOG.md (#2081) 2023-06-12 10:53:40 +09:00
François Massot
924fc70cb5 Merge pull request #2088 from quickwit-oss/fmassot/align-type-priorities-for-json-numbers
Align numerical type priority order on the search side.
2023-06-11 22:04:54 +02:00
François Massot
07023948aa Add test that indexes and searches a JSON field. 2023-06-11 21:47:52 +02:00
François Massot
0cb53207ec Fix tests. 2023-06-11 12:13:35 +02:00
François Massot
17c783b4db Align numerical type priority order on the search side. 2023-06-11 11:49:27 +02:00
Harrison Burt
7220df8a09 Fix building on windows with mmap (#2070)
* Fix windows build

* Make pub

* Update docs

* Re arrange

* Fix compilation error on unix

* Fix unix borrows

* Revert "Fix unix borrows"

This reverts commit c1d94fd12b.

* Fix unix borrows and revert original change

* Fix warning

* Cleaner code.

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
0.20.1
2023-06-10 18:32:39 +02:00
PSeitz
e3eacb4388 release tantivy (#2083)
* prerelease

* chore: Release
0.20
2023-06-09 10:47:46 +02:00
PSeitz
fdecb79273 tokenizer-api: reduce Tokenizer overhead (#2062)
* tokenizer-api: reduce Tokenizer overhead

Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.

* simplify api

* move lowercase and ascii folding buffer to global

* empty Token text as default
2023-06-08 18:37:58 +08:00
PSeitz
27f202083c Improve Termmap Indexing Performance +~30% (#2058)
* update benchmark

* Improve Termmap Indexing Performance +~30%

This contains many small changes to improve Termmap performance.
Most notably:
* Specialized byte compare and equality versions, instead of glibc calls.
* ExpUnrolledLinkedList to not contain inline items.

Allow compare hash only via a feature flag compare_hash_only:
64bits should be enough with a good hash function to compare strings by
their hashes instead of comparing the strings. Disabled by default

CreateHashMap/alice/174693
                        time:   [642.23 µs 643.80 µs 645.24 µs]
                        thrpt:  [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s]
                 change:
                        time:   [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05)
                        thrpt:  [+14.088% +15.344% +16.862%]
                        Performance has improved.
CreateHashMap/alice_expull/174693
                        time:   [877.03 µs 880.44 µs 884.67 µs]
                        thrpt:  [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s]
                 change:
                        time:   [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05)
                        thrpt:  [+35.301% +35.637% +35.981%]
                        Performance has improved.
CreateHashMap/numbers_zipf/8000000
                        time:   [9.1198 ms 9.1573 ms 9.1961 ms]
                        thrpt:  [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s]
                 change:
                        time:   [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05)
                        thrpt:  [+52.403% +53.440% +54.390%]
                        Performance has improved.

* clippy

* add bench for ids

* inline(always) to inline whole block with bounds checks

* cleanup
2023-06-08 11:13:52 +02:00
PSeitz
ccb09aaa83 allow histogram bounds to be passed as Rfc3339 (#2076) 2023-06-08 09:07:08 +02:00
Valerii
4b7c485a08 feat: add stop words for Hungarian language (#2069) 2023-06-02 07:26:03 +02:00
PSeitz
3942fc6d2b update CHANGELOG (#2068) 2023-06-02 05:00:12 +02:00
Adam Reichold
b325d569ad Expose phrase-prefix queries via the built-in query parser (#2044)
* Expose phrase-prefix queries via the built-in query parser

This proposes the less-than-imaginative syntax `field:"phrase ter"*` to
perform a phrase prefix query against `field` using `phrase` and `ter` as the
terms. The aim of this is to make this type of query more discoverable and
simplify manual testing.

I did consider exposing the `max_expansions` parameter similar to how slop is
handled, but I think that this is rather something that should be configured via
the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as
choosing it requires rather intimiate knowledge of the backing index.

* Prevent construction of zero or one term phrase-prefix queries via the query parser.

* Add example using phrase-prefix search via surface API to improve feature discoverability.
2023-06-01 13:03:16 +02:00
Paul Masurel
7ee78bda52 Readding s in datetime precision variant names (#2065)
There is no clear win and it change some serialization in quickwit.
2023-06-01 06:39:46 +02:00
Paul Masurel
184a9daa8a Cancels concurrently running actions for the same PR. (#2067) 2023-06-01 12:57:38 +09:00
Paul Masurel
47e01b345b Simplified linear probing code (#2066) 2023-06-01 04:58:42 +02:00
PSeitz
3af456972e Fix min doc_count empty merge bug (#2057)
This fixes an issue when min_doc==0 loads terms from the dictionary from
one segment and merges the same term with a subaggregation from another
segment.
Previously the empty structure was not correctly initialized to contain
the subaggregation so the merge was incorrect.
2023-05-29 14:20:50 +08:00
PSeitz
e56addc63e enable tokenizer on json fields (#2053)
* enable tokenizer on json fields

enable tokenizer on json fields for type text

* Avoid making the tokenizer within the TextAnalyzer pub(crate)

* Moving BoxableTokenizer to tantivy.

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-05-24 10:47:39 +02:00
dependabot[bot]
4be6f83b0a Update criterion requirement from 0.4 to 0.5 (#2056)
Updates the requirements on [criterion](https://github.com/bheisler/criterion.rs) to permit the latest version.
- [Changelog](https://github.com/bheisler/criterion.rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/bheisler/criterion.rs/compare/0.4.0...0.5.0)

---
updated-dependencies:
- dependency-name: criterion
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-24 15:59:51 +09:00
Adrien Guillo
a789ad9aee Rename DatePrecision to DateTimePrecision (#2051) 2023-05-23 17:09:11 +02:00
Sergei Lavrentev
8cf26da4b2 Add possibility to set up highlighten prefix and postfix for snippet (#1422)
* add possibility to change highlight prefix and postfix

* add comment to Snippet::new

* add test for highlighten elements

* add default highlight prefix and postfix constants

* fix spelling

* fix tests

* fix spelling

* do fixes after code review

* reduce test_snippet_generator_custom_highlighted_elements code

* fix fmt

* change names to more convenient

---------

Co-authored-by: Sergei Lavrentev <23312691+lavrxxx@users.noreply.github.com>
2023-05-23 15:09:24 +02:00
trinity-1686a
a3f001360f add support for warming up range of terms (#2042)
* add support for warming up range of terms

* simplify handling of limit
2023-05-22 14:29:35 +02:00
trinity-1686a
6564e0c467 fix phrase prefix query (#2043)
* fix phrase prefix query

it would fail spectacularly when no doc in the segment would match the phrase part of the query

* clippy
2023-05-22 12:36:20 +02:00
Paul Masurel
d7e97331e5 Minor refactoring find field (#2055)
* Minor refactoring

Moving find_field_with_default to Schema.

* Clippy comments
2023-05-22 15:00:48 +09:00
Paul Masurel
4417be165d Minor refactoring (#2054)
Moving find_field_with_default to Schema.
2023-05-22 14:56:38 +09:00
PSeitz
6239697a02 switch to ms in histogram for date type (#2045)
* switch to ms in histogram for date type

switch to ms in histogram, by adding a normalization step that converts
to nanoseconds precision when creating the collector.

closes #2028
related to #2026

* add missing unit long variants

* use single thread to avoid handling test case

* fix docs

* revert CI

* cleanup

* improve docs

* Update src/aggregation/bucket/histogram/histogram.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-05-19 08:15:44 +02:00
Paul Masurel
62709b8094 Change in the query grammar. (#2050)
* Change in the query grammar.

Quotation mark can now be used for phrase queries.
The delimiter is part of the `UserInputLeaf`.
That information is meant to be used in Quickwit to solve #3364.

This PR also adds support for quotation marks escaping in phrase
queries.

* Apply suggestions from code review
2023-05-19 12:07:10 +09:00
PSeitz
04562c0318 add fastfield tokenizer to IndexBuilder (#2046) 2023-05-18 04:33:42 +02:00
PSeitz
2dfe37940d handle multiple types in term aggregation (#2041) 2023-05-15 11:57:38 +02:00
Denis Bazhenov
e248a4959f Enforcing "NOT" and "-" queries consistency in UserInputAst (#1609)
* Enforcing "NOT" and "-" queries consistency in UserInputAst

* Mutable implementation if rewrite_ast_clause()
2023-05-13 00:27:48 +09:00
PSeitz
00c5df610c update termmap benchmark (#2040) 2023-05-12 07:35:06 +02:00
Adam Reichold
fedd9559e7 Expose create a query from a user input AST. (#2039) 2023-05-11 21:53:18 +09:00
Paul Masurel
fe3ecf9567 Added support for madvise (#2036)
Added support for madvise
2023-05-11 05:39:17 +02:00
PSeitz
ba3a885a3b handle multiple agg results (#2035)
handle multiple intermediate aggregation results with the same name.
2023-05-10 15:00:38 +02:00
PSeitz
d1988be8e9 fix and extend benchmark (#2030)
* add benchmark, add missing inlines

* fix stacker bench

* add wiki benchmark

* move line split out of bench
2023-05-10 13:01:56 +02:00
PSeitz
0eafbaab8e fix slop (#2031)
Fix slop by carrying slop so far for multiterms.
Define slop contract in the API
2023-05-10 11:45:14 +02:00
PSeitz
d3357a8426 fix ArenaHashMap default (#2034)
an empty ArenaHashMap is invalid and causes a panic when combined with `get`
2023-05-10 11:39:47 +02:00
Yuri Astrakhan
74275b76a6 Inline format arguments where makes sense (#2038)
Applied this command to the code, making it a bit shorter and slightly
more readable.

```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
2023-05-10 18:03:59 +09:00
dependabot[bot]
f479840a1b Update memmap2 requirement from 0.5.3 to 0.6.0 (#2033)
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs) to permit the latest version.
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v0.5.3...v0.6.0)

---
updated-dependencies:
- dependency-name: memmap2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-10 03:50:14 +02:00
PSeitz
4ee1b5cda0 add seperate tokenizer manager for fast fields (#2019)
* add seperate tokenizer manager for fast fields

* rename
2023-05-08 11:22:31 +02:00
PSeitz
45ff0e3c5c clear memory consumption in AggregationLimits (#2022)
* clear memory consumption in AggregationLimits

clear memory consumption in AggregationLimits at the end of segment collection

* switch to ResourceLimitGuard

* unduplicate code

* merge methods

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-05-08 10:15:09 +02:00
PSeitz
4c58b0086d allow slop in both directions (#2020)
* allow slop in both directions

allow slop in both directions
so "big wolf"~3 can also match "wolf big"

This also fixes #1934, when the docsets were reordered by size and didn't
match the terms.

* remove count

* add test for repeating tokens, unduplicate tests
2023-05-07 12:05:21 +09:00
Tomoko Uchida
85df322ceb fix typo in the architecture doc (#2009) 2023-05-07 12:04:07 +09:00
François Massot
38c863830f Merge pull request #2027 from quickwit-oss/fmassot/fix-date-histogram
Fix date histogram bounds and field name.
2023-05-05 13:03:25 +02:00