Commit Graph

2981 Commits

Author SHA1 Message Date
Adrien Guillo
a789ad9aee Rename DatePrecision to DateTimePrecision (#2051) 2023-05-23 17:09:11 +02:00
Sergei Lavrentev
8cf26da4b2 Add possibility to set up highlighten prefix and postfix for snippet (#1422)
* add possibility to change highlight prefix and postfix

* add comment to Snippet::new

* add test for highlighten elements

* add default highlight prefix and postfix constants

* fix spelling

* fix tests

* fix spelling

* do fixes after code review

* reduce test_snippet_generator_custom_highlighted_elements code

* fix fmt

* change names to more convenient

---------

Co-authored-by: Sergei Lavrentev <23312691+lavrxxx@users.noreply.github.com>
2023-05-23 15:09:24 +02:00
trinity-1686a
a3f001360f add support for warming up range of terms (#2042)
* add support for warming up range of terms

* simplify handling of limit
2023-05-22 14:29:35 +02:00
trinity-1686a
6564e0c467 fix phrase prefix query (#2043)
* fix phrase prefix query

it would fail spectacularly when no doc in the segment would match the phrase part of the query

* clippy
2023-05-22 12:36:20 +02:00
Paul Masurel
d7e97331e5 Minor refactoring find field (#2055)
* Minor refactoring

Moving find_field_with_default to Schema.

* Clippy comments
2023-05-22 15:00:48 +09:00
Paul Masurel
4417be165d Minor refactoring (#2054)
Moving find_field_with_default to Schema.
2023-05-22 14:56:38 +09:00
PSeitz
6239697a02 switch to ms in histogram for date type (#2045)
* switch to ms in histogram for date type

switch to ms in histogram, by adding a normalization step that converts
to nanoseconds precision when creating the collector.

closes #2028
related to #2026

* add missing unit long variants

* use single thread to avoid handling test case

* fix docs

* revert CI

* cleanup

* improve docs

* Update src/aggregation/bucket/histogram/histogram.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-05-19 08:15:44 +02:00
Paul Masurel
62709b8094 Change in the query grammar. (#2050)
* Change in the query grammar.

Quotation mark can now be used for phrase queries.
The delimiter is part of the `UserInputLeaf`.
That information is meant to be used in Quickwit to solve #3364.

This PR also adds support for quotation marks escaping in phrase
queries.

* Apply suggestions from code review
2023-05-19 12:07:10 +09:00
PSeitz
04562c0318 add fastfield tokenizer to IndexBuilder (#2046) 2023-05-18 04:33:42 +02:00
PSeitz
2dfe37940d handle multiple types in term aggregation (#2041) 2023-05-15 11:57:38 +02:00
Denis Bazhenov
e248a4959f Enforcing "NOT" and "-" queries consistency in UserInputAst (#1609)
* Enforcing "NOT" and "-" queries consistency in UserInputAst

* Mutable implementation if rewrite_ast_clause()
2023-05-13 00:27:48 +09:00
PSeitz
00c5df610c update termmap benchmark (#2040) 2023-05-12 07:35:06 +02:00
Adam Reichold
fedd9559e7 Expose create a query from a user input AST. (#2039) 2023-05-11 21:53:18 +09:00
Paul Masurel
fe3ecf9567 Added support for madvise (#2036)
Added support for madvise
2023-05-11 05:39:17 +02:00
PSeitz
ba3a885a3b handle multiple agg results (#2035)
handle multiple intermediate aggregation results with the same name.
2023-05-10 15:00:38 +02:00
PSeitz
d1988be8e9 fix and extend benchmark (#2030)
* add benchmark, add missing inlines

* fix stacker bench

* add wiki benchmark

* move line split out of bench
2023-05-10 13:01:56 +02:00
PSeitz
0eafbaab8e fix slop (#2031)
Fix slop by carrying slop so far for multiterms.
Define slop contract in the API
2023-05-10 11:45:14 +02:00
PSeitz
d3357a8426 fix ArenaHashMap default (#2034)
an empty ArenaHashMap is invalid and causes a panic when combined with `get`
2023-05-10 11:39:47 +02:00
Yuri Astrakhan
74275b76a6 Inline format arguments where makes sense (#2038)
Applied this command to the code, making it a bit shorter and slightly
more readable.

```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
2023-05-10 18:03:59 +09:00
dependabot[bot]
f479840a1b Update memmap2 requirement from 0.5.3 to 0.6.0 (#2033)
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs) to permit the latest version.
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v0.5.3...v0.6.0)

---
updated-dependencies:
- dependency-name: memmap2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-10 03:50:14 +02:00
PSeitz
4ee1b5cda0 add seperate tokenizer manager for fast fields (#2019)
* add seperate tokenizer manager for fast fields

* rename
2023-05-08 11:22:31 +02:00
PSeitz
45ff0e3c5c clear memory consumption in AggregationLimits (#2022)
* clear memory consumption in AggregationLimits

clear memory consumption in AggregationLimits at the end of segment collection

* switch to ResourceLimitGuard

* unduplicate code

* merge methods

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-05-08 10:15:09 +02:00
PSeitz
4c58b0086d allow slop in both directions (#2020)
* allow slop in both directions

allow slop in both directions
so "big wolf"~3 can also match "wolf big"

This also fixes #1934, when the docsets were reordered by size and didn't
match the terms.

* remove count

* add test for repeating tokens, unduplicate tests
2023-05-07 12:05:21 +09:00
Tomoko Uchida
85df322ceb fix typo in the architecture doc (#2009) 2023-05-07 12:04:07 +09:00
François Massot
38c863830f Merge pull request #2027 from quickwit-oss/fmassot/fix-date-histogram
Fix date histogram bounds and field name.
2023-05-05 13:03:25 +02:00
François Massot
992f755298 Fix clippy. 2023-05-05 10:51:29 +02:00
François Massot
c8df843f96 Fix date histogram bounds and field name. 2023-05-05 00:52:55 +02:00
Paul Masurel
f28ddb711e Exposing u64-based FastFieldRangeWeight (#2024) 2023-05-03 18:32:00 +09:00
tottoto
73452284ae Remove unused crates from dependencies (#2018)
* Remove unused crates from dependencies

* Revert rand to columnar

* Revert criterion to stacker
2023-05-02 12:34:20 +02:00
PSeitz
ba309e18a1 switch to nanosecond precision (#2016) 2023-05-01 03:32:20 +02:00
PSeitz
cbf2bdc75b change bucket count type (#2013)
* change bucket count type

closes #2012

* Update src/aggregation/agg_limits.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Update src/directory/managed_directory.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* fix test

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-27 15:47:31 +08:00
PSeitz
1f06997d04 fix single collector special case (#2014) 2023-04-27 09:30:19 +02:00
PSeitz
c599bf3b6c chore!:drop JSON support on intermediate agg result (#1992)
* chore!:drop JSON support on intermediate agg result

add support for other formats by removing skip_serialize and untagged
JSON support is broken anyway due it's lack on f64::INF etc. handling

* Update src/aggregation/intermediate_agg_result.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* move from impl

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-26 13:05:16 +02:00
PSeitz
80df1d9835 Handle error for exists on MMapDirectory (#1988)
`exists` will return false in case of other io errors, like permission denied
2023-04-25 09:20:33 +02:00
PSeitz
2e369db936 switch to Aggregation without serde_untagged (#2003)
* refactor result handling

* remove Internal stuff

* merge different accessors

* switch to Aggregation without serde_untagged

* fix doctests
2023-04-25 08:54:51 +02:00
PSeitz
7b31100208 refactor vint (#2010)
- improve performance of vint
vint serialization shows up in performance profiles during indexing.
It would also make sense to limit the value space to u29 and operate on 4 bytes only.
- remove unused code
- add missing inlines
- fix regex test
2023-04-25 08:49:36 +02:00
trinity-1686a
9c93bfeb51 optimise warmup code path (#2007)
* optimise warmup code path

* better function naming
2023-04-21 11:23:09 +02:00
PSeitz
74f9eafefc refactor Term (#2006)
* refactor Term

add ValueBytes for serialized term values
add missing debug for ip
skip unnecessary json path validation
remove code duplication
add DATE_TIME_PRECISION_INDEXED constant
add missing Term clarification
remove weird value_bytes_mut() API

* fix naming
2023-04-20 15:31:43 +02:00
RT_Enzyme
ff3d3313c4 fix BooleanQuery document (#1999)
* fix BooleanQuery document

* Update src/query/boolean_query/boolean_query.rs

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-20 11:37:20 +02:00
Paul Masurel
fbda511a1a Making more things public for quickwit. (#2005) 2023-04-20 11:37:45 +09:00
Adam Reichold
c1defdda05 Bump aho-corasick dependency to version 1.0 and adjust to API changes (#2002)
* Drop additional Arc-layer as the automaton itself is now cheap-to-clone.
* Drop state ID type parameter as it is not exposed by the library any more.
2023-04-18 07:34:30 +02:00
PSeitz
e522163a1c use json in agg tests (#1998)
* switch to JSON in tests, add flat aggregation types

* use method

* clippy

* remove commented file
2023-04-17 14:08:48 +02:00
PSeitz
e83abbfe4a perf: faster term hash map (#1940)
* add term hashmap benchmark

* refactor arena hashmap

add inlines
remove occupied array and use table_entry.is_empty instead (saves 4 bytes per entry)
reduce saturation threshold from 1/3 to 1/2 to reduce memory
use u32 for UnorderedId (we have the 4billion limit anyways on the Columnar stuff)
fix naming LinearProbing
remove byteorder dependency

memory consumption went down from 2Gb to 1.8GB on indexing wikipedia dataset in tantivy

* Update stacker/src/arena_hashmap.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-17 09:07:33 +02:00
trinity-1686a
780e26331d sstable compression (#1946)
* compress sstable with zstd

* add some details to sstable readme

* compress only block which benefit from it

* multiple changes to sstable

make compression optional
use OwnedBytes instead of impl Read in sstable, required for next point
use zstd bulk api, which is much faster on small records

* cleanup and use bulk api for compression

* use dedicated byte for compression

* switch block len and compression flag

* change default zstd level in sstable
2023-04-14 16:25:50 +02:00
trinity-1686a
0286ecea09 re-export a few sstable functions on dicitonary (#1996)
* re-export a few sstable functions on dicitonary

* Update documentation

Co-authored-by: François Massot <francois.massot@gmail.com>

---------

Co-authored-by: François Massot <francois.massot@gmail.com>
2023-04-14 11:13:48 +02:00
PSeitz
b0ef9a6252 use crates.io dependency (#1990) 2023-04-14 09:35:20 +08:00
François Massot
36138c493b Merge pull request #1994 from quickwit-oss/fmassot/expose-simple-token-stream
Expose `SimpleTokenStream` to use it in quickwit for the multilanguage tokenizer
2023-04-13 18:55:02 +02:00
François Massot
64bce340b2 Expose to use it in quickwit. 2023-04-13 18:28:53 +02:00
trinity-1686a
205e8a0a92 encode dictionary type in fst footer (#1968)
* encode additional footer for dictionary kind in fst
2023-04-12 09:43:01 +02:00
Paul Masurel
4b01cc4c49 Made BooleanWeight and BoostWeight public (#1991) 2023-04-12 10:26:30 +09:00