Pascal Seitz
18de6f477b
add failing test check_num_columnar_fields
2023-07-13 18:19:02 +09:00
PSeitz
1e7cd48cfa
remove allocations in split compound words ( #2080 )
...
* remove allocations in split compound words
* clear reused data
2023-07-13 09:43:02 +09:00
dependabot[bot]
7f51d85bbd
Update lru requirement from 0.10.0 to 0.11.0 ( #2117 )
...
Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs ) to permit the latest version.
- [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md )
- [Commits](https://github.com/jeromefroe/lru-rs/compare/0.10.0...0.11.0 )
---
updated-dependencies:
- dependency-name: lru
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-13 09:42:21 +09:00
PSeitz
ad76e32398
Update CHANGELOG.md ( #2091 )
...
* Update CHANGELOG.md
* Update CHANGELOG.md
2023-07-11 13:58:49 +08:00
dependabot[bot]
7575f9bf1c
Update itertools requirement from 0.10.3 to 0.11.0 ( #2098 )
...
Updates the requirements on [itertools](https://github.com/rust-itertools/itertools ) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md )
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.10.5...v0.11.0 )
---
updated-dependencies:
- dependency-name: itertools
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-07 11:14:46 +02:00
Naveen Aiathurai
67bdf3f5f6
fixes order_by_u64_field and order_by_fast_field should allow sorting in ascending order #1676 ( #2111 )
...
* feat: order_by_fast_field allows sorting using parameter order
* chore: change the corresponding values to original one
* chore: fix formatting issues
* fix: first_or_default_col should also sort by order
* chore: empty doc to testcase and docstest fixes
* chore: fix failure tests
* core: add empty document without fastfield
* chore: fix fmt
* chore: change variable name
2023-07-06 05:10:10 +02:00
François Massot
3c300666ad
Merge pull request #2110 from quickwit-oss/fulmicoton/dynamic-follow-up
...
Add dynamic filters to text analyzer builder.
2023-07-03 21:49:24 +02:00
François Massot
b91d3f6be4
Clean comment on 'TextAnalyzerBuilder::filter_dynamic' method.
2023-07-03 18:45:59 +02:00
François Massot
a8e76513bb
Remove useless clone.
2023-07-03 22:05:11 +09:00
François Massot
0a23201338
Fix stackoverflow and add docs.
2023-07-03 22:05:11 +09:00
François Massot
81330aaf89
WIP
2023-07-03 22:05:10 +09:00
Paul Masurel
98a3b01992
Removing the BoxedTokenizer
2023-07-03 22:05:10 +09:00
Paul Masurel
d341520938
Dynamic follow up
2023-07-03 22:05:10 +09:00
François Massot
5c9af73e41
Followup fulmicoton poc.
2023-07-03 22:05:10 +09:00
Paul Masurel
ad4c940fa3
proof of concept for dynamic tokenizer.
2023-07-03 22:05:10 +09:00
Paul Masurel
910b0b0c61
Cargo fmt
2023-07-03 22:03:31 +09:00
PSeitz
3fef052bf1
fix flaky test ( #2107 )
...
closes #2099
2023-06-29 14:30:56 +08:00
PSeitz
040554f2f9
Update to lz4_flex 0.11 ( #2106 )
2023-06-29 14:16:00 +08:00
PSeitz
17186ca9c9
improve docs ( #2105 )
2023-06-27 13:37:14 +08:00
François Massot
212d59c9ab
Merge pull request #2102 from quickwit-oss/fmassot/ngram-new-should-return-error
...
Ngram tokenizer now returns an error with invalid arguments.
2023-06-27 05:36:09 +02:00
dependabot[bot]
1a1f252a3f
Update memmap2 requirement from 0.6.0 to 0.7.1 ( #2104 )
...
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs ) to permit the latest version.
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md )
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v0.6.0...v0.7.1 )
---
updated-dependencies:
- dependency-name: memmap2
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-27 05:15:43 +02:00
François Massot
d73706dede
Ngram tokenizer now returns an error with invalid arguments.
2023-06-25 20:13:24 +02:00
PSeitz
44850e1036
move fail dep to dev only ( #2094 )
...
wasm compilation fails with dep only
2023-06-22 06:59:11 +02:00
Adam Reichold
3b0cbf8102
Cosmetic updates to the warmer example. ( #2095 )
...
Just some cosmetic tweaks to make the example easier on the eyes as a colleague
was staring at this for quite some time this week.
2023-06-22 11:25:01 +09:00
Adam Reichold
4aa131c3db
Make TextAnalyzerBuilder publically accessible ( #2097 )
...
This way, client code can name the type to e.g. store it inside structs without
resorting to generics and it means that its documentation is part of the crate
documentation generated by `cargo doc`.
2023-06-22 11:24:21 +09:00
Naveen Aiathurai
59962097d0
fix : #2078 return error when tokenizer not found while indexing ( #2093 )
...
* fix : #2078 return error when tokenizer not found while indexing
* chore: formatting issues
* chore: fix review comments
2023-06-16 04:33:55 +02:00
Adam Reichold
ebc78127f3
Add BytesFilterCollector to support filtering based on a bytes fast field ( #2075 )
...
* Do some Clippy- and Cargo-related boy-scouting.
* Add BytesFilterCollector to support filtering based on a bytes fast field
This is basically a copy of the existing FilterCollector but modified and
specialised to work on a bytes fast field.
* Changed semantics of filter collectors to consider multi-valued fields
2023-06-13 14:19:58 +09:00
PSeitz
8199aa7de7
bump version to 0.20.2 ( #2089 )
0.20.2
2023-06-12 18:56:54 +08:00
PSeitz
657f0cd3bd
add missing Bytes validation to term_agg ( #2077 )
...
returns empty for now instead of failing like before
2023-06-12 16:38:07 +08:00
Adam Reichold
3a82ef2560
Fix is_child_of function not considering the root facet. ( #2086 )
2023-06-12 08:35:18 +02:00
PSeitz
3546e7fc63
small agg limit docs improvement ( #2073 )
...
small docs improvement as follow up on bug https://github.com/quickwit-oss/quickwit/issues/3503
2023-06-12 10:55:24 +09:00
PSeitz
862f367f9e
release without Alice in Wonderland, bump version to 0.20.1 ( #2087 )
...
* Release without Alice in Wonderland
* bump version to 0.20.1
2023-06-12 10:54:03 +09:00
PSeitz
14137d91c4
Update CHANGELOG.md ( #2081 )
2023-06-12 10:53:40 +09:00
François Massot
924fc70cb5
Merge pull request #2088 from quickwit-oss/fmassot/align-type-priorities-for-json-numbers
...
Align numerical type priority order on the search side.
2023-06-11 22:04:54 +02:00
François Massot
07023948aa
Add test that indexes and searches a JSON field.
2023-06-11 21:47:52 +02:00
François Massot
0cb53207ec
Fix tests.
2023-06-11 12:13:35 +02:00
François Massot
17c783b4db
Align numerical type priority order on the search side.
2023-06-11 11:49:27 +02:00
Harrison Burt
7220df8a09
Fix building on windows with mmap ( #2070 )
...
* Fix windows build
* Make pub
* Update docs
* Re arrange
* Fix compilation error on unix
* Fix unix borrows
* Revert "Fix unix borrows"
This reverts commit c1d94fd12b .
* Fix unix borrows and revert original change
* Fix warning
* Cleaner code.
---------
Co-authored-by: Paul Masurel <paul@quickwit.io >
0.20.1
2023-06-10 18:32:39 +02:00
PSeitz
e3eacb4388
release tantivy ( #2083 )
...
* prerelease
* chore: Release
0.20
2023-06-09 10:47:46 +02:00
PSeitz
fdecb79273
tokenizer-api: reduce Tokenizer overhead ( #2062 )
...
* tokenizer-api: reduce Tokenizer overhead
Previously a new `Token` for each text encountered was created, which
contains `String::with_capacity(200)`
In the new API the token_stream gets mutable access to the tokenizer,
this allows state to be shared (in this PR Token is shared).
Ideally the allocation for the BoxTokenStream would also be removed, but
this may require some lifetime tricks.
* simplify api
* move lowercase and ascii folding buffer to global
* empty Token text as default
2023-06-08 18:37:58 +08:00
PSeitz
27f202083c
Improve Termmap Indexing Performance +~30% ( #2058 )
...
* update benchmark
* Improve Termmap Indexing Performance +~30%
This contains many small changes to improve Termmap performance.
Most notably:
* Specialized byte compare and equality versions, instead of glibc calls.
* ExpUnrolledLinkedList to not contain inline items.
Allow compare hash only via a feature flag compare_hash_only:
64bits should be enough with a good hash function to compare strings by
their hashes instead of comparing the strings. Disabled by default
CreateHashMap/alice/174693
time: [642.23 µs 643.80 µs 645.24 µs]
thrpt: [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s]
change:
time: [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05)
thrpt: [+14.088% +15.344% +16.862%]
Performance has improved.
CreateHashMap/alice_expull/174693
time: [877.03 µs 880.44 µs 884.67 µs]
thrpt: [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s]
change:
time: [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05)
thrpt: [+35.301% +35.637% +35.981%]
Performance has improved.
CreateHashMap/numbers_zipf/8000000
time: [9.1198 ms 9.1573 ms 9.1961 ms]
thrpt: [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s]
change:
time: [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05)
thrpt: [+52.403% +53.440% +54.390%]
Performance has improved.
* clippy
* add bench for ids
* inline(always) to inline whole block with bounds checks
* cleanup
2023-06-08 11:13:52 +02:00
PSeitz
ccb09aaa83
allow histogram bounds to be passed as Rfc3339 ( #2076 )
2023-06-08 09:07:08 +02:00
Valerii
4b7c485a08
feat: add stop words for Hungarian language ( #2069 )
2023-06-02 07:26:03 +02:00
PSeitz
3942fc6d2b
update CHANGELOG ( #2068 )
2023-06-02 05:00:12 +02:00
Adam Reichold
b325d569ad
Expose phrase-prefix queries via the built-in query parser ( #2044 )
...
* Expose phrase-prefix queries via the built-in query parser
This proposes the less-than-imaginative syntax `field:"phrase ter"*` to
perform a phrase prefix query against `field` using `phrase` and `ter` as the
terms. The aim of this is to make this type of query more discoverable and
simplify manual testing.
I did consider exposing the `max_expansions` parameter similar to how slop is
handled, but I think that this is rather something that should be configured via
the querser parser (similar to `set_field_boost` and `set_field_fuzzy`) as
choosing it requires rather intimiate knowledge of the backing index.
* Prevent construction of zero or one term phrase-prefix queries via the query parser.
* Add example using phrase-prefix search via surface API to improve feature discoverability.
2023-06-01 13:03:16 +02:00
Paul Masurel
7ee78bda52
Readding s in datetime precision variant names ( #2065 )
...
There is no clear win and it change some serialization in quickwit.
2023-06-01 06:39:46 +02:00
Paul Masurel
184a9daa8a
Cancels concurrently running actions for the same PR. ( #2067 )
2023-06-01 12:57:38 +09:00
Paul Masurel
47e01b345b
Simplified linear probing code ( #2066 )
2023-06-01 04:58:42 +02:00
PSeitz
3af456972e
Fix min doc_count empty merge bug ( #2057 )
...
This fixes an issue when min_doc==0 loads terms from the dictionary from
one segment and merges the same term with a subaggregation from another
segment.
Previously the empty structure was not correctly initialized to contain
the subaggregation so the merge was incorrect.
2023-05-29 14:20:50 +08:00
PSeitz
e56addc63e
enable tokenizer on json fields ( #2053 )
...
* enable tokenizer on json fields
enable tokenizer on json fields for type text
* Avoid making the tokenizer within the TextAnalyzer pub(crate)
* Moving BoxableTokenizer to tantivy.
---------
Co-authored-by: Paul Masurel <paul@quickwit.io >
2023-05-24 10:47:39 +02:00