Commit Graph

2490 Commits

Author SHA1 Message Date
PSeitz-dd
40659d4d07 improve naming in buffered_union (#2705) 2025-09-24 10:58:46 +02:00
PSeitz-dd
70da310b2d perf: deduplicate queries (#2698)
* deduplicate queries

Deduplicate queries in the UserInputAst after parsing queries

* add return type
2025-09-22 12:16:58 +02:00
PSeitz-dd
2340dca628 fix compiler warnings (#2699)
* fix compiler warnings

* fix import
2025-09-19 15:55:04 +02:00
Remi
71a26d5b24 Fix CI with rust 1.90 (#2696)
* Empty commit

* Fix dead code lint error
2025-09-18 23:06:33 +02:00
PSeitz-dd
203751f2fe Optimize ExistsQuery for a high number of dynamic columns (#2694)
* Optimize ExistsQuery for a high number of dynamic columns

The previous algorithm checked _each_ doc in _each_ column for
existence. This causes huge cost on JSON fields with e.g. 100k columns.
Compute a bitset instead if we have more than one column.

add `iter_docs` to the multivalued_index

* add benchmark

subfields=1
exists_json_union    Memory: 89.3 KB (+2.01%)    Avg: 0.4865ms (-26.03%)    Median: 0.4865ms (-26.03%)    [0.4865ms .. 0.4865ms]
subfields=2
exists_json_union    Memory: 68.1 KB     Avg: 1.7048ms (-0.46%)    Median: 1.7048ms (-0.46%)    [1.7048ms .. 1.7048ms]
subfields=3
exists_json_union    Memory: 61.8 KB     Avg: 2.0742ms (-2.22%)    Median: 2.0742ms (-2.22%)    [2.0742ms .. 2.0742ms]
subfields=4
exists_json_union    Memory: 119.8 KB (+103.44%)    Avg: 3.9500ms (+42.62%)    Median: 3.9500ms (+42.62%)    [3.9500ms .. 3.9500ms]
subfields=5
exists_json_union    Memory: 120.4 KB (+107.65%)    Avg: 3.9610ms (+20.65%)    Median: 3.9610ms (+20.65%)    [3.9610ms .. 3.9610ms]
subfields=6
exists_json_union    Memory: 120.6 KB (+107.49%)    Avg: 3.8903ms (+3.11%)    Median: 3.8903ms (+3.11%)    [3.8903ms .. 3.8903ms]
subfields=7
exists_json_union    Memory: 120.9 KB (+106.93%)    Avg: 3.6220ms (-16.22%)    Median: 3.6220ms (-16.22%)    [3.6220ms .. 3.6220ms]
subfields=8
exists_json_union    Memory: 121.3 KB (+106.23%)    Avg: 4.0981ms (-15.97%)    Median: 4.0981ms (-15.97%)    [4.0981ms .. 4.0981ms]
subfields=16
exists_json_union    Memory: 123.1 KB (+103.09%)    Avg: 4.3483ms (-92.26%)    Median: 4.3483ms (-92.26%)    [4.3483ms .. 4.3483ms]
subfields=256
exists_json_union    Memory: 204.6 KB (+19.85%)    Avg: 3.8874ms (-99.01%)    Median: 3.8874ms (-99.01%)    [3.8874ms .. 3.8874ms]
subfields=4096
exists_json_union    Memory: 2.0 MB     Avg: 3.5571ms (-99.90%)    Median: 3.5571ms (-99.90%)    [3.5571ms .. 3.5571ms]
subfields=65536
exists_json_union    Memory: 28.3 MB     Avg: 14.4417ms (-99.97%)    Median: 14.4417ms (-99.97%)    [14.4417ms .. 14.4417ms]
subfields=262144
exists_json_union    Memory: 113.3 MB     Avg: 66.2860ms (-99.95%)    Median: 66.2860ms (-99.95%)    [66.2860ms .. 66.2860ms]

* rename methods
2025-09-16 18:21:03 +02:00
PSeitz-dd
7963b0b4aa Add fast field fallback for term query if not indexed (#2693)
* Add fast field fallback for term query if not indexed

* only fallback without scores
2025-09-12 14:58:21 +02:00
Paul Masurel
5d6c8de23e Align search float search logic to the columnar coercion rules
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.

That way a document with the float 1.0 will be searchable when the user
searches for 1.

Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
2025-09-09 19:28:17 +02:00
Raphaël Cohen
f4b374110f feat: Regex query grammar (#2677)
* feat: Regex query grammar

* feat: Disable regexes by default

* chore: Apply formatting
2025-09-03 10:07:04 +02:00
Paul Masurel
39e027667b per field size details (#2679)
* Added per-field size details.

This also does a bunch of refactoring.

merging field metadata does not silently asserts that arguments should be sorted.
merging does not set `stored`.

We do not rely on a hashmap to group fields, but instead rely on the fact that
the term dictionary is sorted.

The inverted level method that exposes field metadata is not exposed
as public anymore.

* CR comment

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-08-13 13:12:22 +02:00
PSeitz-dd
a1d65c3df3 test stable ordering with pagination (#2683) 2025-08-13 15:36:28 +08:00
trinity-1686a
2e4615c2d3 Merge pull request #2678 from Darkheir/feat/query_grammar_space_between_field_and_value
feat: Support spaces between field name and value
2025-08-11 09:57:23 +02:00
trinity-1686a
c301e7b1c4 Merge pull request #2673 from paradedb/stuhood.fix-order-by-dup-string
Fix `TopDocs::order_by_string_fast_field` for duplicates
2025-07-30 18:25:03 +02:00
Darkheir
d4b090124c feat: Support spaces between field name and value 2025-07-23 11:12:13 +02:00
PSeitz-dd
811c68cdb2 fix field_names in top_hits aggregation (#2675) 2025-07-21 12:19:30 +08:00
trinity-1686a
bc1c789897 Merge pull request #2676 from quickwit-oss/trinity.pointard/allow-partial-default-field-success
ignore failure to parse query when other default field suceeded
2025-07-18 14:20:41 +02:00
trinity Pointard
e7c8c331bd ignore failure to parse query when other default field suceeded 2025-07-17 14:47:28 +02:00
Eric Ridge
2f01152a3c adjust Dictionary::sorted_ords_to_term_cb() to allow duplicates 2025-07-16 13:38:43 -07:00
PSeitz
4e84c70387 Fix TopNComputer for reverse order (#2672)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-16 21:44:04 +08:00
Paul M.
f2c77f06c5 Update fs4 to latest (0.13.1) (#2654)
- One change was needed to handle the `Result<bool>` that now returns from `try_lock_exclusive`

Co-authored-by: Paul M. <prov223@tutanota.com>
2025-07-14 11:26:19 +08:00
PSeitz
945af922d1 clippy (#2661)
* clippy

* use readable version

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-02 11:25:03 +02:00
PSeitz-dd
295d07e55c fix union performance regression (#2663)
closes https://github.com/quickwit-oss/tantivy/issues/2656
2025-07-01 20:32:25 +02:00
Stu Hood
a2400f4e73 Add string fast field support to TopDocs. (#2642)
* Add string fast field support to `TopDocs`.

* Remove unnecessary generics, and review feedback.

* Use actual/less-ambiguous cities.

* Review feedback
2025-06-20 10:27:14 +02:00
Zhang.Jinrui
436ec6caea fix typo for the comments of search_with_executor() (#2653)
Co-authored-by: Zhang Jinrui <zhangjinrui@microsoft.com>
2025-06-19 09:53:21 +02:00
PSeitz
2b668bd2bf readability improvement on executor (#2615) 2025-04-08 18:28:49 +02:00
Remi Dettai
b681ec9335 Fix compilation stability 2025-04-01 09:33:33 +02:00
trinity Pointard
9426d5be7b fix agg Key PartialEq impl 2025-03-14 14:57:45 +01:00
Paul Masurel
519e5d2ed1 clippy warnings 2025-03-05 11:15:06 +01:00
Paul Masurel
0afabad494 Cargo fmt 2025-03-05 11:07:46 +01:00
Remi Dettai
89b052cd42 Catch panics during merges (#2582)
* Adding panic handler for the rayon merge thread pool

* Return panic message in error

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-03-05 10:36:48 +01:00
SteveLauC
c48c649436 refactor: use std AtomicU64 and remove wrapper (#2585) 2025-02-24 03:56:15 +01:00
Paul Masurel
58c0739953 Merge pull request #2581 from quickwit-oss/merge_dict_column_repro
use usize in bitpacker
2025-02-21 10:53:07 +09:00
Pascal Seitz
e7daf69de9 use usize in bitpacker
use usize in bitpacker to enable larger columns in the columnar store

Godbolt comparison with u32 vs u64 for get access: https://godbolt.org/z/cjf7nenYP

Add a mini-tool to inspect columnar files created by tantivy. (very basic functionality which can be extended later)
2025-02-20 15:39:10 +01:00
trinity Pointard
0368162ef0 make DateHistogramAggregationReq buildable 2025-02-18 11:45:24 +01:00
trinity-1686a
d281ca3e65 Merge pull request #2559 from quickwit-oss/trinity/sstable-partial-automaton
allow warming partially an sstable for an automaton
2025-01-08 16:35:35 +01:00
trinity Pointard
be17daf658 split iterator 2025-01-08 16:24:34 +01:00
trinity Pointard
6ca84a61fa make termdict always clone 2025-01-08 16:19:54 +01:00
trinity Pointard
037d12c9c9 fix deadlocking on automaton warmup 2025-01-06 11:58:58 +01:00
Remi Dettai
71cf19870b Exist queries match subpath fields (#2558)
* Exist queries match subpath fields

* Make subpath check optional

* Add async subpath listing
2025-01-06 10:17:39 +01:00
trinity Pointard
175a529c41 use executor for cpu-heavy sstable decompression for automaton 2025-01-03 19:14:07 +01:00
trinity Pointard
fe0c7c5408 change rangebound style 2025-01-02 11:56:05 +01:00
Harrison Burt
148594f0f9 Improve IndexWriter customisation via builder (#2562)
* Improve `IndexWriter` customisation via builder

* Remove change noise from PR

* Correct documentation

* Resolve comments and add test
2025-01-02 09:43:22 +01:00
trinity Pointard
dfff5f3bcb rename merge_holes_under => merge_holes_under_bytes 2024-12-23 16:17:44 +01:00
trinity-1686a
ebf4d84553 add comment about cpu-intensive operation in async context 2024-12-20 12:23:49 +01:00
trinity-1686a
a1447cc9c2 remove breaking change in sstable public api 2024-12-19 17:30:05 +01:00
trinity-1686a
c39d91f827 Merge pull request #2547 from quickwit-oss/trinity/count-str
add support for counting non integer in aggregation
2024-12-17 15:27:30 +01:00
trinity Pointard
32b6e9711b add tests 2024-12-13 16:06:24 +01:00
trinity-1686a
24c5dc2398 allow warming up automaton 2024-12-10 13:32:12 +01:00
Pierre Barre
6e02c5cb25 Make NUM_MERGE_THREADS configurable (#2535)
* Make `NUM_MERGE_THREADS` configurable

* Remove unused import

* Reword comment src/index/index.rs

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2024-12-09 16:53:11 +08:00
trinity-1686a
0bac391291 add support for counting non integer in aggregation 2024-11-28 19:52:47 +01:00
Paul Masurel
c35a782747 Updating rustc-hash and clippy fixes (#2532)
* Updating rustc-hash and clippy fixes

* fix terms_aggregation_min_doc_count_special_case

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-11-01 13:46:26 +08:00