Commit Graph

3372 Commits

Author SHA1 Message Date
PSeitz-dd
eb5f51f3c4 Optimize ExistsQuery for a high number of dynamic columns (#2694)
* Optimize ExistsQuery for a high number of dynamic columns

The previous algorithm checked _each_ doc in _each_ column for
existence. This causes huge cost on JSON fields with e.g. 100k columns.
Compute a bitset instead if we have more than one column.

add `iter_docs` to the multivalued_index

* add benchmark

subfields=1
exists_json_union    Memory: 89.3 KB (+2.01%)    Avg: 0.4865ms (-26.03%)    Median: 0.4865ms (-26.03%)    [0.4865ms .. 0.4865ms]
subfields=2
exists_json_union    Memory: 68.1 KB     Avg: 1.7048ms (-0.46%)    Median: 1.7048ms (-0.46%)    [1.7048ms .. 1.7048ms]
subfields=3
exists_json_union    Memory: 61.8 KB     Avg: 2.0742ms (-2.22%)    Median: 2.0742ms (-2.22%)    [2.0742ms .. 2.0742ms]
subfields=4
exists_json_union    Memory: 119.8 KB (+103.44%)    Avg: 3.9500ms (+42.62%)    Median: 3.9500ms (+42.62%)    [3.9500ms .. 3.9500ms]
subfields=5
exists_json_union    Memory: 120.4 KB (+107.65%)    Avg: 3.9610ms (+20.65%)    Median: 3.9610ms (+20.65%)    [3.9610ms .. 3.9610ms]
subfields=6
exists_json_union    Memory: 120.6 KB (+107.49%)    Avg: 3.8903ms (+3.11%)    Median: 3.8903ms (+3.11%)    [3.8903ms .. 3.8903ms]
subfields=7
exists_json_union    Memory: 120.9 KB (+106.93%)    Avg: 3.6220ms (-16.22%)    Median: 3.6220ms (-16.22%)    [3.6220ms .. 3.6220ms]
subfields=8
exists_json_union    Memory: 121.3 KB (+106.23%)    Avg: 4.0981ms (-15.97%)    Median: 4.0981ms (-15.97%)    [4.0981ms .. 4.0981ms]
subfields=16
exists_json_union    Memory: 123.1 KB (+103.09%)    Avg: 4.3483ms (-92.26%)    Median: 4.3483ms (-92.26%)    [4.3483ms .. 4.3483ms]
subfields=256
exists_json_union    Memory: 204.6 KB (+19.85%)    Avg: 3.8874ms (-99.01%)    Median: 3.8874ms (-99.01%)    [3.8874ms .. 3.8874ms]
subfields=4096
exists_json_union    Memory: 2.0 MB     Avg: 3.5571ms (-99.90%)    Median: 3.5571ms (-99.90%)    [3.5571ms .. 3.5571ms]
subfields=65536
exists_json_union    Memory: 28.3 MB     Avg: 14.4417ms (-99.97%)    Median: 14.4417ms (-99.97%)    [14.4417ms .. 14.4417ms]
subfields=262144
exists_json_union    Memory: 113.3 MB     Avg: 66.2860ms (-99.95%)    Median: 66.2860ms (-99.95%)    [66.2860ms .. 66.2860ms]

* rename methods
2025-09-16 12:57:57 -04:00
PSeitz-dd
7963b0b4aa Add fast field fallback for term query if not indexed (#2693)
* Add fast field fallback for term query if not indexed

* only fallback without scores
2025-09-12 14:58:21 +02:00
Paul Masurel
d5eefca11d Merge pull request #2692 from quickwit-oss/paul.masurel/coerce-floats-too-in-search-too
This PR changes the logic used on the ingestion of floats.
2025-09-10 09:46:54 +02:00
Paul Masurel
5d6c8de23e Align search float search logic to the columnar coercion rules
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.

That way a document with the float 1.0 will be searchable when the user
searches for 1.

Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
2025-09-09 19:28:17 +02:00
PSeitz
a06365f39f Update CHANGELOG.md for bugfixes (#2674)
* Update CHANGELOG.md

* Update CHANGELOG.md
2025-09-04 11:51:00 +02:00
Raphaël Cohen
f4b374110f feat: Regex query grammar (#2677)
* feat: Regex query grammar

* feat: Disable regexes by default

* chore: Apply formatting
2025-09-03 10:07:04 +02:00
PSeitz-dd
c37af9c1ff update release instructions (#2687) 2025-08-22 07:57:48 +08:00
PSeitz
33794a114c chore: Release (#2686)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-08-20 18:29:37 +08:00
PSeitz-dd
8676a1f57b prepare release: update Changelog (#2685) 2025-08-20 16:07:53 +08:00
PSeitz-dd
021ff2ad63 move bench to binggan (#2684) 2025-08-14 17:02:44 +08:00
Paul Masurel
39e027667b per field size details (#2679)
* Added per-field size details.

This also does a bunch of refactoring.

merging field metadata does not silently asserts that arguments should be sorted.
merging does not set `stored`.

We do not rely on a hashmap to group fields, but instead rely on the fact that
the term dictionary is sorted.

The inverted level method that exposes field metadata is not exposed
as public anymore.

* CR comment

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-08-13 13:12:22 +02:00
PSeitz-dd
a1d65c3df3 test stable ordering with pagination (#2683) 2025-08-13 15:36:28 +08:00
trinity-1686a
2e4615c2d3 Merge pull request #2678 from Darkheir/feat/query_grammar_space_between_field_and_value
feat: Support spaces between field name and value
2025-08-11 09:57:23 +02:00
Darkheir
610091e2c4 feat: Applies PR review suggestion 2025-08-04 10:12:51 +02:00
trinity-1686a
c301e7b1c4 Merge pull request #2673 from paradedb/stuhood.fix-order-by-dup-string
Fix `TopDocs::order_by_string_fast_field` for duplicates
2025-07-30 18:25:03 +02:00
Stu Hood
d9eb093368 Attempt to clarify sorted_ords_to_term_cb. 2025-07-29 21:56:31 -07:00
Darkheir
d4b090124c feat: Support spaces between field name and value 2025-07-23 11:12:13 +02:00
PSeitz-dd
811c68cdb2 fix field_names in top_hits aggregation (#2675) 2025-07-21 12:19:30 +08:00
trinity-1686a
bc1c789897 Merge pull request #2676 from quickwit-oss/trinity.pointard/allow-partial-default-field-success
ignore failure to parse query when other default field suceeded
2025-07-18 14:20:41 +02:00
trinity Pointard
e7c8c331bd ignore failure to parse query when other default field suceeded 2025-07-17 14:47:28 +02:00
Eric Ridge
2f01152a3c adjust Dictionary::sorted_ords_to_term_cb() to allow duplicates 2025-07-16 13:38:43 -07:00
PSeitz
4e84c70387 Fix TopNComputer for reverse order (#2672)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-16 21:44:04 +08:00
Paul M.
f2c77f06c5 Update fs4 to latest (0.13.1) (#2654)
- One change was needed to handle the `Result<bool>` that now returns from `try_lock_exclusive`

Co-authored-by: Paul M. <prov223@tutanota.com>
2025-07-14 11:26:19 +08:00
MassimilianoBaglioni
74334f9c9a Fixed typo in documentation (#2629)
Co-authored-by: Massimiliano Baglioni <massimilianobaglioni@MacBook-Air-di-Massimiliano.local>
2025-07-11 14:45:59 +08:00
Parth
cc4beb61ba update CHANGELOG (#2670)
* update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

* Update CHANGELOG.md

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2025-07-11 11:33:11 +08:00
Dale Seo
6742e5981b fix a typo in the comment (#2668) 2025-07-10 07:14:57 +02:00
Philippe Noël
b128299976 Update ParadeDB logo (#2669) 2025-07-10 07:14:35 +02:00
PSeitz
945af922d1 clippy (#2661)
* clippy

* use readable version

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-02 11:25:03 +02:00
PSeitz-dd
295d07e55c fix union performance regression (#2663)
closes https://github.com/quickwit-oss/tantivy/issues/2656
2025-07-01 20:32:25 +02:00
PSeitz
080fa4d1f4 add docs/example and Vec<u32> values to sstable (#2660) 2025-07-01 15:40:02 +02:00
PSeitz-dd
988c2b35e7 fix import in test (#2657) 2025-06-24 12:55:34 +02:00
PSeitz
bf3cc12610 update CHANGELOG (#2621)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-06-24 11:58:44 +02:00
Stu Hood
a2400f4e73 Add string fast field support to TopDocs. (#2642)
* Add string fast field support to `TopDocs`.

* Remove unnecessary generics, and review feedback.

* Use actual/less-ambiguous cities.

* Review feedback
2025-06-20 10:27:14 +02:00
Zhang.Jinrui
436ec6caea fix typo for the comments of search_with_executor() (#2653)
Co-authored-by: Zhang Jinrui <zhangjinrui@microsoft.com>
2025-06-19 09:53:21 +02:00
PSeitz
4a6123d3ff release tantivy: bump versions (#2625)
* chore: Release

* chore: Release

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-06-10 15:34:39 +02:00
Parth
5a2fe42c24 make zstd optional in sstable (#2633)
* make zstd truly optional

* changelog notes

* make sure we write

* resolve comments

* make this a default feature

* remove changelog notes
2025-05-14 17:16:41 +02:00
PSeitz
5379c99ea2 update edition to 2024 (#2620)
* update common to edition 2024

* update bitpacker to edition 2024

* update stacker to edition 2024

* update query-grammar to edition 2024

* update sstable to edition 2024 + fmt

* fmt

* update columnar to edition 2024

* cargo fmt

* use None instead of _
2025-04-18 04:56:31 +02:00
Paul Masurel
3fa90e70e2 Merge pull request #2618 from quickwit-oss/release_tantivy
fix tantivy-query-grammar version
2025-04-09 09:54:09 +02:00
Pascal Seitz
6ab4102253 fix tantivy-query-grammar version 2025-04-09 14:35:23 +08:00
PSeitz
11c6329ca5 temp unbump version (#2501)
temp unbump to 0.22 for easier release with `cargo release`
2025-04-09 08:09:41 +02:00
PSeitz
ab8bb93928 update changelog (#2617) 2025-04-09 03:31:30 +02:00
PSeitz
2b668bd2bf readability improvement on executor (#2615) 2025-04-08 18:28:49 +02:00
Paul Masurel
97a7137ef8 Merge pull request #2606 from katlim-br/add_serde_serialize
Add serde json serialize to UserInputAst
2025-04-03 15:57:03 +02:00
Kat Lim Ruiz
ffa7cdf397 agreed with Remi, about the final json structure, having "type" tag and using "clauses" is more accurate 2025-04-03 08:35:16 -05:00
Kat Lim Ruiz
caf1275e60 Merge pull request #1 from quickwit-oss/tagged-user-input-ast
Tag UserInputAst
2025-04-03 08:30:07 -05:00
Remi Dettai
fb12b7be28 Tag UserInputAst 2025-04-03 10:07:34 +02:00
Kat Lim Ruiz
6f77083493 create more complex unit test 2025-04-02 18:06:20 -05:00
Kat Lim Ruiz
cd7745da7a set Leaf untagged, leave clause and boost the same (with own property) 2025-04-02 17:52:18 -05:00
Kat Lim Ruiz
eb8304dee9 remove untitled file 2025-04-02 08:47:58 -05:00
Kat Lim Ruiz
e5638112a9 all json should be snake_case 2025-04-02 08:45:33 -05:00