Paul Masurel
13d74c3c20
Update binggan requirement from 0.16.0 to 0.16.1 ( #2899 )
2026-04-20 11:59:47 +02:00
dependabot[bot]
058afff8b7
Update binggan requirement from 0.15.3 to 0.16.0
...
Updates the requirements on [binggan](https://github.com/pseitz/binggan ) to permit the latest version.
- [Changelog](https://github.com/PSeitz/binggan/blob/main/CHANGELOG.md )
- [Commits](https://github.com/pseitz/binggan/commits )
---
updated-dependencies:
- dependency-name: binggan
dependency-version: 0.16.0
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
2026-04-15 08:58:03 +02:00
Paul Masurel
04beab3b29
Performance improvement for nested cardinality aggregation
...
When a string cardinality aggregation is nested it end up being applied to different buckets.
Dictionary encoding relies on a different dictionaries for each segment.
As a result, during segment collection, we only collect term ordinals in a HashSet, and decode them in the
term dictionary at the end of collection.
Before this PR, this decoding phase was done once for each bucket, causing the same work to be done over and over. This PR introduce a coupon cache. The HLL sketch relies on a hash of the string values.
We populate the cache before bucket collection, and get our values from it.
This PR also rename "caching" "buffering" in aggregation (it was never caching), and does several cleanups.
2026-04-10 14:51:00 +02:00
Pascal Seitz
5c344db1bf
chore: Release
2026-03-31 17:15:34 +08:00
dependabot[bot]
3abc137bfe
Update binggan requirement from 0.14.2 to 0.15.3 ( #2870 )
...
Updates the requirements on [binggan](https://github.com/pseitz/binggan ) to permit the latest version.
- [Commits](https://github.com/pseitz/binggan/commits )
---
updated-dependencies:
- dependency-name: binggan
dependency-version: 0.15.3
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 07:59:02 +08:00
nuri
3859cc8699
fix: deduplicate doc counts in term aggregation for multi-valued fields ( #2854 )
...
* fix: deduplicate doc counts in term aggregation for multi-valued fields
Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.
Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.
Fixes #2721
* refactor: only deduplicate for multivalue cardinality
Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().
* fix: handle non-consecutive duplicate values in dedup
Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.
Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.
* perf: skip dedup when block has no multivalue entries
Add early return when no consecutive doc_ids are equal, avoiding
unnecessary sort and dedup passes. Remove the 2-element swap
optimization as it is not needed by the dedup algorithm.
---------
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local >
2026-03-24 02:02:30 +01:00
Paul Masurel
545169c0d8
Composite agg merge ( #2856 )
...
Add composite aggregation
Co-authored-by: Remi Dettai <remi.dettai@sekoia.io >
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com >
2026-03-18 17:28:59 +01:00
trinity-1686a
12977bc7c4
upgrade some dependancies ( #2802 )
...
including rand, which had a few breaking changes
2026-01-14 10:19:09 +01:00
PSeitz-dd
65b5a1a306
one collector per agg request instead per bucket ( #2759 )
...
* improve bench
* add more tests for new collection type
* one collector per agg request instead per bucket
In this refactoring a collector knows in which bucket of the parent
their data is in. This allows to convert the previous approach of one
collector per bucket to one collector per request.
low card bucket optimization
* reduce dynamic dispatch, faster term agg
* use radix map, fix prepare_max_bucket
use paged term map in term agg
use special no sub agg term map impl
* specialize columntype in stats
* remove stacktrace bloat, use &mut helper
increase cache to 2048
* cleanup
remove clone
move data in term req, single doc opt for stats
* add comment
* share column block accessor
* simplify fetch block in column_block_accessor
* split subaggcache into two trait impls
* move partitions to heap
* fix name, add comment
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com >
2026-01-06 11:50:55 +01:00
ChangRui-Ryan
db2ecc6057
fix Column.first method parameter type ( #2792 )
2026-01-05 10:03:01 +01:00
ChangRui-Ryan
e0b62e00ac
optimize RangeDocSet for non-overlapping query ranges ( #2783 )
2025-12-29 16:55:28 +01:00
PSeitz-dd
c6912ce89a
Handle JSON fields and columnar in space_usage ( #2761 )
...
return field names in space_usage instead of `Field`
more detailed info for columns
2025-12-10 20:33:33 +08:00
Ang
08a92675dc
Fix typos again ( #2753 )
...
Found via `codespell -S benches,stopwords.rs -L
womens,parth,abd,childs,ond,ser,ue,mot,hel,atleast,pris,claus,allo`
2025-12-01 12:15:41 +01:00
Paul Masurel
25d44fcec8
Revert "remove unused columnar api ( #2742 )" ( #2748 )
...
* Revert "remove unused columnar api (#2742 )"
This reverts commit 8725594d47 .
* Clippy comment + removing fill_vals
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com >
2025-11-26 17:44:02 +01:00
PSeitz-dd
842fe9295f
split Term in Term and IndexingTerm ( #2744 )
...
* split Term in Term and IndexingTerm
* add append_json_path to JsonTermSerializer
2025-11-26 16:48:59 +01:00
PSeitz-dd
8725594d47
remove unused columnar api ( #2742 )
2025-11-21 18:07:25 +01:00
PSeitz
85010b589a
clippy ( #2700 )
...
* clippy
* clippy
* clippy
* clippy + fmt
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com >
2025-09-19 18:04:25 +02:00
PSeitz-dd
2340dca628
fix compiler warnings ( #2699 )
...
* fix compiler warnings
* fix import
2025-09-19 15:55:04 +02:00
PSeitz-dd
203751f2fe
Optimize ExistsQuery for a high number of dynamic columns ( #2694 )
...
* Optimize ExistsQuery for a high number of dynamic columns
The previous algorithm checked _each_ doc in _each_ column for
existence. This causes huge cost on JSON fields with e.g. 100k columns.
Compute a bitset instead if we have more than one column.
add `iter_docs` to the multivalued_index
* add benchmark
subfields=1
exists_json_union Memory: 89.3 KB (+2.01%) Avg: 0.4865ms (-26.03%) Median: 0.4865ms (-26.03%) [0.4865ms .. 0.4865ms]
subfields=2
exists_json_union Memory: 68.1 KB Avg: 1.7048ms (-0.46%) Median: 1.7048ms (-0.46%) [1.7048ms .. 1.7048ms]
subfields=3
exists_json_union Memory: 61.8 KB Avg: 2.0742ms (-2.22%) Median: 2.0742ms (-2.22%) [2.0742ms .. 2.0742ms]
subfields=4
exists_json_union Memory: 119.8 KB (+103.44%) Avg: 3.9500ms (+42.62%) Median: 3.9500ms (+42.62%) [3.9500ms .. 3.9500ms]
subfields=5
exists_json_union Memory: 120.4 KB (+107.65%) Avg: 3.9610ms (+20.65%) Median: 3.9610ms (+20.65%) [3.9610ms .. 3.9610ms]
subfields=6
exists_json_union Memory: 120.6 KB (+107.49%) Avg: 3.8903ms (+3.11%) Median: 3.8903ms (+3.11%) [3.8903ms .. 3.8903ms]
subfields=7
exists_json_union Memory: 120.9 KB (+106.93%) Avg: 3.6220ms (-16.22%) Median: 3.6220ms (-16.22%) [3.6220ms .. 3.6220ms]
subfields=8
exists_json_union Memory: 121.3 KB (+106.23%) Avg: 4.0981ms (-15.97%) Median: 4.0981ms (-15.97%) [4.0981ms .. 4.0981ms]
subfields=16
exists_json_union Memory: 123.1 KB (+103.09%) Avg: 4.3483ms (-92.26%) Median: 4.3483ms (-92.26%) [4.3483ms .. 4.3483ms]
subfields=256
exists_json_union Memory: 204.6 KB (+19.85%) Avg: 3.8874ms (-99.01%) Median: 3.8874ms (-99.01%) [3.8874ms .. 3.8874ms]
subfields=4096
exists_json_union Memory: 2.0 MB Avg: 3.5571ms (-99.90%) Median: 3.5571ms (-99.90%) [3.5571ms .. 3.5571ms]
subfields=65536
exists_json_union Memory: 28.3 MB Avg: 14.4417ms (-99.97%) Median: 14.4417ms (-99.97%) [14.4417ms .. 14.4417ms]
subfields=262144
exists_json_union Memory: 113.3 MB Avg: 66.2860ms (-99.95%) Median: 66.2860ms (-99.95%) [66.2860ms .. 66.2860ms]
* rename methods
2025-09-16 18:21:03 +02:00
Paul Masurel
5d6c8de23e
Align search float search logic to the columnar coercion rules
...
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.
That way a document with the float 1.0 will be searchable when the user
searches for 1.
Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
2025-09-09 19:28:17 +02:00
PSeitz
33794a114c
chore: Release ( #2686 )
...
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com >
2025-08-20 18:29:37 +08:00
PSeitz-dd
021ff2ad63
move bench to binggan ( #2684 )
2025-08-14 17:02:44 +08:00
MassimilianoBaglioni
74334f9c9a
Fixed typo in documentation ( #2629 )
...
Co-authored-by: Massimiliano Baglioni <massimilianobaglioni@MacBook-Air-di-Massimiliano.local >
2025-07-11 14:45:59 +08:00
PSeitz-dd
988c2b35e7
fix import in test ( #2657 )
2025-06-24 12:55:34 +02:00
PSeitz
4a6123d3ff
release tantivy: bump versions ( #2625 )
...
* chore: Release
* chore: Release
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com >
2025-06-10 15:34:39 +02:00
Parth
5a2fe42c24
make zstd optional in sstable ( #2633 )
...
* make zstd truly optional
* changelog notes
* make sure we write
* resolve comments
* make this a default feature
* remove changelog notes
2025-05-14 17:16:41 +02:00
PSeitz
5379c99ea2
update edition to 2024 ( #2620 )
...
* update common to edition 2024
* update bitpacker to edition 2024
* update stacker to edition 2024
* update query-grammar to edition 2024
* update sstable to edition 2024 + fmt
* fmt
* update columnar to edition 2024
* cargo fmt
* use None instead of _
2025-04-18 04:56:31 +02:00
Remi Dettai
06d2dcf469
Further fix type inference tests
2025-04-01 09:52:22 +02:00
PSeitz
d5d2d41264
merge column: small refactors ( #2579 )
...
* merge column: small refactors
* make ord dependency more explicit
* add columnar merge crashtest proptest
* fix naming
2025-03-07 18:52:34 +08:00
Paul Masurel
519e5d2ed1
clippy warnings
2025-03-05 11:15:06 +01:00
Paul Masurel
58c0739953
Merge pull request #2581 from quickwit-oss/merge_dict_column_repro
...
use usize in bitpacker
2025-02-21 10:53:07 +09:00
Pascal Seitz
e7daf69de9
use usize in bitpacker
...
use usize in bitpacker to enable larger columns in the columnar store
Godbolt comparison with u32 vs u64 for get access: https://godbolt.org/z/cjf7nenYP
Add a mini-tool to inspect columnar files created by tantivy. (very basic functionality which can be extended later)
2025-02-20 15:39:10 +01:00
dependabot[bot]
4aa8cd2470
Update downcast-rs requirement from 1.2.1 to 2.0.1 ( #2566 )
...
Updates the requirements on [downcast-rs](https://github.com/marcianx/downcast-rs ) to permit the latest version.
- [Changelog](https://github.com/marcianx/downcast-rs/blob/master/CHANGELOG.md )
- [Commits](https://github.com/marcianx/downcast-rs/compare/v1.2.1...v2.0.1 )
---
updated-dependencies:
- dependency-name: downcast-rs
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-22 10:32:24 +01:00
dependabot[bot]
43c89b4360
Update itertools requirement from 0.13.0 to 0.14.0 ( #2563 )
...
Updates the requirements on [itertools](https://github.com/rust-itertools/itertools ) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md )
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.14.0 )
---
updated-dependencies:
- dependency-name: itertools
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-08 17:11:46 +01:00
Remi Dettai
71cf19870b
Exist queries match subpath fields ( #2558 )
...
* Exist queries match subpath fields
* Make subpath check optional
* Add async subpath listing
2025-01-06 10:17:39 +01:00
PSeitz
4c52499622
clippy ( #2549 )
2024-11-29 16:08:21 +08:00
Paul Masurel
c35a782747
Updating rustc-hash and clippy fixes ( #2532 )
...
* Updating rustc-hash and clippy fixes
* fix terms_aggregation_min_doc_count_special_case
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com >
2024-11-01 13:46:26 +08:00
dependabot[bot]
c66af2c0a9
Update binggan requirement from 0.12.0 to 0.14.0 ( #2530 )
...
* Update binggan requirement from 0.12.0 to 0.14.0
---
updated-dependencies:
- dependency-name: binggan
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
* fix build
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com >
2024-10-24 09:41:35 +08:00
PSeitz
21d057059e
clippy ( #2527 )
...
* clippy
* clippy
* clippy
* clippy
* convert allow to expect and remove unused
* cargo fmt
* cleanup
* export sample
* clippy
2024-10-22 09:26:54 +08:00
dependabot[bot]
99be20cedd
Update binggan requirement from 0.10.0 to 0.12.0 ( #2519 )
...
* Update binggan requirement from 0.10.0 to 0.12.0
---
updated-dependencies:
- dependency-name: binggan
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
* fix build
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com >
2024-10-16 11:36:04 +08:00
Bruce Mitchener
c17e513377
Reduce typo count. ( #2510 )
2024-10-10 09:55:37 +08:00
Tri
8bd6eb06e6
feat: make SegmentMeta.with_max_doc public ( #2499 )
...
* chore: add container
* feat: make max doc editable externally
* chore: expose another method
* chore: remove comments
* remove unused devcontainer
* chore: manually match nightly format
* chore: change weird formating
* revert format change
* fix: format with nightly
2024-09-23 12:39:36 +08:00
dependabot[bot]
56fc56c5b9
Update binggan requirement from 0.8.0 to 0.10.0 ( #2493 )
...
* Update binggan requirement from 0.8.0 to 0.10.0
---
updated-dependencies:
- dependency-name: binggan
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
* update PR
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com >
2024-09-10 14:26:06 +08:00
trinity-1686a
85395d942a
fix clippy lints from 1.80-1.81 ( #2488 )
...
* fix some clippy lints
* fix clippy::doc_lazy_continuation
* fix some lints for 1.82
2024-09-05 14:33:05 +02:00
PSeitz
0d4e319965
add Key::I64 and Key::U64 variants in aggregation ( #2468 )
...
* add Key::I64 and Key::U64 variants in aggregation
Currently all `Key` numerical values are returned as f64. This causes problems in some
cases with the precision and the way f64 is serialized.
This PR adds `Key::I64` and `Key::U64` variants and uses them in the term
aggregation.
* add clarification comment
2024-07-31 20:29:32 +08:00
PSeitz
232f37126e
fix coverage ( #2448 )
2024-07-05 12:04:18 +08:00
Paul Masurel
0f4c2e27cf
Fixes bug that causes out-of-order sstable key. ( #2445 )
...
The previous way to address the problem was to replace \u{0000}
with 0 in different places.
This logic had several flaws:
Done on the serializer side (like it was for the columnar), there was
a collision problem.
If a document in the segment contained a json field with a \0 and
antoher doc contained the same json field but `0` then we were sending
the same field path twice to the serializer.
Another option would have been to normalizes all values on the writer
side.
This PR simplifies the logic and simply ignore json path containing a
\0, both in the columnar and the inverted index.
Closes #2442
2024-07-01 15:40:07 +08:00
PSeitz
59084143ef
use optional index in multivalued index ( #2439 )
...
* use optional index in multivalued index
For mostly empty multivalued indices there was a large overhead during
creation when iterating all docids. This is alleviated by placing an
optional index in the multivalued index to mark documents that have values.
There's some performance overhead when accessing values in a multivalued
index. The accessing cost is now optional index + multivalue index. The
sparse codec performs relatively bad with the binary_search when accessing
data. This is reflected in the benchmarks below.
This changes the format of columnar to v2, but code is added to handle the v1
formats.
```
Running benches/bench_access.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_access-ea323c028db88db4)
multi sparse 1/13
access_values_for_doc Avg: 42.8946ms (+241.80%) Median: 42.8869ms (+244.10%) [42.7484ms .. 43.1074ms]
access_first_vals Avg: 42.8022ms (+421.93%) Median: 42.7553ms (+439.84%) [42.6794ms .. 43.7404ms]
multi 2x
access_values_for_doc Avg: 31.1244ms (+24.17%) Median: 30.8339ms (+23.46%) [30.7192ms .. 33.6059ms]
access_first_vals Avg: 24.3070ms (+70.92%) Median: 24.0966ms (+70.18%) [23.9328ms .. 26.4851ms]
sparse 1/13
access_values_for_doc Avg: 42.2490ms (+0.61%) Median: 42.2346ms (+2.28%) [41.8988ms .. 43.7821ms]
access_first_vals Avg: 43.6272ms (+0.23%) Median: 43.6197ms (+1.78%) [43.4920ms .. 43.9009ms]
dense 1/12
access_values_for_doc Avg: 8.6184ms (+23.18%) Median: 8.6126ms (+23.78%) [8.5843ms .. 8.7527ms]
access_first_vals Avg: 6.8112ms (+4.47%) Median: 6.8002ms (+4.55%) [6.7887ms .. 6.8991ms]
full
access_values_for_doc Avg: 9.4073ms (-5.09%) Median: 9.4023ms (-2.23%) [9.3694ms .. 9.4568ms]
access_first_vals Avg: 4.9531ms (+6.24%) Median: 4.9502ms (+7.85%) [4.9423ms .. 4.9718ms]
```
```
Running benches/bench_merge.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_merge-475697dfceb3639f)
merge_multi 2x_and_multi 2x Avg: 20.2280ms (+34.33%) Median: 20.1829ms (+35.33%) [19.9933ms .. 20.8806ms]
merge_multi sparse 1/13_and_multi sparse 1/13 Avg: 0.8961ms (-78.04%) Median: 0.8943ms (-77.61%) [0.8899ms .. 0.9272ms]
merge_dense 1/12_and_dense 1/12 Avg: 0.6619ms (-1.26%) Median: 0.6616ms (+2.20%) [0.6473ms .. 0.6837ms]
merge_sparse 1/13_and_sparse 1/13 Avg: 0.5508ms (-0.85%) Median: 0.5508ms (+2.80%) [0.5420ms .. 0.5634ms]
merge_sparse 1/13_and_dense 1/12 Avg: 0.6046ms (-4.64%) Median: 0.6038ms (+2.80%) [0.5939ms .. 0.6296ms]
merge_multi sparse 1/13_and_dense 1/12 Avg: 0.9111ms (-83.48%) Median: 0.9063ms (-83.50%) [0.9047ms .. 0.9663ms]
merge_multi sparse 1/13_and_sparse 1/13 Avg: 0.8451ms (-89.49%) Median: 0.8428ms (-89.43%) [0.8411ms .. 0.8563ms]
merge_multi 2x_and_dense 1/12 Avg: 10.6624ms (-4.82%) Median: 10.6568ms (-4.49%) [10.5738ms .. 10.8353ms]
merge_multi 2x_and_sparse 1/13 Avg: 10.6336ms (-22.95%) Median: 10.5925ms (-22.33%) [10.5149ms .. 11.5657ms]
```
* Update columnar/src/columnar/format_version.rs
Co-authored-by: Paul Masurel <paul@quickwit.io >
* Update columnar/src/column_index/mod.rs
Co-authored-by: Paul Masurel <paul@quickwit.io >
---------
Co-authored-by: Paul Masurel <paul@quickwit.io >
2024-06-19 14:54:12 +08:00
PSeitz
511b027350
update columnar bench ( #2438 )
...
* update columnar bench
* fix compile
2024-06-14 10:42:35 +08:00
PSeitz
72f61ff89c
remove index sorting ( #2434 )
...
closes https://github.com/quickwit-oss/tantivy/issues/2352
2024-06-13 15:51:53 +08:00