Commit Graph

30 Commits

Author SHA1 Message Date
Paul Masurel
25d44fcec8 Revert "remove unused columnar api (#2742)" (#2748)
* Revert "remove unused columnar api (#2742)"

This reverts commit 8725594d47.

* Clippy comment + removing fill_vals

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-11-26 17:44:02 +01:00
PSeitz-dd
8725594d47 remove unused columnar api (#2742) 2025-11-21 18:07:25 +01:00
MassimilianoBaglioni
74334f9c9a Fixed typo in documentation (#2629)
Co-authored-by: Massimiliano Baglioni <massimilianobaglioni@MacBook-Air-di-Massimiliano.local>
2025-07-11 14:45:59 +08:00
PSeitz
5379c99ea2 update edition to 2024 (#2620)
* update common to edition 2024

* update bitpacker to edition 2024

* update stacker to edition 2024

* update query-grammar to edition 2024

* update sstable to edition 2024 + fmt

* fmt

* update columnar to edition 2024

* cargo fmt

* use None instead of _
2025-04-18 04:56:31 +02:00
PSeitz
59084143ef use optional index in multivalued index (#2439)
* use optional index in multivalued index

For mostly empty multivalued indices there was a large overhead during
creation when iterating all docids. This is alleviated by placing an
optional index in the multivalued index to mark documents that have values.

There's some performance overhead when accessing values in a multivalued
index. The accessing cost is now optional index + multivalue index. The
sparse codec performs relatively bad with the binary_search when accessing
data. This is reflected in the benchmarks below.

This changes the format of columnar to v2, but code is added to handle the v1
formats.

```
     Running benches/bench_access.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_access-ea323c028db88db4)
multi sparse 1/13
access_values_for_doc        Avg: 42.8946ms (+241.80%)    Median: 42.8869ms (+244.10%)    [42.7484ms .. 43.1074ms]
access_first_vals            Avg: 42.8022ms (+421.93%)    Median: 42.7553ms (+439.84%)    [42.6794ms .. 43.7404ms]
multi 2x
access_values_for_doc        Avg: 31.1244ms (+24.17%)    Median: 30.8339ms (+23.46%)    [30.7192ms .. 33.6059ms]
access_first_vals            Avg: 24.3070ms (+70.92%)    Median: 24.0966ms (+70.18%)    [23.9328ms .. 26.4851ms]
sparse 1/13
access_values_for_doc        Avg: 42.2490ms (+0.61%)    Median: 42.2346ms (+2.28%)    [41.8988ms .. 43.7821ms]
access_first_vals            Avg: 43.6272ms (+0.23%)    Median: 43.6197ms (+1.78%)    [43.4920ms .. 43.9009ms]
dense 1/12
access_values_for_doc        Avg: 8.6184ms (+23.18%)    Median: 8.6126ms (+23.78%)    [8.5843ms .. 8.7527ms]
access_first_vals            Avg: 6.8112ms (+4.47%)     Median: 6.8002ms (+4.55%)     [6.7887ms .. 6.8991ms]
full
access_values_for_doc        Avg: 9.4073ms (-5.09%)    Median: 9.4023ms (-2.23%)    [9.3694ms .. 9.4568ms]
access_first_vals            Avg: 4.9531ms (+6.24%)    Median: 4.9502ms (+7.85%)    [4.9423ms .. 4.9718ms]
```

```
     Running benches/bench_merge.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_merge-475697dfceb3639f)
merge_multi 2x_and_multi 2x                          Avg: 20.2280ms (+34.33%)    Median: 20.1829ms (+35.33%)    [19.9933ms .. 20.8806ms]
merge_multi sparse 1/13_and_multi sparse 1/13        Avg: 0.8961ms (-78.04%)     Median: 0.8943ms (-77.61%)     [0.8899ms .. 0.9272ms]
merge_dense 1/12_and_dense 1/12                      Avg: 0.6619ms (-1.26%)      Median: 0.6616ms (+2.20%)      [0.6473ms .. 0.6837ms]
merge_sparse 1/13_and_sparse 1/13                    Avg: 0.5508ms (-0.85%)      Median: 0.5508ms (+2.80%)      [0.5420ms .. 0.5634ms]
merge_sparse 1/13_and_dense 1/12                     Avg: 0.6046ms (-4.64%)      Median: 0.6038ms (+2.80%)      [0.5939ms .. 0.6296ms]
merge_multi sparse 1/13_and_dense 1/12               Avg: 0.9111ms (-83.48%)     Median: 0.9063ms (-83.50%)     [0.9047ms .. 0.9663ms]
merge_multi sparse 1/13_and_sparse 1/13              Avg: 0.8451ms (-89.49%)     Median: 0.8428ms (-89.43%)     [0.8411ms .. 0.8563ms]
merge_multi 2x_and_dense 1/12                        Avg: 10.6624ms (-4.82%)     Median: 10.6568ms (-4.49%)     [10.5738ms .. 10.8353ms]
merge_multi 2x_and_sparse 1/13                       Avg: 10.6336ms (-22.95%)    Median: 10.5925ms (-22.33%)    [10.5149ms .. 11.5657ms]
```

* Update columnar/src/columnar/format_version.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Update columnar/src/column_index/mod.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2024-06-19 14:54:12 +08:00
PSeitz
7ce950f141 add method to fetch block of first vals in columnar (#2330)
* add method to fetch block of first vals in columnar

add method to fetch block of first vals in columnar (this is way faster
than single calls for full columns)
add benchmark
fix import warnings

```
test bench_get_block_first_on_full_column                  ... bench:          56 ns/iter (+/- 26)
test bench_get_block_first_on_full_column_single_calls     ... bench:         311 ns/iter (+/- 6)
test bench_get_block_first_on_multi_column                 ... bench:         378 ns/iter (+/- 15)
test bench_get_block_first_on_multi_column_single_calls    ... bench:         546 ns/iter (+/- 13)
test bench_get_block_first_on_optional_column              ... bench:         291 ns/iter (+/- 6)
test bench_get_block_first_on_optional_column_single_calls ... bench:         362 ns/iter (+/- 8)
```

* use remainder
2024-03-15 08:01:47 +01:00
PSeitz
b0e65560a1 handle ip adresses in term aggregation (#2319)
* handle ip adresses in term aggregation

Stores IpAdresses during the segment term aggregation via u64 representation
and convert to u128(IpV6Adress) via downcast when converting to intermediate results.

Enable Downcasting on `ColumnValues`
Expose u64 variant for u128 encoded data via `open_u64_lenient` method.
Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming
from downcast trait.

* rename method
2024-03-14 09:41:18 +01:00
PSeitz
ec37295b2f add fast path for full columns in fetch_block (#2328)
Spotted in `range_date_histogram` query in quickwit benchmark:
5% of time copying docs around, which is not needed in the full index case

remove Column to ColumnIndex deref
2024-03-14 04:07:11 +01:00
PSeitz
b1d8b072db add missing aggregation part 2 (#2149)
* add missing aggregation part 2

Add missing support for:
- Mixed types columns
- Key of type string on numerical fields

The special aggregation is slower than the integrated one in TermsAggregation and therefore not
chosen by default, although it can cover all use cases.

* simplify, add num_docs to empty
2023-08-31 07:55:33 +02:00
PSeitz
2e109018b7 add missing parameter to term agg (#2103)
* add missing parameter to term agg

* move missing handling to block accessor

* add multivalue test, fix multivalue case, add comments

* add documentation, deactivate special case

* cargo fmt

* resolve merge conflict
2023-08-14 14:22:18 +02:00
Paul Masurel
821208480b Adding Debug/Display impl. Refining the ColumnIndex::get_cardinality 2023-03-26 14:40:37 +09:00
Paul Masurel
a2e3c2ed5b Renaming Column::idx -> Column::index (#1961)
There was some variable name ghosting happening.
2023-03-26 13:58:50 +09:00
Paul Masurel
2b6a4da640 Exposing empty column builder. (#1959) 2023-03-24 16:34:41 +09:00
PSeitz
da2804644f fetch blocks of vals in aggregation for all cardinality (#1950)
* fetch blocks of vals in aggregation for all cardinality

* move caching in common accessor
2023-03-23 08:41:11 +01:00
Paul Masurel
0a726a0897 Added Empty ColumnIndex (#1910) 2023-02-27 13:59:22 +09:00
Paul Masurel
06850719dc Renaming .values(DocId) to .values_for_doc(DocId) (#1906) 2023-02-27 12:15:13 +09:00
Paul Masurel
0274c982d5 Refactoring. (#1881)
`ColumnValues` wrongly located in column_values/column.rs due to
historical reason moves to column_values/mod.rs

u128 stuff gets its own directory like u64 stuff.
2023-02-17 21:57:14 +09:00
PSeitz
74bf60b4f7 implement SegmentAggregationCollector on bucket aggs (#1878) 2023-02-17 12:53:29 +01:00
PSeitz
111f25a8f7 clippy (#1879)
* fix clippy

* fix clippy

* fmt
2023-02-17 11:34:21 +01:00
Paul Masurel
097fd6138d Fix clippy comments (#1872) 2023-02-14 23:12:45 +09:00
PSeitz
1cfb9ce59a improve range query performance (#1864)
fix RowId vs DocId naming
fixes #1863
2023-02-14 13:25:39 +09:00
Paul Masurel
bd5eea9852 Integrated columnar work. 2023-02-09 13:14:31 +01:00
PSeitz
b31fd389d8 collect columns for merge (#1812)
* collect columns for merge

* return column_type from, fix visibility

* fix

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-01-20 07:58:29 +01:00
Paul Masurel
89cec79813 Make it possible to force a column type and intricate bugfix. (#1815) 2023-01-20 14:30:56 +09:00
Paul Masurel
e3d504d833 Minor code cleanup (#1810) 2023-01-19 17:47:26 +09:00
Paul Masurel
5a42c5aae9 Add support for multivalues (#1809) 2023-01-19 16:55:01 +09:00
Paul Masurel
a86b104a40 Differentiating between str and bytes, + unit test 2023-01-19 14:38:12 +09:00
PSeitz
f9abd256b7 add ip addr to columnar (#1805) 2023-01-19 05:36:06 +01:00
Paul Masurel
9f42b6440a Completed unit test for dictionary encoded column 2023-01-19 12:15:27 +09:00
Paul Masurel
25bad784ad Integrated fastfield codecs into columnar. (#1782)
Introduced asymetric OptionalCodec / SerializableOptionalCodec
Removed cardinality from the columnar sstable.
Added DynamicColumn
Reorganized all files
Change DenseCodec serialization logic.
Renamed methods to rank/select
Moved versioning footer to the columnar level
2023-01-16 17:24:49 +09:00