Commit Graph

96 Commits

Author SHA1 Message Date
PSeitz
4c52499622 clippy (#2549) 2024-11-29 16:08:21 +08:00
Paul Masurel
c35a782747 Updating rustc-hash and clippy fixes (#2532)
* Updating rustc-hash and clippy fixes

* fix terms_aggregation_min_doc_count_special_case

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-11-01 13:46:26 +08:00
PSeitz
21d057059e clippy (#2527)
* clippy

* clippy

* clippy

* clippy

* convert allow to expect and remove unused

* cargo fmt

* cleanup

* export sample

* clippy
2024-10-22 09:26:54 +08:00
Bruce Mitchener
c17e513377 Reduce typo count. (#2510) 2024-10-10 09:55:37 +08:00
Tri
8bd6eb06e6 feat: make SegmentMeta.with_max_doc public (#2499)
* chore: add container

* feat: make max doc editable externally

* chore: expose another method

* chore: remove comments

* remove unused devcontainer

* chore: manually match nightly format

* chore: change weird formating

* revert format change

* fix: format with nightly
2024-09-23 12:39:36 +08:00
trinity-1686a
85395d942a fix clippy lints from 1.80-1.81 (#2488)
* fix some clippy lints

* fix clippy::doc_lazy_continuation

* fix some lints for 1.82
2024-09-05 14:33:05 +02:00
PSeitz
0d4e319965 add Key::I64 and Key::U64 variants in aggregation (#2468)
* add Key::I64 and Key::U64 variants in aggregation

Currently all `Key` numerical values are returned as f64. This causes problems in some
cases with the precision and the way f64 is serialized.

This PR adds `Key::I64` and `Key::U64` variants and uses them in the term
aggregation.

* add clarification comment
2024-07-31 20:29:32 +08:00
PSeitz
232f37126e fix coverage (#2448) 2024-07-05 12:04:18 +08:00
Paul Masurel
0f4c2e27cf Fixes bug that causes out-of-order sstable key. (#2445)
The previous way to address the problem was to replace \u{0000}
with 0 in different places.

This logic had several flaws:
Done on the serializer side (like it was for the columnar), there was
a collision problem.

If a document in the segment contained a json field with a \0 and
antoher doc contained the same json field but `0` then we were sending
the same field path twice to the serializer.

Another option would have been to normalizes all values on the writer
side.

This PR simplifies the logic and simply ignore json path containing a
\0, both in the columnar and the inverted index.

Closes #2442
2024-07-01 15:40:07 +08:00
PSeitz
59084143ef use optional index in multivalued index (#2439)
* use optional index in multivalued index

For mostly empty multivalued indices there was a large overhead during
creation when iterating all docids. This is alleviated by placing an
optional index in the multivalued index to mark documents that have values.

There's some performance overhead when accessing values in a multivalued
index. The accessing cost is now optional index + multivalue index. The
sparse codec performs relatively bad with the binary_search when accessing
data. This is reflected in the benchmarks below.

This changes the format of columnar to v2, but code is added to handle the v1
formats.

```
     Running benches/bench_access.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_access-ea323c028db88db4)
multi sparse 1/13
access_values_for_doc        Avg: 42.8946ms (+241.80%)    Median: 42.8869ms (+244.10%)    [42.7484ms .. 43.1074ms]
access_first_vals            Avg: 42.8022ms (+421.93%)    Median: 42.7553ms (+439.84%)    [42.6794ms .. 43.7404ms]
multi 2x
access_values_for_doc        Avg: 31.1244ms (+24.17%)    Median: 30.8339ms (+23.46%)    [30.7192ms .. 33.6059ms]
access_first_vals            Avg: 24.3070ms (+70.92%)    Median: 24.0966ms (+70.18%)    [23.9328ms .. 26.4851ms]
sparse 1/13
access_values_for_doc        Avg: 42.2490ms (+0.61%)    Median: 42.2346ms (+2.28%)    [41.8988ms .. 43.7821ms]
access_first_vals            Avg: 43.6272ms (+0.23%)    Median: 43.6197ms (+1.78%)    [43.4920ms .. 43.9009ms]
dense 1/12
access_values_for_doc        Avg: 8.6184ms (+23.18%)    Median: 8.6126ms (+23.78%)    [8.5843ms .. 8.7527ms]
access_first_vals            Avg: 6.8112ms (+4.47%)     Median: 6.8002ms (+4.55%)     [6.7887ms .. 6.8991ms]
full
access_values_for_doc        Avg: 9.4073ms (-5.09%)    Median: 9.4023ms (-2.23%)    [9.3694ms .. 9.4568ms]
access_first_vals            Avg: 4.9531ms (+6.24%)    Median: 4.9502ms (+7.85%)    [4.9423ms .. 4.9718ms]
```

```
     Running benches/bench_merge.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_merge-475697dfceb3639f)
merge_multi 2x_and_multi 2x                          Avg: 20.2280ms (+34.33%)    Median: 20.1829ms (+35.33%)    [19.9933ms .. 20.8806ms]
merge_multi sparse 1/13_and_multi sparse 1/13        Avg: 0.8961ms (-78.04%)     Median: 0.8943ms (-77.61%)     [0.8899ms .. 0.9272ms]
merge_dense 1/12_and_dense 1/12                      Avg: 0.6619ms (-1.26%)      Median: 0.6616ms (+2.20%)      [0.6473ms .. 0.6837ms]
merge_sparse 1/13_and_sparse 1/13                    Avg: 0.5508ms (-0.85%)      Median: 0.5508ms (+2.80%)      [0.5420ms .. 0.5634ms]
merge_sparse 1/13_and_dense 1/12                     Avg: 0.6046ms (-4.64%)      Median: 0.6038ms (+2.80%)      [0.5939ms .. 0.6296ms]
merge_multi sparse 1/13_and_dense 1/12               Avg: 0.9111ms (-83.48%)     Median: 0.9063ms (-83.50%)     [0.9047ms .. 0.9663ms]
merge_multi sparse 1/13_and_sparse 1/13              Avg: 0.8451ms (-89.49%)     Median: 0.8428ms (-89.43%)     [0.8411ms .. 0.8563ms]
merge_multi 2x_and_dense 1/12                        Avg: 10.6624ms (-4.82%)     Median: 10.6568ms (-4.49%)     [10.5738ms .. 10.8353ms]
merge_multi 2x_and_sparse 1/13                       Avg: 10.6336ms (-22.95%)    Median: 10.5925ms (-22.33%)    [10.5149ms .. 11.5657ms]
```

* Update columnar/src/columnar/format_version.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Update columnar/src/column_index/mod.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2024-06-19 14:54:12 +08:00
PSeitz
511b027350 update columnar bench (#2438)
* update columnar bench

* fix compile
2024-06-14 10:42:35 +08:00
PSeitz
72f61ff89c remove index sorting (#2434)
closes https://github.com/quickwit-oss/tantivy/issues/2352
2024-06-13 15:51:53 +08:00
PSeitz
a141c3ec59 add columnar format compatibiliy tests (#2433)
* add columnar format compatibiliy tests

* always try to write current format
2024-06-13 15:04:52 +08:00
PSeitz
714f363d43 add bench & test for columnar merging (#2428)
* add merge columnar proptest

* add columnar merge benchmark
2024-06-10 16:26:16 +08:00
PSeitz
99a59ad37e remove zero byte check (#2379)
remove zero byte checks in columnar. zero bytes are converted during serialization now.
unify code paths
extend test for expected column names
2024-04-26 06:03:28 +02:00
PSeitz
74940e9345 clippy (#2349)
* fix clippy

* fix clippy

* fix duplicate imports
2024-04-09 07:54:44 +02:00
PSeitz
b644d78a32 fix null byte handling in JSON paths (#2345)
* fix null byte handling in JSON paths

closes https://github.com/quickwit-oss/tantivy/issues/2193
closes https://github.com/quickwit-oss/tantivy/issues/2340

* avoid repeated term truncation

* fix test

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add comment

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2024-04-05 09:53:35 +02:00
PSeitz
7ce950f141 add method to fetch block of first vals in columnar (#2330)
* add method to fetch block of first vals in columnar

add method to fetch block of first vals in columnar (this is way faster
than single calls for full columns)
add benchmark
fix import warnings

```
test bench_get_block_first_on_full_column                  ... bench:          56 ns/iter (+/- 26)
test bench_get_block_first_on_full_column_single_calls     ... bench:         311 ns/iter (+/- 6)
test bench_get_block_first_on_multi_column                 ... bench:         378 ns/iter (+/- 15)
test bench_get_block_first_on_multi_column_single_calls    ... bench:         546 ns/iter (+/- 13)
test bench_get_block_first_on_optional_column              ... bench:         291 ns/iter (+/- 6)
test bench_get_block_first_on_optional_column_single_calls ... bench:         362 ns/iter (+/- 8)
```

* use remainder
2024-03-15 08:01:47 +01:00
PSeitz
b0e65560a1 handle ip adresses in term aggregation (#2319)
* handle ip adresses in term aggregation

Stores IpAdresses during the segment term aggregation via u64 representation
and convert to u128(IpV6Adress) via downcast when converting to intermediate results.

Enable Downcasting on `ColumnValues`
Expose u64 variant for u128 encoded data via `open_u64_lenient` method.
Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming
from downcast trait.

* rename method
2024-03-14 09:41:18 +01:00
PSeitz
ec37295b2f add fast path for full columns in fetch_block (#2328)
Spotted in `range_date_histogram` query in quickwit benchmark:
5% of time copying docs around, which is not needed in the full index case

remove Column to ColumnIndex deref
2024-03-14 04:07:11 +01:00
Adam Reichold
72002e8a89 Make test builds Clippy clean. (#2277) 2024-01-31 02:47:06 +01:00
Paul Masurel
014328e378 Fix bug that can cause get_docids_for_value_range to panic. (#2295)
* Fix bug that can cause `get_docids_for_value_range` to panic.

When `selected_docid_range.end == num_rows`, we would get a panic
as we try to access a non-existing blockmeta.

This PR accepts calls to rank with any value.
For any value above num_rows we simply return non_null_rows.

Fixes #2293

* add tests, merge variables

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-01-09 14:52:20 +01:00
trinity-1686a
9ebc5ed053 use fst for sstable index (#2268)
* read path for new fst based index

* implement BlockAddrStoreWriter

* extract slop/derivation computation

* use better linear approximator and allow negative correction to approximator

* document format and reorder some fields

* optimize single block sstable size

* plug backward compat
2023-12-04 15:13:15 +01:00
PSeitz
1a9fc10be9 add fields_metadata to SegmentReader, add columnar docs (#2222)
* add fields_metadata to SegmentReader, add columnar docs

* use schema to resolve field, add test

* normalize paths

* merge for FieldsMetadata, add fields_metadata on Index

* Update src/core/segment_reader.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* merge code paths

* add Hash

* move function oustide

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-11-22 12:29:53 +01:00
PSeitz
927b4432c9 Perf: use term hashmap in fastfield (#2243)
* add shared arena hashmap

* bench fastfield indexing

* use shared arena hashmap in columnar

lower minimum resize in hashtable

* clippy

* add comments
2023-11-09 13:44:02 +01:00
PSeitz
19a859d6fd term hashmap remove copy in is_empty, unused unordered_id (#2229) 2023-10-27 05:01:32 +02:00
PSeitz
4feeb2323d fix clippy (#2223) 2023-10-24 10:05:22 +02:00
PSeitz
38db53c465 make column_index pub (#2181) 2023-09-22 08:06:45 +02:00
PSeitz
b1d8b072db add missing aggregation part 2 (#2149)
* add missing aggregation part 2

Add missing support for:
- Mixed types columns
- Key of type string on numerical fields

The special aggregation is slower than the integrated one in TermsAggregation and therefore not
chosen by default, although it can cover all use cases.

* simplify, add num_docs to empty
2023-08-31 07:55:33 +02:00
PSeitz
59460c767f delayed column opening during merge (#2132)
* lazy columnar merge

This is the first part of addressing #3633
Instead of loading all Column into memory for the merge, only the current column_name
group is loaded. This can be done since the sstable streams the columns lexicographically.

* refactor

* add rustdoc

* replace iterator with BTreeMap
2023-08-21 08:55:35 +02:00
PSeitz
62ece86f24 track ff dictionary indexing memory consumption (#2147) 2023-08-16 14:00:08 +02:00
PSeitz
ed1deee902 fix sort index by date (#2124)
closes #2112
2023-08-14 17:36:52 +02:00
PSeitz
2e109018b7 add missing parameter to term agg (#2103)
* add missing parameter to term agg

* move missing handling to block accessor

* add multivalue test, fix multivalue case, add comments

* add documentation, deactivate special case

* cargo fmt

* resolve merge conflict
2023-08-14 14:22:18 +02:00
Adam Reichold
42acd334f4 Fixes the new deny-by-default incorrect_partial_ord_impl_on_ord_type Clippy lint (#2131) 2023-07-21 11:36:17 +09:00
Paul Masurel
910b0b0c61 Cargo fmt 2023-07-03 22:03:31 +09:00
PSeitz
17186ca9c9 improve docs (#2105) 2023-06-27 13:37:14 +08:00
PSeitz
6239697a02 switch to ms in histogram for date type (#2045)
* switch to ms in histogram for date type

switch to ms in histogram, by adding a normalization step that converts
to nanoseconds precision when creating the collector.

closes #2028
related to #2026

* add missing unit long variants

* use single thread to avoid handling test case

* fix docs

* revert CI

* cleanup

* improve docs

* Update src/aggregation/bucket/histogram/histogram.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-05-19 08:15:44 +02:00
Yuri Astrakhan
74275b76a6 Inline format arguments where makes sense (#2038)
Applied this command to the code, making it a bit shorter and slightly
more readable.

```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
2023-05-10 18:03:59 +09:00
PSeitz
ba309e18a1 switch to nanosecond precision (#2016) 2023-05-01 03:32:20 +02:00
trinity-1686a
780e26331d sstable compression (#1946)
* compress sstable with zstd

* add some details to sstable readme

* compress only block which benefit from it

* multiple changes to sstable

make compression optional
use OwnedBytes instead of impl Read in sstable, required for next point
use zstd bulk api, which is much faster on small records

* cleanup and use bulk api for compression

* use dedicated byte for compression

* switch block len and compression flag

* change default zstd level in sstable
2023-04-14 16:25:50 +02:00
trinity-1686a
205e8a0a92 encode dictionary type in fst footer (#1968)
* encode additional footer for dictionary kind in fst
2023-04-12 09:43:01 +02:00
Paul Masurel
5eb12173d6 Proptest merge columnar (#1976)
* Added proptest on columnar merge with a shuffle

Made column serialization more explicit.
Bugfix when a bytes column is missing, and with a shuffle.
Improved the cardinality detection logic / column detection.

* Code review

* CR comments

* Following CR
2023-04-04 11:28:42 +09:00
PSeitz
571735c5f7 Fix index sort by on optional/multicolumn (#1972)
Fix index sort by on optional/multicolumn
add optional columns to proptest
extend proptests for sort
add columnar sort tests
2023-03-31 04:24:11 +02:00
Paul Masurel
694a056255 Faster range (#1954)
* Faster range queries

This PR does several changes
- ip compact space now uses u32
- the bitunpacker now gets a get_batch function
- we push down range filtering, removing GCD / shift in the bitpacking
  codec.
- we rely on AVX2 routine to do the filtering.

* Apply suggestions from code review

* Apply suggestions from code review

* CR comments
2023-03-27 14:56:32 +09:00
Paul Masurel
2955e34452 Added proptests for building/merging columnar. (#1963) 2023-03-27 14:56:02 +09:00
Paul Masurel
821208480b Adding Debug/Display impl. Refining the ColumnIndex::get_cardinality 2023-03-26 14:40:37 +09:00
Paul Masurel
a2e3c2ed5b Renaming Column::idx -> Column::index (#1961)
There was some variable name ghosting happening.
2023-03-26 13:58:50 +09:00
PSeitz
835f228bfa fix cardinality when merging empty columns (#1960)
fixes #1958
2023-03-25 15:58:15 +09:00
Paul Masurel
2b6a4da640 Exposing empty column builder. (#1959) 2023-03-24 16:34:41 +09:00
PSeitz
da2804644f fetch blocks of vals in aggregation for all cardinality (#1950)
* fetch blocks of vals in aggregation for all cardinality

* move caching in common accessor
2023-03-23 08:41:11 +01:00