Files
tantivy/columnar
PSeitz 59084143ef use optional index in multivalued index (#2439)
* use optional index in multivalued index

For mostly empty multivalued indices there was a large overhead during
creation when iterating all docids. This is alleviated by placing an
optional index in the multivalued index to mark documents that have values.

There's some performance overhead when accessing values in a multivalued
index. The accessing cost is now optional index + multivalue index. The
sparse codec performs relatively bad with the binary_search when accessing
data. This is reflected in the benchmarks below.

This changes the format of columnar to v2, but code is added to handle the v1
formats.

```
     Running benches/bench_access.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_access-ea323c028db88db4)
multi sparse 1/13
access_values_for_doc        Avg: 42.8946ms (+241.80%)    Median: 42.8869ms (+244.10%)    [42.7484ms .. 43.1074ms]
access_first_vals            Avg: 42.8022ms (+421.93%)    Median: 42.7553ms (+439.84%)    [42.6794ms .. 43.7404ms]
multi 2x
access_values_for_doc        Avg: 31.1244ms (+24.17%)    Median: 30.8339ms (+23.46%)    [30.7192ms .. 33.6059ms]
access_first_vals            Avg: 24.3070ms (+70.92%)    Median: 24.0966ms (+70.18%)    [23.9328ms .. 26.4851ms]
sparse 1/13
access_values_for_doc        Avg: 42.2490ms (+0.61%)    Median: 42.2346ms (+2.28%)    [41.8988ms .. 43.7821ms]
access_first_vals            Avg: 43.6272ms (+0.23%)    Median: 43.6197ms (+1.78%)    [43.4920ms .. 43.9009ms]
dense 1/12
access_values_for_doc        Avg: 8.6184ms (+23.18%)    Median: 8.6126ms (+23.78%)    [8.5843ms .. 8.7527ms]
access_first_vals            Avg: 6.8112ms (+4.47%)     Median: 6.8002ms (+4.55%)     [6.7887ms .. 6.8991ms]
full
access_values_for_doc        Avg: 9.4073ms (-5.09%)    Median: 9.4023ms (-2.23%)    [9.3694ms .. 9.4568ms]
access_first_vals            Avg: 4.9531ms (+6.24%)    Median: 4.9502ms (+7.85%)    [4.9423ms .. 4.9718ms]
```

```
     Running benches/bench_merge.rs (/home/pascal/Development/tantivy/optional_multivalues/target/release/deps/bench_merge-475697dfceb3639f)
merge_multi 2x_and_multi 2x                          Avg: 20.2280ms (+34.33%)    Median: 20.1829ms (+35.33%)    [19.9933ms .. 20.8806ms]
merge_multi sparse 1/13_and_multi sparse 1/13        Avg: 0.8961ms (-78.04%)     Median: 0.8943ms (-77.61%)     [0.8899ms .. 0.9272ms]
merge_dense 1/12_and_dense 1/12                      Avg: 0.6619ms (-1.26%)      Median: 0.6616ms (+2.20%)      [0.6473ms .. 0.6837ms]
merge_sparse 1/13_and_sparse 1/13                    Avg: 0.5508ms (-0.85%)      Median: 0.5508ms (+2.80%)      [0.5420ms .. 0.5634ms]
merge_sparse 1/13_and_dense 1/12                     Avg: 0.6046ms (-4.64%)      Median: 0.6038ms (+2.80%)      [0.5939ms .. 0.6296ms]
merge_multi sparse 1/13_and_dense 1/12               Avg: 0.9111ms (-83.48%)     Median: 0.9063ms (-83.50%)     [0.9047ms .. 0.9663ms]
merge_multi sparse 1/13_and_sparse 1/13              Avg: 0.8451ms (-89.49%)     Median: 0.8428ms (-89.43%)     [0.8411ms .. 0.8563ms]
merge_multi 2x_and_dense 1/12                        Avg: 10.6624ms (-4.82%)     Median: 10.6568ms (-4.49%)     [10.5738ms .. 10.8353ms]
merge_multi 2x_and_sparse 1/13                       Avg: 10.6336ms (-22.95%)    Median: 10.5925ms (-22.33%)    [10.5149ms .. 11.5657ms]
```

* Update columnar/src/columnar/format_version.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Update columnar/src/column_index/mod.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2024-06-19 14:54:12 +08:00
..
2023-11-20 02:59:59 +01:00

Columnar format

This crate describes columnar format used in tantivy.

Goals

This format is special in the following way.

  • it needs to be compact
  • accessing a specific column does not require to load the entire columnar. It can be done in 2 to 3 random access.
  • columns of several types can be associated with the same column name.
  • it needs to support columns with different types (str, u64, i64, f64) and different cardinality (required, optional, multivalued).
  • columns, once loaded, offer cheap random access.
  • it is designed to allow range queries.

Coercion rules

Users can create a columnar by inserting rows to a ColumnarWriter, and serializing it into a Write object. Nothing prevents a user from recording values with different type to the same column_name.

In that case, tantivy-columnar's behavior is as follows:

  • JsonValues are grouped into 3 types (String, Number, bool). Values that corresponds to different groups are mapped to different columns. For instance, String values are treated independently from Number or boolean values. tantivy-columnar will simply emit several columns associated to a given column_name.
  • Only one column for a given json value type is emitted. If number values with different number types are recorded (e.g. u64, i64, f64), tantivy-columnar will pick the first type that can represents the set of appended value, with the following prioriy order (i64, u64, f64). i64 is picked over u64 as it is likely to yield less change of types. Most use cases strictly requiring u64 show the restriction on 50% of the values (e.g. a 64-bit hash). On the other hand, a lot of use cases can show rare negative value.

Columnar format

This columnar format may have more than one column (with different types) associated to the same column_name (see Coercion rules above). The (column_name, columne_type) couple however uniquely identifies a column. That couple is serialized as a column column_key. The format of that key is: [column_name][ZERO_BYTE][column_type_header: u8]

COLUMNAR:=
    [COLUMNAR_DATA]
    [COLUMNAR_KEY_TO_DATA_INDEX]
    [COLUMNAR_FOOTER];


# Columns are sorted by their column key.
COLUMNAR_DATA:=
    [COLUMN_DATA]+;

COLUMNAR_FOOTER := [RANGE_SSTABLE_BYTES_LEN: 8 bytes little endian]

The columnar file starts by the actual column data, concatenated one after the other, sorted by column key.

A sstable associates `(column name, column_cardinality, column_type) to range of bytes.

Column name may not contain the zero byte \0.

Listing all columns associated to column_name can therefore be done by listing all keys prefixed by [column_name][ZERO_BYTE]

The associated range of bytes refer to a range of bytes

This crate exposes a columnar format for tantivy. This format is described in README.md

The crate introduces the following concepts.

Columnar is an equivalent of a dataframe. It maps column_key to Column.

A Column<T> asssociates a RowId (u32) to any number of values.

This is made possible by wrapping a ColumnIndex and a ColumnValue object. The ColumnValue<T> represents a mapping that associates each RowId to exactly one single value.

The ColumnIndex then maps each RowId to a set of RowId in the ColumnValue.

For optimization, and compression purposes, the ColumnIndex has three possible representation, each for different cardinalities.

  • Full

All RowId have exactly one value. The ColumnIndex is the trivial mapping.

  • Optional

All RowIds can have at most one value. The ColumnIndex is the trivial mapping ColumnRowId -> Option<ColumnValueRowId>.

  • Multivalued

All RowIds can have any number of values. The column index is mapping values to a range.

All these objects are implemented an unit tested independently in their own module:

  • columnar
  • column_index
  • column_values
  • column