mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-05-14 15:20:43 +00:00
* read path for new fst based index * implement BlockAddrStoreWriter * extract slop/derivation computation * use better linear approximator and allow negative correction to approximator * document format and reorder some fields * optimize single block sstable size * plug backward compat
160 lines
6.1 KiB
Markdown
160 lines
6.1 KiB
Markdown
# SSTable
|
|
|
|
The `tantivy-sstable` crate is yet another sstable crate.
|
|
|
|
It has been designed to be used in `quickwit`:
|
|
- as an alternative to the default tantivy fst dictionary.
|
|
- as a way to store the column index for dynamic fast fields.
|
|
|
|
The benefit compared to the fst crate is locality.
|
|
Searching a key in the fst crate requires downloading the entire dictionary.
|
|
|
|
Once the sstable index is downloaded, running a `get` in the sstable
|
|
crate only requires a single fetch.
|
|
|
|
Right now, the block index and the default block size have been thought
|
|
for quickwit, and the performance of a get is very bad.
|
|
|
|
# Sorted strings?
|
|
|
|
SSTable stands for Sorted String Table.
|
|
Strings have to be insert in sorted order.
|
|
|
|
That sorted order is used in different ways:
|
|
- it makes gets and streaming ranges of keys
|
|
possible.
|
|
- it allows incremental encoding of the keys
|
|
- the front compression is leveraged to optimize
|
|
the intersection with an automaton
|
|
|
|
# On disk format
|
|
|
|
Overview of the SSTable format. Unless noted otherwise, numbers are little-endian.
|
|
|
|
### SSTable
|
|
```
|
|
+-------+-------+-----+--------+
|
|
| Block | Block | ... | Footer |
|
|
+-------+-------+-----+--------+
|
|
|----( # of blocks)---|
|
|
```
|
|
- Block(`SSTBlock`): list of independent block, terminated by a single empty block.
|
|
- Footer(`SSTFooter`)
|
|
|
|
### SSTBlock
|
|
```
|
|
+----------+----------+--------+-------+-------+-----+
|
|
| BlockLen | Compress | Values | Delta | Delta | ... |
|
|
+----------+----------+--------+-------+-------+-----+
|
|
| |----( # of deltas)---|
|
|
|------(maybe compressed)------|
|
|
```
|
|
- BlockLen(u32): length of the block, including the compress byte.
|
|
- Compress(u8): indicate whether block is compressed. 0 if not compressed, 1 if compressed.
|
|
- Values: an application defined format storing a sequence of value, capable of determining it own length
|
|
- Delta
|
|
|
|
### Delta
|
|
```
|
|
+---------+--------+
|
|
| KeepAdd | Suffix |
|
|
+---------+--------+
|
|
```
|
|
- KeepAdd
|
|
- Suffix: KeepAdd.add bytes of key suffix
|
|
|
|
### KeepAdd
|
|
KeepAdd can be represented in two different representation, a very compact 1byte one which is enough for most usage, and a longer variable-len one when required
|
|
|
|
When keep < 16 and add < 16
|
|
```
|
|
+-----+------+
|
|
| Add | Keep |
|
|
+-----+------+
|
|
```
|
|
- Add(u4): number of bytes to push
|
|
- Keep(u4): number of bytes to pop
|
|
|
|
Otherwise:
|
|
```
|
|
+------+------+-----+
|
|
| 0x01 | Keep | Add |
|
|
+------+------+-----+
|
|
```
|
|
- Add(VInt): number of bytes to push
|
|
- Keep(VInt): number of bytes to pop
|
|
|
|
|
|
Note: as the SSTable does not support redundant keys, there is no ambiguity between both representation. Add is always guaranteed to be non-zero, except for the very first key of an SSTable, where Keep is guaranteed to be zero.
|
|
|
|
### SSTFooter
|
|
```
|
|
+-----+----------------+-------------+-------------+---------+---------+
|
|
| Fst | BlockAddrStore | StoreOffset | IndexOffset | NumTerm | Version |
|
|
+-----+----------------+-------------+-------------+---------+---------+
|
|
```
|
|
- Fst(Fst): finite state transducer mapping keys to a block number
|
|
- BlockAddrStore(BlockAddrStore): store mapping a block number to its BlockAddr
|
|
- StoreOffset(u64): Offset to start of the BlockAddrStore. If zero, see the SingleBlockSStable section
|
|
- IndexOffset(u64): Offset to the start of the SSTFooter
|
|
- NumTerm(u64): number of terms in the sstable
|
|
- Version(u32): Currently equal to 3
|
|
|
|
### Fst
|
|
|
|
Fst is in the format of tantivy\_fst
|
|
|
|
### BlockAddrStore
|
|
|
|
+---------+-----------+-----------+-----+-----------+-----------+-----+
|
|
| MetaLen | BlockMeta | BlockMeta | ... | BlockData | BlockData | ... |
|
|
+---------+-----------+-----------+-----+-----------+-----------+-----+
|
|
|---------(N blocks)----------|---------(N blocks)----------|
|
|
|
|
- MetaLen(u64): length of the BlockMeta section
|
|
- BlockMeta(BlockAddrBlockMetadata): metadata to seek through BlockData
|
|
- BlockData(CompactedBlockAddr): bitpacked per block metadata
|
|
|
|
### BlockAddrBlockMetadata
|
|
|
|
+--------+------------+--------------+------------+--------------+-------------------+-----------------+----------+
|
|
| Offset | RangeStart | FirstOrdinal | RangeSlope | OrdinalSlope | FirstOrdinalNBits | RangeStartNBits | BlockLen |
|
|
+--------+------------+--------------+------------+--------------+-------------------+-----------------+----------+
|
|
|
|
- Offset(u64): offset of the corresponding BlockData in the datastream
|
|
- RangeStart(u64): the start position of the first block
|
|
- FirstOrdinal(u64): the first ordinal of the first block
|
|
- RangeSlope(u32): slope predicted for start range evolution (see computation in BlockData)
|
|
- OrdinalSlope(u64): slope predicted for first ordinal evolution (see computation in BlockData)
|
|
- FirstOrdinalNBits(u8): number of bits per ordinal in datastream (see computation in BlockData)
|
|
- RangeStartNBits(u8): number of bits per range start in datastream (see computation in BlockData)
|
|
|
|
### BlockData
|
|
|
|
+-----------------+-------------------+---------------+
|
|
| RangeStartDelta | FirstOrdinalDelta | FinalRangeEnd |
|
|
+-----------------+-------------------+---------------+
|
|
|------(BlockLen repetitions)---------|
|
|
|
|
- RangeStartDelta(var): RangeStartNBits *bits* of little endian number. See below for decoding
|
|
- FirstOrdinalDelta(var): FirstOrdinalNBits *bits* of little endian number. See below for decoding
|
|
- FinalRangeEnd(var): RangeStartNBits *bits* of integer. See below for decoding
|
|
|
|
converting a BlockData of index Index and a BlockAddrBlockMetadata to an actual block address is done as follow:
|
|
range\_prediction := RangeStart + Index * RangeSlop;
|
|
range\_derivation := RangeStartDelta - (1 << (RangeStartNBits-1));
|
|
range\_start := range\_prediction + range\_derivation
|
|
|
|
The same computation can be done for ordinal.
|
|
|
|
Note that `range_derivation` can take negative value. `RangeStartDelta` is just its translation to a positive range.
|
|
|
|
|
|
## SingleBlockSStable
|
|
|
|
The format used for the index is meant to be compact, however it has a constant cost of around 70
|
|
bytes, which isn't negligible for a table containing very few keys.
|
|
To limit the impact of that constant cost, single block sstable omit the Fst and BlockAddrStore from
|
|
their index. Instead a block with first ordinal of 0, range start of 0 and range end of IndexOffset
|
|
is implicitly used for every operations.
|