mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-05-28 06:00:40 +00:00
* compress sstable with zstd * add some details to sstable readme * compress only block which benefit from it * multiple changes to sstable make compression optional use OwnedBytes instead of impl Read in sstable, required for next point use zstd bulk api, which is much faster on small records * cleanup and use bulk api for compression * use dedicated byte for compression * switch block len and compression flag * change default zstd level in sstable
122 lines
3.6 KiB
Markdown
122 lines
3.6 KiB
Markdown
# SSTable
|
|
|
|
The `tantivy-sstable` crate is yet another sstable crate.
|
|
|
|
It has been designed to be used in `quickwit`:
|
|
- as an alternative to the default tantivy fst dictionary.
|
|
- as a way to store the column index for dynamic fast fields.
|
|
|
|
The benefit compared to the fst crate is locality.
|
|
Searching a key in the fst crate requires downloading the entire dictionary.
|
|
|
|
Once the sstable index is downloaded, running a `get` in the sstable
|
|
crate only requires a single fetch.
|
|
|
|
Right now, the block index and the default block size have been thought
|
|
for quickwit, and the performance of a get is very bad.
|
|
|
|
# Sorted strings?
|
|
|
|
SSTable stands for Sorted String Table.
|
|
Strings have to be insert in sorted order.
|
|
|
|
That sorted order is used in different ways:
|
|
- it makes gets and streaming ranges of keys
|
|
possible.
|
|
- it allows incremental encoding of the keys
|
|
- the front compression is leveraged to optimize
|
|
the intersection with an automaton
|
|
|
|
# On disk format
|
|
|
|
Overview of the SSTable format. Unless noted otherwise, numbers are little-endian.
|
|
|
|
### SSTable
|
|
```
|
|
+-------+-------+-----+--------+
|
|
| Block | Block | ... | Footer |
|
|
+-------+-------+-----+--------+
|
|
|----( # of blocks)---|
|
|
```
|
|
- Block(`SSTBlock`): list of independent block, terminated by a single empty block.
|
|
- Footer(`SSTFooter`)
|
|
|
|
### SSTBlock
|
|
```
|
|
+----------+----------+--------+-------+-------+-----+
|
|
| BlockLen | Compress | Values | Delta | Delta | ... |
|
|
+----------+----------+--------+-------+-------+-----+
|
|
| |----( # of deltas)---|
|
|
|------(maybe compressed)------|
|
|
```
|
|
- BlockLen(u32): length of the block, including the compress byte.
|
|
- Compress(u8): indicate whether block is compressed. 0 if not compressed, 1 if compressed.
|
|
- Values: an application defined format storing a sequence of value, capable of determining it own length
|
|
- Delta
|
|
|
|
### Delta
|
|
```
|
|
+---------+--------+
|
|
| KeepAdd | Suffix |
|
|
+---------+--------+
|
|
```
|
|
- KeepAdd
|
|
- Suffix: KeepAdd.add bytes of key suffix
|
|
|
|
### KeepAdd
|
|
KeepAdd can be represented in two different representation, a very compact 1byte one which is enough for most usage, and a longer variable-len one when required
|
|
|
|
When keep < 16 and add < 16
|
|
```
|
|
+-----+------+
|
|
| Add | Keep |
|
|
+-----+------+
|
|
```
|
|
- Add(u4): number of bytes to push
|
|
- Keep(u4): number of bytes to pop
|
|
|
|
Otherwise:
|
|
```
|
|
+------+------+-----+
|
|
| 0x01 | Keep | Add |
|
|
+------+------+-----+
|
|
```
|
|
- Add(VInt): number of bytes to push
|
|
- Keep(VInt): number of bytes to pop
|
|
|
|
|
|
Note: as the SSTable does not support redundant keys, there is no ambiguity between both representation. Add is always guaranteed to be non-zero, except for the very first key of an SSTable, where Keep is guaranteed to be zero.
|
|
|
|
### SSTFooter
|
|
```
|
|
+-------+-------+-----+-------------+---------+---------+
|
|
| Block | Block | ... | IndexOffset | NumTerm | Version |
|
|
+-------+-------+-----+-------------+---------+---------+
|
|
|----( # of blocks)---|
|
|
```
|
|
- Block(SSTBlock): uses IndexValue for its Values format
|
|
- IndexOffset(u64): Offset to the start of the SSTFooter
|
|
- NumTerm(u64): number of terms in the sstable
|
|
- Version(u32): Currently equal to 2
|
|
|
|
### IndexValue
|
|
```
|
|
+------------+----------+-------+-------+-----+
|
|
| EntryCount | StartPos | Entry | Entry | ... |
|
|
+------------+----------+-------+-------+-----+
|
|
|---( # of entries)---|
|
|
```
|
|
|
|
- EntryCount(VInt): number of entries
|
|
- StartPos(VInt): the start pos of the first (data) block referenced by this (index) block
|
|
- Entry (IndexEntry)
|
|
|
|
### Entry
|
|
```
|
|
+----------+--------------+
|
|
| BlockLen | FirstOrdinal |
|
|
+----------+--------------+
|
|
```
|
|
- BlockLen(VInt): length of the block
|
|
- FirstOrdinal(VInt): ordinal of the first element in the given block
|