mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-01-13 20:42:55 +00:00
61 lines
2.0 KiB
Markdown
61 lines
2.0 KiB
Markdown
% Tantivy's datastructure and index format
|
|
|
|
This document explains how tantivy works, and specifically
|
|
what kind of datastructures are used to index and store the data.
|
|
|
|
# An inverted index
|
|
|
|
As you may know, an idea central to search engines is to assign a document id
|
|
to each document, and build an inverted index, which is simply
|
|
a datastructure associating each term (word) to a sorted list of doc ids.
|
|
|
|
Such an index then makes it possible to compute the union or
|
|
the intersection of the documents containing two terms
|
|
in `O(1)` memory and `O(n)` time.
|
|
|
|
## Term dictionary
|
|
|
|
Tantivy term dicionary (`.term` files) are stored in
|
|
a finite state transducer (courtesy of the excellent
|
|
[`fst`](https://github.com/BurntSushi/fst) crate).
|
|
|
|
For each term, the dictionary associates
|
|
a [TermInfo](http://fulmicoton.com/tantivy/tantivy/postings/struct.TermInfo.html).
|
|
which contains all of the information required to access the list of doc ids of the doc containing
|
|
the term.
|
|
|
|
In fact `fst` can only associated terms to a long. [`FstMap`](https://github.com/fulmicoton/tantivy/blob/master/src/datastruct/fstmap.rs) are
|
|
in charge to build a KV map on top of it.
|
|
|
|
|
|
## Postings
|
|
|
|
The posting lists (sorted list of doc ids) are encoded in the `.idx` file.
|
|
Optionally, you specify in your schema that you want tf-idf to be encoded
|
|
in the index file (if you do not, the index will behave as if all documents
|
|
have a term frequency of 1).
|
|
Tf-idf scoring requires the term frequency (number of time the term appeared in the field of the document)
|
|
for each document.
|
|
|
|
|
|
# Segments
|
|
|
|
Tantivy's index are divided into segments.
|
|
All segments are as many independant structure.
|
|
|
|
This has many benefits. For instance, assuming you are
|
|
trying to one billion documents, you could split
|
|
your corpus into N pieces, index them on Hadoop, copy all
|
|
of the resulting segments in the same directory
|
|
and edit the index meta.json file to list all of the segments.
|
|
|
|
This strong division also simplify a lot multithreaded indexing.
|
|
Each thread is actually build its own segment.
|
|
|
|
|
|
##
|
|
|
|
# Store
|
|
|
|
The store
|
|
When a document |