Files
tantivy/docs/datastruct.md
2016-08-11 21:18:59 +09:00

61 lines
2.0 KiB
Markdown

% Tantivy's datastructure and index format
This document explains how tantivy works, and specifically
what kind of datastructures are used to index and store the data.
# An inverted index
As you may know, an idea central to search engines is to assign a document id
to each document, and build an inverted index, which is simply
a datastructure associating each term (word) to a sorted list of doc ids.
Such an index then makes it possible to compute the union or
the intersection of the documents containing two terms
in `O(1)` memory and `O(n)` time.
## Term dictionary
Tantivy term dicionary (`.term` files) are stored in
a finite state transducer (courtesy of the excellent
[`fst`](https://github.com/BurntSushi/fst) crate).
For each term, the dictionary associates
a [TermInfo](http://fulmicoton.com/tantivy/tantivy/postings/struct.TermInfo.html).
which contains all of the information required to access the list of doc ids of the doc containing
the term.
In fact `fst` can only associated terms to a long. [`FstMap`](https://github.com/fulmicoton/tantivy/blob/master/src/datastruct/fstmap.rs) are
in charge to build a KV map on top of it.
## Postings
The posting lists (sorted list of doc ids) are encoded in the `.idx` file.
Optionally, you specify in your schema that you want tf-idf to be encoded
in the index file (if you do not, the index will behave as if all documents
have a term frequency of 1).
Tf-idf scoring requires the term frequency (number of time the term appeared in the field of the document)
for each document.
# Segments
Tantivy's index are divided into segments.
All segments are as many independant structure.
This has many benefits. For instance, assuming you are
trying to one billion documents, you could split
your corpus into N pieces, index them on Hadoop, copy all
of the resulting segments in the same directory
and edit the index meta.json file to list all of the segments.
This strong division also simplify a lot multithreaded indexing.
Each thread is actually build its own segment.
##
# Store
The store
When a document