Files
tantivy/docs/datastruct.md
2016-08-11 21:18:59 +09:00

2.0 KiB

% Tantivy's datastructure and index format

This document explains how tantivy works, and specifically what kind of datastructures are used to index and store the data.

An inverted index

As you may know, an idea central to search engines is to assign a document id to each document, and build an inverted index, which is simply a datastructure associating each term (word) to a sorted list of doc ids.

Such an index then makes it possible to compute the union or the intersection of the documents containing two terms in O(1) memory and O(n) time.

Term dictionary

Tantivy term dicionary (.term files) are stored in a finite state transducer (courtesy of the excellent fst crate).

For each term, the dictionary associates a TermInfo. which contains all of the information required to access the list of doc ids of the doc containing the term.

In fact fst can only associated terms to a long. FstMap are in charge to build a KV map on top of it.

Postings

The posting lists (sorted list of doc ids) are encoded in the .idx file. Optionally, you specify in your schema that you want tf-idf to be encoded in the index file (if you do not, the index will behave as if all documents have a term frequency of 1). Tf-idf scoring requires the term frequency (number of time the term appeared in the field of the document) for each document.

Segments

Tantivy's index are divided into segments. All segments are as many independant structure.

This has many benefits. For instance, assuming you are trying to one billion documents, you could split your corpus into N pieces, index them on Hadoop, copy all of the resulting segments in the same directory and edit the index meta.json file to list all of the segments.

This strong division also simplify a lot multithreaded indexing. Each thread is actually build its own segment.

Store

The store When a document