added extra doc file

This commit is contained in:
Paul Masurel
2016-08-11 21:18:59 +09:00
parent 853e020fda
commit 841a54546e
7 changed files with 135 additions and 16 deletions

View File

@@ -34,6 +34,7 @@ script:
travis-cargo bench &&
travis-cargo doc
after_success:
- travis-cargo doc-upload
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then travis-cargo coveralls --no-sudo --verify; fi
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ./kcov/build/src/kcov --verify --coveralls-id=$TRAVIS_JOB_ID --exclude-pattern=/.cargo target/kcov target/debug/tantivy-*; fi
- bash ./script/build-doc.sh
- travis-cargo doc-upload
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then travis-cargo coveralls --no-sudo --verify; fi
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ./kcov/build/src/kcov --verify --coveralls-id=$TRAVIS_JOB_ID --exclude-pattern=/.cargo target/kcov target/debug/tantivy-*; fi

61
docs/datastruct.md Normal file
View File

@@ -0,0 +1,61 @@
% Tantivy's datastructure and index format
This document explains how tantivy works, and specifically
what kind of datastructures are used to index and store the data.
# An inverted index
As you may know, an idea central to search engines is to assign a document id
to each document, and build an inverted index, which is simply
a datastructure associating each term (word) to a sorted list of doc ids.
Such an index then makes it possible to compute the union or
the intersection of the documents containing two terms
in `O(1)` memory and `O(n)` time.
## Term dictionary
Tantivy term dicionary (`.term` files) are stored in
a finite state transducer (courtesy of the excellent
[`fst`](https://github.com/BurntSushi/fst) crate).
For each term, the dictionary associates
a [TermInfo](http://fulmicoton.com/tantivy/tantivy/postings/struct.TermInfo.html).
which contains all of the information required to access the list of doc ids of the doc containing
the term.
In fact `fst` can only associated terms to a long. [`FstMap`](https://github.com/fulmicoton/tantivy/blob/master/src/datastruct/fstmap.rs) are
in charge to build a KV map on top of it.
## Postings
The posting lists (sorted list of doc ids) are encoded in the `.idx` file.
Optionally, you specify in your schema that you want tf-idf to be encoded
in the index file (if you do not, the index will behave as if all documents
have a term frequency of 1).
Tf-idf scoring requires the term frequency (number of time the term appeared in the field of the document)
for each document.
# Segments
Tantivy's index are divided into segments.
All segments are as many independant structure.
This has many benefits. For instance, assuming you are
trying to one billion documents, you could split
your corpus into N pieces, index them on Hadoop, copy all
of the resulting segments in the same directory
and edit the index meta.json file to list all of the segments.
This strong division also simplify a lot multithreaded indexing.
Each thread is actually build its own segment.
##
# Store
The store
When a document

37
docs/style.css Normal file
View File

@@ -0,0 +1,37 @@
body {
max-width: 1000px;
padding-left: 300px;
}
nav {
width: 300px;
position: fixed;
left: 0;
top: 0;
padding: 30px;
border-bottom: none !important;
}
nav > ul {
padding-left: 0px;
}
nav ul, nav li {
list-style: none;
}
h1.title {
font-size: 2em;
}
nav a, h1, h2.section-header {
color: #6d6c6c;
}
nav a {
color: #187ec1;
}
h1.title {
color: #df3600;
}

View File

@@ -1,20 +1,19 @@
# Indexing Wikipedia with Tantivy CLI interface
% Tutorial: Indexing Wikipedia with Tantivy CLI
## Introduction
# Introduction
In this tutorial, we will create a brand new index
with the articles of English wikipedia in it.
## Step 1 - Get tantivy CLI interface
# Install
There are two ways to get `tantivy`.
If you are a rust programmer, you can run `cargo install tantivy`.
Alternatively, if you are on `Linux 64bits`, you can download a
static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy/binaries/linux_x86_64/tantivy)
## Step 2 - creating the index
# Creating the index
Create a directory in which your index will be stored.
@@ -40,7 +39,7 @@ the definition of the schema of our new index.
When asked answer to the question as follows:
```
```none
Creating new index
Let's define it's schema!
@@ -114,7 +113,7 @@ If you want to know more about the meaning of these options, you can check out t
The json displayed at the end has been written in `wikipedia-index/meta.json`.
# Step 3 - Get the documents to index
# Get the documents to index
Tantivy's index command offers a way to index a json file.
More accurately, the file must contain one document per line, in a json format.
@@ -134,7 +133,7 @@ Make sure to uncompress the file
bunzip2 wiki-articles.json.bz2
```
# Step 4 - Index the documents.
# Index the documents.
The `index` command will index your document.
By default it will use as many threads as there are core on your machine.
@@ -145,7 +144,8 @@ On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes.
cat /data/wiki-articles | tantivy index -i wikipedia-index
```
# Step 5 - Have a look at the index directory
While it is indexing, you can peek at the index directory
to check what is happening.
```bash
ls wikipedia-index
@@ -159,7 +159,7 @@ It is named by a uuid.
Each different files is storing a different datastructure for the index.
# Step 6 - Serve a search index
# Serve the search index
```
tantivy serve -i wikipedia-index

10
script/build-doc.sh Executable file
View File

@@ -0,0 +1,10 @@
#!/bin/bash
DEST=target/doc/tantivy/docs/
mkdir -p $DEST
for f in $(ls docs/*.md)
do
rustdoc $f -o $DEST --markdown-css ../../rustdoc.css --markdown-css style.css
done
cp docs/*.css $DEST

View File

@@ -54,9 +54,8 @@ impl FreqHandler {
block_decoder.output(idx)
}
FreqHandler::NoFreq => {
0
1u32
}
}
}
}

View File

@@ -1,6 +1,17 @@
use common::BinarySerializable;
use std::io;
// `TermInfo` contains all of the information
// associated to terms in the `.term` file.
//
// It consists of
// * doc_freq : the number of document in the segment
// containing this term. It is also the length of the
// posting list associated to this term
// * postings_offset: an offset in the `.idx` file
// addressing the start of the posting list associated
// to this term.
#[derive(Debug,Ord,PartialOrd,Eq,PartialEq,Clone)]
pub struct TermInfo {
pub doc_freq: u32,