mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-01-06 01:02:55 +00:00
added extra doc file
This commit is contained in:
@@ -34,6 +34,7 @@ script:
|
||||
travis-cargo bench &&
|
||||
travis-cargo doc
|
||||
after_success:
|
||||
- travis-cargo doc-upload
|
||||
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then travis-cargo coveralls --no-sudo --verify; fi
|
||||
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ./kcov/build/src/kcov --verify --coveralls-id=$TRAVIS_JOB_ID --exclude-pattern=/.cargo target/kcov target/debug/tantivy-*; fi
|
||||
- bash ./script/build-doc.sh
|
||||
- travis-cargo doc-upload
|
||||
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then travis-cargo coveralls --no-sudo --verify; fi
|
||||
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then ./kcov/build/src/kcov --verify --coveralls-id=$TRAVIS_JOB_ID --exclude-pattern=/.cargo target/kcov target/debug/tantivy-*; fi
|
||||
|
||||
61
docs/datastruct.md
Normal file
61
docs/datastruct.md
Normal file
@@ -0,0 +1,61 @@
|
||||
% Tantivy's datastructure and index format
|
||||
|
||||
This document explains how tantivy works, and specifically
|
||||
what kind of datastructures are used to index and store the data.
|
||||
|
||||
# An inverted index
|
||||
|
||||
As you may know, an idea central to search engines is to assign a document id
|
||||
to each document, and build an inverted index, which is simply
|
||||
a datastructure associating each term (word) to a sorted list of doc ids.
|
||||
|
||||
Such an index then makes it possible to compute the union or
|
||||
the intersection of the documents containing two terms
|
||||
in `O(1)` memory and `O(n)` time.
|
||||
|
||||
## Term dictionary
|
||||
|
||||
Tantivy term dicionary (`.term` files) are stored in
|
||||
a finite state transducer (courtesy of the excellent
|
||||
[`fst`](https://github.com/BurntSushi/fst) crate).
|
||||
|
||||
For each term, the dictionary associates
|
||||
a [TermInfo](http://fulmicoton.com/tantivy/tantivy/postings/struct.TermInfo.html).
|
||||
which contains all of the information required to access the list of doc ids of the doc containing
|
||||
the term.
|
||||
|
||||
In fact `fst` can only associated terms to a long. [`FstMap`](https://github.com/fulmicoton/tantivy/blob/master/src/datastruct/fstmap.rs) are
|
||||
in charge to build a KV map on top of it.
|
||||
|
||||
|
||||
## Postings
|
||||
|
||||
The posting lists (sorted list of doc ids) are encoded in the `.idx` file.
|
||||
Optionally, you specify in your schema that you want tf-idf to be encoded
|
||||
in the index file (if you do not, the index will behave as if all documents
|
||||
have a term frequency of 1).
|
||||
Tf-idf scoring requires the term frequency (number of time the term appeared in the field of the document)
|
||||
for each document.
|
||||
|
||||
|
||||
# Segments
|
||||
|
||||
Tantivy's index are divided into segments.
|
||||
All segments are as many independant structure.
|
||||
|
||||
This has many benefits. For instance, assuming you are
|
||||
trying to one billion documents, you could split
|
||||
your corpus into N pieces, index them on Hadoop, copy all
|
||||
of the resulting segments in the same directory
|
||||
and edit the index meta.json file to list all of the segments.
|
||||
|
||||
This strong division also simplify a lot multithreaded indexing.
|
||||
Each thread is actually build its own segment.
|
||||
|
||||
|
||||
##
|
||||
|
||||
# Store
|
||||
|
||||
The store
|
||||
When a document
|
||||
37
docs/style.css
Normal file
37
docs/style.css
Normal file
@@ -0,0 +1,37 @@
|
||||
body {
|
||||
max-width: 1000px;
|
||||
padding-left: 300px;
|
||||
}
|
||||
|
||||
nav {
|
||||
width: 300px;
|
||||
position: fixed;
|
||||
left: 0;
|
||||
top: 0;
|
||||
padding: 30px;
|
||||
border-bottom: none !important;
|
||||
}
|
||||
|
||||
nav > ul {
|
||||
padding-left: 0px;
|
||||
}
|
||||
|
||||
nav ul, nav li {
|
||||
list-style: none;
|
||||
}
|
||||
|
||||
h1.title {
|
||||
font-size: 2em;
|
||||
}
|
||||
|
||||
nav a, h1, h2.section-header {
|
||||
color: #6d6c6c;
|
||||
}
|
||||
|
||||
nav a {
|
||||
color: #187ec1;
|
||||
}
|
||||
|
||||
h1.title {
|
||||
color: #df3600;
|
||||
}
|
||||
@@ -1,20 +1,19 @@
|
||||
# Indexing Wikipedia with Tantivy CLI interface
|
||||
% Tutorial: Indexing Wikipedia with Tantivy CLI
|
||||
|
||||
## Introduction
|
||||
# Introduction
|
||||
|
||||
In this tutorial, we will create a brand new index
|
||||
with the articles of English wikipedia in it.
|
||||
|
||||
|
||||
|
||||
## Step 1 - Get tantivy CLI interface
|
||||
|
||||
# Install
|
||||
|
||||
There are two ways to get `tantivy`.
|
||||
If you are a rust programmer, you can run `cargo install tantivy`.
|
||||
Alternatively, if you are on `Linux 64bits`, you can download a
|
||||
static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy/binaries/linux_x86_64/tantivy)
|
||||
|
||||
## Step 2 - creating the index
|
||||
# Creating the index
|
||||
|
||||
Create a directory in which your index will be stored.
|
||||
|
||||
@@ -40,7 +39,7 @@ the definition of the schema of our new index.
|
||||
|
||||
When asked answer to the question as follows:
|
||||
|
||||
```
|
||||
```none
|
||||
Creating new index
|
||||
Let's define it's schema!
|
||||
|
||||
@@ -114,7 +113,7 @@ If you want to know more about the meaning of these options, you can check out t
|
||||
The json displayed at the end has been written in `wikipedia-index/meta.json`.
|
||||
|
||||
|
||||
# Step 3 - Get the documents to index
|
||||
# Get the documents to index
|
||||
|
||||
Tantivy's index command offers a way to index a json file.
|
||||
More accurately, the file must contain one document per line, in a json format.
|
||||
@@ -134,7 +133,7 @@ Make sure to uncompress the file
|
||||
bunzip2 wiki-articles.json.bz2
|
||||
```
|
||||
|
||||
# Step 4 - Index the documents.
|
||||
# Index the documents.
|
||||
|
||||
The `index` command will index your document.
|
||||
By default it will use as many threads as there are core on your machine.
|
||||
@@ -145,7 +144,8 @@ On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes.
|
||||
cat /data/wiki-articles | tantivy index -i wikipedia-index
|
||||
```
|
||||
|
||||
# Step 5 - Have a look at the index directory
|
||||
While it is indexing, you can peek at the index directory
|
||||
to check what is happening.
|
||||
|
||||
```bash
|
||||
ls wikipedia-index
|
||||
@@ -159,7 +159,7 @@ It is named by a uuid.
|
||||
Each different files is storing a different datastructure for the index.
|
||||
|
||||
|
||||
# Step 6 - Serve a search index
|
||||
# Serve the search index
|
||||
|
||||
```
|
||||
tantivy serve -i wikipedia-index
|
||||
|
||||
10
script/build-doc.sh
Executable file
10
script/build-doc.sh
Executable file
@@ -0,0 +1,10 @@
|
||||
#!/bin/bash
|
||||
DEST=target/doc/tantivy/docs/
|
||||
mkdir -p $DEST
|
||||
|
||||
for f in $(ls docs/*.md)
|
||||
do
|
||||
rustdoc $f -o $DEST --markdown-css ../../rustdoc.css --markdown-css style.css
|
||||
done
|
||||
|
||||
cp docs/*.css $DEST
|
||||
@@ -54,9 +54,8 @@ impl FreqHandler {
|
||||
block_decoder.output(idx)
|
||||
}
|
||||
FreqHandler::NoFreq => {
|
||||
0
|
||||
1u32
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
@@ -1,6 +1,17 @@
|
||||
use common::BinarySerializable;
|
||||
use std::io;
|
||||
|
||||
|
||||
// `TermInfo` contains all of the information
|
||||
// associated to terms in the `.term` file.
|
||||
//
|
||||
// It consists of
|
||||
// * doc_freq : the number of document in the segment
|
||||
// containing this term. It is also the length of the
|
||||
// posting list associated to this term
|
||||
// * postings_offset: an offset in the `.idx` file
|
||||
// addressing the start of the posting list associated
|
||||
// to this term.
|
||||
#[derive(Debug,Ord,PartialOrd,Eq,PartialEq,Clone)]
|
||||
pub struct TermInfo {
|
||||
pub doc_freq: u32,
|
||||
|
||||
Reference in New Issue
Block a user