Added literate programming for the example

This commit is contained in:
Paul Masurel
2016-09-23 00:23:21 +09:00
parent 1a08ca4f95
commit 5a761dbcf0
21 changed files with 1509 additions and 341 deletions

View File

@@ -1,61 +0,0 @@
% Tantivy's datastructure and index format
This document explains how tantivy works, and specifically
what kind of datastructures are used to index and store the data.
# An inverted index
As you may know, an idea central to search engines is to assign a document id
to each document, and build an inverted index, which is simply
a datastructure associating each term (word) to a sorted list of doc ids.
Such an index then makes it possible to compute the union or
the intersection of the documents containing two terms
in `O(1)` memory and `O(n)` time.
## Term dictionary
Tantivy term dicionary (`.term` files) are stored in
a finite state transducer (courtesy of the excellent
[`fst`](https://github.com/BurntSushi/fst) crate).
For each term, the dictionary associates
a [TermInfo](http://fulmicoton.com/tantivy/tantivy/postings/struct.TermInfo.html).
which contains all of the information required to access the list of doc ids of the doc containing
the term.
In fact `fst` can only associated terms to a long. [`FstMap`](https://github.com/fulmicoton/tantivy/blob/master/src/datastruct/fstmap.rs) are
in charge to build a KV map on top of it.
## Postings
The posting lists (sorted list of doc ids) are encoded in the `.idx` file.
Optionally, you specify in your schema that you want tf-idf to be encoded
in the index file (if you do not, the index will behave as if all documents
have a term frequency of 1).
Tf-idf scoring requires the term frequency (number of time the term appeared in the field of the document)
for each document.
# Segments
Tantivy's index are divided into segments.
All segments are as many independant structure.
This has many benefits. For instance, assuming you are
trying to one billion documents, you could split
your corpus into N pieces, index them on Hadoop, copy all
of the resulting segments in the same directory
and edit the index meta.json file to list all of the segments.
This strong division also simplify a lot multithreaded indexing.
Each thread is actually build its own segment.
##
# Store
The store
When a document

View File

@@ -1,37 +0,0 @@
body {
max-width: 1000px;
padding-left: 300px;
}
nav {
width: 300px;
position: fixed;
left: 0;
top: 0;
padding: 30px;
border-bottom: none !important;
}
nav > ul {
padding-left: 0px;
}
nav ul, nav li {
list-style: none;
}
h1.title {
font-size: 2em;
}
nav a, h1, h2.section-header {
color: #6d6c6c;
}
nav a {
color: #187ec1;
}
h1.title {
color: #df3600;
}

View File

@@ -1,167 +0,0 @@
% Tutorial: Indexing Wikipedia with Tantivy CLI
# Introduction
In this tutorial, we will create a brand new index
with the articles of English wikipedia in it.
# Install
There are two ways to get `tantivy`.
If you are a rust programmer, you can run `cargo install tantivy`.
Alternatively, if you are on `Linux 64bits`, you can download a
static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy/binaries/linux_x86_64/tantivy)
# Creating the index
Create a directory in which your index will be stored.
```bash
# create the directory
mkdir wikipedia-index
```
We will now initialize the index and create it's schema.
Our documents will contain
* a title
* a body
* a url
Running `tantivy new` will start a wizard that will help you go through
the definition of the schema of our new index.
```bash
tantivy new -i wikipedia-index
```
When asked answer to the question as follows:
```none
Creating new index
Let's define it's schema!
New field name ? title
Text or unsigned 32-bit Integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? Y
Should the field be tokenized (Y/N) ? Y
Should the term frequencies (per doc) be in the index (Y/N) ? Y
Should the term positions (per doc) be in the index (Y/N) ? Y
Add another field (Y/N) ? Y
New field name ? body
Text or unsigned 32-bit Integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? Y
Should the field be tokenized (Y/N) ? Y
Should the term frequencies (per doc) be in the index (Y/N) ? Y
Should the term positions (per doc) be in the index (Y/N) ? Y
Add another field (Y/N) ? Y
New field name ? url
Text or unsigned 32-bit Integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? N
Add another field (Y/N) ? N
[
{
"name": "title",
"type": "text",
"options": {
"indexing": "position",
"stored": true
}
},
{
"name": "body",
"type": "text",
"options": {
"indexing": "position",
"stored": true
}
},
{
"name": "url",
"type": "text",
"options": {
"indexing": "unindexed",
"stored": true
}
}
]
```
If you want to know more about the meaning of these options, you can check out the [schema doc page](http://fulmicoton.com/tantivy/tantivy/schema/index.html).
The json displayed at the end has been written in `wikipedia-index/meta.json`.
# Get the documents to index
Tantivy's index command offers a way to index a json file.
More accurately, the file must contain one document per line, in a json format.
The structure of this JSON object must match that of our schema definition.
```json
{"body": "some text", "title": "some title", "url": "http://somedomain.com"}
```
You can download a corpus of more than 5 millions articles from wikipedia
formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy/tutorial/wiki-articles-first100.json).
Make sure to uncompress the file
```bash
bunzip2 wiki-articles.json.bz2
```
# Index the documents.
The `index` command will index your document.
By default it will use as many threads as there are core on your machine.
On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes.
```
cat /data/wiki-articles | tantivy index -i wikipedia-index
```
While it is indexing, you can peek at the index directory
to check what is happening.
```bash
ls wikipedia-index
```
If you indexed the 5 millions articles, you should see a lot of files, all with the following format
The main file is `meta.json`.
Our index is in fact divided in segments. Each segment acts as an individual smaller index.
It is named by a uuid.
Each different files is storing a different datastructure for the index.
# Serve the search index
```
tantivy serve -i wikipedia-index
```
You can start a small server with a JSON API to search into wikipedia.
By default, the server is serving on the port `3000`.

2
examples/generate_html.sh Executable file
View File

@@ -0,0 +1,2 @@
#!/bin/bash
docco simple_search.rs -o html

518
examples/html/docco.css Normal file
View File

@@ -0,0 +1,518 @@
/*--------------------- Typography ----------------------------*/
@font-face {
font-family: 'aller-light';
src: url('public/fonts/aller-light.eot');
src: url('public/fonts/aller-light.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-light.woff') format('woff'),
url('public/fonts/aller-light.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'aller-bold';
src: url('public/fonts/aller-bold.eot');
src: url('public/fonts/aller-bold.eot?#iefix') format('embedded-opentype'),
url('public/fonts/aller-bold.woff') format('woff'),
url('public/fonts/aller-bold.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'roboto-black';
src: url('public/fonts/roboto-black.eot');
src: url('public/fonts/roboto-black.eot?#iefix') format('embedded-opentype'),
url('public/fonts/roboto-black.woff') format('woff'),
url('public/fonts/roboto-black.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}
/*--------------------- Layout ----------------------------*/
html { height: 100%; }
body {
font-family: "aller-light";
font-size: 14px;
line-height: 18px;
color: #30404f;
margin: 0; padding: 0;
height:100%;
}
#container { min-height: 100%; }
a {
color: #000;
}
b, strong {
font-weight: normal;
font-family: "aller-bold";
}
p {
margin: 15px 0 0px;
}
.annotation ul, .annotation ol {
margin: 25px 0;
}
.annotation ul li, .annotation ol li {
font-size: 14px;
line-height: 18px;
margin: 10px 0;
}
h1, h2, h3, h4, h5, h6 {
color: #112233;
line-height: 1em;
font-weight: normal;
font-family: "roboto-black";
text-transform: uppercase;
margin: 30px 0 15px 0;
}
h1 {
margin-top: 40px;
}
h2 {
font-size: 1.26em;
}
hr {
border: 0;
background: 1px #ddd;
height: 1px;
margin: 20px 0;
}
pre, tt, code {
font-size: 12px; line-height: 16px;
font-family: Menlo, Monaco, Consolas, "Lucida Console", monospace;
margin: 0; padding: 0;
}
.annotation pre {
display: block;
margin: 0;
padding: 7px 10px;
background: #fcfcfc;
-moz-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
-webkit-box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
box-shadow: inset 0 0 10px rgba(0,0,0,0.1);
overflow-x: auto;
}
.annotation pre code {
border: 0;
padding: 0;
background: transparent;
}
blockquote {
border-left: 5px solid #ccc;
margin: 0;
padding: 1px 0 1px 1em;
}
.sections blockquote p {
font-family: Menlo, Consolas, Monaco, monospace;
font-size: 12px; line-height: 16px;
color: #999;
margin: 10px 0 0;
white-space: pre-wrap;
}
ul.sections {
list-style: none;
padding:0 0 5px 0;;
margin:0;
}
/*
Force border-box so that % widths fit the parent
container without overlap because of margin/padding.
More Info : http://www.quirksmode.org/css/box.html
*/
ul.sections > li > div {
-moz-box-sizing: border-box; /* firefox */
-ms-box-sizing: border-box; /* ie */
-webkit-box-sizing: border-box; /* webkit */
-khtml-box-sizing: border-box; /* konqueror */
box-sizing: border-box; /* css3 */
}
/*---------------------- Jump Page -----------------------------*/
#jump_to, #jump_page {
margin: 0;
background: white;
-webkit-box-shadow: 0 0 25px #777; -moz-box-shadow: 0 0 25px #777;
-webkit-border-bottom-left-radius: 5px; -moz-border-radius-bottomleft: 5px;
font: 16px Arial;
cursor: pointer;
text-align: right;
list-style: none;
}
#jump_to a {
text-decoration: none;
}
#jump_to a.large {
display: none;
}
#jump_to a.small {
font-size: 22px;
font-weight: bold;
color: #676767;
}
#jump_to, #jump_wrapper {
position: fixed;
right: 0; top: 0;
padding: 10px 15px;
margin:0;
}
#jump_wrapper {
display: none;
padding:0;
}
#jump_to:hover #jump_wrapper {
display: block;
}
#jump_page_wrapper{
position: fixed;
right: 0;
top: 0;
bottom: 0;
}
#jump_page {
padding: 5px 0 3px;
margin: 0 0 25px 25px;
max-height: 100%;
overflow: auto;
}
#jump_page .source {
display: block;
padding: 15px;
text-decoration: none;
border-top: 1px solid #eee;
}
#jump_page .source:hover {
background: #f5f5ff;
}
#jump_page .source:first-child {
}
/*---------------------- Low resolutions (> 320px) ---------------------*/
@media only screen and (min-width: 320px) {
.pilwrap { display: none; }
ul.sections > li > div {
display: block;
padding:5px 10px 0 10px;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 30px;
}
ul.sections > li > div.content {
overflow-x:auto;
-webkit-box-shadow: inset 0 0 5px #e5e5ee;
box-shadow: inset 0 0 5px #e5e5ee;
border: 1px solid #dedede;
margin:5px 10px 5px 10px;
padding-bottom: 5px;
}
ul.sections > li > div.annotation pre {
margin: 7px 0 7px;
padding-left: 15px;
}
ul.sections > li > div.annotation p tt, .annotation code {
background: #f8f8ff;
border: 1px solid #dedede;
font-size: 12px;
padding: 0 0.2em;
}
}
/*---------------------- (> 481px) ---------------------*/
@media only screen and (min-width: 481px) {
#container {
position: relative;
}
body {
background-color: #F5F5FF;
font-size: 15px;
line-height: 21px;
}
pre, tt, code {
line-height: 18px;
}
p, ul, ol {
margin: 0 0 15px;
}
#jump_to {
padding: 5px 10px;
}
#jump_wrapper {
padding: 0;
}
#jump_to, #jump_page {
font: 10px Arial;
text-transform: uppercase;
}
#jump_page .source {
padding: 5px 10px;
}
#jump_to a.large {
display: inline-block;
}
#jump_to a.small {
display: none;
}
#background {
position: absolute;
top: 0; bottom: 0;
width: 350px;
background: #fff;
border-right: 1px solid #e5e5ee;
z-index: -1;
}
ul.sections > li > div.annotation ul, ul.sections > li > div.annotation ol {
padding-left: 40px;
}
ul.sections > li {
white-space: nowrap;
}
ul.sections > li > div {
display: inline-block;
}
ul.sections > li > div.annotation {
max-width: 350px;
min-width: 350px;
min-height: 5px;
padding: 13px;
overflow-x: hidden;
white-space: normal;
vertical-align: top;
text-align: left;
}
ul.sections > li > div.annotation pre {
margin: 15px 0 15px;
padding-left: 15px;
}
ul.sections > li > div.content {
padding: 13px;
vertical-align: top;
border: none;
-webkit-box-shadow: none;
box-shadow: none;
}
.pilwrap {
position: relative;
display: inline;
}
.pilcrow {
font: 12px Arial;
text-decoration: none;
color: #454545;
position: absolute;
top: 3px; left: -20px;
padding: 1px 2px;
opacity: 0;
-webkit-transition: opacity 0.2s linear;
}
.for-h1 .pilcrow {
top: 47px;
}
.for-h2 .pilcrow, .for-h3 .pilcrow, .for-h4 .pilcrow {
top: 35px;
}
ul.sections > li > div.annotation:hover .pilcrow {
opacity: 1;
}
}
/*---------------------- (> 1025px) ---------------------*/
@media only screen and (min-width: 1025px) {
body {
font-size: 16px;
line-height: 24px;
}
#background {
width: 525px;
}
ul.sections > li > div.annotation {
max-width: 525px;
min-width: 525px;
padding: 10px 25px 1px 50px;
}
ul.sections > li > div.content {
padding: 9px 15px 16px 25px;
}
}
/*---------------------- Syntax Highlighting -----------------------------*/
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
/*
github.com style (c) Vasily Polovnyov <vast@whiteants.net>
*/
pre code {
display: block; padding: 0.5em;
color: #000;
background: #f8f8ff
}
pre .hljs-comment,
pre .hljs-template_comment,
pre .hljs-diff .hljs-header,
pre .hljs-javadoc {
color: #408080;
font-style: italic
}
pre .hljs-keyword,
pre .hljs-assignment,
pre .hljs-literal,
pre .hljs-css .hljs-rule .hljs-keyword,
pre .hljs-winutils,
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
color: #954121;
/*font-weight: bold*/
}
pre .hljs-number,
pre .hljs-hexcolor {
color: #40a070
}
pre .hljs-string,
pre .hljs-tag .hljs-value,
pre .hljs-phpdoc,
pre .hljs-tex .hljs-formula {
color: #219161;
}
pre .hljs-title,
pre .hljs-id {
color: #19469D;
}
pre .hljs-params {
color: #00F;
}
pre .hljs-javascript .hljs-title,
pre .hljs-lisp .hljs-title,
pre .hljs-subst {
font-weight: normal
}
pre .hljs-class .hljs-title,
pre .hljs-haskell .hljs-label,
pre .hljs-tex .hljs-command {
color: #458;
font-weight: bold
}
pre .hljs-tag,
pre .hljs-tag .hljs-title,
pre .hljs-rules .hljs-property,
pre .hljs-django .hljs-tag .hljs-keyword {
color: #000080;
font-weight: normal
}
pre .hljs-attribute,
pre .hljs-variable,
pre .hljs-instancevar,
pre .hljs-lisp .hljs-body {
color: #008080
}
pre .hljs-regexp {
color: #B68
}
pre .hljs-class {
color: #458;
font-weight: bold
}
pre .hljs-symbol,
pre .hljs-ruby .hljs-symbol .hljs-string,
pre .hljs-ruby .hljs-symbol .hljs-keyword,
pre .hljs-ruby .hljs-symbol .hljs-keymethods,
pre .hljs-lisp .hljs-keyword,
pre .hljs-tex .hljs-special,
pre .hljs-input_number {
color: #990073
}
pre .hljs-builtin,
pre .hljs-constructor,
pre .hljs-built_in,
pre .hljs-lisp .hljs-title {
color: #0086b3
}
pre .hljs-preprocessor,
pre .hljs-pi,
pre .hljs-doctype,
pre .hljs-shebang,
pre .hljs-cdata {
color: #999;
font-weight: bold
}
pre .hljs-deletion {
background: #fdd
}
pre .hljs-addition {
background: #dfd
}
pre .hljs-diff .hljs-change {
background: #0086b3
}
pre .hljs-chunk {
color: #aaa
}
pre .hljs-tex .hljs-formula {
opacity: 0.5;
}

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

View File

@@ -0,0 +1,375 @@
/*! normalize.css v2.0.1 | MIT License | git.io/normalize */
/* ==========================================================================
HTML5 display definitions
========================================================================== */
/*
* Corrects `block` display not defined in IE 8/9.
*/
article,
aside,
details,
figcaption,
figure,
footer,
header,
hgroup,
nav,
section,
summary {
display: block;
}
/*
* Corrects `inline-block` display not defined in IE 8/9.
*/
audio,
canvas,
video {
display: inline-block;
}
/*
* Prevents modern browsers from displaying `audio` without controls.
* Remove excess height in iOS 5 devices.
*/
audio:not([controls]) {
display: none;
height: 0;
}
/*
* Addresses styling for `hidden` attribute not present in IE 8/9.
*/
[hidden] {
display: none;
}
/* ==========================================================================
Base
========================================================================== */
/*
* 1. Sets default font family to sans-serif.
* 2. Prevents iOS text size adjust after orientation change, without disabling
* user zoom.
*/
html {
font-family: sans-serif; /* 1 */
-webkit-text-size-adjust: 100%; /* 2 */
-ms-text-size-adjust: 100%; /* 2 */
}
/*
* Removes default margin.
*/
body {
margin: 0;
}
/* ==========================================================================
Links
========================================================================== */
/*
* Addresses `outline` inconsistency between Chrome and other browsers.
*/
a:focus {
outline: thin dotted;
}
/*
* Improves readability when focused and also mouse hovered in all browsers.
*/
a:active,
a:hover {
outline: 0;
}
/* ==========================================================================
Typography
========================================================================== */
/*
* Addresses `h1` font sizes within `section` and `article` in Firefox 4+,
* Safari 5, and Chrome.
*/
h1 {
font-size: 2em;
}
/*
* Addresses styling not present in IE 8/9, Safari 5, and Chrome.
*/
abbr[title] {
border-bottom: 1px dotted;
}
/*
* Addresses style set to `bolder` in Firefox 4+, Safari 5, and Chrome.
*/
b,
strong {
font-weight: bold;
}
/*
* Addresses styling not present in Safari 5 and Chrome.
*/
dfn {
font-style: italic;
}
/*
* Addresses styling not present in IE 8/9.
*/
mark {
background: #ff0;
color: #000;
}
/*
* Corrects font family set oddly in Safari 5 and Chrome.
*/
code,
kbd,
pre,
samp {
font-family: monospace, serif;
font-size: 1em;
}
/*
* Improves readability of pre-formatted text in all browsers.
*/
pre {
white-space: pre;
white-space: pre-wrap;
word-wrap: break-word;
}
/*
* Sets consistent quote types.
*/
q {
quotes: "\201C" "\201D" "\2018" "\2019";
}
/*
* Addresses inconsistent and variable font size in all browsers.
*/
small {
font-size: 80%;
}
/*
* Prevents `sub` and `sup` affecting `line-height` in all browsers.
*/
sub,
sup {
font-size: 75%;
line-height: 0;
position: relative;
vertical-align: baseline;
}
sup {
top: -0.5em;
}
sub {
bottom: -0.25em;
}
/* ==========================================================================
Embedded content
========================================================================== */
/*
* Removes border when inside `a` element in IE 8/9.
*/
img {
border: 0;
}
/*
* Corrects overflow displayed oddly in IE 9.
*/
svg:not(:root) {
overflow: hidden;
}
/* ==========================================================================
Figures
========================================================================== */
/*
* Addresses margin not present in IE 8/9 and Safari 5.
*/
figure {
margin: 0;
}
/* ==========================================================================
Forms
========================================================================== */
/*
* Define consistent border, margin, and padding.
*/
fieldset {
border: 1px solid #c0c0c0;
margin: 0 2px;
padding: 0.35em 0.625em 0.75em;
}
/*
* 1. Corrects color not being inherited in IE 8/9.
* 2. Remove padding so people aren't caught out if they zero out fieldsets.
*/
legend {
border: 0; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Corrects font family not being inherited in all browsers.
* 2. Corrects font size not being inherited in all browsers.
* 3. Addresses margins set differently in Firefox 4+, Safari 5, and Chrome
*/
button,
input,
select,
textarea {
font-family: inherit; /* 1 */
font-size: 100%; /* 2 */
margin: 0; /* 3 */
}
/*
* Addresses Firefox 4+ setting `line-height` on `input` using `!important` in
* the UA stylesheet.
*/
button,
input {
line-height: normal;
}
/*
* 1. Avoid the WebKit bug in Android 4.0.* where (2) destroys native `audio`
* and `video` controls.
* 2. Corrects inability to style clickable `input` types in iOS.
* 3. Improves usability and consistency of cursor style between image-type
* `input` and others.
*/
button,
html input[type="button"], /* 1 */
input[type="reset"],
input[type="submit"] {
-webkit-appearance: button; /* 2 */
cursor: pointer; /* 3 */
}
/*
* Re-set default cursor for disabled elements.
*/
button[disabled],
input[disabled] {
cursor: default;
}
/*
* 1. Addresses box sizing set to `content-box` in IE 8/9.
* 2. Removes excess padding in IE 8/9.
*/
input[type="checkbox"],
input[type="radio"] {
box-sizing: border-box; /* 1 */
padding: 0; /* 2 */
}
/*
* 1. Addresses `appearance` set to `searchfield` in Safari 5 and Chrome.
* 2. Addresses `box-sizing` set to `border-box` in Safari 5 and Chrome
* (include `-moz` to future-proof).
*/
input[type="search"] {
-webkit-appearance: textfield; /* 1 */
-moz-box-sizing: content-box;
-webkit-box-sizing: content-box; /* 2 */
box-sizing: content-box;
}
/*
* Removes inner padding and search cancel button in Safari 5 and Chrome
* on OS X.
*/
input[type="search"]::-webkit-search-cancel-button,
input[type="search"]::-webkit-search-decoration {
-webkit-appearance: none;
}
/*
* Removes inner padding and border in Firefox 4+.
*/
button::-moz-focus-inner,
input::-moz-focus-inner {
border: 0;
padding: 0;
}
/*
* 1. Removes default vertical scrollbar in IE 8/9.
* 2. Improves readability and alignment in all browsers.
*/
textarea {
overflow: auto; /* 1 */
vertical-align: top; /* 2 */
}
/* ==========================================================================
Tables
========================================================================== */
/*
* Remove most spacing between table cells.
*/
table {
border-collapse: collapse;
border-spacing: 0;
}

View File

@@ -0,0 +1,496 @@
<!DOCTYPE html>
<html>
<head>
<title>simple_search.rs</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, target-densitydpi=160dpi, initial-scale=1.0; maximum-scale=1.0; user-scalable=0;">
<link rel="stylesheet" media="all" href="docco.css" />
</head>
<body>
<div id="container">
<div id="background"></div>
<ul class="sections">
<li id="title">
<div class="annotation">
<h1>simple_search.rs</h1>
</div>
</li>
<li id="section-1">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-1">&#182;</a>
</div>
</div>
<div class="content"><div class='highlight'><pre><span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> rustc_serialize;
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tantivy;
<span class="hljs-keyword">extern</span> <span class="hljs-keyword">crate</span> tempdir;
<span class="hljs-keyword">use</span> std::path::Path;
<span class="hljs-keyword">use</span> tempdir::TempDir;
<span class="hljs-keyword">use</span> tantivy::Index;
<span class="hljs-keyword">use</span> tantivy::schema::*;
<span class="hljs-keyword">use</span> tantivy::collector::TopCollector;
<span class="hljs-keyword">use</span> tantivy::query::QueryParser;
<span class="hljs-keyword">use</span> tantivy::query::Query;
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() {</pre></div></div>
</li>
<li id="section-2">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-2">&#182;</a>
</div>
<p>Lets create a temporary directory for the
sake of this example</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> <span class="hljs-literal">Ok</span>(dir) = TempDir::new(<span class="hljs-string">"tantivy_example_dir"</span>) {
run_example(&amp;dir.path()).unwrap();
dir.close().unwrap();
}
}
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">create_schema</span></span>() -&gt; Schema {
}
<span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">run_example</span></span>(index_path: &amp;Path) -&gt; tantivy::<span class="hljs-built_in">Result</span>&lt;()&gt; {</pre></div></div>
</li>
<li id="section-3">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-3">&#182;</a>
</div>
<h1 id="defining-the-schema">Defining the schema</h1>
<p>Tantivy index require to have a very strict schema.
The schema declares which fields are in the index,
and for each field, its type and “the way it should
be indexed”.</p>
</div>
</li>
<li id="section-4">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-4">&#182;</a>
</div>
<p>first we need to define a schema …</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> schema_builder = SchemaBuilder::<span class="hljs-keyword">default</span>();</pre></div></div>
</li>
<li id="section-5">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-5">&#182;</a>
</div>
<p>Our first field is title.
We want full-text search for it, and we want to be able
to retrieve the document after the search.</p>
<p>TEXT | STORED is some syntactic sugar to describe
that. </p>
<p><code>TEXT</code> means the field should be tokenized and indexed,
along with its term frequency and term positions.</p>
<p><code>STORED</code> means that the field will also be saved
in a compressed, row-oriented key-value store.
This store is useful to reconstruct the
document that were selected during the search phase.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"title"</span>, TEXT | STORED);</pre></div></div>
</li>
<li id="section-6">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-6">&#182;</a>
</div>
<p>Our first field is body.
We want full-text search for it, and we want to be able
to retrieve the body after the search.</p>
</div>
<div class="content"><div class='highlight'><pre> schema_builder.add_text_field(<span class="hljs-string">"body"</span>, TEXT);
<span class="hljs-keyword">let</span> schema = schema_builder.build();</pre></div></div>
</li>
<li id="section-7">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-7">&#182;</a>
</div>
<h1 id="indexing-documents">Indexing documents</h1>
<p>Lets create a brand new index.</p>
<p>This will actually just save a meta.json
with our schema the directory.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> index = <span class="hljs-built_in">try!</span>(Index::create(index_path, schema.clone()));</pre></div></div>
</li>
<li id="section-8">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-8">&#182;</a>
</div>
<p>To insert document we need an index writer.
There shall be only one writer at a time.
Besides, this single <code>IndexWriter</code> is already
multithreaded.</p>
<p>Here we used a buffer of 1 GB. Using a bigger
heap for the indexer can increase its throughput.
This buffer will be split between the indexing
threads.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> index_writer = <span class="hljs-built_in">try!</span>(index.writer(<span class="hljs-number">1_000_000_000</span>));</pre></div></div>
</li>
<li id="section-9">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-9">&#182;</a>
</div>
<p>Lets now index our documents!
We first need a handle on the title and the body field.</p>
</div>
</li>
<li id="section-10">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-10">&#182;</a>
</div>
<h3 id="create-a-document-manually-">Create a document “manually”.</h3>
<p>We can create a document manually, by setting adding the fields
one by one in a Document object.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> title = schema.get_field(<span class="hljs-string">"title"</span>).unwrap();
<span class="hljs-keyword">let</span> body = schema.get_field(<span class="hljs-string">"body"</span>).unwrap();
<span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> old_man_doc = Document::<span class="hljs-keyword">default</span>();
old_man_doc.add_text(title, <span class="hljs-string">"The Old Man and the Sea"</span>);
old_man_doc.add_text(body, <span class="hljs-string">"He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish."</span>);</pre></div></div>
</li>
<li id="section-11">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-11">&#182;</a>
</div>
<p>… and add it to the <code>IndexWriter</code>.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(index_writer.add_document(old_man_doc));</pre></div></div>
</li>
<li id="section-12">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-12">&#182;</a>
</div>
<h3 id="create-a-document-directly-from-json-">Create a document directly from json.</h3>
<p>Alternatively, we can use our schema to parse
a document object directly from json.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">let</span> mice_and_men_doc = <span class="hljs-built_in">try!</span>(schema.parse_document(r#<span class="hljs-string">"{
"</span>title<span class="hljs-string">": "</span>Of Mice and Men<span class="hljs-string">",
"</span>body<span class="hljs-string">": "</span>few miles south of Soledad, the Salinas River drops <span class="hljs-keyword">in</span> close to the hillside bank and runs deep and green. The water is warm too, <span class="hljs-keyword">for</span> it has slipped twinkling over the yellow sands <span class="hljs-keyword">in</span> the sunlight before reaching the narrow pool. On one side of the river the golden foothill slopes curve up to the strong and rocky Gabilan Mountains, but on the valley side the water is lined with trees—willows fresh and green with every spring, carrying <span class="hljs-keyword">in</span> their lower leaf junctures the debris of the winters flooding; and sycamores with mottled, white,recumbent limbs and branches that arch over the pool<span class="hljs-string">"
}"</span>#));
<span class="hljs-built_in">try!</span>(index_writer.add_document(mice_and_men_doc));</pre></div></div>
</li>
<li id="section-13">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-13">&#182;</a>
</div>
<p>Multi-valued field are allowed, they are
expressed in JSON by an array.
The following document has two titles.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> frankenstein_doc = <span class="hljs-built_in">try!</span>(schema.parse_document(r#<span class="hljs-string">"{
"</span>title<span class="hljs-string">": ["</span>Frankenstein<span class="hljs-string">", "</span>The Modern Promotheus<span class="hljs-string">"],
"</span>body<span class="hljs-string">": "</span>You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence <span class="hljs-keyword">in</span> the success of my undertaking.<span class="hljs-string">"
}"</span>#));
<span class="hljs-built_in">try!</span>(index_writer.add_document(frankenstein_doc));</pre></div></div>
</li>
<li id="section-14">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-14">&#182;</a>
</div>
<p>This is an example, so we will only index 3 documents
here. You can check out tantivys tutorial to index
the English wikipedia. Tantivys indexing is rather fast.
Indexing 5 millions articles of the English wikipedia takes
around 4 minutes on my computer!</p>
</div>
</li>
<li id="section-15">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-15">&#182;</a>
</div>
<h3 id="committing">Committing</h3>
<p>At this point our documents are not searchable.</p>
<p>We need to call .commit() explicitely to force the
index_writer to finish processing the documents in the queue,
flush the current index on the disk, and advertise
the existence of new documents.</p>
<p>This call is blocking.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(index_writer.commit());</pre></div></div>
</li>
<li id="section-16">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-16">&#182;</a>
</div>
<p>If <code>.commit()</code> returns correctly, then all of the
documents have been added before are guaranteed to be
persistently indexed.</p>
<p>In the scenario of a crash or a power failure,
tantivy behaves as if it rollbacked to its last
commit.</p>
</div>
</li>
<li id="section-17">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-17">&#182;</a>
</div>
<h1 id="searching">Searching</h1>
<p>Lets search our index. This starts
by creating a searcher. There can be more
than one searcher at a time.</p>
<p>You are supposed to acquire a search
every time you start a “search query”.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> searcher = index.searcher();</pre></div></div>
</li>
<li id="section-18">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-18">&#182;</a>
</div>
<p>The query parser can interpret human queries.
Here, if the user does not specify which
field he wants to search, tantivy will search
in both title and body.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> query_parser = QueryParser::new(index.schema(), <span class="hljs-built_in">vec!</span>(title, body));</pre></div></div>
</li>
<li id="section-19">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-19">&#182;</a>
</div>
<p>QueryParser may fail if the query is not in the right
format. For user facing applications, this can be a problem.
A ticket has been filled regarding this problem.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> query = <span class="hljs-built_in">try!</span>(query_parser.parse_query(<span class="hljs-string">"sea whale"</span>));</pre></div></div>
</li>
<li id="section-20">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-20">&#182;</a>
</div>
<p>A query defines a set of documents, as
well as the way they should be scored.</p>
<p>Query created by the query parser are scoring according
to a metric called Tf-Idf, and will consider
any document matching at least one of our terms.</p>
</div>
</li>
<li id="section-21">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-21">&#182;</a>
</div>
<h3 id="collectors">Collectors</h3>
<p>We are not interested in all of the document but
only in the top 10. Keep track of our top 10 best documents
is the role of the TopCollector.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> top_collector = TopCollector::with_limit(<span class="hljs-number">10</span>);</pre></div></div>
</li>
<li id="section-22">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-22">&#182;</a>
</div>
<p>We can now perform our query.</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-built_in">try!</span>(query.search(&amp;searcher, &amp;<span class="hljs-keyword">mut</span> top_collector));</pre></div></div>
</li>
<li id="section-23">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-23">&#182;</a>
</div>
<p>Our top collector now contains are 10
most relevant doc ids…</p>
</div>
<div class="content"><div class='highlight'><pre> <span class="hljs-keyword">let</span> doc_addresses = top_collector.docs();</pre></div></div>
</li>
<li id="section-24">
<div class="annotation">
<div class="pilwrap ">
<a class="pilcrow" href="#section-24">&#182;</a>
</div>
<p>The actual documents still need to be
retrieved from Tantivys store.</p>
<p>Since body was not configured as stored,
the document returned will only contain
a title.</p>
</div>
<div class="content"><div class='highlight'><pre>
<span class="hljs-keyword">for</span> doc_address <span class="hljs-keyword">in</span> doc_addresses {
<span class="hljs-keyword">let</span> retrieved_doc = <span class="hljs-built_in">try!</span>(searcher.doc(&amp;doc_address));
<span class="hljs-built_in">println!</span>(<span class="hljs-string">"{}"</span>, schema.to_json(&amp;retrieved_doc));
}
<span class="hljs-literal">Ok</span>(())
}</pre></div></div>
</li>
</ul>
</div>
</body>
</html>

View File

@@ -11,81 +11,111 @@ use tantivy::query::QueryParser;
use tantivy::query::Query;
fn main() {
// this example creates its index in a temporary
// directory.
// Let's create a temporary directory for the
// sake of this example
if let Ok(dir) = TempDir::new("tantivy_example_dir") {
run(&dir.path()).unwrap();
run_example(&dir.path()).unwrap();
dir.close().unwrap();
}
}
fn create_schema() -> Schema {
// We need to declare a schema
// to create a new index.
let mut schema_builder = SchemaBuilder::default();
// TEXT | STORED is some syntactic sugar to describe
// how tantivy should index this field.
// It means the field should be tokenized and indexed,
// along with its term frequency and term positions.
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
schema_builder.build()
}
fn run(index_path: &Path) -> tantivy::Result<()> {
fn run_example(index_path: &Path) -> tantivy::Result<()> {
// # Defining the schema
//
// Tantivy index require to have a very strict schema.
// The schema declares which fields are in the index,
// and for each field, its type and "the way it should
// be indexed".
// first we need to define a schema ...
let schema = create_schema();
// Creates an empty index.
let mut schema_builder = SchemaBuilder::default();
// Our first field is title.
// We want full-text search for it, and we want to be able
// to retrieve the document after the search.
//
// TEXT | STORED is some syntactic sugar to describe
// that.
//
// This will actually just save a meta.json
// file in the directory.
// `TEXT` means the field should be tokenized and indexed,
// along with its term frequency and term positions.
//
// `STORED` means that the field will also be saved
// in a compressed, row-oriented key-value store.
// This store is useful to reconstruct the
// document that were selected during the search phase.
schema_builder.add_text_field("title", TEXT | STORED);
// Our first field is body.
// We want full-text search for it, and we want to be able
// to retrieve the body after the search.
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// # Indexing documents
//
// Let's create a brand new index.
//
// This will actually just save a meta.json
// with our schema the directory.
let index = try!(Index::create(index_path, schema.clone()));
// There can be only one writer at one time.
// The writer will use more than one thread
// to use your multicore CPU.
// To insert document we need an index writer.
// There shall be only one writer at a time.
// Besides, this single `IndexWriter` is already
// multithreaded.
//
// Here we used a buffer of 1MB. In the
// real world, you want to use much more RAM on your indexer,
// to maximum your throughput. (300MB for instance)
let mut index_writer = try!(index.writer(1_000_000));
// Here we used a buffer of 1 GB. Using a bigger
// heap for the indexer can increase its throughput.
// This buffer will be split between the indexing
// threads.
let mut index_writer = try!(index.writer(1_000_000_000));
// Let's now index our documents!
// We need a handle on the title and the body field.
// We first need a handle on the title and the body field.
// ### Create a document "manually".
//
// We can create a document manually, by setting adding the fields
// one by one in a Document object.
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let mut old_man_doc = Document::default();
old_man_doc.add_text(title, "The Old Man and the Sea");
old_man_doc.add_text(body, "He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish.");
// We can now add our document
// ... and add it to the `IndexWriter`.
try!(index_writer.add_document(old_man_doc));
// ### Create a document directly from json.
//
// Alternatively, we can use our schema to parse
// a document object from json.
// a document object directly from json.
let mice_and_men_doc = try!(schema.parse_document(r#"{
"title": "Of Mice and Men",
"body": "few miles south of Soledad, the Salinas River drops in close to the hillside bank and runs deep and green. The water is warm too, for it has slipped twinkling over the yellow sands in the sunlight before reaching the narrow pool. On one side of the river the golden foothill slopes curve up to the strong and rocky Gabilan Mountains, but on the valley side the water is lined with trees—willows fresh and green with every spring, carrying in their lower leaf junctures the debris of the winters flooding; and sycamores with mottled, white,recumbent limbs and branches that arch over the pool"
}"#));
try!(index_writer.add_document(mice_and_men_doc));
// Multi-valued field are allowed, they are
// expressed in JSON by an array.
// The following document has two titles.
@@ -94,62 +124,74 @@ fn run(index_path: &Path) -> tantivy::Result<()> {
"body": "You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking."
}"#));
try!(index_writer.add_document(frankenstein_doc));
//
// ... in the real world, we would add way more documents
// here.
// Tantivy is rather fast. Indexing 5 millions articles of
// the English wikipedia takes around 6 minutes on
// my desktop.
//
// At this point our documents are not necessarily
// indexed.
// This is an example, so we will only index 3 documents
// here. You can check out tantivy's tutorial to index
// the English wikipedia. Tantivy's indexing is rather fast.
// Indexing 5 millions articles of the English wikipedia takes
// around 4 minutes on my computer!
// ### Committing
//
// It has been pushed to a queue where
// it will be eventually processed.
// At this point our documents are not searchable.
//
// We have no guarantee that
// the document will be indexed if there
// is a power outage for instance.
//
// We can call .commit() to force the index_writer to
// commit to disk. This call is blocking.
// We need to call .commit() explicitely to force the
// index_writer to finish processing the documents in the queue,
// flush the current index on the disk, and advertise
// the existence of new documents.
//
// This call is blocking.
try!(index_writer.commit());
// If a commit returns correctly, then all of the
// documents added before have been indexed.
// If `.commit()` returns correctly, then all of the
// documents have been added before are guaranteed to be
// persistently indexed.
//
// In the scenario of a crash or a power failure
// - as long as the hard disk is spared - tantivy
// will be able to rollback to its last commit.
// In the scenario of a crash or a power failure,
// tantivy behaves as if it rollbacked to its last
// commit.
// # Searching
//
// Let's search our index. This starts
// by creating a searcher. There can be more
// than one searcher at a time.
//
// You are supposed to acquire a search
// every time you start a "search query".
let searcher = index.searcher();
// The query parser can interpret human queries.
// Here, if the user does not specify which
// field he wants to search, tantivy will search in both title and body.
// field he wants to search, tantivy will search
// in both title and body.
let query_parser = QueryParser::new(index.schema(), vec!(title, body));
// QueryParser may fail if the query is not in the right
// format. For user facing applications, this can be a problem.
// A ticket has been filled regarding this problem.
let query = try!(query_parser.parse_query("sea whale"));
// A query defines a set of documents, as
// well as the way they should be scored.
//
// Query created by the query parser are scoring according
// to a metric called Tf-Idf, and will consider
// any document matching at least one of our terms.
//
// ### Collectors
//
// We are not interested in all of the document but
// only in the top 10.
// only in the top 10. Keep track of our top 10 best documents
// is the role of the TopCollector.
let mut top_collector = TopCollector::with_limit(10);
// We can now perform our query.
try!(query.search(&searcher, &mut top_collector));
// Our top collector now contains are 10
@@ -158,7 +200,7 @@ fn run(index_path: &Path) -> tantivy::Result<()> {
// The actual documents still need to be
// retrieved from Tantivy's store.
//
//
// Since body was not configured as stored,
// the document returned will only contain
// a title.