updating doc

This commit is contained in:
Paul Masurel
2018-09-09 17:23:30 +09:00
parent 7e5f697d00
commit a78aa4c259
5 changed files with 74 additions and 39 deletions

View File

@@ -9,6 +9,7 @@
- [Facetting](./facetting.md)
- [Innerworkings](./innerworkings.md)
- [Inverted index](./inverted_index.md)
- [Best practise](./inverted_index.md)
[Frequently Asked Questions](./faq.md)
[Examples](./examples.md)

View File

@@ -2,8 +2,8 @@
> Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, tantivy is heavily inspired by Lucene's design and
they both have the same scope and targetted users.
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
they both have the same scope and targetted use cases.
If you are not familiar with Lucene, let's break down our little tagline.
@@ -17,15 +17,18 @@ relevancy, collapsing, highlighting, spatial search.
experience. But keep in mind this is just a toolbox.
Which bring us to the second keyword...
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution.
Sometimes a functionality will not be available in tantivy because it is too specific to your use case. By design, tantivy should make it possible to extend
the available set of features using the existing rock-solid datastructures.
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
`Tokenizer/TokenFilter`... But some of your requirement may also be related to
architecture or operations. For instance, you may want to build a large corpus on Hadoop,
fine-tune the merge policy to keep your index sharded in a time-wise fashion, or you may want
to convert and existing index from a different format.
Tantivy exposes its API to do all of these things.
Sometimes a functionality will not be available in tantivy because it is too
specific to your use case. By design, tantivy should make it possible to extend
the available set of features using the existing rock-solid datastructures.
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
`TokenFilter`... Some of your requirements may also be related to
something closer to architecture or operations. For instance, you may
want to build a large corpus on Hadoop, fine-tune the merge policy to keep your
index sharded in a time-wise fashion, or you may want to convert and existing
index from a different format.
Tantivy exposes a lot of low level API to do all of these things.

View File

@@ -2,47 +2,76 @@
## Straight from disk
By default, tantivy accesses its data using its `MMapDirectory`.
While this design has some downsides, this greatly simplifies the source code of tantivy,
and entirely delegates the caching to the OS.
Tantivy accesses its data using an abstracting trait called `Directory`.
In theory, one can come and override the data access logic. In practise, the
trait somewhat assumes that your data can be mapped to memory, and tantivy
seems deeply married to using `mmap` for its io [^1], and the only persisting
directory shipped with tantivy is the `MmapDirectory`.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk.
As a result, the act of opening an indexing does not involve loading different datastructures
from the disk into random access memory : starting a process, opening an index, and performing a query
can typically be done in a matter of milliseconds.
While this design has some downsides, this greatly simplifies the source code of
tantivy. Caching is also entirely delegated to the OS.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine.
Spawning a new process for each new query can be a perfectly sensible solution in some use case.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
In later chapters, we will discuss tantivy's inverted index data layout.
One key take away is that to achieve great performance, search indexes are extremely compact.
One key take away is that to achieve great performance, search indexes are extremely compact.
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
Also, whenever possible the data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access
the data from your spinning hard disk, but this is also a great property when working with `SSD` or `RAM`,
as it makes our read patterns very predictable for the CPU.
Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
## Segments, and the log method
That kind compact layout comes at one cost: it prevents our datastructures from being dynamic.
In fact, a trait called `Directory` is in charge of abstracting all of tantivy's data access
and its API does not even allow editing these file once they are written.
That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
In fact, the `Directory` trait does not even allow you to modify part of a file.
To allow the addition / deletion of documents, and create the illusion that
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes
referred to as the *log method*.
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*.
Let's forget about deletes for a moment. As you add documents, these documents are processed and stored in
a dedicated datastructure, in a `RAM` buffer. This datastructure is designed to be dynamic but
cannot be accessed for search. As you add documents, this buffer will reach its capacity and tantivy will
transparently stop adding document to it and start converting this datastructure to its final
read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
Let's forget about deletes for a moment.
As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly.
As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
The resulting chunk of index obtained after this serialization is called a `Segment`.
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files
are identified using the naming scheme : `<UUID>.*`.
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `<UUID>.*`.
Which brings us to the nature of a tantivy `Index`.
> A tantivy `Index` is a collection of `Segments`.
Physically, this really just means and index is a bunch of segment files in a given `Directory`,
linked together by a `meta.json` file. This transparency can become extremely handy
to get tantivy to fit your use case:
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
> A tantivy `Index` is a collection of `Segments`.
# Merging
As you index more and more data, your index will accumulate more and more segments.
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit.
That's where merging or compacting comes into place. Tantivy will continuously consider merge
opportunities and start merging segments in the background.
# Indexing throughput, number of indexing threads
[^1]: This may eventually change.
[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not.

View File

View File

@@ -1 +1,3 @@
# Examples
- [Basic search](/examples/basic_search.html)