mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-06-05 01:50:42 +00:00
updating doc
This commit is contained in:
@@ -9,6 +9,7 @@
|
||||
- [Facetting](./facetting.md)
|
||||
- [Innerworkings](./innerworkings.md)
|
||||
- [Inverted index](./inverted_index.md)
|
||||
- [Best practise](./inverted_index.md)
|
||||
|
||||
[Frequently Asked Questions](./faq.md)
|
||||
[Examples](./examples.md)
|
||||
|
||||
@@ -2,8 +2,8 @@
|
||||
|
||||
> Tantivy is a **search** engine **library** for Rust.
|
||||
|
||||
If you are familiar with Lucene, tantivy is heavily inspired by Lucene's design and
|
||||
they both have the same scope and targetted users.
|
||||
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
|
||||
they both have the same scope and targetted use cases.
|
||||
|
||||
If you are not familiar with Lucene, let's break down our little tagline.
|
||||
|
||||
@@ -17,15 +17,18 @@ relevancy, collapsing, highlighting, spatial search.
|
||||
experience. But keep in mind this is just a toolbox.
|
||||
Which bring us to the second keyword...
|
||||
|
||||
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution.
|
||||
|
||||
Sometimes a functionality will not be available in tantivy because it is too specific to your use case. By design, tantivy should make it possible to extend
|
||||
the available set of features using the existing rock-solid datastructures.
|
||||
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
|
||||
|
||||
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
|
||||
`Tokenizer/TokenFilter`... But some of your requirement may also be related to
|
||||
architecture or operations. For instance, you may want to build a large corpus on Hadoop,
|
||||
fine-tune the merge policy to keep your index sharded in a time-wise fashion, or you may want
|
||||
to convert and existing index from a different format.
|
||||
|
||||
Tantivy exposes its API to do all of these things.
|
||||
Sometimes a functionality will not be available in tantivy because it is too
|
||||
specific to your use case. By design, tantivy should make it possible to extend
|
||||
the available set of features using the existing rock-solid datastructures.
|
||||
|
||||
Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own
|
||||
`TokenFilter`... Some of your requirements may also be related to
|
||||
something closer to architecture or operations. For instance, you may
|
||||
want to build a large corpus on Hadoop, fine-tune the merge policy to keep your
|
||||
index sharded in a time-wise fashion, or you may want to convert and existing
|
||||
index from a different format.
|
||||
|
||||
Tantivy exposes a lot of low level API to do all of these things.
|
||||
|
||||
|
||||
@@ -2,47 +2,76 @@
|
||||
|
||||
## Straight from disk
|
||||
|
||||
By default, tantivy accesses its data using its `MMapDirectory`.
|
||||
While this design has some downsides, this greatly simplifies the source code of tantivy,
|
||||
and entirely delegates the caching to the OS.
|
||||
Tantivy accesses its data using an abstracting trait called `Directory`.
|
||||
In theory, one can come and override the data access logic. In practise, the
|
||||
trait somewhat assumes that your data can be mapped to memory, and tantivy
|
||||
seems deeply married to using `mmap` for its io [^1], and the only persisting
|
||||
directory shipped with tantivy is the `MmapDirectory`.
|
||||
|
||||
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk.
|
||||
As a result, the act of opening an indexing does not involve loading different datastructures
|
||||
from the disk into random access memory : starting a process, opening an index, and performing a query
|
||||
can typically be done in a matter of milliseconds.
|
||||
While this design has some downsides, this greatly simplifies the source code of
|
||||
tantivy. Caching is also entirely delegated to the OS.
|
||||
|
||||
This is an interesting property for a command line search engine, or for some multi-tenant log search engine.
|
||||
Spawning a new process for each new query can be a perfectly sensible solution in some use case.
|
||||
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
|
||||
|
||||
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
|
||||
|
||||
In later chapters, we will discuss tantivy's inverted index data layout.
|
||||
One key take away is that to achieve great performance, search indexes are extremely compact.
|
||||
One key take away is that to achieve great performance, search indexes are extremely compact.
|
||||
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
|
||||
|
||||
Also, whenever possible the data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access
|
||||
the data from your spinning hard disk, but this is also a great property when working with `SSD` or `RAM`,
|
||||
as it makes our read patterns very predictable for the CPU.
|
||||
Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
|
||||
critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
|
||||
|
||||
|
||||
## Segments, and the log method
|
||||
|
||||
That kind compact layout comes at one cost: it prevents our datastructures from being dynamic.
|
||||
In fact, a trait called `Directory` is in charge of abstracting all of tantivy's data access
|
||||
and its API does not even allow editing these file once they are written.
|
||||
That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
|
||||
In fact, the `Directory` trait does not even allow you to modify part of a file.
|
||||
|
||||
To allow the addition / deletion of documents, and create the illusion that
|
||||
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes
|
||||
referred to as the *log method*.
|
||||
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*.
|
||||
|
||||
Let's forget about deletes for a moment. As you add documents, these documents are processed and stored in
|
||||
a dedicated datastructure, in a `RAM` buffer. This datastructure is designed to be dynamic but
|
||||
cannot be accessed for search. As you add documents, this buffer will reach its capacity and tantivy will
|
||||
transparently stop adding document to it and start converting this datastructure to its final
|
||||
read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
|
||||
Let's forget about deletes for a moment.
|
||||
|
||||
As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly.
|
||||
|
||||
As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
|
||||
|
||||
The resulting chunk of index obtained after this serialization is called a `Segment`.
|
||||
|
||||
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files
|
||||
are identified using the naming scheme : `<UUID>.*`.
|
||||
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `<UUID>.*`.
|
||||
|
||||
Which brings us to the nature of a tantivy `Index`.
|
||||
|
||||
> A tantivy `Index` is a collection of `Segments`.
|
||||
|
||||
Physically, this really just means and index is a bunch of segment files in a given `Directory`,
|
||||
linked together by a `meta.json` file. This transparency can become extremely handy
|
||||
to get tantivy to fit your use case:
|
||||
|
||||
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
|
||||
|
||||
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
|
||||
|
||||
|
||||
> A tantivy `Index` is a collection of `Segments`.
|
||||
|
||||
|
||||
|
||||
# Merging
|
||||
|
||||
As you index more and more data, your index will accumulate more and more segments.
|
||||
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
|
||||
all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit.
|
||||
|
||||
That's where merging or compacting comes into place. Tantivy will continuously consider merge
|
||||
opportunities and start merging segments in the background.
|
||||
|
||||
|
||||
# Indexing throughput, number of indexing threads
|
||||
|
||||
|
||||
|
||||
|
||||
[^1]: This may eventually change.
|
||||
|
||||
[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not.
|
||||
|
||||
0
doc/src/best_practise.md.rs
Normal file
0
doc/src/best_practise.md.rs
Normal file
@@ -1 +1,3 @@
|
||||
# Examples
|
||||
|
||||
- [Basic search](/examples/basic_search.html)
|
||||
Reference in New Issue
Block a user