diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md index 76dd29748..a280d19b7 100644 --- a/doc/src/SUMMARY.md +++ b/doc/src/SUMMARY.md @@ -9,6 +9,7 @@ - [Facetting](./facetting.md) - [Innerworkings](./innerworkings.md) - [Inverted index](./inverted_index.md) +- [Best practise](./inverted_index.md) [Frequently Asked Questions](./faq.md) [Examples](./examples.md) diff --git a/doc/src/avant-propos.md b/doc/src/avant-propos.md index aa50cd02b..485afd178 100644 --- a/doc/src/avant-propos.md +++ b/doc/src/avant-propos.md @@ -2,8 +2,8 @@ > Tantivy is a **search** engine **library** for Rust. -If you are familiar with Lucene, tantivy is heavily inspired by Lucene's design and -they both have the same scope and targetted users. +If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and +they both have the same scope and targetted use cases. If you are not familiar with Lucene, let's break down our little tagline. @@ -17,15 +17,18 @@ relevancy, collapsing, highlighting, spatial search. experience. But keep in mind this is just a toolbox. Which bring us to the second keyword... -- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution. - - Sometimes a functionality will not be available in tantivy because it is too specific to your use case. By design, tantivy should make it possible to extend - the available set of features using the existing rock-solid datastructures. +- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance. - Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own - `Tokenizer/TokenFilter`... But some of your requirement may also be related to - architecture or operations. For instance, you may want to build a large corpus on Hadoop, - fine-tune the merge policy to keep your index sharded in a time-wise fashion, or you may want - to convert and existing index from a different format. - - Tantivy exposes its API to do all of these things. \ No newline at end of file + Sometimes a functionality will not be available in tantivy because it is too + specific to your use case. By design, tantivy should make it possible to extend + the available set of features using the existing rock-solid datastructures. + + Most frequently this will mean writing your own `Collector`, your own `Scorer` or your own + `TokenFilter`... Some of your requirements may also be related to + something closer to architecture or operations. For instance, you may + want to build a large corpus on Hadoop, fine-tune the merge policy to keep your + index sharded in a time-wise fashion, or you may want to convert and existing + index from a different format. + + Tantivy exposes a lot of low level API to do all of these things. + diff --git a/doc/src/basis.md b/doc/src/basis.md index e52615f6d..21dadb7fb 100644 --- a/doc/src/basis.md +++ b/doc/src/basis.md @@ -2,47 +2,76 @@ ## Straight from disk -By default, tantivy accesses its data using its `MMapDirectory`. -While this design has some downsides, this greatly simplifies the source code of tantivy, -and entirely delegates the caching to the OS. +Tantivy accesses its data using an abstracting trait called `Directory`. +In theory, one can come and override the data access logic. In practise, the +trait somewhat assumes that your data can be mapped to memory, and tantivy +seems deeply married to using `mmap` for its io [^1], and the only persisting +directory shipped with tantivy is the `MmapDirectory`. -`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. -As a result, the act of opening an indexing does not involve loading different datastructures -from the disk into random access memory : starting a process, opening an index, and performing a query -can typically be done in a matter of milliseconds. +While this design has some downsides, this greatly simplifies the source code of +tantivy. Caching is also entirely delegated to the OS. -This is an interesting property for a command line search engine, or for some multi-tenant log search engine. -Spawning a new process for each new query can be a perfectly sensible solution in some use case. +`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds. + +This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case. In later chapters, we will discuss tantivy's inverted index data layout. -One key take away is that to achieve great performance, search indexes are extremely compact. +One key take away is that to achieve great performance, search indexes are extremely compact. Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM. -Also, whenever possible the data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access -the data from your spinning hard disk, but this is also a great property when working with `SSD` or `RAM`, -as it makes our read patterns very predictable for the CPU. +Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also +critical for performance, if your data is read from and an `SSD` or even already in your pagecache. ## Segments, and the log method -That kind compact layout comes at one cost: it prevents our datastructures from being dynamic. -In fact, a trait called `Directory` is in charge of abstracting all of tantivy's data access -and its API does not even allow editing these file once they are written. +That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic. +In fact, the `Directory` trait does not even allow you to modify part of a file. To allow the addition / deletion of documents, and create the illusion that -your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes -referred to as the *log method*. +your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*. -Let's forget about deletes for a moment. As you add documents, these documents are processed and stored in -a dedicated datastructure, in a `RAM` buffer. This datastructure is designed to be dynamic but -cannot be accessed for search. As you add documents, this buffer will reach its capacity and tantivy will -transparently stop adding document to it and start converting this datastructure to its final -read-only format on disk. Once written, an brand empty buffer is available to resume adding documents. +Let's forget about deletes for a moment. + +As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly. + +As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents. The resulting chunk of index obtained after this serialization is called a `Segment`. -> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files -are identified using the naming scheme : `.*`. +> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `.*`. + +Which brings us to the nature of a tantivy `Index`. + +> A tantivy `Index` is a collection of `Segments`. + +Physically, this really just means and index is a bunch of segment files in a given `Directory`, +linked together by a `meta.json` file. This transparency can become extremely handy +to get tantivy to fit your use case: + +*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2] + +*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`. -> A tantivy `Index` is a collection of `Segments`. \ No newline at end of file + + + +# Merging + +As you index more and more data, your index will accumulate more and more segments. +Having a lot of small segments is not really optimal. There is a bit of redundancy in having +all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit. + +That's where merging or compacting comes into place. Tantivy will continuously consider merge +opportunities and start merging segments in the background. + + +# Indexing throughput, number of indexing threads + + + + +[^1]: This may eventually change. + +[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not. diff --git a/doc/src/best_practise.md.rs b/doc/src/best_practise.md.rs new file mode 100644 index 000000000..e69de29bb diff --git a/doc/src/examples.md b/doc/src/examples.md index df635b4e6..6ba4a8a4d 100644 --- a/doc/src/examples.md +++ b/doc/src/examples.md @@ -1 +1,3 @@ # Examples + +- [Basic search](/examples/basic_search.html) \ No newline at end of file