mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-06-03 17:10:48 +00:00
Added fundamentalss
This commit is contained in:
@@ -4,7 +4,7 @@
|
||||
|
||||
[Avant Propos](./avant-propos.md)
|
||||
|
||||
- [Fundamental concepts](./basis.md)
|
||||
- [Segments](./basis.md)
|
||||
- [Defining your schema](./schema.md)
|
||||
- [Facetting](./facetting.md)
|
||||
- [Innerworkings](./innerworkings.md)
|
||||
|
||||
@@ -1 +1,48 @@
|
||||
# Basic concepts
|
||||
# Anatomy of an index
|
||||
|
||||
## Straight from disk
|
||||
|
||||
By default, tantivy accesses its data using its `MMapDirectory`.
|
||||
While this design has some downsides, this greatly simplifies the source code of tantivy,
|
||||
and entirely delegates the caching to the OS.
|
||||
|
||||
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk.
|
||||
As a result, the act of opening an indexing does not involve loading different datastructures
|
||||
from the disk into random access memory : starting a process, opening an index, and performing a query
|
||||
can typically be done in a matter of milliseconds.
|
||||
|
||||
This is an interesting property for a command line search engine, or for some multi-tenant log search engine.
|
||||
Spawning a new process for each new query can be a perfectly sensible solution in some use case.
|
||||
|
||||
In later chapters, we will discuss tantivy's inverted index data layout.
|
||||
One key take away is that to achieve great performance, search indexes are extremely compact.
|
||||
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
|
||||
|
||||
Also, whenever possible the data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access
|
||||
the data from your spinning hard disk, but this is also a great property when working with `SSD` or `RAM`,
|
||||
as it makes our read patterns very predictable for the CPU.
|
||||
|
||||
|
||||
## Segments, and the log method
|
||||
|
||||
That kind compact layout comes at one cost: it prevents our datastructures from being dynamic.
|
||||
In fact, a trait called `Directory` is in charge of abstracting all of tantivy's data access
|
||||
and its API does not even allow editing these file once they are written.
|
||||
|
||||
To allow the addition / deletion of documents, and create the illusion that
|
||||
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes
|
||||
referred to as the *log method*.
|
||||
|
||||
Let's forget about deletes for a moment. As you add documents, these documents are processed and stored in
|
||||
a dedicated datastructure, in a `RAM` buffer. This datastructure is designed to be dynamic but
|
||||
cannot be accessed for search. As you add documents, this buffer will reach its capacity and tantivy will
|
||||
transparently stop adding document to it and start converting this datastructure to its final
|
||||
read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
|
||||
|
||||
The resulting chunk of index obtained after this serialization is called a `Segment`.
|
||||
|
||||
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files
|
||||
are identified using the naming scheme : `<UUID>.*`.
|
||||
|
||||
|
||||
> A tantivy `Index` is a collection of `Segments`.
|
||||
Reference in New Issue
Block a user