Fix typos and markdowns

Found via these commands:

    codespell -L crate,ser,panting,beauti,hart,ue,atleast,childs,ond,pris,hel,mot
    markdownlint *.md doc/src/*.md --disable MD013 MD025 MD033 MD001 MD024 MD036 MD041 MD003
This commit is contained in:
Kian-Meng Ang
2022-08-13 18:13:27 +08:00
parent 8e773ade77
commit 625bcb4877
62 changed files with 201 additions and 202 deletions

View File

@@ -1,7 +1,5 @@
# Summary
[Avant Propos](./avant-propos.md)
- [Segments](./basis.md)

View File

@@ -3,7 +3,7 @@
> Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
they both have the same scope and targetted use cases.
they both have the same scope and targeted use cases.
If you are not familiar with Lucene, let's break down our little tagline.
@@ -31,4 +31,4 @@ relevancy, collapsing, highlighting, spatial search.
index from a different format.
Tantivy exposes a lot of low level API to do all of these things.

View File

@@ -11,7 +11,7 @@ directory shipped with tantivy is the `MmapDirectory`.
While this design has some downsides, this greatly simplifies the source code of
tantivy. Caching is also entirely delegated to the OS.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are layed on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
@@ -22,7 +22,6 @@ Of course this is crucial to reduce IO, and ensure that as much of our index can
Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
## Segments, and the log method
That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
@@ -53,11 +52,7 @@ to get tantivy to fit your use case:
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
# Merging
## Merging
As you index more and more data, your index will accumulate more and more segments.
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
@@ -66,11 +61,7 @@ all these term dictionary. Also when searching, we will need to do term lookups
That's where merging or compacting comes into place. Tantivy will continuously consider merge
opportunities and start merging segments in the background.
# Indexing throughput, number of indexing threads
## Indexing throughput, number of indexing threads
[^1]: This may eventually change.

View File

@@ -1,3 +1,3 @@
# Examples
- [Basic search](/examples/basic_search.html)
- [Basic search](/examples/basic_search.html)

View File

@@ -1,11 +1,11 @@
- [Index Sorting](#index-sorting)
+ [Why Sorting](#why-sorting)
* [Compression](#compression)
* [Top-N Optimization](#top-n-optimization)
* [Pruning](#pruning)
* [Other](#other)
+ [Usage](#usage)
- [Why Sorting](#why-sorting)
- [Compression](#compression)
- [Top-N Optimization](#top-n-optimization)
- [Pruning](#pruning)
- [Other](#other)
- [Usage](#usage)
# Index Sorting
@@ -15,32 +15,34 @@ Tantivy allows you to sort the index according to a property.
Presorting an index has several advantages:
###### Compression
### Compression
When data is sorted it is easier to compress the data. E.g. the numbers sequence [5, 2, 3, 1, 4] would be sorted to [1, 2, 3, 4, 5].
When data is sorted it is easier to compress the data. E.g. the numbers sequence [5, 2, 3, 1, 4] would be sorted to [1, 2, 3, 4, 5].
If we apply delta encoding this list would be unsorted [5, -3, 1, -2, 3] vs. [1, 1, 1, 1, 1].
Compression ratio is mainly affected on the fast field of the sorted property, every thing else is likely unaffected.
###### Top-N Optimization
Compression ratio is mainly affected on the fast field of the sorted property, every thing else is likely unaffected.
When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents.
### Top-N Optimization
When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents.
E.g. if the data is sorted by timestamp and want the top n newest docs containing a term, we can simply leveraging the order of the docids.
Note: Tantivy 0.16 does not do this optimization yet.
###### Pruning
### Pruning
Let's say we want all documents and want to apply the filter `>= 2010-08-11`. When the data is sorted, we could make a lookup in the fast field to find the docid range and use this as the filter.
Note: Tantivy 0.16 does not do this optimization yet.
###### Other?
### Other?
In principle there are many algorithms possible that exploit the monotonically increasing nature. (aggregations maybe?)
## Usage
The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of Tantivy 0.16 only fast fields are allowed to be used.
```
```rust
let settings = IndexSettings {
sort_by_field: Some(IndexSortByField {
field: "intval".to_string(),
@@ -58,4 +60,3 @@ let index = index_builder.create_in_ram().unwrap();
Sorting an index is applied in the serialization step. In general there are two serialization steps: [Finishing a single segment](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/segment_writer.rs#L338) and [merging multiple segments](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/merger.rs#L1073).
In both cases we generate a docid mapping reflecting the sort. This mapping is used when serializing the different components (doc store, fastfields, posting list, normfield, facets).

View File

@@ -21,16 +21,17 @@ For instance, if user is a json field, the following document:
```
emits the following tokens:
- ("name", Text, "Paul")
- ("name", Text, "Masurel")
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("created_at", Date, 15420648505)
- ("name", Text, "Paul")
- ("name", Text, "Masurel")
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("created_at", Date, 15420648505)
# Bytes-encoding and lexicographical sort.
## Bytes-encoding and lexicographical sort
Like any other terms, these triplets are encoded into a binary format as follows.
- `json_path`: the json path is a sequence of "segments". In the example above, `address.city`
is just a debug representation of the json path `["address", "city"]`.
Its representation is done by separating segments by a unicode char `\x01`, and ending the path by `\x00`.
@@ -41,16 +42,16 @@ This representation is designed to align the natural sort of Terms with the lexi
of their binary representation (Tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).
In the example above, the terms will be sorted as
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("name", Text, "Masurel")
- ("name", Text, "Paul")
- ("created_at", Date, 15420648505)
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("name", Text, "Masurel")
- ("name", Text, "Paul")
- ("created_at", Date, 15420648505)
As seen in "pitfalls", we may end up having to search for a value for a same path in several different fields. Putting the field code after the path makes it maximizes compression opportunities but also increases the chances for the two terms to end up in the actual same term dictionary block.
# Pitfalls, limitation and corner cases.
## Pitfalls, limitation and corner cases
Json gives very little information about the type of the literals it stores.
All numeric types end up mapped as a "Number" and there are no types for dates.
@@ -70,19 +71,21 @@ For instance, we do not even know if the type is a number or string based.
So the query
```
```rust
my_path.my_segment:233
```
Will be interpreted as
`(my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)`
```rust
(my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)
```
Likewise, we need to emit two tokens if the query contains an rfc3999 date.
Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.
If one more json field is defined, things get even more complicated.
## Default json field
If the schema contains a text field called "text" and a json field that is set as a default field:
@@ -96,11 +99,11 @@ This is a product decision.
The user can still target the JSON field by specifying its name explicitly:
`json_dynamic.text:hello`.
## Range queries are not supported.
## Range queries are not supported
Json field do not support range queries.
## Arrays do not work like nested object.
## Arrays do not work like nested object
If json object contains an array, a search query might return more documents
than what might be expected.
@@ -120,9 +123,8 @@ Let's take an example.
Despite the array structure, a document in tantivy is a bag of terms.
The query:
```
```rust
cart.product_type:sneakers AND cart.attributes.color:red
```
Actually match the document above.