mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-01-06 01:02:55 +00:00
Compare commits
76 Commits
rihardsk-d
...
owned-byte
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8644d68023 | ||
|
|
913850b590 | ||
|
|
a4002622f8 | ||
|
|
8e21087ad7 | ||
|
|
d523543dc7 | ||
|
|
6ca27b6dd4 | ||
|
|
8d51e9cc91 | ||
|
|
2aced2d958 | ||
|
|
3fcba00a1f | ||
|
|
372d12766a | ||
|
|
dfed8896b9 | ||
|
|
d71aa57077 | ||
|
|
3e85fe57ac | ||
|
|
537021e12d | ||
|
|
ec4834cd73 | ||
|
|
712c01aa93 | ||
|
|
cde324d4b4 | ||
|
|
478571ebb4 | ||
|
|
fde9d27482 | ||
|
|
f38daab7f7 | ||
|
|
25b9429929 | ||
|
|
83cf638a2e | ||
|
|
a04e0bdaf1 | ||
|
|
c200d59d1e | ||
|
|
bbeac5888c | ||
|
|
daa53522b5 | ||
|
|
2c0f6e3319 | ||
|
|
27f587aa13 | ||
|
|
cfc27c9665 | ||
|
|
88a1a90c3c | ||
|
|
6d8581baae | ||
|
|
2b4b16ae90 | ||
|
|
075c23eb8c | ||
|
|
cbf805c3e6 | ||
|
|
46beb2a989 | ||
|
|
c01c175744 | ||
|
|
eca496ee24 | ||
|
|
083bb3ec3f | ||
|
|
2dc5403e7b | ||
|
|
aead5d4068 | ||
|
|
6fb3622abb | ||
|
|
39dd8cfe24 | ||
|
|
b9b9e9e518 | ||
|
|
e2c91aff33 | ||
|
|
96098fce20 | ||
|
|
18bfe131fe | ||
|
|
8dc3e7704c | ||
|
|
1ebfc71721 | ||
|
|
afd3dc7e81 | ||
|
|
cbaecb1ea4 | ||
|
|
8883e32dd8 | ||
|
|
4243780e0a | ||
|
|
5f3fd08509 | ||
|
|
98a225acd1 | ||
|
|
d8555cc8a1 | ||
|
|
3ccd93ac67 | ||
|
|
336428df8b | ||
|
|
d69aace9ec | ||
|
|
777debf5d7 | ||
|
|
7c20771d20 | ||
|
|
fef428a9c6 | ||
|
|
cc9972ad6c | ||
|
|
c2fdc60569 | ||
|
|
39320f953c | ||
|
|
be7c9cc9b8 | ||
|
|
b7159dd48e | ||
|
|
38992251c5 | ||
|
|
a00049b879 | ||
|
|
ba4bc6d7c3 | ||
|
|
868f4fd174 | ||
|
|
5c1ce5b0e1 | ||
|
|
9af3aa0de0 | ||
|
|
71309c5528 | ||
|
|
54decc60bb | ||
|
|
50eea4376b | ||
|
|
443aa17329 |
8
.github/dependabot.yml
vendored
Normal file
8
.github/dependabot.yml
vendored
Normal file
@@ -0,0 +1,8 @@
|
||||
version: 2
|
||||
updates:
|
||||
- package-ecosystem: cargo
|
||||
directory: "/"
|
||||
schedule:
|
||||
interval: daily
|
||||
time: "20:00"
|
||||
open-pull-requests-limit: 10
|
||||
@@ -2,51 +2,51 @@
|
||||
|
||||
## What is tantivy?
|
||||
|
||||
Tantivy is a library that is meant to build search engines. Although it is by no mean a port of Lucene, its architecture is strongly inspired by it. If you are familiar with Lucene, you may be struck by the overlapping vocabulary.
|
||||
Tantivy is a library that is meant to build search engines. Although it is by no means a port of Lucene, its architecture is strongly inspired by it. If you are familiar with Lucene, you may be struck by the overlapping vocabulary.
|
||||
This is not fortuitous.
|
||||
|
||||
Tantivy's bread and butter is to address the problem of full-text search :
|
||||
|
||||
Given a large set of textual documents, and a text query, return the K-most relevant documents in a very efficient way. In order to execute these queries rapidly, the tantivy need to build an index beforehand. The relevance score implemented in the tantivy is not configurable. Tantivy uses the same score as the default similarity used in Lucene / Elasticsearch, called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
|
||||
Given a large set of textual documents, and a text query, return the K-most relevant documents in a very efficient way. To execute these queries rapidly, the tantivy needs to build an index beforehand. The relevance score implemented in the tantivy is not configurable. Tantivy uses the same score as the default similarity used in Lucene / Elasticsearch, called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
|
||||
|
||||
But tantivy's scope does not stop there. Numerous features are required to power rich search applications. For instance, one may want to:
|
||||
But tantivy's scope does not stop there. Numerous features are required to power rich-search applications. For instance, one may want to:
|
||||
- compute the count of documents matching a query in the different section of an e-commerce website,
|
||||
- display an average price per meter square for a real estate search engine,
|
||||
- take in account historical user data to rank documents in a specific way,
|
||||
- take into account historical user data to rank documents in a specific way,
|
||||
- or even use tantivy to power an OLAP database.
|
||||
|
||||
A more abstract description of the problem space tantivy is trying to address is the following.
|
||||
|
||||
Ingest a large set of documents, create an index that makes it possible to
|
||||
rapidly select all documents matching a given predicate (also known as a query) and
|
||||
collect some information about them (See collector).
|
||||
collect some information about them ([See collector](#collector-define-what-to-do-with-matched-documents)).
|
||||
|
||||
Roughly speaking the design is following these guiding principles:
|
||||
- Search should be O(1) in memory.
|
||||
- Indexing should be O(1) in memory. (In practise it is just sublinear)
|
||||
- Indexing should be O(1) in memory. (In practice it is just sublinear)
|
||||
- Search should be as fast as possible
|
||||
|
||||
This comes at the cost of the dynamicity of the index : while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
|
||||
This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
|
||||
|
||||
## [core/](src/core): Index, segments, searchers.
|
||||
|
||||
Core contains all of the high level code to make it possible for to create an index, add documents, delete documents and commit.
|
||||
Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.
|
||||
|
||||
This is both the most high-level part of tantivy, the least performance sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.
|
||||
This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.
|
||||
|
||||
### Index and Segments...
|
||||
|
||||
A tantivy index is in fact a collection of smaller independent immutable segments.
|
||||
Each segment contains its own independent set of datastructures.
|
||||
A tantivy index is a collection of smaller independent immutable segments.
|
||||
Each segment contains its own independent set of data structures.
|
||||
|
||||
A segment is identified by a segment id that is in fact a UUID.
|
||||
The file of a segment has the format
|
||||
|
||||
```segment-id . ext ```
|
||||
|
||||
The extension signals which datastructure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
|
||||
The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
|
||||
|
||||
A small `meta.json` file is in charge keeping track of the list of segments, as well as the schema.
|
||||
A small `meta.json` file is in charge of keeping track of the list of segments, as well as the schema.
|
||||
|
||||
On commit, one segment per indexing thread is written to disk, and the `meta.json` is then updated atomically.
|
||||
|
||||
@@ -66,16 +66,16 @@ An opstamp is simply an incremental id that identifies any operation applied to
|
||||
### DocId
|
||||
|
||||
Within a segment, all documents are identified by a DocId that ranges within `[0, max_doc)`.
|
||||
where max doc is the number of documents in the segment, (deleted or not). Having such a compact `DocId` space is key to the compression of our datastructures.
|
||||
where `max_doc` is the number of documents in the segment, (deleted or not). Having such a compact `DocId` space is key to the compression of our data structures.
|
||||
|
||||
The DocIds are simply allocated in the order documents are added to the index.
|
||||
|
||||
### Merges
|
||||
|
||||
In separate threads, tantivy's index writer search for opportunities to merge segments.
|
||||
The point of segment merges is to:
|
||||
The point of segment merge is to:
|
||||
- eventually get rid of tombstoned documents
|
||||
- reduce the otherwise evergrowing number of segments.
|
||||
- reduce the otherwise ever-growing number of segments.
|
||||
|
||||
Indeed, while having several segments instead of one does not hurt search too much, having hundreds can have a measurable impact on the search performance.
|
||||
|
||||
@@ -99,7 +99,7 @@ but users can extend tantivy with their own implementation.
|
||||
|
||||
## [schema/](src/schema): What are documents?
|
||||
|
||||
Tantivy's document follow a very strict schema , decided before building any index.
|
||||
Tantivy's document follows a very strict schema, decided before building any index.
|
||||
|
||||
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
|
||||
|
||||
@@ -108,25 +108,25 @@ Depending on the type of the field, you can decide to
|
||||
- store it as a fast field
|
||||
- index it
|
||||
|
||||
Practically, tantivy will push values associated to this type to up to 3 respective
|
||||
datastructures.
|
||||
Practically, tantivy will push values associated with this type to up to 3 respective
|
||||
data structures.
|
||||
|
||||
*Limitations*
|
||||
|
||||
As of today, tantivy's schema impose a 1:1 relationship between a field that is being ingested and a field represented in the search index. In sophisticated search application, it is fairly common to want to index a field twice using different tokenizers, or to index the concatenation of several fields together into one field.
|
||||
As of today, tantivy's schema imposes a 1:1 relationship between a field that is being ingested and a field represented in the search index. In sophisticated search application, it is fairly common to want to index a field twice using different tokenizers, or to index the concatenation of several fields together into one field.
|
||||
|
||||
This is not something tantivy supports, and it is up to the user to duplicate field / concatenate fields before feeding them to tantivy.
|
||||
|
||||
## General information about these datastructures.
|
||||
## General information about these data structures.
|
||||
|
||||
All datastructures in tantivy, have:
|
||||
All data structures in tantivy, have:
|
||||
- a writer
|
||||
- a serializer
|
||||
- a reader
|
||||
|
||||
The writer builds a in-memory representation of a batch of documents. This representation is not searchable. It is just meant as intermediary mutable representation, to which we can sequentially add
|
||||
The writer builds an in-memory representation of a batch of documents. This representation is not searchable. It is just meant as an intermediary mutable representation, to which we can sequentially add
|
||||
the document of a batch. At the end of the batch (or if a memory limit is reached), this representation
|
||||
is then converted in an on-disk immutable representation, that is extremely compact.
|
||||
is then converted into an on-disk immutable representation, that is extremely compact.
|
||||
This conversion is done by the serializer.
|
||||
|
||||
Finally, the reader is in charge of offering an API to read on this on-disk read-only representation.
|
||||
@@ -134,8 +134,8 @@ In tantivy, readers are designed to require very little anonymous memory. The da
|
||||
|
||||
## [store/](src/store): Here is my DocId, Gimme my document!
|
||||
|
||||
The docstore is a row-oriented storage that, for each documents, stores a subset of the fields
|
||||
that are marked as stored in the schema. The docstore is compressed using a general purpose algorithm
|
||||
The docstore is a row-oriented storage that, for each document, stores a subset of the fields
|
||||
that are marked as stored in the schema. The docstore is compressed using a general-purpose algorithm
|
||||
like LZ4.
|
||||
|
||||
**Useful for**
|
||||
@@ -146,7 +146,7 @@ Once the top 10 documents have been identified, we fetch them from the store, an
|
||||
**Not useful for**
|
||||
|
||||
Fetching a document from the store is typically a "slow" operation. It usually consists in
|
||||
- searching into a compact tree-like datastructure to find the position of the right block.
|
||||
- searching into a compact tree-like data structure to find the position of the right block.
|
||||
- decompressing a small block
|
||||
- returning the document from this block.
|
||||
|
||||
@@ -180,10 +180,10 @@ all of the documents matching a query (aka [DocSet](src/docset.rs)).
|
||||
|
||||
For instance, one could define a function to combine upvotes with tantivy's internal relevancy score.
|
||||
This can be done by fetching a fast field during scoring.
|
||||
Once could also compute the mean price of the items matching a query in an e-commerce website.
|
||||
One could also compute the mean price of the items matching a query in an e-commerce website.
|
||||
This can be done by fetching a fast field in a collector.
|
||||
Finally one could decide to post filter a docset to remove docset with a price within a specific range.
|
||||
If the ratio of filtered out documents is not too low, an efficient way to do this is to fetch the price, and apply the filter on the collector side.
|
||||
Finally one could decide to post-filter a docset to remove docset with a price within a specific range.
|
||||
If the ratio of filtered out documents is not too low, an efficient way to do this is to fetch the price and apply the filter on the collector side.
|
||||
|
||||
Aside from integer values, it is also possible to store an actual byte payload.
|
||||
For advanced search engine, it is possible to store all of the features required for learning-to-rank in a byte payload, access it during search, and apply the learning-to-rank model.
|
||||
@@ -193,24 +193,22 @@ Finally facets are a specific kind of fast field, and the associated source code
|
||||
# The inverted search index.
|
||||
|
||||
The inverted index is the core part of full-text search.
|
||||
When presented a new document with the text field "Hello, happy tax payer!", tantivy breaks it into a list of so-called token. In addition to just splitting this strings into tokens, it might also do different kind of operations like dropping the punctuation, converting the character to lowercase, apply stemming etc. Tantivy makes it possible to configure the operations to be applied in the schema
|
||||
(tokenizer/ is the place where these operations are implemented).
|
||||
When presented a new document with the text field "Hello, happy tax payer!", tantivy breaks it into a list of so-called tokens. In addition to just splitting these strings into tokens, it might also do different kinds of operations like dropping the punctuation, converting the character to lowercase, apply stemming, etc. Tantivy makes it possible to configure the operations to be applied in the schema (tokenizer/ is the place where these operations are implemented).
|
||||
|
||||
For instance, the default tokenizer of tantivy would break our text into: `[hello, happy, tax, payer]`.
|
||||
The document will therefore be registered in the inverted index as containing the terms
|
||||
`[text:hello, text:happy, text:tax, text:payer]`.
|
||||
|
||||
The role of the inverted index is, when given a term, supply us with a very fast iterator over
|
||||
the sorted doc ids that match the term.
|
||||
The role of the inverted index is, when given a term, gives us in return a very fast iterator over the sorted doc ids that match the term.
|
||||
|
||||
Such an iterator is called a posting list. In addition to giving us `DocId`, they can also give us optionally to the number of occurrence of the term for each document, also called term frequency or TF.
|
||||
Such an iterator is called a posting list. In addition to giving us `DocId`, they can also give us optionally the number of occurrence of the term for each document, also called term frequency or TF.
|
||||
|
||||
These iterators being sorted by DocId, one can create an iterator over the document containing `text:tax AND text:payer`, `(text:tax AND text:payer) OR (text:contribuable)` or any boolean expression.
|
||||
|
||||
In order to represent the function
|
||||
```Term ⟶ Posting```
|
||||
|
||||
The inverted index actually consists of two datastructures chained together.
|
||||
The inverted index actually consists of two data structures chained together.
|
||||
|
||||
- [Term](src/schema/term.rs) ⟶ [TermInfo](src/postings/term_info.rs) is addressed by the term dictionary.
|
||||
- [TermInfo](src/postings/term_info.rs) ⟶ [Posting](src/postings/postings.rs) is addressed by the posting lists.
|
||||
@@ -224,7 +222,7 @@ Tantivy's term dictionary is mainly in charge of supplying the function
|
||||
|
||||
[Term](src/schema/term.rs) ⟶ [TermInfo](src/postings/term_info.rs)
|
||||
|
||||
It is itself is broken into two parts.
|
||||
It is itself broken into two parts.
|
||||
- [Term](src/schema/term.rs) ⟶ [TermOrdinal](src/termdict/mod.rs) is addressed by a finite state transducer, implemented by the fst crate.
|
||||
- [TermOrdinal](src/termdict/mod.rs) ⟶ [TermInfo](src/postings/term_info.rs) is addressed by the term info store.
|
||||
|
||||
@@ -234,7 +232,7 @@ It is itself is broken into two parts.
|
||||
A posting list makes it possible to store a sorted list of doc ids and for each doc store
|
||||
a term frequency as well.
|
||||
|
||||
The posting list are stored in a separate file. The [TermInfo](src/postings/term_info.rs) contains an offset into that file and a number of documents for the given posting list. Both are required and sufficient to read the posting list.
|
||||
The posting lists are stored in a separate file. The [TermInfo](src/postings/term_info.rs) contains an offset into that file and a number of documents for the given posting list. Both are required and sufficient to read the posting list.
|
||||
|
||||
The posting list is organized in block of 128 documents.
|
||||
One block of doc ids is followed by one block of term frequencies.
|
||||
@@ -242,11 +240,11 @@ One block of doc ids is followed by one block of term frequencies.
|
||||
The doc ids are delta encoded and bitpacked.
|
||||
The term frequencies are bitpacked.
|
||||
|
||||
Because the number of docs is rarely a multiple of 128, the last block my contain an arbitrary number of docs between 1 and 127 documents. We then use variable int encoding instead of bitpacking.
|
||||
Because the number of docs is rarely a multiple of 128, the last block may contain an arbitrary number of docs between 1 and 127 documents. We then use variable int encoding instead of bitpacking.
|
||||
|
||||
## [positions/](src/positions): Where are my terms within the documents?
|
||||
|
||||
Phrase queries make it possible to search for documents containing a specific sequence of document.
|
||||
Phrase queries make it possible to search for documents containing a specific sequence of terms.
|
||||
For instance, when the phrase query "the art of war" does not match "the war of art".
|
||||
To make it possible, it is possible to specify in the schema that a field should store positions in addition to being indexed.
|
||||
|
||||
@@ -257,7 +255,7 @@ we advance the position reader by the number of term frequencies of the current
|
||||
## [fieldnorms/](src/fieldnorms): Here is my doc, how many tokens in this field?
|
||||
|
||||
The [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) formula also requires to know the number of tokens stored in a specific field for a given document. We store this information on one byte per document in the fieldnorm.
|
||||
The fieldnorm is therefore compressed. Values up to 40 are encoded unchanged. There is then a logarithmic mapping that
|
||||
The fieldnorm is therefore compressed. Values up to 40 are encoded unchanged.
|
||||
|
||||
|
||||
## [tokenizer/](src/tokenizer): How should we process text?
|
||||
@@ -274,24 +272,24 @@ Tantivy's comes with few tokenizers, but external crates are offering advanced t
|
||||
## [query/](src/query): Define and compose queries
|
||||
|
||||
The [Query](src/query/query.rs) trait defines what a query is.
|
||||
Due to the necessity for some query to compute some statistics over the entire index, and because the
|
||||
index is composed of several `SegmentReader`, the path from transforming a `Query` to a iterator over document is slightly convoluted, but fundamentally, this is what a Query is.
|
||||
Due to the necessity for some queries to compute some statistics over the entire index, and because the
|
||||
index is composed of several `SegmentReader`, the path from transforming a `Query` to an iterator over documents is slightly convoluted, but fundamentally, this is what a Query is.
|
||||
|
||||
The iterator over a document comes with some scoring function. The resulting trait is called a
|
||||
[Scorer](src/query/scorer.rs) and is specific to a segment.
|
||||
|
||||
Different queries can be combined using the [BooleanQuery](src/query/boolean_query/).
|
||||
Tantivy comes with different types of queries, and can be extended by implementing
|
||||
the Query`, `Weight` and `Scorer` traits.
|
||||
Tantivy comes with different types of queries and can be extended by implementing
|
||||
the `Query`, `Weight`, and `Scorer` traits.
|
||||
|
||||
## [collector](src/collector): Define what to do with matched documents
|
||||
|
||||
Collectors define how to aggregate the documents matching a query, in the broadest sense possible.
|
||||
The search will push matched document one by one, calling their
|
||||
The search will push matched documents one by one, calling their
|
||||
`fn collect(doc: DocId, score: Score);` method.
|
||||
|
||||
Users may implement their own collectors by implementing the [Collector](src/collector/mod.rs) trait.
|
||||
|
||||
## [query-grammar](query-grammar): Defines the grammar of the query parser
|
||||
|
||||
While the [QueryParser](src/query/query_parser/query_parser.rs) struct is located in the `query/` directory, the actual parser combinator used to convert user queries into an AST is in an external crate called `query-grammar`. This part was externalize to lighten the work of the compiler.
|
||||
While the [QueryParser](src/query/query_parser/query_parser.rs) struct is located in the `query/` directory, the actual parser combinator used to convert user queries into an AST is in an external crate called `query-grammar`. This part was externalized to lighten the work of the compiler.
|
||||
|
||||
@@ -7,6 +7,12 @@ Tantivy 0.15.0
|
||||
- DocAddress is now a struct (@scampi) #987
|
||||
- Bugfix consistent tie break handling in facet's topk (@hardikpnsp) #357
|
||||
- Date field support for range queries (@rihardsk) #516
|
||||
- Added lz4-flex as the default compression scheme in tantivy (@PSeitz) #1009
|
||||
- Renamed a lot of symbols to avoid all uppercasing on acronyms, as per new clippy recommendation. For instance, RAMDirectory -> RamDirectory. (@pmasurel)
|
||||
- Simplified positions index format (@fulmicoton) #1022
|
||||
- Moved bitpacking to bitpacker subcrate and add BlockedBitpacker, which bitpacks blocks of 128 elements (@PSeitz) #1030
|
||||
- Added support for more-like-this query in tantivy (@evanxg852000) #1011
|
||||
- Added support for sorting an index, e.g presorting documents in an index by a timestamp field. This can heavily improve performance for certain scenarios, by utilizing the sorted data (Top-n optimizations). #1026
|
||||
|
||||
Tantivy 0.14.0
|
||||
=========================
|
||||
@@ -22,6 +28,7 @@ Tantivy 0.14.0
|
||||
- Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)
|
||||
- `FilterCollector` now supports all Fast Field value types (@barrotsteindev)
|
||||
- FastField are not all loaded when opening the segment reader. (@fulmicoton)
|
||||
- Added an API to merge segments, see `tantivy::merge_segments` #1005. (@evanxg852000)
|
||||
|
||||
This version breaks compatibility and requires users to reindex everything.
|
||||
|
||||
|
||||
69
Cargo.toml
69
Cargo.toml
@@ -14,51 +14,54 @@ edition = "2018"
|
||||
|
||||
[dependencies]
|
||||
base64 = "0.13"
|
||||
byteorder = "1"
|
||||
crc32fast = "1"
|
||||
once_cell = "1"
|
||||
regex ={version = "1", default-features = false, features = ["std"]}
|
||||
byteorder = "1.4.3"
|
||||
crc32fast = "1.2.1"
|
||||
once_cell = "1.7.2"
|
||||
regex ={ version = "1.5.4", default-features = false, features = ["std"] }
|
||||
tantivy-fst = "0.3"
|
||||
memmap = {version = "0.7", optional=true}
|
||||
lz4 = {version="1", optional=true}
|
||||
brotli = {version="3.3.0", optional=true}
|
||||
snap = "1"
|
||||
tempfile = {version="3", optional=true}
|
||||
log = "0.4"
|
||||
serde = {version="1", features=["derive"]}
|
||||
serde_json = "1"
|
||||
num_cpus = "1"
|
||||
fs2={version="0.4", optional=true}
|
||||
lz4_flex = { version = "0.7.5", default-features = false, features = ["checked-decode"], optional = true }
|
||||
lz4 = { version = "1.23.2", optional = true }
|
||||
brotli = { version = "3.3", optional = true }
|
||||
snap = { version = "1.0.5", optional = true }
|
||||
tempfile = { version = "3.2", optional = true }
|
||||
log = "0.4.14"
|
||||
serde = { version = "1.0.126", features = ["derive"] }
|
||||
serde_json = "1.0.64"
|
||||
num_cpus = "1.13"
|
||||
fs2={ version = "0.4.3", optional = true }
|
||||
levenshtein_automata = "0.2"
|
||||
uuid = { version = "0.8", features = ["v4", "serde"] }
|
||||
uuid = { version = "0.8.2", features = ["v4", "serde"] }
|
||||
crossbeam = "0.8"
|
||||
futures = {version = "0.3", features=["thread-pool"] }
|
||||
futures = { version = "0.3.15", features = ["thread-pool"] }
|
||||
tantivy-query-grammar = { version="0.14.0", path="./query-grammar" }
|
||||
stable_deref_trait = "1"
|
||||
rust-stemmers = "1"
|
||||
downcast-rs = "1"
|
||||
bitpacking = {version="0.8", default-features = false, features=["bitpacker4x"]}
|
||||
tantivy-bitpacker = { version="0.1", path="./bitpacker" }
|
||||
stable_deref_trait = "1.2"
|
||||
rust-stemmers = "1.2"
|
||||
downcast-rs = "1.2"
|
||||
bitpacking = { version = "0.8.4", default-features = false, features = ["bitpacker4x"] }
|
||||
census = "0.4"
|
||||
fnv = "1"
|
||||
thiserror = "1.0"
|
||||
htmlescape = "0.3"
|
||||
fnv = "1.0.7"
|
||||
thiserror = "1.0.24"
|
||||
htmlescape = "0.3.1"
|
||||
fail = "0.4"
|
||||
murmurhash32 = "0.2"
|
||||
chrono = "0.4"
|
||||
smallvec = "1"
|
||||
rayon = "1"
|
||||
lru = "0.6"
|
||||
chrono = "0.4.19"
|
||||
smallvec = "1.6.1"
|
||||
rayon = "1.5"
|
||||
lru = "0.6.5"
|
||||
fastdivide = "0.3"
|
||||
itertools = "0.10.0"
|
||||
|
||||
[target.'cfg(windows)'.dependencies]
|
||||
winapi = "0.3"
|
||||
winapi = "0.3.9"
|
||||
|
||||
[dev-dependencies]
|
||||
rand = "0.8"
|
||||
maplit = "1"
|
||||
rand = "0.8.3"
|
||||
maplit = "1.0.2"
|
||||
matches = "0.1.8"
|
||||
proptest = "1.0"
|
||||
criterion = "0.3"
|
||||
criterion = "0.3.4"
|
||||
|
||||
[dev-dependencies.fail]
|
||||
version = "0.4"
|
||||
@@ -74,16 +77,18 @@ debug-assertions = true
|
||||
overflow-checks = true
|
||||
|
||||
[features]
|
||||
default = ["mmap"]
|
||||
default = ["mmap", "lz4-block-compression" ]
|
||||
mmap = ["fs2", "tempfile", "memmap"]
|
||||
brotli-compression = ["brotli"]
|
||||
lz4-compression = ["lz4"]
|
||||
lz4-block-compression = ["lz4_flex"]
|
||||
snappy-compression = ["snap"]
|
||||
failpoints = ["fail/failpoints"]
|
||||
unstable = [] # useful for benches.
|
||||
wasm-bindgen = ["uuid/wasm-bindgen"]
|
||||
|
||||
[workspace]
|
||||
members = ["query-grammar"]
|
||||
members = ["query-grammar", "bitpacker"]
|
||||
|
||||
[badges]
|
||||
travis-ci = { repository = "tantivy-search/tantivy" }
|
||||
|
||||
@@ -38,7 +38,7 @@ Your mileage WILL vary depending on the nature of queries and their load.
|
||||
# Features
|
||||
|
||||
- Full-text search
|
||||
- Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy) and [tantivy-tokenizer-tiny-segmente](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
|
||||
- Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy) and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
|
||||
- Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
|
||||
- Tiny startup time (<10ms), perfect for command line tools
|
||||
- BM25 scoring (the same as Lucene)
|
||||
|
||||
@@ -18,5 +18,5 @@ install:
|
||||
build: false
|
||||
|
||||
test_script:
|
||||
- REM SET RUST_LOG=tantivy,test & cargo test --all --verbose --no-default-features --features mmap
|
||||
- REM SET RUST_LOG=tantivy,test & cargo test --all --verbose --no-default-features --features lz4-block-compression --features mmap
|
||||
- REM SET RUST_BACKTRACE=1 & cargo build --examples
|
||||
|
||||
8
bitpacker/Cargo.toml
Normal file
8
bitpacker/Cargo.toml
Normal file
@@ -0,0 +1,8 @@
|
||||
[package]
|
||||
name = "tantivy-bitpacker"
|
||||
version = "0.1.0"
|
||||
edition = "2018"
|
||||
|
||||
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
|
||||
|
||||
[dependencies]
|
||||
33
bitpacker/benches/bench.rs
Normal file
33
bitpacker/benches/bench.rs
Normal file
@@ -0,0 +1,33 @@
|
||||
#![feature(test)]
|
||||
|
||||
extern crate test;
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use tantivy_bitpacker::BlockedBitpacker;
|
||||
use test::Bencher;
|
||||
#[bench]
|
||||
fn bench_blockedbitp_read(b: &mut Bencher) {
|
||||
let mut blocked_bitpacker = BlockedBitpacker::new();
|
||||
for val in 0..=21500 {
|
||||
blocked_bitpacker.add(val * val);
|
||||
}
|
||||
b.iter(|| {
|
||||
let mut out = 0;
|
||||
for val in 0..=21500 {
|
||||
out = blocked_bitpacker.get(val);
|
||||
}
|
||||
out
|
||||
});
|
||||
}
|
||||
#[bench]
|
||||
fn bench_blockedbitp_create(b: &mut Bencher) {
|
||||
b.iter(|| {
|
||||
let mut blocked_bitpacker = BlockedBitpacker::new();
|
||||
for val in 0..=21500 {
|
||||
blocked_bitpacker.add(val * val);
|
||||
}
|
||||
blocked_bitpacker
|
||||
});
|
||||
}
|
||||
}
|
||||
@@ -1,13 +1,14 @@
|
||||
use byteorder::{ByteOrder, LittleEndian, WriteBytesExt};
|
||||
use std::io;
|
||||
use std::{convert::TryInto, io};
|
||||
|
||||
use crate::directory::OwnedBytes;
|
||||
|
||||
pub(crate) struct BitPacker {
|
||||
pub struct BitPacker {
|
||||
mini_buffer: u64,
|
||||
mini_buffer_written: usize,
|
||||
}
|
||||
|
||||
impl Default for BitPacker {
|
||||
fn default() -> Self {
|
||||
BitPacker::new()
|
||||
}
|
||||
}
|
||||
impl BitPacker {
|
||||
pub fn new() -> BitPacker {
|
||||
BitPacker {
|
||||
@@ -26,14 +27,14 @@ impl BitPacker {
|
||||
let num_bits = num_bits as usize;
|
||||
if self.mini_buffer_written + num_bits > 64 {
|
||||
self.mini_buffer |= val_u64.wrapping_shl(self.mini_buffer_written as u32);
|
||||
output.write_u64::<LittleEndian>(self.mini_buffer)?;
|
||||
output.write_all(self.mini_buffer.to_le_bytes().as_ref())?;
|
||||
self.mini_buffer = val_u64.wrapping_shr((64 - self.mini_buffer_written) as u32);
|
||||
self.mini_buffer_written = self.mini_buffer_written + num_bits - 64;
|
||||
} else {
|
||||
self.mini_buffer |= val_u64 << self.mini_buffer_written;
|
||||
self.mini_buffer_written += num_bits;
|
||||
if self.mini_buffer_written == 64 {
|
||||
output.write_u64::<LittleEndian>(self.mini_buffer)?;
|
||||
output.write_all(self.mini_buffer.to_le_bytes().as_ref())?;
|
||||
self.mini_buffer_written = 0;
|
||||
self.mini_buffer = 0u64;
|
||||
}
|
||||
@@ -44,9 +45,8 @@ impl BitPacker {
|
||||
pub fn flush<TWrite: io::Write>(&mut self, output: &mut TWrite) -> io::Result<()> {
|
||||
if self.mini_buffer_written > 0 {
|
||||
let num_bytes = (self.mini_buffer_written + 7) / 8;
|
||||
let mut arr: [u8; 8] = [0u8; 8];
|
||||
LittleEndian::write_u64(&mut arr, self.mini_buffer);
|
||||
output.write_all(&arr[..num_bytes])?;
|
||||
let bytes = self.mini_buffer.to_le_bytes();
|
||||
output.write_all(&bytes[..num_bytes])?;
|
||||
self.mini_buffer_written = 0;
|
||||
}
|
||||
Ok(())
|
||||
@@ -64,11 +64,10 @@ impl BitPacker {
|
||||
pub struct BitUnpacker {
|
||||
num_bits: u64,
|
||||
mask: u64,
|
||||
data: OwnedBytes,
|
||||
}
|
||||
|
||||
impl BitUnpacker {
|
||||
pub fn new(data: OwnedBytes, num_bits: u8) -> BitUnpacker {
|
||||
pub fn new(num_bits: u8) -> BitUnpacker {
|
||||
let mask: u64 = if num_bits == 64 {
|
||||
!0u64
|
||||
} else {
|
||||
@@ -77,15 +76,13 @@ impl BitUnpacker {
|
||||
BitUnpacker {
|
||||
num_bits: u64::from(num_bits),
|
||||
mask,
|
||||
data,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn get(&self, idx: u64) -> u64 {
|
||||
pub fn get(&self, idx: u64, data: &[u8]) -> u64 {
|
||||
if self.num_bits == 0 {
|
||||
return 0u64;
|
||||
}
|
||||
let data: &[u8] = self.data.as_slice();
|
||||
let num_bits = self.num_bits;
|
||||
let mask = self.mask;
|
||||
let addr_in_bits = idx * num_bits;
|
||||
@@ -95,7 +92,10 @@ impl BitUnpacker {
|
||||
addr + 8 <= data.len() as u64,
|
||||
"The fast field field should have been padded with 7 bytes."
|
||||
);
|
||||
let val_unshifted_unmasked: u64 = LittleEndian::read_u64(&data[(addr as usize)..]);
|
||||
let bytes: [u8; 8] = (&data[(addr as usize)..(addr as usize) + 8])
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let val_unshifted_unmasked: u64 = u64::from_le_bytes(bytes);
|
||||
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
|
||||
val_shifted & mask
|
||||
}
|
||||
@@ -104,9 +104,8 @@ impl BitUnpacker {
|
||||
#[cfg(test)]
|
||||
mod test {
|
||||
use super::{BitPacker, BitUnpacker};
|
||||
use crate::directory::OwnedBytes;
|
||||
|
||||
fn create_fastfield_bitpacker(len: usize, num_bits: u8) -> (BitUnpacker, Vec<u64>) {
|
||||
fn create_fastfield_bitpacker(len: usize, num_bits: u8) -> (BitUnpacker, Vec<u64>, Vec<u8>) {
|
||||
let mut data = Vec::new();
|
||||
let mut bitpacker = BitPacker::new();
|
||||
let max_val: u64 = (1u64 << num_bits as u64) - 1u64;
|
||||
@@ -118,14 +117,14 @@ mod test {
|
||||
}
|
||||
bitpacker.close(&mut data).unwrap();
|
||||
assert_eq!(data.len(), ((num_bits as usize) * len + 7) / 8 + 7);
|
||||
let bitunpacker = BitUnpacker::new(OwnedBytes::new(data), num_bits);
|
||||
(bitunpacker, vals)
|
||||
let bitunpacker = BitUnpacker::new(num_bits);
|
||||
(bitunpacker, vals, data)
|
||||
}
|
||||
|
||||
fn test_bitpacker_util(len: usize, num_bits: u8) {
|
||||
let (bitunpacker, vals) = create_fastfield_bitpacker(len, num_bits);
|
||||
let (bitunpacker, vals, data) = create_fastfield_bitpacker(len, num_bits);
|
||||
for (i, val) in vals.iter().enumerate() {
|
||||
assert_eq!(bitunpacker.get(i as u64), *val);
|
||||
assert_eq!(bitunpacker.get(i as u64, &data), *val);
|
||||
}
|
||||
}
|
||||
|
||||
176
bitpacker/src/blocked_bitpacker.rs
Normal file
176
bitpacker/src/blocked_bitpacker.rs
Normal file
@@ -0,0 +1,176 @@
|
||||
use crate::{minmax, BitUnpacker};
|
||||
|
||||
use super::{bitpacker::BitPacker, compute_num_bits};
|
||||
|
||||
const BLOCK_SIZE: usize = 128;
|
||||
|
||||
/// `BlockedBitpacker` compresses data in blocks of
|
||||
/// 128 elements, while keeping an index on it
|
||||
///
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct BlockedBitpacker {
|
||||
// bitpacked blocks
|
||||
compressed_blocks: Vec<u8>,
|
||||
// uncompressed data, collected until BLOCK_SIZE
|
||||
buffer: Vec<u64>,
|
||||
offset_and_bits: Vec<BlockedBitpackerEntryMetaData>,
|
||||
}
|
||||
impl Default for BlockedBitpacker {
|
||||
fn default() -> Self {
|
||||
BlockedBitpacker::new()
|
||||
}
|
||||
}
|
||||
|
||||
/// `BlockedBitpackerEntryMetaData` encodes the
|
||||
/// offset and bit_width into a u64 bit field
|
||||
///
|
||||
/// This saves some space, since 7byte is more
|
||||
/// than enough and also keeps the access fast
|
||||
/// because of alignment
|
||||
#[derive(Debug, Clone, Default)]
|
||||
struct BlockedBitpackerEntryMetaData {
|
||||
encoded: u64,
|
||||
base_value: u64,
|
||||
}
|
||||
|
||||
impl BlockedBitpackerEntryMetaData {
|
||||
fn new(offset: u64, num_bits: u8, base_value: u64) -> Self {
|
||||
let encoded = offset | (num_bits as u64) << (64 - 8);
|
||||
Self {
|
||||
encoded,
|
||||
base_value,
|
||||
}
|
||||
}
|
||||
fn offset(&self) -> u64 {
|
||||
(self.encoded << 8) >> 8
|
||||
}
|
||||
fn num_bits(&self) -> u8 {
|
||||
(self.encoded >> 56) as u8
|
||||
}
|
||||
fn base_value(&self) -> u64 {
|
||||
self.base_value
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn metadata_test() {
|
||||
let meta = BlockedBitpackerEntryMetaData::new(50000, 6, 40000);
|
||||
assert_eq!(meta.offset(), 50000);
|
||||
assert_eq!(meta.num_bits(), 6);
|
||||
}
|
||||
|
||||
impl BlockedBitpacker {
|
||||
pub fn new() -> Self {
|
||||
let mut compressed_blocks = vec![];
|
||||
compressed_blocks.resize(8, 0);
|
||||
Self {
|
||||
compressed_blocks,
|
||||
buffer: vec![],
|
||||
offset_and_bits: vec![],
|
||||
}
|
||||
}
|
||||
|
||||
/// The memory used (inclusive childs)
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
std::mem::size_of::<BlockedBitpacker>()
|
||||
+ self.compressed_blocks.capacity()
|
||||
+ self.offset_and_bits.capacity()
|
||||
* std::mem::size_of_val(&self.offset_and_bits.get(0).cloned().unwrap_or_default())
|
||||
+ self.buffer.capacity()
|
||||
* std::mem::size_of_val(&self.buffer.get(0).cloned().unwrap_or_default())
|
||||
}
|
||||
|
||||
pub fn add(&mut self, val: u64) {
|
||||
self.buffer.push(val);
|
||||
if self.buffer.len() == BLOCK_SIZE as usize {
|
||||
self.flush();
|
||||
}
|
||||
}
|
||||
|
||||
pub fn flush(&mut self) {
|
||||
if let Some((min_value, max_value)) = minmax(self.buffer.iter()) {
|
||||
let mut bit_packer = BitPacker::new();
|
||||
let num_bits_block = compute_num_bits(*max_value - min_value);
|
||||
// todo performance: the padding handling could be done better, e.g. use a slice and
|
||||
// return num_bytes written from bitpacker
|
||||
self.compressed_blocks
|
||||
.resize(self.compressed_blocks.len() - 8, 0); // remove padding for bitpacker
|
||||
let offset = self.compressed_blocks.len() as u64;
|
||||
// todo performance: for some bit_width we
|
||||
// can encode multiple vals into the
|
||||
// mini_buffer before checking to flush
|
||||
// (to be done in BitPacker)
|
||||
for val in self.buffer.iter() {
|
||||
bit_packer
|
||||
.write(
|
||||
*val - min_value,
|
||||
num_bits_block,
|
||||
&mut self.compressed_blocks,
|
||||
)
|
||||
.expect("cannot write bitpacking to output"); // write to in memory can't fail
|
||||
}
|
||||
bit_packer.flush(&mut self.compressed_blocks).unwrap();
|
||||
self.offset_and_bits
|
||||
.push(BlockedBitpackerEntryMetaData::new(
|
||||
offset,
|
||||
num_bits_block,
|
||||
*min_value,
|
||||
));
|
||||
|
||||
self.buffer.clear();
|
||||
self.compressed_blocks
|
||||
.resize(self.compressed_blocks.len() + 8, 0); // add padding for bitpacker
|
||||
}
|
||||
}
|
||||
pub fn get(&self, idx: usize) -> u64 {
|
||||
let metadata_pos = idx / BLOCK_SIZE as usize;
|
||||
let pos_in_block = idx % BLOCK_SIZE as usize;
|
||||
if let Some(metadata) = self.offset_and_bits.get(metadata_pos) {
|
||||
let unpacked = BitUnpacker::new(metadata.num_bits()).get(
|
||||
pos_in_block as u64,
|
||||
&self.compressed_blocks[metadata.offset() as usize..],
|
||||
);
|
||||
unpacked + metadata.base_value()
|
||||
} else {
|
||||
self.buffer[pos_in_block]
|
||||
}
|
||||
}
|
||||
|
||||
pub fn iter(&self) -> impl Iterator<Item = u64> + '_ {
|
||||
// todo performance: we could decompress a whole block and cache it instead
|
||||
let bitpacked_elems = self.offset_and_bits.len() * BLOCK_SIZE;
|
||||
let iter = (0..bitpacked_elems)
|
||||
.map(move |idx| self.get(idx))
|
||||
.chain(self.buffer.iter().cloned());
|
||||
iter
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
#[test]
|
||||
fn blocked_bitpacker_empty() {
|
||||
let blocked_bitpacker = BlockedBitpacker::new();
|
||||
assert_eq!(blocked_bitpacker.iter().collect::<Vec<u64>>(), vec![]);
|
||||
}
|
||||
#[test]
|
||||
fn blocked_bitpacker_one() {
|
||||
let mut blocked_bitpacker = BlockedBitpacker::new();
|
||||
blocked_bitpacker.add(50000);
|
||||
assert_eq!(blocked_bitpacker.get(0), 50000);
|
||||
assert_eq!(blocked_bitpacker.iter().collect::<Vec<u64>>(), vec![50000]);
|
||||
}
|
||||
#[test]
|
||||
fn blocked_bitpacker_test() {
|
||||
let mut blocked_bitpacker = BlockedBitpacker::new();
|
||||
for val in 0..21500 {
|
||||
blocked_bitpacker.add(val);
|
||||
}
|
||||
for val in 0..21500 {
|
||||
assert_eq!(blocked_bitpacker.get(val as usize), val);
|
||||
}
|
||||
assert_eq!(blocked_bitpacker.iter().count(), 21500);
|
||||
assert_eq!(blocked_bitpacker.iter().last().unwrap(), 21499);
|
||||
}
|
||||
}
|
||||
52
bitpacker/src/lib.rs
Normal file
52
bitpacker/src/lib.rs
Normal file
@@ -0,0 +1,52 @@
|
||||
mod bitpacker;
|
||||
mod blocked_bitpacker;
|
||||
|
||||
pub use crate::bitpacker::BitPacker;
|
||||
pub use crate::bitpacker::BitUnpacker;
|
||||
pub use crate::blocked_bitpacker::BlockedBitpacker;
|
||||
|
||||
/// Computes the number of bits that will be used for bitpacking.
|
||||
///
|
||||
/// In general the target is the minimum number of bits
|
||||
/// required to express the amplitude given in argument.
|
||||
///
|
||||
/// e.g. If the amplitude is 10, we can store all ints on simply 4bits.
|
||||
///
|
||||
/// The logic is slightly more convoluted here as for optimization
|
||||
/// reasons, we want to ensure that a value spawns over at most 8 bytes
|
||||
/// of aligned bytes.
|
||||
///
|
||||
/// Spanning over 9 bytes is possible for instance, if we do
|
||||
/// bitpacking with an amplitude of 63 bits.
|
||||
/// In this case, the second int will start on bit
|
||||
/// 63 (which belongs to byte 7) and ends at byte 15;
|
||||
/// Hence 9 bytes (from byte 7 to byte 15 included).
|
||||
///
|
||||
/// To avoid this, we force the number of bits to 64bits
|
||||
/// when the result is greater than `64-8 = 56 bits`.
|
||||
///
|
||||
/// Note that this only affects rare use cases spawning over
|
||||
/// a very large range of values. Even in this case, it results
|
||||
/// in an extra cost of at most 12% compared to the optimal
|
||||
/// number of bits.
|
||||
pub fn compute_num_bits(n: u64) -> u8 {
|
||||
let amplitude = (64u32 - n.leading_zeros()) as u8;
|
||||
if amplitude <= 64 - 8 {
|
||||
amplitude
|
||||
} else {
|
||||
64
|
||||
}
|
||||
}
|
||||
|
||||
pub fn minmax<I, T>(mut vals: I) -> Option<(T, T)>
|
||||
where
|
||||
I: Iterator<Item = T>,
|
||||
T: Copy + Ord,
|
||||
{
|
||||
if let Some(first_el) = vals.next() {
|
||||
return Some(vals.fold((first_el, first_el), |(min_val, max_val), el| {
|
||||
(min_val.min(el), max_val.max(el))
|
||||
}));
|
||||
}
|
||||
None
|
||||
}
|
||||
@@ -5,11 +5,11 @@ use combine::parser::Parser;
|
||||
|
||||
pub use crate::occur::Occur;
|
||||
use crate::query_grammar::parse_to_ast;
|
||||
pub use crate::user_input_ast::{UserInputAST, UserInputBound, UserInputLeaf, UserInputLiteral};
|
||||
pub use crate::user_input_ast::{UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral};
|
||||
|
||||
pub struct Error;
|
||||
|
||||
pub fn parse_query(query: &str) -> Result<UserInputAST, Error> {
|
||||
pub fn parse_query(query: &str) -> Result<UserInputAst, Error> {
|
||||
let (user_input_ast, _remaining) = parse_to_ast().parse(query).map_err(|_| Error)?;
|
||||
Ok(user_input_ast)
|
||||
}
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
use super::user_input_ast::{UserInputAST, UserInputBound, UserInputLeaf, UserInputLiteral};
|
||||
use super::user_input_ast::{UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral};
|
||||
use crate::Occur;
|
||||
use combine::parser::char::{char, digit, letter, space, spaces, string};
|
||||
use combine::parser::Parser;
|
||||
@@ -209,21 +209,21 @@ fn range<'a>() -> impl Parser<&'a str, Output = UserInputLeaf> {
|
||||
})
|
||||
}
|
||||
|
||||
fn negate(expr: UserInputAST) -> UserInputAST {
|
||||
fn negate(expr: UserInputAst) -> UserInputAst {
|
||||
expr.unary(Occur::MustNot)
|
||||
}
|
||||
|
||||
fn leaf<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
|
||||
fn leaf<'a>() -> impl Parser<&'a str, Output = UserInputAst> {
|
||||
parser(|input| {
|
||||
char('(')
|
||||
.with(ast())
|
||||
.skip(char(')'))
|
||||
.or(char('*').map(|_| UserInputAST::from(UserInputLeaf::All)))
|
||||
.or(char('*').map(|_| UserInputAst::from(UserInputLeaf::All)))
|
||||
.or(attempt(
|
||||
string("NOT").skip(spaces1()).with(leaf()).map(negate),
|
||||
))
|
||||
.or(attempt(range().map(UserInputAST::from)))
|
||||
.or(literal().map(UserInputAST::from))
|
||||
.or(attempt(range().map(UserInputAst::from)))
|
||||
.or(literal().map(UserInputAst::from))
|
||||
.parse_stream(input)
|
||||
.into_result()
|
||||
})
|
||||
@@ -235,7 +235,7 @@ fn occur_symbol<'a>() -> impl Parser<&'a str, Output = Occur> {
|
||||
.or(char('+').map(|_| Occur::Must))
|
||||
}
|
||||
|
||||
fn occur_leaf<'a>() -> impl Parser<&'a str, Output = (Option<Occur>, UserInputAST)> {
|
||||
fn occur_leaf<'a>() -> impl Parser<&'a str, Output = (Option<Occur>, UserInputAst)> {
|
||||
(optional(occur_symbol()), boosted_leaf())
|
||||
}
|
||||
|
||||
@@ -256,10 +256,10 @@ fn boost<'a>() -> impl Parser<&'a str, Output = f64> {
|
||||
(char('^'), positive_float_number()).map(|(_, boost)| boost)
|
||||
}
|
||||
|
||||
fn boosted_leaf<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
|
||||
fn boosted_leaf<'a>() -> impl Parser<&'a str, Output = UserInputAst> {
|
||||
(leaf(), optional(boost())).map(|(leaf, boost_opt)| match boost_opt {
|
||||
Some(boost) if (boost - 1.0).abs() > std::f64::EPSILON => {
|
||||
UserInputAST::Boost(Box::new(leaf), boost)
|
||||
UserInputAst::Boost(Box::new(leaf), boost)
|
||||
}
|
||||
_ => leaf,
|
||||
})
|
||||
@@ -278,10 +278,10 @@ fn binary_operand<'a>() -> impl Parser<&'a str, Output = BinaryOperand> {
|
||||
}
|
||||
|
||||
fn aggregate_binary_expressions(
|
||||
left: UserInputAST,
|
||||
others: Vec<(BinaryOperand, UserInputAST)>,
|
||||
) -> UserInputAST {
|
||||
let mut dnf: Vec<Vec<UserInputAST>> = vec![vec![left]];
|
||||
left: UserInputAst,
|
||||
others: Vec<(BinaryOperand, UserInputAst)>,
|
||||
) -> UserInputAst {
|
||||
let mut dnf: Vec<Vec<UserInputAst>> = vec![vec![left]];
|
||||
for (operator, operand_ast) in others {
|
||||
match operator {
|
||||
BinaryOperand::And => {
|
||||
@@ -295,33 +295,33 @@ fn aggregate_binary_expressions(
|
||||
}
|
||||
}
|
||||
if dnf.len() == 1 {
|
||||
UserInputAST::and(dnf.into_iter().next().unwrap()) //< safe
|
||||
UserInputAst::and(dnf.into_iter().next().unwrap()) //< safe
|
||||
} else {
|
||||
let conjunctions = dnf.into_iter().map(UserInputAST::and).collect();
|
||||
UserInputAST::or(conjunctions)
|
||||
let conjunctions = dnf.into_iter().map(UserInputAst::and).collect();
|
||||
UserInputAst::or(conjunctions)
|
||||
}
|
||||
}
|
||||
|
||||
fn operand_leaf<'a>() -> impl Parser<&'a str, Output = (BinaryOperand, UserInputAST)> {
|
||||
fn operand_leaf<'a>() -> impl Parser<&'a str, Output = (BinaryOperand, UserInputAst)> {
|
||||
(
|
||||
binary_operand().skip(spaces()),
|
||||
boosted_leaf().skip(spaces()),
|
||||
)
|
||||
}
|
||||
|
||||
pub fn ast<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
|
||||
pub fn ast<'a>() -> impl Parser<&'a str, Output = UserInputAst> {
|
||||
let boolean_expr = (boosted_leaf().skip(spaces()), many1(operand_leaf()))
|
||||
.map(|(left, right)| aggregate_binary_expressions(left, right));
|
||||
let whitespace_separated_leaves = many1(occur_leaf().skip(spaces().silent())).map(
|
||||
|subqueries: Vec<(Option<Occur>, UserInputAST)>| {
|
||||
|subqueries: Vec<(Option<Occur>, UserInputAst)>| {
|
||||
if subqueries.len() == 1 {
|
||||
let (occur_opt, ast) = subqueries.into_iter().next().unwrap();
|
||||
match occur_opt.unwrap_or(Occur::Should) {
|
||||
Occur::Must | Occur::Should => ast,
|
||||
Occur::MustNot => UserInputAST::Clause(vec![(Some(Occur::MustNot), ast)]),
|
||||
Occur::MustNot => UserInputAst::Clause(vec![(Some(Occur::MustNot), ast)]),
|
||||
}
|
||||
} else {
|
||||
UserInputAST::Clause(subqueries.into_iter().collect())
|
||||
UserInputAst::Clause(subqueries.into_iter().collect())
|
||||
}
|
||||
},
|
||||
);
|
||||
@@ -329,10 +329,10 @@ pub fn ast<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
|
||||
spaces().with(expr).skip(spaces())
|
||||
}
|
||||
|
||||
pub fn parse_to_ast<'a>() -> impl Parser<&'a str, Output = UserInputAST> {
|
||||
pub fn parse_to_ast<'a>() -> impl Parser<&'a str, Output = UserInputAst> {
|
||||
spaces()
|
||||
.with(optional(ast()).skip(eof()))
|
||||
.map(|opt_ast| opt_ast.unwrap_or_else(UserInputAST::empty_query))
|
||||
.map(|opt_ast| opt_ast.unwrap_or_else(UserInputAst::empty_query))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
|
||||
@@ -84,41 +84,41 @@ impl UserInputBound {
|
||||
}
|
||||
}
|
||||
|
||||
pub enum UserInputAST {
|
||||
Clause(Vec<(Option<Occur>, UserInputAST)>),
|
||||
pub enum UserInputAst {
|
||||
Clause(Vec<(Option<Occur>, UserInputAst)>),
|
||||
Leaf(Box<UserInputLeaf>),
|
||||
Boost(Box<UserInputAST>, f64),
|
||||
Boost(Box<UserInputAst>, f64),
|
||||
}
|
||||
|
||||
impl UserInputAST {
|
||||
pub fn unary(self, occur: Occur) -> UserInputAST {
|
||||
UserInputAST::Clause(vec![(Some(occur), self)])
|
||||
impl UserInputAst {
|
||||
pub fn unary(self, occur: Occur) -> UserInputAst {
|
||||
UserInputAst::Clause(vec![(Some(occur), self)])
|
||||
}
|
||||
|
||||
fn compose(occur: Occur, asts: Vec<UserInputAST>) -> UserInputAST {
|
||||
fn compose(occur: Occur, asts: Vec<UserInputAst>) -> UserInputAst {
|
||||
assert_ne!(occur, Occur::MustNot);
|
||||
assert!(!asts.is_empty());
|
||||
if asts.len() == 1 {
|
||||
asts.into_iter().next().unwrap() //< safe
|
||||
} else {
|
||||
UserInputAST::Clause(
|
||||
UserInputAst::Clause(
|
||||
asts.into_iter()
|
||||
.map(|ast: UserInputAST| (Some(occur), ast))
|
||||
.map(|ast: UserInputAst| (Some(occur), ast))
|
||||
.collect::<Vec<_>>(),
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn empty_query() -> UserInputAST {
|
||||
UserInputAST::Clause(Vec::default())
|
||||
pub fn empty_query() -> UserInputAst {
|
||||
UserInputAst::Clause(Vec::default())
|
||||
}
|
||||
|
||||
pub fn and(asts: Vec<UserInputAST>) -> UserInputAST {
|
||||
UserInputAST::compose(Occur::Must, asts)
|
||||
pub fn and(asts: Vec<UserInputAst>) -> UserInputAst {
|
||||
UserInputAst::compose(Occur::Must, asts)
|
||||
}
|
||||
|
||||
pub fn or(asts: Vec<UserInputAST>) -> UserInputAST {
|
||||
UserInputAST::compose(Occur::Should, asts)
|
||||
pub fn or(asts: Vec<UserInputAst>) -> UserInputAst {
|
||||
UserInputAst::compose(Occur::Should, asts)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -128,15 +128,15 @@ impl From<UserInputLiteral> for UserInputLeaf {
|
||||
}
|
||||
}
|
||||
|
||||
impl From<UserInputLeaf> for UserInputAST {
|
||||
fn from(leaf: UserInputLeaf) -> UserInputAST {
|
||||
UserInputAST::Leaf(Box::new(leaf))
|
||||
impl From<UserInputLeaf> for UserInputAst {
|
||||
fn from(leaf: UserInputLeaf) -> UserInputAst {
|
||||
UserInputAst::Leaf(Box::new(leaf))
|
||||
}
|
||||
}
|
||||
|
||||
fn print_occur_ast(
|
||||
occur_opt: Option<Occur>,
|
||||
ast: &UserInputAST,
|
||||
ast: &UserInputAst,
|
||||
formatter: &mut fmt::Formatter,
|
||||
) -> fmt::Result {
|
||||
if let Some(occur) = occur_opt {
|
||||
@@ -147,10 +147,10 @@ fn print_occur_ast(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
impl fmt::Debug for UserInputAST {
|
||||
impl fmt::Debug for UserInputAst {
|
||||
fn fmt(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
|
||||
match *self {
|
||||
UserInputAST::Clause(ref subqueries) => {
|
||||
UserInputAst::Clause(ref subqueries) => {
|
||||
if subqueries.is_empty() {
|
||||
write!(formatter, "<emptyclause>")?;
|
||||
} else {
|
||||
@@ -164,8 +164,8 @@ impl fmt::Debug for UserInputAST {
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
UserInputAST::Leaf(ref subquery) => write!(formatter, "{:?}", subquery),
|
||||
UserInputAST::Boost(ref leaf, boost) => write!(formatter, "({:?})^{}", leaf, boost),
|
||||
UserInputAst::Leaf(ref subquery) => write!(formatter, "{:?}", subquery),
|
||||
UserInputAst::Boost(ref leaf, boost) => write!(formatter, "({:?})^{}", leaf, boost),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -180,7 +180,7 @@ impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
|
||||
|
||||
/// Return true iff at least K documents have gone through
|
||||
/// the collector.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub(crate) fn at_capacity(&self) -> bool {
|
||||
self.heap.len() >= self.limit
|
||||
}
|
||||
@@ -189,7 +189,7 @@ impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
|
||||
///
|
||||
/// It collects documents until it has reached the max capacity. Once it reaches capacity, it
|
||||
/// will compare the lowest scoring item with the given one and keep whichever is greater.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn collect(&mut self, doc: DocId, feature: T) {
|
||||
if self.at_capacity() {
|
||||
// It's ok to unwrap as long as a limit of 0 is forbidden.
|
||||
|
||||
@@ -59,19 +59,19 @@ impl TinySet {
|
||||
|
||||
/// Creates a new `TinySet` containing only one element
|
||||
/// within `[0; 64[`
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn singleton(el: u32) -> TinySet {
|
||||
TinySet(1u64 << u64::from(el))
|
||||
}
|
||||
|
||||
/// Insert a new element within [0..64[
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn insert(self, el: u32) -> TinySet {
|
||||
self.union(TinySet::singleton(el))
|
||||
}
|
||||
|
||||
/// Insert a new element within [0..64[
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn insert_mut(&mut self, el: u32) -> bool {
|
||||
let old = *self;
|
||||
*self = old.insert(el);
|
||||
@@ -79,20 +79,20 @@ impl TinySet {
|
||||
}
|
||||
|
||||
/// Returns the union of two tinysets
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn union(self, other: TinySet) -> TinySet {
|
||||
TinySet(self.0 | other.0)
|
||||
}
|
||||
|
||||
/// Returns true iff the `TinySet` is empty.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn is_empty(self) -> bool {
|
||||
self.0 == 0u64
|
||||
}
|
||||
|
||||
/// Returns the lowest element in the `TinySet`
|
||||
/// and removes it.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn pop_lowest(&mut self) -> Option<u32> {
|
||||
if self.is_empty() {
|
||||
None
|
||||
|
||||
@@ -190,7 +190,7 @@ mod test {
|
||||
use super::{CompositeFile, CompositeWrite};
|
||||
use crate::common::BinarySerializable;
|
||||
use crate::common::VInt;
|
||||
use crate::directory::{Directory, RAMDirectory};
|
||||
use crate::directory::{Directory, RamDirectory};
|
||||
use crate::schema::Field;
|
||||
use std::io::Write;
|
||||
use std::path::Path;
|
||||
@@ -198,7 +198,7 @@ mod test {
|
||||
#[test]
|
||||
fn test_composite_file() -> crate::Result<()> {
|
||||
let path = Path::new("test_path");
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
{
|
||||
let w = directory.open_write(path).unwrap();
|
||||
let mut composite_write = CompositeWrite::wrap(w);
|
||||
|
||||
@@ -1,4 +1,3 @@
|
||||
pub mod bitpacker;
|
||||
mod bitset;
|
||||
mod composite_file;
|
||||
mod counting_writer;
|
||||
@@ -20,52 +19,6 @@ pub use byteorder::LittleEndian as Endianness;
|
||||
/// We do not allow segments with more than
|
||||
pub const MAX_DOC_LIMIT: u32 = 1 << 31;
|
||||
|
||||
pub fn minmax<I, T>(mut vals: I) -> Option<(T, T)>
|
||||
where
|
||||
I: Iterator<Item = T>,
|
||||
T: Copy + Ord,
|
||||
{
|
||||
if let Some(first_el) = vals.next() {
|
||||
return Some(vals.fold((first_el, first_el), |(min_val, max_val), el| {
|
||||
(min_val.min(el), max_val.max(el))
|
||||
}));
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Computes the number of bits that will be used for bitpacking.
|
||||
///
|
||||
/// In general the target is the minimum number of bits
|
||||
/// required to express the amplitude given in argument.
|
||||
///
|
||||
/// e.g. If the amplitude is 10, we can store all ints on simply 4bits.
|
||||
///
|
||||
/// The logic is slightly more convoluted here as for optimization
|
||||
/// reasons, we want to ensure that a value spawns over at most 8 bytes
|
||||
/// of aligns bytes.
|
||||
///
|
||||
/// Spanning over 9 bytes is possible for instance, if we do
|
||||
/// bitpacking with an amplitude of 63 bits.
|
||||
/// In this case, the second int will start on bit
|
||||
/// 63 (which belongs to byte 7) and ends at byte 15;
|
||||
/// Hence 9 bytes (from byte 7 to byte 15 included).
|
||||
///
|
||||
/// To avoid this, we force the number of bits to 64bits
|
||||
/// when the result is greater than `64-8 = 56 bits`.
|
||||
///
|
||||
/// Note that this only affects rare use cases spawning over
|
||||
/// a very large range of values. Even in this case, it results
|
||||
/// in an extra cost of at most 12% compared to the optimal
|
||||
/// number of bits.
|
||||
pub(crate) fn compute_num_bits(n: u64) -> u8 {
|
||||
let amplitude = (64u32 - n.leading_zeros()) as u8;
|
||||
if amplitude <= 64 - 8 {
|
||||
amplitude
|
||||
} else {
|
||||
64
|
||||
}
|
||||
}
|
||||
|
||||
/// Has length trait
|
||||
pub trait HasLen {
|
||||
/// Return length
|
||||
@@ -99,13 +52,13 @@ const HIGHEST_BIT: u64 = 1 << 63;
|
||||
///
|
||||
/// # See also
|
||||
/// The [reverse mapping is `u64_to_i64`](./fn.u64_to_i64.html).
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn i64_to_u64(val: i64) -> u64 {
|
||||
(val as u64) ^ HIGHEST_BIT
|
||||
}
|
||||
|
||||
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn u64_to_i64(val: u64) -> i64 {
|
||||
(val ^ HIGHEST_BIT) as i64
|
||||
}
|
||||
@@ -127,7 +80,7 @@ pub fn u64_to_i64(val: u64) -> i64 {
|
||||
///
|
||||
/// # See also
|
||||
/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn f64_to_u64(val: f64) -> u64 {
|
||||
let bits = val.to_bits();
|
||||
if val.is_sign_positive() {
|
||||
@@ -138,7 +91,7 @@ pub fn f64_to_u64(val: f64) -> u64 {
|
||||
}
|
||||
|
||||
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn u64_to_f64(val: u64) -> f64 {
|
||||
f64::from_bits(if val & HIGHEST_BIT != 0 {
|
||||
val ^ HIGHEST_BIT
|
||||
@@ -150,11 +103,12 @@ pub fn u64_to_f64(val: u64) -> f64 {
|
||||
#[cfg(test)]
|
||||
pub(crate) mod test {
|
||||
|
||||
pub use super::minmax;
|
||||
pub use super::serialize::test::fixed_size_test;
|
||||
use super::{compute_num_bits, f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
|
||||
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
|
||||
use proptest::prelude::*;
|
||||
use std::f64;
|
||||
use tantivy_bitpacker::compute_num_bits;
|
||||
pub use tantivy_bitpacker::minmax;
|
||||
|
||||
fn test_i64_converter_helper(val: i64) {
|
||||
assert_eq!(u64_to_i64(i64_to_u64(val)), val);
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
use super::segment::Segment;
|
||||
use super::{segment::Segment, IndexSettings};
|
||||
use crate::core::Executor;
|
||||
use crate::core::IndexMeta;
|
||||
use crate::core::SegmentId;
|
||||
@@ -10,10 +10,10 @@ use crate::directory::ManagedDirectory;
|
||||
#[cfg(feature = "mmap")]
|
||||
use crate::directory::MmapDirectory;
|
||||
use crate::directory::INDEX_WRITER_LOCK;
|
||||
use crate::directory::{Directory, RAMDirectory};
|
||||
use crate::directory::{Directory, RamDirectory};
|
||||
use crate::error::DataCorruption;
|
||||
use crate::error::TantivyError;
|
||||
use crate::indexer::index_writer::HEAP_SIZE_MIN;
|
||||
use crate::indexer::index_writer::{HEAP_SIZE_MIN, MAX_NUM_THREAD};
|
||||
use crate::indexer::segment_updater::save_new_metas;
|
||||
use crate::reader::IndexReader;
|
||||
use crate::reader::IndexReaderBuilder;
|
||||
@@ -55,17 +55,146 @@ fn load_metas(
|
||||
.map_err(From::from)
|
||||
}
|
||||
|
||||
/// IndexBuilder can be used to create an index.
|
||||
///
|
||||
/// Use in conjunction with `SchemaBuilder`. Global index settings
|
||||
/// can be configured with `IndexSettings`
|
||||
///
|
||||
/// # Examples
|
||||
///
|
||||
/// ```
|
||||
/// use tantivy::schema::*;
|
||||
/// use tantivy::{Index, IndexSettings, IndexSortByField, Order};
|
||||
///
|
||||
/// let mut schema_builder = Schema::builder();
|
||||
/// let id_field = schema_builder.add_text_field("id", STRING);
|
||||
/// let title_field = schema_builder.add_text_field("title", TEXT);
|
||||
/// let body_field = schema_builder.add_text_field("body", TEXT);
|
||||
/// let number_field = schema_builder.add_u64_field(
|
||||
/// "number",
|
||||
/// IntOptions::default().set_fast(Cardinality::SingleValue),
|
||||
/// );
|
||||
///
|
||||
/// let schema = schema_builder.build();
|
||||
/// let settings = IndexSettings{sort_by_field: Some(IndexSortByField{field:"number".to_string(), order:Order::Asc})};
|
||||
/// let index = Index::builder().schema(schema).settings(settings).create_in_ram();
|
||||
///
|
||||
/// ```
|
||||
pub struct IndexBuilder {
|
||||
schema: Option<Schema>,
|
||||
index_settings: IndexSettings,
|
||||
}
|
||||
impl Default for IndexBuilder {
|
||||
fn default() -> Self {
|
||||
IndexBuilder::new()
|
||||
}
|
||||
}
|
||||
impl IndexBuilder {
|
||||
/// Creates a new `IndexBuilder`
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
schema: None,
|
||||
index_settings: IndexSettings::default(),
|
||||
}
|
||||
}
|
||||
/// Set the settings
|
||||
pub fn settings(mut self, settings: IndexSettings) -> Self {
|
||||
self.index_settings = settings;
|
||||
self
|
||||
}
|
||||
/// Set the schema
|
||||
pub fn schema(mut self, schema: Schema) -> Self {
|
||||
self.schema = Some(schema);
|
||||
self
|
||||
}
|
||||
/// Creates a new index using the `RAMDirectory`.
|
||||
///
|
||||
/// The index will be allocated in anonymous memory.
|
||||
/// This should only be used for unit tests.
|
||||
pub fn create_in_ram(self) -> Result<Index, TantivyError> {
|
||||
let ram_directory = RamDirectory::create();
|
||||
Ok(self
|
||||
.create(ram_directory)
|
||||
.expect("Creating a RAMDirectory should never fail"))
|
||||
}
|
||||
/// Creates a new index in a given filepath.
|
||||
/// The index will use the `MMapDirectory`.
|
||||
///
|
||||
/// If a previous index was in this directory, then its meta file will be destroyed.
|
||||
#[cfg(feature = "mmap")]
|
||||
pub fn create_in_dir<P: AsRef<Path>>(self, directory_path: P) -> crate::Result<Index> {
|
||||
let mmap_directory = MmapDirectory::open(directory_path)?;
|
||||
if Index::exists(&mmap_directory)? {
|
||||
return Err(TantivyError::IndexAlreadyExists);
|
||||
}
|
||||
self.create(mmap_directory)
|
||||
}
|
||||
/// Creates a new index in a temp directory.
|
||||
///
|
||||
/// The index will use the `MMapDirectory` in a newly created directory.
|
||||
/// The temp directory will be destroyed automatically when the `Index` object
|
||||
/// is destroyed.
|
||||
///
|
||||
/// The temp directory is only used for testing the `MmapDirectory`.
|
||||
/// For other unit tests, prefer the `RAMDirectory`, see: `create_in_ram`.
|
||||
#[cfg(feature = "mmap")]
|
||||
pub fn create_from_tempdir(self) -> crate::Result<Index> {
|
||||
let mmap_directory = MmapDirectory::create_from_tempdir()?;
|
||||
self.create(mmap_directory)
|
||||
}
|
||||
fn get_expect_schema(&self) -> crate::Result<Schema> {
|
||||
self.schema
|
||||
.as_ref()
|
||||
.cloned()
|
||||
.ok_or(TantivyError::IndexBuilderMissingArgument("schema"))
|
||||
}
|
||||
/// Opens or creates a new index in the provided directory
|
||||
pub fn open_or_create<Dir: Directory>(self, dir: Dir) -> crate::Result<Index> {
|
||||
if !Index::exists(&dir)? {
|
||||
return self.create(dir);
|
||||
}
|
||||
let index = Index::open(dir)?;
|
||||
if index.schema() == self.get_expect_schema()? {
|
||||
Ok(index)
|
||||
} else {
|
||||
Err(TantivyError::SchemaError(
|
||||
"An index exists but the schema does not match.".to_string(),
|
||||
))
|
||||
}
|
||||
}
|
||||
/// Creates a new index given an implementation of the trait `Directory`.
|
||||
///
|
||||
/// If a directory previously existed, it will be erased.
|
||||
fn create<Dir: Directory>(self, dir: Dir) -> crate::Result<Index> {
|
||||
let directory = ManagedDirectory::wrap(dir)?;
|
||||
save_new_metas(
|
||||
self.get_expect_schema()?,
|
||||
self.index_settings.clone(),
|
||||
&directory,
|
||||
)?;
|
||||
let mut metas = IndexMeta::with_schema(self.get_expect_schema()?);
|
||||
metas.index_settings = self.index_settings.clone();
|
||||
let index = Index::open_from_metas(directory, &metas, SegmentMetaInventory::default());
|
||||
Ok(index)
|
||||
}
|
||||
}
|
||||
|
||||
/// Search Index
|
||||
#[derive(Clone)]
|
||||
pub struct Index {
|
||||
directory: ManagedDirectory,
|
||||
schema: Schema,
|
||||
settings: IndexSettings,
|
||||
executor: Arc<Executor>,
|
||||
tokenizers: TokenizerManager,
|
||||
inventory: SegmentMetaInventory,
|
||||
}
|
||||
|
||||
impl Index {
|
||||
/// Creates a new builder.
|
||||
pub fn builder() -> IndexBuilder {
|
||||
IndexBuilder::new()
|
||||
}
|
||||
/// Examines the directory to see if it contains an index.
|
||||
///
|
||||
/// Effectively, it only checks for the presence of the `meta.json` file.
|
||||
@@ -97,13 +226,12 @@ impl Index {
|
||||
self.set_multithread_executor(default_num_threads)
|
||||
}
|
||||
|
||||
/// Creates a new index using the `RAMDirectory`.
|
||||
/// Creates a new index using the `RamDirectory`.
|
||||
///
|
||||
/// The index will be allocated in anonymous memory.
|
||||
/// This should only be used for unit tests.
|
||||
pub fn create_in_ram(schema: Schema) -> Index {
|
||||
let ram_directory = RAMDirectory::create();
|
||||
Index::create(ram_directory, schema).expect("Creating a RAMDirectory should never fail")
|
||||
IndexBuilder::new().schema(schema).create_in_ram().unwrap()
|
||||
}
|
||||
|
||||
/// Creates a new index in a given filepath.
|
||||
@@ -115,26 +243,14 @@ impl Index {
|
||||
directory_path: P,
|
||||
schema: Schema,
|
||||
) -> crate::Result<Index> {
|
||||
let mmap_directory = MmapDirectory::open(directory_path)?;
|
||||
if Index::exists(&mmap_directory)? {
|
||||
return Err(TantivyError::IndexAlreadyExists);
|
||||
}
|
||||
Index::create(mmap_directory, schema)
|
||||
IndexBuilder::new()
|
||||
.schema(schema)
|
||||
.create_in_dir(directory_path)
|
||||
}
|
||||
|
||||
/// Opens or creates a new index in the provided directory
|
||||
pub fn open_or_create<Dir: Directory>(dir: Dir, schema: Schema) -> crate::Result<Index> {
|
||||
if !Index::exists(&dir)? {
|
||||
return Index::create(dir, schema);
|
||||
}
|
||||
let index = Index::open(dir)?;
|
||||
if index.schema() == schema {
|
||||
Ok(index)
|
||||
} else {
|
||||
Err(TantivyError::SchemaError(
|
||||
"An index exists but the schema does not match.".to_string(),
|
||||
))
|
||||
}
|
||||
IndexBuilder::new().schema(schema).open_or_create(dir)
|
||||
}
|
||||
|
||||
/// Creates a new index in a temp directory.
|
||||
@@ -144,39 +260,34 @@ impl Index {
|
||||
/// is destroyed.
|
||||
///
|
||||
/// The temp directory is only used for testing the `MmapDirectory`.
|
||||
/// For other unit tests, prefer the `RAMDirectory`, see: `create_in_ram`.
|
||||
/// For other unit tests, prefer the `RamDirectory`, see: `create_in_ram`.
|
||||
#[cfg(feature = "mmap")]
|
||||
pub fn create_from_tempdir(schema: Schema) -> crate::Result<Index> {
|
||||
let mmap_directory = MmapDirectory::create_from_tempdir()?;
|
||||
Index::create(mmap_directory, schema)
|
||||
IndexBuilder::new().schema(schema).create_from_tempdir()
|
||||
}
|
||||
|
||||
/// Creates a new index given an implementation of the trait `Directory`.
|
||||
///
|
||||
/// If a directory previously existed, it will be erased.
|
||||
pub fn create<Dir: Directory>(dir: Dir, schema: Schema) -> crate::Result<Index> {
|
||||
let directory = ManagedDirectory::wrap(dir)?;
|
||||
Index::from_directory(directory, schema)
|
||||
}
|
||||
|
||||
/// Create a new index from a directory.
|
||||
///
|
||||
/// This will overwrite existing meta.json
|
||||
fn from_directory(directory: ManagedDirectory, schema: Schema) -> crate::Result<Index> {
|
||||
save_new_metas(schema.clone(), &directory)?;
|
||||
let metas = IndexMeta::with_schema(schema);
|
||||
let index = Index::create_from_metas(directory, &metas, SegmentMetaInventory::default());
|
||||
Ok(index)
|
||||
pub fn create<Dir: Directory>(
|
||||
dir: Dir,
|
||||
schema: Schema,
|
||||
settings: IndexSettings,
|
||||
) -> crate::Result<Index> {
|
||||
let mut builder = IndexBuilder::new().schema(schema);
|
||||
builder = builder.settings(settings);
|
||||
builder.create(dir)
|
||||
}
|
||||
|
||||
/// Creates a new index given a directory and an `IndexMeta`.
|
||||
fn create_from_metas(
|
||||
fn open_from_metas(
|
||||
directory: ManagedDirectory,
|
||||
metas: &IndexMeta,
|
||||
inventory: SegmentMetaInventory,
|
||||
) -> Index {
|
||||
let schema = metas.schema.clone();
|
||||
Index {
|
||||
settings: metas.index_settings.clone(),
|
||||
directory,
|
||||
schema,
|
||||
tokenizers: TokenizerManager::default(),
|
||||
@@ -257,7 +368,7 @@ impl Index {
|
||||
let directory = ManagedDirectory::wrap(directory)?;
|
||||
let inventory = SegmentMetaInventory::default();
|
||||
let metas = load_metas(&directory, &inventory)?;
|
||||
let index = Index::create_from_metas(directory, &metas, inventory);
|
||||
let index = Index::open_from_metas(directory, &metas, inventory);
|
||||
Ok(index)
|
||||
}
|
||||
|
||||
@@ -282,7 +393,7 @@ impl Index {
|
||||
/// Each thread will receive a budget of `overall_heap_size_in_bytes / num_threads`.
|
||||
///
|
||||
/// # Errors
|
||||
/// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IOError`.
|
||||
/// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IoError`.
|
||||
///
|
||||
/// # Panics
|
||||
/// If the heap size per thread is too small, panics.
|
||||
@@ -317,7 +428,7 @@ impl Index {
|
||||
|
||||
/// Helper to create an index writer for tests.
|
||||
///
|
||||
/// That index writer only simply has a single thread and a heap of 5 MB.
|
||||
/// That index writer only simply has a single thread and a heap of 10 MB.
|
||||
/// Using a single thread gives us a deterministic allocation of DocId.
|
||||
#[cfg(test)]
|
||||
pub fn writer_for_tests(&self) -> crate::Result<IndexWriter> {
|
||||
@@ -326,7 +437,8 @@ impl Index {
|
||||
|
||||
/// Creates a multithreaded writer
|
||||
///
|
||||
/// Tantivy will automatically define the number of threads to use.
|
||||
/// Tantivy will automatically define the number of threads to use, but
|
||||
/// no more than [`MAX_NUM_THREAD`] threads.
|
||||
/// `overall_heap_size_in_bytes` is the total target memory usage that will be split
|
||||
/// between a given number of threads.
|
||||
///
|
||||
@@ -335,7 +447,7 @@ impl Index {
|
||||
/// # Panics
|
||||
/// If the heap size per thread is too small, panics.
|
||||
pub fn writer(&self, overall_heap_size_in_bytes: usize) -> crate::Result<IndexWriter> {
|
||||
let mut num_threads = num_cpus::get();
|
||||
let mut num_threads = std::cmp::min(num_cpus::get(), MAX_NUM_THREAD);
|
||||
let heap_size_in_bytes_per_thread = overall_heap_size_in_bytes / num_threads;
|
||||
if heap_size_in_bytes_per_thread < HEAP_SIZE_MIN {
|
||||
num_threads = (overall_heap_size_in_bytes / HEAP_SIZE_MIN).max(1);
|
||||
@@ -343,6 +455,11 @@ impl Index {
|
||||
self.writer_with_num_threads(num_threads, overall_heap_size_in_bytes)
|
||||
}
|
||||
|
||||
/// Accessor to the index settings
|
||||
///
|
||||
pub fn settings(&self) -> &IndexSettings {
|
||||
&self.settings
|
||||
}
|
||||
/// Accessor to the index schema
|
||||
///
|
||||
/// The schema is actually cloned.
|
||||
@@ -411,11 +528,14 @@ impl fmt::Debug for Index {
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::directory::{RAMDirectory, WatchCallback};
|
||||
use crate::schema::Field;
|
||||
use crate::schema::{Schema, INDEXED, TEXT};
|
||||
use crate::IndexReader;
|
||||
use crate::ReloadPolicy;
|
||||
use crate::{
|
||||
directory::{RamDirectory, WatchCallback},
|
||||
IndexSettings,
|
||||
};
|
||||
use crate::{Directory, Index};
|
||||
|
||||
#[test]
|
||||
@@ -434,15 +554,20 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_index_exists() {
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
assert!(!Index::exists(&directory).unwrap());
|
||||
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
|
||||
assert!(Index::create(
|
||||
directory.clone(),
|
||||
throw_away_schema(),
|
||||
IndexSettings::default()
|
||||
)
|
||||
.is_ok());
|
||||
assert!(Index::exists(&directory).unwrap());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn open_or_create_should_create() {
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
assert!(!Index::exists(&directory).unwrap());
|
||||
assert!(Index::open_or_create(directory.clone(), throw_away_schema()).is_ok());
|
||||
assert!(Index::exists(&directory).unwrap());
|
||||
@@ -450,24 +575,44 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn open_or_create_should_open() {
|
||||
let directory = RAMDirectory::create();
|
||||
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
|
||||
let directory = RamDirectory::create();
|
||||
assert!(Index::create(
|
||||
directory.clone(),
|
||||
throw_away_schema(),
|
||||
IndexSettings::default()
|
||||
)
|
||||
.is_ok());
|
||||
assert!(Index::exists(&directory).unwrap());
|
||||
assert!(Index::open_or_create(directory, throw_away_schema()).is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn create_should_wipeoff_existing() {
|
||||
let directory = RAMDirectory::create();
|
||||
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
|
||||
let directory = RamDirectory::create();
|
||||
assert!(Index::create(
|
||||
directory.clone(),
|
||||
throw_away_schema(),
|
||||
IndexSettings::default()
|
||||
)
|
||||
.is_ok());
|
||||
assert!(Index::exists(&directory).unwrap());
|
||||
assert!(Index::create(directory.clone(), Schema::builder().build()).is_ok());
|
||||
assert!(Index::create(
|
||||
directory.clone(),
|
||||
Schema::builder().build(),
|
||||
IndexSettings::default()
|
||||
)
|
||||
.is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn open_or_create_exists_but_schema_does_not_match() {
|
||||
let directory = RAMDirectory::create();
|
||||
assert!(Index::create(directory.clone(), throw_away_schema()).is_ok());
|
||||
let directory = RamDirectory::create();
|
||||
assert!(Index::create(
|
||||
directory.clone(),
|
||||
throw_away_schema(),
|
||||
IndexSettings::default()
|
||||
)
|
||||
.is_ok());
|
||||
assert!(Index::exists(&directory).unwrap());
|
||||
assert!(Index::open_or_create(directory.clone(), throw_away_schema()).is_ok());
|
||||
let err = Index::open_or_create(directory, Schema::builder().build());
|
||||
@@ -599,10 +744,10 @@ mod tests {
|
||||
#[cfg(not(target_os = "windows"))]
|
||||
#[test]
|
||||
fn garbage_collect_works_as_intended() {
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
let schema = throw_away_schema();
|
||||
let field = schema.get_field("num_likes").unwrap();
|
||||
let index = Index::create(directory.clone(), schema).unwrap();
|
||||
let index = Index::create(directory.clone(), schema, IndexSettings::default()).unwrap();
|
||||
|
||||
let mut writer = index.writer_with_num_threads(8, 24_000_000).unwrap();
|
||||
for i in 0u64..8_000u64 {
|
||||
|
||||
@@ -4,9 +4,9 @@ use crate::schema::Schema;
|
||||
use crate::Opstamp;
|
||||
use census::{Inventory, TrackedObject};
|
||||
use serde::{Deserialize, Serialize};
|
||||
use std::collections::HashSet;
|
||||
use std::fmt;
|
||||
use std::path::PathBuf;
|
||||
use std::{collections::HashSet, sync::atomic::AtomicBool};
|
||||
use std::{fmt, sync::Arc};
|
||||
|
||||
#[derive(Clone, Debug, Serialize, Deserialize)]
|
||||
struct DeleteMeta {
|
||||
@@ -33,6 +33,7 @@ impl SegmentMetaInventory {
|
||||
let inner = InnerSegmentMeta {
|
||||
segment_id,
|
||||
max_doc,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
deletes: None,
|
||||
};
|
||||
SegmentMeta::from(self.inventory.track(inner))
|
||||
@@ -80,6 +81,15 @@ impl SegmentMeta {
|
||||
self.tracked.segment_id
|
||||
}
|
||||
|
||||
/// Removes the Component::TempStore from the alive list and
|
||||
/// therefore marks the temp docstore file to be deleted by
|
||||
/// the garbage collection.
|
||||
pub fn untrack_temp_docstore(&self) {
|
||||
self.tracked
|
||||
.include_temp_doc_store
|
||||
.store(false, std::sync::atomic::Ordering::Relaxed);
|
||||
}
|
||||
|
||||
/// Returns the number of deleted documents.
|
||||
pub fn num_deleted_docs(&self) -> u32 {
|
||||
self.tracked
|
||||
@@ -96,9 +106,20 @@ impl SegmentMeta {
|
||||
/// is by removing all files that have been created by tantivy
|
||||
/// and are not used by any segment anymore.
|
||||
pub fn list_files(&self) -> HashSet<PathBuf> {
|
||||
SegmentComponent::iterator()
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
if self
|
||||
.tracked
|
||||
.include_temp_doc_store
|
||||
.load(std::sync::atomic::Ordering::Relaxed)
|
||||
{
|
||||
SegmentComponent::iterator()
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
} else {
|
||||
SegmentComponent::iterator()
|
||||
.filter(|comp| *comp != &SegmentComponent::TempStore)
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the relative path of a component of our segment.
|
||||
@@ -108,14 +129,14 @@ impl SegmentMeta {
|
||||
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
|
||||
let mut path = self.id().uuid_string();
|
||||
path.push_str(&*match component {
|
||||
SegmentComponent::POSTINGS => ".idx".to_string(),
|
||||
SegmentComponent::POSITIONS => ".pos".to_string(),
|
||||
SegmentComponent::POSITIONSSKIP => ".posidx".to_string(),
|
||||
SegmentComponent::TERMS => ".term".to_string(),
|
||||
SegmentComponent::STORE => ".store".to_string(),
|
||||
SegmentComponent::FASTFIELDS => ".fast".to_string(),
|
||||
SegmentComponent::FIELDNORMS => ".fieldnorm".to_string(),
|
||||
SegmentComponent::DELETE => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
|
||||
SegmentComponent::Postings => ".idx".to_string(),
|
||||
SegmentComponent::Positions => ".pos".to_string(),
|
||||
SegmentComponent::Terms => ".term".to_string(),
|
||||
SegmentComponent::Store => ".store".to_string(),
|
||||
SegmentComponent::TempStore => ".store.temp".to_string(),
|
||||
SegmentComponent::FastFields => ".fast".to_string(),
|
||||
SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
|
||||
SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
|
||||
});
|
||||
PathBuf::from(path)
|
||||
}
|
||||
@@ -160,6 +181,7 @@ impl SegmentMeta {
|
||||
segment_id: inner_meta.segment_id,
|
||||
max_doc,
|
||||
deletes: None,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
});
|
||||
SegmentMeta { tracked }
|
||||
}
|
||||
@@ -173,6 +195,7 @@ impl SegmentMeta {
|
||||
let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
|
||||
segment_id: inner_meta.segment_id,
|
||||
max_doc: inner_meta.max_doc,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
deletes: Some(delete_meta),
|
||||
});
|
||||
SegmentMeta { tracked }
|
||||
@@ -184,6 +207,14 @@ struct InnerSegmentMeta {
|
||||
segment_id: SegmentId,
|
||||
max_doc: u32,
|
||||
deletes: Option<DeleteMeta>,
|
||||
/// If you want to avoid the SegmentComponent::TempStore file to be covered by
|
||||
/// garbage collection and deleted, set this to true. This is used during merge.
|
||||
#[serde(skip)]
|
||||
#[serde(default = "default_temp_store")]
|
||||
pub(crate) include_temp_doc_store: Arc<AtomicBool>,
|
||||
}
|
||||
fn default_temp_store() -> Arc<AtomicBool> {
|
||||
Arc::new(AtomicBool::new(false))
|
||||
}
|
||||
|
||||
impl InnerSegmentMeta {
|
||||
@@ -194,6 +225,36 @@ impl InnerSegmentMeta {
|
||||
}
|
||||
}
|
||||
|
||||
/// Search Index Settings.
|
||||
///
|
||||
/// Contains settings which are applied on the whole
|
||||
/// index, like presort documents.
|
||||
#[derive(Clone, Default, Serialize, Deserialize, Eq, PartialEq)]
|
||||
pub struct IndexSettings {
|
||||
/// Sorts the documents by information
|
||||
/// provided in `IndexSortByField`
|
||||
pub sort_by_field: Option<IndexSortByField>,
|
||||
}
|
||||
/// Settings to presort the documents in an index
|
||||
///
|
||||
/// Presorting documents can greatly performance
|
||||
/// in some scenarios, by applying top n
|
||||
/// optimizations.
|
||||
#[derive(Clone, Serialize, Deserialize, Eq, PartialEq)]
|
||||
pub struct IndexSortByField {
|
||||
/// The field to sort the documents by
|
||||
pub field: String,
|
||||
/// The order to sort the documents by
|
||||
pub order: Order,
|
||||
}
|
||||
/// The order to sort by
|
||||
#[derive(Clone, Serialize, Deserialize, Eq, PartialEq)]
|
||||
pub enum Order {
|
||||
/// Ascending Order
|
||||
Asc,
|
||||
/// Descending Order
|
||||
Desc,
|
||||
}
|
||||
/// Meta information about the `Index`.
|
||||
///
|
||||
/// This object is serialized on disk in the `meta.json` file.
|
||||
@@ -204,6 +265,9 @@ impl InnerSegmentMeta {
|
||||
///
|
||||
#[derive(Clone, Serialize)]
|
||||
pub struct IndexMeta {
|
||||
/// `IndexSettings` to configure index options.
|
||||
#[serde(default)]
|
||||
pub index_settings: IndexSettings,
|
||||
/// List of `SegmentMeta` informations associated to each finalized segment of the index.
|
||||
pub segments: Vec<SegmentMeta>,
|
||||
/// Index `Schema`
|
||||
@@ -222,6 +286,8 @@ pub struct IndexMeta {
|
||||
#[derive(Deserialize)]
|
||||
struct UntrackedIndexMeta {
|
||||
pub segments: Vec<InnerSegmentMeta>,
|
||||
#[serde(default)]
|
||||
pub index_settings: IndexSettings,
|
||||
pub schema: Schema,
|
||||
pub opstamp: Opstamp,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
@@ -231,6 +297,7 @@ struct UntrackedIndexMeta {
|
||||
impl UntrackedIndexMeta {
|
||||
pub fn track(self, inventory: &SegmentMetaInventory) -> IndexMeta {
|
||||
IndexMeta {
|
||||
index_settings: self.index_settings,
|
||||
segments: self
|
||||
.segments
|
||||
.into_iter()
|
||||
@@ -251,6 +318,7 @@ impl IndexMeta {
|
||||
/// Opstamp will the value `0u64`.
|
||||
pub fn with_schema(schema: Schema) -> IndexMeta {
|
||||
IndexMeta {
|
||||
index_settings: IndexSettings::default(),
|
||||
segments: vec![],
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
@@ -282,7 +350,10 @@ impl fmt::Debug for IndexMeta {
|
||||
mod tests {
|
||||
|
||||
use super::IndexMeta;
|
||||
use crate::schema::{Schema, TEXT};
|
||||
use crate::{
|
||||
schema::{Schema, TEXT},
|
||||
IndexSettings, IndexSortByField, Order,
|
||||
};
|
||||
use serde_json;
|
||||
|
||||
#[test]
|
||||
@@ -293,6 +364,12 @@ mod tests {
|
||||
schema_builder.build()
|
||||
};
|
||||
let index_metas = IndexMeta {
|
||||
index_settings: IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "text".to_string(),
|
||||
order: Order::Asc,
|
||||
}),
|
||||
},
|
||||
segments: Vec::new(),
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
@@ -301,7 +378,7 @@ mod tests {
|
||||
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","tokenizer":"default"},"stored":false}}],"opstamp":0}"#
|
||||
r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"}},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","tokenizer":"default"},"stored":false}}],"opstamp":0}"#
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -26,7 +26,6 @@ pub struct InvertedIndexReader {
|
||||
termdict: TermDictionary,
|
||||
postings_file_slice: FileSlice,
|
||||
positions_file_slice: FileSlice,
|
||||
positions_idx_file_slice: FileSlice,
|
||||
record_option: IndexRecordOption,
|
||||
total_num_tokens: u64,
|
||||
}
|
||||
@@ -37,7 +36,6 @@ impl InvertedIndexReader {
|
||||
termdict: TermDictionary,
|
||||
postings_file_slice: FileSlice,
|
||||
positions_file_slice: FileSlice,
|
||||
positions_idx_file_slice: FileSlice,
|
||||
record_option: IndexRecordOption,
|
||||
) -> io::Result<InvertedIndexReader> {
|
||||
let (total_num_tokens_slice, postings_body) = postings_file_slice.split(8);
|
||||
@@ -46,7 +44,6 @@ impl InvertedIndexReader {
|
||||
termdict,
|
||||
postings_file_slice: postings_body,
|
||||
positions_file_slice,
|
||||
positions_idx_file_slice,
|
||||
record_option,
|
||||
total_num_tokens,
|
||||
})
|
||||
@@ -59,7 +56,6 @@ impl InvertedIndexReader {
|
||||
termdict: TermDictionary::empty(),
|
||||
postings_file_slice: FileSlice::empty(),
|
||||
positions_file_slice: FileSlice::empty(),
|
||||
positions_idx_file_slice: FileSlice::empty(),
|
||||
record_option,
|
||||
total_num_tokens: 0u64,
|
||||
}
|
||||
@@ -141,12 +137,12 @@ impl InvertedIndexReader {
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<SegmentPostings> {
|
||||
let block_postings = self.read_block_postings_from_terminfo(term_info, option)?;
|
||||
let position_stream = {
|
||||
let position_reader = {
|
||||
if option.has_positions() {
|
||||
let position_reader = self.positions_file_slice.clone();
|
||||
let skip_reader = self.positions_idx_file_slice.clone();
|
||||
let position_reader =
|
||||
PositionReader::new(position_reader, skip_reader, term_info.positions_idx)?;
|
||||
let positions_data = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice(term_info.positions_range.clone())?;
|
||||
let position_reader = PositionReader::open(positions_data)?;
|
||||
Some(position_reader)
|
||||
} else {
|
||||
None
|
||||
@@ -154,7 +150,7 @@ impl InvertedIndexReader {
|
||||
};
|
||||
Ok(SegmentPostings::from_block_postings(
|
||||
block_postings,
|
||||
position_stream,
|
||||
position_reader,
|
||||
))
|
||||
}
|
||||
|
||||
|
||||
@@ -9,8 +9,10 @@ mod segment_id;
|
||||
mod segment_reader;
|
||||
|
||||
pub use self::executor::Executor;
|
||||
pub use self::index::Index;
|
||||
pub use self::index_meta::{IndexMeta, SegmentMeta, SegmentMetaInventory};
|
||||
pub use self::index::{Index, IndexBuilder};
|
||||
pub use self::index_meta::{
|
||||
IndexMeta, IndexSettings, IndexSortByField, Order, SegmentMeta, SegmentMetaInventory,
|
||||
};
|
||||
pub use self::inverted_index_reader::InvertedIndexReader;
|
||||
pub use self::searcher::Searcher;
|
||||
pub use self::segment::Segment;
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
use super::SegmentComponent;
|
||||
use crate::core::Index;
|
||||
use crate::core::SegmentId;
|
||||
use crate::core::SegmentMeta;
|
||||
use crate::directory::error::{OpenReadError, OpenWriteError};
|
||||
@@ -8,6 +7,7 @@ use crate::directory::{FileSlice, WritePtr};
|
||||
use crate::indexer::segment_serializer::SegmentSerializer;
|
||||
use crate::schema::Schema;
|
||||
use crate::Opstamp;
|
||||
use crate::{core::Index, indexer::doc_id_mapping::DocIdMapping};
|
||||
use std::fmt;
|
||||
use std::path::PathBuf;
|
||||
|
||||
@@ -97,5 +97,13 @@ pub trait SerializableSegment {
|
||||
///
|
||||
/// # Returns
|
||||
/// The number of documents in the segment.
|
||||
fn write(&self, serializer: SegmentSerializer) -> crate::Result<u32>;
|
||||
///
|
||||
/// doc_id_map is used when index is created and sorted, to map to the new doc_id order.
|
||||
/// It is not used by the `IndexMerger`, since the doc_id_mapping on cross-segments works
|
||||
/// differently
|
||||
fn write(
|
||||
&self,
|
||||
serializer: SegmentSerializer,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> crate::Result<u32>;
|
||||
}
|
||||
|
||||
@@ -4,42 +4,42 @@ use std::slice;
|
||||
/// Each component is stored in its own file,
|
||||
/// using the pattern `segment_uuid`.`component_extension`,
|
||||
/// except the delete component that takes an `segment_uuid`.`delete_opstamp`.`component_extension`
|
||||
#[derive(Copy, Clone)]
|
||||
#[derive(Copy, Clone, Eq, PartialEq)]
|
||||
pub enum SegmentComponent {
|
||||
/// Postings (or inverted list). Sorted lists of document ids, associated to terms
|
||||
POSTINGS,
|
||||
Postings,
|
||||
/// Positions of terms in each document.
|
||||
POSITIONS,
|
||||
/// Index to seek within the position file
|
||||
POSITIONSSKIP,
|
||||
Positions,
|
||||
/// Column-oriented random-access storage of fields.
|
||||
FASTFIELDS,
|
||||
FastFields,
|
||||
/// Stores the sum of the length (in terms) of each field for each document.
|
||||
/// Field norms are stored as a special u64 fast field.
|
||||
FIELDNORMS,
|
||||
FieldNorms,
|
||||
/// Dictionary associating `Term`s to `TermInfo`s which is
|
||||
/// simply an address into the `postings` file and the `positions` file.
|
||||
TERMS,
|
||||
Terms,
|
||||
/// Row-oriented, compressed storage of the documents.
|
||||
/// Accessing a document from the store is relatively slow, as it
|
||||
/// requires to decompress the entire block it belongs to.
|
||||
STORE,
|
||||
Store,
|
||||
/// Temporary storage of the documents, before streamed to `Store`.
|
||||
TempStore,
|
||||
/// Bitset describing which document of the segment is deleted.
|
||||
DELETE,
|
||||
Delete,
|
||||
}
|
||||
|
||||
impl SegmentComponent {
|
||||
/// Iterates through the components.
|
||||
pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
|
||||
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
|
||||
SegmentComponent::POSTINGS,
|
||||
SegmentComponent::POSITIONS,
|
||||
SegmentComponent::POSITIONSSKIP,
|
||||
SegmentComponent::FASTFIELDS,
|
||||
SegmentComponent::FIELDNORMS,
|
||||
SegmentComponent::TERMS,
|
||||
SegmentComponent::STORE,
|
||||
SegmentComponent::DELETE,
|
||||
SegmentComponent::Postings,
|
||||
SegmentComponent::Positions,
|
||||
SegmentComponent::FastFields,
|
||||
SegmentComponent::FieldNorms,
|
||||
SegmentComponent::Terms,
|
||||
SegmentComponent::Store,
|
||||
SegmentComponent::TempStore,
|
||||
SegmentComponent::Delete,
|
||||
];
|
||||
SEGMENT_COMPONENTS.iter()
|
||||
}
|
||||
|
||||
@@ -46,7 +46,6 @@ pub struct SegmentReader {
|
||||
termdict_composite: CompositeFile,
|
||||
postings_composite: CompositeFile,
|
||||
positions_composite: CompositeFile,
|
||||
positions_idx_composite: CompositeFile,
|
||||
fast_fields_readers: Arc<FastFieldReaders>,
|
||||
fieldnorm_readers: FieldNormReaders,
|
||||
|
||||
@@ -151,44 +150,36 @@ impl SegmentReader {
|
||||
|
||||
/// Open a new segment for reading.
|
||||
pub fn open(segment: &Segment) -> crate::Result<SegmentReader> {
|
||||
let termdict_file = segment.open_read(SegmentComponent::TERMS)?;
|
||||
let termdict_file = segment.open_read(SegmentComponent::Terms)?;
|
||||
let termdict_composite = CompositeFile::open(&termdict_file)?;
|
||||
|
||||
let store_file = segment.open_read(SegmentComponent::STORE)?;
|
||||
let store_file = segment.open_read(SegmentComponent::Store)?;
|
||||
|
||||
fail_point!("SegmentReader::open#middle");
|
||||
|
||||
let postings_file = segment.open_read(SegmentComponent::POSTINGS)?;
|
||||
let postings_file = segment.open_read(SegmentComponent::Postings)?;
|
||||
let postings_composite = CompositeFile::open(&postings_file)?;
|
||||
|
||||
let positions_composite = {
|
||||
if let Ok(positions_file) = segment.open_read(SegmentComponent::POSITIONS) {
|
||||
if let Ok(positions_file) = segment.open_read(SegmentComponent::Positions) {
|
||||
CompositeFile::open(&positions_file)?
|
||||
} else {
|
||||
CompositeFile::empty()
|
||||
}
|
||||
};
|
||||
|
||||
let positions_idx_composite = {
|
||||
if let Ok(positions_skip_file) = segment.open_read(SegmentComponent::POSITIONSSKIP) {
|
||||
CompositeFile::open(&positions_skip_file)?
|
||||
} else {
|
||||
CompositeFile::empty()
|
||||
}
|
||||
};
|
||||
|
||||
let schema = segment.schema();
|
||||
|
||||
let fast_fields_data = segment.open_read(SegmentComponent::FASTFIELDS)?;
|
||||
let fast_fields_data = segment.open_read(SegmentComponent::FastFields)?;
|
||||
let fast_fields_composite = CompositeFile::open(&fast_fields_data)?;
|
||||
let fast_field_readers =
|
||||
Arc::new(FastFieldReaders::new(schema.clone(), fast_fields_composite));
|
||||
|
||||
let fieldnorm_data = segment.open_read(SegmentComponent::FIELDNORMS)?;
|
||||
let fieldnorm_data = segment.open_read(SegmentComponent::FieldNorms)?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
|
||||
let delete_bitset_opt = if segment.meta().has_deletes() {
|
||||
let delete_data = segment.open_read(SegmentComponent::DELETE)?;
|
||||
let delete_data = segment.open_read(SegmentComponent::Delete)?;
|
||||
let delete_bitset = DeleteBitSet::open(delete_data)?;
|
||||
Some(delete_bitset)
|
||||
} else {
|
||||
@@ -207,7 +198,6 @@ impl SegmentReader {
|
||||
store_file,
|
||||
delete_bitset_opt,
|
||||
positions_composite,
|
||||
positions_idx_composite,
|
||||
schema,
|
||||
})
|
||||
}
|
||||
@@ -263,18 +253,15 @@ impl SegmentReader {
|
||||
let positions_file = self
|
||||
.positions_composite
|
||||
.open_read(field)
|
||||
.expect("Index corrupted. Failed to open field positions in composite file.");
|
||||
|
||||
let positions_idx_file = self
|
||||
.positions_idx_composite
|
||||
.open_read(field)
|
||||
.expect("Index corrupted. Failed to open field positions in composite file.");
|
||||
.ok_or_else(|| {
|
||||
let error_msg = format!("Failed to open field {:?}'s positions in the composite file. Has the schema been modified?", field_entry.name());
|
||||
DataCorruption::comment_only(error_msg)
|
||||
})?;
|
||||
|
||||
let inv_idx_reader = Arc::new(InvertedIndexReader::new(
|
||||
TermDictionary::open(termdict_file)?,
|
||||
postings_file,
|
||||
positions_file,
|
||||
positions_idx_file,
|
||||
record_option,
|
||||
)?);
|
||||
|
||||
@@ -319,7 +306,6 @@ impl SegmentReader {
|
||||
self.termdict_composite.space_usage(),
|
||||
self.postings_composite.space_usage(),
|
||||
self.positions_composite.space_usage(),
|
||||
self.positions_idx_composite.space_usage(),
|
||||
self.fast_fields_readers.space_usage(),
|
||||
self.fieldnorm_readers.space_usage(),
|
||||
self.get_store_reader()?.space_usage(),
|
||||
|
||||
@@ -70,7 +70,7 @@ impl Drop for DirectoryLockGuard {
|
||||
|
||||
enum TryAcquireLockError {
|
||||
FileExists,
|
||||
IOError(io::Error),
|
||||
IoError(io::Error),
|
||||
}
|
||||
|
||||
fn try_acquire_lock(
|
||||
@@ -79,9 +79,9 @@ fn try_acquire_lock(
|
||||
) -> Result<DirectoryLock, TryAcquireLockError> {
|
||||
let mut write = directory.open_write(filepath).map_err(|e| match e {
|
||||
OpenWriteError::FileAlreadyExists(_) => TryAcquireLockError::FileExists,
|
||||
OpenWriteError::IOError { io_error, .. } => TryAcquireLockError::IOError(io_error),
|
||||
OpenWriteError::IoError { io_error, .. } => TryAcquireLockError::IoError(io_error),
|
||||
})?;
|
||||
write.flush().map_err(TryAcquireLockError::IOError)?;
|
||||
write.flush().map_err(TryAcquireLockError::IoError)?;
|
||||
Ok(DirectoryLock::from(Box::new(DirectoryLockGuard {
|
||||
directory: directory.box_clone(),
|
||||
path: filepath.to_owned(),
|
||||
@@ -106,7 +106,7 @@ fn retry_policy(is_blocking: bool) -> RetryPolicy {
|
||||
///
|
||||
/// - The [`MMapDirectory`](struct.MmapDirectory.html), this
|
||||
/// should be your default choice.
|
||||
/// - The [`RAMDirectory`](struct.RAMDirectory.html), which
|
||||
/// - The [`RamDirectory`](struct.RamDirectory.html), which
|
||||
/// should be used mostly for tests.
|
||||
pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
|
||||
/// Opens a file and returns a boxed `FileHandle`.
|
||||
@@ -154,7 +154,7 @@ pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
|
||||
/// Flush operation should also be persistent.
|
||||
///
|
||||
/// The user shall not rely on `Drop` triggering `flush`.
|
||||
/// Note that `RAMDirectory` will panic! if `flush`
|
||||
/// Note that `RamDirectory` will panic! if `flush`
|
||||
/// was not called.
|
||||
///
|
||||
/// The file may not previously exist.
|
||||
@@ -192,8 +192,8 @@ pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
|
||||
return Err(LockError::LockBusy);
|
||||
}
|
||||
}
|
||||
Err(TryAcquireLockError::IOError(io_error)) => {
|
||||
return Err(LockError::IOError(io_error));
|
||||
Err(TryAcquireLockError::IoError(io_error)) => {
|
||||
return Err(LockError::IoError(io_error));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -12,9 +12,9 @@ pub enum LockError {
|
||||
/// - In the context of a non-blocking lock, this means the lock was busy at the moment of the call.
|
||||
#[error("Could not acquire lock as it is already held, possibly by a different process.")]
|
||||
LockBusy,
|
||||
/// Trying to acquire a lock failed with an `IOError`
|
||||
/// Trying to acquire a lock failed with an `IoError`
|
||||
#[error("Failed to acquire the lock due to an io:Error.")]
|
||||
IOError(io::Error),
|
||||
IoError(io::Error),
|
||||
}
|
||||
|
||||
/// Error that may occur when opening a directory
|
||||
@@ -30,7 +30,7 @@ pub enum OpenDirectoryError {
|
||||
#[error("Failed to create a temporary directory: '{0}'.")]
|
||||
FailedToCreateTempDir(io::Error),
|
||||
/// IoError
|
||||
#[error("IOError '{io_error:?}' while create directory in: '{directory_path:?}'.")]
|
||||
#[error("IoError '{io_error:?}' while create directory in: '{directory_path:?}'.")]
|
||||
IoError {
|
||||
/// underlying io Error.
|
||||
io_error: io::Error,
|
||||
@@ -48,8 +48,8 @@ pub enum OpenWriteError {
|
||||
FileAlreadyExists(PathBuf),
|
||||
/// Any kind of IO error that happens when
|
||||
/// writing in the underlying IO device.
|
||||
#[error("IOError '{io_error:?}' while opening file for write: '{filepath}'.")]
|
||||
IOError {
|
||||
#[error("IoError '{io_error:?}' while opening file for write: '{filepath}'.")]
|
||||
IoError {
|
||||
/// The underlying `io::Error`.
|
||||
io_error: io::Error,
|
||||
/// File path of the file that tantivy failed to open for write.
|
||||
@@ -60,7 +60,7 @@ pub enum OpenWriteError {
|
||||
impl OpenWriteError {
|
||||
/// Wraps an io error.
|
||||
pub fn wrap_io_error(io_error: io::Error, filepath: PathBuf) -> Self {
|
||||
Self::IOError { io_error, filepath }
|
||||
Self::IoError { io_error, filepath }
|
||||
}
|
||||
}
|
||||
/// Type of index incompatibility between the library and the index found on disk
|
||||
@@ -130,9 +130,9 @@ pub enum OpenReadError {
|
||||
FileDoesNotExist(PathBuf),
|
||||
/// Any kind of io::Error.
|
||||
#[error(
|
||||
"IOError: '{io_error:?}' happened while opening the following file for Read: {filepath}."
|
||||
"IoError: '{io_error:?}' happened while opening the following file for Read: {filepath}."
|
||||
)]
|
||||
IOError {
|
||||
IoError {
|
||||
/// The underlying `io::Error`.
|
||||
io_error: io::Error,
|
||||
/// File path of the file that tantivy failed to open for read.
|
||||
@@ -146,7 +146,7 @@ pub enum OpenReadError {
|
||||
impl OpenReadError {
|
||||
/// Wraps an io error.
|
||||
pub fn wrap_io_error(io_error: io::Error, filepath: PathBuf) -> Self {
|
||||
Self::IOError { io_error, filepath }
|
||||
Self::IoError { io_error, filepath }
|
||||
}
|
||||
}
|
||||
/// Error that may occur when trying to delete a file
|
||||
@@ -158,7 +158,7 @@ pub enum DeleteError {
|
||||
/// Any kind of IO error that happens when
|
||||
/// interacting with the underlying IO device.
|
||||
#[error("The following IO error happened while deleting file '{filepath}': '{io_error:?}'.")]
|
||||
IOError {
|
||||
IoError {
|
||||
/// The underlying `io::Error`.
|
||||
io_error: io::Error,
|
||||
/// File path of the file that tantivy failed to delete.
|
||||
|
||||
@@ -297,8 +297,10 @@ mod tests {
|
||||
assert!(footer_proxy.terminate().is_ok());
|
||||
if crate::store::COMPRESSION == "lz4" {
|
||||
assert_eq!(vec.len(), 158);
|
||||
} else {
|
||||
} else if crate::store::COMPRESSION == "snappy" {
|
||||
assert_eq!(vec.len(), 167);
|
||||
} else if crate::store::COMPRESSION == "lz4_block" {
|
||||
assert_eq!(vec.len(), 176);
|
||||
}
|
||||
let footer = Footer::deserialize(&mut &vec[..]).unwrap();
|
||||
assert!(matches!(
|
||||
|
||||
@@ -86,7 +86,7 @@ impl ManagedDirectory {
|
||||
directory: Box::new(directory),
|
||||
meta_informations: Arc::default(),
|
||||
}),
|
||||
io_err @ Err(OpenReadError::IOError { .. }) => Err(io_err.err().unwrap().into()),
|
||||
io_err @ Err(OpenReadError::IoError { .. }) => Err(io_err.err().unwrap().into()),
|
||||
Err(OpenReadError::IncompatibleIndex(incompatibility)) => {
|
||||
// For the moment, this should never happen `meta.json`
|
||||
// do not have any footer and cannot detect incompatibility.
|
||||
@@ -168,7 +168,7 @@ impl ManagedDirectory {
|
||||
DeleteError::FileDoesNotExist(_) => {
|
||||
deleted_files.push(file_to_delete.clone());
|
||||
}
|
||||
DeleteError::IOError { .. } => {
|
||||
DeleteError::IoError { .. } => {
|
||||
failed_to_delete_files.push(file_to_delete.clone());
|
||||
if !cfg!(target_os = "windows") {
|
||||
// On windows, delete is expected to fail if the file
|
||||
@@ -232,13 +232,13 @@ impl ManagedDirectory {
|
||||
pub fn validate_checksum(&self, path: &Path) -> result::Result<bool, OpenReadError> {
|
||||
let reader = self.directory.open_read(path)?;
|
||||
let (footer, data) =
|
||||
Footer::extract_footer(reader).map_err(|io_error| OpenReadError::IOError {
|
||||
Footer::extract_footer(reader).map_err(|io_error| OpenReadError::IoError {
|
||||
io_error,
|
||||
filepath: path.to_path_buf(),
|
||||
})?;
|
||||
let bytes = data
|
||||
.read_bytes()
|
||||
.map_err(|io_error| OpenReadError::IOError {
|
||||
.map_err(|io_error| OpenReadError::IoError {
|
||||
filepath: path.to_path_buf(),
|
||||
io_error,
|
||||
})?;
|
||||
|
||||
@@ -185,7 +185,7 @@ impl MmapDirectory {
|
||||
/// Creates a new MmapDirectory in a temporary directory.
|
||||
///
|
||||
/// This is mostly useful to test the MmapDirectory itself.
|
||||
/// For your unit tests, prefer the RAMDirectory.
|
||||
/// For your unit tests, prefer the RamDirectory.
|
||||
pub fn create_from_tempdir() -> Result<MmapDirectory, OpenDirectoryError> {
|
||||
let tempdir = TempDir::new().map_err(OpenDirectoryError::FailedToCreateTempDir)?;
|
||||
Ok(MmapDirectory::new(
|
||||
@@ -374,7 +374,7 @@ impl Directory for MmapDirectory {
|
||||
fn delete(&self, path: &Path) -> result::Result<(), DeleteError> {
|
||||
let full_path = self.resolve_path(path);
|
||||
match fs::remove_file(&full_path) {
|
||||
Ok(_) => self.sync_directory().map_err(|e| DeleteError::IOError {
|
||||
Ok(_) => self.sync_directory().map_err(|e| DeleteError::IoError {
|
||||
io_error: e,
|
||||
filepath: path.to_path_buf(),
|
||||
}),
|
||||
@@ -382,7 +382,7 @@ impl Directory for MmapDirectory {
|
||||
if e.kind() == io::ErrorKind::NotFound {
|
||||
Err(DeleteError::FileDoesNotExist(path.to_owned()))
|
||||
} else {
|
||||
Err(DeleteError::IOError {
|
||||
Err(DeleteError::IoError {
|
||||
io_error: e,
|
||||
filepath: path.to_path_buf(),
|
||||
})
|
||||
@@ -460,9 +460,9 @@ impl Directory for MmapDirectory {
|
||||
.write(true)
|
||||
.create(true) //< if the file does not exist yet, create it.
|
||||
.open(&full_path)
|
||||
.map_err(LockError::IOError)?;
|
||||
.map_err(LockError::IoError)?;
|
||||
if lock.is_blocking {
|
||||
file.lock_exclusive().map_err(LockError::IOError)?;
|
||||
file.lock_exclusive().map_err(LockError::IoError)?;
|
||||
} else {
|
||||
file.try_lock_exclusive().map_err(|_| LockError::LockBusy)?
|
||||
}
|
||||
@@ -485,10 +485,13 @@ mod tests {
|
||||
// The following tests are specific to the MmapDirectory
|
||||
|
||||
use super::*;
|
||||
use crate::schema::{Schema, SchemaBuilder, TEXT};
|
||||
use crate::Index;
|
||||
use crate::ReloadPolicy;
|
||||
use crate::{common::HasLen, indexer::LogMergePolicy};
|
||||
use crate::{
|
||||
schema::{Schema, SchemaBuilder, TEXT},
|
||||
IndexSettings,
|
||||
};
|
||||
|
||||
#[test]
|
||||
fn test_open_non_existent_path() {
|
||||
@@ -585,7 +588,8 @@ mod tests {
|
||||
let schema = schema_builder.build();
|
||||
|
||||
{
|
||||
let index = Index::create(mmap_directory.clone(), schema).unwrap();
|
||||
let index =
|
||||
Index::create(mmap_directory.clone(), schema, IndexSettings::default()).unwrap();
|
||||
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
let mut log_merge_policy = LogMergePolicy::default();
|
||||
@@ -614,8 +618,10 @@ mod tests {
|
||||
reader.reload().unwrap();
|
||||
let num_segments = reader.searcher().segment_readers().len();
|
||||
assert!(num_segments <= 4);
|
||||
let num_components_except_deletes_and_tempstore =
|
||||
crate::core::SegmentComponent::iterator().len() - 2;
|
||||
assert_eq!(
|
||||
num_segments * 7,
|
||||
num_segments * num_components_except_deletes_and_tempstore,
|
||||
mmap_directory.get_cache_info().mmapped.len()
|
||||
);
|
||||
}
|
||||
|
||||
@@ -26,7 +26,7 @@ pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK};
|
||||
pub(crate) use self::file_slice::{ArcBytes, WeakArcBytes};
|
||||
pub use self::file_slice::{FileHandle, FileSlice};
|
||||
pub use self::owned_bytes::OwnedBytes;
|
||||
pub use self::ram_directory::RAMDirectory;
|
||||
pub use self::ram_directory::RamDirectory;
|
||||
pub use self::watch_event_router::{WatchCallback, WatchCallbackList, WatchHandle};
|
||||
use std::io::{self, BufWriter, Write};
|
||||
use std::path::PathBuf;
|
||||
|
||||
@@ -36,8 +36,8 @@ impl OwnedBytes {
|
||||
let bytes: &[u8] = box_stable_deref.as_ref();
|
||||
let data = unsafe { mem::transmute::<_, &'static [u8]>(bytes.deref()) };
|
||||
OwnedBytes {
|
||||
box_stable_deref,
|
||||
data,
|
||||
box_stable_deref,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -51,13 +51,13 @@ impl OwnedBytes {
|
||||
|
||||
/// Returns the underlying slice of data.
|
||||
/// `Deref` and `AsRef` are also available.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn as_slice(&self) -> &[u8] {
|
||||
self.data
|
||||
}
|
||||
|
||||
/// Returns the len of the slice.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn len(&self) -> usize {
|
||||
self.data.len()
|
||||
}
|
||||
@@ -84,7 +84,7 @@ impl OwnedBytes {
|
||||
}
|
||||
|
||||
/// Returns true iff this `OwnedBytes` is empty.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.as_slice().is_empty()
|
||||
}
|
||||
@@ -92,7 +92,7 @@ impl OwnedBytes {
|
||||
/// Drops the left most `advance_len` bytes.
|
||||
///
|
||||
/// See also [.clip(clip_len: usize))](#method.clip).
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn advance(&mut self, advance_len: usize) {
|
||||
self.data = &self.data[advance_len..]
|
||||
}
|
||||
|
||||
@@ -14,7 +14,7 @@ use std::sync::{Arc, RwLock};
|
||||
|
||||
use super::FileHandle;
|
||||
|
||||
/// Writer associated with the `RAMDirectory`
|
||||
/// Writer associated with the `RamDirectory`
|
||||
///
|
||||
/// The Writer just writes a buffer.
|
||||
///
|
||||
@@ -26,13 +26,13 @@ use super::FileHandle;
|
||||
///
|
||||
struct VecWriter {
|
||||
path: PathBuf,
|
||||
shared_directory: RAMDirectory,
|
||||
shared_directory: RamDirectory,
|
||||
data: Cursor<Vec<u8>>,
|
||||
is_flushed: bool,
|
||||
}
|
||||
|
||||
impl VecWriter {
|
||||
fn new(path_buf: PathBuf, shared_directory: RAMDirectory) -> VecWriter {
|
||||
fn new(path_buf: PathBuf, shared_directory: RamDirectory) -> VecWriter {
|
||||
VecWriter {
|
||||
path: path_buf,
|
||||
data: Cursor::new(Vec::new()),
|
||||
@@ -119,9 +119,9 @@ impl InnerDirectory {
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for RAMDirectory {
|
||||
impl fmt::Debug for RamDirectory {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "RAMDirectory")
|
||||
write!(f, "RamDirectory")
|
||||
}
|
||||
}
|
||||
|
||||
@@ -131,23 +131,23 @@ impl fmt::Debug for RAMDirectory {
|
||||
/// Writes are only made visible upon flushing.
|
||||
///
|
||||
#[derive(Clone, Default)]
|
||||
pub struct RAMDirectory {
|
||||
pub struct RamDirectory {
|
||||
fs: Arc<RwLock<InnerDirectory>>,
|
||||
}
|
||||
|
||||
impl RAMDirectory {
|
||||
impl RamDirectory {
|
||||
/// Constructor
|
||||
pub fn create() -> RAMDirectory {
|
||||
pub fn create() -> RamDirectory {
|
||||
Self::default()
|
||||
}
|
||||
|
||||
/// Returns the sum of the size of the different files
|
||||
/// in the RAMDirectory.
|
||||
/// in the RamDirectory.
|
||||
pub fn total_mem_usage(&self) -> usize {
|
||||
self.fs.read().unwrap().total_mem_usage()
|
||||
}
|
||||
|
||||
/// Write a copy of all of the files saved in the RAMDirectory in the target `Directory`.
|
||||
/// Write a copy of all of the files saved in the RamDirectory in the target `Directory`.
|
||||
///
|
||||
/// Files are all written using the `Directory::write` meaning, even if they were
|
||||
/// written using the `atomic_write` api.
|
||||
@@ -164,7 +164,7 @@ impl RAMDirectory {
|
||||
}
|
||||
}
|
||||
|
||||
impl Directory for RAMDirectory {
|
||||
impl Directory for RamDirectory {
|
||||
fn get_file_handle(&self, path: &Path) -> Result<Box<dyn FileHandle>, OpenReadError> {
|
||||
let file_slice = self.open_read(path)?;
|
||||
Ok(Box::new(file_slice))
|
||||
@@ -175,8 +175,8 @@ impl Directory for RAMDirectory {
|
||||
}
|
||||
|
||||
fn delete(&self, path: &Path) -> result::Result<(), DeleteError> {
|
||||
fail_point!("RAMDirectory::delete", |_| {
|
||||
Err(DeleteError::IOError {
|
||||
fail_point!("RamDirectory::delete", |_| {
|
||||
Err(DeleteError::IoError {
|
||||
io_error: io::Error::from(io::ErrorKind::Other),
|
||||
filepath: path.to_path_buf(),
|
||||
})
|
||||
@@ -188,7 +188,7 @@ impl Directory for RAMDirectory {
|
||||
Ok(self
|
||||
.fs
|
||||
.read()
|
||||
.map_err(|e| OpenReadError::IOError {
|
||||
.map_err(|e| OpenReadError::IoError {
|
||||
io_error: io::Error::new(io::ErrorKind::Other, e.to_string()),
|
||||
filepath: path.to_path_buf(),
|
||||
})?
|
||||
@@ -212,7 +212,7 @@ impl Directory for RAMDirectory {
|
||||
let bytes =
|
||||
self.open_read(path)?
|
||||
.read_bytes()
|
||||
.map_err(|io_error| OpenReadError::IOError {
|
||||
.map_err(|io_error| OpenReadError::IoError {
|
||||
io_error,
|
||||
filepath: path.to_path_buf(),
|
||||
})?;
|
||||
@@ -220,7 +220,7 @@ impl Directory for RAMDirectory {
|
||||
}
|
||||
|
||||
fn atomic_write(&self, path: &Path, data: &[u8]) -> io::Result<()> {
|
||||
fail_point!("RAMDirectory::atomic_write", |msg| Err(io::Error::new(
|
||||
fail_point!("RamDirectory::atomic_write", |msg| Err(io::Error::new(
|
||||
io::ErrorKind::Other,
|
||||
msg.unwrap_or_else(|| "Undefined".to_string())
|
||||
)));
|
||||
@@ -241,7 +241,7 @@ impl Directory for RAMDirectory {
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::RAMDirectory;
|
||||
use super::RamDirectory;
|
||||
use crate::Directory;
|
||||
use std::io::Write;
|
||||
use std::path::Path;
|
||||
@@ -252,12 +252,12 @@ mod tests {
|
||||
let msg_seq: &'static [u8] = b"sequential is the way";
|
||||
let path_atomic: &'static Path = Path::new("atomic");
|
||||
let path_seq: &'static Path = Path::new("seq");
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
assert!(directory.atomic_write(path_atomic, msg_atomic).is_ok());
|
||||
let mut wrt = directory.open_write(path_seq).unwrap();
|
||||
assert!(wrt.write_all(msg_seq).is_ok());
|
||||
assert!(wrt.flush().is_ok());
|
||||
let directory_copy = RAMDirectory::create();
|
||||
let directory_copy = RamDirectory::create();
|
||||
assert!(directory.persist(&directory_copy).is_ok());
|
||||
assert_eq!(directory_copy.atomic_read(path_atomic).unwrap(), msg_atomic);
|
||||
assert_eq!(directory_copy.atomic_read(path_seq).unwrap(), msg_seq);
|
||||
|
||||
@@ -65,12 +65,12 @@ mod mmap_directory_tests {
|
||||
}
|
||||
|
||||
mod ram_directory_tests {
|
||||
use crate::directory::RAMDirectory;
|
||||
use crate::directory::RamDirectory;
|
||||
|
||||
type DirectoryImpl = RAMDirectory;
|
||||
type DirectoryImpl = RamDirectory;
|
||||
|
||||
fn make_directory() -> DirectoryImpl {
|
||||
RAMDirectory::default()
|
||||
RamDirectory::default()
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -122,7 +122,7 @@ mod ram_directory_tests {
|
||||
#[should_panic]
|
||||
fn ram_directory_panics_if_flush_forgotten() {
|
||||
let test_path: &'static Path = Path::new("some_path_for_test");
|
||||
let ram_directory = RAMDirectory::create();
|
||||
let ram_directory = RamDirectory::create();
|
||||
let mut write_file = ram_directory.open_write(test_path).unwrap();
|
||||
assert!(write_file.write_all(&[4]).is_ok());
|
||||
}
|
||||
|
||||
@@ -70,7 +70,7 @@ pub enum TantivyError {
|
||||
LockFailure(LockError, Option<String>),
|
||||
/// IO Error.
|
||||
#[error("An IO error occurred: '{0}'")]
|
||||
IOError(#[from] io::Error),
|
||||
IoError(#[from] io::Error),
|
||||
/// Data corruption.
|
||||
#[error("Data corrupted: '{0:?}'")]
|
||||
DataCorruption(DataCorruption),
|
||||
@@ -83,6 +83,9 @@ pub enum TantivyError {
|
||||
/// An Error happened in one of the thread.
|
||||
#[error("An error occurred in a thread: '{0}'")]
|
||||
ErrorInThread(String),
|
||||
/// An Error appeared related to opening or creating a index.
|
||||
#[error("Missing required index builder argument when open/create index: '{0}'")]
|
||||
IndexBuilderMissingArgument(&'static str),
|
||||
/// An Error appeared related to the schema.
|
||||
#[error("Schema error: '{0}'")]
|
||||
SchemaError(String),
|
||||
@@ -136,7 +139,7 @@ impl From<schema::DocParsingError> for TantivyError {
|
||||
|
||||
impl From<serde_json::Error> for TantivyError {
|
||||
fn from(error: serde_json::Error) -> TantivyError {
|
||||
TantivyError::IOError(error.into())
|
||||
TantivyError::IoError(error.into())
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
use crate::directory::FileSlice;
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::fastfield::FastFieldReader;
|
||||
use crate::DocId;
|
||||
use crate::{directory::FileSlice, fastfield::MultiValueLength};
|
||||
|
||||
/// Reader for byte array fast fields
|
||||
///
|
||||
@@ -40,8 +40,23 @@ impl BytesFastFieldReader {
|
||||
&self.values.as_slice()[start..stop]
|
||||
}
|
||||
|
||||
/// Returns the length of the bytes associated to the given `doc`
|
||||
pub fn num_bytes(&self, doc: DocId) -> usize {
|
||||
let (start, stop) = self.range(doc);
|
||||
stop - start
|
||||
}
|
||||
|
||||
/// Returns the overall number of bytes in this bytes fast field.
|
||||
pub fn total_num_bytes(&self) -> usize {
|
||||
self.values.len()
|
||||
}
|
||||
}
|
||||
|
||||
impl MultiValueLength for BytesFastFieldReader {
|
||||
fn get_len(&self, doc_id: DocId) -> u64 {
|
||||
self.num_bytes(doc_id) as u64
|
||||
}
|
||||
fn get_total_len(&self) -> u64 {
|
||||
self.total_num_bytes() as u64
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
use std::io;
|
||||
|
||||
use crate::fastfield::serializer::FastFieldSerializer;
|
||||
use crate::schema::{Document, Field, Value};
|
||||
use crate::DocId;
|
||||
use crate::{fastfield::serializer::FastFieldSerializer, indexer::doc_id_mapping::DocIdMapping};
|
||||
|
||||
/// Writer for byte array (as in, any number of bytes per document) fast fields
|
||||
///
|
||||
@@ -35,6 +35,10 @@ impl BytesFastFieldWriter {
|
||||
}
|
||||
}
|
||||
|
||||
/// The memory used (inclusive childs)
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
self.vals.capacity() + self.doc_index.capacity() * std::mem::size_of::<u64>()
|
||||
}
|
||||
/// Access the field associated to the `BytesFastFieldWriter`
|
||||
pub fn field(&self) -> Field {
|
||||
self.field
|
||||
@@ -68,20 +72,62 @@ impl BytesFastFieldWriter {
|
||||
doc
|
||||
}
|
||||
|
||||
/// Returns an iterator over values per doc_id in ascending doc_id order.
|
||||
///
|
||||
/// Normally the order is simply iterating self.doc_id_index.
|
||||
/// With doc_id_map it accounts for the new mapping, returning values in the order of the
|
||||
/// new doc_ids.
|
||||
fn get_ordered_values<'a: 'b, 'b>(
|
||||
&'a self,
|
||||
doc_id_map: Option<&'b DocIdMapping>,
|
||||
) -> impl Iterator<Item = &'b [u8]> {
|
||||
let doc_id_iter = if let Some(doc_id_map) = doc_id_map {
|
||||
Box::new(doc_id_map.iter_old_doc_ids().cloned()) as Box<dyn Iterator<Item = u32>>
|
||||
} else {
|
||||
Box::new(self.doc_index.iter().enumerate().map(|el| el.0 as u32))
|
||||
as Box<dyn Iterator<Item = u32>>
|
||||
};
|
||||
doc_id_iter.map(move |doc_id| self.get_values_for_doc_id(doc_id))
|
||||
}
|
||||
|
||||
/// returns all values for a doc_ids
|
||||
fn get_values_for_doc_id(&self, doc_id: u32) -> &[u8] {
|
||||
let start_pos = self.doc_index[doc_id as usize] as usize;
|
||||
let end_pos = self
|
||||
.doc_index
|
||||
.get(doc_id as usize + 1)
|
||||
.cloned()
|
||||
.unwrap_or(self.vals.len() as u64) as usize; // special case, last doc_id has no offset information
|
||||
&self.vals[start_pos..end_pos]
|
||||
}
|
||||
|
||||
/// Serializes the fast field values by pushing them to the `FastFieldSerializer`.
|
||||
pub fn serialize(&self, serializer: &mut FastFieldSerializer) -> io::Result<()> {
|
||||
pub fn serialize(
|
||||
&self,
|
||||
serializer: &mut FastFieldSerializer,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> io::Result<()> {
|
||||
// writing the offset index
|
||||
let mut doc_index_serializer =
|
||||
serializer.new_u64_fast_field_with_idx(self.field, 0, self.vals.len() as u64, 0)?;
|
||||
for &offset in &self.doc_index {
|
||||
let mut offset = 0;
|
||||
for vals in self.get_ordered_values(doc_id_map) {
|
||||
doc_index_serializer.add_val(offset)?;
|
||||
offset += vals.len() as u64;
|
||||
}
|
||||
doc_index_serializer.add_val(self.vals.len() as u64)?;
|
||||
doc_index_serializer.close_field()?;
|
||||
// writing the values themselves
|
||||
serializer
|
||||
.new_bytes_fast_field_with_idx(self.field, 1)
|
||||
.write_all(&self.vals)?;
|
||||
let mut value_serializer = serializer.new_bytes_fast_field_with_idx(self.field, 1);
|
||||
// the else could be removed, but this is faster (difference not benchmarked)
|
||||
if let Some(doc_id_map) = doc_id_map {
|
||||
for vals in self.get_ordered_values(Some(doc_id_map)) {
|
||||
// sort values in case of remapped doc_ids?
|
||||
value_serializer.write_all(vals)?;
|
||||
}
|
||||
} else {
|
||||
value_serializer.write_all(&self.vals)?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -41,20 +41,20 @@ pub fn write_delete_bitset(
|
||||
#[derive(Clone)]
|
||||
pub struct DeleteBitSet {
|
||||
data: OwnedBytes,
|
||||
len: usize,
|
||||
num_deleted: usize,
|
||||
}
|
||||
|
||||
impl DeleteBitSet {
|
||||
#[cfg(test)]
|
||||
pub(crate) fn for_test(docs: &[DocId], max_doc: u32) -> DeleteBitSet {
|
||||
use crate::directory::{Directory, RAMDirectory, TerminatingWrite};
|
||||
use crate::directory::{Directory, RamDirectory, TerminatingWrite};
|
||||
use std::path::Path;
|
||||
assert!(docs.iter().all(|&doc| doc < max_doc));
|
||||
let mut bitset = BitSet::with_max_value(max_doc);
|
||||
for &doc in docs {
|
||||
bitset.insert(doc);
|
||||
}
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
let path = Path::new("dummydeletebitset");
|
||||
let mut wrt = directory.open_write(path).unwrap();
|
||||
write_delete_bitset(&bitset, max_doc, &mut wrt).unwrap();
|
||||
@@ -73,7 +73,7 @@ impl DeleteBitSet {
|
||||
.sum();
|
||||
Ok(DeleteBitSet {
|
||||
data: bytes,
|
||||
len: num_deleted,
|
||||
num_deleted,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -83,7 +83,7 @@ impl DeleteBitSet {
|
||||
}
|
||||
|
||||
/// Returns true iff the document has been marked as deleted.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn is_deleted(&self, doc: DocId) -> bool {
|
||||
let byte_offset = doc / 8u32;
|
||||
let b: u8 = self.data.as_slice()[byte_offset as usize];
|
||||
@@ -99,7 +99,7 @@ impl DeleteBitSet {
|
||||
|
||||
impl HasLen for DeleteBitSet {
|
||||
fn len(&self) -> usize {
|
||||
self.len
|
||||
self.num_deleted
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -33,7 +33,6 @@ pub use self::reader::FastFieldReader;
|
||||
pub use self::readers::FastFieldReaders;
|
||||
pub use self::serializer::FastFieldSerializer;
|
||||
pub use self::writer::{FastFieldsWriter, IntFastFieldWriter};
|
||||
use crate::common;
|
||||
use crate::schema::Cardinality;
|
||||
use crate::schema::FieldType;
|
||||
use crate::schema::Value;
|
||||
@@ -41,6 +40,7 @@ use crate::{
|
||||
chrono::{NaiveDateTime, Utc},
|
||||
schema::Type,
|
||||
};
|
||||
use crate::{common, DocId};
|
||||
|
||||
mod bytes;
|
||||
mod delete;
|
||||
@@ -52,6 +52,15 @@ mod readers;
|
||||
mod serializer;
|
||||
mod writer;
|
||||
|
||||
/// Trait for `BytesFastFieldReader` and `MultiValuedFastFieldReader` to return the length of data
|
||||
/// for a doc_id
|
||||
pub trait MultiValueLength {
|
||||
/// returns the num of values associated to a doc_id
|
||||
fn get_len(&self, doc_id: DocId) -> u64;
|
||||
/// returns the sum of num of all values for all doc_ids
|
||||
fn get_total_len(&self) -> u64;
|
||||
}
|
||||
|
||||
/// Trait for types that are allowed for fast fields: (u64, i64 and f64).
|
||||
pub trait FastValue: Clone + Copy + Send + Sync + PartialOrd + 'static {
|
||||
/// Converts a value from u64
|
||||
@@ -201,7 +210,7 @@ mod tests {
|
||||
|
||||
use super::*;
|
||||
use crate::common::CompositeFile;
|
||||
use crate::directory::{Directory, RAMDirectory, WritePtr};
|
||||
use crate::directory::{Directory, RamDirectory, WritePtr};
|
||||
use crate::fastfield::FastFieldReader;
|
||||
use crate::merge_policy::NoMergePolicy;
|
||||
use crate::schema::Field;
|
||||
@@ -242,7 +251,7 @@ mod tests {
|
||||
#[test]
|
||||
fn test_intfastfield_small() -> crate::Result<()> {
|
||||
let path = Path::new("test");
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
{
|
||||
let write: WritePtr = directory.open_write(Path::new("test")).unwrap();
|
||||
let mut serializer = FastFieldSerializer::from_write(write).unwrap();
|
||||
@@ -251,7 +260,7 @@ mod tests {
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>14u64));
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>2u64));
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
@@ -269,7 +278,7 @@ mod tests {
|
||||
#[test]
|
||||
fn test_intfastfield_large() -> crate::Result<()> {
|
||||
let path = Path::new("test");
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
{
|
||||
let write: WritePtr = directory.open_write(Path::new("test"))?;
|
||||
let mut serializer = FastFieldSerializer::from_write(write)?;
|
||||
@@ -283,7 +292,7 @@ mod tests {
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>1_002u64));
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>1_501u64));
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>215u64));
|
||||
fast_field_writers.serialize(&mut serializer, &HashMap::new())?;
|
||||
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
|
||||
serializer.close()?;
|
||||
}
|
||||
let file = directory.open_read(&path)?;
|
||||
@@ -308,7 +317,7 @@ mod tests {
|
||||
#[test]
|
||||
fn test_intfastfield_null_amplitude() -> crate::Result<()> {
|
||||
let path = Path::new("test");
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
|
||||
{
|
||||
let write: WritePtr = directory.open_write(Path::new("test")).unwrap();
|
||||
@@ -318,7 +327,7 @@ mod tests {
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>100_000u64));
|
||||
}
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
@@ -338,7 +347,7 @@ mod tests {
|
||||
#[test]
|
||||
fn test_intfastfield_large_numbers() -> crate::Result<()> {
|
||||
let path = Path::new("test");
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
|
||||
{
|
||||
let write: WritePtr = directory.open_write(Path::new("test")).unwrap();
|
||||
@@ -350,7 +359,7 @@ mod tests {
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>5_000_000_000_000_000_000u64 + i));
|
||||
}
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
@@ -374,7 +383,7 @@ mod tests {
|
||||
#[test]
|
||||
fn test_signed_intfastfield() -> crate::Result<()> {
|
||||
let path = Path::new("test");
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
let i64_field = schema_builder.add_i64_field("field", FAST);
|
||||
@@ -389,7 +398,7 @@ mod tests {
|
||||
fast_field_writers.add_document(&doc);
|
||||
}
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
@@ -417,7 +426,7 @@ mod tests {
|
||||
#[test]
|
||||
fn test_signed_intfastfield_default_val() -> crate::Result<()> {
|
||||
let path = Path::new("test");
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
let mut schema_builder = Schema::builder();
|
||||
let i64_field = schema_builder.add_i64_field("field", FAST);
|
||||
let schema = schema_builder.build();
|
||||
@@ -429,7 +438,7 @@ mod tests {
|
||||
let doc = Document::default();
|
||||
fast_field_writers.add_document(&doc);
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
@@ -456,7 +465,7 @@ mod tests {
|
||||
let path = Path::new("test");
|
||||
let permutation = generate_permutation();
|
||||
let n = permutation.len();
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
{
|
||||
let write: WritePtr = directory.open_write(Path::new("test"))?;
|
||||
let mut serializer = FastFieldSerializer::from_write(write)?;
|
||||
@@ -464,7 +473,7 @@ mod tests {
|
||||
for &x in &permutation {
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>x));
|
||||
}
|
||||
fast_field_writers.serialize(&mut serializer, &HashMap::new())?;
|
||||
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
|
||||
serializer.close()?;
|
||||
}
|
||||
let file = directory.open_read(&path)?;
|
||||
@@ -576,7 +585,7 @@ mod bench {
|
||||
use super::tests::{generate_permutation, SCHEMA};
|
||||
use super::*;
|
||||
use crate::common::CompositeFile;
|
||||
use crate::directory::{Directory, RAMDirectory, WritePtr};
|
||||
use crate::directory::{Directory, RamDirectory, WritePtr};
|
||||
use crate::fastfield::FastFieldReader;
|
||||
use std::collections::HashMap;
|
||||
use std::path::Path;
|
||||
@@ -612,7 +621,7 @@ mod bench {
|
||||
fn bench_intfastfield_linear_fflookup(b: &mut Bencher) {
|
||||
let path = Path::new("test");
|
||||
let permutation = generate_permutation();
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
{
|
||||
let write: WritePtr = directory.open_write(Path::new("test")).unwrap();
|
||||
let mut serializer = FastFieldSerializer::from_write(write).unwrap();
|
||||
@@ -621,7 +630,7 @@ mod bench {
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>x));
|
||||
}
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
@@ -646,7 +655,7 @@ mod bench {
|
||||
fn bench_intfastfield_fflookup(b: &mut Bencher) {
|
||||
let path = Path::new("test");
|
||||
let permutation = generate_permutation();
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
{
|
||||
let write: WritePtr = directory.open_write(Path::new("test")).unwrap();
|
||||
let mut serializer = FastFieldSerializer::from_write(write).unwrap();
|
||||
@@ -655,7 +664,7 @@ mod bench {
|
||||
fast_field_writers.add_document(&doc!(*FIELD=>x));
|
||||
}
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
|
||||
@@ -77,12 +77,15 @@ mod tests {
|
||||
// add another second
|
||||
let two_secs_ahead = first_time_stamp + Duration::seconds(2);
|
||||
index_writer.add_document(doc!(date_field=>two_secs_ahead, date_field=>two_secs_ahead,date_field=>two_secs_ahead, time_i=>3i64));
|
||||
// add three seconds
|
||||
index_writer
|
||||
.add_document(doc!(date_field=>first_time_stamp + Duration::seconds(3), time_i=>4i64));
|
||||
assert!(index_writer.commit().is_ok());
|
||||
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
let reader = searcher.segment_reader(0);
|
||||
assert_eq!(reader.num_docs(), 4);
|
||||
assert_eq!(reader.num_docs(), 5);
|
||||
|
||||
{
|
||||
let parser = QueryParser::for_index(&index, vec![date_field]);
|
||||
@@ -150,7 +153,7 @@ mod tests {
|
||||
{
|
||||
let parser = QueryParser::for_index(&index, vec![date_field]);
|
||||
let range_q = format!(
|
||||
"[{} TO {}]",
|
||||
"[{} TO {}}}",
|
||||
(first_time_stamp + Duration::seconds(1)).to_rfc3339(),
|
||||
(first_time_stamp + Duration::seconds(3)).to_rfc3339()
|
||||
);
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use std::ops::Range;
|
||||
|
||||
use crate::fastfield::{FastFieldReader, FastValue};
|
||||
use crate::fastfield::{FastFieldReader, FastValue, MultiValueLength};
|
||||
use crate::DocId;
|
||||
|
||||
/// Reader for a multivalued `u64` fast field.
|
||||
@@ -56,6 +56,15 @@ impl<Item: FastValue> MultiValuedFastFieldReader<Item> {
|
||||
}
|
||||
}
|
||||
|
||||
impl<Item: FastValue> MultiValueLength for MultiValuedFastFieldReader<Item> {
|
||||
fn get_len(&self, doc_id: DocId) -> u64 {
|
||||
self.num_vals(doc_id) as u64
|
||||
}
|
||||
|
||||
fn get_total_len(&self) -> u64 {
|
||||
self.total_num_vals() as u64
|
||||
}
|
||||
}
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
|
||||
@@ -1,13 +1,13 @@
|
||||
use crate::fastfield::serializer::FastSingleFieldSerializer;
|
||||
use crate::fastfield::value_to_u64;
|
||||
use crate::fastfield::FastFieldSerializer;
|
||||
use crate::postings::UnorderedTermId;
|
||||
use crate::schema::{Document, Field};
|
||||
use crate::termdict::TermOrdinal;
|
||||
use crate::DocId;
|
||||
use crate::{fastfield::value_to_u64, indexer::doc_id_mapping::DocIdMapping};
|
||||
use fnv::FnvHashMap;
|
||||
use std::io;
|
||||
use std::iter::once;
|
||||
use tantivy_bitpacker::minmax;
|
||||
|
||||
/// Writer for multi-valued (as in, more than one value per document)
|
||||
/// int fast field.
|
||||
@@ -48,6 +48,12 @@ impl MultiValuedFastFieldWriter {
|
||||
}
|
||||
}
|
||||
|
||||
/// The memory used (inclusive childs)
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
self.vals.capacity() * std::mem::size_of::<UnorderedTermId>()
|
||||
+ self.doc_index.capacity() * std::mem::size_of::<u64>()
|
||||
}
|
||||
|
||||
/// Access the field associated to the `MultiValuedFastFieldWriter`
|
||||
pub fn field(&self) -> Field {
|
||||
self.field
|
||||
@@ -87,7 +93,34 @@ impl MultiValuedFastFieldWriter {
|
||||
self.vals.extend_from_slice(vals);
|
||||
doc
|
||||
}
|
||||
/// Returns an iterator over values per doc_id in ascending doc_id order.
|
||||
///
|
||||
/// Normally the order is simply iterating self.doc_id_index.
|
||||
/// With doc_id_map it accounts for the new mapping, returning values in the order of the
|
||||
/// new doc_ids.
|
||||
fn get_ordered_values<'a: 'b, 'b>(
|
||||
&'a self,
|
||||
doc_id_map: Option<&'b DocIdMapping>,
|
||||
) -> impl Iterator<Item = &'b [u64]> {
|
||||
let doc_id_iter = if let Some(doc_id_map) = doc_id_map {
|
||||
Box::new(doc_id_map.iter_old_doc_ids().cloned()) as Box<dyn Iterator<Item = u32>>
|
||||
} else {
|
||||
Box::new(self.doc_index.iter().enumerate().map(|el| el.0 as u32))
|
||||
as Box<dyn Iterator<Item = u32>>
|
||||
};
|
||||
doc_id_iter.map(move |doc_id| self.get_values_for_doc_id(doc_id))
|
||||
}
|
||||
|
||||
/// returns all values for a doc_ids
|
||||
fn get_values_for_doc_id(&self, doc_id: u32) -> &[u64] {
|
||||
let start_pos = self.doc_index[doc_id as usize] as usize;
|
||||
let end_pos = self
|
||||
.doc_index
|
||||
.get(doc_id as usize + 1)
|
||||
.cloned()
|
||||
.unwrap_or(self.vals.len() as u64) as usize; // special case, last doc_id has no offset information
|
||||
&self.vals[start_pos..end_pos]
|
||||
}
|
||||
/// Serializes fast field values by pushing them to the `FastFieldSerializer`.
|
||||
///
|
||||
/// If a mapping is given, the values are remapped *and sorted* before serialization.
|
||||
@@ -103,15 +136,20 @@ impl MultiValuedFastFieldWriter {
|
||||
&self,
|
||||
serializer: &mut FastFieldSerializer,
|
||||
mapping_opt: Option<&FnvHashMap<UnorderedTermId, TermOrdinal>>,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> io::Result<()> {
|
||||
{
|
||||
// writing the offset index
|
||||
let mut doc_index_serializer =
|
||||
serializer.new_u64_fast_field_with_idx(self.field, 0, self.vals.len() as u64, 0)?;
|
||||
for &offset in &self.doc_index {
|
||||
|
||||
let mut offset = 0;
|
||||
for vals in self.get_ordered_values(doc_id_map) {
|
||||
doc_index_serializer.add_val(offset)?;
|
||||
offset += vals.len() as u64;
|
||||
}
|
||||
doc_index_serializer.add_val(self.vals.len() as u64)?;
|
||||
|
||||
doc_index_serializer.close_field()?;
|
||||
}
|
||||
{
|
||||
@@ -126,18 +164,10 @@ impl MultiValuedFastFieldWriter {
|
||||
1,
|
||||
)?;
|
||||
|
||||
let last_interval =
|
||||
self.doc_index.last().cloned().unwrap() as usize..self.vals.len();
|
||||
|
||||
let mut doc_vals: Vec<u64> = Vec::with_capacity(100);
|
||||
for range in self
|
||||
.doc_index
|
||||
.windows(2)
|
||||
.map(|interval| interval[0] as usize..interval[1] as usize)
|
||||
.chain(once(last_interval))
|
||||
{
|
||||
for vals in self.get_ordered_values(doc_id_map) {
|
||||
doc_vals.clear();
|
||||
let remapped_vals = self.vals[range]
|
||||
let remapped_vals = vals
|
||||
.iter()
|
||||
.map(|val| *mapping.get(val).expect("Missing term ordinal"));
|
||||
doc_vals.extend(remapped_vals);
|
||||
@@ -148,12 +178,15 @@ impl MultiValuedFastFieldWriter {
|
||||
}
|
||||
}
|
||||
None => {
|
||||
let val_min_max = crate::common::minmax(self.vals.iter().cloned());
|
||||
let val_min_max = minmax(self.vals.iter().cloned());
|
||||
let (val_min, val_max) = val_min_max.unwrap_or((0u64, 0u64));
|
||||
value_serializer =
|
||||
serializer.new_u64_fast_field_with_idx(self.field, val_min, val_max, 1)?;
|
||||
for &val in &self.vals {
|
||||
value_serializer.add_val(val)?;
|
||||
for vals in self.get_ordered_values(doc_id_map) {
|
||||
// sort values in case of remapped doc_ids?
|
||||
for &val in vals {
|
||||
value_serializer.add_val(val)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,10 +1,9 @@
|
||||
use super::FastValue;
|
||||
use crate::common::bitpacker::BitUnpacker;
|
||||
use crate::common::compute_num_bits;
|
||||
use crate::common::BinarySerializable;
|
||||
use crate::common::CompositeFile;
|
||||
use crate::directory::FileSlice;
|
||||
use crate::directory::{Directory, RAMDirectory, WritePtr};
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::directory::{Directory, RamDirectory, WritePtr};
|
||||
use crate::fastfield::{FastFieldSerializer, FastFieldsWriter};
|
||||
use crate::schema::Schema;
|
||||
use crate::schema::FAST;
|
||||
@@ -12,6 +11,8 @@ use crate::DocId;
|
||||
use std::collections::HashMap;
|
||||
use std::marker::PhantomData;
|
||||
use std::path::Path;
|
||||
use tantivy_bitpacker::compute_num_bits;
|
||||
use tantivy_bitpacker::BitUnpacker;
|
||||
|
||||
/// Trait for accessing a fastfield.
|
||||
///
|
||||
@@ -19,6 +20,7 @@ use std::path::Path;
|
||||
/// fast field is required.
|
||||
#[derive(Clone)]
|
||||
pub struct FastFieldReader<Item: FastValue> {
|
||||
bytes: OwnedBytes,
|
||||
bit_unpacker: BitUnpacker,
|
||||
min_value_u64: u64,
|
||||
max_value_u64: u64,
|
||||
@@ -33,8 +35,9 @@ impl<Item: FastValue> FastFieldReader<Item> {
|
||||
let amplitude = u64::deserialize(&mut bytes)?;
|
||||
let max_value = min_value + amplitude;
|
||||
let num_bits = compute_num_bits(amplitude);
|
||||
let bit_unpacker = BitUnpacker::new(bytes, num_bits);
|
||||
let bit_unpacker = BitUnpacker::new(num_bits);
|
||||
Ok(FastFieldReader {
|
||||
bytes,
|
||||
min_value_u64: min_value,
|
||||
max_value_u64: max_value,
|
||||
bit_unpacker,
|
||||
@@ -55,7 +58,7 @@ impl<Item: FastValue> FastFieldReader<Item> {
|
||||
}
|
||||
|
||||
pub(crate) fn get_u64(&self, doc: u64) -> Item {
|
||||
Item::from_u64(self.min_value_u64 + self.bit_unpacker.get(doc))
|
||||
Item::from_u64(self.min_value_u64 + self.bit_unpacker.get(doc, &self.bytes))
|
||||
}
|
||||
|
||||
/// Internally `multivalued` also use SingleValue Fast fields.
|
||||
@@ -118,24 +121,24 @@ impl<Item: FastValue> From<Vec<Item>> for FastFieldReader<Item> {
|
||||
let field = schema_builder.add_u64_field("field", FAST);
|
||||
let schema = schema_builder.build();
|
||||
let path = Path::new("__dummy__");
|
||||
let directory: RAMDirectory = RAMDirectory::create();
|
||||
let directory: RamDirectory = RamDirectory::create();
|
||||
{
|
||||
let write: WritePtr = directory
|
||||
.open_write(path)
|
||||
.expect("With a RAMDirectory, this should never fail.");
|
||||
.expect("With a RamDirectory, this should never fail.");
|
||||
let mut serializer = FastFieldSerializer::from_write(write)
|
||||
.expect("With a RAMDirectory, this should never fail.");
|
||||
.expect("With a RamDirectory, this should never fail.");
|
||||
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
|
||||
{
|
||||
let fast_field_writer = fast_field_writers
|
||||
.get_field_writer(field)
|
||||
.expect("With a RAMDirectory, this should never fail.");
|
||||
.get_field_writer_mut(field)
|
||||
.expect("With a RamDirectory, this should never fail.");
|
||||
for val in vals {
|
||||
fast_field_writer.add_val(val.to_u64());
|
||||
}
|
||||
}
|
||||
fast_field_writers
|
||||
.serialize(&mut serializer, &HashMap::new())
|
||||
.serialize(&mut serializer, &HashMap::new(), None)
|
||||
.unwrap();
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
|
||||
@@ -46,8 +46,8 @@ fn type_and_cardinality(field_type: &FieldType) -> Option<(FastType, Cardinality
|
||||
impl FastFieldReaders {
|
||||
pub(crate) fn new(schema: Schema, fast_fields_composite: CompositeFile) -> FastFieldReaders {
|
||||
FastFieldReaders {
|
||||
fast_fields_composite,
|
||||
schema,
|
||||
fast_fields_composite,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -119,7 +119,7 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns the `u64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a u64 fast field, this method returns `None`.
|
||||
/// If `field` is not a u64 fast field, this method returns an Error.
|
||||
pub fn u64(&self, field: Field) -> crate::Result<FastFieldReader<u64>> {
|
||||
self.check_type(field, FastType::U64, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
@@ -135,7 +135,7 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns the `i64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a i64 fast field, this method returns `None`.
|
||||
/// If `field` is not a i64 fast field, this method returns an Error.
|
||||
pub fn i64(&self, field: Field) -> crate::Result<FastFieldReader<i64>> {
|
||||
self.check_type(field, FastType::I64, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
@@ -143,7 +143,7 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns the `i64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a i64 fast field, this method returns `None`.
|
||||
/// If `field` is not a i64 fast field, this method returns an Error.
|
||||
pub fn date(&self, field: Field) -> crate::Result<FastFieldReader<crate::DateTime>> {
|
||||
self.check_type(field, FastType::Date, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
@@ -151,7 +151,7 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns the `f64` fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a f64 fast field, this method returns `None`.
|
||||
/// If `field` is not a f64 fast field, this method returns an Error.
|
||||
pub fn f64(&self, field: Field) -> crate::Result<FastFieldReader<f64>> {
|
||||
self.check_type(field, FastType::F64, Cardinality::SingleValue)?;
|
||||
self.typed_fast_field_reader(field)
|
||||
@@ -159,15 +159,23 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns a `u64s` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a u64 multi-valued fast field, this method returns `None`.
|
||||
/// If `field` is not a u64 multi-valued fast field, this method returns an Error.
|
||||
pub fn u64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<u64>> {
|
||||
self.check_type(field, FastType::U64, Cardinality::MultiValues)?;
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
}
|
||||
|
||||
/// Returns a `u64s` multi-valued fast field reader reader associated to `field`, regardless of whether the given
|
||||
/// field is effectively of type `u64` or not.
|
||||
///
|
||||
/// If `field` is not a u64 multi-valued fast field, this method returns an Error.
|
||||
pub fn u64s_lenient(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<u64>> {
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
}
|
||||
|
||||
/// Returns a `i64s` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a i64 multi-valued fast field, this method returns `None`.
|
||||
/// If `field` is not a i64 multi-valued fast field, this method returns an Error.
|
||||
pub fn i64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<i64>> {
|
||||
self.check_type(field, FastType::I64, Cardinality::MultiValues)?;
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
@@ -175,7 +183,7 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns a `f64s` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a f64 multi-valued fast field, this method returns `None`.
|
||||
/// If `field` is not a f64 multi-valued fast field, this method returns an Error.
|
||||
pub fn f64s(&self, field: Field) -> crate::Result<MultiValuedFastFieldReader<f64>> {
|
||||
self.check_type(field, FastType::F64, Cardinality::MultiValues)?;
|
||||
self.typed_fast_field_multi_reader(field)
|
||||
@@ -183,7 +191,7 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns a `crate::DateTime` multi-valued fast field reader reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a `crate::DateTime` multi-valued fast field, this method returns `None`.
|
||||
/// If `field` is not a `crate::DateTime` multi-valued fast field, this method returns an Error.
|
||||
pub fn dates(
|
||||
&self,
|
||||
field: Field,
|
||||
@@ -194,7 +202,7 @@ impl FastFieldReaders {
|
||||
|
||||
/// Returns the `bytes` fast field reader associated to `field`.
|
||||
///
|
||||
/// If `field` is not a bytes fast field, returns `None`.
|
||||
/// If `field` is not a bytes fast field, returns an Error.
|
||||
pub fn bytes(&self, field: Field) -> crate::Result<BytesFastFieldReader> {
|
||||
let field_entry = self.schema.get_field_entry(field);
|
||||
if let FieldType::Bytes(bytes_option) = field_entry.field_type() {
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
use crate::common::bitpacker::BitPacker;
|
||||
use crate::common::compute_num_bits;
|
||||
use crate::common::BinarySerializable;
|
||||
use crate::common::CompositeWrite;
|
||||
use crate::common::CountingWriter;
|
||||
use crate::directory::WritePtr;
|
||||
use crate::schema::Field;
|
||||
use std::io::{self, Write};
|
||||
use tantivy_bitpacker::compute_num_bits;
|
||||
use tantivy_bitpacker::BitPacker;
|
||||
|
||||
/// `FastFieldSerializer` is in charge of serializing
|
||||
/// fastfields on disk.
|
||||
@@ -107,8 +107,8 @@ impl<'a, W: Write> FastSingleFieldSerializer<'a, W> {
|
||||
let num_bits = compute_num_bits(amplitude);
|
||||
let bit_packer = BitPacker::new();
|
||||
Ok(FastSingleFieldSerializer {
|
||||
write,
|
||||
bit_packer,
|
||||
write,
|
||||
min_value,
|
||||
num_bits,
|
||||
})
|
||||
|
||||
@@ -1,14 +1,14 @@
|
||||
use super::multivalued::MultiValuedFastFieldWriter;
|
||||
use crate::common;
|
||||
use crate::common::BinarySerializable;
|
||||
use crate::common::VInt;
|
||||
use crate::fastfield::{BytesFastFieldWriter, FastFieldSerializer};
|
||||
use crate::indexer::doc_id_mapping::DocIdMapping;
|
||||
use crate::postings::UnorderedTermId;
|
||||
use crate::schema::{Cardinality, Document, Field, FieldEntry, FieldType, Schema};
|
||||
use crate::termdict::TermOrdinal;
|
||||
use fnv::FnvHashMap;
|
||||
use std::collections::HashMap;
|
||||
use std::io;
|
||||
use tantivy_bitpacker::BlockedBitpacker;
|
||||
|
||||
/// The fastfieldswriter regroup all of the fast field writers.
|
||||
pub struct FastFieldsWriter {
|
||||
@@ -72,8 +72,34 @@ impl FastFieldsWriter {
|
||||
}
|
||||
}
|
||||
|
||||
/// The memory used (inclusive childs)
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
self.single_value_writers
|
||||
.iter()
|
||||
.map(|w| w.mem_usage())
|
||||
.sum::<usize>()
|
||||
+ self
|
||||
.multi_values_writers
|
||||
.iter()
|
||||
.map(|w| w.mem_usage())
|
||||
.sum::<usize>()
|
||||
+ self
|
||||
.bytes_value_writers
|
||||
.iter()
|
||||
.map(|w| w.mem_usage())
|
||||
.sum::<usize>()
|
||||
}
|
||||
|
||||
/// Get the `FastFieldWriter` associated to a field.
|
||||
pub fn get_field_writer(&mut self, field: Field) -> Option<&mut IntFastFieldWriter> {
|
||||
pub fn get_field_writer(&self, field: Field) -> Option<&IntFastFieldWriter> {
|
||||
// TODO optimize
|
||||
self.single_value_writers
|
||||
.iter()
|
||||
.find(|field_writer| field_writer.field() == field)
|
||||
}
|
||||
|
||||
/// Get the `FastFieldWriter` associated to a field.
|
||||
pub fn get_field_writer_mut(&mut self, field: Field) -> Option<&mut IntFastFieldWriter> {
|
||||
// TODO optimize
|
||||
self.single_value_writers
|
||||
.iter_mut()
|
||||
@@ -84,7 +110,7 @@ impl FastFieldsWriter {
|
||||
///
|
||||
/// Returns None if the field does not exist, or is not
|
||||
/// configured as a multivalued fastfield in the schema.
|
||||
pub fn get_multivalue_writer(
|
||||
pub fn get_multivalue_writer_mut(
|
||||
&mut self,
|
||||
field: Field,
|
||||
) -> Option<&mut MultiValuedFastFieldWriter> {
|
||||
@@ -98,7 +124,7 @@ impl FastFieldsWriter {
|
||||
///
|
||||
/// Returns None if the field does not exist, or is not
|
||||
/// configured as a bytes fastfield in the schema.
|
||||
pub fn get_bytes_writer(&mut self, field: Field) -> Option<&mut BytesFastFieldWriter> {
|
||||
pub fn get_bytes_writer_mut(&mut self, field: Field) -> Option<&mut BytesFastFieldWriter> {
|
||||
// TODO optimize
|
||||
self.bytes_value_writers
|
||||
.iter_mut()
|
||||
@@ -124,17 +150,18 @@ impl FastFieldsWriter {
|
||||
&self,
|
||||
serializer: &mut FastFieldSerializer,
|
||||
mapping: &HashMap<Field, FnvHashMap<UnorderedTermId, TermOrdinal>>,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> io::Result<()> {
|
||||
for field_writer in &self.single_value_writers {
|
||||
field_writer.serialize(serializer)?;
|
||||
field_writer.serialize(serializer, doc_id_map)?;
|
||||
}
|
||||
|
||||
for field_writer in &self.multi_values_writers {
|
||||
let field = field_writer.field();
|
||||
field_writer.serialize(serializer, mapping.get(&field))?;
|
||||
field_writer.serialize(serializer, mapping.get(&field), doc_id_map)?;
|
||||
}
|
||||
for field_writer in &self.bytes_value_writers {
|
||||
field_writer.serialize(serializer)?;
|
||||
field_writer.serialize(serializer, doc_id_map)?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
@@ -157,7 +184,7 @@ impl FastFieldsWriter {
|
||||
/// using `common::i64_to_u64` and `common::f64_to_u64`.
|
||||
pub struct IntFastFieldWriter {
|
||||
field: Field,
|
||||
vals: Vec<u8>,
|
||||
vals: BlockedBitpacker,
|
||||
val_count: usize,
|
||||
val_if_missing: u64,
|
||||
val_min: u64,
|
||||
@@ -169,7 +196,7 @@ impl IntFastFieldWriter {
|
||||
pub fn new(field: Field) -> IntFastFieldWriter {
|
||||
IntFastFieldWriter {
|
||||
field,
|
||||
vals: Vec::new(),
|
||||
vals: BlockedBitpacker::new(),
|
||||
val_count: 0,
|
||||
val_if_missing: 0u64,
|
||||
val_min: u64::max_value(),
|
||||
@@ -177,6 +204,11 @@ impl IntFastFieldWriter {
|
||||
}
|
||||
}
|
||||
|
||||
/// The memory used (inclusive childs)
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
self.vals.mem_usage()
|
||||
}
|
||||
|
||||
/// Returns the field that this writer is targetting.
|
||||
pub fn field(&self) -> Field {
|
||||
self.field
|
||||
@@ -196,9 +228,7 @@ impl IntFastFieldWriter {
|
||||
/// associated to the document with the `DocId` n.
|
||||
/// (Well, `n-1` actually because of 0-indexing)
|
||||
pub fn add_val(&mut self, val: u64) {
|
||||
VInt(val)
|
||||
.serialize(&mut self.vals)
|
||||
.expect("unable to serialize VInt to Vec");
|
||||
self.vals.add(val);
|
||||
|
||||
if val > self.val_max {
|
||||
self.val_max = val;
|
||||
@@ -234,20 +264,32 @@ impl IntFastFieldWriter {
|
||||
self.add_val(val);
|
||||
}
|
||||
|
||||
/// Extract the stored data
|
||||
pub(crate) fn get_data(&self) -> Vec<u64> {
|
||||
self.vals.iter().collect::<Vec<u64>>()
|
||||
}
|
||||
|
||||
/// Push the fast fields value to the `FastFieldWriter`.
|
||||
pub fn serialize(&self, serializer: &mut FastFieldSerializer) -> io::Result<()> {
|
||||
pub fn serialize(
|
||||
&self,
|
||||
serializer: &mut FastFieldSerializer,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> io::Result<()> {
|
||||
let (min, max) = if self.val_min > self.val_max {
|
||||
(0, 0)
|
||||
} else {
|
||||
(self.val_min, self.val_max)
|
||||
};
|
||||
|
||||
let mut single_field_serializer = serializer.new_u64_fast_field(self.field, min, max)?;
|
||||
|
||||
let mut cursor = self.vals.as_slice();
|
||||
while let Ok(VInt(val)) = VInt::deserialize(&mut cursor) {
|
||||
single_field_serializer.add_val(val)?;
|
||||
}
|
||||
if let Some(doc_id_map) = doc_id_map {
|
||||
for doc_id in doc_id_map.iter_old_doc_ids() {
|
||||
single_field_serializer.add_val(self.vals.get(*doc_id as usize))?;
|
||||
}
|
||||
} else {
|
||||
for val in self.vals.iter() {
|
||||
single_field_serializer.add_val(val)?;
|
||||
}
|
||||
};
|
||||
|
||||
single_field_serializer.close_field()
|
||||
}
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn id_to_fieldnorm(id: u8) -> u32 {
|
||||
FIELD_NORMS_TABLE[id as usize]
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn fieldnorm_to_id(fieldnorm: u32) -> u8 {
|
||||
FIELD_NORMS_TABLE
|
||||
.binary_search(&fieldnorm)
|
||||
|
||||
@@ -15,7 +15,7 @@
|
||||
//! precompute computationally expensive functions of the fieldnorm
|
||||
//! in a very short array.
|
||||
//!
|
||||
//! This trick is used by the BM25 similarity.
|
||||
//! This trick is used by the Bm25 similarity.
|
||||
mod code;
|
||||
mod reader;
|
||||
mod serializer;
|
||||
|
||||
@@ -117,8 +117,8 @@ impl FieldNormReader {
|
||||
/// The fieldnorm is a value approximating the number
|
||||
/// of tokens in a given field of the `doc_id`.
|
||||
///
|
||||
/// It is imprecise, and always lower than the actual
|
||||
/// number of tokens.
|
||||
/// It is imprecise, and equal or lower than
|
||||
/// the actual number of tokens.
|
||||
///
|
||||
/// The fieldnorm is effectively decoded from the
|
||||
/// `fieldnorm_id` by doing a simple table lookup.
|
||||
@@ -133,7 +133,7 @@ impl FieldNormReader {
|
||||
}
|
||||
|
||||
/// Returns the `fieldnorm_id` associated to a document.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn fieldnorm_id(&self, doc_id: DocId) -> u8 {
|
||||
match &self.0 {
|
||||
ReaderImplEnum::FromData(data) => {
|
||||
@@ -145,14 +145,14 @@ impl FieldNormReader {
|
||||
}
|
||||
|
||||
/// Converts a `fieldnorm_id` into a fieldnorm.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn id_to_fieldnorm(id: u8) -> u32 {
|
||||
id_to_fieldnorm(id)
|
||||
}
|
||||
|
||||
/// Converts a `fieldnorm` into a `fieldnorm_id`.
|
||||
/// (This function is not injective).
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn fieldnorm_to_id(fieldnorm: u32) -> u8 {
|
||||
fieldnorm_to_id(fieldnorm)
|
||||
}
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
use crate::DocId;
|
||||
use crate::{indexer::doc_id_mapping::DocIdMapping, DocId};
|
||||
|
||||
use super::fieldnorm_to_id;
|
||||
use super::FieldNormsSerializer;
|
||||
@@ -50,6 +50,13 @@ impl FieldNormsWriter {
|
||||
}
|
||||
}
|
||||
|
||||
/// The memory used inclusive childs
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
self.fieldnorms_buffer
|
||||
.iter()
|
||||
.map(|buf| buf.capacity())
|
||||
.sum()
|
||||
}
|
||||
/// Ensure that all documents in 0..max_doc have a byte associated with them
|
||||
/// in each of the fieldnorm vectors.
|
||||
///
|
||||
@@ -80,10 +87,23 @@ impl FieldNormsWriter {
|
||||
}
|
||||
|
||||
/// Serialize the seen fieldnorm values to the serializer for all fields.
|
||||
pub fn serialize(&self, mut fieldnorms_serializer: FieldNormsSerializer) -> io::Result<()> {
|
||||
pub fn serialize(
|
||||
&self,
|
||||
mut fieldnorms_serializer: FieldNormsSerializer,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> io::Result<()> {
|
||||
for &field in self.fields.iter() {
|
||||
let fieldnorm_values: &[u8] = &self.fieldnorms_buffer[field.field_id() as usize][..];
|
||||
fieldnorms_serializer.serialize_field(field, fieldnorm_values)?;
|
||||
if let Some(doc_id_map) = doc_id_map {
|
||||
let mut mapped_fieldnorm_values = vec![];
|
||||
mapped_fieldnorm_values.resize(fieldnorm_values.len(), 0u8);
|
||||
for (new_doc_id, old_doc_id) in doc_id_map.iter_old_doc_ids().enumerate() {
|
||||
mapped_fieldnorm_values[new_doc_id] = fieldnorm_values[*old_doc_id as usize];
|
||||
}
|
||||
fieldnorms_serializer.serialize_field(field, &mapped_fieldnorm_values)?;
|
||||
} else {
|
||||
fieldnorms_serializer.serialize_field(field, fieldnorm_values)?;
|
||||
}
|
||||
}
|
||||
fieldnorms_serializer.close()?;
|
||||
Ok(())
|
||||
|
||||
423
src/indexer/doc_id_mapping.rs
Normal file
423
src/indexer/doc_id_mapping.rs
Normal file
@@ -0,0 +1,423 @@
|
||||
//! This module is used when sorting the index by a property, e.g.
|
||||
//! to get mappings from old doc_id to new doc_id and vice versa, after sorting
|
||||
//!
|
||||
|
||||
use super::SegmentWriter;
|
||||
use crate::{
|
||||
schema::{Field, Schema},
|
||||
DocId, IndexSortByField, Order, TantivyError,
|
||||
};
|
||||
use std::cmp::Reverse;
|
||||
|
||||
/// Struct to provide mapping from old doc_id to new doc_id and vice versa
|
||||
pub struct DocIdMapping {
|
||||
new_doc_id_to_old: Vec<DocId>,
|
||||
old_doc_id_to_new: Vec<DocId>,
|
||||
}
|
||||
|
||||
impl DocIdMapping {
|
||||
/// returns the new doc_id for the old doc_id
|
||||
pub fn get_new_doc_id(&self, doc_id: DocId) -> DocId {
|
||||
self.old_doc_id_to_new[doc_id as usize]
|
||||
}
|
||||
/// returns the old doc_id for the new doc_id
|
||||
pub fn get_old_doc_id(&self, doc_id: DocId) -> DocId {
|
||||
self.new_doc_id_to_old[doc_id as usize]
|
||||
}
|
||||
/// iterate over old doc_ids in order of the new doc_ids
|
||||
pub fn iter_old_doc_ids(&self) -> std::slice::Iter<'_, DocId> {
|
||||
self.new_doc_id_to_old.iter()
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn expect_field_id_for_sort_field(
|
||||
schema: &Schema,
|
||||
sort_by_field: &IndexSortByField,
|
||||
) -> crate::Result<Field> {
|
||||
schema.get_field(&sort_by_field.field).ok_or_else(|| {
|
||||
TantivyError::InvalidArgument(format!(
|
||||
"field to sort index by not found: {:?}",
|
||||
sort_by_field.field
|
||||
))
|
||||
})
|
||||
}
|
||||
|
||||
// Generates a document mapping in the form of [index new doc_id] -> old doc_id
|
||||
// TODO detect if field is already sorted and discard mapping
|
||||
pub(crate) fn get_doc_id_mapping_from_field(
|
||||
sort_by_field: IndexSortByField,
|
||||
segment_writer: &SegmentWriter,
|
||||
) -> crate::Result<DocIdMapping> {
|
||||
let schema = segment_writer.segment_serializer.segment().schema();
|
||||
let field_id = expect_field_id_for_sort_field(&schema, &sort_by_field)?; // for now expect fastfield, but not strictly required
|
||||
let fast_field = segment_writer
|
||||
.fast_field_writers
|
||||
.get_field_writer(field_id)
|
||||
.ok_or_else(|| {
|
||||
TantivyError::InvalidArgument(format!(
|
||||
"sort index by field is required to be a fast field {:?}",
|
||||
sort_by_field.field
|
||||
))
|
||||
})?;
|
||||
|
||||
// create new doc_id to old doc_id index (used in fast_field_writers)
|
||||
let data = fast_field.get_data();
|
||||
let mut doc_id_and_data = data
|
||||
.into_iter()
|
||||
.enumerate()
|
||||
.map(|el| (el.0 as DocId, el.1))
|
||||
.collect::<Vec<_>>();
|
||||
if sort_by_field.order == Order::Desc {
|
||||
doc_id_and_data.sort_by_key(|k| Reverse(k.1));
|
||||
} else {
|
||||
doc_id_and_data.sort_by_key(|k| k.1);
|
||||
}
|
||||
let new_doc_id_to_old = doc_id_and_data
|
||||
.into_iter()
|
||||
.map(|el| el.0)
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
// create old doc_id to new doc_id index (used in posting recorder)
|
||||
let max_doc = new_doc_id_to_old.len();
|
||||
let mut old_doc_id_to_new = vec![0; max_doc];
|
||||
for i in 0..max_doc {
|
||||
old_doc_id_to_new[new_doc_id_to_old[i] as usize] = i as DocId;
|
||||
}
|
||||
let doc_id_map = DocIdMapping {
|
||||
new_doc_id_to_old,
|
||||
old_doc_id_to_new,
|
||||
};
|
||||
Ok(doc_id_map)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests_indexsorting {
|
||||
use crate::{collector::TopDocs, query::QueryParser, schema::*};
|
||||
use crate::{schema::Schema, DocAddress};
|
||||
use crate::{Index, IndexSettings, IndexSortByField, Order};
|
||||
|
||||
fn create_test_index(
|
||||
index_settings: Option<IndexSettings>,
|
||||
text_field_options: TextOptions,
|
||||
) -> Index {
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
let my_text_field = schema_builder.add_text_field("text_field", text_field_options);
|
||||
let my_string_field = schema_builder.add_text_field("string_field", STRING | STORED);
|
||||
let my_number = schema_builder.add_u64_field(
|
||||
"my_number",
|
||||
IntOptions::default().set_fast(Cardinality::SingleValue),
|
||||
);
|
||||
|
||||
let multi_numbers = schema_builder.add_u64_field(
|
||||
"multi_numbers",
|
||||
IntOptions::default().set_fast(Cardinality::MultiValues),
|
||||
);
|
||||
|
||||
let schema = schema_builder.build();
|
||||
let mut index_builder = Index::builder().schema(schema);
|
||||
if let Some(settings) = index_settings {
|
||||
index_builder = index_builder.settings(settings);
|
||||
}
|
||||
let index = index_builder.create_in_ram().unwrap();
|
||||
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
index_writer.add_document(doc!(my_number=>40_u64));
|
||||
index_writer
|
||||
.add_document(doc!(my_number=>20_u64, multi_numbers => 5_u64, multi_numbers => 6_u64));
|
||||
index_writer.add_document(doc!(my_number=>100_u64));
|
||||
index_writer.add_document(
|
||||
doc!(my_number=>10_u64, my_string_field=> "blublub", my_text_field => "some text"),
|
||||
);
|
||||
index_writer.add_document(doc!(my_number=>30_u64, multi_numbers => 3_u64 ));
|
||||
index_writer.commit().unwrap();
|
||||
index
|
||||
}
|
||||
fn get_text_options() -> TextOptions {
|
||||
TextOptions::default().set_indexing_options(
|
||||
TextFieldIndexing::default().set_index_option(IndexRecordOption::Basic),
|
||||
)
|
||||
}
|
||||
#[test]
|
||||
fn test_sort_index_test_text_field() -> crate::Result<()> {
|
||||
// there are different serializers for different settings in postings/recorder.rs
|
||||
// test remapping for all of them
|
||||
let options = vec![
|
||||
get_text_options(),
|
||||
get_text_options().set_indexing_options(
|
||||
TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
|
||||
),
|
||||
get_text_options().set_indexing_options(
|
||||
TextFieldIndexing::default()
|
||||
.set_index_option(IndexRecordOption::WithFreqsAndPositions),
|
||||
),
|
||||
];
|
||||
|
||||
for option in options {
|
||||
//let options = get_text_options();
|
||||
// no index_sort
|
||||
let index = create_test_index(None, option.clone());
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let searcher = index.reader()?.searcher();
|
||||
|
||||
let query = QueryParser::for_index(&index, vec![my_text_field]).parse_query("text")?;
|
||||
let top_docs: Vec<(f32, DocAddress)> =
|
||||
searcher.search(&query, &TopDocs::with_limit(3))?;
|
||||
assert_eq!(
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>(),
|
||||
vec![3]
|
||||
);
|
||||
|
||||
// sort by field asc
|
||||
let index = create_test_index(
|
||||
Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "my_number".to_string(),
|
||||
order: Order::Asc,
|
||||
}),
|
||||
}),
|
||||
option.clone(),
|
||||
);
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
|
||||
let query = QueryParser::for_index(&index, vec![my_text_field]).parse_query("text")?;
|
||||
let top_docs: Vec<(f32, DocAddress)> =
|
||||
searcher.search(&query, &TopDocs::with_limit(3))?;
|
||||
assert_eq!(
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>(),
|
||||
vec![0]
|
||||
);
|
||||
|
||||
// test new field norm mapping
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let fieldnorm_reader = searcher
|
||||
.segment_reader(0)
|
||||
.get_fieldnorms_reader(my_text_field)?;
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0), 2); // some text
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(1), 0);
|
||||
}
|
||||
// sort by field desc
|
||||
let index = create_test_index(
|
||||
Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "my_number".to_string(),
|
||||
order: Order::Desc,
|
||||
}),
|
||||
}),
|
||||
option.clone(),
|
||||
);
|
||||
let my_string_field = index.schema().get_field("text_field").unwrap();
|
||||
let searcher = index.reader()?.searcher();
|
||||
|
||||
let query =
|
||||
QueryParser::for_index(&index, vec![my_string_field]).parse_query("text")?;
|
||||
let top_docs: Vec<(f32, DocAddress)> =
|
||||
searcher.search(&query, &TopDocs::with_limit(3))?;
|
||||
assert_eq!(
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>(),
|
||||
vec![4]
|
||||
);
|
||||
// test new field norm mapping
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let fieldnorm_reader = searcher
|
||||
.segment_reader(0)
|
||||
.get_fieldnorms_reader(my_text_field)?;
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(1), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(2), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(3), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(4), 2); // some text
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
#[test]
|
||||
fn test_sort_index_get_documents() -> crate::Result<()> {
|
||||
// default baseline
|
||||
let index = create_test_index(None, get_text_options());
|
||||
let my_string_field = index.schema().get_field("string_field").unwrap();
|
||||
let searcher = index.reader()?.searcher();
|
||||
{
|
||||
assert_eq!(
|
||||
searcher
|
||||
.doc(DocAddress::new(0, 0))?
|
||||
.get_first(my_string_field),
|
||||
None
|
||||
);
|
||||
assert_eq!(
|
||||
searcher
|
||||
.doc(DocAddress::new(0, 3))?
|
||||
.get_first(my_string_field)
|
||||
.unwrap()
|
||||
.text(),
|
||||
Some("blublub")
|
||||
);
|
||||
}
|
||||
// sort by field asc
|
||||
let index = create_test_index(
|
||||
Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "my_number".to_string(),
|
||||
order: Order::Asc,
|
||||
}),
|
||||
}),
|
||||
get_text_options(),
|
||||
);
|
||||
let my_string_field = index.schema().get_field("string_field").unwrap();
|
||||
let searcher = index.reader()?.searcher();
|
||||
{
|
||||
assert_eq!(
|
||||
searcher
|
||||
.doc(DocAddress::new(0, 0))?
|
||||
.get_first(my_string_field)
|
||||
.unwrap()
|
||||
.text(),
|
||||
Some("blublub")
|
||||
);
|
||||
let doc = searcher.doc(DocAddress::new(0, 4))?;
|
||||
assert_eq!(doc.get_first(my_string_field), None);
|
||||
}
|
||||
// sort by field desc
|
||||
let index = create_test_index(
|
||||
Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "my_number".to_string(),
|
||||
order: Order::Desc,
|
||||
}),
|
||||
}),
|
||||
get_text_options(),
|
||||
);
|
||||
let my_string_field = index.schema().get_field("string_field").unwrap();
|
||||
let searcher = index.reader()?.searcher();
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 4))?;
|
||||
assert_eq!(
|
||||
doc.get_first(my_string_field).unwrap().text(),
|
||||
Some("blublub")
|
||||
);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_index_test_string_field() -> crate::Result<()> {
|
||||
let index = create_test_index(None, get_text_options());
|
||||
let my_string_field = index.schema().get_field("string_field").unwrap();
|
||||
let searcher = index.reader()?.searcher();
|
||||
|
||||
let query = QueryParser::for_index(&index, vec![my_string_field]).parse_query("blublub")?;
|
||||
let top_docs: Vec<(f32, DocAddress)> = searcher.search(&query, &TopDocs::with_limit(3))?;
|
||||
assert_eq!(
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>(),
|
||||
vec![3]
|
||||
);
|
||||
|
||||
let index = create_test_index(
|
||||
Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "my_number".to_string(),
|
||||
order: Order::Asc,
|
||||
}),
|
||||
}),
|
||||
get_text_options(),
|
||||
);
|
||||
let my_string_field = index.schema().get_field("string_field").unwrap();
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
|
||||
let query = QueryParser::for_index(&index, vec![my_string_field]).parse_query("blublub")?;
|
||||
let top_docs: Vec<(f32, DocAddress)> = searcher.search(&query, &TopDocs::with_limit(3))?;
|
||||
assert_eq!(
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>(),
|
||||
vec![0]
|
||||
);
|
||||
|
||||
// test new field norm mapping
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let fieldnorm_reader = searcher
|
||||
.segment_reader(0)
|
||||
.get_fieldnorms_reader(my_text_field)?;
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0), 2); // some text
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(1), 0);
|
||||
}
|
||||
// sort by field desc
|
||||
let index = create_test_index(
|
||||
Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "my_number".to_string(),
|
||||
order: Order::Desc,
|
||||
}),
|
||||
}),
|
||||
get_text_options(),
|
||||
);
|
||||
let my_string_field = index.schema().get_field("string_field").unwrap();
|
||||
let searcher = index.reader()?.searcher();
|
||||
|
||||
let query = QueryParser::for_index(&index, vec![my_string_field]).parse_query("blublub")?;
|
||||
let top_docs: Vec<(f32, DocAddress)> = searcher.search(&query, &TopDocs::with_limit(3))?;
|
||||
assert_eq!(
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>(),
|
||||
vec![4]
|
||||
);
|
||||
// test new field norm mapping
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let fieldnorm_reader = searcher
|
||||
.segment_reader(0)
|
||||
.get_fieldnorms_reader(my_text_field)?;
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(1), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(2), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(3), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(4), 2); // some text
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_index_fast_field() -> crate::Result<()> {
|
||||
let index = create_test_index(
|
||||
Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "my_number".to_string(),
|
||||
order: Order::Asc,
|
||||
}),
|
||||
}),
|
||||
get_text_options(),
|
||||
);
|
||||
assert_eq!(
|
||||
index.settings().sort_by_field.as_ref().unwrap().field,
|
||||
"my_number".to_string()
|
||||
);
|
||||
|
||||
let searcher = index.reader()?.searcher();
|
||||
assert_eq!(searcher.segment_readers().len(), 1);
|
||||
let segment_reader = searcher.segment_reader(0);
|
||||
let fast_fields = segment_reader.fast_fields();
|
||||
let my_number = index.schema().get_field("my_number").unwrap();
|
||||
|
||||
let fast_field = fast_fields.u64(my_number).unwrap();
|
||||
assert_eq!(fast_field.get(0u32), 10u64);
|
||||
assert_eq!(fast_field.get(1u32), 20u64);
|
||||
assert_eq!(fast_field.get(2u32), 30u64);
|
||||
|
||||
let multi_numbers = index.schema().get_field("multi_numbers").unwrap();
|
||||
let multifield = fast_fields.u64s(multi_numbers).unwrap();
|
||||
let mut vals = vec![];
|
||||
multifield.get_vals(0u32, &mut vals);
|
||||
assert_eq!(vals, &[] as &[u64]);
|
||||
let mut vals = vec![];
|
||||
multifield.get_vals(1u32, &mut vals);
|
||||
assert_eq!(vals, &[5, 6]);
|
||||
|
||||
let mut vals = vec![];
|
||||
multifield.get_vals(2u32, &mut vals);
|
||||
assert_eq!(vals, &[3]);
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -43,6 +43,9 @@ pub const MARGIN_IN_BYTES: usize = 1_000_000;
|
||||
pub const HEAP_SIZE_MIN: usize = ((MARGIN_IN_BYTES as u32) * 3u32) as usize;
|
||||
pub const HEAP_SIZE_MAX: usize = u32::max_value() as usize - MARGIN_IN_BYTES;
|
||||
|
||||
// We impose the number of index writter thread to be at most this.
|
||||
pub const MAX_NUM_THREAD: usize = 8;
|
||||
|
||||
// Add document will block if the number of docs waiting in the queue to be indexed
|
||||
// reaches `PIPELINE_MAX_SIZE_IN_DOCS`
|
||||
const PIPELINE_MAX_SIZE_IN_DOCS: usize = 10_000;
|
||||
@@ -50,7 +53,7 @@ const PIPELINE_MAX_SIZE_IN_DOCS: usize = 10_000;
|
||||
// Group of operations.
|
||||
// Most of the time, users will send operation one-by-one, but it can be useful to
|
||||
// send them as a small block to ensure that
|
||||
// - all docs in the operation will happen on the same segment and continuous docids.
|
||||
// - all docs in the operation will happen on the same segment and continuous doc_ids.
|
||||
// - all operations in the group are committed at the same time, making the group
|
||||
// atomic.
|
||||
type OperationGroup = SmallVec<[AddOperation; 4]>;
|
||||
@@ -180,7 +183,7 @@ pub(crate) fn advance_deletes(
|
||||
if num_deleted_docs > num_deleted_docs_before {
|
||||
// There are new deletes. We need to write a new delete file.
|
||||
segment = segment.with_delete_meta(num_deleted_docs as u32, target_opstamp);
|
||||
let mut delete_file = segment.open_write(SegmentComponent::DELETE)?;
|
||||
let mut delete_file = segment.open_write(SegmentComponent::Delete)?;
|
||||
write_delete_bitset(&delete_bitset, max_doc, &mut delete_file)?;
|
||||
delete_file.terminate()?;
|
||||
}
|
||||
@@ -236,11 +239,10 @@ fn index_documents(
|
||||
last_docstamp,
|
||||
)?;
|
||||
|
||||
let segment_entry = SegmentEntry::new(
|
||||
segment_with_max_doc.meta().clone(),
|
||||
delete_cursor,
|
||||
delete_bitset_opt,
|
||||
);
|
||||
let meta = segment_with_max_doc.meta().clone();
|
||||
meta.untrack_temp_docstore();
|
||||
// update segment_updater inventory to remove tempstore
|
||||
let segment_entry = SegmentEntry::new(meta, delete_cursor, delete_bitset_opt);
|
||||
block_on(segment_updater.schedule_add_segment(segment_entry))?;
|
||||
Ok(true)
|
||||
}
|
||||
|
||||
@@ -57,8 +57,8 @@ impl MergePolicy for LogMergePolicy {
|
||||
let mut size_sorted_tuples = segments
|
||||
.iter()
|
||||
.map(SegmentMeta::num_docs)
|
||||
.filter(|s| s <= &(self.max_merge_size as u32))
|
||||
.enumerate()
|
||||
.filter(|(_, s)| s <= &(self.max_merge_size as u32))
|
||||
.collect::<Vec<(usize, u32)>>();
|
||||
|
||||
size_sorted_tuples.sort_by(|x, y| y.1.cmp(&(x.1)));
|
||||
@@ -216,12 +216,19 @@ mod tests {
|
||||
create_random_segment_meta(1_000_000),
|
||||
create_random_segment_meta(100_001),
|
||||
create_random_segment_meta(100_000),
|
||||
create_random_segment_meta(1_000_001),
|
||||
create_random_segment_meta(100_000),
|
||||
create_random_segment_meta(100_000),
|
||||
create_random_segment_meta(1_500_000),
|
||||
];
|
||||
let result_list = test_merge_policy().compute_merge_candidates(&test_input);
|
||||
// Do not include large segments
|
||||
assert_eq!(result_list.len(), 1);
|
||||
assert_eq!(result_list[0].0.len(), 3)
|
||||
assert_eq!(result_list[0].0.len(), 3);
|
||||
|
||||
// Making sure merge policy points to the correct index of the original input
|
||||
assert_eq!(result_list[0].0[0], test_input[2].id());
|
||||
assert_eq!(result_list[0].0[1], test_input[4].id());
|
||||
assert_eq!(result_list[0].0[2], test_input[5].id());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,9 +1,5 @@
|
||||
use crate::common::MAX_DOC_LIMIT;
|
||||
use crate::core::Segment;
|
||||
use crate::core::SegmentReader;
|
||||
use crate::core::SerializableSegment;
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::fastfield::BytesFastFieldReader;
|
||||
use super::doc_id_mapping::DocIdMapping;
|
||||
use crate::error::DataCorruption;
|
||||
use crate::fastfield::DeleteBitSet;
|
||||
use crate::fastfield::FastFieldReader;
|
||||
use crate::fastfield::FastFieldSerializer;
|
||||
@@ -20,10 +16,21 @@ use crate::schema::{Field, Schema};
|
||||
use crate::store::StoreWriter;
|
||||
use crate::termdict::TermMerger;
|
||||
use crate::termdict::TermOrdinal;
|
||||
use crate::{common::HasLen, fastfield::MultiValueLength};
|
||||
use crate::{common::MAX_DOC_LIMIT, IndexSettings};
|
||||
use crate::{core::Segment, indexer::doc_id_mapping::expect_field_id_for_sort_field};
|
||||
use crate::{core::SegmentReader, Order};
|
||||
use crate::{core::SerializableSegment, IndexSortByField};
|
||||
use crate::{
|
||||
docset::{DocSet, TERMINATED},
|
||||
SegmentOrdinal,
|
||||
};
|
||||
use crate::{DocId, InvertedIndexReader, SegmentComponent};
|
||||
use itertools::Itertools;
|
||||
use std::cmp;
|
||||
use std::collections::HashMap;
|
||||
use std::sync::Arc;
|
||||
use tantivy_bitpacker::minmax;
|
||||
|
||||
fn compute_total_num_tokens(readers: &[SegmentReader], field: Field) -> crate::Result<u64> {
|
||||
let mut total_tokens = 0u64;
|
||||
@@ -52,7 +59,28 @@ fn compute_total_num_tokens(readers: &[SegmentReader], field: Field) -> crate::R
|
||||
.sum::<u64>())
|
||||
}
|
||||
|
||||
/// `ReaderWithOrdinal` is used to be able to easier associate
|
||||
/// data with a `SegmentReader`. The ordinal is supposed to be
|
||||
/// used as an index access.
|
||||
///
|
||||
/// The ordinal identifies the position within `Merger` readers.
|
||||
#[derive(Clone, Copy)]
|
||||
pub(crate) struct SegmentReaderWithOrdinal<'a> {
|
||||
pub reader: &'a SegmentReader,
|
||||
pub ordinal: SegmentOrdinal,
|
||||
}
|
||||
|
||||
impl<'a> From<(usize, &'a SegmentReader)> for SegmentReaderWithOrdinal<'a> {
|
||||
fn from(data: (usize, &'a SegmentReader)) -> Self {
|
||||
SegmentReaderWithOrdinal {
|
||||
reader: data.1,
|
||||
ordinal: data.0 as u32,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub struct IndexMerger {
|
||||
index_settings: IndexSettings,
|
||||
schema: Schema,
|
||||
readers: Vec<SegmentReader>,
|
||||
max_doc: u32,
|
||||
@@ -70,7 +98,7 @@ fn compute_min_max_val(
|
||||
Some(delete_bitset) => {
|
||||
// some deleted documents,
|
||||
// we need to recompute the max / min
|
||||
crate::common::minmax(
|
||||
minmax(
|
||||
(0..max_doc)
|
||||
.filter(|doc_id| delete_bitset.is_alive(*doc_id))
|
||||
.map(|doc_id| u64_reader.get(doc_id)),
|
||||
@@ -141,7 +169,11 @@ impl DeltaComputer {
|
||||
}
|
||||
|
||||
impl IndexMerger {
|
||||
pub fn open(schema: Schema, segments: &[Segment]) -> crate::Result<IndexMerger> {
|
||||
pub fn open(
|
||||
schema: Schema,
|
||||
index_settings: IndexSettings,
|
||||
segments: &[Segment],
|
||||
) -> crate::Result<IndexMerger> {
|
||||
let mut readers = vec![];
|
||||
let mut max_doc: u32 = 0u32;
|
||||
for segment in segments {
|
||||
@@ -161,6 +193,7 @@ impl IndexMerger {
|
||||
}
|
||||
Ok(IndexMerger {
|
||||
schema,
|
||||
index_settings,
|
||||
readers,
|
||||
max_doc,
|
||||
})
|
||||
@@ -169,17 +202,27 @@ impl IndexMerger {
|
||||
fn write_fieldnorms(
|
||||
&self,
|
||||
mut fieldnorms_serializer: FieldNormsSerializer,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
let fields = FieldNormsWriter::fields_with_fieldnorm(&self.schema);
|
||||
let mut fieldnorms_data = Vec::with_capacity(self.max_doc as usize);
|
||||
for field in fields {
|
||||
fieldnorms_data.clear();
|
||||
for reader in &self.readers {
|
||||
let fieldnorms_reader = reader.get_fieldnorms_reader(field)?;
|
||||
for doc_id in reader.doc_ids_alive() {
|
||||
let fieldnorm_id = fieldnorms_reader.fieldnorm_id(doc_id);
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
for (doc_id, reader_with_ordinal) in doc_id_mapping {
|
||||
let fieldnorms_reader =
|
||||
reader_with_ordinal.reader.get_fieldnorms_reader(field)?;
|
||||
let fieldnorm_id = fieldnorms_reader.fieldnorm_id(*doc_id);
|
||||
fieldnorms_data.push(fieldnorm_id);
|
||||
}
|
||||
} else {
|
||||
for reader in &self.readers {
|
||||
let fieldnorms_reader = reader.get_fieldnorms_reader(field)?;
|
||||
for doc_id in reader.doc_ids_alive() {
|
||||
let fieldnorm_id = fieldnorms_reader.fieldnorm_id(doc_id);
|
||||
fieldnorms_data.push(fieldnorm_id);
|
||||
}
|
||||
}
|
||||
}
|
||||
fieldnorms_serializer.serialize_field(field, &fieldnorms_data[..])?;
|
||||
}
|
||||
@@ -191,6 +234,7 @@ impl IndexMerger {
|
||||
&self,
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
mut term_ord_mappings: HashMap<Field, TermOrdinalMapping>,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
for (field, field_entry) in self.schema.fields() {
|
||||
let field_type = field_entry.field_type();
|
||||
@@ -204,6 +248,7 @@ impl IndexMerger {
|
||||
field,
|
||||
&term_ordinal_mapping,
|
||||
fast_field_serializer,
|
||||
doc_id_mapping,
|
||||
)?;
|
||||
}
|
||||
FieldType::U64(ref options)
|
||||
@@ -211,10 +256,10 @@ impl IndexMerger {
|
||||
| FieldType::F64(ref options)
|
||||
| FieldType::Date(ref options) => match options.get_fastfield_cardinality() {
|
||||
Some(Cardinality::SingleValue) => {
|
||||
self.write_single_fast_field(field, fast_field_serializer)?;
|
||||
self.write_single_fast_field(field, fast_field_serializer, doc_id_mapping)?;
|
||||
}
|
||||
Some(Cardinality::MultiValues) => {
|
||||
self.write_multi_fast_field(field, fast_field_serializer)?;
|
||||
self.write_multi_fast_field(field, fast_field_serializer, doc_id_mapping)?;
|
||||
}
|
||||
None => {}
|
||||
},
|
||||
@@ -225,7 +270,7 @@ impl IndexMerger {
|
||||
}
|
||||
FieldType::Bytes(byte_options) => {
|
||||
if byte_options.is_fast() {
|
||||
self.write_bytes_fast_field(field, fast_field_serializer)?;
|
||||
self.write_bytes_fast_field(field, fast_field_serializer, doc_id_mapping)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -238,110 +283,246 @@ impl IndexMerger {
|
||||
&self,
|
||||
field: Field,
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
let mut u64_readers = vec![];
|
||||
let mut min_value = u64::max_value();
|
||||
let mut max_value = u64::min_value();
|
||||
|
||||
for reader in &self.readers {
|
||||
let u64_reader: FastFieldReader<u64> = reader
|
||||
let (min_value, max_value) = self.readers.iter().map(|reader|{
|
||||
let u64_reader: FastFieldReader<u64> = reader
|
||||
.fast_fields()
|
||||
.typed_fast_field_reader(field)
|
||||
.expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen.");
|
||||
if let Some((seg_min_val, seg_max_val)) =
|
||||
compute_min_max_val(&u64_reader, reader.max_doc(), reader.delete_bitset())
|
||||
{
|
||||
// the segment has some non-deleted documents
|
||||
min_value = cmp::min(min_value, seg_min_val);
|
||||
max_value = cmp::max(max_value, seg_max_val);
|
||||
u64_readers.push((reader.max_doc(), u64_reader, reader.delete_bitset()));
|
||||
} else {
|
||||
// all documents have been deleted.
|
||||
})
|
||||
.filter_map(|x| x)
|
||||
.reduce(|a, b| {
|
||||
(a.0.min(b.0), a.1.max(b.1))
|
||||
}).expect("Unexpected error, empty readers in IndexMerger");
|
||||
|
||||
let fast_field_readers = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| {
|
||||
let u64_reader: FastFieldReader<u64> = reader
|
||||
.fast_fields()
|
||||
.typed_fast_field_reader(field)
|
||||
.expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen.");
|
||||
u64_reader
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
let sorted_doc_ids = doc_id_mapping.iter().map(|(doc_id, reader_with_ordinal)| {
|
||||
(
|
||||
doc_id,
|
||||
&fast_field_readers[reader_with_ordinal.ordinal as usize],
|
||||
)
|
||||
});
|
||||
// add values in order of the new doc_ids
|
||||
let mut fast_single_field_serializer =
|
||||
fast_field_serializer.new_u64_fast_field(field, min_value, max_value)?;
|
||||
for (doc_id, field_reader) in sorted_doc_ids {
|
||||
let val = field_reader.get(*doc_id);
|
||||
fast_single_field_serializer.add_val(val)?;
|
||||
}
|
||||
}
|
||||
|
||||
if min_value > max_value {
|
||||
// There is not a single document remaining in the index.
|
||||
min_value = 0;
|
||||
max_value = 0;
|
||||
}
|
||||
fast_single_field_serializer.close_field()?;
|
||||
Ok(())
|
||||
} else {
|
||||
let u64_readers = self.readers.iter()
|
||||
.filter(|reader|reader.max_doc() != reader.delete_bitset().map(|bit_set|bit_set.len() as u32).unwrap_or(0))
|
||||
.map(|reader|{
|
||||
let u64_reader: FastFieldReader<u64> = reader
|
||||
.fast_fields()
|
||||
.typed_fast_field_reader(field)
|
||||
.expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen.");
|
||||
(reader.max_doc(), u64_reader, reader.delete_bitset())
|
||||
}).collect::<Vec<_>>();
|
||||
|
||||
let mut fast_single_field_serializer =
|
||||
fast_field_serializer.new_u64_fast_field(field, min_value, max_value)?;
|
||||
for (max_doc, u64_reader, delete_bitset_opt) in u64_readers {
|
||||
for doc_id in 0u32..max_doc {
|
||||
let is_deleted = delete_bitset_opt
|
||||
.map(|delete_bitset| delete_bitset.is_deleted(doc_id))
|
||||
.unwrap_or(false);
|
||||
if !is_deleted {
|
||||
let val = u64_reader.get(doc_id);
|
||||
fast_single_field_serializer.add_val(val)?;
|
||||
let mut fast_single_field_serializer =
|
||||
fast_field_serializer.new_u64_fast_field(field, min_value, max_value)?;
|
||||
for (max_doc, u64_reader, delete_bitset_opt) in u64_readers {
|
||||
for doc_id in 0u32..max_doc {
|
||||
let is_deleted = delete_bitset_opt
|
||||
.map(|delete_bitset| delete_bitset.is_deleted(doc_id))
|
||||
.unwrap_or(false);
|
||||
if !is_deleted {
|
||||
let val = u64_reader.get(doc_id);
|
||||
fast_single_field_serializer.add_val(val)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fast_single_field_serializer.close_field()?;
|
||||
Ok(())
|
||||
fast_single_field_serializer.close_field()?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
fn write_fast_field_idx(
|
||||
/// Generates the doc_id mapping where position in the vec=new
|
||||
/// doc_id.
|
||||
/// ReaderWithOrdinal will include the ordinal position of the
|
||||
/// reader in self.readers.
|
||||
pub(crate) fn generate_doc_id_mapping(
|
||||
&self,
|
||||
sort_by_field: &IndexSortByField,
|
||||
) -> crate::Result<Vec<(DocId, SegmentReaderWithOrdinal)>> {
|
||||
let reader_and_field_accessors = self
|
||||
.readers
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|reader| {
|
||||
let reader_with_ordinal: SegmentReaderWithOrdinal = reader.into();
|
||||
let field_id = expect_field_id_for_sort_field(
|
||||
&reader_with_ordinal.reader.schema(),
|
||||
&sort_by_field,
|
||||
)?; // for now expect fastfield, but not strictly required
|
||||
let value_accessor = reader_with_ordinal
|
||||
.reader
|
||||
.fast_fields()
|
||||
.u64_lenient(field_id)?;
|
||||
Ok((reader_with_ordinal, value_accessor))
|
||||
})
|
||||
.collect::<crate::Result<Vec<_>>>()?; // Collecting to bind the lifetime of value_accessor into the vec, or can't be used as a reference.
|
||||
// Loading the field accessor on demand causes a 15x regression
|
||||
|
||||
// create iterators over segment/sort_accessor/doc_id tuple
|
||||
let doc_id_reader_pair = reader_and_field_accessors
|
||||
.iter()
|
||||
.map(|reader_and_field_accessor| {
|
||||
reader_and_field_accessor
|
||||
.0
|
||||
.reader
|
||||
.doc_ids_alive()
|
||||
.map(move |doc_id| {
|
||||
(
|
||||
doc_id,
|
||||
reader_and_field_accessor.0,
|
||||
&reader_and_field_accessor.1,
|
||||
)
|
||||
})
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
// create iterator tuple of (old doc_id, reader) in order of the new doc_ids
|
||||
let sorted_doc_ids: Vec<(DocId, SegmentReaderWithOrdinal)> = doc_id_reader_pair
|
||||
.into_iter()
|
||||
.kmerge_by(|a, b| {
|
||||
let val1 = a.2.get(a.0);
|
||||
let val2 = b.2.get(b.0);
|
||||
if sort_by_field.order == Order::Asc {
|
||||
val1 < val2
|
||||
} else {
|
||||
val1 > val2
|
||||
}
|
||||
})
|
||||
.map(|(doc_id, reader_with_id, _)| (doc_id, reader_with_id))
|
||||
.collect::<Vec<_>>();
|
||||
Ok(sorted_doc_ids)
|
||||
}
|
||||
|
||||
// Creating the index file to point into the data, generic over `BytesFastFieldReader` and
|
||||
// `MultiValuedFastFieldReader`
|
||||
//
|
||||
// Important: reader_and_field_accessor needs
|
||||
// to have the same order as self.readers since ReaderWithOrdinal
|
||||
// is used to index the reader_and_field_accessors vec.
|
||||
fn write_1_n_fast_field_idx_generic(
|
||||
field: Field,
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
reader_and_field_accessors: &[(&SegmentReader, impl MultiValueLength)],
|
||||
) -> crate::Result<()> {
|
||||
let mut total_num_vals = 0u64;
|
||||
let mut u64s_readers: Vec<MultiValuedFastFieldReader<u64>> = Vec::new();
|
||||
|
||||
// In the first pass, we compute the total number of vals.
|
||||
//
|
||||
// This is required by the bitpacker, as it needs to know
|
||||
// what should be the bit length use for bitpacking.
|
||||
for reader in &self.readers {
|
||||
let u64s_reader = reader.fast_fields()
|
||||
.typed_fast_field_multi_reader(field)
|
||||
.expect("Failed to find index for multivalued field. This is a bug in tantivy, please report.");
|
||||
for (reader, u64s_reader) in reader_and_field_accessors.iter() {
|
||||
if let Some(delete_bitset) = reader.delete_bitset() {
|
||||
for doc in 0u32..reader.max_doc() {
|
||||
if delete_bitset.is_alive(doc) {
|
||||
let num_vals = u64s_reader.num_vals(doc) as u64;
|
||||
let num_vals = u64s_reader.get_len(doc) as u64;
|
||||
total_num_vals += num_vals;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
total_num_vals += u64s_reader.total_num_vals();
|
||||
total_num_vals += u64s_reader.get_total_len();
|
||||
}
|
||||
u64s_readers.push(u64s_reader);
|
||||
}
|
||||
|
||||
// We can now create our `idx` serializer, and in a second pass,
|
||||
// can effectively push the different indexes.
|
||||
let mut serialize_idx =
|
||||
fast_field_serializer.new_u64_fast_field_with_idx(field, 0, total_num_vals, 0)?;
|
||||
let mut idx = 0;
|
||||
for (segment_reader, u64s_reader) in self.readers.iter().zip(&u64s_readers) {
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
serialize_idx.add_val(idx)?;
|
||||
idx += u64s_reader.num_vals(doc) as u64;
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
let mut serialize_idx =
|
||||
fast_field_serializer.new_u64_fast_field_with_idx(field, 0, total_num_vals, 0)?;
|
||||
|
||||
let mut offset = 0;
|
||||
for (doc_id, reader) in doc_id_mapping {
|
||||
let reader = &reader_and_field_accessors[reader.ordinal as usize].1;
|
||||
serialize_idx.add_val(offset)?;
|
||||
offset += reader.get_len(*doc_id) as u64;
|
||||
}
|
||||
serialize_idx.add_val(offset as u64)?;
|
||||
|
||||
serialize_idx.close_field()?;
|
||||
} else {
|
||||
let mut serialize_idx =
|
||||
fast_field_serializer.new_u64_fast_field_with_idx(field, 0, total_num_vals, 0)?;
|
||||
let mut idx = 0;
|
||||
for (segment_reader, u64s_reader) in reader_and_field_accessors.iter() {
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
serialize_idx.add_val(idx)?;
|
||||
idx += u64s_reader.get_len(doc) as u64;
|
||||
}
|
||||
}
|
||||
serialize_idx.add_val(idx)?;
|
||||
serialize_idx.close_field()?;
|
||||
}
|
||||
serialize_idx.add_val(idx)?;
|
||||
serialize_idx.close_field()?;
|
||||
Ok(())
|
||||
}
|
||||
fn write_multi_value_fast_field_idx(
|
||||
&self,
|
||||
field: Field,
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
let reader_and_field_accessors = self.readers.iter().map(|reader|{
|
||||
let u64s_reader: MultiValuedFastFieldReader<u64> = reader.fast_fields()
|
||||
.typed_fast_field_multi_reader(field)
|
||||
.expect("Failed to find index for multivalued field. This is a bug in tantivy, please report.");
|
||||
(reader, u64s_reader)
|
||||
}).collect::<Vec<_>>();
|
||||
|
||||
Self::write_1_n_fast_field_idx_generic(
|
||||
field,
|
||||
fast_field_serializer,
|
||||
doc_id_mapping,
|
||||
&reader_and_field_accessors,
|
||||
)
|
||||
}
|
||||
|
||||
fn write_hierarchical_facet_field(
|
||||
&self,
|
||||
field: Field,
|
||||
term_ordinal_mappings: &TermOrdinalMapping,
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
// Multifastfield consists in 2 fastfields.
|
||||
// The first serves as an index into the second one and is stricly increasing.
|
||||
// The second contains the actual values.
|
||||
|
||||
// First we merge the idx fast field.
|
||||
self.write_fast_field_idx(field, fast_field_serializer)?;
|
||||
self.write_multi_value_fast_field_idx(field, fast_field_serializer, doc_id_mapping)?;
|
||||
|
||||
let fast_field_reader = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| {
|
||||
let ff_reader: MultiValuedFastFieldReader<u64> = reader
|
||||
.fast_fields()
|
||||
.u64s(field)
|
||||
.expect("Could not find multivalued u64 fast value reader.");
|
||||
ff_reader
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
// We can now write the actual fast field values.
|
||||
// In the case of hierarchical facets, they are actually term ordinals.
|
||||
let max_term_ord = term_ordinal_mappings.max_term_ord();
|
||||
@@ -349,21 +530,32 @@ impl IndexMerger {
|
||||
let mut serialize_vals =
|
||||
fast_field_serializer.new_u64_fast_field_with_idx(field, 0u64, max_term_ord, 1)?;
|
||||
let mut vals = Vec::with_capacity(100);
|
||||
for (segment_ord, segment_reader) in self.readers.iter().enumerate() {
|
||||
let term_ordinal_mapping: &[TermOrdinal] =
|
||||
term_ordinal_mappings.get_segment(segment_ord);
|
||||
let ff_reader: MultiValuedFastFieldReader<u64> = segment_reader
|
||||
.fast_fields()
|
||||
.u64s(field)
|
||||
.expect("Could not find multivalued u64 fast value reader.");
|
||||
// TODO optimize if no deletes
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
ff_reader.get_vals(doc, &mut vals);
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
for (old_doc_id, reader_with_ordinal) in doc_id_mapping {
|
||||
let term_ordinal_mapping: &[TermOrdinal] =
|
||||
term_ordinal_mappings.get_segment(reader_with_ordinal.ordinal as usize);
|
||||
|
||||
let ff_reader = &fast_field_reader[reader_with_ordinal.ordinal as usize];
|
||||
ff_reader.get_vals(*old_doc_id, &mut vals);
|
||||
for &prev_term_ord in &vals {
|
||||
let new_term_ord = term_ordinal_mapping[prev_term_ord as usize];
|
||||
serialize_vals.add_val(new_term_ord)?;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for (segment_ord, segment_reader) in self.readers.iter().enumerate() {
|
||||
let term_ordinal_mapping: &[TermOrdinal] =
|
||||
term_ordinal_mappings.get_segment(segment_ord);
|
||||
let ff_reader = &fast_field_reader[segment_ord as usize];
|
||||
// TODO optimize if no deletes
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
ff_reader.get_vals(doc, &mut vals);
|
||||
for &prev_term_ord in &vals {
|
||||
let new_term_ord = term_ordinal_mapping[prev_term_ord as usize];
|
||||
serialize_vals.add_val(new_term_ord)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
serialize_vals.close_field()?;
|
||||
}
|
||||
@@ -374,13 +566,14 @@ impl IndexMerger {
|
||||
&self,
|
||||
field: Field,
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
// Multifastfield consists in 2 fastfields.
|
||||
// The first serves as an index into the second one and is stricly increasing.
|
||||
// The second contains the actual values.
|
||||
|
||||
// First we merge the idx fast field.
|
||||
self.write_fast_field_idx(field, fast_field_serializer)?;
|
||||
self.write_multi_value_fast_field_idx(field, fast_field_serializer, doc_id_mapping)?;
|
||||
|
||||
let mut min_value = u64::max_value();
|
||||
let mut max_value = u64::min_value();
|
||||
@@ -419,10 +612,29 @@ impl IndexMerger {
|
||||
max_value = 0;
|
||||
}
|
||||
|
||||
let fast_field_reader = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| {
|
||||
let ff_reader : MultiValuedFastFieldReader<u64> = reader.fast_fields()
|
||||
.typed_fast_field_multi_reader(field)
|
||||
.expect("Failed to find index for multivalued field. This is a bug in tantivy, please report.");
|
||||
ff_reader
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
// We can now initialize our serializer, and push it the different values
|
||||
{
|
||||
let mut serialize_vals = fast_field_serializer
|
||||
.new_u64_fast_field_with_idx(field, min_value, max_value, 1)?;
|
||||
let mut serialize_vals =
|
||||
fast_field_serializer.new_u64_fast_field_with_idx(field, min_value, max_value, 1)?;
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
for (doc_id, reader_with_ordinal) in doc_id_mapping {
|
||||
let ff_reader = &fast_field_reader[reader_with_ordinal.ordinal as usize];
|
||||
ff_reader.get_vals(*doc_id, &mut vals);
|
||||
for &val in &vals {
|
||||
serialize_vals.add_val(val)?;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for (reader, ff_reader) in self.readers.iter().zip(ff_readers) {
|
||||
// TODO optimize if no deletes
|
||||
for doc in reader.doc_ids_alive() {
|
||||
@@ -432,8 +644,8 @@ impl IndexMerger {
|
||||
}
|
||||
}
|
||||
}
|
||||
serialize_vals.close_field()?;
|
||||
}
|
||||
serialize_vals.close_field()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -441,50 +653,42 @@ impl IndexMerger {
|
||||
&self,
|
||||
field: Field,
|
||||
fast_field_serializer: &mut FastFieldSerializer,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
let mut total_num_vals = 0u64;
|
||||
let mut bytes_readers: Vec<BytesFastFieldReader> = Vec::new();
|
||||
|
||||
for reader in &self.readers {
|
||||
let bytes_reader = reader.fast_fields().bytes(field)?;
|
||||
if let Some(delete_bitset) = reader.delete_bitset() {
|
||||
for doc in 0u32..reader.max_doc() {
|
||||
if delete_bitset.is_alive(doc) {
|
||||
let num_vals = bytes_reader.get_bytes(doc).len() as u64;
|
||||
total_num_vals += num_vals;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
total_num_vals += bytes_reader.total_num_bytes() as u64;
|
||||
}
|
||||
bytes_readers.push(bytes_reader);
|
||||
}
|
||||
|
||||
{
|
||||
// We can now create our `idx` serializer, and in a second pass,
|
||||
// can effectively push the different indexes.
|
||||
let mut serialize_idx =
|
||||
fast_field_serializer.new_u64_fast_field_with_idx(field, 0, total_num_vals, 0)?;
|
||||
let mut idx = 0;
|
||||
for (segment_reader, bytes_reader) in self.readers.iter().zip(&bytes_readers) {
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
serialize_idx.add_val(idx)?;
|
||||
idx += bytes_reader.get_bytes(doc).len() as u64;
|
||||
}
|
||||
}
|
||||
serialize_idx.add_val(idx)?;
|
||||
serialize_idx.close_field()?;
|
||||
}
|
||||
let reader_and_field_accessors = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| {
|
||||
let bytes_reader = reader.fast_fields().bytes(field)
|
||||
.expect("Failed to find index for bytes field. This is a bug in tantivy, please report.");
|
||||
(reader, bytes_reader)
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
Self::write_1_n_fast_field_idx_generic(
|
||||
field,
|
||||
fast_field_serializer,
|
||||
doc_id_mapping,
|
||||
&reader_and_field_accessors,
|
||||
)?;
|
||||
let mut serialize_vals = fast_field_serializer.new_bytes_fast_field_with_idx(field, 1);
|
||||
for segment_reader in &self.readers {
|
||||
let bytes_reader = segment_reader.fast_fields().bytes(field)
|
||||
.expect("Failed to find bytes field in fast field reader. This is a bug in tantivy. Please report.");
|
||||
// TODO: optimize if no deletes
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
let val = bytes_reader.get_bytes(doc);
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
for (doc_id, reader_with_ordinal) in doc_id_mapping {
|
||||
let bytes_reader =
|
||||
&reader_and_field_accessors[reader_with_ordinal.ordinal as usize].1;
|
||||
let val = bytes_reader.get_bytes(*doc_id);
|
||||
serialize_vals.write_all(val)?;
|
||||
}
|
||||
} else {
|
||||
for segment_reader in &self.readers {
|
||||
let bytes_reader = segment_reader.fast_fields().bytes(field)
|
||||
.expect("Failed to find bytes field in fast field reader. This is a bug in tantivy. Please report.");
|
||||
// TODO: optimize if no deletes
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
let val = bytes_reader.get_bytes(doc);
|
||||
serialize_vals.write_all(val)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
serialize_vals.flush()?;
|
||||
Ok(())
|
||||
@@ -496,6 +700,7 @@ impl IndexMerger {
|
||||
field_type: &FieldType,
|
||||
serializer: &mut InvertedIndexSerializer,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<Option<TermOrdinalMapping>> {
|
||||
let mut positions_buffer: Vec<u32> = Vec::with_capacity(1_000);
|
||||
let mut delta_computer = DeltaComputer::new();
|
||||
@@ -526,19 +731,35 @@ impl IndexMerger {
|
||||
// map from segment doc ids to the resulting merged segment doc id.
|
||||
let mut merged_doc_id_map: Vec<Vec<Option<DocId>>> = Vec::with_capacity(self.readers.len());
|
||||
|
||||
for reader in &self.readers {
|
||||
let mut segment_local_map = Vec::with_capacity(reader.max_doc() as usize);
|
||||
for doc_id in 0..reader.max_doc() {
|
||||
if reader.is_deleted(doc_id) {
|
||||
segment_local_map.push(None);
|
||||
} else {
|
||||
segment_local_map.push(Some(max_doc));
|
||||
max_doc += 1u32;
|
||||
}
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
merged_doc_id_map = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| {
|
||||
let mut segment_local_map = vec![];
|
||||
segment_local_map.resize(reader.max_doc() as usize, None);
|
||||
segment_local_map
|
||||
})
|
||||
.collect();
|
||||
for (new_doc_id, (old_doc_id, segment_and_ordinal)) in doc_id_mapping.iter().enumerate()
|
||||
{
|
||||
let segment_map = &mut merged_doc_id_map[segment_and_ordinal.ordinal as usize];
|
||||
segment_map[*old_doc_id as usize] = Some(new_doc_id as DocId);
|
||||
}
|
||||
} else {
|
||||
for reader in &self.readers {
|
||||
let mut segment_local_map = Vec::with_capacity(reader.max_doc() as usize);
|
||||
for doc_id in 0..reader.max_doc() {
|
||||
if reader.is_deleted(doc_id) {
|
||||
segment_local_map.push(None);
|
||||
} else {
|
||||
segment_local_map.push(Some(max_doc));
|
||||
max_doc += 1u32;
|
||||
}
|
||||
}
|
||||
merged_doc_id_map.push(segment_local_map);
|
||||
}
|
||||
merged_doc_id_map.push(segment_local_map);
|
||||
}
|
||||
|
||||
// The total number of tokens will only be exact when there has been no deletes.
|
||||
//
|
||||
// Otherwise, we approximate by removing deleted documents proportionally.
|
||||
@@ -553,7 +774,9 @@ impl IndexMerger {
|
||||
// - Segment 1's doc ids become [seg0.max_doc, seg0.max_doc + seg.max_doc]
|
||||
// - Segment 2's doc ids become [seg0.max_doc + seg1.max_doc,
|
||||
// seg0.max_doc + seg1.max_doc + seg2.max_doc]
|
||||
// ...
|
||||
//
|
||||
// This stacking applies only when the index is not sorted, in that case the
|
||||
// doc_ids are kmerged by their sort property
|
||||
let mut field_serializer =
|
||||
serializer.new_field(indexed_field, total_num_tokens, fieldnorm_reader)?;
|
||||
|
||||
@@ -566,6 +789,7 @@ impl IndexMerger {
|
||||
);
|
||||
|
||||
let mut segment_postings_containing_the_term: Vec<(usize, SegmentPostings)> = vec![];
|
||||
let mut doc_id_and_positions = vec![];
|
||||
|
||||
while merged_terms.advance() {
|
||||
segment_postings_containing_the_term.clear();
|
||||
@@ -622,18 +846,39 @@ impl IndexMerger {
|
||||
while doc != TERMINATED {
|
||||
// deleted doc are skipped as they do not have a `remapped_doc_id`.
|
||||
if let Some(remapped_doc_id) = old_to_new_doc_id[doc as usize] {
|
||||
// we make sure to only write the term iff
|
||||
// we make sure to only write the term if
|
||||
// there is at least one document.
|
||||
let term_freq = segment_postings.term_freq();
|
||||
segment_postings.positions(&mut positions_buffer);
|
||||
|
||||
let delta_positions = delta_computer.compute_delta(&positions_buffer);
|
||||
field_serializer.write_doc(remapped_doc_id, term_freq, delta_positions)?;
|
||||
// if doc_id_mapping exists, the docids are reordered, they are
|
||||
// not just stacked. The field serializer expects monotonically increasing
|
||||
// docids, so we collect and sort them first, before writing.
|
||||
//
|
||||
// I think this is not strictly necessary, it would be possible to
|
||||
// avoid the loading into a vec via some form of kmerge, but then the merge
|
||||
// logic would deviate much more from the stacking case (unsorted index)
|
||||
if doc_id_mapping.is_some() {
|
||||
doc_id_and_positions.push((
|
||||
remapped_doc_id,
|
||||
term_freq,
|
||||
positions_buffer.to_vec(),
|
||||
));
|
||||
} else {
|
||||
let delta_positions = delta_computer.compute_delta(&positions_buffer);
|
||||
field_serializer.write_doc(remapped_doc_id, term_freq, delta_positions);
|
||||
}
|
||||
}
|
||||
|
||||
doc = segment_postings.advance();
|
||||
}
|
||||
}
|
||||
if doc_id_mapping.is_some() {
|
||||
doc_id_and_positions.sort_unstable_by_key(|&(doc_id, _, _)| doc_id);
|
||||
for (doc_id, term_freq, positions) in &doc_id_and_positions {
|
||||
field_serializer.write_doc(*doc_id, *term_freq, positions);
|
||||
}
|
||||
doc_id_and_positions.clear();
|
||||
}
|
||||
|
||||
// closing the term.
|
||||
field_serializer.close_term()?;
|
||||
@@ -646,6 +891,7 @@ impl IndexMerger {
|
||||
&self,
|
||||
serializer: &mut InvertedIndexSerializer,
|
||||
fieldnorm_readers: FieldNormReaders,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<HashMap<Field, TermOrdinalMapping>> {
|
||||
let mut term_ordinal_mappings = HashMap::new();
|
||||
for (field, field_entry) in self.schema.fields() {
|
||||
@@ -656,6 +902,7 @@ impl IndexMerger {
|
||||
field_entry.field_type(),
|
||||
serializer,
|
||||
fieldnorm_reader,
|
||||
doc_id_mapping,
|
||||
)? {
|
||||
term_ordinal_mappings.insert(field, term_ordinal_mapping);
|
||||
}
|
||||
@@ -664,16 +911,46 @@ impl IndexMerger {
|
||||
Ok(term_ordinal_mappings)
|
||||
}
|
||||
|
||||
fn write_storable_fields(&self, store_writer: &mut StoreWriter) -> crate::Result<()> {
|
||||
for reader in &self.readers {
|
||||
let store_reader = reader.get_store_reader()?;
|
||||
if reader.num_deleted_docs() > 0 {
|
||||
for doc_id in reader.doc_ids_alive() {
|
||||
let doc = store_reader.get(doc_id)?;
|
||||
store_writer.store(&doc)?;
|
||||
fn write_storable_fields(
|
||||
&self,
|
||||
store_writer: &mut StoreWriter,
|
||||
doc_id_mapping: &Option<Vec<(DocId, SegmentReaderWithOrdinal)>>,
|
||||
) -> crate::Result<()> {
|
||||
let store_readers: Vec<_> = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| reader.get_store_reader())
|
||||
.collect::<Result<_, _>>()?;
|
||||
let mut document_iterators: Vec<_> = store_readers
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, store)| store.iter_raw(self.readers[i].delete_bitset()))
|
||||
.collect();
|
||||
if let Some(doc_id_mapping) = doc_id_mapping {
|
||||
for (old_doc_id, reader_with_ordinal) in doc_id_mapping {
|
||||
let doc_bytes_it = &mut document_iterators[reader_with_ordinal.ordinal as usize];
|
||||
if let Some(doc_bytes_res) = doc_bytes_it.next() {
|
||||
let doc_bytes = doc_bytes_res?;
|
||||
store_writer.store_bytes(&doc_bytes)?;
|
||||
} else {
|
||||
return Err(DataCorruption::comment_only(&format!(
|
||||
"unexpected missing document in docstore on merge, doc id {:?}",
|
||||
old_doc_id
|
||||
))
|
||||
.into());
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for reader in &self.readers {
|
||||
let store_reader = reader.get_store_reader()?;
|
||||
if reader.num_deleted_docs() > 0 {
|
||||
for doc_bytes_res in store_reader.iter_raw(reader.delete_bitset()) {
|
||||
let doc_bytes = doc_bytes_res?;
|
||||
store_writer.store_bytes(&doc_bytes)?;
|
||||
}
|
||||
} else {
|
||||
store_writer.stack(&store_reader)?;
|
||||
}
|
||||
} else {
|
||||
store_writer.stack(&store_reader)?;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
@@ -681,18 +958,36 @@ impl IndexMerger {
|
||||
}
|
||||
|
||||
impl SerializableSegment for IndexMerger {
|
||||
fn write(&self, mut serializer: SegmentSerializer) -> crate::Result<u32> {
|
||||
fn write(
|
||||
&self,
|
||||
mut serializer: SegmentSerializer,
|
||||
_: Option<&DocIdMapping>,
|
||||
) -> crate::Result<u32> {
|
||||
let doc_id_mapping = if let Some(sort_by_field) = self.index_settings.sort_by_field.as_ref()
|
||||
{
|
||||
Some(self.generate_doc_id_mapping(sort_by_field)?)
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
if let Some(fieldnorms_serializer) = serializer.extract_fieldnorms_serializer() {
|
||||
self.write_fieldnorms(fieldnorms_serializer)?;
|
||||
self.write_fieldnorms(fieldnorms_serializer, &doc_id_mapping)?;
|
||||
}
|
||||
let fieldnorm_data = serializer
|
||||
.segment()
|
||||
.open_read(SegmentComponent::FIELDNORMS)?;
|
||||
.open_read(SegmentComponent::FieldNorms)?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
let term_ord_mappings =
|
||||
self.write_postings(serializer.get_postings_serializer(), fieldnorm_readers)?;
|
||||
self.write_fast_fields(serializer.get_fast_field_serializer(), term_ord_mappings)?;
|
||||
self.write_storable_fields(serializer.get_store_writer())?;
|
||||
let term_ord_mappings = self.write_postings(
|
||||
serializer.get_postings_serializer(),
|
||||
fieldnorm_readers,
|
||||
&doc_id_mapping,
|
||||
)?;
|
||||
self.write_fast_fields(
|
||||
serializer.get_fast_field_serializer(),
|
||||
term_ord_mappings,
|
||||
&doc_id_mapping,
|
||||
)?;
|
||||
self.write_storable_fields(serializer.get_store_writer(), &doc_id_mapping)?;
|
||||
serializer.close()?;
|
||||
Ok(self.max_doc)
|
||||
}
|
||||
@@ -715,12 +1010,14 @@ mod tests {
|
||||
use crate::schema::IntOptions;
|
||||
use crate::schema::Term;
|
||||
use crate::schema::TextFieldIndexing;
|
||||
use crate::schema::INDEXED;
|
||||
use crate::schema::{Cardinality, TEXT};
|
||||
use crate::DocAddress;
|
||||
use crate::IndexSettings;
|
||||
use crate::IndexSortByField;
|
||||
use crate::IndexWriter;
|
||||
use crate::Searcher;
|
||||
use crate::{schema, DocSet, SegmentId};
|
||||
use crate::{schema::INDEXED, Order};
|
||||
use byteorder::{BigEndian, ReadBytesExt};
|
||||
use futures::executor::block_on;
|
||||
use schema::FAST;
|
||||
@@ -1178,20 +1475,56 @@ mod tests {
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
#[test]
|
||||
fn test_merge_facets_sort_none() {
|
||||
test_merge_facets(None)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merge_facets() {
|
||||
fn test_merge_facets_sort_asc() {
|
||||
// the data is already sorted asc, so this should have no effect, but go through the docid
|
||||
// mapping code
|
||||
test_merge_facets(Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "intval".to_string(),
|
||||
order: Order::Asc,
|
||||
}),
|
||||
}));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merge_facets_sort_desc() {
|
||||
test_merge_facets(Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "intval".to_string(),
|
||||
order: Order::Desc,
|
||||
}),
|
||||
}));
|
||||
}
|
||||
fn test_merge_facets(index_settings: Option<IndexSettings>) {
|
||||
let mut schema_builder = schema::Schema::builder();
|
||||
let facet_field = schema_builder.add_facet_field("facet", INDEXED);
|
||||
let index = Index::create_in_ram(schema_builder.build());
|
||||
let int_options = IntOptions::default()
|
||||
.set_fast(Cardinality::SingleValue)
|
||||
.set_indexed();
|
||||
let int_field = schema_builder.add_u64_field("intval", int_options);
|
||||
let mut index_builder = Index::builder().schema(schema_builder.build());
|
||||
if let Some(settings) = index_settings {
|
||||
index_builder = index_builder.settings(settings);
|
||||
}
|
||||
let index = index_builder.create_in_ram().unwrap();
|
||||
//let index = Index::create_in_ram(schema_builder.build());
|
||||
let reader = index.reader().unwrap();
|
||||
let mut int_val = 0;
|
||||
{
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
let index_doc = |index_writer: &mut IndexWriter, doc_facets: &[&str]| {
|
||||
let mut index_doc = |index_writer: &mut IndexWriter, doc_facets: &[&str]| {
|
||||
let mut doc = Document::default();
|
||||
for facet in doc_facets {
|
||||
doc.add_facet(facet_field, Facet::from(facet));
|
||||
}
|
||||
doc.add_u64(int_field, int_val);
|
||||
int_val += 1;
|
||||
index_writer.add_document(doc);
|
||||
};
|
||||
|
||||
|
||||
404
src/indexer/merger_sorted_index_test.rs
Normal file
404
src/indexer/merger_sorted_index_test.rs
Normal file
@@ -0,0 +1,404 @@
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::{
|
||||
collector::TopDocs,
|
||||
schema::{Cardinality, TextFieldIndexing},
|
||||
};
|
||||
use crate::{core::Index, fastfield::MultiValuedFastFieldReader};
|
||||
use crate::{
|
||||
query::QueryParser,
|
||||
schema::{IntOptions, TextOptions},
|
||||
};
|
||||
use crate::{schema::Facet, IndexSortByField};
|
||||
use crate::{schema::INDEXED, Order};
|
||||
use crate::{
|
||||
schema::{self, BytesOptions},
|
||||
DocAddress,
|
||||
};
|
||||
use crate::{IndexSettings, Term};
|
||||
use futures::executor::block_on;
|
||||
|
||||
fn create_test_index_posting_list_issue(index_settings: Option<IndexSettings>) -> Index {
|
||||
let mut schema_builder = schema::Schema::builder();
|
||||
let int_options = IntOptions::default()
|
||||
.set_fast(Cardinality::SingleValue)
|
||||
.set_indexed();
|
||||
let int_field = schema_builder.add_u64_field("intval", int_options);
|
||||
|
||||
let facet_field = schema_builder.add_facet_field("facet", INDEXED);
|
||||
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let mut index_builder = Index::builder().schema(schema);
|
||||
if let Some(settings) = index_settings {
|
||||
index_builder = index_builder.settings(settings);
|
||||
}
|
||||
let index = index_builder.create_in_ram().unwrap();
|
||||
|
||||
{
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
|
||||
index_writer.add_document(doc!(int_field=>3_u64, facet_field=> Facet::from("/crime")));
|
||||
|
||||
assert!(index_writer.commit().is_ok());
|
||||
index_writer.add_document(doc!(int_field=>5_u64, facet_field=> Facet::from("/fanta")));
|
||||
|
||||
assert!(index_writer.commit().is_ok());
|
||||
}
|
||||
|
||||
// Merging the segments
|
||||
{
|
||||
let segment_ids = index
|
||||
.searchable_segment_ids()
|
||||
.expect("Searchable segments failed.");
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
assert!(block_on(index_writer.merge(&segment_ids)).is_ok());
|
||||
assert!(index_writer.wait_merging_threads().is_ok());
|
||||
}
|
||||
index
|
||||
}
|
||||
|
||||
fn create_test_index(index_settings: Option<IndexSettings>) -> Index {
|
||||
let mut schema_builder = schema::Schema::builder();
|
||||
let int_options = IntOptions::default()
|
||||
.set_fast(Cardinality::SingleValue)
|
||||
.set_stored()
|
||||
.set_indexed();
|
||||
let int_field = schema_builder.add_u64_field("intval", int_options);
|
||||
|
||||
let bytes_options = BytesOptions::default().set_fast().set_indexed();
|
||||
let bytes_field = schema_builder.add_bytes_field("bytes", bytes_options);
|
||||
let facet_field = schema_builder.add_facet_field("facet", INDEXED);
|
||||
|
||||
let multi_numbers = schema_builder.add_u64_field(
|
||||
"multi_numbers",
|
||||
IntOptions::default().set_fast(Cardinality::MultiValues),
|
||||
);
|
||||
let text_field_options = TextOptions::default()
|
||||
.set_indexing_options(
|
||||
TextFieldIndexing::default()
|
||||
.set_index_option(schema::IndexRecordOption::WithFreqsAndPositions),
|
||||
)
|
||||
.set_stored();
|
||||
let text_field = schema_builder.add_text_field("text_field", text_field_options);
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let mut index_builder = Index::builder().schema(schema);
|
||||
if let Some(settings) = index_settings {
|
||||
index_builder = index_builder.settings(settings);
|
||||
}
|
||||
let index = index_builder.create_in_ram().unwrap();
|
||||
|
||||
{
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
|
||||
index_writer.add_document(doc!(int_field=>1_u64));
|
||||
index_writer.add_document(
|
||||
doc!(int_field=>3_u64, multi_numbers => 3_u64, multi_numbers => 4_u64, bytes_field => vec![1, 2, 3], text_field => "some text", facet_field=> Facet::from("/book/crime")),
|
||||
);
|
||||
index_writer.add_document(doc!(int_field=>1_u64, text_field=> "deleteme"));
|
||||
index_writer.add_document(
|
||||
doc!(int_field=>2_u64, multi_numbers => 2_u64, multi_numbers => 3_u64),
|
||||
);
|
||||
|
||||
assert!(index_writer.commit().is_ok());
|
||||
index_writer.add_document(doc!(int_field=>20_u64, multi_numbers => 20_u64));
|
||||
index_writer.add_document(doc!(int_field=>1_u64, text_field=> "deleteme", facet_field=> Facet::from("/book/crime")));
|
||||
assert!(index_writer.commit().is_ok());
|
||||
index_writer.add_document(
|
||||
doc!(int_field=>10_u64, multi_numbers => 10_u64, multi_numbers => 11_u64, text_field=> "blubber", facet_field=> Facet::from("/book/fantasy")),
|
||||
);
|
||||
index_writer.add_document(doc!(int_field=>5_u64, text_field=> "deleteme"));
|
||||
index_writer.add_document(
|
||||
doc!(int_field=>1_000u64, multi_numbers => 1001_u64, multi_numbers => 1002_u64, bytes_field => vec![5, 5],text_field => "the biggest num")
|
||||
);
|
||||
|
||||
index_writer.delete_term(Term::from_field_text(text_field, "deleteme"));
|
||||
assert!(index_writer.commit().is_ok());
|
||||
}
|
||||
|
||||
// Merging the segments
|
||||
{
|
||||
let segment_ids = index
|
||||
.searchable_segment_ids()
|
||||
.expect("Searchable segments failed.");
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
assert!(block_on(index_writer.merge(&segment_ids)).is_ok());
|
||||
assert!(index_writer.wait_merging_threads().is_ok());
|
||||
}
|
||||
index
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merge_sorted_postinglist_sort_issue() {
|
||||
create_test_index_posting_list_issue(Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "intval".to_string(),
|
||||
order: Order::Desc,
|
||||
}),
|
||||
}));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merge_sorted_index_desc() {
|
||||
let index = create_test_index(Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "intval".to_string(),
|
||||
order: Order::Desc,
|
||||
}),
|
||||
}));
|
||||
|
||||
let int_field = index.schema().get_field("intval").unwrap();
|
||||
let reader = index.reader().unwrap();
|
||||
|
||||
let searcher = reader.searcher();
|
||||
assert_eq!(searcher.segment_readers().len(), 1);
|
||||
let segment_reader = searcher.segment_readers().last().unwrap();
|
||||
|
||||
let fast_fields = segment_reader.fast_fields();
|
||||
let fast_field = fast_fields.u64(int_field).unwrap();
|
||||
assert_eq!(fast_field.get(5u32), 1u64);
|
||||
assert_eq!(fast_field.get(4u32), 2u64);
|
||||
assert_eq!(fast_field.get(3u32), 3u64);
|
||||
assert_eq!(fast_field.get(2u32), 10u64);
|
||||
assert_eq!(fast_field.get(1u32), 20u64);
|
||||
assert_eq!(fast_field.get(0u32), 1_000u64);
|
||||
|
||||
// test new field norm mapping
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let fieldnorm_reader = segment_reader.get_fieldnorms_reader(my_text_field).unwrap();
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0), 3); // the biggest num
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(1), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(2), 1); // blubber
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(3), 2); // some text
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(5), 0);
|
||||
}
|
||||
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let searcher = index.reader().unwrap().searcher();
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
|
||||
let do_search = |term: &str| {
|
||||
let query = QueryParser::for_index(&index, vec![my_text_field])
|
||||
.parse_query(term)
|
||||
.unwrap();
|
||||
let top_docs: Vec<(f32, DocAddress)> =
|
||||
searcher.search(&query, &TopDocs::with_limit(3)).unwrap();
|
||||
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>()
|
||||
};
|
||||
|
||||
assert_eq!(do_search("some"), vec![3]);
|
||||
assert_eq!(do_search("blubber"), vec![2]);
|
||||
assert_eq!(do_search("biggest"), vec![0]);
|
||||
}
|
||||
|
||||
// access doc store
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 2)).unwrap();
|
||||
assert_eq!(
|
||||
doc.get_first(my_text_field).unwrap().text(),
|
||||
Some("blubber")
|
||||
);
|
||||
let doc = searcher.doc(DocAddress::new(0, 0)).unwrap();
|
||||
assert_eq!(doc.get_first(int_field).unwrap().u64_value(), Some(1000));
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merge_sorted_index_asc() {
|
||||
let index = create_test_index(Some(IndexSettings {
|
||||
sort_by_field: Some(IndexSortByField {
|
||||
field: "intval".to_string(),
|
||||
order: Order::Asc,
|
||||
}),
|
||||
}));
|
||||
|
||||
let int_field = index.schema().get_field("intval").unwrap();
|
||||
let multi_numbers = index.schema().get_field("multi_numbers").unwrap();
|
||||
let bytes_field = index.schema().get_field("bytes").unwrap();
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
assert_eq!(searcher.segment_readers().len(), 1);
|
||||
let segment_reader = searcher.segment_readers().last().unwrap();
|
||||
|
||||
let fast_fields = segment_reader.fast_fields();
|
||||
let fast_field = fast_fields.u64(int_field).unwrap();
|
||||
assert_eq!(fast_field.get(0u32), 1u64);
|
||||
assert_eq!(fast_field.get(1u32), 2u64);
|
||||
assert_eq!(fast_field.get(2u32), 3u64);
|
||||
assert_eq!(fast_field.get(3u32), 10u64);
|
||||
assert_eq!(fast_field.get(4u32), 20u64);
|
||||
assert_eq!(fast_field.get(5u32), 1_000u64);
|
||||
|
||||
let get_vals = |fast_field: &MultiValuedFastFieldReader<u64>, doc_id: u32| -> Vec<u64> {
|
||||
let mut vals = vec![];
|
||||
fast_field.get_vals(doc_id, &mut vals);
|
||||
vals
|
||||
};
|
||||
let fast_fields = segment_reader.fast_fields();
|
||||
let fast_field = fast_fields.u64s(multi_numbers).unwrap();
|
||||
assert_eq!(&get_vals(&fast_field, 0), &[] as &[u64]);
|
||||
assert_eq!(&get_vals(&fast_field, 1), &[2, 3]);
|
||||
assert_eq!(&get_vals(&fast_field, 2), &[3, 4]);
|
||||
assert_eq!(&get_vals(&fast_field, 3), &[10, 11]);
|
||||
assert_eq!(&get_vals(&fast_field, 4), &[20]);
|
||||
assert_eq!(&get_vals(&fast_field, 5), &[1001, 1002]);
|
||||
|
||||
let fast_field = fast_fields.bytes(bytes_field).unwrap();
|
||||
assert_eq!(fast_field.get_bytes(0), &[] as &[u8]);
|
||||
assert_eq!(fast_field.get_bytes(2), &[1, 2, 3]);
|
||||
assert_eq!(fast_field.get_bytes(5), &[5, 5]);
|
||||
|
||||
// test new field norm mapping
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let fieldnorm_reader = segment_reader.get_fieldnorms_reader(my_text_field).unwrap();
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(1), 0);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(2), 2); // some text
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(3), 1);
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(5), 3); // the biggest num
|
||||
}
|
||||
|
||||
let searcher = index.reader().unwrap().searcher();
|
||||
{
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
|
||||
let do_search = |term: &str| {
|
||||
let query = QueryParser::for_index(&index, vec![my_text_field])
|
||||
.parse_query(term)
|
||||
.unwrap();
|
||||
let top_docs: Vec<(f32, DocAddress)> =
|
||||
searcher.search(&query, &TopDocs::with_limit(3)).unwrap();
|
||||
|
||||
top_docs.iter().map(|el| el.1.doc_id).collect::<Vec<_>>()
|
||||
};
|
||||
|
||||
assert_eq!(do_search("some"), vec![2]);
|
||||
assert_eq!(do_search("blubber"), vec![3]);
|
||||
assert_eq!(do_search("biggest"), vec![5]);
|
||||
}
|
||||
|
||||
// access doc store
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 0)).unwrap();
|
||||
assert_eq!(doc.get_first(int_field).unwrap().u64_value(), Some(1));
|
||||
let doc = searcher.doc(DocAddress::new(0, 1)).unwrap();
|
||||
assert_eq!(doc.get_first(int_field).unwrap().u64_value(), Some(2));
|
||||
let doc = searcher.doc(DocAddress::new(0, 2)).unwrap();
|
||||
assert_eq!(doc.get_first(int_field).unwrap().u64_value(), Some(3));
|
||||
let doc = searcher.doc(DocAddress::new(0, 3)).unwrap();
|
||||
assert_eq!(doc.get_first(int_field).unwrap().u64_value(), Some(10));
|
||||
let doc = searcher.doc(DocAddress::new(0, 4)).unwrap();
|
||||
assert_eq!(doc.get_first(int_field).unwrap().u64_value(), Some(20));
|
||||
let doc = searcher.doc(DocAddress::new(0, 5)).unwrap();
|
||||
assert_eq!(doc.get_first(int_field).unwrap().u64_value(), Some(1_000));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(all(test, feature = "unstable"))]
|
||||
mod bench_sorted_index_merge {
|
||||
|
||||
use crate::core::Index;
|
||||
//use cratedoc_id, readerdoc_id_mappinglet vals = reader.fate::schema;
|
||||
use crate::fastfield::FastFieldReader;
|
||||
use crate::indexer::merger::IndexMerger;
|
||||
use crate::schema::Cardinality;
|
||||
use crate::schema::Document;
|
||||
use crate::schema::IntOptions;
|
||||
use crate::schema::Schema;
|
||||
use crate::IndexSettings;
|
||||
use crate::IndexSortByField;
|
||||
use crate::IndexWriter;
|
||||
use crate::Order;
|
||||
use futures::executor::block_on;
|
||||
use test::{self, Bencher};
|
||||
fn create_index(sort_by_field: Option<IndexSortByField>) -> Index {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let int_options = IntOptions::default()
|
||||
.set_fast(Cardinality::SingleValue)
|
||||
.set_indexed();
|
||||
let int_field = schema_builder.add_u64_field("intval", int_options);
|
||||
let int_field = schema_builder.add_u64_field("intval", int_options);
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let index_builder = Index::builder()
|
||||
.schema(schema)
|
||||
.settings(IndexSettings { sort_by_field });
|
||||
let index = index_builder.create_in_ram().unwrap();
|
||||
|
||||
{
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
let index_doc = |index_writer: &mut IndexWriter, val: u64| {
|
||||
let mut doc = Document::default();
|
||||
doc.add_u64(int_field, val);
|
||||
index_writer.add_document(doc);
|
||||
};
|
||||
// 3 segments with 10_000 values in the fast fields
|
||||
for _ in 0..3 {
|
||||
index_doc(&mut index_writer, 5000); // fix to make it unordered
|
||||
for i in 0..10_000 {
|
||||
index_doc(&mut index_writer, i);
|
||||
}
|
||||
index_writer.commit().unwrap();
|
||||
}
|
||||
}
|
||||
index
|
||||
}
|
||||
#[bench]
|
||||
fn create_sorted_index_walk_overkmerge_on_merge_fastfield(
|
||||
b: &mut Bencher,
|
||||
) -> crate::Result<()> {
|
||||
let sort_by_field = IndexSortByField {
|
||||
field: "intval".to_string(),
|
||||
order: Order::Desc,
|
||||
};
|
||||
let index = create_index(Some(sort_by_field.clone()));
|
||||
let field = index.schema().get_field("intval").unwrap();
|
||||
let segments = index.searchable_segments().unwrap();
|
||||
let merger: IndexMerger =
|
||||
IndexMerger::open(index.schema(), index.settings().clone(), &segments[..])?;
|
||||
let doc_id_mapping = merger.generate_doc_id_mapping(&sort_by_field).unwrap();
|
||||
b.iter(|| {
|
||||
|
||||
let sorted_doc_ids = doc_id_mapping.iter().map(|(doc_id, reader)|{
|
||||
let u64_reader: FastFieldReader<u64> = reader
|
||||
.fast_fields()
|
||||
.typed_fast_field_reader(field)
|
||||
.expect("Failed to find a reader for single fast field. This is a tantivy bug and it should never happen.");
|
||||
(doc_id, reader, u64_reader)
|
||||
});
|
||||
// add values in order of the new docids
|
||||
let mut val = 0;
|
||||
for (doc_id, _reader, field_reader) in sorted_doc_ids {
|
||||
val = field_reader.get(*doc_id);
|
||||
}
|
||||
|
||||
val
|
||||
|
||||
});
|
||||
|
||||
Ok(())
|
||||
}
|
||||
#[bench]
|
||||
fn create_sorted_index_create_docid_mapping(b: &mut Bencher) -> crate::Result<()> {
|
||||
let sort_by_field = IndexSortByField {
|
||||
field: "intval".to_string(),
|
||||
order: Order::Desc,
|
||||
};
|
||||
let index = create_index(Some(sort_by_field.clone()));
|
||||
let field = index.schema().get_field("intval").unwrap();
|
||||
let segments = index.searchable_segments().unwrap();
|
||||
let merger: IndexMerger =
|
||||
IndexMerger::open(index.schema(), index.settings().clone(), &segments[..])?;
|
||||
b.iter(|| {
|
||||
merger.generate_doc_id_mapping(&sort_by_field).unwrap();
|
||||
});
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -1,11 +1,13 @@
|
||||
pub mod delete_queue;
|
||||
|
||||
pub mod doc_id_mapping;
|
||||
mod doc_opstamp_mapping;
|
||||
pub mod index_writer;
|
||||
mod log_merge_policy;
|
||||
mod merge_operation;
|
||||
pub mod merge_policy;
|
||||
pub mod merger;
|
||||
mod merger_sorted_index_test;
|
||||
pub mod operation;
|
||||
mod prepared_commit;
|
||||
mod segment_entry;
|
||||
|
||||
@@ -9,7 +9,7 @@ use crate::store::StoreWriter;
|
||||
/// the data accumulated and sorted by the `SegmentWriter`.
|
||||
pub struct SegmentSerializer {
|
||||
segment: Segment,
|
||||
store_writer: StoreWriter,
|
||||
pub(crate) store_writer: StoreWriter,
|
||||
fast_field_serializer: FastFieldSerializer,
|
||||
fieldnorms_serializer: Option<FieldNormsSerializer>,
|
||||
postings_serializer: InvertedIndexSerializer,
|
||||
@@ -17,13 +17,25 @@ pub struct SegmentSerializer {
|
||||
|
||||
impl SegmentSerializer {
|
||||
/// Creates a new `SegmentSerializer`.
|
||||
pub fn for_segment(mut segment: Segment) -> crate::Result<SegmentSerializer> {
|
||||
let store_write = segment.open_write(SegmentComponent::STORE)?;
|
||||
pub fn for_segment(
|
||||
mut segment: Segment,
|
||||
is_in_merge: bool,
|
||||
) -> crate::Result<SegmentSerializer> {
|
||||
// If the segment is going to be sorted, we stream the docs first to a temporary file.
|
||||
// In the merge case this is not necessary because we can kmerge the already sorted
|
||||
// segments
|
||||
let remapping_required = segment.index().settings().sort_by_field.is_some() && !is_in_merge;
|
||||
let store_component = if remapping_required {
|
||||
SegmentComponent::TempStore
|
||||
} else {
|
||||
SegmentComponent::Store
|
||||
};
|
||||
let store_write = segment.open_write(store_component)?;
|
||||
|
||||
let fast_field_write = segment.open_write(SegmentComponent::FASTFIELDS)?;
|
||||
let fast_field_write = segment.open_write(SegmentComponent::FastFields)?;
|
||||
let fast_field_serializer = FastFieldSerializer::from_write(fast_field_write)?;
|
||||
|
||||
let fieldnorms_write = segment.open_write(SegmentComponent::FIELDNORMS)?;
|
||||
let fieldnorms_write = segment.open_write(SegmentComponent::FieldNorms)?;
|
||||
let fieldnorms_serializer = FieldNormsSerializer::from_write(fieldnorms_write)?;
|
||||
|
||||
let postings_serializer = InvertedIndexSerializer::open(&mut segment)?;
|
||||
@@ -36,10 +48,19 @@ impl SegmentSerializer {
|
||||
})
|
||||
}
|
||||
|
||||
/// The memory used (inclusive childs)
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
self.store_writer.mem_usage()
|
||||
}
|
||||
|
||||
pub fn segment(&self) -> &Segment {
|
||||
&self.segment
|
||||
}
|
||||
|
||||
pub fn segment_mut(&mut self) -> &mut Segment {
|
||||
&mut self.segment
|
||||
}
|
||||
|
||||
/// Accessor to the `PostingsSerializer`.
|
||||
pub fn get_postings_serializer(&mut self) -> &mut InvertedIndexSerializer {
|
||||
&mut self.postings_serializer
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
use super::segment_manager::{get_mergeable_segments, SegmentManager};
|
||||
use crate::core::Index;
|
||||
use crate::core::IndexMeta;
|
||||
use crate::core::IndexSettings;
|
||||
use crate::core::Segment;
|
||||
use crate::core::SegmentId;
|
||||
use crate::core::SegmentMeta;
|
||||
@@ -43,9 +44,14 @@ const NUM_MERGE_THREADS: usize = 4;
|
||||
/// and flushed.
|
||||
///
|
||||
/// This method is not part of tantivy's public API
|
||||
pub fn save_new_metas(schema: Schema, directory: &dyn Directory) -> crate::Result<()> {
|
||||
pub fn save_new_metas(
|
||||
schema: Schema,
|
||||
index_settings: IndexSettings,
|
||||
directory: &dyn Directory,
|
||||
) -> crate::Result<()> {
|
||||
save_metas(
|
||||
&IndexMeta {
|
||||
index_settings,
|
||||
segments: Vec::new(),
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
@@ -128,12 +134,13 @@ fn merge(
|
||||
.collect();
|
||||
|
||||
// An IndexMerger is like a "view" of our merged segments.
|
||||
let merger: IndexMerger = IndexMerger::open(index.schema(), &segments[..])?;
|
||||
let merger: IndexMerger =
|
||||
IndexMerger::open(index.schema(), index.settings().clone(), &segments[..])?;
|
||||
|
||||
// ... we just serialize this index merger in our new segment to merge the two segments.
|
||||
let segment_serializer = SegmentSerializer::for_segment(merged_segment.clone())?;
|
||||
// ... we just serialize this index merger in our new segment to merge the segments.
|
||||
let segment_serializer = SegmentSerializer::for_segment(merged_segment.clone(), true)?;
|
||||
|
||||
let num_docs = merger.write(segment_serializer)?;
|
||||
let num_docs = merger.write(segment_serializer, None)?;
|
||||
|
||||
let merged_segment_id = merged_segment.id();
|
||||
|
||||
@@ -165,6 +172,7 @@ pub fn merge_segments<Dir: Directory>(
|
||||
}
|
||||
|
||||
let target_schema = indices[0].schema();
|
||||
let target_settings = indices[0].settings().clone();
|
||||
|
||||
// let's check that all of the indices have the same schema
|
||||
if indices
|
||||
@@ -176,18 +184,32 @@ pub fn merge_segments<Dir: Directory>(
|
||||
"Attempt to merge different schema indices".to_string(),
|
||||
));
|
||||
}
|
||||
// let's check that all of the indices have the same index settings
|
||||
if indices
|
||||
.iter()
|
||||
.skip(1)
|
||||
.any(|index| index.settings() != &target_settings)
|
||||
{
|
||||
return Err(crate::TantivyError::InvalidArgument(
|
||||
"Attempt to merge indices with different index_settings".to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
let mut segments: Vec<Segment> = Vec::new();
|
||||
for index in indices {
|
||||
segments.extend(index.searchable_segments()?);
|
||||
}
|
||||
|
||||
let mut merged_index = Index::create(output_directory, target_schema.clone())?;
|
||||
let mut merged_index = Index::create(output_directory, target_schema.clone(), target_settings)?;
|
||||
let merged_segment = merged_index.new_segment();
|
||||
let merged_segment_id = merged_segment.id();
|
||||
let merger: IndexMerger = IndexMerger::open(merged_index.schema(), &segments[..])?;
|
||||
let segment_serializer = SegmentSerializer::for_segment(merged_segment)?;
|
||||
let num_docs = merger.write(segment_serializer)?;
|
||||
let merger: IndexMerger = IndexMerger::open(
|
||||
merged_index.schema(),
|
||||
merged_index.settings().clone(),
|
||||
&segments[..],
|
||||
)?;
|
||||
let segment_serializer = SegmentSerializer::for_segment(merged_segment, true)?;
|
||||
let num_docs = merger.write(segment_serializer, None)?;
|
||||
|
||||
let segment_meta = merged_index.new_segment_meta(merged_segment_id, num_docs);
|
||||
|
||||
@@ -204,6 +226,7 @@ pub fn merge_segments<Dir: Directory>(
|
||||
);
|
||||
|
||||
let index_meta = IndexMeta {
|
||||
index_settings: indices[0].load_metas()?.index_settings, // index_settings of all segments should be the same
|
||||
segments: vec![segment_meta],
|
||||
schema: target_schema,
|
||||
opstamp: 0u64,
|
||||
@@ -223,7 +246,7 @@ pub(crate) struct InnerSegmentUpdater {
|
||||
//
|
||||
// This should be up to date as all update happen through
|
||||
// the unique active `SegmentUpdater`.
|
||||
active_metas: RwLock<Arc<IndexMeta>>,
|
||||
active_index_meta: RwLock<Arc<IndexMeta>>,
|
||||
pool: ThreadPool,
|
||||
merge_thread_pool: ThreadPool,
|
||||
|
||||
@@ -263,7 +286,7 @@ impl SegmentUpdater {
|
||||
})?;
|
||||
let index_meta = index.load_metas()?;
|
||||
Ok(SegmentUpdater(Arc::new(InnerSegmentUpdater {
|
||||
active_metas: RwLock::new(Arc::new(index_meta)),
|
||||
active_index_meta: RwLock::new(Arc::new(index_meta)),
|
||||
pool,
|
||||
merge_thread_pool,
|
||||
index,
|
||||
@@ -368,6 +391,7 @@ impl SegmentUpdater {
|
||||
// Segment 1 from disk 1, Segment 1 from disk 2, etc.
|
||||
commited_segment_metas.sort_by_key(|segment_meta| -(segment_meta.max_doc() as i32));
|
||||
let index_meta = IndexMeta {
|
||||
index_settings: index.settings().clone(),
|
||||
segments: commited_segment_metas,
|
||||
schema: index.schema(),
|
||||
opstamp,
|
||||
@@ -419,15 +443,15 @@ impl SegmentUpdater {
|
||||
}
|
||||
|
||||
fn store_meta(&self, index_meta: &IndexMeta) {
|
||||
*self.active_metas.write().unwrap() = Arc::new(index_meta.clone());
|
||||
*self.active_index_meta.write().unwrap() = Arc::new(index_meta.clone());
|
||||
}
|
||||
|
||||
fn load_metas(&self) -> Arc<IndexMeta> {
|
||||
self.active_metas.read().unwrap().clone()
|
||||
fn load_meta(&self) -> Arc<IndexMeta> {
|
||||
self.active_index_meta.read().unwrap().clone()
|
||||
}
|
||||
|
||||
pub(crate) fn make_merge_operation(&self, segment_ids: &[SegmentId]) -> MergeOperation {
|
||||
let commit_opstamp = self.load_metas().opstamp;
|
||||
let commit_opstamp = self.load_meta().opstamp;
|
||||
MergeOperation::new(&self.merge_operations, commit_opstamp, segment_ids.to_vec())
|
||||
}
|
||||
|
||||
@@ -497,8 +521,11 @@ impl SegmentUpdater {
|
||||
}
|
||||
});
|
||||
|
||||
Ok(merging_future_recv
|
||||
.unwrap_or_else(|_| Err(crate::TantivyError::SystemError("Merge failed".to_string()))))
|
||||
Ok(merging_future_recv.unwrap_or_else(|e| {
|
||||
Err(crate::TantivyError::SystemError(
|
||||
"Merge failed:".to_string() + &e.to_string(),
|
||||
))
|
||||
}))
|
||||
}
|
||||
|
||||
async fn consider_merge_options(&self) {
|
||||
@@ -519,7 +546,7 @@ impl SegmentUpdater {
|
||||
})
|
||||
.collect();
|
||||
|
||||
let commit_opstamp = self.load_metas().opstamp;
|
||||
let commit_opstamp = self.load_meta().opstamp;
|
||||
let committed_merge_candidates = merge_policy
|
||||
.compute_merge_candidates(&committed_segments)
|
||||
.into_iter()
|
||||
@@ -550,7 +577,7 @@ impl SegmentUpdater {
|
||||
{
|
||||
let mut delete_cursor = after_merge_segment_entry.delete_cursor().clone();
|
||||
if let Some(delete_operation) = delete_cursor.get() {
|
||||
let committed_opstamp = segment_updater.load_metas().opstamp;
|
||||
let committed_opstamp = segment_updater.load_meta().opstamp;
|
||||
if delete_operation.opstamp < committed_opstamp {
|
||||
let index = &segment_updater.index;
|
||||
let segment = index.segment(after_merge_segment_entry.meta().clone());
|
||||
@@ -574,7 +601,7 @@ impl SegmentUpdater {
|
||||
}
|
||||
}
|
||||
}
|
||||
let previous_metas = segment_updater.load_metas();
|
||||
let previous_metas = segment_updater.load_meta();
|
||||
let segments_status = segment_updater
|
||||
.segment_manager
|
||||
.end_merge(merge_operation.segment_ids(), after_merge_segment_entry)?;
|
||||
@@ -586,6 +613,7 @@ impl SegmentUpdater {
|
||||
|
||||
segment_updater.consider_merge_options().await;
|
||||
} // we drop all possible handle to a now useless `SegmentMeta`.
|
||||
|
||||
let _ = garbage_collect_files(segment_updater).await;
|
||||
Ok(())
|
||||
});
|
||||
@@ -616,7 +644,7 @@ impl SegmentUpdater {
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::merge_segments;
|
||||
use crate::directory::RAMDirectory;
|
||||
use crate::directory::RamDirectory;
|
||||
use crate::indexer::merge_policy::tests::MergeWheneverPossible;
|
||||
use crate::schema::*;
|
||||
use crate::Index;
|
||||
@@ -765,7 +793,7 @@ mod tests {
|
||||
}
|
||||
|
||||
assert_eq!(indices.len(), 3);
|
||||
let output_directory = RAMDirectory::default();
|
||||
let output_directory = RamDirectory::default();
|
||||
let index = merge_segments(&indices, output_directory)?;
|
||||
assert_eq!(index.schema(), schema);
|
||||
|
||||
@@ -780,7 +808,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_merge_empty_indices_array() {
|
||||
let merge_result = merge_segments(&[], RAMDirectory::default());
|
||||
let merge_result = merge_segments(&[], RamDirectory::default());
|
||||
assert!(merge_result.is_err());
|
||||
}
|
||||
|
||||
@@ -807,7 +835,7 @@ mod tests {
|
||||
};
|
||||
|
||||
// mismatched schema index list
|
||||
let result = merge_segments(&[first_index, second_index], RAMDirectory::default());
|
||||
let result = merge_segments(&[first_index, second_index], RamDirectory::default());
|
||||
assert!(result.is_err());
|
||||
|
||||
Ok(())
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
use super::operation::AddOperation;
|
||||
use crate::core::Segment;
|
||||
use crate::core::SerializableSegment;
|
||||
use super::{
|
||||
doc_id_mapping::{get_doc_id_mapping_from_field, DocIdMapping},
|
||||
operation::AddOperation,
|
||||
};
|
||||
use crate::fastfield::FastFieldsWriter;
|
||||
use crate::fieldnorm::{FieldNormReaders, FieldNormsWriter};
|
||||
use crate::indexer::segment_serializer::SegmentSerializer;
|
||||
@@ -15,6 +16,8 @@ use crate::tokenizer::{BoxTokenStream, PreTokenizedStream};
|
||||
use crate::tokenizer::{FacetTokenizer, TextAnalyzer};
|
||||
use crate::tokenizer::{TokenStreamChain, Tokenizer};
|
||||
use crate::Opstamp;
|
||||
use crate::{core::Segment, store::StoreWriter};
|
||||
use crate::{core::SerializableSegment, store::StoreReader};
|
||||
use crate::{DocId, SegmentComponent};
|
||||
|
||||
/// Computes the initial size of the hash table.
|
||||
@@ -39,12 +42,12 @@ fn initial_table_size(per_thread_memory_budget: usize) -> crate::Result<usize> {
|
||||
/// They creates the postings list in anonymous memory.
|
||||
/// The segment is layed on disk when the segment gets `finalized`.
|
||||
pub struct SegmentWriter {
|
||||
max_doc: DocId,
|
||||
multifield_postings: MultiFieldPostingsWriter,
|
||||
segment_serializer: SegmentSerializer,
|
||||
fast_field_writers: FastFieldsWriter,
|
||||
fieldnorms_writer: FieldNormsWriter,
|
||||
doc_opstamps: Vec<Opstamp>,
|
||||
pub(crate) max_doc: DocId,
|
||||
pub(crate) multifield_postings: MultiFieldPostingsWriter,
|
||||
pub(crate) segment_serializer: SegmentSerializer,
|
||||
pub(crate) fast_field_writers: FastFieldsWriter,
|
||||
pub(crate) fieldnorms_writer: FieldNormsWriter,
|
||||
pub(crate) doc_opstamps: Vec<Opstamp>,
|
||||
tokenizers: Vec<Option<TextAnalyzer>>,
|
||||
term_buffer: Term,
|
||||
}
|
||||
@@ -66,7 +69,7 @@ impl SegmentWriter {
|
||||
) -> crate::Result<SegmentWriter> {
|
||||
let tokenizer_manager = segment.index().tokenizers().clone();
|
||||
let table_num_bits = initial_table_size(memory_budget)?;
|
||||
let segment_serializer = SegmentSerializer::for_segment(segment)?;
|
||||
let segment_serializer = SegmentSerializer::for_segment(segment, false)?;
|
||||
let multifield_postings = MultiFieldPostingsWriter::new(schema, table_num_bits);
|
||||
let tokenizers = schema
|
||||
.fields()
|
||||
@@ -100,17 +103,30 @@ impl SegmentWriter {
|
||||
/// be used afterwards.
|
||||
pub fn finalize(mut self) -> crate::Result<Vec<u64>> {
|
||||
self.fieldnorms_writer.fill_up_to_max_doc(self.max_doc);
|
||||
let mapping: Option<DocIdMapping> = self
|
||||
.segment_serializer
|
||||
.segment()
|
||||
.index()
|
||||
.settings()
|
||||
.sort_by_field
|
||||
.clone()
|
||||
.map(|sort_by_field| get_doc_id_mapping_from_field(sort_by_field, &self))
|
||||
.transpose()?;
|
||||
write(
|
||||
&self.multifield_postings,
|
||||
&self.fast_field_writers,
|
||||
&self.fieldnorms_writer,
|
||||
self.segment_serializer,
|
||||
mapping.as_ref(),
|
||||
)?;
|
||||
Ok(self.doc_opstamps)
|
||||
}
|
||||
|
||||
pub fn mem_usage(&self) -> usize {
|
||||
self.multifield_postings.mem_usage()
|
||||
+ self.fieldnorms_writer.mem_usage()
|
||||
+ self.fast_field_writers.mem_usage()
|
||||
+ self.segment_serializer.mem_usage()
|
||||
}
|
||||
|
||||
/// Indexes a new document
|
||||
@@ -165,7 +181,7 @@ impl SegmentWriter {
|
||||
});
|
||||
if let Some(unordered_term_id) = unordered_term_id_opt {
|
||||
self.fast_field_writers
|
||||
.get_multivalue_writer(field)
|
||||
.get_multivalue_writer_mut(field)
|
||||
.expect("writer for facet missing")
|
||||
.add_val(unordered_term_id);
|
||||
}
|
||||
@@ -305,29 +321,61 @@ fn write(
|
||||
fast_field_writers: &FastFieldsWriter,
|
||||
fieldnorms_writer: &FieldNormsWriter,
|
||||
mut serializer: SegmentSerializer,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> crate::Result<()> {
|
||||
if let Some(fieldnorms_serializer) = serializer.extract_fieldnorms_serializer() {
|
||||
fieldnorms_writer.serialize(fieldnorms_serializer)?;
|
||||
fieldnorms_writer.serialize(fieldnorms_serializer, doc_id_map)?;
|
||||
}
|
||||
let fieldnorm_data = serializer
|
||||
.segment()
|
||||
.open_read(SegmentComponent::FIELDNORMS)?;
|
||||
.open_read(SegmentComponent::FieldNorms)?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
let term_ord_map =
|
||||
multifield_postings.serialize(serializer.get_postings_serializer(), fieldnorm_readers)?;
|
||||
fast_field_writers.serialize(serializer.get_fast_field_serializer(), &term_ord_map)?;
|
||||
let term_ord_map = multifield_postings.serialize(
|
||||
serializer.get_postings_serializer(),
|
||||
fieldnorm_readers,
|
||||
doc_id_map,
|
||||
)?;
|
||||
fast_field_writers.serialize(
|
||||
serializer.get_fast_field_serializer(),
|
||||
&term_ord_map,
|
||||
doc_id_map,
|
||||
)?;
|
||||
// finalize temp docstore and create version, which reflects the doc_id_map
|
||||
if let Some(doc_id_map) = doc_id_map {
|
||||
let store_write = serializer
|
||||
.segment_mut()
|
||||
.open_write(SegmentComponent::Store)?;
|
||||
let old_store_writer =
|
||||
std::mem::replace(&mut serializer.store_writer, StoreWriter::new(store_write));
|
||||
old_store_writer.close()?;
|
||||
let store_read = StoreReader::open(
|
||||
serializer
|
||||
.segment()
|
||||
.open_read(SegmentComponent::TempStore)?,
|
||||
)?;
|
||||
for old_doc_id in doc_id_map.iter_old_doc_ids() {
|
||||
let doc_bytes = store_read.get_document_bytes(*old_doc_id)?;
|
||||
serializer.get_store_writer().store_bytes(&doc_bytes)?;
|
||||
}
|
||||
// TODO delete temp store
|
||||
}
|
||||
serializer.close()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
impl SerializableSegment for SegmentWriter {
|
||||
fn write(&self, serializer: SegmentSerializer) -> crate::Result<u32> {
|
||||
fn write(
|
||||
&self,
|
||||
serializer: SegmentSerializer,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> crate::Result<u32> {
|
||||
let max_doc = self.max_doc;
|
||||
write(
|
||||
&self.multifield_postings,
|
||||
&self.fast_field_writers,
|
||||
&self.fieldnorms_writer,
|
||||
serializer,
|
||||
doc_id_map,
|
||||
)?;
|
||||
Ok(max_doc)
|
||||
}
|
||||
|
||||
@@ -141,7 +141,7 @@ pub mod collector;
|
||||
pub mod directory;
|
||||
pub mod fastfield;
|
||||
pub mod fieldnorm;
|
||||
pub(crate) mod positions;
|
||||
pub mod positions;
|
||||
pub mod postings;
|
||||
pub mod query;
|
||||
pub mod schema;
|
||||
@@ -160,7 +160,10 @@ pub use self::docset::{DocSet, TERMINATED};
|
||||
pub use crate::common::HasLen;
|
||||
pub use crate::common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
|
||||
pub use crate::core::{Executor, SegmentComponent};
|
||||
pub use crate::core::{Index, IndexMeta, Searcher, Segment, SegmentId, SegmentMeta};
|
||||
pub use crate::core::{
|
||||
Index, IndexBuilder, IndexMeta, IndexSettings, IndexSortByField, Order, Searcher, Segment,
|
||||
SegmentId, SegmentMeta,
|
||||
};
|
||||
pub use crate::core::{InvertedIndexReader, SegmentReader};
|
||||
pub use crate::directory::Directory;
|
||||
pub use crate::indexer::merge_segments;
|
||||
@@ -255,7 +258,7 @@ pub type Opstamp = u64;
|
||||
/// the document to the search query.
|
||||
pub type Score = f32;
|
||||
|
||||
/// A `SegmentOrdinal` identifies a segment, within a `Searcher`.
|
||||
/// A `SegmentOrdinal` identifies a segment, within a `Searcher` or `Merger`.
|
||||
pub type SegmentOrdinal = u32;
|
||||
|
||||
impl DocAddress {
|
||||
|
||||
@@ -1,28 +1,29 @@
|
||||
/// Positions are stored in three parts and over two files.
|
||||
//
|
||||
/// The `SegmentComponent::POSITIONS` file contains all of the bitpacked positions delta,
|
||||
/// for all terms of a given field, one term after the other.
|
||||
///
|
||||
/// If the last block is incomplete, it is simply padded with zeros.
|
||||
/// It cannot be read alone, as it actually does not contain the number of bits used for
|
||||
/// each blocks.
|
||||
/// .
|
||||
/// Each block is serialized one after the other.
|
||||
/// If the last block is incomplete, it is simply padded with zeros.
|
||||
///
|
||||
///
|
||||
/// The `SegmentComponent::POSITIONSSKIP` file contains the number of bits used in each block in `u8`
|
||||
/// stream.
|
||||
///
|
||||
/// This makes it possible to rapidly skip over `n positions`.
|
||||
///
|
||||
/// For every block #n where n = k * `LONG_SKIP_INTERVAL` blocks (k>=1), we also store
|
||||
/// in this file the sum of number of bits used for all of the previous block (blocks `[0, n[`).
|
||||
/// That is useful to start reading the positions for a given term: The TermInfo contains
|
||||
/// an address in the positions stream, expressed in "number of positions".
|
||||
/// The long skip structure makes it possible to skip rapidly to the a checkpoint close to this
|
||||
/// value, and then skip normally.
|
||||
///
|
||||
//! Tantivy can (if instructed to do so in the schema) store the term positions in a given field.
|
||||
//! This positions are expressed as token ordinal. For instance,
|
||||
//! In "The beauty and the beast", the term "the" appears in position 0 and position 4.
|
||||
//! This information is useful to run phrase queries.
|
||||
//!
|
||||
//! The `SegmentComponent::POSITIONS` file contains all of the bitpacked positions delta,
|
||||
//! for all terms of a given field, one term after the other.
|
||||
//!
|
||||
//! Each terms is encoded independently.
|
||||
//! Like for positing lists, tantivy rely on simd bitpacking to encode the positions delta in blocks of 128 deltas.
|
||||
//! Because we rarely have a multiple of 128, a final block may encode the remaining values variable byte encoding.
|
||||
//!
|
||||
//! In order to make reading possible, the term delta positions first encodes the number of bitpacked blocks,
|
||||
//! then the bitwidth for each blocks, then the actual bitpacked block and finally the final variable int encoded block.
|
||||
//!
|
||||
//! Contrary to postings list, the reader does not have access on the number of positions that is encoded, and instead
|
||||
//! stops decoding the last block when its byte slice has been entirely read.
|
||||
//!
|
||||
//! More formally:
|
||||
//! * *Positions* := *NumBitPackedBlocks* *BitPackedPositionBlock*^(P/128) *BitPackedPositionsDeltaBitWidth* *VIntPosDeltas*?
|
||||
//! * *NumBitPackedBlocks**: := *P* / 128 encoded as a variable byte integer.
|
||||
//! * *BitPackedPositionBlock* := bit width encoded block of 128 positions delta
|
||||
//! * *BitPackedPositionsDeltaBitWidth* := (*BitWidth*: u8)^*NumBitPackedBlocks*
|
||||
//! * *VIntPosDeltas* := *VIntPosDelta*^(*P* % 128).
|
||||
//!
|
||||
//! The skip widths encoded separately makes it easy and fast to rapidly skip over n positions.
|
||||
mod reader;
|
||||
mod serializer;
|
||||
|
||||
@@ -31,54 +32,103 @@ pub use self::serializer::PositionSerializer;
|
||||
use bitpacking::{BitPacker, BitPacker4x};
|
||||
|
||||
const COMPRESSION_BLOCK_SIZE: usize = BitPacker4x::BLOCK_LEN;
|
||||
const LONG_SKIP_IN_BLOCKS: usize = 1_024;
|
||||
const LONG_SKIP_INTERVAL: u64 = (LONG_SKIP_IN_BLOCKS * COMPRESSION_BLOCK_SIZE) as u64;
|
||||
|
||||
#[cfg(test)]
|
||||
pub mod tests {
|
||||
|
||||
use super::PositionSerializer;
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::positions::reader::PositionReader;
|
||||
use crate::{common::HasLen, directory::FileSlice};
|
||||
use proptest::prelude::*;
|
||||
use proptest::sample::select;
|
||||
use std::iter;
|
||||
|
||||
fn create_stream_buffer(vals: &[u32]) -> (FileSlice, FileSlice) {
|
||||
let mut skip_buffer = vec![];
|
||||
let mut stream_buffer = vec![];
|
||||
{
|
||||
let mut serializer = PositionSerializer::new(&mut stream_buffer, &mut skip_buffer);
|
||||
for (i, &val) in vals.iter().enumerate() {
|
||||
assert_eq!(serializer.positions_idx(), i as u64);
|
||||
serializer.write_all(&[val]).unwrap();
|
||||
fn create_positions_data(vals: &[u32]) -> crate::Result<OwnedBytes> {
|
||||
let mut positions_buffer = vec![];
|
||||
let mut serializer = PositionSerializer::new(&mut positions_buffer);
|
||||
serializer.write_positions_delta(&vals);
|
||||
serializer.close_term()?;
|
||||
serializer.close()?;
|
||||
Ok(OwnedBytes::new(positions_buffer))
|
||||
}
|
||||
|
||||
fn gen_delta_positions() -> BoxedStrategy<Vec<u32>> {
|
||||
select(&[0, 1, 70, 127, 128, 129, 200, 255, 256, 257, 270][..])
|
||||
.prop_flat_map(|num_delta_positions| {
|
||||
proptest::collection::vec(
|
||||
select(&[1u32, 2u32, 4u32, 8u32, 16u32][..]),
|
||||
num_delta_positions,
|
||||
)
|
||||
})
|
||||
.boxed()
|
||||
}
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn test_position_delta(delta_positions in gen_delta_positions()) {
|
||||
let delta_positions_data = create_positions_data(&delta_positions).unwrap();
|
||||
let mut position_reader = PositionReader::open(delta_positions_data).unwrap();
|
||||
let mut minibuf = [0u32; 1];
|
||||
for (offset, &delta_position) in delta_positions.iter().enumerate() {
|
||||
position_reader.read(offset as u64, &mut minibuf[..]);
|
||||
assert_eq!(delta_position, minibuf[0]);
|
||||
}
|
||||
serializer.close().unwrap();
|
||||
}
|
||||
(FileSlice::from(stream_buffer), FileSlice::from(skip_buffer))
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_position_read() {
|
||||
let v: Vec<u32> = (0..1000).collect();
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
assert_eq!(skip.len(), 12);
|
||||
assert_eq!(stream.len(), 1168);
|
||||
let mut position_reader = PositionReader::new(stream, skip, 0u64).unwrap();
|
||||
fn test_position_read() -> crate::Result<()> {
|
||||
let position_deltas: Vec<u32> = (0..1000).collect();
|
||||
let positions_data = create_positions_data(&position_deltas[..])?;
|
||||
assert_eq!(positions_data.len(), 1224);
|
||||
let mut position_reader = PositionReader::open(positions_data)?;
|
||||
for &n in &[1, 10, 127, 128, 130, 312] {
|
||||
let mut v = vec![0u32; n];
|
||||
position_reader.read(0, &mut v[..]);
|
||||
for i in 0..n {
|
||||
assert_eq!(v[i], i as u32);
|
||||
assert_eq!(position_deltas[i], i as u32);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_position_read_with_offset() {
|
||||
let v: Vec<u32> = (0..1000).collect();
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
assert_eq!(skip.len(), 12);
|
||||
assert_eq!(stream.len(), 1168);
|
||||
let mut position_reader = PositionReader::new(stream, skip, 0u64).unwrap();
|
||||
fn test_empty_position() -> crate::Result<()> {
|
||||
let mut positions_buffer = vec![];
|
||||
let mut serializer = PositionSerializer::new(&mut positions_buffer);
|
||||
serializer.close_term()?;
|
||||
serializer.close()?;
|
||||
let position_delta = OwnedBytes::new(positions_buffer);
|
||||
assert!(PositionReader::open(position_delta).is_ok());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_multiple_write_positions() -> crate::Result<()> {
|
||||
let mut positions_buffer = vec![];
|
||||
let mut serializer = PositionSerializer::new(&mut positions_buffer);
|
||||
serializer.write_positions_delta(&[1u32, 12u32]);
|
||||
serializer.write_positions_delta(&[4u32, 17u32]);
|
||||
serializer.write_positions_delta(&[443u32]);
|
||||
serializer.close_term()?;
|
||||
serializer.close()?;
|
||||
let position_delta = OwnedBytes::new(positions_buffer);
|
||||
let mut output_delta_pos_buffer = vec![0u32; 5];
|
||||
let mut position_reader = PositionReader::open(position_delta)?;
|
||||
position_reader.read(0, &mut output_delta_pos_buffer[..]);
|
||||
assert_eq!(
|
||||
&output_delta_pos_buffer[..],
|
||||
&[1u32, 12u32, 4u32, 17u32, 443u32]
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_position_read_with_offset() -> crate::Result<()> {
|
||||
let position_deltas: Vec<u32> = (0..1000).collect();
|
||||
let positions_data = create_positions_data(&position_deltas[..])?;
|
||||
assert_eq!(positions_data.len(), 1224);
|
||||
let mut position_reader = PositionReader::open(positions_data)?;
|
||||
for &offset in &[1u64, 10u64, 127u64, 128u64, 130u64, 312u64] {
|
||||
for &len in &[1, 10, 130, 500] {
|
||||
let mut v = vec![0u32; len];
|
||||
@@ -88,16 +138,16 @@ pub mod tests {
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_position_read_after_skip() {
|
||||
let v: Vec<u32> = (0..1_000).collect();
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
assert_eq!(skip.len(), 12);
|
||||
assert_eq!(stream.len(), 1168);
|
||||
fn test_position_read_after_skip() -> crate::Result<()> {
|
||||
let position_deltas: Vec<u32> = (0..1_000).collect();
|
||||
let positions_data = create_positions_data(&position_deltas[..])?;
|
||||
assert_eq!(positions_data.len(), 1224);
|
||||
|
||||
let mut position_reader = PositionReader::new(stream, skip, 0u64).unwrap();
|
||||
let mut position_reader = PositionReader::open(positions_data)?;
|
||||
let mut buf = [0u32; 7];
|
||||
let mut c = 0;
|
||||
|
||||
@@ -111,15 +161,15 @@ pub mod tests {
|
||||
c += 1;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_position_reread_anchor_different_than_block() {
|
||||
let v: Vec<u32> = (0..2_000_000).collect();
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
assert_eq!(skip.len(), 15_749);
|
||||
assert_eq!(stream.len(), 4_987_872);
|
||||
let mut position_reader = PositionReader::new(stream.clone(), skip.clone(), 0).unwrap();
|
||||
fn test_position_reread_anchor_different_than_block() -> crate::Result<()> {
|
||||
let positions_delta: Vec<u32> = (0..2_000_000).collect();
|
||||
let positions_data = create_positions_data(&positions_delta[..])?;
|
||||
assert_eq!(positions_data.len(), 5003499);
|
||||
let mut position_reader = PositionReader::open(positions_data.clone())?;
|
||||
let mut buf = [0u32; 256];
|
||||
position_reader.read(128, &mut buf);
|
||||
for i in 0..256 {
|
||||
@@ -129,62 +179,41 @@ pub mod tests {
|
||||
for i in 0..256 {
|
||||
assert_eq!(buf[i], (128 + i) as u32);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic(expected = "offset arguments should be increasing.")]
|
||||
fn test_position_panic_if_called_previous_anchor() {
|
||||
let v: Vec<u32> = (0..2_000_000).collect();
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
assert_eq!(skip.len(), 15_749);
|
||||
assert_eq!(stream.len(), 4_987_872);
|
||||
fn test_position_requesting_passed_block() -> crate::Result<()> {
|
||||
let positions_delta: Vec<u32> = (0..512).collect();
|
||||
let positions_data = create_positions_data(&positions_delta[..])?;
|
||||
assert_eq!(positions_data.len(), 533);
|
||||
let mut buf = [0u32; 1];
|
||||
let mut position_reader =
|
||||
PositionReader::new(stream.clone(), skip.clone(), 200_000).unwrap();
|
||||
let mut position_reader = PositionReader::open(positions_data)?;
|
||||
position_reader.read(230, &mut buf);
|
||||
assert_eq!(buf[0], 230);
|
||||
position_reader.read(9, &mut buf);
|
||||
assert_eq!(buf[0], 9);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_positions_bug() {
|
||||
let mut v: Vec<u32> = vec![];
|
||||
for i in 1..200 {
|
||||
for j in 0..i {
|
||||
v.push(j);
|
||||
}
|
||||
}
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
let mut buf = Vec::new();
|
||||
let mut position_reader = PositionReader::new(stream.clone(), skip.clone(), 0).unwrap();
|
||||
let mut offset = 0;
|
||||
for i in 1..24 {
|
||||
buf.resize(i, 0);
|
||||
position_reader.read(offset, &mut buf[..]);
|
||||
offset += i as u64;
|
||||
let r: Vec<u32> = (0..i).map(|el| el as u32).collect();
|
||||
assert_eq!(buf, &r[..]);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_position_long_skip_const() {
|
||||
fn test_position() -> crate::Result<()> {
|
||||
const CONST_VAL: u32 = 9u32;
|
||||
let v: Vec<u32> = iter::repeat(CONST_VAL).take(2_000_000).collect();
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
assert_eq!(skip.len(), 15_749);
|
||||
assert_eq!(stream.len(), 1_000_000);
|
||||
let mut position_reader = PositionReader::new(stream, skip, 128 * 1024).unwrap();
|
||||
let positions_delta: Vec<u32> = iter::repeat(CONST_VAL).take(2_000_000).collect();
|
||||
let positions_data = create_positions_data(&positions_delta[..])?;
|
||||
assert_eq!(positions_data.len(), 1_015_627);
|
||||
let mut position_reader = PositionReader::open(positions_data)?;
|
||||
let mut buf = [0u32; 1];
|
||||
position_reader.read(0, &mut buf);
|
||||
assert_eq!(buf[0], CONST_VAL);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_position_long_skip_2() {
|
||||
let v: Vec<u32> = (0..2_000_000).collect();
|
||||
let (stream, skip) = create_stream_buffer(&v[..]);
|
||||
assert_eq!(skip.len(), 15_749);
|
||||
assert_eq!(stream.len(), 4_987_872);
|
||||
fn test_position_advance() -> crate::Result<()> {
|
||||
let positions_delta: Vec<u32> = (0..2_000_000).collect();
|
||||
let positions_data = create_positions_data(&positions_delta[..])?;
|
||||
assert_eq!(positions_data.len(), 5_003_499);
|
||||
for &offset in &[
|
||||
10,
|
||||
128 * 1024,
|
||||
@@ -192,11 +221,11 @@ pub mod tests {
|
||||
128 * 1024 + 7,
|
||||
128 * 10 * 1024 + 10,
|
||||
] {
|
||||
let mut position_reader =
|
||||
PositionReader::new(stream.clone(), skip.clone(), offset).unwrap();
|
||||
let mut position_reader = PositionReader::open(positions_data.clone())?;
|
||||
let mut buf = [0u32; 1];
|
||||
position_reader.read(0, &mut buf);
|
||||
position_reader.read(offset, &mut buf);
|
||||
assert_eq!(buf[0], offset as u32);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,174 +1,148 @@
|
||||
use std::io;
|
||||
|
||||
use crate::common::{BinarySerializable, FixedSize};
|
||||
use crate::directory::FileSlice;
|
||||
use crate::common::{BinarySerializable, VInt};
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::positions::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::positions::LONG_SKIP_INTERVAL;
|
||||
use crate::positions::LONG_SKIP_IN_BLOCKS;
|
||||
use bitpacking::{BitPacker, BitPacker4x};
|
||||
use crate::postings::compression::{BlockDecoder, VIntDecoder};
|
||||
|
||||
/// Positions works as a long sequence of compressed block.
|
||||
/// All terms are chained one after the other.
|
||||
///
|
||||
/// When accessing the position of a term, we get a positions_idx from the `Terminfo`.
|
||||
/// This means we need to skip to the `nth` positions efficiently.
|
||||
///
|
||||
/// This is done thanks to two levels of skiping that we refer to in the code
|
||||
/// as `long_skip` and `short_skip`.
|
||||
///
|
||||
/// The `long_skip` makes it possible to skip every 1_024 compression blocks (= 131_072 positions).
|
||||
/// Skipping offset are simply stored one after as an offset stored over 8 bytes.
|
||||
///
|
||||
/// We find the number of long skips, as `n / long_skip`.
|
||||
///
|
||||
/// Blocks are compressed using bitpacking, so `skip_read` contains the number of bytes
|
||||
/// (values can go from 0bit to 32 bits) required to decompressed every block.
|
||||
/// Blocks are compressed using bitpacking, so `skip_read` contains the number of bits
|
||||
/// (values can go from 0bit to 32 bits) required to decompress every block.
|
||||
///
|
||||
/// A given block obviously takes `(128 x num_bit_for_the_block / num_bits_in_a_byte)`,
|
||||
/// so skipping a block without decompressing it is just a matter of advancing that many
|
||||
/// bytes.
|
||||
|
||||
struct Positions {
|
||||
bit_packer: BitPacker4x,
|
||||
skip_file: FileSlice,
|
||||
position_file: FileSlice,
|
||||
long_skip_data: OwnedBytes,
|
||||
}
|
||||
|
||||
impl Positions {
|
||||
pub fn new(position_file: FileSlice, skip_file: FileSlice) -> io::Result<Positions> {
|
||||
let (body, footer) = skip_file.split_from_end(u32::SIZE_IN_BYTES);
|
||||
let footer_data = footer.read_bytes()?;
|
||||
let num_long_skips = u32::deserialize(&mut footer_data.as_slice())?;
|
||||
let (skip_file, long_skip_file) =
|
||||
body.split_from_end(u64::SIZE_IN_BYTES * (num_long_skips as usize));
|
||||
let long_skip_data = long_skip_file.read_bytes()?;
|
||||
Ok(Positions {
|
||||
bit_packer: BitPacker4x::new(),
|
||||
skip_file,
|
||||
long_skip_data,
|
||||
position_file,
|
||||
})
|
||||
}
|
||||
|
||||
/// Returns the offset of the block associated to the given `long_skip_id`.
|
||||
///
|
||||
/// One `long_skip_id` means `LONG_SKIP_IN_BLOCKS` blocks.
|
||||
fn long_skip(&self, long_skip_id: usize) -> u64 {
|
||||
if long_skip_id == 0 {
|
||||
return 0;
|
||||
}
|
||||
let long_skip_slice = self.long_skip_data.as_slice();
|
||||
let mut long_skip_blocks: &[u8] = &long_skip_slice[(long_skip_id - 1) * 8..][..8];
|
||||
u64::deserialize(&mut long_skip_blocks).expect("Index corrupted")
|
||||
}
|
||||
|
||||
fn reader(&self, offset: u64) -> io::Result<PositionReader> {
|
||||
let long_skip_id = (offset / LONG_SKIP_INTERVAL) as usize;
|
||||
let offset_num_bytes: u64 = self.long_skip(long_skip_id);
|
||||
let position_read = self
|
||||
.position_file
|
||||
.slice_from(offset_num_bytes as usize)
|
||||
.read_bytes()?;
|
||||
let skip_read = self
|
||||
.skip_file
|
||||
.slice_from(long_skip_id * LONG_SKIP_IN_BLOCKS)
|
||||
.read_bytes()?;
|
||||
Ok(PositionReader {
|
||||
bit_packer: self.bit_packer,
|
||||
skip_read,
|
||||
position_read,
|
||||
buffer: Box::new([0u32; 128]),
|
||||
block_offset: std::i64::MAX as u64,
|
||||
anchor_offset: (long_skip_id as u64) * LONG_SKIP_INTERVAL,
|
||||
abs_offset: offset,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct PositionReader {
|
||||
skip_read: OwnedBytes,
|
||||
position_read: OwnedBytes,
|
||||
bit_packer: BitPacker4x,
|
||||
buffer: Box<[u32; COMPRESSION_BLOCK_SIZE]>,
|
||||
bit_widths: OwnedBytes,
|
||||
positions: OwnedBytes,
|
||||
|
||||
block_decoder: BlockDecoder,
|
||||
|
||||
// offset, expressed in positions, for the first position of the block currently loaded
|
||||
// block_offset is a multiple of COMPRESSION_BLOCK_SIZE.
|
||||
block_offset: u64,
|
||||
// offset, expressed in positions, for the position of the first block encoded
|
||||
// in the `self.positions` bytes, and if bitpacked, compressed using the bitwidth in
|
||||
// `self.bit_widths`.
|
||||
//
|
||||
// As we advance, anchor increases simultaneously with bit_widths and positions get consumed.
|
||||
anchor_offset: u64,
|
||||
|
||||
abs_offset: u64,
|
||||
// These are just copies used for .reset().
|
||||
original_bit_widths: OwnedBytes,
|
||||
original_positions: OwnedBytes,
|
||||
}
|
||||
|
||||
impl PositionReader {
|
||||
pub fn new(
|
||||
position_file: FileSlice,
|
||||
skip_file: FileSlice,
|
||||
offset: u64,
|
||||
) -> io::Result<PositionReader> {
|
||||
let positions = Positions::new(position_file, skip_file)?;
|
||||
positions.reader(offset)
|
||||
/// Open and reads the term positions encoded into the positions_data owned bytes.
|
||||
pub fn open(mut positions_data: OwnedBytes) -> io::Result<PositionReader> {
|
||||
let num_positions_bitpacked_blocks = VInt::deserialize(&mut positions_data)?.0 as usize;
|
||||
let (bit_widths, positions) = positions_data.split(num_positions_bitpacked_blocks);
|
||||
Ok(PositionReader {
|
||||
bit_widths: bit_widths.clone(),
|
||||
positions: positions.clone(),
|
||||
block_decoder: BlockDecoder::default(),
|
||||
block_offset: std::i64::MAX as u64,
|
||||
anchor_offset: 0u64,
|
||||
original_bit_widths: bit_widths,
|
||||
original_positions: positions,
|
||||
})
|
||||
}
|
||||
|
||||
fn reset(&mut self) {
|
||||
self.positions = self.original_positions.clone();
|
||||
self.bit_widths = self.original_bit_widths.clone();
|
||||
self.block_offset = std::i64::MAX as u64;
|
||||
self.anchor_offset = 0u64;
|
||||
}
|
||||
|
||||
/// Advance from num_blocks bitpacked blocks.
|
||||
///
|
||||
/// Panics if there are not that many remaining blocks.
|
||||
fn advance_num_blocks(&mut self, num_blocks: usize) {
|
||||
let num_bits: usize = self.skip_read.as_ref()[..num_blocks]
|
||||
let num_bits: usize = self.bit_widths.as_ref()[..num_blocks]
|
||||
.iter()
|
||||
.cloned()
|
||||
.map(|num_bits| num_bits as usize)
|
||||
.sum();
|
||||
let num_bytes_to_skip = num_bits * COMPRESSION_BLOCK_SIZE / 8;
|
||||
self.skip_read.advance(num_blocks as usize);
|
||||
self.position_read.advance(num_bytes_to_skip);
|
||||
self.bit_widths.advance(num_blocks as usize);
|
||||
self.positions.advance(num_bytes_to_skip);
|
||||
self.anchor_offset += (num_blocks * COMPRESSION_BLOCK_SIZE) as u64;
|
||||
}
|
||||
|
||||
/// block_rel_id is counted relatively to the anchor.
|
||||
/// block_rel_id = 0 means the anchor block.
|
||||
/// block_rel_id = i means the ith block after the anchor block.
|
||||
fn load_block(&mut self, block_rel_id: usize) {
|
||||
let bit_widths = self.bit_widths.as_slice();
|
||||
let byte_offset: usize = bit_widths[0..block_rel_id]
|
||||
.iter()
|
||||
.map(|&b| b as usize)
|
||||
.sum::<usize>()
|
||||
* COMPRESSION_BLOCK_SIZE
|
||||
/ 8;
|
||||
let compressed_data = &self.positions.as_slice()[byte_offset..];
|
||||
if bit_widths.len() > block_rel_id {
|
||||
// that block is bitpacked.
|
||||
let bit_width = bit_widths[block_rel_id];
|
||||
self.block_decoder
|
||||
.uncompress_block_unsorted(compressed_data, bit_width);
|
||||
} else {
|
||||
// that block is vint encoded.
|
||||
self.block_decoder
|
||||
.uncompress_vint_unsorted_until_end(compressed_data);
|
||||
}
|
||||
self.block_offset = self.anchor_offset + (block_rel_id * COMPRESSION_BLOCK_SIZE) as u64;
|
||||
}
|
||||
|
||||
/// Fills a buffer with the positions `[offset..offset+output.len())` integers.
|
||||
///
|
||||
/// `offset` is required to have a value >= to the offsets given in previous calls
|
||||
/// for the given `PositionReaderAbsolute` instance.
|
||||
/// This function is optimized to be called with increasing values of `offset`.
|
||||
pub fn read(&mut self, mut offset: u64, mut output: &mut [u32]) {
|
||||
offset += self.abs_offset;
|
||||
assert!(
|
||||
offset >= self.anchor_offset,
|
||||
"offset arguments should be increasing."
|
||||
);
|
||||
if offset < self.anchor_offset {
|
||||
self.reset();
|
||||
}
|
||||
let delta_to_block_offset = offset as i64 - self.block_offset as i64;
|
||||
if !(0..128).contains(&delta_to_block_offset) {
|
||||
// The first position is not within the first block.
|
||||
// We need to decompress the first block.
|
||||
// (Note that it could be before or after)
|
||||
// We need to possibly skip a few blocks, and decompress the first relevant block.
|
||||
let delta_to_anchor_offset = offset - self.anchor_offset;
|
||||
let num_blocks_to_skip =
|
||||
(delta_to_anchor_offset / (COMPRESSION_BLOCK_SIZE as u64)) as usize;
|
||||
self.advance_num_blocks(num_blocks_to_skip);
|
||||
self.anchor_offset = offset - (offset % COMPRESSION_BLOCK_SIZE as u64);
|
||||
self.block_offset = self.anchor_offset;
|
||||
let num_bits = self.skip_read.as_slice()[0];
|
||||
self.bit_packer
|
||||
.decompress(self.position_read.as_ref(), self.buffer.as_mut(), num_bits);
|
||||
self.load_block(0);
|
||||
} else {
|
||||
// The request offset is within the loaded block.
|
||||
// We still need to advance anchor_offset to our current block.
|
||||
let num_blocks_to_skip =
|
||||
((self.block_offset - self.anchor_offset) / COMPRESSION_BLOCK_SIZE as u64) as usize;
|
||||
self.advance_num_blocks(num_blocks_to_skip);
|
||||
self.anchor_offset = self.block_offset;
|
||||
}
|
||||
|
||||
let mut num_bits = self.skip_read.as_slice()[0];
|
||||
let mut position_data = self.position_read.as_ref();
|
||||
|
||||
// At this point, the block containing offset is loaded, and anchor has
|
||||
// been updated to point to it as well.
|
||||
for i in 1.. {
|
||||
// we copy the part from block i - 1 that is relevant.
|
||||
let offset_in_block = (offset as usize) % COMPRESSION_BLOCK_SIZE;
|
||||
let remaining_in_block = COMPRESSION_BLOCK_SIZE - offset_in_block;
|
||||
if remaining_in_block >= output.len() {
|
||||
output.copy_from_slice(&self.buffer[offset_in_block..][..output.len()]);
|
||||
output.copy_from_slice(
|
||||
&self.block_decoder.output_array()[offset_in_block..][..output.len()],
|
||||
);
|
||||
break;
|
||||
}
|
||||
output[..remaining_in_block].copy_from_slice(&self.buffer[offset_in_block..]);
|
||||
output[..remaining_in_block]
|
||||
.copy_from_slice(&self.block_decoder.output_array()[offset_in_block..]);
|
||||
output = &mut output[remaining_in_block..];
|
||||
// we load block #i if necessary.
|
||||
offset += remaining_in_block as u64;
|
||||
position_data = &position_data[(num_bits as usize * COMPRESSION_BLOCK_SIZE / 8)..];
|
||||
num_bits = self.skip_read.as_slice()[i];
|
||||
self.bit_packer
|
||||
.decompress(position_data, self.buffer.as_mut(), num_bits);
|
||||
self.block_offset += COMPRESSION_BLOCK_SIZE as u64;
|
||||
self.load_block(i);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,80 +1,91 @@
|
||||
use crate::common::BinarySerializable;
|
||||
use crate::common::CountingWriter;
|
||||
use crate::positions::{COMPRESSION_BLOCK_SIZE, LONG_SKIP_INTERVAL};
|
||||
use bitpacking::BitPacker;
|
||||
use bitpacking::BitPacker4x;
|
||||
use crate::common::{BinarySerializable, CountingWriter, VInt};
|
||||
use crate::positions::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::postings::compression::BlockEncoder;
|
||||
use crate::postings::compression::VIntEncoder;
|
||||
use std::io::{self, Write};
|
||||
|
||||
/// The PositionSerializer is in charge of serializing all of the positions
|
||||
/// of all of the terms of a given field.
|
||||
///
|
||||
/// It is valid to call write_position_delta more than once per term.
|
||||
pub struct PositionSerializer<W: io::Write> {
|
||||
bit_packer: BitPacker4x,
|
||||
write_stream: CountingWriter<W>,
|
||||
write_skip_index: W,
|
||||
block_encoder: BlockEncoder,
|
||||
positions_wrt: CountingWriter<W>,
|
||||
positions_buffer: Vec<u8>,
|
||||
block: Vec<u32>,
|
||||
buffer: Vec<u8>,
|
||||
num_ints: u64,
|
||||
long_skips: Vec<u64>,
|
||||
bit_widths: Vec<u8>,
|
||||
}
|
||||
|
||||
impl<W: io::Write> PositionSerializer<W> {
|
||||
pub fn new(write_stream: W, write_skip_index: W) -> PositionSerializer<W> {
|
||||
/// Creates a new PositionSerializer writing into the given positions_wrt.
|
||||
pub fn new(positions_wrt: W) -> PositionSerializer<W> {
|
||||
PositionSerializer {
|
||||
bit_packer: BitPacker4x::new(),
|
||||
write_stream: CountingWriter::wrap(write_stream),
|
||||
write_skip_index,
|
||||
block_encoder: BlockEncoder::new(),
|
||||
positions_wrt: CountingWriter::wrap(positions_wrt),
|
||||
positions_buffer: Vec::with_capacity(128_000),
|
||||
block: Vec::with_capacity(128),
|
||||
buffer: vec![0u8; 128 * 4],
|
||||
num_ints: 0u64,
|
||||
long_skips: Vec::new(),
|
||||
bit_widths: Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn positions_idx(&self) -> u64 {
|
||||
self.num_ints
|
||||
/// Returns the number of bytes written in the positions write object
|
||||
/// at this point.
|
||||
/// When called before writing the positions of a term, this value is used as
|
||||
/// start offset.
|
||||
/// When called after writing the positions of a term, this value is used as a
|
||||
/// end offset.
|
||||
pub fn written_bytes(&self) -> u64 {
|
||||
self.positions_wrt.written_bytes()
|
||||
}
|
||||
|
||||
fn remaining_block_len(&self) -> usize {
|
||||
COMPRESSION_BLOCK_SIZE - self.block.len()
|
||||
}
|
||||
|
||||
pub fn write_all(&mut self, mut vals: &[u32]) -> io::Result<()> {
|
||||
while !vals.is_empty() {
|
||||
/// Writes all of the given positions delta.
|
||||
pub fn write_positions_delta(&mut self, mut positions_delta: &[u32]) {
|
||||
while !positions_delta.is_empty() {
|
||||
let remaining_block_len = self.remaining_block_len();
|
||||
let num_to_write = remaining_block_len.min(vals.len());
|
||||
self.block.extend(&vals[..num_to_write]);
|
||||
self.num_ints += num_to_write as u64;
|
||||
vals = &vals[num_to_write..];
|
||||
let num_to_write = remaining_block_len.min(positions_delta.len());
|
||||
self.block.extend(&positions_delta[..num_to_write]);
|
||||
positions_delta = &positions_delta[num_to_write..];
|
||||
if self.remaining_block_len() == 0 {
|
||||
self.flush_block()?;
|
||||
self.flush_block();
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn flush_block(&mut self) -> io::Result<()> {
|
||||
let num_bits = self.bit_packer.num_bits(&self.block[..]);
|
||||
self.write_skip_index.write_all(&[num_bits])?;
|
||||
let written_len = self
|
||||
.bit_packer
|
||||
.compress(&self.block[..], &mut self.buffer, num_bits);
|
||||
self.write_stream.write_all(&self.buffer[..written_len])?;
|
||||
fn flush_block(&mut self) {
|
||||
// encode the positions in the block
|
||||
if self.block.is_empty() {
|
||||
return;
|
||||
}
|
||||
if self.block.len() == COMPRESSION_BLOCK_SIZE {
|
||||
let (bit_width, block_encoded): (u8, &[u8]) =
|
||||
self.block_encoder.compress_block_unsorted(&self.block[..]);
|
||||
self.bit_widths.push(bit_width);
|
||||
self.positions_buffer.extend(block_encoded);
|
||||
} else {
|
||||
debug_assert!(self.block.len() < COMPRESSION_BLOCK_SIZE);
|
||||
let block_vint_encoded = self.block_encoder.compress_vint_unsorted(&self.block[..]);
|
||||
self.positions_buffer.extend_from_slice(block_vint_encoded);
|
||||
}
|
||||
self.block.clear();
|
||||
if (self.num_ints % LONG_SKIP_INTERVAL) == 0u64 {
|
||||
self.long_skips.push(self.write_stream.written_bytes());
|
||||
}
|
||||
}
|
||||
|
||||
/// Close the positions for the given term.
|
||||
pub fn close_term(&mut self) -> io::Result<()> {
|
||||
self.flush_block();
|
||||
VInt(self.bit_widths.len() as u64).serialize(&mut self.positions_wrt)?;
|
||||
self.positions_wrt.write_all(&self.bit_widths[..])?;
|
||||
self.positions_wrt.write_all(&self.positions_buffer)?;
|
||||
self.bit_widths.clear();
|
||||
self.positions_buffer.clear();
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Close the positions for this term and flushes the data.
|
||||
pub fn close(mut self) -> io::Result<()> {
|
||||
if !self.block.is_empty() {
|
||||
self.block.resize(COMPRESSION_BLOCK_SIZE, 0u32);
|
||||
self.flush_block()?;
|
||||
}
|
||||
for &long_skip in &self.long_skips {
|
||||
long_skip.serialize(&mut self.write_skip_index)?;
|
||||
}
|
||||
(self.long_skips.len() as u32).serialize(&mut self.write_skip_index)?;
|
||||
self.write_skip_index.flush()?;
|
||||
self.write_stream.flush()?;
|
||||
Ok(())
|
||||
self.positions_wrt.flush()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -100,7 +100,7 @@ fn galloping(block_docs: &[u32], target: u32) -> usize {
|
||||
#[derive(Clone, Copy, PartialEq)]
|
||||
pub enum BlockSearcher {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
SSE2,
|
||||
Sse2,
|
||||
Scalar,
|
||||
}
|
||||
|
||||
@@ -139,7 +139,7 @@ impl BlockSearcher {
|
||||
pub(crate) fn search_in_block(self, block_docs: &AlignedBuffer, target: u32) -> usize {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
{
|
||||
if self == BlockSearcher::SSE2 {
|
||||
if self == BlockSearcher::Sse2 {
|
||||
return sse2::linear_search_sse2_128(block_docs, target);
|
||||
}
|
||||
}
|
||||
@@ -152,7 +152,7 @@ impl Default for BlockSearcher {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
{
|
||||
if is_x86_feature_detected!("sse2") {
|
||||
return BlockSearcher::SSE2;
|
||||
return BlockSearcher::Sse2;
|
||||
}
|
||||
}
|
||||
BlockSearcher::Scalar
|
||||
@@ -236,6 +236,6 @@ mod tests {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
#[test]
|
||||
fn test_search_in_block_sse2() {
|
||||
test_search_in_block_util(BlockSearcher::SSE2);
|
||||
test_search_in_block_util(BlockSearcher::Sse2);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -8,7 +8,7 @@ use crate::postings::compression::{
|
||||
AlignedBuffer, BlockDecoder, VIntDecoder, COMPRESSION_BLOCK_SIZE,
|
||||
};
|
||||
use crate::postings::{BlockInfo, FreqReadingOption, SkipReader};
|
||||
use crate::query::BM25Weight;
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, Score, TERMINATED};
|
||||
|
||||
@@ -127,7 +127,7 @@ impl BlockSegmentPostings {
|
||||
pub fn block_max_score(
|
||||
&mut self,
|
||||
fieldnorm_reader: &FieldNormReader,
|
||||
bm25_weight: &BM25Weight,
|
||||
bm25_weight: &Bm25Weight,
|
||||
) -> Score {
|
||||
if let Some(score) = self.block_max_score_cache {
|
||||
return score;
|
||||
@@ -212,14 +212,14 @@ impl BlockSegmentPostings {
|
||||
/// `TERMINATED`. The array is also guaranteed to be aligned on 16 bytes = 128 bits.
|
||||
///
|
||||
/// This method is useful to run SSE2 linear search.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub(crate) fn docs_aligned(&self) -> &AlignedBuffer {
|
||||
debug_assert!(self.block_is_loaded());
|
||||
self.doc_decoder.output_aligned()
|
||||
}
|
||||
|
||||
/// Return the document at index `idx` of the block.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn doc(&self, idx: usize) -> u32 {
|
||||
self.doc_decoder.output(idx)
|
||||
}
|
||||
|
||||
@@ -167,6 +167,8 @@ pub trait VIntDecoder {
|
||||
num_els: usize,
|
||||
padding: u32,
|
||||
) -> usize;
|
||||
|
||||
fn uncompress_vint_unsorted_until_end(&mut self, compressed_data: &[u8]);
|
||||
}
|
||||
|
||||
impl VIntEncoder for BlockEncoder {
|
||||
@@ -202,6 +204,11 @@ impl VIntDecoder for BlockDecoder {
|
||||
self.output.0.iter_mut().for_each(|el| *el = padding);
|
||||
vint::uncompress_unsorted(compressed_data, &mut self.output.0[..num_els])
|
||||
}
|
||||
|
||||
fn uncompress_vint_unsorted_until_end(&mut self, compressed_data: &[u8]) {
|
||||
let num_els = vint::uncompress_unsorted_until_end(compressed_data, &mut self.output.0);
|
||||
self.output_len = num_els;
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn compress_sorted<'a>(input: &[u32], output: &'a mut [u8], mut offset: u32) -> &'a [u8] {
|
||||
let mut byte_written = 0;
|
||||
for &v in input {
|
||||
@@ -20,7 +20,7 @@ pub fn compress_sorted<'a>(input: &[u32], output: &'a mut [u8], mut offset: u32)
|
||||
&output[..byte_written]
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub(crate) fn compress_unsorted<'a>(input: &[u32], output: &'a mut [u8]) -> &'a [u8] {
|
||||
let mut byte_written = 0;
|
||||
for &v in input {
|
||||
@@ -41,7 +41,7 @@ pub(crate) fn compress_unsorted<'a>(input: &[u32], output: &'a mut [u8]) -> &'a
|
||||
&output[..byte_written]
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn uncompress_sorted(compressed_data: &[u8], output: &mut [u32], offset: u32) -> usize {
|
||||
let mut read_byte = 0;
|
||||
let mut result = offset;
|
||||
@@ -61,15 +61,15 @@ pub fn uncompress_sorted(compressed_data: &[u8], output: &mut [u32], offset: u32
|
||||
read_byte
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub(crate) fn uncompress_unsorted(compressed_data: &[u8], output_arr: &mut [u32]) -> usize {
|
||||
let mut read_byte = 0;
|
||||
let mut num_read_bytes = 0;
|
||||
for output_mut in output_arr.iter_mut() {
|
||||
let mut result = 0u32;
|
||||
let mut shift = 0u32;
|
||||
loop {
|
||||
let cur_byte = compressed_data[read_byte];
|
||||
read_byte += 1;
|
||||
let cur_byte = compressed_data[num_read_bytes];
|
||||
num_read_bytes += 1;
|
||||
result += u32::from(cur_byte % 128u8) << shift;
|
||||
if cur_byte & 128u8 != 0u8 {
|
||||
break;
|
||||
@@ -78,5 +78,31 @@ pub(crate) fn uncompress_unsorted(compressed_data: &[u8], output_arr: &mut [u32]
|
||||
}
|
||||
*output_mut = result;
|
||||
}
|
||||
read_byte
|
||||
num_read_bytes
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub(crate) fn uncompress_unsorted_until_end(
|
||||
compressed_data: &[u8],
|
||||
output_arr: &mut [u32],
|
||||
) -> usize {
|
||||
let mut num_read_bytes = 0;
|
||||
for (num_ints_written, output_mut) in output_arr.iter_mut().enumerate() {
|
||||
if compressed_data.len() == num_read_bytes {
|
||||
return num_ints_written;
|
||||
}
|
||||
let mut result = 0u32;
|
||||
let mut shift = 0u32;
|
||||
loop {
|
||||
let cur_byte = compressed_data[num_read_bytes];
|
||||
num_read_bytes += 1;
|
||||
result += u32::from(cur_byte % 128u8) << shift;
|
||||
if cur_byte & 128u8 != 0u8 {
|
||||
break;
|
||||
}
|
||||
shift += 7;
|
||||
}
|
||||
*output_mut = result;
|
||||
}
|
||||
output_arr.len()
|
||||
}
|
||||
|
||||
@@ -68,13 +68,13 @@ pub mod tests {
|
||||
field_serializer.new_term("abc".as_bytes(), 12u32)?;
|
||||
for doc_id in 0u32..120u32 {
|
||||
let delta_positions = vec![1, 2, 3, 2];
|
||||
field_serializer.write_doc(doc_id, 4, &delta_positions)?;
|
||||
field_serializer.write_doc(doc_id, 4, &delta_positions);
|
||||
}
|
||||
field_serializer.close_term()?;
|
||||
mem::drop(field_serializer);
|
||||
posting_serializer.close()?;
|
||||
let read = segment.open_read(SegmentComponent::POSITIONS)?;
|
||||
assert!(read.len() <= 140);
|
||||
let read = segment.open_read(SegmentComponent::Positions)?;
|
||||
assert_eq!(read.len(), 207);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
|
||||
@@ -1,8 +1,7 @@
|
||||
use super::stacker::{Addr, MemoryArena, TermHashMap};
|
||||
|
||||
use crate::fieldnorm::FieldNormReaders;
|
||||
use crate::postings::recorder::{
|
||||
BufferLender, NothingRecorder, Recorder, TFAndPositionRecorder, TermFrequencyRecorder,
|
||||
BufferLender, NothingRecorder, Recorder, TermFrequencyRecorder, TfAndPositionRecorder,
|
||||
};
|
||||
use crate::postings::UnorderedTermId;
|
||||
use crate::postings::{FieldSerializer, InvertedIndexSerializer};
|
||||
@@ -12,6 +11,7 @@ use crate::termdict::TermOrdinal;
|
||||
use crate::tokenizer::TokenStream;
|
||||
use crate::tokenizer::{Token, MAX_TOKEN_LEN};
|
||||
use crate::DocId;
|
||||
use crate::{fieldnorm::FieldNormReaders, indexer::doc_id_mapping::DocIdMapping};
|
||||
use fnv::FnvHashMap;
|
||||
use std::collections::HashMap;
|
||||
use std::io;
|
||||
@@ -30,7 +30,7 @@ fn posting_from_field_entry(field_entry: &FieldEntry) -> Box<dyn PostingsWriter>
|
||||
SpecializedPostingsWriter::<TermFrequencyRecorder>::new_boxed()
|
||||
}
|
||||
IndexRecordOption::WithFreqsAndPositions => {
|
||||
SpecializedPostingsWriter::<TFAndPositionRecorder>::new_boxed()
|
||||
SpecializedPostingsWriter::<TfAndPositionRecorder>::new_boxed()
|
||||
}
|
||||
})
|
||||
.unwrap_or_else(|| SpecializedPostingsWriter::<NothingRecorder>::new_boxed()),
|
||||
@@ -130,6 +130,7 @@ impl MultiFieldPostingsWriter {
|
||||
&self,
|
||||
serializer: &mut InvertedIndexSerializer,
|
||||
fieldnorm_readers: FieldNormReaders,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> crate::Result<HashMap<Field, FnvHashMap<UnorderedTermId, TermOrdinal>>> {
|
||||
let mut term_offsets: Vec<(&[u8], Addr, UnorderedTermId)> =
|
||||
self.term_index.iter().collect();
|
||||
@@ -175,6 +176,7 @@ impl MultiFieldPostingsWriter {
|
||||
&mut field_serializer,
|
||||
&self.term_index.heap,
|
||||
&self.heap,
|
||||
doc_id_map,
|
||||
)?;
|
||||
field_serializer.close()?;
|
||||
}
|
||||
@@ -211,6 +213,7 @@ pub trait PostingsWriter {
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
term_heap: &MemoryArena,
|
||||
heap: &MemoryArena,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> io::Result<()>;
|
||||
|
||||
/// Tokenize a text and subscribe all of its token.
|
||||
@@ -236,7 +239,7 @@ pub trait PostingsWriter {
|
||||
heap,
|
||||
);
|
||||
} else {
|
||||
info!(
|
||||
warn!(
|
||||
"A token exceeding MAX_TOKEN_LEN ({}>{}) was dropped. Search for \
|
||||
MAX_TOKEN_LEN in the documentation for more information.",
|
||||
token.text.len(),
|
||||
@@ -307,13 +310,14 @@ impl<Rec: Recorder + 'static> PostingsWriter for SpecializedPostingsWriter<Rec>
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
termdict_heap: &MemoryArena,
|
||||
heap: &MemoryArena,
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) -> io::Result<()> {
|
||||
let mut buffer_lender = BufferLender::default();
|
||||
for &(term_bytes, addr, _) in term_addrs {
|
||||
let recorder: Rec = termdict_heap.read(addr);
|
||||
let term_doc_freq = recorder.term_doc_freq().unwrap_or(0u32);
|
||||
serializer.new_term(&term_bytes[4..], term_doc_freq)?;
|
||||
recorder.serialize(&mut buffer_lender, serializer, heap)?;
|
||||
recorder.serialize(&mut buffer_lender, serializer, heap, doc_id_map);
|
||||
serializer.close_term()?;
|
||||
}
|
||||
Ok(())
|
||||
|
||||
@@ -1,8 +1,10 @@
|
||||
use super::stacker::{ExpUnrolledLinkedList, MemoryArena};
|
||||
use crate::common::{read_u32_vint, write_u32_vint};
|
||||
use crate::postings::FieldSerializer;
|
||||
use crate::DocId;
|
||||
use std::io;
|
||||
use crate::{
|
||||
common::{read_u32_vint, write_u32_vint},
|
||||
indexer::doc_id_mapping::DocIdMapping,
|
||||
};
|
||||
|
||||
const POSITION_END: u32 = 0;
|
||||
|
||||
@@ -74,7 +76,8 @@ pub(crate) trait Recorder: Copy + 'static {
|
||||
buffer_lender: &mut BufferLender,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
heap: &MemoryArena,
|
||||
) -> io::Result<()>;
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
);
|
||||
/// Returns the number of document containing this term.
|
||||
///
|
||||
/// Returns `None` if not available.
|
||||
@@ -114,14 +117,26 @@ impl Recorder for NothingRecorder {
|
||||
buffer_lender: &mut BufferLender,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
heap: &MemoryArena,
|
||||
) -> io::Result<()> {
|
||||
let buffer = buffer_lender.lend_u8();
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) {
|
||||
let (buffer, doc_ids) = buffer_lender.lend_all();
|
||||
self.stack.read_to_end(heap, buffer);
|
||||
// TODO avoid reading twice.
|
||||
for doc in VInt32Reader::new(&buffer[..]) {
|
||||
serializer.write_doc(doc as u32, 0u32, &[][..])?;
|
||||
//TODO avoid reading twice.
|
||||
if let Some(doc_id_map) = doc_id_map {
|
||||
doc_ids.extend(
|
||||
VInt32Reader::new(&buffer[..])
|
||||
.map(|old_doc_id| doc_id_map.get_new_doc_id(old_doc_id)),
|
||||
);
|
||||
doc_ids.sort_unstable();
|
||||
|
||||
for doc in doc_ids {
|
||||
serializer.write_doc(*doc, 0u32, &[][..]);
|
||||
}
|
||||
} else {
|
||||
for doc in VInt32Reader::new(&buffer[..]) {
|
||||
serializer.write_doc(doc, 0u32, &[][..]);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn term_doc_freq(&self) -> Option<u32> {
|
||||
@@ -142,7 +157,7 @@ impl Recorder for TermFrequencyRecorder {
|
||||
fn new() -> Self {
|
||||
TermFrequencyRecorder {
|
||||
stack: ExpUnrolledLinkedList::new(),
|
||||
current_doc: u32::max_value(),
|
||||
current_doc: 0,
|
||||
current_tf: 0u32,
|
||||
term_doc_freq: 0u32,
|
||||
}
|
||||
@@ -173,16 +188,28 @@ impl Recorder for TermFrequencyRecorder {
|
||||
buffer_lender: &mut BufferLender,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
heap: &MemoryArena,
|
||||
) -> io::Result<()> {
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) {
|
||||
let buffer = buffer_lender.lend_u8();
|
||||
self.stack.read_to_end(heap, buffer);
|
||||
let mut u32_it = VInt32Reader::new(&buffer[..]);
|
||||
while let Some(doc) = u32_it.next() {
|
||||
let term_freq = u32_it.next().unwrap_or(self.current_tf);
|
||||
serializer.write_doc(doc as u32, term_freq, &[][..])?;
|
||||
}
|
||||
if let Some(doc_id_map) = doc_id_map {
|
||||
let mut doc_id_and_tf = vec![];
|
||||
while let Some(old_doc_id) = u32_it.next() {
|
||||
let term_freq = u32_it.next().unwrap_or(self.current_tf);
|
||||
doc_id_and_tf.push((doc_id_map.get_new_doc_id(old_doc_id), term_freq));
|
||||
}
|
||||
doc_id_and_tf.sort_unstable_by_key(|&(doc_id, _)| doc_id);
|
||||
|
||||
Ok(())
|
||||
for (doc_id, tf) in doc_id_and_tf {
|
||||
serializer.write_doc(doc_id, tf, &[][..]);
|
||||
}
|
||||
} else {
|
||||
while let Some(doc) = u32_it.next() {
|
||||
let term_freq = u32_it.next().unwrap_or(self.current_tf);
|
||||
serializer.write_doc(doc, term_freq, &[][..]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn term_doc_freq(&self) -> Option<u32> {
|
||||
@@ -192,14 +219,14 @@ impl Recorder for TermFrequencyRecorder {
|
||||
|
||||
/// Recorder encoding term frequencies as well as positions.
|
||||
#[derive(Clone, Copy)]
|
||||
pub struct TFAndPositionRecorder {
|
||||
pub struct TfAndPositionRecorder {
|
||||
stack: ExpUnrolledLinkedList,
|
||||
current_doc: DocId,
|
||||
term_doc_freq: u32,
|
||||
}
|
||||
impl Recorder for TFAndPositionRecorder {
|
||||
impl Recorder for TfAndPositionRecorder {
|
||||
fn new() -> Self {
|
||||
TFAndPositionRecorder {
|
||||
TfAndPositionRecorder {
|
||||
stack: ExpUnrolledLinkedList::new(),
|
||||
current_doc: u32::max_value(),
|
||||
term_doc_freq: 0u32,
|
||||
@@ -229,10 +256,12 @@ impl Recorder for TFAndPositionRecorder {
|
||||
buffer_lender: &mut BufferLender,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
heap: &MemoryArena,
|
||||
) -> io::Result<()> {
|
||||
doc_id_map: Option<&DocIdMapping>,
|
||||
) {
|
||||
let (buffer_u8, buffer_positions) = buffer_lender.lend_all();
|
||||
self.stack.read_to_end(heap, buffer_u8);
|
||||
let mut u32_it = VInt32Reader::new(&buffer_u8[..]);
|
||||
let mut doc_id_and_positions = vec![];
|
||||
while let Some(doc) = u32_it.next() {
|
||||
let mut prev_position_plus_one = 1u32;
|
||||
buffer_positions.clear();
|
||||
@@ -248,9 +277,20 @@ impl Recorder for TFAndPositionRecorder {
|
||||
}
|
||||
}
|
||||
}
|
||||
serializer.write_doc(doc, buffer_positions.len() as u32, &buffer_positions)?;
|
||||
if let Some(doc_id_map) = doc_id_map {
|
||||
// this simple variant to remap may consume to much memory
|
||||
doc_id_and_positions
|
||||
.push((doc_id_map.get_new_doc_id(doc), buffer_positions.to_vec()));
|
||||
} else {
|
||||
serializer.write_doc(doc, buffer_positions.len() as u32, &buffer_positions);
|
||||
}
|
||||
}
|
||||
if doc_id_map.is_some() {
|
||||
doc_id_and_positions.sort_unstable_by_key(|&(doc_id, _)| doc_id);
|
||||
for (doc_id, positions) in doc_id_and_positions {
|
||||
serializer.write_doc(doc_id, positions.len() as u32, &positions);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn term_doc_freq(&self) -> Option<u32> {
|
||||
|
||||
@@ -204,7 +204,7 @@ impl DocSet for SegmentPostings {
|
||||
}
|
||||
|
||||
/// Return the current document's `DocId`.
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn doc(&self) -> DocId {
|
||||
self.block_cursor.doc(self.cur)
|
||||
}
|
||||
|
||||
@@ -7,7 +7,7 @@ use crate::fieldnorm::FieldNormReader;
|
||||
use crate::positions::PositionSerializer;
|
||||
use crate::postings::compression::{BlockEncoder, VIntEncoder, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::postings::skip::SkipSerializer;
|
||||
use crate::query::BM25Weight;
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::{Field, FieldEntry, FieldType};
|
||||
use crate::schema::{IndexRecordOption, Schema};
|
||||
use crate::termdict::{TermDictionaryBuilder, TermOrdinal};
|
||||
@@ -50,19 +50,17 @@ pub struct InvertedIndexSerializer {
|
||||
terms_write: CompositeWrite<WritePtr>,
|
||||
postings_write: CompositeWrite<WritePtr>,
|
||||
positions_write: CompositeWrite<WritePtr>,
|
||||
positionsidx_write: CompositeWrite<WritePtr>,
|
||||
schema: Schema,
|
||||
}
|
||||
|
||||
impl InvertedIndexSerializer {
|
||||
/// Open a new `PostingsSerializer` for the given segment
|
||||
pub fn open(segment: &mut Segment) -> crate::Result<InvertedIndexSerializer> {
|
||||
use crate::SegmentComponent::{POSITIONS, POSITIONSSKIP, POSTINGS, TERMS};
|
||||
use crate::SegmentComponent::{Positions, Postings, Terms};
|
||||
let inv_index_serializer = InvertedIndexSerializer {
|
||||
terms_write: CompositeWrite::wrap(segment.open_write(TERMS)?),
|
||||
postings_write: CompositeWrite::wrap(segment.open_write(POSTINGS)?),
|
||||
positions_write: CompositeWrite::wrap(segment.open_write(POSITIONS)?),
|
||||
positionsidx_write: CompositeWrite::wrap(segment.open_write(POSITIONSSKIP)?),
|
||||
terms_write: CompositeWrite::wrap(segment.open_write(Terms)?),
|
||||
postings_write: CompositeWrite::wrap(segment.open_write(Postings)?),
|
||||
positions_write: CompositeWrite::wrap(segment.open_write(Positions)?),
|
||||
schema: segment.schema(),
|
||||
};
|
||||
Ok(inv_index_serializer)
|
||||
@@ -82,7 +80,6 @@ impl InvertedIndexSerializer {
|
||||
let term_dictionary_write = self.terms_write.for_field(field);
|
||||
let postings_write = self.postings_write.for_field(field);
|
||||
let positions_write = self.positions_write.for_field(field);
|
||||
let positionsidx_write = self.positionsidx_write.for_field(field);
|
||||
let field_type: FieldType = (*field_entry.field_type()).clone();
|
||||
FieldSerializer::create(
|
||||
&field_type,
|
||||
@@ -90,7 +87,6 @@ impl InvertedIndexSerializer {
|
||||
term_dictionary_write,
|
||||
postings_write,
|
||||
positions_write,
|
||||
positionsidx_write,
|
||||
fieldnorm_reader,
|
||||
)
|
||||
}
|
||||
@@ -100,7 +96,6 @@ impl InvertedIndexSerializer {
|
||||
self.terms_write.close()?;
|
||||
self.postings_write.close()?;
|
||||
self.positions_write.close()?;
|
||||
self.positionsidx_write.close()?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -123,7 +118,6 @@ impl<'a> FieldSerializer<'a> {
|
||||
term_dictionary_write: &'a mut CountingWriter<WritePtr>,
|
||||
postings_write: &'a mut CountingWriter<WritePtr>,
|
||||
positions_write: &'a mut CountingWriter<WritePtr>,
|
||||
positionsidx_write: &'a mut CountingWriter<WritePtr>,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
) -> io::Result<FieldSerializer<'a>> {
|
||||
total_num_tokens.serialize(postings_write)?;
|
||||
@@ -145,7 +139,7 @@ impl<'a> FieldSerializer<'a> {
|
||||
let postings_serializer =
|
||||
PostingsSerializer::new(postings_write, average_fieldnorm, mode, fieldnorm_reader);
|
||||
let positions_serializer_opt = if mode.has_positions() {
|
||||
Some(PositionSerializer::new(positions_write, positionsidx_write))
|
||||
Some(PositionSerializer::new(positions_write))
|
||||
} else {
|
||||
None
|
||||
};
|
||||
@@ -161,17 +155,17 @@ impl<'a> FieldSerializer<'a> {
|
||||
}
|
||||
|
||||
fn current_term_info(&self) -> TermInfo {
|
||||
let positions_idx =
|
||||
let positions_start =
|
||||
if let Some(positions_serializer) = self.positions_serializer_opt.as_ref() {
|
||||
positions_serializer.positions_idx()
|
||||
positions_serializer.written_bytes()
|
||||
} else {
|
||||
0u64
|
||||
};
|
||||
let addr = self.postings_serializer.addr() as usize;
|
||||
} as usize;
|
||||
let addr = self.postings_serializer.written_bytes() as usize;
|
||||
TermInfo {
|
||||
doc_freq: 0,
|
||||
postings_range: addr..addr,
|
||||
positions_idx,
|
||||
positions_range: positions_start..positions_start,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -204,18 +198,12 @@ impl<'a> FieldSerializer<'a> {
|
||||
///
|
||||
/// Term frequencies and positions may be ignored by the serializer depending
|
||||
/// on the configuration of the field in the `Schema`.
|
||||
pub fn write_doc(
|
||||
&mut self,
|
||||
doc_id: DocId,
|
||||
term_freq: u32,
|
||||
position_deltas: &[u32],
|
||||
) -> io::Result<()> {
|
||||
pub fn write_doc(&mut self, doc_id: DocId, term_freq: u32, position_deltas: &[u32]) {
|
||||
self.current_term_info.doc_freq += 1;
|
||||
self.postings_serializer.write_doc(doc_id, term_freq);
|
||||
if let Some(ref mut positions_serializer) = self.positions_serializer_opt.as_mut() {
|
||||
positions_serializer.write_all(position_deltas)?;
|
||||
positions_serializer.write_positions_delta(position_deltas);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Finish the serialization for this term postings.
|
||||
@@ -226,7 +214,14 @@ impl<'a> FieldSerializer<'a> {
|
||||
if self.term_open {
|
||||
self.postings_serializer
|
||||
.close_term(self.current_term_info.doc_freq)?;
|
||||
self.current_term_info.postings_range.end = self.postings_serializer.addr() as usize;
|
||||
self.current_term_info.postings_range.end =
|
||||
self.postings_serializer.written_bytes() as usize;
|
||||
|
||||
if let Some(positions_serializer) = self.positions_serializer_opt.as_mut() {
|
||||
positions_serializer.close_term()?;
|
||||
self.current_term_info.positions_range.end =
|
||||
positions_serializer.written_bytes() as usize;
|
||||
}
|
||||
self.term_dictionary_builder
|
||||
.insert_value(&self.current_term_info)?;
|
||||
self.term_open = false;
|
||||
@@ -307,7 +302,7 @@ pub struct PostingsSerializer<W: Write> {
|
||||
mode: IndexRecordOption,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
|
||||
bm25_weight: Option<BM25Weight>,
|
||||
bm25_weight: Option<Bm25Weight>,
|
||||
|
||||
num_docs: u32, // Number of docs in the segment
|
||||
avg_fieldnorm: Score, // Average number of term in the field for that segment.
|
||||
@@ -347,7 +342,7 @@ impl<W: Write> PostingsSerializer<W> {
|
||||
|
||||
pub fn new_term(&mut self, term_doc_freq: u32) {
|
||||
if self.mode.has_freq() && self.num_docs > 0 {
|
||||
let bm25_weight = BM25Weight::for_one_term(
|
||||
let bm25_weight = Bm25Weight::for_one_term(
|
||||
term_doc_freq as u64,
|
||||
self.num_docs as u64,
|
||||
self.avg_fieldnorm,
|
||||
@@ -457,7 +452,13 @@ impl<W: Write> PostingsSerializer<W> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn addr(&self) -> u64 {
|
||||
/// Returns the number of bytes written in the postings write object
|
||||
/// at this point.
|
||||
/// When called before writing the postings of a term, this value is used as
|
||||
/// start offset.
|
||||
/// When called after writing the postings of a term, this value is used as a
|
||||
/// end offset.
|
||||
fn written_bytes(&self) -> u64 {
|
||||
self.output_write.written_bytes() as u64
|
||||
}
|
||||
|
||||
|
||||
@@ -2,16 +2,16 @@ use std::convert::TryInto;
|
||||
|
||||
use crate::directory::OwnedBytes;
|
||||
use crate::postings::compression::{compressed_block_size, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::query::BM25Weight;
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, Score, TERMINATED};
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn encode_block_wand_max_tf(max_tf: u32) -> u8 {
|
||||
max_tf.min(u8::MAX as u32) as u8
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn decode_block_wand_max_tf(max_tf_code: u8) -> u32 {
|
||||
if max_tf_code == u8::MAX {
|
||||
u32::MAX
|
||||
@@ -20,12 +20,12 @@ fn decode_block_wand_max_tf(max_tf_code: u8) -> u32 {
|
||||
}
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn read_u32(data: &[u8]) -> u32 {
|
||||
u32::from_le_bytes(data[..4].try_into().unwrap())
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn write_u32(val: u32, buf: &mut Vec<u8>) {
|
||||
buf.extend_from_slice(&val.to_le_bytes());
|
||||
}
|
||||
@@ -144,7 +144,7 @@ impl SkipReader {
|
||||
//
|
||||
// The block max score is available for all full bitpacked block,
|
||||
// but no available for the last VInt encoded incomplete block.
|
||||
pub fn block_max_score(&self, bm25_weight: &BM25Weight) -> Option<Score> {
|
||||
pub fn block_max_score(&self, bm25_weight: &Bm25Weight) -> Option<Score> {
|
||||
match self.block_info {
|
||||
BlockInfo::BitPacked {
|
||||
block_wand_fieldnorm_id,
|
||||
@@ -163,7 +163,7 @@ impl SkipReader {
|
||||
self.position_offset
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn byte_offset(&self) -> usize {
|
||||
self.byte_offset
|
||||
}
|
||||
|
||||
@@ -147,7 +147,7 @@ impl ExpUnrolledLinkedList {
|
||||
}
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn writer<'a>(&'a mut self, heap: &'a mut MemoryArena) -> ExpUnrolledLinkedListWriter<'a> {
|
||||
ExpUnrolledLinkedListWriter { eull: self, heap }
|
||||
}
|
||||
|
||||
@@ -132,7 +132,7 @@ impl MemoryArena {
|
||||
self.pages[addr.page_id()].slice_from(addr.page_local_addr())
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn slice_mut(&mut self, addr: Addr, len: usize) -> &mut [u8] {
|
||||
self.pages[addr.page_id()].slice_mut(addr.page_local_addr(), len)
|
||||
}
|
||||
@@ -162,7 +162,7 @@ impl Page {
|
||||
}
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn is_available(&self, len: usize) -> bool {
|
||||
len + self.len <= PAGE_SIZE
|
||||
}
|
||||
|
||||
@@ -118,7 +118,7 @@ impl TermHashMap {
|
||||
self.table.len() < self.occupied.len() * 3
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn get_key_value(&self, addr: Addr) -> (&[u8], Addr) {
|
||||
let data = self.heap.slice_from(addr);
|
||||
let key_bytes_len = NativeEndian::read_u16(data) as usize;
|
||||
@@ -126,7 +126,7 @@ impl TermHashMap {
|
||||
(key_bytes, addr.offset(2u32 + key_bytes_len as u32))
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn get_value_addr_if_key_match(&self, target_key: &[u8], addr: Addr) -> Option<Addr> {
|
||||
let (stored_key, value_addr) = self.get_key_value(addr);
|
||||
if stored_key == target_key {
|
||||
|
||||
@@ -11,8 +11,8 @@ pub struct TermInfo {
|
||||
pub doc_freq: u32,
|
||||
/// Byte range of the posting list within the postings (`.idx`) file.
|
||||
pub postings_range: Range<usize>,
|
||||
/// Start offset of the first block within the position (`.pos`) file.
|
||||
pub positions_idx: u64,
|
||||
/// Byte range of the positions of this terms in the positions (`.pos`) file.
|
||||
pub positions_range: Range<usize>,
|
||||
}
|
||||
|
||||
impl TermInfo {
|
||||
@@ -21,6 +21,12 @@ impl TermInfo {
|
||||
assert!(num_bytes <= std::u32::MAX as usize);
|
||||
num_bytes as u32
|
||||
}
|
||||
|
||||
pub(crate) fn positions_num_bytes(&self) -> u32 {
|
||||
let num_bytes = self.positions_range.len();
|
||||
assert!(num_bytes <= std::u32::MAX as usize);
|
||||
num_bytes as u32
|
||||
}
|
||||
}
|
||||
|
||||
impl FixedSize for TermInfo {
|
||||
@@ -28,7 +34,7 @@ impl FixedSize for TermInfo {
|
||||
/// This is large, but in practise, `TermInfo` are encoded in blocks and
|
||||
/// only the first `TermInfo` of a block is serialized uncompressed.
|
||||
/// The subsequent `TermInfo` are delta encoded and bitpacked.
|
||||
const SIZE_IN_BYTES: usize = 2 * u32::SIZE_IN_BYTES + 2 * u64::SIZE_IN_BYTES;
|
||||
const SIZE_IN_BYTES: usize = 3 * u32::SIZE_IN_BYTES + 2 * u64::SIZE_IN_BYTES;
|
||||
}
|
||||
|
||||
impl BinarySerializable for TermInfo {
|
||||
@@ -36,20 +42,23 @@ impl BinarySerializable for TermInfo {
|
||||
self.doc_freq.serialize(writer)?;
|
||||
(self.postings_range.start as u64).serialize(writer)?;
|
||||
self.posting_num_bytes().serialize(writer)?;
|
||||
self.positions_idx.serialize(writer)?;
|
||||
(self.positions_range.start as u64).serialize(writer)?;
|
||||
self.positions_num_bytes().serialize(writer)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
|
||||
let doc_freq = u32::deserialize(reader)?;
|
||||
let postings_start_offset = u64::deserialize(reader)? as usize;
|
||||
let postings_num_bytes = u32::deserialize(reader)?;
|
||||
let postings_end_offset = postings_start_offset + u64::from(postings_num_bytes) as usize;
|
||||
let positions_idx = u64::deserialize(reader)?;
|
||||
let postings_num_bytes = u32::deserialize(reader)? as usize;
|
||||
let postings_end_offset = postings_start_offset + postings_num_bytes;
|
||||
let positions_start_offset = u64::deserialize(reader)? as usize;
|
||||
let positions_num_bytes = u32::deserialize(reader)? as usize;
|
||||
let positions_end_offset = positions_start_offset + positions_num_bytes;
|
||||
Ok(TermInfo {
|
||||
doc_freq,
|
||||
postings_range: postings_start_offset..postings_end_offset,
|
||||
positions_idx,
|
||||
positions_range: positions_start_offset..positions_end_offset,
|
||||
})
|
||||
}
|
||||
}
|
||||
@@ -60,6 +69,8 @@ mod tests {
|
||||
use super::TermInfo;
|
||||
use crate::common::test::fixed_size_test;
|
||||
|
||||
// TODO add serialize/deserialize test for terminfo
|
||||
|
||||
#[test]
|
||||
fn test_fixed_size() {
|
||||
fixed_size_test::<TermInfo>();
|
||||
|
||||
@@ -9,7 +9,7 @@ use serde::Serialize;
|
||||
const K1: Score = 1.2;
|
||||
const B: Score = 0.75;
|
||||
|
||||
fn idf(doc_freq: u64, doc_count: u64) -> Score {
|
||||
pub(crate) fn idf(doc_freq: u64, doc_count: u64) -> Score {
|
||||
assert!(doc_count >= doc_freq, "{} >= {}", doc_count, doc_freq);
|
||||
let x = ((doc_count - doc_freq) as Score + 0.5) / (doc_freq as Score + 0.5);
|
||||
(1.0 + x).ln()
|
||||
@@ -29,22 +29,22 @@ fn compute_tf_cache(average_fieldnorm: Score) -> [Score; 256] {
|
||||
}
|
||||
|
||||
#[derive(Clone, PartialEq, Debug, Serialize, Deserialize)]
|
||||
pub struct BM25Params {
|
||||
pub struct Bm25Params {
|
||||
pub idf: Score,
|
||||
pub avg_fieldnorm: Score,
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct BM25Weight {
|
||||
pub struct Bm25Weight {
|
||||
idf_explain: Explanation,
|
||||
weight: Score,
|
||||
cache: [Score; 256],
|
||||
average_fieldnorm: Score,
|
||||
}
|
||||
|
||||
impl BM25Weight {
|
||||
pub fn boost_by(&self, boost: Score) -> BM25Weight {
|
||||
BM25Weight {
|
||||
impl Bm25Weight {
|
||||
pub fn boost_by(&self, boost: Score) -> Bm25Weight {
|
||||
Bm25Weight {
|
||||
idf_explain: self.idf_explain.clone(),
|
||||
weight: self.weight * boost,
|
||||
cache: self.cache,
|
||||
@@ -52,8 +52,8 @@ impl BM25Weight {
|
||||
}
|
||||
}
|
||||
|
||||
pub fn for_terms(searcher: &Searcher, terms: &[Term]) -> crate::Result<BM25Weight> {
|
||||
assert!(!terms.is_empty(), "BM25 requires at least one term");
|
||||
pub fn for_terms(searcher: &Searcher, terms: &[Term]) -> crate::Result<Bm25Weight> {
|
||||
assert!(!terms.is_empty(), "Bm25 requires at least one term");
|
||||
let field = terms[0].field();
|
||||
for term in &terms[1..] {
|
||||
assert_eq!(
|
||||
@@ -74,7 +74,7 @@ impl BM25Weight {
|
||||
|
||||
if terms.len() == 1 {
|
||||
let term_doc_freq = searcher.doc_freq(&terms[0])?;
|
||||
Ok(BM25Weight::for_one_term(
|
||||
Ok(Bm25Weight::for_one_term(
|
||||
term_doc_freq,
|
||||
total_num_docs,
|
||||
average_fieldnorm,
|
||||
@@ -86,7 +86,7 @@ impl BM25Weight {
|
||||
idf_sum += idf(term_doc_freq, total_num_docs);
|
||||
}
|
||||
let idf_explain = Explanation::new("idf", idf_sum);
|
||||
Ok(BM25Weight::new(idf_explain, average_fieldnorm))
|
||||
Ok(Bm25Weight::new(idf_explain, average_fieldnorm))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -94,7 +94,7 @@ impl BM25Weight {
|
||||
term_doc_freq: u64,
|
||||
total_num_docs: u64,
|
||||
avg_fieldnorm: Score,
|
||||
) -> BM25Weight {
|
||||
) -> Bm25Weight {
|
||||
let idf = idf(term_doc_freq, total_num_docs);
|
||||
let mut idf_explain =
|
||||
Explanation::new("idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))", idf);
|
||||
@@ -103,12 +103,12 @@ impl BM25Weight {
|
||||
term_doc_freq as Score,
|
||||
);
|
||||
idf_explain.add_const("N, total number of docs", total_num_docs as Score);
|
||||
BM25Weight::new(idf_explain, avg_fieldnorm)
|
||||
Bm25Weight::new(idf_explain, avg_fieldnorm)
|
||||
}
|
||||
|
||||
pub(crate) fn new(idf_explain: Explanation, average_fieldnorm: Score) -> BM25Weight {
|
||||
pub(crate) fn new(idf_explain: Explanation, average_fieldnorm: Score) -> Bm25Weight {
|
||||
let weight = idf_explain.value() * (1.0 + K1);
|
||||
BM25Weight {
|
||||
Bm25Weight {
|
||||
idf_explain,
|
||||
weight,
|
||||
cache: compute_tf_cache(average_fieldnorm),
|
||||
@@ -116,7 +116,7 @@ impl BM25Weight {
|
||||
}
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub fn score(&self, fieldnorm_id: u8, term_freq: u32) -> Score {
|
||||
self.weight * self.tf_factor(fieldnorm_id, term_freq)
|
||||
}
|
||||
@@ -125,7 +125,7 @@ impl BM25Weight {
|
||||
self.score(255u8, 2_013_265_944)
|
||||
}
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
pub(crate) fn tf_factor(&self, fieldnorm_id: u8, term_freq: u32) -> Score {
|
||||
let term_freq = term_freq as Score;
|
||||
let norm = self.cache[fieldnorm_id as usize];
|
||||
|
||||
@@ -238,7 +238,7 @@ mod tests {
|
||||
use crate::query::score_combiner::SumCombiner;
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::Union;
|
||||
use crate::query::{BM25Weight, Scorer};
|
||||
use crate::query::{Bm25Weight, Scorer};
|
||||
use crate::{DocId, DocSet, Score, TERMINATED};
|
||||
use proptest::prelude::*;
|
||||
use std::cmp::Ordering;
|
||||
@@ -393,7 +393,7 @@ mod tests {
|
||||
let term_scorers: Vec<TermScorer> = postings_lists_expanded
|
||||
.iter()
|
||||
.map(|postings| {
|
||||
let bm25_weight = BM25Weight::for_one_term(
|
||||
let bm25_weight = Bm25Weight::for_one_term(
|
||||
postings.len() as u64,
|
||||
max_doc as u64,
|
||||
average_fieldnorm,
|
||||
|
||||
@@ -3,7 +3,7 @@ use crate::query::Scorer;
|
||||
use crate::DocId;
|
||||
use crate::Score;
|
||||
|
||||
#[inline(always)]
|
||||
#[inline]
|
||||
fn is_within<TDocSetExclude: DocSet>(docset: &mut TDocSetExclude, doc: DocId) -> bool {
|
||||
docset.doc() <= doc && docset.seek(doc) == doc
|
||||
}
|
||||
|
||||
@@ -8,9 +8,9 @@ use std::collections::HashMap;
|
||||
use std::ops::Range;
|
||||
use tantivy_fst::Automaton;
|
||||
|
||||
pub(crate) struct DFAWrapper(pub DFA);
|
||||
pub(crate) struct DfaWrapper(pub DFA);
|
||||
|
||||
impl Automaton for DFAWrapper {
|
||||
impl Automaton for DfaWrapper {
|
||||
type State = u32;
|
||||
|
||||
fn start(&self) -> Self::State {
|
||||
@@ -127,7 +127,7 @@ impl FuzzyTermQuery {
|
||||
}
|
||||
}
|
||||
|
||||
fn specialized_weight(&self) -> crate::Result<AutomatonWeight<DFAWrapper>> {
|
||||
fn specialized_weight(&self) -> crate::Result<AutomatonWeight<DfaWrapper>> {
|
||||
// LEV_BUILDER is a HashMap, whose `get` method returns an Option
|
||||
match LEV_BUILDER.get(&(self.distance, false)) {
|
||||
// Unwrap the option and build the Ok(AutomatonWeight)
|
||||
@@ -139,7 +139,7 @@ impl FuzzyTermQuery {
|
||||
};
|
||||
Ok(AutomatonWeight::new(
|
||||
self.term.field(),
|
||||
DFAWrapper(automaton),
|
||||
DfaWrapper(automaton),
|
||||
))
|
||||
}
|
||||
None => Err(InvalidArgument(format!(
|
||||
|
||||
@@ -11,6 +11,7 @@ mod exclude;
|
||||
mod explanation;
|
||||
mod fuzzy_query;
|
||||
mod intersection;
|
||||
mod more_like_this;
|
||||
mod phrase_query;
|
||||
mod query;
|
||||
mod query_parser;
|
||||
@@ -26,7 +27,7 @@ mod weight;
|
||||
mod vec_docset;
|
||||
|
||||
pub(crate) mod score_combiner;
|
||||
pub(crate) use self::bm25::BM25Weight;
|
||||
pub(crate) use self::bm25::Bm25Weight;
|
||||
pub use self::intersection::Intersection;
|
||||
pub use self::union::Union;
|
||||
|
||||
@@ -42,9 +43,10 @@ pub use self::empty_query::{EmptyQuery, EmptyScorer, EmptyWeight};
|
||||
pub use self::exclude::Exclude;
|
||||
pub use self::explanation::Explanation;
|
||||
#[cfg(test)]
|
||||
pub(crate) use self::fuzzy_query::DFAWrapper;
|
||||
pub(crate) use self::fuzzy_query::DfaWrapper;
|
||||
pub use self::fuzzy_query::FuzzyTermQuery;
|
||||
pub use self::intersection::intersect_scorers;
|
||||
pub use self::more_like_this::{MoreLikeThisQuery, MoreLikeThisQueryBuilder};
|
||||
pub use self::phrase_query::PhraseQuery;
|
||||
pub use self::query::{Query, QueryClone};
|
||||
pub use self::query_parser::QueryParser;
|
||||
|
||||
5
src/query/more_like_this/mod.rs
Normal file
5
src/query/more_like_this/mod.rs
Normal file
@@ -0,0 +1,5 @@
|
||||
mod more_like_this;
|
||||
mod query;
|
||||
|
||||
pub use self::more_like_this::MoreLikeThis;
|
||||
pub use self::query::{MoreLikeThisQuery, MoreLikeThisQueryBuilder};
|
||||
385
src/query/more_like_this/more_like_this.rs
Normal file
385
src/query/more_like_this/more_like_this.rs
Normal file
@@ -0,0 +1,385 @@
|
||||
use std::cmp::Reverse;
|
||||
use std::collections::{BinaryHeap, HashMap};
|
||||
|
||||
use crate::{
|
||||
query::{bm25::idf, BooleanQuery, BoostQuery, Occur, Query, TermQuery},
|
||||
schema::{Field, FieldType, FieldValue, IndexRecordOption, Term, Value},
|
||||
tokenizer::{BoxTokenStream, FacetTokenizer, PreTokenizedStream, Tokenizer},
|
||||
DocAddress, Result, Searcher, TantivyError,
|
||||
};
|
||||
|
||||
#[derive(Debug, PartialEq)]
|
||||
struct ScoreTerm {
|
||||
pub term: Term,
|
||||
pub score: f32,
|
||||
}
|
||||
|
||||
impl ScoreTerm {
|
||||
fn new(term: Term, score: f32) -> Self {
|
||||
Self { term, score }
|
||||
}
|
||||
}
|
||||
|
||||
impl Eq for ScoreTerm {}
|
||||
|
||||
impl PartialOrd for ScoreTerm {
|
||||
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
|
||||
self.score.partial_cmp(&other.score)
|
||||
}
|
||||
}
|
||||
|
||||
impl Ord for ScoreTerm {
|
||||
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
|
||||
self.partial_cmp(other).unwrap_or(std::cmp::Ordering::Equal)
|
||||
}
|
||||
}
|
||||
|
||||
/// A struct used as helper to build [`MoreLikeThisQuery`]
|
||||
/// This more-like-this implementation is inspired by the Appache Lucene
|
||||
/// amd closely follows the same implementation with adaptabtion to Tantivy vocabulary and API.
|
||||
///
|
||||
/// [MoreLikeThis](https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L147)
|
||||
/// [MoreLikeThisQuery](https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThisQuery.java#L36)
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MoreLikeThis {
|
||||
/// Ignore words which do not occur in at least this many docs.
|
||||
pub min_doc_frequency: Option<u64>,
|
||||
/// Ignore words which occur in more than this many docs.
|
||||
pub max_doc_frequency: Option<u64>,
|
||||
/// Ignore words less frequent than this.
|
||||
pub min_term_frequency: Option<usize>,
|
||||
/// Don't return a query longer than this.
|
||||
pub max_query_terms: Option<usize>,
|
||||
/// Ignore words if less than this length.
|
||||
pub min_word_length: Option<usize>,
|
||||
/// Ignore words if greater than this length.
|
||||
pub max_word_length: Option<usize>,
|
||||
/// Boost factor to use when boosting the terms
|
||||
pub boost_factor: Option<f32>,
|
||||
/// Current set of stop words.
|
||||
pub stop_words: Vec<String>,
|
||||
}
|
||||
|
||||
impl Default for MoreLikeThis {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
min_doc_frequency: Some(5),
|
||||
max_doc_frequency: None,
|
||||
min_term_frequency: Some(2),
|
||||
max_query_terms: Some(25),
|
||||
min_word_length: None,
|
||||
max_word_length: None,
|
||||
boost_factor: Some(1.0),
|
||||
stop_words: vec![],
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl MoreLikeThis {
|
||||
/// Creates a [`BooleanQuery`] using a document address to collect
|
||||
/// the top stored field values.
|
||||
pub fn query_with_document(
|
||||
&self,
|
||||
searcher: &Searcher,
|
||||
doc_address: DocAddress,
|
||||
) -> Result<BooleanQuery> {
|
||||
let score_terms = self.retrieve_terms_from_doc_address(searcher, doc_address)?;
|
||||
let query = self.create_query(score_terms);
|
||||
Ok(query)
|
||||
}
|
||||
|
||||
/// Creates a [`BooleanQuery`] using a set of field values.
|
||||
pub fn query_with_document_fields(
|
||||
&self,
|
||||
searcher: &Searcher,
|
||||
doc_fields: &[(Field, Vec<FieldValue>)],
|
||||
) -> Result<BooleanQuery> {
|
||||
let score_terms = self.retrieve_terms_from_doc_fields(searcher, doc_fields)?;
|
||||
let query = self.create_query(score_terms);
|
||||
Ok(query)
|
||||
}
|
||||
|
||||
/// Creates a [`BooleanQuery`] from an ascendingly sorted list of ScoreTerm
|
||||
/// This will map the list of ScoreTerm to a list of [`TermQuery`] and compose a
|
||||
/// BooleanQuery using that list as sub queries.
|
||||
fn create_query(&self, mut score_terms: Vec<ScoreTerm>) -> BooleanQuery {
|
||||
score_terms.sort_by(|left_ts, right_ts| right_ts.cmp(left_ts));
|
||||
let best_score = score_terms.first().map_or(1f32, |x| x.score);
|
||||
let mut queries = Vec::new();
|
||||
|
||||
for ScoreTerm { term, score } in score_terms {
|
||||
let mut query: Box<dyn Query> =
|
||||
Box::new(TermQuery::new(term, IndexRecordOption::Basic));
|
||||
if let Some(factor) = self.boost_factor {
|
||||
query = Box::new(BoostQuery::new(query, score * factor / best_score));
|
||||
}
|
||||
queries.push((Occur::Should, query));
|
||||
}
|
||||
BooleanQuery::from(queries)
|
||||
}
|
||||
|
||||
/// Finds terms for a more-like-this query.
|
||||
/// doc_address is the address of document from which to find terms.
|
||||
fn retrieve_terms_from_doc_address(
|
||||
&self,
|
||||
searcher: &Searcher,
|
||||
doc_address: DocAddress,
|
||||
) -> Result<Vec<ScoreTerm>> {
|
||||
let doc = searcher.doc(doc_address)?;
|
||||
let field_to_field_values = doc
|
||||
.get_sorted_field_values()
|
||||
.iter()
|
||||
.map(|(field, values)| {
|
||||
(
|
||||
*field,
|
||||
values
|
||||
.iter()
|
||||
.map(|v| (**v).clone())
|
||||
.collect::<Vec<FieldValue>>(),
|
||||
)
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
self.retrieve_terms_from_doc_fields(searcher, &field_to_field_values)
|
||||
}
|
||||
|
||||
/// Finds terms for a more-like-this query.
|
||||
/// field_to_field_values is a mapping from field to possible values of taht field.
|
||||
fn retrieve_terms_from_doc_fields(
|
||||
&self,
|
||||
searcher: &Searcher,
|
||||
field_to_field_values: &[(Field, Vec<FieldValue>)],
|
||||
) -> Result<Vec<ScoreTerm>> {
|
||||
if field_to_field_values.is_empty() {
|
||||
return Err(TantivyError::InvalidArgument("Cannot create more like this query on empty field values. The document may not have stored fields".to_string()));
|
||||
}
|
||||
|
||||
let mut field_to_term_freq_map = HashMap::new();
|
||||
for (field, field_values) in field_to_field_values {
|
||||
self.add_term_frequencies(searcher, *field, field_values, &mut field_to_term_freq_map)?;
|
||||
}
|
||||
self.create_score_term(searcher, field_to_term_freq_map)
|
||||
}
|
||||
|
||||
/// Computes the frequency of values for a field while updating the term frequencies
|
||||
/// Note: A FieldValue can be made up of multiple terms.
|
||||
/// We are interested in extracting terms within FieldValue
|
||||
fn add_term_frequencies(
|
||||
&self,
|
||||
searcher: &Searcher,
|
||||
field: Field,
|
||||
field_values: &[FieldValue],
|
||||
term_frequencies: &mut HashMap<Term, usize>,
|
||||
) -> Result<()> {
|
||||
let schema = searcher.schema();
|
||||
let tokenizer_manager = searcher.index().tokenizers();
|
||||
|
||||
let field_entry = schema.get_field_entry(field);
|
||||
if !field_entry.is_indexed() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// extract the raw value, possibly tokenizing & filtering to update the term frequency map
|
||||
match field_entry.field_type() {
|
||||
FieldType::HierarchicalFacet(_) => {
|
||||
let facets: Vec<&str> = field_values
|
||||
.iter()
|
||||
.map(|field_value| match *field_value.value() {
|
||||
Value::Facet(ref facet) => Ok(facet.encoded_str()),
|
||||
_ => Err(TantivyError::InvalidArgument(
|
||||
"invalid field value".to_string(),
|
||||
)),
|
||||
})
|
||||
.collect::<Result<Vec<_>>>()?;
|
||||
for fake_str in facets {
|
||||
FacetTokenizer.token_stream(fake_str).process(&mut |token| {
|
||||
if self.is_noise_word(token.text.clone()) {
|
||||
let term = Term::from_field_text(field, &token.text);
|
||||
*term_frequencies.entry(term).or_insert(0) += 1;
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
FieldType::Str(text_options) => {
|
||||
let mut token_streams: Vec<BoxTokenStream> = vec![];
|
||||
|
||||
for field_value in field_values {
|
||||
match field_value.value() {
|
||||
Value::PreTokStr(tok_str) => {
|
||||
token_streams.push(PreTokenizedStream::from(tok_str.clone()).into());
|
||||
}
|
||||
Value::Str(ref text) => {
|
||||
if let Some(tokenizer) = text_options
|
||||
.get_indexing_options()
|
||||
.map(|text_indexing_options| {
|
||||
text_indexing_options.tokenizer().to_string()
|
||||
})
|
||||
.and_then(|tokenizer_name| tokenizer_manager.get(&tokenizer_name))
|
||||
{
|
||||
token_streams.push(tokenizer.token_stream(text));
|
||||
}
|
||||
}
|
||||
_ => (),
|
||||
}
|
||||
}
|
||||
|
||||
for mut token_stream in token_streams {
|
||||
token_stream.process(&mut |token| {
|
||||
if !self.is_noise_word(token.text.clone()) {
|
||||
let term = Term::from_field_text(field, &token.text);
|
||||
*term_frequencies.entry(term).or_insert(0) += 1;
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
FieldType::U64(_) => {
|
||||
for field_value in field_values {
|
||||
let val = field_value
|
||||
.value()
|
||||
.u64_value()
|
||||
.ok_or(TantivyError::InvalidArgument("invalid value".to_string()))?;
|
||||
if !self.is_noise_word(val.to_string()) {
|
||||
let term = Term::from_field_u64(field, val);
|
||||
*term_frequencies.entry(term).or_insert(0) += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
FieldType::Date(_) => {
|
||||
for field_value in field_values {
|
||||
// TODO: Ask if this is the semantic (timestamp) we want
|
||||
let val = field_value
|
||||
.value()
|
||||
.date_value()
|
||||
.ok_or(TantivyError::InvalidArgument("invalid value".to_string()))?
|
||||
.timestamp();
|
||||
if !self.is_noise_word(val.to_string()) {
|
||||
let term = Term::from_field_i64(field, val);
|
||||
*term_frequencies.entry(term).or_insert(0) += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
FieldType::I64(_) => {
|
||||
for field_value in field_values {
|
||||
let val = field_value
|
||||
.value()
|
||||
.i64_value()
|
||||
.ok_or(TantivyError::InvalidArgument("invalid value".to_string()))?;
|
||||
if !self.is_noise_word(val.to_string()) {
|
||||
let term = Term::from_field_i64(field, val);
|
||||
*term_frequencies.entry(term).or_insert(0) += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
FieldType::F64(_) => {
|
||||
for field_value in field_values {
|
||||
let val = field_value
|
||||
.value()
|
||||
.f64_value()
|
||||
.ok_or(TantivyError::InvalidArgument("invalid value".to_string()))?;
|
||||
if !self.is_noise_word(val.to_string()) {
|
||||
let term = Term::from_field_f64(field, val);
|
||||
*term_frequencies.entry(term).or_insert(0) += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Determines if the term is likely to be of interest based on "more-like-this" settings
|
||||
fn is_noise_word(&self, word: String) -> bool {
|
||||
let word_length = word.len();
|
||||
if word_length == 0 {
|
||||
return true;
|
||||
}
|
||||
if self
|
||||
.min_word_length
|
||||
.map(|min| word_length < min)
|
||||
.unwrap_or(false)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
if self
|
||||
.max_word_length
|
||||
.map(|max| word_length > max)
|
||||
.unwrap_or(false)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
return self.stop_words.contains(&word);
|
||||
}
|
||||
|
||||
/// Couputes the score for each term while ignoring not useful terms
|
||||
fn create_score_term(
|
||||
&self,
|
||||
searcher: &Searcher,
|
||||
per_field_term_frequencies: HashMap<Term, usize>,
|
||||
) -> Result<Vec<ScoreTerm>> {
|
||||
let mut score_terms: BinaryHeap<Reverse<ScoreTerm>> = BinaryHeap::new();
|
||||
let num_docs = searcher
|
||||
.segment_readers()
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.num_docs() as u64)
|
||||
.sum::<u64>();
|
||||
|
||||
for (term, term_frequency) in per_field_term_frequencies.iter() {
|
||||
// ignore terms with less than min_term_frequency
|
||||
if self
|
||||
.min_term_frequency
|
||||
.map(|min_term_frequency| *term_frequency < min_term_frequency)
|
||||
.unwrap_or(false)
|
||||
{
|
||||
continue;
|
||||
}
|
||||
|
||||
let doc_freq = searcher.doc_freq(&term)?;
|
||||
|
||||
// ignore terms with less than min_doc_frequency
|
||||
if self
|
||||
.min_doc_frequency
|
||||
.map(|min_doc_frequency| doc_freq < min_doc_frequency)
|
||||
.unwrap_or(false)
|
||||
{
|
||||
continue;
|
||||
}
|
||||
|
||||
// ignore terms with more than max_doc_frequency
|
||||
if self
|
||||
.max_doc_frequency
|
||||
.map(|max_doc_frequency| doc_freq > max_doc_frequency)
|
||||
.unwrap_or(false)
|
||||
{
|
||||
continue;
|
||||
}
|
||||
|
||||
// ignore terms with zero frequency
|
||||
if doc_freq == 0 {
|
||||
continue;
|
||||
}
|
||||
|
||||
// compute similarity & score
|
||||
let idf = idf(doc_freq, num_docs);
|
||||
let score = (*term_frequency as f32) * idf;
|
||||
if let Some(limit) = self.max_query_terms {
|
||||
if score_terms.len() > limit {
|
||||
// update the least significant term
|
||||
let least_significant_term_score = score_terms.peek().unwrap().0.score;
|
||||
if least_significant_term_score < score {
|
||||
score_terms.peek_mut().unwrap().0 = ScoreTerm::new(term.clone(), score);
|
||||
}
|
||||
} else {
|
||||
score_terms.push(Reverse(ScoreTerm::new(term.clone(), score)));
|
||||
}
|
||||
} else {
|
||||
score_terms.push(Reverse(ScoreTerm::new(term.clone(), score)));
|
||||
}
|
||||
}
|
||||
|
||||
let score_terms_vec: Vec<ScoreTerm> = score_terms
|
||||
.into_iter()
|
||||
.map(|reverse_score_term| reverse_score_term.0)
|
||||
.collect();
|
||||
|
||||
Ok(score_terms_vec)
|
||||
}
|
||||
}
|
||||
284
src/query/more_like_this/query.rs
Normal file
284
src/query/more_like_this/query.rs
Normal file
@@ -0,0 +1,284 @@
|
||||
use super::MoreLikeThis;
|
||||
|
||||
use crate::{
|
||||
query::{Query, Weight},
|
||||
schema::{Field, FieldValue},
|
||||
DocAddress, Result, Searcher,
|
||||
};
|
||||
|
||||
/// A query that matches all of the documents similar to a document
|
||||
/// or a set of field values provided.
|
||||
///
|
||||
/// # Examples
|
||||
///
|
||||
/// ```
|
||||
/// use tantivy::DocAddress;
|
||||
/// use tantivy::query::MoreLikeThisQuery;
|
||||
///
|
||||
/// let query = MoreLikeThisQuery::builder()
|
||||
/// .with_min_doc_frequency(1)
|
||||
/// .with_max_doc_frequency(10)
|
||||
/// .with_min_term_frequency(1)
|
||||
/// .with_min_word_length(2)
|
||||
/// .with_max_word_length(5)
|
||||
/// .with_boost_factor(1.0)
|
||||
/// .with_stop_words(vec!["for".to_string()])
|
||||
/// .with_document(DocAddress::new(2, 1));
|
||||
///
|
||||
/// ```
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MoreLikeThisQuery {
|
||||
mlt: MoreLikeThis,
|
||||
target: TargetDocument,
|
||||
}
|
||||
|
||||
#[derive(Debug, PartialEq, Clone)]
|
||||
enum TargetDocument {
|
||||
DocumentAdress(DocAddress),
|
||||
DocumentFields(Vec<(Field, Vec<FieldValue>)>),
|
||||
}
|
||||
|
||||
impl MoreLikeThisQuery {
|
||||
/// Creates a new builder.
|
||||
pub fn builder() -> MoreLikeThisQueryBuilder {
|
||||
MoreLikeThisQueryBuilder::default()
|
||||
}
|
||||
}
|
||||
|
||||
impl Query for MoreLikeThisQuery {
|
||||
fn weight(&self, searcher: &Searcher, scoring_enabled: bool) -> Result<Box<dyn Weight>> {
|
||||
match &self.target {
|
||||
TargetDocument::DocumentAdress(doc_address) => self
|
||||
.mlt
|
||||
.query_with_document(searcher, *doc_address)?
|
||||
.weight(searcher, scoring_enabled),
|
||||
TargetDocument::DocumentFields(doc_fields) => self
|
||||
.mlt
|
||||
.query_with_document_fields(searcher, doc_fields)?
|
||||
.weight(searcher, scoring_enabled),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// The builder for more-like-this query
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MoreLikeThisQueryBuilder {
|
||||
mlt: MoreLikeThis,
|
||||
}
|
||||
|
||||
impl Default for MoreLikeThisQueryBuilder {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
mlt: MoreLikeThis::default(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl MoreLikeThisQueryBuilder {
|
||||
/// Sets the minimum document frequency.
|
||||
///
|
||||
/// The resulting query will ignore words which do not occur
|
||||
/// in at least this many docs.
|
||||
pub fn with_min_doc_frequency(mut self, value: u64) -> Self {
|
||||
self.mlt.min_doc_frequency = Some(value);
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the maximum document frequency.
|
||||
///
|
||||
/// The resulting query will ignore words which occur
|
||||
/// in more than this many docs.
|
||||
pub fn with_max_doc_frequency(mut self, value: u64) -> Self {
|
||||
self.mlt.max_doc_frequency = Some(value);
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the minimum term frequency.
|
||||
///
|
||||
/// The resulting query will ignore words less
|
||||
/// frequent that this number.
|
||||
pub fn with_min_term_frequency(mut self, value: usize) -> Self {
|
||||
self.mlt.min_term_frequency = Some(value);
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the maximum query terms.
|
||||
///
|
||||
/// The resulting query will not return a query with more clause than this.
|
||||
pub fn with_max_query_terms(mut self, value: usize) -> Self {
|
||||
self.mlt.max_query_terms = Some(value);
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the minimum word length.
|
||||
///
|
||||
/// The resulting query will ignore words shorter than this length.
|
||||
pub fn with_min_word_length(mut self, value: usize) -> Self {
|
||||
self.mlt.min_word_length = Some(value);
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the maximum word length.
|
||||
///
|
||||
/// The resulting query will ignore words longer than this length.
|
||||
pub fn with_max_word_length(mut self, value: usize) -> Self {
|
||||
self.mlt.max_word_length = Some(value);
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the boost factor
|
||||
///
|
||||
/// The boost factor used by the resulting query for boosting terms.
|
||||
pub fn with_boost_factor(mut self, value: f32) -> Self {
|
||||
self.mlt.boost_factor = Some(value);
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the set of stop words
|
||||
///
|
||||
/// The resulting query will ignore these set of words.
|
||||
pub fn with_stop_words(mut self, value: Vec<String>) -> Self {
|
||||
self.mlt.stop_words = value;
|
||||
self
|
||||
}
|
||||
|
||||
/// Sets the document address
|
||||
/// Returns the constructed [`MoreLikeThisQuery`]
|
||||
///
|
||||
/// This document will be used to collect field values, extract frequent terms
|
||||
/// needed for composing the query.
|
||||
///
|
||||
/// Note that field values will only be collected from stored fields in the index.
|
||||
/// You can construct your own field values from any source.
|
||||
pub fn with_document(self, doc_address: DocAddress) -> MoreLikeThisQuery {
|
||||
MoreLikeThisQuery {
|
||||
mlt: self.mlt,
|
||||
target: TargetDocument::DocumentAdress(doc_address),
|
||||
}
|
||||
}
|
||||
|
||||
/// Sets the document fields
|
||||
/// Returns the constructed [`MoreLikeThisQuery`]
|
||||
///
|
||||
/// This represents the list field values possibly collected from multiple documents
|
||||
/// that will be used to compose the resulting query.
|
||||
/// This interface is meant to be used when you want to provide your own set of fields
|
||||
/// not necessarily from a specific document.
|
||||
pub fn with_document_fields(
|
||||
self,
|
||||
doc_fields: Vec<(Field, Vec<FieldValue>)>,
|
||||
) -> MoreLikeThisQuery {
|
||||
MoreLikeThisQuery {
|
||||
mlt: self.mlt,
|
||||
target: TargetDocument::DocumentFields(doc_fields),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::MoreLikeThisQuery;
|
||||
use super::TargetDocument;
|
||||
use crate::collector::TopDocs;
|
||||
use crate::schema::{Schema, STORED, TEXT};
|
||||
use crate::DocAddress;
|
||||
use crate::Index;
|
||||
|
||||
fn create_test_index() -> Index {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let title = schema_builder.add_text_field("title", TEXT);
|
||||
let body = schema_builder.add_text_field("body", TEXT | STORED);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
index_writer.add_document(doc!(title => "aaa", body => "the old man and the sea"));
|
||||
index_writer.add_document(doc!(title => "bbb", body => "an old man sailing on the sea"));
|
||||
index_writer.add_document(doc!(title => "ccc", body=> "send this message to alice"));
|
||||
index_writer.add_document(doc!(title => "ddd", body=> "a lady was riding and old bike"));
|
||||
index_writer.add_document(doc!(title => "eee", body=> "Yes, my lady."));
|
||||
index_writer.commit().unwrap();
|
||||
index
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_more_like_this_query_builder() {
|
||||
// default settings
|
||||
let query = MoreLikeThisQuery::builder().with_document_fields(vec![]);
|
||||
|
||||
assert_eq!(query.mlt.min_doc_frequency, Some(5));
|
||||
assert_eq!(query.mlt.max_doc_frequency, None);
|
||||
assert_eq!(query.mlt.min_term_frequency, Some(2));
|
||||
assert_eq!(query.mlt.max_query_terms, Some(25));
|
||||
assert_eq!(query.mlt.min_word_length, None);
|
||||
assert_eq!(query.mlt.max_word_length, None);
|
||||
assert_eq!(query.mlt.boost_factor, Some(1.0));
|
||||
assert_eq!(query.mlt.stop_words, Vec::<String>::new());
|
||||
assert_eq!(query.target, TargetDocument::DocumentFields(vec![]));
|
||||
|
||||
// custom settings
|
||||
let query = MoreLikeThisQuery::builder()
|
||||
.with_min_doc_frequency(2)
|
||||
.with_max_doc_frequency(5)
|
||||
.with_min_term_frequency(2)
|
||||
.with_min_word_length(2)
|
||||
.with_max_word_length(4)
|
||||
.with_boost_factor(0.5)
|
||||
.with_stop_words(vec!["all".to_string(), "for".to_string()])
|
||||
.with_document(DocAddress::new(1, 2));
|
||||
|
||||
assert_eq!(query.mlt.min_doc_frequency, Some(2));
|
||||
assert_eq!(query.mlt.max_doc_frequency, Some(5));
|
||||
assert_eq!(query.mlt.min_term_frequency, Some(2));
|
||||
assert_eq!(query.mlt.min_word_length, Some(2));
|
||||
assert_eq!(query.mlt.max_word_length, Some(4));
|
||||
assert_eq!(query.mlt.boost_factor, Some(0.5));
|
||||
assert_eq!(
|
||||
query.mlt.stop_words,
|
||||
vec!["all".to_string(), "for".to_string()]
|
||||
);
|
||||
assert_eq!(
|
||||
query.target,
|
||||
TargetDocument::DocumentAdress(DocAddress::new(1, 2))
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_more_like_this_query() {
|
||||
let index = create_test_index();
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// search base 1st doc with words [sea, and] skipping [old]
|
||||
let query = MoreLikeThisQuery::builder()
|
||||
.with_min_doc_frequency(1)
|
||||
.with_max_doc_frequency(10)
|
||||
.with_min_term_frequency(1)
|
||||
.with_min_word_length(2)
|
||||
.with_max_word_length(5)
|
||||
.with_boost_factor(1.0)
|
||||
.with_stop_words(vec!["old".to_string()])
|
||||
.with_document(DocAddress::new(0, 0));
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(5)).unwrap();
|
||||
let mut doc_ids: Vec<_> = top_docs.iter().map(|item| item.1.doc_id).collect();
|
||||
doc_ids.sort();
|
||||
|
||||
assert_eq!(doc_ids.len(), 3);
|
||||
assert_eq!(doc_ids, vec![0, 1, 3]);
|
||||
|
||||
// search base 5th doc with words [lady]
|
||||
let query = MoreLikeThisQuery::builder()
|
||||
.with_min_doc_frequency(1)
|
||||
.with_max_doc_frequency(10)
|
||||
.with_min_term_frequency(1)
|
||||
.with_min_word_length(2)
|
||||
.with_max_word_length(5)
|
||||
.with_boost_factor(1.0)
|
||||
.with_document(DocAddress::new(0, 4));
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(5)).unwrap();
|
||||
let mut doc_ids: Vec<_> = top_docs.iter().map(|item| item.1.doc_id).collect();
|
||||
doc_ids.sort();
|
||||
|
||||
assert_eq!(doc_ids.len(), 2);
|
||||
assert_eq!(doc_ids, vec![3, 4]);
|
||||
}
|
||||
}
|
||||
@@ -1,6 +1,6 @@
|
||||
use super::PhraseWeight;
|
||||
use crate::core::searcher::Searcher;
|
||||
use crate::query::bm25::BM25Weight;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::Query;
|
||||
use crate::query::Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
@@ -95,7 +95,7 @@ impl PhraseQuery {
|
||||
)));
|
||||
}
|
||||
let terms = self.phrase_terms();
|
||||
let bm25_weight = BM25Weight::for_terms(searcher, &terms)?;
|
||||
let bm25_weight = Bm25Weight::for_terms(searcher, &terms)?;
|
||||
Ok(PhraseWeight::new(
|
||||
self.phrase_terms.clone(),
|
||||
bm25_weight,
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::Postings;
|
||||
use crate::query::bm25::BM25Weight;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::{Intersection, Scorer};
|
||||
use crate::{DocId, Score};
|
||||
use std::cmp::Ordering;
|
||||
@@ -49,8 +49,8 @@ pub struct PhraseScorer<TPostings: Postings> {
|
||||
right: Vec<u32>,
|
||||
phrase_count: u32,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: BM25Weight,
|
||||
score_needed: bool,
|
||||
similarity_weight: Bm25Weight,
|
||||
scoring_enabled: bool,
|
||||
}
|
||||
|
||||
/// Returns true iff the two sorted array contain a common element
|
||||
@@ -133,9 +133,9 @@ fn intersection(left: &mut [u32], right: &[u32]) -> usize {
|
||||
impl<TPostings: Postings> PhraseScorer<TPostings> {
|
||||
pub fn new(
|
||||
term_postings: Vec<(usize, TPostings)>,
|
||||
similarity_weight: BM25Weight,
|
||||
similarity_weight: Bm25Weight,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
score_needed: bool,
|
||||
scoring_enabled: bool,
|
||||
) -> PhraseScorer<TPostings> {
|
||||
let max_offset = term_postings
|
||||
.iter()
|
||||
@@ -157,7 +157,7 @@ impl<TPostings: Postings> PhraseScorer<TPostings> {
|
||||
phrase_count: 0u32,
|
||||
similarity_weight,
|
||||
fieldnorm_reader,
|
||||
score_needed,
|
||||
scoring_enabled,
|
||||
};
|
||||
if scorer.doc() != TERMINATED && !scorer.phrase_match() {
|
||||
scorer.advance();
|
||||
@@ -170,7 +170,7 @@ impl<TPostings: Postings> PhraseScorer<TPostings> {
|
||||
}
|
||||
|
||||
fn phrase_match(&mut self) -> bool {
|
||||
if self.score_needed {
|
||||
if self.scoring_enabled {
|
||||
let count = self.compute_phrase_count();
|
||||
self.phrase_count = count;
|
||||
count > 0u32
|
||||
|
||||
@@ -2,7 +2,7 @@ use super::PhraseScorer;
|
||||
use crate::core::SegmentReader;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::SegmentPostings;
|
||||
use crate::query::bm25::BM25Weight;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::Scorer;
|
||||
use crate::query::Weight;
|
||||
@@ -14,27 +14,31 @@ use crate::{DocId, DocSet};
|
||||
|
||||
pub struct PhraseWeight {
|
||||
phrase_terms: Vec<(usize, Term)>,
|
||||
similarity_weight: BM25Weight,
|
||||
score_needed: bool,
|
||||
similarity_weight: Bm25Weight,
|
||||
scoring_enabled: bool,
|
||||
}
|
||||
|
||||
impl PhraseWeight {
|
||||
/// Creates a new phrase weight.
|
||||
pub fn new(
|
||||
phrase_terms: Vec<(usize, Term)>,
|
||||
similarity_weight: BM25Weight,
|
||||
score_needed: bool,
|
||||
similarity_weight: Bm25Weight,
|
||||
scoring_enabled: bool,
|
||||
) -> PhraseWeight {
|
||||
PhraseWeight {
|
||||
phrase_terms,
|
||||
similarity_weight,
|
||||
score_needed,
|
||||
scoring_enabled,
|
||||
}
|
||||
}
|
||||
|
||||
fn fieldnorm_reader(&self, reader: &SegmentReader) -> crate::Result<FieldNormReader> {
|
||||
let field = self.phrase_terms[0].1.field();
|
||||
reader.get_fieldnorms_reader(field)
|
||||
if self.scoring_enabled {
|
||||
reader.get_fieldnorms_reader(field)
|
||||
} else {
|
||||
Ok(FieldNormReader::constant(reader.max_doc(), 1))
|
||||
}
|
||||
}
|
||||
|
||||
fn phrase_scorer(
|
||||
@@ -60,7 +64,7 @@ impl PhraseWeight {
|
||||
term_postings_list,
|
||||
similarity_weight,
|
||||
fieldnorm_reader,
|
||||
self.score_needed,
|
||||
self.scoring_enabled,
|
||||
)))
|
||||
} else {
|
||||
let mut term_postings_list = Vec::new();
|
||||
@@ -78,7 +82,7 @@ impl PhraseWeight {
|
||||
term_postings_list,
|
||||
similarity_weight,
|
||||
fieldnorm_reader,
|
||||
self.score_needed,
|
||||
self.scoring_enabled,
|
||||
)))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -19,18 +19,18 @@ pub enum LogicalLiteral {
|
||||
All,
|
||||
}
|
||||
|
||||
pub enum LogicalAST {
|
||||
Clause(Vec<(Occur, LogicalAST)>),
|
||||
pub enum LogicalAst {
|
||||
Clause(Vec<(Occur, LogicalAst)>),
|
||||
Leaf(Box<LogicalLiteral>),
|
||||
Boost(Box<LogicalAST>, Score),
|
||||
Boost(Box<LogicalAst>, Score),
|
||||
}
|
||||
|
||||
impl LogicalAST {
|
||||
pub fn boost(self, boost: Score) -> LogicalAST {
|
||||
impl LogicalAst {
|
||||
pub fn boost(self, boost: Score) -> LogicalAst {
|
||||
if (boost - 1.0).abs() < Score::EPSILON {
|
||||
self
|
||||
} else {
|
||||
LogicalAST::Boost(Box::new(self), boost)
|
||||
LogicalAst::Boost(Box::new(self), boost)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -43,10 +43,10 @@ fn occur_letter(occur: Occur) -> &'static str {
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for LogicalAST {
|
||||
impl fmt::Debug for LogicalAst {
|
||||
fn fmt(&self, formatter: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> {
|
||||
match *self {
|
||||
LogicalAST::Clause(ref clause) => {
|
||||
LogicalAst::Clause(ref clause) => {
|
||||
if clause.is_empty() {
|
||||
write!(formatter, "<emptyclause>")?;
|
||||
} else {
|
||||
@@ -59,15 +59,15 @@ impl fmt::Debug for LogicalAST {
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
LogicalAST::Boost(ref ast, boost) => write!(formatter, "{:?}^{}", ast, boost),
|
||||
LogicalAST::Leaf(ref literal) => write!(formatter, "{:?}", literal),
|
||||
LogicalAst::Boost(ref ast, boost) => write!(formatter, "{:?}^{}", ast, boost),
|
||||
LogicalAst::Leaf(ref literal) => write!(formatter, "{:?}", literal),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<LogicalLiteral> for LogicalAST {
|
||||
fn from(literal: LogicalLiteral) -> LogicalAST {
|
||||
LogicalAST::Leaf(Box::new(literal))
|
||||
impl From<LogicalLiteral> for LogicalAst {
|
||||
fn from(literal: LogicalLiteral) -> LogicalAst {
|
||||
LogicalAst::Leaf(Box::new(literal))
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -18,7 +18,7 @@ use std::collections::HashMap;
|
||||
use std::num::{ParseFloatError, ParseIntError};
|
||||
use std::ops::Bound;
|
||||
use std::str::FromStr;
|
||||
use tantivy_query_grammar::{UserInputAST, UserInputBound, UserInputLeaf};
|
||||
use tantivy_query_grammar::{UserInputAst, UserInputBound, UserInputLeaf};
|
||||
|
||||
/// Possible error that may happen when parsing a query.
|
||||
#[derive(Debug, PartialEq, Eq, Error)]
|
||||
@@ -28,7 +28,7 @@ pub enum QueryParserError {
|
||||
SyntaxError,
|
||||
/// `FieldDoesNotExist(field_name: String)`
|
||||
/// The query references a field that is not in the schema
|
||||
#[error("File does not exists: '{0:?}'")]
|
||||
#[error("Field does not exists: '{0:?}'")]
|
||||
FieldDoesNotExist(String),
|
||||
/// The query contains a term for a `u64` or `i64`-field, but the value
|
||||
/// is neither.
|
||||
@@ -91,9 +91,9 @@ impl From<chrono::ParseError> for QueryParserError {
|
||||
/// Recursively remove empty clause from the AST
|
||||
///
|
||||
/// Returns `None` iff the `logical_ast` ended up being empty.
|
||||
fn trim_ast(logical_ast: LogicalAST) -> Option<LogicalAST> {
|
||||
fn trim_ast(logical_ast: LogicalAst) -> Option<LogicalAst> {
|
||||
match logical_ast {
|
||||
LogicalAST::Clause(children) => {
|
||||
LogicalAst::Clause(children) => {
|
||||
let trimmed_children = children
|
||||
.into_iter()
|
||||
.flat_map(|(occur, child)| {
|
||||
@@ -103,7 +103,7 @@ fn trim_ast(logical_ast: LogicalAST) -> Option<LogicalAST> {
|
||||
if trimmed_children.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(LogicalAST::Clause(trimmed_children))
|
||||
Some(LogicalAst::Clause(trimmed_children))
|
||||
}
|
||||
}
|
||||
_ => Some(logical_ast),
|
||||
@@ -178,11 +178,11 @@ pub struct QueryParser {
|
||||
boost: HashMap<Field, Score>,
|
||||
}
|
||||
|
||||
fn all_negative(ast: &LogicalAST) -> bool {
|
||||
fn all_negative(ast: &LogicalAst) -> bool {
|
||||
match ast {
|
||||
LogicalAST::Leaf(_) => false,
|
||||
LogicalAST::Boost(ref child_ast, _) => all_negative(&*child_ast),
|
||||
LogicalAST::Clause(children) => children
|
||||
LogicalAst::Leaf(_) => false,
|
||||
LogicalAst::Boost(ref child_ast, _) => all_negative(&*child_ast),
|
||||
LogicalAst::Clause(children) => children
|
||||
.iter()
|
||||
.all(|(ref occur, child)| (*occur == Occur::MustNot) || all_negative(child)),
|
||||
}
|
||||
@@ -251,7 +251,7 @@ impl QueryParser {
|
||||
}
|
||||
|
||||
/// Parse the user query into an AST.
|
||||
fn parse_query_to_logical_ast(&self, query: &str) -> Result<LogicalAST, QueryParserError> {
|
||||
fn parse_query_to_logical_ast(&self, query: &str) -> Result<LogicalAst, QueryParserError> {
|
||||
let user_input_ast =
|
||||
tantivy_query_grammar::parse_query(query).map_err(|_| QueryParserError::SyntaxError)?;
|
||||
self.compute_logical_ast(user_input_ast)
|
||||
@@ -265,10 +265,10 @@ impl QueryParser {
|
||||
|
||||
fn compute_logical_ast(
|
||||
&self,
|
||||
user_input_ast: UserInputAST,
|
||||
) -> Result<LogicalAST, QueryParserError> {
|
||||
user_input_ast: UserInputAst,
|
||||
) -> Result<LogicalAst, QueryParserError> {
|
||||
let ast = self.compute_logical_ast_with_occur(user_input_ast)?;
|
||||
if let LogicalAST::Clause(children) = &ast {
|
||||
if let LogicalAst::Clause(children) = &ast {
|
||||
if children.is_empty() {
|
||||
return Ok(ast);
|
||||
}
|
||||
@@ -429,24 +429,24 @@ impl QueryParser {
|
||||
|
||||
fn compute_logical_ast_with_occur(
|
||||
&self,
|
||||
user_input_ast: UserInputAST,
|
||||
) -> Result<LogicalAST, QueryParserError> {
|
||||
user_input_ast: UserInputAst,
|
||||
) -> Result<LogicalAst, QueryParserError> {
|
||||
match user_input_ast {
|
||||
UserInputAST::Clause(sub_queries) => {
|
||||
UserInputAst::Clause(sub_queries) => {
|
||||
let default_occur = self.default_occur();
|
||||
let mut logical_sub_queries: Vec<(Occur, LogicalAST)> = Vec::new();
|
||||
let mut logical_sub_queries: Vec<(Occur, LogicalAst)> = Vec::new();
|
||||
for (occur_opt, sub_ast) in sub_queries {
|
||||
let sub_ast = self.compute_logical_ast_with_occur(sub_ast)?;
|
||||
let occur = occur_opt.unwrap_or(default_occur);
|
||||
logical_sub_queries.push((occur, sub_ast));
|
||||
}
|
||||
Ok(LogicalAST::Clause(logical_sub_queries))
|
||||
Ok(LogicalAst::Clause(logical_sub_queries))
|
||||
}
|
||||
UserInputAST::Boost(ast, boost) => {
|
||||
UserInputAst::Boost(ast, boost) => {
|
||||
let ast = self.compute_logical_ast_with_occur(*ast)?;
|
||||
Ok(ast.boost(boost as Score))
|
||||
}
|
||||
UserInputAST::Leaf(leaf) => self.compute_logical_ast_from_leaf(*leaf),
|
||||
UserInputAst::Leaf(leaf) => self.compute_logical_ast_from_leaf(*leaf),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -457,7 +457,7 @@ impl QueryParser {
|
||||
fn compute_logical_ast_from_leaf(
|
||||
&self,
|
||||
leaf: UserInputLeaf,
|
||||
) -> Result<LogicalAST, QueryParserError> {
|
||||
) -> Result<LogicalAst, QueryParserError> {
|
||||
match leaf {
|
||||
UserInputLeaf::Literal(literal) => {
|
||||
let term_phrases: Vec<(Field, String)> = match literal.field_name {
|
||||
@@ -476,22 +476,22 @@ impl QueryParser {
|
||||
}
|
||||
}
|
||||
};
|
||||
let mut asts: Vec<LogicalAST> = Vec::new();
|
||||
let mut asts: Vec<LogicalAst> = Vec::new();
|
||||
for (field, phrase) in term_phrases {
|
||||
if let Some(ast) = self.compute_logical_ast_for_leaf(field, &phrase)? {
|
||||
// Apply some field specific boost defined at the query parser level.
|
||||
let boost = self.field_boost(field);
|
||||
asts.push(LogicalAST::Leaf(Box::new(ast)).boost(boost));
|
||||
asts.push(LogicalAst::Leaf(Box::new(ast)).boost(boost));
|
||||
}
|
||||
}
|
||||
let result_ast: LogicalAST = if asts.len() == 1 {
|
||||
let result_ast: LogicalAst = if asts.len() == 1 {
|
||||
asts.into_iter().next().unwrap()
|
||||
} else {
|
||||
LogicalAST::Clause(asts.into_iter().map(|ast| (Occur::Should, ast)).collect())
|
||||
LogicalAst::Clause(asts.into_iter().map(|ast| (Occur::Should, ast)).collect())
|
||||
};
|
||||
Ok(result_ast)
|
||||
}
|
||||
UserInputLeaf::All => Ok(LogicalAST::Leaf(Box::new(LogicalLiteral::All))),
|
||||
UserInputLeaf::All => Ok(LogicalAst::Leaf(Box::new(LogicalLiteral::All))),
|
||||
UserInputLeaf::Range {
|
||||
field,
|
||||
lower,
|
||||
@@ -504,7 +504,7 @@ impl QueryParser {
|
||||
let boost = self.field_boost(field);
|
||||
let field_entry = self.schema.get_field_entry(field);
|
||||
let value_type = field_entry.field_type().value_type();
|
||||
let logical_ast = LogicalAST::Leaf(Box::new(LogicalLiteral::Range {
|
||||
let logical_ast = LogicalAst::Leaf(Box::new(LogicalLiteral::Range {
|
||||
field,
|
||||
value_type,
|
||||
lower: self.resolve_bound(field, &lower)?,
|
||||
@@ -516,7 +516,7 @@ impl QueryParser {
|
||||
let result_ast = if clauses.len() == 1 {
|
||||
clauses.pop().unwrap()
|
||||
} else {
|
||||
LogicalAST::Clause(
|
||||
LogicalAst::Clause(
|
||||
clauses
|
||||
.into_iter()
|
||||
.map(|clause| (Occur::Should, clause))
|
||||
@@ -547,9 +547,9 @@ fn convert_literal_to_query(logical_literal: LogicalLiteral) -> Box<dyn Query> {
|
||||
}
|
||||
}
|
||||
|
||||
fn convert_to_query(logical_ast: LogicalAST) -> Box<dyn Query> {
|
||||
fn convert_to_query(logical_ast: LogicalAst) -> Box<dyn Query> {
|
||||
match trim_ast(logical_ast) {
|
||||
Some(LogicalAST::Clause(trimmed_clause)) => {
|
||||
Some(LogicalAst::Clause(trimmed_clause)) => {
|
||||
let occur_subqueries = trimmed_clause
|
||||
.into_iter()
|
||||
.map(|(occur, subquery)| (occur, convert_to_query(subquery)))
|
||||
@@ -560,10 +560,10 @@ fn convert_to_query(logical_ast: LogicalAST) -> Box<dyn Query> {
|
||||
);
|
||||
Box::new(BooleanQuery::new(occur_subqueries))
|
||||
}
|
||||
Some(LogicalAST::Leaf(trimmed_logical_literal)) => {
|
||||
Some(LogicalAst::Leaf(trimmed_logical_literal)) => {
|
||||
convert_literal_to_query(*trimmed_logical_literal)
|
||||
}
|
||||
Some(LogicalAST::Boost(ast, boost)) => {
|
||||
Some(LogicalAst::Boost(ast, boost)) => {
|
||||
let query = convert_to_query(*ast);
|
||||
let boosted_query = BoostQuery::new(query, boost);
|
||||
Box::new(boosted_query)
|
||||
@@ -632,7 +632,7 @@ mod test {
|
||||
fn parse_query_to_logical_ast(
|
||||
query: &str,
|
||||
default_conjunction: bool,
|
||||
) -> Result<LogicalAST, QueryParserError> {
|
||||
) -> Result<LogicalAst, QueryParserError> {
|
||||
let mut query_parser = make_query_parser();
|
||||
if default_conjunction {
|
||||
query_parser.set_conjunction_by_default();
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
use super::term_weight::TermWeight;
|
||||
use crate::query::bm25::BM25Weight;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::Weight;
|
||||
use crate::query::{Explanation, Query};
|
||||
use crate::schema::IndexRecordOption;
|
||||
@@ -102,10 +102,10 @@ impl TermQuery {
|
||||
}
|
||||
let bm25_weight;
|
||||
if scoring_enabled {
|
||||
bm25_weight = BM25Weight::for_terms(searcher, &[term])?;
|
||||
bm25_weight = Bm25Weight::for_terms(searcher, &[term])?;
|
||||
} else {
|
||||
bm25_weight =
|
||||
BM25Weight::new(Explanation::new("<no score>".to_string(), 1.0f32), 1.0f32);
|
||||
Bm25Weight::new(Explanation::new("<no score>".to_string(), 1.0f32), 1.0f32);
|
||||
}
|
||||
let index_record_option = if scoring_enabled {
|
||||
self.index_record_option
|
||||
|
||||
@@ -6,20 +6,20 @@ use crate::Score;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::SegmentPostings;
|
||||
use crate::postings::{FreqReadingOption, Postings};
|
||||
use crate::query::bm25::BM25Weight;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct TermScorer {
|
||||
postings: SegmentPostings,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: BM25Weight,
|
||||
similarity_weight: Bm25Weight,
|
||||
}
|
||||
|
||||
impl TermScorer {
|
||||
pub fn new(
|
||||
postings: SegmentPostings,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: BM25Weight,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> TermScorer {
|
||||
TermScorer {
|
||||
postings,
|
||||
@@ -36,7 +36,7 @@ impl TermScorer {
|
||||
pub fn create_for_test(
|
||||
doc_and_tfs: &[(DocId, u32)],
|
||||
fieldnorms: &[u32],
|
||||
similarity_weight: BM25Weight,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> TermScorer {
|
||||
assert!(!doc_and_tfs.is_empty());
|
||||
assert!(
|
||||
@@ -131,7 +131,7 @@ mod tests {
|
||||
use crate::merge_policy::NoMergePolicy;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{BM25Weight, Scorer, TermQuery};
|
||||
use crate::query::{Bm25Weight, Scorer, TermQuery};
|
||||
use crate::schema::{IndexRecordOption, Schema, TEXT};
|
||||
use crate::Score;
|
||||
use crate::{assert_nearly_equals, Index, Searcher, SegmentId, Term};
|
||||
@@ -141,7 +141,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_term_scorer_max_score() -> crate::Result<()> {
|
||||
let bm25_weight = BM25Weight::for_one_term(3, 6, 10.0);
|
||||
let bm25_weight = Bm25Weight::for_one_term(3, 6, 10.0);
|
||||
let mut term_scorer = TermScorer::create_for_test(
|
||||
&[(2, 3), (3, 12), (7, 8)],
|
||||
&[0, 0, 10, 12, 0, 0, 0, 100],
|
||||
@@ -167,7 +167,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_term_scorer_shallow_advance() -> crate::Result<()> {
|
||||
let bm25_weight = BM25Weight::for_one_term(300, 1024, 10.0);
|
||||
let bm25_weight = Bm25Weight::for_one_term(300, 1024, 10.0);
|
||||
let mut doc_and_tfs = vec![];
|
||||
for i in 0u32..300u32 {
|
||||
let doc = i * 10;
|
||||
@@ -205,7 +205,7 @@ mod tests {
|
||||
// Average fieldnorm is over the entire index,
|
||||
// not necessarily the docs that are in the posting list.
|
||||
// For this reason we multiply by 1.1 to make a realistic value.
|
||||
let bm25_weight = BM25Weight::for_one_term(term_doc_freq as u64,
|
||||
let bm25_weight = Bm25Weight::for_one_term(term_doc_freq as u64,
|
||||
term_doc_freq as u64 * 10u64,
|
||||
average_fieldnorm);
|
||||
|
||||
@@ -240,7 +240,7 @@ mod tests {
|
||||
doc_tfs.push((258, 1u32));
|
||||
|
||||
let fieldnorms: Vec<u32> = std::iter::repeat(20u32).take(300).collect();
|
||||
let bm25_weight = BM25Weight::for_one_term(10, 129, 20.0);
|
||||
let bm25_weight = Bm25Weight::for_one_term(10, 129, 20.0);
|
||||
let mut docs = TermScorer::create_for_test(&doc_tfs[..], &fieldnorms[..], bm25_weight);
|
||||
assert_nearly_equals!(docs.block_max_score(), 2.5161593);
|
||||
docs.shallow_seek(135);
|
||||
|
||||
@@ -3,7 +3,7 @@ use crate::core::SegmentReader;
|
||||
use crate::docset::DocSet;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::SegmentPostings;
|
||||
use crate::query::bm25::BM25Weight;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::weight::for_each_scorer;
|
||||
use crate::query::Weight;
|
||||
@@ -15,7 +15,7 @@ use crate::{DocId, Score};
|
||||
pub struct TermWeight {
|
||||
term: Term,
|
||||
index_record_option: IndexRecordOption,
|
||||
similarity_weight: BM25Weight,
|
||||
similarity_weight: Bm25Weight,
|
||||
scoring_enabled: bool,
|
||||
}
|
||||
|
||||
@@ -88,7 +88,7 @@ impl TermWeight {
|
||||
pub fn new(
|
||||
term: Term,
|
||||
index_record_option: IndexRecordOption,
|
||||
similarity_weight: BM25Weight,
|
||||
similarity_weight: Bm25Weight,
|
||||
scoring_enabled: bool,
|
||||
) -> TermWeight {
|
||||
TermWeight {
|
||||
|
||||
@@ -109,7 +109,7 @@ impl Facet {
|
||||
self.0.push_str(facet_str);
|
||||
}
|
||||
|
||||
/// Returns `true` iff other is a subfacet of `self`.
|
||||
/// Returns `true` if other is a subfacet of `self`.
|
||||
pub fn is_prefix_of(&self, other: &Facet) -> bool {
|
||||
let self_str = self.encoded_str();
|
||||
let other_str = other.encoded_str();
|
||||
|
||||
@@ -309,7 +309,7 @@ impl Schema {
|
||||
} else {
|
||||
format!("{:?}...", &doc_json[0..20])
|
||||
};
|
||||
DocParsingError::NotJSON(doc_json_sample)
|
||||
DocParsingError::NotJson(doc_json_sample)
|
||||
})?;
|
||||
|
||||
let mut doc = Document::default();
|
||||
@@ -394,7 +394,7 @@ impl<'de> Deserialize<'de> for Schema {
|
||||
pub enum DocParsingError {
|
||||
/// The payload given is not valid JSON.
|
||||
#[error("The provided string is not valid JSON")]
|
||||
NotJSON(String),
|
||||
NotJson(String),
|
||||
/// One of the value node could not be parsed.
|
||||
#[error("The field '{0:?}' could not be parsed: {1:?}")]
|
||||
ValueError(String, ValueParsingError),
|
||||
@@ -408,7 +408,7 @@ mod tests {
|
||||
|
||||
use crate::schema::field_type::ValueParsingError;
|
||||
use crate::schema::int_options::Cardinality::SingleValue;
|
||||
use crate::schema::schema::DocParsingError::NotJSON;
|
||||
use crate::schema::schema::DocParsingError::NotJson;
|
||||
use crate::schema::*;
|
||||
use matches::{assert_matches, matches};
|
||||
use serde_json;
|
||||
@@ -737,7 +737,7 @@ mod tests {
|
||||
"count": 50,
|
||||
}"#,
|
||||
);
|
||||
assert_matches!(json_err, Err(NotJSON(_)));
|
||||
assert_matches!(json_err, Err(NotJson(_)));
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -69,7 +69,6 @@ pub struct SegmentSpaceUsage {
|
||||
termdict: PerFieldSpaceUsage,
|
||||
postings: PerFieldSpaceUsage,
|
||||
positions: PerFieldSpaceUsage,
|
||||
positions_idx: PerFieldSpaceUsage,
|
||||
fast_fields: PerFieldSpaceUsage,
|
||||
fieldnorms: PerFieldSpaceUsage,
|
||||
|
||||
@@ -87,7 +86,6 @@ impl SegmentSpaceUsage {
|
||||
termdict: PerFieldSpaceUsage,
|
||||
postings: PerFieldSpaceUsage,
|
||||
positions: PerFieldSpaceUsage,
|
||||
positions_idx: PerFieldSpaceUsage,
|
||||
fast_fields: PerFieldSpaceUsage,
|
||||
fieldnorms: PerFieldSpaceUsage,
|
||||
store: StoreSpaceUsage,
|
||||
@@ -105,7 +103,6 @@ impl SegmentSpaceUsage {
|
||||
termdict,
|
||||
postings,
|
||||
positions,
|
||||
positions_idx,
|
||||
fast_fields,
|
||||
fieldnorms,
|
||||
store,
|
||||
@@ -122,14 +119,14 @@ impl SegmentSpaceUsage {
|
||||
use self::ComponentSpaceUsage::*;
|
||||
use crate::SegmentComponent::*;
|
||||
match component {
|
||||
POSTINGS => PerField(self.postings().clone()),
|
||||
POSITIONS => PerField(self.positions().clone()),
|
||||
POSITIONSSKIP => PerField(self.positions_skip_idx().clone()),
|
||||
FASTFIELDS => PerField(self.fast_fields().clone()),
|
||||
FIELDNORMS => PerField(self.fieldnorms().clone()),
|
||||
TERMS => PerField(self.termdict().clone()),
|
||||
STORE => Store(self.store().clone()),
|
||||
DELETE => Basic(self.deletes()),
|
||||
Postings => PerField(self.postings().clone()),
|
||||
Positions => PerField(self.positions().clone()),
|
||||
FastFields => PerField(self.fast_fields().clone()),
|
||||
FieldNorms => PerField(self.fieldnorms().clone()),
|
||||
Terms => PerField(self.termdict().clone()),
|
||||
SegmentComponent::Store => ComponentSpaceUsage::Store(self.store().clone()),
|
||||
SegmentComponent::TempStore => ComponentSpaceUsage::Store(self.store().clone()),
|
||||
Delete => Basic(self.deletes()),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -153,11 +150,6 @@ impl SegmentSpaceUsage {
|
||||
&self.positions
|
||||
}
|
||||
|
||||
/// Space usage for positions skip idx
|
||||
pub fn positions_skip_idx(&self) -> &PerFieldSpaceUsage {
|
||||
&self.positions_idx
|
||||
}
|
||||
|
||||
/// Space usage for fast fields
|
||||
pub fn fast_fields(&self) -> &PerFieldSpaceUsage {
|
||||
&self.fast_fields
|
||||
@@ -358,7 +350,6 @@ mod test {
|
||||
expect_single_field(segment.termdict(), &name, 1, 512);
|
||||
expect_single_field(segment.postings(), &name, 1, 512);
|
||||
assert_eq!(0, segment.positions().total());
|
||||
assert_eq!(0, segment.positions_skip_idx().total());
|
||||
expect_single_field(segment.fast_fields(), &name, 1, 512);
|
||||
expect_single_field(segment.fieldnorms(), &name, 1, 512);
|
||||
// TODO: understand why the following fails
|
||||
@@ -398,7 +389,6 @@ mod test {
|
||||
expect_single_field(segment.termdict(), &name, 1, 512);
|
||||
expect_single_field(segment.postings(), &name, 1, 512);
|
||||
expect_single_field(segment.positions(), &name, 1, 512);
|
||||
expect_single_field(segment.positions_skip_idx(), &name, 1, 512);
|
||||
assert_eq!(0, segment.fast_fields().total());
|
||||
expect_single_field(segment.fieldnorms(), &name, 1, 512);
|
||||
// TODO: understand why the following fails
|
||||
@@ -437,7 +427,6 @@ mod test {
|
||||
assert_eq!(0, segment.termdict().total());
|
||||
assert_eq!(0, segment.postings().total());
|
||||
assert_eq!(0, segment.positions().total());
|
||||
assert_eq!(0, segment.positions_skip_idx().total());
|
||||
assert_eq!(0, segment.fast_fields().total());
|
||||
assert_eq!(0, segment.fieldnorms().total());
|
||||
assert!(segment.store().total() > 0);
|
||||
@@ -483,7 +472,6 @@ mod test {
|
||||
expect_single_field(segment_space_usage.termdict(), &name, 1, 512);
|
||||
expect_single_field(segment_space_usage.postings(), &name, 1, 512);
|
||||
assert_eq!(0, segment_space_usage.positions().total());
|
||||
assert_eq!(0, segment_space_usage.positions_skip_idx().total());
|
||||
assert_eq!(0, segment_space_usage.fast_fields().total());
|
||||
expect_single_field(segment_space_usage.fieldnorms(), &name, 1, 512);
|
||||
assert!(segment_space_usage.deletes() > 0);
|
||||
|
||||
39
src/store/compression_lz4_block.rs
Normal file
39
src/store/compression_lz4_block.rs
Normal file
@@ -0,0 +1,39 @@
|
||||
use std::io::{self};
|
||||
|
||||
use core::convert::TryInto;
|
||||
use lz4_flex::{compress_into, decompress_into};
|
||||
/// Name of the compression scheme used in the doc store.
|
||||
///
|
||||
/// This name is appended to the version string of tantivy.
|
||||
pub const COMPRESSION: &str = "lz4_block";
|
||||
|
||||
pub fn compress(uncompressed: &[u8], compressed: &mut Vec<u8>) -> io::Result<()> {
|
||||
compressed.clear();
|
||||
|
||||
compressed.extend_from_slice(&[0, 0, 0, 0]);
|
||||
compress_into(uncompressed, compressed);
|
||||
let num_bytes = uncompressed.len() as u32;
|
||||
compressed[0..4].copy_from_slice(&num_bytes.to_le_bytes());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn decompress(compressed: &[u8], decompressed: &mut Vec<u8>) -> io::Result<()> {
|
||||
decompressed.clear();
|
||||
//next lz4_flex version will support slice as input parameter.
|
||||
//this will make the usage much less ugly
|
||||
let uncompressed_size_bytes: &[u8; 4] = compressed
|
||||
.get(..4)
|
||||
.ok_or(io::ErrorKind::InvalidData)?
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let uncompressed_size = u32::from_le_bytes(*uncompressed_size_bytes) as usize;
|
||||
// reserve more than required, because blocked writes may write out of bounds, will be improved
|
||||
// with lz4_flex 1.0
|
||||
decompressed.reserve(uncompressed_size + 4 + 24);
|
||||
unsafe {
|
||||
decompressed.set_len(uncompressed_size);
|
||||
}
|
||||
decompress_into(&compressed[4..], decompressed)
|
||||
.map_err(|err| io::Error::new(io::ErrorKind::InvalidData, err.to_string()))?;
|
||||
Ok(())
|
||||
}
|
||||
@@ -18,7 +18,7 @@ pub use self::skip_index_builder::SkipIndexBuilder;
|
||||
/// All of the intervals here defined are semi-open.
|
||||
/// The checkpoint describes that the block within the `byte_range`
|
||||
/// and spans over the `doc_range`.
|
||||
#[derive(Clone, Eq, PartialEq)]
|
||||
#[derive(Clone, Eq, PartialEq, Default)]
|
||||
pub struct Checkpoint {
|
||||
pub doc_range: Range<DocId>,
|
||||
pub byte_range: Range<usize>,
|
||||
|
||||
104
src/store/mod.rs
104
src/store/mod.rs
@@ -39,9 +39,40 @@ mod writer;
|
||||
pub use self::reader::StoreReader;
|
||||
pub use self::writer::StoreWriter;
|
||||
|
||||
// compile_error doesn't scale very well, enum like feature flags would be great to have in Rust
|
||||
#[cfg(all(feature = "lz4", feature = "brotli"))]
|
||||
compile_error!("feature `lz4` or `brotli` must not be enabled together.");
|
||||
|
||||
#[cfg(all(feature = "lz4_block", feature = "brotli"))]
|
||||
compile_error!("feature `lz4_block` or `brotli` must not be enabled together.");
|
||||
|
||||
#[cfg(all(feature = "lz4_block", feature = "lz4"))]
|
||||
compile_error!("feature `lz4_block` or `lz4` must not be enabled together.");
|
||||
|
||||
#[cfg(all(feature = "lz4_block", feature = "snap"))]
|
||||
compile_error!("feature `lz4_block` or `snap` must not be enabled together.");
|
||||
|
||||
#[cfg(all(feature = "lz4", feature = "snap"))]
|
||||
compile_error!("feature `lz4` or `snap` must not be enabled together.");
|
||||
|
||||
#[cfg(all(feature = "brotli", feature = "snap"))]
|
||||
compile_error!("feature `brotli` or `snap` must not be enabled together.");
|
||||
|
||||
#[cfg(not(any(
|
||||
feature = "lz4",
|
||||
feature = "brotli",
|
||||
feature = "lz4_flex",
|
||||
feature = "snap"
|
||||
)))]
|
||||
compile_error!("all compressors are deactivated via feature-flags, check Cargo.toml for available decompressors.");
|
||||
|
||||
#[cfg(feature = "lz4_flex")]
|
||||
mod compression_lz4_block;
|
||||
#[cfg(feature = "lz4_flex")]
|
||||
pub use self::compression_lz4_block::COMPRESSION;
|
||||
#[cfg(feature = "lz4_flex")]
|
||||
use self::compression_lz4_block::{compress, decompress};
|
||||
|
||||
#[cfg(feature = "lz4")]
|
||||
mod compression_lz4;
|
||||
#[cfg(feature = "lz4")]
|
||||
@@ -56,22 +87,24 @@ pub use self::compression_brotli::COMPRESSION;
|
||||
#[cfg(feature = "brotli")]
|
||||
use self::compression_brotli::{compress, decompress};
|
||||
|
||||
#[cfg(not(any(feature = "lz4", feature = "brotli")))]
|
||||
#[cfg(feature = "snap")]
|
||||
mod compression_snap;
|
||||
#[cfg(not(any(feature = "lz4", feature = "brotli")))]
|
||||
#[cfg(feature = "snap")]
|
||||
pub use self::compression_snap::COMPRESSION;
|
||||
#[cfg(not(any(feature = "lz4", feature = "brotli")))]
|
||||
#[cfg(feature = "snap")]
|
||||
use self::compression_snap::{compress, decompress};
|
||||
|
||||
#[cfg(test)]
|
||||
pub mod tests {
|
||||
|
||||
use super::*;
|
||||
use crate::directory::{Directory, RAMDirectory, WritePtr};
|
||||
use crate::schema::Document;
|
||||
use crate::schema::FieldValue;
|
||||
use crate::schema::Schema;
|
||||
use crate::schema::TextOptions;
|
||||
use crate::schema::{self, FieldValue, TextFieldIndexing};
|
||||
use crate::schema::{Document, TextOptions};
|
||||
use crate::{
|
||||
directory::{Directory, RamDirectory, WritePtr},
|
||||
Term,
|
||||
};
|
||||
use crate::{schema::Schema, Index};
|
||||
use std::path::Path;
|
||||
|
||||
pub fn write_lorem_ipsum_store(writer: WritePtr, num_docs: usize) -> Schema {
|
||||
@@ -115,7 +148,7 @@ pub mod tests {
|
||||
#[test]
|
||||
fn test_store() -> crate::Result<()> {
|
||||
let path = Path::new("store");
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
let store_wrt = directory.open_write(path)?;
|
||||
let schema = write_lorem_ipsum_store(store_wrt, 1_000);
|
||||
let field_title = schema.get_field("title").unwrap();
|
||||
@@ -132,6 +165,53 @@ pub mod tests {
|
||||
format!("Doc {}", i)
|
||||
);
|
||||
}
|
||||
for (i, doc) in store.iter(None).enumerate() {
|
||||
assert_eq!(
|
||||
*doc?.get_first(field_title).unwrap().text().unwrap(),
|
||||
format!("Doc {}", i)
|
||||
);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_store_with_delete() -> crate::Result<()> {
|
||||
let mut schema_builder = schema::Schema::builder();
|
||||
|
||||
let text_field_options = TextOptions::default()
|
||||
.set_indexing_options(
|
||||
TextFieldIndexing::default()
|
||||
.set_index_option(schema::IndexRecordOption::WithFreqsAndPositions),
|
||||
)
|
||||
.set_stored();
|
||||
let text_field = schema_builder.add_text_field("text_field", text_field_options);
|
||||
let schema = schema_builder.build();
|
||||
let index_builder = Index::builder().schema(schema);
|
||||
|
||||
let index = index_builder.create_in_ram().unwrap();
|
||||
|
||||
{
|
||||
let mut index_writer = index.writer_for_tests().unwrap();
|
||||
|
||||
index_writer.add_document(doc!(text_field=> "deleteme"));
|
||||
index_writer.add_document(doc!(text_field=> "deletemenot"));
|
||||
index_writer.add_document(doc!(text_field=> "deleteme"));
|
||||
index_writer.add_document(doc!(text_field=> "deletemenot"));
|
||||
index_writer.add_document(doc!(text_field=> "deleteme"));
|
||||
|
||||
index_writer.delete_term(Term::from_field_text(text_field, "deleteme"));
|
||||
assert!(index_writer.commit().is_ok());
|
||||
}
|
||||
|
||||
let searcher = index.reader().unwrap().searcher();
|
||||
let reader = searcher.segment_reader(0);
|
||||
let store = reader.get_store_reader().unwrap();
|
||||
for doc in store.iter(reader.delete_bitset()) {
|
||||
assert_eq!(
|
||||
*doc?.get_first(text_field).unwrap().text().unwrap(),
|
||||
"deletemenot".to_string()
|
||||
);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -141,7 +221,7 @@ mod bench {
|
||||
|
||||
use super::tests::write_lorem_ipsum_store;
|
||||
use crate::directory::Directory;
|
||||
use crate::directory::RAMDirectory;
|
||||
use crate::directory::RamDirectory;
|
||||
use crate::store::StoreReader;
|
||||
use std::path::Path;
|
||||
use test::Bencher;
|
||||
@@ -149,7 +229,7 @@ mod bench {
|
||||
#[bench]
|
||||
#[cfg(feature = "mmap")]
|
||||
fn bench_store_encode(b: &mut Bencher) {
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
let path = Path::new("store");
|
||||
b.iter(|| {
|
||||
write_lorem_ipsum_store(directory.open_write(path).unwrap(), 1_000);
|
||||
@@ -159,7 +239,7 @@ mod bench {
|
||||
|
||||
#[bench]
|
||||
fn bench_store_decode(b: &mut Bencher) {
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
let path = Path::new("store");
|
||||
write_lorem_ipsum_store(directory.open_write(path).unwrap(), 1_000);
|
||||
let store_file = directory.open_read(path).unwrap();
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
use super::decompress;
|
||||
use super::index::SkipIndex;
|
||||
use crate::common::VInt;
|
||||
use crate::common::{BinarySerializable, HasLen};
|
||||
use crate::directory::{FileSlice, OwnedBytes};
|
||||
use crate::schema::Document;
|
||||
use crate::space_usage::StoreSpaceUsage;
|
||||
use crate::store::index::Checkpoint;
|
||||
use crate::DocId;
|
||||
use crate::{common::VInt, fastfield::DeleteBitSet};
|
||||
use lru::LruCache;
|
||||
use std::io;
|
||||
use std::mem::size_of;
|
||||
@@ -15,7 +15,7 @@ use std::sync::{Arc, Mutex};
|
||||
|
||||
const LRU_CACHE_CAPACITY: usize = 100;
|
||||
|
||||
type Block = Arc<Vec<u8>>;
|
||||
type Block = OwnedBytes;
|
||||
|
||||
type BlockCache = Arc<Mutex<LruCache<usize, Block>>>;
|
||||
|
||||
@@ -74,7 +74,7 @@ impl StoreReader {
|
||||
let mut decompressed_block = vec![];
|
||||
decompress(compressed_block.as_slice(), &mut decompressed_block)?;
|
||||
|
||||
let block = Arc::new(decompressed_block);
|
||||
let block = OwnedBytes::new(decompressed_block);
|
||||
self.cache
|
||||
.lock()
|
||||
.unwrap()
|
||||
@@ -86,23 +86,132 @@ impl StoreReader {
|
||||
/// Reads a given document.
|
||||
///
|
||||
/// Calling `.get(doc)` is relatively costly as it requires
|
||||
/// decompressing a compressed block.
|
||||
/// decompressing a compressed block. The store utilizes a LRU cache,
|
||||
/// so accessing docs from the same compressed block should be faster.
|
||||
/// For that reason a store reader should be kept and reused.
|
||||
///
|
||||
/// It should not be called to score documents
|
||||
/// for instance.
|
||||
pub fn get(&self, doc_id: DocId) -> crate::Result<Document> {
|
||||
let mut doc_bytes = self.get_document_bytes(doc_id)?;
|
||||
Ok(Document::deserialize(&mut doc_bytes)?)
|
||||
}
|
||||
|
||||
/// Reads raw bytes of a given document. Returns `RawDocument`, which contains the block of a document and its start and end
|
||||
/// position within the block.
|
||||
///
|
||||
/// Calling `.get(doc)` is relatively costly as it requires
|
||||
/// decompressing a compressed block. The store utilizes a LRU cache,
|
||||
/// so accessing docs from the same compressed block should be faster.
|
||||
/// For that reason a store reader should be kept and reused.
|
||||
///
|
||||
pub fn get_document_bytes(&self, doc_id: DocId) -> crate::Result<OwnedBytes> {
|
||||
let checkpoint = self.block_checkpoint(doc_id).ok_or_else(|| {
|
||||
crate::TantivyError::InvalidArgument(format!("Failed to lookup Doc #{}.", doc_id))
|
||||
})?;
|
||||
let mut cursor = &self.read_block(&checkpoint)?[..];
|
||||
let block = self.read_block(&checkpoint)?;
|
||||
let mut cursor = &block[..];
|
||||
let cursor_len_before = cursor.len();
|
||||
for _ in checkpoint.doc_range.start..doc_id {
|
||||
let doc_length = VInt::deserialize(&mut cursor)?.val() as usize;
|
||||
cursor = &cursor[doc_length..];
|
||||
}
|
||||
|
||||
let doc_length = VInt::deserialize(&mut cursor)?.val() as usize;
|
||||
cursor = &cursor[..doc_length];
|
||||
Ok(Document::deserialize(&mut cursor)?)
|
||||
let start_pos = cursor_len_before - cursor.len();
|
||||
let end_pos = cursor_len_before - cursor.len() + doc_length;
|
||||
Ok(block.slice(start_pos..end_pos))
|
||||
}
|
||||
|
||||
/// Iterator over all Documents in their order as they are stored in the doc store.
|
||||
/// Use this, if you want to extract all Documents from the doc store.
|
||||
/// The delete_bitset has to be forwarded from the `SegmentReader` or the results maybe wrong.
|
||||
pub fn iter<'a: 'b, 'b>(
|
||||
&'b self,
|
||||
delete_bitset: Option<&'a DeleteBitSet>,
|
||||
) -> impl Iterator<Item = crate::Result<Document>> + 'b {
|
||||
self.iter_raw(delete_bitset).map(|doc_bytes_res| {
|
||||
let mut doc_bytes = doc_bytes_res?;
|
||||
Ok(Document::deserialize(&mut doc_bytes)?)
|
||||
})
|
||||
}
|
||||
|
||||
/// Iterator over all RawDocuments in their order as they are stored in the doc store.
|
||||
/// Use this, if you want to extract all Documents from the doc store.
|
||||
/// The delete_bitset has to be forwarded from the `SegmentReader` or the results maybe wrong.
|
||||
pub(crate) fn iter_raw<'a: 'b, 'b>(
|
||||
&'b self,
|
||||
delete_bitset: Option<&'a DeleteBitSet>,
|
||||
) -> impl Iterator<Item = crate::Result<OwnedBytes>> + 'b {
|
||||
let last_docid = self
|
||||
.block_checkpoints()
|
||||
.last()
|
||||
.map(|checkpoint| checkpoint.doc_range.end)
|
||||
.unwrap_or(0);
|
||||
let mut checkpoint_block_iter = self.block_checkpoints();
|
||||
let mut curr_checkpoint = checkpoint_block_iter.next();
|
||||
let mut curr_block = curr_checkpoint
|
||||
.as_ref()
|
||||
.map(|checkpoint| self.read_block(&checkpoint));
|
||||
let mut block_start_pos = 0;
|
||||
let mut num_skipped = 0;
|
||||
(0..last_docid)
|
||||
.filter_map(move |doc_id| {
|
||||
// filter_map is only used to resolve lifetime issues between the two closures on
|
||||
// the outer variables
|
||||
let alive = delete_bitset.map_or(true, |bitset| bitset.is_alive(doc_id));
|
||||
if !alive {
|
||||
// we keep the number of skipped documents to move forward in the map block
|
||||
num_skipped += 1;
|
||||
}
|
||||
// check move to next checkpoint
|
||||
let mut reset_block_pos = false;
|
||||
if doc_id >= curr_checkpoint.as_ref().unwrap().doc_range.end {
|
||||
curr_checkpoint = checkpoint_block_iter.next();
|
||||
curr_block = curr_checkpoint
|
||||
.as_ref()
|
||||
.map(|checkpoint| self.read_block(&checkpoint));
|
||||
reset_block_pos = true;
|
||||
num_skipped = 0;
|
||||
}
|
||||
|
||||
if alive {
|
||||
let ret = Some((
|
||||
curr_block.as_ref().unwrap().as_ref().unwrap().clone(), // todo forward errors
|
||||
num_skipped,
|
||||
reset_block_pos,
|
||||
));
|
||||
// the map block will move over the num_skipped, so we reset to 0
|
||||
num_skipped = 0;
|
||||
ret
|
||||
} else {
|
||||
None
|
||||
}
|
||||
})
|
||||
.map(move |(block, num_skipped, reset_block_pos)| {
|
||||
if reset_block_pos {
|
||||
block_start_pos = 0;
|
||||
}
|
||||
let mut cursor = &block[block_start_pos..];
|
||||
let mut pos = 0;
|
||||
let doc_length = loop {
|
||||
let doc_length = VInt::deserialize(&mut cursor)?.val() as usize;
|
||||
let num_bytes_read = block[block_start_pos..].len() - cursor.len();
|
||||
block_start_pos += num_bytes_read;
|
||||
|
||||
pos += 1;
|
||||
if pos == num_skipped + 1 {
|
||||
break doc_length;
|
||||
} else {
|
||||
block_start_pos += doc_length;
|
||||
cursor = &block[block_start_pos..];
|
||||
}
|
||||
};
|
||||
let end_pos = block_start_pos + doc_length;
|
||||
let doc_bytes = block.slice(block_start_pos..end_pos);
|
||||
block_start_pos = end_pos;
|
||||
Ok(doc_bytes)
|
||||
})
|
||||
}
|
||||
|
||||
/// Summarize total space usage of this store reader.
|
||||
@@ -124,7 +233,7 @@ mod tests {
|
||||
use super::*;
|
||||
use crate::schema::Document;
|
||||
use crate::schema::Field;
|
||||
use crate::{directory::RAMDirectory, store::tests::write_lorem_ipsum_store, Directory};
|
||||
use crate::{directory::RamDirectory, store::tests::write_lorem_ipsum_store, Directory};
|
||||
use std::path::Path;
|
||||
|
||||
fn get_text_field<'a>(doc: &'a Document, field: &'a Field) -> Option<&'a str> {
|
||||
@@ -133,7 +242,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_store_lru_cache() -> crate::Result<()> {
|
||||
let directory = RAMDirectory::create();
|
||||
let directory = RamDirectory::create();
|
||||
let path = Path::new("store");
|
||||
let writer = directory.open_write(path)?;
|
||||
let schema = write_lorem_ipsum_store(writer, 500);
|
||||
@@ -191,7 +300,7 @@ mod tests {
|
||||
.unwrap()
|
||||
.peek_lru()
|
||||
.map(|(&k, _)| k as usize),
|
||||
Some(18806)
|
||||
Some(9249)
|
||||
);
|
||||
|
||||
Ok(())
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user