mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-03-25 22:50:42 +00:00
Compare commits
2 Commits
storage_ab
...
optim
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d53bad6cac | ||
|
|
d45602fb9a |
@@ -1,87 +0,0 @@
|
||||
---
|
||||
name: update-changelog
|
||||
description: Update CHANGELOG.md with merged PRs since the last changelog update, categorized by type
|
||||
---
|
||||
|
||||
# Update Changelog
|
||||
|
||||
This skill updates CHANGELOG.md with merged PRs that aren't already listed.
|
||||
|
||||
## Step 1: Determine the changelog scope
|
||||
|
||||
Read `CHANGELOG.md` to identify the current unreleased version section at the top (e.g., `Tantivy 0.26 (Unreleased)`).
|
||||
|
||||
Collect all PR numbers already mentioned in the unreleased section by extracting `#NNNN` references.
|
||||
|
||||
## Step 2: Find merged PRs not yet in the changelog
|
||||
|
||||
Use `gh` to list recently merged PRs from the upstream repo:
|
||||
|
||||
```bash
|
||||
gh pr list --repo quickwit-oss/tantivy --state merged --limit 100 --json number,title,author,labels,mergedAt
|
||||
```
|
||||
|
||||
Filter out any PRs whose number already appears in the unreleased section of the changelog.
|
||||
|
||||
## Step 3: Consolidate related PRs
|
||||
|
||||
Before categorizing, group PRs that belong to the same logical change. This is critical for producing a clean changelog. Use PR descriptions, titles, cross-references, and the files touched to identify relationships.
|
||||
|
||||
**Merge follow-up PRs into the original:**
|
||||
- If a PR is a bugfix, refinement, or follow-up to another PR in the same unreleased cycle, combine them into a single changelog entry with multiple `[#N](url)` links.
|
||||
- Also consolidate PRs that touch the same feature area even if not explicitly linked — e.g., a PR fixing an edge case in a new API should be folded into the entry for the PR that introduced that API.
|
||||
|
||||
**Filter out bugfixes on unreleased features:**
|
||||
- If a bugfix PR fixes something introduced by another PR in the **same unreleased version**, it must NOT appear as a separate Bugfixes entry. Instead, silently fold it into the original feature/improvement entry. The changelog should describe the final shipped state, not the development history.
|
||||
- To detect this: check if the bugfix PR references or reverts changes from another PR in the same release cycle, or if it touches code that was newly added (not present in the previous release).
|
||||
|
||||
## Step 4: Review the actual code diff
|
||||
|
||||
**Do not rely on PR titles or descriptions alone.** For every candidate PR, run `gh pr diff <number> --repo quickwit-oss/tantivy` and read the actual changes. PR titles are often misleading — the diff is the source of truth.
|
||||
|
||||
**What to look for in the diff:**
|
||||
- Does it change observable behavior, public API surface, or performance characteristics?
|
||||
- Is the change something a user of the library would notice or need to know about?
|
||||
- Could the change break existing code (API changes, removed features)?
|
||||
|
||||
**Skip PRs where the diff reveals the change is not meaningful enough for the changelog** — e.g., cosmetic renames, trivial visibility tweaks, test-only changes, etc.
|
||||
|
||||
## Step 5: Categorize each PR group
|
||||
|
||||
For each PR (or consolidated group) that survived the diff review, determine its category:
|
||||
|
||||
- **Bugfixes** — fixes to behavior that existed in the **previous release**. NOT fixes to features introduced in this release cycle.
|
||||
- **Features/Improvements** — new features, API additions, new options, improvements that change user-facing behavior or add new capabilities.
|
||||
- **Performance** — optimizations, speed improvements, memory reductions. **If a PR adds new API whose primary purpose is enabling a performance optimization, categorize it as Performance, not Features.** The deciding question is: does a user benefit from this because of new functionality, or because things got faster/leaner? For example, a new trait method that exists solely to enable cheaper intersection ordering is Performance, not a Feature.
|
||||
|
||||
If a PR doesn't clearly fit any category (e.g., CI-only changes, internal refactors with no user-facing impact, dependency bumps with no behavior change), skip it — not everything belongs in the changelog.
|
||||
|
||||
When unclear, use your best judgment or ask the user.
|
||||
|
||||
## Step 6: Format entries
|
||||
|
||||
Each entry must follow this exact format:
|
||||
|
||||
```
|
||||
- Description [#NUMBER](https://github.com/quickwit-oss/tantivy/pull/NUMBER)(@author)
|
||||
```
|
||||
|
||||
Rules:
|
||||
- The description should be concise and describe the user-facing change (not the implementation). Describe the final shipped state, not the incremental development steps.
|
||||
- Use sub-categories with bold headers when multiple entries relate to the same area (e.g., `- **Aggregation**` with indented entries beneath). Follow the existing grouping style in the changelog.
|
||||
- Author is the GitHub username from the PR, prefixed with `@`. For consolidated entries, include all contributing authors.
|
||||
- For consolidated PRs, list all PR links in a single entry: `[#100](url) [#110](url)` (see existing entries for examples).
|
||||
|
||||
## Step 7: Present changes to the user
|
||||
|
||||
Show the user the proposed changelog entries grouped by category **before** editing the file. Ask for confirmation or adjustments.
|
||||
|
||||
## Step 8: Update CHANGELOG.md
|
||||
|
||||
Insert the new entries into the appropriate sections of the unreleased version block. If a section doesn't exist yet, create it following the order: Bugfixes, Features/Improvements, Performance.
|
||||
|
||||
Append new entries at the end of each section (before the next section header or version header).
|
||||
|
||||
## Step 9: Verify
|
||||
|
||||
Read back the updated unreleased section and display it to the user for final review.
|
||||
48
CHANGELOG.md
48
CHANGELOG.md
@@ -1,51 +1,3 @@
|
||||
Tantivy 0.26 (Unreleased)
|
||||
================================
|
||||
|
||||
## Bugfixes
|
||||
- Align float query coercion during search with the columnar coercion rules [#2692](https://github.com/quickwit-oss/tantivy/pull/2692)(@fulmicoton)
|
||||
- Fix lenient elastic range queries with trailing closing parentheses [#2816](https://github.com/quickwit-oss/tantivy/pull/2816)(@evance-br)
|
||||
- Fix intersection `seek()` advancing below current doc id [#2812](https://github.com/quickwit-oss/tantivy/pull/2812)(@fulmicoton)
|
||||
- Fix phrase query prefixed with `*` [#2751](https://github.com/quickwit-oss/tantivy/pull/2751)(@Darkheir)
|
||||
- Fix `vint` buffer overflow during index creation [#2778](https://github.com/quickwit-oss/tantivy/pull/2778)(@rebasedming)
|
||||
- Fix integer overflow in `ExpUnrolledLinkedList` for large datasets [#2735](https://github.com/quickwit-oss/tantivy/pull/2735)(@mdashti)
|
||||
- Fix integer overflow in segment sorting and merge policy truncation [#2846](https://github.com/quickwit-oss/tantivy/pull/2846)(@anaslimem)
|
||||
- Fix merging of intermediate aggregation results [#2719](https://github.com/quickwit-oss/tantivy/pull/2719)(@PSeitz)
|
||||
- Fix deduplicate doc counts in term aggregation for multi-valued fields [#2854](https://github.com/quickwit-oss/tantivy/pull/2854)(@nuri-yoo)
|
||||
|
||||
## Features/Improvements
|
||||
- **Aggregation**
|
||||
- Add filter aggregation [#2711](https://github.com/quickwit-oss/tantivy/pull/2711)(@mdashti)
|
||||
- Add include/exclude filtering for term aggregations [#2717](https://github.com/quickwit-oss/tantivy/pull/2717)(@PSeitz)
|
||||
- Add public accessors for intermediate aggregation results [#2829](https://github.com/quickwit-oss/tantivy/pull/2829)(@congx4)
|
||||
- Replace HyperLogLog++ with Apache DataSketches HLL for cardinality aggregation [#2837](https://github.com/quickwit-oss/tantivy/pull/2837) [#2842](https://github.com/quickwit-oss/tantivy/pull/2842)(@congx4)
|
||||
- Add composite aggregation [#2856](https://github.com/quickwit-oss/tantivy/pull/2856)(@fulmicoton)
|
||||
- **Fast Fields**
|
||||
- Add fast field fallback for `TermQuery` when the field is not indexed [#2693](https://github.com/quickwit-oss/tantivy/pull/2693)(@PSeitz-dd)
|
||||
- Add fast field support for `Bytes` values [#2830](https://github.com/quickwit-oss/tantivy/pull/2830)(@mdashti)
|
||||
- **Query Parser**
|
||||
- Add support for regexes in the query grammar [#2677](https://github.com/quickwit-oss/tantivy/pull/2677) [#2818](https://github.com/quickwit-oss/tantivy/pull/2818)(@Darkheir)
|
||||
- Deduplicate queries in query parser [#2698](https://github.com/quickwit-oss/tantivy/pull/2698)(@PSeitz-dd)
|
||||
- Add erased `SortKeyComputer` for sorting on column types unknown until runtime [#2770](https://github.com/quickwit-oss/tantivy/pull/2770) [#2790](https://github.com/quickwit-oss/tantivy/pull/2790)(@stuhood @PSeitz)
|
||||
- Add natural-order-with-none-highest support in `TopDocs::order_by` [#2780](https://github.com/quickwit-oss/tantivy/pull/2780)(@stuhood)
|
||||
- Move stemming behing `stemmer` feature flag [#2791](https://github.com/quickwit-oss/tantivy/pull/2791)(@fulmicoton)
|
||||
- Make `DeleteMeta`, `AddOperation`, `advance_deletes`, `with_max_doc`, `serializer` module, and `delete_queue` public [#2762](https://github.com/quickwit-oss/tantivy/pull/2762) [#2765](https://github.com/quickwit-oss/tantivy/pull/2765) [#2766](https://github.com/quickwit-oss/tantivy/pull/2766) [#2835](https://github.com/quickwit-oss/tantivy/pull/2835)(@philippemnoel @PSeitz)
|
||||
- Make `Language` hashable [#2763](https://github.com/quickwit-oss/tantivy/pull/2763)(@philippemnoel)
|
||||
- Improve `space_usage` reporting for JSON fields and columnar data [#2761](https://github.com/quickwit-oss/tantivy/pull/2761)(@PSeitz-dd)
|
||||
- Split `Term` into `Term` and `IndexingTerm` [#2744](https://github.com/quickwit-oss/tantivy/pull/2744) [#2750](https://github.com/quickwit-oss/tantivy/pull/2750)(@PSeitz-dd @PSeitz)
|
||||
|
||||
## Performance
|
||||
- **Aggregation**
|
||||
- Large speed up and memory reduction for nested high cardinality aggregations by using one collector per request instead of one per bucket, and adding `PagedTermMap` for faster medium cardinality term aggregations [#2715](https://github.com/quickwit-oss/tantivy/pull/2715) [#2759](https://github.com/quickwit-oss/tantivy/pull/2759)(@PSeitz @PSeitz-dd)
|
||||
- Optimize low-cardinality term aggregations by using a `Vec` instead of a `HashMap` [#2740](https://github.com/quickwit-oss/tantivy/pull/2740)(@fulmicoton-dd)
|
||||
- Optimize `ExistsQuery` for a high number of dynamic columns [#2694](https://github.com/quickwit-oss/tantivy/pull/2694)(@PSeitz-dd)
|
||||
- Add lazy scorers to stop score evaluation early when a doc won't reach the top-K threshold [#2726](https://github.com/quickwit-oss/tantivy/pull/2726) [#2777](https://github.com/quickwit-oss/tantivy/pull/2777)(@fulmicoton @stuhood)
|
||||
- Add `DocSet::cost()` and use it to order scorers in intersections [#2707](https://github.com/quickwit-oss/tantivy/pull/2707)(@PSeitz)
|
||||
- Add `collect_block` support for collector wrappers [#2727](https://github.com/quickwit-oss/tantivy/pull/2727)(@stuhood)
|
||||
- Optimize saturated posting lists by replacing them with `AllScorer` in boolean queries [#2745](https://github.com/quickwit-oss/tantivy/pull/2745) [#2760](https://github.com/quickwit-oss/tantivy/pull/2760) [#2774](https://github.com/quickwit-oss/tantivy/pull/2774)(@fulmicoton @mdashti @trinity-1686a)
|
||||
- Add `seek_danger` on `DocSet` for more efficient intersections [#2538](https://github.com/quickwit-oss/tantivy/pull/2538) [#2810](https://github.com/quickwit-oss/tantivy/pull/2810)(@PSeitz @stuhood @fulmicoton)
|
||||
- Skip column traversal in `RangeDocSet` when query range does not overlap with column bounds [#2783](https://github.com/quickwit-oss/tantivy/pull/2783)(@ChangRui-Ryan)
|
||||
- Speed up exclude queries by supporting multiple excluded `DocSet`s without intermediate union [#2825](https://github.com/quickwit-oss/tantivy/pull/2825)(@PSeitz)
|
||||
|
||||
Tantivy 0.25
|
||||
================================
|
||||
|
||||
|
||||
@@ -27,7 +27,7 @@ regex = { version = "1.5.5", default-features = false, features = [
|
||||
aho-corasick = "1.0"
|
||||
tantivy-fst = "0.5"
|
||||
memmap2 = { version = "0.9.0", optional = true }
|
||||
lz4_flex = { version = "0.13", default-features = false, optional = true }
|
||||
lz4_flex = { version = "0.12", default-features = false, optional = true }
|
||||
zstd = { version = "0.13", optional = true, default-features = false }
|
||||
tempfile = { version = "3.12.0", optional = true }
|
||||
log = "0.4.16"
|
||||
@@ -64,7 +64,7 @@ query-grammar = { version = "0.25.0", path = "./query-grammar", package = "tanti
|
||||
tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
|
||||
common = { version = "0.10", path = "./common/", package = "tantivy-common" }
|
||||
tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
|
||||
sketches-ddsketch = { version = "0.4", features = ["use_serde"] }
|
||||
sketches-ddsketch = { git = "https://github.com/quickwit-oss/rust-sketches-ddsketch.git", rev = "555caf1", features = ["use_serde"] }
|
||||
datasketches = "0.2.0"
|
||||
futures-util = { version = "0.3.28", optional = true }
|
||||
futures-channel = { version = "0.3.28", optional = true }
|
||||
@@ -201,8 +201,3 @@ harness = false
|
||||
[[bench]]
|
||||
name = "regex_all_terms"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "fill_bitset"
|
||||
harness = false
|
||||
|
||||
|
||||
@@ -141,6 +141,13 @@ fn main() {
|
||||
0.01,
|
||||
0.15,
|
||||
),
|
||||
(
|
||||
"N=1M, p(a)=1%, p(b)=30%, p(c)=30%".to_string(),
|
||||
1_000_000,
|
||||
0.01,
|
||||
0.30,
|
||||
0.30,
|
||||
),
|
||||
];
|
||||
|
||||
let queries = &["a", "+a +b", "+a +b +c", "a OR b", "a OR b OR c"];
|
||||
|
||||
@@ -1,106 +0,0 @@
|
||||
use binggan::{black_box, BenchRunner, PeakMemAlloc, INSTRUMENTED_SYSTEM};
|
||||
use common::BitSet;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::{Rng, SeedableRng};
|
||||
use tantivy::postings::BlockSegmentPostings;
|
||||
use tantivy::schema::*;
|
||||
use tantivy::{doc, DocSet as _, Index, InvertedIndexReader as _, TantivyDocument};
|
||||
|
||||
#[global_allocator]
|
||||
pub static GLOBAL: &PeakMemAlloc<std::alloc::System> = &INSTRUMENTED_SYSTEM;
|
||||
|
||||
fn main() {
|
||||
let index = build_test_index();
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader = &searcher.segment_readers()[0];
|
||||
let text_field = index.schema().get_field("text").unwrap();
|
||||
let inverted_index = segment_reader.inverted_index(text_field).unwrap();
|
||||
let max_doc = segment_reader.max_doc();
|
||||
|
||||
let term = Term::from_field_text(text_field, "hello");
|
||||
let term_info = inverted_index.get_term_info(&term).unwrap().unwrap();
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
runner.set_name("fill_bitset");
|
||||
|
||||
let mut group = runner.new_group();
|
||||
{
|
||||
let inverted_index = &inverted_index;
|
||||
let term_info = &term_info;
|
||||
// This is the path used by queries (AutomatonWeight, RangeQuery, etc.)
|
||||
// It dispatches via DynInvertedIndexReader::fill_bitset_from_terminfo.
|
||||
group.register("fill_bitset_from_terminfo (via trait)", move |_| {
|
||||
let mut bitset = BitSet::with_max_value(max_doc);
|
||||
inverted_index
|
||||
.fill_bitset_from_terminfo(term_info, &mut bitset)
|
||||
.unwrap();
|
||||
black_box(bitset);
|
||||
});
|
||||
}
|
||||
{
|
||||
let inverted_index = &inverted_index;
|
||||
let term_info = &term_info;
|
||||
// This constructs a SegmentPostings via read_docset_from_terminfo and calls fill_bitset.
|
||||
group.register("read_docset + fill_bitset", move |_| {
|
||||
let mut postings = inverted_index.read_docset_from_terminfo(term_info).unwrap();
|
||||
let mut bitset = BitSet::with_max_value(max_doc);
|
||||
postings.fill_bitset(&mut bitset);
|
||||
black_box(bitset);
|
||||
});
|
||||
}
|
||||
{
|
||||
let inverted_index = &inverted_index;
|
||||
let term_info = &term_info;
|
||||
// This uses BlockSegmentPostings directly, bypassing SegmentPostings entirely.
|
||||
group.register("BlockSegmentPostings direct", move |_| {
|
||||
let raw = inverted_index
|
||||
.read_raw_postings_data(term_info, IndexRecordOption::Basic)
|
||||
.unwrap();
|
||||
let mut block_postings = BlockSegmentPostings::open(
|
||||
term_info.doc_freq,
|
||||
raw.postings_data,
|
||||
raw.record_option,
|
||||
raw.effective_option,
|
||||
)
|
||||
.unwrap();
|
||||
let mut bitset = BitSet::with_max_value(max_doc);
|
||||
loop {
|
||||
let docs = block_postings.docs();
|
||||
if docs.is_empty() {
|
||||
break;
|
||||
}
|
||||
for &doc in docs {
|
||||
bitset.insert(doc);
|
||||
}
|
||||
block_postings.advance();
|
||||
}
|
||||
black_box(bitset);
|
||||
});
|
||||
}
|
||||
group.run();
|
||||
}
|
||||
|
||||
fn build_test_index() -> Index {
|
||||
let mut schema_builder = Schema::builder();
|
||||
schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
let text_field = schema.get_field("text").unwrap();
|
||||
|
||||
let mut writer = index.writer::<TantivyDocument>(250_000_000).unwrap();
|
||||
let mut rng = StdRng::from_seed([42u8; 32]);
|
||||
for _ in 0..100_000 {
|
||||
if rng.random_bool(0.5) {
|
||||
writer
|
||||
.add_document(doc!(text_field => "hello world"))
|
||||
.unwrap();
|
||||
} else {
|
||||
writer
|
||||
.add_document(doc!(text_field => "goodbye world"))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
index
|
||||
}
|
||||
@@ -17,6 +17,7 @@ use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::{Count, DocSetCollector};
|
||||
use tantivy::query::RangeQuery;
|
||||
use tantivy::schema::document::TantivyDocument;
|
||||
use tantivy::schema::{Schema, Value, FAST, STORED, STRING};
|
||||
use tantivy::{doc, Index, ReloadPolicy, Searcher, Term};
|
||||
|
||||
@@ -44,7 +45,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
match distribution {
|
||||
"dense_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.random_range(0u64..1000u64);
|
||||
let suffix = rng.gen_range(0u64..1000u64);
|
||||
let str_val = format!("str_{:03}", suffix);
|
||||
|
||||
writer
|
||||
@@ -70,7 +71,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
}
|
||||
"sparse_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.random_range(0u64..1000000u64);
|
||||
let suffix = rng.gen_range(0u64..1000000u64);
|
||||
let str_val = format!("str_{:07}", suffix);
|
||||
|
||||
writer
|
||||
@@ -405,7 +406,7 @@ impl FetchAllStringsFromDocTask {
|
||||
|
||||
for doc_address in docs {
|
||||
// Get the document from the doc store (row store access)
|
||||
if let Ok(doc) = self.searcher.doc(doc_address) {
|
||||
if let Ok(doc) = self.searcher.doc::<TantivyDocument>(doc_address) {
|
||||
// Extract string values from the stored field
|
||||
if let Some(field_value) = doc.get_first(str_stored_field) {
|
||||
if let Some(text) = field_value.as_value().as_str() {
|
||||
|
||||
@@ -58,78 +58,6 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
|
||||
}
|
||||
}
|
||||
|
||||
/// Like `fetch_block_with_missing`, but deduplicates (doc_id, value) pairs
|
||||
/// so that each unique value per document is returned only once.
|
||||
///
|
||||
/// This is necessary for correct document counting in aggregations,
|
||||
/// where multi-valued fields can produce duplicate entries that inflate counts.
|
||||
#[inline]
|
||||
pub fn fetch_block_with_missing_unique_per_doc(
|
||||
&mut self,
|
||||
docs: &[u32],
|
||||
accessor: &Column<T>,
|
||||
missing: Option<T>,
|
||||
) where
|
||||
T: Ord,
|
||||
{
|
||||
self.fetch_block_with_missing(docs, accessor, missing);
|
||||
if accessor.index.get_cardinality().is_multivalue() {
|
||||
self.dedup_docid_val_pairs();
|
||||
}
|
||||
}
|
||||
|
||||
/// Removes duplicate (doc_id, value) pairs from the caches.
|
||||
///
|
||||
/// After `fetch_block`, entries are sorted by doc_id, but values within
|
||||
/// the same doc may not be sorted (e.g. `(0,1), (0,2), (0,1)`).
|
||||
/// We group consecutive entries by doc_id, sort values within each group
|
||||
/// if it has more than 2 elements, then deduplicate adjacent pairs.
|
||||
///
|
||||
/// Skips entirely if no doc_id appears more than once in the block.
|
||||
fn dedup_docid_val_pairs(&mut self)
|
||||
where T: Ord {
|
||||
if self.docid_cache.len() <= 1 {
|
||||
return;
|
||||
}
|
||||
|
||||
// Quick check: if no consecutive doc_ids are equal, no dedup needed.
|
||||
let has_multivalue = self.docid_cache.windows(2).any(|w| w[0] == w[1]);
|
||||
if !has_multivalue {
|
||||
return;
|
||||
}
|
||||
|
||||
// Sort values within each doc_id group so duplicates become adjacent.
|
||||
let mut start = 0;
|
||||
while start < self.docid_cache.len() {
|
||||
let doc = self.docid_cache[start];
|
||||
let mut end = start + 1;
|
||||
while end < self.docid_cache.len() && self.docid_cache[end] == doc {
|
||||
end += 1;
|
||||
}
|
||||
if end - start > 2 {
|
||||
self.val_cache[start..end].sort();
|
||||
}
|
||||
start = end;
|
||||
}
|
||||
|
||||
// Now duplicates are adjacent — deduplicate in place.
|
||||
let mut write = 0;
|
||||
for read in 1..self.docid_cache.len() {
|
||||
if self.docid_cache[read] != self.docid_cache[write]
|
||||
|| self.val_cache[read] != self.val_cache[write]
|
||||
{
|
||||
write += 1;
|
||||
if write != read {
|
||||
self.docid_cache[write] = self.docid_cache[read];
|
||||
self.val_cache[write] = self.val_cache[read];
|
||||
}
|
||||
}
|
||||
}
|
||||
let new_len = write + 1;
|
||||
self.docid_cache.truncate(new_len);
|
||||
self.val_cache.truncate(new_len);
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn iter_vals(&self) -> impl Iterator<Item = T> + '_ {
|
||||
self.val_cache.iter().cloned()
|
||||
@@ -235,56 +163,4 @@ mod tests {
|
||||
|
||||
assert_eq!(missing_docs, vec![1, 2, 3, 4, 5]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_consecutive() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 2, 3];
|
||||
accessor.val_cache = vec![10, 10, 10, 10];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 2, 3]);
|
||||
assert_eq!(accessor.val_cache, vec![10, 10, 10]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_non_consecutive() {
|
||||
// (0,1), (0,2), (0,1) — duplicate value not adjacent
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 0];
|
||||
accessor.val_cache = vec![1, 2, 1];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 2]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_multi_doc() {
|
||||
// doc 0: values [3, 1, 3], doc 1: values [5, 5]
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 0, 1, 1];
|
||||
accessor.val_cache = vec![3, 1, 3, 5, 5];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 3, 5]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_no_duplicates() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 1];
|
||||
accessor.val_cache = vec![1, 2, 3];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 2, 3]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_single_element() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0];
|
||||
accessor.val_cache = vec![1];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0]);
|
||||
assert_eq!(accessor.val_cache, vec![1]);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -47,6 +47,9 @@ impl TinySet {
|
||||
TinySet(val)
|
||||
}
|
||||
|
||||
/// An empty `TinySet` constant.
|
||||
pub const EMPTY: TinySet = TinySet(0u64);
|
||||
|
||||
/// Returns an empty `TinySet`.
|
||||
#[inline]
|
||||
pub fn empty() -> TinySet {
|
||||
@@ -178,11 +181,13 @@ impl TinySet {
|
||||
#[derive(Clone)]
|
||||
pub struct BitSet {
|
||||
tinysets: Box<[TinySet]>,
|
||||
len: u64,
|
||||
max_value: u32,
|
||||
}
|
||||
impl std::fmt::Debug for BitSet {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.debug_struct("BitSet")
|
||||
.field("len", &self.len)
|
||||
.field("max_value", &self.max_value)
|
||||
.finish()
|
||||
}
|
||||
@@ -210,6 +215,7 @@ impl BitSet {
|
||||
let tinybitsets = vec![TinySet::empty(); num_buckets as usize].into_boxed_slice();
|
||||
BitSet {
|
||||
tinysets: tinybitsets,
|
||||
len: 0,
|
||||
max_value,
|
||||
}
|
||||
}
|
||||
@@ -227,6 +233,7 @@ impl BitSet {
|
||||
}
|
||||
BitSet {
|
||||
tinysets: tinybitsets,
|
||||
len: max_value as u64,
|
||||
max_value,
|
||||
}
|
||||
}
|
||||
@@ -245,19 +252,17 @@ impl BitSet {
|
||||
|
||||
/// Intersect with tinysets
|
||||
fn intersect_update_with_iter(&mut self, other: impl Iterator<Item = TinySet>) {
|
||||
self.len = 0;
|
||||
for (left, right) in self.tinysets.iter_mut().zip(other) {
|
||||
*left = left.intersect(right);
|
||||
self.len += left.len() as u64;
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the number of elements in the `BitSet`.
|
||||
#[inline]
|
||||
pub fn len(&self) -> usize {
|
||||
self.tinysets
|
||||
.iter()
|
||||
.copied()
|
||||
.map(|tinyset| tinyset.len())
|
||||
.sum::<u32>() as usize
|
||||
self.len as usize
|
||||
}
|
||||
|
||||
/// Inserts an element in the `BitSet`
|
||||
@@ -266,7 +271,7 @@ impl BitSet {
|
||||
// we do not check saturated els.
|
||||
let higher = el / 64u32;
|
||||
let lower = el % 64u32;
|
||||
self.tinysets[higher as usize].insert_mut(lower);
|
||||
self.len += u64::from(self.tinysets[higher as usize].insert_mut(lower));
|
||||
}
|
||||
|
||||
/// Inserts an element in the `BitSet`
|
||||
@@ -275,7 +280,7 @@ impl BitSet {
|
||||
// we do not check saturated els.
|
||||
let higher = el / 64u32;
|
||||
let lower = el % 64u32;
|
||||
self.tinysets[higher as usize].remove_mut(lower);
|
||||
self.len -= u64::from(self.tinysets[higher as usize].remove_mut(lower));
|
||||
}
|
||||
|
||||
/// Returns true iff the elements is in the `BitSet`.
|
||||
@@ -297,9 +302,6 @@ impl BitSet {
|
||||
.map(|delta_bucket| bucket + delta_bucket as u32)
|
||||
}
|
||||
|
||||
/// Returns the maximum number of elements in the bitset.
|
||||
///
|
||||
/// Warning: The largest element the bitset can contain is `max_value - 1`.
|
||||
#[inline]
|
||||
pub fn max_value(&self) -> u32 {
|
||||
self.max_value
|
||||
|
||||
@@ -70,7 +70,7 @@ impl Collector for StatsCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> tantivy::Result<StatsSegmentCollector> {
|
||||
let fast_field_reader = segment_reader.fast_fields().u64(&self.field)?;
|
||||
Ok(StatsSegmentCollector {
|
||||
|
||||
@@ -60,7 +60,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4).order_by_score())?;
|
||||
assert_eq!(count_docs.len(), 1);
|
||||
for (_score, doc_address) in count_docs {
|
||||
let retrieved_doc = searcher.doc(doc_address)?;
|
||||
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
assert!(retrieved_doc
|
||||
.get_first(occurred_at)
|
||||
.unwrap()
|
||||
|
||||
@@ -65,7 +65,7 @@ fn main() -> tantivy::Result<()> {
|
||||
);
|
||||
let top_docs_by_custom_score =
|
||||
// Call TopDocs with a custom tweak score
|
||||
TopDocs::with_limit(2).tweak_score(move |segment_reader: &dyn SegmentReader| {
|
||||
TopDocs::with_limit(2).tweak_score(move |segment_reader: &SegmentReader| {
|
||||
let ingredient_reader = segment_reader.facet_reader("ingredient").unwrap();
|
||||
let facet_dict = ingredient_reader.facet_dict();
|
||||
|
||||
@@ -91,7 +91,7 @@ fn main() -> tantivy::Result<()> {
|
||||
.iter()
|
||||
.map(|(_, doc_id)| {
|
||||
searcher
|
||||
.doc(*doc_id)
|
||||
.doc::<TantivyDocument>(*doc_id)
|
||||
.unwrap()
|
||||
.get_first(title)
|
||||
.and_then(|v| v.as_str().map(|el| el.to_string()))
|
||||
|
||||
@@ -91,10 +91,46 @@ fn main() -> tantivy::Result<()> {
|
||||
}
|
||||
}
|
||||
|
||||
// Some other powerful operations (especially `.seek`) may be useful to consume these
|
||||
// A `Term` is a text token associated with a field.
|
||||
// Let's go through all docs containing the term `title:the` and access their position
|
||||
let term_the = Term::from_field_text(title, "the");
|
||||
|
||||
// Some other powerful operations (especially `.skip_to`) may be useful to consume these
|
||||
// posting lists rapidly.
|
||||
// You can check for them in the [`DocSet`](https://docs.rs/tantivy/~0/tantivy/trait.DocSet.html) trait
|
||||
// and the [`Postings`](https://docs.rs/tantivy/~0/tantivy/trait.Postings.html) trait
|
||||
|
||||
// Also, for some VERY specific high performance use case like an OLAP analysis of logs,
|
||||
// you can get better performance by accessing directly the blocks of doc ids.
|
||||
for segment_reader in searcher.segment_readers() {
|
||||
// A segment contains different data structure.
|
||||
// Inverted index stands for the combination of
|
||||
// - the term dictionary
|
||||
// - the inverted lists associated with each terms and their positions
|
||||
let inverted_index = segment_reader.inverted_index(title)?;
|
||||
|
||||
// This segment posting object is like a cursor over the documents matching the term.
|
||||
// The `IndexRecordOption` arguments tells tantivy we will be interested in both term
|
||||
// frequencies and positions.
|
||||
//
|
||||
// If you don't need all this information, you may get better performance by decompressing
|
||||
// less information.
|
||||
if let Some(mut block_segment_postings) =
|
||||
inverted_index.read_block_postings(&term_the, IndexRecordOption::Basic)?
|
||||
{
|
||||
loop {
|
||||
let docs = block_segment_postings.docs();
|
||||
if docs.is_empty() {
|
||||
break;
|
||||
}
|
||||
// Once again these docs MAY contains deleted documents as well.
|
||||
let docs = block_segment_postings.docs();
|
||||
// Prints `Docs [0, 2].`
|
||||
println!("Docs {docs:?}");
|
||||
block_segment_postings.advance();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -67,7 +67,7 @@ fn main() -> Result<()> {
|
||||
let mut titles = top_docs
|
||||
.into_iter()
|
||||
.map(|(_score, doc_address)| {
|
||||
let doc = searcher.doc(doc_address)?;
|
||||
let doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
let title = doc
|
||||
.get_first(title)
|
||||
.and_then(|v| v.as_str())
|
||||
|
||||
@@ -55,7 +55,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
|
||||
|
||||
for (score, doc_address) in top_docs {
|
||||
let doc = searcher.doc(doc_address)?;
|
||||
let doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
let snippet = snippet_generator.snippet_from_doc(&doc);
|
||||
println!("Document score {score}:");
|
||||
println!("title: {}", doc.get_first(title).unwrap().as_str().unwrap());
|
||||
|
||||
@@ -43,7 +43,7 @@ impl DynamicPriceColumn {
|
||||
}
|
||||
}
|
||||
|
||||
pub fn price_for_segment(&self, segment_reader: &dyn SegmentReader) -> Option<Arc<Vec<Price>>> {
|
||||
pub fn price_for_segment(&self, segment_reader: &SegmentReader) -> Option<Arc<Vec<Price>>> {
|
||||
let segment_key = (segment_reader.segment_id(), segment_reader.delete_opstamp());
|
||||
self.price_cache.read().unwrap().get(&segment_key).cloned()
|
||||
}
|
||||
@@ -157,7 +157,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let query = query_parser.parse_query("cooking")?;
|
||||
|
||||
let searcher = reader.searcher();
|
||||
let score_by_price = move |segment_reader: &dyn SegmentReader| {
|
||||
let score_by_price = move |segment_reader: &SegmentReader| {
|
||||
let price = price_dynamic_column
|
||||
.price_for_segment(segment_reader)
|
||||
.unwrap();
|
||||
|
||||
@@ -57,7 +57,7 @@ pub(crate) fn get_numeric_or_date_column_types() -> &'static [ColumnType] {
|
||||
|
||||
/// Get fast field reader or empty as default.
|
||||
pub(crate) fn get_ff_reader(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
allowed_column_types: Option<&[ColumnType]>,
|
||||
) -> crate::Result<(columnar::Column<u64>, ColumnType)> {
|
||||
@@ -74,7 +74,7 @@ pub(crate) fn get_ff_reader(
|
||||
}
|
||||
|
||||
pub(crate) fn get_dynamic_columns(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
) -> crate::Result<Vec<columnar::DynamicColumn>> {
|
||||
let ff_fields = reader.fast_fields().dynamic_column_handles(field_name)?;
|
||||
@@ -90,7 +90,7 @@ pub(crate) fn get_dynamic_columns(
|
||||
///
|
||||
/// Is guaranteed to return at least one column.
|
||||
pub(crate) fn get_all_ff_reader_or_empty(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
allowed_column_types: Option<&[ColumnType]>,
|
||||
fallback_type: ColumnType,
|
||||
|
||||
@@ -520,7 +520,7 @@ impl AggKind {
|
||||
/// Build AggregationsData by walking the request tree.
|
||||
pub(crate) fn build_aggregations_data_from_req(
|
||||
aggs: &Aggregations,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
context: AggContextParams,
|
||||
) -> crate::Result<AggregationsSegmentCtx> {
|
||||
@@ -540,7 +540,7 @@ pub(crate) fn build_aggregations_data_from_req(
|
||||
fn build_nodes(
|
||||
agg_name: &str,
|
||||
req: &Aggregation,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
data: &mut AggregationsSegmentCtx,
|
||||
is_top_level: bool,
|
||||
@@ -787,7 +787,7 @@ fn build_nodes(
|
||||
let idx_in_req_data = data.push_filter_req_data(FilterAggReqData {
|
||||
name: agg_name.to_string(),
|
||||
req: filter_req.clone(),
|
||||
segment_reader: reader.clone_arc(),
|
||||
segment_reader: reader.clone(),
|
||||
evaluator,
|
||||
matching_docs_buffer,
|
||||
is_top_level,
|
||||
@@ -804,7 +804,7 @@ fn build_nodes(
|
||||
|
||||
fn build_composite_node(
|
||||
agg_name: &str,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
_segment_ordinal: SegmentOrdinal,
|
||||
data: &mut AggregationsSegmentCtx,
|
||||
sub_aggs: &Aggregations,
|
||||
@@ -833,7 +833,7 @@ fn build_composite_node(
|
||||
|
||||
fn build_children(
|
||||
aggs: &Aggregations,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
data: &mut AggregationsSegmentCtx,
|
||||
) -> crate::Result<Vec<AggRefNode>> {
|
||||
@@ -852,7 +852,7 @@ fn build_children(
|
||||
}
|
||||
|
||||
fn get_term_agg_accessors(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
missing: &Option<Key>,
|
||||
) -> crate::Result<Vec<(Column<u64>, ColumnType)>> {
|
||||
@@ -905,7 +905,7 @@ fn build_terms_or_cardinality_nodes(
|
||||
agg_name: &str,
|
||||
field_name: &str,
|
||||
missing: &Option<Key>,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
data: &mut AggregationsSegmentCtx,
|
||||
sub_aggs: &Aggregations,
|
||||
|
||||
@@ -75,7 +75,7 @@ impl CompositeSourceAccessors {
|
||||
///
|
||||
/// Precomputes some values to make collection faster.
|
||||
pub fn build_for_source(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
source: &CompositeAggregationSource,
|
||||
// First option is None when no after key was set in the query, the
|
||||
// second option is None when the after key was set but its value for
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
use std::fmt::Debug;
|
||||
use std::sync::Arc;
|
||||
|
||||
use common::BitSet;
|
||||
use serde::{Deserialize, Deserializer, Serialize, Serializer};
|
||||
@@ -403,7 +402,7 @@ pub struct FilterAggReqData {
|
||||
/// The filter aggregation
|
||||
pub req: FilterAggregation,
|
||||
/// The segment reader
|
||||
pub segment_reader: Arc<dyn SegmentReader>,
|
||||
pub segment_reader: SegmentReader,
|
||||
/// Document evaluator for the filter query (precomputed BitSet)
|
||||
/// This is built once when the request data is created
|
||||
pub evaluator: DocumentQueryEvaluator,
|
||||
@@ -417,7 +416,7 @@ impl FilterAggReqData {
|
||||
pub(crate) fn get_memory_consumption(&self) -> usize {
|
||||
// Estimate: name + segment reader reference + bitset + buffer capacity
|
||||
self.name.len()
|
||||
+ std::mem::size_of::<Arc<dyn SegmentReader>>()
|
||||
+ std::mem::size_of::<SegmentReader>()
|
||||
+ self.evaluator.bitset.len() / 8 // BitSet memory (bits to bytes)
|
||||
+ self.matching_docs_buffer.capacity() * std::mem::size_of::<DocId>()
|
||||
+ std::mem::size_of::<bool>()
|
||||
@@ -439,7 +438,7 @@ impl DocumentQueryEvaluator {
|
||||
pub(crate) fn new(
|
||||
query: Box<dyn Query>,
|
||||
schema: Schema,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self> {
|
||||
let max_doc = segment_reader.max_doc();
|
||||
|
||||
|
||||
@@ -807,13 +807,11 @@ impl<TermMap: TermAggregationMap, C: SubAggCache> SegmentAggregationCollector
|
||||
|
||||
let req_data = &mut self.terms_req_data;
|
||||
|
||||
agg_data
|
||||
.column_block_accessor
|
||||
.fetch_block_with_missing_unique_per_doc(
|
||||
docs,
|
||||
&req_data.accessor,
|
||||
req_data.missing_value_for_accessor,
|
||||
);
|
||||
agg_data.column_block_accessor.fetch_block_with_missing(
|
||||
docs,
|
||||
&req_data.accessor,
|
||||
req_data.missing_value_for_accessor,
|
||||
);
|
||||
|
||||
if let Some(sub_agg) = &mut self.sub_agg {
|
||||
let term_buckets = &mut self.parent_buckets[parent_bucket_id as usize];
|
||||
@@ -2349,7 +2347,7 @@ mod tests {
|
||||
|
||||
// text field
|
||||
assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hello Hello");
|
||||
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 5);
|
||||
assert_eq!(res["my_texts"]["buckets"][1]["key"], "Empty");
|
||||
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
|
||||
assert_eq!(
|
||||
@@ -2358,7 +2356,7 @@ mod tests {
|
||||
);
|
||||
// text field with number as missing fallback
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 5);
|
||||
assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
|
||||
assert_eq!(res["my_texts2"]["buckets"][1]["doc_count"], 2);
|
||||
assert_eq!(
|
||||
@@ -2372,7 +2370,7 @@ mod tests {
|
||||
assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
|
||||
assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_ids"]["buckets"][1]["key"], 1.0);
|
||||
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 2);
|
||||
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 3);
|
||||
assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);
|
||||
|
||||
Ok(())
|
||||
|
||||
@@ -66,7 +66,7 @@ impl Collector for DistributedAggregationCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: crate::SegmentOrdinal,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
AggregationSegmentCollector::from_agg_req_and_reader(
|
||||
&self.agg,
|
||||
@@ -96,7 +96,7 @@ impl Collector for AggregationCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: crate::SegmentOrdinal,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
AggregationSegmentCollector::from_agg_req_and_reader(
|
||||
&self.agg,
|
||||
@@ -145,7 +145,7 @@ impl AggregationSegmentCollector {
|
||||
/// reader. Also includes validation, e.g. checking field types and existence.
|
||||
pub fn from_agg_req_and_reader(
|
||||
agg: &Aggregations,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
context: &AggContextParams,
|
||||
) -> crate::Result<Self> {
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
use super::Collector;
|
||||
use crate::collector::SegmentCollector;
|
||||
use crate::query::Weight;
|
||||
use crate::{DocId, Score, SegmentOrdinal, SegmentReader};
|
||||
|
||||
/// `CountCollector` collector only counts how many
|
||||
@@ -43,7 +44,7 @@ impl Collector for Count {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_: SegmentOrdinal,
|
||||
_: &dyn SegmentReader,
|
||||
_: &SegmentReader,
|
||||
) -> crate::Result<SegmentCountCollector> {
|
||||
Ok(SegmentCountCollector::default())
|
||||
}
|
||||
@@ -55,6 +56,15 @@ impl Collector for Count {
|
||||
fn merge_fruits(&self, segment_counts: Vec<usize>) -> crate::Result<usize> {
|
||||
Ok(segment_counts.into_iter().sum())
|
||||
}
|
||||
|
||||
fn collect_segment(
|
||||
&self,
|
||||
weight: &dyn Weight,
|
||||
_segment_ord: u32,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<usize> {
|
||||
Ok(weight.count(reader)? as usize)
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Default)]
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
use std::collections::HashSet;
|
||||
|
||||
use super::{Collector, SegmentCollector};
|
||||
use crate::{DocAddress, DocId, Score, SegmentReader};
|
||||
use crate::{DocAddress, DocId, Score};
|
||||
|
||||
/// Collectors that returns the set of DocAddress that matches the query.
|
||||
///
|
||||
@@ -15,7 +15,7 @@ impl Collector for DocSetCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: crate::SegmentOrdinal,
|
||||
_segment: &dyn SegmentReader,
|
||||
_segment: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok(DocSetChildCollector {
|
||||
segment_local_id,
|
||||
|
||||
@@ -265,7 +265,7 @@ impl Collector for FacetCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_: SegmentOrdinal,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<FacetSegmentCollector> {
|
||||
let facet_reader = reader.facet_reader(&self.field_name)?;
|
||||
let facet_dict = facet_reader.facet_dict();
|
||||
|
||||
@@ -113,7 +113,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let column_opt = segment_reader.fast_fields().column_opt(&self.field)?;
|
||||
|
||||
@@ -287,7 +287,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let column_opt = segment_reader.fast_fields().bytes(&self.field)?;
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ use fastdivide::DividerU64;
|
||||
use crate::collector::{Collector, SegmentCollector};
|
||||
use crate::fastfield::{FastFieldNotAvailableError, FastValue};
|
||||
use crate::schema::Type;
|
||||
use crate::{DocId, Score, SegmentReader};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Histogram builds an histogram of the values of a fastfield for the
|
||||
/// collected DocSet.
|
||||
@@ -110,7 +110,7 @@ impl Collector for HistogramCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_segment_local_id: crate::SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let column_opt = segment.fast_fields().u64_lenient(&self.field)?;
|
||||
let (column, _column_type) = column_opt.ok_or_else(|| FastFieldNotAvailableError {
|
||||
|
||||
@@ -156,7 +156,7 @@ pub trait Collector: Sync + Send {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child>;
|
||||
|
||||
/// Returns true iff the collector requires to compute scores for documents.
|
||||
@@ -174,7 +174,7 @@ pub trait Collector: Sync + Send {
|
||||
&self,
|
||||
weight: &dyn Weight,
|
||||
segment_ord: u32,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> {
|
||||
let with_scoring = self.requires_scoring();
|
||||
let mut segment_collector = self.for_segment(segment_ord, reader)?;
|
||||
@@ -186,7 +186,7 @@ pub trait Collector: Sync + Send {
|
||||
pub(crate) fn default_collect_segment_impl<TSegmentCollector: SegmentCollector>(
|
||||
segment_collector: &mut TSegmentCollector,
|
||||
weight: &dyn Weight,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
with_scoring: bool,
|
||||
) -> crate::Result<()> {
|
||||
match (reader.alive_bitset(), with_scoring) {
|
||||
@@ -255,7 +255,7 @@ impl<TCollector: Collector> Collector for Option<TCollector> {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok(if let Some(inner) = self {
|
||||
let inner_segment_collector = inner.for_segment(segment_local_id, segment)?;
|
||||
@@ -336,7 +336,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let left = self.0.for_segment(segment_local_id, segment)?;
|
||||
let right = self.1.for_segment(segment_local_id, segment)?;
|
||||
@@ -407,7 +407,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let one = self.0.for_segment(segment_local_id, segment)?;
|
||||
let two = self.1.for_segment(segment_local_id, segment)?;
|
||||
@@ -487,7 +487,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let one = self.0.for_segment(segment_local_id, segment)?;
|
||||
let two = self.1.for_segment(segment_local_id, segment)?;
|
||||
|
||||
@@ -24,7 +24,7 @@ impl<TCollector: Collector> Collector for CollectorWrapper<TCollector> {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<Box<dyn BoxableSegmentCollector>> {
|
||||
let child = self.0.for_segment(segment_local_id, reader)?;
|
||||
Ok(Box::new(SegmentCollectorWrapper(child)))
|
||||
@@ -209,7 +209,7 @@ impl Collector for MultiCollector<'_> {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<MultiCollectorChild> {
|
||||
let children = self
|
||||
.collector_wrappers
|
||||
|
||||
@@ -5,7 +5,7 @@ use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
|
||||
use crate::schema::{OwnedValue, Schema};
|
||||
use crate::{DocId, Order, Score, SegmentReader};
|
||||
use crate::{DocId, Order, Score};
|
||||
|
||||
fn compare_owned_value<const NULLS_FIRST: bool>(lhs: &OwnedValue, rhs: &OwnedValue) -> Ordering {
|
||||
match (lhs, rhs) {
|
||||
@@ -430,7 +430,7 @@ where
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let child = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
Ok(SegmentSortKeyComputerWithComparator {
|
||||
@@ -468,7 +468,7 @@ where
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let child = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
Ok(SegmentSortKeyComputerWithComparator {
|
||||
|
||||
@@ -32,7 +32,7 @@ impl SortKeyComputer for SortByBytes {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn crate::SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let bytes_column_opt = segment_reader.fast_fields().bytes(&self.column_name)?;
|
||||
Ok(ByBytesColumnSegmentSortKeyComputer { bytes_column_opt })
|
||||
|
||||
@@ -6,7 +6,7 @@ use crate::collector::sort_key::{
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
|
||||
use crate::fastfield::FastFieldNotAvailableError;
|
||||
use crate::schema::OwnedValue;
|
||||
use crate::{DateTime, DocId, Score, SegmentReader};
|
||||
use crate::{DateTime, DocId, Score};
|
||||
|
||||
/// Sort by the boxed / OwnedValue representation of either a fast field, or of the score.
|
||||
///
|
||||
@@ -86,7 +86,7 @@ impl SortKeyComputer for SortByErasedType {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let inner: Box<dyn ErasedSegmentSortKeyComputer> = match self {
|
||||
Self::Field(column_name) => {
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use crate::collector::sort_key::NaturalComparator;
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer, TopNComputer};
|
||||
use crate::{DocAddress, DocId, Score, SegmentReader};
|
||||
use crate::{DocAddress, DocId, Score};
|
||||
|
||||
/// Sort by similarity score.
|
||||
#[derive(Clone, Debug, Copy)]
|
||||
@@ -19,7 +19,7 @@ impl SortKeyComputer for SortBySimilarityScore {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
_segment_reader: &dyn SegmentReader,
|
||||
_segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok(SortBySimilarityScore)
|
||||
}
|
||||
@@ -29,7 +29,7 @@ impl SortKeyComputer for SortBySimilarityScore {
|
||||
&self,
|
||||
k: usize,
|
||||
weight: &dyn crate::query::Weight,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
segment_ord: u32,
|
||||
) -> crate::Result<Vec<(Self::SortKey, DocAddress)>> {
|
||||
let mut top_n: TopNComputer<Score, DocId, Self::Comparator> =
|
||||
|
||||
@@ -61,7 +61,7 @@ impl<T: FastValue> SortKeyComputer for SortByStaticFastValue<T> {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let sort_column_opt = segment_reader.fast_fields().u64_lenient(&self.field)?;
|
||||
let (sort_column, _sort_column_type) =
|
||||
|
||||
@@ -3,7 +3,7 @@ use columnar::StrColumn;
|
||||
use crate::collector::sort_key::NaturalComparator;
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
|
||||
use crate::termdict::TermOrdinal;
|
||||
use crate::{DocId, Score, SegmentReader};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Sort by the first value of a string column.
|
||||
///
|
||||
@@ -35,7 +35,7 @@ impl SortKeyComputer for SortByString {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let str_column_opt = segment_reader.fast_fields().str(&self.column_name)?;
|
||||
Ok(ByStringColumnSegmentSortKeyComputer { str_column_opt })
|
||||
|
||||
@@ -119,7 +119,7 @@ pub trait SortKeyComputer: Sync {
|
||||
&self,
|
||||
k: usize,
|
||||
weight: &dyn crate::query::Weight,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
segment_ord: u32,
|
||||
) -> crate::Result<Vec<(Self::SortKey, DocAddress)>> {
|
||||
let with_scoring = self.requires_scoring();
|
||||
@@ -135,7 +135,7 @@ pub trait SortKeyComputer: Sync {
|
||||
}
|
||||
|
||||
/// Builds a child sort key computer for a specific segment.
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child>;
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child>;
|
||||
}
|
||||
|
||||
impl<HeadSortKeyComputer, TailSortKeyComputer> SortKeyComputer
|
||||
@@ -156,7 +156,7 @@ where
|
||||
(self.0.comparator(), self.1.comparator())
|
||||
}
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
Ok((
|
||||
self.0.segment_sort_key_computer(segment_reader)?,
|
||||
self.1.segment_sort_key_computer(segment_reader)?,
|
||||
@@ -357,7 +357,7 @@ where
|
||||
)
|
||||
}
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
let sort_key_computer1 = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer2 = self.1.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer3 = self.2.segment_sort_key_computer(segment_reader)?;
|
||||
@@ -420,7 +420,7 @@ where
|
||||
SortKeyComputer4::Comparator,
|
||||
);
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
let sort_key_computer1 = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer2 = self.1.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer3 = self.2.segment_sort_key_computer(segment_reader)?;
|
||||
@@ -454,7 +454,7 @@ where
|
||||
|
||||
impl<F, SegmentF, TSortKey> SortKeyComputer for F
|
||||
where
|
||||
F: 'static + Send + Sync + Fn(&dyn SegmentReader) -> SegmentF,
|
||||
F: 'static + Send + Sync + Fn(&SegmentReader) -> SegmentF,
|
||||
SegmentF: 'static + FnMut(DocId) -> TSortKey,
|
||||
TSortKey: 'static + PartialOrd + Clone + Send + Sync + std::fmt::Debug,
|
||||
{
|
||||
@@ -462,7 +462,7 @@ where
|
||||
type Child = SegmentF;
|
||||
type Comparator = NaturalComparator;
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
Ok((self)(segment_reader))
|
||||
}
|
||||
}
|
||||
@@ -509,10 +509,10 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_lazy_score_computer() {
|
||||
let score_computer_primary = |_segment_reader: &dyn SegmentReader| |_doc: DocId| 200u32;
|
||||
let score_computer_primary = |_segment_reader: &SegmentReader| |_doc: DocId| 200u32;
|
||||
let call_count = Arc::new(AtomicUsize::new(0));
|
||||
let call_count_clone = call_count.clone();
|
||||
let score_computer_secondary = move |_segment_reader: &dyn SegmentReader| {
|
||||
let score_computer_secondary = move |_segment_reader: &SegmentReader| {
|
||||
let call_count_new_clone = call_count_clone.clone();
|
||||
move |_doc: DocId| {
|
||||
call_count_new_clone.fetch_add(1, AtomicOrdering::SeqCst);
|
||||
@@ -572,10 +572,10 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_lazy_score_computer_dynamic_ordering() {
|
||||
let score_computer_primary = |_segment_reader: &dyn SegmentReader| |_doc: DocId| 200u32;
|
||||
let score_computer_primary = |_segment_reader: &SegmentReader| |_doc: DocId| 200u32;
|
||||
let call_count = Arc::new(AtomicUsize::new(0));
|
||||
let call_count_clone = call_count.clone();
|
||||
let score_computer_secondary = move |_segment_reader: &dyn SegmentReader| {
|
||||
let score_computer_secondary = move |_segment_reader: &SegmentReader| {
|
||||
let call_count_new_clone = call_count_clone.clone();
|
||||
move |_doc: DocId| {
|
||||
call_count_new_clone.fetch_add(1, AtomicOrdering::SeqCst);
|
||||
|
||||
@@ -32,11 +32,7 @@ where TSortKeyComputer: SortKeyComputer + Send + Sync + 'static
|
||||
self.sort_key_computer.check_schema(schema)
|
||||
}
|
||||
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_ord: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
) -> Result<Self::Child> {
|
||||
fn for_segment(&self, segment_ord: u32, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
let segment_sort_key_computer = self
|
||||
.sort_key_computer
|
||||
.segment_sort_key_computer(segment_reader)?;
|
||||
@@ -67,7 +63,7 @@ where TSortKeyComputer: SortKeyComputer + Send + Sync + 'static
|
||||
&self,
|
||||
weight: &dyn Weight,
|
||||
segment_ord: u32,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<Vec<(TSortKeyComputer::SortKey, DocAddress)>> {
|
||||
let k = self.doc_range.end;
|
||||
let docs = self
|
||||
|
||||
@@ -5,7 +5,7 @@ use crate::query::{AllQuery, QueryParser};
|
||||
use crate::schema::{Schema, FAST, TEXT};
|
||||
use crate::time::format_description::well_known::Rfc3339;
|
||||
use crate::time::OffsetDateTime;
|
||||
use crate::{DateTime, DocAddress, Index, Searcher, SegmentReader, TantivyDocument};
|
||||
use crate::{DateTime, DocAddress, Index, Searcher, TantivyDocument};
|
||||
|
||||
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
|
||||
compute_score: true,
|
||||
@@ -109,7 +109,7 @@ impl Collector for TestCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_id: SegmentOrdinal,
|
||||
_reader: &dyn SegmentReader,
|
||||
_reader: &SegmentReader,
|
||||
) -> crate::Result<TestSegmentCollector> {
|
||||
Ok(TestSegmentCollector {
|
||||
segment_id,
|
||||
@@ -180,7 +180,7 @@ impl Collector for FastFieldTestCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_: SegmentOrdinal,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<FastFieldSegmentCollector> {
|
||||
let reader = segment_reader
|
||||
.fast_fields()
|
||||
@@ -243,7 +243,7 @@ impl Collector for BytesFastFieldTestCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<BytesFastFieldSegmentCollector> {
|
||||
let column_opt = segment_reader.fast_fields().bytes(&self.field)?;
|
||||
Ok(BytesFastFieldSegmentCollector {
|
||||
|
||||
@@ -393,7 +393,7 @@ impl TopDocs {
|
||||
/// // This is where we build our collector with our custom score.
|
||||
/// let top_docs_by_custom_score = TopDocs
|
||||
/// ::with_limit(10)
|
||||
/// .tweak_score(move |segment_reader: &dyn SegmentReader| {
|
||||
/// .tweak_score(move |segment_reader: &SegmentReader| {
|
||||
/// // The argument is a function that returns our scoring
|
||||
/// // function.
|
||||
/// //
|
||||
@@ -442,7 +442,7 @@ pub struct TweakScoreFn<F>(F);
|
||||
|
||||
impl<F, TTweakScoreSortKeyFn, TSortKey> SortKeyComputer for TweakScoreFn<F>
|
||||
where
|
||||
F: 'static + Send + Sync + Fn(&dyn SegmentReader) -> TTweakScoreSortKeyFn,
|
||||
F: 'static + Send + Sync + Fn(&SegmentReader) -> TTweakScoreSortKeyFn,
|
||||
TTweakScoreSortKeyFn: 'static + Fn(DocId, Score) -> TSortKey,
|
||||
TweakScoreSegmentSortKeyComputer<TTweakScoreSortKeyFn>:
|
||||
SegmentSortKeyComputer<SortKey = TSortKey, SegmentSortKey = TSortKey>,
|
||||
@@ -458,7 +458,7 @@ where
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok({
|
||||
TweakScoreSegmentSortKeyComputer {
|
||||
@@ -1525,7 +1525,7 @@ mod tests {
|
||||
let text_query = query_parser.parse_query("droopy tax")?;
|
||||
let collector = TopDocs::with_limit(2)
|
||||
.and_offset(1)
|
||||
.order_by(move |_segment_reader: &dyn SegmentReader| move |doc: DocId| doc);
|
||||
.order_by(move |_segment_reader: &SegmentReader| move |doc: DocId| doc);
|
||||
let score_docs: Vec<(u32, DocAddress)> =
|
||||
index.reader()?.searcher().search(&text_query, &collector)?;
|
||||
assert_eq!(
|
||||
@@ -1543,7 +1543,7 @@ mod tests {
|
||||
let text_query = query_parser.parse_query("droopy tax").unwrap();
|
||||
let collector = TopDocs::with_limit(2)
|
||||
.and_offset(1)
|
||||
.order_by(move |_segment_reader: &dyn SegmentReader| move |doc: DocId| doc);
|
||||
.order_by(move |_segment_reader: &SegmentReader| move |doc: DocId| doc);
|
||||
let score_docs: Vec<(u32, DocAddress)> = index
|
||||
.reader()
|
||||
.unwrap()
|
||||
|
||||
@@ -4,7 +4,7 @@ use common::{replace_in_place, JsonPathWriter};
|
||||
use rustc_hash::FxHashMap;
|
||||
|
||||
use crate::indexer::indexing_term::IndexingTerm;
|
||||
use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter as _, PostingsWriterEnum};
|
||||
use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter};
|
||||
use crate::schema::document::{ReferenceValue, ReferenceValueLeaf, Value};
|
||||
use crate::schema::{Type, DATE_TIME_PRECISION_INDEXED};
|
||||
use crate::time::format_description::well_known::Rfc3339;
|
||||
@@ -80,7 +80,7 @@ fn index_json_object<'a, V: Value<'a>>(
|
||||
text_analyzer: &mut TextAnalyzer,
|
||||
term_buffer: &mut IndexingTerm,
|
||||
json_path_writer: &mut JsonPathWriter,
|
||||
postings_writer: &mut PostingsWriterEnum,
|
||||
postings_writer: &mut dyn PostingsWriter,
|
||||
ctx: &mut IndexingContext,
|
||||
positions_per_path: &mut IndexingPositionsPerPath,
|
||||
) {
|
||||
@@ -110,7 +110,7 @@ pub(crate) fn index_json_value<'a, V: Value<'a>>(
|
||||
text_analyzer: &mut TextAnalyzer,
|
||||
term_buffer: &mut IndexingTerm,
|
||||
json_path_writer: &mut JsonPathWriter,
|
||||
postings_writer: &mut PostingsWriterEnum,
|
||||
postings_writer: &mut dyn PostingsWriter,
|
||||
ctx: &mut IndexingContext,
|
||||
positions_per_path: &mut IndexingPositionsPerPath,
|
||||
) {
|
||||
|
||||
@@ -8,7 +8,7 @@ use std::path::Path;
|
||||
use once_cell::sync::Lazy;
|
||||
|
||||
pub use self::executor::Executor;
|
||||
pub use self::searcher::{Searcher, SearcherContext, SearcherGeneration};
|
||||
pub use self::searcher::{Searcher, SearcherGeneration};
|
||||
|
||||
/// The meta file contains all the information about the list of segments and the schema
|
||||
/// of the index.
|
||||
|
||||
@@ -4,13 +4,13 @@ use std::{fmt, io};
|
||||
|
||||
use crate::collector::Collector;
|
||||
use crate::core::Executor;
|
||||
use crate::index::{Index, SegmentId, SegmentReader};
|
||||
use crate::index::{SegmentId, SegmentReader};
|
||||
use crate::query::{Bm25StatisticsProvider, EnableScoring, Query};
|
||||
use crate::schema::{Field, FieldType, Schema, TantivyDocument, Term};
|
||||
use crate::schema::document::DocumentDeserialize;
|
||||
use crate::schema::{Schema, Term};
|
||||
use crate::space_usage::SearcherSpaceUsage;
|
||||
use crate::store::{CacheStats, StoreReader, DOCSTORE_CACHE_CAPACITY};
|
||||
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
|
||||
use crate::{DocAddress, Inventory, Opstamp, TantivyError, TrackedObject};
|
||||
use crate::store::{CacheStats, StoreReader};
|
||||
use crate::{DocAddress, Index, Opstamp, TrackedObject};
|
||||
|
||||
/// Identifies the searcher generation accessed by a [`Searcher`].
|
||||
///
|
||||
@@ -36,7 +36,7 @@ pub struct SearcherGeneration {
|
||||
|
||||
impl SearcherGeneration {
|
||||
pub(crate) fn from_segment_readers(
|
||||
segment_readers: &[Arc<dyn SegmentReader>],
|
||||
segment_readers: &[SegmentReader],
|
||||
generation_id: u64,
|
||||
) -> Self {
|
||||
let mut segment_id_to_del_opstamp = BTreeMap::new();
|
||||
@@ -61,103 +61,6 @@ impl SearcherGeneration {
|
||||
}
|
||||
}
|
||||
|
||||
/// Search-time context required by a [`Searcher`].
|
||||
#[derive(Clone)]
|
||||
pub struct SearcherContext {
|
||||
schema: Schema,
|
||||
executor: Executor,
|
||||
tokenizers: TokenizerManager,
|
||||
fast_field_tokenizers: TokenizerManager,
|
||||
}
|
||||
|
||||
impl SearcherContext {
|
||||
/// Creates a context from explicit search-time components.
|
||||
pub fn new(
|
||||
schema: Schema,
|
||||
executor: Executor,
|
||||
tokenizers: TokenizerManager,
|
||||
fast_field_tokenizers: TokenizerManager,
|
||||
) -> SearcherContext {
|
||||
SearcherContext {
|
||||
schema,
|
||||
executor,
|
||||
tokenizers,
|
||||
fast_field_tokenizers,
|
||||
}
|
||||
}
|
||||
|
||||
/// Creates a context from an index.
|
||||
pub fn from_index(index: &Index) -> SearcherContext {
|
||||
SearcherContext::new(
|
||||
index.schema(),
|
||||
index.search_executor().clone(),
|
||||
index.tokenizers().clone(),
|
||||
index.fast_field_tokenizer().clone(),
|
||||
)
|
||||
}
|
||||
|
||||
/// Access the schema associated with this context.
|
||||
pub fn schema(&self) -> &Schema {
|
||||
&self.schema
|
||||
}
|
||||
|
||||
/// Access the executor associated with this context.
|
||||
pub fn search_executor(&self) -> &Executor {
|
||||
&self.executor
|
||||
}
|
||||
|
||||
/// Access the tokenizer manager associated with this context.
|
||||
pub fn tokenizers(&self) -> &TokenizerManager {
|
||||
&self.tokenizers
|
||||
}
|
||||
|
||||
/// Access the fast field tokenizer manager associated with this context.
|
||||
pub fn fast_field_tokenizer(&self) -> &TokenizerManager {
|
||||
&self.fast_field_tokenizers
|
||||
}
|
||||
|
||||
/// Get the tokenizer associated with a specific field.
|
||||
pub fn tokenizer_for_field(&self, field: Field) -> crate::Result<TextAnalyzer> {
|
||||
let field_entry = self.schema.get_field_entry(field);
|
||||
let field_type = field_entry.field_type();
|
||||
let indexing_options_opt = match field_type {
|
||||
FieldType::JsonObject(options) => options.get_text_indexing_options(),
|
||||
FieldType::Str(options) => options.get_indexing_options(),
|
||||
_ => {
|
||||
return Err(TantivyError::SchemaError(format!(
|
||||
"{:?} is not a text field.",
|
||||
field_entry.name()
|
||||
)))
|
||||
}
|
||||
};
|
||||
let indexing_options = indexing_options_opt.ok_or_else(|| {
|
||||
TantivyError::InvalidArgument(format!(
|
||||
"No indexing options set for field {field_entry:?}"
|
||||
))
|
||||
})?;
|
||||
|
||||
self.tokenizers
|
||||
.get(indexing_options.tokenizer())
|
||||
.ok_or_else(|| {
|
||||
TantivyError::InvalidArgument(format!(
|
||||
"No Tokenizer found for field {field_entry:?}"
|
||||
))
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl From<&Index> for SearcherContext {
|
||||
fn from(index: &Index) -> Self {
|
||||
SearcherContext::from_index(index)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<Index> for SearcherContext {
|
||||
fn from(index: Index) -> Self {
|
||||
SearcherContext::from(&index)
|
||||
}
|
||||
}
|
||||
|
||||
/// Holds a list of `SegmentReader`s ready for search.
|
||||
///
|
||||
/// It guarantees that the `Segment` will not be removed before
|
||||
@@ -168,66 +71,9 @@ pub struct Searcher {
|
||||
}
|
||||
|
||||
impl Searcher {
|
||||
/// Creates a `Searcher` from an arbitrary list of segment readers.
|
||||
///
|
||||
/// This is useful when segment readers are not opened from
|
||||
/// `IndexReader` / `meta.json` (e.g. external segment sources).
|
||||
/// The generated [`SearcherGeneration`] uses `generation_id = 0`.
|
||||
pub fn from_segment_readers<Ctx: Into<SearcherContext>>(
|
||||
context: Ctx,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
) -> crate::Result<Searcher> {
|
||||
Self::from_segment_readers_with_generation_id(context, segment_readers, 0)
|
||||
}
|
||||
|
||||
/// Same as [`Searcher::from_segment_readers`] but allows setting
|
||||
/// a custom generation id.
|
||||
pub fn from_segment_readers_with_generation_id<Ctx: Into<SearcherContext>>(
|
||||
context: Ctx,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
generation_id: u64,
|
||||
) -> crate::Result<Searcher> {
|
||||
let context = context.into();
|
||||
let generation = SearcherGeneration::from_segment_readers(&segment_readers, generation_id);
|
||||
let tracked_generation = Inventory::default().track(generation);
|
||||
let inner = SearcherInner::new(
|
||||
context,
|
||||
segment_readers,
|
||||
tracked_generation,
|
||||
DOCSTORE_CACHE_CAPACITY,
|
||||
)?;
|
||||
Ok(Arc::new(inner).into())
|
||||
}
|
||||
|
||||
/// Returns the search context associated with the `Searcher`.
|
||||
pub fn context(&self) -> &SearcherContext {
|
||||
&self.inner.context
|
||||
}
|
||||
|
||||
/// Deprecated alias for [`Searcher::context`].
|
||||
#[deprecated(note = "use Searcher::context()")]
|
||||
pub fn index(&self) -> &SearcherContext {
|
||||
self.context()
|
||||
}
|
||||
|
||||
/// Access the search executor associated with this searcher.
|
||||
pub fn search_executor(&self) -> &Executor {
|
||||
self.context().search_executor()
|
||||
}
|
||||
|
||||
/// Access the tokenizer manager associated with this searcher.
|
||||
pub fn tokenizers(&self) -> &TokenizerManager {
|
||||
self.context().tokenizers()
|
||||
}
|
||||
|
||||
/// Access the fast field tokenizer manager associated with this searcher.
|
||||
pub fn fast_field_tokenizer(&self) -> &TokenizerManager {
|
||||
self.context().fast_field_tokenizer()
|
||||
}
|
||||
|
||||
/// Get the tokenizer associated with a specific field.
|
||||
pub fn tokenizer_for_field(&self, field: Field) -> crate::Result<TextAnalyzer> {
|
||||
self.context().tokenizer_for_field(field)
|
||||
/// Returns the `Index` associated with the `Searcher`
|
||||
pub fn index(&self) -> &Index {
|
||||
&self.inner.index
|
||||
}
|
||||
|
||||
/// [`SearcherGeneration`] which identifies the version of the snapshot held by this `Searcher`.
|
||||
@@ -239,7 +85,7 @@ impl Searcher {
|
||||
///
|
||||
/// The searcher uses the segment ordinal to route the
|
||||
/// request to the right `Segment`.
|
||||
pub fn doc(&self, doc_address: DocAddress) -> crate::Result<TantivyDocument> {
|
||||
pub fn doc<D: DocumentDeserialize>(&self, doc_address: DocAddress) -> crate::Result<D> {
|
||||
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
|
||||
store_reader.get(doc_address.doc_id)
|
||||
}
|
||||
@@ -259,15 +105,18 @@ impl Searcher {
|
||||
|
||||
/// Fetches a document in an asynchronous manner.
|
||||
#[cfg(feature = "quickwit")]
|
||||
pub async fn doc_async(&self, doc_address: DocAddress) -> crate::Result<TantivyDocument> {
|
||||
let executor = self.search_executor();
|
||||
pub async fn doc_async<D: DocumentDeserialize>(
|
||||
&self,
|
||||
doc_address: DocAddress,
|
||||
) -> crate::Result<D> {
|
||||
let executor = self.inner.index.search_executor();
|
||||
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
|
||||
store_reader.get_async(doc_address.doc_id, executor).await
|
||||
}
|
||||
|
||||
/// Access the schema associated with the index of this searcher.
|
||||
pub fn schema(&self) -> &Schema {
|
||||
self.context().schema()
|
||||
&self.inner.schema
|
||||
}
|
||||
|
||||
/// Returns the overall number of documents in the index.
|
||||
@@ -305,13 +154,13 @@ impl Searcher {
|
||||
}
|
||||
|
||||
/// Return the list of segment readers
|
||||
pub fn segment_readers(&self) -> &[Arc<dyn SegmentReader>] {
|
||||
pub fn segment_readers(&self) -> &[SegmentReader] {
|
||||
&self.inner.segment_readers
|
||||
}
|
||||
|
||||
/// Returns the segment_reader associated with the given segment_ord
|
||||
pub fn segment_reader(&self, segment_ord: u32) -> &dyn SegmentReader {
|
||||
self.inner.segment_readers[segment_ord as usize].as_ref()
|
||||
pub fn segment_reader(&self, segment_ord: u32) -> &SegmentReader {
|
||||
&self.inner.segment_readers[segment_ord as usize]
|
||||
}
|
||||
|
||||
/// Runs a query on the segment readers wrapped by the searcher.
|
||||
@@ -352,7 +201,7 @@ impl Searcher {
|
||||
} else {
|
||||
EnableScoring::disabled_from_searcher(self)
|
||||
};
|
||||
let executor = self.search_executor();
|
||||
let executor = self.inner.index.search_executor();
|
||||
self.search_with_executor(query, collector, executor, enabled_scoring)
|
||||
}
|
||||
|
||||
@@ -380,11 +229,7 @@ impl Searcher {
|
||||
let segment_readers = self.segment_readers();
|
||||
let fruits = executor.map(
|
||||
|(segment_ord, segment_reader)| {
|
||||
collector.collect_segment(
|
||||
weight.as_ref(),
|
||||
segment_ord as u32,
|
||||
segment_reader.as_ref(),
|
||||
)
|
||||
collector.collect_segment(weight.as_ref(), segment_ord as u32, segment_reader)
|
||||
},
|
||||
segment_readers.iter().enumerate(),
|
||||
)?;
|
||||
@@ -412,17 +257,19 @@ impl From<Arc<SearcherInner>> for Searcher {
|
||||
/// It guarantees that the `Segment` will not be removed before
|
||||
/// the destruction of the `Searcher`.
|
||||
pub(crate) struct SearcherInner {
|
||||
context: SearcherContext,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
store_readers: Vec<Box<dyn StoreReader>>,
|
||||
schema: Schema,
|
||||
index: Index,
|
||||
segment_readers: Vec<SegmentReader>,
|
||||
store_readers: Vec<StoreReader>,
|
||||
generation: TrackedObject<SearcherGeneration>,
|
||||
}
|
||||
|
||||
impl SearcherInner {
|
||||
/// Creates a new `Searcher`
|
||||
pub(crate) fn new(
|
||||
context: SearcherContext,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
schema: Schema,
|
||||
index: Index,
|
||||
segment_readers: Vec<SegmentReader>,
|
||||
generation: TrackedObject<SearcherGeneration>,
|
||||
doc_store_cache_num_blocks: usize,
|
||||
) -> io::Result<SearcherInner> {
|
||||
@@ -434,13 +281,14 @@ impl SearcherInner {
|
||||
generation.segments(),
|
||||
"Set of segments referenced by this Searcher and its SearcherGeneration must match"
|
||||
);
|
||||
let store_readers: Vec<Box<dyn StoreReader>> = segment_readers
|
||||
let store_readers: Vec<StoreReader> = segment_readers
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.get_store_reader(doc_store_cache_num_blocks))
|
||||
.collect::<io::Result<Vec<_>>>()?;
|
||||
|
||||
Ok(SearcherInner {
|
||||
context,
|
||||
schema,
|
||||
index,
|
||||
segment_readers,
|
||||
store_readers,
|
||||
generation,
|
||||
@@ -453,7 +301,7 @@ impl fmt::Debug for Searcher {
|
||||
let segment_ids = self
|
||||
.segment_readers()
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.segment_id())
|
||||
.map(SegmentReader::segment_id)
|
||||
.collect::<Vec<_>>();
|
||||
write!(f, "Searcher({segment_ids:?})")
|
||||
}
|
||||
|
||||
@@ -7,8 +7,8 @@ use crate::query::TermQuery;
|
||||
use crate::schema::{Field, IndexRecordOption, Schema, INDEXED, STRING, TEXT};
|
||||
use crate::tokenizer::TokenizerManager;
|
||||
use crate::{
|
||||
Directory, DocSet, Executor, Index, IndexBuilder, IndexReader, IndexSettings, IndexWriter,
|
||||
ReloadPolicy, Searcher, SearcherContext, TantivyDocument, Term,
|
||||
Directory, DocSet, Index, IndexBuilder, IndexReader, IndexSettings, IndexWriter, ReloadPolicy,
|
||||
TantivyDocument, Term,
|
||||
};
|
||||
|
||||
#[test]
|
||||
@@ -300,40 +300,6 @@ fn test_single_segment_index_writer() -> crate::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_searcher_from_external_segment_readers() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
let mut writer: IndexWriter = index.writer_for_tests()?;
|
||||
writer.add_document(doc!(text_field => "hello"))?;
|
||||
writer.add_document(doc!(text_field => "hello"))?;
|
||||
writer.commit()?;
|
||||
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
let segment_readers = searcher.segment_readers().to_vec();
|
||||
let context = SearcherContext::new(
|
||||
schema,
|
||||
Executor::single_thread(),
|
||||
TokenizerManager::default(),
|
||||
TokenizerManager::default(),
|
||||
);
|
||||
let custom_searcher =
|
||||
Searcher::from_segment_readers_with_generation_id(context, segment_readers, 42)?;
|
||||
|
||||
let term_query = TermQuery::new(
|
||||
Term::from_field_text(text_field, "hello"),
|
||||
IndexRecordOption::Basic,
|
||||
);
|
||||
let count = custom_searcher.search(&term_query, &Count)?;
|
||||
assert_eq!(count, 2);
|
||||
assert_eq!(custom_searcher.generation().generation_id(), 42);
|
||||
assert_eq!(custom_searcher.segment_readers().len(), 1);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merging_segment_update_docfreq() {
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
@@ -167,9 +167,7 @@ impl CompositeFile {
|
||||
.map(|byte_range| self.data.slice(byte_range.clone()))
|
||||
}
|
||||
|
||||
/// Returns per-field byte usage for all slices stored in this composite file.
|
||||
///
|
||||
/// The provided `schema` is used to resolve field ids into field names.
|
||||
/// Returns the space usage per field in this composite file.
|
||||
pub fn space_usage(&self, schema: &Schema) -> PerFieldSpaceUsage {
|
||||
let mut fields = Vec::new();
|
||||
for (&field_addr, byte_range) in &self.offsets_index {
|
||||
|
||||
110
src/docset.rs
110
src/docset.rs
@@ -1,7 +1,6 @@
|
||||
use std::borrow::BorrowMut;
|
||||
use std::ops::{Deref as _, DerefMut as _};
|
||||
use std::borrow::{Borrow, BorrowMut};
|
||||
|
||||
use common::BitSet;
|
||||
use common::TinySet;
|
||||
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::DocId;
|
||||
@@ -17,6 +16,12 @@ pub const TERMINATED: DocId = i32::MAX as u32;
|
||||
/// exactly this size as long as we can fill the buffer.
|
||||
pub const COLLECT_BLOCK_BUFFER_LEN: usize = 64;
|
||||
|
||||
/// Number of `TinySet` (64-bit) buckets in a block used by [`DocSet::fill_bitset_block`].
|
||||
pub const BLOCK_NUM_TINYBITSETS: usize = 16;
|
||||
|
||||
/// Number of doc IDs covered by one block: `BLOCK_NUM_TINYBITSETS * 64 = 1024`.
|
||||
pub const BLOCK_WINDOW: u32 = BLOCK_NUM_TINYBITSETS as u32 * 64;
|
||||
|
||||
/// Represents an iterable set of sorted doc ids.
|
||||
pub trait DocSet: Send {
|
||||
/// Goes to the next element.
|
||||
@@ -133,19 +138,6 @@ pub trait DocSet: Send {
|
||||
buffer.len()
|
||||
}
|
||||
|
||||
/// Fills the given bitset with the documents in the docset.
|
||||
///
|
||||
/// If the docset max_doc is smaller than the largest doc, this function might not consume the
|
||||
/// docset entirely.
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
let bitset_max_value: u32 = bitset.max_value();
|
||||
let mut doc = self.doc();
|
||||
while doc < bitset_max_value {
|
||||
bitset.insert(doc);
|
||||
doc = self.advance();
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the current document
|
||||
/// Right after creating a new `DocSet`, the docset points to the first document.
|
||||
///
|
||||
@@ -176,6 +168,31 @@ pub trait DocSet: Send {
|
||||
self.size_hint() as u64
|
||||
}
|
||||
|
||||
/// Fills a bitmask representing which documents in `[min_doc, min_doc + BLOCK_WINDOW)` are
|
||||
/// present in this docset.
|
||||
///
|
||||
/// The window is divided into `BLOCK_NUM_TINYBITSETS` buckets of 64 docs each.
|
||||
/// Returns the next doc `>= min_doc + BLOCK_WINDOW`, or `TERMINATED` if exhausted.
|
||||
fn fill_bitset_block(
|
||||
&mut self,
|
||||
min_doc: DocId,
|
||||
mask: &mut [TinySet; BLOCK_NUM_TINYBITSETS],
|
||||
) -> DocId {
|
||||
self.seek(min_doc);
|
||||
let horizon = min_doc + BLOCK_WINDOW;
|
||||
loop {
|
||||
let doc = self.doc();
|
||||
if doc >= horizon {
|
||||
return doc;
|
||||
}
|
||||
let delta = doc - min_doc;
|
||||
mask[(delta / 64) as usize].insert_mut(delta % 64);
|
||||
if self.advance() == TERMINATED {
|
||||
return TERMINATED;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the number documents matching.
|
||||
/// Calling this method consumes the `DocSet`.
|
||||
fn count(&mut self, alive_bitset: &AliveBitSet) -> u32 {
|
||||
@@ -230,6 +247,18 @@ impl DocSet for &mut dyn DocSet {
|
||||
(**self).seek_danger(target)
|
||||
}
|
||||
|
||||
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
|
||||
(**self).fill_buffer(buffer)
|
||||
}
|
||||
|
||||
fn fill_bitset_block(
|
||||
&mut self,
|
||||
min_doc: DocId,
|
||||
mask: &mut [TinySet; BLOCK_NUM_TINYBITSETS],
|
||||
) -> DocId {
|
||||
(**self).fill_bitset_block(min_doc, mask)
|
||||
}
|
||||
|
||||
fn doc(&self) -> u32 {
|
||||
(**self).doc()
|
||||
}
|
||||
@@ -249,59 +278,60 @@ impl DocSet for &mut dyn DocSet {
|
||||
fn count_including_deleted(&mut self) -> u32 {
|
||||
(**self).count_including_deleted()
|
||||
}
|
||||
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
(**self).fill_bitset(bitset);
|
||||
}
|
||||
}
|
||||
|
||||
impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
|
||||
#[inline]
|
||||
fn advance(&mut self) -> DocId {
|
||||
self.deref_mut().advance()
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.advance()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
self.deref_mut().seek(target)
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.seek(target)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.seek_danger(target)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
|
||||
self.deref_mut().fill_buffer(buffer)
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.fill_buffer(buffer)
|
||||
}
|
||||
|
||||
fn fill_bitset_block(
|
||||
&mut self,
|
||||
min_doc: DocId,
|
||||
mask: &mut [TinySet; BLOCK_NUM_TINYBITSETS],
|
||||
) -> DocId {
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.fill_bitset_block(min_doc, mask)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn doc(&self) -> DocId {
|
||||
self.deref().doc()
|
||||
let unboxed: &TDocSet = self.borrow();
|
||||
unboxed.doc()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn size_hint(&self) -> u32 {
|
||||
self.deref().size_hint()
|
||||
let unboxed: &TDocSet = self.borrow();
|
||||
unboxed.size_hint()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn cost(&self) -> u64 {
|
||||
self.deref().cost()
|
||||
let unboxed: &TDocSet = self.borrow();
|
||||
unboxed.cost()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn count(&mut self, alive_bitset: &AliveBitSet) -> u32 {
|
||||
self.deref_mut().count(alive_bitset)
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.count(alive_bitset)
|
||||
}
|
||||
|
||||
fn count_including_deleted(&mut self) -> u32 {
|
||||
self.deref_mut().count_including_deleted()
|
||||
}
|
||||
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
self.deref_mut().fill_bitset(bitset);
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.count_including_deleted()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -84,7 +84,9 @@ mod tests {
|
||||
let mut facet = Facet::default();
|
||||
facet_reader.facet_from_ord(0, &mut facet).unwrap();
|
||||
assert_eq!(facet.to_path_string(), "/a/b");
|
||||
let doc = searcher.doc(DocAddress::new(0u32, 0u32)).unwrap();
|
||||
let doc = searcher
|
||||
.doc::<TantivyDocument>(DocAddress::new(0u32, 0u32))
|
||||
.unwrap();
|
||||
let value = doc
|
||||
.get_first(facet_field)
|
||||
.and_then(|v| v.as_value().as_facet());
|
||||
@@ -143,7 +145,7 @@ mod tests {
|
||||
let mut facet_ords = Vec::new();
|
||||
facet_ords.extend(facet_reader.facet_ords(0u32));
|
||||
assert_eq!(&facet_ords, &[0u64]);
|
||||
let doc = searcher.doc(DocAddress::new(0u32, 0u32))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0u32, 0u32))?;
|
||||
let value: Option<Facet> = doc
|
||||
.get_first(facet_field)
|
||||
.and_then(|v| v.as_facet())
|
||||
|
||||
@@ -96,7 +96,7 @@ mod tests {
|
||||
};
|
||||
use crate::time::OffsetDateTime;
|
||||
use crate::tokenizer::{LowerCaser, RawTokenizer, TextAnalyzer, TokenizerManager};
|
||||
use crate::{Index, IndexWriter};
|
||||
use crate::{Index, IndexWriter, SegmentReader};
|
||||
|
||||
pub static SCHEMA: Lazy<Schema> = Lazy::new(|| {
|
||||
let mut schema_builder = Schema::builder();
|
||||
@@ -430,7 +430,7 @@ mod tests {
|
||||
.searcher()
|
||||
.segment_readers()
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.segment_id())
|
||||
.map(SegmentReader::segment_id)
|
||||
.collect();
|
||||
assert_eq!(segment_ids.len(), 2);
|
||||
index_writer.merge(&segment_ids[..]).wait().unwrap();
|
||||
|
||||
@@ -25,8 +25,7 @@ pub struct FastFieldReaders {
|
||||
}
|
||||
|
||||
impl FastFieldReaders {
|
||||
/// Opens the segment fast-field container and binds it to a schema.
|
||||
pub fn open(fast_field_file: FileSlice, schema: Schema) -> io::Result<FastFieldReaders> {
|
||||
pub(crate) fn open(fast_field_file: FileSlice, schema: Schema) -> io::Result<FastFieldReaders> {
|
||||
let columnar = Arc::new(ColumnarReader::open(fast_field_file)?);
|
||||
Ok(FastFieldReaders { columnar, schema })
|
||||
}
|
||||
@@ -40,8 +39,7 @@ impl FastFieldReaders {
|
||||
self.resolve_column_name_given_default_field(column_name, default_field_opt)
|
||||
}
|
||||
|
||||
/// Returns per-field space usage for all loaded fast-field columns.
|
||||
pub fn space_usage(&self) -> io::Result<PerFieldSpaceUsage> {
|
||||
pub(crate) fn space_usage(&self) -> io::Result<PerFieldSpaceUsage> {
|
||||
let mut per_field_usages: Vec<FieldUsage> = Default::default();
|
||||
for (mut field_name, column_handle) in self.columnar.iter_columns()? {
|
||||
json_path_sep_to_dot(&mut field_name);
|
||||
@@ -53,8 +51,7 @@ impl FastFieldReaders {
|
||||
Ok(PerFieldSpaceUsage::new(per_field_usages))
|
||||
}
|
||||
|
||||
/// Returns the underlying `ColumnarReader`.
|
||||
pub fn columnar(&self) -> &ColumnarReader {
|
||||
pub(crate) fn columnar(&self) -> &ColumnarReader {
|
||||
self.columnar.as_ref()
|
||||
}
|
||||
|
||||
|
||||
@@ -1,29 +0,0 @@
|
||||
use std::borrow::Cow;
|
||||
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
const STANDARD_CODEC_ID: &str = "tantivy-default";
|
||||
|
||||
/// A Codec configuration is just a serializable object.
|
||||
#[derive(Serialize, Deserialize, Clone, Debug)]
|
||||
pub struct CodecConfiguration {
|
||||
codec_id: Cow<'static, str>,
|
||||
#[serde(default, skip_serializing_if = "serde_json::Value::is_null")]
|
||||
props: serde_json::Value,
|
||||
}
|
||||
|
||||
impl CodecConfiguration {
|
||||
/// Returns true if the codec is the standard codec.
|
||||
pub fn is_standard(&self) -> bool {
|
||||
self.codec_id == STANDARD_CODEC_ID && self.props.is_null()
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for CodecConfiguration {
|
||||
fn default() -> Self {
|
||||
CodecConfiguration {
|
||||
codec_id: Cow::Borrowed(STANDARD_CODEC_ID),
|
||||
props: serde_json::Value::Null,
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -3,19 +3,17 @@ use std::fmt;
|
||||
#[cfg(feature = "mmap")]
|
||||
use std::path::Path;
|
||||
use std::path::PathBuf;
|
||||
use std::sync::Arc;
|
||||
use std::thread::available_parallelism;
|
||||
|
||||
use super::segment::Segment;
|
||||
use super::segment_reader::merge_field_meta_data;
|
||||
use super::{FieldMetadata, IndexSettings, TantivySegmentReader};
|
||||
use super::{FieldMetadata, IndexSettings};
|
||||
use crate::core::{Executor, META_FILEPATH};
|
||||
use crate::directory::error::OpenReadError;
|
||||
#[cfg(feature = "mmap")]
|
||||
use crate::directory::MmapDirectory;
|
||||
use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK};
|
||||
use crate::error::{DataCorruption, TantivyError};
|
||||
use crate::index::codec_configuration::CodecConfiguration;
|
||||
use crate::index::{IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory};
|
||||
use crate::indexer::index_writer::{
|
||||
IndexWriterOptions, MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN,
|
||||
@@ -26,6 +24,7 @@ use crate::reader::{IndexReader, IndexReaderBuilder};
|
||||
use crate::schema::document::Document;
|
||||
use crate::schema::{Field, FieldType, Schema};
|
||||
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
|
||||
use crate::SegmentReader;
|
||||
|
||||
fn load_metas(
|
||||
directory: &dyn Directory,
|
||||
@@ -60,7 +59,6 @@ fn save_new_metas(
|
||||
schema: Schema,
|
||||
index_settings: IndexSettings,
|
||||
directory: &dyn Directory,
|
||||
codec: CodecConfiguration,
|
||||
) -> crate::Result<()> {
|
||||
save_metas(
|
||||
&IndexMeta {
|
||||
@@ -69,7 +67,6 @@ fn save_new_metas(
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
payload: None,
|
||||
codec,
|
||||
},
|
||||
directory,
|
||||
)?;
|
||||
@@ -110,13 +107,11 @@ pub struct IndexBuilder {
|
||||
tokenizer_manager: TokenizerManager,
|
||||
fast_field_tokenizer_manager: TokenizerManager,
|
||||
}
|
||||
|
||||
impl Default for IndexBuilder {
|
||||
fn default() -> Self {
|
||||
IndexBuilder::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl IndexBuilder {
|
||||
/// Creates a new `IndexBuilder`
|
||||
pub fn new() -> Self {
|
||||
@@ -249,31 +244,18 @@ impl IndexBuilder {
|
||||
/// Creates a new index given an implementation of the trait `Directory`.
|
||||
///
|
||||
/// If a directory previously existed, it will be erased.
|
||||
pub fn create<T: Into<Box<dyn Directory>>>(self, dir: T) -> crate::Result<Index> {
|
||||
self.create_avoid_monomorphization(dir.into())
|
||||
}
|
||||
|
||||
fn create_avoid_monomorphization(self, dir: Box<dyn Directory>) -> crate::Result<Index> {
|
||||
fn create<T: Into<Box<dyn Directory>>>(self, dir: T) -> crate::Result<Index> {
|
||||
self.validate()?;
|
||||
let dir = dir.into();
|
||||
let directory = ManagedDirectory::wrap(dir)?;
|
||||
let codec = CodecConfiguration::default();
|
||||
save_new_metas(
|
||||
self.get_expect_schema()?,
|
||||
self.index_settings.clone(),
|
||||
&directory,
|
||||
codec,
|
||||
)?;
|
||||
let schema = self.get_expect_schema()?;
|
||||
let mut metas = IndexMeta {
|
||||
index_settings: IndexSettings::default(),
|
||||
segments: vec![],
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
payload: None,
|
||||
codec: CodecConfiguration::default(),
|
||||
};
|
||||
let mut metas = IndexMeta::with_schema(self.get_expect_schema()?);
|
||||
metas.index_settings = self.index_settings;
|
||||
let mut index = Index::open_from_metas(directory, &metas, SegmentMetaInventory::default())?;
|
||||
let mut index = Index::open_from_metas(directory, &metas, SegmentMetaInventory::default());
|
||||
index.set_tokenizers(self.tokenizer_manager);
|
||||
index.set_fast_field_tokenizers(self.fast_field_tokenizer_manager);
|
||||
Ok(index)
|
||||
@@ -297,6 +279,41 @@ impl Index {
|
||||
pub fn builder() -> IndexBuilder {
|
||||
IndexBuilder::new()
|
||||
}
|
||||
/// Examines the directory to see if it contains an index.
|
||||
///
|
||||
/// Effectively, it only checks for the presence of the `meta.json` file.
|
||||
pub fn exists(dir: &dyn Directory) -> Result<bool, OpenReadError> {
|
||||
dir.exists(&META_FILEPATH)
|
||||
}
|
||||
|
||||
/// Accessor to the search executor.
|
||||
///
|
||||
/// This pool is used by default when calling `searcher.search(...)`
|
||||
/// to perform search on the individual segments.
|
||||
///
|
||||
/// By default the executor is single thread, and simply runs in the calling thread.
|
||||
pub fn search_executor(&self) -> &Executor {
|
||||
&self.executor
|
||||
}
|
||||
|
||||
/// Replace the default single thread search executor pool
|
||||
/// by a thread pool with a given number of threads.
|
||||
pub fn set_multithread_executor(&mut self, num_threads: usize) -> crate::Result<()> {
|
||||
self.executor = Executor::multi_thread(num_threads, "tantivy-search-")?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Custom thread pool by a outer thread pool.
|
||||
pub fn set_executor(&mut self, executor: Executor) {
|
||||
self.executor = executor;
|
||||
}
|
||||
|
||||
/// Replace the default single thread search executor pool
|
||||
/// by a thread pool with as many threads as there are CPUs on the system.
|
||||
pub fn set_default_multithread_executor(&mut self) -> crate::Result<()> {
|
||||
let default_num_threads = available_parallelism()?.get();
|
||||
self.set_multithread_executor(default_num_threads)
|
||||
}
|
||||
|
||||
/// Creates a new index using the [`RamDirectory`].
|
||||
///
|
||||
@@ -307,13 +324,6 @@ impl Index {
|
||||
IndexBuilder::new().schema(schema).create_in_ram().unwrap()
|
||||
}
|
||||
|
||||
/// Examines the directory to see if it contains an index.
|
||||
///
|
||||
/// Effectively, it only checks for the presence of the `meta.json` file.
|
||||
pub fn exists(directory: &dyn Directory) -> Result<bool, OpenReadError> {
|
||||
directory.exists(&META_FILEPATH)
|
||||
}
|
||||
|
||||
/// Creates a new index in a given filepath.
|
||||
/// The index will use the [`MmapDirectory`].
|
||||
///
|
||||
@@ -360,82 +370,20 @@ impl Index {
|
||||
schema: Schema,
|
||||
settings: IndexSettings,
|
||||
) -> crate::Result<Index> {
|
||||
Self::create_to_avoid_monomorphization(dir.into(), schema, settings)
|
||||
}
|
||||
|
||||
fn create_to_avoid_monomorphization(
|
||||
dir: Box<dyn Directory>,
|
||||
schema: Schema,
|
||||
settings: IndexSettings,
|
||||
) -> crate::Result<Index> {
|
||||
let dir: Box<dyn Directory> = dir.into();
|
||||
let mut builder = IndexBuilder::new().schema(schema);
|
||||
builder = builder.settings(settings);
|
||||
builder.create(dir)
|
||||
}
|
||||
|
||||
/// Opens a new directory from an index path.
|
||||
#[cfg(feature = "mmap")]
|
||||
pub fn open_in_dir<P: AsRef<Path>>(directory_path: P) -> crate::Result<Index> {
|
||||
Self::open_in_dir_to_avoid_monomorphization(directory_path.as_ref())
|
||||
}
|
||||
|
||||
#[cfg(feature = "mmap")]
|
||||
#[inline(never)]
|
||||
fn open_in_dir_to_avoid_monomorphization(directory_path: &Path) -> crate::Result<Index> {
|
||||
let mmap_directory = MmapDirectory::open(directory_path)?;
|
||||
Index::open(mmap_directory)
|
||||
}
|
||||
|
||||
/// Open the index using the provided directory
|
||||
pub fn open<T: Into<Box<dyn Directory>>>(directory: T) -> crate::Result<Index> {
|
||||
Self::open_avoid_monomorphization(directory.into())
|
||||
}
|
||||
|
||||
#[inline(never)]
|
||||
fn open_avoid_monomorphization(directory: Box<dyn Directory>) -> crate::Result<Index> {
|
||||
let directory = ManagedDirectory::wrap(directory)?;
|
||||
let inventory = SegmentMetaInventory::default();
|
||||
let metas = load_metas(&directory, &inventory)?;
|
||||
Index::open_from_metas(directory, &metas, inventory)
|
||||
}
|
||||
|
||||
/// Accessor to the search executor.
|
||||
///
|
||||
/// This pool is used by default when calling `searcher.search(...)`
|
||||
/// to perform search on the individual segments.
|
||||
///
|
||||
/// By default the executor is single thread, and simply runs in the calling thread.
|
||||
pub fn search_executor(&self) -> &Executor {
|
||||
&self.executor
|
||||
}
|
||||
|
||||
/// Replace the default single thread search executor pool
|
||||
/// by a thread pool with a given number of threads.
|
||||
pub fn set_multithread_executor(&mut self, num_threads: usize) -> crate::Result<()> {
|
||||
self.executor = Executor::multi_thread(num_threads, "tantivy-search-")?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Custom thread pool by a outer thread pool.
|
||||
pub fn set_executor(&mut self, executor: Executor) {
|
||||
self.executor = executor;
|
||||
}
|
||||
|
||||
/// Replace the default single thread search executor pool
|
||||
/// by a thread pool with as many threads as there are CPUs on the system.
|
||||
pub fn set_default_multithread_executor(&mut self) -> crate::Result<()> {
|
||||
let default_num_threads = available_parallelism()?.get();
|
||||
self.set_multithread_executor(default_num_threads)
|
||||
}
|
||||
|
||||
/// Creates a new index given a directory and an [`IndexMeta`].
|
||||
fn open_from_metas(
|
||||
directory: ManagedDirectory,
|
||||
metas: &IndexMeta,
|
||||
inventory: SegmentMetaInventory,
|
||||
) -> crate::Result<Index> {
|
||||
) -> Index {
|
||||
let schema = metas.schema.clone();
|
||||
Ok(Index {
|
||||
Index {
|
||||
settings: metas.index_settings.clone(),
|
||||
directory,
|
||||
schema,
|
||||
@@ -443,7 +391,7 @@ impl Index {
|
||||
fast_field_tokenizers: TokenizerManager::default(),
|
||||
executor: Executor::single_thread(),
|
||||
inventory,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// Setter for the tokenizer manager.
|
||||
@@ -511,6 +459,13 @@ impl Index {
|
||||
IndexReaderBuilder::new(self.clone())
|
||||
}
|
||||
|
||||
/// Opens a new directory from an index path.
|
||||
#[cfg(feature = "mmap")]
|
||||
pub fn open_in_dir<P: AsRef<Path>>(directory_path: P) -> crate::Result<Index> {
|
||||
let mmap_directory = MmapDirectory::open(directory_path)?;
|
||||
Index::open(mmap_directory)
|
||||
}
|
||||
|
||||
/// Returns the list of the segment metas tracked by the index.
|
||||
///
|
||||
/// Such segments can of course be part of the index,
|
||||
@@ -537,16 +492,7 @@ impl Index {
|
||||
let segments = self.searchable_segments()?;
|
||||
let fields_metadata: Vec<Vec<FieldMetadata>> = segments
|
||||
.into_iter()
|
||||
.map(|segment| {
|
||||
let reader = TantivySegmentReader::open_with_custom_alive_set_from_directory(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
None,
|
||||
)?;
|
||||
let reader: Arc<dyn crate::index::SegmentReader> = Arc::new(reader);
|
||||
reader.fields_metadata()
|
||||
})
|
||||
.map(|segment| SegmentReader::open(&segment)?.fields_metadata())
|
||||
.collect::<Result<_, _>>()?;
|
||||
Ok(merge_field_meta_data(fields_metadata))
|
||||
}
|
||||
@@ -560,6 +506,16 @@ impl Index {
|
||||
self.inventory.new_segment_meta(segment_id, max_doc)
|
||||
}
|
||||
|
||||
/// Open the index using the provided directory
|
||||
pub fn open<T: Into<Box<dyn Directory>>>(directory: T) -> crate::Result<Index> {
|
||||
let directory = directory.into();
|
||||
let directory = ManagedDirectory::wrap(directory)?;
|
||||
let inventory = SegmentMetaInventory::default();
|
||||
let metas = load_metas(&directory, &inventory)?;
|
||||
let index = Index::open_from_metas(directory, &metas, inventory);
|
||||
Ok(index)
|
||||
}
|
||||
|
||||
/// Reads the index meta file from the directory.
|
||||
pub fn load_metas(&self) -> crate::Result<IndexMeta> {
|
||||
load_metas(self.directory(), &self.inventory)
|
||||
@@ -752,7 +708,7 @@ impl Index {
|
||||
}
|
||||
|
||||
impl fmt::Debug for Index {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "Index({:?})", self.directory)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -5,7 +5,7 @@ use std::path::PathBuf;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use super::SegmentComponent;
|
||||
use crate::index::{CodecConfiguration, SegmentId};
|
||||
use crate::index::SegmentId;
|
||||
use crate::schema::Schema;
|
||||
use crate::store::Compressor;
|
||||
use crate::{Inventory, Opstamp, TrackedObject};
|
||||
@@ -286,10 +286,8 @@ pub struct IndexMeta {
|
||||
/// This payload is entirely unused by tantivy.
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub payload: Option<String>,
|
||||
/// Codec configuration for the index.
|
||||
#[serde(skip_serializing_if = "CodecConfiguration::is_standard")]
|
||||
pub codec: CodecConfiguration,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
struct UntrackedIndexMeta {
|
||||
pub segments: Vec<InnerSegmentMeta>,
|
||||
@@ -299,8 +297,6 @@ struct UntrackedIndexMeta {
|
||||
pub opstamp: Opstamp,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub payload: Option<String>,
|
||||
#[serde(default)]
|
||||
pub codec: CodecConfiguration,
|
||||
}
|
||||
|
||||
impl UntrackedIndexMeta {
|
||||
@@ -315,7 +311,6 @@ impl UntrackedIndexMeta {
|
||||
schema: self.schema,
|
||||
opstamp: self.opstamp,
|
||||
payload: self.payload,
|
||||
codec: self.codec,
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -333,7 +328,6 @@ impl IndexMeta {
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
payload: None,
|
||||
codec: CodecConfiguration::default(),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -384,38 +378,14 @@ mod tests {
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
payload: None,
|
||||
codec: Default::default(),
|
||||
};
|
||||
let json_value: serde_json::Value =
|
||||
serde_json::to_value(&index_metas).expect("serialization failed");
|
||||
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
|
||||
assert_eq!(
|
||||
&json_value,
|
||||
&serde_json::json!(
|
||||
{
|
||||
"index_settings": {
|
||||
"docstore_compression": "none",
|
||||
"docstore_blocksize": 16384
|
||||
},
|
||||
"segments": [],
|
||||
"schema": [
|
||||
{
|
||||
"name": "text",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"indexing": {
|
||||
"record": "position",
|
||||
"fieldnorms": true,
|
||||
"tokenizer": "default"
|
||||
},
|
||||
"stored": false,
|
||||
"fast": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"opstamp": 0
|
||||
})
|
||||
json,
|
||||
r#"{"index_settings":{"docstore_compression":"none","docstore_blocksize":16384},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0}"#
|
||||
);
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_value(json_value).unwrap();
|
||||
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_str(&json).unwrap();
|
||||
assert_eq!(index_metas.index_settings, deser_meta.index_settings);
|
||||
assert_eq!(index_metas.schema, deser_meta.schema);
|
||||
assert_eq!(index_metas.opstamp, deser_meta.opstamp);
|
||||
@@ -441,39 +411,14 @@ mod tests {
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
payload: None,
|
||||
codec: Default::default(),
|
||||
};
|
||||
let json_value = serde_json::to_value(&index_metas).expect("serialization failed");
|
||||
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
|
||||
assert_eq!(
|
||||
&json_value,
|
||||
&serde_json::json!(
|
||||
{
|
||||
"index_settings": {
|
||||
"docstore_compression": "zstd(compression_level=4)",
|
||||
"docstore_blocksize": 1000000
|
||||
},
|
||||
"segments": [],
|
||||
"schema": [
|
||||
{
|
||||
"name": "text",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"indexing": {
|
||||
"record": "position",
|
||||
"fieldnorms": true,
|
||||
"tokenizer": "default"
|
||||
},
|
||||
"stored": false,
|
||||
"fast": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"opstamp": 0
|
||||
}
|
||||
)
|
||||
json,
|
||||
r#"{"index_settings":{"docstore_compression":"zstd(compression_level=4)","docstore_blocksize":1000000},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0}"#
|
||||
);
|
||||
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_value(json_value).unwrap();
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_str(&json).unwrap();
|
||||
assert_eq!(index_metas.index_settings, deser_meta.index_settings);
|
||||
assert_eq!(index_metas.schema, deser_meta.schema);
|
||||
assert_eq!(index_metas.opstamp, deser_meta.opstamp);
|
||||
|
||||
@@ -1,12 +1,7 @@
|
||||
use std::any::Any;
|
||||
#[cfg(feature = "quickwit")]
|
||||
use std::future::Future;
|
||||
use std::io;
|
||||
#[cfg(feature = "quickwit")]
|
||||
use std::pin::Pin;
|
||||
|
||||
use common::json_path_writer::JSON_END_OF_PATH;
|
||||
use common::{BinarySerializable, BitSet, ByteCount, OwnedBytes};
|
||||
use common::{BinarySerializable, ByteCount};
|
||||
#[cfg(feature = "quickwit")]
|
||||
use futures_util::{FutureExt, StreamExt, TryStreamExt};
|
||||
#[cfg(feature = "quickwit")]
|
||||
@@ -15,262 +10,37 @@ use itertools::Itertools;
|
||||
use tantivy_fst::automaton::{AlwaysMatch, Automaton};
|
||||
|
||||
use crate::directory::FileSlice;
|
||||
use crate::docset::DocSet;
|
||||
use crate::postings::{
|
||||
load_postings_from_raw_data, Postings, RawPostingsData, SegmentPostings, TermInfo,
|
||||
};
|
||||
use crate::positions::PositionReader;
|
||||
use crate::postings::{BlockSegmentPostings, SegmentPostings, TermInfo};
|
||||
use crate::schema::{IndexRecordOption, Term, Type};
|
||||
use crate::termdict::TermDictionary;
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
pub type TermRangeBounds = (std::ops::Bound<Term>, std::ops::Bound<Term>);
|
||||
|
||||
/// Trait defining the contract for a dynamically dispatched inverted index reader.
|
||||
pub trait DynInvertedIndexReader: Send + Sync {
|
||||
/// Downcasts to the concrete reader type when possible.
|
||||
fn as_any(&self) -> &dyn Any;
|
||||
|
||||
/// Returns the term info associated with the term.
|
||||
fn get_term_info(&self, term: &Term) -> io::Result<Option<TermInfo>> {
|
||||
self.terms().get(term.serialized_value_bytes())
|
||||
}
|
||||
|
||||
/// Return the term dictionary datastructure.
|
||||
fn terms(&self) -> &TermDictionary;
|
||||
|
||||
/// Return the fields and types encoded in the dictionary in lexicographic order.
|
||||
/// Only valid on JSON fields.
|
||||
///
|
||||
/// Notice: This requires a full scan and therefore **very expensive**.
|
||||
fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>>;
|
||||
|
||||
/// Returns the raw postings bytes and metadata for a term.
|
||||
fn read_raw_postings_data(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<RawPostingsData>;
|
||||
|
||||
/// Returns the total number of tokens recorded for all documents
|
||||
/// (including deleted documents).
|
||||
fn total_num_tokens(&self) -> u64;
|
||||
|
||||
/// Returns the segment postings associated with the term, and with the given option,
|
||||
/// or `None` if the term has never been encountered and indexed.
|
||||
fn read_postings(
|
||||
&self,
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Option<Box<dyn Postings>>> {
|
||||
self.get_term_info(term)?
|
||||
.map(move |term_info| self.read_postings_from_terminfo(&term_info, option))
|
||||
.transpose()
|
||||
}
|
||||
|
||||
/// Returns the postings for a given `TermInfo`.
|
||||
///
|
||||
/// The default implementation decodes via [`read_raw_postings_data`]. Custom readers
|
||||
/// that cannot produce valid raw postings bytes (e.g. merged/union posting sources)
|
||||
/// should override this method.
|
||||
fn read_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Box<dyn Postings>> {
|
||||
let postings_data = self.read_raw_postings_data(term_info, option)?;
|
||||
let postings = load_postings_from_raw_data(term_info.doc_freq, postings_data)?;
|
||||
Ok(Box::new(postings))
|
||||
}
|
||||
|
||||
/// Returns the number of documents containing the term.
|
||||
fn doc_freq(&self, term: &Term) -> io::Result<u32>;
|
||||
|
||||
/// Returns the number of documents containing the term asynchronously.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn doc_freq_async<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<u32>> + Send + 'a>>;
|
||||
|
||||
/// Warmup fieldnorm readers for this inverted index field.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_fieldnorms_readers<'a>(
|
||||
&'a self,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>>;
|
||||
|
||||
/// Warmup the block postings for all terms.
|
||||
///
|
||||
/// Default implementation is a no-op.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_full<'a>(
|
||||
&'a self,
|
||||
_with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>> {
|
||||
Box::pin(async { Ok(()) })
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a `Term`.
|
||||
///
|
||||
/// Returns whether the term was found in the dictionary.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>;
|
||||
|
||||
/// Warmup block postings for terms in a range.
|
||||
///
|
||||
/// Returns whether at least one matching term was found.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_range<'a>(
|
||||
&'a self,
|
||||
terms: TermRangeBounds,
|
||||
limit: Option<u64>,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>;
|
||||
|
||||
/// Warmup block postings for terms matching an automaton.
|
||||
///
|
||||
/// Returns whether at least one matching term was found.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_automaton<'a, A: Automaton + Clone + Send + Sync + 'static>(
|
||||
&'a self,
|
||||
automaton: A,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>
|
||||
where
|
||||
A::State: Clone + Send,
|
||||
Self: Sized;
|
||||
}
|
||||
|
||||
/// Trait defining the contract for a typed inverted index reader.
|
||||
pub trait InvertedIndexReader: Send + Sync {
|
||||
/// The concrete postings type returned by this reader.
|
||||
type Postings: Postings;
|
||||
|
||||
/// A lighter doc-id-only iterator returned when frequencies and positions are not needed.
|
||||
type DocSet: DocSet;
|
||||
|
||||
/// Returns a posting object given a `term_info`.
|
||||
fn read_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Self::Postings>;
|
||||
|
||||
/// Returns a doc-id-only iterator for the given term.
|
||||
///
|
||||
/// Always reads with `IndexRecordOption::Basic` — no frequency decoding,
|
||||
/// no position reader.
|
||||
fn read_docset_from_terminfo(&self, term_info: &TermInfo) -> io::Result<Self::DocSet>;
|
||||
|
||||
/// Fills a bitset with the doc ids for the given term.
|
||||
fn fill_bitset_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
doc_bitset: &mut BitSet,
|
||||
) -> io::Result<()> {
|
||||
let mut docset = self.read_docset_from_terminfo(term_info)?;
|
||||
docset.fill_bitset(doc_bitset);
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl InvertedIndexReader for dyn DynInvertedIndexReader + '_ {
|
||||
type Postings = Box<dyn Postings>;
|
||||
type DocSet = Box<dyn Postings>;
|
||||
|
||||
fn read_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Self::Postings> {
|
||||
DynInvertedIndexReader::read_postings_from_terminfo(self, term_info, option)
|
||||
}
|
||||
|
||||
fn read_docset_from_terminfo(&self, term_info: &TermInfo) -> io::Result<Self::DocSet> {
|
||||
DynInvertedIndexReader::read_postings_from_terminfo(
|
||||
self,
|
||||
term_info,
|
||||
IndexRecordOption::Basic,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
/// Handler interface used by [`try_downcast_and_call`] to build query objects.
|
||||
pub trait TypedInvertedIndexReaderCb<R> {
|
||||
/// Invokes the handler with either Tantivy's built-in typed reader or the dynamic fallback.
|
||||
fn call<I: InvertedIndexReader + ?Sized>(&mut self, reader: &I) -> R;
|
||||
}
|
||||
|
||||
/// Tries Tantivy's built-in reader downcast before falling back to the dynamic reader path.
|
||||
pub fn try_downcast_and_call<R, C>(reader: &dyn DynInvertedIndexReader, handler: &mut C) -> R
|
||||
where C: TypedInvertedIndexReaderCb<R> {
|
||||
if let Some(reader) = reader.as_any().downcast_ref::<TantivyInvertedIndexReader>() {
|
||||
return handler.call(reader);
|
||||
}
|
||||
handler.call(reader)
|
||||
}
|
||||
|
||||
struct LoadPostingsFromTermInfo<'a> {
|
||||
term_info: &'a TermInfo,
|
||||
option: IndexRecordOption,
|
||||
}
|
||||
|
||||
impl TypedInvertedIndexReaderCb<io::Result<Box<dyn Postings>>> for LoadPostingsFromTermInfo<'_> {
|
||||
fn call<I: InvertedIndexReader + ?Sized>(
|
||||
&mut self,
|
||||
reader: &I,
|
||||
) -> io::Result<Box<dyn Postings>> {
|
||||
let postings = reader.read_postings_from_terminfo(self.term_info, self.option)?;
|
||||
Ok(Box::new(postings))
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn load_postings_from_terminfo(
|
||||
reader: &dyn DynInvertedIndexReader,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Box<dyn Postings>> {
|
||||
let mut postings_loader = LoadPostingsFromTermInfo { term_info, option };
|
||||
try_downcast_and_call(reader, &mut postings_loader)
|
||||
}
|
||||
|
||||
/// Tantivy's default inverted index reader implementation.
|
||||
///
|
||||
/// The inverted index reader is in charge of accessing
|
||||
/// the inverted index associated with a specific field.
|
||||
///
|
||||
/// # Note
|
||||
///
|
||||
/// It is safe to delete the segment associated with
|
||||
/// an `InvertedIndexReader` implementation. As long as it is open,
|
||||
/// an `InvertedIndexReader`. As long as it is open,
|
||||
/// the [`FileSlice`] it is relying on should
|
||||
/// stay available.
|
||||
///
|
||||
/// `TantivyInvertedIndexReader` instances are created by calling
|
||||
/// `InvertedIndexReader` are created by calling
|
||||
/// [`SegmentReader::inverted_index()`](crate::SegmentReader::inverted_index).
|
||||
pub struct TantivyInvertedIndexReader {
|
||||
pub struct InvertedIndexReader {
|
||||
termdict: TermDictionary,
|
||||
postings_file_slice: FileSlice,
|
||||
positions_file_slice: FileSlice,
|
||||
#[cfg_attr(not(feature = "quickwit"), allow(dead_code))]
|
||||
fieldnorms_file_slice: FileSlice,
|
||||
record_option: IndexRecordOption,
|
||||
total_num_tokens: u64,
|
||||
}
|
||||
|
||||
/// Object that records the amount of space used by a field in an inverted index.
|
||||
pub struct InvertedIndexFieldSpace {
|
||||
/// Field name as encoded in the term dictionary.
|
||||
pub(crate) struct InvertedIndexFieldSpace {
|
||||
pub field_name: String,
|
||||
/// Value type for the encoded field.
|
||||
pub field_type: Type,
|
||||
/// Total bytes used by postings for this field.
|
||||
pub postings_size: ByteCount,
|
||||
/// Total bytes used by positions for this field.
|
||||
pub positions_size: ByteCount,
|
||||
/// Number of terms in the field.
|
||||
pub num_terms: u64,
|
||||
}
|
||||
|
||||
@@ -292,81 +62,52 @@ impl InvertedIndexFieldSpace {
|
||||
}
|
||||
}
|
||||
|
||||
impl TantivyInvertedIndexReader {
|
||||
pub(crate) fn read_raw_postings_data_inner(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<RawPostingsData> {
|
||||
let effective_option = option.downgrade(self.record_option);
|
||||
let postings_data = self
|
||||
.postings_file_slice
|
||||
.slice(term_info.postings_range.clone())
|
||||
.read_bytes()?;
|
||||
let positions_data: Option<OwnedBytes> = if effective_option.has_positions() {
|
||||
let positions_data = self
|
||||
.positions_file_slice
|
||||
.slice(term_info.positions_range.clone())
|
||||
.read_bytes()?;
|
||||
Some(positions_data)
|
||||
} else {
|
||||
None
|
||||
};
|
||||
Ok(RawPostingsData {
|
||||
postings_data,
|
||||
positions_data,
|
||||
record_option: self.record_option,
|
||||
effective_option,
|
||||
})
|
||||
}
|
||||
|
||||
/// Opens an inverted index reader from already-loaded term/postings/positions slices.
|
||||
///
|
||||
/// The first 8 bytes of `postings_file_slice` are expected to contain
|
||||
/// the serialized total token count.
|
||||
pub fn new(
|
||||
impl InvertedIndexReader {
|
||||
pub(crate) fn new(
|
||||
termdict: TermDictionary,
|
||||
postings_file_slice: FileSlice,
|
||||
positions_file_slice: FileSlice,
|
||||
fieldnorms_file_slice: FileSlice,
|
||||
record_option: IndexRecordOption,
|
||||
) -> io::Result<TantivyInvertedIndexReader> {
|
||||
) -> io::Result<InvertedIndexReader> {
|
||||
let (total_num_tokens_slice, postings_body) = postings_file_slice.split(8);
|
||||
let total_num_tokens = u64::deserialize(&mut total_num_tokens_slice.read_bytes()?)?;
|
||||
Ok(TantivyInvertedIndexReader {
|
||||
Ok(InvertedIndexReader {
|
||||
termdict,
|
||||
postings_file_slice: postings_body,
|
||||
positions_file_slice,
|
||||
fieldnorms_file_slice,
|
||||
record_option,
|
||||
total_num_tokens,
|
||||
})
|
||||
}
|
||||
|
||||
/// Creates an empty `TantivyInvertedIndexReader` object, which
|
||||
/// Creates an empty `InvertedIndexReader` object, which
|
||||
/// contains no terms at all.
|
||||
pub fn empty(record_option: IndexRecordOption) -> TantivyInvertedIndexReader {
|
||||
TantivyInvertedIndexReader {
|
||||
pub fn empty(record_option: IndexRecordOption) -> InvertedIndexReader {
|
||||
InvertedIndexReader {
|
||||
termdict: TermDictionary::empty(),
|
||||
postings_file_slice: FileSlice::empty(),
|
||||
positions_file_slice: FileSlice::empty(),
|
||||
fieldnorms_file_slice: FileSlice::empty(),
|
||||
record_option,
|
||||
total_num_tokens: 0u64,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl DynInvertedIndexReader for TantivyInvertedIndexReader {
|
||||
fn as_any(&self) -> &dyn Any {
|
||||
self
|
||||
/// Returns the term info associated with the term.
|
||||
pub fn get_term_info(&self, term: &Term) -> io::Result<Option<TermInfo>> {
|
||||
self.termdict.get(term.serialized_value_bytes())
|
||||
}
|
||||
|
||||
fn terms(&self) -> &TermDictionary {
|
||||
/// Return the term dictionary datastructure.
|
||||
pub fn terms(&self) -> &TermDictionary {
|
||||
&self.termdict
|
||||
}
|
||||
|
||||
fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>> {
|
||||
/// Return the fields and types encoded in the dictionary in lexicographic order.
|
||||
/// Only valid on JSON fields.
|
||||
///
|
||||
/// Notice: This requires a full scan and therefore **very expensive**.
|
||||
/// TODO: Move to sstable to use the index.
|
||||
pub(crate) fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>> {
|
||||
let mut stream = self.termdict.stream()?;
|
||||
let mut fields: Vec<InvertedIndexFieldSpace> = Vec::new();
|
||||
|
||||
@@ -419,308 +160,136 @@ impl DynInvertedIndexReader for TantivyInvertedIndexReader {
|
||||
Ok(fields)
|
||||
}
|
||||
|
||||
fn read_raw_postings_data(
|
||||
/// Resets the block segment to another position of the postings
|
||||
/// file.
|
||||
///
|
||||
/// This is useful for enumerating through a list of terms,
|
||||
/// and consuming the associated posting lists while avoiding
|
||||
/// reallocating a [`BlockSegmentPostings`].
|
||||
///
|
||||
/// # Warning
|
||||
///
|
||||
/// This does not reset the positions list.
|
||||
pub fn reset_block_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
block_postings: &mut BlockSegmentPostings,
|
||||
) -> io::Result<()> {
|
||||
let postings_slice = self
|
||||
.postings_file_slice
|
||||
.slice(term_info.postings_range.clone());
|
||||
let postings_bytes = postings_slice.read_bytes()?;
|
||||
block_postings.reset(term_info.doc_freq, postings_bytes)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Returns a block postings given a `Term`.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// Most users should prefer using [`Self::read_postings()`] instead.
|
||||
pub fn read_block_postings(
|
||||
&self,
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Option<BlockSegmentPostings>> {
|
||||
self.get_term_info(term)?
|
||||
.map(move |term_info| self.read_block_postings_from_terminfo(&term_info, option))
|
||||
.transpose()
|
||||
}
|
||||
|
||||
/// Returns a block postings given a `term_info`.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// Most users should prefer using [`Self::read_postings()`] instead.
|
||||
pub fn read_block_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
requested_option: IndexRecordOption,
|
||||
) -> io::Result<BlockSegmentPostings> {
|
||||
let postings_data = self
|
||||
.postings_file_slice
|
||||
.slice(term_info.postings_range.clone());
|
||||
BlockSegmentPostings::open(
|
||||
term_info.doc_freq,
|
||||
postings_data,
|
||||
self.record_option,
|
||||
requested_option,
|
||||
)
|
||||
}
|
||||
|
||||
/// Returns a posting object given a `term_info`.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// Most users should prefer using [`Self::read_postings()`] instead.
|
||||
pub fn read_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<RawPostingsData> {
|
||||
self.read_raw_postings_data_inner(term_info, option)
|
||||
) -> io::Result<SegmentPostings> {
|
||||
let option = option.downgrade(self.record_option);
|
||||
|
||||
let block_postings = self.read_block_postings_from_terminfo(term_info, option)?;
|
||||
let position_reader = {
|
||||
if option.has_positions() {
|
||||
let positions_data = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice(term_info.positions_range.clone())?;
|
||||
let position_reader = PositionReader::open(positions_data)?;
|
||||
Some(position_reader)
|
||||
} else {
|
||||
None
|
||||
}
|
||||
};
|
||||
Ok(SegmentPostings::from_block_postings(
|
||||
block_postings,
|
||||
position_reader,
|
||||
))
|
||||
}
|
||||
|
||||
fn total_num_tokens(&self) -> u64 {
|
||||
/// Returns the total number of tokens recorded for all documents
|
||||
/// (including deleted documents).
|
||||
pub fn total_num_tokens(&self) -> u64 {
|
||||
self.total_num_tokens
|
||||
}
|
||||
|
||||
fn doc_freq(&self, term: &Term) -> io::Result<u32> {
|
||||
/// Returns the segment postings associated with the term, and with the given option,
|
||||
/// or `None` if the term has never been encountered and indexed.
|
||||
///
|
||||
/// If the field was not indexed with the indexing options that cover
|
||||
/// the requested options, the returned [`SegmentPostings`] the method does not fail
|
||||
/// and returns a `SegmentPostings` with as much information as possible.
|
||||
///
|
||||
/// For instance, requesting [`IndexRecordOption::WithFreqs`] for a
|
||||
/// [`TextOptions`](crate::schema::TextOptions) that does not index position
|
||||
/// will return a [`SegmentPostings`] with `DocId`s and frequencies.
|
||||
pub fn read_postings(
|
||||
&self,
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Option<SegmentPostings>> {
|
||||
self.get_term_info(term)?
|
||||
.map(move |term_info| self.read_postings_from_terminfo(&term_info, option))
|
||||
.transpose()
|
||||
}
|
||||
|
||||
/// Returns the number of documents containing the term.
|
||||
pub fn doc_freq(&self, term: &Term) -> io::Result<u32> {
|
||||
Ok(self
|
||||
.get_term_info(term)?
|
||||
.map(|term_info| term_info.doc_freq)
|
||||
.unwrap_or(0u32))
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn doc_freq_async<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<u32>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
Ok(self
|
||||
.get_term_info_async(term)
|
||||
.await?
|
||||
.map(|term_info| term_info.doc_freq)
|
||||
.unwrap_or(0u32))
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_fieldnorms_readers<'a>(
|
||||
&'a self,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
self.fieldnorms_file_slice.read_bytes_async().await?;
|
||||
Ok(())
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_full<'a>(
|
||||
&'a self,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
self.postings_file_slice.read_bytes_async().await?;
|
||||
if with_positions {
|
||||
self.positions_file_slice.read_bytes_async().await?;
|
||||
}
|
||||
Ok(())
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
let term_info_opt: Option<TermInfo> = self.get_term_info_async(term).await?;
|
||||
if let Some(term_info) = term_info_opt {
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(term_info.postings_range.clone());
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(term_info.positions_range.clone());
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
} else {
|
||||
Ok(false)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_range<'a>(
|
||||
&'a self,
|
||||
terms: TermRangeBounds,
|
||||
limit: Option<u64>,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
let mut term_info = self
|
||||
.get_term_range_async(terms, AlwaysMatch, limit, 0)
|
||||
.await?;
|
||||
|
||||
let Some(first_terminfo) = term_info.next() else {
|
||||
// no key matches, nothing more to load
|
||||
return Ok(false);
|
||||
};
|
||||
|
||||
let last_terminfo = term_info.last().unwrap_or_else(|| first_terminfo.clone());
|
||||
|
||||
let postings_range =
|
||||
first_terminfo.postings_range.start..last_terminfo.postings_range.end;
|
||||
let positions_range =
|
||||
first_terminfo.positions_range.start..last_terminfo.positions_range.end;
|
||||
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(postings_range);
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(positions_range);
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_automaton<'a, A: Automaton + Clone + Send + Sync + 'static>(
|
||||
&'a self,
|
||||
automaton: A,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>
|
||||
where
|
||||
A::State: Clone + Send,
|
||||
Self: Sized,
|
||||
{
|
||||
Box::pin(async move {
|
||||
// merge holes under 4MiB, that's how many bytes we can hope to receive during a TTFB
|
||||
// from S3 (~80MiB/s, and 50ms latency)
|
||||
const MERGE_HOLES_UNDER_BYTES: usize = (80 * 1024 * 1024 * 50) / 1000;
|
||||
// Trigger async prefetch of relevant termdict blocks.
|
||||
let _term_info_iter = self
|
||||
.get_term_range_async(
|
||||
(std::ops::Bound::Unbounded, std::ops::Bound::Unbounded),
|
||||
automaton.clone(),
|
||||
None,
|
||||
MERGE_HOLES_UNDER_BYTES,
|
||||
)
|
||||
.await?;
|
||||
drop(_term_info_iter);
|
||||
|
||||
// Build a 2nd stream without merged holes so we only scan matching blocks.
|
||||
// This assumes the storage layer caches data fetched by the first pass.
|
||||
let mut stream = self.termdict.search(automaton).into_stream()?;
|
||||
let posting_ranges_iter =
|
||||
std::iter::from_fn(move || stream.next().map(|(_k, v)| v.postings_range.clone()));
|
||||
let merged_posting_ranges: Vec<std::ops::Range<usize>> = posting_ranges_iter
|
||||
.coalesce(|range1, range2| {
|
||||
if range1.end + MERGE_HOLES_UNDER_BYTES >= range2.start {
|
||||
Ok(range1.start..range2.end)
|
||||
} else {
|
||||
Err((range1, range2))
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
if merged_posting_ranges.is_empty() {
|
||||
return Ok(false);
|
||||
}
|
||||
|
||||
let slices_downloaded = futures_util::stream::iter(merged_posting_ranges.into_iter())
|
||||
.map(|posting_slice| {
|
||||
self.postings_file_slice
|
||||
.read_bytes_slice_async(posting_slice)
|
||||
.map(|result| result.map(|_slice| ()))
|
||||
})
|
||||
.buffer_unordered(5)
|
||||
.try_collect::<Vec<()>>()
|
||||
.await?;
|
||||
|
||||
Ok(!slices_downloaded.is_empty())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl InvertedIndexReader for TantivyInvertedIndexReader {
|
||||
type Postings = SegmentPostings;
|
||||
type DocSet = SegmentPostings;
|
||||
|
||||
#[inline]
|
||||
fn read_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Self::Postings> {
|
||||
let postings_data = self.read_raw_postings_data_inner(term_info, option)?;
|
||||
load_postings_from_raw_data(term_info.doc_freq, postings_data)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn read_docset_from_terminfo(&self, term_info: &TermInfo) -> io::Result<Self::DocSet> {
|
||||
let postings_data =
|
||||
self.read_raw_postings_data_inner(term_info, IndexRecordOption::Basic)?;
|
||||
load_postings_from_raw_data(term_info.doc_freq, postings_data)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use std::any::TypeId;
|
||||
|
||||
use super::*;
|
||||
|
||||
#[derive(Default)]
|
||||
struct RecordDispatch {
|
||||
used_concrete_reader: bool,
|
||||
used_dynamic_fallback: bool,
|
||||
}
|
||||
|
||||
impl TypedInvertedIndexReaderCb<()> for RecordDispatch {
|
||||
fn call<I: InvertedIndexReader + ?Sized>(&mut self, _reader: &I) {
|
||||
let postings_type = TypeId::of::<I::Postings>();
|
||||
if postings_type == TypeId::of::<SegmentPostings>() {
|
||||
self.used_concrete_reader = true;
|
||||
} else if postings_type == TypeId::of::<Box<dyn Postings>>() {
|
||||
self.used_dynamic_fallback = true;
|
||||
} else {
|
||||
panic!("unexpected postings type in downcast helper test");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
struct OnlyDynReader {
|
||||
termdict: TermDictionary,
|
||||
}
|
||||
|
||||
impl Default for OnlyDynReader {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
termdict: TermDictionary::empty(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl DynInvertedIndexReader for OnlyDynReader {
|
||||
fn as_any(&self) -> &dyn Any {
|
||||
self
|
||||
}
|
||||
|
||||
fn terms(&self) -> &TermDictionary {
|
||||
&self.termdict
|
||||
}
|
||||
|
||||
fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>> {
|
||||
Ok(Vec::new())
|
||||
}
|
||||
|
||||
fn read_raw_postings_data(
|
||||
&self,
|
||||
_term_info: &TermInfo,
|
||||
_option: IndexRecordOption,
|
||||
) -> io::Result<RawPostingsData> {
|
||||
unreachable!("not used in downcast helper tests")
|
||||
}
|
||||
|
||||
fn total_num_tokens(&self) -> u64 {
|
||||
0
|
||||
}
|
||||
|
||||
fn doc_freq(&self, _term: &Term) -> io::Result<u32> {
|
||||
Ok(0)
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn try_downcast_and_call_uses_tantivy_reader() {
|
||||
let reader = TantivyInvertedIndexReader::empty(IndexRecordOption::Basic);
|
||||
let mut dispatch_recorder = RecordDispatch::default();
|
||||
|
||||
try_downcast_and_call(&reader, &mut dispatch_recorder);
|
||||
|
||||
assert!(dispatch_recorder.used_concrete_reader);
|
||||
assert!(!dispatch_recorder.used_dynamic_fallback);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn try_downcast_and_call_uses_dynamic_fallback_for_other_readers() {
|
||||
let reader = OnlyDynReader::default();
|
||||
let mut dispatch_recorder = RecordDispatch::default();
|
||||
|
||||
try_downcast_and_call(&reader, &mut dispatch_recorder);
|
||||
|
||||
assert!(!dispatch_recorder.used_concrete_reader);
|
||||
assert!(dispatch_recorder.used_dynamic_fallback);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
impl TantivyInvertedIndexReader {
|
||||
impl InvertedIndexReader {
|
||||
pub(crate) async fn get_term_info_async(&self, term: &Term) -> io::Result<Option<TermInfo>> {
|
||||
self.termdict.get_async(term.serialized_value_bytes()).await
|
||||
}
|
||||
|
||||
async fn get_term_range_async<'a, A: Automaton + 'a>(
|
||||
&'a self,
|
||||
terms: TermRangeBounds,
|
||||
terms: impl std::ops::RangeBounds<Term>,
|
||||
automaton: A,
|
||||
limit: Option<u64>,
|
||||
merge_holes_under_bytes: usize,
|
||||
@@ -728,17 +297,17 @@ impl TantivyInvertedIndexReader {
|
||||
where
|
||||
A::State: Clone,
|
||||
{
|
||||
use std::ops::Bound;
|
||||
let range_builder = self.termdict.search(automaton);
|
||||
let (start_bound, end_bound) = terms;
|
||||
let range_builder = match start_bound {
|
||||
std::ops::Bound::Included(bound) => range_builder.ge(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Excluded(bound) => range_builder.gt(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Unbounded => range_builder,
|
||||
let range_builder = match terms.start_bound() {
|
||||
Bound::Included(bound) => range_builder.ge(bound.serialized_value_bytes()),
|
||||
Bound::Excluded(bound) => range_builder.gt(bound.serialized_value_bytes()),
|
||||
Bound::Unbounded => range_builder,
|
||||
};
|
||||
let range_builder = match end_bound {
|
||||
std::ops::Bound::Included(bound) => range_builder.le(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Excluded(bound) => range_builder.lt(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Unbounded => range_builder,
|
||||
let range_builder = match terms.end_bound() {
|
||||
Bound::Included(bound) => range_builder.le(bound.serialized_value_bytes()),
|
||||
Bound::Excluded(bound) => range_builder.lt(bound.serialized_value_bytes()),
|
||||
Bound::Unbounded => range_builder,
|
||||
};
|
||||
let range_builder = if let Some(limit) = limit {
|
||||
range_builder.limit(limit)
|
||||
@@ -759,4 +328,167 @@ impl TantivyInvertedIndexReader {
|
||||
|
||||
Ok(iter)
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a `Term`.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// returns a boolean, whether the term was found in the dictionary
|
||||
pub async fn warm_postings(&self, term: &Term, with_positions: bool) -> io::Result<bool> {
|
||||
let term_info_opt: Option<TermInfo> = self.get_term_info_async(term).await?;
|
||||
if let Some(term_info) = term_info_opt {
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(term_info.postings_range.clone());
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(term_info.positions_range.clone());
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
} else {
|
||||
Ok(false)
|
||||
}
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a range of `Term`s.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// returns a boolean, whether a term matching the range was found in the dictionary
|
||||
pub async fn warm_postings_range(
|
||||
&self,
|
||||
terms: impl std::ops::RangeBounds<Term>,
|
||||
limit: Option<u64>,
|
||||
with_positions: bool,
|
||||
) -> io::Result<bool> {
|
||||
let mut term_info = self
|
||||
.get_term_range_async(terms, AlwaysMatch, limit, 0)
|
||||
.await?;
|
||||
|
||||
let Some(first_terminfo) = term_info.next() else {
|
||||
// no key matches, nothing more to load
|
||||
return Ok(false);
|
||||
};
|
||||
|
||||
let last_terminfo = term_info.last().unwrap_or_else(|| first_terminfo.clone());
|
||||
|
||||
let postings_range = first_terminfo.postings_range.start..last_terminfo.postings_range.end;
|
||||
let positions_range =
|
||||
first_terminfo.positions_range.start..last_terminfo.positions_range.end;
|
||||
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(postings_range);
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(positions_range);
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a range of `Term`s.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// returns a boolean, whether a term matching the range was found in the dictionary
|
||||
pub async fn warm_postings_automaton<
|
||||
A: Automaton + Clone + Send + 'static,
|
||||
E: FnOnce(Box<dyn FnOnce() -> io::Result<()> + Send>) -> F,
|
||||
F: std::future::Future<Output = io::Result<()>>,
|
||||
>(
|
||||
&self,
|
||||
automaton: A,
|
||||
// with_positions: bool, at the moment we have no use for it, and supporting it would add
|
||||
// complexity to the coalesce
|
||||
executor: E,
|
||||
) -> io::Result<bool>
|
||||
where
|
||||
A::State: Clone,
|
||||
{
|
||||
// merge holes under 4MiB, that's how many bytes we can hope to receive during a TTFB from
|
||||
// S3 (~80MiB/s, and 50ms latency)
|
||||
const MERGE_HOLES_UNDER_BYTES: usize = (80 * 1024 * 1024 * 50) / 1000;
|
||||
// we build a first iterator to download everything. Simply calling the function already
|
||||
// download everything we need from the sstable, but doesn't start iterating over it.
|
||||
let _term_info_iter = self
|
||||
.get_term_range_async(.., automaton.clone(), None, MERGE_HOLES_UNDER_BYTES)
|
||||
.await?;
|
||||
|
||||
let (sender, posting_ranges_to_load_stream) = futures_channel::mpsc::unbounded();
|
||||
let termdict = self.termdict.clone();
|
||||
let cpu_bound_task = move || {
|
||||
// then we build a 2nd iterator, this one with no holes, so we don't go through blocks
|
||||
// we can't match.
|
||||
// This makes the assumption there is a caching layer below us, which gives sync read
|
||||
// for free after the initial async access. This might not always be true, but is in
|
||||
// Quickwit.
|
||||
// We build things from this closure otherwise we get into lifetime issues that can only
|
||||
// be solved with self referential strucs. Returning an io::Result from here is a bit
|
||||
// more leaky abstraction-wise, but a lot better than the alternative
|
||||
let mut stream = termdict.search(automaton).into_stream()?;
|
||||
|
||||
// we could do without an iterator, but this allows us access to coalesce which simplify
|
||||
// things
|
||||
let posting_ranges_iter =
|
||||
std::iter::from_fn(move || stream.next().map(|(_k, v)| v.postings_range.clone()));
|
||||
|
||||
let merged_posting_ranges_iter = posting_ranges_iter.coalesce(|range1, range2| {
|
||||
if range1.end + MERGE_HOLES_UNDER_BYTES >= range2.start {
|
||||
Ok(range1.start..range2.end)
|
||||
} else {
|
||||
Err((range1, range2))
|
||||
}
|
||||
});
|
||||
|
||||
for posting_range in merged_posting_ranges_iter {
|
||||
if let Err(_) = sender.unbounded_send(posting_range) {
|
||||
// this should happen only when search is cancelled
|
||||
return Err(io::Error::other("failed to send posting range back"));
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
};
|
||||
let task_handle = executor(Box::new(cpu_bound_task));
|
||||
|
||||
let posting_downloader = posting_ranges_to_load_stream
|
||||
.map(|posting_slice| {
|
||||
self.postings_file_slice
|
||||
.read_bytes_slice_async(posting_slice)
|
||||
.map(|result| result.map(|_slice| ()))
|
||||
})
|
||||
.buffer_unordered(5)
|
||||
.try_collect::<Vec<()>>();
|
||||
|
||||
let (_, slices_downloaded) =
|
||||
futures_util::future::try_join(task_handle, posting_downloader).await?;
|
||||
|
||||
Ok(!slices_downloaded.is_empty())
|
||||
}
|
||||
|
||||
/// Warmup the block postings for all terms.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// If you know which terms to pre-load, prefer using [`Self::warm_postings`] or
|
||||
/// [`Self::warm_postings`] instead.
|
||||
pub async fn warm_postings_full(&self, with_positions: bool) -> io::Result<()> {
|
||||
self.postings_file_slice.read_bytes_async().await?;
|
||||
if with_positions {
|
||||
self.positions_file_slice.read_bytes_async().await?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Returns the number of documents containing the term asynchronously.
|
||||
pub async fn doc_freq_async(&self, term: &Term) -> io::Result<u32> {
|
||||
Ok(self
|
||||
.get_term_info_async(term)
|
||||
.await?
|
||||
.map(|term_info| term_info.doc_freq)
|
||||
.unwrap_or(0u32))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -2,7 +2,6 @@
|
||||
//!
|
||||
//! It contains `Index` and `Segment`, where a `Index` consists of one or more `Segment`s.
|
||||
|
||||
mod codec_configuration;
|
||||
mod index;
|
||||
mod index_meta;
|
||||
mod inverted_index_reader;
|
||||
@@ -11,16 +10,11 @@ mod segment_component;
|
||||
mod segment_id;
|
||||
mod segment_reader;
|
||||
|
||||
pub use self::codec_configuration::CodecConfiguration;
|
||||
pub use self::index::{Index, IndexBuilder};
|
||||
pub(crate) use self::index_meta::SegmentMetaInventory;
|
||||
pub use self::index_meta::{IndexMeta, IndexSettings, Order, SegmentMeta};
|
||||
pub(crate) use self::inverted_index_reader::load_postings_from_terminfo;
|
||||
pub use self::inverted_index_reader::{
|
||||
try_downcast_and_call, DynInvertedIndexReader, InvertedIndexFieldSpace, InvertedIndexReader,
|
||||
TantivyInvertedIndexReader, TypedInvertedIndexReaderCb,
|
||||
};
|
||||
pub use self::inverted_index_reader::InvertedIndexReader;
|
||||
pub use self::segment::Segment;
|
||||
pub use self::segment_component::SegmentComponent;
|
||||
pub use self::segment_id::SegmentId;
|
||||
pub use self::segment_reader::{FieldMetadata, SegmentReader, TantivySegmentReader};
|
||||
pub use self::segment_reader::{FieldMetadata, SegmentReader};
|
||||
|
||||
@@ -16,7 +16,7 @@ pub struct Segment {
|
||||
}
|
||||
|
||||
impl fmt::Debug for Segment {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "Segment({:?})", self.id().uuid_string())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -44,7 +44,7 @@ fn create_uuid() -> Uuid {
|
||||
}
|
||||
|
||||
impl SegmentId {
|
||||
/// Generates a new random `SegmentId`.
|
||||
#[doc(hidden)]
|
||||
pub fn generate_random() -> SegmentId {
|
||||
SegmentId(create_uuid())
|
||||
}
|
||||
|
||||
@@ -6,101 +6,17 @@ use common::{ByteCount, HasLen};
|
||||
use fnv::FnvHashMap;
|
||||
use itertools::Itertools;
|
||||
|
||||
use crate::directory::{CompositeFile, Directory, FileSlice};
|
||||
use crate::directory::{CompositeFile, FileSlice};
|
||||
use crate::error::DataCorruption;
|
||||
use crate::fastfield::{intersect_alive_bitsets, AliveBitSet, FacetReader, FastFieldReaders};
|
||||
use crate::fieldnorm::{FieldNormReader, FieldNormReaders};
|
||||
use crate::index::{
|
||||
DynInvertedIndexReader, Segment, SegmentComponent, SegmentId, SegmentMeta,
|
||||
TantivyInvertedIndexReader,
|
||||
};
|
||||
use crate::index::{InvertedIndexReader, Segment, SegmentComponent, SegmentId};
|
||||
use crate::json_utils::json_path_sep_to_dot;
|
||||
use crate::postings::SegmentPostings;
|
||||
use crate::query::boolean_query::block_wand::{block_wand, block_wand_single_scorer};
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{BufferedUnionScorer, Scorer, SumCombiner};
|
||||
use crate::schema::{Field, IndexRecordOption, Schema, Type};
|
||||
use crate::space_usage::SegmentSpaceUsage;
|
||||
use crate::store::{StoreReader, TantivyStoreReader};
|
||||
use crate::store::StoreReader;
|
||||
use crate::termdict::TermDictionary;
|
||||
use crate::{DocId, DocSet as _, Opstamp, Score, TERMINATED};
|
||||
|
||||
/// Trait defining the contract for a segment reader.
|
||||
pub trait SegmentReader: Send + Sync {
|
||||
/// Returns the highest document id ever attributed in this segment + 1.
|
||||
fn max_doc(&self) -> DocId;
|
||||
|
||||
/// Returns the number of alive documents. Deleted documents are not counted.
|
||||
fn num_docs(&self) -> DocId;
|
||||
|
||||
/// Returns the schema of the index this segment belongs to.
|
||||
fn schema(&self) -> &Schema;
|
||||
|
||||
/// Performs a for_each_pruning operation on the given scorer.
|
||||
fn for_each_pruning(
|
||||
&self,
|
||||
threshold: Score,
|
||||
scorer: Box<dyn Scorer>,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
);
|
||||
|
||||
/// Return the number of documents that have been deleted in the segment.
|
||||
fn num_deleted_docs(&self) -> DocId;
|
||||
|
||||
/// Returns true if some of the documents of the segment have been deleted.
|
||||
fn has_deletes(&self) -> bool;
|
||||
|
||||
/// Accessor to a segment's fast field reader given a field.
|
||||
fn fast_fields(&self) -> &FastFieldReaders;
|
||||
|
||||
/// Accessor to the `FacetReader` associated with a given `Field`.
|
||||
fn facet_reader(&self, field_name: &str) -> crate::Result<FacetReader> {
|
||||
let field = self.schema().get_field(field_name)?;
|
||||
let field_entry = self.schema().get_field_entry(field);
|
||||
if field_entry.field_type().value_type() != Type::Facet {
|
||||
return Err(crate::TantivyError::SchemaError(format!(
|
||||
"`{field_name}` is not a facet field.`"
|
||||
)));
|
||||
}
|
||||
let Some(facet_column) = self.fast_fields().str(field_name)? else {
|
||||
panic!("Facet Field `{field_name}` is missing. This should not happen");
|
||||
};
|
||||
Ok(FacetReader::new(facet_column))
|
||||
}
|
||||
|
||||
/// Accessor to the segment's `Field norms`'s reader.
|
||||
fn get_fieldnorms_reader(&self, field: Field) -> crate::Result<FieldNormReader>;
|
||||
|
||||
/// Accessor to the segment's [`StoreReader`](crate::store::StoreReader).
|
||||
fn get_store_reader(&self, cache_num_blocks: usize) -> io::Result<Box<dyn StoreReader>>;
|
||||
|
||||
/// Returns a field reader associated with the field given in argument.
|
||||
fn inverted_index(&self, field: Field) -> crate::Result<Arc<dyn DynInvertedIndexReader>>;
|
||||
|
||||
/// Returns the list of fields that have been indexed in the segment.
|
||||
fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>>;
|
||||
|
||||
/// Returns the segment id.
|
||||
fn segment_id(&self) -> SegmentId;
|
||||
|
||||
/// Returns the delete opstamp.
|
||||
fn delete_opstamp(&self) -> Option<Opstamp>;
|
||||
|
||||
/// Returns the bitset representing the alive `DocId`s.
|
||||
fn alive_bitset(&self) -> Option<&AliveBitSet>;
|
||||
|
||||
/// Returns true if the `doc` is marked as deleted.
|
||||
fn is_deleted(&self, doc: DocId) -> bool;
|
||||
|
||||
/// Returns an iterator that will iterate over the alive document ids.
|
||||
fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + Send + '_>;
|
||||
|
||||
/// Summarize total space usage of this segment.
|
||||
fn space_usage(&self) -> io::Result<SegmentSpaceUsage>;
|
||||
|
||||
/// Clones this reader into a shared trait object.
|
||||
fn clone_arc(&self) -> Arc<dyn SegmentReader>;
|
||||
}
|
||||
use crate::{DocId, Opstamp};
|
||||
|
||||
/// Entry point to access all of the datastructures of the `Segment`
|
||||
///
|
||||
@@ -113,8 +29,8 @@ pub trait SegmentReader: Send + Sync {
|
||||
/// The segment reader has a very low memory footprint,
|
||||
/// as close to all of the memory data is mmapped.
|
||||
#[derive(Clone)]
|
||||
pub struct TantivySegmentReader {
|
||||
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<dyn DynInvertedIndexReader>>>>,
|
||||
pub struct SegmentReader {
|
||||
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>,
|
||||
|
||||
segment_id: SegmentId,
|
||||
delete_opstamp: Option<Opstamp>,
|
||||
@@ -133,157 +49,73 @@ pub struct TantivySegmentReader {
|
||||
schema: Schema,
|
||||
}
|
||||
|
||||
impl TantivySegmentReader {
|
||||
/// Open a new segment for reading.
|
||||
pub fn open(segment: &Segment) -> crate::Result<Arc<dyn SegmentReader>> {
|
||||
Self::open_with_custom_alive_set(segment, None)
|
||||
}
|
||||
|
||||
/// Open a new segment for reading.
|
||||
pub fn open_with_custom_alive_set(
|
||||
segment: &Segment,
|
||||
custom_bitset: Option<AliveBitSet>,
|
||||
) -> crate::Result<Arc<dyn SegmentReader>> {
|
||||
let reader = Self::open_with_custom_alive_set_from_directory(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
custom_bitset,
|
||||
)?;
|
||||
Ok(Arc::new(reader))
|
||||
}
|
||||
|
||||
pub(crate) fn open_with_custom_alive_set_from_directory(
|
||||
directory: &dyn Directory,
|
||||
segment_meta: &SegmentMeta,
|
||||
schema: Schema,
|
||||
custom_bitset: Option<AliveBitSet>,
|
||||
) -> crate::Result<TantivySegmentReader> {
|
||||
let termdict_file =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Terms))?;
|
||||
let termdict_composite = CompositeFile::open(&termdict_file)?;
|
||||
|
||||
let store_file =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Store))?;
|
||||
|
||||
crate::fail_point!("SegmentReader::open#middle");
|
||||
|
||||
let postings_file =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Postings))?;
|
||||
let postings_composite = CompositeFile::open(&postings_file)?;
|
||||
|
||||
let positions_composite = {
|
||||
if let Ok(positions_file) =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Positions))
|
||||
{
|
||||
CompositeFile::open(&positions_file)?
|
||||
} else {
|
||||
CompositeFile::empty()
|
||||
}
|
||||
};
|
||||
|
||||
let fast_fields_data =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::FastFields))?;
|
||||
let fast_fields_readers = FastFieldReaders::open(fast_fields_data, schema.clone())?;
|
||||
let fieldnorm_data =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::FieldNorms))?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
|
||||
let original_bitset = if segment_meta.has_deletes() {
|
||||
let alive_doc_file_slice =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Delete))?;
|
||||
let alive_doc_data = alive_doc_file_slice.read_bytes()?;
|
||||
Some(AliveBitSet::open(alive_doc_data))
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
let alive_bitset_opt = intersect_alive_bitset(original_bitset, custom_bitset);
|
||||
|
||||
let max_doc = segment_meta.max_doc();
|
||||
let num_docs = alive_bitset_opt
|
||||
.as_ref()
|
||||
.map(|alive_bitset| alive_bitset.num_alive_docs() as u32)
|
||||
.unwrap_or(max_doc);
|
||||
|
||||
Ok(TantivySegmentReader {
|
||||
inv_idx_reader_cache: Default::default(),
|
||||
num_docs,
|
||||
max_doc,
|
||||
termdict_composite,
|
||||
postings_composite,
|
||||
fast_fields_readers,
|
||||
fieldnorm_readers,
|
||||
segment_id: segment_meta.id(),
|
||||
delete_opstamp: segment_meta.delete_opstamp(),
|
||||
store_file,
|
||||
alive_bitset_opt,
|
||||
positions_composite,
|
||||
schema,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl SegmentReader for TantivySegmentReader {
|
||||
fn max_doc(&self) -> DocId {
|
||||
impl SegmentReader {
|
||||
/// Returns the highest document id ever attributed in
|
||||
/// this segment + 1.
|
||||
pub fn max_doc(&self) -> DocId {
|
||||
self.max_doc
|
||||
}
|
||||
|
||||
fn num_docs(&self) -> DocId {
|
||||
/// Returns the number of alive documents.
|
||||
/// Deleted documents are not counted.
|
||||
pub fn num_docs(&self) -> DocId {
|
||||
self.num_docs
|
||||
}
|
||||
|
||||
fn schema(&self) -> &Schema {
|
||||
/// Returns the schema of the index this segment belongs to.
|
||||
pub fn schema(&self) -> &Schema {
|
||||
&self.schema
|
||||
}
|
||||
|
||||
fn for_each_pruning(
|
||||
&self,
|
||||
mut threshold: Score,
|
||||
mut scorer: Box<dyn Scorer>,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
) {
|
||||
// Try WAND acceleration with concrete postings types
|
||||
scorer = match scorer.downcast::<TermScorer<SegmentPostings>>() {
|
||||
Ok(term_scorer) => {
|
||||
block_wand_single_scorer(*term_scorer, threshold, callback);
|
||||
return;
|
||||
}
|
||||
Err(scorer) => scorer,
|
||||
};
|
||||
match scorer.downcast::<BufferedUnionScorer<TermScorer<SegmentPostings>, SumCombiner>>() {
|
||||
Ok(mut union_scorer) => {
|
||||
let doc = union_scorer.doc();
|
||||
if doc == TERMINATED {
|
||||
return;
|
||||
}
|
||||
let score = union_scorer.score();
|
||||
if score > threshold {
|
||||
threshold = callback(doc, score);
|
||||
}
|
||||
let scorers: Vec<TermScorer<SegmentPostings>> = union_scorer.into_scorers();
|
||||
block_wand(scorers, threshold, callback);
|
||||
}
|
||||
Err(mut scorer) => {
|
||||
// No acceleration available. Fall back to default.
|
||||
scorer.for_each_pruning(threshold, callback);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn num_deleted_docs(&self) -> DocId {
|
||||
/// Return the number of documents that have been
|
||||
/// deleted in the segment.
|
||||
pub fn num_deleted_docs(&self) -> DocId {
|
||||
self.max_doc - self.num_docs
|
||||
}
|
||||
|
||||
fn has_deletes(&self) -> bool {
|
||||
self.num_docs != self.max_doc
|
||||
/// Returns true if some of the documents of the segment have been deleted.
|
||||
pub fn has_deletes(&self) -> bool {
|
||||
self.num_deleted_docs() > 0
|
||||
}
|
||||
|
||||
fn fast_fields(&self) -> &FastFieldReaders {
|
||||
/// Accessor to a segment's fast field reader given a field.
|
||||
///
|
||||
/// Returns the u64 fast value reader if the field
|
||||
/// is a u64 field indexed as "fast".
|
||||
///
|
||||
/// Return a FastFieldNotAvailableError if the field is not
|
||||
/// declared as a fast field in the schema.
|
||||
///
|
||||
/// # Panics
|
||||
/// May panic if the index is corrupted.
|
||||
pub fn fast_fields(&self) -> &FastFieldReaders {
|
||||
&self.fast_fields_readers
|
||||
}
|
||||
|
||||
fn get_fieldnorms_reader(&self, field: Field) -> crate::Result<FieldNormReader> {
|
||||
/// Accessor to the `FacetReader` associated with a given `Field`.
|
||||
pub fn facet_reader(&self, field_name: &str) -> crate::Result<FacetReader> {
|
||||
let schema = self.schema();
|
||||
let field = schema.get_field(field_name)?;
|
||||
let field_entry = schema.get_field_entry(field);
|
||||
if field_entry.field_type().value_type() != Type::Facet {
|
||||
return Err(crate::TantivyError::SchemaError(format!(
|
||||
"`{field_name}` is not a facet field.`"
|
||||
)));
|
||||
}
|
||||
let Some(facet_column) = self.fast_fields().str(field_name)? else {
|
||||
panic!("Facet Field `{field_name}` is missing. This should not happen");
|
||||
};
|
||||
Ok(FacetReader::new(facet_column))
|
||||
}
|
||||
|
||||
/// Accessor to the segment's `Field norms`'s reader.
|
||||
///
|
||||
/// Field norms are the length (in tokens) of the fields.
|
||||
/// It is used in the computation of the [TfIdf](https://fulmicoton.gitbooks.io/tantivy-doc/content/tfidf.html).
|
||||
///
|
||||
/// They are simply stored as a fast field, serialized in
|
||||
/// the `.fieldnorm` file of the segment.
|
||||
pub fn get_fieldnorms_reader(&self, field: Field) -> crate::Result<FieldNormReader> {
|
||||
self.fieldnorm_readers.get_field(field)?.ok_or_else(|| {
|
||||
let field_name = self.schema.get_field_name(field);
|
||||
let err_msg = format!(
|
||||
@@ -294,14 +126,100 @@ impl SegmentReader for TantivySegmentReader {
|
||||
})
|
||||
}
|
||||
|
||||
fn get_store_reader(&self, cache_num_blocks: usize) -> io::Result<Box<dyn StoreReader>> {
|
||||
Ok(Box::new(TantivyStoreReader::open(
|
||||
self.store_file.clone(),
|
||||
cache_num_blocks,
|
||||
)?))
|
||||
#[doc(hidden)]
|
||||
pub fn fieldnorms_readers(&self) -> &FieldNormReaders {
|
||||
&self.fieldnorm_readers
|
||||
}
|
||||
|
||||
fn inverted_index(&self, field: Field) -> crate::Result<Arc<dyn DynInvertedIndexReader>> {
|
||||
/// Accessor to the segment's [`StoreReader`](crate::store::StoreReader).
|
||||
///
|
||||
/// `cache_num_blocks` sets the number of decompressed blocks to be cached in an LRU.
|
||||
/// The size of blocks is configurable, this should be reflexted in the
|
||||
pub fn get_store_reader(&self, cache_num_blocks: usize) -> io::Result<StoreReader> {
|
||||
StoreReader::open(self.store_file.clone(), cache_num_blocks)
|
||||
}
|
||||
|
||||
/// Open a new segment for reading.
|
||||
pub fn open(segment: &Segment) -> crate::Result<SegmentReader> {
|
||||
Self::open_with_custom_alive_set(segment, None)
|
||||
}
|
||||
|
||||
/// Open a new segment for reading.
|
||||
pub fn open_with_custom_alive_set(
|
||||
segment: &Segment,
|
||||
custom_bitset: Option<AliveBitSet>,
|
||||
) -> crate::Result<SegmentReader> {
|
||||
let termdict_file = segment.open_read(SegmentComponent::Terms)?;
|
||||
let termdict_composite = CompositeFile::open(&termdict_file)?;
|
||||
|
||||
let store_file = segment.open_read(SegmentComponent::Store)?;
|
||||
|
||||
crate::fail_point!("SegmentReader::open#middle");
|
||||
|
||||
let postings_file = segment.open_read(SegmentComponent::Postings)?;
|
||||
let postings_composite = CompositeFile::open(&postings_file)?;
|
||||
|
||||
let positions_composite = {
|
||||
if let Ok(positions_file) = segment.open_read(SegmentComponent::Positions) {
|
||||
CompositeFile::open(&positions_file)?
|
||||
} else {
|
||||
CompositeFile::empty()
|
||||
}
|
||||
};
|
||||
|
||||
let schema = segment.schema();
|
||||
|
||||
let fast_fields_data = segment.open_read(SegmentComponent::FastFields)?;
|
||||
let fast_fields_readers = FastFieldReaders::open(fast_fields_data, schema.clone())?;
|
||||
let fieldnorm_data = segment.open_read(SegmentComponent::FieldNorms)?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
|
||||
let original_bitset = if segment.meta().has_deletes() {
|
||||
let alive_doc_file_slice = segment.open_read(SegmentComponent::Delete)?;
|
||||
let alive_doc_data = alive_doc_file_slice.read_bytes()?;
|
||||
Some(AliveBitSet::open(alive_doc_data))
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
let alive_bitset_opt = intersect_alive_bitset(original_bitset, custom_bitset);
|
||||
|
||||
let max_doc = segment.meta().max_doc();
|
||||
let num_docs = alive_bitset_opt
|
||||
.as_ref()
|
||||
.map(|alive_bitset| alive_bitset.num_alive_docs() as u32)
|
||||
.unwrap_or(max_doc);
|
||||
|
||||
Ok(SegmentReader {
|
||||
inv_idx_reader_cache: Default::default(),
|
||||
num_docs,
|
||||
max_doc,
|
||||
termdict_composite,
|
||||
postings_composite,
|
||||
fast_fields_readers,
|
||||
fieldnorm_readers,
|
||||
segment_id: segment.id(),
|
||||
delete_opstamp: segment.meta().delete_opstamp(),
|
||||
store_file,
|
||||
alive_bitset_opt,
|
||||
positions_composite,
|
||||
schema,
|
||||
})
|
||||
}
|
||||
|
||||
/// Returns a field reader associated with the field given in argument.
|
||||
/// If the field was not present in the index during indexing time,
|
||||
/// the InvertedIndexReader is empty.
|
||||
///
|
||||
/// The field reader is in charge of iterating through the
|
||||
/// term dictionary associated with a specific field,
|
||||
/// and opening the posting list associated with any term.
|
||||
///
|
||||
/// If the field is not marked as index, a warning is logged and an empty `InvertedIndexReader`
|
||||
/// is returned.
|
||||
/// Similarly, if the field is marked as indexed but no term has been indexed for the given
|
||||
/// index, an empty `InvertedIndexReader` is returned (but no warning is logged).
|
||||
pub fn inverted_index(&self, field: Field) -> crate::Result<Arc<InvertedIndexReader>> {
|
||||
if let Some(inv_idx_reader) = self
|
||||
.inv_idx_reader_cache
|
||||
.read()
|
||||
@@ -326,9 +244,7 @@ impl SegmentReader for TantivySegmentReader {
|
||||
//
|
||||
// Returns an empty inverted index.
|
||||
let record_option = record_option_opt.unwrap_or(IndexRecordOption::Basic);
|
||||
let inv_idx_reader: Arc<dyn DynInvertedIndexReader> =
|
||||
Arc::new(TantivyInvertedIndexReader::empty(record_option));
|
||||
return Ok(inv_idx_reader);
|
||||
return Ok(Arc::new(InvertedIndexReader::empty(record_option)));
|
||||
}
|
||||
|
||||
let record_option = record_option_opt.unwrap();
|
||||
@@ -351,20 +267,13 @@ impl SegmentReader for TantivySegmentReader {
|
||||
);
|
||||
DataCorruption::comment_only(error_msg)
|
||||
})?;
|
||||
let fieldnorms_file = self
|
||||
.fieldnorm_readers
|
||||
.get_inner_file()
|
||||
.open_read(field)
|
||||
.unwrap_or_else(FileSlice::empty);
|
||||
|
||||
let inv_idx_reader: Arc<dyn DynInvertedIndexReader> =
|
||||
Arc::new(TantivyInvertedIndexReader::new(
|
||||
TermDictionary::open(termdict_file)?,
|
||||
postings_file,
|
||||
positions_file,
|
||||
fieldnorms_file,
|
||||
record_option,
|
||||
)?);
|
||||
let inv_idx_reader = Arc::new(InvertedIndexReader::new(
|
||||
TermDictionary::open(termdict_file)?,
|
||||
postings_file,
|
||||
positions_file,
|
||||
record_option,
|
||||
)?);
|
||||
|
||||
// by releasing the lock in between, we may end up opening the inverting index
|
||||
// twice, but this is fine.
|
||||
@@ -376,10 +285,23 @@ impl SegmentReader for TantivySegmentReader {
|
||||
Ok(inv_idx_reader)
|
||||
}
|
||||
|
||||
fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
|
||||
/// Returns the list of fields that have been indexed in the segment.
|
||||
/// The field list includes the field defined in the schema as well as the fields
|
||||
/// that have been indexed as a part of a JSON field.
|
||||
/// The returned field name is the full field name, including the name of the JSON field.
|
||||
///
|
||||
/// The returned field names can be used in queries.
|
||||
///
|
||||
/// Notice: If your data contains JSON fields this is **very expensive**, as it requires
|
||||
/// browsing through the inverted index term dictionary and the columnar field dictionary.
|
||||
///
|
||||
/// Disclaimer: Some fields may not be listed here. For instance, if the schema contains a json
|
||||
/// field that is not indexed nor a fast field but is stored, it is possible for the field
|
||||
/// to not be listed.
|
||||
pub fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
|
||||
let mut indexed_fields: Vec<FieldMetadata> = Vec::new();
|
||||
let mut map_to_canonical = FnvHashMap::default();
|
||||
for (field, field_entry) in self.schema.fields() {
|
||||
for (field, field_entry) in self.schema().fields() {
|
||||
let field_name = field_entry.name().to_string();
|
||||
let is_indexed = field_entry.is_indexed();
|
||||
if is_indexed {
|
||||
@@ -469,7 +391,7 @@ impl SegmentReader for TantivySegmentReader {
|
||||
}
|
||||
}
|
||||
let fast_fields: Vec<FieldMetadata> = self
|
||||
.fast_fields_readers
|
||||
.fast_fields()
|
||||
.columnar()
|
||||
.iter_columns()?
|
||||
.map(|(mut field_name, handle)| {
|
||||
@@ -497,26 +419,31 @@ impl SegmentReader for TantivySegmentReader {
|
||||
Ok(merged_field_metadatas)
|
||||
}
|
||||
|
||||
fn segment_id(&self) -> SegmentId {
|
||||
/// Returns the segment id
|
||||
pub fn segment_id(&self) -> SegmentId {
|
||||
self.segment_id
|
||||
}
|
||||
|
||||
fn delete_opstamp(&self) -> Option<Opstamp> {
|
||||
/// Returns the delete opstamp
|
||||
pub fn delete_opstamp(&self) -> Option<Opstamp> {
|
||||
self.delete_opstamp
|
||||
}
|
||||
|
||||
fn alive_bitset(&self) -> Option<&AliveBitSet> {
|
||||
/// Returns the bitset representing the alive `DocId`s.
|
||||
pub fn alive_bitset(&self) -> Option<&AliveBitSet> {
|
||||
self.alive_bitset_opt.as_ref()
|
||||
}
|
||||
|
||||
fn is_deleted(&self, doc: DocId) -> bool {
|
||||
self.alive_bitset_opt
|
||||
.as_ref()
|
||||
/// Returns true if the `doc` is marked
|
||||
/// as deleted.
|
||||
pub fn is_deleted(&self, doc: DocId) -> bool {
|
||||
self.alive_bitset()
|
||||
.map(|alive_bitset| alive_bitset.is_deleted(doc))
|
||||
.unwrap_or(false)
|
||||
}
|
||||
|
||||
fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + Send + '_> {
|
||||
/// Returns an iterator that will iterate over the alive document ids
|
||||
pub fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + Send + '_> {
|
||||
if let Some(alive_bitset) = &self.alive_bitset_opt {
|
||||
Box::new(alive_bitset.iter_alive())
|
||||
} else {
|
||||
@@ -524,25 +451,22 @@ impl SegmentReader for TantivySegmentReader {
|
||||
}
|
||||
}
|
||||
|
||||
fn space_usage(&self) -> io::Result<SegmentSpaceUsage> {
|
||||
/// Summarize total space usage of this segment.
|
||||
pub fn space_usage(&self) -> io::Result<SegmentSpaceUsage> {
|
||||
Ok(SegmentSpaceUsage::new(
|
||||
self.num_docs,
|
||||
self.termdict_composite.space_usage(&self.schema),
|
||||
self.postings_composite.space_usage(&self.schema),
|
||||
self.positions_composite.space_usage(&self.schema),
|
||||
self.num_docs(),
|
||||
self.termdict_composite.space_usage(self.schema()),
|
||||
self.postings_composite.space_usage(self.schema()),
|
||||
self.positions_composite.space_usage(self.schema()),
|
||||
self.fast_fields_readers.space_usage()?,
|
||||
self.fieldnorm_readers.space_usage(&self.schema),
|
||||
TantivyStoreReader::open(self.store_file.clone(), 0)?.space_usage(),
|
||||
self.fieldnorm_readers.space_usage(self.schema()),
|
||||
self.get_store_reader(0)?.space_usage(),
|
||||
self.alive_bitset_opt
|
||||
.as_ref()
|
||||
.map(AliveBitSet::space_usage)
|
||||
.unwrap_or_default(),
|
||||
))
|
||||
}
|
||||
|
||||
fn clone_arc(&self) -> Arc<dyn SegmentReader> {
|
||||
Arc::new(self.clone())
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord)]
|
||||
@@ -652,7 +576,7 @@ fn intersect_alive_bitset(
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for TantivySegmentReader {
|
||||
impl fmt::Debug for SegmentReader {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
write!(f, "SegmentReader({:?})", self.segment_id)
|
||||
}
|
||||
|
||||
@@ -250,15 +250,11 @@ mod tests {
|
||||
|
||||
struct DummyWeight;
|
||||
impl Weight for DummyWeight {
|
||||
fn scorer(
|
||||
&self,
|
||||
_reader: &dyn SegmentReader,
|
||||
_boost: Score,
|
||||
) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, _reader: &SegmentReader, _boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
Err(crate::TantivyError::InternalError("dummy impl".to_owned()))
|
||||
}
|
||||
|
||||
fn explain(&self, _reader: &dyn SegmentReader, _doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, _reader: &SegmentReader, _doc: DocId) -> crate::Result<Explanation> {
|
||||
Err(crate::TantivyError::InternalError("dummy impl".to_owned()))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -12,9 +12,7 @@ use super::{AddBatch, AddBatchReceiver, AddBatchSender, PreparedCommit};
|
||||
use crate::directory::{DirectoryLock, GarbageCollectionResult, TerminatingWrite};
|
||||
use crate::error::TantivyError;
|
||||
use crate::fastfield::write_alive_bitset;
|
||||
use crate::index::{
|
||||
Index, Segment, SegmentComponent, SegmentId, SegmentMeta, SegmentReader, TantivySegmentReader,
|
||||
};
|
||||
use crate::index::{Index, Segment, SegmentComponent, SegmentId, SegmentMeta, SegmentReader};
|
||||
use crate::indexer::delete_queue::{DeleteCursor, DeleteQueue};
|
||||
use crate::indexer::doc_opstamp_mapping::DocToOpstampMapping;
|
||||
use crate::indexer::index_writer_status::IndexWriterStatus;
|
||||
@@ -96,7 +94,7 @@ pub struct IndexWriter<D: Document = TantivyDocument> {
|
||||
|
||||
fn compute_deleted_bitset(
|
||||
alive_bitset: &mut BitSet,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
delete_cursor: &mut DeleteCursor,
|
||||
doc_opstamps: &DocToOpstampMapping,
|
||||
target_opstamp: Opstamp,
|
||||
@@ -145,13 +143,7 @@ pub fn advance_deletes(
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let segment_reader = TantivySegmentReader::open_with_custom_alive_set_from_directory(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
None,
|
||||
)?;
|
||||
let segment_reader: Arc<dyn SegmentReader> = Arc::new(segment_reader);
|
||||
let segment_reader = SegmentReader::open(&segment)?;
|
||||
|
||||
let max_doc = segment_reader.max_doc();
|
||||
let mut alive_bitset: BitSet = match segment_entry.alive_bitset() {
|
||||
@@ -163,7 +155,7 @@ pub fn advance_deletes(
|
||||
|
||||
compute_deleted_bitset(
|
||||
&mut alive_bitset,
|
||||
segment_reader.as_ref(),
|
||||
&segment_reader,
|
||||
segment_entry.delete_cursor(),
|
||||
&DocToOpstampMapping::None,
|
||||
target_opstamp,
|
||||
@@ -251,20 +243,14 @@ fn apply_deletes(
|
||||
.max()
|
||||
.expect("Empty DocOpstamp is forbidden");
|
||||
|
||||
let segment_reader = TantivySegmentReader::open_with_custom_alive_set_from_directory(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
None,
|
||||
)?;
|
||||
let segment_reader: Arc<dyn SegmentReader> = Arc::new(segment_reader);
|
||||
let segment_reader = SegmentReader::open(segment)?;
|
||||
let doc_to_opstamps = DocToOpstampMapping::WithMap(doc_opstamps);
|
||||
|
||||
let max_doc = segment.meta().max_doc();
|
||||
let mut deleted_bitset = BitSet::with_max_value_and_full(max_doc);
|
||||
let may_have_deletes = compute_deleted_bitset(
|
||||
&mut deleted_bitset,
|
||||
segment_reader.as_ref(),
|
||||
&segment_reader,
|
||||
delete_cursor,
|
||||
&doc_to_opstamps,
|
||||
max_doc_opstamp,
|
||||
@@ -1979,9 +1965,9 @@ mod tests {
|
||||
.get_store_reader(DOCSTORE_CACHE_CAPACITY)
|
||||
.unwrap();
|
||||
// test store iterator
|
||||
for doc_id in segment_reader.doc_ids_alive() {
|
||||
let doc = store_reader.get(doc_id).unwrap();
|
||||
for doc in store_reader.iter::<TantivyDocument>(segment_reader.alive_bitset()) {
|
||||
let id = doc
|
||||
.unwrap()
|
||||
.get_first(id_field)
|
||||
.unwrap()
|
||||
.as_value()
|
||||
@@ -1992,7 +1978,7 @@ mod tests {
|
||||
// test store random access
|
||||
for doc_id in segment_reader.doc_ids_alive() {
|
||||
let id = store_reader
|
||||
.get(doc_id)
|
||||
.get::<TantivyDocument>(doc_id)
|
||||
.unwrap()
|
||||
.get_first(id_field)
|
||||
.unwrap()
|
||||
@@ -2001,7 +1987,7 @@ mod tests {
|
||||
assert!(expected_ids_and_num_occurrences.contains_key(&id));
|
||||
if id_is_full_doc(id) {
|
||||
let id2 = store_reader
|
||||
.get(doc_id)
|
||||
.get::<TantivyDocument>(doc_id)
|
||||
.unwrap()
|
||||
.get_first(multi_numbers)
|
||||
.unwrap()
|
||||
@@ -2009,13 +1995,13 @@ mod tests {
|
||||
.unwrap();
|
||||
assert_eq!(id, id2);
|
||||
let bool = store_reader
|
||||
.get(doc_id)
|
||||
.get::<TantivyDocument>(doc_id)
|
||||
.unwrap()
|
||||
.get_first(bool_field)
|
||||
.unwrap()
|
||||
.as_bool()
|
||||
.unwrap();
|
||||
let doc = store_reader.get(doc_id).unwrap();
|
||||
let doc = store_reader.get::<TantivyDocument>(doc_id).unwrap();
|
||||
let mut bool2 = doc.get_all(multi_bools);
|
||||
assert_eq!(bool, bool2.next().unwrap().as_bool().unwrap());
|
||||
assert_ne!(bool, bool2.next().unwrap().as_bool().unwrap());
|
||||
|
||||
@@ -3,7 +3,7 @@ mod tests {
|
||||
use crate::collector::TopDocs;
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::index::Index;
|
||||
use crate::postings::{DocFreq, Postings};
|
||||
use crate::postings::Postings;
|
||||
use crate::query::QueryParser;
|
||||
use crate::schema::{
|
||||
self, BytesOptions, Facet, FacetOptions, IndexRecordOption, NumericOptions,
|
||||
@@ -121,32 +121,21 @@ mod tests {
|
||||
let my_text_field = index.schema().get_field("text_field").unwrap();
|
||||
let term_a = Term::from_field_text(my_text_field, "text");
|
||||
let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
|
||||
let term_info = inverted_index.get_term_info(&term_a).unwrap().unwrap();
|
||||
let postings_for_test = crate::index::load_postings_from_terminfo(
|
||||
inverted_index.as_ref(),
|
||||
&term_info,
|
||||
IndexRecordOption::WithFreqsAndPositions,
|
||||
)
|
||||
.unwrap();
|
||||
let mut postings = inverted_index
|
||||
.read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
assert_eq!(postings.doc_freq(), 2);
|
||||
let fallback_bitset = AliveBitSet::for_test_from_deleted_docs(&[0], 100);
|
||||
assert_eq!(
|
||||
crate::indexer::merger::doc_freq_given_deletes(
|
||||
postings_for_test,
|
||||
postings.doc_freq_given_deletes(
|
||||
segment_reader.alive_bitset().unwrap_or(&fallback_bitset)
|
||||
),
|
||||
2
|
||||
);
|
||||
let postings = inverted_index
|
||||
.read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
|
||||
.unwrap();
|
||||
assert_eq!(postings.unwrap().doc_freq(), DocFreq::Exact(2));
|
||||
let postings = inverted_index
|
||||
.read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)
|
||||
.unwrap();
|
||||
let mut postings = postings.unwrap();
|
||||
|
||||
assert_eq!(postings.term_freq(), 1);
|
||||
let mut output = Vec::new();
|
||||
let mut output = vec![];
|
||||
postings.positions(&mut output);
|
||||
assert_eq!(output, vec![1]);
|
||||
postings.advance();
|
||||
|
||||
@@ -1,4 +1,3 @@
|
||||
use std::io;
|
||||
use std::sync::Arc;
|
||||
|
||||
use columnar::{
|
||||
@@ -16,11 +15,11 @@ use crate::fieldnorm::{FieldNormReader, FieldNormReaders, FieldNormsSerializer,
|
||||
use crate::index::{Segment, SegmentComponent, SegmentReader};
|
||||
use crate::indexer::doc_id_mapping::{MappingType, SegmentDocIdMapping};
|
||||
use crate::indexer::SegmentSerializer;
|
||||
use crate::postings::{InvertedIndexSerializer, Postings, TermInfo};
|
||||
use crate::schema::{value_type_to_column_type, Field, FieldType, IndexRecordOption, Schema};
|
||||
use crate::postings::{InvertedIndexSerializer, Postings, SegmentPostings};
|
||||
use crate::schema::{value_type_to_column_type, Field, FieldType, Schema};
|
||||
use crate::store::StoreWriter;
|
||||
use crate::termdict::{TermMerger, TermOrdinal};
|
||||
use crate::{DocAddress, DocId, DynInvertedIndexReader};
|
||||
use crate::{DocAddress, DocId, InvertedIndexReader};
|
||||
|
||||
/// Segment's max doc must be `< MAX_DOC_LIMIT`.
|
||||
///
|
||||
@@ -28,7 +27,7 @@ use crate::{DocAddress, DocId, DynInvertedIndexReader};
|
||||
pub const MAX_DOC_LIMIT: u32 = 1 << 31;
|
||||
|
||||
fn estimate_total_num_tokens_in_single_segment(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field: Field,
|
||||
) -> crate::Result<u64> {
|
||||
// There are no deletes. We can simply use the exact value saved into the posting list.
|
||||
@@ -40,7 +39,7 @@ fn estimate_total_num_tokens_in_single_segment(
|
||||
|
||||
// When there are deletes, we use an approximation either
|
||||
// by using the fieldnorm.
|
||||
if let Ok(fieldnorm_reader) = reader.get_fieldnorms_reader(field) {
|
||||
if let Some(fieldnorm_reader) = reader.fieldnorms_readers().get_field(field)? {
|
||||
let mut count: [usize; 256] = [0; 256];
|
||||
for doc in reader.doc_ids_alive() {
|
||||
let fieldnorm_id = fieldnorm_reader.fieldnorm_id(doc);
|
||||
@@ -69,20 +68,17 @@ fn estimate_total_num_tokens_in_single_segment(
|
||||
Ok((segment_num_tokens as f64 * ratio) as u64)
|
||||
}
|
||||
|
||||
fn estimate_total_num_tokens(
|
||||
readers: &[Arc<dyn SegmentReader>],
|
||||
field: Field,
|
||||
) -> crate::Result<u64> {
|
||||
fn estimate_total_num_tokens(readers: &[SegmentReader], field: Field) -> crate::Result<u64> {
|
||||
let mut total_num_tokens: u64 = 0;
|
||||
for reader in readers {
|
||||
total_num_tokens += estimate_total_num_tokens_in_single_segment(reader.as_ref(), field)?;
|
||||
total_num_tokens += estimate_total_num_tokens_in_single_segment(reader, field)?;
|
||||
}
|
||||
Ok(total_num_tokens)
|
||||
}
|
||||
|
||||
pub struct IndexMerger {
|
||||
schema: Schema,
|
||||
pub(crate) readers: Vec<Arc<dyn SegmentReader>>,
|
||||
pub(crate) readers: Vec<SegmentReader>,
|
||||
max_doc: u32,
|
||||
}
|
||||
|
||||
@@ -166,25 +162,16 @@ impl IndexMerger {
|
||||
// This can be used to merge but also apply an additional filter.
|
||||
// One use case is demux, which is basically taking a list of
|
||||
// segments and partitions them e.g. by a value in a field.
|
||||
//
|
||||
// # Panics if segments is empty.
|
||||
pub fn open_with_custom_alive_set(
|
||||
schema: Schema,
|
||||
segments: &[Segment],
|
||||
alive_bitset_opt: Vec<Option<AliveBitSet>>,
|
||||
) -> crate::Result<IndexMerger> {
|
||||
assert!(!segments.is_empty());
|
||||
let mut readers = vec![];
|
||||
for (segment, new_alive_bitset_opt) in segments.iter().zip(alive_bitset_opt) {
|
||||
if segment.meta().num_docs() > 0 {
|
||||
let reader =
|
||||
crate::TantivySegmentReader::open_with_custom_alive_set_from_directory(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
new_alive_bitset_opt,
|
||||
)?;
|
||||
let reader: Arc<dyn SegmentReader> = Arc::new(reader);
|
||||
SegmentReader::open_with_custom_alive_set(segment, new_alive_bitset_opt)?;
|
||||
readers.push(reader);
|
||||
}
|
||||
}
|
||||
@@ -275,7 +262,7 @@ impl IndexMerger {
|
||||
}),
|
||||
);
|
||||
|
||||
let has_deletes: bool = self.readers.iter().any(|reader| reader.has_deletes());
|
||||
let has_deletes: bool = self.readers.iter().any(SegmentReader::has_deletes);
|
||||
let mapping_type = if has_deletes {
|
||||
MappingType::StackedWithDeletes
|
||||
} else {
|
||||
@@ -310,7 +297,7 @@ impl IndexMerger {
|
||||
|
||||
let mut max_term_ords: Vec<TermOrdinal> = Vec::new();
|
||||
|
||||
let field_readers: Vec<Arc<dyn DynInvertedIndexReader>> = self
|
||||
let field_readers: Vec<Arc<InvertedIndexReader>> = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| reader.inverted_index(indexed_field))
|
||||
@@ -368,8 +355,7 @@ impl IndexMerger {
|
||||
indexed. Have you modified the schema?",
|
||||
);
|
||||
|
||||
let mut segment_postings_containing_the_term: Vec<(usize, Box<dyn Postings>)> =
|
||||
Vec::with_capacity(self.readers.len());
|
||||
let mut segment_postings_containing_the_term: Vec<(usize, SegmentPostings)> = vec![];
|
||||
|
||||
while merged_terms.advance() {
|
||||
segment_postings_containing_the_term.clear();
|
||||
@@ -380,15 +366,18 @@ impl IndexMerger {
|
||||
// Let's compute the list of non-empty posting lists
|
||||
for (segment_ord, term_info) in merged_terms.current_segment_ords_and_term_infos() {
|
||||
let segment_reader = &self.readers[segment_ord];
|
||||
let inverted_index = &field_readers[segment_ord];
|
||||
if let Some((doc_freq, postings)) = postings_for_merge(
|
||||
inverted_index.as_ref(),
|
||||
&term_info,
|
||||
segment_postings_option,
|
||||
segment_reader.alive_bitset(),
|
||||
)? {
|
||||
let inverted_index: &InvertedIndexReader = &field_readers[segment_ord];
|
||||
let segment_postings = inverted_index
|
||||
.read_postings_from_terminfo(&term_info, segment_postings_option)?;
|
||||
let alive_bitset_opt = segment_reader.alive_bitset();
|
||||
let doc_freq = if let Some(alive_bitset) = alive_bitset_opt {
|
||||
segment_postings.doc_freq_given_deletes(alive_bitset)
|
||||
} else {
|
||||
segment_postings.doc_freq()
|
||||
};
|
||||
if doc_freq > 0u32 {
|
||||
total_doc_freq += doc_freq;
|
||||
segment_postings_containing_the_term.push((segment_ord, postings));
|
||||
segment_postings_containing_the_term.push((segment_ord, segment_postings));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -406,7 +395,11 @@ impl IndexMerger {
|
||||
assert!(!segment_postings_containing_the_term.is_empty());
|
||||
|
||||
let has_term_freq = {
|
||||
let has_term_freq = segment_postings_containing_the_term[0].1.has_freq();
|
||||
let has_term_freq = !segment_postings_containing_the_term[0]
|
||||
.1
|
||||
.block_cursor
|
||||
.freqs()
|
||||
.is_empty();
|
||||
for (_, postings) in &segment_postings_containing_the_term[1..] {
|
||||
// This may look at a strange way to test whether we have term freq or not.
|
||||
// With JSON object, the schema is not sufficient to know whether a term
|
||||
@@ -422,7 +415,7 @@ impl IndexMerger {
|
||||
//
|
||||
// Overall the reliable way to know if we have actual frequencies loaded or not
|
||||
// is to check whether the actual decoded array is empty or not.
|
||||
if postings.has_freq() != has_term_freq {
|
||||
if has_term_freq == postings.block_cursor.freqs().is_empty() {
|
||||
return Err(DataCorruption::comment_only(
|
||||
"Term freqs are inconsistent across segments",
|
||||
)
|
||||
@@ -497,7 +490,33 @@ impl IndexMerger {
|
||||
debug_time!("write-storable-fields");
|
||||
debug!("write-storable-field");
|
||||
|
||||
store_writer.merge_segment_readers(&self.readers)?;
|
||||
for reader in &self.readers {
|
||||
let store_reader = reader.get_store_reader(1)?;
|
||||
if reader.has_deletes()
|
||||
// If there is not enough data in the store, we avoid stacking in order to
|
||||
// avoid creating many small blocks in the doc store. Once we have 5 full blocks,
|
||||
// we start stacking. In the worst case 2/7 of the blocks would be very small.
|
||||
// [segment 1 - {1 doc}][segment 2 - {fullblock * 5}{1doc}]
|
||||
// => 5 * full blocks, 2 * 1 document blocks
|
||||
//
|
||||
// In a more realistic scenario the segments are of the same size, so 1/6 of
|
||||
// the doc stores would be on average half full, given total randomness (which
|
||||
// is not the case here, but not sure how it behaves exactly).
|
||||
//
|
||||
// https://github.com/quickwit-oss/tantivy/issues/1053
|
||||
//
|
||||
// take 7 in order to not walk over all checkpoints.
|
||||
|| store_reader.block_checkpoints().take(7).count() < 6
|
||||
|| store_reader.decompressor() != store_writer.compressor().into()
|
||||
{
|
||||
for doc_bytes_res in store_reader.iter_raw(reader.alive_bitset()) {
|
||||
let doc_bytes = doc_bytes_res?;
|
||||
store_writer.store_bytes(&doc_bytes)?;
|
||||
}
|
||||
} else {
|
||||
store_writer.stack(store_reader)?;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -534,75 +553,6 @@ impl IndexMerger {
|
||||
}
|
||||
}
|
||||
|
||||
/// Compute the number of non-deleted documents.
|
||||
///
|
||||
/// This method will scan through the posting lists, consuming them.
|
||||
/// (this is a rather expensive operation).
|
||||
pub(crate) fn doc_freq_given_deletes(
|
||||
mut postings: Box<dyn Postings>,
|
||||
alive_bitset: &AliveBitSet,
|
||||
) -> u32 {
|
||||
let mut doc_freq = 0;
|
||||
loop {
|
||||
let doc = postings.doc();
|
||||
if doc == TERMINATED {
|
||||
return doc_freq;
|
||||
}
|
||||
if alive_bitset.is_alive(doc) {
|
||||
doc_freq += 1u32;
|
||||
}
|
||||
postings.advance();
|
||||
}
|
||||
}
|
||||
|
||||
fn read_postings_for_merge(
|
||||
inverted_index: &dyn DynInvertedIndexReader,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Box<dyn Postings>> {
|
||||
crate::index::load_postings_from_terminfo(inverted_index, term_info, option)
|
||||
}
|
||||
|
||||
fn postings_for_merge(
|
||||
inverted_index: &dyn DynInvertedIndexReader,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
alive_bitset_opt: Option<&AliveBitSet>,
|
||||
) -> io::Result<Option<(u32, Box<dyn Postings>)>> {
|
||||
// TODO: avoid loading postings twice — once for counting, once for writing
|
||||
let count_postings = read_postings_for_merge(inverted_index, term_info, option)?;
|
||||
let doc_freq = if let Some(alive_bitset) = alive_bitset_opt {
|
||||
doc_freq_given_deletes(count_postings, alive_bitset)
|
||||
} else {
|
||||
// We do not need an exact document frequency here.
|
||||
match count_postings.doc_freq() {
|
||||
crate::postings::DocFreq::Exact(doc_freq) => doc_freq,
|
||||
crate::postings::DocFreq::Approximate(_) => exact_doc_freq(count_postings),
|
||||
}
|
||||
};
|
||||
|
||||
if doc_freq == 0u32 {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
let postings = read_postings_for_merge(inverted_index, term_info, option)?;
|
||||
Ok(Some((doc_freq, postings)))
|
||||
}
|
||||
|
||||
/// If the postings is not able to inform us of the document frequency,
|
||||
/// we just scan through it.
|
||||
pub(crate) fn exact_doc_freq(mut postings: Box<dyn Postings>) -> u32 {
|
||||
let mut doc_freq = 0;
|
||||
loop {
|
||||
let doc = postings.doc();
|
||||
if doc == TERMINATED {
|
||||
return doc_freq;
|
||||
}
|
||||
doc_freq += 1u32;
|
||||
postings.advance();
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
@@ -615,10 +565,8 @@ mod tests {
|
||||
BytesFastFieldTestCollector, FastFieldTestCollector, TEST_COLLECTOR_WITH_SCORE,
|
||||
};
|
||||
use crate::collector::{Count, FacetCollector};
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::index::{Index, SegmentId};
|
||||
use crate::indexer::NoMergePolicy;
|
||||
use crate::postings::{DocFreq, Postings as _, SegmentPostings};
|
||||
use crate::query::{AllQuery, BooleanQuery, EnableScoring, Scorer, TermQuery};
|
||||
use crate::schema::{
|
||||
Facet, FacetOptions, IndexRecordOption, NumericOptions, TantivyDocument, Term,
|
||||
@@ -733,32 +681,32 @@ mod tests {
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 0))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 0))?;
|
||||
assert_eq!(
|
||||
doc.get_first(text_field).unwrap().as_value().as_str(),
|
||||
Some("af b")
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 1))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 1))?;
|
||||
assert_eq!(
|
||||
doc.get_first(text_field).unwrap().as_value().as_str(),
|
||||
Some("a b c")
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 2))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 2))?;
|
||||
assert_eq!(
|
||||
doc.get_first(text_field).unwrap().as_value().as_str(),
|
||||
Some("a b c d")
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 3))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 3))?;
|
||||
assert_eq!(doc.get_first(text_field).unwrap().as_str(), Some("af b"));
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 4))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 4))?;
|
||||
assert_eq!(doc.get_first(text_field).unwrap().as_str(), Some("a b c g"));
|
||||
}
|
||||
|
||||
@@ -1570,10 +1518,10 @@ mod tests {
|
||||
let searcher = reader.searcher();
|
||||
let mut term_scorer = term_query
|
||||
.specialized_weight(EnableScoring::enabled_from_searcher(&searcher))?
|
||||
.term_scorer_for_test(searcher.segment_reader(0u32), 1.0)
|
||||
.term_scorer_for_test(searcher.segment_reader(0u32), 1.0)?
|
||||
.unwrap();
|
||||
assert_eq!(term_scorer.doc(), 0);
|
||||
assert_nearly_equals!(term_scorer.seek_block_max(0), 0.0079681855);
|
||||
assert_nearly_equals!(term_scorer.block_max_score(), 0.0079681855);
|
||||
assert_nearly_equals!(term_scorer.score(), 0.0079681855);
|
||||
for _ in 0..81 {
|
||||
writer.add_document(doc!(text=>"hello happy tax payer"))?;
|
||||
@@ -1586,13 +1534,13 @@ mod tests {
|
||||
for segment_reader in searcher.segment_readers() {
|
||||
let mut term_scorer = term_query
|
||||
.specialized_weight(EnableScoring::enabled_from_searcher(&searcher))?
|
||||
.term_scorer_for_test(segment_reader.as_ref(), 1.0)
|
||||
.term_scorer_for_test(segment_reader, 1.0)?
|
||||
.unwrap();
|
||||
// the difference compared to before is intrinsic to the bm25 formula. no worries
|
||||
// there.
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
assert_eq!(term_scorer.doc(), doc);
|
||||
assert_nearly_equals!(term_scorer.seek_block_max(doc), 0.003478312);
|
||||
assert_nearly_equals!(term_scorer.block_max_score(), 0.003478312);
|
||||
assert_nearly_equals!(term_scorer.score(), 0.003478312);
|
||||
term_scorer.advance();
|
||||
}
|
||||
@@ -1612,12 +1560,12 @@ mod tests {
|
||||
let segment_reader = searcher.segment_reader(0u32);
|
||||
let mut term_scorer = term_query
|
||||
.specialized_weight(EnableScoring::enabled_from_searcher(&searcher))?
|
||||
.term_scorer_for_test(segment_reader, 1.0)
|
||||
.term_scorer_for_test(segment_reader, 1.0)?
|
||||
.unwrap();
|
||||
// the difference compared to before is intrinsic to the bm25 formula. no worries there.
|
||||
for doc in segment_reader.doc_ids_alive() {
|
||||
assert_eq!(term_scorer.doc(), doc);
|
||||
assert_nearly_equals!(term_scorer.seek_block_max(doc), 0.003478312);
|
||||
assert_nearly_equals!(term_scorer.block_max_score(), 0.003478312);
|
||||
assert_nearly_equals!(term_scorer.score(), 0.003478312);
|
||||
term_scorer.advance();
|
||||
}
|
||||
@@ -1631,19 +1579,4 @@ mod tests {
|
||||
assert!(((super::MAX_DOC_LIMIT - 1) as i32) >= 0);
|
||||
assert!((super::MAX_DOC_LIMIT as i32) < 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_doc_freq_given_delete() {
|
||||
let docs = SegmentPostings::create_from_docs(&[0, 2, 10]);
|
||||
assert_eq!(docs.doc_freq(), DocFreq::Exact(3));
|
||||
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[2], 12);
|
||||
let docs_boxed: Box<dyn crate::postings::Postings> =
|
||||
Box::new(SegmentPostings::create_from_docs(&[0, 2, 10]));
|
||||
assert_eq!(super::doc_freq_given_deletes(docs_boxed, &alive_bitset), 2);
|
||||
let all_deleted =
|
||||
AliveBitSet::for_test_from_deleted_docs(&[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 12);
|
||||
let docs_boxed: Box<dyn crate::postings::Postings> =
|
||||
Box::new(SegmentPostings::create_from_docs(&[0, 2, 10]));
|
||||
assert_eq!(super::doc_freq_given_deletes(docs_boxed, &all_deleted), 0);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -13,9 +13,7 @@ use super::segment_manager::SegmentManager;
|
||||
use crate::core::META_FILEPATH;
|
||||
use crate::directory::{Directory, DirectoryClone, GarbageCollectionResult};
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::index::{
|
||||
CodecConfiguration, Index, IndexMeta, IndexSettings, Segment, SegmentId, SegmentMeta,
|
||||
};
|
||||
use crate::index::{Index, IndexMeta, IndexSettings, Segment, SegmentId, SegmentMeta};
|
||||
use crate::indexer::delete_queue::DeleteCursor;
|
||||
use crate::indexer::index_writer::advance_deletes;
|
||||
use crate::indexer::merge_operation::MergeOperationInventory;
|
||||
@@ -141,9 +139,9 @@ fn merge(
|
||||
/// meant to work if you have an `IndexWriter` running for the origin indices, or
|
||||
/// the destination `Index`.
|
||||
#[doc(hidden)]
|
||||
pub fn merge_indices(
|
||||
pub fn merge_indices<T: Into<Box<dyn Directory>>>(
|
||||
indices: &[Index],
|
||||
output_directory: Box<dyn Directory>,
|
||||
output_directory: T,
|
||||
) -> crate::Result<Index> {
|
||||
if indices.is_empty() {
|
||||
// If there are no indices to merge, there is no need to do anything.
|
||||
@@ -213,11 +211,11 @@ pub fn merge_filtered_segments<T: Into<Box<dyn Directory>>>(
|
||||
));
|
||||
}
|
||||
|
||||
let mut merged_index: Index = Index::builder()
|
||||
.schema(target_schema.clone())
|
||||
.settings(target_settings.clone())
|
||||
.create(output_directory.into())?;
|
||||
|
||||
let mut merged_index = Index::create(
|
||||
output_directory,
|
||||
target_schema.clone(),
|
||||
target_settings.clone(),
|
||||
)?;
|
||||
let merged_segment = merged_index.new_segment();
|
||||
let merged_segment_id = merged_segment.id();
|
||||
let merger: IndexMerger =
|
||||
@@ -237,7 +235,6 @@ pub fn merge_filtered_segments<T: Into<Box<dyn Directory>>>(
|
||||
))
|
||||
.trim_end()
|
||||
);
|
||||
let codec_configuration = CodecConfiguration::default();
|
||||
|
||||
let index_meta = IndexMeta {
|
||||
index_settings: target_settings, // index_settings of all segments should be the same
|
||||
@@ -245,7 +242,6 @@ pub fn merge_filtered_segments<T: Into<Box<dyn Directory>>>(
|
||||
schema: target_schema,
|
||||
opstamp: 0u64,
|
||||
payload: Some(stats),
|
||||
codec: codec_configuration,
|
||||
};
|
||||
|
||||
// save the meta.json
|
||||
@@ -279,7 +275,7 @@ impl SegmentUpdater {
|
||||
stamper: Stamper,
|
||||
delete_cursor: &DeleteCursor,
|
||||
num_merge_threads: usize,
|
||||
) -> crate::Result<Self> {
|
||||
) -> crate::Result<SegmentUpdater> {
|
||||
let segments = index.searchable_segment_metas()?;
|
||||
let segment_manager = SegmentManager::from_segments(segments, delete_cursor);
|
||||
let pool = ThreadPoolBuilder::new()
|
||||
@@ -415,7 +411,6 @@ impl SegmentUpdater {
|
||||
schema: index.schema(),
|
||||
opstamp,
|
||||
payload: commit_message,
|
||||
codec: CodecConfiguration::default(),
|
||||
};
|
||||
// TODO add context to the error.
|
||||
save_metas(&index_meta, directory.box_clone().borrow_mut())?;
|
||||
@@ -935,7 +930,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_merge_empty_indices_array() {
|
||||
let merge_result = merge_indices(&[], Box::new(RamDirectory::default()));
|
||||
let merge_result = merge_indices(&[], RamDirectory::default());
|
||||
assert!(merge_result.is_err());
|
||||
}
|
||||
|
||||
@@ -962,10 +957,7 @@ mod tests {
|
||||
};
|
||||
|
||||
// mismatched schema index list
|
||||
let result = merge_indices(
|
||||
&[first_index, second_index],
|
||||
Box::new(RamDirectory::default()),
|
||||
);
|
||||
let result = merge_indices(&[first_index, second_index], RamDirectory::default());
|
||||
assert!(result.is_err());
|
||||
|
||||
Ok(())
|
||||
|
||||
@@ -12,7 +12,7 @@ use crate::indexer::segment_serializer::SegmentSerializer;
|
||||
use crate::json_utils::{index_json_value, IndexingPositionsPerPath};
|
||||
use crate::postings::{
|
||||
compute_table_memory_size, serialize_postings, IndexingContext, IndexingPosition,
|
||||
PerFieldPostingsWriter, PostingsWriter, PostingsWriterEnum,
|
||||
PerFieldPostingsWriter, PostingsWriter,
|
||||
};
|
||||
use crate::schema::document::{Document, Value};
|
||||
use crate::schema::{FieldEntry, FieldType, Schema, DATE_TIME_PRECISION_INDEXED};
|
||||
@@ -169,7 +169,7 @@ impl SegmentWriter {
|
||||
}
|
||||
|
||||
let (term_buffer, ctx) = (&mut self.term_buffer, &mut self.ctx);
|
||||
let postings_writer: &mut PostingsWriterEnum =
|
||||
let postings_writer: &mut dyn PostingsWriter =
|
||||
self.per_field_postings_writers.get_for_field_mut(field);
|
||||
term_buffer.clear_with_field(field);
|
||||
|
||||
@@ -434,7 +434,7 @@ mod tests {
|
||||
Document, IndexRecordOption, OwnedValue, Schema, TextFieldIndexing, TextOptions, Value,
|
||||
DATE_TIME_PRECISION_INDEXED, FAST, STORED, STRING, TEXT,
|
||||
};
|
||||
use crate::store::{Compressor, StoreWriter, TantivyStoreReader};
|
||||
use crate::store::{Compressor, StoreReader, StoreWriter};
|
||||
use crate::time::format_description::well_known::Rfc3339;
|
||||
use crate::time::OffsetDateTime;
|
||||
use crate::tokenizer::{PreTokenizedString, Token};
|
||||
@@ -482,8 +482,8 @@ mod tests {
|
||||
store_writer.store(&doc, &schema).unwrap();
|
||||
store_writer.close().unwrap();
|
||||
|
||||
let reader = TantivyStoreReader::open(directory.open_read(path).unwrap(), 0).unwrap();
|
||||
let doc = reader.get(0).unwrap();
|
||||
let reader = StoreReader::open(directory.open_read(path).unwrap(), 0).unwrap();
|
||||
let doc = reader.get::<TantivyDocument>(0).unwrap();
|
||||
|
||||
assert_eq!(doc.field_values().count(), 2);
|
||||
assert_eq!(
|
||||
@@ -600,12 +600,16 @@ mod tests {
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
let doc = searcher
|
||||
.doc(DocAddress {
|
||||
.doc::<TantivyDocument>(DocAddress {
|
||||
segment_ord: 0u32,
|
||||
doc_id: 0u32,
|
||||
})
|
||||
.unwrap();
|
||||
let serdeser_json_val = doc.to_json(&schema).get("json").unwrap().clone();
|
||||
let serdeser_json_val = serde_json::from_str::<serde_json::Value>(&doc.to_json(&schema))
|
||||
.unwrap()
|
||||
.get("json")
|
||||
.unwrap()[0]
|
||||
.clone();
|
||||
assert_eq!(json_val, serdeser_json_val);
|
||||
let segment_reader = searcher.segment_reader(0u32);
|
||||
let inv_idx = segment_reader.inverted_index(json_field).unwrap();
|
||||
@@ -867,7 +871,7 @@ mod tests {
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader = searcher.segment_reader(0u32);
|
||||
|
||||
fn assert_type(reader: &dyn SegmentReader, field: &str, typ: ColumnType) {
|
||||
fn assert_type(reader: &SegmentReader, field: &str, typ: ColumnType) {
|
||||
let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
|
||||
assert_eq!(cols.len(), 1, "{field}");
|
||||
assert_eq!(cols[0].column_type(), typ, "{field}");
|
||||
@@ -886,7 +890,7 @@ mod tests {
|
||||
assert_type(segment_reader, "json.my_arr", ColumnType::I64);
|
||||
assert_type(segment_reader, "json.my_arr.my_key", ColumnType::Str);
|
||||
|
||||
fn assert_empty(reader: &dyn SegmentReader, field: &str) {
|
||||
fn assert_empty(reader: &SegmentReader, field: &str) {
|
||||
let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
|
||||
assert_eq!(cols.len(), 0);
|
||||
}
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
use std::marker::PhantomData;
|
||||
|
||||
use crate::index::CodecConfiguration;
|
||||
use crate::indexer::operation::AddOperation;
|
||||
use crate::indexer::segment_updater::save_metas;
|
||||
use crate::indexer::SegmentWriter;
|
||||
@@ -12,7 +11,7 @@ pub struct SingleSegmentIndexWriter<D: Document = TantivyDocument> {
|
||||
segment_writer: SegmentWriter,
|
||||
segment: Segment,
|
||||
opstamp: Opstamp,
|
||||
_doc: PhantomData<D>,
|
||||
_phantom: PhantomData<D>,
|
||||
}
|
||||
|
||||
impl<D: Document> SingleSegmentIndexWriter<D> {
|
||||
@@ -23,7 +22,7 @@ impl<D: Document> SingleSegmentIndexWriter<D> {
|
||||
segment_writer,
|
||||
segment,
|
||||
opstamp: 0,
|
||||
_doc: PhantomData,
|
||||
_phantom: PhantomData,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -41,7 +40,7 @@ impl<D: Document> SingleSegmentIndexWriter<D> {
|
||||
pub fn finalize(self) -> crate::Result<Index> {
|
||||
let max_doc = self.segment_writer.max_doc();
|
||||
self.segment_writer.finalize()?;
|
||||
let segment = self.segment.with_max_doc(max_doc);
|
||||
let segment: Segment = self.segment.with_max_doc(max_doc);
|
||||
let index = segment.index();
|
||||
let index_meta = IndexMeta {
|
||||
index_settings: index.settings().clone(),
|
||||
@@ -49,7 +48,6 @@ impl<D: Document> SingleSegmentIndexWriter<D> {
|
||||
schema: index.schema(),
|
||||
opstamp: 0,
|
||||
payload: None,
|
||||
codec: CodecConfiguration::default(),
|
||||
};
|
||||
save_metas(&index_meta, index.directory())?;
|
||||
index.directory().sync_directory()?;
|
||||
|
||||
16
src/lib.rs
16
src/lib.rs
@@ -93,7 +93,7 @@
|
||||
//!
|
||||
//! for (_score, doc_address) in top_docs {
|
||||
//! // Retrieve the actual content of documents given its `doc_address`.
|
||||
//! let retrieved_doc = searcher.doc(doc_address)?;
|
||||
//! let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
//! println!("{}", retrieved_doc.to_json(&schema));
|
||||
//! }
|
||||
//!
|
||||
@@ -166,7 +166,6 @@ mod functional_test;
|
||||
|
||||
#[macro_use]
|
||||
mod macros;
|
||||
|
||||
mod future_result;
|
||||
|
||||
// Re-exports
|
||||
@@ -224,12 +223,11 @@ use once_cell::sync::Lazy;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
pub use self::docset::{DocSet, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
|
||||
pub use crate::core::{json_utils, Executor, Searcher, SearcherContext, SearcherGeneration};
|
||||
pub use crate::core::{json_utils, Executor, Searcher, SearcherGeneration};
|
||||
pub use crate::directory::Directory;
|
||||
pub use crate::index::{
|
||||
try_downcast_and_call, DynInvertedIndexReader, Index, IndexBuilder, IndexMeta, IndexSettings,
|
||||
InvertedIndexReader, Order, Segment, SegmentMeta, SegmentReader, TantivyInvertedIndexReader,
|
||||
TantivySegmentReader, TypedInvertedIndexReaderCb,
|
||||
Index, IndexBuilder, IndexMeta, IndexSettings, InvertedIndexReader, Order, Segment,
|
||||
SegmentMeta, SegmentReader,
|
||||
};
|
||||
pub use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
|
||||
pub use crate::schema::{Document, TantivyDocument, Term};
|
||||
@@ -549,7 +547,7 @@ pub mod tests {
|
||||
index_writer.commit()?;
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader: &dyn SegmentReader = searcher.segment_reader(0);
|
||||
let segment_reader: &SegmentReader = searcher.segment_reader(0);
|
||||
let fieldnorms_reader = segment_reader.get_fieldnorms_reader(text_field)?;
|
||||
assert_eq!(fieldnorms_reader.fieldnorm(0), 3);
|
||||
assert_eq!(fieldnorms_reader.fieldnorm(1), 0);
|
||||
@@ -557,7 +555,7 @@ pub mod tests {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn advance_undeleted(docset: &mut dyn DocSet, reader: &dyn SegmentReader) -> bool {
|
||||
fn advance_undeleted(docset: &mut dyn DocSet, reader: &SegmentReader) -> bool {
|
||||
let mut doc = docset.advance();
|
||||
while doc != TERMINATED {
|
||||
if !reader.is_deleted(doc) {
|
||||
@@ -1074,7 +1072,7 @@ pub mod tests {
|
||||
}
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader: &dyn SegmentReader = searcher.segment_reader(0);
|
||||
let segment_reader: &SegmentReader = searcher.segment_reader(0);
|
||||
{
|
||||
let fast_field_reader_res = segment_reader.fast_fields().u64("text");
|
||||
assert!(fast_field_reader_res.is_err());
|
||||
|
||||
@@ -1,17 +1,26 @@
|
||||
use std::io;
|
||||
|
||||
use common::{OwnedBytes, VInt};
|
||||
use common::VInt;
|
||||
|
||||
use super::FreqReadingOption;
|
||||
use crate::directory::{FileSlice, OwnedBytes};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::compression::{BlockDecoder, VIntDecoder as _, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::postings::skip::{BlockInfo, SkipReader};
|
||||
use crate::postings::compression::{BlockDecoder, VIntDecoder, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::postings::{BlockInfo, FreqReadingOption, SkipReader};
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, Score, TERMINATED};
|
||||
|
||||
fn max_score<I: Iterator<Item = Score>>(mut it: I) -> Option<Score> {
|
||||
it.next().map(|first| it.fold(first, Score::max))
|
||||
}
|
||||
|
||||
/// `BlockSegmentPostings` is a cursor iterating over blocks
|
||||
/// of documents.
|
||||
///
|
||||
/// # Warning
|
||||
///
|
||||
/// While it is useful for some very specific high-performance
|
||||
/// use cases, you should prefer using `SegmentPostings` for most usage.
|
||||
#[derive(Clone)]
|
||||
pub struct BlockSegmentPostings {
|
||||
pub(crate) doc_decoder: BlockDecoder,
|
||||
@@ -79,18 +88,19 @@ fn split_into_skips_and_postings(
|
||||
}
|
||||
|
||||
impl BlockSegmentPostings {
|
||||
/// Opens a `StandardPostingsReader`.
|
||||
/// Opens a `BlockSegmentPostings`.
|
||||
/// `doc_freq` is the number of documents in the posting list.
|
||||
/// `record_option` represents the amount of data available according to the schema.
|
||||
/// `requested_option` is the amount of data requested by the user.
|
||||
/// If for instance, we do not request for term frequencies, this function will not decompress
|
||||
/// term frequency blocks.
|
||||
pub fn open(
|
||||
pub(crate) fn open(
|
||||
doc_freq: u32,
|
||||
bytes: OwnedBytes,
|
||||
data: FileSlice,
|
||||
mut record_option: IndexRecordOption,
|
||||
requested_option: IndexRecordOption,
|
||||
) -> io::Result<BlockSegmentPostings> {
|
||||
let bytes = data.read_bytes()?;
|
||||
let (skip_data_opt, postings_data) = split_into_skips_and_postings(doc_freq, bytes)?;
|
||||
let skip_reader = match skip_data_opt {
|
||||
Some(skip_data) => {
|
||||
@@ -128,86 +138,6 @@ impl BlockSegmentPostings {
|
||||
block_segment_postings.load_block();
|
||||
Ok(block_segment_postings)
|
||||
}
|
||||
}
|
||||
|
||||
fn max_score<I: Iterator<Item = Score>>(mut it: I) -> Option<Score> {
|
||||
it.next().map(|first| it.fold(first, Score::max))
|
||||
}
|
||||
|
||||
impl BlockSegmentPostings {
|
||||
/// Returns the overall number of documents in the block postings.
|
||||
/// It does not take in account whether documents are deleted or not.
|
||||
///
|
||||
/// This `doc_freq` is simply the sum of the length of all of the blocks
|
||||
/// length, and it does not take in account deleted documents.
|
||||
pub fn doc_freq(&self) -> u32 {
|
||||
self.doc_freq
|
||||
}
|
||||
|
||||
/// Returns the array of docs in the current block.
|
||||
///
|
||||
/// Before the first call to `.advance()`, the block
|
||||
/// returned by `.docs()` is empty.
|
||||
#[inline]
|
||||
pub fn docs(&self) -> &[DocId] {
|
||||
debug_assert!(self.block_loaded);
|
||||
self.doc_decoder.output_array()
|
||||
}
|
||||
|
||||
/// Return the document at index `idx` of the block.
|
||||
#[inline]
|
||||
pub fn doc(&self, idx: usize) -> u32 {
|
||||
self.doc_decoder.output(idx)
|
||||
}
|
||||
|
||||
/// Return the array of `term freq` in the block.
|
||||
#[inline]
|
||||
pub fn freqs(&self) -> &[u32] {
|
||||
debug_assert!(self.block_loaded);
|
||||
self.freq_decoder.output_array()
|
||||
}
|
||||
|
||||
/// Return the frequency at index `idx` of the block.
|
||||
#[inline]
|
||||
pub fn freq(&self, idx: usize) -> u32 {
|
||||
debug_assert!(self.block_loaded);
|
||||
self.freq_decoder.output(idx)
|
||||
}
|
||||
|
||||
/// Position on a block that may contains `target_doc`.
|
||||
///
|
||||
/// If all docs are smaller than target, the block loaded may be empty,
|
||||
/// or be the last an incomplete VInt block.
|
||||
pub fn seek(&mut self, target_doc: DocId) -> usize {
|
||||
// Move to the block that might contain our document.
|
||||
self.seek_block_without_loading(target_doc);
|
||||
self.load_block();
|
||||
|
||||
// At this point we are on the block that might contain our document.
|
||||
let doc = self.doc_decoder.seek_within_block(target_doc);
|
||||
|
||||
// The last block is not full and padded with TERMINATED,
|
||||
// so we are guaranteed to have at least one value (real or padding)
|
||||
// that is >= target_doc.
|
||||
debug_assert!(doc < COMPRESSION_BLOCK_SIZE);
|
||||
|
||||
// `doc` is now the first element >= `target_doc`.
|
||||
// If all docs are smaller than target, the current block is incomplete and padded
|
||||
// with TERMINATED. After the search, the cursor points to the first TERMINATED.
|
||||
doc
|
||||
}
|
||||
|
||||
pub fn position_offset(&self) -> u64 {
|
||||
self.skip_reader.position_offset()
|
||||
}
|
||||
|
||||
/// Advance to the next block.
|
||||
pub fn advance(&mut self) {
|
||||
self.skip_reader.advance();
|
||||
self.block_loaded = false;
|
||||
self.block_max_score_cache = None;
|
||||
self.load_block();
|
||||
}
|
||||
|
||||
/// Returns the block_max_score for the current block.
|
||||
/// It does not require the block to be loaded. For instance, it is ok to call this method
|
||||
@@ -230,7 +160,7 @@ impl BlockSegmentPostings {
|
||||
}
|
||||
// this is the last block of the segment posting list.
|
||||
// If it is actually loaded, we can compute block max manually.
|
||||
if self.block_loaded {
|
||||
if self.block_is_loaded() {
|
||||
let docs = self.doc_decoder.output_array().iter().cloned();
|
||||
let freqs = self.freq_decoder.output_array().iter().cloned();
|
||||
let bm25_scores = docs.zip(freqs).map(|(doc, term_freq)| {
|
||||
@@ -247,25 +177,112 @@ impl BlockSegmentPostings {
|
||||
// We do not cache it however, so that it gets computed when once block is loaded.
|
||||
bm25_weight.max_score()
|
||||
}
|
||||
}
|
||||
|
||||
impl BlockSegmentPostings {
|
||||
/// Returns an empty segment postings object
|
||||
pub fn empty() -> BlockSegmentPostings {
|
||||
BlockSegmentPostings {
|
||||
doc_decoder: BlockDecoder::with_val(TERMINATED),
|
||||
block_loaded: true,
|
||||
freq_decoder: BlockDecoder::with_val(1),
|
||||
freq_reading_option: FreqReadingOption::NoFreq,
|
||||
block_max_score_cache: None,
|
||||
doc_freq: 0,
|
||||
data: OwnedBytes::empty(),
|
||||
skip_reader: SkipReader::new(OwnedBytes::empty(), 0, IndexRecordOption::Basic),
|
||||
}
|
||||
pub(crate) fn freq_reading_option(&self) -> FreqReadingOption {
|
||||
self.freq_reading_option
|
||||
}
|
||||
|
||||
pub(crate) fn skip_reader(&self) -> &SkipReader {
|
||||
&self.skip_reader
|
||||
// Resets the block segment postings on another position
|
||||
// in the postings file.
|
||||
//
|
||||
// This is useful for enumerating through a list of terms,
|
||||
// and consuming the associated posting lists while avoiding
|
||||
// reallocating a `BlockSegmentPostings`.
|
||||
//
|
||||
// # Warning
|
||||
//
|
||||
// This does not reset the positions list.
|
||||
pub(crate) fn reset(&mut self, doc_freq: u32, postings_data: OwnedBytes) -> io::Result<()> {
|
||||
let (skip_data_opt, postings_data) =
|
||||
split_into_skips_and_postings(doc_freq, postings_data)?;
|
||||
self.data = postings_data;
|
||||
self.block_max_score_cache = None;
|
||||
self.block_loaded = false;
|
||||
if let Some(skip_data) = skip_data_opt {
|
||||
self.skip_reader.reset(skip_data, doc_freq);
|
||||
} else {
|
||||
self.skip_reader.reset(OwnedBytes::empty(), doc_freq);
|
||||
}
|
||||
self.doc_freq = doc_freq;
|
||||
self.load_block();
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Returns the overall number of documents in the block postings.
|
||||
/// It does not take in account whether documents are deleted or not.
|
||||
///
|
||||
/// This `doc_freq` is simply the sum of the length of all of the blocks
|
||||
/// length, and it does not take in account deleted documents.
|
||||
pub fn doc_freq(&self) -> u32 {
|
||||
self.doc_freq
|
||||
}
|
||||
|
||||
/// Returns the array of docs in the current block.
|
||||
///
|
||||
/// Before the first call to `.advance()`, the block
|
||||
/// returned by `.docs()` is empty.
|
||||
#[inline]
|
||||
pub fn docs(&self) -> &[DocId] {
|
||||
debug_assert!(self.block_is_loaded());
|
||||
self.doc_decoder.output_array()
|
||||
}
|
||||
|
||||
/// Return the document at index `idx` of the block.
|
||||
#[inline]
|
||||
pub fn doc(&self, idx: usize) -> u32 {
|
||||
self.doc_decoder.output(idx)
|
||||
}
|
||||
|
||||
/// Return the array of `term freq` in the block.
|
||||
#[inline]
|
||||
pub fn freqs(&self) -> &[u32] {
|
||||
debug_assert!(self.block_is_loaded());
|
||||
self.freq_decoder.output_array()
|
||||
}
|
||||
|
||||
/// Return the frequency at index `idx` of the block.
|
||||
#[inline]
|
||||
pub fn freq(&self, idx: usize) -> u32 {
|
||||
debug_assert!(self.block_is_loaded());
|
||||
self.freq_decoder.output(idx)
|
||||
}
|
||||
|
||||
/// Returns the length of the current block.
|
||||
///
|
||||
/// All blocks have a length of `NUM_DOCS_PER_BLOCK`,
|
||||
/// except the last block that may have a length
|
||||
/// of any number between 1 and `NUM_DOCS_PER_BLOCK - 1`
|
||||
#[inline]
|
||||
pub fn block_len(&self) -> usize {
|
||||
debug_assert!(self.block_is_loaded());
|
||||
self.doc_decoder.output_len
|
||||
}
|
||||
|
||||
/// Position on a block that may contains `target_doc`.
|
||||
///
|
||||
/// If all docs are smaller than target, the block loaded may be empty,
|
||||
/// or be the last an incomplete VInt block.
|
||||
pub fn seek(&mut self, target_doc: DocId) -> usize {
|
||||
// Move to the block that might contain our document.
|
||||
self.seek_block(target_doc);
|
||||
self.load_block();
|
||||
|
||||
// At this point we are on the block that might contain our document.
|
||||
let doc = self.doc_decoder.seek_within_block(target_doc);
|
||||
|
||||
// The last block is not full and padded with TERMINATED,
|
||||
// so we are guaranteed to have at least one value (real or padding)
|
||||
// that is >= target_doc.
|
||||
debug_assert!(doc < COMPRESSION_BLOCK_SIZE);
|
||||
|
||||
// `doc` is now the first element >= `target_doc`.
|
||||
// If all docs are smaller than target, the current block is incomplete and padded
|
||||
// with TERMINATED. After the search, the cursor points to the first TERMINATED.
|
||||
doc
|
||||
}
|
||||
|
||||
pub(crate) fn position_offset(&self) -> u64 {
|
||||
self.skip_reader.position_offset()
|
||||
}
|
||||
|
||||
/// Dangerous API! This calls seeks the next block on the skip list,
|
||||
@@ -274,15 +291,19 @@ impl BlockSegmentPostings {
|
||||
/// `.load_block()` needs to be called manually afterwards.
|
||||
/// If all docs are smaller than target, the block loaded may be empty,
|
||||
/// or be the last an incomplete VInt block.
|
||||
pub(crate) fn seek_block_without_loading(&mut self, target_doc: DocId) {
|
||||
pub(crate) fn seek_block(&mut self, target_doc: DocId) {
|
||||
if self.skip_reader.seek(target_doc) {
|
||||
self.block_max_score_cache = None;
|
||||
self.block_loaded = false;
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn block_is_loaded(&self) -> bool {
|
||||
self.block_loaded
|
||||
}
|
||||
|
||||
pub(crate) fn load_block(&mut self) {
|
||||
if self.block_loaded {
|
||||
if self.block_is_loaded() {
|
||||
return;
|
||||
}
|
||||
let offset = self.skip_reader.byte_offset();
|
||||
@@ -330,39 +351,68 @@ impl BlockSegmentPostings {
|
||||
}
|
||||
self.block_loaded = true;
|
||||
}
|
||||
|
||||
/// Advance to the next block.
|
||||
pub fn advance(&mut self) {
|
||||
self.skip_reader.advance();
|
||||
self.block_loaded = false;
|
||||
self.block_max_score_cache = None;
|
||||
self.load_block();
|
||||
}
|
||||
|
||||
/// Returns an empty segment postings object
|
||||
pub fn empty() -> BlockSegmentPostings {
|
||||
BlockSegmentPostings {
|
||||
doc_decoder: BlockDecoder::with_val(TERMINATED),
|
||||
block_loaded: true,
|
||||
freq_decoder: BlockDecoder::with_val(1),
|
||||
freq_reading_option: FreqReadingOption::NoFreq,
|
||||
block_max_score_cache: None,
|
||||
doc_freq: 0,
|
||||
data: OwnedBytes::empty(),
|
||||
skip_reader: SkipReader::new(OwnedBytes::empty(), 0, IndexRecordOption::Basic),
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn skip_reader(&self) -> &SkipReader {
|
||||
&self.skip_reader
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use common::OwnedBytes;
|
||||
use common::HasLen;
|
||||
|
||||
use super::BlockSegmentPostings;
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::index::Index;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::postings::postings::Postings;
|
||||
use crate::postings::SegmentPostings;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::schema::{IndexRecordOption, Schema, Term, INDEXED};
|
||||
use crate::DocId;
|
||||
|
||||
#[cfg(test)]
|
||||
fn build_block_postings(docs: &[u32]) -> BlockSegmentPostings {
|
||||
let doc_freq = docs.len() as u32;
|
||||
let mut postings_serializer =
|
||||
PostingsSerializer::new(1.0f32, IndexRecordOption::Basic, None);
|
||||
postings_serializer.new_term(docs.len() as u32, false);
|
||||
for doc in docs {
|
||||
postings_serializer.write_doc(*doc, 1u32);
|
||||
}
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
postings_serializer
|
||||
.close_term(doc_freq, &mut buffer)
|
||||
.unwrap();
|
||||
BlockSegmentPostings::open(
|
||||
doc_freq,
|
||||
OwnedBytes::new(buffer),
|
||||
IndexRecordOption::Basic,
|
||||
IndexRecordOption::Basic,
|
||||
)
|
||||
.unwrap()
|
||||
#[test]
|
||||
fn test_empty_segment_postings() {
|
||||
let mut postings = SegmentPostings::empty();
|
||||
assert_eq!(postings.doc(), TERMINATED);
|
||||
assert_eq!(postings.advance(), TERMINATED);
|
||||
assert_eq!(postings.advance(), TERMINATED);
|
||||
assert_eq!(postings.doc_freq(), 0);
|
||||
assert_eq!(postings.len(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_postings_doc_returns_terminated() {
|
||||
let mut postings = SegmentPostings::empty();
|
||||
assert_eq!(postings.doc(), TERMINATED);
|
||||
assert_eq!(postings.advance(), TERMINATED);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_postings_doc_term_freq_returns_0() {
|
||||
let postings = SegmentPostings::empty();
|
||||
assert_eq!(postings.term_freq(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -377,7 +427,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_block_segment_postings() -> crate::Result<()> {
|
||||
let mut block_segments = build_block_postings(&(0..100_000).collect::<Vec<u32>>());
|
||||
let mut block_segments = build_block_postings(&(0..100_000).collect::<Vec<u32>>())?;
|
||||
let mut offset: u32 = 0u32;
|
||||
// checking that the `doc_freq` is correct
|
||||
assert_eq!(block_segments.doc_freq(), 100_000);
|
||||
@@ -402,7 +452,7 @@ mod tests {
|
||||
doc_ids.push(129);
|
||||
doc_ids.push(130);
|
||||
{
|
||||
let block_segments = build_block_postings(&doc_ids);
|
||||
let block_segments = build_block_postings(&doc_ids)?;
|
||||
let mut docset = SegmentPostings::from_block_postings(block_segments, None);
|
||||
assert_eq!(docset.seek(128), 129);
|
||||
assert_eq!(docset.doc(), 129);
|
||||
@@ -411,7 +461,7 @@ mod tests {
|
||||
assert_eq!(docset.advance(), TERMINATED);
|
||||
}
|
||||
{
|
||||
let block_segments = build_block_postings(&doc_ids);
|
||||
let block_segments = build_block_postings(&doc_ids).unwrap();
|
||||
let mut docset = SegmentPostings::from_block_postings(block_segments, None);
|
||||
assert_eq!(docset.seek(129), 129);
|
||||
assert_eq!(docset.doc(), 129);
|
||||
@@ -420,7 +470,7 @@ mod tests {
|
||||
assert_eq!(docset.advance(), TERMINATED);
|
||||
}
|
||||
{
|
||||
let block_segments = build_block_postings(&doc_ids);
|
||||
let block_segments = build_block_postings(&doc_ids)?;
|
||||
let mut docset = SegmentPostings::from_block_postings(block_segments, None);
|
||||
assert_eq!(docset.doc(), 0);
|
||||
assert_eq!(docset.seek(131), TERMINATED);
|
||||
@@ -429,13 +479,38 @@ mod tests {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn build_block_postings(docs: &[DocId]) -> crate::Result<BlockSegmentPostings> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let int_field = schema_builder.add_u64_field("id", INDEXED);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut index_writer = index.writer_for_tests()?;
|
||||
let mut last_doc = 0u32;
|
||||
for &doc in docs {
|
||||
for _ in last_doc..doc {
|
||||
index_writer.add_document(doc!(int_field=>1u64))?;
|
||||
}
|
||||
index_writer.add_document(doc!(int_field=>0u64))?;
|
||||
last_doc = doc + 1;
|
||||
}
|
||||
index_writer.commit()?;
|
||||
let searcher = index.reader()?.searcher();
|
||||
let segment_reader = searcher.segment_reader(0);
|
||||
let inverted_index = segment_reader.inverted_index(int_field).unwrap();
|
||||
let term = Term::from_field_u64(int_field, 0u64);
|
||||
let term_info = inverted_index.get_term_info(&term)?.unwrap();
|
||||
let block_postings = inverted_index
|
||||
.read_block_postings_from_terminfo(&term_info, IndexRecordOption::Basic)?;
|
||||
Ok(block_postings)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_block_segment_postings_seek() -> crate::Result<()> {
|
||||
let mut docs = Vec::new();
|
||||
let mut docs = vec![0];
|
||||
for i in 0..1300 {
|
||||
docs.push((i * i / 100) + i);
|
||||
}
|
||||
let mut block_postings = build_block_postings(&docs[..]);
|
||||
let mut block_postings = build_block_postings(&docs[..])?;
|
||||
for i in &[0, 424, 10000] {
|
||||
block_postings.seek(*i);
|
||||
let docs = block_postings.docs();
|
||||
@@ -446,4 +521,40 @@ mod tests {
|
||||
assert_eq!(block_postings.doc(COMPRESSION_BLOCK_SIZE - 1), TERMINATED);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_reset_block_segment_postings() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let int_field = schema_builder.add_u64_field("id", INDEXED);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut index_writer = index.writer_for_tests()?;
|
||||
// create two postings list, one containing even number,
|
||||
// the other containing odd numbers.
|
||||
for i in 0..6 {
|
||||
let doc = doc!(int_field=> (i % 2) as u64);
|
||||
index_writer.add_document(doc)?;
|
||||
}
|
||||
index_writer.commit()?;
|
||||
let searcher = index.reader()?.searcher();
|
||||
let segment_reader = searcher.segment_reader(0);
|
||||
|
||||
let mut block_segments;
|
||||
{
|
||||
let term = Term::from_field_u64(int_field, 0u64);
|
||||
let inverted_index = segment_reader.inverted_index(int_field)?;
|
||||
let term_info = inverted_index.get_term_info(&term)?.unwrap();
|
||||
block_segments = inverted_index
|
||||
.read_block_postings_from_terminfo(&term_info, IndexRecordOption::Basic)?;
|
||||
}
|
||||
assert_eq!(block_segments.docs(), &[0, 2, 4]);
|
||||
{
|
||||
let term = Term::from_field_u64(int_field, 1u64);
|
||||
let inverted_index = segment_reader.inverted_index(int_field)?;
|
||||
let term_info = inverted_index.get_term_info(&term)?.unwrap();
|
||||
inverted_index.reset_block_postings_from_terminfo(&term_info, &mut block_segments)?;
|
||||
}
|
||||
assert_eq!(block_segments.docs(), &[1, 3, 5]);
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -22,6 +22,12 @@ pub(crate) struct JsonPostingsWriter<Rec: Recorder> {
|
||||
non_str_posting_writer: SpecializedPostingsWriter<DocIdRecorder>,
|
||||
}
|
||||
|
||||
impl<Rec: Recorder> From<JsonPostingsWriter<Rec>> for Box<dyn PostingsWriter> {
|
||||
fn from(json_postings_writer: JsonPostingsWriter<Rec>) -> Box<dyn PostingsWriter> {
|
||||
Box::new(json_postings_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
|
||||
#[inline]
|
||||
fn subscribe(
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::postings::{DocFreq, Postings};
|
||||
use crate::postings::{Postings, SegmentPostings};
|
||||
use crate::DocId;
|
||||
|
||||
/// `LoadedPostings` is a `DocSet` and `Postings` implementation.
|
||||
@@ -25,16 +25,16 @@ impl LoadedPostings {
|
||||
/// Creates a new `LoadedPostings` from a `SegmentPostings`.
|
||||
///
|
||||
/// It will also preload positions, if positions are available in the SegmentPostings.
|
||||
pub fn load(postings: &mut Box<dyn Postings>) -> LoadedPostings {
|
||||
let num_docs: usize = u32::from(postings.doc_freq()) as usize;
|
||||
pub fn load(segment_postings: &mut SegmentPostings) -> LoadedPostings {
|
||||
let num_docs = segment_postings.doc_freq() as usize;
|
||||
let mut doc_ids = Vec::with_capacity(num_docs);
|
||||
let mut positions = Vec::with_capacity(num_docs);
|
||||
let mut position_offsets = Vec::with_capacity(num_docs);
|
||||
while postings.doc() != TERMINATED {
|
||||
while segment_postings.doc() != TERMINATED {
|
||||
position_offsets.push(positions.len() as u32);
|
||||
doc_ids.push(postings.doc());
|
||||
postings.append_positions_with_offset(0, &mut positions);
|
||||
postings.advance();
|
||||
doc_ids.push(segment_postings.doc());
|
||||
segment_postings.append_positions_with_offset(0, &mut positions);
|
||||
segment_postings.advance();
|
||||
}
|
||||
position_offsets.push(positions.len() as u32);
|
||||
LoadedPostings {
|
||||
@@ -101,14 +101,6 @@ impl Postings for LoadedPostings {
|
||||
output.push(*pos + offset);
|
||||
}
|
||||
}
|
||||
|
||||
fn has_freq(&self) -> bool {
|
||||
true
|
||||
}
|
||||
|
||||
fn doc_freq(&self) -> DocFreq {
|
||||
DocFreq::Exact(self.doc_ids.len() as u32)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
|
||||
@@ -1,16 +1,9 @@
|
||||
//! Postings module (also called inverted index)
|
||||
|
||||
use std::io;
|
||||
|
||||
use common::OwnedBytes;
|
||||
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::positions::PositionReader;
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::Score;
|
||||
|
||||
mod block_search;
|
||||
|
||||
pub(crate) use self::block_search::branchless_binary_search;
|
||||
|
||||
mod block_segment_postings;
|
||||
pub(crate) mod compression;
|
||||
mod indexing_context;
|
||||
@@ -23,53 +16,22 @@ mod recorder;
|
||||
mod segment_postings;
|
||||
/// Serializer module for the inverted index
|
||||
pub mod serializer;
|
||||
pub(crate) mod skip;
|
||||
mod skip;
|
||||
mod term_info;
|
||||
|
||||
pub(crate) use loaded_postings::LoadedPostings;
|
||||
pub(crate) use stacker::compute_table_memory_size;
|
||||
|
||||
pub(crate) use self::block_search::branchless_binary_search;
|
||||
pub use self::block_segment_postings::BlockSegmentPostings;
|
||||
pub(crate) use self::indexing_context::IndexingContext;
|
||||
pub(crate) use self::per_field_postings_writer::PerFieldPostingsWriter;
|
||||
pub use self::postings::{DocFreq, Postings};
|
||||
pub(crate) use self::postings_writer::{
|
||||
serialize_postings, IndexingPosition, PostingsWriter, PostingsWriterEnum,
|
||||
};
|
||||
pub use self::postings::Postings;
|
||||
pub(crate) use self::postings_writer::{serialize_postings, IndexingPosition, PostingsWriter};
|
||||
pub use self::segment_postings::SegmentPostings;
|
||||
pub use self::serializer::{FieldSerializer, InvertedIndexSerializer};
|
||||
pub(crate) use self::skip::{BlockInfo, SkipReader};
|
||||
pub use self::term_info::TermInfo;
|
||||
|
||||
/// Raw postings bytes and metadata read from storage.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct RawPostingsData {
|
||||
/// Raw postings bytes for the term.
|
||||
pub postings_data: OwnedBytes,
|
||||
/// Raw positions bytes for the term, if positions are available.
|
||||
pub positions_data: Option<OwnedBytes>,
|
||||
/// Record option of the indexed field.
|
||||
pub record_option: IndexRecordOption,
|
||||
/// Effective record option after downgrading to the indexed field capability.
|
||||
pub effective_option: IndexRecordOption,
|
||||
}
|
||||
|
||||
/// A light complement interface to Postings to allow block-max wand acceleration.
|
||||
pub trait PostingsWithBlockMax: Postings {
|
||||
/// Moves the postings to the block containing `target_doc` and returns
|
||||
/// an upperbound of the score for documents in the block.
|
||||
fn seek_block_max(
|
||||
&mut self,
|
||||
target_doc: crate::DocId,
|
||||
fieldnorm_reader: &FieldNormReader,
|
||||
similarity_weight: &Bm25Weight,
|
||||
) -> Score;
|
||||
|
||||
/// Returns the last document in the current block (or Terminated if this
|
||||
/// is the last block).
|
||||
fn last_doc_in_block(&self) -> crate::DocId;
|
||||
}
|
||||
|
||||
#[expect(clippy::enum_variant_names)]
|
||||
#[derive(Debug, PartialEq, Clone, Copy, Eq)]
|
||||
pub(crate) enum FreqReadingOption {
|
||||
@@ -78,26 +40,6 @@ pub(crate) enum FreqReadingOption {
|
||||
ReadFreq,
|
||||
}
|
||||
|
||||
pub fn load_postings_from_raw_data(
|
||||
doc_freq: u32,
|
||||
postings_data: RawPostingsData,
|
||||
) -> io::Result<SegmentPostings> {
|
||||
let RawPostingsData {
|
||||
postings_data,
|
||||
positions_data: positions_data_opt,
|
||||
record_option,
|
||||
effective_option,
|
||||
} = postings_data;
|
||||
let requested_option = effective_option;
|
||||
let block_segment_postings =
|
||||
BlockSegmentPostings::open(doc_freq, postings_data, record_option, requested_option)?;
|
||||
let position_reader = positions_data_opt.map(PositionReader::open).transpose()?;
|
||||
Ok(SegmentPostings::from_block_postings(
|
||||
block_segment_postings,
|
||||
position_reader,
|
||||
))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod tests {
|
||||
use std::mem;
|
||||
@@ -105,10 +47,9 @@ pub(crate) mod tests {
|
||||
use super::{InvertedIndexSerializer, Postings};
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::index::{Index, SegmentComponent};
|
||||
use crate::index::{Index, SegmentComponent, SegmentReader};
|
||||
use crate::indexer::operation::AddOperation;
|
||||
use crate::indexer::SegmentWriter;
|
||||
use crate::postings::DocFreq;
|
||||
use crate::query::Scorer;
|
||||
use crate::schema::{
|
||||
Field, IndexRecordOption, Schema, Term, TextFieldIndexing, TextOptions, INDEXED, TEXT,
|
||||
@@ -318,7 +259,7 @@ pub(crate) mod tests {
|
||||
segment_writer.finalize()?;
|
||||
}
|
||||
{
|
||||
let segment_reader = crate::TantivySegmentReader::open(&segment)?;
|
||||
let segment_reader = SegmentReader::open(&segment)?;
|
||||
{
|
||||
let fieldnorm_reader = segment_reader.get_fieldnorms_reader(text_field)?;
|
||||
assert_eq!(fieldnorm_reader.fieldnorm(0), 8 + 5);
|
||||
@@ -339,11 +280,11 @@ pub(crate) mod tests {
|
||||
}
|
||||
{
|
||||
let term_a = Term::from_field_text(text_field, "a");
|
||||
let mut postings_a: Box<dyn Postings> = segment_reader
|
||||
let mut postings_a = segment_reader
|
||||
.inverted_index(term_a.field())?
|
||||
.read_postings(&term_a, IndexRecordOption::WithFreqsAndPositions)?
|
||||
.unwrap();
|
||||
assert_eq!(postings_a.doc_freq(), DocFreq::Exact(1000));
|
||||
assert_eq!(postings_a.len(), 1000);
|
||||
assert_eq!(postings_a.doc(), 0);
|
||||
assert_eq!(postings_a.term_freq(), 6);
|
||||
postings_a.positions(&mut positions);
|
||||
@@ -366,7 +307,7 @@ pub(crate) mod tests {
|
||||
.inverted_index(term_e.field())?
|
||||
.read_postings(&term_e, IndexRecordOption::WithFreqsAndPositions)?
|
||||
.unwrap();
|
||||
assert_eq!(postings_e.doc_freq(), DocFreq::Exact(1000 - 2));
|
||||
assert_eq!(postings_e.len(), 1000 - 2);
|
||||
for i in 2u32..1000u32 {
|
||||
assert_eq!(postings_e.term_freq(), i);
|
||||
postings_e.positions(&mut positions);
|
||||
|
||||
@@ -1,15 +1,16 @@
|
||||
use crate::postings::json_postings_writer::JsonPostingsWriter;
|
||||
use crate::postings::postings_writer::{PostingsWriterEnum, SpecializedPostingsWriter};
|
||||
use crate::postings::postings_writer::SpecializedPostingsWriter;
|
||||
use crate::postings::recorder::{DocIdRecorder, TermFrequencyRecorder, TfAndPositionRecorder};
|
||||
use crate::postings::PostingsWriter;
|
||||
use crate::schema::{Field, FieldEntry, FieldType, IndexRecordOption, Schema};
|
||||
|
||||
pub(crate) struct PerFieldPostingsWriter {
|
||||
per_field_postings_writers: Vec<PostingsWriterEnum>,
|
||||
per_field_postings_writers: Vec<Box<dyn PostingsWriter>>,
|
||||
}
|
||||
|
||||
impl PerFieldPostingsWriter {
|
||||
pub fn for_schema(schema: &Schema) -> Self {
|
||||
let per_field_postings_writers: Vec<PostingsWriterEnum> = schema
|
||||
let per_field_postings_writers = schema
|
||||
.fields()
|
||||
.map(|(_, field_entry)| posting_writer_from_field_entry(field_entry))
|
||||
.collect();
|
||||
@@ -18,16 +19,16 @@ impl PerFieldPostingsWriter {
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn get_for_field(&self, field: Field) -> &PostingsWriterEnum {
|
||||
&self.per_field_postings_writers[field.field_id() as usize]
|
||||
pub(crate) fn get_for_field(&self, field: Field) -> &dyn PostingsWriter {
|
||||
self.per_field_postings_writers[field.field_id() as usize].as_ref()
|
||||
}
|
||||
|
||||
pub(crate) fn get_for_field_mut(&mut self, field: Field) -> &mut PostingsWriterEnum {
|
||||
&mut self.per_field_postings_writers[field.field_id() as usize]
|
||||
pub(crate) fn get_for_field_mut(&mut self, field: Field) -> &mut dyn PostingsWriter {
|
||||
self.per_field_postings_writers[field.field_id() as usize].as_mut()
|
||||
}
|
||||
}
|
||||
|
||||
fn posting_writer_from_field_entry(field_entry: &FieldEntry) -> PostingsWriterEnum {
|
||||
fn posting_writer_from_field_entry(field_entry: &FieldEntry) -> Box<dyn PostingsWriter> {
|
||||
match *field_entry.field_type() {
|
||||
FieldType::Str(ref text_options) => text_options
|
||||
.get_indexing_options()
|
||||
@@ -50,7 +51,7 @@ fn posting_writer_from_field_entry(field_entry: &FieldEntry) -> PostingsWriterEn
|
||||
| FieldType::Date(_)
|
||||
| FieldType::Bytes(_)
|
||||
| FieldType::IpAddr(_)
|
||||
| FieldType::Facet(_) => <SpecializedPostingsWriter<DocIdRecorder>>::default().into(),
|
||||
| FieldType::Facet(_) => Box::<SpecializedPostingsWriter<DocIdRecorder>>::default(),
|
||||
FieldType::JsonObject(ref json_object_options) => {
|
||||
if let Some(text_indexing_option) = json_object_options.get_text_indexing_options() {
|
||||
match text_indexing_option.index_option() {
|
||||
|
||||
@@ -1,25 +1,5 @@
|
||||
use crate::docset::DocSet;
|
||||
|
||||
/// Result of the doc_freq method.
|
||||
///
|
||||
/// Postings can inform us that the document frequency is approximate.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum DocFreq {
|
||||
/// The document frequency is approximate.
|
||||
Approximate(u32),
|
||||
/// The document frequency is exact.
|
||||
Exact(u32),
|
||||
}
|
||||
|
||||
impl From<DocFreq> for u32 {
|
||||
fn from(doc_freq: DocFreq) -> Self {
|
||||
match doc_freq {
|
||||
DocFreq::Approximate(approximate_doc_freq) => approximate_doc_freq,
|
||||
DocFreq::Exact(doc_freq) => doc_freq,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Postings (also called inverted list)
|
||||
///
|
||||
/// For a given term, it is the list of doc ids of the doc
|
||||
@@ -34,9 +14,6 @@ pub trait Postings: DocSet + 'static {
|
||||
/// The number of times the term appears in the document.
|
||||
fn term_freq(&self) -> u32;
|
||||
|
||||
/// Returns the number of documents containing the term in the segment.
|
||||
fn doc_freq(&self) -> DocFreq;
|
||||
|
||||
/// Returns the positions offsetted with a given value.
|
||||
/// It is not necessary to clear the `output` before calling this method.
|
||||
/// The output vector will be resized to the `term_freq`.
|
||||
@@ -54,16 +31,6 @@ pub trait Postings: DocSet + 'static {
|
||||
fn positions(&mut self, output: &mut Vec<u32>) {
|
||||
self.positions_with_offset(0u32, output);
|
||||
}
|
||||
|
||||
/// Returns true if the term_frequency is available.
|
||||
///
|
||||
/// This is a tricky question, because on JSON fields, it is possible
|
||||
/// for a text term to have term freq, whereas a number term in the field has none.
|
||||
///
|
||||
/// This function returns whether the actual term has term frequencies or not.
|
||||
/// In this above JSON field example, `has_freq` should return true for the
|
||||
/// earlier and false for the latter.
|
||||
fn has_freq(&self) -> bool;
|
||||
}
|
||||
|
||||
impl Postings for Box<dyn Postings> {
|
||||
@@ -74,12 +41,4 @@ impl Postings for Box<dyn Postings> {
|
||||
fn append_positions_with_offset(&mut self, offset: u32, output: &mut Vec<u32>) {
|
||||
(**self).append_positions_with_offset(offset, output);
|
||||
}
|
||||
|
||||
fn has_freq(&self) -> bool {
|
||||
(**self).has_freq()
|
||||
}
|
||||
|
||||
fn doc_freq(&self) -> DocFreq {
|
||||
(**self).doc_freq()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -7,10 +7,7 @@ use stacker::Addr;
|
||||
use crate::fieldnorm::FieldNormReaders;
|
||||
use crate::indexer::indexing_term::IndexingTerm;
|
||||
use crate::indexer::path_to_unordered_id::OrderedPathId;
|
||||
use crate::postings::json_postings_writer::JsonPostingsWriter;
|
||||
use crate::postings::recorder::{
|
||||
BufferLender, DocIdRecorder, Recorder, TermFrequencyRecorder, TfAndPositionRecorder,
|
||||
};
|
||||
use crate::postings::recorder::{BufferLender, Recorder};
|
||||
use crate::postings::{
|
||||
FieldSerializer, IndexingContext, InvertedIndexSerializer, PerFieldPostingsWriter,
|
||||
};
|
||||
@@ -103,141 +100,6 @@ pub(crate) struct IndexingPosition {
|
||||
pub end_position: u32,
|
||||
}
|
||||
|
||||
pub enum PostingsWriterEnum {
|
||||
DocId(SpecializedPostingsWriter<DocIdRecorder>),
|
||||
DocIdTf(SpecializedPostingsWriter<TermFrequencyRecorder>),
|
||||
DocTfAndPosition(SpecializedPostingsWriter<TfAndPositionRecorder>),
|
||||
JsonDocId(JsonPostingsWriter<DocIdRecorder>),
|
||||
JsonDocIdTf(JsonPostingsWriter<TermFrequencyRecorder>),
|
||||
JsonDocTfAndPosition(JsonPostingsWriter<TfAndPositionRecorder>),
|
||||
}
|
||||
|
||||
impl From<SpecializedPostingsWriter<DocIdRecorder>> for PostingsWriterEnum {
|
||||
fn from(doc_id_recorder_writer: SpecializedPostingsWriter<DocIdRecorder>) -> Self {
|
||||
PostingsWriterEnum::DocId(doc_id_recorder_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<SpecializedPostingsWriter<TermFrequencyRecorder>> for PostingsWriterEnum {
|
||||
fn from(doc_id_tf_recorder_writer: SpecializedPostingsWriter<TermFrequencyRecorder>) -> Self {
|
||||
PostingsWriterEnum::DocIdTf(doc_id_tf_recorder_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<SpecializedPostingsWriter<TfAndPositionRecorder>> for PostingsWriterEnum {
|
||||
fn from(
|
||||
doc_id_tf_and_positions_recorder_writer: SpecializedPostingsWriter<TfAndPositionRecorder>,
|
||||
) -> Self {
|
||||
PostingsWriterEnum::DocTfAndPosition(doc_id_tf_and_positions_recorder_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<JsonPostingsWriter<DocIdRecorder>> for PostingsWriterEnum {
|
||||
fn from(doc_id_recorder_writer: JsonPostingsWriter<DocIdRecorder>) -> Self {
|
||||
PostingsWriterEnum::JsonDocId(doc_id_recorder_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<JsonPostingsWriter<TermFrequencyRecorder>> for PostingsWriterEnum {
|
||||
fn from(doc_id_tf_recorder_writer: JsonPostingsWriter<TermFrequencyRecorder>) -> Self {
|
||||
PostingsWriterEnum::JsonDocIdTf(doc_id_tf_recorder_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<JsonPostingsWriter<TfAndPositionRecorder>> for PostingsWriterEnum {
|
||||
fn from(
|
||||
doc_id_tf_and_positions_recorder_writer: JsonPostingsWriter<TfAndPositionRecorder>,
|
||||
) -> Self {
|
||||
PostingsWriterEnum::JsonDocTfAndPosition(doc_id_tf_and_positions_recorder_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl PostingsWriter for PostingsWriterEnum {
|
||||
fn subscribe(&mut self, doc: DocId, pos: u32, term: &IndexingTerm, ctx: &mut IndexingContext) {
|
||||
match self {
|
||||
PostingsWriterEnum::DocId(writer) => writer.subscribe(doc, pos, term, ctx),
|
||||
PostingsWriterEnum::DocIdTf(writer) => writer.subscribe(doc, pos, term, ctx),
|
||||
PostingsWriterEnum::DocTfAndPosition(writer) => writer.subscribe(doc, pos, term, ctx),
|
||||
PostingsWriterEnum::JsonDocId(writer) => writer.subscribe(doc, pos, term, ctx),
|
||||
PostingsWriterEnum::JsonDocIdTf(writer) => writer.subscribe(doc, pos, term, ctx),
|
||||
PostingsWriterEnum::JsonDocTfAndPosition(writer) => {
|
||||
writer.subscribe(doc, pos, term, ctx)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn serialize(
|
||||
&self,
|
||||
term_addrs: &[(Field, OrderedPathId, &[u8], Addr)],
|
||||
ordered_id_to_path: &[&str],
|
||||
ctx: &IndexingContext,
|
||||
serializer: &mut FieldSerializer,
|
||||
) -> io::Result<()> {
|
||||
match self {
|
||||
PostingsWriterEnum::DocId(writer) => {
|
||||
writer.serialize(term_addrs, ordered_id_to_path, ctx, serializer)
|
||||
}
|
||||
PostingsWriterEnum::DocIdTf(writer) => {
|
||||
writer.serialize(term_addrs, ordered_id_to_path, ctx, serializer)
|
||||
}
|
||||
PostingsWriterEnum::DocTfAndPosition(writer) => {
|
||||
writer.serialize(term_addrs, ordered_id_to_path, ctx, serializer)
|
||||
}
|
||||
PostingsWriterEnum::JsonDocId(writer) => {
|
||||
writer.serialize(term_addrs, ordered_id_to_path, ctx, serializer)
|
||||
}
|
||||
PostingsWriterEnum::JsonDocIdTf(writer) => {
|
||||
writer.serialize(term_addrs, ordered_id_to_path, ctx, serializer)
|
||||
}
|
||||
PostingsWriterEnum::JsonDocTfAndPosition(writer) => {
|
||||
writer.serialize(term_addrs, ordered_id_to_path, ctx, serializer)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Tokenize a text and subscribe all of its token.
|
||||
fn index_text(
|
||||
&mut self,
|
||||
doc_id: DocId,
|
||||
token_stream: &mut dyn TokenStream,
|
||||
term_buffer: &mut IndexingTerm,
|
||||
ctx: &mut IndexingContext,
|
||||
indexing_position: &mut IndexingPosition,
|
||||
) {
|
||||
match self {
|
||||
PostingsWriterEnum::DocId(writer) => {
|
||||
writer.index_text(doc_id, token_stream, term_buffer, ctx, indexing_position)
|
||||
}
|
||||
PostingsWriterEnum::DocIdTf(writer) => {
|
||||
writer.index_text(doc_id, token_stream, term_buffer, ctx, indexing_position)
|
||||
}
|
||||
PostingsWriterEnum::DocTfAndPosition(writer) => {
|
||||
writer.index_text(doc_id, token_stream, term_buffer, ctx, indexing_position)
|
||||
}
|
||||
PostingsWriterEnum::JsonDocId(writer) => {
|
||||
writer.index_text(doc_id, token_stream, term_buffer, ctx, indexing_position)
|
||||
}
|
||||
PostingsWriterEnum::JsonDocIdTf(writer) => {
|
||||
writer.index_text(doc_id, token_stream, term_buffer, ctx, indexing_position)
|
||||
}
|
||||
PostingsWriterEnum::JsonDocTfAndPosition(writer) => {
|
||||
writer.index_text(doc_id, token_stream, term_buffer, ctx, indexing_position)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn total_num_tokens(&self) -> u64 {
|
||||
match self {
|
||||
PostingsWriterEnum::DocId(writer) => writer.total_num_tokens(),
|
||||
PostingsWriterEnum::DocIdTf(writer) => writer.total_num_tokens(),
|
||||
PostingsWriterEnum::DocTfAndPosition(writer) => writer.total_num_tokens(),
|
||||
PostingsWriterEnum::JsonDocId(writer) => writer.total_num_tokens(),
|
||||
PostingsWriterEnum::JsonDocIdTf(writer) => writer.total_num_tokens(),
|
||||
PostingsWriterEnum::JsonDocTfAndPosition(writer) => writer.total_num_tokens(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// The `PostingsWriter` is in charge of receiving documenting
|
||||
/// and building a `Segment` in anonymous memory.
|
||||
///
|
||||
@@ -309,6 +171,14 @@ pub(crate) struct SpecializedPostingsWriter<Rec: Recorder> {
|
||||
_recorder_type: PhantomData<Rec>,
|
||||
}
|
||||
|
||||
impl<Rec: Recorder> From<SpecializedPostingsWriter<Rec>> for Box<dyn PostingsWriter> {
|
||||
fn from(
|
||||
specialized_postings_writer: SpecializedPostingsWriter<Rec>,
|
||||
) -> Box<dyn PostingsWriter> {
|
||||
Box::new(specialized_postings_writer)
|
||||
}
|
||||
}
|
||||
|
||||
impl<Rec: Recorder> SpecializedPostingsWriter<Rec> {
|
||||
#[inline]
|
||||
pub(crate) fn serialize_one_term(
|
||||
|
||||
@@ -70,7 +70,7 @@ pub(crate) trait Recorder: Copy + Default + Send + Sync + 'static {
|
||||
fn serialize(
|
||||
&self,
|
||||
arena: &MemoryArena,
|
||||
serializer: &mut FieldSerializer,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
buffer_lender: &mut BufferLender,
|
||||
);
|
||||
/// Returns the number of document containing this term.
|
||||
@@ -113,7 +113,7 @@ impl Recorder for DocIdRecorder {
|
||||
fn serialize(
|
||||
&self,
|
||||
arena: &MemoryArena,
|
||||
serializer: &mut FieldSerializer,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
buffer_lender: &mut BufferLender,
|
||||
) {
|
||||
let buffer = buffer_lender.lend_u8();
|
||||
@@ -181,7 +181,7 @@ impl Recorder for TermFrequencyRecorder {
|
||||
fn serialize(
|
||||
&self,
|
||||
arena: &MemoryArena,
|
||||
serializer: &mut FieldSerializer,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
buffer_lender: &mut BufferLender,
|
||||
) {
|
||||
let buffer = buffer_lender.lend_u8();
|
||||
@@ -238,7 +238,7 @@ impl Recorder for TfAndPositionRecorder {
|
||||
fn serialize(
|
||||
&self,
|
||||
arena: &MemoryArena,
|
||||
serializer: &mut FieldSerializer,
|
||||
serializer: &mut FieldSerializer<'_>,
|
||||
buffer_lender: &mut BufferLender,
|
||||
) {
|
||||
let (buffer_u8, buffer_positions) = buffer_lender.lend_all();
|
||||
|
||||
@@ -1,13 +1,11 @@
|
||||
use common::BitSet;
|
||||
use common::HasLen;
|
||||
|
||||
use super::{BlockSegmentPostings, PostingsWithBlockMax};
|
||||
use crate::docset::DocSet;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::positions::PositionReader;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::postings::{DocFreq, Postings};
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::{DocId, Score};
|
||||
use crate::postings::{BlockSegmentPostings, Postings};
|
||||
use crate::{DocId, TERMINATED};
|
||||
|
||||
/// `SegmentPostings` represents the inverted list or postings associated with
|
||||
/// a term in a `Segment`.
|
||||
@@ -31,6 +29,31 @@ impl SegmentPostings {
|
||||
}
|
||||
}
|
||||
|
||||
/// Compute the number of non-deleted documents.
|
||||
///
|
||||
/// This method will clone and scan through the posting lists.
|
||||
/// (this is a rather expensive operation).
|
||||
pub fn doc_freq_given_deletes(&self, alive_bitset: &AliveBitSet) -> u32 {
|
||||
let mut docset = self.clone();
|
||||
let mut doc_freq = 0;
|
||||
loop {
|
||||
let doc = docset.doc();
|
||||
if doc == TERMINATED {
|
||||
return doc_freq;
|
||||
}
|
||||
if alive_bitset.is_alive(doc) {
|
||||
doc_freq += 1u32;
|
||||
}
|
||||
docset.advance();
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the overall number of documents in the block postings.
|
||||
/// It does not take in account whether documents are deleted or not.
|
||||
pub fn doc_freq(&self) -> u32 {
|
||||
self.block_cursor.doc_freq()
|
||||
}
|
||||
|
||||
/// Creates a segment postings object with the given documents
|
||||
/// and no frequency encoded.
|
||||
///
|
||||
@@ -41,13 +64,11 @@ impl SegmentPostings {
|
||||
/// buffer with the serialized data.
|
||||
#[cfg(test)]
|
||||
pub fn create_from_docs(docs: &[u32]) -> SegmentPostings {
|
||||
use common::OwnedBytes;
|
||||
|
||||
use crate::directory::FileSlice;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::schema::IndexRecordOption;
|
||||
let mut buffer = Vec::new();
|
||||
{
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
|
||||
let mut postings_serializer =
|
||||
PostingsSerializer::new(0.0, IndexRecordOption::Basic, None);
|
||||
postings_serializer.new_term(docs.len() as u32, false);
|
||||
@@ -60,7 +81,7 @@ impl SegmentPostings {
|
||||
}
|
||||
let block_segment_postings = BlockSegmentPostings::open(
|
||||
docs.len() as u32,
|
||||
OwnedBytes::new(buffer),
|
||||
FileSlice::from(buffer),
|
||||
IndexRecordOption::Basic,
|
||||
IndexRecordOption::Basic,
|
||||
)
|
||||
@@ -74,8 +95,7 @@ impl SegmentPostings {
|
||||
doc_and_tfs: &[(u32, u32)],
|
||||
fieldnorms: Option<&[u32]>,
|
||||
) -> SegmentPostings {
|
||||
use common::OwnedBytes;
|
||||
|
||||
use crate::directory::FileSlice;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::schema::IndexRecordOption;
|
||||
@@ -108,7 +128,7 @@ impl SegmentPostings {
|
||||
.unwrap();
|
||||
let block_segment_postings = BlockSegmentPostings::open(
|
||||
doc_and_tfs.len() as u32,
|
||||
OwnedBytes::new(buffer),
|
||||
FileSlice::from(buffer),
|
||||
IndexRecordOption::WithFreqs,
|
||||
IndexRecordOption::WithFreqs,
|
||||
)
|
||||
@@ -138,6 +158,7 @@ impl DocSet for SegmentPostings {
|
||||
// next needs to be called a first time to point to the correct element.
|
||||
#[inline]
|
||||
fn advance(&mut self) -> DocId {
|
||||
debug_assert!(self.block_cursor.block_is_loaded());
|
||||
if self.cur == COMPRESSION_BLOCK_SIZE - 1 {
|
||||
self.cur = 0;
|
||||
self.block_cursor.advance();
|
||||
@@ -176,31 +197,13 @@ impl DocSet for SegmentPostings {
|
||||
}
|
||||
|
||||
fn size_hint(&self) -> u32 {
|
||||
self.doc_freq().into()
|
||||
self.len() as u32
|
||||
}
|
||||
}
|
||||
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
let bitset_max_value: DocId = bitset.max_value();
|
||||
loop {
|
||||
let docs = self.block_cursor.docs();
|
||||
let Some(&last_doc) = docs.last() else {
|
||||
break;
|
||||
};
|
||||
if last_doc < bitset_max_value {
|
||||
// All docs are within the range of the bitset
|
||||
for &doc in docs {
|
||||
bitset.insert(doc);
|
||||
}
|
||||
} else {
|
||||
for &doc in docs {
|
||||
if doc < bitset_max_value {
|
||||
bitset.insert(doc);
|
||||
}
|
||||
}
|
||||
break;
|
||||
}
|
||||
self.block_cursor.advance();
|
||||
}
|
||||
impl HasLen for SegmentPostings {
|
||||
fn len(&self) -> usize {
|
||||
self.block_cursor.doc_freq() as usize
|
||||
}
|
||||
}
|
||||
|
||||
@@ -226,13 +229,6 @@ impl Postings for SegmentPostings {
|
||||
self.block_cursor.freq(self.cur)
|
||||
}
|
||||
|
||||
/// Returns the overall number of documents in the block postings.
|
||||
/// It does not take in account whether documents are deleted or not.
|
||||
#[inline(always)]
|
||||
fn doc_freq(&self) -> DocFreq {
|
||||
DocFreq::Exact(self.block_cursor.doc_freq())
|
||||
}
|
||||
|
||||
fn append_positions_with_offset(&mut self, offset: u32, output: &mut Vec<u32>) {
|
||||
let term_freq = self.term_freq();
|
||||
let prev_len = output.len();
|
||||
@@ -256,44 +252,24 @@ impl Postings for SegmentPostings {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn has_freq(&self) -> bool {
|
||||
!self.block_cursor.freqs().is_empty()
|
||||
}
|
||||
}
|
||||
|
||||
impl PostingsWithBlockMax for SegmentPostings {
|
||||
#[inline]
|
||||
fn seek_block_max(
|
||||
&mut self,
|
||||
target_doc: crate::DocId,
|
||||
fieldnorm_reader: &FieldNormReader,
|
||||
similarity_weight: &Bm25Weight,
|
||||
) -> Score {
|
||||
self.block_cursor.seek_block_without_loading(target_doc);
|
||||
self.block_cursor
|
||||
.block_max_score(fieldnorm_reader, similarity_weight)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn last_doc_in_block(&self) -> crate::DocId {
|
||||
self.block_cursor.skip_reader().last_doc_in_block()
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
use common::HasLen;
|
||||
|
||||
use super::SegmentPostings;
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::postings::Postings;
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::postings::postings::Postings;
|
||||
|
||||
#[test]
|
||||
fn test_empty_segment_postings() {
|
||||
let mut postings = SegmentPostings::empty();
|
||||
assert_eq!(postings.doc(), TERMINATED);
|
||||
assert_eq!(postings.advance(), TERMINATED);
|
||||
assert_eq!(postings.advance(), TERMINATED);
|
||||
assert_eq!(postings.doc_freq(), crate::postings::DocFreq::Exact(0));
|
||||
assert_eq!(postings.len(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -308,4 +284,15 @@ mod tests {
|
||||
let postings = SegmentPostings::empty();
|
||||
assert_eq!(postings.term_freq(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_doc_freq() {
|
||||
let docs = SegmentPostings::create_from_docs(&[0, 2, 10]);
|
||||
assert_eq!(docs.doc_freq(), 3);
|
||||
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[2], 12);
|
||||
assert_eq!(docs.doc_freq_given_deletes(&alive_bitset), 2);
|
||||
let all_deleted =
|
||||
AliveBitSet::for_test_from_deleted_docs(&[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 12);
|
||||
assert_eq!(docs.doc_freq_given_deletes(&all_deleted), 0);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -8,7 +8,7 @@ use crate::directory::{CompositeWrite, WritePtr};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::index::Segment;
|
||||
use crate::positions::PositionSerializer;
|
||||
use crate::postings::compression::{BlockEncoder, VIntEncoder as _, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::postings::compression::{BlockEncoder, VIntEncoder, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::postings::skip::SkipSerializer;
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::{Field, FieldEntry, IndexRecordOption, Schema};
|
||||
|
||||
@@ -146,6 +146,23 @@ impl SkipReader {
|
||||
skip_reader
|
||||
}
|
||||
|
||||
pub fn reset(&mut self, data: OwnedBytes, doc_freq: u32) {
|
||||
self.last_doc_in_block = if doc_freq >= COMPRESSION_BLOCK_SIZE as u32 {
|
||||
0
|
||||
} else {
|
||||
TERMINATED
|
||||
};
|
||||
self.last_doc_in_previous_block = 0u32;
|
||||
self.owned_read = data;
|
||||
self.block_info = BlockInfo::VInt { num_docs: doc_freq };
|
||||
self.byte_offset = 0;
|
||||
self.remaining_docs = doc_freq;
|
||||
self.position_offset = 0u64;
|
||||
if doc_freq >= COMPRESSION_BLOCK_SIZE as u32 {
|
||||
self.read_block_info();
|
||||
}
|
||||
}
|
||||
|
||||
// Returns the block max score for this block if available.
|
||||
//
|
||||
// The block max score is available for all full bitpacked block,
|
||||
|
||||
@@ -2,7 +2,7 @@ use crate::docset::{DocSet, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
|
||||
use crate::index::SegmentReader;
|
||||
use crate::query::boost_query::BoostScorer;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::{box_scorer, EnableScoring, Explanation, Query, Scorer, Weight};
|
||||
use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Query that matches all of the documents.
|
||||
@@ -21,16 +21,16 @@ impl Query for AllQuery {
|
||||
pub struct AllWeight;
|
||||
|
||||
impl Weight for AllWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
let all_scorer = AllScorer::new(reader.max_doc());
|
||||
if boost != 1.0 {
|
||||
Ok(box_scorer(BoostScorer::new(all_scorer, boost)))
|
||||
Ok(Box::new(BoostScorer::new(all_scorer, boost)))
|
||||
} else {
|
||||
Ok(box_scorer(all_scorer))
|
||||
Ok(Box::new(all_scorer))
|
||||
}
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
if doc >= reader.max_doc() {
|
||||
return Err(does_not_match(doc));
|
||||
}
|
||||
|
||||
@@ -5,14 +5,12 @@ use common::BitSet;
|
||||
use tantivy_fst::Automaton;
|
||||
|
||||
use super::phrase_prefix_query::prefix_end;
|
||||
use crate::index::{
|
||||
try_downcast_and_call, InvertedIndexReader, SegmentReader, TypedInvertedIndexReaderCb,
|
||||
};
|
||||
use crate::index::SegmentReader;
|
||||
use crate::postings::TermInfo;
|
||||
use crate::query::{BitSetDocSet, ConstScorer, Explanation, Scorer, Weight};
|
||||
use crate::schema::Field;
|
||||
use crate::schema::{Field, IndexRecordOption};
|
||||
use crate::termdict::{TermDictionary, TermStreamer};
|
||||
use crate::{DocId, DocSet, Score, TantivyError};
|
||||
use crate::{DocId, Score, TantivyError};
|
||||
|
||||
/// A weight struct for Fuzzy Term and Regex Queries
|
||||
pub struct AutomatonWeight<A> {
|
||||
@@ -69,7 +67,7 @@ where
|
||||
}
|
||||
|
||||
/// Returns the term infos that match the automaton
|
||||
pub fn get_match_term_infos(&self, reader: &dyn SegmentReader) -> crate::Result<Vec<TermInfo>> {
|
||||
pub fn get_match_term_infos(&self, reader: &SegmentReader) -> crate::Result<Vec<TermInfo>> {
|
||||
let inverted_index = reader.inverted_index(self.field)?;
|
||||
let term_dict = inverted_index.terms();
|
||||
let mut term_stream = self.automaton_stream(term_dict)?;
|
||||
@@ -86,42 +84,33 @@ where
|
||||
A: Automaton + Send + Sync + 'static,
|
||||
A::State: Clone,
|
||||
{
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
let max_doc = reader.max_doc();
|
||||
let mut doc_bitset = BitSet::with_max_value(max_doc);
|
||||
let inverted_index = reader.inverted_index(self.field)?;
|
||||
let term_dict = inverted_index.terms();
|
||||
let mut term_stream = self.automaton_stream(term_dict)?;
|
||||
struct FillBitsetLoop<'a, 'b, A: Automaton>
|
||||
where A::State: Clone
|
||||
{
|
||||
term_stream: &'a mut TermStreamer<'b, &'b A>,
|
||||
bitset: &'a mut BitSet,
|
||||
}
|
||||
impl<A: Automaton> TypedInvertedIndexReaderCb<io::Result<()>> for FillBitsetLoop<'_, '_, A>
|
||||
where A::State: Clone
|
||||
{
|
||||
fn call<I: InvertedIndexReader + ?Sized>(&mut self, reader: &I) -> io::Result<()> {
|
||||
while self.term_stream.advance() {
|
||||
let term_info = self.term_stream.value();
|
||||
reader.fill_bitset_from_terminfo(term_info, self.bitset)?;
|
||||
while term_stream.advance() {
|
||||
let term_info = term_stream.value();
|
||||
let mut block_segment_postings = inverted_index
|
||||
.read_block_postings_from_terminfo(term_info, IndexRecordOption::Basic)?;
|
||||
loop {
|
||||
let docs = block_segment_postings.docs();
|
||||
if docs.is_empty() {
|
||||
break;
|
||||
}
|
||||
Ok(())
|
||||
for &doc in docs {
|
||||
doc_bitset.insert(doc);
|
||||
}
|
||||
block_segment_postings.advance();
|
||||
}
|
||||
}
|
||||
try_downcast_and_call(
|
||||
inverted_index.as_ref(),
|
||||
&mut FillBitsetLoop {
|
||||
term_stream: &mut term_stream,
|
||||
bitset: &mut doc_bitset,
|
||||
},
|
||||
)?;
|
||||
let doc_bitset = BitSetDocSet::from(doc_bitset);
|
||||
let const_scorer = ConstScorer::new(doc_bitset, boost);
|
||||
Ok(Box::new(const_scorer))
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let mut scorer = self.scorer(reader, 1.0)?;
|
||||
if scorer.seek(doc) == doc {
|
||||
Ok(Explanation::new("AutomatonScorer", 1.0))
|
||||
|
||||
@@ -24,13 +24,6 @@ impl BitSetDocSet {
|
||||
self.cursor_bucket = bucket_addr;
|
||||
self.cursor_tinybitset = self.docs.tinyset(bucket_addr);
|
||||
}
|
||||
|
||||
/// Returns the number of documents in the bitset.
|
||||
///
|
||||
/// This call is not free: it will bitcount the number of bits in the bitset.
|
||||
pub fn doc_freq(&self) -> u32 {
|
||||
self.docs.len() as u32
|
||||
}
|
||||
}
|
||||
|
||||
impl From<BitSet> for BitSetDocSet {
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
use std::ops::{Deref, DerefMut};
|
||||
|
||||
use crate::postings::PostingsWithBlockMax;
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::Scorer;
|
||||
use crate::{DocId, DocSet, Score, TERMINATED};
|
||||
@@ -14,8 +13,8 @@ use crate::{DocId, DocSet, Score, TERMINATED};
|
||||
/// We always have `before_pivot_len` < `pivot_len`.
|
||||
///
|
||||
/// `None` is returned if we establish that no document can exceed the threshold.
|
||||
fn find_pivot_doc<TPostings: PostingsWithBlockMax>(
|
||||
term_scorers: &[TermScorerWithMaxScore<TPostings>],
|
||||
fn find_pivot_doc(
|
||||
term_scorers: &[TermScorerWithMaxScore],
|
||||
threshold: Score,
|
||||
) -> Option<(usize, usize, DocId)> {
|
||||
let mut max_score = 0.0;
|
||||
@@ -47,8 +46,8 @@ fn find_pivot_doc<TPostings: PostingsWithBlockMax>(
|
||||
/// the next doc candidate defined by the min of `last_doc_in_block + 1` for
|
||||
/// scorer in scorers[..pivot_len] and `scorer.doc()` for scorer in scorers[pivot_len..].
|
||||
/// Note: before and after calling this method, scorers need to be sorted by their `.doc()`.
|
||||
fn block_max_was_too_low_advance_one_scorer<TPostings: PostingsWithBlockMax>(
|
||||
scorers: &mut [TermScorerWithMaxScore<TPostings>],
|
||||
fn block_max_was_too_low_advance_one_scorer(
|
||||
scorers: &mut [TermScorerWithMaxScore],
|
||||
pivot_len: usize,
|
||||
) {
|
||||
debug_assert!(is_sorted(scorers.iter().map(|scorer| scorer.doc())));
|
||||
@@ -83,10 +82,7 @@ fn block_max_was_too_low_advance_one_scorer<TPostings: PostingsWithBlockMax>(
|
||||
// Given a list of term_scorers and a `ord` and assuming that `term_scorers[ord]` is sorted
|
||||
// except term_scorers[ord] that might be in advance compared to its ranks,
|
||||
// bubble up term_scorers[ord] in order to restore the ordering.
|
||||
fn restore_ordering<TPostings: PostingsWithBlockMax>(
|
||||
term_scorers: &mut [TermScorerWithMaxScore<TPostings>],
|
||||
ord: usize,
|
||||
) {
|
||||
fn restore_ordering(term_scorers: &mut [TermScorerWithMaxScore], ord: usize) {
|
||||
let doc = term_scorers[ord].doc();
|
||||
for i in ord + 1..term_scorers.len() {
|
||||
if term_scorers[i].doc() >= doc {
|
||||
@@ -101,10 +97,9 @@ fn restore_ordering<TPostings: PostingsWithBlockMax>(
|
||||
// If this works, return true.
|
||||
// If this fails (ie: one of the term_scorer does not contain `pivot_doc` and seek goes past the
|
||||
// pivot), reorder the term_scorers to ensure the list is still sorted and returns `false`.
|
||||
// If a term_scorer reach TERMINATED in the process return false remove the term_scorer and
|
||||
// return.
|
||||
fn align_scorers<TPostings: PostingsWithBlockMax>(
|
||||
term_scorers: &mut Vec<TermScorerWithMaxScore<TPostings>>,
|
||||
// If a term_scorer reach TERMINATED in the process return false remove the term_scorer and return.
|
||||
fn align_scorers(
|
||||
term_scorers: &mut Vec<TermScorerWithMaxScore>,
|
||||
pivot_doc: DocId,
|
||||
before_pivot_len: usize,
|
||||
) -> bool {
|
||||
@@ -131,10 +126,7 @@ fn align_scorers<TPostings: PostingsWithBlockMax>(
|
||||
// Assumes terms_scorers[..pivot_len] are positioned on the same doc (pivot_doc).
|
||||
// Advance term_scorers[..pivot_len] and out of these removes the terminated scores.
|
||||
// Restores the ordering of term_scorers.
|
||||
fn advance_all_scorers_on_pivot<TPostings: PostingsWithBlockMax>(
|
||||
term_scorers: &mut Vec<TermScorerWithMaxScore<TPostings>>,
|
||||
pivot_len: usize,
|
||||
) {
|
||||
fn advance_all_scorers_on_pivot(term_scorers: &mut Vec<TermScorerWithMaxScore>, pivot_len: usize) {
|
||||
for term_scorer in &mut term_scorers[..pivot_len] {
|
||||
term_scorer.advance();
|
||||
}
|
||||
@@ -153,12 +145,12 @@ fn advance_all_scorers_on_pivot<TPostings: PostingsWithBlockMax>(
|
||||
/// Implements the WAND (Weak AND) algorithm for dynamic pruning
|
||||
/// described in the paper "Faster Top-k Document Retrieval Using Block-Max Indexes".
|
||||
/// Link: <http://engineering.nyu.edu/~suel/papers/bmw.pdf>
|
||||
pub fn block_wand<TPostings: PostingsWithBlockMax>(
|
||||
mut scorers: Vec<TermScorer<TPostings>>,
|
||||
pub fn block_wand(
|
||||
mut scorers: Vec<TermScorer>,
|
||||
mut threshold: Score,
|
||||
callback: &mut dyn FnMut(u32, Score) -> Score,
|
||||
) {
|
||||
let mut scorers: Vec<TermScorerWithMaxScore<TPostings>> = scorers
|
||||
let mut scorers: Vec<TermScorerWithMaxScore> = scorers
|
||||
.iter_mut()
|
||||
.map(TermScorerWithMaxScore::from)
|
||||
.collect();
|
||||
@@ -174,7 +166,10 @@ pub fn block_wand<TPostings: PostingsWithBlockMax>(
|
||||
|
||||
let block_max_score_upperbound: Score = scorers[..pivot_len]
|
||||
.iter_mut()
|
||||
.map(|scorer| scorer.seek_block_max(pivot_doc))
|
||||
.map(|scorer| {
|
||||
scorer.seek_block(pivot_doc);
|
||||
scorer.block_max_score()
|
||||
})
|
||||
.sum();
|
||||
|
||||
// Beware after shallow advance, skip readers can be in advance compared to
|
||||
@@ -225,22 +220,21 @@ pub fn block_wand<TPostings: PostingsWithBlockMax>(
|
||||
/// - On a block, advance until the end and execute `callback` when the doc score is greater or
|
||||
/// equal to the `threshold`.
|
||||
pub fn block_wand_single_scorer(
|
||||
mut scorer: TermScorer<impl PostingsWithBlockMax>,
|
||||
mut scorer: TermScorer,
|
||||
mut threshold: Score,
|
||||
callback: &mut dyn FnMut(u32, Score) -> Score,
|
||||
) {
|
||||
let mut doc = scorer.doc();
|
||||
let mut block_max_score = scorer.seek_block_max(doc);
|
||||
loop {
|
||||
// We position the scorer on a block that can reach
|
||||
// the threshold.
|
||||
while block_max_score < threshold {
|
||||
while scorer.block_max_score() < threshold {
|
||||
let last_doc_in_block = scorer.last_doc_in_block();
|
||||
if last_doc_in_block == TERMINATED {
|
||||
return;
|
||||
}
|
||||
doc = last_doc_in_block + 1;
|
||||
block_max_score = scorer.seek_block_max(doc);
|
||||
scorer.seek_block(doc);
|
||||
}
|
||||
// Seek will effectively load that block.
|
||||
doc = scorer.seek(doc);
|
||||
@@ -262,33 +256,31 @@ pub fn block_wand_single_scorer(
|
||||
}
|
||||
}
|
||||
doc += 1;
|
||||
block_max_score = scorer.seek_block_max(doc);
|
||||
scorer.seek_block(doc);
|
||||
}
|
||||
}
|
||||
|
||||
struct TermScorerWithMaxScore<'a, TPostings: PostingsWithBlockMax> {
|
||||
scorer: &'a mut TermScorer<TPostings>,
|
||||
struct TermScorerWithMaxScore<'a> {
|
||||
scorer: &'a mut TermScorer,
|
||||
max_score: Score,
|
||||
}
|
||||
|
||||
impl<'a, TPostings: PostingsWithBlockMax> From<&'a mut TermScorer<TPostings>>
|
||||
for TermScorerWithMaxScore<'a, TPostings>
|
||||
{
|
||||
fn from(scorer: &'a mut TermScorer<TPostings>) -> Self {
|
||||
impl<'a> From<&'a mut TermScorer> for TermScorerWithMaxScore<'a> {
|
||||
fn from(scorer: &'a mut TermScorer) -> Self {
|
||||
let max_score = scorer.max_score();
|
||||
TermScorerWithMaxScore { scorer, max_score }
|
||||
}
|
||||
}
|
||||
|
||||
impl<TPostings: PostingsWithBlockMax> Deref for TermScorerWithMaxScore<'_, TPostings> {
|
||||
type Target = TermScorer<TPostings>;
|
||||
impl Deref for TermScorerWithMaxScore<'_> {
|
||||
type Target = TermScorer;
|
||||
|
||||
fn deref(&self) -> &Self::Target {
|
||||
self.scorer
|
||||
}
|
||||
}
|
||||
|
||||
impl<TPostings: PostingsWithBlockMax> DerefMut for TermScorerWithMaxScore<'_, TPostings> {
|
||||
impl DerefMut for TermScorerWithMaxScore<'_> {
|
||||
fn deref_mut(&mut self) -> &mut Self::Target {
|
||||
self.scorer
|
||||
}
|
||||
|
||||
@@ -2,21 +2,21 @@ use std::collections::HashMap;
|
||||
|
||||
use crate::docset::COLLECT_BLOCK_BUFFER_LEN;
|
||||
use crate::index::SegmentReader;
|
||||
use crate::postings::FreqReadingOption;
|
||||
use crate::query::disjunction::Disjunction;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::score_combiner::{DoNothingCombiner, ScoreCombiner};
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::weight::for_each_docset_buffered;
|
||||
use crate::query::weight::{for_each_docset_buffered, for_each_pruning_scorer, for_each_scorer};
|
||||
use crate::query::{
|
||||
box_scorer, intersect_scorers, AllScorer, BufferedUnionScorer, EmptyScorer, Exclude,
|
||||
Explanation, Occur, RequiredOptionalScorer, Scorer, SumCombiner, Weight,
|
||||
intersect_scorers, AllScorer, BufferedUnionScorer, EmptyScorer, Exclude, Explanation, Occur,
|
||||
RequiredOptionalScorer, Scorer, Weight,
|
||||
};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
#[derive(Copy, Clone)]
|
||||
enum SumOrDoNothingCombiner {
|
||||
Sum,
|
||||
DoNothing,
|
||||
enum SpecializedScorer {
|
||||
TermUnion(Vec<TermScorer>),
|
||||
Other(Box<dyn Scorer>),
|
||||
}
|
||||
|
||||
fn scorer_disjunction<TScoreCombiner>(
|
||||
@@ -32,7 +32,7 @@ where
|
||||
if scorers.len() == 1 {
|
||||
return scorers.into_iter().next().unwrap(); // Safe unwrap.
|
||||
}
|
||||
box_scorer(Disjunction::new(
|
||||
Box::new(Disjunction::new(
|
||||
scorers,
|
||||
score_combiner,
|
||||
minimum_match_required,
|
||||
@@ -44,60 +44,57 @@ fn scorer_union<TScoreCombiner>(
|
||||
scorers: Vec<Box<dyn Scorer>>,
|
||||
score_combiner_fn: impl Fn() -> TScoreCombiner,
|
||||
num_docs: u32,
|
||||
) -> Box<dyn Scorer>
|
||||
) -> SpecializedScorer
|
||||
where
|
||||
TScoreCombiner: ScoreCombiner,
|
||||
{
|
||||
match scorers.len() {
|
||||
0 => box_scorer(EmptyScorer),
|
||||
1 => scorers.into_iter().next().unwrap(),
|
||||
_ => {
|
||||
let combiner_opt: Option<SumOrDoNothingCombiner> = if std::any::TypeId::of::<
|
||||
TScoreCombiner,
|
||||
>() == std::any::TypeId::of::<
|
||||
SumCombiner,
|
||||
>() {
|
||||
Some(SumOrDoNothingCombiner::Sum)
|
||||
} else if std::any::TypeId::of::<TScoreCombiner>()
|
||||
== std::any::TypeId::of::<DoNothingCombiner>()
|
||||
assert!(!scorers.is_empty());
|
||||
if scorers.len() == 1 {
|
||||
return SpecializedScorer::Other(scorers.into_iter().next().unwrap()); //< we checked the size beforehand
|
||||
}
|
||||
|
||||
{
|
||||
let is_all_term_queries = scorers.iter().all(|scorer| scorer.is::<TermScorer>());
|
||||
if is_all_term_queries {
|
||||
let scorers: Vec<TermScorer> = scorers
|
||||
.into_iter()
|
||||
.map(|scorer| *(scorer.downcast::<TermScorer>().map_err(|_| ()).unwrap()))
|
||||
.collect();
|
||||
if scorers
|
||||
.iter()
|
||||
.all(|scorer| scorer.freq_reading_option() == FreqReadingOption::ReadFreq)
|
||||
{
|
||||
Some(SumOrDoNothingCombiner::DoNothing)
|
||||
// Block wand is only available if we read frequencies.
|
||||
return SpecializedScorer::TermUnion(scorers);
|
||||
} else {
|
||||
None
|
||||
};
|
||||
if let Some(combiner) = combiner_opt {
|
||||
if scorers.iter().all(|scorer| scorer.is::<TermScorer>()) {
|
||||
let scorers: Vec<TermScorer> = scorers
|
||||
.into_iter()
|
||||
.map(|scorer| {
|
||||
*scorer.downcast::<TermScorer>().ok().expect(
|
||||
"downcast failed despite the fact we already checked the type",
|
||||
)
|
||||
})
|
||||
.collect();
|
||||
return match combiner {
|
||||
SumOrDoNothingCombiner::Sum => box_scorer(BufferedUnionScorer::build(
|
||||
scorers,
|
||||
SumCombiner::default,
|
||||
num_docs,
|
||||
)),
|
||||
SumOrDoNothingCombiner::DoNothing => {
|
||||
box_scorer(BufferedUnionScorer::build(
|
||||
scorers,
|
||||
DoNothingCombiner::default,
|
||||
num_docs,
|
||||
))
|
||||
}
|
||||
};
|
||||
}
|
||||
return SpecializedScorer::Other(Box::new(BufferedUnionScorer::build(
|
||||
scorers,
|
||||
score_combiner_fn,
|
||||
num_docs,
|
||||
)));
|
||||
}
|
||||
box_scorer(BufferedUnionScorer::build(
|
||||
scorers,
|
||||
score_combiner_fn,
|
||||
num_docs,
|
||||
))
|
||||
}
|
||||
}
|
||||
SpecializedScorer::Other(Box::new(BufferedUnionScorer::build(
|
||||
scorers,
|
||||
score_combiner_fn,
|
||||
num_docs,
|
||||
)))
|
||||
}
|
||||
|
||||
fn into_box_scorer<TScoreCombiner: ScoreCombiner>(
|
||||
scorer: SpecializedScorer,
|
||||
score_combiner_fn: impl Fn() -> TScoreCombiner,
|
||||
num_docs: u32,
|
||||
) -> Box<dyn Scorer> {
|
||||
match scorer {
|
||||
SpecializedScorer::TermUnion(term_scorers) => {
|
||||
let union_scorer =
|
||||
BufferedUnionScorer::build(term_scorers, score_combiner_fn, num_docs);
|
||||
Box::new(union_scorer)
|
||||
}
|
||||
SpecializedScorer::Other(scorer) => scorer,
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the effective MUST scorer, accounting for removed AllScorers.
|
||||
@@ -113,7 +110,7 @@ fn effective_must_scorer(
|
||||
if must_scorers.is_empty() {
|
||||
if removed_all_scorer_count > 0 {
|
||||
// Had AllScorer(s) only - all docs match
|
||||
Some(box_scorer(AllScorer::new(max_doc)))
|
||||
Some(Box::new(AllScorer::new(max_doc)))
|
||||
} else {
|
||||
// No MUST constraint at all
|
||||
None
|
||||
@@ -131,26 +128,28 @@ fn effective_must_scorer(
|
||||
/// When `scoring_enabled` is false, we can just return AllScorer alone since
|
||||
/// we don't need score contributions from the should_scorer.
|
||||
fn effective_should_scorer_for_union<TScoreCombiner: ScoreCombiner>(
|
||||
should_scorer: Box<dyn Scorer>,
|
||||
should_scorer: SpecializedScorer,
|
||||
removed_all_scorer_count: usize,
|
||||
max_doc: DocId,
|
||||
num_docs: u32,
|
||||
score_combiner_fn: impl Fn() -> TScoreCombiner,
|
||||
scoring_enabled: bool,
|
||||
) -> Box<dyn Scorer> {
|
||||
) -> SpecializedScorer {
|
||||
if removed_all_scorer_count > 0 {
|
||||
if scoring_enabled {
|
||||
// Need to union to get score contributions from both
|
||||
let all_scorers: Vec<Box<dyn Scorer>> =
|
||||
vec![should_scorer, box_scorer(AllScorer::new(max_doc))];
|
||||
box_scorer(BufferedUnionScorer::build(
|
||||
let all_scorers: Vec<Box<dyn Scorer>> = vec![
|
||||
into_box_scorer(should_scorer, &score_combiner_fn, num_docs),
|
||||
Box::new(AllScorer::new(max_doc)),
|
||||
];
|
||||
SpecializedScorer::Other(Box::new(BufferedUnionScorer::build(
|
||||
all_scorers,
|
||||
score_combiner_fn,
|
||||
num_docs,
|
||||
))
|
||||
)))
|
||||
} else {
|
||||
// Scoring disabled - AllScorer alone is sufficient
|
||||
box_scorer(AllScorer::new(max_doc))
|
||||
SpecializedScorer::Other(Box::new(AllScorer::new(max_doc)))
|
||||
}
|
||||
} else {
|
||||
should_scorer
|
||||
@@ -161,9 +160,9 @@ enum ShouldScorersCombinationMethod {
|
||||
// Should scorers are irrelevant.
|
||||
Ignored,
|
||||
// Only contributes to final score.
|
||||
Optional(Box<dyn Scorer>),
|
||||
Optional(SpecializedScorer),
|
||||
// Regardless of score, the should scorers may impact whether a document is matching or not.
|
||||
Required(Box<dyn Scorer>),
|
||||
Required(SpecializedScorer),
|
||||
}
|
||||
|
||||
/// Weight associated to the `BoolQuery`.
|
||||
@@ -206,7 +205,7 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
|
||||
fn per_occur_scorers(
|
||||
&self,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
boost: Score,
|
||||
) -> crate::Result<HashMap<Occur, Vec<Box<dyn Scorer>>>> {
|
||||
let mut per_occur_scorers: HashMap<Occur, Vec<Box<dyn Scorer>>> = HashMap::new();
|
||||
@@ -222,10 +221,10 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
|
||||
fn complex_scorer<TComplexScoreCombiner: ScoreCombiner>(
|
||||
&self,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
boost: Score,
|
||||
score_combiner_fn: impl Fn() -> TComplexScoreCombiner,
|
||||
) -> crate::Result<Box<dyn Scorer>> {
|
||||
) -> crate::Result<SpecializedScorer> {
|
||||
let num_docs = reader.num_docs();
|
||||
let mut per_occur_scorers = self.per_occur_scorers(reader, boost)?;
|
||||
|
||||
@@ -235,7 +234,7 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
let must_special_scorer_counts = remove_and_count_all_and_empty_scorers(&mut must_scorers);
|
||||
|
||||
if must_special_scorer_counts.num_empty_scorers > 0 {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(SpecializedScorer::Other(Box::new(EmptyScorer)));
|
||||
}
|
||||
|
||||
let mut should_scorers = per_occur_scorers.remove(&Occur::Should).unwrap_or_default();
|
||||
@@ -250,7 +249,7 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
|
||||
if exclude_special_scorer_counts.num_all_scorers > 0 {
|
||||
// We exclude all documents at one point.
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(SpecializedScorer::Other(Box::new(EmptyScorer)));
|
||||
}
|
||||
|
||||
let effective_minimum_number_should_match = self
|
||||
@@ -262,7 +261,7 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
if effective_minimum_number_should_match > num_of_should_scorers {
|
||||
// We don't have enough scorers to satisfy the minimum number of should matches.
|
||||
// The request will match no documents.
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(SpecializedScorer::Other(Box::new(EmptyScorer)));
|
||||
}
|
||||
match effective_minimum_number_should_match {
|
||||
0 if num_of_should_scorers == 0 => ShouldScorersCombinationMethod::Ignored,
|
||||
@@ -282,10 +281,12 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
must_scorers.append(&mut should_scorers);
|
||||
ShouldScorersCombinationMethod::Ignored
|
||||
}
|
||||
_ => ShouldScorersCombinationMethod::Required(scorer_disjunction(
|
||||
should_scorers,
|
||||
score_combiner_fn(),
|
||||
effective_minimum_number_should_match,
|
||||
_ => ShouldScorersCombinationMethod::Required(SpecializedScorer::Other(
|
||||
scorer_disjunction(
|
||||
should_scorers,
|
||||
score_combiner_fn(),
|
||||
effective_minimum_number_should_match,
|
||||
),
|
||||
)),
|
||||
}
|
||||
};
|
||||
@@ -302,8 +303,8 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
reader.max_doc(),
|
||||
num_docs,
|
||||
)
|
||||
.unwrap_or_else(|| box_scorer(EmptyScorer));
|
||||
boxed_scorer
|
||||
.unwrap_or_else(|| Box::new(EmptyScorer));
|
||||
SpecializedScorer::Other(boxed_scorer)
|
||||
}
|
||||
(ShouldScorersCombinationMethod::Optional(should_scorer), must_scorers) => {
|
||||
// Optional SHOULD: contributes to scoring but not required for matching.
|
||||
@@ -328,12 +329,16 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
Some(must_scorer) => {
|
||||
// Has MUST constraint: SHOULD only affects scoring.
|
||||
if self.scoring_enabled {
|
||||
box_scorer(RequiredOptionalScorer::<_, _, TScoreCombiner>::new(
|
||||
SpecializedScorer::Other(Box::new(RequiredOptionalScorer::<
|
||||
_,
|
||||
_,
|
||||
TScoreCombiner,
|
||||
>::new(
|
||||
must_scorer,
|
||||
should_scorer,
|
||||
))
|
||||
into_box_scorer(should_scorer, &score_combiner_fn, num_docs),
|
||||
)))
|
||||
} else {
|
||||
must_scorer
|
||||
SpecializedScorer::Other(must_scorer)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -353,7 +358,12 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
}
|
||||
Some(must_scorer) => {
|
||||
// Has MUST constraint: intersect MUST with SHOULD.
|
||||
intersect_scorers(vec![must_scorer, should_scorer], num_docs)
|
||||
let should_boxed =
|
||||
into_box_scorer(should_scorer, &score_combiner_fn, num_docs);
|
||||
SpecializedScorer::Other(intersect_scorers(
|
||||
vec![must_scorer, should_boxed],
|
||||
num_docs,
|
||||
))
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -362,18 +372,19 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
return Ok(include_scorer);
|
||||
}
|
||||
|
||||
let include_scorer_boxed = into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
|
||||
let scorer: Box<dyn Scorer> = if exclude_scorers.len() == 1 {
|
||||
let exclude_scorer = exclude_scorers.pop().unwrap();
|
||||
match exclude_scorer.downcast::<TermScorer>() {
|
||||
// Cast to TermScorer succeeded
|
||||
Ok(exclude_scorer) => Box::new(Exclude::new(include_scorer, *exclude_scorer)),
|
||||
Ok(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, *exclude_scorer)),
|
||||
// We get back the original Box<dyn Scorer>
|
||||
Err(exclude_scorer) => Box::new(Exclude::new(include_scorer, exclude_scorer)),
|
||||
Err(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, exclude_scorer)),
|
||||
}
|
||||
} else {
|
||||
Box::new(Exclude::new(include_scorer, exclude_scorers))
|
||||
Box::new(Exclude::new(include_scorer_boxed, exclude_scorers))
|
||||
};
|
||||
Ok(scorer)
|
||||
Ok(SpecializedScorer::Other(scorer))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -402,7 +413,8 @@ fn remove_and_count_all_and_empty_scorers(
|
||||
}
|
||||
|
||||
impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombiner> {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
let num_docs = reader.num_docs();
|
||||
if self.weights.is_empty() {
|
||||
Ok(Box::new(EmptyScorer))
|
||||
} else if self.weights.len() == 1 {
|
||||
@@ -414,12 +426,18 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
|
||||
}
|
||||
} else if self.scoring_enabled {
|
||||
self.complex_scorer(reader, boost, &self.score_combiner_fn)
|
||||
.map(|specialized_scorer| {
|
||||
into_box_scorer(specialized_scorer, &self.score_combiner_fn, num_docs)
|
||||
})
|
||||
} else {
|
||||
self.complex_scorer(reader, boost, DoNothingCombiner::default)
|
||||
.map(|specialized_scorer| {
|
||||
into_box_scorer(specialized_scorer, DoNothingCombiner::default, num_docs)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let mut scorer = self.scorer(reader, 1.0)?;
|
||||
if scorer.seek(doc) != doc {
|
||||
return Err(does_not_match(doc));
|
||||
@@ -441,22 +459,47 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
|
||||
|
||||
fn for_each(
|
||||
&self,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
callback: &mut dyn FnMut(DocId, Score),
|
||||
) -> crate::Result<()> {
|
||||
let mut scorer = self.complex_scorer(reader, 1.0, &self.score_combiner_fn)?;
|
||||
scorer.for_each(callback);
|
||||
let scorer = self.complex_scorer(reader, 1.0, &self.score_combiner_fn)?;
|
||||
match scorer {
|
||||
SpecializedScorer::TermUnion(term_scorers) => {
|
||||
let mut union_scorer = BufferedUnionScorer::build(
|
||||
term_scorers,
|
||||
&self.score_combiner_fn,
|
||||
reader.num_docs(),
|
||||
);
|
||||
for_each_scorer(&mut union_scorer, callback);
|
||||
}
|
||||
SpecializedScorer::Other(mut scorer) => {
|
||||
for_each_scorer(scorer.as_mut(), callback);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn for_each_no_score(
|
||||
&self,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
callback: &mut dyn FnMut(&[DocId]),
|
||||
) -> crate::Result<()> {
|
||||
let mut scorer = self.complex_scorer(reader, 1.0, || DoNothingCombiner)?;
|
||||
let scorer = self.complex_scorer(reader, 1.0, || DoNothingCombiner)?;
|
||||
let mut buffer = [0u32; COLLECT_BLOCK_BUFFER_LEN];
|
||||
for_each_docset_buffered(scorer.as_mut(), &mut buffer, callback);
|
||||
|
||||
match scorer {
|
||||
SpecializedScorer::TermUnion(term_scorers) => {
|
||||
let mut union_scorer = BufferedUnionScorer::build(
|
||||
term_scorers,
|
||||
&self.score_combiner_fn,
|
||||
reader.num_docs(),
|
||||
);
|
||||
for_each_docset_buffered(&mut union_scorer, &mut buffer, callback);
|
||||
}
|
||||
SpecializedScorer::Other(mut scorer) => {
|
||||
for_each_docset_buffered(scorer.as_mut(), &mut buffer, callback);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -473,11 +516,18 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
|
||||
fn for_each_pruning(
|
||||
&self,
|
||||
threshold: Score,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
) -> crate::Result<()> {
|
||||
let scorer = self.complex_scorer(reader, 1.0, &self.score_combiner_fn)?;
|
||||
reader.for_each_pruning(threshold, scorer, callback);
|
||||
match scorer {
|
||||
SpecializedScorer::TermUnion(term_scorers) => {
|
||||
super::block_wand(term_scorers, threshold, callback);
|
||||
}
|
||||
SpecializedScorer::Other(mut scorer) => {
|
||||
for_each_pruning_scorer(scorer.as_mut(), threshold, callback);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,7 +1,8 @@
|
||||
pub(crate) mod block_wand;
|
||||
mod block_wand;
|
||||
mod boolean_query;
|
||||
mod boolean_weight;
|
||||
|
||||
pub(crate) use self::block_wand::{block_wand, block_wand_single_scorer};
|
||||
pub use self::boolean_query::BooleanQuery;
|
||||
pub use self::boolean_weight::BooleanWeight;
|
||||
|
||||
@@ -15,8 +16,8 @@ mod tests {
|
||||
use crate::collector::{Count, TopDocs};
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{
|
||||
AllScorer, BufferedUnionScorer, EmptyScorer, EnableScoring, Intersection, Occur, Query,
|
||||
QueryParser, RangeQuery, RequiredOptionalScorer, Scorer, SumCombiner, TermQuery,
|
||||
AllScorer, EmptyScorer, EnableScoring, Intersection, Occur, Query, QueryParser, RangeQuery,
|
||||
RequiredOptionalScorer, Scorer, SumCombiner, TermQuery,
|
||||
};
|
||||
use crate::schema::*;
|
||||
use crate::{assert_nearly_equals, DocAddress, DocId, Index, IndexWriter, Score};
|
||||
@@ -61,19 +62,6 @@ mod tests {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
pub fn test_boolean_termonly_union_specialization() -> crate::Result<()> {
|
||||
let (index, text_field) = aux_test_helper()?;
|
||||
let query_parser = QueryParser::for_index(&index, vec![text_field]);
|
||||
let query = query_parser.parse_query("a b")?;
|
||||
let searcher = index.reader()?.searcher();
|
||||
let weight = query.weight(EnableScoring::enabled_from_searcher(&searcher))?;
|
||||
let scorer = weight.scorer(searcher.segment_reader(0u32), 1.0)?;
|
||||
assert!(scorer.is::<BufferedUnionScorer<TermScorer, SumCombiner>>());
|
||||
assert_eq!(query.count(&searcher)?, 4);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
pub fn test_boolean_termonly_intersection() -> crate::Result<()> {
|
||||
let (index, text_field) = aux_test_helper()?;
|
||||
|
||||
@@ -67,11 +67,11 @@ impl BoostWeight {
|
||||
}
|
||||
|
||||
impl Weight for BoostWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
self.weight.scorer(reader, boost * self.boost)
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: u32) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: u32) -> crate::Result<Explanation> {
|
||||
let underlying_explanation = self.weight.explain(reader, doc)?;
|
||||
let score = underlying_explanation.value() * self.boost;
|
||||
let mut explanation =
|
||||
@@ -80,7 +80,7 @@ impl Weight for BoostWeight {
|
||||
Ok(explanation)
|
||||
}
|
||||
|
||||
fn count(&self, reader: &dyn SegmentReader) -> crate::Result<u32> {
|
||||
fn count(&self, reader: &SegmentReader) -> crate::Result<u32> {
|
||||
self.weight.count(reader)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
use std::fmt;
|
||||
|
||||
use crate::docset::COLLECT_BLOCK_BUFFER_LEN;
|
||||
use crate::query::{box_scorer, EnableScoring, Explanation, Query, Scorer, Weight};
|
||||
use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
|
||||
use crate::{DocId, DocSet, Score, SegmentReader, TantivyError, Term};
|
||||
|
||||
/// `ConstScoreQuery` is a wrapper over a query to provide a constant score.
|
||||
@@ -63,15 +63,12 @@ impl ConstWeight {
|
||||
}
|
||||
|
||||
impl Weight for ConstWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
let inner_scorer = self.weight.scorer(reader, boost)?;
|
||||
Ok(box_scorer(ConstScorer::new(
|
||||
inner_scorer,
|
||||
boost * self.score,
|
||||
)))
|
||||
Ok(Box::new(ConstScorer::new(inner_scorer, boost * self.score)))
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: u32) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: u32) -> crate::Result<Explanation> {
|
||||
let mut scorer = self.scorer(reader, 1.0)?;
|
||||
if scorer.seek(doc) != doc {
|
||||
return Err(TantivyError::InvalidArgument(format!(
|
||||
@@ -84,7 +81,7 @@ impl Weight for ConstWeight {
|
||||
Ok(explanation)
|
||||
}
|
||||
|
||||
fn count(&self, reader: &dyn SegmentReader) -> crate::Result<u32> {
|
||||
fn count(&self, reader: &SegmentReader) -> crate::Result<u32> {
|
||||
self.weight.count(reader)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -2,7 +2,7 @@ use super::Scorer;
|
||||
use crate::docset::TERMINATED;
|
||||
use crate::index::SegmentReader;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::{box_scorer, EnableScoring, Explanation, Query, Weight};
|
||||
use crate::query::{EnableScoring, Explanation, Query, Weight};
|
||||
use crate::{DocId, DocSet, Score, Searcher};
|
||||
|
||||
/// `EmptyQuery` is a dummy `Query` in which no document matches.
|
||||
@@ -26,11 +26,11 @@ impl Query for EmptyQuery {
|
||||
/// It is useful for tests and handling edge cases.
|
||||
pub struct EmptyWeight;
|
||||
impl Weight for EmptyWeight {
|
||||
fn scorer(&self, _reader: &dyn SegmentReader, _boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
Ok(box_scorer(EmptyScorer))
|
||||
fn scorer(&self, _reader: &SegmentReader, _boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
Ok(Box::new(EmptyScorer))
|
||||
}
|
||||
|
||||
fn explain(&self, _reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, _reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
Err(does_not_match(doc))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -3,7 +3,7 @@ use core::fmt::Debug;
|
||||
use columnar::{ColumnIndex, DynamicColumn};
|
||||
use common::BitSet;
|
||||
|
||||
use super::{box_scorer, ConstScorer, EmptyScorer};
|
||||
use super::{ConstScorer, EmptyScorer};
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::index::SegmentReader;
|
||||
use crate::query::all_query::AllScorer;
|
||||
@@ -98,7 +98,7 @@ pub struct ExistsWeight {
|
||||
}
|
||||
|
||||
impl Weight for ExistsWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
let fast_field_reader = reader.fast_fields();
|
||||
let mut column_handles = fast_field_reader.dynamic_column_handles(&self.field_name)?;
|
||||
if self.field_type == Type::Json && self.json_subpaths {
|
||||
@@ -117,7 +117,7 @@ impl Weight for ExistsWeight {
|
||||
}
|
||||
}
|
||||
if non_empty_columns.is_empty() {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
}
|
||||
|
||||
// If any column is full, all docs match.
|
||||
@@ -128,9 +128,9 @@ impl Weight for ExistsWeight {
|
||||
{
|
||||
let all_scorer = AllScorer::new(max_doc);
|
||||
if boost != 1.0f32 {
|
||||
return Ok(box_scorer(BoostScorer::new(all_scorer, boost)));
|
||||
return Ok(Box::new(BoostScorer::new(all_scorer, boost)));
|
||||
} else {
|
||||
return Ok(box_scorer(all_scorer));
|
||||
return Ok(Box::new(all_scorer));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -138,7 +138,7 @@ impl Weight for ExistsWeight {
|
||||
// NOTE: A lower number may be better for very sparse columns
|
||||
if non_empty_columns.len() < 4 {
|
||||
let docset = ExistsDocSet::new(non_empty_columns, reader.max_doc());
|
||||
return Ok(box_scorer(ConstScorer::new(docset, boost)));
|
||||
return Ok(Box::new(ConstScorer::new(docset, boost)));
|
||||
}
|
||||
|
||||
// If we have many dynamic columns, precompute a bitset of matching docs
|
||||
@@ -162,10 +162,10 @@ impl Weight for ExistsWeight {
|
||||
}
|
||||
}
|
||||
let docset = BitSetDocSet::from(doc_bitset);
|
||||
Ok(box_scorer(ConstScorer::new(docset, boost)))
|
||||
Ok(Box::new(ConstScorer::new(docset, boost)))
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let mut scorer = self.scorer(reader, 1.0)?;
|
||||
if scorer.seek(doc) != doc {
|
||||
return Err(does_not_match(doc));
|
||||
|
||||
@@ -1,7 +1,9 @@
|
||||
use common::TinySet;
|
||||
|
||||
use super::size_hint::estimate_intersection;
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, BLOCK_NUM_TINYBITSETS, TERMINATED};
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{box_scorer, EmptyScorer, Scorer};
|
||||
use crate::query::{EmptyScorer, Scorer};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Returns the intersection scorer.
|
||||
@@ -17,10 +19,10 @@ use crate::{DocId, Score};
|
||||
/// `size_hint` of the intersection.
|
||||
pub fn intersect_scorers(
|
||||
mut scorers: Vec<Box<dyn Scorer>>,
|
||||
num_docs_segment: u32,
|
||||
segment_num_docs: u32,
|
||||
) -> Box<dyn Scorer> {
|
||||
if scorers.is_empty() {
|
||||
return box_scorer(EmptyScorer);
|
||||
return Box::new(EmptyScorer);
|
||||
}
|
||||
if scorers.len() == 1 {
|
||||
return scorers.pop().unwrap();
|
||||
@@ -29,7 +31,7 @@ pub fn intersect_scorers(
|
||||
scorers.sort_by_key(|scorer| scorer.cost());
|
||||
let doc = go_to_first_doc(&mut scorers[..]);
|
||||
if doc == TERMINATED {
|
||||
return box_scorer(EmptyScorer);
|
||||
return Box::new(EmptyScorer);
|
||||
}
|
||||
// We know that we have at least 2 elements.
|
||||
let left = scorers.remove(0);
|
||||
@@ -38,18 +40,18 @@ pub fn intersect_scorers(
|
||||
.iter()
|
||||
.all(|&scorer| scorer.is::<TermScorer>());
|
||||
if all_term_scorers {
|
||||
return box_scorer(Intersection {
|
||||
return Box::new(Intersection {
|
||||
left: *(left.downcast::<TermScorer>().map_err(|_| ()).unwrap()),
|
||||
right: *(right.downcast::<TermScorer>().map_err(|_| ()).unwrap()),
|
||||
others: scorers,
|
||||
num_docs: num_docs_segment,
|
||||
segment_num_docs,
|
||||
});
|
||||
}
|
||||
box_scorer(Intersection {
|
||||
Box::new(Intersection {
|
||||
left,
|
||||
right,
|
||||
others: scorers,
|
||||
num_docs: num_docs_segment,
|
||||
segment_num_docs,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -58,7 +60,7 @@ pub struct Intersection<TDocSet: DocSet, TOtherDocSet: DocSet = Box<dyn Scorer>>
|
||||
left: TDocSet,
|
||||
right: TDocSet,
|
||||
others: Vec<TOtherDocSet>,
|
||||
num_docs: u32,
|
||||
segment_num_docs: u32,
|
||||
}
|
||||
|
||||
fn go_to_first_doc<TDocSet: DocSet>(docsets: &mut [TDocSet]) -> DocId {
|
||||
@@ -78,7 +80,10 @@ fn go_to_first_doc<TDocSet: DocSet>(docsets: &mut [TDocSet]) -> DocId {
|
||||
|
||||
impl<TDocSet: DocSet> Intersection<TDocSet, TDocSet> {
|
||||
/// num_docs is the number of documents in the segment.
|
||||
pub(crate) fn new(mut docsets: Vec<TDocSet>, num_docs: u32) -> Intersection<TDocSet, TDocSet> {
|
||||
pub(crate) fn new(
|
||||
mut docsets: Vec<TDocSet>,
|
||||
segment_num_docs: u32,
|
||||
) -> Intersection<TDocSet, TDocSet> {
|
||||
let num_docsets = docsets.len();
|
||||
assert!(num_docsets >= 2);
|
||||
docsets.sort_by_key(|docset| docset.cost());
|
||||
@@ -97,7 +102,7 @@ impl<TDocSet: DocSet> Intersection<TDocSet, TDocSet> {
|
||||
left,
|
||||
right,
|
||||
others: docsets,
|
||||
num_docs,
|
||||
segment_num_docs,
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -214,7 +219,7 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
|
||||
[self.left.size_hint(), self.right.size_hint()]
|
||||
.into_iter()
|
||||
.chain(self.others.iter().map(DocSet::size_hint)),
|
||||
self.num_docs,
|
||||
self.segment_num_docs,
|
||||
)
|
||||
}
|
||||
|
||||
@@ -224,6 +229,91 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
|
||||
// If there are docsets that are bad at skipping, they should also influence the cost.
|
||||
self.left.cost()
|
||||
}
|
||||
|
||||
fn count_including_deleted(&mut self) -> u32 {
|
||||
const DENSITY_THRESHOLD_INVERSE: u32 = 32;
|
||||
if self
|
||||
.left
|
||||
.size_hint()
|
||||
.saturating_mul(DENSITY_THRESHOLD_INVERSE)
|
||||
< self.segment_num_docs
|
||||
{
|
||||
// Sparse path: if the lead iterator covers less than ~3% of docs,
|
||||
// the block approach wastes time on mostly-empty blocks.
|
||||
self.count_including_deleted_sparse()
|
||||
} else {
|
||||
// Dense approach. We push documents into a block bitset to then
|
||||
// perform count using popcount.
|
||||
self.count_including_deleted_dense()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
const EMPTY_BLOCK: [TinySet; BLOCK_NUM_TINYBITSETS] = [TinySet::EMPTY; BLOCK_NUM_TINYBITSETS];
|
||||
|
||||
/// ANDs `other` into `mask` in-place. Returns `true` if the result is all zeros.
|
||||
#[inline]
|
||||
fn and_is_empty(
|
||||
mask: &mut [TinySet; BLOCK_NUM_TINYBITSETS],
|
||||
other: &[TinySet; BLOCK_NUM_TINYBITSETS],
|
||||
) -> bool {
|
||||
let mut all_empty = true;
|
||||
for (m, t) in mask.iter_mut().zip(other.iter()) {
|
||||
*m = m.intersect(*t);
|
||||
all_empty &= m.is_empty();
|
||||
}
|
||||
all_empty
|
||||
}
|
||||
|
||||
impl<TDocSet: DocSet, TOtherDocSet: DocSet> Intersection<TDocSet, TOtherDocSet> {
|
||||
fn count_including_deleted_sparse(&mut self) -> u32 {
|
||||
let mut count = 0u32;
|
||||
let mut doc = self.doc();
|
||||
while doc != TERMINATED {
|
||||
count += 1;
|
||||
doc = self.advance();
|
||||
}
|
||||
count
|
||||
}
|
||||
|
||||
/// Dense block-wise bitmask intersection count.
|
||||
///
|
||||
/// Fills a 1024-doc window from each iterator, ANDs the bitmasks together,
|
||||
/// and popcounts the result. `fill_bitset_block` handles seeking tails forward
|
||||
/// when they lag behind the current block.
|
||||
fn count_including_deleted_dense(&mut self) -> u32 {
|
||||
let mut count = 0u32;
|
||||
let mut next_base = self.left.doc();
|
||||
|
||||
while next_base < TERMINATED {
|
||||
let base = next_base;
|
||||
|
||||
// Fill lead bitmask.
|
||||
let mut mask = EMPTY_BLOCK;
|
||||
next_base = next_base.max(self.left.fill_bitset_block(base, &mut mask));
|
||||
|
||||
let mut tail_mask = EMPTY_BLOCK;
|
||||
next_base = next_base.max(self.right.fill_bitset_block(base, &mut tail_mask));
|
||||
|
||||
if and_is_empty(&mut mask, &tail_mask) {
|
||||
continue;
|
||||
}
|
||||
// AND with each additional tail.
|
||||
for other in &mut self.others {
|
||||
let mut other_mask = EMPTY_BLOCK;
|
||||
next_base = next_base.max(other.fill_bitset_block(base, &mut other_mask));
|
||||
if and_is_empty(&mut mask, &other_mask) {
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
for tinyset in &mask {
|
||||
count += tinyset.len();
|
||||
}
|
||||
}
|
||||
|
||||
count
|
||||
}
|
||||
}
|
||||
|
||||
impl<TScorer, TOtherScorer> Scorer for Intersection<TScorer, TOtherScorer>
|
||||
@@ -421,6 +511,82 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn prop_test_count_including_deleted_matches_default(
|
||||
a in sorted_deduped_vec(1200, 400),
|
||||
b in sorted_deduped_vec(1200, 400),
|
||||
c in sorted_deduped_vec(1200, 400),
|
||||
num_docs in 1200u32..2000u32,
|
||||
) {
|
||||
// Compute expected count via set intersection.
|
||||
let expected: u32 = a.iter()
|
||||
.filter(|doc| b.contains(doc) && c.contains(doc))
|
||||
.count() as u32;
|
||||
|
||||
// Test count_including_deleted (dense path).
|
||||
let make_intersection = || {
|
||||
Intersection::new(
|
||||
vec![
|
||||
VecDocSet::from(a.clone()),
|
||||
VecDocSet::from(b.clone()),
|
||||
VecDocSet::from(c.clone()),
|
||||
],
|
||||
num_docs,
|
||||
)
|
||||
};
|
||||
|
||||
let mut intersection = make_intersection();
|
||||
let count = intersection.count_including_deleted();
|
||||
prop_assert_eq!(count, expected,
|
||||
"count_including_deleted mismatch: a={:?}, b={:?}, c={:?}", a, b, c);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_count_including_deleted_two_way() {
|
||||
let left = VecDocSet::from(vec![1, 3, 9]);
|
||||
let right = VecDocSet::from(vec![3, 4, 9, 18]);
|
||||
let mut intersection = Intersection::new(vec![left, right], 100);
|
||||
assert_eq!(intersection.count_including_deleted(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_count_including_deleted_empty() {
|
||||
let a = VecDocSet::from(vec![1, 3]);
|
||||
let b = VecDocSet::from(vec![1, 4]);
|
||||
let c = VecDocSet::from(vec![3, 9]);
|
||||
let mut intersection = Intersection::new(vec![a, b, c], 100);
|
||||
assert_eq!(intersection.count_including_deleted(), 0);
|
||||
}
|
||||
|
||||
/// Test with enough documents to exercise the dense path (>= num_docs/32).
|
||||
#[test]
|
||||
fn test_count_including_deleted_dense_path() {
|
||||
// Create dense docsets: many docs relative to segment size.
|
||||
let docs_a: Vec<u32> = (0..2000).step_by(2).collect(); // even numbers 0..2000
|
||||
let docs_b: Vec<u32> = (0..2000).step_by(3).collect(); // multiples of 3
|
||||
let expected = docs_a.iter().filter(|d| *d % 3 == 0).count() as u32;
|
||||
|
||||
let a = VecDocSet::from(docs_a);
|
||||
let b = VecDocSet::from(docs_b);
|
||||
let mut intersection = Intersection::new(vec![a, b], 2000);
|
||||
assert_eq!(intersection.count_including_deleted(), expected);
|
||||
}
|
||||
|
||||
/// Test that spans multiple blocks (>1024 docs).
|
||||
#[test]
|
||||
fn test_count_including_deleted_multi_block() {
|
||||
let docs_a: Vec<u32> = (0..5000).collect();
|
||||
let docs_b: Vec<u32> = (0..5000).step_by(7).collect();
|
||||
let expected = docs_b.len() as u32; // all of b is in a
|
||||
|
||||
let a = VecDocSet::from(docs_a);
|
||||
let b = VecDocSet::from(docs_b);
|
||||
let mut intersection = Intersection::new(vec![a, b], 5000);
|
||||
assert_eq!(intersection.count_including_deleted(), expected);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_bug_2811_intersection_candidate_should_increase() {
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
@@ -2,7 +2,7 @@ mod all_query;
|
||||
mod automaton_weight;
|
||||
mod bitset;
|
||||
mod bm25;
|
||||
pub(crate) mod boolean_query;
|
||||
mod boolean_query;
|
||||
mod boost_query;
|
||||
mod const_score_query;
|
||||
mod disjunction;
|
||||
@@ -24,7 +24,7 @@ mod reqopt_scorer;
|
||||
mod scorer;
|
||||
mod set_query;
|
||||
mod size_hint;
|
||||
pub(crate) mod term_query;
|
||||
mod term_query;
|
||||
mod union;
|
||||
mod weight;
|
||||
|
||||
@@ -53,17 +53,17 @@ pub use self::intersection::{intersect_scorers, Intersection};
|
||||
pub use self::more_like_this::{MoreLikeThisQuery, MoreLikeThisQueryBuilder};
|
||||
pub use self::phrase_prefix_query::PhrasePrefixQuery;
|
||||
pub use self::phrase_query::regex_phrase_query::{wildcard_query_to_regex_str, RegexPhraseQuery};
|
||||
pub use self::phrase_query::{PhraseQuery, PhraseScorer};
|
||||
pub use self::phrase_query::PhraseQuery;
|
||||
pub use self::query::{EnableScoring, Query, QueryClone};
|
||||
pub use self::query_parser::{QueryParser, QueryParserError};
|
||||
pub use self::range_query::*;
|
||||
pub use self::regex_query::RegexQuery;
|
||||
pub use self::reqopt_scorer::RequiredOptionalScorer;
|
||||
pub use self::score_combiner::{DisjunctionMaxCombiner, ScoreCombiner, SumCombiner};
|
||||
pub use self::scorer::{box_scorer, Scorer};
|
||||
pub use self::scorer::Scorer;
|
||||
pub use self::set_query::TermSetQuery;
|
||||
pub use self::term_query::{BoxedTermScorer, TermQuery, TermScorer};
|
||||
pub use self::union::{BufferedUnionScorer, SimpleUnion};
|
||||
pub use self::term_query::TermQuery;
|
||||
pub use self::union::BufferedUnionScorer;
|
||||
#[cfg(test)]
|
||||
pub use self::vec_docset::VecDocSet;
|
||||
pub use self::weight::Weight;
|
||||
|
||||
@@ -8,7 +8,7 @@ use crate::query::{BooleanQuery, BoostQuery, Occur, Query, TermQuery};
|
||||
use crate::schema::document::{Document, Value};
|
||||
use crate::schema::{Field, FieldType, IndexRecordOption, Term};
|
||||
use crate::tokenizer::{FacetTokenizer, PreTokenizedStream, TokenStream, Tokenizer};
|
||||
use crate::{DocAddress, Result, Searcher, TantivyError};
|
||||
use crate::{DocAddress, Result, Searcher, TantivyDocument, TantivyError};
|
||||
|
||||
#[derive(Debug, PartialEq)]
|
||||
struct ScoreTerm {
|
||||
@@ -129,7 +129,7 @@ impl MoreLikeThis {
|
||||
searcher: &Searcher,
|
||||
doc_address: DocAddress,
|
||||
) -> Result<Vec<ScoreTerm>> {
|
||||
let doc = searcher.doc(doc_address)?;
|
||||
let doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
|
||||
let field_to_values = doc.get_sorted_field_values();
|
||||
self.retrieve_terms_from_doc_fields(searcher, &field_to_values)
|
||||
@@ -167,7 +167,7 @@ impl MoreLikeThis {
|
||||
term_frequencies: &mut HashMap<Term, usize>,
|
||||
) -> Result<()> {
|
||||
let schema = searcher.schema();
|
||||
let tokenizer_manager = searcher.tokenizers();
|
||||
let tokenizer_manager = searcher.index().tokenizers();
|
||||
|
||||
let field_entry = schema.get_field_entry(field);
|
||||
if !field_entry.is_indexed() {
|
||||
|
||||
@@ -2,7 +2,7 @@ use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::Postings;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::phrase_query::{intersection_exists, PhraseScorer};
|
||||
use crate::query::phrase_query::{intersection_count, PhraseScorer};
|
||||
use crate::query::Scorer;
|
||||
use crate::{DocId, Score};
|
||||
|
||||
@@ -100,6 +100,7 @@ pub struct PhrasePrefixScorer<TPostings: Postings> {
|
||||
phrase_scorer: PhraseKind<TPostings>,
|
||||
suffixes: Vec<TPostings>,
|
||||
suffix_offset: u32,
|
||||
phrase_count: u32,
|
||||
suffix_position_buffer: Vec<u32>,
|
||||
}
|
||||
|
||||
@@ -143,6 +144,7 @@ impl<TPostings: Postings> PhrasePrefixScorer<TPostings> {
|
||||
phrase_scorer,
|
||||
suffixes,
|
||||
suffix_offset: (max_offset - suffix_pos) as u32,
|
||||
phrase_count: 0,
|
||||
suffix_position_buffer: Vec::with_capacity(100),
|
||||
};
|
||||
if phrase_prefix_scorer.doc() != TERMINATED && !phrase_prefix_scorer.matches_prefix() {
|
||||
@@ -151,7 +153,12 @@ impl<TPostings: Postings> PhrasePrefixScorer<TPostings> {
|
||||
phrase_prefix_scorer
|
||||
}
|
||||
|
||||
pub fn phrase_count(&self) -> u32 {
|
||||
self.phrase_count
|
||||
}
|
||||
|
||||
fn matches_prefix(&mut self) -> bool {
|
||||
let mut count = 0;
|
||||
let current_doc = self.doc();
|
||||
let pos_matching = self.phrase_scorer.get_intersection();
|
||||
for suffix in &mut self.suffixes {
|
||||
@@ -161,12 +168,11 @@ impl<TPostings: Postings> PhrasePrefixScorer<TPostings> {
|
||||
let doc = suffix.seek(current_doc);
|
||||
if doc == current_doc {
|
||||
suffix.positions_with_offset(self.suffix_offset, &mut self.suffix_position_buffer);
|
||||
if intersection_exists(pos_matching, &self.suffix_position_buffer) {
|
||||
return true;
|
||||
}
|
||||
count += intersection_count(pos_matching, &self.suffix_position_buffer);
|
||||
}
|
||||
}
|
||||
false
|
||||
self.phrase_count = count as u32;
|
||||
count != 0
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1,11 +1,12 @@
|
||||
use super::{prefix_end, PhrasePrefixScorer};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::index::SegmentReader;
|
||||
use crate::postings::Postings;
|
||||
use crate::postings::SegmentPostings;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::{box_scorer, EmptyScorer, Scorer, Weight};
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::{EmptyScorer, Explanation, Scorer, Weight};
|
||||
use crate::schema::{IndexRecordOption, Term};
|
||||
use crate::Score;
|
||||
use crate::{DocId, DocSet, Score};
|
||||
|
||||
pub struct PhrasePrefixWeight {
|
||||
phrase_terms: Vec<(usize, Term)>,
|
||||
@@ -31,10 +32,10 @@ impl PhrasePrefixWeight {
|
||||
}
|
||||
}
|
||||
|
||||
fn fieldnorm_reader(&self, reader: &dyn SegmentReader) -> crate::Result<FieldNormReader> {
|
||||
fn fieldnorm_reader(&self, reader: &SegmentReader) -> crate::Result<FieldNormReader> {
|
||||
let field = self.phrase_terms[0].1.field();
|
||||
if self.similarity_weight_opt.is_some() {
|
||||
if let Ok(fieldnorm_reader) = reader.get_fieldnorms_reader(field) {
|
||||
if let Some(fieldnorm_reader) = reader.fieldnorms_readers().get_field(field)? {
|
||||
return Ok(fieldnorm_reader);
|
||||
}
|
||||
}
|
||||
@@ -43,15 +44,15 @@ impl PhrasePrefixWeight {
|
||||
|
||||
pub(crate) fn phrase_scorer(
|
||||
&self,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
boost: Score,
|
||||
) -> crate::Result<Option<Box<dyn Scorer>>> {
|
||||
) -> crate::Result<Option<PhrasePrefixScorer<SegmentPostings>>> {
|
||||
let similarity_weight_opt = self
|
||||
.similarity_weight_opt
|
||||
.as_ref()
|
||||
.map(|similarity_weight| similarity_weight.boost_by(boost));
|
||||
let fieldnorm_reader = self.fieldnorm_reader(reader)?;
|
||||
let mut term_postings_list: Vec<(usize, Box<dyn Postings>)> = Vec::new();
|
||||
let mut term_postings_list = Vec::new();
|
||||
for &(offset, ref term) in &self.phrase_terms {
|
||||
if let Some(postings) = reader
|
||||
.inverted_index(term.field())?
|
||||
@@ -102,32 +103,49 @@ impl PhrasePrefixWeight {
|
||||
}
|
||||
}
|
||||
|
||||
Ok(Some(box_scorer(PhrasePrefixScorer::new(
|
||||
Ok(Some(PhrasePrefixScorer::new(
|
||||
term_postings_list,
|
||||
similarity_weight_opt,
|
||||
fieldnorm_reader,
|
||||
suffixes,
|
||||
self.prefix.0,
|
||||
))))
|
||||
)))
|
||||
}
|
||||
}
|
||||
|
||||
impl Weight for PhrasePrefixWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
if let Some(scorer) = self.phrase_scorer(reader, boost)? {
|
||||
Ok(scorer)
|
||||
Ok(Box::new(scorer))
|
||||
} else {
|
||||
Ok(box_scorer(EmptyScorer))
|
||||
Ok(Box::new(EmptyScorer))
|
||||
}
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let scorer_opt = self.phrase_scorer(reader, 1.0)?;
|
||||
if scorer_opt.is_none() {
|
||||
return Err(does_not_match(doc));
|
||||
}
|
||||
let mut scorer = scorer_opt.unwrap();
|
||||
if scorer.seek(doc) != doc {
|
||||
return Err(does_not_match(doc));
|
||||
}
|
||||
let fieldnorm_reader = self.fieldnorm_reader(reader)?;
|
||||
let fieldnorm_id = fieldnorm_reader.fieldnorm_id(doc);
|
||||
let phrase_count = scorer.phrase_count();
|
||||
let mut explanation = Explanation::new("Phrase Prefix Scorer", scorer.score());
|
||||
if let Some(similarity_weight) = self.similarity_weight_opt.as_ref() {
|
||||
explanation.add_detail(similarity_weight.explain(fieldnorm_id, phrase_count));
|
||||
}
|
||||
Ok(explanation)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::docset::TERMINATED;
|
||||
use crate::index::Index;
|
||||
use crate::postings::Postings;
|
||||
use crate::query::phrase_prefix_query::PhrasePrefixScorer;
|
||||
use crate::query::{EnableScoring, PhrasePrefixQuery, Query};
|
||||
use crate::schema::{Schema, TEXT};
|
||||
use crate::{DocSet, IndexWriter, Term};
|
||||
@@ -168,14 +186,14 @@ mod tests {
|
||||
.phrase_prefix_query_weight(enable_scoring)
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
let mut phrase_scorer_boxed = phrase_weight
|
||||
let mut phrase_scorer = phrase_weight
|
||||
.phrase_scorer(searcher.segment_reader(0u32), 1.0)?
|
||||
.unwrap();
|
||||
let phrase_scorer: &mut PhrasePrefixScorer<Box<dyn Postings>> =
|
||||
phrase_scorer_boxed.as_any_mut().downcast_mut().unwrap();
|
||||
assert_eq!(phrase_scorer.doc(), 1);
|
||||
assert_eq!(phrase_scorer.phrase_count(), 2);
|
||||
assert_eq!(phrase_scorer.advance(), 2);
|
||||
assert_eq!(phrase_scorer.doc(), 2);
|
||||
assert_eq!(phrase_scorer.phrase_count(), 1);
|
||||
assert_eq!(phrase_scorer.advance(), TERMINATED);
|
||||
Ok(())
|
||||
}
|
||||
@@ -195,15 +213,14 @@ mod tests {
|
||||
.phrase_prefix_query_weight(enable_scoring)
|
||||
.unwrap()
|
||||
.unwrap();
|
||||
let mut phrase_scorer_boxed = phrase_weight
|
||||
let mut phrase_scorer = phrase_weight
|
||||
.phrase_scorer(searcher.segment_reader(0u32), 1.0)?
|
||||
.unwrap();
|
||||
let phrase_scorer = phrase_scorer_boxed
|
||||
.downcast_mut::<PhrasePrefixScorer<Box<dyn Postings>>>()
|
||||
.unwrap();
|
||||
assert_eq!(phrase_scorer.doc(), 1);
|
||||
assert_eq!(phrase_scorer.phrase_count(), 2);
|
||||
assert_eq!(phrase_scorer.advance(), 2);
|
||||
assert_eq!(phrase_scorer.doc(), 2);
|
||||
assert_eq!(phrase_scorer.phrase_count(), 1);
|
||||
assert_eq!(phrase_scorer.advance(), TERMINATED);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -5,7 +5,7 @@ pub mod regex_phrase_query;
|
||||
mod regex_phrase_weight;
|
||||
|
||||
pub use self::phrase_query::PhraseQuery;
|
||||
pub(crate) use self::phrase_scorer::intersection_exists;
|
||||
pub(crate) use self::phrase_scorer::intersection_count;
|
||||
pub use self::phrase_scorer::PhraseScorer;
|
||||
pub use self::phrase_weight::PhraseWeight;
|
||||
|
||||
|
||||
@@ -126,7 +126,7 @@ impl PhraseQuery {
|
||||
};
|
||||
let mut weight = PhraseWeight::new(self.phrase_terms.clone(), bm25_weight_opt);
|
||||
if self.slop > 0 {
|
||||
weight.set_slop(self.slop);
|
||||
weight.slop(self.slop);
|
||||
}
|
||||
Ok(weight)
|
||||
}
|
||||
|
||||
@@ -2,9 +2,9 @@ use std::cmp::Ordering;
|
||||
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::{Postings, SegmentPostings as StandardPostings};
|
||||
use crate::postings::Postings;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::{Explanation, Intersection, Scorer};
|
||||
use crate::query::{Intersection, Scorer};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
struct PostingsWithOffset<TPostings> {
|
||||
@@ -43,14 +43,7 @@ impl<TPostings: Postings> DocSet for PostingsWithOffset<TPostings> {
|
||||
}
|
||||
}
|
||||
|
||||
/// `PhraseScorer` is a `Scorer` that matches documents that match a phrase query, and scores them
|
||||
/// based on the number of times the phrase appears in the document and the fieldnorm of the
|
||||
/// document.
|
||||
///
|
||||
/// It is implemented as an intersection of the postings of each term in the
|
||||
/// phrase, where the intersection condition is that the positions of the terms are next to each
|
||||
/// other (or within a certain slop).
|
||||
pub struct PhraseScorer<TPostings: Postings = StandardPostings> {
|
||||
pub struct PhraseScorer<TPostings: Postings> {
|
||||
intersection_docset: Intersection<PostingsWithOffset<TPostings>, PostingsWithOffset<TPostings>>,
|
||||
num_terms: usize,
|
||||
left_positions: Vec<u32>,
|
||||
@@ -65,7 +58,7 @@ pub struct PhraseScorer<TPostings: Postings = StandardPostings> {
|
||||
}
|
||||
|
||||
/// Returns true if and only if the two sorted arrays contain a common element
|
||||
pub(crate) fn intersection_exists(left: &[u32], right: &[u32]) -> bool {
|
||||
fn intersection_exists(left: &[u32], right: &[u32]) -> bool {
|
||||
let mut left_index = 0;
|
||||
let mut right_index = 0;
|
||||
while left_index < left.len() && right_index < right.len() {
|
||||
@@ -86,7 +79,7 @@ pub(crate) fn intersection_exists(left: &[u32], right: &[u32]) -> bool {
|
||||
false
|
||||
}
|
||||
|
||||
fn intersection_count(left: &[u32], right: &[u32]) -> usize {
|
||||
pub(crate) fn intersection_count(left: &[u32], right: &[u32]) -> usize {
|
||||
let mut left_index = 0;
|
||||
let mut right_index = 0;
|
||||
let mut count = 0;
|
||||
@@ -353,9 +346,6 @@ fn intersection_count_with_carrying_slop(
|
||||
|
||||
impl<TPostings: Postings> PhraseScorer<TPostings> {
|
||||
// If similarity_weight is None, then scoring is disabled.
|
||||
/// Creates a phrase scorer from term postings and phrase matching options.
|
||||
///
|
||||
/// `slop` controls the maximum positional distance allowed between terms.
|
||||
pub fn new(
|
||||
term_postings: Vec<(usize, TPostings)>,
|
||||
similarity_weight_opt: Option<Bm25Weight>,
|
||||
@@ -412,7 +402,6 @@ impl<TPostings: Postings> PhraseScorer<TPostings> {
|
||||
scorer
|
||||
}
|
||||
|
||||
/// Returns the number of phrases identified in the current matching doc.
|
||||
pub fn phrase_count(&self) -> u32 {
|
||||
self.phrase_count
|
||||
}
|
||||
@@ -595,17 +584,6 @@ impl<TPostings: Postings> Scorer for PhraseScorer<TPostings> {
|
||||
1.0f32
|
||||
}
|
||||
}
|
||||
|
||||
fn explain(&mut self) -> Explanation {
|
||||
let doc = self.doc();
|
||||
let phrase_count = self.phrase_count();
|
||||
let fieldnorm_id = self.fieldnorm_reader.fieldnorm_id(doc);
|
||||
let mut explanation = Explanation::new("Phrase Scorer", self.score());
|
||||
if let Some(similarity_weight) = self.similarity_weight_opt.as_ref() {
|
||||
explanation.add_detail(similarity_weight.explain(fieldnorm_id, phrase_count));
|
||||
}
|
||||
explanation
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
|
||||
@@ -1,43 +1,13 @@
|
||||
use std::io;
|
||||
|
||||
use super::PhraseScorer;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::index::{
|
||||
try_downcast_and_call, InvertedIndexReader, SegmentReader, TypedInvertedIndexReaderCb,
|
||||
};
|
||||
use crate::postings::TermInfo;
|
||||
use crate::index::SegmentReader;
|
||||
use crate::postings::SegmentPostings;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::{box_scorer, EmptyScorer, Explanation, Scorer, Weight};
|
||||
use crate::schema::Term;
|
||||
use crate::query::{EmptyScorer, Explanation, Scorer, Weight};
|
||||
use crate::schema::{IndexRecordOption, Term};
|
||||
use crate::{DocId, DocSet, Score};
|
||||
|
||||
struct BuildPhraseScorer<'a> {
|
||||
term_infos: &'a [(usize, TermInfo)],
|
||||
similarity_weight_opt: Option<Bm25Weight>,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
slop: u32,
|
||||
}
|
||||
|
||||
impl TypedInvertedIndexReaderCb<io::Result<Box<dyn Scorer>>> for BuildPhraseScorer<'_> {
|
||||
fn call<I: InvertedIndexReader + ?Sized>(&mut self, reader: &I) -> io::Result<Box<dyn Scorer>> {
|
||||
let mut offset_and_term_postings = Vec::with_capacity(self.term_infos.len());
|
||||
for (offset, term_info) in self.term_infos {
|
||||
let postings = reader.read_postings_from_terminfo(
|
||||
term_info,
|
||||
crate::schema::IndexRecordOption::WithFreqsAndPositions,
|
||||
)?;
|
||||
offset_and_term_postings.push((*offset, postings));
|
||||
}
|
||||
let scorer = super::PhraseScorer::new(
|
||||
offset_and_term_postings,
|
||||
self.similarity_weight_opt.clone(),
|
||||
self.fieldnorm_reader.clone(),
|
||||
self.slop,
|
||||
);
|
||||
Ok(box_scorer(scorer))
|
||||
}
|
||||
}
|
||||
|
||||
pub struct PhraseWeight {
|
||||
phrase_terms: Vec<(usize, Term)>,
|
||||
similarity_weight_opt: Option<Bm25Weight>,
|
||||
@@ -51,17 +21,18 @@ impl PhraseWeight {
|
||||
phrase_terms: Vec<(usize, Term)>,
|
||||
similarity_weight_opt: Option<Bm25Weight>,
|
||||
) -> PhraseWeight {
|
||||
let slop = 0;
|
||||
PhraseWeight {
|
||||
phrase_terms,
|
||||
similarity_weight_opt,
|
||||
slop: 0,
|
||||
slop,
|
||||
}
|
||||
}
|
||||
|
||||
fn fieldnorm_reader(&self, reader: &dyn SegmentReader) -> crate::Result<FieldNormReader> {
|
||||
fn fieldnorm_reader(&self, reader: &SegmentReader) -> crate::Result<FieldNormReader> {
|
||||
let field = self.phrase_terms[0].1.field();
|
||||
if self.similarity_weight_opt.is_some() {
|
||||
if let Ok(fieldnorm_reader) = reader.get_fieldnorms_reader(field) {
|
||||
if let Some(fieldnorm_reader) = reader.fieldnorms_readers().get_field(field)? {
|
||||
return Ok(fieldnorm_reader);
|
||||
}
|
||||
}
|
||||
@@ -70,69 +41,48 @@ impl PhraseWeight {
|
||||
|
||||
pub(crate) fn phrase_scorer(
|
||||
&self,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
boost: Score,
|
||||
) -> crate::Result<Option<Box<dyn Scorer>>> {
|
||||
) -> crate::Result<Option<PhraseScorer<SegmentPostings>>> {
|
||||
let similarity_weight_opt = self
|
||||
.similarity_weight_opt
|
||||
.as_ref()
|
||||
.map(|similarity_weight| similarity_weight.boost_by(boost));
|
||||
let fieldnorm_reader = self.fieldnorm_reader(reader)?;
|
||||
|
||||
if self.phrase_terms.is_empty() {
|
||||
return Ok(None);
|
||||
}
|
||||
let field = self.phrase_terms[0].1.field();
|
||||
|
||||
if !self
|
||||
.phrase_terms
|
||||
.iter()
|
||||
.all(|(_offset, term)| term.field() == field)
|
||||
{
|
||||
return Err(crate::TantivyError::InvalidArgument(
|
||||
"All terms in a phrase query must belong to the same field".to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
let inverted_index_reader = reader.inverted_index(field)?;
|
||||
|
||||
let mut term_infos: Vec<(usize, TermInfo)> = Vec::with_capacity(self.phrase_terms.len());
|
||||
|
||||
let mut term_postings_list = Vec::new();
|
||||
for &(offset, ref term) in &self.phrase_terms {
|
||||
let Some(term_info) = inverted_index_reader.get_term_info(term)? else {
|
||||
if let Some(postings) = reader
|
||||
.inverted_index(term.field())?
|
||||
.read_postings(term, IndexRecordOption::WithFreqsAndPositions)?
|
||||
{
|
||||
term_postings_list.push((offset, postings));
|
||||
} else {
|
||||
return Ok(None);
|
||||
};
|
||||
term_infos.push((offset, term_info));
|
||||
}
|
||||
}
|
||||
|
||||
let mut phrase_scorer_builder = BuildPhraseScorer {
|
||||
term_infos: &term_infos,
|
||||
Ok(Some(PhraseScorer::new(
|
||||
term_postings_list,
|
||||
similarity_weight_opt,
|
||||
fieldnorm_reader,
|
||||
slop: self.slop,
|
||||
};
|
||||
let scorer =
|
||||
try_downcast_and_call(inverted_index_reader.as_ref(), &mut phrase_scorer_builder)?;
|
||||
|
||||
Ok(Some(scorer))
|
||||
self.slop,
|
||||
)))
|
||||
}
|
||||
|
||||
/// Sets the slop for the given PhraseWeight.
|
||||
pub fn set_slop(&mut self, slop: u32) {
|
||||
pub fn slop(&mut self, slop: u32) {
|
||||
self.slop = slop;
|
||||
}
|
||||
}
|
||||
|
||||
impl Weight for PhraseWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
if let Some(scorer) = self.phrase_scorer(reader, boost)? {
|
||||
Ok(scorer)
|
||||
Ok(Box::new(scorer))
|
||||
} else {
|
||||
Ok(box_scorer(EmptyScorer))
|
||||
Ok(Box::new(EmptyScorer))
|
||||
}
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let scorer_opt = self.phrase_scorer(reader, 1.0)?;
|
||||
if scorer_opt.is_none() {
|
||||
return Err(does_not_match(doc));
|
||||
@@ -141,7 +91,14 @@ impl Weight for PhraseWeight {
|
||||
if scorer.seek(doc) != doc {
|
||||
return Err(does_not_match(doc));
|
||||
}
|
||||
Ok(scorer.explain())
|
||||
let fieldnorm_reader = self.fieldnorm_reader(reader)?;
|
||||
let fieldnorm_id = fieldnorm_reader.fieldnorm_id(doc);
|
||||
let phrase_count = scorer.phrase_count();
|
||||
let mut explanation = Explanation::new("Phrase Scorer", scorer.score());
|
||||
if let Some(similarity_weight) = self.similarity_weight_opt.as_ref() {
|
||||
explanation.add_detail(similarity_weight.explain(fieldnorm_id, phrase_count));
|
||||
}
|
||||
Ok(explanation)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -149,8 +106,7 @@ impl Weight for PhraseWeight {
|
||||
mod tests {
|
||||
use super::super::tests::create_index;
|
||||
use crate::docset::TERMINATED;
|
||||
use crate::query::phrase_query::PhraseScorer;
|
||||
use crate::query::{EnableScoring, PhraseQuery, Scorer};
|
||||
use crate::query::{EnableScoring, PhraseQuery};
|
||||
use crate::{DocSet, Term};
|
||||
|
||||
#[test]
|
||||
@@ -165,11 +121,9 @@ mod tests {
|
||||
]);
|
||||
let enable_scoring = EnableScoring::enabled_from_searcher(&searcher);
|
||||
let phrase_weight = phrase_query.phrase_weight(enable_scoring).unwrap();
|
||||
let phrase_scorer_boxed: Box<dyn Scorer> = phrase_weight
|
||||
let mut phrase_scorer = phrase_weight
|
||||
.phrase_scorer(searcher.segment_reader(0u32), 1.0)?
|
||||
.unwrap();
|
||||
let mut phrase_scorer: Box<PhraseScorer> =
|
||||
phrase_scorer_boxed.downcast::<PhraseScorer>().ok().unwrap();
|
||||
assert_eq!(phrase_scorer.doc(), 1);
|
||||
assert_eq!(phrase_scorer.phrase_count(), 2);
|
||||
assert_eq!(phrase_scorer.advance(), 2);
|
||||
|
||||
@@ -5,16 +5,14 @@ use tantivy_fst::Regex;
|
||||
|
||||
use super::PhraseScorer;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::index::{InvertedIndexReader, SegmentReader};
|
||||
use crate::postings::{LoadedPostings, Postings, TermInfo};
|
||||
use crate::index::SegmentReader;
|
||||
use crate::postings::{LoadedPostings, Postings, SegmentPostings, TermInfo};
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::union::{BitSetPostingUnion, SimpleUnion};
|
||||
use crate::query::{
|
||||
box_scorer, AutomatonWeight, BitSetDocSet, EmptyScorer, Explanation, Scorer, Weight,
|
||||
};
|
||||
use crate::query::{AutomatonWeight, BitSetDocSet, EmptyScorer, Explanation, Scorer, Weight};
|
||||
use crate::schema::{Field, IndexRecordOption};
|
||||
use crate::{DocId, DocSet, DynInvertedIndexReader, Score};
|
||||
use crate::{DocId, DocSet, InvertedIndexReader, Score};
|
||||
|
||||
type UnionType = SimpleUnion<Box<dyn Postings + 'static>>;
|
||||
|
||||
@@ -47,9 +45,9 @@ impl RegexPhraseWeight {
|
||||
}
|
||||
}
|
||||
|
||||
fn fieldnorm_reader(&self, reader: &dyn SegmentReader) -> crate::Result<FieldNormReader> {
|
||||
fn fieldnorm_reader(&self, reader: &SegmentReader) -> crate::Result<FieldNormReader> {
|
||||
if self.similarity_weight_opt.is_some() {
|
||||
if let Ok(fieldnorm_reader) = reader.get_fieldnorms_reader(self.field) {
|
||||
if let Some(fieldnorm_reader) = reader.fieldnorms_readers().get_field(self.field)? {
|
||||
return Ok(fieldnorm_reader);
|
||||
}
|
||||
}
|
||||
@@ -58,7 +56,7 @@ impl RegexPhraseWeight {
|
||||
|
||||
pub(crate) fn phrase_scorer(
|
||||
&self,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
boost: Score,
|
||||
) -> crate::Result<Option<PhraseScorer<UnionType>>> {
|
||||
let similarity_weight_opt = self
|
||||
@@ -86,8 +84,7 @@ impl RegexPhraseWeight {
|
||||
"Phrase query exceeded max expansions {num_terms}"
|
||||
)));
|
||||
}
|
||||
let union =
|
||||
Self::get_union_from_term_infos(&term_infos, reader, inverted_index.as_ref())?;
|
||||
let union = Self::get_union_from_term_infos(&term_infos, reader, &inverted_index)?;
|
||||
|
||||
posting_lists.push((offset, union));
|
||||
}
|
||||
@@ -102,11 +99,22 @@ impl RegexPhraseWeight {
|
||||
|
||||
/// Add all docs of the term to the docset
|
||||
fn add_to_bitset(
|
||||
inverted_index: &(impl InvertedIndexReader + ?Sized),
|
||||
inverted_index: &InvertedIndexReader,
|
||||
term_info: &TermInfo,
|
||||
doc_bitset: &mut BitSet,
|
||||
) -> crate::Result<()> {
|
||||
inverted_index.fill_bitset_from_terminfo(term_info, doc_bitset)?;
|
||||
let mut block_segment_postings = inverted_index
|
||||
.read_block_postings_from_terminfo(term_info, IndexRecordOption::Basic)?;
|
||||
loop {
|
||||
let docs = block_segment_postings.docs();
|
||||
if docs.is_empty() {
|
||||
break;
|
||||
}
|
||||
for &doc in docs {
|
||||
doc_bitset.insert(doc);
|
||||
}
|
||||
block_segment_postings.advance();
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -166,8 +174,8 @@ impl RegexPhraseWeight {
|
||||
/// Use Roaring Bitmaps for sparse terms. The full bitvec is main memory consumer currently.
|
||||
pub(crate) fn get_union_from_term_infos(
|
||||
term_infos: &[TermInfo],
|
||||
reader: &dyn SegmentReader,
|
||||
inverted_index: &dyn DynInvertedIndexReader,
|
||||
reader: &SegmentReader,
|
||||
inverted_index: &InvertedIndexReader,
|
||||
) -> crate::Result<UnionType> {
|
||||
let max_doc = reader.max_doc();
|
||||
|
||||
@@ -180,19 +188,16 @@ impl RegexPhraseWeight {
|
||||
// - Bucket 1: Terms appearing in 0.1% to 1% of documents
|
||||
// - Bucket 2: Terms appearing in 1% to 10% of documents
|
||||
// - Bucket 3: Terms appearing in more than 10% of documents
|
||||
let mut buckets: Vec<(BitSet, Vec<Box<dyn Postings>>)> = (0..4)
|
||||
let mut buckets: Vec<(BitSet, Vec<SegmentPostings>)> = (0..4)
|
||||
.map(|_| (BitSet::with_max_value(max_doc), Vec::new()))
|
||||
.collect();
|
||||
|
||||
const SPARSE_TERM_DOC_THRESHOLD: u32 = 100;
|
||||
|
||||
for term_info in term_infos {
|
||||
let mut term_posting = crate::index::load_postings_from_terminfo(
|
||||
inverted_index,
|
||||
term_info,
|
||||
IndexRecordOption::WithFreqsAndPositions,
|
||||
)?;
|
||||
let num_docs = u32::from(term_posting.doc_freq());
|
||||
let mut term_posting = inverted_index
|
||||
.read_postings_from_terminfo(term_info, IndexRecordOption::WithFreqsAndPositions)?;
|
||||
let num_docs = term_posting.doc_freq();
|
||||
|
||||
if num_docs < SPARSE_TERM_DOC_THRESHOLD {
|
||||
let current_bucket = &mut sparse_buckets[0];
|
||||
@@ -264,15 +269,15 @@ impl RegexPhraseWeight {
|
||||
}
|
||||
|
||||
impl Weight for RegexPhraseWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
if let Some(scorer) = self.phrase_scorer(reader, boost)? {
|
||||
Ok(box_scorer(scorer))
|
||||
Ok(Box::new(scorer))
|
||||
} else {
|
||||
Ok(box_scorer(EmptyScorer))
|
||||
Ok(Box::new(EmptyScorer))
|
||||
}
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let scorer_opt = self.phrase_scorer(reader, 1.0)?;
|
||||
if scorer_opt.is_none() {
|
||||
return Err(does_not_match(doc));
|
||||
|
||||
@@ -146,7 +146,7 @@ pub trait Query: QueryClone + Send + Sync + downcast_rs::Downcast + fmt::Debug {
|
||||
let weight = self.weight(EnableScoring::disabled_from_searcher(searcher))?;
|
||||
let mut result = 0;
|
||||
for reader in searcher.segment_readers() {
|
||||
result += weight.count(reader.as_ref())? as usize;
|
||||
result += weight.count(reader)? as usize;
|
||||
}
|
||||
Ok(result)
|
||||
}
|
||||
|
||||
@@ -5,15 +5,13 @@ use common::bounds::{map_bound, BoundsRange};
|
||||
use common::BitSet;
|
||||
|
||||
use super::range_query_fastfield::FastFieldRangeWeight;
|
||||
use crate::index::{InvertedIndexReader as _, SegmentReader};
|
||||
use crate::index::SegmentReader;
|
||||
use crate::query::explanation::does_not_match;
|
||||
use crate::query::range_query::is_type_valid_for_fastfield_range_query;
|
||||
use crate::query::{
|
||||
box_scorer, BitSetDocSet, ConstScorer, EnableScoring, Explanation, Query, Scorer, Weight,
|
||||
};
|
||||
use crate::schema::{Field, Term, Type};
|
||||
use crate::query::{BitSetDocSet, ConstScorer, EnableScoring, Explanation, Query, Scorer, Weight};
|
||||
use crate::schema::{Field, IndexRecordOption, Term, Type};
|
||||
use crate::termdict::{TermDictionary, TermStreamer};
|
||||
use crate::{DocId, DocSet, Score};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// `RangeQuery` matches all documents that have at least one term within a defined range.
|
||||
///
|
||||
@@ -214,7 +212,7 @@ impl InvertedIndexRangeWeight {
|
||||
}
|
||||
|
||||
impl Weight for InvertedIndexRangeWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
let max_doc = reader.max_doc();
|
||||
let mut doc_bitset = BitSet::with_max_value(max_doc);
|
||||
|
||||
@@ -230,13 +228,24 @@ impl Weight for InvertedIndexRangeWeight {
|
||||
}
|
||||
processed_count += 1;
|
||||
let term_info = term_range.value();
|
||||
inverted_index.fill_bitset_from_terminfo(term_info, &mut doc_bitset)?;
|
||||
let mut block_segment_postings = inverted_index
|
||||
.read_block_postings_from_terminfo(term_info, IndexRecordOption::Basic)?;
|
||||
loop {
|
||||
let docs = block_segment_postings.docs();
|
||||
if docs.is_empty() {
|
||||
break;
|
||||
}
|
||||
for &doc in block_segment_postings.docs() {
|
||||
doc_bitset.insert(doc);
|
||||
}
|
||||
block_segment_postings.advance();
|
||||
}
|
||||
}
|
||||
let doc_bitset = BitSetDocSet::from(doc_bitset);
|
||||
Ok(box_scorer(ConstScorer::new(doc_bitset, boost)))
|
||||
Ok(Box::new(ConstScorer::new(doc_bitset, boost)))
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let mut scorer = self.scorer(reader, 1.0)?;
|
||||
if scorer.seek(doc) != doc {
|
||||
return Err(does_not_match(doc));
|
||||
@@ -677,7 +686,7 @@ mod tests {
|
||||
.weight(EnableScoring::disabled_from_schema(&schema))
|
||||
.unwrap();
|
||||
let range_scorer = range_weight
|
||||
.scorer(searcher.segment_readers()[0].as_ref(), 1.0f32)
|
||||
.scorer(&searcher.segment_readers()[0], 1.0f32)
|
||||
.unwrap();
|
||||
range_scorer
|
||||
};
|
||||
|
||||
@@ -13,8 +13,7 @@ use common::bounds::{BoundsRange, TransformBound};
|
||||
|
||||
use super::fast_field_range_doc_set::RangeDocSet;
|
||||
use crate::query::{
|
||||
box_scorer, AllScorer, ConstScorer, EmptyScorer, EnableScoring, Explanation, Query, Scorer,
|
||||
Weight,
|
||||
AllScorer, ConstScorer, EmptyScorer, EnableScoring, Explanation, Query, Scorer, Weight,
|
||||
};
|
||||
use crate::schema::{Type, ValueBytes};
|
||||
use crate::{DocId, DocSet, Score, SegmentReader, TantivyError, Term};
|
||||
@@ -53,10 +52,10 @@ impl FastFieldRangeWeight {
|
||||
}
|
||||
|
||||
impl Weight for FastFieldRangeWeight {
|
||||
fn scorer(&self, reader: &dyn SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
// Check if both bounds are Bound::Unbounded
|
||||
if self.bounds.is_unbounded() {
|
||||
return Ok(box_scorer(AllScorer::new(reader.max_doc())));
|
||||
return Ok(Box::new(AllScorer::new(reader.max_doc())));
|
||||
}
|
||||
|
||||
let term = self
|
||||
@@ -96,7 +95,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
let Some(str_dict_column): Option<StrColumn> =
|
||||
reader.fast_fields().str(&field_name)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
let dict = str_dict_column.dictionary();
|
||||
|
||||
@@ -108,7 +107,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
let Some((column, _col_type)) = fast_field_reader
|
||||
.u64_lenient_for_type(Some(&[ColumnType::Str]), &field_name)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
search_on_u64_ff(column, boost, BoundsRange::new(lower_bound, upper_bound))
|
||||
}
|
||||
@@ -120,7 +119,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
let Some((column, _col_type)) = fast_field_reader
|
||||
.u64_lenient_for_type(Some(&[ColumnType::DateTime]), &field_name)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
let bounds = bounds.map_bound(|term| term.as_date().unwrap().to_u64());
|
||||
search_on_u64_ff(
|
||||
@@ -147,7 +146,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
let Some(ip_addr_column): Option<Column<Ipv6Addr>> =
|
||||
reader.fast_fields().column_opt(&field_name)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
let value_range = bound_range_inclusive_ip(
|
||||
&bounds.lower_bound,
|
||||
@@ -156,11 +155,11 @@ impl Weight for FastFieldRangeWeight {
|
||||
ip_addr_column.max_value(),
|
||||
);
|
||||
let docset = RangeDocSet::new(value_range, ip_addr_column);
|
||||
Ok(box_scorer(ConstScorer::new(docset, boost)))
|
||||
Ok(Box::new(ConstScorer::new(docset, boost)))
|
||||
} else if field_type.is_str() {
|
||||
let Some(str_dict_column): Option<StrColumn> = reader.fast_fields().str(&field_name)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
let dict = str_dict_column.dictionary();
|
||||
|
||||
@@ -172,7 +171,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
let Some((column, _col_type)) =
|
||||
fast_field_reader.u64_lenient_for_type(None, &field_name)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
search_on_u64_ff(column, boost, BoundsRange::new(lower_bound, upper_bound))
|
||||
} else if field_type.is_bytes() {
|
||||
@@ -229,7 +228,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
&field_name,
|
||||
)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
search_on_u64_ff(
|
||||
column,
|
||||
@@ -239,7 +238,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
}
|
||||
}
|
||||
|
||||
fn explain(&self, reader: &dyn SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, reader: &SegmentReader, doc: DocId) -> crate::Result<Explanation> {
|
||||
let mut scorer = self.scorer(reader, 1.0)?;
|
||||
if scorer.seek(doc) != doc {
|
||||
return Err(TantivyError::InvalidArgument(format!(
|
||||
@@ -256,7 +255,7 @@ impl Weight for FastFieldRangeWeight {
|
||||
///
|
||||
/// Convert into fast field value space and search.
|
||||
fn search_on_json_numerical_field(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
typ: Type,
|
||||
bounds: BoundsRange<ValueBytes<Vec<u8>>>,
|
||||
@@ -270,7 +269,7 @@ fn search_on_json_numerical_field(
|
||||
let Some((column, col_type)) =
|
||||
fast_field_reader.u64_lenient_for_type(allowed_column_types, field_name)?
|
||||
else {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
};
|
||||
let actual_column_type: NumericalType = col_type
|
||||
.numerical_type()
|
||||
@@ -428,18 +427,18 @@ fn search_on_u64_ff(
|
||||
)
|
||||
.unwrap_or(1..=0); // empty range
|
||||
if value_range.is_empty() {
|
||||
return Ok(box_scorer(EmptyScorer));
|
||||
return Ok(Box::new(EmptyScorer));
|
||||
}
|
||||
if col_min_value >= *value_range.start() && col_max_value <= *value_range.end() {
|
||||
// all values in the column are within the range.
|
||||
if column.index.get_cardinality() == Cardinality::Full {
|
||||
if boost != 1.0f32 {
|
||||
return Ok(box_scorer(ConstScorer::new(
|
||||
return Ok(Box::new(ConstScorer::new(
|
||||
AllScorer::new(column.num_docs()),
|
||||
boost,
|
||||
)));
|
||||
} else {
|
||||
return Ok(box_scorer(AllScorer::new(column.num_docs())));
|
||||
return Ok(Box::new(AllScorer::new(column.num_docs())));
|
||||
}
|
||||
} else {
|
||||
// TODO Make it a field presence request for that specific column
|
||||
@@ -447,7 +446,7 @@ fn search_on_u64_ff(
|
||||
}
|
||||
|
||||
let docset = RangeDocSet::new(value_range, column);
|
||||
Ok(box_scorer(ConstScorer::new(docset, boost)))
|
||||
Ok(Box::new(ConstScorer::new(docset, boost)))
|
||||
}
|
||||
|
||||
/// Returns true if the type maps to a u64 fast field
|
||||
|
||||
@@ -1,11 +1,9 @@
|
||||
use std::mem::{transmute_copy, ManuallyDrop};
|
||||
use std::ops::DerefMut;
|
||||
|
||||
use downcast_rs::impl_downcast;
|
||||
|
||||
use crate::docset::DocSet;
|
||||
use crate::query::Explanation;
|
||||
use crate::{DocId, Score, TERMINATED};
|
||||
use crate::Score;
|
||||
|
||||
/// Scored set of documents matching a query within a specific segment.
|
||||
///
|
||||
@@ -15,53 +13,6 @@ pub trait Scorer: downcast_rs::Downcast + DocSet + 'static {
|
||||
///
|
||||
/// This method will perform a bit of computation and is not cached.
|
||||
fn score(&mut self) -> Score;
|
||||
|
||||
/// Calls `callback` with all of the `(doc, score)` for which score
|
||||
/// is exceeding a given threshold.
|
||||
///
|
||||
/// This method is useful for the TopDocs collector.
|
||||
/// For all docsets, the blanket implementation has the benefit
|
||||
/// of prefiltering (doc, score) pairs, avoiding the
|
||||
/// virtual dispatch cost.
|
||||
///
|
||||
/// More importantly, it makes it possible for scorers to implement
|
||||
/// important optimization (e.g. BlockWAND for union).
|
||||
fn for_each_pruning(
|
||||
&mut self,
|
||||
threshold: Score,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
) {
|
||||
for_each_pruning_scorer_default_impl(self, threshold, callback);
|
||||
}
|
||||
|
||||
/// Calls `callback` with all of the `(doc, score)` in the scorer.
|
||||
fn for_each(&mut self, callback: &mut dyn FnMut(DocId, Score)) {
|
||||
let mut doc = self.doc();
|
||||
while doc != TERMINATED {
|
||||
callback(doc, self.score());
|
||||
doc = self.advance();
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns an explanation for the score of the current document.
|
||||
fn explain(&mut self) -> Explanation {
|
||||
let score = self.score();
|
||||
let name = std::any::type_name_of_val(self);
|
||||
Explanation::new(name, score)
|
||||
}
|
||||
}
|
||||
|
||||
/// Boxes a scorer. Prefer this to Box::new as it avoids double boxing
|
||||
/// when TScorer is already a Box<dyn Scorer>.
|
||||
pub fn box_scorer<TScorer: Scorer>(scorer: TScorer) -> Box<dyn Scorer> {
|
||||
if std::any::TypeId::of::<TScorer>() == std::any::TypeId::of::<Box<dyn Scorer>>() {
|
||||
unsafe {
|
||||
let forget_me = ManuallyDrop::new(scorer);
|
||||
transmute_copy::<TScorer, Box<dyn Scorer>>(&forget_me)
|
||||
}
|
||||
} else {
|
||||
Box::new(scorer)
|
||||
}
|
||||
}
|
||||
|
||||
impl_downcast!(Scorer);
|
||||
@@ -71,41 +22,4 @@ impl Scorer for Box<dyn Scorer> {
|
||||
fn score(&mut self) -> Score {
|
||||
self.deref_mut().score()
|
||||
}
|
||||
|
||||
fn for_each_pruning(
|
||||
&mut self,
|
||||
threshold: Score,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
) {
|
||||
self.deref_mut().for_each_pruning(threshold, callback);
|
||||
}
|
||||
|
||||
fn for_each(&mut self, callback: &mut dyn FnMut(DocId, Score)) {
|
||||
self.deref_mut().for_each(callback);
|
||||
}
|
||||
}
|
||||
|
||||
/// Calls `callback` with all of the `(doc, score)` for which score
|
||||
/// is exceeding a given threshold.
|
||||
///
|
||||
/// This method is useful for the [`TopDocs`](crate::collector::TopDocs) collector.
|
||||
/// For all docsets, the blanket implementation has the benefit
|
||||
/// of prefiltering (doc, score) pairs, avoiding the
|
||||
/// virtual dispatch cost.
|
||||
///
|
||||
/// More importantly, it makes it possible for scorers to implement
|
||||
/// important optimization (e.g. BlockWAND for union).
|
||||
pub(crate) fn for_each_pruning_scorer_default_impl<TScorer: Scorer + ?Sized>(
|
||||
scorer: &mut TScorer,
|
||||
mut threshold: Score,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
) {
|
||||
let mut doc = scorer.doc();
|
||||
while doc != TERMINATED {
|
||||
let score = scorer.score();
|
||||
if score > threshold {
|
||||
threshold = callback(doc, score);
|
||||
}
|
||||
doc = scorer.advance();
|
||||
}
|
||||
}
|
||||
|
||||
@@ -3,10 +3,10 @@ mod term_scorer;
|
||||
mod term_weight;
|
||||
|
||||
pub use self::term_query::TermQuery;
|
||||
pub use self::term_scorer::{BoxedTermScorer, TermScorer};
|
||||
|
||||
pub use self::term_scorer::TermScorer;
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
use crate::collector::TopDocs;
|
||||
use crate::docset::DocSet;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
|
||||
@@ -1,44 +1,23 @@
|
||||
use crate::docset::DocSet;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::{Postings, PostingsWithBlockMax, SegmentPostings};
|
||||
use crate::postings::{FreqReadingOption, Postings, SegmentPostings};
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
use crate::query::{box_scorer, Explanation, Scorer};
|
||||
use crate::query::{Explanation, Scorer};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Type-erased term scorer guaranteed to wrap a Tantivy [`TermScorer`].
|
||||
pub struct BoxedTermScorer(Box<dyn Scorer>);
|
||||
|
||||
impl BoxedTermScorer {
|
||||
/// Creates a boxed term scorer from a concrete Tantivy [`TermScorer`].
|
||||
pub fn new<TPostings: Postings>(term_scorer: TermScorer<TPostings>) -> BoxedTermScorer {
|
||||
BoxedTermScorer(box_scorer(term_scorer))
|
||||
}
|
||||
|
||||
/// Converts this boxed term scorer into a generic boxed scorer.
|
||||
pub fn into_boxed_scorer(self) -> Box<dyn Scorer> {
|
||||
self.0
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
/// Scorer for a single term over a postings list.
|
||||
///
|
||||
/// `TermScorer` combines postings data, fieldnorms, and BM25 term weight to
|
||||
/// produce per-document scores.
|
||||
pub struct TermScorer<TPostings: Postings = SegmentPostings> {
|
||||
postings: TPostings,
|
||||
pub struct TermScorer {
|
||||
postings: SegmentPostings,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: Bm25Weight,
|
||||
}
|
||||
|
||||
impl<TPostings: Postings> TermScorer<TPostings> {
|
||||
/// Creates a new term scorer from postings, fieldnorm reader, and BM25
|
||||
/// term weight.
|
||||
impl TermScorer {
|
||||
pub fn new(
|
||||
postings: TPostings,
|
||||
postings: SegmentPostings,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> TermScorer<TPostings> {
|
||||
) -> TermScorer {
|
||||
TermScorer {
|
||||
postings,
|
||||
fieldnorm_reader,
|
||||
@@ -46,38 +25,10 @@ impl<TPostings: Postings> TermScorer<TPostings> {
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the term frequency for the current document.
|
||||
pub fn term_freq(&self) -> u32 {
|
||||
self.postings.term_freq()
|
||||
pub(crate) fn seek_block(&mut self, target_doc: DocId) {
|
||||
self.postings.block_cursor.seek_block(target_doc);
|
||||
}
|
||||
|
||||
/// Returns the fieldnorm id for the current document.
|
||||
pub fn fieldnorm_id(&self) -> u8 {
|
||||
self.fieldnorm_reader.fieldnorm_id(self.doc())
|
||||
}
|
||||
|
||||
/// Returns the maximum score upper bound for this scorer.
|
||||
pub fn max_score(&self) -> Score {
|
||||
self.similarity_weight.max_score()
|
||||
}
|
||||
}
|
||||
|
||||
impl<TPostingsWithBlockMax: PostingsWithBlockMax> TermScorer<TPostingsWithBlockMax> {
|
||||
pub(crate) fn last_doc_in_block(&self) -> DocId {
|
||||
self.postings.last_doc_in_block()
|
||||
}
|
||||
|
||||
/// Advances the term scorer to the block containing target_doc and returns
|
||||
/// an upperbound for the score all of the documents in the block.
|
||||
/// (BlockMax). This score is not guaranteed to be the
|
||||
/// effective maximum score of the block.
|
||||
pub(crate) fn seek_block_max(&mut self, target_doc: DocId) -> Score {
|
||||
self.postings
|
||||
.seek_block_max(target_doc, &self.fieldnorm_reader, &self.similarity_weight)
|
||||
}
|
||||
}
|
||||
|
||||
impl TermScorer {
|
||||
#[cfg(test)]
|
||||
pub fn create_for_test(
|
||||
doc_and_tfs: &[(DocId, u32)],
|
||||
@@ -98,9 +49,55 @@ impl TermScorer {
|
||||
let fieldnorm_reader = FieldNormReader::for_test(fieldnorms);
|
||||
TermScorer::new(segment_postings, fieldnorm_reader, similarity_weight)
|
||||
}
|
||||
|
||||
/// See `FreqReadingOption`.
|
||||
pub(crate) fn freq_reading_option(&self) -> FreqReadingOption {
|
||||
self.postings.block_cursor.freq_reading_option()
|
||||
}
|
||||
|
||||
/// Returns the maximum score for the current block.
|
||||
///
|
||||
/// In some rare case, the result may not be exact. In this case a lower value is returned,
|
||||
/// (and may lead us to return a lesser document).
|
||||
///
|
||||
/// At index time, we store the (fieldnorm_id, term frequency) pair that maximizes the
|
||||
/// score assuming the average fieldnorm computed on this segment.
|
||||
///
|
||||
/// Though extremely rare, it is theoretically possible that the actual average fieldnorm
|
||||
/// is different enough from the current segment average fieldnorm that the maximum over a
|
||||
/// specific is achieved on a different document.
|
||||
///
|
||||
/// (The result is on the other hand guaranteed to be correct if there is only one segment).
|
||||
pub fn block_max_score(&mut self) -> Score {
|
||||
self.postings
|
||||
.block_cursor
|
||||
.block_max_score(&self.fieldnorm_reader, &self.similarity_weight)
|
||||
}
|
||||
|
||||
pub fn term_freq(&self) -> u32 {
|
||||
self.postings.term_freq()
|
||||
}
|
||||
|
||||
pub fn fieldnorm_id(&self) -> u8 {
|
||||
self.fieldnorm_reader.fieldnorm_id(self.doc())
|
||||
}
|
||||
|
||||
pub fn explain(&self) -> Explanation {
|
||||
let fieldnorm_id = self.fieldnorm_id();
|
||||
let term_freq = self.term_freq();
|
||||
self.similarity_weight.explain(fieldnorm_id, term_freq)
|
||||
}
|
||||
|
||||
pub fn max_score(&self) -> Score {
|
||||
self.similarity_weight.max_score()
|
||||
}
|
||||
|
||||
pub fn last_doc_in_block(&self) -> DocId {
|
||||
self.postings.block_cursor.skip_reader().last_doc_in_block()
|
||||
}
|
||||
}
|
||||
|
||||
impl<TPostings: Postings> DocSet for TermScorer<TPostings> {
|
||||
impl DocSet for TermScorer {
|
||||
#[inline]
|
||||
fn advance(&mut self) -> DocId {
|
||||
self.postings.advance()
|
||||
@@ -120,21 +117,21 @@ impl<TPostings: Postings> DocSet for TermScorer<TPostings> {
|
||||
fn size_hint(&self) -> u32 {
|
||||
self.postings.size_hint()
|
||||
}
|
||||
|
||||
// TODO
|
||||
// It is probably possible to optimize fill_bitset_block for TermScorer,
|
||||
// working directly with the blocks, enabling vectorization.
|
||||
// I did not manage to get a performance improvement on Mac ARM,
|
||||
// and do not have access to x86 to investigate.
|
||||
}
|
||||
|
||||
impl<TPostings: Postings> Scorer for TermScorer<TPostings> {
|
||||
impl Scorer for TermScorer {
|
||||
#[inline]
|
||||
fn score(&mut self) -> Score {
|
||||
let fieldnorm_id = self.fieldnorm_id();
|
||||
let term_freq = self.term_freq();
|
||||
self.similarity_weight.score(fieldnorm_id, term_freq)
|
||||
}
|
||||
|
||||
fn explain(&mut self) -> Explanation {
|
||||
let fieldnorm_id = self.fieldnorm_id();
|
||||
let term_freq = self.term_freq();
|
||||
self.similarity_weight.explain(fieldnorm_id, term_freq)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
@@ -143,7 +140,7 @@ mod tests {
|
||||
|
||||
use crate::index::SegmentId;
|
||||
use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN;
|
||||
use crate::indexer::NoMergePolicy;
|
||||
use crate::merge_policy::NoMergePolicy;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{Bm25Weight, EnableScoring, Scorer, TermQuery};
|
||||
@@ -164,7 +161,7 @@ mod tests {
|
||||
crate::assert_nearly_equals!(max_scorer, 1.3990127);
|
||||
assert_eq!(term_scorer.doc(), 2);
|
||||
assert_eq!(term_scorer.term_freq(), 3);
|
||||
assert_nearly_equals!(term_scorer.seek_block_max(2), 1.3676447);
|
||||
assert_nearly_equals!(term_scorer.block_max_score(), 1.3676447);
|
||||
assert_nearly_equals!(term_scorer.score(), 1.0892314);
|
||||
assert_eq!(term_scorer.advance(), 3);
|
||||
assert_eq!(term_scorer.doc(), 3);
|
||||
@@ -179,9 +176,9 @@ mod tests {
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_term_scorer_shallow_advance() {
|
||||
fn test_term_scorer_shallow_advance() -> crate::Result<()> {
|
||||
let bm25_weight = Bm25Weight::for_one_term(300, 1024, 10.0);
|
||||
let mut doc_and_tfs = Vec::new();
|
||||
let mut doc_and_tfs = vec![];
|
||||
for i in 0u32..300u32 {
|
||||
let doc = i * 10;
|
||||
doc_and_tfs.push((doc, 1u32 + doc % 3u32));
|
||||
@@ -189,10 +186,11 @@ mod tests {
|
||||
let fieldnorms: Vec<u32> = std::iter::repeat_n(10u32, 3_000).collect();
|
||||
let mut term_scorer = TermScorer::create_for_test(&doc_and_tfs, &fieldnorms, bm25_weight);
|
||||
assert_eq!(term_scorer.doc(), 0u32);
|
||||
term_scorer.seek_block_max(1289);
|
||||
term_scorer.seek_block(1289);
|
||||
assert_eq!(term_scorer.doc(), 0u32);
|
||||
term_scorer.seek(1289);
|
||||
assert_eq!(term_scorer.doc(), 1290);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
proptest! {
|
||||
@@ -226,7 +224,7 @@ mod tests {
|
||||
|
||||
let docs: Vec<DocId> = (0..term_doc_freq).map(|doc| doc as DocId).collect();
|
||||
for block in docs.chunks(COMPRESSION_BLOCK_SIZE) {
|
||||
let block_max_score: Score = term_scorer.seek_block_max(0);
|
||||
let block_max_score: Score = term_scorer.block_max_score();
|
||||
let mut block_max_score_computed: Score = 0.0;
|
||||
for &doc in block {
|
||||
assert_eq!(term_scorer.doc(), doc);
|
||||
@@ -254,26 +252,25 @@ mod tests {
|
||||
let fieldnorms: Vec<u32> = std::iter::repeat_n(20u32, 300).collect();
|
||||
let bm25_weight = Bm25Weight::for_one_term(10, 129, 20.0);
|
||||
let mut docs = TermScorer::create_for_test(&doc_tfs[..], &fieldnorms[..], bm25_weight);
|
||||
assert_nearly_equals!(docs.seek_block_max(0), 2.5161593);
|
||||
assert_nearly_equals!(docs.seek_block_max(135), 3.4597192);
|
||||
assert_nearly_equals!(docs.block_max_score(), 2.5161593);
|
||||
docs.seek_block(135);
|
||||
assert_nearly_equals!(docs.block_max_score(), 3.4597192);
|
||||
docs.seek_block(256);
|
||||
// the block is not loaded yet.
|
||||
assert_nearly_equals!(docs.seek_block_max(256), 5.2971773);
|
||||
assert_nearly_equals!(docs.block_max_score(), 5.2971773);
|
||||
assert_eq!(256, docs.seek(256));
|
||||
assert_nearly_equals!(docs.seek_block_max(256), 3.9539647);
|
||||
assert_nearly_equals!(docs.block_max_score(), 3.9539647);
|
||||
}
|
||||
|
||||
fn test_block_wand_aux(term_query: &TermQuery, searcher: &Searcher) {
|
||||
let term_weight = term_query
|
||||
.specialized_weight(EnableScoring::enabled_from_searcher(searcher))
|
||||
.unwrap();
|
||||
fn test_block_wand_aux(term_query: &TermQuery, searcher: &Searcher) -> crate::Result<()> {
|
||||
let term_weight =
|
||||
term_query.specialized_weight(EnableScoring::enabled_from_searcher(searcher))?;
|
||||
for reader in searcher.segment_readers() {
|
||||
let mut block_max_scores = vec![];
|
||||
let mut block_max_scores_b = vec![];
|
||||
let mut docs = vec![];
|
||||
{
|
||||
let mut term_scorer = term_weight
|
||||
.term_scorer_for_test(reader.as_ref(), 1.0)
|
||||
.unwrap();
|
||||
let mut term_scorer = term_weight.term_scorer_for_test(reader, 1.0)?.unwrap();
|
||||
while term_scorer.doc() != TERMINATED {
|
||||
let mut score = term_scorer.score();
|
||||
docs.push(term_scorer.doc());
|
||||
@@ -287,12 +284,10 @@ mod tests {
|
||||
}
|
||||
}
|
||||
{
|
||||
let mut term_scorer = term_weight
|
||||
.term_scorer_for_test(reader.as_ref(), 1.0)
|
||||
.unwrap();
|
||||
let mut term_scorer = term_weight.term_scorer_for_test(reader, 1.0)?.unwrap();
|
||||
for d in docs {
|
||||
let block_max_score = term_scorer.seek_block_max(d);
|
||||
block_max_scores_b.push(block_max_score);
|
||||
term_scorer.seek_block(d);
|
||||
block_max_scores_b.push(term_scorer.block_max_score());
|
||||
}
|
||||
}
|
||||
for (l, r) in block_max_scores
|
||||
@@ -303,18 +298,18 @@ mod tests {
|
||||
assert_nearly_equals!(l, r);
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[ignore]
|
||||
#[test]
|
||||
fn test_block_wand_long_test() {
|
||||
fn test_block_wand_long_test() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut writer: IndexWriter = index
|
||||
.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)
|
||||
.unwrap();
|
||||
let mut writer: IndexWriter =
|
||||
index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
|
||||
use rand::Rng;
|
||||
let mut rng = rand::rng();
|
||||
writer.set_merge_policy(Box::new(NoMergePolicy));
|
||||
@@ -322,15 +317,15 @@ mod tests {
|
||||
let term_freq = rng.random_range(1..10000);
|
||||
let words: Vec<&str> = std::iter::repeat_n("bbbb", term_freq).collect();
|
||||
let text = words.join(" ");
|
||||
writer.add_document(doc!(text_field=>text)).unwrap();
|
||||
writer.add_document(doc!(text_field=>text))?;
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
writer.commit()?;
|
||||
let term_query = TermQuery::new(
|
||||
Term::from_field_text(text_field, "bbbb"),
|
||||
IndexRecordOption::WithFreqs,
|
||||
);
|
||||
let segment_ids: Vec<SegmentId>;
|
||||
let reader = index.reader().unwrap();
|
||||
let reader = index.reader()?;
|
||||
{
|
||||
let searcher = reader.searcher();
|
||||
segment_ids = searcher
|
||||
@@ -338,14 +333,15 @@ mod tests {
|
||||
.iter()
|
||||
.map(|segment| segment.segment_id())
|
||||
.collect();
|
||||
test_block_wand_aux(&term_query, &searcher);
|
||||
test_block_wand_aux(&term_query, &searcher)?;
|
||||
}
|
||||
writer.merge(&segment_ids[..]).wait().unwrap();
|
||||
{
|
||||
reader.reload().unwrap();
|
||||
reader.reload()?;
|
||||
let searcher = reader.searcher();
|
||||
assert_eq!(searcher.segment_readers().len(), 1);
|
||||
test_block_wand_aux(&term_query, &searcher);
|
||||
test_block_wand_aux(&term_query, &searcher)?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user