mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-04-18 10:30:40 +00:00
Compare commits
41 Commits
postings-w
...
postings-w
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f5221215b3 | ||
|
|
7d9427e9d6 | ||
|
|
62b50bb254 | ||
|
|
2dcd550b74 | ||
|
|
4840886a87 | ||
|
|
1de872ba71 | ||
|
|
93915ce1bf | ||
|
|
bf87c54f0e | ||
|
|
4f27b503ed | ||
|
|
0346942174 | ||
|
|
6ec38276a6 | ||
|
|
476960e89b | ||
|
|
955ce6477c | ||
|
|
daaecf2afb | ||
|
|
d64178906f | ||
|
|
f1377018b0 | ||
|
|
9fe96d06af | ||
|
|
d9c4270acb | ||
|
|
16d1611f4d | ||
|
|
b296948bcd | ||
|
|
6b7380eda8 | ||
|
|
947459a0a9 | ||
|
|
eb18182901 | ||
|
|
01d670f60c | ||
|
|
b77338b590 | ||
|
|
c75fa94d25 | ||
|
|
cf632673ac | ||
|
|
6f00d96127 | ||
|
|
a5ccb62c99 | ||
|
|
c42505a043 | ||
|
|
3e57eb9add | ||
|
|
0955b44ce1 | ||
|
|
783a2a6bef | ||
|
|
1e3c353e21 | ||
|
|
799e88adbd | ||
|
|
1d5fe6bc7c | ||
|
|
d768b2a491 | ||
|
|
7453df8db3 | ||
|
|
ba6abba20a | ||
|
|
d128e5c2a2 | ||
|
|
e6d062bf2d |
@@ -1,125 +0,0 @@
|
||||
---
|
||||
name: rationalize-deps
|
||||
description: Analyze Cargo.toml dependencies and attempt to remove unused features to reduce compile times and binary size
|
||||
---
|
||||
|
||||
# Rationalize Dependencies
|
||||
|
||||
This skill analyzes Cargo.toml dependencies to identify and remove unused features.
|
||||
|
||||
## Overview
|
||||
|
||||
Many crates enable features by default that may not be needed. This skill:
|
||||
1. Identifies dependencies with default features enabled
|
||||
2. Tests if `default-features = false` works
|
||||
3. Identifies which specific features are actually needed
|
||||
4. Verifies compilation after changes
|
||||
|
||||
## Step 1: Identify the target
|
||||
|
||||
Ask the user which crate(s) to analyze:
|
||||
- A specific crate name (e.g., "tokio", "serde")
|
||||
- A specific workspace member (e.g., "quickwit-search")
|
||||
- "all" to scan the entire workspace
|
||||
|
||||
## Step 2: Analyze current dependencies
|
||||
|
||||
For the workspace Cargo.toml (`quickwit/Cargo.toml`), list dependencies that:
|
||||
- Do NOT have `default-features = false`
|
||||
- Have default features that might be unnecessary
|
||||
|
||||
Run: `cargo tree -p <crate> -f "{p} {f}" --edges features` to see what features are actually used.
|
||||
|
||||
## Step 3: For each candidate dependency
|
||||
|
||||
### 3a: Check the crate's default features
|
||||
|
||||
Look up the crate on crates.io or check its Cargo.toml to understand:
|
||||
- What features are enabled by default
|
||||
- What each feature provides
|
||||
|
||||
Use: `cargo metadata --format-version=1 | jq '.packages[] | select(.name == "<crate>") | .features'`
|
||||
|
||||
### 3b: Try disabling default features
|
||||
|
||||
Modify the dependency in `quickwit/Cargo.toml`:
|
||||
|
||||
From:
|
||||
```toml
|
||||
some-crate = { version = "1.0" }
|
||||
```
|
||||
|
||||
To:
|
||||
```toml
|
||||
some-crate = { version = "1.0", default-features = false }
|
||||
```
|
||||
|
||||
### 3c: Run cargo check
|
||||
|
||||
Run: `cargo check --workspace` (or target specific packages for faster feedback)
|
||||
|
||||
If compilation fails:
|
||||
1. Read the error messages to identify which features are needed
|
||||
2. Add only the required features explicitly:
|
||||
```toml
|
||||
some-crate = { version = "1.0", default-features = false, features = ["needed-feature"] }
|
||||
```
|
||||
3. Re-run cargo check
|
||||
|
||||
### 3d: Binary search for minimal features
|
||||
|
||||
If there are many default features, use binary search:
|
||||
1. Start with no features
|
||||
2. If it fails, add half the default features
|
||||
3. Continue until you find the minimal set
|
||||
|
||||
## Step 4: Document findings
|
||||
|
||||
For each dependency analyzed, report:
|
||||
- Original configuration
|
||||
- New configuration (if changed)
|
||||
- Features that were removed
|
||||
- Any features that are required
|
||||
|
||||
## Step 5: Verify full build
|
||||
|
||||
After all changes, run:
|
||||
```bash
|
||||
cargo check --workspace --all-targets
|
||||
cargo test --workspace --no-run
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Serde
|
||||
Often only needs `derive`:
|
||||
```toml
|
||||
serde = { version = "1.0", default-features = false, features = ["derive", "std"] }
|
||||
```
|
||||
|
||||
### Tokio
|
||||
Identify which runtime features are actually used:
|
||||
```toml
|
||||
tokio = { version = "1.0", default-features = false, features = ["rt-multi-thread", "macros", "sync"] }
|
||||
```
|
||||
|
||||
### Reqwest
|
||||
Often doesn't need all TLS backends:
|
||||
```toml
|
||||
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls", "json"] }
|
||||
```
|
||||
|
||||
## Rollback
|
||||
|
||||
If changes cause issues:
|
||||
```bash
|
||||
git checkout quickwit/Cargo.toml
|
||||
cargo check --workspace
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
- Start with large crates that have many default features (tokio, reqwest, hyper)
|
||||
- Use `cargo bloat --crates` to identify large dependencies
|
||||
- Check `cargo tree -d` for duplicate dependencies that might indicate feature conflicts
|
||||
- Some features are needed only for tests - consider using `[dev-dependencies]` features
|
||||
@@ -1,60 +0,0 @@
|
||||
---
|
||||
name: simple-pr
|
||||
description: Create a simple PR from staged changes with an auto-generated commit message
|
||||
disable-model-invocation: true
|
||||
---
|
||||
|
||||
# Simple PR
|
||||
|
||||
Follow these steps to create a simple PR from staged changes:
|
||||
|
||||
## Step 1: Check workspace state
|
||||
|
||||
Run: `git status`
|
||||
|
||||
Verify that all changes have been staged (no unstaged changes). If there are unstaged changes, abort and ask the user to stage their changes first with `git add`.
|
||||
|
||||
Also verify that we are on the `main` branch. If not, abort and ask the user to switch to main first.
|
||||
|
||||
## Step 2: Ensure main is up to date
|
||||
|
||||
Run: `git pull origin main`
|
||||
|
||||
This ensures we're working from the latest code.
|
||||
|
||||
## Step 3: Review staged changes
|
||||
|
||||
Run: `git diff --cached`
|
||||
|
||||
Review the staged changes to understand what the PR will contain.
|
||||
|
||||
## Step 4: Generate commit message
|
||||
|
||||
Based on the staged changes, generate a concise commit message (1-2 sentences) that describes the "why" rather than the "what".
|
||||
|
||||
Display the proposed commit message to the user and ask for confirmation before proceeding.
|
||||
|
||||
## Step 5: Create a new branch
|
||||
|
||||
Get the git username: `git config user.name | tr ' ' '-' | tr '[:upper:]' '[:lower:]'`
|
||||
|
||||
Create a short, descriptive branch name based on the changes (e.g., `fix-typo-in-readme`, `add-retry-logic`, `update-deps`).
|
||||
|
||||
Create and checkout the branch: `git checkout -b {username}/{short-descriptive-name}`
|
||||
|
||||
## Step 6: Commit changes
|
||||
|
||||
Commit with the message from step 3:
|
||||
```
|
||||
git commit -m "{commit-message}"
|
||||
```
|
||||
|
||||
## Step 7: Push and open a PR
|
||||
|
||||
Push the branch and open a PR:
|
||||
```
|
||||
git push -u origin {branch-name}
|
||||
gh pr create --title "{commit-message-title}" --body "{longer-description-if-needed}"
|
||||
```
|
||||
|
||||
Report the PR URL to the user when complete.
|
||||
23
Cargo.toml
23
Cargo.toml
@@ -15,7 +15,7 @@ rust-version = "1.85"
|
||||
exclude = ["benches/*.json", "benches/*.txt"]
|
||||
|
||||
[dependencies]
|
||||
oneshot = "0.1.13"
|
||||
oneshot = "0.1.7"
|
||||
base64 = "0.22.0"
|
||||
byteorder = "1.4.3"
|
||||
crc32fast = "1.3.2"
|
||||
@@ -27,7 +27,7 @@ regex = { version = "1.5.5", default-features = false, features = [
|
||||
aho-corasick = "1.0"
|
||||
tantivy-fst = "0.5"
|
||||
memmap2 = { version = "0.9.0", optional = true }
|
||||
lz4_flex = { version = "0.12", default-features = false, optional = true }
|
||||
lz4_flex = { version = "0.11", default-features = false, optional = true }
|
||||
zstd = { version = "0.13", optional = true, default-features = false }
|
||||
tempfile = { version = "3.12.0", optional = true }
|
||||
log = "0.4.16"
|
||||
@@ -50,7 +50,7 @@ fail = { version = "0.5.0", optional = true }
|
||||
time = { version = "0.3.35", features = ["serde-well-known"] }
|
||||
smallvec = "1.8.0"
|
||||
rayon = "1.5.2"
|
||||
lru = "0.16.3"
|
||||
lru = "0.12.0"
|
||||
fastdivide = "0.4.0"
|
||||
itertools = "0.14.0"
|
||||
measure_time = "0.9.0"
|
||||
@@ -76,7 +76,7 @@ winapi = "0.3.9"
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.14.2"
|
||||
rand = "0.9"
|
||||
rand = "0.8.5"
|
||||
maplit = "1.0.2"
|
||||
matches = "0.1.9"
|
||||
pretty_assertions = "1.2.1"
|
||||
@@ -85,7 +85,7 @@ test-log = "0.2.10"
|
||||
futures = "0.3.21"
|
||||
paste = "1.0.11"
|
||||
more-asserts = "0.3.1"
|
||||
rand_distr = "0.5"
|
||||
rand_distr = "0.4.3"
|
||||
time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
|
||||
postcard = { version = "1.0.4", features = [
|
||||
"use-std",
|
||||
@@ -189,16 +189,3 @@ harness = false
|
||||
[[bench]]
|
||||
name = "bool_queries_with_range"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "str_search_and_get"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "merge_segments"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "regex_all_terms"
|
||||
harness = false
|
||||
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
use binggan::plugins::PeakMemAllocPlugin;
|
||||
use binggan::{black_box, InputGroup, PeakMemAlloc, INSTRUMENTED_SYSTEM};
|
||||
use rand::distr::weighted::WeightedIndex;
|
||||
use rand::distributions::WeightedIndex;
|
||||
use rand::prelude::SliceRandom;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::seq::IndexedRandom;
|
||||
use rand::{Rng, SeedableRng};
|
||||
use rand_distr::Distribution;
|
||||
use serde_json::json;
|
||||
@@ -532,7 +532,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
// Prepare 1000 unique terms sampled using a Zipf distribution.
|
||||
// Exponent ~1.1 approximates top-20 terms covering around ~20%.
|
||||
let terms_1000: Vec<String> = (1..=1000).map(|i| format!("term_{i}")).collect();
|
||||
let zipf_1000 = rand_distr::Zipf::new(1000.0, 1.1f64).unwrap();
|
||||
let zipf_1000 = rand_distr::Zipf::new(1000, 1.1f64).unwrap();
|
||||
|
||||
{
|
||||
let mut rng = StdRng::from_seed([1u8; 32]);
|
||||
@@ -576,8 +576,8 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
}
|
||||
let _val_max = 1_000_000.0;
|
||||
for _ in 0..doc_with_value {
|
||||
let val: f64 = rng.random_range(0.0..1_000_000.0);
|
||||
let json = if rng.random_bool(0.1) {
|
||||
let val: f64 = rng.gen_range(0.0..1_000_000.0);
|
||||
let json = if rng.gen_bool(0.1) {
|
||||
// 10% are numeric values
|
||||
json!({ "mixed_type": val })
|
||||
} else {
|
||||
@@ -586,7 +586,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
index_writer.add_document(doc!(
|
||||
text_field => "cool",
|
||||
json_field => json,
|
||||
text_field_all_unique_terms => format!("unique_term_{}", rng.random::<u64>()),
|
||||
text_field_all_unique_terms => format!("unique_term_{}", rng.gen::<u64>()),
|
||||
text_field_many_terms => many_terms_data.choose(&mut rng).unwrap().to_string(),
|
||||
text_field_few_terms_status => status_field_data[log_level_distribution.sample(&mut rng)].0,
|
||||
text_field_1000_terms_zipf => terms_1000[zipf_1000.sample(&mut rng) as usize - 1].as_str(),
|
||||
|
||||
@@ -55,29 +55,29 @@ fn build_shared_indices(num_docs: usize, p_a: f32, p_b: f32, p_c: f32) -> (Bench
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 500_000_000).unwrap();
|
||||
for _ in 0..num_docs {
|
||||
let has_a = rng.random_bool(p_a as f64);
|
||||
let has_b = rng.random_bool(p_b as f64);
|
||||
let has_c = rng.random_bool(p_c as f64);
|
||||
let score = rng.random_range(0u64..100u64);
|
||||
let score2 = rng.random_range(0u64..100_000u64);
|
||||
let has_a = rng.gen_bool(p_a as f64);
|
||||
let has_b = rng.gen_bool(p_b as f64);
|
||||
let has_c = rng.gen_bool(p_c as f64);
|
||||
let score = rng.gen_range(0u64..100u64);
|
||||
let score2 = rng.gen_range(0u64..100_000u64);
|
||||
let mut title_tokens: Vec<&str> = Vec::new();
|
||||
let mut body_tokens: Vec<&str> = Vec::new();
|
||||
if has_a {
|
||||
if rng.random_bool(0.1) {
|
||||
if rng.gen_bool(0.1) {
|
||||
title_tokens.push("a");
|
||||
} else {
|
||||
body_tokens.push("a");
|
||||
}
|
||||
}
|
||||
if has_b {
|
||||
if rng.random_bool(0.1) {
|
||||
if rng.gen_bool(0.1) {
|
||||
title_tokens.push("b");
|
||||
} else {
|
||||
body_tokens.push("b");
|
||||
}
|
||||
}
|
||||
if has_c {
|
||||
if rng.random_bool(0.1) {
|
||||
if rng.gen_bool(0.1) {
|
||||
title_tokens.push("c");
|
||||
} else {
|
||||
body_tokens.push("c");
|
||||
|
||||
@@ -36,13 +36,13 @@ fn build_shared_indices(num_docs: usize, p_title_a: f32, distribution: &str) ->
|
||||
"dense" => {
|
||||
for doc_id in 0..num_docs {
|
||||
// Always add title to avoid empty documents
|
||||
let title_token = if rng.random_bool(p_title_a as f64) {
|
||||
let title_token = if rng.gen_bool(p_title_a as f64) {
|
||||
"a"
|
||||
} else {
|
||||
"b"
|
||||
};
|
||||
|
||||
let num_rand = rng.random_range(0u64..1000u64);
|
||||
let num_rand = rng.gen_range(0u64..1000u64);
|
||||
|
||||
let num_asc = (doc_id / 10000) as u64;
|
||||
|
||||
@@ -60,13 +60,13 @@ fn build_shared_indices(num_docs: usize, p_title_a: f32, distribution: &str) ->
|
||||
"sparse" => {
|
||||
for doc_id in 0..num_docs {
|
||||
// Always add title to avoid empty documents
|
||||
let title_token = if rng.random_bool(p_title_a as f64) {
|
||||
let title_token = if rng.gen_bool(p_title_a as f64) {
|
||||
"a"
|
||||
} else {
|
||||
"b"
|
||||
};
|
||||
|
||||
let num_rand = rng.random_range(0u64..10000000u64);
|
||||
let num_rand = rng.gen_range(0u64..10000000u64);
|
||||
|
||||
let num_asc = doc_id as u64;
|
||||
|
||||
|
||||
@@ -1,224 +0,0 @@
|
||||
// Benchmarks segment merging
|
||||
//
|
||||
// Notes:
|
||||
// - Input segments are kept intact (no deletes / no IndexWriter merge).
|
||||
// - Output is written to a `NullDirectory` that discards all files except
|
||||
// fieldnorms (needed for merging).
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::io::{self, Write};
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::directory::error::{DeleteError, OpenReadError, OpenWriteError};
|
||||
use tantivy::directory::{
|
||||
AntiCallToken, Directory, FileHandle, OwnedBytes, TerminatingWrite, WatchCallback, WatchHandle,
|
||||
WritePtr,
|
||||
};
|
||||
use tantivy::indexer::{merge_filtered_segments, NoMergePolicy};
|
||||
use tantivy::schema::{Schema, TEXT};
|
||||
use tantivy::{doc, HasLen, Index, IndexSettings, Segment};
|
||||
|
||||
#[derive(Clone, Default, Debug)]
|
||||
struct NullDirectory {
|
||||
blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
|
||||
}
|
||||
|
||||
struct NullWriter;
|
||||
|
||||
impl Write for NullWriter {
|
||||
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
|
||||
Ok(buf.len())
|
||||
}
|
||||
|
||||
fn flush(&mut self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl TerminatingWrite for NullWriter {
|
||||
fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
struct InMemoryWriter {
|
||||
path: PathBuf,
|
||||
buffer: Vec<u8>,
|
||||
blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
|
||||
}
|
||||
|
||||
impl Write for InMemoryWriter {
|
||||
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
|
||||
self.buffer.extend_from_slice(buf);
|
||||
Ok(buf.len())
|
||||
}
|
||||
|
||||
fn flush(&mut self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl TerminatingWrite for InMemoryWriter {
|
||||
fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
|
||||
let bytes = OwnedBytes::new(std::mem::take(&mut self.buffer));
|
||||
self.blobs.write().unwrap().insert(self.path.clone(), bytes);
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug, Default)]
|
||||
struct NullFileHandle;
|
||||
impl HasLen for NullFileHandle {
|
||||
fn len(&self) -> usize {
|
||||
0
|
||||
}
|
||||
}
|
||||
impl FileHandle for NullFileHandle {
|
||||
fn read_bytes(&self, _range: std::ops::Range<usize>) -> io::Result<OwnedBytes> {
|
||||
unimplemented!()
|
||||
}
|
||||
}
|
||||
|
||||
impl Directory for NullDirectory {
|
||||
fn get_file_handle(&self, path: &Path) -> Result<Arc<dyn FileHandle>, OpenReadError> {
|
||||
if let Some(bytes) = self.blobs.read().unwrap().get(path) {
|
||||
return Ok(Arc::new(bytes.clone()));
|
||||
}
|
||||
Ok(Arc::new(NullFileHandle))
|
||||
}
|
||||
|
||||
fn delete(&self, _path: &Path) -> Result<(), DeleteError> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn exists(&self, _path: &Path) -> Result<bool, OpenReadError> {
|
||||
Ok(true)
|
||||
}
|
||||
|
||||
fn open_write(&self, path: &Path) -> Result<WritePtr, OpenWriteError> {
|
||||
let path_buf = path.to_path_buf();
|
||||
if path.to_string_lossy().ends_with(".fieldnorm") {
|
||||
let writer = InMemoryWriter {
|
||||
path: path_buf,
|
||||
buffer: Vec::new(),
|
||||
blobs: Arc::clone(&self.blobs),
|
||||
};
|
||||
Ok(io::BufWriter::new(Box::new(writer)))
|
||||
} else {
|
||||
Ok(io::BufWriter::new(Box::new(NullWriter)))
|
||||
}
|
||||
}
|
||||
|
||||
fn atomic_read(&self, path: &Path) -> Result<Vec<u8>, OpenReadError> {
|
||||
if let Some(bytes) = self.blobs.read().unwrap().get(path) {
|
||||
return Ok(bytes.as_slice().to_vec());
|
||||
}
|
||||
Err(OpenReadError::FileDoesNotExist(path.to_path_buf()))
|
||||
}
|
||||
|
||||
fn atomic_write(&self, _path: &Path, _data: &[u8]) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn sync_directory(&self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn watch(&self, _watch_callback: WatchCallback) -> tantivy::Result<WatchHandle> {
|
||||
Ok(WatchHandle::empty())
|
||||
}
|
||||
}
|
||||
|
||||
struct MergeScenario {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
segments: Vec<Segment>,
|
||||
settings: IndexSettings,
|
||||
label: String,
|
||||
}
|
||||
|
||||
fn build_index(
|
||||
num_segments: usize,
|
||||
docs_per_segment: usize,
|
||||
tokens_per_doc: usize,
|
||||
vocab_size: usize,
|
||||
) -> MergeScenario {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let body = schema_builder.add_text_field("body", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
assert!(vocab_size > 0);
|
||||
let total_tokens = num_segments * docs_per_segment * tokens_per_doc;
|
||||
let use_unique_terms = vocab_size >= total_tokens;
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
let mut next_token_id: u64 = 0;
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 256_000_000).unwrap();
|
||||
writer.set_merge_policy(Box::new(NoMergePolicy));
|
||||
for _ in 0..num_segments {
|
||||
for _ in 0..docs_per_segment {
|
||||
let mut tokens = Vec::with_capacity(tokens_per_doc);
|
||||
for _ in 0..tokens_per_doc {
|
||||
let token_id = if use_unique_terms {
|
||||
let id = next_token_id;
|
||||
next_token_id += 1;
|
||||
id
|
||||
} else {
|
||||
rng.random_range(0..vocab_size as u64)
|
||||
};
|
||||
tokens.push(format!("term_{token_id}"));
|
||||
}
|
||||
writer.add_document(doc!(body => tokens.join(" "))).unwrap();
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let segments = index.searchable_segments().unwrap();
|
||||
let settings = index.settings().clone();
|
||||
let label = format!(
|
||||
"segments={}, docs/seg={}, tokens/doc={}, vocab={}",
|
||||
num_segments, docs_per_segment, tokens_per_doc, vocab_size
|
||||
);
|
||||
|
||||
MergeScenario {
|
||||
index,
|
||||
segments,
|
||||
settings,
|
||||
label,
|
||||
}
|
||||
}
|
||||
|
||||
fn main() {
|
||||
let scenarios = vec![
|
||||
build_index(8, 50_000, 12, 8),
|
||||
build_index(16, 50_000, 12, 8),
|
||||
build_index(16, 100_000, 12, 8),
|
||||
build_index(8, 50_000, 8, 8 * 50_000 * 8),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for scenario in scenarios {
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!("merge_segments inv_index — {}", scenario.label));
|
||||
let segments = scenario.segments.clone();
|
||||
let settings = scenario.settings.clone();
|
||||
group.register("merge", move |_| {
|
||||
let output_dir = NullDirectory::default();
|
||||
let filter_doc_ids = vec![None; segments.len()];
|
||||
let merged_index =
|
||||
merge_filtered_segments(&segments, settings.clone(), filter_doc_ids, output_dir)
|
||||
.unwrap();
|
||||
black_box(merged_index);
|
||||
});
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
@@ -33,7 +33,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
match distribution {
|
||||
"dense" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let num_rand = rng.random_range(0u64..1000u64);
|
||||
let num_rand = rng.gen_range(0u64..1000u64);
|
||||
let num_asc = (doc_id / 10000) as u64;
|
||||
|
||||
writer
|
||||
@@ -46,7 +46,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
}
|
||||
"sparse" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let num_rand = rng.random_range(0u64..10000000u64);
|
||||
let num_rand = rng.gen_range(0u64..10000000u64);
|
||||
let num_asc = doc_id as u64;
|
||||
|
||||
writer
|
||||
|
||||
@@ -97,20 +97,20 @@ fn get_index_0_to_100() -> Index {
|
||||
let num_vals = 100_000;
|
||||
let docs: Vec<_> = (0..num_vals)
|
||||
.map(|_i| {
|
||||
let id_name = if rng.random_bool(0.01) {
|
||||
let id_name = if rng.gen_bool(0.01) {
|
||||
"veryfew".to_string() // 1%
|
||||
} else if rng.random_bool(0.1) {
|
||||
} else if rng.gen_bool(0.1) {
|
||||
"few".to_string() // 9%
|
||||
} else {
|
||||
"most".to_string() // 90%
|
||||
};
|
||||
Doc {
|
||||
id_name,
|
||||
id: rng.random_range(0..100),
|
||||
id: rng.gen_range(0..100),
|
||||
// Multiply by 1000, so that we create most buckets in the compact space
|
||||
// The benches depend on this range to select n-percent of elements with the
|
||||
// methods below.
|
||||
ip: Ipv6Addr::from_u128(rng.random_range(0..100) * 1000),
|
||||
ip: Ipv6Addr::from_u128(rng.gen_range(0..100) * 1000),
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
@@ -1,113 +0,0 @@
|
||||
// Benchmarks regex query that matches all terms in a synthetic index.
|
||||
//
|
||||
// Corpus model:
|
||||
// - N unique terms: t000000, t000001, ...
|
||||
// - M docs
|
||||
// - K tokens per doc: doc i gets terms derived from (i, token_index)
|
||||
//
|
||||
// Query:
|
||||
// - Regex "t.*" to match all terms
|
||||
//
|
||||
// Run with:
|
||||
// - cargo bench --bench regex_all_terms
|
||||
//
|
||||
|
||||
use std::fmt::Write;
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use tantivy::collector::Count;
|
||||
use tantivy::query::RegexQuery;
|
||||
use tantivy::schema::{Schema, TEXT};
|
||||
use tantivy::{doc, Index, ReloadPolicy};
|
||||
|
||||
const HEAP_SIZE_BYTES: usize = 200_000_000;
|
||||
|
||||
#[derive(Clone, Copy)]
|
||||
struct BenchConfig {
|
||||
num_terms: usize,
|
||||
num_docs: usize,
|
||||
tokens_per_doc: usize,
|
||||
}
|
||||
|
||||
fn main() {
|
||||
let configs = default_configs();
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for config in configs {
|
||||
let (index, text_field) = build_index(config, HEAP_SIZE_BYTES);
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.expect("reader");
|
||||
let searcher = reader.searcher();
|
||||
let query = RegexQuery::from_pattern("t.*", text_field).expect("regex query");
|
||||
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!(
|
||||
"regex_all_terms_t{}_d{}_k{}",
|
||||
config.num_terms, config.num_docs, config.tokens_per_doc
|
||||
));
|
||||
group.register("regex_count", move |_| {
|
||||
let count = searcher.search(&query, &Count).expect("search");
|
||||
black_box(count);
|
||||
});
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
fn default_configs() -> Vec<BenchConfig> {
|
||||
vec![
|
||||
BenchConfig {
|
||||
num_terms: 10_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 1,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 10_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 8,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 100_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 1,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 100_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 8,
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
fn build_index(config: BenchConfig, heap_size_bytes: usize) -> (Index, tantivy::schema::Field) {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
|
||||
let term_width = config.num_terms.to_string().len();
|
||||
{
|
||||
let mut writer = index
|
||||
.writer_with_num_threads(1, heap_size_bytes)
|
||||
.expect("writer");
|
||||
let mut buffer = String::new();
|
||||
for doc_id in 0..config.num_docs {
|
||||
buffer.clear();
|
||||
for token_idx in 0..config.tokens_per_doc {
|
||||
if token_idx > 0 {
|
||||
buffer.push(' ');
|
||||
}
|
||||
let term_id = (doc_id * config.tokens_per_doc + token_idx) % config.num_terms;
|
||||
write!(&mut buffer, "t{term_id:0term_width$}").expect("write token");
|
||||
}
|
||||
writer
|
||||
.add_document(doc!(text_field => buffer.as_str()))
|
||||
.expect("add_document");
|
||||
}
|
||||
writer.commit().expect("commit");
|
||||
}
|
||||
|
||||
(index, text_field)
|
||||
}
|
||||
@@ -1,420 +0,0 @@
|
||||
// This benchmark compares different approaches for retrieving string values:
|
||||
//
|
||||
// 1. Fast Field Approach: retrieves string values via term_ords() and ord_to_str()
|
||||
//
|
||||
// 2. Doc Store Approach: retrieves string values via searcher.doc() and field extraction
|
||||
//
|
||||
// The benchmark includes various data distributions:
|
||||
// - Dense Sequential: Sequential document IDs with dense data
|
||||
// - Dense Random: Random document IDs with dense data
|
||||
// - Sparse Sequential: Sequential document IDs with sparse data
|
||||
// - Sparse Random: Random document IDs with sparse data
|
||||
use std::ops::Bound;
|
||||
|
||||
use binggan::{black_box, BenchGroup, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::{Count, DocSetCollector};
|
||||
use tantivy::query::RangeQuery;
|
||||
use tantivy::schema::{Schema, Value, FAST, STORED, STRING};
|
||||
use tantivy::{doc, Index, ReloadPolicy, Searcher, Term};
|
||||
|
||||
#[derive(Clone)]
|
||||
struct BenchIndex {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
searcher: Searcher,
|
||||
}
|
||||
|
||||
fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
// Schema with string fast field and stored field for doc access
|
||||
let mut schema_builder = Schema::builder();
|
||||
let f_str_fast = schema_builder.add_text_field("str_fast", STRING | STORED | FAST);
|
||||
let f_str_stored = schema_builder.add_text_field("str_stored", STRING | STORED);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
// Populate index with stable RNG for reproducibility.
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 4_000_000_000).unwrap();
|
||||
|
||||
match distribution {
|
||||
"dense_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.random_range(0u64..1000u64);
|
||||
let str_val = format!("str_{:03}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"dense_sequential" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let suffix = doc_id as u64 % 1000;
|
||||
let str_val = format!("str_{:03}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"sparse_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.random_range(0u64..1000000u64);
|
||||
let str_val = format!("str_{:07}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"sparse_sequential" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let suffix = doc_id as u64;
|
||||
let str_val = format!("str_{:07}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
_ => {
|
||||
panic!("Unsupported distribution type");
|
||||
}
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
|
||||
// Prepare reader/searcher once.
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
BenchIndex { index, searcher }
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// Prepare corpora with varying scenarios
|
||||
let scenarios = vec![
|
||||
(
|
||||
"dense_random_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_random",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"dense_random_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_random",
|
||||
990,
|
||||
999,
|
||||
),
|
||||
(
|
||||
"dense_sequential_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_sequential",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"dense_sequential_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_sequential",
|
||||
990,
|
||||
999,
|
||||
),
|
||||
(
|
||||
"sparse_random_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_random",
|
||||
0,
|
||||
9999,
|
||||
),
|
||||
(
|
||||
"sparse_random_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_random",
|
||||
990_000,
|
||||
999_999,
|
||||
),
|
||||
(
|
||||
"sparse_sequential_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_sequential",
|
||||
0,
|
||||
9999,
|
||||
),
|
||||
(
|
||||
"sparse_sequential_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_sequential",
|
||||
990_000,
|
||||
999_999,
|
||||
),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for (scenario_id, n, distribution, range_low, range_high) in scenarios {
|
||||
let bench_index = build_shared_indices(n, distribution);
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(scenario_id);
|
||||
|
||||
let field = bench_index.searcher.schema().get_field("str_fast").unwrap();
|
||||
|
||||
let (lower_str, upper_str) =
|
||||
if distribution == "dense_sequential" || distribution == "dense_random" {
|
||||
(
|
||||
format!("str_{:03}", range_low),
|
||||
format!("str_{:03}", range_high),
|
||||
)
|
||||
} else {
|
||||
(
|
||||
format!("str_{:07}", range_low),
|
||||
format!("str_{:07}", range_high),
|
||||
)
|
||||
};
|
||||
|
||||
let lower_term = Term::from_field_text(field, &lower_str);
|
||||
let upper_term = Term::from_field_text(field, &upper_str);
|
||||
|
||||
let query = RangeQuery::new(Bound::Included(lower_term), Bound::Included(upper_term));
|
||||
|
||||
run_benchmark_tasks(&mut group, &bench_index, query, range_low, range_high);
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
/// Run all benchmark tasks for a given range query
|
||||
fn run_benchmark_tasks(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
// Test count of matching documents
|
||||
add_bench_task_count(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
|
||||
// Test fetching all DocIds of matching documents
|
||||
add_bench_task_docset(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
|
||||
// Test fetching all string fast field values of matching documents
|
||||
add_bench_task_fetch_all_strings(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
|
||||
// Test fetching all string values of matching documents through doc() method
|
||||
add_bench_task_fetch_all_strings_from_doc(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query,
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
}
|
||||
|
||||
fn add_bench_task_count(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!("string_search_count_[{}-{}]", range_low, range_high);
|
||||
|
||||
let search_task = CountSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
fn add_bench_task_docset(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!("string_fetch_all_docset_[{}-{}]", range_low, range_high);
|
||||
|
||||
let search_task = DocSetSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
fn add_bench_task_fetch_all_strings(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"string_fastfield_fetch_all_strings_[{}-{}]",
|
||||
range_low, range_high
|
||||
);
|
||||
|
||||
let search_task = FetchAllStringsSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
|
||||
bench_group.register(task_name, move |_| {
|
||||
let result = black_box(search_task.run());
|
||||
result.len()
|
||||
});
|
||||
}
|
||||
|
||||
fn add_bench_task_fetch_all_strings_from_doc(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"string_doc_fetch_all_strings_[{}-{}]",
|
||||
range_low, range_high
|
||||
);
|
||||
|
||||
let search_task = FetchAllStringsFromDocTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
|
||||
bench_group.register(task_name, move |_| {
|
||||
let result = black_box(search_task.run());
|
||||
result.len()
|
||||
});
|
||||
}
|
||||
|
||||
struct CountSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl CountSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
self.searcher.search(&self.query, &Count).unwrap()
|
||||
}
|
||||
}
|
||||
|
||||
struct DocSetSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl DocSetSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
let result = self.searcher.search(&self.query, &DocSetCollector).unwrap();
|
||||
result.len()
|
||||
}
|
||||
}
|
||||
|
||||
struct FetchAllStringsSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl FetchAllStringsSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> Vec<String> {
|
||||
let doc_addresses = self.searcher.search(&self.query, &DocSetCollector).unwrap();
|
||||
let mut docs = doc_addresses.into_iter().collect::<Vec<_>>();
|
||||
docs.sort();
|
||||
let mut strings = Vec::with_capacity(docs.len());
|
||||
|
||||
for doc_address in docs {
|
||||
let segment_reader = &self.searcher.segment_readers()[doc_address.segment_ord as usize];
|
||||
let str_column_opt = segment_reader.fast_fields().str("str_fast");
|
||||
|
||||
if let Ok(Some(str_column)) = str_column_opt {
|
||||
let doc_id = doc_address.doc_id;
|
||||
let term_ord = str_column.term_ords(doc_id).next().unwrap();
|
||||
let mut str_buffer = String::new();
|
||||
if str_column.ord_to_str(term_ord, &mut str_buffer).is_ok() {
|
||||
strings.push(str_buffer);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
strings
|
||||
}
|
||||
}
|
||||
|
||||
struct FetchAllStringsFromDocTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl FetchAllStringsFromDocTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> Vec<String> {
|
||||
let doc_addresses = self.searcher.search(&self.query, &DocSetCollector).unwrap();
|
||||
let mut docs = doc_addresses.into_iter().collect::<Vec<_>>();
|
||||
docs.sort();
|
||||
let mut strings = Vec::with_capacity(docs.len());
|
||||
|
||||
let str_stored_field = self
|
||||
.searcher
|
||||
.schema()
|
||||
.get_field("str_stored")
|
||||
.expect("str_stored field should exist");
|
||||
|
||||
for doc_address in docs {
|
||||
// Get the document from the doc store (row store access)
|
||||
if let Ok(doc) = self.searcher.doc(doc_address) {
|
||||
// Extract string values from the stored field
|
||||
if let Some(field_value) = doc.get_first(str_stored_field) {
|
||||
if let Some(text) = field_value.as_value().as_str() {
|
||||
strings.push(text.to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
strings
|
||||
}
|
||||
}
|
||||
@@ -18,5 +18,5 @@ homepage = "https://github.com/quickwit-oss/tantivy"
|
||||
bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker1x"] }
|
||||
|
||||
[dev-dependencies]
|
||||
rand = "0.9"
|
||||
rand = "0.8"
|
||||
proptest = "1"
|
||||
|
||||
@@ -4,8 +4,8 @@ extern crate test;
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use rand::rng;
|
||||
use rand::seq::IteratorRandom;
|
||||
use rand::thread_rng;
|
||||
use tantivy_bitpacker::{BitPacker, BitUnpacker, BlockedBitpacker};
|
||||
use test::Bencher;
|
||||
|
||||
@@ -27,7 +27,7 @@ mod tests {
|
||||
let num_els = 1_000_000u32;
|
||||
let bit_unpacker = BitUnpacker::new(bit_width);
|
||||
let data = create_bitpacked_data(bit_width, num_els);
|
||||
let idxs: Vec<u32> = (0..num_els).choose_multiple(&mut rng(), 100_000);
|
||||
let idxs: Vec<u32> = (0..num_els).choose_multiple(&mut thread_rng(), 100_000);
|
||||
b.iter(|| {
|
||||
let mut out = 0u64;
|
||||
for &idx in &idxs {
|
||||
|
||||
@@ -22,7 +22,7 @@ downcast-rs = "2.0.1"
|
||||
[dev-dependencies]
|
||||
proptest = "1"
|
||||
more-asserts = "0.3.1"
|
||||
rand = "0.9"
|
||||
rand = "0.8"
|
||||
binggan = "0.14.0"
|
||||
|
||||
[[bench]]
|
||||
|
||||
@@ -9,7 +9,7 @@ use tantivy_columnar::column_values::{CodecType, serialize_and_load_u64_based_co
|
||||
fn get_data() -> Vec<u64> {
|
||||
let mut rng = StdRng::seed_from_u64(2u64);
|
||||
let mut data: Vec<_> = (100..55_000_u64)
|
||||
.map(|num| num + rng.random::<u8>() as u64)
|
||||
.map(|num| num + rng.r#gen::<u8>() as u64)
|
||||
.collect();
|
||||
data.push(99_000);
|
||||
data.insert(1000, 2000);
|
||||
|
||||
@@ -6,7 +6,7 @@ use tantivy_columnar::column_values::{CodecType, serialize_u64_based_column_valu
|
||||
fn get_data() -> Vec<u64> {
|
||||
let mut rng = StdRng::seed_from_u64(2u64);
|
||||
let mut data: Vec<_> = (100..55_000_u64)
|
||||
.map(|num| num + rng.random::<u8>() as u64)
|
||||
.map(|num| num + rng.r#gen::<u8>() as u64)
|
||||
.collect();
|
||||
data.push(99_000);
|
||||
data.insert(1000, 2000);
|
||||
|
||||
@@ -8,7 +8,7 @@ const TOTAL_NUM_VALUES: u32 = 1_000_000;
|
||||
fn gen_optional_index(fill_ratio: f64) -> OptionalIndex {
|
||||
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
|
||||
let vals: Vec<u32> = (0..TOTAL_NUM_VALUES)
|
||||
.map(|_| rng.random_bool(fill_ratio))
|
||||
.map(|_| rng.gen_bool(fill_ratio))
|
||||
.enumerate()
|
||||
.filter(|(_pos, val)| *val)
|
||||
.map(|(pos, _)| pos as u32)
|
||||
@@ -25,7 +25,7 @@ fn random_range_iterator(
|
||||
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
|
||||
let mut current = start;
|
||||
std::iter::from_fn(move || {
|
||||
current += rng.random_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
|
||||
current += rng.gen_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
|
||||
if current >= end { None } else { Some(current) }
|
||||
})
|
||||
}
|
||||
|
||||
@@ -39,7 +39,7 @@ fn get_data_50percent_item() -> Vec<u128> {
|
||||
|
||||
let mut data = vec![];
|
||||
for _ in 0..300_000 {
|
||||
let val = rng.random_range(1..=100);
|
||||
let val = rng.gen_range(1..=100);
|
||||
data.push(val);
|
||||
}
|
||||
data.push(SINGLE_ITEM);
|
||||
|
||||
@@ -34,7 +34,7 @@ fn get_data_50percent_item() -> Vec<u128> {
|
||||
|
||||
let mut data = vec![];
|
||||
for _ in 0..300_000 {
|
||||
let val = rng.random_range(1..=100);
|
||||
let val = rng.gen_range(1..=100);
|
||||
data.push(val);
|
||||
}
|
||||
data.push(SINGLE_ITEM);
|
||||
|
||||
@@ -268,7 +268,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn linear_interpol_fast_field_rand() {
|
||||
let mut rng = rand::rng();
|
||||
let mut rng = rand::thread_rng();
|
||||
for _ in 0..50 {
|
||||
let mut data = (0..10_000).map(|_| rng.next_u64()).collect::<Vec<_>>();
|
||||
create_and_validate::<LinearCodec>(&data, "random");
|
||||
|
||||
@@ -122,7 +122,7 @@ pub(crate) fn create_and_validate<TColumnCodec: ColumnCodec>(
|
||||
assert_eq!(vals, buffer);
|
||||
|
||||
if !vals.is_empty() {
|
||||
let test_rand_idx = rand::rng().random_range(0..=vals.len() - 1);
|
||||
let test_rand_idx = rand::thread_rng().gen_range(0..=vals.len() - 1);
|
||||
let expected_positions: Vec<u32> = vals
|
||||
.iter()
|
||||
.enumerate()
|
||||
|
||||
@@ -21,5 +21,5 @@ serde = { version = "1.0.136", features = ["derive"] }
|
||||
[dev-dependencies]
|
||||
binggan = "0.14.0"
|
||||
proptest = "1.0.0"
|
||||
rand = "0.9"
|
||||
rand = "0.8.4"
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use binggan::{BenchRunner, black_box};
|
||||
use rand::rng;
|
||||
use rand::seq::IteratorRandom;
|
||||
use rand::thread_rng;
|
||||
use tantivy_common::{BitSet, TinySet, serialize_vint_u32};
|
||||
|
||||
fn bench_vint() {
|
||||
@@ -17,7 +17,7 @@ fn bench_vint() {
|
||||
black_box(out);
|
||||
});
|
||||
|
||||
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut rng(), 100_000);
|
||||
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
|
||||
runner.bench_function("bench_vint_rand", move |_| {
|
||||
let mut out = 0u64;
|
||||
for val in vals.iter().cloned() {
|
||||
|
||||
@@ -297,9 +297,6 @@ impl BitSet {
|
||||
.map(|delta_bucket| bucket + delta_bucket as u32)
|
||||
}
|
||||
|
||||
/// Returns the maximum number of elements in the bitset.
|
||||
///
|
||||
/// Warning: The largest element the bitset can contain is `max_value - 1`.
|
||||
#[inline]
|
||||
pub fn max_value(&self) -> u32 {
|
||||
self.max_value
|
||||
@@ -417,7 +414,7 @@ mod tests {
|
||||
use std::collections::HashSet;
|
||||
|
||||
use ownedbytes::OwnedBytes;
|
||||
use rand::distr::Bernoulli;
|
||||
use rand::distributions::Bernoulli;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::{Rng, SeedableRng};
|
||||
|
||||
|
||||
@@ -62,9 +62,7 @@ impl<W: TerminatingWrite> TerminatingWrite for CountingWriter<W> {
|
||||
pub struct AntiCallToken(());
|
||||
|
||||
/// Trait used to indicate when no more write need to be done on a writer
|
||||
///
|
||||
/// Thread-safety is enforced at the call sites that require it.
|
||||
pub trait TerminatingWrite: Write {
|
||||
pub trait TerminatingWrite: Write + Send + Sync {
|
||||
/// Indicate that the writer will no longer be used. Internally call terminate_ref.
|
||||
fn terminate(mut self) -> io::Result<()>
|
||||
where Self: Sized {
|
||||
|
||||
@@ -60,7 +60,7 @@ At indexing, tantivy will try to interpret number and strings as different type
|
||||
priority order.
|
||||
|
||||
Numbers will be interpreted as u64, i64 and f64 in that order.
|
||||
Strings will be interpreted as rfc3339 dates or simple strings.
|
||||
Strings will be interpreted as rfc3999 dates or simple strings.
|
||||
|
||||
The first working type is picked and is the only term that is emitted for indexing.
|
||||
Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
|
||||
@@ -81,7 +81,7 @@ Will be interpreted as
|
||||
(my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)
|
||||
```
|
||||
|
||||
Likewise, we need to emit two tokens if the query contains an rfc3339 date.
|
||||
Likewise, we need to emit two tokens if the query contains an rfc3999 date.
|
||||
Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.
|
||||
|
||||
If one more json field is defined, things get even more complicated.
|
||||
|
||||
@@ -70,7 +70,7 @@ impl Collector for StatsCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> tantivy::Result<StatsSegmentCollector> {
|
||||
let fast_field_reader = segment_reader.fast_fields().u64(&self.field)?;
|
||||
Ok(StatsSegmentCollector {
|
||||
|
||||
@@ -60,7 +60,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4).order_by_score())?;
|
||||
assert_eq!(count_docs.len(), 1);
|
||||
for (_score, doc_address) in count_docs {
|
||||
let retrieved_doc = searcher.doc(doc_address)?;
|
||||
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
assert!(retrieved_doc
|
||||
.get_first(occurred_at)
|
||||
.unwrap()
|
||||
|
||||
@@ -65,7 +65,7 @@ fn main() -> tantivy::Result<()> {
|
||||
);
|
||||
let top_docs_by_custom_score =
|
||||
// Call TopDocs with a custom tweak score
|
||||
TopDocs::with_limit(2).tweak_score(move |segment_reader: &dyn SegmentReader| {
|
||||
TopDocs::with_limit(2).tweak_score(move |segment_reader: &SegmentReader| {
|
||||
let ingredient_reader = segment_reader.facet_reader("ingredient").unwrap();
|
||||
let facet_dict = ingredient_reader.facet_dict();
|
||||
|
||||
@@ -91,7 +91,7 @@ fn main() -> tantivy::Result<()> {
|
||||
.iter()
|
||||
.map(|(_, doc_id)| {
|
||||
searcher
|
||||
.doc(*doc_id)
|
||||
.doc::<TantivyDocument>(*doc_id)
|
||||
.unwrap()
|
||||
.get_first(title)
|
||||
.and_then(|v| v.as_str().map(|el| el.to_string()))
|
||||
|
||||
@@ -67,7 +67,7 @@ fn main() -> Result<()> {
|
||||
let mut titles = top_docs
|
||||
.into_iter()
|
||||
.map(|(_score, doc_address)| {
|
||||
let doc = searcher.doc(doc_address)?;
|
||||
let doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
let title = doc
|
||||
.get_first(title)
|
||||
.and_then(|v| v.as_str())
|
||||
|
||||
@@ -55,7 +55,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
|
||||
|
||||
for (score, doc_address) in top_docs {
|
||||
let doc = searcher.doc(doc_address)?;
|
||||
let doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
let snippet = snippet_generator.snippet_from_doc(&doc);
|
||||
println!("Document score {score}:");
|
||||
println!("title: {}", doc.get_first(title).unwrap().as_str().unwrap());
|
||||
|
||||
@@ -43,7 +43,7 @@ impl DynamicPriceColumn {
|
||||
}
|
||||
}
|
||||
|
||||
pub fn price_for_segment(&self, segment_reader: &dyn SegmentReader) -> Option<Arc<Vec<Price>>> {
|
||||
pub fn price_for_segment(&self, segment_reader: &SegmentReader) -> Option<Arc<Vec<Price>>> {
|
||||
let segment_key = (segment_reader.segment_id(), segment_reader.delete_opstamp());
|
||||
self.price_cache.read().unwrap().get(&segment_key).cloned()
|
||||
}
|
||||
@@ -157,7 +157,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let query = query_parser.parse_query("cooking")?;
|
||||
|
||||
let searcher = reader.searcher();
|
||||
let score_by_price = move |segment_reader: &dyn SegmentReader| {
|
||||
let score_by_price = move |segment_reader: &SegmentReader| {
|
||||
let price = price_dynamic_column
|
||||
.price_for_segment(segment_reader)
|
||||
.unwrap();
|
||||
|
||||
@@ -560,7 +560,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
(
|
||||
(
|
||||
value((), tag(">=")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
bound
|
||||
@@ -574,7 +574,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag("<=")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
UserInputBound::Unbounded,
|
||||
@@ -588,7 +588,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag(">")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
bound
|
||||
@@ -602,7 +602,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag("<")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
UserInputBound::Unbounded,
|
||||
@@ -704,11 +704,7 @@ fn regex(inp: &str) -> IResult<&str, UserInputLeaf> {
|
||||
many1(alt((preceded(char('\\'), char('/')), none_of("/")))),
|
||||
char('/'),
|
||||
),
|
||||
peek(alt((
|
||||
value((), multispace1),
|
||||
value((), char(')')),
|
||||
value((), eof),
|
||||
))),
|
||||
peek(alt((multispace1, eof))),
|
||||
),
|
||||
|elements| UserInputLeaf::Regex {
|
||||
field: None,
|
||||
@@ -725,12 +721,8 @@ fn regex_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
opt_i_err(char('/'), "missing delimiter /"),
|
||||
),
|
||||
opt_i_err(
|
||||
peek(alt((
|
||||
value((), multispace1),
|
||||
value((), char(')')),
|
||||
value((), eof),
|
||||
))),
|
||||
"expected whitespace, closing parenthesis, or end of input",
|
||||
peek(alt((multispace1, eof))),
|
||||
"expected whitespace or end of input",
|
||||
),
|
||||
)(inp)
|
||||
{
|
||||
@@ -1331,14 +1323,6 @@ mod test {
|
||||
test_parse_query_to_ast_helper("<a", "{\"*\" TO \"a\"}");
|
||||
test_parse_query_to_ast_helper("<=a", "{\"*\" TO \"a\"]");
|
||||
test_parse_query_to_ast_helper("<=bsd", "{\"*\" TO \"bsd\"]");
|
||||
|
||||
test_parse_query_to_ast_helper("(<=42)", "{\"*\" TO \"42\"]");
|
||||
test_parse_query_to_ast_helper("(<=42 )", "{\"*\" TO \"42\"]");
|
||||
test_parse_query_to_ast_helper("(age:>5)", "\"age\":{\"5\" TO \"*\"}");
|
||||
test_parse_query_to_ast_helper(
|
||||
"(title:bar AND age:>12)",
|
||||
"(+\"title\":bar +\"age\":{\"12\" TO \"*\"})",
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -1715,10 +1699,6 @@ mod test {
|
||||
test_parse_query_to_ast_helper("foo:(A OR B)", "(?\"foo\":A ?\"foo\":B)");
|
||||
test_parse_query_to_ast_helper("foo:(A* OR B*)", "(?\"foo\":A* ?\"foo\":B*)");
|
||||
test_parse_query_to_ast_helper("foo:(*A OR *B)", "(?\"foo\":*A ?\"foo\":*B)");
|
||||
|
||||
// Regexes between parentheses
|
||||
test_parse_query_to_ast_helper("foo:(/A.*/)", "\"foo\":/A.*/");
|
||||
test_parse_query_to_ast_helper("foo:(/A.*/ OR /B.*/)", "(?\"foo\":/A.*/ ?\"foo\":/B.*/)");
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -66,7 +66,6 @@ impl UserInputLeaf {
|
||||
}
|
||||
UserInputLeaf::Range { field, .. } if field.is_none() => *field = Some(default_field),
|
||||
UserInputLeaf::Set { field, .. } if field.is_none() => *field = Some(default_field),
|
||||
UserInputLeaf::Regex { field, .. } if field.is_none() => *field = Some(default_field),
|
||||
_ => (), // field was already set, do nothing
|
||||
}
|
||||
}
|
||||
|
||||
@@ -57,7 +57,7 @@ pub(crate) fn get_numeric_or_date_column_types() -> &'static [ColumnType] {
|
||||
|
||||
/// Get fast field reader or empty as default.
|
||||
pub(crate) fn get_ff_reader(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
allowed_column_types: Option<&[ColumnType]>,
|
||||
) -> crate::Result<(columnar::Column<u64>, ColumnType)> {
|
||||
@@ -74,7 +74,7 @@ pub(crate) fn get_ff_reader(
|
||||
}
|
||||
|
||||
pub(crate) fn get_dynamic_columns(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
) -> crate::Result<Vec<columnar::DynamicColumn>> {
|
||||
let ff_fields = reader.fast_fields().dynamic_column_handles(field_name)?;
|
||||
@@ -90,7 +90,7 @@ pub(crate) fn get_dynamic_columns(
|
||||
///
|
||||
/// Is guaranteed to return at least one column.
|
||||
pub(crate) fn get_all_ff_reader_or_empty(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
allowed_column_types: Option<&[ColumnType]>,
|
||||
fallback_type: ColumnType,
|
||||
|
||||
@@ -469,7 +469,7 @@ impl AggKind {
|
||||
/// Build AggregationsData by walking the request tree.
|
||||
pub(crate) fn build_aggregations_data_from_req(
|
||||
aggs: &Aggregations,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
context: AggContextParams,
|
||||
) -> crate::Result<AggregationsSegmentCtx> {
|
||||
@@ -489,7 +489,7 @@ pub(crate) fn build_aggregations_data_from_req(
|
||||
fn build_nodes(
|
||||
agg_name: &str,
|
||||
req: &Aggregation,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
data: &mut AggregationsSegmentCtx,
|
||||
is_top_level: bool,
|
||||
@@ -728,7 +728,7 @@ fn build_nodes(
|
||||
let idx_in_req_data = data.push_filter_req_data(FilterAggReqData {
|
||||
name: agg_name.to_string(),
|
||||
req: filter_req.clone(),
|
||||
segment_reader: reader.clone_arc(),
|
||||
segment_reader: reader.clone(),
|
||||
evaluator,
|
||||
matching_docs_buffer,
|
||||
is_top_level,
|
||||
@@ -745,7 +745,7 @@ fn build_nodes(
|
||||
|
||||
fn build_children(
|
||||
aggs: &Aggregations,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
data: &mut AggregationsSegmentCtx,
|
||||
) -> crate::Result<Vec<AggRefNode>> {
|
||||
@@ -764,7 +764,7 @@ fn build_children(
|
||||
}
|
||||
|
||||
fn get_term_agg_accessors(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
missing: &Option<Key>,
|
||||
) -> crate::Result<Vec<(Column<u64>, ColumnType)>> {
|
||||
@@ -817,7 +817,7 @@ fn build_terms_or_cardinality_nodes(
|
||||
agg_name: &str,
|
||||
field_name: &str,
|
||||
missing: &Option<Key>,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
data: &mut AggregationsSegmentCtx,
|
||||
sub_aggs: &Aggregations,
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
use std::fmt::Debug;
|
||||
use std::sync::Arc;
|
||||
|
||||
use common::BitSet;
|
||||
use serde::{Deserialize, Deserializer, Serialize, Serializer};
|
||||
@@ -403,7 +402,7 @@ pub struct FilterAggReqData {
|
||||
/// The filter aggregation
|
||||
pub req: FilterAggregation,
|
||||
/// The segment reader
|
||||
pub segment_reader: Arc<dyn SegmentReader>,
|
||||
pub segment_reader: SegmentReader,
|
||||
/// Document evaluator for the filter query (precomputed BitSet)
|
||||
/// This is built once when the request data is created
|
||||
pub evaluator: DocumentQueryEvaluator,
|
||||
@@ -417,7 +416,7 @@ impl FilterAggReqData {
|
||||
pub(crate) fn get_memory_consumption(&self) -> usize {
|
||||
// Estimate: name + segment reader reference + bitset + buffer capacity
|
||||
self.name.len()
|
||||
+ std::mem::size_of::<Arc<dyn SegmentReader>>()
|
||||
+ std::mem::size_of::<SegmentReader>()
|
||||
+ self.evaluator.bitset.len() / 8 // BitSet memory (bits to bytes)
|
||||
+ self.matching_docs_buffer.capacity() * std::mem::size_of::<DocId>()
|
||||
+ std::mem::size_of::<bool>()
|
||||
@@ -439,7 +438,7 @@ impl DocumentQueryEvaluator {
|
||||
pub(crate) fn new(
|
||||
query: Box<dyn Query>,
|
||||
schema: Schema,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self> {
|
||||
let max_doc = segment_reader.max_doc();
|
||||
|
||||
|
||||
@@ -66,7 +66,7 @@ impl Collector for DistributedAggregationCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: crate::SegmentOrdinal,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
AggregationSegmentCollector::from_agg_req_and_reader(
|
||||
&self.agg,
|
||||
@@ -96,7 +96,7 @@ impl Collector for AggregationCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: crate::SegmentOrdinal,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
AggregationSegmentCollector::from_agg_req_and_reader(
|
||||
&self.agg,
|
||||
@@ -145,7 +145,7 @@ impl AggregationSegmentCollector {
|
||||
/// reader. Also includes validation, e.g. checking field types and existence.
|
||||
pub fn from_agg_req_and_reader(
|
||||
agg: &Aggregations,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
context: &AggContextParams,
|
||||
) -> crate::Result<Self> {
|
||||
|
||||
@@ -90,19 +90,6 @@ impl From<IntermediateKey> for Key {
|
||||
|
||||
impl Eq for IntermediateKey {}
|
||||
|
||||
impl std::fmt::Display for IntermediateKey {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
match self {
|
||||
IntermediateKey::Str(val) => f.write_str(val),
|
||||
IntermediateKey::F64(val) => f.write_str(&val.to_string()),
|
||||
IntermediateKey::U64(val) => f.write_str(&val.to_string()),
|
||||
IntermediateKey::I64(val) => f.write_str(&val.to_string()),
|
||||
IntermediateKey::Bool(val) => f.write_str(&val.to_string()),
|
||||
IntermediateKey::IpAddr(val) => f.write_str(&val.to_string()),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl std::hash::Hash for IntermediateKey {
|
||||
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
|
||||
core::mem::discriminant(self).hash(state);
|
||||
@@ -118,21 +105,6 @@ impl std::hash::Hash for IntermediateKey {
|
||||
}
|
||||
|
||||
impl IntermediateAggregationResults {
|
||||
/// Returns a reference to the intermediate aggregation result for the given key.
|
||||
pub fn get(&self, key: &str) -> Option<&IntermediateAggregationResult> {
|
||||
self.aggs_res.get(key)
|
||||
}
|
||||
|
||||
/// Removes and returns the intermediate aggregation result for the given key.
|
||||
pub fn remove(&mut self, key: &str) -> Option<IntermediateAggregationResult> {
|
||||
self.aggs_res.remove(key)
|
||||
}
|
||||
|
||||
/// Returns an iterator over the keys in the intermediate aggregation results.
|
||||
pub fn keys(&self) -> impl Iterator<Item = &String> {
|
||||
self.aggs_res.keys()
|
||||
}
|
||||
|
||||
/// Add a result
|
||||
pub fn push(&mut self, key: String, value: IntermediateAggregationResult) -> crate::Result<()> {
|
||||
let entry = self.aggs_res.entry(key);
|
||||
@@ -667,21 +639,6 @@ pub struct IntermediateTermBucketResult {
|
||||
}
|
||||
|
||||
impl IntermediateTermBucketResult {
|
||||
/// Returns a reference to the map of bucket entries keyed by [`IntermediateKey`].
|
||||
pub fn entries(&self) -> &FxHashMap<IntermediateKey, IntermediateTermBucketEntry> {
|
||||
&self.entries
|
||||
}
|
||||
|
||||
/// Returns the count of documents not included in the returned buckets.
|
||||
pub fn sum_other_doc_count(&self) -> u64 {
|
||||
self.sum_other_doc_count
|
||||
}
|
||||
|
||||
/// Returns the upper bound of the error on document counts in the returned buckets.
|
||||
pub fn doc_count_error_upper_bound(&self) -> u64 {
|
||||
self.doc_count_error_upper_bound
|
||||
}
|
||||
|
||||
pub(crate) fn into_final_result(
|
||||
self,
|
||||
req: &TermsAggregation,
|
||||
@@ -863,7 +820,7 @@ impl IntermediateRangeBucketEntry {
|
||||
};
|
||||
|
||||
// If we have a date type on the histogram buckets, we add the `key_as_string` field as
|
||||
// rfc3339
|
||||
// rfc339
|
||||
if column_type == Some(ColumnType::DateTime) {
|
||||
if let Some(val) = range_bucket_entry.to {
|
||||
let key_as_string = format_date(val as i64)?;
|
||||
|
||||
@@ -55,12 +55,6 @@ impl IntermediateAverage {
|
||||
pub(crate) fn from_stats(stats: IntermediateStats) -> Self {
|
||||
Self { stats }
|
||||
}
|
||||
|
||||
/// Returns a reference to the underlying [`IntermediateStats`].
|
||||
pub fn stats(&self) -> &IntermediateStats {
|
||||
&self.stats
|
||||
}
|
||||
|
||||
/// Merges the other intermediate result into self.
|
||||
pub fn merge_fruits(&mut self, other: IntermediateAverage) {
|
||||
self.stats.merge_fruits(other.stats);
|
||||
|
||||
@@ -107,11 +107,8 @@ pub enum PercentileValues {
|
||||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
|
||||
/// The entry when requesting percentiles with keyed: false
|
||||
pub struct PercentileValuesVecEntry {
|
||||
/// Percentile
|
||||
pub key: f64,
|
||||
|
||||
/// Value at the percentile
|
||||
pub value: f64,
|
||||
key: f64,
|
||||
value: f64,
|
||||
}
|
||||
|
||||
/// Single-metric aggregations use this common result structure.
|
||||
|
||||
@@ -110,16 +110,6 @@ impl Default for IntermediateStats {
|
||||
}
|
||||
|
||||
impl IntermediateStats {
|
||||
/// Returns the number of values collected.
|
||||
pub fn count(&self) -> u64 {
|
||||
self.count
|
||||
}
|
||||
|
||||
/// Returns the sum of all values collected.
|
||||
pub fn sum(&self) -> f64 {
|
||||
self.sum
|
||||
}
|
||||
|
||||
/// Merges the other stats intermediate result into self.
|
||||
pub fn merge_fruits(&mut self, other: IntermediateStats) {
|
||||
self.count += other.count;
|
||||
|
||||
197
src/codec/mod.rs
197
src/codec/mod.rs
@@ -4,18 +4,16 @@ pub mod postings;
|
||||
/// Standard tantivy codec. This is the codec you use by default.
|
||||
pub mod standard;
|
||||
|
||||
use std::sync::Arc;
|
||||
use std::io;
|
||||
|
||||
pub use standard::StandardCodec;
|
||||
|
||||
use crate::codec::postings::PostingsCodec;
|
||||
use crate::directory::Directory;
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::query::score_combiner::DoNothingCombiner;
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{box_scorer, BufferedUnionScorer, Scorer, SumCombiner};
|
||||
use crate::schema::Schema;
|
||||
use crate::{DocId, Score, SegmentMeta, SegmentReader, TantivySegmentReader};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::{Postings, TermInfo};
|
||||
use crate::query::{box_scorer, Bm25Weight, Scorer};
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, InvertedIndexReader, Score};
|
||||
|
||||
/// Codecs describes how data is layed out on disk.
|
||||
///
|
||||
@@ -24,9 +22,8 @@ pub trait Codec: Clone + std::fmt::Debug + Send + Sync + 'static {
|
||||
/// The specific postings type used by this codec.
|
||||
type PostingsCodec: PostingsCodec;
|
||||
|
||||
/// ID of the codec. It should be unique to your codec.
|
||||
/// Make it human-readable, descriptive, short and unique.
|
||||
const ID: &'static str;
|
||||
/// Name of the codec. It should be unique to your codec.
|
||||
const NAME: &'static str;
|
||||
|
||||
/// Load codec based on the codec configuration.
|
||||
fn from_json_props(json_value: &serde_json::Value) -> crate::Result<Self>;
|
||||
@@ -36,46 +33,58 @@ pub trait Codec: Clone + std::fmt::Debug + Send + Sync + 'static {
|
||||
|
||||
/// Returns the postings codec.
|
||||
fn postings_codec(&self) -> &Self::PostingsCodec;
|
||||
|
||||
/// Loads postings using the codec's concrete postings type.
|
||||
fn load_postings_typed(
|
||||
&self,
|
||||
reader: &dyn crate::index::InvertedIndexReader,
|
||||
term_info: &crate::postings::TermInfo,
|
||||
option: crate::schema::IndexRecordOption,
|
||||
) -> std::io::Result<<Self::PostingsCodec as crate::codec::postings::PostingsCodec>::Postings>
|
||||
{
|
||||
let postings_data = reader.read_raw_postings_data(term_info, option)?;
|
||||
self.postings_codec()
|
||||
.load_postings(term_info.doc_freq, postings_data)
|
||||
}
|
||||
|
||||
/// Opens a segment reader using this codec.
|
||||
///
|
||||
/// Override this if your codec uses a custom segment reader implementation.
|
||||
fn open_segment_reader(
|
||||
&self,
|
||||
directory: &dyn Directory,
|
||||
segment_meta: &SegmentMeta,
|
||||
schema: Schema,
|
||||
custom_bitset: Option<AliveBitSet>,
|
||||
) -> crate::Result<Arc<dyn SegmentReader>> {
|
||||
let codec: Arc<dyn ObjectSafeCodec> = Arc::new(self.clone());
|
||||
let reader = TantivySegmentReader::open_with_custom_alive_set_from_directory(
|
||||
directory,
|
||||
segment_meta,
|
||||
schema,
|
||||
codec,
|
||||
custom_bitset,
|
||||
)?;
|
||||
Ok(Arc::new(reader))
|
||||
}
|
||||
}
|
||||
|
||||
/// Object-safe codec is a Codec that can be used in a trait object.
|
||||
///
|
||||
/// The point of it is to offer a way to use a codec without a proliferation of generics.
|
||||
pub trait ObjectSafeCodec: 'static + Send + Sync {
|
||||
/// Loads a type-erased Postings object for the given term.
|
||||
///
|
||||
/// If the schema used to build the index did not provide enough
|
||||
/// information to match the requested `option`, a Postings is still
|
||||
/// returned in a best-effort manner.
|
||||
fn load_postings_type_erased(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
inverted_index_reader: &InvertedIndexReader,
|
||||
) -> io::Result<Box<dyn Postings>>;
|
||||
|
||||
/// Loads a type-erased TermScorer object for the given term.
|
||||
///
|
||||
/// If the schema used to build the index did not provide enough
|
||||
/// information to match the requested `option`, a TermScorer is still
|
||||
/// returned in a best-effort manner.
|
||||
///
|
||||
/// The point of this contraption is that the return TermScorer is backed,
|
||||
/// not by Box<dyn Postings> but by the codec's concrete Postings type.
|
||||
fn load_term_scorer_type_erased(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
inverted_index_reader: &InvertedIndexReader,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> io::Result<Box<dyn Scorer>>;
|
||||
|
||||
/// Loads a type-erased PhraseScorer object for the given term.
|
||||
///
|
||||
/// If the schema used to build the index did not provide enough
|
||||
/// information to match the requested `option`, a TermScorer is still
|
||||
/// returned in a best-effort manner.
|
||||
///
|
||||
/// The point of this contraption is that the return PhraseScorer is backed,
|
||||
/// not by Box<dyn Postings> but by the codec's concrete Postings type.
|
||||
fn new_phrase_scorer_type_erased(
|
||||
&self,
|
||||
term_infos: &[(usize, TermInfo)],
|
||||
similarity_weight: Option<Bm25Weight>,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
slop: u32,
|
||||
inverted_index_reader: &InvertedIndexReader,
|
||||
) -> io::Result<Box<dyn Scorer>>;
|
||||
|
||||
/// Performs a for_each_pruning operation on the given scorer.
|
||||
///
|
||||
/// The function will go through matching documents and call the callback
|
||||
@@ -92,55 +101,54 @@ pub trait ObjectSafeCodec: 'static + Send + Sync {
|
||||
scorer: Box<dyn Scorer>,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
);
|
||||
|
||||
/// Builds a union scorer possibly specialized if
|
||||
/// all scorers are `Term<Self::Postings>`.
|
||||
fn build_union_scorer_with_sum_combiner(
|
||||
&self,
|
||||
scorers: Vec<Box<dyn Scorer>>,
|
||||
num_docs: DocId,
|
||||
score_combiner_type: SumOrDoNothingCombiner,
|
||||
) -> Box<dyn Scorer>;
|
||||
}
|
||||
|
||||
impl<TCodec: Codec> ObjectSafeCodec for TCodec {
|
||||
fn build_union_scorer_with_sum_combiner(
|
||||
fn load_postings_type_erased(
|
||||
&self,
|
||||
scorers: Vec<Box<dyn Scorer>>,
|
||||
num_docs: DocId,
|
||||
sum_or_do_nothing_combiner: SumOrDoNothingCombiner,
|
||||
) -> Box<dyn Scorer> {
|
||||
if !scorers.iter().all(|scorer| {
|
||||
scorer.is::<TermScorer<<<Self as Codec>::PostingsCodec as PostingsCodec>::Postings>>()
|
||||
}) {
|
||||
return box_scorer(BufferedUnionScorer::build(
|
||||
scorers,
|
||||
SumCombiner::default,
|
||||
num_docs,
|
||||
));
|
||||
}
|
||||
let specialized_scorers: Vec<
|
||||
TermScorer<<<Self as Codec>::PostingsCodec as PostingsCodec>::Postings>,
|
||||
> = scorers
|
||||
.into_iter()
|
||||
.map(|scorer| {
|
||||
*scorer.downcast::<TermScorer<_>>().ok().expect(
|
||||
"Downcast failed despite the fact we already checked the type was correct",
|
||||
)
|
||||
})
|
||||
.collect();
|
||||
match sum_or_do_nothing_combiner {
|
||||
SumOrDoNothingCombiner::Sum => box_scorer(BufferedUnionScorer::build(
|
||||
specialized_scorers,
|
||||
SumCombiner::default,
|
||||
num_docs,
|
||||
)),
|
||||
SumOrDoNothingCombiner::DoNothing => box_scorer(BufferedUnionScorer::build(
|
||||
specialized_scorers,
|
||||
DoNothingCombiner::default,
|
||||
num_docs,
|
||||
)),
|
||||
}
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
inverted_index_reader: &InvertedIndexReader,
|
||||
) -> io::Result<Box<dyn Postings>> {
|
||||
let postings = inverted_index_reader
|
||||
.read_postings_from_terminfo_specialized(term_info, option, self)?;
|
||||
Ok(Box::new(postings))
|
||||
}
|
||||
|
||||
fn load_term_scorer_type_erased(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
inverted_index_reader: &InvertedIndexReader,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> io::Result<Box<dyn Scorer>> {
|
||||
let scorer = inverted_index_reader.new_term_scorer_specialized(
|
||||
term_info,
|
||||
option,
|
||||
fieldnorm_reader,
|
||||
similarity_weight,
|
||||
self,
|
||||
)?;
|
||||
Ok(box_scorer(scorer))
|
||||
}
|
||||
|
||||
fn new_phrase_scorer_type_erased(
|
||||
&self,
|
||||
term_infos: &[(usize, TermInfo)],
|
||||
similarity_weight: Option<Bm25Weight>,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
slop: u32,
|
||||
inverted_index_reader: &InvertedIndexReader,
|
||||
) -> io::Result<Box<dyn Scorer>> {
|
||||
let scorer = inverted_index_reader.new_phrase_scorer_type_specialized(
|
||||
term_infos,
|
||||
similarity_weight,
|
||||
fieldnorm_reader,
|
||||
slop,
|
||||
self,
|
||||
)?;
|
||||
Ok(box_scorer(scorer))
|
||||
}
|
||||
|
||||
fn for_each_pruning(
|
||||
@@ -159,12 +167,3 @@ impl<TCodec: Codec> ObjectSafeCodec for TCodec {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// SumCombiner or DoNothingCombiner
|
||||
#[derive(Copy, Clone)]
|
||||
pub enum SumOrDoNothingCombiner {
|
||||
/// Sum scores together
|
||||
Sum,
|
||||
/// Do not track any score.
|
||||
DoNothing,
|
||||
}
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
/// Block-max WAND algorithm.
|
||||
pub mod block_wand;
|
||||
use std::io;
|
||||
|
||||
/// Block-max WAND algorithm.
|
||||
pub mod block_wand;
|
||||
use common::OwnedBytes;
|
||||
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
@@ -10,16 +10,38 @@ use crate::query::{Bm25Weight, Scorer};
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Postings codec (read path).
|
||||
/// Postings codec.
|
||||
pub trait PostingsCodec: Send + Sync + 'static {
|
||||
/// Serializer type for the postings codec.
|
||||
type PostingsSerializer: PostingsSerializer;
|
||||
/// Postings type for the postings codec.
|
||||
type Postings: Postings + Clone;
|
||||
/// Creates a new postings serializer.
|
||||
fn new_serializer(
|
||||
&self,
|
||||
avg_fieldnorm: Score,
|
||||
mode: IndexRecordOption,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
) -> Self::PostingsSerializer;
|
||||
|
||||
/// Load postings from raw bytes and metadata.
|
||||
/// Loads postings
|
||||
///
|
||||
/// Record option is the option that was passed at indexing time.
|
||||
/// Requested option is the option that is requested.
|
||||
///
|
||||
/// For instance, we may have term_freq in the posting list
|
||||
/// but we can skip decompressing as we read the posting list.
|
||||
///
|
||||
/// If record option does not support the requested option,
|
||||
/// this method does NOT return an error and will in fact restrict
|
||||
/// requested_option to what is available.
|
||||
fn load_postings(
|
||||
&self,
|
||||
doc_freq: u32,
|
||||
postings_data: RawPostingsData,
|
||||
postings_data: OwnedBytes,
|
||||
record_option: IndexRecordOption,
|
||||
requested_option: IndexRecordOption,
|
||||
positions_data: Option<OwnedBytes>,
|
||||
) -> io::Result<Self::Postings>;
|
||||
|
||||
/// If your codec supports different ways to accelerate `for_each_pruning` that's
|
||||
@@ -41,17 +63,43 @@ pub trait PostingsCodec: Send + Sync + 'static {
|
||||
}
|
||||
}
|
||||
|
||||
/// Raw postings bytes and metadata read from storage.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct RawPostingsData {
|
||||
/// Raw postings bytes for the term.
|
||||
pub postings_data: OwnedBytes,
|
||||
/// Raw positions bytes for the term, if positions are available.
|
||||
pub positions_data: Option<OwnedBytes>,
|
||||
/// Record option of the indexed field.
|
||||
pub record_option: IndexRecordOption,
|
||||
/// Effective record option after downgrading to the indexed field capability.
|
||||
pub effective_option: IndexRecordOption,
|
||||
/// A postings serializer is a listener that is in charge of serializing postings
|
||||
///
|
||||
/// IO is done only once per postings, once all of the data has been received.
|
||||
/// A serializer will therefore contain internal buffers.
|
||||
///
|
||||
/// A serializer is created once and recycled for all postings.
|
||||
///
|
||||
/// Clients should use PostingsSerializer as follows.
|
||||
/// ```
|
||||
/// // First postings list
|
||||
/// serializer.new_term(2, true);
|
||||
/// serializer.write_doc(2, 1);
|
||||
/// serializer.write_doc(6, 2);
|
||||
/// serializer.close_term(3);
|
||||
/// serializer.clear();
|
||||
/// // Second postings list
|
||||
/// serializer.new_term(1, true);
|
||||
/// serializer.write_doc(3, 1);
|
||||
/// serializer.close_term(3);
|
||||
/// ```
|
||||
pub trait PostingsSerializer {
|
||||
/// The term_doc_freq here is the number of documents
|
||||
/// in the postings lists.
|
||||
///
|
||||
/// It can be used to compute the idf that will be used for the
|
||||
/// blockmax parameters.
|
||||
///
|
||||
/// If not available (e.g. if we do not collect `term_frequencies`
|
||||
/// blockwand is disabled), the term_doc_freq passed will be set 0.
|
||||
fn new_term(&mut self, term_doc_freq: u32, record_term_freq: bool);
|
||||
|
||||
/// Records a new document id for the current term.
|
||||
/// The serializer may ignore it.
|
||||
fn write_doc(&mut self, doc_id: DocId, term_freq: u32);
|
||||
|
||||
/// Closes the current term and writes the postings list associated.
|
||||
fn close_term(&mut self, doc_freq: u32, wrt: &mut impl io::Write) -> io::Result<()>;
|
||||
}
|
||||
|
||||
/// A light complement interface to Postings to allow block-max wand acceleration.
|
||||
@@ -62,12 +110,8 @@ pub trait PostingsWithBlockMax: Postings {
|
||||
/// `Warning`: Calling this method may leave the postings in an invalid state.
|
||||
/// callers are required to call seek before calling any other of the
|
||||
/// `Postings` method (like doc / advance etc.).
|
||||
fn seek_block_max(
|
||||
&mut self,
|
||||
target_doc: crate::DocId,
|
||||
fieldnorm_reader: &FieldNormReader,
|
||||
similarity_weight: &Bm25Weight,
|
||||
) -> Score;
|
||||
fn seek_block_max(&mut self, target_doc: crate::DocId, similarity_weight: &Bm25Weight)
|
||||
-> Score;
|
||||
|
||||
/// Returns the last document in the current block (or Terminated if this
|
||||
/// is the last block).
|
||||
|
||||
@@ -13,7 +13,7 @@ pub struct StandardCodec;
|
||||
impl Codec for StandardCodec {
|
||||
type PostingsCodec = StandardPostingsCodec;
|
||||
|
||||
const ID: &'static str = "tantivy-default";
|
||||
const NAME: &'static str = "standard";
|
||||
|
||||
fn from_json_props(json_value: &serde_json::Value) -> crate::Result<Self> {
|
||||
if !json_value.is_null() {
|
||||
|
||||
50
src/codec/standard/postings/block.rs
Normal file
50
src/codec/standard/postings/block.rs
Normal file
@@ -0,0 +1,50 @@
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::DocId;
|
||||
|
||||
pub struct Block {
|
||||
doc_ids: [DocId; COMPRESSION_BLOCK_SIZE],
|
||||
term_freqs: [u32; COMPRESSION_BLOCK_SIZE],
|
||||
len: usize,
|
||||
}
|
||||
|
||||
impl Block {
|
||||
pub fn new() -> Self {
|
||||
Block {
|
||||
doc_ids: [0u32; COMPRESSION_BLOCK_SIZE],
|
||||
term_freqs: [0u32; COMPRESSION_BLOCK_SIZE],
|
||||
len: 0,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn doc_ids(&self) -> &[DocId] {
|
||||
&self.doc_ids[..self.len]
|
||||
}
|
||||
|
||||
pub fn term_freqs(&self) -> &[u32] {
|
||||
&self.term_freqs[..self.len]
|
||||
}
|
||||
|
||||
pub fn clear(&mut self) {
|
||||
self.len = 0;
|
||||
}
|
||||
|
||||
pub fn append_doc(&mut self, doc: DocId, term_freq: u32) {
|
||||
let len = self.len;
|
||||
self.doc_ids[len] = doc;
|
||||
self.term_freqs[len] = term_freq;
|
||||
self.len = len + 1;
|
||||
}
|
||||
|
||||
pub fn is_full(&self) -> bool {
|
||||
self.len == COMPRESSION_BLOCK_SIZE
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.len == 0
|
||||
}
|
||||
|
||||
pub fn last_doc(&self) -> DocId {
|
||||
assert_eq!(self.len, COMPRESSION_BLOCK_SIZE);
|
||||
self.doc_ids[COMPRESSION_BLOCK_SIZE - 1]
|
||||
}
|
||||
}
|
||||
@@ -2,10 +2,9 @@ use std::io;
|
||||
|
||||
use common::{OwnedBytes, VInt};
|
||||
|
||||
use crate::codec::standard::postings::skip::{BlockInfo, SkipReader};
|
||||
use crate::codec::standard::postings::FreqReadingOption;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::compression::{BlockDecoder, VIntDecoder as _, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::postings::skip::{BlockInfo, SkipReader};
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, Score, TERMINATED};
|
||||
@@ -130,10 +129,6 @@ impl BlockSegmentPostings {
|
||||
}
|
||||
}
|
||||
|
||||
fn max_score<I: Iterator<Item = Score>>(mut it: I) -> Option<Score> {
|
||||
it.next().map(|first| it.fold(first, Score::max))
|
||||
}
|
||||
|
||||
impl BlockSegmentPostings {
|
||||
/// Returns the overall number of documents in the block postings.
|
||||
/// It does not take in account whether documents are deleted or not.
|
||||
@@ -214,11 +209,7 @@ impl BlockSegmentPostings {
|
||||
/// after having called `.shallow_advance(..)`.
|
||||
///
|
||||
/// See `TermScorer::block_max_score(..)` for more information.
|
||||
pub fn block_max_score(
|
||||
&mut self,
|
||||
fieldnorm_reader: &FieldNormReader,
|
||||
bm25_weight: &Bm25Weight,
|
||||
) -> Score {
|
||||
pub fn block_max_score(&mut self, bm25_weight: &Bm25Weight) -> Score {
|
||||
if let Some(score) = self.block_max_score_cache {
|
||||
return score;
|
||||
}
|
||||
@@ -228,21 +219,9 @@ impl BlockSegmentPostings {
|
||||
self.block_max_score_cache = Some(skip_reader_max_score);
|
||||
return skip_reader_max_score;
|
||||
}
|
||||
// this is the last block of the segment posting list.
|
||||
// If it is actually loaded, we can compute block max manually.
|
||||
if self.block_loaded {
|
||||
let docs = self.doc_decoder.output_array().iter().cloned();
|
||||
let freqs = self.freq_decoder.output_array().iter().cloned();
|
||||
let bm25_scores = docs.zip(freqs).map(|(doc, term_freq)| {
|
||||
let fieldnorm_id = fieldnorm_reader.fieldnorm_id(doc);
|
||||
bm25_weight.score(fieldnorm_id, term_freq)
|
||||
});
|
||||
let block_max_score = max_score(bm25_scores).unwrap_or(0.0);
|
||||
self.block_max_score_cache = Some(block_max_score);
|
||||
return block_max_score;
|
||||
}
|
||||
// We do not have access to any good block max value. We return bm25_weight.max_score()
|
||||
// as it is a valid upperbound.
|
||||
// We do not have access to any good block max value.
|
||||
// It happens if this is the last block.
|
||||
// We return bm25_weight.max_score() as it is a valid upperbound.
|
||||
//
|
||||
// We do not cache it however, so that it gets computed when once block is loaded.
|
||||
bm25_weight.max_score()
|
||||
@@ -337,17 +316,18 @@ mod tests {
|
||||
use common::OwnedBytes;
|
||||
|
||||
use super::BlockSegmentPostings;
|
||||
use crate::codec::postings::PostingsSerializer;
|
||||
use crate::codec::standard::postings::segment_postings::SegmentPostings;
|
||||
use crate::codec::standard::postings::StandardPostingsSerializer;
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::schema::IndexRecordOption;
|
||||
|
||||
#[cfg(test)]
|
||||
fn build_block_postings(docs: &[u32]) -> BlockSegmentPostings {
|
||||
let doc_freq = docs.len() as u32;
|
||||
let mut postings_serializer =
|
||||
PostingsSerializer::new(1.0f32, IndexRecordOption::Basic, None);
|
||||
StandardPostingsSerializer::new(1.0f32, IndexRecordOption::Basic, None);
|
||||
postings_serializer.new_term(docs.len() as u32, false);
|
||||
for doc in docs {
|
||||
postings_serializer.write_doc(*doc, 1u32);
|
||||
|
||||
@@ -1,20 +1,24 @@
|
||||
use std::io;
|
||||
|
||||
use common::BitSet;
|
||||
|
||||
use crate::codec::postings::block_wand::{block_wand, block_wand_single_scorer};
|
||||
use crate::codec::postings::{PostingsCodec, RawPostingsData};
|
||||
use crate::codec::postings::PostingsCodec;
|
||||
use crate::codec::standard::postings::block_segment_postings::BlockSegmentPostings;
|
||||
pub use crate::codec::standard::postings::segment_postings::SegmentPostings;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::positions::PositionReader;
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{BufferedUnionScorer, Scorer, SumCombiner};
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocSet as _, Score, TERMINATED};
|
||||
|
||||
mod block;
|
||||
mod block_segment_postings;
|
||||
mod segment_postings;
|
||||
mod skip;
|
||||
mod standard_postings_serializer;
|
||||
|
||||
pub use segment_postings::SegmentPostings as StandardPostings;
|
||||
pub use standard_postings_serializer::StandardPostingsSerializer;
|
||||
|
||||
/// The default postings codec for tantivy.
|
||||
pub struct StandardPostingsCodec;
|
||||
@@ -28,14 +32,35 @@ pub(crate) enum FreqReadingOption {
|
||||
}
|
||||
|
||||
impl PostingsCodec for StandardPostingsCodec {
|
||||
type PostingsSerializer = StandardPostingsSerializer;
|
||||
type Postings = SegmentPostings;
|
||||
|
||||
fn new_serializer(
|
||||
&self,
|
||||
avg_fieldnorm: Score,
|
||||
mode: IndexRecordOption,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
) -> Self::PostingsSerializer {
|
||||
StandardPostingsSerializer::new(avg_fieldnorm, mode, fieldnorm_reader)
|
||||
}
|
||||
|
||||
fn load_postings(
|
||||
&self,
|
||||
doc_freq: u32,
|
||||
postings_data: RawPostingsData,
|
||||
postings_data: common::OwnedBytes,
|
||||
record_option: IndexRecordOption,
|
||||
requested_option: IndexRecordOption,
|
||||
positions_data_opt: Option<common::OwnedBytes>,
|
||||
) -> io::Result<Self::Postings> {
|
||||
load_postings_from_raw_data(doc_freq, postings_data)
|
||||
// Rationalize record_option/requested_option.
|
||||
let requested_option = requested_option.downgrade(record_option);
|
||||
let block_segment_postings =
|
||||
BlockSegmentPostings::open(doc_freq, postings_data, record_option, requested_option)?;
|
||||
let position_reader = positions_data_opt.map(PositionReader::open).transpose()?;
|
||||
Ok(SegmentPostings::from_block_postings(
|
||||
block_segment_postings,
|
||||
position_reader,
|
||||
))
|
||||
}
|
||||
|
||||
fn try_accelerated_for_each_pruning(
|
||||
@@ -51,7 +76,14 @@ impl PostingsCodec for StandardPostingsCodec {
|
||||
Err(scorer) => scorer,
|
||||
};
|
||||
let mut union_scorer =
|
||||
scorer.downcast::<BufferedUnionScorer<TermScorer<Self::Postings>, SumCombiner>>()?;
|
||||
scorer.downcast::<BufferedUnionScorer<Box<dyn Scorer>, SumCombiner>>()?;
|
||||
if !union_scorer
|
||||
.scorers()
|
||||
.iter()
|
||||
.all(|scorer| scorer.is::<TermScorer<Self::Postings>>())
|
||||
{
|
||||
return Err(union_scorer);
|
||||
}
|
||||
let doc = union_scorer.doc();
|
||||
if doc == TERMINATED {
|
||||
return Ok(());
|
||||
@@ -60,112 +92,16 @@ impl PostingsCodec for StandardPostingsCodec {
|
||||
if score > threshold {
|
||||
threshold = callback(doc, score);
|
||||
}
|
||||
let scorers: Vec<TermScorer<Self::Postings>> = union_scorer.into_scorers();
|
||||
let boxed_scorers: Vec<Box<dyn Scorer>> = union_scorer.into_scorers();
|
||||
let scorers: Vec<TermScorer<Self::Postings>> = boxed_scorers
|
||||
.into_iter()
|
||||
.map(|scorer| {
|
||||
*scorer.downcast::<TermScorer<Self::Postings>>().ok().expect(
|
||||
"Downcast failed despite the fact we already checked the type was correct",
|
||||
)
|
||||
})
|
||||
.collect();
|
||||
block_wand(scorers, threshold, callback);
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
pub(crate) fn load_postings_from_raw_data(
|
||||
doc_freq: u32,
|
||||
postings_data: RawPostingsData,
|
||||
) -> io::Result<SegmentPostings> {
|
||||
let RawPostingsData {
|
||||
postings_data,
|
||||
positions_data: positions_data_opt,
|
||||
record_option,
|
||||
effective_option,
|
||||
} = postings_data;
|
||||
let requested_option = effective_option;
|
||||
let block_segment_postings =
|
||||
BlockSegmentPostings::open(doc_freq, postings_data, record_option, requested_option)?;
|
||||
let position_reader = positions_data_opt.map(PositionReader::open).transpose()?;
|
||||
Ok(SegmentPostings::from_block_postings(
|
||||
block_segment_postings,
|
||||
position_reader,
|
||||
))
|
||||
}
|
||||
|
||||
pub(crate) fn fill_bitset_from_raw_data(
|
||||
doc_freq: u32,
|
||||
postings_data: RawPostingsData,
|
||||
doc_bitset: &mut BitSet,
|
||||
) -> io::Result<()> {
|
||||
let RawPostingsData {
|
||||
postings_data,
|
||||
record_option,
|
||||
effective_option,
|
||||
..
|
||||
} = postings_data;
|
||||
let mut block_postings =
|
||||
BlockSegmentPostings::open(doc_freq, postings_data, record_option, effective_option)?;
|
||||
loop {
|
||||
let docs = block_postings.docs();
|
||||
if docs.is_empty() {
|
||||
break;
|
||||
}
|
||||
for &doc in docs {
|
||||
doc_bitset.insert(doc);
|
||||
}
|
||||
block_postings.advance();
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use common::OwnedBytes;
|
||||
|
||||
use super::*;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::postings::Postings as _;
|
||||
use crate::schema::IndexRecordOption;
|
||||
|
||||
fn test_segment_postings_tf_aux(num_docs: u32, include_term_freq: bool) -> SegmentPostings {
|
||||
let mut postings_serializer =
|
||||
PostingsSerializer::new(1.0f32, IndexRecordOption::WithFreqs, None);
|
||||
let mut buffer = Vec::new();
|
||||
postings_serializer.new_term(num_docs, include_term_freq);
|
||||
for i in 0..num_docs {
|
||||
postings_serializer.write_doc(i, 2);
|
||||
}
|
||||
postings_serializer
|
||||
.close_term(num_docs, &mut buffer)
|
||||
.unwrap();
|
||||
load_postings_from_raw_data(
|
||||
num_docs,
|
||||
RawPostingsData {
|
||||
postings_data: OwnedBytes::new(buffer),
|
||||
positions_data: None,
|
||||
record_option: IndexRecordOption::WithFreqs,
|
||||
effective_option: IndexRecordOption::WithFreqs,
|
||||
},
|
||||
)
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_segment_postings_small_block_with_and_without_freq() {
|
||||
let small_block_without_term_freq = test_segment_postings_tf_aux(1, false);
|
||||
assert!(!small_block_without_term_freq.has_freq());
|
||||
assert_eq!(small_block_without_term_freq.doc(), 0);
|
||||
assert_eq!(small_block_without_term_freq.term_freq(), 1);
|
||||
|
||||
let small_block_with_term_freq = test_segment_postings_tf_aux(1, true);
|
||||
assert!(small_block_with_term_freq.has_freq());
|
||||
assert_eq!(small_block_with_term_freq.doc(), 0);
|
||||
assert_eq!(small_block_with_term_freq.term_freq(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_segment_postings_large_block_with_and_without_freq() {
|
||||
let large_block_without_term_freq = test_segment_postings_tf_aux(128, false);
|
||||
assert!(!large_block_without_term_freq.has_freq());
|
||||
assert_eq!(large_block_without_term_freq.doc(), 0);
|
||||
assert_eq!(large_block_without_term_freq.term_freq(), 1);
|
||||
|
||||
let large_block_with_term_freq = test_segment_postings_tf_aux(128, true);
|
||||
assert!(large_block_with_term_freq.has_freq());
|
||||
assert_eq!(large_block_with_term_freq.doc(), 0);
|
||||
assert_eq!(large_block_with_term_freq.term_freq(), 2);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,9 +1,8 @@
|
||||
use common::BitSet;
|
||||
use common::{BitSet, HasLen};
|
||||
|
||||
use super::BlockSegmentPostings;
|
||||
use crate::codec::postings::PostingsWithBlockMax;
|
||||
use crate::docset::DocSet;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::positions::PositionReader;
|
||||
use crate::postings::compression::COMPRESSION_BLOCK_SIZE;
|
||||
use crate::postings::{DocFreq, Postings};
|
||||
@@ -47,10 +46,14 @@ impl SegmentPostings {
|
||||
use crate::schema::IndexRecordOption;
|
||||
let mut buffer = Vec::new();
|
||||
{
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::codec::postings::PostingsSerializer;
|
||||
|
||||
let mut postings_serializer =
|
||||
PostingsSerializer::new(0.0, IndexRecordOption::Basic, None);
|
||||
crate::codec::standard::postings::StandardPostingsSerializer::new(
|
||||
0.0,
|
||||
IndexRecordOption::Basic,
|
||||
None,
|
||||
);
|
||||
postings_serializer.new_term(docs.len() as u32, false);
|
||||
for &doc in docs {
|
||||
postings_serializer.write_doc(doc, 1u32);
|
||||
@@ -77,8 +80,9 @@ impl SegmentPostings {
|
||||
) -> SegmentPostings {
|
||||
use common::OwnedBytes;
|
||||
|
||||
use crate::codec::postings::PostingsSerializer as _;
|
||||
use crate::codec::standard::postings::StandardPostingsSerializer;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::serializer::PostingsSerializer;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::Score;
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
@@ -95,7 +99,7 @@ impl SegmentPostings {
|
||||
total_num_tokens as Score / fieldnorms.len() as Score
|
||||
})
|
||||
.unwrap_or(0.0);
|
||||
let mut postings_serializer = PostingsSerializer::new(
|
||||
let mut postings_serializer = StandardPostingsSerializer::new(
|
||||
average_field_norm,
|
||||
IndexRecordOption::WithFreqs,
|
||||
fieldnorm_reader,
|
||||
@@ -148,20 +152,12 @@ impl DocSet for SegmentPostings {
|
||||
self.doc()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
debug_assert!(self.doc() <= target);
|
||||
if self.doc() >= target {
|
||||
return self.doc();
|
||||
}
|
||||
|
||||
// As an optimization, if the block is already loaded, we can
|
||||
// cheaply check the next doc.
|
||||
self.cur = (self.cur + 1).min(COMPRESSION_BLOCK_SIZE - 1);
|
||||
if self.doc() >= target {
|
||||
return self.doc();
|
||||
}
|
||||
|
||||
// Delegate block-local search to BlockSegmentPostings::seek, which returns
|
||||
// the in-block index of the first doc >= target.
|
||||
self.cur = self.block_cursor.seek(target);
|
||||
@@ -177,34 +173,29 @@ impl DocSet for SegmentPostings {
|
||||
}
|
||||
|
||||
fn size_hint(&self) -> u32 {
|
||||
self.doc_freq().into()
|
||||
self.len() as u32
|
||||
}
|
||||
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
let bitset_max_value: DocId = bitset.max_value();
|
||||
loop {
|
||||
let docs = self.block_cursor.docs();
|
||||
let Some(&last_doc) = docs.last() else {
|
||||
break;
|
||||
};
|
||||
if last_doc < bitset_max_value {
|
||||
// All docs are within the range of the bitset
|
||||
for &doc in docs {
|
||||
bitset.insert(doc);
|
||||
}
|
||||
} else {
|
||||
for &doc in docs {
|
||||
if doc < bitset_max_value {
|
||||
bitset.insert(doc);
|
||||
}
|
||||
}
|
||||
if docs.is_empty() {
|
||||
break;
|
||||
}
|
||||
for &doc in docs {
|
||||
bitset.insert(doc);
|
||||
}
|
||||
self.block_cursor.advance();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl HasLen for SegmentPostings {
|
||||
fn len(&self) -> usize {
|
||||
self.block_cursor.doc_freq() as usize
|
||||
}
|
||||
}
|
||||
|
||||
impl Postings for SegmentPostings {
|
||||
/// Returns the frequency associated with the current document.
|
||||
/// If the schema is set up so that no frequency have been encoded,
|
||||
@@ -212,7 +203,7 @@ impl Postings for SegmentPostings {
|
||||
///
|
||||
/// # Panics
|
||||
///
|
||||
/// Will panics if called without having called advance before.
|
||||
/// Will panics if called without having cagled advance before.
|
||||
fn term_freq(&self) -> u32 {
|
||||
debug_assert!(
|
||||
// Here we do not use the len of `freqs()`
|
||||
@@ -264,19 +255,15 @@ impl Postings for SegmentPostings {
|
||||
}
|
||||
|
||||
impl PostingsWithBlockMax for SegmentPostings {
|
||||
#[inline]
|
||||
fn seek_block_max(
|
||||
&mut self,
|
||||
target_doc: crate::DocId,
|
||||
fieldnorm_reader: &FieldNormReader,
|
||||
similarity_weight: &Bm25Weight,
|
||||
) -> Score {
|
||||
self.block_cursor.seek_block_without_loading(target_doc);
|
||||
self.block_cursor
|
||||
.block_max_score(fieldnorm_reader, similarity_weight)
|
||||
self.block_cursor.block_max_score(similarity_weight)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn last_doc_in_block(&self) -> crate::DocId {
|
||||
self.block_cursor.skip_reader().last_doc_in_block()
|
||||
}
|
||||
@@ -284,6 +271,9 @@ impl PostingsWithBlockMax for SegmentPostings {
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
use common::HasLen;
|
||||
|
||||
use super::SegmentPostings;
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::postings::Postings;
|
||||
@@ -295,6 +285,7 @@ mod tests {
|
||||
assert_eq!(postings.advance(), TERMINATED);
|
||||
assert_eq!(postings.advance(), TERMINATED);
|
||||
assert_eq!(postings.doc_freq(), crate::postings::DocFreq::Exact(0));
|
||||
assert_eq!(postings.len(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -14,11 +14,7 @@ use crate::{DocId, Score, TERMINATED};
|
||||
// (requiring a 6th bit), but the biggest doc_id we can want to encode is TERMINATED-1, which can
|
||||
// be represented on 31b without delta encoding.
|
||||
fn encode_bitwidth(bitwidth: u8, delta_1: bool) -> u8 {
|
||||
assert!(
|
||||
bitwidth < 32,
|
||||
"bitwidth needs to be less than 32, but got {}",
|
||||
bitwidth
|
||||
);
|
||||
assert!(bitwidth < 32);
|
||||
bitwidth | ((delta_1 as u8) << 6)
|
||||
}
|
||||
|
||||
183
src/codec/standard/postings/standard_postings_serializer.rs
Normal file
183
src/codec/standard/postings/standard_postings_serializer.rs
Normal file
@@ -0,0 +1,183 @@
|
||||
use std::cmp::Ordering;
|
||||
use std::io::{self, Write as _};
|
||||
|
||||
use common::{BinarySerializable as _, VInt};
|
||||
|
||||
use crate::codec::postings::PostingsSerializer;
|
||||
use crate::codec::standard::postings::block::Block;
|
||||
use crate::codec::standard::postings::skip::SkipSerializer;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::compression::{BlockEncoder, VIntEncoder as _, COMPRESSION_BLOCK_SIZE};
|
||||
use crate::query::Bm25Weight;
|
||||
use crate::schema::IndexRecordOption;
|
||||
use crate::{DocId, Score};
|
||||
|
||||
pub struct StandardPostingsSerializer {
|
||||
last_doc_id_encoded: u32,
|
||||
|
||||
block_encoder: BlockEncoder,
|
||||
block: Box<Block>,
|
||||
|
||||
postings_write: Vec<u8>,
|
||||
skip_write: SkipSerializer,
|
||||
|
||||
mode: IndexRecordOption,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
|
||||
bm25_weight: Option<Bm25Weight>,
|
||||
avg_fieldnorm: Score, /* Average number of term in the field for that segment.
|
||||
* this value is used to compute the block wand information. */
|
||||
term_has_freq: bool,
|
||||
}
|
||||
|
||||
impl StandardPostingsSerializer {
|
||||
pub fn new(
|
||||
avg_fieldnorm: Score,
|
||||
mode: IndexRecordOption,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
) -> StandardPostingsSerializer {
|
||||
Self {
|
||||
last_doc_id_encoded: 0,
|
||||
block_encoder: BlockEncoder::new(),
|
||||
block: Box::new(Block::new()),
|
||||
postings_write: Vec::new(),
|
||||
skip_write: SkipSerializer::new(),
|
||||
mode,
|
||||
fieldnorm_reader,
|
||||
bm25_weight: None,
|
||||
avg_fieldnorm,
|
||||
term_has_freq: false,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl PostingsSerializer for StandardPostingsSerializer {
|
||||
fn new_term(&mut self, term_doc_freq: u32, record_term_freq: bool) {
|
||||
self.clear();
|
||||
|
||||
self.term_has_freq = self.mode.has_freq() && record_term_freq;
|
||||
if !self.term_has_freq {
|
||||
return;
|
||||
}
|
||||
|
||||
let num_docs_in_segment: u64 =
|
||||
if let Some(fieldnorm_reader) = self.fieldnorm_reader.as_ref() {
|
||||
fieldnorm_reader.num_docs() as u64
|
||||
} else {
|
||||
return;
|
||||
};
|
||||
|
||||
if num_docs_in_segment == 0 {
|
||||
return;
|
||||
}
|
||||
|
||||
self.bm25_weight = Some(Bm25Weight::for_one_term_without_explain(
|
||||
term_doc_freq as u64,
|
||||
num_docs_in_segment,
|
||||
self.avg_fieldnorm,
|
||||
));
|
||||
}
|
||||
|
||||
fn write_doc(&mut self, doc_id: DocId, term_freq: u32) {
|
||||
self.block.append_doc(doc_id, term_freq);
|
||||
if self.block.is_full() {
|
||||
self.write_block();
|
||||
}
|
||||
}
|
||||
|
||||
fn close_term(&mut self, doc_freq: u32, output_write: &mut impl io::Write) -> io::Result<()> {
|
||||
if !self.block.is_empty() {
|
||||
// we have doc ids waiting to be written
|
||||
// this happens when the number of doc ids is
|
||||
// not a perfect multiple of our block size.
|
||||
//
|
||||
// In that case, the remaining part is encoded
|
||||
// using variable int encoding.
|
||||
{
|
||||
let block_encoded = self
|
||||
.block_encoder
|
||||
.compress_vint_sorted(self.block.doc_ids(), self.last_doc_id_encoded);
|
||||
self.postings_write.write_all(block_encoded)?;
|
||||
}
|
||||
// ... Idem for term frequencies
|
||||
if self.term_has_freq {
|
||||
let block_encoded = self
|
||||
.block_encoder
|
||||
.compress_vint_unsorted(self.block.term_freqs());
|
||||
self.postings_write.write_all(block_encoded)?;
|
||||
}
|
||||
self.block.clear();
|
||||
}
|
||||
if doc_freq >= COMPRESSION_BLOCK_SIZE as u32 {
|
||||
let skip_data = self.skip_write.data();
|
||||
VInt(skip_data.len() as u64).serialize(output_write)?;
|
||||
output_write.write_all(skip_data)?;
|
||||
}
|
||||
output_write.write_all(&self.postings_write[..])?;
|
||||
self.skip_write.clear();
|
||||
self.postings_write.clear();
|
||||
self.bm25_weight = None;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl StandardPostingsSerializer {
|
||||
fn clear(&mut self) {
|
||||
self.bm25_weight = None;
|
||||
self.block.clear();
|
||||
self.last_doc_id_encoded = 0;
|
||||
}
|
||||
|
||||
fn write_block(&mut self) {
|
||||
{
|
||||
// encode the doc ids
|
||||
let (num_bits, block_encoded): (u8, &[u8]) = self
|
||||
.block_encoder
|
||||
.compress_block_sorted(self.block.doc_ids(), self.last_doc_id_encoded);
|
||||
self.last_doc_id_encoded = self.block.last_doc();
|
||||
self.skip_write
|
||||
.write_doc(self.last_doc_id_encoded, num_bits);
|
||||
// last el block 0, offset block 1,
|
||||
self.postings_write.extend(block_encoded);
|
||||
}
|
||||
if self.term_has_freq {
|
||||
let (num_bits, block_encoded): (u8, &[u8]) = self
|
||||
.block_encoder
|
||||
.compress_block_unsorted(self.block.term_freqs(), true);
|
||||
self.postings_write.extend(block_encoded);
|
||||
self.skip_write.write_term_freq(num_bits);
|
||||
if self.mode.has_positions() {
|
||||
// We serialize the sum of term freqs within the skip information
|
||||
// in order to navigate through positions.
|
||||
let sum_freq = self.block.term_freqs().iter().cloned().sum();
|
||||
self.skip_write.write_total_term_freq(sum_freq);
|
||||
}
|
||||
let mut blockwand_params = (0u8, 0u32);
|
||||
if let Some(bm25_weight) = self.bm25_weight.as_ref() {
|
||||
if let Some(fieldnorm_reader) = self.fieldnorm_reader.as_ref() {
|
||||
let docs = self.block.doc_ids().iter().cloned();
|
||||
let term_freqs = self.block.term_freqs().iter().cloned();
|
||||
let fieldnorms = docs.map(|doc| fieldnorm_reader.fieldnorm_id(doc));
|
||||
blockwand_params = fieldnorms
|
||||
.zip(term_freqs)
|
||||
.max_by(
|
||||
|(left_fieldnorm_id, left_term_freq),
|
||||
(right_fieldnorm_id, right_term_freq)| {
|
||||
let left_score =
|
||||
bm25_weight.tf_factor(*left_fieldnorm_id, *left_term_freq);
|
||||
let right_score =
|
||||
bm25_weight.tf_factor(*right_fieldnorm_id, *right_term_freq);
|
||||
left_score
|
||||
.partial_cmp(&right_score)
|
||||
.unwrap_or(Ordering::Equal)
|
||||
},
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
let (fieldnorm_id, term_freq) = blockwand_params;
|
||||
self.skip_write.write_blockwand_max(fieldnorm_id, term_freq);
|
||||
}
|
||||
self.block.clear();
|
||||
}
|
||||
}
|
||||
@@ -43,7 +43,7 @@ impl Collector for Count {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_: SegmentOrdinal,
|
||||
_: &dyn SegmentReader,
|
||||
_: &SegmentReader,
|
||||
) -> crate::Result<SegmentCountCollector> {
|
||||
Ok(SegmentCountCollector::default())
|
||||
}
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
use std::collections::HashSet;
|
||||
|
||||
use super::{Collector, SegmentCollector};
|
||||
use crate::{DocAddress, DocId, Score, SegmentReader};
|
||||
use crate::{DocAddress, DocId, Score};
|
||||
|
||||
/// Collectors that returns the set of DocAddress that matches the query.
|
||||
///
|
||||
@@ -15,7 +15,7 @@ impl Collector for DocSetCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: crate::SegmentOrdinal,
|
||||
_segment: &dyn SegmentReader,
|
||||
_segment: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok(DocSetChildCollector {
|
||||
segment_local_id,
|
||||
|
||||
@@ -265,7 +265,7 @@ impl Collector for FacetCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_: SegmentOrdinal,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<FacetSegmentCollector> {
|
||||
let facet_reader = reader.facet_reader(&self.field_name)?;
|
||||
let facet_dict = facet_reader.facet_dict();
|
||||
@@ -486,9 +486,9 @@ mod tests {
|
||||
use std::collections::BTreeSet;
|
||||
|
||||
use columnar::Dictionary;
|
||||
use rand::distr::Uniform;
|
||||
use rand::distributions::Uniform;
|
||||
use rand::prelude::SliceRandom;
|
||||
use rand::{rng, Rng};
|
||||
use rand::{thread_rng, Rng};
|
||||
|
||||
use super::{FacetCollector, FacetCounts};
|
||||
use crate::collector::facet_collector::compress_mapping;
|
||||
@@ -731,7 +731,7 @@ mod tests {
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
|
||||
let uniform = Uniform::new_inclusive(1, 100_000).unwrap();
|
||||
let uniform = Uniform::new_inclusive(1, 100_000);
|
||||
let mut docs: Vec<TantivyDocument> =
|
||||
vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)]
|
||||
.into_iter()
|
||||
@@ -741,11 +741,14 @@ mod tests {
|
||||
std::iter::repeat_n(doc, count)
|
||||
})
|
||||
.map(|mut doc| {
|
||||
doc.add_facet(facet_field, &format!("/facet/{}", rng().sample(uniform)));
|
||||
doc.add_facet(
|
||||
facet_field,
|
||||
&format!("/facet/{}", thread_rng().sample(uniform)),
|
||||
);
|
||||
doc
|
||||
})
|
||||
.collect();
|
||||
docs[..].shuffle(&mut rng());
|
||||
docs[..].shuffle(&mut thread_rng());
|
||||
|
||||
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
|
||||
for doc in docs {
|
||||
@@ -819,8 +822,8 @@ mod tests {
|
||||
#[cfg(all(test, feature = "unstable"))]
|
||||
mod bench {
|
||||
|
||||
use rand::rng;
|
||||
use rand::seq::SliceRandom;
|
||||
use rand::thread_rng;
|
||||
use test::Bencher;
|
||||
|
||||
use crate::collector::FacetCollector;
|
||||
@@ -843,7 +846,7 @@ mod bench {
|
||||
}
|
||||
}
|
||||
// 40425 docs
|
||||
docs[..].shuffle(&mut rng());
|
||||
docs[..].shuffle(&mut thread_rng());
|
||||
|
||||
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
|
||||
for doc in docs {
|
||||
|
||||
@@ -113,7 +113,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let column_opt = segment_reader.fast_fields().column_opt(&self.field)?;
|
||||
|
||||
@@ -287,7 +287,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let column_opt = segment_reader.fast_fields().bytes(&self.field)?;
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ use fastdivide::DividerU64;
|
||||
use crate::collector::{Collector, SegmentCollector};
|
||||
use crate::fastfield::{FastFieldNotAvailableError, FastValue};
|
||||
use crate::schema::Type;
|
||||
use crate::{DocId, Score, SegmentReader};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Histogram builds an histogram of the values of a fastfield for the
|
||||
/// collected DocSet.
|
||||
@@ -110,7 +110,7 @@ impl Collector for HistogramCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_segment_local_id: crate::SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let column_opt = segment.fast_fields().u64_lenient(&self.field)?;
|
||||
let (column, _column_type) = column_opt.ok_or_else(|| FastFieldNotAvailableError {
|
||||
|
||||
@@ -156,7 +156,7 @@ pub trait Collector: Sync + Send {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child>;
|
||||
|
||||
/// Returns true iff the collector requires to compute scores for documents.
|
||||
@@ -174,7 +174,7 @@ pub trait Collector: Sync + Send {
|
||||
&self,
|
||||
weight: &dyn Weight,
|
||||
segment_ord: u32,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> {
|
||||
let with_scoring = self.requires_scoring();
|
||||
let mut segment_collector = self.for_segment(segment_ord, reader)?;
|
||||
@@ -186,7 +186,7 @@ pub trait Collector: Sync + Send {
|
||||
pub(crate) fn default_collect_segment_impl<TSegmentCollector: SegmentCollector>(
|
||||
segment_collector: &mut TSegmentCollector,
|
||||
weight: &dyn Weight,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
with_scoring: bool,
|
||||
) -> crate::Result<()> {
|
||||
match (reader.alive_bitset(), with_scoring) {
|
||||
@@ -255,7 +255,7 @@ impl<TCollector: Collector> Collector for Option<TCollector> {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok(if let Some(inner) = self {
|
||||
let inner_segment_collector = inner.for_segment(segment_local_id, segment)?;
|
||||
@@ -336,7 +336,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let left = self.0.for_segment(segment_local_id, segment)?;
|
||||
let right = self.1.for_segment(segment_local_id, segment)?;
|
||||
@@ -407,7 +407,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let one = self.0.for_segment(segment_local_id, segment)?;
|
||||
let two = self.1.for_segment(segment_local_id, segment)?;
|
||||
@@ -487,7 +487,7 @@ where
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let one = self.0.for_segment(segment_local_id, segment)?;
|
||||
let two = self.1.for_segment(segment_local_id, segment)?;
|
||||
|
||||
@@ -24,7 +24,7 @@ impl<TCollector: Collector> Collector for CollectorWrapper<TCollector> {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: u32,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<Box<dyn BoxableSegmentCollector>> {
|
||||
let child = self.0.for_segment(segment_local_id, reader)?;
|
||||
Ok(Box::new(SegmentCollectorWrapper(child)))
|
||||
@@ -209,7 +209,7 @@ impl Collector for MultiCollector<'_> {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_local_id: SegmentOrdinal,
|
||||
segment: &dyn SegmentReader,
|
||||
segment: &SegmentReader,
|
||||
) -> crate::Result<MultiCollectorChild> {
|
||||
let children = self
|
||||
.collector_wrappers
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
mod order;
|
||||
mod sort_by_bytes;
|
||||
mod sort_by_erased_type;
|
||||
mod sort_by_score;
|
||||
mod sort_by_static_fast_value;
|
||||
@@ -7,7 +6,6 @@ mod sort_by_string;
|
||||
mod sort_key_computer;
|
||||
|
||||
pub use order::*;
|
||||
pub use sort_by_bytes::SortByBytes;
|
||||
pub use sort_by_erased_type::SortByErasedType;
|
||||
pub use sort_by_score::SortBySimilarityScore;
|
||||
pub use sort_by_static_fast_value::SortByStaticFastValue;
|
||||
|
||||
@@ -5,7 +5,7 @@ use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
|
||||
use crate::schema::{OwnedValue, Schema};
|
||||
use crate::{DocId, Order, Score, SegmentReader};
|
||||
use crate::{DocId, Order, Score};
|
||||
|
||||
fn compare_owned_value<const NULLS_FIRST: bool>(lhs: &OwnedValue, rhs: &OwnedValue) -> Ordering {
|
||||
match (lhs, rhs) {
|
||||
@@ -430,7 +430,7 @@ where
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let child = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
Ok(SegmentSortKeyComputerWithComparator {
|
||||
@@ -468,7 +468,7 @@ where
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let child = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
Ok(SegmentSortKeyComputerWithComparator {
|
||||
|
||||
@@ -1,168 +0,0 @@
|
||||
use columnar::BytesColumn;
|
||||
|
||||
use crate::collector::sort_key::NaturalComparator;
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
|
||||
use crate::termdict::TermOrdinal;
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Sort by the first value of a bytes column.
|
||||
///
|
||||
/// If the field is multivalued, only the first value is considered.
|
||||
///
|
||||
/// Documents that do not have this value are still considered.
|
||||
/// Their sort key will simply be `None`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SortByBytes {
|
||||
column_name: String,
|
||||
}
|
||||
|
||||
impl SortByBytes {
|
||||
/// Creates a new sort by bytes sort key computer.
|
||||
pub fn for_field(column_name: impl ToString) -> Self {
|
||||
SortByBytes {
|
||||
column_name: column_name.to_string(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl SortKeyComputer for SortByBytes {
|
||||
type SortKey = Option<Vec<u8>>;
|
||||
type Child = ByBytesColumnSegmentSortKeyComputer;
|
||||
type Comparator = NaturalComparator;
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let bytes_column_opt = segment_reader.fast_fields().bytes(&self.column_name)?;
|
||||
Ok(ByBytesColumnSegmentSortKeyComputer { bytes_column_opt })
|
||||
}
|
||||
}
|
||||
|
||||
/// Segment-level sort key computer for bytes columns.
|
||||
pub struct ByBytesColumnSegmentSortKeyComputer {
|
||||
bytes_column_opt: Option<BytesColumn>,
|
||||
}
|
||||
|
||||
impl SegmentSortKeyComputer for ByBytesColumnSegmentSortKeyComputer {
|
||||
type SortKey = Option<Vec<u8>>;
|
||||
type SegmentSortKey = Option<TermOrdinal>;
|
||||
type SegmentComparator = NaturalComparator;
|
||||
|
||||
#[inline(always)]
|
||||
fn segment_sort_key(&mut self, doc: DocId, _score: Score) -> Option<TermOrdinal> {
|
||||
let bytes_column = self.bytes_column_opt.as_ref()?;
|
||||
bytes_column.ords().first(doc)
|
||||
}
|
||||
|
||||
fn convert_segment_sort_key(&self, term_ord_opt: Option<TermOrdinal>) -> Option<Vec<u8>> {
|
||||
// TODO: Individual lookups to the dictionary like this are very likely to repeatedly
|
||||
// decompress the same blocks. See https://github.com/quickwit-oss/tantivy/issues/2776
|
||||
let term_ord = term_ord_opt?;
|
||||
let bytes_column = self.bytes_column_opt.as_ref()?;
|
||||
let mut bytes = Vec::new();
|
||||
bytes_column
|
||||
.dictionary()
|
||||
.ord_to_term(term_ord, &mut bytes)
|
||||
.ok()?;
|
||||
Some(bytes)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::SortByBytes;
|
||||
use crate::collector::TopDocs;
|
||||
use crate::query::AllQuery;
|
||||
use crate::schema::{BytesOptions, Schema, FAST, INDEXED};
|
||||
use crate::{Index, IndexWriter, Order, TantivyDocument};
|
||||
|
||||
#[test]
|
||||
fn test_sort_by_bytes_asc() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let bytes_field = schema_builder
|
||||
.add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
|
||||
let id_field = schema_builder.add_u64_field("id", FAST | INDEXED);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut index_writer: IndexWriter = index.writer_for_tests()?;
|
||||
|
||||
// Insert documents with byte values in non-sorted order
|
||||
let test_data: Vec<(u64, Vec<u8>)> = vec![
|
||||
(1, vec![0x02, 0x00]),
|
||||
(2, vec![0x00, 0x10]),
|
||||
(3, vec![0x01, 0x00]),
|
||||
(4, vec![0x00, 0x20]),
|
||||
];
|
||||
|
||||
for (id, bytes) in &test_data {
|
||||
let mut doc = TantivyDocument::new();
|
||||
doc.add_u64(id_field, *id);
|
||||
doc.add_bytes(bytes_field, bytes);
|
||||
index_writer.add_document(doc)?;
|
||||
}
|
||||
index_writer.commit()?;
|
||||
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// Sort ascending by bytes
|
||||
let top_docs =
|
||||
TopDocs::with_limit(10).order_by((SortByBytes::for_field("data"), Order::Asc));
|
||||
let results: Vec<(Option<Vec<u8>>, _)> = searcher.search(&AllQuery, &top_docs)?;
|
||||
|
||||
// Expected order: [0x00,0x10], [0x00,0x20], [0x01,0x00], [0x02,0x00]
|
||||
let sorted_bytes: Vec<Option<Vec<u8>>> = results.into_iter().map(|(b, _)| b).collect();
|
||||
assert_eq!(
|
||||
sorted_bytes,
|
||||
vec![
|
||||
Some(vec![0x00, 0x10]),
|
||||
Some(vec![0x00, 0x20]),
|
||||
Some(vec![0x01, 0x00]),
|
||||
Some(vec![0x02, 0x00]),
|
||||
]
|
||||
);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_by_bytes_desc() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let bytes_field = schema_builder
|
||||
.add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut index_writer: IndexWriter = index.writer_for_tests()?;
|
||||
|
||||
let test_data: Vec<Vec<u8>> = vec![vec![0x00, 0x10], vec![0x02, 0x00], vec![0x01, 0x00]];
|
||||
|
||||
for bytes in &test_data {
|
||||
let mut doc = TantivyDocument::new();
|
||||
doc.add_bytes(bytes_field, bytes);
|
||||
index_writer.add_document(doc)?;
|
||||
}
|
||||
index_writer.commit()?;
|
||||
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// Sort descending by bytes
|
||||
let top_docs =
|
||||
TopDocs::with_limit(10).order_by((SortByBytes::for_field("data"), Order::Desc));
|
||||
let results: Vec<(Option<Vec<u8>>, _)> = searcher.search(&AllQuery, &top_docs)?;
|
||||
|
||||
// Expected order (descending): [0x02,0x00], [0x01,0x00], [0x00,0x10]
|
||||
let sorted_bytes: Vec<Option<Vec<u8>>> = results.into_iter().map(|(b, _)| b).collect();
|
||||
assert_eq!(
|
||||
sorted_bytes,
|
||||
vec![
|
||||
Some(vec![0x02, 0x00]),
|
||||
Some(vec![0x01, 0x00]),
|
||||
Some(vec![0x00, 0x10]),
|
||||
]
|
||||
);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -1,12 +1,12 @@
|
||||
use columnar::{ColumnType, MonotonicallyMappableToU64};
|
||||
|
||||
use crate::collector::sort_key::{
|
||||
NaturalComparator, SortByBytes, SortBySimilarityScore, SortByStaticFastValue, SortByString,
|
||||
NaturalComparator, SortBySimilarityScore, SortByStaticFastValue, SortByString,
|
||||
};
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
|
||||
use crate::fastfield::FastFieldNotAvailableError;
|
||||
use crate::schema::OwnedValue;
|
||||
use crate::{DateTime, DocId, Score, SegmentReader};
|
||||
use crate::{DateTime, DocId, Score};
|
||||
|
||||
/// Sort by the boxed / OwnedValue representation of either a fast field, or of the score.
|
||||
///
|
||||
@@ -86,7 +86,7 @@ impl SortKeyComputer for SortByErasedType {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let inner: Box<dyn ErasedSegmentSortKeyComputer> = match self {
|
||||
Self::Field(column_name) => {
|
||||
@@ -114,16 +114,6 @@ impl SortKeyComputer for SortByErasedType {
|
||||
},
|
||||
})
|
||||
}
|
||||
ColumnType::Bytes => {
|
||||
let computer = SortByBytes::for_field(column_name);
|
||||
let inner = computer.segment_sort_key_computer(segment_reader)?;
|
||||
Box::new(ErasedSegmentSortKeyComputerWrapper {
|
||||
inner,
|
||||
converter: |val: Option<Vec<u8>>| {
|
||||
val.map(OwnedValue::Bytes).unwrap_or(OwnedValue::Null)
|
||||
},
|
||||
})
|
||||
}
|
||||
ColumnType::U64 => {
|
||||
let computer = SortByStaticFastValue::<u64>::for_field(column_name);
|
||||
let inner = computer.segment_sort_key_computer(segment_reader)?;
|
||||
@@ -291,65 +281,6 @@ mod tests {
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_by_owned_bytes() {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let data_field = schema_builder.add_bytes_field("data", FAST);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut writer = index.writer_for_tests().unwrap();
|
||||
writer
|
||||
.add_document(doc!(data_field => vec![0x03u8, 0x00]))
|
||||
.unwrap();
|
||||
writer
|
||||
.add_document(doc!(data_field => vec![0x01u8, 0x00]))
|
||||
.unwrap();
|
||||
writer
|
||||
.add_document(doc!(data_field => vec![0x02u8, 0x00]))
|
||||
.unwrap();
|
||||
writer.add_document(doc!()).unwrap();
|
||||
writer.commit().unwrap();
|
||||
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// Sort descending (Natural - highest first)
|
||||
let collector = TopDocs::with_limit(10)
|
||||
.order_by((SortByErasedType::for_field("data"), ComparatorEnum::Natural));
|
||||
let top_docs = searcher.search(&AllQuery, &collector).unwrap();
|
||||
|
||||
let values: Vec<OwnedValue> = top_docs.into_iter().map(|(key, _)| key).collect();
|
||||
|
||||
assert_eq!(
|
||||
values,
|
||||
vec![
|
||||
OwnedValue::Bytes(vec![0x03, 0x00]),
|
||||
OwnedValue::Bytes(vec![0x02, 0x00]),
|
||||
OwnedValue::Bytes(vec![0x01, 0x00]),
|
||||
OwnedValue::Null
|
||||
]
|
||||
);
|
||||
|
||||
// Sort ascending (ReverseNoneLower - lowest first, nulls last)
|
||||
let collector = TopDocs::with_limit(10).order_by((
|
||||
SortByErasedType::for_field("data"),
|
||||
ComparatorEnum::ReverseNoneLower,
|
||||
));
|
||||
let top_docs = searcher.search(&AllQuery, &collector).unwrap();
|
||||
|
||||
let values: Vec<OwnedValue> = top_docs.into_iter().map(|(key, _)| key).collect();
|
||||
|
||||
assert_eq!(
|
||||
values,
|
||||
vec![
|
||||
OwnedValue::Bytes(vec![0x01, 0x00]),
|
||||
OwnedValue::Bytes(vec![0x02, 0x00]),
|
||||
OwnedValue::Bytes(vec![0x03, 0x00]),
|
||||
OwnedValue::Null
|
||||
]
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_by_owned_reverse() {
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use crate::collector::sort_key::NaturalComparator;
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer, TopNComputer};
|
||||
use crate::{DocAddress, DocId, Score, SegmentReader};
|
||||
use crate::{DocAddress, DocId, Score};
|
||||
|
||||
/// Sort by similarity score.
|
||||
#[derive(Clone, Debug, Copy)]
|
||||
@@ -19,7 +19,7 @@ impl SortKeyComputer for SortBySimilarityScore {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
_segment_reader: &dyn SegmentReader,
|
||||
_segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok(SortBySimilarityScore)
|
||||
}
|
||||
@@ -29,7 +29,7 @@ impl SortKeyComputer for SortBySimilarityScore {
|
||||
&self,
|
||||
k: usize,
|
||||
weight: &dyn crate::query::Weight,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
segment_ord: u32,
|
||||
) -> crate::Result<Vec<(Self::SortKey, DocAddress)>> {
|
||||
let mut top_n: TopNComputer<Score, DocId, Self::Comparator> =
|
||||
|
||||
@@ -61,7 +61,7 @@ impl<T: FastValue> SortKeyComputer for SortByStaticFastValue<T> {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let sort_column_opt = segment_reader.fast_fields().u64_lenient(&self.field)?;
|
||||
let (sort_column, _sort_column_type) =
|
||||
|
||||
@@ -3,7 +3,7 @@ use columnar::StrColumn;
|
||||
use crate::collector::sort_key::NaturalComparator;
|
||||
use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
|
||||
use crate::termdict::TermOrdinal;
|
||||
use crate::{DocId, Score, SegmentReader};
|
||||
use crate::{DocId, Score};
|
||||
|
||||
/// Sort by the first value of a string column.
|
||||
///
|
||||
@@ -35,7 +35,7 @@ impl SortKeyComputer for SortByString {
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &crate::SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
let str_column_opt = segment_reader.fast_fields().str(&self.column_name)?;
|
||||
Ok(ByStringColumnSegmentSortKeyComputer { str_column_opt })
|
||||
|
||||
@@ -119,7 +119,7 @@ pub trait SortKeyComputer: Sync {
|
||||
&self,
|
||||
k: usize,
|
||||
weight: &dyn crate::query::Weight,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &crate::SegmentReader,
|
||||
segment_ord: u32,
|
||||
) -> crate::Result<Vec<(Self::SortKey, DocAddress)>> {
|
||||
let with_scoring = self.requires_scoring();
|
||||
@@ -135,7 +135,7 @@ pub trait SortKeyComputer: Sync {
|
||||
}
|
||||
|
||||
/// Builds a child sort key computer for a specific segment.
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child>;
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child>;
|
||||
}
|
||||
|
||||
impl<HeadSortKeyComputer, TailSortKeyComputer> SortKeyComputer
|
||||
@@ -156,7 +156,7 @@ where
|
||||
(self.0.comparator(), self.1.comparator())
|
||||
}
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
Ok((
|
||||
self.0.segment_sort_key_computer(segment_reader)?,
|
||||
self.1.segment_sort_key_computer(segment_reader)?,
|
||||
@@ -357,7 +357,7 @@ where
|
||||
)
|
||||
}
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
let sort_key_computer1 = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer2 = self.1.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer3 = self.2.segment_sort_key_computer(segment_reader)?;
|
||||
@@ -420,7 +420,7 @@ where
|
||||
SortKeyComputer4::Comparator,
|
||||
);
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
let sort_key_computer1 = self.0.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer2 = self.1.segment_sort_key_computer(segment_reader)?;
|
||||
let sort_key_computer3 = self.2.segment_sort_key_computer(segment_reader)?;
|
||||
@@ -454,7 +454,7 @@ where
|
||||
|
||||
impl<F, SegmentF, TSortKey> SortKeyComputer for F
|
||||
where
|
||||
F: 'static + Send + Sync + Fn(&dyn SegmentReader) -> SegmentF,
|
||||
F: 'static + Send + Sync + Fn(&SegmentReader) -> SegmentF,
|
||||
SegmentF: 'static + FnMut(DocId) -> TSortKey,
|
||||
TSortKey: 'static + PartialOrd + Clone + Send + Sync + std::fmt::Debug,
|
||||
{
|
||||
@@ -462,7 +462,7 @@ where
|
||||
type Child = SegmentF;
|
||||
type Comparator = NaturalComparator;
|
||||
|
||||
fn segment_sort_key_computer(&self, segment_reader: &dyn SegmentReader) -> Result<Self::Child> {
|
||||
fn segment_sort_key_computer(&self, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
Ok((self)(segment_reader))
|
||||
}
|
||||
}
|
||||
@@ -509,10 +509,10 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_lazy_score_computer() {
|
||||
let score_computer_primary = |_segment_reader: &dyn SegmentReader| |_doc: DocId| 200u32;
|
||||
let score_computer_primary = |_segment_reader: &SegmentReader| |_doc: DocId| 200u32;
|
||||
let call_count = Arc::new(AtomicUsize::new(0));
|
||||
let call_count_clone = call_count.clone();
|
||||
let score_computer_secondary = move |_segment_reader: &dyn SegmentReader| {
|
||||
let score_computer_secondary = move |_segment_reader: &SegmentReader| {
|
||||
let call_count_new_clone = call_count_clone.clone();
|
||||
move |_doc: DocId| {
|
||||
call_count_new_clone.fetch_add(1, AtomicOrdering::SeqCst);
|
||||
@@ -572,10 +572,10 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_lazy_score_computer_dynamic_ordering() {
|
||||
let score_computer_primary = |_segment_reader: &dyn SegmentReader| |_doc: DocId| 200u32;
|
||||
let score_computer_primary = |_segment_reader: &SegmentReader| |_doc: DocId| 200u32;
|
||||
let call_count = Arc::new(AtomicUsize::new(0));
|
||||
let call_count_clone = call_count.clone();
|
||||
let score_computer_secondary = move |_segment_reader: &dyn SegmentReader| {
|
||||
let score_computer_secondary = move |_segment_reader: &SegmentReader| {
|
||||
let call_count_new_clone = call_count_clone.clone();
|
||||
move |_doc: DocId| {
|
||||
call_count_new_clone.fetch_add(1, AtomicOrdering::SeqCst);
|
||||
|
||||
@@ -32,11 +32,7 @@ where TSortKeyComputer: SortKeyComputer + Send + Sync + 'static
|
||||
self.sort_key_computer.check_schema(schema)
|
||||
}
|
||||
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_ord: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
) -> Result<Self::Child> {
|
||||
fn for_segment(&self, segment_ord: u32, segment_reader: &SegmentReader) -> Result<Self::Child> {
|
||||
let segment_sort_key_computer = self
|
||||
.sort_key_computer
|
||||
.segment_sort_key_computer(segment_reader)?;
|
||||
@@ -67,7 +63,7 @@ where TSortKeyComputer: SortKeyComputer + Send + Sync + 'static
|
||||
&self,
|
||||
weight: &dyn Weight,
|
||||
segment_ord: u32,
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
) -> crate::Result<Vec<(TSortKeyComputer::SortKey, DocAddress)>> {
|
||||
let k = self.doc_range.end;
|
||||
let docs = self
|
||||
@@ -164,7 +160,7 @@ mod tests {
|
||||
expected: &[(crate::Score, usize)],
|
||||
) {
|
||||
let mut vals: Vec<(crate::Score, usize)> = (0..10).map(|val| (val as f32, val)).collect();
|
||||
vals.shuffle(&mut rand::rng());
|
||||
vals.shuffle(&mut rand::thread_rng());
|
||||
let vals_merged = merge_top_k(vals.into_iter(), doc_range, ComparatorEnum::from(order));
|
||||
assert_eq!(&vals_merged, expected);
|
||||
}
|
||||
|
||||
@@ -5,7 +5,7 @@ use crate::query::{AllQuery, QueryParser};
|
||||
use crate::schema::{Schema, FAST, TEXT};
|
||||
use crate::time::format_description::well_known::Rfc3339;
|
||||
use crate::time::OffsetDateTime;
|
||||
use crate::{DateTime, DocAddress, Index, Searcher, SegmentReader, TantivyDocument};
|
||||
use crate::{DateTime, DocAddress, Index, Searcher, TantivyDocument};
|
||||
|
||||
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
|
||||
compute_score: true,
|
||||
@@ -109,7 +109,7 @@ impl Collector for TestCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
segment_id: SegmentOrdinal,
|
||||
_reader: &dyn SegmentReader,
|
||||
_reader: &SegmentReader,
|
||||
) -> crate::Result<TestSegmentCollector> {
|
||||
Ok(TestSegmentCollector {
|
||||
segment_id,
|
||||
@@ -180,7 +180,7 @@ impl Collector for FastFieldTestCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_: SegmentOrdinal,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<FastFieldSegmentCollector> {
|
||||
let reader = segment_reader
|
||||
.fast_fields()
|
||||
@@ -243,7 +243,7 @@ impl Collector for BytesFastFieldTestCollector {
|
||||
fn for_segment(
|
||||
&self,
|
||||
_segment_local_id: u32,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<BytesFastFieldSegmentCollector> {
|
||||
let column_opt = segment_reader.fast_fields().bytes(&self.field)?;
|
||||
Ok(BytesFastFieldSegmentCollector {
|
||||
|
||||
@@ -393,7 +393,7 @@ impl TopDocs {
|
||||
/// // This is where we build our collector with our custom score.
|
||||
/// let top_docs_by_custom_score = TopDocs
|
||||
/// ::with_limit(10)
|
||||
/// .tweak_score(move |segment_reader: &dyn SegmentReader| {
|
||||
/// .tweak_score(move |segment_reader: &SegmentReader| {
|
||||
/// // The argument is a function that returns our scoring
|
||||
/// // function.
|
||||
/// //
|
||||
@@ -442,7 +442,7 @@ pub struct TweakScoreFn<F>(F);
|
||||
|
||||
impl<F, TTweakScoreSortKeyFn, TSortKey> SortKeyComputer for TweakScoreFn<F>
|
||||
where
|
||||
F: 'static + Send + Sync + Fn(&dyn SegmentReader) -> TTweakScoreSortKeyFn,
|
||||
F: 'static + Send + Sync + Fn(&SegmentReader) -> TTweakScoreSortKeyFn,
|
||||
TTweakScoreSortKeyFn: 'static + Fn(DocId, Score) -> TSortKey,
|
||||
TweakScoreSegmentSortKeyComputer<TTweakScoreSortKeyFn>:
|
||||
SegmentSortKeyComputer<SortKey = TSortKey, SegmentSortKey = TSortKey>,
|
||||
@@ -458,7 +458,7 @@ where
|
||||
|
||||
fn segment_sort_key_computer(
|
||||
&self,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
) -> crate::Result<Self::Child> {
|
||||
Ok({
|
||||
TweakScoreSegmentSortKeyComputer {
|
||||
@@ -1525,7 +1525,7 @@ mod tests {
|
||||
let text_query = query_parser.parse_query("droopy tax")?;
|
||||
let collector = TopDocs::with_limit(2)
|
||||
.and_offset(1)
|
||||
.order_by(move |_segment_reader: &dyn SegmentReader| move |doc: DocId| doc);
|
||||
.order_by(move |_segment_reader: &SegmentReader| move |doc: DocId| doc);
|
||||
let score_docs: Vec<(u32, DocAddress)> =
|
||||
index.reader()?.searcher().search(&text_query, &collector)?;
|
||||
assert_eq!(
|
||||
@@ -1543,7 +1543,7 @@ mod tests {
|
||||
let text_query = query_parser.parse_query("droopy tax").unwrap();
|
||||
let collector = TopDocs::with_limit(2)
|
||||
.and_offset(1)
|
||||
.order_by(move |_segment_reader: &dyn SegmentReader| move |doc: DocId| doc);
|
||||
.order_by(move |_segment_reader: &SegmentReader| move |doc: DocId| doc);
|
||||
let score_docs: Vec<(u32, DocAddress)> = index
|
||||
.reader()
|
||||
.unwrap()
|
||||
|
||||
@@ -8,7 +8,7 @@ use std::path::Path;
|
||||
use once_cell::sync::Lazy;
|
||||
|
||||
pub use self::executor::Executor;
|
||||
pub use self::searcher::{Searcher, SearcherContext, SearcherGeneration};
|
||||
pub use self::searcher::{Searcher, SearcherGeneration};
|
||||
|
||||
/// The meta file contains all the information about the list of segments and the schema
|
||||
/// of the index.
|
||||
|
||||
@@ -4,13 +4,13 @@ use std::{fmt, io};
|
||||
|
||||
use crate::collector::Collector;
|
||||
use crate::core::Executor;
|
||||
use crate::index::{Index, SegmentId, SegmentReader};
|
||||
use crate::index::{SegmentId, SegmentReader};
|
||||
use crate::query::{Bm25StatisticsProvider, EnableScoring, Query};
|
||||
use crate::schema::{Field, FieldType, Schema, TantivyDocument, Term};
|
||||
use crate::schema::document::DocumentDeserialize;
|
||||
use crate::schema::{Schema, Term};
|
||||
use crate::space_usage::SearcherSpaceUsage;
|
||||
use crate::store::{CacheStats, StoreReader, DOCSTORE_CACHE_CAPACITY};
|
||||
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
|
||||
use crate::{DocAddress, Inventory, Opstamp, TantivyError, TrackedObject};
|
||||
use crate::store::{CacheStats, StoreReader};
|
||||
use crate::{DocAddress, Index, Opstamp, TrackedObject};
|
||||
|
||||
/// Identifies the searcher generation accessed by a [`Searcher`].
|
||||
///
|
||||
@@ -36,7 +36,7 @@ pub struct SearcherGeneration {
|
||||
|
||||
impl SearcherGeneration {
|
||||
pub(crate) fn from_segment_readers(
|
||||
segment_readers: &[Arc<dyn SegmentReader>],
|
||||
segment_readers: &[SegmentReader],
|
||||
generation_id: u64,
|
||||
) -> Self {
|
||||
let mut segment_id_to_del_opstamp = BTreeMap::new();
|
||||
@@ -61,103 +61,6 @@ impl SearcherGeneration {
|
||||
}
|
||||
}
|
||||
|
||||
/// Search-time context required by a [`Searcher`].
|
||||
#[derive(Clone)]
|
||||
pub struct SearcherContext {
|
||||
schema: Schema,
|
||||
executor: Executor,
|
||||
tokenizers: TokenizerManager,
|
||||
fast_field_tokenizers: TokenizerManager,
|
||||
}
|
||||
|
||||
impl SearcherContext {
|
||||
/// Creates a context from explicit search-time components.
|
||||
pub fn new(
|
||||
schema: Schema,
|
||||
executor: Executor,
|
||||
tokenizers: TokenizerManager,
|
||||
fast_field_tokenizers: TokenizerManager,
|
||||
) -> SearcherContext {
|
||||
SearcherContext {
|
||||
schema,
|
||||
executor,
|
||||
tokenizers,
|
||||
fast_field_tokenizers,
|
||||
}
|
||||
}
|
||||
|
||||
/// Creates a context from an index.
|
||||
pub fn from_index<C: crate::codec::Codec>(index: &Index<C>) -> SearcherContext {
|
||||
SearcherContext::new(
|
||||
index.schema(),
|
||||
index.search_executor().clone(),
|
||||
index.tokenizers().clone(),
|
||||
index.fast_field_tokenizer().clone(),
|
||||
)
|
||||
}
|
||||
|
||||
/// Access the schema associated with this context.
|
||||
pub fn schema(&self) -> &Schema {
|
||||
&self.schema
|
||||
}
|
||||
|
||||
/// Access the executor associated with this context.
|
||||
pub fn search_executor(&self) -> &Executor {
|
||||
&self.executor
|
||||
}
|
||||
|
||||
/// Access the tokenizer manager associated with this context.
|
||||
pub fn tokenizers(&self) -> &TokenizerManager {
|
||||
&self.tokenizers
|
||||
}
|
||||
|
||||
/// Access the fast field tokenizer manager associated with this context.
|
||||
pub fn fast_field_tokenizer(&self) -> &TokenizerManager {
|
||||
&self.fast_field_tokenizers
|
||||
}
|
||||
|
||||
/// Get the tokenizer associated with a specific field.
|
||||
pub fn tokenizer_for_field(&self, field: Field) -> crate::Result<TextAnalyzer> {
|
||||
let field_entry = self.schema.get_field_entry(field);
|
||||
let field_type = field_entry.field_type();
|
||||
let indexing_options_opt = match field_type {
|
||||
FieldType::JsonObject(options) => options.get_text_indexing_options(),
|
||||
FieldType::Str(options) => options.get_indexing_options(),
|
||||
_ => {
|
||||
return Err(TantivyError::SchemaError(format!(
|
||||
"{:?} is not a text field.",
|
||||
field_entry.name()
|
||||
)))
|
||||
}
|
||||
};
|
||||
let indexing_options = indexing_options_opt.ok_or_else(|| {
|
||||
TantivyError::InvalidArgument(format!(
|
||||
"No indexing options set for field {field_entry:?}"
|
||||
))
|
||||
})?;
|
||||
|
||||
self.tokenizers
|
||||
.get(indexing_options.tokenizer())
|
||||
.ok_or_else(|| {
|
||||
TantivyError::InvalidArgument(format!(
|
||||
"No Tokenizer found for field {field_entry:?}"
|
||||
))
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl<C: crate::codec::Codec> From<&Index<C>> for SearcherContext {
|
||||
fn from(index: &Index<C>) -> Self {
|
||||
SearcherContext::from_index(index)
|
||||
}
|
||||
}
|
||||
|
||||
impl<C: crate::codec::Codec> From<Index<C>> for SearcherContext {
|
||||
fn from(index: Index<C>) -> Self {
|
||||
SearcherContext::from(&index)
|
||||
}
|
||||
}
|
||||
|
||||
/// Holds a list of `SegmentReader`s ready for search.
|
||||
///
|
||||
/// It guarantees that the `Segment` will not be removed before
|
||||
@@ -168,66 +71,9 @@ pub struct Searcher {
|
||||
}
|
||||
|
||||
impl Searcher {
|
||||
/// Creates a `Searcher` from an arbitrary list of segment readers.
|
||||
///
|
||||
/// This is useful when segment readers are not opened from
|
||||
/// `IndexReader` / `meta.json` (e.g. external segment sources).
|
||||
/// The generated [`SearcherGeneration`] uses `generation_id = 0`.
|
||||
pub fn from_segment_readers<Ctx: Into<SearcherContext>>(
|
||||
context: Ctx,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
) -> crate::Result<Searcher> {
|
||||
Self::from_segment_readers_with_generation_id(context, segment_readers, 0)
|
||||
}
|
||||
|
||||
/// Same as [`Searcher::from_segment_readers`] but allows setting
|
||||
/// a custom generation id.
|
||||
pub fn from_segment_readers_with_generation_id<Ctx: Into<SearcherContext>>(
|
||||
context: Ctx,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
generation_id: u64,
|
||||
) -> crate::Result<Searcher> {
|
||||
let context = context.into();
|
||||
let generation = SearcherGeneration::from_segment_readers(&segment_readers, generation_id);
|
||||
let tracked_generation = Inventory::default().track(generation);
|
||||
let inner = SearcherInner::new(
|
||||
context,
|
||||
segment_readers,
|
||||
tracked_generation,
|
||||
DOCSTORE_CACHE_CAPACITY,
|
||||
)?;
|
||||
Ok(Arc::new(inner).into())
|
||||
}
|
||||
|
||||
/// Returns the search context associated with the `Searcher`.
|
||||
pub fn context(&self) -> &SearcherContext {
|
||||
&self.inner.context
|
||||
}
|
||||
|
||||
/// Deprecated alias for [`Searcher::context`].
|
||||
#[deprecated(note = "use Searcher::context()")]
|
||||
pub fn index(&self) -> &SearcherContext {
|
||||
self.context()
|
||||
}
|
||||
|
||||
/// Access the search executor associated with this searcher.
|
||||
pub fn search_executor(&self) -> &Executor {
|
||||
self.context().search_executor()
|
||||
}
|
||||
|
||||
/// Access the tokenizer manager associated with this searcher.
|
||||
pub fn tokenizers(&self) -> &TokenizerManager {
|
||||
self.context().tokenizers()
|
||||
}
|
||||
|
||||
/// Access the fast field tokenizer manager associated with this searcher.
|
||||
pub fn fast_field_tokenizer(&self) -> &TokenizerManager {
|
||||
self.context().fast_field_tokenizer()
|
||||
}
|
||||
|
||||
/// Get the tokenizer associated with a specific field.
|
||||
pub fn tokenizer_for_field(&self, field: Field) -> crate::Result<TextAnalyzer> {
|
||||
self.context().tokenizer_for_field(field)
|
||||
/// Returns the `Index` associated with the `Searcher`
|
||||
pub fn index(&self) -> &Index {
|
||||
&self.inner.index
|
||||
}
|
||||
|
||||
/// [`SearcherGeneration`] which identifies the version of the snapshot held by this `Searcher`.
|
||||
@@ -239,7 +85,7 @@ impl Searcher {
|
||||
///
|
||||
/// The searcher uses the segment ordinal to route the
|
||||
/// request to the right `Segment`.
|
||||
pub fn doc(&self, doc_address: DocAddress) -> crate::Result<TantivyDocument> {
|
||||
pub fn doc<D: DocumentDeserialize>(&self, doc_address: DocAddress) -> crate::Result<D> {
|
||||
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
|
||||
store_reader.get(doc_address.doc_id)
|
||||
}
|
||||
@@ -259,15 +105,18 @@ impl Searcher {
|
||||
|
||||
/// Fetches a document in an asynchronous manner.
|
||||
#[cfg(feature = "quickwit")]
|
||||
pub async fn doc_async(&self, doc_address: DocAddress) -> crate::Result<TantivyDocument> {
|
||||
let executor = self.search_executor();
|
||||
pub async fn doc_async<D: DocumentDeserialize>(
|
||||
&self,
|
||||
doc_address: DocAddress,
|
||||
) -> crate::Result<D> {
|
||||
let executor = self.inner.index.search_executor();
|
||||
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
|
||||
store_reader.get_async(doc_address.doc_id, executor).await
|
||||
}
|
||||
|
||||
/// Access the schema associated with the index of this searcher.
|
||||
pub fn schema(&self) -> &Schema {
|
||||
self.context().schema()
|
||||
&self.inner.schema
|
||||
}
|
||||
|
||||
/// Returns the overall number of documents in the index.
|
||||
@@ -305,13 +154,13 @@ impl Searcher {
|
||||
}
|
||||
|
||||
/// Return the list of segment readers
|
||||
pub fn segment_readers(&self) -> &[Arc<dyn SegmentReader>] {
|
||||
pub fn segment_readers(&self) -> &[SegmentReader] {
|
||||
&self.inner.segment_readers
|
||||
}
|
||||
|
||||
/// Returns the segment_reader associated with the given segment_ord
|
||||
pub fn segment_reader(&self, segment_ord: u32) -> &dyn SegmentReader {
|
||||
self.inner.segment_readers[segment_ord as usize].as_ref()
|
||||
pub fn segment_reader(&self, segment_ord: u32) -> &SegmentReader {
|
||||
&self.inner.segment_readers[segment_ord as usize]
|
||||
}
|
||||
|
||||
/// Runs a query on the segment readers wrapped by the searcher.
|
||||
@@ -352,7 +201,7 @@ impl Searcher {
|
||||
} else {
|
||||
EnableScoring::disabled_from_searcher(self)
|
||||
};
|
||||
let executor = self.search_executor();
|
||||
let executor = self.inner.index.search_executor();
|
||||
self.search_with_executor(query, collector, executor, enabled_scoring)
|
||||
}
|
||||
|
||||
@@ -380,11 +229,7 @@ impl Searcher {
|
||||
let segment_readers = self.segment_readers();
|
||||
let fruits = executor.map(
|
||||
|(segment_ord, segment_reader)| {
|
||||
collector.collect_segment(
|
||||
weight.as_ref(),
|
||||
segment_ord as u32,
|
||||
segment_reader.as_ref(),
|
||||
)
|
||||
collector.collect_segment(weight.as_ref(), segment_ord as u32, segment_reader)
|
||||
},
|
||||
segment_readers.iter().enumerate(),
|
||||
)?;
|
||||
@@ -412,17 +257,19 @@ impl From<Arc<SearcherInner>> for Searcher {
|
||||
/// It guarantees that the `Segment` will not be removed before
|
||||
/// the destruction of the `Searcher`.
|
||||
pub(crate) struct SearcherInner {
|
||||
context: SearcherContext,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
store_readers: Vec<Box<dyn StoreReader>>,
|
||||
schema: Schema,
|
||||
index: Index,
|
||||
segment_readers: Vec<SegmentReader>,
|
||||
store_readers: Vec<StoreReader>,
|
||||
generation: TrackedObject<SearcherGeneration>,
|
||||
}
|
||||
|
||||
impl SearcherInner {
|
||||
/// Creates a new `Searcher`
|
||||
pub(crate) fn new(
|
||||
context: SearcherContext,
|
||||
segment_readers: Vec<Arc<dyn SegmentReader>>,
|
||||
schema: Schema,
|
||||
index: Index,
|
||||
segment_readers: Vec<SegmentReader>,
|
||||
generation: TrackedObject<SearcherGeneration>,
|
||||
doc_store_cache_num_blocks: usize,
|
||||
) -> io::Result<SearcherInner> {
|
||||
@@ -434,13 +281,14 @@ impl SearcherInner {
|
||||
generation.segments(),
|
||||
"Set of segments referenced by this Searcher and its SearcherGeneration must match"
|
||||
);
|
||||
let store_readers: Vec<Box<dyn StoreReader>> = segment_readers
|
||||
let store_readers: Vec<StoreReader> = segment_readers
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.get_store_reader(doc_store_cache_num_blocks))
|
||||
.collect::<io::Result<Vec<_>>>()?;
|
||||
|
||||
Ok(SearcherInner {
|
||||
context,
|
||||
schema,
|
||||
index,
|
||||
segment_readers,
|
||||
store_readers,
|
||||
generation,
|
||||
@@ -453,7 +301,7 @@ impl fmt::Debug for Searcher {
|
||||
let segment_ids = self
|
||||
.segment_readers()
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.segment_id())
|
||||
.map(SegmentReader::segment_id)
|
||||
.collect::<Vec<_>>();
|
||||
write!(f, "Searcher({segment_ids:?})")
|
||||
}
|
||||
|
||||
@@ -7,8 +7,8 @@ use crate::query::TermQuery;
|
||||
use crate::schema::{Field, IndexRecordOption, Schema, INDEXED, STRING, TEXT};
|
||||
use crate::tokenizer::TokenizerManager;
|
||||
use crate::{
|
||||
Directory, DocSet, Executor, Index, IndexBuilder, IndexReader, IndexSettings, IndexWriter,
|
||||
ReloadPolicy, Searcher, SearcherContext, TantivyDocument, Term,
|
||||
Directory, DocSet, Index, IndexBuilder, IndexReader, IndexSettings, IndexWriter, ReloadPolicy,
|
||||
TantivyDocument, Term,
|
||||
};
|
||||
|
||||
#[test]
|
||||
@@ -300,40 +300,6 @@ fn test_single_segment_index_writer() -> crate::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_searcher_from_external_segment_readers() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
let mut writer: IndexWriter = index.writer_for_tests()?;
|
||||
writer.add_document(doc!(text_field => "hello"))?;
|
||||
writer.add_document(doc!(text_field => "hello"))?;
|
||||
writer.commit()?;
|
||||
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
let segment_readers = searcher.segment_readers().to_vec();
|
||||
let context = SearcherContext::new(
|
||||
schema,
|
||||
Executor::single_thread(),
|
||||
TokenizerManager::default(),
|
||||
TokenizerManager::default(),
|
||||
);
|
||||
let custom_searcher =
|
||||
Searcher::from_segment_readers_with_generation_id(context, segment_readers, 42)?;
|
||||
|
||||
let term_query = TermQuery::new(
|
||||
Term::from_field_text(text_field, "hello"),
|
||||
IndexRecordOption::Basic,
|
||||
);
|
||||
let count = custom_searcher.search(&term_query, &Count)?;
|
||||
assert_eq!(count, 2);
|
||||
assert_eq!(custom_searcher.generation().generation_id(), 42);
|
||||
assert_eq!(custom_searcher.segment_readers().len(), 1);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_merging_segment_update_docfreq() {
|
||||
let mut schema_builder = Schema::builder();
|
||||
|
||||
@@ -167,9 +167,6 @@ impl CompositeFile {
|
||||
.map(|byte_range| self.data.slice(byte_range.clone()))
|
||||
}
|
||||
|
||||
/// Returns per-field byte usage for all slices stored in this composite file.
|
||||
///
|
||||
/// The provided `schema` is used to resolve field ids into field names.
|
||||
pub fn space_usage(&self, schema: &Schema) -> PerFieldSpaceUsage {
|
||||
let mut fields = Vec::new();
|
||||
for (&field_addr, byte_range) in &self.offsets_index {
|
||||
|
||||
@@ -676,7 +676,7 @@ mod tests {
|
||||
let num_segments = reader.searcher().segment_readers().len();
|
||||
assert!(num_segments <= 4);
|
||||
let num_components_except_deletes_and_tempstore =
|
||||
crate::index::SegmentComponent::iterator().len() - 1;
|
||||
crate::index::SegmentComponent::iterator().len() - 2;
|
||||
let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
|
||||
assert_eventually(|| {
|
||||
let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
|
||||
|
||||
@@ -21,7 +21,7 @@ use std::path::PathBuf;
|
||||
pub use common::file_slice::{FileHandle, FileSlice};
|
||||
pub use common::{AntiCallToken, OwnedBytes, TerminatingWrite};
|
||||
|
||||
pub use self::composite_file::{CompositeFile, CompositeWrite};
|
||||
pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
|
||||
pub use self::directory::{Directory, DirectoryClone, DirectoryLock};
|
||||
pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK};
|
||||
pub use self::ram_directory::RamDirectory;
|
||||
@@ -52,7 +52,7 @@ pub use self::mmap_directory::MmapDirectory;
|
||||
///
|
||||
/// `WritePtr` are required to implement both Write
|
||||
/// and Seek.
|
||||
pub type WritePtr = BufWriter<Box<dyn TerminatingWrite + Send + Sync>>;
|
||||
pub type WritePtr = BufWriter<Box<dyn TerminatingWrite>>;
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests;
|
||||
|
||||
146
src/docset.rs
146
src/docset.rs
@@ -1,5 +1,4 @@
|
||||
use std::borrow::BorrowMut;
|
||||
use std::ops::{Deref as _, DerefMut as _};
|
||||
use std::borrow::{Borrow, BorrowMut};
|
||||
|
||||
use common::BitSet;
|
||||
|
||||
@@ -54,55 +53,31 @@ pub trait DocSet: Send {
|
||||
doc
|
||||
}
|
||||
|
||||
/// !!!Dragons ahead!!!
|
||||
/// In spirit, this is an approximate and dangerous version of `seek`.
|
||||
///
|
||||
/// It can leave the DocSet in an `invalid` state and might return a
|
||||
/// lower bound of what the result of Seek would have been.
|
||||
///
|
||||
///
|
||||
/// More accurately it returns either:
|
||||
/// - Found if the target is in the docset. In that case, the DocSet is left in a valid state.
|
||||
/// - SeekLowerBound(seek_lower_bound) if the target is not in the docset. In that case, The
|
||||
/// DocSet can be the left in a invalid state. The DocSet should then only receives call to
|
||||
/// `seek_danger(..)` until it returns `Found`, and get back to a valid state.
|
||||
///
|
||||
/// `seek_lower_bound` can be any `DocId` (in the docset or not) as long as it is in
|
||||
/// `(target .. seek_result] U {TERMINATED}` where `seek_result` is the first document in the
|
||||
/// docset greater than to `target`.
|
||||
///
|
||||
/// `seek_danger` may return `SeekLowerBound(TERMINATED)`.
|
||||
///
|
||||
/// Calling `seek_danger` with TERMINATED as a target is allowed,
|
||||
/// and should always return NewTarget(TERMINATED) or anything larger as TERMINATED is NOT in
|
||||
/// the DocSet.
|
||||
/// Seeks to the target if possible and returns true if the target is in the DocSet.
|
||||
///
|
||||
/// DocSets that already have an efficient `seek` method don't need to implement
|
||||
/// `seek_danger`.
|
||||
/// `seek_into_the_danger_zone`. All wrapper DocSets should forward
|
||||
/// `seek_into_the_danger_zone` to the underlying DocSet.
|
||||
///
|
||||
/// Consecutive calls to seek_danger are guaranteed to have strictly increasing `target`
|
||||
/// values.
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
if target >= TERMINATED {
|
||||
debug_assert!(target == TERMINATED);
|
||||
// No need to advance.
|
||||
return SeekDangerResult::SeekLowerBound(target);
|
||||
}
|
||||
|
||||
// The default implementation does not include any
|
||||
// `danger zone` behavior.
|
||||
//
|
||||
// It does not leave the scorer in an invalid state.
|
||||
// For this reason, we can safely call `self.doc()`.
|
||||
let mut doc = self.doc();
|
||||
if doc < target {
|
||||
doc = self.seek(target);
|
||||
}
|
||||
if doc == target {
|
||||
SeekDangerResult::Found
|
||||
} else {
|
||||
SeekDangerResult::SeekLowerBound(doc)
|
||||
/// ## API Behaviour
|
||||
/// If `seek_into_the_danger_zone` is returning true, a call to `doc()` has to return target.
|
||||
/// If `seek_into_the_danger_zone` is returning false, a call to `doc()` may return any doc
|
||||
/// between the last doc that matched and target or a doc that is a valid next hit after
|
||||
/// target. The DocSet is considered to be in an invalid state until
|
||||
/// `seek_into_the_danger_zone` returns true again.
|
||||
///
|
||||
/// `target` needs to be equal or larger than `doc` when in a valid state.
|
||||
///
|
||||
/// Consecutive calls are not allowed to have decreasing `target` values.
|
||||
///
|
||||
/// # Warning
|
||||
/// This is an advanced API used by intersection. The API contract is tricky, avoid using it.
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
let current_doc = self.doc();
|
||||
if current_doc < target {
|
||||
self.seek(target);
|
||||
}
|
||||
self.doc() == target
|
||||
}
|
||||
|
||||
/// Fills a given mutable buffer with the next doc ids from the
|
||||
@@ -133,14 +108,10 @@ pub trait DocSet: Send {
|
||||
buffer.len()
|
||||
}
|
||||
|
||||
/// Fills the given bitset with the documents in the docset.
|
||||
///
|
||||
/// If the docset max_doc is smaller than the largest doc, this function might not consume the
|
||||
/// docset entirely.
|
||||
/// TODO comment on the size of the bitset
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
let bitset_max_value: u32 = bitset.max_value();
|
||||
let mut doc = self.doc();
|
||||
while doc < bitset_max_value {
|
||||
while doc != TERMINATED {
|
||||
bitset.insert(doc);
|
||||
doc = self.advance();
|
||||
}
|
||||
@@ -206,17 +177,6 @@ pub trait DocSet: Send {
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
|
||||
pub enum SeekDangerResult {
|
||||
/// The target was found in the DocSet.
|
||||
Found,
|
||||
/// The target was not found in the DocSet.
|
||||
/// We return a range in which the value could be.
|
||||
/// The given target can be any DocId, that is <= than the first document
|
||||
/// in the docset after the target.
|
||||
SeekLowerBound(DocId),
|
||||
}
|
||||
|
||||
impl DocSet for &mut dyn DocSet {
|
||||
fn advance(&mut self) -> u32 {
|
||||
(**self).advance()
|
||||
@@ -226,8 +186,8 @@ impl DocSet for &mut dyn DocSet {
|
||||
(**self).seek(target)
|
||||
}
|
||||
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
(**self).seek_danger(target)
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
(**self).seek_into_the_danger_zone(target)
|
||||
}
|
||||
|
||||
fn doc(&self) -> u32 {
|
||||
@@ -249,59 +209,51 @@ impl DocSet for &mut dyn DocSet {
|
||||
fn count_including_deleted(&mut self) -> u32 {
|
||||
(**self).count_including_deleted()
|
||||
}
|
||||
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
(**self).fill_bitset(bitset);
|
||||
}
|
||||
}
|
||||
|
||||
impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
|
||||
#[inline]
|
||||
fn advance(&mut self) -> DocId {
|
||||
self.deref_mut().advance()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
self.deref_mut().seek(target)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.seek_danger(target)
|
||||
unboxed.advance()
|
||||
}
|
||||
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.seek(target)
|
||||
}
|
||||
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.seek_into_the_danger_zone(target)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
|
||||
self.deref_mut().fill_buffer(buffer)
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.fill_buffer(buffer)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn doc(&self) -> DocId {
|
||||
self.deref().doc()
|
||||
let unboxed: &TDocSet = self.borrow();
|
||||
unboxed.doc()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn size_hint(&self) -> u32 {
|
||||
self.deref().size_hint()
|
||||
let unboxed: &TDocSet = self.borrow();
|
||||
unboxed.size_hint()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn cost(&self) -> u64 {
|
||||
self.deref().cost()
|
||||
let unboxed: &TDocSet = self.borrow();
|
||||
unboxed.cost()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn count(&mut self, alive_bitset: &AliveBitSet) -> u32 {
|
||||
self.deref_mut().count(alive_bitset)
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.count(alive_bitset)
|
||||
}
|
||||
|
||||
fn count_including_deleted(&mut self) -> u32 {
|
||||
self.deref_mut().count_including_deleted()
|
||||
}
|
||||
|
||||
fn fill_bitset(&mut self, bitset: &mut BitSet) {
|
||||
self.deref_mut().fill_bitset(bitset);
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.count_including_deleted()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -162,7 +162,7 @@ mod tests {
|
||||
mod bench {
|
||||
|
||||
use rand::prelude::IteratorRandom;
|
||||
use rand::rng;
|
||||
use rand::thread_rng;
|
||||
use test::Bencher;
|
||||
|
||||
use super::AliveBitSet;
|
||||
@@ -176,7 +176,7 @@ mod bench {
|
||||
}
|
||||
|
||||
fn remove_rand(raw: &mut Vec<u32>) {
|
||||
let i = (0..raw.len()).choose(&mut rng()).unwrap();
|
||||
let i = (0..raw.len()).choose(&mut thread_rng()).unwrap();
|
||||
raw.remove(i);
|
||||
}
|
||||
|
||||
|
||||
@@ -84,7 +84,9 @@ mod tests {
|
||||
let mut facet = Facet::default();
|
||||
facet_reader.facet_from_ord(0, &mut facet).unwrap();
|
||||
assert_eq!(facet.to_path_string(), "/a/b");
|
||||
let doc = searcher.doc(DocAddress::new(0u32, 0u32)).unwrap();
|
||||
let doc = searcher
|
||||
.doc::<TantivyDocument>(DocAddress::new(0u32, 0u32))
|
||||
.unwrap();
|
||||
let value = doc
|
||||
.get_first(facet_field)
|
||||
.and_then(|v| v.as_value().as_facet());
|
||||
@@ -143,7 +145,7 @@ mod tests {
|
||||
let mut facet_ords = Vec::new();
|
||||
facet_ords.extend(facet_reader.facet_ords(0u32));
|
||||
assert_eq!(&facet_ords, &[0u64]);
|
||||
let doc = searcher.doc(DocAddress::new(0u32, 0u32))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0u32, 0u32))?;
|
||||
let value: Option<Facet> = doc
|
||||
.get_first(facet_field)
|
||||
.and_then(|v| v.as_facet())
|
||||
|
||||
@@ -96,7 +96,7 @@ mod tests {
|
||||
};
|
||||
use crate::time::OffsetDateTime;
|
||||
use crate::tokenizer::{LowerCaser, RawTokenizer, TextAnalyzer, TokenizerManager};
|
||||
use crate::{Index, IndexWriter};
|
||||
use crate::{Index, IndexWriter, SegmentReader};
|
||||
|
||||
pub static SCHEMA: Lazy<Schema> = Lazy::new(|| {
|
||||
let mut schema_builder = Schema::builder();
|
||||
@@ -430,7 +430,7 @@ mod tests {
|
||||
.searcher()
|
||||
.segment_readers()
|
||||
.iter()
|
||||
.map(|segment_reader| segment_reader.segment_id())
|
||||
.map(SegmentReader::segment_id)
|
||||
.collect();
|
||||
assert_eq!(segment_ids.len(), 2);
|
||||
index_writer.merge(&segment_ids[..]).wait().unwrap();
|
||||
@@ -879,7 +879,7 @@ mod tests {
|
||||
const ONE_HOUR_IN_MICROSECS: i64 = 3_600 * 1_000_000;
|
||||
let times: Vec<DateTime> = std::iter::repeat_with(|| {
|
||||
// +- One hour.
|
||||
let t = T0 + rng.random_range(-ONE_HOUR_IN_MICROSECS..ONE_HOUR_IN_MICROSECS);
|
||||
let t = T0 + rng.gen_range(-ONE_HOUR_IN_MICROSECS..ONE_HOUR_IN_MICROSECS);
|
||||
DateTime::from_timestamp_micros(t)
|
||||
})
|
||||
.take(1_000)
|
||||
|
||||
@@ -25,8 +25,7 @@ pub struct FastFieldReaders {
|
||||
}
|
||||
|
||||
impl FastFieldReaders {
|
||||
/// Opens the segment fast-field container and binds it to a schema.
|
||||
pub fn open(fast_field_file: FileSlice, schema: Schema) -> io::Result<FastFieldReaders> {
|
||||
pub(crate) fn open(fast_field_file: FileSlice, schema: Schema) -> io::Result<FastFieldReaders> {
|
||||
let columnar = Arc::new(ColumnarReader::open(fast_field_file)?);
|
||||
Ok(FastFieldReaders { columnar, schema })
|
||||
}
|
||||
@@ -40,8 +39,7 @@ impl FastFieldReaders {
|
||||
self.resolve_column_name_given_default_field(column_name, default_field_opt)
|
||||
}
|
||||
|
||||
/// Returns per-field space usage for all loaded fast-field columns.
|
||||
pub fn space_usage(&self) -> io::Result<PerFieldSpaceUsage> {
|
||||
pub(crate) fn space_usage(&self) -> io::Result<PerFieldSpaceUsage> {
|
||||
let mut per_field_usages: Vec<FieldUsage> = Default::default();
|
||||
for (mut field_name, column_handle) in self.columnar.iter_columns()? {
|
||||
json_path_sep_to_dot(&mut field_name);
|
||||
@@ -53,8 +51,7 @@ impl FastFieldReaders {
|
||||
Ok(PerFieldSpaceUsage::new(per_field_usages))
|
||||
}
|
||||
|
||||
/// Returns the underlying `ColumnarReader`.
|
||||
pub fn columnar(&self) -> &ColumnarReader {
|
||||
pub(crate) fn columnar(&self) -> &ColumnarReader {
|
||||
self.columnar.as_ref()
|
||||
}
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use std::collections::HashSet;
|
||||
|
||||
use rand::{rng, Rng};
|
||||
use rand::{thread_rng, Rng};
|
||||
|
||||
use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN;
|
||||
use crate::schema::*;
|
||||
@@ -29,7 +29,7 @@ fn test_functional_store() -> crate::Result<()> {
|
||||
let index = Index::create_in_ram(schema);
|
||||
let reader = index.reader()?;
|
||||
|
||||
let mut rng = rng();
|
||||
let mut rng = thread_rng();
|
||||
|
||||
let mut index_writer: IndexWriter =
|
||||
index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
|
||||
@@ -38,9 +38,9 @@ fn test_functional_store() -> crate::Result<()> {
|
||||
|
||||
let mut doc_id = 0u64;
|
||||
for _iteration in 0..get_num_iterations() {
|
||||
let num_docs: usize = rng.random_range(0..4);
|
||||
let num_docs: usize = rng.gen_range(0..4);
|
||||
if !doc_set.is_empty() {
|
||||
let doc_to_remove_id = rng.random_range(0..doc_set.len());
|
||||
let doc_to_remove_id = rng.gen_range(0..doc_set.len());
|
||||
let removed_doc_id = doc_set.swap_remove(doc_to_remove_id);
|
||||
index_writer.delete_term(Term::from_field_u64(id_field, removed_doc_id));
|
||||
}
|
||||
@@ -70,10 +70,10 @@ const LOREM: &str = "Doc Lorem ipsum dolor sit amet, consectetur adipiscing elit
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat \
|
||||
non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";
|
||||
fn get_text() -> String {
|
||||
use rand::seq::IndexedRandom;
|
||||
let mut rng = rng();
|
||||
use rand::seq::SliceRandom;
|
||||
let mut rng = thread_rng();
|
||||
let tokens: Vec<_> = LOREM.split(' ').collect();
|
||||
let random_val = rng.random_range(0..20);
|
||||
let random_val = rng.gen_range(0..20);
|
||||
|
||||
(0..random_val)
|
||||
.map(|_| tokens.choose(&mut rng).unwrap())
|
||||
@@ -101,7 +101,7 @@ fn test_functional_indexing_unsorted() -> crate::Result<()> {
|
||||
let index = Index::create_from_tempdir(schema)?;
|
||||
let reader = index.reader()?;
|
||||
|
||||
let mut rng = rng();
|
||||
let mut rng = thread_rng();
|
||||
|
||||
let mut index_writer: IndexWriter =
|
||||
index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
|
||||
@@ -110,7 +110,7 @@ fn test_functional_indexing_unsorted() -> crate::Result<()> {
|
||||
let mut uncommitted_docs: HashSet<u64> = HashSet::new();
|
||||
|
||||
for _ in 0..get_num_iterations() {
|
||||
let random_val = rng.random_range(0..20);
|
||||
let random_val = rng.gen_range(0..20);
|
||||
if random_val == 0 {
|
||||
index_writer.commit()?;
|
||||
committed_docs.extend(&uncommitted_docs);
|
||||
|
||||
@@ -4,46 +4,35 @@ use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::codec::{Codec, StandardCodec};
|
||||
|
||||
/// A Codec configuration is just a serializable object.
|
||||
#[derive(Serialize, Deserialize, Clone, Debug)]
|
||||
pub struct CodecConfiguration {
|
||||
codec_id: Cow<'static, str>,
|
||||
name: Cow<'static, str>,
|
||||
#[serde(default, skip_serializing_if = "serde_json::Value::is_null")]
|
||||
props: serde_json::Value,
|
||||
}
|
||||
|
||||
impl CodecConfiguration {
|
||||
/// Returns true if the codec is the standard codec.
|
||||
pub fn is_standard(&self) -> bool {
|
||||
self.codec_id == StandardCodec::ID && self.props.is_null()
|
||||
pub fn from_codec<C: Codec>(codec: &C) -> Self {
|
||||
CodecConfiguration {
|
||||
name: Cow::Borrowed(C::NAME),
|
||||
props: codec.to_json_props(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Creates a codec instance from the configuration.
|
||||
///
|
||||
/// If the codec id does not match the code's name, an error is returned.
|
||||
pub fn to_codec<C: Codec>(&self) -> crate::Result<C> {
|
||||
if self.codec_id != C::ID {
|
||||
if self.name != C::NAME {
|
||||
return Err(crate::TantivyError::InvalidArgument(format!(
|
||||
"Codec id mismatch: expected {}, got {}",
|
||||
C::ID,
|
||||
self.codec_id
|
||||
"Codec name mismatch: expected {}, got {}",
|
||||
C::NAME,
|
||||
self.name
|
||||
)));
|
||||
}
|
||||
C::from_json_props(&self.props)
|
||||
}
|
||||
}
|
||||
|
||||
impl<'a, C: Codec> From<&'a C> for CodecConfiguration {
|
||||
fn from(codec: &'a C) -> Self {
|
||||
CodecConfiguration {
|
||||
codec_id: Cow::Borrowed(C::ID),
|
||||
props: codec.to_json_props(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for CodecConfiguration {
|
||||
fn default() -> Self {
|
||||
CodecConfiguration::from(&StandardCodec)
|
||||
CodecConfiguration::from_codec(&StandardCodec)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -26,6 +26,7 @@ use crate::reader::{IndexReader, IndexReaderBuilder};
|
||||
use crate::schema::document::Document;
|
||||
use crate::schema::{Field, FieldType, Schema};
|
||||
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
|
||||
use crate::SegmentReader;
|
||||
|
||||
fn load_metas(
|
||||
directory: &dyn Directory,
|
||||
@@ -275,7 +276,7 @@ impl<Codec: crate::codec::Codec> IndexBuilder<Codec> {
|
||||
fn create_avoid_monomorphization(self, dir: Box<dyn Directory>) -> crate::Result<Index<Codec>> {
|
||||
self.validate()?;
|
||||
let directory = ManagedDirectory::wrap(dir)?;
|
||||
let codec: CodecConfiguration = CodecConfiguration::from(&self.codec);
|
||||
let codec: CodecConfiguration = CodecConfiguration::from_codec(&self.codec);
|
||||
save_new_metas(
|
||||
self.get_expect_schema()?,
|
||||
self.index_settings.clone(),
|
||||
@@ -393,7 +394,6 @@ impl Index {
|
||||
Self::open_in_dir_to_avoid_monomorphization(directory_path.as_ref())
|
||||
}
|
||||
|
||||
#[cfg(feature = "mmap")]
|
||||
#[inline(never)]
|
||||
fn open_in_dir_to_avoid_monomorphization(directory_path: &Path) -> crate::Result<Index> {
|
||||
let mmap_directory = MmapDirectory::open(directory_path)?;
|
||||
@@ -407,6 +407,22 @@ impl Index {
|
||||
}
|
||||
|
||||
impl<Codec: crate::codec::Codec> Index<Codec> {
|
||||
/// Returns a version of this index with the standard codec.
|
||||
/// This is useful when you need to pass the index to APIs that
|
||||
/// don't care about the codec (e.g., for reading).
|
||||
pub(crate) fn with_standard_codec(&self) -> Index<StandardCodec> {
|
||||
Index {
|
||||
directory: self.directory.clone(),
|
||||
schema: self.schema.clone(),
|
||||
settings: self.settings.clone(),
|
||||
executor: self.executor.clone(),
|
||||
tokenizers: self.tokenizers.clone(),
|
||||
fast_field_tokenizers: self.fast_field_tokenizers.clone(),
|
||||
inventory: self.inventory.clone(),
|
||||
codec: StandardCodec,
|
||||
}
|
||||
}
|
||||
|
||||
/// Open the index using the provided directory
|
||||
#[inline(never)]
|
||||
pub fn open_with_codec(directory: Box<dyn Directory>) -> crate::Result<Index<Codec>> {
|
||||
@@ -562,15 +578,7 @@ impl<Codec: crate::codec::Codec> Index<Codec> {
|
||||
let segments = self.searchable_segments()?;
|
||||
let fields_metadata: Vec<Vec<FieldMetadata>> = segments
|
||||
.into_iter()
|
||||
.map(|segment| {
|
||||
let segment_reader = segment.index().codec().open_segment_reader(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
None,
|
||||
)?;
|
||||
segment_reader.fields_metadata()
|
||||
})
|
||||
.map(|segment| SegmentReader::open(&segment)?.fields_metadata())
|
||||
.collect::<Result<_, _>>()?;
|
||||
Ok(merge_field_meta_data(fields_metadata))
|
||||
}
|
||||
@@ -776,7 +784,7 @@ impl<Codec: crate::codec::Codec> Index<Codec> {
|
||||
}
|
||||
|
||||
impl fmt::Debug for Index {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "Index({:?})", self.directory)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,6 +1,8 @@
|
||||
use std::collections::HashSet;
|
||||
use std::fmt;
|
||||
use std::path::PathBuf;
|
||||
use std::sync::atomic::AtomicBool;
|
||||
use std::sync::Arc;
|
||||
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
@@ -36,6 +38,7 @@ impl SegmentMetaInventory {
|
||||
let inner = InnerSegmentMeta {
|
||||
segment_id,
|
||||
max_doc,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
deletes: None,
|
||||
};
|
||||
SegmentMeta::from(self.inventory.track(inner))
|
||||
@@ -83,6 +86,15 @@ impl SegmentMeta {
|
||||
self.tracked.segment_id
|
||||
}
|
||||
|
||||
/// Removes the Component::TempStore from the alive list and
|
||||
/// therefore marks the temp docstore file to be deleted by
|
||||
/// the garbage collection.
|
||||
pub fn untrack_temp_docstore(&self) {
|
||||
self.tracked
|
||||
.include_temp_doc_store
|
||||
.store(false, std::sync::atomic::Ordering::Relaxed);
|
||||
}
|
||||
|
||||
/// Returns the number of deleted documents.
|
||||
pub fn num_deleted_docs(&self) -> u32 {
|
||||
self.tracked
|
||||
@@ -100,9 +112,20 @@ impl SegmentMeta {
|
||||
/// is by removing all files that have been created by tantivy
|
||||
/// and are not used by any segment anymore.
|
||||
pub fn list_files(&self) -> HashSet<PathBuf> {
|
||||
SegmentComponent::iterator()
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
if self
|
||||
.tracked
|
||||
.include_temp_doc_store
|
||||
.load(std::sync::atomic::Ordering::Relaxed)
|
||||
{
|
||||
SegmentComponent::iterator()
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
} else {
|
||||
SegmentComponent::iterator()
|
||||
.filter(|comp| *comp != &SegmentComponent::TempStore)
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the relative path of a component of our segment.
|
||||
@@ -116,6 +139,7 @@ impl SegmentMeta {
|
||||
SegmentComponent::Positions => ".pos".to_string(),
|
||||
SegmentComponent::Terms => ".term".to_string(),
|
||||
SegmentComponent::Store => ".store".to_string(),
|
||||
SegmentComponent::TempStore => ".store.temp".to_string(),
|
||||
SegmentComponent::FastFields => ".fast".to_string(),
|
||||
SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
|
||||
SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
|
||||
@@ -160,6 +184,7 @@ impl SegmentMeta {
|
||||
segment_id: inner_meta.segment_id,
|
||||
max_doc,
|
||||
deletes: None,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
});
|
||||
SegmentMeta { tracked }
|
||||
}
|
||||
@@ -178,6 +203,7 @@ impl SegmentMeta {
|
||||
let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
|
||||
segment_id: inner_meta.segment_id,
|
||||
max_doc: inner_meta.max_doc,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
deletes: Some(delete_meta),
|
||||
});
|
||||
SegmentMeta { tracked }
|
||||
@@ -189,6 +215,14 @@ struct InnerSegmentMeta {
|
||||
segment_id: SegmentId,
|
||||
max_doc: u32,
|
||||
pub deletes: Option<DeleteMeta>,
|
||||
/// If you want to avoid the SegmentComponent::TempStore file to be covered by
|
||||
/// garbage collection and deleted, set this to true. This is used during merge.
|
||||
#[serde(skip)]
|
||||
#[serde(default = "default_temp_store")]
|
||||
pub(crate) include_temp_doc_store: Arc<AtomicBool>,
|
||||
}
|
||||
fn default_temp_store() -> Arc<AtomicBool> {
|
||||
Arc::new(AtomicBool::new(false))
|
||||
}
|
||||
|
||||
impl InnerSegmentMeta {
|
||||
@@ -288,9 +322,9 @@ pub struct IndexMeta {
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub payload: Option<String>,
|
||||
/// Codec configuration for the index.
|
||||
#[serde(skip_serializing_if = "CodecConfiguration::is_standard")]
|
||||
pub codec: CodecConfiguration,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
struct UntrackedIndexMeta {
|
||||
pub segments: Vec<InnerSegmentMeta>,
|
||||
@@ -334,7 +368,7 @@ impl IndexMeta {
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
payload: None,
|
||||
codec: CodecConfiguration::from(codec),
|
||||
codec: CodecConfiguration::from_codec(codec),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -387,36 +421,13 @@ mod tests {
|
||||
payload: None,
|
||||
codec: Default::default(),
|
||||
};
|
||||
let json_value: serde_json::Value =
|
||||
serde_json::to_value(&index_metas).expect("serialization failed");
|
||||
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
|
||||
assert_eq!(
|
||||
&json_value,
|
||||
&serde_json::json!(
|
||||
{
|
||||
"index_settings": {
|
||||
"docstore_compression": "none",
|
||||
"docstore_blocksize": 16384
|
||||
},
|
||||
"segments": [],
|
||||
"schema": [
|
||||
{
|
||||
"name": "text",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"indexing": {
|
||||
"record": "position",
|
||||
"fieldnorms": true,
|
||||
"tokenizer": "default"
|
||||
},
|
||||
"stored": false,
|
||||
"fast": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"opstamp": 0
|
||||
})
|
||||
json,
|
||||
r#"{"index_settings":{"docstore_compression":"none","docstore_blocksize":16384},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0,"codec":{"name":"standard"}}"#
|
||||
);
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_value(json_value).unwrap();
|
||||
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_str(&json).unwrap();
|
||||
assert_eq!(index_metas.index_settings, deser_meta.index_settings);
|
||||
assert_eq!(index_metas.schema, deser_meta.schema);
|
||||
assert_eq!(index_metas.opstamp, deser_meta.opstamp);
|
||||
@@ -442,39 +453,14 @@ mod tests {
|
||||
schema,
|
||||
opstamp: 0u64,
|
||||
payload: None,
|
||||
codec: Default::default(),
|
||||
};
|
||||
let json_value = serde_json::to_value(&index_metas).expect("serialization failed");
|
||||
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
|
||||
assert_eq!(
|
||||
&json_value,
|
||||
&serde_json::json!(
|
||||
{
|
||||
"index_settings": {
|
||||
"docstore_compression": "zstd(compression_level=4)",
|
||||
"docstore_blocksize": 1000000
|
||||
},
|
||||
"segments": [],
|
||||
"schema": [
|
||||
{
|
||||
"name": "text",
|
||||
"type": "text",
|
||||
"options": {
|
||||
"indexing": {
|
||||
"record": "position",
|
||||
"fieldnorms": true,
|
||||
"tokenizer": "default"
|
||||
},
|
||||
"stored": false,
|
||||
"fast": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"opstamp": 0
|
||||
}
|
||||
)
|
||||
json,
|
||||
r#"{"index_settings":{"docstore_compression":"zstd(compression_level=4)","docstore_blocksize":1000000},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0}"#
|
||||
);
|
||||
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_value(json_value).unwrap();
|
||||
let deser_meta: UntrackedIndexMeta = serde_json::from_str(&json).unwrap();
|
||||
assert_eq!(index_metas.index_settings, deser_meta.index_settings);
|
||||
assert_eq!(index_metas.schema, deser_meta.schema);
|
||||
assert_eq!(index_metas.opstamp, deser_meta.opstamp);
|
||||
|
||||
@@ -1,11 +1,8 @@
|
||||
#[cfg(feature = "quickwit")]
|
||||
use std::future::Future;
|
||||
use std::io;
|
||||
#[cfg(feature = "quickwit")]
|
||||
use std::pin::Pin;
|
||||
use std::sync::Arc;
|
||||
|
||||
use common::json_path_writer::JSON_END_OF_PATH;
|
||||
use common::{BinarySerializable, BitSet, ByteCount, OwnedBytes};
|
||||
use common::{BinarySerializable, ByteCount, OwnedBytes};
|
||||
#[cfg(feature = "quickwit")]
|
||||
use futures_util::{FutureExt, StreamExt, TryStreamExt};
|
||||
#[cfg(feature = "quickwit")]
|
||||
@@ -13,213 +10,43 @@ use itertools::Itertools;
|
||||
#[cfg(feature = "quickwit")]
|
||||
use tantivy_fst::automaton::{AlwaysMatch, Automaton};
|
||||
|
||||
use crate::codec::postings::RawPostingsData;
|
||||
use crate::codec::standard::postings::{
|
||||
fill_bitset_from_raw_data, load_postings_from_raw_data, SegmentPostings,
|
||||
};
|
||||
use crate::codec::postings::PostingsCodec;
|
||||
use crate::codec::{Codec, ObjectSafeCodec, StandardCodec};
|
||||
use crate::directory::FileSlice;
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::{Postings, TermInfo};
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{box_scorer, Bm25Weight, PhraseScorer, Scorer};
|
||||
use crate::query::{Bm25Weight, PhraseScorer, Scorer};
|
||||
use crate::schema::{IndexRecordOption, Term, Type};
|
||||
use crate::termdict::TermDictionary;
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
pub type TermRangeBounds = (std::ops::Bound<Term>, std::ops::Bound<Term>);
|
||||
|
||||
/// Type-erased term scorer guaranteed to wrap a Tantivy [`TermScorer`].
|
||||
pub struct BoxedTermScorer(Box<dyn Scorer>);
|
||||
|
||||
impl BoxedTermScorer {
|
||||
/// Creates a boxed term scorer from a concrete Tantivy [`TermScorer`].
|
||||
pub fn new<TPostings: Postings>(term_scorer: TermScorer<TPostings>) -> BoxedTermScorer {
|
||||
BoxedTermScorer(box_scorer(term_scorer))
|
||||
}
|
||||
|
||||
/// Converts this boxed term scorer into a generic boxed scorer.
|
||||
pub fn into_boxed_scorer(self) -> Box<dyn Scorer> {
|
||||
self.0
|
||||
}
|
||||
}
|
||||
|
||||
/// Trait defining the contract for inverted index readers.
|
||||
pub trait InvertedIndexReader: Send + Sync {
|
||||
/// Returns the term info associated with the term.
|
||||
fn get_term_info(&self, term: &Term) -> io::Result<Option<TermInfo>> {
|
||||
self.terms().get(term.serialized_value_bytes())
|
||||
}
|
||||
|
||||
/// Return the term dictionary datastructure.
|
||||
fn terms(&self) -> &TermDictionary;
|
||||
|
||||
/// Return the fields and types encoded in the dictionary in lexicographic order.
|
||||
/// Only valid on JSON fields.
|
||||
///
|
||||
/// Notice: This requires a full scan and therefore **very expensive**.
|
||||
fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>>;
|
||||
|
||||
/// Build a new term scorer.
|
||||
fn new_term_scorer(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> io::Result<BoxedTermScorer>;
|
||||
|
||||
/// Returns a posting object given a `term_info`.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// Most users should prefer using [`Self::read_postings()`] instead.
|
||||
fn read_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Box<dyn Postings>>;
|
||||
|
||||
/// Returns the raw postings bytes and metadata for a term.
|
||||
fn read_raw_postings_data(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<RawPostingsData>;
|
||||
|
||||
/// Fills a bitset with documents containing the term.
|
||||
///
|
||||
/// Implementers can override this to avoid boxing postings.
|
||||
fn fill_bitset_for_term(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
doc_bitset: &mut BitSet,
|
||||
) -> io::Result<()> {
|
||||
let mut postings = self.read_postings_from_terminfo(term_info, option)?;
|
||||
postings.fill_bitset(doc_bitset);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Builds a phrase scorer for the given term infos.
|
||||
fn new_phrase_scorer(
|
||||
&self,
|
||||
term_infos: &[(usize, TermInfo)],
|
||||
similarity_weight: Option<Bm25Weight>,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
slop: u32,
|
||||
) -> io::Result<Box<dyn Scorer>>;
|
||||
|
||||
/// Returns the total number of tokens recorded for all documents
|
||||
/// (including deleted documents).
|
||||
fn total_num_tokens(&self) -> u64;
|
||||
|
||||
/// Returns the segment postings associated with the term, and with the given option,
|
||||
/// or `None` if the term has never been encountered and indexed.
|
||||
fn read_postings(
|
||||
&self,
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Option<Box<dyn Postings>>> {
|
||||
self.get_term_info(term)?
|
||||
.map(move |term_info| self.read_postings_from_terminfo(&term_info, option))
|
||||
.transpose()
|
||||
}
|
||||
|
||||
/// Returns the number of documents containing the term.
|
||||
fn doc_freq(&self, term: &Term) -> io::Result<u32>;
|
||||
|
||||
/// Returns the number of documents containing the term asynchronously.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn doc_freq_async<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<u32>> + Send + 'a>>;
|
||||
|
||||
/// Warmup fieldnorm readers for this inverted index field.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_fieldnorms_readers<'a>(
|
||||
&'a self,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>>;
|
||||
|
||||
/// Warmup the block postings for all terms.
|
||||
///
|
||||
/// Default implementation is a no-op.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_full<'a>(
|
||||
&'a self,
|
||||
_with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>> {
|
||||
Box::pin(async { Ok(()) })
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a `Term`.
|
||||
///
|
||||
/// Returns whether the term was found in the dictionary.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>;
|
||||
|
||||
/// Warmup block postings for terms in a range.
|
||||
///
|
||||
/// Returns whether at least one matching term was found.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_range<'a>(
|
||||
&'a self,
|
||||
terms: TermRangeBounds,
|
||||
limit: Option<u64>,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>;
|
||||
|
||||
/// Warmup block postings for terms matching an automaton.
|
||||
///
|
||||
/// Returns whether at least one matching term was found.
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_automaton<'a, A: Automaton + Clone + Send + Sync + 'static>(
|
||||
&'a self,
|
||||
automaton: A,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>
|
||||
where
|
||||
A::State: Clone + Send,
|
||||
Self: Sized;
|
||||
}
|
||||
|
||||
/// Tantivy's default inverted index reader implementation.
|
||||
///
|
||||
/// The inverted index reader is in charge of accessing
|
||||
/// the inverted index associated with a specific field.
|
||||
///
|
||||
/// # Note
|
||||
///
|
||||
/// It is safe to delete the segment associated with
|
||||
/// an `InvertedIndexReader` implementation. As long as it is open,
|
||||
/// an `InvertedIndexReader`. As long as it is open,
|
||||
/// the [`FileSlice`] it is relying on should
|
||||
/// stay available.
|
||||
///
|
||||
/// `TantivyInvertedIndexReader` instances are created by calling
|
||||
/// `InvertedIndexReader` are created by calling
|
||||
/// [`SegmentReader::inverted_index()`](crate::SegmentReader::inverted_index).
|
||||
pub struct TantivyInvertedIndexReader {
|
||||
pub struct InvertedIndexReader {
|
||||
termdict: TermDictionary,
|
||||
postings_file_slice: FileSlice,
|
||||
positions_file_slice: FileSlice,
|
||||
#[cfg_attr(not(feature = "quickwit"), allow(dead_code))]
|
||||
fieldnorms_file_slice: FileSlice,
|
||||
record_option: IndexRecordOption,
|
||||
total_num_tokens: u64,
|
||||
codec: Arc<dyn ObjectSafeCodec>,
|
||||
}
|
||||
|
||||
/// Object that records the amount of space used by a field in an inverted index.
|
||||
pub struct InvertedIndexFieldSpace {
|
||||
/// Field name as encoded in the term dictionary.
|
||||
pub(crate) struct InvertedIndexFieldSpace {
|
||||
pub field_name: String,
|
||||
/// Value type for the encoded field.
|
||||
pub field_type: Type,
|
||||
/// Total bytes used by postings for this field.
|
||||
pub postings_size: ByteCount,
|
||||
/// Total bytes used by positions for this field.
|
||||
pub positions_size: ByteCount,
|
||||
/// Number of terms in the field.
|
||||
pub num_terms: u64,
|
||||
}
|
||||
|
||||
@@ -241,86 +68,55 @@ impl InvertedIndexFieldSpace {
|
||||
}
|
||||
}
|
||||
|
||||
impl TantivyInvertedIndexReader {
|
||||
pub(crate) fn read_raw_postings_data_inner(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<RawPostingsData> {
|
||||
let effective_option = option.downgrade(self.record_option);
|
||||
let postings_data = self
|
||||
.postings_file_slice
|
||||
.slice(term_info.postings_range.clone())
|
||||
.read_bytes()?;
|
||||
let positions_data: Option<OwnedBytes> = if effective_option.has_positions() {
|
||||
let positions_data = self
|
||||
.positions_file_slice
|
||||
.slice(term_info.positions_range.clone())
|
||||
.read_bytes()?;
|
||||
Some(positions_data)
|
||||
} else {
|
||||
None
|
||||
};
|
||||
Ok(RawPostingsData {
|
||||
postings_data,
|
||||
positions_data,
|
||||
record_option: self.record_option,
|
||||
effective_option,
|
||||
})
|
||||
}
|
||||
|
||||
/// Opens an inverted index reader from already-loaded term/postings/positions slices.
|
||||
///
|
||||
/// The first 8 bytes of `postings_file_slice` are expected to contain
|
||||
/// the serialized total token count.
|
||||
pub fn new(
|
||||
impl InvertedIndexReader {
|
||||
pub(crate) fn new(
|
||||
termdict: TermDictionary,
|
||||
postings_file_slice: FileSlice,
|
||||
positions_file_slice: FileSlice,
|
||||
fieldnorms_file_slice: FileSlice,
|
||||
record_option: IndexRecordOption,
|
||||
) -> io::Result<TantivyInvertedIndexReader> {
|
||||
codec: Arc<dyn ObjectSafeCodec>,
|
||||
) -> io::Result<InvertedIndexReader> {
|
||||
let (total_num_tokens_slice, postings_body) = postings_file_slice.split(8);
|
||||
let total_num_tokens = u64::deserialize(&mut total_num_tokens_slice.read_bytes()?)?;
|
||||
Ok(TantivyInvertedIndexReader {
|
||||
Ok(InvertedIndexReader {
|
||||
termdict,
|
||||
postings_file_slice: postings_body,
|
||||
positions_file_slice,
|
||||
fieldnorms_file_slice,
|
||||
record_option,
|
||||
total_num_tokens,
|
||||
codec,
|
||||
})
|
||||
}
|
||||
|
||||
/// Creates an empty `TantivyInvertedIndexReader` object, which
|
||||
/// Creates an empty `InvertedIndexReader` object, which
|
||||
/// contains no terms at all.
|
||||
pub fn empty(record_option: IndexRecordOption) -> TantivyInvertedIndexReader {
|
||||
TantivyInvertedIndexReader {
|
||||
pub fn empty(record_option: IndexRecordOption) -> InvertedIndexReader {
|
||||
InvertedIndexReader {
|
||||
termdict: TermDictionary::empty(),
|
||||
postings_file_slice: FileSlice::empty(),
|
||||
positions_file_slice: FileSlice::empty(),
|
||||
fieldnorms_file_slice: FileSlice::empty(),
|
||||
record_option,
|
||||
total_num_tokens: 0u64,
|
||||
codec: Arc::new(StandardCodec),
|
||||
}
|
||||
}
|
||||
|
||||
fn load_segment_postings(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<SegmentPostings> {
|
||||
let postings_data = self.read_raw_postings_data_inner(term_info, option)?;
|
||||
load_postings_from_raw_data(term_info.doc_freq, postings_data)
|
||||
/// Returns the term info associated with the term.
|
||||
pub fn get_term_info(&self, term: &Term) -> io::Result<Option<TermInfo>> {
|
||||
self.termdict.get(term.serialized_value_bytes())
|
||||
}
|
||||
}
|
||||
|
||||
impl InvertedIndexReader for TantivyInvertedIndexReader {
|
||||
fn terms(&self) -> &TermDictionary {
|
||||
/// Return the term dictionary datastructure.
|
||||
pub fn terms(&self) -> &TermDictionary {
|
||||
&self.termdict
|
||||
}
|
||||
|
||||
fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>> {
|
||||
/// Return the fields and types encoded in the dictionary in lexicographic order.
|
||||
/// Only valid on JSON fields.
|
||||
///
|
||||
/// Notice: This requires a full scan and therefore **very expensive**.
|
||||
/// TODO: Move to sstable to use the index.
|
||||
pub(crate) fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>> {
|
||||
let mut stream = self.termdict.stream()?;
|
||||
let mut fields: Vec<InvertedIndexFieldSpace> = Vec::new();
|
||||
|
||||
@@ -373,73 +169,130 @@ impl InvertedIndexReader for TantivyInvertedIndexReader {
|
||||
Ok(fields)
|
||||
}
|
||||
|
||||
fn new_term_scorer(
|
||||
pub(crate) fn new_term_scorer_specialized<C: Codec>(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> io::Result<BoxedTermScorer> {
|
||||
let postings = self.load_segment_postings(term_info, option)?;
|
||||
codec: &C,
|
||||
) -> io::Result<TermScorer<<<C as Codec>::PostingsCodec as PostingsCodec>::Postings>> {
|
||||
let postings = self.read_postings_from_terminfo_specialized(term_info, option, codec)?;
|
||||
let term_scorer = TermScorer::new(postings, fieldnorm_reader, similarity_weight);
|
||||
Ok(BoxedTermScorer::new(term_scorer))
|
||||
Ok(term_scorer)
|
||||
}
|
||||
|
||||
fn read_postings_from_terminfo(
|
||||
pub(crate) fn new_phrase_scorer_type_specialized<C: Codec>(
|
||||
&self,
|
||||
term_infos: &[(usize, TermInfo)],
|
||||
similarity_weight_opt: Option<Bm25Weight>,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
slop: u32,
|
||||
codec: &C,
|
||||
) -> io::Result<PhraseScorer<<<C as Codec>::PostingsCodec as PostingsCodec>::Postings>> {
|
||||
let mut offset_and_term_postings: Vec<(
|
||||
usize,
|
||||
<<C as Codec>::PostingsCodec as PostingsCodec>::Postings,
|
||||
)> = Vec::with_capacity(term_infos.len());
|
||||
for (offset, term_info) in term_infos {
|
||||
let postings = self.read_postings_from_terminfo_specialized(
|
||||
term_info,
|
||||
IndexRecordOption::WithFreqsAndPositions,
|
||||
codec,
|
||||
)?;
|
||||
offset_and_term_postings.push((*offset, postings));
|
||||
}
|
||||
let phrase_scorer = PhraseScorer::new(
|
||||
offset_and_term_postings,
|
||||
similarity_weight_opt,
|
||||
fieldnorm_reader,
|
||||
slop,
|
||||
);
|
||||
Ok(phrase_scorer)
|
||||
}
|
||||
|
||||
/// Build a new term scorer.
|
||||
pub fn new_term_scorer(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
similarity_weight: Bm25Weight,
|
||||
) -> io::Result<Box<dyn Scorer>> {
|
||||
let term_scorer = self.codec.load_term_scorer_type_erased(
|
||||
term_info,
|
||||
option,
|
||||
self,
|
||||
fieldnorm_reader,
|
||||
similarity_weight,
|
||||
)?;
|
||||
Ok(term_scorer)
|
||||
}
|
||||
|
||||
/// Returns a postings object specific with a concrete type.
|
||||
///
|
||||
/// This requires you to provied the actual codec.
|
||||
pub fn read_postings_from_terminfo_specialized<C: Codec>(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
codec: &C,
|
||||
) -> io::Result<<<C as Codec>::PostingsCodec as PostingsCodec>::Postings> {
|
||||
let option = option.downgrade(self.record_option);
|
||||
let postings_data = self
|
||||
.postings_file_slice
|
||||
.slice(term_info.postings_range.clone())
|
||||
.read_bytes()?;
|
||||
let positions_data: Option<OwnedBytes> = if option.has_positions() {
|
||||
let positions_data = self
|
||||
.positions_file_slice
|
||||
.slice(term_info.positions_range.clone())
|
||||
.read_bytes()?;
|
||||
Some(positions_data)
|
||||
} else {
|
||||
None
|
||||
};
|
||||
let postings: <<C as Codec>::PostingsCodec as PostingsCodec>::Postings =
|
||||
codec.postings_codec().load_postings(
|
||||
term_info.doc_freq,
|
||||
postings_data,
|
||||
self.record_option,
|
||||
option,
|
||||
positions_data,
|
||||
)?;
|
||||
Ok(postings)
|
||||
}
|
||||
|
||||
/// Returns a posting object given a `term_info`.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// Most users should prefer using [`Self::read_postings()`] instead.
|
||||
pub fn read_postings_from_terminfo(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<Box<dyn Postings>> {
|
||||
let postings = self.load_segment_postings(term_info, option)?;
|
||||
Ok(Box::new(postings))
|
||||
self.codec
|
||||
.load_postings_type_erased(term_info, option, self)
|
||||
}
|
||||
|
||||
fn read_raw_postings_data(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<RawPostingsData> {
|
||||
self.read_raw_postings_data_inner(term_info, option)
|
||||
}
|
||||
|
||||
fn fill_bitset_for_term(
|
||||
&self,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
doc_bitset: &mut BitSet,
|
||||
) -> io::Result<()> {
|
||||
let postings_data = self.read_raw_postings_data_inner(term_info, option)?;
|
||||
fill_bitset_from_raw_data(term_info.doc_freq, postings_data, doc_bitset)
|
||||
}
|
||||
|
||||
fn new_phrase_scorer(
|
||||
&self,
|
||||
term_infos: &[(usize, TermInfo)],
|
||||
similarity_weight: Option<Bm25Weight>,
|
||||
fieldnorm_reader: FieldNormReader,
|
||||
slop: u32,
|
||||
) -> io::Result<Box<dyn Scorer>> {
|
||||
let mut offset_and_term_postings: Vec<(usize, SegmentPostings)> =
|
||||
Vec::with_capacity(term_infos.len());
|
||||
for (offset, term_info) in term_infos {
|
||||
let postings =
|
||||
self.load_segment_postings(term_info, IndexRecordOption::WithFreqsAndPositions)?;
|
||||
offset_and_term_postings.push((*offset, postings));
|
||||
}
|
||||
let scorer = PhraseScorer::new(
|
||||
offset_and_term_postings,
|
||||
similarity_weight,
|
||||
fieldnorm_reader,
|
||||
slop,
|
||||
);
|
||||
Ok(box_scorer(scorer))
|
||||
}
|
||||
|
||||
fn total_num_tokens(&self) -> u64 {
|
||||
/// Returns the total number of tokens recorded for all documents
|
||||
/// (including deleted documents).
|
||||
pub fn total_num_tokens(&self) -> u64 {
|
||||
self.total_num_tokens
|
||||
}
|
||||
|
||||
fn read_postings(
|
||||
/// Returns the segment postings associated with the term, and with the given option,
|
||||
/// or `None` if the term has never been encountered and indexed.
|
||||
///
|
||||
/// If the field was not indexed with the indexing options that cover
|
||||
/// the requested options, the returned [`SegmentPostings`] the method does not fail
|
||||
/// and returns a `SegmentPostings` with as much information as possible.
|
||||
///
|
||||
/// For instance, requesting [`IndexRecordOption::WithFreqs`] for a
|
||||
/// [`TextOptions`](crate::schema::TextOptions) that does not index position
|
||||
/// will return a [`SegmentPostings`] with `DocId`s and frequencies.
|
||||
pub fn read_postings(
|
||||
&self,
|
||||
term: &Term,
|
||||
option: IndexRecordOption,
|
||||
@@ -449,184 +302,24 @@ impl InvertedIndexReader for TantivyInvertedIndexReader {
|
||||
.transpose()
|
||||
}
|
||||
|
||||
fn doc_freq(&self, term: &Term) -> io::Result<u32> {
|
||||
/// Returns the number of documents containing the term.
|
||||
pub fn doc_freq(&self, term: &Term) -> io::Result<u32> {
|
||||
Ok(self
|
||||
.get_term_info(term)?
|
||||
.map(|term_info| term_info.doc_freq)
|
||||
.unwrap_or(0u32))
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn doc_freq_async<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<u32>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
Ok(self
|
||||
.get_term_info_async(term)
|
||||
.await?
|
||||
.map(|term_info| term_info.doc_freq)
|
||||
.unwrap_or(0u32))
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_fieldnorms_readers<'a>(
|
||||
&'a self,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
self.fieldnorms_file_slice.read_bytes_async().await?;
|
||||
Ok(())
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_full<'a>(
|
||||
&'a self,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<()>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
self.postings_file_slice.read_bytes_async().await?;
|
||||
if with_positions {
|
||||
self.positions_file_slice.read_bytes_async().await?;
|
||||
}
|
||||
Ok(())
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings<'a>(
|
||||
&'a self,
|
||||
term: &'a Term,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
let term_info_opt: Option<TermInfo> = self.get_term_info_async(term).await?;
|
||||
if let Some(term_info) = term_info_opt {
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(term_info.postings_range.clone());
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(term_info.positions_range.clone());
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
} else {
|
||||
Ok(false)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_range<'a>(
|
||||
&'a self,
|
||||
terms: TermRangeBounds,
|
||||
limit: Option<u64>,
|
||||
with_positions: bool,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>> {
|
||||
Box::pin(async move {
|
||||
let mut term_info = self
|
||||
.get_term_range_async(terms, AlwaysMatch, limit, 0)
|
||||
.await?;
|
||||
|
||||
let Some(first_terminfo) = term_info.next() else {
|
||||
// no key matches, nothing more to load
|
||||
return Ok(false);
|
||||
};
|
||||
|
||||
let last_terminfo = term_info.last().unwrap_or_else(|| first_terminfo.clone());
|
||||
|
||||
let postings_range =
|
||||
first_terminfo.postings_range.start..last_terminfo.postings_range.end;
|
||||
let positions_range =
|
||||
first_terminfo.positions_range.start..last_terminfo.positions_range.end;
|
||||
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(postings_range);
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(positions_range);
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
fn warm_postings_automaton<'a, A: Automaton + Clone + Send + Sync + 'static>(
|
||||
&'a self,
|
||||
automaton: A,
|
||||
) -> Pin<Box<dyn Future<Output = io::Result<bool>> + Send + 'a>>
|
||||
where
|
||||
A::State: Clone + Send,
|
||||
Self: Sized,
|
||||
{
|
||||
Box::pin(async move {
|
||||
// merge holes under 4MiB, that's how many bytes we can hope to receive during a TTFB
|
||||
// from S3 (~80MiB/s, and 50ms latency)
|
||||
const MERGE_HOLES_UNDER_BYTES: usize = (80 * 1024 * 1024 * 50) / 1000;
|
||||
// Trigger async prefetch of relevant termdict blocks.
|
||||
let _term_info_iter = self
|
||||
.get_term_range_async(
|
||||
(std::ops::Bound::Unbounded, std::ops::Bound::Unbounded),
|
||||
automaton.clone(),
|
||||
None,
|
||||
MERGE_HOLES_UNDER_BYTES,
|
||||
)
|
||||
.await?;
|
||||
drop(_term_info_iter);
|
||||
|
||||
// Build a 2nd stream without merged holes so we only scan matching blocks.
|
||||
// This assumes the storage layer caches data fetched by the first pass.
|
||||
let mut stream = self.termdict.search(automaton).into_stream()?;
|
||||
let posting_ranges_iter =
|
||||
std::iter::from_fn(move || stream.next().map(|(_k, v)| v.postings_range.clone()));
|
||||
let merged_posting_ranges: Vec<std::ops::Range<usize>> = posting_ranges_iter
|
||||
.coalesce(|range1, range2| {
|
||||
if range1.end + MERGE_HOLES_UNDER_BYTES >= range2.start {
|
||||
Ok(range1.start..range2.end)
|
||||
} else {
|
||||
Err((range1, range2))
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
if merged_posting_ranges.is_empty() {
|
||||
return Ok(false);
|
||||
}
|
||||
|
||||
let slices_downloaded = futures_util::stream::iter(merged_posting_ranges.into_iter())
|
||||
.map(|posting_slice| {
|
||||
self.postings_file_slice
|
||||
.read_bytes_slice_async(posting_slice)
|
||||
.map(|result| result.map(|_slice| ()))
|
||||
})
|
||||
.buffer_unordered(5)
|
||||
.try_collect::<Vec<()>>()
|
||||
.await?;
|
||||
|
||||
Ok(!slices_downloaded.is_empty())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "quickwit")]
|
||||
impl TantivyInvertedIndexReader {
|
||||
impl InvertedIndexReader {
|
||||
pub(crate) async fn get_term_info_async(&self, term: &Term) -> io::Result<Option<TermInfo>> {
|
||||
self.termdict.get_async(term.serialized_value_bytes()).await
|
||||
}
|
||||
|
||||
async fn get_term_range_async<'a, A: Automaton + 'a>(
|
||||
&'a self,
|
||||
terms: TermRangeBounds,
|
||||
terms: impl std::ops::RangeBounds<Term>,
|
||||
automaton: A,
|
||||
limit: Option<u64>,
|
||||
merge_holes_under_bytes: usize,
|
||||
@@ -634,17 +327,17 @@ impl TantivyInvertedIndexReader {
|
||||
where
|
||||
A::State: Clone,
|
||||
{
|
||||
use std::ops::Bound;
|
||||
let range_builder = self.termdict.search(automaton);
|
||||
let (start_bound, end_bound) = terms;
|
||||
let range_builder = match start_bound {
|
||||
std::ops::Bound::Included(bound) => range_builder.ge(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Excluded(bound) => range_builder.gt(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Unbounded => range_builder,
|
||||
let range_builder = match terms.start_bound() {
|
||||
Bound::Included(bound) => range_builder.ge(bound.serialized_value_bytes()),
|
||||
Bound::Excluded(bound) => range_builder.gt(bound.serialized_value_bytes()),
|
||||
Bound::Unbounded => range_builder,
|
||||
};
|
||||
let range_builder = match end_bound {
|
||||
std::ops::Bound::Included(bound) => range_builder.le(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Excluded(bound) => range_builder.lt(bound.serialized_value_bytes()),
|
||||
std::ops::Bound::Unbounded => range_builder,
|
||||
let range_builder = match terms.end_bound() {
|
||||
Bound::Included(bound) => range_builder.le(bound.serialized_value_bytes()),
|
||||
Bound::Excluded(bound) => range_builder.lt(bound.serialized_value_bytes()),
|
||||
Bound::Unbounded => range_builder,
|
||||
};
|
||||
let range_builder = if let Some(limit) = limit {
|
||||
range_builder.limit(limit)
|
||||
@@ -665,4 +358,167 @@ impl TantivyInvertedIndexReader {
|
||||
|
||||
Ok(iter)
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a `Term`.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// returns a boolean, whether the term was found in the dictionary
|
||||
pub async fn warm_postings(&self, term: &Term, with_positions: bool) -> io::Result<bool> {
|
||||
let term_info_opt: Option<TermInfo> = self.get_term_info_async(term).await?;
|
||||
if let Some(term_info) = term_info_opt {
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(term_info.postings_range.clone());
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(term_info.positions_range.clone());
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
} else {
|
||||
Ok(false)
|
||||
}
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a range of `Term`s.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// returns a boolean, whether a term matching the range was found in the dictionary
|
||||
pub async fn warm_postings_range(
|
||||
&self,
|
||||
terms: impl std::ops::RangeBounds<Term>,
|
||||
limit: Option<u64>,
|
||||
with_positions: bool,
|
||||
) -> io::Result<bool> {
|
||||
let mut term_info = self
|
||||
.get_term_range_async(terms, AlwaysMatch, limit, 0)
|
||||
.await?;
|
||||
|
||||
let Some(first_terminfo) = term_info.next() else {
|
||||
// no key matches, nothing more to load
|
||||
return Ok(false);
|
||||
};
|
||||
|
||||
let last_terminfo = term_info.last().unwrap_or_else(|| first_terminfo.clone());
|
||||
|
||||
let postings_range = first_terminfo.postings_range.start..last_terminfo.postings_range.end;
|
||||
let positions_range =
|
||||
first_terminfo.positions_range.start..last_terminfo.positions_range.end;
|
||||
|
||||
let postings = self
|
||||
.postings_file_slice
|
||||
.read_bytes_slice_async(postings_range);
|
||||
if with_positions {
|
||||
let positions = self
|
||||
.positions_file_slice
|
||||
.read_bytes_slice_async(positions_range);
|
||||
futures_util::future::try_join(postings, positions).await?;
|
||||
} else {
|
||||
postings.await?;
|
||||
}
|
||||
Ok(true)
|
||||
}
|
||||
|
||||
/// Warmup a block postings given a range of `Term`s.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// returns a boolean, whether a term matching the range was found in the dictionary
|
||||
pub async fn warm_postings_automaton<
|
||||
A: Automaton + Clone + Send + 'static,
|
||||
E: FnOnce(Box<dyn FnOnce() -> io::Result<()> + Send>) -> F,
|
||||
F: std::future::Future<Output = io::Result<()>>,
|
||||
>(
|
||||
&self,
|
||||
automaton: A,
|
||||
// with_positions: bool, at the moment we have no use for it, and supporting it would add
|
||||
// complexity to the coalesce
|
||||
executor: E,
|
||||
) -> io::Result<bool>
|
||||
where
|
||||
A::State: Clone,
|
||||
{
|
||||
// merge holes under 4MiB, that's how many bytes we can hope to receive during a TTFB from
|
||||
// S3 (~80MiB/s, and 50ms latency)
|
||||
const MERGE_HOLES_UNDER_BYTES: usize = (80 * 1024 * 1024 * 50) / 1000;
|
||||
// we build a first iterator to download everything. Simply calling the function already
|
||||
// download everything we need from the sstable, but doesn't start iterating over it.
|
||||
let _term_info_iter = self
|
||||
.get_term_range_async(.., automaton.clone(), None, MERGE_HOLES_UNDER_BYTES)
|
||||
.await?;
|
||||
|
||||
let (sender, posting_ranges_to_load_stream) = futures_channel::mpsc::unbounded();
|
||||
let termdict = self.termdict.clone();
|
||||
let cpu_bound_task = move || {
|
||||
// then we build a 2nd iterator, this one with no holes, so we don't go through blocks
|
||||
// we can't match.
|
||||
// This makes the assumption there is a caching layer below us, which gives sync read
|
||||
// for free after the initial async access. This might not always be true, but is in
|
||||
// Quickwit.
|
||||
// We build things from this closure otherwise we get into lifetime issues that can only
|
||||
// be solved with self referential strucs. Returning an io::Result from here is a bit
|
||||
// more leaky abstraction-wise, but a lot better than the alternative
|
||||
let mut stream = termdict.search(automaton).into_stream()?;
|
||||
|
||||
// we could do without an iterator, but this allows us access to coalesce which simplify
|
||||
// things
|
||||
let posting_ranges_iter =
|
||||
std::iter::from_fn(move || stream.next().map(|(_k, v)| v.postings_range.clone()));
|
||||
|
||||
let merged_posting_ranges_iter = posting_ranges_iter.coalesce(|range1, range2| {
|
||||
if range1.end + MERGE_HOLES_UNDER_BYTES >= range2.start {
|
||||
Ok(range1.start..range2.end)
|
||||
} else {
|
||||
Err((range1, range2))
|
||||
}
|
||||
});
|
||||
|
||||
for posting_range in merged_posting_ranges_iter {
|
||||
if let Err(_) = sender.unbounded_send(posting_range) {
|
||||
// this should happen only when search is cancelled
|
||||
return Err(io::Error::other("failed to send posting range back"));
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
};
|
||||
let task_handle = executor(Box::new(cpu_bound_task));
|
||||
|
||||
let posting_downloader = posting_ranges_to_load_stream
|
||||
.map(|posting_slice| {
|
||||
self.postings_file_slice
|
||||
.read_bytes_slice_async(posting_slice)
|
||||
.map(|result| result.map(|_slice| ()))
|
||||
})
|
||||
.buffer_unordered(5)
|
||||
.try_collect::<Vec<()>>();
|
||||
|
||||
let (_, slices_downloaded) =
|
||||
futures_util::future::try_join(task_handle, posting_downloader).await?;
|
||||
|
||||
Ok(!slices_downloaded.is_empty())
|
||||
}
|
||||
|
||||
/// Warmup the block postings for all terms.
|
||||
/// This method is for an advanced usage only.
|
||||
///
|
||||
/// If you know which terms to pre-load, prefer using [`Self::warm_postings`] or
|
||||
/// [`Self::warm_postings`] instead.
|
||||
pub async fn warm_postings_full(&self, with_positions: bool) -> io::Result<()> {
|
||||
self.postings_file_slice.read_bytes_async().await?;
|
||||
if with_positions {
|
||||
self.positions_file_slice.read_bytes_async().await?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Returns the number of documents containing the term asynchronously.
|
||||
pub async fn doc_freq_async(&self, term: &Term) -> io::Result<u32> {
|
||||
Ok(self
|
||||
.get_term_info_async(term)
|
||||
.await?
|
||||
.map(|term_info| term_info.doc_freq)
|
||||
.unwrap_or(0u32))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -15,10 +15,8 @@ pub use self::codec_configuration::CodecConfiguration;
|
||||
pub use self::index::{Index, IndexBuilder};
|
||||
pub(crate) use self::index_meta::SegmentMetaInventory;
|
||||
pub use self::index_meta::{IndexMeta, IndexSettings, Order, SegmentMeta};
|
||||
pub use self::inverted_index_reader::{
|
||||
BoxedTermScorer, InvertedIndexFieldSpace, InvertedIndexReader, TantivyInvertedIndexReader,
|
||||
};
|
||||
pub use self::inverted_index_reader::InvertedIndexReader;
|
||||
pub use self::segment::Segment;
|
||||
pub use self::segment_component::SegmentComponent;
|
||||
pub use self::segment_id::SegmentId;
|
||||
pub use self::segment_reader::{FieldMetadata, SegmentReader, TantivySegmentReader};
|
||||
pub use self::segment_reader::{FieldMetadata, SegmentReader};
|
||||
|
||||
@@ -23,6 +23,8 @@ pub enum SegmentComponent {
|
||||
/// Accessing a document from the store is relatively slow, as it
|
||||
/// requires to decompress the entire block it belongs to.
|
||||
Store,
|
||||
/// Temporary storage of the documents, before streamed to `Store`.
|
||||
TempStore,
|
||||
/// Bitset describing which document of the segment is alive.
|
||||
/// (It was representing deleted docs but changed to represent alive docs from v0.17)
|
||||
Delete,
|
||||
@@ -31,13 +33,14 @@ pub enum SegmentComponent {
|
||||
impl SegmentComponent {
|
||||
/// Iterates through the components.
|
||||
pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
|
||||
static SEGMENT_COMPONENTS: [SegmentComponent; 7] = [
|
||||
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
|
||||
SegmentComponent::Postings,
|
||||
SegmentComponent::Positions,
|
||||
SegmentComponent::FastFields,
|
||||
SegmentComponent::FieldNorms,
|
||||
SegmentComponent::Terms,
|
||||
SegmentComponent::Store,
|
||||
SegmentComponent::TempStore,
|
||||
SegmentComponent::Delete,
|
||||
];
|
||||
SEGMENT_COMPONENTS.iter()
|
||||
|
||||
@@ -44,7 +44,7 @@ fn create_uuid() -> Uuid {
|
||||
}
|
||||
|
||||
impl SegmentId {
|
||||
/// Generates a new random `SegmentId`.
|
||||
#[doc(hidden)]
|
||||
pub fn generate_random() -> SegmentId {
|
||||
SegmentId(create_uuid())
|
||||
}
|
||||
|
||||
@@ -6,107 +6,18 @@ use common::{ByteCount, HasLen};
|
||||
use fnv::FnvHashMap;
|
||||
use itertools::Itertools;
|
||||
|
||||
use crate::codec::{ObjectSafeCodec, SumOrDoNothingCombiner};
|
||||
use crate::directory::{CompositeFile, Directory, FileSlice};
|
||||
use crate::codec::ObjectSafeCodec;
|
||||
use crate::directory::{CompositeFile, FileSlice};
|
||||
use crate::error::DataCorruption;
|
||||
use crate::fastfield::{intersect_alive_bitsets, AliveBitSet, FacetReader, FastFieldReaders};
|
||||
use crate::fieldnorm::{FieldNormReader, FieldNormReaders};
|
||||
use crate::index::{
|
||||
InvertedIndexReader, Segment, SegmentComponent, SegmentId, SegmentMeta,
|
||||
TantivyInvertedIndexReader,
|
||||
};
|
||||
use crate::index::{InvertedIndexReader, Segment, SegmentComponent, SegmentId};
|
||||
use crate::json_utils::json_path_sep_to_dot;
|
||||
use crate::query::Scorer;
|
||||
use crate::schema::{Field, IndexRecordOption, Schema, Type};
|
||||
use crate::space_usage::SegmentSpaceUsage;
|
||||
use crate::store::{StoreReader, TantivyStoreReader};
|
||||
use crate::store::StoreReader;
|
||||
use crate::termdict::TermDictionary;
|
||||
use crate::{DocId, Opstamp, Score};
|
||||
|
||||
/// Trait defining the contract for a segment reader.
|
||||
pub trait SegmentReader: Send + Sync {
|
||||
/// Returns the highest document id ever attributed in this segment + 1.
|
||||
fn max_doc(&self) -> DocId;
|
||||
|
||||
/// Returns the number of alive documents. Deleted documents are not counted.
|
||||
fn num_docs(&self) -> DocId;
|
||||
|
||||
/// Returns the schema of the index this segment belongs to.
|
||||
fn schema(&self) -> &Schema;
|
||||
|
||||
/// Performs a for_each_pruning operation on the given scorer.
|
||||
fn for_each_pruning(
|
||||
&self,
|
||||
threshold: Score,
|
||||
scorer: Box<dyn Scorer>,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
);
|
||||
|
||||
/// Builds a union scorer possibly specialized if all scorers are term scorers.
|
||||
fn build_union_scorer_with_sum_combiner(
|
||||
&self,
|
||||
scorers: Vec<Box<dyn Scorer>>,
|
||||
num_docs: DocId,
|
||||
score_combiner_type: SumOrDoNothingCombiner,
|
||||
) -> Box<dyn Scorer>;
|
||||
|
||||
/// Return the number of documents that have been deleted in the segment.
|
||||
fn num_deleted_docs(&self) -> DocId;
|
||||
|
||||
/// Returns true if some of the documents of the segment have been deleted.
|
||||
fn has_deletes(&self) -> bool;
|
||||
|
||||
/// Accessor to a segment's fast field reader given a field.
|
||||
fn fast_fields(&self) -> &FastFieldReaders;
|
||||
|
||||
/// Accessor to the `FacetReader` associated with a given `Field`.
|
||||
fn facet_reader(&self, field_name: &str) -> crate::Result<FacetReader> {
|
||||
let field = self.schema().get_field(field_name)?;
|
||||
let field_entry = self.schema().get_field_entry(field);
|
||||
if field_entry.field_type().value_type() != Type::Facet {
|
||||
return Err(crate::TantivyError::SchemaError(format!(
|
||||
"`{field_name}` is not a facet field.`"
|
||||
)));
|
||||
}
|
||||
let Some(facet_column) = self.fast_fields().str(field_name)? else {
|
||||
panic!("Facet Field `{field_name}` is missing. This should not happen");
|
||||
};
|
||||
Ok(FacetReader::new(facet_column))
|
||||
}
|
||||
|
||||
/// Accessor to the segment's `Field norms`'s reader.
|
||||
fn get_fieldnorms_reader(&self, field: Field) -> crate::Result<FieldNormReader>;
|
||||
|
||||
/// Accessor to the segment's [`StoreReader`](crate::store::StoreReader).
|
||||
fn get_store_reader(&self, cache_num_blocks: usize) -> io::Result<Box<dyn StoreReader>>;
|
||||
|
||||
/// Returns a field reader associated with the field given in argument.
|
||||
fn inverted_index(&self, field: Field) -> crate::Result<Arc<dyn InvertedIndexReader>>;
|
||||
|
||||
/// Returns the list of fields that have been indexed in the segment.
|
||||
fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>>;
|
||||
|
||||
/// Returns the segment id.
|
||||
fn segment_id(&self) -> SegmentId;
|
||||
|
||||
/// Returns the delete opstamp.
|
||||
fn delete_opstamp(&self) -> Option<Opstamp>;
|
||||
|
||||
/// Returns the bitset representing the alive `DocId`s.
|
||||
fn alive_bitset(&self) -> Option<&AliveBitSet>;
|
||||
|
||||
/// Returns true if the `doc` is marked as deleted.
|
||||
fn is_deleted(&self, doc: DocId) -> bool;
|
||||
|
||||
/// Returns an iterator that will iterate over the alive document ids.
|
||||
fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + Send + '_>;
|
||||
|
||||
/// Summarize total space usage of this segment.
|
||||
fn space_usage(&self) -> io::Result<SegmentSpaceUsage>;
|
||||
|
||||
/// Clones this reader into a shared trait object.
|
||||
fn clone_arc(&self) -> Arc<dyn SegmentReader>;
|
||||
}
|
||||
use crate::{DocId, Opstamp};
|
||||
|
||||
/// Entry point to access all of the datastructures of the `Segment`
|
||||
///
|
||||
@@ -119,8 +30,8 @@ pub trait SegmentReader: Send + Sync {
|
||||
/// The segment reader has a very low memory footprint,
|
||||
/// as close to all of the memory data is mmapped.
|
||||
#[derive(Clone)]
|
||||
pub struct TantivySegmentReader {
|
||||
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<dyn InvertedIndexReader>>>>,
|
||||
pub struct SegmentReader {
|
||||
inv_idx_reader_cache: Arc<RwLock<HashMap<Field, Arc<InvertedIndexReader>>>>,
|
||||
|
||||
segment_id: SegmentId,
|
||||
delete_opstamp: Option<Opstamp>,
|
||||
@@ -137,148 +48,77 @@ pub struct TantivySegmentReader {
|
||||
store_file: FileSlice,
|
||||
alive_bitset_opt: Option<AliveBitSet>,
|
||||
schema: Schema,
|
||||
codec: Arc<dyn ObjectSafeCodec>,
|
||||
|
||||
pub(crate) codec: Arc<dyn ObjectSafeCodec>,
|
||||
}
|
||||
|
||||
impl TantivySegmentReader {
|
||||
/// Open a new segment for reading.
|
||||
pub fn open<C: crate::codec::Codec>(
|
||||
segment: &Segment<C>,
|
||||
) -> crate::Result<Arc<dyn SegmentReader>> {
|
||||
Self::open_with_custom_alive_set(segment, None)
|
||||
}
|
||||
|
||||
/// Open a new segment for reading.
|
||||
pub fn open_with_custom_alive_set<C: crate::codec::Codec>(
|
||||
segment: &Segment<C>,
|
||||
custom_bitset: Option<AliveBitSet>,
|
||||
) -> crate::Result<Arc<dyn SegmentReader>> {
|
||||
segment.index().codec().open_segment_reader(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
custom_bitset,
|
||||
)
|
||||
}
|
||||
|
||||
pub(crate) fn open_with_custom_alive_set_from_directory(
|
||||
directory: &dyn Directory,
|
||||
segment_meta: &SegmentMeta,
|
||||
schema: Schema,
|
||||
codec: Arc<dyn ObjectSafeCodec>,
|
||||
custom_bitset: Option<AliveBitSet>,
|
||||
) -> crate::Result<TantivySegmentReader> {
|
||||
let termdict_file =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Terms))?;
|
||||
let termdict_composite = CompositeFile::open(&termdict_file)?;
|
||||
|
||||
let store_file =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Store))?;
|
||||
|
||||
crate::fail_point!("SegmentReader::open#middle");
|
||||
|
||||
let postings_file =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Postings))?;
|
||||
let postings_composite = CompositeFile::open(&postings_file)?;
|
||||
|
||||
let positions_composite = {
|
||||
if let Ok(positions_file) =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Positions))
|
||||
{
|
||||
CompositeFile::open(&positions_file)?
|
||||
} else {
|
||||
CompositeFile::empty()
|
||||
}
|
||||
};
|
||||
|
||||
let fast_fields_data =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::FastFields))?;
|
||||
let fast_fields_readers = FastFieldReaders::open(fast_fields_data, schema.clone())?;
|
||||
let fieldnorm_data =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::FieldNorms))?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
|
||||
let original_bitset = if segment_meta.has_deletes() {
|
||||
let alive_doc_file_slice =
|
||||
directory.open_read(&segment_meta.relative_path(SegmentComponent::Delete))?;
|
||||
let alive_doc_data = alive_doc_file_slice.read_bytes()?;
|
||||
Some(AliveBitSet::open(alive_doc_data))
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
let alive_bitset_opt = intersect_alive_bitset(original_bitset, custom_bitset);
|
||||
|
||||
let max_doc = segment_meta.max_doc();
|
||||
let num_docs = alive_bitset_opt
|
||||
.as_ref()
|
||||
.map(|alive_bitset| alive_bitset.num_alive_docs() as u32)
|
||||
.unwrap_or(max_doc);
|
||||
|
||||
Ok(TantivySegmentReader {
|
||||
inv_idx_reader_cache: Default::default(),
|
||||
num_docs,
|
||||
max_doc,
|
||||
termdict_composite,
|
||||
postings_composite,
|
||||
fast_fields_readers,
|
||||
fieldnorm_readers,
|
||||
segment_id: segment_meta.id(),
|
||||
delete_opstamp: segment_meta.delete_opstamp(),
|
||||
store_file,
|
||||
alive_bitset_opt,
|
||||
positions_composite,
|
||||
schema,
|
||||
codec,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl SegmentReader for TantivySegmentReader {
|
||||
fn max_doc(&self) -> DocId {
|
||||
impl SegmentReader {
|
||||
/// Returns the highest document id ever attributed in
|
||||
/// this segment + 1.
|
||||
pub fn max_doc(&self) -> DocId {
|
||||
self.max_doc
|
||||
}
|
||||
|
||||
fn num_docs(&self) -> DocId {
|
||||
/// Returns the number of alive documents.
|
||||
/// Deleted documents are not counted.
|
||||
pub fn num_docs(&self) -> DocId {
|
||||
self.num_docs
|
||||
}
|
||||
|
||||
fn schema(&self) -> &Schema {
|
||||
/// Returns the schema of the index this segment belongs to.
|
||||
pub fn schema(&self) -> &Schema {
|
||||
&self.schema
|
||||
}
|
||||
|
||||
fn for_each_pruning(
|
||||
&self,
|
||||
threshold: Score,
|
||||
scorer: Box<dyn Scorer>,
|
||||
callback: &mut dyn FnMut(DocId, Score) -> Score,
|
||||
) {
|
||||
self.codec.for_each_pruning(threshold, scorer, callback);
|
||||
}
|
||||
|
||||
fn build_union_scorer_with_sum_combiner(
|
||||
&self,
|
||||
scorers: Vec<Box<dyn Scorer>>,
|
||||
num_docs: DocId,
|
||||
score_combiner_type: SumOrDoNothingCombiner,
|
||||
) -> Box<dyn Scorer> {
|
||||
self.codec
|
||||
.build_union_scorer_with_sum_combiner(scorers, num_docs, score_combiner_type)
|
||||
}
|
||||
|
||||
fn num_deleted_docs(&self) -> DocId {
|
||||
/// Return the number of documents that have been
|
||||
/// deleted in the segment.
|
||||
pub fn num_deleted_docs(&self) -> DocId {
|
||||
self.max_doc - self.num_docs
|
||||
}
|
||||
|
||||
fn has_deletes(&self) -> bool {
|
||||
self.num_docs != self.max_doc
|
||||
/// Returns true if some of the documents of the segment have been deleted.
|
||||
pub fn has_deletes(&self) -> bool {
|
||||
self.num_deleted_docs() > 0
|
||||
}
|
||||
|
||||
fn fast_fields(&self) -> &FastFieldReaders {
|
||||
/// Accessor to a segment's fast field reader given a field.
|
||||
///
|
||||
/// Returns the u64 fast value reader if the field
|
||||
/// is a u64 field indexed as "fast".
|
||||
///
|
||||
/// Return a FastFieldNotAvailableError if the field is not
|
||||
/// declared as a fast field in the schema.
|
||||
///
|
||||
/// # Panics
|
||||
/// May panic if the index is corrupted.
|
||||
pub fn fast_fields(&self) -> &FastFieldReaders {
|
||||
&self.fast_fields_readers
|
||||
}
|
||||
|
||||
fn get_fieldnorms_reader(&self, field: Field) -> crate::Result<FieldNormReader> {
|
||||
/// Accessor to the `FacetReader` associated with a given `Field`.
|
||||
pub fn facet_reader(&self, field_name: &str) -> crate::Result<FacetReader> {
|
||||
let schema = self.schema();
|
||||
let field = schema.get_field(field_name)?;
|
||||
let field_entry = schema.get_field_entry(field);
|
||||
if field_entry.field_type().value_type() != Type::Facet {
|
||||
return Err(crate::TantivyError::SchemaError(format!(
|
||||
"`{field_name}` is not a facet field.`"
|
||||
)));
|
||||
}
|
||||
let Some(facet_column) = self.fast_fields().str(field_name)? else {
|
||||
panic!("Facet Field `{field_name}` is missing. This should not happen");
|
||||
};
|
||||
Ok(FacetReader::new(facet_column))
|
||||
}
|
||||
|
||||
/// Accessor to the segment's `Field norms`'s reader.
|
||||
///
|
||||
/// Field norms are the length (in tokens) of the fields.
|
||||
/// It is used in the computation of the [TfIdf](https://fulmicoton.gitbooks.io/tantivy-doc/content/tfidf.html).
|
||||
///
|
||||
/// They are simply stored as a fast field, serialized in
|
||||
/// the `.fieldnorm` file of the segment.
|
||||
pub fn get_fieldnorms_reader(&self, field: Field) -> crate::Result<FieldNormReader> {
|
||||
self.fieldnorm_readers.get_field(field)?.ok_or_else(|| {
|
||||
let field_name = self.schema.get_field_name(field);
|
||||
let err_msg = format!(
|
||||
@@ -289,14 +129,102 @@ impl SegmentReader for TantivySegmentReader {
|
||||
})
|
||||
}
|
||||
|
||||
fn get_store_reader(&self, cache_num_blocks: usize) -> io::Result<Box<dyn StoreReader>> {
|
||||
Ok(Box::new(TantivyStoreReader::open(
|
||||
self.store_file.clone(),
|
||||
cache_num_blocks,
|
||||
)?))
|
||||
#[doc(hidden)]
|
||||
pub fn fieldnorms_readers(&self) -> &FieldNormReaders {
|
||||
&self.fieldnorm_readers
|
||||
}
|
||||
|
||||
fn inverted_index(&self, field: Field) -> crate::Result<Arc<dyn InvertedIndexReader>> {
|
||||
/// Accessor to the segment's [`StoreReader`](crate::store::StoreReader).
|
||||
///
|
||||
/// `cache_num_blocks` sets the number of decompressed blocks to be cached in an LRU.
|
||||
/// The size of blocks is configurable, this should be reflexted in the
|
||||
pub fn get_store_reader(&self, cache_num_blocks: usize) -> io::Result<StoreReader> {
|
||||
StoreReader::open(self.store_file.clone(), cache_num_blocks)
|
||||
}
|
||||
|
||||
/// Open a new segment for reading.
|
||||
pub fn open<C: crate::codec::Codec>(segment: &Segment<C>) -> crate::Result<SegmentReader> {
|
||||
Self::open_with_custom_alive_set(segment, None)
|
||||
}
|
||||
|
||||
/// Open a new segment for reading.
|
||||
pub fn open_with_custom_alive_set<C: crate::codec::Codec>(
|
||||
segment: &Segment<C>,
|
||||
custom_bitset: Option<AliveBitSet>,
|
||||
) -> crate::Result<SegmentReader> {
|
||||
let codec: Arc<dyn ObjectSafeCodec> = Arc::new(segment.index().codec().clone());
|
||||
let termdict_file = segment.open_read(SegmentComponent::Terms)?;
|
||||
let termdict_composite = CompositeFile::open(&termdict_file)?;
|
||||
|
||||
let store_file = segment.open_read(SegmentComponent::Store)?;
|
||||
|
||||
crate::fail_point!("SegmentReader::open#middle");
|
||||
|
||||
let postings_file = segment.open_read(SegmentComponent::Postings)?;
|
||||
let postings_composite = CompositeFile::open(&postings_file)?;
|
||||
|
||||
let positions_composite = {
|
||||
if let Ok(positions_file) = segment.open_read(SegmentComponent::Positions) {
|
||||
CompositeFile::open(&positions_file)?
|
||||
} else {
|
||||
CompositeFile::empty()
|
||||
}
|
||||
};
|
||||
|
||||
let schema = segment.schema();
|
||||
|
||||
let fast_fields_data = segment.open_read(SegmentComponent::FastFields)?;
|
||||
let fast_fields_readers = FastFieldReaders::open(fast_fields_data, schema.clone())?;
|
||||
let fieldnorm_data = segment.open_read(SegmentComponent::FieldNorms)?;
|
||||
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
|
||||
|
||||
let original_bitset = if segment.meta().has_deletes() {
|
||||
let alive_doc_file_slice = segment.open_read(SegmentComponent::Delete)?;
|
||||
let alive_doc_data = alive_doc_file_slice.read_bytes()?;
|
||||
Some(AliveBitSet::open(alive_doc_data))
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
let alive_bitset_opt = intersect_alive_bitset(original_bitset, custom_bitset);
|
||||
|
||||
let max_doc = segment.meta().max_doc();
|
||||
let num_docs = alive_bitset_opt
|
||||
.as_ref()
|
||||
.map(|alive_bitset| alive_bitset.num_alive_docs() as u32)
|
||||
.unwrap_or(max_doc);
|
||||
|
||||
Ok(SegmentReader {
|
||||
inv_idx_reader_cache: Default::default(),
|
||||
num_docs,
|
||||
max_doc,
|
||||
termdict_composite,
|
||||
postings_composite,
|
||||
fast_fields_readers,
|
||||
fieldnorm_readers,
|
||||
segment_id: segment.id(),
|
||||
delete_opstamp: segment.meta().delete_opstamp(),
|
||||
store_file,
|
||||
alive_bitset_opt,
|
||||
positions_composite,
|
||||
schema,
|
||||
codec,
|
||||
})
|
||||
}
|
||||
|
||||
/// Returns a field reader associated with the field given in argument.
|
||||
/// If the field was not present in the index during indexing time,
|
||||
/// the InvertedIndexReader is empty.
|
||||
///
|
||||
/// The field reader is in charge of iterating through the
|
||||
/// term dictionary associated with a specific field,
|
||||
/// and opening the posting list associated with any term.
|
||||
///
|
||||
/// If the field is not marked as index, a warning is logged and an empty `InvertedIndexReader`
|
||||
/// is returned.
|
||||
/// Similarly, if the field is marked as indexed but no term has been indexed for the given
|
||||
/// index, an empty `InvertedIndexReader` is returned (but no warning is logged).
|
||||
pub fn inverted_index(&self, field: Field) -> crate::Result<Arc<InvertedIndexReader>> {
|
||||
if let Some(inv_idx_reader) = self
|
||||
.inv_idx_reader_cache
|
||||
.read()
|
||||
@@ -321,9 +249,7 @@ impl SegmentReader for TantivySegmentReader {
|
||||
//
|
||||
// Returns an empty inverted index.
|
||||
let record_option = record_option_opt.unwrap_or(IndexRecordOption::Basic);
|
||||
let inv_idx_reader: Arc<dyn InvertedIndexReader> =
|
||||
Arc::new(TantivyInvertedIndexReader::empty(record_option));
|
||||
return Ok(inv_idx_reader);
|
||||
return Ok(Arc::new(InvertedIndexReader::empty(record_option)));
|
||||
}
|
||||
|
||||
let record_option = record_option_opt.unwrap();
|
||||
@@ -346,20 +272,14 @@ impl SegmentReader for TantivySegmentReader {
|
||||
);
|
||||
DataCorruption::comment_only(error_msg)
|
||||
})?;
|
||||
let fieldnorms_file = self
|
||||
.fieldnorm_readers
|
||||
.get_inner_file()
|
||||
.open_read(field)
|
||||
.unwrap_or_else(FileSlice::empty);
|
||||
|
||||
let inv_idx_reader: Arc<dyn InvertedIndexReader> =
|
||||
Arc::new(TantivyInvertedIndexReader::new(
|
||||
TermDictionary::open(termdict_file)?,
|
||||
postings_file,
|
||||
positions_file,
|
||||
fieldnorms_file,
|
||||
record_option,
|
||||
)?);
|
||||
let inv_idx_reader = Arc::new(InvertedIndexReader::new(
|
||||
TermDictionary::open(termdict_file)?,
|
||||
postings_file,
|
||||
positions_file,
|
||||
record_option,
|
||||
self.codec.clone(),
|
||||
)?);
|
||||
|
||||
// by releasing the lock in between, we may end up opening the inverting index
|
||||
// twice, but this is fine.
|
||||
@@ -371,10 +291,23 @@ impl SegmentReader for TantivySegmentReader {
|
||||
Ok(inv_idx_reader)
|
||||
}
|
||||
|
||||
fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
|
||||
/// Returns the list of fields that have been indexed in the segment.
|
||||
/// The field list includes the field defined in the schema as well as the fields
|
||||
/// that have been indexed as a part of a JSON field.
|
||||
/// The returned field name is the full field name, including the name of the JSON field.
|
||||
///
|
||||
/// The returned field names can be used in queries.
|
||||
///
|
||||
/// Notice: If your data contains JSON fields this is **very expensive**, as it requires
|
||||
/// browsing through the inverted index term dictionary and the columnar field dictionary.
|
||||
///
|
||||
/// Disclaimer: Some fields may not be listed here. For instance, if the schema contains a json
|
||||
/// field that is not indexed nor a fast field but is stored, it is possible for the field
|
||||
/// to not be listed.
|
||||
pub fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
|
||||
let mut indexed_fields: Vec<FieldMetadata> = Vec::new();
|
||||
let mut map_to_canonical = FnvHashMap::default();
|
||||
for (field, field_entry) in self.schema.fields() {
|
||||
for (field, field_entry) in self.schema().fields() {
|
||||
let field_name = field_entry.name().to_string();
|
||||
let is_indexed = field_entry.is_indexed();
|
||||
if is_indexed {
|
||||
@@ -464,7 +397,7 @@ impl SegmentReader for TantivySegmentReader {
|
||||
}
|
||||
}
|
||||
let fast_fields: Vec<FieldMetadata> = self
|
||||
.fast_fields_readers
|
||||
.fast_fields()
|
||||
.columnar()
|
||||
.iter_columns()?
|
||||
.map(|(mut field_name, handle)| {
|
||||
@@ -492,26 +425,31 @@ impl SegmentReader for TantivySegmentReader {
|
||||
Ok(merged_field_metadatas)
|
||||
}
|
||||
|
||||
fn segment_id(&self) -> SegmentId {
|
||||
/// Returns the segment id
|
||||
pub fn segment_id(&self) -> SegmentId {
|
||||
self.segment_id
|
||||
}
|
||||
|
||||
fn delete_opstamp(&self) -> Option<Opstamp> {
|
||||
/// Returns the delete opstamp
|
||||
pub fn delete_opstamp(&self) -> Option<Opstamp> {
|
||||
self.delete_opstamp
|
||||
}
|
||||
|
||||
fn alive_bitset(&self) -> Option<&AliveBitSet> {
|
||||
/// Returns the bitset representing the alive `DocId`s.
|
||||
pub fn alive_bitset(&self) -> Option<&AliveBitSet> {
|
||||
self.alive_bitset_opt.as_ref()
|
||||
}
|
||||
|
||||
fn is_deleted(&self, doc: DocId) -> bool {
|
||||
self.alive_bitset_opt
|
||||
.as_ref()
|
||||
/// Returns true if the `doc` is marked
|
||||
/// as deleted.
|
||||
pub fn is_deleted(&self, doc: DocId) -> bool {
|
||||
self.alive_bitset()
|
||||
.map(|alive_bitset| alive_bitset.is_deleted(doc))
|
||||
.unwrap_or(false)
|
||||
}
|
||||
|
||||
fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + Send + '_> {
|
||||
/// Returns an iterator that will iterate over the alive document ids
|
||||
pub fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + Send + '_> {
|
||||
if let Some(alive_bitset) = &self.alive_bitset_opt {
|
||||
Box::new(alive_bitset.iter_alive())
|
||||
} else {
|
||||
@@ -519,25 +457,22 @@ impl SegmentReader for TantivySegmentReader {
|
||||
}
|
||||
}
|
||||
|
||||
fn space_usage(&self) -> io::Result<SegmentSpaceUsage> {
|
||||
/// Summarize total space usage of this segment.
|
||||
pub fn space_usage(&self) -> io::Result<SegmentSpaceUsage> {
|
||||
Ok(SegmentSpaceUsage::new(
|
||||
self.num_docs,
|
||||
self.termdict_composite.space_usage(&self.schema),
|
||||
self.postings_composite.space_usage(&self.schema),
|
||||
self.positions_composite.space_usage(&self.schema),
|
||||
self.num_docs(),
|
||||
self.termdict_composite.space_usage(self.schema()),
|
||||
self.postings_composite.space_usage(self.schema()),
|
||||
self.positions_composite.space_usage(self.schema()),
|
||||
self.fast_fields_readers.space_usage()?,
|
||||
self.fieldnorm_readers.space_usage(&self.schema),
|
||||
TantivyStoreReader::open(self.store_file.clone(), 0)?.space_usage(),
|
||||
self.fieldnorm_readers.space_usage(self.schema()),
|
||||
self.get_store_reader(0)?.space_usage(),
|
||||
self.alive_bitset_opt
|
||||
.as_ref()
|
||||
.map(AliveBitSet::space_usage)
|
||||
.unwrap_or_default(),
|
||||
))
|
||||
}
|
||||
|
||||
fn clone_arc(&self) -> Arc<dyn SegmentReader> {
|
||||
Arc::new(self.clone())
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord)]
|
||||
@@ -647,7 +582,7 @@ fn intersect_alive_bitset(
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for TantivySegmentReader {
|
||||
impl fmt::Debug for SegmentReader {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
write!(f, "SegmentReader({:?})", self.segment_id)
|
||||
}
|
||||
|
||||
@@ -250,15 +250,11 @@ mod tests {
|
||||
|
||||
struct DummyWeight;
|
||||
impl Weight for DummyWeight {
|
||||
fn scorer(
|
||||
&self,
|
||||
_reader: &dyn SegmentReader,
|
||||
_boost: Score,
|
||||
) -> crate::Result<Box<dyn Scorer>> {
|
||||
fn scorer(&self, _reader: &SegmentReader, _boost: Score) -> crate::Result<Box<dyn Scorer>> {
|
||||
Err(crate::TantivyError::InternalError("dummy impl".to_owned()))
|
||||
}
|
||||
|
||||
fn explain(&self, _reader: &dyn SegmentReader, _doc: DocId) -> crate::Result<Explanation> {
|
||||
fn explain(&self, _reader: &SegmentReader, _doc: DocId) -> crate::Result<Explanation> {
|
||||
Err(crate::TantivyError::InternalError("dummy impl".to_owned()))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -95,7 +95,7 @@ pub struct IndexWriter<C: Codec = StandardCodec, D: Document = TantivyDocument>
|
||||
|
||||
fn compute_deleted_bitset(
|
||||
alive_bitset: &mut BitSet,
|
||||
segment_reader: &dyn SegmentReader,
|
||||
segment_reader: &SegmentReader,
|
||||
delete_cursor: &mut DeleteCursor,
|
||||
doc_opstamps: &DocToOpstampMapping,
|
||||
target_opstamp: Opstamp,
|
||||
@@ -144,12 +144,7 @@ pub fn advance_deletes<C: Codec>(
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let segment_reader = segment.index().codec().open_segment_reader(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
None,
|
||||
)?;
|
||||
let segment_reader = SegmentReader::open(&segment)?;
|
||||
|
||||
let max_doc = segment_reader.max_doc();
|
||||
let mut alive_bitset: BitSet = match segment_entry.alive_bitset() {
|
||||
@@ -161,7 +156,7 @@ pub fn advance_deletes<C: Codec>(
|
||||
|
||||
compute_deleted_bitset(
|
||||
&mut alive_bitset,
|
||||
segment_reader.as_ref(),
|
||||
&segment_reader,
|
||||
segment_entry.delete_cursor(),
|
||||
&DocToOpstampMapping::None,
|
||||
target_opstamp,
|
||||
@@ -224,7 +219,7 @@ fn index_documents<C: crate::codec::Codec, D: Document>(
|
||||
let alive_bitset_opt = apply_deletes(&segment_with_max_doc, &mut delete_cursor, &doc_opstamps)?;
|
||||
|
||||
let meta = segment_with_max_doc.meta().clone();
|
||||
|
||||
meta.untrack_temp_docstore();
|
||||
// update segment_updater inventory to remove tempstore
|
||||
let segment_entry = SegmentEntry::new(meta, delete_cursor, alive_bitset_opt);
|
||||
segment_updater.schedule_add_segment(segment_entry).wait()?;
|
||||
@@ -249,19 +244,14 @@ fn apply_deletes<C: crate::codec::Codec>(
|
||||
.max()
|
||||
.expect("Empty DocOpstamp is forbidden");
|
||||
|
||||
let segment_reader = segment.index().codec().open_segment_reader(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
None,
|
||||
)?;
|
||||
let segment_reader = SegmentReader::open(segment)?;
|
||||
let doc_to_opstamps = DocToOpstampMapping::WithMap(doc_opstamps);
|
||||
|
||||
let max_doc = segment.meta().max_doc();
|
||||
let mut deleted_bitset = BitSet::with_max_value_and_full(max_doc);
|
||||
let may_have_deletes = compute_deleted_bitset(
|
||||
&mut deleted_bitset,
|
||||
segment_reader.as_ref(),
|
||||
&segment_reader,
|
||||
delete_cursor,
|
||||
&doc_to_opstamps,
|
||||
max_doc_opstamp,
|
||||
@@ -1976,9 +1966,9 @@ mod tests {
|
||||
.get_store_reader(DOCSTORE_CACHE_CAPACITY)
|
||||
.unwrap();
|
||||
// test store iterator
|
||||
for doc_id in segment_reader.doc_ids_alive() {
|
||||
let doc = store_reader.get(doc_id).unwrap();
|
||||
for doc in store_reader.iter::<TantivyDocument>(segment_reader.alive_bitset()) {
|
||||
let id = doc
|
||||
.unwrap()
|
||||
.get_first(id_field)
|
||||
.unwrap()
|
||||
.as_value()
|
||||
@@ -1989,7 +1979,7 @@ mod tests {
|
||||
// test store random access
|
||||
for doc_id in segment_reader.doc_ids_alive() {
|
||||
let id = store_reader
|
||||
.get(doc_id)
|
||||
.get::<TantivyDocument>(doc_id)
|
||||
.unwrap()
|
||||
.get_first(id_field)
|
||||
.unwrap()
|
||||
@@ -1998,7 +1988,7 @@ mod tests {
|
||||
assert!(expected_ids_and_num_occurrences.contains_key(&id));
|
||||
if id_is_full_doc(id) {
|
||||
let id2 = store_reader
|
||||
.get(doc_id)
|
||||
.get::<TantivyDocument>(doc_id)
|
||||
.unwrap()
|
||||
.get_first(multi_numbers)
|
||||
.unwrap()
|
||||
@@ -2006,13 +1996,13 @@ mod tests {
|
||||
.unwrap();
|
||||
assert_eq!(id, id2);
|
||||
let bool = store_reader
|
||||
.get(doc_id)
|
||||
.get::<TantivyDocument>(doc_id)
|
||||
.unwrap()
|
||||
.get_first(bool_field)
|
||||
.unwrap()
|
||||
.as_bool()
|
||||
.unwrap();
|
||||
let doc = store_reader.get(doc_id).unwrap();
|
||||
let doc = store_reader.get::<TantivyDocument>(doc_id).unwrap();
|
||||
let mut bool2 = doc.get_all(multi_bools);
|
||||
assert_eq!(bool, bool2.next().unwrap().as_bool().unwrap());
|
||||
assert_ne!(bool, bool2.next().unwrap().as_bool().unwrap());
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::codec::StandardCodec;
|
||||
use crate::collector::TopDocs;
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::index::Index;
|
||||
@@ -122,28 +123,22 @@ mod tests {
|
||||
let term_a = Term::from_field_text(my_text_field, "text");
|
||||
let inverted_index = segment_reader.inverted_index(my_text_field).unwrap();
|
||||
let term_info = inverted_index.get_term_info(&term_a).unwrap().unwrap();
|
||||
let typed_postings = crate::codec::Codec::load_postings_typed(
|
||||
index.codec(),
|
||||
inverted_index.as_ref(),
|
||||
&term_info,
|
||||
IndexRecordOption::WithFreqsAndPositions,
|
||||
)
|
||||
.unwrap();
|
||||
let mut postings = inverted_index
|
||||
.read_postings_from_terminfo_specialized(
|
||||
&term_info,
|
||||
IndexRecordOption::WithFreqsAndPositions,
|
||||
&StandardCodec,
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(postings.doc_freq(), DocFreq::Exact(2));
|
||||
let fallback_bitset = AliveBitSet::for_test_from_deleted_docs(&[0], 100);
|
||||
assert_eq!(
|
||||
crate::indexer::merger::doc_freq_given_deletes(
|
||||
&typed_postings,
|
||||
&postings,
|
||||
segment_reader.alive_bitset().unwrap_or(&fallback_bitset)
|
||||
),
|
||||
2
|
||||
);
|
||||
let mut postings = inverted_index
|
||||
.read_postings_from_terminfo(&term_info, IndexRecordOption::WithFreqsAndPositions)
|
||||
.unwrap();
|
||||
assert_eq!(postings.doc_freq(), DocFreq::Exact(2));
|
||||
let mut postings = inverted_index
|
||||
.read_postings_from_terminfo(&term_info, IndexRecordOption::WithFreqsAndPositions)
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(postings.term_freq(), 1);
|
||||
let mut output = Vec::new();
|
||||
|
||||
@@ -1,5 +1,3 @@
|
||||
use std::io;
|
||||
use std::marker::PhantomData;
|
||||
use std::sync::Arc;
|
||||
|
||||
use columnar::{
|
||||
@@ -19,8 +17,8 @@ use crate::fieldnorm::{FieldNormReader, FieldNormReaders, FieldNormsSerializer,
|
||||
use crate::index::{Segment, SegmentComponent, SegmentReader};
|
||||
use crate::indexer::doc_id_mapping::{MappingType, SegmentDocIdMapping};
|
||||
use crate::indexer::SegmentSerializer;
|
||||
use crate::postings::{InvertedIndexSerializer, Postings, TermInfo};
|
||||
use crate::schema::{value_type_to_column_type, Field, FieldType, IndexRecordOption, Schema};
|
||||
use crate::postings::{InvertedIndexSerializer, Postings};
|
||||
use crate::schema::{value_type_to_column_type, Field, FieldType, Schema};
|
||||
use crate::store::StoreWriter;
|
||||
use crate::termdict::{TermMerger, TermOrdinal};
|
||||
use crate::{DocAddress, DocId, InvertedIndexReader};
|
||||
@@ -31,7 +29,7 @@ use crate::{DocAddress, DocId, InvertedIndexReader};
|
||||
pub const MAX_DOC_LIMIT: u32 = 1 << 31;
|
||||
|
||||
fn estimate_total_num_tokens_in_single_segment(
|
||||
reader: &dyn SegmentReader,
|
||||
reader: &SegmentReader,
|
||||
field: Field,
|
||||
) -> crate::Result<u64> {
|
||||
// There are no deletes. We can simply use the exact value saved into the posting list.
|
||||
@@ -43,7 +41,7 @@ fn estimate_total_num_tokens_in_single_segment(
|
||||
|
||||
// When there are deletes, we use an approximation either
|
||||
// by using the fieldnorm.
|
||||
if let Ok(fieldnorm_reader) = reader.get_fieldnorms_reader(field) {
|
||||
if let Some(fieldnorm_reader) = reader.fieldnorms_readers().get_field(field)? {
|
||||
let mut count: [usize; 256] = [0; 256];
|
||||
for doc in reader.doc_ids_alive() {
|
||||
let fieldnorm_id = fieldnorm_reader.fieldnorm_id(doc);
|
||||
@@ -72,23 +70,19 @@ fn estimate_total_num_tokens_in_single_segment(
|
||||
Ok((segment_num_tokens as f64 * ratio) as u64)
|
||||
}
|
||||
|
||||
fn estimate_total_num_tokens(
|
||||
readers: &[Arc<dyn SegmentReader>],
|
||||
field: Field,
|
||||
) -> crate::Result<u64> {
|
||||
fn estimate_total_num_tokens(readers: &[SegmentReader], field: Field) -> crate::Result<u64> {
|
||||
let mut total_num_tokens: u64 = 0;
|
||||
for reader in readers {
|
||||
total_num_tokens += estimate_total_num_tokens_in_single_segment(reader.as_ref(), field)?;
|
||||
total_num_tokens += estimate_total_num_tokens_in_single_segment(reader, field)?;
|
||||
}
|
||||
Ok(total_num_tokens)
|
||||
}
|
||||
|
||||
pub struct IndexMerger<C: Codec = StandardCodec> {
|
||||
schema: Schema,
|
||||
pub(crate) readers: Vec<Arc<dyn SegmentReader>>,
|
||||
pub(crate) readers: Vec<SegmentReader>,
|
||||
max_doc: u32,
|
||||
codec: C,
|
||||
phantom: PhantomData<C>,
|
||||
}
|
||||
|
||||
struct DeltaComputer {
|
||||
@@ -183,12 +177,8 @@ impl<C: Codec> IndexMerger<C> {
|
||||
let mut readers = vec![];
|
||||
for (segment, new_alive_bitset_opt) in segments.iter().zip(alive_bitset_opt) {
|
||||
if segment.meta().num_docs() > 0 {
|
||||
let reader = segment.index().codec().open_segment_reader(
|
||||
segment.index().directory(),
|
||||
segment.meta(),
|
||||
segment.schema(),
|
||||
new_alive_bitset_opt,
|
||||
)?;
|
||||
let reader =
|
||||
SegmentReader::open_with_custom_alive_set(segment, new_alive_bitset_opt)?;
|
||||
readers.push(reader);
|
||||
}
|
||||
}
|
||||
@@ -207,7 +197,6 @@ impl<C: Codec> IndexMerger<C> {
|
||||
readers,
|
||||
max_doc,
|
||||
codec,
|
||||
phantom: PhantomData,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -281,7 +270,7 @@ impl<C: Codec> IndexMerger<C> {
|
||||
}),
|
||||
);
|
||||
|
||||
let has_deletes: bool = self.readers.iter().any(|reader| reader.has_deletes());
|
||||
let has_deletes: bool = self.readers.iter().any(SegmentReader::has_deletes);
|
||||
let mapping_type = if has_deletes {
|
||||
MappingType::StackedWithDeletes
|
||||
} else {
|
||||
@@ -306,7 +295,7 @@ impl<C: Codec> IndexMerger<C> {
|
||||
&self,
|
||||
indexed_field: Field,
|
||||
_field_type: &FieldType,
|
||||
serializer: &mut InvertedIndexSerializer,
|
||||
serializer: &mut InvertedIndexSerializer<C>,
|
||||
fieldnorm_reader: Option<FieldNormReader>,
|
||||
doc_id_mapping: &SegmentDocIdMapping,
|
||||
) -> crate::Result<()> {
|
||||
@@ -316,7 +305,7 @@ impl<C: Codec> IndexMerger<C> {
|
||||
|
||||
let mut max_term_ords: Vec<TermOrdinal> = Vec::new();
|
||||
|
||||
let field_readers: Vec<Arc<dyn InvertedIndexReader>> = self
|
||||
let field_readers: Vec<Arc<InvertedIndexReader>> = self
|
||||
.readers
|
||||
.iter()
|
||||
.map(|reader| reader.inverted_index(indexed_field))
|
||||
@@ -388,14 +377,23 @@ impl<C: Codec> IndexMerger<C> {
|
||||
// Let's compute the list of non-empty posting lists
|
||||
for (segment_ord, term_info) in merged_terms.current_segment_ords_and_term_infos() {
|
||||
let segment_reader = &self.readers[segment_ord];
|
||||
let inverted_index = &field_readers[segment_ord];
|
||||
if let Some((doc_freq, postings)) = postings_for_merge::<C>(
|
||||
inverted_index.as_ref(),
|
||||
&self.codec,
|
||||
let inverted_index: &InvertedIndexReader = &field_readers[segment_ord];
|
||||
let postings = inverted_index.read_postings_from_terminfo_specialized(
|
||||
&term_info,
|
||||
segment_postings_option,
|
||||
segment_reader.alive_bitset(),
|
||||
)? {
|
||||
&self.codec,
|
||||
)?;
|
||||
let alive_bitset_opt = segment_reader.alive_bitset();
|
||||
let doc_freq = if let Some(alive_bitset) = alive_bitset_opt {
|
||||
doc_freq_given_deletes(&postings, alive_bitset)
|
||||
} else {
|
||||
// We do not an exact document frequency here.
|
||||
match postings.doc_freq() {
|
||||
crate::postings::DocFreq::Approximate(_) => exact_doc_freq(&postings),
|
||||
crate::postings::DocFreq::Exact(doc_freq) => doc_freq,
|
||||
}
|
||||
};
|
||||
if doc_freq > 0u32 {
|
||||
total_doc_freq += doc_freq;
|
||||
segment_postings_containing_the_term.push((segment_ord, postings));
|
||||
}
|
||||
@@ -483,7 +481,7 @@ impl<C: Codec> IndexMerger<C> {
|
||||
|
||||
fn write_postings(
|
||||
&self,
|
||||
serializer: &mut InvertedIndexSerializer,
|
||||
serializer: &mut InvertedIndexSerializer<C>,
|
||||
fieldnorm_readers: FieldNormReaders,
|
||||
doc_id_mapping: &SegmentDocIdMapping,
|
||||
) -> crate::Result<()> {
|
||||
@@ -506,7 +504,33 @@ impl<C: Codec> IndexMerger<C> {
|
||||
debug_time!("write-storable-fields");
|
||||
debug!("write-storable-field");
|
||||
|
||||
store_writer.merge_segment_readers(&self.readers)?;
|
||||
for reader in &self.readers {
|
||||
let store_reader = reader.get_store_reader(1)?;
|
||||
if reader.has_deletes()
|
||||
// If there is not enough data in the store, we avoid stacking in order to
|
||||
// avoid creating many small blocks in the doc store. Once we have 5 full blocks,
|
||||
// we start stacking. In the worst case 2/7 of the blocks would be very small.
|
||||
// [segment 1 - {1 doc}][segment 2 - {fullblock * 5}{1doc}]
|
||||
// => 5 * full blocks, 2 * 1 document blocks
|
||||
//
|
||||
// In a more realistic scenario the segments are of the same size, so 1/6 of
|
||||
// the doc stores would be on average half full, given total randomness (which
|
||||
// is not the case here, but not sure how it behaves exactly).
|
||||
//
|
||||
// https://github.com/quickwit-oss/tantivy/issues/1053
|
||||
//
|
||||
// take 7 in order to not walk over all checkpoints.
|
||||
|| store_reader.block_checkpoints().take(7).count() < 6
|
||||
|| store_reader.decompressor() != store_writer.compressor().into()
|
||||
{
|
||||
for doc_bytes_res in store_reader.iter_raw(reader.alive_bitset()) {
|
||||
let doc_bytes = doc_bytes_res?;
|
||||
store_writer.store_bytes(&doc_bytes)?;
|
||||
}
|
||||
} else {
|
||||
store_writer.stack(store_reader)?;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -551,66 +575,32 @@ pub(crate) fn doc_freq_given_deletes<P: Postings + Clone>(
|
||||
postings: &P,
|
||||
alive_bitset: &AliveBitSet,
|
||||
) -> u32 {
|
||||
let mut postings = postings.clone();
|
||||
let mut docset = postings.clone();
|
||||
let mut doc_freq = 0;
|
||||
loop {
|
||||
let doc = postings.doc();
|
||||
let doc = docset.doc();
|
||||
if doc == TERMINATED {
|
||||
return doc_freq;
|
||||
}
|
||||
if alive_bitset.is_alive(doc) {
|
||||
doc_freq += 1u32;
|
||||
}
|
||||
postings.advance();
|
||||
docset.advance();
|
||||
}
|
||||
}
|
||||
|
||||
fn read_postings_for_merge<C: Codec>(
|
||||
inverted_index: &dyn InvertedIndexReader,
|
||||
codec: &C,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
) -> io::Result<<C::PostingsCodec as PostingsCodec>::Postings> {
|
||||
codec.load_postings_typed(inverted_index, term_info, option)
|
||||
}
|
||||
|
||||
fn postings_for_merge<C: Codec>(
|
||||
inverted_index: &dyn InvertedIndexReader,
|
||||
codec: &C,
|
||||
term_info: &TermInfo,
|
||||
option: IndexRecordOption,
|
||||
alive_bitset_opt: Option<&AliveBitSet>,
|
||||
) -> io::Result<Option<(u32, <C::PostingsCodec as PostingsCodec>::Postings)>> {
|
||||
let postings = read_postings_for_merge(inverted_index, codec, term_info, option)?;
|
||||
let doc_freq = if let Some(alive_bitset) = alive_bitset_opt {
|
||||
doc_freq_given_deletes(&postings, alive_bitset)
|
||||
} else {
|
||||
// We do not need an exact document frequency here.
|
||||
match postings.doc_freq() {
|
||||
crate::postings::DocFreq::Exact(doc_freq) => doc_freq,
|
||||
crate::postings::DocFreq::Approximate(_) => exact_doc_freq(&postings),
|
||||
}
|
||||
};
|
||||
|
||||
if doc_freq == 0u32 {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
Ok(Some((doc_freq, postings)))
|
||||
}
|
||||
|
||||
/// If the postings is not able to inform us of the document frequency,
|
||||
/// we just scan through it.
|
||||
pub(crate) fn exact_doc_freq<P: Postings + Clone>(postings: &P) -> u32 {
|
||||
let mut postings = postings.clone();
|
||||
let mut docset = postings.clone();
|
||||
let mut doc_freq = 0;
|
||||
loop {
|
||||
let doc = postings.doc();
|
||||
let doc = docset.doc();
|
||||
if doc == TERMINATED {
|
||||
return doc_freq;
|
||||
}
|
||||
doc_freq += 1u32;
|
||||
postings.advance();
|
||||
docset.advance();
|
||||
}
|
||||
}
|
||||
|
||||
@@ -746,32 +736,32 @@ mod tests {
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 0))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 0))?;
|
||||
assert_eq!(
|
||||
doc.get_first(text_field).unwrap().as_value().as_str(),
|
||||
Some("af b")
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 1))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 1))?;
|
||||
assert_eq!(
|
||||
doc.get_first(text_field).unwrap().as_value().as_str(),
|
||||
Some("a b c")
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 2))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 2))?;
|
||||
assert_eq!(
|
||||
doc.get_first(text_field).unwrap().as_value().as_str(),
|
||||
Some("a b c d")
|
||||
);
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 3))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 3))?;
|
||||
assert_eq!(doc.get_first(text_field).unwrap().as_str(), Some("af b"));
|
||||
}
|
||||
{
|
||||
let doc = searcher.doc(DocAddress::new(0, 4))?;
|
||||
let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0, 4))?;
|
||||
assert_eq!(doc.get_first(text_field).unwrap().as_str(), Some("a b c g"));
|
||||
}
|
||||
|
||||
@@ -1599,7 +1589,7 @@ mod tests {
|
||||
for segment_reader in searcher.segment_readers() {
|
||||
let mut term_scorer = term_query
|
||||
.specialized_weight(EnableScoring::enabled_from_searcher(&searcher))?
|
||||
.term_scorer_for_test(segment_reader.as_ref(), 1.0)
|
||||
.term_scorer_for_test(segment_reader, 1.0)
|
||||
.unwrap();
|
||||
// the difference compared to before is intrinsic to the bm25 formula. no worries
|
||||
// there.
|
||||
@@ -1654,8 +1644,6 @@ mod tests {
|
||||
assert_eq!(super::doc_freq_given_deletes(&docs, &alive_bitset), 2);
|
||||
let all_deleted =
|
||||
AliveBitSet::for_test_from_deleted_docs(&[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 12);
|
||||
let docs =
|
||||
<StandardPostingsCodec as PostingsCodec>::Postings::create_from_docs(&[0, 2, 10]);
|
||||
assert_eq!(super::doc_freq_given_deletes(&docs, &all_deleted), 0);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -13,7 +13,7 @@ pub struct SegmentSerializer<C: crate::codec::Codec> {
|
||||
pub(crate) store_writer: StoreWriter,
|
||||
fast_field_write: WritePtr,
|
||||
fieldnorms_serializer: Option<FieldNormsSerializer>,
|
||||
postings_serializer: InvertedIndexSerializer,
|
||||
postings_serializer: InvertedIndexSerializer<C>,
|
||||
}
|
||||
|
||||
impl<C: crate::codec::Codec> SegmentSerializer<C> {
|
||||
@@ -55,7 +55,7 @@ impl<C: crate::codec::Codec> SegmentSerializer<C> {
|
||||
}
|
||||
|
||||
/// Accessor to the `PostingsSerializer`.
|
||||
pub fn get_postings_serializer(&mut self) -> &mut InvertedIndexSerializer {
|
||||
pub fn get_postings_serializer(&mut self) -> &mut InvertedIndexSerializer<C> {
|
||||
&mut self.postings_serializer
|
||||
}
|
||||
|
||||
|
||||
@@ -239,7 +239,7 @@ pub fn merge_filtered_segments<C: crate::codec::Codec, T: Into<Box<dyn Directory
|
||||
))
|
||||
.trim_end()
|
||||
);
|
||||
let codec_configuration = CodecConfiguration::from(segments[0].index().codec());
|
||||
let codec_configuration = CodecConfiguration::from_codec(segments[0].index().codec());
|
||||
|
||||
let index_meta = IndexMeta {
|
||||
index_settings: target_settings, // index_settings of all segments should be the same
|
||||
@@ -410,7 +410,7 @@ impl<Codec: crate::codec::Codec> SegmentUpdater<Codec> {
|
||||
//
|
||||
// Segment 1 from disk 1, Segment 1 from disk 2, etc.
|
||||
committed_segment_metas.sort_by_key(|segment_meta| -(segment_meta.max_doc() as i32));
|
||||
let codec = CodecConfiguration::from(index.codec());
|
||||
let codec = CodecConfiguration::from_codec(index.codec());
|
||||
let index_meta = IndexMeta {
|
||||
index_settings: index.settings().clone(),
|
||||
segments: committed_segment_metas,
|
||||
|
||||
@@ -438,7 +438,7 @@ mod tests {
|
||||
Document, IndexRecordOption, OwnedValue, Schema, TextFieldIndexing, TextOptions, Value,
|
||||
DATE_TIME_PRECISION_INDEXED, FAST, STORED, STRING, TEXT,
|
||||
};
|
||||
use crate::store::{Compressor, StoreWriter, TantivyStoreReader};
|
||||
use crate::store::{Compressor, StoreReader, StoreWriter};
|
||||
use crate::time::format_description::well_known::Rfc3339;
|
||||
use crate::time::OffsetDateTime;
|
||||
use crate::tokenizer::{PreTokenizedString, Token};
|
||||
@@ -486,8 +486,8 @@ mod tests {
|
||||
store_writer.store(&doc, &schema).unwrap();
|
||||
store_writer.close().unwrap();
|
||||
|
||||
let reader = TantivyStoreReader::open(directory.open_read(path).unwrap(), 0).unwrap();
|
||||
let doc = reader.get(0).unwrap();
|
||||
let reader = StoreReader::open(directory.open_read(path).unwrap(), 0).unwrap();
|
||||
let doc = reader.get::<TantivyDocument>(0).unwrap();
|
||||
|
||||
assert_eq!(doc.field_values().count(), 2);
|
||||
assert_eq!(
|
||||
@@ -604,12 +604,16 @@ mod tests {
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
let doc = searcher
|
||||
.doc(DocAddress {
|
||||
.doc::<TantivyDocument>(DocAddress {
|
||||
segment_ord: 0u32,
|
||||
doc_id: 0u32,
|
||||
})
|
||||
.unwrap();
|
||||
let serdeser_json_val = doc.to_json(&schema).get("json").unwrap().clone();
|
||||
let serdeser_json_val = serde_json::from_str::<serde_json::Value>(&doc.to_json(&schema))
|
||||
.unwrap()
|
||||
.get("json")
|
||||
.unwrap()[0]
|
||||
.clone();
|
||||
assert_eq!(json_val, serdeser_json_val);
|
||||
let segment_reader = searcher.segment_reader(0u32);
|
||||
let inv_idx = segment_reader.inverted_index(json_field).unwrap();
|
||||
@@ -871,7 +875,7 @@ mod tests {
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader = searcher.segment_reader(0u32);
|
||||
|
||||
fn assert_type(reader: &dyn SegmentReader, field: &str, typ: ColumnType) {
|
||||
fn assert_type(reader: &SegmentReader, field: &str, typ: ColumnType) {
|
||||
let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
|
||||
assert_eq!(cols.len(), 1, "{field}");
|
||||
assert_eq!(cols[0].column_type(), typ, "{field}");
|
||||
@@ -890,7 +894,7 @@ mod tests {
|
||||
assert_type(segment_reader, "json.my_arr", ColumnType::I64);
|
||||
assert_type(segment_reader, "json.my_arr.my_key", ColumnType::Str);
|
||||
|
||||
fn assert_empty(reader: &dyn SegmentReader, field: &str) {
|
||||
fn assert_empty(reader: &SegmentReader, field: &str) {
|
||||
let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
|
||||
assert_eq!(cols.len(), 0);
|
||||
}
|
||||
|
||||
@@ -53,7 +53,7 @@ impl<Codec: crate::codec::Codec, D: Document> SingleSegmentIndexWriter<Codec, D>
|
||||
schema: index.schema(),
|
||||
opstamp: 0,
|
||||
payload: None,
|
||||
codec: CodecConfiguration::from(index.codec()),
|
||||
codec: CodecConfiguration::from_codec(index.codec()),
|
||||
};
|
||||
save_metas(&index_meta, index.directory())?;
|
||||
index.directory().sync_directory()?;
|
||||
|
||||
16
src/lib.rs
16
src/lib.rs
@@ -93,7 +93,7 @@
|
||||
//!
|
||||
//! for (_score, doc_address) in top_docs {
|
||||
//! // Retrieve the actual content of documents given its `doc_address`.
|
||||
//! let retrieved_doc = searcher.doc(doc_address)?;
|
||||
//! let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
//! println!("{}", retrieved_doc.to_json(&schema));
|
||||
//! }
|
||||
//!
|
||||
@@ -224,11 +224,11 @@ use once_cell::sync::Lazy;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
pub use self::docset::{DocSet, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
|
||||
pub use crate::core::{json_utils, Executor, Searcher, SearcherContext, SearcherGeneration};
|
||||
pub use crate::core::{json_utils, Executor, Searcher, SearcherGeneration};
|
||||
pub use crate::directory::Directory;
|
||||
pub use crate::index::{
|
||||
Index, IndexBuilder, IndexMeta, IndexSettings, InvertedIndexReader, Order, Segment,
|
||||
SegmentMeta, SegmentReader, TantivyInvertedIndexReader, TantivySegmentReader,
|
||||
SegmentMeta, SegmentReader,
|
||||
};
|
||||
pub use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
|
||||
pub use crate::schema::{Document, TantivyDocument, Term};
|
||||
@@ -380,7 +380,7 @@ pub mod tests {
|
||||
|
||||
use common::{BinarySerializable, FixedSize};
|
||||
use query_grammar::{UserInputAst, UserInputLeaf, UserInputLiteral};
|
||||
use rand::distr::{Bernoulli, Uniform};
|
||||
use rand::distributions::{Bernoulli, Uniform};
|
||||
use rand::rngs::StdRng;
|
||||
use rand::{Rng, SeedableRng};
|
||||
use time::OffsetDateTime;
|
||||
@@ -431,7 +431,7 @@ pub mod tests {
|
||||
pub fn generate_nonunique_unsorted(max_value: u32, n_elems: usize) -> Vec<u32> {
|
||||
let seed: [u8; 32] = [1; 32];
|
||||
StdRng::from_seed(seed)
|
||||
.sample_iter(&Uniform::new(0u32, max_value).unwrap())
|
||||
.sample_iter(&Uniform::new(0u32, max_value))
|
||||
.take(n_elems)
|
||||
.collect::<Vec<u32>>()
|
||||
}
|
||||
@@ -548,7 +548,7 @@ pub mod tests {
|
||||
index_writer.commit()?;
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader: &dyn SegmentReader = searcher.segment_reader(0);
|
||||
let segment_reader: &SegmentReader = searcher.segment_reader(0);
|
||||
let fieldnorms_reader = segment_reader.get_fieldnorms_reader(text_field)?;
|
||||
assert_eq!(fieldnorms_reader.fieldnorm(0), 3);
|
||||
assert_eq!(fieldnorms_reader.fieldnorm(1), 0);
|
||||
@@ -556,7 +556,7 @@ pub mod tests {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn advance_undeleted(docset: &mut dyn DocSet, reader: &dyn SegmentReader) -> bool {
|
||||
fn advance_undeleted(docset: &mut dyn DocSet, reader: &SegmentReader) -> bool {
|
||||
let mut doc = docset.advance();
|
||||
while doc != TERMINATED {
|
||||
if !reader.is_deleted(doc) {
|
||||
@@ -1073,7 +1073,7 @@ pub mod tests {
|
||||
}
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
let segment_reader: &dyn SegmentReader = searcher.segment_reader(0);
|
||||
let segment_reader: &SegmentReader = searcher.segment_reader(0);
|
||||
{
|
||||
let fast_field_reader_res = segment_reader.fast_fields().u64("text");
|
||||
assert!(fast_field_reader_res.is_err());
|
||||
|
||||
@@ -397,10 +397,7 @@ mod bench {
|
||||
let mut seed: [u8; 32] = [0; 32];
|
||||
seed[31] = seed_val;
|
||||
let mut rng = StdRng::from_seed(seed);
|
||||
(0u32..)
|
||||
.filter(|_| rng.random_bool(ratio))
|
||||
.take(n)
|
||||
.collect()
|
||||
(0u32..).filter(|_| rng.gen_bool(ratio)).take(n).collect()
|
||||
}
|
||||
|
||||
pub fn generate_array(n: usize, ratio: f64) -> Vec<u32> {
|
||||
|
||||
@@ -3,6 +3,7 @@ use std::io;
|
||||
use common::json_path_writer::JSON_END_OF_PATH;
|
||||
use stacker::Addr;
|
||||
|
||||
use crate::codec::Codec;
|
||||
use crate::indexer::indexing_term::IndexingTerm;
|
||||
use crate::indexer::path_to_unordered_id::OrderedPathId;
|
||||
use crate::postings::postings_writer::SpecializedPostingsWriter;
|
||||
@@ -52,12 +53,12 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
|
||||
}
|
||||
|
||||
/// The actual serialization format is handled by the `PostingsSerializer`.
|
||||
fn serialize(
|
||||
fn serialize<C: Codec>(
|
||||
&self,
|
||||
ordered_term_addrs: &[(Field, OrderedPathId, &[u8], Addr)],
|
||||
ordered_id_to_path: &[&str],
|
||||
ctx: &IndexingContext,
|
||||
serializer: &mut FieldSerializer,
|
||||
serializer: &mut FieldSerializer<C>,
|
||||
) -> io::Result<()> {
|
||||
let mut term_buffer = JsonTermSerializer(Vec::with_capacity(48));
|
||||
let mut buffer_lender = BufferLender::default();
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user