mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-02-03 23:00:37 +00:00
Compare commits
11 Commits
stuhood.in
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
28db952131 | ||
|
|
98ebbf922d | ||
|
|
4a89e74597 | ||
|
|
4d99e51e50 | ||
|
|
9b619998bd | ||
|
|
765c448945 | ||
|
|
943594ebaa | ||
|
|
df17daae0d | ||
|
|
0ae94baef5 | ||
|
|
3f448ecf79 | ||
|
|
b86caeefe2 |
125
.claude/skills/rationalize-deps/SKILL.md
Normal file
125
.claude/skills/rationalize-deps/SKILL.md
Normal file
@@ -0,0 +1,125 @@
|
||||
---
|
||||
name: rationalize-deps
|
||||
description: Analyze Cargo.toml dependencies and attempt to remove unused features to reduce compile times and binary size
|
||||
---
|
||||
|
||||
# Rationalize Dependencies
|
||||
|
||||
This skill analyzes Cargo.toml dependencies to identify and remove unused features.
|
||||
|
||||
## Overview
|
||||
|
||||
Many crates enable features by default that may not be needed. This skill:
|
||||
1. Identifies dependencies with default features enabled
|
||||
2. Tests if `default-features = false` works
|
||||
3. Identifies which specific features are actually needed
|
||||
4. Verifies compilation after changes
|
||||
|
||||
## Step 1: Identify the target
|
||||
|
||||
Ask the user which crate(s) to analyze:
|
||||
- A specific crate name (e.g., "tokio", "serde")
|
||||
- A specific workspace member (e.g., "quickwit-search")
|
||||
- "all" to scan the entire workspace
|
||||
|
||||
## Step 2: Analyze current dependencies
|
||||
|
||||
For the workspace Cargo.toml (`quickwit/Cargo.toml`), list dependencies that:
|
||||
- Do NOT have `default-features = false`
|
||||
- Have default features that might be unnecessary
|
||||
|
||||
Run: `cargo tree -p <crate> -f "{p} {f}" --edges features` to see what features are actually used.
|
||||
|
||||
## Step 3: For each candidate dependency
|
||||
|
||||
### 3a: Check the crate's default features
|
||||
|
||||
Look up the crate on crates.io or check its Cargo.toml to understand:
|
||||
- What features are enabled by default
|
||||
- What each feature provides
|
||||
|
||||
Use: `cargo metadata --format-version=1 | jq '.packages[] | select(.name == "<crate>") | .features'`
|
||||
|
||||
### 3b: Try disabling default features
|
||||
|
||||
Modify the dependency in `quickwit/Cargo.toml`:
|
||||
|
||||
From:
|
||||
```toml
|
||||
some-crate = { version = "1.0" }
|
||||
```
|
||||
|
||||
To:
|
||||
```toml
|
||||
some-crate = { version = "1.0", default-features = false }
|
||||
```
|
||||
|
||||
### 3c: Run cargo check
|
||||
|
||||
Run: `cargo check --workspace` (or target specific packages for faster feedback)
|
||||
|
||||
If compilation fails:
|
||||
1. Read the error messages to identify which features are needed
|
||||
2. Add only the required features explicitly:
|
||||
```toml
|
||||
some-crate = { version = "1.0", default-features = false, features = ["needed-feature"] }
|
||||
```
|
||||
3. Re-run cargo check
|
||||
|
||||
### 3d: Binary search for minimal features
|
||||
|
||||
If there are many default features, use binary search:
|
||||
1. Start with no features
|
||||
2. If it fails, add half the default features
|
||||
3. Continue until you find the minimal set
|
||||
|
||||
## Step 4: Document findings
|
||||
|
||||
For each dependency analyzed, report:
|
||||
- Original configuration
|
||||
- New configuration (if changed)
|
||||
- Features that were removed
|
||||
- Any features that are required
|
||||
|
||||
## Step 5: Verify full build
|
||||
|
||||
After all changes, run:
|
||||
```bash
|
||||
cargo check --workspace --all-targets
|
||||
cargo test --workspace --no-run
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Serde
|
||||
Often only needs `derive`:
|
||||
```toml
|
||||
serde = { version = "1.0", default-features = false, features = ["derive", "std"] }
|
||||
```
|
||||
|
||||
### Tokio
|
||||
Identify which runtime features are actually used:
|
||||
```toml
|
||||
tokio = { version = "1.0", default-features = false, features = ["rt-multi-thread", "macros", "sync"] }
|
||||
```
|
||||
|
||||
### Reqwest
|
||||
Often doesn't need all TLS backends:
|
||||
```toml
|
||||
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls", "json"] }
|
||||
```
|
||||
|
||||
## Rollback
|
||||
|
||||
If changes cause issues:
|
||||
```bash
|
||||
git checkout quickwit/Cargo.toml
|
||||
cargo check --workspace
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
- Start with large crates that have many default features (tokio, reqwest, hyper)
|
||||
- Use `cargo bloat --crates` to identify large dependencies
|
||||
- Check `cargo tree -d` for duplicate dependencies that might indicate feature conflicts
|
||||
- Some features are needed only for tests - consider using `[dev-dependencies]` features
|
||||
60
.claude/skills/simple-pr/SKILL.md
Normal file
60
.claude/skills/simple-pr/SKILL.md
Normal file
@@ -0,0 +1,60 @@
|
||||
---
|
||||
name: simple-pr
|
||||
description: Create a simple PR from staged changes with an auto-generated commit message
|
||||
disable-model-invocation: true
|
||||
---
|
||||
|
||||
# Simple PR
|
||||
|
||||
Follow these steps to create a simple PR from staged changes:
|
||||
|
||||
## Step 1: Check workspace state
|
||||
|
||||
Run: `git status`
|
||||
|
||||
Verify that all changes have been staged (no unstaged changes). If there are unstaged changes, abort and ask the user to stage their changes first with `git add`.
|
||||
|
||||
Also verify that we are on the `main` branch. If not, abort and ask the user to switch to main first.
|
||||
|
||||
## Step 2: Ensure main is up to date
|
||||
|
||||
Run: `git pull origin main`
|
||||
|
||||
This ensures we're working from the latest code.
|
||||
|
||||
## Step 3: Review staged changes
|
||||
|
||||
Run: `git diff --cached`
|
||||
|
||||
Review the staged changes to understand what the PR will contain.
|
||||
|
||||
## Step 4: Generate commit message
|
||||
|
||||
Based on the staged changes, generate a concise commit message (1-2 sentences) that describes the "why" rather than the "what".
|
||||
|
||||
Display the proposed commit message to the user and ask for confirmation before proceeding.
|
||||
|
||||
## Step 5: Create a new branch
|
||||
|
||||
Get the git username: `git config user.name | tr ' ' '-' | tr '[:upper:]' '[:lower:]'`
|
||||
|
||||
Create a short, descriptive branch name based on the changes (e.g., `fix-typo-in-readme`, `add-retry-logic`, `update-deps`).
|
||||
|
||||
Create and checkout the branch: `git checkout -b {username}/{short-descriptive-name}`
|
||||
|
||||
## Step 6: Commit changes
|
||||
|
||||
Commit with the message from step 3:
|
||||
```
|
||||
git commit -m "{commit-message}"
|
||||
```
|
||||
|
||||
## Step 7: Push and open a PR
|
||||
|
||||
Push the branch and open a PR:
|
||||
```
|
||||
git push -u origin {branch-name}
|
||||
gh pr create --title "{commit-message-title}" --body "{longer-description-if-needed}"
|
||||
```
|
||||
|
||||
Report the PR URL to the user when complete.
|
||||
11
Cargo.toml
11
Cargo.toml
@@ -15,7 +15,7 @@ rust-version = "1.85"
|
||||
exclude = ["benches/*.json", "benches/*.txt"]
|
||||
|
||||
[dependencies]
|
||||
oneshot = "0.1.7"
|
||||
oneshot = "0.1.13"
|
||||
base64 = "0.22.0"
|
||||
byteorder = "1.4.3"
|
||||
crc32fast = "1.3.2"
|
||||
@@ -193,3 +193,12 @@ harness = false
|
||||
[[bench]]
|
||||
name = "str_search_and_get"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "merge_segments"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "regex_all_terms"
|
||||
harness = false
|
||||
|
||||
|
||||
224
benches/merge_segments.rs
Normal file
224
benches/merge_segments.rs
Normal file
@@ -0,0 +1,224 @@
|
||||
// Benchmarks segment merging
|
||||
//
|
||||
// Notes:
|
||||
// - Input segments are kept intact (no deletes / no IndexWriter merge).
|
||||
// - Output is written to a `NullDirectory` that discards all files except
|
||||
// fieldnorms (needed for merging).
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::io::{self, Write};
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::directory::error::{DeleteError, OpenReadError, OpenWriteError};
|
||||
use tantivy::directory::{
|
||||
AntiCallToken, Directory, FileHandle, OwnedBytes, TerminatingWrite, WatchCallback, WatchHandle,
|
||||
WritePtr,
|
||||
};
|
||||
use tantivy::indexer::{merge_filtered_segments, NoMergePolicy};
|
||||
use tantivy::schema::{Schema, TEXT};
|
||||
use tantivy::{doc, HasLen, Index, IndexSettings, Segment};
|
||||
|
||||
#[derive(Clone, Default, Debug)]
|
||||
struct NullDirectory {
|
||||
blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
|
||||
}
|
||||
|
||||
struct NullWriter;
|
||||
|
||||
impl Write for NullWriter {
|
||||
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
|
||||
Ok(buf.len())
|
||||
}
|
||||
|
||||
fn flush(&mut self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl TerminatingWrite for NullWriter {
|
||||
fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
struct InMemoryWriter {
|
||||
path: PathBuf,
|
||||
buffer: Vec<u8>,
|
||||
blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
|
||||
}
|
||||
|
||||
impl Write for InMemoryWriter {
|
||||
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
|
||||
self.buffer.extend_from_slice(buf);
|
||||
Ok(buf.len())
|
||||
}
|
||||
|
||||
fn flush(&mut self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl TerminatingWrite for InMemoryWriter {
|
||||
fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
|
||||
let bytes = OwnedBytes::new(std::mem::take(&mut self.buffer));
|
||||
self.blobs.write().unwrap().insert(self.path.clone(), bytes);
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug, Default)]
|
||||
struct NullFileHandle;
|
||||
impl HasLen for NullFileHandle {
|
||||
fn len(&self) -> usize {
|
||||
0
|
||||
}
|
||||
}
|
||||
impl FileHandle for NullFileHandle {
|
||||
fn read_bytes(&self, _range: std::ops::Range<usize>) -> io::Result<OwnedBytes> {
|
||||
unimplemented!()
|
||||
}
|
||||
}
|
||||
|
||||
impl Directory for NullDirectory {
|
||||
fn get_file_handle(&self, path: &Path) -> Result<Arc<dyn FileHandle>, OpenReadError> {
|
||||
if let Some(bytes) = self.blobs.read().unwrap().get(path) {
|
||||
return Ok(Arc::new(bytes.clone()));
|
||||
}
|
||||
Ok(Arc::new(NullFileHandle))
|
||||
}
|
||||
|
||||
fn delete(&self, _path: &Path) -> Result<(), DeleteError> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn exists(&self, _path: &Path) -> Result<bool, OpenReadError> {
|
||||
Ok(true)
|
||||
}
|
||||
|
||||
fn open_write(&self, path: &Path) -> Result<WritePtr, OpenWriteError> {
|
||||
let path_buf = path.to_path_buf();
|
||||
if path.to_string_lossy().ends_with(".fieldnorm") {
|
||||
let writer = InMemoryWriter {
|
||||
path: path_buf,
|
||||
buffer: Vec::new(),
|
||||
blobs: Arc::clone(&self.blobs),
|
||||
};
|
||||
Ok(io::BufWriter::new(Box::new(writer)))
|
||||
} else {
|
||||
Ok(io::BufWriter::new(Box::new(NullWriter)))
|
||||
}
|
||||
}
|
||||
|
||||
fn atomic_read(&self, path: &Path) -> Result<Vec<u8>, OpenReadError> {
|
||||
if let Some(bytes) = self.blobs.read().unwrap().get(path) {
|
||||
return Ok(bytes.as_slice().to_vec());
|
||||
}
|
||||
Err(OpenReadError::FileDoesNotExist(path.to_path_buf()))
|
||||
}
|
||||
|
||||
fn atomic_write(&self, _path: &Path, _data: &[u8]) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn sync_directory(&self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn watch(&self, _watch_callback: WatchCallback) -> tantivy::Result<WatchHandle> {
|
||||
Ok(WatchHandle::empty())
|
||||
}
|
||||
}
|
||||
|
||||
struct MergeScenario {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
segments: Vec<Segment>,
|
||||
settings: IndexSettings,
|
||||
label: String,
|
||||
}
|
||||
|
||||
fn build_index(
|
||||
num_segments: usize,
|
||||
docs_per_segment: usize,
|
||||
tokens_per_doc: usize,
|
||||
vocab_size: usize,
|
||||
) -> MergeScenario {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let body = schema_builder.add_text_field("body", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
assert!(vocab_size > 0);
|
||||
let total_tokens = num_segments * docs_per_segment * tokens_per_doc;
|
||||
let use_unique_terms = vocab_size >= total_tokens;
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
let mut next_token_id: u64 = 0;
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 256_000_000).unwrap();
|
||||
writer.set_merge_policy(Box::new(NoMergePolicy));
|
||||
for _ in 0..num_segments {
|
||||
for _ in 0..docs_per_segment {
|
||||
let mut tokens = Vec::with_capacity(tokens_per_doc);
|
||||
for _ in 0..tokens_per_doc {
|
||||
let token_id = if use_unique_terms {
|
||||
let id = next_token_id;
|
||||
next_token_id += 1;
|
||||
id
|
||||
} else {
|
||||
rng.random_range(0..vocab_size as u64)
|
||||
};
|
||||
tokens.push(format!("term_{token_id}"));
|
||||
}
|
||||
writer.add_document(doc!(body => tokens.join(" "))).unwrap();
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let segments = index.searchable_segments().unwrap();
|
||||
let settings = index.settings().clone();
|
||||
let label = format!(
|
||||
"segments={}, docs/seg={}, tokens/doc={}, vocab={}",
|
||||
num_segments, docs_per_segment, tokens_per_doc, vocab_size
|
||||
);
|
||||
|
||||
MergeScenario {
|
||||
index,
|
||||
segments,
|
||||
settings,
|
||||
label,
|
||||
}
|
||||
}
|
||||
|
||||
fn main() {
|
||||
let scenarios = vec![
|
||||
build_index(8, 50_000, 12, 8),
|
||||
build_index(16, 50_000, 12, 8),
|
||||
build_index(16, 100_000, 12, 8),
|
||||
build_index(8, 50_000, 8, 8 * 50_000 * 8),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for scenario in scenarios {
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!("merge_segments inv_index — {}", scenario.label));
|
||||
let segments = scenario.segments.clone();
|
||||
let settings = scenario.settings.clone();
|
||||
group.register("merge", move |_| {
|
||||
let output_dir = NullDirectory::default();
|
||||
let filter_doc_ids = vec![None; segments.len()];
|
||||
let merged_index =
|
||||
merge_filtered_segments(&segments, settings.clone(), filter_doc_ids, output_dir)
|
||||
.unwrap();
|
||||
black_box(merged_index);
|
||||
});
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
113
benches/regex_all_terms.rs
Normal file
113
benches/regex_all_terms.rs
Normal file
@@ -0,0 +1,113 @@
|
||||
// Benchmarks regex query that matches all terms in a synthetic index.
|
||||
//
|
||||
// Corpus model:
|
||||
// - N unique terms: t000000, t000001, ...
|
||||
// - M docs
|
||||
// - K tokens per doc: doc i gets terms derived from (i, token_index)
|
||||
//
|
||||
// Query:
|
||||
// - Regex "t.*" to match all terms
|
||||
//
|
||||
// Run with:
|
||||
// - cargo bench --bench regex_all_terms
|
||||
//
|
||||
|
||||
use std::fmt::Write;
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use tantivy::collector::Count;
|
||||
use tantivy::query::RegexQuery;
|
||||
use tantivy::schema::{Schema, TEXT};
|
||||
use tantivy::{doc, Index, ReloadPolicy};
|
||||
|
||||
const HEAP_SIZE_BYTES: usize = 200_000_000;
|
||||
|
||||
#[derive(Clone, Copy)]
|
||||
struct BenchConfig {
|
||||
num_terms: usize,
|
||||
num_docs: usize,
|
||||
tokens_per_doc: usize,
|
||||
}
|
||||
|
||||
fn main() {
|
||||
let configs = default_configs();
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for config in configs {
|
||||
let (index, text_field) = build_index(config, HEAP_SIZE_BYTES);
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.expect("reader");
|
||||
let searcher = reader.searcher();
|
||||
let query = RegexQuery::from_pattern("t.*", text_field).expect("regex query");
|
||||
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!(
|
||||
"regex_all_terms_t{}_d{}_k{}",
|
||||
config.num_terms, config.num_docs, config.tokens_per_doc
|
||||
));
|
||||
group.register("regex_count", move |_| {
|
||||
let count = searcher.search(&query, &Count).expect("search");
|
||||
black_box(count);
|
||||
});
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
fn default_configs() -> Vec<BenchConfig> {
|
||||
vec![
|
||||
BenchConfig {
|
||||
num_terms: 10_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 1,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 10_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 8,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 100_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 1,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 100_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 8,
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
fn build_index(config: BenchConfig, heap_size_bytes: usize) -> (Index, tantivy::schema::Field) {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
|
||||
let term_width = config.num_terms.to_string().len();
|
||||
{
|
||||
let mut writer = index
|
||||
.writer_with_num_threads(1, heap_size_bytes)
|
||||
.expect("writer");
|
||||
let mut buffer = String::new();
|
||||
for doc_id in 0..config.num_docs {
|
||||
buffer.clear();
|
||||
for token_idx in 0..config.tokens_per_doc {
|
||||
if token_idx > 0 {
|
||||
buffer.push(' ');
|
||||
}
|
||||
let term_id = (doc_id * config.tokens_per_doc + token_idx) % config.num_terms;
|
||||
write!(&mut buffer, "t{term_id:0term_width$}").expect("write token");
|
||||
}
|
||||
writer
|
||||
.add_document(doc!(text_field => buffer.as_str()))
|
||||
.expect("add_document");
|
||||
}
|
||||
writer.commit().expect("commit");
|
||||
}
|
||||
|
||||
(index, text_field)
|
||||
}
|
||||
@@ -60,7 +60,7 @@ At indexing, tantivy will try to interpret number and strings as different type
|
||||
priority order.
|
||||
|
||||
Numbers will be interpreted as u64, i64 and f64 in that order.
|
||||
Strings will be interpreted as rfc3999 dates or simple strings.
|
||||
Strings will be interpreted as rfc3339 dates or simple strings.
|
||||
|
||||
The first working type is picked and is the only term that is emitted for indexing.
|
||||
Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
|
||||
@@ -81,7 +81,7 @@ Will be interpreted as
|
||||
(my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)
|
||||
```
|
||||
|
||||
Likewise, we need to emit two tokens if the query contains an rfc3999 date.
|
||||
Likewise, we need to emit two tokens if the query contains an rfc3339 date.
|
||||
Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.
|
||||
|
||||
If one more json field is defined, things get even more complicated.
|
||||
|
||||
@@ -560,7 +560,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
(
|
||||
(
|
||||
value((), tag(">=")),
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
bound
|
||||
@@ -574,7 +574,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag("<=")),
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
UserInputBound::Unbounded,
|
||||
@@ -588,7 +588,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag(">")),
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
bound
|
||||
@@ -602,7 +602,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag("<")),
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
UserInputBound::Unbounded,
|
||||
@@ -1323,6 +1323,14 @@ mod test {
|
||||
test_parse_query_to_ast_helper("<a", "{\"*\" TO \"a\"}");
|
||||
test_parse_query_to_ast_helper("<=a", "{\"*\" TO \"a\"]");
|
||||
test_parse_query_to_ast_helper("<=bsd", "{\"*\" TO \"bsd\"]");
|
||||
|
||||
test_parse_query_to_ast_helper("(<=42)", "{\"*\" TO \"42\"]");
|
||||
test_parse_query_to_ast_helper("(<=42 )", "{\"*\" TO \"42\"]");
|
||||
test_parse_query_to_ast_helper("(age:>5)", "\"age\":{\"5\" TO \"*\"}");
|
||||
test_parse_query_to_ast_helper(
|
||||
"(title:bar AND age:>12)",
|
||||
"(+\"title\":bar +\"age\":{\"12\" TO \"*\"})",
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -820,7 +820,7 @@ impl IntermediateRangeBucketEntry {
|
||||
};
|
||||
|
||||
// If we have a date type on the histogram buckets, we add the `key_as_string` field as
|
||||
// rfc339
|
||||
// rfc3339
|
||||
if column_type == Some(ColumnType::DateTime) {
|
||||
if let Some(val) = range_bucket_entry.to {
|
||||
let key_as_string = format_date(val as i64)?;
|
||||
|
||||
@@ -676,7 +676,7 @@ mod tests {
|
||||
let num_segments = reader.searcher().segment_readers().len();
|
||||
assert!(num_segments <= 4);
|
||||
let num_components_except_deletes_and_tempstore =
|
||||
crate::index::SegmentComponent::iterator().len() - 2;
|
||||
crate::index::SegmentComponent::iterator().len() - 1;
|
||||
let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
|
||||
assert_eventually(|| {
|
||||
let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
|
||||
|
||||
@@ -51,31 +51,55 @@ pub trait DocSet: Send {
|
||||
doc
|
||||
}
|
||||
|
||||
/// Seeks to the target if possible and returns true if the target is in the DocSet.
|
||||
/// !!!Dragons ahead!!!
|
||||
/// In spirit, this is an approximate and dangerous version of `seek`.
|
||||
///
|
||||
/// It can leave the DocSet in an `invalid` state and might return a
|
||||
/// lower bound of what the result of Seek would have been.
|
||||
///
|
||||
///
|
||||
/// More accurately it returns either:
|
||||
/// - Found if the target is in the docset. In that case, the DocSet is left in a valid state.
|
||||
/// - SeekLowerBound(seek_lower_bound) if the target is not in the docset. In that case, The
|
||||
/// DocSet can be the left in a invalid state. The DocSet should then only receives call to
|
||||
/// `seek_danger(..)` until it returns `Found`, and get back to a valid state.
|
||||
///
|
||||
/// `seek_lower_bound` can be any `DocId` (in the docset or not) as long as it is in
|
||||
/// `(target .. seek_result] U {TERMINATED}` where `seek_result` is the first document in the
|
||||
/// docset greater than to `target`.
|
||||
///
|
||||
/// `seek_danger` may return `SeekLowerBound(TERMINATED)`.
|
||||
///
|
||||
/// Calling `seek_danger` with TERMINATED as a target is allowed,
|
||||
/// and should always return NewTarget(TERMINATED) or anything larger as TERMINATED is NOT in
|
||||
/// the DocSet.
|
||||
///
|
||||
/// DocSets that already have an efficient `seek` method don't need to implement
|
||||
/// `seek_into_the_danger_zone`. All wrapper DocSets should forward
|
||||
/// `seek_into_the_danger_zone` to the underlying DocSet.
|
||||
/// `seek_danger`.
|
||||
///
|
||||
/// ## API Behaviour
|
||||
/// If `seek_into_the_danger_zone` is returning true, a call to `doc()` has to return target.
|
||||
/// If `seek_into_the_danger_zone` is returning false, a call to `doc()` may return any doc
|
||||
/// between the last doc that matched and target or a doc that is a valid next hit after
|
||||
/// target. The DocSet is considered to be in an invalid state until
|
||||
/// `seek_into_the_danger_zone` returns true again.
|
||||
///
|
||||
/// `target` needs to be equal or larger than `doc` when in a valid state.
|
||||
///
|
||||
/// Consecutive calls are not allowed to have decreasing `target` values.
|
||||
///
|
||||
/// # Warning
|
||||
/// This is an advanced API used by intersection. The API contract is tricky, avoid using it.
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
let current_doc = self.doc();
|
||||
if current_doc < target {
|
||||
self.seek(target);
|
||||
/// Consecutive calls to seek_danger are guaranteed to have strictly increasing `target`
|
||||
/// values.
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
if target >= TERMINATED {
|
||||
debug_assert!(target == TERMINATED);
|
||||
// No need to advance.
|
||||
return SeekDangerResult::SeekLowerBound(target);
|
||||
}
|
||||
|
||||
// The default implementation does not include any
|
||||
// `danger zone` behavior.
|
||||
//
|
||||
// It does not leave the scorer in an invalid state.
|
||||
// For this reason, we can safely call `self.doc()`.
|
||||
let mut doc = self.doc();
|
||||
if doc < target {
|
||||
doc = self.seek(target);
|
||||
}
|
||||
if doc == target {
|
||||
SeekDangerResult::Found
|
||||
} else {
|
||||
SeekDangerResult::SeekLowerBound(doc)
|
||||
}
|
||||
self.doc() == target
|
||||
}
|
||||
|
||||
/// Fills a given mutable buffer with the next doc ids from the
|
||||
@@ -166,6 +190,17 @@ pub trait DocSet: Send {
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
|
||||
pub enum SeekDangerResult {
|
||||
/// The target was found in the DocSet.
|
||||
Found,
|
||||
/// The target was not found in the DocSet.
|
||||
/// We return a range in which the value could be.
|
||||
/// The given target can be any DocId, that is <= than the first document
|
||||
/// in the docset after the target.
|
||||
SeekLowerBound(DocId),
|
||||
}
|
||||
|
||||
impl DocSet for &mut dyn DocSet {
|
||||
fn advance(&mut self) -> u32 {
|
||||
(**self).advance()
|
||||
@@ -175,8 +210,8 @@ impl DocSet for &mut dyn DocSet {
|
||||
(**self).seek(target)
|
||||
}
|
||||
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
(**self).seek_into_the_danger_zone(target)
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
(**self).seek_danger(target)
|
||||
}
|
||||
|
||||
fn doc(&self) -> u32 {
|
||||
@@ -211,9 +246,9 @@ impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
|
||||
unboxed.seek(target)
|
||||
}
|
||||
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
let unboxed: &mut TDocSet = self.borrow_mut();
|
||||
unboxed.seek_into_the_danger_zone(target)
|
||||
unboxed.seek_danger(target)
|
||||
}
|
||||
|
||||
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
|
||||
|
||||
@@ -1,8 +1,6 @@
|
||||
use std::collections::HashSet;
|
||||
use std::fmt;
|
||||
use std::path::PathBuf;
|
||||
use std::sync::atomic::AtomicBool;
|
||||
use std::sync::Arc;
|
||||
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
@@ -37,7 +35,6 @@ impl SegmentMetaInventory {
|
||||
let inner = InnerSegmentMeta {
|
||||
segment_id,
|
||||
max_doc,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
deletes: None,
|
||||
};
|
||||
SegmentMeta::from(self.inventory.track(inner))
|
||||
@@ -85,15 +82,6 @@ impl SegmentMeta {
|
||||
self.tracked.segment_id
|
||||
}
|
||||
|
||||
/// Removes the Component::TempStore from the alive list and
|
||||
/// therefore marks the temp docstore file to be deleted by
|
||||
/// the garbage collection.
|
||||
pub fn untrack_temp_docstore(&self) {
|
||||
self.tracked
|
||||
.include_temp_doc_store
|
||||
.store(false, std::sync::atomic::Ordering::Relaxed);
|
||||
}
|
||||
|
||||
/// Returns the number of deleted documents.
|
||||
pub fn num_deleted_docs(&self) -> u32 {
|
||||
self.tracked
|
||||
@@ -111,20 +99,9 @@ impl SegmentMeta {
|
||||
/// is by removing all files that have been created by tantivy
|
||||
/// and are not used by any segment anymore.
|
||||
pub fn list_files(&self) -> HashSet<PathBuf> {
|
||||
if self
|
||||
.tracked
|
||||
.include_temp_doc_store
|
||||
.load(std::sync::atomic::Ordering::Relaxed)
|
||||
{
|
||||
SegmentComponent::iterator()
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
} else {
|
||||
SegmentComponent::iterator()
|
||||
.filter(|comp| *comp != &SegmentComponent::TempStore)
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
}
|
||||
SegmentComponent::iterator()
|
||||
.map(|component| self.relative_path(*component))
|
||||
.collect::<HashSet<PathBuf>>()
|
||||
}
|
||||
|
||||
/// Returns the relative path of a component of our segment.
|
||||
@@ -138,7 +115,6 @@ impl SegmentMeta {
|
||||
SegmentComponent::Positions => ".pos".to_string(),
|
||||
SegmentComponent::Terms => ".term".to_string(),
|
||||
SegmentComponent::Store => ".store".to_string(),
|
||||
SegmentComponent::TempStore => ".store.temp".to_string(),
|
||||
SegmentComponent::FastFields => ".fast".to_string(),
|
||||
SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
|
||||
SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
|
||||
@@ -183,7 +159,6 @@ impl SegmentMeta {
|
||||
segment_id: inner_meta.segment_id,
|
||||
max_doc,
|
||||
deletes: None,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
});
|
||||
SegmentMeta { tracked }
|
||||
}
|
||||
@@ -202,7 +177,6 @@ impl SegmentMeta {
|
||||
let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
|
||||
segment_id: inner_meta.segment_id,
|
||||
max_doc: inner_meta.max_doc,
|
||||
include_temp_doc_store: Arc::new(AtomicBool::new(true)),
|
||||
deletes: Some(delete_meta),
|
||||
});
|
||||
SegmentMeta { tracked }
|
||||
@@ -214,14 +188,6 @@ struct InnerSegmentMeta {
|
||||
segment_id: SegmentId,
|
||||
max_doc: u32,
|
||||
pub deletes: Option<DeleteMeta>,
|
||||
/// If you want to avoid the SegmentComponent::TempStore file to be covered by
|
||||
/// garbage collection and deleted, set this to true. This is used during merge.
|
||||
#[serde(skip)]
|
||||
#[serde(default = "default_temp_store")]
|
||||
pub(crate) include_temp_doc_store: Arc<AtomicBool>,
|
||||
}
|
||||
fn default_temp_store() -> Arc<AtomicBool> {
|
||||
Arc::new(AtomicBool::new(false))
|
||||
}
|
||||
|
||||
impl InnerSegmentMeta {
|
||||
|
||||
@@ -23,8 +23,6 @@ pub enum SegmentComponent {
|
||||
/// Accessing a document from the store is relatively slow, as it
|
||||
/// requires to decompress the entire block it belongs to.
|
||||
Store,
|
||||
/// Temporary storage of the documents, before streamed to `Store`.
|
||||
TempStore,
|
||||
/// Bitset describing which document of the segment is alive.
|
||||
/// (It was representing deleted docs but changed to represent alive docs from v0.17)
|
||||
Delete,
|
||||
@@ -33,14 +31,13 @@ pub enum SegmentComponent {
|
||||
impl SegmentComponent {
|
||||
/// Iterates through the components.
|
||||
pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
|
||||
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
|
||||
static SEGMENT_COMPONENTS: [SegmentComponent; 7] = [
|
||||
SegmentComponent::Postings,
|
||||
SegmentComponent::Positions,
|
||||
SegmentComponent::FastFields,
|
||||
SegmentComponent::FieldNorms,
|
||||
SegmentComponent::Terms,
|
||||
SegmentComponent::Store,
|
||||
SegmentComponent::TempStore,
|
||||
SegmentComponent::Delete,
|
||||
];
|
||||
SEGMENT_COMPONENTS.iter()
|
||||
|
||||
@@ -218,7 +218,7 @@ fn index_documents<D: Document>(
|
||||
let alive_bitset_opt = apply_deletes(&segment_with_max_doc, &mut delete_cursor, &doc_opstamps)?;
|
||||
|
||||
let meta = segment_with_max_doc.meta().clone();
|
||||
meta.untrack_temp_docstore();
|
||||
|
||||
// update segment_updater inventory to remove tempstore
|
||||
let segment_entry = SegmentEntry::new(meta, delete_cursor, alive_bitset_opt);
|
||||
segment_updater.schedule_add_segment(segment_entry).wait()?;
|
||||
|
||||
@@ -303,10 +303,10 @@ impl BlockSegmentPostings {
|
||||
}
|
||||
|
||||
pub(crate) fn load_block(&mut self) {
|
||||
let offset = self.skip_reader.byte_offset();
|
||||
if self.block_is_loaded() {
|
||||
return;
|
||||
}
|
||||
let offset = self.skip_reader.byte_offset();
|
||||
match self.skip_reader.block_info() {
|
||||
BlockInfo::BitPacked {
|
||||
doc_num_bits,
|
||||
|
||||
@@ -168,12 +168,20 @@ impl DocSet for SegmentPostings {
|
||||
self.doc()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
debug_assert!(self.doc() <= target);
|
||||
if self.doc() >= target {
|
||||
return self.doc();
|
||||
}
|
||||
|
||||
// As an optimization, if the block is already loaded, we can
|
||||
// cheaply check the next doc.
|
||||
self.cur = (self.cur + 1).min(COMPRESSION_BLOCK_SIZE - 1);
|
||||
if self.doc() >= target {
|
||||
return self.doc();
|
||||
}
|
||||
|
||||
// Delegate block-local search to BlockSegmentPostings::seek, which returns
|
||||
// the in-block index of the first doc >= target.
|
||||
self.cur = self.block_cursor.seek(target);
|
||||
|
||||
@@ -291,18 +291,6 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
}
|
||||
};
|
||||
|
||||
let exclude_scorer_opt: Option<Box<dyn Scorer>> = if exclude_scorers.is_empty() {
|
||||
None
|
||||
} else {
|
||||
let exclude_specialized_scorer: SpecializedScorer =
|
||||
scorer_union(exclude_scorers, DoNothingCombiner::default, num_docs);
|
||||
Some(into_box_scorer(
|
||||
exclude_specialized_scorer,
|
||||
DoNothingCombiner::default,
|
||||
num_docs,
|
||||
))
|
||||
};
|
||||
|
||||
let include_scorer = match (should_scorers, must_scorers) {
|
||||
(ShouldScorersCombinationMethod::Ignored, must_scorers) => {
|
||||
// No SHOULD clauses (or they were absorbed into MUST).
|
||||
@@ -380,16 +368,23 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
|
||||
}
|
||||
}
|
||||
};
|
||||
if let Some(exclude_scorer) = exclude_scorer_opt {
|
||||
let include_scorer_boxed =
|
||||
into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
|
||||
Ok(SpecializedScorer::Other(Box::new(Exclude::new(
|
||||
include_scorer_boxed,
|
||||
exclude_scorer,
|
||||
))))
|
||||
} else {
|
||||
Ok(include_scorer)
|
||||
if exclude_scorers.is_empty() {
|
||||
return Ok(include_scorer);
|
||||
}
|
||||
|
||||
let include_scorer_boxed = into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
|
||||
let scorer: Box<dyn Scorer> = if exclude_scorers.len() == 1 {
|
||||
let exclude_scorer = exclude_scorers.pop().unwrap();
|
||||
match exclude_scorer.downcast::<TermScorer>() {
|
||||
// Cast to TermScorer succeeded
|
||||
Ok(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, *exclude_scorer)),
|
||||
// We get back the original Box<dyn Scorer>
|
||||
Err(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, exclude_scorer)),
|
||||
}
|
||||
} else {
|
||||
Box::new(Exclude::new(include_scorer_boxed, exclude_scorers))
|
||||
};
|
||||
Ok(SpecializedScorer::Other(scorer))
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use std::fmt;
|
||||
|
||||
use crate::docset::COLLECT_BLOCK_BUFFER_LEN;
|
||||
use crate::docset::{SeekDangerResult, COLLECT_BLOCK_BUFFER_LEN};
|
||||
use crate::fastfield::AliveBitSet;
|
||||
use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
|
||||
use crate::{DocId, DocSet, Score, SegmentReader, Term};
|
||||
@@ -104,8 +104,8 @@ impl<S: Scorer> DocSet for BoostScorer<S> {
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
self.underlying.seek(target)
|
||||
}
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
self.underlying.seek_into_the_danger_zone(target)
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
self.underlying.seek_danger(target)
|
||||
}
|
||||
|
||||
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
use std::cmp::Ordering;
|
||||
use std::collections::BinaryHeap;
|
||||
|
||||
use crate::docset::SeekDangerResult;
|
||||
use crate::query::score_combiner::DoNothingCombiner;
|
||||
use crate::query::{ScoreCombiner, Scorer};
|
||||
use crate::{DocId, DocSet, Score, TERMINATED};
|
||||
@@ -67,10 +68,12 @@ impl<T: Scorer> DocSet for ScorerWrapper<T> {
|
||||
self.current_doc = doc_id;
|
||||
doc_id
|
||||
}
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
let found = self.scorer.seek_into_the_danger_zone(target);
|
||||
self.current_doc = self.scorer.doc();
|
||||
found
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
let result = self.scorer.seek_danger(target);
|
||||
if result == SeekDangerResult::Found {
|
||||
self.current_doc = target;
|
||||
}
|
||||
result
|
||||
}
|
||||
|
||||
fn doc(&self) -> DocId {
|
||||
|
||||
@@ -1,48 +1,71 @@
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::query::Scorer;
|
||||
use crate::{DocId, Score};
|
||||
|
||||
#[inline]
|
||||
fn is_within<TDocSetExclude: DocSet>(docset: &mut TDocSetExclude, doc: DocId) -> bool {
|
||||
docset.doc() <= doc && docset.seek(doc) == doc
|
||||
}
|
||||
|
||||
/// Filters a given `DocSet` by removing the docs from a given `DocSet`.
|
||||
/// An exclusion set is a set of documents
|
||||
/// that should be excluded from a given DocSet.
|
||||
///
|
||||
/// The excluding docset has no impact on scoring.
|
||||
pub struct Exclude<TDocSet, TDocSetExclude> {
|
||||
underlying_docset: TDocSet,
|
||||
excluding_docset: TDocSetExclude,
|
||||
/// It can be a single DocSet, or a Vec of DocSets.
|
||||
pub trait ExclusionSet: Send {
|
||||
/// Returns `true` if the given `doc` is in the exclusion set.
|
||||
fn contains(&mut self, doc: DocId) -> bool;
|
||||
}
|
||||
|
||||
impl<TDocSet, TDocSetExclude> Exclude<TDocSet, TDocSetExclude>
|
||||
impl<TDocSet: DocSet> ExclusionSet for TDocSet {
|
||||
#[inline]
|
||||
fn contains(&mut self, doc: DocId) -> bool {
|
||||
self.seek_danger(doc) == SeekDangerResult::Found
|
||||
}
|
||||
}
|
||||
|
||||
impl<TDocSet: DocSet> ExclusionSet for Vec<TDocSet> {
|
||||
#[inline]
|
||||
fn contains(&mut self, doc: DocId) -> bool {
|
||||
for docset in self.iter_mut() {
|
||||
if docset.seek_danger(doc) == SeekDangerResult::Found {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
false
|
||||
}
|
||||
}
|
||||
|
||||
/// Filters a given `DocSet` by removing the docs from an exclusion set.
|
||||
///
|
||||
/// The excluding docsets have no impact on scoring.
|
||||
pub struct Exclude<TDocSet, TExclusionSet> {
|
||||
underlying_docset: TDocSet,
|
||||
exclusion_set: TExclusionSet,
|
||||
}
|
||||
|
||||
impl<TDocSet, TExclusionSet> Exclude<TDocSet, TExclusionSet>
|
||||
where
|
||||
TDocSet: DocSet,
|
||||
TDocSetExclude: DocSet,
|
||||
TExclusionSet: ExclusionSet,
|
||||
{
|
||||
/// Creates a new `ExcludeScorer`
|
||||
pub fn new(
|
||||
mut underlying_docset: TDocSet,
|
||||
mut excluding_docset: TDocSetExclude,
|
||||
) -> Exclude<TDocSet, TDocSetExclude> {
|
||||
mut exclusion_set: TExclusionSet,
|
||||
) -> Exclude<TDocSet, TExclusionSet> {
|
||||
while underlying_docset.doc() != TERMINATED {
|
||||
let target = underlying_docset.doc();
|
||||
if !is_within(&mut excluding_docset, target) {
|
||||
if !exclusion_set.contains(target) {
|
||||
break;
|
||||
}
|
||||
underlying_docset.advance();
|
||||
}
|
||||
Exclude {
|
||||
underlying_docset,
|
||||
excluding_docset,
|
||||
exclusion_set,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl<TDocSet, TDocSetExclude> DocSet for Exclude<TDocSet, TDocSetExclude>
|
||||
impl<TDocSet, TExclusionSet> DocSet for Exclude<TDocSet, TExclusionSet>
|
||||
where
|
||||
TDocSet: DocSet,
|
||||
TDocSetExclude: DocSet,
|
||||
TExclusionSet: ExclusionSet,
|
||||
{
|
||||
fn advance(&mut self) -> DocId {
|
||||
loop {
|
||||
@@ -50,7 +73,7 @@ where
|
||||
if candidate == TERMINATED {
|
||||
return TERMINATED;
|
||||
}
|
||||
if !is_within(&mut self.excluding_docset, candidate) {
|
||||
if !self.exclusion_set.contains(candidate) {
|
||||
return candidate;
|
||||
}
|
||||
}
|
||||
@@ -61,7 +84,7 @@ where
|
||||
if candidate == TERMINATED {
|
||||
return TERMINATED;
|
||||
}
|
||||
if !is_within(&mut self.excluding_docset, candidate) {
|
||||
if !self.exclusion_set.contains(candidate) {
|
||||
return candidate;
|
||||
}
|
||||
self.advance()
|
||||
@@ -79,10 +102,10 @@ where
|
||||
}
|
||||
}
|
||||
|
||||
impl<TScorer, TDocSetExclude> Scorer for Exclude<TScorer, TDocSetExclude>
|
||||
impl<TScorer, TExclusionSet> Scorer for Exclude<TScorer, TExclusionSet>
|
||||
where
|
||||
TScorer: Scorer,
|
||||
TDocSetExclude: DocSet + 'static,
|
||||
TExclusionSet: ExclusionSet + 'static,
|
||||
{
|
||||
#[inline]
|
||||
fn score(&mut self) -> Score {
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
use super::size_hint::estimate_intersection;
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::query::term_query::TermScorer;
|
||||
use crate::query::{EmptyScorer, Scorer};
|
||||
use crate::{DocId, Score};
|
||||
@@ -84,6 +84,14 @@ impl<TDocSet: DocSet> Intersection<TDocSet, TDocSet> {
|
||||
docsets.sort_by_key(|docset| docset.cost());
|
||||
go_to_first_doc(&mut docsets);
|
||||
let left = docsets.remove(0);
|
||||
debug_assert!({
|
||||
let doc = left.doc();
|
||||
if doc == TERMINATED {
|
||||
true
|
||||
} else {
|
||||
docsets.iter().all(|docset| docset.doc() == doc)
|
||||
}
|
||||
});
|
||||
let right = docsets.remove(0);
|
||||
Intersection {
|
||||
left,
|
||||
@@ -108,46 +116,61 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
|
||||
#[inline]
|
||||
fn advance(&mut self) -> DocId {
|
||||
let (left, right) = (&mut self.left, &mut self.right);
|
||||
let mut candidate = left.advance();
|
||||
if candidate == TERMINATED {
|
||||
return TERMINATED;
|
||||
}
|
||||
|
||||
loop {
|
||||
// In the first part we look for a document in the intersection
|
||||
// of the two rarest `DocSet` in the intersection.
|
||||
// Invariant:
|
||||
// - candidate is always <= to the next document in the intersection.
|
||||
// - candidate strictly increases at every occurence of the loop.
|
||||
let mut candidate = left.doc() + 1;
|
||||
|
||||
loop {
|
||||
if right.seek_into_the_danger_zone(candidate) {
|
||||
break;
|
||||
}
|
||||
let right_doc = right.doc();
|
||||
// TODO: Think about which value would make sense here
|
||||
// It depends on the DocSet implementation, when a seek would outweigh an advance.
|
||||
if right_doc > candidate.wrapping_add(100) {
|
||||
candidate = left.seek(right_doc);
|
||||
} else {
|
||||
candidate = left.advance();
|
||||
}
|
||||
if candidate == TERMINATED {
|
||||
return TERMINATED;
|
||||
}
|
||||
}
|
||||
// Termination: candidate strictly increases.
|
||||
'outer: while candidate < TERMINATED {
|
||||
// As we enter the loop, we should always have candidate < next_doc.
|
||||
|
||||
debug_assert_eq!(left.doc(), right.doc());
|
||||
// test the remaining scorers
|
||||
if self
|
||||
.others
|
||||
.iter_mut()
|
||||
.all(|docset| docset.seek_into_the_danger_zone(candidate))
|
||||
candidate = left.seek(candidate);
|
||||
|
||||
// Left is positionned on `candidate`.
|
||||
debug_assert_eq!(left.doc(), candidate);
|
||||
|
||||
if let SeekDangerResult::SeekLowerBound(seek_lower_bound) = right.seek_danger(candidate)
|
||||
{
|
||||
debug_assert_eq!(candidate, self.left.doc());
|
||||
debug_assert_eq!(candidate, self.right.doc());
|
||||
debug_assert!(self.others.iter().all(|docset| docset.doc() == candidate));
|
||||
return candidate;
|
||||
debug_assert!(
|
||||
seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
|
||||
"seek_lower_bound {seek_lower_bound} must be greater than candidate \
|
||||
{candidate}"
|
||||
);
|
||||
candidate = seek_lower_bound;
|
||||
continue;
|
||||
}
|
||||
candidate = left.advance();
|
||||
|
||||
// Left and right are positionned on `candidate`.
|
||||
debug_assert_eq!(right.doc(), candidate);
|
||||
|
||||
for other in &mut self.others {
|
||||
if let SeekDangerResult::SeekLowerBound(seek_lower_bound) =
|
||||
other.seek_danger(candidate)
|
||||
{
|
||||
// One of the scorer does not match, let's restart at the top of the loop.
|
||||
debug_assert!(
|
||||
seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
|
||||
"seek_lower_bound {seek_lower_bound} must be greater than candidate \
|
||||
{candidate}"
|
||||
);
|
||||
candidate = seek_lower_bound;
|
||||
continue 'outer;
|
||||
}
|
||||
}
|
||||
|
||||
// At this point all scorers are in a valid state, aligned on the next document in the
|
||||
// intersection.
|
||||
debug_assert!(self.others.iter().all(|docset| docset.doc() == candidate));
|
||||
return candidate;
|
||||
}
|
||||
|
||||
// We make sure our docset is in a valid state.
|
||||
// In particular, we want .doc() to return TERMINATED.
|
||||
left.seek(TERMINATED);
|
||||
|
||||
TERMINATED
|
||||
}
|
||||
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
@@ -166,13 +189,19 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
|
||||
///
|
||||
/// Some implementations may choose to advance past the target if beneficial for performance.
|
||||
/// The return value is `true` if the target is in the docset, and `false` otherwise.
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
self.left.seek_into_the_danger_zone(target)
|
||||
&& self.right.seek_into_the_danger_zone(target)
|
||||
&& self
|
||||
.others
|
||||
.iter_mut()
|
||||
.all(|docset| docset.seek_into_the_danger_zone(target))
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
if let SeekDangerResult::SeekLowerBound(new_target) = self.left.seek_danger(target) {
|
||||
return SeekDangerResult::SeekLowerBound(new_target);
|
||||
}
|
||||
if let SeekDangerResult::SeekLowerBound(new_target) = self.right.seek_danger(target) {
|
||||
return SeekDangerResult::SeekLowerBound(new_target);
|
||||
}
|
||||
for docset in &mut self.others {
|
||||
if let SeekDangerResult::SeekLowerBound(new_target) = docset.seek_danger(target) {
|
||||
return SeekDangerResult::SeekLowerBound(new_target);
|
||||
}
|
||||
}
|
||||
SeekDangerResult::Found
|
||||
}
|
||||
|
||||
#[inline]
|
||||
@@ -215,9 +244,12 @@ mod tests {
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::Intersection;
|
||||
use crate::collector::Count;
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::postings::tests::test_skip_against_unoptimized;
|
||||
use crate::query::VecDocSet;
|
||||
use crate::query::{QueryParser, VecDocSet};
|
||||
use crate::schema::{Schema, TEXT};
|
||||
use crate::Index;
|
||||
|
||||
#[test]
|
||||
fn test_intersection() {
|
||||
@@ -304,6 +336,58 @@ mod tests {
|
||||
assert_eq!(intersection.doc(), TERMINATED);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_intersection_abc() {
|
||||
let a = VecDocSet::from(vec![2, 3, 6]);
|
||||
let b = VecDocSet::from(vec![1, 3, 5]);
|
||||
let c = VecDocSet::from(vec![1, 3, 5]);
|
||||
let mut intersection = Intersection::new(vec![c, b, a], 10);
|
||||
let mut docs = Vec::new();
|
||||
use crate::DocSet;
|
||||
while intersection.doc() != TERMINATED {
|
||||
docs.push(intersection.doc());
|
||||
intersection.advance();
|
||||
}
|
||||
assert_eq!(&docs, &[3]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_intersection_termination() {
|
||||
use crate::query::score_combiner::DoNothingCombiner;
|
||||
use crate::query::{BufferedUnionScorer, ConstScorer, VecDocSet};
|
||||
|
||||
let a1 = ConstScorer::new(VecDocSet::from(vec![0u32, 10000]), 1.0);
|
||||
let a2 = ConstScorer::new(VecDocSet::from(vec![0u32, 10000]), 1.0);
|
||||
|
||||
let mut b_scorers = vec![];
|
||||
for _ in 0..2 {
|
||||
// Union matches 0 and 10000.
|
||||
b_scorers.push(ConstScorer::new(VecDocSet::from(vec![0, 10000]), 1.0));
|
||||
}
|
||||
// That's the union of two scores matching 0, and 10_000.
|
||||
let union = BufferedUnionScorer::build(b_scorers, DoNothingCombiner::default, 30000);
|
||||
|
||||
// Mismatching scorer: matches 0 and 20000. We then append more docs at the end to ensure it
|
||||
// is last.
|
||||
let mut m_docs = vec![0, 20000];
|
||||
for i in 30000..30100 {
|
||||
m_docs.push(i);
|
||||
}
|
||||
let m = ConstScorer::new(VecDocSet::from(m_docs), 1.0);
|
||||
|
||||
// Costs: A1=2, A2=2, Union=4, M=102.
|
||||
// Sorted: A1, A2, Union, M.
|
||||
// Left=A1, Right=A2, Others=[Union, M].
|
||||
let mut intersection = crate::query::intersect_scorers(
|
||||
vec![Box::new(a1), Box::new(a2), Box::new(union), Box::new(m)],
|
||||
40000,
|
||||
);
|
||||
|
||||
while intersection.doc() != TERMINATED {
|
||||
intersection.advance();
|
||||
}
|
||||
}
|
||||
|
||||
// Strategy to generate sorted and deduplicated vectors of u32 document IDs
|
||||
fn sorted_deduped_vec(max_val: u32, max_size: usize) -> impl Strategy<Value = Vec<u32>> {
|
||||
prop::collection::vec(0..max_val, 0..max_size).prop_map(|mut vec| {
|
||||
@@ -335,6 +419,30 @@ mod tests {
|
||||
}
|
||||
assert_eq!(intersection.doc(), TERMINATED);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_bug_2811_intersection_candidate_should_increase() {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut writer = index.writer_for_tests().unwrap();
|
||||
writer
|
||||
.add_document(doc!(text_field=>"hello happy tax"))
|
||||
.unwrap();
|
||||
writer.add_document(doc!(text_field=>"hello")).unwrap();
|
||||
writer.add_document(doc!(text_field=>"hello")).unwrap();
|
||||
writer.add_document(doc!(text_field=>"happy tax")).unwrap();
|
||||
|
||||
writer.commit().unwrap();
|
||||
let query_parser = QueryParser::for_index(&index, Vec::new());
|
||||
let query = query_parser
|
||||
.parse_query(r#"+text:hello +text:"happy tax""#)
|
||||
.unwrap();
|
||||
let searcher = index.reader().unwrap().searcher();
|
||||
let c = searcher.search(&*query, &Count).unwrap();
|
||||
assert_eq!(c, 1);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -43,7 +43,7 @@ pub use self::boost_query::{BoostQuery, BoostWeight};
|
||||
pub use self::const_score_query::{ConstScoreQuery, ConstScorer};
|
||||
pub use self::disjunction_max_query::DisjunctionMaxQuery;
|
||||
pub use self::empty_query::{EmptyQuery, EmptyScorer, EmptyWeight};
|
||||
pub use self::exclude::Exclude;
|
||||
pub use self::exclude::{Exclude, ExclusionSet};
|
||||
pub use self::exist_query::ExistsQuery;
|
||||
pub use self::explanation::Explanation;
|
||||
#[cfg(test)]
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::Postings;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
@@ -194,11 +194,16 @@ impl<TPostings: Postings> DocSet for PhrasePrefixScorer<TPostings> {
|
||||
self.advance()
|
||||
}
|
||||
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
if self.phrase_scorer.seek_into_the_danger_zone(target) {
|
||||
self.matches_prefix()
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
let seek_res = self.phrase_scorer.seek_danger(target);
|
||||
if seek_res != SeekDangerResult::Found {
|
||||
return seek_res;
|
||||
}
|
||||
// The intersection matched. Now let's see if we match the prefix.
|
||||
if self.matches_prefix() {
|
||||
SeekDangerResult::Found
|
||||
} else {
|
||||
false
|
||||
SeekDangerResult::SeekLowerBound(target + 1)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use std::cmp::Ordering;
|
||||
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::fieldnorm::FieldNormReader;
|
||||
use crate::postings::Postings;
|
||||
use crate::query::bm25::Bm25Weight;
|
||||
@@ -530,12 +530,23 @@ impl<TPostings: Postings> DocSet for PhraseScorer<TPostings> {
|
||||
self.advance()
|
||||
}
|
||||
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
debug_assert!(target >= self.doc());
|
||||
if self.intersection_docset.seek_into_the_danger_zone(target) && self.phrase_match() {
|
||||
return true;
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
debug_assert!(
|
||||
target >= self.doc(),
|
||||
"target ({}) should be greater than or equal to doc ({})",
|
||||
target,
|
||||
self.doc()
|
||||
);
|
||||
let seek_res = self.intersection_docset.seek_danger(target);
|
||||
if seek_res != SeekDangerResult::Found {
|
||||
return seek_res;
|
||||
}
|
||||
// The intersection matched. Now let's see if we match the phrase.
|
||||
if self.phrase_match() {
|
||||
SeekDangerResult::Found
|
||||
} else {
|
||||
SeekDangerResult::SeekLowerBound(target + 1)
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
fn doc(&self) -> DocId {
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use std::marker::PhantomData;
|
||||
|
||||
use crate::docset::DocSet;
|
||||
use crate::docset::{DocSet, SeekDangerResult};
|
||||
use crate::query::score_combiner::ScoreCombiner;
|
||||
use crate::query::Scorer;
|
||||
use crate::{DocId, Score};
|
||||
@@ -56,9 +56,9 @@ where
|
||||
self.req_scorer.seek(target)
|
||||
}
|
||||
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
self.score_cache = None;
|
||||
self.req_scorer.seek_into_the_danger_zone(target)
|
||||
self.req_scorer.seek_danger(target)
|
||||
}
|
||||
|
||||
fn doc(&self) -> DocId {
|
||||
|
||||
@@ -105,6 +105,7 @@ impl DocSet for TermScorer {
|
||||
|
||||
#[inline]
|
||||
fn seek(&mut self, target: DocId) -> DocId {
|
||||
debug_assert!(target >= self.doc());
|
||||
self.postings.seek(target)
|
||||
}
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use common::TinySet;
|
||||
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::query::score_combiner::{DoNothingCombiner, ScoreCombiner};
|
||||
use crate::query::size_hint::estimate_union;
|
||||
use crate::query::Scorer;
|
||||
@@ -225,25 +225,47 @@ where
|
||||
}
|
||||
}
|
||||
|
||||
fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
|
||||
fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
|
||||
if target >= TERMINATED {
|
||||
return SeekDangerResult::SeekLowerBound(TERMINATED);
|
||||
}
|
||||
if self.is_in_horizon(target) {
|
||||
// Our value is within the buffered horizon and the docset may already have been
|
||||
// processed and removed, so we need to use seek, which uses the regular advance.
|
||||
self.seek(target) == target
|
||||
} else {
|
||||
// The docsets are not in the buffered range, so we can use seek_into_the_danger_zone
|
||||
// of the underlying docsets
|
||||
let is_hit = self
|
||||
.docsets
|
||||
.iter_mut()
|
||||
.any(|docset| docset.seek_into_the_danger_zone(target));
|
||||
let seek_doc = self.seek(target);
|
||||
if seek_doc == target {
|
||||
return SeekDangerResult::Found;
|
||||
} else {
|
||||
return SeekDangerResult::SeekLowerBound(seek_doc);
|
||||
};
|
||||
}
|
||||
|
||||
// The API requires the DocSet to be in a valid state when `seek_into_the_danger_zone`
|
||||
// returns true.
|
||||
if is_hit {
|
||||
self.seek(target);
|
||||
// The docsets are not in the buffered range, so we can use seek_into_the_danger_zone
|
||||
// of the underlying docsets
|
||||
let mut is_hit = false;
|
||||
let mut min_new_target = TERMINATED;
|
||||
|
||||
for docset in self.docsets.iter_mut() {
|
||||
match docset.seek_danger(target) {
|
||||
SeekDangerResult::Found => {
|
||||
is_hit = true;
|
||||
break;
|
||||
}
|
||||
SeekDangerResult::SeekLowerBound(new_target) => {
|
||||
min_new_target = min_new_target.min(new_target);
|
||||
}
|
||||
}
|
||||
is_hit
|
||||
}
|
||||
|
||||
// The API requires the DocSet to be in a valid state when `seek_into_the_danger_zone`
|
||||
// returns Found.
|
||||
if is_hit {
|
||||
// The doc is found. Let's make sure we position the union on the target
|
||||
// to bring it back to a valid state.
|
||||
self.seek(target);
|
||||
SeekDangerResult::Found
|
||||
} else {
|
||||
SeekDangerResult::SeekLowerBound(min_new_target)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@ mod tests {
|
||||
use common::BitSet;
|
||||
|
||||
use super::{SimpleUnion, *};
|
||||
use crate::docset::{DocSet, TERMINATED};
|
||||
use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
|
||||
use crate::postings::tests::test_skip_against_unoptimized;
|
||||
use crate::query::score_combiner::DoNothingCombiner;
|
||||
use crate::query::union::bitset_union::BitSetPostingUnion;
|
||||
@@ -254,6 +254,27 @@ mod tests {
|
||||
vec![1, 2, 3, 7, 8, 9, 99, 100, 101, 500, 20000],
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_buffered_union_seek_into_danger_zone_terminated() {
|
||||
let scorer1 = ConstScorer::new(VecDocSet::from(vec![1, 2]), 1.0);
|
||||
let scorer2 = ConstScorer::new(VecDocSet::from(vec![2, 3]), 1.0);
|
||||
|
||||
let mut union_scorer =
|
||||
BufferedUnionScorer::build(vec![scorer1, scorer2], DoNothingCombiner::default, 100);
|
||||
|
||||
// Advance to end
|
||||
while union_scorer.doc() != TERMINATED {
|
||||
union_scorer.advance();
|
||||
}
|
||||
|
||||
assert_eq!(union_scorer.doc(), TERMINATED);
|
||||
|
||||
assert_eq!(
|
||||
union_scorer.seek_danger(TERMINATED),
|
||||
SeekDangerResult::SeekLowerBound(TERMINATED)
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(all(test, feature = "unstable"))]
|
||||
|
||||
@@ -17,6 +17,9 @@ pub struct VecDocSet {
|
||||
|
||||
impl From<Vec<DocId>> for VecDocSet {
|
||||
fn from(doc_ids: Vec<DocId>) -> VecDocSet {
|
||||
// We do not use `slice::is_sorted`, as we want to check for doc ids to be strictly
|
||||
// sorted.
|
||||
assert!(doc_ids.windows(2).all(|w| w[0] < w[1]));
|
||||
VecDocSet { doc_ids, cursor: 0 }
|
||||
}
|
||||
}
|
||||
|
||||
@@ -124,7 +124,6 @@ impl SegmentSpaceUsage {
|
||||
FieldNorms => PerField(self.fieldnorms().clone()),
|
||||
Terms => PerField(self.termdict().clone()),
|
||||
SegmentComponent::Store => ComponentSpaceUsage::Store(self.store().clone()),
|
||||
SegmentComponent::TempStore => ComponentSpaceUsage::Store(self.store().clone()),
|
||||
Delete => Basic(self.deletes()),
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user