Files
tantivy/examples/snippet.rs
Pascal Seitz 3cbcb5d0aa Abstract tantivy's data storage behind traits for pluggable backends
Extract trait interfaces from tantivy's core reader types so that
alternative storage backends (e.g. Quickwit) can provide their own
implementations while tantivy's query engine works through dynamic
dispatch.

Reader trait extraction:

- SegmentReader is now a trait; the concrete implementation is renamed
  to TantivySegmentReader.
- DynInvertedIndexReader trait for object-safe dynamic dispatch, plus
  a typed InvertedIndexReader trait with associated Postings/DocSet
  types for static dispatch.  The concrete reader becomes
  TantivyInvertedIndexReader.
- StoreReader is now a trait; the concrete implementation is renamed
  to TantivyStoreReader.  get() returns TantivyDocument directly
  instead of requiring a generic DocumentDeserialize bound.

Typed downcast for performance-critical paths:

- try_downcast_and_call() + TypedInvertedIndexReaderCb allow query
  weights (TermWeight, PhraseWeight) to attempt a downcast to the
  concrete TantivyInvertedIndexReader, obtaining typed postings for
  zero-cost scoring, and falling back to the dynamic path otherwise.
- TermScorer<TPostings> is now generic over its postings type.
- PostingsWithBlockMax trait enables block-max WAND acceleration
  through the trait boundary.
- block_wand() and block_wand_single_scorer() are generic over
  PostingsWithBlockMax, and for_each_pruning is dispatched through
  the SegmentReader trait so custom backends can provide their own
  block-max implementations.

Searcher decoupled from Index:

- New SearcherContext holds schema, executor, and tokenizers.
- Searcher can be constructed from Vec<Arc<dyn SegmentReader>>
  via Searcher::from_segment_readers(), without needing an Index.
- Searcher::index() is deprecated in favor of Searcher::context().

Postings and DocSet changes:

- Postings trait gains doc_freq() -> DocFreq (Exact/Approximate)
  and has_freq().
- RawPostingsData struct carries raw postings bytes across the trait
  boundary for custom reader implementations.
- BlockSegmentPostings::open() takes OwnedBytes instead of FileSlice.
- DocSet gains fill_bitset() method.

Scorer improvements:

- Scorer trait absorbs for_each, for_each_pruning, and explain
  (previously free functions or on Weight).
- box_scorer() helper avoids double-boxing Box<dyn Scorer>.
- BoxedTermScorer wraps a type-erased term scorer.
- BufferedUnionScorer initialization fixed to avoid an extra
  advance() on construction.

Other changes:

- Document::to_json() now returns serde_json::Value; the old
  string serialization is renamed to to_serialized_json().
- DocumentDeserialize removed from the store reader public API.
2026-03-30 12:57:53 +08:00

84 lines
3.2 KiB
Rust
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// # Snippet example
//
// This example shows how to return a representative snippet of
// your hit result.
// Snippet are an extracted of a target document, and returned in HTML format.
// The keyword searched by the user are highlighted with a `<b>` tag.
// ---
// Importing tantivy...
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::snippet::{Snippet, SnippetGenerator};
use tantivy::{doc, Index, IndexWriter};
use tempfile::TempDir;
fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the
// sake of this example
let index_path = TempDir::new()?;
// # Defining the schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT | STORED);
let schema = schema_builder.build();
// # Indexing documents
let index = Index::create_in_dir(&index_path, schema)?;
let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// we'll only need one doc for this example.
index_writer.add_document(doc!(
title => "Of Mice and Men",
body => "A few miles south of Soledad, the Salinas River drops in close to the hillside \
bank and runs deep and green. The water is warm too, for it has slipped twinkling \
over the yellow sands in the sunlight before reaching the narrow pool. On one \
side of the river the golden foothill slopes curve up to the strong and rocky \
Gabilan Mountains, but on the valley side the water is lined with trees—willows \
fresh and green with every spring, carrying in their lower leaf junctures the \
debris of the winters flooding; and sycamores with mottled, white, recumbent \
limbs and branches that arch over the pool"
))?;
// ...
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("sycamore spring")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
for (score, doc_address) in top_docs {
let doc = searcher.doc(doc_address)?;
let snippet = snippet_generator.snippet_from_doc(&doc);
println!("Document score {score}:");
println!("title: {}", doc.get_first(title).unwrap().as_str().unwrap());
println!("snippet: {}", snippet.to_html());
println!("custom highlighting: {}", highlight(snippet));
}
Ok(())
}
fn highlight(snippet: Snippet) -> String {
let mut result = String::new();
let mut start_from = 0;
for fragment_range in snippet.highlighted() {
result.push_str(&snippet.fragment()[start_from..fragment_range.start]);
result.push_str(" --> ");
result.push_str(&snippet.fragment()[fragment_range.clone()]);
result.push_str(" <-- ");
start_from = fragment_range.end;
}
result.push_str(&snippet.fragment()[start_from..]);
result
}