tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-17 08:40:41 +00:00

Go to file

Pascal Seitz 3cbcb5d0aa Abstract tantivy's data storage behind traits for pluggable backends

Extract trait interfaces from tantivy's core reader types so that
alternative storage backends (e.g. Quickwit) can provide their own
implementations while tantivy's query engine works through dynamic
dispatch.

Reader trait extraction:

- SegmentReader is now a trait; the concrete implementation is renamed
  to TantivySegmentReader.
- DynInvertedIndexReader trait for object-safe dynamic dispatch, plus
  a typed InvertedIndexReader trait with associated Postings/DocSet
  types for static dispatch.  The concrete reader becomes
  TantivyInvertedIndexReader.
- StoreReader is now a trait; the concrete implementation is renamed
  to TantivyStoreReader.  get() returns TantivyDocument directly
  instead of requiring a generic DocumentDeserialize bound.

Typed downcast for performance-critical paths:

- try_downcast_and_call() + TypedInvertedIndexReaderCb allow query
  weights (TermWeight, PhraseWeight) to attempt a downcast to the
  concrete TantivyInvertedIndexReader, obtaining typed postings for
  zero-cost scoring, and falling back to the dynamic path otherwise.
- TermScorer<TPostings> is now generic over its postings type.
- PostingsWithBlockMax trait enables block-max WAND acceleration
  through the trait boundary.
- block_wand() and block_wand_single_scorer() are generic over
  PostingsWithBlockMax, and for_each_pruning is dispatched through
  the SegmentReader trait so custom backends can provide their own
  block-max implementations.

Searcher decoupled from Index:

- New SearcherContext holds schema, executor, and tokenizers.
- Searcher can be constructed from Vec<Arc<dyn SegmentReader>>
  via Searcher::from_segment_readers(), without needing an Index.
- Searcher::index() is deprecated in favor of Searcher::context().

Postings and DocSet changes:

- Postings trait gains doc_freq() -> DocFreq (Exact/Approximate)
  and has_freq().
- RawPostingsData struct carries raw postings bytes across the trait
  boundary for custom reader implementations.
- BlockSegmentPostings::open() takes OwnedBytes instead of FileSlice.
- DocSet gains fill_bitset() method.

Scorer improvements:

- Scorer trait absorbs for_each, for_each_pruning, and explain
  (previously free functions or on Weight).
- box_scorer() helper avoids double-boxing Box<dyn Scorer>.
- BoxedTermScorer wraps a type-erased term scorer.
- BufferedUnionScorer initialization fixed to avoid an extra
  advance() on construction.

Other changes:

- Document::to_json() now returns serde_json::Value; the old
  string serialization is renamed to to_serialized_json().
- DocumentDeserialize removed from the store reader public API.

2026-03-30 12:57:53 +08:00

.claude/skills

update CHANGELOG for tantivy 0.26 release (#2857 )

2026-03-24 08:02:12 +01:00

.github

Making stemming optional. (#2791 )

2026-01-02 12:40:42 +01:00

benches

Abstract tantivy's data storage behind traits for pluggable backends

2026-03-30 12:57:53 +08:00

bitpacker

upgrade some dependancies (#2802 )

2026-01-14 10:19:09 +01:00

columnar

fix: deduplicate doc counts in term aggregation for multi-valued fields (#2854 )

2026-03-24 02:02:30 +01:00

common

Abstract tantivy's data storage behind traits for pluggable backends

2026-03-30 12:57:53 +08:00

doc

Fix rfc3339 typos and add Claude Code skills (#2823 )

2026-01-30 12:00:28 +01:00

examples

Abstract tantivy's data storage behind traits for pluggable backends

2026-03-30 12:57:53 +08:00

ownedbytes

release tantivy: bump versions (#2625 )

2025-06-10 15:34:39 +02:00

query-grammar

fix(query-grammar): Fix regexes between parentheses

2026-01-28 10:37:51 +01:00

src

Abstract tantivy's data storage behind traits for pluggable backends

2026-03-30 12:57:53 +08:00

sstable

Composite agg merge (#2856 )

2026-03-18 17:28:59 +01:00

stacker

upgrade some dependancies (#2802 )

2026-01-14 10:19:09 +01:00

tests

store DateTime as nanoseconds in doc store (#2486 )

2024-10-18 10:50:20 +08:00

tokenizer-api

Fix typos again (#2753 )

2025-12-01 12:15:41 +01:00

.gitignore

split term collection count and sub_agg (#1921 )

2023-03-13 04:37:41 +01:00

ARCHITECTURE.md

Fix some links in architecture docs (#2528 )

2024-10-23 21:06:54 +09:00

AUTHORS

Added an AUTHORS file. Closes #315 (#316 )

2018-06-11 22:21:58 +09:00

Cargo.toml

Abstract tantivy's data storage behind traits for pluggable backends

2026-03-30 12:57:53 +08:00

CHANGELOG.md

update CHANGELOG for tantivy 0.26 release (#2857 )

2026-03-24 08:02:12 +01:00

CITATION.cff

Fixed citation (#2523 )

2024-10-17 10:19:50 +09:00

cliff.toml

update changelog (#2617 )

2025-04-09 03:31:30 +02:00

LICENSE

Added an AUTHORS file. Closes #315 (#316 )

2018-06-11 22:21:58 +09:00

Makefile

Rename DatePrecision to DateTimePrecision (#2051 )

2023-05-23 17:09:11 +02:00

README.md

docs: add usage example to README (#2743 )

2025-12-02 21:56:57 +01:00

RELEASE.md

update release instructions (#2687 )

2025-08-22 07:57:48 +08:00

rustfmt.toml

Minor refactoring (#1266 )

2022-01-28 15:55:55 +09:00

TODO.txt

Fix typos again (#2753 )

2025-12-01 12:15:41 +01:00

README.md

Fast full-text search engine library written in Rust

If you are looking for an alternative to Elasticsearch or Apache Solr, check out Quickwit, our distributed search engine built on top of Tantivy.

Tantivy is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

Benchmark

The following benchmark breaks down the performance for different types of queries/collections.

Your mileage WILL vary depending on the nature of queries and their load.

Details about the benchmark can be found at this repository.

Features

Full-text search
Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera, Vaporetto, and tantivy-tokenizer-tiny-segmenter) and Korean (lindera + lindera-ko-dic-builder)
Fast (check out the 🐎 ✨ benchmark ✨ 🐎)
Tiny startup time (<10ms), perfect for command-line tools
BM25 scoring (the same as Lucene)
Natural query language (e.g. (michael AND jackson) OR "king of pop")
Phrase queries search (e.g. "michael jackson")
Incremental indexing
Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
Mmap directory
SIMD integer compression when the platform/CPU includes the SSE2 instruction set
Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
&[u8] fast fields
Text, i64, u64, f64, dates, ip, bool, and hierarchical facet fields
Compressed document store (LZ4, Zstd, None)
Range queries
Faceted search
Configurable indexing (optional term frequency and position indexing)
JSON Field
Aggregation Collector: histogram, range buckets, average, and stats metrics
LogMergePolicy with deletes
Searcher Warmer API
Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out Quickwit.

Getting started

Tantivy works on stable Rust and supports Linux, macOS, and Windows.

Tantivy's simple search example
tantivy-cli and its tutorial - tantivy-cli is an actual command-line interface that makes it easy for you to create a search engine, index documents, and search via the CLI or a small server with a REST API. It walks you through getting a Wikipedia search engine up and running in a few minutes.
Reference doc for the last released version

How can I support this project?

There are many ways to support this project.

Use Tantivy and tell us about your experience on Discord or by email (paul.masurel@gmail.com)
Report bugs
Write a blog post
Help with documentation by asking questions or submitting PRs
Contribute code (you can join our Discord server)
Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR. Feel free to update CHANGELOG.md with your contribution.

Tokenizer

When implementing a tokenizer for tantivy depend on the tantivy-tokenizer-api crate.

Clone and build locally

Tantivy compiles on stable Rust. To check out and run tests, you can simply run:

git clone https://github.com/quickwit-oss/tantivy.git
cd tantivy
cargo test

Companies Using Tantivy

FAQ

Can I use Tantivy in other languages?

Python → tantivy-py
Ruby → tantiny

You can also find other bindings on GitHub but they may be less maintained.

What are some examples of Tantivy use?

seshat: A matrix message database/indexer
tantiny: Tiny full-text search for Ruby
lnx: adaptable, typo tolerant search engine with a REST API
Bichon: A lightweight, high-performance Rust email archiver with WebUI
and more!

On average, how much faster is Tantivy compared to Lucene?

According to our search latency benchmark, Tantivy is approximately 2x faster than Lucene.

Does tantivy support incremental indexing?

Yes.

How can I edit documents?

Data in tantivy is immutable. To edit a document, the document needs to be deleted and reindexed.

When will my documents be searchable during indexing?

Documents will be searchable after a commit is called on an IndexWriter. Existing IndexReaders will also need to be reloaded in order to reflect the changes. Finally, changes are only visible to newly acquired Searcher.

Description

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

rust search-engine

Readme MIT Cite this repository 81 MiB