tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-06-03 09:00:42 +00:00

Files

Francois Massot b3da16fa7b bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer on Wikipedia

Adds a new criterion benchmark (`tokenizer_compare`) that measures throughput
(MiB/s) of two UAX#29 tokenizer implementations on 64 MiB of English Wikipedia,
matching alyze's own benchmark methodology.

Implementations compared:
- UnicodeSegmenterTokenizer: unicode_segmentation::unicode_word_indices() wrapped
  in tantivy's Tokenizer trait, with LowerCaser + RemoveLongFilter(255)
- alyze: hand-rolled DFA with ASCII fast-path, via its Analyzer API

Results on this machine:
  unicode_seg/tokenize_only  ~88 MiB/s
  unicode_seg/full_pipeline  ~74 MiB/s
  alyze/tokenize_only       ~359 MiB/s
  alyze/full_pipeline       ~225 MiB/s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-02 05:13:06 +02:00

agg_bench.rs

add nested term benchmark

2026-04-21 07:26:58 +02:00

alice.txt

added a simple bench for the default analyzer

2021-01-06 19:11:26 +09:00

analyzer.rs

Fix stackoverflow and add docs.

2023-07-03 22:05:11 +09:00

and_or_queries.rs

Improve Union Performance for non-score unions (#2863 )