mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-06-03 09:00:42 +00:00
Adds a new criterion benchmark (`tokenizer_compare`) that measures throughput (MiB/s) of two UAX#29 tokenizer implementations on 64 MiB of English Wikipedia, matching alyze's own benchmark methodology. Implementations compared: - UnicodeSegmenterTokenizer: unicode_segmentation::unicode_word_indices() wrapped in tantivy's Tokenizer trait, with LowerCaser + RemoveLongFilter(255) - alyze: hand-rolled DFA with ASCII fast-path, via its Analyzer API Results on this machine: unicode_seg/tokenize_only ~88 MiB/s unicode_seg/full_pipeline ~74 MiB/s alyze/tokenize_only ~359 MiB/s alyze/full_pipeline ~225 MiB/s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>