Files
tantivy/benches
Francois Massot b3da16fa7b bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer on Wikipedia
Adds a new criterion benchmark (`tokenizer_compare`) that measures throughput
(MiB/s) of two UAX#29 tokenizer implementations on 64 MiB of English Wikipedia,
matching alyze's own benchmark methodology.

Implementations compared:
- UnicodeSegmenterTokenizer: unicode_segmentation::unicode_word_indices() wrapped
  in tantivy's Tokenizer trait, with LowerCaser + RemoveLongFilter(255)
- alyze: hand-rolled DFA with ASCII fast-path, via its Analyzer API

Results on this machine:
  unicode_seg/tokenize_only  ~88 MiB/s
  unicode_seg/full_pipeline  ~74 MiB/s
  alyze/tokenize_only       ~359 MiB/s
  alyze/full_pipeline       ~225 MiB/s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-02 05:13:06 +02:00
..
2026-04-21 07:26:58 +02:00
2023-07-03 22:05:11 +09:00
2023-05-10 13:01:56 +02:00
2023-05-10 13:01:56 +02:00