Files
tantivy/src/tokenizer
Piotr Olszak 538da08eb5 Add polish stemmer (#82)
This commit adds support for Polish language stemming.
The previously used rust-stemmers crate is abandoned and unmaintained, which blocked the addition of new languages. This change addresses a user request for Polish stemming to improve BM25 recall in their use case. The tantivy-stemmers crate is a modern, maintained alternative that also opens the door for supporting many other languages in the future.
- Added the tantivy-stemmers crate as a dependency to the workspace, alongside the existing rust-stemmers dependency (for backward compatibility)
- Introduced an internal enum that can hold an algorithm from either rust-stemmers or tantivy-stemmers
- Added Polish to the main Language enum, mapped to the new tantivy-stemmers implementation
- Updated the token stream to handle both types of stemmers internally
- Added the POLISH variant to the stopwords list
- Existing tests pass
- Added test_pl_tokenizer to verify that the Polish stemmer works correctly
2025-12-10 10:17:28 -08:00
..
2025-12-10 10:17:28 -08:00
2024-10-22 09:26:54 +08:00
2025-12-10 10:17:28 -08:00
2024-10-22 09:26:54 +08:00
2024-10-22 09:26:54 +08:00
2024-10-22 09:26:54 +08:00
2024-10-22 09:26:54 +08:00
2025-12-10 10:17:28 -08:00
2025-07-02 11:25:03 +02:00
2024-10-22 09:26:54 +08:00