tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-05-27 05:30:45 +00:00

Author	SHA1	Message	Date
Stu Hood	e8a4adeedd	Replace `Column::first_vals` with `Column::first_vals_in_value_range`.	2025-12-25 15:39:18 -07:00
Stu Hood	efc9e585a9	WIP: Add ValueRange::All	2025-12-25 15:16:26 -07:00
Stu Hood	f4252fc184	WIP: Add ValueRange.	2025-12-25 14:53:15 -07:00
Stu Hood	53c067d1f3	Restore laziness in `ChainSegmentSortKeyComputer`.	2025-12-24 10:39:26 -07:00
Stu Hood	259c1ed965	Isolate `accept_sort_key_lazy` to `ChainSegmentSortKeyComputer`.	2025-12-23 17:37:33 -07:00
Stu Hood	1afc432df8	Use an internal buffer in the SegmentSortKeyComputer.	2025-12-23 17:23:10 -07:00
Stu Hood	b8acd3ac94	WIP: Add and use `segment_sort_keys` to remove dynamic dispatch to the column.	2025-12-23 16:44:50 -07:00
Stu Hood	b5321d2125	Implement laziness for `collect_block`.	2025-12-23 15:48:36 -07:00
Stu Hood	ad3e2363fe	WIP: Add failing test.	2025-12-23 15:48:34 -07:00
Stu Hood	9ec5750c25	Implement `collect_block` for lazy scorers.	2025-12-23 15:46:41 -07:00
Stu Hood	03f09a2b5b	chore: Add support for natural-order-with-none-highest in `TopDocs::order_by` (#90 ) Add `ComparatorEnum::NaturalNoneHigher`, which matches Postgres's `DESC NULLS FIRST` behavior in `TopDocs::order_by`. Expands comments on `Comparator` implementations to ensure that behavior for `None` is explicit. Upstream as https://github.com/quickwit-oss/tantivy/pull/2780	2025-12-23 09:15:31 -08:00
Stu Hood	9ffe4af096	Fix TopN performance regression. https://github.com/quickwit-oss/tantivy/pull/2777	2025-12-17 10:43:29 -07:00
Stu Hood	c56ddcb6d7	Add an erased SortKeyComputer to sort on types which are not known until runtime. https://github.com/quickwit-oss/tantivy/pull/2770	2025-12-17 10:43:29 -07:00
Ming	5b8fff154b	fix: overflow in vint buffer (#88 )	2025-12-17 10:43:29 -07:00
Mohammad Dashti	ff6ee3a5db	fix: post-rebase fixes - Add missing size_hint module declaration - Remove test-only export serialize_and_load_u64_based_column_values - fixed quickwit CI issues	2025-12-10 10:17:28 -08:00
Moe	eda9aa437f	fix: boolean query incorrectly dropping documents when AllScorer is present (#84 ) Co-authored-by: Stu Hood <stuhood@gmail.com>	2025-12-10 10:17:28 -08:00
Piotr Olszak	538da08eb5	Add polish stemmer (#82 ) This commit adds support for Polish language stemming. The previously used rust-stemmers crate is abandoned and unmaintained, which blocked the addition of new languages. This change addresses a user request for Polish stemming to improve BM25 recall in their use case. The tantivy-stemmers crate is a modern, maintained alternative that also opens the door for supporting many other languages in the future. - Added the tantivy-stemmers crate as a dependency to the workspace, alongside the existing rust-stemmers dependency (for backward compatibility) - Introduced an internal enum that can hold an algorithm from either rust-stemmers or tantivy-stemmers - Added Polish to the main Language enum, mapped to the new tantivy-stemmers implementation - Updated the token stream to handle both types of stemmers internally - Added the POLISH variant to the stopwords list - Existing tests pass - Added test_pl_tokenizer to verify that the Polish stemmer works correctly	2025-12-10 10:17:28 -08:00
Moe	7bd5cc5417	fix: fixed integer overflow in ExpUnrolledLinkedList for large datasets (#80 )	2025-12-10 10:17:28 -08:00
Moe	5d46137556	feat: Added multiple snippet support (#76 ) Adds `SnippetGenerator::snippets` to render multiple snippets in either score or position order. Additionally: renames the existing `limit` and `offset` arguments to disambiguate between "match" positions (which are concatenated into fragments), and "snippet" positions. Co-authored-by: Stu Hood <stuhood@gmail.com>	2025-12-10 10:17:28 -08:00
Stu Hood	92c784f697	perf: Optimize `TermSet` for very large sets of terms. (#75 ) * Removes allocation in a bunch of places * Removes sorting of terms if we're going to use the fast field execution method * Adds back the (accidentally dropped) cardinality threshold * Removes `bool` support -- using the posting lists is always more efficient for a `bool`, since there are at most two of them * More eagerly constructs the term `HashSet` so that it happens once, rather than once per segment	2025-12-10 10:17:28 -08:00
Stu Hood	b3541d10e1	chore: Use smaller merge buffers. (#74 ) ## What Reduce the per-segment buffer sizes from 4MB to 512KB. ## Why #71 moved from buffers which covered the entire file to maximum 4MB buffers. But for merges with very large segment counts, we need to be using more conservative buffer sizes. 512KB will still eliminate most posting list reads: posting lists larger than 512KB will skip the buffer.	2025-12-10 10:17:28 -08:00
Stu Hood	7183ac6cbc	fix: Use smaller buffers during merging (#71 ) `MergeOptimizedInvertedIndexReader` was added in #32 in order to avoid making small reads to our underlying `FileHandle`. It did so by reading the entire content of the posting lists and positions at open time. As that PR says: > A likely downside to this approach is that now pg_search will be, indirectly, holding onto a lot of heap-allocated memory that was read from its block storage. Perhaps in the (near) future we can further optimize the new `MergeOptimizedInvertedIndexReader` such that it pages in blocks of a few megabytes at a time, on demand, rather than the whole file. This PR makes that change. But it additionally removes code that was later added in #47 to borrow individual entries rather than creating `OwnedBytes` for them. I believe that this code was added due to a misunderstanding: `OwnedBytes` is a total misnomer: the bytes are not "owned": they are immutably borrowed and reference counted. An `OwnedBytes` object can be created for any type which derefs to a slice of bytes, and can be cheaply cloned and sliced. So there is no need to actually borrow _or_ copy the buffer under the `OwnedBytes`. Removing the code that was doing so allows us to safely recreate our buffer without worrying about the lifetimes of buffers that we've handed out.	2025-12-10 10:17:28 -08:00
Stu Hood	e0476d2eb2	fix: Add support for `bool` to the fast field `TermSet` implementation (#70 ) Missed in #69. The `TermSet` fast fields implementation cribbed from `RangeQuery`'s fast fields implementation: ... which also has this bug. Will fix upstream.	2025-12-10 10:17:28 -08:00
Stu Hood	9fe0899934	perf: Implement a TermSet variant which uses fast fields (#69 ) The `TermSet` `Query` currently produces one `Scorer`/`DocSet` per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a `HashSet` of (encoded) term values. Following the pattern set by the two execution modes of `RangeQuery`, this PR introduces a variant of `TermSet` which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!). Performance is significantly improved for large `TermSet`s of primitives.	2025-12-10 10:17:28 -08:00
Stu Hood	aaa5abb7d6	chore: Expose a method to create a segment with a particular id (#68 ) In support of https://github.com/paradedb/paradedb/pull/3203	2025-12-10 10:17:28 -08:00
Ming	f8b8fd0321	feat: `SnippetGenerator` accepts limit/offset (#66 )	2025-12-10 10:17:27 -08:00
Eric Ridge	cd878a5c90	fix: support MemoryArena allocations up to 4GB (#62 ) A MemoryArena should support allocations up to 4GB and https://github.com/paradedb/tantivy/pull/60 broke this by not accounting for the "max page id" when pages are now 50% the size of what the originally were. This cleans up the code so things stay in sync if we change NUM_BITS_PAGE_ADDR again and adds a unit test	2025-12-10 10:17:27 -08:00
Eric Ridge	30c237e895	perf: various optimizations around arenas (#60 ) - Use a bitset to track used buckets in the `SharedArenaHashmap`, allowing for more efficient iteration - Create a global pool for both `MemoryArena` and `IndexingContext` - Reduce the MemoryArea page size by half (it's now 512KB instead of 1MB) - Centralize thread pool instances in `SegmentUpdater` so we can elide making them if all nthread sizes are zero	2025-12-10 10:17:27 -08:00
Eric Ridge	b6cd39872b	fix: Allow zero indexing & merging threads (#59 ) This removes a check against `IndexWriterOptions` which disallowed zero indexing worker threads (`num_worker_threads`).	2025-12-10 10:17:27 -08:00
Stu Hood	c96d801c68	perf: Lazily load in `BitpackedCodec` (#56 ) We would like to be able to lazily load `BitpackedCodec` columns (similar to what `020bdffd61` did for `BlockwiseLinearCodec`), because in the context of `pg_search`, immediately constructing `OwnedBytes` means copying the entire content of the column into memory. To do so, we expose some (slightly overlapped) block boundaries from `BitUnpacker`, and then lazily load each block when it is requested. Only the `get_val` function uses the cache: `get_row_ids_for_value_range` does not (yet), because it would be necessary to partition the row ids by block, and most of the time consumers using it are already loading reasonably large ranges anyway. See https://github.com/paradedb/paradedb/pull/2894 for usage. There are a few 2x speedups in the benchmark suite, as well as a 1.8x speedup on a representative customer query. Unfortunately there are also some 13-19% slowdowns on aggregates: it looks like that is because aggregates use `get_vals`, for which the default implementation is to just call `get_val` in a loop.	2025-12-10 10:17:27 -08:00
Stu Hood	7a13e0294d	Avoid copying into `OwnedBytes` when opening a fast field column `Dictionary`. (#55 ) When a fast fields string/bytes `Dictionary` is opened, we currently read the entire dictionary from `FileSlice` -> `OwnedBytes`... and then immediately wrap it back into a `FileSlice`. Switching to `Dictionary::open` preserves the `FileSlice`, such that only the portions of the `Dictionary` which are actually accessed are read from disk/buffers.	2025-12-10 10:17:27 -08:00
Eric Ridge	20d00701ee	perf: lazily open positions file (#54 ) Not all queries use positions and it's okay if we (from the perspective of `pg_search`, anyways) defer opening/loading them until they're first needed. This is probably completely wrong for a mmap-based Directory, but we (again, `pg_search`) decided long ago that we don't care about that use case. This saves a lot of disk I/O when an index has lots of segments and the query doesn't need positions. As a drive by, make sure a random Vec has enough space before pushing items to it. This showed up in the profiler, believe it or not.	2025-12-10 10:17:27 -08:00
Eric Ridge	526afc6111	chore: internal API visibility adjustments (#53 )	2025-12-10 10:17:27 -08:00
Ming	f9e4a8413b	make the directory `BufWriter` capacity configurable (#52 )	2025-12-10 10:17:27 -08:00
Ming	58124bb164	changes to make merging work (#48 )	2025-12-10 10:17:27 -08:00
Eric Ridge	176f7e852a	perf: remove general overhead during segment merging (#47 )	2025-12-10 10:17:27 -08:00
Ming	cfa5f94114	chore: Make some delete-related functions public (#46 )	2025-12-10 10:17:26 -08:00
Ming	5e449e7dda	feat: `SnippetGenerator` can handle JSON fields (#42 )	2025-12-10 10:17:26 -08:00
Stu Hood	1617459b01	Expose some methods which are necessary to create a streaming version of `sorted_ords_to_term_cb`. (#43 ) See https://github.com/paradedb/paradedb/pull/2612. We might eventually want that function upstreamed, but there are more changes planned to it for https://github.com/paradedb/paradedb/issues/2619, so doing the expedient thing now.	2025-12-10 10:17:26 -08:00
Ming	0e1a7e213e	chore: allow `merge_foreground` to ignore the store (#40 )	2025-12-10 10:17:26 -08:00
Ming	b0660ba196	chore: make some structs pub (#39 )	2025-12-10 10:17:26 -08:00
Eric Ridge	936d6af471	feat: ability to directly merge segments in the foregound (#36 ) This adds new public-facing (and internal) APIs for being able to merge a list of segments in the foreground, without using any threads. It's largely a cut-n-paste of the existing background merge code. For pg_search, this is beneficial because it allows us to merge directly using our `MVCCDirectory` rather than going through the `ChannelDirectory`, which has quite a bit of overhead.	2025-12-10 10:17:26 -08:00
Eric Ridge	2560de3a01	feat: `IndexWriter::wait_merging_threads()` return `Err` on merge failure (#34 )	2025-12-10 10:17:26 -08:00
Eric Ridge	75a8384c2b	feat: remove `Directory::reconsider_merge_policy()` and add other niceties to Directory API (#33 ) This removes `Directory::reconsider_merge_policy()`. After reconsidering this, it's better to make this decision ahead of time. Also adds a `Directory::log(message: &str)` function along with passing a `Directory` reference to `MergePolicy::compute_merge_candidates()`. It also hits some `#[derive(Debug)]` and `#[derive(Serialize)]` on a couple of structs that can benefit.	2025-12-10 10:17:26 -08:00
Eric Ridge	5b6da9123c	feat: introduce a `MergeOptimizedInvertedIndexReader` (#32 ) This is probably a bit of a misnomer as it's really a "PgSearchOptimizedInvertedIndexReaderForMerge". What we've done here is copied `InvertedIndexReader` and internally adjusted it to hold onto the complete `OwnedBytes` of the index's postings and positions. One or two other small touch points were required to make other internal APIs compatabile with this but they don't otherwise change functionality or I/O patterns. `MergeOptimizedInvertedIndexReader` does change I/O patterns, however, in that the merge process now does two (potentially) very large reads when it obtains the new "merge optimized inverted index reader" for each segment. This changes access patterns such that all the reads happen up-front rather than term-by-term as the merge process is solving. A likely downside to this approach is that now pg_search will be, indirectly, holding onto a lot of heap-allocated memory that was read from its block storage. Perhaps in the (near) future we can further optimize the new `MergeOptimizedInvertedIndexReader` such that it pages in blocks of a few megabytes at a time, on demand, rather than the whole file. --- Some unit tests were also updated to resolve compilation problems by PR https://github.com/paradedb/tantivy/pull/31 that for some reason didn't show in CI. #weird	2025-12-10 10:17:26 -08:00
Eric Ridge	8b7db36c99	feat: Add `Directory::wants_cancel()` function (#31 ) This adds a function named `wants_cancel() -> bool` to the `Directory` trait. It allows a Directory implementation to indicate that it would like Tantivy to cancel an operation. Right now, querying this function only happens during key points of index merging, but _could_ be used in other places. Technically, segment merging is the only "black box" in tantivy that users don't otherwise have the direct ability to control. The default implementaiton of `wants_cancel()` returns false, so there's no fear of default tantivy spuriously cancelling a merge. The cancels happen "cleanly" such that if `wants_cancel()` returns true an `Err(TantivyError::Cancelled)` is returned from the calling function at that point, and the error result will be propogated up the stack. No panics are raised.	2025-12-10 10:17:26 -08:00
Eric Ridge	eabe589814	feat: ability to assign a panic handler to a Directory (#30 ) Tantivy creates thread pools for some of its background work, specifically committing and merging. It's possible if one of the thread workers panics that rayon will simply abort the process. This is terrible from pg_search as that takes down the entire Postgres cluster. These changes allow a Directory to assign a panic handler that gets called in such cases. Which allows pg_search to gracefully rollback the current transaction, while presenting the panic message to the user.	2025-12-10 10:17:26 -08:00
Eric Ridge	65d3574dfd	feat: make garbage collection opt-out (#28 )	2025-12-10 10:17:26 -08:00
Ming	26d623c411	Change default index precision to microseconds (#27 )	2025-12-10 10:17:25 -08:00
Eric Ridge	0552dddeb9	feat: delete docs by (`SegmentId`, `DocId`) (#26 ) This teaches tantivy how to "directly" delete a document in a segment. Our use case from pg_search is that we already know the segment_id and doc_id so it's waaaaay more efficient for us to delete docs through our `ambulkdelete()` routine. It avoids doing a search, and all the stuff around that, for each of our "ctid" terms that we want to delete.	2025-12-10 10:17:25 -08:00

1 2 3 4 5 ...

3482 Commits