Commit Graph

3442 Commits

Author SHA1 Message Date
Ming
b0660ba196 chore: make some structs pub (#39) 2025-12-10 10:17:26 -08:00
Eric Ridge
936d6af471 feat: ability to directly merge segments in the foregound (#36)
This adds new public-facing (and internal) APIs for being able to merge a list of segments in the foreground, without using any threads.  It's largely a cut-n-paste of the existing background merge code.

For pg_search, this is beneficial because it allows us to merge directly using our `MVCCDirectory` rather than going through the `ChannelDirectory`, which has quite a bit of overhead.
2025-12-10 10:17:26 -08:00
Eric Ridge
2560de3a01 feat: IndexWriter::wait_merging_threads() return Err on merge failure (#34) 2025-12-10 10:17:26 -08:00
Eric Ridge
75a8384c2b feat: remove Directory::reconsider_merge_policy() and add other niceties to Directory API (#33)
This removes `Directory::reconsider_merge_policy()`.  After reconsidering this, it's better to make this decision ahead of time.

Also adds a `Directory::log(message: &str)` function along with passing a `Directory` reference to `MergePolicy::compute_merge_candidates()`.

It also hits some `#[derive(Debug)]` and `#[derive(Serialize)]` on a couple of structs that can benefit.
2025-12-10 10:17:26 -08:00
Eric Ridge
5b6da9123c feat: introduce a MergeOptimizedInvertedIndexReader (#32)
This is probably a bit of a misnomer as it's really a "PgSearchOptimizedInvertedIndexReaderForMerge".

What we've done here is copied `InvertedIndexReader` and internally adjusted it to hold onto the complete `OwnedBytes` of the index's postings and positions.  One or two other small touch points were required to make other internal APIs compatabile with this but they don't otherwise change functionality or I/O patterns.

`MergeOptimizedInvertedIndexReader` does change I/O patterns, however, in that the merge process now does two (potentially) very large reads when it obtains the new "merge optimized inverted index reader" for each segment.  This changes access patterns such that all the reads happen up-front rather than term-by-term as the merge process is solving.

A likely downside to this approach is that now pg_search will be, indirectly, holding onto a lot of heap-allocated memory that was read from its block storage.  Perhaps in the (near) future we can further optimize the new `MergeOptimizedInvertedIndexReader` such that it pages in blocks of a few megabytes at a time, on demand, rather than the whole file.

---

Some unit tests were also updated to resolve compilation problems by PR https://github.com/paradedb/tantivy/pull/31 that for some reason didn't show in CI.  #weird
2025-12-10 10:17:26 -08:00
Eric Ridge
8b7db36c99 feat: Add Directory::wants_cancel() function (#31)
This adds a function named `wants_cancel() -> bool` to the `Directory` trait.  It allows a Directory implementation to indicate that it would like Tantivy to cancel an operation.

Right now, querying this function only happens during key points of index merging, but _could_ be used in other places.  Technically, segment merging is the only "black box" in tantivy that users don't otherwise have the direct ability to control.

The default implementaiton of `wants_cancel()` returns false, so there's no fear of default tantivy spuriously cancelling a merge.

The cancels happen "cleanly" such that if `wants_cancel()` returns true an `Err(TantivyError::Cancelled)` is returned from the calling function at that point, and the error result will be propogated up the stack.  No panics are raised.
2025-12-10 10:17:26 -08:00
Eric Ridge
eabe589814 feat: ability to assign a panic handler to a Directory (#30)
Tantivy creates thread pools for some of its background work, specifically committing and merging.

It's possible if one of the thread workers panics that rayon will simply abort the process.  This is terrible from pg_search as that takes down the entire Postgres cluster.

These changes allow a Directory to assign a panic handler that gets called in such cases.  Which allows pg_search to gracefully rollback the current transaction, while presenting the panic message to the user.
2025-12-10 10:17:26 -08:00
Eric Ridge
65d3574dfd feat: make garbage collection opt-out (#28) 2025-12-10 10:17:26 -08:00
Ming
26d623c411 Change default index precision to microseconds (#27) 2025-12-10 10:17:25 -08:00
Eric Ridge
0552dddeb9 feat: delete docs by (SegmentId, DocId) (#26)
This teaches tantivy how to "directly" delete a document in a segment.
    
Our use case from pg_search is that we already know the segment_id and doc_id so it's waaaaay more efficient for us to delete docs through our `ambulkdelete()` routine.

It avoids doing a search, and all the stuff around that, for each of our "ctid" terms that we want to delete.
2025-12-10 10:17:25 -08:00
Eric Ridge
1b88bb61f9 feat: Add ability to construct a SegmentId from raw bytes (#24)
This allows a `SegmentId` to be constructed from a `[u8; 16]` byte array.  

It also adds a `impl Default for SegementId`, which defaults to all nulls
2025-12-10 10:17:25 -08:00
Eric Ridge
16da31cf06 perf: make the footer fixed width (#23)
Prior to this commit, the Footer tantivy serialized at the end of every file included a json blob that could be an arbitrary size.

This changes the Footer to be exactly 24 bytes (6 u32s), making sure to keep the `crc` value.  The other change we make here is to not actually read/validate the footer bytes when opening a file.

From pg_search's perspective, this is quite unnecessary Postgres buffer cache I/O and increases index/segment opening overhead, which is something pg_search does often for each query.

Two tests are ignored here as they test physical index files stored here in the repo that this change completely breaks.
2025-12-10 10:17:25 -08:00
Eric Ridge
658b9b22e0 perf: remove some fast fields loading overhead (#22)
This removes up some overhead the profiler exposed.  In the case I was testing, fast fields no longer shows up in the profile at all.

I also renamed `BlockWithLength` to `BlockWithData`
2025-12-10 10:17:25 -08:00
Eric Ridge
95661fba30 perf: teach SegmentReader to lazily open/read its various SegmentComponents (#20)
This overhauls `SegmentReader` to put its various components behind `OnceLock`s such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened.

This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge.  This isn't true now that the segment's physical files aren't opened until needed.  As such, I've `#[ignore]`'d six tests that expose this problem.

From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open.

Overall, this cuts down a signficiant number of disk reads during pg_search's query planning.  With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes.

This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.
2025-12-10 10:17:25 -08:00
Philippe Noël
ddd169b77c chore: Don't do codecov (#21) 2025-12-10 10:17:25 -08:00
Eric Ridge
bb4c4b8522 perf: push FileSlices down through most of fast fields (#19)
This PR modifies internal API signatures and implementation details so that `FileSlice`s are passed down into the innards of (at least) the `BlockwiseLinearCodec`.  This allows tantivy to defer dereferencing large slices of bytes when reading numeric fast fields, and instead dereference only the slice of bytes it needs for any given compressed Block.

The motivation here is for external `Directory` implementations where it's not exactly efficient to dereference large slices of bytes.
2025-12-10 10:17:25 -08:00
Neil Hansen
ffa558e3a9 fix: tests in ci (#18) 2025-12-10 10:17:25 -08:00
Neil Hansen
a35e3dcb5a suppress warnings after rebase 2025-12-10 10:17:25 -08:00
Neil Hansen
1e3998fbad implement fuzzy scoring in sstable 2025-12-10 10:17:25 -08:00
Neil Hansen
f3df079d6b chore: point tantivy-fst to paradedb fork to fix regex 2025-12-10 10:17:24 -08:00
Ming Ying
f7c0335857 comments 2025-12-10 10:17:24 -08:00
Ming Ying
2584325e0d add reconsider_merge_policy to directory 2025-12-10 10:17:24 -08:00
Eric B. Ridge
1f2c2d0c8a fix compilation warnings on rust v1.83 2025-12-10 10:17:24 -08:00
Eric Ridge
91db6909d1 Add a payload: &mut (dyn Any + '_) argument to Directory::save_meta() (#17) 2025-12-10 10:17:24 -08:00
Ming Ying
7639b47615 small changes to make MVCC work with delete 2025-12-10 10:17:24 -08:00
Ming Ying
8b55f0f355 Make DeleteMeta pub 2025-12-10 10:17:24 -08:00
Ming Ying
8d29f19110 make save_metas provide previous metas 2025-12-10 10:17:24 -08:00
Ming Ying
d742d3277a undo changes to segment_updater.rs 2025-12-10 10:17:24 -08:00
Eric B. Ridge
3afe3714a2 no pgrx, please 2025-12-10 10:17:24 -08:00
Ming Ying
67ea8e53a8 quickwit compiles 2025-12-10 10:17:24 -08:00
Ming Ying
3adc85c017 Directory trait can read/write meta/managed 2025-12-10 10:17:24 -08:00
Ming
6bb3a22c98 expose AddOperation and with_max_doc (#7) 2025-12-10 10:17:23 -08:00
Ming
5503cfb8ef Fix managed paths (#5) 2025-12-10 10:17:23 -08:00
Alexander Alexandrov
ea0e88ae4b feat: implement TokenFilter for Option<F> (#4) 2025-12-10 10:17:23 -08:00
Neil Hansen
dee2dd3f21 Use Levenshtein distance to score documents in fuzzy term queries 2025-12-10 10:17:19 -08:00
Philippe Noël
794ff1ffc9 chore: Make Language hashable (#79) (#2763)
Co-authored-by: Ming <ming.ying.nyc@gmail.com>
2025-12-10 15:38:43 +01:00
PSeitz-dd
c6912ce89a Handle JSON fields and columnar in space_usage (#2761)
return field names in space_usage instead of `Field`
more detailed info for columns
2025-12-10 20:33:33 +08:00
PSeitz
618e3bd11b Term and IndexingTerm cleanup (#2750)
* refactor term

* add deprecated functions

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-12-05 09:48:40 +08:00
PSeitz
b2f99c6217 add term->histogram benchmark (#2758)
* add term->histogram benchmark

* add more term aggs

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-12-04 02:29:37 +01:00
PSeitz
76de5bab6f fix unsafe warnings (#2757) 2025-12-03 20:15:21 +08:00
rustmailer
b7eb31162b docs: add usage example to README (#2743) 2025-12-02 21:56:57 +01:00
Paul Masurel
63c66005db Lazy scorers (#2726)
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.

- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.

This PR breaks public API, but fixing code is straightforward.

* Bumping tantivy version

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-01 15:38:57 +01:00
Paul Masurel
7d513a44c5 Added some benchmark for top K by a fast field (#2754)
Also removed query parsing from the bench code.

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-01 14:58:29 +01:00
Stu Hood
ca87fcd454 Implement collect_block for Collectors which wrap other Collectors (#2727)
* Implement `collect_block` for tuple Collectors, and for MultiCollector.

* Two more.
2025-12-01 12:26:29 +01:00
Ang
08a92675dc Fix typos again (#2753)
Found via `codespell -S benches,stopwords.rs -L
womens,parth,abd,childs,ond,ser,ue,mot,hel,atleast,pris,claus,allo`
2025-12-01 12:15:41 +01:00
Raphaël Cohen
f7f4b354d6 fix: Handle phrase prefixed with star (#2751)
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
2025-12-01 11:43:25 +01:00
Paul Masurel
25d44fcec8 Revert "remove unused columnar api (#2742)" (#2748)
* Revert "remove unused columnar api (#2742)"

This reverts commit 8725594d47.

* Clippy comment + removing fill_vals

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-11-26 17:44:02 +01:00
PSeitz-dd
842fe9295f split Term in Term and IndexingTerm (#2744)
* split Term in Term and IndexingTerm

* add append_json_path to JsonTermSerializer
2025-11-26 16:48:59 +01:00
Paul Masurel
f88b7200b2 Optimization when posting list are saturated. (#2745)
* Optimization when posting list are saturated.

If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.

In turn, in a boolean query, we can dismiss  all scorers and
empty scorers, to accelerate the request.

* Added range query optimization

* CR comment

* CR comments

* CR comment

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-11-26 15:50:57 +01:00
PSeitz-dd
8725594d47 remove unused columnar api (#2742) 2025-11-21 18:07:25 +01:00