A MemoryArena should support allocations up to 4GB and https://github.com/paradedb/tantivy/pull/60 broke this by not accounting for the "max page id" when pages are now 50% the size of what the originally were.
This cleans up the code so things stay in sync if we change NUM_BITS_PAGE_ADDR again and adds a unit test
- Use a bitset to track used buckets in the `SharedArenaHashmap`, allowing for more efficient iteration
- Create a global pool for both `MemoryArena` and `IndexingContext`
- Reduce the MemoryArea page size by half (it's now 512KB instead of 1MB)
- Centralize thread pool instances in `SegmentUpdater` so we can elide making them if all nthread sizes are zero
We would like to be able to lazily load `BitpackedCodec` columns (similar to what 020bdffd61 did for `BlockwiseLinearCodec`), because in the context of `pg_search`, immediately constructing `OwnedBytes` means copying the entire content of the column into memory.
To do so, we expose some (slightly overlapped) block boundaries from `BitUnpacker`, and then lazily load each block when it is requested. Only the `get_val` function uses the cache: `get_row_ids_for_value_range` does not (yet), because it would be necessary to partition the row ids by block, and most of the time consumers using it are already loading reasonably large ranges anyway.
See https://github.com/paradedb/paradedb/pull/2894 for usage. There are a few 2x speedups in the benchmark suite, as well as a 1.8x speedup on a representative customer query. Unfortunately there are also some 13-19% slowdowns on aggregates: it looks like that is because aggregates use `get_vals`, for which the default implementation is to just call `get_val` in a loop.
When a fast fields string/bytes `Dictionary` is opened, we currently read the entire dictionary from `FileSlice` -> `OwnedBytes`... and then immediately wrap it back into a `FileSlice`.
Switching to `Dictionary::open` preserves the `FileSlice`, such that only the portions of the `Dictionary` which are actually accessed are read from disk/buffers.
Not all queries use positions and it's okay if we (from the perspective of `pg_search`, anyways) defer opening/loading them until they're first needed.
This is probably completely wrong for a mmap-based Directory, but we (again, `pg_search`) decided long ago that we don't care about that use case.
This saves a lot of disk I/O when an index has lots of segments and the query doesn't need positions.
As a drive by, make sure a random Vec has enough space before pushing items to it. This showed up in the profiler, believe it or not.
This adds new public-facing (and internal) APIs for being able to merge a list of segments in the foreground, without using any threads. It's largely a cut-n-paste of the existing background merge code.
For pg_search, this is beneficial because it allows us to merge directly using our `MVCCDirectory` rather than going through the `ChannelDirectory`, which has quite a bit of overhead.
This removes `Directory::reconsider_merge_policy()`. After reconsidering this, it's better to make this decision ahead of time.
Also adds a `Directory::log(message: &str)` function along with passing a `Directory` reference to `MergePolicy::compute_merge_candidates()`.
It also hits some `#[derive(Debug)]` and `#[derive(Serialize)]` on a couple of structs that can benefit.
This is probably a bit of a misnomer as it's really a "PgSearchOptimizedInvertedIndexReaderForMerge".
What we've done here is copied `InvertedIndexReader` and internally adjusted it to hold onto the complete `OwnedBytes` of the index's postings and positions. One or two other small touch points were required to make other internal APIs compatabile with this but they don't otherwise change functionality or I/O patterns.
`MergeOptimizedInvertedIndexReader` does change I/O patterns, however, in that the merge process now does two (potentially) very large reads when it obtains the new "merge optimized inverted index reader" for each segment. This changes access patterns such that all the reads happen up-front rather than term-by-term as the merge process is solving.
A likely downside to this approach is that now pg_search will be, indirectly, holding onto a lot of heap-allocated memory that was read from its block storage. Perhaps in the (near) future we can further optimize the new `MergeOptimizedInvertedIndexReader` such that it pages in blocks of a few megabytes at a time, on demand, rather than the whole file.
---
Some unit tests were also updated to resolve compilation problems by PR https://github.com/paradedb/tantivy/pull/31 that for some reason didn't show in CI. #weird
This adds a function named `wants_cancel() -> bool` to the `Directory` trait. It allows a Directory implementation to indicate that it would like Tantivy to cancel an operation.
Right now, querying this function only happens during key points of index merging, but _could_ be used in other places. Technically, segment merging is the only "black box" in tantivy that users don't otherwise have the direct ability to control.
The default implementaiton of `wants_cancel()` returns false, so there's no fear of default tantivy spuriously cancelling a merge.
The cancels happen "cleanly" such that if `wants_cancel()` returns true an `Err(TantivyError::Cancelled)` is returned from the calling function at that point, and the error result will be propogated up the stack. No panics are raised.
Tantivy creates thread pools for some of its background work, specifically committing and merging.
It's possible if one of the thread workers panics that rayon will simply abort the process. This is terrible from pg_search as that takes down the entire Postgres cluster.
These changes allow a Directory to assign a panic handler that gets called in such cases. Which allows pg_search to gracefully rollback the current transaction, while presenting the panic message to the user.
This teaches tantivy how to "directly" delete a document in a segment.
Our use case from pg_search is that we already know the segment_id and doc_id so it's waaaaay more efficient for us to delete docs through our `ambulkdelete()` routine.
It avoids doing a search, and all the stuff around that, for each of our "ctid" terms that we want to delete.
Prior to this commit, the Footer tantivy serialized at the end of every file included a json blob that could be an arbitrary size.
This changes the Footer to be exactly 24 bytes (6 u32s), making sure to keep the `crc` value. The other change we make here is to not actually read/validate the footer bytes when opening a file.
From pg_search's perspective, this is quite unnecessary Postgres buffer cache I/O and increases index/segment opening overhead, which is something pg_search does often for each query.
Two tests are ignored here as they test physical index files stored here in the repo that this change completely breaks.
This removes up some overhead the profiler exposed. In the case I was testing, fast fields no longer shows up in the profile at all.
I also renamed `BlockWithLength` to `BlockWithData`
This overhauls `SegmentReader` to put its various components behind `OnceLock`s such that they can be opened and read on their first use, as oppoed when a SegmentReader is constructed -- which is once for every segment when an Index is opened.
This has a negative impact on some of Tantivy's expectations in that an existing SegementReader can still read from physical files that were deleted by a merge. This isn't true now that the segment's physical files aren't opened until needed. As such, I've `#[ignore]`'d six tests that expose this problem.
From our (pg_search's) side of things, we don't really have physical files and don't need to rely on the filesystem/kernel to allow reading unlinked files that are still open.
Overall, this cuts down a signficiant number of disk reads during pg_search's query planning. With my test data it goes from 808 individual reads totalling 999,799 bytes, to 18 reads totalling 814,514 bytes.
This reduces the time it takes to plan a simple query from about 1.4ms to 0.436ms -- roughly a 3.2x improvement.
This PR modifies internal API signatures and implementation details so that `FileSlice`s are passed down into the innards of (at least) the `BlockwiseLinearCodec`. This allows tantivy to defer dereferencing large slices of bytes when reading numeric fast fields, and instead dereference only the slice of bytes it needs for any given compressed Block.
The motivation here is for external `Directory` implementations where it's not exactly efficient to dereference large slices of bytes.