`MergeOptimizedInvertedIndexReader` was added in #32 in order to avoid making small reads to our underlying `FileHandle`. It did so by reading the entire content of the posting lists and positions at open time.
As that PR says:
> A likely downside to this approach is that now pg_search will be, indirectly, holding onto a lot of heap-allocated memory that was read from its block storage. Perhaps in the (near) future we can further optimize the new `MergeOptimizedInvertedIndexReader` such that it pages in blocks of a few megabytes at a time, on demand, rather than the whole file.
This PR makes that change. But it additionally removes code that was later added in #47 to borrow individual entries rather than creating `OwnedBytes` for them. I believe that this code was added due to a misunderstanding:
`OwnedBytes` is a total misnomer: the bytes are not "owned": they are immutably borrowed and reference counted. An `OwnedBytes` object can be created for any type which derefs to a slice of bytes, and can be cheaply cloned and sliced. So there is no need to actually borrow _or_ copy the buffer under the `OwnedBytes`. Removing the code that was doing so allows us to safely recreate our buffer without worrying about the lifetimes of buffers that we've handed out.
This adds new public-facing (and internal) APIs for being able to merge a list of segments in the foreground, without using any threads. It's largely a cut-n-paste of the existing background merge code.
For pg_search, this is beneficial because it allows us to merge directly using our `MVCCDirectory` rather than going through the `ChannelDirectory`, which has quite a bit of overhead.
This removes `Directory::reconsider_merge_policy()`. After reconsidering this, it's better to make this decision ahead of time.
Also adds a `Directory::log(message: &str)` function along with passing a `Directory` reference to `MergePolicy::compute_merge_candidates()`.
It also hits some `#[derive(Debug)]` and `#[derive(Serialize)]` on a couple of structs that can benefit.
This adds a function named `wants_cancel() -> bool` to the `Directory` trait. It allows a Directory implementation to indicate that it would like Tantivy to cancel an operation.
Right now, querying this function only happens during key points of index merging, but _could_ be used in other places. Technically, segment merging is the only "black box" in tantivy that users don't otherwise have the direct ability to control.
The default implementaiton of `wants_cancel()` returns false, so there's no fear of default tantivy spuriously cancelling a merge.
The cancels happen "cleanly" such that if `wants_cancel()` returns true an `Err(TantivyError::Cancelled)` is returned from the calling function at that point, and the error result will be propogated up the stack. No panics are raised.
Tantivy creates thread pools for some of its background work, specifically committing and merging.
It's possible if one of the thread workers panics that rayon will simply abort the process. This is terrible from pg_search as that takes down the entire Postgres cluster.
These changes allow a Directory to assign a panic handler that gets called in such cases. Which allows pg_search to gracefully rollback the current transaction, while presenting the panic message to the user.
Prior to this commit, the Footer tantivy serialized at the end of every file included a json blob that could be an arbitrary size.
This changes the Footer to be exactly 24 bytes (6 u32s), making sure to keep the `crc` value. The other change we make here is to not actually read/validate the footer bytes when opening a file.
From pg_search's perspective, this is quite unnecessary Postgres buffer cache I/O and increases index/segment opening overhead, which is something pg_search does often for each query.
Two tests are ignored here as they test physical index files stored here in the repo that this change completely breaks.
use usize in bitpacker to enable larger columns in the columnar store
Godbolt comparison with u32 vs u64 for get access: https://godbolt.org/z/cjf7nenYP
Add a mini-tool to inspect columnar files created by tantivy. (very basic functionality which can be extended later)
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
Applied this command to the code, making it a bit shorter and slightly
more readable.
```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
* add memory limit for aggregations
introduce AggregationLimits to set memory consumption limit and bucket limits
memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request.
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
* add ByteCount with human readable format
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* Make nightly Clippy mostly happy.
* Document how to produce TermSetQuery queries using QueryParser.
* Enable construction of queries using FuzzyTermQuery via the QueryParser
* Use FxHashMap instead of HashMap in the QueryParser as these hash tables are not exposed to DoS attacks.
* Use a struct instead of a tuple to improve readability.