mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-07-03 07:40:41 +00:00
We ([ParadeDB](https://github.com/paradedb/paradedb)) have restored and been using the removed [index sorting](https://github.com/quickwit-oss/tantivy/issues/2352) feature in our Tantivy fork. Our use case is sorting the index by Postgres' internal `ctid` identifier. Results returned from Tantivy must be checked against Postgres' visibility map, and checking them in ctid order is much more cache friendly, resulting in up to 80% speedups for certain queries. This PR is split into 5 commits, corresponding to the index sorting reversal plus bug fixes we uncovered during our usage of index sorting. | Commit | Maps to | What it does | |---|---|---| | `2aea0ad9f` | foundation ([#104](https://github.com/paradedb/tantivy/pull/104)) | Restore `SegmentComponent::TempStore` (revert of upstream #2815). Subsumes fork PR [#104](https://github.com/paradedb/tantivy/pull/104)'s CI fix. | | `9205bcb0c` | [#92](https://github.com/paradedb/tantivy/pull/92) | Restore sort-by-field (single-segment + merge paths). | | `39c790f0f` | [#101](https://github.com/paradedb/tantivy/pull/101) | Enable `sort_by` for `Str`/`Bytes` fast fields. | | `9c4341a87` | [#105](https://github.com/paradedb/tantivy/pull/105) | Native typed numeric sort-key comparison (precision/NULL fix). | | `2d9ba2418` | [#106](https://github.com/paradedb/tantivy/pull/106) | Preserve NULL ordering in numeric segment merges. | We have discussed with the Tantivy maintainers and they indicated they would be open to this PR. Another motivation for landing this PR is we are planning on contributing a significant refactor that makes Tantivy's segment components extensible, and landing that without index sorting leads to too many conflicts.
49 lines
1.9 KiB
Rust
49 lines
1.9 KiB
Rust
use std::slice;
|
|
|
|
/// Enum describing each component of a tantivy segment.
|
|
///
|
|
/// Each component is stored in its own file,
|
|
/// using the pattern `segment_uuid`.`component_extension`,
|
|
/// except the delete component that takes an `segment_uuid`.`delete_opstamp`.`component_extension`
|
|
#[derive(Copy, Clone, Eq, PartialEq)]
|
|
pub enum SegmentComponent {
|
|
/// Postings (or inverted list). Sorted lists of document ids, associated with terms
|
|
Postings,
|
|
/// Positions of terms in each document.
|
|
Positions,
|
|
/// Column-oriented random-access storage of fields.
|
|
FastFields,
|
|
/// Stores the sum of the length (in terms) of each field for each document.
|
|
/// Field norms are stored as a special u64 fast field.
|
|
FieldNorms,
|
|
/// Dictionary associating `Term`s to `TermInfo`s which is
|
|
/// simply an address into the `postings` file and the `positions` file.
|
|
Terms,
|
|
/// Row-oriented, compressed storage of the documents.
|
|
/// Accessing a document from the store is relatively slow, as it
|
|
/// requires to decompress the entire block it belongs to.
|
|
Store,
|
|
/// Temporary storage of the documents, before streamed to `Store`.
|
|
TempStore,
|
|
/// Bitset describing which document of the segment is alive.
|
|
/// (It was representing deleted docs but changed to represent alive docs from v0.17)
|
|
Delete,
|
|
}
|
|
|
|
impl SegmentComponent {
|
|
/// Iterates through the components.
|
|
pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
|
|
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
|
|
SegmentComponent::Postings,
|
|
SegmentComponent::Positions,
|
|
SegmentComponent::FastFields,
|
|
SegmentComponent::FieldNorms,
|
|
SegmentComponent::Terms,
|
|
SegmentComponent::Store,
|
|
SegmentComponent::TempStore,
|
|
SegmentComponent::Delete,
|
|
];
|
|
SEGMENT_COMPONENTS.iter()
|
|
}
|
|
}
|