Compare commits

..

1 Commits

Author SHA1 Message Date
dependabot[bot]
ec5748d795 Bump codecov/codecov-action from 3 to 5
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 3 to 5.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](https://github.com/codecov/codecov-action/compare/v3...v5)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-11-14 20:12:50 +00:00
78 changed files with 505 additions and 2248 deletions

View File

@@ -21,7 +21,7 @@ jobs:
- name: Generate code coverage - name: Generate code coverage
run: cargo +nightly-2024-07-01 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info run: cargo +nightly-2024-07-01 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
- name: Upload coverage to Codecov - name: Upload coverage to Codecov
uses: codecov/codecov-action@v3 uses: codecov/codecov-action@v5
continue-on-error: true continue-on-error: true
with: with:
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos

View File

@@ -1,14 +1,11 @@
Tantivy 0.23 - Unreleased Tantivy 0.23 - Unreleased
================================ ================================
Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0.21. The new minimum rust version will be 1.75. Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0.21.
#### Bugfixes #### Bugfixes
- fix potential endless loop in merge [#2457](https://github.com/quickwit-oss/tantivy/pull/2457)(@PSeitz) - fix potential endless loop in merge [#2457](https://github.com/quickwit-oss/tantivy/pull/2457)(@PSeitz)
- fix bug that causes out-of-order sstable key. [#2445](https://github.com/quickwit-oss/tantivy/pull/2445)(@fulmicoton) - fix bug that causes out-of-order sstable key. [#2445](https://github.com/quickwit-oss/tantivy/pull/2445)(@fulmicoton)
- fix ReferenceValue API flaw [#2372](https://github.com/quickwit-oss/tantivy/pull/2372)(@PSeitz) - fix ReferenceValue API flaw [#2372](https://github.com/quickwit-oss/tantivy/pull/2372)(@PSeitz)
- fix `OwnedBytes` debug panic [#2512](https://github.com/quickwit-oss/tantivy/pull/2512)(@b41sh)
- catch panics during merges [#2582](https://github.com/quickwit-oss/tantivy/pull/2582)(@rdettai)
- switch from u32 to usize in bitpacker. This enables multivalued columns larger than 4GB, which crashed during merge before. [#2581](https://github.com/quickwit-oss/tantivy/pull/2581) [#2586](https://github.com/quickwit-oss/tantivy/pull/2586)(@fulmicoton-dd @PSeitz)
#### Breaking API Changes #### Breaking API Changes
- remove index sorting [#2434](https://github.com/quickwit-oss/tantivy/pull/2434)(@PSeitz) - remove index sorting [#2434](https://github.com/quickwit-oss/tantivy/pull/2434)(@PSeitz)
@@ -26,7 +23,6 @@ Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0
- reduce top hits memory consumption [#2426](https://github.com/quickwit-oss/tantivy/pull/2426)(@PSeitz) - reduce top hits memory consumption [#2426](https://github.com/quickwit-oss/tantivy/pull/2426)(@PSeitz)
- check unsupported parameters top_hits [#2351](https://github.com/quickwit-oss/tantivy/pull/2351)(@PSeitz) - check unsupported parameters top_hits [#2351](https://github.com/quickwit-oss/tantivy/pull/2351)(@PSeitz)
- Change AggregationLimits to AggregationLimitsGuard [#2495](https://github.com/quickwit-oss/tantivy/pull/2495)(@PSeitz) - Change AggregationLimits to AggregationLimitsGuard [#2495](https://github.com/quickwit-oss/tantivy/pull/2495)(@PSeitz)
- add support for counting non integer in aggregation [#2547](https://github.com/quickwit-oss/tantivy/pull/2547)(@trinity-1686a)
- **Range Queries** - **Range Queries**
- Support fast field range queries on json fields [#2456](https://github.com/quickwit-oss/tantivy/pull/2456)(@PSeitz) - Support fast field range queries on json fields [#2456](https://github.com/quickwit-oss/tantivy/pull/2456)(@PSeitz)
- Add support for str fast field range query [#2460](https://github.com/quickwit-oss/tantivy/pull/2460) [#2452](https://github.com/quickwit-oss/tantivy/pull/2452) [#2453](https://github.com/quickwit-oss/tantivy/pull/2453)(@PSeitz) - Add support for str fast field range query [#2460](https://github.com/quickwit-oss/tantivy/pull/2460) [#2452](https://github.com/quickwit-oss/tantivy/pull/2452) [#2453](https://github.com/quickwit-oss/tantivy/pull/2453)(@PSeitz)
@@ -37,18 +33,9 @@ Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0
- add columnar format compatibility tests [#2433](https://github.com/quickwit-oss/tantivy/pull/2433)(@PSeitz) - add columnar format compatibility tests [#2433](https://github.com/quickwit-oss/tantivy/pull/2433)(@PSeitz)
- Improved snippet ranges algorithm [#2474](https://github.com/quickwit-oss/tantivy/pull/2474)(@gezihuzi) - Improved snippet ranges algorithm [#2474](https://github.com/quickwit-oss/tantivy/pull/2474)(@gezihuzi)
- make find_field_with_default return json fields without path [#2476](https://github.com/quickwit-oss/tantivy/pull/2476)(@trinity-1686a) - make find_field_with_default return json fields without path [#2476](https://github.com/quickwit-oss/tantivy/pull/2476)(@trinity-1686a)
- Make `BooleanQuery` support `minimum_number_should_match` [#2405](https://github.com/quickwit-oss/tantivy/pull/2405)(@LebranceBW) - feat(query): Make `BooleanQuery` support `minimum_number_should_match` [#2405](https://github.com/quickwit-oss/tantivy/pull/2405)(@LebranceBW)
- Make `NUM_MERGE_THREADS` configurable [#2535](https://github.com/quickwit-oss/tantivy/pull/2535)(@Barre)
- **RegexPhraseQuery** - **Optional Index in Multivalue Columnar Index** For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case). This is alleviated by placing an optional index in the multivalued index to mark documents that have values. This will slightly increase space and access time. [#2439](https://github.com/quickwit-oss/tantivy/pull/2439)(@PSeitz)
`RegexPhraseQuery` supports phrase queries with regex. E.g. query "b.* b.* wolf" matches "big bad wolf". Slop is supported as well: "b.* wolf"~2 matches "big bad wolf" [#2516](https://github.com/quickwit-oss/tantivy/pull/2516)(@PSeitz)
- **Optional Index in Multivalue Columnar Index**
For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case).
This is alleviated by placing an optional index in the multivalued index to mark documents that have values.
This will slightly increase space and access time. [#2439](https://github.com/quickwit-oss/tantivy/pull/2439)(@PSeitz)
- **Store DateTime as nanoseconds in doc store** DateTime in the doc store was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. [#2486](https://github.com/quickwit-oss/tantivy/pull/2486)(@PSeitz)
- **Performace/Memory** - **Performace/Memory**
- lift clauses in LogicalAst for optimized ast during execution [#2449](https://github.com/quickwit-oss/tantivy/pull/2449)(@PSeitz) - lift clauses in LogicalAst for optimized ast during execution [#2449](https://github.com/quickwit-oss/tantivy/pull/2449)(@PSeitz)
@@ -64,21 +51,18 @@ This will slightly increase space and access time. [#2439](https://github.com/qu
- fix de-escaping too much in query parser [#2427](https://github.com/quickwit-oss/tantivy/pull/2427)(@trinity-1686a) - fix de-escaping too much in query parser [#2427](https://github.com/quickwit-oss/tantivy/pull/2427)(@trinity-1686a)
- improve query parser [#2416](https://github.com/quickwit-oss/tantivy/pull/2416)(@trinity-1686a) - improve query parser [#2416](https://github.com/quickwit-oss/tantivy/pull/2416)(@trinity-1686a)
- Support field grouping `title:(return AND "pink panther")` [#2333](https://github.com/quickwit-oss/tantivy/pull/2333)(@trinity-1686a) - Support field grouping `title:(return AND "pink panther")` [#2333](https://github.com/quickwit-oss/tantivy/pull/2333)(@trinity-1686a)
- allow term starting with wildcard [#2568](https://github.com/quickwit-oss/tantivy/pull/2568)(@trinity-1686a)
- Exist queries match subpath fields [#2558](https://github.com/quickwit-oss/tantivy/pull/2558)(@rdettai)
- add access benchmark for columnar [#2432](https://github.com/quickwit-oss/tantivy/pull/2432)(@PSeitz) - add access benchmark for columnar [#2432](https://github.com/quickwit-oss/tantivy/pull/2432)(@PSeitz)
- extend indexwriter proptests [#2342](https://github.com/quickwit-oss/tantivy/pull/2342)(@PSeitz) - extend indexwriter proptests [#2342](https://github.com/quickwit-oss/tantivy/pull/2342)(@PSeitz)
- add bench & test for columnar merging [#2428](https://github.com/quickwit-oss/tantivy/pull/2428)(@PSeitz) - add bench & test for columnar merging [#2428](https://github.com/quickwit-oss/tantivy/pull/2428)(@PSeitz)
- Change in Executor API [#2391](https://github.com/quickwit-oss/tantivy/pull/2391)(@fulmicoton) - Change in Executor API [#2391](https://github.com/quickwit-oss/tantivy/pull/2391)(@fulmicoton)
- Removed usage of num_cpus [#2387](https://github.com/quickwit-oss/tantivy/pull/2387)(@fulmicoton) - Removed usage of num_cpus [#2387](https://github.com/quickwit-oss/tantivy/pull/2387)(@fulmicoton)
- use bingang for agg and stacker benchmark [#2378](https://github.com/quickwit-oss/tantivy/pull/2378)[#2492](https://github.com/quickwit-oss/tantivy/pull/2492)(@PSeitz) - use bingang for agg benchmark [#2378](https://github.com/quickwit-oss/tantivy/pull/2378)(@PSeitz)
- cleanup top level exports [#2382](https://github.com/quickwit-oss/tantivy/pull/2382)(@PSeitz) - cleanup top level exports [#2382](https://github.com/quickwit-oss/tantivy/pull/2382)(@PSeitz)
- make convert_to_fast_value_and_append_to_json_term pub [#2370](https://github.com/quickwit-oss/tantivy/pull/2370)(@PSeitz) - make convert_to_fast_value_and_append_to_json_term pub [#2370](https://github.com/quickwit-oss/tantivy/pull/2370)(@PSeitz)
- remove JsonTermWriter [#2238](https://github.com/quickwit-oss/tantivy/pull/2238)(@PSeitz) - remove JsonTermWriter [#2238](https://github.com/quickwit-oss/tantivy/pull/2238)(@PSeitz)
- validate sort by field type [#2336](https://github.com/quickwit-oss/tantivy/pull/2336)(@PSeitz) - validate sort by field type [#2336](https://github.com/quickwit-oss/tantivy/pull/2336)(@PSeitz)
- Fix trait bound of StoreReader::iter [#2360](https://github.com/quickwit-oss/tantivy/pull/2360)(@adamreichold) - Fix trait bound of StoreReader::iter [#2360](https://github.com/quickwit-oss/tantivy/pull/2360)(@adamreichold)
- remove read_postings_no_deletes [#2526](https://github.com/quickwit-oss/tantivy/pull/2526)(@PSeitz)
Tantivy 0.22 Tantivy 0.22
================================ ================================
@@ -733,7 +717,7 @@ Tantivy 0.4.0
- Raise the limit of number of fields (previously 256 fields) (@fulmicoton) - Raise the limit of number of fields (previously 256 fields) (@fulmicoton)
- Removed u32 fields. They are replaced by u64 and i64 fields (#65) (@fulmicoton) - Removed u32 fields. They are replaced by u64 and i64 fields (#65) (@fulmicoton)
- Optimized skip in SegmentPostings (#130) (@lnicola) - Optimized skip in SegmentPostings (#130) (@lnicola)
- Replacing rustc_serialize by serde. Kudos to benchmark@KodrAus and @lnicola - Replacing rustc_serialize by serde. Kudos to @KodrAus and @lnicola
- Using error-chain (@KodrAus) - Using error-chain (@KodrAus)
- QueryParser: (@fulmicoton) - QueryParser: (@fulmicoton)
- Explicit error returned when searched for a term that is not indexed - Explicit error returned when searched for a term that is not indexed

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy" name = "tantivy"
version = "0.24.2" version = "0.23.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
@@ -11,7 +11,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md" readme = "README.md"
keywords = ["search", "information", "retrieval"] keywords = ["search", "information", "retrieval"]
edition = "2021" edition = "2021"
rust-version = "1.81" rust-version = "1.75"
exclude = ["benches/*.json", "benches/*.txt"] exclude = ["benches/*.json", "benches/*.txt"]
[dependencies] [dependencies]
@@ -31,14 +31,14 @@ lz4_flex = { version = "0.11", default-features = false, optional = true }
zstd = { version = "0.13", optional = true, default-features = false } zstd = { version = "0.13", optional = true, default-features = false }
tempfile = { version = "3.12.0", optional = true } tempfile = { version = "3.12.0", optional = true }
log = "0.4.16" log = "0.4.16"
serde = { version = "1.0.219", features = ["derive"] } serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.140" serde_json = "1.0.79"
fs4 = { version = "0.8.0", optional = true } fs4 = { version = "0.8.0", optional = true }
levenshtein_automata = "0.2.1" levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] } uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4" crossbeam-channel = "0.5.4"
rust-stemmers = "1.2.0" rust-stemmers = "1.2.0"
downcast-rs = "2.0.1" downcast-rs = "1.2.1"
bitpacking = { version = "0.9.2", default-features = false, features = [ bitpacking = { version = "0.9.2", default-features = false, features = [
"bitpacker4x", "bitpacker4x",
] } ] }
@@ -52,22 +52,20 @@ smallvec = "1.8.0"
rayon = "1.5.2" rayon = "1.5.2"
lru = "0.12.0" lru = "0.12.0"
fastdivide = "0.4.0" fastdivide = "0.4.0"
itertools = "0.14.0" itertools = "0.13.0"
measure_time = "0.9.0" measure_time = "0.8.2"
arc-swap = "1.5.0" arc-swap = "1.5.0"
bon = "3.3.1"
columnar = { version = "0.5", path = "./columnar", package = "tantivy-columnar" } columnar = { version = "0.3", path = "./columnar", package = "tantivy-columnar" }
sstable = { version = "0.5", path = "./sstable", package = "tantivy-sstable", optional = true } sstable = { version = "0.3", path = "./sstable", package = "tantivy-sstable", optional = true }
stacker = { version = "0.5", path = "./stacker", package = "tantivy-stacker" } stacker = { version = "0.3", path = "./stacker", package = "tantivy-stacker" }
query-grammar = { version = "0.24.0", path = "./query-grammar", package = "tantivy-query-grammar" } query-grammar = { version = "0.22.0", path = "./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version = "0.8", path = "./bitpacker" } tantivy-bitpacker = { version = "0.6", path = "./bitpacker" }
common = { version = "0.9", path = "./common/", package = "tantivy-common" } common = { version = "0.7", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version = "0.5", path = "./tokenizer-api", package = "tantivy-tokenizer-api" } tokenizer-api = { version = "0.3", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] } sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] }
hyperloglogplus = { version = "0.4.1", features = ["const-loop"] } hyperloglogplus = { version = "0.4.1", features = ["const-loop"] }
futures-util = { version = "0.3.28", optional = true } futures-util = { version = "0.3.28", optional = true }
futures-channel = { version = "0.3.28", optional = true }
fnv = "1.0.7" fnv = "1.0.7"
[target.'cfg(windows)'.dependencies] [target.'cfg(windows)'.dependencies]
@@ -122,7 +120,7 @@ zstd-compression = ["zstd"]
failpoints = ["fail", "fail/failpoints"] failpoints = ["fail", "fail/failpoints"]
unstable = [] # useful for benches. unstable = [] # useful for benches.
quickwit = ["sstable", "futures-util", "futures-channel"] quickwit = ["sstable", "futures-util"]
# Compares only the hash of a string when indexing data. # Compares only the hash of a string when indexing data.
# Increases indexing speed, but may lead to extremely rare missing terms, when there's a hash collision. # Increases indexing speed, but may lead to extremely rare missing terms, when there's a hash collision.

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-bitpacker" name = "tantivy-bitpacker"
version = "0.8.0" version = "0.6.0"
edition = "2021" edition = "2021"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"

View File

@@ -65,7 +65,7 @@ impl BitPacker {
#[derive(Clone, Debug, Default, Copy)] #[derive(Clone, Debug, Default, Copy)]
pub struct BitUnpacker { pub struct BitUnpacker {
num_bits: usize, num_bits: u32,
mask: u64, mask: u64,
} }
@@ -83,7 +83,7 @@ impl BitUnpacker {
(1u64 << num_bits) - 1u64 (1u64 << num_bits) - 1u64
}; };
BitUnpacker { BitUnpacker {
num_bits: usize::from(num_bits), num_bits: u32::from(num_bits),
mask, mask,
} }
} }
@@ -94,14 +94,14 @@ impl BitUnpacker {
#[inline] #[inline]
pub fn get(&self, idx: u32, data: &[u8]) -> u64 { pub fn get(&self, idx: u32, data: &[u8]) -> u64 {
let addr_in_bits = idx as usize * self.num_bits; let addr_in_bits = idx * self.num_bits;
let addr = addr_in_bits >> 3; let addr = (addr_in_bits >> 3) as usize;
if addr + 8 > data.len() { if addr + 8 > data.len() {
if self.num_bits == 0 { if self.num_bits == 0 {
return 0; return 0;
} }
let bit_shift = addr_in_bits & 7; let bit_shift = addr_in_bits & 7;
return self.get_slow_path(addr, bit_shift as u32, data); return self.get_slow_path(addr, bit_shift, data);
} }
let bit_shift = addr_in_bits & 7; let bit_shift = addr_in_bits & 7;
let bytes: [u8; 8] = (&data[addr..addr + 8]).try_into().unwrap(); let bytes: [u8; 8] = (&data[addr..addr + 8]).try_into().unwrap();
@@ -134,13 +134,12 @@ impl BitUnpacker {
"Bitwidth must be <= 32 to use this method." "Bitwidth must be <= 32 to use this method."
); );
let end_idx: u32 = start_idx + output.len() as u32; let end_idx = start_idx + output.len() as u32;
// We use `usize` here to avoid overflow issues. let end_bit_read = end_idx * self.num_bits;
let end_bit_read = (end_idx as usize) * self.num_bits;
let end_byte_read = (end_bit_read + 7) / 8; let end_byte_read = (end_bit_read + 7) / 8;
assert!( assert!(
end_byte_read <= data.len(), end_byte_read as usize <= data.len(),
"Requested index is out of bounds." "Requested index is out of bounds."
); );
@@ -160,24 +159,24 @@ impl BitUnpacker {
// We want the start of the fast track to start align with bytes. // We want the start of the fast track to start align with bytes.
// A sufficient condition is to start with an idx that is a multiple of 8, // A sufficient condition is to start with an idx that is a multiple of 8,
// so highway start is the closest multiple of 8 that is >= start_idx. // so highway start is the closest multiple of 8 that is >= start_idx.
let entrance_ramp_len: u32 = 8 - (start_idx % 8) % 8; let entrance_ramp_len = 8 - (start_idx % 8) % 8;
let highway_start: u32 = start_idx + entrance_ramp_len; let highway_start: u32 = start_idx + entrance_ramp_len;
if highway_start + (BitPacker1x::BLOCK_LEN as u32) > end_idx { if highway_start + BitPacker1x::BLOCK_LEN as u32 > end_idx {
// We don't have enough values to have even a single block of highway. // We don't have enough values to have even a single block of highway.
// Let's just supply the values the simple way. // Let's just supply the values the simple way.
get_batch_ramp(start_idx, output); get_batch_ramp(start_idx, output);
return; return;
} }
let num_blocks: usize = (end_idx - highway_start) as usize / BitPacker1x::BLOCK_LEN; let num_blocks: u32 = (end_idx - highway_start) / BitPacker1x::BLOCK_LEN as u32;
// Entrance ramp // Entrance ramp
get_batch_ramp(start_idx, &mut output[..entrance_ramp_len as usize]); get_batch_ramp(start_idx, &mut output[..entrance_ramp_len as usize]);
// Highway // Highway
let mut offset = (highway_start as usize * self.num_bits) / 8; let mut offset = (highway_start * self.num_bits) as usize / 8;
let mut output_cursor = (highway_start - start_idx) as usize; let mut output_cursor = (highway_start - start_idx) as usize;
for _ in 0..num_blocks { for _ in 0..num_blocks {
offset += BitPacker1x.decompress( offset += BitPacker1x.decompress(
@@ -189,7 +188,7 @@ impl BitUnpacker {
} }
// Exit ramp // Exit ramp
let highway_end: u32 = highway_start + (num_blocks * BitPacker1x::BLOCK_LEN) as u32; let highway_end = highway_start + num_blocks * BitPacker1x::BLOCK_LEN as u32;
get_batch_ramp(highway_end, &mut output[output_cursor..]); get_batch_ramp(highway_end, &mut output[output_cursor..]);
} }

View File

@@ -34,7 +34,7 @@ struct BlockedBitpackerEntryMetaData {
impl BlockedBitpackerEntryMetaData { impl BlockedBitpackerEntryMetaData {
fn new(offset: u64, num_bits: u8, base_value: u64) -> Self { fn new(offset: u64, num_bits: u8, base_value: u64) -> Self {
let encoded = offset | (u64::from(num_bits) << (64 - 8)); let encoded = offset | (num_bits as u64) << (64 - 8);
Self { Self {
encoded, encoded,
base_value, base_value,

View File

@@ -16,14 +16,14 @@ body = """
{%- if version %} in {{ version }}{%- endif -%} {%- if version %} in {{ version }}{%- endif -%}
{% for commit in commits %} {% for commit in commits %}
{% if commit.remote.pr_title -%} {% if commit.github.pr_title -%}
{%- set commit_message = commit.remote.pr_title -%} {%- set commit_message = commit.github.pr_title -%}
{%- else -%} {%- else -%}
{%- set commit_message = commit.message -%} {%- set commit_message = commit.message -%}
{%- endif -%} {%- endif -%}
- {{ commit_message | split(pat="\n") | first | trim }}\ - {{ commit_message | split(pat="\n") | first | trim }}\
{% if commit.remote.pr_number %} \ {% if commit.github.pr_number %} \
[#{{ commit.remote.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.remote.pr_number }}){% if commit.remote.username %}(@{{ commit.remote.username }}){%- endif -%} \ [#{{ commit.github.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.github.pr_number }}){% if commit.github.username %}(@{{ commit.github.username }}){%- endif -%} \
{%- endif %} {%- endif %}
{%- endfor -%} {%- endfor -%}

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-columnar" name = "tantivy-columnar"
version = "0.5.0" version = "0.3.0"
edition = "2021" edition = "2021"
license = "MIT" license = "MIT"
homepage = "https://github.com/quickwit-oss/tantivy" homepage = "https://github.com/quickwit-oss/tantivy"
@@ -9,15 +9,15 @@ description = "column oriented storage for tantivy"
categories = ["database-implementations", "data-structures", "compression"] categories = ["database-implementations", "data-structures", "compression"]
[dependencies] [dependencies]
itertools = "0.14.0" itertools = "0.13.0"
fastdivide = "0.4.0" fastdivide = "0.4.0"
stacker = { version= "0.5", path = "../stacker", package="tantivy-stacker"} stacker = { version= "0.3", path = "../stacker", package="tantivy-stacker"}
sstable = { version= "0.5", path = "../sstable", package = "tantivy-sstable" } sstable = { version= "0.3", path = "../sstable", package = "tantivy-sstable" }
common = { version= "0.9", path = "../common", package = "tantivy-common" } common = { version= "0.7", path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.8", path = "../bitpacker/" } tantivy-bitpacker = { version= "0.6", path = "../bitpacker/" }
serde = "1.0.152" serde = "1.0.152"
downcast-rs = "2.0.1" downcast-rs = "1.2.0"
[dev-dependencies] [dev-dependencies]
proptest = "1" proptest = "1"

View File

@@ -1,18 +0,0 @@
[package]
name = "tantivy-columnar-inspect"
version = "0.1.0"
edition = "2021"
license = "MIT"
[dependencies]
tantivy = {path="../..", package="tantivy"}
columnar = {path="../", package="tantivy-columnar"}
common = {path="../../common", package="tantivy-common"}
[workspace]
members = []
[profile.release]
debug = true
#debug-assertions = true
#overflow-checks = true

View File

@@ -1,54 +0,0 @@
use columnar::ColumnarReader;
use common::file_slice::{FileSlice, WrapFile};
use std::io;
use std::path::Path;
use tantivy::directory::footer::Footer;
fn main() -> io::Result<()> {
println!("Opens a columnar file written by tantivy and validates it.");
let path = std::env::args().nth(1).unwrap();
let path = Path::new(&path);
println!("Reading {:?}", path);
let _reader = open_and_validate_columnar(path.to_str().unwrap())?;
Ok(())
}
pub fn validate_columnar_reader(reader: &ColumnarReader) {
let num_rows = reader.num_rows();
println!("num_rows: {}", num_rows);
let columns = reader.list_columns().unwrap();
println!("num columns: {:?}", columns.len());
for (col_name, dynamic_column_handle) in columns {
let col = dynamic_column_handle.open().unwrap();
match col {
columnar::DynamicColumn::Bool(_)
| columnar::DynamicColumn::I64(_)
| columnar::DynamicColumn::U64(_)
| columnar::DynamicColumn::F64(_)
| columnar::DynamicColumn::IpAddr(_)
| columnar::DynamicColumn::DateTime(_)
| columnar::DynamicColumn::Bytes(_) => {}
columnar::DynamicColumn::Str(str_column) => {
let num_vals = str_column.ords().values.num_vals();
let num_terms_dict = str_column.num_terms() as u64;
let max_ord = str_column.ords().values.iter().max().unwrap_or_default();
println!("{col_name:35} num_vals {num_vals:10} \t num_terms_dict {num_terms_dict:8} max_ord: {max_ord:8}",);
for ord in str_column.ords().values.iter() {
assert!(ord < num_terms_dict);
}
}
}
}
}
/// Opens a columnar file that was written by tantivy and validates it.
pub fn open_and_validate_columnar(path: &str) -> io::Result<ColumnarReader> {
let wrap_file = WrapFile::new(std::fs::File::open(path)?)?;
let slice = FileSlice::new(std::sync::Arc::new(wrap_file));
let (_footer, slice) = Footer::extract_footer(slice.clone()).unwrap();
let reader = ColumnarReader::open(slice).unwrap();
validate_columnar_reader(&reader);
Ok(reader)
}

View File

@@ -139,7 +139,7 @@ mod tests {
missing_docs.push(missing_doc); missing_docs.push(missing_doc);
}); });
assert_eq!(missing_docs, Vec::<u32>::new()); assert_eq!(missing_docs, vec![]);
} }
#[test] #[test]

View File

@@ -58,7 +58,7 @@ struct ShuffledIndex<'a> {
merge_order: &'a ShuffleMergeOrder, merge_order: &'a ShuffleMergeOrder,
} }
impl Iterable<u32> for ShuffledIndex<'_> { impl<'a> Iterable<u32> for ShuffledIndex<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
Box::new( Box::new(
self.merge_order self.merge_order
@@ -127,7 +127,7 @@ fn integrate_num_vals(num_vals: impl Iterator<Item = u32>) -> impl Iterator<Item
) )
} }
impl Iterable<u32> for ShuffledMultivaluedIndex<'_> { impl<'a> Iterable<u32> for ShuffledMultivaluedIndex<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
let num_vals_per_row = iter_num_values(self.column_indexes, self.merge_order); let num_vals_per_row = iter_num_values(self.column_indexes, self.merge_order);
Box::new(integrate_num_vals(num_vals_per_row)) Box::new(integrate_num_vals(num_vals_per_row))

View File

@@ -56,7 +56,7 @@ fn get_doc_ids_with_values<'a>(
ColumnIndex::Full => Box::new(doc_range), ColumnIndex::Full => Box::new(doc_range),
ColumnIndex::Optional(optional_index) => Box::new( ColumnIndex::Optional(optional_index) => Box::new(
optional_index optional_index
.iter_docs() .iter_rows()
.map(move |row| row + doc_range.start), .map(move |row| row + doc_range.start),
), ),
ColumnIndex::Multivalued(multivalued_index) => match multivalued_index { ColumnIndex::Multivalued(multivalued_index) => match multivalued_index {
@@ -73,7 +73,7 @@ fn get_doc_ids_with_values<'a>(
MultiValueIndex::MultiValueIndexV2(multivalued_index) => Box::new( MultiValueIndex::MultiValueIndexV2(multivalued_index) => Box::new(
multivalued_index multivalued_index
.optional_index .optional_index
.iter_docs() .iter_rows()
.map(move |row| row + doc_range.start), .map(move |row| row + doc_range.start),
), ),
}, },
@@ -123,7 +123,7 @@ fn get_num_values_iterator<'a>(
} }
} }
impl Iterable<u32> for StackedStartOffsets<'_> { impl<'a> Iterable<u32> for StackedStartOffsets<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
let num_values_it = (0..self.column_indexes.len()).flat_map(|columnar_id| { let num_values_it = (0..self.column_indexes.len()).flat_map(|columnar_id| {
let num_docs = self.stack_merge_order.columnar_range(columnar_id).len() as u32; let num_docs = self.stack_merge_order.columnar_range(columnar_id).len() as u32;
@@ -177,7 +177,7 @@ impl<'a> Iterable<RowId> for StackedOptionalIndex<'a> {
ColumnIndex::Full => Box::new(columnar_row_range), ColumnIndex::Full => Box::new(columnar_row_range),
ColumnIndex::Optional(optional_index) => Box::new( ColumnIndex::Optional(optional_index) => Box::new(
optional_index optional_index
.iter_docs() .iter_rows()
.map(move |row_id: RowId| columnar_row_range.start + row_id), .map(move |row_id: RowId| columnar_row_range.start + row_id),
), ),
ColumnIndex::Multivalued(_) => { ColumnIndex::Multivalued(_) => {

View File

@@ -80,23 +80,23 @@ impl BlockVariant {
/// index is the block index. For each block `byte_start` and `offset` is computed. /// index is the block index. For each block `byte_start` and `offset` is computed.
#[derive(Clone)] #[derive(Clone)]
pub struct OptionalIndex { pub struct OptionalIndex {
num_docs: RowId, num_rows: RowId,
num_non_null_docs: RowId, num_non_null_rows: RowId,
block_data: OwnedBytes, block_data: OwnedBytes,
block_metas: Arc<[BlockMeta]>, block_metas: Arc<[BlockMeta]>,
} }
impl Iterable<u32> for &OptionalIndex { impl<'a> Iterable<u32> for &'a OptionalIndex {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
Box::new(self.iter_docs()) Box::new(self.iter_rows())
} }
} }
impl std::fmt::Debug for OptionalIndex { impl std::fmt::Debug for OptionalIndex {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
f.debug_struct("OptionalIndex") f.debug_struct("OptionalIndex")
.field("num_docs", &self.num_docs) .field("num_rows", &self.num_rows)
.field("num_non_null_docs", &self.num_non_null_docs) .field("num_non_null_rows", &self.num_non_null_rows)
.finish_non_exhaustive() .finish_non_exhaustive()
} }
} }
@@ -123,7 +123,7 @@ enum BlockSelectCursor<'a> {
Sparse(<SparseBlock<'a> as Set<u16>>::SelectCursor<'a>), Sparse(<SparseBlock<'a> as Set<u16>>::SelectCursor<'a>),
} }
impl BlockSelectCursor<'_> { impl<'a> BlockSelectCursor<'a> {
fn select(&mut self, rank: u16) -> u16 { fn select(&mut self, rank: u16) -> u16 {
match self { match self {
BlockSelectCursor::Dense(dense_select_cursor) => dense_select_cursor.select(rank), BlockSelectCursor::Dense(dense_select_cursor) => dense_select_cursor.select(rank),
@@ -141,7 +141,7 @@ pub struct OptionalIndexSelectCursor<'a> {
num_null_rows_before_block: RowId, num_null_rows_before_block: RowId,
} }
impl OptionalIndexSelectCursor<'_> { impl<'a> OptionalIndexSelectCursor<'a> {
fn search_and_load_block(&mut self, rank: RowId) { fn search_and_load_block(&mut self, rank: RowId) {
if rank < self.current_block_end_rank { if rank < self.current_block_end_rank {
// we are already in the right block // we are already in the right block
@@ -165,7 +165,7 @@ impl OptionalIndexSelectCursor<'_> {
} }
} }
impl SelectCursor<RowId> for OptionalIndexSelectCursor<'_> { impl<'a> SelectCursor<RowId> for OptionalIndexSelectCursor<'a> {
fn select(&mut self, rank: RowId) -> RowId { fn select(&mut self, rank: RowId) -> RowId {
self.search_and_load_block(rank); self.search_and_load_block(rank);
let index_in_block = (rank - self.num_null_rows_before_block) as u16; let index_in_block = (rank - self.num_null_rows_before_block) as u16;
@@ -271,17 +271,17 @@ impl OptionalIndex {
} }
pub fn num_docs(&self) -> RowId { pub fn num_docs(&self) -> RowId {
self.num_docs self.num_rows
} }
pub fn num_non_nulls(&self) -> RowId { pub fn num_non_nulls(&self) -> RowId {
self.num_non_null_docs self.num_non_null_rows
} }
pub fn iter_docs(&self) -> impl Iterator<Item = RowId> + '_ { pub fn iter_rows(&self) -> impl Iterator<Item = RowId> + '_ {
// TODO optimize // TODO optimize
let mut select_batch = self.select_cursor(); let mut select_batch = self.select_cursor();
(0..self.num_non_null_docs).map(move |rank| select_batch.select(rank)) (0..self.num_non_null_rows).map(move |rank| select_batch.select(rank))
} }
pub fn select_batch(&self, ranks: &mut [RowId]) { pub fn select_batch(&self, ranks: &mut [RowId]) {
let mut select_cursor = self.select_cursor(); let mut select_cursor = self.select_cursor();
@@ -505,7 +505,7 @@ fn deserialize_optional_index_block_metadatas(
non_null_rows_before_block += num_non_null_rows; non_null_rows_before_block += num_non_null_rows;
} }
block_metas.resize( block_metas.resize(
num_rows.div_ceil(ELEMENTS_PER_BLOCK) as usize, ((num_rows + ELEMENTS_PER_BLOCK - 1) / ELEMENTS_PER_BLOCK) as usize,
BlockMeta { BlockMeta {
non_null_rows_before_block, non_null_rows_before_block,
start_byte_offset, start_byte_offset,
@@ -519,15 +519,15 @@ pub fn open_optional_index(bytes: OwnedBytes) -> io::Result<OptionalIndex> {
let (mut bytes, num_non_empty_blocks_bytes) = bytes.rsplit(2); let (mut bytes, num_non_empty_blocks_bytes) = bytes.rsplit(2);
let num_non_empty_block_bytes = let num_non_empty_block_bytes =
u16::from_le_bytes(num_non_empty_blocks_bytes.as_slice().try_into().unwrap()); u16::from_le_bytes(num_non_empty_blocks_bytes.as_slice().try_into().unwrap());
let num_docs = VInt::deserialize_u64(&mut bytes)? as u32; let num_rows = VInt::deserialize_u64(&mut bytes)? as u32;
let block_metas_num_bytes = let block_metas_num_bytes =
num_non_empty_block_bytes as usize * SERIALIZED_BLOCK_META_NUM_BYTES; num_non_empty_block_bytes as usize * SERIALIZED_BLOCK_META_NUM_BYTES;
let (block_data, block_metas) = bytes.rsplit(block_metas_num_bytes); let (block_data, block_metas) = bytes.rsplit(block_metas_num_bytes);
let (block_metas, num_non_null_docs) = let (block_metas, num_non_null_rows) =
deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_docs); deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_rows);
let optional_index = OptionalIndex { let optional_index = OptionalIndex {
num_docs, num_rows,
num_non_null_docs, num_non_null_rows,
block_data, block_data,
block_metas: block_metas.into(), block_metas: block_metas.into(),
}; };

View File

@@ -23,6 +23,7 @@ fn set_bit_at(input: &mut u64, n: u16) {
/// ///
/// When translating a dense index to the original index, we can use the offset to find the correct /// When translating a dense index to the original index, we can use the offset to find the correct
/// block. Direct computation is not possible, but we can employ a linear or binary search. /// block. Direct computation is not possible, but we can employ a linear or binary search.
const ELEMENTS_PER_MINI_BLOCK: u16 = 64; const ELEMENTS_PER_MINI_BLOCK: u16 = 64;
const MINI_BLOCK_BITVEC_NUM_BYTES: usize = 8; const MINI_BLOCK_BITVEC_NUM_BYTES: usize = 8;
const MINI_BLOCK_OFFSET_NUM_BYTES: usize = 2; const MINI_BLOCK_OFFSET_NUM_BYTES: usize = 2;
@@ -108,7 +109,7 @@ pub struct DenseBlockSelectCursor<'a> {
dense_block: DenseBlock<'a>, dense_block: DenseBlock<'a>,
} }
impl SelectCursor<u16> for DenseBlockSelectCursor<'_> { impl<'a> SelectCursor<u16> for DenseBlockSelectCursor<'a> {
#[inline] #[inline]
fn select(&mut self, rank: u16) -> u16 { fn select(&mut self, rank: u16) -> u16 {
self.block_id = self self.block_id = self
@@ -174,7 +175,7 @@ impl<'a> Set<u16> for DenseBlock<'a> {
} }
} }
impl DenseBlock<'_> { impl<'a> DenseBlock<'a> {
#[inline] #[inline]
fn mini_block(&self, mini_block_id: u16) -> DenseMiniBlock { fn mini_block(&self, mini_block_id: u16) -> DenseMiniBlock {
let data_start_pos = mini_block_id as usize * MINI_BLOCK_NUM_BYTES; let data_start_pos = mini_block_id as usize * MINI_BLOCK_NUM_BYTES;

View File

@@ -31,7 +31,7 @@ impl<'a> SelectCursor<u16> for SparseBlock<'a> {
} }
} }
impl Set<u16> for SparseBlock<'_> { impl<'a> Set<u16> for SparseBlock<'a> {
type SelectCursor<'b> type SelectCursor<'b>
= Self = Self
where Self: 'b; where Self: 'b;
@@ -69,7 +69,7 @@ fn get_u16(data: &[u8], byte_position: usize) -> u16 {
u16::from_le_bytes(bytes) u16::from_le_bytes(bytes)
} }
impl SparseBlock<'_> { impl<'a> SparseBlock<'a> {
#[inline(always)] #[inline(always)]
fn value_at_idx(&self, data: &[u8], idx: u16) -> u16 { fn value_at_idx(&self, data: &[u8], idx: u16) -> u16 {
let start_offset: usize = idx as usize * 2; let start_offset: usize = idx as usize * 2;

View File

@@ -164,7 +164,7 @@ fn test_optional_index_large() {
fn test_optional_index_iter_aux(row_ids: &[RowId], num_rows: RowId) { fn test_optional_index_iter_aux(row_ids: &[RowId], num_rows: RowId) {
let optional_index = OptionalIndex::for_test(num_rows, row_ids); let optional_index = OptionalIndex::for_test(num_rows, row_ids);
assert_eq!(optional_index.num_docs(), num_rows); assert_eq!(optional_index.num_docs(), num_rows);
assert!(optional_index.iter_docs().eq(row_ids.iter().copied())); assert!(optional_index.iter_rows().eq(row_ids.iter().copied()));
} }
#[test] #[test]

View File

@@ -31,7 +31,7 @@ pub enum SerializableColumnIndex<'a> {
Multivalued(SerializableMultivalueIndex<'a>), Multivalued(SerializableMultivalueIndex<'a>),
} }
impl SerializableColumnIndex<'_> { impl<'a> SerializableColumnIndex<'a> {
pub fn get_cardinality(&self) -> Cardinality { pub fn get_cardinality(&self) -> Cardinality {
match self { match self {
SerializableColumnIndex::Full => Cardinality::Full, SerializableColumnIndex::Full => Cardinality::Full,

View File

@@ -10,7 +10,7 @@ pub(crate) struct MergedColumnValues<'a, T> {
pub(crate) merge_row_order: &'a MergeRowOrder, pub(crate) merge_row_order: &'a MergeRowOrder,
} }
impl<T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'_, T> { impl<'a, T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'a, T> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
match self.merge_row_order { match self.merge_row_order {
MergeRowOrder::Stack(_) => Box::new( MergeRowOrder::Stack(_) => Box::new(

View File

@@ -39,7 +39,7 @@ impl BinarySerializable for Block {
} }
fn compute_num_blocks(num_vals: u32) -> u32 { fn compute_num_blocks(num_vals: u32) -> u32 {
num_vals.div_ceil(BLOCK_SIZE) (num_vals + BLOCK_SIZE - 1) / BLOCK_SIZE
} }
pub struct BlockwiseLinearEstimator { pub struct BlockwiseLinearEstimator {

View File

@@ -3,7 +3,7 @@ use std::io::{self, Write};
use common::{BitSet, CountingWriter, ReadOnlyBitSet}; use common::{BitSet, CountingWriter, ReadOnlyBitSet};
use sstable::{SSTable, Streamer, TermOrdinal, VoidSSTable}; use sstable::{SSTable, Streamer, TermOrdinal, VoidSSTable};
use super::term_merger::{TermMerger, TermsWithSegmentOrd}; use super::term_merger::TermMerger;
use crate::column::serialize_column_mappable_to_u64; use crate::column::serialize_column_mappable_to_u64;
use crate::column_index::SerializableColumnIndex; use crate::column_index::SerializableColumnIndex;
use crate::iterable::Iterable; use crate::iterable::Iterable;
@@ -39,7 +39,7 @@ struct RemappedTermOrdinalsValues<'a> {
merge_row_order: &'a MergeRowOrder, merge_row_order: &'a MergeRowOrder,
} }
impl Iterable for RemappedTermOrdinalsValues<'_> { impl<'a> Iterable for RemappedTermOrdinalsValues<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u64> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
match self.merge_row_order { match self.merge_row_order {
MergeRowOrder::Stack(_) => self.boxed_iter_stacked(), MergeRowOrder::Stack(_) => self.boxed_iter_stacked(),
@@ -50,7 +50,7 @@ impl Iterable for RemappedTermOrdinalsValues<'_> {
} }
} }
impl RemappedTermOrdinalsValues<'_> { impl<'a> RemappedTermOrdinalsValues<'a> {
fn boxed_iter_stacked(&self) -> Box<dyn Iterator<Item = u64> + '_> { fn boxed_iter_stacked(&self) -> Box<dyn Iterator<Item = u64> + '_> {
let iter = self let iter = self
.bytes_columns .bytes_columns
@@ -126,17 +126,14 @@ fn serialize_merged_dict(
let mut term_ord_mapping = TermOrdinalMapping::default(); let mut term_ord_mapping = TermOrdinalMapping::default();
let mut field_term_streams = Vec::new(); let mut field_term_streams = Vec::new();
for (segment_ord, column_opt) in bytes_columns.iter().enumerate() { for column_opt in bytes_columns.iter() {
if let Some(column) = column_opt { if let Some(column) = column_opt {
term_ord_mapping.add_segment(column.dictionary.num_terms()); term_ord_mapping.add_segment(column.dictionary.num_terms());
let terms: Streamer<VoidSSTable> = column.dictionary.stream()?; let terms: Streamer<VoidSSTable> = column.dictionary.stream()?;
field_term_streams.push(TermsWithSegmentOrd { terms, segment_ord }); field_term_streams.push(terms);
} else { } else {
term_ord_mapping.add_segment(0); term_ord_mapping.add_segment(0);
field_term_streams.push(TermsWithSegmentOrd { field_term_streams.push(Streamer::empty());
terms: Streamer::empty(),
segment_ord,
});
} }
} }
@@ -194,7 +191,6 @@ fn serialize_merged_dict(
#[derive(Default, Debug)] #[derive(Default, Debug)]
struct TermOrdinalMapping { struct TermOrdinalMapping {
/// Contains the new term ordinals for each segment.
per_segment_new_term_ordinals: Vec<Vec<TermOrdinal>>, per_segment_new_term_ordinals: Vec<Vec<TermOrdinal>>,
} }
@@ -209,6 +205,6 @@ impl TermOrdinalMapping {
} }
fn get_segment(&self, segment_ord: u32) -> &[TermOrdinal] { fn get_segment(&self, segment_ord: u32) -> &[TermOrdinal] {
&self.per_segment_new_term_ordinals[segment_ord as usize] &(self.per_segment_new_term_ordinals[segment_ord as usize])[..]
} }
} }

View File

@@ -26,7 +26,7 @@ impl StackMergeOrder {
let mut cumulated_row_ids: Vec<RowId> = Vec::with_capacity(columnars.len()); let mut cumulated_row_ids: Vec<RowId> = Vec::with_capacity(columnars.len());
let mut cumulated_row_id = 0; let mut cumulated_row_id = 0;
for columnar in columnars { for columnar in columnars {
cumulated_row_id += columnar.num_docs(); cumulated_row_id += columnar.num_rows();
cumulated_row_ids.push(cumulated_row_id); cumulated_row_ids.push(cumulated_row_id);
} }
StackMergeOrder { cumulated_row_ids } StackMergeOrder { cumulated_row_ids }

View File

@@ -80,12 +80,13 @@ pub fn merge_columnar(
output: &mut impl io::Write, output: &mut impl io::Write,
) -> io::Result<()> { ) -> io::Result<()> {
let mut serializer = ColumnarSerializer::new(output); let mut serializer = ColumnarSerializer::new(output);
let num_docs_per_columnar = columnar_readers let num_rows_per_columnar = columnar_readers
.iter() .iter()
.map(|reader| reader.num_docs()) .map(|reader| reader.num_rows())
.collect::<Vec<u32>>(); .collect::<Vec<u32>>();
let columns_to_merge = group_columns_for_merge(columnar_readers, required_columns)?; let columns_to_merge =
group_columns_for_merge(columnar_readers, required_columns, &merge_row_order)?;
for res in columns_to_merge { for res in columns_to_merge {
let ((column_name, _column_type_category), grouped_columns) = res; let ((column_name, _column_type_category), grouped_columns) = res;
let grouped_columns = grouped_columns.open(&merge_row_order)?; let grouped_columns = grouped_columns.open(&merge_row_order)?;
@@ -93,18 +94,15 @@ pub fn merge_columnar(
continue; continue;
} }
let column_type_after_merge = grouped_columns.column_type_after_merge(); let column_type = grouped_columns.column_type_after_merge();
let mut columns = grouped_columns.columns; let mut columns = grouped_columns.columns;
// Make sure the number of columns is the same as the number of columnar readers. coerce_columns(column_type, &mut columns)?;
// Or num_docs_per_columnar would be incorrect.
assert_eq!(columns.len(), columnar_readers.len());
coerce_columns(column_type_after_merge, &mut columns)?;
let mut column_serializer = let mut column_serializer =
serializer.start_serialize_column(column_name.as_bytes(), column_type_after_merge); serializer.start_serialize_column(column_name.as_bytes(), column_type);
merge_column( merge_column(
column_type_after_merge, column_type,
&num_docs_per_columnar, &num_rows_per_columnar,
columns, columns,
&merge_row_order, &merge_row_order,
&mut column_serializer, &mut column_serializer,
@@ -130,7 +128,7 @@ fn dynamic_column_to_u64_monotonic(dynamic_column: DynamicColumn) -> Option<Colu
fn merge_column( fn merge_column(
column_type: ColumnType, column_type: ColumnType,
num_docs_per_column: &[u32], num_docs_per_column: &[u32],
columns_to_merge: Vec<Option<DynamicColumn>>, columns: Vec<Option<DynamicColumn>>,
merge_row_order: &MergeRowOrder, merge_row_order: &MergeRowOrder,
wrt: &mut impl io::Write, wrt: &mut impl io::Write,
) -> io::Result<()> { ) -> io::Result<()> {
@@ -140,10 +138,10 @@ fn merge_column(
| ColumnType::F64 | ColumnType::F64
| ColumnType::DateTime | ColumnType::DateTime
| ColumnType::Bool => { | ColumnType::Bool => {
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len()); let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
let mut column_values: Vec<Option<Arc<dyn ColumnValues>>> = let mut column_values: Vec<Option<Arc<dyn ColumnValues>>> =
Vec::with_capacity(columns_to_merge.len()); Vec::with_capacity(columns.len());
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() { for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
if let Some(Column { index: idx, values }) = if let Some(Column { index: idx, values }) =
dynamic_column_opt.and_then(dynamic_column_to_u64_monotonic) dynamic_column_opt.and_then(dynamic_column_to_u64_monotonic)
{ {
@@ -166,10 +164,10 @@ fn merge_column(
serialize_column_mappable_to_u64(merged_column_index, &merge_column_values, wrt)?; serialize_column_mappable_to_u64(merged_column_index, &merge_column_values, wrt)?;
} }
ColumnType::IpAddr => { ColumnType::IpAddr => {
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len()); let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
let mut column_values: Vec<Option<Arc<dyn ColumnValues<Ipv6Addr>>>> = let mut column_values: Vec<Option<Arc<dyn ColumnValues<Ipv6Addr>>>> =
Vec::with_capacity(columns_to_merge.len()); Vec::with_capacity(columns.len());
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() { for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
if let Some(DynamicColumn::IpAddr(Column { index: idx, values })) = if let Some(DynamicColumn::IpAddr(Column { index: idx, values })) =
dynamic_column_opt dynamic_column_opt
{ {
@@ -194,10 +192,9 @@ fn merge_column(
serialize_column_mappable_to_u128(merged_column_index, &merge_column_values, wrt)?; serialize_column_mappable_to_u128(merged_column_index, &merge_column_values, wrt)?;
} }
ColumnType::Bytes | ColumnType::Str => { ColumnType::Bytes | ColumnType::Str => {
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len()); let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
let mut bytes_columns: Vec<Option<BytesColumn>> = let mut bytes_columns: Vec<Option<BytesColumn>> = Vec::with_capacity(columns.len());
Vec::with_capacity(columns_to_merge.len()); for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
match dynamic_column_opt { match dynamic_column_opt {
Some(DynamicColumn::Str(str_column)) => { Some(DynamicColumn::Str(str_column)) => {
column_indexes.push(str_column.term_ord_column.index.clone()); column_indexes.push(str_column.term_ord_column.index.clone());
@@ -251,7 +248,7 @@ impl GroupedColumns {
if column_type.len() == 1 { if column_type.len() == 1 {
return column_type.into_iter().next().unwrap(); return column_type.into_iter().next().unwrap();
} }
// At the moment, only the numerical column type category has more than one possible // At the moment, only the numerical categorical column type has more than one possible
// column type. // column type.
assert!(self assert!(self
.columns .columns
@@ -364,7 +361,7 @@ fn is_empty_after_merge(
ColumnIndex::Empty { .. } => true, ColumnIndex::Empty { .. } => true,
ColumnIndex::Full => alive_bitset.len() == 0, ColumnIndex::Full => alive_bitset.len() == 0,
ColumnIndex::Optional(optional_index) => { ColumnIndex::Optional(optional_index) => {
for doc in optional_index.iter_docs() { for doc in optional_index.iter_rows() {
if alive_bitset.contains(doc) { if alive_bitset.contains(doc) {
return false; return false;
} }
@@ -394,6 +391,7 @@ fn is_empty_after_merge(
fn group_columns_for_merge<'a>( fn group_columns_for_merge<'a>(
columnar_readers: &'a [&'a ColumnarReader], columnar_readers: &'a [&'a ColumnarReader],
required_columns: &'a [(String, ColumnType)], required_columns: &'a [(String, ColumnType)],
_merge_row_order: &'a MergeRowOrder,
) -> io::Result<BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle>> { ) -> io::Result<BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle>> {
let mut columns: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = BTreeMap::new(); let mut columns: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = BTreeMap::new();

View File

@@ -5,29 +5,28 @@ use sstable::TermOrdinal;
use crate::Streamer; use crate::Streamer;
/// The terms of a column with the ordinal of the segment. pub struct HeapItem<'a> {
pub struct TermsWithSegmentOrd<'a> { pub streamer: Streamer<'a>,
pub terms: Streamer<'a>,
pub segment_ord: usize, pub segment_ord: usize,
} }
impl PartialEq for TermsWithSegmentOrd<'_> { impl<'a> PartialEq for HeapItem<'a> {
fn eq(&self, other: &Self) -> bool { fn eq(&self, other: &Self) -> bool {
self.segment_ord == other.segment_ord self.segment_ord == other.segment_ord
} }
} }
impl Eq for TermsWithSegmentOrd<'_> {} impl<'a> Eq for HeapItem<'a> {}
impl<'a> PartialOrd for TermsWithSegmentOrd<'a> { impl<'a> PartialOrd for HeapItem<'a> {
fn partial_cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Option<Ordering> { fn partial_cmp(&self, other: &HeapItem<'a>) -> Option<Ordering> {
Some(self.cmp(other)) Some(self.cmp(other))
} }
} }
impl<'a> Ord for TermsWithSegmentOrd<'a> { impl<'a> Ord for HeapItem<'a> {
fn cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Ordering { fn cmp(&self, other: &HeapItem<'a>) -> Ordering {
(&other.terms.key(), &other.segment_ord).cmp(&(&self.terms.key(), &self.segment_ord)) (&other.streamer.key(), &other.segment_ord).cmp(&(&self.streamer.key(), &self.segment_ord))
} }
} }
@@ -38,32 +37,39 @@ impl<'a> Ord for TermsWithSegmentOrd<'a> {
/// - the term /// - the term
/// - a slice with the ordinal of the segments containing the terms. /// - a slice with the ordinal of the segments containing the terms.
pub struct TermMerger<'a> { pub struct TermMerger<'a> {
heap: BinaryHeap<TermsWithSegmentOrd<'a>>, heap: BinaryHeap<HeapItem<'a>>,
term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>, current_streamers: Vec<HeapItem<'a>>,
} }
impl<'a> TermMerger<'a> { impl<'a> TermMerger<'a> {
/// Stream of merged term dictionary /// Stream of merged term dictionary
pub fn new(term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>) -> TermMerger<'a> { pub fn new(streams: Vec<Streamer<'a>>) -> TermMerger<'a> {
TermMerger { TermMerger {
heap: BinaryHeap::new(), heap: BinaryHeap::new(),
term_streams_with_segment, current_streamers: streams
.into_iter()
.enumerate()
.map(|(ord, streamer)| HeapItem {
streamer,
segment_ord: ord,
})
.collect(),
} }
} }
pub(crate) fn matching_segments<'b: 'a>( pub(crate) fn matching_segments<'b: 'a>(
&'b self, &'b self,
) -> impl 'b + Iterator<Item = (usize, TermOrdinal)> { ) -> impl 'b + Iterator<Item = (usize, TermOrdinal)> {
self.term_streams_with_segment self.current_streamers
.iter() .iter()
.map(|heap_item| (heap_item.segment_ord, heap_item.terms.term_ord())) .map(|heap_item| (heap_item.segment_ord, heap_item.streamer.term_ord()))
} }
fn advance_segments(&mut self) { fn advance_segments(&mut self) {
let streamers = &mut self.term_streams_with_segment; let streamers = &mut self.current_streamers;
let heap = &mut self.heap; let heap = &mut self.heap;
for mut heap_item in streamers.drain(..) { for mut heap_item in streamers.drain(..) {
if heap_item.terms.advance() { if heap_item.streamer.advance() {
heap.push(heap_item); heap.push(heap_item);
} }
} }
@@ -75,13 +81,13 @@ impl<'a> TermMerger<'a> {
pub fn advance(&mut self) -> bool { pub fn advance(&mut self) -> bool {
self.advance_segments(); self.advance_segments();
if let Some(head) = self.heap.pop() { if let Some(head) = self.heap.pop() {
self.term_streams_with_segment.push(head); self.current_streamers.push(head);
while let Some(next_streamer) = self.heap.peek() { while let Some(next_streamer) = self.heap.peek() {
if self.term_streams_with_segment[0].terms.key() != next_streamer.terms.key() { if self.current_streamers[0].streamer.key() != next_streamer.streamer.key() {
break; break;
} }
let next_heap_it = self.heap.pop().unwrap(); // safe : we peeked beforehand let next_heap_it = self.heap.pop().unwrap(); // safe : we peeked beforehand
self.term_streams_with_segment.push(next_heap_it); self.current_streamers.push(next_heap_it);
} }
true true
} else { } else {
@@ -95,6 +101,6 @@ impl<'a> TermMerger<'a> {
/// if and only if advance() has been called before /// if and only if advance() has been called before
/// and "true" was returned. /// and "true" was returned.
pub fn key(&self) -> &[u8] { pub fn key(&self) -> &[u8] {
self.term_streams_with_segment[0].terms.key() self.current_streamers[0].streamer.key()
} }
} }

View File

@@ -1,10 +1,7 @@
use itertools::Itertools; use itertools::Itertools;
use proptest::collection::vec;
use proptest::prelude::*;
use super::*; use super::*;
use crate::columnar::{merge_columnar, ColumnarReader, MergeRowOrder, StackMergeOrder}; use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId};
use crate::{Cardinality, ColumnarWriter, DynamicColumn, HasAssociatedColumnType, RowId};
fn make_columnar<T: Into<NumericalValue> + HasAssociatedColumnType + Copy>( fn make_columnar<T: Into<NumericalValue> + HasAssociatedColumnType + Copy>(
column_name: &str, column_name: &str,
@@ -29,8 +26,9 @@ fn test_column_coercion_to_u64() {
// u64 type // u64 type
let columnar2 = make_columnar("numbers", &[u64::MAX]); let columnar2 = make_columnar("numbers", &[u64::MAX]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[]).unwrap(); group_columns_for_merge(columnars, &[], &merge_order).unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
} }
@@ -40,8 +38,9 @@ fn test_column_coercion_to_i64() {
let columnar1 = make_columnar("numbers", &[-1i64]); let columnar1 = make_columnar("numbers", &[-1i64]);
let columnar2 = make_columnar("numbers", &[2u64]); let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[]).unwrap(); group_columns_for_merge(columnars, &[], &merge_order).unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
} }
@@ -64,8 +63,14 @@ fn test_group_columns_with_required_column() {
let columnar1 = make_columnar("numbers", &[1i64]); let columnar1 = make_columnar("numbers", &[1i64]);
let columnar2 = make_columnar("numbers", &[2u64]); let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap(); group_columns_for_merge(
&[&columnar1, &columnar2],
&[("numbers".to_string(), ColumnType::U64)],
&merge_order,
)
.unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
} }
@@ -75,9 +80,13 @@ fn test_group_columns_required_column_with_no_existing_columns() {
let columnar1 = make_columnar("numbers", &[2u64]); let columnar1 = make_columnar("numbers", &[2u64]);
let columnar2 = make_columnar("numbers", &[2u64]); let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let column_map: BTreeMap<_, _> = let merge_order = StackMergeOrder::stack(columnars).into();
group_columns_for_merge(columnars, &[("required_col".to_string(), ColumnType::Str)]) let column_map: BTreeMap<_, _> = group_columns_for_merge(
.unwrap(); columnars,
&[("required_col".to_string(), ColumnType::Str)],
&merge_order,
)
.unwrap();
assert_eq!(column_map.len(), 2); assert_eq!(column_map.len(), 2);
let columns = &column_map let columns = &column_map
.get(&("required_col".to_string(), ColumnTypeCategory::Str)) .get(&("required_col".to_string(), ColumnTypeCategory::Str))
@@ -93,8 +102,14 @@ fn test_group_columns_required_column_is_above_all_columns_have_the_same_type_ru
let columnar1 = make_columnar("numbers", &[2i64]); let columnar1 = make_columnar("numbers", &[2i64]);
let columnar2 = make_columnar("numbers", &[2i64]); let columnar2 = make_columnar("numbers", &[2i64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap(); group_columns_for_merge(
columnars,
&[("numbers".to_string(), ColumnType::U64)],
&merge_order,
)
.unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
} }
@@ -104,8 +119,9 @@ fn test_missing_column() {
let columnar1 = make_columnar("numbers", &[-1i64]); let columnar1 = make_columnar("numbers", &[-1i64]);
let columnar2 = make_columnar("numbers2", &[2u64]); let columnar2 = make_columnar("numbers2", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[]).unwrap(); group_columns_for_merge(columnars, &[], &merge_order).unwrap();
assert_eq!(column_map.len(), 2); assert_eq!(column_map.len(), 2);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
{ {
@@ -208,7 +224,7 @@ fn test_merge_columnar_numbers() {
) )
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap(); let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_docs(), 3); assert_eq!(columnar_reader.num_rows(), 3);
assert_eq!(columnar_reader.num_columns(), 1); assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("numbers").unwrap(); let cols = columnar_reader.read_columns("numbers").unwrap();
let dynamic_column = cols[0].open().unwrap(); let dynamic_column = cols[0].open().unwrap();
@@ -236,7 +252,7 @@ fn test_merge_columnar_texts() {
) )
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap(); let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_docs(), 3); assert_eq!(columnar_reader.num_rows(), 3);
assert_eq!(columnar_reader.num_columns(), 1); assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("texts").unwrap(); let cols = columnar_reader.read_columns("texts").unwrap();
let dynamic_column = cols[0].open().unwrap(); let dynamic_column = cols[0].open().unwrap();
@@ -285,7 +301,7 @@ fn test_merge_columnar_byte() {
) )
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap(); let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_docs(), 4); assert_eq!(columnar_reader.num_rows(), 4);
assert_eq!(columnar_reader.num_columns(), 1); assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("bytes").unwrap(); let cols = columnar_reader.read_columns("bytes").unwrap();
let dynamic_column = cols[0].open().unwrap(); let dynamic_column = cols[0].open().unwrap();
@@ -341,7 +357,7 @@ fn test_merge_columnar_byte_with_missing() {
) )
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap(); let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_docs(), 3 + 2 + 3); assert_eq!(columnar_reader.num_rows(), 3 + 2 + 3);
assert_eq!(columnar_reader.num_columns(), 2); assert_eq!(columnar_reader.num_columns(), 2);
let cols = columnar_reader.read_columns("col").unwrap(); let cols = columnar_reader.read_columns("col").unwrap();
let dynamic_column = cols[0].open().unwrap(); let dynamic_column = cols[0].open().unwrap();
@@ -393,7 +409,7 @@ fn test_merge_columnar_different_types() {
) )
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap(); let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_docs(), 4); assert_eq!(columnar_reader.num_rows(), 4);
assert_eq!(columnar_reader.num_columns(), 2); assert_eq!(columnar_reader.num_columns(), 2);
let cols = columnar_reader.read_columns("mixed").unwrap(); let cols = columnar_reader.read_columns("mixed").unwrap();
@@ -403,11 +419,11 @@ fn test_merge_columnar_different_types() {
panic!() panic!()
}; };
assert_eq!(vals.get_cardinality(), Cardinality::Optional); assert_eq!(vals.get_cardinality(), Cardinality::Optional);
assert_eq!(vals.values_for_doc(0).collect_vec(), Vec::<i64>::new()); assert_eq!(vals.values_for_doc(0).collect_vec(), vec![]);
assert_eq!(vals.values_for_doc(1).collect_vec(), Vec::<i64>::new()); assert_eq!(vals.values_for_doc(1).collect_vec(), vec![]);
assert_eq!(vals.values_for_doc(2).collect_vec(), Vec::<i64>::new()); assert_eq!(vals.values_for_doc(2).collect_vec(), vec![]);
assert_eq!(vals.values_for_doc(3).collect_vec(), vec![1]); assert_eq!(vals.values_for_doc(3).collect_vec(), vec![1]);
assert_eq!(vals.values_for_doc(4).collect_vec(), Vec::<i64>::new()); assert_eq!(vals.values_for_doc(4).collect_vec(), vec![]);
// text column // text column
let dynamic_column = cols[1].open().unwrap(); let dynamic_column = cols[1].open().unwrap();
@@ -458,7 +474,7 @@ fn test_merge_columnar_different_empty_cardinality() {
) )
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap(); let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_docs(), 2); assert_eq!(columnar_reader.num_rows(), 2);
assert_eq!(columnar_reader.num_columns(), 2); assert_eq!(columnar_reader.num_columns(), 2);
let cols = columnar_reader.read_columns("mixed").unwrap(); let cols = columnar_reader.read_columns("mixed").unwrap();
@@ -470,119 +486,3 @@ fn test_merge_columnar_different_empty_cardinality() {
let dynamic_column = cols[1].open().unwrap(); let dynamic_column = cols[1].open().unwrap();
assert_eq!(dynamic_column.get_cardinality(), Cardinality::Optional); assert_eq!(dynamic_column.get_cardinality(), Cardinality::Optional);
} }
#[derive(Debug, Clone)]
struct ColumnSpec {
column_name: String,
/// (row_id, term)
terms: Vec<(RowId, Vec<u8>)>,
}
#[derive(Clone, Debug)]
struct ColumnarSpec {
columns: Vec<ColumnSpec>,
}
/// Generate a random (row_id, term) pair:
/// - row_id in [0..10]
/// - term is either from POSSIBLE_TERMS or random bytes
fn rowid_and_term_strategy() -> impl Strategy<Value = (RowId, Vec<u8>)> {
const POSSIBLE_TERMS: &[&[u8]] = &[b"a", b"b", b"allo"];
let term_strat = prop_oneof![
// pick from the fixed list
(0..POSSIBLE_TERMS.len()).prop_map(|i| POSSIBLE_TERMS[i].to_vec()),
// or random bytes (length 0..10)
prop::collection::vec(any::<u8>(), 0..10),
];
(0u32..11, term_strat)
}
/// Generate one ColumnSpec, with a random name and a random list of (row_id, term).
/// We sort it by row_id so that data is in ascending order.
fn column_spec_strategy() -> impl Strategy<Value = ColumnSpec> {
let column_name = prop_oneof![
Just("col".to_string()),
Just("col2".to_string()),
"col.*".prop_map(|s| s),
];
// We'll produce 0..8 (rowid,term) entries for this column
let data_strat = vec(rowid_and_term_strategy(), 0..8).prop_map(|mut pairs| {
// Sort by row_id
pairs.sort_by_key(|(row_id, _)| *row_id);
pairs
});
(column_name, data_strat).prop_map(|(name, data)| ColumnSpec {
column_name: name,
terms: data,
})
}
/// Strategy to generate an ColumnarSpec
fn columnar_strategy() -> impl Strategy<Value = ColumnarSpec> {
vec(column_spec_strategy(), 0..3).prop_map(|columns| ColumnarSpec { columns })
}
/// Strategy to generate multiple ColumnarSpecs, each of which we will treat
/// as one "columnar" to be merged together.
fn columnars_strategy() -> impl Strategy<Value = Vec<ColumnarSpec>> {
vec(columnar_strategy(), 1..4)
}
/// Build a `ColumnarReader` from a `ColumnarSpec`
fn build_columnar(spec: &ColumnarSpec) -> ColumnarReader {
let mut writer = ColumnarWriter::default();
let mut max_row_id = 0;
for col in &spec.columns {
for &(row_id, ref term) in &col.terms {
writer.record_bytes(row_id, &col.column_name, term);
max_row_id = max_row_id.max(row_id);
}
}
let mut buffer = Vec::new();
writer.serialize(max_row_id + 1, &mut buffer).unwrap();
ColumnarReader::open(buffer).unwrap()
}
proptest! {
// We just test that the merge_columnar function doesn't crash.
#![proptest_config(ProptestConfig::with_cases(256))]
#[test]
fn test_merge_columnar_bytes_no_crash(columnars in columnars_strategy(), second_merge_columnars in columnars_strategy()) {
let columnars: Vec<ColumnarReader> = columnars.iter()
.map(build_columnar)
.collect();
let mut out = Vec::new();
let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
merge_columnar(
&columnar_refs,
&[],
MergeRowOrder::Stack(stack_merge_order),
&mut out,
).unwrap();
let merged_reader = ColumnarReader::open(out).unwrap();
// Merge the second set of columnars with the result of the first merge
let mut columnars: Vec<ColumnarReader> = second_merge_columnars.iter()
.map(build_columnar)
.collect();
columnars.push(merged_reader);
let mut out = Vec::new();
let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
merge_columnar(
&columnar_refs,
&[],
MergeRowOrder::Stack(stack_merge_order),
&mut out,
).unwrap();
}
}

View File

@@ -1,7 +1,6 @@
use std::{fmt, io, mem}; use std::{fmt, io, mem};
use common::file_slice::FileSlice; use common::file_slice::FileSlice;
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
use common::BinarySerializable; use common::BinarySerializable;
use sstable::{Dictionary, RangeSSTable}; use sstable::{Dictionary, RangeSSTable};
@@ -19,13 +18,13 @@ fn io_invalid_data(msg: String) -> io::Error {
pub struct ColumnarReader { pub struct ColumnarReader {
column_dictionary: Dictionary<RangeSSTable>, column_dictionary: Dictionary<RangeSSTable>,
column_data: FileSlice, column_data: FileSlice,
num_docs: RowId, num_rows: RowId,
format_version: Version, format_version: Version,
} }
impl fmt::Debug for ColumnarReader { impl fmt::Debug for ColumnarReader {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let num_rows = self.num_docs(); let num_rows = self.num_rows();
let columns = self.list_columns().unwrap(); let columns = self.list_columns().unwrap();
let num_cols = columns.len(); let num_cols = columns.len();
let mut debug_struct = f.debug_struct("Columnar"); let mut debug_struct = f.debug_struct("Columnar");
@@ -77,19 +76,6 @@ fn read_all_columns_in_stream(
Ok(results) Ok(results)
} }
fn column_dictionary_prefix_for_column_name(column_name: &str) -> String {
// Each column is a associated to a given `column_key`,
// that starts by `column_name\0column_header`.
//
// Listing the columns associated to the given column name is therefore equivalent to
// listing `column_key` with the prefix `column_name\0`.
format!("{}{}", column_name, '\0')
}
fn column_dictionary_prefix_for_subpath(root_path: &str) -> String {
format!("{}{}", root_path, JSON_PATH_SEGMENT_SEP as char)
}
impl ColumnarReader { impl ColumnarReader {
/// Opens a new Columnar file. /// Opens a new Columnar file.
pub fn open<F>(file_slice: F) -> io::Result<ColumnarReader> pub fn open<F>(file_slice: F) -> io::Result<ColumnarReader>
@@ -112,13 +98,13 @@ impl ColumnarReader {
Ok(ColumnarReader { Ok(ColumnarReader {
column_dictionary, column_dictionary,
column_data, column_data,
num_docs: num_rows, num_rows,
format_version, format_version,
}) })
} }
pub fn num_docs(&self) -> RowId { pub fn num_rows(&self) -> RowId {
self.num_docs self.num_rows
} }
// Iterate over the columns in a sorted way // Iterate over the columns in a sorted way
pub fn iter_columns( pub fn iter_columns(
@@ -158,14 +144,32 @@ impl ColumnarReader {
Ok(self.iter_columns()?.collect()) Ok(self.iter_columns()?.collect())
} }
fn stream_for_column_range(&self, column_name: &str) -> sstable::StreamerBuilder<RangeSSTable> {
// Each column is a associated to a given `column_key`,
// that starts by `column_name\0column_header`.
//
// Listing the columns associated to the given column name is therefore equivalent to
// listing `column_key` with the prefix `column_name\0`.
//
// This is in turn equivalent to searching for the range
// `[column_name,\0`..column_name\1)`.
// TODO can we get some more generic `prefix(..)` logic in the dictionary.
let mut start_key = column_name.to_string();
start_key.push('\0');
let mut end_key = column_name.to_string();
end_key.push(1u8 as char);
self.column_dictionary
.range()
.ge(start_key.as_bytes())
.lt(end_key.as_bytes())
}
pub async fn read_columns_async( pub async fn read_columns_async(
&self, &self,
column_name: &str, column_name: &str,
) -> io::Result<Vec<DynamicColumnHandle>> { ) -> io::Result<Vec<DynamicColumnHandle>> {
let prefix = column_dictionary_prefix_for_column_name(column_name);
let stream = self let stream = self
.column_dictionary .stream_for_column_range(column_name)
.prefix_range(prefix)
.into_stream_async() .into_stream_async()
.await?; .await?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version) read_all_columns_in_stream(stream, &self.column_data, self.format_version)
@@ -176,35 +180,7 @@ impl ColumnarReader {
/// There can be more than one column associated to a given column name, provided they have /// There can be more than one column associated to a given column name, provided they have
/// different types. /// different types.
pub fn read_columns(&self, column_name: &str) -> io::Result<Vec<DynamicColumnHandle>> { pub fn read_columns(&self, column_name: &str) -> io::Result<Vec<DynamicColumnHandle>> {
let prefix = column_dictionary_prefix_for_column_name(column_name); let stream = self.stream_for_column_range(column_name).into_stream()?;
let stream = self.column_dictionary.prefix_range(prefix).into_stream()?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
}
pub async fn read_subpath_columns_async(
&self,
root_path: &str,
) -> io::Result<Vec<DynamicColumnHandle>> {
let prefix = column_dictionary_prefix_for_subpath(root_path);
let stream = self
.column_dictionary
.prefix_range(prefix)
.into_stream_async()
.await?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
}
/// Get all inner columns for a given JSON prefix, i.e columns for which the name starts
/// with the prefix then contain the [`JSON_PATH_SEGMENT_SEP`].
///
/// There can be more than one column associated to each path within the JSON structure,
/// provided they have different types.
pub fn read_subpath_columns(&self, root_path: &str) -> io::Result<Vec<DynamicColumnHandle>> {
let prefix = column_dictionary_prefix_for_subpath(root_path);
let stream = self
.column_dictionary
.prefix_range(prefix.as_bytes())
.into_stream()?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version) read_all_columns_in_stream(stream, &self.column_data, self.format_version)
} }
@@ -216,8 +192,6 @@ impl ColumnarReader {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
use crate::{ColumnType, ColumnarReader, ColumnarWriter}; use crate::{ColumnType, ColumnarReader, ColumnarWriter};
#[test] #[test]
@@ -250,64 +224,6 @@ mod tests {
assert_eq!(columns[0].1.column_type(), ColumnType::U64); assert_eq!(columns[0].1.column_type(), ColumnType::U64);
} }
#[test]
fn test_read_columns() {
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_column_type("col", ColumnType::U64, false);
columnar_writer.record_numerical(1, "col", 1u64);
let mut buffer = Vec::new();
columnar_writer.serialize(2, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
{
let columns = columnar.read_columns("col").unwrap();
assert_eq!(columns.len(), 1);
assert_eq!(columns[0].column_type(), ColumnType::U64);
}
{
let columns = columnar.read_columns("other").unwrap();
assert_eq!(columns.len(), 0);
}
}
#[test]
fn test_read_subpath_columns() {
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_str(
0,
&format!("col1{}subcol1", JSON_PATH_SEGMENT_SEP as char),
"hello",
);
columnar_writer.record_numerical(
0,
&format!("col1{}subcol2", JSON_PATH_SEGMENT_SEP as char),
1i64,
);
columnar_writer.record_str(1, "col1", "hello");
columnar_writer.record_str(0, "col2", "hello");
let mut buffer = Vec::new();
columnar_writer.serialize(2, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
{
let columns = columnar.read_subpath_columns("col1").unwrap();
assert_eq!(columns.len(), 2);
assert_eq!(columns[0].column_type(), ColumnType::Str);
assert_eq!(columns[1].column_type(), ColumnType::I64);
}
{
let columns = columnar.read_subpath_columns("col1.subcol1").unwrap();
assert_eq!(columns.len(), 0);
}
{
let columns = columnar.read_subpath_columns("col2").unwrap();
assert_eq!(columns.len(), 0);
}
{
let columns = columnar.read_subpath_columns("other").unwrap();
assert_eq!(columns.len(), 0);
}
}
#[test] #[test]
#[should_panic(expected = "Input type forbidden")] #[should_panic(expected = "Input type forbidden")]
fn test_list_columns_strict_typing_panics_on_wrong_types() { fn test_list_columns_strict_typing_panics_on_wrong_types() {

View File

@@ -285,6 +285,7 @@ impl ColumnarWriter {
.map(|(column_name, addr)| (column_name, ColumnType::DateTime, addr)), .map(|(column_name, addr)| (column_name, ColumnType::DateTime, addr)),
); );
columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type)); columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type));
let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries); let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries);
let mut symbol_byte_buffer: Vec<u8> = Vec::new(); let mut symbol_byte_buffer: Vec<u8> = Vec::new();
for (column_name, column_type, addr) in columns { for (column_name, column_type, addr) in columns {

View File

@@ -67,7 +67,7 @@ pub struct ColumnSerializer<'a, W: io::Write> {
start_offset: u64, start_offset: u64,
} }
impl<W: io::Write> ColumnSerializer<'_, W> { impl<'a, W: io::Write> ColumnSerializer<'a, W> {
pub fn finalize(self) -> io::Result<()> { pub fn finalize(self) -> io::Result<()> {
let end_offset: u64 = self.columnar_serializer.wrt.written_bytes(); let end_offset: u64 = self.columnar_serializer.wrt.written_bytes();
let byte_range = self.start_offset..end_offset; let byte_range = self.start_offset..end_offset;
@@ -80,7 +80,7 @@ impl<W: io::Write> ColumnSerializer<'_, W> {
} }
} }
impl<W: io::Write> io::Write for ColumnSerializer<'_, W> { impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> { fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
self.columnar_serializer.wrt.write(buf) self.columnar_serializer.wrt.write(buf)
} }

View File

@@ -7,7 +7,7 @@ pub trait Iterable<T = u64> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_>; fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_>;
} }
impl<T: Copy> Iterable<T> for &[T] { impl<'a, T: Copy> Iterable<T> for &'a [T] {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
Box::new(self.iter().copied()) Box::new(self.iter().copied())
} }

View File

@@ -380,7 +380,7 @@ fn assert_columnar_eq(
right: &ColumnarReader, right: &ColumnarReader,
lenient_on_numerical_value: bool, lenient_on_numerical_value: bool,
) { ) {
assert_eq!(left.num_docs(), right.num_docs()); assert_eq!(left.num_rows(), right.num_rows());
let left_columns = left.list_columns().unwrap(); let left_columns = left.list_columns().unwrap();
let right_columns = right.list_columns().unwrap(); let right_columns = right.list_columns().unwrap();
assert_eq!(left_columns.len(), right_columns.len()); assert_eq!(left_columns.len(), right_columns.len());
@@ -588,7 +588,7 @@ proptest! {
#[test] #[test]
fn test_single_columnar_builder_proptest(docs in columnar_docs_strategy()) { fn test_single_columnar_builder_proptest(docs in columnar_docs_strategy()) {
let columnar = build_columnar(&docs[..]); let columnar = build_columnar(&docs[..]);
assert_eq!(columnar.num_docs() as usize, docs.len()); assert_eq!(columnar.num_rows() as usize, docs.len());
let mut expected_columns: HashMap<(&str, ColumnTypeCategory), HashMap<u32, Vec<&ColumnValue>> > = Default::default(); let mut expected_columns: HashMap<(&str, ColumnTypeCategory), HashMap<u32, Vec<&ColumnValue>> > = Default::default();
for (doc_id, doc_vals) in docs.iter().enumerate() { for (doc_id, doc_vals) in docs.iter().enumerate() {
for (col_name, col_val) in doc_vals { for (col_name, col_val) in doc_vals {
@@ -715,7 +715,6 @@ fn test_columnar_merging_number_columns() {
// TODO test required_columns // TODO test required_columns
// TODO document edge case: required_columns incompatible with values. // TODO document edge case: required_columns incompatible with values.
#[allow(clippy::type_complexity)]
fn columnar_docs_and_remap( fn columnar_docs_and_remap(
) -> impl Strategy<Value = (Vec<Vec<Vec<(&'static str, ColumnValue)>>>, Vec<RowAddr>)> { ) -> impl Strategy<Value = (Vec<Vec<Vec<(&'static str, ColumnValue)>>>, Vec<RowAddr>)> {
proptest::collection::vec(columnar_docs_strategy(), 2..=3).prop_flat_map( proptest::collection::vec(columnar_docs_strategy(), 2..=3).prop_flat_map(
@@ -820,7 +819,7 @@ fn test_columnar_merge_empty() {
) )
.unwrap(); .unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
assert_eq!(merged_columnar.num_docs(), 0); assert_eq!(merged_columnar.num_rows(), 0);
assert_eq!(merged_columnar.num_columns(), 0); assert_eq!(merged_columnar.num_columns(), 0);
} }
@@ -846,7 +845,7 @@ fn test_columnar_merge_single_str_column() {
) )
.unwrap(); .unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
assert_eq!(merged_columnar.num_docs(), 1); assert_eq!(merged_columnar.num_rows(), 1);
assert_eq!(merged_columnar.num_columns(), 1); assert_eq!(merged_columnar.num_columns(), 1);
} }
@@ -878,7 +877,7 @@ fn test_delete_decrease_cardinality() {
) )
.unwrap(); .unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
assert_eq!(merged_columnar.num_docs(), 1); assert_eq!(merged_columnar.num_rows(), 1);
assert_eq!(merged_columnar.num_columns(), 1); assert_eq!(merged_columnar.num_columns(), 1);
let cols = merged_columnar.read_columns("c").unwrap(); let cols = merged_columnar.read_columns("c").unwrap();
assert_eq!(cols.len(), 1); assert_eq!(cols.len(), 1);

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-common" name = "tantivy-common"
version = "0.9.0" version = "0.7.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT" license = "MIT"
edition = "2021" edition = "2021"
@@ -13,7 +13,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
[dependencies] [dependencies]
byteorder = "1.4.3" byteorder = "1.4.3"
ownedbytes = { version= "0.9", path="../ownedbytes" } ownedbytes = { version= "0.7", path="../ownedbytes" }
async-trait = "0.1" async-trait = "0.1"
time = { version = "0.3.10", features = ["serde-well-known"] } time = { version = "0.3.10", features = ["serde-well-known"] }
serde = { version = "1.0.136", features = ["derive"] } serde = { version = "1.0.136", features = ["derive"] }

View File

@@ -1,6 +1,5 @@
use std::fs::File; use std::fs::File;
use std::ops::{Deref, Range, RangeBounds}; use std::ops::{Deref, Range, RangeBounds};
use std::path::Path;
use std::sync::Arc; use std::sync::Arc;
use std::{fmt, io}; use std::{fmt, io};
@@ -178,12 +177,6 @@ fn combine_ranges<R: RangeBounds<usize>>(orig_range: Range<usize>, rel_range: R)
} }
impl FileSlice { impl FileSlice {
/// Creates a FileSlice from a path.
pub fn open(path: &Path) -> io::Result<FileSlice> {
let wrap_file = WrapFile::new(File::open(path)?)?;
Ok(FileSlice::new(Arc::new(wrap_file)))
}
/// Wraps a FileHandle. /// Wraps a FileHandle.
pub fn new(file_handle: Arc<dyn FileHandle>) -> Self { pub fn new(file_handle: Arc<dyn FileHandle>) -> Self {
let num_bytes = file_handle.len(); let num_bytes = file_handle.len();

View File

@@ -87,7 +87,7 @@ impl<W: TerminatingWrite> TerminatingWrite for BufWriter<W> {
} }
} }
impl TerminatingWrite for &mut Vec<u8> { impl<'a> TerminatingWrite for &'a mut Vec<u8> {
fn terminate_ref(&mut self, _a: AntiCallToken) -> io::Result<()> { fn terminate_ref(&mut self, _a: AntiCallToken) -> io::Result<()> {
self.flush() self.flush()
} }

View File

@@ -1,7 +1,7 @@
[package] [package]
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
name = "ownedbytes" name = "ownedbytes"
version = "0.9.0" version = "0.7.0"
edition = "2021" edition = "2021"
description = "Expose data as static slice" description = "Expose data as static slice"
license = "MIT" license = "MIT"

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-query-grammar" name = "tantivy-query-grammar"
version = "0.24.0" version = "0.22.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
@@ -13,5 +13,3 @@ edition = "2021"
[dependencies] [dependencies]
nom = "7" nom = "7"
serde = { version = "1.0.219", features = ["derive"] }
serde_json = "1.0.140"

View File

@@ -3,7 +3,6 @@
use std::convert::Infallible; use std::convert::Infallible;
use nom::{AsChar, IResult, InputLength, InputTakeAtPosition}; use nom::{AsChar, IResult, InputLength, InputTakeAtPosition};
use serde::Serialize;
pub(crate) type ErrorList = Vec<LenientErrorInternal>; pub(crate) type ErrorList = Vec<LenientErrorInternal>;
pub(crate) type JResult<I, O> = IResult<I, (O, ErrorList), Infallible>; pub(crate) type JResult<I, O> = IResult<I, (O, ErrorList), Infallible>;
@@ -16,8 +15,7 @@ pub(crate) struct LenientErrorInternal {
} }
/// A recoverable error and the position it happened at /// A recoverable error and the position it happened at
#[derive(Debug, PartialEq, Serialize)] #[derive(Debug, PartialEq)]
#[serde(rename_all = "snake_case")]
pub struct LenientError { pub struct LenientError {
pub pos: usize, pub pos: usize,
pub message: String, pub message: String,
@@ -355,21 +353,3 @@ where
{ {
move |i: I| l.choice(i.clone()).unwrap_or_else(|| default.parse(i)) move |i: I| l.choice(i.clone()).unwrap_or_else(|| default.parse(i))
} }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_lenient_error_serialization() {
let error = LenientError {
pos: 42,
message: "test error message".to_string(),
};
assert_eq!(
serde_json::to_string(&error).unwrap(),
"{\"pos\":42,\"message\":\"test error message\"}"
);
}
}

View File

@@ -1,7 +1,5 @@
#![allow(clippy::derive_partial_eq_without_eq)] #![allow(clippy::derive_partial_eq_without_eq)]
use serde::Serialize;
mod infallible; mod infallible;
mod occur; mod occur;
mod query_grammar; mod query_grammar;
@@ -14,8 +12,6 @@ pub use crate::user_input_ast::{
Delimiter, UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral, Delimiter, UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral,
}; };
#[derive(Debug, Serialize)]
#[serde(rename_all = "snake_case")]
pub struct Error; pub struct Error;
/// Parse a query /// Parse a query
@@ -28,31 +24,3 @@ pub fn parse_query(query: &str) -> Result<UserInputAst, Error> {
pub fn parse_query_lenient(query: &str) -> (UserInputAst, Vec<LenientError>) { pub fn parse_query_lenient(query: &str) -> (UserInputAst, Vec<LenientError>) {
parse_to_ast_lenient(query) parse_to_ast_lenient(query)
} }
#[cfg(test)]
mod tests {
use crate::{parse_query, parse_query_lenient};
#[test]
fn test_parse_query_serialization() {
let ast = parse_query("title:hello OR title:x").unwrap();
let json = serde_json::to_string(&ast).unwrap();
assert_eq!(
json,
r#"{"type":"bool","clauses":[["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}],["should",{"type":"literal","field_name":"title","phrase":"x","delimiter":"none","slop":0,"prefix":false}]]}"#
);
}
#[test]
fn test_parse_query_wrong_query() {
assert!(parse_query("title:").is_err());
}
#[test]
fn test_parse_query_lenient_wrong_query() {
let (_, errors) = parse_query_lenient("title:");
assert!(errors.len() == 1);
let json = serde_json::to_string(&errors).unwrap();
assert_eq!(json, r#"[{"pos":6,"message":"expected word"}]"#);
}
}

View File

@@ -1,12 +1,9 @@
use std::fmt; use std::fmt;
use std::fmt::Write; use std::fmt::Write;
use serde::Serialize;
/// Defines whether a term in a query must be present, /// Defines whether a term in a query must be present,
/// should be present or must not be present. /// should be present or must not be present.
#[derive(Debug, Clone, Hash, Copy, Eq, PartialEq, Serialize)] #[derive(Debug, Clone, Hash, Copy, Eq, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum Occur { pub enum Occur {
/// For a given document to be considered for scoring, /// For a given document to be considered for scoring,
/// at least one of the queries with the Should or the Must /// at least one of the queries with the Should or the Must

View File

@@ -321,17 +321,7 @@ fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
UserInputLeaf::Exists { UserInputLeaf::Exists {
field: String::new(), field: String::new(),
}, },
tuple(( tuple((multispace0, char('*'))),
multispace0,
char('*'),
peek(alt((
value(
"",
satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
),
eof,
))),
)),
)(inp) )(inp)
} }
@@ -341,14 +331,7 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
peek(tuple(( peek(tuple((
field_name, field_name,
multispace0, multispace0,
char('*'), char('*'), // when we are here, we know it can't be anything but a exists
peek(alt((
value(
"",
satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
),
eof,
))), // we need to check this isn't a wildcard query
))), ))),
)(inp) )(inp)
.map_err(|e| e.map(|_| ())) .map_err(|e| e.map(|_| ()))
@@ -1514,11 +1497,6 @@ mod test {
test_is_parse_err(r#"field:(+a -"b c""#, r#"(+"field":a -"field":"b c")"#); test_is_parse_err(r#"field:(+a -"b c""#, r#"(+"field":a -"field":"b c")"#);
} }
#[test]
fn field_re_specification() {
test_parse_query_to_ast_helper(r#"field:(abc AND b:cde)"#, r#"(+"field":abc +"b":cde)"#);
}
#[test] #[test]
fn test_parse_query_single_term() { fn test_parse_query_single_term() {
test_parse_query_to_ast_helper("abc", "abc"); test_parse_query_to_ast_helper("abc", "abc");
@@ -1641,19 +1619,13 @@ mod test {
#[test] #[test]
fn test_exist_query() { fn test_exist_query() {
test_parse_query_to_ast_helper("a:*", "$exists(\"a\")"); test_parse_query_to_ast_helper("a:*", "\"a\":*");
test_parse_query_to_ast_helper("a: *", "$exists(\"a\")"); test_parse_query_to_ast_helper("a: *", "\"a\":*");
// an exist followed by default term being b
test_is_parse_err("a:*b", "(*\"a\":* *b)");
test_parse_query_to_ast_helper( // this is a term query (not a phrase prefix)
"(hello AND toto:*) OR happy",
"(?(+hello +$exists(\"toto\")) ?happy)",
);
test_parse_query_to_ast_helper("(a:*)", "$exists(\"a\")");
// these are term/wildcard query (not a phrase prefix)
test_parse_query_to_ast_helper("a:b*", "\"a\":b*"); test_parse_query_to_ast_helper("a:b*", "\"a\":b*");
test_parse_query_to_ast_helper("a:*b", "\"a\":*b");
test_parse_query_to_ast_helper(r#"a:*def*"#, "\"a\":*def*");
} }
#[test] #[test]

View File

@@ -1,13 +1,9 @@
use std::fmt; use std::fmt;
use std::fmt::{Debug, Formatter}; use std::fmt::{Debug, Formatter};
use serde::Serialize;
use crate::Occur; use crate::Occur;
#[derive(PartialEq, Clone, Serialize)] #[derive(PartialEq, Clone)]
#[serde(tag = "type")]
#[serde(rename_all = "snake_case")]
pub enum UserInputLeaf { pub enum UserInputLeaf {
Literal(UserInputLiteral), Literal(UserInputLiteral),
All, All,
@@ -105,22 +101,20 @@ impl Debug for UserInputLeaf {
} }
UserInputLeaf::All => write!(formatter, "*"), UserInputLeaf::All => write!(formatter, "*"),
UserInputLeaf::Exists { field } => { UserInputLeaf::Exists { field } => {
write!(formatter, "$exists(\"{field}\")") write!(formatter, "\"{field}\":*")
} }
} }
} }
} }
#[derive(Copy, Clone, Eq, PartialEq, Debug, Serialize)] #[derive(Copy, Clone, Eq, PartialEq, Debug)]
#[serde(rename_all = "snake_case")]
pub enum Delimiter { pub enum Delimiter {
SingleQuotes, SingleQuotes,
DoubleQuotes, DoubleQuotes,
None, None,
} }
#[derive(PartialEq, Clone, Serialize)] #[derive(PartialEq, Clone)]
#[serde(rename_all = "snake_case")]
pub struct UserInputLiteral { pub struct UserInputLiteral {
pub field_name: Option<String>, pub field_name: Option<String>,
pub phrase: String, pub phrase: String,
@@ -158,9 +152,7 @@ impl fmt::Debug for UserInputLiteral {
} }
} }
#[derive(PartialEq, Debug, Clone, Serialize)] #[derive(PartialEq, Debug, Clone)]
#[serde(tag = "type", content = "value")]
#[serde(rename_all = "snake_case")]
pub enum UserInputBound { pub enum UserInputBound {
Inclusive(String), Inclusive(String),
Exclusive(String), Exclusive(String),
@@ -195,38 +187,11 @@ impl UserInputBound {
} }
} }
#[derive(PartialEq, Clone, Serialize)] #[derive(PartialEq, Clone)]
#[serde(into = "UserInputAstSerde")]
pub enum UserInputAst { pub enum UserInputAst {
Clause(Vec<(Option<Occur>, UserInputAst)>), Clause(Vec<(Option<Occur>, UserInputAst)>),
Leaf(Box<UserInputLeaf>),
Boost(Box<UserInputAst>, f64), Boost(Box<UserInputAst>, f64),
Leaf(Box<UserInputLeaf>),
}
#[derive(Serialize)]
#[serde(tag = "type", rename_all = "snake_case")]
enum UserInputAstSerde {
Bool {
clauses: Vec<(Option<Occur>, UserInputAst)>,
},
Boost {
underlying: Box<UserInputAst>,
boost: f64,
},
#[serde(untagged)]
Leaf(Box<UserInputLeaf>),
}
impl From<UserInputAst> for UserInputAstSerde {
fn from(ast: UserInputAst) -> Self {
match ast {
UserInputAst::Clause(clause) => UserInputAstSerde::Bool { clauses: clause },
UserInputAst::Boost(underlying, boost) => {
UserInputAstSerde::Boost { underlying, boost }
}
UserInputAst::Leaf(leaf) => UserInputAstSerde::Leaf(leaf),
}
}
} }
impl UserInputAst { impl UserInputAst {
@@ -320,126 +285,3 @@ impl fmt::Debug for UserInputAst {
} }
} }
} }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_all_leaf_serialization() {
let ast = UserInputAst::Leaf(Box::new(UserInputLeaf::All));
let json = serde_json::to_string(&ast).unwrap();
assert_eq!(json, r#"{"type":"all"}"#);
}
#[test]
fn test_literal_leaf_serialization() {
let literal = UserInputLiteral {
field_name: Some("title".to_string()),
phrase: "hello".to_string(),
delimiter: Delimiter::None,
slop: 0,
prefix: false,
};
let ast = UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(literal)));
let json = serde_json::to_string(&ast).unwrap();
assert_eq!(
json,
r#"{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}"#
);
}
#[test]
fn test_range_leaf_serialization() {
let range = UserInputLeaf::Range {
field: Some("price".to_string()),
lower: UserInputBound::Inclusive("10".to_string()),
upper: UserInputBound::Exclusive("100".to_string()),
};
let ast = UserInputAst::Leaf(Box::new(range));
let json = serde_json::to_string(&ast).unwrap();
assert_eq!(
json,
r#"{"type":"range","field":"price","lower":{"type":"inclusive","value":"10"},"upper":{"type":"exclusive","value":"100"}}"#
);
}
#[test]
fn test_range_leaf_unbounded_serialization() {
let range = UserInputLeaf::Range {
field: Some("price".to_string()),
lower: UserInputBound::Inclusive("10".to_string()),
upper: UserInputBound::Unbounded,
};
let ast = UserInputAst::Leaf(Box::new(range));
let json = serde_json::to_string(&ast).unwrap();
assert_eq!(
json,
r#"{"type":"range","field":"price","lower":{"type":"inclusive","value":"10"},"upper":{"type":"unbounded"}}"#
);
}
#[test]
fn test_boost_serialization() {
let inner_ast = UserInputAst::Leaf(Box::new(UserInputLeaf::All));
let boost_ast = UserInputAst::Boost(Box::new(inner_ast), 2.5);
let json = serde_json::to_string(&boost_ast).unwrap();
assert_eq!(
json,
r#"{"type":"boost","underlying":{"type":"all"},"boost":2.5}"#
);
}
#[test]
fn test_boost_serialization2() {
let boost_ast = UserInputAst::Boost(
Box::new(UserInputAst::Clause(vec![
(
Some(Occur::Must),
UserInputAst::Leaf(Box::new(UserInputLeaf::All)),
),
(
Some(Occur::Should),
UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(UserInputLiteral {
field_name: Some("title".to_string()),
phrase: "hello".to_string(),
delimiter: Delimiter::None,
slop: 0,
prefix: false,
}))),
),
])),
2.5,
);
let json = serde_json::to_string(&boost_ast).unwrap();
assert_eq!(
json,
r#"{"type":"boost","underlying":{"type":"bool","clauses":[["must",{"type":"all"}],["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}]]},"boost":2.5}"#
);
}
#[test]
fn test_clause_serialization() {
let clause = UserInputAst::Clause(vec![
(
Some(Occur::Must),
UserInputAst::Leaf(Box::new(UserInputLeaf::All)),
),
(
Some(Occur::Should),
UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(UserInputLiteral {
field_name: Some("title".to_string()),
phrase: "hello".to_string(),
delimiter: Delimiter::None,
slop: 0,
prefix: false,
}))),
),
]);
let json = serde_json::to_string(&clause).unwrap();
assert_eq!(
json,
r#"{"type":"bool","clauses":[["must",{"type":"all"}],["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}]]}"#
);
}
}

View File

@@ -271,6 +271,10 @@ impl AggregationWithAccessor {
field: ref field_name, field: ref field_name,
.. ..
}) })
| Count(CountAggregation {
field: ref field_name,
..
})
| Max(MaxAggregation { | Max(MaxAggregation {
field: ref field_name, field: ref field_name,
.. ..
@@ -295,24 +299,6 @@ impl AggregationWithAccessor {
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?; add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
} }
Count(CountAggregation {
field: ref field_name,
..
}) => {
let allowed_column_types = [
ColumnType::I64,
ColumnType::U64,
ColumnType::F64,
ColumnType::Str,
ColumnType::DateTime,
ColumnType::Bool,
ColumnType::IpAddr,
// ColumnType::Bytes Unsupported
];
let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(&allowed_column_types))?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
}
Percentiles(ref percentiles) => { Percentiles(ref percentiles) => {
let (accessor, column_type) = get_ff_reader( let (accessor, column_type) = get_ff_reader(
reader, reader,

View File

@@ -34,10 +34,10 @@ use crate::aggregation::*;
pub struct DateHistogramAggregationReq { pub struct DateHistogramAggregationReq {
#[doc(hidden)] #[doc(hidden)]
/// Only for validation /// Only for validation
pub interval: Option<String>, interval: Option<String>,
#[doc(hidden)] #[doc(hidden)]
/// Only for validation /// Only for validation
pub calendar_interval: Option<String>, calendar_interval: Option<String>,
/// The field to aggregate on. /// The field to aggregate on.
pub field: String, pub field: String,
/// The format to format dates. Unsupported currently. /// The format to format dates. Unsupported currently.

View File

@@ -220,23 +220,9 @@ impl SegmentStatsCollector {
.column_block_accessor .column_block_accessor
.fetch_block(docs, &agg_accessor.accessor); .fetch_block(docs, &agg_accessor.accessor);
} }
if [ for val in agg_accessor.column_block_accessor.iter_vals() {
ColumnType::I64, let val1 = f64_from_fastfield_u64(val, &self.field_type);
ColumnType::U64, self.stats.collect(val1);
ColumnType::F64,
ColumnType::DateTime,
]
.contains(&self.field_type)
{
for val in agg_accessor.column_block_accessor.iter_vals() {
let val1 = f64_from_fastfield_u64(val, &self.field_type);
self.stats.collect(val1);
}
} else {
for _val in agg_accessor.column_block_accessor.iter_vals() {
// we ignore the value and simply record that we got something
self.stats.collect(0.0);
}
} }
} }
} }
@@ -449,11 +435,6 @@ mod tests {
"field": "score", "field": "score",
}, },
}, },
"count_str": {
"value_count": {
"field": "text",
},
},
"range": range_agg "range": range_agg
})) }))
.unwrap(); .unwrap();
@@ -519,13 +500,6 @@ mod tests {
}) })
); );
assert_eq!(
res["count_str"],
json!({
"value": 7.0,
})
);
Ok(()) Ok(())
} }

View File

@@ -366,12 +366,8 @@ impl PartialEq for Key {
fn eq(&self, other: &Self) -> bool { fn eq(&self, other: &Self) -> bool {
match (self, other) { match (self, other) {
(Self::Str(l), Self::Str(r)) => l == r, (Self::Str(l), Self::Str(r)) => l == r,
(Self::F64(l), Self::F64(r)) => l.to_bits() == r.to_bits(), (Self::F64(l), Self::F64(r)) => l == r,
(Self::I64(l), Self::I64(r)) => l == r, _ => false,
(Self::U64(l), Self::U64(r)) => l == r,
// we list all variant of left operand to make sure this gets updated when we add
// variants to the enum
(Self::Str(_) | Self::F64(_) | Self::I64(_) | Self::U64(_), _) => false,
} }
} }
} }
@@ -582,7 +578,7 @@ mod tests {
.set_indexing_options( .set_indexing_options(
TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs), TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
) )
.set_fast(Some("raw")) .set_fast(None)
.set_stored(); .set_stored();
let text_field = schema_builder.add_text_field("text", text_fieldtype); let text_field = schema_builder.add_text_field("text", text_fieldtype);
let date_field = schema_builder.add_date_field("date", FAST); let date_field = schema_builder.add_date_field("date", FAST);

View File

@@ -786,7 +786,7 @@ impl<Score, D, const R: bool> From<TopNComputerDeser<Score, D, R>> for TopNCompu
} }
} }
impl<Score, D, const REVERSE_ORDER: bool> TopNComputer<Score, D, REVERSE_ORDER> impl<Score, D, const R: bool> TopNComputer<Score, D, R>
where where
Score: PartialOrd + Clone, Score: PartialOrd + Clone,
D: Ord, D: Ord,
@@ -807,10 +807,7 @@ where
#[inline] #[inline]
pub fn push(&mut self, feature: Score, doc: D) { pub fn push(&mut self, feature: Score, doc: D) {
if let Some(last_median) = self.threshold.clone() { if let Some(last_median) = self.threshold.clone() {
if !REVERSE_ORDER && feature > last_median { if feature < last_median {
return;
}
if REVERSE_ORDER && feature < last_median {
return; return;
} }
} }
@@ -845,7 +842,7 @@ where
} }
/// Returns the top n elements in sorted order. /// Returns the top n elements in sorted order.
pub fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, D, REVERSE_ORDER>> { pub fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, D, R>> {
if self.buffer.len() > self.top_n { if self.buffer.len() > self.top_n {
self.truncate_top_n(); self.truncate_top_n();
} }
@@ -856,7 +853,7 @@ where
/// Returns the top n elements in stored order. /// Returns the top n elements in stored order.
/// Useful if you do not need the elements in sorted order, /// Useful if you do not need the elements in sorted order,
/// for example when merging the results of multiple segments. /// for example when merging the results of multiple segments.
pub fn into_vec(mut self) -> Vec<ComparableDoc<Score, D, REVERSE_ORDER>> { pub fn into_vec(mut self) -> Vec<ComparableDoc<Score, D, R>> {
if self.buffer.len() > self.top_n { if self.buffer.len() > self.top_n {
self.truncate_top_n(); self.truncate_top_n();
} }
@@ -866,11 +863,9 @@ where
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use proptest::prelude::*;
use super::{TopDocs, TopNComputer}; use super::{TopDocs, TopNComputer};
use crate::collector::top_collector::ComparableDoc; use crate::collector::top_collector::ComparableDoc;
use crate::collector::{Collector, DocSetCollector}; use crate::collector::Collector;
use crate::query::{AllQuery, Query, QueryParser}; use crate::query::{AllQuery, Query, QueryParser};
use crate::schema::{Field, Schema, FAST, STORED, TEXT}; use crate::schema::{Field, Schema, FAST, STORED, TEXT};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
@@ -965,44 +960,6 @@ mod tests {
} }
} }
proptest! {
#[test]
fn test_topn_computer_asc_prop(
limit in 0..10_usize,
docs in proptest::collection::vec((0..100_u64, 0..100_u64), 0..100_usize),
) {
let mut computer: TopNComputer<_, _, false> = TopNComputer::new(limit);
for (feature, doc) in &docs {
computer.push(*feature, *doc);
}
let mut comparable_docs = docs.into_iter().map(|(feature, doc)| ComparableDoc { feature, doc }).collect::<Vec<_>>();
comparable_docs.sort();
comparable_docs.truncate(limit);
prop_assert_eq!(
computer.into_sorted_vec(),
comparable_docs,
);
}
#[test]
fn test_topn_computer_desc_prop(
limit in 0..10_usize,
docs in proptest::collection::vec((0..100_u64, 0..100_u64), 0..100_usize),
) {
let mut computer: TopNComputer<_, _, true> = TopNComputer::new(limit);
for (feature, doc) in &docs {
computer.push(*feature, *doc);
}
let mut comparable_docs = docs.into_iter().map(|(feature, doc)| ComparableDoc { feature, doc }).collect::<Vec<_>>();
comparable_docs.sort();
comparable_docs.truncate(limit);
prop_assert_eq!(
computer.into_sorted_vec(),
comparable_docs,
);
}
}
#[test] #[test]
fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> { fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> {
let index = make_index()?; let index = make_index()?;
@@ -1416,29 +1373,4 @@ mod tests {
); );
Ok(()) Ok(())
} }
#[test]
fn test_topn_computer_asc() {
let mut computer: TopNComputer<u32, u32, false> = TopNComputer::new(2);
computer.push(1u32, 1u32);
computer.push(2u32, 2u32);
computer.push(3u32, 3u32);
computer.push(2u32, 4u32);
computer.push(4u32, 5u32);
computer.push(1u32, 6u32);
assert_eq!(
computer.into_sorted_vec(),
&[
ComparableDoc {
feature: 1u32,
doc: 1u32,
},
ComparableDoc {
feature: 1u32,
doc: 6u32,
}
]
);
}
} }

View File

@@ -41,12 +41,16 @@ impl Executor {
/// ///
/// Regardless of the executor (`SingleThread` or `ThreadPool`), panics in the task /// Regardless of the executor (`SingleThread` or `ThreadPool`), panics in the task
/// will propagate to the caller. /// will propagate to the caller.
pub fn map<A, R, F>(&self, f: F, args: impl Iterator<Item = A>) -> crate::Result<Vec<R>> pub fn map<
where
A: Send, A: Send,
R: Send, R: Send,
AIterator: Iterator<Item = A>,
F: Sized + Sync + Fn(A) -> crate::Result<R>, F: Sized + Sync + Fn(A) -> crate::Result<R>,
{ >(
&self,
f: F,
args: AIterator,
) -> crate::Result<Vec<R>> {
match self { match self {
Executor::SingleThread => args.map(f).collect::<crate::Result<_>>(), Executor::SingleThread => args.map(f).collect::<crate::Result<_>>(),
Executor::ThreadPool(pool) => { Executor::ThreadPool(pool) => {

View File

@@ -1,9 +1,3 @@
//! The footer is a small metadata structure that is appended at the end of every file.
//!
//! The footer is used to store a checksum of the file content.
//! The footer also stores the version of the index format.
//! This version is used to detect incompatibility between the index and the library version.
use std::io; use std::io;
use std::io::Write; use std::io::Write;
@@ -26,22 +20,20 @@ type CrcHashU32 = u32;
/// A Footer is appended to every file /// A Footer is appended to every file
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)] #[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct Footer { pub struct Footer {
/// The version of the index format
pub version: Version, pub version: Version,
/// The crc32 hash of the body
pub crc: CrcHashU32, pub crc: CrcHashU32,
} }
impl Footer { impl Footer {
pub(crate) fn new(crc: CrcHashU32) -> Self { pub fn new(crc: CrcHashU32) -> Self {
let version = crate::VERSION.clone(); let version = crate::VERSION.clone();
Footer { version, crc } Footer { version, crc }
} }
pub(crate) fn crc(&self) -> CrcHashU32 { pub fn crc(&self) -> CrcHashU32 {
self.crc self.crc
} }
pub(crate) fn append_footer<W: io::Write>(&self, mut write: &mut W) -> io::Result<()> { pub fn append_footer<W: io::Write>(&self, mut write: &mut W) -> io::Result<()> {
let mut counting_write = CountingWriter::wrap(&mut write); let mut counting_write = CountingWriter::wrap(&mut write);
counting_write.write_all(serde_json::to_string(&self)?.as_ref())?; counting_write.write_all(serde_json::to_string(&self)?.as_ref())?;
let footer_payload_len = counting_write.written_bytes(); let footer_payload_len = counting_write.written_bytes();
@@ -50,7 +42,6 @@ impl Footer {
Ok(()) Ok(())
} }
/// Extracts the tantivy Footer from the file and returns the footer and the rest of the file
pub fn extract_footer(file: FileSlice) -> io::Result<(Footer, FileSlice)> { pub fn extract_footer(file: FileSlice) -> io::Result<(Footer, FileSlice)> {
if file.len() < 4 { if file.len() < 4 {
return Err(io::Error::new( return Err(io::Error::new(

View File

@@ -6,7 +6,7 @@ mod mmap_directory;
mod directory; mod directory;
mod directory_lock; mod directory_lock;
mod file_watcher; mod file_watcher;
pub mod footer; mod footer;
mod managed_directory; mod managed_directory;
mod ram_directory; mod ram_directory;
mod watch_event_router; mod watch_event_router;

View File

@@ -942,7 +942,7 @@ mod tests {
let numbers = [100, 200, 300]; let numbers = [100, 200, 300];
let test_range = |range: RangeInclusive<u64>| { let test_range = |range: RangeInclusive<u64>| {
let expected_count = numbers.iter().filter(|num| range.contains(*num)).count(); let expected_count = numbers.iter().filter(|num| range.contains(num)).count();
let mut vec = vec![]; let mut vec = vec![];
field.get_row_ids_for_value_range(range, 0..u32::MAX, &mut vec); field.get_row_ids_for_value_range(range, 0..u32::MAX, &mut vec);
assert_eq!(vec.len(), expected_count); assert_eq!(vec.len(), expected_count);
@@ -1020,7 +1020,7 @@ mod tests {
let numbers = [1000, 1001, 1003]; let numbers = [1000, 1001, 1003];
let test_range = |range: RangeInclusive<u64>| { let test_range = |range: RangeInclusive<u64>| {
let expected_count = numbers.iter().filter(|num| range.contains(*num)).count(); let expected_count = numbers.iter().filter(|num| range.contains(num)).count();
let mut vec = vec![]; let mut vec = vec![];
field.get_row_ids_for_value_range(range, 0..u32::MAX, &mut vec); field.get_row_ids_for_value_range(range, 0..u32::MAX, &mut vec);
assert_eq!(vec.len(), expected_count); assert_eq!(vec.len(), expected_count);

View File

@@ -217,7 +217,7 @@ impl FastFieldReaders {
Ok(dynamic_column.into()) Ok(dynamic_column.into())
} }
/// Returns a `dynamic_column_handle`. /// Returning a `dynamic_column_handle`.
pub fn dynamic_column_handle( pub fn dynamic_column_handle(
&self, &self,
field_name: &str, field_name: &str,
@@ -234,7 +234,7 @@ impl FastFieldReaders {
Ok(dynamic_column_handle_opt) Ok(dynamic_column_handle_opt)
} }
/// Returns all `dynamic_column_handle` that match the given field name. /// Returning all `dynamic_column_handle`.
pub fn dynamic_column_handles( pub fn dynamic_column_handles(
&self, &self,
field_name: &str, field_name: &str,
@@ -250,22 +250,6 @@ impl FastFieldReaders {
Ok(dynamic_column_handles) Ok(dynamic_column_handles)
} }
/// Returns all `dynamic_column_handle` that are inner fields of the provided JSON path.
pub fn dynamic_subpath_column_handles(
&self,
root_path: &str,
) -> crate::Result<Vec<DynamicColumnHandle>> {
let Some(resolved_field_name) = self.resolve_field(root_path)? else {
return Ok(Vec::new());
};
let dynamic_column_handles = self
.columnar
.read_subpath_columns(&resolved_field_name)?
.into_iter()
.collect();
Ok(dynamic_column_handles)
}
#[doc(hidden)] #[doc(hidden)]
pub async fn list_dynamic_column_handles( pub async fn list_dynamic_column_handles(
&self, &self,
@@ -281,21 +265,6 @@ impl FastFieldReaders {
Ok(columns) Ok(columns)
} }
#[doc(hidden)]
pub async fn list_subpath_dynamic_column_handles(
&self,
root_path: &str,
) -> crate::Result<Vec<DynamicColumnHandle>> {
let Some(resolved_field_name) = self.resolve_field(root_path)? else {
return Ok(Vec::new());
};
let columns = self
.columnar
.read_subpath_columns_async(&resolved_field_name)
.await?;
Ok(columns)
}
/// Returns the `u64` column used to represent any `u64`-mapped typed (String/Bytes term ids, /// Returns the `u64` column used to represent any `u64`-mapped typed (String/Bytes term ids,
/// i64, u64, f64, DateTime). /// i64, u64, f64, DateTime).
/// ///
@@ -507,15 +476,6 @@ mod tests {
.iter() .iter()
.any(|column| column.column_type() == ColumnType::Str)); .any(|column| column.column_type() == ColumnType::Str));
let json_columns = fast_fields.dynamic_column_handles("json").unwrap(); println!("*** {:?}", fast_fields.columnar().list_columns());
assert_eq!(json_columns.len(), 0);
let json_subcolumns = fast_fields.dynamic_subpath_column_handles("json").unwrap();
assert_eq!(json_subcolumns.len(), 3);
let foo_subcolumns = fast_fields
.dynamic_subpath_column_handles("json.foo")
.unwrap();
assert_eq!(foo_subcolumns.len(), 0);
} }
} }

View File

@@ -15,9 +15,7 @@ use crate::directory::MmapDirectory;
use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK}; use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK};
use crate::error::{DataCorruption, TantivyError}; use crate::error::{DataCorruption, TantivyError};
use crate::index::{IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory}; use crate::index::{IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory};
use crate::indexer::index_writer::{ use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN};
IndexWriterOptions, MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN,
};
use crate::indexer::segment_updater::save_metas; use crate::indexer::segment_updater::save_metas;
use crate::indexer::{IndexWriter, SingleSegmentIndexWriter}; use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
use crate::reader::{IndexReader, IndexReaderBuilder}; use crate::reader::{IndexReader, IndexReaderBuilder};
@@ -521,43 +519,6 @@ impl Index {
load_metas(self.directory(), &self.inventory) load_metas(self.directory(), &self.inventory)
} }
/// Open a new index writer with the given options. Attempts to acquire a lockfile.
///
/// The lockfile should be deleted on drop, but it is possible
/// that due to a panic or other error, a stale lockfile will be
/// left in the index directory. If you are sure that no other
/// `IndexWriter` on the system is accessing the index directory,
/// it is safe to manually delete the lockfile.
///
/// - `options` defines the writer configuration which includes things like buffer sizes,
/// indexer threads, etc...
///
/// # Errors
/// If the lockfile already exists, returns `TantivyError::LockFailure`.
/// If the memory arena per thread is too small or too big, returns
/// `TantivyError::InvalidArgument`
pub fn writer_with_options<D: Document>(
&self,
options: IndexWriterOptions,
) -> crate::Result<IndexWriter<D>> {
let directory_lock = self
.directory
.acquire_lock(&INDEX_WRITER_LOCK)
.map_err(|err| {
TantivyError::LockFailure(
err,
Some(
"Failed to acquire index lock. If you are using a regular directory, this \
means there is already an `IndexWriter` working on this `Directory`, in \
this process or in a different process."
.to_string(),
),
)
})?;
IndexWriter::new(self, options, directory_lock)
}
/// Open a new index writer. Attempts to acquire a lockfile. /// Open a new index writer. Attempts to acquire a lockfile.
/// ///
/// The lockfile should be deleted on drop, but it is possible /// The lockfile should be deleted on drop, but it is possible
@@ -582,12 +543,27 @@ impl Index {
num_threads: usize, num_threads: usize,
overall_memory_budget_in_bytes: usize, overall_memory_budget_in_bytes: usize,
) -> crate::Result<IndexWriter<D>> { ) -> crate::Result<IndexWriter<D>> {
let directory_lock = self
.directory
.acquire_lock(&INDEX_WRITER_LOCK)
.map_err(|err| {
TantivyError::LockFailure(
err,
Some(
"Failed to acquire index lock. If you are using a regular directory, this \
means there is already an `IndexWriter` working on this `Directory`, in \
this process or in a different process."
.to_string(),
),
)
})?;
let memory_arena_in_bytes_per_thread = overall_memory_budget_in_bytes / num_threads; let memory_arena_in_bytes_per_thread = overall_memory_budget_in_bytes / num_threads;
let options = IndexWriterOptions::builder() IndexWriter::new(
.num_worker_threads(num_threads) self,
.memory_budget_per_thread(memory_arena_in_bytes_per_thread) num_threads,
.build(); memory_arena_in_bytes_per_thread,
self.writer_with_options(options) directory_lock,
)
} }
/// Helper to create an index writer for tests. /// Helper to create an index writer for tests.

View File

@@ -3,12 +3,6 @@ use std::io;
use common::json_path_writer::JSON_END_OF_PATH; use common::json_path_writer::JSON_END_OF_PATH;
use common::BinarySerializable; use common::BinarySerializable;
use fnv::FnvHashSet; use fnv::FnvHashSet;
#[cfg(feature = "quickwit")]
use futures_util::{FutureExt, StreamExt, TryStreamExt};
#[cfg(feature = "quickwit")]
use itertools::Itertools;
#[cfg(feature = "quickwit")]
use tantivy_fst::automaton::{AlwaysMatch, Automaton};
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::positions::PositionReader; use crate::positions::PositionReader;
@@ -225,18 +219,13 @@ impl InvertedIndexReader {
self.termdict.get_async(term.serialized_value_bytes()).await self.termdict.get_async(term.serialized_value_bytes()).await
} }
async fn get_term_range_async<'a, A: Automaton + 'a>( async fn get_term_range_async(
&'a self, &self,
terms: impl std::ops::RangeBounds<Term>, terms: impl std::ops::RangeBounds<Term>,
automaton: A,
limit: Option<u64>, limit: Option<u64>,
merge_holes_under_bytes: usize, ) -> io::Result<impl Iterator<Item = TermInfo> + '_> {
) -> io::Result<impl Iterator<Item = TermInfo> + 'a>
where
A::State: Clone,
{
use std::ops::Bound; use std::ops::Bound;
let range_builder = self.termdict.search(automaton); let range_builder = self.termdict.range();
let range_builder = match terms.start_bound() { let range_builder = match terms.start_bound() {
Bound::Included(bound) => range_builder.ge(bound.serialized_value_bytes()), Bound::Included(bound) => range_builder.ge(bound.serialized_value_bytes()),
Bound::Excluded(bound) => range_builder.gt(bound.serialized_value_bytes()), Bound::Excluded(bound) => range_builder.gt(bound.serialized_value_bytes()),
@@ -253,9 +242,7 @@ impl InvertedIndexReader {
range_builder range_builder
}; };
let mut stream = range_builder let mut stream = range_builder.into_stream_async().await?;
.into_stream_async_merging_holes(merge_holes_under_bytes)
.await?;
let iter = std::iter::from_fn(move || stream.next().map(|(_k, v)| v.clone())); let iter = std::iter::from_fn(move || stream.next().map(|(_k, v)| v.clone()));
@@ -301,9 +288,7 @@ impl InvertedIndexReader {
limit: Option<u64>, limit: Option<u64>,
with_positions: bool, with_positions: bool,
) -> io::Result<bool> { ) -> io::Result<bool> {
let mut term_info = self let mut term_info = self.get_term_range_async(terms, limit).await?;
.get_term_range_async(terms, AlwaysMatch, limit, 0)
.await?;
let Some(first_terminfo) = term_info.next() else { let Some(first_terminfo) = term_info.next() else {
// no key matches, nothing more to load // no key matches, nothing more to load
@@ -330,84 +315,6 @@ impl InvertedIndexReader {
Ok(true) Ok(true)
} }
/// Warmup a block postings given a range of `Term`s.
/// This method is for an advanced usage only.
///
/// returns a boolean, whether a term matching the range was found in the dictionary
pub async fn warm_postings_automaton<
A: Automaton + Clone + Send + 'static,
E: FnOnce(Box<dyn FnOnce() -> io::Result<()> + Send>) -> F,
F: std::future::Future<Output = io::Result<()>>,
>(
&self,
automaton: A,
// with_positions: bool, at the moment we have no use for it, and supporting it would add
// complexity to the coalesce
executor: E,
) -> io::Result<bool>
where
A::State: Clone,
{
// merge holes under 4MiB, that's how many bytes we can hope to receive during a TTFB from
// S3 (~80MiB/s, and 50ms latency)
const MERGE_HOLES_UNDER_BYTES: usize = (80 * 1024 * 1024 * 50) / 1000;
// we build a first iterator to download everything. Simply calling the function already
// download everything we need from the sstable, but doesn't start iterating over it.
let _term_info_iter = self
.get_term_range_async(.., automaton.clone(), None, MERGE_HOLES_UNDER_BYTES)
.await?;
let (sender, posting_ranges_to_load_stream) = futures_channel::mpsc::unbounded();
let termdict = self.termdict.clone();
let cpu_bound_task = move || {
// then we build a 2nd iterator, this one with no holes, so we don't go through blocks
// we can't match.
// This makes the assumption there is a caching layer below us, which gives sync read
// for free after the initial async access. This might not always be true, but is in
// Quickwit.
// We build things from this closure otherwise we get into lifetime issues that can only
// be solved with self referential strucs. Returning an io::Result from here is a bit
// more leaky abstraction-wise, but a lot better than the alternative
let mut stream = termdict.search(automaton).into_stream()?;
// we could do without an iterator, but this allows us access to coalesce which simplify
// things
let posting_ranges_iter =
std::iter::from_fn(move || stream.next().map(|(_k, v)| v.postings_range.clone()));
let merged_posting_ranges_iter = posting_ranges_iter.coalesce(|range1, range2| {
if range1.end + MERGE_HOLES_UNDER_BYTES >= range2.start {
Ok(range1.start..range2.end)
} else {
Err((range1, range2))
}
});
for posting_range in merged_posting_ranges_iter {
if let Err(_) = sender.unbounded_send(posting_range) {
// this should happen only when search is cancelled
return Err(io::Error::other("failed to send posting range back"));
}
}
Ok(())
};
let task_handle = executor(Box::new(cpu_bound_task));
let posting_downloader = posting_ranges_to_load_stream
.map(|posting_slice| {
self.postings_file_slice
.read_bytes_slice_async(posting_slice)
.map(|result| result.map(|_slice| ()))
})
.buffer_unordered(5)
.try_collect::<Vec<()>>();
let (_, slices_downloaded) =
futures_util::future::try_join(task_handle, posting_downloader).await?;
Ok(!slices_downloaded.is_empty())
}
/// Warmup the block postings for all terms. /// Warmup the block postings for all terms.
/// This method is for an advanced usage only. /// This method is for an advanced usage only.
/// ///

View File

@@ -45,23 +45,6 @@ fn error_in_index_worker_thread(context: &str) -> TantivyError {
)) ))
} }
#[derive(Clone, bon::Builder)]
/// A builder for creating a new [IndexWriter] for an index.
pub struct IndexWriterOptions {
#[builder(default = MEMORY_BUDGET_NUM_BYTES_MIN)]
/// The memory budget per indexer thread.
///
/// When an indexer thread has buffered this much data in memory
/// it will flush the segment to disk (although this is not searchable until commit is called.)
memory_budget_per_thread: usize,
#[builder(default = 1)]
/// The number of indexer worker threads to use.
num_worker_threads: usize,
#[builder(default = 4)]
/// Defines the number of merger threads to use.
num_merge_threads: usize,
}
/// `IndexWriter` is the user entry-point to add document to an index. /// `IndexWriter` is the user entry-point to add document to an index.
/// ///
/// It manages a small number of indexing thread, as well as a shared /// It manages a small number of indexing thread, as well as a shared
@@ -75,7 +58,8 @@ pub struct IndexWriter<D: Document = TantivyDocument> {
index: Index, index: Index,
options: IndexWriterOptions, // The memory budget per thread, after which a commit is triggered.
memory_budget_in_bytes_per_thread: usize,
workers_join_handle: Vec<JoinHandle<crate::Result<()>>>, workers_join_handle: Vec<JoinHandle<crate::Result<()>>>,
@@ -86,6 +70,8 @@ pub struct IndexWriter<D: Document = TantivyDocument> {
worker_id: usize, worker_id: usize,
num_threads: usize,
delete_queue: DeleteQueue, delete_queue: DeleteQueue,
stamper: Stamper, stamper: Stamper,
@@ -279,27 +265,23 @@ impl<D: Document> IndexWriter<D> {
/// `TantivyError::InvalidArgument` /// `TantivyError::InvalidArgument`
pub(crate) fn new( pub(crate) fn new(
index: &Index, index: &Index,
options: IndexWriterOptions, num_threads: usize,
memory_budget_in_bytes_per_thread: usize,
directory_lock: DirectoryLock, directory_lock: DirectoryLock,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
if options.memory_budget_per_thread < MEMORY_BUDGET_NUM_BYTES_MIN { if memory_budget_in_bytes_per_thread < MEMORY_BUDGET_NUM_BYTES_MIN {
let err_msg = format!( let err_msg = format!(
"The memory arena in bytes per thread needs to be at least \ "The memory arena in bytes per thread needs to be at least \
{MEMORY_BUDGET_NUM_BYTES_MIN}." {MEMORY_BUDGET_NUM_BYTES_MIN}."
); );
return Err(TantivyError::InvalidArgument(err_msg)); return Err(TantivyError::InvalidArgument(err_msg));
} }
if options.memory_budget_per_thread >= MEMORY_BUDGET_NUM_BYTES_MAX { if memory_budget_in_bytes_per_thread >= MEMORY_BUDGET_NUM_BYTES_MAX {
let err_msg = format!( let err_msg = format!(
"The memory arena in bytes per thread cannot exceed {MEMORY_BUDGET_NUM_BYTES_MAX}" "The memory arena in bytes per thread cannot exceed {MEMORY_BUDGET_NUM_BYTES_MAX}"
); );
return Err(TantivyError::InvalidArgument(err_msg)); return Err(TantivyError::InvalidArgument(err_msg));
} }
if options.num_worker_threads == 0 {
let err_msg = "At least one worker thread is required, got 0".to_string();
return Err(TantivyError::InvalidArgument(err_msg));
}
let (document_sender, document_receiver) = let (document_sender, document_receiver) =
crossbeam_channel::bounded(PIPELINE_MAX_SIZE_IN_DOCS); crossbeam_channel::bounded(PIPELINE_MAX_SIZE_IN_DOCS);
@@ -309,17 +291,13 @@ impl<D: Document> IndexWriter<D> {
let stamper = Stamper::new(current_opstamp); let stamper = Stamper::new(current_opstamp);
let segment_updater = SegmentUpdater::create( let segment_updater =
index.clone(), SegmentUpdater::create(index.clone(), stamper.clone(), &delete_queue.cursor())?;
stamper.clone(),
&delete_queue.cursor(),
options.num_merge_threads,
)?;
let mut index_writer = Self { let mut index_writer = Self {
_directory_lock: Some(directory_lock), _directory_lock: Some(directory_lock),
options: options.clone(), memory_budget_in_bytes_per_thread,
index: index.clone(), index: index.clone(),
index_writer_status: IndexWriterStatus::from(document_receiver), index_writer_status: IndexWriterStatus::from(document_receiver),
operation_sender: document_sender, operation_sender: document_sender,
@@ -327,6 +305,7 @@ impl<D: Document> IndexWriter<D> {
segment_updater, segment_updater,
workers_join_handle: vec![], workers_join_handle: vec![],
num_threads,
delete_queue, delete_queue,
@@ -419,7 +398,7 @@ impl<D: Document> IndexWriter<D> {
let mut delete_cursor = self.delete_queue.cursor(); let mut delete_cursor = self.delete_queue.cursor();
let mem_budget = self.options.memory_budget_per_thread; let mem_budget = self.memory_budget_in_bytes_per_thread;
let index = self.index.clone(); let index = self.index.clone();
let join_handle: JoinHandle<crate::Result<()>> = thread::Builder::new() let join_handle: JoinHandle<crate::Result<()>> = thread::Builder::new()
.name(format!("thrd-tantivy-index{}", self.worker_id)) .name(format!("thrd-tantivy-index{}", self.worker_id))
@@ -472,7 +451,7 @@ impl<D: Document> IndexWriter<D> {
} }
fn start_workers(&mut self) -> crate::Result<()> { fn start_workers(&mut self) -> crate::Result<()> {
for _ in 0..self.options.num_worker_threads { for _ in 0..self.num_threads {
self.add_indexing_worker()?; self.add_indexing_worker()?;
} }
Ok(()) Ok(())
@@ -574,7 +553,12 @@ impl<D: Document> IndexWriter<D> {
.take() .take()
.expect("The IndexWriter does not have any lock. This is a bug, please report."); .expect("The IndexWriter does not have any lock. This is a bug, please report.");
let new_index_writer = IndexWriter::new(&self.index, self.options.clone(), directory_lock)?; let new_index_writer = IndexWriter::new(
&self.index,
self.num_threads,
self.memory_budget_in_bytes_per_thread,
directory_lock,
)?;
// the current `self` is dropped right away because of this call. // the current `self` is dropped right away because of this call.
// //
@@ -828,7 +812,7 @@ mod tests {
use crate::directory::error::LockError; use crate::directory::error::LockError;
use crate::error::*; use crate::error::*;
use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN; use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN;
use crate::indexer::{IndexWriterOptions, NoMergePolicy}; use crate::indexer::NoMergePolicy;
use crate::query::{QueryParser, TermQuery}; use crate::query::{QueryParser, TermQuery};
use crate::schema::{ use crate::schema::{
self, Facet, FacetOptions, IndexRecordOption, IpAddrOptions, JsonObjectOptions, self, Facet, FacetOptions, IndexRecordOption, IpAddrOptions, JsonObjectOptions,
@@ -2549,36 +2533,4 @@ mod tests {
index_writer.commit().unwrap(); index_writer.commit().unwrap();
Ok(()) Ok(())
} }
#[test]
fn test_writer_options_validation() {
let mut schema_builder = Schema::builder();
let _field = schema_builder.add_bool_field("example", STORED);
let index = Index::create_in_ram(schema_builder.build());
let opt_wo_threads = IndexWriterOptions::builder().num_worker_threads(0).build();
let result = index.writer_with_options::<TantivyDocument>(opt_wo_threads);
assert!(result.is_err(), "Writer should reject 0 thread count");
assert!(matches!(result, Err(TantivyError::InvalidArgument(_))));
let opt_with_low_memory = IndexWriterOptions::builder()
.memory_budget_per_thread(10 << 10)
.build();
let result = index.writer_with_options::<TantivyDocument>(opt_with_low_memory);
assert!(
result.is_err(),
"Writer should reject options with too low memory size"
);
assert!(matches!(result, Err(TantivyError::InvalidArgument(_))));
let opt_with_low_memory = IndexWriterOptions::builder()
.memory_budget_per_thread(5 << 30)
.build();
let result = index.writer_with_options::<TantivyDocument>(opt_with_low_memory);
assert!(
result.is_err(),
"Writer should reject options with too high memory size"
);
assert!(matches!(result, Err(TantivyError::InvalidArgument(_))));
}
} }

View File

@@ -31,7 +31,7 @@ mod stamper;
use crossbeam_channel as channel; use crossbeam_channel as channel;
use smallvec::SmallVec; use smallvec::SmallVec;
pub use self::index_writer::{IndexWriter, IndexWriterOptions}; pub use self::index_writer::IndexWriter;
pub use self::log_merge_policy::LogMergePolicy; pub use self::log_merge_policy::LogMergePolicy;
pub use self::merge_operation::MergeOperation; pub use self::merge_operation::MergeOperation;
pub use self::merge_policy::{MergeCandidate, MergePolicy, NoMergePolicy}; pub use self::merge_policy::{MergeCandidate, MergePolicy, NoMergePolicy};

View File

@@ -1,4 +1,3 @@
use std::any::Any;
use std::borrow::BorrowMut; use std::borrow::BorrowMut;
use std::collections::HashSet; use std::collections::HashSet;
use std::io::Write; use std::io::Write;
@@ -24,9 +23,9 @@ use crate::indexer::{
DefaultMergePolicy, MergeCandidate, MergeOperation, MergePolicy, SegmentEntry, DefaultMergePolicy, MergeCandidate, MergeOperation, MergePolicy, SegmentEntry,
SegmentSerializer, SegmentSerializer,
}; };
use crate::{FutureResult, Opstamp, TantivyError}; use crate::{FutureResult, Opstamp};
const PANIC_CAUGHT: &str = "Panic caught in merge thread"; const NUM_MERGE_THREADS: usize = 4;
/// Save the index meta file. /// Save the index meta file.
/// This operation is atomic: /// This operation is atomic:
@@ -274,7 +273,6 @@ impl SegmentUpdater {
index: Index, index: Index,
stamper: Stamper, stamper: Stamper,
delete_cursor: &DeleteCursor, delete_cursor: &DeleteCursor,
num_merge_threads: usize,
) -> crate::Result<SegmentUpdater> { ) -> crate::Result<SegmentUpdater> {
let segments = index.searchable_segment_metas()?; let segments = index.searchable_segment_metas()?;
let segment_manager = SegmentManager::from_segments(segments, delete_cursor); let segment_manager = SegmentManager::from_segments(segments, delete_cursor);
@@ -289,16 +287,7 @@ impl SegmentUpdater {
})?; })?;
let merge_thread_pool = ThreadPoolBuilder::new() let merge_thread_pool = ThreadPoolBuilder::new()
.thread_name(|i| format!("merge_thread_{i}")) .thread_name(|i| format!("merge_thread_{i}"))
.num_threads(num_merge_threads) .num_threads(NUM_MERGE_THREADS)
.panic_handler(move |panic| {
// We don't print the panic content itself,
// it is already printed during the unwinding
if let Some(message) = panic.downcast_ref::<&str>() {
if *message != PANIC_CAUGHT {
error!("uncaught merge panic")
}
}
})
.build() .build()
.map_err(|_| { .map_err(|_| {
crate::TantivyError::SystemError( crate::TantivyError::SystemError(
@@ -518,34 +507,11 @@ impl SegmentUpdater {
// Its lifetime is used to track how many merging thread are currently running, // Its lifetime is used to track how many merging thread are currently running,
// as well as which segment is currently in merge and therefore should not be // as well as which segment is currently in merge and therefore should not be
// candidate for another merge. // candidate for another merge.
let merge_panic_res = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| { match merge(
merge( &segment_updater.index,
&segment_updater.index, segment_entries,
segment_entries, merge_operation.target_opstamp(),
merge_operation.target_opstamp(), ) {
)
}));
let merge_res = match merge_panic_res {
Ok(merge_res) => merge_res,
Err(panic_err) => {
let panic_str = if let Some(msg) = panic_err.downcast_ref::<&str>() {
*msg
} else if let Some(msg) = panic_err.downcast_ref::<String>() {
msg.as_str()
} else {
"UNKNOWN"
};
let _send_result = merging_future_send.send(Err(TantivyError::SystemError(
format!("Merge thread panicked: {panic_str}"),
)));
// Resume unwinding because we forced unwind safety with
// `std::panic::AssertUnwindSafe` Use a specific message so
// the panic_handler can double check that we properly caught the panic.
let boxed_panic_message: Box<dyn Any + Send> = Box::new(PANIC_CAUGHT);
std::panic::resume_unwind(boxed_panic_message);
}
};
match merge_res {
Ok(after_merge_segment_entry) => { Ok(after_merge_segment_entry) => {
let res = segment_updater.end_merge(merge_operation, after_merge_segment_entry); let res = segment_updater.end_merge(merge_operation, after_merge_segment_entry);
let _send_result = merging_future_send.send(res); let _send_result = merging_future_send.send(res);

View File

@@ -422,7 +422,6 @@ mod tests {
use std::collections::BTreeMap; use std::collections::BTreeMap;
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use columnar::ColumnType;
use tempfile::TempDir; use tempfile::TempDir;
use crate::collector::{Count, TopDocs}; use crate::collector::{Count, TopDocs};
@@ -432,15 +431,15 @@ mod tests {
use crate::query::{PhraseQuery, QueryParser}; use crate::query::{PhraseQuery, QueryParser};
use crate::schema::{ use crate::schema::{
Document, IndexRecordOption, OwnedValue, Schema, TextFieldIndexing, TextOptions, Value, Document, IndexRecordOption, OwnedValue, Schema, TextFieldIndexing, TextOptions, Value,
DATE_TIME_PRECISION_INDEXED, FAST, STORED, STRING, TEXT, DATE_TIME_PRECISION_INDEXED, STORED, STRING, TEXT,
}; };
use crate::store::{Compressor, StoreReader, StoreWriter}; use crate::store::{Compressor, StoreReader, StoreWriter};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime; use crate::time::OffsetDateTime;
use crate::tokenizer::{PreTokenizedString, Token}; use crate::tokenizer::{PreTokenizedString, Token};
use crate::{ use crate::{
DateTime, Directory, DocAddress, DocSet, Index, IndexWriter, SegmentReader, DateTime, Directory, DocAddress, DocSet, Index, IndexWriter, TantivyDocument, Term,
TantivyDocument, Term, TERMINATED, TERMINATED,
}; };
#[test] #[test]
@@ -842,75 +841,6 @@ mod tests {
assert_eq!(searcher.search(&phrase_query, &Count).unwrap(), 0); assert_eq!(searcher.search(&phrase_query, &Count).unwrap(), 0);
} }
#[test]
fn test_json_fast() {
let mut schema_builder = Schema::builder();
let json_field = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let json_val: serde_json::Value = serde_json::from_str(
r#"{
"toto": "titi",
"float": -0.2,
"bool": true,
"unsigned": 1,
"signed": -2,
"complexobject": {
"field.with.dot": 1
},
"date": "1985-04-12T23:20:50.52Z",
"my_arr": [2, 3, {"my_key": "two tokens"}, 4]
}"#,
)
.unwrap();
let doc = doc!(json_field=>json_val.clone());
let index = Index::create_in_ram(schema.clone());
let mut writer = index.writer_for_tests().unwrap();
writer.add_document(doc).unwrap();
writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
fn assert_type(reader: &SegmentReader, field: &str, typ: ColumnType) {
let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
assert_eq!(cols.len(), 1, "{}", field);
assert_eq!(cols[0].column_type(), typ, "{}", field);
}
assert_type(segment_reader, "json.toto", ColumnType::Str);
assert_type(segment_reader, "json.float", ColumnType::F64);
assert_type(segment_reader, "json.bool", ColumnType::Bool);
assert_type(segment_reader, "json.unsigned", ColumnType::I64);
assert_type(segment_reader, "json.signed", ColumnType::I64);
assert_type(
segment_reader,
"json.complexobject.field\\.with\\.dot",
ColumnType::I64,
);
assert_type(segment_reader, "json.date", ColumnType::DateTime);
assert_type(segment_reader, "json.my_arr", ColumnType::I64);
assert_type(segment_reader, "json.my_arr.my_key", ColumnType::Str);
fn assert_empty(reader: &SegmentReader, field: &str) {
let cols = reader.fast_fields().dynamic_column_handles(field).unwrap();
assert_eq!(cols.len(), 0);
}
assert_empty(segment_reader, "unknown");
assert_empty(segment_reader, "json");
assert_empty(segment_reader, "json.toto.titi");
let sub_columns = segment_reader
.fast_fields()
.dynamic_subpath_column_handles("json")
.unwrap();
assert_eq!(sub_columns.len(), 9);
let subsub_columns = segment_reader
.fast_fields()
.dynamic_subpath_column_handles("json.complexobject")
.unwrap();
assert_eq!(subsub_columns.len(), 1);
}
#[test] #[test]
fn test_json_term_with_numeric_merge_panic_regression_bug_2283() { fn test_json_term_with_numeric_merge_panic_regression_bug_2283() {
// https://github.com/quickwit-oss/tantivy/issues/2283 // https://github.com/quickwit-oss/tantivy/issues/2283

View File

@@ -1,19 +1,79 @@
use std::ops::Range; use std::ops::Range;
use std::sync::atomic::{AtomicU64, Ordering}; use std::sync::atomic::Ordering;
use std::sync::Arc; use std::sync::Arc;
use crate::Opstamp; use crate::Opstamp;
#[cfg(not(target_arch = "arm"))]
mod atomic_impl {
use std::sync::atomic::{AtomicU64, Ordering};
use crate::Opstamp;
#[derive(Default)]
pub struct AtomicU64Wrapper(AtomicU64);
impl AtomicU64Wrapper {
pub fn new(first_opstamp: Opstamp) -> AtomicU64Wrapper {
AtomicU64Wrapper(AtomicU64::new(first_opstamp))
}
pub fn fetch_add(&self, val: u64, order: Ordering) -> u64 {
self.0.fetch_add(val, order)
}
pub fn revert(&self, val: u64, order: Ordering) -> u64 {
self.0.store(val, order);
val
}
}
}
#[cfg(target_arch = "arm")]
mod atomic_impl {
/// Under other architecture, we rely on a mutex.
use std::sync::atomic::Ordering;
use std::sync::RwLock;
use crate::Opstamp;
#[derive(Default)]
pub struct AtomicU64Wrapper(RwLock<u64>);
impl AtomicU64Wrapper {
pub fn new(first_opstamp: Opstamp) -> AtomicU64Wrapper {
AtomicU64Wrapper(RwLock::new(first_opstamp))
}
pub fn fetch_add(&self, incr: u64, _order: Ordering) -> u64 {
let mut lock = self.0.write().unwrap();
let previous_val = *lock;
*lock = previous_val + incr;
previous_val
}
pub fn revert(&self, val: u64, _order: Ordering) -> u64 {
let mut lock = self.0.write().unwrap();
*lock = val;
val
}
}
}
use self::atomic_impl::AtomicU64Wrapper;
/// Stamper provides Opstamps, which is just an auto-increment id to label /// Stamper provides Opstamps, which is just an auto-increment id to label
/// an operation. /// an operation.
/// ///
/// Cloning does not "fork" the stamp generation. The stamper actually wraps an `Arc`. /// Cloning does not "fork" the stamp generation. The stamper actually wraps an `Arc`.
#[derive(Clone, Default)] #[derive(Clone, Default)]
pub struct Stamper(Arc<AtomicU64>); pub struct Stamper(Arc<AtomicU64Wrapper>);
impl Stamper { impl Stamper {
pub fn new(first_opstamp: Opstamp) -> Stamper { pub fn new(first_opstamp: Opstamp) -> Stamper {
Stamper(Arc::new(AtomicU64::new(first_opstamp))) Stamper(Arc::new(AtomicU64Wrapper::new(first_opstamp)))
} }
pub fn stamp(&self) -> Opstamp { pub fn stamp(&self) -> Opstamp {
@@ -32,8 +92,7 @@ impl Stamper {
/// Reverts the stamper to a given `Opstamp` value and returns it /// Reverts the stamper to a given `Opstamp` value and returns it
pub fn revert(&self, to_opstamp: Opstamp) -> Opstamp { pub fn revert(&self, to_opstamp: Opstamp) -> Opstamp {
self.0.store(to_opstamp, Ordering::SeqCst); self.0.revert(to_opstamp, Ordering::SeqCst)
to_opstamp
} }
} }

View File

@@ -15,7 +15,7 @@ fn encode_bitwidth(bitwidth: u8, delta_1: bool) -> u8 {
} }
fn decode_bitwidth(raw_bitwidth: u8) -> (u8, bool) { fn decode_bitwidth(raw_bitwidth: u8) -> (u8, bool) {
let delta_1 = ((raw_bitwidth >> 6) & 1) != 0; let delta_1 = (raw_bitwidth >> 6 & 1) != 0;
let bitwidth = raw_bitwidth & 0x3f; let bitwidth = raw_bitwidth & 0x3f;
(bitwidth, delta_1) (bitwidth, delta_1)
} }

View File

@@ -7,32 +7,14 @@ use crate::docset::{DocSet, TERMINATED};
use crate::index::SegmentReader; use crate::index::SegmentReader;
use crate::query::explanation::does_not_match; use crate::query::explanation::does_not_match;
use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight}; use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
use crate::schema::Type;
use crate::{DocId, Score, TantivyError}; use crate::{DocId, Score, TantivyError};
/// Query that matches all documents with a non-null value in the specified /// Query that matches all documents with a non-null value in the specified field.
/// field.
///
/// When querying inside a JSON field, "exists" queries can be executed strictly
/// on the field name or check all the subpaths. In that second case a document
/// will be matched if a non-null value exists in any subpath. For example,
/// assuming the following document where `myfield` is a JSON fast field:
/// ```json
/// {
/// "myfield": {
/// "mysubfield": "hello"
/// }
/// }
/// ```
/// With `json_subpaths` enabled queries on either `myfield` or
/// `myfield.mysubfield` will match the document. If it is set to false, only
/// `myfield.mysubfield` will match it.
/// ///
/// All of the matched documents get the score 1.0. /// All of the matched documents get the score 1.0.
#[derive(Clone, Debug)] #[derive(Clone, Debug)]
pub struct ExistsQuery { pub struct ExistsQuery {
field_name: String, field_name: String,
json_subpaths: bool,
} }
impl ExistsQuery { impl ExistsQuery {
@@ -41,28 +23,8 @@ impl ExistsQuery {
/// This query matches all documents with at least one non-null value in the specified field. /// This query matches all documents with at least one non-null value in the specified field.
/// This constructor never fails, but executing the search with this query will return an /// This constructor never fails, but executing the search with this query will return an
/// error if the specified field doesn't exists or is not a fast field. /// error if the specified field doesn't exists or is not a fast field.
#[deprecated]
pub fn new_exists_query(field: String) -> ExistsQuery { pub fn new_exists_query(field: String) -> ExistsQuery {
ExistsQuery { ExistsQuery { field_name: field }
field_name: field,
json_subpaths: false,
}
}
/// Creates a new `ExistQuery` from the given field.
///
/// This query matches all documents with at least one non-null value in the
/// specified field. If `json_subpaths` is set to true, documents with
/// non-null values in any JSON subpath will also be matched.
///
/// This constructor never fails, but executing the search with this query will
/// return an error if the specified field doesn't exists or is not a fast
/// field.
pub fn new(field: String, json_subpaths: bool) -> Self {
Self {
field_name: field,
json_subpaths,
}
} }
} }
@@ -81,8 +43,6 @@ impl Query for ExistsQuery {
} }
Ok(Box::new(ExistsWeight { Ok(Box::new(ExistsWeight {
field_name: self.field_name.clone(), field_name: self.field_name.clone(),
field_type: field_type.value_type(),
json_subpaths: self.json_subpaths,
})) }))
} }
} }
@@ -90,20 +50,13 @@ impl Query for ExistsQuery {
/// Weight associated with the `ExistsQuery` query. /// Weight associated with the `ExistsQuery` query.
pub struct ExistsWeight { pub struct ExistsWeight {
field_name: String, field_name: String,
field_type: Type,
json_subpaths: bool,
} }
impl Weight for ExistsWeight { impl Weight for ExistsWeight {
fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> { fn scorer(&self, reader: &SegmentReader, boost: Score) -> crate::Result<Box<dyn Scorer>> {
let fast_field_reader = reader.fast_fields(); let fast_field_reader = reader.fast_fields();
let mut column_handles = fast_field_reader.dynamic_column_handles(&self.field_name)?; let dynamic_columns: crate::Result<Vec<DynamicColumn>> = fast_field_reader
if self.field_type == Type::Json && self.json_subpaths { .dynamic_column_handles(&self.field_name)?
let mut sub_columns =
fast_field_reader.dynamic_subpath_column_handles(&self.field_name)?;
column_handles.append(&mut sub_columns);
}
let dynamic_columns: crate::Result<Vec<DynamicColumn>> = column_handles
.into_iter() .into_iter()
.map(|handle| handle.open().map_err(|io_error| io_error.into())) .map(|handle| handle.open().map_err(|io_error| io_error.into()))
.collect(); .collect();
@@ -227,12 +180,11 @@ mod tests {
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
assert_eq!(count_existing_fields(&searcher, "all", false)?, 100); assert_eq!(count_existing_fields(&searcher, "all")?, 100);
assert_eq!(count_existing_fields(&searcher, "odd", false)?, 50); assert_eq!(count_existing_fields(&searcher, "odd")?, 50);
assert_eq!(count_existing_fields(&searcher, "even", false)?, 50); assert_eq!(count_existing_fields(&searcher, "even")?, 50);
assert_eq!(count_existing_fields(&searcher, "multi", false)?, 10); assert_eq!(count_existing_fields(&searcher, "multi")?, 10);
assert_eq!(count_existing_fields(&searcher, "multi", true)?, 10); assert_eq!(count_existing_fields(&searcher, "never")?, 0);
assert_eq!(count_existing_fields(&searcher, "never", false)?, 0);
// exercise seek // exercise seek
let query = BooleanQuery::intersection(vec![ let query = BooleanQuery::intersection(vec![
@@ -240,7 +192,7 @@ mod tests {
Bound::Included(Term::from_field_u64(all_field, 50)), Bound::Included(Term::from_field_u64(all_field, 50)),
Bound::Unbounded, Bound::Unbounded,
)), )),
Box::new(ExistsQuery::new("even".to_string(), false)), Box::new(ExistsQuery::new_exists_query("even".to_string())),
]); ]);
assert_eq!(searcher.search(&query, &Count)?, 25); assert_eq!(searcher.search(&query, &Count)?, 25);
@@ -249,7 +201,7 @@ mod tests {
Bound::Included(Term::from_field_u64(all_field, 0)), Bound::Included(Term::from_field_u64(all_field, 0)),
Bound::Included(Term::from_field_u64(all_field, 50)), Bound::Included(Term::from_field_u64(all_field, 50)),
)), )),
Box::new(ExistsQuery::new("odd".to_string(), false)), Box::new(ExistsQuery::new_exists_query("odd".to_string())),
]); ]);
assert_eq!(searcher.search(&query, &Count)?, 25); assert_eq!(searcher.search(&query, &Count)?, 25);
@@ -278,18 +230,22 @@ mod tests {
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
assert_eq!(count_existing_fields(&searcher, "json.all", false)?, 100); assert_eq!(count_existing_fields(&searcher, "json.all")?, 100);
assert_eq!(count_existing_fields(&searcher, "json.even", false)?, 50); assert_eq!(count_existing_fields(&searcher, "json.even")?, 50);
assert_eq!(count_existing_fields(&searcher, "json.even", true)?, 50); assert_eq!(count_existing_fields(&searcher, "json.odd")?, 50);
assert_eq!(count_existing_fields(&searcher, "json.odd", false)?, 50);
assert_eq!(count_existing_fields(&searcher, "json", false)?, 0);
assert_eq!(count_existing_fields(&searcher, "json", true)?, 100);
// Handling of non-existing fields: // Handling of non-existing fields:
assert_eq!(count_existing_fields(&searcher, "json.absent", false)?, 0); assert_eq!(count_existing_fields(&searcher, "json.absent")?, 0);
assert_eq!(count_existing_fields(&searcher, "json.absent", true)?, 0); assert_eq!(
assert_does_not_exist(&searcher, "does_not_exists.absent", true); searcher
assert_does_not_exist(&searcher, "does_not_exists.absent", false); .search(
&ExistsQuery::new_exists_query("does_not_exists.absent".to_string()),
&Count
)
.unwrap_err()
.to_string(),
"The field does not exist: 'does_not_exists.absent'"
);
Ok(()) Ok(())
} }
@@ -328,13 +284,12 @@ mod tests {
let reader = index.reader()?; let reader = index.reader()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
assert_eq!(count_existing_fields(&searcher, "bool", false)?, 50); assert_eq!(count_existing_fields(&searcher, "bool")?, 50);
assert_eq!(count_existing_fields(&searcher, "bool", true)?, 50); assert_eq!(count_existing_fields(&searcher, "bytes")?, 50);
assert_eq!(count_existing_fields(&searcher, "bytes", false)?, 50); assert_eq!(count_existing_fields(&searcher, "date")?, 50);
assert_eq!(count_existing_fields(&searcher, "date", false)?, 50); assert_eq!(count_existing_fields(&searcher, "f64")?, 50);
assert_eq!(count_existing_fields(&searcher, "f64", false)?, 50); assert_eq!(count_existing_fields(&searcher, "ip_addr")?, 50);
assert_eq!(count_existing_fields(&searcher, "ip_addr", false)?, 50); assert_eq!(count_existing_fields(&searcher, "facet")?, 50);
assert_eq!(count_existing_fields(&searcher, "facet", false)?, 50);
Ok(()) Ok(())
} }
@@ -358,33 +313,31 @@ mod tests {
assert_eq!( assert_eq!(
searcher searcher
.search(&ExistsQuery::new("not_fast".to_string(), false), &Count) .search(
&ExistsQuery::new_exists_query("not_fast".to_string()),
&Count
)
.unwrap_err() .unwrap_err()
.to_string(), .to_string(),
"Schema error: 'Field not_fast is not a fast field.'" "Schema error: 'Field not_fast is not a fast field.'"
); );
assert_does_not_exist(&searcher, "does_not_exists", false); assert_eq!(
searcher
.search(
&ExistsQuery::new_exists_query("does_not_exists".to_string()),
&Count
)
.unwrap_err()
.to_string(),
"The field does not exist: 'does_not_exists'"
);
Ok(()) Ok(())
} }
fn count_existing_fields( fn count_existing_fields(searcher: &Searcher, field: &str) -> crate::Result<usize> {
searcher: &Searcher, let query = ExistsQuery::new_exists_query(field.to_string());
field: &str,
json_subpaths: bool,
) -> crate::Result<usize> {
let query = ExistsQuery::new(field.to_string(), json_subpaths);
searcher.search(&query, &Count) searcher.search(&query, &Count)
} }
fn assert_does_not_exist(searcher: &Searcher, field: &str, json_subpaths: bool) {
assert_eq!(
searcher
.search(&ExistsQuery::new(field.to_string(), json_subpaths), &Count)
.unwrap_err()
.to_string(),
format!("The field does not exist: '{}'", field)
);
}
} }

View File

@@ -15,7 +15,7 @@ use crate::schema::field_type::ValueParsingError;
use crate::schema::{Facet, Field, NamedFieldDocument, OwnedValue, Schema}; use crate::schema::{Facet, Field, NamedFieldDocument, OwnedValue, Schema};
use crate::tokenizer::PreTokenizedString; use crate::tokenizer::PreTokenizedString;
#[repr(C, packed)] #[repr(packed)]
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
/// A field value pair in the compact tantivy document /// A field value pair in the compact tantivy document
struct FieldValueAddr { struct FieldValueAddr {
@@ -480,7 +480,7 @@ impl<'a> CompactDocValue<'a> {
type Addr = u32; type Addr = u32;
#[derive(Clone, Copy, Default)] #[derive(Clone, Copy, Default)]
#[repr(C, packed)] #[repr(packed)]
/// The value type and the address to its payload in the container. /// The value type and the address to its payload in the container.
struct ValueAddr { struct ValueAddr {
type_id: ValueType, type_id: ValueType,
@@ -734,7 +734,7 @@ mod tests {
#[test] #[test]
fn test_json_value() { fn test_json_value() {
let json_str = r#"{ let json_str = r#"{
"toto": "titi", "toto": "titi",
"float": -0.2, "float": -0.2,
"bool": true, "bool": true,

View File

@@ -93,7 +93,6 @@ impl TermInfoBlockMeta {
} }
} }
#[derive(Clone)]
pub struct TermInfoStore { pub struct TermInfoStore {
num_terms: usize, num_terms: usize,
block_meta_bytes: OwnedBytes, block_meta_bytes: OwnedBytes,

View File

@@ -1,5 +1,4 @@
use std::io::{self, Write}; use std::io::{self, Write};
use std::sync::Arc;
use common::{BinarySerializable, CountingWriter}; use common::{BinarySerializable, CountingWriter};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
@@ -114,9 +113,8 @@ static EMPTY_TERM_DICT_FILE: Lazy<FileSlice> = Lazy::new(|| {
/// The `Fst` crate is used to associate terms to their /// The `Fst` crate is used to associate terms to their
/// respective `TermOrdinal`. The `TermInfoStore` then makes it /// respective `TermOrdinal`. The `TermInfoStore` then makes it
/// possible to fetch the associated `TermInfo`. /// possible to fetch the associated `TermInfo`.
#[derive(Clone)]
pub struct TermDictionary { pub struct TermDictionary {
fst_index: Arc<tantivy_fst::Map<OwnedBytes>>, fst_index: tantivy_fst::Map<OwnedBytes>,
term_info_store: TermInfoStore, term_info_store: TermInfoStore,
} }
@@ -138,7 +136,7 @@ impl TermDictionary {
let fst_index = open_fst_index(fst_file_slice)?; let fst_index = open_fst_index(fst_file_slice)?;
let term_info_store = TermInfoStore::open(values_file_slice)?; let term_info_store = TermInfoStore::open(values_file_slice)?;
Ok(TermDictionary { Ok(TermDictionary {
fst_index: Arc::new(fst_index), fst_index,
term_info_store, term_info_store,
}) })
} }

View File

@@ -74,7 +74,6 @@ const CURRENT_TYPE: DictionaryType = DictionaryType::SSTable;
// TODO in the future this should become an enum of supported dictionaries // TODO in the future this should become an enum of supported dictionaries
/// A TermDictionary wrapping either an FST based dictionary or a SSTable based one. /// A TermDictionary wrapping either an FST based dictionary or a SSTable based one.
#[derive(Clone)]
pub struct TermDictionary(InnerTermDict); pub struct TermDictionary(InnerTermDict);
impl TermDictionary { impl TermDictionary {

View File

@@ -28,7 +28,6 @@ pub type TermDictionaryBuilder<W> = sstable::Writer<W, TermInfoValueWriter>;
pub type TermStreamer<'a, A = AlwaysMatch> = sstable::Streamer<'a, TermSSTable, A>; pub type TermStreamer<'a, A = AlwaysMatch> = sstable::Streamer<'a, TermSSTable, A>;
/// SSTable used to store TermInfo objects. /// SSTable used to store TermInfo objects.
#[derive(Clone)]
pub struct TermSSTable; pub struct TermSSTable;
pub type TermStreamerBuilder<'a, A = AlwaysMatch> = sstable::StreamerBuilder<'a, TermSSTable, A>; pub type TermStreamerBuilder<'a, A = AlwaysMatch> = sstable::StreamerBuilder<'a, TermSSTable, A>;

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-sstable" name = "tantivy-sstable"
version = "0.5.0" version = "0.3.0"
edition = "2021" edition = "2021"
license = "MIT" license = "MIT"
homepage = "https://github.com/quickwit-oss/tantivy" homepage = "https://github.com/quickwit-oss/tantivy"
@@ -10,10 +10,8 @@ categories = ["database-implementations", "data-structures", "compression"]
description = "sstables for tantivy" description = "sstables for tantivy"
[dependencies] [dependencies]
common = {version= "0.9", path="../common", package="tantivy-common"} common = {version= "0.7", path="../common", package="tantivy-common"}
futures-util = "0.3.30" tantivy-bitpacker = { version= "0.6", path="../bitpacker" }
itertools = "0.14.0"
tantivy-bitpacker = { version= "0.8", path="../bitpacker" }
tantivy-fst = "0.5" tantivy-fst = "0.5"
# experimental gives us access to Decompressor::upper_bound # experimental gives us access to Decompressor::upper_bound
zstd = { version = "0.13", features = ["experimental"] } zstd = { version = "0.13", features = ["experimental"] }

View File

@@ -1,271 +0,0 @@
use tantivy_fst::Automaton;
/// Returns whether a block can match an automaton based on its bounds.
///
/// start key is exclusive, and optional to account for the first block. end key is inclusive and
/// mandatory.
pub(crate) fn can_block_match_automaton(
start_key_opt: Option<&[u8]>,
end_key: &[u8],
automaton: &impl Automaton,
) -> bool {
let start_key = if let Some(start_key) = start_key_opt {
start_key
} else {
// if start_key_opt is None, we would allow an automaton matching the empty string to match
if automaton.is_match(&automaton.start()) {
return true;
}
&[]
};
can_block_match_automaton_with_start(start_key, end_key, automaton)
}
// similar to can_block_match_automaton, ignoring the edge case of the initial block
fn can_block_match_automaton_with_start(
start_key: &[u8],
end_key: &[u8],
automaton: &impl Automaton,
) -> bool {
// notation: in loops, we use `kb` to denotate a key byte (a byte taken from the start/end key),
// and `rb`, a range byte (usually all values higher than a `kb` when comparing with
// start_key, or all values lower than a `kb` when comparing with end_key)
if start_key >= end_key {
return false;
}
let common_prefix_len = crate::common_prefix_len(start_key, end_key);
let mut base_state = automaton.start();
for kb in &start_key[0..common_prefix_len] {
base_state = automaton.accept(&base_state, *kb);
}
// this is not required for correctness, but allows dodging more expensive checks
if !automaton.can_match(&base_state) {
return false;
}
// we have 3 distinct case:
// - keys are `abc` and `abcd` => we test for abc[\0-d].*
// - keys are `abcd` and `abce` => we test for abc[d-e].*
// - keys are `abcd` and `abc` => contradiction with start_key < end_key.
//
// ideally for (abc, abcde] we could test for abc([\0-c].*|d([\0-d].*|e)?)
// but let's start simple (and correct), and tighten our bounds latter
//
// and for (abcde, abcfg] we could test for abc(d(e.+|[f-\xff].*)|e.*|f([\0-f].*|g)?)
// abc (
// d(e.+|[f-\xff].*) |
// e.* |
// f([\0-f].*|g)?
// )
//
// these are all written as regex, but can be converted to operations we can do:
// - [x-y] is a for c in x..=y
// - .* is a can_match()
// - .+ is a for c in 0..=255 { accept(c).can_match() }
// - ? is a the thing before can_match(), or current state.is_match()
// - | means test both side
// we have two cases, either start_key is a prefix of end_key (e.g. (abc, abcjp]),
// or it is not (e.g. (abcdg, abcjp]). It is not possible however that end_key be a prefix of
// start_key (or that both are equal) because we already handled start_key >= end_key.
//
// if we are in the first case, we want to visit the following states:
// abc (
// [\0-i].* |
// j (
// [\0-o].* |
// p
// )?
// )
// Everything after `abc` is handled by `match_range_end`
//
// if we are in the 2nd case, we want to visit the following states:
// abc (
// d(g.+|[h-\xff].*) | // this is handled by match_range_start
//
// [e-i].* | // this is handled here
//
// j ( // this is handled by match_range_end (but countrary to the other
// [\0-o].* | // case, j is already consumed so to not check [\0-i].* )
// p
// )?
// )
let Some(start_range) = start_key.get(common_prefix_len) else {
return match_range_end(&end_key[common_prefix_len..], &automaton, base_state);
};
let end_range = end_key[common_prefix_len];
// things starting with start_range were handled in match_range_start
// this starting with end_range are handled bellow.
// this can run for 0 iteration in cases such as (abc, abd]
for rb in (start_range + 1)..end_range {
let new_state = automaton.accept(&base_state, rb);
if automaton.can_match(&new_state) {
return true;
}
}
let state_for_start = automaton.accept(&base_state, *start_range);
if match_range_start(
&start_key[common_prefix_len + 1..],
&automaton,
state_for_start,
) {
return true;
}
let state_for_end = automaton.accept(&base_state, end_range);
if automaton.is_match(&state_for_end) {
return true;
}
match_range_end(&end_key[common_prefix_len + 1..], &automaton, state_for_end)
}
fn match_range_start<S, A: Automaton<State = S>>(
start_key: &[u8],
automaton: &A,
mut state: S,
) -> bool {
// case (abcdgj, abcpqr], `abcd` is already consumed, we need to handle:
// - [h-\xff].*
// - g[k-\xff].*
// - gj.+ == gf[\0-\xff].*
for kb in start_key {
// this is an optimisation, and is not needed for correctness
if !automaton.can_match(&state) {
return false;
}
// does the [h-\xff].* part. we skip if kb==255 as [\{0100}-\xff] is an empty range, and
// this would overflow in our u8 world
if *kb < u8::MAX {
for rb in (kb + 1)..=u8::MAX {
let temp_state = automaton.accept(&state, rb);
if automaton.can_match(&temp_state) {
return true;
}
}
}
// push g
state = automaton.accept(&state, *kb);
}
// this isn't required for correctness, but can save us from looping 256 below
if !automaton.can_match(&state) {
return false;
}
// does the final `.+`, which is the same as `[\0-\xff].*`
for rb in 0..=u8::MAX {
let temp_state = automaton.accept(&state, rb);
if automaton.can_match(&temp_state) {
return true;
}
}
false
}
fn match_range_end<S, A: Automaton<State = S>>(
end_key: &[u8],
automaton: &A,
mut state: S,
) -> bool {
// for (abcdef, abcmps]. the prefix `abcm` has been consumed, `[d-l].*` was handled elsewhere,
// we just need to handle
// - [\0-o].*
// - p
// - p[\0-r].*
// - ps
for kb in end_key {
// this is an optimisation, and is not needed for correctness
if !automaton.can_match(&state) {
return false;
}
// does the `[\0-o].*`
for rb in 0..*kb {
let temp_state = automaton.accept(&state, rb);
if automaton.can_match(&temp_state) {
return true;
}
}
// push p
state = automaton.accept(&state, *kb);
// verify the `p` case
if automaton.is_match(&state) {
return true;
}
}
false
}
#[cfg(test)]
pub(crate) mod tests {
use proptest::prelude::*;
use tantivy_fst::Automaton;
use super::*;
pub(crate) struct EqBuffer(pub Vec<u8>);
impl Automaton for EqBuffer {
type State = Option<usize>;
fn start(&self) -> Self::State {
Some(0)
}
fn is_match(&self, state: &Self::State) -> bool {
*state == Some(self.0.len())
}
fn accept(&self, state: &Self::State, byte: u8) -> Self::State {
state
.filter(|pos| self.0.get(*pos) == Some(&byte))
.map(|pos| pos + 1)
}
fn can_match(&self, state: &Self::State) -> bool {
state.is_some()
}
fn will_always_match(&self, _state: &Self::State) -> bool {
false
}
}
fn gen_key_strategy() -> impl Strategy<Value = Vec<u8>> {
// we only generate bytes in [0, 1, 2, 254, 255] to reduce the search space without
// ignoring edge cases that might ocure with integer over/underflow
proptest::collection::vec(prop_oneof![0u8..=2, 254u8..=255], 0..5)
}
proptest! {
#![proptest_config(ProptestConfig {
cases: 10000, .. ProptestConfig::default()
})]
#[test]
fn test_proptest_automaton_match_block(start in gen_key_strategy(), end in gen_key_strategy(), key in gen_key_strategy()) {
let expected = start < key && end >= key;
let automaton = EqBuffer(key);
assert_eq!(can_block_match_automaton(Some(&start), &end, &automaton), expected);
}
#[test]
fn test_proptest_automaton_match_first_block(end in gen_key_strategy(), key in gen_key_strategy()) {
let expected = end >= key;
let automaton = EqBuffer(key);
assert_eq!(can_block_match_automaton(None, &end, &automaton), expected);
}
}
}

View File

@@ -7,7 +7,6 @@ use zstd::bulk::Decompressor;
pub struct BlockReader { pub struct BlockReader {
buffer: Vec<u8>, buffer: Vec<u8>,
reader: OwnedBytes, reader: OwnedBytes,
next_readers: std::vec::IntoIter<OwnedBytes>,
offset: usize, offset: usize,
} }
@@ -16,18 +15,6 @@ impl BlockReader {
BlockReader { BlockReader {
buffer: Vec::new(), buffer: Vec::new(),
reader, reader,
next_readers: Vec::new().into_iter(),
offset: 0,
}
}
pub fn from_multiple_blocks(readers: Vec<OwnedBytes>) -> BlockReader {
let mut next_readers = readers.into_iter();
let reader = next_readers.next().unwrap_or_else(OwnedBytes::empty);
BlockReader {
buffer: Vec::new(),
reader,
next_readers,
offset: 0, offset: 0,
} }
} }
@@ -47,52 +34,42 @@ impl BlockReader {
self.offset = 0; self.offset = 0;
self.buffer.clear(); self.buffer.clear();
loop { let block_len = match self.reader.len() {
let block_len = match self.reader.len() { 0 => return Ok(false),
0 => { 1..=3 => {
// we are out of data for this block. Check if we have another block after
if let Some(new_reader) = self.next_readers.next() {
self.reader = new_reader;
continue;
} else {
return Ok(false);
}
}
1..=3 => {
return Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"failed to read block_len",
))
}
_ => self.reader.read_u32() as usize,
};
if block_len <= 1 {
return Ok(false);
}
let compress = self.reader.read_u8();
let block_len = block_len - 1;
if self.reader.len() < block_len {
return Err(io::Error::new( return Err(io::Error::new(
io::ErrorKind::UnexpectedEof, io::ErrorKind::UnexpectedEof,
"failed to read block content", "failed to read block_len",
)); ))
} }
if compress == 1 { _ => self.reader.read_u32() as usize,
let required_capacity = };
Decompressor::upper_bound(&self.reader[..block_len]).unwrap_or(1024 * 1024); if block_len <= 1 {
self.buffer.reserve(required_capacity); return Ok(false);
Decompressor::new()?
.decompress_to_buffer(&self.reader[..block_len], &mut self.buffer)?;
self.reader.advance(block_len);
} else {
self.buffer.resize(block_len, 0u8);
self.reader.read_exact(&mut self.buffer[..])?;
}
return Ok(true);
} }
let compress = self.reader.read_u8();
let block_len = block_len - 1;
if self.reader.len() < block_len {
return Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"failed to read block content",
));
}
if compress == 1 {
let required_capacity =
Decompressor::upper_bound(&self.reader[..block_len]).unwrap_or(1024 * 1024);
self.buffer.reserve(required_capacity);
Decompressor::new()?
.decompress_to_buffer(&self.reader[..block_len], &mut self.buffer)?;
self.reader.advance(block_len);
} else {
self.buffer.resize(block_len, 0u8);
self.reader.read_exact(&mut self.buffer[..])?;
}
Ok(true)
} }
#[inline(always)] #[inline(always)]

View File

@@ -89,7 +89,7 @@ where
fn encode_keep_add(&mut self, keep_len: usize, add_len: usize) { fn encode_keep_add(&mut self, keep_len: usize, add_len: usize) {
if keep_len < FOUR_BIT_LIMITS && add_len < FOUR_BIT_LIMITS { if keep_len < FOUR_BIT_LIMITS && add_len < FOUR_BIT_LIMITS {
let b = (keep_len | (add_len << 4)) as u8; let b = (keep_len | add_len << 4) as u8;
self.block.extend_from_slice(&[b]) self.block.extend_from_slice(&[b])
} else { } else {
let mut buf = [VINT_MODE; 20]; let mut buf = [VINT_MODE; 20];
@@ -143,16 +143,6 @@ where TValueReader: value::ValueReader
} }
} }
pub fn from_multiple_blocks(reader: Vec<OwnedBytes>) -> Self {
DeltaReader {
idx: 0,
common_prefix_len: 0,
suffix_range: 0..0,
value_reader: TValueReader::default(),
block_reader: BlockReader::from_multiple_blocks(reader),
}
}
pub fn empty() -> Self { pub fn empty() -> Self {
DeltaReader::new(OwnedBytes::empty()) DeltaReader::new(OwnedBytes::empty())
} }

View File

@@ -1,5 +1,3 @@
#![allow(clippy::needless_borrows_for_generic_args)]
use std::cmp::Ordering; use std::cmp::Ordering;
use std::io; use std::io;
use std::marker::PhantomData; use std::marker::PhantomData;
@@ -9,8 +7,6 @@ use std::sync::Arc;
use common::bounds::{transform_bound_inner_res, TransformBound}; use common::bounds::{transform_bound_inner_res, TransformBound};
use common::file_slice::FileSlice; use common::file_slice::FileSlice;
use common::{BinarySerializable, OwnedBytes}; use common::{BinarySerializable, OwnedBytes};
use futures_util::{stream, StreamExt, TryStreamExt};
use itertools::Itertools;
use tantivy_fst::automaton::AlwaysMatch; use tantivy_fst::automaton::AlwaysMatch;
use tantivy_fst::Automaton; use tantivy_fst::Automaton;
@@ -102,52 +98,20 @@ impl<TSSTable: SSTable> Dictionary<TSSTable> {
&self, &self,
key_range: impl RangeBounds<[u8]>, key_range: impl RangeBounds<[u8]>,
limit: Option<u64>, limit: Option<u64>,
automaton: &impl Automaton,
merge_holes_under_bytes: usize,
) -> io::Result<DeltaReader<TSSTable::ValueReader>> { ) -> io::Result<DeltaReader<TSSTable::ValueReader>> {
let match_all = automaton.will_always_match(&automaton.start()); let slice = self.file_slice_for_range(key_range, limit);
if match_all { let data = slice.read_bytes_async().await?;
let slice = self.file_slice_for_range(key_range, limit); Ok(TSSTable::delta_reader(data))
let data = slice.read_bytes_async().await?;
Ok(TSSTable::delta_reader(data))
} else {
let blocks = stream::iter(self.get_block_iterator_for_range_and_automaton(
key_range,
automaton,
merge_holes_under_bytes,
));
let data = blocks
.map(|block_addr| {
self.sstable_slice
.read_bytes_slice_async(block_addr.byte_range)
})
.buffered(5)
.try_collect::<Vec<_>>()
.await?;
Ok(DeltaReader::from_multiple_blocks(data))
}
} }
pub(crate) fn sstable_delta_reader_for_key_range( pub(crate) fn sstable_delta_reader_for_key_range(
&self, &self,
key_range: impl RangeBounds<[u8]>, key_range: impl RangeBounds<[u8]>,
limit: Option<u64>, limit: Option<u64>,
automaton: &impl Automaton,
) -> io::Result<DeltaReader<TSSTable::ValueReader>> { ) -> io::Result<DeltaReader<TSSTable::ValueReader>> {
let match_all = automaton.will_always_match(&automaton.start()); let slice = self.file_slice_for_range(key_range, limit);
if match_all { let data = slice.read_bytes()?;
let slice = self.file_slice_for_range(key_range, limit); Ok(TSSTable::delta_reader(data))
let data = slice.read_bytes()?;
Ok(TSSTable::delta_reader(data))
} else {
// if operations are sync, we assume latency is almost null, and there is no point in
// merging accross holes
let blocks = self.get_block_iterator_for_range_and_automaton(key_range, automaton, 0);
let data = blocks
.map(|block_addr| self.sstable_slice.read_bytes_slice(block_addr.byte_range))
.collect::<Result<Vec<_>, _>>()?;
Ok(DeltaReader::from_multiple_blocks(data))
}
} }
pub(crate) fn sstable_delta_reader_block( pub(crate) fn sstable_delta_reader_block(
@@ -240,42 +204,6 @@ impl<TSSTable: SSTable> Dictionary<TSSTable> {
self.sstable_slice.slice((start_bound, end_bound)) self.sstable_slice.slice((start_bound, end_bound))
} }
fn get_block_iterator_for_range_and_automaton<'a>(
&'a self,
key_range: impl RangeBounds<[u8]>,
automaton: &'a impl Automaton,
merge_holes_under_bytes: usize,
) -> impl Iterator<Item = BlockAddr> + 'a {
let lower_bound = match key_range.start_bound() {
Bound::Included(key) | Bound::Excluded(key) => {
self.sstable_index.locate_with_key(key).unwrap_or(u64::MAX)
}
Bound::Unbounded => 0,
};
let upper_bound = match key_range.end_bound() {
Bound::Included(key) | Bound::Excluded(key) => {
self.sstable_index.locate_with_key(key).unwrap_or(u64::MAX)
}
Bound::Unbounded => u64::MAX,
};
let block_range = lower_bound..=upper_bound;
self.sstable_index
.get_block_for_automaton(automaton)
.filter(move |(block_id, _)| block_range.contains(block_id))
.map(|(_, block_addr)| block_addr)
.coalesce(move |first, second| {
if first.byte_range.end + merge_holes_under_bytes >= second.byte_range.start {
Ok(BlockAddr {
first_ordinal: first.first_ordinal,
byte_range: first.byte_range.start..second.byte_range.end,
})
} else {
Err((first, second))
}
})
}
/// Opens a `TermDictionary`. /// Opens a `TermDictionary`.
pub fn open(term_dictionary_file: FileSlice) -> io::Result<Self> { pub fn open(term_dictionary_file: FileSlice) -> io::Result<Self> {
let (main_slice, footer_len_slice) = term_dictionary_file.split_from_end(20); let (main_slice, footer_len_slice) = term_dictionary_file.split_from_end(20);
@@ -593,25 +521,6 @@ impl<TSSTable: SSTable> Dictionary<TSSTable> {
StreamerBuilder::new(self, AlwaysMatch) StreamerBuilder::new(self, AlwaysMatch)
} }
/// Returns a range builder filtered with a prefix.
pub fn prefix_range<K: AsRef<[u8]>>(&self, prefix: K) -> StreamerBuilder<TSSTable> {
let lower_bound = prefix.as_ref();
let mut upper_bound = lower_bound.to_vec();
for idx in (0..upper_bound.len()).rev() {
if upper_bound[idx] == 255 {
upper_bound.pop();
} else {
upper_bound[idx] += 1;
break;
}
}
let mut builder = self.range().ge(lower_bound);
if !upper_bound.is_empty() {
builder = builder.lt(upper_bound);
}
builder
}
/// A stream of all the sorted terms. /// A stream of all the sorted terms.
pub fn stream(&self) -> io::Result<Streamer<TSSTable>> { pub fn stream(&self) -> io::Result<Streamer<TSSTable>> {
self.range().into_stream() self.range().into_stream()
@@ -1019,62 +928,4 @@ mod tests {
} }
assert!(!stream.advance()); assert!(!stream.advance());
} }
#[test]
fn test_prefix() {
let (dic, _slice) = make_test_sstable();
{
let mut stream = dic.prefix_range("1").into_stream().unwrap();
for i in 0x10000..0x20000 {
assert!(stream.advance());
assert_eq!(stream.term_ord(), i);
assert_eq!(stream.value(), &i);
assert_eq!(stream.key(), format!("{i:05X}").into_bytes());
}
assert!(!stream.advance());
}
{
let mut stream = dic.prefix_range("").into_stream().unwrap();
for i in 0..0x3ffff {
assert!(stream.advance(), "failed at {i:05X}");
assert_eq!(stream.term_ord(), i);
assert_eq!(stream.value(), &i);
assert_eq!(stream.key(), format!("{i:05X}").into_bytes());
}
assert!(!stream.advance());
}
{
let mut stream = dic.prefix_range("0FF").into_stream().unwrap();
for i in 0x0ff00..=0x0ffff {
assert!(stream.advance(), "failed at {i:05X}");
assert_eq!(stream.term_ord(), i);
assert_eq!(stream.value(), &i);
assert_eq!(stream.key(), format!("{i:05X}").into_bytes());
}
assert!(!stream.advance());
}
}
#[test]
fn test_prefix_edge() {
let dict = {
let mut builder = Dictionary::<MonotonicU64SSTable>::builder(Vec::new()).unwrap();
builder.insert(&[0, 254], &0).unwrap();
builder.insert(&[0, 255], &1).unwrap();
builder.insert(&[0, 255, 12], &2).unwrap();
builder.insert(&[1], &2).unwrap();
builder.insert(&[1, 0], &2).unwrap();
let table = builder.finish().unwrap();
let table = Arc::new(PermissionedHandle::new(table));
let slice = common::file_slice::FileSlice::new(table.clone());
Dictionary::<MonotonicU64SSTable>::open(slice).unwrap()
};
let mut stream = dict.prefix_range(&[0, 255]).into_stream().unwrap();
assert!(stream.advance());
assert_eq!(stream.key(), &[0, 255]);
assert!(stream.advance());
assert_eq!(stream.key(), &[0, 255, 12]);
assert!(!stream.advance());
}
} }

View File

@@ -3,7 +3,6 @@ use std::ops::Range;
use merge::ValueMerger; use merge::ValueMerger;
mod block_match_automaton;
mod delta; mod delta;
mod dictionary; mod dictionary;
pub mod merge; pub mod merge;

View File

@@ -1,12 +1,10 @@
use common::OwnedBytes; use common::OwnedBytes;
use tantivy_fst::Automaton;
use crate::block_match_automaton::can_block_match_automaton;
use crate::{BlockAddr, SSTable, SSTableDataCorruption, TermOrdinal}; use crate::{BlockAddr, SSTable, SSTableDataCorruption, TermOrdinal};
#[derive(Default, Debug, Clone)] #[derive(Default, Debug, Clone)]
pub struct SSTableIndex { pub struct SSTableIndex {
pub(crate) blocks: Vec<BlockMeta>, blocks: Vec<BlockMeta>,
} }
impl SSTableIndex { impl SSTableIndex {
@@ -76,31 +74,6 @@ impl SSTableIndex {
// locate_with_ord always returns an index within range // locate_with_ord always returns an index within range
self.get_block(self.locate_with_ord(ord)).unwrap() self.get_block(self.locate_with_ord(ord)).unwrap()
} }
pub(crate) fn get_block_for_automaton<'a>(
&'a self,
automaton: &'a impl Automaton,
) -> impl Iterator<Item = (u64, BlockAddr)> + 'a {
std::iter::once((None, &self.blocks[0]))
.chain(self.blocks.windows(2).map(|window| {
let [prev, curr] = window else {
unreachable!();
};
(Some(&*prev.last_key_or_greater), curr)
}))
.enumerate()
.filter_map(move |(pos, (prev_key, current_block))| {
if can_block_match_automaton(
prev_key,
&current_block.last_key_or_greater,
automaton,
) {
Some((pos as u64, current_block.block_addr.clone()))
} else {
None
}
})
}
} }
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
@@ -126,106 +99,3 @@ impl SSTable for IndexSSTable {
type ValueWriter = crate::value::index::IndexValueWriter; type ValueWriter = crate::value::index::IndexValueWriter;
} }
#[cfg(test)]
mod tests {
use super::*;
use crate::block_match_automaton::tests::EqBuffer;
#[test]
fn test_get_block_for_automaton() {
let sstable = SSTableIndex {
blocks: vec![
BlockMeta {
last_key_or_greater: vec![0, 1, 2],
block_addr: BlockAddr {
first_ordinal: 0,
byte_range: 0..10,
},
},
BlockMeta {
last_key_or_greater: vec![0, 2, 2],
block_addr: BlockAddr {
first_ordinal: 5,
byte_range: 10..20,
},
},
BlockMeta {
last_key_or_greater: vec![0, 3, 2],
block_addr: BlockAddr {
first_ordinal: 10,
byte_range: 20..30,
},
},
],
};
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 1, 1]))
.collect::<Vec<_>>();
assert_eq!(
res,
vec![(
0,
BlockAddr {
first_ordinal: 0,
byte_range: 0..10
}
)]
);
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 2, 1]))
.collect::<Vec<_>>();
assert_eq!(
res,
vec![(
1,
BlockAddr {
first_ordinal: 5,
byte_range: 10..20
}
)]
);
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 3, 1]))
.collect::<Vec<_>>();
assert_eq!(
res,
vec![(
2,
BlockAddr {
first_ordinal: 10,
byte_range: 20..30
}
)]
);
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 4, 1]))
.collect::<Vec<_>>();
assert!(res.is_empty());
let complex_automaton = EqBuffer(vec![0, 1, 1]).union(EqBuffer(vec![0, 3, 1]));
let res = sstable
.get_block_for_automaton(&complex_automaton)
.collect::<Vec<_>>();
assert_eq!(
res,
vec![
(
0,
BlockAddr {
first_ordinal: 0,
byte_range: 0..10
}
),
(
2,
BlockAddr {
first_ordinal: 10,
byte_range: 20..30
}
)
]
);
}
}

View File

@@ -5,9 +5,8 @@ use std::sync::Arc;
use common::{BinarySerializable, FixedSize, OwnedBytes}; use common::{BinarySerializable, FixedSize, OwnedBytes};
use tantivy_bitpacker::{compute_num_bits, BitPacker}; use tantivy_bitpacker::{compute_num_bits, BitPacker};
use tantivy_fst::raw::Fst; use tantivy_fst::raw::Fst;
use tantivy_fst::{Automaton, IntoStreamer, Map, MapBuilder, Streamer}; use tantivy_fst::{IntoStreamer, Map, MapBuilder, Streamer};
use crate::block_match_automaton::can_block_match_automaton;
use crate::{common_prefix_len, SSTableDataCorruption, TermOrdinal}; use crate::{common_prefix_len, SSTableDataCorruption, TermOrdinal};
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
@@ -65,41 +64,6 @@ impl SSTableIndex {
SSTableIndex::V3Empty(v3_empty) => v3_empty.get_block_with_ord(ord), SSTableIndex::V3Empty(v3_empty) => v3_empty.get_block_with_ord(ord),
} }
} }
pub fn get_block_for_automaton<'a>(
&'a self,
automaton: &'a impl Automaton,
) -> impl Iterator<Item = (u64, BlockAddr)> + 'a {
match self {
SSTableIndex::V2(v2_index) => {
BlockIter::V2(v2_index.get_block_for_automaton(automaton))
}
SSTableIndex::V3(v3_index) => {
BlockIter::V3(v3_index.get_block_for_automaton(automaton))
}
SSTableIndex::V3Empty(v3_empty) => {
BlockIter::V3Empty(std::iter::once((0, v3_empty.block_addr.clone())))
}
}
}
}
enum BlockIter<V2, V3, T> {
V2(V2),
V3(V3),
V3Empty(std::iter::Once<T>),
}
impl<V2: Iterator<Item = T>, V3: Iterator<Item = T>, T> Iterator for BlockIter<V2, V3, T> {
type Item = T;
fn next(&mut self) -> Option<Self::Item> {
match self {
BlockIter::V2(v2) => v2.next(),
BlockIter::V3(v3) => v3.next(),
BlockIter::V3Empty(once) => once.next(),
}
}
} }
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
@@ -159,59 +123,6 @@ impl SSTableIndexV3 {
pub(crate) fn get_block_with_ord(&self, ord: TermOrdinal) -> BlockAddr { pub(crate) fn get_block_with_ord(&self, ord: TermOrdinal) -> BlockAddr {
self.block_addr_store.binary_search_ord(ord).1 self.block_addr_store.binary_search_ord(ord).1
} }
pub(crate) fn get_block_for_automaton<'a>(
&'a self,
automaton: &'a impl Automaton,
) -> impl Iterator<Item = (u64, BlockAddr)> + 'a {
// this is more complicated than other index formats: we don't have a ready made list of
// blocks, and instead need to stream-decode the sstable.
GetBlockForAutomaton {
streamer: self.fst_index.stream(),
block_addr_store: &self.block_addr_store,
prev_key: None,
automaton,
}
}
}
// TODO we iterate over the entire Map to find matching blocks,
// we could manually iterate on the underlying Fst and skip whole branches if our Automaton says
// cannot match. this isn't as bad as it sounds given the fst is a lot smaller than the rest of the
// sstable.
// To do that, we can't use tantivy_fst's Stream with an automaton, as we need to know 2 consecutive
// fst keys to form a proper opinion on whether this is a match, which we wan't translate into a
// single automaton
struct GetBlockForAutomaton<'a, A: Automaton> {
streamer: tantivy_fst::map::Stream<'a>,
block_addr_store: &'a BlockAddrStore,
prev_key: Option<Vec<u8>>,
automaton: &'a A,
}
impl<A: Automaton> Iterator for GetBlockForAutomaton<'_, A> {
type Item = (u64, BlockAddr);
fn next(&mut self) -> Option<Self::Item> {
while let Some((new_key, block_id)) = self.streamer.next() {
if let Some(prev_key) = self.prev_key.as_mut() {
if can_block_match_automaton(Some(prev_key), new_key, self.automaton) {
prev_key.clear();
prev_key.extend_from_slice(new_key);
return Some((block_id, self.block_addr_store.get(block_id).unwrap()));
}
prev_key.clear();
prev_key.extend_from_slice(new_key);
} else {
self.prev_key = Some(new_key.to_owned());
if can_block_match_automaton(None, new_key, self.automaton) {
return Some((block_id, self.block_addr_store.get(block_id).unwrap()));
}
}
}
None
}
} }
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
@@ -823,8 +734,7 @@ fn find_best_slope(elements: impl Iterator<Item = (usize, u64)> + Clone) -> (u32
mod tests { mod tests {
use common::OwnedBytes; use common::OwnedBytes;
use super::*; use super::{BlockAddr, SSTableIndexBuilder, SSTableIndexV3};
use crate::block_match_automaton::tests::EqBuffer;
use crate::SSTableDataCorruption; use crate::SSTableDataCorruption;
#[test] #[test]
@@ -913,108 +823,4 @@ mod tests {
(12345, 1) (12345, 1)
); );
} }
#[test]
fn test_get_block_for_automaton() {
let sstable_index_builder = SSTableIndexBuilder {
blocks: vec![
BlockMeta {
last_key_or_greater: vec![0, 1, 2],
block_addr: BlockAddr {
first_ordinal: 0,
byte_range: 0..10,
},
},
BlockMeta {
last_key_or_greater: vec![0, 2, 2],
block_addr: BlockAddr {
first_ordinal: 5,
byte_range: 10..20,
},
},
BlockMeta {
last_key_or_greater: vec![0, 3, 2],
block_addr: BlockAddr {
first_ordinal: 10,
byte_range: 20..30,
},
},
],
};
let mut sstable_index_bytes = Vec::new();
let fst_len = sstable_index_builder
.serialize(&mut sstable_index_bytes)
.unwrap();
let sstable = SSTableIndexV3::load(OwnedBytes::new(sstable_index_bytes), fst_len).unwrap();
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 1, 1]))
.collect::<Vec<_>>();
assert_eq!(
res,
vec![(
0,
BlockAddr {
first_ordinal: 0,
byte_range: 0..10
}
)]
);
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 2, 1]))
.collect::<Vec<_>>();
assert_eq!(
res,
vec![(
1,
BlockAddr {
first_ordinal: 5,
byte_range: 10..20
}
)]
);
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 3, 1]))
.collect::<Vec<_>>();
assert_eq!(
res,
vec![(
2,
BlockAddr {
first_ordinal: 10,
byte_range: 20..30
}
)]
);
let res = sstable
.get_block_for_automaton(&EqBuffer(vec![0, 4, 1]))
.collect::<Vec<_>>();
assert!(res.is_empty());
let complex_automaton = EqBuffer(vec![0, 1, 1]).union(EqBuffer(vec![0, 3, 1]));
let res = sstable
.get_block_for_automaton(&complex_automaton)
.collect::<Vec<_>>();
assert_eq!(
res,
vec![
(
0,
BlockAddr {
first_ordinal: 0,
byte_range: 0..10
}
),
(
2,
BlockAddr {
first_ordinal: 10,
byte_range: 20..30
}
)
]
);
}
} }

View File

@@ -86,24 +86,16 @@ where
bound_as_byte_slice(&self.upper), bound_as_byte_slice(&self.upper),
); );
self.term_dict self.term_dict
.sstable_delta_reader_for_key_range(key_range, self.limit, &self.automaton) .sstable_delta_reader_for_key_range(key_range, self.limit)
} }
async fn delta_reader_async( async fn delta_reader_async(&self) -> io::Result<DeltaReader<TSSTable::ValueReader>> {
&self,
merge_holes_under_bytes: usize,
) -> io::Result<DeltaReader<TSSTable::ValueReader>> {
let key_range = ( let key_range = (
bound_as_byte_slice(&self.lower), bound_as_byte_slice(&self.lower),
bound_as_byte_slice(&self.upper), bound_as_byte_slice(&self.upper),
); );
self.term_dict self.term_dict
.sstable_delta_reader_for_key_range_async( .sstable_delta_reader_for_key_range_async(key_range, self.limit)
key_range,
self.limit,
&self.automaton,
merge_holes_under_bytes,
)
.await .await
} }
@@ -138,16 +130,7 @@ where
/// See `into_stream(..)` /// See `into_stream(..)`
pub async fn into_stream_async(self) -> io::Result<Streamer<'a, TSSTable, A>> { pub async fn into_stream_async(self) -> io::Result<Streamer<'a, TSSTable, A>> {
self.into_stream_async_merging_holes(0).await let delta_reader = self.delta_reader_async().await?;
}
/// Same as `into_stream_async`, but tries to issue a single io operation when requesting
/// blocks that are not consecutive, but also less than `merge_holes_under_bytes` bytes appart.
pub async fn into_stream_async_merging_holes(
self,
merge_holes_under_bytes: usize,
) -> io::Result<Streamer<'a, TSSTable, A>> {
let delta_reader = self.delta_reader_async(merge_holes_under_bytes).await?;
self.into_stream_given_delta_reader(delta_reader) self.into_stream_given_delta_reader(delta_reader)
} }
@@ -178,7 +161,7 @@ where
_lifetime: std::marker::PhantomData<&'a ()>, _lifetime: std::marker::PhantomData<&'a ()>,
} }
impl<TSSTable> Streamer<'_, TSSTable, AlwaysMatch> impl<'a, TSSTable> Streamer<'a, TSSTable, AlwaysMatch>
where TSSTable: SSTable where TSSTable: SSTable
{ {
pub fn empty() -> Self { pub fn empty() -> Self {
@@ -195,7 +178,7 @@ where TSSTable: SSTable
} }
} }
impl<TSSTable, A> Streamer<'_, TSSTable, A> impl<'a, TSSTable, A> Streamer<'a, TSSTable, A>
where where
A: Automaton, A: Automaton,
A::State: Clone, A::State: Clone,
@@ -344,7 +327,4 @@ mod tests {
assert!(!term_streamer.advance()); assert!(!term_streamer.advance());
Ok(()) Ok(())
} }
// TODO add test for sparse search with a block of poison (starts with 0xffffffff) => such a
// block instantly causes an unexpected EOF error
} }

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-stacker" name = "tantivy-stacker"
version = "0.5.0" version = "0.3.0"
edition = "2021" edition = "2021"
license = "MIT" license = "MIT"
homepage = "https://github.com/quickwit-oss/tantivy" homepage = "https://github.com/quickwit-oss/tantivy"
@@ -9,7 +9,7 @@ description = "term hashmap used for indexing"
[dependencies] [dependencies]
murmurhash32 = "0.3" murmurhash32 = "0.3"
common = { version = "0.9", path = "../common/", package = "tantivy-common" } common = { version = "0.7", path = "../common/", package = "tantivy-common" }
ahash = { version = "0.8.11", default-features = false, optional = true } ahash = { version = "0.8.11", default-features = false, optional = true }
rand_distr = "0.4.3" rand_distr = "0.4.3"
@@ -26,7 +26,7 @@ path = "example/hashmap.rs"
[dev-dependencies] [dev-dependencies]
rand = "0.8.5" rand = "0.8.5"
zipf = "7.0.0" zipf = "7.0.0"
rustc-hash = "2.1.0" rustc-hash = "1.1.0"
proptest = "1.2.0" proptest = "1.2.0"
binggan = { version = "0.14.0" } binggan = { version = "0.14.0" }

View File

@@ -74,7 +74,7 @@ fn ensure_capacity<'a>(
eull.remaining_cap = allocate as u16; eull.remaining_cap = allocate as u16;
} }
impl ExpUnrolledLinkedListWriter<'_> { impl<'a> ExpUnrolledLinkedListWriter<'a> {
#[inline] #[inline]
pub fn write_u32_vint(&mut self, val: u32) { pub fn write_u32_vint(&mut self, val: u32) {
let mut buf = [0u8; 8]; let mut buf = [0u8; 8];
@@ -160,7 +160,7 @@ mod tests {
{ {
let mut buffer = Vec::new(); let mut buffer = Vec::new();
stack.read_to_end(&arena, &mut buffer); stack.read_to_end(&arena, &mut buffer);
assert_eq!(&buffer[..], &[] as &[u8]); assert_eq!(&buffer[..], &[]);
} }
} }

View File

@@ -54,7 +54,7 @@ impl Addr {
#[inline] #[inline]
fn new(page_id: usize, local_addr: usize) -> Addr { fn new(page_id: usize, local_addr: usize) -> Addr {
Addr(((page_id << NUM_BITS_PAGE_ADDR) | local_addr) as u32) Addr((page_id << NUM_BITS_PAGE_ADDR | local_addr) as u32)
} }
#[inline] #[inline]

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-tokenizer-api" name = "tantivy-tokenizer-api"
version = "0.5.0" version = "0.3.0"
license = "MIT" license = "MIT"
edition = "2021" edition = "2021"
description = "Tokenizer API of tantivy" description = "Tokenizer API of tantivy"

View File

@@ -63,7 +63,7 @@ pub trait Tokenizer: 'static + Clone + Send + Sync {
/// Simple wrapper of `Box<dyn TokenStream + 'a>`. /// Simple wrapper of `Box<dyn TokenStream + 'a>`.
pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>); pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>);
impl TokenStream for BoxTokenStream<'_> { impl<'a> TokenStream for BoxTokenStream<'a> {
fn advance(&mut self) -> bool { fn advance(&mut self) -> bool {
self.0.advance() self.0.advance()
} }
@@ -90,7 +90,7 @@ impl<'a> Deref for BoxTokenStream<'a> {
&*self.0 &*self.0
} }
} }
impl DerefMut for BoxTokenStream<'_> { impl<'a> DerefMut for BoxTokenStream<'a> {
fn deref_mut(&mut self) -> &mut Self::Target { fn deref_mut(&mut self) -> &mut Self::Target {
&mut *self.0 &mut *self.0
} }