Compare commits

...

23 Commits

Author SHA1 Message Date
Pascal Seitz
a88e659e02 make convert_to_fast_value_and_append_to_json_term pub 2024-07-11 08:54:01 +08:00
Pascal Seitz
dd2c4a8963 clippy 2024-04-22 09:56:49 +08:00
Pascal Seitz
786781d0fc cleanup 2024-04-22 09:44:41 +08:00
Pascal Seitz
2d7483e3d4 add JsonTermSerializer 2024-04-20 18:56:27 +08:00
Pascal Seitz
87b9f0678c split term and indexing term 2024-04-18 23:38:21 +08:00
PSeitz
0e9fced336 remove JsonTermWriter (#2238)
* remove JsonTermWriter

remove JsonTermWriter
remove path truncation logic, add assertion

* fix json_path_writer add sep logic
2024-04-18 16:28:05 +02:00
PSeitz
b257b960b3 validate sort by field type (#2336)
* validate sort by field type

* Update src/index/index.rs

Co-authored-by: Adam Reichold <adamreichold@users.noreply.github.com>

---------

Co-authored-by: Adam Reichold <adamreichold@users.noreply.github.com>
2024-04-16 04:42:24 +02:00
Adam Reichold
4708171a32 Fix some of the things current Clippy complains about (#2363) 2024-04-16 04:27:06 +02:00
Adam Reichold
b493743f8d Fix trait bound of StoreReader::iter (#2360)
* Fix trait bound of StoreReader::iter

Similar to `StoreReader::get`, `StoreReader::iter` should only require
`DocumentDeserialize` and not `Document`.

* Mark the iterator returned by SegmentReader::doc_ids_alive as Send so it can be used in impls of Stream/AsyncIterator.
2024-04-15 15:50:02 +02:00
trinity-1686a
d2955a3fd2 extend field grouping (#2333)
* extend field grouping
2024-04-15 10:36:32 +02:00
PSeitz
17d5869ad6 update CHANGELOG, use github API in cliff (#2354)
* update CHANGELOG, use github API in cliff

* reset version to 0.21.1, before release

* chore: Release

* remove unreleased from CHANGELOG
2024-04-15 10:07:20 +02:00
PSeitz
dfa3aed32d check unsupported parameters top_hits (#2351)
* check unsupported parameters top_hits

* move to function
2024-04-10 08:20:52 +02:00
PSeitz
398817ce7b add index sorting deprecation warning (#2353)
* add index sorting deprecation warning

* remove deprecated IntOptions and DatePrecision
2024-04-10 08:09:09 +02:00
PSeitz
74940e9345 clippy (#2349)
* fix clippy

* fix clippy

* fix duplicate imports
2024-04-09 07:54:44 +02:00
PSeitz
1e9fc51535 update ahash (#2344) 2024-04-09 06:35:39 +02:00
PSeitz
92c32979d2 fix postcard compatibility for top_hits, add postcard test (#2346)
* fix postcard compatibility for top_hits, add postcard test

* fix top_hits naming, delay data fetch

closes #2347

* fix import
2024-04-09 06:17:25 +02:00
PSeitz
b644d78a32 fix null byte handling in JSON paths (#2345)
* fix null byte handling in JSON paths

closes https://github.com/quickwit-oss/tantivy/issues/2193
closes https://github.com/quickwit-oss/tantivy/issues/2340

* avoid repeated term truncation

* fix test

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add comment

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2024-04-05 09:53:35 +02:00
PSeitz
4e79e11007 add collect_block to BoxableSegmentCollector (#2331) 2024-03-21 09:10:25 +01:00
PSeitz
67ebba3c3c expose collect_block buffer size (#2326)
* expose buffer of collect_block

* flip shard_size segment_size
2024-03-15 08:02:08 +01:00
PSeitz
7ce950f141 add method to fetch block of first vals in columnar (#2330)
* add method to fetch block of first vals in columnar

add method to fetch block of first vals in columnar (this is way faster
than single calls for full columns)
add benchmark
fix import warnings

```
test bench_get_block_first_on_full_column                  ... bench:          56 ns/iter (+/- 26)
test bench_get_block_first_on_full_column_single_calls     ... bench:         311 ns/iter (+/- 6)
test bench_get_block_first_on_multi_column                 ... bench:         378 ns/iter (+/- 15)
test bench_get_block_first_on_multi_column_single_calls    ... bench:         546 ns/iter (+/- 13)
test bench_get_block_first_on_optional_column              ... bench:         291 ns/iter (+/- 6)
test bench_get_block_first_on_optional_column_single_calls ... bench:         362 ns/iter (+/- 8)
```

* use remainder
2024-03-15 08:01:47 +01:00
dependabot[bot]
0cffe5fb09 Update base64 requirement from 0.21.0 to 0.22.0 (#2324)
Updates the requirements on [base64](https://github.com/marshallpierce/rust-base64) to permit the latest version.
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.21.0...v0.22.0)

---
updated-dependencies:
- dependency-name: base64
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-15 15:50:34 +09:00
PSeitz
b0e65560a1 handle ip adresses in term aggregation (#2319)
* handle ip adresses in term aggregation

Stores IpAdresses during the segment term aggregation via u64 representation
and convert to u128(IpV6Adress) via downcast when converting to intermediate results.

Enable Downcasting on `ColumnValues`
Expose u64 variant for u128 encoded data via `open_u64_lenient` method.
Remove lifetime in VecColumn, to avoid 'static lifetime requirement coming
from downcast trait.

* rename method
2024-03-14 09:41:18 +01:00
PSeitz
ec37295b2f add fast path for full columns in fetch_block (#2328)
Spotted in `range_date_histogram` query in quickwit benchmark:
5% of time copying docs around, which is not needed in the full index case

remove Column to ColumnIndex deref
2024-03-14 04:07:11 +01:00
130 changed files with 1920 additions and 1401 deletions

View File

@@ -1,3 +1,65 @@
Tantivy 0.22
================================
Tantivy 0.22 will be able to read indices created with Tantivy 0.21.
#### Bugfixes
- Fix null byte handling in JSON paths (null bytes in json keys caused panic during indexing) [#2345](https://github.com/quickwit-oss/tantivy/pull/2345)(@PSeitz)
- Fix bug that can cause `get_docids_for_value_range` to panic. [#2295](https://github.com/quickwit-oss/tantivy/pull/2295)(@fulmicoton)
- Avoid 1 document indices by increase min memory to 15MB for indexing [#2176](https://github.com/quickwit-oss/tantivy/pull/2176)(@PSeitz)
- Fix merge panic for JSON fields [#2284](https://github.com/quickwit-oss/tantivy/pull/2284)(@PSeitz)
- Fix bug occuring when merging JSON object indexed with positions. [#2253](https://github.com/quickwit-oss/tantivy/pull/2253)(@fulmicoton)
- Fix empty DateHistogram gap bug [#2183](https://github.com/quickwit-oss/tantivy/pull/2183)(@PSeitz)
- Fix range query end check (fields with less than 1 value per doc are affected) [#2226](https://github.com/quickwit-oss/tantivy/pull/2226)(@PSeitz)
- Handle exclusive out of bounds ranges on fastfield range queries [#2174](https://github.com/quickwit-oss/tantivy/pull/2174)(@PSeitz)
#### Breaking API Changes
- rename ReloadPolicy onCommit to onCommitWithDelay [#2235](https://github.com/quickwit-oss/tantivy/pull/2235)(@giovannicuccu)
- Move exports from the root into modules [#2220](https://github.com/quickwit-oss/tantivy/pull/2220)(@PSeitz)
- Accept field name instead of `Field` in FilterCollector [#2196](https://github.com/quickwit-oss/tantivy/pull/2196)(@PSeitz)
- remove deprecated IntOptions and DateTime [#2353](https://github.com/quickwit-oss/tantivy/pull/2353)(@PSeitz)
#### Features/Improvements
- Tantivy documents as a trait: Index data directly without converting to tantivy types first [#2071](https://github.com/quickwit-oss/tantivy/pull/2071)(@ChillFish8)
- encode some part of posting list as -1 instead of direct values (smaller inverted indices) [#2185](https://github.com/quickwit-oss/tantivy/pull/2185)(@trinity-1686a)
- **Aggregation**
- Support to deserialize f64 from string [#2311](https://github.com/quickwit-oss/tantivy/pull/2311)(@PSeitz)
- Add a top_hits aggregator [#2198](https://github.com/quickwit-oss/tantivy/pull/2198)(@ditsuke)
- Support bool type in term aggregation [#2318](https://github.com/quickwit-oss/tantivy/pull/2318)(@PSeitz)
- Support ip adresses in term aggregation [#2319](https://github.com/quickwit-oss/tantivy/pull/2319)(@PSeitz)
- Support date type in term aggregation [#2172](https://github.com/quickwit-oss/tantivy/pull/2172)(@PSeitz)
- Support escaped dot when addressing field [#2250](https://github.com/quickwit-oss/tantivy/pull/2250)(@PSeitz)
- Add ExistsQuery to check documents that have a value [#2160](https://github.com/quickwit-oss/tantivy/pull/2160)(@imotov)
- Expose TopDocs::order_by_u64_field again [#2282](https://github.com/quickwit-oss/tantivy/pull/2282)(@ditsuke)
- **Memory/Performance**
- Faster TopN: replace BinaryHeap with TopNComputer [#2186](https://github.com/quickwit-oss/tantivy/pull/2186)(@PSeitz)
- reduce number of allocations during indexing [#2257](https://github.com/quickwit-oss/tantivy/pull/2257)(@PSeitz)
- Less Memory while indexing: docid deltas while indexing [#2249](https://github.com/quickwit-oss/tantivy/pull/2249)(@PSeitz)
- Faster indexing: use term hashmap in fastfield [#2243](https://github.com/quickwit-oss/tantivy/pull/2243)(@PSeitz)
- term hashmap remove copy in is_empty, unused unordered_id [#2229](https://github.com/quickwit-oss/tantivy/pull/2229)(@PSeitz)
- add method to fetch block of first values in columnar [#2330](https://github.com/quickwit-oss/tantivy/pull/2330)(@PSeitz)
- Faster aggregations: add fast path for full columns in fetch_block [#2328](https://github.com/quickwit-oss/tantivy/pull/2328)(@PSeitz)
- Faster sstable loading: use fst for sstable index [#2268](https://github.com/quickwit-oss/tantivy/pull/2268)(@trinity-1686a)
- **QueryParser**
- allow newline where we allow space in query parser [#2302](https://github.com/quickwit-oss/tantivy/pull/2302)(@trinity-1686a)
- allow some mixing of occur and bool in strict query parser [#2323](https://github.com/quickwit-oss/tantivy/pull/2323)(@trinity-1686a)
- handle * inside term in lenient query parser [#2228](https://github.com/quickwit-oss/tantivy/pull/2228)(@trinity-1686a)
- add support for exists query syntax in query parser [#2170](https://github.com/quickwit-oss/tantivy/pull/2170)(@trinity-1686a)
- Add shared search executor [#2312](https://github.com/quickwit-oss/tantivy/pull/2312)(@MochiXu)
- Truncate keys to u16::MAX in term hashmap [#2299](https://github.com/quickwit-oss/tantivy/pull/2299)(@PSeitz)
- report if a term matched when warming up posting list [#2309](https://github.com/quickwit-oss/tantivy/pull/2309)(@trinity-1686a)
- Support json fields in FuzzyTermQuery [#2173](https://github.com/quickwit-oss/tantivy/pull/2173)(@PingXia-at)
- Read list of fields encoded in term dictionary for JSON fields [#2184](https://github.com/quickwit-oss/tantivy/pull/2184)(@PSeitz)
- add collect_block to BoxableSegmentCollector [#2331](https://github.com/quickwit-oss/tantivy/pull/2331)(@PSeitz)
- expose collect_block buffer size [#2326](https://github.com/quickwit-oss/tantivy/pull/2326)(@PSeitz)
- Forward regex parser errors [#2288](https://github.com/quickwit-oss/tantivy/pull/2288)(@adamreichold)
- Make FacetCounts defaultable and cloneable. [#2322](https://github.com/quickwit-oss/tantivy/pull/2322)(@adamreichold)
- Derive Debug for SchemaBuilder [#2254](https://github.com/quickwit-oss/tantivy/pull/2254)(@GodTamIt)
- add missing inlines to tantivy options [#2245](https://github.com/quickwit-oss/tantivy/pull/2245)(@PSeitz)
Tantivy 0.21.1
================================
#### Bugfixes

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy"
version = "0.22.0-dev"
version = "0.22.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
@@ -11,12 +11,12 @@ repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2021"
rust-version = "1.62"
rust-version = "1.63"
exclude = ["benches/*.json", "benches/*.txt"]
[dependencies]
oneshot = "0.1.5"
base64 = "0.21.0"
base64 = "0.22.0"
byteorder = "1.4.3"
crc32fast = "1.3.2"
once_cell = "1.10.0"
@@ -52,13 +52,13 @@ itertools = "0.12.0"
measure_time = "0.8.2"
arc-swap = "1.5.0"
columnar = { version= "0.2", path="./columnar", package ="tantivy-columnar" }
sstable = { version= "0.2", path="./sstable", package ="tantivy-sstable", optional = true }
stacker = { version= "0.2", path="./stacker", package ="tantivy-stacker" }
query-grammar = { version= "0.21.0", path="./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version= "0.5", path="./bitpacker" }
common = { version= "0.6", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version= "0.2", path="./tokenizer-api", package="tantivy-tokenizer-api" }
columnar = { version= "0.3", path="./columnar", package ="tantivy-columnar" }
sstable = { version= "0.3", path="./sstable", package ="tantivy-sstable", optional = true }
stacker = { version= "0.3", path="./stacker", package ="tantivy-stacker" }
query-grammar = { version= "0.22.0", path="./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version= "0.6", path="./bitpacker" }
common = { version= "0.7", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version= "0.3", path="./tokenizer-api", package="tantivy-tokenizer-api" }
sketches-ddsketch = { version = "0.2.1", features = ["use_serde"] }
futures-util = { version = "0.3.28", optional = true }
fnv = "1.0.7"
@@ -78,6 +78,9 @@ paste = "1.0.11"
more-asserts = "0.3.1"
rand_distr = "0.4.3"
time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
postcard = { version = "1.0.4", features = [
"use-std",
], default-features = false }
[target.'cfg(not(windows))'.dev-dependencies]
criterion = { version = "0.5", default-features = false }

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-bitpacker"
version = "0.5.0"
version = "0.6.0"
edition = "2021"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"

View File

@@ -1,4 +1,3 @@
use std::convert::TryInto;
use std::io;
use std::ops::{Range, RangeInclusive};

View File

@@ -1,6 +1,10 @@
# configuration file for git-cliff{ pattern = "foo", replace = "bar"}
# see https://github.com/orhun/git-cliff#configuration-file
[remote.github]
owner = "quickwit-oss"
repo = "tantivy"
[changelog]
# changelog header
header = """
@@ -8,15 +12,43 @@ header = """
# template for the changelog body
# https://tera.netlify.app/docs/#introduction
body = """
{% if version %}\
{{ version | trim_start_matches(pat="v") }} ({{ timestamp | date(format="%Y-%m-%d") }})
==================
{% else %}\
## [unreleased]
{% endif %}\
## What's Changed
{%- if version %} in {{ version }}{%- endif -%}
{% for commit in commits %}
- {% if commit.breaking %}[**breaking**] {% endif %}{{ commit.message | split(pat="\n") | first | trim | upper_first }}(@{{ commit.author.name }})\
{% endfor %}
{% if commit.github.pr_title -%}
{%- set commit_message = commit.github.pr_title -%}
{%- else -%}
{%- set commit_message = commit.message -%}
{%- endif -%}
- {{ commit_message | split(pat="\n") | first | trim }}\
{% if commit.github.pr_number %} \
[#{{ commit.github.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.github.pr_number }}){% if commit.github.username %}(@{{ commit.github.username }}){%- endif -%} \
{%- endif %}
{%- endfor -%}
{% if github.contributors | filter(attribute="is_first_time", value=true) | length != 0 %}
{% raw %}\n{% endraw -%}
## New Contributors
{%- endif %}\
{% for contributor in github.contributors | filter(attribute="is_first_time", value=true) %}
* @{{ contributor.username }} made their first contribution
{%- if contributor.pr_number %} in \
[#{{ contributor.pr_number }}]({{ self::remote_url() }}/pull/{{ contributor.pr_number }}) \
{%- endif %}
{%- endfor -%}
{% if version %}
{% if previous.version %}
**Full Changelog**: {{ self::remote_url() }}/compare/{{ previous.version }}...{{ version }}
{% endif %}
{% else -%}
{% raw %}\n{% endraw %}
{% endif %}
{%- macro remote_url() -%}
https://github.com/{{ remote.github.owner }}/{{ remote.github.repo }}
{%- endmacro -%}
"""
# remove the leading and trailing whitespace from the template
trim = true
@@ -25,53 +57,24 @@ footer = """
"""
postprocessors = [
{ pattern = 'Paul Masurel', replace = "fulmicoton"}, # replace with github user
{ pattern = 'PSeitz', replace = "PSeitz"}, # replace with github user
{ pattern = 'Adam Reichold', replace = "adamreichold"}, # replace with github user
{ pattern = 'trinity-1686a', replace = "trinity-1686a"}, # replace with github user
{ pattern = 'Michael Kleen', replace = "mkleen"}, # replace with github user
{ pattern = 'Adrien Guillo', replace = "guilload"}, # replace with github user
{ pattern = 'François Massot', replace = "fmassot"}, # replace with github user
{ pattern = 'Naveen Aiathurai', replace = "naveenann"}, # replace with github user
{ pattern = '', replace = ""}, # replace with github user
]
[git]
# parse the commits based on https://www.conventionalcommits.org
# This is required or commit.message contains the whole commit message and not just the title
conventional_commits = true
conventional_commits = false
# filter out the commits that are not conventional
filter_unconventional = false
filter_unconventional = true
# process each line of a commit as an individual commit
split_commits = false
# regex for preprocessing the commit messages
commit_preprocessors = [
{ pattern = '\((\w+\s)?#([0-9]+)\)', replace = "[#${2}](https://github.com/quickwit-oss/tantivy/issues/${2})"}, # replace issue numbers
{ pattern = '\((\w+\s)?#([0-9]+)\)', replace = ""},
]
#link_parsers = [
#{ pattern = "#(\\d+)", href = "https://github.com/quickwit-oss/tantivy/pulls/$1"},
#]
# regex for parsing and grouping commits
commit_parsers = [
{ message = "^feat", group = "Features"},
{ message = "^fix", group = "Bug Fixes"},
{ message = "^doc", group = "Documentation"},
{ message = "^perf", group = "Performance"},
{ message = "^refactor", group = "Refactor"},
{ message = "^style", group = "Styling"},
{ message = "^test", group = "Testing"},
{ message = "^chore\\(release\\): prepare for", skip = true},
{ message = "(?i)clippy", skip = true},
{ message = "(?i)dependabot", skip = true},
{ message = "(?i)fmt", skip = true},
{ message = "(?i)bump", skip = true},
{ message = "(?i)readme", skip = true},
{ message = "(?i)comment", skip = true},
{ message = "(?i)spelling", skip = true},
{ message = "^chore", group = "Miscellaneous Tasks"},
{ body = ".*security", group = "Security"},
{ message = ".*", group = "Other", default_scope = "other"},
]
# protect breaking changes from being skipped due to matching a skipping commit_parser
protect_breaking_commits = false
# filter out the commits that are not matched by commit parsers

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-columnar"
version = "0.2.0"
version = "0.3.0"
edition = "2021"
license = "MIT"
homepage = "https://github.com/quickwit-oss/tantivy"
@@ -12,11 +12,12 @@ categories = ["database-implementations", "data-structures", "compression"]
itertools = "0.12.0"
fastdivide = "0.4.0"
stacker = { version= "0.2", path = "../stacker", package="tantivy-stacker"}
sstable = { version= "0.2", path = "../sstable", package = "tantivy-sstable" }
common = { version= "0.6", path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.5", path = "../bitpacker/" }
stacker = { version= "0.3", path = "../stacker", package="tantivy-stacker"}
sstable = { version= "0.3", path = "../sstable", package = "tantivy-sstable" }
common = { version= "0.7", path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.6", path = "../bitpacker/" }
serde = "1.0.152"
downcast-rs = "1.2.0"
[dev-dependencies]
proptest = "1"

View File

@@ -0,0 +1,155 @@
#![feature(test)]
extern crate test;
use std::sync::Arc;
use rand::prelude::*;
use tantivy_columnar::column_values::{serialize_and_load_u64_based_column_values, CodecType};
use tantivy_columnar::*;
use test::{black_box, Bencher};
struct Columns {
pub optional: Column,
pub full: Column,
pub multi: Column,
}
fn get_test_columns() -> Columns {
let data = generate_permutation();
let mut dataframe_writer = ColumnarWriter::default();
for (idx, val) in data.iter().enumerate() {
dataframe_writer.record_numerical(idx as u32, "full_values", NumericalValue::U64(*val));
if idx % 2 == 0 {
dataframe_writer.record_numerical(
idx as u32,
"optional_values",
NumericalValue::U64(*val),
);
}
dataframe_writer.record_numerical(idx as u32, "multi_values", NumericalValue::U64(*val));
dataframe_writer.record_numerical(idx as u32, "multi_values", NumericalValue::U64(*val));
}
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(data.len() as u32, None, &mut buffer)
.unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("optional_values").unwrap();
assert_eq!(cols.len(), 1);
let optional = cols[0].open_u64_lenient().unwrap().unwrap();
assert_eq!(optional.index.get_cardinality(), Cardinality::Optional);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("full_values").unwrap();
assert_eq!(cols.len(), 1);
let column_full = cols[0].open_u64_lenient().unwrap().unwrap();
assert_eq!(column_full.index.get_cardinality(), Cardinality::Full);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("multi_values").unwrap();
assert_eq!(cols.len(), 1);
let multi = cols[0].open_u64_lenient().unwrap().unwrap();
assert_eq!(multi.index.get_cardinality(), Cardinality::Multivalued);
Columns {
optional,
full: column_full,
multi,
}
}
const NUM_VALUES: u64 = 100_000;
fn generate_permutation() -> Vec<u64> {
let mut permutation: Vec<u64> = (0u64..NUM_VALUES).collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
pub fn serialize_and_load(column: &[u64], codec_type: CodecType) -> Arc<dyn ColumnValues<u64>> {
serialize_and_load_u64_based_column_values(&column, &[codec_type])
}
fn run_bench_on_column_full_scan(b: &mut Bencher, column: Column) {
let num_iter = black_box(NUM_VALUES);
b.iter(|| {
let mut sum = 0u64;
for i in 0..num_iter as u32 {
let val = column.first(i);
sum += val.unwrap_or(0);
}
sum
});
}
fn run_bench_on_column_block_fetch(b: &mut Bencher, column: Column) {
let mut block: Vec<Option<u64>> = vec![None; 64];
let fetch_docids = (0..64).collect::<Vec<_>>();
b.iter(move || {
column.first_vals(&fetch_docids, &mut block);
block[0]
});
}
fn run_bench_on_column_block_single_calls(b: &mut Bencher, column: Column) {
let mut block: Vec<Option<u64>> = vec![None; 64];
let fetch_docids = (0..64).collect::<Vec<_>>();
b.iter(move || {
for i in 0..fetch_docids.len() {
block[i] = column.first(fetch_docids[i]);
}
block[0]
});
}
/// Column first method
#[bench]
fn bench_get_first_on_full_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_full_scan(b, column);
}
#[bench]
fn bench_get_first_on_optional_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_full_scan(b, column);
}
#[bench]
fn bench_get_first_on_multi_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_full_scan(b, column);
}
/// Block fetch column accessor
#[bench]
fn bench_get_block_first_on_optional_column(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_multi_column(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_full_column(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_optional_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_block_single_calls(b, column);
}
#[bench]
fn bench_get_block_first_on_multi_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_block_single_calls(b, column);
}
#[bench]
fn bench_get_block_first_on_full_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_block_single_calls(b, column);
}

View File

@@ -16,14 +16,6 @@ fn generate_permutation() -> Vec<u64> {
permutation
}
fn generate_random() -> Vec<u64> {
let mut permutation: Vec<u64> = (0u64..100_000u64)
.map(|el| el + random::<u16>() as u64)
.collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
// Warning: this generates the same permutation at each call
fn generate_permutation_gcd() -> Vec<u64> {
let mut permutation: Vec<u64> = (1u64..100_000u64).map(|el| el * 1000).collect();

View File

@@ -14,20 +14,32 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
ColumnBlockAccessor<T>
{
#[inline]
pub fn fetch_block(&mut self, docs: &[u32], accessor: &Column<T>) {
self.docid_cache.clear();
self.row_id_cache.clear();
accessor.row_ids_for_docs(docs, &mut self.docid_cache, &mut self.row_id_cache);
self.val_cache.resize(self.row_id_cache.len(), T::default());
accessor
.values
.get_vals(&self.row_id_cache, &mut self.val_cache);
pub fn fetch_block<'a>(&'a mut self, docs: &'a [u32], accessor: &Column<T>) {
if accessor.index.get_cardinality().is_full() {
self.val_cache.resize(docs.len(), T::default());
accessor.values.get_vals(docs, &mut self.val_cache);
} else {
self.docid_cache.clear();
self.row_id_cache.clear();
accessor.row_ids_for_docs(docs, &mut self.docid_cache, &mut self.row_id_cache);
self.val_cache.resize(self.row_id_cache.len(), T::default());
accessor
.values
.get_vals(&self.row_id_cache, &mut self.val_cache);
}
}
#[inline]
pub fn fetch_block_with_missing(&mut self, docs: &[u32], accessor: &Column<T>, missing: T) {
self.fetch_block(docs, accessor);
// We can compare docid_cache with docs to find missing docs
if docs.len() != self.docid_cache.len() || accessor.index.is_multivalue() {
// no missing values
if accessor.index.get_cardinality().is_full() {
return;
}
// We can compare docid_cache length with docs to find missing docs
// For multi value columns we can't rely on the length and always need to scan
if accessor.index.get_cardinality().is_multivalue() || docs.len() != self.docid_cache.len()
{
self.missing_docids_cache.clear();
find_missing_docs(docs, &self.docid_cache, |doc| {
self.missing_docids_cache.push(doc);
@@ -44,11 +56,25 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
}
#[inline]
pub fn iter_docid_vals(&self) -> impl Iterator<Item = (DocId, T)> + '_ {
self.docid_cache
.iter()
.cloned()
.zip(self.val_cache.iter().cloned())
/// Returns an iterator over the docids and values
/// The passed in `docs` slice needs to be the same slice that was passed to `fetch_block` or
/// `fetch_block_with_missing`.
///
/// The docs is used if the column is full (each docs has exactly one value), otherwise the
/// internal docid vec is used for the iterator, which e.g. may contain duplicate docs.
pub fn iter_docid_vals<'a>(
&'a self,
docs: &'a [u32],
accessor: &Column<T>,
) -> impl Iterator<Item = (DocId, T)> + '_ {
if accessor.index.get_cardinality().is_full() {
docs.iter().cloned().zip(self.val_cache.iter().cloned())
} else {
self.docid_cache
.iter()
.cloned()
.zip(self.val_cache.iter().cloned())
}
}
}

View File

@@ -3,17 +3,17 @@ mod serialize;
use std::fmt::{self, Debug};
use std::io::Write;
use std::ops::{Deref, Range, RangeInclusive};
use std::ops::{Range, RangeInclusive};
use std::sync::Arc;
use common::BinarySerializable;
pub use dictionary_encoded::{BytesColumn, StrColumn};
pub use serialize::{
open_column_bytes, open_column_str, open_column_u128, open_column_u64,
serialize_column_mappable_to_u128, serialize_column_mappable_to_u64,
open_column_bytes, open_column_str, open_column_u128, open_column_u128_as_compact_u64,
open_column_u64, serialize_column_mappable_to_u128, serialize_column_mappable_to_u64,
};
use crate::column_index::ColumnIndex;
use crate::column_index::{ColumnIndex, Set};
use crate::column_values::monotonic_mapping::StrictlyMonotonicMappingToInternal;
use crate::column_values::{monotonic_map_column, ColumnValues};
use crate::{Cardinality, DocId, EmptyColumnValues, MonotonicallyMappableToU64, RowId};
@@ -83,10 +83,36 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
self.values.max_value()
}
#[inline]
pub fn first(&self, row_id: RowId) -> Option<T> {
self.values_for_doc(row_id).next()
}
/// Load the first value for each docid in the provided slice.
#[inline]
pub fn first_vals(&self, docids: &[DocId], output: &mut [Option<T>]) {
match &self.index {
ColumnIndex::Empty { .. } => {}
ColumnIndex::Full => self.values.get_vals_opt(docids, output),
ColumnIndex::Optional(optional_index) => {
for (i, docid) in docids.iter().enumerate() {
output[i] = optional_index
.rank_if_exists(*docid)
.map(|rowid| self.values.get_val(rowid));
}
}
ColumnIndex::Multivalued(multivalued_index) => {
for (i, docid) in docids.iter().enumerate() {
let range = multivalued_index.range(*docid);
let is_empty = range.start == range.end;
if !is_empty {
output[i] = Some(self.values.get_val(range.start));
}
}
}
}
}
/// Translates a block of docis to row_ids.
///
/// returns the row_ids and the matching docids on the same index
@@ -105,7 +131,8 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
}
pub fn values_for_doc(&self, doc_id: DocId) -> impl Iterator<Item = T> + '_ {
self.value_row_ids(doc_id)
self.index
.value_row_ids(doc_id)
.map(|value_row_id: RowId| self.values.get_val(value_row_id))
}
@@ -147,14 +174,6 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
}
}
impl<T> Deref for Column<T> {
type Target = ColumnIndex;
fn deref(&self) -> &Self::Target {
&self.index
}
}
impl BinarySerializable for Cardinality {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> std::io::Result<()> {
self.to_code().serialize(writer)
@@ -176,6 +195,7 @@ struct FirstValueWithDefault<T: Copy> {
impl<T: PartialOrd + Debug + Send + Sync + Copy + 'static> ColumnValues<T>
for FirstValueWithDefault<T>
{
#[inline(always)]
fn get_val(&self, idx: u32) -> T {
self.column.first(idx).unwrap_or(self.default_value)
}

View File

@@ -76,6 +76,26 @@ pub fn open_column_u128<T: MonotonicallyMappableToU128>(
})
}
/// Open the column as u64.
///
/// See [`open_u128_as_compact_u64`] for more details.
pub fn open_column_u128_as_compact_u64(bytes: OwnedBytes) -> io::Result<Column<u64>> {
let (body, column_index_num_bytes_payload) = bytes.rsplit(4);
let column_index_num_bytes = u32::from_le_bytes(
column_index_num_bytes_payload
.as_slice()
.try_into()
.unwrap(),
);
let (column_index_data, column_values_data) = body.split(column_index_num_bytes as usize);
let column_index = crate::column_index::open_column_index(column_index_data)?;
let column_values = crate::column_values::open_u128_as_compact_u64(column_values_data)?;
Ok(Column {
index: column_index,
values: column_values,
})
}
pub fn open_column_bytes(data: OwnedBytes) -> io::Result<BytesColumn> {
let (body, dictionary_len_bytes) = data.rsplit(4);
let dictionary_len = u32::from_le_bytes(dictionary_len_bytes.as_slice().try_into().unwrap());

View File

@@ -140,7 +140,7 @@ mod tests {
#[test]
fn test_merge_column_index_optional_shuffle() {
let optional_index: ColumnIndex = OptionalIndex::for_test(2, &[0]).into();
let column_indexes = vec![optional_index, ColumnIndex::Full];
let column_indexes = [optional_index, ColumnIndex::Full];
let row_addrs = vec![
RowAddr {
segment_ord: 0u32,

View File

@@ -42,10 +42,6 @@ impl From<MultiValueIndex> for ColumnIndex {
}
impl ColumnIndex {
#[inline]
pub fn is_multivalue(&self) -> bool {
matches!(self, ColumnIndex::Multivalued(_))
}
/// Returns the cardinality of the column index.
///
/// By convention, if the column contains no docs, we consider that it is

View File

@@ -1,4 +1,3 @@
use std::convert::TryInto;
use std::io::{self, Write};
use common::BinarySerializable;

View File

@@ -1,5 +1,4 @@
use proptest::prelude::{any, prop, *};
use proptest::strategy::Strategy;
use proptest::prelude::*;
use proptest::{prop_oneof, proptest};
use super::*;

View File

@@ -10,7 +10,7 @@ pub(crate) struct MergedColumnValues<'a, T> {
pub(crate) merge_row_order: &'a MergeRowOrder,
}
impl<'a, T: Copy + PartialOrd + Debug> Iterable<T> for MergedColumnValues<'a, T> {
impl<'a, T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'a, T> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
match self.merge_row_order {
MergeRowOrder::Stack(_) => Box::new(

View File

@@ -10,6 +10,7 @@ use std::fmt::Debug;
use std::ops::{Range, RangeInclusive};
use std::sync::Arc;
use downcast_rs::DowncastSync;
pub use monotonic_mapping::{MonotonicallyMappableToU64, StrictlyMonotonicFn};
pub use monotonic_mapping_u128::MonotonicallyMappableToU128;
@@ -25,7 +26,10 @@ mod monotonic_column;
pub(crate) use merge::MergedColumnValues;
pub use stats::ColumnStats;
pub use u128_based::{open_u128_mapped, serialize_column_values_u128};
pub use u128_based::{
open_u128_as_compact_u64, open_u128_mapped, serialize_column_values_u128,
CompactSpaceU64Accessor,
};
pub use u64_based::{
load_u64_based_column_values, serialize_and_load_u64_based_column_values,
serialize_u64_based_column_values, CodecType, ALL_U64_CODEC_TYPES,
@@ -41,7 +45,7 @@ use crate::RowId;
///
/// Any methods with a default and specialized implementation need to be called in the
/// wrappers that implement the trait: Arc and MonotonicMappingColumn
pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync {
pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync + DowncastSync {
/// Return the value associated with the given idx.
///
/// This accessor should return as fast as possible.
@@ -68,11 +72,40 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync {
out_x4[3] = self.get_val(idx_x4[3]);
}
let step_size = 4;
let cutoff = indexes.len() - indexes.len() % step_size;
let out_and_idx_chunks = output
.chunks_exact_mut(4)
.into_remainder()
.iter_mut()
.zip(indexes.chunks_exact(4).remainder());
for (out, idx) in out_and_idx_chunks {
*out = self.get_val(*idx);
}
}
for idx in cutoff..indexes.len() {
output[idx] = self.get_val(indexes[idx]);
/// Allows to push down multiple fetch calls, to avoid dynamic dispatch overhead.
/// The slightly weird `Option<T>` in output allows pushdown to full columns.
///
/// idx and output should have the same length
///
/// # Panics
///
/// May panic if `idx` is greater than the column length.
fn get_vals_opt(&self, indexes: &[u32], output: &mut [Option<T>]) {
assert!(indexes.len() == output.len());
let out_and_idx_chunks = output.chunks_exact_mut(4).zip(indexes.chunks_exact(4));
for (out_x4, idx_x4) in out_and_idx_chunks {
out_x4[0] = Some(self.get_val(idx_x4[0]));
out_x4[1] = Some(self.get_val(idx_x4[1]));
out_x4[2] = Some(self.get_val(idx_x4[2]));
out_x4[3] = Some(self.get_val(idx_x4[3]));
}
let out_and_idx_chunks = output
.chunks_exact_mut(4)
.into_remainder()
.iter_mut()
.zip(indexes.chunks_exact(4).remainder());
for (out, idx) in out_and_idx_chunks {
*out = Some(self.get_val(*idx));
}
}
@@ -139,6 +172,7 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync {
Box::new((0..self.num_vals()).map(|idx| self.get_val(idx)))
}
}
downcast_rs::impl_downcast!(sync ColumnValues<T> where T: PartialOrd);
/// Empty column of values.
pub struct EmptyColumnValues;
@@ -161,12 +195,17 @@ impl<T: PartialOrd + Default> ColumnValues<T> for EmptyColumnValues {
}
}
impl<T: Copy + PartialOrd + Debug> ColumnValues<T> for Arc<dyn ColumnValues<T>> {
impl<T: Copy + PartialOrd + Debug + 'static> ColumnValues<T> for Arc<dyn ColumnValues<T>> {
#[inline(always)]
fn get_val(&self, idx: u32) -> T {
self.as_ref().get_val(idx)
}
#[inline(always)]
fn get_vals_opt(&self, indexes: &[u32], output: &mut [Option<T>]) {
self.as_ref().get_vals_opt(indexes, output)
}
#[inline(always)]
fn min_value(&self) -> T {
self.as_ref().min_value()

View File

@@ -31,10 +31,10 @@ pub fn monotonic_map_column<C, T, Input, Output>(
monotonic_mapping: T,
) -> impl ColumnValues<Output>
where
C: ColumnValues<Input>,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Debug + Send + Sync + Clone,
Output: PartialOrd + Debug + Send + Sync + Clone,
C: ColumnValues<Input> + 'static,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync + 'static,
Input: PartialOrd + Debug + Send + Sync + Clone + 'static,
Output: PartialOrd + Debug + Send + Sync + Clone + 'static,
{
MonotonicMappingColumn {
from_column,
@@ -45,10 +45,10 @@ where
impl<C, T, Input, Output> ColumnValues<Output> for MonotonicMappingColumn<C, T, Input>
where
C: ColumnValues<Input>,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Send + Debug + Sync + Clone,
Output: PartialOrd + Send + Debug + Sync + Clone,
C: ColumnValues<Input> + 'static,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync + 'static,
Input: PartialOrd + Send + Debug + Sync + Clone + 'static,
Output: PartialOrd + Send + Debug + Sync + Clone + 'static,
{
#[inline(always)]
fn get_val(&self, idx: u32) -> Output {
@@ -107,7 +107,7 @@ mod tests {
#[test]
fn test_monotonic_mapping_iter() {
let vals: Vec<u64> = (0..100u64).map(|el| el * 10).collect();
let col = VecColumn::from(&vals);
let col = VecColumn::from(vals);
let mapped = monotonic_map_column(
col,
StrictlyMonotonicMappingInverter::from(StrictlyMonotonicMappingToInternal::<i64>::new()),

View File

@@ -22,7 +22,7 @@ mod build_compact_space;
use build_compact_space::get_compact_space;
use common::{BinarySerializable, CountingWriter, OwnedBytes, VInt, VIntU128};
use tantivy_bitpacker::{self, BitPacker, BitUnpacker};
use tantivy_bitpacker::{BitPacker, BitUnpacker};
use crate::column_values::ColumnValues;
use crate::RowId;
@@ -148,7 +148,7 @@ impl CompactSpace {
.binary_search_by_key(&compact, |range_mapping| range_mapping.compact_start)
// Correctness: Overflow. The first range starts at compact space 0, the error from
// binary search can never be 0
.map_or_else(|e| e - 1, |v| v);
.unwrap_or_else(|e| e - 1);
let range_mapping = &self.ranges_mapping[pos];
let diff = compact - range_mapping.compact_start;
@@ -292,6 +292,63 @@ impl BinarySerializable for IPCodecParams {
}
}
/// Exposes the compact space compressed values as u64.
///
/// This allows faster access to the values, as u64 is faster to work with than u128.
/// It also allows to handle u128 values like u64, via the `open_u64_lenient` as a uniform
/// access interface.
///
/// When converting from the internal u64 to u128 `compact_to_u128` can be used.
pub struct CompactSpaceU64Accessor(CompactSpaceDecompressor);
impl CompactSpaceU64Accessor {
pub(crate) fn open(data: OwnedBytes) -> io::Result<CompactSpaceU64Accessor> {
let decompressor = CompactSpaceU64Accessor(CompactSpaceDecompressor::open(data)?);
Ok(decompressor)
}
/// Convert a compact space value to u128
pub fn compact_to_u128(&self, compact: u32) -> u128 {
self.0.compact_to_u128(compact)
}
}
impl ColumnValues<u64> for CompactSpaceU64Accessor {
#[inline]
fn get_val(&self, doc: u32) -> u64 {
let compact = self.0.get_compact(doc);
compact as u64
}
fn min_value(&self) -> u64 {
self.0.u128_to_compact(self.0.min_value()).unwrap() as u64
}
fn max_value(&self) -> u64 {
self.0.u128_to_compact(self.0.max_value()).unwrap() as u64
}
fn num_vals(&self) -> u32 {
self.0.params.num_vals
}
#[inline]
fn iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
Box::new(self.0.iter_compact().map(|el| el as u64))
}
#[inline]
fn get_row_ids_for_value_range(
&self,
value_range: RangeInclusive<u64>,
position_range: Range<u32>,
positions: &mut Vec<u32>,
) {
let value_range = self.0.compact_to_u128(*value_range.start() as u32)
..=self.0.compact_to_u128(*value_range.end() as u32);
self.0
.get_row_ids_for_value_range(value_range, position_range, positions)
}
}
impl ColumnValues<u128> for CompactSpaceDecompressor {
#[inline]
fn get_val(&self, doc: u32) -> u128 {
@@ -402,9 +459,14 @@ impl CompactSpaceDecompressor {
.map(|compact| self.compact_to_u128(compact))
}
#[inline]
pub fn get_compact(&self, idx: u32) -> u32 {
self.params.bit_unpacker.get(idx, &self.data) as u32
}
#[inline]
pub fn get(&self, idx: u32) -> u128 {
let compact = self.params.bit_unpacker.get(idx, &self.data) as u32;
let compact = self.get_compact(idx);
self.compact_to_u128(compact)
}

View File

@@ -6,7 +6,9 @@ use std::sync::Arc;
mod compact_space;
use common::{BinarySerializable, OwnedBytes, VInt};
use compact_space::{CompactSpaceCompressor, CompactSpaceDecompressor};
pub use compact_space::{
CompactSpaceCompressor, CompactSpaceDecompressor, CompactSpaceU64Accessor,
};
use crate::column_values::monotonic_map_column;
use crate::column_values::monotonic_mapping::{
@@ -108,6 +110,23 @@ pub fn open_u128_mapped<T: MonotonicallyMappableToU128 + Debug>(
StrictlyMonotonicMappingToInternal::<T>::new().into();
Ok(Arc::new(monotonic_map_column(reader, inverted)))
}
/// Returns the u64 representation of the u128 data.
/// The internal representation of the data as u64 is useful for faster processing.
///
/// In order to convert to u128 back cast to `CompactSpaceU64Accessor` and call
/// `compact_to_u128`.
///
/// # Notice
/// In case there are new codecs added, check for usages of `CompactSpaceDecompressorU64` and
/// also handle the new codecs.
pub fn open_u128_as_compact_u64(mut bytes: OwnedBytes) -> io::Result<Arc<dyn ColumnValues<u64>>> {
let header = U128Header::deserialize(&mut bytes)?;
assert_eq!(header.codec_type, U128FastFieldCodecType::CompactSpace);
let reader = CompactSpaceU64Accessor::open(bytes)?;
Ok(Arc::new(reader))
}
#[cfg(test)]
pub mod tests {
use super::*;

View File

@@ -63,7 +63,6 @@ impl ColumnValues for BitpackedReader {
fn get_val(&self, doc: u32) -> u64 {
self.stats.min_value + self.stats.gcd.get() * self.bit_unpacker.get(doc, &self.data)
}
#[inline]
fn min_value(&self) -> u64 {
self.stats.min_value

View File

@@ -63,7 +63,10 @@ impl BlockwiseLinearEstimator {
if self.block.is_empty() {
return;
}
let line = Line::train(&VecColumn::from(&self.block));
let column = VecColumn::from(std::mem::take(&mut self.block));
let line = Line::train(&column);
self.block = column.into();
let mut max_value = 0u64;
for (i, buffer_val) in self.block.iter().enumerate() {
let interpolated_val = line.eval(i as u32);
@@ -125,7 +128,7 @@ impl ColumnCodecEstimator for BlockwiseLinearEstimator {
*buffer_val = gcd_divider.divide(*buffer_val - stats.min_value);
}
let line = Line::train(&VecColumn::from(&buffer));
let line = Line::train(&VecColumn::from(buffer.to_vec()));
assert!(!buffer.is_empty());

View File

@@ -184,7 +184,7 @@ mod tests {
}
fn test_eval_max_err(ys: &[u64]) -> Option<u64> {
let line = Line::train(&VecColumn::from(&ys));
let line = Line::train(&VecColumn::from(ys.to_vec()));
ys.iter()
.enumerate()
.map(|(x, y)| y.wrapping_sub(line.eval(x as u32)))

View File

@@ -173,7 +173,9 @@ impl LinearCodecEstimator {
fn collect_before_line_estimation(&mut self, value: u64) {
self.block.push(value);
if self.block.len() == LINE_ESTIMATION_BLOCK_LEN {
let line = Line::train(&VecColumn::from(&self.block));
let column = VecColumn::from(std::mem::take(&mut self.block));
let line = Line::train(&column);
self.block = column.into();
let block = std::mem::take(&mut self.block);
for val in block {
self.collect_after_line_estimation(&line, val);

View File

@@ -1,5 +1,4 @@
use proptest::prelude::*;
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest};
#[test]

View File

@@ -4,14 +4,14 @@ use tantivy_bitpacker::minmax;
use crate::ColumnValues;
/// VecColumn provides `Column` over a slice.
pub struct VecColumn<'a, T = u64> {
pub(crate) values: &'a [T],
/// VecColumn provides `Column` over a `Vec<T>`.
pub struct VecColumn<T = u64> {
pub(crate) values: Vec<T>,
pub(crate) min_value: T,
pub(crate) max_value: T,
}
impl<'a, T: Copy + PartialOrd + Send + Sync + Debug> ColumnValues<T> for VecColumn<'a, T> {
impl<T: Copy + PartialOrd + Send + Sync + Debug + 'static> ColumnValues<T> for VecColumn<T> {
fn get_val(&self, position: u32) -> T {
self.values[position as usize]
}
@@ -37,11 +37,8 @@ impl<'a, T: Copy + PartialOrd + Send + Sync + Debug> ColumnValues<T> for VecColu
}
}
impl<'a, T: Copy + PartialOrd + Default, V> From<&'a V> for VecColumn<'a, T>
where V: AsRef<[T]> + ?Sized
{
fn from(values: &'a V) -> Self {
let values = values.as_ref();
impl<T: Copy + PartialOrd + Default> From<Vec<T>> for VecColumn<T> {
fn from(values: Vec<T>) -> Self {
let (min_value, max_value) = minmax(values.iter().copied()).unwrap_or_default();
Self {
values,
@@ -50,3 +47,8 @@ where V: AsRef<[T]> + ?Sized
}
}
}
impl From<VecColumn> for Vec<u64> {
fn from(column: VecColumn) -> Self {
column.values
}
}

View File

@@ -1,7 +1,3 @@
use std::collections::BTreeMap;
use itertools::Itertools;
use super::*;
use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId};

View File

@@ -13,9 +13,7 @@ pub(crate) use serializer::ColumnarSerializer;
use stacker::{Addr, ArenaHashMap, MemoryArena};
use crate::column_index::SerializableColumnIndex;
use crate::column_values::{
ColumnValues, MonotonicallyMappableToU128, MonotonicallyMappableToU64, VecColumn,
};
use crate::column_values::{MonotonicallyMappableToU128, MonotonicallyMappableToU64};
use crate::columnar::column_type::ColumnType;
use crate::columnar::writer::column_writers::{
ColumnWriter, NumericalColumnWriter, StrOrBytesColumnWriter,
@@ -645,10 +643,7 @@ fn send_to_serialize_column_mappable_to_u128<
value_index_builders: &mut PreallocatedIndexBuilders,
values: &mut Vec<T>,
mut wrt: impl io::Write,
) -> io::Result<()>
where
for<'a> VecColumn<'a, T>: ColumnValues<T>,
{
) -> io::Result<()> {
values.clear();
// TODO: split index and values
let serializable_column_index = match cardinality {
@@ -701,10 +696,7 @@ fn send_to_serialize_column_mappable_to_u64(
value_index_builders: &mut PreallocatedIndexBuilders,
values: &mut Vec<u64>,
mut wrt: impl io::Write,
) -> io::Result<()>
where
for<'a> VecColumn<'a, u64>: ColumnValues<u64>,
{
) -> io::Result<()> {
values.clear();
let serializable_column_index = match cardinality {
Cardinality::Full => {

View File

@@ -18,7 +18,12 @@ pub struct ColumnarSerializer<W: io::Write> {
/// code.
fn prepare_key(key: &[u8], column_type: ColumnType, buffer: &mut Vec<u8>) {
buffer.clear();
buffer.extend_from_slice(key);
// Convert 0 bytes to '0' string, as 0 bytes are reserved for the end of the path.
if key.contains(&0u8) {
buffer.extend(key.iter().map(|&b| if b == 0 { b'0' } else { b }));
} else {
buffer.extend_from_slice(key);
}
buffer.push(0u8);
buffer.push(column_type.to_code());
}
@@ -96,14 +101,13 @@ impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
#[cfg(test)]
mod tests {
use super::*;
use crate::columnar::column_type::ColumnType;
#[test]
fn test_prepare_key_bytes() {
let mut buffer: Vec<u8> = b"somegarbage".to_vec();
prepare_key(b"root\0child", ColumnType::Str, &mut buffer);
assert_eq!(buffer.len(), 12);
assert_eq!(&buffer[..10], b"root\0child");
assert_eq!(&buffer[..10], b"root0child");
assert_eq!(buffer[10], 0u8);
assert_eq!(buffer[11], ColumnType::Str.to_code());
}

View File

@@ -8,7 +8,7 @@ use common::{ByteCount, DateTime, HasLen, OwnedBytes};
use crate::column::{BytesColumn, Column, StrColumn};
use crate::column_values::{monotonic_map_column, StrictlyMonotonicFn};
use crate::columnar::ColumnType;
use crate::{Cardinality, ColumnIndex, NumericalType};
use crate::{Cardinality, ColumnIndex, ColumnValues, NumericalType};
#[derive(Clone)]
pub enum DynamicColumn {
@@ -247,7 +247,12 @@ impl DynamicColumnHandle {
}
/// Returns the `u64` fast field reader reader associated with `fields` of types
/// Str, u64, i64, f64, bool, or datetime.
/// Str, u64, i64, f64, bool, ip, or datetime.
///
/// Notice that for IpAddr, the fastfield reader will return the u64 representation of the
/// IpAddr.
/// In order to convert to u128 back cast to `CompactSpaceU64Accessor` and call
/// `compact_to_u128`.
///
/// If not, the fastfield reader will returns the u64-value associated with the original
/// FastValue.
@@ -258,7 +263,10 @@ impl DynamicColumnHandle {
let column: BytesColumn = crate::column::open_column_bytes(column_bytes)?;
Ok(Some(column.term_ord_column))
}
ColumnType::IpAddr => Ok(None),
ColumnType::IpAddr => {
let column = crate::column::open_column_u128_as_compact_u64(column_bytes)?;
Ok(Some(column))
}
ColumnType::Bool
| ColumnType::I64
| ColumnType::U64

View File

@@ -113,6 +113,9 @@ impl Cardinality {
pub fn is_multivalue(&self) -> bool {
matches!(self, Cardinality::Multivalued)
}
pub fn is_full(&self) -> bool {
matches!(self, Cardinality::Full)
}
pub(crate) fn to_code(self) -> u8 {
self as u8
}

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-common"
version = "0.6.0"
version = "0.7.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT"
edition = "2021"
@@ -14,7 +14,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
[dependencies]
byteorder = "1.4.3"
ownedbytes = { version= "0.6", path="../ownedbytes" }
ownedbytes = { version= "0.7", path="../ownedbytes" }
async-trait = "0.1"
time = { version = "0.3.10", features = ["serde-well-known"] }
serde = { version = "1.0.136", features = ["derive"] }

View File

@@ -1,6 +1,5 @@
use std::convert::TryInto;
use std::io::Write;
use std::{fmt, io, u64};
use std::{fmt, io};
use ownedbytes::OwnedBytes;

View File

@@ -1,5 +1,3 @@
#![allow(deprecated)]
use std::fmt;
use std::io::{Read, Write};
@@ -27,9 +25,6 @@ pub enum DateTimePrecision {
Nanoseconds,
}
#[deprecated(since = "0.20.0", note = "Use `DateTimePrecision` instead")]
pub type DatePrecision = DateTimePrecision;
/// A date/time value with nanoseconds precision.
///
/// This timestamp does not carry any explicit time zone information.
@@ -40,7 +35,7 @@ pub type DatePrecision = DateTimePrecision;
/// All constructors and conversions are provided as explicit
/// functions and not by implementing any `From`/`Into` traits
/// to prevent unintended usage.
#[derive(Clone, Default, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
#[derive(Clone, Default, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Serialize, Deserialize)]
pub struct DateTime {
// Timestamp in nanoseconds.
pub(crate) timestamp_nanos: i64,

View File

@@ -5,6 +5,12 @@ pub const JSON_PATH_SEGMENT_SEP: u8 = 1u8;
pub const JSON_PATH_SEGMENT_SEP_STR: &str =
unsafe { std::str::from_utf8_unchecked(&[JSON_PATH_SEGMENT_SEP]) };
/// Separates the json path and the value in
/// a JSON term binary representation.
pub const JSON_END_OF_PATH: u8 = 0u8;
pub const JSON_END_OF_PATH_STR: &str =
unsafe { std::str::from_utf8_unchecked(&[JSON_END_OF_PATH]) };
/// Create a new JsonPathWriter, that creates flattened json paths for tantivy.
#[derive(Clone, Debug, Default)]
pub struct JsonPathWriter {
@@ -14,6 +20,14 @@ pub struct JsonPathWriter {
}
impl JsonPathWriter {
pub fn with_expand_dots(expand_dots: bool) -> Self {
JsonPathWriter {
path: String::new(),
indices: Vec::new(),
expand_dots,
}
}
pub fn new() -> Self {
JsonPathWriter {
path: String::new(),
@@ -39,8 +53,8 @@ impl JsonPathWriter {
pub fn push(&mut self, segment: &str) {
let len_path = self.path.len();
self.indices.push(len_path);
if !self.path.is_empty() {
self.path.push_str(JSON_PATH_SEGMENT_SEP_STR);
if self.indices.len() > 1 {
self.path.push(JSON_PATH_SEGMENT_SEP as char);
}
self.path.push_str(segment);
if self.expand_dots {
@@ -55,6 +69,12 @@ impl JsonPathWriter {
}
}
/// Set the end of JSON path marker.
#[inline]
pub fn set_end(&mut self) {
self.path.push_str(JSON_END_OF_PATH_STR);
}
/// Remove the last segment. Does nothing if the path is empty.
#[inline]
pub fn pop(&mut self) {
@@ -91,6 +111,7 @@ mod tests {
#[test]
fn json_path_writer_test() {
let mut writer = JsonPathWriter::new();
writer.set_expand_dots(false);
writer.push("root");
assert_eq!(writer.as_str(), "root");
@@ -109,4 +130,15 @@ mod tests {
writer.push("k8s.node.id");
assert_eq!(writer.as_str(), "root\u{1}k8s\u{1}node\u{1}id");
}
#[test]
fn test_json_path_expand_dots_enabled_pop_segment() {
let mut json_writer = JsonPathWriter::with_expand_dots(true);
json_writer.push("hello");
assert_eq!(json_writer.as_str(), "hello");
json_writer.push("color.hue");
assert_eq!(json_writer.as_str(), "hello\x01color\x01hue");
json_writer.pop();
assert_eq!(json_writer.as_str(), "hello");
}
}

View File

@@ -9,14 +9,12 @@ mod byte_count;
mod datetime;
pub mod file_slice;
mod group_by;
mod json_path_writer;
pub mod json_path_writer;
mod serialize;
mod vint;
mod writer;
pub use bitset::*;
pub use byte_count::ByteCount;
#[allow(deprecated)]
pub use datetime::DatePrecision;
pub use datetime::{DateTime, DateTimePrecision};
pub use group_by::GroupByIteratorExtended;
pub use json_path_writer::JsonPathWriter;

View File

@@ -290,8 +290,7 @@ impl<'a> BinarySerializable for Cow<'a, [u8]> {
#[cfg(test)]
pub mod test {
use super::{VInt, *};
use crate::serialize::BinarySerializable;
use super::*;
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();
O::default().serialize(&mut buffer).unwrap();

View File

@@ -1,7 +1,7 @@
[package]
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
name = "ownedbytes"
version = "0.6.0"
version = "0.7.0"
edition = "2021"
description = "Expose data as static slice"
license = "MIT"

View File

@@ -1,4 +1,3 @@
use std::convert::TryInto;
use std::ops::{Deref, Range};
use std::sync::Arc;
use std::{fmt, io};

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-query-grammar"
version = "0.21.0"
version = "0.22.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]

View File

@@ -218,27 +218,14 @@ fn term_or_phrase_infallible(inp: &str) -> JResult<&str, Option<UserInputLeaf>>
}
fn term_group(inp: &str) -> IResult<&str, UserInputAst> {
let occur_symbol = alt((
value(Occur::MustNot, char('-')),
value(Occur::Must, char('+')),
));
map(
tuple((
terminated(field_name, multispace0),
delimited(
tuple((char('('), multispace0)),
separated_list0(multispace1, tuple((opt(occur_symbol), term_or_phrase))),
char(')'),
),
delimited(tuple((char('('), multispace0)), ast, char(')')),
)),
|(field_name, terms)| {
UserInputAst::Clause(
terms
.into_iter()
.map(|(occur, leaf)| (occur, leaf.set_field(Some(field_name.clone())).into()))
.collect(),
)
|(field_name, mut ast)| {
ast.set_default_field(field_name);
ast
},
)(inp)
}
@@ -258,46 +245,18 @@ fn term_group_precond(inp: &str) -> IResult<&str, (), ()> {
}
fn term_group_infallible(inp: &str) -> JResult<&str, UserInputAst> {
let (mut inp, (field_name, _, _, _)) =
let (inp, (field_name, _, _, _)) =
tuple((field_name, multispace0, char('('), multispace0))(inp).expect("precondition failed");
let mut terms = Vec::new();
let mut errs = Vec::new();
let mut first_round = true;
loop {
let mut space_error = if first_round {
first_round = false;
Vec::new()
} else {
let (rest, (_, err)) = space1_infallible(inp)?;
inp = rest;
err
};
if inp.is_empty() {
errs.push(LenientErrorInternal {
pos: inp.len(),
message: "missing )".to_string(),
});
break Ok((inp, (UserInputAst::Clause(terms), errs)));
}
if let Some(inp) = inp.strip_prefix(')') {
break Ok((inp, (UserInputAst::Clause(terms), errs)));
}
// only append missing space error if we did not reach the end of group
errs.append(&mut space_error);
// here we do the assumption term_or_phrase_infallible always consume something if the
// first byte is not `)` or ' '. If it did not, we would end up looping.
let (rest, ((occur, leaf), mut err)) =
tuple_infallible((occur_symbol, term_or_phrase_infallible))(inp)?;
errs.append(&mut err);
if let Some(leaf) = leaf {
terms.push((occur, leaf.set_field(Some(field_name.clone())).into()));
}
inp = rest;
}
let res = delimited_infallible(
nothing,
map(ast_infallible, |(mut ast, errors)| {
ast.set_default_field(field_name.to_string());
(ast, errors)
}),
opt_i_err(char(')'), "expected ')'"),
)(inp);
res
}
fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
@@ -1468,8 +1427,18 @@ mod test {
#[test]
fn test_parse_query_term_group() {
test_parse_query_to_ast_helper(r#"field:(abc)"#, r#"(*"field":abc)"#);
test_parse_query_to_ast_helper(r#"field:(abc)"#, r#""field":abc"#);
test_parse_query_to_ast_helper(r#"field:(+a -"b c")"#, r#"(+"field":a -"field":"b c")"#);
test_parse_query_to_ast_helper(r#"field:(a AND "b c")"#, r#"(+"field":a +"field":"b c")"#);
test_parse_query_to_ast_helper(r#"field:(a OR "b c")"#, r#"(?"field":a ?"field":"b c")"#);
test_parse_query_to_ast_helper(
r#"field:(a OR (b AND c))"#,
r#"(?"field":a ?(+"field":b +"field":c))"#,
);
test_parse_query_to_ast_helper(
r#"field:(a [b TO c])"#,
r#"(*"field":a *"field":["b" TO "c"])"#,
);
test_is_parse_err(r#"field:(+a -"b c""#, r#"(+"field":a -"field":"b c")"#);
}

View File

@@ -44,6 +44,26 @@ impl UserInputLeaf {
},
}
}
pub(crate) fn set_default_field(&mut self, default_field: String) {
match self {
UserInputLeaf::Literal(ref mut literal) if literal.field_name.is_none() => {
literal.field_name = Some(default_field)
}
UserInputLeaf::All => {
*self = UserInputLeaf::Exists {
field: default_field,
}
}
UserInputLeaf::Range { ref mut field, .. } if field.is_none() => {
*field = Some(default_field)
}
UserInputLeaf::Set { ref mut field, .. } if field.is_none() => {
*field = Some(default_field)
}
_ => (), // field was already set, do nothing
}
}
}
impl Debug for UserInputLeaf {
@@ -205,6 +225,16 @@ impl UserInputAst {
pub fn or(asts: Vec<UserInputAst>) -> UserInputAst {
UserInputAst::compose(Occur::Should, asts)
}
pub(crate) fn set_default_field(&mut self, field: String) {
match self {
UserInputAst::Clause(clauses) => clauses
.iter_mut()
.for_each(|(_, ast)| ast.set_default_field(field.clone())),
UserInputAst::Leaf(leaf) => leaf.set_default_field(field),
UserInputAst::Boost(ref mut ast, _) => ast.set_default_field(field),
}
}
}
impl From<UserInputLiteral> for UserInputLeaf {

View File

@@ -170,8 +170,8 @@ impl AggregationWithAccessor {
ColumnType::Str,
ColumnType::DateTime,
ColumnType::Bool,
ColumnType::IpAddr,
// ColumnType::Bytes Unsupported
// ColumnType::IpAddr Unsupported
];
// In case the column is empty we want the shim column to match the missing type
@@ -292,7 +292,7 @@ impl AggregationWithAccessor {
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
}
TopHits(ref mut top_hits) => {
top_hits.validate_and_resolve(reader.fast_fields().columnar())?;
top_hits.validate_and_resolve_field_names(reader.fast_fields().columnar())?;
let accessors: Vec<(Column<u64>, ColumnType)> = top_hits
.field_names()
.iter()

View File

@@ -4,6 +4,7 @@ use crate::aggregation::agg_req::{Aggregation, Aggregations};
use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::buf_collector::DOC_BLOCK_SIZE;
use crate::aggregation::collector::AggregationCollector;
use crate::aggregation::intermediate_agg_result::IntermediateAggregationResults;
use crate::aggregation::segment_agg_result::AggregationLimits;
use crate::aggregation::tests::{get_test_index_2_segments, get_test_index_from_values_and_terms};
use crate::aggregation::DistributedAggregationCollector;
@@ -66,6 +67,22 @@ fn test_aggregation_flushing(
}
}
},
"top_hits_test":{
"terms": {
"field": "string_id"
},
"aggs": {
"bucketsL2": {
"top_hits": {
"size": 2,
"sort": [
{ "score": "asc" }
],
"docvalue_fields": ["score"]
}
}
}
},
"histogram_test":{
"histogram": {
"field": "score",
@@ -108,6 +125,16 @@ fn test_aggregation_flushing(
let searcher = reader.searcher();
let intermediate_agg_result = searcher.search(&AllQuery, &collector).unwrap();
// Test postcard roundtrip serialization
let intermediate_agg_result_bytes = postcard::to_allocvec(&intermediate_agg_result).expect(
"Postcard Serialization failed, flatten etc. is not supported in the intermediate \
result",
);
let intermediate_agg_result: IntermediateAggregationResults =
postcard::from_bytes(&intermediate_agg_result_bytes)
.expect("Post deserialization failed");
intermediate_agg_result
.into_final_result(agg_req, &Default::default())
.unwrap()
@@ -816,38 +843,38 @@ fn test_aggregation_on_json_object_mixed_types() {
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric
index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0})))
.add_document(doc!(json => json!({"mixed_type": 10.0, "mixed_price": 10.0})))
.unwrap();
index_writer.commit().unwrap();
// => Segment with all values text
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"})))
.add_document(doc!(json => json!({"mixed_type": "blue", "mixed_price": 5.0})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"})))
.add_document(doc!(json => json!({"mixed_type": "blue", "mixed_price": 5.0})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"})))
.add_document(doc!(json => json!({"mixed_type": "blue", "mixed_price": 5.0})))
.unwrap();
index_writer.commit().unwrap();
// => Segment with all boolen
index_writer
.add_document(doc!(json => json!({"mixed_type": true})))
.add_document(doc!(json => json!({"mixed_type": true, "mixed_price": "no_price"})))
.unwrap();
index_writer.commit().unwrap();
// => Segment with mixed values
index_writer
.add_document(doc!(json => json!({"mixed_type": "red"})))
.add_document(doc!(json => json!({"mixed_type": "red", "mixed_price": 1.0})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": "red"})))
.add_document(doc!(json => json!({"mixed_type": "red", "mixed_price": 1.0})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": -20.5})))
.add_document(doc!(json => json!({"mixed_type": -20.5, "mixed_price": -20.5})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": true})))
.add_document(doc!(json => json!({"mixed_type": true, "mixed_price": "no_price"})))
.unwrap();
index_writer.commit().unwrap();
@@ -861,7 +888,7 @@ fn test_aggregation_on_json_object_mixed_types() {
"order": { "min_price": "desc" }
},
"aggs": {
"min_price": { "min": { "field": "json.mixed_type" } }
"min_price": { "min": { "field": "json.mixed_price" } }
}
},
"rangeagg": {
@@ -885,7 +912,6 @@ fn test_aggregation_on_json_object_mixed_types() {
let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap();
let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap();
// pretty print as json
use pretty_assertions::assert_eq;
assert_eq!(
&aggregation_res_json,
@@ -901,10 +927,10 @@ fn test_aggregation_on_json_object_mixed_types() {
"termagg": {
"buckets": [
{ "doc_count": 1, "key": 10.0, "min_price": { "value": 10.0 } },
{ "doc_count": 3, "key": "blue", "min_price": { "value": 5.0 } },
{ "doc_count": 2, "key": "red", "min_price": { "value": 1.0 } },
{ "doc_count": 1, "key": -20.5, "min_price": { "value": -20.5 } },
{ "doc_count": 2, "key": "red", "min_price": { "value": null } },
{ "doc_count": 2, "key": 1.0, "key_as_string": "true", "min_price": { "value": null } },
{ "doc_count": 3, "key": "blue", "min_price": { "value": null } },
],
"sum_other_doc_count": 0
}

View File

@@ -1,8 +1,5 @@
use std::cmp::Ordering;
use std::fmt::Display;
use columnar::ColumnType;
use itertools::Itertools;
use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize};
use tantivy_bitpacker::minmax;
@@ -18,7 +15,7 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateHistogramBucketEntry,
};
use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, AggregationLimits, SegmentAggregationCollector,
build_segment_agg_collector, SegmentAggregationCollector,
};
use crate::aggregation::*;
use crate::TantivyError;
@@ -310,7 +307,10 @@ impl SegmentAggregationCollector for SegmentHistogramCollector {
.column_block_accessor
.fetch_block(docs, &bucket_agg_accessor.accessor);
for (doc, val) in bucket_agg_accessor.column_block_accessor.iter_docid_vals() {
for (doc, val) in bucket_agg_accessor
.column_block_accessor
.iter_docid_vals(docs, &bucket_agg_accessor.accessor)
{
let val = self.f64_from_fastfield_u64(val);
let bucket_pos = get_bucket_pos(val);
@@ -597,13 +597,11 @@ mod tests {
use serde_json::Value;
use super::*;
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::tests::{
exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit,
get_test_index_2_segments, get_test_index_from_values, get_test_index_with_num_docs,
};
use crate::aggregation::AggregationCollector;
use crate::query::AllQuery;
#[test]

View File

@@ -28,6 +28,7 @@ mod term_agg;
mod term_missing_agg;
use std::collections::HashMap;
use std::fmt;
pub use histogram::*;
pub use range::*;
@@ -72,12 +73,12 @@ impl From<&str> for OrderTarget {
}
}
impl ToString for OrderTarget {
fn to_string(&self) -> String {
impl fmt::Display for OrderTarget {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self {
OrderTarget::Key => "_key".to_string(),
OrderTarget::Count => "_count".to_string(),
OrderTarget::SubAggregation(agg) => agg.to_string(),
OrderTarget::Key => f.write_str("_key"),
OrderTarget::Count => f.write_str("_count"),
OrderTarget::SubAggregation(agg) => agg.fmt(f),
}
}
}

View File

@@ -1,7 +1,6 @@
use std::fmt::Debug;
use std::ops::Range;
use columnar::{ColumnType, MonotonicallyMappableToU64};
use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize};
@@ -236,7 +235,10 @@ impl SegmentAggregationCollector for SegmentRangeCollector {
.column_block_accessor
.fetch_block(docs, &bucket_agg_accessor.accessor);
for (doc, val) in bucket_agg_accessor.column_block_accessor.iter_docid_vals() {
for (doc, val) in bucket_agg_accessor
.column_block_accessor
.iter_docid_vals(docs, &bucket_agg_accessor.accessor)
{
let bucket_pos = self.get_bucket_pos(val);
let bucket = &mut self.buckets[bucket_pos];
@@ -447,7 +449,6 @@ pub(crate) fn range_to_key(range: &Range<u64>, field_type: &ColumnType) -> crate
#[cfg(test)]
mod tests {
use columnar::MonotonicallyMappableToU64;
use serde_json::Value;
use super::*;
@@ -456,7 +457,6 @@ mod tests {
exec_request, exec_request_with_query, get_test_index_2_segments,
get_test_index_with_num_docs,
};
use crate::aggregation::AggregationLimits;
pub fn get_collector_from_ranges(
ranges: Vec<RangeAggregationRange>,

View File

@@ -1,6 +1,10 @@
use std::fmt::Debug;
use std::net::Ipv6Addr;
use columnar::{BytesColumn, ColumnType, MonotonicallyMappableToU64, StrColumn};
use columnar::column_values::CompactSpaceU64Accessor;
use columnar::{
BytesColumn, ColumnType, MonotonicallyMappableToU128, MonotonicallyMappableToU64, StrColumn,
};
use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize};
@@ -105,9 +109,9 @@ pub struct TermsAggregation {
///
/// Defaults to 10 * size.
#[serde(skip_serializing_if = "Option::is_none", default)]
#[serde(alias = "segment_size")]
#[serde(alias = "shard_size")]
#[serde(alias = "split_size")]
pub shard_size: Option<u32>,
pub segment_size: Option<u32>,
/// If you set the `show_term_doc_count_error` parameter to true, the terms aggregation will
/// include doc_count_error_upper_bound, which is an upper bound to the error on the
@@ -196,7 +200,7 @@ impl TermsAggregationInternal {
pub(crate) fn from_req(req: &TermsAggregation) -> Self {
let size = req.size.unwrap_or(10);
let mut segment_size = req.shard_size.unwrap_or(size * 10);
let mut segment_size = req.segment_size.unwrap_or(size * 10);
let order = req.order.clone().unwrap_or_default();
segment_size = segment_size.max(size);
@@ -306,7 +310,10 @@ impl SegmentAggregationCollector for SegmentTermCollector {
}
// has subagg
if let Some(blueprint) = self.blueprint.as_ref() {
for (doc, term_id) in bucket_agg_accessor.column_block_accessor.iter_docid_vals() {
for (doc, term_id) in bucket_agg_accessor
.column_block_accessor
.iter_docid_vals(docs, &bucket_agg_accessor.accessor)
{
let sub_aggregations = self
.term_buckets
.sub_aggs
@@ -535,6 +542,27 @@ impl SegmentTermCollector {
let val = bool::from_u64(val);
dict.insert(IntermediateKey::Bool(val), intermediate_entry);
}
} else if self.column_type == ColumnType::IpAddr {
let compact_space_accessor = agg_with_accessor
.accessor
.values
.clone()
.downcast_arc::<CompactSpaceU64Accessor>()
.map_err(|_| {
TantivyError::AggregationError(
crate::aggregation::AggregationError::InternalError(
"Type mismatch: Could not downcast to CompactSpaceU64Accessor"
.to_string(),
),
)
})?;
for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val: u128 = compact_space_accessor.compact_to_u128(val as u32);
let val = Ipv6Addr::from_u128(val);
dict.insert(IntermediateKey::IpAddr(val), intermediate_entry);
}
} else {
for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
@@ -587,6 +615,9 @@ pub(crate) fn cut_off_buckets<T: GetDocCount + Debug>(
#[cfg(test)]
mod tests {
use std::net::IpAddr;
use std::str::FromStr;
use common::DateTime;
use time::{Date, Month};
@@ -597,7 +628,7 @@ mod tests {
};
use crate::aggregation::AggregationLimits;
use crate::indexer::NoMergePolicy;
use crate::schema::{Schema, FAST, STRING};
use crate::schema::{IntoIpv6Addr, Schema, FAST, STRING};
use crate::{Index, IndexWriter};
#[test]
@@ -1179,9 +1210,9 @@ mod tests {
assert_eq!(res["my_texts"]["buckets"][0]["key"], "terma");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "termc");
assert_eq!(res["my_texts"]["buckets"][1]["key"], "termb");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 0);
assert_eq!(res["my_texts"]["buckets"][2]["key"], "termb");
assert_eq!(res["my_texts"]["buckets"][2]["key"], "termc");
assert_eq!(res["my_texts"]["buckets"][2]["doc_count"], 0);
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
@@ -1927,4 +1958,44 @@ mod tests {
Ok(())
}
#[test]
fn terms_aggregation_ip_addr() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_ip_addr_field("ip_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
// IpV6 loopback
writer.add_document(doc!(field=>IpAddr::from_str("::1").unwrap().into_ipv6_addr()))?;
writer.add_document(doc!(field=>IpAddr::from_str("::1").unwrap().into_ipv6_addr()))?;
// IpV4
writer.add_document(
doc!(field=>IpAddr::from_str("127.0.0.1").unwrap().into_ipv6_addr()),
)?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_bool": {
"terms": {
"field": "ip_field"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// print as json
// println!("{}", serde_json::to_string_pretty(&res).unwrap());
assert_eq!(res["my_bool"]["buckets"][0]["key"], "::1");
assert_eq!(res["my_bool"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_bool"]["buckets"][1]["key"], "127.0.0.1");
assert_eq!(res["my_bool"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_bool"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
}

View File

@@ -5,6 +5,7 @@
use std::cmp::Ordering;
use std::collections::hash_map::Entry;
use std::hash::Hash;
use std::net::Ipv6Addr;
use columnar::ColumnType;
use itertools::Itertools;
@@ -19,7 +20,7 @@ use super::bucket::{
};
use super::metric::{
IntermediateAverage, IntermediateCount, IntermediateMax, IntermediateMin, IntermediateStats,
IntermediateSum, PercentilesCollector, TopHitsCollector,
IntermediateSum, PercentilesCollector, TopHitsTopNComputer,
};
use super::segment_agg_result::AggregationLimits;
use super::{format_date, AggregationError, Key, SerializedKey};
@@ -41,6 +42,8 @@ pub struct IntermediateAggregationResults {
/// This might seem redundant with `Key`, but the point is to have a different
/// Serialize implementation.
pub enum IntermediateKey {
/// Ip Addr key
IpAddr(Ipv6Addr),
/// Bool key
Bool(bool),
/// String key
@@ -60,6 +63,14 @@ impl From<IntermediateKey> for Key {
fn from(value: IntermediateKey) -> Self {
match value {
IntermediateKey::Str(s) => Self::Str(s),
IntermediateKey::IpAddr(s) => {
// Prefer to use the IPv4 representation if possible
if let Some(ip) = s.to_ipv4_mapped() {
Self::Str(ip.to_string())
} else {
Self::Str(s.to_string())
}
}
IntermediateKey::F64(f) => Self::F64(f),
IntermediateKey::Bool(f) => Self::F64(f as u64 as f64),
}
@@ -75,6 +86,7 @@ impl std::hash::Hash for IntermediateKey {
IntermediateKey::Str(text) => text.hash(state),
IntermediateKey::F64(val) => val.to_bits().hash(state),
IntermediateKey::Bool(val) => val.hash(state),
IntermediateKey::IpAddr(val) => val.hash(state),
}
}
}
@@ -209,9 +221,9 @@ pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult
Percentiles(_) => IntermediateAggregationResult::Metric(
IntermediateMetricResult::Percentiles(PercentilesCollector::default()),
),
TopHits(_) => IntermediateAggregationResult::Metric(IntermediateMetricResult::TopHits(
TopHitsCollector::default(),
)),
TopHits(ref req) => IntermediateAggregationResult::Metric(
IntermediateMetricResult::TopHits(TopHitsTopNComputer::new(req.clone())),
),
}
}
@@ -273,7 +285,7 @@ pub enum IntermediateMetricResult {
/// Intermediate sum result.
Sum(IntermediateSum),
/// Intermediate top_hits result
TopHits(TopHitsCollector),
TopHits(TopHitsTopNComputer),
}
impl IntermediateMetricResult {
@@ -302,7 +314,7 @@ impl IntermediateMetricResult {
.into_final_result(req.agg.as_percentile().expect("unexpected metric type")),
),
IntermediateMetricResult::TopHits(top_hits) => {
MetricResult::TopHits(top_hits.finalize())
MetricResult::TopHits(top_hits.into_final_result())
}
}
}

View File

@@ -25,6 +25,8 @@ mod stats;
mod sum;
mod top_hits;
use std::collections::HashMap;
pub use average::*;
pub use count::*;
pub use max::*;
@@ -36,6 +38,8 @@ pub use stats::*;
pub use sum::*;
pub use top_hits::*;
use crate::schema::OwnedValue;
/// Single-metric aggregations use this common result structure.
///
/// Main reason to wrap it in value is to match elasticsearch output structure.
@@ -92,8 +96,9 @@ pub struct TopHitsVecEntry {
/// Search results, for queries that include field retrieval requests
/// (`docvalue_fields`).
#[serde(flatten)]
pub search_results: FieldRetrivalResult,
#[serde(rename = "docvalue_fields")]
#[serde(skip_serializing_if = "HashMap::is_empty")]
pub doc_value_fields: HashMap<String, OwnedValue>,
}
/// The top_hits metric aggregation results a list of top hits by sort criteria.

View File

@@ -1,6 +1,5 @@
use std::fmt::Debug;
use columnar::ColumnType;
use serde::{Deserialize, Serialize};
use super::*;

View File

@@ -1,4 +1,3 @@
use columnar::ColumnType;
use serde::{Deserialize, Serialize};
use super::*;

View File

@@ -1,7 +1,9 @@
use std::collections::HashMap;
use std::fmt::Formatter;
use std::net::Ipv6Addr;
use columnar::{ColumnarReader, DynamicColumn};
use common::json_path_writer::JSON_PATH_SEGMENT_SEP_STR;
use common::DateTime;
use regex::Regex;
use serde::ser::SerializeMap;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
@@ -12,8 +14,8 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateMetricResult,
};
use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::AggregationError;
use crate::collector::TopNComputer;
use crate::schema::term::JSON_PATH_SEGMENT_SEP_STR;
use crate::schema::OwnedValue;
use crate::{DocAddress, DocId, SegmentOrdinal};
@@ -92,53 +94,106 @@ pub struct TopHitsAggregation {
size: usize,
from: Option<usize>,
#[serde(flatten)]
retrieval: RetrievalFields,
}
const fn default_doc_value_fields() -> Vec<String> {
Vec::new()
}
/// Search query spec for each matched document
/// TODO: move this to a common module
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct RetrievalFields {
/// The fast fields to return for each hit.
/// This is the only variant supported for now.
/// TODO: support the {field, format} variant for custom formatting.
#[serde(rename = "docvalue_fields")]
#[serde(default = "default_doc_value_fields")]
pub doc_value_fields: Vec<String>,
#[serde(default)]
doc_value_fields: Vec<String>,
// Not supported
_source: Option<serde_json::Value>,
fields: Option<serde_json::Value>,
script_fields: Option<serde_json::Value>,
highlight: Option<serde_json::Value>,
explain: Option<serde_json::Value>,
version: Option<serde_json::Value>,
}
/// Search query result for each matched document
/// TODO: move this to a common module
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct FieldRetrivalResult {
/// The fast fields returned for each hit.
#[serde(rename = "docvalue_fields")]
#[serde(skip_serializing_if = "HashMap::is_empty")]
pub doc_value_fields: HashMap<String, OwnedValue>,
#[derive(Debug, Clone, PartialEq, Default)]
struct KeyOrder {
field: String,
order: Order,
}
impl RetrievalFields {
fn get_field_names(&self) -> Vec<&str> {
self.doc_value_fields.iter().map(|s| s.as_str()).collect()
impl Serialize for KeyOrder {
fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
let KeyOrder { field, order } = self;
let mut map = serializer.serialize_map(Some(1))?;
map.serialize_entry(field, order)?;
map.end()
}
}
impl<'de> Deserialize<'de> for KeyOrder {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where D: Deserializer<'de> {
let mut key_order = <HashMap<String, Order>>::deserialize(deserializer)?.into_iter();
let (field, order) = key_order.next().ok_or(serde::de::Error::custom(
"Expected exactly one key-value pair in sort parameter of top_hits, found none",
))?;
if key_order.next().is_some() {
return Err(serde::de::Error::custom(format!(
"Expected exactly one key-value pair in sort parameter of top_hits, found {:?}",
key_order
)));
}
Ok(Self { field, order })
}
}
// Tranform a glob (`pattern*`, for example) into a regex::Regex (`^pattern.*$`)
fn globbed_string_to_regex(glob: &str) -> Result<Regex, crate::TantivyError> {
// Replace `*` glob with `.*` regex
let sanitized = format!("^{}$", regex::escape(glob).replace(r"\*", ".*"));
Regex::new(&sanitized.replace('*', ".*")).map_err(|e| {
crate::TantivyError::SchemaError(format!(
"Invalid regex '{}' in docvalue_fields: {}",
glob, e
))
})
}
fn use_doc_value_fields_err(parameter: &str) -> crate::Result<()> {
Err(crate::TantivyError::AggregationError(
AggregationError::InvalidRequest(format!(
"The `{}` parameter is not supported, only `docvalue_fields` is supported in \
`top_hits` aggregation",
parameter
)),
))
}
fn unsupported_err(parameter: &str) -> crate::Result<()> {
Err(crate::TantivyError::AggregationError(
AggregationError::InvalidRequest(format!(
"The `{}` parameter is not supported in the `top_hits` aggregation",
parameter
)),
))
}
impl TopHitsAggregation {
/// Validate and resolve field retrieval parameters
pub fn validate_and_resolve_field_names(
&mut self,
reader: &ColumnarReader,
) -> crate::Result<()> {
if self._source.is_some() {
use_doc_value_fields_err("_source")?;
}
if self.fields.is_some() {
use_doc_value_fields_err("fields")?;
}
if self.script_fields.is_some() {
use_doc_value_fields_err("script_fields")?;
}
if self.explain.is_some() {
unsupported_err("explain")?;
}
if self.highlight.is_some() {
unsupported_err("highlight")?;
}
if self.version.is_some() {
unsupported_err("version")?;
}
fn resolve_field_names(&mut self, reader: &ColumnarReader) -> crate::Result<()> {
// Tranform a glob (`pattern*`, for example) into a regex::Regex (`^pattern.*$`)
let globbed_string_to_regex = |glob: &str| {
// Replace `*` glob with `.*` regex
let sanitized = format!("^{}$", regex::escape(glob).replace(r"\*", ".*"));
Regex::new(&sanitized.replace('*', ".*")).map_err(|e| {
crate::TantivyError::SchemaError(format!(
"Invalid regex '{}' in docvalue_fields: {}",
glob, e
))
})
};
self.doc_value_fields = self
.doc_value_fields
.iter()
@@ -175,12 +230,25 @@ impl RetrievalFields {
Ok(())
}
/// Return fields accessed by the aggregator, in order.
pub fn field_names(&self) -> Vec<&str> {
self.sort
.iter()
.map(|KeyOrder { field, .. }| field.as_str())
.collect()
}
/// Return fields accessed by the aggregator's value retrieval.
pub fn value_field_names(&self) -> Vec<&str> {
self.doc_value_fields.iter().map(|s| s.as_str()).collect()
}
fn get_document_field_data(
&self,
accessors: &HashMap<String, Vec<DynamicColumn>>,
doc_id: DocId,
) -> FieldRetrivalResult {
let dvf = self
) -> HashMap<String, FastFieldValue> {
let doc_value_fields = self
.doc_value_fields
.iter()
.map(|field| {
@@ -188,20 +256,20 @@ impl RetrievalFields {
.get(field)
.unwrap_or_else(|| panic!("field '{}' not found in accessors", field));
let values: Vec<OwnedValue> = accessors
let values: Vec<FastFieldValue> = accessors
.iter()
.flat_map(|accessor| match accessor {
DynamicColumn::U64(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::U64)
.map(FastFieldValue::U64)
.collect::<Vec<_>>(),
DynamicColumn::I64(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::I64)
.map(FastFieldValue::I64)
.collect::<Vec<_>>(),
DynamicColumn::F64(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::F64)
.map(FastFieldValue::F64)
.collect::<Vec<_>>(),
DynamicColumn::Bytes(accessor) => accessor
.term_ords(doc_id)
@@ -213,7 +281,7 @@ impl RetrievalFields {
.expect("could not read term dictionary"),
"term corresponding to term_ord does not exist"
);
OwnedValue::Bytes(buffer)
FastFieldValue::Bytes(buffer)
})
.collect::<Vec<_>>(),
DynamicColumn::Str(accessor) => accessor
@@ -226,94 +294,82 @@ impl RetrievalFields {
.expect("could not read term dictionary"),
"term corresponding to term_ord does not exist"
);
OwnedValue::Str(String::from_utf8(buffer).unwrap())
FastFieldValue::Str(String::from_utf8(buffer).unwrap())
})
.collect::<Vec<_>>(),
DynamicColumn::Bool(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::Bool)
.map(FastFieldValue::Bool)
.collect::<Vec<_>>(),
DynamicColumn::IpAddr(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::IpAddr)
.map(FastFieldValue::IpAddr)
.collect::<Vec<_>>(),
DynamicColumn::DateTime(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::Date)
.map(FastFieldValue::Date)
.collect::<Vec<_>>(),
})
.collect();
(field.to_owned(), OwnedValue::Array(values))
(field.to_owned(), FastFieldValue::Array(values))
})
.collect();
FieldRetrivalResult {
doc_value_fields: dvf,
doc_value_fields
}
}
/// A retrieved value from a fast field.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub enum FastFieldValue {
/// The str type is used for any text information.
Str(String),
/// Unsigned 64-bits Integer `u64`
U64(u64),
/// Signed 64-bits Integer `i64`
I64(i64),
/// 64-bits Float `f64`
F64(f64),
/// Bool value
Bool(bool),
/// Date/time with nanoseconds precision
Date(DateTime),
/// Arbitrarily sized byte array
Bytes(Vec<u8>),
/// IpV6 Address. Internally there is no IpV4, it needs to be converted to `Ipv6Addr`.
IpAddr(Ipv6Addr),
/// A list of values.
Array(Vec<Self>),
}
impl From<FastFieldValue> for OwnedValue {
fn from(value: FastFieldValue) -> Self {
match value {
FastFieldValue::Str(s) => OwnedValue::Str(s),
FastFieldValue::U64(u) => OwnedValue::U64(u),
FastFieldValue::I64(i) => OwnedValue::I64(i),
FastFieldValue::F64(f) => OwnedValue::F64(f),
FastFieldValue::Bool(b) => OwnedValue::Bool(b),
FastFieldValue::Date(d) => OwnedValue::Date(d),
FastFieldValue::Bytes(b) => OwnedValue::Bytes(b),
FastFieldValue::IpAddr(ip) => OwnedValue::IpAddr(ip),
FastFieldValue::Array(a) => {
OwnedValue::Array(a.into_iter().map(OwnedValue::from).collect())
}
}
}
}
#[derive(Debug, Clone, PartialEq, Default)]
struct KeyOrder {
field: String,
order: Order,
}
impl Serialize for KeyOrder {
fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
let KeyOrder { field, order } = self;
let mut map = serializer.serialize_map(Some(1))?;
map.serialize_entry(field, order)?;
map.end()
}
}
impl<'de> Deserialize<'de> for KeyOrder {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where D: Deserializer<'de> {
let mut k_o = <HashMap<String, Order>>::deserialize(deserializer)?.into_iter();
let (k, v) = k_o.next().ok_or(serde::de::Error::custom(
"Expected exactly one key-value pair in KeyOrder, found none",
))?;
if k_o.next().is_some() {
return Err(serde::de::Error::custom(
"Expected exactly one key-value pair in KeyOrder, found more",
));
}
Ok(Self { field: k, order: v })
}
}
impl TopHitsAggregation {
/// Validate and resolve field retrieval parameters
pub fn validate_and_resolve(&mut self, reader: &ColumnarReader) -> crate::Result<()> {
self.retrieval.resolve_field_names(reader)
}
/// Return fields accessed by the aggregator, in order.
pub fn field_names(&self) -> Vec<&str> {
self.sort
.iter()
.map(|KeyOrder { field, .. }| field.as_str())
.collect()
}
/// Return fields accessed by the aggregator's value retrieval.
pub fn value_field_names(&self) -> Vec<&str> {
self.retrieval.get_field_names()
}
}
/// Holds a single comparable doc feature, and the order in which it should be sorted.
/// Holds a fast field value in its u64 representation, and the order in which it should be sorted.
#[derive(Clone, Serialize, Deserialize, Debug)]
struct ComparableDocFeature {
/// Stores any u64-mappable feature.
struct DocValueAndOrder {
/// A fast field value in its u64 representation.
value: Option<u64>,
/// Sort order for the doc feature
/// Sort order for the value
order: Order,
}
impl Ord for ComparableDocFeature {
impl Ord for DocValueAndOrder {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
let invert = |cmp: std::cmp::Ordering| match self.order {
Order::Asc => cmp,
@@ -329,26 +385,32 @@ impl Ord for ComparableDocFeature {
}
}
impl PartialOrd for ComparableDocFeature {
impl PartialOrd for DocValueAndOrder {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for ComparableDocFeature {
impl PartialEq for DocValueAndOrder {
fn eq(&self, other: &Self) -> bool {
self.value.cmp(&other.value) == std::cmp::Ordering::Equal
}
}
impl Eq for ComparableDocFeature {}
impl Eq for DocValueAndOrder {}
#[derive(Clone, Serialize, Deserialize, Debug)]
struct ComparableDocFeatures(Vec<ComparableDocFeature>, FieldRetrivalResult);
struct DocSortValuesAndFields {
sorts: Vec<DocValueAndOrder>,
impl Ord for ComparableDocFeatures {
#[serde(rename = "docvalue_fields")]
#[serde(skip_serializing_if = "HashMap::is_empty")]
doc_value_fields: HashMap<String, FastFieldValue>,
}
impl Ord for DocSortValuesAndFields {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
for (self_feature, other_feature) in self.0.iter().zip(other.0.iter()) {
for (self_feature, other_feature) in self.sorts.iter().zip(other.sorts.iter()) {
let cmp = self_feature.cmp(other_feature);
if cmp != std::cmp::Ordering::Equal {
return cmp;
@@ -358,53 +420,43 @@ impl Ord for ComparableDocFeatures {
}
}
impl PartialOrd for ComparableDocFeatures {
impl PartialOrd for DocSortValuesAndFields {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for ComparableDocFeatures {
impl PartialEq for DocSortValuesAndFields {
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == std::cmp::Ordering::Equal
}
}
impl Eq for ComparableDocFeatures {}
impl Eq for DocSortValuesAndFields {}
/// The TopHitsCollector used for collecting over segments and merging results.
#[derive(Clone, Serialize, Deserialize)]
pub struct TopHitsCollector {
#[derive(Clone, Serialize, Deserialize, Debug)]
pub struct TopHitsTopNComputer {
req: TopHitsAggregation,
top_n: TopNComputer<ComparableDocFeatures, DocAddress, false>,
top_n: TopNComputer<DocSortValuesAndFields, DocAddress, false>,
}
impl Default for TopHitsCollector {
fn default() -> Self {
Self {
req: TopHitsAggregation::default(),
top_n: TopNComputer::new(1),
}
}
}
impl std::fmt::Debug for TopHitsCollector {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
f.debug_struct("TopHitsCollector")
.field("req", &self.req)
.field("top_n_threshold", &self.top_n.threshold)
.finish()
}
}
impl std::cmp::PartialEq for TopHitsCollector {
impl std::cmp::PartialEq for TopHitsTopNComputer {
fn eq(&self, _other: &Self) -> bool {
false
}
}
impl TopHitsCollector {
fn collect(&mut self, features: ComparableDocFeatures, doc: DocAddress) {
impl TopHitsTopNComputer {
/// Create a new TopHitsCollector
pub fn new(req: TopHitsAggregation) -> Self {
Self {
top_n: TopNComputer::new(req.size + req.from.unwrap_or(0)),
req,
}
}
fn collect(&mut self, features: DocSortValuesAndFields, doc: DocAddress) {
self.top_n.push(features, doc);
}
@@ -416,14 +468,19 @@ impl TopHitsCollector {
}
/// Finalize by converting self into the final result form
pub fn finalize(self) -> TopHitsMetricResult {
pub fn into_final_result(self) -> TopHitsMetricResult {
let mut hits: Vec<TopHitsVecEntry> = self
.top_n
.into_sorted_vec()
.into_iter()
.map(|doc| TopHitsVecEntry {
sort: doc.feature.0.iter().map(|f| f.value).collect(),
search_results: doc.feature.1,
sort: doc.feature.sorts.iter().map(|f| f.value).collect(),
doc_value_fields: doc
.feature
.doc_value_fields
.into_iter()
.map(|(k, v)| (k, v.into()))
.collect(),
})
.collect();
@@ -436,48 +493,63 @@ impl TopHitsCollector {
}
}
#[derive(Clone)]
pub(crate) struct SegmentTopHitsCollector {
#[derive(Clone, Debug)]
pub(crate) struct TopHitsSegmentCollector {
segment_ordinal: SegmentOrdinal,
accessor_idx: usize,
inner_collector: TopHitsCollector,
req: TopHitsAggregation,
top_n: TopNComputer<Vec<DocValueAndOrder>, DocAddress, false>,
}
impl SegmentTopHitsCollector {
impl TopHitsSegmentCollector {
pub fn from_req(
req: &TopHitsAggregation,
accessor_idx: usize,
segment_ordinal: SegmentOrdinal,
) -> Self {
Self {
inner_collector: TopHitsCollector {
req: req.clone(),
top_n: TopNComputer::new(req.size + req.from.unwrap_or(0)),
},
req: req.clone(),
top_n: TopNComputer::new(req.size + req.from.unwrap_or(0)),
segment_ordinal,
accessor_idx,
}
}
}
fn into_top_hits_collector(
self,
value_accessors: &HashMap<String, Vec<DynamicColumn>>,
) -> TopHitsTopNComputer {
let mut top_hits_computer = TopHitsTopNComputer::new(self.req.clone());
let top_results = self.top_n.into_vec();
impl std::fmt::Debug for SegmentTopHitsCollector {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
f.debug_struct("SegmentTopHitsCollector")
.field("segment_id", &self.segment_ordinal)
.field("accessor_idx", &self.accessor_idx)
.field("inner_collector", &self.inner_collector)
.finish()
for res in top_results {
let doc_value_fields = self
.req
.get_document_field_data(value_accessors, res.doc.doc_id);
top_hits_computer.collect(
DocSortValuesAndFields {
sorts: res.feature,
doc_value_fields,
},
res.doc,
);
}
top_hits_computer
}
}
impl SegmentAggregationCollector for SegmentTopHitsCollector {
impl SegmentAggregationCollector for TopHitsSegmentCollector {
fn add_intermediate_aggregation_result(
self: Box<Self>,
agg_with_accessor: &crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
results: &mut crate::aggregation::intermediate_agg_result::IntermediateAggregationResults,
) -> crate::Result<()> {
let name = agg_with_accessor.aggs.keys[self.accessor_idx].to_string();
let intermediate_result = IntermediateMetricResult::TopHits(self.inner_collector);
let value_accessors = &agg_with_accessor.aggs.values[self.accessor_idx].value_accessors;
let intermediate_result =
IntermediateMetricResult::TopHits(self.into_top_hits_collector(value_accessors));
results.push(
name,
IntermediateAggregationResult::Metric(intermediate_result),
@@ -490,9 +562,7 @@ impl SegmentAggregationCollector for SegmentTopHitsCollector {
agg_with_accessor: &mut crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
) -> crate::Result<()> {
let accessors = &agg_with_accessor.aggs.values[self.accessor_idx].accessors;
let value_accessors = &agg_with_accessor.aggs.values[self.accessor_idx].value_accessors;
let features: Vec<ComparableDocFeature> = self
.inner_collector
let sorts: Vec<DocValueAndOrder> = self
.req
.sort
.iter()
@@ -505,18 +575,12 @@ impl SegmentAggregationCollector for SegmentTopHitsCollector {
.0
.values_for_doc(doc_id)
.next();
ComparableDocFeature { value, order }
DocValueAndOrder { value, order }
})
.collect();
let retrieval_result = self
.inner_collector
.req
.retrieval
.get_document_field_data(value_accessors, doc_id);
self.inner_collector.collect(
ComparableDocFeatures(features, retrieval_result),
self.top_n.push(
sorts,
DocAddress {
segment_ord: self.segment_ordinal,
doc_id,
@@ -530,11 +594,7 @@ impl SegmentAggregationCollector for SegmentTopHitsCollector {
docs: &[crate::DocId],
agg_with_accessor: &mut crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
) -> crate::Result<()> {
// TODO: Consider getting fields with the column block accessor and refactor this.
// ---
// Would the additional complexity of getting fields with the column_block_accessor
// make sense here? Probably yes, but I want to get a first-pass review first
// before proceeding.
// TODO: Consider getting fields with the column block accessor.
for doc in docs {
self.collect(*doc, agg_with_accessor)?;
}
@@ -549,7 +609,7 @@ mod tests {
use serde_json::Value;
use time::macros::datetime;
use super::{ComparableDocFeature, ComparableDocFeatures, Order};
use super::{DocSortValuesAndFields, DocValueAndOrder, Order};
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::bucket::tests::get_test_index_from_docs;
@@ -557,44 +617,44 @@ mod tests {
use crate::aggregation::AggregationCollector;
use crate::collector::ComparableDoc;
use crate::query::AllQuery;
use crate::schema::OwnedValue as SchemaValue;
use crate::schema::OwnedValue;
fn invert_order(cmp_feature: ComparableDocFeature) -> ComparableDocFeature {
let ComparableDocFeature { value, order } = cmp_feature;
fn invert_order(cmp_feature: DocValueAndOrder) -> DocValueAndOrder {
let DocValueAndOrder { value, order } = cmp_feature;
let order = match order {
Order::Asc => Order::Desc,
Order::Desc => Order::Asc,
};
ComparableDocFeature { value, order }
DocValueAndOrder { value, order }
}
fn collector_with_capacity(capacity: usize) -> super::TopHitsCollector {
super::TopHitsCollector {
fn collector_with_capacity(capacity: usize) -> super::TopHitsTopNComputer {
super::TopHitsTopNComputer {
top_n: super::TopNComputer::new(capacity),
..Default::default()
req: Default::default(),
}
}
fn invert_order_features(cmp_features: ComparableDocFeatures) -> ComparableDocFeatures {
let ComparableDocFeatures(cmp_features, search_results) = cmp_features;
let cmp_features = cmp_features
fn invert_order_features(mut cmp_features: DocSortValuesAndFields) -> DocSortValuesAndFields {
cmp_features.sorts = cmp_features
.sorts
.into_iter()
.map(invert_order)
.collect::<Vec<_>>();
ComparableDocFeatures(cmp_features, search_results)
cmp_features
}
#[test]
fn test_comparable_doc_feature() -> crate::Result<()> {
let small = ComparableDocFeature {
let small = DocValueAndOrder {
value: Some(1),
order: Order::Asc,
};
let big = ComparableDocFeature {
let big = DocValueAndOrder {
value: Some(2),
order: Order::Asc,
};
let none = ComparableDocFeature {
let none = DocValueAndOrder {
value: None,
order: Order::Asc,
};
@@ -616,21 +676,21 @@ mod tests {
#[test]
fn test_comparable_doc_features() -> crate::Result<()> {
let features_1 = ComparableDocFeatures(
vec![ComparableDocFeature {
let features_1 = DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(1),
order: Order::Asc,
}],
Default::default(),
);
doc_value_fields: Default::default(),
};
let features_2 = ComparableDocFeatures(
vec![ComparableDocFeature {
let features_2 = DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(2),
order: Order::Asc,
}],
Default::default(),
);
doc_value_fields: Default::default(),
};
assert!(features_1 < features_2);
@@ -689,39 +749,39 @@ mod tests {
segment_ord: 0,
doc_id: 0,
},
feature: ComparableDocFeatures(
vec![ComparableDocFeature {
feature: DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(1),
order: Order::Asc,
}],
Default::default(),
),
doc_value_fields: Default::default(),
},
},
ComparableDoc {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 2,
},
feature: ComparableDocFeatures(
vec![ComparableDocFeature {
feature: DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(3),
order: Order::Asc,
}],
Default::default(),
),
doc_value_fields: Default::default(),
},
},
ComparableDoc {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 1,
},
feature: ComparableDocFeatures(
vec![ComparableDocFeature {
feature: DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(5),
order: Order::Asc,
}],
Default::default(),
),
doc_value_fields: Default::default(),
},
},
];
@@ -730,23 +790,23 @@ mod tests {
collector.collect(doc.feature, doc.doc);
}
let res = collector.finalize();
let res = collector.into_final_result();
assert_eq!(
res,
super::TopHitsMetricResult {
hits: vec![
super::TopHitsVecEntry {
sort: vec![docs[0].feature.0[0].value],
search_results: Default::default(),
sort: vec![docs[0].feature.sorts[0].value],
doc_value_fields: Default::default(),
},
super::TopHitsVecEntry {
sort: vec![docs[1].feature.0[0].value],
search_results: Default::default(),
sort: vec![docs[1].feature.sorts[0].value],
doc_value_fields: Default::default(),
},
super::TopHitsVecEntry {
sort: vec![docs[2].feature.0[0].value],
search_results: Default::default(),
sort: vec![docs[2].feature.sorts[0].value],
doc_value_fields: Default::default(),
},
]
}
@@ -803,7 +863,7 @@ mod tests {
{
"sort": [common::i64_to_u64(date_2017.unix_timestamp_nanos() as i64)],
"docvalue_fields": {
"date": [ SchemaValue::Date(DateTime::from_utc(date_2017)) ],
"date": [ OwnedValue::Date(DateTime::from_utc(date_2017)) ],
"text": [ "ccc" ],
"text2": [ "ddd" ],
"mixed.dyn_arr": [ 3, "4" ],
@@ -812,7 +872,7 @@ mod tests {
{
"sort": [common::i64_to_u64(date_2016.unix_timestamp_nanos() as i64)],
"docvalue_fields": {
"date": [ SchemaValue::Date(DateTime::from_utc(date_2016)) ],
"date": [ OwnedValue::Date(DateTime::from_utc(date_2016)) ],
"text": [ "aaa" ],
"text2": [ "bbb" ],
"mixed.dyn_arr": [ 6, "7" ],

View File

@@ -417,7 +417,6 @@ mod tests {
use time::OffsetDateTime;
use super::agg_req::Aggregations;
use super::segment_agg_result::AggregationLimits;
use super::*;
use crate::indexer::NoMergePolicy;
use crate::query::{AllQuery, TermQuery};

View File

@@ -16,7 +16,7 @@ use super::metric::{
SumAggregation,
};
use crate::aggregation::bucket::TermMissingAgg;
use crate::aggregation::metric::SegmentTopHitsCollector;
use crate::aggregation::metric::TopHitsSegmentCollector;
pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug {
fn add_intermediate_aggregation_result(
@@ -161,7 +161,7 @@ pub(crate) fn build_single_agg_segment_collector(
accessor_idx,
)?,
)),
TopHits(top_hits_req) => Ok(Box::new(SegmentTopHitsCollector::from_req(
TopHits(top_hits_req) => Ok(Box::new(TopHitsSegmentCollector::from_req(
top_hits_req,
accessor_idx,
req.segment_ordinal,

View File

@@ -1,7 +1,7 @@
use std::cmp::Ordering;
use std::collections::{btree_map, BTreeMap, BTreeSet, BinaryHeap};
use std::io;
use std::ops::Bound;
use std::{io, u64, usize};
use crate::collector::{Collector, SegmentCollector};
use crate::fastfield::FacetReader;

View File

@@ -160,7 +160,7 @@ mod tests {
use super::{add_vecs, HistogramCollector, HistogramComputer};
use crate::schema::{Schema, FAST};
use crate::time::{Date, Month};
use crate::{doc, query, DateTime, Index};
use crate::{query, DateTime, Index};
#[test]
fn test_add_histograms_simple() {

View File

@@ -274,6 +274,10 @@ pub trait SegmentCollector: 'static {
fn collect(&mut self, doc: DocId, score: Score);
/// The query pushes the scored document to the collector via this method.
/// This method is used when the collector does not require scoring.
///
/// See [`COLLECT_BLOCK_BUFFER_LEN`](crate::COLLECT_BLOCK_BUFFER_LEN) for the
/// buffer size passed to the collector.
fn collect_block(&mut self, docs: &[DocId]) {
for doc in docs {
self.collect(*doc, 0.0);

View File

@@ -52,10 +52,16 @@ impl<TCollector: Collector> Collector for CollectorWrapper<TCollector> {
impl SegmentCollector for Box<dyn BoxableSegmentCollector> {
type Fruit = Box<dyn Fruit>;
#[inline]
fn collect(&mut self, doc: u32, score: Score) {
self.as_mut().collect(doc, score);
}
#[inline]
fn collect_block(&mut self, docs: &[DocId]) {
self.as_mut().collect_block(docs);
}
fn harvest(self) -> Box<dyn Fruit> {
BoxableSegmentCollector::harvest_from_box(self)
}
@@ -63,6 +69,11 @@ impl SegmentCollector for Box<dyn BoxableSegmentCollector> {
pub trait BoxableSegmentCollector {
fn collect(&mut self, doc: u32, score: Score);
fn collect_block(&mut self, docs: &[DocId]) {
for &doc in docs {
self.collect(doc, 0.0);
}
}
fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit>;
}
@@ -71,9 +82,14 @@ pub struct SegmentCollectorWrapper<TSegmentCollector: SegmentCollector>(TSegment
impl<TSegmentCollector: SegmentCollector> BoxableSegmentCollector
for SegmentCollectorWrapper<TSegmentCollector>
{
#[inline]
fn collect(&mut self, doc: u32, score: Score) {
self.0.collect(doc, score);
}
#[inline]
fn collect_block(&mut self, docs: &[DocId]) {
self.0.collect_block(docs);
}
fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit> {
Box::new(self.0.harvest())

View File

@@ -1,15 +1,11 @@
use columnar::{BytesColumn, Column};
use super::*;
use crate::collector::{Count, FilterCollector, TopDocs};
use crate::index::SegmentReader;
use crate::query::{AllQuery, QueryParser};
use crate::schema::{Schema, FAST, TEXT};
use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime;
use crate::{
doc, DateTime, DocAddress, DocId, Index, Score, Searcher, SegmentOrdinal, TantivyDocument,
};
use crate::{DateTime, DocAddress, Index, Searcher, TantivyDocument};
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
compute_score: true,

View File

@@ -732,6 +732,19 @@ pub struct TopNComputer<Score, D, const REVERSE_ORDER: bool = true> {
top_n: usize,
pub(crate) threshold: Option<Score>,
}
impl<Score: std::fmt::Debug, D, const REVERSE_ORDER: bool> std::fmt::Debug
for TopNComputer<Score, D, REVERSE_ORDER>
{
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("TopNComputer")
.field("buffer_len", &self.buffer.len())
.field("top_n", &self.top_n)
.field("current_threshold", &self.threshold)
.finish()
}
}
// Intermediate struct for TopNComputer for deserialization, to keep vec capacity
#[derive(Deserialize)]
struct TopNComputerDeser<Score, D, const REVERSE_ORDER: bool> {

View File

@@ -1,12 +1,11 @@
use columnar::MonotonicallyMappableToU64;
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
use common::{replace_in_place, JsonPathWriter};
use rustc_hash::FxHashMap;
use crate::fastfield::FastValue;
use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter};
use crate::schema::document::{ReferenceValue, ReferenceValueLeaf, Value};
use crate::schema::term::JSON_PATH_SEGMENT_SEP;
use crate::schema::{Field, Type, DATE_TIME_PRECISION_INDEXED};
use crate::schema::indexing_term::IndexingTerm;
use crate::schema::{Field, Type};
use crate::time::format_description::well_known::Rfc3339;
use crate::time::{OffsetDateTime, UtcOffset};
use crate::tokenizer::TextAnalyzer;
@@ -76,7 +75,7 @@ pub(crate) fn index_json_values<'a, V: Value<'a>>(
json_visitors: impl Iterator<Item = crate::Result<V::ObjectIter>>,
text_analyzer: &mut TextAnalyzer,
expand_dots_enabled: bool,
term_buffer: &mut Term,
term_buffer: &mut IndexingTerm,
postings_writer: &mut dyn PostingsWriter,
json_path_writer: &mut JsonPathWriter,
ctx: &mut IndexingContext,
@@ -105,7 +104,7 @@ fn index_json_object<'a, V: Value<'a>>(
doc: DocId,
json_visitor: V::ObjectIter,
text_analyzer: &mut TextAnalyzer,
term_buffer: &mut Term,
term_buffer: &mut IndexingTerm,
json_path_writer: &mut JsonPathWriter,
postings_writer: &mut dyn PostingsWriter,
ctx: &mut IndexingContext,
@@ -132,19 +131,16 @@ fn index_json_value<'a, V: Value<'a>>(
doc: DocId,
json_value: V,
text_analyzer: &mut TextAnalyzer,
term_buffer: &mut Term,
term_buffer: &mut IndexingTerm,
json_path_writer: &mut JsonPathWriter,
postings_writer: &mut dyn PostingsWriter,
ctx: &mut IndexingContext,
positions_per_path: &mut IndexingPositionsPerPath,
) {
let set_path_id = |term_buffer: &mut Term, unordered_id: u32| {
let set_path_id = |term_buffer: &mut IndexingTerm, unordered_id: u32| {
term_buffer.truncate_value_bytes(0);
term_buffer.append_bytes(&unordered_id.to_be_bytes());
};
let set_type = |term_buffer: &mut Term, typ: Type| {
term_buffer.append_bytes(&[typ.to_code()]);
};
match json_value.as_value() {
ReferenceValue::Leaf(leaf) => match leaf {
@@ -157,7 +153,7 @@ fn index_json_value<'a, V: Value<'a>>(
// TODO: make sure the chain position works out.
set_path_id(term_buffer, unordered_id);
set_type(term_buffer, Type::Str);
term_buffer.append_bytes(&[Type::Str.to_code()]);
let indexing_position = positions_per_path.get_position_from_id(unordered_id);
postings_writer.index_text(
doc,
@@ -213,18 +209,16 @@ fn index_json_value<'a, V: Value<'a>>(
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::PreTokStr(_) => {
unimplemented!(
"Pre-tokenized string support in dynamic fields is not yet implemented"
)
unimplemented!("Pre-tokenized string support in JSON fields is not yet implemented")
}
ReferenceValueLeaf::Bytes(_) => {
unimplemented!("Bytes support in dynamic fields is not yet implemented")
unimplemented!("Bytes support in JSON fields is not yet implemented")
}
ReferenceValueLeaf::Facet(_) => {
unimplemented!("Facet support in dynamic fields is not yet implemented")
unimplemented!("Facet support in JSON fields is not yet implemented")
}
ReferenceValueLeaf::IpAddr(_) => {
unimplemented!("IP address support in dynamic fields is not yet implemented")
unimplemented!("IP address support in JSON fields is not yet implemented")
}
},
ReferenceValue::Array(elements) => {
@@ -256,71 +250,43 @@ fn index_json_value<'a, V: Value<'a>>(
}
}
// Tries to infer a JSON type from a string.
pub fn convert_to_fast_value_and_get_term(
json_term_writer: &mut JsonTermWriter,
phrase: &str,
) -> Option<Term> {
/// Tries to infer a JSON type from a string and append it to the term.
///
/// The term must be json + JSON path.
pub fn convert_to_fast_value_and_append_to_json_term(mut term: Term, phrase: &str) -> Option<Term> {
assert_eq!(
term.value()
.as_json()
.expect("expecting a Term with a json type and json path")
.1
.as_serialized()
.len(),
0,
"JSON value bytes should be empty"
);
if let Ok(dt) = OffsetDateTime::parse(phrase, &Rfc3339) {
let dt_utc = dt.to_offset(UtcOffset::UTC);
return Some(set_fastvalue_and_get_term(
json_term_writer,
DateTime::from_utc(dt_utc),
));
term.append_type_and_fast_value(DateTime::from_utc(dt_utc));
return Some(term);
}
if let Ok(i64_val) = str::parse::<i64>(phrase) {
return Some(set_fastvalue_and_get_term(json_term_writer, i64_val));
term.append_type_and_fast_value(i64_val);
return Some(term);
}
if let Ok(u64_val) = str::parse::<u64>(phrase) {
return Some(set_fastvalue_and_get_term(json_term_writer, u64_val));
term.append_type_and_fast_value(u64_val);
return Some(term);
}
if let Ok(f64_val) = str::parse::<f64>(phrase) {
return Some(set_fastvalue_and_get_term(json_term_writer, f64_val));
term.append_type_and_fast_value(f64_val);
return Some(term);
}
if let Ok(bool_val) = str::parse::<bool>(phrase) {
return Some(set_fastvalue_and_get_term(json_term_writer, bool_val));
term.append_type_and_fast_value(bool_val);
return Some(term);
}
None
}
// helper function to generate a Term from a json fastvalue
pub(crate) fn set_fastvalue_and_get_term<T: FastValue>(
json_term_writer: &mut JsonTermWriter,
value: T,
) -> Term {
json_term_writer.set_fast_value(value);
json_term_writer.term().clone()
}
// helper function to generate a list of terms with their positions from a textual json value
pub(crate) fn set_string_and_get_terms(
json_term_writer: &mut JsonTermWriter,
value: &str,
text_analyzer: &mut TextAnalyzer,
) -> Vec<(usize, Term)> {
let mut positions_and_terms = Vec::<(usize, Term)>::new();
json_term_writer.close_path_and_set_type(Type::Str);
let term_num_bytes = json_term_writer.term_buffer.len_bytes();
let mut token_stream = text_analyzer.token_stream(value);
token_stream.process(&mut |token| {
json_term_writer
.term_buffer
.truncate_value_bytes(term_num_bytes);
json_term_writer
.term_buffer
.append_bytes(token.text.as_bytes());
positions_and_terms.push((token.position, json_term_writer.term().clone()));
});
positions_and_terms
}
/// Writes a value of a JSON field to a `Term`.
/// The Term format is as follows:
/// `[JSON_TYPE][JSON_PATH][JSON_END_OF_PATH][VALUE_BYTES]`
pub struct JsonTermWriter<'a> {
term_buffer: &'a mut Term,
path_stack: Vec<usize>,
expand_dots_enabled: bool,
}
/// Splits a json path supplied to the query parser in such a way that
/// `.` can be escaped.
@@ -377,275 +343,106 @@ pub(crate) fn encode_column_name(
path.into()
}
impl<'a> JsonTermWriter<'a> {
pub fn from_field_and_json_path(
field: Field,
json_path: &str,
expand_dots_enabled: bool,
term_buffer: &'a mut Term,
) -> Self {
term_buffer.set_field_and_type(field, Type::Json);
let mut json_term_writer = Self::wrap(term_buffer, expand_dots_enabled);
for segment in split_json_path(json_path) {
json_term_writer.push_path_segment(&segment);
}
json_term_writer
pub fn term_from_json_paths<'a>(
json_field: Field,
paths: impl Iterator<Item = &'a str>,
expand_dots_enabled: bool,
) -> Term {
let mut json_path = JsonPathWriter::with_expand_dots(expand_dots_enabled);
for path in paths {
json_path.push(path);
}
json_path.set_end();
let mut term = Term::with_type_and_field(Type::Json, json_field);
pub fn wrap(term_buffer: &'a mut Term, expand_dots_enabled: bool) -> Self {
term_buffer.clear_with_type(Type::Json);
let mut path_stack = Vec::with_capacity(10);
path_stack.push(0);
Self {
term_buffer,
path_stack,
expand_dots_enabled,
}
}
fn trim_to_end_of_path(&mut self) {
let end_of_path = *self.path_stack.last().unwrap();
self.term_buffer.truncate_value_bytes(end_of_path);
}
pub fn close_path_and_set_type(&mut self, typ: Type) {
self.trim_to_end_of_path();
self.term_buffer.set_json_path_end();
self.term_buffer.append_bytes(&[typ.to_code()]);
}
// TODO: Remove this function and use JsonPathWriter instead.
pub fn push_path_segment(&mut self, segment: &str) {
// the path stack should never be empty.
self.trim_to_end_of_path();
if self.path_stack.len() > 1 {
self.term_buffer.set_json_path_separator();
}
let appended_segment = self.term_buffer.append_bytes(segment.as_bytes());
if self.expand_dots_enabled {
// We need to replace `.` by JSON_PATH_SEGMENT_SEP.
replace_in_place(b'.', JSON_PATH_SEGMENT_SEP, appended_segment);
}
self.term_buffer.add_json_path_separator();
self.path_stack.push(self.term_buffer.len_bytes());
}
pub fn pop_path_segment(&mut self) {
self.path_stack.pop();
assert!(!self.path_stack.is_empty());
self.trim_to_end_of_path();
}
/// Returns the json path of the term being currently built.
#[cfg(test)]
pub(crate) fn path(&self) -> &[u8] {
let end_of_path = self.path_stack.last().cloned().unwrap_or(1);
&self.term().serialized_value_bytes()[..end_of_path - 1]
}
pub(crate) fn set_fast_value<T: FastValue>(&mut self, val: T) {
self.close_path_and_set_type(T::to_type());
let value = if T::to_type() == Type::Date {
DateTime::from_u64(val.to_u64())
.truncate(DATE_TIME_PRECISION_INDEXED)
.to_u64()
} else {
val.to_u64()
};
self.term_buffer
.append_bytes(value.to_be_bytes().as_slice());
}
pub fn set_str(&mut self, text: &str) {
self.close_path_and_set_type(Type::Str);
self.term_buffer.append_bytes(text.as_bytes());
}
pub fn term(&self) -> &Term {
self.term_buffer
}
term.append_bytes(json_path.as_str().as_bytes());
term
}
#[cfg(test)]
mod tests {
use super::{split_json_path, JsonTermWriter};
use crate::schema::{Field, Type};
use crate::Term;
use super::split_json_path;
use crate::json_utils::term_from_json_paths;
use crate::schema::Field;
#[test]
fn test_json_writer() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("attributes");
json_writer.push_path_segment("color");
json_writer.set_str("red");
let mut term = term_from_json_paths(field, ["attributes", "color"].into_iter(), false);
term.append_type_and_str("red");
assert_eq!(
format!("{:?}", json_writer.term()),
format!("{:?}", term),
"Term(field=1, type=Json, path=attributes.color, type=Str, \"red\")"
);
json_writer.set_str("blue");
assert_eq!(
format!("{:?}", json_writer.term()),
"Term(field=1, type=Json, path=attributes.color, type=Str, \"blue\")"
let mut term = term_from_json_paths(
field,
["attributes", "dimensions", "width"].into_iter(),
false,
);
json_writer.pop_path_segment();
json_writer.push_path_segment("dimensions");
json_writer.push_path_segment("width");
json_writer.set_fast_value(400i64);
term.append_type_and_fast_value(400i64);
assert_eq!(
format!("{:?}", json_writer.term()),
format!("{:?}", term),
"Term(field=1, type=Json, path=attributes.dimensions.width, type=I64, 400)"
);
json_writer.pop_path_segment();
json_writer.push_path_segment("height");
json_writer.set_fast_value(300i64);
assert_eq!(
format!("{:?}", json_writer.term()),
"Term(field=1, type=Json, path=attributes.dimensions.height, type=I64, 300)"
);
}
#[test]
fn test_string_term() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color");
json_writer.set_str("red");
assert_eq!(
json_writer.term().serialized_term(),
b"\x00\x00\x00\x01jcolor\x00sred"
)
let mut term = term_from_json_paths(field, ["color"].into_iter(), false);
term.append_type_and_str("red");
assert_eq!(term.serialized_term(), b"\x00\x00\x00\x01jcolor\x00sred")
}
#[test]
fn test_i64_term() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color");
json_writer.set_fast_value(-4i64);
let mut term = term_from_json_paths(field, ["color"].into_iter(), false);
term.append_type_and_fast_value(-4i64);
assert_eq!(
json_writer.term().serialized_term(),
b"\x00\x00\x00\x01jcolor\x00i\x7f\xff\xff\xff\xff\xff\xff\xfc"
term.value().as_serialized(),
b"jcolor\x00i\x7f\xff\xff\xff\xff\xff\xff\xfc"
)
}
#[test]
fn test_u64_term() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color");
json_writer.set_fast_value(4u64);
let mut term = term_from_json_paths(field, ["color"].into_iter(), false);
term.append_type_and_fast_value(4u64);
assert_eq!(
json_writer.term().serialized_term(),
b"\x00\x00\x00\x01jcolor\x00u\x00\x00\x00\x00\x00\x00\x00\x04"
term.value().as_serialized(),
b"jcolor\x00u\x00\x00\x00\x00\x00\x00\x00\x04"
)
}
#[test]
fn test_f64_term() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color");
json_writer.set_fast_value(4.0f64);
let mut term = term_from_json_paths(field, ["color"].into_iter(), false);
term.append_type_and_fast_value(4.0f64);
assert_eq!(
json_writer.term().serialized_term(),
b"\x00\x00\x00\x01jcolor\x00f\xc0\x10\x00\x00\x00\x00\x00\x00"
term.value().as_serialized(),
b"jcolor\x00f\xc0\x10\x00\x00\x00\x00\x00\x00"
)
}
#[test]
fn test_bool_term() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color");
json_writer.set_fast_value(true);
let mut term = term_from_json_paths(field, ["color"].into_iter(), false);
term.append_type_and_fast_value(true);
assert_eq!(
json_writer.term().serialized_term(),
b"\x00\x00\x00\x01jcolor\x00o\x00\x00\x00\x00\x00\x00\x00\x01"
term.value().as_serialized(),
b"jcolor\x00o\x00\x00\x00\x00\x00\x00\x00\x01"
)
}
#[test]
fn test_push_after_set_path_segment() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("attribute");
json_writer.set_str("something");
json_writer.push_path_segment("color");
json_writer.set_str("red");
assert_eq!(
json_writer.term().serialized_term(),
b"\x00\x00\x00\x01jattribute\x01color\x00sred"
)
}
#[test]
fn test_pop_segment() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color");
json_writer.push_path_segment("hue");
json_writer.pop_path_segment();
json_writer.set_str("red");
assert_eq!(
json_writer.term().serialized_term(),
b"\x00\x00\x00\x01jcolor\x00sred"
)
}
#[test]
fn test_json_writer_path() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color");
assert_eq!(json_writer.path(), b"color");
json_writer.push_path_segment("hue");
assert_eq!(json_writer.path(), b"color\x01hue");
json_writer.set_str("pink");
assert_eq!(json_writer.path(), b"color\x01hue");
}
#[test]
fn test_json_path_expand_dots_disabled() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, false);
json_writer.push_path_segment("color.hue");
assert_eq!(json_writer.path(), b"color.hue");
}
#[test]
fn test_json_path_expand_dots_enabled() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, true);
json_writer.push_path_segment("color.hue");
assert_eq!(json_writer.path(), b"color\x01hue");
}
#[test]
fn test_json_path_expand_dots_enabled_pop_segment() {
let field = Field::from_field_id(1);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term, true);
json_writer.push_path_segment("hello");
assert_eq!(json_writer.path(), b"hello");
json_writer.push_path_segment("color.hue");
assert_eq!(json_writer.path(), b"hello\x01color\x01hue");
json_writer.pop_path_segment();
assert_eq!(json_writer.path(), b"hello");
}
#[test]
fn test_split_json_path_simple() {
let json_path = split_json_path("titi.toto");

View File

@@ -1,9 +1,9 @@
use crate::collector::Count;
use crate::directory::{RamDirectory, WatchCallback};
use crate::indexer::{LogMergePolicy, NoMergePolicy};
use crate::json_utils::JsonTermWriter;
use crate::json_utils::term_from_json_paths;
use crate::query::TermQuery;
use crate::schema::{Field, IndexRecordOption, Schema, Type, INDEXED, STRING, TEXT};
use crate::schema::{Field, IndexRecordOption, Schema, INDEXED, STRING, TEXT};
use crate::tokenizer::TokenizerManager;
use crate::{
Directory, DocSet, Index, IndexBuilder, IndexReader, IndexSettings, IndexWriter, Postings,
@@ -137,7 +137,6 @@ mod mmap_specific {
use tempfile::TempDir;
use super::*;
use crate::Directory;
#[test]
fn test_index_on_commit_reload_policy_mmap() -> crate::Result<()> {
@@ -417,16 +416,12 @@ fn test_non_text_json_term_freq() {
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_idx = segment_reader.inverted_index(field).unwrap();
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
json_term_writer.push_path_segment("tenant_id");
json_term_writer.close_path_and_set_type(Type::U64);
json_term_writer.set_fast_value(75u64);
let mut term = term_from_json_paths(field, ["tenant_id"].iter().cloned(), false);
term.append_type_and_fast_value(75u64);
let postings = inv_idx
.read_postings(
json_term_writer.term(),
IndexRecordOption::WithFreqsAndPositions,
)
.read_postings(&term, IndexRecordOption::WithFreqsAndPositions)
.unwrap()
.unwrap();
assert_eq!(postings.doc(), 0);
@@ -455,16 +450,12 @@ fn test_non_text_json_term_freq_bitpacked() {
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_idx = segment_reader.inverted_index(field).unwrap();
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
json_term_writer.push_path_segment("tenant_id");
json_term_writer.close_path_and_set_type(Type::U64);
json_term_writer.set_fast_value(75u64);
let mut term = term_from_json_paths(field, ["tenant_id"].iter().cloned(), false);
term.append_type_and_fast_value(75u64);
let mut postings = inv_idx
.read_postings(
json_term_writer.term(),
IndexRecordOption::WithFreqsAndPositions,
)
.read_postings(&term, IndexRecordOption::WithFreqsAndPositions)
.unwrap()
.unwrap();
assert_eq!(postings.doc(), 0);

View File

@@ -1,6 +1,5 @@
use std::collections::HashMap;
use std::io::{self, Read, Write};
use std::iter::ExactSizeIterator;
use std::ops::Range;
use common::{BinarySerializable, CountingWriter, HasLen, VInt};

View File

@@ -1,5 +1,4 @@
use std::io::Write;
use std::marker::{Send, Sync};
use std::path::{Path, PathBuf};
use std::sync::Arc;
use std::time::Duration;
@@ -40,6 +39,7 @@ impl RetryPolicy {
/// The `DirectoryLock` is an object that represents a file lock.
///
/// It is associated with a lock file, that gets deleted on `Drop.`
#[allow(dead_code)]
pub struct DirectoryLock(Box<dyn Send + Sync + 'static>);
struct DirectoryLockGuard {

View File

@@ -1,6 +1,6 @@
use std::io::Write;
use std::mem;
use std::path::{Path, PathBuf};
use std::path::Path;
use std::sync::atomic::Ordering::SeqCst;
use std::sync::atomic::{AtomicBool, AtomicUsize};
use std::sync::Arc;

View File

@@ -32,6 +32,7 @@ pub struct WatchCallbackList {
/// file change is detected.
#[must_use = "This `WatchHandle` controls the lifetime of the watch and should therefore be used."]
#[derive(Clone)]
#[allow(dead_code)]
pub struct WatchHandle(Arc<WatchCallback>);
impl WatchHandle {

View File

@@ -9,7 +9,10 @@ use crate::DocId;
/// to compare `[u32; 4]`.
pub const TERMINATED: DocId = i32::MAX as u32;
pub const BUFFER_LEN: usize = 64;
/// The collect_block method on `SegmentCollector` uses a buffer of this size.
/// Passed results to `collect_block` will not exceed this size and will be
/// exactly this size as long as we can fill the buffer.
pub const COLLECT_BLOCK_BUFFER_LEN: usize = 64;
/// Represents an iterable set of sorted doc ids.
pub trait DocSet: Send {
@@ -61,7 +64,7 @@ pub trait DocSet: Send {
/// This method is only here for specific high-performance
/// use case where batching. The normal way to
/// go through the `DocId`'s is to call `.advance()`.
fn fill_buffer(&mut self, buffer: &mut [DocId; BUFFER_LEN]) -> usize {
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
if self.doc() == TERMINATED {
return 0;
}
@@ -151,7 +154,7 @@ impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
unboxed.seek(target)
}
fn fill_buffer(&mut self, buffer: &mut [DocId; BUFFER_LEN]) -> usize {
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
let unboxed: &mut TDocSet = self.borrow_mut();
unboxed.fill_buffer(buffer)
}

View File

@@ -79,7 +79,7 @@ mod tests {
use std::ops::{Range, RangeInclusive};
use std::path::Path;
use columnar::{Column, MonotonicallyMappableToU64, StrColumn};
use columnar::StrColumn;
use common::{ByteCount, HasLen, TerminatingWrite};
use once_cell::sync::Lazy;
use rand::prelude::SliceRandom;

View File

@@ -1,3 +1,5 @@
#![allow(deprecated)] // Remove with index sorting
use std::collections::HashSet;
use rand::{thread_rng, Rng};

View File

@@ -20,7 +20,7 @@ use crate::indexer::segment_updater::save_metas;
use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
use crate::reader::{IndexReader, IndexReaderBuilder};
use crate::schema::document::Document;
use crate::schema::{Field, FieldType, Schema};
use crate::schema::{Field, FieldType, Schema, Type};
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::SegmentReader;
@@ -83,7 +83,7 @@ fn save_new_metas(
///
/// ```
/// use tantivy::schema::*;
/// use tantivy::{Index, IndexSettings, IndexSortByField, Order};
/// use tantivy::{Index, IndexSettings};
///
/// let mut schema_builder = Schema::builder();
/// let id_field = schema_builder.add_text_field("id", STRING);
@@ -96,10 +96,7 @@ fn save_new_metas(
///
/// let schema = schema_builder.build();
/// let settings = IndexSettings{
/// sort_by_field: Some(IndexSortByField{
/// field: "number".to_string(),
/// order: Order::Asc
/// }),
/// docstore_blocksize: 100_000,
/// ..Default::default()
/// };
/// let index = Index::builder().schema(schema).settings(settings).create_in_ram();
@@ -251,6 +248,15 @@ impl IndexBuilder {
sort_by_field.field
)));
}
let supported_field_types = [Type::I64, Type::U64, Type::F64, Type::Date];
let field_type = entry.field_type().value_type();
if !supported_field_types.contains(&field_type) {
return Err(TantivyError::InvalidArgument(format!(
"Unsupported field type in sort_by_field: {:?}. Supported field types: \
{:?} ",
field_type, supported_field_types,
)));
}
}
Ok(())
} else {

View File

@@ -288,6 +288,10 @@ impl Default for IndexSettings {
/// Presorting documents can greatly improve performance
/// in some scenarios, by applying top n
/// optimizations.
#[deprecated(
since = "0.22.0",
note = "We plan to remove index sorting in `0.23`. If you need index sorting, please comment on the related issue https://github.com/quickwit-oss/tantivy/issues/2352 and explain your use case."
)]
#[derive(Clone, Debug, Serialize, Deserialize, Eq, PartialEq)]
pub struct IndexSortByField {
/// The field to sort the documents by

View File

@@ -1,12 +1,13 @@
use std::io;
use common::json_path_writer::JSON_END_OF_PATH;
use common::BinarySerializable;
use fnv::FnvHashSet;
use crate::directory::FileSlice;
use crate::positions::PositionReader;
use crate::postings::{BlockSegmentPostings, SegmentPostings, TermInfo};
use crate::schema::{IndexRecordOption, Term, Type, JSON_END_OF_PATH};
use crate::schema::{IndexRecordOption, Term, Type};
use crate::termdict::TermDictionary;
/// The inverted index reader is in charge of accessing

View File

@@ -1,4 +1,4 @@
use std::cmp::{Ord, Ordering};
use std::cmp::Ordering;
use std::error::Error;
use std::fmt;
use std::str::FromStr;

View File

@@ -406,7 +406,7 @@ impl SegmentReader {
}
/// Returns an iterator that will iterate over the alive document ids
pub fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + '_> {
pub fn doc_ids_alive(&self) -> Box<dyn Iterator<Item = DocId> + Send + '_> {
if let Some(alive_bitset) = &self.alive_bitset_opt {
Box::new(alive_bitset.iter_alive())
} else {
@@ -516,8 +516,8 @@ impl fmt::Debug for SegmentReader {
mod test {
use super::*;
use crate::index::Index;
use crate::schema::{Schema, SchemaBuilder, Term, STORED, TEXT};
use crate::{DocId, IndexWriter};
use crate::schema::{SchemaBuilder, Term, STORED, TEXT};
use crate::IndexWriter;
#[test]
fn test_merge_field_meta_data_same() {

View File

@@ -158,9 +158,8 @@ mod tests_indexsorting {
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::indexer::NoMergePolicy;
use crate::query::QueryParser;
use crate::schema::document::Value;
use crate::schema::{Schema, *};
use crate::{DocAddress, Index, IndexSettings, IndexSortByField, Order};
use crate::schema::*;
use crate::{DocAddress, Index, IndexBuilder, IndexSettings, IndexSortByField, Order};
fn create_test_index(
index_settings: Option<IndexSettings>,
@@ -558,4 +557,28 @@ mod tests_indexsorting {
&[2000, 8000, 3000]
);
}
#[test]
fn test_text_sort() -> crate::Result<()> {
let mut schema_builder = SchemaBuilder::new();
schema_builder.add_text_field("id", STRING | FAST | STORED);
schema_builder.add_text_field("name", TEXT | STORED);
let resp = IndexBuilder::new()
.schema(schema_builder.build())
.settings(IndexSettings {
sort_by_field: Some(IndexSortByField {
field: "id".to_string(),
order: Order::Asc,
}),
..Default::default()
})
.create_in_ram();
assert!(resp
.unwrap_err()
.to_string()
.contains("Unsupported field type"));
Ok(())
}
}

View File

@@ -22,6 +22,7 @@ where
}
}
#[allow(dead_code)]
pub trait FlatMapWithBufferIter: Iterator {
/// Function similar to `flat_map`, but allows reusing a shared `Vec`.
fn flat_map_with_buffer<F, T>(self, fill_buffer: F) -> FlatMapWithBuffer<T, F, Self>

View File

@@ -806,7 +806,6 @@ mod tests {
use columnar::{Cardinality, Column, MonotonicallyMappableToU128};
use itertools::Itertools;
use proptest::prop_oneof;
use proptest::strategy::Strategy;
use super::super::operation::UserOperation;
use crate::collector::TopDocs;

View File

@@ -144,10 +144,9 @@ mod tests {
use once_cell::sync::Lazy;
use super::*;
use crate::index::{SegmentId, SegmentMeta, SegmentMetaInventory};
use crate::indexer::merge_policy::MergePolicy;
use crate::schema;
use crate::index::SegmentMetaInventory;
use crate::schema::INDEXED;
use crate::{schema, SegmentId};
static INVENTORY: Lazy<SegmentMetaInventory> = Lazy::new(SegmentMetaInventory::default);

View File

@@ -39,7 +39,6 @@ impl MergePolicy for NoMergePolicy {
pub mod tests {
use super::*;
use crate::index::{SegmentId, SegmentMeta};
/// `MergePolicy` useful for test purposes.
///

View File

@@ -576,7 +576,7 @@ impl IndexMerger {
//
// Overall the reliable way to know if we have actual frequencies loaded or not
// is to check whether the actual decoded array is empty or not.
if has_term_freq != !postings.block_cursor.freqs().is_empty() {
if has_term_freq == postings.block_cursor.freqs().is_empty() {
return Err(DataCorruption::comment_only(
"Term freqs are inconsistent across segments",
)

View File

@@ -144,6 +144,123 @@ mod tests_mmap {
assert_eq!(num_docs, 256);
}
}
#[test]
fn test_json_field_null_byte() {
// Test when field name contains a zero byte, which has special meaning in tantivy.
// As a workaround, we convert the zero byte to the ASCII character '0'.
// https://github.com/quickwit-oss/tantivy/issues/2340
// https://github.com/quickwit-oss/tantivy/issues/2193
let field_name_in = "\u{0000}";
let field_name_out = "0";
test_json_field_name(field_name_in, field_name_out);
}
#[test]
fn test_json_field_1byte() {
// Test when field name contains a '1' byte, which has special meaning in tantivy.
// The 1 byte can be addressed as '1' byte or '.'.
let field_name_in = "\u{0001}";
let field_name_out = "\u{0001}";
test_json_field_name(field_name_in, field_name_out);
// Test when field name contains a '1' byte, which has special meaning in tantivy.
let field_name_in = "\u{0001}";
let field_name_out = ".";
test_json_field_name(field_name_in, field_name_out);
}
#[test]
fn test_json_field_dot() {
// Test when field name contains a '.'
let field_name_in = ".";
let field_name_out = ".";
test_json_field_name(field_name_in, field_name_out);
}
fn test_json_field_name(field_name_in: &str, field_name_out: &str) {
let mut schema_builder = Schema::builder();
let options = JsonObjectOptions::from(TEXT | FAST).set_expand_dots_enabled();
let field = schema_builder.add_json_field("json", options);
let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests().unwrap();
index_writer
.add_document(doc!(field=>json!({format!("{field_name_in}"): "test1"})))
.unwrap();
index_writer
.add_document(doc!(field=>json!({format!("a{field_name_in}"): "test2"})))
.unwrap();
index_writer
.add_document(doc!(field=>json!({format!("a{field_name_in}a"): "test3"})))
.unwrap();
index_writer
.add_document(
doc!(field=>json!({format!("a{field_name_in}a{field_name_in}"): "test4"})),
)
.unwrap();
index_writer
.add_document(
doc!(field=>json!({format!("a{field_name_in}.ab{field_name_in}"): "test5"})),
)
.unwrap();
index_writer
.add_document(
doc!(field=>json!({format!("a{field_name_in}"): json!({format!("a{field_name_in}"): "test6"}) })),
)
.unwrap();
index_writer
.add_document(doc!(field=>json!({format!("{field_name_in}a" ): "test7"})))
.unwrap();
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let parse_query = QueryParser::for_index(&index, Vec::new());
let test_query = |query_str: &str| {
let query = parse_query.parse_query(query_str).unwrap();
let num_docs = searcher.search(&query, &Count).unwrap();
assert_eq!(num_docs, 1, "{}", query_str);
};
test_query(format!("json.{field_name_out}:test1").as_str());
test_query(format!("json.a{field_name_out}:test2").as_str());
test_query(format!("json.a{field_name_out}a:test3").as_str());
test_query(format!("json.a{field_name_out}a{field_name_out}:test4").as_str());
test_query(format!("json.a{field_name_out}.ab{field_name_out}:test5").as_str());
test_query(format!("json.a{field_name_out}.a{field_name_out}:test6").as_str());
test_query(format!("json.{field_name_out}a:test7").as_str());
let test_agg = |field_name: &str, expected: &str| {
let agg_req_str = json!(
{
"termagg": {
"terms": {
"field": field_name,
}
}
});
let agg_req: Aggregations = serde_json::from_value(agg_req_str).unwrap();
let collector = AggregationCollector::from_aggs(agg_req, Default::default());
let agg_res: AggregationResults = searcher.search(&AllQuery, &collector).unwrap();
let res = serde_json::to_value(agg_res).unwrap();
assert_eq!(res["termagg"]["buckets"][0]["doc_count"], 1);
assert_eq!(res["termagg"]["buckets"][0]["key"], expected);
};
test_agg(format!("json.{field_name_out}").as_str(), "test1");
test_agg(format!("json.a{field_name_out}").as_str(), "test2");
test_agg(format!("json.a{field_name_out}a").as_str(), "test3");
test_agg(
format!("json.a{field_name_out}a{field_name_out}").as_str(),
"test4",
);
test_agg(
format!("json.a{field_name_out}.ab{field_name_out}").as_str(),
"test5",
);
test_agg(
format!("json.a{field_name_out}.a{field_name_out}").as_str(),
"test6",
);
test_agg(format!("json.{field_name_out}a").as_str(), "test7");
}
#[test]
fn test_json_field_expand_dots_enabled_dot_escape_not_required() {

View File

@@ -103,7 +103,7 @@ impl SegmentRegister {
#[cfg(test)]
mod tests {
use super::*;
use crate::index::{SegmentId, SegmentMetaInventory};
use crate::index::SegmentMetaInventory;
use crate::indexer::delete_queue::*;
fn segment_ids(segment_register: &SegmentRegister) -> Vec<SegmentId> {

View File

@@ -1,4 +1,3 @@
use columnar::MonotonicallyMappableToU64;
use common::JsonPathWriter;
use itertools::Itertools;
use tokenizer_api::BoxTokenStream;
@@ -15,7 +14,8 @@ use crate::postings::{
PerFieldPostingsWriter, PostingsWriter,
};
use crate::schema::document::{Document, ReferenceValue, Value};
use crate::schema::{FieldEntry, FieldType, Schema, Term, DATE_TIME_PRECISION_INDEXED};
use crate::schema::indexing_term::IndexingTerm;
use crate::schema::{FieldEntry, FieldType, Schema};
use crate::store::{StoreReader, StoreWriter};
use crate::tokenizer::{FacetTokenizer, PreTokenizedStream, TextAnalyzer, Tokenizer};
use crate::{DocId, Opstamp, SegmentComponent, TantivyError};
@@ -70,7 +70,7 @@ pub struct SegmentWriter {
pub(crate) json_path_writer: JsonPathWriter,
pub(crate) doc_opstamps: Vec<Opstamp>,
per_field_text_analyzers: Vec<TextAnalyzer>,
term_buffer: Term,
term_buffer: IndexingTerm,
schema: Schema,
}
@@ -126,7 +126,7 @@ impl SegmentWriter {
)?,
doc_opstamps: Vec::with_capacity(1_000),
per_field_text_analyzers,
term_buffer: Term::with_capacity(16),
term_buffer: IndexingTerm::new(),
schema,
})
}
@@ -195,7 +195,7 @@ impl SegmentWriter {
let (term_buffer, ctx) = (&mut self.term_buffer, &mut self.ctx);
let postings_writer: &mut dyn PostingsWriter =
self.per_field_postings_writers.get_for_field_mut(field);
term_buffer.clear_with_field_and_type(field_entry.field_type().value_type(), field);
term_buffer.clear_with_field(field);
match field_entry.field_type() {
FieldType::Facet(_) => {
@@ -271,8 +271,7 @@ impl SegmentWriter {
num_vals += 1;
let date_val = value.as_datetime().ok_or_else(make_schema_error)?;
term_buffer
.set_u64(date_val.truncate(DATE_TIME_PRECISION_INDEXED).to_u64());
term_buffer.set_date(date_val);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
@@ -332,7 +331,7 @@ impl SegmentWriter {
num_vals += 1;
let bytes = value.as_bytes().ok_or_else(make_schema_error)?;
term_buffer.set_bytes(bytes);
term_buffer.set_value_bytes(bytes);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
@@ -496,14 +495,14 @@ mod tests {
use tempfile::TempDir;
use crate::collector::{Count, TopDocs};
use crate::core::json_utils::JsonTermWriter;
use crate::directory::RamDirectory;
use crate::fastfield::FastValue;
use crate::json_utils::term_from_json_paths;
use crate::postings::TermInfo;
use crate::query::{PhraseQuery, QueryParser};
use crate::schema::document::Value;
use crate::schema::{
Document, IndexRecordOption, Schema, TextFieldIndexing, TextOptions, Type, STORED, STRING,
TEXT,
Document, IndexRecordOption, Schema, TextFieldIndexing, TextOptions, STORED, STRING, TEXT,
};
use crate::store::{Compressor, StoreReader, StoreWriter};
use crate::time::format_description::well_known::Rfc3339;
@@ -645,115 +644,117 @@ mod tests {
let inv_idx = segment_reader.inverted_index(json_field).unwrap();
let term_dict = inv_idx.terms();
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut term_stream = term_dict.stream().unwrap();
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
let term_from_path = |paths: &[&str]| -> Term {
term_from_json_paths(json_field, paths.iter().cloned(), false)
};
json_term_writer.push_path_segment("bool");
json_term_writer.set_fast_value(true);
fn set_fast_val<T: FastValue>(val: T, mut term: Term) -> Term {
term.append_type_and_fast_value(val);
term
}
fn set_str(val: &str, mut term: Term) -> Term {
term.append_type_and_str(val);
term
}
let term = term_from_path(&["bool"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(true, term).serialized_value_bytes()
);
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("complexobject");
json_term_writer.push_path_segment("field.with.dot");
json_term_writer.set_fast_value(1i64);
let term = term_from_path(&["complexobject", "field.with.dot"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(1i64, term).serialized_value_bytes()
);
json_term_writer.pop_path_segment();
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("date");
json_term_writer.set_fast_value(DateTime::from_utc(
OffsetDateTime::parse("1985-04-12T23:20:50.52Z", &Rfc3339).unwrap(),
));
// Date
let term = term_from_path(&["date"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(
DateTime::from_utc(
OffsetDateTime::parse("1985-04-12T23:20:50.52Z", &Rfc3339).unwrap(),
),
term
)
.serialized_value_bytes()
);
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("float");
json_term_writer.set_fast_value(-0.2f64);
// Float
let term = term_from_path(&["float"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(-0.2f64, term).serialized_value_bytes()
);
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("my_arr");
json_term_writer.set_fast_value(2i64);
// Number In Array
let term = term_from_path(&["my_arr"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(2i64, term).serialized_value_bytes()
);
json_term_writer.set_fast_value(3i64);
let term = term_from_path(&["my_arr"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(3i64, term).serialized_value_bytes()
);
json_term_writer.set_fast_value(4i64);
let term = term_from_path(&["my_arr"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(4i64, term).serialized_value_bytes()
);
json_term_writer.push_path_segment("my_key");
json_term_writer.set_str("tokens");
// El in Array
let term = term_from_path(&["my_arr", "my_key"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_str("tokens", term).serialized_value_bytes()
);
json_term_writer.set_str("two");
let term = term_from_path(&["my_arr", "my_key"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_str("two", term).serialized_value_bytes()
);
json_term_writer.pop_path_segment();
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("signed");
json_term_writer.set_fast_value(-2i64);
// Signed
let term = term_from_path(&["signed"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(-2i64, term).serialized_value_bytes()
);
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("toto");
json_term_writer.set_str("titi");
let term = term_from_path(&["toto"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_str("titi", term).serialized_value_bytes()
);
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("unsigned");
json_term_writer.set_fast_value(1i64);
// Unsigned
let term = term_from_path(&["unsigned"]);
assert!(term_stream.advance());
assert_eq!(
term_stream.key(),
json_term_writer.term().serialized_value_bytes()
set_fast_val(1i64, term).serialized_value_bytes()
);
assert!(!term_stream.advance());
}
@@ -774,14 +775,9 @@ mod tests {
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_index = segment_reader.inverted_index(json_field).unwrap();
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
json_term_writer.push_path_segment("mykey");
json_term_writer.set_str("token");
let term_info = inv_index
.get_term_info(json_term_writer.term())
.unwrap()
.unwrap();
let mut term = term_from_json_paths(json_field, ["mykey"].into_iter(), false);
term.append_type_and_str("token");
let term_info = inv_index.get_term_info(&term).unwrap().unwrap();
assert_eq!(
term_info,
TermInfo {
@@ -818,14 +814,9 @@ mod tests {
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_index = segment_reader.inverted_index(json_field).unwrap();
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
json_term_writer.push_path_segment("mykey");
json_term_writer.set_str("two tokens");
let term_info = inv_index
.get_term_info(json_term_writer.term())
.unwrap()
.unwrap();
let mut term = term_from_json_paths(json_field, ["mykey"].into_iter(), false);
term.append_type_and_str("two tokens");
let term_info = inv_index.get_term_info(&term).unwrap().unwrap();
assert_eq!(
term_info,
TermInfo {
@@ -863,16 +854,18 @@ mod tests {
writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
json_term_writer.push_path_segment("mykey");
json_term_writer.push_path_segment("field");
json_term_writer.set_str("hello");
let hello_term = json_term_writer.term().clone();
json_term_writer.set_str("nothello");
let nothello_term = json_term_writer.term().clone();
json_term_writer.set_str("happy");
let happy_term = json_term_writer.term().clone();
let term = term_from_json_paths(json_field, ["mykey", "field"].into_iter(), false);
let mut hello_term = term.clone();
hello_term.append_type_and_str("hello");
let mut nothello_term = term.clone();
nothello_term.append_type_and_str("nothello");
let mut happy_term = term.clone();
happy_term.append_type_and_str("happy");
let phrase_query = PhraseQuery::new(vec![hello_term, happy_term.clone()]);
assert_eq!(searcher.search(&phrase_query, &Count).unwrap(), 1);
let phrase_query = PhraseQuery::new(vec![nothello_term, happy_term]);

View File

@@ -178,6 +178,7 @@ pub use crate::future_result::FutureResult;
pub type Result<T> = std::result::Result<T, TantivyError>;
mod core;
#[allow(deprecated)] // Remove with index sorting
pub mod indexer;
#[allow(unused_doc_comments)]
@@ -189,6 +190,7 @@ pub mod collector;
pub mod directory;
pub mod fastfield;
pub mod fieldnorm;
#[allow(deprecated)] // Remove with index sorting
pub mod index;
pub mod positions;
pub mod postings;
@@ -213,7 +215,7 @@ pub use common::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64, HasLen};
use once_cell::sync::Lazy;
use serde::{Deserialize, Serialize};
pub use self::docset::{DocSet, TERMINATED};
pub use self::docset::{DocSet, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
#[deprecated(
since = "0.22.0",
note = "Will be removed in tantivy 0.23. Use export from snippet module instead"
@@ -223,6 +225,7 @@ pub use self::snippet::{Snippet, SnippetGenerator};
pub use crate::core::json_utils;
pub use crate::core::{Executor, Searcher, SearcherGeneration};
pub use crate::directory::Directory;
#[allow(deprecated)] // Remove with index sorting
pub use crate::index::{
Index, IndexBuilder, IndexMeta, IndexSettings, IndexSortByField, InvertedIndexReader, Order,
Segment, SegmentComponent, SegmentId, SegmentMeta, SegmentReader,
@@ -234,8 +237,6 @@ pub use crate::index::{
pub use crate::indexer::PreparedCommit;
pub use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
pub use crate::postings::Postings;
#[allow(deprecated)]
pub use crate::schema::DatePrecision;
pub use crate::schema::{DateOptions, DateTimePrecision, Document, TantivyDocument, Term};
/// Index format version.
@@ -254,7 +255,7 @@ pub struct Version {
impl fmt::Debug for Version {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.to_string())
fmt::Display::fmt(self, f)
}
}
@@ -265,9 +266,10 @@ static VERSION: Lazy<Version> = Lazy::new(|| Version {
index_format_version: INDEX_FORMAT_VERSION,
});
impl ToString for Version {
fn to_string(&self) -> String {
format!(
impl fmt::Display for Version {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(
f,
"tantivy v{}.{}.{}, index_format v{}",
self.major, self.minor, self.patch, self.index_format_version
)
@@ -391,7 +393,6 @@ pub mod tests {
use crate::index::SegmentReader;
use crate::merge_policy::NoMergePolicy;
use crate::query::BooleanQuery;
use crate::schema::document::Value;
use crate::schema::*;
use crate::{DateTime, DocAddress, Index, IndexWriter, Postings, ReloadPolicy};

View File

@@ -14,7 +14,6 @@ pub fn compressed_block_size(num_bits: u8) -> usize {
pub struct BlockEncoder {
bitpacker: BitPacker4x,
pub output: [u8; COMPRESSED_BLOCK_MAX_SIZE],
pub output_len: usize,
}
impl Default for BlockEncoder {
@@ -28,7 +27,6 @@ impl BlockEncoder {
BlockEncoder {
bitpacker: BitPacker4x::new(),
output: [0u8; COMPRESSED_BLOCK_MAX_SIZE],
output_len: 0,
}
}

View File

@@ -1,5 +1,6 @@
use std::io;
use common::json_path_writer::JSON_END_OF_PATH;
use stacker::Addr;
use crate::indexer::doc_id_mapping::DocIdMapping;
@@ -7,9 +8,10 @@ use crate::indexer::path_to_unordered_id::OrderedPathId;
use crate::postings::postings_writer::SpecializedPostingsWriter;
use crate::postings::recorder::{BufferLender, DocIdRecorder, Recorder};
use crate::postings::{FieldSerializer, IndexingContext, IndexingPosition, PostingsWriter};
use crate::schema::{Field, Type, JSON_END_OF_PATH};
use crate::schema::indexing_term::IndexingTerm;
use crate::schema::{Field, Type, ValueBytes};
use crate::tokenizer::TokenStream;
use crate::{DocId, Term};
use crate::DocId;
/// The `JsonPostingsWriter` is odd in that it relies on a hidden contract:
///
@@ -33,7 +35,7 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
&mut self,
doc: crate::DocId,
pos: u32,
term: &crate::Term,
term: &IndexingTerm,
ctx: &mut IndexingContext,
) {
self.non_str_posting_writer.subscribe(doc, pos, term, ctx);
@@ -43,7 +45,7 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
&mut self,
doc_id: DocId,
token_stream: &mut dyn TokenStream,
term_buffer: &mut Term,
term_buffer: &mut IndexingTerm,
ctx: &mut IndexingContext,
indexing_position: &mut IndexingPosition,
) {
@@ -65,34 +67,40 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
ctx: &IndexingContext,
serializer: &mut FieldSerializer,
) -> io::Result<()> {
let mut term_buffer = Term::with_capacity(48);
let mut term_buffer = JsonTermSerializer(Vec::with_capacity(48));
let mut buffer_lender = BufferLender::default();
let mut prev_term_id = u32::MAX;
let mut term_path_len = 0; // this will be set in the first iteration
for (_field, path_id, term, addr) in term_addrs {
term_buffer.clear_with_field_and_type(Type::Json, Field::from_field_id(0));
term_buffer.append_bytes(ordered_id_to_path[path_id.path_id() as usize].as_bytes());
term_buffer.append_bytes(&[JSON_END_OF_PATH]);
if prev_term_id != path_id.path_id() {
term_buffer.clear();
term_buffer.append_path(ordered_id_to_path[path_id.path_id() as usize].as_bytes());
term_buffer.append_bytes(&[JSON_END_OF_PATH]);
term_path_len = term_buffer.len();
prev_term_id = path_id.path_id();
}
term_buffer.truncate(term_path_len);
term_buffer.append_bytes(term);
if let Some(json_value) = term_buffer.value().as_json_value_bytes() {
let typ = json_value.typ();
if typ == Type::Str {
SpecializedPostingsWriter::<Rec>::serialize_one_term(
term_buffer.serialized_value_bytes(),
*addr,
doc_id_map,
&mut buffer_lender,
ctx,
serializer,
)?;
} else {
SpecializedPostingsWriter::<DocIdRecorder>::serialize_one_term(
term_buffer.serialized_value_bytes(),
*addr,
doc_id_map,
&mut buffer_lender,
ctx,
serializer,
)?;
}
let json_value = ValueBytes::wrap(term);
let typ = json_value.typ();
if typ == Type::Str {
SpecializedPostingsWriter::<Rec>::serialize_one_term(
term_buffer.as_bytes(),
*addr,
doc_id_map,
&mut buffer_lender,
ctx,
serializer,
)?;
} else {
SpecializedPostingsWriter::<DocIdRecorder>::serialize_one_term(
term_buffer.as_bytes(),
*addr,
doc_id_map,
&mut buffer_lender,
ctx,
serializer,
)?;
}
}
Ok(())
@@ -102,3 +110,40 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
self.str_posting_writer.total_num_tokens() + self.non_str_posting_writer.total_num_tokens()
}
}
struct JsonTermSerializer(Vec<u8>);
impl JsonTermSerializer {
#[inline]
pub fn append_path(&mut self, bytes: &[u8]) {
if bytes.contains(&0u8) {
self.0
.extend(bytes.iter().map(|&b| if b == 0 { b'0' } else { b }));
} else {
self.0.extend_from_slice(bytes);
}
}
/// Appends value bytes to the Term.
///
/// This function returns the segment that has just been added.
#[inline]
pub fn append_bytes(&mut self, bytes: &[u8]) -> &mut [u8] {
let len_before = self.0.len();
self.0.extend_from_slice(bytes);
&mut self.0[len_before..]
}
fn clear(&mut self) {
self.0.clear();
}
fn truncate(&mut self, len: usize) {
self.0.truncate(len);
}
fn len(&self) -> usize {
self.0.len()
}
fn as_bytes(&self) -> &[u8] {
&self.0
}
}

View File

@@ -11,7 +11,8 @@ use crate::postings::recorder::{BufferLender, Recorder};
use crate::postings::{
FieldSerializer, IndexingContext, InvertedIndexSerializer, PerFieldPostingsWriter,
};
use crate::schema::{Field, Schema, Term, Type};
use crate::schema::indexing_term::{get_field_from_indexing_term, IndexingTerm};
use crate::schema::{Field, Schema, Type};
use crate::tokenizer::{Token, TokenStream, MAX_TOKEN_LEN};
use crate::DocId;
@@ -60,14 +61,14 @@ pub(crate) fn serialize_postings(
let mut term_offsets: Vec<(Field, OrderedPathId, &[u8], Addr)> =
Vec::with_capacity(ctx.term_index.len());
term_offsets.extend(ctx.term_index.iter().map(|(key, addr)| {
let field = Term::wrap(key).field();
let field = get_field_from_indexing_term(key);
if schema.get_field_entry(field).field_type().value_type() == Type::Json {
let byte_range_path = 5..5 + 4;
let byte_range_path = 4..4 + 4;
let unordered_id = u32::from_be_bytes(key[byte_range_path.clone()].try_into().unwrap());
let path_id = unordered_id_to_ordered_id[unordered_id as usize];
(field, path_id, &key[byte_range_path.end..], addr)
} else {
(field, 0.into(), &key[5..], addr)
(field, 0.into(), &key[4..], addr)
}
}));
// Sort by field, path, and term
@@ -114,7 +115,7 @@ pub(crate) trait PostingsWriter: Send + Sync {
/// * term - the term
/// * ctx - Contains a term hashmap and a memory arena to store all necessary posting list
/// information.
fn subscribe(&mut self, doc: DocId, pos: u32, term: &Term, ctx: &mut IndexingContext);
fn subscribe(&mut self, doc: DocId, pos: u32, term: &IndexingTerm, ctx: &mut IndexingContext);
/// Serializes the postings on disk.
/// The actual serialization format is handled by the `PostingsSerializer`.
@@ -132,7 +133,7 @@ pub(crate) trait PostingsWriter: Send + Sync {
&mut self,
doc_id: DocId,
token_stream: &mut dyn TokenStream,
term_buffer: &mut Term,
term_buffer: &mut IndexingTerm,
ctx: &mut IndexingContext,
indexing_position: &mut IndexingPosition,
) {
@@ -203,26 +204,35 @@ impl<Rec: Recorder> SpecializedPostingsWriter<Rec> {
impl<Rec: Recorder> PostingsWriter for SpecializedPostingsWriter<Rec> {
#[inline]
fn subscribe(&mut self, doc: DocId, position: u32, term: &Term, ctx: &mut IndexingContext) {
debug_assert!(term.serialized_term().len() >= 4);
fn subscribe(
&mut self,
doc: DocId,
position: u32,
term: &IndexingTerm,
ctx: &mut IndexingContext,
) {
debug_assert!(term.serialized_for_hashmap().len() >= 4);
self.total_num_tokens += 1;
let (term_index, arena) = (&mut ctx.term_index, &mut ctx.arena);
term_index.mutate_or_create(term.serialized_term(), |opt_recorder: Option<Rec>| {
if let Some(mut recorder) = opt_recorder {
let current_doc = recorder.current_doc();
if current_doc != doc {
recorder.close_doc(arena);
term_index.mutate_or_create(
term.serialized_for_hashmap(),
|opt_recorder: Option<Rec>| {
if let Some(mut recorder) = opt_recorder {
let current_doc = recorder.current_doc();
if current_doc != doc {
recorder.close_doc(arena);
recorder.new_doc(doc, arena);
}
recorder.record_position(position, arena);
recorder
} else {
let mut recorder = Rec::default();
recorder.new_doc(doc, arena);
recorder.record_position(position, arena);
recorder
}
recorder.record_position(position, arena);
recorder
} else {
let mut recorder = Rec::default();
recorder.new_doc(doc, arena);
recorder.record_position(position, arena);
recorder
}
});
},
);
}
fn serialize(

View File

@@ -1,5 +1,3 @@
use std::convert::TryInto;
use crate::directory::OwnedBytes;
use crate::postings::compression::{compressed_block_size, COMPRESSION_BLOCK_SIZE};
use crate::query::Bm25Weight;

View File

@@ -1,5 +1,4 @@
use std::io;
use std::iter::ExactSizeIterator;
use std::ops::Range;
use common::{BinarySerializable, FixedSize};

View File

@@ -1,4 +1,4 @@
use crate::docset::{DocSet, BUFFER_LEN, TERMINATED};
use crate::docset::{DocSet, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
use crate::index::SegmentReader;
use crate::query::boost_query::BoostScorer;
use crate::query::explanation::does_not_match;
@@ -54,7 +54,7 @@ impl DocSet for AllScorer {
self.doc
}
fn fill_buffer(&mut self, buffer: &mut [DocId; BUFFER_LEN]) -> usize {
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
if self.doc() == TERMINATED {
return 0;
}
@@ -96,7 +96,7 @@ impl Scorer for AllScorer {
#[cfg(test)]
mod tests {
use super::AllQuery;
use crate::docset::{DocSet, BUFFER_LEN, TERMINATED};
use crate::docset::{DocSet, COLLECT_BLOCK_BUFFER_LEN, TERMINATED};
use crate::query::{AllScorer, EnableScoring, Query};
use crate::schema::{Schema, TEXT};
use crate::{Index, IndexWriter};
@@ -162,16 +162,16 @@ mod tests {
pub fn test_fill_buffer() {
let mut postings = AllScorer {
doc: 0u32,
max_doc: BUFFER_LEN as u32 * 2 + 9,
max_doc: COLLECT_BLOCK_BUFFER_LEN as u32 * 2 + 9,
};
let mut buffer = [0u32; BUFFER_LEN];
assert_eq!(postings.fill_buffer(&mut buffer), BUFFER_LEN);
for i in 0u32..BUFFER_LEN as u32 {
let mut buffer = [0u32; COLLECT_BLOCK_BUFFER_LEN];
assert_eq!(postings.fill_buffer(&mut buffer), COLLECT_BLOCK_BUFFER_LEN);
for i in 0u32..COLLECT_BLOCK_BUFFER_LEN as u32 {
assert_eq!(buffer[i as usize], i);
}
assert_eq!(postings.fill_buffer(&mut buffer), BUFFER_LEN);
for i in 0u32..BUFFER_LEN as u32 {
assert_eq!(buffer[i as usize], i + BUFFER_LEN as u32);
assert_eq!(postings.fill_buffer(&mut buffer), COLLECT_BLOCK_BUFFER_LEN);
for i in 0u32..COLLECT_BLOCK_BUFFER_LEN as u32 {
assert_eq!(buffer[i as usize], i + COLLECT_BLOCK_BUFFER_LEN as u32);
}
assert_eq!(postings.fill_buffer(&mut buffer), 9);
}

View File

@@ -1,6 +1,6 @@
use std::collections::HashMap;
use crate::docset::BUFFER_LEN;
use crate::docset::COLLECT_BLOCK_BUFFER_LEN;
use crate::index::SegmentReader;
use crate::postings::FreqReadingOption;
use crate::query::explanation::does_not_match;
@@ -228,7 +228,7 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
callback: &mut dyn FnMut(&[DocId]),
) -> crate::Result<()> {
let scorer = self.complex_scorer(reader, 1.0, || DoNothingCombiner)?;
let mut buffer = [0u32; BUFFER_LEN];
let mut buffer = [0u32; COLLECT_BLOCK_BUFFER_LEN];
match scorer {
SpecializedScorer::TermUnion(term_scorers) => {

View File

@@ -1,6 +1,6 @@
use std::fmt;
use crate::docset::BUFFER_LEN;
use crate::docset::COLLECT_BLOCK_BUFFER_LEN;
use crate::fastfield::AliveBitSet;
use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
use crate::{DocId, DocSet, Score, SegmentReader, Term};
@@ -105,7 +105,7 @@ impl<S: Scorer> DocSet for BoostScorer<S> {
self.underlying.seek(target)
}
fn fill_buffer(&mut self, buffer: &mut [DocId; BUFFER_LEN]) -> usize {
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
self.underlying.fill_buffer(buffer)
}

View File

@@ -1,6 +1,6 @@
use std::fmt;
use crate::docset::BUFFER_LEN;
use crate::docset::COLLECT_BLOCK_BUFFER_LEN;
use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
use crate::{DocId, DocSet, Score, SegmentReader, TantivyError, Term};
@@ -119,7 +119,7 @@ impl<TDocSet: DocSet> DocSet for ConstScorer<TDocSet> {
self.docset.seek(target)
}
fn fill_buffer(&mut self, buffer: &mut [DocId; BUFFER_LEN]) -> usize {
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
self.docset.fill_buffer(buffer)
}

View File

@@ -149,7 +149,7 @@ mod tests {
use crate::query::exist_query::ExistsQuery;
use crate::query::{BooleanQuery, RangeQuery};
use crate::schema::{Facet, FacetOptions, Schema, FAST, INDEXED, STRING, TEXT};
use crate::{doc, Index, Searcher};
use crate::{Index, Searcher};
#[test]
fn test_exists_query_simple() -> crate::Result<()> {

View File

@@ -3,7 +3,7 @@ use once_cell::sync::OnceCell;
use tantivy_fst::Automaton;
use crate::query::{AutomatonWeight, EnableScoring, Query, Weight};
use crate::schema::{Term, Type};
use crate::schema::Term;
use crate::TantivyError::InvalidArgument;
pub(crate) struct DfaWrapper(pub DFA);
@@ -84,7 +84,7 @@ pub struct FuzzyTermQuery {
distance: u8,
/// Should a transposition cost 1 or 2?
transposition_cost_one: bool,
///
/// is a starts with query
prefix: bool,
}
@@ -133,40 +133,33 @@ impl FuzzyTermQuery {
let term_value = self.term.value();
let term_text = if term_value.typ() == Type::Json {
if let Some(json_path_type) = term_value.json_path_type() {
if json_path_type != Type::Str {
return Err(InvalidArgument(format!(
"The fuzzy term query requires a string path type for a json term. Found \
{:?}",
json_path_type
)));
}
let get_automaton = |term_text: &str| {
if self.prefix {
automaton_builder.build_prefix_dfa(term_text)
} else {
automaton_builder.build_dfa(term_text)
}
std::str::from_utf8(self.term.serialized_value_bytes()).map_err(|_| {
InvalidArgument(
"Failed to convert json term value bytes to utf8 string.".to_string(),
)
})?
} else {
term_value.as_str().ok_or_else(|| {
InvalidArgument("The fuzzy term query requires a string term.".to_string())
})?
};
let automaton = if self.prefix {
automaton_builder.build_prefix_dfa(term_text)
} else {
automaton_builder.build_dfa(term_text)
};
if let Some((json_path_bytes, _)) = term_value.as_json() {
if let Some((json_path_bytes, _term_value)) = term_value.as_json() {
let term_text =
std::str::from_utf8(self.term.serialized_value_bytes()).map_err(|_| {
InvalidArgument(
"Failed to convert json term value bytes to utf8 string.".to_string(),
)
})?;
let automaton = get_automaton(term_text);
Ok(AutomatonWeight::new_for_json_path(
self.term.field(),
DfaWrapper(automaton),
json_path_bytes,
))
} else {
let term_text = term_value.as_str().ok_or_else(|| {
InvalidArgument("The fuzzy term query requires a string term.".to_string())
})?;
let automaton = get_automaton(term_text);
Ok(AutomatonWeight::new(
self.term.field(),
DfaWrapper(automaton),

View File

@@ -137,7 +137,7 @@ impl Query for PhrasePrefixQuery {
// There are no prefix. Let's just match the suffix.
let end_term =
if let Some(end_value) = prefix_end(self.prefix.1.serialized_value_bytes()) {
let mut end_term = Term::with_capacity(end_value.len());
let mut end_term = Term::new();
end_term.set_field_and_type(self.field, self.prefix.1.typ());
end_term.append_bytes(&end_value);
Bound::Excluded(end_term)

View File

@@ -10,10 +10,10 @@ use query_grammar::{UserInputAst, UserInputBound, UserInputLeaf, UserInputLitera
use rustc_hash::FxHashMap;
use super::logical_ast::*;
use crate::core::json_utils::{
convert_to_fast_value_and_get_term, set_string_and_get_terms, JsonTermWriter,
};
use crate::index::Index;
use crate::json_utils::{
convert_to_fast_value_and_append_to_json_term, split_json_path, term_from_json_paths,
};
use crate::query::range_query::{is_type_valid_for_fastfield_range_query, RangeQuery};
use crate::query::{
AllQuery, BooleanQuery, BoostQuery, EmptyQuery, FuzzyTermQuery, Occur, PhrasePrefixQuery,
@@ -965,20 +965,33 @@ fn generate_literals_for_json_object(
})?;
let index_record_option = text_options.index_option();
let mut logical_literals = Vec::new();
let mut term = Term::with_capacity(100);
let mut json_term_writer = JsonTermWriter::from_field_and_json_path(
field,
json_path,
json_options.is_expand_dots_enabled(),
&mut term,
);
if let Some(term) = convert_to_fast_value_and_get_term(&mut json_term_writer, phrase) {
let paths = split_json_path(json_path);
let get_term_with_path = || {
term_from_json_paths(
field,
paths.iter().map(|el| el.as_str()),
json_options.is_expand_dots_enabled(),
)
};
// Try to convert the phrase to a fast value
if let Some(term) = convert_to_fast_value_and_append_to_json_term(get_term_with_path(), phrase)
{
logical_literals.push(LogicalLiteral::Term(term));
}
let terms = set_string_and_get_terms(&mut json_term_writer, phrase, &mut text_analyzer);
drop(json_term_writer);
if terms.len() <= 1 {
for (_, term) in terms {
// Try to tokenize the phrase and create Terms.
let mut positions_and_terms = Vec::<(usize, Term)>::new();
let mut token_stream = text_analyzer.token_stream(phrase);
token_stream.process(&mut |token| {
let mut term = get_term_with_path();
term.append_type_and_str(&token.text);
positions_and_terms.push((token.position, term.clone()));
});
if positions_and_terms.len() <= 1 {
for (_, term) in positions_and_terms {
logical_literals.push(LogicalLiteral::Term(term));
}
return Ok(logical_literals);
@@ -989,7 +1002,7 @@ fn generate_literals_for_json_object(
));
}
logical_literals.push(LogicalLiteral::Phrase {
terms,
terms: positions_and_terms,
slop: 0,
prefix: false,
});

View File

@@ -477,7 +477,7 @@ mod tests {
use crate::schema::{
Field, IntoIpv6Addr, Schema, TantivyDocument, FAST, INDEXED, STORED, TEXT,
};
use crate::{doc, Index, IndexWriter};
use crate::{Index, IndexWriter};
#[test]
fn test_range_query_simple() -> crate::Result<()> {

Some files were not shown because too many files have changed in this diff Show More