Compare commits

..

1 Commits

Author SHA1 Message Date
Pascal Seitz
bb57e63522 Store List of Fields in Segment
Fiels may be encoded in the columnar storage or in the inverted index
for JSON fields.
Add a new Segment file that contains the list of fields (schema +
encoded)
2023-12-13 15:52:41 +08:00
172 changed files with 1323 additions and 3334 deletions

View File

@@ -1,65 +1,3 @@
Tantivy 0.22
================================
Tantivy 0.22 will be able to read indices created with Tantivy 0.21.
#### Bugfixes
- Fix null byte handling in JSON paths (null bytes in json keys caused panic during indexing) [#2345](https://github.com/quickwit-oss/tantivy/pull/2345)(@PSeitz)
- Fix bug that can cause `get_docids_for_value_range` to panic. [#2295](https://github.com/quickwit-oss/tantivy/pull/2295)(@fulmicoton)
- Avoid 1 document indices by increase min memory to 15MB for indexing [#2176](https://github.com/quickwit-oss/tantivy/pull/2176)(@PSeitz)
- Fix merge panic for JSON fields [#2284](https://github.com/quickwit-oss/tantivy/pull/2284)(@PSeitz)
- Fix bug occuring when merging JSON object indexed with positions. [#2253](https://github.com/quickwit-oss/tantivy/pull/2253)(@fulmicoton)
- Fix empty DateHistogram gap bug [#2183](https://github.com/quickwit-oss/tantivy/pull/2183)(@PSeitz)
- Fix range query end check (fields with less than 1 value per doc are affected) [#2226](https://github.com/quickwit-oss/tantivy/pull/2226)(@PSeitz)
- Handle exclusive out of bounds ranges on fastfield range queries [#2174](https://github.com/quickwit-oss/tantivy/pull/2174)(@PSeitz)
#### Breaking API Changes
- rename ReloadPolicy onCommit to onCommitWithDelay [#2235](https://github.com/quickwit-oss/tantivy/pull/2235)(@giovannicuccu)
- Move exports from the root into modules [#2220](https://github.com/quickwit-oss/tantivy/pull/2220)(@PSeitz)
- Accept field name instead of `Field` in FilterCollector [#2196](https://github.com/quickwit-oss/tantivy/pull/2196)(@PSeitz)
- remove deprecated IntOptions and DateTime [#2353](https://github.com/quickwit-oss/tantivy/pull/2353)(@PSeitz)
#### Features/Improvements
- Tantivy documents as a trait: Index data directly without converting to tantivy types first [#2071](https://github.com/quickwit-oss/tantivy/pull/2071)(@ChillFish8)
- encode some part of posting list as -1 instead of direct values (smaller inverted indices) [#2185](https://github.com/quickwit-oss/tantivy/pull/2185)(@trinity-1686a)
- **Aggregation**
- Support to deserialize f64 from string [#2311](https://github.com/quickwit-oss/tantivy/pull/2311)(@PSeitz)
- Add a top_hits aggregator [#2198](https://github.com/quickwit-oss/tantivy/pull/2198)(@ditsuke)
- Support bool type in term aggregation [#2318](https://github.com/quickwit-oss/tantivy/pull/2318)(@PSeitz)
- Support ip adresses in term aggregation [#2319](https://github.com/quickwit-oss/tantivy/pull/2319)(@PSeitz)
- Support date type in term aggregation [#2172](https://github.com/quickwit-oss/tantivy/pull/2172)(@PSeitz)
- Support escaped dot when addressing field [#2250](https://github.com/quickwit-oss/tantivy/pull/2250)(@PSeitz)
- Add ExistsQuery to check documents that have a value [#2160](https://github.com/quickwit-oss/tantivy/pull/2160)(@imotov)
- Expose TopDocs::order_by_u64_field again [#2282](https://github.com/quickwit-oss/tantivy/pull/2282)(@ditsuke)
- **Memory/Performance**
- Faster TopN: replace BinaryHeap with TopNComputer [#2186](https://github.com/quickwit-oss/tantivy/pull/2186)(@PSeitz)
- reduce number of allocations during indexing [#2257](https://github.com/quickwit-oss/tantivy/pull/2257)(@PSeitz)
- Less Memory while indexing: docid deltas while indexing [#2249](https://github.com/quickwit-oss/tantivy/pull/2249)(@PSeitz)
- Faster indexing: use term hashmap in fastfield [#2243](https://github.com/quickwit-oss/tantivy/pull/2243)(@PSeitz)
- term hashmap remove copy in is_empty, unused unordered_id [#2229](https://github.com/quickwit-oss/tantivy/pull/2229)(@PSeitz)
- add method to fetch block of first values in columnar [#2330](https://github.com/quickwit-oss/tantivy/pull/2330)(@PSeitz)
- Faster aggregations: add fast path for full columns in fetch_block [#2328](https://github.com/quickwit-oss/tantivy/pull/2328)(@PSeitz)
- Faster sstable loading: use fst for sstable index [#2268](https://github.com/quickwit-oss/tantivy/pull/2268)(@trinity-1686a)
- **QueryParser**
- allow newline where we allow space in query parser [#2302](https://github.com/quickwit-oss/tantivy/pull/2302)(@trinity-1686a)
- allow some mixing of occur and bool in strict query parser [#2323](https://github.com/quickwit-oss/tantivy/pull/2323)(@trinity-1686a)
- handle * inside term in lenient query parser [#2228](https://github.com/quickwit-oss/tantivy/pull/2228)(@trinity-1686a)
- add support for exists query syntax in query parser [#2170](https://github.com/quickwit-oss/tantivy/pull/2170)(@trinity-1686a)
- Add shared search executor [#2312](https://github.com/quickwit-oss/tantivy/pull/2312)(@MochiXu)
- Truncate keys to u16::MAX in term hashmap [#2299](https://github.com/quickwit-oss/tantivy/pull/2299)(@PSeitz)
- report if a term matched when warming up posting list [#2309](https://github.com/quickwit-oss/tantivy/pull/2309)(@trinity-1686a)
- Support json fields in FuzzyTermQuery [#2173](https://github.com/quickwit-oss/tantivy/pull/2173)(@PingXia-at)
- Read list of fields encoded in term dictionary for JSON fields [#2184](https://github.com/quickwit-oss/tantivy/pull/2184)(@PSeitz)
- add collect_block to BoxableSegmentCollector [#2331](https://github.com/quickwit-oss/tantivy/pull/2331)(@PSeitz)
- expose collect_block buffer size [#2326](https://github.com/quickwit-oss/tantivy/pull/2326)(@PSeitz)
- Forward regex parser errors [#2288](https://github.com/quickwit-oss/tantivy/pull/2288)(@adamreichold)
- Make FacetCounts defaultable and cloneable. [#2322](https://github.com/quickwit-oss/tantivy/pull/2322)(@adamreichold)
- Derive Debug for SchemaBuilder [#2254](https://github.com/quickwit-oss/tantivy/pull/2254)(@GodTamIt)
- add missing inlines to tantivy options [#2245](https://github.com/quickwit-oss/tantivy/pull/2245)(@PSeitz)
Tantivy 0.21.1 Tantivy 0.21.1
================================ ================================
#### Bugfixes #### Bugfixes

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy" name = "tantivy"
version = "0.22.1" version = "0.22.0-dev"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
@@ -11,12 +11,12 @@ repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md" readme = "README.md"
keywords = ["search", "information", "retrieval"] keywords = ["search", "information", "retrieval"]
edition = "2021" edition = "2021"
rust-version = "1.63" rust-version = "1.62"
exclude = ["benches/*.json", "benches/*.txt"] exclude = ["benches/*.json", "benches/*.txt"]
[dependencies] [dependencies]
oneshot = "0.1.5" oneshot = "0.1.5"
base64 = "0.22.0" base64 = "0.21.0"
byteorder = "1.4.3" byteorder = "1.4.3"
crc32fast = "1.3.2" crc32fast = "1.3.2"
once_cell = "1.10.0" once_cell = "1.10.0"
@@ -25,20 +25,20 @@ aho-corasick = "1.0"
tantivy-fst = "0.5" tantivy-fst = "0.5"
memmap2 = { version = "0.9.0", optional = true } memmap2 = { version = "0.9.0", optional = true }
lz4_flex = { version = "0.11", default-features = false, optional = true } lz4_flex = { version = "0.11", default-features = false, optional = true }
zstd = { version = "0.13", optional = true, default-features = false } zstd = { version = "0.13", default-features = false }
tempfile = { version = "3.3.0", optional = true } tempfile = { version = "3.3.0", optional = true }
log = "0.4.16" log = "0.4.16"
serde = { version = "1.0.136", features = ["derive"] } serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.79" serde_json = "1.0.79"
num_cpus = "1.13.1" num_cpus = "1.13.1"
fs4 = { version = "0.8.0", optional = true } fs4 = { version = "0.7.0", optional = true }
levenshtein_automata = "0.2.1" levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] } uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4" crossbeam-channel = "0.5.4"
rust-stemmers = "1.2.0" rust-stemmers = "1.2.0"
downcast-rs = "1.2.0" downcast-rs = "1.2.0"
bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker4x"] } bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker4x"] }
census = "0.4.2" census = "0.4.0"
rustc-hash = "1.1.0" rustc-hash = "1.1.0"
thiserror = "1.0.30" thiserror = "1.0.30"
htmlescape = "0.3.1" htmlescape = "0.3.1"
@@ -52,13 +52,13 @@ itertools = "0.12.0"
measure_time = "0.8.2" measure_time = "0.8.2"
arc-swap = "1.5.0" arc-swap = "1.5.0"
columnar = { version= "0.3", path="./columnar", package ="tantivy-columnar" } columnar = { version= "0.2", path="./columnar", package ="tantivy-columnar" }
sstable = { version= "0.3", path="./sstable", package ="tantivy-sstable", optional = true } sstable = { version= "0.2", path="./sstable", package ="tantivy-sstable", optional = true }
stacker = { version= "0.3", path="./stacker", package ="tantivy-stacker" } stacker = { version= "0.2", path="./stacker", package ="tantivy-stacker" }
query-grammar = { version= "0.22.0", path="./query-grammar", package = "tantivy-query-grammar" } query-grammar = { version= "0.21.0", path="./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version= "0.6", path="./bitpacker" } tantivy-bitpacker = { version= "0.5", path="./bitpacker" }
common = { version= "0.7", path = "./common/", package = "tantivy-common" } common = { version= "0.6", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version= "0.3", path="./tokenizer-api", package="tantivy-tokenizer-api" } tokenizer-api = { version= "0.2", path="./tokenizer-api", package="tantivy-tokenizer-api" }
sketches-ddsketch = { version = "0.2.1", features = ["use_serde"] } sketches-ddsketch = { version = "0.2.1", features = ["use_serde"] }
futures-util = { version = "0.3.28", optional = true } futures-util = { version = "0.3.28", optional = true }
fnv = "1.0.7" fnv = "1.0.7"
@@ -77,10 +77,6 @@ futures = "0.3.21"
paste = "1.0.11" paste = "1.0.11"
more-asserts = "0.3.1" more-asserts = "0.3.1"
rand_distr = "0.4.3" rand_distr = "0.4.3"
time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
postcard = { version = "1.0.4", features = [
"use-std",
], default-features = false }
[target.'cfg(not(windows))'.dev-dependencies] [target.'cfg(not(windows))'.dev-dependencies]
criterion = { version = "0.5", default-features = false } criterion = { version = "0.5", default-features = false }
@@ -109,7 +105,7 @@ mmap = ["fs4", "tempfile", "memmap2"]
stopwords = [] stopwords = []
lz4-compression = ["lz4_flex"] lz4-compression = ["lz4_flex"]
zstd-compression = ["zstd"] zstd-compression = []
failpoints = ["fail", "fail/failpoints"] failpoints = ["fail", "fail/failpoints"]
unstable = [] # useful for benches. unstable = [] # useful for benches.

View File

@@ -5,18 +5,19 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy) [![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)
<img src="https://tantivy-search.github.io/logo/tantivy-logo.png" alt="Tantivy, the fastest full-text search engine library written in Rust" height="250"> ![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
## Fast full-text search engine library written in Rust **Tantivy** is a **full-text search engine library** written in Rust.
**If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our distributed search engine built on top of Tantivy.** It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
an off-the-shelf search engine server, but rather a crate that can be used
Tantivy is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not to build such a search engine.
an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.
Tantivy is, in fact, strongly inspired by Lucene's design. Tantivy is, in fact, strongly inspired by Lucene's design.
## Benchmark If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our search engine built on top of Tantivy.
# Benchmark
The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns
performance for different types of queries/collections. performance for different types of queries/collections.
@@ -27,7 +28,7 @@ Your mileage WILL vary depending on the nature of queries and their load.
Details about the benchmark can be found at this [repository](https://github.com/quickwit-oss/search-benchmark-game). Details about the benchmark can be found at this [repository](https://github.com/quickwit-oss/search-benchmark-game).
## Features # Features
- Full-text search - Full-text search
- Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy), [Vaporetto](https://crates.io/crates/vaporetto_tantivy), and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder)) - Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy), [Vaporetto](https://crates.io/crates/vaporetto_tantivy), and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
@@ -53,11 +54,11 @@ Details about the benchmark can be found at this [repository](https://github.com
- Searcher Warmer API - Searcher Warmer API
- Cheesy logo with a horse - Cheesy logo with a horse
### Non-features ## Non-features
Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out [Quickwit](https://github.com/quickwit-oss/quickwit/). Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out [Quickwit](https://github.com/quickwit-oss/quickwit/).
## Getting started # Getting started
Tantivy works on stable Rust and supports Linux, macOS, and Windows. Tantivy works on stable Rust and supports Linux, macOS, and Windows.
@@ -67,7 +68,7 @@ index documents, and search via the CLI or a small server with a REST API.
It walks you through getting a Wikipedia search engine up and running in a few minutes. It walks you through getting a Wikipedia search engine up and running in a few minutes.
- [Reference doc for the last released version](https://docs.rs/tantivy/) - [Reference doc for the last released version](https://docs.rs/tantivy/)
## How can I support this project? # How can I support this project?
There are many ways to support this project. There are many ways to support this project.
@@ -78,16 +79,16 @@ There are many ways to support this project.
- Contribute code (you can join [our Discord server](https://discord.gg/MT27AG5EVE)) - Contribute code (you can join [our Discord server](https://discord.gg/MT27AG5EVE))
- Talk about Tantivy around you - Talk about Tantivy around you
## Contributing code # Contributing code
We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR. We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
Feel free to update CHANGELOG.md with your contribution. Feel free to update CHANGELOG.md with your contribution.
### Tokenizer ## Tokenizer
When implementing a tokenizer for tantivy depend on the `tantivy-tokenizer-api` crate. When implementing a tokenizer for tantivy depend on the `tantivy-tokenizer-api` crate.
### Clone and build locally ## Clone and build locally
Tantivy compiles on stable Rust. Tantivy compiles on stable Rust.
To check out and run tests, you can simply run: To check out and run tests, you can simply run:
@@ -98,7 +99,7 @@ cd tantivy
cargo test cargo test
``` ```
## Companies Using Tantivy # Companies Using Tantivy
<p align="left"> <p align="left">
<img align="center" src="doc/assets/images/etsy.png" alt="Etsy" height="25" width="auto" />&nbsp; <img align="center" src="doc/assets/images/etsy.png" alt="Etsy" height="25" width="auto" />&nbsp;
@@ -110,7 +111,7 @@ cargo test
<img align="center" src="doc/assets/images/element-dark-theme.png#gh-dark-mode-only" alt="Element.io" height="25" width="auto" /> <img align="center" src="doc/assets/images/element-dark-theme.png#gh-dark-mode-only" alt="Element.io" height="25" width="auto" />
</p> </p>
## FAQ # FAQ
### Can I use Tantivy in other languages? ### Can I use Tantivy in other languages?

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-bitpacker" name = "tantivy-bitpacker"
version = "0.6.0" version = "0.5.0"
edition = "2021" edition = "2021"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"

View File

@@ -1,3 +1,4 @@
use std::convert::TryInto;
use std::io; use std::io;
use std::ops::{Range, RangeInclusive}; use std::ops::{Range, RangeInclusive};

View File

@@ -1,10 +1,6 @@
# configuration file for git-cliff{ pattern = "foo", replace = "bar"} # configuration file for git-cliff{ pattern = "foo", replace = "bar"}
# see https://github.com/orhun/git-cliff#configuration-file # see https://github.com/orhun/git-cliff#configuration-file
[remote.github]
owner = "quickwit-oss"
repo = "tantivy"
[changelog] [changelog]
# changelog header # changelog header
header = """ header = """
@@ -12,43 +8,15 @@ header = """
# template for the changelog body # template for the changelog body
# https://tera.netlify.app/docs/#introduction # https://tera.netlify.app/docs/#introduction
body = """ body = """
## What's Changed {% if version %}\
{{ version | trim_start_matches(pat="v") }} ({{ timestamp | date(format="%Y-%m-%d") }})
{%- if version %} in {{ version }}{%- endif -%} ==================
{% else %}\
## [unreleased]
{% endif %}\
{% for commit in commits %} {% for commit in commits %}
{% if commit.github.pr_title -%} - {% if commit.breaking %}[**breaking**] {% endif %}{{ commit.message | split(pat="\n") | first | trim | upper_first }}(@{{ commit.author.name }})\
{%- set commit_message = commit.github.pr_title -%} {% endfor %}
{%- else -%}
{%- set commit_message = commit.message -%}
{%- endif -%}
- {{ commit_message | split(pat="\n") | first | trim }}\
{% if commit.github.pr_number %} \
[#{{ commit.github.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.github.pr_number }}){% if commit.github.username %}(@{{ commit.github.username }}){%- endif -%} \
{%- endif %}
{%- endfor -%}
{% if github.contributors | filter(attribute="is_first_time", value=true) | length != 0 %}
{% raw %}\n{% endraw -%}
## New Contributors
{%- endif %}\
{% for contributor in github.contributors | filter(attribute="is_first_time", value=true) %}
* @{{ contributor.username }} made their first contribution
{%- if contributor.pr_number %} in \
[#{{ contributor.pr_number }}]({{ self::remote_url() }}/pull/{{ contributor.pr_number }}) \
{%- endif %}
{%- endfor -%}
{% if version %}
{% if previous.version %}
**Full Changelog**: {{ self::remote_url() }}/compare/{{ previous.version }}...{{ version }}
{% endif %}
{% else -%}
{% raw %}\n{% endraw %}
{% endif %}
{%- macro remote_url() -%}
https://github.com/{{ remote.github.owner }}/{{ remote.github.repo }}
{%- endmacro -%}
""" """
# remove the leading and trailing whitespace from the template # remove the leading and trailing whitespace from the template
trim = true trim = true
@@ -57,24 +25,53 @@ footer = """
""" """
postprocessors = [ postprocessors = [
{ pattern = 'Paul Masurel', replace = "fulmicoton"}, # replace with github user
{ pattern = 'PSeitz', replace = "PSeitz"}, # replace with github user
{ pattern = 'Adam Reichold', replace = "adamreichold"}, # replace with github user
{ pattern = 'trinity-1686a', replace = "trinity-1686a"}, # replace with github user
{ pattern = 'Michael Kleen', replace = "mkleen"}, # replace with github user
{ pattern = 'Adrien Guillo', replace = "guilload"}, # replace with github user
{ pattern = 'François Massot', replace = "fmassot"}, # replace with github user
{ pattern = 'Naveen Aiathurai', replace = "naveenann"}, # replace with github user
{ pattern = '', replace = ""}, # replace with github user
] ]
[git] [git]
# parse the commits based on https://www.conventionalcommits.org # parse the commits based on https://www.conventionalcommits.org
# This is required or commit.message contains the whole commit message and not just the title # This is required or commit.message contains the whole commit message and not just the title
conventional_commits = false conventional_commits = true
# filter out the commits that are not conventional # filter out the commits that are not conventional
filter_unconventional = true filter_unconventional = false
# process each line of a commit as an individual commit # process each line of a commit as an individual commit
split_commits = false split_commits = false
# regex for preprocessing the commit messages # regex for preprocessing the commit messages
commit_preprocessors = [ commit_preprocessors = [
{ pattern = '\((\w+\s)?#([0-9]+)\)', replace = ""}, { pattern = '\((\w+\s)?#([0-9]+)\)', replace = "[#${2}](https://github.com/quickwit-oss/tantivy/issues/${2})"}, # replace issue numbers
] ]
#link_parsers = [ #link_parsers = [
#{ pattern = "#(\\d+)", href = "https://github.com/quickwit-oss/tantivy/pulls/$1"}, #{ pattern = "#(\\d+)", href = "https://github.com/quickwit-oss/tantivy/pulls/$1"},
#] #]
# regex for parsing and grouping commits # regex for parsing and grouping commits
commit_parsers = [
{ message = "^feat", group = "Features"},
{ message = "^fix", group = "Bug Fixes"},
{ message = "^doc", group = "Documentation"},
{ message = "^perf", group = "Performance"},
{ message = "^refactor", group = "Refactor"},
{ message = "^style", group = "Styling"},
{ message = "^test", group = "Testing"},
{ message = "^chore\\(release\\): prepare for", skip = true},
{ message = "(?i)clippy", skip = true},
{ message = "(?i)dependabot", skip = true},
{ message = "(?i)fmt", skip = true},
{ message = "(?i)bump", skip = true},
{ message = "(?i)readme", skip = true},
{ message = "(?i)comment", skip = true},
{ message = "(?i)spelling", skip = true},
{ message = "^chore", group = "Miscellaneous Tasks"},
{ body = ".*security", group = "Security"},
{ message = ".*", group = "Other", default_scope = "other"},
]
# protect breaking changes from being skipped due to matching a skipping commit_parser # protect breaking changes from being skipped due to matching a skipping commit_parser
protect_breaking_commits = false protect_breaking_commits = false
# filter out the commits that are not matched by commit parsers # filter out the commits that are not matched by commit parsers

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-columnar" name = "tantivy-columnar"
version = "0.3.0" version = "0.2.0"
edition = "2021" edition = "2021"
license = "MIT" license = "MIT"
homepage = "https://github.com/quickwit-oss/tantivy" homepage = "https://github.com/quickwit-oss/tantivy"
@@ -12,12 +12,11 @@ categories = ["database-implementations", "data-structures", "compression"]
itertools = "0.12.0" itertools = "0.12.0"
fastdivide = "0.4.0" fastdivide = "0.4.0"
stacker = { version= "0.3", path = "../stacker", package="tantivy-stacker"} stacker = { version= "0.2", path = "../stacker", package="tantivy-stacker"}
sstable = { version= "0.3", path = "../sstable", package = "tantivy-sstable" } sstable = { version= "0.2", path = "../sstable", package = "tantivy-sstable" }
common = { version= "0.7", path = "../common", package = "tantivy-common" } common = { version= "0.6", path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.6", path = "../bitpacker/" } tantivy-bitpacker = { version= "0.5", path = "../bitpacker/" }
serde = "1.0.152" serde = "1.0.152"
downcast-rs = "1.2.0"
[dev-dependencies] [dev-dependencies]
proptest = "1" proptest = "1"

View File

@@ -1,155 +0,0 @@
#![feature(test)]
extern crate test;
use std::sync::Arc;
use rand::prelude::*;
use tantivy_columnar::column_values::{serialize_and_load_u64_based_column_values, CodecType};
use tantivy_columnar::*;
use test::{black_box, Bencher};
struct Columns {
pub optional: Column,
pub full: Column,
pub multi: Column,
}
fn get_test_columns() -> Columns {
let data = generate_permutation();
let mut dataframe_writer = ColumnarWriter::default();
for (idx, val) in data.iter().enumerate() {
dataframe_writer.record_numerical(idx as u32, "full_values", NumericalValue::U64(*val));
if idx % 2 == 0 {
dataframe_writer.record_numerical(
idx as u32,
"optional_values",
NumericalValue::U64(*val),
);
}
dataframe_writer.record_numerical(idx as u32, "multi_values", NumericalValue::U64(*val));
dataframe_writer.record_numerical(idx as u32, "multi_values", NumericalValue::U64(*val));
}
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(data.len() as u32, None, &mut buffer)
.unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("optional_values").unwrap();
assert_eq!(cols.len(), 1);
let optional = cols[0].open_u64_lenient().unwrap().unwrap();
assert_eq!(optional.index.get_cardinality(), Cardinality::Optional);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("full_values").unwrap();
assert_eq!(cols.len(), 1);
let column_full = cols[0].open_u64_lenient().unwrap().unwrap();
assert_eq!(column_full.index.get_cardinality(), Cardinality::Full);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("multi_values").unwrap();
assert_eq!(cols.len(), 1);
let multi = cols[0].open_u64_lenient().unwrap().unwrap();
assert_eq!(multi.index.get_cardinality(), Cardinality::Multivalued);
Columns {
optional,
full: column_full,
multi,
}
}
const NUM_VALUES: u64 = 100_000;
fn generate_permutation() -> Vec<u64> {
let mut permutation: Vec<u64> = (0u64..NUM_VALUES).collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
pub fn serialize_and_load(column: &[u64], codec_type: CodecType) -> Arc<dyn ColumnValues<u64>> {
serialize_and_load_u64_based_column_values(&column, &[codec_type])
}
fn run_bench_on_column_full_scan(b: &mut Bencher, column: Column) {
let num_iter = black_box(NUM_VALUES);
b.iter(|| {
let mut sum = 0u64;
for i in 0..num_iter as u32 {
let val = column.first(i);
sum += val.unwrap_or(0);
}
sum
});
}
fn run_bench_on_column_block_fetch(b: &mut Bencher, column: Column) {
let mut block: Vec<Option<u64>> = vec![None; 64];
let fetch_docids = (0..64).collect::<Vec<_>>();
b.iter(move || {
column.first_vals(&fetch_docids, &mut block);
block[0]
});
}
fn run_bench_on_column_block_single_calls(b: &mut Bencher, column: Column) {
let mut block: Vec<Option<u64>> = vec![None; 64];
let fetch_docids = (0..64).collect::<Vec<_>>();
b.iter(move || {
for i in 0..fetch_docids.len() {
block[i] = column.first(fetch_docids[i]);
}
block[0]
});
}
/// Column first method
#[bench]
fn bench_get_first_on_full_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_full_scan(b, column);
}
#[bench]
fn bench_get_first_on_optional_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_full_scan(b, column);
}
#[bench]
fn bench_get_first_on_multi_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_full_scan(b, column);
}
/// Block fetch column accessor
#[bench]
fn bench_get_block_first_on_optional_column(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_multi_column(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_full_column(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_optional_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_block_single_calls(b, column);
}
#[bench]
fn bench_get_block_first_on_multi_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_block_single_calls(b, column);
}
#[bench]
fn bench_get_block_first_on_full_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_block_single_calls(b, column);
}

View File

@@ -16,6 +16,14 @@ fn generate_permutation() -> Vec<u64> {
permutation permutation
} }
fn generate_random() -> Vec<u64> {
let mut permutation: Vec<u64> = (0u64..100_000u64)
.map(|el| el + random::<u16>() as u64)
.collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
// Warning: this generates the same permutation at each call // Warning: this generates the same permutation at each call
fn generate_permutation_gcd() -> Vec<u64> { fn generate_permutation_gcd() -> Vec<u64> {
let mut permutation: Vec<u64> = (1u64..100_000u64).map(|el| el * 1000).collect(); let mut permutation: Vec<u64> = (1u64..100_000u64).map(|el| el * 1000).collect();

View File

@@ -14,32 +14,20 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
ColumnBlockAccessor<T> ColumnBlockAccessor<T>
{ {
#[inline] #[inline]
pub fn fetch_block<'a>(&'a mut self, docs: &'a [u32], accessor: &Column<T>) { pub fn fetch_block(&mut self, docs: &[u32], accessor: &Column<T>) {
if accessor.index.get_cardinality().is_full() { self.docid_cache.clear();
self.val_cache.resize(docs.len(), T::default()); self.row_id_cache.clear();
accessor.values.get_vals(docs, &mut self.val_cache); accessor.row_ids_for_docs(docs, &mut self.docid_cache, &mut self.row_id_cache);
} else { self.val_cache.resize(self.row_id_cache.len(), T::default());
self.docid_cache.clear(); accessor
self.row_id_cache.clear(); .values
accessor.row_ids_for_docs(docs, &mut self.docid_cache, &mut self.row_id_cache); .get_vals(&self.row_id_cache, &mut self.val_cache);
self.val_cache.resize(self.row_id_cache.len(), T::default());
accessor
.values
.get_vals(&self.row_id_cache, &mut self.val_cache);
}
} }
#[inline] #[inline]
pub fn fetch_block_with_missing(&mut self, docs: &[u32], accessor: &Column<T>, missing: T) { pub fn fetch_block_with_missing(&mut self, docs: &[u32], accessor: &Column<T>, missing: T) {
self.fetch_block(docs, accessor); self.fetch_block(docs, accessor);
// no missing values // We can compare docid_cache with docs to find missing docs
if accessor.index.get_cardinality().is_full() { if docs.len() != self.docid_cache.len() || accessor.index.is_multivalue() {
return;
}
// We can compare docid_cache length with docs to find missing docs
// For multi value columns we can't rely on the length and always need to scan
if accessor.index.get_cardinality().is_multivalue() || docs.len() != self.docid_cache.len()
{
self.missing_docids_cache.clear(); self.missing_docids_cache.clear();
find_missing_docs(docs, &self.docid_cache, |doc| { find_missing_docs(docs, &self.docid_cache, |doc| {
self.missing_docids_cache.push(doc); self.missing_docids_cache.push(doc);
@@ -56,25 +44,11 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
} }
#[inline] #[inline]
/// Returns an iterator over the docids and values pub fn iter_docid_vals(&self) -> impl Iterator<Item = (DocId, T)> + '_ {
/// The passed in `docs` slice needs to be the same slice that was passed to `fetch_block` or self.docid_cache
/// `fetch_block_with_missing`. .iter()
/// .cloned()
/// The docs is used if the column is full (each docs has exactly one value), otherwise the .zip(self.val_cache.iter().cloned())
/// internal docid vec is used for the iterator, which e.g. may contain duplicate docs.
pub fn iter_docid_vals<'a>(
&'a self,
docs: &'a [u32],
accessor: &Column<T>,
) -> impl Iterator<Item = (DocId, T)> + '_ {
if accessor.index.get_cardinality().is_full() {
docs.iter().cloned().zip(self.val_cache.iter().cloned())
} else {
self.docid_cache
.iter()
.cloned()
.zip(self.val_cache.iter().cloned())
}
} }
} }

View File

@@ -3,17 +3,17 @@ mod serialize;
use std::fmt::{self, Debug}; use std::fmt::{self, Debug};
use std::io::Write; use std::io::Write;
use std::ops::{Range, RangeInclusive}; use std::ops::{Deref, Range, RangeInclusive};
use std::sync::Arc; use std::sync::Arc;
use common::BinarySerializable; use common::BinarySerializable;
pub use dictionary_encoded::{BytesColumn, StrColumn}; pub use dictionary_encoded::{BytesColumn, StrColumn};
pub use serialize::{ pub use serialize::{
open_column_bytes, open_column_str, open_column_u128, open_column_u128_as_compact_u64, open_column_bytes, open_column_str, open_column_u128, open_column_u64,
open_column_u64, serialize_column_mappable_to_u128, serialize_column_mappable_to_u64, serialize_column_mappable_to_u128, serialize_column_mappable_to_u64,
}; };
use crate::column_index::{ColumnIndex, Set}; use crate::column_index::ColumnIndex;
use crate::column_values::monotonic_mapping::StrictlyMonotonicMappingToInternal; use crate::column_values::monotonic_mapping::StrictlyMonotonicMappingToInternal;
use crate::column_values::{monotonic_map_column, ColumnValues}; use crate::column_values::{monotonic_map_column, ColumnValues};
use crate::{Cardinality, DocId, EmptyColumnValues, MonotonicallyMappableToU64, RowId}; use crate::{Cardinality, DocId, EmptyColumnValues, MonotonicallyMappableToU64, RowId};
@@ -83,36 +83,10 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
self.values.max_value() self.values.max_value()
} }
#[inline]
pub fn first(&self, row_id: RowId) -> Option<T> { pub fn first(&self, row_id: RowId) -> Option<T> {
self.values_for_doc(row_id).next() self.values_for_doc(row_id).next()
} }
/// Load the first value for each docid in the provided slice.
#[inline]
pub fn first_vals(&self, docids: &[DocId], output: &mut [Option<T>]) {
match &self.index {
ColumnIndex::Empty { .. } => {}
ColumnIndex::Full => self.values.get_vals_opt(docids, output),
ColumnIndex::Optional(optional_index) => {
for (i, docid) in docids.iter().enumerate() {
output[i] = optional_index
.rank_if_exists(*docid)
.map(|rowid| self.values.get_val(rowid));
}
}
ColumnIndex::Multivalued(multivalued_index) => {
for (i, docid) in docids.iter().enumerate() {
let range = multivalued_index.range(*docid);
let is_empty = range.start == range.end;
if !is_empty {
output[i] = Some(self.values.get_val(range.start));
}
}
}
}
}
/// Translates a block of docis to row_ids. /// Translates a block of docis to row_ids.
/// ///
/// returns the row_ids and the matching docids on the same index /// returns the row_ids and the matching docids on the same index
@@ -131,8 +105,7 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
} }
pub fn values_for_doc(&self, doc_id: DocId) -> impl Iterator<Item = T> + '_ { pub fn values_for_doc(&self, doc_id: DocId) -> impl Iterator<Item = T> + '_ {
self.index self.value_row_ids(doc_id)
.value_row_ids(doc_id)
.map(|value_row_id: RowId| self.values.get_val(value_row_id)) .map(|value_row_id: RowId| self.values.get_val(value_row_id))
} }
@@ -174,6 +147,14 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
} }
} }
impl<T> Deref for Column<T> {
type Target = ColumnIndex;
fn deref(&self) -> &Self::Target {
&self.index
}
}
impl BinarySerializable for Cardinality { impl BinarySerializable for Cardinality {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> std::io::Result<()> { fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> std::io::Result<()> {
self.to_code().serialize(writer) self.to_code().serialize(writer)
@@ -195,7 +176,6 @@ struct FirstValueWithDefault<T: Copy> {
impl<T: PartialOrd + Debug + Send + Sync + Copy + 'static> ColumnValues<T> impl<T: PartialOrd + Debug + Send + Sync + Copy + 'static> ColumnValues<T>
for FirstValueWithDefault<T> for FirstValueWithDefault<T>
{ {
#[inline(always)]
fn get_val(&self, idx: u32) -> T { fn get_val(&self, idx: u32) -> T {
self.column.first(idx).unwrap_or(self.default_value) self.column.first(idx).unwrap_or(self.default_value)
} }

View File

@@ -76,26 +76,6 @@ pub fn open_column_u128<T: MonotonicallyMappableToU128>(
}) })
} }
/// Open the column as u64.
///
/// See [`open_u128_as_compact_u64`] for more details.
pub fn open_column_u128_as_compact_u64(bytes: OwnedBytes) -> io::Result<Column<u64>> {
let (body, column_index_num_bytes_payload) = bytes.rsplit(4);
let column_index_num_bytes = u32::from_le_bytes(
column_index_num_bytes_payload
.as_slice()
.try_into()
.unwrap(),
);
let (column_index_data, column_values_data) = body.split(column_index_num_bytes as usize);
let column_index = crate::column_index::open_column_index(column_index_data)?;
let column_values = crate::column_values::open_u128_as_compact_u64(column_values_data)?;
Ok(Column {
index: column_index,
values: column_values,
})
}
pub fn open_column_bytes(data: OwnedBytes) -> io::Result<BytesColumn> { pub fn open_column_bytes(data: OwnedBytes) -> io::Result<BytesColumn> {
let (body, dictionary_len_bytes) = data.rsplit(4); let (body, dictionary_len_bytes) = data.rsplit(4);
let dictionary_len = u32::from_le_bytes(dictionary_len_bytes.as_slice().try_into().unwrap()); let dictionary_len = u32::from_le_bytes(dictionary_len_bytes.as_slice().try_into().unwrap());

View File

@@ -140,7 +140,7 @@ mod tests {
#[test] #[test]
fn test_merge_column_index_optional_shuffle() { fn test_merge_column_index_optional_shuffle() {
let optional_index: ColumnIndex = OptionalIndex::for_test(2, &[0]).into(); let optional_index: ColumnIndex = OptionalIndex::for_test(2, &[0]).into();
let column_indexes = [optional_index, ColumnIndex::Full]; let column_indexes = vec![optional_index, ColumnIndex::Full];
let row_addrs = vec![ let row_addrs = vec![
RowAddr { RowAddr {
segment_ord: 0u32, segment_ord: 0u32,

View File

@@ -111,7 +111,10 @@ fn stack_multivalued_indexes<'a>(
let mut last_row_id = 0; let mut last_row_id = 0;
let mut current_it = multivalued_indexes.next(); let mut current_it = multivalued_indexes.next();
Box::new(std::iter::from_fn(move || loop { Box::new(std::iter::from_fn(move || loop {
if let Some(row_id) = current_it.as_mut()?.next() { let Some(multivalued_index) = current_it.as_mut() else {
return None;
};
if let Some(row_id) = multivalued_index.next() {
last_row_id = offset + row_id; last_row_id = offset + row_id;
return Some(last_row_id); return Some(last_row_id);
} }

View File

@@ -42,6 +42,10 @@ impl From<MultiValueIndex> for ColumnIndex {
} }
impl ColumnIndex { impl ColumnIndex {
#[inline]
pub fn is_multivalue(&self) -> bool {
matches!(self, ColumnIndex::Multivalued(_))
}
/// Returns the cardinality of the column index. /// Returns the cardinality of the column index.
/// ///
/// By convention, if the column contains no docs, we consider that it is /// By convention, if the column contains no docs, we consider that it is
@@ -122,18 +126,18 @@ impl ColumnIndex {
} }
} }
pub fn docid_range_to_rowids(&self, doc_id_range: Range<DocId>) -> Range<RowId> { pub fn docid_range_to_rowids(&self, doc_id: Range<DocId>) -> Range<RowId> {
match self { match self {
ColumnIndex::Empty { .. } => 0..0, ColumnIndex::Empty { .. } => 0..0,
ColumnIndex::Full => doc_id_range, ColumnIndex::Full => doc_id,
ColumnIndex::Optional(optional_index) => { ColumnIndex::Optional(optional_index) => {
let row_start = optional_index.rank(doc_id_range.start); let row_start = optional_index.rank(doc_id.start);
let row_end = optional_index.rank(doc_id_range.end); let row_end = optional_index.rank(doc_id.end);
row_start..row_end row_start..row_end
} }
ColumnIndex::Multivalued(multivalued_index) => { ColumnIndex::Multivalued(multivalued_index) => {
let end_docid = doc_id_range.end.min(multivalued_index.num_docs() - 1) + 1; let end_docid = doc_id.end.min(multivalued_index.num_docs() - 1) + 1;
let start_docid = doc_id_range.start.min(end_docid); let start_docid = doc_id.start.min(end_docid);
let row_start = multivalued_index.start_index_column.get_val(start_docid); let row_start = multivalued_index.start_index_column.get_val(start_docid);
let row_end = multivalued_index.start_index_column.get_val(end_docid); let row_end = multivalued_index.start_index_column.get_val(end_docid);

View File

@@ -21,6 +21,8 @@ const DENSE_BLOCK_THRESHOLD: u32 =
const ELEMENTS_PER_BLOCK: u32 = u16::MAX as u32 + 1; const ELEMENTS_PER_BLOCK: u32 = u16::MAX as u32 + 1;
const BLOCK_SIZE: RowId = 1 << 16;
#[derive(Copy, Clone, Debug)] #[derive(Copy, Clone, Debug)]
struct BlockMeta { struct BlockMeta {
non_null_rows_before_block: u32, non_null_rows_before_block: u32,
@@ -107,8 +109,8 @@ struct RowAddr {
#[inline(always)] #[inline(always)]
fn row_addr_from_row_id(row_id: RowId) -> RowAddr { fn row_addr_from_row_id(row_id: RowId) -> RowAddr {
RowAddr { RowAddr {
block_id: (row_id / ELEMENTS_PER_BLOCK) as u16, block_id: (row_id / BLOCK_SIZE) as u16,
in_block_row_id: (row_id % ELEMENTS_PER_BLOCK) as u16, in_block_row_id: (row_id % BLOCK_SIZE) as u16,
} }
} }
@@ -183,13 +185,8 @@ impl Set<RowId> for OptionalIndex {
} }
} }
/// Any value doc_id is allowed.
/// In particular, doc_id = num_rows.
#[inline] #[inline]
fn rank(&self, doc_id: DocId) -> RowId { fn rank(&self, doc_id: DocId) -> RowId {
if doc_id >= self.num_docs() {
return self.num_non_nulls();
}
let RowAddr { let RowAddr {
block_id, block_id,
in_block_row_id, in_block_row_id,
@@ -203,15 +200,13 @@ impl Set<RowId> for OptionalIndex {
block_meta.non_null_rows_before_block + block_offset_row_id block_meta.non_null_rows_before_block + block_offset_row_id
} }
/// Any value doc_id is allowed.
/// In particular, doc_id = num_rows.
#[inline] #[inline]
fn rank_if_exists(&self, doc_id: DocId) -> Option<RowId> { fn rank_if_exists(&self, doc_id: DocId) -> Option<RowId> {
let RowAddr { let RowAddr {
block_id, block_id,
in_block_row_id, in_block_row_id,
} = row_addr_from_row_id(doc_id); } = row_addr_from_row_id(doc_id);
let block_meta = *self.block_metas.get(block_id as usize)?; let block_meta = self.block_metas[block_id as usize];
let block = self.block(block_meta); let block = self.block(block_meta);
let block_offset_row_id = match block { let block_offset_row_id = match block {
Block::Dense(dense_block) => dense_block.rank_if_exists(in_block_row_id), Block::Dense(dense_block) => dense_block.rank_if_exists(in_block_row_id),
@@ -496,7 +491,7 @@ fn deserialize_optional_index_block_metadatas(
non_null_rows_before_block += num_non_null_rows; non_null_rows_before_block += num_non_null_rows;
} }
block_metas.resize( block_metas.resize(
((num_rows + ELEMENTS_PER_BLOCK - 1) / ELEMENTS_PER_BLOCK) as usize, ((num_rows + BLOCK_SIZE - 1) / BLOCK_SIZE) as usize,
BlockMeta { BlockMeta {
non_null_rows_before_block, non_null_rows_before_block,
start_byte_offset, start_byte_offset,

View File

@@ -39,8 +39,7 @@ pub trait Set<T> {
/// ///
/// # Panics /// # Panics
/// ///
/// May panic if rank is greater or equal to the number of /// May panic if rank is greater than the number of elements in the Set.
/// elements in the Set.
fn select(&self, rank: T) -> T; fn select(&self, rank: T) -> T;
/// Creates a brand new select cursor. /// Creates a brand new select cursor.

View File

@@ -1,3 +1,4 @@
use std::convert::TryInto;
use std::io::{self, Write}; use std::io::{self, Write};
use common::BinarySerializable; use common::BinarySerializable;

View File

@@ -1,31 +1,8 @@
use proptest::prelude::*; use proptest::prelude::{any, prop, *};
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest}; use proptest::{prop_oneof, proptest};
use super::*; use super::*;
use crate::{ColumnarReader, ColumnarWriter, DynamicColumnHandle};
#[test]
fn test_optional_index_bug_2293() {
// tests for panic in docid_range_to_rowids for docid == num_docs
test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK - 1);
test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK);
test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK + 1);
}
fn test_optional_index_with_num_docs(num_docs: u32) {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_numerical(100, "score", 80i64);
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(num_docs, None, &mut buffer)
.unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("score").unwrap();
assert_eq!(cols.len(), 1);
let col = cols[0].open().unwrap();
col.column_index().docid_range_to_rowids(0..num_docs);
}
#[test] #[test]
fn test_dense_block_threshold() { fn test_dense_block_threshold() {
@@ -58,7 +35,7 @@ proptest! {
#[test] #[test]
fn test_with_random_sets_simple() { fn test_with_random_sets_simple() {
let vals = 10..ELEMENTS_PER_BLOCK * 2; let vals = 10..BLOCK_SIZE * 2;
let mut out: Vec<u8> = Vec::new(); let mut out: Vec<u8> = Vec::new();
serialize_optional_index(&vals, 100, &mut out).unwrap(); serialize_optional_index(&vals, 100, &mut out).unwrap();
let null_index = open_optional_index(OwnedBytes::new(out)).unwrap(); let null_index = open_optional_index(OwnedBytes::new(out)).unwrap();
@@ -194,7 +171,7 @@ fn test_optional_index_rank() {
test_optional_index_rank_aux(&[0u32, 1u32]); test_optional_index_rank_aux(&[0u32, 1u32]);
let mut block = Vec::new(); let mut block = Vec::new();
block.push(3u32); block.push(3u32);
block.extend((0..ELEMENTS_PER_BLOCK).map(|i| i + ELEMENTS_PER_BLOCK + 1)); block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1));
test_optional_index_rank_aux(&block); test_optional_index_rank_aux(&block);
} }
@@ -208,8 +185,8 @@ fn test_optional_index_iter_empty_one() {
fn test_optional_index_iter_dense_block() { fn test_optional_index_iter_dense_block() {
let mut block = Vec::new(); let mut block = Vec::new();
block.push(3u32); block.push(3u32);
block.extend((0..ELEMENTS_PER_BLOCK).map(|i| i + ELEMENTS_PER_BLOCK + 1)); block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1));
test_optional_index_iter_aux(&block, 3 * ELEMENTS_PER_BLOCK); test_optional_index_iter_aux(&block, 3 * BLOCK_SIZE);
} }
#[test] #[test]

View File

@@ -10,7 +10,7 @@ pub(crate) struct MergedColumnValues<'a, T> {
pub(crate) merge_row_order: &'a MergeRowOrder, pub(crate) merge_row_order: &'a MergeRowOrder,
} }
impl<'a, T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'a, T> { impl<'a, T: Copy + PartialOrd + Debug> Iterable<T> for MergedColumnValues<'a, T> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> { fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
match self.merge_row_order { match self.merge_row_order {
MergeRowOrder::Stack(_) => Box::new( MergeRowOrder::Stack(_) => Box::new(

View File

@@ -10,7 +10,6 @@ use std::fmt::Debug;
use std::ops::{Range, RangeInclusive}; use std::ops::{Range, RangeInclusive};
use std::sync::Arc; use std::sync::Arc;
use downcast_rs::DowncastSync;
pub use monotonic_mapping::{MonotonicallyMappableToU64, StrictlyMonotonicFn}; pub use monotonic_mapping::{MonotonicallyMappableToU64, StrictlyMonotonicFn};
pub use monotonic_mapping_u128::MonotonicallyMappableToU128; pub use monotonic_mapping_u128::MonotonicallyMappableToU128;
@@ -26,10 +25,7 @@ mod monotonic_column;
pub(crate) use merge::MergedColumnValues; pub(crate) use merge::MergedColumnValues;
pub use stats::ColumnStats; pub use stats::ColumnStats;
pub use u128_based::{ pub use u128_based::{open_u128_mapped, serialize_column_values_u128};
open_u128_as_compact_u64, open_u128_mapped, serialize_column_values_u128,
CompactSpaceU64Accessor,
};
pub use u64_based::{ pub use u64_based::{
load_u64_based_column_values, serialize_and_load_u64_based_column_values, load_u64_based_column_values, serialize_and_load_u64_based_column_values,
serialize_u64_based_column_values, CodecType, ALL_U64_CODEC_TYPES, serialize_u64_based_column_values, CodecType, ALL_U64_CODEC_TYPES,
@@ -45,7 +41,7 @@ use crate::RowId;
/// ///
/// Any methods with a default and specialized implementation need to be called in the /// Any methods with a default and specialized implementation need to be called in the
/// wrappers that implement the trait: Arc and MonotonicMappingColumn /// wrappers that implement the trait: Arc and MonotonicMappingColumn
pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync + DowncastSync { pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync {
/// Return the value associated with the given idx. /// Return the value associated with the given idx.
/// ///
/// This accessor should return as fast as possible. /// This accessor should return as fast as possible.
@@ -72,40 +68,11 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync + DowncastSync {
out_x4[3] = self.get_val(idx_x4[3]); out_x4[3] = self.get_val(idx_x4[3]);
} }
let out_and_idx_chunks = output let step_size = 4;
.chunks_exact_mut(4) let cutoff = indexes.len() - indexes.len() % step_size;
.into_remainder()
.iter_mut()
.zip(indexes.chunks_exact(4).remainder());
for (out, idx) in out_and_idx_chunks {
*out = self.get_val(*idx);
}
}
/// Allows to push down multiple fetch calls, to avoid dynamic dispatch overhead. for idx in cutoff..indexes.len() {
/// The slightly weird `Option<T>` in output allows pushdown to full columns. output[idx] = self.get_val(indexes[idx]);
///
/// idx and output should have the same length
///
/// # Panics
///
/// May panic if `idx` is greater than the column length.
fn get_vals_opt(&self, indexes: &[u32], output: &mut [Option<T>]) {
assert!(indexes.len() == output.len());
let out_and_idx_chunks = output.chunks_exact_mut(4).zip(indexes.chunks_exact(4));
for (out_x4, idx_x4) in out_and_idx_chunks {
out_x4[0] = Some(self.get_val(idx_x4[0]));
out_x4[1] = Some(self.get_val(idx_x4[1]));
out_x4[2] = Some(self.get_val(idx_x4[2]));
out_x4[3] = Some(self.get_val(idx_x4[3]));
}
let out_and_idx_chunks = output
.chunks_exact_mut(4)
.into_remainder()
.iter_mut()
.zip(indexes.chunks_exact(4).remainder());
for (out, idx) in out_and_idx_chunks {
*out = Some(self.get_val(*idx));
} }
} }
@@ -134,7 +101,7 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync + DowncastSync {
row_id_hits: &mut Vec<RowId>, row_id_hits: &mut Vec<RowId>,
) { ) {
let row_id_range = row_id_range.start..row_id_range.end.min(self.num_vals()); let row_id_range = row_id_range.start..row_id_range.end.min(self.num_vals());
for idx in row_id_range { for idx in row_id_range.start..row_id_range.end {
let val = self.get_val(idx); let val = self.get_val(idx);
if value_range.contains(&val) { if value_range.contains(&val) {
row_id_hits.push(idx); row_id_hits.push(idx);
@@ -172,7 +139,6 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync + DowncastSync {
Box::new((0..self.num_vals()).map(|idx| self.get_val(idx))) Box::new((0..self.num_vals()).map(|idx| self.get_val(idx)))
} }
} }
downcast_rs::impl_downcast!(sync ColumnValues<T> where T: PartialOrd);
/// Empty column of values. /// Empty column of values.
pub struct EmptyColumnValues; pub struct EmptyColumnValues;
@@ -195,17 +161,12 @@ impl<T: PartialOrd + Default> ColumnValues<T> for EmptyColumnValues {
} }
} }
impl<T: Copy + PartialOrd + Debug + 'static> ColumnValues<T> for Arc<dyn ColumnValues<T>> { impl<T: Copy + PartialOrd + Debug> ColumnValues<T> for Arc<dyn ColumnValues<T>> {
#[inline(always)] #[inline(always)]
fn get_val(&self, idx: u32) -> T { fn get_val(&self, idx: u32) -> T {
self.as_ref().get_val(idx) self.as_ref().get_val(idx)
} }
#[inline(always)]
fn get_vals_opt(&self, indexes: &[u32], output: &mut [Option<T>]) {
self.as_ref().get_vals_opt(indexes, output)
}
#[inline(always)] #[inline(always)]
fn min_value(&self) -> T { fn min_value(&self) -> T {
self.as_ref().min_value() self.as_ref().min_value()

View File

@@ -31,10 +31,10 @@ pub fn monotonic_map_column<C, T, Input, Output>(
monotonic_mapping: T, monotonic_mapping: T,
) -> impl ColumnValues<Output> ) -> impl ColumnValues<Output>
where where
C: ColumnValues<Input> + 'static, C: ColumnValues<Input>,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync + 'static, T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Debug + Send + Sync + Clone + 'static, Input: PartialOrd + Debug + Send + Sync + Clone,
Output: PartialOrd + Debug + Send + Sync + Clone + 'static, Output: PartialOrd + Debug + Send + Sync + Clone,
{ {
MonotonicMappingColumn { MonotonicMappingColumn {
from_column, from_column,
@@ -45,10 +45,10 @@ where
impl<C, T, Input, Output> ColumnValues<Output> for MonotonicMappingColumn<C, T, Input> impl<C, T, Input, Output> ColumnValues<Output> for MonotonicMappingColumn<C, T, Input>
where where
C: ColumnValues<Input> + 'static, C: ColumnValues<Input>,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync + 'static, T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Send + Debug + Sync + Clone + 'static, Input: PartialOrd + Send + Debug + Sync + Clone,
Output: PartialOrd + Send + Debug + Sync + Clone + 'static, Output: PartialOrd + Send + Debug + Sync + Clone,
{ {
#[inline(always)] #[inline(always)]
fn get_val(&self, idx: u32) -> Output { fn get_val(&self, idx: u32) -> Output {
@@ -107,7 +107,7 @@ mod tests {
#[test] #[test]
fn test_monotonic_mapping_iter() { fn test_monotonic_mapping_iter() {
let vals: Vec<u64> = (0..100u64).map(|el| el * 10).collect(); let vals: Vec<u64> = (0..100u64).map(|el| el * 10).collect();
let col = VecColumn::from(vals); let col = VecColumn::from(&vals);
let mapped = monotonic_map_column( let mapped = monotonic_map_column(
col, col,
StrictlyMonotonicMappingInverter::from(StrictlyMonotonicMappingToInternal::<i64>::new()), StrictlyMonotonicMappingInverter::from(StrictlyMonotonicMappingToInternal::<i64>::new()),

View File

@@ -22,7 +22,7 @@ mod build_compact_space;
use build_compact_space::get_compact_space; use build_compact_space::get_compact_space;
use common::{BinarySerializable, CountingWriter, OwnedBytes, VInt, VIntU128}; use common::{BinarySerializable, CountingWriter, OwnedBytes, VInt, VIntU128};
use tantivy_bitpacker::{BitPacker, BitUnpacker}; use tantivy_bitpacker::{self, BitPacker, BitUnpacker};
use crate::column_values::ColumnValues; use crate::column_values::ColumnValues;
use crate::RowId; use crate::RowId;
@@ -148,7 +148,7 @@ impl CompactSpace {
.binary_search_by_key(&compact, |range_mapping| range_mapping.compact_start) .binary_search_by_key(&compact, |range_mapping| range_mapping.compact_start)
// Correctness: Overflow. The first range starts at compact space 0, the error from // Correctness: Overflow. The first range starts at compact space 0, the error from
// binary search can never be 0 // binary search can never be 0
.unwrap_or_else(|e| e - 1); .map_or_else(|e| e - 1, |v| v);
let range_mapping = &self.ranges_mapping[pos]; let range_mapping = &self.ranges_mapping[pos];
let diff = compact - range_mapping.compact_start; let diff = compact - range_mapping.compact_start;
@@ -292,63 +292,6 @@ impl BinarySerializable for IPCodecParams {
} }
} }
/// Exposes the compact space compressed values as u64.
///
/// This allows faster access to the values, as u64 is faster to work with than u128.
/// It also allows to handle u128 values like u64, via the `open_u64_lenient` as a uniform
/// access interface.
///
/// When converting from the internal u64 to u128 `compact_to_u128` can be used.
pub struct CompactSpaceU64Accessor(CompactSpaceDecompressor);
impl CompactSpaceU64Accessor {
pub(crate) fn open(data: OwnedBytes) -> io::Result<CompactSpaceU64Accessor> {
let decompressor = CompactSpaceU64Accessor(CompactSpaceDecompressor::open(data)?);
Ok(decompressor)
}
/// Convert a compact space value to u128
pub fn compact_to_u128(&self, compact: u32) -> u128 {
self.0.compact_to_u128(compact)
}
}
impl ColumnValues<u64> for CompactSpaceU64Accessor {
#[inline]
fn get_val(&self, doc: u32) -> u64 {
let compact = self.0.get_compact(doc);
compact as u64
}
fn min_value(&self) -> u64 {
self.0.u128_to_compact(self.0.min_value()).unwrap() as u64
}
fn max_value(&self) -> u64 {
self.0.u128_to_compact(self.0.max_value()).unwrap() as u64
}
fn num_vals(&self) -> u32 {
self.0.params.num_vals
}
#[inline]
fn iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
Box::new(self.0.iter_compact().map(|el| el as u64))
}
#[inline]
fn get_row_ids_for_value_range(
&self,
value_range: RangeInclusive<u64>,
position_range: Range<u32>,
positions: &mut Vec<u32>,
) {
let value_range = self.0.compact_to_u128(*value_range.start() as u32)
..=self.0.compact_to_u128(*value_range.end() as u32);
self.0
.get_row_ids_for_value_range(value_range, position_range, positions)
}
}
impl ColumnValues<u128> for CompactSpaceDecompressor { impl ColumnValues<u128> for CompactSpaceDecompressor {
#[inline] #[inline]
fn get_val(&self, doc: u32) -> u128 { fn get_val(&self, doc: u32) -> u128 {
@@ -459,14 +402,9 @@ impl CompactSpaceDecompressor {
.map(|compact| self.compact_to_u128(compact)) .map(|compact| self.compact_to_u128(compact))
} }
#[inline]
pub fn get_compact(&self, idx: u32) -> u32 {
self.params.bit_unpacker.get(idx, &self.data) as u32
}
#[inline] #[inline]
pub fn get(&self, idx: u32) -> u128 { pub fn get(&self, idx: u32) -> u128 {
let compact = self.get_compact(idx); let compact = self.params.bit_unpacker.get(idx, &self.data) as u32;
self.compact_to_u128(compact) self.compact_to_u128(compact)
} }

View File

@@ -6,9 +6,7 @@ use std::sync::Arc;
mod compact_space; mod compact_space;
use common::{BinarySerializable, OwnedBytes, VInt}; use common::{BinarySerializable, OwnedBytes, VInt};
pub use compact_space::{ use compact_space::{CompactSpaceCompressor, CompactSpaceDecompressor};
CompactSpaceCompressor, CompactSpaceDecompressor, CompactSpaceU64Accessor,
};
use crate::column_values::monotonic_map_column; use crate::column_values::monotonic_map_column;
use crate::column_values::monotonic_mapping::{ use crate::column_values::monotonic_mapping::{
@@ -110,23 +108,6 @@ pub fn open_u128_mapped<T: MonotonicallyMappableToU128 + Debug>(
StrictlyMonotonicMappingToInternal::<T>::new().into(); StrictlyMonotonicMappingToInternal::<T>::new().into();
Ok(Arc::new(monotonic_map_column(reader, inverted))) Ok(Arc::new(monotonic_map_column(reader, inverted)))
} }
/// Returns the u64 representation of the u128 data.
/// The internal representation of the data as u64 is useful for faster processing.
///
/// In order to convert to u128 back cast to `CompactSpaceU64Accessor` and call
/// `compact_to_u128`.
///
/// # Notice
/// In case there are new codecs added, check for usages of `CompactSpaceDecompressorU64` and
/// also handle the new codecs.
pub fn open_u128_as_compact_u64(mut bytes: OwnedBytes) -> io::Result<Arc<dyn ColumnValues<u64>>> {
let header = U128Header::deserialize(&mut bytes)?;
assert_eq!(header.codec_type, U128FastFieldCodecType::CompactSpace);
let reader = CompactSpaceU64Accessor::open(bytes)?;
Ok(Arc::new(reader))
}
#[cfg(test)] #[cfg(test)]
pub mod tests { pub mod tests {
use super::*; use super::*;

View File

@@ -63,6 +63,7 @@ impl ColumnValues for BitpackedReader {
fn get_val(&self, doc: u32) -> u64 { fn get_val(&self, doc: u32) -> u64 {
self.stats.min_value + self.stats.gcd.get() * self.bit_unpacker.get(doc, &self.data) self.stats.min_value + self.stats.gcd.get() * self.bit_unpacker.get(doc, &self.data)
} }
#[inline] #[inline]
fn min_value(&self) -> u64 { fn min_value(&self) -> u64 {
self.stats.min_value self.stats.min_value

View File

@@ -63,10 +63,7 @@ impl BlockwiseLinearEstimator {
if self.block.is_empty() { if self.block.is_empty() {
return; return;
} }
let column = VecColumn::from(std::mem::take(&mut self.block)); let line = Line::train(&VecColumn::from(&self.block));
let line = Line::train(&column);
self.block = column.into();
let mut max_value = 0u64; let mut max_value = 0u64;
for (i, buffer_val) in self.block.iter().enumerate() { for (i, buffer_val) in self.block.iter().enumerate() {
let interpolated_val = line.eval(i as u32); let interpolated_val = line.eval(i as u32);
@@ -128,7 +125,7 @@ impl ColumnCodecEstimator for BlockwiseLinearEstimator {
*buffer_val = gcd_divider.divide(*buffer_val - stats.min_value); *buffer_val = gcd_divider.divide(*buffer_val - stats.min_value);
} }
let line = Line::train(&VecColumn::from(buffer.to_vec())); let line = Line::train(&VecColumn::from(&buffer));
assert!(!buffer.is_empty()); assert!(!buffer.is_empty());

View File

@@ -184,7 +184,7 @@ mod tests {
} }
fn test_eval_max_err(ys: &[u64]) -> Option<u64> { fn test_eval_max_err(ys: &[u64]) -> Option<u64> {
let line = Line::train(&VecColumn::from(ys.to_vec())); let line = Line::train(&VecColumn::from(&ys));
ys.iter() ys.iter()
.enumerate() .enumerate()
.map(|(x, y)| y.wrapping_sub(line.eval(x as u32))) .map(|(x, y)| y.wrapping_sub(line.eval(x as u32)))

View File

@@ -173,9 +173,7 @@ impl LinearCodecEstimator {
fn collect_before_line_estimation(&mut self, value: u64) { fn collect_before_line_estimation(&mut self, value: u64) {
self.block.push(value); self.block.push(value);
if self.block.len() == LINE_ESTIMATION_BLOCK_LEN { if self.block.len() == LINE_ESTIMATION_BLOCK_LEN {
let column = VecColumn::from(std::mem::take(&mut self.block)); let line = Line::train(&VecColumn::from(&self.block));
let line = Line::train(&column);
self.block = column.into();
let block = std::mem::take(&mut self.block); let block = std::mem::take(&mut self.block);
for val in block { for val in block {
self.collect_after_line_estimation(&line, val); self.collect_after_line_estimation(&line, val);

View File

@@ -1,4 +1,5 @@
use proptest::prelude::*; use proptest::prelude::*;
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest}; use proptest::{prop_oneof, proptest};
#[test] #[test]

View File

@@ -4,14 +4,14 @@ use tantivy_bitpacker::minmax;
use crate::ColumnValues; use crate::ColumnValues;
/// VecColumn provides `Column` over a `Vec<T>`. /// VecColumn provides `Column` over a slice.
pub struct VecColumn<T = u64> { pub struct VecColumn<'a, T = u64> {
pub(crate) values: Vec<T>, pub(crate) values: &'a [T],
pub(crate) min_value: T, pub(crate) min_value: T,
pub(crate) max_value: T, pub(crate) max_value: T,
} }
impl<T: Copy + PartialOrd + Send + Sync + Debug + 'static> ColumnValues<T> for VecColumn<T> { impl<'a, T: Copy + PartialOrd + Send + Sync + Debug> ColumnValues<T> for VecColumn<'a, T> {
fn get_val(&self, position: u32) -> T { fn get_val(&self, position: u32) -> T {
self.values[position as usize] self.values[position as usize]
} }
@@ -37,8 +37,11 @@ impl<T: Copy + PartialOrd + Send + Sync + Debug + 'static> ColumnValues<T> for V
} }
} }
impl<T: Copy + PartialOrd + Default> From<Vec<T>> for VecColumn<T> { impl<'a, T: Copy + PartialOrd + Default, V> From<&'a V> for VecColumn<'a, T>
fn from(values: Vec<T>) -> Self { where V: AsRef<[T]> + ?Sized
{
fn from(values: &'a V) -> Self {
let values = values.as_ref();
let (min_value, max_value) = minmax(values.iter().copied()).unwrap_or_default(); let (min_value, max_value) = minmax(values.iter().copied()).unwrap_or_default();
Self { Self {
values, values,
@@ -47,8 +50,3 @@ impl<T: Copy + PartialOrd + Default> From<Vec<T>> for VecColumn<T> {
} }
} }
} }
impl From<VecColumn> for Vec<u64> {
fn from(column: VecColumn) -> Self {
column.values
}
}

View File

@@ -58,7 +58,7 @@ impl ColumnType {
self == &ColumnType::DateTime self == &ColumnType::DateTime
} }
pub(crate) fn try_from_code(code: u8) -> Result<ColumnType, InvalidData> { pub fn try_from_code(code: u8) -> Result<ColumnType, InvalidData> {
COLUMN_TYPES.get(code as usize).copied().ok_or(InvalidData) COLUMN_TYPES.get(code as usize).copied().ok_or(InvalidData)
} }
} }

View File

@@ -1,3 +1,7 @@
use std::collections::BTreeMap;
use itertools::Itertools;
use super::*; use super::*;
use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId}; use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId};

View File

@@ -13,7 +13,9 @@ pub(crate) use serializer::ColumnarSerializer;
use stacker::{Addr, ArenaHashMap, MemoryArena}; use stacker::{Addr, ArenaHashMap, MemoryArena};
use crate::column_index::SerializableColumnIndex; use crate::column_index::SerializableColumnIndex;
use crate::column_values::{MonotonicallyMappableToU128, MonotonicallyMappableToU64}; use crate::column_values::{
ColumnValues, MonotonicallyMappableToU128, MonotonicallyMappableToU64, VecColumn,
};
use crate::columnar::column_type::ColumnType; use crate::columnar::column_type::ColumnType;
use crate::columnar::writer::column_writers::{ use crate::columnar::writer::column_writers::{
ColumnWriter, NumericalColumnWriter, StrOrBytesColumnWriter, ColumnWriter, NumericalColumnWriter, StrOrBytesColumnWriter,
@@ -331,7 +333,7 @@ impl ColumnarWriter {
num_docs: RowId, num_docs: RowId,
old_to_new_row_ids: Option<&[RowId]>, old_to_new_row_ids: Option<&[RowId]>,
wrt: &mut dyn io::Write, wrt: &mut dyn io::Write,
) -> io::Result<()> { ) -> io::Result<Vec<(String, ColumnType)>> {
let mut serializer = ColumnarSerializer::new(wrt); let mut serializer = ColumnarSerializer::new(wrt);
let mut columns: Vec<(&[u8], ColumnType, Addr)> = self let mut columns: Vec<(&[u8], ColumnType, Addr)> = self
.numerical_field_hash_map .numerical_field_hash_map
@@ -372,7 +374,9 @@ impl ColumnarWriter {
let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries); let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries);
let mut symbol_byte_buffer: Vec<u8> = Vec::new(); let mut symbol_byte_buffer: Vec<u8> = Vec::new();
for (column_name, column_type, addr) in columns { for (column_name, column_type, addr) in columns.iter() {
let column_type = *column_type;
let addr = *addr;
match column_type { match column_type {
ColumnType::Bool => { ColumnType::Bool => {
let column_writer: ColumnWriter = self.bool_field_hash_map.read(addr); let column_writer: ColumnWriter = self.bool_field_hash_map.read(addr);
@@ -483,7 +487,15 @@ impl ColumnarWriter {
}; };
} }
serializer.finalize(num_docs)?; serializer.finalize(num_docs)?;
Ok(()) Ok(columns
.into_iter()
.map(|(column_name, column_type, _)| {
(
String::from_utf8_lossy(column_name).to_string(),
column_type,
)
})
.collect())
} }
} }
@@ -643,7 +655,10 @@ fn send_to_serialize_column_mappable_to_u128<
value_index_builders: &mut PreallocatedIndexBuilders, value_index_builders: &mut PreallocatedIndexBuilders,
values: &mut Vec<T>, values: &mut Vec<T>,
mut wrt: impl io::Write, mut wrt: impl io::Write,
) -> io::Result<()> { ) -> io::Result<()>
where
for<'a> VecColumn<'a, T>: ColumnValues<T>,
{
values.clear(); values.clear();
// TODO: split index and values // TODO: split index and values
let serializable_column_index = match cardinality { let serializable_column_index = match cardinality {
@@ -696,7 +711,10 @@ fn send_to_serialize_column_mappable_to_u64(
value_index_builders: &mut PreallocatedIndexBuilders, value_index_builders: &mut PreallocatedIndexBuilders,
values: &mut Vec<u64>, values: &mut Vec<u64>,
mut wrt: impl io::Write, mut wrt: impl io::Write,
) -> io::Result<()> { ) -> io::Result<()>
where
for<'a> VecColumn<'a, u64>: ColumnValues<u64>,
{
values.clear(); values.clear();
let serializable_column_index = match cardinality { let serializable_column_index = match cardinality {
Cardinality::Full => { Cardinality::Full => {

View File

@@ -18,12 +18,7 @@ pub struct ColumnarSerializer<W: io::Write> {
/// code. /// code.
fn prepare_key(key: &[u8], column_type: ColumnType, buffer: &mut Vec<u8>) { fn prepare_key(key: &[u8], column_type: ColumnType, buffer: &mut Vec<u8>) {
buffer.clear(); buffer.clear();
// Convert 0 bytes to '0' string, as 0 bytes are reserved for the end of the path. buffer.extend_from_slice(key);
if key.contains(&0u8) {
buffer.extend(key.iter().map(|&b| if b == 0 { b'0' } else { b }));
} else {
buffer.extend_from_slice(key);
}
buffer.push(0u8); buffer.push(0u8);
buffer.push(column_type.to_code()); buffer.push(column_type.to_code());
} }
@@ -101,13 +96,14 @@ impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::*; use super::*;
use crate::columnar::column_type::ColumnType;
#[test] #[test]
fn test_prepare_key_bytes() { fn test_prepare_key_bytes() {
let mut buffer: Vec<u8> = b"somegarbage".to_vec(); let mut buffer: Vec<u8> = b"somegarbage".to_vec();
prepare_key(b"root\0child", ColumnType::Str, &mut buffer); prepare_key(b"root\0child", ColumnType::Str, &mut buffer);
assert_eq!(buffer.len(), 12); assert_eq!(buffer.len(), 12);
assert_eq!(&buffer[..10], b"root0child"); assert_eq!(&buffer[..10], b"root\0child");
assert_eq!(buffer[10], 0u8); assert_eq!(buffer[10], 0u8);
assert_eq!(buffer[11], ColumnType::Str.to_code()); assert_eq!(buffer[11], ColumnType::Str.to_code());
} }

View File

@@ -8,7 +8,7 @@ use common::{ByteCount, DateTime, HasLen, OwnedBytes};
use crate::column::{BytesColumn, Column, StrColumn}; use crate::column::{BytesColumn, Column, StrColumn};
use crate::column_values::{monotonic_map_column, StrictlyMonotonicFn}; use crate::column_values::{monotonic_map_column, StrictlyMonotonicFn};
use crate::columnar::ColumnType; use crate::columnar::ColumnType;
use crate::{Cardinality, ColumnIndex, ColumnValues, NumericalType}; use crate::{Cardinality, ColumnIndex, NumericalType};
#[derive(Clone)] #[derive(Clone)]
pub enum DynamicColumn { pub enum DynamicColumn {
@@ -247,12 +247,7 @@ impl DynamicColumnHandle {
} }
/// Returns the `u64` fast field reader reader associated with `fields` of types /// Returns the `u64` fast field reader reader associated with `fields` of types
/// Str, u64, i64, f64, bool, ip, or datetime. /// Str, u64, i64, f64, bool, or datetime.
///
/// Notice that for IpAddr, the fastfield reader will return the u64 representation of the
/// IpAddr.
/// In order to convert to u128 back cast to `CompactSpaceU64Accessor` and call
/// `compact_to_u128`.
/// ///
/// If not, the fastfield reader will returns the u64-value associated with the original /// If not, the fastfield reader will returns the u64-value associated with the original
/// FastValue. /// FastValue.
@@ -263,10 +258,7 @@ impl DynamicColumnHandle {
let column: BytesColumn = crate::column::open_column_bytes(column_bytes)?; let column: BytesColumn = crate::column::open_column_bytes(column_bytes)?;
Ok(Some(column.term_ord_column)) Ok(Some(column.term_ord_column))
} }
ColumnType::IpAddr => { ColumnType::IpAddr => Ok(None),
let column = crate::column::open_column_u128_as_compact_u64(column_bytes)?;
Ok(Some(column))
}
ColumnType::Bool ColumnType::Bool
| ColumnType::I64 | ColumnType::I64
| ColumnType::U64 | ColumnType::U64

View File

@@ -113,9 +113,6 @@ impl Cardinality {
pub fn is_multivalue(&self) -> bool { pub fn is_multivalue(&self) -> bool {
matches!(self, Cardinality::Multivalued) matches!(self, Cardinality::Multivalued)
} }
pub fn is_full(&self) -> bool {
matches!(self, Cardinality::Full)
}
pub(crate) fn to_code(self) -> u8 { pub(crate) fn to_code(self) -> u8 {
self as u8 self as u8
} }

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-common" name = "tantivy-common"
version = "0.7.0" version = "0.6.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT" license = "MIT"
edition = "2021" edition = "2021"
@@ -14,7 +14,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
[dependencies] [dependencies]
byteorder = "1.4.3" byteorder = "1.4.3"
ownedbytes = { version= "0.7", path="../ownedbytes" } ownedbytes = { version= "0.6", path="../ownedbytes" }
async-trait = "0.1" async-trait = "0.1"
time = { version = "0.3.10", features = ["serde-well-known"] } time = { version = "0.3.10", features = ["serde-well-known"] }
serde = { version = "1.0.136", features = ["derive"] } serde = { version = "1.0.136", features = ["derive"] }

View File

@@ -1,3 +1,4 @@
use std::convert::TryInto;
use std::io::Write; use std::io::Write;
use std::{fmt, io, u64}; use std::{fmt, io, u64};
@@ -5,7 +6,7 @@ use ownedbytes::OwnedBytes;
use crate::ByteCount; use crate::ByteCount;
#[derive(Clone, Copy, Eq, PartialEq)] #[derive(Clone, Copy, Eq, PartialEq, Hash)]
pub struct TinySet(u64); pub struct TinySet(u64);
impl fmt::Debug for TinySet { impl fmt::Debug for TinySet {

View File

@@ -1,3 +1,5 @@
#![allow(deprecated)]
use std::fmt; use std::fmt;
use std::io::{Read, Write}; use std::io::{Read, Write};
@@ -25,6 +27,9 @@ pub enum DateTimePrecision {
Nanoseconds, Nanoseconds,
} }
#[deprecated(since = "0.20.0", note = "Use `DateTimePrecision` instead")]
pub type DatePrecision = DateTimePrecision;
/// A date/time value with nanoseconds precision. /// A date/time value with nanoseconds precision.
/// ///
/// This timestamp does not carry any explicit time zone information. /// This timestamp does not carry any explicit time zone information.
@@ -35,7 +40,7 @@ pub enum DateTimePrecision {
/// All constructors and conversions are provided as explicit /// All constructors and conversions are provided as explicit
/// functions and not by implementing any `From`/`Into` traits /// functions and not by implementing any `From`/`Into` traits
/// to prevent unintended usage. /// to prevent unintended usage.
#[derive(Clone, Default, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Serialize, Deserialize)] #[derive(Clone, Default, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct DateTime { pub struct DateTime {
// Timestamp in nanoseconds. // Timestamp in nanoseconds.
pub(crate) timestamp_nanos: i64, pub(crate) timestamp_nanos: i64,

View File

@@ -15,6 +15,8 @@ mod vint;
mod writer; mod writer;
pub use bitset::*; pub use bitset::*;
pub use byte_count::ByteCount; pub use byte_count::ByteCount;
#[allow(deprecated)]
pub use datetime::DatePrecision;
pub use datetime::{DateTime, DateTimePrecision}; pub use datetime::{DateTime, DateTimePrecision};
pub use group_by::GroupByIteratorExtended; pub use group_by::GroupByIteratorExtended;
pub use json_path_writer::JsonPathWriter; pub use json_path_writer::JsonPathWriter;

View File

@@ -290,7 +290,8 @@ impl<'a> BinarySerializable for Cow<'a, [u8]> {
#[cfg(test)] #[cfg(test)]
pub mod test { pub mod test {
use super::*; use super::{VInt, *};
use crate::serialize::BinarySerializable;
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() { pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new(); let mut buffer = Vec::new();
O::default().serialize(&mut buffer).unwrap(); O::default().serialize(&mut buffer).unwrap();

View File

@@ -1,7 +1,7 @@
[package] [package]
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
name = "ownedbytes" name = "ownedbytes"
version = "0.7.0" version = "0.6.0"
edition = "2021" edition = "2021"
description = "Expose data as static slice" description = "Expose data as static slice"
license = "MIT" license = "MIT"

View File

@@ -1,3 +1,4 @@
use std::convert::TryInto;
use std::ops::{Deref, Range}; use std::ops::{Deref, Range};
use std::sync::Arc; use std::sync::Arc;
use std::{fmt, io}; use std::{fmt, io};

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-query-grammar" name = "tantivy-query-grammar"
version = "0.22.0" version = "0.21.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]

View File

@@ -81,8 +81,8 @@ where
T: InputTakeAtPosition + Clone, T: InputTakeAtPosition + Clone,
<T as InputTakeAtPosition>::Item: AsChar + Clone, <T as InputTakeAtPosition>::Item: AsChar + Clone,
{ {
opt_i(nom::character::complete::multispace0)(input) opt_i(nom::character::complete::space0)(input)
.map(|(left, (spaces, errors))| (left, (spaces.expect("multispace0 can't fail"), errors))) .map(|(left, (spaces, errors))| (left, (spaces.expect("space0 can't fail"), errors)))
} }
pub(crate) fn space1_infallible<T>(input: T) -> JResult<T, Option<T>> pub(crate) fn space1_infallible<T>(input: T) -> JResult<T, Option<T>>
@@ -90,7 +90,7 @@ where
T: InputTakeAtPosition + Clone + InputLength, T: InputTakeAtPosition + Clone + InputLength,
<T as InputTakeAtPosition>::Item: AsChar + Clone, <T as InputTakeAtPosition>::Item: AsChar + Clone,
{ {
opt_i(nom::character::complete::multispace1)(input).map(|(left, (spaces, mut errors))| { opt_i(nom::character::complete::space1)(input).map(|(left, (spaces, mut errors))| {
if spaces.is_none() { if spaces.is_none() {
errors.push(LenientErrorInternal { errors.push(LenientErrorInternal {
pos: left.input_len(), pos: left.input_len(),

View File

@@ -3,11 +3,11 @@ use std::iter::once;
use nom::branch::alt; use nom::branch::alt;
use nom::bytes::complete::tag; use nom::bytes::complete::tag;
use nom::character::complete::{ use nom::character::complete::{
anychar, char, digit1, multispace0, multispace1, none_of, one_of, satisfy, u32, anychar, char, digit1, none_of, one_of, satisfy, space0, space1, u32,
}; };
use nom::combinator::{eof, map, map_res, opt, peek, recognize, value, verify}; use nom::combinator::{eof, map, map_res, opt, peek, recognize, value, verify};
use nom::error::{Error, ErrorKind}; use nom::error::{Error, ErrorKind};
use nom::multi::{many0, many1, separated_list0}; use nom::multi::{many0, many1, separated_list0, separated_list1};
use nom::sequence::{delimited, preceded, separated_pair, terminated, tuple}; use nom::sequence::{delimited, preceded, separated_pair, terminated, tuple};
use nom::IResult; use nom::IResult;
@@ -65,7 +65,7 @@ fn word_infallible(delimiter: &str) -> impl Fn(&str) -> JResult<&str, Option<&st
|inp| { |inp| {
opt_i_err( opt_i_err(
preceded( preceded(
multispace0, space0,
recognize(many1(satisfy(|c| { recognize(many1(satisfy(|c| {
!c.is_whitespace() && !delimiter.contains(c) !c.is_whitespace() && !delimiter.contains(c)
}))), }))),
@@ -225,10 +225,10 @@ fn term_group(inp: &str) -> IResult<&str, UserInputAst> {
map( map(
tuple(( tuple((
terminated(field_name, multispace0), terminated(field_name, space0),
delimited( delimited(
tuple((char('('), multispace0)), tuple((char('('), space0)),
separated_list0(multispace1, tuple((opt(occur_symbol), term_or_phrase))), separated_list0(space1, tuple((opt(occur_symbol), term_or_phrase))),
char(')'), char(')'),
), ),
)), )),
@@ -250,7 +250,7 @@ fn term_group_precond(inp: &str) -> IResult<&str, (), ()> {
(), (),
peek(tuple(( peek(tuple((
field_name, field_name,
multispace0, space0,
char('('), // when we are here, we know it can't be anything but a term group char('('), // when we are here, we know it can't be anything but a term group
))), ))),
)(inp) )(inp)
@@ -259,7 +259,7 @@ fn term_group_precond(inp: &str) -> IResult<&str, (), ()> {
fn term_group_infallible(inp: &str) -> JResult<&str, UserInputAst> { fn term_group_infallible(inp: &str) -> JResult<&str, UserInputAst> {
let (mut inp, (field_name, _, _, _)) = let (mut inp, (field_name, _, _, _)) =
tuple((field_name, multispace0, char('('), multispace0))(inp).expect("precondition failed"); tuple((field_name, space0, char('('), space0))(inp).expect("precondition failed");
let mut terms = Vec::new(); let mut terms = Vec::new();
let mut errs = Vec::new(); let mut errs = Vec::new();
@@ -305,7 +305,7 @@ fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
UserInputLeaf::Exists { UserInputLeaf::Exists {
field: String::new(), field: String::new(),
}, },
tuple((multispace0, char('*'))), tuple((space0, char('*'))),
)(inp) )(inp)
} }
@@ -314,7 +314,7 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
(), (),
peek(tuple(( peek(tuple((
field_name, field_name,
multispace0, space0,
char('*'), // when we are here, we know it can't be anything but a exists char('*'), // when we are here, we know it can't be anything but a exists
))), ))),
)(inp) )(inp)
@@ -323,7 +323,7 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
fn exists_infallible(inp: &str) -> JResult<&str, UserInputAst> { fn exists_infallible(inp: &str) -> JResult<&str, UserInputAst> {
let (inp, (field_name, _, _)) = let (inp, (field_name, _, _)) =
tuple((field_name, multispace0, char('*')))(inp).expect("precondition failed"); tuple((field_name, space0, char('*')))(inp).expect("precondition failed");
let exists = UserInputLeaf::Exists { field: field_name }.into(); let exists = UserInputLeaf::Exists { field: field_name }.into();
Ok((inp, (exists, Vec::new()))) Ok((inp, (exists, Vec::new())))
@@ -349,7 +349,7 @@ fn literal_no_group_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>>
alt_infallible( alt_infallible(
( (
( (
value((), tuple((tag("IN"), multispace0, char('[')))), value((), tuple((tag("IN"), space0, char('[')))),
map(set_infallible, |(set, errs)| (Some(set), errs)), map(set_infallible, |(set, errs)| (Some(set), errs)),
), ),
( (
@@ -430,8 +430,8 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
// check for unbounded range in the form of <5, <=10, >5, >=5 // check for unbounded range in the form of <5, <=10, >5, >=5
let elastic_unbounded_range = map( let elastic_unbounded_range = map(
tuple(( tuple((
preceded(multispace0, alt((tag(">="), tag("<="), tag("<"), tag(">")))), preceded(space0, alt((tag(">="), tag("<="), tag("<"), tag(">")))),
preceded(multispace0, range_term_val()), preceded(space0, range_term_val()),
)), )),
|(comparison_sign, bound)| match comparison_sign { |(comparison_sign, bound)| match comparison_sign {
">=" => (UserInputBound::Inclusive(bound), UserInputBound::Unbounded), ">=" => (UserInputBound::Inclusive(bound), UserInputBound::Unbounded),
@@ -444,7 +444,7 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
); );
let lower_bound = map( let lower_bound = map(
separated_pair(one_of("{["), multispace0, range_term_val()), separated_pair(one_of("{["), space0, range_term_val()),
|(boundary_char, lower_bound)| { |(boundary_char, lower_bound)| {
if lower_bound == "*" { if lower_bound == "*" {
UserInputBound::Unbounded UserInputBound::Unbounded
@@ -457,7 +457,7 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
); );
let upper_bound = map( let upper_bound = map(
separated_pair(range_term_val(), multispace0, one_of("}]")), separated_pair(range_term_val(), space0, one_of("}]")),
|(upper_bound, boundary_char)| { |(upper_bound, boundary_char)| {
if upper_bound == "*" { if upper_bound == "*" {
UserInputBound::Unbounded UserInputBound::Unbounded
@@ -469,11 +469,8 @@ fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
}, },
); );
let lower_to_upper = separated_pair( let lower_to_upper =
lower_bound, separated_pair(lower_bound, tuple((space1, tag("TO"), space1)), upper_bound);
tuple((multispace1, tag("TO"), multispace1)),
upper_bound,
);
map( map(
alt((elastic_unbounded_range, lower_to_upper)), alt((elastic_unbounded_range, lower_to_upper)),
@@ -493,16 +490,13 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
word_infallible("]}"), word_infallible("]}"),
space1_infallible, space1_infallible,
opt_i_err( opt_i_err(
terminated(tag("TO"), alt((value((), multispace1), value((), eof)))), terminated(tag("TO"), alt((value((), space1), value((), eof)))),
"missing keyword TO", "missing keyword TO",
), ),
word_infallible("]}"), word_infallible("]}"),
opt_i_err(one_of("]}"), "missing range delimiter"), opt_i_err(one_of("]}"), "missing range delimiter"),
)), )),
|( |((lower_bound_kind, _space0, lower, _space1, to, upper, upper_bound_kind), errs)| {
(lower_bound_kind, _multispace0, lower, _multispace1, to, upper, upper_bound_kind),
errs,
)| {
let lower_bound = match (lower_bound_kind, lower) { let lower_bound = match (lower_bound_kind, lower) {
(_, Some("*")) => UserInputBound::Unbounded, (_, Some("*")) => UserInputBound::Unbounded,
(_, None) => UserInputBound::Unbounded, (_, None) => UserInputBound::Unbounded,
@@ -602,10 +596,10 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
fn set(inp: &str) -> IResult<&str, UserInputLeaf> { fn set(inp: &str) -> IResult<&str, UserInputLeaf> {
map( map(
preceded( preceded(
tuple((multispace0, tag("IN"), multispace1)), tuple((space0, tag("IN"), space1)),
delimited( delimited(
tuple((char('['), multispace0)), tuple((char('['), space0)),
separated_list0(multispace1, map(simple_term, |(_, term)| term)), separated_list0(space1, map(simple_term, |(_, term)| term)),
char(']'), char(']'),
), ),
), ),
@@ -673,7 +667,7 @@ fn leaf(inp: &str) -> IResult<&str, UserInputAst> {
alt(( alt((
delimited(char('('), ast, char(')')), delimited(char('('), ast, char(')')),
map(char('*'), |_| UserInputAst::from(UserInputLeaf::All)), map(char('*'), |_| UserInputAst::from(UserInputLeaf::All)),
map(preceded(tuple((tag("NOT"), multispace1)), leaf), negate), map(preceded(tuple((tag("NOT"), space1)), leaf), negate),
literal, literal,
))(inp) ))(inp)
} }
@@ -786,23 +780,27 @@ fn binary_operand(inp: &str) -> IResult<&str, BinaryOperand> {
} }
fn aggregate_binary_expressions( fn aggregate_binary_expressions(
left: (Option<Occur>, UserInputAst), left: UserInputAst,
others: Vec<(Option<BinaryOperand>, Option<Occur>, UserInputAst)>, others: Vec<(BinaryOperand, UserInputAst)>,
) -> Result<UserInputAst, LenientErrorInternal> { ) -> UserInputAst {
let mut leafs = Vec::with_capacity(others.len() + 1); let mut dnf: Vec<Vec<UserInputAst>> = vec![vec![left]];
leafs.push((None, left.0, Some(left.1))); for (operator, operand_ast) in others {
leafs.extend( match operator {
others BinaryOperand::And => {
.into_iter() if let Some(last) = dnf.last_mut() {
.map(|(operand, occur, ast)| (operand, occur, Some(ast))), last.push(operand_ast);
); }
// the parameters we pass should statically guarantee we can't get errors }
// (no prefix BinaryOperand is provided) BinaryOperand::Or => {
let (res, mut errors) = aggregate_infallible_expressions(leafs); dnf.push(vec![operand_ast]);
if errors.is_empty() { }
Ok(res) }
}
if dnf.len() == 1 {
UserInputAst::and(dnf.into_iter().next().unwrap()) //< safe
} else { } else {
Err(errors.swap_remove(0)) let conjunctions = dnf.into_iter().map(UserInputAst::and).collect();
UserInputAst::or(conjunctions)
} }
} }
@@ -818,10 +816,30 @@ fn aggregate_infallible_expressions(
return (UserInputAst::empty_query(), err); return (UserInputAst::empty_query(), err);
} }
let use_operand = leafs.iter().any(|(operand, _, _)| operand.is_some());
let all_operand = leafs
.iter()
.skip(1)
.all(|(operand, _, _)| operand.is_some());
let early_operand = leafs let early_operand = leafs
.iter() .iter()
.take(1) .take(1)
.all(|(operand, _, _)| operand.is_some()); .all(|(operand, _, _)| operand.is_some());
let use_occur = leafs.iter().any(|(_, occur, _)| occur.is_some());
if use_operand && use_occur {
err.push(LenientErrorInternal {
pos: 0,
message: "Use of mixed occur and boolean operator".to_string(),
});
}
if use_operand && !all_operand {
err.push(LenientErrorInternal {
pos: 0,
message: "Missing boolean operator".to_string(),
});
}
if early_operand { if early_operand {
err.push(LenientErrorInternal { err.push(LenientErrorInternal {
@@ -848,15 +866,7 @@ fn aggregate_infallible_expressions(
Some(BinaryOperand::And) => Some(Occur::Must), Some(BinaryOperand::And) => Some(Occur::Must),
_ => Some(Occur::Should), _ => Some(Occur::Should),
}; };
if occur == &Some(Occur::MustNot) && default_op == Some(Occur::Should) { clauses.push(vec![(occur.or(default_op), ast.clone())]);
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
clauses.push(vec![(
Some(Occur::Should),
ast.clone().unary(Occur::MustNot),
)])
} else {
clauses.push(vec![(occur.or(default_op), ast.clone())]);
}
} }
None => { None => {
let default_op = match next_operator { let default_op = match next_operator {
@@ -864,15 +874,7 @@ fn aggregate_infallible_expressions(
Some(BinaryOperand::Or) => Some(Occur::Should), Some(BinaryOperand::Or) => Some(Occur::Should),
None => None, None => None,
}; };
if occur == &Some(Occur::MustNot) && default_op == Some(Occur::Should) { clauses.push(vec![(occur.or(default_op), ast.clone())])
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
clauses.push(vec![(
Some(Occur::Should),
ast.clone().unary(Occur::MustNot),
)])
} else {
clauses.push(vec![(occur.or(default_op), ast.clone())])
}
} }
} }
} }
@@ -889,12 +891,7 @@ fn aggregate_infallible_expressions(
} }
} }
Some(BinaryOperand::Or) => { Some(BinaryOperand::Or) => {
if last_occur == Some(Occur::MustNot) { clauses.push(vec![(last_occur.or(Some(Occur::Should)), last_ast)]);
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
clauses.push(vec![(Some(Occur::Should), last_ast.unary(Occur::MustNot))]);
} else {
clauses.push(vec![(last_occur.or(Some(Occur::Should)), last_ast)]);
}
} }
None => clauses.push(vec![(last_occur, last_ast)]), None => clauses.push(vec![(last_occur, last_ast)]),
} }
@@ -920,29 +917,35 @@ fn aggregate_infallible_expressions(
} }
} }
fn operand_leaf(inp: &str) -> IResult<&str, (Option<BinaryOperand>, Option<Occur>, UserInputAst)> { fn operand_leaf(inp: &str) -> IResult<&str, (BinaryOperand, UserInputAst)> {
map( tuple((
tuple(( terminated(binary_operand, space0),
terminated(opt(binary_operand), multispace0), terminated(boosted_leaf, space0),
terminated(occur_leaf, multispace0), ))(inp)
)),
|(operand, (occur, ast))| (operand, occur, ast),
)(inp)
} }
fn ast(inp: &str) -> IResult<&str, UserInputAst> { fn ast(inp: &str) -> IResult<&str, UserInputAst> {
let boolean_expr = map_res( let boolean_expr = map(
separated_pair(occur_leaf, multispace1, many1(operand_leaf)), separated_pair(boosted_leaf, space1, many1(operand_leaf)),
|(left, right)| aggregate_binary_expressions(left, right), |(left, right)| aggregate_binary_expressions(left, right),
); );
let single_leaf = map(occur_leaf, |(occur, ast)| { let whitespace_separated_leaves = map(separated_list1(space1, occur_leaf), |subqueries| {
if occur == Some(Occur::MustNot) { if subqueries.len() == 1 {
ast.unary(Occur::MustNot) let (occur_opt, ast) = subqueries.into_iter().next().unwrap();
match occur_opt.unwrap_or(Occur::Should) {
Occur::Must | Occur::Should => ast,
Occur::MustNot => UserInputAst::Clause(vec![(Some(Occur::MustNot), ast)]),
}
} else { } else {
ast UserInputAst::Clause(subqueries.into_iter().collect())
} }
}); });
delimited(multispace0, alt((boolean_expr, single_leaf)), multispace0)(inp)
delimited(
space0,
alt((boolean_expr, whitespace_separated_leaves)),
space0,
)(inp)
} }
fn ast_infallible(inp: &str) -> JResult<&str, UserInputAst> { fn ast_infallible(inp: &str) -> JResult<&str, UserInputAst> {
@@ -966,7 +969,7 @@ fn ast_infallible(inp: &str) -> JResult<&str, UserInputAst> {
} }
pub fn parse_to_ast(inp: &str) -> IResult<&str, UserInputAst> { pub fn parse_to_ast(inp: &str) -> IResult<&str, UserInputAst> {
map(delimited(multispace0, opt(ast), eof), |opt_ast| { map(delimited(space0, opt(ast), eof), |opt_ast| {
rewrite_ast(opt_ast.unwrap_or_else(UserInputAst::empty_query)) rewrite_ast(opt_ast.unwrap_or_else(UserInputAst::empty_query))
})(inp) })(inp)
} }
@@ -1142,43 +1145,24 @@ mod test {
#[test] #[test]
fn test_parse_query_to_ast_binary_op() { fn test_parse_query_to_ast_binary_op() {
test_parse_query_to_ast_helper("a AND b", "(+a +b)"); test_parse_query_to_ast_helper("a AND b", "(+a +b)");
test_parse_query_to_ast_helper("a\nAND b", "(+a +b)");
test_parse_query_to_ast_helper("a OR b", "(?a ?b)"); test_parse_query_to_ast_helper("a OR b", "(?a ?b)");
test_parse_query_to_ast_helper("a OR b AND c", "(?a ?(+b +c))"); test_parse_query_to_ast_helper("a OR b AND c", "(?a ?(+b +c))");
test_parse_query_to_ast_helper("a AND b AND c", "(+a +b +c)"); test_parse_query_to_ast_helper("a AND b AND c", "(+a +b +c)");
test_parse_query_to_ast_helper("a OR b aaa", "(?a ?b *aaa)"); test_is_parse_err("a OR b aaa", "(?a ?b *aaa)");
test_parse_query_to_ast_helper("a AND b aaa", "(?(+a +b) *aaa)"); test_is_parse_err("a AND b aaa", "(?(+a +b) *aaa)");
test_parse_query_to_ast_helper("aaa a OR b ", "(*aaa ?a ?b)"); test_is_parse_err("aaa a OR b ", "(*aaa ?a ?b)");
test_parse_query_to_ast_helper("aaa ccc a OR b ", "(*aaa *ccc ?a ?b)"); test_is_parse_err("aaa ccc a OR b ", "(*aaa *ccc ?a ?b)");
test_parse_query_to_ast_helper("aaa a AND b ", "(*aaa ?(+a +b))"); test_is_parse_err("aaa a AND b ", "(*aaa ?(+a +b))");
test_parse_query_to_ast_helper("aaa ccc a AND b ", "(*aaa *ccc ?(+a +b))"); test_is_parse_err("aaa ccc a AND b ", "(*aaa *ccc ?(+a +b))");
} }
#[test] #[test]
fn test_parse_mixed_bool_occur() { fn test_parse_mixed_bool_occur() {
test_parse_query_to_ast_helper("+a OR +b", "(+a +b)"); test_is_parse_err("a OR b +aaa", "(?a ?b +aaa)");
test_is_parse_err("a AND b -aaa", "(?(+a +b) -aaa)");
test_parse_query_to_ast_helper("a AND -b", "(+a -b)"); test_is_parse_err("+a OR +b aaa", "(+a +b *aaa)");
test_parse_query_to_ast_helper("-a AND b", "(-a +b)"); test_is_parse_err("-a AND -b aaa", "(?(-a -b) *aaa)");
test_parse_query_to_ast_helper("a AND NOT b", "(+a +(-b))"); test_is_parse_err("-aaa +ccc -a OR b ", "(-aaa +ccc -a ?b)");
test_parse_query_to_ast_helper("NOT a AND b", "(+(-a) +b)");
test_parse_query_to_ast_helper("a AND NOT b AND c", "(+a +(-b) +c)");
test_parse_query_to_ast_helper("a AND -b AND c", "(+a -b +c)");
test_parse_query_to_ast_helper("a OR -b", "(?a ?(-b))");
test_parse_query_to_ast_helper("-a OR b", "(?(-a) ?b)");
test_parse_query_to_ast_helper("a OR NOT b", "(?a ?(-b))");
test_parse_query_to_ast_helper("NOT a OR b", "(?(-a) ?b)");
test_parse_query_to_ast_helper("a OR NOT b OR c", "(?a ?(-b) ?c)");
test_parse_query_to_ast_helper("a OR -b OR c", "(?a ?(-b) ?c)");
test_parse_query_to_ast_helper("a OR b +aaa", "(?a ?b +aaa)");
test_parse_query_to_ast_helper("a AND b -aaa", "(?(+a +b) -aaa)");
test_parse_query_to_ast_helper("+a OR +b aaa", "(+a +b *aaa)");
test_parse_query_to_ast_helper("-a AND -b aaa", "(?(-a -b) *aaa)");
test_parse_query_to_ast_helper("-aaa +ccc -a OR b ", "(-aaa +ccc ?(-a) ?b)");
} }
#[test] #[test]

View File

@@ -290,41 +290,6 @@ mod bench {
}); });
} }
bench_all_cardinalities!(bench_aggregation_terms_many_with_top_hits_agg);
fn bench_aggregation_terms_many_with_top_hits_agg_card(
b: &mut Bencher,
cardinality: Cardinality,
) {
let index = get_test_index_bench(cardinality).unwrap();
let reader = index.reader().unwrap();
b.iter(|| {
let agg_req: Aggregations = serde_json::from_value(json!({
"my_texts": {
"terms": { "field": "text_many_terms" },
"aggs": {
"top_hits": { "top_hits":
{
"sort": [
{ "score": "desc" }
],
"size": 2,
"doc_value_fields": ["score_f64"]
}
}
}
},
}))
.unwrap();
let collector = get_collector(agg_req);
let searcher = reader.searcher();
searcher.search(&AllQuery, &collector).unwrap()
});
}
bench_all_cardinalities!(bench_aggregation_terms_many_with_sub_agg); bench_all_cardinalities!(bench_aggregation_terms_many_with_sub_agg);
fn bench_aggregation_terms_many_with_sub_agg_card(b: &mut Bencher, cardinality: Cardinality) { fn bench_aggregation_terms_many_with_sub_agg_card(b: &mut Bencher, cardinality: Cardinality) {

View File

@@ -35,7 +35,7 @@ use super::bucket::{
}; };
use super::metric::{ use super::metric::{
AverageAggregation, CountAggregation, MaxAggregation, MinAggregation, AverageAggregation, CountAggregation, MaxAggregation, MinAggregation,
PercentilesAggregationReq, StatsAggregation, SumAggregation, TopHitsAggregation, PercentilesAggregationReq, StatsAggregation, SumAggregation,
}; };
/// The top-level aggregation request structure, which contains [`Aggregation`] and their user /// The top-level aggregation request structure, which contains [`Aggregation`] and their user
@@ -93,12 +93,7 @@ impl Aggregation {
} }
fn get_fast_field_names(&self, fast_field_names: &mut HashSet<String>) { fn get_fast_field_names(&self, fast_field_names: &mut HashSet<String>) {
fast_field_names.extend( fast_field_names.insert(self.agg.get_fast_field_name().to_string());
self.agg
.get_fast_field_names()
.iter()
.map(|s| s.to_string()),
);
fast_field_names.extend(get_fast_field_names(&self.sub_aggregation)); fast_field_names.extend(get_fast_field_names(&self.sub_aggregation));
} }
} }
@@ -152,27 +147,23 @@ pub enum AggregationVariants {
/// Computes the sum of the extracted values. /// Computes the sum of the extracted values.
#[serde(rename = "percentiles")] #[serde(rename = "percentiles")]
Percentiles(PercentilesAggregationReq), Percentiles(PercentilesAggregationReq),
/// Finds the top k values matching some order
#[serde(rename = "top_hits")]
TopHits(TopHitsAggregation),
} }
impl AggregationVariants { impl AggregationVariants {
/// Returns the name of the fields used by the aggregation. /// Returns the name of the field used by the aggregation.
pub fn get_fast_field_names(&self) -> Vec<&str> { pub fn get_fast_field_name(&self) -> &str {
match self { match self {
AggregationVariants::Terms(terms) => vec![terms.field.as_str()], AggregationVariants::Terms(terms) => terms.field.as_str(),
AggregationVariants::Range(range) => vec![range.field.as_str()], AggregationVariants::Range(range) => range.field.as_str(),
AggregationVariants::Histogram(histogram) => vec![histogram.field.as_str()], AggregationVariants::Histogram(histogram) => histogram.field.as_str(),
AggregationVariants::DateHistogram(histogram) => vec![histogram.field.as_str()], AggregationVariants::DateHistogram(histogram) => histogram.field.as_str(),
AggregationVariants::Average(avg) => vec![avg.field_name()], AggregationVariants::Average(avg) => avg.field_name(),
AggregationVariants::Count(count) => vec![count.field_name()], AggregationVariants::Count(count) => count.field_name(),
AggregationVariants::Max(max) => vec![max.field_name()], AggregationVariants::Max(max) => max.field_name(),
AggregationVariants::Min(min) => vec![min.field_name()], AggregationVariants::Min(min) => min.field_name(),
AggregationVariants::Stats(stats) => vec![stats.field_name()], AggregationVariants::Stats(stats) => stats.field_name(),
AggregationVariants::Sum(sum) => vec![sum.field_name()], AggregationVariants::Sum(sum) => sum.field_name(),
AggregationVariants::Percentiles(per) => vec![per.field_name()], AggregationVariants::Percentiles(per) => per.field_name(),
AggregationVariants::TopHits(top_hits) => top_hits.field_names(),
} }
} }

View File

@@ -1,9 +1,6 @@
//! This will enhance the request tree with access to the fastfield and metadata. //! This will enhance the request tree with access to the fastfield and metadata.
use std::collections::HashMap; use columnar::{Column, ColumnBlockAccessor, ColumnType, StrColumn};
use std::io;
use columnar::{Column, ColumnBlockAccessor, ColumnType, DynamicColumn, StrColumn};
use super::agg_limits::ResourceLimitGuard; use super::agg_limits::ResourceLimitGuard;
use super::agg_req::{Aggregation, AggregationVariants, Aggregations}; use super::agg_req::{Aggregation, AggregationVariants, Aggregations};
@@ -17,7 +14,7 @@ use super::metric::{
use super::segment_agg_result::AggregationLimits; use super::segment_agg_result::AggregationLimits;
use super::VecWithNames; use super::VecWithNames;
use crate::aggregation::{f64_to_fastfield_u64, Key}; use crate::aggregation::{f64_to_fastfield_u64, Key};
use crate::{SegmentOrdinal, SegmentReader}; use crate::SegmentReader;
#[derive(Default)] #[derive(Default)]
pub(crate) struct AggregationsWithAccessor { pub(crate) struct AggregationsWithAccessor {
@@ -35,7 +32,6 @@ impl AggregationsWithAccessor {
} }
pub struct AggregationWithAccessor { pub struct AggregationWithAccessor {
pub(crate) segment_ordinal: SegmentOrdinal,
/// In general there can be buckets without fast field access, e.g. buckets that are created /// In general there can be buckets without fast field access, e.g. buckets that are created
/// based on search terms. That is not that case currently, but eventually this needs to be /// based on search terms. That is not that case currently, but eventually this needs to be
/// Option or moved. /// Option or moved.
@@ -48,16 +44,10 @@ pub struct AggregationWithAccessor {
pub(crate) limits: ResourceLimitGuard, pub(crate) limits: ResourceLimitGuard,
pub(crate) column_block_accessor: ColumnBlockAccessor<u64>, pub(crate) column_block_accessor: ColumnBlockAccessor<u64>,
/// Used for missing term aggregation, which checks all columns for existence. /// Used for missing term aggregation, which checks all columns for existence.
/// And also for `top_hits` aggregation, which may sort on multiple fields.
/// By convention the missing aggregation is chosen, when this property is set /// By convention the missing aggregation is chosen, when this property is set
/// (instead bein set in `agg`). /// (instead bein set in `agg`).
/// If this needs to used by other aggregations, we need to refactor this. /// If this needs to used by other aggregations, we need to refactor this.
// NOTE: we can make all other aggregations use this instead of the `accessor` and `field_type` pub(crate) accessors: Vec<Column<u64>>,
// (making them obsolete) But will it have a performance impact?
pub(crate) accessors: Vec<(Column<u64>, ColumnType)>,
/// Map field names to all associated column accessors.
/// This field is used for `docvalue_fields`, which is currently only supported for `top_hits`.
pub(crate) value_accessors: HashMap<String, Vec<DynamicColumn>>,
pub(crate) agg: Aggregation, pub(crate) agg: Aggregation,
} }
@@ -67,55 +57,19 @@ impl AggregationWithAccessor {
agg: &Aggregation, agg: &Aggregation,
sub_aggregation: &Aggregations, sub_aggregation: &Aggregations,
reader: &SegmentReader, reader: &SegmentReader,
segment_ordinal: SegmentOrdinal,
limits: AggregationLimits, limits: AggregationLimits,
) -> crate::Result<Vec<AggregationWithAccessor>> { ) -> crate::Result<Vec<AggregationWithAccessor>> {
let mut agg = agg.clone(); let add_agg_with_accessor = |accessor: Column<u64>,
let add_agg_with_accessor = |agg: &Aggregation,
accessor: Column<u64>,
column_type: ColumnType, column_type: ColumnType,
aggs: &mut Vec<AggregationWithAccessor>| aggs: &mut Vec<AggregationWithAccessor>|
-> crate::Result<()> { -> crate::Result<()> {
let res = AggregationWithAccessor { let res = AggregationWithAccessor {
segment_ordinal,
accessor, accessor,
accessors: Default::default(), accessors: Vec::new(),
value_accessors: Default::default(),
field_type: column_type, field_type: column_type,
sub_aggregation: get_aggs_with_segment_accessor_and_validate( sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation, sub_aggregation,
reader, reader,
segment_ordinal,
&limits,
)?,
agg: agg.clone(),
limits: limits.new_guard(),
missing_value_for_accessor: None,
str_dict_column: None,
column_block_accessor: Default::default(),
};
aggs.push(res);
Ok(())
};
let add_agg_with_accessors = |agg: &Aggregation,
accessors: Vec<(Column<u64>, ColumnType)>,
aggs: &mut Vec<AggregationWithAccessor>,
value_accessors: HashMap<String, Vec<DynamicColumn>>|
-> crate::Result<()> {
let (accessor, field_type) = accessors.first().expect("at least one accessor");
let res = AggregationWithAccessor {
segment_ordinal,
// TODO: We should do away with the `accessor` field altogether
accessor: accessor.clone(),
value_accessors,
field_type: *field_type,
accessors,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation,
reader,
segment_ordinal,
&limits, &limits,
)?, )?,
agg: agg.clone(), agg: agg.clone(),
@@ -130,36 +84,32 @@ impl AggregationWithAccessor {
let mut res: Vec<AggregationWithAccessor> = Vec::new(); let mut res: Vec<AggregationWithAccessor> = Vec::new();
use AggregationVariants::*; use AggregationVariants::*;
match &agg.agg {
match agg.agg {
Range(RangeAggregation { Range(RangeAggregation {
field: ref field_name, field: field_name, ..
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?; add_agg_with_accessor(accessor, column_type, &mut res)?;
} }
Histogram(HistogramAggregation { Histogram(HistogramAggregation {
field: ref field_name, field: field_name, ..
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?; add_agg_with_accessor(accessor, column_type, &mut res)?;
} }
DateHistogram(DateHistogramAggregationReq { DateHistogram(DateHistogramAggregationReq {
field: ref field_name, field: field_name, ..
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
// Only DateTime is supported for DateHistogram // Only DateTime is supported for DateHistogram
get_ff_reader(reader, field_name, Some(&[ColumnType::DateTime]))?; get_ff_reader(reader, field_name, Some(&[ColumnType::DateTime]))?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?; add_agg_with_accessor(accessor, column_type, &mut res)?;
} }
Terms(TermsAggregation { Terms(TermsAggregation {
field: ref field_name, field: field_name,
ref missing, missing,
.. ..
}) => { }) => {
let str_dict_column = reader.fast_fields().str(field_name)?; let str_dict_column = reader.fast_fields().str(field_name)?;
@@ -169,9 +119,9 @@ impl AggregationWithAccessor {
ColumnType::F64, ColumnType::F64,
ColumnType::Str, ColumnType::Str,
ColumnType::DateTime, ColumnType::DateTime,
ColumnType::Bool,
ColumnType::IpAddr,
// ColumnType::Bytes Unsupported // ColumnType::Bytes Unsupported
// ColumnType::Bool Unsupported
// ColumnType::IpAddr Unsupported
]; ];
// In case the column is empty we want the shim column to match the missing type // In case the column is empty we want the shim column to match the missing type
@@ -212,11 +162,24 @@ impl AggregationWithAccessor {
let column_and_types = let column_and_types =
get_all_ff_reader_or_empty(reader, field_name, None, fallback_type)?; get_all_ff_reader_or_empty(reader, field_name, None, fallback_type)?;
let accessors = column_and_types let accessors: Vec<Column> =
.iter() column_and_types.iter().map(|(a, _)| a.clone()).collect();
.map(|c_t| (c_t.0.clone(), c_t.1)) let agg_wit_acc = AggregationWithAccessor {
.collect(); missing_value_for_accessor: None,
add_agg_with_accessors(&agg, accessors, &mut res, Default::default())?; accessor: accessors[0].clone(),
accessors,
field_type: ColumnType::U64,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation,
reader,
&limits,
)?,
agg: agg.clone(),
str_dict_column: str_dict_column.clone(),
limits: limits.new_guard(),
column_block_accessor: Default::default(),
};
res.push(agg_wit_acc);
} }
for (accessor, column_type) in column_and_types { for (accessor, column_type) in column_and_types {
@@ -226,25 +189,21 @@ impl AggregationWithAccessor {
missing.clone() missing.clone()
}; };
let missing_value_for_accessor = if let Some(missing) = let missing_value_for_accessor =
missing_value_term_agg.as_ref() if let Some(missing) = missing_value_term_agg.as_ref() {
{ get_missing_val(column_type, missing, agg.agg.get_fast_field_name())?
get_missing_val(column_type, missing, agg.agg.get_fast_field_names()[0])? } else {
} else { None
None };
};
let agg = AggregationWithAccessor { let agg = AggregationWithAccessor {
segment_ordinal,
missing_value_for_accessor, missing_value_for_accessor,
accessor, accessor,
accessors: Default::default(), accessors: Vec::new(),
value_accessors: Default::default(),
field_type: column_type, field_type: column_type,
sub_aggregation: get_aggs_with_segment_accessor_and_validate( sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation, sub_aggregation,
reader, reader,
segment_ordinal,
&limits, &limits,
)?, )?,
agg: agg.clone(), agg: agg.clone(),
@@ -256,63 +215,34 @@ impl AggregationWithAccessor {
} }
} }
Average(AverageAggregation { Average(AverageAggregation {
field: ref field_name, field: field_name, ..
..
}) })
| Count(CountAggregation { | Count(CountAggregation {
field: ref field_name, field: field_name, ..
..
}) })
| Max(MaxAggregation { | Max(MaxAggregation {
field: ref field_name, field: field_name, ..
..
}) })
| Min(MinAggregation { | Min(MinAggregation {
field: ref field_name, field: field_name, ..
..
}) })
| Stats(StatsAggregation { | Stats(StatsAggregation {
field: ref field_name, field: field_name, ..
..
}) })
| Sum(SumAggregation { | Sum(SumAggregation {
field: ref field_name, field: field_name, ..
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?; add_agg_with_accessor(accessor, column_type, &mut res)?;
} }
Percentiles(ref percentiles) => { Percentiles(percentiles) => {
let (accessor, column_type) = get_ff_reader( let (accessor, column_type) = get_ff_reader(
reader, reader,
percentiles.field_name(), percentiles.field_name(),
Some(get_numeric_or_date_column_types()), Some(get_numeric_or_date_column_types()),
)?; )?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?; add_agg_with_accessor(accessor, column_type, &mut res)?;
}
TopHits(ref mut top_hits) => {
top_hits.validate_and_resolve_field_names(reader.fast_fields().columnar())?;
let accessors: Vec<(Column<u64>, ColumnType)> = top_hits
.field_names()
.iter()
.map(|field| {
get_ff_reader(reader, field, Some(get_numeric_or_date_column_types()))
})
.collect::<crate::Result<_>>()?;
let value_accessors = top_hits
.value_field_names()
.iter()
.map(|field_name| {
Ok((
field_name.to_string(),
get_dynamic_columns(reader, field_name)?,
))
})
.collect::<crate::Result<_>>()?;
add_agg_with_accessors(&agg, accessors, &mut res, value_accessors)?;
} }
}; };
@@ -354,7 +284,6 @@ fn get_numeric_or_date_column_types() -> &'static [ColumnType] {
pub(crate) fn get_aggs_with_segment_accessor_and_validate( pub(crate) fn get_aggs_with_segment_accessor_and_validate(
aggs: &Aggregations, aggs: &Aggregations,
reader: &SegmentReader, reader: &SegmentReader,
segment_ordinal: SegmentOrdinal,
limits: &AggregationLimits, limits: &AggregationLimits,
) -> crate::Result<AggregationsWithAccessor> { ) -> crate::Result<AggregationsWithAccessor> {
let mut aggss = Vec::new(); let mut aggss = Vec::new();
@@ -363,7 +292,6 @@ pub(crate) fn get_aggs_with_segment_accessor_and_validate(
agg, agg,
agg.sub_aggregation(), agg.sub_aggregation(),
reader, reader,
segment_ordinal,
limits.clone(), limits.clone(),
)?; )?;
for agg in aggs { for agg in aggs {
@@ -393,19 +321,6 @@ fn get_ff_reader(
Ok(ff_field_with_type) Ok(ff_field_with_type)
} }
fn get_dynamic_columns(
reader: &SegmentReader,
field_name: &str,
) -> crate::Result<Vec<columnar::DynamicColumn>> {
let ff_fields = reader.fast_fields().dynamic_column_handles(field_name)?;
let cols = ff_fields
.iter()
.map(|h| h.open())
.collect::<io::Result<_>>()?;
assert!(!ff_fields.is_empty(), "field {} not found", field_name);
Ok(cols)
}
/// Get all fast field reader or empty as default. /// Get all fast field reader or empty as default.
/// ///
/// Is guaranteed to return at least one column. /// Is guaranteed to return at least one column.

View File

@@ -8,7 +8,7 @@ use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::bucket::GetDocCount; use super::bucket::GetDocCount;
use super::metric::{PercentilesMetricResult, SingleMetricResult, Stats, TopHitsMetricResult}; use super::metric::{PercentilesMetricResult, SingleMetricResult, Stats};
use super::{AggregationError, Key}; use super::{AggregationError, Key};
use crate::TantivyError; use crate::TantivyError;
@@ -90,10 +90,8 @@ pub enum MetricResult {
Stats(Stats), Stats(Stats),
/// Sum metric result. /// Sum metric result.
Sum(SingleMetricResult), Sum(SingleMetricResult),
/// Percentiles metric result. /// Sum metric result.
Percentiles(PercentilesMetricResult), Percentiles(PercentilesMetricResult),
/// Top hits metric result
TopHits(TopHitsMetricResult),
} }
impl MetricResult { impl MetricResult {
@@ -108,9 +106,6 @@ impl MetricResult {
MetricResult::Percentiles(_) => Err(TantivyError::AggregationError( MetricResult::Percentiles(_) => Err(TantivyError::AggregationError(
AggregationError::InvalidRequest("percentiles can't be used to order".to_string()), AggregationError::InvalidRequest("percentiles can't be used to order".to_string()),
)), )),
MetricResult::TopHits(_) => Err(TantivyError::AggregationError(
AggregationError::InvalidRequest("top_hits can't be used to order".to_string()),
)),
} }
} }
} }

View File

@@ -4,7 +4,6 @@ use crate::aggregation::agg_req::{Aggregation, Aggregations};
use crate::aggregation::agg_result::AggregationResults; use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::buf_collector::DOC_BLOCK_SIZE; use crate::aggregation::buf_collector::DOC_BLOCK_SIZE;
use crate::aggregation::collector::AggregationCollector; use crate::aggregation::collector::AggregationCollector;
use crate::aggregation::intermediate_agg_result::IntermediateAggregationResults;
use crate::aggregation::segment_agg_result::AggregationLimits; use crate::aggregation::segment_agg_result::AggregationLimits;
use crate::aggregation::tests::{get_test_index_2_segments, get_test_index_from_values_and_terms}; use crate::aggregation::tests::{get_test_index_2_segments, get_test_index_from_values_and_terms};
use crate::aggregation::DistributedAggregationCollector; use crate::aggregation::DistributedAggregationCollector;
@@ -67,22 +66,6 @@ fn test_aggregation_flushing(
} }
} }
}, },
"top_hits_test":{
"terms": {
"field": "string_id"
},
"aggs": {
"bucketsL2": {
"top_hits": {
"size": 2,
"sort": [
{ "score": "asc" }
],
"docvalue_fields": ["score"]
}
}
}
},
"histogram_test":{ "histogram_test":{
"histogram": { "histogram": {
"field": "score", "field": "score",
@@ -125,16 +108,6 @@ fn test_aggregation_flushing(
let searcher = reader.searcher(); let searcher = reader.searcher();
let intermediate_agg_result = searcher.search(&AllQuery, &collector).unwrap(); let intermediate_agg_result = searcher.search(&AllQuery, &collector).unwrap();
// Test postcard roundtrip serialization
let intermediate_agg_result_bytes = postcard::to_allocvec(&intermediate_agg_result).expect(
"Postcard Serialization failed, flatten etc. is not supported in the intermediate \
result",
);
let intermediate_agg_result: IntermediateAggregationResults =
postcard::from_bytes(&intermediate_agg_result_bytes)
.expect("Post deserialization failed");
intermediate_agg_result intermediate_agg_result
.into_final_result(agg_req, &Default::default()) .into_final_result(agg_req, &Default::default())
.unwrap() .unwrap()
@@ -614,9 +587,6 @@ fn test_aggregation_on_json_object() {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer
.add_document(doc!(json => json!({"color": "red"})))
.unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"color": "red"}))) .add_document(doc!(json => json!({"color": "red"})))
.unwrap(); .unwrap();
@@ -644,8 +614,8 @@ fn test_aggregation_on_json_object() {
&serde_json::json!({ &serde_json::json!({
"jsonagg": { "jsonagg": {
"buckets": [ "buckets": [
{"doc_count": 2, "key": "red"},
{"doc_count": 1, "key": "blue"}, {"doc_count": 1, "key": "blue"},
{"doc_count": 1, "key": "red"}
], ],
"doc_count_error_upper_bound": 0, "doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0 "sum_other_doc_count": 0
@@ -667,9 +637,6 @@ fn test_aggregation_on_nested_json_object() {
index_writer index_writer
.add_document(doc!(json => json!({"color.dot": "blue", "color": {"nested":"blue"} }))) .add_document(doc!(json => json!({"color.dot": "blue", "color": {"nested":"blue"} })))
.unwrap(); .unwrap();
index_writer
.add_document(doc!(json => json!({"color.dot": "blue", "color": {"nested":"blue"} })))
.unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
let reader = index.reader().unwrap(); let reader = index.reader().unwrap();
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -697,7 +664,7 @@ fn test_aggregation_on_nested_json_object() {
&serde_json::json!({ &serde_json::json!({
"jsonagg1": { "jsonagg1": {
"buckets": [ "buckets": [
{"doc_count": 2, "key": "blue"}, {"doc_count": 1, "key": "blue"},
{"doc_count": 1, "key": "red"} {"doc_count": 1, "key": "red"}
], ],
"doc_count_error_upper_bound": 0, "doc_count_error_upper_bound": 0,
@@ -705,7 +672,7 @@ fn test_aggregation_on_nested_json_object() {
}, },
"jsonagg2": { "jsonagg2": {
"buckets": [ "buckets": [
{"doc_count": 2, "key": "blue"}, {"doc_count": 1, "key": "blue"},
{"doc_count": 1, "key": "red"} {"doc_count": 1, "key": "red"}
], ],
"doc_count_error_upper_bound": 0, "doc_count_error_upper_bound": 0,
@@ -843,38 +810,29 @@ fn test_aggregation_on_json_object_mixed_types() {
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0, "mixed_price": 10.0}))) .add_document(doc!(json => json!({"mixed_type": 10.0})))
.unwrap(); .unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
// => Segment with all values text // => Segment with all values text
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": "blue", "mixed_price": 5.0}))) .add_document(doc!(json => json!({"mixed_type": "blue"})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue", "mixed_price": 5.0})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue", "mixed_price": 5.0})))
.unwrap(); .unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
// => Segment with all boolen // => Segment with all boolen
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": true, "mixed_price": "no_price"}))) .add_document(doc!(json => json!({"mixed_type": true})))
.unwrap(); .unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
// => Segment with mixed values // => Segment with mixed values
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": "red", "mixed_price": 1.0}))) .add_document(doc!(json => json!({"mixed_type": "red"})))
.unwrap(); .unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": "red", "mixed_price": 1.0}))) .add_document(doc!(json => json!({"mixed_type": -20.5})))
.unwrap(); .unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": -20.5, "mixed_price": -20.5}))) .add_document(doc!(json => json!({"mixed_type": true})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": true, "mixed_price": "no_price"})))
.unwrap(); .unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
@@ -888,7 +846,7 @@ fn test_aggregation_on_json_object_mixed_types() {
"order": { "min_price": "desc" } "order": { "min_price": "desc" }
}, },
"aggs": { "aggs": {
"min_price": { "min": { "field": "json.mixed_price" } } "min_price": { "min": { "field": "json.mixed_type" } }
} }
}, },
"rangeagg": { "rangeagg": {
@@ -912,7 +870,6 @@ fn test_aggregation_on_json_object_mixed_types() {
let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap(); let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap();
let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap(); let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap();
use pretty_assertions::assert_eq;
assert_eq!( assert_eq!(
&aggregation_res_json, &aggregation_res_json,
&serde_json::json!({ &serde_json::json!({
@@ -927,10 +884,10 @@ fn test_aggregation_on_json_object_mixed_types() {
"termagg": { "termagg": {
"buckets": [ "buckets": [
{ "doc_count": 1, "key": 10.0, "min_price": { "value": 10.0 } }, { "doc_count": 1, "key": 10.0, "min_price": { "value": 10.0 } },
{ "doc_count": 3, "key": "blue", "min_price": { "value": 5.0 } },
{ "doc_count": 2, "key": "red", "min_price": { "value": 1.0 } },
{ "doc_count": 1, "key": -20.5, "min_price": { "value": -20.5 } }, { "doc_count": 1, "key": -20.5, "min_price": { "value": -20.5 } },
{ "doc_count": 2, "key": 1.0, "key_as_string": "true", "min_price": { "value": null } }, // TODO bool is also not yet handled in aggregation
{ "doc_count": 1, "key": "blue", "min_price": { "value": null } },
{ "doc_count": 1, "key": "red", "min_price": { "value": null } },
], ],
"sum_other_doc_count": 0 "sum_other_doc_count": 0
} }

View File

@@ -1,7 +1,7 @@
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::{HistogramAggregation, HistogramBounds}; use super::{HistogramAggregation, HistogramBounds};
use crate::aggregation::*; use crate::aggregation::AggregationError;
/// DateHistogramAggregation is similar to `HistogramAggregation`, but it can only be used with date /// DateHistogramAggregation is similar to `HistogramAggregation`, but it can only be used with date
/// type. /// type.
@@ -307,7 +307,6 @@ pub mod tests {
) -> crate::Result<Index> { ) -> crate::Result<Index> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
schema_builder.add_date_field("date", FAST); schema_builder.add_date_field("date", FAST);
schema_builder.add_json_field("mixed", FAST);
schema_builder.add_text_field("text", FAST | STRING); schema_builder.add_text_field("text", FAST | STRING);
schema_builder.add_text_field("text2", FAST | STRING); schema_builder.add_text_field("text2", FAST | STRING);
let schema = schema_builder.build(); let schema = schema_builder.build();
@@ -352,10 +351,8 @@ pub mod tests {
let docs = vec![ let docs = vec![
vec![r#"{ "date": "2015-01-01T12:10:30Z", "text": "aaa" }"#], vec![r#"{ "date": "2015-01-01T12:10:30Z", "text": "aaa" }"#],
vec![r#"{ "date": "2015-01-01T11:11:30Z", "text": "bbb" }"#], vec![r#"{ "date": "2015-01-01T11:11:30Z", "text": "bbb" }"#],
vec![r#"{ "date": "2015-01-01T11:11:30Z", "text": "bbb" }"#],
vec![r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb" }"#], vec![r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb" }"#],
vec![r#"{ "date": "2015-01-06T00:00:00Z", "text": "ccc" }"#], vec![r#"{ "date": "2015-01-06T00:00:00Z", "text": "ccc" }"#],
vec![r#"{ "date": "2015-01-06T00:00:00Z", "text": "ccc" }"#],
]; ];
let index = get_test_index_from_docs(merge_segments, &docs).unwrap(); let index = get_test_index_from_docs(merge_segments, &docs).unwrap();
@@ -384,7 +381,7 @@ pub mod tests {
{ {
"key_as_string" : "2015-01-01T00:00:00Z", "key_as_string" : "2015-01-01T00:00:00Z",
"key" : 1420070400000.0, "key" : 1420070400000.0,
"doc_count" : 6 "doc_count" : 4
} }
] ]
} }
@@ -422,15 +419,15 @@ pub mod tests {
{ {
"key_as_string" : "2015-01-01T00:00:00Z", "key_as_string" : "2015-01-01T00:00:00Z",
"key" : 1420070400000.0, "key" : 1420070400000.0,
"doc_count" : 6, "doc_count" : 4,
"texts": { "texts": {
"buckets": [ "buckets": [
{ {
"doc_count": 3, "doc_count": 2,
"key": "bbb" "key": "bbb"
}, },
{ {
"doc_count": 2, "doc_count": 1,
"key": "ccc" "key": "ccc"
}, },
{ {
@@ -469,7 +466,7 @@ pub mod tests {
"sales_over_time": { "sales_over_time": {
"buckets": [ "buckets": [
{ {
"doc_count": 3, "doc_count": 2,
"key": 1420070400000.0, "key": 1420070400000.0,
"key_as_string": "2015-01-01T00:00:00Z" "key_as_string": "2015-01-01T00:00:00Z"
}, },
@@ -494,7 +491,7 @@ pub mod tests {
"key_as_string": "2015-01-05T00:00:00Z" "key_as_string": "2015-01-05T00:00:00Z"
}, },
{ {
"doc_count": 2, "doc_count": 1,
"key": 1420502400000.0, "key": 1420502400000.0,
"key_as_string": "2015-01-06T00:00:00Z" "key_as_string": "2015-01-06T00:00:00Z"
} }
@@ -535,7 +532,7 @@ pub mod tests {
"key_as_string": "2014-12-31T00:00:00Z" "key_as_string": "2014-12-31T00:00:00Z"
}, },
{ {
"doc_count": 3, "doc_count": 2,
"key": 1420070400000.0, "key": 1420070400000.0,
"key_as_string": "2015-01-01T00:00:00Z" "key_as_string": "2015-01-01T00:00:00Z"
}, },
@@ -560,7 +557,7 @@ pub mod tests {
"key_as_string": "2015-01-05T00:00:00Z" "key_as_string": "2015-01-05T00:00:00Z"
}, },
{ {
"doc_count": 2, "doc_count": 1,
"key": 1420502400000.0, "key": 1420502400000.0,
"key_as_string": "2015-01-06T00:00:00Z" "key_as_string": "2015-01-06T00:00:00Z"
}, },

View File

@@ -1,5 +1,8 @@
use std::cmp::Ordering; use std::cmp::Ordering;
use std::fmt::Display;
use columnar::ColumnType;
use itertools::Itertools;
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use tantivy_bitpacker::minmax; use tantivy_bitpacker::minmax;
@@ -15,9 +18,9 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateHistogramBucketEntry, IntermediateHistogramBucketEntry,
}; };
use crate::aggregation::segment_agg_result::{ use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, SegmentAggregationCollector, build_segment_agg_collector, AggregationLimits, SegmentAggregationCollector,
}; };
use crate::aggregation::*; use crate::aggregation::{f64_from_fastfield_u64, format_date};
use crate::TantivyError; use crate::TantivyError;
/// Histogram is a bucket aggregation, where buckets are created dynamically for given `interval`. /// Histogram is a bucket aggregation, where buckets are created dynamically for given `interval`.
@@ -70,7 +73,6 @@ pub struct HistogramAggregation {
pub field: String, pub field: String,
/// The interval to chunk your data range. Each bucket spans a value range of [0..interval). /// The interval to chunk your data range. Each bucket spans a value range of [0..interval).
/// Must be a positive value. /// Must be a positive value.
#[serde(deserialize_with = "deserialize_f64")]
pub interval: f64, pub interval: f64,
/// Intervals implicitly defines an absolute grid of buckets `[interval * k, interval * (k + /// Intervals implicitly defines an absolute grid of buckets `[interval * k, interval * (k +
/// 1))`. /// 1))`.
@@ -83,7 +85,6 @@ pub struct HistogramAggregation {
/// fall into the buckets with the key 0 and 10. /// fall into the buckets with the key 0 and 10.
/// With offset 5 and interval 10, they would both fall into the bucket with they key 5 and the /// With offset 5 and interval 10, they would both fall into the bucket with they key 5 and the
/// range [5..15) /// range [5..15)
#[serde(default, deserialize_with = "deserialize_option_f64")]
pub offset: Option<f64>, pub offset: Option<f64>,
/// The minimum number of documents in a bucket to be returned. Defaults to 0. /// The minimum number of documents in a bucket to be returned. Defaults to 0.
pub min_doc_count: Option<u64>, pub min_doc_count: Option<u64>,
@@ -307,10 +308,7 @@ impl SegmentAggregationCollector for SegmentHistogramCollector {
.column_block_accessor .column_block_accessor
.fetch_block(docs, &bucket_agg_accessor.accessor); .fetch_block(docs, &bucket_agg_accessor.accessor);
for (doc, val) in bucket_agg_accessor for (doc, val) in bucket_agg_accessor.column_block_accessor.iter_docid_vals() {
.column_block_accessor
.iter_docid_vals(docs, &bucket_agg_accessor.accessor)
{
let val = self.f64_from_fastfield_u64(val); let val = self.f64_from_fastfield_u64(val);
let bucket_pos = get_bucket_pos(val); let bucket_pos = get_bucket_pos(val);
@@ -597,12 +595,11 @@ mod tests {
use serde_json::Value; use serde_json::Value;
use super::*; use super::*;
use crate::aggregation::agg_result::AggregationResults; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::{ use crate::aggregation::tests::{
exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit, exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit,
get_test_index_2_segments, get_test_index_from_values, get_test_index_with_num_docs, get_test_index_2_segments, get_test_index_from_values, get_test_index_with_num_docs,
}; };
use crate::query::AllQuery;
#[test] #[test]
fn histogram_test_crooked_values() -> crate::Result<()> { fn histogram_test_crooked_values() -> crate::Result<()> {
@@ -1354,35 +1351,6 @@ mod tests {
}) })
); );
Ok(())
}
#[test]
fn test_aggregation_histogram_empty_index() -> crate::Result<()> {
// test index without segments
let values = vec![];
let index = get_test_index_from_values(false, &values)?;
let agg_req_1: Aggregations = serde_json::from_value(json!({
"myhisto": {
"histogram": {
"field": "score",
"interval": 10.0
},
}
}))
.unwrap();
let collector = AggregationCollector::from_aggs(agg_req_1, Default::default());
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&AllQuery, &collector).unwrap();
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
// Make sure the result structure is correct
assert_eq!(res["myhisto"]["buckets"].as_array().unwrap().len(), 0);
Ok(()) Ok(())
} }
} }

View File

@@ -1,6 +1,7 @@
use std::fmt::Debug; use std::fmt::Debug;
use std::ops::Range; use std::ops::Range;
use columnar::{ColumnType, MonotonicallyMappableToU64};
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
@@ -13,7 +14,9 @@ use crate::aggregation::intermediate_agg_result::{
use crate::aggregation::segment_agg_result::{ use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, SegmentAggregationCollector, build_segment_agg_collector, SegmentAggregationCollector,
}; };
use crate::aggregation::*; use crate::aggregation::{
f64_from_fastfield_u64, f64_to_fastfield_u64, format_date, Key, SerializedKey,
};
use crate::TantivyError; use crate::TantivyError;
/// Provide user-defined buckets to aggregate on. /// Provide user-defined buckets to aggregate on.
@@ -69,19 +72,11 @@ pub struct RangeAggregationRange {
pub key: Option<String>, pub key: Option<String>,
/// The from range value, which is inclusive in the range. /// The from range value, which is inclusive in the range.
/// `None` equals to an open ended interval. /// `None` equals to an open ended interval.
#[serde( #[serde(skip_serializing_if = "Option::is_none", default)]
skip_serializing_if = "Option::is_none",
default,
deserialize_with = "deserialize_option_f64"
)]
pub from: Option<f64>, pub from: Option<f64>,
/// The to range value, which is not inclusive in the range. /// The to range value, which is not inclusive in the range.
/// `None` equals to an open ended interval. /// `None` equals to an open ended interval.
#[serde( #[serde(skip_serializing_if = "Option::is_none", default)]
skip_serializing_if = "Option::is_none",
default,
deserialize_with = "deserialize_option_f64"
)]
pub to: Option<f64>, pub to: Option<f64>,
} }
@@ -235,10 +230,7 @@ impl SegmentAggregationCollector for SegmentRangeCollector {
.column_block_accessor .column_block_accessor
.fetch_block(docs, &bucket_agg_accessor.accessor); .fetch_block(docs, &bucket_agg_accessor.accessor);
for (doc, val) in bucket_agg_accessor for (doc, val) in bucket_agg_accessor.column_block_accessor.iter_docid_vals() {
.column_block_accessor
.iter_docid_vals(docs, &bucket_agg_accessor.accessor)
{
let bucket_pos = self.get_bucket_pos(val); let bucket_pos = self.get_bucket_pos(val);
let bucket = &mut self.buckets[bucket_pos]; let bucket = &mut self.buckets[bucket_pos];
@@ -449,6 +441,7 @@ pub(crate) fn range_to_key(range: &Range<u64>, field_type: &ColumnType) -> crate
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use columnar::MonotonicallyMappableToU64;
use serde_json::Value; use serde_json::Value;
use super::*; use super::*;
@@ -457,6 +450,7 @@ mod tests {
exec_request, exec_request_with_query, get_test_index_2_segments, exec_request, exec_request_with_query, get_test_index_2_segments,
get_test_index_with_num_docs, get_test_index_with_num_docs,
}; };
use crate::aggregation::AggregationLimits;
pub fn get_collector_from_ranges( pub fn get_collector_from_ranges(
ranges: Vec<RangeAggregationRange>, ranges: Vec<RangeAggregationRange>,

View File

@@ -1,10 +1,6 @@
use std::fmt::Debug; use std::fmt::Debug;
use std::net::Ipv6Addr;
use columnar::column_values::CompactSpaceU64Accessor; use columnar::{BytesColumn, ColumnType, MonotonicallyMappableToU64, StrColumn};
use columnar::{
BytesColumn, ColumnType, MonotonicallyMappableToU128, MonotonicallyMappableToU64, StrColumn,
};
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
@@ -103,14 +99,23 @@ pub struct TermsAggregation {
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(skip_serializing_if = "Option::is_none", default)]
pub size: Option<u32>, pub size: Option<u32>,
/// To get more accurate results, we fetch more than `size` from each segment. /// Unused by tantivy.
///
/// Since tantivy doesn't know shards, this parameter is merely there to be used by consumers
/// of tantivy. shard_size is the number of terms returned by each shard.
/// The default value in elasticsearch is size * 1.5 + 10.
///
/// Should never be smaller than size.
#[serde(skip_serializing_if = "Option::is_none", default)]
#[serde(alias = "shard_size")]
pub split_size: Option<u32>,
/// The get more accurate results, we fetch more than `size` from each segment.
/// ///
/// Increasing this value is will increase the cost for more accuracy. /// Increasing this value is will increase the cost for more accuracy.
/// ///
/// Defaults to 10 * size. /// Defaults to 10 * size.
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(skip_serializing_if = "Option::is_none", default)]
#[serde(alias = "shard_size")]
#[serde(alias = "split_size")]
pub segment_size: Option<u32>, pub segment_size: Option<u32>,
/// If you set the `show_term_doc_count_error` parameter to true, the terms aggregation will /// If you set the `show_term_doc_count_error` parameter to true, the terms aggregation will
@@ -251,7 +256,7 @@ pub struct SegmentTermCollector {
term_buckets: TermBuckets, term_buckets: TermBuckets,
req: TermsAggregationInternal, req: TermsAggregationInternal,
blueprint: Option<Box<dyn SegmentAggregationCollector>>, blueprint: Option<Box<dyn SegmentAggregationCollector>>,
column_type: ColumnType, field_type: ColumnType,
accessor_idx: usize, accessor_idx: usize,
} }
@@ -310,10 +315,7 @@ impl SegmentAggregationCollector for SegmentTermCollector {
} }
// has subagg // has subagg
if let Some(blueprint) = self.blueprint.as_ref() { if let Some(blueprint) = self.blueprint.as_ref() {
for (doc, term_id) in bucket_agg_accessor for (doc, term_id) in bucket_agg_accessor.column_block_accessor.iter_docid_vals() {
.column_block_accessor
.iter_docid_vals(docs, &bucket_agg_accessor.accessor)
{
let sub_aggregations = self let sub_aggregations = self
.term_buckets .term_buckets
.sub_aggs .sub_aggs
@@ -353,7 +355,7 @@ impl SegmentTermCollector {
field_type: ColumnType, field_type: ColumnType,
accessor_idx: usize, accessor_idx: usize,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
if field_type == ColumnType::Bytes { if field_type == ColumnType::Bytes || field_type == ColumnType::Bool {
return Err(TantivyError::InvalidArgument(format!( return Err(TantivyError::InvalidArgument(format!(
"terms aggregation is not supported for column type {:?}", "terms aggregation is not supported for column type {:?}",
field_type field_type
@@ -387,7 +389,7 @@ impl SegmentTermCollector {
req: TermsAggregationInternal::from_req(req), req: TermsAggregationInternal::from_req(req),
term_buckets, term_buckets,
blueprint, blueprint,
column_type: field_type, field_type,
accessor_idx, accessor_idx,
}) })
} }
@@ -464,7 +466,7 @@ impl SegmentTermCollector {
Ok(intermediate_entry) Ok(intermediate_entry)
}; };
if self.column_type == ColumnType::Str { if self.field_type == ColumnType::Str {
let term_dict = agg_with_accessor let term_dict = agg_with_accessor
.str_dict_column .str_dict_column
.as_ref() .as_ref()
@@ -529,55 +531,28 @@ impl SegmentTermCollector {
}); });
} }
} }
} else if self.column_type == ColumnType::DateTime { } else if self.field_type == ColumnType::DateTime {
for (val, doc_count) in entries { for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?; let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val = i64::from_u64(val); let val = i64::from_u64(val);
let date = format_date(val)?; let date = format_date(val)?;
dict.insert(IntermediateKey::Str(date), intermediate_entry); dict.insert(IntermediateKey::Str(date), intermediate_entry);
} }
} else if self.column_type == ColumnType::Bool {
for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val = bool::from_u64(val);
dict.insert(IntermediateKey::Bool(val), intermediate_entry);
}
} else if self.column_type == ColumnType::IpAddr {
let compact_space_accessor = agg_with_accessor
.accessor
.values
.clone()
.downcast_arc::<CompactSpaceU64Accessor>()
.map_err(|_| {
TantivyError::AggregationError(
crate::aggregation::AggregationError::InternalError(
"Type mismatch: Could not downcast to CompactSpaceU64Accessor"
.to_string(),
),
)
})?;
for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val: u128 = compact_space_accessor.compact_to_u128(val as u32);
let val = Ipv6Addr::from_u128(val);
dict.insert(IntermediateKey::IpAddr(val), intermediate_entry);
}
} else { } else {
for (val, doc_count) in entries { for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?; let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val = f64_from_fastfield_u64(val, &self.column_type); let val = f64_from_fastfield_u64(val, &self.field_type);
dict.insert(IntermediateKey::F64(val), intermediate_entry); dict.insert(IntermediateKey::F64(val), intermediate_entry);
} }
}; };
Ok(IntermediateBucketResult::Terms { Ok(IntermediateBucketResult::Terms(
buckets: IntermediateTermBucketResult { IntermediateTermBucketResult {
entries: dict, entries: dict,
sum_other_doc_count, sum_other_doc_count,
doc_count_error_upper_bound: term_doc_count_before_cutoff, doc_count_error_upper_bound: term_doc_count_before_cutoff,
}, },
}) ))
} }
} }
@@ -615,9 +590,6 @@ pub(crate) fn cut_off_buckets<T: GetDocCount + Debug>(
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use std::net::IpAddr;
use std::str::FromStr;
use common::DateTime; use common::DateTime;
use time::{Date, Month}; use time::{Date, Month};
@@ -628,7 +600,7 @@ mod tests {
}; };
use crate::aggregation::AggregationLimits; use crate::aggregation::AggregationLimits;
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::schema::{IntoIpv6Addr, Schema, FAST, STRING}; use crate::schema::{Schema, FAST, STRING};
use crate::{Index, IndexWriter}; use crate::{Index, IndexWriter};
#[test] #[test]
@@ -1210,9 +1182,9 @@ mod tests {
assert_eq!(res["my_texts"]["buckets"][0]["key"], "terma"); assert_eq!(res["my_texts"]["buckets"][0]["key"], "terma");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4); assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "termb"); assert_eq!(res["my_texts"]["buckets"][1]["key"], "termc");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 0); assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 0);
assert_eq!(res["my_texts"]["buckets"][2]["key"], "termc"); assert_eq!(res["my_texts"]["buckets"][2]["key"], "termb");
assert_eq!(res["my_texts"]["buckets"][2]["doc_count"], 0); assert_eq!(res["my_texts"]["buckets"][2]["doc_count"], 0);
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0); assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0); assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
@@ -1393,7 +1365,7 @@ mod tests {
#[test] #[test]
fn terms_aggregation_different_tokenizer_on_ff_test() -> crate::Result<()> { fn terms_aggregation_different_tokenizer_on_ff_test() -> crate::Result<()> {
let terms = vec!["Hello Hello", "Hallo Hallo", "Hallo Hallo"]; let terms = vec!["Hello Hello", "Hallo Hallo"];
let index = get_test_index_from_terms(true, &[terms])?; let index = get_test_index_from_terms(true, &[terms])?;
@@ -1411,7 +1383,7 @@ mod tests {
println!("{}", serde_json::to_string_pretty(&res).unwrap()); println!("{}", serde_json::to_string_pretty(&res).unwrap());
assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hallo Hallo"); assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hallo Hallo");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 2); assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 1);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "Hello Hello"); assert_eq!(res["my_texts"]["buckets"][1]["key"], "Hello Hello");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 1); assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 1);
@@ -1922,80 +1894,4 @@ mod tests {
Ok(()) Ok(())
} }
#[test]
fn terms_aggregation_bool() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_bool_field("bool_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(field=>true))?;
writer.add_document(doc!(field=>false))?;
writer.add_document(doc!(field=>true))?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_bool": {
"terms": {
"field": "bool_field"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(res["my_bool"]["buckets"][0]["key"], 1.0);
assert_eq!(res["my_bool"]["buckets"][0]["key_as_string"], "true");
assert_eq!(res["my_bool"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_bool"]["buckets"][1]["key"], 0.0);
assert_eq!(res["my_bool"]["buckets"][1]["key_as_string"], "false");
assert_eq!(res["my_bool"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_bool"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_ip_addr() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_ip_addr_field("ip_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
// IpV6 loopback
writer.add_document(doc!(field=>IpAddr::from_str("::1").unwrap().into_ipv6_addr()))?;
writer.add_document(doc!(field=>IpAddr::from_str("::1").unwrap().into_ipv6_addr()))?;
// IpV4
writer.add_document(
doc!(field=>IpAddr::from_str("127.0.0.1").unwrap().into_ipv6_addr()),
)?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_bool": {
"terms": {
"field": "ip_field"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// print as json
// println!("{}", serde_json::to_string_pretty(&res).unwrap());
assert_eq!(res["my_bool"]["buckets"][0]["key"], "::1");
assert_eq!(res["my_bool"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_bool"]["buckets"][1]["key"], "127.0.0.1");
assert_eq!(res["my_bool"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_bool"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
} }

View File

@@ -73,13 +73,11 @@ impl SegmentAggregationCollector for TermMissingAgg {
entries.insert(missing.into(), missing_entry); entries.insert(missing.into(), missing_entry);
let bucket = IntermediateBucketResult::Terms { let bucket = IntermediateBucketResult::Terms(IntermediateTermBucketResult {
buckets: IntermediateTermBucketResult { entries,
entries, sum_other_doc_count: 0,
sum_other_doc_count: 0, doc_count_error_upper_bound: 0,
doc_count_error_upper_bound: 0, });
},
};
results.push(name, IntermediateAggregationResult::Bucket(bucket))?; results.push(name, IntermediateAggregationResult::Bucket(bucket))?;
@@ -92,10 +90,7 @@ impl SegmentAggregationCollector for TermMissingAgg {
agg_with_accessor: &mut AggregationsWithAccessor, agg_with_accessor: &mut AggregationsWithAccessor,
) -> crate::Result<()> { ) -> crate::Result<()> {
let agg = &mut agg_with_accessor.aggs.values[self.accessor_idx]; let agg = &mut agg_with_accessor.aggs.values[self.accessor_idx];
let has_value = agg let has_value = agg.accessors.iter().any(|acc| acc.index.has_value(doc));
.accessors
.iter()
.any(|(acc, _)| acc.index.has_value(doc));
if !has_value { if !has_value {
self.missing_count += 1; self.missing_count += 1;
if let Some(sub_agg) = self.sub_agg.as_mut() { if let Some(sub_agg) = self.sub_agg.as_mut() {

View File

@@ -8,7 +8,7 @@ use super::segment_agg_result::{
}; };
use crate::aggregation::agg_req_with_accessor::get_aggs_with_segment_accessor_and_validate; use crate::aggregation::agg_req_with_accessor::get_aggs_with_segment_accessor_and_validate;
use crate::collector::{Collector, SegmentCollector}; use crate::collector::{Collector, SegmentCollector};
use crate::{DocId, SegmentOrdinal, SegmentReader, TantivyError}; use crate::{DocId, SegmentReader, TantivyError};
/// The default max bucket count, before the aggregation fails. /// The default max bucket count, before the aggregation fails.
pub const DEFAULT_BUCKET_LIMIT: u32 = 65000; pub const DEFAULT_BUCKET_LIMIT: u32 = 65000;
@@ -64,15 +64,10 @@ impl Collector for DistributedAggregationCollector {
fn for_segment( fn for_segment(
&self, &self,
segment_local_id: crate::SegmentOrdinal, _segment_local_id: crate::SegmentOrdinal,
reader: &crate::SegmentReader, reader: &crate::SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
AggregationSegmentCollector::from_agg_req_and_reader( AggregationSegmentCollector::from_agg_req_and_reader(&self.agg, reader, &self.limits)
&self.agg,
reader,
segment_local_id,
&self.limits,
)
} }
fn requires_scoring(&self) -> bool { fn requires_scoring(&self) -> bool {
@@ -94,15 +89,10 @@ impl Collector for AggregationCollector {
fn for_segment( fn for_segment(
&self, &self,
segment_local_id: crate::SegmentOrdinal, _segment_local_id: crate::SegmentOrdinal,
reader: &crate::SegmentReader, reader: &crate::SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
AggregationSegmentCollector::from_agg_req_and_reader( AggregationSegmentCollector::from_agg_req_and_reader(&self.agg, reader, &self.limits)
&self.agg,
reader,
segment_local_id,
&self.limits,
)
} }
fn requires_scoring(&self) -> bool { fn requires_scoring(&self) -> bool {
@@ -145,11 +135,10 @@ impl AggregationSegmentCollector {
pub fn from_agg_req_and_reader( pub fn from_agg_req_and_reader(
agg: &Aggregations, agg: &Aggregations,
reader: &SegmentReader, reader: &SegmentReader,
segment_ordinal: SegmentOrdinal,
limits: &AggregationLimits, limits: &AggregationLimits,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
let mut aggs_with_accessor = let mut aggs_with_accessor =
get_aggs_with_segment_accessor_and_validate(agg, reader, segment_ordinal, limits)?; get_aggs_with_segment_accessor_and_validate(agg, reader, limits)?;
let result = let result =
BufAggregationCollector::new(build_segment_agg_collector(&mut aggs_with_accessor)?); BufAggregationCollector::new(build_segment_agg_collector(&mut aggs_with_accessor)?);
Ok(AggregationSegmentCollector { Ok(AggregationSegmentCollector {

View File

@@ -5,7 +5,6 @@
use std::cmp::Ordering; use std::cmp::Ordering;
use std::collections::hash_map::Entry; use std::collections::hash_map::Entry;
use std::hash::Hash; use std::hash::Hash;
use std::net::Ipv6Addr;
use columnar::ColumnType; use columnar::ColumnType;
use itertools::Itertools; use itertools::Itertools;
@@ -20,7 +19,7 @@ use super::bucket::{
}; };
use super::metric::{ use super::metric::{
IntermediateAverage, IntermediateCount, IntermediateMax, IntermediateMin, IntermediateStats, IntermediateAverage, IntermediateCount, IntermediateMax, IntermediateMin, IntermediateStats,
IntermediateSum, PercentilesCollector, TopHitsTopNComputer, IntermediateSum, PercentilesCollector,
}; };
use super::segment_agg_result::AggregationLimits; use super::segment_agg_result::AggregationLimits;
use super::{format_date, AggregationError, Key, SerializedKey}; use super::{format_date, AggregationError, Key, SerializedKey};
@@ -42,10 +41,6 @@ pub struct IntermediateAggregationResults {
/// This might seem redundant with `Key`, but the point is to have a different /// This might seem redundant with `Key`, but the point is to have a different
/// Serialize implementation. /// Serialize implementation.
pub enum IntermediateKey { pub enum IntermediateKey {
/// Ip Addr key
IpAddr(Ipv6Addr),
/// Bool key
Bool(bool),
/// String key /// String key
Str(String), Str(String),
/// `f64` key /// `f64` key
@@ -63,16 +58,7 @@ impl From<IntermediateKey> for Key {
fn from(value: IntermediateKey) -> Self { fn from(value: IntermediateKey) -> Self {
match value { match value {
IntermediateKey::Str(s) => Self::Str(s), IntermediateKey::Str(s) => Self::Str(s),
IntermediateKey::IpAddr(s) => {
// Prefer to use the IPv4 representation if possible
if let Some(ip) = s.to_ipv4_mapped() {
Self::Str(ip.to_string())
} else {
Self::Str(s.to_string())
}
}
IntermediateKey::F64(f) => Self::F64(f), IntermediateKey::F64(f) => Self::F64(f),
IntermediateKey::Bool(f) => Self::F64(f as u64 as f64),
} }
} }
} }
@@ -85,8 +71,6 @@ impl std::hash::Hash for IntermediateKey {
match self { match self {
IntermediateKey::Str(text) => text.hash(state), IntermediateKey::Str(text) => text.hash(state),
IntermediateKey::F64(val) => val.to_bits().hash(state), IntermediateKey::F64(val) => val.to_bits().hash(state),
IntermediateKey::Bool(val) => val.hash(state),
IntermediateKey::IpAddr(val) => val.hash(state),
} }
} }
} }
@@ -182,9 +166,9 @@ impl IntermediateAggregationResults {
pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult { pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult {
use AggregationVariants::*; use AggregationVariants::*;
match req.agg { match req.agg {
Terms(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Terms { Terms(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Terms(
buckets: Default::default(), Default::default(),
}), )),
Range(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Range( Range(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Range(
Default::default(), Default::default(),
)), )),
@@ -221,9 +205,6 @@ pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult
Percentiles(_) => IntermediateAggregationResult::Metric( Percentiles(_) => IntermediateAggregationResult::Metric(
IntermediateMetricResult::Percentiles(PercentilesCollector::default()), IntermediateMetricResult::Percentiles(PercentilesCollector::default()),
), ),
TopHits(ref req) => IntermediateAggregationResult::Metric(
IntermediateMetricResult::TopHits(TopHitsTopNComputer::new(req.clone())),
),
} }
} }
@@ -284,8 +265,6 @@ pub enum IntermediateMetricResult {
Stats(IntermediateStats), Stats(IntermediateStats),
/// Intermediate sum result. /// Intermediate sum result.
Sum(IntermediateSum), Sum(IntermediateSum),
/// Intermediate top_hits result
TopHits(TopHitsTopNComputer),
} }
impl IntermediateMetricResult { impl IntermediateMetricResult {
@@ -313,13 +292,9 @@ impl IntermediateMetricResult {
percentiles percentiles
.into_final_result(req.agg.as_percentile().expect("unexpected metric type")), .into_final_result(req.agg.as_percentile().expect("unexpected metric type")),
), ),
IntermediateMetricResult::TopHits(top_hits) => {
MetricResult::TopHits(top_hits.into_final_result())
}
} }
} }
// TODO: this is our top-of-the-chain fruit merge mech
fn merge_fruits(&mut self, other: IntermediateMetricResult) -> crate::Result<()> { fn merge_fruits(&mut self, other: IntermediateMetricResult) -> crate::Result<()> {
match (self, other) { match (self, other) {
( (
@@ -355,9 +330,6 @@ impl IntermediateMetricResult {
) => { ) => {
left.merge_fruits(right)?; left.merge_fruits(right)?;
} }
(IntermediateMetricResult::TopHits(left), IntermediateMetricResult::TopHits(right)) => {
left.merge_fruits(right)?;
}
_ => { _ => {
panic!("incompatible fruit types in tree or missing merge_fruits handler"); panic!("incompatible fruit types in tree or missing merge_fruits handler");
} }
@@ -379,14 +351,11 @@ pub enum IntermediateBucketResult {
Histogram { Histogram {
/// The column_type of the underlying `Column` is DateTime /// The column_type of the underlying `Column` is DateTime
is_date_agg: bool, is_date_agg: bool,
/// The histogram buckets /// The buckets
buckets: Vec<IntermediateHistogramBucketEntry>, buckets: Vec<IntermediateHistogramBucketEntry>,
}, },
/// Term aggregation /// Term aggregation
Terms { Terms(IntermediateTermBucketResult),
/// The term buckets
buckets: IntermediateTermBucketResult,
},
} }
impl IntermediateBucketResult { impl IntermediateBucketResult {
@@ -463,7 +432,7 @@ impl IntermediateBucketResult {
}; };
Ok(BucketResult::Histogram { buckets }) Ok(BucketResult::Histogram { buckets })
} }
IntermediateBucketResult::Terms { buckets: terms } => terms.into_final_result( IntermediateBucketResult::Terms(terms) => terms.into_final_result(
req.agg req.agg
.as_term() .as_term()
.expect("unexpected aggregation, expected term aggregation"), .expect("unexpected aggregation, expected term aggregation"),
@@ -476,12 +445,8 @@ impl IntermediateBucketResult {
fn merge_fruits(&mut self, other: IntermediateBucketResult) -> crate::Result<()> { fn merge_fruits(&mut self, other: IntermediateBucketResult) -> crate::Result<()> {
match (self, other) { match (self, other) {
( (
IntermediateBucketResult::Terms { IntermediateBucketResult::Terms(term_res_left),
buckets: term_res_left, IntermediateBucketResult::Terms(term_res_right),
},
IntermediateBucketResult::Terms {
buckets: term_res_right,
},
) => { ) => {
merge_maps(&mut term_res_left.entries, term_res_right.entries)?; merge_maps(&mut term_res_left.entries, term_res_right.entries)?;
term_res_left.sum_other_doc_count += term_res_right.sum_other_doc_count; term_res_left.sum_other_doc_count += term_res_right.sum_other_doc_count;
@@ -565,15 +530,8 @@ impl IntermediateTermBucketResult {
.into_iter() .into_iter()
.filter(|bucket| bucket.1.doc_count as u64 >= req.min_doc_count) .filter(|bucket| bucket.1.doc_count as u64 >= req.min_doc_count)
.map(|(key, entry)| { .map(|(key, entry)| {
let key_as_string = match key {
IntermediateKey::Bool(key) => {
let val = if key { "true" } else { "false" };
Some(val.to_string())
}
_ => None,
};
Ok(BucketEntry { Ok(BucketEntry {
key_as_string, key_as_string: None,
key: key.into(), key: key.into(),
doc_count: entry.doc_count as u64, doc_count: entry.doc_count as u64,
sub_aggregation: entry sub_aggregation: entry

View File

@@ -2,8 +2,7 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::*; use super::{IntermediateStats, SegmentStatsCollector};
use crate::aggregation::*;
/// A single-value metric aggregation that computes the average of numeric values that are /// A single-value metric aggregation that computes the average of numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -25,7 +24,7 @@ pub struct AverageAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default, deserialize_with = "deserialize_option_f64")] #[serde(default)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }
@@ -66,71 +65,3 @@ impl IntermediateAverage {
self.stats.finalize().avg self.stats.finalize().avg
} }
} }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn deserialization_with_missing_test1() {
let json = r#"{
"field": "score",
"missing": "10.0"
}"#;
let avg: AverageAggregation = serde_json::from_str(json).unwrap();
assert_eq!(avg.field, "score");
assert_eq!(avg.missing, Some(10.0));
// no dot
let json = r#"{
"field": "score",
"missing": "10"
}"#;
let avg: AverageAggregation = serde_json::from_str(json).unwrap();
assert_eq!(avg.field, "score");
assert_eq!(avg.missing, Some(10.0));
// from value
let avg: AverageAggregation = serde_json::from_value(json!({
"field": "score_f64",
"missing": 10u64,
}))
.unwrap();
assert_eq!(avg.missing, Some(10.0));
// from value
let avg: AverageAggregation = serde_json::from_value(json!({
"field": "score_f64",
"missing": 10u32,
}))
.unwrap();
assert_eq!(avg.missing, Some(10.0));
let avg: AverageAggregation = serde_json::from_value(json!({
"field": "score_f64",
"missing": 10i8,
}))
.unwrap();
assert_eq!(avg.missing, Some(10.0));
}
#[test]
fn deserialization_with_missing_test_fail() {
let json = r#"{
"field": "score",
"missing": "a"
}"#;
let avg: Result<AverageAggregation, _> = serde_json::from_str(json);
assert!(avg.is_err());
assert!(avg
.unwrap_err()
.to_string()
.contains("Failed to parse f64 from string: \"a\""));
// Disallow NaN
let json = r#"{
"field": "score",
"missing": "NaN"
}"#;
let avg: Result<AverageAggregation, _> = serde_json::from_str(json);
assert!(avg.is_err());
assert!(avg.unwrap_err().to_string().contains("NaN"));
}
}

View File

@@ -2,8 +2,7 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::*; use super::{IntermediateStats, SegmentStatsCollector};
use crate::aggregation::*;
/// A single-value metric aggregation that counts the number of values that are /// A single-value metric aggregation that counts the number of values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -25,7 +24,7 @@ pub struct CountAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default, deserialize_with = "deserialize_option_f64")] #[serde(default)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }

View File

@@ -2,8 +2,7 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::*; use super::{IntermediateStats, SegmentStatsCollector};
use crate::aggregation::*;
/// A single-value metric aggregation that computes the maximum of numeric values that are /// A single-value metric aggregation that computes the maximum of numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -25,7 +24,7 @@ pub struct MaxAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default, deserialize_with = "deserialize_option_f64")] #[serde(default)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }

View File

@@ -2,8 +2,7 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::*; use super::{IntermediateStats, SegmentStatsCollector};
use crate::aggregation::*;
/// A single-value metric aggregation that computes the minimum of numeric values that are /// A single-value metric aggregation that computes the minimum of numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -25,7 +24,7 @@ pub struct MinAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default, deserialize_with = "deserialize_option_f64")] #[serde(default)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }

View File

@@ -23,10 +23,6 @@ mod min;
mod percentiles; mod percentiles;
mod stats; mod stats;
mod sum; mod sum;
mod top_hits;
use std::collections::HashMap;
pub use average::*; pub use average::*;
pub use count::*; pub use count::*;
pub use max::*; pub use max::*;
@@ -36,9 +32,6 @@ use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
pub use stats::*; pub use stats::*;
pub use sum::*; pub use sum::*;
pub use top_hits::*;
use crate::schema::OwnedValue;
/// Single-metric aggregations use this common result structure. /// Single-metric aggregations use this common result structure.
/// ///
@@ -88,28 +81,6 @@ pub struct PercentilesMetricResult {
pub values: PercentileValues, pub values: PercentileValues,
} }
/// The top_hits metric results entry
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct TopHitsVecEntry {
/// The sort values of the document, depending on the sort criteria in the request.
pub sort: Vec<Option<u64>>,
/// Search results, for queries that include field retrieval requests
/// (`docvalue_fields`).
#[serde(rename = "docvalue_fields")]
#[serde(skip_serializing_if = "HashMap::is_empty")]
pub doc_value_fields: HashMap<String, OwnedValue>,
}
/// The top_hits metric aggregation results a list of top hits by sort criteria.
///
/// The main reason for wrapping it in `hits` is to match elasticsearch output structure.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct TopHitsMetricResult {
/// The result of the top_hits metric.
pub hits: Vec<TopHitsVecEntry>,
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;

View File

@@ -1,5 +1,6 @@
use std::fmt::Debug; use std::fmt::Debug;
use columnar::ColumnType;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::*; use super::*;
@@ -10,7 +11,7 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult, IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult,
}; };
use crate::aggregation::segment_agg_result::SegmentAggregationCollector; use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::*; use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64, AggregationError};
use crate::{DocId, TantivyError}; use crate::{DocId, TantivyError};
/// # Percentiles /// # Percentiles
@@ -83,11 +84,7 @@ pub struct PercentilesAggregationReq {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde( #[serde(skip_serializing_if = "Option::is_none", default)]
skip_serializing_if = "Option::is_none",
default,
deserialize_with = "deserialize_option_f64"
)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }
fn default_percentiles() -> &'static [f64] { fn default_percentiles() -> &'static [f64] {
@@ -136,6 +133,7 @@ pub(crate) struct SegmentPercentilesCollector {
field_type: ColumnType, field_type: ColumnType,
pub(crate) percentiles: PercentilesCollector, pub(crate) percentiles: PercentilesCollector,
pub(crate) accessor_idx: usize, pub(crate) accessor_idx: usize,
val_cache: Vec<u64>,
missing: Option<u64>, missing: Option<u64>,
} }
@@ -245,6 +243,7 @@ impl SegmentPercentilesCollector {
field_type, field_type,
percentiles: PercentilesCollector::new(), percentiles: PercentilesCollector::new(),
accessor_idx, accessor_idx,
val_cache: Default::default(),
missing, missing,
}) })
} }

View File

@@ -1,3 +1,4 @@
use columnar::ColumnType;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::*; use super::*;
@@ -8,7 +9,7 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult, IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult,
}; };
use crate::aggregation::segment_agg_result::SegmentAggregationCollector; use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::*; use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64};
use crate::{DocId, TantivyError}; use crate::{DocId, TantivyError};
/// A multi-value metric aggregation that computes a collection of statistics on numeric values that /// A multi-value metric aggregation that computes a collection of statistics on numeric values that
@@ -32,7 +33,7 @@ pub struct StatsAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default, deserialize_with = "deserialize_option_f64")] #[serde(default)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }
@@ -579,30 +580,6 @@ mod tests {
}) })
); );
// From string
let agg_req: Aggregations = serde_json::from_value(json!({
"my_stats": {
"stats": {
"field": "json.partially_empty",
"missing": "0.0"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(
res["my_stats"],
json!({
"avg": 2.5,
"count": 4,
"max": 10.0,
"min": 0.0,
"sum": 10.0
})
);
Ok(()) Ok(())
} }

View File

@@ -2,8 +2,7 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::*; use super::{IntermediateStats, SegmentStatsCollector};
use crate::aggregation::*;
/// A single-value metric aggregation that sums up numeric values that are /// A single-value metric aggregation that sums up numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -25,7 +24,7 @@ pub struct SumAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default, deserialize_with = "deserialize_option_f64")] #[serde(default)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }

View File

@@ -1,897 +0,0 @@
use std::collections::HashMap;
use std::net::Ipv6Addr;
use columnar::{ColumnarReader, DynamicColumn};
use common::DateTime;
use regex::Regex;
use serde::ser::SerializeMap;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use super::{TopHitsMetricResult, TopHitsVecEntry};
use crate::aggregation::bucket::Order;
use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateMetricResult,
};
use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::AggregationError;
use crate::collector::TopNComputer;
use crate::schema::term::JSON_PATH_SEGMENT_SEP_STR;
use crate::schema::OwnedValue;
use crate::{DocAddress, DocId, SegmentOrdinal};
/// # Top Hits
///
/// The top hits aggregation is a useful tool to answer questions like:
/// - "What are the most recent posts by each author?"
/// - "What are the most popular items in each category?"
///
/// It does so by keeping track of the most relevant document being aggregated,
/// in terms of a sort criterion that can consist of multiple fields and their
/// sort-orders (ascending or descending).
///
/// `top_hits` should not be used as a top-level aggregation. It is intended to be
/// used as a sub-aggregation, inside a `terms` aggregation or a `filters` aggregation,
/// for example.
///
/// Note that this aggregator does not return the actual document addresses, but
/// rather a list of the values of the fields that were requested to be retrieved.
/// These values can be specified in the `docvalue_fields` parameter, which can include
/// a list of fast fields to be retrieved. At the moment, only fast fields are supported
/// but it is possible that we support the `fields` parameter to retrieve any stored
/// field in the future.
///
/// The following example demonstrates a request for the top_hits aggregation:
/// ```JSON
/// {
/// "aggs": {
/// "top_authors": {
/// "terms": {
/// "field": "author",
/// "size": 5
/// }
/// },
/// "aggs": {
/// "top_hits": {
/// "size": 2,
/// "from": 0
/// "sort": [
/// { "date": "desc" }
/// ]
/// "docvalue_fields": ["date", "title", "iden"]
/// }
/// }
/// }
/// ```
///
/// This request will return an object containing the top two documents, sorted
/// by the `date` field in descending order. You can also sort by multiple fields, which
/// helps to resolve ties. The aggregation object for each bucket will look like:
/// ```JSON
/// {
/// "hits": [
/// {
/// "score": [<time_u64>],
/// "docvalue_fields": {
/// "date": "<date_RFC3339>",
/// "title": "<title>",
/// "iden": "<iden>"
/// }
/// },
/// {
/// "score": [<time_u64>]
/// "docvalue_fields": {
/// "date": "<date_RFC3339>",
/// "title": "<title>",
/// "iden": "<iden>"
/// }
/// }
/// ]
/// }
/// ```
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct TopHitsAggregation {
sort: Vec<KeyOrder>,
size: usize,
from: Option<usize>,
#[serde(rename = "docvalue_fields")]
#[serde(default)]
doc_value_fields: Vec<String>,
// Not supported
_source: Option<serde_json::Value>,
fields: Option<serde_json::Value>,
script_fields: Option<serde_json::Value>,
highlight: Option<serde_json::Value>,
explain: Option<serde_json::Value>,
version: Option<serde_json::Value>,
}
#[derive(Debug, Clone, PartialEq, Default)]
struct KeyOrder {
field: String,
order: Order,
}
impl Serialize for KeyOrder {
fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
let KeyOrder { field, order } = self;
let mut map = serializer.serialize_map(Some(1))?;
map.serialize_entry(field, order)?;
map.end()
}
}
impl<'de> Deserialize<'de> for KeyOrder {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where D: Deserializer<'de> {
let mut key_order = <HashMap<String, Order>>::deserialize(deserializer)?.into_iter();
let (field, order) = key_order.next().ok_or(serde::de::Error::custom(
"Expected exactly one key-value pair in sort parameter of top_hits, found none",
))?;
if key_order.next().is_some() {
return Err(serde::de::Error::custom(format!(
"Expected exactly one key-value pair in sort parameter of top_hits, found {:?}",
key_order
)));
}
Ok(Self { field, order })
}
}
// Tranform a glob (`pattern*`, for example) into a regex::Regex (`^pattern.*$`)
fn globbed_string_to_regex(glob: &str) -> Result<Regex, crate::TantivyError> {
// Replace `*` glob with `.*` regex
let sanitized = format!("^{}$", regex::escape(glob).replace(r"\*", ".*"));
Regex::new(&sanitized.replace('*', ".*")).map_err(|e| {
crate::TantivyError::SchemaError(format!(
"Invalid regex '{}' in docvalue_fields: {}",
glob, e
))
})
}
fn use_doc_value_fields_err(parameter: &str) -> crate::Result<()> {
Err(crate::TantivyError::AggregationError(
AggregationError::InvalidRequest(format!(
"The `{}` parameter is not supported, only `docvalue_fields` is supported in \
`top_hits` aggregation",
parameter
)),
))
}
fn unsupported_err(parameter: &str) -> crate::Result<()> {
Err(crate::TantivyError::AggregationError(
AggregationError::InvalidRequest(format!(
"The `{}` parameter is not supported in the `top_hits` aggregation",
parameter
)),
))
}
impl TopHitsAggregation {
/// Validate and resolve field retrieval parameters
pub fn validate_and_resolve_field_names(
&mut self,
reader: &ColumnarReader,
) -> crate::Result<()> {
if self._source.is_some() {
use_doc_value_fields_err("_source")?;
}
if self.fields.is_some() {
use_doc_value_fields_err("fields")?;
}
if self.script_fields.is_some() {
use_doc_value_fields_err("script_fields")?;
}
if self.explain.is_some() {
unsupported_err("explain")?;
}
if self.highlight.is_some() {
unsupported_err("highlight")?;
}
if self.version.is_some() {
unsupported_err("version")?;
}
self.doc_value_fields = self
.doc_value_fields
.iter()
.map(|field| {
if !field.contains('*')
&& reader
.iter_columns()?
.any(|(name, _)| name.as_str() == field)
{
return Ok(vec![field.to_owned()]);
}
let pattern = globbed_string_to_regex(field)?;
let fields = reader
.iter_columns()?
.map(|(name, _)| {
// normalize path from internal fast field repr
name.replace(JSON_PATH_SEGMENT_SEP_STR, ".")
})
.filter(|name| pattern.is_match(name))
.collect::<Vec<_>>();
assert!(
!fields.is_empty(),
"No fields matched the glob '{}' in docvalue_fields",
field
);
Ok(fields)
})
.collect::<crate::Result<Vec<_>>>()?
.into_iter()
.flatten()
.collect();
Ok(())
}
/// Return fields accessed by the aggregator, in order.
pub fn field_names(&self) -> Vec<&str> {
self.sort
.iter()
.map(|KeyOrder { field, .. }| field.as_str())
.collect()
}
/// Return fields accessed by the aggregator's value retrieval.
pub fn value_field_names(&self) -> Vec<&str> {
self.doc_value_fields.iter().map(|s| s.as_str()).collect()
}
fn get_document_field_data(
&self,
accessors: &HashMap<String, Vec<DynamicColumn>>,
doc_id: DocId,
) -> HashMap<String, FastFieldValue> {
let doc_value_fields = self
.doc_value_fields
.iter()
.map(|field| {
let accessors = accessors
.get(field)
.unwrap_or_else(|| panic!("field '{}' not found in accessors", field));
let values: Vec<FastFieldValue> = accessors
.iter()
.flat_map(|accessor| match accessor {
DynamicColumn::U64(accessor) => accessor
.values_for_doc(doc_id)
.map(FastFieldValue::U64)
.collect::<Vec<_>>(),
DynamicColumn::I64(accessor) => accessor
.values_for_doc(doc_id)
.map(FastFieldValue::I64)
.collect::<Vec<_>>(),
DynamicColumn::F64(accessor) => accessor
.values_for_doc(doc_id)
.map(FastFieldValue::F64)
.collect::<Vec<_>>(),
DynamicColumn::Bytes(accessor) => accessor
.term_ords(doc_id)
.map(|term_ord| {
let mut buffer = vec![];
assert!(
accessor
.ord_to_bytes(term_ord, &mut buffer)
.expect("could not read term dictionary"),
"term corresponding to term_ord does not exist"
);
FastFieldValue::Bytes(buffer)
})
.collect::<Vec<_>>(),
DynamicColumn::Str(accessor) => accessor
.term_ords(doc_id)
.map(|term_ord| {
let mut buffer = vec![];
assert!(
accessor
.ord_to_bytes(term_ord, &mut buffer)
.expect("could not read term dictionary"),
"term corresponding to term_ord does not exist"
);
FastFieldValue::Str(String::from_utf8(buffer).unwrap())
})
.collect::<Vec<_>>(),
DynamicColumn::Bool(accessor) => accessor
.values_for_doc(doc_id)
.map(FastFieldValue::Bool)
.collect::<Vec<_>>(),
DynamicColumn::IpAddr(accessor) => accessor
.values_for_doc(doc_id)
.map(FastFieldValue::IpAddr)
.collect::<Vec<_>>(),
DynamicColumn::DateTime(accessor) => accessor
.values_for_doc(doc_id)
.map(FastFieldValue::Date)
.collect::<Vec<_>>(),
})
.collect();
(field.to_owned(), FastFieldValue::Array(values))
})
.collect();
doc_value_fields
}
}
/// A retrieved value from a fast field.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub enum FastFieldValue {
/// The str type is used for any text information.
Str(String),
/// Unsigned 64-bits Integer `u64`
U64(u64),
/// Signed 64-bits Integer `i64`
I64(i64),
/// 64-bits Float `f64`
F64(f64),
/// Bool value
Bool(bool),
/// Date/time with nanoseconds precision
Date(DateTime),
/// Arbitrarily sized byte array
Bytes(Vec<u8>),
/// IpV6 Address. Internally there is no IpV4, it needs to be converted to `Ipv6Addr`.
IpAddr(Ipv6Addr),
/// A list of values.
Array(Vec<Self>),
}
impl From<FastFieldValue> for OwnedValue {
fn from(value: FastFieldValue) -> Self {
match value {
FastFieldValue::Str(s) => OwnedValue::Str(s),
FastFieldValue::U64(u) => OwnedValue::U64(u),
FastFieldValue::I64(i) => OwnedValue::I64(i),
FastFieldValue::F64(f) => OwnedValue::F64(f),
FastFieldValue::Bool(b) => OwnedValue::Bool(b),
FastFieldValue::Date(d) => OwnedValue::Date(d),
FastFieldValue::Bytes(b) => OwnedValue::Bytes(b),
FastFieldValue::IpAddr(ip) => OwnedValue::IpAddr(ip),
FastFieldValue::Array(a) => {
OwnedValue::Array(a.into_iter().map(OwnedValue::from).collect())
}
}
}
}
/// Holds a fast field value in its u64 representation, and the order in which it should be sorted.
#[derive(Clone, Serialize, Deserialize, Debug)]
struct DocValueAndOrder {
/// A fast field value in its u64 representation.
value: Option<u64>,
/// Sort order for the value
order: Order,
}
impl Ord for DocValueAndOrder {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
let invert = |cmp: std::cmp::Ordering| match self.order {
Order::Asc => cmp,
Order::Desc => cmp.reverse(),
};
match (self.value, other.value) {
(Some(self_value), Some(other_value)) => invert(self_value.cmp(&other_value)),
(Some(_), None) => std::cmp::Ordering::Greater,
(None, Some(_)) => std::cmp::Ordering::Less,
(None, None) => std::cmp::Ordering::Equal,
}
}
}
impl PartialOrd for DocValueAndOrder {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for DocValueAndOrder {
fn eq(&self, other: &Self) -> bool {
self.value.cmp(&other.value) == std::cmp::Ordering::Equal
}
}
impl Eq for DocValueAndOrder {}
#[derive(Clone, Serialize, Deserialize, Debug)]
struct DocSortValuesAndFields {
sorts: Vec<DocValueAndOrder>,
#[serde(rename = "docvalue_fields")]
#[serde(skip_serializing_if = "HashMap::is_empty")]
doc_value_fields: HashMap<String, FastFieldValue>,
}
impl Ord for DocSortValuesAndFields {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
for (self_feature, other_feature) in self.sorts.iter().zip(other.sorts.iter()) {
let cmp = self_feature.cmp(other_feature);
if cmp != std::cmp::Ordering::Equal {
return cmp;
}
}
std::cmp::Ordering::Equal
}
}
impl PartialOrd for DocSortValuesAndFields {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for DocSortValuesAndFields {
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == std::cmp::Ordering::Equal
}
}
impl Eq for DocSortValuesAndFields {}
/// The TopHitsCollector used for collecting over segments and merging results.
#[derive(Clone, Serialize, Deserialize, Debug)]
pub struct TopHitsTopNComputer {
req: TopHitsAggregation,
top_n: TopNComputer<DocSortValuesAndFields, DocAddress, false>,
}
impl std::cmp::PartialEq for TopHitsTopNComputer {
fn eq(&self, _other: &Self) -> bool {
false
}
}
impl TopHitsTopNComputer {
/// Create a new TopHitsCollector
pub fn new(req: TopHitsAggregation) -> Self {
Self {
top_n: TopNComputer::new(req.size + req.from.unwrap_or(0)),
req,
}
}
fn collect(&mut self, features: DocSortValuesAndFields, doc: DocAddress) {
self.top_n.push(features, doc);
}
pub(crate) fn merge_fruits(&mut self, other_fruit: Self) -> crate::Result<()> {
for doc in other_fruit.top_n.into_vec() {
self.collect(doc.feature, doc.doc);
}
Ok(())
}
/// Finalize by converting self into the final result form
pub fn into_final_result(self) -> TopHitsMetricResult {
let mut hits: Vec<TopHitsVecEntry> = self
.top_n
.into_sorted_vec()
.into_iter()
.map(|doc| TopHitsVecEntry {
sort: doc.feature.sorts.iter().map(|f| f.value).collect(),
doc_value_fields: doc
.feature
.doc_value_fields
.into_iter()
.map(|(k, v)| (k, v.into()))
.collect(),
})
.collect();
// Remove the first `from` elements
// Truncating from end would be more efficient, but we need to truncate from the front
// because `into_sorted_vec` gives us a descending order because of the inverted
// `Ord` semantics of the heap elements.
hits.drain(..self.req.from.unwrap_or(0));
TopHitsMetricResult { hits }
}
}
#[derive(Clone, Debug)]
pub(crate) struct TopHitsSegmentCollector {
segment_ordinal: SegmentOrdinal,
accessor_idx: usize,
req: TopHitsAggregation,
top_n: TopNComputer<Vec<DocValueAndOrder>, DocAddress, false>,
}
impl TopHitsSegmentCollector {
pub fn from_req(
req: &TopHitsAggregation,
accessor_idx: usize,
segment_ordinal: SegmentOrdinal,
) -> Self {
Self {
req: req.clone(),
top_n: TopNComputer::new(req.size + req.from.unwrap_or(0)),
segment_ordinal,
accessor_idx,
}
}
fn into_top_hits_collector(
self,
value_accessors: &HashMap<String, Vec<DynamicColumn>>,
) -> TopHitsTopNComputer {
let mut top_hits_computer = TopHitsTopNComputer::new(self.req.clone());
let top_results = self.top_n.into_vec();
for res in top_results {
let doc_value_fields = self
.req
.get_document_field_data(value_accessors, res.doc.doc_id);
top_hits_computer.collect(
DocSortValuesAndFields {
sorts: res.feature,
doc_value_fields,
},
res.doc,
);
}
top_hits_computer
}
}
impl SegmentAggregationCollector for TopHitsSegmentCollector {
fn add_intermediate_aggregation_result(
self: Box<Self>,
agg_with_accessor: &crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
results: &mut crate::aggregation::intermediate_agg_result::IntermediateAggregationResults,
) -> crate::Result<()> {
let name = agg_with_accessor.aggs.keys[self.accessor_idx].to_string();
let value_accessors = &agg_with_accessor.aggs.values[self.accessor_idx].value_accessors;
let intermediate_result =
IntermediateMetricResult::TopHits(self.into_top_hits_collector(value_accessors));
results.push(
name,
IntermediateAggregationResult::Metric(intermediate_result),
)
}
fn collect(
&mut self,
doc_id: crate::DocId,
agg_with_accessor: &mut crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
) -> crate::Result<()> {
let accessors = &agg_with_accessor.aggs.values[self.accessor_idx].accessors;
let sorts: Vec<DocValueAndOrder> = self
.req
.sort
.iter()
.enumerate()
.map(|(idx, KeyOrder { order, .. })| {
let order = *order;
let value = accessors
.get(idx)
.expect("could not find field in accessors")
.0
.values_for_doc(doc_id)
.next();
DocValueAndOrder { value, order }
})
.collect();
self.top_n.push(
sorts,
DocAddress {
segment_ord: self.segment_ordinal,
doc_id,
},
);
Ok(())
}
fn collect_block(
&mut self,
docs: &[crate::DocId],
agg_with_accessor: &mut crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
) -> crate::Result<()> {
// TODO: Consider getting fields with the column block accessor.
for doc in docs {
self.collect(*doc, agg_with_accessor)?;
}
Ok(())
}
}
#[cfg(test)]
mod tests {
use common::DateTime;
use pretty_assertions::assert_eq;
use serde_json::Value;
use time::macros::datetime;
use super::{DocSortValuesAndFields, DocValueAndOrder, Order};
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::bucket::tests::get_test_index_from_docs;
use crate::aggregation::tests::get_test_index_from_values;
use crate::aggregation::AggregationCollector;
use crate::collector::ComparableDoc;
use crate::query::AllQuery;
use crate::schema::OwnedValue;
fn invert_order(cmp_feature: DocValueAndOrder) -> DocValueAndOrder {
let DocValueAndOrder { value, order } = cmp_feature;
let order = match order {
Order::Asc => Order::Desc,
Order::Desc => Order::Asc,
};
DocValueAndOrder { value, order }
}
fn collector_with_capacity(capacity: usize) -> super::TopHitsTopNComputer {
super::TopHitsTopNComputer {
top_n: super::TopNComputer::new(capacity),
req: Default::default(),
}
}
fn invert_order_features(mut cmp_features: DocSortValuesAndFields) -> DocSortValuesAndFields {
cmp_features.sorts = cmp_features
.sorts
.into_iter()
.map(invert_order)
.collect::<Vec<_>>();
cmp_features
}
#[test]
fn test_comparable_doc_feature() -> crate::Result<()> {
let small = DocValueAndOrder {
value: Some(1),
order: Order::Asc,
};
let big = DocValueAndOrder {
value: Some(2),
order: Order::Asc,
};
let none = DocValueAndOrder {
value: None,
order: Order::Asc,
};
assert!(small < big);
assert!(none < small);
assert!(none < big);
let small = invert_order(small);
let big = invert_order(big);
let none = invert_order(none);
assert!(small > big);
assert!(none < small);
assert!(none < big);
Ok(())
}
#[test]
fn test_comparable_doc_features() -> crate::Result<()> {
let features_1 = DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(1),
order: Order::Asc,
}],
doc_value_fields: Default::default(),
};
let features_2 = DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(2),
order: Order::Asc,
}],
doc_value_fields: Default::default(),
};
assert!(features_1 < features_2);
assert!(invert_order_features(features_1.clone()) > invert_order_features(features_2));
Ok(())
}
#[test]
fn test_aggregation_top_hits_empty_index() -> crate::Result<()> {
let values = vec![];
let index = get_test_index_from_values(false, &values)?;
let d: Aggregations = serde_json::from_value(json!({
"top_hits_req": {
"top_hits": {
"size": 2,
"sort": [
{ "date": "desc" }
],
"from": 0,
}
}
}))
.unwrap();
let collector = AggregationCollector::from_aggs(d, Default::default());
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&AllQuery, &collector).unwrap();
let res: Value = serde_json::from_str(
&serde_json::to_string(&agg_res).expect("JSON serialization failed"),
)
.expect("JSON parsing failed");
assert_eq!(
res,
json!({
"top_hits_req": {
"hits": []
}
})
);
Ok(())
}
#[test]
fn test_top_hits_collector_single_feature() -> crate::Result<()> {
let docs = vec![
ComparableDoc::<_, _, false> {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 0,
},
feature: DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(1),
order: Order::Asc,
}],
doc_value_fields: Default::default(),
},
},
ComparableDoc {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 2,
},
feature: DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(3),
order: Order::Asc,
}],
doc_value_fields: Default::default(),
},
},
ComparableDoc {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 1,
},
feature: DocSortValuesAndFields {
sorts: vec![DocValueAndOrder {
value: Some(5),
order: Order::Asc,
}],
doc_value_fields: Default::default(),
},
},
];
let mut collector = collector_with_capacity(3);
for doc in docs.clone() {
collector.collect(doc.feature, doc.doc);
}
let res = collector.into_final_result();
assert_eq!(
res,
super::TopHitsMetricResult {
hits: vec![
super::TopHitsVecEntry {
sort: vec![docs[0].feature.sorts[0].value],
doc_value_fields: Default::default(),
},
super::TopHitsVecEntry {
sort: vec![docs[1].feature.sorts[0].value],
doc_value_fields: Default::default(),
},
super::TopHitsVecEntry {
sort: vec![docs[2].feature.sorts[0].value],
doc_value_fields: Default::default(),
},
]
}
);
Ok(())
}
fn test_aggregation_top_hits(merge_segments: bool) -> crate::Result<()> {
let docs = vec![
vec![
r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb", "text2": "bbb", "mixed": { "dyn_arr": [1, "2"] } }"#,
r#"{ "date": "2017-06-15T00:00:00Z", "text": "ccc", "text2": "ddd", "mixed": { "dyn_arr": [3, "4"] } }"#,
],
vec![
r#"{ "text": "aaa", "text2": "bbb", "date": "2018-01-02T00:00:00Z", "mixed": { "dyn_arr": ["9", 8] } }"#,
r#"{ "text": "aaa", "text2": "bbb", "date": "2016-01-02T00:00:00Z", "mixed": { "dyn_arr": ["7", 6] } }"#,
],
];
let index = get_test_index_from_docs(merge_segments, &docs)?;
let d: Aggregations = serde_json::from_value(json!({
"top_hits_req": {
"top_hits": {
"size": 2,
"sort": [
{ "date": "desc" }
],
"from": 1,
"docvalue_fields": [
"date",
"tex*",
"mixed.*",
],
}
}
}))?;
let collector = AggregationCollector::from_aggs(d, Default::default());
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res =
serde_json::to_value(searcher.search(&AllQuery, &collector).unwrap()).unwrap();
let date_2017 = datetime!(2017-06-15 00:00:00 UTC);
let date_2016 = datetime!(2016-01-02 00:00:00 UTC);
assert_eq!(
agg_res["top_hits_req"],
json!({
"hits": [
{
"sort": [common::i64_to_u64(date_2017.unix_timestamp_nanos() as i64)],
"docvalue_fields": {
"date": [ OwnedValue::Date(DateTime::from_utc(date_2017)) ],
"text": [ "ccc" ],
"text2": [ "ddd" ],
"mixed.dyn_arr": [ 3, "4" ],
}
},
{
"sort": [common::i64_to_u64(date_2016.unix_timestamp_nanos() as i64)],
"docvalue_fields": {
"date": [ OwnedValue::Date(DateTime::from_utc(date_2016)) ],
"text": [ "aaa" ],
"text2": [ "bbb" ],
"mixed.dyn_arr": [ 6, "7" ],
}
}
]
}),
);
Ok(())
}
#[test]
fn test_aggregation_top_hits_single_segment() -> crate::Result<()> {
test_aggregation_top_hits(true)
}
#[test]
fn test_aggregation_top_hits_multi_segment() -> crate::Result<()> {
test_aggregation_top_hits(false)
}
}

View File

@@ -145,8 +145,6 @@ mod agg_tests;
mod agg_bench; mod agg_bench;
use core::fmt;
pub use agg_limits::AggregationLimits; pub use agg_limits::AggregationLimits;
pub use collector::{ pub use collector::{
AggregationCollector, AggregationSegmentCollector, DistributedAggregationCollector, AggregationCollector, AggregationSegmentCollector, DistributedAggregationCollector,
@@ -156,110 +154,7 @@ use columnar::{ColumnType, MonotonicallyMappableToU64};
pub(crate) use date::format_date; pub(crate) use date::format_date;
pub use error::AggregationError; pub use error::AggregationError;
use itertools::Itertools; use itertools::Itertools;
use serde::de::{self, Visitor}; use serde::{Deserialize, Serialize};
use serde::{Deserialize, Deserializer, Serialize};
pub(crate) fn invalid_agg_request(message: String) -> crate::TantivyError {
crate::TantivyError::AggregationError(AggregationError::InvalidRequest(message))
}
fn parse_str_into_f64<E: de::Error>(value: &str) -> Result<f64, E> {
let parsed = value.parse::<f64>().map_err(|_err| {
de::Error::custom(format!("Failed to parse f64 from string: {:?}", value))
})?;
// Check if the parsed value is NaN or infinity
if parsed.is_nan() || parsed.is_infinite() {
Err(de::Error::custom(format!(
"Value is not a valid f64 (NaN or Infinity): {:?}",
value
)))
} else {
Ok(parsed)
}
}
/// deserialize Option<f64> from string or float
pub(crate) fn deserialize_option_f64<'de, D>(deserializer: D) -> Result<Option<f64>, D::Error>
where D: Deserializer<'de> {
struct StringOrFloatVisitor;
impl<'de> Visitor<'de> for StringOrFloatVisitor {
type Value = Option<f64>;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
formatter.write_str("a string or a float")
}
fn visit_str<E>(self, value: &str) -> Result<Self::Value, E>
where E: de::Error {
parse_str_into_f64(value).map(Some)
}
fn visit_f64<E>(self, value: f64) -> Result<Self::Value, E>
where E: de::Error {
Ok(Some(value))
}
fn visit_i64<E>(self, value: i64) -> Result<Self::Value, E>
where E: de::Error {
Ok(Some(value as f64))
}
fn visit_u64<E>(self, value: u64) -> Result<Self::Value, E>
where E: de::Error {
Ok(Some(value as f64))
}
fn visit_none<E>(self) -> Result<Self::Value, E>
where E: de::Error {
Ok(None)
}
fn visit_unit<E>(self) -> Result<Self::Value, E>
where E: de::Error {
Ok(None)
}
}
deserializer.deserialize_any(StringOrFloatVisitor)
}
/// deserialize f64 from string or float
pub(crate) fn deserialize_f64<'de, D>(deserializer: D) -> Result<f64, D::Error>
where D: Deserializer<'de> {
struct StringOrFloatVisitor;
impl<'de> Visitor<'de> for StringOrFloatVisitor {
type Value = f64;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
formatter.write_str("a string or a float")
}
fn visit_str<E>(self, value: &str) -> Result<Self::Value, E>
where E: de::Error {
parse_str_into_f64(value)
}
fn visit_f64<E>(self, value: f64) -> Result<Self::Value, E>
where E: de::Error {
Ok(value)
}
fn visit_i64<E>(self, value: i64) -> Result<Self::Value, E>
where E: de::Error {
Ok(value as f64)
}
fn visit_u64<E>(self, value: u64) -> Result<Self::Value, E>
where E: de::Error {
Ok(value as f64)
}
}
deserializer.deserialize_any(StringOrFloatVisitor)
}
/// Represents an associative array `(key => values)` in a very efficient manner. /// Represents an associative array `(key => values)` in a very efficient manner.
#[derive(PartialEq, Serialize, Deserialize)] #[derive(PartialEq, Serialize, Deserialize)]
@@ -386,7 +281,6 @@ pub(crate) fn f64_from_fastfield_u64(val: u64, field_type: &ColumnType) -> f64 {
ColumnType::U64 => val as f64, ColumnType::U64 => val as f64,
ColumnType::I64 | ColumnType::DateTime => i64::from_u64(val) as f64, ColumnType::I64 | ColumnType::DateTime => i64::from_u64(val) as f64,
ColumnType::F64 => f64::from_u64(val), ColumnType::F64 => f64::from_u64(val),
ColumnType::Bool => val as f64,
_ => { _ => {
panic!("unexpected type {field_type:?}. This should not happen") panic!("unexpected type {field_type:?}. This should not happen")
} }
@@ -407,7 +301,6 @@ pub(crate) fn f64_to_fastfield_u64(val: f64, field_type: &ColumnType) -> Option<
ColumnType::U64 => Some(val as u64), ColumnType::U64 => Some(val as u64),
ColumnType::I64 | ColumnType::DateTime => Some((val as i64).to_u64()), ColumnType::I64 | ColumnType::DateTime => Some((val as i64).to_u64()),
ColumnType::F64 => Some(val.to_u64()), ColumnType::F64 => Some(val.to_u64()),
ColumnType::Bool => Some(val as u64),
_ => None, _ => None,
} }
} }
@@ -421,6 +314,7 @@ mod tests {
use time::OffsetDateTime; use time::OffsetDateTime;
use super::agg_req::Aggregations; use super::agg_req::Aggregations;
use super::segment_agg_result::AggregationLimits;
use super::*; use super::*;
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::query::{AllQuery, TermQuery}; use crate::query::{AllQuery, TermQuery};

View File

@@ -16,7 +16,6 @@ use super::metric::{
SumAggregation, SumAggregation,
}; };
use crate::aggregation::bucket::TermMissingAgg; use crate::aggregation::bucket::TermMissingAgg;
use crate::aggregation::metric::TopHitsSegmentCollector;
pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug { pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug {
fn add_intermediate_aggregation_result( fn add_intermediate_aggregation_result(
@@ -161,11 +160,6 @@ pub(crate) fn build_single_agg_segment_collector(
accessor_idx, accessor_idx,
)?, )?,
)), )),
TopHits(top_hits_req) => Ok(Box::new(TopHitsSegmentCollector::from_req(
top_hits_req,
accessor_idx,
req.segment_ordinal,
))),
} }
} }

View File

@@ -410,7 +410,6 @@ impl SegmentCollector for FacetSegmentCollector {
/// Intermediary result of the `FacetCollector` that stores /// Intermediary result of the `FacetCollector` that stores
/// the facet counts for all the segments. /// the facet counts for all the segments.
#[derive(Default, Clone)]
pub struct FacetCounts { pub struct FacetCounts {
facet_counts: BTreeMap<Facet, u64>, facet_counts: BTreeMap<Facet, u64>,
} }
@@ -494,7 +493,7 @@ mod tests {
use super::{FacetCollector, FacetCounts}; use super::{FacetCollector, FacetCounts};
use crate::collector::facet_collector::compress_mapping; use crate::collector::facet_collector::compress_mapping;
use crate::collector::Count; use crate::collector::Count;
use crate::index::Index; use crate::core::Index;
use crate::query::{AllQuery, QueryParser, TermQuery}; use crate::query::{AllQuery, QueryParser, TermQuery};
use crate::schema::{Facet, FacetOptions, IndexRecordOption, Schema, TantivyDocument}; use crate::schema::{Facet, FacetOptions, IndexRecordOption, Schema, TantivyDocument};
use crate::{IndexWriter, Term}; use crate::{IndexWriter, Term};

View File

@@ -160,7 +160,7 @@ mod tests {
use super::{add_vecs, HistogramCollector, HistogramComputer}; use super::{add_vecs, HistogramCollector, HistogramComputer};
use crate::schema::{Schema, FAST}; use crate::schema::{Schema, FAST};
use crate::time::{Date, Month}; use crate::time::{Date, Month};
use crate::{query, DateTime, Index}; use crate::{doc, query, DateTime, Index};
#[test] #[test]
fn test_add_histograms_simple() { fn test_add_histograms_simple() {

View File

@@ -97,7 +97,6 @@ pub use self::multi_collector::{FruitHandle, MultiCollector, MultiFruit};
mod top_collector; mod top_collector;
mod top_score_collector; mod top_score_collector;
pub use self::top_collector::ComparableDoc;
pub use self::top_score_collector::{TopDocs, TopNComputer}; pub use self::top_score_collector::{TopDocs, TopNComputer};
mod custom_score_top_collector; mod custom_score_top_collector;
@@ -274,10 +273,6 @@ pub trait SegmentCollector: 'static {
fn collect(&mut self, doc: DocId, score: Score); fn collect(&mut self, doc: DocId, score: Score);
/// The query pushes the scored document to the collector via this method. /// The query pushes the scored document to the collector via this method.
/// This method is used when the collector does not require scoring.
///
/// See [`COLLECT_BLOCK_BUFFER_LEN`](crate::COLLECT_BLOCK_BUFFER_LEN) for the
/// buffer size passed to the collector.
fn collect_block(&mut self, docs: &[DocId]) { fn collect_block(&mut self, docs: &[DocId]) {
for doc in docs { for doc in docs {
self.collect(*doc, 0.0); self.collect(*doc, 0.0);

View File

@@ -52,16 +52,10 @@ impl<TCollector: Collector> Collector for CollectorWrapper<TCollector> {
impl SegmentCollector for Box<dyn BoxableSegmentCollector> { impl SegmentCollector for Box<dyn BoxableSegmentCollector> {
type Fruit = Box<dyn Fruit>; type Fruit = Box<dyn Fruit>;
#[inline]
fn collect(&mut self, doc: u32, score: Score) { fn collect(&mut self, doc: u32, score: Score) {
self.as_mut().collect(doc, score); self.as_mut().collect(doc, score);
} }
#[inline]
fn collect_block(&mut self, docs: &[DocId]) {
self.as_mut().collect_block(docs);
}
fn harvest(self) -> Box<dyn Fruit> { fn harvest(self) -> Box<dyn Fruit> {
BoxableSegmentCollector::harvest_from_box(self) BoxableSegmentCollector::harvest_from_box(self)
} }
@@ -69,11 +63,6 @@ impl SegmentCollector for Box<dyn BoxableSegmentCollector> {
pub trait BoxableSegmentCollector { pub trait BoxableSegmentCollector {
fn collect(&mut self, doc: u32, score: Score); fn collect(&mut self, doc: u32, score: Score);
fn collect_block(&mut self, docs: &[DocId]) {
for &doc in docs {
self.collect(doc, 0.0);
}
}
fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit>; fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit>;
} }
@@ -82,14 +71,9 @@ pub struct SegmentCollectorWrapper<TSegmentCollector: SegmentCollector>(TSegment
impl<TSegmentCollector: SegmentCollector> BoxableSegmentCollector impl<TSegmentCollector: SegmentCollector> BoxableSegmentCollector
for SegmentCollectorWrapper<TSegmentCollector> for SegmentCollectorWrapper<TSegmentCollector>
{ {
#[inline]
fn collect(&mut self, doc: u32, score: Score) { fn collect(&mut self, doc: u32, score: Score) {
self.0.collect(doc, score); self.0.collect(doc, score);
} }
#[inline]
fn collect_block(&mut self, docs: &[DocId]) {
self.0.collect_block(docs);
}
fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit> { fn harvest_from_box(self: Box<Self>) -> Box<dyn Fruit> {
Box::new(self.0.harvest()) Box::new(self.0.harvest())

View File

@@ -1,11 +1,15 @@
use columnar::{BytesColumn, Column}; use columnar::{BytesColumn, Column};
use super::*; use super::*;
use crate::collector::{Count, FilterCollector, TopDocs};
use crate::core::SegmentReader;
use crate::query::{AllQuery, QueryParser}; use crate::query::{AllQuery, QueryParser};
use crate::schema::{Schema, FAST, TEXT}; use crate::schema::{Schema, FAST, TEXT};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime; use crate::time::OffsetDateTime;
use crate::{DateTime, DocAddress, Index, Searcher, TantivyDocument}; use crate::{
doc, DateTime, DocAddress, DocId, Index, Score, Searcher, SegmentOrdinal, TantivyDocument,
};
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector { pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
compute_score: true, compute_score: true,

View File

@@ -1,58 +1,47 @@
use std::cmp::Ordering; use std::cmp::Ordering;
use std::marker::PhantomData; use std::marker::PhantomData;
use serde::{Deserialize, Serialize};
use super::top_score_collector::TopNComputer; use super::top_score_collector::TopNComputer;
use crate::{DocAddress, DocId, SegmentOrdinal, SegmentReader}; use crate::{DocAddress, DocId, SegmentOrdinal, SegmentReader};
/// Contains a feature (field, score, etc.) of a document along with the document address. /// Contains a feature (field, score, etc.) of a document along with the document address.
/// ///
/// It guarantees stable sorting: in case of a tie on the feature, the document /// It has a custom implementation of `PartialOrd` that reverses the order. This is because the
/// address is used. /// default Rust heap is a max heap, whereas a min heap is needed.
/// ///
/// The REVERSE_ORDER generic parameter controls whether the by-feature order /// Additionally, it guarantees stable sorting: in case of a tie on the feature, the document
/// should be reversed, which is useful for achieving for example largest-first /// address is used.
/// semantics without having to wrap the feature in a `Reverse`.
/// ///
/// WARNING: equality is not what you would expect here. /// WARNING: equality is not what you would expect here.
/// Two elements are equal if their feature is equal, and regardless of whether `doc` /// Two elements are equal if their feature is equal, and regardless of whether `doc`
/// is equal. This should be perfectly fine for this usage, but let's make sure this /// is equal. This should be perfectly fine for this usage, but let's make sure this
/// struct is never public. /// struct is never public.
#[derive(Clone, Default, Serialize, Deserialize)] pub(crate) struct ComparableDoc<T, D> {
pub struct ComparableDoc<T, D, const REVERSE_ORDER: bool = false> {
/// The feature of the document. In practice, this is
/// is any type that implements `PartialOrd`.
pub feature: T, pub feature: T,
/// The document address. In practice, this is any
/// type that implements `PartialOrd`, and is guaranteed
/// to be unique for each document.
pub doc: D, pub doc: D,
} }
impl<T: std::fmt::Debug, D: std::fmt::Debug, const R: bool> std::fmt::Debug impl<T: std::fmt::Debug, D: std::fmt::Debug> std::fmt::Debug for ComparableDoc<T, D> {
for ComparableDoc<T, D, R>
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct(format!("ComparableDoc<_, _ {R}").as_str()) f.debug_struct("ComparableDoc")
.field("feature", &self.feature) .field("feature", &self.feature)
.field("doc", &self.doc) .field("doc", &self.doc)
.finish() .finish()
} }
} }
impl<T: PartialOrd, D: PartialOrd, const R: bool> PartialOrd for ComparableDoc<T, D, R> { impl<T: PartialOrd, D: PartialOrd> PartialOrd for ComparableDoc<T, D> {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> { fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other)) Some(self.cmp(other))
} }
} }
impl<T: PartialOrd, D: PartialOrd, const R: bool> Ord for ComparableDoc<T, D, R> { impl<T: PartialOrd, D: PartialOrd> Ord for ComparableDoc<T, D> {
#[inline] #[inline]
fn cmp(&self, other: &Self) -> Ordering { fn cmp(&self, other: &Self) -> Ordering {
let by_feature = self // Reversed to make BinaryHeap work as a min-heap
let by_feature = other
.feature .feature
.partial_cmp(&other.feature) .partial_cmp(&self.feature)
.map(|ord| if R { ord.reverse() } else { ord })
.unwrap_or(Ordering::Equal); .unwrap_or(Ordering::Equal);
let lazy_by_doc_address = || self.doc.partial_cmp(&other.doc).unwrap_or(Ordering::Equal); let lazy_by_doc_address = || self.doc.partial_cmp(&other.doc).unwrap_or(Ordering::Equal);
@@ -64,13 +53,13 @@ impl<T: PartialOrd, D: PartialOrd, const R: bool> Ord for ComparableDoc<T, D, R>
} }
} }
impl<T: PartialOrd, D: PartialOrd, const R: bool> PartialEq for ComparableDoc<T, D, R> { impl<T: PartialOrd, D: PartialOrd> PartialEq for ComparableDoc<T, D> {
fn eq(&self, other: &Self) -> bool { fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal self.cmp(other) == Ordering::Equal
} }
} }
impl<T: PartialOrd, D: PartialOrd, const R: bool> Eq for ComparableDoc<T, D, R> {} impl<T: PartialOrd, D: PartialOrd> Eq for ComparableDoc<T, D> {}
pub(crate) struct TopCollector<T> { pub(crate) struct TopCollector<T> {
pub limit: usize, pub limit: usize,
@@ -110,10 +99,10 @@ where T: PartialOrd + Clone
if self.limit == 0 { if self.limit == 0 {
return Ok(Vec::new()); return Ok(Vec::new());
} }
let mut top_collector: TopNComputer<_, _> = TopNComputer::new(self.limit + self.offset); let mut top_collector = TopNComputer::new(self.limit + self.offset);
for child_fruit in children { for child_fruit in children {
for (feature, doc) in child_fruit { for (feature, doc) in child_fruit {
top_collector.push(feature, doc); top_collector.push(ComparableDoc { feature, doc });
} }
} }
@@ -154,8 +143,6 @@ where T: PartialOrd + Clone
/// The theoretical complexity for collecting the top `K` out of `n` documents /// The theoretical complexity for collecting the top `K` out of `n` documents
/// is `O(n + K)`. /// is `O(n + K)`.
pub(crate) struct TopSegmentCollector<T> { pub(crate) struct TopSegmentCollector<T> {
/// We reverse the order of the feature in order to
/// have top-semantics instead of bottom semantics.
topn_computer: TopNComputer<T, DocId>, topn_computer: TopNComputer<T, DocId>,
segment_ord: u32, segment_ord: u32,
} }
@@ -193,7 +180,7 @@ impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
/// will compare the lowest scoring item with the given one and keep whichever is greater. /// will compare the lowest scoring item with the given one and keep whichever is greater.
#[inline] #[inline]
pub fn collect(&mut self, doc: DocId, feature: T) { pub fn collect(&mut self, doc: DocId, feature: T) {
self.topn_computer.push(feature, doc); self.topn_computer.push(ComparableDoc { feature, doc });
} }
} }

View File

@@ -3,8 +3,6 @@ use std::marker::PhantomData;
use std::sync::Arc; use std::sync::Arc;
use columnar::ColumnValues; use columnar::ColumnValues;
use serde::de::DeserializeOwned;
use serde::{Deserialize, Serialize};
use super::Collector; use super::Collector;
use crate::collector::custom_score_top_collector::CustomScoreTopCollector; use crate::collector::custom_score_top_collector::CustomScoreTopCollector;
@@ -311,7 +309,7 @@ impl TopDocs {
/// ///
/// To comfortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to /// To comfortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to
/// the [.order_by_fast_field(...)](TopDocs::order_by_fast_field) method. /// the [.order_by_fast_field(...)](TopDocs::order_by_fast_field) method.
pub fn order_by_u64_field( fn order_by_u64_field(
self, self,
field: impl ToString, field: impl ToString,
order: Order, order: Order,
@@ -665,7 +663,7 @@ impl Collector for TopDocs {
reader: &SegmentReader, reader: &SegmentReader,
) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> { ) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> {
let heap_len = self.0.limit + self.0.offset; let heap_len = self.0.limit + self.0.offset;
let mut top_n: TopNComputer<_, _> = TopNComputer::new(heap_len); let mut top_n = TopNComputer::new(heap_len);
if let Some(alive_bitset) = reader.alive_bitset() { if let Some(alive_bitset) = reader.alive_bitset() {
let mut threshold = Score::MIN; let mut threshold = Score::MIN;
@@ -674,13 +672,21 @@ impl Collector for TopDocs {
if alive_bitset.is_deleted(doc) { if alive_bitset.is_deleted(doc) {
return threshold; return threshold;
} }
top_n.push(score, doc); let doc = ComparableDoc {
feature: score,
doc,
};
top_n.push(doc);
threshold = top_n.threshold.unwrap_or(Score::MIN); threshold = top_n.threshold.unwrap_or(Score::MIN);
threshold threshold
})?; })?;
} else { } else {
weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| { weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| {
top_n.push(score, doc); let doc = ComparableDoc {
feature: score,
doc,
};
top_n.push(doc);
top_n.threshold.unwrap_or(Score::MIN) top_n.threshold.unwrap_or(Score::MIN)
})?; })?;
} }
@@ -719,78 +725,17 @@ impl SegmentCollector for TopScoreSegmentCollector {
/// Fast TopN Computation /// Fast TopN Computation
/// ///
/// Capacity of the vec is 2 * top_n.
/// The buffer is truncated to the top_n elements when it reaches the capacity of the Vec.
/// That means capacity has special meaning and should be carried over when cloning or serializing.
///
/// For TopN == 0, it will be relative expensive. /// For TopN == 0, it will be relative expensive.
#[derive(Serialize, Deserialize)] pub struct TopNComputer<Score, DocId> {
#[serde(from = "TopNComputerDeser<Score, D, REVERSE_ORDER>")] buffer: Vec<ComparableDoc<Score, DocId>>,
pub struct TopNComputer<Score, D, const REVERSE_ORDER: bool = true> {
/// The buffer reverses sort order to get top-semantics instead of bottom-semantics
buffer: Vec<ComparableDoc<Score, D, REVERSE_ORDER>>,
top_n: usize, top_n: usize,
pub(crate) threshold: Option<Score>, pub(crate) threshold: Option<Score>,
} }
impl<Score: std::fmt::Debug, D, const REVERSE_ORDER: bool> std::fmt::Debug impl<Score, DocId> TopNComputer<Score, DocId>
for TopNComputer<Score, D, REVERSE_ORDER>
{
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("TopNComputer")
.field("buffer_len", &self.buffer.len())
.field("top_n", &self.top_n)
.field("current_threshold", &self.threshold)
.finish()
}
}
// Intermediate struct for TopNComputer for deserialization, to keep vec capacity
#[derive(Deserialize)]
struct TopNComputerDeser<Score, D, const REVERSE_ORDER: bool> {
buffer: Vec<ComparableDoc<Score, D, REVERSE_ORDER>>,
top_n: usize,
threshold: Option<Score>,
}
// Custom clone to keep capacity
impl<Score: Clone, D: Clone, const REVERSE_ORDER: bool> Clone
for TopNComputer<Score, D, REVERSE_ORDER>
{
fn clone(&self) -> Self {
let mut buffer_clone = Vec::with_capacity(self.buffer.capacity());
buffer_clone.extend(self.buffer.iter().cloned());
TopNComputer {
buffer: buffer_clone,
top_n: self.top_n,
threshold: self.threshold.clone(),
}
}
}
impl<Score, D, const R: bool> From<TopNComputerDeser<Score, D, R>> for TopNComputer<Score, D, R> {
fn from(mut value: TopNComputerDeser<Score, D, R>) -> Self {
let expected_cap = value.top_n.max(1) * 2;
let current_cap = value.buffer.capacity();
if current_cap < expected_cap {
value.buffer.reserve_exact(expected_cap - current_cap);
} else {
value.buffer.shrink_to(expected_cap);
}
TopNComputer {
buffer: value.buffer,
top_n: value.top_n,
threshold: value.threshold,
}
}
}
impl<Score, D, const REVERSE_ORDER: bool> TopNComputer<Score, D, REVERSE_ORDER>
where where
Score: PartialOrd + Clone, Score: PartialOrd + Clone,
D: Serialize + DeserializeOwned + Ord + Clone, DocId: Ord + Clone,
{ {
/// Create a new `TopNComputer`. /// Create a new `TopNComputer`.
/// Internally it will allocate a buffer of size `2 * top_n`. /// Internally it will allocate a buffer of size `2 * top_n`.
@@ -803,15 +748,10 @@ where
} }
} }
/// Push a new document to the top n.
/// If the document is below the current threshold, it will be ignored.
#[inline] #[inline]
pub fn push(&mut self, feature: Score, doc: D) { pub(crate) fn push(&mut self, doc: ComparableDoc<Score, DocId>) {
if let Some(last_median) = self.threshold.clone() { if let Some(last_median) = self.threshold.clone() {
if !REVERSE_ORDER && feature > last_median { if doc.feature < last_median {
return;
}
if REVERSE_ORDER && feature < last_median {
return; return;
} }
} }
@@ -826,7 +766,7 @@ where
let uninit = self.buffer.spare_capacity_mut(); let uninit = self.buffer.spare_capacity_mut();
// This cannot panic, because we truncate_median will at least remove one element, since // This cannot panic, because we truncate_median will at least remove one element, since
// the min capacity is 2. // the min capacity is 2.
uninit[0].write(ComparableDoc { doc, feature }); uninit[0].write(doc);
// This is safe because it would panic in the line above // This is safe because it would panic in the line above
unsafe { unsafe {
self.buffer.set_len(self.buffer.len() + 1); self.buffer.set_len(self.buffer.len() + 1);
@@ -845,33 +785,20 @@ where
median_score median_score
} }
/// Returns the top n elements in sorted order. pub(crate) fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, DocId>> {
pub fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, D, REVERSE_ORDER>> {
if self.buffer.len() > self.top_n { if self.buffer.len() > self.top_n {
self.truncate_top_n(); self.truncate_top_n();
} }
self.buffer.sort_unstable(); self.buffer.sort_unstable();
self.buffer self.buffer
} }
/// Returns the top n elements in stored order.
/// Useful if you do not need the elements in sorted order,
/// for example when merging the results of multiple segments.
pub fn into_vec(mut self) -> Vec<ComparableDoc<Score, D, REVERSE_ORDER>> {
if self.buffer.len() > self.top_n {
self.truncate_top_n();
}
self.buffer
}
} }
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use proptest::prelude::*;
use super::{TopDocs, TopNComputer}; use super::{TopDocs, TopNComputer};
use crate::collector::top_collector::ComparableDoc; use crate::collector::top_collector::ComparableDoc;
use crate::collector::{Collector, DocSetCollector}; use crate::collector::Collector;
use crate::query::{AllQuery, Query, QueryParser}; use crate::query::{AllQuery, Query, QueryParser};
use crate::schema::{Field, Schema, FAST, STORED, TEXT}; use crate::schema::{Field, Schema, FAST, STORED, TEXT};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
@@ -898,44 +825,49 @@ mod tests {
crate::assert_nearly_equals!(result.0, expected.0); crate::assert_nearly_equals!(result.0, expected.0);
} }
} }
#[test]
fn test_topn_computer_serde() {
let computer: TopNComputer<u32, u32> = TopNComputer::new(1);
let computer_ser = serde_json::to_string(&computer).unwrap();
let mut computer: TopNComputer<u32, u32> = serde_json::from_str(&computer_ser).unwrap();
computer.push(1u32, 5u32);
computer.push(1u32, 0u32);
computer.push(1u32, 7u32);
assert_eq!(
computer.into_sorted_vec(),
&[ComparableDoc {
feature: 1u32,
doc: 0u32,
},]
);
}
#[test] #[test]
fn test_empty_topn_computer() { fn test_empty_topn_computer() {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(0); let mut computer: TopNComputer<u32, u32> = TopNComputer::new(0);
computer.push(1u32, 1u32); computer.push(ComparableDoc {
computer.push(1u32, 2u32); feature: 1u32,
computer.push(1u32, 3u32); doc: 1u32,
});
computer.push(ComparableDoc {
feature: 1u32,
doc: 2u32,
});
computer.push(ComparableDoc {
feature: 1u32,
doc: 3u32,
});
assert!(computer.into_sorted_vec().is_empty()); assert!(computer.into_sorted_vec().is_empty());
} }
#[test] #[test]
fn test_topn_computer() { fn test_topn_computer() {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(2); let mut computer: TopNComputer<u32, u32> = TopNComputer::new(2);
computer.push(1u32, 1u32); computer.push(ComparableDoc {
computer.push(2u32, 2u32); feature: 1u32,
computer.push(3u32, 3u32); doc: 1u32,
computer.push(2u32, 4u32); });
computer.push(1u32, 5u32); computer.push(ComparableDoc {
feature: 2u32,
doc: 2u32,
});
computer.push(ComparableDoc {
feature: 3u32,
doc: 3u32,
});
computer.push(ComparableDoc {
feature: 2u32,
doc: 4u32,
});
computer.push(ComparableDoc {
feature: 1u32,
doc: 5u32,
});
assert_eq!( assert_eq!(
computer.into_sorted_vec(), computer.into_sorted_vec(),
&[ &[
@@ -957,50 +889,15 @@ mod tests {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(top_n); let mut computer: TopNComputer<u32, u32> = TopNComputer::new(top_n);
for _ in 0..1 + top_n * 2 { for _ in 0..1 + top_n * 2 {
computer.push(1u32, 1u32); computer.push(ComparableDoc {
feature: 1u32,
doc: 1u32,
});
} }
let _vals = computer.into_sorted_vec(); let _vals = computer.into_sorted_vec();
} }
} }
proptest! {
#[test]
fn test_topn_computer_asc_prop(
limit in 0..10_usize,
docs in proptest::collection::vec((0..100_u64, 0..100_u64), 0..100_usize),
) {
let mut computer: TopNComputer<_, _, false> = TopNComputer::new(limit);
for (feature, doc) in &docs {
computer.push(*feature, *doc);
}
let mut comparable_docs = docs.into_iter().map(|(feature, doc)| ComparableDoc { feature, doc }).collect::<Vec<_>>();
comparable_docs.sort();
comparable_docs.truncate(limit);
prop_assert_eq!(
computer.into_sorted_vec(),
comparable_docs,
);
}
#[test]
fn test_topn_computer_desc_prop(
limit in 0..10_usize,
docs in proptest::collection::vec((0..100_u64, 0..100_u64), 0..100_usize),
) {
let mut computer: TopNComputer<_, _, true> = TopNComputer::new(limit);
for (feature, doc) in &docs {
computer.push(*feature, *doc);
}
let mut comparable_docs = docs.into_iter().map(|(feature, doc)| ComparableDoc { feature, doc }).collect::<Vec<_>>();
comparable_docs.sort();
comparable_docs.truncate(limit);
prop_assert_eq!(
computer.into_sorted_vec(),
comparable_docs,
);
}
}
#[test] #[test]
fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> { fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> {
let index = make_index()?; let index = make_index()?;
@@ -1414,29 +1311,4 @@ mod tests {
); );
Ok(()) Ok(())
} }
#[test]
fn test_topn_computer_asc() {
let mut computer: TopNComputer<u32, u32, false> = TopNComputer::new(2);
computer.push(1u32, 1u32);
computer.push(2u32, 2u32);
computer.push(3u32, 3u32);
computer.push(2u32, 4u32);
computer.push(4u32, 5u32);
computer.push(1u32, 6u32);
assert_eq!(
computer.into_sorted_vec(),
&[
ComparableDoc {
feature: 1u32,
doc: 1u32,
},
ComparableDoc {
feature: 1u32,
doc: 6u32,
}
]
);
}
} }

View File

@@ -6,23 +6,24 @@ use std::path::PathBuf;
use std::sync::Arc; use std::sync::Arc;
use super::segment::Segment; use super::segment::Segment;
use super::segment_reader::merge_field_meta_data; use super::IndexSettings;
use super::{FieldMetadata, IndexSettings}; use crate::core::single_segment_index_writer::SingleSegmentIndexWriter;
use crate::core::{Executor, META_FILEPATH}; use crate::core::{
Executor, IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory, META_FILEPATH,
};
use crate::directory::error::OpenReadError; use crate::directory::error::OpenReadError;
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
use crate::directory::MmapDirectory; use crate::directory::MmapDirectory;
use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK}; use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK};
use crate::error::{DataCorruption, TantivyError}; use crate::error::{DataCorruption, TantivyError};
use crate::index::{IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory};
use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN}; use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN};
use crate::indexer::segment_updater::save_metas; use crate::indexer::segment_updater::save_metas;
use crate::indexer::{IndexWriter, SingleSegmentIndexWriter}; use crate::indexer::IndexWriter;
use crate::reader::{IndexReader, IndexReaderBuilder}; use crate::reader::{IndexReader, IndexReaderBuilder};
use crate::schema::document::Document; use crate::schema::document::Document;
use crate::schema::{Field, FieldType, Schema}; use crate::schema::{Field, FieldType, Schema};
use crate::tokenizer::{TextAnalyzer, TokenizerManager}; use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::SegmentReader; use crate::{merge_field_meta_data, FieldMetadata, SegmentReader};
fn load_metas( fn load_metas(
directory: &dyn Directory, directory: &dyn Directory,
@@ -83,7 +84,7 @@ fn save_new_metas(
/// ///
/// ``` /// ```
/// use tantivy::schema::*; /// use tantivy::schema::*;
/// use tantivy::{Index, IndexSettings}; /// use tantivy::{Index, IndexSettings, IndexSortByField, Order};
/// ///
/// let mut schema_builder = Schema::builder(); /// let mut schema_builder = Schema::builder();
/// let id_field = schema_builder.add_text_field("id", STRING); /// let id_field = schema_builder.add_text_field("id", STRING);
@@ -96,7 +97,10 @@ fn save_new_metas(
/// ///
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let settings = IndexSettings{ /// let settings = IndexSettings{
/// docstore_blocksize: 100_000, /// sort_by_field: Some(IndexSortByField{
/// field: "number".to_string(),
/// order: Order::Asc
/// }),
/// ..Default::default() /// ..Default::default()
/// }; /// };
/// let index = Index::builder().schema(schema).settings(settings).create_in_ram(); /// let index = Index::builder().schema(schema).settings(settings).create_in_ram();
@@ -319,15 +323,6 @@ impl Index {
Ok(()) Ok(())
} }
/// Custom thread pool by a outer thread pool.
pub fn set_shared_multithread_executor(
&mut self,
shared_thread_pool: Arc<Executor>,
) -> crate::Result<()> {
self.executor = shared_thread_pool.clone();
Ok(())
}
/// Replace the default single thread search executor pool /// Replace the default single thread search executor pool
/// by a thread pool with as many threads as there are CPUs on the system. /// by a thread pool with as many threads as there are CPUs on the system.
pub fn set_default_multithread_executor(&mut self) -> crate::Result<()> { pub fn set_default_multithread_executor(&mut self) -> crate::Result<()> {

View File

@@ -7,7 +7,7 @@ use std::sync::Arc;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::SegmentComponent; use super::SegmentComponent;
use crate::index::SegmentId; use crate::core::SegmentId;
use crate::schema::Schema; use crate::schema::Schema;
use crate::store::Compressor; use crate::store::Compressor;
use crate::{Inventory, Opstamp, TrackedObject}; use crate::{Inventory, Opstamp, TrackedObject};
@@ -19,7 +19,7 @@ struct DeleteMeta {
} }
#[derive(Clone, Default)] #[derive(Clone, Default)]
pub(crate) struct SegmentMetaInventory { pub struct SegmentMetaInventory {
inventory: Inventory<InnerSegmentMeta>, inventory: Inventory<InnerSegmentMeta>,
} }
@@ -142,6 +142,7 @@ impl SegmentMeta {
SegmentComponent::FastFields => ".fast".to_string(), SegmentComponent::FastFields => ".fast".to_string(),
SegmentComponent::FieldNorms => ".fieldnorm".to_string(), SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)), SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
SegmentComponent::FieldList => ".fieldlist".to_string(),
}); });
PathBuf::from(path) PathBuf::from(path)
} }
@@ -288,10 +289,6 @@ impl Default for IndexSettings {
/// Presorting documents can greatly improve performance /// Presorting documents can greatly improve performance
/// in some scenarios, by applying top n /// in some scenarios, by applying top n
/// optimizations. /// optimizations.
#[deprecated(
since = "0.22.0",
note = "We plan to remove index sorting in `0.23`. If you need index sorting, please comment on the related issue https://github.com/quickwit-oss/tantivy/issues/2352 and explain your use case."
)]
#[derive(Clone, Debug, Serialize, Deserialize, Eq, PartialEq)] #[derive(Clone, Debug, Serialize, Deserialize, Eq, PartialEq)]
pub struct IndexSortByField { pub struct IndexSortByField {
/// The field to sort the documents by /// The field to sort the documents by
@@ -412,7 +409,7 @@ impl fmt::Debug for IndexMeta {
mod tests { mod tests {
use super::IndexMeta; use super::IndexMeta;
use crate::index::index_meta::UntrackedIndexMeta; use crate::core::index_meta::UntrackedIndexMeta;
use crate::schema::{Schema, TEXT}; use crate::schema::{Schema, TEXT};
use crate::store::Compressor; use crate::store::Compressor;
#[cfg(feature = "zstd-compression")] #[cfg(feature = "zstd-compression")]

View File

@@ -70,7 +70,7 @@ impl InvertedIndexReader {
&self.termdict &self.termdict
} }
/// Return the fields and types encoded in the dictionary in lexicographic oder. /// Return the fields and types encoded in the dictionary in lexicographic order.
/// Only valid on JSON fields. /// Only valid on JSON fields.
/// ///
/// Notice: This requires a full scan and therefore **very expensive**. /// Notice: This requires a full scan and therefore **very expensive**.
@@ -266,9 +266,7 @@ impl InvertedIndexReader {
/// Warmup a block postings given a `Term`. /// Warmup a block postings given a `Term`.
/// This method is for an advanced usage only. /// This method is for an advanced usage only.
/// pub async fn warm_postings(&self, term: &Term, with_positions: bool) -> io::Result<()> {
/// returns a boolean, whether the term was found in the dictionary
pub async fn warm_postings(&self, term: &Term, with_positions: bool) -> io::Result<bool> {
let term_info_opt: Option<TermInfo> = self.get_term_info_async(term).await?; let term_info_opt: Option<TermInfo> = self.get_term_info_async(term).await?;
if let Some(term_info) = term_info_opt { if let Some(term_info) = term_info_opt {
let postings = self let postings = self
@@ -282,27 +280,23 @@ impl InvertedIndexReader {
} else { } else {
postings.await?; postings.await?;
} }
Ok(true)
} else {
Ok(false)
} }
Ok(())
} }
/// Warmup a block postings given a range of `Term`s. /// Warmup a block postings given a range of `Term`s.
/// This method is for an advanced usage only. /// This method is for an advanced usage only.
///
/// returns a boolean, whether a term matching the range was found in the dictionary
pub async fn warm_postings_range( pub async fn warm_postings_range(
&self, &self,
terms: impl std::ops::RangeBounds<Term>, terms: impl std::ops::RangeBounds<Term>,
limit: Option<u64>, limit: Option<u64>,
with_positions: bool, with_positions: bool,
) -> io::Result<bool> { ) -> io::Result<()> {
let mut term_info = self.get_term_range_async(terms, limit).await?; let mut term_info = self.get_term_range_async(terms, limit).await?;
let Some(first_terminfo) = term_info.next() else { let Some(first_terminfo) = term_info.next() else {
// no key matches, nothing more to load // no key matches, nothing more to load
return Ok(false); return Ok(());
}; };
let last_terminfo = term_info.last().unwrap_or_else(|| first_terminfo.clone()); let last_terminfo = term_info.last().unwrap_or_else(|| first_terminfo.clone());
@@ -322,7 +316,7 @@ impl InvertedIndexReader {
} else { } else {
postings.await?; postings.await?;
} }
Ok(true) Ok(())
} }
/// Warmup the block postings for all terms. /// Warmup the block postings for all terms.

View File

@@ -1,4 +1,4 @@
use columnar::MonotonicallyMappableToU64; use columnar::{ColumnType, MonotonicallyMappableToU64};
use common::{replace_in_place, JsonPathWriter}; use common::{replace_in_place, JsonPathWriter};
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
@@ -153,7 +153,7 @@ fn index_json_value<'a, V: Value<'a>>(
let mut token_stream = text_analyzer.token_stream(val); let mut token_stream = text_analyzer.token_stream(val);
let unordered_id = ctx let unordered_id = ctx
.path_to_unordered_id .path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()); .get_or_allocate_unordered_id(json_path_writer.as_str(), ColumnType::Str);
// TODO: make sure the chain position works out. // TODO: make sure the chain position works out.
set_path_id(term_buffer, unordered_id); set_path_id(term_buffer, unordered_id);
@@ -171,7 +171,7 @@ fn index_json_value<'a, V: Value<'a>>(
set_path_id( set_path_id(
term_buffer, term_buffer,
ctx.path_to_unordered_id ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()), .get_or_allocate_unordered_id(json_path_writer.as_str(), ColumnType::U64),
); );
term_buffer.append_type_and_fast_value(val); term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx); postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
@@ -180,7 +180,7 @@ fn index_json_value<'a, V: Value<'a>>(
set_path_id( set_path_id(
term_buffer, term_buffer,
ctx.path_to_unordered_id ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()), .get_or_allocate_unordered_id(json_path_writer.as_str(), ColumnType::I64),
); );
term_buffer.append_type_and_fast_value(val); term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx); postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
@@ -189,7 +189,7 @@ fn index_json_value<'a, V: Value<'a>>(
set_path_id( set_path_id(
term_buffer, term_buffer,
ctx.path_to_unordered_id ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()), .get_or_allocate_unordered_id(json_path_writer.as_str(), ColumnType::F64),
); );
term_buffer.append_type_and_fast_value(val); term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx); postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
@@ -198,7 +198,7 @@ fn index_json_value<'a, V: Value<'a>>(
set_path_id( set_path_id(
term_buffer, term_buffer,
ctx.path_to_unordered_id ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()), .get_or_allocate_unordered_id(json_path_writer.as_str(), ColumnType::Bool),
); );
term_buffer.append_type_and_fast_value(val); term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx); postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
@@ -206,8 +206,10 @@ fn index_json_value<'a, V: Value<'a>>(
ReferenceValueLeaf::Date(val) => { ReferenceValueLeaf::Date(val) => {
set_path_id( set_path_id(
term_buffer, term_buffer,
ctx.path_to_unordered_id ctx.path_to_unordered_id.get_or_allocate_unordered_id(
.get_or_allocate_unordered_id(json_path_writer.as_str()), json_path_writer.as_str(),
ColumnType::DateTime,
),
); );
term_buffer.append_type_and_fast_value(val); term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx); postings_writer.subscribe(doc, 0u32, term_buffer, ctx);

View File

@@ -1,14 +1,32 @@
mod executor; mod executor;
pub mod index;
mod index_meta;
mod inverted_index_reader;
#[doc(hidden)] #[doc(hidden)]
pub mod json_utils; pub mod json_utils;
pub mod searcher; pub mod searcher;
mod segment;
mod segment_component;
mod segment_id;
mod segment_reader;
mod single_segment_index_writer;
use std::path::Path; use std::path::Path;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
pub use self::executor::Executor; pub use self::executor::Executor;
pub use self::index::{Index, IndexBuilder};
pub use self::index_meta::{
IndexMeta, IndexSettings, IndexSortByField, Order, SegmentMeta, SegmentMetaInventory,
};
pub use self::inverted_index_reader::InvertedIndexReader;
pub use self::searcher::{Searcher, SearcherGeneration}; pub use self::searcher::{Searcher, SearcherGeneration};
pub use self::segment::Segment;
pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId;
pub use self::segment_reader::{merge_field_meta_data, FieldMetadata, SegmentReader};
pub use self::single_segment_index_writer::SingleSegmentIndexWriter;
/// The meta file contains all the information about the list of segments and the schema /// The meta file contains all the information about the list of segments and the schema
/// of the index. /// of the index.

View File

@@ -3,8 +3,7 @@ use std::sync::Arc;
use std::{fmt, io}; use std::{fmt, io};
use crate::collector::Collector; use crate::collector::Collector;
use crate::core::Executor; use crate::core::{Executor, SegmentReader};
use crate::index::SegmentReader;
use crate::query::{Bm25StatisticsProvider, EnableScoring, Query}; use crate::query::{Bm25StatisticsProvider, EnableScoring, Query};
use crate::schema::document::DocumentDeserialize; use crate::schema::document::DocumentDeserialize;
use crate::schema::{Schema, Term}; use crate::schema::{Schema, Term};

View File

@@ -2,9 +2,9 @@ use std::fmt;
use std::path::PathBuf; use std::path::PathBuf;
use super::SegmentComponent; use super::SegmentComponent;
use crate::core::{Index, SegmentId, SegmentMeta};
use crate::directory::error::{OpenReadError, OpenWriteError}; use crate::directory::error::{OpenReadError, OpenWriteError};
use crate::directory::{Directory, FileSlice, WritePtr}; use crate::directory::{Directory, FileSlice, WritePtr};
use crate::index::{Index, SegmentId, SegmentMeta};
use crate::schema::Schema; use crate::schema::Schema;
use crate::Opstamp; use crate::Opstamp;

View File

@@ -27,12 +27,14 @@ pub enum SegmentComponent {
/// Bitset describing which document of the segment is alive. /// Bitset describing which document of the segment is alive.
/// (It was representing deleted docs but changed to represent alive docs from v0.17) /// (It was representing deleted docs but changed to represent alive docs from v0.17)
Delete, Delete,
/// Field list describing the fields in the segment.
FieldList,
} }
impl SegmentComponent { impl SegmentComponent {
/// Iterates through the components. /// Iterates through the components.
pub fn iterator() -> slice::Iter<'static, SegmentComponent> { pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [ static SEGMENT_COMPONENTS: [SegmentComponent; 9] = [
SegmentComponent::Postings, SegmentComponent::Postings,
SegmentComponent::Positions, SegmentComponent::Positions,
SegmentComponent::FastFields, SegmentComponent::FastFields,
@@ -41,6 +43,7 @@ impl SegmentComponent {
SegmentComponent::Store, SegmentComponent::Store,
SegmentComponent::TempStore, SegmentComponent::TempStore,
SegmentComponent::Delete, SegmentComponent::Delete,
SegmentComponent::FieldList,
]; ];
SEGMENT_COMPONENTS.iter() SEGMENT_COMPONENTS.iter()
} }

View File

@@ -1,4 +1,4 @@
use std::cmp::Ordering; use std::cmp::{Ord, Ordering};
use std::error::Error; use std::error::Error;
use std::fmt; use std::fmt;
use std::str::FromStr; use std::str::FromStr;

View File

@@ -3,15 +3,14 @@ use std::ops::BitOrAssign;
use std::sync::{Arc, RwLock}; use std::sync::{Arc, RwLock};
use std::{fmt, io}; use std::{fmt, io};
use fnv::FnvHashMap;
use itertools::Itertools; use itertools::Itertools;
use crate::core::{InvertedIndexReader, Segment, SegmentComponent, SegmentId};
use crate::directory::{CompositeFile, FileSlice}; use crate::directory::{CompositeFile, FileSlice};
use crate::error::DataCorruption; use crate::error::DataCorruption;
use crate::fastfield::{intersect_alive_bitsets, AliveBitSet, FacetReader, FastFieldReaders}; use crate::fastfield::{intersect_alive_bitsets, AliveBitSet, FacetReader, FastFieldReaders};
use crate::field_list::read_split_fields;
use crate::fieldnorm::{FieldNormReader, FieldNormReaders}; use crate::fieldnorm::{FieldNormReader, FieldNormReaders};
use crate::index::{InvertedIndexReader, Segment, SegmentComponent, SegmentId};
use crate::json_utils::json_path_sep_to_dot;
use crate::schema::{Field, IndexRecordOption, Schema, Type}; use crate::schema::{Field, IndexRecordOption, Schema, Type};
use crate::space_usage::SegmentSpaceUsage; use crate::space_usage::SegmentSpaceUsage;
use crate::store::StoreReader; use crate::store::StoreReader;
@@ -44,6 +43,7 @@ pub struct SegmentReader {
fast_fields_readers: FastFieldReaders, fast_fields_readers: FastFieldReaders,
fieldnorm_readers: FieldNormReaders, fieldnorm_readers: FieldNormReaders,
list_fields_file: Option<FileSlice>, // Optional field list file for backwards compatibility
store_file: FileSlice, store_file: FileSlice,
alive_bitset_opt: Option<AliveBitSet>, alive_bitset_opt: Option<AliveBitSet>,
schema: Schema, schema: Schema,
@@ -153,6 +153,7 @@ impl SegmentReader {
let termdict_composite = CompositeFile::open(&termdict_file)?; let termdict_composite = CompositeFile::open(&termdict_file)?;
let store_file = segment.open_read(SegmentComponent::Store)?; let store_file = segment.open_read(SegmentComponent::Store)?;
let list_fields_file = segment.open_read(SegmentComponent::FieldList).ok();
crate::fail_point!("SegmentReader::open#middle"); crate::fail_point!("SegmentReader::open#middle");
@@ -201,6 +202,7 @@ impl SegmentReader {
segment_id: segment.id(), segment_id: segment.id(),
delete_opstamp: segment.meta().delete_opstamp(), delete_opstamp: segment.meta().delete_opstamp(),
store_file, store_file,
list_fields_file,
alive_bitset_opt, alive_bitset_opt,
positions_composite, positions_composite,
schema, schema,
@@ -299,87 +301,25 @@ impl SegmentReader {
/// field that is not indexed nor a fast field but is stored, it is possible for the field /// field that is not indexed nor a fast field but is stored, it is possible for the field
/// to not be listed. /// to not be listed.
pub fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> { pub fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
let mut indexed_fields: Vec<FieldMetadata> = Vec::new(); if let Some(list_fields_file) = self.list_fields_file.as_ref() {
let mut map_to_canonical = FnvHashMap::default(); let file = list_fields_file.read_bytes()?;
for (field, field_entry) in self.schema().fields() { let fields_metadata =
let field_name = field_entry.name().to_string(); read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>();
let is_indexed = field_entry.is_indexed(); fields_metadata.map_err(|e| e.into())
} else {
if is_indexed { // Schema fallback
let is_json = field_entry.field_type().value_type() == Type::Json; Ok(self
if is_json { .schema()
let inv_index = self.inverted_index(field)?; .fields()
let encoded_fields_in_index = inv_index.list_encoded_fields()?; .map(|(_field, entry)| FieldMetadata {
let mut build_path = |field_name: &str, mut json_path: String| { field_name: entry.name().to_string(),
// In this case we need to map the potential fast field to the field name typ: entry.field_type().value_type(),
// accepted by the query parser. indexed: entry.is_indexed(),
let create_canonical = stored: entry.is_stored(),
!field_entry.is_expand_dots_enabled() && json_path.contains('.'); fast: entry.is_fast(),
if create_canonical { })
// Without expand dots enabled dots need to be escaped. .collect())
let escaped_json_path = json_path.replace('.', "\\.");
let full_path = format!("{}.{}", field_name, escaped_json_path);
let full_path_unescaped = format!("{}.{}", field_name, &json_path);
map_to_canonical.insert(full_path_unescaped, full_path.to_string());
full_path
} else {
// With expand dots enabled, we can use '.' instead of '\u{1}'.
json_path_sep_to_dot(&mut json_path);
format!("{}.{}", field_name, json_path)
}
};
indexed_fields.extend(
encoded_fields_in_index
.into_iter()
.map(|(name, typ)| (build_path(&field_name, name), typ))
.map(|(field_name, typ)| FieldMetadata {
indexed: true,
stored: false,
field_name,
fast: false,
typ,
}),
);
} else {
indexed_fields.push(FieldMetadata {
indexed: true,
stored: false,
field_name: field_name.to_string(),
fast: false,
typ: field_entry.field_type().value_type(),
});
}
}
} }
let mut fast_fields: Vec<FieldMetadata> = self
.fast_fields()
.columnar()
.iter_columns()?
.map(|(mut field_name, handle)| {
json_path_sep_to_dot(&mut field_name);
// map to canonical path, to avoid similar but different entries.
// Eventually we should just accept '.' seperated for all cases.
let field_name = map_to_canonical
.get(&field_name)
.unwrap_or(&field_name)
.to_string();
FieldMetadata {
indexed: false,
stored: false,
field_name,
fast: true,
typ: Type::from(handle.column_type()),
}
})
.collect();
// Since the type is encoded differently in the fast field and in the inverted index,
// the order of the fields is not guaranteed to be the same. Therefore, we sort the fields.
// If we are sure that the order is the same, we can remove this sort.
indexed_fields.sort_unstable();
fast_fields.sort_unstable();
let merged = merge_field_meta_data(vec![indexed_fields, fast_fields], &self.schema);
Ok(merged)
} }
/// Returns the segment id /// Returns the segment id
@@ -515,9 +455,9 @@ impl fmt::Debug for SegmentReader {
#[cfg(test)] #[cfg(test)]
mod test { mod test {
use super::*; use super::*;
use crate::index::Index; use crate::core::Index;
use crate::schema::{SchemaBuilder, Term, STORED, TEXT}; use crate::schema::{Schema, SchemaBuilder, Term, STORED, TEXT};
use crate::IndexWriter; use crate::{DocId, FieldMetadata, IndexWriter};
#[test] #[test]
fn test_merge_field_meta_data_same() { fn test_merge_field_meta_data_same() {

View File

@@ -137,6 +137,7 @@ mod mmap_specific {
use tempfile::TempDir; use tempfile::TempDir;
use super::*; use super::*;
use crate::Directory;
#[test] #[test]
fn test_index_on_commit_reload_policy_mmap() -> crate::Result<()> { fn test_index_on_commit_reload_policy_mmap() -> crate::Result<()> {
@@ -423,7 +424,7 @@ fn test_non_text_json_term_freq() {
json_term_writer.set_fast_value(75u64); json_term_writer.set_fast_value(75u64);
let postings = inv_idx let postings = inv_idx
.read_postings( .read_postings(
json_term_writer.term(), &json_term_writer.term(),
IndexRecordOption::WithFreqsAndPositions, IndexRecordOption::WithFreqsAndPositions,
) )
.unwrap() .unwrap()
@@ -461,7 +462,7 @@ fn test_non_text_json_term_freq_bitpacked() {
json_term_writer.set_fast_value(75u64); json_term_writer.set_fast_value(75u64);
let mut postings = inv_idx let mut postings = inv_idx
.read_postings( .read_postings(
json_term_writer.term(), &json_term_writer.term(),
IndexRecordOption::WithFreqsAndPositions, IndexRecordOption::WithFreqsAndPositions,
) )
.unwrap() .unwrap()

View File

@@ -1,5 +1,6 @@
use std::collections::HashMap; use std::collections::HashMap;
use std::io::{self, Read, Write}; use std::io::{self, Read, Write};
use std::iter::ExactSizeIterator;
use std::ops::Range; use std::ops::Range;
use common::{BinarySerializable, CountingWriter, HasLen, VInt}; use common::{BinarySerializable, CountingWriter, HasLen, VInt};

View File

@@ -1,4 +1,5 @@
use std::io::Write; use std::io::Write;
use std::marker::{Send, Sync};
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use std::sync::Arc; use std::sync::Arc;
use std::time::Duration; use std::time::Duration;
@@ -39,7 +40,6 @@ impl RetryPolicy {
/// The `DirectoryLock` is an object that represents a file lock. /// The `DirectoryLock` is an object that represents a file lock.
/// ///
/// It is associated with a lock file, that gets deleted on `Drop.` /// It is associated with a lock file, that gets deleted on `Drop.`
#[allow(dead_code)]
pub struct DirectoryLock(Box<dyn Send + Sync + 'static>); pub struct DirectoryLock(Box<dyn Send + Sync + 'static>);
struct DirectoryLockGuard { struct DirectoryLockGuard {

View File

@@ -479,7 +479,6 @@ impl Directory for MmapDirectory {
let file: File = OpenOptions::new() let file: File = OpenOptions::new()
.write(true) .write(true)
.create(true) //< if the file does not exist yet, create it. .create(true) //< if the file does not exist yet, create it.
.truncate(false)
.open(full_path) .open(full_path)
.map_err(LockError::wrap_io_error)?; .map_err(LockError::wrap_io_error)?;
if lock.is_blocking { if lock.is_blocking {
@@ -674,7 +673,7 @@ mod tests {
let num_segments = reader.searcher().segment_readers().len(); let num_segments = reader.searcher().segment_readers().len();
assert!(num_segments <= 4); assert!(num_segments <= 4);
let num_components_except_deletes_and_tempstore = let num_components_except_deletes_and_tempstore =
crate::index::SegmentComponent::iterator().len() - 2; crate::core::SegmentComponent::iterator().len() - 2;
let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments; let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
assert_eventually(|| { assert_eventually(|| {
let num_mmapped = mmap_directory.get_cache_info().mmapped.len(); let num_mmapped = mmap_directory.get_cache_info().mmapped.len();

View File

@@ -85,7 +85,7 @@ impl InnerDirectory {
self.fs self.fs
.get(path) .get(path)
.ok_or_else(|| OpenReadError::FileDoesNotExist(PathBuf::from(path))) .ok_or_else(|| OpenReadError::FileDoesNotExist(PathBuf::from(path)))
.cloned() .map(Clone::clone)
} }
fn delete(&mut self, path: &Path) -> result::Result<(), DeleteError> { fn delete(&mut self, path: &Path) -> result::Result<(), DeleteError> {

View File

@@ -1,6 +1,6 @@
use std::io::Write; use std::io::Write;
use std::mem; use std::mem;
use std::path::Path; use std::path::{Path, PathBuf};
use std::sync::atomic::Ordering::SeqCst; use std::sync::atomic::Ordering::SeqCst;
use std::sync::atomic::{AtomicBool, AtomicUsize}; use std::sync::atomic::{AtomicBool, AtomicUsize};
use std::sync::Arc; use std::sync::Arc;

View File

@@ -32,7 +32,6 @@ pub struct WatchCallbackList {
/// file change is detected. /// file change is detected.
#[must_use = "This `WatchHandle` controls the lifetime of the watch and should therefore be used."] #[must_use = "This `WatchHandle` controls the lifetime of the watch and should therefore be used."]
#[derive(Clone)] #[derive(Clone)]
#[allow(dead_code)]
pub struct WatchHandle(Arc<WatchCallback>); pub struct WatchHandle(Arc<WatchCallback>);
impl WatchHandle { impl WatchHandle {

View File

@@ -9,10 +9,7 @@ use crate::DocId;
/// to compare `[u32; 4]`. /// to compare `[u32; 4]`.
pub const TERMINATED: DocId = i32::MAX as u32; pub const TERMINATED: DocId = i32::MAX as u32;
/// The collect_block method on `SegmentCollector` uses a buffer of this size. pub const BUFFER_LEN: usize = 64;
/// Passed results to `collect_block` will not exceed this size and will be
/// exactly this size as long as we can fill the buffer.
pub const COLLECT_BLOCK_BUFFER_LEN: usize = 64;
/// Represents an iterable set of sorted doc ids. /// Represents an iterable set of sorted doc ids.
pub trait DocSet: Send { pub trait DocSet: Send {
@@ -64,7 +61,7 @@ pub trait DocSet: Send {
/// This method is only here for specific high-performance /// This method is only here for specific high-performance
/// use case where batching. The normal way to /// use case where batching. The normal way to
/// go through the `DocId`'s is to call `.advance()`. /// go through the `DocId`'s is to call `.advance()`.
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize { fn fill_buffer(&mut self, buffer: &mut [DocId; BUFFER_LEN]) -> usize {
if self.doc() == TERMINATED { if self.doc() == TERMINATED {
return 0; return 0;
} }
@@ -154,7 +151,7 @@ impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
unboxed.seek(target) unboxed.seek(target)
} }
fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize { fn fill_buffer(&mut self, buffer: &mut [DocId; BUFFER_LEN]) -> usize {
let unboxed: &mut TDocSet = self.borrow_mut(); let unboxed: &mut TDocSet = self.borrow_mut();
unboxed.fill_buffer(buffer) unboxed.fill_buffer(buffer)
} }

View File

@@ -79,7 +79,7 @@ mod tests {
use std::ops::{Range, RangeInclusive}; use std::ops::{Range, RangeInclusive};
use std::path::Path; use std::path::Path;
use columnar::StrColumn; use columnar::{Column, MonotonicallyMappableToU64, StrColumn};
use common::{ByteCount, HasLen, TerminatingWrite}; use common::{ByteCount, HasLen, TerminatingWrite};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use rand::prelude::SliceRandom; use rand::prelude::SliceRandom;

View File

@@ -238,13 +238,17 @@ impl FastFieldsWriter {
mut self, mut self,
wrt: &mut dyn io::Write, wrt: &mut dyn io::Write,
doc_id_map_opt: Option<&DocIdMapping>, doc_id_map_opt: Option<&DocIdMapping>,
) -> io::Result<()> { ) -> io::Result<Vec<(String, Type)>> {
let num_docs = self.num_docs; let num_docs = self.num_docs;
let old_to_new_row_ids = let old_to_new_row_ids =
doc_id_map_opt.map(|doc_id_mapping| doc_id_mapping.old_to_new_ids()); doc_id_map_opt.map(|doc_id_mapping| doc_id_mapping.old_to_new_ids());
self.columnar_writer let columns = self
.columnar_writer
.serialize(num_docs, old_to_new_row_ids, wrt)?; .serialize(num_docs, old_to_new_row_ids, wrt)?;
Ok(()) Ok(columns
.into_iter()
.map(|(field_name, column)| (field_name.to_string(), column.into()))
.collect())
} }
} }

369
src/field_list/mod.rs Normal file
View File

@@ -0,0 +1,369 @@
//! The list of fields that are stored in a `tantivy` `Index`.
use std::collections::HashSet;
use std::io::{self, ErrorKind, Read};
use columnar::ColumnType;
use common::TinySet;
use fnv::FnvHashMap;
use crate::indexer::path_to_unordered_id::OrderedPathId;
use crate::json_utils::json_path_sep_to_dot;
use crate::postings::IndexingContext;
use crate::schema::{Field, Schema, Type};
use crate::{merge_field_meta_data, FieldMetadata, Term};
#[derive(Debug, PartialEq, Eq, Clone, Copy, Hash)]
pub(crate) struct FieldConfig {
pub typ: Type,
pub indexed: bool,
pub stored: bool,
pub fast: bool,
}
impl FieldConfig {
fn serialize(&self) -> [u8; 2] {
let typ = self.typ.to_code();
let flags = (self.indexed as u8) << 2 | (self.stored as u8) << 1 | (self.fast as u8);
[typ, flags]
}
fn deserialize_from(data: [u8; 2]) -> io::Result<FieldConfig> {
let typ = Type::from_code(data[0]).ok_or_else(|| {
io::Error::new(
ErrorKind::InvalidData,
format!("could not deserialize type {}", data[0]),
)
})?;
let data = data[1];
let indexed = (data & 0b100) != 0;
let stored = (data & 0b010) != 0;
let fast = (data & 0b001) != 0;
Ok(FieldConfig {
typ,
indexed,
stored,
fast,
})
}
}
/// Serializes the split fields.
pub(crate) fn serialize_segment_fields(
ctx: IndexingContext,
wrt: &mut dyn io::Write,
schema: &Schema,
unordered_id_to_ordered_id: &[(OrderedPathId, TinySet)],
mut columns: Vec<(String, Type)>,
) -> crate::Result<()> {
let mut field_list_set: HashSet<(Field, OrderedPathId, TinySet)> = HashSet::default();
let mut encoded_fields = Vec::new();
let mut map_to_canonical = FnvHashMap::default();
// Replace unordered ids by ordered ids to be able to sort
let ordered_id_to_path = ctx.path_to_unordered_id.ordered_id_to_path();
for (key, _addr) in ctx.term_index.iter() {
let field = Term::wrap(key).field();
let field_entry = schema.get_field_entry(field);
if field_entry.field_type().value_type() == Type::Json {
let byte_range_unordered_id = 5..5 + 4;
let unordered_id =
u32::from_be_bytes(key[byte_range_unordered_id.clone()].try_into().unwrap());
let (path_id, typ_code_bitvec) = unordered_id_to_ordered_id[unordered_id as usize];
if !field_list_set.contains(&(field, path_id, typ_code_bitvec)) {
field_list_set.insert((field, path_id, typ_code_bitvec));
let mut build_path = |field_name: &str, mut json_path: String| {
// In this case we need to map the potential fast field to the field name
// accepted by the query parser.
let create_canonical =
!field_entry.is_expand_dots_enabled() && json_path.contains('.');
if create_canonical {
// Without expand dots enabled dots need to be escaped.
let escaped_json_path = json_path.replace('.', "\\.");
let full_path = format!("{}.{}", field_name, escaped_json_path);
let full_path_unescaped = format!("{}.{}", field_name, &json_path);
map_to_canonical.insert(full_path_unescaped, full_path.to_string());
full_path
} else {
// With expand dots enabled, we can use '.' instead of '\u{1}'.
json_path_sep_to_dot(&mut json_path);
format!("{}.{}", field_name, json_path)
}
};
let path = build_path(
field_entry.name(),
ordered_id_to_path[path_id.path_id() as usize].to_string(), /* String::from_utf8(key[5..].to_vec()).unwrap(), */
);
encoded_fields.push((path, typ_code_bitvec));
}
}
}
let mut indexed_fields: Vec<FieldMetadata> = Vec::new();
for (_field, field_entry) in schema.fields() {
let field_name = field_entry.name().to_string();
let is_indexed = field_entry.is_indexed();
let is_json = field_entry.field_type().value_type() == Type::Json;
if is_indexed && !is_json {
indexed_fields.push(FieldMetadata {
indexed: true,
stored: false,
field_name: field_name.to_string(),
fast: false,
typ: field_entry.field_type().value_type(),
});
}
}
for (field_name, field_type_set) in encoded_fields {
for field_type in field_type_set {
let column_type = ColumnType::try_from_code(field_type as u8).unwrap();
indexed_fields.push(FieldMetadata {
indexed: true,
stored: false,
field_name: field_name.to_string(),
fast: false,
typ: Type::from(column_type),
});
}
}
let mut fast_fields: Vec<FieldMetadata> = columns
.iter_mut()
.map(|(field_name, typ)| {
json_path_sep_to_dot(field_name);
// map to canonical path, to avoid similar but different entries.
// Eventually we should just accept '.' seperated for all cases.
let field_name = map_to_canonical
.get(field_name)
.unwrap_or(field_name)
.to_string();
FieldMetadata {
indexed: false,
stored: false,
field_name,
fast: true,
typ: *typ,
}
})
.collect();
// Since the type is encoded differently in the fast field and in the inverted index,
// the order of the fields is not guaranteed to be the same. Therefore, we sort the fields.
// If we are sure that the order is the same, we can remove this sort.
indexed_fields.sort_unstable();
fast_fields.sort_unstable();
let merged = merge_field_meta_data(vec![indexed_fields, fast_fields], schema);
let out = serialize_split_fields(&merged);
wrt.write_all(&out)?;
Ok(())
}
/// Serializes the Split fields.
///
/// `fields_metadata` has to be sorted.
pub fn serialize_split_fields(fields_metadata: &[FieldMetadata]) -> Vec<u8> {
// ensure that fields_metadata is strictly sorted.
debug_assert!(fields_metadata.windows(2).all(|w| w[0] < w[1]));
let mut payload = Vec::new();
// Write Num Fields
let length = fields_metadata.len() as u32;
payload.extend_from_slice(&length.to_le_bytes());
for field_metadata in fields_metadata {
write_field(field_metadata, &mut payload);
}
let compression_level = 3;
let payload_compressed = zstd::stream::encode_all(&mut &payload[..], compression_level)
.expect("zstd encoding failed");
let mut out = Vec::new();
// Write Header -- Format Version
let format_version = 1u8;
out.push(format_version);
// Write Payload
out.extend_from_slice(&payload_compressed);
out
}
fn write_field(field_metadata: &FieldMetadata, out: &mut Vec<u8>) {
let field_config = FieldConfig {
typ: field_metadata.typ,
indexed: field_metadata.indexed,
stored: field_metadata.stored,
fast: field_metadata.fast,
};
// Write Config 2 bytes
out.extend_from_slice(&field_config.serialize());
let str_length = field_metadata.field_name.len() as u16;
// Write String length 2 bytes
out.extend_from_slice(&str_length.to_le_bytes());
out.extend_from_slice(field_metadata.field_name.as_bytes());
}
/// Reads a fixed number of bytes into an array and returns the array.
fn read_exact_array<R: Read, const N: usize>(reader: &mut R) -> io::Result<[u8; N]> {
let mut buffer = [0u8; N];
reader.read_exact(&mut buffer)?;
Ok(buffer)
}
/// Reads the Split fields from a zstd compressed stream of bytes
pub fn read_split_fields<R: Read>(
mut reader: R,
) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> {
let format_version = read_exact_array::<_, 1>(&mut reader)?[0];
assert_eq!(format_version, 1);
let reader = zstd::Decoder::new(reader)?;
read_split_fields_from_zstd(reader)
}
fn read_field<R: Read>(reader: &mut R) -> io::Result<FieldMetadata> {
// Read FieldConfig (2 bytes)
let config_bytes = read_exact_array::<_, 2>(reader)?;
let field_config = FieldConfig::deserialize_from(config_bytes)?; // Assuming this returns a Result
// Read field name length and the field name
let name_len = u16::from_le_bytes(read_exact_array::<_, 2>(reader)?) as usize;
let mut data = vec![0; name_len];
reader.read_exact(&mut data)?;
let field_name = String::from_utf8(data).map_err(|err| {
io::Error::new(
ErrorKind::InvalidData,
format!(
"Encountered invalid utf8 when deserializing field name: {}",
err
),
)
})?;
Ok(FieldMetadata {
field_name,
typ: field_config.typ,
indexed: field_config.indexed,
stored: field_config.stored,
fast: field_config.fast,
})
}
/// Reads the Split fields from a stream of bytes
fn read_split_fields_from_zstd<R: Read>(
mut reader: R,
) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> {
let mut num_fields = u32::from_le_bytes(read_exact_array::<_, 4>(&mut reader)?);
Ok(std::iter::from_fn(move || {
if num_fields == 0 {
return None;
}
num_fields -= 1;
Some(read_field(&mut reader))
}))
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn field_config_deser_test() {
let field_config = FieldConfig {
typ: Type::Str,
indexed: true,
stored: false,
fast: true,
};
let serialized = field_config.serialize();
let deserialized = FieldConfig::deserialize_from(serialized).unwrap();
assert_eq!(field_config, deserialized);
}
#[test]
fn write_read_field_test() {
for typ in Type::iter_values() {
let field_metadata = FieldMetadata {
field_name: "test".to_string(),
typ,
indexed: true,
stored: true,
fast: true,
};
let mut out = Vec::new();
write_field(&field_metadata, &mut out);
let deserialized = read_field(&mut &out[..]).unwrap();
assert_eq!(field_metadata, deserialized);
}
let field_metadata = FieldMetadata {
field_name: "test".to_string(),
typ: Type::Str,
indexed: false,
stored: true,
fast: true,
};
let mut out = Vec::new();
write_field(&field_metadata, &mut out);
let deserialized = read_field(&mut &out[..]).unwrap();
assert_eq!(field_metadata, deserialized);
let field_metadata = FieldMetadata {
field_name: "test".to_string(),
typ: Type::Str,
indexed: false,
stored: false,
fast: true,
};
let mut out = Vec::new();
write_field(&field_metadata, &mut out);
let deserialized = read_field(&mut &out[..]).unwrap();
assert_eq!(field_metadata, deserialized);
let field_metadata = FieldMetadata {
field_name: "test".to_string(),
typ: Type::Str,
indexed: true,
stored: false,
fast: false,
};
let mut out = Vec::new();
write_field(&field_metadata, &mut out);
let deserialized = read_field(&mut &out[..]).unwrap();
assert_eq!(field_metadata, deserialized);
}
#[test]
fn write_split_fields_test() {
let fields_metadata = vec![
FieldMetadata {
field_name: "test".to_string(),
typ: Type::Str,
indexed: true,
stored: true,
fast: true,
},
FieldMetadata {
field_name: "test2".to_string(),
typ: Type::Str,
indexed: true,
stored: false,
fast: false,
},
FieldMetadata {
field_name: "test3".to_string(),
typ: Type::U64,
indexed: true,
stored: false,
fast: true,
},
];
let out = serialize_split_fields(&fields_metadata);
let deserialized: Vec<FieldMetadata> = read_split_fields(&mut &out[..])
.unwrap()
.map(|el| el.unwrap())
.collect();
assert_eq!(fields_metadata, deserialized);
}
}

View File

@@ -1,22 +0,0 @@
//! # Index Module
//!
//! The `index` module in Tantivy contains core components to read and write indexes.
//!
//! It contains `Index` and `Segment`, where a `Index` consists of one or more `Segment`s.
mod index;
mod index_meta;
mod inverted_index_reader;
mod segment;
mod segment_component;
mod segment_id;
mod segment_reader;
pub use self::index::{Index, IndexBuilder};
pub(crate) use self::index_meta::SegmentMetaInventory;
pub use self::index_meta::{IndexMeta, IndexSettings, IndexSortByField, Order, SegmentMeta};
pub use self::inverted_index_reader::InvertedIndexReader;
pub use self::segment::Segment;
pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId;
pub use self::segment_reader::{FieldMetadata, SegmentReader};

Some files were not shown because too many files have changed in this diff Show More