Compare commits

..

85 Commits

Author SHA1 Message Date
Paul Masurel
0928597a43 Bumping version 2022-10-20 10:32:14 +09:00
Paul Masurel
f72abe9b9c Bugfix position broken.
For Field with several FieldValues, with a
value that contained no token at all, the token position
was reinitialized to 0.

As a result, PhraseQueries can show some false positives.
In addition, after the computation of the position delta, we can
underflow u32, and end up with gigantic delta.

We haven't been able to actually explain the bug in 1629, but it
is assumed that in some corner case these delta can cause a panic.

Closes #1629
2022-10-20 10:27:57 +09:00
Paul Masurel
f0a2b1cc44 Bumped tantivy and subcrate versions. 2022-05-25 22:50:33 +09:00
Paul Masurel
fcfdc44c61 Bumped tantivy-grammar version 2022-05-25 21:52:46 +09:00
Paul Masurel
3171f0b9ba Added ZSTD support in CHANGELOG 2022-05-25 21:51:46 +09:00
PSeitz
89e19f14b5 Merge pull request #1374 from kryesh/main
Add Zstd compression support, Make block size configurable via IndexSettings
2022-05-25 07:39:46 +02:00
PSeitz
1a6a1396cd Merge pull request #1376 from saroh/json-example
Add examples to explain default field handling in the json example
2022-05-24 07:09:37 +02:00
saroh
e766375700 remove useless example 2022-05-23 19:49:31 +02:00
PSeitz
496b4a4fdb Update examples/json_field.rs 2022-05-23 12:24:36 +02:00
PSeitz
93cc8498b3 Update examples/json_field.rs 2022-05-23 11:59:42 +02:00
PSeitz
0aa3d63a9f Update examples/json_field.rs 2022-05-23 11:39:45 +02:00
PSeitz
4e2a053b69 Update examples/json_field.rs 2022-05-23 11:27:05 +02:00
Paul Masurel
71c4393ec4 Clippy 2022-05-23 10:20:37 +09:00
saroh
b2e97e266a more examples to explain default field handling 2022-05-21 17:36:39 +02:00
Antoine G
9ee4772140 Fix deps for unicode regex compiling (#1373)
* lint doc warning

* fix regex build
2022-05-20 10:18:44 +09:00
Kryesh
c95013b11e Add zstd-compression feature to github workflow tests 2022-05-19 22:15:18 +10:00
Kryesh
fc045e6bf9 Cleanup imports, remove unneeded error mapping 2022-05-19 10:34:02 +10:00
Kryesh
6837a4d468 Fix bench 2022-05-18 20:35:29 +10:00
Kryesh
0759bf9448 Cleanup zstd structure and serialise to u32 in line with lz4 2022-05-18 20:31:22 +10:00
Kryesh
152e8238d7 Fix silly errors from running tests without feature flag 2022-05-18 19:49:10 +10:00
Kryesh
d4e5b48437 Apply feedback - standardise on u64 and fix correct compression bounds 2022-05-18 19:37:28 +10:00
Kryesh
03040ed81d Add Zstd compression support 2022-05-18 14:04:43 +10:00
Kryesh
aaa22ad225 Make block size configurable to allow for better compression ratios on large documents 2022-05-18 11:13:15 +10:00
Antoine G
3223bdf254 Refactorize PhraseScorer::compute_phrase_match (#1364)
* Refactorize PhraseScorer::compute_phrase_match
* implem optim for slop
2022-05-13 09:57:21 +09:00
dependabot[bot]
cbd06ab189 Update pprof requirement from 0.8.0 to 0.9.0 (#1365)
Updates the requirements on [pprof](https://github.com/tikv/pprof-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/pprof-rs/releases)
- [Changelog](https://github.com/tikv/pprof-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/pprof-rs/commits)

---
updated-dependencies:
- dependency-name: pprof
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-05-11 11:42:04 +09:00
Paul Masurel
749395bbb8 Added rustdoc for MultiFruit extract function (#1369) 2022-05-11 11:41:39 +09:00
Paul Masurel
617ba1f0c0 Bugfix in the document deserialization. (#1368)
Deserializing a json field does not expect the
end of the document anymore.

This behavior is well documented in serde_json.
https://docs.serde.rs/serde_json/fn.from_reader.html

Closes #1366
2022-05-11 11:38:10 +09:00
Paul Masurel
2f1cd7e7f0 Bugfix in the document deserialization. (#1367)
Deserializing a json field does not expect the
end of the document anymore.

This behavior is well documented in serde_json.
https://docs.serde.rs/serde_json/fn.from_reader.html

Closes #1366
2022-05-11 11:27:04 +09:00
PSeitz
58c0cb5fc4 Merge pull request #1357 from saroh/1302-json-term-writer-API
Expose helpers to generate json field writer terms
2022-05-10 11:02:05 +08:00
PSeitz
7f45a6ac96 allow setting tokenizer manager on index (#1362)
handle json in tokenizer_for_field
2022-05-09 18:15:45 +09:00
saroh
0ade871126 rename constructor to be more explicit 2022-05-06 13:29:07 +02:00
PSeitz
aab65490c9 Merge pull request #1358 from quickwit-oss/fix_docs
add alias shard_size to split_size for quickwit
2022-05-06 18:41:34 +08:00
Pascal Seitz
d77e8de36a flip alias variable name 2022-05-06 17:52:36 +08:00
Pascal Seitz
d11a8cce26 minor docs fix 2022-05-06 17:52:36 +08:00
Pascal Seitz
bc607a921b add alias shard_size split_size for quickwit
improve some docs
2022-05-06 17:52:36 +08:00
Paul Masurel
1273f33338 Fixed comment. 2022-05-06 18:35:25 +09:00
Paul Masurel
e30449743c Shortens blocks' last_key in the SSTable block index. (#1361)
Right now we store last key in the blocks of the SSTable index.
This PR replaces the last key by a shorter string that is greater or
equal and still lesser than the next key.
This property is sufficiently to ensure the block index
works properly.

Related to quickwit#1366
2022-05-06 16:29:06 +08:00
Paul Masurel
ed26552296 Minor changes in query parsing for quickwit#1334. (#1356)
Quickwit's still heavily relies on generating field names
containing a '.' for nested object, yet allows for
user defined field names to contain a dot.

In order to reuse tantivy query parser, we will end up
using quickwit field names directly into tantivy.
Only '.' will be escaped.

This PR makes minor changes in how tantivy query parser parses
a field name and resolves it to a field.
Some of the new edge case behavior is hacky.

Closes #1355
2022-05-06 13:20:10 +09:00
Saroh
65d129afbd better function names 2022-05-05 10:12:28 +02:00
Antoine G
386ffab76c Fix documentation regression (#1359)
This breaks the doc on doc.rs as the type seems to shadow the struct https://docs.rs/tantivy/latest/tantivy/termdict/type.TermDictionary.html
introduced by #1293 which may not have been up to date with what was done in #1242
2022-05-05 14:59:25 +09:00
Pasha Podolsky
57a8d0359c Make FruitHandle and MultiFruit public (#1360)
* Make `FruitHandle` and `MultiFruit` public

* Add docs for `MultiFruit` and `FruitHandle`
2022-05-05 14:58:33 +09:00
Saroh
14cb66ee00 move helper to indexer module 2022-05-04 18:01:57 +02:00
Saroh
9e38343352 expose helpers for json field writer manipulation
closes #1302
2022-05-04 18:01:45 +02:00
PSeitz
944302ae2f Merge pull request #1350 from quickwit-oss/update_edition
update edition
2022-05-04 11:02:52 +02:00
Paul Masurel
be70804d17 Removed AtomicUsize. 2022-05-04 16:45:24 +09:00
PSeitz
a1afc80600 Update src/core/executor.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-05-04 08:39:44 +02:00
Paul Masurel
02e24fda52 Clippy fix 2022-05-04 12:24:07 +09:00
PSeitz
7e3c0c5392 Merge pull request #1353 from quickwit-oss/fix_docs
minor docs fixes
2022-05-02 07:48:25 +02:00
Pascal Seitz
fdb2524f9e minor docs fixes 2022-05-02 12:26:12 +08:00
Pascal Seitz
4db655ae82 update dependencies, update edition 2022-04-28 22:50:55 +08:00
Pascal Seitz
bb44cc84c4 update dependencies 2022-04-28 20:55:36 +08:00
PSeitz
8c1e1cf1ad Merge pull request #1349 from quickwit-oss/fix_error_message
print whole query on syntax error
2022-04-28 09:31:45 +02:00
Pascal Seitz
b5b16948b0 print whole query on syntax error 2022-04-27 12:48:30 +08:00
PSeitz
c305d3a2a2 Merge pull request #1346 from quickwit-oss/term_agg
term agg
2022-04-26 07:08:07 +02:00
PSeitz
038d234ff1 Merge pull request #1347 from quickwit-oss/query_parser_error
fix query parser error field not found
2022-04-26 07:01:48 +02:00
Pascal Seitz
c45eb9a9fa improve readability, add json test 2022-04-26 11:22:34 +08:00
Pascal Seitz
824d6f96fe return query on parse error 2022-04-22 16:11:36 +08:00
Pascal Seitz
7cf821bac0 fix query parser error field not found 2022-04-22 12:40:00 +08:00
PSeitz
ae83fc8298 bump uuid to 1.0 (#1345) 2022-04-22 10:02:24 +09:00
dependabot[bot]
a7bc361145 Update pprof requirement from 0.7 to 0.8 (#1343)
Updates the requirements on [pprof](https://github.com/tikv/pprof-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/pprof-rs/releases)
- [Changelog](https://github.com/tikv/pprof-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/pprof-rs/commits)

---
updated-dependencies:
- dependency-name: pprof
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-04-21 09:35:13 +09:00
Pascal Seitz
2805291400 minor fixes 2022-04-20 14:22:44 +08:00
Pascal Seitz
6614a2cba0 fix is_fast for bytes field 2022-04-20 12:02:38 +08:00
Pascal Seitz
6f4d203d1b return error on missing sub aggregation 2022-04-20 11:19:36 +08:00
Pascal Seitz
1be6c6111c support order property on term aggregations
support order property on term aggregations
order can be by doc_count, key, or a metric sub_aggregation
2022-04-20 00:34:38 +08:00
PSeitz
c7c3eab256 Merge pull request #1340 from PSeitz/term_agg
fix collecting term_dict field names
2022-04-18 08:21:27 +02:00
Pascal Seitz
ec69875d15 fix collecting term_dict field names
fix collecting term_dict field names for sub_aggregations, minor refactoring
2022-04-15 17:49:20 +08:00
PSeitz
d832cfcfd8 Merge pull request #1329 from quickwit-oss/term_agg
add term aggregation
2022-04-14 14:45:21 +08:00
Pascal Seitz
ab6b532cc4 add comments 2022-04-14 12:06:36 +08:00
Pascal Seitz
4b6047f7d7 return Option from as_ methods 2022-04-14 10:48:36 +08:00
Pascal Seitz
5ca04beb94 add min_doc_count test 2022-04-13 19:51:18 +08:00
Pascal Seitz
902d05ebec refactor getffreader function 2022-04-13 19:51:18 +08:00
Pascal Seitz
f1b298642a remove unnecessary benchmarks 2022-04-13 19:51:18 +08:00
Pascal Seitz
dd13dedaeb forward errors, remove unwrap 2022-04-13 19:51:18 +08:00
Pascal Seitz
46724b4a05 add segment_size, add get term dict fields, add tests 2022-04-13 19:51:18 +08:00
Pascal Seitz
24432bf523 add term aggregation 2022-04-13 19:51:18 +08:00
PSeitz
31d3bcfff2 Merge pull request #1334 from PSeitz/minor_fixes
fix DateTime naming, fix docs, cleanup
2022-04-13 13:13:57 +08:00
Pascal Seitz
706fbd6886 fix DateTime naming, fix docs, cleanup 2022-04-13 13:01:00 +08:00
PSeitz
8a8a048015 fix coverage (#1335) 2022-04-13 13:47:47 +09:00
dependabot[bot]
c72549cb9a Bump codecov/codecov-action from 2 to 3 (#1328)
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 2 to 3.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/master/CHANGELOG.md)
- [Commits](https://github.com/codecov/codecov-action/compare/v2...v3)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-04-11 21:26:52 +09:00
PSeitz
d6f803212c Merge pull request #1325 from quickwit-oss/term_agg
fast field on string
2022-04-04 15:34:31 +08:00
Pascal Seitz
dac73537d2 update changelog 2022-04-04 14:15:40 +08:00
Pascal Seitz
bb5254de12 always serialize, use enum as param 2022-04-04 13:50:23 +08:00
Maxim Kraynyuchenko
be5218c2f6 Company Logos were not visible in Dark Theme. (#1326) 2022-04-04 11:53:31 +09:00
Pascal Seitz
ec9478830a add text test
move get multiple values to test code
remove sorting term ids per docidi for non facets
2022-03-30 11:31:33 +08:00
Pascal Seitz
8807bfd13d fast field on string
enables FAST on string fields, which creates a fastfield containing the term ordinals
2022-03-29 12:40:10 +08:00
94 changed files with 3967 additions and 1881 deletions

View File

@@ -13,12 +13,11 @@ jobs:
- uses: actions/checkout@v3
- name: Install Rust
run: rustup toolchain install nightly --component llvm-tools-preview
- name: Install cargo-llvm-cov
run: curl -LsSf https://github.com/taiki-e/cargo-llvm-cov/releases/latest/download/cargo-llvm-cov-x86_64-unknown-linux-gnu.tar.gz | tar xzf - -C ~/.cargo/bin
- uses: taiki-e/install-action@cargo-llvm-cov
- name: Generate code coverage
run: cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info
run: cargo +nightly llvm-cov --all-features --workspace --lcov --output-path lcov.info
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v2
uses: codecov/codecov-action@v3
with:
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
files: lcov.info

View File

@@ -33,7 +33,7 @@ jobs:
components: rustfmt, clippy
- name: Run tests
run: cargo +stable test --features mmap,brotli-compression,lz4-compression,snappy-compression,failpoints --verbose --workspace
run: cargo +stable test --features mmap,brotli-compression,lz4-compression,snappy-compression,zstd-compression,failpoints --verbose --workspace
- name: Run tests quickwit feature
run: cargo +stable test --features mmap,quickwit,failpoints --verbose --workspace

View File

@@ -1,4 +1,8 @@
Unreleased
Tantivy 0.18.1
================================
- Hotfix: positions computation. #1629 (@fmassot, @fulmicoton, @PSeitz)
Tantivy 0.18
================================
- For date values `chrono` has been replaced with `time` (@uklotzde) #1304 :
- The `time` crate is re-exported as `tantivy::time` instead of `tantivy::chrono`.
@@ -8,6 +12,10 @@ Unreleased
- Converting a `time::OffsetDateTime` to `Value::Date` implicitly converts the value into UTC.
If this is not desired do the time zone conversion yourself and use `time::PrimitiveDateTime`
directly instead.
- Add [histogram](https://github.com/quickwit-oss/tantivy/pull/1306) aggregation (@PSeitz)
- Add support for fastfield on text fields (@PSeitz)
- Add terms aggregation (@PSeitz)
- Add support for zstd compression (@kryesh)
Tantivy 0.17
================================
@@ -19,7 +27,7 @@ Tantivy 0.17
- Schema now offers not indexing fieldnorms (@lpouget) [#922](https://github.com/quickwit-oss/tantivy/issues/922)
- Reduce the number of fsync calls [#1225](https://github.com/quickwit-oss/tantivy/issues/1225)
- Fix opening bytes index with dynamic codec (@PSeitz) [#1278](https://github.com/quickwit-oss/tantivy/issues/1278)
- Added an aggregation collector compatible with Elasticsearch (@PSeitz)
- Added an aggregation collector for range, average and stats compatible with Elasticsearch. (@PSeitz)
- Added a JSON schema type @fulmicoton [#1251](https://github.com/quickwit-oss/tantivy/issues/1251)
- Added support for slop in phrase queries @halvorboe [#1068](https://github.com/quickwit-oss/tantivy/issues/1068)

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy"
version = "0.17.0"
version = "0.18.1"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
@@ -10,71 +10,72 @@ homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2018"
edition = "2021"
[dependencies]
oneshot = "0.1"
base64 = "0.13"
oneshot = "0.1.3"
base64 = "0.13.0"
byteorder = "1.4.3"
crc32fast = "1.2.1"
once_cell = "1.7.2"
regex ={ version = "1.5.4", default-features = false, features = ["std"] }
tantivy-fst = "0.3"
memmap2 = {version = "0.5", optional=true}
lz4_flex = { version = "0.9", default-features = false, features = ["checked-decode"], optional = true }
brotli = { version = "3.3", optional = true }
crc32fast = "1.3.2"
once_cell = "1.10.0"
regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] }
tantivy-fst = "0.3.0"
memmap2 = { version = "0.5.3", optional = true }
lz4_flex = { version = "0.9.2", default-features = false, features = ["checked-decode"], optional = true }
brotli = { version = "3.3.4", optional = true }
zstd = { version = "0.11", optional = true }
snap = { version = "1.0.5", optional = true }
tempfile = { version = "3.2", optional = true }
log = "0.4.14"
serde = { version = "1.0.126", features = ["derive"] }
serde_json = "1.0.64"
num_cpus = "1.13"
tempfile = { version = "3.3.0", optional = true }
log = "0.4.16"
serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.79"
num_cpus = "1.13.1"
fs2={ version = "0.4.3", optional = true }
levenshtein_automata = "0.2"
uuid = { version = "0.8.2", features = ["v4", "serde"] }
crossbeam = "0.8.1"
tantivy-query-grammar = { version="0.15.0", path="./query-grammar" }
tantivy-bitpacker = { version="0.1", path="./bitpacker" }
common = { version = "0.2", path = "./common/", package = "tantivy-common" }
fastfield_codecs = { version="0.1", path="./fastfield_codecs", default-features = false }
ownedbytes = { version="0.2", path="./ownedbytes" }
stable_deref_trait = "1.2"
rust-stemmers = "1.2"
downcast-rs = "1.2"
levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4"
tantivy-query-grammar = { version="0.18.0", path="./query-grammar" }
tantivy-bitpacker = { version="0.2", path="./bitpacker" }
common = { version = "0.3", path = "./common/", package = "tantivy-common" }
fastfield_codecs = { version="0.2", path="./fastfield_codecs", default-features = false }
ownedbytes = { version="0.3", path="./ownedbytes" }
stable_deref_trait = "1.2.0"
rust-stemmers = "1.2.0"
downcast-rs = "1.2.0"
bitpacking = { version = "0.8.4", default-features = false, features = ["bitpacker4x"] }
census = "0.4"
census = "0.4.0"
fnv = "1.0.7"
thiserror = "1.0.24"
thiserror = "1.0.30"
htmlescape = "0.3.1"
fail = "0.5"
murmurhash32 = "0.2"
time = { version = "0.3.7", features = ["serde-well-known"] }
smallvec = "1.6.1"
rayon = "1.5"
lru = "0.7.0"
fastdivide = "0.4"
itertools = "0.10.0"
measure_time = "0.8.0"
pretty_assertions = "1.1.0"
serde_cbor = {version="0.11", optional=true}
async-trait = "0.1"
fail = "0.5.0"
murmurhash32 = "0.2.0"
time = { version = "0.3.9", features = ["serde-well-known"] }
smallvec = "1.8.0"
rayon = "1.5.2"
lru = "0.7.5"
fastdivide = "0.4.0"
itertools = "0.10.3"
measure_time = "0.8.2"
pretty_assertions = "1.2.1"
serde_cbor = { version = "0.11.2", optional = true }
async-trait = "0.1.53"
[target.'cfg(windows)'.dependencies]
winapi = "0.3.9"
[dev-dependencies]
rand = "0.8.3"
rand = "0.8.5"
maplit = "1.0.2"
matches = "0.1.8"
proptest = "1.0"
matches = "0.1.9"
proptest = "1.0.0"
criterion = "0.3.5"
test-log = "0.2.8"
test-log = "0.2.10"
env_logger = "0.9.0"
pprof = {version= "0.7", features=["flamegraph", "criterion"]}
futures = "0.3.15"
pprof = { version = "0.9.0", features = ["flamegraph", "criterion"] }
futures = "0.3.21"
[dev-dependencies.fail]
version = "0.5"
version = "0.5.0"
features = ["failpoints"]
[profile.release]
@@ -93,6 +94,7 @@ mmap = ["fs2", "tempfile", "memmap2"]
brotli-compression = ["brotli"]
lz4-compression = ["lz4_flex"]
snappy-compression = ["snap"]
zstd-compression = ["zstd"]
failpoints = ["fail/failpoints"]
unstable = [] # useful for benches.

View File

@@ -128,10 +128,13 @@ $ gdb run
# Companies Using Tantivy
<p align="left">
<img align="center" src="doc/assets/images/Nuclia.png" alt="Nuclia" height="25" width="auto" /> &nbsp;
<img align="center" src="doc/assets/images/humanfirst.png" alt="Humanfirst.ai" height="30" width="auto" />&nbsp;
<img align="center" src="doc/assets/images/element.io.svg" alt="Element.io" height="25" width="auto" />
</p>
<img align="center" src="doc/assets/images/Nuclia.png#gh-light-mode-only" alt="Nuclia" height="25" width="auto" /> &nbsp;
<img align="center" src="doc/assets/images/humanfirst.png#gh-light-mode-only" alt="Humanfirst.ai" height="30" width="auto" />
<img align="center" src="doc/assets/images/element.io.svg#gh-light-mode-only" alt="Element.io" height="25" width="auto" />
<img align="center" src="doc/assets/images/nuclia-dark-theme.png#gh-dark-mode-only" alt="Nuclia" height="35" width="auto" /> &nbsp;
<img align="center" src="doc/assets/images/humanfirst.ai-dark-theme.png#gh-dark-mode-only" alt="Humanfirst.ai" height="25" width="auto" />&nbsp; &nbsp;
<img align="center" src="doc/assets/images/element-dark-theme.png#gh-dark-mode-only" alt="Element.io" height="25" width="auto" />
</p>
# FAQ

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-bitpacker"
version = "0.1.1"
version = "0.2.0"
edition = "2018"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-common"
version = "0.2.0"
version = "0.3.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT"
edition = "2018"
@@ -10,7 +10,7 @@ description = "common traits and utility functions used by multiple tantivy subc
[dependencies]
byteorder = "1.4.3"
ownedbytes = { version="0.2", path="../ownedbytes" }
ownedbytes = { version="0.3", path="../ownedbytes" }
[dev-dependencies]
proptest = "1.0.0"

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.8 KiB

View File

@@ -122,7 +122,7 @@ fn main() -> tantivy::Result<()> {
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&term_query, &collector).unwrap();
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
let res: Value = serde_json::to_value(&agg_res)?;
println!("{}", serde_json::to_string_pretty(&res)?);
Ok(())

View File

@@ -1,7 +1,8 @@
// # Json field example
//
// This example shows how the json field can be used
// to make tantivy partially schemaless.
// to make tantivy partially schemaless by setting it as
// default query parser field.
use tantivy::collector::{Count, TopDocs};
use tantivy::query::QueryParser;
@@ -10,10 +11,6 @@ use tantivy::Index;
fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// We need two fields:
// - a timestamp
// - a json object field
let mut schema_builder = Schema::builder();
schema_builder.add_date_field("timestamp", FAST | STORED);
let event_type = schema_builder.add_text_field("event_type", STRING | STORED);
@@ -43,7 +40,8 @@ fn main() -> tantivy::Result<()> {
"attributes": {
"target": "submit-button",
"cart": {"product_id": 133},
"description": "das keyboard"
"description": "das keyboard",
"event_type": "holiday-sale"
}
}"#,
)?;
@@ -53,6 +51,9 @@ fn main() -> tantivy::Result<()> {
let reader = index.reader()?;
let searcher = reader.searcher();
// # Default fields: event_type and attributes
// By setting attributes as a default field it allows omitting attributes itself, e.g. "target",
// instead of "attributes.target"
let query_parser = QueryParser::for_index(&index, vec![event_type, attributes]);
{
let query = query_parser.parse_query("target:submit-button")?;
@@ -70,10 +71,34 @@ fn main() -> tantivy::Result<()> {
assert_eq!(count_docs, 1);
}
{
let query = query_parser
.parse_query("event_type:click AND cart.product_id:133")
.unwrap();
let hits = searcher.search(&*query, &TopDocs::with_limit(2)).unwrap();
let query = query_parser.parse_query("click AND cart.product_id:133")?;
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
assert_eq!(hits.len(), 1);
}
{
// The sub-fields in the json field marked as default field still need to be explicitly
// addressed
let query = query_parser.parse_query("click AND 133")?;
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
assert_eq!(hits.len(), 0);
}
{
// Default json fields are ignored if they collide with the schema
let query = query_parser.parse_query("event_type:holiday-sale")?;
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
assert_eq!(hits.len(), 0);
}
// # Query via full attribute path
{
// This only searches in our schema's `event_type` field
let query = query_parser.parse_query("event_type:click")?;
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
assert_eq!(hits.len(), 2);
}
{
// Default json fields can still be accessed by full path
let query = query_parser.parse_query("attributes.event_type:holiday-sale")?;
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
assert_eq!(hits.len(), 1);
}
Ok(())

View File

@@ -1 +0,0 @@
datasets/

View File

@@ -1,14 +1,16 @@
[package]
name = "fastfield_codecs"
version = "0.1.0"
version = "0.2.0"
authors = ["Pascal Seitz <pascal@quickwit.io>"]
license = "MIT"
edition = "2018"
description = "Fast field codecs used by tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
common = { version = "0.2", path = "../common/", package = "tantivy-common" }
tantivy-bitpacker = { version="0.1.1", path = "../bitpacker/" }
common = { version = "0.3", path = "../common/", package = "tantivy-common" }
tantivy-bitpacker = { version="0.2", path = "../bitpacker/" }
prettytable-rs = {version="0.8.0", optional= true}
rand = {version="0.8.3", optional= true}
@@ -17,6 +19,6 @@ more-asserts = "0.2.1"
rand = "0.8.3"
[features]
unstable = [] # useful for benches and experimental codecs.
bin = ["prettytable-rs", "rand"]
default = ["bin"]

View File

@@ -1,6 +0,0 @@
DATASETS ?= hdfs_logs_timestamps http_logs_timestamps amazon_reviews_product_ids nooc_temperatures
download:
@echo "--- Downloading datasets ---"
mkdir -p datasets
@for dataset in $(DATASETS); do curl -o - https://quickwit-datasets-public.s3.amazonaws.com/benchmarks/fastfields/$$dataset.txt.gz | gunzip > datasets/$$dataset.txt; done

View File

@@ -13,10 +13,6 @@ A codec needs to implement 2 traits:
- A reader implementing `FastFieldCodecReader` to read the codec.
- A serializer implementing `FastFieldCodecSerializer` for compression estimation and codec name + id.
### Download real world datasets for codecs comparison
Before comparing codecs, you need to execute `make download` to download real world datasets hosted on AWS S3.
To run with the unstable codecs, execute `cargo run --features unstable`.
### Tests
Once the traits are implemented test and benchmark integration is pretty easy (see `test_with_codec_data_sets` and `bench.rs`).
@@ -27,101 +23,46 @@ cargo run --features bin
```
### TODO
- Add real world data sets in comparison
- Add codec to cover sparse data sets
### Codec Comparison
```
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| | Compression ratio | Compression ratio estimation | Compression time (micro) | Reading time (micro) |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Autoincrement | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.0051544965 | 0.17251475 | 960 | 211 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.118189104 | 0.14172314 | 708 | 212 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.28126493 | 0.28125 | 474 | 112 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Monotonically increasing concave | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.005955 | 0.18813984 | 885 | 211 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.16113 | 0.15734828 | 704 | 212 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.31251436 | 0.3125 | 478 | 113 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Monotonically increasing convex | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.00613 | 0.20376484 | 889 | 211 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.157175 | 0.17297328 | 706 | 212 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.31251436 | 0.3125 | 471 | 113 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Almost monotonically increasing | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.14549863 | 0.17251475 | 923 | 210 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.14943957 | 0.15734814 | 703 | 211 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.28126493 | 0.28125 | 462 | 112 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Random | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.14533783 | 0.14126475 | 924 | 211 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.13381402 | 0.15734814 | 695 | 211 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.12501445 | 0.125 | 422 | 112 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| HDFS logs timestamps | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.39826187 | 0.4068908 | 5545 | 1086 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.39214826 | 0.40734857 | 5082 | 1073 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.39062786 | 0.390625 | 2864 | 567 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| HDFS logs timestamps SORTED | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.032736875 | 0.094390824 | 4942 | 1067 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.02667125 | 0.079223566 | 3626 | 994 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.39062786 | 0.390625 | 2493 | 566 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| HTTP logs timestamps SORTED | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.047942877 | 0.20376582 | 5121 | 1065 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.06637425 | 0.18859856 | 3929 | 1093 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.26562786 | 0.265625 | 2221 | 526 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Amazon review product ids | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.41900787 | 0.4225158 | 5239 | 1089 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.41504425 | 0.43859857 | 4158 | 1052 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.40625286 | 0.40625 | 2603 | 513 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Amazon review product ids SORTED | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | 0.18364687 | 0.25064084 | 5036 | 990 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 0.21239226 | 0.21984856 | 4087 | 1072 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 0.40625286 | 0.40625 | 2702 | 525 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Temperatures | | | | |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| PiecewiseLinear | | Codec Disabled | 0 | 0 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| FOR | 1.0088086 | 1.001098 | 1306 | 237 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
| Bitpacked | 1.000012 | 1 | 950 | 108 |
+----------------------------------+-------------------+------------------------------+--------------------------+----------------------+
+----------------------------------+-------------------+------------------------+
| | Compression Ratio | Compression Estimation |
+----------------------------------+-------------------+------------------------+
| Autoincrement | | |
+----------------------------------+-------------------+------------------------+
| LinearInterpol | 0.000039572664 | 0.000004396963 |
+----------------------------------+-------------------+------------------------+
| MultiLinearInterpol | 0.1477348 | 0.17275847 |
+----------------------------------+-------------------+------------------------+
| Bitpacked | 0.28126493 | 0.28125 |
+----------------------------------+-------------------+------------------------+
| Monotonically increasing concave | | |
+----------------------------------+-------------------+------------------------+
| LinearInterpol | 0.25003937 | 0.26562938 |
+----------------------------------+-------------------+------------------------+
| MultiLinearInterpol | 0.190665 | 0.1883836 |
+----------------------------------+-------------------+------------------------+
| Bitpacked | 0.31251436 | 0.3125 |
+----------------------------------+-------------------+------------------------+
| Monotonically increasing convex | | |
+----------------------------------+-------------------+------------------------+
| LinearInterpol | 0.25003937 | 0.28125438 |
+----------------------------------+-------------------+------------------------+
| MultiLinearInterpol | 0.18676 | 0.2040086 |
+----------------------------------+-------------------+------------------------+
| Bitpacked | 0.31251436 | 0.3125 |
+----------------------------------+-------------------+------------------------+
| Almost monotonically increasing | | |
+----------------------------------+-------------------+------------------------+
| LinearInterpol | 0.14066513 | 0.1562544 |
+----------------------------------+-------------------+------------------------+
| MultiLinearInterpol | 0.16335973 | 0.17275847 |
+----------------------------------+-------------------+------------------------+
| Bitpacked | 0.28126493 | 0.28125 |
+----------------------------------+-------------------+------------------------+
```

View File

@@ -5,8 +5,11 @@ extern crate test;
#[cfg(test)]
mod tests {
use fastfield_codecs::bitpacked::{BitpackedFastFieldReader, BitpackedFastFieldSerializer};
use fastfield_codecs::piecewise_linear::{
PiecewiseLinearFastFieldReader, PiecewiseLinearFastFieldSerializer,
use fastfield_codecs::linearinterpol::{
LinearInterpolFastFieldReader, LinearInterpolFastFieldSerializer,
};
use fastfield_codecs::multilinearinterpol::{
MultiLinearInterpolFastFieldReader, MultiLinearInterpolFastFieldSerializer,
};
use fastfield_codecs::*;
@@ -67,9 +70,14 @@ mod tests {
bench_create::<BitpackedFastFieldSerializer>(b, &data);
}
#[bench]
fn bench_fastfield_piecewise_linear_create(b: &mut Bencher) {
fn bench_fastfield_linearinterpol_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<PiecewiseLinearFastFieldSerializer>(b, &data);
bench_create::<LinearInterpolFastFieldSerializer>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<MultiLinearInterpolFastFieldSerializer>(b, &data);
}
#[bench]
fn bench_fastfield_bitpack_get(b: &mut Bencher) {
@@ -77,9 +85,16 @@ mod tests {
bench_get::<BitpackedFastFieldSerializer, BitpackedFastFieldReader>(b, &data);
}
#[bench]
fn bench_fastfield_piecewise_linear_get(b: &mut Bencher) {
fn bench_fastfield_linearinterpol_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<PiecewiseLinearFastFieldSerializer, PiecewiseLinearFastFieldReader>(b, &data);
bench_get::<LinearInterpolFastFieldSerializer, LinearInterpolFastFieldReader>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<MultiLinearInterpolFastFieldSerializer, MultiLinearInterpolFastFieldReader>(
b, &data,
);
}
pub fn stats_from_vec(data: &[u64]) -> FastFieldStats {
let min_value = data.iter().cloned().min().unwrap_or(0);

View File

@@ -128,10 +128,7 @@ impl FastFieldCodecSerializer for BitpackedFastFieldSerializer {
) -> bool {
true
}
fn estimate_compression_ratio(
_fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> f32 {
fn estimate(_fastfield_accessor: &impl FastFieldDataAccess, stats: FastFieldStats) -> f32 {
let amplitude = stats.max_value - stats.min_value;
let num_bits = compute_num_bits(amplitude);
let num_bits_uncompressed = 64;

View File

@@ -1,272 +0,0 @@
use std::io::{self, Read, Write};
use common::{BinarySerializable, DeserializeFrom};
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use crate::{FastFieldCodecReader, FastFieldCodecSerializer, FastFieldDataAccess, FastFieldStats};
const BLOCK_SIZE: u64 = 128;
#[derive(Clone)]
pub struct FORFastFieldReader {
num_vals: u64,
min_value: u64,
max_value: u64,
block_readers: Vec<BlockReader>,
}
#[derive(Clone, Debug, Default)]
struct BlockMetadata {
min: u64,
num_bits: u8,
}
#[derive(Clone, Debug, Default)]
struct BlockReader {
metadata: BlockMetadata,
start_offset: u64,
bit_unpacker: BitUnpacker,
}
impl BlockReader {
fn new(metadata: BlockMetadata, start_offset: u64) -> Self {
Self {
bit_unpacker: BitUnpacker::new(metadata.num_bits),
metadata,
start_offset,
}
}
#[inline]
fn get_u64(&self, block_pos: u64, data: &[u8]) -> u64 {
let diff = self
.bit_unpacker
.get(block_pos, &data[self.start_offset as usize..]);
self.metadata.min + diff
}
}
impl BinarySerializable for BlockMetadata {
fn serialize<W: Write>(&self, write: &mut W) -> io::Result<()> {
self.min.serialize(write)?;
self.num_bits.serialize(write)?;
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let min = u64::deserialize(reader)?;
let num_bits = u8::deserialize(reader)?;
Ok(Self { min, num_bits })
}
}
#[derive(Clone, Debug)]
pub struct FORFooter {
pub num_vals: u64,
pub min_value: u64,
pub max_value: u64,
block_metadatas: Vec<BlockMetadata>,
}
impl BinarySerializable for FORFooter {
fn serialize<W: Write>(&self, write: &mut W) -> io::Result<()> {
let mut out = vec![];
self.num_vals.serialize(&mut out)?;
self.min_value.serialize(&mut out)?;
self.max_value.serialize(&mut out)?;
self.block_metadatas.serialize(&mut out)?;
write.write_all(&out)?;
(out.len() as u32).serialize(write)?;
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let footer = Self {
num_vals: u64::deserialize(reader)?,
min_value: u64::deserialize(reader)?,
max_value: u64::deserialize(reader)?,
block_metadatas: Vec::<BlockMetadata>::deserialize(reader)?,
};
Ok(footer)
}
}
impl FastFieldCodecReader for FORFastFieldReader {
/// Opens a fast field given a file.
fn open_from_bytes(bytes: &[u8]) -> io::Result<Self> {
let footer_len: u32 = (&bytes[bytes.len() - 4..]).deserialize()?;
let (_, mut footer) = bytes.split_at(bytes.len() - (4 + footer_len) as usize);
let footer = FORFooter::deserialize(&mut footer)?;
let mut block_readers = Vec::with_capacity(footer.block_metadatas.len());
let mut current_data_offset = 0;
for block_metadata in footer.block_metadatas {
let num_bits = block_metadata.num_bits;
block_readers.push(BlockReader::new(block_metadata, current_data_offset));
current_data_offset += num_bits as u64 * BLOCK_SIZE / 8;
}
Ok(Self {
num_vals: footer.num_vals,
min_value: footer.min_value,
max_value: footer.max_value,
block_readers,
})
}
#[inline]
fn get_u64(&self, idx: u64, data: &[u8]) -> u64 {
let block_idx = (idx / BLOCK_SIZE) as usize;
let block_pos = idx - (block_idx as u64) * BLOCK_SIZE;
let block_reader = &self.block_readers[block_idx];
block_reader.get_u64(block_pos, data)
}
#[inline]
fn min_value(&self) -> u64 {
self.min_value
}
#[inline]
fn max_value(&self) -> u64 {
self.max_value
}
}
/// Same as LinearInterpolFastFieldSerializer, but working on chunks of CHUNK_SIZE elements.
pub struct FORFastFieldSerializer {}
impl FastFieldCodecSerializer for FORFastFieldSerializer {
const NAME: &'static str = "FOR";
const ID: u8 = 5;
/// Creates a new fast field serializer.
fn serialize(
write: &mut impl Write,
_: &impl FastFieldDataAccess,
stats: FastFieldStats,
data_iter: impl Iterator<Item = u64>,
_data_iter1: impl Iterator<Item = u64>,
) -> io::Result<()> {
let data = data_iter.collect::<Vec<_>>();
let mut bit_packer = BitPacker::new();
let mut block_metadatas = Vec::new();
for data_pos in (0..data.len() as u64).step_by(BLOCK_SIZE as usize) {
let block_num_vals = BLOCK_SIZE.min(data.len() as u64 - data_pos) as usize;
let block_values = &data[data_pos as usize..data_pos as usize + block_num_vals];
let mut min = block_values[0];
let mut max = block_values[0];
for &current_value in block_values[1..].iter() {
min = min.min(current_value);
max = max.max(current_value);
}
let num_bits = compute_num_bits(max - min);
for current_value in block_values.iter() {
bit_packer.write(current_value - min, num_bits, write)?;
}
bit_packer.flush(write)?;
block_metadatas.push(BlockMetadata { min, num_bits });
}
bit_packer.close(write)?;
let footer = FORFooter {
num_vals: stats.num_vals,
min_value: stats.min_value,
max_value: stats.max_value,
block_metadatas,
};
footer.serialize(write)?;
Ok(())
}
fn is_applicable(
_fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> bool {
stats.num_vals > BLOCK_SIZE
}
/// Estimate compression ratio by compute the ratio of the first block.
fn estimate_compression_ratio(
fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> f32 {
let last_elem_in_first_chunk = BLOCK_SIZE.min(stats.num_vals);
let max_distance = (0..last_elem_in_first_chunk)
.into_iter()
.map(|pos| {
let actual_value = fastfield_accessor.get_val(pos as u64);
actual_value - stats.min_value
})
.max()
.unwrap();
// Estimate one block and multiply by a magic number 3 to select this codec
// when we are almost sure that this is relevant.
let relative_max_value = max_distance as f32 * 3.0;
let num_bits = compute_num_bits(relative_max_value as u64) as u64 * stats.num_vals as u64
// function metadata per block
+ 9 * (stats.num_vals / BLOCK_SIZE);
let num_bits_uncompressed = 64 * stats.num_vals;
num_bits as f32 / num_bits_uncompressed as f32
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::tests::get_codec_test_data_sets;
fn create_and_validate(data: &[u64], name: &str) -> (f32, f32) {
crate::tests::create_and_validate::<FORFastFieldSerializer, FORFastFieldReader>(data, name)
}
#[test]
fn test_compression() {
let data = (10..=6_000_u64).collect::<Vec<_>>();
let (estimate, actual_compression) =
create_and_validate(&data, "simple monotonically large");
println!("{}", actual_compression);
assert!(actual_compression < 0.2);
assert!(actual_compression > 0.006);
assert!(estimate < 0.20);
assert!(estimate > 0.10);
}
#[test]
fn test_with_codec_data_sets() {
let data_sets = get_codec_test_data_sets();
for (mut data, name) in data_sets {
create_and_validate(&data, name);
data.reverse();
create_and_validate(&data, name);
}
}
#[test]
fn test_simple() {
let data = (10..=20_u64).collect::<Vec<_>>();
create_and_validate(&data, "simple monotonically");
}
#[test]
fn border_cases_1() {
let data = (0..1024).collect::<Vec<_>>();
create_and_validate(&data, "border case");
}
#[test]
fn border_case_2() {
let data = (0..1025).collect::<Vec<_>>();
create_and_validate(&data, "border case");
}
#[test]
fn rand() {
for _ in 0..10 {
let mut data = (5_000..20_000)
.map(|_| rand::random::<u32>() as u64)
.collect::<Vec<_>>();
let (estimate, actual_compression) = create_and_validate(&data, "random");
dbg!(estimate);
dbg!(actual_compression);
data.reverse();
create_and_validate(&data, "random");
}
}
}

View File

@@ -6,20 +6,15 @@ use std::io;
use std::io::Write;
pub mod bitpacked;
#[cfg(feature = "unstable")]
pub mod frame_of_reference;
pub mod linearinterpol;
pub mod multilinearinterpol;
pub mod piecewise_linear;
pub trait FastFieldCodecReader: Sized {
/// Reads the metadata and returns the CodecReader.
/// reads the metadata and returns the CodecReader
fn open_from_bytes(bytes: &[u8]) -> std::io::Result<Self>;
/// Read u64 value for indice `idx`.
/// `idx` can be either a `DocId` or an index used for
/// `multivalued` fast field.
fn get_u64(&self, idx: u64, data: &[u8]) -> u64;
fn get_u64(&self, doc: u64, data: &[u8]) -> u64;
fn min_value(&self) -> u64;
fn max_value(&self) -> u64;
}
@@ -40,10 +35,7 @@ pub trait FastFieldCodecSerializer {
///
/// It could make sense to also return a value representing
/// computational complexity.
fn estimate_compression_ratio(
fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> f32;
fn estimate(fastfield_accessor: &impl FastFieldDataAccess, stats: FastFieldStats) -> f32;
/// Serializes the data using the serializer into write.
/// There are multiple iterators, in case the codec needs to read the data multiple times.
@@ -93,8 +85,9 @@ impl FastFieldDataAccess for Vec<u64> {
#[cfg(test)]
mod tests {
use crate::bitpacked::{BitpackedFastFieldReader, BitpackedFastFieldSerializer};
use crate::piecewise_linear::{
PiecewiseLinearFastFieldReader, PiecewiseLinearFastFieldSerializer,
use crate::linearinterpol::{LinearInterpolFastFieldReader, LinearInterpolFastFieldSerializer};
use crate::multilinearinterpol::{
MultiLinearInterpolFastFieldReader, MultiLinearInterpolFastFieldSerializer,
};
pub fn create_and_validate<S: FastFieldCodecSerializer, R: FastFieldCodecReader>(
@@ -104,7 +97,7 @@ mod tests {
if !S::is_applicable(&data, crate::tests::stats_from_vec(data)) {
return (f32::MAX, 0.0);
}
let estimation = S::estimate_compression_ratio(&data, crate::tests::stats_from_vec(data));
let estimation = S::estimate(&data, crate::tests::stats_from_vec(data));
let mut out = vec![];
S::serialize(
&mut out,
@@ -164,10 +157,13 @@ mod tests {
fn test_codec_bitpacking() {
test_codec::<BitpackedFastFieldSerializer, BitpackedFastFieldReader>();
}
#[test]
fn test_codec_piecewise_linear() {
test_codec::<PiecewiseLinearFastFieldSerializer, PiecewiseLinearFastFieldReader>();
fn test_codec_interpolation() {
test_codec::<LinearInterpolFastFieldSerializer, LinearInterpolFastFieldReader>();
}
#[test]
fn test_codec_multi_interpolation() {
test_codec::<MultiLinearInterpolFastFieldSerializer, MultiLinearInterpolFastFieldReader>();
}
use super::*;
@@ -185,50 +181,45 @@ mod tests {
fn estimation_good_interpolation_case() {
let data = (10..=20000_u64).collect::<Vec<_>>();
let piecewise_interpol_estimation =
PiecewiseLinearFastFieldSerializer::estimate_compression_ratio(
&data,
stats_from_vec(&data),
);
assert_le!(piecewise_interpol_estimation, 0.2);
let linear_interpol_estimation =
LinearInterpolFastFieldSerializer::estimate(&data, stats_from_vec(&data));
assert_le!(linear_interpol_estimation, 0.01);
let multi_linear_interpol_estimation =
MultiLinearInterpolFastFieldSerializer::estimate(&data, stats_from_vec(&data));
assert_le!(multi_linear_interpol_estimation, 0.2);
assert_le!(linear_interpol_estimation, multi_linear_interpol_estimation);
let bitpacked_estimation =
BitpackedFastFieldSerializer::estimate_compression_ratio(&data, stats_from_vec(&data));
assert_le!(piecewise_interpol_estimation, bitpacked_estimation);
BitpackedFastFieldSerializer::estimate(&data, stats_from_vec(&data));
assert_le!(linear_interpol_estimation, bitpacked_estimation);
}
#[test]
fn estimation_test_bad_interpolation_case() {
let data = vec![200, 10, 10, 10, 10, 1000, 20];
let piecewise_interpol_estimation =
PiecewiseLinearFastFieldSerializer::estimate_compression_ratio(
&data,
stats_from_vec(&data),
);
assert_le!(piecewise_interpol_estimation, 0.32);
let linear_interpol_estimation =
LinearInterpolFastFieldSerializer::estimate(&data, stats_from_vec(&data));
assert_le!(linear_interpol_estimation, 0.32);
let bitpacked_estimation =
BitpackedFastFieldSerializer::estimate_compression_ratio(&data, stats_from_vec(&data));
assert_le!(bitpacked_estimation, piecewise_interpol_estimation);
BitpackedFastFieldSerializer::estimate(&data, stats_from_vec(&data));
assert_le!(bitpacked_estimation, linear_interpol_estimation);
}
#[test]
fn estimation_test_interpolation_case_monotonically_increasing() {
fn estimation_test_bad_interpolation_case_monotonically_increasing() {
let mut data = (200..=20000_u64).collect::<Vec<_>>();
data.push(1_000_000);
// in this case the linear interpolation can't in fact not be worse than bitpacking,
// but the estimator adds some threshold, which leads to estimated worse behavior
let piecewise_interpol_estimation =
PiecewiseLinearFastFieldSerializer::estimate_compression_ratio(
&data,
stats_from_vec(&data),
);
assert_le!(piecewise_interpol_estimation, 0.2);
let linear_interpol_estimation =
LinearInterpolFastFieldSerializer::estimate(&data, stats_from_vec(&data));
assert_le!(linear_interpol_estimation, 0.35);
let bitpacked_estimation =
BitpackedFastFieldSerializer::estimate_compression_ratio(&data, stats_from_vec(&data));
println!("{}", bitpacked_estimation);
BitpackedFastFieldSerializer::estimate(&data, stats_from_vec(&data));
assert_le!(bitpacked_estimation, 0.32);
assert_le!(piecewise_interpol_estimation, bitpacked_estimation);
assert_le!(bitpacked_estimation, linear_interpol_estimation);
}
}

View File

@@ -71,9 +71,9 @@ impl FastFieldCodecReader for LinearInterpolFastFieldReader {
})
}
#[inline]
fn get_u64(&self, idx: u64, data: &[u8]) -> u64 {
let calculated_value = get_calculated_value(self.footer.first_val, idx, self.slope);
(calculated_value + self.bit_unpacker.get(idx, data)) - self.footer.offset
fn get_u64(&self, doc: u64, data: &[u8]) -> u64 {
let calculated_value = get_calculated_value(self.footer.first_val, doc, self.slope);
(calculated_value + self.bit_unpacker.get(doc, data)) - self.footer.offset
}
#[inline]
@@ -88,10 +88,6 @@ impl FastFieldCodecReader for LinearInterpolFastFieldReader {
/// Fastfield serializer, which tries to guess values by linear interpolation
/// and stores the difference bitpacked.
#[deprecated(
note = "Linear interpolation works best only on very rare cases and piecewise linear codec \
already works great on them."
)]
pub struct LinearInterpolFastFieldSerializer {}
#[inline]
@@ -109,7 +105,6 @@ fn get_calculated_value(first_val: u64, pos: u64, slope: f32) -> u64 {
first_val + (pos as f32 * slope) as u64
}
#[allow(deprecated)]
impl FastFieldCodecSerializer for LinearInterpolFastFieldSerializer {
const NAME: &'static str = "LinearInterpol";
const ID: u8 = 2;
@@ -187,16 +182,10 @@ impl FastFieldCodecSerializer for LinearInterpolFastFieldSerializer {
}
true
}
/// Estimation for linear interpolation is hard because, you don't know
/// estimation for linear interpolation is hard because, you don't know
/// where the local maxima for the deviation of the calculated value are and
/// the offset to shift all values to >=0 is also unknown.
fn estimate_compression_ratio(
fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> f32 {
if stats.num_vals < 3 {
return f32::MAX;
}
fn estimate(fastfield_accessor: &impl FastFieldDataAccess, stats: FastFieldStats) -> f32 {
let first_val = fastfield_accessor.get_val(0);
let last_val = fastfield_accessor.get_val(stats.num_vals as u64 - 1);
let slope = get_slope(first_val, last_val, stats.num_vals);
@@ -240,7 +229,6 @@ fn distance<T: Sub<Output = T> + Ord>(x: T, y: T) -> T {
}
}
#[allow(deprecated)]
#[cfg(test)]
mod tests {
use super::*;
@@ -301,10 +289,8 @@ mod tests {
#[test]
fn linear_interpol_fast_field_rand() {
for _ in 0..10 {
let mut data = (5_000..20_000)
.map(|_| rand::random::<u32>() as u64)
.collect::<Vec<_>>();
for _ in 0..5000 {
let mut data = (0..50).map(|_| rand::random::<u64>()).collect::<Vec<_>>();
create_and_validate(&data, "random");
data.reverse();

View File

@@ -1,52 +1,31 @@
#[macro_use]
extern crate prettytable;
use std::fs::File;
use std::io;
use std::io::BufRead;
use std::time::{Duration, Instant};
use common::f64_to_u64;
use fastfield_codecs::bitpacked::BitpackedFastFieldReader;
#[cfg(feature = "unstable")]
use fastfield_codecs::frame_of_reference::{FORFastFieldReader, FORFastFieldSerializer};
use fastfield_codecs::piecewise_linear::{
PiecewiseLinearFastFieldReader, PiecewiseLinearFastFieldSerializer,
};
use fastfield_codecs::{FastFieldCodecReader, FastFieldCodecSerializer, FastFieldStats};
use fastfield_codecs::linearinterpol::LinearInterpolFastFieldSerializer;
use fastfield_codecs::multilinearinterpol::MultiLinearInterpolFastFieldSerializer;
use fastfield_codecs::{FastFieldCodecSerializer, FastFieldStats};
use prettytable::{Cell, Row, Table};
use rand::prelude::StdRng;
use rand::Rng;
fn main() {
let mut table = Table::new();
// Add a row per time
table.add_row(row![
"",
"Compression ratio",
"Compression ratio estimation",
"Compression time (micro)",
"Reading time (micro)"
]);
table.add_row(row!["", "Compression Ratio", "Compression Estimation"]);
for (data, data_set_name) in get_codec_test_data_sets() {
let mut results = vec![];
let res = serialize_with_codec::<
PiecewiseLinearFastFieldSerializer,
PiecewiseLinearFastFieldReader,
>(&data);
let res = serialize_with_codec::<LinearInterpolFastFieldSerializer>(&data);
results.push(res);
#[cfg(feature = "unstable")]
{
let res = serialize_with_codec::<FORFastFieldSerializer, FORFastFieldReader>(&data);
results.push(res);
}
let res = serialize_with_codec::<
fastfield_codecs::bitpacked::BitpackedFastFieldSerializer,
BitpackedFastFieldReader,
>(&data);
let res = serialize_with_codec::<MultiLinearInterpolFastFieldSerializer>(&data);
results.push(res);
let res = serialize_with_codec::<fastfield_codecs::bitpacked::BitpackedFastFieldSerializer>(
&data,
);
results.push(res);
// let best_estimation_codec = results
//.iter()
//.min_by(|res1, res2| res1.partial_cmp(&res2).unwrap())
//.unwrap();
let best_compression_ratio_codec = results
.iter()
.min_by(|res1, res2| res1.partial_cmp(res2).unwrap())
@@ -54,7 +33,7 @@ fn main() {
.unwrap();
table.add_row(Row::new(vec![Cell::new(data_set_name).style_spec("Bbb")]));
for (is_applicable, est, comp, name, compression_duration, read_duration) in results {
for (is_applicable, est, comp, name) in results {
let (est_cell, ratio_cell) = if !is_applicable {
("Codec Disabled".to_string(), "".to_string())
} else {
@@ -70,8 +49,6 @@ fn main() {
Cell::new(name).style_spec("bFg"),
Cell::new(&ratio_cell).style_spec(style),
Cell::new(&est_cell).style_spec(""),
Cell::new(&compression_duration.as_micros().to_string()),
Cell::new(&read_duration.as_micros().to_string()),
]));
}
}
@@ -93,6 +70,7 @@ pub fn get_codec_test_data_sets() -> Vec<(Vec<u64>, &'static str)> {
current_cumulative
})
.collect::<Vec<_>>();
// let data = (1..=200000_u64).map(|num| num + num).collect::<Vec<_>>();
data_and_names.push((data, "Monotonically increasing concave"));
let mut current_cumulative = 0;
@@ -105,79 +83,22 @@ pub fn get_codec_test_data_sets() -> Vec<(Vec<u64>, &'static str)> {
.collect::<Vec<_>>();
data_and_names.push((data, "Monotonically increasing convex"));
let mut rng: StdRng = rand::SeedableRng::seed_from_u64(1);
let data = (1000..=200_000_u64)
.map(|num| num + rng.gen::<u8>() as u64)
.map(|num| num + rand::random::<u8>() as u64)
.collect::<Vec<_>>();
data_and_names.push((data, "Almost monotonically increasing"));
let data = (1000..=200_000_u64)
.map(|_| rng.gen::<u8>() as u64)
.collect::<Vec<_>>();
data_and_names.push((data, "Random"));
let mut data = load_dataset("datasets/hdfs_logs_timestamps.txt");
data_and_names.push((data.clone(), "HDFS logs timestamps"));
data.sort_unstable();
data_and_names.push((data, "HDFS logs timestamps SORTED"));
let data = load_dataset("datasets/http_logs_timestamps.txt");
data_and_names.push((data, "HTTP logs timestamps SORTED"));
let mut data = load_dataset("datasets/amazon_reviews_product_ids.txt");
data_and_names.push((data.clone(), "Amazon review product ids"));
data.sort_unstable();
data_and_names.push((data, "Amazon review product ids SORTED"));
let data = load_float_dataset("datasets/nooc_temperatures.txt");
data_and_names.push((data, "Temperatures"));
data_and_names
}
pub fn load_dataset(file_path: &str) -> Vec<u64> {
println!("Load dataset from `{}`", file_path);
let file = File::open(file_path).expect("Error when opening file.");
let lines = io::BufReader::new(file).lines();
let mut data = Vec::new();
for line in lines {
let l = line.unwrap();
data.push(l.parse::<u64>().unwrap());
}
data
}
pub fn load_float_dataset(file_path: &str) -> Vec<u64> {
println!("Load float dataset from `{}`", file_path);
let file = File::open(file_path).expect("Error when opening file.");
let lines = io::BufReader::new(file).lines();
let mut data = Vec::new();
for line in lines {
let line_string = line.unwrap();
let value = line_string.parse::<f64>().unwrap();
data.push(f64_to_u64(value));
}
data
}
pub fn serialize_with_codec<S: FastFieldCodecSerializer, R: FastFieldCodecReader>(
pub fn serialize_with_codec<S: FastFieldCodecSerializer>(
data: &[u64],
) -> (bool, f32, f32, &'static str, Duration, Duration) {
) -> (bool, f32, f32, &'static str) {
let is_applicable = S::is_applicable(&data, stats_from_vec(data));
if !is_applicable {
return (
false,
0.0,
0.0,
S::NAME,
Duration::from_secs(0),
Duration::from_secs(0),
);
return (false, 0.0, 0.0, S::NAME);
}
let start_time_compression = Instant::now();
let estimation = S::estimate_compression_ratio(&data, stats_from_vec(data));
let estimation = S::estimate(&data, stats_from_vec(data));
let mut out = vec![];
S::serialize(
&mut out,
@@ -187,22 +108,9 @@ pub fn serialize_with_codec<S: FastFieldCodecSerializer, R: FastFieldCodecReader
data.iter().cloned(),
)
.unwrap();
let elasped_time_compression = start_time_compression.elapsed();
let actual_compression = out.len() as f32 / (data.len() * 8) as f32;
let reader = R::open_from_bytes(&out).unwrap();
let start_time_read = Instant::now();
for doc in 0..data.len() {
reader.get_u64(doc as u64, &out);
}
let elapsed_time_read = start_time_read.elapsed();
(
true,
estimation,
actual_compression,
S::NAME,
elasped_time_compression,
elapsed_time_read,
)
(true, estimation, actual_compression, S::NAME)
}
pub fn stats_from_vec(data: &[u64]) -> FastFieldStats {

View File

@@ -155,17 +155,14 @@ impl FastFieldCodecReader for MultiLinearInterpolFastFieldReader {
}
#[inline]
fn get_u64(&self, idx: u64, data: &[u8]) -> u64 {
let interpolation = get_interpolation_function(idx, &self.footer.interpolations);
let block_idx = idx - interpolation.start_pos;
let calculated_value = get_calculated_value(
interpolation.value_start_pos,
block_idx,
interpolation.slope,
);
fn get_u64(&self, doc: u64, data: &[u8]) -> u64 {
let interpolation = get_interpolation_function(doc, &self.footer.interpolations);
let doc = doc - interpolation.start_pos;
let calculated_value =
get_calculated_value(interpolation.value_start_pos, doc, interpolation.slope);
let diff = interpolation
.bit_unpacker
.get(block_idx, &data[interpolation.data_start_offset as usize..]);
.get(doc, &data[interpolation.data_start_offset as usize..]);
(calculated_value + diff) - interpolation.positive_val_offset
}
@@ -190,13 +187,8 @@ fn get_calculated_value(first_val: u64, pos: u64, slope: f32) -> u64 {
}
/// Same as LinearInterpolFastFieldSerializer, but working on chunks of CHUNK_SIZE elements.
#[deprecated(
note = "MultiLinearInterpol is replaced by PiecewiseLinear codec which fixes the slope and is \
a little bit more optimized."
)]
pub struct MultiLinearInterpolFastFieldSerializer {}
#[allow(deprecated)]
impl FastFieldCodecSerializer for MultiLinearInterpolFastFieldSerializer {
const NAME: &'static str = "MultiLinearInterpol";
const ID: u8 = 3;
@@ -319,13 +311,10 @@ impl FastFieldCodecSerializer for MultiLinearInterpolFastFieldSerializer {
}
true
}
/// Estimation for linear interpolation is hard because, you don't know
/// estimation for linear interpolation is hard because, you don't know
/// where the local maxima are for the deviation of the calculated value and
/// the offset is also unknown.
fn estimate_compression_ratio(
fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> f32 {
fn estimate(fastfield_accessor: &impl FastFieldDataAccess, stats: FastFieldStats) -> f32 {
let first_val_in_first_block = fastfield_accessor.get_val(0);
let last_elem_in_first_chunk = CHUNK_SIZE.min(stats.num_vals);
let last_val_in_first_block =
@@ -377,7 +366,6 @@ fn distance<T: Sub<Output = T> + Ord>(x: T, y: T) -> T {
}
#[cfg(test)]
#[allow(deprecated)]
mod tests {
use super::*;
use crate::tests::get_codec_test_data_sets;

View File

@@ -1,365 +0,0 @@
//! PiecewiseLinear codec uses piecewise linear functions for every block of 512 values to predict
//! values and fast field values. The difference with real fast field values is then stored.
//! For every block, the linear function can be expressed as
//! `computed_value = slope * block_position + first_value + positive_offset`
//! where:
//! - `block_position` is the position inside of the block from 0 to 511
//! - `first_value` is the first value on the block
//! - `positive_offset` is computed such that we ensure the diff `real_value - computed_value` is
//! always positive.
//!
//! 21 bytes is needed to store the block metadata, it adds an overhead of 21 * 8 / 512 = 0,33 bits
//! per element.
use std::io::{self, Read, Write};
use std::ops::Sub;
use common::{BinarySerializable, DeserializeFrom};
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use crate::{FastFieldCodecReader, FastFieldCodecSerializer, FastFieldDataAccess, FastFieldStats};
const BLOCK_SIZE: u64 = 512;
#[derive(Clone)]
pub struct PiecewiseLinearFastFieldReader {
min_value: u64,
max_value: u64,
block_readers: Vec<BlockReader>,
}
/// Block that stores metadata to predict value with a linear
/// function `predicted_value = slope * position + first_value + positive_offset`
/// where `positive_offset` is comupted such that predicted values
/// are always positive.
#[derive(Clone, Debug, Default)]
struct BlockMetadata {
first_value: u64,
positive_offset: u64,
slope: f32,
num_bits: u8,
}
#[derive(Clone, Debug, Default)]
struct BlockReader {
metadata: BlockMetadata,
start_offset: u64,
bit_unpacker: BitUnpacker,
}
impl BlockReader {
fn new(metadata: BlockMetadata, start_offset: u64) -> Self {
Self {
bit_unpacker: BitUnpacker::new(metadata.num_bits),
metadata,
start_offset,
}
}
#[inline]
fn get_u64(&self, block_pos: u64, data: &[u8]) -> u64 {
let diff = self
.bit_unpacker
.get(block_pos, &data[self.start_offset as usize..]);
let predicted_value =
predict_value(self.metadata.first_value, block_pos, self.metadata.slope);
(predicted_value + diff) - self.metadata.positive_offset
}
}
impl BinarySerializable for BlockMetadata {
fn serialize<W: Write>(&self, write: &mut W) -> io::Result<()> {
self.first_value.serialize(write)?;
self.positive_offset.serialize(write)?;
self.slope.serialize(write)?;
self.num_bits.serialize(write)?;
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let first_value = u64::deserialize(reader)?;
let positive_offset = u64::deserialize(reader)?;
let slope = f32::deserialize(reader)?;
let num_bits = u8::deserialize(reader)?;
Ok(Self {
first_value,
positive_offset,
slope,
num_bits,
})
}
}
#[derive(Clone, Debug)]
pub struct PiecewiseLinearFooter {
pub num_vals: u64,
pub min_value: u64,
pub max_value: u64,
block_metadatas: Vec<BlockMetadata>,
}
impl BinarySerializable for PiecewiseLinearFooter {
fn serialize<W: Write>(&self, write: &mut W) -> io::Result<()> {
let mut out = vec![];
self.num_vals.serialize(&mut out)?;
self.min_value.serialize(&mut out)?;
self.max_value.serialize(&mut out)?;
self.block_metadatas.serialize(&mut out)?;
write.write_all(&out)?;
(out.len() as u32).serialize(write)?;
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let footer = Self {
num_vals: u64::deserialize(reader)?,
min_value: u64::deserialize(reader)?,
max_value: u64::deserialize(reader)?,
block_metadatas: Vec::<BlockMetadata>::deserialize(reader)?,
};
Ok(footer)
}
}
impl FastFieldCodecReader for PiecewiseLinearFastFieldReader {
/// Opens a fast field given a file.
fn open_from_bytes(bytes: &[u8]) -> io::Result<Self> {
let footer_len: u32 = (&bytes[bytes.len() - 4..]).deserialize()?;
let (_, mut footer) = bytes.split_at(bytes.len() - (4 + footer_len) as usize);
let footer = PiecewiseLinearFooter::deserialize(&mut footer)?;
let mut block_readers = Vec::with_capacity(footer.block_metadatas.len());
let mut current_data_offset = 0;
for block_metadata in footer.block_metadatas.into_iter() {
let num_bits = block_metadata.num_bits;
block_readers.push(BlockReader::new(block_metadata, current_data_offset));
current_data_offset += num_bits as u64 * BLOCK_SIZE / 8;
}
Ok(Self {
min_value: footer.min_value,
max_value: footer.max_value,
block_readers,
})
}
#[inline]
fn get_u64(&self, idx: u64, data: &[u8]) -> u64 {
let block_idx = (idx / BLOCK_SIZE) as usize;
let block_pos = idx - (block_idx as u64) * BLOCK_SIZE;
let block_reader = &self.block_readers[block_idx];
block_reader.get_u64(block_pos, data)
}
#[inline]
fn min_value(&self) -> u64 {
self.min_value
}
#[inline]
fn max_value(&self) -> u64 {
self.max_value
}
}
#[inline]
fn predict_value(first_val: u64, pos: u64, slope: f32) -> u64 {
(first_val as i64 + (pos as f32 * slope) as i64) as u64
}
pub struct PiecewiseLinearFastFieldSerializer;
impl FastFieldCodecSerializer for PiecewiseLinearFastFieldSerializer {
const NAME: &'static str = "PiecewiseLinear";
const ID: u8 = 4;
/// Creates a new fast field serializer.
fn serialize(
write: &mut impl Write,
_: &impl FastFieldDataAccess,
stats: FastFieldStats,
data_iter: impl Iterator<Item = u64>,
_data_iter1: impl Iterator<Item = u64>,
) -> io::Result<()> {
let mut data = data_iter.collect::<Vec<_>>();
let mut bit_packer = BitPacker::new();
let mut block_metadatas = Vec::new();
for data_pos in (0..data.len() as u64).step_by(BLOCK_SIZE as usize) {
let block_num_vals = BLOCK_SIZE.min(data.len() as u64 - data_pos) as usize;
let block_values = &mut data[data_pos as usize..data_pos as usize + block_num_vals];
let slope = if block_num_vals == 1 {
0f32
} else {
((block_values[block_values.len() - 1] as f64 - block_values[0] as f64)
/ (block_num_vals - 1) as f64) as f32
};
let first_value = block_values[0];
let mut positive_offset = 0;
let mut max_delta = 0;
for (pos, &current_value) in block_values[1..].iter().enumerate() {
let computed_value = predict_value(first_value, pos as u64 + 1, slope);
if computed_value > current_value {
positive_offset = positive_offset.max(computed_value - current_value);
} else {
max_delta = max_delta.max(current_value - computed_value);
}
}
let num_bits = compute_num_bits(max_delta + positive_offset);
for (pos, current_value) in block_values.iter().enumerate() {
let computed_value = predict_value(first_value, pos as u64, slope);
let diff = (current_value + positive_offset) - computed_value;
bit_packer.write(diff, num_bits, write)?;
}
bit_packer.flush(write)?;
block_metadatas.push(BlockMetadata {
first_value,
positive_offset,
slope,
num_bits,
});
}
bit_packer.close(write)?;
let footer = PiecewiseLinearFooter {
num_vals: stats.num_vals,
min_value: stats.min_value,
max_value: stats.max_value,
block_metadatas,
};
footer.serialize(write)?;
Ok(())
}
fn is_applicable(
_fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> bool {
if stats.num_vals < 10 * BLOCK_SIZE {
return false;
}
// On serialization the offset is added to the actual value.
// We need to make sure this won't run into overflow calculation issues.
// For this we take the maximum theroretical offset and add this to the max value.
// If this doesn't overflow the algortihm should be fine
let theorethical_maximum_offset = stats.max_value - stats.min_value;
if stats
.max_value
.checked_add(theorethical_maximum_offset)
.is_none()
{
return false;
}
true
}
/// Estimation for linear interpolation is hard because, you don't know
/// where the local maxima are for the deviation of the calculated value and
/// the offset is also unknown.
fn estimate_compression_ratio(
fastfield_accessor: &impl FastFieldDataAccess,
stats: FastFieldStats,
) -> f32 {
let first_val_in_first_block = fastfield_accessor.get_val(0);
let last_elem_in_first_chunk = BLOCK_SIZE.min(stats.num_vals);
let last_val_in_first_block =
fastfield_accessor.get_val(last_elem_in_first_chunk as u64 - 1);
let slope = ((last_val_in_first_block as f64 - first_val_in_first_block as f64)
/ (stats.num_vals - 1) as f64) as f32;
// let's sample at 0%, 5%, 10% .. 95%, 100%, but for the first block only
let sample_positions = (0..20)
.map(|pos| (last_elem_in_first_chunk as f32 / 100.0 * pos as f32 * 5.0) as usize)
.collect::<Vec<_>>();
let max_distance = sample_positions
.iter()
.map(|&pos| {
let calculated_value = predict_value(first_val_in_first_block, pos as u64, slope);
let actual_value = fastfield_accessor.get_val(pos as u64);
distance(calculated_value, actual_value)
})
.max()
.unwrap();
// Estimate one block and extrapolate the cost to all blocks.
// the theory would be that we don't have the actual max_distance, but we are close within
// 50% threshold.
// It is multiplied by 2 because in a log case scenario the line would be as much above as
// below. So the offset would = max_distance
let relative_max_value = (max_distance as f32 * 1.5) * 2.0;
let num_bits = compute_num_bits(relative_max_value as u64) as u64 * stats.num_vals as u64
// function metadata per block
+ 21 * (stats.num_vals / BLOCK_SIZE);
let num_bits_uncompressed = 64 * stats.num_vals;
num_bits as f32 / num_bits_uncompressed as f32
}
}
fn distance<T: Sub<Output = T> + Ord>(x: T, y: T) -> T {
if x < y {
y - x
} else {
x - y
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::tests::get_codec_test_data_sets;
fn create_and_validate(data: &[u64], name: &str) -> (f32, f32) {
crate::tests::create_and_validate::<
PiecewiseLinearFastFieldSerializer,
PiecewiseLinearFastFieldReader,
>(data, name)
}
#[test]
fn test_compression() {
let data = (10..=6_000_u64).collect::<Vec<_>>();
let (estimate, actual_compression) =
create_and_validate(&data, "simple monotonically large");
assert!(actual_compression < 0.2);
assert!(estimate < 0.20);
assert!(estimate > 0.15);
assert!(actual_compression > 0.001);
}
#[test]
fn test_with_codec_data_sets() {
let data_sets = get_codec_test_data_sets();
for (mut data, name) in data_sets {
create_and_validate(&data, name);
data.reverse();
create_and_validate(&data, name);
}
}
#[test]
fn test_simple() {
let data = (10..=20_u64).collect::<Vec<_>>();
create_and_validate(&data, "simple monotonically");
}
#[test]
fn border_cases_1() {
let data = (0..1024).collect::<Vec<_>>();
create_and_validate(&data, "border case");
}
#[test]
fn border_case_2() {
let data = (0..1025).collect::<Vec<_>>();
create_and_validate(&data, "border case");
}
#[test]
fn rand() {
for _ in 0..10 {
let mut data = (5_000..20_000)
.map(|_| rand::random::<u32>() as u64)
.collect::<Vec<_>>();
let (estimate, actual_compression) = create_and_validate(&data, "random");
dbg!(estimate);
dbg!(actual_compression);
data.reverse();
create_and_validate(&data, "random");
}
}
}

View File

@@ -1,7 +1,7 @@
[package]
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
name = "ownedbytes"
version = "0.2.0"
version = "0.3.0"
edition = "2018"
description = "Expose data as static slice"
license = "MIT"

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-query-grammar"
version = "0.15.0"
version = "0.18.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]

View File

@@ -18,7 +18,7 @@ use crate::Occur;
const SPECIAL_CHARS: &[char] = &[
'+', '^', '`', ':', '{', '}', '"', '[', ']', '(', ')', '~', '!', '\\', '*', ' ',
];
const ESCAPED_SPECIAL_CHARS_PATTERN: &str = r#"\\(\+|\^|`|:|\{|\}|"|\[|\]|\(|\)|\~|!|\\|\*| )"#;
const ESCAPED_SPECIAL_CHARS_PATTERN: &str = r#"\\(\+|\^|`|:|\{|\}|"|\[|\]|\(|\)|\~|!|\\|\*|\s)"#;
/// Parses a field_name
/// A field name must have at least one character and be followed by a colon.
@@ -34,7 +34,8 @@ fn field_name<'a>() -> impl Parser<&'a str, Output = String> {
take_while(|c| !SPECIAL_CHARS.contains(&c)),
),
'\\',
satisfy(|c| SPECIAL_CHARS.contains(&c)),
satisfy(|_| true), /* if the next character is not a special char, the \ will be treated
* as the \ character. */
))
.skip(char(':'))
.map(|s| ESCAPED_SPECIAL_CHARS_RE.replace_all(&s, "$1").to_string())
@@ -516,15 +517,27 @@ mod test {
}
#[test]
fn test_field_name() -> TestParseResult {
fn test_field_name() {
assert_eq!(
super::field_name().parse(".my.field.name:a"),
Ok((".my.field.name".to_string(), "a"))
);
assert_eq!(
super::field_name().parse(r#"my\ field:a"#),
Ok(("my field".to_string(), "a"))
);
assert_eq!(
super::field_name().parse(r#"にんじん:a"#),
Ok(("にんじん".to_string(), "a"))
);
assert_eq!(
super::field_name().parse("my\\ field\\ name:a"),
Ok(("my field name".to_string(), "a"))
);
assert_eq!(
super::field_name().parse(r#"my\field:a"#),
Ok((r#"my\field"#.to_string(), "a"))
);
assert!(super::field_name().parse("my field:a").is_err());
assert_eq!(
super::field_name().parse("\\(1\\+1\\):2"),
@@ -534,14 +547,21 @@ mod test {
super::field_name().parse("my_field_name:a"),
Ok(("my_field_name".to_string(), "a"))
);
assert_eq!(
super::field_name().parse("myfield.b:hello").unwrap(),
("myfield.b".to_string(), "hello")
);
assert_eq!(
super::field_name().parse(r#"myfield\.b:hello"#).unwrap(),
(r#"myfield\.b"#.to_string(), "hello")
);
assert!(super::field_name().parse("my_field_name").is_err());
assert!(super::field_name().parse(":a").is_err());
assert!(super::field_name().parse("-my_field:a").is_err());
assert_eq!(
super::field_name().parse("_my_field:a")?,
("_my_field".to_string(), "a")
super::field_name().parse("_my_field:a"),
Ok(("_my_field".to_string(), "a"))
);
Ok(())
}
#[test]

View File

@@ -48,8 +48,8 @@ use std::collections::{HashMap, HashSet};
use serde::{Deserialize, Serialize};
use super::bucket::HistogramAggregation;
pub use super::bucket::RangeAggregation;
use super::bucket::{HistogramAggregation, TermsAggregation};
use super::metric::{AverageAggregation, StatsAggregation};
use super::VecWithNames;
@@ -100,12 +100,27 @@ pub(crate) struct BucketAggregationInternal {
}
impl BucketAggregationInternal {
pub(crate) fn as_histogram(&self) -> &HistogramAggregation {
pub(crate) fn as_histogram(&self) -> Option<&HistogramAggregation> {
match &self.bucket_agg {
BucketAggregationType::Range(_) => panic!("unexpected aggregation"),
BucketAggregationType::Histogram(histogram) => histogram,
BucketAggregationType::Histogram(histogram) => Some(histogram),
_ => None,
}
}
pub(crate) fn as_term(&self) -> Option<&TermsAggregation> {
match &self.bucket_agg {
BucketAggregationType::Terms(terms) => Some(terms),
_ => None,
}
}
}
/// Extract all fields, where the term directory is used in the tree.
pub fn get_term_dict_field_names(aggs: &Aggregations) -> HashSet<String> {
let mut term_dict_field_names = Default::default();
for el in aggs.values() {
el.get_term_dict_field_names(&mut term_dict_field_names)
}
term_dict_field_names
}
/// Extract all fast field names used in the tree.
@@ -130,6 +145,12 @@ pub enum Aggregation {
}
impl Aggregation {
fn get_term_dict_field_names(&self, term_field_names: &mut HashSet<String>) {
if let Aggregation::Bucket(bucket) = self {
bucket.get_term_dict_field_names(term_field_names)
}
}
fn get_fast_field_names(&self, fast_field_names: &mut HashSet<String>) {
match self {
Aggregation::Bucket(bucket) => bucket.get_fast_field_names(fast_field_names),
@@ -162,6 +183,12 @@ pub struct BucketAggregation {
}
impl BucketAggregation {
fn get_term_dict_field_names(&self, term_dict_field_names: &mut HashSet<String>) {
if let BucketAggregationType::Terms(terms) = &self.bucket_agg {
term_dict_field_names.insert(terms.field.to_string());
}
term_dict_field_names.extend(get_term_dict_field_names(&self.sub_aggregation));
}
fn get_fast_field_names(&self, fast_field_names: &mut HashSet<String>) {
self.bucket_agg.get_fast_field_names(fast_field_names);
fast_field_names.extend(get_fast_field_names(&self.sub_aggregation));
@@ -177,11 +204,15 @@ pub enum BucketAggregationType {
/// Put data into buckets of user-defined ranges.
#[serde(rename = "histogram")]
Histogram(HistogramAggregation),
/// Put data into buckets of terms.
#[serde(rename = "terms")]
Terms(TermsAggregation),
}
impl BucketAggregationType {
fn get_fast_field_names(&self, fast_field_names: &mut HashSet<String>) {
match self {
BucketAggregationType::Terms(terms) => fast_field_names.insert(terms.field.to_string()),
BucketAggregationType::Range(range) => fast_field_names.insert(range.field.to_string()),
BucketAggregationType::Histogram(histogram) => {
fast_field_names.insert(histogram.field.to_string())

View File

@@ -1,12 +1,16 @@
//! This will enhance the request tree with access to the fastfield and metadata.
use std::sync::Arc;
use super::agg_req::{Aggregation, Aggregations, BucketAggregationType, MetricAggregation};
use super::bucket::{HistogramAggregation, RangeAggregation};
use super::bucket::{HistogramAggregation, RangeAggregation, TermsAggregation};
use super::metric::{AverageAggregation, StatsAggregation};
use super::VecWithNames;
use crate::fastfield::{type_and_cardinality, DynamicFastFieldReader, FastType};
use crate::fastfield::{
type_and_cardinality, DynamicFastFieldReader, FastType, MultiValuedFastFieldReader,
};
use crate::schema::{Cardinality, Type};
use crate::{SegmentReader, TantivyError};
use crate::{InvertedIndexReader, SegmentReader, TantivyError};
#[derive(Clone, Default)]
pub(crate) struct AggregationsWithAccessor {
@@ -27,11 +31,32 @@ impl AggregationsWithAccessor {
}
}
#[derive(Clone)]
pub(crate) enum FastFieldAccessor {
Multi(MultiValuedFastFieldReader<u64>),
Single(DynamicFastFieldReader<u64>),
}
impl FastFieldAccessor {
pub fn as_single(&self) -> Option<&DynamicFastFieldReader<u64>> {
match self {
FastFieldAccessor::Multi(_) => None,
FastFieldAccessor::Single(reader) => Some(reader),
}
}
pub fn as_multi(&self) -> Option<&MultiValuedFastFieldReader<u64>> {
match self {
FastFieldAccessor::Multi(reader) => Some(reader),
FastFieldAccessor::Single(_) => None,
}
}
}
#[derive(Clone)]
pub struct BucketAggregationWithAccessor {
/// In general there can be buckets without fast field access, e.g. buckets that are created
/// based on search terms. So eventually this needs to be Option or moved.
pub(crate) accessor: DynamicFastFieldReader<u64>,
pub(crate) accessor: FastFieldAccessor,
pub(crate) inverted_index: Option<Arc<InvertedIndexReader>>,
pub(crate) field_type: Type,
pub(crate) bucket_agg: BucketAggregationType,
pub(crate) sub_aggregation: AggregationsWithAccessor,
@@ -43,14 +68,25 @@ impl BucketAggregationWithAccessor {
sub_aggregation: &Aggregations,
reader: &SegmentReader,
) -> crate::Result<BucketAggregationWithAccessor> {
let mut inverted_index = None;
let (accessor, field_type) = match &bucket {
BucketAggregationType::Range(RangeAggregation {
field: field_name,
ranges: _,
}) => get_ff_reader_and_validate(reader, field_name)?,
}) => get_ff_reader_and_validate(reader, field_name, Cardinality::SingleValue)?,
BucketAggregationType::Histogram(HistogramAggregation {
field: field_name, ..
}) => get_ff_reader_and_validate(reader, field_name)?,
}) => get_ff_reader_and_validate(reader, field_name, Cardinality::SingleValue)?,
BucketAggregationType::Terms(TermsAggregation {
field: field_name, ..
}) => {
let field = reader
.schema()
.get_field(field_name)
.ok_or_else(|| TantivyError::FieldNotFound(field_name.to_string()))?;
inverted_index = Some(reader.inverted_index(field)?);
get_ff_reader_and_validate(reader, field_name, Cardinality::MultiValues)?
}
};
let sub_aggregation = sub_aggregation.clone();
Ok(BucketAggregationWithAccessor {
@@ -58,6 +94,7 @@ impl BucketAggregationWithAccessor {
field_type,
sub_aggregation: get_aggs_with_accessor_and_validate(&sub_aggregation, reader)?,
bucket_agg: bucket.clone(),
inverted_index,
})
}
}
@@ -78,10 +115,14 @@ impl MetricAggregationWithAccessor {
match &metric {
MetricAggregation::Average(AverageAggregation { field: field_name })
| MetricAggregation::Stats(StatsAggregation { field: field_name }) => {
let (accessor, field_type) = get_ff_reader_and_validate(reader, field_name)?;
let (accessor, field_type) =
get_ff_reader_and_validate(reader, field_name, Cardinality::SingleValue)?;
Ok(MetricAggregationWithAccessor {
accessor,
accessor: accessor
.as_single()
.expect("unexpected fast field cardinality")
.clone(),
field_type,
metric: metric.clone(),
})
@@ -118,32 +159,45 @@ pub(crate) fn get_aggs_with_accessor_and_validate(
))
}
/// Get fast field reader with given cardinatility.
fn get_ff_reader_and_validate(
reader: &SegmentReader,
field_name: &str,
) -> crate::Result<(DynamicFastFieldReader<u64>, Type)> {
cardinality: Cardinality,
) -> crate::Result<(FastFieldAccessor, Type)> {
let field = reader
.schema()
.get_field(field_name)
.ok_or_else(|| TantivyError::FieldNotFound(field_name.to_string()))?;
let field_type = reader.schema().get_field_entry(field).field_type();
if let Some((ff_type, cardinality)) = type_and_cardinality(field_type) {
if cardinality == Cardinality::MultiValues || ff_type == FastType::Date {
if let Some((ff_type, field_cardinality)) = type_and_cardinality(field_type) {
if ff_type == FastType::Date {
return Err(TantivyError::InvalidArgument(
"Unsupported field type date in aggregation".to_string(),
));
}
if cardinality != field_cardinality {
return Err(TantivyError::InvalidArgument(format!(
"Invalid field type in aggregation {:?}, only Cardinality::SingleValue supported",
field_type.value_type()
"Invalid field cardinality on field {} expected {:?}, but got {:?}",
field_name, cardinality, field_cardinality
)));
}
} else {
return Err(TantivyError::InvalidArgument(format!(
"Only single value fast fields of type f64, u64, i64 are supported, but got {:?} ",
"Only fast fields of type f64, u64, i64 are supported, but got {:?} ",
field_type.value_type()
)));
};
let ff_fields = reader.fast_fields();
ff_fields
.u64_lenient(field)
.map(|field| (field, field_type.value_type()))
match cardinality {
Cardinality::SingleValue => ff_fields
.u64_lenient(field)
.map(|field| (FastFieldAccessor::Single(field), field_type.value_type())),
Cardinality::MultiValues => ff_fields
.u64s_lenient(field)
.map(|field| (FastFieldAccessor::Multi(field), field_type.value_type())),
}
}

View File

@@ -7,86 +7,134 @@
use std::cmp::Ordering;
use std::collections::HashMap;
use itertools::Itertools;
use serde::{Deserialize, Serialize};
use super::agg_req::{Aggregations, AggregationsInternal, BucketAggregationInternal};
use super::bucket::intermediate_buckets_to_final_buckets;
use super::agg_req::{
Aggregations, AggregationsInternal, BucketAggregationInternal, MetricAggregation,
};
use super::bucket::{intermediate_buckets_to_final_buckets, GetDocCount};
use super::intermediate_agg_result::{
IntermediateAggregationResults, IntermediateBucketResult, IntermediateHistogramBucketEntry,
IntermediateMetricResult, IntermediateRangeBucketEntry,
};
use super::metric::{SingleMetricResult, Stats};
use super::Key;
use super::{Key, VecWithNames};
use crate::TantivyError;
#[derive(Clone, Default, Debug, PartialEq, Serialize, Deserialize)]
/// The final aggegation result.
pub struct AggregationResults(pub HashMap<String, AggregationResult>);
impl AggregationResults {
pub(crate) fn get_value_from_aggregation(
&self,
name: &str,
agg_property: &str,
) -> crate::Result<Option<f64>> {
if let Some(agg) = self.0.get(name) {
agg.get_value_from_aggregation(name, agg_property)
} else {
// Validation is be done during request parsing, so we can't reach this state.
Err(TantivyError::InternalError(format!(
"Can't find aggregation {:?} in sub_aggregations",
name
)))
}
}
/// Convert and intermediate result and its aggregation request to the final result
pub fn from_intermediate_and_req(
results: IntermediateAggregationResults,
agg: Aggregations,
) -> Self {
) -> crate::Result<Self> {
AggregationResults::from_intermediate_and_req_internal(results, &(agg.into()))
}
/// Convert and intermediate result and its aggregation request to the final result
///
/// Internal function, CollectorAggregations is used instead Aggregations, which is optimized
/// for internal processing
fn from_intermediate_and_req_internal(
results: IntermediateAggregationResults,
/// for internal processing, by splitting metric and buckets into seperate groups.
pub(crate) fn from_intermediate_and_req_internal(
intermediate_results: IntermediateAggregationResults,
req: &AggregationsInternal,
) -> Self {
let mut result = HashMap::default();
) -> crate::Result<Self> {
// Important assumption:
// When the tree contains buckets/metric, we expect it to have all buckets/metrics from the
// request
if let Some(buckets) = results.buckets {
result.extend(buckets.into_iter().zip(req.buckets.values()).map(
|((key, bucket), req)| {
(
key,
AggregationResult::BucketResult(BucketResult::from_intermediate_and_req(
bucket, req,
)),
)
},
));
} else {
result.extend(req.buckets.iter().map(|(key, req)| {
let empty_bucket = IntermediateBucketResult::empty_from_req(&req.bucket_agg);
(
key.to_string(),
AggregationResult::BucketResult(BucketResult::from_intermediate_and_req(
empty_bucket,
req,
)),
)
}));
}
let mut results: HashMap<String, AggregationResult> = HashMap::new();
if let Some(metrics) = results.metrics {
result.extend(
metrics
.into_iter()
.map(|(key, metric)| (key, AggregationResult::MetricResult(metric.into()))),
);
if let Some(buckets) = intermediate_results.buckets {
add_coverted_final_buckets_to_result(&mut results, buckets, &req.buckets)?
} else {
result.extend(req.metrics.iter().map(|(key, req)| {
let empty_bucket = IntermediateMetricResult::empty_from_req(req);
(
key.to_string(),
AggregationResult::MetricResult(empty_bucket.into()),
)
}));
// When there are no buckets, we create empty buckets, so that the serialized json
// format is constant
add_empty_final_buckets_to_result(&mut results, &req.buckets)?
};
if let Some(metrics) = intermediate_results.metrics {
add_converted_final_metrics_to_result(&mut results, metrics);
} else {
// When there are no metrics, we create empty metric results, so that the serialized
// json format is constant
add_empty_final_metrics_to_result(&mut results, &req.metrics)?;
}
Self(result)
Ok(Self(results))
}
}
fn add_converted_final_metrics_to_result(
results: &mut HashMap<String, AggregationResult>,
metrics: VecWithNames<IntermediateMetricResult>,
) {
results.extend(
metrics
.into_iter()
.map(|(key, metric)| (key, AggregationResult::MetricResult(metric.into()))),
);
}
fn add_empty_final_metrics_to_result(
results: &mut HashMap<String, AggregationResult>,
req_metrics: &VecWithNames<MetricAggregation>,
) -> crate::Result<()> {
results.extend(req_metrics.iter().map(|(key, req)| {
let empty_bucket = IntermediateMetricResult::empty_from_req(req);
(
key.to_string(),
AggregationResult::MetricResult(empty_bucket.into()),
)
}));
Ok(())
}
fn add_empty_final_buckets_to_result(
results: &mut HashMap<String, AggregationResult>,
req_buckets: &VecWithNames<BucketAggregationInternal>,
) -> crate::Result<()> {
let requested_buckets = req_buckets.iter();
for (key, req) in requested_buckets {
let empty_bucket = AggregationResult::BucketResult(BucketResult::empty_from_req(req)?);
results.insert(key.to_string(), empty_bucket);
}
Ok(())
}
fn add_coverted_final_buckets_to_result(
results: &mut HashMap<String, AggregationResult>,
buckets: VecWithNames<IntermediateBucketResult>,
req_buckets: &VecWithNames<BucketAggregationInternal>,
) -> crate::Result<()> {
assert_eq!(buckets.len(), req_buckets.len());
let buckets_with_request = buckets.into_iter().zip(req_buckets.values());
for ((key, bucket), req) in buckets_with_request {
let result =
AggregationResult::BucketResult(BucketResult::from_intermediate_and_req(bucket, req)?);
results.insert(key, result);
}
Ok(())
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(untagged)]
/// An aggregation is either a bucket or a metric.
@@ -97,6 +145,23 @@ pub enum AggregationResult {
MetricResult(MetricResult),
}
impl AggregationResult {
pub(crate) fn get_value_from_aggregation(
&self,
_name: &str,
agg_property: &str,
) -> crate::Result<Option<f64>> {
match self {
AggregationResult::BucketResult(_bucket) => Err(TantivyError::InternalError(
"Tried to retrieve value from bucket aggregation. This is not supported and \
should not happen during collection, but should be catched during validation"
.to_string(),
)),
AggregationResult::MetricResult(metric) => metric.get_value(agg_property),
}
}
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(untagged)]
/// MetricResult
@@ -107,6 +172,14 @@ pub enum MetricResult {
Stats(Stats),
}
impl MetricResult {
fn get_value(&self, agg_property: &str) -> crate::Result<Option<f64>> {
match self {
MetricResult::Average(avg) => Ok(avg.value),
MetricResult::Stats(stats) => stats.get_value(agg_property),
}
}
}
impl From<IntermediateMetricResult> for MetricResult {
fn from(metric: IntermediateMetricResult) -> Self {
match metric {
@@ -140,39 +213,64 @@ pub enum BucketResult {
/// See [HistogramAggregation](super::bucket::HistogramAggregation)
buckets: Vec<BucketEntry>,
},
/// This is the term result
Terms {
/// The buckets.
///
/// See [TermsAggregation](super::bucket::TermsAggregation)
buckets: Vec<BucketEntry>,
/// The number of documents that didnt make it into to TOP N due to shard_size or size
sum_other_doc_count: u64,
#[serde(skip_serializing_if = "Option::is_none")]
/// The upper bound error for the doc count of each term.
doc_count_error_upper_bound: Option<u64>,
},
}
impl BucketResult {
pub(crate) fn empty_from_req(req: &BucketAggregationInternal) -> crate::Result<Self> {
let empty_bucket = IntermediateBucketResult::empty_from_req(&req.bucket_agg);
BucketResult::from_intermediate_and_req(empty_bucket, req)
}
fn from_intermediate_and_req(
bucket_result: IntermediateBucketResult,
req: &BucketAggregationInternal,
) -> Self {
) -> crate::Result<Self> {
match bucket_result {
IntermediateBucketResult::Range(range_map) => {
let mut buckets: Vec<RangeBucketEntry> = range_map
IntermediateBucketResult::Range(range_res) => {
let mut buckets: Vec<RangeBucketEntry> = range_res
.buckets
.into_iter()
.map(|(_, bucket)| {
RangeBucketEntry::from_intermediate_and_req(bucket, &req.sub_aggregation)
})
.collect_vec();
.collect::<crate::Result<Vec<_>>>()?;
buckets.sort_by(|a, b| {
a.from
buckets.sort_by(|left, right| {
// TODO use total_cmp next stable rust release
left.from
.unwrap_or(f64::MIN)
.partial_cmp(&b.from.unwrap_or(f64::MIN))
.partial_cmp(&right.from.unwrap_or(f64::MIN))
.unwrap_or(Ordering::Equal)
});
BucketResult::Range { buckets }
Ok(BucketResult::Range { buckets })
}
IntermediateBucketResult::Histogram { buckets } => {
let buckets = intermediate_buckets_to_final_buckets(
buckets,
req.as_histogram(),
req.as_histogram()
.expect("unexpected aggregation, expected histogram aggregation"),
&req.sub_aggregation,
);
)?;
BucketResult::Histogram { buckets }
Ok(BucketResult::Histogram { buckets })
}
IntermediateBucketResult::Terms(terms) => terms.into_final_result(
req.as_term()
.expect("unexpected aggregation, expected term aggregation"),
&req.sub_aggregation,
),
}
}
}
@@ -210,7 +308,7 @@ pub struct BucketEntry {
/// Number of documents in the bucket.
pub doc_count: u64,
#[serde(flatten)]
/// sub-aggregations in this bucket.
/// Sub-aggregations in this bucket.
pub sub_aggregation: AggregationResults,
}
@@ -218,15 +316,25 @@ impl BucketEntry {
pub(crate) fn from_intermediate_and_req(
entry: IntermediateHistogramBucketEntry,
req: &AggregationsInternal,
) -> Self {
BucketEntry {
) -> crate::Result<Self> {
Ok(BucketEntry {
key: Key::F64(entry.key),
doc_count: entry.doc_count,
sub_aggregation: AggregationResults::from_intermediate_and_req_internal(
entry.sub_aggregation,
req,
),
}
)?,
})
}
}
impl GetDocCount for &BucketEntry {
fn doc_count(&self) -> u64 {
self.doc_count
}
}
impl GetDocCount for BucketEntry {
fn doc_count(&self) -> u64 {
self.doc_count
}
}
@@ -281,16 +389,16 @@ impl RangeBucketEntry {
fn from_intermediate_and_req(
entry: IntermediateRangeBucketEntry,
req: &AggregationsInternal,
) -> Self {
RangeBucketEntry {
) -> crate::Result<Self> {
Ok(RangeBucketEntry {
key: entry.key,
doc_count: entry.doc_count,
sub_aggregation: AggregationResults::from_intermediate_and_req_internal(
entry.sub_aggregation,
req,
),
)?,
to: entry.to,
from: entry.from,
}
})
}
}

View File

@@ -13,9 +13,7 @@ use crate::aggregation::f64_from_fastfield_u64;
use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResults, IntermediateBucketResult, IntermediateHistogramBucketEntry,
};
use crate::aggregation::segment_agg_result::{
SegmentAggregationResultsCollector, SegmentHistogramBucketEntry,
};
use crate::aggregation::segment_agg_result::SegmentAggregationResultsCollector;
use crate::fastfield::{DynamicFastFieldReader, FastFieldReader};
use crate::schema::Type;
use crate::{DocId, TantivyError};
@@ -58,7 +56,7 @@ use crate::{DocId, TantivyError};
/// "prices": {
/// "histogram": {
/// "field": "price",
/// "interval": 10,
/// "interval": 10
/// }
/// }
/// }
@@ -71,16 +69,17 @@ use crate::{DocId, TantivyError};
pub struct HistogramAggregation {
/// The field to aggregate on.
pub field: String,
/// The interval to chunk your data range. The buckets span ranges of [0..interval).
/// The interval to chunk your data range. Each bucket spans a value range of [0..interval).
/// Must be a positive value.
pub interval: f64,
/// Intervals implicitely defines an absolute grid of buckets `[interval * k, interval * (k +
/// 1))`.
///
/// Offset makes it possible to shift this grid into `[offset + interval * k, offset + interval
/// * (k + 1)) Offset has to be in the range [0, interval).
/// Offset makes it possible to shift this grid into
/// `[offset + interval * k, offset + interval * (k + 1))`. Offset has to be in the range [0,
/// interval).
///
/// As an example. If there are two documents with value 8 and 12 and interval 10.0, they would
/// As an example, if there are two documents with value 9 and 12 and interval 10.0, they would
/// fall into the buckets with the key 0 and 10.
/// With offset 5 and interval 10, they would both fall into the bucket with they key 5 and the
/// range [5..15)
@@ -93,6 +92,22 @@ pub struct HistogramAggregation {
///
/// hard_bounds only limits the buckets, to force a range set both extended_bounds and
/// hard_bounds to the same range.
///
/// ## Example
/// ```json
/// {
/// "prices": {
/// "histogram": {
/// "field": "price",
/// "interval": 10,
/// "hard_bounds": {
/// "min": 0,
/// "max": 100
/// }
/// }
/// }
/// }
/// ```
pub hard_bounds: Option<HistogramBounds>,
/// Can be set to extend your bounds. The range of the buckets is by default defined by the
/// data range of the values of the documents. As the name suggests, this can only be used to
@@ -159,6 +174,27 @@ impl HistogramBounds {
}
}
#[derive(Clone, Debug, PartialEq)]
pub(crate) struct SegmentHistogramBucketEntry {
pub key: f64,
pub doc_count: u64,
}
impl SegmentHistogramBucketEntry {
pub(crate) fn into_intermediate_bucket_entry(
self,
sub_aggregation: SegmentAggregationResultsCollector,
agg_with_accessor: &AggregationsWithAccessor,
) -> crate::Result<IntermediateHistogramBucketEntry> {
Ok(IntermediateHistogramBucketEntry {
key: self.key,
doc_count: self.doc_count,
sub_aggregation: sub_aggregation
.into_intermediate_aggregations_result(agg_with_accessor)?,
})
}
}
/// The collector puts values from the fast field into the correct buckets and does a conversion to
/// the correct datatype.
#[derive(Clone, Debug, PartialEq)]
@@ -174,7 +210,10 @@ pub struct SegmentHistogramCollector {
}
impl SegmentHistogramCollector {
pub fn into_intermediate_bucket_result(self) -> IntermediateBucketResult {
pub fn into_intermediate_bucket_result(
self,
agg_with_accessor: &BucketAggregationWithAccessor,
) -> crate::Result<IntermediateBucketResult> {
let mut buckets = Vec::with_capacity(
self.buckets
.iter()
@@ -188,13 +227,20 @@ impl SegmentHistogramCollector {
//
// Empty buckets may be added later again in the final result, depending on the request.
if let Some(sub_aggregations) = self.sub_aggregations {
buckets.extend(
self.buckets
.into_iter()
.zip(sub_aggregations.into_iter())
.filter(|(bucket, _sub_aggregation)| bucket.doc_count != 0)
.map(|(bucket, sub_aggregation)| (bucket, sub_aggregation).into()),
)
for bucket_res in self
.buckets
.into_iter()
.zip(sub_aggregations.into_iter())
.filter(|(bucket, _sub_aggregation)| bucket.doc_count != 0)
.map(|(bucket, sub_aggregation)| {
bucket.into_intermediate_bucket_entry(
sub_aggregation,
&agg_with_accessor.sub_aggregation,
)
})
{
buckets.push(bucket_res?);
}
} else {
buckets.extend(
self.buckets
@@ -204,7 +250,7 @@ impl SegmentHistogramCollector {
);
};
IntermediateBucketResult::Histogram { buckets }
Ok(IntermediateBucketResult::Histogram { buckets })
}
pub(crate) fn from_req_and_validate(
@@ -273,12 +319,16 @@ impl SegmentHistogramCollector {
let get_bucket_num =
|val| (get_bucket_num_f64(val, interval, offset) as i64 - first_bucket_num) as usize;
let accessor = bucket_with_accessor
.accessor
.as_single()
.expect("unexpected fast field cardinatility");
let mut iter = doc.chunks_exact(4);
for docs in iter.by_ref() {
let val0 = self.f64_from_fastfield_u64(bucket_with_accessor.accessor.get(docs[0]));
let val1 = self.f64_from_fastfield_u64(bucket_with_accessor.accessor.get(docs[1]));
let val2 = self.f64_from_fastfield_u64(bucket_with_accessor.accessor.get(docs[2]));
let val3 = self.f64_from_fastfield_u64(bucket_with_accessor.accessor.get(docs[3]));
let val0 = self.f64_from_fastfield_u64(accessor.get(docs[0]));
let val1 = self.f64_from_fastfield_u64(accessor.get(docs[1]));
let val2 = self.f64_from_fastfield_u64(accessor.get(docs[2]));
let val3 = self.f64_from_fastfield_u64(accessor.get(docs[3]));
let bucket_pos0 = get_bucket_num(val0);
let bucket_pos1 = get_bucket_num(val1);
@@ -315,8 +365,7 @@ impl SegmentHistogramCollector {
);
}
for doc in iter.remainder() {
let val =
f64_from_fastfield_u64(bucket_with_accessor.accessor.get(*doc), &self.field_type);
let val = f64_from_fastfield_u64(accessor.get(*doc), &self.field_type);
if !bounds.contains(val) {
continue;
}
@@ -393,7 +442,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
buckets: Vec<IntermediateHistogramBucketEntry>,
histogram_req: &HistogramAggregation,
sub_aggregation: &AggregationsInternal,
) -> Vec<BucketEntry> {
) -> crate::Result<Vec<BucketEntry>> {
// Generate the the full list of buckets without gaps.
//
// The bounds are the min max from the current buckets, optionally extended by
@@ -436,7 +485,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
.map(|intermediate_bucket| {
BucketEntry::from_intermediate_and_req(intermediate_bucket, sub_aggregation)
})
.collect_vec()
.collect::<crate::Result<Vec<_>>>()
}
// Convert to BucketEntry
@@ -444,7 +493,7 @@ pub(crate) fn intermediate_buckets_to_final_buckets(
buckets: Vec<IntermediateHistogramBucketEntry>,
histogram_req: &HistogramAggregation,
sub_aggregation: &AggregationsInternal,
) -> Vec<BucketEntry> {
) -> crate::Result<Vec<BucketEntry>> {
if histogram_req.min_doc_count() == 0 {
// With min_doc_count != 0, we may need to add buckets, so that there are no
// gaps, since intermediate result does not contain empty buckets (filtered to
@@ -456,7 +505,7 @@ pub(crate) fn intermediate_buckets_to_final_buckets(
.into_iter()
.filter(|bucket| bucket.doc_count >= histogram_req.min_doc_count())
.map(|bucket| BucketEntry::from_intermediate_and_req(bucket, sub_aggregation))
.collect_vec()
.collect::<crate::Result<Vec<_>>>()
}
}
@@ -630,41 +679,9 @@ mod tests {
};
use crate::aggregation::metric::{AverageAggregation, StatsAggregation};
use crate::aggregation::tests::{
get_test_index_2_segments, get_test_index_from_values, get_test_index_with_num_docs,
exec_request, exec_request_with_query, get_test_index_2_segments,
get_test_index_from_values, get_test_index_with_num_docs,
};
use crate::aggregation::AggregationCollector;
use crate::query::{AllQuery, TermQuery};
use crate::schema::IndexRecordOption;
use crate::{Index, Term};
fn exec_request(agg_req: Aggregations, index: &Index) -> crate::Result<Value> {
exec_request_with_query(agg_req, index, None)
}
fn exec_request_with_query(
agg_req: Aggregations,
index: &Index,
query: Option<(&str, &str)>,
) -> crate::Result<Value> {
let collector = AggregationCollector::from_aggs(agg_req);
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res = if let Some((field, term)) = query {
let text_field = reader.searcher().schema().get_field(field).unwrap();
let term_query = TermQuery::new(
Term::from_field_text(text_field, term),
IndexRecordOption::Basic,
);
searcher.search(&term_query, &collector)?
} else {
searcher.search(&AllQuery, &collector)?
};
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
Ok(res)
}
#[test]
fn histogram_test_crooked_values() -> crate::Result<()> {
@@ -1347,4 +1364,29 @@ mod tests {
Ok(())
}
#[test]
fn histogram_invalid_request() -> crate::Result<()> {
let index = get_test_index_2_segments(true)?;
let agg_req: Aggregations = vec![(
"histogram".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Histogram(HistogramAggregation {
field: "score_f64".to_string(),
interval: 0.0,
..Default::default()
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let agg_res = exec_request(agg_req, &index);
assert!(agg_res.is_err());
Ok(())
}
}

View File

@@ -9,8 +9,132 @@
mod histogram;
mod range;
mod term_agg;
use std::collections::HashMap;
pub(crate) use histogram::SegmentHistogramCollector;
pub use histogram::*;
pub(crate) use range::SegmentRangeCollector;
pub use range::*;
use serde::{de, Deserialize, Deserializer, Serialize, Serializer};
pub use term_agg::*;
/// Order for buckets in a bucket aggregation.
#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize)]
pub enum Order {
/// Asc order
#[serde(rename = "asc")]
Asc,
/// Desc order
#[serde(rename = "desc")]
Desc,
}
impl Default for Order {
fn default() -> Self {
Order::Desc
}
}
#[derive(Clone, Debug, PartialEq)]
/// Order property by which to apply the order
pub enum OrderTarget {
/// The key of the bucket
Key,
/// The doc count of the bucket
Count,
/// Order by value of the sub aggregation metric with identified by given `String`.
///
/// Only single value metrics are supported currently
SubAggregation(String),
}
impl Default for OrderTarget {
fn default() -> Self {
OrderTarget::Count
}
}
impl From<&str> for OrderTarget {
fn from(val: &str) -> Self {
match val {
"_key" => OrderTarget::Key,
"_count" => OrderTarget::Count,
_ => OrderTarget::SubAggregation(val.to_string()),
}
}
}
impl ToString for OrderTarget {
fn to_string(&self) -> String {
match self {
OrderTarget::Key => "_key".to_string(),
OrderTarget::Count => "_count".to_string(),
OrderTarget::SubAggregation(agg) => agg.to_string(),
}
}
}
/// Set the order. target is either "_count", "_key", or the name of
/// a metric sub_aggregation.
///
/// De/Serializes to elasticsearch compatible JSON.
///
/// Examples in JSON format:
/// { "_count": "asc" }
/// { "_key": "asc" }
/// { "average_price": "asc" }
#[derive(Clone, Default, Debug, PartialEq)]
pub struct CustomOrder {
/// The target property by which to sort by
pub target: OrderTarget,
/// The order asc or desc
pub order: Order,
}
impl Serialize for CustomOrder {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where S: Serializer {
let map: HashMap<String, Order> =
std::iter::once((self.target.to_string(), self.order)).collect();
map.serialize(serializer)
}
}
impl<'de> Deserialize<'de> for CustomOrder {
fn deserialize<D>(deserializer: D) -> Result<CustomOrder, D::Error>
where D: Deserializer<'de> {
HashMap::<String, Order>::deserialize(deserializer).and_then(|map| {
if let Some((key, value)) = map.into_iter().next() {
Ok(CustomOrder {
target: key.as_str().into(),
order: value,
})
} else {
Err(de::Error::custom(
"unexpected empty map in order".to_string(),
))
}
})
}
}
#[test]
fn custom_order_serde_test() {
let order = CustomOrder {
target: OrderTarget::Key,
order: Order::Desc,
};
let order_str = serde_json::to_string(&order).unwrap();
assert_eq!(order_str, "{\"_key\":\"desc\"}");
let order_deser = serde_json::from_str(&order_str).unwrap();
assert_eq!(order, order_deser);
let order_deser: serde_json::Result<CustomOrder> = serde_json::from_str("{}");
assert!(order_deser.is_err());
let order_deser: serde_json::Result<CustomOrder> = serde_json::from_str("[]");
assert!(order_deser.is_err());
}

View File

@@ -1,3 +1,4 @@
use std::fmt::Debug;
use std::ops::Range;
use serde::{Deserialize, Serialize};
@@ -5,10 +6,10 @@ use serde::{Deserialize, Serialize};
use crate::aggregation::agg_req_with_accessor::{
AggregationsWithAccessor, BucketAggregationWithAccessor,
};
use crate::aggregation::intermediate_agg_result::IntermediateBucketResult;
use crate::aggregation::segment_agg_result::{
SegmentAggregationResultsCollector, SegmentRangeBucketEntry,
use crate::aggregation::intermediate_agg_result::{
IntermediateBucketResult, IntermediateRangeBucketEntry, IntermediateRangeBucketResult,
};
use crate::aggregation::segment_agg_result::SegmentAggregationResultsCollector;
use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64, Key};
use crate::fastfield::FastFieldReader;
use crate::schema::Type;
@@ -38,12 +39,12 @@ use crate::{DocId, TantivyError};
/// # Request JSON Format
/// ```json
/// {
/// "range": {
/// "my_ranges": {
/// "field": "score",
/// "ranges": [
/// { "to": 3.0 },
/// { "from": 3.0, "to": 7.0 },
/// { "from": 7.0, "to": 20.0 }
/// { "from": 7.0, "to": 20.0 },
/// { "from": 20.0 }
/// ]
/// }
@@ -102,22 +103,72 @@ pub struct SegmentRangeCollector {
field_type: Type,
}
#[derive(Clone, PartialEq)]
pub(crate) struct SegmentRangeBucketEntry {
pub key: Key,
pub doc_count: u64,
pub sub_aggregation: Option<SegmentAggregationResultsCollector>,
/// The from range of the bucket. Equals f64::MIN when None.
pub from: Option<f64>,
/// The to range of the bucket. Equals f64::MAX when None. Open interval, `to` is not
/// inclusive.
pub to: Option<f64>,
}
impl Debug for SegmentRangeBucketEntry {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("SegmentRangeBucketEntry")
.field("key", &self.key)
.field("doc_count", &self.doc_count)
.field("from", &self.from)
.field("to", &self.to)
.finish()
}
}
impl SegmentRangeBucketEntry {
pub(crate) fn into_intermediate_bucket_entry(
self,
agg_with_accessor: &AggregationsWithAccessor,
) -> crate::Result<IntermediateRangeBucketEntry> {
let sub_aggregation = if let Some(sub_aggregation) = self.sub_aggregation {
sub_aggregation.into_intermediate_aggregations_result(agg_with_accessor)?
} else {
Default::default()
};
Ok(IntermediateRangeBucketEntry {
key: self.key,
doc_count: self.doc_count,
sub_aggregation,
from: self.from,
to: self.to,
})
}
}
impl SegmentRangeCollector {
pub fn into_intermediate_bucket_result(self) -> IntermediateBucketResult {
pub fn into_intermediate_bucket_result(
self,
agg_with_accessor: &BucketAggregationWithAccessor,
) -> crate::Result<IntermediateBucketResult> {
let field_type = self.field_type;
let buckets = self
.buckets
.into_iter()
.map(move |range_bucket| {
(
Ok((
range_to_string(&range_bucket.range, &field_type),
range_bucket.bucket.into(),
)
range_bucket
.bucket
.into_intermediate_bucket_entry(&agg_with_accessor.sub_aggregation)?,
))
})
.collect();
.collect::<crate::Result<_>>()?;
IntermediateBucketResult::Range(buckets)
Ok(IntermediateBucketResult::Range(
IntermediateRangeBucketResult { buckets },
))
}
pub(crate) fn from_req_and_validate(
@@ -175,11 +226,15 @@ impl SegmentRangeCollector {
force_flush: bool,
) {
let mut iter = doc.chunks_exact(4);
let accessor = bucket_with_accessor
.accessor
.as_single()
.expect("unexpected fast field cardinatility");
for docs in iter.by_ref() {
let val1 = bucket_with_accessor.accessor.get(docs[0]);
let val2 = bucket_with_accessor.accessor.get(docs[1]);
let val3 = bucket_with_accessor.accessor.get(docs[2]);
let val4 = bucket_with_accessor.accessor.get(docs[3]);
let val1 = accessor.get(docs[0]);
let val2 = accessor.get(docs[1]);
let val3 = accessor.get(docs[2]);
let val4 = accessor.get(docs[3]);
let bucket_pos1 = self.get_bucket_pos(val1);
let bucket_pos2 = self.get_bucket_pos(val2);
let bucket_pos3 = self.get_bucket_pos(val3);
@@ -191,7 +246,7 @@ impl SegmentRangeCollector {
self.increment_bucket(bucket_pos4, docs[3], &bucket_with_accessor.sub_aggregation);
}
for doc in iter.remainder() {
let val = bucket_with_accessor.accessor.get(*doc);
let val = accessor.get(*doc);
let bucket_pos = self.get_bucket_pos(val);
self.increment_bucket(bucket_pos, *doc, &bucket_with_accessor.sub_aggregation);
}
@@ -346,7 +401,8 @@ mod tests {
ranges,
};
SegmentRangeCollector::from_req_and_validate(&req, &Default::default(), field_type).unwrap()
SegmentRangeCollector::from_req_and_validate(&req, &Default::default(), field_type)
.expect("unexpected error")
}
#[test]
@@ -487,11 +543,7 @@ mod tests {
#[test]
fn range_binary_search_test_f64() {
let ranges = vec![
//(f64::MIN..10.0).into(),
(10.0..100.0).into(),
//(100.0..f64::MAX).into(),
];
let ranges = vec![(10.0..100.0).into()];
let collector = get_collector_from_ranges(ranges, Type::F64);
let search = |val: u64| collector.get_bucket_pos(val);

File diff suppressed because it is too large Load Diff

View File

@@ -86,17 +86,18 @@ impl Collector for AggregationCollector {
&self,
segment_fruits: Vec<<Self::Child as SegmentCollector>::Fruit>,
) -> crate::Result<Self::Fruit> {
merge_fruits(segment_fruits)
.map(|res| AggregationResults::from_intermediate_and_req(res, self.agg.clone()))
let res = merge_fruits(segment_fruits)?;
AggregationResults::from_intermediate_and_req(res, self.agg.clone())
}
}
fn merge_fruits(
mut segment_fruits: Vec<IntermediateAggregationResults>,
mut segment_fruits: Vec<crate::Result<IntermediateAggregationResults>>,
) -> crate::Result<IntermediateAggregationResults> {
if let Some(mut fruit) = segment_fruits.pop() {
if let Some(fruit) = segment_fruits.pop() {
let mut fruit = fruit?;
for next_fruit in segment_fruits {
fruit.merge_fruits(next_fruit);
fruit.merge_fruits(next_fruit?);
}
Ok(fruit)
} else {
@@ -106,7 +107,7 @@ fn merge_fruits(
/// AggregationSegmentCollector does the aggregation collection on a segment.
pub struct AggregationSegmentCollector {
aggs: AggregationsWithAccessor,
aggs_with_accessor: AggregationsWithAccessor,
result: SegmentAggregationResultsCollector,
}
@@ -121,22 +122,24 @@ impl AggregationSegmentCollector {
let result =
SegmentAggregationResultsCollector::from_req_and_validate(&aggs_with_accessor)?;
Ok(AggregationSegmentCollector {
aggs: aggs_with_accessor,
aggs_with_accessor,
result,
})
}
}
impl SegmentCollector for AggregationSegmentCollector {
type Fruit = IntermediateAggregationResults;
type Fruit = crate::Result<IntermediateAggregationResults>;
#[inline]
fn collect(&mut self, doc: crate::DocId, _score: crate::Score) {
self.result.collect(doc, &self.aggs);
self.result.collect(doc, &self.aggs_with_accessor);
}
fn harvest(mut self) -> Self::Fruit {
self.result.flush_staged_docs(&self.aggs, true);
self.result.into()
self.result
.flush_staged_docs(&self.aggs_with_accessor, true);
self.result
.into_intermediate_aggregations_result(&self.aggs_with_accessor)
}
}

View File

@@ -9,30 +9,27 @@ use itertools::Itertools;
use serde::{Deserialize, Serialize};
use super::agg_req::{AggregationsInternal, BucketAggregationType, MetricAggregation};
use super::metric::{IntermediateAverage, IntermediateStats};
use super::segment_agg_result::{
SegmentAggregationResultsCollector, SegmentBucketResultCollector, SegmentHistogramBucketEntry,
SegmentMetricResultCollector, SegmentRangeBucketEntry,
use super::agg_result::BucketResult;
use super::bucket::{
cut_off_buckets, get_agg_name_and_property, GetDocCount, Order, OrderTarget,
SegmentHistogramBucketEntry, TermsAggregation,
};
use super::metric::{IntermediateAverage, IntermediateStats};
use super::segment_agg_result::SegmentMetricResultCollector;
use super::{Key, SerializedKey, VecWithNames};
use crate::aggregation::agg_result::{AggregationResults, BucketEntry};
use crate::aggregation::bucket::TermsAggregationInternal;
/// Contains the intermediate aggregation result, which is optimized to be merged with other
/// intermediate results.
#[derive(Default, Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct IntermediateAggregationResults {
#[serde(skip_serializing_if = "Option::is_none")]
pub(crate) metrics: Option<VecWithNames<IntermediateMetricResult>>,
#[serde(skip_serializing_if = "Option::is_none")]
pub(crate) buckets: Option<VecWithNames<IntermediateBucketResult>>,
}
impl From<SegmentAggregationResultsCollector> for IntermediateAggregationResults {
fn from(tree: SegmentAggregationResultsCollector) -> Self {
let metrics = tree.metrics.map(VecWithNames::from_other);
let buckets = tree.buckets.map(VecWithNames::from_other);
Self { metrics, buckets }
}
}
impl IntermediateAggregationResults {
pub(crate) fn empty_from_req(req: &AggregationsInternal) -> Self {
let metrics = if req.metrics.is_empty() {
@@ -162,29 +159,21 @@ impl IntermediateMetricResult {
pub enum IntermediateBucketResult {
/// This is the range entry for a bucket, which contains a key, count, from, to, and optionally
/// sub_aggregations.
Range(FnvHashMap<SerializedKey, IntermediateRangeBucketEntry>),
Range(IntermediateRangeBucketResult),
/// This is the histogram entry for a bucket, which contains a key, count, and optionally
/// sub_aggregations.
Histogram {
/// The buckets
buckets: Vec<IntermediateHistogramBucketEntry>,
},
}
impl From<SegmentBucketResultCollector> for IntermediateBucketResult {
fn from(collector: SegmentBucketResultCollector) -> Self {
match collector {
SegmentBucketResultCollector::Range(range) => range.into_intermediate_bucket_result(),
SegmentBucketResultCollector::Histogram(histogram) => {
histogram.into_intermediate_bucket_result()
}
}
}
/// Term aggregation
Terms(IntermediateTermBucketResult),
}
impl IntermediateBucketResult {
pub(crate) fn empty_from_req(req: &BucketAggregationType) -> Self {
match req {
BucketAggregationType::Terms(_) => IntermediateBucketResult::Terms(Default::default()),
BucketAggregationType::Range(_) => IntermediateBucketResult::Range(Default::default()),
BucketAggregationType::Histogram(_) => {
IntermediateBucketResult::Histogram { buckets: vec![] }
@@ -194,24 +183,34 @@ impl IntermediateBucketResult {
fn merge_fruits(&mut self, other: IntermediateBucketResult) {
match (self, other) {
(
IntermediateBucketResult::Range(entries_left),
IntermediateBucketResult::Range(entries_right),
IntermediateBucketResult::Terms(term_res_left),
IntermediateBucketResult::Terms(term_res_right),
) => {
merge_maps(entries_left, entries_right);
merge_maps(&mut term_res_left.entries, term_res_right.entries);
term_res_left.sum_other_doc_count += term_res_right.sum_other_doc_count;
term_res_left.doc_count_error_upper_bound +=
term_res_right.doc_count_error_upper_bound;
}
(
IntermediateBucketResult::Range(range_res_left),
IntermediateBucketResult::Range(range_res_right),
) => {
merge_maps(&mut range_res_left.buckets, range_res_right.buckets);
}
(
IntermediateBucketResult::Histogram {
buckets: entries_left,
buckets: buckets_left,
..
},
IntermediateBucketResult::Histogram {
buckets: entries_right,
buckets: buckets_right,
..
},
) => {
let mut buckets = entries_left
let buckets = buckets_left
.drain(..)
.merge_join_by(entries_right.into_iter(), |left, right| {
.merge_join_by(buckets_right.into_iter(), |left, right| {
left.key.partial_cmp(&right.key).unwrap_or(Ordering::Equal)
})
.map(|either| match either {
@@ -224,7 +223,7 @@ impl IntermediateBucketResult {
})
.collect();
std::mem::swap(entries_left, &mut buckets);
*buckets_left = buckets;
}
(IntermediateBucketResult::Range(_), _) => {
panic!("try merge on different types")
@@ -232,10 +231,118 @@ impl IntermediateBucketResult {
(IntermediateBucketResult::Histogram { .. }, _) => {
panic!("try merge on different types")
}
(IntermediateBucketResult::Terms { .. }, _) => {
panic!("try merge on different types")
}
}
}
}
#[derive(Default, Clone, Debug, PartialEq, Serialize, Deserialize)]
/// Range aggregation including error counts
pub struct IntermediateRangeBucketResult {
pub(crate) buckets: FnvHashMap<SerializedKey, IntermediateRangeBucketEntry>,
}
#[derive(Default, Clone, Debug, PartialEq, Serialize, Deserialize)]
/// Term aggregation including error counts
pub struct IntermediateTermBucketResult {
pub(crate) entries: FnvHashMap<String, IntermediateTermBucketEntry>,
pub(crate) sum_other_doc_count: u64,
pub(crate) doc_count_error_upper_bound: u64,
}
impl IntermediateTermBucketResult {
pub(crate) fn into_final_result(
self,
req: &TermsAggregation,
sub_aggregation_req: &AggregationsInternal,
) -> crate::Result<BucketResult> {
let req = TermsAggregationInternal::from_req(req);
let mut buckets: Vec<BucketEntry> = self
.entries
.into_iter()
.filter(|bucket| bucket.1.doc_count >= req.min_doc_count)
.map(|(key, entry)| {
Ok(BucketEntry {
key: Key::Str(key),
doc_count: entry.doc_count,
sub_aggregation: AggregationResults::from_intermediate_and_req_internal(
entry.sub_aggregation,
sub_aggregation_req,
)?,
})
})
.collect::<crate::Result<_>>()?;
let order = req.order.order;
match req.order.target {
OrderTarget::Key => {
buckets.sort_by(|left, right| {
if req.order.order == Order::Desc {
left.key.partial_cmp(&right.key)
} else {
right.key.partial_cmp(&left.key)
}
.expect("expected type string, which is always sortable")
});
}
OrderTarget::Count => {
if req.order.order == Order::Desc {
buckets.sort_unstable_by_key(|bucket| std::cmp::Reverse(bucket.doc_count()));
} else {
buckets.sort_unstable_by_key(|bucket| bucket.doc_count());
}
}
OrderTarget::SubAggregation(name) => {
let (agg_name, agg_property) = get_agg_name_and_property(&name);
let mut buckets_with_val = buckets
.into_iter()
.map(|bucket| {
let val = bucket
.sub_aggregation
.get_value_from_aggregation(agg_name, agg_property)?
.unwrap_or(f64::NAN);
Ok((bucket, val))
})
.collect::<crate::Result<Vec<_>>>()?;
buckets_with_val.sort_by(|(_, val1), (_, val2)| {
// TODO use total_cmp in next rust stable release
match &order {
Order::Desc => val2.partial_cmp(val1).unwrap_or(std::cmp::Ordering::Equal),
Order::Asc => val1.partial_cmp(val2).unwrap_or(std::cmp::Ordering::Equal),
}
});
buckets = buckets_with_val
.into_iter()
.map(|(bucket, _val)| bucket)
.collect_vec();
}
}
// We ignore _term_doc_count_before_cutoff here, because it increases the upperbound error
// only for terms that didn't make it into the top N.
//
// This can be interesting, as a value of quality of the results, but not good to check the
// actual error count for the returned terms.
let (_term_doc_count_before_cutoff, sum_other_doc_count) =
cut_off_buckets(&mut buckets, req.size as usize);
let doc_count_error_upper_bound = if req.show_term_doc_count_error {
Some(self.doc_count_error_upper_bound)
} else {
None
};
Ok(BucketResult::Terms {
buckets,
sum_other_doc_count: self.sum_other_doc_count + sum_other_doc_count,
doc_count_error_upper_bound,
})
}
}
trait MergeFruits {
fn merge_fruits(&mut self, other: Self);
}
@@ -277,26 +384,6 @@ impl From<SegmentHistogramBucketEntry> for IntermediateHistogramBucketEntry {
}
}
impl
From<(
SegmentHistogramBucketEntry,
SegmentAggregationResultsCollector,
)> for IntermediateHistogramBucketEntry
{
fn from(
entry: (
SegmentHistogramBucketEntry,
SegmentAggregationResultsCollector,
),
) -> Self {
IntermediateHistogramBucketEntry {
key: entry.0.key,
doc_count: entry.0.doc_count,
sub_aggregation: entry.1.into(),
}
}
}
/// This is the range entry for a bucket, which contains a key, count, and optionally
/// sub_aggregations.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
@@ -305,7 +392,6 @@ pub struct IntermediateRangeBucketEntry {
pub key: Key,
/// The number of documents in the bucket.
pub doc_count: u64,
pub(crate) values: Option<Vec<u64>>,
/// The sub_aggregation in this bucket.
pub sub_aggregation: IntermediateAggregationResults,
/// The from range of the bucket. Equals f64::MIN when None.
@@ -316,22 +402,20 @@ pub struct IntermediateRangeBucketEntry {
pub to: Option<f64>,
}
impl From<SegmentRangeBucketEntry> for IntermediateRangeBucketEntry {
fn from(entry: SegmentRangeBucketEntry) -> Self {
let sub_aggregation = if let Some(sub_aggregation) = entry.sub_aggregation {
sub_aggregation.into()
} else {
Default::default()
};
/// This is the term entry for a bucket, which contains a count, and optionally
/// sub_aggregations.
#[derive(Clone, Default, Debug, PartialEq, Serialize, Deserialize)]
pub struct IntermediateTermBucketEntry {
/// The number of documents in the bucket.
pub doc_count: u64,
/// The sub_aggregation in this bucket.
pub sub_aggregation: IntermediateAggregationResults,
}
IntermediateRangeBucketEntry {
key: entry.key,
doc_count: entry.doc_count,
values: None,
sub_aggregation,
to: entry.to,
from: entry.from,
}
impl MergeFruits for IntermediateTermBucketEntry {
fn merge_fruits(&mut self, other: IntermediateTermBucketEntry) {
self.doc_count += other.doc_count;
self.sub_aggregation.merge_fruits(other.sub_aggregation);
}
}
@@ -366,7 +450,6 @@ mod tests {
IntermediateRangeBucketEntry {
key: Key::Str(key.to_string()),
doc_count: *doc_count,
values: None,
sub_aggregation: Default::default(),
from: None,
to: None,
@@ -375,7 +458,7 @@ mod tests {
}
map.insert(
"my_agg_level2".to_string(),
IntermediateBucketResult::Range(buckets),
IntermediateBucketResult::Range(IntermediateRangeBucketResult { buckets }),
);
IntermediateAggregationResults {
buckets: Some(VecWithNames::from_entries(map.into_iter().collect())),
@@ -394,7 +477,6 @@ mod tests {
IntermediateRangeBucketEntry {
key: Key::Str(key.to_string()),
doc_count: *doc_count,
values: None,
from: None,
to: None,
sub_aggregation: get_sub_test_tree(&[(
@@ -406,7 +488,7 @@ mod tests {
}
map.insert(
"my_agg_level1".to_string(),
IntermediateBucketResult::Range(buckets),
IntermediateBucketResult::Range(IntermediateRangeBucketResult { buckets }),
);
IntermediateAggregationResults {
buckets: Some(VecWithNames::from_entries(map.into_iter().collect())),

View File

@@ -19,7 +19,7 @@ use crate::DocId;
/// "avg": {
/// "field": "score",
/// }
/// }
/// }
/// ```
pub struct AverageAggregation {
/// The field name to compute the stats on.

View File

@@ -3,7 +3,7 @@ use serde::{Deserialize, Serialize};
use crate::aggregation::f64_from_fastfield_u64;
use crate::fastfield::{DynamicFastFieldReader, FastFieldReader};
use crate::schema::Type;
use crate::DocId;
use crate::{DocId, TantivyError};
/// A multi-value metric aggregation that computes stats of numeric values that are
/// extracted from the aggregated documents.
@@ -53,6 +53,23 @@ pub struct Stats {
pub avg: Option<f64>,
}
impl Stats {
pub(crate) fn get_value(&self, agg_property: &str) -> crate::Result<Option<f64>> {
match agg_property {
"count" => Ok(Some(self.count as f64)),
"sum" => Ok(Some(self.sum)),
"standard_deviation" => Ok(self.standard_deviation),
"min" => Ok(self.min),
"max" => Ok(self.max),
"avg" => Ok(self.avg),
_ => Err(TantivyError::InvalidArgument(format!(
"unknown property {} on stats metric aggregation",
agg_property
))),
}
}
}
/// IntermediateStats contains the mergeable version for stats.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct IntermediateStats {

View File

@@ -20,7 +20,8 @@
//!
//! #### Limitations
//!
//! Currently aggregations work only on single value fast fields of type u64, f64 and i64.
//! Currently aggregations work only on single value fast fields of type u64, f64, i64 and
//! fast fields on text fields.
//!
//! # JSON Format
//! Aggregations request and result structures de/serialize into elasticsearch compatible JSON.
@@ -37,6 +38,7 @@
//! - [Bucket](bucket)
//! - [Histogram](bucket::HistogramAggregation)
//! - [Range](bucket::RangeAggregation)
//! - [Terms](bucket::TermsAggregation)
//! - [Metric](metric)
//! - [Average](metric::AverageAggregation)
//! - [Stats](metric::StatsAggregation)
@@ -147,7 +149,8 @@
//! IntermediateAggregationResults provides the
//! [merge_fruits](intermediate_agg_result::IntermediateAggregationResults::merge_fruits) method to
//! merge multiple results. The merged result can then be converted into
//! [agg_result::AggregationResults] via the [Into] trait.
//! [agg_result::AggregationResults] via the
//! [agg_result::AggregationResults::from_intermediate_and_req] method.
pub mod agg_req;
mod agg_req_with_accessor;
@@ -245,6 +248,14 @@ impl<T: Clone> VecWithNames<T> {
fn is_empty(&self) -> bool {
self.keys.is_empty()
}
fn len(&self) -> usize {
self.keys.len()
}
fn get(&self, name: &str) -> Option<&T> {
self.keys()
.position(|key| key == name)
.map(|pos| &self.values[pos])
}
}
/// The serialized key is used in a HashMap.
@@ -311,13 +322,16 @@ mod tests {
use super::bucket::RangeAggregation;
use super::collector::AggregationCollector;
use super::metric::AverageAggregation;
use crate::aggregation::agg_req::{BucketAggregationType, MetricAggregation};
use crate::aggregation::agg_req::{
get_term_dict_field_names, BucketAggregationType, MetricAggregation,
};
use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::bucket::TermsAggregation;
use crate::aggregation::intermediate_agg_result::IntermediateAggregationResults;
use crate::aggregation::segment_agg_result::DOC_BLOCK_SIZE;
use crate::aggregation::DistributedAggregationCollector;
use crate::query::{AllQuery, TermQuery};
use crate::schema::{Cardinality, IndexRecordOption, Schema, TextFieldIndexing};
use crate::schema::{Cardinality, IndexRecordOption, Schema, TextFieldIndexing, FAST, STRING};
use crate::{Index, Term};
fn get_avg_req(field_name: &str) -> Aggregation {
@@ -336,17 +350,80 @@ mod tests {
)
}
pub fn exec_request(agg_req: Aggregations, index: &Index) -> crate::Result<Value> {
exec_request_with_query(agg_req, index, None)
}
pub fn exec_request_with_query(
agg_req: Aggregations,
index: &Index,
query: Option<(&str, &str)>,
) -> crate::Result<Value> {
let collector = AggregationCollector::from_aggs(agg_req);
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res = if let Some((field, term)) = query {
let text_field = reader.searcher().schema().get_field(field).unwrap();
let term_query = TermQuery::new(
Term::from_field_text(text_field, term),
IndexRecordOption::Basic,
);
searcher.search(&term_query, &collector)?
} else {
searcher.search(&AllQuery, &collector)?
};
// Test serialization/deserialization rountrip
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
Ok(res)
}
pub fn get_test_index_from_values(
merge_segments: bool,
values: &[f64],
) -> crate::Result<Index> {
// Every value gets its own segment
let mut segment_and_values = vec![];
for value in values {
segment_and_values.push(vec![(*value, value.to_string())]);
}
get_test_index_from_values_and_terms(merge_segments, &segment_and_values)
}
pub fn get_test_index_from_terms(
merge_segments: bool,
values: &[Vec<&str>],
) -> crate::Result<Index> {
// Every value gets its own segment
let segment_and_values = values
.iter()
.map(|terms| {
terms
.iter()
.enumerate()
.map(|(i, term)| (i as f64, term.to_string()))
.collect()
})
.collect::<Vec<_>>();
get_test_index_from_values_and_terms(merge_segments, &segment_and_values)
}
pub fn get_test_index_from_values_and_terms(
merge_segments: bool,
segment_and_values: &[Vec<(f64, String)>],
) -> crate::Result<Index> {
let mut schema_builder = Schema::builder();
let text_fieldtype = crate::schema::TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
)
.set_fast()
.set_stored();
let text_field = schema_builder.add_text_field("text", text_fieldtype);
let text_field = schema_builder.add_text_field("text", text_fieldtype.clone());
let text_field_id = schema_builder.add_text_field("text_id", text_fieldtype);
let string_field_id = schema_builder.add_text_field("string_id", STRING | FAST);
let score_fieldtype =
crate::schema::NumericOptions::default().set_fast(Cardinality::SingleValue);
let score_field = schema_builder.add_u64_field("score", score_fieldtype.clone());
@@ -359,15 +436,20 @@ mod tests {
let index = Index::create_in_ram(schema_builder.build());
{
let mut index_writer = index.writer_for_tests()?;
for &i in values {
// writing the segment
index_writer.add_document(doc!(
text_field => "cool",
score_field => i as u64,
score_field_f64 => i as f64,
score_field_i64 => i as i64,
fraction_field => i as f64/100.0,
))?;
for values in segment_and_values {
for (i, term) in values {
let i = *i;
// writing the segment
index_writer.add_document(doc!(
text_field => "cool",
text_field_id => term.to_string(),
string_field_id => term.to_string(),
score_field => i as u64,
score_field_f64 => i as f64,
score_field_i64 => i as i64,
fraction_field => i as f64/100.0,
))?;
}
index_writer.commit()?;
}
}
@@ -388,15 +470,13 @@ mod tests {
merge_segments: bool,
use_distributed_collector: bool,
) -> crate::Result<()> {
let index = get_test_index_with_num_docs(merge_segments, 80)?;
let mut values_and_terms = (0..80)
.map(|val| vec![(val as f64, "terma".to_string())])
.collect::<Vec<_>>();
values_and_terms.last_mut().unwrap()[0].1 = "termb".to_string();
let index = get_test_index_from_values_and_terms(merge_segments, &values_and_terms)?;
let reader = index.reader()?;
let text_field = reader.searcher().schema().get_field("text").unwrap();
let term_query = TermQuery::new(
Term::from_field_text(text_field, "cool"),
IndexRecordOption::Basic,
);
assert_eq!(DOC_BLOCK_SIZE, 64);
// In the tree we cache Documents of DOC_BLOCK_SIZE, before passing them down as one block.
@@ -441,6 +521,19 @@ mod tests {
}
}
}
},
"term_agg_test":{
"terms": {
"field": "string_id"
},
"aggs": {
"bucketsL2": {
"histogram": {
"field": "score",
"interval": 70.0
}
}
}
}
});
@@ -453,14 +546,15 @@ mod tests {
let searcher = reader.searcher();
AggregationResults::from_intermediate_and_req(
searcher.search(&term_query, &collector).unwrap(),
searcher.search(&AllQuery, &collector).unwrap(),
agg_req,
)
.unwrap()
} else {
let collector = AggregationCollector::from_aggs(agg_req);
let searcher = reader.searcher();
searcher.search(&term_query, &collector).unwrap()
searcher.search(&AllQuery, &collector).unwrap()
};
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
@@ -490,6 +584,46 @@ mod tests {
);
assert_eq!(res["bucketsL1"]["buckets"][2]["doc_count"], 80 - 70);
assert_eq!(
res["term_agg_test"],
json!(
{
"buckets": [
{
"bucketsL2": {
"buckets": [
{
"doc_count": 70,
"key": 0.0
},
{
"doc_count": 9,
"key": 70.0
}
]
},
"doc_count": 79,
"key": "terma"
},
{
"bucketsL2": {
"buckets": [
{
"doc_count": 1,
"key": 70.0
}
]
},
"doc_count": 1,
"key": "termb"
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
)
);
Ok(())
}
@@ -507,8 +641,10 @@ mod tests {
.set_indexing_options(
TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
)
.set_fast()
.set_stored();
let text_field = schema_builder.add_text_field("text", text_fieldtype);
schema_builder.add_text_field("dummy_text", STRING);
let score_fieldtype =
crate::schema::NumericOptions::default().set_fast(Cardinality::SingleValue);
let score_field = schema_builder.add_u64_field("score", score_fieldtype.clone());
@@ -713,10 +849,21 @@ mod tests {
IndexRecordOption::Basic,
);
let sub_agg_req: Aggregations =
vec![("average_in_range".to_string(), get_avg_req("score"))]
.into_iter()
.collect();
let sub_agg_req: Aggregations = vec![
("average_in_range".to_string(), get_avg_req("score")),
(
"term_agg".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Terms(TermsAggregation {
field: "text".to_string(),
..Default::default()
}),
sub_aggregation: Default::default(),
}),
),
]
.into_iter()
.collect();
let agg_req: Aggregations = if use_elastic_json_req {
let elasticsearch_compatible_json_req = r#"
{
@@ -732,7 +879,8 @@ mod tests {
]
},
"aggs": {
"average_in_range": { "avg": { "field": "score" } }
"average_in_range": { "avg": { "field": "score" } },
"term_agg": { "terms": { "field": "text" } }
}
},
"rangei64": {
@@ -747,7 +895,8 @@ mod tests {
]
},
"aggs": {
"average_in_range": { "avg": { "field": "score" } }
"average_in_range": { "avg": { "field": "score" } },
"term_agg": { "terms": { "field": "text" } }
}
},
"average": {
@@ -765,7 +914,8 @@ mod tests {
]
},
"aggs": {
"average_in_range": { "avg": { "field": "score" } }
"average_in_range": { "avg": { "field": "score" } },
"term_agg": { "terms": { "field": "text" } }
}
}
}
@@ -824,6 +974,9 @@ mod tests {
agg_req
};
let field_names = get_term_dict_field_names(&agg_req);
assert_eq!(field_names, vec!["text".to_string()].into_iter().collect());
let agg_res: AggregationResults = if use_distributed_collector {
let collector = DistributedAggregationCollector::from_aggs(agg_req.clone());
@@ -832,7 +985,7 @@ mod tests {
// Test de/serialization roundtrip on intermediate_agg_result
let res: IntermediateAggregationResults =
serde_json::from_str(&serde_json::to_string(&res).unwrap()).unwrap();
AggregationResults::from_intermediate_and_req(res, agg_req.clone())
AggregationResults::from_intermediate_and_req(res, agg_req.clone()).unwrap()
} else {
let collector = AggregationCollector::from_aggs(agg_req.clone());
@@ -964,10 +1117,10 @@ mod tests {
searcher.search(&AllQuery, &collector).unwrap_err()
};
let agg_res = avg_on_field("text");
let agg_res = avg_on_field("dummy_text");
assert_eq!(
format!("{:?}", agg_res),
r#"InvalidArgument("Only single value fast fields of type f64, u64, i64 are supported, but got Str ")"#
r#"InvalidArgument("Only fast fields of type f64, u64, i64 are supported, but got Str ")"#
);
let agg_res = avg_on_field("not_exist_field");
@@ -979,7 +1132,7 @@ mod tests {
let agg_res = avg_on_field("scores_i64");
assert_eq!(
format!("{:?}", agg_res),
r#"InvalidArgument("Invalid field type in aggregation I64, only Cardinality::SingleValue supported")"#
r#"InvalidArgument("Invalid field cardinality on field scores_i64 expected SingleValue, but got MultiValues")"#
);
Ok(())
@@ -988,11 +1141,12 @@ mod tests {
#[cfg(all(test, feature = "unstable"))]
mod bench {
use rand::prelude::SliceRandom;
use rand::{thread_rng, Rng};
use test::{self, Bencher};
use super::*;
use crate::aggregation::bucket::{HistogramAggregation, HistogramBounds};
use crate::aggregation::bucket::{HistogramAggregation, HistogramBounds, TermsAggregation};
use crate::aggregation::metric::StatsAggregation;
use crate::query::AllQuery;
@@ -1004,6 +1158,10 @@ mod tests {
)
.set_stored();
let text_field = schema_builder.add_text_field("text", text_fieldtype);
let text_field_many_terms =
schema_builder.add_text_field("text_many_terms", STRING | FAST);
let text_field_few_terms =
schema_builder.add_text_field("text_few_terms", STRING | FAST);
let score_fieldtype =
crate::schema::NumericOptions::default().set_fast(Cardinality::SingleValue);
let score_field = schema_builder.add_u64_field("score", score_fieldtype.clone());
@@ -1011,6 +1169,10 @@ mod tests {
schema_builder.add_f64_field("score_f64", score_fieldtype.clone());
let score_field_i64 = schema_builder.add_i64_field("score_i64", score_fieldtype);
let index = Index::create_from_tempdir(schema_builder.build())?;
let few_terms_data = vec!["INFO", "ERROR", "WARN", "DEBUG"];
let many_terms_data = (0..15_000)
.map(|num| format!("author{}", num))
.collect::<Vec<_>>();
{
let mut rng = thread_rng();
let mut index_writer = index.writer_for_tests()?;
@@ -1019,6 +1181,8 @@ mod tests {
let val: f64 = rng.gen_range(0.0..1_000_000.0);
index_writer.add_document(doc!(
text_field => "cool",
text_field_many_terms => many_terms_data.choose(&mut rng).unwrap().to_string(),
text_field_few_terms => few_terms_data.choose(&mut rng).unwrap().to_string(),
score_field => val as u64,
score_field_f64 => val as f64,
score_field_i64 => val as i64,
@@ -1170,6 +1334,64 @@ mod tests {
});
}
#[bench]
fn bench_aggregation_terms_few(b: &mut Bencher) {
let index = get_test_index_bench(false).unwrap();
let reader = index.reader().unwrap();
b.iter(|| {
let agg_req: Aggregations = vec![(
"my_texts".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Terms(TermsAggregation {
field: "text_few_terms".to_string(),
..Default::default()
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req);
let searcher = reader.searcher();
let agg_res: AggregationResults =
searcher.search(&AllQuery, &collector).unwrap().into();
agg_res
});
}
#[bench]
fn bench_aggregation_terms_many(b: &mut Bencher) {
let index = get_test_index_bench(false).unwrap();
let reader = index.reader().unwrap();
b.iter(|| {
let agg_req: Aggregations = vec![(
"my_texts".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Terms(TermsAggregation {
field: "text_many_terms".to_string(),
..Default::default()
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req);
let searcher = reader.searcher();
let agg_res: AggregationResults =
searcher.search(&AllQuery, &collector).unwrap().into();
agg_res
});
}
#[bench]
fn bench_aggregation_range_only(b: &mut Bencher) {
let index = get_test_index_bench(false).unwrap();

View File

@@ -9,11 +9,12 @@ use super::agg_req::MetricAggregation;
use super::agg_req_with_accessor::{
AggregationsWithAccessor, BucketAggregationWithAccessor, MetricAggregationWithAccessor,
};
use super::bucket::{SegmentHistogramCollector, SegmentRangeCollector};
use super::bucket::{SegmentHistogramCollector, SegmentRangeCollector, SegmentTermCollector};
use super::intermediate_agg_result::{IntermediateAggregationResults, IntermediateBucketResult};
use super::metric::{
AverageAggregation, SegmentAverageCollector, SegmentStatsCollector, StatsAggregation,
};
use super::{Key, VecWithNames};
use super::VecWithNames;
use crate::aggregation::agg_req::BucketAggregationType;
use crate::DocId;
@@ -28,6 +29,17 @@ pub(crate) struct SegmentAggregationResultsCollector {
num_staged_docs: usize,
}
impl Default for SegmentAggregationResultsCollector {
fn default() -> Self {
Self {
metrics: Default::default(),
buckets: Default::default(),
staged_docs: [0; DOC_BLOCK_SIZE],
num_staged_docs: Default::default(),
}
}
}
impl Debug for SegmentAggregationResultsCollector {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("SegmentAggregationResultsCollector")
@@ -40,6 +52,25 @@ impl Debug for SegmentAggregationResultsCollector {
}
impl SegmentAggregationResultsCollector {
pub fn into_intermediate_aggregations_result(
self,
agg_with_accessor: &AggregationsWithAccessor,
) -> crate::Result<IntermediateAggregationResults> {
let buckets = if let Some(buckets) = self.buckets {
let entries = buckets
.into_iter()
.zip(agg_with_accessor.buckets.values())
.map(|((key, bucket), acc)| Ok((key, bucket.into_intermediate_bucket_result(acc)?)))
.collect::<crate::Result<Vec<(String, _)>>>()?;
Some(VecWithNames::from_entries(entries))
} else {
None
};
let metrics = self.metrics.map(VecWithNames::from_other);
Ok(IntermediateAggregationResults { metrics, buckets })
}
pub(crate) fn from_req_and_validate(req: &AggregationsWithAccessor) -> crate::Result<Self> {
let buckets = req
.buckets
@@ -97,6 +128,9 @@ impl SegmentAggregationResultsCollector {
agg_with_accessor: &AggregationsWithAccessor,
force_flush: bool,
) {
if self.num_staged_docs == 0 {
return;
}
if let Some(metrics) = &mut self.metrics {
for (collector, agg_with_accessor) in
metrics.values_mut().zip(agg_with_accessor.metrics.values())
@@ -162,12 +196,40 @@ impl SegmentMetricResultCollector {
#[derive(Clone, Debug, PartialEq)]
pub(crate) enum SegmentBucketResultCollector {
Range(SegmentRangeCollector),
Histogram(SegmentHistogramCollector),
Histogram(Box<SegmentHistogramCollector>),
Terms(Box<SegmentTermCollector>),
}
impl SegmentBucketResultCollector {
pub fn into_intermediate_bucket_result(
self,
agg_with_accessor: &BucketAggregationWithAccessor,
) -> crate::Result<IntermediateBucketResult> {
match self {
SegmentBucketResultCollector::Terms(terms) => {
terms.into_intermediate_bucket_result(agg_with_accessor)
}
SegmentBucketResultCollector::Range(range) => {
range.into_intermediate_bucket_result(agg_with_accessor)
}
SegmentBucketResultCollector::Histogram(histogram) => {
histogram.into_intermediate_bucket_result(agg_with_accessor)
}
}
}
pub fn from_req_and_validate(req: &BucketAggregationWithAccessor) -> crate::Result<Self> {
match &req.bucket_agg {
BucketAggregationType::Terms(terms_req) => Ok(Self::Terms(Box::new(
SegmentTermCollector::from_req_and_validate(
terms_req,
&req.sub_aggregation,
req.field_type,
req.accessor
.as_multi()
.expect("unexpected fast field cardinality"),
)?,
))),
BucketAggregationType::Range(range_req) => {
Ok(Self::Range(SegmentRangeCollector::from_req_and_validate(
range_req,
@@ -175,14 +237,16 @@ impl SegmentBucketResultCollector {
req.field_type,
)?))
}
BucketAggregationType::Histogram(histogram) => Ok(Self::Histogram(
BucketAggregationType::Histogram(histogram) => Ok(Self::Histogram(Box::new(
SegmentHistogramCollector::from_req_and_validate(
histogram,
&req.sub_aggregation,
req.field_type,
&req.accessor,
req.accessor
.as_single()
.expect("unexpected fast field cardinality"),
)?,
)),
))),
}
}
@@ -200,34 +264,9 @@ impl SegmentBucketResultCollector {
SegmentBucketResultCollector::Histogram(histogram) => {
histogram.collect_block(doc, bucket_with_accessor, force_flush)
}
SegmentBucketResultCollector::Terms(terms) => {
terms.collect_block(doc, bucket_with_accessor, force_flush)
}
}
}
}
#[derive(Clone, Debug, PartialEq)]
pub(crate) struct SegmentHistogramBucketEntry {
pub key: f64,
pub doc_count: u64,
}
#[derive(Clone, PartialEq)]
pub(crate) struct SegmentRangeBucketEntry {
pub key: Key,
pub doc_count: u64,
pub sub_aggregation: Option<SegmentAggregationResultsCollector>,
/// The from range of the bucket. Equals f64::MIN when None.
pub from: Option<f64>,
/// The to range of the bucket. Equals f64::MAX when None.
pub to: Option<f64>,
}
impl Debug for SegmentRangeBucketEntry {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("SegmentRangeBucketEntry")
.field("key", &self.key)
.field("doc_count", &self.doc_count)
.field("from", &self.from)
.field("to", &self.to)
.finish()
}
}

View File

@@ -273,18 +273,18 @@ mod tests {
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?;
writer.add_document(doc!(date_field=>DateTime::new_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(
doc!(date_field=>DateTime::new_primitive(Date::from_calendar_date(1986, Month::March, 9)?.with_hms(0, 0, 0)?)),
doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1986, Month::March, 9)?.with_hms(0, 0, 0)?)),
)?;
writer.add_document(doc!(date_field=>DateTime::new_primitive(Date::from_calendar_date(1983, Month::September, 27)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1983, Month::September, 27)?.with_hms(0, 0, 0)?)))?;
writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
let all_query = AllQuery;
let week_histogram_collector = HistogramCollector::new(
date_field,
DateTime::new_primitive(
DateTime::from_primitive(
Date::from_calendar_date(1980, Month::January, 1)?.with_hms(0, 0, 0)?,
),
3600 * 24 * 365, // it is just for a unit test... sorry leap years.

View File

@@ -92,7 +92,7 @@ mod histogram_collector;
pub use histogram_collector::HistogramCollector;
mod multi_collector;
pub use self::multi_collector::MultiCollector;
pub use self::multi_collector::{FruitHandle, MultiCollector, MultiFruit};
mod top_collector;

View File

@@ -5,6 +5,7 @@ use super::{Collector, SegmentCollector};
use crate::collector::Fruit;
use crate::{DocId, Score, SegmentOrdinal, SegmentReader, TantivyError};
/// MultiFruit keeps Fruits from every nested Collector
pub struct MultiFruit {
sub_fruits: Vec<Option<Box<dyn Fruit>>>,
}
@@ -79,12 +80,17 @@ impl<TSegmentCollector: SegmentCollector> BoxableSegmentCollector
}
}
/// FruitHandle stores reference to the corresponding collector inside MultiCollector
pub struct FruitHandle<TFruit: Fruit> {
pos: usize,
_phantom: PhantomData<TFruit>,
}
impl<TFruit: Fruit> FruitHandle<TFruit> {
/// Extract a typed fruit off a multifruit.
///
/// This function involves downcasting and can panic if the multifruit was
/// created using faulty code.
pub fn extract(self, fruits: &mut MultiFruit) -> TFruit {
let boxed_fruit = fruits.sub_fruits[self.pos].take().expect("");
*boxed_fruit

View File

@@ -26,11 +26,11 @@ pub fn test_filter_collector() -> crate::Result<()> {
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?;
index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::new_utc(OffsetDateTime::parse("1898-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::new_utc(OffsetDateTime::parse("2020-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of Anne Frank", price => 18_240u64, date => DateTime::new_utc(OffsetDateTime::parse("2019-04-20T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64, date => DateTime::new_utc(OffsetDateTime::parse("2019-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64, date => DateTime::new_utc(OffsetDateTime::parse("2018-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::from_utc(OffsetDateTime::parse("1898-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::from_utc(OffsetDateTime::parse("2020-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of Anne Frank", price => 18_240u64, date => DateTime::from_utc(OffsetDateTime::parse("2019-04-20T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64, date => DateTime::from_utc(OffsetDateTime::parse("2019-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of a Young Girl", price => 20_120u64, date => DateTime::from_utc(OffsetDateTime::parse("2018-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.commit()?;
let reader = index.reader()?;
@@ -55,7 +55,7 @@ pub fn test_filter_collector() -> crate::Result<()> {
assert_eq!(filtered_top_docs.len(), 0);
fn date_filter(value: DateTime) -> bool {
(value.to_utc() - OffsetDateTime::parse("2019-04-09T00:00:00+00:00", &Rfc3339).unwrap())
(value.into_utc() - OffsetDateTime::parse("2019-04-09T00:00:00+00:00", &Rfc3339).unwrap())
.whole_weeks()
> 0
}

View File

@@ -898,7 +898,7 @@ mod tests {
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?;
let pr_birthday = DateTime::new_utc(OffsetDateTime::parse(
let pr_birthday = DateTime::from_utc(OffsetDateTime::parse(
"1898-04-09T00:00:00+00:00",
&Rfc3339,
)?);
@@ -906,7 +906,7 @@ mod tests {
name => "Paul Robeson",
birthday => pr_birthday,
))?;
let mr_birthday = DateTime::new_utc(OffsetDateTime::parse(
let mr_birthday = DateTime::from_utc(OffsetDateTime::parse(
"1947-11-08T00:00:00+00:00",
&Rfc3339,
)?);

View File

@@ -1,6 +1,7 @@
use crossbeam::channel;
use rayon::{ThreadPool, ThreadPoolBuilder};
use crate::TantivyError;
/// Search executor whether search request are single thread or multithread.
///
/// We don't expose Rayon thread pool directly here for several reasons.
@@ -47,16 +48,19 @@ impl Executor {
match self {
Executor::SingleThread => args.map(f).collect::<crate::Result<_>>(),
Executor::ThreadPool(pool) => {
let args_with_indices: Vec<(usize, A)> = args.enumerate().collect();
let num_fruits = args_with_indices.len();
let args: Vec<A> = args.collect();
let num_fruits = args.len();
let fruit_receiver = {
let (fruit_sender, fruit_receiver) = channel::unbounded();
let (fruit_sender, fruit_receiver) = crossbeam_channel::unbounded();
pool.scope(|scope| {
for arg_with_idx in args_with_indices {
scope.spawn(|_| {
let (idx, arg) = arg_with_idx;
let fruit = f(arg);
if let Err(err) = fruit_sender.send((idx, fruit)) {
for (idx, arg) in args.into_iter().enumerate() {
// We name references for f and fruit_sender_ref because we do not
// want these two to be moved into the closure.
let f_ref = &f;
let fruit_sender_ref = &fruit_sender;
scope.spawn(move |_| {
let fruit = f_ref(arg);
if let Err(err) = fruit_sender_ref.send((idx, fruit)) {
error!(
"Failed to send search task. It probably means all search \
threads have panicked. {:?}",
@@ -71,18 +75,19 @@ impl Executor {
// This is important as it makes it possible for the fruit_receiver iteration to
// terminate.
};
// This is lame, but safe.
let mut results_with_position = Vec::with_capacity(num_fruits);
let mut result_placeholders: Vec<Option<R>> =
std::iter::repeat_with(|| None).take(num_fruits).collect();
for (pos, fruit_res) in fruit_receiver {
let fruit = fruit_res?;
results_with_position.push((pos, fruit));
result_placeholders[pos] = Some(fruit);
}
results_with_position.sort_by_key(|(pos, _)| *pos);
assert_eq!(results_with_position.len(), num_fruits);
Ok(results_with_position
.into_iter()
.map(|(_, fruit)| fruit)
.collect::<Vec<_>>())
let results: Vec<R> = result_placeholders.into_iter().flatten().collect();
if results.len() != num_fruits {
return Err(TantivyError::InternalError(
"One of the mapped execution failed.".to_string(),
));
}
Ok(results)
}
}
}

View File

@@ -74,6 +74,7 @@ fn load_metas(
pub struct IndexBuilder {
schema: Option<Schema>,
index_settings: IndexSettings,
tokenizer_manager: TokenizerManager,
}
impl Default for IndexBuilder {
fn default() -> Self {
@@ -86,6 +87,7 @@ impl IndexBuilder {
Self {
schema: None,
index_settings: IndexSettings::default(),
tokenizer_manager: TokenizerManager::default(),
}
}
@@ -103,6 +105,12 @@ impl IndexBuilder {
self
}
/// Set the tokenizers .
pub fn tokenizers(mut self, tokenizers: TokenizerManager) -> Self {
self.tokenizer_manager = tokenizers;
self
}
/// Creates a new index using the `RAMDirectory`.
///
/// The index will be allocated in anonymous memory.
@@ -154,7 +162,8 @@ impl IndexBuilder {
if !Index::exists(&*dir)? {
return self.create(dir);
}
let index = Index::open(dir)?;
let mut index = Index::open(dir)?;
index.set_tokenizers(self.tokenizer_manager.clone());
if index.schema() == self.get_expect_schema()? {
Ok(index)
} else {
@@ -176,7 +185,8 @@ impl IndexBuilder {
)?;
let mut metas = IndexMeta::with_schema(self.get_expect_schema()?);
metas.index_settings = self.index_settings;
let index = Index::open_from_metas(directory, &metas, SegmentMetaInventory::default());
let mut index = Index::open_from_metas(directory, &metas, SegmentMetaInventory::default());
index.set_tokenizers(self.tokenizer_manager);
Ok(index)
}
}
@@ -304,6 +314,11 @@ impl Index {
}
}
/// Setter for the tokenizer manager.
pub fn set_tokenizers(&mut self, tokenizers: TokenizerManager) {
self.tokenizers = tokenizers;
}
/// Accessor for the tokenizer manager.
pub fn tokenizers(&self) -> &TokenizerManager {
&self.tokenizers
@@ -314,20 +329,31 @@ impl Index {
let field_entry = self.schema.get_field_entry(field);
let field_type = field_entry.field_type();
let tokenizer_manager: &TokenizerManager = self.tokenizers();
let tokenizer_name_opt: Option<TextAnalyzer> = match field_type {
FieldType::Str(text_options) => text_options
.get_indexing_options()
.map(|text_indexing_options| text_indexing_options.tokenizer().to_string())
.and_then(|tokenizer_name| tokenizer_manager.get(&tokenizer_name)),
_ => None,
let indexing_options_opt = match field_type {
FieldType::JsonObject(options) => options.get_text_indexing_options(),
FieldType::Str(options) => options.get_indexing_options(),
_ => {
return Err(TantivyError::SchemaError(format!(
"{:?} is not a text field.",
field_entry.name()
)))
}
};
match tokenizer_name_opt {
Some(tokenizer) => Ok(tokenizer),
None => Err(TantivyError::SchemaError(format!(
"{:?} is not a text field.",
field_entry.name()
))),
}
let indexing_options = indexing_options_opt.ok_or_else(|| {
TantivyError::InvalidArgument(format!(
"No indexing options set for field {:?}",
field_entry
))
})?;
tokenizer_manager
.get(indexing_options.tokenizer())
.ok_or_else(|| {
TantivyError::InvalidArgument(format!(
"No Tokenizer found for field {:?}",
field_entry
))
})
}
/// Create a default `IndexReader` for the given index.
@@ -557,7 +583,8 @@ impl fmt::Debug for Index {
mod tests {
use crate::directory::{RamDirectory, WatchCallback};
use crate::schema::{Field, Schema, INDEXED, TEXT};
use crate::{Directory, Index, IndexReader, IndexSettings, ReloadPolicy};
use crate::tokenizer::TokenizerManager;
use crate::{Directory, Index, IndexBuilder, IndexReader, IndexSettings, ReloadPolicy};
#[test]
fn test_indexer_for_field() {
@@ -573,6 +600,21 @@ mod tests {
);
}
#[test]
fn test_set_tokenizer_manager() {
let mut schema_builder = Schema::builder();
schema_builder.add_u64_field("num_likes", INDEXED);
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
let index = IndexBuilder::new()
// set empty tokenizer manager
.tokenizers(TokenizerManager::new())
.schema(schema)
.create_in_ram()
.unwrap();
assert!(index.tokenizers().get("raw").is_none());
}
#[test]
fn test_index_exists() {
let directory: Box<dyn Directory> = Box::new(RamDirectory::create());
@@ -702,7 +744,7 @@ mod tests {
.try_into()?;
assert_eq!(reader.searcher().num_docs(), 0);
writer.add_document(doc!(field=>1u64))?;
let (sender, receiver) = crossbeam::channel::unbounded();
let (sender, receiver) = crossbeam_channel::unbounded();
let _handle = index.directory_mut().watch(WatchCallback::new(move || {
let _ = sender.send(());
}));
@@ -737,7 +779,7 @@ mod tests {
reader: &IndexReader,
) -> crate::Result<()> {
let mut reader_index = reader.index();
let (sender, receiver) = crossbeam::channel::unbounded();
let (sender, receiver) = crossbeam_channel::unbounded();
let _watch_handle = reader_index
.directory_mut()
.watch(WatchCallback::new(move || {

View File

@@ -239,7 +239,7 @@ impl InnerSegmentMeta {
///
/// Contains settings which are applied on the whole
/// index, like presort documents.
#[derive(Clone, Debug, Default, Serialize, Deserialize, Eq, PartialEq)]
#[derive(Clone, Debug, Serialize, Deserialize, Eq, PartialEq)]
pub struct IndexSettings {
/// Sorts the documents by information
/// provided in `IndexSortByField`
@@ -248,7 +248,26 @@ pub struct IndexSettings {
/// The `Compressor` used to compress the doc store.
#[serde(default)]
pub docstore_compression: Compressor,
#[serde(default = "default_docstore_blocksize")]
/// The size of each block that will be compressed and written to disk
pub docstore_blocksize: usize,
}
/// Must be a function to be compatible with serde defaults
fn default_docstore_blocksize() -> usize {
16_384
}
impl Default for IndexSettings {
fn default() -> Self {
Self {
sort_by_field: None,
docstore_compression: Compressor::default(),
docstore_blocksize: default_docstore_blocksize(),
}
}
}
/// Settings to presort the documents in an index
///
/// Presorting documents can greatly performance
@@ -401,7 +420,7 @@ mod tests {
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
assert_eq!(
json,
r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"},"docstore_compression":"lz4"},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false}}],"opstamp":0}"#
r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"},"docstore_compression":"lz4","docstore_blocksize":16384},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0}"#
);
let deser_meta: UntrackedIndexMeta = serde_json::from_str(&json).unwrap();

View File

@@ -35,7 +35,7 @@ const ZERO_ARRAY: [u8; 8] = [0u8; 8];
#[cfg(test)]
fn create_uuid() -> Uuid {
let new_auto_inc_id = (*AUTO_INC_COUNTER).fetch_add(1, atomic::Ordering::SeqCst);
Uuid::from_fields(new_auto_inc_id as u32, 0, 0, &ZERO_ARRAY).unwrap()
Uuid::from_fields(new_auto_inc_id as u32, 0, 0, &ZERO_ARRAY)
}
#[cfg(not(test))]
@@ -57,7 +57,7 @@ impl SegmentId {
/// Picking the first 8 chars is ok to identify
/// segments in a display message (e.g. a5c4dfcb).
pub fn short_uuid_string(&self) -> String {
(&self.0.to_simple_ref().to_string()[..8]).to_string()
(&self.0.as_simple().to_string()[..8]).to_string()
}
/// Returns a segment uuid string.
@@ -65,7 +65,7 @@ impl SegmentId {
/// It consists in 32 lowercase hexadecimal chars
/// (e.g. a5c4dfcbdfe645089129e308e26d5523)
pub fn uuid_string(&self) -> String {
self.0.to_simple_ref().to_string()
self.0.as_simple().to_string()
}
/// Build a `SegmentId` string from the full uuid string.

View File

@@ -169,7 +169,7 @@ impl SegmentReader {
let fast_fields_data = segment.open_read(SegmentComponent::FastFields)?;
let fast_fields_composite = CompositeFile::open(&fast_fields_data)?;
let fast_field_readers =
let fast_fields_readers =
Arc::new(FastFieldReaders::new(schema.clone(), fast_fields_composite));
let fieldnorm_data = segment.open_read(SegmentComponent::FieldNorms)?;
let fieldnorm_readers = FieldNormReaders::open(fieldnorm_data)?;
@@ -196,7 +196,7 @@ impl SegmentReader {
max_doc,
termdict_composite,
postings_composite,
fast_fields_readers: fast_field_readers,
fast_fields_readers,
fieldnorm_readers,
segment_id: segment.id(),
delete_opstamp: segment.meta().delete_opstamp(),

View File

@@ -110,7 +110,7 @@ mod tests {
let tmp_file = tmp_dir.path().join("watched.txt");
let counter: Arc<AtomicUsize> = Default::default();
let (tx, rx) = crossbeam::channel::unbounded();
let (tx, rx) = crossbeam_channel::unbounded();
let timeout = Duration::from_millis(100);
let watcher = FileWatcher::new(&tmp_file);
@@ -153,7 +153,7 @@ mod tests {
let tmp_file = tmp_dir.path().join("watched.txt");
let counter: Arc<AtomicUsize> = Default::default();
let (tx, rx) = crossbeam::channel::unbounded();
let (tx, rx) = crossbeam_channel::unbounded();
let timeout = Duration::from_millis(100);
let watcher = FileWatcher::new(&tmp_file);

View File

@@ -181,7 +181,7 @@ fn test_directory_delete(directory: &dyn Directory) -> crate::Result<()> {
fn test_watch(directory: &dyn Directory) {
let counter: Arc<AtomicUsize> = Default::default();
let (tx, rx) = crossbeam::channel::unbounded();
let (tx, rx) = crossbeam_channel::unbounded();
let timeout = Duration::from_millis(500);
let handle = directory

View File

@@ -97,6 +97,10 @@ pub enum TantivyError {
/// Index incompatible with current version of Tantivy.
#[error("{0:?}")]
IncompatibleIndex(Incompatibility),
/// An internal error occurred. This is are internal states that should not be reached.
/// e.g. a datastructure is incorrectly inititalized.
#[error("Internal error: '{0}'")]
InternalError(String),
}
#[cfg(feature = "quickwit")]

View File

@@ -188,14 +188,14 @@ mod bench {
}
#[bench]
fn bench_deletebitset_iter_deser_on_fly(bench: &mut Bencher) {
fn bench_alive_bitset_iter_deser_on_fly(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[0, 1, 1000, 10000], 1_000_000);
bench.iter(|| alive_bitset.iter_alive().collect::<Vec<_>>());
}
#[bench]
fn bench_deletebitset_access(bench: &mut Bencher) {
fn bench_alive_bitset_access(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&[0, 1, 1000, 10000], 1_000_000);
bench.iter(|| {
@@ -206,14 +206,14 @@ mod bench {
}
#[bench]
fn bench_deletebitset_iter_deser_on_fly_1_8_alive(bench: &mut Bencher) {
fn bench_alive_bitset_iter_deser_on_fly_1_8_alive(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&get_alive(), 1_000_000);
bench.iter(|| alive_bitset.iter_alive().collect::<Vec<_>>());
}
#[bench]
fn bench_deletebitset_access_1_8_alive(bench: &mut Bencher) {
fn bench_alive_bitset_access_1_8_alive(bench: &mut Bencher) {
let alive_bitset = AliveBitSet::for_test_from_deleted_docs(&get_alive(), 1_000_000);
bench.iter(|| {

View File

@@ -167,7 +167,7 @@ impl FastValue for DateTime {
}
fn to_u64(&self) -> u64 {
self.to_unix_timestamp().to_u64()
self.into_unix_timestamp().to_u64()
}
fn fast_field_cardinality(field_type: &FieldType) -> Option<Cardinality> {
@@ -178,7 +178,7 @@ impl FastValue for DateTime {
}
fn as_u64(&self) -> u64 {
self.to_unix_timestamp().as_u64()
self.into_unix_timestamp().as_u64()
}
fn to_type() -> Type {
@@ -196,10 +196,31 @@ fn value_to_u64(value: &Value) -> u64 {
}
}
/// The fast field type
pub enum FastFieldType {
/// Numeric type, e.g. f64.
Numeric,
/// Fast field stores string ids.
String,
/// Fast field stores string ids for facets.
Facet,
}
impl FastFieldType {
fn is_storing_term_ids(&self) -> bool {
matches!(self, FastFieldType::String | FastFieldType::Facet)
}
fn is_facet(&self) -> bool {
matches!(self, FastFieldType::Facet)
}
}
#[cfg(test)]
mod tests {
use std::collections::HashMap;
use std::ops::Range;
use std::path::Path;
use common::HasLen;
@@ -211,7 +232,7 @@ mod tests {
use super::*;
use crate::directory::{CompositeFile, Directory, RamDirectory, WritePtr};
use crate::merge_policy::NoMergePolicy;
use crate::schema::{Document, Field, NumericOptions, Schema, FAST};
use crate::schema::{Document, Field, NumericOptions, Schema, FAST, STRING, TEXT};
use crate::time::OffsetDateTime;
use crate::{Index, SegmentId, SegmentReader};
@@ -233,7 +254,7 @@ mod tests {
#[test]
pub fn test_fastfield_i64_u64() {
let datetime = DateTime::new_utc(OffsetDateTime::UNIX_EPOCH);
let datetime = DateTime::from_utc(OffsetDateTime::UNIX_EPOCH);
assert_eq!(i64::from_u64(datetime.to_u64()), 0i64);
}
@@ -392,7 +413,8 @@ mod tests {
serializer.close().unwrap();
}
let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 12471_usize); // Piecewise linear codec size
// assert_eq!(file.len(), 17710 as usize); //bitpacked size
assert_eq!(file.len(), 10175_usize); // linear interpol size
{
let fast_fields_composite = CompositeFile::open(&file)?;
let data = fast_fields_composite.open_read(i64_field).unwrap();
@@ -489,7 +511,7 @@ mod tests {
let mut index_writer = index.writer_for_tests().unwrap();
index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer
.add_document(doc!(date_field =>DateTime::new_utc(OffsetDateTime::now_utc())))?;
.add_document(doc!(date_field =>DateTime::from_utc(OffsetDateTime::now_utc())))?;
index_writer.commit()?;
index_writer.add_document(doc!())?;
index_writer.commit()?;
@@ -509,7 +531,206 @@ mod tests {
#[test]
fn test_default_datetime() {
assert_eq!(0, DateTime::make_zero().to_unix_timestamp());
assert_eq!(0, DateTime::make_zero().into_unix_timestamp());
}
fn get_vals_for_docs(ff: &MultiValuedFastFieldReader<u64>, docs: Range<u32>) -> Vec<u64> {
let mut all = vec![];
for doc in docs {
let mut out = vec![];
ff.get_vals(doc, &mut out);
all.extend(out);
}
all
}
#[test]
fn test_text_fastfield() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT | FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
// first segment
let mut index_writer = index.writer_for_tests()?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(
text_field => "BBBBB AAAAA", // term_ord 1,2
))?;
index_writer.add_document(doc!())?;
index_writer.add_document(doc!(
text_field => "AAAAA", // term_ord 0
))?;
index_writer.add_document(doc!(
text_field => "AAAAA BBBBB", // term_ord 0
))?;
index_writer.add_document(doc!(
text_field => "zumberthree", // term_ord 2, after merge term_ord 3
))?;
index_writer.add_document(doc!())?;
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
assert_eq!(searcher.segment_readers().len(), 1);
let segment_reader = searcher.segment_reader(0);
let fast_fields = segment_reader.fast_fields();
let text_fast_field = fast_fields.u64s(text_field).unwrap();
assert_eq!(
get_vals_for_docs(&text_fast_field, 0..5),
vec![1, 0, 0, 0, 1, 2]
);
let mut out = vec![];
text_fast_field.get_vals(3, &mut out);
assert_eq!(out, vec![0, 1]);
let inverted_index = segment_reader.inverted_index(text_field)?;
assert_eq!(inverted_index.terms().num_terms(), 3);
let mut bytes = vec![];
assert!(inverted_index.terms().ord_to_term(0, &mut bytes)?);
// default tokenizer applies lower case
assert_eq!(bytes, "aaaaa".as_bytes());
}
{
// second segment
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(
text_field => "AAAAA", // term_ord 0
))?;
index_writer.add_document(doc!(
text_field => "CCCCC AAAAA", // term_ord 1, after merge 2
))?;
index_writer.add_document(doc!())?;
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
assert_eq!(searcher.segment_readers().len(), 2);
let segment_reader = searcher.segment_reader(1);
let fast_fields = segment_reader.fast_fields();
let text_fast_field = fast_fields.u64s(text_field).unwrap();
assert_eq!(get_vals_for_docs(&text_fast_field, 0..3), vec![0, 1, 0]);
}
// Merging the segments
{
let segment_ids = index.searchable_segment_ids()?;
let mut index_writer = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?;
}
let reader = index.reader()?;
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0);
let fast_fields = segment_reader.fast_fields();
let text_fast_field = fast_fields.u64s(text_field).unwrap();
assert_eq!(
get_vals_for_docs(&text_fast_field, 0..8),
vec![1, 0, 0, 0, 1, 3 /* next segment */, 0, 2, 0]
);
Ok(())
}
#[test]
fn test_string_fastfield() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", STRING | FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
// first segment
let mut index_writer = index.writer_for_tests()?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(
text_field => "BBBBB", // term_ord 1
))?;
index_writer.add_document(doc!())?;
index_writer.add_document(doc!(
text_field => "AAAAA", // term_ord 0
))?;
index_writer.add_document(doc!(
text_field => "AAAAA", // term_ord 0
))?;
index_writer.add_document(doc!(
text_field => "zumberthree", // term_ord 2, after merge term_ord 3
))?;
index_writer.add_document(doc!())?;
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
assert_eq!(searcher.segment_readers().len(), 1);
let segment_reader = searcher.segment_reader(0);
let fast_fields = segment_reader.fast_fields();
let text_fast_field = fast_fields.u64s(text_field).unwrap();
assert_eq!(get_vals_for_docs(&text_fast_field, 0..6), vec![1, 0, 0, 2]);
let inverted_index = segment_reader.inverted_index(text_field)?;
assert_eq!(inverted_index.terms().num_terms(), 3);
let mut bytes = vec![];
assert!(inverted_index.terms().ord_to_term(0, &mut bytes)?);
assert_eq!(bytes, "AAAAA".as_bytes());
}
{
// second segment
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(
text_field => "AAAAA", // term_ord 0
))?;
index_writer.add_document(doc!(
text_field => "CCCCC", // term_ord 1, after merge 2
))?;
index_writer.add_document(doc!())?;
index_writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
assert_eq!(searcher.segment_readers().len(), 2);
let segment_reader = searcher.segment_reader(1);
let fast_fields = segment_reader.fast_fields();
let text_fast_field = fast_fields.u64s(text_field).unwrap();
assert_eq!(get_vals_for_docs(&text_fast_field, 0..2), vec![0, 1]);
}
// Merging the segments
{
let segment_ids = index.searchable_segment_ids()?;
let mut index_writer = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?;
}
let reader = index.reader()?;
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0);
let fast_fields = segment_reader.fast_fields();
let text_fast_field = fast_fields.u64s(text_field).unwrap();
assert_eq!(
get_vals_for_docs(&text_fast_field, 0..9),
vec![1, 0, 0, 3 /* next segment */, 0, 2]
);
Ok(())
}
#[test]
@@ -547,23 +768,23 @@ mod tests {
let dates_fast_field = fast_fields.dates(multi_date_field).unwrap();
let mut dates = vec![];
{
assert_eq!(date_fast_field.get(0u32).to_unix_timestamp(), 1i64);
assert_eq!(date_fast_field.get(0u32).into_unix_timestamp(), 1i64);
dates_fast_field.get_vals(0u32, &mut dates);
assert_eq!(dates.len(), 2);
assert_eq!(dates[0].to_unix_timestamp(), 2i64);
assert_eq!(dates[1].to_unix_timestamp(), 3i64);
assert_eq!(dates[0].into_unix_timestamp(), 2i64);
assert_eq!(dates[1].into_unix_timestamp(), 3i64);
}
{
assert_eq!(date_fast_field.get(1u32).to_unix_timestamp(), 4i64);
assert_eq!(date_fast_field.get(1u32).into_unix_timestamp(), 4i64);
dates_fast_field.get_vals(1u32, &mut dates);
assert!(dates.is_empty());
}
{
assert_eq!(date_fast_field.get(2u32).to_unix_timestamp(), 0i64);
assert_eq!(date_fast_field.get(2u32).into_unix_timestamp(), 0i64);
dates_fast_field.get_vals(2u32, &mut dates);
assert_eq!(dates.len(), 2);
assert_eq!(dates[0].to_unix_timestamp(), 5i64);
assert_eq!(dates[1].to_unix_timestamp(), 6i64);
assert_eq!(dates[0].into_unix_timestamp(), 5i64);
assert_eq!(dates[1].into_unix_timestamp(), 6i64);
}
Ok(())
}

View File

@@ -71,24 +71,24 @@ mod tests {
let mut index_writer = index.writer_for_tests()?;
let first_time_stamp = OffsetDateTime::now_utc();
index_writer.add_document(doc!(
date_field => DateTime::new_utc(first_time_stamp),
date_field => DateTime::new_utc(first_time_stamp),
date_field => DateTime::from_utc(first_time_stamp),
date_field => DateTime::from_utc(first_time_stamp),
time_i=>1i64))?;
index_writer.add_document(doc!(time_i => 0i64))?;
// add one second
index_writer.add_document(doc!(
date_field => DateTime::new_utc(first_time_stamp + Duration::seconds(1)),
date_field => DateTime::from_utc(first_time_stamp + Duration::seconds(1)),
time_i => 2i64))?;
// add another second
let two_secs_ahead = first_time_stamp + Duration::seconds(2);
index_writer.add_document(doc!(
date_field => DateTime::new_utc(two_secs_ahead),
date_field => DateTime::new_utc(two_secs_ahead),
date_field => DateTime::new_utc(two_secs_ahead),
date_field => DateTime::from_utc(two_secs_ahead),
date_field => DateTime::from_utc(two_secs_ahead),
date_field => DateTime::from_utc(two_secs_ahead),
time_i => 3i64))?;
// add three seconds
index_writer.add_document(doc!(
date_field => DateTime::new_utc(first_time_stamp + Duration::seconds(3)),
date_field => DateTime::from_utc(first_time_stamp + Duration::seconds(3)),
time_i => 4i64))?;
index_writer.commit()?;
@@ -113,7 +113,7 @@ mod tests {
.expect("cannot find value")
.as_date()
.unwrap(),
DateTime::new_utc(first_time_stamp),
DateTime::from_utc(first_time_stamp),
);
assert_eq!(
retrieved_doc
@@ -140,7 +140,7 @@ mod tests {
.expect("cannot find value")
.as_date()
.unwrap(),
DateTime::new_utc(two_secs_ahead)
DateTime::from_utc(two_secs_ahead)
);
assert_eq!(
retrieved_doc
@@ -181,7 +181,7 @@ mod tests {
.expect("cannot find value")
.as_date()
.expect("value not of Date type"),
DateTime::new_utc(first_time_stamp + Duration::seconds(offset_sec)),
DateTime::from_utc(first_time_stamp + Duration::seconds(offset_sec)),
);
assert_eq!(
retrieved_doc

View File

@@ -27,22 +27,28 @@ impl<Item: FastValue> MultiValuedFastFieldReader<Item> {
}
}
/// Returns `(start, stop)`, such that the values associated
/// to the given document are `start..stop`.
/// Returns `[start, end)`, such that the values associated
/// to the given document are `start..end`.
#[inline]
fn range(&self, doc: DocId) -> Range<u64> {
let start = self.idx_reader.get(doc);
let stop = self.idx_reader.get(doc + 1);
start..stop
let end = self.idx_reader.get(doc + 1);
start..end
}
/// Returns the array of values associated to the given `doc`.
#[inline]
fn get_vals_for_range(&self, range: Range<u64>, vals: &mut Vec<Item>) {
let len = (range.end - range.start) as usize;
vals.resize(len, Item::make_zero());
self.vals_reader.get_range(range.start, &mut vals[..]);
}
/// Returns the array of values associated to the given `doc`.
#[inline]
pub fn get_vals(&self, doc: DocId, vals: &mut Vec<Item>) {
let range = self.range(doc);
let len = (range.end - range.start) as usize;
vals.resize(len, Item::make_zero());
self.vals_reader.get_range(range.start, &mut vals[..]);
self.get_vals_for_range(range, vals);
}
/// Returns the minimum value for this fast field.

View File

@@ -4,7 +4,7 @@ use fnv::FnvHashMap;
use tantivy_bitpacker::minmax;
use crate::fastfield::serializer::BitpackedFastFieldSerializerLegacy;
use crate::fastfield::{value_to_u64, CompositeFastFieldSerializer};
use crate::fastfield::{value_to_u64, CompositeFastFieldSerializer, FastFieldType};
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::postings::UnorderedTermId;
use crate::schema::{Document, Field};
@@ -38,17 +38,17 @@ pub struct MultiValuedFastFieldWriter {
field: Field,
vals: Vec<UnorderedTermId>,
doc_index: Vec<u64>,
is_facet: bool,
fast_field_type: FastFieldType,
}
impl MultiValuedFastFieldWriter {
/// Creates a new `IntFastFieldWriter`
pub(crate) fn new(field: Field, is_facet: bool) -> Self {
/// Creates a new `MultiValuedFastFieldWriter`
pub(crate) fn new(field: Field, fast_field_type: FastFieldType) -> Self {
MultiValuedFastFieldWriter {
field,
vals: Vec::new(),
doc_index: Vec::new(),
is_facet,
fast_field_type,
}
}
@@ -77,12 +77,13 @@ impl MultiValuedFastFieldWriter {
/// all of the matching field values present in the document.
pub fn add_document(&mut self, doc: &Document) {
self.next_doc();
// facets are indexed in the `SegmentWriter` as we encode their unordered id.
if !self.is_facet {
for field_value in doc.field_values() {
if field_value.field == self.field {
self.add_val(value_to_u64(field_value.value()));
}
// facets/texts are indexed in the `SegmentWriter` as we encode their unordered id.
if self.fast_field_type.is_storing_term_ids() {
return;
}
for field_value in doc.field_values() {
if field_value.field == self.field {
self.add_val(value_to_u64(field_value.value()));
}
}
}
@@ -158,15 +159,15 @@ impl MultiValuedFastFieldWriter {
{
// writing the values themselves.
let mut value_serializer: BitpackedFastFieldSerializerLegacy<'_, _>;
match mapping_opt {
Some(mapping) => {
value_serializer = serializer.new_u64_fast_field_with_idx(
self.field,
0u64,
mapping.len() as u64,
1,
)?;
if let Some(mapping) = mapping_opt {
value_serializer = serializer.new_u64_fast_field_with_idx(
self.field,
0u64,
mapping.len() as u64,
1,
)?;
if self.fast_field_type.is_facet() {
let mut doc_vals: Vec<u64> = Vec::with_capacity(100);
for vals in self.get_ordered_values(doc_id_map) {
doc_vals.clear();
@@ -179,19 +180,27 @@ impl MultiValuedFastFieldWriter {
value_serializer.add_val(val)?;
}
}
}
None => {
let val_min_max = minmax(self.vals.iter().cloned());
let (val_min, val_max) = val_min_max.unwrap_or((0u64, 0u64));
value_serializer =
serializer.new_u64_fast_field_with_idx(self.field, val_min, val_max, 1)?;
} else {
for vals in self.get_ordered_values(doc_id_map) {
// sort values in case of remapped doc_ids?
for &val in vals {
let remapped_vals = vals
.iter()
.map(|val| *mapping.get(val).expect("Missing term ordinal"));
for val in remapped_vals {
value_serializer.add_val(val)?;
}
}
}
} else {
let val_min_max = minmax(self.vals.iter().cloned());
let (val_min, val_max) = val_min_max.unwrap_or((0u64, 0u64));
value_serializer =
serializer.new_u64_fast_field_with_idx(self.field, val_min, val_max, 1)?;
for vals in self.get_ordered_values(doc_id_map) {
// sort values in case of remapped doc_ids?
for &val in vals {
value_serializer.add_val(val)?;
}
}
}
value_serializer.close_field()?;
}

View File

@@ -6,17 +6,12 @@ use common::BinarySerializable;
use fastfield_codecs::bitpacked::{
BitpackedFastFieldReader as BitpackedReader, BitpackedFastFieldSerializer,
};
#[allow(deprecated)]
use fastfield_codecs::linearinterpol::{
LinearInterpolFastFieldReader, LinearInterpolFastFieldSerializer,
};
#[allow(deprecated)]
use fastfield_codecs::multilinearinterpol::{
MultiLinearInterpolFastFieldReader, MultiLinearInterpolFastFieldSerializer,
};
use fastfield_codecs::piecewise_linear::{
PiecewiseLinearFastFieldReader, PiecewiseLinearFastFieldSerializer,
};
use fastfield_codecs::{FastFieldCodecReader, FastFieldCodecSerializer};
use super::FastValue;
@@ -76,8 +71,6 @@ pub enum DynamicFastFieldReader<Item: FastValue> {
LinearInterpol(FastFieldReaderCodecWrapper<Item, LinearInterpolFastFieldReader>),
/// Blockwise linear interpolated values + bitpacked
MultiLinearInterpol(FastFieldReaderCodecWrapper<Item, MultiLinearInterpolFastFieldReader>),
/// Piecewise linear interpolated values + bitpacked
PiecewiseLinear(FastFieldReaderCodecWrapper<Item, PiecewiseLinearFastFieldReader>),
}
impl<Item: FastValue> DynamicFastFieldReader<Item> {
@@ -93,14 +86,12 @@ impl<Item: FastValue> DynamicFastFieldReader<Item> {
BitpackedReader,
>::open_from_bytes(bytes)?)
}
#[allow(deprecated)]
LinearInterpolFastFieldSerializer::ID => {
DynamicFastFieldReader::LinearInterpol(FastFieldReaderCodecWrapper::<
Item,
LinearInterpolFastFieldReader,
>::open_from_bytes(bytes)?)
}
#[allow(deprecated)]
MultiLinearInterpolFastFieldSerializer::ID => {
DynamicFastFieldReader::MultiLinearInterpol(FastFieldReaderCodecWrapper::<
Item,
@@ -109,12 +100,6 @@ impl<Item: FastValue> DynamicFastFieldReader<Item> {
bytes
)?)
}
PiecewiseLinearFastFieldSerializer::ID => {
DynamicFastFieldReader::PiecewiseLinear(FastFieldReaderCodecWrapper::<
Item,
PiecewiseLinearFastFieldReader,
>::open_from_bytes(bytes)?)
}
_ => {
panic!(
"unknown fastfield id {:?}. Data corrupted or using old tantivy version.",
@@ -133,7 +118,6 @@ impl<Item: FastValue> FastFieldReader<Item> for DynamicFastFieldReader<Item> {
Self::Bitpacked(reader) => reader.get(doc),
Self::LinearInterpol(reader) => reader.get(doc),
Self::MultiLinearInterpol(reader) => reader.get(doc),
Self::PiecewiseLinear(reader) => reader.get(doc),
}
}
#[inline]
@@ -142,7 +126,6 @@ impl<Item: FastValue> FastFieldReader<Item> for DynamicFastFieldReader<Item> {
Self::Bitpacked(reader) => reader.get_range(start, output),
Self::LinearInterpol(reader) => reader.get_range(start, output),
Self::MultiLinearInterpol(reader) => reader.get_range(start, output),
Self::PiecewiseLinear(reader) => reader.get_range(start, output),
}
}
fn min_value(&self) -> Item {
@@ -150,7 +133,6 @@ impl<Item: FastValue> FastFieldReader<Item> for DynamicFastFieldReader<Item> {
Self::Bitpacked(reader) => reader.min_value(),
Self::LinearInterpol(reader) => reader.min_value(),
Self::MultiLinearInterpol(reader) => reader.min_value(),
Self::PiecewiseLinear(reader) => reader.min_value(),
}
}
fn max_value(&self) -> Item {
@@ -158,7 +140,6 @@ impl<Item: FastValue> FastFieldReader<Item> for DynamicFastFieldReader<Item> {
Self::Bitpacked(reader) => reader.max_value(),
Self::LinearInterpol(reader) => reader.max_value(),
Self::MultiLinearInterpol(reader) => reader.max_value(),
Self::PiecewiseLinear(reader) => reader.max_value(),
}
}
}
@@ -195,12 +176,9 @@ impl<Item: FastValue, C: FastFieldCodecReader> FastFieldReaderCodecWrapper<Item,
_phantom: PhantomData,
})
}
/// Get u64 for indice `idx`.
/// `idx` can be either a `DocId` or an index used for
/// `multivalued` fast field. See [`get_range`] for more details.
pub(crate) fn get_u64(&self, idx: u64) -> Item {
Item::from_u64(self.reader.get_u64(idx, self.bytes.as_slice()))
#[inline]
pub(crate) fn get_u64(&self, doc: u64) -> Item {
Item::from_u64(self.reader.get_u64(doc, self.bytes.as_slice()))
}
/// Internally `multivalued` also use SingleValue Fast fields.

View File

@@ -39,6 +39,9 @@ pub(crate) fn type_and_cardinality(field_type: &FieldType) -> Option<(FastType,
.get_fastfield_cardinality()
.map(|cardinality| (FastType::Date, cardinality)),
FieldType::Facet(_) => Some((FastType::U64, Cardinality::MultiValues)),
FieldType::Str(options) if options.is_fast() => {
Some((FastType::U64, Cardinality::MultiValues))
}
_ => None,
}
}

View File

@@ -4,9 +4,9 @@ use common::{BinarySerializable, CountingWriter};
pub use fastfield_codecs::bitpacked::{
BitpackedFastFieldSerializer, BitpackedFastFieldSerializerLegacy,
};
use fastfield_codecs::piecewise_linear::PiecewiseLinearFastFieldSerializer;
use fastfield_codecs::linearinterpol::LinearInterpolFastFieldSerializer;
use fastfield_codecs::multilinearinterpol::MultiLinearInterpolFastFieldSerializer;
pub use fastfield_codecs::{FastFieldCodecSerializer, FastFieldDataAccess, FastFieldStats};
use itertools::Itertools;
use crate::directory::{CompositeWrite, WritePtr};
use crate::schema::Field;
@@ -35,31 +35,18 @@ pub struct CompositeFastFieldSerializer {
composite_write: CompositeWrite<WritePtr>,
}
#[derive(Debug)]
pub struct CodecEstimationResult<'a> {
pub ratio: f32,
pub name: &'a str,
pub id: u8,
}
// TODO: use this when this is merged and stabilized explicit_generic_args_with_impl_trait
// use this, when this is merged and stabilized explicit_generic_args_with_impl_trait
// https://github.com/rust-lang/rust/pull/86176
fn codec_estimation<T: FastFieldCodecSerializer, A: FastFieldDataAccess>(
stats: FastFieldStats,
fastfield_accessor: &A,
) -> CodecEstimationResult {
estimations: &mut Vec<(f32, &str, u8)>,
) {
if !T::is_applicable(fastfield_accessor, stats.clone()) {
return CodecEstimationResult {
ratio: f32::MAX,
name: T::NAME,
id: T::ID,
};
}
CodecEstimationResult {
ratio: T::estimate_compression_ratio(fastfield_accessor, stats),
name: T::NAME,
id: T::ID,
return;
}
let (ratio, name, id) = (T::estimate(fastfield_accessor, stats), T::NAME, T::ID);
estimations.push((ratio, name, id));
}
impl CompositeFastFieldSerializer {
@@ -72,7 +59,7 @@ impl CompositeFastFieldSerializer {
/// Serialize data into a new u64 fast field. The best compression codec will be chosen
/// automatically.
pub fn new_u64_fast_field_with_best_codec(
pub fn create_auto_detect_u64_fast_field(
&mut self,
field: Field,
stats: FastFieldStats,
@@ -80,7 +67,7 @@ impl CompositeFastFieldSerializer {
data_iter_1: impl Iterator<Item = u64>,
data_iter_2: impl Iterator<Item = u64>,
) -> io::Result<()> {
self.new_u64_fast_field_with_idx_with_best_codec(
self.create_auto_detect_u64_fast_field_with_idx(
field,
stats,
fastfield_accessor,
@@ -91,7 +78,7 @@ impl CompositeFastFieldSerializer {
}
/// Serialize data into a new u64 fast field. The best compression codec will be chosen
/// automatically.
pub fn new_u64_fast_field_with_idx_with_best_codec(
pub fn create_auto_detect_u64_fast_field_with_idx(
&mut self,
field: Field,
stats: FastFieldStats,
@@ -101,29 +88,42 @@ impl CompositeFastFieldSerializer {
idx: usize,
) -> io::Result<()> {
let field_write = self.composite_write.for_field_with_idx(field, idx);
let estimations = vec![
codec_estimation::<BitpackedFastFieldSerializer, _>(stats.clone(), &fastfield_accessor),
codec_estimation::<PiecewiseLinearFastFieldSerializer, _>(
stats.clone(),
&fastfield_accessor,
),
];
let best_codec_result = estimations
.iter()
.sorted_by(|result_a, result_b| {
result_a
.ratio
.partial_cmp(&result_b.ratio)
.expect("Ratio cannot be nan.")
})
.next()
.expect("A codec must be present.");
debug!(
"Choosing fast field codec {} for field_id {:?} among {:?}",
best_codec_result.name, field, estimations,
let mut estimations = vec![];
codec_estimation::<BitpackedFastFieldSerializer, _>(
stats.clone(),
&fastfield_accessor,
&mut estimations,
);
best_codec_result.id.serialize(field_write)?;
match best_codec_result.name {
codec_estimation::<LinearInterpolFastFieldSerializer, _>(
stats.clone(),
&fastfield_accessor,
&mut estimations,
);
codec_estimation::<MultiLinearInterpolFastFieldSerializer, _>(
stats.clone(),
&fastfield_accessor,
&mut estimations,
);
if let Some(broken_estimation) = estimations.iter().find(|estimation| estimation.0.is_nan())
{
warn!(
"broken estimation for fast field codec {}",
broken_estimation.1
);
}
// removing nan values for codecs with broken calculations, and max values which disables
// codecs
estimations.retain(|estimation| !estimation.0.is_nan() && estimation.0 != f32::MAX);
estimations.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
let (_ratio, name, id) = estimations[0];
debug!(
"choosing fast field codec {} for field_id {:?}",
name, field
); // todo print actual field name
id.serialize(field_write)?;
match name {
BitpackedFastFieldSerializer::NAME => {
BitpackedFastFieldSerializer::serialize(
field_write,
@@ -133,8 +133,17 @@ impl CompositeFastFieldSerializer {
data_iter_2,
)?;
}
PiecewiseLinearFastFieldSerializer::NAME => {
PiecewiseLinearFastFieldSerializer::serialize(
LinearInterpolFastFieldSerializer::NAME => {
LinearInterpolFastFieldSerializer::serialize(
field_write,
&fastfield_accessor,
stats,
data_iter_1,
data_iter_2,
)?;
}
MultiLinearInterpolFastFieldSerializer::NAME => {
MultiLinearInterpolFastFieldSerializer::serialize(
field_write,
&fastfield_accessor,
stats,
@@ -143,7 +152,7 @@ impl CompositeFastFieldSerializer {
)?;
}
_ => {
panic!("unknown fastfield serializer {}", best_codec_result.name)
panic!("unknown fastfield serializer {}", name)
}
};
field_write.flush()?;
@@ -207,45 +216,3 @@ impl<'a, W: Write> FastBytesFieldSerializer<'a, W> {
self.write.flush()
}
}
#[cfg(test)]
mod tests {
use std::path::Path;
use common::BinarySerializable;
use fastfield_codecs::FastFieldStats;
use itertools::Itertools;
use super::CompositeFastFieldSerializer;
use crate::directory::{RamDirectory, WritePtr};
use crate::schema::Field;
use crate::Directory;
#[test]
fn new_u64_fast_field_with_best_codec() -> crate::Result<()> {
let directory: RamDirectory = RamDirectory::create();
let path = Path::new("test");
let write: WritePtr = directory.open_write(path)?;
let mut serializer = CompositeFastFieldSerializer::from_write(write)?;
let vals = (0..10000u64).into_iter().collect_vec();
let stats = FastFieldStats {
min_value: 0,
max_value: 9999,
num_vals: vals.len() as u64,
};
serializer.new_u64_fast_field_with_best_codec(
Field::from_field_id(0),
stats,
vals.clone(),
vals.clone().into_iter(),
vals.into_iter(),
)?;
serializer.close()?;
// get the codecs id
let mut bytes = directory.open_read(path)?.read_bytes()?;
let codec_id = u8::deserialize(&mut bytes)?;
// Codec id = 4 is piecewise linear.
assert_eq!(codec_id, 4);
Ok(())
}
}

View File

@@ -7,7 +7,7 @@ use tantivy_bitpacker::BlockedBitpacker;
use super::multivalued::MultiValuedFastFieldWriter;
use super::serializer::FastFieldStats;
use super::FastFieldDataAccess;
use super::{FastFieldDataAccess, FastFieldType};
use crate::fastfield::{BytesFastFieldWriter, CompositeFastFieldSerializer};
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::postings::UnorderedTermId;
@@ -16,6 +16,7 @@ use crate::termdict::TermOrdinal;
/// The `FastFieldsWriter` groups all of the fast field writers.
pub struct FastFieldsWriter {
term_id_writers: Vec<MultiValuedFastFieldWriter>,
single_value_writers: Vec<IntFastFieldWriter>,
multi_values_writers: Vec<MultiValuedFastFieldWriter>,
bytes_value_writers: Vec<BytesFastFieldWriter>,
@@ -33,6 +34,7 @@ impl FastFieldsWriter {
/// Create all `FastFieldWriter` required by the schema.
pub fn from_schema(schema: &Schema) -> FastFieldsWriter {
let mut single_value_writers = Vec::new();
let mut term_id_writers = Vec::new();
let mut multi_values_writers = Vec::new();
let mut bytes_value_writers = Vec::new();
@@ -50,15 +52,22 @@ impl FastFieldsWriter {
single_value_writers.push(fast_field_writer);
}
Some(Cardinality::MultiValues) => {
let fast_field_writer = MultiValuedFastFieldWriter::new(field, false);
let fast_field_writer =
MultiValuedFastFieldWriter::new(field, FastFieldType::Numeric);
multi_values_writers.push(fast_field_writer);
}
None => {}
}
}
FieldType::Facet(_) => {
let fast_field_writer = MultiValuedFastFieldWriter::new(field, true);
multi_values_writers.push(fast_field_writer);
let fast_field_writer =
MultiValuedFastFieldWriter::new(field, FastFieldType::Facet);
term_id_writers.push(fast_field_writer);
}
FieldType::Str(_) if field_entry.is_fast() => {
let fast_field_writer =
MultiValuedFastFieldWriter::new(field, FastFieldType::String);
term_id_writers.push(fast_field_writer);
}
FieldType::Bytes(bytes_option) => {
if bytes_option.is_fast() {
@@ -70,6 +79,7 @@ impl FastFieldsWriter {
}
}
FastFieldsWriter {
term_id_writers,
single_value_writers,
multi_values_writers,
bytes_value_writers,
@@ -78,10 +88,15 @@ impl FastFieldsWriter {
/// The memory used (inclusive childs)
pub fn mem_usage(&self) -> usize {
self.single_value_writers
self.term_id_writers
.iter()
.map(|w| w.mem_usage())
.sum::<usize>()
+ self
.single_value_writers
.iter()
.map(|w| w.mem_usage())
.sum::<usize>()
+ self
.multi_values_writers
.iter()
@@ -94,6 +109,14 @@ impl FastFieldsWriter {
.sum::<usize>()
}
/// Get the `FastFieldWriter` associated to a field.
pub fn get_term_id_writer(&self, field: Field) -> Option<&MultiValuedFastFieldWriter> {
// TODO optimize
self.term_id_writers
.iter()
.find(|field_writer| field_writer.field() == field)
}
/// Get the `FastFieldWriter` associated to a field.
pub fn get_field_writer(&self, field: Field) -> Option<&IntFastFieldWriter> {
// TODO optimize
@@ -110,6 +133,17 @@ impl FastFieldsWriter {
.find(|field_writer| field_writer.field() == field)
}
/// Get the `FastFieldWriter` associated to a field.
pub fn get_term_id_writer_mut(
&mut self,
field: Field,
) -> Option<&mut MultiValuedFastFieldWriter> {
// TODO optimize
self.term_id_writers
.iter_mut()
.find(|field_writer| field_writer.field() == field)
}
/// Returns the fast field multi-value writer for the given field.
///
/// Returns None if the field does not exist, or is not
@@ -137,6 +171,9 @@ impl FastFieldsWriter {
/// Indexes all of the fastfields of a new document.
pub fn add_document(&mut self, doc: &Document) {
for field_writer in &mut self.term_id_writers {
field_writer.add_document(doc);
}
for field_writer in &mut self.single_value_writers {
field_writer.add_document(doc);
}
@@ -156,6 +193,10 @@ impl FastFieldsWriter {
mapping: &HashMap<Field, FnvHashMap<UnorderedTermId, TermOrdinal>>,
doc_id_map: Option<&DocIdMapping>,
) -> io::Result<()> {
for field_writer in &self.term_id_writers {
let field = field_writer.field();
field_writer.serialize(serializer, mapping.get(&field), doc_id_map)?;
}
for field_writer in &self.single_value_writers {
field_writer.serialize(serializer, doc_id_map)?;
}
@@ -244,6 +285,10 @@ impl IntFastFieldWriter {
self.val_count += 1;
}
/// Extract the fast field value from the document
/// (or use the default value) and records it.
///
///
/// Extract the value associated to the fast field for
/// this document.
///
@@ -254,18 +299,17 @@ impl IntFastFieldWriter {
/// instead.
/// If the document has more than one value for the given field,
/// only the first one is taken in account.
fn extract_val(&self, doc: &Document) -> u64 {
match doc.get_first(self.field) {
Some(v) => super::value_to_u64(v),
None => self.val_if_missing,
}
}
/// Extract the fast field value from the document
/// (or use the default value) and records it.
///
/// Values on text fast fields are skipped.
pub fn add_document(&mut self, doc: &Document) {
let val = self.extract_val(doc);
self.add_val(val);
match doc.get_first(self.field) {
Some(v) => {
self.add_val(super::value_to_u64(v));
}
None => {
self.add_val(self.val_if_missing);
}
};
}
/// get iterator over the data
@@ -284,6 +328,7 @@ impl IntFastFieldWriter {
} else {
(self.val_min, self.val_max)
};
let fastfield_accessor = WriterFastFieldAccessProvider {
doc_id_map,
vals: &self.vals,
@@ -298,7 +343,7 @@ impl IntFastFieldWriter {
let iter = doc_id_map
.iter_old_doc_ids()
.map(|doc_id| self.vals.get(doc_id as usize));
serializer.new_u64_fast_field_with_best_codec(
serializer.create_auto_detect_u64_fast_field(
self.field,
stats,
fastfield_accessor,
@@ -306,7 +351,7 @@ impl IntFastFieldWriter {
iter,
)?;
} else {
serializer.new_u64_fast_field_with_best_codec(
serializer.create_auto_detect_u64_fast_field(
self.field,
stats,
fastfield_accessor,

View File

@@ -116,14 +116,14 @@ pub fn demux(
) -> crate::Result<Vec<Index>> {
let mut indices = vec![];
for (target_segment_ord, output_directory) in output_directories.into_iter().enumerate() {
let delete_bitsets = get_alive_bitsets(demux_mapping, target_segment_ord as u32)
let alive_bitset = get_alive_bitsets(demux_mapping, target_segment_ord as u32)
.into_iter()
.map(Some)
.collect_vec();
let index = merge_filtered_segments(
segments,
target_settings.clone(),
delete_bitsets,
alive_bitset,
output_directory,
)?;
indices.push(index);
@@ -141,7 +141,7 @@ mod tests {
use crate::{DocAddress, Term};
#[test]
fn test_demux_map_to_deletebitset() {
fn test_demux_map_to_alive_bitset() {
let max_value = 2;
let mut demux_mapping = DemuxMapping::default();
// segment ordinal 0 mapping

View File

@@ -4,7 +4,6 @@ use std::thread;
use std::thread::JoinHandle;
use common::BitSet;
use crossbeam::channel;
use smallvec::smallvec;
use super::operation::{AddOperation, UserOperation};
@@ -289,7 +288,7 @@ impl IndexWriter {
return Err(TantivyError::InvalidArgument(err_msg));
}
let (document_sender, document_receiver): (AddBatchSender, AddBatchReceiver) =
channel::bounded(PIPELINE_MAX_SIZE_IN_DOCS);
crossbeam_channel::bounded(PIPELINE_MAX_SIZE_IN_DOCS);
let delete_queue = DeleteQueue::new();
@@ -326,7 +325,7 @@ impl IndexWriter {
}
fn drop_sender(&mut self) {
let (sender, _receiver) = channel::bounded(1);
let (sender, _receiver) = crossbeam_channel::bounded(1);
self.operation_sender = sender;
}
@@ -532,7 +531,7 @@ impl IndexWriter {
/// Returns the former segment_ready channel.
fn recreate_document_channel(&mut self) {
let (document_sender, document_receiver): (AddBatchSender, AddBatchReceiver) =
channel::bounded(PIPELINE_MAX_SIZE_IN_DOCS);
crossbeam_channel::bounded(PIPELINE_MAX_SIZE_IN_DOCS);
self.operation_sender = document_sender;
self.index_writer_status = IndexWriterStatus::from(document_receiver);
}

View File

@@ -92,7 +92,7 @@ impl Drop for IndexWriterBomb {
mod tests {
use std::mem;
use crossbeam::channel;
use crossbeam_channel as channel;
use super::IndexWriterStatus;

View File

@@ -4,7 +4,7 @@ use murmurhash32::murmurhash2;
use crate::fastfield::FastValue;
use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter};
use crate::schema::term::{JSON_END_OF_PATH, JSON_PATH_SEGMENT_SEP};
use crate::schema::Type;
use crate::schema::{Field, Type};
use crate::time::format_description::well_known::Rfc3339;
use crate::time::{OffsetDateTime, UtcOffset};
use crate::tokenizer::TextAnalyzer;
@@ -149,10 +149,11 @@ fn index_json_value<'a>(
json_term_writer.term_buffer,
ctx,
indexing_position,
None,
);
}
TextOrDateTime::DateTime(dt) => {
json_term_writer.set_fast_value(DateTime::new_utc(dt));
json_term_writer.set_fast_value(DateTime::from_utc(dt));
postings_writer.subscribe(doc, 0u32, json_term_writer.term(), ctx);
}
},
@@ -198,12 +199,77 @@ fn infer_type_from_str(text: &str) -> TextOrDateTime {
}
}
// Tries to infer a JSON type from a string
pub(crate) fn convert_to_fast_value_and_get_term(
json_term_writer: &mut JsonTermWriter,
phrase: &str,
) -> Option<Term> {
if let Ok(dt) = OffsetDateTime::parse(phrase, &Rfc3339) {
let dt_utc = dt.to_offset(UtcOffset::UTC);
return Some(set_fastvalue_and_get_term(
json_term_writer,
DateTime::from_utc(dt_utc),
));
}
if let Ok(u64_val) = str::parse::<u64>(phrase) {
return Some(set_fastvalue_and_get_term(json_term_writer, u64_val));
}
if let Ok(i64_val) = str::parse::<i64>(phrase) {
return Some(set_fastvalue_and_get_term(json_term_writer, i64_val));
}
if let Ok(f64_val) = str::parse::<f64>(phrase) {
return Some(set_fastvalue_and_get_term(json_term_writer, f64_val));
}
None
}
// helper function to generate a Term from a json fastvalue
pub(crate) fn set_fastvalue_and_get_term<T: FastValue>(
json_term_writer: &mut JsonTermWriter,
value: T,
) -> Term {
json_term_writer.set_fast_value(value);
json_term_writer.term().clone()
}
// helper function to generate a list of terms with their positions from a textual json value
pub(crate) fn set_string_and_get_terms(
json_term_writer: &mut JsonTermWriter,
value: &str,
text_analyzer: &TextAnalyzer,
) -> Vec<(usize, Term)> {
let mut positions_and_terms = Vec::<(usize, Term)>::new();
json_term_writer.close_path_and_set_type(Type::Str);
let term_num_bytes = json_term_writer.term_buffer.as_slice().len();
let mut token_stream = text_analyzer.token_stream(value);
token_stream.process(&mut |token| {
json_term_writer.term_buffer.truncate(term_num_bytes);
json_term_writer
.term_buffer
.append_bytes(token.text.as_bytes());
positions_and_terms.push((token.position, json_term_writer.term().clone()));
});
positions_and_terms
}
pub struct JsonTermWriter<'a> {
term_buffer: &'a mut Term,
path_stack: Vec<usize>,
}
impl<'a> JsonTermWriter<'a> {
pub fn from_field_and_json_path(
field: Field,
json_path: &str,
term_buffer: &'a mut Term,
) -> Self {
term_buffer.set_field(Type::Json, field);
let mut json_term_writer = Self::wrap(term_buffer);
for segment in json_path.split('.') {
json_term_writer.push_path_segment(segment);
}
json_term_writer
}
pub fn wrap(term_buffer: &'a mut Term) -> Self {
term_buffer.clear_with_type(Type::Json);
let mut path_stack = Vec::with_capacity(10);

View File

@@ -170,8 +170,8 @@ impl IndexMerger {
index_settings: IndexSettings,
segments: &[Segment],
) -> crate::Result<IndexMerger> {
let delete_bitsets = segments.iter().map(|_| None).collect_vec();
Self::open_with_custom_alive_set(schema, index_settings, segments, delete_bitsets)
let alive_bitset = segments.iter().map(|_| None).collect_vec();
Self::open_with_custom_alive_set(schema, index_settings, segments, alive_bitset)
}
// Create merge with a custom delete set.
@@ -180,7 +180,7 @@ impl IndexMerger {
// corresponds to the segment index.
//
// If `None` is provided for custom alive set, the regular alive set will be used.
// If a delete_bitsets is provided, the union between the provided and regular
// If a alive_bitset is provided, the union between the provided and regular
// alive set will be used.
//
// This can be used to merge but also apply an additional filter.
@@ -283,12 +283,12 @@ impl IndexMerger {
for (field, field_entry) in self.schema.fields() {
let field_type = field_entry.field_type();
match field_type {
FieldType::Facet(_) => {
FieldType::Facet(_) | FieldType::Str(_) if field_type.is_fast() => {
let term_ordinal_mapping = term_ord_mappings.remove(&field).expect(
"Logic Error in Tantivy (Please report). Facet field should have required \
a`term_ordinal_mapping`.",
);
self.write_hierarchical_facet_field(
self.write_term_id_fast_field(
field,
&term_ordinal_mapping,
fast_field_serializer,
@@ -312,8 +312,8 @@ impl IndexMerger {
self.write_bytes_fast_field(field, fast_field_serializer, doc_id_mapping)?;
}
}
FieldType::Str(_) | FieldType::JsonObject(_) => {
// We don't handle json / string fast field for the moment
_ => {
// We don't handle json fast field for the moment
// They can be implemented using what is done
// for facets in the future
}
@@ -384,7 +384,7 @@ impl IndexMerger {
let fast_field_reader = &fast_field_readers[*reader_ordinal as usize];
fast_field_reader.get(*doc_id)
});
fast_field_serializer.new_u64_fast_field_with_best_codec(
fast_field_serializer.create_auto_detect_u64_fast_field(
field,
stats,
fastfield_accessor,
@@ -551,7 +551,7 @@ impl IndexMerger {
}
offsets.push(offset);
fast_field_serializer.new_u64_fast_field_with_best_codec(
fast_field_serializer.create_auto_detect_u64_fast_field(
field,
stats,
&offsets[..],
@@ -590,14 +590,14 @@ impl IndexMerger {
)
}
fn write_hierarchical_facet_field(
fn write_term_id_fast_field(
&self,
field: Field,
term_ordinal_mappings: &TermOrdinalMapping,
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
) -> crate::Result<()> {
debug_time!("write-hierarchical-facet-field");
debug_time!("write-term-id-fast-field");
// Multifastfield consists of 2 fastfields.
// The first serves as an index into the second one and is stricly increasing.
@@ -771,7 +771,7 @@ impl IndexMerger {
ff_reader.get_vals(*doc_id, &mut vals);
vals.into_iter()
});
fast_field_serializer.new_u64_fast_field_with_idx_with_best_codec(
fast_field_serializer.create_auto_detect_u64_fast_field_with_idx(
field,
stats,
fastfield_accessor,
@@ -848,6 +848,9 @@ impl IndexMerger {
let mut term_ord_mapping_opt = match field_type {
FieldType::Facet(_) => Some(TermOrdinalMapping::new(max_term_ords)),
FieldType::Str(options) if options.is_fast() => {
Some(TermOrdinalMapping::new(max_term_ords))
}
_ => None,
};
@@ -1174,7 +1177,7 @@ mod tests {
index_writer.add_document(doc!(
text_field => "af b",
score_field => 3u64,
date_field => DateTime::new_utc(curr_time),
date_field => DateTime::from_utc(curr_time),
bytes_score_field => 3u32.to_be_bytes().as_ref()
))?;
index_writer.add_document(doc!(
@@ -1191,7 +1194,7 @@ mod tests {
// writing the segment
index_writer.add_document(doc!(
text_field => "af b",
date_field => DateTime::new_utc(curr_time),
date_field => DateTime::from_utc(curr_time),
score_field => 11u64,
bytes_score_field => 11u32.to_be_bytes().as_ref()
))?;
@@ -1249,7 +1252,7 @@ mod tests {
assert_eq!(
get_doc_ids(vec![Term::from_field_date(
date_field,
DateTime::new_utc(curr_time)
DateTime::from_utc(curr_time)
)])?,
vec![DocAddress::new(0, 0), DocAddress::new(0, 3)]
);

View File

@@ -21,11 +21,13 @@ pub mod segment_updater;
mod segment_writer;
mod stamper;
use crossbeam::channel;
use crossbeam_channel as channel;
use smallvec::SmallVec;
pub use self::index_writer::IndexWriter;
pub(crate) use self::json_term_writer::JsonTermWriter;
pub(crate) use self::json_term_writer::{
convert_to_fast_value_and_get_term, set_string_and_get_terms, JsonTermWriter,
};
pub use self::log_merge_policy::LogMergePolicy;
pub use self::merge_operation::MergeOperation;
pub use self::merge_policy::{MergeCandidate, MergePolicy, NoMergePolicy};

View File

@@ -39,9 +39,10 @@ impl SegmentSerializer {
let postings_serializer = InvertedIndexSerializer::open(&mut segment)?;
let compressor = segment.index().settings().docstore_compression;
let blocksize = segment.index().settings().docstore_blocksize;
Ok(SegmentSerializer {
segment,
store_writer: StoreWriter::new(store_write, compressor),
store_writer: StoreWriter::new(store_write, compressor, blocksize),
fast_field_serializer,
fieldnorms_serializer: Some(fieldnorms_serializer),
postings_serializer,

View File

@@ -1,6 +1,5 @@
use std::borrow::BorrowMut;
use std::collections::HashSet;
use std::io;
use std::io::Write;
use std::ops::Deref;
use std::path::PathBuf;
@@ -27,7 +26,7 @@ use crate::indexer::{
SegmentSerializer,
};
use crate::schema::Schema;
use crate::{FutureResult, Opstamp, TantivyError};
use crate::{FutureResult, Opstamp};
const NUM_MERGE_THREADS: usize = 4;
@@ -73,10 +72,12 @@ fn save_metas(metas: &IndexMeta, directory: &dyn Directory) -> crate::Result<()>
let mut buffer = serde_json::to_vec_pretty(metas)?;
// Just adding a new line at the end of the buffer.
writeln!(&mut buffer)?;
fail_point!("save_metas", |msg| Err(TantivyError::from(io::Error::new(
io::ErrorKind::Other,
msg.unwrap_or_else(|| "Undefined".to_string())
))));
fail_point!("save_metas", |msg| Err(crate::TantivyError::from(
std::io::Error::new(
std::io::ErrorKind::Other,
msg.unwrap_or_else(|| "Undefined".to_string())
)
)));
directory.sync_directory()?;
directory.atomic_write(&META_FILEPATH, &buffer[..])?;
debug!("Saved metas {:?}", serde_json::to_string_pretty(&metas));

View File

@@ -188,7 +188,7 @@ impl SegmentWriter {
});
if let Some(unordered_term_id) = unordered_term_id_opt {
self.fast_field_writers
.get_multivalue_writer_mut(field)
.get_term_id_writer_mut(field)
.expect("writer for facet missing")
.add_val(unordered_term_id);
}
@@ -221,6 +221,7 @@ impl SegmentWriter {
}
let mut indexing_position = IndexingPosition::default();
for mut token_stream in token_streams {
assert_eq!(term_buffer.as_slice().len(), 5);
postings_writer.index_text(
@@ -229,10 +230,13 @@ impl SegmentWriter {
term_buffer,
ctx,
&mut indexing_position,
self.fast_field_writers.get_term_id_writer_mut(field),
);
}
self.fieldnorms_writer
.record(doc_id, field, indexing_position.num_tokens);
if field_entry.has_fieldnorms() {
self.fieldnorms_writer
.record(doc_id, field, indexing_position.num_tokens);
}
}
FieldType::U64(_) => {
for value in values {
@@ -368,9 +372,10 @@ fn remap_and_write(
.segment_mut()
.open_write(SegmentComponent::Store)?;
let compressor = serializer.segment().index().settings().docstore_compression;
let block_size = serializer.segment().index().settings().docstore_blocksize;
let old_store_writer = std::mem::replace(
&mut serializer.store_writer,
StoreWriter::new(store_write, compressor),
StoreWriter::new(store_write, compressor, block_size),
);
old_store_writer.close()?;
let store_read = StoreReader::open(
@@ -523,7 +528,7 @@ mod tests {
json_term_writer.pop_path_segment();
json_term_writer.pop_path_segment();
json_term_writer.push_path_segment("date");
json_term_writer.set_fast_value(DateTime::new_utc(
json_term_writer.set_fast_value(DateTime::from_utc(
OffsetDateTime::parse("1985-04-12T23:20:50.52Z", &Rfc3339).unwrap(),
));
assert!(term_stream.advance());
@@ -703,4 +708,38 @@ mod tests {
let phrase_query = PhraseQuery::new(vec![nothello_term, happy_term]);
assert_eq!(searcher.search(&phrase_query, &Count).unwrap(), 0);
}
#[test]
fn test_bug_regression_1629_position_when_array_with_a_field_value_that_does_not_contain_any_token(
) {
// We experienced a bug where we would have a position underflow when computing position
// delta in an horrible corner case.
//
// See the commit with this unit test if you want the details.
let mut schema_builder = Schema::builder();
let text = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
let doc = schema
.parse_document(r#"{"text": [ "bbb", "aaa", "", "aaa"]}"#)
.unwrap();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap();
index_writer.add_document(doc).unwrap();
// On debug this did panic on the underflow
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let seg_reader = searcher.segment_reader(0);
let inv_index = seg_reader.inverted_index(text).unwrap();
let term = Term::from_field_text(text, "aaa");
let mut postings = inv_index
.read_postings(&term, IndexRecordOption::WithFreqsAndPositions)
.unwrap()
.unwrap();
assert_eq!(postings.doc(), 0u32);
let mut positions = Vec::new();
postings.positions(&mut positions);
// On release this was [2, 1]. (< note the decreasing values)
assert_eq!(positions, &[2, 5]);
}
}

View File

@@ -158,7 +158,7 @@ impl DateTime {
///
/// The given date/time is converted to UTC and the actual
/// time zone is discarded.
pub const fn new_utc(dt: OffsetDateTime) -> Self {
pub const fn from_utc(dt: OffsetDateTime) -> Self {
Self::from_unix_timestamp(dt.unix_timestamp())
}
@@ -166,19 +166,19 @@ impl DateTime {
///
/// Implicitly assumes that the given date/time is in UTC!
/// Otherwise the original value must only be reobtained with
/// [`to_primitive()`].
pub const fn new_primitive(dt: PrimitiveDateTime) -> Self {
Self::new_utc(dt.assume_utc())
/// [`Self::into_primitive()`].
pub const fn from_primitive(dt: PrimitiveDateTime) -> Self {
Self::from_utc(dt.assume_utc())
}
/// Convert to UNIX timestamp
pub const fn to_unix_timestamp(self) -> i64 {
pub const fn into_unix_timestamp(self) -> i64 {
let Self { unix_timestamp } = self;
unix_timestamp
}
/// Convert to UTC `OffsetDateTime`
pub fn to_utc(self) -> OffsetDateTime {
pub fn into_utc(self) -> OffsetDateTime {
let Self { unix_timestamp } = self;
let utc_datetime =
OffsetDateTime::from_unix_timestamp(unix_timestamp).expect("valid UNIX timestamp");
@@ -187,16 +187,16 @@ impl DateTime {
}
/// Convert to `OffsetDateTime` with the given time zone
pub fn to_offset(self, offset: UtcOffset) -> OffsetDateTime {
self.to_utc().to_offset(offset)
pub fn into_offset(self, offset: UtcOffset) -> OffsetDateTime {
self.into_utc().to_offset(offset)
}
/// Convert to `PrimitiveDateTime` without any time zone
///
/// The value should have been constructed with [`from_primitive()`].
/// The value should have been constructed with [`Self::from_primitive()`].
/// Otherwise the time zone is implicitly assumed to be UTC.
pub fn to_primitive(self) -> PrimitiveDateTime {
let utc_datetime = self.to_utc();
pub fn into_primitive(self) -> PrimitiveDateTime {
let utc_datetime = self.into_utc();
// Discard the UTC time zone offset
debug_assert_eq!(UtcOffset::UTC, utc_datetime.offset());
PrimitiveDateTime::new(utc_datetime.date(), utc_datetime.time())
@@ -205,7 +205,7 @@ impl DateTime {
impl fmt::Debug for DateTime {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let utc_rfc3339 = self.to_utc().format(&Rfc3339).map_err(|_| fmt::Error)?;
let utc_rfc3339 = self.into_utc().format(&Rfc3339).map_err(|_| fmt::Error)?;
f.write_str(&utc_rfc3339)
}
}

View File

@@ -1,5 +1,6 @@
use std::io;
use crate::fastfield::MultiValuedFastFieldWriter;
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::postings::postings_writer::SpecializedPostingsWriter;
use crate::postings::recorder::{BufferLender, NothingRecorder, Recorder};
@@ -42,6 +43,7 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
term_buffer: &mut Term,
ctx: &mut IndexingContext,
indexing_position: &mut IndexingPosition,
_fast_field_writer: Option<&mut MultiValuedFastFieldWriter>,
) {
self.str_posting_writer.index_text(
doc_id,
@@ -49,6 +51,7 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
term_buffer,
ctx,
indexing_position,
None,
);
}

View File

@@ -6,6 +6,7 @@ use std::ops::Range;
use fnv::FnvHashMap;
use super::stacker::Addr;
use crate::fastfield::MultiValuedFastFieldWriter;
use crate::fieldnorm::FieldNormReaders;
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::postings::recorder::{BufferLender, Recorder};
@@ -145,10 +146,11 @@ pub(crate) trait PostingsWriter {
term_buffer: &mut Term,
ctx: &mut IndexingContext,
indexing_position: &mut IndexingPosition,
mut term_id_fast_field_writer_opt: Option<&mut MultiValuedFastFieldWriter>,
) {
let end_of_path_idx = term_buffer.as_slice().len();
let mut num_tokens = 0;
let mut end_position = 0;
let mut end_position = indexing_position.end_position;
token_stream.process(&mut |token: &Token| {
// We skip all tokens with a len greater than u16.
if token.text.len() > MAX_TOKEN_LEN {
@@ -164,9 +166,14 @@ pub(crate) trait PostingsWriter {
term_buffer.append_bytes(token.text.as_bytes());
let start_position = indexing_position.end_position + token.position as u32;
end_position = start_position + token.position_length as u32;
self.subscribe(doc_id, start_position, term_buffer, ctx);
let unordered_term_id = self.subscribe(doc_id, start_position, term_buffer, ctx);
if let Some(term_id_fast_field_writer) = term_id_fast_field_writer_opt.as_mut() {
term_id_fast_field_writer.add_val(unordered_term_id);
}
num_tokens += 1;
});
indexing_position.end_position = end_position + POSITION_GAP;
indexing_position.num_tokens += num_tokens;
term_buffer.truncate(end_of_path_idx);

View File

@@ -247,7 +247,7 @@ impl MoreLikeThis {
let unix_timestamp = value
.as_date()
.ok_or_else(|| TantivyError::InvalidArgument("invalid value".to_string()))?
.to_unix_timestamp();
.into_unix_timestamp();
if !self.is_noise_word(unix_timestamp.to_string()) {
let term = Term::from_field_i64(field, unix_timestamp);
*term_frequencies.entry(term).or_insert(0) += 1;

View File

@@ -184,6 +184,66 @@ fn intersection_with_slop(left: &mut [u32], right: &[u32], slop: u32) -> usize {
count
}
fn intersection_count_with_slop(left: &[u32], right: &[u32], slop: u32) -> usize {
let mut left_index = 0;
let mut right_index = 0;
let mut count = 0;
let left_len = left.len();
let right_len = right.len();
while left_index < left_len && right_index < right_len {
let left_val = left[left_index];
let right_val = right[right_index];
let right_slop = if right_val >= slop {
right_val - slop
} else {
0
};
if left_val < right_slop {
left_index += 1;
} else if right_slop <= left_val && left_val <= right_val {
while left_index + 1 < left_len {
let next_left_val = left[left_index + 1];
if next_left_val > right_val {
break;
}
left_index += 1;
}
count += 1;
left_index += 1;
right_index += 1;
} else if left_val > right_val {
right_index += 1;
}
}
count
}
fn intersection_exists_with_slop(left: &[u32], right: &[u32], slop: u32) -> bool {
let mut left_index = 0;
let mut right_index = 0;
let left_len = left.len();
let right_len = right.len();
while left_index < left_len && right_index < right_len {
let left_val = left[left_index];
let right_val = right[right_index];
let right_slop = if right_val >= slop {
right_val - slop
} else {
0
};
if left_val < right_slop {
left_index += 1;
} else if right_slop <= left_val && left_val <= right_val {
return true;
} else if left_val > right_val {
right_index += 1;
}
}
false
}
impl<TPostings: Postings> PhraseScorer<TPostings> {
pub fn new(
term_postings: Vec<(usize, TPostings)>,
@@ -237,11 +297,25 @@ impl<TPostings: Postings> PhraseScorer<TPostings> {
fn phrase_exists(&mut self) -> bool {
let intersection_len = self.compute_phrase_match();
if self.has_slop() {
return intersection_exists_with_slop(
&self.left[..intersection_len],
&self.right[..],
self.slop,
);
}
intersection_exists(&self.left[..intersection_len], &self.right[..])
}
fn compute_phrase_count(&mut self) -> u32 {
let intersection_len = self.compute_phrase_match();
if self.has_slop() {
return intersection_count_with_slop(
&self.left[..intersection_len],
&self.right[..],
self.slop,
) as u32;
}
intersection_count(&self.left[..intersection_len], &self.right[..]) as u32
}
@@ -252,12 +326,7 @@ impl<TPostings: Postings> PhraseScorer<TPostings> {
.positions(&mut self.left);
}
let mut intersection_len = self.left.len();
let end_term = if self.has_slop() {
self.num_terms
} else {
self.num_terms - 1
};
for i in 1..end_term {
for i in 1..self.num_terms - 1 {
{
self.intersection_docset
.docset_mut_specialized(i)

View File

@@ -1,4 +1,4 @@
use std::collections::{BTreeSet, HashMap};
use std::collections::HashMap;
use std::num::{ParseFloatError, ParseIntError};
use std::ops::Bound;
use std::str::FromStr;
@@ -7,7 +7,9 @@ use tantivy_query_grammar::{UserInputAst, UserInputBound, UserInputLeaf, UserInp
use super::logical_ast::*;
use crate::core::Index;
use crate::indexer::JsonTermWriter;
use crate::indexer::{
convert_to_fast_value_and_get_term, set_string_and_get_terms, JsonTermWriter,
};
use crate::query::{
AllQuery, BooleanQuery, BoostQuery, EmptyQuery, Occur, PhraseQuery, Query, RangeQuery,
TermQuery,
@@ -16,7 +18,7 @@ use crate::schema::{
Facet, FacetParseError, Field, FieldType, IndexRecordOption, Schema, Term, Type,
};
use crate::time::format_description::well_known::Rfc3339;
use crate::time::{OffsetDateTime, UtcOffset};
use crate::time::OffsetDateTime;
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::{DateTime, Score};
@@ -24,13 +26,13 @@ use crate::{DateTime, Score};
#[derive(Debug, PartialEq, Eq, Error)]
pub enum QueryParserError {
/// Error in the query syntax
#[error("Syntax Error")]
SyntaxError,
#[error("Syntax Error: {0}")]
SyntaxError(String),
/// This query is unsupported.
#[error("Unsupported query: {0}")]
UnsupportedQuery(String),
/// The query references a field that is not in the schema
#[error("Field does not exists: '{0:?}'")]
#[error("Field does not exists: '{0}'")]
FieldDoesNotExist(String),
/// The query contains a term for a `u64` or `i64`-field, but the value
/// is neither.
@@ -53,11 +55,11 @@ pub enum QueryParserError {
NoDefaultFieldDeclared,
/// The field searched for is not declared
/// as indexed in the schema.
#[error("The field '{0:?}' is not declared as indexed")]
#[error("The field '{0}' is not declared as indexed")]
FieldNotIndexed(String),
/// A phrase query was requested for a field that does not
/// have any positions indexed.
#[error("The field '{0:?}' does not have positions indexed")]
#[error("The field '{0}' does not have positions indexed")]
FieldDoesNotHavePositionsIndexed(String),
/// The tokenizer for the given field is unknown
/// The two argument strings are the name of the field, the name of the tokenizer
@@ -169,7 +171,7 @@ pub struct QueryParser {
conjunction_by_default: bool,
tokenizer_manager: TokenizerManager,
boost: HashMap<Field, Score>,
field_names: BTreeSet<String>,
field_names: HashMap<String, Field>,
}
fn all_negative(ast: &LogicalAst) -> bool {
@@ -182,6 +184,31 @@ fn all_negative(ast: &LogicalAst) -> bool {
}
}
// Returns the position (in byte offsets) of the unescaped '.' in the `field_path`.
//
// This function operates directly on bytes (as opposed to codepoint), relying
// on a encoding property of utf-8 for its correctness.
fn locate_splitting_dots(field_path: &str) -> Vec<usize> {
let mut splitting_dots_pos = Vec::new();
let mut escape_state = false;
for (pos, b) in field_path.bytes().enumerate() {
if escape_state {
escape_state = false;
continue;
}
match b {
b'\\' => {
escape_state = true;
}
b'.' => {
splitting_dots_pos.push(pos);
}
_ => {}
}
}
splitting_dots_pos
}
impl QueryParser {
/// Creates a `QueryParser`, given
/// * schema - index Schema
@@ -193,7 +220,7 @@ impl QueryParser {
) -> QueryParser {
let field_names = schema
.fields()
.map(|(_, field_entry)| field_entry.name().to_string())
.map(|(field, field_entry)| (field_entry.name().to_string(), field))
.collect();
QueryParser {
schema,
@@ -207,25 +234,18 @@ impl QueryParser {
// Splits a full_path as written in a query, into a field name and a
// json path.
pub(crate) fn split_full_path<'a>(&self, full_path: &'a str) -> (&'a str, &'a str) {
if full_path.is_empty() {
return ("", "");
pub(crate) fn split_full_path<'a>(&self, full_path: &'a str) -> Option<(Field, &'a str)> {
if let Some(field) = self.field_names.get(full_path) {
return Some((*field, ""));
}
if self.field_names.contains(full_path) {
return (full_path, "");
}
let mut result = ("", full_path);
let mut cursor = 0;
while let Some(pos) = full_path[cursor..].find('.') {
cursor += pos;
let prefix = &full_path[..cursor];
let suffix = &full_path[cursor + 1..];
if self.field_names.contains(prefix) {
result = (prefix, suffix);
let mut splitting_period_pos: Vec<usize> = locate_splitting_dots(full_path);
while let Some(pos) = splitting_period_pos.pop() {
let (prefix, suffix) = full_path.split_at(pos);
if let Some(field) = self.field_names.get(prefix) {
return Some((*field, &suffix[1..]));
}
cursor += 1;
}
result
None
}
/// Creates a `QueryParser`, given
@@ -273,17 +293,11 @@ impl QueryParser {
/// Parse the user query into an AST.
fn parse_query_to_logical_ast(&self, query: &str) -> Result<LogicalAst, QueryParserError> {
let user_input_ast =
tantivy_query_grammar::parse_query(query).map_err(|_| QueryParserError::SyntaxError)?;
let user_input_ast = tantivy_query_grammar::parse_query(query)
.map_err(|_| QueryParserError::SyntaxError(query.to_string()))?;
self.compute_logical_ast(user_input_ast)
}
fn resolve_field_name(&self, field_name: &str) -> Result<Field, QueryParserError> {
self.schema
.get_field(field_name)
.ok_or_else(|| QueryParserError::FieldDoesNotExist(String::from(field_name)))
}
fn compute_logical_ast(
&self,
user_input_ast: UserInputAst,
@@ -334,7 +348,7 @@ impl QueryParser {
}
FieldType::Date(_) => {
let dt = OffsetDateTime::parse(phrase, &Rfc3339)?;
Ok(Term::from_field_date(field, DateTime::new_utc(dt)))
Ok(Term::from_field_date(field, DateTime::from_utc(dt)))
}
FieldType::Str(ref str_options) => {
let option = str_options.get_indexing_options().ok_or_else(|| {
@@ -390,6 +404,12 @@ impl QueryParser {
if !field_type.is_indexed() {
return Err(QueryParserError::FieldNotIndexed(field_name.to_string()));
}
if field_type.value_type() != Type::Json && !json_path.is_empty() {
let field_name = self.schema.get_field_name(field);
return Err(QueryParserError::FieldDoesNotExist(format!(
"{field_name}.{json_path}"
)));
}
match *field_type {
FieldType::U64(_) => {
let val: u64 = u64::from_str(phrase)?;
@@ -408,7 +428,7 @@ impl QueryParser {
}
FieldType::Date(_) => {
let dt = OffsetDateTime::parse(phrase, &Rfc3339)?;
let dt_term = Term::from_field_date(field, DateTime::new_utc(dt));
let dt_term = Term::from_field_date(field, DateTime::from_utc(dt));
Ok(vec![LogicalLiteral::Term(dt_term)])
}
FieldType::Str(ref str_options) => {
@@ -531,37 +551,56 @@ impl QueryParser {
})
}
fn compute_path_triplet_for_literal<'a>(
/// Given a literal, returns the list of terms that should be searched.
///
/// The terms are identified by a triplet:
/// - tantivy field
/// - field_path: tantivy has JSON fields. It is possible to target a member of a JSON
/// object by naturally extending the json field name with a "." separated field_path
/// - field_phrase: the phrase that is being searched.
///
/// The literal identifies the targetted field by a so-called *full field path*,
/// specified before the ":". (e.g. identity.username:fulmicoton).
///
/// The way we split the full field path into (field_name, field_path) can be ambiguous,
/// because field_names can contain "." themselves.
// For instance if a field is named `one.two` and another one is named `one`,
/// should `one.two:three` target `one.two` with field path `` or or `one` with
/// the field path `two`.
///
/// In this case tantivy, just picks the solution with the longest field name.
///
/// Quirk: As a hack for quickwit, we do not split over a dot that appear escaped '\.'.
fn compute_path_triplets_for_literal<'a>(
&self,
literal: &'a UserInputLiteral,
) -> Result<Vec<(Field, &'a str, &'a str)>, QueryParserError> {
match &literal.field_name {
Some(ref full_path) => {
// We need to add terms associated to json default fields.
let (field_name, path) = self.split_full_path(full_path);
if let Ok(field) = self.resolve_field_name(field_name) {
return Ok(vec![(field, path, literal.phrase.as_str())]);
}
let triplets: Vec<(Field, &str, &str)> = self
.default_indexed_json_fields()
.map(|json_field| (json_field, full_path.as_str(), literal.phrase.as_str()))
.collect();
if triplets.is_empty() {
return Err(QueryParserError::FieldDoesNotExist(field_name.to_string()));
}
Ok(triplets)
}
None => {
if self.default_fields.is_empty() {
return Err(QueryParserError::NoDefaultFieldDeclared);
}
Ok(self
.default_fields
.iter()
.map(|default_field| (*default_field, "", literal.phrase.as_str()))
.collect::<Vec<(Field, &str, &str)>>())
let full_path = if let Some(full_path) = &literal.field_name {
full_path
} else {
// The user did not specify any path...
// We simply target default fields.
if self.default_fields.is_empty() {
return Err(QueryParserError::NoDefaultFieldDeclared);
}
return Ok(self
.default_fields
.iter()
.map(|default_field| (*default_field, "", literal.phrase.as_str()))
.collect::<Vec<(Field, &str, &str)>>());
};
if let Some((field, path)) = self.split_full_path(full_path) {
return Ok(vec![(field, path, literal.phrase.as_str())]);
}
// We need to add terms associated to json default fields.
let triplets: Vec<(Field, &str, &str)> = self
.default_indexed_json_fields()
.map(|json_field| (json_field, full_path.as_str(), literal.phrase.as_str()))
.collect();
if triplets.is_empty() {
return Err(QueryParserError::FieldDoesNotExist(full_path.to_string()));
}
Ok(triplets)
}
fn compute_logical_ast_from_leaf(
@@ -571,7 +610,7 @@ impl QueryParser {
match leaf {
UserInputLeaf::Literal(literal) => {
let term_phrases: Vec<(Field, &str, &str)> =
self.compute_path_triplet_for_literal(&literal)?;
self.compute_path_triplets_for_literal(&literal)?;
let mut asts: Vec<LogicalAst> = Vec::new();
for (field, json_path, phrase) in term_phrases {
for ast in self.compute_logical_ast_for_leaf(field, json_path, phrase)? {
@@ -598,8 +637,9 @@ impl QueryParser {
"Range query need to target a specific field.".to_string(),
)
})?;
let (field_name, json_path) = self.split_full_path(&full_path);
let field = self.resolve_field_name(field_name)?;
let (field, json_path) = self
.split_full_path(&full_path)
.ok_or_else(|| QueryParserError::FieldDoesNotExist(full_path.clone()))?;
let field_entry = self.schema.get_field_entry(field);
let value_type = field_entry.field_type().value_type();
let logical_ast = LogicalAst::Leaf(Box::new(LogicalLiteral::Range {
@@ -660,30 +700,6 @@ fn generate_literals_for_str(
Ok(Some(LogicalLiteral::Phrase(terms)))
}
enum NumValue {
U64(u64),
I64(i64),
F64(f64),
DateTime(OffsetDateTime),
}
fn infer_type_num(phrase: &str) -> Option<NumValue> {
if let Ok(dt) = OffsetDateTime::parse(phrase, &Rfc3339) {
let dt_utc = dt.to_offset(UtcOffset::UTC);
return Some(NumValue::DateTime(dt_utc));
}
if let Ok(u64_val) = str::parse::<u64>(phrase) {
return Some(NumValue::U64(u64_val));
}
if let Ok(i64_val) = str::parse::<i64>(phrase) {
return Some(NumValue::I64(i64_val));
}
if let Ok(f64_val) = str::parse::<f64>(phrase) {
return Some(NumValue::F64(f64_val));
}
None
}
fn generate_literals_for_json_object(
field_name: &str,
field: Field,
@@ -694,38 +710,13 @@ fn generate_literals_for_json_object(
) -> Result<Vec<LogicalLiteral>, QueryParserError> {
let mut logical_literals = Vec::new();
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term);
for segment in json_path.split('.') {
json_term_writer.push_path_segment(segment);
let mut json_term_writer =
JsonTermWriter::from_field_and_json_path(field, json_path, &mut term);
if let Some(term) = convert_to_fast_value_and_get_term(&mut json_term_writer, phrase) {
logical_literals.push(LogicalLiteral::Term(term));
}
if let Some(num_value) = infer_type_num(phrase) {
match num_value {
NumValue::U64(u64_val) => {
json_term_writer.set_fast_value(u64_val);
}
NumValue::I64(i64_val) => {
json_term_writer.set_fast_value(i64_val);
}
NumValue::F64(f64_val) => {
json_term_writer.set_fast_value(f64_val);
}
NumValue::DateTime(dt_val) => {
json_term_writer.set_fast_value(DateTime::new_utc(dt_val));
}
}
logical_literals.push(LogicalLiteral::Term(json_term_writer.term().clone()));
}
json_term_writer.close_path_and_set_type(Type::Str);
let terms = set_string_and_get_terms(&mut json_term_writer, phrase, text_analyzer);
drop(json_term_writer);
let term_num_bytes = term.as_slice().len();
let mut token_stream = text_analyzer.token_stream(phrase);
let mut terms: Vec<(usize, Term)> = Vec::new();
token_stream.process(&mut |token| {
term.truncate(term_num_bytes);
term.append_bytes(token.text.as_bytes());
terms.push((token.position, term.clone()));
});
if terms.len() <= 1 {
for (_, term) in terms {
logical_literals.push(LogicalLiteral::Term(term));
@@ -1220,9 +1211,11 @@ mod test {
#[test]
pub fn test_query_parser_field_does_not_exist() {
let query_parser = make_query_parser();
assert_matches!(
query_parser.parse_query("boujou:\"18446744073709551615\""),
Err(QueryParserError::FieldDoesNotExist(_))
assert_eq!(
query_parser
.parse_query("boujou:\"18446744073709551615\"")
.unwrap_err(),
QueryParserError::FieldDoesNotExist("boujou".to_string())
);
}
@@ -1397,29 +1390,56 @@ mod test {
}
}
#[test]
fn test_escaped_field() {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field(r#"a\.b"#, STRING);
let schema = schema_builder.build();
let query_parser = QueryParser::new(schema, Vec::new(), TokenizerManager::default());
let query = query_parser.parse_query(r#"a\.b:hello"#).unwrap();
assert_eq!(
format!("{:?}", query),
"TermQuery(Term(type=Str, field=0, \"hello\"))"
);
}
#[test]
fn test_split_full_path() {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("second", STRING);
schema_builder.add_text_field("first", STRING);
schema_builder.add_text_field("first.toto", STRING);
schema_builder.add_text_field("first.toto.titi", STRING);
schema_builder.add_text_field("third.a.b.c", STRING);
let schema = schema_builder.build();
let query_parser = QueryParser::new(schema, Vec::new(), TokenizerManager::default());
let query_parser =
QueryParser::new(schema.clone(), Vec::new(), TokenizerManager::default());
assert_eq!(
query_parser.split_full_path("first.toto"),
("first.toto", "")
Some((schema.get_field("first.toto").unwrap(), ""))
);
assert_eq!(
query_parser.split_full_path("first.toto.bubu"),
Some((schema.get_field("first.toto").unwrap(), "bubu"))
);
assert_eq!(
query_parser.split_full_path("first.toto.titi"),
Some((schema.get_field("first.toto.titi").unwrap(), ""))
);
assert_eq!(
query_parser.split_full_path("first.titi"),
("first", "titi")
Some((schema.get_field("first").unwrap(), "titi"))
);
assert_eq!(query_parser.split_full_path("third"), ("", "third"));
assert_eq!(
query_parser.split_full_path("hello.toto"),
("", "hello.toto")
);
assert_eq!(query_parser.split_full_path(""), ("", ""));
assert_eq!(query_parser.split_full_path("firsty"), ("", "firsty"));
assert_eq!(query_parser.split_full_path("third"), None);
assert_eq!(query_parser.split_full_path("hello.toto"), None);
assert_eq!(query_parser.split_full_path(""), None);
assert_eq!(query_parser.split_full_path("firsty"), None);
}
#[test]
fn test_locate_splitting_dots() {
assert_eq!(&super::locate_splitting_dots("a.b.c"), &[1, 3]);
assert_eq!(&super::locate_splitting_dots(r#"a\.b.c"#), &[4]);
assert_eq!(&super::locate_splitting_dots(r#"a\..b.c"#), &[3, 5]);
}
}

View File

@@ -2,7 +2,7 @@ use std::ops::{Deref, DerefMut};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use crossbeam::channel::{unbounded, Receiver, RecvError, Sender};
use crossbeam_channel::{unbounded, Receiver, RecvError, Sender};
pub struct GenerationItem<T> {
generation: usize,
@@ -197,7 +197,7 @@ mod tests {
use std::{iter, mem};
use crossbeam::channel;
use crossbeam_channel as channel;
use super::{Pool, Queue};

View File

@@ -147,7 +147,7 @@ impl WarmingStateInner {
/// Every [GC_INTERVAL] attempt to GC, with panics caught and logged using
/// [std::panic::catch_unwind].
fn gc_loop(inner: Weak<Mutex<WarmingStateInner>>) {
for _ in crossbeam::channel::tick(GC_INTERVAL) {
for _ in crossbeam_channel::tick(GC_INTERVAL) {
if let Some(inner) = inner.upgrade() {
// rely on deterministic gc in tests
#[cfg(not(test))]

View File

@@ -213,6 +213,8 @@ impl BinarySerializable for Document {
#[cfg(test)]
mod tests {
use common::BinarySerializable;
use crate::schema::*;
#[test]
@@ -223,4 +225,22 @@ mod tests {
doc.add_text(text_field, "My title");
assert_eq!(doc.field_values().len(), 1);
}
#[test]
fn test_doc_serialization_issue() {
let mut doc = Document::default();
doc.add_json_object(
Field::from_field_id(0),
serde_json::json!({"key": 2u64})
.as_object()
.unwrap()
.clone(),
);
doc.add_text(Field::from_field_id(1), "hello");
assert_eq!(doc.field_values().len(), 2);
let mut payload: Vec<u8> = Vec::new();
doc.serialize(&mut payload).unwrap();
assert_eq!(payload.len(), 26);
Document::deserialize(&mut &payload[..]).unwrap();
}
}

View File

@@ -93,13 +93,7 @@ impl FieldEntry {
/// Returns true if the field is a int (signed or unsigned) fast field
pub fn is_fast(&self) -> bool {
match self.field_type {
FieldType::U64(ref options)
| FieldType::I64(ref options)
| FieldType::Date(ref options)
| FieldType::F64(ref options) => options.is_fast(),
_ => false,
}
self.field_type.is_fast()
}
/// Returns true if the field is stored
@@ -144,7 +138,8 @@ mod tests {
"fieldnorms": true,
"tokenizer": "default"
},
"stored": false
"stored": false,
"fast": false
}
}"#;
let field_value_json = serde_json::to_string_pretty(&field_value).unwrap();

View File

@@ -185,6 +185,20 @@ impl FieldType {
}
}
/// returns true if the field is fast.
pub fn is_fast(&self) -> bool {
match *self {
FieldType::Bytes(ref bytes_options) => bytes_options.is_fast(),
FieldType::Str(ref text_options) => text_options.is_fast(),
FieldType::U64(ref int_options)
| FieldType::I64(ref int_options)
| FieldType::F64(ref int_options)
| FieldType::Date(ref int_options) => int_options.get_fastfield_cardinality().is_some(),
FieldType::Facet(_) => true,
FieldType::JsonObject(_) => false,
}
}
/// returns true if the field is normed (see [fieldnorms](crate::fieldnorm)).
pub fn has_fieldnorms(&self) -> bool {
match *self {
@@ -254,7 +268,7 @@ impl FieldType {
expected: "rfc3339 format",
json: JsonValue::String(field_text),
})?;
Ok(DateTime::new_utc(dt_with_fixed_tz).into())
Ok(DateTime::from_utc(dt_with_fixed_tz).into())
}
FieldType::Str(_) => Ok(Value::Str(field_text)),
FieldType::U64(_) | FieldType::I64(_) | FieldType::F64(_) => {
@@ -374,7 +388,7 @@ mod tests {
let naive_date = Date::from_calendar_date(1982, Month::September, 17).unwrap();
let naive_time = Time::from_hms(13, 20, 0).unwrap();
let date_time = PrimitiveDateTime::new(naive_date, naive_time);
doc.add_date(date_field, DateTime::new_primitive(date_time));
doc.add_date(date_field, DateTime::from_primitive(date_time));
let doc_json = schema.to_json(&doc);
assert_eq!(doc_json, r#"{"date":["1982-09-17T13:20:00Z"]}"#);
}

View File

@@ -417,6 +417,7 @@ mod tests {
use std::collections::BTreeMap;
use matches::{assert_matches, matches};
use pretty_assertions::assert_eq;
use serde_json;
use crate::schema::field_type::ValueParsingError;
@@ -469,7 +470,8 @@ mod tests {
"fieldnorms": true,
"tokenizer": "default"
},
"stored": false
"stored": false,
"fast": false
}
},
{
@@ -481,7 +483,8 @@ mod tests {
"fieldnorms": false,
"tokenizer": "raw"
},
"stored": false
"stored": false,
"fast": false
}
},
{
@@ -784,7 +787,8 @@ mod tests {
"fieldnorms": true,
"tokenizer": "default"
},
"stored": false
"stored": false,
"fast": false
}
},
{
@@ -816,7 +820,8 @@ mod tests {
"fieldnorms": true,
"tokenizer": "raw"
},
"stored": true
"stored": true,
"fast": false
}
},
{
@@ -838,7 +843,8 @@ mod tests {
"fieldnorms": true,
"tokenizer": "default"
},
"stored": false
"stored": false,
"fast": false
}
},
{

View File

@@ -3,6 +3,7 @@ use std::ops::BitOr;
use serde::{Deserialize, Serialize};
use super::flags::FastFlag;
use crate::schema::flags::{SchemaFlagList, StoredFlag};
use crate::schema::IndexRecordOption;
@@ -14,6 +15,8 @@ pub struct TextOptions {
indexing: Option<TextFieldIndexing>,
#[serde(default)]
stored: bool,
#[serde(default)]
fast: bool,
}
impl TextOptions {
@@ -27,6 +30,30 @@ impl TextOptions {
self.stored
}
/// Returns true iff the value is a fast field.
pub fn is_fast(&self) -> bool {
self.fast
}
/// Set the field as a fast field.
///
/// Fast fields are designed for random access.
/// Access time are similar to a random lookup in an array.
/// Text fast fields will have the term ids stored in the fast field.
/// The fast field will be a multivalued fast field.
///
/// The effective cardinality depends on the tokenizer. When creating fast fields on text
/// fields it is recommended to use the "raw" tokenizer, since it will store the original text
/// unchanged. The "default" tokenizer will store the terms as lower case and this will be
/// reflected in the dictionary.
///
/// The original text can be retrieved via `ord_to_term` from the dictionary.
#[must_use]
pub fn set_fast(mut self) -> TextOptions {
self.fast = true;
self
}
/// Sets the field as stored
#[must_use]
pub fn set_stored(mut self) -> TextOptions {
@@ -45,9 +72,13 @@ impl TextOptions {
#[derive(Clone, PartialEq, Debug, Eq, Serialize, Deserialize)]
struct TokenizerName(Cow<'static, str>);
const DEFAULT_TOKENIZER_NAME: &str = "default";
const NO_TOKENIZER_NAME: &str = "raw";
impl Default for TokenizerName {
fn default() -> Self {
TokenizerName::from_static("default")
TokenizerName::from_static(DEFAULT_TOKENIZER_NAME)
}
}
@@ -141,21 +172,23 @@ impl TextFieldIndexing {
/// The field will be untokenized and indexed.
pub const STRING: TextOptions = TextOptions {
indexing: Some(TextFieldIndexing {
tokenizer: TokenizerName::from_static("raw"),
tokenizer: TokenizerName::from_static(NO_TOKENIZER_NAME),
fieldnorms: true,
record: IndexRecordOption::Basic,
}),
stored: false,
fast: false,
};
/// The field will be tokenized and indexed.
pub const TEXT: TextOptions = TextOptions {
indexing: Some(TextFieldIndexing {
tokenizer: TokenizerName::from_static("default"),
tokenizer: TokenizerName::from_static(DEFAULT_TOKENIZER_NAME),
fieldnorms: true,
record: IndexRecordOption::WithFreqsAndPositions,
}),
stored: false,
fast: false,
};
impl<T: Into<TextOptions>> BitOr<T> for TextOptions {
@@ -166,6 +199,7 @@ impl<T: Into<TextOptions>> BitOr<T> for TextOptions {
TextOptions {
indexing: self.indexing.or(other.indexing),
stored: self.stored | other.stored,
fast: self.fast | other.fast,
}
}
}
@@ -181,6 +215,17 @@ impl From<StoredFlag> for TextOptions {
TextOptions {
indexing: None,
stored: true,
fast: false,
}
}
}
impl From<FastFlag> for TextOptions {
fn from(_: FastFlag) -> TextOptions {
TextOptions {
indexing: None,
stored: false,
fast: true,
}
}
}

View File

@@ -43,7 +43,7 @@ impl Serialize for Value {
Value::U64(u) => serializer.serialize_u64(u),
Value::I64(u) => serializer.serialize_i64(u),
Value::F64(u) => serializer.serialize_f64(u),
Value::Date(ref date) => time::serde::rfc3339::serialize(&date.to_utc(), serializer),
Value::Date(ref date) => time::serde::rfc3339::serialize(&date.into_utc(), serializer),
Value::Facet(ref facet) => facet.serialize(serializer),
Value::Bytes(ref bytes) => serializer.serialize_bytes(bytes),
Value::JsonObject(ref obj) => obj.serialize(serializer),
@@ -388,8 +388,16 @@ mod binary_serialize {
}
}
JSON_OBJ_CODE => {
let map = serde_json::from_reader(reader)?;
Ok(Value::JsonObject(map))
// As explained in
// https://docs.serde.rs/serde_json/fn.from_reader.html
//
// `T::from_reader(..)` expects EOF after reading the object,
// which is not what we want here.
//
// For this reason we need to create our own `Deserializer`.
let mut de = serde_json::Deserializer::from_reader(reader);
let json_map = <serde_json::Map::<String, serde_json::Value> as serde::Deserialize>::deserialize(&mut de)?;
Ok(Value::JsonObject(json_map))
}
_ => Err(io::Error::new(
io::ErrorKind::InvalidData,
@@ -409,12 +417,12 @@ mod tests {
#[test]
fn test_serialize_date() {
let value = Value::from(DateTime::new_utc(
let value = Value::from(DateTime::from_utc(
OffsetDateTime::parse("1996-12-20T00:39:57+00:00", &Rfc3339).unwrap(),
));
let serialized_value_json = serde_json::to_string_pretty(&value).unwrap();
assert_eq!(serialized_value_json, r#""1996-12-20T00:39:57Z""#);
let value = Value::from(DateTime::new_utc(
let value = Value::from(DateTime::from_utc(
OffsetDateTime::parse("1996-12-20T00:39:57-01:00", &Rfc3339).unwrap(),
));
let serialized_value_json = serde_json::to_string_pretty(&value).unwrap();

View File

@@ -0,0 +1,50 @@
use std::io;
use zstd::bulk::{compress_to_buffer, decompress_to_buffer};
use zstd::DEFAULT_COMPRESSION_LEVEL;
#[inline]
pub fn compress(uncompressed: &[u8], compressed: &mut Vec<u8>) -> io::Result<()> {
let count_size = std::mem::size_of::<u32>();
let max_size = zstd::zstd_safe::compress_bound(uncompressed.len()) + count_size;
compressed.clear();
compressed.resize(max_size, 0);
let compressed_size = compress_to_buffer(
uncompressed,
&mut compressed[count_size..],
DEFAULT_COMPRESSION_LEVEL,
)?;
compressed[0..count_size].copy_from_slice(&(uncompressed.len() as u32).to_le_bytes());
compressed.resize(compressed_size + count_size, 0);
Ok(())
}
#[inline]
pub fn decompress(compressed: &[u8], decompressed: &mut Vec<u8>) -> io::Result<()> {
let count_size = std::mem::size_of::<u32>();
let uncompressed_size = u32::from_le_bytes(
compressed
.get(..count_size)
.ok_or(io::ErrorKind::InvalidData)?
.try_into()
.unwrap(),
) as usize;
decompressed.clear();
decompressed.resize(uncompressed_size, 0);
let decompressed_size = decompress_to_buffer(&compressed[count_size..], decompressed)?;
if decompressed_size != uncompressed_size {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"doc store block not completely decompressed, data corruption".to_string(),
));
}
Ok(())
}

View File

@@ -26,6 +26,9 @@ pub enum Compressor {
#[serde(rename = "snappy")]
/// Use the snap compressor
Snappy,
#[serde(rename = "zstd")]
/// Use the zstd compressor
Zstd,
}
impl Default for Compressor {
@@ -36,6 +39,8 @@ impl Default for Compressor {
Compressor::Brotli
} else if cfg!(feature = "snappy-compression") {
Compressor::Snappy
} else if cfg!(feature = "zstd-compression") {
Compressor::Zstd
} else {
Compressor::None
}
@@ -49,6 +54,7 @@ impl Compressor {
1 => Compressor::Lz4,
2 => Compressor::Brotli,
3 => Compressor::Snappy,
4 => Compressor::Zstd,
_ => panic!("unknown compressor id {:?}", id),
}
}
@@ -58,6 +64,7 @@ impl Compressor {
Self::Lz4 => 1,
Self::Brotli => 2,
Self::Snappy => 3,
Self::Zstd => 4,
}
}
#[inline]
@@ -98,6 +105,16 @@ impl Compressor {
panic!("snappy-compression feature flag not activated");
}
}
Self::Zstd => {
#[cfg(feature = "zstd-compression")]
{
super::compression_zstd_block::compress(uncompressed, compressed)
}
#[cfg(not(feature = "zstd-compression"))]
{
panic!("zstd-compression feature flag not activated");
}
}
}
}
@@ -143,6 +160,16 @@ impl Compressor {
panic!("snappy-compression feature flag not activated");
}
}
Self::Zstd => {
#[cfg(feature = "zstd-compression")]
{
super::compression_zstd_block::decompress(compressed, decompressed)
}
#[cfg(not(feature = "zstd-compression"))]
{
panic!("zstd-compression feature flag not activated");
}
}
}
}
}

View File

@@ -50,6 +50,9 @@ mod compression_brotli;
#[cfg(feature = "snappy-compression")]
mod compression_snap;
#[cfg(feature = "zstd-compression")]
mod compression_zstd_block;
#[cfg(test)]
pub mod tests {
@@ -69,10 +72,13 @@ pub mod tests {
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt \
mollit anim id est laborum.";
const BLOCK_SIZE: usize = 16_384;
pub fn write_lorem_ipsum_store(
writer: WritePtr,
num_docs: usize,
compressor: Compressor,
blocksize: usize,
) -> Schema {
let mut schema_builder = Schema::builder();
let field_body = schema_builder.add_text_field("body", TextOptions::default().set_stored());
@@ -80,7 +86,7 @@ pub mod tests {
schema_builder.add_text_field("title", TextOptions::default().set_stored());
let schema = schema_builder.build();
{
let mut store_writer = StoreWriter::new(writer, compressor);
let mut store_writer = StoreWriter::new(writer, compressor, blocksize);
for i in 0..num_docs {
let mut doc = Document::default();
doc.add_field_value(field_body, LOREM.to_string());
@@ -103,7 +109,7 @@ pub mod tests {
let path = Path::new("store");
let directory = RamDirectory::create();
let store_wrt = directory.open_write(path)?;
let schema = write_lorem_ipsum_store(store_wrt, NUM_DOCS, Compressor::Lz4);
let schema = write_lorem_ipsum_store(store_wrt, NUM_DOCS, Compressor::Lz4, BLOCK_SIZE);
let field_title = schema.get_field("title").unwrap();
let store_file = directory.open_read(path)?;
let store = StoreReader::open(store_file)?;
@@ -139,11 +145,11 @@ pub mod tests {
Ok(())
}
fn test_store(compressor: Compressor) -> crate::Result<()> {
fn test_store(compressor: Compressor, blocksize: usize) -> crate::Result<()> {
let path = Path::new("store");
let directory = RamDirectory::create();
let store_wrt = directory.open_write(path)?;
let schema = write_lorem_ipsum_store(store_wrt, NUM_DOCS, compressor);
let schema = write_lorem_ipsum_store(store_wrt, NUM_DOCS, compressor, blocksize);
let field_title = schema.get_field("title").unwrap();
let store_file = directory.open_read(path)?;
let store = StoreReader::open(store_file)?;
@@ -169,22 +175,28 @@ pub mod tests {
#[test]
fn test_store_noop() -> crate::Result<()> {
test_store(Compressor::None)
test_store(Compressor::None, BLOCK_SIZE)
}
#[cfg(feature = "lz4-compression")]
#[test]
fn test_store_lz4_block() -> crate::Result<()> {
test_store(Compressor::Lz4)
test_store(Compressor::Lz4, BLOCK_SIZE)
}
#[cfg(feature = "snappy-compression")]
#[test]
fn test_store_snap() -> crate::Result<()> {
test_store(Compressor::Snappy)
test_store(Compressor::Snappy, BLOCK_SIZE)
}
#[cfg(feature = "brotli-compression")]
#[test]
fn test_store_brotli() -> crate::Result<()> {
test_store(Compressor::Brotli)
test_store(Compressor::Brotli, BLOCK_SIZE)
}
#[cfg(feature = "zstd-compression")]
#[test]
fn test_store_zstd() -> crate::Result<()> {
test_store(Compressor::Zstd, BLOCK_SIZE)
}
#[test]
@@ -348,6 +360,7 @@ mod bench {
directory.open_write(path).unwrap(),
1_000,
Compressor::default(),
16_384,
);
directory.delete(path).unwrap();
});
@@ -361,6 +374,7 @@ mod bench {
directory.open_write(path).unwrap(),
1_000,
Compressor::default(),
16_384,
);
let store_file = directory.open_read(path).unwrap();
let store = StoreReader::open(store_file).unwrap();

View File

@@ -304,6 +304,8 @@ mod tests {
use crate::store::tests::write_lorem_ipsum_store;
use crate::Directory;
const BLOCK_SIZE: usize = 16_384;
fn get_text_field<'a>(doc: &'a Document, field: &'a Field) -> Option<&'a str> {
doc.get_first(*field).and_then(|f| f.as_text())
}
@@ -313,7 +315,7 @@ mod tests {
let directory = RamDirectory::create();
let path = Path::new("store");
let writer = directory.open_write(path)?;
let schema = write_lorem_ipsum_store(writer, 500, Compressor::default());
let schema = write_lorem_ipsum_store(writer, 500, Compressor::default(), BLOCK_SIZE);
let title = schema.get_field("title").unwrap();
let store_file = directory.open_read(path)?;
let store = StoreReader::open(store_file)?;

View File

@@ -11,8 +11,6 @@ use crate::schema::Document;
use crate::store::index::Checkpoint;
use crate::DocId;
const BLOCK_SIZE: usize = 16_384;
/// Write tantivy's [`Store`](./index.html)
///
/// Contrary to the other components of `tantivy`,
@@ -22,6 +20,7 @@ const BLOCK_SIZE: usize = 16_384;
/// The skip list index on the other hand, is built in memory.
pub struct StoreWriter {
compressor: Compressor,
block_size: usize,
doc: DocId,
first_doc_in_block: DocId,
offset_index_writer: SkipIndexBuilder,
@@ -35,9 +34,10 @@ impl StoreWriter {
///
/// The store writer will writes blocks on disc as
/// document are added.
pub fn new(writer: WritePtr, compressor: Compressor) -> StoreWriter {
pub fn new(writer: WritePtr, compressor: Compressor, block_size: usize) -> StoreWriter {
StoreWriter {
compressor,
block_size,
doc: 0,
first_doc_in_block: 0,
offset_index_writer: SkipIndexBuilder::new(),
@@ -65,7 +65,7 @@ impl StoreWriter {
VInt(doc_num_bytes as u64).serialize(&mut self.current_block)?;
self.current_block.write_all(serialized_document)?;
self.doc += 1;
if self.current_block.len() > BLOCK_SIZE {
if self.current_block.len() > self.block_size {
self.write_and_compress_block()?;
}
Ok(())
@@ -86,7 +86,7 @@ impl StoreWriter {
self.current_block
.write_all(&self.intermediary_buffer[..])?;
self.doc += 1;
if self.current_block.len() > BLOCK_SIZE {
if self.current_block.len() > self.block_size {
self.write_and_compress_block()?;
}
Ok(())

View File

@@ -28,7 +28,6 @@ use fst_termdict as termdict;
mod sstable_termdict;
#[cfg(feature = "quickwit")]
use sstable_termdict as termdict;
use tantivy_fst::automaton::AlwaysMatch;
#[cfg(test)]
mod tests;
@@ -36,24 +35,4 @@ mod tests;
/// Position of the term in the sorted list of terms.
pub type TermOrdinal = u64;
/// The term dictionary contains all of the terms in
/// `tantivy index` in a sorted manner.
pub type TermDictionary = self::termdict::TermDictionary;
/// Builder for the new term dictionary.
///
/// Inserting must be done in the order of the `keys`.
pub type TermDictionaryBuilder<W> = self::termdict::TermDictionaryBuilder<W>;
/// Given a list of sorted term streams,
/// returns an iterator over sorted unique terms.
///
/// The item yield is actually a pair with
/// - the term
/// - a slice with the ordinal of the segments containing
/// the terms.
pub type TermMerger<'a> = self::termdict::TermMerger<'a>;
/// `TermStreamer` acts as a cursor over a range of terms of a segment.
/// Terms are guaranteed to be sorted.
pub type TermStreamer<'a, A = AlwaysMatch> = self::termdict::TermStreamer<'a, A>;
pub use self::termdict::{TermDictionary, TermDictionaryBuilder, TermMerger, TermStreamer};

View File

@@ -145,6 +145,12 @@ where
}
pub fn write_key(&mut self, key: &[u8]) {
// If this is the first key in the block, we use it to
// shorten the last term in the last block.
if self.first_ordinal_of_the_block == self.num_terms {
self.index_builder
.shorten_last_block_key_given_next_key(key);
}
let keep_len = common_prefix_len(&self.previous_key, key);
let add_len = key.len() - keep_len;
let increasing_keys = add_len > 0 && (self.previous_key.len() == keep_len)
@@ -273,11 +279,12 @@ mod test {
33u8, 18u8, 19u8, // keep 1 push 1 | 20
17u8, 20u8, 0u8, 0u8, 0u8, 0u8, // no more blocks
// index
161, 102, 98, 108, 111, 99, 107, 115, 129, 162, 104, 108, 97, 115, 116, 95, 107,
101, 121, 130, 17, 20, 106, 98, 108, 111, 99, 107, 95, 97, 100, 100, 114, 162, 106,
98, 121, 116, 101, 95, 114, 97, 110, 103, 101, 162, 101, 115, 116, 97, 114, 116, 0,
99, 101, 110, 100, 11, 109, 102, 105, 114, 115, 116, 95, 111, 114, 100, 105, 110,
97, 108, 0, 15, 0, 0, 0, 0, 0, 0, 0, // offset for the index
161, 102, 98, 108, 111, 99, 107, 115, 129, 162, 115, 108, 97, 115, 116, 95, 107,
101, 121, 95, 111, 114, 95, 103, 114, 101, 97, 116, 101, 114, 130, 17, 20, 106, 98,
108, 111, 99, 107, 95, 97, 100, 100, 114, 162, 106, 98, 121, 116, 101, 95, 114, 97,
110, 103, 101, 162, 101, 115, 116, 97, 114, 116, 0, 99, 101, 110, 100, 11, 109,
102, 105, 114, 115, 116, 95, 111, 114, 100, 105, 110, 97, 108, 0, 15, 0, 0, 0, 0,
0, 0, 0, // offset for the index
3u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8 // num terms
]
);

View File

@@ -4,6 +4,7 @@ use std::ops::Range;
use serde::{Deserialize, Serialize};
use crate::error::DataCorruption;
use crate::termdict::sstable_termdict::sstable::common_prefix_len;
#[derive(Default, Debug, Serialize, Deserialize)]
pub struct SSTableIndex {
@@ -19,7 +20,7 @@ impl SSTableIndex {
pub fn search(&self, key: &[u8]) -> Option<BlockAddr> {
self.blocks
.iter()
.find(|block| &block.last_key[..] >= key)
.find(|block| &block.last_key_or_greater[..] >= key)
.map(|block| block.block_addr.clone())
}
}
@@ -32,7 +33,10 @@ pub struct BlockAddr {
#[derive(Debug, Serialize, Deserialize)]
struct BlockMeta {
pub last_key: Vec<u8>,
/// Any byte string that is lexicographically greater or equal to
/// the last key in the block,
/// and yet stricly smaller than the first key in the next block.
pub last_key_or_greater: Vec<u8>,
pub block_addr: BlockAddr,
}
@@ -41,10 +45,39 @@ pub struct SSTableIndexBuilder {
index: SSTableIndex,
}
/// Given that left < right,
/// mutates `left into a shorter byte string left'` that
/// matches `left <= left' < right`.
fn find_shorter_str_in_between(left: &mut Vec<u8>, right: &[u8]) {
assert!(&left[..] < right);
let common_len = common_prefix_len(&left, right);
if left.len() == common_len {
return;
}
// It is possible to do one character shorter in some case,
// but it is not worth the extra complexity
for pos in (common_len + 1)..left.len() {
if left[pos] != u8::MAX {
left[pos] += 1;
left.truncate(pos + 1);
return;
}
}
}
impl SSTableIndexBuilder {
/// In order to make the index as light as possible, we
/// try to find a shorter alternative to the last key of the last block
/// that is still smaller than the next key.
pub(crate) fn shorten_last_block_key_given_next_key(&mut self, next_key: &[u8]) {
if let Some(last_block) = self.index.blocks.last_mut() {
find_shorter_str_in_between(&mut last_block.last_key_or_greater, next_key);
}
}
pub fn add_block(&mut self, last_key: &[u8], byte_range: Range<usize>, first_ordinal: u64) {
self.index.blocks.push(BlockMeta {
last_key: last_key.to_vec(),
last_key_or_greater: last_key.to_vec(),
block_addr: BlockAddr {
byte_range,
first_ordinal,
@@ -97,4 +130,35 @@ mod tests {
"Data corruption: SSTable index is corrupted."
);
}
#[track_caller]
fn test_find_shorter_str_in_between_aux(left: &[u8], right: &[u8]) {
let mut left_buf = left.to_vec();
super::find_shorter_str_in_between(&mut left_buf, right);
assert!(left_buf.len() <= left.len());
assert!(left <= &left_buf);
assert!(&left_buf[..] < &right);
}
#[test]
fn test_find_shorter_str_in_between() {
test_find_shorter_str_in_between_aux(b"", b"hello");
test_find_shorter_str_in_between_aux(b"abc", b"abcd");
test_find_shorter_str_in_between_aux(b"abcd", b"abd");
test_find_shorter_str_in_between_aux(&[0, 0, 0], &[1]);
test_find_shorter_str_in_between_aux(&[0, 0, 0], &[0, 0, 1]);
test_find_shorter_str_in_between_aux(&[0, 0, 255, 255, 255, 0u8], &[0, 1]);
}
use proptest::prelude::*;
proptest! {
#![proptest_config(ProptestConfig::with_cases(100))]
#[test]
fn test_proptest_find_shorter_str(left in any::<Vec<u8>>(), right in any::<Vec<u8>>()) {
if left < right {
test_find_shorter_str_in_between_aux(&left, &right);
}
}
}
}

View File

@@ -25,6 +25,13 @@ pub struct TokenizerManager {
}
impl TokenizerManager {
/// Creates an empty tokenizer manager.
pub fn new() -> Self {
Self {
tokenizers: Arc::new(RwLock::new(HashMap::new())),
}
}
/// Registers a new tokenizer associated with a given name.
pub fn register<T>(&self, tokenizer_name: &str, tokenizer: T)
where TextAnalyzer: From<T> {
@@ -52,9 +59,7 @@ impl Default for TokenizerManager {
/// - en_stem
/// - ja
fn default() -> TokenizerManager {
let manager = TokenizerManager {
tokenizers: Arc::new(RwLock::new(HashMap::new())),
};
let manager = TokenizerManager::new();
manager.register("raw", RawTokenizer);
manager.register(
"default",