mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2025-12-28 04:52:55 +00:00
Compare commits
1 Commits
0.24
...
test_order
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
e2dae2f433 |
4
.github/workflows/coverage.yml
vendored
4
.github/workflows/coverage.yml
vendored
@@ -15,11 +15,11 @@ jobs:
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Install Rust
|
||||
run: rustup toolchain install nightly-2024-07-01 --profile minimal --component llvm-tools-preview
|
||||
run: rustup toolchain install nightly-2024-04-10 --profile minimal --component llvm-tools-preview
|
||||
- uses: Swatinem/rust-cache@v2
|
||||
- uses: taiki-e/install-action@cargo-llvm-cov
|
||||
- name: Generate code coverage
|
||||
run: cargo +nightly-2024-07-01 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
|
||||
run: cargo +nightly-2024-04-10 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
|
||||
- name: Upload coverage to Codecov
|
||||
uses: codecov/codecov-action@v3
|
||||
continue-on-error: true
|
||||
|
||||
@@ -46,7 +46,7 @@ The file of a segment has the format
|
||||
|
||||
```segment-id . ext```
|
||||
|
||||
The extension signals which data structure (or [`SegmentComponent`](src/index/segment_component.rs)) is stored in the file.
|
||||
The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
|
||||
|
||||
A small `meta.json` file is in charge of keeping track of the list of segments, as well as the schema.
|
||||
|
||||
@@ -102,7 +102,7 @@ but users can extend tantivy with their own implementation.
|
||||
|
||||
Tantivy's document follows a very strict schema, decided before building any index.
|
||||
|
||||
The schema defines all of the fields that the indexes [`Document`](src/schema/document/mod.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
|
||||
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
|
||||
|
||||
Depending on the type of the field, you can decide to
|
||||
|
||||
|
||||
102
CHANGELOG.md
102
CHANGELOG.md
@@ -1,85 +1,3 @@
|
||||
Tantivy 0.23 - Unreleased
|
||||
================================
|
||||
Tantivy 0.23 will be backwards compatible with indices created with v0.22 and v0.21. The new minimum rust version will be 1.75.
|
||||
|
||||
#### Bugfixes
|
||||
- fix potential endless loop in merge [#2457](https://github.com/quickwit-oss/tantivy/pull/2457)(@PSeitz)
|
||||
- fix bug that causes out-of-order sstable key. [#2445](https://github.com/quickwit-oss/tantivy/pull/2445)(@fulmicoton)
|
||||
- fix ReferenceValue API flaw [#2372](https://github.com/quickwit-oss/tantivy/pull/2372)(@PSeitz)
|
||||
- fix `OwnedBytes` debug panic [#2512](https://github.com/quickwit-oss/tantivy/pull/2512)(@b41sh)
|
||||
- catch panics during merges [#2582](https://github.com/quickwit-oss/tantivy/pull/2582)(@rdettai)
|
||||
- switch from u32 to usize in bitpacker. This enables multivalued columns larger than 4GB, which crashed during merge before. [#2581](https://github.com/quickwit-oss/tantivy/pull/2581) [#2586](https://github.com/quickwit-oss/tantivy/pull/2586)(@fulmicoton-dd @PSeitz)
|
||||
|
||||
#### Breaking API Changes
|
||||
- remove index sorting [#2434](https://github.com/quickwit-oss/tantivy/pull/2434)(@PSeitz)
|
||||
|
||||
#### Features/Improvements
|
||||
- **Aggregation**
|
||||
- Support for cardinality aggregation [#2337](https://github.com/quickwit-oss/tantivy/pull/2337) [#2446](https://github.com/quickwit-oss/tantivy/pull/2446) (@raphaelcoeffic @PSeitz)
|
||||
- Support for extended stats aggregation [#2247](https://github.com/quickwit-oss/tantivy/pull/2247)(@giovannicuccu)
|
||||
- Add Key::I64 and Key::U64 variants in aggregation to avoid f64 precision issues [#2468](https://github.com/quickwit-oss/tantivy/pull/2468)(@PSeitz)
|
||||
- Faster term aggregation fetch terms [#2447](https://github.com/quickwit-oss/tantivy/pull/2447)(@PSeitz)
|
||||
- Improve custom order deserialization [#2451](https://github.com/quickwit-oss/tantivy/pull/2451)(@PSeitz)
|
||||
- Change AggregationLimits behavior [#2495](https://github.com/quickwit-oss/tantivy/pull/2495)(@PSeitz)
|
||||
- lower contention on AggregationLimits [#2394](https://github.com/quickwit-oss/tantivy/pull/2394)(@PSeitz)
|
||||
- fix postcard compatibility for top_hits, add postcard test [#2346](https://github.com/quickwit-oss/tantivy/pull/2346)(@PSeitz)
|
||||
- reduce top hits memory consumption [#2426](https://github.com/quickwit-oss/tantivy/pull/2426)(@PSeitz)
|
||||
- check unsupported parameters top_hits [#2351](https://github.com/quickwit-oss/tantivy/pull/2351)(@PSeitz)
|
||||
- Change AggregationLimits to AggregationLimitsGuard [#2495](https://github.com/quickwit-oss/tantivy/pull/2495)(@PSeitz)
|
||||
- add support for counting non integer in aggregation [#2547](https://github.com/quickwit-oss/tantivy/pull/2547)(@trinity-1686a)
|
||||
- **Range Queries**
|
||||
- Support fast field range queries on json fields [#2456](https://github.com/quickwit-oss/tantivy/pull/2456)(@PSeitz)
|
||||
- Add support for str fast field range query [#2460](https://github.com/quickwit-oss/tantivy/pull/2460) [#2452](https://github.com/quickwit-oss/tantivy/pull/2452) [#2453](https://github.com/quickwit-oss/tantivy/pull/2453)(@PSeitz)
|
||||
- modify fastfield range query heuristic [#2375](https://github.com/quickwit-oss/tantivy/pull/2375)(@trinity-1686a)
|
||||
- add FastFieldRangeQuery for explicit range queries on fast field (for `RangeQuery` it is autodetected) [#2477](https://github.com/quickwit-oss/tantivy/pull/2477)(@PSeitz)
|
||||
|
||||
- add format backwards-compatibility tests [#2485](https://github.com/quickwit-oss/tantivy/pull/2485)(@PSeitz)
|
||||
- add columnar format compatibility tests [#2433](https://github.com/quickwit-oss/tantivy/pull/2433)(@PSeitz)
|
||||
- Improved snippet ranges algorithm [#2474](https://github.com/quickwit-oss/tantivy/pull/2474)(@gezihuzi)
|
||||
- make find_field_with_default return json fields without path [#2476](https://github.com/quickwit-oss/tantivy/pull/2476)(@trinity-1686a)
|
||||
- Make `BooleanQuery` support `minimum_number_should_match` [#2405](https://github.com/quickwit-oss/tantivy/pull/2405)(@LebranceBW)
|
||||
- Make `NUM_MERGE_THREADS` configurable [#2535](https://github.com/quickwit-oss/tantivy/pull/2535)(@Barre)
|
||||
|
||||
- **RegexPhraseQuery**
|
||||
`RegexPhraseQuery` supports phrase queries with regex. E.g. query "b.* b.* wolf" matches "big bad wolf". Slop is supported as well: "b.* wolf"~2 matches "big bad wolf" [#2516](https://github.com/quickwit-oss/tantivy/pull/2516)(@PSeitz)
|
||||
|
||||
- **Optional Index in Multivalue Columnar Index**
|
||||
For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case).
|
||||
This is alleviated by placing an optional index in the multivalued index to mark documents that have values.
|
||||
This will slightly increase space and access time. [#2439](https://github.com/quickwit-oss/tantivy/pull/2439)(@PSeitz)
|
||||
|
||||
- **Store DateTime as nanoseconds in doc store** DateTime in the doc store was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. [#2486](https://github.com/quickwit-oss/tantivy/pull/2486)(@PSeitz)
|
||||
|
||||
- **Performace/Memory**
|
||||
- lift clauses in LogicalAst for optimized ast during execution [#2449](https://github.com/quickwit-oss/tantivy/pull/2449)(@PSeitz)
|
||||
- Use Vec instead of BTreeMap to back OwnedValue object [#2364](https://github.com/quickwit-oss/tantivy/pull/2364)(@fulmicoton)
|
||||
- Replace TantivyDocument with CompactDoc. CompactDoc is much smaller and provides similar performance. [#2402](https://github.com/quickwit-oss/tantivy/pull/2402)(@PSeitz)
|
||||
- Recycling buffer in PrefixPhraseScorer [#2443](https://github.com/quickwit-oss/tantivy/pull/2443)(@fulmicoton)
|
||||
|
||||
- **Json Type**
|
||||
- JSON supports now all values on the root level. Previously an object was required. This enables support for flat mixed types. allow more JSON values, fix i64 special case [#2383](https://github.com/quickwit-oss/tantivy/pull/2383)(@PSeitz)
|
||||
- add json path constructor to term [#2367](https://github.com/quickwit-oss/tantivy/pull/2367)(@PSeitz)
|
||||
|
||||
- **QueryParser**
|
||||
- fix de-escaping too much in query parser [#2427](https://github.com/quickwit-oss/tantivy/pull/2427)(@trinity-1686a)
|
||||
- improve query parser [#2416](https://github.com/quickwit-oss/tantivy/pull/2416)(@trinity-1686a)
|
||||
- Support field grouping `title:(return AND "pink panther")` [#2333](https://github.com/quickwit-oss/tantivy/pull/2333)(@trinity-1686a)
|
||||
- allow term starting with wildcard [#2568](https://github.com/quickwit-oss/tantivy/pull/2568)(@trinity-1686a)
|
||||
|
||||
- Exist queries match subpath fields [#2558](https://github.com/quickwit-oss/tantivy/pull/2558)(@rdettai)
|
||||
- add access benchmark for columnar [#2432](https://github.com/quickwit-oss/tantivy/pull/2432)(@PSeitz)
|
||||
- extend indexwriter proptests [#2342](https://github.com/quickwit-oss/tantivy/pull/2342)(@PSeitz)
|
||||
- add bench & test for columnar merging [#2428](https://github.com/quickwit-oss/tantivy/pull/2428)(@PSeitz)
|
||||
- Change in Executor API [#2391](https://github.com/quickwit-oss/tantivy/pull/2391)(@fulmicoton)
|
||||
- Removed usage of num_cpus [#2387](https://github.com/quickwit-oss/tantivy/pull/2387)(@fulmicoton)
|
||||
- use bingang for agg and stacker benchmark [#2378](https://github.com/quickwit-oss/tantivy/pull/2378)[#2492](https://github.com/quickwit-oss/tantivy/pull/2492)(@PSeitz)
|
||||
- cleanup top level exports [#2382](https://github.com/quickwit-oss/tantivy/pull/2382)(@PSeitz)
|
||||
- make convert_to_fast_value_and_append_to_json_term pub [#2370](https://github.com/quickwit-oss/tantivy/pull/2370)(@PSeitz)
|
||||
- remove JsonTermWriter [#2238](https://github.com/quickwit-oss/tantivy/pull/2238)(@PSeitz)
|
||||
- validate sort by field type [#2336](https://github.com/quickwit-oss/tantivy/pull/2336)(@PSeitz)
|
||||
- Fix trait bound of StoreReader::iter [#2360](https://github.com/quickwit-oss/tantivy/pull/2360)(@adamreichold)
|
||||
- remove read_postings_no_deletes [#2526](https://github.com/quickwit-oss/tantivy/pull/2526)(@PSeitz)
|
||||
|
||||
Tantivy 0.22
|
||||
================================
|
||||
|
||||
@@ -90,7 +8,7 @@ Tantivy 0.22 will be able to read indices created with Tantivy 0.21.
|
||||
- Fix bug that can cause `get_docids_for_value_range` to panic. [#2295](https://github.com/quickwit-oss/tantivy/pull/2295)(@fulmicoton)
|
||||
- Avoid 1 document indices by increase min memory to 15MB for indexing [#2176](https://github.com/quickwit-oss/tantivy/pull/2176)(@PSeitz)
|
||||
- Fix merge panic for JSON fields [#2284](https://github.com/quickwit-oss/tantivy/pull/2284)(@PSeitz)
|
||||
- Fix bug occurring when merging JSON object indexed with positions. [#2253](https://github.com/quickwit-oss/tantivy/pull/2253)(@fulmicoton)
|
||||
- Fix bug occuring when merging JSON object indexed with positions. [#2253](https://github.com/quickwit-oss/tantivy/pull/2253)(@fulmicoton)
|
||||
- Fix empty DateHistogram gap bug [#2183](https://github.com/quickwit-oss/tantivy/pull/2183)(@PSeitz)
|
||||
- Fix range query end check (fields with less than 1 value per doc are affected) [#2226](https://github.com/quickwit-oss/tantivy/pull/2226)(@PSeitz)
|
||||
- Handle exclusive out of bounds ranges on fastfield range queries [#2174](https://github.com/quickwit-oss/tantivy/pull/2174)(@PSeitz)
|
||||
@@ -108,7 +26,7 @@ Tantivy 0.22 will be able to read indices created with Tantivy 0.21.
|
||||
- Support to deserialize f64 from string [#2311](https://github.com/quickwit-oss/tantivy/pull/2311)(@PSeitz)
|
||||
- Add a top_hits aggregator [#2198](https://github.com/quickwit-oss/tantivy/pull/2198)(@ditsuke)
|
||||
- Support bool type in term aggregation [#2318](https://github.com/quickwit-oss/tantivy/pull/2318)(@PSeitz)
|
||||
- Support ip addresses in term aggregation [#2319](https://github.com/quickwit-oss/tantivy/pull/2319)(@PSeitz)
|
||||
- Support ip adresses in term aggregation [#2319](https://github.com/quickwit-oss/tantivy/pull/2319)(@PSeitz)
|
||||
- Support date type in term aggregation [#2172](https://github.com/quickwit-oss/tantivy/pull/2172)(@PSeitz)
|
||||
- Support escaped dot when addressing field [#2250](https://github.com/quickwit-oss/tantivy/pull/2250)(@PSeitz)
|
||||
|
||||
@@ -198,7 +116,7 @@ Tantivy 0.20
|
||||
- Add PhrasePrefixQuery [#1842](https://github.com/quickwit-oss/tantivy/issues/1842) (@trinity-1686a)
|
||||
- Add `coerce` option for text and numbers types (convert the value instead of returning an error during indexing) [#1904](https://github.com/quickwit-oss/tantivy/issues/1904) (@PSeitz)
|
||||
- Add regex tokenizer [#1759](https://github.com/quickwit-oss/tantivy/issues/1759)(@mkleen)
|
||||
- Move tokenizer API to separate crate. Having a separate crate with a stable API will allow us to use tokenizers with different tantivy versions. [#1767](https://github.com/quickwit-oss/tantivy/issues/1767) (@PSeitz)
|
||||
- Move tokenizer API to seperate crate. Having a seperate crate with a stable API will allow us to use tokenizers with different tantivy versions. [#1767](https://github.com/quickwit-oss/tantivy/issues/1767) (@PSeitz)
|
||||
- **Columnar crate**: New fast field handling (@fulmicoton @PSeitz) [#1806](https://github.com/quickwit-oss/tantivy/issues/1806)[#1809](https://github.com/quickwit-oss/tantivy/issues/1809)
|
||||
- Support for fast fields with optional values. Previously tantivy supported only single-valued and multi-value fast fields. The encoding of optional fast fields is now very compact.
|
||||
- Fast field Support for JSON (schemaless fast fields). Support multiple types on the same column. [#1876](https://github.com/quickwit-oss/tantivy/issues/1876) (@fulmicoton)
|
||||
@@ -245,13 +163,13 @@ Tantivy 0.20
|
||||
- Auto downgrade index record option, instead of vint error [#1857](https://github.com/quickwit-oss/tantivy/issues/1857) (@PSeitz)
|
||||
- Enable range query on fast field for u64 compatible types [#1762](https://github.com/quickwit-oss/tantivy/issues/1762) (@PSeitz) [#1876]
|
||||
- sstable
|
||||
- Isolating sstable and stacker in independent crates. [#1718](https://github.com/quickwit-oss/tantivy/issues/1718) (@fulmicoton)
|
||||
- Isolating sstable and stacker in independant crates. [#1718](https://github.com/quickwit-oss/tantivy/issues/1718) (@fulmicoton)
|
||||
- New sstable format [#1943](https://github.com/quickwit-oss/tantivy/issues/1943)[#1953](https://github.com/quickwit-oss/tantivy/issues/1953) (@trinity-1686a)
|
||||
- Use DeltaReader directly to implement Dictionary::ord_to_term [#1928](https://github.com/quickwit-oss/tantivy/issues/1928) (@trinity-1686a)
|
||||
- Use DeltaReader directly to implement Dictionary::term_ord [#1925](https://github.com/quickwit-oss/tantivy/issues/1925) (@trinity-1686a)
|
||||
- Add separate tokenizer manager for fast fields [#2019](https://github.com/quickwit-oss/tantivy/issues/2019) (@PSeitz)
|
||||
- Use DeltaReader directly to implement Dictionnary::ord_to_term [#1928](https://github.com/quickwit-oss/tantivy/issues/1928) (@trinity-1686a)
|
||||
- Use DeltaReader directly to implement Dictionnary::term_ord [#1925](https://github.com/quickwit-oss/tantivy/issues/1925) (@trinity-1686a)
|
||||
- Add seperate tokenizer manager for fast fields [#2019](https://github.com/quickwit-oss/tantivy/issues/2019) (@PSeitz)
|
||||
- Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances lazy. [#1756](https://github.com/quickwit-oss/tantivy/issues/1756) (@adamreichold)
|
||||
- Added support for madvise when opening an mmapped Index [#2036](https://github.com/quickwit-oss/tantivy/issues/2036) (@fulmicoton)
|
||||
- Added support for madvise when opening an mmaped Index [#2036](https://github.com/quickwit-oss/tantivy/issues/2036) (@fulmicoton)
|
||||
- Rename `DatePrecision` to `DateTimePrecision` [#2051](https://github.com/quickwit-oss/tantivy/issues/2051) (@guilload)
|
||||
- Query Parser
|
||||
- Quotation mark can now be used for phrase queries. [#2050](https://github.com/quickwit-oss/tantivy/issues/2050) (@fulmicoton)
|
||||
@@ -290,7 +208,7 @@ Tantivy 0.19
|
||||
- Add support for phrase slop in query language [#1393](https://github.com/quickwit-oss/tantivy/pull/1393) (@saroh)
|
||||
- Aggregation
|
||||
- Add aggregation support for date type [#1693](https://github.com/quickwit-oss/tantivy/pull/1693)(@PSeitz)
|
||||
- Add support for keyed parameter in range and histogram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
|
||||
- Add support for keyed parameter in range and histgram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
|
||||
- Add aggregation bucket limit [#1363](https://github.com/quickwit-oss/tantivy/pull/1363) (@PSeitz)
|
||||
- Faster indexing
|
||||
- [#1610](https://github.com/quickwit-oss/tantivy/pull/1610) (@PSeitz)
|
||||
@@ -733,7 +651,7 @@ Tantivy 0.4.0
|
||||
- Raise the limit of number of fields (previously 256 fields) (@fulmicoton)
|
||||
- Removed u32 fields. They are replaced by u64 and i64 fields (#65) (@fulmicoton)
|
||||
- Optimized skip in SegmentPostings (#130) (@lnicola)
|
||||
- Replacing rustc_serialize by serde. Kudos to benchmark@KodrAus and @lnicola
|
||||
- Replacing rustc_serialize by serde. Kudos to @KodrAus and @lnicola
|
||||
- Using error-chain (@KodrAus)
|
||||
- QueryParser: (@fulmicoton)
|
||||
- Explicit error returned when searched for a term that is not indexed
|
||||
|
||||
10
CITATION.cff
10
CITATION.cff
@@ -1,10 +0,0 @@
|
||||
cff-version: 1.2.0
|
||||
message: "If you use this software, please cite it as below."
|
||||
authors:
|
||||
- alias: Quickwit Inc.
|
||||
website: "https://quickwit.io"
|
||||
title: "tantivy"
|
||||
version: 0.22.0
|
||||
doi: 10.5281/zenodo.13942948
|
||||
date-released: 2024-10-17
|
||||
url: "https://github.com/quickwit-oss/tantivy"
|
||||
43
Cargo.toml
43
Cargo.toml
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy"
|
||||
version = "0.24.0"
|
||||
version = "0.23.0"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
categories = ["database-implementations", "data-structures"]
|
||||
@@ -11,7 +11,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
|
||||
readme = "README.md"
|
||||
keywords = ["search", "information", "retrieval"]
|
||||
edition = "2021"
|
||||
rust-version = "1.75"
|
||||
rust-version = "1.63"
|
||||
exclude = ["benches/*.json", "benches/*.txt"]
|
||||
|
||||
[dependencies]
|
||||
@@ -29,52 +29,49 @@ tantivy-fst = "0.5"
|
||||
memmap2 = { version = "0.9.0", optional = true }
|
||||
lz4_flex = { version = "0.11", default-features = false, optional = true }
|
||||
zstd = { version = "0.13", optional = true, default-features = false }
|
||||
tempfile = { version = "3.12.0", optional = true }
|
||||
tempfile = { version = "3.3.0", optional = true }
|
||||
log = "0.4.16"
|
||||
serde = { version = "1.0.219", features = ["derive"] }
|
||||
serde_json = "1.0.140"
|
||||
serde = { version = "1.0.136", features = ["derive"] }
|
||||
serde_json = "1.0.79"
|
||||
fs4 = { version = "0.8.0", optional = true }
|
||||
levenshtein_automata = "0.2.1"
|
||||
uuid = { version = "1.0.0", features = ["v4", "serde"] }
|
||||
crossbeam-channel = "0.5.4"
|
||||
rust-stemmers = "1.2.0"
|
||||
downcast-rs = "2.0.1"
|
||||
downcast-rs = "1.2.0"
|
||||
bitpacking = { version = "0.9.2", default-features = false, features = [
|
||||
"bitpacker4x",
|
||||
] }
|
||||
census = "0.4.2"
|
||||
rustc-hash = "2.0.0"
|
||||
thiserror = "2.0.1"
|
||||
rustc-hash = "1.1.0"
|
||||
thiserror = "1.0.30"
|
||||
htmlescape = "0.3.1"
|
||||
fail = { version = "0.5.0", optional = true }
|
||||
time = { version = "0.3.35", features = ["serde-well-known"] }
|
||||
time = { version = "0.3.10", features = ["serde-well-known"] }
|
||||
smallvec = "1.8.0"
|
||||
rayon = "1.5.2"
|
||||
lru = "0.12.0"
|
||||
fastdivide = "0.4.0"
|
||||
itertools = "0.14.0"
|
||||
measure_time = "0.9.0"
|
||||
itertools = "0.13.0"
|
||||
measure_time = "0.8.2"
|
||||
arc-swap = "1.5.0"
|
||||
bon = "3.3.1"
|
||||
|
||||
columnar = { version = "0.5", path = "./columnar", package = "tantivy-columnar" }
|
||||
sstable = { version = "0.5", path = "./sstable", package = "tantivy-sstable", optional = true }
|
||||
stacker = { version = "0.5", path = "./stacker", package = "tantivy-stacker" }
|
||||
query-grammar = { version = "0.24.0", path = "./query-grammar", package = "tantivy-query-grammar" }
|
||||
tantivy-bitpacker = { version = "0.8", path = "./bitpacker" }
|
||||
common = { version = "0.9", path = "./common/", package = "tantivy-common" }
|
||||
tokenizer-api = { version = "0.5", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
|
||||
columnar = { version = "0.3", path = "./columnar", package = "tantivy-columnar" }
|
||||
sstable = { version = "0.3", path = "./sstable", package = "tantivy-sstable", optional = true }
|
||||
stacker = { version = "0.3", path = "./stacker", package = "tantivy-stacker" }
|
||||
query-grammar = { version = "0.22.0", path = "./query-grammar", package = "tantivy-query-grammar" }
|
||||
tantivy-bitpacker = { version = "0.6", path = "./bitpacker" }
|
||||
common = { version = "0.7", path = "./common/", package = "tantivy-common" }
|
||||
tokenizer-api = { version = "0.3", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
|
||||
sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] }
|
||||
hyperloglogplus = { version = "0.4.1", features = ["const-loop"] }
|
||||
futures-util = { version = "0.3.28", optional = true }
|
||||
futures-channel = { version = "0.3.28", optional = true }
|
||||
fnv = "1.0.7"
|
||||
|
||||
[target.'cfg(windows)'.dependencies]
|
||||
winapi = "0.3.9"
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.14.0"
|
||||
binggan = "0.8.0"
|
||||
rand = "0.8.5"
|
||||
maplit = "1.0.2"
|
||||
matches = "0.1.9"
|
||||
@@ -122,7 +119,7 @@ zstd-compression = ["zstd"]
|
||||
failpoints = ["fail", "fail/failpoints"]
|
||||
unstable = [] # useful for benches.
|
||||
|
||||
quickwit = ["sstable", "futures-util", "futures-channel"]
|
||||
quickwit = ["sstable", "futures-util"]
|
||||
|
||||
# Compares only the hash of a string when indexing data.
|
||||
# Increases indexing speed, but may lead to extremely rare missing terms, when there's a hash collision.
|
||||
|
||||
@@ -18,7 +18,7 @@ Tantivy is, in fact, strongly inspired by Lucene's design.
|
||||
|
||||
## Benchmark
|
||||
|
||||
The following [benchmark](https://tantivy-search.github.io/bench/) breaks down the
|
||||
The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns
|
||||
performance for different types of queries/collections.
|
||||
|
||||
Your mileage WILL vary depending on the nature of queries and their load.
|
||||
|
||||
2
TODO.txt
2
TODO.txt
@@ -1,7 +1,7 @@
|
||||
Make schema_builder API fluent.
|
||||
fix doc serialization and prevent compression problems
|
||||
|
||||
u64 , etc. should return Result<Option> now that we support optional missing a column is really not an error
|
||||
u64 , etc. shoudl return Resutl<Option> now that we support optional missing a column is really not an error
|
||||
remove fastfield codecs
|
||||
ditch the first_or_default trick. if it is still useful, improve its implementation.
|
||||
rename FastFieldReaders::open to load
|
||||
|
||||
@@ -1,4 +1,3 @@
|
||||
use binggan::plugins::PeakMemAllocPlugin;
|
||||
use binggan::{black_box, InputGroup, PeakMemAlloc, INSTRUMENTED_SYSTEM};
|
||||
use rand::prelude::SliceRandom;
|
||||
use rand::rngs::StdRng;
|
||||
@@ -18,9 +17,7 @@ pub static GLOBAL: &PeakMemAlloc<std::alloc::System> = &INSTRUMENTED_SYSTEM;
|
||||
/// runner.register("average_u64", move |index| average_u64(index));
|
||||
macro_rules! register {
|
||||
($runner:expr, $func:ident) => {
|
||||
$runner.register(stringify!($func), move |index| {
|
||||
$func(index);
|
||||
})
|
||||
$runner.register(stringify!($func), move |index| $func(index))
|
||||
};
|
||||
}
|
||||
|
||||
@@ -45,8 +42,7 @@ fn main() {
|
||||
}
|
||||
|
||||
fn bench_agg(mut group: InputGroup<Index>) {
|
||||
group.add_plugin(PeakMemAllocPlugin::new(GLOBAL));
|
||||
|
||||
group.set_alloc(GLOBAL); // Set the peak mem allocator. This will enable peak memory reporting.
|
||||
register!(group, average_u64);
|
||||
register!(group, average_f64);
|
||||
register!(group, average_f64_u64);
|
||||
@@ -55,15 +51,10 @@ fn bench_agg(mut group: InputGroup<Index>) {
|
||||
register!(group, percentiles_f64);
|
||||
register!(group, terms_few);
|
||||
register!(group, terms_many);
|
||||
register!(group, terms_many_top_1000);
|
||||
register!(group, terms_many_order_by_term);
|
||||
register!(group, terms_many_with_top_hits);
|
||||
register!(group, terms_many_with_avg_sub_agg);
|
||||
register!(group, terms_many_json_mixed_type_with_avg_sub_agg);
|
||||
|
||||
register!(group, cardinality_agg);
|
||||
register!(group, terms_few_with_cardinality_agg);
|
||||
|
||||
register!(group, terms_many_json_mixed_type_with_sub_agg_card);
|
||||
register!(group, range_agg);
|
||||
register!(group, range_agg_with_avg_sub_agg);
|
||||
register!(group, range_agg_with_term_agg_few);
|
||||
@@ -132,33 +123,6 @@ fn percentiles_f64(index: &Index) {
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn cardinality_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "text_many_terms"
|
||||
},
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn terms_few_with_cardinality_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_few_terms" },
|
||||
"aggs": {
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "text_many_terms"
|
||||
},
|
||||
}
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_few(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": { "terms": { "field": "text_few_terms" } },
|
||||
@@ -171,12 +135,6 @@ fn terms_many(index: &Index) {
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn terms_many_top_1000(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": { "terms": { "field": "text_many_terms", "size": 1000 } },
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn terms_many_order_by_term(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": { "terms": { "field": "text_many_terms", "order": { "_key": "desc" } } },
|
||||
@@ -213,7 +171,7 @@ fn terms_many_with_avg_sub_agg(index: &Index) {
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn terms_many_json_mixed_type_with_avg_sub_agg(index: &Index) {
|
||||
fn terms_many_json_mixed_type_with_sub_agg_card(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "json.mixed_type" },
|
||||
@@ -310,7 +268,6 @@ fn range_agg_with_term_agg_many(index: &Index) {
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn histogram(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"rangef64": {
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-bitpacker"
|
||||
version = "0.8.0"
|
||||
version = "0.6.0"
|
||||
edition = "2021"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
|
||||
@@ -65,7 +65,7 @@ impl BitPacker {
|
||||
|
||||
#[derive(Clone, Debug, Default, Copy)]
|
||||
pub struct BitUnpacker {
|
||||
num_bits: usize,
|
||||
num_bits: u32,
|
||||
mask: u64,
|
||||
}
|
||||
|
||||
@@ -83,7 +83,7 @@ impl BitUnpacker {
|
||||
(1u64 << num_bits) - 1u64
|
||||
};
|
||||
BitUnpacker {
|
||||
num_bits: usize::from(num_bits),
|
||||
num_bits: u32::from(num_bits),
|
||||
mask,
|
||||
}
|
||||
}
|
||||
@@ -94,14 +94,14 @@ impl BitUnpacker {
|
||||
|
||||
#[inline]
|
||||
pub fn get(&self, idx: u32, data: &[u8]) -> u64 {
|
||||
let addr_in_bits = idx as usize * self.num_bits;
|
||||
let addr = addr_in_bits >> 3;
|
||||
let addr_in_bits = idx * self.num_bits;
|
||||
let addr = (addr_in_bits >> 3) as usize;
|
||||
if addr + 8 > data.len() {
|
||||
if self.num_bits == 0 {
|
||||
return 0;
|
||||
}
|
||||
let bit_shift = addr_in_bits & 7;
|
||||
return self.get_slow_path(addr, bit_shift as u32, data);
|
||||
return self.get_slow_path(addr, bit_shift, data);
|
||||
}
|
||||
let bit_shift = addr_in_bits & 7;
|
||||
let bytes: [u8; 8] = (&data[addr..addr + 8]).try_into().unwrap();
|
||||
@@ -134,13 +134,12 @@ impl BitUnpacker {
|
||||
"Bitwidth must be <= 32 to use this method."
|
||||
);
|
||||
|
||||
let end_idx: u32 = start_idx + output.len() as u32;
|
||||
let end_idx = start_idx + output.len() as u32;
|
||||
|
||||
// We use `usize` here to avoid overflow issues.
|
||||
let end_bit_read = (end_idx as usize) * self.num_bits;
|
||||
let end_bit_read = end_idx * self.num_bits;
|
||||
let end_byte_read = (end_bit_read + 7) / 8;
|
||||
assert!(
|
||||
end_byte_read <= data.len(),
|
||||
end_byte_read as usize <= data.len(),
|
||||
"Requested index is out of bounds."
|
||||
);
|
||||
|
||||
@@ -160,24 +159,24 @@ impl BitUnpacker {
|
||||
// We want the start of the fast track to start align with bytes.
|
||||
// A sufficient condition is to start with an idx that is a multiple of 8,
|
||||
// so highway start is the closest multiple of 8 that is >= start_idx.
|
||||
let entrance_ramp_len: u32 = 8 - (start_idx % 8) % 8;
|
||||
let entrance_ramp_len = 8 - (start_idx % 8) % 8;
|
||||
|
||||
let highway_start: u32 = start_idx + entrance_ramp_len;
|
||||
|
||||
if highway_start + (BitPacker1x::BLOCK_LEN as u32) > end_idx {
|
||||
if highway_start + BitPacker1x::BLOCK_LEN as u32 > end_idx {
|
||||
// We don't have enough values to have even a single block of highway.
|
||||
// Let's just supply the values the simple way.
|
||||
get_batch_ramp(start_idx, output);
|
||||
return;
|
||||
}
|
||||
|
||||
let num_blocks: usize = (end_idx - highway_start) as usize / BitPacker1x::BLOCK_LEN;
|
||||
let num_blocks: u32 = (end_idx - highway_start) / BitPacker1x::BLOCK_LEN as u32;
|
||||
|
||||
// Entrance ramp
|
||||
get_batch_ramp(start_idx, &mut output[..entrance_ramp_len as usize]);
|
||||
|
||||
// Highway
|
||||
let mut offset = (highway_start as usize * self.num_bits) / 8;
|
||||
let mut offset = (highway_start * self.num_bits) as usize / 8;
|
||||
let mut output_cursor = (highway_start - start_idx) as usize;
|
||||
for _ in 0..num_blocks {
|
||||
offset += BitPacker1x.decompress(
|
||||
@@ -189,7 +188,7 @@ impl BitUnpacker {
|
||||
}
|
||||
|
||||
// Exit ramp
|
||||
let highway_end: u32 = highway_start + (num_blocks * BitPacker1x::BLOCK_LEN) as u32;
|
||||
let highway_end = highway_start + num_blocks * BitPacker1x::BLOCK_LEN as u32;
|
||||
get_batch_ramp(highway_end, &mut output[output_cursor..]);
|
||||
}
|
||||
|
||||
@@ -369,9 +368,9 @@ mod test {
|
||||
for start_idx in 0u32..32u32 {
|
||||
output.resize(len, 0);
|
||||
bitunpacker.get_batch_u32s(start_idx, &buffer, &mut output);
|
||||
for (i, output_byte) in output.iter().enumerate() {
|
||||
for i in 0..len {
|
||||
let expected = (start_idx + i as u32) & mask;
|
||||
assert_eq!(*output_byte, expected);
|
||||
assert_eq!(output[i], expected);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -34,7 +34,7 @@ struct BlockedBitpackerEntryMetaData {
|
||||
|
||||
impl BlockedBitpackerEntryMetaData {
|
||||
fn new(offset: u64, num_bits: u8, base_value: u64) -> Self {
|
||||
let encoded = offset | (u64::from(num_bits) << (64 - 8));
|
||||
let encoded = offset | (num_bits as u64) << (64 - 8);
|
||||
Self {
|
||||
encoded,
|
||||
base_value,
|
||||
|
||||
@@ -35,8 +35,8 @@ const IMPLS: [FilterImplPerInstructionSet; 2] = [
|
||||
const IMPLS: [FilterImplPerInstructionSet; 1] = [FilterImplPerInstructionSet::Scalar];
|
||||
|
||||
impl FilterImplPerInstructionSet {
|
||||
#[allow(unused_variables)]
|
||||
#[inline]
|
||||
#[allow(unused_variables)] // on non-x86_64, code is unused.
|
||||
fn from(code: u8) -> FilterImplPerInstructionSet {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
if code == FilterImplPerInstructionSet::AVX2 as u8 {
|
||||
|
||||
@@ -16,14 +16,14 @@ body = """
|
||||
|
||||
{%- if version %} in {{ version }}{%- endif -%}
|
||||
{% for commit in commits %}
|
||||
{% if commit.remote.pr_title -%}
|
||||
{%- set commit_message = commit.remote.pr_title -%}
|
||||
{% if commit.github.pr_title -%}
|
||||
{%- set commit_message = commit.github.pr_title -%}
|
||||
{%- else -%}
|
||||
{%- set commit_message = commit.message -%}
|
||||
{%- endif -%}
|
||||
- {{ commit_message | split(pat="\n") | first | trim }}\
|
||||
{% if commit.remote.pr_number %} \
|
||||
[#{{ commit.remote.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.remote.pr_number }}){% if commit.remote.username %}(@{{ commit.remote.username }}){%- endif -%} \
|
||||
{% if commit.github.pr_number %} \
|
||||
[#{{ commit.github.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.github.pr_number }}){% if commit.github.username %}(@{{ commit.github.username }}){%- endif -%} \
|
||||
{%- endif %}
|
||||
{%- endfor -%}
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-columnar"
|
||||
version = "0.5.0"
|
||||
version = "0.3.0"
|
||||
edition = "2021"
|
||||
license = "MIT"
|
||||
homepage = "https://github.com/quickwit-oss/tantivy"
|
||||
@@ -9,21 +9,21 @@ description = "column oriented storage for tantivy"
|
||||
categories = ["database-implementations", "data-structures", "compression"]
|
||||
|
||||
[dependencies]
|
||||
itertools = "0.14.0"
|
||||
itertools = "0.13.0"
|
||||
fastdivide = "0.4.0"
|
||||
|
||||
stacker = { version= "0.5", path = "../stacker", package="tantivy-stacker"}
|
||||
sstable = { version= "0.5", path = "../sstable", package = "tantivy-sstable" }
|
||||
common = { version= "0.9", path = "../common", package = "tantivy-common" }
|
||||
tantivy-bitpacker = { version= "0.8", path = "../bitpacker/" }
|
||||
stacker = { version= "0.3", path = "../stacker", package="tantivy-stacker"}
|
||||
sstable = { version= "0.3", path = "../sstable", package = "tantivy-sstable" }
|
||||
common = { version= "0.7", path = "../common", package = "tantivy-common" }
|
||||
tantivy-bitpacker = { version= "0.6", path = "../bitpacker/" }
|
||||
serde = "1.0.152"
|
||||
downcast-rs = "2.0.1"
|
||||
downcast-rs = "1.2.0"
|
||||
|
||||
[dev-dependencies]
|
||||
proptest = "1"
|
||||
more-asserts = "0.3.1"
|
||||
rand = "0.8"
|
||||
binggan = "0.14.0"
|
||||
binggan = "0.8.1"
|
||||
|
||||
[[bench]]
|
||||
name = "bench_merge"
|
||||
|
||||
@@ -31,7 +31,7 @@ restriction on 50% of the values (e.g. a 64-bit hash). On the other hand, a lot
|
||||
# Columnar format
|
||||
|
||||
This columnar format may have more than one column (with different types) associated to the same `column_name` (see [Coercion rules](#coercion-rules) above).
|
||||
The `(column_name, column_type)` couple however uniquely identifies a column.
|
||||
The `(column_name, columne_type)` couple however uniquely identifies a column.
|
||||
That couple is serialized as a column `column_key`. The format of that key is:
|
||||
`[column_name][ZERO_BYTE][column_type_header: u8]`
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
pub mod common;
|
||||
|
||||
use binggan::BenchRunner;
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use common::{generate_columnar_with_name, Card};
|
||||
use tantivy_columnar::*;
|
||||
|
||||
@@ -29,7 +29,7 @@ fn main() {
|
||||
add_combo(Card::Multi, Card::Dense);
|
||||
add_combo(Card::Multi, Card::Sparse);
|
||||
|
||||
let mut runner: BenchRunner = BenchRunner::new();
|
||||
let runner: BenchRunner = BenchRunner::new();
|
||||
let mut group = runner.new_group();
|
||||
for (input_name, columnar_readers) in inputs.iter() {
|
||||
group.register_with_input(
|
||||
@@ -41,7 +41,7 @@ fn main() {
|
||||
let merge_row_order = StackMergeOrder::stack(&columnar_readers[..]);
|
||||
|
||||
merge_columnar(&columnar_readers, &[], merge_row_order.into(), &mut out).unwrap();
|
||||
Some(out.len() as u64)
|
||||
black_box(out);
|
||||
},
|
||||
);
|
||||
}
|
||||
|
||||
@@ -1,18 +0,0 @@
|
||||
[package]
|
||||
name = "tantivy-columnar-inspect"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
license = "MIT"
|
||||
|
||||
[dependencies]
|
||||
tantivy = {path="../..", package="tantivy"}
|
||||
columnar = {path="../", package="tantivy-columnar"}
|
||||
common = {path="../../common", package="tantivy-common"}
|
||||
|
||||
[workspace]
|
||||
members = []
|
||||
|
||||
[profile.release]
|
||||
debug = true
|
||||
#debug-assertions = true
|
||||
#overflow-checks = true
|
||||
@@ -1,54 +0,0 @@
|
||||
use columnar::ColumnarReader;
|
||||
use common::file_slice::{FileSlice, WrapFile};
|
||||
use std::io;
|
||||
use std::path::Path;
|
||||
use tantivy::directory::footer::Footer;
|
||||
|
||||
fn main() -> io::Result<()> {
|
||||
println!("Opens a columnar file written by tantivy and validates it.");
|
||||
let path = std::env::args().nth(1).unwrap();
|
||||
|
||||
let path = Path::new(&path);
|
||||
println!("Reading {:?}", path);
|
||||
let _reader = open_and_validate_columnar(path.to_str().unwrap())?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn validate_columnar_reader(reader: &ColumnarReader) {
|
||||
let num_rows = reader.num_rows();
|
||||
println!("num_rows: {}", num_rows);
|
||||
let columns = reader.list_columns().unwrap();
|
||||
println!("num columns: {:?}", columns.len());
|
||||
for (col_name, dynamic_column_handle) in columns {
|
||||
let col = dynamic_column_handle.open().unwrap();
|
||||
match col {
|
||||
columnar::DynamicColumn::Bool(_)
|
||||
| columnar::DynamicColumn::I64(_)
|
||||
| columnar::DynamicColumn::U64(_)
|
||||
| columnar::DynamicColumn::F64(_)
|
||||
| columnar::DynamicColumn::IpAddr(_)
|
||||
| columnar::DynamicColumn::DateTime(_)
|
||||
| columnar::DynamicColumn::Bytes(_) => {}
|
||||
columnar::DynamicColumn::Str(str_column) => {
|
||||
let num_vals = str_column.ords().values.num_vals();
|
||||
let num_terms_dict = str_column.num_terms() as u64;
|
||||
let max_ord = str_column.ords().values.iter().max().unwrap_or_default();
|
||||
println!("{col_name:35} num_vals {num_vals:10} \t num_terms_dict {num_terms_dict:8} max_ord: {max_ord:8}",);
|
||||
for ord in str_column.ords().values.iter() {
|
||||
assert!(ord < num_terms_dict);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Opens a columnar file that was written by tantivy and validates it.
|
||||
pub fn open_and_validate_columnar(path: &str) -> io::Result<ColumnarReader> {
|
||||
let wrap_file = WrapFile::new(std::fs::File::open(path)?)?;
|
||||
let slice = FileSlice::new(std::sync::Arc::new(wrap_file));
|
||||
let (_footer, slice) = Footer::extract_footer(slice.clone()).unwrap();
|
||||
let reader = ColumnarReader::open(slice).unwrap();
|
||||
validate_columnar_reader(&reader);
|
||||
Ok(reader)
|
||||
}
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
# Perf and Size
|
||||
* remove alloc in `ord_to_term`
|
||||
+ multivaued range queries restart from the beginning all of the time.
|
||||
+ multivaued range queries restrat frm the beginning all of the time.
|
||||
* re-add ZSTD compression for dictionaries
|
||||
no systematic monotonic mapping
|
||||
consider removing multilinear
|
||||
@@ -30,7 +30,7 @@ investigate if should have better errors? io::Error is overused at the moment.
|
||||
rename rank/select in unit tests
|
||||
Review the public API via cargo doc
|
||||
go through TODOs
|
||||
remove all doc_id occurrences -> row_id
|
||||
remove all doc_id occurences -> row_id
|
||||
use the rank & select naming in unit tests branch.
|
||||
multi-linear -> blockwise
|
||||
linear codec -> simply a multiplication for the index column
|
||||
@@ -43,5 +43,5 @@ isolate u128_based and uniform naming
|
||||
# Other
|
||||
fix enhance column-cli
|
||||
|
||||
# Santa Claus
|
||||
# Santa claus
|
||||
autodetect datetime ipaddr, plug customizable tokenizer.
|
||||
|
||||
@@ -66,7 +66,7 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
|
||||
&'a self,
|
||||
docs: &'a [u32],
|
||||
accessor: &Column<T>,
|
||||
) -> impl Iterator<Item = (DocId, T)> + 'a {
|
||||
) -> impl Iterator<Item = (DocId, T)> + '_ {
|
||||
if accessor.index.get_cardinality().is_full() {
|
||||
docs.iter().cloned().zip(self.val_cache.iter().cloned())
|
||||
} else {
|
||||
@@ -139,7 +139,7 @@ mod tests {
|
||||
missing_docs.push(missing_doc);
|
||||
});
|
||||
|
||||
assert_eq!(missing_docs, Vec::<u32>::new());
|
||||
assert_eq!(missing_docs, vec![]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -173,7 +173,7 @@ mod tests {
|
||||
.into();
|
||||
let merged_column_index = merge_column_index(&column_indexes[..], &merge_row_order);
|
||||
let SerializableColumnIndex::Multivalued(start_index_iterable) = merged_column_index else {
|
||||
panic!("Expected a multivalued index")
|
||||
panic!("Excpected a multivalued index")
|
||||
};
|
||||
let mut output = Vec::new();
|
||||
serialize_multivalued_index(&start_index_iterable, &mut output).unwrap();
|
||||
@@ -211,7 +211,7 @@ mod tests {
|
||||
|
||||
let merged_column_index = merge_column_index(&column_indexes[..], &merge_row_order);
|
||||
let SerializableColumnIndex::Multivalued(start_index_iterable) = merged_column_index else {
|
||||
panic!("Expected a multivalued index")
|
||||
panic!("Excpected a multivalued index")
|
||||
};
|
||||
let mut output = Vec::new();
|
||||
serialize_multivalued_index(&start_index_iterable, &mut output).unwrap();
|
||||
|
||||
@@ -58,7 +58,7 @@ struct ShuffledIndex<'a> {
|
||||
merge_order: &'a ShuffleMergeOrder,
|
||||
}
|
||||
|
||||
impl Iterable<u32> for ShuffledIndex<'_> {
|
||||
impl<'a> Iterable<u32> for ShuffledIndex<'a> {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
|
||||
Box::new(
|
||||
self.merge_order
|
||||
@@ -127,7 +127,7 @@ fn integrate_num_vals(num_vals: impl Iterator<Item = u32>) -> impl Iterator<Item
|
||||
)
|
||||
}
|
||||
|
||||
impl Iterable<u32> for ShuffledMultivaluedIndex<'_> {
|
||||
impl<'a> Iterable<u32> for ShuffledMultivaluedIndex<'a> {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
|
||||
let num_vals_per_row = iter_num_values(self.column_indexes, self.merge_order);
|
||||
Box::new(integrate_num_vals(num_vals_per_row))
|
||||
|
||||
@@ -56,7 +56,7 @@ fn get_doc_ids_with_values<'a>(
|
||||
ColumnIndex::Full => Box::new(doc_range),
|
||||
ColumnIndex::Optional(optional_index) => Box::new(
|
||||
optional_index
|
||||
.iter_docs()
|
||||
.iter_rows()
|
||||
.map(move |row| row + doc_range.start),
|
||||
),
|
||||
ColumnIndex::Multivalued(multivalued_index) => match multivalued_index {
|
||||
@@ -73,7 +73,7 @@ fn get_doc_ids_with_values<'a>(
|
||||
MultiValueIndex::MultiValueIndexV2(multivalued_index) => Box::new(
|
||||
multivalued_index
|
||||
.optional_index
|
||||
.iter_docs()
|
||||
.iter_rows()
|
||||
.map(move |row| row + doc_range.start),
|
||||
),
|
||||
},
|
||||
@@ -123,7 +123,7 @@ fn get_num_values_iterator<'a>(
|
||||
}
|
||||
}
|
||||
|
||||
impl Iterable<u32> for StackedStartOffsets<'_> {
|
||||
impl<'a> Iterable<u32> for StackedStartOffsets<'a> {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
|
||||
let num_values_it = (0..self.column_indexes.len()).flat_map(|columnar_id| {
|
||||
let num_docs = self.stack_merge_order.columnar_range(columnar_id).len() as u32;
|
||||
@@ -177,7 +177,7 @@ impl<'a> Iterable<RowId> for StackedOptionalIndex<'a> {
|
||||
ColumnIndex::Full => Box::new(columnar_row_range),
|
||||
ColumnIndex::Optional(optional_index) => Box::new(
|
||||
optional_index
|
||||
.iter_docs()
|
||||
.iter_rows()
|
||||
.map(move |row_id: RowId| columnar_row_range.start + row_id),
|
||||
),
|
||||
ColumnIndex::Multivalued(_) => {
|
||||
|
||||
@@ -28,7 +28,7 @@ pub enum ColumnIndex {
|
||||
Full,
|
||||
Optional(OptionalIndex),
|
||||
/// In addition, at index num_rows, an extra value is added
|
||||
/// containing the overall number of values.
|
||||
/// containing the overal number of values.
|
||||
Multivalued(MultiValueIndex),
|
||||
}
|
||||
|
||||
|
||||
@@ -80,23 +80,23 @@ impl BlockVariant {
|
||||
/// index is the block index. For each block `byte_start` and `offset` is computed.
|
||||
#[derive(Clone)]
|
||||
pub struct OptionalIndex {
|
||||
num_docs: RowId,
|
||||
num_non_null_docs: RowId,
|
||||
num_rows: RowId,
|
||||
num_non_null_rows: RowId,
|
||||
block_data: OwnedBytes,
|
||||
block_metas: Arc<[BlockMeta]>,
|
||||
}
|
||||
|
||||
impl Iterable<u32> for &OptionalIndex {
|
||||
impl<'a> Iterable<u32> for &'a OptionalIndex {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
|
||||
Box::new(self.iter_docs())
|
||||
Box::new(self.iter_rows())
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Debug for OptionalIndex {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
||||
f.debug_struct("OptionalIndex")
|
||||
.field("num_docs", &self.num_docs)
|
||||
.field("num_non_null_docs", &self.num_non_null_docs)
|
||||
.field("num_rows", &self.num_rows)
|
||||
.field("num_non_null_rows", &self.num_non_null_rows)
|
||||
.finish_non_exhaustive()
|
||||
}
|
||||
}
|
||||
@@ -123,7 +123,7 @@ enum BlockSelectCursor<'a> {
|
||||
Sparse(<SparseBlock<'a> as Set<u16>>::SelectCursor<'a>),
|
||||
}
|
||||
|
||||
impl BlockSelectCursor<'_> {
|
||||
impl<'a> BlockSelectCursor<'a> {
|
||||
fn select(&mut self, rank: u16) -> u16 {
|
||||
match self {
|
||||
BlockSelectCursor::Dense(dense_select_cursor) => dense_select_cursor.select(rank),
|
||||
@@ -141,7 +141,7 @@ pub struct OptionalIndexSelectCursor<'a> {
|
||||
num_null_rows_before_block: RowId,
|
||||
}
|
||||
|
||||
impl OptionalIndexSelectCursor<'_> {
|
||||
impl<'a> OptionalIndexSelectCursor<'a> {
|
||||
fn search_and_load_block(&mut self, rank: RowId) {
|
||||
if rank < self.current_block_end_rank {
|
||||
// we are already in the right block
|
||||
@@ -165,7 +165,7 @@ impl OptionalIndexSelectCursor<'_> {
|
||||
}
|
||||
}
|
||||
|
||||
impl SelectCursor<RowId> for OptionalIndexSelectCursor<'_> {
|
||||
impl<'a> SelectCursor<RowId> for OptionalIndexSelectCursor<'a> {
|
||||
fn select(&mut self, rank: RowId) -> RowId {
|
||||
self.search_and_load_block(rank);
|
||||
let index_in_block = (rank - self.num_null_rows_before_block) as u16;
|
||||
@@ -174,9 +174,7 @@ impl SelectCursor<RowId> for OptionalIndexSelectCursor<'_> {
|
||||
}
|
||||
|
||||
impl Set<RowId> for OptionalIndex {
|
||||
type SelectCursor<'b>
|
||||
= OptionalIndexSelectCursor<'b>
|
||||
where Self: 'b;
|
||||
type SelectCursor<'b> = OptionalIndexSelectCursor<'b> where Self: 'b;
|
||||
// Check if value at position is not null.
|
||||
#[inline]
|
||||
fn contains(&self, row_id: RowId) -> bool {
|
||||
@@ -271,17 +269,17 @@ impl OptionalIndex {
|
||||
}
|
||||
|
||||
pub fn num_docs(&self) -> RowId {
|
||||
self.num_docs
|
||||
self.num_rows
|
||||
}
|
||||
|
||||
pub fn num_non_nulls(&self) -> RowId {
|
||||
self.num_non_null_docs
|
||||
self.num_non_null_rows
|
||||
}
|
||||
|
||||
pub fn iter_docs(&self) -> impl Iterator<Item = RowId> + '_ {
|
||||
pub fn iter_rows(&self) -> impl Iterator<Item = RowId> + '_ {
|
||||
// TODO optimize
|
||||
let mut select_batch = self.select_cursor();
|
||||
(0..self.num_non_null_docs).map(move |rank| select_batch.select(rank))
|
||||
(0..self.num_non_null_rows).map(move |rank| select_batch.select(rank))
|
||||
}
|
||||
pub fn select_batch(&self, ranks: &mut [RowId]) {
|
||||
let mut select_cursor = self.select_cursor();
|
||||
@@ -505,7 +503,7 @@ fn deserialize_optional_index_block_metadatas(
|
||||
non_null_rows_before_block += num_non_null_rows;
|
||||
}
|
||||
block_metas.resize(
|
||||
num_rows.div_ceil(ELEMENTS_PER_BLOCK) as usize,
|
||||
((num_rows + ELEMENTS_PER_BLOCK - 1) / ELEMENTS_PER_BLOCK) as usize,
|
||||
BlockMeta {
|
||||
non_null_rows_before_block,
|
||||
start_byte_offset,
|
||||
@@ -519,15 +517,15 @@ pub fn open_optional_index(bytes: OwnedBytes) -> io::Result<OptionalIndex> {
|
||||
let (mut bytes, num_non_empty_blocks_bytes) = bytes.rsplit(2);
|
||||
let num_non_empty_block_bytes =
|
||||
u16::from_le_bytes(num_non_empty_blocks_bytes.as_slice().try_into().unwrap());
|
||||
let num_docs = VInt::deserialize_u64(&mut bytes)? as u32;
|
||||
let num_rows = VInt::deserialize_u64(&mut bytes)? as u32;
|
||||
let block_metas_num_bytes =
|
||||
num_non_empty_block_bytes as usize * SERIALIZED_BLOCK_META_NUM_BYTES;
|
||||
let (block_data, block_metas) = bytes.rsplit(block_metas_num_bytes);
|
||||
let (block_metas, num_non_null_docs) =
|
||||
deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_docs);
|
||||
let (block_metas, num_non_null_rows) =
|
||||
deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_rows);
|
||||
let optional_index = OptionalIndex {
|
||||
num_docs,
|
||||
num_non_null_docs,
|
||||
num_rows,
|
||||
num_non_null_rows,
|
||||
block_data,
|
||||
block_metas: block_metas.into(),
|
||||
};
|
||||
|
||||
@@ -23,6 +23,7 @@ fn set_bit_at(input: &mut u64, n: u16) {
|
||||
///
|
||||
/// When translating a dense index to the original index, we can use the offset to find the correct
|
||||
/// block. Direct computation is not possible, but we can employ a linear or binary search.
|
||||
|
||||
const ELEMENTS_PER_MINI_BLOCK: u16 = 64;
|
||||
const MINI_BLOCK_BITVEC_NUM_BYTES: usize = 8;
|
||||
const MINI_BLOCK_OFFSET_NUM_BYTES: usize = 2;
|
||||
@@ -108,7 +109,7 @@ pub struct DenseBlockSelectCursor<'a> {
|
||||
dense_block: DenseBlock<'a>,
|
||||
}
|
||||
|
||||
impl SelectCursor<u16> for DenseBlockSelectCursor<'_> {
|
||||
impl<'a> SelectCursor<u16> for DenseBlockSelectCursor<'a> {
|
||||
#[inline]
|
||||
fn select(&mut self, rank: u16) -> u16 {
|
||||
self.block_id = self
|
||||
@@ -122,9 +123,7 @@ impl SelectCursor<u16> for DenseBlockSelectCursor<'_> {
|
||||
}
|
||||
|
||||
impl<'a> Set<u16> for DenseBlock<'a> {
|
||||
type SelectCursor<'b>
|
||||
= DenseBlockSelectCursor<'a>
|
||||
where Self: 'b;
|
||||
type SelectCursor<'b> = DenseBlockSelectCursor<'a> where Self: 'b;
|
||||
|
||||
#[inline(always)]
|
||||
fn contains(&self, el: u16) -> bool {
|
||||
@@ -174,7 +173,7 @@ impl<'a> Set<u16> for DenseBlock<'a> {
|
||||
}
|
||||
}
|
||||
|
||||
impl DenseBlock<'_> {
|
||||
impl<'a> DenseBlock<'a> {
|
||||
#[inline]
|
||||
fn mini_block(&self, mini_block_id: u16) -> DenseMiniBlock {
|
||||
let data_start_pos = mini_block_id as usize * MINI_BLOCK_NUM_BYTES;
|
||||
|
||||
@@ -31,10 +31,8 @@ impl<'a> SelectCursor<u16> for SparseBlock<'a> {
|
||||
}
|
||||
}
|
||||
|
||||
impl Set<u16> for SparseBlock<'_> {
|
||||
type SelectCursor<'b>
|
||||
= Self
|
||||
where Self: 'b;
|
||||
impl<'a> Set<u16> for SparseBlock<'a> {
|
||||
type SelectCursor<'b> = Self where Self: 'b;
|
||||
|
||||
#[inline(always)]
|
||||
fn contains(&self, el: u16) -> bool {
|
||||
@@ -69,7 +67,7 @@ fn get_u16(data: &[u8], byte_position: usize) -> u16 {
|
||||
u16::from_le_bytes(bytes)
|
||||
}
|
||||
|
||||
impl SparseBlock<'_> {
|
||||
impl<'a> SparseBlock<'a> {
|
||||
#[inline(always)]
|
||||
fn value_at_idx(&self, data: &[u8], idx: u16) -> u16 {
|
||||
let start_offset: usize = idx as usize * 2;
|
||||
@@ -82,7 +80,7 @@ impl SparseBlock<'_> {
|
||||
}
|
||||
|
||||
#[inline]
|
||||
#[expect(clippy::comparison_chain)]
|
||||
#[allow(clippy::comparison_chain)]
|
||||
// Looks for the element in the block. Returns the positions if found.
|
||||
fn binary_search(&self, target: u16) -> Result<u16, u16> {
|
||||
let data = &self.0;
|
||||
|
||||
@@ -110,8 +110,8 @@ fn test_null_index(data: &[bool]) {
|
||||
.map(|(pos, _val)| pos as u32)
|
||||
.collect();
|
||||
let mut select_iter = null_index.select_cursor();
|
||||
for (i, expected) in orig_idx_with_value.iter().enumerate() {
|
||||
assert_eq!(select_iter.select(i as u32), *expected);
|
||||
for i in 0..orig_idx_with_value.len() {
|
||||
assert_eq!(select_iter.select(i as u32), orig_idx_with_value[i]);
|
||||
}
|
||||
|
||||
let step_size = (orig_idx_with_value.len() / 100).max(1);
|
||||
@@ -164,7 +164,7 @@ fn test_optional_index_large() {
|
||||
fn test_optional_index_iter_aux(row_ids: &[RowId], num_rows: RowId) {
|
||||
let optional_index = OptionalIndex::for_test(num_rows, row_ids);
|
||||
assert_eq!(optional_index.num_docs(), num_rows);
|
||||
assert!(optional_index.iter_docs().eq(row_ids.iter().copied()));
|
||||
assert!(optional_index.iter_rows().eq(row_ids.iter().copied()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -31,7 +31,7 @@ pub enum SerializableColumnIndex<'a> {
|
||||
Multivalued(SerializableMultivalueIndex<'a>),
|
||||
}
|
||||
|
||||
impl SerializableColumnIndex<'_> {
|
||||
impl<'a> SerializableColumnIndex<'a> {
|
||||
pub fn get_cardinality(&self) -> Cardinality {
|
||||
match self {
|
||||
SerializableColumnIndex::Full => Cardinality::Full,
|
||||
|
||||
@@ -34,7 +34,6 @@ fn compute_stats(vals: impl Iterator<Item = u64>) -> ColumnStats {
|
||||
fn value_iter() -> impl Iterator<Item = u64> {
|
||||
0..20_000
|
||||
}
|
||||
|
||||
fn get_reader_for_bench<Codec: ColumnCodec>(data: &[u64]) -> Codec::ColumnValues {
|
||||
let mut bytes = Vec::new();
|
||||
let stats = compute_stats(data.iter().cloned());
|
||||
@@ -42,13 +41,10 @@ fn get_reader_for_bench<Codec: ColumnCodec>(data: &[u64]) -> Codec::ColumnValues
|
||||
for val in data {
|
||||
codec_serializer.collect(*val);
|
||||
}
|
||||
codec_serializer
|
||||
.serialize(&stats, Box::new(data.iter().copied()).as_mut(), &mut bytes)
|
||||
.unwrap();
|
||||
codec_serializer.serialize(&stats, Box::new(data.iter().copied()).as_mut(), &mut bytes);
|
||||
|
||||
Codec::load(OwnedBytes::new(bytes)).unwrap()
|
||||
}
|
||||
|
||||
fn bench_get<Codec: ColumnCodec>(b: &mut Bencher, data: &[u64]) {
|
||||
let col = get_reader_for_bench::<Codec>(data);
|
||||
b.iter(|| {
|
||||
|
||||
@@ -10,7 +10,7 @@ pub(crate) struct MergedColumnValues<'a, T> {
|
||||
pub(crate) merge_row_order: &'a MergeRowOrder,
|
||||
}
|
||||
|
||||
impl<T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'_, T> {
|
||||
impl<'a, T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'a, T> {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
|
||||
match self.merge_row_order {
|
||||
MergeRowOrder::Stack(_) => Box::new(
|
||||
|
||||
@@ -184,7 +184,7 @@ impl CompactSpaceBuilder {
|
||||
|
||||
let mut covered_space = Vec::with_capacity(self.blanks.len());
|
||||
|
||||
// beginning of the blanks
|
||||
// begining of the blanks
|
||||
if let Some(first_blank_start) = self.blanks.first().map(RangeInclusive::start) {
|
||||
if *first_blank_start != 0 {
|
||||
covered_space.push(0..=first_blank_start - 1);
|
||||
|
||||
@@ -128,7 +128,7 @@ pub fn open_u128_as_compact_u64(mut bytes: OwnedBytes) -> io::Result<Arc<dyn Col
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod tests {
|
||||
pub mod tests {
|
||||
use super::*;
|
||||
use crate::column_values::u64_based::{
|
||||
serialize_and_load_u64_based_column_values, serialize_u64_based_column_values,
|
||||
|
||||
@@ -39,7 +39,7 @@ impl BinarySerializable for Block {
|
||||
}
|
||||
|
||||
fn compute_num_blocks(num_vals: u32) -> u32 {
|
||||
num_vals.div_ceil(BLOCK_SIZE)
|
||||
(num_vals + BLOCK_SIZE - 1) / BLOCK_SIZE
|
||||
}
|
||||
|
||||
pub struct BlockwiseLinearEstimator {
|
||||
|
||||
@@ -122,11 +122,12 @@ impl Line {
|
||||
line
|
||||
}
|
||||
|
||||
/// Returns a line that attempts to approximate a function
|
||||
/// Returns a line that attemps to approximate a function
|
||||
/// f: i in 0..[ys.num_vals()) -> ys[i].
|
||||
///
|
||||
/// - The approximation is always lower than the actual value. Or more rigorously, formally
|
||||
/// `f(i).wrapping_sub(ys[i])` is small for any i in [0..ys.len()).
|
||||
/// - The approximation is always lower than the actual value.
|
||||
/// Or more rigorously, formally `f(i).wrapping_sub(ys[i])` is small
|
||||
/// for any i in [0..ys.len()).
|
||||
/// - It computes without panicking for any value of it.
|
||||
///
|
||||
/// This function is only invariable by translation if all of the
|
||||
|
||||
@@ -3,7 +3,7 @@ use std::io::{self, Write};
|
||||
use common::{BitSet, CountingWriter, ReadOnlyBitSet};
|
||||
use sstable::{SSTable, Streamer, TermOrdinal, VoidSSTable};
|
||||
|
||||
use super::term_merger::{TermMerger, TermsWithSegmentOrd};
|
||||
use super::term_merger::TermMerger;
|
||||
use crate::column::serialize_column_mappable_to_u64;
|
||||
use crate::column_index::SerializableColumnIndex;
|
||||
use crate::iterable::Iterable;
|
||||
@@ -39,7 +39,7 @@ struct RemappedTermOrdinalsValues<'a> {
|
||||
merge_row_order: &'a MergeRowOrder,
|
||||
}
|
||||
|
||||
impl Iterable for RemappedTermOrdinalsValues<'_> {
|
||||
impl<'a> Iterable for RemappedTermOrdinalsValues<'a> {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
|
||||
match self.merge_row_order {
|
||||
MergeRowOrder::Stack(_) => self.boxed_iter_stacked(),
|
||||
@@ -50,7 +50,7 @@ impl Iterable for RemappedTermOrdinalsValues<'_> {
|
||||
}
|
||||
}
|
||||
|
||||
impl RemappedTermOrdinalsValues<'_> {
|
||||
impl<'a> RemappedTermOrdinalsValues<'a> {
|
||||
fn boxed_iter_stacked(&self) -> Box<dyn Iterator<Item = u64> + '_> {
|
||||
let iter = self
|
||||
.bytes_columns
|
||||
@@ -126,17 +126,14 @@ fn serialize_merged_dict(
|
||||
let mut term_ord_mapping = TermOrdinalMapping::default();
|
||||
|
||||
let mut field_term_streams = Vec::new();
|
||||
for (segment_ord, column_opt) in bytes_columns.iter().enumerate() {
|
||||
for column_opt in bytes_columns.iter() {
|
||||
if let Some(column) = column_opt {
|
||||
term_ord_mapping.add_segment(column.dictionary.num_terms());
|
||||
let terms: Streamer<VoidSSTable> = column.dictionary.stream()?;
|
||||
field_term_streams.push(TermsWithSegmentOrd { terms, segment_ord });
|
||||
field_term_streams.push(terms);
|
||||
} else {
|
||||
term_ord_mapping.add_segment(0);
|
||||
field_term_streams.push(TermsWithSegmentOrd {
|
||||
terms: Streamer::empty(),
|
||||
segment_ord,
|
||||
});
|
||||
field_term_streams.push(Streamer::empty());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -194,7 +191,6 @@ fn serialize_merged_dict(
|
||||
|
||||
#[derive(Default, Debug)]
|
||||
struct TermOrdinalMapping {
|
||||
/// Contains the new term ordinals for each segment.
|
||||
per_segment_new_term_ordinals: Vec<Vec<TermOrdinal>>,
|
||||
}
|
||||
|
||||
@@ -209,6 +205,6 @@ impl TermOrdinalMapping {
|
||||
}
|
||||
|
||||
fn get_segment(&self, segment_ord: u32) -> &[TermOrdinal] {
|
||||
&self.per_segment_new_term_ordinals[segment_ord as usize]
|
||||
&(self.per_segment_new_term_ordinals[segment_ord as usize])[..]
|
||||
}
|
||||
}
|
||||
|
||||
@@ -26,7 +26,7 @@ impl StackMergeOrder {
|
||||
let mut cumulated_row_ids: Vec<RowId> = Vec::with_capacity(columnars.len());
|
||||
let mut cumulated_row_id = 0;
|
||||
for columnar in columnars {
|
||||
cumulated_row_id += columnar.num_docs();
|
||||
cumulated_row_id += columnar.num_rows();
|
||||
cumulated_row_ids.push(cumulated_row_id);
|
||||
}
|
||||
StackMergeOrder { cumulated_row_ids }
|
||||
|
||||
@@ -25,7 +25,7 @@ use crate::{
|
||||
/// After merge, all columns belonging to the same category are coerced to
|
||||
/// the same column type.
|
||||
///
|
||||
/// In practise, today, only Numerical columns are coerced into one type today.
|
||||
/// In practise, today, only Numerical colummns are coerced into one type today.
|
||||
///
|
||||
/// See also [README.md].
|
||||
///
|
||||
@@ -63,10 +63,11 @@ impl From<ColumnType> for ColumnTypeCategory {
|
||||
/// `require_columns` makes it possible to ensure that some columns will be present in the
|
||||
/// resulting columnar. When a required column is a numerical column type, one of two things can
|
||||
/// happen:
|
||||
/// - If the required column type is compatible with all of the input columnar, the resulting merged
|
||||
/// columnar will simply coerce the input column and use the required column type.
|
||||
/// - If the required column type is incompatible with one of the input columnar, the merged will
|
||||
/// fail with an InvalidData error.
|
||||
/// - If the required column type is compatible with all of the input columnar, the resulsting
|
||||
/// merged
|
||||
/// columnar will simply coerce the input column and use the required column type.
|
||||
/// - If the required column type is incompatible with one of the input columnar, the merged
|
||||
/// will fail with an InvalidData error.
|
||||
///
|
||||
/// `merge_row_order` makes it possible to remove or reorder row in the resulting
|
||||
/// `Columnar` table.
|
||||
@@ -80,12 +81,13 @@ pub fn merge_columnar(
|
||||
output: &mut impl io::Write,
|
||||
) -> io::Result<()> {
|
||||
let mut serializer = ColumnarSerializer::new(output);
|
||||
let num_docs_per_columnar = columnar_readers
|
||||
let num_rows_per_columnar = columnar_readers
|
||||
.iter()
|
||||
.map(|reader| reader.num_docs())
|
||||
.map(|reader| reader.num_rows())
|
||||
.collect::<Vec<u32>>();
|
||||
|
||||
let columns_to_merge = group_columns_for_merge(columnar_readers, required_columns)?;
|
||||
let columns_to_merge =
|
||||
group_columns_for_merge(columnar_readers, required_columns, &merge_row_order)?;
|
||||
for res in columns_to_merge {
|
||||
let ((column_name, _column_type_category), grouped_columns) = res;
|
||||
let grouped_columns = grouped_columns.open(&merge_row_order)?;
|
||||
@@ -93,18 +95,15 @@ pub fn merge_columnar(
|
||||
continue;
|
||||
}
|
||||
|
||||
let column_type_after_merge = grouped_columns.column_type_after_merge();
|
||||
let column_type = grouped_columns.column_type_after_merge();
|
||||
let mut columns = grouped_columns.columns;
|
||||
// Make sure the number of columns is the same as the number of columnar readers.
|
||||
// Or num_docs_per_columnar would be incorrect.
|
||||
assert_eq!(columns.len(), columnar_readers.len());
|
||||
coerce_columns(column_type_after_merge, &mut columns)?;
|
||||
coerce_columns(column_type, &mut columns)?;
|
||||
|
||||
let mut column_serializer =
|
||||
serializer.start_serialize_column(column_name.as_bytes(), column_type_after_merge);
|
||||
serializer.start_serialize_column(column_name.as_bytes(), column_type);
|
||||
merge_column(
|
||||
column_type_after_merge,
|
||||
&num_docs_per_columnar,
|
||||
column_type,
|
||||
&num_rows_per_columnar,
|
||||
columns,
|
||||
&merge_row_order,
|
||||
&mut column_serializer,
|
||||
@@ -130,7 +129,7 @@ fn dynamic_column_to_u64_monotonic(dynamic_column: DynamicColumn) -> Option<Colu
|
||||
fn merge_column(
|
||||
column_type: ColumnType,
|
||||
num_docs_per_column: &[u32],
|
||||
columns_to_merge: Vec<Option<DynamicColumn>>,
|
||||
columns: Vec<Option<DynamicColumn>>,
|
||||
merge_row_order: &MergeRowOrder,
|
||||
wrt: &mut impl io::Write,
|
||||
) -> io::Result<()> {
|
||||
@@ -140,10 +139,10 @@ fn merge_column(
|
||||
| ColumnType::F64
|
||||
| ColumnType::DateTime
|
||||
| ColumnType::Bool => {
|
||||
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
|
||||
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
|
||||
let mut column_values: Vec<Option<Arc<dyn ColumnValues>>> =
|
||||
Vec::with_capacity(columns_to_merge.len());
|
||||
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
|
||||
Vec::with_capacity(columns.len());
|
||||
for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
|
||||
if let Some(Column { index: idx, values }) =
|
||||
dynamic_column_opt.and_then(dynamic_column_to_u64_monotonic)
|
||||
{
|
||||
@@ -166,10 +165,10 @@ fn merge_column(
|
||||
serialize_column_mappable_to_u64(merged_column_index, &merge_column_values, wrt)?;
|
||||
}
|
||||
ColumnType::IpAddr => {
|
||||
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
|
||||
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
|
||||
let mut column_values: Vec<Option<Arc<dyn ColumnValues<Ipv6Addr>>>> =
|
||||
Vec::with_capacity(columns_to_merge.len());
|
||||
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
|
||||
Vec::with_capacity(columns.len());
|
||||
for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
|
||||
if let Some(DynamicColumn::IpAddr(Column { index: idx, values })) =
|
||||
dynamic_column_opt
|
||||
{
|
||||
@@ -194,10 +193,9 @@ fn merge_column(
|
||||
serialize_column_mappable_to_u128(merged_column_index, &merge_column_values, wrt)?;
|
||||
}
|
||||
ColumnType::Bytes | ColumnType::Str => {
|
||||
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
|
||||
let mut bytes_columns: Vec<Option<BytesColumn>> =
|
||||
Vec::with_capacity(columns_to_merge.len());
|
||||
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
|
||||
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
|
||||
let mut bytes_columns: Vec<Option<BytesColumn>> = Vec::with_capacity(columns.len());
|
||||
for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
|
||||
match dynamic_column_opt {
|
||||
Some(DynamicColumn::Str(str_column)) => {
|
||||
column_indexes.push(str_column.term_ord_column.index.clone());
|
||||
@@ -251,7 +249,7 @@ impl GroupedColumns {
|
||||
if column_type.len() == 1 {
|
||||
return column_type.into_iter().next().unwrap();
|
||||
}
|
||||
// At the moment, only the numerical column type category has more than one possible
|
||||
// At the moment, only the numerical categorical column type has more than one possible
|
||||
// column type.
|
||||
assert!(self
|
||||
.columns
|
||||
@@ -364,7 +362,7 @@ fn is_empty_after_merge(
|
||||
ColumnIndex::Empty { .. } => true,
|
||||
ColumnIndex::Full => alive_bitset.len() == 0,
|
||||
ColumnIndex::Optional(optional_index) => {
|
||||
for doc in optional_index.iter_docs() {
|
||||
for doc in optional_index.iter_rows() {
|
||||
if alive_bitset.contains(doc) {
|
||||
return false;
|
||||
}
|
||||
@@ -394,6 +392,7 @@ fn is_empty_after_merge(
|
||||
fn group_columns_for_merge<'a>(
|
||||
columnar_readers: &'a [&'a ColumnarReader],
|
||||
required_columns: &'a [(String, ColumnType)],
|
||||
_merge_row_order: &'a MergeRowOrder,
|
||||
) -> io::Result<BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle>> {
|
||||
let mut columns: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = BTreeMap::new();
|
||||
|
||||
|
||||
@@ -5,29 +5,28 @@ use sstable::TermOrdinal;
|
||||
|
||||
use crate::Streamer;
|
||||
|
||||
/// The terms of a column with the ordinal of the segment.
|
||||
pub struct TermsWithSegmentOrd<'a> {
|
||||
pub terms: Streamer<'a>,
|
||||
pub struct HeapItem<'a> {
|
||||
pub streamer: Streamer<'a>,
|
||||
pub segment_ord: usize,
|
||||
}
|
||||
|
||||
impl PartialEq for TermsWithSegmentOrd<'_> {
|
||||
impl<'a> PartialEq for HeapItem<'a> {
|
||||
fn eq(&self, other: &Self) -> bool {
|
||||
self.segment_ord == other.segment_ord
|
||||
}
|
||||
}
|
||||
|
||||
impl Eq for TermsWithSegmentOrd<'_> {}
|
||||
impl<'a> Eq for HeapItem<'a> {}
|
||||
|
||||
impl<'a> PartialOrd for TermsWithSegmentOrd<'a> {
|
||||
fn partial_cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Option<Ordering> {
|
||||
impl<'a> PartialOrd for HeapItem<'a> {
|
||||
fn partial_cmp(&self, other: &HeapItem<'a>) -> Option<Ordering> {
|
||||
Some(self.cmp(other))
|
||||
}
|
||||
}
|
||||
|
||||
impl<'a> Ord for TermsWithSegmentOrd<'a> {
|
||||
fn cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Ordering {
|
||||
(&other.terms.key(), &other.segment_ord).cmp(&(&self.terms.key(), &self.segment_ord))
|
||||
impl<'a> Ord for HeapItem<'a> {
|
||||
fn cmp(&self, other: &HeapItem<'a>) -> Ordering {
|
||||
(&other.streamer.key(), &other.segment_ord).cmp(&(&self.streamer.key(), &self.segment_ord))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -36,34 +35,42 @@ impl<'a> Ord for TermsWithSegmentOrd<'a> {
|
||||
///
|
||||
/// The item yield is actually a pair with
|
||||
/// - the term
|
||||
/// - a slice with the ordinal of the segments containing the terms.
|
||||
/// - a slice with the ordinal of the segments containing
|
||||
/// the terms.
|
||||
pub struct TermMerger<'a> {
|
||||
heap: BinaryHeap<TermsWithSegmentOrd<'a>>,
|
||||
term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>,
|
||||
heap: BinaryHeap<HeapItem<'a>>,
|
||||
current_streamers: Vec<HeapItem<'a>>,
|
||||
}
|
||||
|
||||
impl<'a> TermMerger<'a> {
|
||||
/// Stream of merged term dictionary
|
||||
pub fn new(term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>) -> TermMerger<'a> {
|
||||
pub fn new(streams: Vec<Streamer<'a>>) -> TermMerger<'a> {
|
||||
TermMerger {
|
||||
heap: BinaryHeap::new(),
|
||||
term_streams_with_segment,
|
||||
current_streamers: streams
|
||||
.into_iter()
|
||||
.enumerate()
|
||||
.map(|(ord, streamer)| HeapItem {
|
||||
streamer,
|
||||
segment_ord: ord,
|
||||
})
|
||||
.collect(),
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn matching_segments<'b: 'a>(
|
||||
&'b self,
|
||||
) -> impl 'b + Iterator<Item = (usize, TermOrdinal)> {
|
||||
self.term_streams_with_segment
|
||||
self.current_streamers
|
||||
.iter()
|
||||
.map(|heap_item| (heap_item.segment_ord, heap_item.terms.term_ord()))
|
||||
.map(|heap_item| (heap_item.segment_ord, heap_item.streamer.term_ord()))
|
||||
}
|
||||
|
||||
fn advance_segments(&mut self) {
|
||||
let streamers = &mut self.term_streams_with_segment;
|
||||
let streamers = &mut self.current_streamers;
|
||||
let heap = &mut self.heap;
|
||||
for mut heap_item in streamers.drain(..) {
|
||||
if heap_item.terms.advance() {
|
||||
if heap_item.streamer.advance() {
|
||||
heap.push(heap_item);
|
||||
}
|
||||
}
|
||||
@@ -75,13 +82,13 @@ impl<'a> TermMerger<'a> {
|
||||
pub fn advance(&mut self) -> bool {
|
||||
self.advance_segments();
|
||||
if let Some(head) = self.heap.pop() {
|
||||
self.term_streams_with_segment.push(head);
|
||||
self.current_streamers.push(head);
|
||||
while let Some(next_streamer) = self.heap.peek() {
|
||||
if self.term_streams_with_segment[0].terms.key() != next_streamer.terms.key() {
|
||||
if self.current_streamers[0].streamer.key() != next_streamer.streamer.key() {
|
||||
break;
|
||||
}
|
||||
let next_heap_it = self.heap.pop().unwrap(); // safe : we peeked beforehand
|
||||
self.term_streams_with_segment.push(next_heap_it);
|
||||
self.current_streamers.push(next_heap_it);
|
||||
}
|
||||
true
|
||||
} else {
|
||||
@@ -95,6 +102,6 @@ impl<'a> TermMerger<'a> {
|
||||
/// if and only if advance() has been called before
|
||||
/// and "true" was returned.
|
||||
pub fn key(&self) -> &[u8] {
|
||||
self.term_streams_with_segment[0].terms.key()
|
||||
self.current_streamers[0].streamer.key()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,10 +1,7 @@
|
||||
use itertools::Itertools;
|
||||
use proptest::collection::vec;
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::*;
|
||||
use crate::columnar::{merge_columnar, ColumnarReader, MergeRowOrder, StackMergeOrder};
|
||||
use crate::{Cardinality, ColumnarWriter, DynamicColumn, HasAssociatedColumnType, RowId};
|
||||
use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId};
|
||||
|
||||
fn make_columnar<T: Into<NumericalValue> + HasAssociatedColumnType + Copy>(
|
||||
column_name: &str,
|
||||
@@ -29,8 +26,9 @@ fn test_column_coercion_to_u64() {
|
||||
// u64 type
|
||||
let columnar2 = make_columnar("numbers", &[u64::MAX]);
|
||||
let columnars = &[&columnar1, &columnar2];
|
||||
let merge_order = StackMergeOrder::stack(columnars).into();
|
||||
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
|
||||
group_columns_for_merge(columnars, &[]).unwrap();
|
||||
group_columns_for_merge(columnars, &[], &merge_order).unwrap();
|
||||
assert_eq!(column_map.len(), 1);
|
||||
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
|
||||
}
|
||||
@@ -40,8 +38,9 @@ fn test_column_coercion_to_i64() {
|
||||
let columnar1 = make_columnar("numbers", &[-1i64]);
|
||||
let columnar2 = make_columnar("numbers", &[2u64]);
|
||||
let columnars = &[&columnar1, &columnar2];
|
||||
let merge_order = StackMergeOrder::stack(columnars).into();
|
||||
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
|
||||
group_columns_for_merge(columnars, &[]).unwrap();
|
||||
group_columns_for_merge(columnars, &[], &merge_order).unwrap();
|
||||
assert_eq!(column_map.len(), 1);
|
||||
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
|
||||
}
|
||||
@@ -64,8 +63,14 @@ fn test_group_columns_with_required_column() {
|
||||
let columnar1 = make_columnar("numbers", &[1i64]);
|
||||
let columnar2 = make_columnar("numbers", &[2u64]);
|
||||
let columnars = &[&columnar1, &columnar2];
|
||||
let merge_order = StackMergeOrder::stack(columnars).into();
|
||||
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
|
||||
group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap();
|
||||
group_columns_for_merge(
|
||||
&[&columnar1, &columnar2],
|
||||
&[("numbers".to_string(), ColumnType::U64)],
|
||||
&merge_order,
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(column_map.len(), 1);
|
||||
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
|
||||
}
|
||||
@@ -75,9 +80,13 @@ fn test_group_columns_required_column_with_no_existing_columns() {
|
||||
let columnar1 = make_columnar("numbers", &[2u64]);
|
||||
let columnar2 = make_columnar("numbers", &[2u64]);
|
||||
let columnars = &[&columnar1, &columnar2];
|
||||
let column_map: BTreeMap<_, _> =
|
||||
group_columns_for_merge(columnars, &[("required_col".to_string(), ColumnType::Str)])
|
||||
.unwrap();
|
||||
let merge_order = StackMergeOrder::stack(columnars).into();
|
||||
let column_map: BTreeMap<_, _> = group_columns_for_merge(
|
||||
columnars,
|
||||
&[("required_col".to_string(), ColumnType::Str)],
|
||||
&merge_order,
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(column_map.len(), 2);
|
||||
let columns = &column_map
|
||||
.get(&("required_col".to_string(), ColumnTypeCategory::Str))
|
||||
@@ -93,8 +102,14 @@ fn test_group_columns_required_column_is_above_all_columns_have_the_same_type_ru
|
||||
let columnar1 = make_columnar("numbers", &[2i64]);
|
||||
let columnar2 = make_columnar("numbers", &[2i64]);
|
||||
let columnars = &[&columnar1, &columnar2];
|
||||
let merge_order = StackMergeOrder::stack(columnars).into();
|
||||
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
|
||||
group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap();
|
||||
group_columns_for_merge(
|
||||
columnars,
|
||||
&[("numbers".to_string(), ColumnType::U64)],
|
||||
&merge_order,
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(column_map.len(), 1);
|
||||
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
|
||||
}
|
||||
@@ -104,8 +119,9 @@ fn test_missing_column() {
|
||||
let columnar1 = make_columnar("numbers", &[-1i64]);
|
||||
let columnar2 = make_columnar("numbers2", &[2u64]);
|
||||
let columnars = &[&columnar1, &columnar2];
|
||||
let merge_order = StackMergeOrder::stack(columnars).into();
|
||||
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
|
||||
group_columns_for_merge(columnars, &[]).unwrap();
|
||||
group_columns_for_merge(columnars, &[], &merge_order).unwrap();
|
||||
assert_eq!(column_map.len(), 2);
|
||||
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
|
||||
{
|
||||
@@ -208,7 +224,7 @@ fn test_merge_columnar_numbers() {
|
||||
)
|
||||
.unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_docs(), 3);
|
||||
assert_eq!(columnar_reader.num_rows(), 3);
|
||||
assert_eq!(columnar_reader.num_columns(), 1);
|
||||
let cols = columnar_reader.read_columns("numbers").unwrap();
|
||||
let dynamic_column = cols[0].open().unwrap();
|
||||
@@ -236,7 +252,7 @@ fn test_merge_columnar_texts() {
|
||||
)
|
||||
.unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_docs(), 3);
|
||||
assert_eq!(columnar_reader.num_rows(), 3);
|
||||
assert_eq!(columnar_reader.num_columns(), 1);
|
||||
let cols = columnar_reader.read_columns("texts").unwrap();
|
||||
let dynamic_column = cols[0].open().unwrap();
|
||||
@@ -285,7 +301,7 @@ fn test_merge_columnar_byte() {
|
||||
)
|
||||
.unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_docs(), 4);
|
||||
assert_eq!(columnar_reader.num_rows(), 4);
|
||||
assert_eq!(columnar_reader.num_columns(), 1);
|
||||
let cols = columnar_reader.read_columns("bytes").unwrap();
|
||||
let dynamic_column = cols[0].open().unwrap();
|
||||
@@ -341,7 +357,7 @@ fn test_merge_columnar_byte_with_missing() {
|
||||
)
|
||||
.unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_docs(), 3 + 2 + 3);
|
||||
assert_eq!(columnar_reader.num_rows(), 3 + 2 + 3);
|
||||
assert_eq!(columnar_reader.num_columns(), 2);
|
||||
let cols = columnar_reader.read_columns("col").unwrap();
|
||||
let dynamic_column = cols[0].open().unwrap();
|
||||
@@ -393,7 +409,7 @@ fn test_merge_columnar_different_types() {
|
||||
)
|
||||
.unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_docs(), 4);
|
||||
assert_eq!(columnar_reader.num_rows(), 4);
|
||||
assert_eq!(columnar_reader.num_columns(), 2);
|
||||
let cols = columnar_reader.read_columns("mixed").unwrap();
|
||||
|
||||
@@ -403,11 +419,11 @@ fn test_merge_columnar_different_types() {
|
||||
panic!()
|
||||
};
|
||||
assert_eq!(vals.get_cardinality(), Cardinality::Optional);
|
||||
assert_eq!(vals.values_for_doc(0).collect_vec(), Vec::<i64>::new());
|
||||
assert_eq!(vals.values_for_doc(1).collect_vec(), Vec::<i64>::new());
|
||||
assert_eq!(vals.values_for_doc(2).collect_vec(), Vec::<i64>::new());
|
||||
assert_eq!(vals.values_for_doc(0).collect_vec(), vec![]);
|
||||
assert_eq!(vals.values_for_doc(1).collect_vec(), vec![]);
|
||||
assert_eq!(vals.values_for_doc(2).collect_vec(), vec![]);
|
||||
assert_eq!(vals.values_for_doc(3).collect_vec(), vec![1]);
|
||||
assert_eq!(vals.values_for_doc(4).collect_vec(), Vec::<i64>::new());
|
||||
assert_eq!(vals.values_for_doc(4).collect_vec(), vec![]);
|
||||
|
||||
// text column
|
||||
let dynamic_column = cols[1].open().unwrap();
|
||||
@@ -458,7 +474,7 @@ fn test_merge_columnar_different_empty_cardinality() {
|
||||
)
|
||||
.unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_docs(), 2);
|
||||
assert_eq!(columnar_reader.num_rows(), 2);
|
||||
assert_eq!(columnar_reader.num_columns(), 2);
|
||||
let cols = columnar_reader.read_columns("mixed").unwrap();
|
||||
|
||||
@@ -470,119 +486,3 @@ fn test_merge_columnar_different_empty_cardinality() {
|
||||
let dynamic_column = cols[1].open().unwrap();
|
||||
assert_eq!(dynamic_column.get_cardinality(), Cardinality::Optional);
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
struct ColumnSpec {
|
||||
column_name: String,
|
||||
/// (row_id, term)
|
||||
terms: Vec<(RowId, Vec<u8>)>,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
struct ColumnarSpec {
|
||||
columns: Vec<ColumnSpec>,
|
||||
}
|
||||
|
||||
/// Generate a random (row_id, term) pair:
|
||||
/// - row_id in [0..10]
|
||||
/// - term is either from POSSIBLE_TERMS or random bytes
|
||||
fn rowid_and_term_strategy() -> impl Strategy<Value = (RowId, Vec<u8>)> {
|
||||
const POSSIBLE_TERMS: &[&[u8]] = &[b"a", b"b", b"allo"];
|
||||
|
||||
let term_strat = prop_oneof![
|
||||
// pick from the fixed list
|
||||
(0..POSSIBLE_TERMS.len()).prop_map(|i| POSSIBLE_TERMS[i].to_vec()),
|
||||
// or random bytes (length 0..10)
|
||||
prop::collection::vec(any::<u8>(), 0..10),
|
||||
];
|
||||
|
||||
(0u32..11, term_strat)
|
||||
}
|
||||
|
||||
/// Generate one ColumnSpec, with a random name and a random list of (row_id, term).
|
||||
/// We sort it by row_id so that data is in ascending order.
|
||||
fn column_spec_strategy() -> impl Strategy<Value = ColumnSpec> {
|
||||
let column_name = prop_oneof![
|
||||
Just("col".to_string()),
|
||||
Just("col2".to_string()),
|
||||
"col.*".prop_map(|s| s),
|
||||
];
|
||||
|
||||
// We'll produce 0..8 (rowid,term) entries for this column
|
||||
let data_strat = vec(rowid_and_term_strategy(), 0..8).prop_map(|mut pairs| {
|
||||
// Sort by row_id
|
||||
pairs.sort_by_key(|(row_id, _)| *row_id);
|
||||
pairs
|
||||
});
|
||||
|
||||
(column_name, data_strat).prop_map(|(name, data)| ColumnSpec {
|
||||
column_name: name,
|
||||
terms: data,
|
||||
})
|
||||
}
|
||||
|
||||
/// Strategy to generate an ColumnarSpec
|
||||
fn columnar_strategy() -> impl Strategy<Value = ColumnarSpec> {
|
||||
vec(column_spec_strategy(), 0..3).prop_map(|columns| ColumnarSpec { columns })
|
||||
}
|
||||
|
||||
/// Strategy to generate multiple ColumnarSpecs, each of which we will treat
|
||||
/// as one "columnar" to be merged together.
|
||||
fn columnars_strategy() -> impl Strategy<Value = Vec<ColumnarSpec>> {
|
||||
vec(columnar_strategy(), 1..4)
|
||||
}
|
||||
|
||||
/// Build a `ColumnarReader` from a `ColumnarSpec`
|
||||
fn build_columnar(spec: &ColumnarSpec) -> ColumnarReader {
|
||||
let mut writer = ColumnarWriter::default();
|
||||
let mut max_row_id = 0;
|
||||
for col in &spec.columns {
|
||||
for &(row_id, ref term) in &col.terms {
|
||||
writer.record_bytes(row_id, &col.column_name, term);
|
||||
max_row_id = max_row_id.max(row_id);
|
||||
}
|
||||
}
|
||||
|
||||
let mut buffer = Vec::new();
|
||||
writer.serialize(max_row_id + 1, &mut buffer).unwrap();
|
||||
ColumnarReader::open(buffer).unwrap()
|
||||
}
|
||||
|
||||
proptest! {
|
||||
// We just test that the merge_columnar function doesn't crash.
|
||||
#![proptest_config(ProptestConfig::with_cases(256))]
|
||||
#[test]
|
||||
fn test_merge_columnar_bytes_no_crash(columnars in columnars_strategy(), second_merge_columnars in columnars_strategy()) {
|
||||
let columnars: Vec<ColumnarReader> = columnars.iter()
|
||||
.map(build_columnar)
|
||||
.collect();
|
||||
|
||||
let mut out = Vec::new();
|
||||
let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
|
||||
let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
|
||||
merge_columnar(
|
||||
&columnar_refs,
|
||||
&[],
|
||||
MergeRowOrder::Stack(stack_merge_order),
|
||||
&mut out,
|
||||
).unwrap();
|
||||
|
||||
let merged_reader = ColumnarReader::open(out).unwrap();
|
||||
|
||||
// Merge the second set of columnars with the result of the first merge
|
||||
let mut columnars: Vec<ColumnarReader> = second_merge_columnars.iter()
|
||||
.map(build_columnar)
|
||||
.collect();
|
||||
columnars.push(merged_reader);
|
||||
let mut out = Vec::new();
|
||||
let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
|
||||
let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
|
||||
merge_columnar(
|
||||
&columnar_refs,
|
||||
&[],
|
||||
MergeRowOrder::Stack(stack_merge_order),
|
||||
&mut out,
|
||||
).unwrap();
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
use std::{fmt, io, mem};
|
||||
|
||||
use common::file_slice::FileSlice;
|
||||
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
|
||||
use common::BinarySerializable;
|
||||
use sstable::{Dictionary, RangeSSTable};
|
||||
|
||||
@@ -19,13 +18,13 @@ fn io_invalid_data(msg: String) -> io::Error {
|
||||
pub struct ColumnarReader {
|
||||
column_dictionary: Dictionary<RangeSSTable>,
|
||||
column_data: FileSlice,
|
||||
num_docs: RowId,
|
||||
num_rows: RowId,
|
||||
format_version: Version,
|
||||
}
|
||||
|
||||
impl fmt::Debug for ColumnarReader {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
let num_rows = self.num_docs();
|
||||
let num_rows = self.num_rows();
|
||||
let columns = self.list_columns().unwrap();
|
||||
let num_cols = columns.len();
|
||||
let mut debug_struct = f.debug_struct("Columnar");
|
||||
@@ -77,19 +76,6 @@ fn read_all_columns_in_stream(
|
||||
Ok(results)
|
||||
}
|
||||
|
||||
fn column_dictionary_prefix_for_column_name(column_name: &str) -> String {
|
||||
// Each column is a associated to a given `column_key`,
|
||||
// that starts by `column_name\0column_header`.
|
||||
//
|
||||
// Listing the columns associated to the given column name is therefore equivalent to
|
||||
// listing `column_key` with the prefix `column_name\0`.
|
||||
format!("{}{}", column_name, '\0')
|
||||
}
|
||||
|
||||
fn column_dictionary_prefix_for_subpath(root_path: &str) -> String {
|
||||
format!("{}{}", root_path, JSON_PATH_SEGMENT_SEP as char)
|
||||
}
|
||||
|
||||
impl ColumnarReader {
|
||||
/// Opens a new Columnar file.
|
||||
pub fn open<F>(file_slice: F) -> io::Result<ColumnarReader>
|
||||
@@ -112,13 +98,13 @@ impl ColumnarReader {
|
||||
Ok(ColumnarReader {
|
||||
column_dictionary,
|
||||
column_data,
|
||||
num_docs: num_rows,
|
||||
num_rows,
|
||||
format_version,
|
||||
})
|
||||
}
|
||||
|
||||
pub fn num_docs(&self) -> RowId {
|
||||
self.num_docs
|
||||
pub fn num_rows(&self) -> RowId {
|
||||
self.num_rows
|
||||
}
|
||||
// Iterate over the columns in a sorted way
|
||||
pub fn iter_columns(
|
||||
@@ -158,14 +144,32 @@ impl ColumnarReader {
|
||||
Ok(self.iter_columns()?.collect())
|
||||
}
|
||||
|
||||
fn stream_for_column_range(&self, column_name: &str) -> sstable::StreamerBuilder<RangeSSTable> {
|
||||
// Each column is a associated to a given `column_key`,
|
||||
// that starts by `column_name\0column_header`.
|
||||
//
|
||||
// Listing the columns associated to the given column name is therefore equivalent to
|
||||
// listing `column_key` with the prefix `column_name\0`.
|
||||
//
|
||||
// This is in turn equivalent to searching for the range
|
||||
// `[column_name,\0`..column_name\1)`.
|
||||
// TODO can we get some more generic `prefix(..)` logic in the dictionary.
|
||||
let mut start_key = column_name.to_string();
|
||||
start_key.push('\0');
|
||||
let mut end_key = column_name.to_string();
|
||||
end_key.push(1u8 as char);
|
||||
self.column_dictionary
|
||||
.range()
|
||||
.ge(start_key.as_bytes())
|
||||
.lt(end_key.as_bytes())
|
||||
}
|
||||
|
||||
pub async fn read_columns_async(
|
||||
&self,
|
||||
column_name: &str,
|
||||
) -> io::Result<Vec<DynamicColumnHandle>> {
|
||||
let prefix = column_dictionary_prefix_for_column_name(column_name);
|
||||
let stream = self
|
||||
.column_dictionary
|
||||
.prefix_range(prefix)
|
||||
.stream_for_column_range(column_name)
|
||||
.into_stream_async()
|
||||
.await?;
|
||||
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
|
||||
@@ -176,35 +180,7 @@ impl ColumnarReader {
|
||||
/// There can be more than one column associated to a given column name, provided they have
|
||||
/// different types.
|
||||
pub fn read_columns(&self, column_name: &str) -> io::Result<Vec<DynamicColumnHandle>> {
|
||||
let prefix = column_dictionary_prefix_for_column_name(column_name);
|
||||
let stream = self.column_dictionary.prefix_range(prefix).into_stream()?;
|
||||
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
|
||||
}
|
||||
|
||||
pub async fn read_subpath_columns_async(
|
||||
&self,
|
||||
root_path: &str,
|
||||
) -> io::Result<Vec<DynamicColumnHandle>> {
|
||||
let prefix = column_dictionary_prefix_for_subpath(root_path);
|
||||
let stream = self
|
||||
.column_dictionary
|
||||
.prefix_range(prefix)
|
||||
.into_stream_async()
|
||||
.await?;
|
||||
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
|
||||
}
|
||||
|
||||
/// Get all inner columns for a given JSON prefix, i.e columns for which the name starts
|
||||
/// with the prefix then contain the [`JSON_PATH_SEGMENT_SEP`].
|
||||
///
|
||||
/// There can be more than one column associated to each path within the JSON structure,
|
||||
/// provided they have different types.
|
||||
pub fn read_subpath_columns(&self, root_path: &str) -> io::Result<Vec<DynamicColumnHandle>> {
|
||||
let prefix = column_dictionary_prefix_for_subpath(root_path);
|
||||
let stream = self
|
||||
.column_dictionary
|
||||
.prefix_range(prefix.as_bytes())
|
||||
.into_stream()?;
|
||||
let stream = self.stream_for_column_range(column_name).into_stream()?;
|
||||
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
|
||||
}
|
||||
|
||||
@@ -216,8 +192,6 @@ impl ColumnarReader {
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
|
||||
|
||||
use crate::{ColumnType, ColumnarReader, ColumnarWriter};
|
||||
|
||||
#[test]
|
||||
@@ -250,64 +224,6 @@ mod tests {
|
||||
assert_eq!(columns[0].1.column_type(), ColumnType::U64);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_columns() {
|
||||
let mut columnar_writer = ColumnarWriter::default();
|
||||
columnar_writer.record_column_type("col", ColumnType::U64, false);
|
||||
columnar_writer.record_numerical(1, "col", 1u64);
|
||||
let mut buffer = Vec::new();
|
||||
columnar_writer.serialize(2, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
{
|
||||
let columns = columnar.read_columns("col").unwrap();
|
||||
assert_eq!(columns.len(), 1);
|
||||
assert_eq!(columns[0].column_type(), ColumnType::U64);
|
||||
}
|
||||
{
|
||||
let columns = columnar.read_columns("other").unwrap();
|
||||
assert_eq!(columns.len(), 0);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_subpath_columns() {
|
||||
let mut columnar_writer = ColumnarWriter::default();
|
||||
columnar_writer.record_str(
|
||||
0,
|
||||
&format!("col1{}subcol1", JSON_PATH_SEGMENT_SEP as char),
|
||||
"hello",
|
||||
);
|
||||
columnar_writer.record_numerical(
|
||||
0,
|
||||
&format!("col1{}subcol2", JSON_PATH_SEGMENT_SEP as char),
|
||||
1i64,
|
||||
);
|
||||
columnar_writer.record_str(1, "col1", "hello");
|
||||
columnar_writer.record_str(0, "col2", "hello");
|
||||
let mut buffer = Vec::new();
|
||||
columnar_writer.serialize(2, &mut buffer).unwrap();
|
||||
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
{
|
||||
let columns = columnar.read_subpath_columns("col1").unwrap();
|
||||
assert_eq!(columns.len(), 2);
|
||||
assert_eq!(columns[0].column_type(), ColumnType::Str);
|
||||
assert_eq!(columns[1].column_type(), ColumnType::I64);
|
||||
}
|
||||
{
|
||||
let columns = columnar.read_subpath_columns("col1.subcol1").unwrap();
|
||||
assert_eq!(columns.len(), 0);
|
||||
}
|
||||
{
|
||||
let columns = columnar.read_subpath_columns("col2").unwrap();
|
||||
assert_eq!(columns.len(), 0);
|
||||
}
|
||||
{
|
||||
let columns = columnar.read_subpath_columns("other").unwrap();
|
||||
assert_eq!(columns.len(), 0);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic(expected = "Input type forbidden")]
|
||||
fn test_list_columns_strict_typing_panics_on_wrong_types() {
|
||||
|
||||
@@ -87,7 +87,7 @@ impl<V: SymbolValue> ColumnOperation<V> {
|
||||
minibuf
|
||||
}
|
||||
|
||||
/// Deserialize a column operation.
|
||||
/// Deserialize a colummn operation.
|
||||
/// Returns None if the buffer is empty.
|
||||
///
|
||||
/// Panics if the payload is invalid:
|
||||
@@ -122,6 +122,7 @@ impl<T> From<T> for ColumnOperation<T> {
|
||||
// In order to limit memory usage, and in order
|
||||
// to benefit from the stacker, we do this by serialization our data
|
||||
// as "Symbols".
|
||||
#[allow(clippy::from_over_into)]
|
||||
pub(super) trait SymbolValue: Clone + Copy {
|
||||
// Serializes the symbol into the given buffer.
|
||||
// Returns the number of bytes written into the buffer.
|
||||
|
||||
@@ -8,7 +8,6 @@ use std::net::Ipv6Addr;
|
||||
|
||||
use column_operation::ColumnOperation;
|
||||
pub(crate) use column_writers::CompatibleNumericalTypes;
|
||||
use common::json_path_writer::JSON_END_OF_PATH;
|
||||
use common::CountingWriter;
|
||||
pub(crate) use serializer::ColumnarSerializer;
|
||||
use stacker::{Addr, ArenaHashMap, MemoryArena};
|
||||
@@ -284,16 +283,12 @@ impl ColumnarWriter {
|
||||
.iter()
|
||||
.map(|(column_name, addr)| (column_name, ColumnType::DateTime, addr)),
|
||||
);
|
||||
// TODO: replace JSON_END_OF_PATH with b'0' in columns
|
||||
columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type));
|
||||
|
||||
let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries);
|
||||
let mut symbol_byte_buffer: Vec<u8> = Vec::new();
|
||||
for (column_name, column_type, addr) in columns {
|
||||
if column_name.contains(&JSON_END_OF_PATH) {
|
||||
// Tantivy uses b'0' as a separator for nested fields in JSON.
|
||||
// Column names with a b'0' are not simply ignored by the columnar (and the inverted
|
||||
// index).
|
||||
continue;
|
||||
}
|
||||
match column_type {
|
||||
ColumnType::Bool => {
|
||||
let column_writer: ColumnWriter = self.bool_field_hash_map.read(addr);
|
||||
@@ -391,7 +386,7 @@ impl ColumnarWriter {
|
||||
|
||||
// Serialize [Dictionary, Column, dictionary num bytes U32::LE]
|
||||
// Column: [Column Index, Column Values, column index num bytes U32::LE]
|
||||
#[expect(clippy::too_many_arguments)]
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn serialize_bytes_or_str_column(
|
||||
cardinality: Cardinality,
|
||||
num_docs: RowId,
|
||||
|
||||
@@ -67,7 +67,7 @@ pub struct ColumnSerializer<'a, W: io::Write> {
|
||||
start_offset: u64,
|
||||
}
|
||||
|
||||
impl<W: io::Write> ColumnSerializer<'_, W> {
|
||||
impl<'a, W: io::Write> ColumnSerializer<'a, W> {
|
||||
pub fn finalize(self) -> io::Result<()> {
|
||||
let end_offset: u64 = self.columnar_serializer.wrt.written_bytes();
|
||||
let byte_range = self.start_offset..end_offset;
|
||||
@@ -80,7 +80,7 @@ impl<W: io::Write> ColumnSerializer<'_, W> {
|
||||
}
|
||||
}
|
||||
|
||||
impl<W: io::Write> io::Write for ColumnSerializer<'_, W> {
|
||||
impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
|
||||
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
|
||||
self.columnar_serializer.wrt.write(buf)
|
||||
}
|
||||
@@ -93,3 +93,18 @@ impl<W: io::Write> io::Write for ColumnSerializer<'_, W> {
|
||||
self.columnar_serializer.wrt.write_all(buf)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_prepare_key_bytes() {
|
||||
let mut buffer: Vec<u8> = b"somegarbage".to_vec();
|
||||
prepare_key(b"root\0child", ColumnType::Str, &mut buffer);
|
||||
assert_eq!(buffer.len(), 12);
|
||||
assert_eq!(&buffer[..10], b"root0child");
|
||||
assert_eq!(buffer[10], 0u8);
|
||||
assert_eq!(buffer[11], ColumnType::Str.to_code());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -7,7 +7,7 @@ pub trait Iterable<T = u64> {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_>;
|
||||
}
|
||||
|
||||
impl<T: Copy> Iterable<T> for &[T] {
|
||||
impl<'a, T: Copy> Iterable<T> for &'a [T] {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
|
||||
Box::new(self.iter().copied())
|
||||
}
|
||||
|
||||
@@ -380,7 +380,7 @@ fn assert_columnar_eq(
|
||||
right: &ColumnarReader,
|
||||
lenient_on_numerical_value: bool,
|
||||
) {
|
||||
assert_eq!(left.num_docs(), right.num_docs());
|
||||
assert_eq!(left.num_rows(), right.num_rows());
|
||||
let left_columns = left.list_columns().unwrap();
|
||||
let right_columns = right.list_columns().unwrap();
|
||||
assert_eq!(left_columns.len(), right_columns.len());
|
||||
@@ -588,7 +588,7 @@ proptest! {
|
||||
#[test]
|
||||
fn test_single_columnar_builder_proptest(docs in columnar_docs_strategy()) {
|
||||
let columnar = build_columnar(&docs[..]);
|
||||
assert_eq!(columnar.num_docs() as usize, docs.len());
|
||||
assert_eq!(columnar.num_rows() as usize, docs.len());
|
||||
let mut expected_columns: HashMap<(&str, ColumnTypeCategory), HashMap<u32, Vec<&ColumnValue>> > = Default::default();
|
||||
for (doc_id, doc_vals) in docs.iter().enumerate() {
|
||||
for (col_name, col_val) in doc_vals {
|
||||
@@ -715,7 +715,6 @@ fn test_columnar_merging_number_columns() {
|
||||
// TODO test required_columns
|
||||
// TODO document edge case: required_columns incompatible with values.
|
||||
|
||||
#[allow(clippy::type_complexity)]
|
||||
fn columnar_docs_and_remap(
|
||||
) -> impl Strategy<Value = (Vec<Vec<Vec<(&'static str, ColumnValue)>>>, Vec<RowAddr>)> {
|
||||
proptest::collection::vec(columnar_docs_strategy(), 2..=3).prop_flat_map(
|
||||
@@ -820,7 +819,7 @@ fn test_columnar_merge_empty() {
|
||||
)
|
||||
.unwrap();
|
||||
let merged_columnar = ColumnarReader::open(output).unwrap();
|
||||
assert_eq!(merged_columnar.num_docs(), 0);
|
||||
assert_eq!(merged_columnar.num_rows(), 0);
|
||||
assert_eq!(merged_columnar.num_columns(), 0);
|
||||
}
|
||||
|
||||
@@ -846,7 +845,7 @@ fn test_columnar_merge_single_str_column() {
|
||||
)
|
||||
.unwrap();
|
||||
let merged_columnar = ColumnarReader::open(output).unwrap();
|
||||
assert_eq!(merged_columnar.num_docs(), 1);
|
||||
assert_eq!(merged_columnar.num_rows(), 1);
|
||||
assert_eq!(merged_columnar.num_columns(), 1);
|
||||
}
|
||||
|
||||
@@ -878,7 +877,7 @@ fn test_delete_decrease_cardinality() {
|
||||
)
|
||||
.unwrap();
|
||||
let merged_columnar = ColumnarReader::open(output).unwrap();
|
||||
assert_eq!(merged_columnar.num_docs(), 1);
|
||||
assert_eq!(merged_columnar.num_rows(), 1);
|
||||
assert_eq!(merged_columnar.num_columns(), 1);
|
||||
let cols = merged_columnar.read_columns("c").unwrap();
|
||||
assert_eq!(cols.len(), 1);
|
||||
|
||||
@@ -17,31 +17,6 @@ impl NumericalValue {
|
||||
NumericalValue::F64(_) => NumericalType::F64,
|
||||
}
|
||||
}
|
||||
|
||||
/// Tries to normalize the numerical value in the following priorities:
|
||||
/// i64, i64, f64
|
||||
pub fn normalize(self) -> Self {
|
||||
match self {
|
||||
NumericalValue::U64(val) => {
|
||||
if val <= i64::MAX as u64 {
|
||||
NumericalValue::I64(val as i64)
|
||||
} else {
|
||||
NumericalValue::F64(val as f64)
|
||||
}
|
||||
}
|
||||
NumericalValue::I64(val) => NumericalValue::I64(val),
|
||||
NumericalValue::F64(val) => {
|
||||
let fract = val.fract();
|
||||
if fract == 0.0 && val >= i64::MIN as f64 && val <= i64::MAX as f64 {
|
||||
NumericalValue::I64(val as i64)
|
||||
} else if fract == 0.0 && val >= u64::MIN as f64 && val <= u64::MAX as f64 {
|
||||
NumericalValue::U64(val as u64)
|
||||
} else {
|
||||
NumericalValue::F64(val)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u64> for NumericalValue {
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-common"
|
||||
version = "0.9.0"
|
||||
version = "0.7.0"
|
||||
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
|
||||
license = "MIT"
|
||||
edition = "2021"
|
||||
@@ -9,17 +9,19 @@ documentation = "https://docs.rs/tantivy_common/"
|
||||
homepage = "https://github.com/quickwit-oss/tantivy"
|
||||
repository = "https://github.com/quickwit-oss/tantivy"
|
||||
|
||||
|
||||
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
|
||||
|
||||
[dependencies]
|
||||
byteorder = "1.4.3"
|
||||
ownedbytes = { version= "0.9", path="../ownedbytes" }
|
||||
ownedbytes = { version= "0.7", path="../ownedbytes" }
|
||||
async-trait = "0.1"
|
||||
time = { version = "0.3.10", features = ["serde-well-known"] }
|
||||
serde = { version = "1.0.136", features = ["derive"] }
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.14.0"
|
||||
proptest = "1.0.0"
|
||||
rand = "0.8.4"
|
||||
|
||||
[features]
|
||||
unstable = [] # useful for benches.
|
||||
|
||||
@@ -1,64 +1,39 @@
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use rand::seq::IteratorRandom;
|
||||
use rand::thread_rng;
|
||||
use tantivy_common::{serialize_vint_u32, BitSet, TinySet};
|
||||
#![feature(test)]
|
||||
|
||||
fn bench_vint() {
|
||||
let mut runner = BenchRunner::new();
|
||||
extern crate test;
|
||||
|
||||
let vals: Vec<u32> = (0..20_000).collect();
|
||||
runner.bench_function("bench_vint", move |_| {
|
||||
let mut out = 0u64;
|
||||
for val in vals.iter().cloned() {
|
||||
let mut buf = [0u8; 8];
|
||||
serialize_vint_u32(val, &mut buf);
|
||||
out += u64::from(buf[0]);
|
||||
}
|
||||
black_box(out);
|
||||
});
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use rand::seq::IteratorRandom;
|
||||
use rand::thread_rng;
|
||||
use tantivy_common::serialize_vint_u32;
|
||||
use test::Bencher;
|
||||
|
||||
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
|
||||
runner.bench_function("bench_vint_rand", move |_| {
|
||||
let mut out = 0u64;
|
||||
for val in vals.iter().cloned() {
|
||||
let mut buf = [0u8; 8];
|
||||
serialize_vint_u32(val, &mut buf);
|
||||
out += u64::from(buf[0]);
|
||||
}
|
||||
black_box(out);
|
||||
});
|
||||
}
|
||||
|
||||
fn bench_bitset() {
|
||||
let mut runner = BenchRunner::new();
|
||||
|
||||
runner.bench_function("bench_tinyset_pop", move |_| {
|
||||
let mut tinyset = TinySet::singleton(black_box(31u32));
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
black_box(tinyset);
|
||||
});
|
||||
|
||||
let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
|
||||
runner.bench_function("bench_tinyset_sum", move |_| {
|
||||
assert_eq!(black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
|
||||
});
|
||||
|
||||
let v = [10u32, 14u32, 21u32];
|
||||
runner.bench_function("bench_tinyarr_sum", move |_| {
|
||||
black_box(v.iter().cloned().sum::<u32>());
|
||||
});
|
||||
|
||||
runner.bench_function("bench_bitset_initialize", move |_| {
|
||||
black_box(BitSet::with_max_value(1_000_000));
|
||||
});
|
||||
}
|
||||
|
||||
fn main() {
|
||||
bench_vint();
|
||||
bench_bitset();
|
||||
#[bench]
|
||||
fn bench_vint(b: &mut Bencher) {
|
||||
let vals: Vec<u32> = (0..20_000).collect();
|
||||
b.iter(|| {
|
||||
let mut out = 0u64;
|
||||
for val in vals.iter().cloned() {
|
||||
let mut buf = [0u8; 8];
|
||||
serialize_vint_u32(val, &mut buf);
|
||||
out += u64::from(buf[0]);
|
||||
}
|
||||
out
|
||||
});
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_vint_rand(b: &mut Bencher) {
|
||||
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
|
||||
b.iter(|| {
|
||||
let mut out = 0u64;
|
||||
for val in vals.iter().cloned() {
|
||||
let mut buf = [0u8; 8];
|
||||
serialize_vint_u32(val, &mut buf);
|
||||
out += u64::from(buf[0]);
|
||||
}
|
||||
out
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
@@ -696,3 +696,43 @@ mod tests {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(all(test, feature = "unstable"))]
|
||||
mod bench {
|
||||
|
||||
use test;
|
||||
|
||||
use super::{BitSet, TinySet};
|
||||
|
||||
#[bench]
|
||||
fn bench_tinyset_pop(b: &mut test::Bencher) {
|
||||
b.iter(|| {
|
||||
let mut tinyset = TinySet::singleton(test::black_box(31u32));
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
tinyset.pop_lowest();
|
||||
});
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_tinyset_sum(b: &mut test::Bencher) {
|
||||
let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
|
||||
b.iter(|| {
|
||||
assert_eq!(test::black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
|
||||
});
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_tinyarr_sum(b: &mut test::Bencher) {
|
||||
let v = [10u32, 14u32, 21u32];
|
||||
b.iter(|| test::black_box(v).iter().cloned().sum::<u32>());
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_bitset_initialize(b: &mut test::Bencher) {
|
||||
b.iter(|| BitSet::with_max_value(1_000_000));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,130 +0,0 @@
|
||||
use std::io;
|
||||
use std::ops::Bound;
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct BoundsRange<T> {
|
||||
pub lower_bound: Bound<T>,
|
||||
pub upper_bound: Bound<T>,
|
||||
}
|
||||
impl<T> BoundsRange<T> {
|
||||
pub fn new(lower_bound: Bound<T>, upper_bound: Bound<T>) -> Self {
|
||||
BoundsRange {
|
||||
lower_bound,
|
||||
upper_bound,
|
||||
}
|
||||
}
|
||||
pub fn is_unbounded(&self) -> bool {
|
||||
matches!(self.lower_bound, Bound::Unbounded) && matches!(self.upper_bound, Bound::Unbounded)
|
||||
}
|
||||
pub fn map_bound<TTo>(&self, transform: impl Fn(&T) -> TTo) -> BoundsRange<TTo> {
|
||||
BoundsRange {
|
||||
lower_bound: map_bound(&self.lower_bound, &transform),
|
||||
upper_bound: map_bound(&self.upper_bound, &transform),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn map_bound_res<TTo, Err>(
|
||||
&self,
|
||||
transform: impl Fn(&T) -> Result<TTo, Err>,
|
||||
) -> Result<BoundsRange<TTo>, Err> {
|
||||
Ok(BoundsRange {
|
||||
lower_bound: map_bound_res(&self.lower_bound, &transform)?,
|
||||
upper_bound: map_bound_res(&self.upper_bound, &transform)?,
|
||||
})
|
||||
}
|
||||
|
||||
pub fn transform_inner<TTo>(
|
||||
&self,
|
||||
transform_lower: impl Fn(&T) -> TransformBound<TTo>,
|
||||
transform_upper: impl Fn(&T) -> TransformBound<TTo>,
|
||||
) -> BoundsRange<TTo> {
|
||||
BoundsRange {
|
||||
lower_bound: transform_bound_inner(&self.lower_bound, &transform_lower),
|
||||
upper_bound: transform_bound_inner(&self.upper_bound, &transform_upper),
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the first set inner value
|
||||
pub fn get_inner(&self) -> Option<&T> {
|
||||
inner_bound(&self.lower_bound).or(inner_bound(&self.upper_bound))
|
||||
}
|
||||
}
|
||||
|
||||
pub enum TransformBound<T> {
|
||||
/// Overwrite the bounds
|
||||
NewBound(Bound<T>),
|
||||
/// Use Existing bounds with new value
|
||||
Existing(T),
|
||||
}
|
||||
|
||||
/// Takes a bound and transforms the inner value into a new bound via a closure.
|
||||
/// The bound variant may change by the value returned value from the closure.
|
||||
pub fn transform_bound_inner_res<TFrom, TTo>(
|
||||
bound: &Bound<TFrom>,
|
||||
transform: impl Fn(&TFrom) -> io::Result<TransformBound<TTo>>,
|
||||
) -> io::Result<Bound<TTo>> {
|
||||
use self::Bound::*;
|
||||
Ok(match bound {
|
||||
Excluded(ref from_val) => match transform(from_val)? {
|
||||
TransformBound::NewBound(new_val) => new_val,
|
||||
TransformBound::Existing(new_val) => Excluded(new_val),
|
||||
},
|
||||
Included(ref from_val) => match transform(from_val)? {
|
||||
TransformBound::NewBound(new_val) => new_val,
|
||||
TransformBound::Existing(new_val) => Included(new_val),
|
||||
},
|
||||
Unbounded => Unbounded,
|
||||
})
|
||||
}
|
||||
|
||||
/// Takes a bound and transforms the inner value into a new bound via a closure.
|
||||
/// The bound variant may change by the value returned value from the closure.
|
||||
pub fn transform_bound_inner<TFrom, TTo>(
|
||||
bound: &Bound<TFrom>,
|
||||
transform: impl Fn(&TFrom) -> TransformBound<TTo>,
|
||||
) -> Bound<TTo> {
|
||||
use self::Bound::*;
|
||||
match bound {
|
||||
Excluded(ref from_val) => match transform(from_val) {
|
||||
TransformBound::NewBound(new_val) => new_val,
|
||||
TransformBound::Existing(new_val) => Excluded(new_val),
|
||||
},
|
||||
Included(ref from_val) => match transform(from_val) {
|
||||
TransformBound::NewBound(new_val) => new_val,
|
||||
TransformBound::Existing(new_val) => Included(new_val),
|
||||
},
|
||||
Unbounded => Unbounded,
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the inner value of a `Bound`
|
||||
pub fn inner_bound<T>(val: &Bound<T>) -> Option<&T> {
|
||||
match val {
|
||||
Bound::Included(term) | Bound::Excluded(term) => Some(term),
|
||||
Bound::Unbounded => None,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn map_bound<TFrom, TTo>(
|
||||
bound: &Bound<TFrom>,
|
||||
transform: impl Fn(&TFrom) -> TTo,
|
||||
) -> Bound<TTo> {
|
||||
use self::Bound::*;
|
||||
match bound {
|
||||
Excluded(ref from_val) => Bound::Excluded(transform(from_val)),
|
||||
Included(ref from_val) => Bound::Included(transform(from_val)),
|
||||
Unbounded => Unbounded,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn map_bound_res<TFrom, TTo, Err>(
|
||||
bound: &Bound<TFrom>,
|
||||
transform: impl Fn(&TFrom) -> Result<TTo, Err>,
|
||||
) -> Result<Bound<TTo>, Err> {
|
||||
use self::Bound::*;
|
||||
Ok(match bound {
|
||||
Excluded(ref from_val) => Excluded(transform(from_val)?),
|
||||
Included(ref from_val) => Included(transform(from_val)?),
|
||||
Unbounded => Unbounded,
|
||||
})
|
||||
}
|
||||
@@ -1,6 +1,5 @@
|
||||
use std::fs::File;
|
||||
use std::ops::{Deref, Range, RangeBounds};
|
||||
use std::path::Path;
|
||||
use std::sync::Arc;
|
||||
use std::{fmt, io};
|
||||
|
||||
@@ -178,12 +177,6 @@ fn combine_ranges<R: RangeBounds<usize>>(orig_range: Range<usize>, rel_range: R)
|
||||
}
|
||||
|
||||
impl FileSlice {
|
||||
/// Creates a FileSlice from a path.
|
||||
pub fn open(path: &Path) -> io::Result<FileSlice> {
|
||||
let wrap_file = WrapFile::new(File::open(path)?)?;
|
||||
Ok(FileSlice::new(Arc::new(wrap_file)))
|
||||
}
|
||||
|
||||
/// Wraps a FileHandle.
|
||||
pub fn new(file_handle: Arc<dyn FileHandle>) -> Self {
|
||||
let num_bytes = file_handle.len();
|
||||
|
||||
@@ -5,7 +5,6 @@ use std::ops::Deref;
|
||||
pub use byteorder::LittleEndian as Endianness;
|
||||
|
||||
mod bitset;
|
||||
pub mod bounds;
|
||||
mod byte_count;
|
||||
mod datetime;
|
||||
pub mod file_slice;
|
||||
@@ -130,11 +129,11 @@ pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) {
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod test {
|
||||
pub mod test {
|
||||
|
||||
use proptest::prelude::*;
|
||||
|
||||
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
|
||||
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64, BinarySerializable, FixedSize};
|
||||
|
||||
fn test_i64_converter_helper(val: i64) {
|
||||
assert_eq!(u64_to_i64(i64_to_u64(val)), val);
|
||||
@@ -144,6 +143,12 @@ pub(crate) mod test {
|
||||
assert_eq!(u64_to_f64(f64_to_u64(val)), val);
|
||||
}
|
||||
|
||||
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
|
||||
let mut buffer = Vec::new();
|
||||
O::default().serialize(&mut buffer).unwrap();
|
||||
assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
|
||||
}
|
||||
|
||||
proptest! {
|
||||
#[test]
|
||||
fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
|
||||
|
||||
@@ -74,14 +74,14 @@ impl FixedSize for () {
|
||||
|
||||
impl<T: BinarySerializable> BinarySerializable for Vec<T> {
|
||||
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
|
||||
BinarySerializable::serialize(&VInt(self.len() as u64), writer)?;
|
||||
VInt(self.len() as u64).serialize(writer)?;
|
||||
for it in self {
|
||||
it.serialize(writer)?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Vec<T>> {
|
||||
let num_items = <VInt as BinarySerializable>::deserialize(reader)?.val();
|
||||
let num_items = VInt::deserialize(reader)?.val();
|
||||
let mut items: Vec<T> = Vec::with_capacity(num_items as usize);
|
||||
for _ in 0..num_items {
|
||||
let item = T::deserialize(reader)?;
|
||||
@@ -236,12 +236,12 @@ impl FixedSize for bool {
|
||||
impl BinarySerializable for String {
|
||||
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
|
||||
let data: &[u8] = self.as_bytes();
|
||||
BinarySerializable::serialize(&VInt(data.len() as u64), writer)?;
|
||||
VInt(data.len() as u64).serialize(writer)?;
|
||||
writer.write_all(data)
|
||||
}
|
||||
|
||||
fn deserialize<R: Read>(reader: &mut R) -> io::Result<String> {
|
||||
let string_length = <VInt as BinarySerializable>::deserialize(reader)?.val() as usize;
|
||||
let string_length = VInt::deserialize(reader)?.val() as usize;
|
||||
let mut result = String::with_capacity(string_length);
|
||||
reader
|
||||
.take(string_length as u64)
|
||||
@@ -253,12 +253,12 @@ impl BinarySerializable for String {
|
||||
impl<'a> BinarySerializable for Cow<'a, str> {
|
||||
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
|
||||
let data: &[u8] = self.as_bytes();
|
||||
BinarySerializable::serialize(&VInt(data.len() as u64), writer)?;
|
||||
VInt(data.len() as u64).serialize(writer)?;
|
||||
writer.write_all(data)
|
||||
}
|
||||
|
||||
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, str>> {
|
||||
let string_length = <VInt as BinarySerializable>::deserialize(reader)?.val() as usize;
|
||||
let string_length = VInt::deserialize(reader)?.val() as usize;
|
||||
let mut result = String::with_capacity(string_length);
|
||||
reader
|
||||
.take(string_length as u64)
|
||||
@@ -269,18 +269,18 @@ impl<'a> BinarySerializable for Cow<'a, str> {
|
||||
|
||||
impl<'a> BinarySerializable for Cow<'a, [u8]> {
|
||||
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
|
||||
BinarySerializable::serialize(&VInt(self.len() as u64), writer)?;
|
||||
VInt(self.len() as u64).serialize(writer)?;
|
||||
for it in self.iter() {
|
||||
BinarySerializable::serialize(it, writer)?;
|
||||
it.serialize(writer)?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, [u8]>> {
|
||||
let num_items = <VInt as BinarySerializable>::deserialize(reader)?.val();
|
||||
let num_items = VInt::deserialize(reader)?.val();
|
||||
let mut items: Vec<u8> = Vec::with_capacity(num_items as usize);
|
||||
for _ in 0..num_items {
|
||||
let item = <u8 as BinarySerializable>::deserialize(reader)?;
|
||||
let item = u8::deserialize(reader)?;
|
||||
items.push(item);
|
||||
}
|
||||
Ok(Cow::Owned(items))
|
||||
|
||||
@@ -87,7 +87,7 @@ impl<W: TerminatingWrite> TerminatingWrite for BufWriter<W> {
|
||||
}
|
||||
}
|
||||
|
||||
impl TerminatingWrite for &mut Vec<u8> {
|
||||
impl<'a> TerminatingWrite for &'a mut Vec<u8> {
|
||||
fn terminate_ref(&mut self, _a: AntiCallToken) -> io::Result<()> {
|
||||
self.flush()
|
||||
}
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
> Tantivy is a **search** engine **library** for Rust.
|
||||
|
||||
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for Rust. Tantivy is heavily inspired by Lucene's design and
|
||||
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
|
||||
they both have the same scope and targeted use cases.
|
||||
|
||||
If you are not familiar with Lucene, let's break down our little tagline.
|
||||
@@ -17,7 +17,7 @@ relevancy, collapsing, highlighting, spatial search.
|
||||
experience. But keep in mind this is just a toolbox.
|
||||
Which bring us to the second keyword...
|
||||
|
||||
- **Library** means that you will have to write code. Tantivy is not an *all-in-one* server solution like Elasticsearch for instance.
|
||||
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
|
||||
|
||||
Sometimes a functionality will not be available in tantivy because it is too
|
||||
specific to your use case. By design, tantivy should make it possible to extend
|
||||
@@ -31,4 +31,4 @@ relevancy, collapsing, highlighting, spatial search.
|
||||
index from a different format.
|
||||
|
||||
Tantivy exposes a lot of low level API to do all of these things.
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@ directory shipped with tantivy is the `MmapDirectory`.
|
||||
While this design has some downsides, this greatly simplifies the source code of
|
||||
tantivy. Caching is also entirely delegated to the OS.
|
||||
|
||||
Tantivy works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
|
||||
`tantivy` works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
|
||||
|
||||
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
|
||||
|
||||
|
||||
@@ -31,13 +31,13 @@ Compression ratio is mainly affected on the fast field of the sorted property, e
|
||||
When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents.
|
||||
E.g. if the data is sorted by timestamp and want the top n newest docs containing a term, we can simply leveraging the order of the docids.
|
||||
|
||||
Note: tantivy 0.16 does not do this optimization yet.
|
||||
Note: Tantivy 0.16 does not do this optimization yet.
|
||||
|
||||
### Pruning
|
||||
|
||||
Let's say we want all documents and want to apply the filter `>= 2010-08-11`. When the data is sorted, we could make a lookup in the fast field to find the docid range and use this as the filter.
|
||||
|
||||
Note: tantivy 0.16 does not do this optimization yet.
|
||||
Note: Tantivy 0.16 does not do this optimization yet.
|
||||
|
||||
### Other?
|
||||
|
||||
@@ -45,7 +45,7 @@ In principle there are many algorithms possible that exploit the monotonically i
|
||||
|
||||
## Usage
|
||||
|
||||
The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of tantivy 0.16 only fast fields are allowed to be used.
|
||||
The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of Tantivy 0.16 only fast fields are allowed to be used.
|
||||
|
||||
```rust
|
||||
let settings = IndexSettings {
|
||||
|
||||
@@ -39,7 +39,7 @@ Its representation is done by separating segments by a unicode char `\x01`, and
|
||||
- `value`: The value representation is just the regular Value representation.
|
||||
|
||||
This representation is designed to align the natural sort of Terms with the lexicographical sort
|
||||
of their binary representation (tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).
|
||||
of their binary representation (Tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).
|
||||
|
||||
In the example above, the terms will be sorted as
|
||||
|
||||
|
||||
@@ -1,5 +1,3 @@
|
||||
use std::ops::Bound;
|
||||
|
||||
// # Searching a range on an indexed int field.
|
||||
//
|
||||
// Below is an example of creating an indexed integer field in your schema
|
||||
@@ -7,7 +5,7 @@ use std::ops::Bound;
|
||||
use tantivy::collector::Count;
|
||||
use tantivy::query::RangeQuery;
|
||||
use tantivy::schema::{Schema, INDEXED};
|
||||
use tantivy::{doc, Index, IndexWriter, Result, Term};
|
||||
use tantivy::{doc, Index, IndexWriter, Result};
|
||||
|
||||
fn main() -> Result<()> {
|
||||
// For the sake of simplicity, this schema will only have 1 field
|
||||
@@ -29,10 +27,7 @@ fn main() -> Result<()> {
|
||||
reader.reload()?;
|
||||
let searcher = reader.searcher();
|
||||
// The end is excluded i.e. here we are searching up to 1969
|
||||
let docs_in_the_sixties = RangeQuery::new(
|
||||
Bound::Included(Term::from_field_u64(year_field, 1960)),
|
||||
Bound::Excluded(Term::from_field_u64(year_field, 1970)),
|
||||
);
|
||||
let docs_in_the_sixties = RangeQuery::new_u64("year".to_string(), 1960..1970);
|
||||
// Uses a Count collector to sum the total number of docs in the range
|
||||
let num_60s_books = searcher.search(&docs_in_the_sixties, &Count)?;
|
||||
assert_eq!(num_60s_books, 10);
|
||||
|
||||
@@ -28,7 +28,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let mut index_writer: IndexWriter = index.writer_with_num_threads(1, 50_000_000)?;
|
||||
index_writer.add_document(doc!(title => "The Old Man and the Sea"))?;
|
||||
index_writer.add_document(doc!(title => "Of Mice and Men"))?;
|
||||
index_writer.add_document(doc!(title => "The modern Prometheus"))?;
|
||||
index_writer.add_document(doc!(title => "The modern Promotheus"))?;
|
||||
index_writer.commit()?;
|
||||
|
||||
let reader = index.reader()?;
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
[package]
|
||||
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
|
||||
name = "ownedbytes"
|
||||
version = "0.9.0"
|
||||
version = "0.7.0"
|
||||
edition = "2021"
|
||||
description = "Expose data as static slice"
|
||||
license = "MIT"
|
||||
|
||||
@@ -151,7 +151,7 @@ impl fmt::Debug for OwnedBytes {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
// We truncate the bytes in order to make sure the debug string
|
||||
// is not too long.
|
||||
let bytes_truncated: &[u8] = if self.len() > 10 {
|
||||
let bytes_truncated: &[u8] = if self.len() > 8 {
|
||||
&self.as_slice()[..10]
|
||||
} else {
|
||||
self.as_slice()
|
||||
@@ -252,11 +252,6 @@ mod tests {
|
||||
format!("{short_bytes:?}"),
|
||||
"OwnedBytes([97, 98, 99, 100], len=4)"
|
||||
);
|
||||
let medium_bytes = OwnedBytes::new(b"abcdefghi".as_ref());
|
||||
assert_eq!(
|
||||
format!("{medium_bytes:?}"),
|
||||
"OwnedBytes([97, 98, 99, 100, 101, 102, 103, 104, 105], len=9)"
|
||||
);
|
||||
let long_bytes = OwnedBytes::new(b"abcdefghijklmnopq".as_ref());
|
||||
assert_eq!(
|
||||
format!("{long_bytes:?}"),
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-query-grammar"
|
||||
version = "0.24.0"
|
||||
version = "0.22.0"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
categories = ["database-implementations", "data-structures"]
|
||||
@@ -13,5 +13,3 @@ edition = "2021"
|
||||
|
||||
[dependencies]
|
||||
nom = "7"
|
||||
serde = { version = "1.0.219", features = ["derive"] }
|
||||
serde_json = "1.0.140"
|
||||
|
||||
@@ -3,7 +3,6 @@
|
||||
use std::convert::Infallible;
|
||||
|
||||
use nom::{AsChar, IResult, InputLength, InputTakeAtPosition};
|
||||
use serde::Serialize;
|
||||
|
||||
pub(crate) type ErrorList = Vec<LenientErrorInternal>;
|
||||
pub(crate) type JResult<I, O> = IResult<I, (O, ErrorList), Infallible>;
|
||||
@@ -16,8 +15,7 @@ pub(crate) struct LenientErrorInternal {
|
||||
}
|
||||
|
||||
/// A recoverable error and the position it happened at
|
||||
#[derive(Debug, PartialEq, Serialize)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
#[derive(Debug, PartialEq)]
|
||||
pub struct LenientError {
|
||||
pub pos: usize,
|
||||
pub message: String,
|
||||
@@ -111,8 +109,6 @@ where F: nom::Parser<I, (O, ErrorList), Infallible> {
|
||||
move |input: I| match f.parse(input) {
|
||||
Ok((input, (output, _err))) => Ok((input, output)),
|
||||
Err(Err::Incomplete(needed)) => Err(Err::Incomplete(needed)),
|
||||
// old versions don't understand this is uninhabited and need the empty match to help,
|
||||
// newer versions warn because this arm is unreachable (which it is indeed).
|
||||
Err(Err::Error(val)) | Err(Err::Failure(val)) => match val {},
|
||||
}
|
||||
}
|
||||
@@ -355,21 +351,3 @@ where
|
||||
{
|
||||
move |i: I| l.choice(i.clone()).unwrap_or_else(|| default.parse(i))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_lenient_error_serialization() {
|
||||
let error = LenientError {
|
||||
pos: 42,
|
||||
message: "test error message".to_string(),
|
||||
};
|
||||
|
||||
assert_eq!(
|
||||
serde_json::to_string(&error).unwrap(),
|
||||
"{\"pos\":42,\"message\":\"test error message\"}"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,7 +1,5 @@
|
||||
#![allow(clippy::derive_partial_eq_without_eq)]
|
||||
|
||||
use serde::Serialize;
|
||||
|
||||
mod infallible;
|
||||
mod occur;
|
||||
mod query_grammar;
|
||||
@@ -14,8 +12,6 @@ pub use crate::user_input_ast::{
|
||||
Delimiter, UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral,
|
||||
};
|
||||
|
||||
#[derive(Debug, Serialize)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
pub struct Error;
|
||||
|
||||
/// Parse a query
|
||||
@@ -28,31 +24,3 @@ pub fn parse_query(query: &str) -> Result<UserInputAst, Error> {
|
||||
pub fn parse_query_lenient(query: &str) -> (UserInputAst, Vec<LenientError>) {
|
||||
parse_to_ast_lenient(query)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::{parse_query, parse_query_lenient};
|
||||
|
||||
#[test]
|
||||
fn test_parse_query_serialization() {
|
||||
let ast = parse_query("title:hello OR title:x").unwrap();
|
||||
let json = serde_json::to_string(&ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"bool","clauses":[["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}],["should",{"type":"literal","field_name":"title","phrase":"x","delimiter":"none","slop":0,"prefix":false}]]}"#
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_query_wrong_query() {
|
||||
assert!(parse_query("title:").is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_query_lenient_wrong_query() {
|
||||
let (_, errors) = parse_query_lenient("title:");
|
||||
assert!(errors.len() == 1);
|
||||
let json = serde_json::to_string(&errors).unwrap();
|
||||
assert_eq!(json, r#"[{"pos":6,"message":"expected word"}]"#);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,20 +1,17 @@
|
||||
use std::fmt;
|
||||
use std::fmt::Write;
|
||||
|
||||
use serde::Serialize;
|
||||
|
||||
/// Defines whether a term in a query must be present,
|
||||
/// should be present or must not be present.
|
||||
#[derive(Debug, Clone, Hash, Copy, Eq, PartialEq, Serialize)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
#[derive(Debug, Clone, Hash, Copy, Eq, PartialEq)]
|
||||
pub enum Occur {
|
||||
/// For a given document to be considered for scoring,
|
||||
/// at least one of the queries with the Should or the Must
|
||||
/// at least one of the terms with the Should or the Must
|
||||
/// Occur constraint must be within the document.
|
||||
Should,
|
||||
/// Document without the queries are excluded from the search.
|
||||
/// Document without the term are excluded from the search.
|
||||
Must,
|
||||
/// Document that contain the query are excluded from the
|
||||
/// Document that contain the term are excluded from the
|
||||
/// search.
|
||||
MustNot,
|
||||
}
|
||||
|
||||
@@ -321,17 +321,7 @@ fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
|
||||
UserInputLeaf::Exists {
|
||||
field: String::new(),
|
||||
},
|
||||
tuple((
|
||||
multispace0,
|
||||
char('*'),
|
||||
peek(alt((
|
||||
value(
|
||||
"",
|
||||
satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
|
||||
),
|
||||
eof,
|
||||
))),
|
||||
)),
|
||||
tuple((multispace0, char('*'))),
|
||||
)(inp)
|
||||
}
|
||||
|
||||
@@ -341,14 +331,7 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
|
||||
peek(tuple((
|
||||
field_name,
|
||||
multispace0,
|
||||
char('*'),
|
||||
peek(alt((
|
||||
value(
|
||||
"",
|
||||
satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
|
||||
),
|
||||
eof,
|
||||
))), // we need to check this isn't a wildcard query
|
||||
char('*'), // when we are here, we know it can't be anything but a exists
|
||||
))),
|
||||
)(inp)
|
||||
.map_err(|e| e.map(|_| ()))
|
||||
@@ -784,7 +767,7 @@ fn occur_leaf(inp: &str) -> IResult<&str, (Option<Occur>, UserInputAst)> {
|
||||
tuple((fallible(occur_symbol), boosted_leaf))(inp)
|
||||
}
|
||||
|
||||
#[expect(clippy::type_complexity)]
|
||||
#[allow(clippy::type_complexity)]
|
||||
fn operand_occur_leaf_infallible(
|
||||
inp: &str,
|
||||
) -> JResult<&str, (Option<BinaryOperand>, Option<Occur>, Option<UserInputAst>)> {
|
||||
@@ -850,7 +833,7 @@ fn aggregate_infallible_expressions(
|
||||
if early_operand {
|
||||
err.push(LenientErrorInternal {
|
||||
pos: 0,
|
||||
message: "Found unexpected boolean operator before term".to_string(),
|
||||
message: "Found unexpeted boolean operator before term".to_string(),
|
||||
});
|
||||
}
|
||||
|
||||
@@ -873,7 +856,7 @@ fn aggregate_infallible_expressions(
|
||||
_ => Some(Occur::Should),
|
||||
};
|
||||
if occur == &Some(Occur::MustNot) && default_op == Some(Occur::Should) {
|
||||
// if occur is MustNot *and* operation is OR, we synthesize a ShouldNot
|
||||
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
|
||||
clauses.push(vec![(
|
||||
Some(Occur::Should),
|
||||
ast.clone().unary(Occur::MustNot),
|
||||
@@ -889,7 +872,7 @@ fn aggregate_infallible_expressions(
|
||||
None => None,
|
||||
};
|
||||
if occur == &Some(Occur::MustNot) && default_op == Some(Occur::Should) {
|
||||
// if occur is MustNot *and* operation is OR, we synthesize a ShouldNot
|
||||
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
|
||||
clauses.push(vec![(
|
||||
Some(Occur::Should),
|
||||
ast.clone().unary(Occur::MustNot),
|
||||
@@ -914,7 +897,7 @@ fn aggregate_infallible_expressions(
|
||||
}
|
||||
Some(BinaryOperand::Or) => {
|
||||
if last_occur == Some(Occur::MustNot) {
|
||||
// if occur is MustNot *and* operation is OR, we synthesize a ShouldNot
|
||||
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
|
||||
clauses.push(vec![(Some(Occur::Should), last_ast.unary(Occur::MustNot))]);
|
||||
} else {
|
||||
clauses.push(vec![(last_occur.or(Some(Occur::Should)), last_ast)]);
|
||||
@@ -1074,7 +1057,7 @@ mod test {
|
||||
valid_parse("1", 1.0, "");
|
||||
valid_parse("0.234234 aaa", 0.234234f64, " aaa");
|
||||
error_parse(".3332");
|
||||
// TODO trinity-1686a: I disagree that it should fail, I think it should succeed,
|
||||
// TODO trinity-1686a: I disagree that it should fail, I think it should succeeed,
|
||||
// consuming only "1", and leave "." for the next thing (which will likely fail then)
|
||||
// error_parse("1.");
|
||||
error_parse("-1.");
|
||||
@@ -1484,7 +1467,7 @@ mod test {
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_query_to_trimming_spaces() {
|
||||
fn test_parse_query_to_triming_spaces() {
|
||||
test_parse_query_to_ast_helper(" abc", "abc");
|
||||
test_parse_query_to_ast_helper("abc ", "abc");
|
||||
test_parse_query_to_ast_helper("( a OR abc)", "(?a ?abc)");
|
||||
@@ -1514,11 +1497,6 @@ mod test {
|
||||
test_is_parse_err(r#"field:(+a -"b c""#, r#"(+"field":a -"field":"b c")"#);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn field_re_specification() {
|
||||
test_parse_query_to_ast_helper(r#"field:(abc AND b:cde)"#, r#"(+"field":abc +"b":cde)"#);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_query_single_term() {
|
||||
test_parse_query_to_ast_helper("abc", "abc");
|
||||
@@ -1641,19 +1619,13 @@ mod test {
|
||||
|
||||
#[test]
|
||||
fn test_exist_query() {
|
||||
test_parse_query_to_ast_helper("a:*", "$exists(\"a\")");
|
||||
test_parse_query_to_ast_helper("a: *", "$exists(\"a\")");
|
||||
test_parse_query_to_ast_helper("a:*", "\"a\":*");
|
||||
test_parse_query_to_ast_helper("a: *", "\"a\":*");
|
||||
// an exist followed by default term being b
|
||||
test_is_parse_err("a:*b", "(*\"a\":* *b)");
|
||||
|
||||
test_parse_query_to_ast_helper(
|
||||
"(hello AND toto:*) OR happy",
|
||||
"(?(+hello +$exists(\"toto\")) ?happy)",
|
||||
);
|
||||
test_parse_query_to_ast_helper("(a:*)", "$exists(\"a\")");
|
||||
|
||||
// these are term/wildcard query (not a phrase prefix)
|
||||
// this is a term query (not a phrase prefix)
|
||||
test_parse_query_to_ast_helper("a:b*", "\"a\":b*");
|
||||
test_parse_query_to_ast_helper("a:*b", "\"a\":*b");
|
||||
test_parse_query_to_ast_helper(r#"a:*def*"#, "\"a\":*def*");
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -1,13 +1,9 @@
|
||||
use std::fmt;
|
||||
use std::fmt::{Debug, Formatter};
|
||||
|
||||
use serde::Serialize;
|
||||
|
||||
use crate::Occur;
|
||||
|
||||
#[derive(PartialEq, Clone, Serialize)]
|
||||
#[serde(tag = "type")]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
#[derive(PartialEq, Clone)]
|
||||
pub enum UserInputLeaf {
|
||||
Literal(UserInputLiteral),
|
||||
All,
|
||||
@@ -105,22 +101,20 @@ impl Debug for UserInputLeaf {
|
||||
}
|
||||
UserInputLeaf::All => write!(formatter, "*"),
|
||||
UserInputLeaf::Exists { field } => {
|
||||
write!(formatter, "$exists(\"{field}\")")
|
||||
write!(formatter, "\"{field}\":*")
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Copy, Clone, Eq, PartialEq, Debug, Serialize)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
#[derive(Copy, Clone, Eq, PartialEq, Debug)]
|
||||
pub enum Delimiter {
|
||||
SingleQuotes,
|
||||
DoubleQuotes,
|
||||
None,
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Clone, Serialize)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
#[derive(PartialEq, Clone)]
|
||||
pub struct UserInputLiteral {
|
||||
pub field_name: Option<String>,
|
||||
pub phrase: String,
|
||||
@@ -158,9 +152,7 @@ impl fmt::Debug for UserInputLiteral {
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Debug, Clone, Serialize)]
|
||||
#[serde(tag = "type", content = "value")]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
#[derive(PartialEq, Debug, Clone)]
|
||||
pub enum UserInputBound {
|
||||
Inclusive(String),
|
||||
Exclusive(String),
|
||||
@@ -195,38 +187,11 @@ impl UserInputBound {
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Clone, Serialize)]
|
||||
#[serde(into = "UserInputAstSerde")]
|
||||
#[derive(PartialEq, Clone)]
|
||||
pub enum UserInputAst {
|
||||
Clause(Vec<(Option<Occur>, UserInputAst)>),
|
||||
Leaf(Box<UserInputLeaf>),
|
||||
Boost(Box<UserInputAst>, f64),
|
||||
Leaf(Box<UserInputLeaf>),
|
||||
}
|
||||
|
||||
#[derive(Serialize)]
|
||||
#[serde(tag = "type", rename_all = "snake_case")]
|
||||
enum UserInputAstSerde {
|
||||
Bool {
|
||||
clauses: Vec<(Option<Occur>, UserInputAst)>,
|
||||
},
|
||||
Boost {
|
||||
underlying: Box<UserInputAst>,
|
||||
boost: f64,
|
||||
},
|
||||
#[serde(untagged)]
|
||||
Leaf(Box<UserInputLeaf>),
|
||||
}
|
||||
|
||||
impl From<UserInputAst> for UserInputAstSerde {
|
||||
fn from(ast: UserInputAst) -> Self {
|
||||
match ast {
|
||||
UserInputAst::Clause(clause) => UserInputAstSerde::Bool { clauses: clause },
|
||||
UserInputAst::Boost(underlying, boost) => {
|
||||
UserInputAstSerde::Boost { underlying, boost }
|
||||
}
|
||||
UserInputAst::Leaf(leaf) => UserInputAstSerde::Leaf(leaf),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl UserInputAst {
|
||||
@@ -302,7 +267,7 @@ impl fmt::Debug for UserInputAst {
|
||||
match *self {
|
||||
UserInputAst::Clause(ref subqueries) => {
|
||||
if subqueries.is_empty() {
|
||||
// TODO this will break ast reserialization, is writing "( )" enough?
|
||||
// TODO this will break ast reserialization, is writing "( )" enought?
|
||||
write!(formatter, "<emptyclause>")?;
|
||||
} else {
|
||||
write!(formatter, "(")?;
|
||||
@@ -320,126 +285,3 @@ impl fmt::Debug for UserInputAst {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_all_leaf_serialization() {
|
||||
let ast = UserInputAst::Leaf(Box::new(UserInputLeaf::All));
|
||||
let json = serde_json::to_string(&ast).unwrap();
|
||||
assert_eq!(json, r#"{"type":"all"}"#);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_literal_leaf_serialization() {
|
||||
let literal = UserInputLiteral {
|
||||
field_name: Some("title".to_string()),
|
||||
phrase: "hello".to_string(),
|
||||
delimiter: Delimiter::None,
|
||||
slop: 0,
|
||||
prefix: false,
|
||||
};
|
||||
let ast = UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(literal)));
|
||||
let json = serde_json::to_string(&ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}"#
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_range_leaf_serialization() {
|
||||
let range = UserInputLeaf::Range {
|
||||
field: Some("price".to_string()),
|
||||
lower: UserInputBound::Inclusive("10".to_string()),
|
||||
upper: UserInputBound::Exclusive("100".to_string()),
|
||||
};
|
||||
let ast = UserInputAst::Leaf(Box::new(range));
|
||||
let json = serde_json::to_string(&ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"range","field":"price","lower":{"type":"inclusive","value":"10"},"upper":{"type":"exclusive","value":"100"}}"#
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_range_leaf_unbounded_serialization() {
|
||||
let range = UserInputLeaf::Range {
|
||||
field: Some("price".to_string()),
|
||||
lower: UserInputBound::Inclusive("10".to_string()),
|
||||
upper: UserInputBound::Unbounded,
|
||||
};
|
||||
let ast = UserInputAst::Leaf(Box::new(range));
|
||||
let json = serde_json::to_string(&ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"range","field":"price","lower":{"type":"inclusive","value":"10"},"upper":{"type":"unbounded"}}"#
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_boost_serialization() {
|
||||
let inner_ast = UserInputAst::Leaf(Box::new(UserInputLeaf::All));
|
||||
let boost_ast = UserInputAst::Boost(Box::new(inner_ast), 2.5);
|
||||
let json = serde_json::to_string(&boost_ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"boost","underlying":{"type":"all"},"boost":2.5}"#
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_boost_serialization2() {
|
||||
let boost_ast = UserInputAst::Boost(
|
||||
Box::new(UserInputAst::Clause(vec![
|
||||
(
|
||||
Some(Occur::Must),
|
||||
UserInputAst::Leaf(Box::new(UserInputLeaf::All)),
|
||||
),
|
||||
(
|
||||
Some(Occur::Should),
|
||||
UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(UserInputLiteral {
|
||||
field_name: Some("title".to_string()),
|
||||
phrase: "hello".to_string(),
|
||||
delimiter: Delimiter::None,
|
||||
slop: 0,
|
||||
prefix: false,
|
||||
}))),
|
||||
),
|
||||
])),
|
||||
2.5,
|
||||
);
|
||||
let json = serde_json::to_string(&boost_ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"boost","underlying":{"type":"bool","clauses":[["must",{"type":"all"}],["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}]]},"boost":2.5}"#
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_clause_serialization() {
|
||||
let clause = UserInputAst::Clause(vec![
|
||||
(
|
||||
Some(Occur::Must),
|
||||
UserInputAst::Leaf(Box::new(UserInputLeaf::All)),
|
||||
),
|
||||
(
|
||||
Some(Occur::Should),
|
||||
UserInputAst::Leaf(Box::new(UserInputLeaf::Literal(UserInputLiteral {
|
||||
field_name: Some("title".to_string()),
|
||||
phrase: "hello".to_string(),
|
||||
delimiter: Delimiter::None,
|
||||
slop: 0,
|
||||
prefix: false,
|
||||
}))),
|
||||
),
|
||||
]);
|
||||
let json = serde_json::to_string(&clause).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"bool","clauses":[["must",{"type":"all"}],["should",{"type":"literal","field_name":"title","phrase":"hello","delimiter":"none","slop":0,"prefix":false}]]}"#
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -21,10 +21,7 @@ impl<K, V, S> MemoryConsumption for HashMap<K, V, S> {
|
||||
|
||||
/// Aggregation memory limit after which the request fails. Defaults to DEFAULT_MEMORY_LIMIT
|
||||
/// (500MB). The limit is shared by all SegmentCollectors
|
||||
///
|
||||
/// The memory limit is also a guard, which tracks how much it allocated and releases it's memory
|
||||
/// on the shared counter. Cloning will create a new guard.
|
||||
pub struct AggregationLimitsGuard {
|
||||
pub struct AggregationLimits {
|
||||
/// The counter which is shared between the aggregations for one request.
|
||||
memory_consumption: Arc<AtomicU64>,
|
||||
/// The memory_limit in bytes
|
||||
@@ -32,41 +29,28 @@ pub struct AggregationLimitsGuard {
|
||||
/// The maximum number of buckets _returned_
|
||||
/// This is not counting intermediate buckets.
|
||||
bucket_limit: u32,
|
||||
/// Allocated memory with this guard.
|
||||
allocated_with_the_guard: u64,
|
||||
}
|
||||
impl Clone for AggregationLimitsGuard {
|
||||
impl Clone for AggregationLimits {
|
||||
fn clone(&self) -> Self {
|
||||
Self {
|
||||
memory_consumption: Arc::clone(&self.memory_consumption),
|
||||
memory_limit: self.memory_limit,
|
||||
bucket_limit: self.bucket_limit,
|
||||
allocated_with_the_guard: 0,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for AggregationLimitsGuard {
|
||||
/// Removes the memory consumed tracked by this _instance_ of AggregationLimits.
|
||||
/// This is used to clear the segment specific memory consumption all at once.
|
||||
fn drop(&mut self) {
|
||||
self.memory_consumption
|
||||
.fetch_sub(self.allocated_with_the_guard, Ordering::Relaxed);
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for AggregationLimitsGuard {
|
||||
impl Default for AggregationLimits {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
memory_consumption: Default::default(),
|
||||
memory_limit: DEFAULT_MEMORY_LIMIT.into(),
|
||||
bucket_limit: DEFAULT_BUCKET_LIMIT,
|
||||
allocated_with_the_guard: 0,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl AggregationLimitsGuard {
|
||||
impl AggregationLimits {
|
||||
/// *memory_limit*
|
||||
/// memory_limit is defined in bytes.
|
||||
/// Aggregation fails when the estimated memory consumption of the aggregation is higher than
|
||||
@@ -83,15 +67,24 @@ impl AggregationLimitsGuard {
|
||||
memory_consumption: Default::default(),
|
||||
memory_limit: memory_limit.unwrap_or(DEFAULT_MEMORY_LIMIT).into(),
|
||||
bucket_limit: bucket_limit.unwrap_or(DEFAULT_BUCKET_LIMIT),
|
||||
}
|
||||
}
|
||||
|
||||
/// Create a new ResourceLimitGuard, that will release the memory when dropped.
|
||||
pub fn new_guard(&self) -> ResourceLimitGuard {
|
||||
ResourceLimitGuard {
|
||||
// The counter which is shared between the aggregations for one request.
|
||||
memory_consumption: Arc::clone(&self.memory_consumption),
|
||||
// The memory_limit in bytes
|
||||
memory_limit: self.memory_limit,
|
||||
allocated_with_the_guard: 0,
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn add_memory_consumed(&mut self, add_num_bytes: u64) -> crate::Result<()> {
|
||||
pub(crate) fn add_memory_consumed(&self, add_num_bytes: u64) -> crate::Result<()> {
|
||||
let prev_value = self
|
||||
.memory_consumption
|
||||
.fetch_add(add_num_bytes, Ordering::Relaxed);
|
||||
self.allocated_with_the_guard += add_num_bytes;
|
||||
validate_memory_consumption(prev_value + add_num_bytes, self.memory_limit)?;
|
||||
Ok(())
|
||||
}
|
||||
@@ -116,6 +109,34 @@ fn validate_memory_consumption(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub struct ResourceLimitGuard {
|
||||
/// The counter which is shared between the aggregations for one request.
|
||||
memory_consumption: Arc<AtomicU64>,
|
||||
/// The memory_limit in bytes
|
||||
memory_limit: ByteCount,
|
||||
/// Allocated memory with this guard.
|
||||
allocated_with_the_guard: u64,
|
||||
}
|
||||
|
||||
impl ResourceLimitGuard {
|
||||
pub(crate) fn add_memory_consumed(&self, add_num_bytes: u64) -> crate::Result<()> {
|
||||
let prev_value = self
|
||||
.memory_consumption
|
||||
.fetch_add(add_num_bytes, Ordering::Relaxed);
|
||||
validate_memory_consumption(prev_value + add_num_bytes, self.memory_limit)?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for ResourceLimitGuard {
|
||||
/// Removes the memory consumed tracked by this _instance_ of AggregationLimits.
|
||||
/// This is used to clear the segment specific memory consumption all at once.
|
||||
fn drop(&mut self) {
|
||||
self.memory_consumption
|
||||
.fetch_sub(self.allocated_with_the_guard, Ordering::Relaxed);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::aggregation::tests::exec_request_with_query;
|
||||
|
||||
@@ -34,9 +34,8 @@ use super::bucket::{
|
||||
DateHistogramAggregationReq, HistogramAggregation, RangeAggregation, TermsAggregation,
|
||||
};
|
||||
use super::metric::{
|
||||
AverageAggregation, CardinalityAggregationReq, CountAggregation, ExtendedStatsAggregation,
|
||||
MaxAggregation, MinAggregation, PercentilesAggregationReq, StatsAggregation, SumAggregation,
|
||||
TopHitsAggregationReq,
|
||||
AverageAggregation, CountAggregation, ExtendedStatsAggregation, MaxAggregation, MinAggregation,
|
||||
PercentilesAggregationReq, StatsAggregation, SumAggregation, TopHitsAggregation,
|
||||
};
|
||||
|
||||
/// The top-level aggregation request structure, which contains [`Aggregation`] and their user
|
||||
@@ -160,10 +159,7 @@ pub enum AggregationVariants {
|
||||
Percentiles(PercentilesAggregationReq),
|
||||
/// Finds the top k values matching some order
|
||||
#[serde(rename = "top_hits")]
|
||||
TopHits(TopHitsAggregationReq),
|
||||
/// Computes an estimate of the number of unique values
|
||||
#[serde(rename = "cardinality")]
|
||||
Cardinality(CardinalityAggregationReq),
|
||||
TopHits(TopHitsAggregation),
|
||||
}
|
||||
|
||||
impl AggregationVariants {
|
||||
@@ -183,7 +179,6 @@ impl AggregationVariants {
|
||||
AggregationVariants::Sum(sum) => vec![sum.field_name()],
|
||||
AggregationVariants::Percentiles(per) => vec![per.field_name()],
|
||||
AggregationVariants::TopHits(top_hits) => top_hits.field_names(),
|
||||
AggregationVariants::Cardinality(per) => vec![per.field_name()],
|
||||
}
|
||||
}
|
||||
|
||||
@@ -208,7 +203,7 @@ impl AggregationVariants {
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
pub(crate) fn as_top_hits(&self) -> Option<&TopHitsAggregationReq> {
|
||||
pub(crate) fn as_top_hits(&self) -> Option<&TopHitsAggregation> {
|
||||
match &self {
|
||||
AggregationVariants::TopHits(top_hits) => Some(top_hits),
|
||||
_ => None,
|
||||
|
||||
@@ -5,15 +5,16 @@ use std::io;
|
||||
|
||||
use columnar::{Column, ColumnBlockAccessor, ColumnType, DynamicColumn, StrColumn};
|
||||
|
||||
use super::agg_limits::ResourceLimitGuard;
|
||||
use super::agg_req::{Aggregation, AggregationVariants, Aggregations};
|
||||
use super::bucket::{
|
||||
DateHistogramAggregationReq, HistogramAggregation, RangeAggregation, TermsAggregation,
|
||||
};
|
||||
use super::metric::{
|
||||
AverageAggregation, CardinalityAggregationReq, CountAggregation, ExtendedStatsAggregation,
|
||||
MaxAggregation, MinAggregation, StatsAggregation, SumAggregation,
|
||||
AverageAggregation, CountAggregation, ExtendedStatsAggregation, MaxAggregation, MinAggregation,
|
||||
StatsAggregation, SumAggregation,
|
||||
};
|
||||
use super::segment_agg_result::AggregationLimitsGuard;
|
||||
use super::segment_agg_result::AggregationLimits;
|
||||
use super::VecWithNames;
|
||||
use crate::aggregation::{f64_to_fastfield_u64, Key};
|
||||
use crate::index::SegmentReader;
|
||||
@@ -45,7 +46,7 @@ pub struct AggregationWithAccessor {
|
||||
pub(crate) str_dict_column: Option<StrColumn>,
|
||||
pub(crate) field_type: ColumnType,
|
||||
pub(crate) sub_aggregation: AggregationsWithAccessor,
|
||||
pub(crate) limits: AggregationLimitsGuard,
|
||||
pub(crate) limits: ResourceLimitGuard,
|
||||
pub(crate) column_block_accessor: ColumnBlockAccessor<u64>,
|
||||
/// Used for missing term aggregation, which checks all columns for existence.
|
||||
/// And also for `top_hits` aggregation, which may sort on multiple fields.
|
||||
@@ -68,7 +69,7 @@ impl AggregationWithAccessor {
|
||||
sub_aggregation: &Aggregations,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
limits: AggregationLimitsGuard,
|
||||
limits: AggregationLimits,
|
||||
) -> crate::Result<Vec<AggregationWithAccessor>> {
|
||||
let mut agg = agg.clone();
|
||||
|
||||
@@ -90,7 +91,7 @@ impl AggregationWithAccessor {
|
||||
&limits,
|
||||
)?,
|
||||
agg: agg.clone(),
|
||||
limits: limits.clone(),
|
||||
limits: limits.new_guard(),
|
||||
missing_value_for_accessor: None,
|
||||
str_dict_column: None,
|
||||
column_block_accessor: Default::default(),
|
||||
@@ -105,7 +106,6 @@ impl AggregationWithAccessor {
|
||||
value_accessors: HashMap<String, Vec<DynamicColumn>>|
|
||||
-> crate::Result<()> {
|
||||
let (accessor, field_type) = accessors.first().expect("at least one accessor");
|
||||
let limits = limits.clone();
|
||||
let res = AggregationWithAccessor {
|
||||
segment_ordinal,
|
||||
// TODO: We should do away with the `accessor` field altogether
|
||||
@@ -120,7 +120,7 @@ impl AggregationWithAccessor {
|
||||
&limits,
|
||||
)?,
|
||||
agg: agg.clone(),
|
||||
limits,
|
||||
limits: limits.new_guard(),
|
||||
missing_value_for_accessor: None,
|
||||
str_dict_column: None,
|
||||
column_block_accessor: Default::default(),
|
||||
@@ -162,11 +162,6 @@ impl AggregationWithAccessor {
|
||||
field: ref field_name,
|
||||
ref missing,
|
||||
..
|
||||
})
|
||||
| Cardinality(CardinalityAggregationReq {
|
||||
field: ref field_name,
|
||||
ref missing,
|
||||
..
|
||||
}) => {
|
||||
let str_dict_column = reader.fast_fields().str(field_name)?;
|
||||
let allowed_column_types = [
|
||||
@@ -186,8 +181,6 @@ impl AggregationWithAccessor {
|
||||
.map(|missing| match missing {
|
||||
Key::Str(_) => ColumnType::Str,
|
||||
Key::F64(_) => ColumnType::F64,
|
||||
Key::I64(_) => ColumnType::I64,
|
||||
Key::U64(_) => ColumnType::U64,
|
||||
})
|
||||
.unwrap_or(ColumnType::U64);
|
||||
let column_and_types = get_all_ff_reader_or_empty(
|
||||
@@ -234,18 +227,14 @@ impl AggregationWithAccessor {
|
||||
missing.clone()
|
||||
};
|
||||
|
||||
let missing_value_for_accessor =
|
||||
if let Some(missing) = missing_value_term_agg.as_ref() {
|
||||
get_missing_val_as_u64_lenient(
|
||||
column_type,
|
||||
missing,
|
||||
agg.agg.get_fast_field_names()[0],
|
||||
)?
|
||||
} else {
|
||||
None
|
||||
};
|
||||
let missing_value_for_accessor = if let Some(missing) =
|
||||
missing_value_term_agg.as_ref()
|
||||
{
|
||||
get_missing_val(column_type, missing, agg.agg.get_fast_field_names()[0])?
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
let limits = limits.clone();
|
||||
let agg = AggregationWithAccessor {
|
||||
segment_ordinal,
|
||||
missing_value_for_accessor,
|
||||
@@ -261,7 +250,7 @@ impl AggregationWithAccessor {
|
||||
)?,
|
||||
agg: agg.clone(),
|
||||
str_dict_column: str_dict_column.clone(),
|
||||
limits,
|
||||
limits: limits.new_guard(),
|
||||
column_block_accessor: Default::default(),
|
||||
};
|
||||
res.push(agg);
|
||||
@@ -271,6 +260,10 @@ impl AggregationWithAccessor {
|
||||
field: ref field_name,
|
||||
..
|
||||
})
|
||||
| Count(CountAggregation {
|
||||
field: ref field_name,
|
||||
..
|
||||
})
|
||||
| Max(MaxAggregation {
|
||||
field: ref field_name,
|
||||
..
|
||||
@@ -295,24 +288,6 @@ impl AggregationWithAccessor {
|
||||
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
|
||||
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
|
||||
}
|
||||
Count(CountAggregation {
|
||||
field: ref field_name,
|
||||
..
|
||||
}) => {
|
||||
let allowed_column_types = [
|
||||
ColumnType::I64,
|
||||
ColumnType::U64,
|
||||
ColumnType::F64,
|
||||
ColumnType::Str,
|
||||
ColumnType::DateTime,
|
||||
ColumnType::Bool,
|
||||
ColumnType::IpAddr,
|
||||
// ColumnType::Bytes Unsupported
|
||||
];
|
||||
let (accessor, column_type) =
|
||||
get_ff_reader(reader, field_name, Some(&allowed_column_types))?;
|
||||
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
|
||||
}
|
||||
Percentiles(ref percentiles) => {
|
||||
let (accessor, column_type) = get_ff_reader(
|
||||
reader,
|
||||
@@ -350,14 +325,7 @@ impl AggregationWithAccessor {
|
||||
}
|
||||
}
|
||||
|
||||
/// Get the missing value as internal u64 representation
|
||||
///
|
||||
/// For terms we use u64::MAX as sentinel value
|
||||
/// For numerical data we convert the value into the representation
|
||||
/// we would get from the fast field, when we open it as u64_lenient_for_type.
|
||||
///
|
||||
/// That way we can use it the same way as if it would come from the fastfield.
|
||||
fn get_missing_val_as_u64_lenient(
|
||||
fn get_missing_val(
|
||||
column_type: ColumnType,
|
||||
missing: &Key,
|
||||
field_name: &str,
|
||||
@@ -366,18 +334,9 @@ fn get_missing_val_as_u64_lenient(
|
||||
Key::Str(_) if column_type == ColumnType::Str => Some(u64::MAX),
|
||||
// Allow fallback to number on text fields
|
||||
Key::F64(_) if column_type == ColumnType::Str => Some(u64::MAX),
|
||||
Key::U64(_) if column_type == ColumnType::Str => Some(u64::MAX),
|
||||
Key::I64(_) if column_type == ColumnType::Str => Some(u64::MAX),
|
||||
Key::F64(val) if column_type.numerical_type().is_some() => {
|
||||
f64_to_fastfield_u64(*val, &column_type)
|
||||
}
|
||||
// NOTE: We may loose precision of the passed missing value by casting i64 and u64 to f64.
|
||||
Key::I64(val) if column_type.numerical_type().is_some() => {
|
||||
f64_to_fastfield_u64(*val as f64, &column_type)
|
||||
}
|
||||
Key::U64(val) if column_type.numerical_type().is_some() => {
|
||||
f64_to_fastfield_u64(*val as f64, &column_type)
|
||||
}
|
||||
_ => {
|
||||
return Err(crate::TantivyError::InvalidArgument(format!(
|
||||
"Missing value {missing:?} for field {field_name} is not supported for column \
|
||||
@@ -401,7 +360,7 @@ pub(crate) fn get_aggs_with_segment_accessor_and_validate(
|
||||
aggs: &Aggregations,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
limits: &AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<AggregationsWithAccessor> {
|
||||
let mut aggss = Vec::new();
|
||||
for (key, agg) in aggs.iter() {
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
//! Contains the final aggregation tree.
|
||||
//!
|
||||
//! This tree can be converted via the `into()` method from `IntermediateAggregationResults`.
|
||||
//! This conversion computes the final result. For example: The intermediate result contains
|
||||
//! intermediate average results, which is the sum and the number of values. The actual average is
|
||||
@@ -99,8 +98,6 @@ pub enum MetricResult {
|
||||
Percentiles(PercentilesMetricResult),
|
||||
/// Top hits metric result
|
||||
TopHits(TopHitsMetricResult),
|
||||
/// Cardinality metric result
|
||||
Cardinality(SingleMetricResult),
|
||||
}
|
||||
|
||||
impl MetricResult {
|
||||
@@ -119,7 +116,6 @@ impl MetricResult {
|
||||
MetricResult::TopHits(_) => Err(TantivyError::AggregationError(
|
||||
AggregationError::InvalidRequest("top_hits can't be used to order".to_string()),
|
||||
)),
|
||||
MetricResult::Cardinality(card) => Ok(card.value),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -188,7 +184,7 @@ pub enum BucketEntries<T> {
|
||||
}
|
||||
|
||||
impl<T> BucketEntries<T> {
|
||||
fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = &'a T> + 'a> {
|
||||
fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = &T> + 'a> {
|
||||
match self {
|
||||
BucketEntries::Vec(vec) => Box::new(vec.iter()),
|
||||
BucketEntries::HashMap(map) => Box::new(map.values()),
|
||||
|
||||
@@ -5,7 +5,7 @@ use crate::aggregation::agg_result::AggregationResults;
|
||||
use crate::aggregation::buf_collector::DOC_BLOCK_SIZE;
|
||||
use crate::aggregation::collector::AggregationCollector;
|
||||
use crate::aggregation::intermediate_agg_result::IntermediateAggregationResults;
|
||||
use crate::aggregation::segment_agg_result::AggregationLimitsGuard;
|
||||
use crate::aggregation::segment_agg_result::AggregationLimits;
|
||||
use crate::aggregation::tests::{get_test_index_2_segments, get_test_index_from_values_and_terms};
|
||||
use crate::aggregation::DistributedAggregationCollector;
|
||||
use crate::query::{AllQuery, TermQuery};
|
||||
@@ -110,16 +110,6 @@ fn test_aggregation_flushing(
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"cardinality_string_id":{
|
||||
"cardinality": {
|
||||
"field": "string_id"
|
||||
}
|
||||
},
|
||||
"cardinality_score":{
|
||||
"cardinality": {
|
||||
"field": "score"
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
@@ -130,7 +120,7 @@ fn test_aggregation_flushing(
|
||||
let agg_res: AggregationResults = if use_distributed_collector {
|
||||
let collector = DistributedAggregationCollector::from_aggs(
|
||||
agg_req.clone(),
|
||||
AggregationLimitsGuard::default(),
|
||||
AggregationLimits::default(),
|
||||
);
|
||||
|
||||
let searcher = reader.searcher();
|
||||
@@ -146,7 +136,7 @@ fn test_aggregation_flushing(
|
||||
.expect("Post deserialization failed");
|
||||
|
||||
intermediate_agg_result
|
||||
.into_final_result(agg_req, Default::default())
|
||||
.into_final_result(agg_req, &Default::default())
|
||||
.unwrap()
|
||||
} else {
|
||||
let collector = get_collector(agg_req);
|
||||
@@ -222,9 +212,6 @@ fn test_aggregation_flushing(
|
||||
)
|
||||
);
|
||||
|
||||
assert_eq!(res["cardinality_string_id"]["value"], 2.0);
|
||||
assert_eq!(res["cardinality_score"]["value"], 80.0);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -460,7 +447,7 @@ fn test_aggregation_level2(
|
||||
|
||||
let searcher = reader.searcher();
|
||||
let res = searcher.search(&term_query, &collector).unwrap();
|
||||
res.into_final_result(agg_req.clone(), Default::default())
|
||||
res.into_final_result(agg_req.clone(), &Default::default())
|
||||
.unwrap()
|
||||
} else {
|
||||
let collector = get_collector(agg_req.clone());
|
||||
@@ -870,7 +857,7 @@ fn test_aggregation_on_json_object_mixed_types() {
|
||||
.add_document(doc!(json => json!({"mixed_type": "blue", "mixed_price": 5.0})))
|
||||
.unwrap();
|
||||
index_writer.commit().unwrap();
|
||||
// => Segment with all boolean
|
||||
// => Segment with all boolen
|
||||
index_writer
|
||||
.add_document(doc!(json => json!({"mixed_type": true, "mixed_price": "no_price"})))
|
||||
.unwrap();
|
||||
@@ -939,11 +926,11 @@ fn test_aggregation_on_json_object_mixed_types() {
|
||||
},
|
||||
"termagg": {
|
||||
"buckets": [
|
||||
{ "doc_count": 1, "key": 10, "min_price": { "value": 10.0 } },
|
||||
{ "doc_count": 1, "key": 10.0, "min_price": { "value": 10.0 } },
|
||||
{ "doc_count": 3, "key": "blue", "min_price": { "value": 5.0 } },
|
||||
{ "doc_count": 2, "key": "red", "min_price": { "value": 1.0 } },
|
||||
{ "doc_count": 1, "key": -20.5, "min_price": { "value": -20.5 } },
|
||||
{ "doc_count": 2, "key": 1, "key_as_string": "true", "min_price": { "value": null } },
|
||||
{ "doc_count": 2, "key": 1.0, "key_as_string": "true", "min_price": { "value": null } },
|
||||
],
|
||||
"sum_other_doc_count": 0
|
||||
}
|
||||
@@ -951,60 +938,3 @@ fn test_aggregation_on_json_object_mixed_types() {
|
||||
)
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_aggregation_on_json_object_mixed_numerical_segments() {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let json = schema_builder.add_json_field("json", FAST);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
|
||||
// => Segment with all values f64 numeric
|
||||
index_writer
|
||||
.add_document(doc!(json => json!({"mixed_price": 10.5})))
|
||||
.unwrap();
|
||||
// Gets converted to f64!
|
||||
index_writer
|
||||
.add_document(doc!(json => json!({"mixed_price": 10})))
|
||||
.unwrap();
|
||||
index_writer.commit().unwrap();
|
||||
// => Segment with all values i64 numeric
|
||||
index_writer
|
||||
.add_document(doc!(json => json!({"mixed_price": 10})))
|
||||
.unwrap();
|
||||
index_writer.commit().unwrap();
|
||||
|
||||
index_writer.commit().unwrap();
|
||||
|
||||
// All bucket types
|
||||
let agg_req_str = r#"
|
||||
{
|
||||
"termagg": {
|
||||
"terms": {
|
||||
"field": "json.mixed_price"
|
||||
}
|
||||
}
|
||||
} "#;
|
||||
let agg: Aggregations = serde_json::from_str(agg_req_str).unwrap();
|
||||
let aggregation_collector = get_collector(agg);
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap();
|
||||
let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap();
|
||||
use pretty_assertions::assert_eq;
|
||||
assert_eq!(
|
||||
&aggregation_res_json,
|
||||
&serde_json::json!({
|
||||
"termagg": {
|
||||
"buckets": [
|
||||
{ "doc_count": 2, "key": 10},
|
||||
{ "doc_count": 1, "key": 10.5},
|
||||
],
|
||||
"doc_count_error_upper_bound": 0,
|
||||
"sum_other_doc_count": 0
|
||||
}
|
||||
}
|
||||
)
|
||||
);
|
||||
}
|
||||
|
||||
@@ -34,10 +34,10 @@ use crate::aggregation::*;
|
||||
pub struct DateHistogramAggregationReq {
|
||||
#[doc(hidden)]
|
||||
/// Only for validation
|
||||
pub interval: Option<String>,
|
||||
interval: Option<String>,
|
||||
#[doc(hidden)]
|
||||
/// Only for validation
|
||||
pub calendar_interval: Option<String>,
|
||||
calendar_interval: Option<String>,
|
||||
/// The field to aggregate on.
|
||||
pub field: String,
|
||||
/// The format to format dates. Unsupported currently.
|
||||
@@ -244,7 +244,7 @@ fn parse_into_milliseconds(input: &str) -> Result<i64, AggregationError> {
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod tests {
|
||||
pub mod tests {
|
||||
use pretty_assertions::assert_eq;
|
||||
|
||||
use super::*;
|
||||
|
||||
@@ -438,7 +438,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
|
||||
buckets: Vec<IntermediateHistogramBucketEntry>,
|
||||
histogram_req: &HistogramAggregation,
|
||||
sub_aggregation: &Aggregations,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<Vec<BucketEntry>> {
|
||||
// Generate the full list of buckets without gaps.
|
||||
//
|
||||
@@ -496,7 +496,7 @@ pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
|
||||
is_date_agg: bool,
|
||||
histogram_req: &HistogramAggregation,
|
||||
sub_aggregation: &Aggregations,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<Vec<BucketEntry>> {
|
||||
// Normalization is column type dependent.
|
||||
// The request used in the the call to final is not yet be normalized.
|
||||
@@ -750,7 +750,7 @@ mod tests {
|
||||
agg_req,
|
||||
&index,
|
||||
None,
|
||||
AggregationLimitsGuard::new(Some(5_000), None),
|
||||
AggregationLimits::new(Some(5_000), None),
|
||||
)
|
||||
.unwrap_err();
|
||||
assert!(res.to_string().starts_with(
|
||||
|
||||
@@ -112,64 +112,18 @@ impl Serialize for CustomOrder {
|
||||
impl<'de> Deserialize<'de> for CustomOrder {
|
||||
fn deserialize<D>(deserializer: D) -> Result<CustomOrder, D::Error>
|
||||
where D: Deserializer<'de> {
|
||||
let value = serde_json::Value::deserialize(deserializer)?;
|
||||
let return_err = |message, val: serde_json::Value| {
|
||||
de::Error::custom(format!(
|
||||
"{}, but got {}",
|
||||
message,
|
||||
serde_json::to_string(&val).unwrap()
|
||||
))
|
||||
};
|
||||
|
||||
match value {
|
||||
serde_json::Value::Object(map) => {
|
||||
if map.len() != 1 {
|
||||
return Err(return_err(
|
||||
"expected exactly one key-value pair in the order map",
|
||||
map.into(),
|
||||
));
|
||||
}
|
||||
|
||||
let (key, value) = map.into_iter().next().unwrap();
|
||||
let order = serde_json::from_value(value).map_err(de::Error::custom)?;
|
||||
|
||||
HashMap::<String, Order>::deserialize(deserializer).and_then(|map| {
|
||||
if let Some((key, value)) = map.into_iter().next() {
|
||||
Ok(CustomOrder {
|
||||
target: key.as_str().into(),
|
||||
order,
|
||||
order: value,
|
||||
})
|
||||
} else {
|
||||
Err(de::Error::custom(
|
||||
"unexpected empty map in order".to_string(),
|
||||
))
|
||||
}
|
||||
serde_json::Value::Array(arr) => {
|
||||
if arr.is_empty() {
|
||||
return Err(return_err("unexpected empty array in order", arr.into()));
|
||||
}
|
||||
if arr.len() != 1 {
|
||||
return Err(return_err(
|
||||
"only one sort order supported currently",
|
||||
arr.into(),
|
||||
));
|
||||
}
|
||||
let entry = arr.into_iter().next().unwrap();
|
||||
let map = entry
|
||||
.as_object()
|
||||
.ok_or_else(|| return_err("expected object as sort order", entry.clone()))?;
|
||||
let (key, value) = map.into_iter().next().ok_or_else(|| {
|
||||
return_err(
|
||||
"expected exactly one key-value pair in the order map",
|
||||
entry.clone(),
|
||||
)
|
||||
})?;
|
||||
let order = serde_json::from_value(value.clone()).map_err(de::Error::custom)?;
|
||||
|
||||
Ok(CustomOrder {
|
||||
target: key.as_str().into(),
|
||||
order,
|
||||
})
|
||||
}
|
||||
_ => Err(return_err(
|
||||
"unexpected type, expected an object or array",
|
||||
value,
|
||||
)),
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
@@ -184,23 +138,11 @@ fn custom_order_serde_test() {
|
||||
assert_eq!(order_str, "{\"_key\":\"desc\"}");
|
||||
let order_deser = serde_json::from_str(&order_str).unwrap();
|
||||
|
||||
assert_eq!(order, order_deser);
|
||||
let order_deser: CustomOrder = serde_json::from_str("[{\"_key\":\"desc\"}]").unwrap();
|
||||
assert_eq!(order, order_deser);
|
||||
|
||||
let order_deser: serde_json::Result<CustomOrder> = serde_json::from_str("{}");
|
||||
assert!(order_deser.is_err());
|
||||
|
||||
let order_deser: serde_json::Result<CustomOrder> = serde_json::from_str("[]");
|
||||
assert!(order_deser
|
||||
.unwrap_err()
|
||||
.to_string()
|
||||
.contains("unexpected empty array in order"));
|
||||
|
||||
let order_deser: serde_json::Result<CustomOrder> =
|
||||
serde_json::from_str(r#"[{"_key":"desc"},{"_key":"desc"}]"#);
|
||||
assert_eq!(
|
||||
order_deser.unwrap_err().to_string(),
|
||||
r#"only one sort order supported currently, but got [{"_key":"desc"},{"_key":"desc"}]"#
|
||||
);
|
||||
assert!(order_deser.is_err());
|
||||
}
|
||||
|
||||
@@ -4,6 +4,7 @@ use std::ops::Range;
|
||||
use rustc_hash::FxHashMap;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::aggregation::agg_limits::ResourceLimitGuard;
|
||||
use crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor;
|
||||
use crate::aggregation::intermediate_agg_result::{
|
||||
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateBucketResult,
|
||||
@@ -16,7 +17,6 @@ use crate::aggregation::*;
|
||||
use crate::TantivyError;
|
||||
|
||||
/// Provide user-defined buckets to aggregate on.
|
||||
///
|
||||
/// Two special buckets will automatically be created to cover the whole range of values.
|
||||
/// The provided buckets have to be continuous.
|
||||
/// During the aggregation, the values extracted from the fast_field `field` will be checked
|
||||
@@ -270,7 +270,7 @@ impl SegmentRangeCollector {
|
||||
pub(crate) fn from_req_and_validate(
|
||||
req: &RangeAggregation,
|
||||
sub_aggregation: &mut AggregationsWithAccessor,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &ResourceLimitGuard,
|
||||
field_type: ColumnType,
|
||||
accessor_idx: usize,
|
||||
) -> crate::Result<Self> {
|
||||
@@ -471,7 +471,7 @@ mod tests {
|
||||
SegmentRangeCollector::from_req_and_validate(
|
||||
&req,
|
||||
&mut Default::default(),
|
||||
&mut AggregationLimitsGuard::default(),
|
||||
&AggregationLimits::default().new_guard(),
|
||||
field_type,
|
||||
0,
|
||||
)
|
||||
|
||||
@@ -1,10 +1,9 @@
|
||||
use std::fmt::Debug;
|
||||
use std::io;
|
||||
use std::net::Ipv6Addr;
|
||||
|
||||
use columnar::column_values::CompactSpaceU64Accessor;
|
||||
use columnar::{
|
||||
ColumnType, Dictionary, MonotonicallyMappableToU128, MonotonicallyMappableToU64, NumericalValue,
|
||||
BytesColumn, ColumnType, MonotonicallyMappableToU128, MonotonicallyMappableToU64, StrColumn,
|
||||
};
|
||||
use rustc_hash::FxHashMap;
|
||||
use serde::{Deserialize, Serialize};
|
||||
@@ -21,11 +20,11 @@ use crate::aggregation::intermediate_agg_result::{
|
||||
use crate::aggregation::segment_agg_result::{
|
||||
build_segment_agg_collector, SegmentAggregationCollector,
|
||||
};
|
||||
use crate::aggregation::{format_date, Key};
|
||||
use crate::aggregation::{f64_from_fastfield_u64, format_date, Key};
|
||||
use crate::error::DataCorruption;
|
||||
use crate::TantivyError;
|
||||
|
||||
/// Creates a bucket for every unique term and counts the number of occurrences.
|
||||
/// Creates a bucket for every unique term and counts the number of occurences.
|
||||
/// Note that doc_count in the response buckets equals term count here.
|
||||
///
|
||||
/// If the text is untokenized and single value, that means one term per document and therefore it
|
||||
@@ -158,7 +157,7 @@ pub struct TermsAggregation {
|
||||
/// when loading the text.
|
||||
/// Special Case 1:
|
||||
/// If we have multiple columns on one field, we need to have a union on the indices on both
|
||||
/// columns, to find docids without a value. That requires a special missing aggregation.
|
||||
/// columns, to find docids without a value. That requires a special missing aggreggation.
|
||||
/// Special Case 2: if the key is of type text and the column is numerical, we also need to use
|
||||
/// the special missing aggregation, since there is no mechanism in the numerical column to
|
||||
/// add text.
|
||||
@@ -364,7 +363,7 @@ impl SegmentTermCollector {
|
||||
let term_buckets = TermBuckets::default();
|
||||
|
||||
if let Some(custom_order) = req.order.as_ref() {
|
||||
// Validate sub aggregation exists
|
||||
// Validate sub aggregtion exists
|
||||
if let OrderTarget::SubAggregation(sub_agg_name) = &custom_order.target {
|
||||
let (agg_name, _agg_property) = get_agg_name_and_property(sub_agg_name);
|
||||
|
||||
@@ -467,72 +466,49 @@ impl SegmentTermCollector {
|
||||
};
|
||||
|
||||
if self.column_type == ColumnType::Str {
|
||||
let fallback_dict = Dictionary::empty();
|
||||
let term_dict = agg_with_accessor
|
||||
.str_dict_column
|
||||
.as_ref()
|
||||
.map(|el| el.dictionary())
|
||||
.unwrap_or_else(|| &fallback_dict);
|
||||
let mut buffer = Vec::new();
|
||||
|
||||
// special case for missing key
|
||||
if let Some(index) = entries.iter().position(|value| value.0 == u64::MAX) {
|
||||
let entry = entries[index];
|
||||
let intermediate_entry = into_intermediate_bucket_entry(entry.0, entry.1)?;
|
||||
let missing_key = self
|
||||
.req
|
||||
.missing
|
||||
.as_ref()
|
||||
.expect("Found placeholder term_id but `missing` is None");
|
||||
match missing_key {
|
||||
Key::Str(missing) => {
|
||||
buffer.clear();
|
||||
buffer.extend_from_slice(missing.as_bytes());
|
||||
dict.insert(
|
||||
IntermediateKey::Str(
|
||||
String::from_utf8(buffer.to_vec())
|
||||
.expect("could not convert to String"),
|
||||
),
|
||||
intermediate_entry,
|
||||
);
|
||||
.cloned()
|
||||
.unwrap_or_else(|| {
|
||||
StrColumn::wrap(BytesColumn::empty(agg_with_accessor.accessor.num_docs()))
|
||||
});
|
||||
let mut buffer = String::new();
|
||||
for (term_id, doc_count) in entries {
|
||||
let intermediate_entry = into_intermediate_bucket_entry(term_id, doc_count)?;
|
||||
// Special case for missing key
|
||||
if term_id == u64::MAX {
|
||||
let missing_key = self
|
||||
.req
|
||||
.missing
|
||||
.as_ref()
|
||||
.expect("Found placeholder term_id but `missing` is None");
|
||||
match missing_key {
|
||||
Key::Str(missing) => {
|
||||
buffer.clear();
|
||||
buffer.push_str(missing);
|
||||
dict.insert(
|
||||
IntermediateKey::Str(buffer.to_string()),
|
||||
intermediate_entry,
|
||||
);
|
||||
}
|
||||
Key::F64(val) => {
|
||||
buffer.push_str(&val.to_string());
|
||||
dict.insert(IntermediateKey::F64(*val), intermediate_entry);
|
||||
}
|
||||
}
|
||||
Key::F64(val) => {
|
||||
dict.insert(IntermediateKey::F64(*val), intermediate_entry);
|
||||
}
|
||||
Key::U64(val) => {
|
||||
dict.insert(IntermediateKey::U64(*val), intermediate_entry);
|
||||
}
|
||||
Key::I64(val) => {
|
||||
dict.insert(IntermediateKey::I64(*val), intermediate_entry);
|
||||
} else {
|
||||
if !term_dict.ord_to_str(term_id, &mut buffer)? {
|
||||
return Err(TantivyError::InternalError(format!(
|
||||
"Couldn't find term_id {term_id} in dict"
|
||||
)));
|
||||
}
|
||||
dict.insert(IntermediateKey::Str(buffer.to_string()), intermediate_entry);
|
||||
}
|
||||
|
||||
entries.swap_remove(index);
|
||||
}
|
||||
|
||||
// Sort by term ord
|
||||
entries.sort_unstable_by_key(|bucket| bucket.0);
|
||||
let mut idx = 0;
|
||||
term_dict.sorted_ords_to_term_cb(
|
||||
entries.iter().map(|(term_id, _)| *term_id),
|
||||
|term| {
|
||||
let entry = entries[idx];
|
||||
let intermediate_entry = into_intermediate_bucket_entry(entry.0, entry.1)
|
||||
.map_err(|err| io::Error::new(io::ErrorKind::Other, err))?;
|
||||
dict.insert(
|
||||
IntermediateKey::Str(
|
||||
String::from_utf8(term.to_vec()).expect("could not convert to String"),
|
||||
),
|
||||
intermediate_entry,
|
||||
);
|
||||
idx += 1;
|
||||
Ok(())
|
||||
},
|
||||
)?;
|
||||
|
||||
if self.req.min_doc_count == 0 {
|
||||
// TODO: Handle rev streaming for descending sorting by keys
|
||||
let mut stream = term_dict.stream()?;
|
||||
let mut stream = term_dict.dictionary().stream()?;
|
||||
let empty_sub_aggregation = IntermediateAggregationResults::empty_from_req(
|
||||
agg_with_accessor.agg.sub_aggregation(),
|
||||
);
|
||||
@@ -591,26 +567,8 @@ impl SegmentTermCollector {
|
||||
} else {
|
||||
for (val, doc_count) in entries {
|
||||
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
|
||||
if self.column_type == ColumnType::U64 {
|
||||
dict.insert(IntermediateKey::U64(val), intermediate_entry);
|
||||
} else if self.column_type == ColumnType::I64 {
|
||||
dict.insert(IntermediateKey::I64(i64::from_u64(val)), intermediate_entry);
|
||||
} else {
|
||||
let val = f64::from_u64(val);
|
||||
let val: NumericalValue = val.into();
|
||||
|
||||
match val.normalize() {
|
||||
NumericalValue::U64(val) => {
|
||||
dict.insert(IntermediateKey::U64(val), intermediate_entry);
|
||||
}
|
||||
NumericalValue::I64(val) => {
|
||||
dict.insert(IntermediateKey::I64(val), intermediate_entry);
|
||||
}
|
||||
NumericalValue::F64(val) => {
|
||||
dict.insert(IntermediateKey::F64(val), intermediate_entry);
|
||||
}
|
||||
}
|
||||
};
|
||||
let val = f64_from_fastfield_u64(val, &self.column_type);
|
||||
dict.insert(IntermediateKey::F64(val), intermediate_entry);
|
||||
}
|
||||
};
|
||||
|
||||
@@ -669,7 +627,7 @@ mod tests {
|
||||
exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit,
|
||||
get_test_index_from_terms, get_test_index_from_values_and_terms,
|
||||
};
|
||||
use crate::aggregation::AggregationLimitsGuard;
|
||||
use crate::aggregation::AggregationLimits;
|
||||
use crate::indexer::NoMergePolicy;
|
||||
use crate::schema::{IntoIpv6Addr, Schema, FAST, STRING};
|
||||
use crate::{Index, IndexWriter};
|
||||
@@ -1232,8 +1190,8 @@ mod tests {
|
||||
#[test]
|
||||
fn terms_aggregation_min_doc_count_special_case() -> crate::Result<()> {
|
||||
let terms_per_segment = vec![
|
||||
vec!["terma", "terma", "termb", "termb", "termb"],
|
||||
vec!["terma", "terma", "termb"],
|
||||
vec!["terma", "terma", "termb", "termb", "termb", "termc"],
|
||||
vec!["terma", "terma", "termb", "termc", "termc"],
|
||||
];
|
||||
|
||||
let index = get_test_index_from_terms(false, &terms_per_segment)?;
|
||||
@@ -1255,6 +1213,8 @@ mod tests {
|
||||
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_texts"]["buckets"][1]["key"], "termb");
|
||||
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 0);
|
||||
assert_eq!(res["my_texts"]["buckets"][2]["key"], "termc");
|
||||
assert_eq!(res["my_texts"]["buckets"][2]["doc_count"], 0);
|
||||
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
|
||||
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
|
||||
|
||||
@@ -1422,7 +1382,7 @@ mod tests {
|
||||
agg_req,
|
||||
&index,
|
||||
None,
|
||||
AggregationLimitsGuard::new(Some(50_000), None),
|
||||
AggregationLimits::new(Some(50_000), None),
|
||||
)
|
||||
.unwrap_err();
|
||||
assert!(res
|
||||
@@ -1683,7 +1643,7 @@ mod tests {
|
||||
res["my_texts"]["buckets"][2]["key"],
|
||||
serde_json::Value::Null
|
||||
);
|
||||
// text field with number as missing fallback
|
||||
// text field with numner as missing fallback
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 5);
|
||||
assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
|
||||
@@ -1743,54 +1703,6 @@ mod tests {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn terms_aggregation_u64_value() -> crate::Result<()> {
|
||||
// Make sure that large u64 are not truncated
|
||||
let mut schema_builder = Schema::builder();
|
||||
let id_field = schema_builder.add_u64_field("id", FAST);
|
||||
let index = Index::create_in_ram(schema_builder.build());
|
||||
{
|
||||
let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
|
||||
index_writer.set_merge_policy(Box::new(NoMergePolicy));
|
||||
index_writer.add_document(doc!(
|
||||
id_field => 9_223_372_036_854_775_807u64,
|
||||
))?;
|
||||
index_writer.add_document(doc!(
|
||||
id_field => 1_769_070_189_829_214_202u64,
|
||||
))?;
|
||||
index_writer.add_document(doc!(
|
||||
id_field => 1_769_070_189_829_214_202u64,
|
||||
))?;
|
||||
index_writer.commit()?;
|
||||
}
|
||||
|
||||
let agg_req: Aggregations = serde_json::from_value(json!({
|
||||
"my_ids": {
|
||||
"terms": {
|
||||
"field": "id"
|
||||
},
|
||||
}
|
||||
}))
|
||||
.unwrap();
|
||||
|
||||
let res = exec_request_with_query(agg_req, &index, None)?;
|
||||
|
||||
// id field
|
||||
assert_eq!(
|
||||
res["my_ids"]["buckets"][0]["key"],
|
||||
1_769_070_189_829_214_202u64
|
||||
);
|
||||
assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 2);
|
||||
assert_eq!(
|
||||
res["my_ids"]["buckets"][1]["key"],
|
||||
9_223_372_036_854_775_807u64
|
||||
);
|
||||
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 1);
|
||||
assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn terms_aggregation_missing1() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
@@ -1857,7 +1769,7 @@ mod tests {
|
||||
res["my_texts"]["buckets"][2]["key"],
|
||||
serde_json::Value::Null
|
||||
);
|
||||
// text field with number as missing fallback
|
||||
// text field with numner as missing fallback
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
|
||||
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 4);
|
||||
assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
|
||||
|
||||
@@ -70,6 +70,7 @@ impl SegmentAggregationCollector for TermMissingAgg {
|
||||
)?;
|
||||
missing_entry.sub_aggregation = res;
|
||||
}
|
||||
|
||||
entries.insert(missing.into(), missing_entry);
|
||||
|
||||
let bucket = IntermediateBucketResult::Terms {
|
||||
|
||||
@@ -4,7 +4,7 @@ use super::agg_result::AggregationResults;
|
||||
use super::buf_collector::BufAggregationCollector;
|
||||
use super::intermediate_agg_result::IntermediateAggregationResults;
|
||||
use super::segment_agg_result::{
|
||||
build_segment_agg_collector, AggregationLimitsGuard, SegmentAggregationCollector,
|
||||
build_segment_agg_collector, AggregationLimits, SegmentAggregationCollector,
|
||||
};
|
||||
use crate::aggregation::agg_req_with_accessor::get_aggs_with_segment_accessor_and_validate;
|
||||
use crate::collector::{Collector, SegmentCollector};
|
||||
@@ -22,7 +22,7 @@ pub const DEFAULT_MEMORY_LIMIT: u64 = 500_000_000;
|
||||
/// The collector collects all aggregations by the underlying aggregation request.
|
||||
pub struct AggregationCollector {
|
||||
agg: Aggregations,
|
||||
limits: AggregationLimitsGuard,
|
||||
limits: AggregationLimits,
|
||||
}
|
||||
|
||||
impl AggregationCollector {
|
||||
@@ -30,7 +30,7 @@ impl AggregationCollector {
|
||||
///
|
||||
/// Aggregation fails when the limits in `AggregationLimits` is exceeded. (memory limit and
|
||||
/// bucket limit)
|
||||
pub fn from_aggs(agg: Aggregations, limits: AggregationLimitsGuard) -> Self {
|
||||
pub fn from_aggs(agg: Aggregations, limits: AggregationLimits) -> Self {
|
||||
Self { agg, limits }
|
||||
}
|
||||
}
|
||||
@@ -45,7 +45,7 @@ impl AggregationCollector {
|
||||
/// into the final `AggregationResults` via the `into_final_result()` method.
|
||||
pub struct DistributedAggregationCollector {
|
||||
agg: Aggregations,
|
||||
limits: AggregationLimitsGuard,
|
||||
limits: AggregationLimits,
|
||||
}
|
||||
|
||||
impl DistributedAggregationCollector {
|
||||
@@ -53,7 +53,7 @@ impl DistributedAggregationCollector {
|
||||
///
|
||||
/// Aggregation fails when the limits in `AggregationLimits` is exceeded. (memory limit and
|
||||
/// bucket limit)
|
||||
pub fn from_aggs(agg: Aggregations, limits: AggregationLimitsGuard) -> Self {
|
||||
pub fn from_aggs(agg: Aggregations, limits: AggregationLimits) -> Self {
|
||||
Self { agg, limits }
|
||||
}
|
||||
}
|
||||
@@ -115,7 +115,7 @@ impl Collector for AggregationCollector {
|
||||
segment_fruits: Vec<<Self::Child as SegmentCollector>::Fruit>,
|
||||
) -> crate::Result<Self::Fruit> {
|
||||
let res = merge_fruits(segment_fruits)?;
|
||||
res.into_final_result(self.agg.clone(), self.limits.clone())
|
||||
res.into_final_result(self.agg.clone(), &self.limits)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -147,7 +147,7 @@ impl AggregationSegmentCollector {
|
||||
agg: &Aggregations,
|
||||
reader: &SegmentReader,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
limits: &AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<Self> {
|
||||
let mut aggs_with_accessor =
|
||||
get_aggs_with_segment_accessor_and_validate(agg, reader, segment_ordinal, limits)?;
|
||||
|
||||
@@ -22,11 +22,10 @@ use super::metric::{
|
||||
IntermediateAverage, IntermediateCount, IntermediateExtendedStats, IntermediateMax,
|
||||
IntermediateMin, IntermediateStats, IntermediateSum, PercentilesCollector, TopHitsTopNComputer,
|
||||
};
|
||||
use super::segment_agg_result::AggregationLimitsGuard;
|
||||
use super::segment_agg_result::AggregationLimits;
|
||||
use super::{format_date, AggregationError, Key, SerializedKey};
|
||||
use crate::aggregation::agg_result::{AggregationResults, BucketEntries, BucketEntry};
|
||||
use crate::aggregation::bucket::TermsAggregationInternal;
|
||||
use crate::aggregation::metric::CardinalityCollector;
|
||||
use crate::TantivyError;
|
||||
|
||||
/// Contains the intermediate aggregation result, which is optimized to be merged with other
|
||||
@@ -51,18 +50,12 @@ pub enum IntermediateKey {
|
||||
Str(String),
|
||||
/// `f64` key
|
||||
F64(f64),
|
||||
/// `i64` key
|
||||
I64(i64),
|
||||
/// `u64` key
|
||||
U64(u64),
|
||||
}
|
||||
impl From<Key> for IntermediateKey {
|
||||
fn from(value: Key) -> Self {
|
||||
match value {
|
||||
Key::Str(s) => Self::Str(s),
|
||||
Key::F64(f) => Self::F64(f),
|
||||
Key::U64(f) => Self::U64(f),
|
||||
Key::I64(f) => Self::I64(f),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -79,9 +72,7 @@ impl From<IntermediateKey> for Key {
|
||||
}
|
||||
}
|
||||
IntermediateKey::F64(f) => Self::F64(f),
|
||||
IntermediateKey::Bool(f) => Self::U64(f as u64),
|
||||
IntermediateKey::U64(f) => Self::U64(f),
|
||||
IntermediateKey::I64(f) => Self::I64(f),
|
||||
IntermediateKey::Bool(f) => Self::F64(f as u64 as f64),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -94,8 +85,6 @@ impl std::hash::Hash for IntermediateKey {
|
||||
match self {
|
||||
IntermediateKey::Str(text) => text.hash(state),
|
||||
IntermediateKey::F64(val) => val.to_bits().hash(state),
|
||||
IntermediateKey::U64(val) => val.hash(state),
|
||||
IntermediateKey::I64(val) => val.hash(state),
|
||||
IntermediateKey::Bool(val) => val.hash(state),
|
||||
IntermediateKey::IpAddr(val) => val.hash(state),
|
||||
}
|
||||
@@ -122,9 +111,9 @@ impl IntermediateAggregationResults {
|
||||
pub fn into_final_result(
|
||||
self,
|
||||
req: Aggregations,
|
||||
mut limits: AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<AggregationResults> {
|
||||
let res = self.into_final_result_internal(&req, &mut limits)?;
|
||||
let res = self.into_final_result_internal(&req, limits)?;
|
||||
let bucket_count = res.get_bucket_count() as u32;
|
||||
if bucket_count > limits.get_bucket_limit() {
|
||||
return Err(TantivyError::AggregationError(
|
||||
@@ -141,7 +130,7 @@ impl IntermediateAggregationResults {
|
||||
pub(crate) fn into_final_result_internal(
|
||||
self,
|
||||
req: &Aggregations,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<AggregationResults> {
|
||||
let mut results: FxHashMap<String, AggregationResult> = FxHashMap::default();
|
||||
for (key, agg_res) in self.aggs_res.into_iter() {
|
||||
@@ -238,9 +227,6 @@ pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult
|
||||
TopHits(ref req) => IntermediateAggregationResult::Metric(
|
||||
IntermediateMetricResult::TopHits(TopHitsTopNComputer::new(req)),
|
||||
),
|
||||
Cardinality(_) => IntermediateAggregationResult::Metric(
|
||||
IntermediateMetricResult::Cardinality(CardinalityCollector::default()),
|
||||
),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -257,7 +243,7 @@ impl IntermediateAggregationResult {
|
||||
pub(crate) fn into_final_result(
|
||||
self,
|
||||
req: &Aggregation,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<AggregationResult> {
|
||||
let res = match self {
|
||||
IntermediateAggregationResult::Bucket(bucket) => {
|
||||
@@ -305,8 +291,6 @@ pub enum IntermediateMetricResult {
|
||||
Sum(IntermediateSum),
|
||||
/// Intermediate top_hits result
|
||||
TopHits(TopHitsTopNComputer),
|
||||
/// Intermediate cardinality result
|
||||
Cardinality(CardinalityCollector),
|
||||
}
|
||||
|
||||
impl IntermediateMetricResult {
|
||||
@@ -340,9 +324,6 @@ impl IntermediateMetricResult {
|
||||
IntermediateMetricResult::TopHits(top_hits) => {
|
||||
MetricResult::TopHits(top_hits.into_final_result())
|
||||
}
|
||||
IntermediateMetricResult::Cardinality(cardinality) => {
|
||||
MetricResult::Cardinality(cardinality.finalize().into())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -391,12 +372,6 @@ impl IntermediateMetricResult {
|
||||
(IntermediateMetricResult::TopHits(left), IntermediateMetricResult::TopHits(right)) => {
|
||||
left.merge_fruits(right)?;
|
||||
}
|
||||
(
|
||||
IntermediateMetricResult::Cardinality(left),
|
||||
IntermediateMetricResult::Cardinality(right),
|
||||
) => {
|
||||
left.merge_fruits(right)?;
|
||||
}
|
||||
_ => {
|
||||
panic!("incompatible fruit types in tree or missing merge_fruits handler");
|
||||
}
|
||||
@@ -432,7 +407,7 @@ impl IntermediateBucketResult {
|
||||
pub(crate) fn into_final_bucket_result(
|
||||
self,
|
||||
req: &Aggregation,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<BucketResult> {
|
||||
match self {
|
||||
IntermediateBucketResult::Range(range_res) => {
|
||||
@@ -596,7 +571,7 @@ impl IntermediateTermBucketResult {
|
||||
self,
|
||||
req: &TermsAggregation,
|
||||
sub_aggregation_req: &Aggregations,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<BucketResult> {
|
||||
let req = TermsAggregationInternal::from_req(req);
|
||||
let mut buckets: Vec<BucketEntry> = self
|
||||
@@ -723,7 +698,7 @@ impl IntermediateHistogramBucketEntry {
|
||||
pub(crate) fn into_final_bucket_entry(
|
||||
self,
|
||||
req: &Aggregations,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<BucketEntry> {
|
||||
Ok(BucketEntry {
|
||||
key_as_string: None,
|
||||
@@ -758,7 +733,7 @@ impl IntermediateRangeBucketEntry {
|
||||
req: &Aggregations,
|
||||
_range_req: &RangeAggregation,
|
||||
column_type: Option<ColumnType>,
|
||||
limits: &mut AggregationLimitsGuard,
|
||||
limits: &AggregationLimits,
|
||||
) -> crate::Result<RangeBucketEntry> {
|
||||
let mut range_bucket_entry = RangeBucketEntry {
|
||||
key: self.key.into(),
|
||||
@@ -860,7 +835,7 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
fn get_intermediate_tree_with_ranges(
|
||||
fn get_intermediat_tree_with_ranges(
|
||||
data: &[(String, u64, String, u64)],
|
||||
) -> IntermediateAggregationResults {
|
||||
let mut map = HashMap::new();
|
||||
@@ -896,18 +871,18 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_merge_fruits_tree_1() {
|
||||
let mut tree_left = get_intermediate_tree_with_ranges(&[
|
||||
let mut tree_left = get_intermediat_tree_with_ranges(&[
|
||||
("red".to_string(), 50, "1900".to_string(), 25),
|
||||
("blue".to_string(), 30, "1900".to_string(), 30),
|
||||
]);
|
||||
let tree_right = get_intermediate_tree_with_ranges(&[
|
||||
let tree_right = get_intermediat_tree_with_ranges(&[
|
||||
("red".to_string(), 60, "1900".to_string(), 30),
|
||||
("blue".to_string(), 25, "1900".to_string(), 50),
|
||||
]);
|
||||
|
||||
tree_left.merge_fruits(tree_right).unwrap();
|
||||
|
||||
let tree_expected = get_intermediate_tree_with_ranges(&[
|
||||
let tree_expected = get_intermediat_tree_with_ranges(&[
|
||||
("red".to_string(), 110, "1900".to_string(), 55),
|
||||
("blue".to_string(), 55, "1900".to_string(), 80),
|
||||
]);
|
||||
@@ -917,18 +892,18 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_merge_fruits_tree_2() {
|
||||
let mut tree_left = get_intermediate_tree_with_ranges(&[
|
||||
let mut tree_left = get_intermediat_tree_with_ranges(&[
|
||||
("red".to_string(), 50, "1900".to_string(), 25),
|
||||
("blue".to_string(), 30, "1900".to_string(), 30),
|
||||
]);
|
||||
let tree_right = get_intermediate_tree_with_ranges(&[
|
||||
let tree_right = get_intermediat_tree_with_ranges(&[
|
||||
("red".to_string(), 60, "1900".to_string(), 30),
|
||||
("green".to_string(), 25, "1900".to_string(), 50),
|
||||
]);
|
||||
|
||||
tree_left.merge_fruits(tree_right).unwrap();
|
||||
|
||||
let tree_expected = get_intermediate_tree_with_ranges(&[
|
||||
let tree_expected = get_intermediat_tree_with_ranges(&[
|
||||
("red".to_string(), 110, "1900".to_string(), 55),
|
||||
("blue".to_string(), 30, "1900".to_string(), 30),
|
||||
("green".to_string(), 25, "1900".to_string(), 50),
|
||||
@@ -939,7 +914,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn test_merge_fruits_tree_empty() {
|
||||
let mut tree_left = get_intermediate_tree_with_ranges(&[
|
||||
let mut tree_left = get_intermediat_tree_with_ranges(&[
|
||||
("red".to_string(), 50, "1900".to_string(), 25),
|
||||
("blue".to_string(), 30, "1900".to_string(), 30),
|
||||
]);
|
||||
|
||||
@@ -1,473 +0,0 @@
|
||||
use std::collections::hash_map::DefaultHasher;
|
||||
use std::hash::{BuildHasher, Hasher};
|
||||
|
||||
use columnar::column_values::CompactSpaceU64Accessor;
|
||||
use columnar::Dictionary;
|
||||
use common::f64_to_u64;
|
||||
use hyperloglogplus::{HyperLogLog, HyperLogLogPlus};
|
||||
use rustc_hash::FxHashSet;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::aggregation::agg_req_with_accessor::{
|
||||
AggregationWithAccessor, AggregationsWithAccessor,
|
||||
};
|
||||
use crate::aggregation::intermediate_agg_result::{
|
||||
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult,
|
||||
};
|
||||
use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
|
||||
use crate::aggregation::*;
|
||||
use crate::TantivyError;
|
||||
|
||||
#[derive(Clone, Debug, Serialize, Deserialize)]
|
||||
struct BuildSaltedHasher {
|
||||
salt: u8,
|
||||
}
|
||||
|
||||
impl BuildHasher for BuildSaltedHasher {
|
||||
type Hasher = DefaultHasher;
|
||||
|
||||
fn build_hasher(&self) -> Self::Hasher {
|
||||
let mut hasher = DefaultHasher::new();
|
||||
hasher.write_u8(self.salt);
|
||||
|
||||
hasher
|
||||
}
|
||||
}
|
||||
|
||||
/// # Cardinality
|
||||
///
|
||||
/// The cardinality aggregation allows for computing an estimate
|
||||
/// of the number of different values in a data set based on the
|
||||
/// HyperLogLog++ algorithm. This is particularly useful for understanding the
|
||||
/// uniqueness of values in a large dataset where counting each unique value
|
||||
/// individually would be computationally expensive.
|
||||
///
|
||||
/// For example, you might use a cardinality aggregation to estimate the number
|
||||
/// of unique visitors to a website by aggregating on a field that contains
|
||||
/// user IDs or session IDs.
|
||||
///
|
||||
/// To use the cardinality aggregation, you'll need to provide a field to
|
||||
/// aggregate on. The following example demonstrates a request for the cardinality
|
||||
/// of the "user_id" field:
|
||||
///
|
||||
/// ```JSON
|
||||
/// {
|
||||
/// "cardinality": {
|
||||
/// "field": "user_id"
|
||||
/// }
|
||||
/// }
|
||||
/// ```
|
||||
///
|
||||
/// This request will return an estimate of the number of unique values in the
|
||||
/// "user_id" field.
|
||||
///
|
||||
/// ## Missing Values
|
||||
///
|
||||
/// The `missing` parameter defines how documents that are missing a value should be treated.
|
||||
/// By default, documents without a value for the specified field are ignored. However, you can
|
||||
/// specify a default value for these documents using the `missing` parameter. This can be useful
|
||||
/// when you want to include documents with missing values in the aggregation.
|
||||
///
|
||||
/// For example, the following request treats documents with missing values in the "user_id"
|
||||
/// field as if they had a value of "unknown":
|
||||
///
|
||||
/// ```JSON
|
||||
/// {
|
||||
/// "cardinality": {
|
||||
/// "field": "user_id",
|
||||
/// "missing": "unknown"
|
||||
/// }
|
||||
/// }
|
||||
/// ```
|
||||
///
|
||||
/// # Estimation Accuracy
|
||||
///
|
||||
/// The cardinality aggregation provides an approximate count, which is usually
|
||||
/// accurate within a small error range. This trade-off allows for efficient
|
||||
/// computation even on very large datasets.
|
||||
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
|
||||
pub struct CardinalityAggregationReq {
|
||||
/// The field name to compute the percentiles on.
|
||||
pub field: String,
|
||||
/// The missing parameter defines how documents that are missing a value should be treated.
|
||||
/// By default they will be ignored but it is also possible to treat them as if they had a
|
||||
/// value. Examples in JSON format:
|
||||
/// { "field": "my_numbers", "missing": "10.0" }
|
||||
#[serde(skip_serializing_if = "Option::is_none", default)]
|
||||
pub missing: Option<Key>,
|
||||
}
|
||||
|
||||
impl CardinalityAggregationReq {
|
||||
/// Creates a new [`CardinalityAggregationReq`] instance from a field name.
|
||||
pub fn from_field_name(field_name: String) -> Self {
|
||||
Self {
|
||||
field: field_name,
|
||||
missing: None,
|
||||
}
|
||||
}
|
||||
/// Returns the field name the aggregation is computed on.
|
||||
pub fn field_name(&self) -> &str {
|
||||
&self.field
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, PartialEq)]
|
||||
pub(crate) struct SegmentCardinalityCollector {
|
||||
cardinality: CardinalityCollector,
|
||||
entries: FxHashSet<u64>,
|
||||
column_type: ColumnType,
|
||||
accessor_idx: usize,
|
||||
missing: Option<Key>,
|
||||
}
|
||||
|
||||
impl SegmentCardinalityCollector {
|
||||
pub fn from_req(column_type: ColumnType, accessor_idx: usize, missing: &Option<Key>) -> Self {
|
||||
Self {
|
||||
cardinality: CardinalityCollector::new(column_type as u8),
|
||||
entries: Default::default(),
|
||||
column_type,
|
||||
accessor_idx,
|
||||
missing: missing.clone(),
|
||||
}
|
||||
}
|
||||
|
||||
fn fetch_block_with_field(
|
||||
&mut self,
|
||||
docs: &[crate::DocId],
|
||||
agg_accessor: &mut AggregationWithAccessor,
|
||||
) {
|
||||
if let Some(missing) = agg_accessor.missing_value_for_accessor {
|
||||
agg_accessor.column_block_accessor.fetch_block_with_missing(
|
||||
docs,
|
||||
&agg_accessor.accessor,
|
||||
missing,
|
||||
);
|
||||
} else {
|
||||
agg_accessor
|
||||
.column_block_accessor
|
||||
.fetch_block(docs, &agg_accessor.accessor);
|
||||
}
|
||||
}
|
||||
|
||||
fn into_intermediate_metric_result(
|
||||
mut self,
|
||||
agg_with_accessor: &AggregationWithAccessor,
|
||||
) -> crate::Result<IntermediateMetricResult> {
|
||||
if self.column_type == ColumnType::Str {
|
||||
let fallback_dict = Dictionary::empty();
|
||||
let dict = agg_with_accessor
|
||||
.str_dict_column
|
||||
.as_ref()
|
||||
.map(|el| el.dictionary())
|
||||
.unwrap_or_else(|| &fallback_dict);
|
||||
let mut has_missing = false;
|
||||
|
||||
// TODO: replace FxHashSet with something that allows iterating in order
|
||||
// (e.g. sparse bitvec)
|
||||
let mut term_ids = Vec::new();
|
||||
for term_ord in self.entries.into_iter() {
|
||||
if term_ord == u64::MAX {
|
||||
has_missing = true;
|
||||
} else {
|
||||
// we can reasonably exclude values above u32::MAX
|
||||
term_ids.push(term_ord as u32);
|
||||
}
|
||||
}
|
||||
term_ids.sort_unstable();
|
||||
dict.sorted_ords_to_term_cb(term_ids.iter().map(|term| *term as u64), |term| {
|
||||
self.cardinality.sketch.insert_any(&term);
|
||||
Ok(())
|
||||
})?;
|
||||
if has_missing {
|
||||
// Replace missing with the actual value provided
|
||||
let missing_key = self
|
||||
.missing
|
||||
.as_ref()
|
||||
.expect("Found sentinel value u64::MAX for term_ord but `missing` is not set");
|
||||
match missing_key {
|
||||
Key::Str(missing) => {
|
||||
self.cardinality.sketch.insert_any(&missing);
|
||||
}
|
||||
Key::F64(val) => {
|
||||
let val = f64_to_u64(*val);
|
||||
self.cardinality.sketch.insert_any(&val);
|
||||
}
|
||||
Key::U64(val) => {
|
||||
self.cardinality.sketch.insert_any(&val);
|
||||
}
|
||||
Key::I64(val) => {
|
||||
self.cardinality.sketch.insert_any(&val);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(IntermediateMetricResult::Cardinality(self.cardinality))
|
||||
}
|
||||
}
|
||||
|
||||
impl SegmentAggregationCollector for SegmentCardinalityCollector {
|
||||
fn add_intermediate_aggregation_result(
|
||||
self: Box<Self>,
|
||||
agg_with_accessor: &AggregationsWithAccessor,
|
||||
results: &mut IntermediateAggregationResults,
|
||||
) -> crate::Result<()> {
|
||||
let name = agg_with_accessor.aggs.keys[self.accessor_idx].to_string();
|
||||
let agg_with_accessor = &agg_with_accessor.aggs.values[self.accessor_idx];
|
||||
|
||||
let intermediate_result = self.into_intermediate_metric_result(agg_with_accessor)?;
|
||||
results.push(
|
||||
name,
|
||||
IntermediateAggregationResult::Metric(intermediate_result),
|
||||
)?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn collect(
|
||||
&mut self,
|
||||
doc: crate::DocId,
|
||||
agg_with_accessor: &mut AggregationsWithAccessor,
|
||||
) -> crate::Result<()> {
|
||||
self.collect_block(&[doc], agg_with_accessor)
|
||||
}
|
||||
|
||||
fn collect_block(
|
||||
&mut self,
|
||||
docs: &[crate::DocId],
|
||||
agg_with_accessor: &mut AggregationsWithAccessor,
|
||||
) -> crate::Result<()> {
|
||||
let bucket_agg_accessor = &mut agg_with_accessor.aggs.values[self.accessor_idx];
|
||||
self.fetch_block_with_field(docs, bucket_agg_accessor);
|
||||
|
||||
let col_block_accessor = &bucket_agg_accessor.column_block_accessor;
|
||||
if self.column_type == ColumnType::Str {
|
||||
for term_ord in col_block_accessor.iter_vals() {
|
||||
self.entries.insert(term_ord);
|
||||
}
|
||||
} else if self.column_type == ColumnType::IpAddr {
|
||||
let compact_space_accessor = bucket_agg_accessor
|
||||
.accessor
|
||||
.values
|
||||
.clone()
|
||||
.downcast_arc::<CompactSpaceU64Accessor>()
|
||||
.map_err(|_| {
|
||||
TantivyError::AggregationError(
|
||||
crate::aggregation::AggregationError::InternalError(
|
||||
"Type mismatch: Could not downcast to CompactSpaceU64Accessor"
|
||||
.to_string(),
|
||||
),
|
||||
)
|
||||
})?;
|
||||
for val in col_block_accessor.iter_vals() {
|
||||
let val: u128 = compact_space_accessor.compact_to_u128(val as u32);
|
||||
self.cardinality.sketch.insert_any(&val);
|
||||
}
|
||||
} else {
|
||||
for val in col_block_accessor.iter_vals() {
|
||||
self.cardinality.sketch.insert_any(&val);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, Serialize, Deserialize)]
|
||||
/// The percentiles collector used during segment collection and for merging results.
|
||||
pub struct CardinalityCollector {
|
||||
sketch: HyperLogLogPlus<u64, BuildSaltedHasher>,
|
||||
}
|
||||
impl Default for CardinalityCollector {
|
||||
fn default() -> Self {
|
||||
Self::new(0)
|
||||
}
|
||||
}
|
||||
|
||||
impl PartialEq for CardinalityCollector {
|
||||
fn eq(&self, _other: &Self) -> bool {
|
||||
false
|
||||
}
|
||||
}
|
||||
|
||||
impl CardinalityCollector {
|
||||
/// Compute the final cardinality estimate.
|
||||
pub fn finalize(self) -> Option<f64> {
|
||||
Some(self.sketch.clone().count().trunc())
|
||||
}
|
||||
|
||||
fn new(salt: u8) -> Self {
|
||||
Self {
|
||||
sketch: HyperLogLogPlus::new(16, BuildSaltedHasher { salt }).unwrap(),
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn merge_fruits(&mut self, right: CardinalityCollector) -> crate::Result<()> {
|
||||
self.sketch.merge(&right.sketch).map_err(|err| {
|
||||
TantivyError::AggregationError(AggregationError::InternalError(format!(
|
||||
"Error while merging cardinality {err:?}"
|
||||
)))
|
||||
})?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
|
||||
use std::net::IpAddr;
|
||||
use std::str::FromStr;
|
||||
|
||||
use columnar::MonotonicallyMappableToU64;
|
||||
|
||||
use crate::aggregation::agg_req::Aggregations;
|
||||
use crate::aggregation::tests::{exec_request, get_test_index_from_terms};
|
||||
use crate::schema::{IntoIpv6Addr, Schema, FAST};
|
||||
use crate::Index;
|
||||
|
||||
#[test]
|
||||
fn cardinality_aggregation_test_empty_index() -> crate::Result<()> {
|
||||
let values = vec![];
|
||||
let index = get_test_index_from_terms(false, &values)?;
|
||||
let agg_req: Aggregations = serde_json::from_value(json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "string_id",
|
||||
}
|
||||
},
|
||||
}))
|
||||
.unwrap();
|
||||
|
||||
let res = exec_request(agg_req, &index)?;
|
||||
assert_eq!(res["cardinality"]["value"], 0.0);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cardinality_aggregation_test_single_segment() -> crate::Result<()> {
|
||||
cardinality_aggregation_test_merge_segment(true)
|
||||
}
|
||||
#[test]
|
||||
fn cardinality_aggregation_test() -> crate::Result<()> {
|
||||
cardinality_aggregation_test_merge_segment(false)
|
||||
}
|
||||
fn cardinality_aggregation_test_merge_segment(merge_segments: bool) -> crate::Result<()> {
|
||||
let segment_and_terms = vec![
|
||||
vec!["terma"],
|
||||
vec!["termb"],
|
||||
vec!["termc"],
|
||||
vec!["terma"],
|
||||
vec!["terma"],
|
||||
vec!["terma"],
|
||||
vec!["termb"],
|
||||
vec!["terma"],
|
||||
];
|
||||
let index = get_test_index_from_terms(merge_segments, &segment_and_terms)?;
|
||||
let agg_req: Aggregations = serde_json::from_value(json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "string_id",
|
||||
}
|
||||
},
|
||||
}))
|
||||
.unwrap();
|
||||
|
||||
let res = exec_request(agg_req, &index)?;
|
||||
assert_eq!(res["cardinality"]["value"], 3.0);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cardinality_aggregation_u64() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let id_field = schema_builder.add_u64_field("id", FAST);
|
||||
let index = Index::create_in_ram(schema_builder.build());
|
||||
{
|
||||
let mut writer = index.writer_for_tests()?;
|
||||
writer.add_document(doc!(id_field => 1u64))?;
|
||||
writer.add_document(doc!(id_field => 2u64))?;
|
||||
writer.add_document(doc!(id_field => 3u64))?;
|
||||
writer.add_document(doc!())?;
|
||||
writer.commit()?;
|
||||
}
|
||||
|
||||
let agg_req: Aggregations = serde_json::from_value(json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "id",
|
||||
"missing": 0u64
|
||||
},
|
||||
}
|
||||
}))
|
||||
.unwrap();
|
||||
|
||||
let res = exec_request(agg_req, &index)?;
|
||||
assert_eq!(res["cardinality"]["value"], 4.0);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cardinality_aggregation_ip_addr() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let field = schema_builder.add_ip_addr_field("ip_field", FAST);
|
||||
let index = Index::create_in_ram(schema_builder.build());
|
||||
{
|
||||
let mut writer = index.writer_for_tests()?;
|
||||
// IpV6 loopback
|
||||
writer.add_document(doc!(field=>IpAddr::from_str("::1").unwrap().into_ipv6_addr()))?;
|
||||
writer.add_document(doc!(field=>IpAddr::from_str("::1").unwrap().into_ipv6_addr()))?;
|
||||
// IpV4
|
||||
writer.add_document(
|
||||
doc!(field=>IpAddr::from_str("127.0.0.1").unwrap().into_ipv6_addr()),
|
||||
)?;
|
||||
writer.commit()?;
|
||||
}
|
||||
|
||||
let agg_req: Aggregations = serde_json::from_value(json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "ip_field"
|
||||
},
|
||||
}
|
||||
}))
|
||||
.unwrap();
|
||||
|
||||
let res = exec_request(agg_req, &index)?;
|
||||
assert_eq!(res["cardinality"]["value"], 2.0);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cardinality_aggregation_json() -> crate::Result<()> {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let field = schema_builder.add_json_field("json", FAST);
|
||||
let index = Index::create_in_ram(schema_builder.build());
|
||||
{
|
||||
let mut writer = index.writer_for_tests()?;
|
||||
writer.add_document(doc!(field => json!({"value": false})))?;
|
||||
writer.add_document(doc!(field => json!({"value": true})))?;
|
||||
writer.add_document(doc!(field => json!({"value": i64::from_u64(0u64)})))?;
|
||||
writer.add_document(doc!(field => json!({"value": i64::from_u64(1u64)})))?;
|
||||
writer.commit()?;
|
||||
}
|
||||
|
||||
let agg_req: Aggregations = serde_json::from_value(json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "json.value"
|
||||
},
|
||||
}
|
||||
}))
|
||||
.unwrap();
|
||||
|
||||
let res = exec_request(agg_req, &index)?;
|
||||
assert_eq!(res["cardinality"]["value"], 4.0);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -17,7 +17,6 @@
|
||||
//! - [Percentiles](PercentilesAggregationReq)
|
||||
|
||||
mod average;
|
||||
mod cardinality;
|
||||
mod count;
|
||||
mod extended_stats;
|
||||
mod max;
|
||||
@@ -30,7 +29,6 @@ mod top_hits;
|
||||
use std::collections::HashMap;
|
||||
|
||||
pub use average::*;
|
||||
pub use cardinality::*;
|
||||
pub use count::*;
|
||||
pub use extended_stats::*;
|
||||
pub use max::*;
|
||||
|
||||
@@ -163,8 +163,8 @@ impl PartialEq for PercentilesCollector {
|
||||
}
|
||||
}
|
||||
|
||||
fn format_percentile(percentile: f64) -> String {
|
||||
let mut out = percentile.to_string();
|
||||
fn format_percentil(percentil: f64) -> String {
|
||||
let mut out = percentil.to_string();
|
||||
// Slightly silly way to format trailing decimals
|
||||
if !out.contains('.') {
|
||||
out.push_str(".0");
|
||||
@@ -197,7 +197,7 @@ impl PercentilesCollector {
|
||||
let values = if req.keyed {
|
||||
PercentileValues::HashMap(
|
||||
iter_quantile_and_values
|
||||
.map(|(val, quantil)| (format_percentile(val), quantil))
|
||||
.map(|(val, quantil)| (format_percentil(val), quantil))
|
||||
.collect(),
|
||||
)
|
||||
} else {
|
||||
|
||||
@@ -220,23 +220,9 @@ impl SegmentStatsCollector {
|
||||
.column_block_accessor
|
||||
.fetch_block(docs, &agg_accessor.accessor);
|
||||
}
|
||||
if [
|
||||
ColumnType::I64,
|
||||
ColumnType::U64,
|
||||
ColumnType::F64,
|
||||
ColumnType::DateTime,
|
||||
]
|
||||
.contains(&self.field_type)
|
||||
{
|
||||
for val in agg_accessor.column_block_accessor.iter_vals() {
|
||||
let val1 = f64_from_fastfield_u64(val, &self.field_type);
|
||||
self.stats.collect(val1);
|
||||
}
|
||||
} else {
|
||||
for _val in agg_accessor.column_block_accessor.iter_vals() {
|
||||
// we ignore the value and simply record that we got something
|
||||
self.stats.collect(0.0);
|
||||
}
|
||||
for val in agg_accessor.column_block_accessor.iter_vals() {
|
||||
let val1 = f64_from_fastfield_u64(val, &self.field_type);
|
||||
self.stats.collect(val1);
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -449,11 +435,6 @@ mod tests {
|
||||
"field": "score",
|
||||
},
|
||||
},
|
||||
"count_str": {
|
||||
"value_count": {
|
||||
"field": "text",
|
||||
},
|
||||
},
|
||||
"range": range_agg
|
||||
}))
|
||||
.unwrap();
|
||||
@@ -519,13 +500,6 @@ mod tests {
|
||||
})
|
||||
);
|
||||
|
||||
assert_eq!(
|
||||
res["count_str"],
|
||||
json!({
|
||||
"value": 7.0,
|
||||
})
|
||||
);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
|
||||
@@ -89,7 +89,7 @@ use crate::{DocAddress, DocId, SegmentOrdinal};
|
||||
/// }
|
||||
/// ```
|
||||
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
|
||||
pub struct TopHitsAggregationReq {
|
||||
pub struct TopHitsAggregation {
|
||||
sort: Vec<KeyOrder>,
|
||||
size: usize,
|
||||
from: Option<usize>,
|
||||
@@ -139,7 +139,7 @@ impl<'de> Deserialize<'de> for KeyOrder {
|
||||
}
|
||||
}
|
||||
|
||||
// Transform a glob (`pattern*`, for example) into a regex::Regex (`^pattern.*$`)
|
||||
// Tranform a glob (`pattern*`, for example) into a regex::Regex (`^pattern.*$`)
|
||||
fn globbed_string_to_regex(glob: &str) -> Result<Regex, crate::TantivyError> {
|
||||
// Replace `*` glob with `.*` regex
|
||||
let sanitized = format!("^{}$", regex::escape(glob).replace(r"\*", ".*"));
|
||||
@@ -164,7 +164,7 @@ fn unsupported_err(parameter: &str) -> crate::Result<()> {
|
||||
))
|
||||
}
|
||||
|
||||
impl TopHitsAggregationReq {
|
||||
impl TopHitsAggregation {
|
||||
/// Validate and resolve field retrieval parameters
|
||||
pub fn validate_and_resolve_field_names(
|
||||
&mut self,
|
||||
@@ -431,7 +431,7 @@ impl Eq for DocSortValuesAndFields {}
|
||||
/// The TopHitsCollector used for collecting over segments and merging results.
|
||||
#[derive(Clone, Serialize, Deserialize, Debug)]
|
||||
pub struct TopHitsTopNComputer {
|
||||
req: TopHitsAggregationReq,
|
||||
req: TopHitsAggregation,
|
||||
top_n: TopNComputer<DocSortValuesAndFields, DocAddress, false>,
|
||||
}
|
||||
|
||||
@@ -443,7 +443,7 @@ impl std::cmp::PartialEq for TopHitsTopNComputer {
|
||||
|
||||
impl TopHitsTopNComputer {
|
||||
/// Create a new TopHitsCollector
|
||||
pub fn new(req: &TopHitsAggregationReq) -> Self {
|
||||
pub fn new(req: &TopHitsAggregation) -> Self {
|
||||
Self {
|
||||
top_n: TopNComputer::new(req.size + req.from.unwrap_or(0)),
|
||||
req: req.clone(),
|
||||
@@ -496,7 +496,7 @@ pub(crate) struct TopHitsSegmentCollector {
|
||||
|
||||
impl TopHitsSegmentCollector {
|
||||
pub fn from_req(
|
||||
req: &TopHitsAggregationReq,
|
||||
req: &TopHitsAggregation,
|
||||
accessor_idx: usize,
|
||||
segment_ordinal: SegmentOrdinal,
|
||||
) -> Self {
|
||||
@@ -509,7 +509,7 @@ impl TopHitsSegmentCollector {
|
||||
fn into_top_hits_collector(
|
||||
self,
|
||||
value_accessors: &HashMap<String, Vec<DynamicColumn>>,
|
||||
req: &TopHitsAggregationReq,
|
||||
req: &TopHitsAggregation,
|
||||
) -> TopHitsTopNComputer {
|
||||
let mut top_hits_computer = TopHitsTopNComputer::new(req);
|
||||
let top_results = self.top_n.into_vec();
|
||||
@@ -532,7 +532,7 @@ impl TopHitsSegmentCollector {
|
||||
fn collect_with(
|
||||
&mut self,
|
||||
doc_id: crate::DocId,
|
||||
req: &TopHitsAggregationReq,
|
||||
req: &TopHitsAggregation,
|
||||
accessors: &[(Column<u64>, ColumnType)],
|
||||
) -> crate::Result<()> {
|
||||
let sorts: Vec<DocValueAndOrder> = req
|
||||
|
||||
@@ -44,14 +44,11 @@
|
||||
//! - [Metric](metric)
|
||||
//! - [Average](metric::AverageAggregation)
|
||||
//! - [Stats](metric::StatsAggregation)
|
||||
//! - [ExtendedStats](metric::ExtendedStatsAggregation)
|
||||
//! - [Min](metric::MinAggregation)
|
||||
//! - [Max](metric::MaxAggregation)
|
||||
//! - [Sum](metric::SumAggregation)
|
||||
//! - [Count](metric::CountAggregation)
|
||||
//! - [Percentiles](metric::PercentilesAggregationReq)
|
||||
//! - [Cardinality](metric::CardinalityAggregationReq)
|
||||
//! - [TopHits](metric::TopHitsAggregationReq)
|
||||
//!
|
||||
//! # Example
|
||||
//! Compute the average metric, by building [`agg_req::Aggregations`], which is built from an
|
||||
@@ -148,7 +145,7 @@ mod agg_tests;
|
||||
|
||||
use core::fmt;
|
||||
|
||||
pub use agg_limits::AggregationLimitsGuard;
|
||||
pub use agg_limits::AggregationLimits;
|
||||
pub use collector::{
|
||||
AggregationCollector, AggregationSegmentCollector, DistributedAggregationCollector,
|
||||
DEFAULT_BUCKET_LIMIT,
|
||||
@@ -180,7 +177,7 @@ pub(crate) fn deserialize_option_f64<'de, D>(deserializer: D) -> Result<Option<f
|
||||
where D: Deserializer<'de> {
|
||||
struct StringOrFloatVisitor;
|
||||
|
||||
impl Visitor<'_> for StringOrFloatVisitor {
|
||||
impl<'de> Visitor<'de> for StringOrFloatVisitor {
|
||||
type Value = Option<f64>;
|
||||
|
||||
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
|
||||
@@ -226,7 +223,7 @@ pub(crate) fn deserialize_f64<'de, D>(deserializer: D) -> Result<f64, D::Error>
|
||||
where D: Deserializer<'de> {
|
||||
struct StringOrFloatVisitor;
|
||||
|
||||
impl Visitor<'_> for StringOrFloatVisitor {
|
||||
impl<'de> Visitor<'de> for StringOrFloatVisitor {
|
||||
type Value = f64;
|
||||
|
||||
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
|
||||
@@ -336,16 +333,10 @@ pub type SerializedKey = String;
|
||||
|
||||
#[derive(Clone, Debug, Serialize, Deserialize, PartialOrd)]
|
||||
/// The key to identify a bucket.
|
||||
///
|
||||
/// The order is important, with serde untagged, that we try to deserialize into i64 first.
|
||||
#[serde(untagged)]
|
||||
pub enum Key {
|
||||
/// String key
|
||||
Str(String),
|
||||
/// `i64` key
|
||||
I64(i64),
|
||||
/// `u64` key
|
||||
U64(u64),
|
||||
/// `f64` key
|
||||
F64(f64),
|
||||
}
|
||||
@@ -356,8 +347,6 @@ impl std::hash::Hash for Key {
|
||||
match self {
|
||||
Key::Str(text) => text.hash(state),
|
||||
Key::F64(val) => val.to_bits().hash(state),
|
||||
Key::U64(val) => val.hash(state),
|
||||
Key::I64(val) => val.hash(state),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -366,12 +355,8 @@ impl PartialEq for Key {
|
||||
fn eq(&self, other: &Self) -> bool {
|
||||
match (self, other) {
|
||||
(Self::Str(l), Self::Str(r)) => l == r,
|
||||
(Self::F64(l), Self::F64(r)) => l.to_bits() == r.to_bits(),
|
||||
(Self::I64(l), Self::I64(r)) => l == r,
|
||||
(Self::U64(l), Self::U64(r)) => l == r,
|
||||
// we list all variant of left operand to make sure this gets updated when we add
|
||||
// variants to the enum
|
||||
(Self::Str(_) | Self::F64(_) | Self::I64(_) | Self::U64(_), _) => false,
|
||||
(Self::F64(l), Self::F64(r)) => l == r,
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -381,8 +366,6 @@ impl Display for Key {
|
||||
match self {
|
||||
Key::Str(val) => f.write_str(val),
|
||||
Key::F64(val) => f.write_str(&val.to_string()),
|
||||
Key::U64(val) => f.write_str(&val.to_string()),
|
||||
Key::I64(val) => f.write_str(&val.to_string()),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -462,7 +445,7 @@ mod tests {
|
||||
agg_req: Aggregations,
|
||||
index: &Index,
|
||||
query: Option<(&str, &str)>,
|
||||
limits: AggregationLimitsGuard,
|
||||
limits: AggregationLimits,
|
||||
) -> crate::Result<Value> {
|
||||
let collector = AggregationCollector::from_aggs(agg_req, limits);
|
||||
|
||||
@@ -582,7 +565,7 @@ mod tests {
|
||||
.set_indexing_options(
|
||||
TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
|
||||
)
|
||||
.set_fast(Some("raw"))
|
||||
.set_fast(None)
|
||||
.set_stored();
|
||||
let text_field = schema_builder.add_text_field("text", text_fieldtype);
|
||||
let date_field = schema_builder.add_date_field("date", FAST);
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
|
||||
use std::fmt::Debug;
|
||||
|
||||
pub(crate) use super::agg_limits::AggregationLimitsGuard;
|
||||
pub(crate) use super::agg_limits::AggregationLimits;
|
||||
use super::agg_req::AggregationVariants;
|
||||
use super::agg_req_with_accessor::{AggregationWithAccessor, AggregationsWithAccessor};
|
||||
use super::bucket::{SegmentHistogramCollector, SegmentRangeCollector, SegmentTermCollector};
|
||||
@@ -16,10 +16,7 @@ use super::metric::{
|
||||
SumAggregation,
|
||||
};
|
||||
use crate::aggregation::bucket::TermMissingAgg;
|
||||
use crate::aggregation::metric::{
|
||||
CardinalityAggregationReq, SegmentCardinalityCollector, SegmentExtendedStatsCollector,
|
||||
TopHitsSegmentCollector,
|
||||
};
|
||||
use crate::aggregation::metric::{SegmentExtendedStatsCollector, TopHitsSegmentCollector};
|
||||
|
||||
pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug {
|
||||
fn add_intermediate_aggregation_result(
|
||||
@@ -103,7 +100,7 @@ pub(crate) fn build_single_agg_segment_collector(
|
||||
Range(range_req) => Ok(Box::new(SegmentRangeCollector::from_req_and_validate(
|
||||
range_req,
|
||||
&mut req.sub_aggregation,
|
||||
&mut req.limits,
|
||||
&req.limits,
|
||||
req.field_type,
|
||||
accessor_idx,
|
||||
)?)),
|
||||
@@ -172,9 +169,6 @@ pub(crate) fn build_single_agg_segment_collector(
|
||||
accessor_idx,
|
||||
req.segment_ordinal,
|
||||
))),
|
||||
Cardinality(CardinalityAggregationReq { missing, .. }) => Ok(Box::new(
|
||||
SegmentCardinalityCollector::from_req(req.field_type, accessor_idx, missing),
|
||||
)),
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -13,7 +13,7 @@ struct Hit<'a> {
|
||||
facet: &'a Facet,
|
||||
}
|
||||
|
||||
impl Eq for Hit<'_> {}
|
||||
impl<'a> Eq for Hit<'a> {}
|
||||
|
||||
impl<'a> PartialEq<Hit<'a>> for Hit<'a> {
|
||||
fn eq(&self, other: &Hit<'_>) -> bool {
|
||||
@@ -27,7 +27,7 @@ impl<'a> PartialOrd<Hit<'a>> for Hit<'a> {
|
||||
}
|
||||
}
|
||||
|
||||
impl Ord for Hit<'_> {
|
||||
impl<'a> Ord for Hit<'a> {
|
||||
fn cmp(&self, other: &Self) -> Ordering {
|
||||
other
|
||||
.count
|
||||
|
||||
@@ -182,7 +182,6 @@ where
|
||||
}
|
||||
|
||||
/// A variant of the [`FilterCollector`] specialized for bytes fast fields, i.e.
|
||||
///
|
||||
/// it transparently wraps an inner [`Collector`] but filters documents
|
||||
/// based on the result of applying the predicate to the bytes fast field.
|
||||
///
|
||||
|
||||
@@ -495,4 +495,4 @@ where
|
||||
impl_downcast!(Fruit);
|
||||
|
||||
#[cfg(test)]
|
||||
pub(crate) mod tests;
|
||||
pub mod tests;
|
||||
|
||||
@@ -161,7 +161,7 @@ impl<TFruit: Fruit> FruitHandle<TFruit> {
|
||||
/// # Ok(())
|
||||
/// # }
|
||||
/// ```
|
||||
#[expect(clippy::type_complexity)]
|
||||
#[allow(clippy::type_complexity)]
|
||||
#[derive(Default)]
|
||||
pub struct MultiCollector<'a> {
|
||||
collector_wrappers: Vec<
|
||||
@@ -190,7 +190,7 @@ impl<'a> MultiCollector<'a> {
|
||||
}
|
||||
}
|
||||
|
||||
impl Collector for MultiCollector<'_> {
|
||||
impl<'a> Collector for MultiCollector<'a> {
|
||||
type Fruit = MultiFruit;
|
||||
type Child = MultiCollectorChild;
|
||||
|
||||
|
||||
@@ -15,6 +15,11 @@ use crate::{DocAddress, DocId, SegmentOrdinal};
|
||||
/// The REVERSE_ORDER generic parameter controls whether the by-feature order
|
||||
/// should be reversed, which is useful for achieving for example largest-first
|
||||
/// semantics without having to wrap the feature in a `Reverse`.
|
||||
///
|
||||
/// WARNING: equality is not what you would expect here.
|
||||
/// Two elements are equal if their feature is equal, and regardless of whether `doc`
|
||||
/// is equal. This should be perfectly fine for this usage, but let's make sure this
|
||||
/// struct is never public.
|
||||
#[derive(Clone, Default, Serialize, Deserialize)]
|
||||
pub struct ComparableDoc<T, D, const REVERSE_ORDER: bool = false> {
|
||||
/// The feature of the document. In practice, this is
|
||||
|
||||
@@ -3,6 +3,7 @@ use std::marker::PhantomData;
|
||||
use std::sync::Arc;
|
||||
|
||||
use columnar::ColumnValues;
|
||||
use serde::de::DeserializeOwned;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use super::Collector;
|
||||
@@ -789,7 +790,7 @@ impl<Score, D, const R: bool> From<TopNComputerDeser<Score, D, R>> for TopNCompu
|
||||
impl<Score, D, const R: bool> TopNComputer<Score, D, R>
|
||||
where
|
||||
Score: PartialOrd + Clone,
|
||||
D: Ord,
|
||||
D: Serialize + DeserializeOwned + Ord + Clone,
|
||||
{
|
||||
/// Create a new `TopNComputer`.
|
||||
/// Internally it will allocate a buffer of size `2 * top_n`.
|
||||
|
||||
@@ -1,91 +0,0 @@
|
||||
use std::path::PathBuf;
|
||||
|
||||
use schema::*;
|
||||
|
||||
use crate::*;
|
||||
|
||||
fn create_index(path: &str) {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let label = schema_builder.add_text_field("label", TEXT | STORED);
|
||||
let date = schema_builder.add_date_field("date", INDEXED | STORED);
|
||||
let schema = schema_builder.build();
|
||||
std::fs::create_dir_all(path).unwrap();
|
||||
let index = Index::create_in_dir(path, schema).unwrap();
|
||||
let mut index_writer = index.writer_with_num_threads(1, 20_000_000).unwrap();
|
||||
index_writer
|
||||
.add_document(doc!(label => "dateformat", date => DateTime::from_timestamp_nanos(123456)))
|
||||
.unwrap();
|
||||
index_writer.commit().unwrap();
|
||||
}
|
||||
|
||||
#[test]
|
||||
/// Writes an Index for the current INDEX_FORMAT_VERSION to disk.
|
||||
fn create_format() {
|
||||
let version = INDEX_FORMAT_VERSION.to_string();
|
||||
let file_path = path_for_version(&version);
|
||||
if PathBuf::from(file_path.clone()).exists() {
|
||||
return;
|
||||
}
|
||||
create_index(&file_path);
|
||||
}
|
||||
|
||||
fn path_for_version(version: &str) -> String {
|
||||
format!("./tests/compat_tests_data/index_v{}/", version)
|
||||
}
|
||||
|
||||
/// feature flag quickwit uses a different dictionary type
|
||||
#[test]
|
||||
#[cfg(not(feature = "quickwit"))]
|
||||
fn test_format_6() {
|
||||
let path = path_for_version("6");
|
||||
|
||||
let index = Index::open_in_dir(path).expect("Failed to open index");
|
||||
// dates are truncated to Microseconds in v6
|
||||
assert_date_time_precision(&index, DateTimePrecision::Microseconds);
|
||||
}
|
||||
|
||||
/// feature flag quickwit uses a different dictionary type
|
||||
#[test]
|
||||
#[cfg(not(feature = "quickwit"))]
|
||||
fn test_format_7() {
|
||||
let path = path_for_version("7");
|
||||
|
||||
let index = Index::open_in_dir(path).expect("Failed to open index");
|
||||
// dates are not truncated in v7 in the docstore
|
||||
assert_date_time_precision(&index, DateTimePrecision::Nanoseconds);
|
||||
}
|
||||
|
||||
#[cfg(not(feature = "quickwit"))]
|
||||
fn assert_date_time_precision(index: &Index, doc_store_precision: DateTimePrecision) {
|
||||
use collector::TopDocs;
|
||||
let reader = index.reader().expect("Failed to create reader");
|
||||
let searcher = reader.searcher();
|
||||
|
||||
let schema = index.schema();
|
||||
let label_field = schema.get_field("label").expect("Field 'label' not found");
|
||||
let query_parser = query::QueryParser::for_index(index, vec![label_field]);
|
||||
|
||||
let query = query_parser
|
||||
.parse_query("dateformat")
|
||||
.expect("Failed to parse query");
|
||||
let top_docs = searcher
|
||||
.search(&query, &TopDocs::with_limit(1))
|
||||
.expect("Search failed");
|
||||
|
||||
assert_eq!(top_docs.len(), 1, "Expected 1 search result");
|
||||
|
||||
let doc_address = top_docs[0].1;
|
||||
let retrieved_doc: TantivyDocument = searcher
|
||||
.doc(doc_address)
|
||||
.expect("Failed to retrieve document");
|
||||
|
||||
let date_field = schema.get_field("date").expect("Field 'date' not found");
|
||||
let date_value = retrieved_doc
|
||||
.get_first(date_field)
|
||||
.expect("Date field not found in document")
|
||||
.as_datetime()
|
||||
.unwrap();
|
||||
|
||||
let expected = DateTime::from_timestamp_nanos(123456).truncate(doc_store_precision);
|
||||
assert_eq!(date_value, expected,);
|
||||
}
|
||||
@@ -41,12 +41,16 @@ impl Executor {
|
||||
///
|
||||
/// Regardless of the executor (`SingleThread` or `ThreadPool`), panics in the task
|
||||
/// will propagate to the caller.
|
||||
pub fn map<A, R, F>(&self, f: F, args: impl Iterator<Item = A>) -> crate::Result<Vec<R>>
|
||||
where
|
||||
pub fn map<
|
||||
A: Send,
|
||||
R: Send,
|
||||
AIterator: Iterator<Item = A>,
|
||||
F: Sized + Sync + Fn(A) -> crate::Result<R>,
|
||||
{
|
||||
>(
|
||||
&self,
|
||||
f: F,
|
||||
args: AIterator,
|
||||
) -> crate::Result<Vec<R>> {
|
||||
match self {
|
||||
Executor::SingleThread => args.map(f).collect::<crate::Result<_>>(),
|
||||
Executor::ThreadPool(pool) => {
|
||||
@@ -96,7 +100,7 @@ impl Executor {
|
||||
|
||||
/// Spawn a task on the pool, returning a future completing on task success.
|
||||
///
|
||||
/// If the task panics, returns `Err(())`.
|
||||
/// If the task panic, returns `Err(())`.
|
||||
#[cfg(feature = "quickwit")]
|
||||
pub fn spawn_blocking<T: Send + 'static>(
|
||||
&self,
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
use common::json_path_writer::{JSON_END_OF_PATH, JSON_PATH_SEGMENT_SEP};
|
||||
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
|
||||
use common::{replace_in_place, JsonPathWriter};
|
||||
use rustc_hash::FxHashMap;
|
||||
|
||||
use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter};
|
||||
use crate::schema::document::{ReferenceValue, ReferenceValueLeaf, Value};
|
||||
use crate::schema::{Type, DATE_TIME_PRECISION_INDEXED};
|
||||
use crate::schema::Type;
|
||||
use crate::time::format_description::well_known::Rfc3339;
|
||||
use crate::time::{OffsetDateTime, UtcOffset};
|
||||
use crate::tokenizer::TextAnalyzer;
|
||||
@@ -71,7 +71,7 @@ pub fn json_path_sep_to_dot(path: &mut str) {
|
||||
}
|
||||
}
|
||||
|
||||
#[expect(clippy::too_many_arguments)]
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn index_json_object<'a, V: Value<'a>>(
|
||||
doc: DocId,
|
||||
json_visitor: V::ObjectIter,
|
||||
@@ -83,9 +83,6 @@ fn index_json_object<'a, V: Value<'a>>(
|
||||
positions_per_path: &mut IndexingPositionsPerPath,
|
||||
) {
|
||||
for (json_path_segment, json_value_visitor) in json_visitor {
|
||||
if json_path_segment.as_bytes().contains(&JSON_END_OF_PATH) {
|
||||
continue;
|
||||
}
|
||||
json_path_writer.push(json_path_segment);
|
||||
index_json_value(
|
||||
doc,
|
||||
@@ -101,7 +98,7 @@ fn index_json_object<'a, V: Value<'a>>(
|
||||
}
|
||||
}
|
||||
|
||||
#[expect(clippy::too_many_arguments)]
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub(crate) fn index_json_value<'a, V: Value<'a>>(
|
||||
doc: DocId,
|
||||
json_value: V,
|
||||
@@ -189,7 +186,6 @@ pub(crate) fn index_json_value<'a, V: Value<'a>>(
|
||||
ctx.path_to_unordered_id
|
||||
.get_or_allocate_unordered_id(json_path_writer.as_str()),
|
||||
);
|
||||
let val = val.truncate(DATE_TIME_PRECISION_INDEXED);
|
||||
term_buffer.append_type_and_fast_value(val);
|
||||
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
|
||||
}
|
||||
@@ -240,11 +236,7 @@ pub(crate) fn index_json_value<'a, V: Value<'a>>(
|
||||
/// Tries to infer a JSON type from a string and append it to the term.
|
||||
///
|
||||
/// The term must be json + JSON path.
|
||||
pub fn convert_to_fast_value_and_append_to_json_term(
|
||||
mut term: Term,
|
||||
phrase: &str,
|
||||
truncate_date_for_search: bool,
|
||||
) -> Option<Term> {
|
||||
pub fn convert_to_fast_value_and_append_to_json_term(mut term: Term, phrase: &str) -> Option<Term> {
|
||||
assert_eq!(
|
||||
term.value()
|
||||
.as_json_value_bytes()
|
||||
@@ -255,11 +247,8 @@ pub fn convert_to_fast_value_and_append_to_json_term(
|
||||
"JSON value bytes should be empty"
|
||||
);
|
||||
if let Ok(dt) = OffsetDateTime::parse(phrase, &Rfc3339) {
|
||||
let mut dt = DateTime::from_utc(dt.to_offset(UtcOffset::UTC));
|
||||
if truncate_date_for_search {
|
||||
dt = dt.truncate(DATE_TIME_PRECISION_INDEXED);
|
||||
}
|
||||
term.append_type_and_fast_value(dt);
|
||||
let dt_utc = dt.to_offset(UtcOffset::UTC);
|
||||
term.append_type_and_fast_value(DateTime::from_utc(dt_utc));
|
||||
return Some(term);
|
||||
}
|
||||
if let Ok(i64_val) = str::parse::<i64>(phrase) {
|
||||
|
||||
@@ -39,7 +39,7 @@ impl RetryPolicy {
|
||||
/// The `DirectoryLock` is an object that represents a file lock.
|
||||
///
|
||||
/// It is associated with a lock file, that gets deleted on `Drop.`
|
||||
#[expect(dead_code)]
|
||||
#[allow(dead_code)]
|
||||
pub struct DirectoryLock(Box<dyn Send + Sync + 'static>);
|
||||
|
||||
struct DirectoryLockGuard {
|
||||
@@ -102,8 +102,10 @@ fn retry_policy(is_blocking: bool) -> RetryPolicy {
|
||||
///
|
||||
/// There are currently two implementations of `Directory`
|
||||
///
|
||||
/// - The [`MMapDirectory`][crate::directory::MmapDirectory], this should be your default choice.
|
||||
/// - The [`RamDirectory`][crate::directory::RamDirectory], which should be used mostly for tests.
|
||||
/// - The [`MMapDirectory`][crate::directory::MmapDirectory], this
|
||||
/// should be your default choice.
|
||||
/// - The [`RamDirectory`][crate::directory::RamDirectory], which
|
||||
/// should be used mostly for tests.
|
||||
pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
|
||||
/// Opens a file and returns a boxed `FileHandle`.
|
||||
///
|
||||
|
||||
@@ -48,7 +48,6 @@ pub static INDEX_WRITER_LOCK: Lazy<Lock> = Lazy::new(|| Lock {
|
||||
});
|
||||
/// The meta lock file is here to protect the segment files being opened by
|
||||
/// `IndexReader::reload()` from being garbage collected.
|
||||
///
|
||||
/// It makes it possible for another process to safely consume
|
||||
/// our index in-writing. Ideally, we may have preferred `RWLock` semantics
|
||||
/// here, but it is difficult to achieve on Windows.
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user