Compare commits

...

93 Commits

Author SHA1 Message Date
Pascal Seitz
dd57b7fa3a term_freq in TermFrequencyRecorder untested
PR to demonstrate #2285
2023-12-20 23:38:47 +08:00
PSeitz
bff7c58497 improve indexing benchmark (#2275) 2023-12-11 09:04:42 +01:00
trinity-1686a
9ebc5ed053 use fst for sstable index (#2268)
* read path for new fst based index

* implement BlockAddrStoreWriter

* extract slop/derivation computation

* use better linear approximator and allow negative correction to approximator

* document format and reorder some fields

* optimize single block sstable size

* plug backward compat
2023-12-04 15:13:15 +01:00
PSeitz
0b56c88e69 Revert "Preparing for 0.21.2 release." (#2258)
* Revert "Preparing for 0.21.2 release. (#2256)"

This reverts commit 9caab45136.

* bump version to 0.21.1

* set version to 0.22.0-dev
2023-12-01 13:46:12 +01:00
PSeitz
24841f0b2a update bitpacker dep (#2269) 2023-12-01 13:45:52 +01:00
PSeitz
1a9fc10be9 add fields_metadata to SegmentReader, add columnar docs (#2222)
* add fields_metadata to SegmentReader, add columnar docs

* use schema to resolve field, add test

* normalize paths

* merge for FieldsMetadata, add fields_metadata on Index

* Update src/core/segment_reader.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* merge code paths

* add Hash

* move function oustide

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-11-22 12:29:53 +01:00
PSeitz
07573a7f19 update fst (#2267)
update fst to 0.5 (deduplicates regex-syntax in the dep tree)
deps cleanup
2023-11-21 16:06:57 +01:00
BlackHoleFox
daad2dc151 Take string references instead of owned values building Facet paths (#2265) 2023-11-20 09:40:44 +01:00
PSeitz
054f49dc31 support escaped dot, add agg test (#2250)
add agg test for nested JSON
allow escaping of dot
2023-11-20 03:00:57 +01:00
PSeitz
47009ed2d3 remove unused deps (#2264)
found with cargo machete
remove pprof (doesn't work)
2023-11-20 02:59:59 +01:00
PSeitz
0aae31d7d7 reduce number of allocations (#2257)
* reduce number of allocations

Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.

- Make Explanation optional in BM25
- Avoid allocations when using Explanation

* use Cow
2023-11-16 13:47:36 +01:00
Paul Masurel
9caab45136 Preparing for 0.21.2 release. (#2256) 2023-11-15 10:43:36 +09:00
Chris Tam
6d9a7b7eb0 Derive Debug for SchemaBuilder (#2254) 2023-11-15 01:03:44 +01:00
dependabot[bot]
7a2c5804b1 Update itertools requirement from 0.11.0 to 0.12.0 (#2255)
Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.11.0...v0.12.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-15 01:03:08 +01:00
François Massot
5319977171 Merge pull request #2253 from quickwit-oss/issue/2251-bug-merge-json-object-with-number
Fix bug occuring when merging JSON object indexed with positions.
2023-11-14 17:28:29 +01:00
trinity-1686a
828632e8c4 rustfmt 2023-11-14 15:05:16 +01:00
Paul Masurel
6b59ec6fd5 Fix bug occuring when merging JSON object indexed with positions.
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.

The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.

It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.

Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)

The merger would then proceed and attempt to decode 1 position when
there are in fact none.

This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.

Closes #2251
2023-11-14 22:41:48 +09:00
PSeitz
b60d862150 docid deltas while indexing (#2249)
* docid deltas while indexing

storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.

HDFS Indexing 1.1GB Total memory consumption:
Before:  760 MB
Now:     590 MB

* use scan for delta decoding
2023-11-13 05:14:27 +01:00
PSeitz
4837c7811a add missing inlines (#2245) 2023-11-10 08:00:42 +01:00
PSeitz
5a2397d57e add sstable ord_to_term benchmark (#2242) 2023-11-10 07:27:48 +01:00
PSeitz
927b4432c9 Perf: use term hashmap in fastfield (#2243)
* add shared arena hashmap

* bench fastfield indexing

* use shared arena hashmap in columnar

lower minimum resize in hashtable

* clippy

* add comments
2023-11-09 13:44:02 +01:00
trinity-1686a
7a0064db1f bump index version (#2237)
* bump index version

and add constant for lowest supported version

* use range instead of handcoded bounds
2023-11-06 19:02:37 +01:00
PSeitz
2e7327205d fix coverage run (#2232)
coverage run uses the compare_hash_only feature which is not compativle
with the test_hashmap_size test
2023-11-06 11:18:38 +00:00
Paul Masurel
7bc5bf78e2 Fixing functional tests. (#2239) 2023-11-05 18:18:39 +09:00
giovannicuccu
ef603c8c7e rename ReloadPolicy onCommit to onCommitWithDelay (#2235)
* rename ReloadPolicy onCommit to onCommitWithDelay

* fix format issues

---------

Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>
2023-11-03 12:22:10 +01:00
PSeitz
28dd6b6546 collect json paths in indexing (#2231)
* collect json paths in indexing

* remove unsafe iter_mut_keys
2023-11-01 11:25:17 +01:00
trinity-1686a
1dda2bb537 handle * inside term in query parser (#2228) 2023-10-27 08:57:02 +02:00
PSeitz
bf6544cf28 fix mmap::Advice reexport (#2230) 2023-10-27 14:09:25 +09:00
PSeitz
ccecf946f7 tantivy 0.21.1 (#2227) 2023-10-27 05:01:44 +02:00
PSeitz
19a859d6fd term hashmap remove copy in is_empty, unused unordered_id (#2229) 2023-10-27 05:01:32 +02:00
PSeitz
83af14caa4 Fix range query (#2226)
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
2023-10-25 09:17:31 +02:00
PSeitz
4feeb2323d fix clippy (#2223) 2023-10-24 10:05:22 +02:00
PSeitz
07bf66a197 json path writer (#2224)
* refactor logic to JsonPathWriter

* use in encode_column_name

* add inlines

* move unsafe block
2023-10-24 09:45:50 +02:00
trinity-1686a
0d4589219b encode some part of posting list as -1 instead of direct values (#2185)
* add support for delta-1 encoding posting list

* encode term frequency minus one

* don't emit tf for json integer terms

* make skipreader not pub(crate) mutable
2023-10-20 16:58:26 +02:00
PSeitz
c2b0469180 improve docs, rework exports (#2220)
* rework exports

move snippet and advice
make indexer pub, remove indexer reexports

* add deprecation warning

* add architecture overview
2023-10-18 09:22:24 +02:00
PSeitz
7e1980b218 run coverage only after merge (#2212)
* run coverage only after merge

coverage is a quite slow step in CI. It can be run only after merging

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-10-18 07:19:36 +02:00
PSeitz
ecb9a89a9f add compat mode for JSON (#2219) 2023-10-17 10:00:55 +02:00
PSeitz
5e06e504e6 split into ReferenceValueLeaf (#2217) 2023-10-16 16:31:30 +02:00
PSeitz
182f58cea6 remove Document: DocumentDeserialize dependency (#2211)
* remove Document: DocumentDeserialize dependency

The dependency requires users to implement an API they may not use.

* remove unnecessary Document bounds
2023-10-13 07:59:54 +02:00
dependabot[bot]
337ffadefd Update lru requirement from 0.11.0 to 0.12.0 (#2208)
Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs) to permit the latest version.
- [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jeromefroe/lru-rs/compare/0.11.0...0.12.0)

---
updated-dependencies:
- dependency-name: lru
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 12:09:56 +02:00
dependabot[bot]
22aa4daf19 Update zstd requirement from 0.12 to 0.13 (#2214)
Updates the requirements on [zstd](https://github.com/gyscos/zstd-rs) to permit the latest version.
- [Release notes](https://github.com/gyscos/zstd-rs/releases)
- [Commits](https://github.com/gyscos/zstd-rs/compare/v0.12.0...v0.13.0)

---
updated-dependencies:
- dependency-name: zstd
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 04:24:44 +02:00
PSeitz
493f9b2f2a Read list of JSON fields encoded in dictionary (#2184)
* Read list of JSON fields encoded in dictionary

add method to get list of fields on InvertedIndexReader

* add field type
2023-10-09 12:06:22 +02:00
PSeitz
e246e5765d replace ReferenceValue with Self in Value (#2210) 2023-10-06 08:22:15 +02:00
PSeitz
6097235eff fix numeric order, refactor Document (#2209)
fix numeric order to prefer i64
rename and move Document stuff
2023-10-05 16:39:56 +02:00
PSeitz
b700c42246 add AsRef, expose object and array iter on Value (#2207)
add AsRef
expose object and array iter
add to_json on Document
2023-10-05 03:55:35 +02:00
PSeitz
5b1bf1a993 replace Field with field name (#2196) 2023-10-04 06:21:40 +02:00
PSeitz
041d4fced7 move to_named_doc to Document trait (#2205) 2023-10-04 06:03:07 +02:00
dependabot[bot]
166fc15239 Update memmap2 requirement from 0.7.1 to 0.9.0 (#2204)
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs) to permit the latest version.
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v0.7.1...v0.9.0)

---
updated-dependencies:
- dependency-name: memmap2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-04 05:00:46 +02:00
PSeitz
514a6e7fef fix bench compile, fix Document reexport (#2203) 2023-10-03 17:28:36 +02:00
dependabot[bot]
82d9127191 Update fs4 requirement from 0.6.3 to 0.7.0 (#2199)
Updates the requirements on [fs4](https://github.com/al8n/fs4-rs) to permit the latest version.
- [Release notes](https://github.com/al8n/fs4-rs/releases)
- [Commits](https://github.com/al8n/fs4-rs/commits/0.7.0)

---
updated-dependencies:
- dependency-name: fs4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-03 04:43:09 +02:00
PSeitz
03a1f40767 rename DocValue to Value (#2197)
rename DocValue to Value to avoid confusion with lucene DocValues
rename Value to OwnedValue
2023-10-02 17:03:00 +02:00
Harrison Burt
1c7c6fd591 POC: Tantivy documents as a trait (#2071)
* fix windows build (#1)

* Fix windows build

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Fix generic bugs

* Reformat code

* Add generic to index writer which I forgot about

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Rebase main and fix conflicts

* Reformat code

* Merge upstream

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add tokenizer improvements from previous commits

* Add tokenizer improvements from previous commits

* Reformat

* Fix unit tests

* Fix unit tests

* Use enum in changes

* Stage changes

* Add new deserializer logic

* Add serializer integration

* Add document deserializer

* Implement new (de)serialization api for existing types

* Fix bugs and type errors

* Add helper implementations

* Fix errors

* Reformat code

* Add unit tests and some code organisation for serialization

* Add unit tests to deserializer

* Add some small docs

* Add support for deserializing serde values

* Reformat

* Fix typo

* Fix typo

* Change repr of facet

* Remove unused trait methods

* Add child value type

* Resolve comments

* Fix build

* Fix more build errors

* Fix more build errors

* Fix the tests I missed

* Fix examples

* fix numerical order, serialize PreTok Str

* fix coverage

* rename Document to TantivyDocument, rename DocumentAccess to Document

add Binary prefix to binary de/serialization

* fix coverage

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2023-10-02 10:01:16 +02:00
PSeitz
b525f653c0 replace BinaryHeap for TopN (#2186)
* replace BinaryHeap for TopN

replace BinaryHeap for TopN with variant that selects the median with QuickSort,
which runs in O(n) time.

add merge_fruits fast path

* call truncate unconditionally, extend test

* remove special early exit

* add TODO, fmt

* truncate top n instead median, return vec

* simplify code
2023-09-27 09:25:30 +02:00
ethever.eth
90586bc1e2 chore: remove unused Seek impl for Writers (#2187) (#2189)
Co-authored-by: famouscat <onismaa@gmail.com>
2023-09-26 17:03:28 +09:00
PSeitz
832f1633de handle exclusive out of bounds ranges on fastfield range queries (#2174)
closes https://github.com/quickwit-oss/quickwit/issues/3790
2023-09-26 08:00:40 +02:00
PSeitz
38db53c465 make column_index pub (#2181) 2023-09-22 08:06:45 +02:00
PSeitz
34920d31f5 Fix DateHistogram bucket gap (#2183)
* Fix DateHistogram bucket gap

Fixes a computation issue of the number of buckets needed in the
DateHistogram.

This is due to a missing normalization from request values (ms) to fast field
values (ns), when converting an intermediate result to the final result.
This results in a wrong computation by a factor 1_000_000.
The Histogram normalizes values to nanoseconds, to make the user input like
extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible.
This normalization happens only for date type fields, as other field types don't have precision settings.
The normalization does not happen due a missing `column_type`, which is not
correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation.

Another related issue is an empty aggregation, which will not have
`column_type` set, will not convert the result to human readable format.

This PR fixes the issue by:
- Limit the allowed field types of DateHistogram to DateType
- Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`.
- Fix the merge logic

Add a flag to to normalization only once. This is not an issue
currently, but it could become easily one.

closes https://github.com/quickwit-oss/quickwit/issues/3837

* use older nightly for time crate (breaks build)
2023-09-21 10:41:35 +02:00
trinity-1686a
0241a05b90 add support for exists query syntax in query parser (#2170)
* add support for exists query syntax in query parser

* rustfmt

* make Exists require a field
2023-09-19 11:10:39 +02:00
PSeitz
e125f3b041 fix test (#2178) 2023-09-19 08:21:50 +02:00
PSeitz
c520ac46fc add support for date in term agg (#2172)
support DateTime in TermsAggregation
Format dates with Rfc3339
2023-09-14 09:22:18 +02:00
PSeitz
2d7390341c increase min memory to 15MB for indexing (#2176)
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
2023-09-13 07:38:34 +02:00
dependabot[bot]
03fcdce016 Bump actions/checkout from 3 to 4 (#2171)
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-11 10:47:33 +02:00
Ping Xia
e4e416ac42 extend FuzzyTermQuery to support json field (#2173)
* extend fuzzy search for json field

* comments

* comments

* fmt fix

* comments
2023-09-11 05:59:40 +02:00
Igor Motov
19325132b7 Fast-field based implementation of ExistsQuery (#2160)
Adds an implementation of ExistsQuery that takes advantage of fast fields.

Fixes #2159
2023-09-07 11:51:49 +09:00
Paul Masurel
389d36f760 Added comments 2023-09-04 11:06:56 +09:00
PSeitz
49448b31c6 chore: Release (#2168)
* chore: Release

* update CHANGELOG
2023-09-01 13:58:58 +02:00
PSeitz
ebede0bed7 update CHANGELOG (#2167) 2023-08-31 10:01:44 +02:00
PSeitz
b1d8b072db add missing aggregation part 2 (#2149)
* add missing aggregation part 2

Add missing support for:
- Mixed types columns
- Key of type string on numerical fields

The special aggregation is slower than the integrated one in TermsAggregation and therefore not
chosen by default, although it can cover all use cases.

* simplify, add num_docs to empty
2023-08-31 07:55:33 +02:00
ethever.eth
ee6a7c2bbb fix a small typo (#2165)
Co-authored-by: famouscat <onismaa@gmail.com>
2023-08-30 20:14:26 +02:00
PSeitz
c4e2708901 fix clippy, fmt (#2162) 2023-08-30 08:04:26 +02:00
PSeitz
5c8cfa50eb add missing parameter for percentiles (#2157) 2023-08-29 13:04:24 +02:00
PSeitz
73cb71762f add missing parameter for stats,min,max,count,sum,avg (#2151)
* add missing parameter for stats,min,max,count,sum,avg

add missing parameter for stats,min,max,count,sum,avg
closes #1913
partially #1789

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-08-28 08:59:51 +02:00
Harrison Burt
267dfe58d7 Fix testing on windows (#2155)
* Fix missing trait imports

* Fix building tests on windows

* Revert other PR change
2023-08-27 09:20:44 +09:00
Harrison Burt
131c10d318 Fix missing trait imports (#2154) 2023-08-27 09:20:26 +09:00
Chris Tam
e6cacc40a9 Remove outdated fast field documentation (#2145) 2023-08-24 07:49:49 +02:00
PSeitz
48d4847b38 Improve aggregation error message (#2150)
* Improve aggregation error message

Improve aggregation error message by wrapping the deserialization with a
custom struct. This deserialization variant is slower, since we need to
keep the deserialized data around twice with this approach.
For now the valid variants list is manually updated. This could be
replaced with a proc macro.
closes #2143

* Simpler implementation

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-08-23 20:52:15 +02:00
PSeitz
59460c767f delayed column opening during merge (#2132)
* lazy columnar merge

This is the first part of addressing #3633
Instead of loading all Column into memory for the merge, only the current column_name
group is loaded. This can be done since the sstable streams the columns lexicographically.

* refactor

* add rustdoc

* replace iterator with BTreeMap
2023-08-21 08:55:35 +02:00
Paul Masurel
756156beaf Fix doc 2023-08-17 17:47:45 +09:00
PSeitz
480763db0d track memory arena memory usage (#2148) 2023-08-16 18:19:42 +02:00
PSeitz
62ece86f24 track ff dictionary indexing memory consumption (#2147) 2023-08-16 14:00:08 +02:00
Caleb Hattingh
52d9e6f298 Fix doc typos in count aggregation metric (#2127) 2023-08-15 08:50:23 +02:00
Caleb Hattingh
47b315ff18 doc: escape the backslash (#2144) 2023-08-14 19:10:07 +02:00
PSeitz
ed1deee902 fix sort index by date (#2124)
closes #2112
2023-08-14 17:36:52 +02:00
PSeitz
2e109018b7 add missing parameter to term agg (#2103)
* add missing parameter to term agg

* move missing handling to block accessor

* add multivalue test, fix multivalue case, add comments

* add documentation, deactivate special case

* cargo fmt

* resolve merge conflict
2023-08-14 14:22:18 +02:00
Adam Reichold
22c35b1e00 Fix explanation of boost queries seeking beyond query result. (#2142)
* Make current nightly Clippy happy.

* Fix explanation of boost queries seeking beyond query result.
2023-08-14 11:59:11 +09:00
trinity-1686a
b92082b748 implement lenient parser (#2129)
* move query parser to nom

* add suupport for term grouping

* initial work on infallible parser

* fmt

* add tests and fix minor parsing bugs

* address review comments

* add support for lenient queries in tantivy

* make lenient parser report errors

* allow mixing occur and bool in query
2023-08-08 15:41:29 +02:00
PSeitz
c2be6603a2 alternative mixed field aggregation collection (#2135)
* alternative mixed field aggregation collection

instead of having multiple accessor in one AggregationWithAccessor split it into
multiple independent AggregationWithAccessor

* Update src/aggregation/agg_req_with_accessor.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-07-27 12:25:31 +02:00
Adam Reichold
c805f08ca7 Fix a few more upcoming Clippy lints (#2133) 2023-07-24 17:07:57 +09:00
Adam Reichold
ccc0335158 Minor improvements to OwnedBytes (#2134)
This makes it obvious where the `StableDerefTrait` is invoked and avoids
`transmute` when only a lifetime needs to be extended. Furthermore, it makes use
of `slice::split_at` where that seemed appropriate.
2023-07-24 17:06:33 +09:00
Adam Reichold
42acd334f4 Fixes the new deny-by-default incorrect_partial_ord_impl_on_ord_type Clippy lint (#2131) 2023-07-21 11:36:17 +09:00
Adam Reichold
820f126075 Remove support for Brotli and Snappy compression (#2123)
LZ4 provides fast and simple compression whereas Zstd is exceptionally flexible
so that the additional support for Brotli and Snappy does not really add
any distinct functionality on top of those two algorithms.

Removing them reduces our maintenance burden and reduces the number of choices
users have to make when setting up their project based on Tantivy.
2023-07-14 16:54:59 +09:00
Adam Reichold
7e6c4a1856 Include only built-in compression algorithms as enum variants (#2121)
* Include only built-in compression algorithms as enum variants

This enables compile-time errors when a compression algorithm is requested which
is not actually enabled for the current Cargo project. The cost is that indexes
using other compression algorithms cannot even be loaded (even though they
are not fully accessible in any case).

As a drive-by, this also fixes `--no-default-features` on `cfg(unix)`.

* Provide more instructive error messages for unsupported, but not unknown compression variants.
2023-07-14 11:02:49 +09:00
Adam Reichold
5fafe4b1ab Add missing query_terms impl for TermSetQuery. (#2120) 2023-07-13 14:54:29 +02:00
225 changed files with 13778 additions and 4471 deletions

View File

@@ -3,8 +3,6 @@ name: Coverage
on: on:
push: push:
branches: [main] branches: [main]
pull_request:
branches: [main]
# Ensures that we cancel running jobs for the same PR / same workflow. # Ensures that we cancel running jobs for the same PR / same workflow.
concurrency: concurrency:
@@ -15,13 +13,13 @@ jobs:
coverage: coverage:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install Rust - name: Install Rust
run: rustup toolchain install nightly --profile minimal --component llvm-tools-preview run: rustup toolchain install nightly-2023-09-10 --profile minimal --component llvm-tools-preview
- uses: Swatinem/rust-cache@v2 - uses: Swatinem/rust-cache@v2
- uses: taiki-e/install-action@cargo-llvm-cov - uses: taiki-e/install-action@cargo-llvm-cov
- name: Generate code coverage - name: Generate code coverage
run: cargo +nightly llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info run: cargo +nightly-2023-09-10 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
- name: Upload coverage to Codecov - name: Upload coverage to Codecov
uses: codecov/codecov-action@v3 uses: codecov/codecov-action@v3
continue-on-error: true continue-on-error: true

View File

@@ -19,7 +19,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install stable - name: Install stable
uses: actions-rs/toolchain@v1 uses: actions-rs/toolchain@v1
with: with:

View File

@@ -20,7 +20,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install nightly - name: Install nightly
uses: actions-rs/toolchain@v1 uses: actions-rs/toolchain@v1
@@ -39,6 +39,13 @@ jobs:
- name: Check Formatting - name: Check Formatting
run: cargo +nightly fmt --all -- --check run: cargo +nightly fmt --all -- --check
- name: Check Stable Compilation
run: cargo build --all-features
- name: Check Bench Compilation
run: cargo +nightly bench --no-run --profile=dev --all-features
- uses: actions-rs/clippy-check@v1 - uses: actions-rs/clippy-check@v1
with: with:
@@ -53,14 +60,14 @@ jobs:
strategy: strategy:
matrix: matrix:
features: [ features: [
{ label: "all", flags: "mmap,stopwords,brotli-compression,lz4-compression,snappy-compression,zstd-compression,failpoints" }, { label: "all", flags: "mmap,stopwords,lz4-compression,zstd-compression,failpoints" },
{ label: "quickwit", flags: "mmap,quickwit,failpoints" } { label: "quickwit", flags: "mmap,quickwit,failpoints" }
] ]
name: test-${{ matrix.features.label}} name: test-${{ matrix.features.label}}
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install stable - name: Install stable
uses: actions-rs/toolchain@v1 uses: actions-rs/toolchain@v1

View File

@@ -1,3 +1,36 @@
Tantivy 0.21.1
================================
#### Bugfixes
- Range queries on fast fields with less values on that field than documents had an invalid end condition, leading to missing results. [#2226](https://github.com/quickwit-oss/tantivy/issues/2226)(@appaquet @PSeitz)
- Increase the minimum memory budget from 3MB to 15MB to avoid single doc segments (API fix). [#2176](https://github.com/quickwit-oss/tantivy/issues/2176)(@PSeitz)
Tantivy 0.21
================================
#### Bugfixes
- Fix track fast field memory consumption, which led to higher memory consumption than the budget allowed during indexing [#2148](https://github.com/quickwit-oss/tantivy/issues/2148)[#2147](https://github.com/quickwit-oss/tantivy/issues/2147)(@PSeitz)
- Fix a regression from 0.20 where sort index by date wasn't working anymore [#2124](https://github.com/quickwit-oss/tantivy/issues/2124)(@PSeitz)
- Fix getting the root facet on the `FacetCollector`. [#2086](https://github.com/quickwit-oss/tantivy/issues/2086)(@adamreichold)
- Align numerical type priority order of columnar and query. [#2088](https://github.com/quickwit-oss/tantivy/issues/2088)(@fmassot)
#### Breaking Changes
- Remove support for Brotli and Snappy compression [#2123](https://github.com/quickwit-oss/tantivy/issues/2123)(@adamreichold)
#### Features/Improvements
- Implement lenient query parser [#2129](https://github.com/quickwit-oss/tantivy/pull/2129)(@trinity-1686a)
- order_by_u64_field and order_by_fast_field allow sorting in ascending and descending order [#2111](https://github.com/quickwit-oss/tantivy/issues/2111)(@naveenann)
- Allow dynamic filters in text analyzer builder [#2110](https://github.com/quickwit-oss/tantivy/issues/2110)(@fulmicoton @fmassot)
- **Aggregation**
- Add missing parameter for term aggregation [#2149](https://github.com/quickwit-oss/tantivy/issues/2149)[#2103](https://github.com/quickwit-oss/tantivy/issues/2103)(@PSeitz)
- Add missing parameter for percentiles [#2157](https://github.com/quickwit-oss/tantivy/issues/2157)(@PSeitz)
- Add missing parameter for stats,min,max,count,sum,avg [#2151](https://github.com/quickwit-oss/tantivy/issues/2151)(@PSeitz)
- Improve aggregation deserialization error message [#2150](https://github.com/quickwit-oss/tantivy/issues/2150)(@PSeitz)
- Add validation for type Bytes to term_agg [#2077](https://github.com/quickwit-oss/tantivy/issues/2077)(@PSeitz)
- Alternative mixed field collection [#2135](https://github.com/quickwit-oss/tantivy/issues/2135)(@PSeitz)
- Add missing query_terms impl for TermSetQuery. [#2120](https://github.com/quickwit-oss/tantivy/issues/2120)(@adamreichold)
- Minor improvements to OwnedBytes [#2134](https://github.com/quickwit-oss/tantivy/issues/2134)(@adamreichold)
- Remove allocations in split compound words [#2080](https://github.com/quickwit-oss/tantivy/issues/2080)(@PSeitz)
- Ngram tokenizer now returns an error with invalid arguments [#2102](https://github.com/quickwit-oss/tantivy/issues/2102)(@fmassot)
- Make TextAnalyzerBuilder public [#2097](https://github.com/quickwit-oss/tantivy/issues/2097)(@adamreichold)
- Return an error when tokenizer is not found while indexing [#2093](https://github.com/quickwit-oss/tantivy/issues/2093)(@naveenann)
- Delayed column opening during merge [#2132](https://github.com/quickwit-oss/tantivy/issues/2132)(@PSeitz)
Tantivy 0.20.2 Tantivy 0.20.2
================================ ================================

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy" name = "tantivy"
version = "0.20.2" version = "0.22.0-dev"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
@@ -22,49 +22,46 @@ crc32fast = "1.3.2"
once_cell = "1.10.0" once_cell = "1.10.0"
regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] } regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] }
aho-corasick = "1.0" aho-corasick = "1.0"
tantivy-fst = "0.4.0" tantivy-fst = "0.5"
memmap2 = { version = "0.7.1", optional = true } memmap2 = { version = "0.9.0", optional = true }
lz4_flex = { version = "0.11", default-features = false, optional = true } lz4_flex = { version = "0.11", default-features = false, optional = true }
brotli = { version = "3.3.4", optional = true } zstd = { version = "0.13", optional = true, default-features = false }
zstd = { version = "0.12", optional = true, default-features = false }
snap = { version = "1.0.5", optional = true }
tempfile = { version = "3.3.0", optional = true } tempfile = { version = "3.3.0", optional = true }
log = "0.4.16" log = "0.4.16"
serde = { version = "1.0.136", features = ["derive"] } serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.79" serde_json = "1.0.79"
num_cpus = "1.13.1" num_cpus = "1.13.1"
fs4 = { version = "0.6.3", optional = true } fs4 = { version = "0.7.0", optional = true }
levenshtein_automata = "0.2.1" levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] } uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4" crossbeam-channel = "0.5.4"
rust-stemmers = "1.2.0" rust-stemmers = "1.2.0"
downcast-rs = "1.2.0" downcast-rs = "1.2.0"
bitpacking = { version = "0.8.4", default-features = false, features = ["bitpacker4x"] } bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker4x"] }
census = "0.4.0" census = "0.4.0"
rustc-hash = "1.1.0" rustc-hash = "1.1.0"
thiserror = "1.0.30" thiserror = "1.0.30"
htmlescape = "0.3.1" htmlescape = "0.3.1"
fail = { version = "0.5.0", optional = true } fail = { version = "0.5.0", optional = true }
murmurhash32 = "0.3.0"
time = { version = "0.3.10", features = ["serde-well-known"] } time = { version = "0.3.10", features = ["serde-well-known"] }
smallvec = "1.8.0" smallvec = "1.8.0"
rayon = "1.5.2" rayon = "1.5.2"
lru = "0.11.0" lru = "0.12.0"
fastdivide = "0.4.0" fastdivide = "0.4.0"
itertools = "0.11.0" itertools = "0.12.0"
measure_time = "0.8.2" measure_time = "0.8.2"
async-trait = "0.1.53"
arc-swap = "1.5.0" arc-swap = "1.5.0"
columnar = { version= "0.1", path="./columnar", package ="tantivy-columnar" } columnar = { version= "0.2", path="./columnar", package ="tantivy-columnar" }
sstable = { version= "0.1", path="./sstable", package ="tantivy-sstable", optional = true } sstable = { version= "0.2", path="./sstable", package ="tantivy-sstable", optional = true }
stacker = { version= "0.1", path="./stacker", package ="tantivy-stacker" } stacker = { version= "0.2", path="./stacker", package ="tantivy-stacker" }
query-grammar = { version= "0.20.0", path="./query-grammar", package = "tantivy-query-grammar" } query-grammar = { version= "0.21.0", path="./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version= "0.4", path="./bitpacker" } tantivy-bitpacker = { version= "0.5", path="./bitpacker" }
common = { version= "0.5", path = "./common/", package = "tantivy-common" } common = { version= "0.6", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version= "0.1", path="./tokenizer-api", package="tantivy-tokenizer-api" } tokenizer-api = { version= "0.2", path="./tokenizer-api", package="tantivy-tokenizer-api" }
sketches-ddsketch = { version = "0.2.1", features = ["use_serde"] } sketches-ddsketch = { version = "0.2.1", features = ["use_serde"] }
futures-util = { version = "0.3.28", optional = true } futures-util = { version = "0.3.28", optional = true }
fnv = "1.0.7"
[target.'cfg(windows)'.dependencies] [target.'cfg(windows)'.dependencies]
winapi = "0.3.9" winapi = "0.3.9"
@@ -75,15 +72,15 @@ maplit = "1.0.2"
matches = "0.1.9" matches = "0.1.9"
pretty_assertions = "1.2.1" pretty_assertions = "1.2.1"
proptest = "1.0.0" proptest = "1.0.0"
criterion = "0.5"
test-log = "0.2.10" test-log = "0.2.10"
env_logger = "0.10.0"
pprof = { git = "https://github.com/PSeitz/pprof-rs/", rev = "53af24b", features = ["flamegraph", "criterion"] } # temp fork that works with criterion 0.5
futures = "0.3.21" futures = "0.3.21"
paste = "1.0.11" paste = "1.0.11"
more-asserts = "0.3.1" more-asserts = "0.3.1"
rand_distr = "0.4.3" rand_distr = "0.4.3"
[target.'cfg(not(windows))'.dev-dependencies]
criterion = { version = "0.5", default-features = false }
[dev-dependencies.fail] [dev-dependencies.fail]
version = "0.5.0" version = "0.5.0"
features = ["failpoints"] features = ["failpoints"]
@@ -107,9 +104,7 @@ default = ["mmap", "stopwords", "lz4-compression"]
mmap = ["fs4", "tempfile", "memmap2"] mmap = ["fs4", "tempfile", "memmap2"]
stopwords = [] stopwords = []
brotli-compression = ["brotli"]
lz4-compression = ["lz4_flex"] lz4-compression = ["lz4_flex"]
snappy-compression = ["snap"]
zstd-compression = ["zstd"] zstd-compression = ["zstd"]
failpoints = ["fail", "fail/failpoints"] failpoints = ["fail", "fail/failpoints"]
@@ -117,6 +112,11 @@ unstable = [] # useful for benches.
quickwit = ["sstable", "futures-util"] quickwit = ["sstable", "futures-util"]
# Compares only the hash of a string when indexing data.
# Increases indexing speed, but may lead to extremely rare missing terms, when there's a hash collision.
# Uses 64bit ahash.
compare_hash_only = ["stacker/compare_hash_only"]
[workspace] [workspace]
members = ["query-grammar", "bitpacker", "common", "ownedbytes", "stacker", "sstable", "tokenizer-api", "columnar"] members = ["query-grammar", "bitpacker", "common", "ownedbytes", "stacker", "sstable", "tokenizer-api", "columnar"]
@@ -130,7 +130,7 @@ members = ["query-grammar", "bitpacker", "common", "ownedbytes", "stacker", "sst
[[test]] [[test]]
name = "failpoints" name = "failpoints"
path = "tests/failpoints/mod.rs" path = "tests/failpoints/mod.rs"
required-features = ["fail/failpoints"] required-features = ["failpoints"]
[[bench]] [[bench]]
name = "analyzer" name = "analyzer"

View File

@@ -44,7 +44,7 @@ Details about the benchmark can be found at this [repository](https://github.com
- Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene) - Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
- `&[u8]` fast fields - `&[u8]` fast fields
- Text, i64, u64, f64, dates, ip, bool, and hierarchical facet fields - Text, i64, u64, f64, dates, ip, bool, and hierarchical facet fields
- Compressed document store (LZ4, Zstd, None, Brotli, Snap) - Compressed document store (LZ4, Zstd, None)
- Range queries - Range queries
- Faceted search - Faceted search
- Configurable indexing (optional term frequency and position indexing) - Configurable indexing (optional term frequency and position indexing)

View File

@@ -1,14 +1,99 @@
use criterion::{criterion_group, criterion_main, Criterion, Throughput}; use criterion::{criterion_group, criterion_main, BatchSize, Bencher, Criterion, Throughput};
use pprof::criterion::{Output, PProfProfiler}; use tantivy::schema::{TantivyDocument, FAST, INDEXED, STORED, STRING, TEXT};
use tantivy::schema::{FAST, INDEXED, STORED, STRING, TEXT}; use tantivy::{tokenizer, Index, IndexWriter};
use tantivy::Index;
const HDFS_LOGS: &str = include_str!("hdfs.json"); const HDFS_LOGS: &str = include_str!("hdfs.json");
const GH_LOGS: &str = include_str!("gh.json"); const GH_LOGS: &str = include_str!("gh.json");
const WIKI: &str = include_str!("wiki.json"); const WIKI: &str = include_str!("wiki.json");
fn get_lines(input: &str) -> Vec<&str> { fn benchmark(
input.trim().split('\n').collect() b: &mut Bencher,
input: &str,
schema: tantivy::schema::Schema,
commit: bool,
parse_json: bool,
is_dynamic: bool,
) {
if is_dynamic {
benchmark_dynamic_json(b, input, schema, commit, parse_json)
} else {
_benchmark(b, input, schema, commit, parse_json, |schema, doc_json| {
TantivyDocument::parse_json(&schema, doc_json).unwrap()
})
}
}
fn get_index(schema: tantivy::schema::Schema) -> Index {
let mut index = Index::create_in_ram(schema.clone());
let ff_tokenizer_manager = tokenizer::TokenizerManager::default();
ff_tokenizer_manager.register(
"raw",
tokenizer::TextAnalyzer::builder(tokenizer::RawTokenizer::default())
.filter(tokenizer::RemoveLongFilter::limit(255))
.build(),
);
index.set_fast_field_tokenizers(ff_tokenizer_manager.clone());
index
}
fn _benchmark(
b: &mut Bencher,
input: &str,
schema: tantivy::schema::Schema,
commit: bool,
include_json_parsing: bool,
create_doc: impl Fn(&tantivy::schema::Schema, &str) -> TantivyDocument,
) {
if include_json_parsing {
let lines: Vec<&str> = input.trim().split('\n').collect();
b.iter(|| {
let index = get_index(schema.clone());
let mut index_writer: IndexWriter =
index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = create_doc(&schema, doc_json);
index_writer.add_document(doc).unwrap();
}
if commit {
index_writer.commit().unwrap();
}
})
} else {
let docs: Vec<_> = input
.trim()
.split('\n')
.map(|doc_json| create_doc(&schema, doc_json))
.collect();
b.iter_batched(
|| docs.clone(),
|docs| {
let index = get_index(schema.clone());
let mut index_writer: IndexWriter =
index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc in docs {
index_writer.add_document(doc).unwrap();
}
if commit {
index_writer.commit().unwrap();
}
},
BatchSize::SmallInput,
)
}
}
fn benchmark_dynamic_json(
b: &mut Bencher,
input: &str,
schema: tantivy::schema::Schema,
commit: bool,
parse_json: bool,
) {
let json_field = schema.get_field("json").unwrap();
_benchmark(b, input, schema, commit, parse_json, |_schema, doc_json| {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
tantivy::doc!(json_field=>json_val)
})
} }
pub fn hdfs_index_benchmark(c: &mut Criterion) { pub fn hdfs_index_benchmark(c: &mut Criterion) {
@@ -19,7 +104,14 @@ pub fn hdfs_index_benchmark(c: &mut Criterion) {
schema_builder.add_text_field("severity", STRING); schema_builder.add_text_field("severity", STRING);
schema_builder.build() schema_builder.build()
}; };
let schema_with_store = { let schema_only_fast = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_u64_field("timestamp", FAST);
schema_builder.add_text_field("body", FAST);
schema_builder.add_text_field("severity", FAST);
schema_builder.build()
};
let _schema_with_store = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new(); let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_u64_field("timestamp", INDEXED | STORED); schema_builder.add_u64_field("timestamp", INDEXED | STORED);
schema_builder.add_text_field("body", TEXT | STORED); schema_builder.add_text_field("body", TEXT | STORED);
@@ -28,74 +120,39 @@ pub fn hdfs_index_benchmark(c: &mut Criterion) {
}; };
let dynamic_schema = { let dynamic_schema = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new(); let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_json_field("json", TEXT); schema_builder.add_json_field("json", TEXT | FAST);
schema_builder.build() schema_builder.build()
}; };
let mut group = c.benchmark_group("index-hdfs"); let mut group = c.benchmark_group("index-hdfs");
group.throughput(Throughput::Bytes(HDFS_LOGS.len() as u64)); group.throughput(Throughput::Bytes(HDFS_LOGS.len() as u64));
group.sample_size(20); group.sample_size(20);
group.bench_function("index-hdfs-no-commit", |b| {
let lines = get_lines(HDFS_LOGS); let benches = [
b.iter(|| { ("only-indexed-".to_string(), schema, false),
let index = Index::create_in_ram(schema.clone()); //("stored-".to_string(), _schema_with_store, false),
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap(); ("only-fast-".to_string(), schema_only_fast, false),
for doc_json in &lines { ("dynamic-".to_string(), dynamic_schema, true),
let doc = schema.parse_document(doc_json).unwrap(); ];
index_writer.add_document(doc).unwrap();
for (prefix, schema, is_dynamic) in benches {
for commit in [false, true] {
let suffix = if commit { "with-commit" } else { "no-commit" };
for parse_json in [false] {
// for parse_json in [false, true] {
let suffix = if parse_json {
format!("{}-with-json-parsing", suffix)
} else {
format!("{}", suffix)
};
let bench_name = format!("{}{}", prefix, suffix);
group.bench_function(bench_name, |b| {
benchmark(b, HDFS_LOGS, schema.clone(), commit, parse_json, is_dynamic)
});
} }
}) }
}); }
group.bench_function("index-hdfs-with-commit", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-no-commit-with-docstore", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(schema_with_store.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
})
});
group.bench_function("index-hdfs-with-commit-with-docstore", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(schema_with_store.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-no-commit-json-without-docstore", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(dynamic_schema.clone());
let json_field = dynamic_schema.get_field("json").unwrap();
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
});
} }
pub fn gh_index_benchmark(c: &mut Criterion) { pub fn gh_index_benchmark(c: &mut Criterion) {
@@ -104,38 +161,24 @@ pub fn gh_index_benchmark(c: &mut Criterion) {
schema_builder.add_json_field("json", TEXT | FAST); schema_builder.add_json_field("json", TEXT | FAST);
schema_builder.build() schema_builder.build()
}; };
let dynamic_schema_fast = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_json_field("json", FAST);
schema_builder.build()
};
let mut group = c.benchmark_group("index-gh"); let mut group = c.benchmark_group("index-gh");
group.throughput(Throughput::Bytes(GH_LOGS.len() as u64)); group.throughput(Throughput::Bytes(GH_LOGS.len() as u64));
group.bench_function("index-gh-no-commit", |b| { group.bench_function("index-gh-no-commit", |b| {
let lines = get_lines(GH_LOGS); benchmark_dynamic_json(b, GH_LOGS, dynamic_schema.clone(), false, false)
b.iter(|| {
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
})
}); });
group.bench_function("index-gh-with-commit", |b| { group.bench_function("index-gh-fast", |b| {
let lines = get_lines(GH_LOGS); benchmark_dynamic_json(b, GH_LOGS, dynamic_schema_fast.clone(), false, false)
b.iter(|| { });
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone()); group.bench_function("index-gh-fast-with-commit", |b| {
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap(); benchmark_dynamic_json(b, GH_LOGS, dynamic_schema_fast.clone(), true, false)
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
}); });
} }
@@ -150,33 +193,10 @@ pub fn wiki_index_benchmark(c: &mut Criterion) {
group.throughput(Throughput::Bytes(WIKI.len() as u64)); group.throughput(Throughput::Bytes(WIKI.len() as u64));
group.bench_function("index-wiki-no-commit", |b| { group.bench_function("index-wiki-no-commit", |b| {
let lines = get_lines(WIKI); benchmark_dynamic_json(b, WIKI, dynamic_schema.clone(), false, false)
b.iter(|| {
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
})
}); });
group.bench_function("index-wiki-with-commit", |b| { group.bench_function("index-wiki-with-commit", |b| {
let lines = get_lines(WIKI); benchmark_dynamic_json(b, WIKI, dynamic_schema.clone(), true, false)
b.iter(|| {
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
}); });
} }
@@ -187,12 +207,12 @@ criterion_group! {
} }
criterion_group! { criterion_group! {
name = gh_benches; name = gh_benches;
config = Criterion::default().with_profiler(PProfProfiler::new(100, Output::Flamegraph(None))); config = Criterion::default();
targets = gh_index_benchmark targets = gh_index_benchmark
} }
criterion_group! { criterion_group! {
name = wiki_benches; name = wiki_benches;
config = Criterion::default().with_profiler(PProfProfiler::new(100, Output::Flamegraph(None))); config = Criterion::default();
targets = wiki_index_benchmark targets = wiki_index_benchmark
} }
criterion_main!(benches, gh_benches, wiki_benches); criterion_main!(benches, gh_benches, wiki_benches);

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-bitpacker" name = "tantivy-bitpacker"
version = "0.4.0" version = "0.5.0"
edition = "2021" edition = "2021"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
@@ -15,7 +15,7 @@ homepage = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies] [dependencies]
bitpacking = {version="0.8", default-features=false, features = ["bitpacker1x"]} bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker1x"] }
[dev-dependencies] [dev-dependencies]
rand = "0.8" rand = "0.8"

View File

@@ -367,7 +367,7 @@ mod test {
let mut output: Vec<u32> = Vec::new(); let mut output: Vec<u32> = Vec::new();
for len in [0, 1, 2, 32, 33, 34, 64] { for len in [0, 1, 2, 32, 33, 34, 64] {
for start_idx in 0u32..32u32 { for start_idx in 0u32..32u32 {
output.resize(len as usize, 0); output.resize(len, 0);
bitunpacker.get_batch_u32s(start_idx, &buffer, &mut output); bitunpacker.get_batch_u32s(start_idx, &buffer, &mut output);
for i in 0..len { for i in 0..len {
let expected = (start_idx + i as u32) & mask; let expected = (start_idx + i as u32) & mask;

View File

@@ -64,10 +64,8 @@ fn mem_usage<T>(items: &Vec<T>) -> usize {
impl BlockedBitpacker { impl BlockedBitpacker {
pub fn new() -> Self { pub fn new() -> Self {
let mut compressed_blocks = vec![];
compressed_blocks.resize(8, 0);
Self { Self {
compressed_blocks, compressed_blocks: vec![0; 8],
buffer: vec![], buffer: vec![],
offset_and_bits: vec![], offset_and_bits: vec![],
} }

View File

@@ -32,6 +32,7 @@ postprocessors = [
{ pattern = 'Michael Kleen', replace = "mkleen"}, # replace with github user { pattern = 'Michael Kleen', replace = "mkleen"}, # replace with github user
{ pattern = 'Adrien Guillo', replace = "guilload"}, # replace with github user { pattern = 'Adrien Guillo', replace = "guilload"}, # replace with github user
{ pattern = 'François Massot', replace = "fmassot"}, # replace with github user { pattern = 'François Massot', replace = "fmassot"}, # replace with github user
{ pattern = 'Naveen Aiathurai', replace = "naveenann"}, # replace with github user
{ pattern = '', replace = ""}, # replace with github user { pattern = '', replace = ""}, # replace with github user
] ]

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-columnar" name = "tantivy-columnar"
version = "0.1.0" version = "0.2.0"
edition = "2021" edition = "2021"
license = "MIT" license = "MIT"
homepage = "https://github.com/quickwit-oss/tantivy" homepage = "https://github.com/quickwit-oss/tantivy"
@@ -9,14 +9,13 @@ description = "column oriented storage for tantivy"
categories = ["database-implementations", "data-structures", "compression"] categories = ["database-implementations", "data-structures", "compression"]
[dependencies] [dependencies]
itertools = "0.11.0" itertools = "0.12.0"
fnv = "1.0.7"
fastdivide = "0.4.0" fastdivide = "0.4.0"
stacker = { version= "0.1", path = "../stacker", package="tantivy-stacker"} stacker = { version= "0.2", path = "../stacker", package="tantivy-stacker"}
sstable = { version= "0.1", path = "../sstable", package = "tantivy-sstable" } sstable = { version= "0.2", path = "../sstable", package = "tantivy-sstable" }
common = { version= "0.5", path = "../common", package = "tantivy-common" } common = { version= "0.6", path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.4", path = "../bitpacker/" } tantivy-bitpacker = { version= "0.5", path = "../bitpacker/" }
serde = "1.0.152" serde = "1.0.152"
[dev-dependencies] [dev-dependencies]

View File

@@ -8,7 +8,6 @@ license = "MIT"
columnar = {path="../", package="tantivy-columnar"} columnar = {path="../", package="tantivy-columnar"}
serde_json = "1" serde_json = "1"
serde_json_borrow = {git="https://github.com/PSeitz/serde_json_borrow/"} serde_json_borrow = {git="https://github.com/PSeitz/serde_json_borrow/"}
serde = "1"
[workspace] [workspace]
members = [] members = []

View File

@@ -1,9 +1,12 @@
use std::cmp::Ordering;
use crate::{Column, DocId, RowId}; use crate::{Column, DocId, RowId};
#[derive(Debug, Default, Clone)] #[derive(Debug, Default, Clone)]
pub struct ColumnBlockAccessor<T> { pub struct ColumnBlockAccessor<T> {
val_cache: Vec<T>, val_cache: Vec<T>,
docid_cache: Vec<DocId>, docid_cache: Vec<DocId>,
missing_docids_cache: Vec<DocId>,
row_id_cache: Vec<RowId>, row_id_cache: Vec<RowId>,
} }
@@ -20,6 +23,20 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
.values .values
.get_vals(&self.row_id_cache, &mut self.val_cache); .get_vals(&self.row_id_cache, &mut self.val_cache);
} }
#[inline]
pub fn fetch_block_with_missing(&mut self, docs: &[u32], accessor: &Column<T>, missing: T) {
self.fetch_block(docs, accessor);
// We can compare docid_cache with docs to find missing docs
if docs.len() != self.docid_cache.len() || accessor.index.is_multivalue() {
self.missing_docids_cache.clear();
find_missing_docs(docs, &self.docid_cache, |doc| {
self.missing_docids_cache.push(doc);
self.val_cache.push(missing);
});
self.docid_cache
.extend_from_slice(&self.missing_docids_cache);
}
}
#[inline] #[inline]
pub fn iter_vals(&self) -> impl Iterator<Item = T> + '_ { pub fn iter_vals(&self) -> impl Iterator<Item = T> + '_ {
@@ -34,3 +51,82 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
.zip(self.val_cache.iter().cloned()) .zip(self.val_cache.iter().cloned())
} }
} }
/// Given two sorted lists of docids `docs` and `hits`, hits is a subset of `docs`.
/// Return all docs that are not in `hits`.
fn find_missing_docs<F>(docs: &[u32], hits: &[u32], mut callback: F)
where F: FnMut(u32) {
let mut docs_iter = docs.iter();
let mut hits_iter = hits.iter();
let mut doc = docs_iter.next();
let mut hit = hits_iter.next();
while let (Some(&current_doc), Some(&current_hit)) = (doc, hit) {
match current_doc.cmp(&current_hit) {
Ordering::Less => {
callback(current_doc);
doc = docs_iter.next();
}
Ordering::Equal => {
doc = docs_iter.next();
hit = hits_iter.next();
}
Ordering::Greater => {
hit = hits_iter.next();
}
}
}
while let Some(&current_doc) = doc {
callback(current_doc);
doc = docs_iter.next();
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_find_missing_docs() {
let docs: Vec<u32> = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
let hits: Vec<u32> = vec![2, 4, 6, 8, 10];
let mut missing_docs: Vec<u32> = Vec::new();
find_missing_docs(&docs, &hits, |missing_doc| {
missing_docs.push(missing_doc);
});
assert_eq!(missing_docs, vec![1, 3, 5, 7, 9]);
}
#[test]
fn test_find_missing_docs_empty() {
let docs: Vec<u32> = Vec::new();
let hits: Vec<u32> = vec![2, 4, 6, 8, 10];
let mut missing_docs: Vec<u32> = Vec::new();
find_missing_docs(&docs, &hits, |missing_doc| {
missing_docs.push(missing_doc);
});
assert_eq!(missing_docs, vec![]);
}
#[test]
fn test_find_missing_docs_all_missing() {
let docs: Vec<u32> = vec![1, 2, 3, 4, 5];
let hits: Vec<u32> = Vec::new();
let mut missing_docs: Vec<u32> = Vec::new();
find_missing_docs(&docs, &hits, |missing_doc| {
missing_docs.push(missing_doc);
});
assert_eq!(missing_docs, vec![1, 2, 3, 4, 5]);
}
}

View File

@@ -30,6 +30,13 @@ impl fmt::Debug for BytesColumn {
} }
impl BytesColumn { impl BytesColumn {
pub fn empty(num_docs: u32) -> BytesColumn {
BytesColumn {
dictionary: Arc::new(Dictionary::empty()),
term_ord_column: Column::build_empty_column(num_docs),
}
}
/// Fills the given `output` buffer with the term associated to the ordinal `ord`. /// Fills the given `output` buffer with the term associated to the ordinal `ord`.
/// ///
/// Returns `false` if the term does not exist (e.g. `term_ord` is greater or equal to the /// Returns `false` if the term does not exist (e.g. `term_ord` is greater or equal to the
@@ -77,7 +84,7 @@ impl From<StrColumn> for BytesColumn {
} }
impl StrColumn { impl StrColumn {
pub(crate) fn wrap(bytes_column: BytesColumn) -> StrColumn { pub fn wrap(bytes_column: BytesColumn) -> StrColumn {
StrColumn(bytes_column) StrColumn(bytes_column)
} }

View File

@@ -130,7 +130,7 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
.select_batch_in_place(selected_docid_range.start, doc_ids); .select_batch_in_place(selected_docid_range.start, doc_ids);
} }
/// Fils the output vector with the (possibly multiple values that are associated_with /// Fills the output vector with the (possibly multiple values that are associated_with
/// `row_id`. /// `row_id`.
/// ///
/// This method clears the `output` vector. /// This method clears the `output` vector.

View File

@@ -1,3 +1,8 @@
//! # `column_index`
//!
//! `column_index` provides rank and select operations to associate positions when not all
//! documents have exactly one element.
mod merge; mod merge;
mod multivalued_index; mod multivalued_index;
mod optional_index; mod optional_index;
@@ -37,10 +42,14 @@ impl From<MultiValueIndex> for ColumnIndex {
} }
impl ColumnIndex { impl ColumnIndex {
// Returns the cardinality of the column index. #[inline]
// pub fn is_multivalue(&self) -> bool {
// By convention, if the column contains no docs, we consider that it is matches!(self, ColumnIndex::Multivalued(_))
// full. }
/// Returns the cardinality of the column index.
///
/// By convention, if the column contains no docs, we consider that it is
/// full.
#[inline] #[inline]
pub fn get_cardinality(&self) -> Cardinality { pub fn get_cardinality(&self) -> Cardinality {
match self { match self {

View File

@@ -215,12 +215,12 @@ mod bench {
let vals: Vec<RowId> = (0..TOTAL_NUM_VALUES) let vals: Vec<RowId> = (0..TOTAL_NUM_VALUES)
.map(|_| rng.gen_bool(fill_ratio)) .map(|_| rng.gen_bool(fill_ratio))
.enumerate() .enumerate()
.filter(|(pos, val)| *val) .filter(|(_pos, val)| *val)
.map(|(pos, _)| pos as RowId) .map(|(pos, _)| pos as RowId)
.collect(); .collect();
serialize_optional_index(&&vals[..], TOTAL_NUM_VALUES, &mut out).unwrap(); serialize_optional_index(&&vals[..], TOTAL_NUM_VALUES, &mut out).unwrap();
let codec = open_optional_index(OwnedBytes::new(out)).unwrap();
codec open_optional_index(OwnedBytes::new(out)).unwrap()
} }
fn random_range_iterator( fn random_range_iterator(
@@ -242,7 +242,7 @@ mod bench {
} }
fn n_percent_step_iterator(percent: f32, num_values: u32) -> impl Iterator<Item = u32> { fn n_percent_step_iterator(percent: f32, num_values: u32) -> impl Iterator<Item = u32> {
let ratio = percent as f32 / 100.0; let ratio = percent / 100.0;
let step_size = (1f32 / ratio) as u32; let step_size = (1f32 / ratio) as u32;
let deviation = step_size - 1; let deviation = step_size - 1;
random_range_iterator(0, num_values, step_size, deviation) random_range_iterator(0, num_values, step_size, deviation)

View File

@@ -30,6 +30,7 @@ impl<'a> SerializableColumnIndex<'a> {
} }
} }
/// Serialize a column index.
pub fn serialize_column_index( pub fn serialize_column_index(
column_index: SerializableColumnIndex, column_index: SerializableColumnIndex,
output: &mut impl Write, output: &mut impl Write,
@@ -51,6 +52,7 @@ pub fn serialize_column_index(
Ok(column_index_num_bytes) Ok(column_index_num_bytes)
} }
/// Open a serialized column index.
pub fn open_column_index(mut bytes: OwnedBytes) -> io::Result<ColumnIndex> { pub fn open_column_index(mut bytes: OwnedBytes) -> io::Result<ColumnIndex> {
if bytes.is_empty() { if bytes.is_empty() {
return Err(io::Error::new( return Err(io::Error::new(

View File

@@ -38,6 +38,6 @@ impl Ord for BlankRange {
} }
impl PartialOrd for BlankRange { impl PartialOrd for BlankRange {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.blank_size().cmp(&other.blank_size())) Some(self.cmp(other))
} }
} }

View File

@@ -2,7 +2,7 @@ mod merge_dict_column;
mod merge_mapping; mod merge_mapping;
mod term_merger; mod term_merger;
use std::collections::{BTreeMap, HashMap, HashSet}; use std::collections::{BTreeMap, HashSet};
use std::io; use std::io;
use std::net::Ipv6Addr; use std::net::Ipv6Addr;
use std::sync::Arc; use std::sync::Arc;
@@ -18,7 +18,8 @@ use crate::columnar::writer::CompatibleNumericalTypes;
use crate::columnar::ColumnarReader; use crate::columnar::ColumnarReader;
use crate::dynamic_column::DynamicColumn; use crate::dynamic_column::DynamicColumn;
use crate::{ use crate::{
BytesColumn, Column, ColumnIndex, ColumnType, ColumnValues, NumericalType, NumericalValue, BytesColumn, Column, ColumnIndex, ColumnType, ColumnValues, DynamicColumnHandle, NumericalType,
NumericalValue,
}; };
/// Column types are grouped into different categories. /// Column types are grouped into different categories.
@@ -28,14 +29,16 @@ use crate::{
/// In practise, today, only Numerical colummns are coerced into one type today. /// In practise, today, only Numerical colummns are coerced into one type today.
/// ///
/// See also [README.md]. /// See also [README.md].
#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)] ///
/// The ordering has to match the ordering of the variants in [ColumnType].
#[derive(Copy, Clone, Eq, PartialOrd, Ord, PartialEq, Hash, Debug)]
pub(crate) enum ColumnTypeCategory { pub(crate) enum ColumnTypeCategory {
Bool,
Str,
Numerical, Numerical,
DateTime,
Bytes, Bytes,
Str,
Bool,
IpAddr, IpAddr,
DateTime,
} }
impl From<ColumnType> for ColumnTypeCategory { impl From<ColumnType> for ColumnTypeCategory {
@@ -83,9 +86,20 @@ pub fn merge_columnar(
.iter() .iter()
.map(|reader| reader.num_rows()) .map(|reader| reader.num_rows())
.collect::<Vec<u32>>(); .collect::<Vec<u32>>();
let columns_to_merge = let columns_to_merge =
group_columns_for_merge(columnar_readers, required_columns, &merge_row_order)?; group_columns_for_merge(columnar_readers, required_columns, &merge_row_order)?;
for ((column_name, column_type), columns) in columns_to_merge { for res in columns_to_merge {
let ((column_name, _column_type_category), grouped_columns) = res;
let grouped_columns = grouped_columns.open(&merge_row_order)?;
if grouped_columns.is_empty() {
continue;
}
let column_type = grouped_columns.column_type_after_merge();
let mut columns = grouped_columns.columns;
coerce_columns(column_type, &mut columns)?;
let mut column_serializer = let mut column_serializer =
serializer.start_serialize_column(column_name.as_bytes(), column_type); serializer.start_serialize_column(column_name.as_bytes(), column_type);
merge_column( merge_column(
@@ -97,6 +111,7 @@ pub fn merge_columnar(
)?; )?;
column_serializer.finalize()?; column_serializer.finalize()?;
} }
serializer.finalize(merge_row_order.num_rows())?; serializer.finalize(merge_row_order.num_rows())?;
Ok(()) Ok(())
} }
@@ -210,40 +225,12 @@ fn merge_column(
struct GroupedColumns { struct GroupedColumns {
required_column_type: Option<ColumnType>, required_column_type: Option<ColumnType>,
columns: Vec<Option<DynamicColumn>>, columns: Vec<Option<DynamicColumn>>,
column_category: ColumnTypeCategory,
} }
impl GroupedColumns { impl GroupedColumns {
fn for_category(column_category: ColumnTypeCategory, num_columnars: usize) -> Self { /// Check is column group can be skipped during serialization.
GroupedColumns { fn is_empty(&self) -> bool {
required_column_type: None, self.required_column_type.is_none() && self.columns.iter().all(Option::is_none)
columns: vec![None; num_columnars],
column_category,
}
}
/// Set the dynamic column for a given columnar.
fn set_column(&mut self, columnar_id: usize, column: DynamicColumn) {
self.columns[columnar_id] = Some(column);
}
/// Force the existence of a column, as well as its type.
fn require_type(&mut self, required_type: ColumnType) -> io::Result<()> {
if let Some(existing_required_type) = self.required_column_type {
if existing_required_type == required_type {
// This was just a duplicate in the `required_columns`.
// Nothing to do.
return Ok(());
} else {
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
"Required column conflicts with another required column of the same type \
category.",
));
}
}
self.required_column_type = Some(required_type);
Ok(())
} }
/// Returns the column type after merge. /// Returns the column type after merge.
@@ -265,11 +252,76 @@ impl GroupedColumns {
} }
// At the moment, only the numerical categorical column type has more than one possible // At the moment, only the numerical categorical column type has more than one possible
// column type. // column type.
assert_eq!(self.column_category, ColumnTypeCategory::Numerical); assert!(self
.columns
.iter()
.flatten()
.all(|el| ColumnTypeCategory::from(el.column_type()) == ColumnTypeCategory::Numerical));
merged_numerical_columns_type(self.columns.iter().flatten()).into() merged_numerical_columns_type(self.columns.iter().flatten()).into()
} }
} }
struct GroupedColumnsHandle {
required_column_type: Option<ColumnType>,
columns: Vec<Option<DynamicColumnHandle>>,
}
impl GroupedColumnsHandle {
fn new(num_columnars: usize) -> Self {
GroupedColumnsHandle {
required_column_type: None,
columns: vec![None; num_columnars],
}
}
fn open(self, merge_row_order: &MergeRowOrder) -> io::Result<GroupedColumns> {
let mut columns: Vec<Option<DynamicColumn>> = Vec::new();
for (columnar_id, column) in self.columns.iter().enumerate() {
if let Some(column) = column {
let column = column.open()?;
// We skip columns that end up with 0 documents.
// That way, we make sure they don't end up influencing the merge type or
// creating empty columns.
if is_empty_after_merge(merge_row_order, &column, columnar_id) {
columns.push(None);
} else {
columns.push(Some(column));
}
} else {
columns.push(None);
}
}
Ok(GroupedColumns {
required_column_type: self.required_column_type,
columns,
})
}
/// Set the dynamic column for a given columnar.
fn set_column(&mut self, columnar_id: usize, column: DynamicColumnHandle) {
self.columns[columnar_id] = Some(column);
}
/// Force the existence of a column, as well as its type.
fn require_type(&mut self, required_type: ColumnType) -> io::Result<()> {
if let Some(existing_required_type) = self.required_column_type {
if existing_required_type == required_type {
// This was just a duplicate in the `required_columns`.
// Nothing to do.
return Ok(());
} else {
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
"Required column conflicts with another required column of the same type \
category.",
));
}
}
self.required_column_type = Some(required_type);
Ok(())
}
}
/// Returns the type of the merged numerical column. /// Returns the type of the merged numerical column.
/// ///
/// This function picks the first numerical type out of i64, u64, f64 (order matters /// This function picks the first numerical type out of i64, u64, f64 (order matters
@@ -293,7 +345,7 @@ fn merged_numerical_columns_type<'a>(
fn is_empty_after_merge( fn is_empty_after_merge(
merge_row_order: &MergeRowOrder, merge_row_order: &MergeRowOrder,
column: &DynamicColumn, column: &DynamicColumn,
columnar_id: usize, columnar_ord: usize,
) -> bool { ) -> bool {
if column.num_values() == 0u32 { if column.num_values() == 0u32 {
// It was empty before the merge. // It was empty before the merge.
@@ -305,7 +357,7 @@ fn is_empty_after_merge(
false false
} }
MergeRowOrder::Shuffled(shuffled) => { MergeRowOrder::Shuffled(shuffled) => {
if let Some(alive_bitset) = &shuffled.alive_bitsets[columnar_id] { if let Some(alive_bitset) = &shuffled.alive_bitsets[columnar_ord] {
let column_index = column.column_index(); let column_index = column.column_index();
match column_index { match column_index {
ColumnIndex::Empty { .. } => true, ColumnIndex::Empty { .. } => true,
@@ -348,56 +400,34 @@ fn is_empty_after_merge(
} }
} }
#[allow(clippy::type_complexity)] /// Iterates over the columns of the columnar readers, grouped by column name.
fn group_columns_for_merge( /// Key functionality is that `open` of the Columns is done lazy per group.
columnar_readers: &[&ColumnarReader], fn group_columns_for_merge<'a>(
required_columns: &[(String, ColumnType)], columnar_readers: &'a [&'a ColumnarReader],
merge_row_order: &MergeRowOrder, required_columns: &'a [(String, ColumnType)],
) -> io::Result<BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>>> { _merge_row_order: &'a MergeRowOrder,
// Each column name may have multiple types of column associated. ) -> io::Result<BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle>> {
// For merging we are interested in the same column type category since they can be merged. let mut columns: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = BTreeMap::new();
let mut columns_grouped: HashMap<(String, ColumnTypeCategory), GroupedColumns> = HashMap::new();
for &(ref column_name, column_type) in required_columns { for &(ref column_name, column_type) in required_columns {
columns_grouped columns
.entry((column_name.clone(), column_type.into())) .entry((column_name.clone(), column_type.into()))
.or_insert_with(|| { .or_insert_with(|| GroupedColumnsHandle::new(columnar_readers.len()))
GroupedColumns::for_category(column_type.into(), columnar_readers.len())
})
.require_type(column_type)?; .require_type(column_type)?;
} }
for (columnar_id, columnar_reader) in columnar_readers.iter().enumerate() { for (columnar_id, columnar_reader) in columnar_readers.iter().enumerate() {
let column_name_and_handle = columnar_reader.list_columns()?; let column_name_and_handle = columnar_reader.iter_columns()?;
// We skip columns that end up with 0 documents.
// That way, we make sure they don't end up influencing the merge type or
// creating empty columns.
for (column_name, handle) in column_name_and_handle { for (column_name, handle) in column_name_and_handle {
let column_category: ColumnTypeCategory = handle.column_type().into(); let column_category: ColumnTypeCategory = handle.column_type().into();
let column = handle.open()?; columns
if is_empty_after_merge(merge_row_order, &column, columnar_id) {
continue;
}
columns_grouped
.entry((column_name, column_category)) .entry((column_name, column_category))
.or_insert_with(|| { .or_insert_with(|| GroupedColumnsHandle::new(columnar_readers.len()))
GroupedColumns::for_category(column_category, columnar_readers.len()) .set_column(columnar_id, handle);
})
.set_column(columnar_id, column);
} }
} }
Ok(columns)
let mut merge_columns: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
Default::default();
for ((column_name, _), mut grouped_columns) in columns_grouped {
let column_type = grouped_columns.column_type_after_merge();
coerce_columns(column_type, &mut grouped_columns.columns)?;
merge_columns.insert((column_name, column_type), grouped_columns.columns);
}
Ok(merge_columns)
} }
fn coerce_columns( fn coerce_columns(

View File

@@ -1,3 +1,5 @@
use std::collections::BTreeMap;
use itertools::Itertools; use itertools::Itertools;
use super::*; use super::*;
@@ -27,22 +29,10 @@ fn test_column_coercion_to_u64() {
let columnar2 = make_columnar("numbers", &[u64::MAX]); let columnar2 = make_columnar("numbers", &[u64::MAX]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into(); let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[], &merge_order).unwrap(); group_columns_for_merge(columnars, &[], &merge_order).unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
}
#[test]
fn test_column_no_coercion_if_all_the_same() {
let columnar1 = make_columnar("numbers", &[1u64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(columnars, &[], &merge_order).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64)));
} }
#[test] #[test]
@@ -51,24 +41,24 @@ fn test_column_coercion_to_i64() {
let columnar2 = make_columnar("numbers", &[2u64]); let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into(); let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[], &merge_order).unwrap(); group_columns_for_merge(columnars, &[], &merge_order).unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::I64))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
} }
#[test] //#[test]
fn test_impossible_coercion_returns_an_error() { // fn test_impossible_coercion_returns_an_error() {
let columnar1 = make_columnar("numbers", &[u64::MAX]); // let columnar1 = make_columnar("numbers", &[u64::MAX]);
let merge_order = StackMergeOrder::stack(&[&columnar1]).into(); // let merge_order = StackMergeOrder::stack(&[&columnar1]).into();
let group_error = group_columns_for_merge( // let group_error = group_columns_for_merge_iter(
&[&columnar1], //&[&columnar1],
&[("numbers".to_string(), ColumnType::I64)], //&[("numbers".to_string(), ColumnType::I64)],
&merge_order, //&merge_order,
) //)
.unwrap_err(); //.unwrap_err();
assert_eq!(group_error.kind(), io::ErrorKind::InvalidInput); // assert_eq!(group_error.kind(), io::ErrorKind::InvalidInput);
} //}
#[test] #[test]
fn test_group_columns_with_required_column() { fn test_group_columns_with_required_column() {
@@ -76,7 +66,7 @@ fn test_group_columns_with_required_column() {
let columnar2 = make_columnar("numbers", &[2u64]); let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into(); let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge( group_columns_for_merge(
&[&columnar1, &columnar2], &[&columnar1, &columnar2],
&[("numbers".to_string(), ColumnType::U64)], &[("numbers".to_string(), ColumnType::U64)],
@@ -84,7 +74,7 @@ fn test_group_columns_with_required_column() {
) )
.unwrap(); .unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
} }
#[test] #[test]
@@ -93,17 +83,17 @@ fn test_group_columns_required_column_with_no_existing_columns() {
let columnar2 = make_columnar("numbers", &[2u64]); let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into(); let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> = let column_map: BTreeMap<_, _> = group_columns_for_merge(
group_columns_for_merge( columnars,
columnars, &[("required_col".to_string(), ColumnType::Str)],
&[("required_col".to_string(), ColumnType::Str)], &merge_order,
&merge_order, )
) .unwrap();
.unwrap();
assert_eq!(column_map.len(), 2); assert_eq!(column_map.len(), 2);
let columns = column_map let columns = &column_map
.get(&("required_col".to_string(), ColumnType::Str)) .get(&("required_col".to_string(), ColumnTypeCategory::Str))
.unwrap(); .unwrap()
.columns;
assert_eq!(columns.len(), 2); assert_eq!(columns.len(), 2);
assert!(columns[0].is_none()); assert!(columns[0].is_none());
assert!(columns[1].is_none()); assert!(columns[1].is_none());
@@ -115,7 +105,7 @@ fn test_group_columns_required_column_is_above_all_columns_have_the_same_type_ru
let columnar2 = make_columnar("numbers", &[2i64]); let columnar2 = make_columnar("numbers", &[2i64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into(); let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge( group_columns_for_merge(
columnars, columnars,
&[("numbers".to_string(), ColumnType::U64)], &[("numbers".to_string(), ColumnType::U64)],
@@ -123,7 +113,7 @@ fn test_group_columns_required_column_is_above_all_columns_have_the_same_type_ru
) )
.unwrap(); .unwrap();
assert_eq!(column_map.len(), 1); assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
} }
#[test] #[test]
@@ -132,21 +122,23 @@ fn test_missing_column() {
let columnar2 = make_columnar("numbers2", &[2u64]); let columnar2 = make_columnar("numbers2", &[2u64]);
let columnars = &[&columnar1, &columnar2]; let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into(); let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> = let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[], &merge_order).unwrap(); group_columns_for_merge(columnars, &[], &merge_order).unwrap();
assert_eq!(column_map.len(), 2); assert_eq!(column_map.len(), 2);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::I64))); assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
{ {
let columns = column_map let columns = &column_map
.get(&("numbers".to_string(), ColumnType::I64)) .get(&("numbers".to_string(), ColumnTypeCategory::Numerical))
.unwrap(); .unwrap()
.columns;
assert!(columns[0].is_some()); assert!(columns[0].is_some());
assert!(columns[1].is_none()); assert!(columns[1].is_none());
} }
{ {
let columns = column_map let columns = &column_map
.get(&("numbers2".to_string(), ColumnType::U64)) .get(&("numbers2".to_string(), ColumnTypeCategory::Numerical))
.unwrap(); .unwrap()
.columns;
assert!(columns[0].is_none()); assert!(columns[0].is_none());
assert!(columns[1].is_some()); assert!(columns[1].is_some());
} }

View File

@@ -102,30 +102,41 @@ impl ColumnarReader {
pub fn num_rows(&self) -> RowId { pub fn num_rows(&self) -> RowId {
self.num_rows self.num_rows
} }
// Iterate over the columns in a sorted way
pub fn iter_columns(
&self,
) -> io::Result<impl Iterator<Item = (String, DynamicColumnHandle)> + '_> {
let mut stream = self.column_dictionary.stream()?;
Ok(std::iter::from_fn(move || {
if stream.advance() {
let key_bytes: &[u8] = stream.key();
let column_code: u8 = key_bytes.last().cloned().unwrap();
// TODO Error Handling. The API gets quite ugly when returning the error here, so
// instead we could just check the first N columns upfront.
let column_type: ColumnType = ColumnType::try_from_code(column_code)
.map_err(|_| io_invalid_data(format!("Unknown column code `{column_code}`")))
.unwrap();
let range = stream.value().clone();
let column_name =
// The last two bytes are respectively the 0u8 separator and the column_type.
String::from_utf8_lossy(&key_bytes[..key_bytes.len() - 2]).to_string();
let file_slice = self
.column_data
.slice(range.start as usize..range.end as usize);
let column_handle = DynamicColumnHandle {
file_slice,
column_type,
};
Some((column_name, column_handle))
} else {
None
}
}))
}
// TODO Add unit tests // TODO Add unit tests
pub fn list_columns(&self) -> io::Result<Vec<(String, DynamicColumnHandle)>> { pub fn list_columns(&self) -> io::Result<Vec<(String, DynamicColumnHandle)>> {
let mut stream = self.column_dictionary.stream()?; Ok(self.iter_columns()?.collect())
let mut results = Vec::new();
while stream.advance() {
let key_bytes: &[u8] = stream.key();
let column_code: u8 = key_bytes.last().cloned().unwrap();
let column_type: ColumnType = ColumnType::try_from_code(column_code)
.map_err(|_| io_invalid_data(format!("Unknown column code `{column_code}`")))?;
let range = stream.value().clone();
let column_name =
// The last two bytes are respectively the 0u8 separator and the column_type.
String::from_utf8_lossy(&key_bytes[..key_bytes.len() - 2]).to_string();
let file_slice = self
.column_data
.slice(range.start as usize..range.end as usize);
let column_handle = DynamicColumnHandle {
file_slice,
column_type,
};
results.push((column_name, column_handle));
}
Ok(results)
} }
fn stream_for_column_range(&self, column_name: &str) -> sstable::StreamerBuilder<RangeSSTable> { fn stream_for_column_range(&self, column_name: &str) -> sstable::StreamerBuilder<RangeSSTable> {

View File

@@ -269,7 +269,8 @@ impl StrOrBytesColumnWriter {
dictionaries: &mut [DictionaryBuilder], dictionaries: &mut [DictionaryBuilder],
arena: &mut MemoryArena, arena: &mut MemoryArena,
) { ) {
let unordered_id = dictionaries[self.dictionary_id as usize].get_or_allocate_id(bytes); let unordered_id =
dictionaries[self.dictionary_id as usize].get_or_allocate_id(bytes, arena);
self.column_writer.record(doc, unordered_id, arena); self.column_writer.record(doc, unordered_id, arena);
} }

View File

@@ -79,7 +79,6 @@ fn mutate_or_create_column<V, TMutator>(
impl ColumnarWriter { impl ColumnarWriter {
pub fn mem_usage(&self) -> usize { pub fn mem_usage(&self) -> usize {
// TODO add dictionary builders.
self.arena.mem_usage() self.arena.mem_usage()
+ self.numerical_field_hash_map.mem_usage() + self.numerical_field_hash_map.mem_usage()
+ self.bool_field_hash_map.mem_usage() + self.bool_field_hash_map.mem_usage()
@@ -87,6 +86,11 @@ impl ColumnarWriter {
+ self.str_field_hash_map.mem_usage() + self.str_field_hash_map.mem_usage()
+ self.ip_addr_field_hash_map.mem_usage() + self.ip_addr_field_hash_map.mem_usage()
+ self.datetime_field_hash_map.mem_usage() + self.datetime_field_hash_map.mem_usage()
+ self
.dictionaries
.iter()
.map(|dict| dict.mem_usage())
.sum::<usize>()
} }
/// Returns the list of doc ids from 0..num_docs sorted by the `sort_field` /// Returns the list of doc ids from 0..num_docs sorted by the `sort_field`
@@ -101,6 +105,10 @@ impl ColumnarWriter {
let Some(numerical_col_writer) = self let Some(numerical_col_writer) = self
.numerical_field_hash_map .numerical_field_hash_map
.get::<NumericalColumnWriter>(sort_field.as_bytes()) .get::<NumericalColumnWriter>(sort_field.as_bytes())
.or_else(|| {
self.datetime_field_hash_map
.get::<NumericalColumnWriter>(sort_field.as_bytes())
})
else { else {
return Vec::new(); return Vec::new();
}; };
@@ -330,7 +338,7 @@ impl ColumnarWriter {
let mut columns: Vec<(&[u8], ColumnType, Addr)> = self let mut columns: Vec<(&[u8], ColumnType, Addr)> = self
.numerical_field_hash_map .numerical_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| { .map(|(column_name, addr)| {
let numerical_column_writer: NumericalColumnWriter = let numerical_column_writer: NumericalColumnWriter =
self.numerical_field_hash_map.read(addr); self.numerical_field_hash_map.read(addr);
let column_type = numerical_column_writer.numerical_type().into(); let column_type = numerical_column_writer.numerical_type().into();
@@ -340,27 +348,27 @@ impl ColumnarWriter {
columns.extend( columns.extend(
self.bytes_field_hash_map self.bytes_field_hash_map
.iter() .iter()
.map(|(term, addr, _)| (term, ColumnType::Bytes, addr)), .map(|(term, addr)| (term, ColumnType::Bytes, addr)),
); );
columns.extend( columns.extend(
self.str_field_hash_map self.str_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::Str, addr)), .map(|(column_name, addr)| (column_name, ColumnType::Str, addr)),
); );
columns.extend( columns.extend(
self.bool_field_hash_map self.bool_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::Bool, addr)), .map(|(column_name, addr)| (column_name, ColumnType::Bool, addr)),
); );
columns.extend( columns.extend(
self.ip_addr_field_hash_map self.ip_addr_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::IpAddr, addr)), .map(|(column_name, addr)| (column_name, ColumnType::IpAddr, addr)),
); );
columns.extend( columns.extend(
self.datetime_field_hash_map self.datetime_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::DateTime, addr)), .map(|(column_name, addr)| (column_name, ColumnType::DateTime, addr)),
); );
columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type)); columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type));
@@ -429,6 +437,7 @@ impl ColumnarWriter {
&mut symbol_byte_buffer, &mut symbol_byte_buffer,
), ),
buffers, buffers,
&self.arena,
&mut column_serializer, &mut column_serializer,
)?; )?;
column_serializer.finalize()?; column_serializer.finalize()?;
@@ -482,6 +491,7 @@ impl ColumnarWriter {
// Serialize [Dictionary, Column, dictionary num bytes U32::LE] // Serialize [Dictionary, Column, dictionary num bytes U32::LE]
// Column: [Column Index, Column Values, column index num bytes U32::LE] // Column: [Column Index, Column Values, column index num bytes U32::LE]
#[allow(clippy::too_many_arguments)]
fn serialize_bytes_or_str_column( fn serialize_bytes_or_str_column(
cardinality: Cardinality, cardinality: Cardinality,
num_docs: RowId, num_docs: RowId,
@@ -489,6 +499,7 @@ fn serialize_bytes_or_str_column(
dictionary_builder: &DictionaryBuilder, dictionary_builder: &DictionaryBuilder,
operation_it: impl Iterator<Item = ColumnOperation<UnorderedId>>, operation_it: impl Iterator<Item = ColumnOperation<UnorderedId>>,
buffers: &mut SpareBuffers, buffers: &mut SpareBuffers,
arena: &MemoryArena,
wrt: impl io::Write, wrt: impl io::Write,
) -> io::Result<()> { ) -> io::Result<()> {
let SpareBuffers { let SpareBuffers {
@@ -497,7 +508,8 @@ fn serialize_bytes_or_str_column(
.. ..
} = buffers; } = buffers;
let mut counting_writer = CountingWriter::wrap(wrt); let mut counting_writer = CountingWriter::wrap(wrt);
let term_id_mapping: TermIdMapping = dictionary_builder.serialize(&mut counting_writer)?; let term_id_mapping: TermIdMapping =
dictionary_builder.serialize(arena, &mut counting_writer)?;
let dictionary_num_bytes: u32 = counting_writer.written_bytes() as u32; let dictionary_num_bytes: u32 = counting_writer.written_bytes() as u32;
let mut wrt = counting_writer.finish(); let mut wrt = counting_writer.finish();
let operation_iterator = operation_it.map(|symbol: ColumnOperation<UnorderedId>| { let operation_iterator = operation_it.map(|symbol: ColumnOperation<UnorderedId>| {

View File

@@ -1,7 +1,7 @@
use std::io; use std::io;
use fnv::FnvHashMap;
use sstable::SSTable; use sstable::SSTable;
use stacker::{MemoryArena, SharedArenaHashMap};
pub(crate) struct TermIdMapping { pub(crate) struct TermIdMapping {
unordered_to_ord: Vec<OrderedId>, unordered_to_ord: Vec<OrderedId>,
@@ -31,26 +31,38 @@ pub struct OrderedId(pub u32);
/// mapping. /// mapping.
#[derive(Default)] #[derive(Default)]
pub(crate) struct DictionaryBuilder { pub(crate) struct DictionaryBuilder {
dict: FnvHashMap<Vec<u8>, UnorderedId>, dict: SharedArenaHashMap,
} }
impl DictionaryBuilder { impl DictionaryBuilder {
/// Get or allocate an unordered id. /// Get or allocate an unordered id.
/// (This ID is simply an auto-incremented id.) /// (This ID is simply an auto-incremented id.)
pub fn get_or_allocate_id(&mut self, term: &[u8]) -> UnorderedId { pub fn get_or_allocate_id(&mut self, term: &[u8], arena: &mut MemoryArena) -> UnorderedId {
if let Some(term_id) = self.dict.get(term) { let next_id = self.dict.len() as u32;
return *term_id; let unordered_id = self
} .dict
let new_id = UnorderedId(self.dict.len() as u32); .mutate_or_create(term, arena, |unordered_id: Option<u32>| {
self.dict.insert(term.to_vec(), new_id); if let Some(unordered_id) = unordered_id {
new_id unordered_id
} else {
next_id
}
});
UnorderedId(unordered_id)
} }
/// Serialize the dictionary into an fst, and returns the /// Serialize the dictionary into an fst, and returns the
/// `UnorderedId -> TermOrdinal` map. /// `UnorderedId -> TermOrdinal` map.
pub fn serialize<'a, W: io::Write + 'a>(&self, wrt: &mut W) -> io::Result<TermIdMapping> { pub fn serialize<'a, W: io::Write + 'a>(
let mut terms: Vec<(&[u8], UnorderedId)> = &self,
self.dict.iter().map(|(k, v)| (k.as_slice(), *v)).collect(); arena: &MemoryArena,
wrt: &mut W,
) -> io::Result<TermIdMapping> {
let mut terms: Vec<(&[u8], UnorderedId)> = self
.dict
.iter(arena)
.map(|(k, v)| (k, arena.read(v)))
.collect();
terms.sort_unstable_by_key(|(key, _)| *key); terms.sort_unstable_by_key(|(key, _)| *key);
// TODO Remove the allocation. // TODO Remove the allocation.
let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()]; let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()];
@@ -63,6 +75,10 @@ impl DictionaryBuilder {
sstable_builder.finish()?; sstable_builder.finish()?;
Ok(TermIdMapping { unordered_to_ord }) Ok(TermIdMapping { unordered_to_ord })
} }
pub(crate) fn mem_usage(&self) -> usize {
self.dict.mem_usage()
}
} }
#[cfg(test)] #[cfg(test)]
@@ -71,12 +87,13 @@ mod tests {
#[test] #[test]
fn test_dictionary_builder() { fn test_dictionary_builder() {
let mut arena = MemoryArena::default();
let mut dictionary_builder = DictionaryBuilder::default(); let mut dictionary_builder = DictionaryBuilder::default();
let hello_uid = dictionary_builder.get_or_allocate_id(b"hello"); let hello_uid = dictionary_builder.get_or_allocate_id(b"hello", &mut arena);
let happy_uid = dictionary_builder.get_or_allocate_id(b"happy"); let happy_uid = dictionary_builder.get_or_allocate_id(b"happy", &mut arena);
let tax_uid = dictionary_builder.get_or_allocate_id(b"tax"); let tax_uid = dictionary_builder.get_or_allocate_id(b"tax", &mut arena);
let mut buffer = Vec::new(); let mut buffer = Vec::new();
let id_mapping = dictionary_builder.serialize(&mut buffer).unwrap(); let id_mapping = dictionary_builder.serialize(&arena, &mut buffer).unwrap();
assert_eq!(id_mapping.to_ord(hello_uid), OrderedId(1)); assert_eq!(id_mapping.to_ord(hello_uid), OrderedId(1));
assert_eq!(id_mapping.to_ord(happy_uid), OrderedId(0)); assert_eq!(id_mapping.to_ord(happy_uid), OrderedId(0));
assert_eq!(id_mapping.to_ord(tax_uid), OrderedId(2)); assert_eq!(id_mapping.to_ord(tax_uid), OrderedId(2));

View File

@@ -228,7 +228,7 @@ static_dynamic_conversions!(StrColumn, Str);
static_dynamic_conversions!(BytesColumn, Bytes); static_dynamic_conversions!(BytesColumn, Bytes);
static_dynamic_conversions!(Column<Ipv6Addr>, IpAddr); static_dynamic_conversions!(Column<Ipv6Addr>, IpAddr);
#[derive(Clone)] #[derive(Clone, Debug)]
pub struct DynamicColumnHandle { pub struct DynamicColumnHandle {
pub(crate) file_slice: FileSlice, pub(crate) file_slice: FileSlice,
pub(crate) column_type: ColumnType, pub(crate) column_type: ColumnType,
@@ -247,7 +247,7 @@ impl DynamicColumnHandle {
} }
/// Returns the `u64` fast field reader reader associated with `fields` of types /// Returns the `u64` fast field reader reader associated with `fields` of types
/// Str, u64, i64, f64, or datetime. /// Str, u64, i64, f64, bool, or datetime.
/// ///
/// If not, the fastfield reader will returns the u64-value associated with the original /// If not, the fastfield reader will returns the u64-value associated with the original
/// FastValue. /// FastValue.
@@ -258,9 +258,12 @@ impl DynamicColumnHandle {
let column: BytesColumn = crate::column::open_column_bytes(column_bytes)?; let column: BytesColumn = crate::column::open_column_bytes(column_bytes)?;
Ok(Some(column.term_ord_column)) Ok(Some(column.term_ord_column))
} }
ColumnType::Bool => Ok(None),
ColumnType::IpAddr => Ok(None), ColumnType::IpAddr => Ok(None),
ColumnType::I64 | ColumnType::U64 | ColumnType::F64 | ColumnType::DateTime => { ColumnType::Bool
| ColumnType::I64
| ColumnType::U64
| ColumnType::F64
| ColumnType::DateTime => {
let column = crate::column::open_column_u64::<u64>(column_bytes)?; let column = crate::column::open_column_u64::<u64>(column_bytes)?;
Ok(Some(column)) Ok(Some(column))
} }

View File

@@ -1,3 +1,22 @@
//! # Tantivy-Columnar
//!
//! `tantivy-columnar`provides a columnar storage for tantivy.
//! The crate allows for efficient read operations on specific columns rather than entire records.
//!
//! ## Overview
//!
//! - **columnar**: Reading, writing, and merging multiple columns:
//! - **[ColumnarWriter]**: Makes it possible to create a new columnar.
//! - **[ColumnarReader]**: The ColumnarReader makes it possible to access a set of columns
//! associated to field names.
//! - **[merge_columnar]**: Contains the functionalities to merge multiple ColumnarReader or
//! segments into a single one.
//!
//! - **column**: A single column, which contains
//! - [column_index]: Resolves the rows for a document id. Manages the cardinality of the
//! column.
//! - [column_values]: Stores the values of a column in a dense format.
#![cfg_attr(all(feature = "unstable", test), feature(test))] #![cfg_attr(all(feature = "unstable", test), feature(test))]
#[cfg(test)] #[cfg(test)]
@@ -12,7 +31,7 @@ use std::io;
mod block_accessor; mod block_accessor;
mod column; mod column;
mod column_index; pub mod column_index;
pub mod column_values; pub mod column_values;
mod columnar; mod columnar;
mod dictionary; mod dictionary;

View File

@@ -26,7 +26,7 @@ fn test_dataframe_writer_str() {
assert_eq!(columnar.num_columns(), 1); assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap(); let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
assert_eq!(cols.len(), 1); assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 87); assert_eq!(cols[0].num_bytes(), 73);
} }
#[test] #[test]
@@ -40,7 +40,7 @@ fn test_dataframe_writer_bytes() {
assert_eq!(columnar.num_columns(), 1); assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap(); let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
assert_eq!(cols.len(), 1); assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 87); assert_eq!(cols[0].num_bytes(), 73);
} }
#[test] #[test]
@@ -330,9 +330,9 @@ fn bytes_strategy() -> impl Strategy<Value = &'static [u8]> {
// A random column value // A random column value
fn column_value_strategy() -> impl Strategy<Value = ColumnValue> { fn column_value_strategy() -> impl Strategy<Value = ColumnValue> {
prop_oneof![ prop_oneof![
10 => string_strategy().prop_map(|s| ColumnValue::Str(s)), 10 => string_strategy().prop_map(ColumnValue::Str),
1 => bytes_strategy().prop_map(|b| ColumnValue::Bytes(b)), 1 => bytes_strategy().prop_map(ColumnValue::Bytes),
40 => num_strategy().prop_map(|n| ColumnValue::Numerical(n)), 40 => num_strategy().prop_map(ColumnValue::Numerical),
1 => (1u16..3u16).prop_map(|ip_addr_byte| ColumnValue::IpAddr(Ipv6Addr::new( 1 => (1u16..3u16).prop_map(|ip_addr_byte| ColumnValue::IpAddr(Ipv6Addr::new(
127, 127,
0, 0,
@@ -343,7 +343,7 @@ fn column_value_strategy() -> impl Strategy<Value = ColumnValue> {
0, 0,
ip_addr_byte ip_addr_byte
))), ))),
1 => any::<bool>().prop_map(|b| ColumnValue::Bool(b)), 1 => any::<bool>().prop_map(ColumnValue::Bool),
1 => (0_679_723_993i64..1_679_723_995i64) 1 => (0_679_723_993i64..1_679_723_995i64)
.prop_map(|val| { ColumnValue::DateTime(DateTime::from_timestamp_secs(val)) }) .prop_map(|val| { ColumnValue::DateTime(DateTime::from_timestamp_secs(val)) })
] ]
@@ -419,8 +419,8 @@ fn build_columnar_with_mapping(
columnar_writer columnar_writer
.serialize(num_docs, old_to_new_row_ids_opt, &mut buffer) .serialize(num_docs, old_to_new_row_ids_opt, &mut buffer)
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
columnar_reader ColumnarReader::open(buffer).unwrap()
} }
fn build_columnar(docs: &[Vec<(&'static str, ColumnValue)>]) -> ColumnarReader { fn build_columnar(docs: &[Vec<(&'static str, ColumnValue)>]) -> ColumnarReader {
@@ -746,7 +746,7 @@ proptest! {
let stack_merge_order = StackMergeOrder::stack(&columnar_readers_arr[..]).into(); let stack_merge_order = StackMergeOrder::stack(&columnar_readers_arr[..]).into();
crate::merge_columnar(&columnar_readers_arr[..], &[], stack_merge_order, &mut output).unwrap(); crate::merge_columnar(&columnar_readers_arr[..], &[], stack_merge_order, &mut output).unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = columnar_docs.iter().cloned().flatten().collect(); let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = columnar_docs.iter().flatten().cloned().collect();
let expected_merged_columnar = build_columnar(&concat_rows[..]); let expected_merged_columnar = build_columnar(&concat_rows[..]);
assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar); assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar);
} }
@@ -772,7 +772,7 @@ fn test_columnar_merging_empty_columnar() {
.unwrap(); .unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> =
columnar_docs.iter().cloned().flatten().collect(); columnar_docs.iter().flatten().cloned().collect();
let expected_merged_columnar = build_columnar(&concat_rows[..]); let expected_merged_columnar = build_columnar(&concat_rows[..]);
assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar); assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar);
} }
@@ -809,7 +809,7 @@ fn test_columnar_merging_number_columns() {
.unwrap(); .unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> =
columnar_docs.iter().cloned().flatten().collect(); columnar_docs.iter().flatten().cloned().collect();
let expected_merged_columnar = build_columnar(&concat_rows[..]); let expected_merged_columnar = build_columnar(&concat_rows[..]);
assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar); assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar);
} }

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-common" name = "tantivy-common"
version = "0.5.0" version = "0.6.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT" license = "MIT"
edition = "2021" edition = "2021"
@@ -14,7 +14,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
[dependencies] [dependencies]
byteorder = "1.4.3" byteorder = "1.4.3"
ownedbytes = { version= "0.5", path="../ownedbytes" } ownedbytes = { version= "0.6", path="../ownedbytes" }
async-trait = "0.1" async-trait = "0.1"
time = { version = "0.3.10", features = ["serde-well-known"] } time = { version = "0.3.10", features = ["serde-well-known"] }
serde = { version = "1.0.136", features = ["derive"] } serde = { version = "1.0.136", features = ["derive"] }

View File

@@ -1,11 +1,14 @@
#![allow(deprecated)] #![allow(deprecated)]
use std::fmt; use std::fmt;
use std::io::{Read, Write};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use time::format_description::well_known::Rfc3339; use time::format_description::well_known::Rfc3339;
use time::{OffsetDateTime, PrimitiveDateTime, UtcOffset}; use time::{OffsetDateTime, PrimitiveDateTime, UtcOffset};
use crate::BinarySerializable;
/// Precision with which datetimes are truncated when stored in fast fields. This setting is only /// Precision with which datetimes are truncated when stored in fast fields. This setting is only
/// relevant for fast fields. In the docstore, datetimes are always saved with nanosecond precision. /// relevant for fast fields. In the docstore, datetimes are always saved with nanosecond precision.
#[derive( #[derive(
@@ -164,3 +167,15 @@ impl fmt::Debug for DateTime {
f.write_str(&utc_rfc3339) f.write_str(&utc_rfc3339)
} }
} }
impl BinarySerializable for DateTime {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> std::io::Result<()> {
let timestamp_micros = self.into_timestamp_micros();
<i64 as BinarySerializable>::serialize(&timestamp_micros, writer)
}
fn deserialize<R: Read>(reader: &mut R) -> std::io::Result<Self> {
let timestamp_micros = <i64 as BinarySerializable>::deserialize(reader)?;
Ok(Self::from_timestamp_micros(timestamp_micros))
}
}

View File

@@ -1,3 +1,4 @@
use std::fs::File;
use std::ops::{Deref, Range, RangeBounds}; use std::ops::{Deref, Range, RangeBounds};
use std::sync::Arc; use std::sync::Arc;
use std::{fmt, io}; use std::{fmt, io};
@@ -32,6 +33,62 @@ pub trait FileHandle: 'static + Send + Sync + HasLen + fmt::Debug {
} }
} }
#[derive(Debug)]
/// A File with it's length included.
pub struct WrapFile {
file: File,
len: usize,
}
impl WrapFile {
/// Creates a new WrapFile and stores its length.
pub fn new(file: File) -> io::Result<Self> {
let len = file.metadata()?.len() as usize;
Ok(WrapFile { file, len })
}
}
#[async_trait]
impl FileHandle for WrapFile {
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
let file_len = self.len();
// Calculate the actual range to read, ensuring it stays within file boundaries
let start = range.start;
let end = range.end.min(file_len);
// Ensure the start is before the end of the range
if start >= end {
return Err(io::Error::new(io::ErrorKind::InvalidInput, "Invalid range"));
}
let mut buffer = vec![0; end - start];
#[cfg(unix)]
{
use std::os::unix::prelude::FileExt;
self.file.read_exact_at(&mut buffer, start as u64)?;
}
#[cfg(not(unix))]
{
use std::io::{Read, Seek};
let mut file = self.file.try_clone()?; // Clone the file to read from it separately
// Seek to the start position in the file
file.seek(io::SeekFrom::Start(start as u64))?;
// Read the data into the buffer
file.read_exact(&mut buffer)?;
}
Ok(OwnedBytes::new(buffer))
}
// todo implement async
}
impl HasLen for WrapFile {
fn len(&self) -> usize {
self.len
}
}
#[async_trait] #[async_trait]
impl FileHandle for &'static [u8] { impl FileHandle for &'static [u8] {
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes> { fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
@@ -67,6 +124,30 @@ impl fmt::Debug for FileSlice {
} }
} }
impl FileSlice {
pub fn stream_file_chunks(&self) -> impl Iterator<Item = io::Result<OwnedBytes>> + '_ {
let len = self.range.end;
let mut start = self.range.start;
std::iter::from_fn(move || {
/// Returns chunks of 1MB of data from the FileHandle.
const CHUNK_SIZE: usize = 1024 * 1024; // 1MB
if start < len {
let end = (start + CHUNK_SIZE).min(len);
let range = start..end;
let chunk = self.data.read_bytes(range);
start += CHUNK_SIZE;
match chunk {
Ok(chunk) => Some(Ok(chunk)),
Err(e) => Some(Err(e)),
}
} else {
None
}
})
}
}
/// Takes a range, a `RangeBounds` object, and returns /// Takes a range, a `RangeBounds` object, and returns
/// a `Range` that corresponds to the relative application of the /// a `Range` that corresponds to the relative application of the
/// `RangeBounds` object to the original `Range`. /// `RangeBounds` object to the original `Range`.

View File

@@ -27,15 +27,15 @@ pub trait GroupByIteratorExtended: Iterator {
where where
Self: Sized, Self: Sized,
F: FnMut(&Self::Item) -> K, F: FnMut(&Self::Item) -> K,
K: PartialEq + Copy, K: PartialEq + Clone,
Self::Item: Copy, Self::Item: Clone,
{ {
GroupByIterator::new(self, key) GroupByIterator::new(self, key)
} }
} }
impl<I: Iterator> GroupByIteratorExtended for I {} impl<I: Iterator> GroupByIteratorExtended for I {}
pub struct GroupByIterator<I, F, K: Copy> pub struct GroupByIterator<I, F, K: Clone>
where where
I: Iterator, I: Iterator,
F: FnMut(&I::Item) -> K, F: FnMut(&I::Item) -> K,
@@ -50,7 +50,7 @@ where
inner: Rc<RefCell<GroupByShared<I, F, K>>>, inner: Rc<RefCell<GroupByShared<I, F, K>>>,
} }
struct GroupByShared<I, F, K: Copy> struct GroupByShared<I, F, K: Clone>
where where
I: Iterator, I: Iterator,
F: FnMut(&I::Item) -> K, F: FnMut(&I::Item) -> K,
@@ -63,7 +63,7 @@ impl<I, F, K> GroupByIterator<I, F, K>
where where
I: Iterator, I: Iterator,
F: FnMut(&I::Item) -> K, F: FnMut(&I::Item) -> K,
K: Copy, K: Clone,
{ {
fn new(inner: I, group_by_fn: F) -> Self { fn new(inner: I, group_by_fn: F) -> Self {
let inner = GroupByShared { let inner = GroupByShared {
@@ -80,28 +80,28 @@ where
impl<I, F, K> Iterator for GroupByIterator<I, F, K> impl<I, F, K> Iterator for GroupByIterator<I, F, K>
where where
I: Iterator, I: Iterator,
I::Item: Copy, I::Item: Clone,
F: FnMut(&I::Item) -> K, F: FnMut(&I::Item) -> K,
K: Copy, K: Clone,
{ {
type Item = (K, GroupIterator<I, F, K>); type Item = (K, GroupIterator<I, F, K>);
fn next(&mut self) -> Option<Self::Item> { fn next(&mut self) -> Option<Self::Item> {
let mut inner = self.inner.borrow_mut(); let mut inner = self.inner.borrow_mut();
let value = *inner.iter.peek()?; let value = inner.iter.peek()?.clone();
let key = (inner.group_by_fn)(&value); let key = (inner.group_by_fn)(&value);
let inner = self.inner.clone(); let inner = self.inner.clone();
let group_iter = GroupIterator { let group_iter = GroupIterator {
inner, inner,
group_key: key, group_key: key.clone(),
}; };
Some((key, group_iter)) Some((key, group_iter))
} }
} }
pub struct GroupIterator<I, F, K: Copy> pub struct GroupIterator<I, F, K: Clone>
where where
I: Iterator, I: Iterator,
F: FnMut(&I::Item) -> K, F: FnMut(&I::Item) -> K,
@@ -110,10 +110,10 @@ where
group_key: K, group_key: K,
} }
impl<I, F, K: PartialEq + Copy> Iterator for GroupIterator<I, F, K> impl<I, F, K: PartialEq + Clone> Iterator for GroupIterator<I, F, K>
where where
I: Iterator, I: Iterator,
I::Item: Copy, I::Item: Clone,
F: FnMut(&I::Item) -> K, F: FnMut(&I::Item) -> K,
{ {
type Item = I::Item; type Item = I::Item;
@@ -121,7 +121,7 @@ where
fn next(&mut self) -> Option<Self::Item> { fn next(&mut self) -> Option<Self::Item> {
let mut inner = self.inner.borrow_mut(); let mut inner = self.inner.borrow_mut();
// peek if next value is in group // peek if next value is in group
let peek_val = *inner.iter.peek()?; let peek_val = inner.iter.peek()?.clone();
if (inner.group_by_fn)(&peek_val) == self.group_key { if (inner.group_by_fn)(&peek_val) == self.group_key {
inner.iter.next() inner.iter.next()
} else { } else {

View File

@@ -0,0 +1,112 @@
use crate::replace_in_place;
/// Separates the different segments of a json path.
pub const JSON_PATH_SEGMENT_SEP: u8 = 1u8;
pub const JSON_PATH_SEGMENT_SEP_STR: &str =
unsafe { std::str::from_utf8_unchecked(&[JSON_PATH_SEGMENT_SEP]) };
/// Create a new JsonPathWriter, that creates flattened json paths for tantivy.
#[derive(Clone, Debug, Default)]
pub struct JsonPathWriter {
path: String,
indices: Vec<usize>,
expand_dots: bool,
}
impl JsonPathWriter {
pub fn new() -> Self {
JsonPathWriter {
path: String::new(),
indices: Vec::new(),
expand_dots: false,
}
}
/// When expand_dots is enabled, json object like
/// `{"k8s.node.id": 5}` is processed as if it was
/// `{"k8s": {"node": {"id": 5}}}`.
/// This option has the merit of allowing users to
/// write queries like `k8s.node.id:5`.
/// On the other, enabling that feature can lead to
/// ambiguity.
#[inline]
pub fn set_expand_dots(&mut self, expand_dots: bool) {
self.expand_dots = expand_dots;
}
/// Push a new segment to the path.
#[inline]
pub fn push(&mut self, segment: &str) {
let len_path = self.path.len();
self.indices.push(len_path);
if !self.path.is_empty() {
self.path.push_str(JSON_PATH_SEGMENT_SEP_STR);
}
self.path.push_str(segment);
if self.expand_dots {
// This might include the separation byte, which is ok because it is not a dot.
let appended_segment = &mut self.path[len_path..];
// The unsafe below is safe as long as b'.' and JSON_PATH_SEGMENT_SEP are
// valid single byte ut8 strings.
// By utf-8 design, they cannot be part of another codepoint.
unsafe {
replace_in_place(b'.', JSON_PATH_SEGMENT_SEP, appended_segment.as_bytes_mut())
};
}
}
/// Remove the last segment. Does nothing if the path is empty.
#[inline]
pub fn pop(&mut self) {
if let Some(last_idx) = self.indices.pop() {
self.path.truncate(last_idx);
}
}
/// Clear the path.
#[inline]
pub fn clear(&mut self) {
self.path.clear();
self.indices.clear();
}
/// Get the current path.
#[inline]
pub fn as_str(&self) -> &str {
&self.path
}
}
impl From<JsonPathWriter> for String {
#[inline]
fn from(value: JsonPathWriter) -> Self {
value.path
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn json_path_writer_test() {
let mut writer = JsonPathWriter::new();
writer.push("root");
assert_eq!(writer.as_str(), "root");
writer.push("child");
assert_eq!(writer.as_str(), "root\u{1}child");
writer.pop();
assert_eq!(writer.as_str(), "root");
writer.push("k8s.node.id");
assert_eq!(writer.as_str(), "root\u{1}k8s.node.id");
writer.set_expand_dots(true);
writer.pop();
writer.push("k8s.node.id");
assert_eq!(writer.as_str(), "root\u{1}k8s\u{1}node\u{1}id");
}
}

View File

@@ -9,6 +9,7 @@ mod byte_count;
mod datetime; mod datetime;
pub mod file_slice; pub mod file_slice;
mod group_by; mod group_by;
mod json_path_writer;
mod serialize; mod serialize;
mod vint; mod vint;
mod writer; mod writer;
@@ -18,6 +19,7 @@ pub use byte_count::ByteCount;
pub use datetime::DatePrecision; pub use datetime::DatePrecision;
pub use datetime::{DateTime, DateTimePrecision}; pub use datetime::{DateTime, DateTimePrecision};
pub use group_by::GroupByIteratorExtended; pub use group_by::GroupByIteratorExtended;
pub use json_path_writer::JsonPathWriter;
pub use ownedbytes::{OwnedBytes, StableDeref}; pub use ownedbytes::{OwnedBytes, StableDeref};
pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize}; pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize};
pub use vint::{ pub use vint::{
@@ -116,6 +118,7 @@ pub fn u64_to_f64(val: u64) -> f64 {
/// ///
/// This function assumes that the needle is rarely contained in the bytes string /// This function assumes that the needle is rarely contained in the bytes string
/// and offers a fast path if the needle is not present. /// and offers a fast path if the needle is not present.
#[inline]
pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) { pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) {
if !bytes.contains(&needle) { if !bytes.contains(&needle) {
return; return;

View File

@@ -1,3 +1,4 @@
use std::borrow::Cow;
use std::io::{Read, Write}; use std::io::{Read, Write};
use std::{fmt, io}; use std::{fmt, io};
@@ -249,6 +250,43 @@ impl BinarySerializable for String {
} }
} }
impl<'a> BinarySerializable for Cow<'a, str> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
let data: &[u8] = self.as_bytes();
VInt(data.len() as u64).serialize(writer)?;
writer.write_all(data)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, str>> {
let string_length = VInt::deserialize(reader)?.val() as usize;
let mut result = String::with_capacity(string_length);
reader
.take(string_length as u64)
.read_to_string(&mut result)?;
Ok(Cow::Owned(result))
}
}
impl<'a> BinarySerializable for Cow<'a, [u8]> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.len() as u64).serialize(writer)?;
for it in self.iter() {
it.serialize(writer)?;
}
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, [u8]>> {
let num_items = VInt::deserialize(reader)?.val();
let mut items: Vec<u8> = Vec::with_capacity(num_items as usize);
for _ in 0..num_items {
let item = u8::deserialize(reader)?;
items.push(item);
}
Ok(Cow::Owned(items))
}
}
#[cfg(test)] #[cfg(test)]
pub mod test { pub mod test {

View File

@@ -12,7 +12,7 @@ use tantivy::aggregation::agg_result::AggregationResults;
use tantivy::aggregation::AggregationCollector; use tantivy::aggregation::AggregationCollector;
use tantivy::query::AllQuery; use tantivy::query::AllQuery;
use tantivy::schema::{self, IndexRecordOption, Schema, TextFieldIndexing, FAST}; use tantivy::schema::{self, IndexRecordOption, Schema, TextFieldIndexing, FAST};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Create Schema // # Create Schema
@@ -132,10 +132,10 @@ fn main() -> tantivy::Result<()> {
let stream = Deserializer::from_str(data).into_iter::<Value>(); let stream = Deserializer::from_str(data).into_iter::<Value>();
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
let mut num_indexed = 0; let mut num_indexed = 0;
for value in stream { for value in stream {
let doc = schema.parse_document(&serde_json::to_string(&value.unwrap())?)?; let doc = TantivyDocument::parse_json(&schema, &serde_json::to_string(&value.unwrap())?)?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
num_indexed += 1; num_indexed += 1;
if num_indexed > 4 { if num_indexed > 4 {

View File

@@ -15,7 +15,7 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
@@ -75,7 +75,7 @@ fn main() -> tantivy::Result<()> {
// Here we give tantivy a budget of `50MB`. // Here we give tantivy a budget of `50MB`.
// Using a bigger memory_arena for the indexer may increase // Using a bigger memory_arena for the indexer may increase
// throughput, but 50 MB is already plenty. // throughput, but 50 MB is already plenty.
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// Let's index our documents! // Let's index our documents!
// We first need a handle on the title and the body field. // We first need a handle on the title and the body field.
@@ -87,7 +87,7 @@ fn main() -> tantivy::Result<()> {
let title = schema.get_field("title").unwrap(); let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap(); let body = schema.get_field("body").unwrap();
let mut old_man_doc = Document::default(); let mut old_man_doc = TantivyDocument::default();
old_man_doc.add_text(title, "The Old Man and the Sea"); old_man_doc.add_text(title, "The Old Man and the Sea");
old_man_doc.add_text( old_man_doc.add_text(
body, body,
@@ -164,7 +164,7 @@ fn main() -> tantivy::Result<()> {
// will reload the index automatically after each commit. // will reload the index automatically after each commit.
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
// We now need to acquire a searcher. // We now need to acquire a searcher.
@@ -217,9 +217,23 @@ fn main() -> tantivy::Result<()> {
// the document returned will only contain // the document returned will only contain
// a title. // a title.
for (_score, doc_address) in top_docs { for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
// We can also get an explanation to understand
// how a found document got its score.
let query = query_parser.parse_query("title:sea^20 body:whale^70")?;
let (_score, doc_address) = searcher
.search(&query, &TopDocs::with_limit(1))?
.into_iter()
.next()
.unwrap();
let explanation = query.explain(&searcher, doc_address)?;
println!("{}", explanation.to_pretty_json());
Ok(()) Ok(())
} }

View File

@@ -13,7 +13,7 @@ use columnar::Column;
use tantivy::collector::{Collector, SegmentCollector}; use tantivy::collector::{Collector, SegmentCollector};
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, INDEXED, TEXT}; use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
use tantivy::{doc, Index, Score, SegmentReader}; use tantivy::{doc, Index, IndexWriter, Score, SegmentReader};
#[derive(Default)] #[derive(Default)]
struct Stats { struct Stats {
@@ -142,7 +142,7 @@ fn main() -> tantivy::Result<()> {
// this example. // this example.
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
product_name => "Super Broom 2000", product_name => "Super Broom 2000",
product_description => "While it is ok for short distance travel, this broom \ product_description => "While it is ok for short distance travel, this broom \

View File

@@ -6,7 +6,7 @@ use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer; use tantivy::tokenizer::NgramTokenizer;
use tantivy::{doc, Index}; use tantivy::{doc, Index, IndexWriter};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -62,7 +62,7 @@ fn main() -> tantivy::Result<()> {
// //
// Here we use a buffer of 50MB per thread. Using a bigger // Here we use a buffer of 50MB per thread. Using a bigger
// memory arena for the indexer can increase its throughput. // memory arena for the indexer can increase its throughput.
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "The Old Man and the Sea", title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \ body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
@@ -103,8 +103,8 @@ fn main() -> tantivy::Result<()> {
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?; let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (_, doc_address) in top_docs { for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
Ok(()) Ok(())

View File

@@ -4,8 +4,8 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{DateOptions, Schema, Value, INDEXED, STORED, STRING}; use tantivy::schema::{DateOptions, Document, OwnedValue, Schema, INDEXED, STORED, STRING};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -22,16 +22,18 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// The dates are passed as string in the RFC3339 format // The dates are passed as string in the RFC3339 format
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"occurred_at": "2022-06-22T12:53:50.53Z", "occurred_at": "2022-06-22T12:53:50.53Z",
"event": "pull-request" "event": "pull-request"
}"#, }"#,
)?; )?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"occurred_at": "2022-06-22T13:00:00.22Z", "occurred_at": "2022-06-22T13:00:00.22Z",
"event": "comment" "event": "comment"
@@ -58,13 +60,13 @@ fn main() -> tantivy::Result<()> {
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4))?; let count_docs = searcher.search(&*query, &TopDocs::with_limit(4))?;
assert_eq!(count_docs.len(), 1); assert_eq!(count_docs.len(), 1);
for (_score, doc_address) in count_docs { for (_score, doc_address) in count_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
assert!(matches!( assert!(matches!(
retrieved_doc.get_first(occurred_at), retrieved_doc.get_first(occurred_at),
Some(Value::Date(_)) Some(OwnedValue::Date(_))
)); ));
assert_eq!( assert_eq!(
schema.to_json(&retrieved_doc), retrieved_doc.to_json(&schema),
r#"{"event":["comment"],"occurred_at":["2022-06-22T13:00:00.22Z"]}"# r#"{"event":["comment"],"occurred_at":["2022-06-22T13:00:00.22Z"]}"#
); );
} }

View File

@@ -11,7 +11,7 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::TermQuery; use tantivy::query::TermQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, IndexReader}; use tantivy::{doc, Index, IndexReader, IndexWriter};
// A simple helper function to fetch a single document // A simple helper function to fetch a single document
// given its id from our index. // given its id from our index.
@@ -19,7 +19,7 @@ use tantivy::{doc, Index, IndexReader};
fn extract_doc_given_isbn( fn extract_doc_given_isbn(
reader: &IndexReader, reader: &IndexReader,
isbn_term: &Term, isbn_term: &Term,
) -> tantivy::Result<Option<Document>> { ) -> tantivy::Result<Option<TantivyDocument>> {
let searcher = reader.searcher(); let searcher = reader.searcher();
// This is the simplest query you can think of. // This is the simplest query you can think of.
@@ -69,10 +69,10 @@ fn main() -> tantivy::Result<()> {
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// Let's add a couple of documents, for the sake of the example. // Let's add a couple of documents, for the sake of the example.
let mut old_man_doc = Document::default(); let mut old_man_doc = TantivyDocument::default();
old_man_doc.add_text(title, "The Old Man and the Sea"); old_man_doc.add_text(title, "The Old Man and the Sea");
index_writer.add_document(doc!( index_writer.add_document(doc!(
isbn => "978-0099908401", isbn => "978-0099908401",
@@ -94,7 +94,7 @@ fn main() -> tantivy::Result<()> {
// Oops our frankenstein doc seems misspelled // Oops our frankenstein doc seems misspelled
let frankenstein_doc_misspelled = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap(); let frankenstein_doc_misspelled = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!( assert_eq!(
schema.to_json(&frankenstein_doc_misspelled), frankenstein_doc_misspelled.to_json(&schema),
r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#, r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#,
); );
@@ -136,7 +136,7 @@ fn main() -> tantivy::Result<()> {
// No more typo! // No more typo!
let frankenstein_new_doc = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap(); let frankenstein_new_doc = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!( assert_eq!(
schema.to_json(&frankenstein_new_doc), frankenstein_new_doc.to_json(&schema),
r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#, r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#,
); );

View File

@@ -17,7 +17,7 @@
use tantivy::collector::FacetCollector; use tantivy::collector::FacetCollector;
use tantivy::query::{AllQuery, TermQuery}; use tantivy::query::{AllQuery, TermQuery};
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index}; use tantivy::{doc, Index, IndexWriter};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the sake of this example // Let's create a temporary directory for the sake of this example
@@ -30,7 +30,7 @@ fn main() -> tantivy::Result<()> {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(30_000_000)?; let mut index_writer: IndexWriter = index.writer(30_000_000)?;
// For convenience, tantivy also comes with a macro to // For convenience, tantivy also comes with a macro to
// reduce the boilerplate above. // reduce the boilerplate above.

View File

@@ -12,7 +12,7 @@ use std::collections::HashSet;
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::BooleanQuery; use tantivy::query::BooleanQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, DocId, Index, Score, SegmentReader}; use tantivy::{doc, DocId, Index, IndexWriter, Score, SegmentReader};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
@@ -23,7 +23,7 @@ fn main() -> tantivy::Result<()> {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(30_000_000)?; let mut index_writer: IndexWriter = index.writer(30_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Fried egg", title => "Fried egg",
@@ -91,11 +91,10 @@ fn main() -> tantivy::Result<()> {
.iter() .iter()
.map(|(_, doc_id)| { .map(|(_, doc_id)| {
searcher searcher
.doc(*doc_id) .doc::<TantivyDocument>(*doc_id)
.unwrap() .unwrap()
.get_first(title) .get_first(title)
.unwrap() .and_then(|v| v.as_str())
.as_text()
.unwrap() .unwrap()
.to_owned() .to_owned()
}) })

View File

@@ -14,7 +14,7 @@
use tantivy::collector::{Count, TopDocs}; use tantivy::collector::{Count, TopDocs};
use tantivy::query::FuzzyTermQuery; use tantivy::query::FuzzyTermQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
@@ -66,7 +66,7 @@ fn main() -> tantivy::Result<()> {
// Here we give tantivy a budget of `50MB`. // Here we give tantivy a budget of `50MB`.
// Using a bigger memory_arena for the indexer may increase // Using a bigger memory_arena for the indexer may increase
// throughput, but 50 MB is already plenty. // throughput, but 50 MB is already plenty.
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// Let's index our documents! // Let's index our documents!
// We first need a handle on the title and the body field. // We first need a handle on the title and the body field.
@@ -123,7 +123,7 @@ fn main() -> tantivy::Result<()> {
// will reload the index automatically after each commit. // will reload the index automatically after each commit.
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
// We now need to acquire a searcher. // We now need to acquire a searcher.
@@ -151,10 +151,10 @@ fn main() -> tantivy::Result<()> {
assert_eq!(count, 3); assert_eq!(count, 3);
assert_eq!(top_docs.len(), 3); assert_eq!(top_docs.len(), 3);
for (score, doc_address) in top_docs { for (score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
// Note that the score is not lower for the fuzzy hit. // Note that the score is not lower for the fuzzy hit.
// There's an issue open for that: https://github.com/quickwit-oss/tantivy/issues/563 // There's an issue open for that: https://github.com/quickwit-oss/tantivy/issues/563
println!("score {score:?} doc {}", schema.to_json(&retrieved_doc)); let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("score {score:?} doc {}", retrieved_doc.to_json(&schema));
// score 1.0 doc {"title":["The Diary of Muadib"]} // score 1.0 doc {"title":["The Diary of Muadib"]}
// //
// score 1.0 doc {"title":["The Diary of a Young Girl"]} // score 1.0 doc {"title":["The Diary of a Young Girl"]}

View File

@@ -21,7 +21,7 @@ fn main() -> tantivy::Result<()> {
}"#; }"#;
// We can parse our document // We can parse our document
let _mice_and_men_doc = schema.parse_document(mice_and_men_doc_json)?; let _mice_and_men_doc = TantivyDocument::parse_json(&schema, mice_and_men_doc_json)?;
// Multi-valued field are allowed, they are // Multi-valued field are allowed, they are
// expressed in JSON by an array. // expressed in JSON by an array.
@@ -30,7 +30,7 @@ fn main() -> tantivy::Result<()> {
"title": ["Frankenstein", "The Modern Prometheus"], "title": ["Frankenstein", "The Modern Prometheus"],
"year": 1818 "year": 1818
}"#; }"#;
let _frankenstein_doc = schema.parse_document(frankenstein_json)?; let _frankenstein_doc = TantivyDocument::parse_json(&schema, frankenstein_json)?;
// Note that the schema is saved in your index directory. // Note that the schema is saved in your index directory.
// //

View File

@@ -5,7 +5,7 @@
use tantivy::collector::Count; use tantivy::collector::Count;
use tantivy::query::RangeQuery; use tantivy::query::RangeQuery;
use tantivy::schema::{Schema, INDEXED}; use tantivy::schema::{Schema, INDEXED};
use tantivy::{doc, Index, Result}; use tantivy::{doc, Index, IndexWriter, Result};
fn main() -> Result<()> { fn main() -> Result<()> {
// For the sake of simplicity, this schema will only have 1 field // For the sake of simplicity, this schema will only have 1 field
@@ -17,7 +17,7 @@ fn main() -> Result<()> {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let reader = index.reader()?; let reader = index.reader()?;
{ {
let mut index_writer = index.writer_with_num_threads(1, 6_000_000)?; let mut index_writer: IndexWriter = index.writer_with_num_threads(1, 6_000_000)?;
for year in 1950u64..2019u64 { for year in 1950u64..2019u64 {
index_writer.add_document(doc!(year_field => year))?; index_writer.add_document(doc!(year_field => year))?;
} }

View File

@@ -6,7 +6,7 @@
use tantivy::collector::{Count, TopDocs}; use tantivy::collector::{Count, TopDocs};
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, INDEXED, STORED, STRING}; use tantivy::schema::{Schema, FAST, INDEXED, STORED, STRING};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -22,20 +22,22 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// ### IPv4 // ### IPv4
// Adding documents that contain an IPv4 address. Notice that the IP addresses are passed as // Adding documents that contain an IPv4 address. Notice that the IP addresses are passed as
// `String`. Since the field is of type ip, we parse the IP address from the string and store it // `String`. Since the field is of type ip, we parse the IP address from the string and store it
// internally as IPv6. // internally as IPv6.
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"ip": "192.168.0.33", "ip": "192.168.0.33",
"event_type": "login" "event_type": "login"
}"#, }"#,
)?; )?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"ip": "192.168.0.80", "ip": "192.168.0.80",
"event_type": "checkout" "event_type": "checkout"
@@ -44,7 +46,8 @@ fn main() -> tantivy::Result<()> {
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
// ### IPv6 // ### IPv6
// Adding a document that contains an IPv6 address. // Adding a document that contains an IPv6 address.
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334", "ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334",
"event_type": "checkout" "event_type": "checkout"

View File

@@ -10,7 +10,7 @@
// --- // ---
// Importing tantivy... // Importing tantivy...
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, DocSet, Index, Postings, TERMINATED}; use tantivy::{doc, DocSet, Index, IndexWriter, Postings, TERMINATED};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// We first create a schema for the sake of the // We first create a schema for the sake of the
@@ -24,7 +24,7 @@ fn main() -> tantivy::Result<()> {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 50_000_000)?; let mut index_writer: IndexWriter = index.writer_with_num_threads(1, 50_000_000)?;
index_writer.add_document(doc!(title => "The Old Man and the Sea"))?; index_writer.add_document(doc!(title => "The Old Man and the Sea"))?;
index_writer.add_document(doc!(title => "Of Mice and Men"))?; index_writer.add_document(doc!(title => "Of Mice and Men"))?;
index_writer.add_document(doc!(title => "The modern Promotheus"))?; index_writer.add_document(doc!(title => "The modern Promotheus"))?;

View File

@@ -7,7 +7,7 @@
use tantivy::collector::{Count, TopDocs}; use tantivy::collector::{Count, TopDocs};
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, STORED, STRING, TEXT}; use tantivy::schema::{Schema, FAST, STORED, STRING, TEXT};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -20,8 +20,9 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"timestamp": "2022-02-22T23:20:50.53Z", "timestamp": "2022-02-22T23:20:50.53Z",
"event_type": "click", "event_type": "click",
@@ -33,7 +34,8 @@ fn main() -> tantivy::Result<()> {
}"#, }"#,
)?; )?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"timestamp": "2022-02-22T23:20:51.53Z", "timestamp": "2022-02-22T23:20:51.53Z",
"event_type": "click", "event_type": "click",

View File

@@ -1,7 +1,7 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy, Result}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy, Result};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> Result<()> { fn main() -> Result<()> {
@@ -17,7 +17,7 @@ fn main() -> Result<()> {
let index = Index::create_in_dir(&index_path, schema)?; let index = Index::create_in_dir(&index_path, schema)?;
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "The Old Man and the Sea", title => "The Old Man and the Sea",
@@ -51,7 +51,7 @@ fn main() -> Result<()> {
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -67,8 +67,12 @@ fn main() -> Result<()> {
let mut titles = top_docs let mut titles = top_docs
.into_iter() .into_iter()
.map(|(_score, doc_address)| { .map(|(_score, doc_address)| {
let doc = searcher.doc(doc_address)?; let doc = searcher.doc::<TantivyDocument>(doc_address)?;
let title = doc.get_first(title).unwrap().as_text().unwrap().to_owned(); let title = doc
.get_first(title)
.and_then(|v| v.as_str())
.unwrap()
.to_owned();
Ok(title) Ok(title)
}) })
.collect::<Result<Vec<_>>>()?; .collect::<Result<Vec<_>>>()?;

View File

@@ -13,7 +13,7 @@ use tantivy::collector::{Count, TopDocs};
use tantivy::query::TermQuery; use tantivy::query::TermQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::tokenizer::{PreTokenizedString, SimpleTokenizer, Token, TokenStream, Tokenizer}; use tantivy::tokenizer::{PreTokenizedString, SimpleTokenizer, Token, TokenStream, Tokenizer};
use tantivy::{doc, Index, ReloadPolicy}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy};
use tempfile::TempDir; use tempfile::TempDir;
fn pre_tokenize_text(text: &str) -> Vec<Token> { fn pre_tokenize_text(text: &str) -> Vec<Token> {
@@ -38,7 +38,7 @@ fn main() -> tantivy::Result<()> {
let index = Index::create_in_dir(&index_path, schema.clone())?; let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// We can create a document manually, by setting the fields // We can create a document manually, by setting the fields
// one by one in a Document object. // one by one in a Document object.
@@ -83,7 +83,7 @@ fn main() -> tantivy::Result<()> {
}] }]
}"#; }"#;
let short_man_doc = schema.parse_document(short_man_json)?; let short_man_doc = TantivyDocument::parse_json(&schema, short_man_json)?;
index_writer.add_document(short_man_doc)?; index_writer.add_document(short_man_doc)?;
@@ -94,7 +94,7 @@ fn main() -> tantivy::Result<()> {
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -115,8 +115,8 @@ fn main() -> tantivy::Result<()> {
// Note that the tokens are not stored along with the original text // Note that the tokens are not stored along with the original text
// in the document store // in the document store
for (_score, doc_address) in top_docs { for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("Document: {}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
// In contrary to the previous query, when we search for the "man" term we // In contrary to the previous query, when we search for the "man" term we

View File

@@ -10,7 +10,8 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, Snippet, SnippetGenerator}; use tantivy::snippet::{Snippet, SnippetGenerator};
use tantivy::{doc, Index, IndexWriter};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
@@ -27,7 +28,7 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_dir(&index_path, schema)?; let index = Index::create_in_dir(&index_path, schema)?;
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// we'll only need one doc for this example. // we'll only need one doc for this example.
index_writer.add_document(doc!( index_writer.add_document(doc!(
@@ -54,13 +55,10 @@ fn main() -> tantivy::Result<()> {
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?; let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
for (score, doc_address) in top_docs { for (score, doc_address) in top_docs {
let doc = searcher.doc(doc_address)?; let doc = searcher.doc::<TantivyDocument>(doc_address)?;
let snippet = snippet_generator.snippet_from_doc(&doc); let snippet = snippet_generator.snippet_from_doc(&doc);
println!("Document score {score}:"); println!("Document score {score}:");
println!( println!("title: {}", doc.get_first(title).unwrap().as_str().unwrap());
"title: {}",
doc.get_first(title).unwrap().as_text().unwrap()
);
println!("snippet: {}", snippet.to_html()); println!("snippet: {}", snippet.to_html());
println!("custom highlighting: {}", highlight(snippet)); println!("custom highlighting: {}", highlight(snippet));
} }

View File

@@ -15,7 +15,7 @@ use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::tokenizer::*; use tantivy::tokenizer::*;
use tantivy::{doc, Index}; use tantivy::{doc, Index, IndexWriter};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// this example assumes you understand the content in `basic_search` // this example assumes you understand the content in `basic_search`
@@ -60,7 +60,7 @@ fn main() -> tantivy::Result<()> {
index.tokenizers().register("stoppy", tokenizer); index.tokenizers().register("stoppy", tokenizer);
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
let title = schema.get_field("title").unwrap(); let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap(); let body = schema.get_field("body").unwrap();
@@ -105,9 +105,9 @@ fn main() -> tantivy::Result<()> {
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?; let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (score, doc_address) in top_docs { for (score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("\n==\nDocument score {score}:"); println!("\n==\nDocument score {score}:");
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
Ok(()) Ok(())

View File

@@ -6,8 +6,8 @@ use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, TEXT}; use tantivy::schema::{Schema, FAST, TEXT};
use tantivy::{ use tantivy::{
doc, DocAddress, DocId, Index, Opstamp, Searcher, SearcherGeneration, SegmentId, SegmentReader, doc, DocAddress, DocId, Index, IndexWriter, Opstamp, Searcher, SearcherGeneration, SegmentId,
Warmer, SegmentReader, Warmer,
}; };
// This example shows how warmers can be used to // This example shows how warmers can be used to
@@ -143,7 +143,7 @@ fn main() -> tantivy::Result<()> {
const SNEAKERS: ProductId = 23222; const SNEAKERS: ProductId = 23222;
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 10_000_000)?; let mut writer: IndexWriter = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(product_id=>OLIVE_OIL, text=>"cooking olive oil from greece"))?; writer.add_document(doc!(product_id=>OLIVE_OIL, text=>"cooking olive oil from greece"))?;
writer.add_document(doc!(product_id=>GLOVES, text=>"kitchen gloves, perfect for cooking"))?; writer.add_document(doc!(product_id=>GLOVES, text=>"kitchen gloves, perfect for cooking"))?;
writer.add_document(doc!(product_id=>SNEAKERS, text=>"uber sweet sneakers"))?; writer.add_document(doc!(product_id=>SNEAKERS, text=>"uber sweet sneakers"))?;

View File

@@ -1,7 +1,7 @@
[package] [package]
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"] authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
name = "ownedbytes" name = "ownedbytes"
version = "0.5.0" version = "0.6.0"
edition = "2021" edition = "2021"
description = "Expose data as static slice" description = "Expose data as static slice"
license = "MIT" license = "MIT"

View File

@@ -1,7 +1,7 @@
use std::convert::TryInto; use std::convert::TryInto;
use std::ops::{Deref, Range}; use std::ops::{Deref, Range};
use std::sync::Arc; use std::sync::Arc;
use std::{fmt, io, mem}; use std::{fmt, io};
pub use stable_deref_trait::StableDeref; pub use stable_deref_trait::StableDeref;
@@ -26,8 +26,8 @@ impl OwnedBytes {
data_holder: T, data_holder: T,
) -> OwnedBytes { ) -> OwnedBytes {
let box_stable_deref = Arc::new(data_holder); let box_stable_deref = Arc::new(data_holder);
let bytes: &[u8] = box_stable_deref.as_ref(); let bytes: &[u8] = box_stable_deref.deref();
let data = unsafe { mem::transmute::<_, &'static [u8]>(bytes.deref()) }; let data = unsafe { &*(bytes as *const [u8]) };
OwnedBytes { OwnedBytes {
data, data,
box_stable_deref, box_stable_deref,
@@ -57,6 +57,12 @@ impl OwnedBytes {
self.data.len() self.data.len()
} }
/// Returns true iff this `OwnedBytes` is empty.
#[inline]
pub fn is_empty(&self) -> bool {
self.data.is_empty()
}
/// Splits the OwnedBytes into two OwnedBytes `(left, right)`. /// Splits the OwnedBytes into two OwnedBytes `(left, right)`.
/// ///
/// Left will hold `split_len` bytes. /// Left will hold `split_len` bytes.
@@ -68,13 +74,14 @@ impl OwnedBytes {
#[inline] #[inline]
#[must_use] #[must_use]
pub fn split(self, split_len: usize) -> (OwnedBytes, OwnedBytes) { pub fn split(self, split_len: usize) -> (OwnedBytes, OwnedBytes) {
let (left_data, right_data) = self.data.split_at(split_len);
let right_box_stable_deref = self.box_stable_deref.clone(); let right_box_stable_deref = self.box_stable_deref.clone();
let left = OwnedBytes { let left = OwnedBytes {
data: &self.data[..split_len], data: left_data,
box_stable_deref: self.box_stable_deref, box_stable_deref: self.box_stable_deref,
}; };
let right = OwnedBytes { let right = OwnedBytes {
data: &self.data[split_len..], data: right_data,
box_stable_deref: right_box_stable_deref, box_stable_deref: right_box_stable_deref,
}; };
(left, right) (left, right)
@@ -99,55 +106,45 @@ impl OwnedBytes {
/// ///
/// `self` is truncated to `split_len`, left with the remaining bytes. /// `self` is truncated to `split_len`, left with the remaining bytes.
pub fn split_off(&mut self, split_len: usize) -> OwnedBytes { pub fn split_off(&mut self, split_len: usize) -> OwnedBytes {
let (left, right) = self.data.split_at(split_len);
let right_box_stable_deref = self.box_stable_deref.clone(); let right_box_stable_deref = self.box_stable_deref.clone();
let right_piece = OwnedBytes { let right_piece = OwnedBytes {
data: &self.data[split_len..], data: right,
box_stable_deref: right_box_stable_deref, box_stable_deref: right_box_stable_deref,
}; };
self.data = &self.data[..split_len]; self.data = left;
right_piece right_piece
} }
/// Returns true iff this `OwnedBytes` is empty.
#[inline]
pub fn is_empty(&self) -> bool {
self.as_slice().is_empty()
}
/// Drops the left most `advance_len` bytes. /// Drops the left most `advance_len` bytes.
#[inline] #[inline]
pub fn advance(&mut self, advance_len: usize) { pub fn advance(&mut self, advance_len: usize) -> &[u8] {
self.data = &self.data[advance_len..] let (data, rest) = self.data.split_at(advance_len);
self.data = rest;
data
} }
/// Reads an `u8` from the `OwnedBytes` and advance by one byte. /// Reads an `u8` from the `OwnedBytes` and advance by one byte.
#[inline] #[inline]
pub fn read_u8(&mut self) -> u8 { pub fn read_u8(&mut self) -> u8 {
assert!(!self.is_empty()); self.advance(1)[0]
let byte = self.as_slice()[0];
self.advance(1);
byte
} }
/// Reads an `u64` encoded as little-endian from the `OwnedBytes` and advance by 8 bytes.
#[inline] #[inline]
pub fn read_u64(&mut self) -> u64 { fn read_n<const N: usize>(&mut self) -> [u8; N] {
assert!(self.len() > 7); self.advance(N).try_into().unwrap()
let octlet: [u8; 8] = self.as_slice()[..8].try_into().unwrap();
self.advance(8);
u64::from_le_bytes(octlet)
} }
/// Reads an `u32` encoded as little-endian from the `OwnedBytes` and advance by 4 bytes. /// Reads an `u32` encoded as little-endian from the `OwnedBytes` and advance by 4 bytes.
#[inline] #[inline]
pub fn read_u32(&mut self) -> u32 { pub fn read_u32(&mut self) -> u32 {
assert!(self.len() > 3); u32::from_le_bytes(self.read_n())
}
let quad: [u8; 4] = self.as_slice()[..4].try_into().unwrap(); /// Reads an `u64` encoded as little-endian from the `OwnedBytes` and advance by 8 bytes.
self.advance(4); #[inline]
u32::from_le_bytes(quad) pub fn read_u64(&mut self) -> u64 {
u64::from_le_bytes(self.read_n())
} }
} }
@@ -201,32 +198,33 @@ impl Deref for OwnedBytes {
} }
} }
impl AsRef<[u8]> for OwnedBytes {
#[inline]
fn as_ref(&self) -> &[u8] {
self.as_slice()
}
}
impl io::Read for OwnedBytes { impl io::Read for OwnedBytes {
#[inline] #[inline]
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> { fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
let read_len = { let data_len = self.data.len();
let data = self.as_slice(); let buf_len = buf.len();
if data.len() >= buf.len() { if data_len >= buf_len {
let buf_len = buf.len(); let data = self.advance(buf_len);
buf.copy_from_slice(&data[..buf_len]); buf.copy_from_slice(data);
buf.len() Ok(buf_len)
} else { } else {
let data_len = data.len(); buf[..data_len].copy_from_slice(self.data);
buf[..data_len].copy_from_slice(data); self.data = &[];
data_len Ok(data_len)
} }
};
self.advance(read_len);
Ok(read_len)
} }
#[inline] #[inline]
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> { fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
let read_len = { buf.extend(self.data);
let data = self.as_slice(); let read_len = self.data.len();
buf.extend(data); self.data = &[];
data.len()
};
self.advance(read_len);
Ok(read_len) Ok(read_len)
} }
#[inline] #[inline]
@@ -242,13 +240,6 @@ impl io::Read for OwnedBytes {
} }
} }
impl AsRef<[u8]> for OwnedBytes {
#[inline]
fn as_ref(&self) -> &[u8] {
self.as_slice()
}
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use std::io::{self, Read}; use std::io::{self, Read};

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy-query-grammar" name = "tantivy-query-grammar"
version = "0.20.0" version = "0.21.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
@@ -12,6 +12,4 @@ keywords = ["search", "information", "retrieval"]
edition = "2021" edition = "2021"
[dependencies] [dependencies]
combine = {version="4", default-features=false, features=[] } nom = "7"
once_cell = "1.7.2"
regex ={ version = "1.5.4", default-features = false, features = ["std", "unicode"] }

View File

@@ -0,0 +1,353 @@
//! nom combinators for infallible operations
use std::convert::Infallible;
use nom::{AsChar, IResult, InputLength, InputTakeAtPosition};
pub(crate) type ErrorList = Vec<LenientErrorInternal>;
pub(crate) type JResult<I, O> = IResult<I, (O, ErrorList), Infallible>;
/// An error, with an end-of-string based offset
#[derive(Debug)]
pub(crate) struct LenientErrorInternal {
pub pos: usize,
pub message: String,
}
/// A recoverable error and the position it happened at
#[derive(Debug, PartialEq)]
pub struct LenientError {
pub pos: usize,
pub message: String,
}
impl LenientError {
pub(crate) fn from_internal(internal: LenientErrorInternal, str_len: usize) -> LenientError {
LenientError {
pos: str_len - internal.pos,
message: internal.message,
}
}
}
fn unwrap_infallible<T>(res: Result<T, nom::Err<Infallible>>) -> T {
match res {
Ok(val) => val,
Err(_) => unreachable!(),
}
}
// when rfcs#1733 get stabilized, this can make things clearer
// trait InfallibleParser<I, O> = nom::Parser<I, (O, ErrorList), std::convert::Infallible>;
/// A variant of the classical `opt` parser, except it returns an infallible error type.
///
/// It's less generic than the original to ease type resolution in the rest of the code.
pub(crate) fn opt_i<I: Clone, O, F>(mut f: F) -> impl FnMut(I) -> JResult<I, Option<O>>
where F: nom::Parser<I, O, nom::error::Error<I>> {
move |input: I| {
let i = input.clone();
match f.parse(input) {
Ok((i, o)) => Ok((i, (Some(o), Vec::new()))),
Err(_) => Ok((i, (None, Vec::new()))),
}
}
}
pub(crate) fn opt_i_err<'a, I: Clone + InputLength, O, F>(
mut f: F,
message: impl ToString + 'a,
) -> impl FnMut(I) -> JResult<I, Option<O>> + 'a
where
F: nom::Parser<I, O, nom::error::Error<I>> + 'a,
{
move |input: I| {
let i = input.clone();
match f.parse(input) {
Ok((i, o)) => Ok((i, (Some(o), Vec::new()))),
Err(_) => {
let errs = vec![LenientErrorInternal {
pos: i.input_len(),
message: message.to_string(),
}];
Ok((i, (None, errs)))
}
}
}
}
pub(crate) fn space0_infallible<T>(input: T) -> JResult<T, T>
where
T: InputTakeAtPosition + Clone,
<T as InputTakeAtPosition>::Item: AsChar + Clone,
{
opt_i(nom::character::complete::space0)(input)
.map(|(left, (spaces, errors))| (left, (spaces.expect("space0 can't fail"), errors)))
}
pub(crate) fn space1_infallible<T>(input: T) -> JResult<T, Option<T>>
where
T: InputTakeAtPosition + Clone + InputLength,
<T as InputTakeAtPosition>::Item: AsChar + Clone,
{
opt_i(nom::character::complete::space1)(input).map(|(left, (spaces, mut errors))| {
if spaces.is_none() {
errors.push(LenientErrorInternal {
pos: left.input_len(),
message: "missing space".to_string(),
})
}
(left, (spaces, errors))
})
}
pub(crate) fn fallible<I, O, E: nom::error::ParseError<I>, F>(
mut f: F,
) -> impl FnMut(I) -> IResult<I, O, E>
where F: nom::Parser<I, (O, ErrorList), Infallible> {
use nom::Err;
move |input: I| match f.parse(input) {
Ok((input, (output, _err))) => Ok((input, output)),
Err(Err::Incomplete(needed)) => Err(Err::Incomplete(needed)),
Err(Err::Error(val)) | Err(Err::Failure(val)) => match val {},
}
}
pub(crate) fn delimited_infallible<I, O1, O2, O3, F, G, H>(
mut first: F,
mut second: G,
mut third: H,
) -> impl FnMut(I) -> JResult<I, O2>
where
F: nom::Parser<I, (O1, ErrorList), Infallible>,
G: nom::Parser<I, (O2, ErrorList), Infallible>,
H: nom::Parser<I, (O3, ErrorList), Infallible>,
{
move |input: I| {
let (input, (_, mut err)) = first.parse(input)?;
let (input, (o2, mut err2)) = second.parse(input)?;
err.append(&mut err2);
let (input, (_, mut err3)) = third.parse(input)?;
err.append(&mut err3);
Ok((input, (o2, err)))
}
}
// Parse nothing. Just a lazy way to not implement terminated/preceded and use delimited instead
pub(crate) fn nothing(i: &str) -> JResult<&str, ()> {
Ok((i, ((), Vec::new())))
}
pub(crate) trait TupleInfallible<I, O> {
/// Parses the input and returns a tuple of results of each parser.
fn parse(&mut self, input: I) -> JResult<I, O>;
}
impl<Input, Output, F: nom::Parser<Input, (Output, ErrorList), Infallible>>
TupleInfallible<Input, (Output,)> for (F,)
{
fn parse(&mut self, input: Input) -> JResult<Input, (Output,)> {
self.0.parse(input).map(|(i, (o, e))| (i, ((o,), e)))
}
}
// these macros are heavily copied from nom, with some minor adaptations for our type
macro_rules! tuple_trait(
($name1:ident $ty1:ident, $name2: ident $ty2:ident, $($name:ident $ty:ident),*) => (
tuple_trait!(__impl $name1 $ty1, $name2 $ty2; $($name $ty),*);
);
(__impl $($name:ident $ty: ident),+; $name1:ident $ty1:ident, $($name2:ident $ty2:ident),*) => (
tuple_trait_impl!($($name $ty),+);
tuple_trait!(__impl $($name $ty),+ , $name1 $ty1; $($name2 $ty2),*);
);
(__impl $($name:ident $ty: ident),+; $name1:ident $ty1:ident) => (
tuple_trait_impl!($($name $ty),+);
tuple_trait_impl!($($name $ty),+, $name1 $ty1);
);
);
macro_rules! tuple_trait_impl(
($($name:ident $ty: ident),+) => (
impl<
Input: Clone, $($ty),+ ,
$($name: nom::Parser<Input, ($ty, ErrorList), Infallible>),+
> TupleInfallible<Input, ( $($ty),+ )> for ( $($name),+ ) {
fn parse(&mut self, input: Input) -> JResult<Input, ( $($ty),+ )> {
let mut error_list = Vec::new();
tuple_trait_inner!(0, self, input, (), error_list, $($name)+)
}
}
);
);
macro_rules! tuple_trait_inner(
($it:tt, $self:expr, $input:expr, (), $error_list:expr, $head:ident $($id:ident)+) => ({
let (i, (o, mut err)) = $self.$it.parse($input.clone())?;
$error_list.append(&mut err);
succ!($it, tuple_trait_inner!($self, i, ( o ), $error_list, $($id)+))
});
($it:tt, $self:expr, $input:expr, ($($parsed:tt)*), $error_list:expr, $head:ident $($id:ident)+) => ({
let (i, (o, mut err)) = $self.$it.parse($input.clone())?;
$error_list.append(&mut err);
succ!($it, tuple_trait_inner!($self, i, ($($parsed)* , o), $error_list, $($id)+))
});
($it:tt, $self:expr, $input:expr, ($($parsed:tt)*), $error_list:expr, $head:ident) => ({
let (i, (o, mut err)) = $self.$it.parse($input.clone())?;
$error_list.append(&mut err);
Ok((i, (($($parsed)* , o), $error_list)))
});
);
macro_rules! succ (
(0, $submac:ident ! ($($rest:tt)*)) => ($submac!(1, $($rest)*));
(1, $submac:ident ! ($($rest:tt)*)) => ($submac!(2, $($rest)*));
(2, $submac:ident ! ($($rest:tt)*)) => ($submac!(3, $($rest)*));
(3, $submac:ident ! ($($rest:tt)*)) => ($submac!(4, $($rest)*));
(4, $submac:ident ! ($($rest:tt)*)) => ($submac!(5, $($rest)*));
(5, $submac:ident ! ($($rest:tt)*)) => ($submac!(6, $($rest)*));
(6, $submac:ident ! ($($rest:tt)*)) => ($submac!(7, $($rest)*));
(7, $submac:ident ! ($($rest:tt)*)) => ($submac!(8, $($rest)*));
(8, $submac:ident ! ($($rest:tt)*)) => ($submac!(9, $($rest)*));
(9, $submac:ident ! ($($rest:tt)*)) => ($submac!(10, $($rest)*));
(10, $submac:ident ! ($($rest:tt)*)) => ($submac!(11, $($rest)*));
(11, $submac:ident ! ($($rest:tt)*)) => ($submac!(12, $($rest)*));
(12, $submac:ident ! ($($rest:tt)*)) => ($submac!(13, $($rest)*));
(13, $submac:ident ! ($($rest:tt)*)) => ($submac!(14, $($rest)*));
(14, $submac:ident ! ($($rest:tt)*)) => ($submac!(15, $($rest)*));
(15, $submac:ident ! ($($rest:tt)*)) => ($submac!(16, $($rest)*));
(16, $submac:ident ! ($($rest:tt)*)) => ($submac!(17, $($rest)*));
(17, $submac:ident ! ($($rest:tt)*)) => ($submac!(18, $($rest)*));
(18, $submac:ident ! ($($rest:tt)*)) => ($submac!(19, $($rest)*));
(19, $submac:ident ! ($($rest:tt)*)) => ($submac!(20, $($rest)*));
(20, $submac:ident ! ($($rest:tt)*)) => ($submac!(21, $($rest)*));
);
tuple_trait!(FnA A, FnB B, FnC C, FnD D, FnE E, FnF F, FnG G, FnH H, FnI I, FnJ J, FnK K, FnL L,
FnM M, FnN N, FnO O, FnP P, FnQ Q, FnR R, FnS S, FnT T, FnU U);
// Special case: implement `TupleInfallible` for `()`, the unit type.
// This can come up in macros which accept a variable number of arguments.
// Literally, `()` is an empty tuple, so it should simply parse nothing.
impl<I> TupleInfallible<I, ()> for () {
fn parse(&mut self, input: I) -> JResult<I, ()> {
Ok((input, ((), Vec::new())))
}
}
pub(crate) fn tuple_infallible<I, O, List: TupleInfallible<I, O>>(
mut l: List,
) -> impl FnMut(I) -> JResult<I, O> {
move |i: I| l.parse(i)
}
pub(crate) fn separated_list_infallible<I, O, O2, F, G>(
mut sep: G,
mut f: F,
) -> impl FnMut(I) -> JResult<I, Vec<O>>
where
I: Clone + InputLength,
F: nom::Parser<I, (O, ErrorList), Infallible>,
G: nom::Parser<I, (O2, ErrorList), Infallible>,
{
move |i: I| {
let mut res: Vec<O> = Vec::new();
let mut errors: ErrorList = Vec::new();
let (mut i, (o, mut err)) = unwrap_infallible(f.parse(i.clone()));
errors.append(&mut err);
res.push(o);
loop {
let (i_sep_parsed, (_, mut err_sep)) = unwrap_infallible(sep.parse(i.clone()));
let len_before = i_sep_parsed.input_len();
let (i_elem_parsed, (o, mut err_elem)) =
unwrap_infallible(f.parse(i_sep_parsed.clone()));
// infinite loop check: the parser must always consume
// if we consumed nothing here, don't produce an element.
if i_elem_parsed.input_len() == len_before {
return Ok((i, (res, errors)));
}
res.push(o);
errors.append(&mut err_sep);
errors.append(&mut err_elem);
i = i_elem_parsed;
}
}
}
pub(crate) trait Alt<I, O> {
/// Tests each parser in the tuple and returns the result of the first one that succeeds
fn choice(&mut self, input: I) -> Option<JResult<I, O>>;
}
macro_rules! alt_trait(
($first_cond:ident $first:ident, $($id_cond:ident $id: ident),+) => (
alt_trait!(__impl $first_cond $first; $($id_cond $id),+);
);
(__impl $($current_cond:ident $current:ident),*; $head_cond:ident $head:ident, $($id_cond:ident $id:ident),+) => (
alt_trait_impl!($($current_cond $current),*);
alt_trait!(__impl $($current_cond $current,)* $head_cond $head; $($id_cond $id),+);
);
(__impl $($current_cond:ident $current:ident),*; $head_cond:ident $head:ident) => (
alt_trait_impl!($($current_cond $current),*);
alt_trait_impl!($($current_cond $current,)* $head_cond $head);
);
);
macro_rules! alt_trait_impl(
($($id_cond:ident $id:ident),+) => (
impl<
Input: Clone, Output,
$(
// () are to make things easier on me, but I'm not entirely sure whether we can do better
// with rule E0207
$id_cond: nom::Parser<Input, (), ()>,
$id: nom::Parser<Input, (Output, ErrorList), Infallible>
),+
> Alt<Input, Output> for ( $(($id_cond, $id),)+ ) {
fn choice(&mut self, input: Input) -> Option<JResult<Input, Output>> {
match self.0.0.parse(input.clone()) {
Err(_) => alt_trait_inner!(1, self, input, $($id_cond $id),+),
Ok((input_left, _)) => Some(self.0.1.parse(input_left)),
}
}
}
);
);
macro_rules! alt_trait_inner(
($it:tt, $self:expr, $input:expr, $head_cond:ident $head:ident, $($id_cond:ident $id:ident),+) => (
match $self.$it.0.parse($input.clone()) {
Err(_) => succ!($it, alt_trait_inner!($self, $input, $($id_cond $id),+)),
Ok((input_left, _)) => Some($self.$it.1.parse(input_left)),
}
);
($it:tt, $self:expr, $input:expr, $head_cond:ident $head:ident) => (
None
);
);
alt_trait!(A1 A, B1 B, C1 C, D1 D, E1 E, F1 F, G1 G, H1 H, I1 I, J1 J, K1 K,
L1 L, M1 M, N1 N, O1 O, P1 P, Q1 Q, R1 R, S1 S, T1 T, U1 U);
/// An alt() like combinator. For each branch, it first tries a fallible parser, which commits to
/// this branch, or tells to check next branch, and the execute the infallible parser which follow.
///
/// In case no branch match, the default (fallible) parser is executed.
pub(crate) fn alt_infallible<I: Clone, O, F, List: Alt<I, O>>(
mut l: List,
mut default: F,
) -> impl FnMut(I) -> JResult<I, O>
where
F: nom::Parser<I, (O, ErrorList), Infallible>,
{
move |i: I| l.choice(i.clone()).unwrap_or_else(|| default.parse(i))
}

View File

@@ -1,19 +1,26 @@
#![allow(clippy::derive_partial_eq_without_eq)] #![allow(clippy::derive_partial_eq_without_eq)]
mod infallible;
mod occur; mod occur;
mod query_grammar; mod query_grammar;
mod user_input_ast; mod user_input_ast;
use combine::parser::Parser;
pub use crate::infallible::LenientError;
pub use crate::occur::Occur; pub use crate::occur::Occur;
use crate::query_grammar::parse_to_ast; use crate::query_grammar::{parse_to_ast, parse_to_ast_lenient};
pub use crate::user_input_ast::{ pub use crate::user_input_ast::{
Delimiter, UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral, Delimiter, UserInputAst, UserInputBound, UserInputLeaf, UserInputLiteral,
}; };
pub struct Error; pub struct Error;
/// Parse a query
pub fn parse_query(query: &str) -> Result<UserInputAst, Error> { pub fn parse_query(query: &str) -> Result<UserInputAst, Error> {
let (user_input_ast, _remaining) = parse_to_ast().parse(query).map_err(|_| Error)?; let (_remaining, user_input_ast) = parse_to_ast(query).map_err(|_| Error)?;
Ok(user_input_ast) Ok(user_input_ast)
} }
/// Parse a query, trying to recover from syntax errors, and giving hints toward fixing errors.
pub fn parse_query_lenient(query: &str) -> (UserInputAst, Vec<LenientError>) {
parse_to_ast_lenient(query)
}

File diff suppressed because it is too large Load Diff

View File

@@ -3,7 +3,7 @@ use std::fmt::{Debug, Formatter};
use crate::Occur; use crate::Occur;
#[derive(PartialEq)] #[derive(PartialEq, Clone)]
pub enum UserInputLeaf { pub enum UserInputLeaf {
Literal(UserInputLiteral), Literal(UserInputLiteral),
All, All,
@@ -16,6 +16,34 @@ pub enum UserInputLeaf {
field: Option<String>, field: Option<String>,
elements: Vec<String>, elements: Vec<String>,
}, },
Exists {
field: String,
},
}
impl UserInputLeaf {
pub(crate) fn set_field(self, field: Option<String>) -> Self {
match self {
UserInputLeaf::Literal(mut literal) => {
literal.field_name = field;
UserInputLeaf::Literal(literal)
}
UserInputLeaf::All => UserInputLeaf::All,
UserInputLeaf::Range {
field: _,
lower,
upper,
} => UserInputLeaf::Range {
field,
lower,
upper,
},
UserInputLeaf::Set { field: _, elements } => UserInputLeaf::Set { field, elements },
UserInputLeaf::Exists { field: _ } => UserInputLeaf::Exists {
field: field.expect("Exist query without a field isn't allowed"),
},
}
}
} }
impl Debug for UserInputLeaf { impl Debug for UserInputLeaf {
@@ -28,6 +56,7 @@ impl Debug for UserInputLeaf {
ref upper, ref upper,
} => { } => {
if let Some(ref field) = field { if let Some(ref field) = field {
// TODO properly escape field (in case of \")
write!(formatter, "\"{field}\":")?; write!(formatter, "\"{field}\":")?;
} }
lower.display_lower(formatter)?; lower.display_lower(formatter)?;
@@ -37,6 +66,7 @@ impl Debug for UserInputLeaf {
} }
UserInputLeaf::Set { field, elements } => { UserInputLeaf::Set { field, elements } => {
if let Some(ref field) = field { if let Some(ref field) = field {
// TODO properly escape field (in case of \")
write!(formatter, "\"{field}\": ")?; write!(formatter, "\"{field}\": ")?;
} }
write!(formatter, "IN [")?; write!(formatter, "IN [")?;
@@ -44,11 +74,15 @@ impl Debug for UserInputLeaf {
if i != 0 { if i != 0 {
write!(formatter, " ")?; write!(formatter, " ")?;
} }
// TODO properly escape element
write!(formatter, "\"{text}\"")?; write!(formatter, "\"{text}\"")?;
} }
write!(formatter, "]") write!(formatter, "]")
} }
UserInputLeaf::All => write!(formatter, "*"), UserInputLeaf::All => write!(formatter, "*"),
UserInputLeaf::Exists { field } => {
write!(formatter, "\"{field}\":*")
}
} }
} }
} }
@@ -60,7 +94,7 @@ pub enum Delimiter {
None, None,
} }
#[derive(PartialEq)] #[derive(PartialEq, Clone)]
pub struct UserInputLiteral { pub struct UserInputLiteral {
pub field_name: Option<String>, pub field_name: Option<String>,
pub phrase: String, pub phrase: String,
@@ -72,16 +106,20 @@ pub struct UserInputLiteral {
impl fmt::Debug for UserInputLiteral { impl fmt::Debug for UserInputLiteral {
fn fmt(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> { fn fmt(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> {
if let Some(ref field) = self.field_name { if let Some(ref field) = self.field_name {
// TODO properly escape field (in case of \")
write!(formatter, "\"{field}\":")?; write!(formatter, "\"{field}\":")?;
} }
match self.delimiter { match self.delimiter {
Delimiter::SingleQuotes => { Delimiter::SingleQuotes => {
// TODO properly escape element (in case of \')
write!(formatter, "'{}'", self.phrase)?; write!(formatter, "'{}'", self.phrase)?;
} }
Delimiter::DoubleQuotes => { Delimiter::DoubleQuotes => {
// TODO properly escape element (in case of \")
write!(formatter, "\"{}\"", self.phrase)?; write!(formatter, "\"{}\"", self.phrase)?;
} }
Delimiter::None => { Delimiter::None => {
// TODO properly escape element
write!(formatter, "{}", self.phrase)?; write!(formatter, "{}", self.phrase)?;
} }
} }
@@ -94,7 +132,7 @@ impl fmt::Debug for UserInputLiteral {
} }
} }
#[derive(PartialEq)] #[derive(PartialEq, Debug, Clone)]
pub enum UserInputBound { pub enum UserInputBound {
Inclusive(String), Inclusive(String),
Exclusive(String), Exclusive(String),
@@ -104,6 +142,7 @@ pub enum UserInputBound {
impl UserInputBound { impl UserInputBound {
fn display_lower(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> { fn display_lower(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> {
match *self { match *self {
// TODO properly escape word if required
UserInputBound::Inclusive(ref word) => write!(formatter, "[\"{word}\""), UserInputBound::Inclusive(ref word) => write!(formatter, "[\"{word}\""),
UserInputBound::Exclusive(ref word) => write!(formatter, "{{\"{word}\""), UserInputBound::Exclusive(ref word) => write!(formatter, "{{\"{word}\""),
UserInputBound::Unbounded => write!(formatter, "{{\"*\""), UserInputBound::Unbounded => write!(formatter, "{{\"*\""),
@@ -112,6 +151,7 @@ impl UserInputBound {
fn display_upper(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> { fn display_upper(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> {
match *self { match *self {
// TODO properly escape word if required
UserInputBound::Inclusive(ref word) => write!(formatter, "\"{word}\"]"), UserInputBound::Inclusive(ref word) => write!(formatter, "\"{word}\"]"),
UserInputBound::Exclusive(ref word) => write!(formatter, "\"{word}\"}}"), UserInputBound::Exclusive(ref word) => write!(formatter, "\"{word}\"}}"),
UserInputBound::Unbounded => write!(formatter, "\"*\"}}"), UserInputBound::Unbounded => write!(formatter, "\"*\"}}"),
@@ -127,6 +167,7 @@ impl UserInputBound {
} }
} }
#[derive(PartialEq, Clone)]
pub enum UserInputAst { pub enum UserInputAst {
Clause(Vec<(Option<Occur>, UserInputAst)>), Clause(Vec<(Option<Occur>, UserInputAst)>),
Leaf(Box<UserInputLeaf>), Leaf(Box<UserInputLeaf>),
@@ -196,6 +237,7 @@ impl fmt::Debug for UserInputAst {
match *self { match *self {
UserInputAst::Clause(ref subqueries) => { UserInputAst::Clause(ref subqueries) => {
if subqueries.is_empty() { if subqueries.is_empty() {
// TODO this will break ast reserialization, is writing "( )" enought?
write!(formatter, "<emptyclause>")?; write!(formatter, "<emptyclause>")?;
} else { } else {
write!(formatter, "(")?; write!(formatter, "(")?;

View File

@@ -48,7 +48,7 @@ mod bench {
let score_field_f64 = schema_builder.add_f64_field("score_f64", score_fieldtype.clone()); let score_field_f64 = schema_builder.add_f64_field("score_f64", score_fieldtype.clone());
let score_field_i64 = schema_builder.add_i64_field("score_i64", score_fieldtype); let score_field_i64 = schema_builder.add_i64_field("score_i64", score_fieldtype);
let index = Index::create_from_tempdir(schema_builder.build())?; let index = Index::create_from_tempdir(schema_builder.build())?;
let few_terms_data = vec!["INFO", "ERROR", "WARN", "DEBUG"]; let few_terms_data = ["INFO", "ERROR", "WARN", "DEBUG"];
let lg_norm = rand_distr::LogNormal::new(2.996f64, 0.979f64).unwrap(); let lg_norm = rand_distr::LogNormal::new(2.996f64, 0.979f64).unwrap();
@@ -85,7 +85,7 @@ mod bench {
if cardinality == Cardinality::Sparse { if cardinality == Cardinality::Sparse {
doc_with_value /= 20; doc_with_value /= 20;
} }
let val_max = 1_000_000.0; let _val_max = 1_000_000.0;
for _ in 0..doc_with_value { for _ in 0..doc_with_value {
let val: f64 = rng.gen_range(0.0..1_000_000.0); let val: f64 = rng.gen_range(0.0..1_000_000.0);
let json = if rng.gen_bool(0.1) { let json = if rng.gen_bool(0.1) {

View File

@@ -73,9 +73,9 @@ impl AggregationLimits {
/// Create a new ResourceLimitGuard, that will release the memory when dropped. /// Create a new ResourceLimitGuard, that will release the memory when dropped.
pub fn new_guard(&self) -> ResourceLimitGuard { pub fn new_guard(&self) -> ResourceLimitGuard {
ResourceLimitGuard { ResourceLimitGuard {
/// The counter which is shared between the aggregations for one request. // The counter which is shared between the aggregations for one request.
memory_consumption: Arc::clone(&self.memory_consumption), memory_consumption: Arc::clone(&self.memory_consumption),
/// The memory_limit in bytes // The memory_limit in bytes
memory_limit: self.memory_limit, memory_limit: self.memory_limit,
allocated_with_the_guard: 0, allocated_with_the_guard: 0,
} }
@@ -134,3 +134,142 @@ impl Drop for ResourceLimitGuard {
.fetch_sub(self.allocated_with_the_guard, Ordering::Relaxed); .fetch_sub(self.allocated_with_the_guard, Ordering::Relaxed);
} }
} }
#[cfg(test)]
mod tests {
use crate::aggregation::tests::exec_request_with_query;
// https://github.com/quickwit-oss/quickwit/issues/3837
#[test]
fn test_agg_limits_with_empty_merge() {
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::bucket::tests::get_test_index_from_docs;
let docs = vec![
vec![r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb", "text2": "bbb" }"#],
vec![r#"{ "text": "aaa", "text2": "bbb" }"#],
];
let index = get_test_index_from_docs(false, &docs).unwrap();
{
let elasticsearch_compatible_json = json!(
{
"1": {
"terms": {"field": "text2", "min_doc_count": 0},
"aggs": {
"2":{
"date_histogram": {
"field": "date",
"fixed_interval": "1d",
"extended_bounds": {
"min": "2015-01-01T00:00:00Z",
"max": "2015-01-10T00:00:00Z"
}
}
}
}
}
}
);
let agg_req: Aggregations = serde_json::from_str(
&serde_json::to_string(&elasticsearch_compatible_json).unwrap(),
)
.unwrap();
let res = exec_request_with_query(agg_req, &index, Some(("text", "bbb"))).unwrap();
let expected_res = json!({
"1": {
"buckets": [
{
"2": {
"buckets": [
{ "doc_count": 0, "key": 1420070400000.0, "key_as_string": "2015-01-01T00:00:00Z" },
{ "doc_count": 1, "key": 1420156800000.0, "key_as_string": "2015-01-02T00:00:00Z" },
{ "doc_count": 0, "key": 1420243200000.0, "key_as_string": "2015-01-03T00:00:00Z" },
{ "doc_count": 0, "key": 1420329600000.0, "key_as_string": "2015-01-04T00:00:00Z" },
{ "doc_count": 0, "key": 1420416000000.0, "key_as_string": "2015-01-05T00:00:00Z" },
{ "doc_count": 0, "key": 1420502400000.0, "key_as_string": "2015-01-06T00:00:00Z" },
{ "doc_count": 0, "key": 1420588800000.0, "key_as_string": "2015-01-07T00:00:00Z" },
{ "doc_count": 0, "key": 1420675200000.0, "key_as_string": "2015-01-08T00:00:00Z" },
{ "doc_count": 0, "key": 1420761600000.0, "key_as_string": "2015-01-09T00:00:00Z" },
{ "doc_count": 0, "key": 1420848000000.0, "key_as_string": "2015-01-10T00:00:00Z" }
]
},
"doc_count": 1,
"key": "bbb"
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
});
assert_eq!(res, expected_res);
}
}
// https://github.com/quickwit-oss/quickwit/issues/3837
#[test]
fn test_agg_limits_with_empty_data() {
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::bucket::tests::get_test_index_from_docs;
let docs = vec![vec![r#"{ "text": "aaa", "text2": "bbb" }"#]];
let index = get_test_index_from_docs(false, &docs).unwrap();
{
// Empty result since there is no doc with dates
let elasticsearch_compatible_json = json!(
{
"1": {
"terms": {"field": "text2", "min_doc_count": 0},
"aggs": {
"2":{
"date_histogram": {
"field": "date",
"fixed_interval": "1d",
"extended_bounds": {
"min": "2015-01-01T00:00:00Z",
"max": "2015-01-10T00:00:00Z"
}
}
}
}
}
}
);
let agg_req: Aggregations = serde_json::from_str(
&serde_json::to_string(&elasticsearch_compatible_json).unwrap(),
)
.unwrap();
let res = exec_request_with_query(agg_req, &index, Some(("text", "bbb"))).unwrap();
let expected_res = json!({
"1": {
"buckets": [
{
"2": {
"buckets": [
{ "doc_count": 0, "key": 1420070400000.0, "key_as_string": "2015-01-01T00:00:00Z" },
{ "doc_count": 0, "key": 1420156800000.0, "key_as_string": "2015-01-02T00:00:00Z" },
{ "doc_count": 0, "key": 1420243200000.0, "key_as_string": "2015-01-03T00:00:00Z" },
{ "doc_count": 0, "key": 1420329600000.0, "key_as_string": "2015-01-04T00:00:00Z" },
{ "doc_count": 0, "key": 1420416000000.0, "key_as_string": "2015-01-05T00:00:00Z" },
{ "doc_count": 0, "key": 1420502400000.0, "key_as_string": "2015-01-06T00:00:00Z" },
{ "doc_count": 0, "key": 1420588800000.0, "key_as_string": "2015-01-07T00:00:00Z" },
{ "doc_count": 0, "key": 1420675200000.0, "key_as_string": "2015-01-08T00:00:00Z" },
{ "doc_count": 0, "key": 1420761600000.0, "key_as_string": "2015-01-09T00:00:00Z" },
{ "doc_count": 0, "key": 1420848000000.0, "key_as_string": "2015-01-10T00:00:00Z" }
]
},
"doc_count": 0,
"key": "bbb"
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
});
assert_eq!(res, expected_res);
}
}
}

View File

@@ -44,22 +44,49 @@ use super::metric::{
/// The key is the user defined name of the aggregation. /// The key is the user defined name of the aggregation.
pub type Aggregations = HashMap<String, Aggregation>; pub type Aggregations = HashMap<String, Aggregation>;
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
/// Aggregation request. /// Aggregation request.
/// ///
/// An aggregation is either a bucket or a metric. /// An aggregation is either a bucket or a metric.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(try_from = "AggregationForDeserialization")]
pub struct Aggregation { pub struct Aggregation {
/// The aggregation variant, which can be either a bucket or a metric. /// The aggregation variant, which can be either a bucket or a metric.
#[serde(flatten)] #[serde(flatten)]
pub agg: AggregationVariants, pub agg: AggregationVariants,
/// The sub_aggregations, only valid for bucket type aggregations. Each bucket will aggregate
/// on the document set in the bucket. /// on the document set in the bucket.
#[serde(rename = "aggs")] #[serde(rename = "aggs")]
#[serde(default)]
#[serde(skip_serializing_if = "Aggregations::is_empty")] #[serde(skip_serializing_if = "Aggregations::is_empty")]
pub sub_aggregation: Aggregations, pub sub_aggregation: Aggregations,
} }
/// In order to display proper error message, we cannot rely on flattening
/// the json enum. Instead we introduce an intermediary struct to separate
/// the aggregation from the subaggregation.
#[derive(Deserialize)]
struct AggregationForDeserialization {
#[serde(flatten)]
pub aggs_remaining_json: serde_json::Value,
#[serde(rename = "aggs")]
#[serde(default)]
pub sub_aggregation: Aggregations,
}
impl TryFrom<AggregationForDeserialization> for Aggregation {
type Error = serde_json::Error;
fn try_from(value: AggregationForDeserialization) -> serde_json::Result<Self> {
let AggregationForDeserialization {
aggs_remaining_json,
sub_aggregation,
} = value;
let agg: AggregationVariants = serde_json::from_value(aggs_remaining_json)?;
Ok(Aggregation {
agg,
sub_aggregation,
})
}
}
impl Aggregation { impl Aggregation {
pub(crate) fn sub_aggregation(&self) -> &Aggregations { pub(crate) fn sub_aggregation(&self) -> &Aggregations {
&self.sub_aggregation &self.sub_aggregation
@@ -123,7 +150,8 @@ pub enum AggregationVariants {
} }
impl AggregationVariants { impl AggregationVariants {
fn get_fast_field_name(&self) -> &str { /// Returns the name of the field used by the aggregation.
pub fn get_fast_field_name(&self) -> &str {
match self { match self {
AggregationVariants::Terms(terms) => terms.field.as_str(), AggregationVariants::Terms(terms) => terms.field.as_str(),
AggregationVariants::Range(range) => range.field.as_str(), AggregationVariants::Range(range) => range.field.as_str(),

View File

@@ -13,6 +13,7 @@ use super::metric::{
}; };
use super::segment_agg_result::AggregationLimits; use super::segment_agg_result::AggregationLimits;
use super::VecWithNames; use super::VecWithNames;
use crate::aggregation::{f64_to_fastfield_u64, Key};
use crate::SegmentReader; use crate::SegmentReader;
#[derive(Default)] #[derive(Default)]
@@ -35,96 +36,242 @@ pub struct AggregationWithAccessor {
/// based on search terms. That is not that case currently, but eventually this needs to be /// based on search terms. That is not that case currently, but eventually this needs to be
/// Option or moved. /// Option or moved.
pub(crate) accessor: Column<u64>, pub(crate) accessor: Column<u64>,
/// Load insert u64 for missing use case
pub(crate) missing_value_for_accessor: Option<u64>,
pub(crate) str_dict_column: Option<StrColumn>, pub(crate) str_dict_column: Option<StrColumn>,
pub(crate) field_type: ColumnType, pub(crate) field_type: ColumnType,
/// In case there are multiple types of fast fields, e.g. string and numeric.
/// Only used for term aggregations currently.
pub(crate) accessor2: Option<(Column<u64>, ColumnType)>,
pub(crate) sub_aggregation: AggregationsWithAccessor, pub(crate) sub_aggregation: AggregationsWithAccessor,
pub(crate) limits: ResourceLimitGuard, pub(crate) limits: ResourceLimitGuard,
pub(crate) column_block_accessor: ColumnBlockAccessor<u64>, pub(crate) column_block_accessor: ColumnBlockAccessor<u64>,
/// Used for missing term aggregation, which checks all columns for existence.
/// By convention the missing aggregation is chosen, when this property is set
/// (instead bein set in `agg`).
/// If this needs to used by other aggregations, we need to refactor this.
pub(crate) accessors: Vec<Column<u64>>,
pub(crate) agg: Aggregation, pub(crate) agg: Aggregation,
} }
impl AggregationWithAccessor { impl AggregationWithAccessor {
/// May return multiple accessors if the aggregation is e.g. on mixed field types.
fn try_from_agg( fn try_from_agg(
agg: &Aggregation, agg: &Aggregation,
sub_aggregation: &Aggregations, sub_aggregation: &Aggregations,
reader: &SegmentReader, reader: &SegmentReader,
limits: AggregationLimits, limits: AggregationLimits,
) -> crate::Result<AggregationWithAccessor> { ) -> crate::Result<Vec<AggregationWithAccessor>> {
let mut str_dict_column = None; let add_agg_with_accessor = |accessor: Column<u64>,
let mut accessor2 = None; column_type: ColumnType,
aggs: &mut Vec<AggregationWithAccessor>|
-> crate::Result<()> {
let res = AggregationWithAccessor {
accessor,
accessors: Vec::new(),
field_type: column_type,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation,
reader,
&limits,
)?,
agg: agg.clone(),
limits: limits.new_guard(),
missing_value_for_accessor: None,
str_dict_column: None,
column_block_accessor: Default::default(),
};
aggs.push(res);
Ok(())
};
let mut res: Vec<AggregationWithAccessor> = Vec::new();
use AggregationVariants::*; use AggregationVariants::*;
let (accessor, field_type) = match &agg.agg { match &agg.agg {
Range(RangeAggregation { Range(RangeAggregation {
field: field_name, .. field: field_name, ..
}) => get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?, }) => {
let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(accessor, column_type, &mut res)?;
}
Histogram(HistogramAggregation { Histogram(HistogramAggregation {
field: field_name, .. field: field_name, ..
}) => get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?, }) => {
let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(accessor, column_type, &mut res)?;
}
DateHistogram(DateHistogramAggregationReq { DateHistogram(DateHistogramAggregationReq {
field: field_name, .. field: field_name, ..
}) => get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?,
Terms(TermsAggregation {
field: field_name, ..
}) => { }) => {
str_dict_column = reader.fast_fields().str(field_name)?; let (accessor, column_type) =
// Only DateTime is supported for DateHistogram
get_ff_reader(reader, field_name, Some(&[ColumnType::DateTime]))?;
add_agg_with_accessor(accessor, column_type, &mut res)?;
}
Terms(TermsAggregation {
field: field_name,
missing,
..
}) => {
let str_dict_column = reader.fast_fields().str(field_name)?;
let allowed_column_types = [ let allowed_column_types = [
ColumnType::I64, ColumnType::I64,
ColumnType::U64, ColumnType::U64,
ColumnType::F64, ColumnType::F64,
ColumnType::Str, ColumnType::Str,
ColumnType::DateTime,
// ColumnType::Bytes Unsupported // ColumnType::Bytes Unsupported
// ColumnType::Bool Unsupported // ColumnType::Bool Unsupported
// ColumnType::IpAddr Unsupported // ColumnType::IpAddr Unsupported
// ColumnType::DateTime Unsupported
]; ];
let mut columns =
get_all_ff_reader_or_empty(reader, field_name, Some(&allowed_column_types))?;
let first = columns.pop().unwrap();
accessor2 = columns.pop();
first
}
Average(AverageAggregation { field: field_name })
| Count(CountAggregation { field: field_name })
| Max(MaxAggregation { field: field_name })
| Min(MinAggregation { field: field_name })
| Stats(StatsAggregation { field: field_name })
| Sum(SumAggregation { field: field_name }) => {
let (accessor, field_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
(accessor, field_type) // In case the column is empty we want the shim column to match the missing type
let fallback_type = missing
.as_ref()
.map(|missing| match missing {
Key::Str(_) => ColumnType::Str,
Key::F64(_) => ColumnType::F64,
})
.unwrap_or(ColumnType::U64);
let column_and_types = get_all_ff_reader_or_empty(
reader,
field_name,
Some(&allowed_column_types),
fallback_type,
)?;
let missing_and_more_than_one_col = column_and_types.len() > 1 && missing.is_some();
let text_on_non_text_col = column_and_types.len() == 1
&& column_and_types[0].1.numerical_type().is_some()
&& missing
.as_ref()
.map(|m| matches!(m, Key::Str(_)))
.unwrap_or(false);
// Actually we could convert the text to a number and have the fast path, if it is
// provided in Rfc3339 format. But this use case is probably common
// enough to justify the effort.
let text_on_date_col = column_and_types.len() == 1
&& column_and_types[0].1 == ColumnType::DateTime
&& missing
.as_ref()
.map(|m| matches!(m, Key::Str(_)))
.unwrap_or(false);
let use_special_missing_agg =
missing_and_more_than_one_col || text_on_non_text_col || text_on_date_col;
if use_special_missing_agg {
let column_and_types =
get_all_ff_reader_or_empty(reader, field_name, None, fallback_type)?;
let accessors: Vec<Column> =
column_and_types.iter().map(|(a, _)| a.clone()).collect();
let agg_wit_acc = AggregationWithAccessor {
missing_value_for_accessor: None,
accessor: accessors[0].clone(),
accessors,
field_type: ColumnType::U64,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation,
reader,
&limits,
)?,
agg: agg.clone(),
str_dict_column: str_dict_column.clone(),
limits: limits.new_guard(),
column_block_accessor: Default::default(),
};
res.push(agg_wit_acc);
}
for (accessor, column_type) in column_and_types {
let missing_value_term_agg = if use_special_missing_agg {
None
} else {
missing.clone()
};
let missing_value_for_accessor =
if let Some(missing) = missing_value_term_agg.as_ref() {
get_missing_val(column_type, missing, agg.agg.get_fast_field_name())?
} else {
None
};
let agg = AggregationWithAccessor {
missing_value_for_accessor,
accessor,
accessors: Vec::new(),
field_type: column_type,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation,
reader,
&limits,
)?,
agg: agg.clone(),
str_dict_column: str_dict_column.clone(),
limits: limits.new_guard(),
column_block_accessor: Default::default(),
};
res.push(agg);
}
}
Average(AverageAggregation {
field: field_name, ..
})
| Count(CountAggregation {
field: field_name, ..
})
| Max(MaxAggregation {
field: field_name, ..
})
| Min(MinAggregation {
field: field_name, ..
})
| Stats(StatsAggregation {
field: field_name, ..
})
| Sum(SumAggregation {
field: field_name, ..
}) => {
let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(accessor, column_type, &mut res)?;
} }
Percentiles(percentiles) => { Percentiles(percentiles) => {
let (accessor, field_type) = get_ff_reader( let (accessor, column_type) = get_ff_reader(
reader, reader,
percentiles.field_name(), percentiles.field_name(),
Some(get_numeric_or_date_column_types()), Some(get_numeric_or_date_column_types()),
)?; )?;
(accessor, field_type) add_agg_with_accessor(accessor, column_type, &mut res)?;
} }
}; };
let sub_aggregation = sub_aggregation.clone(); Ok(res)
Ok(AggregationWithAccessor {
accessor,
accessor2,
field_type,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
&sub_aggregation,
reader,
&limits,
)?,
agg: agg.clone(),
str_dict_column,
limits: limits.new_guard(),
column_block_accessor: Default::default(),
})
} }
} }
fn get_missing_val(
column_type: ColumnType,
missing: &Key,
field_name: &str,
) -> crate::Result<Option<u64>> {
let missing_val = match missing {
Key::Str(_) if column_type == ColumnType::Str => Some(u64::MAX),
// Allow fallback to number on text fields
Key::F64(_) if column_type == ColumnType::Str => Some(u64::MAX),
Key::F64(val) if column_type.numerical_type().is_some() => {
f64_to_fastfield_u64(*val, &column_type)
}
_ => {
return Err(crate::TantivyError::InvalidArgument(format!(
"Missing value {:?} for field {} is not supported for column type {:?}",
missing, field_name, column_type
)));
}
};
Ok(missing_val)
}
fn get_numeric_or_date_column_types() -> &'static [ColumnType] { fn get_numeric_or_date_column_types() -> &'static [ColumnType] {
&[ &[
ColumnType::F64, ColumnType::F64,
@@ -141,15 +288,15 @@ pub(crate) fn get_aggs_with_segment_accessor_and_validate(
) -> crate::Result<AggregationsWithAccessor> { ) -> crate::Result<AggregationsWithAccessor> {
let mut aggss = Vec::new(); let mut aggss = Vec::new();
for (key, agg) in aggs.iter() { for (key, agg) in aggs.iter() {
aggss.push(( let aggs = AggregationWithAccessor::try_from_agg(
key.to_string(), agg,
AggregationWithAccessor::try_from_agg( agg.sub_aggregation(),
agg, reader,
agg.sub_aggregation(), limits.clone(),
reader, )?;
limits.clone(), for agg in aggs {
)?, aggss.push((key.to_string(), agg));
)); }
} }
Ok(AggregationsWithAccessor::from_data( Ok(AggregationsWithAccessor::from_data(
VecWithNames::from_entries(aggss), VecWithNames::from_entries(aggss),
@@ -181,15 +328,13 @@ fn get_all_ff_reader_or_empty(
reader: &SegmentReader, reader: &SegmentReader,
field_name: &str, field_name: &str,
allowed_column_types: Option<&[ColumnType]>, allowed_column_types: Option<&[ColumnType]>,
fallback_type: ColumnType,
) -> crate::Result<Vec<(columnar::Column<u64>, ColumnType)>> { ) -> crate::Result<Vec<(columnar::Column<u64>, ColumnType)>> {
let ff_fields = reader.fast_fields(); let ff_fields = reader.fast_fields();
let mut ff_field_with_type = let mut ff_field_with_type =
ff_fields.u64_lenient_for_type_all(allowed_column_types, field_name)?; ff_fields.u64_lenient_for_type_all(allowed_column_types, field_name)?;
if ff_field_with_type.is_empty() { if ff_field_with_type.is_empty() {
ff_field_with_type.push(( ff_field_with_type.push((Column::build_empty_column(reader.num_docs()), fallback_type));
Column::build_empty_column(reader.num_docs()),
ColumnType::U64,
));
} }
Ok(ff_field_with_type) Ok(ff_field_with_type)
} }

View File

@@ -9,7 +9,7 @@ use crate::aggregation::tests::{get_test_index_2_segments, get_test_index_from_v
use crate::aggregation::DistributedAggregationCollector; use crate::aggregation::DistributedAggregationCollector;
use crate::query::{AllQuery, TermQuery}; use crate::query::{AllQuery, TermQuery};
use crate::schema::{IndexRecordOption, Schema, FAST}; use crate::schema::{IndexRecordOption, Schema, FAST};
use crate::{Index, Term}; use crate::{Index, IndexWriter, Term};
fn get_avg_req(field_name: &str) -> Aggregation { fn get_avg_req(field_name: &str) -> Aggregation {
serde_json::from_value(json!({ serde_json::from_value(json!({
@@ -558,10 +558,10 @@ fn test_aggregation_invalid_requests() -> crate::Result<()> {
assert_eq!(agg_req_1.is_err(), true); assert_eq!(agg_req_1.is_err(), true);
// TODO: This should list valid values // TODO: This should list valid values
assert_eq!( assert!(agg_req_1
agg_req_1.unwrap_err().to_string(), .unwrap_err()
"no variant of enum AggregationVariants found in flattened data" .to_string()
); .contains("unknown variant `doesnotmatchanyagg`, expected one of"));
// TODO: This should return an error // TODO: This should return an error
// let agg_res = avg_on_field("not_exist_field").unwrap_err(); // let agg_res = avg_on_field("not_exist_field").unwrap_err();
@@ -586,7 +586,7 @@ fn test_aggregation_on_json_object() {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"color": "red"}))) .add_document(doc!(json => json!({"color": "red"})))
.unwrap(); .unwrap();
@@ -624,13 +624,72 @@ fn test_aggregation_on_json_object() {
); );
} }
#[test]
fn test_aggregation_on_nested_json_object() {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json.blub", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer
.add_document(doc!(json => json!({"color.dot": "red", "color": {"nested":"red"} })))
.unwrap();
index_writer
.add_document(doc!(json => json!({"color.dot": "blue", "color": {"nested":"blue"} })))
.unwrap();
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let agg: Aggregations = serde_json::from_value(json!({
"jsonagg1": {
"terms": {
"field": "json\\.blub.color\\.dot",
}
},
"jsonagg2": {
"terms": {
"field": "json\\.blub.color.nested",
}
}
}))
.unwrap();
let aggregation_collector = get_collector(agg);
let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap();
let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap();
assert_eq!(
&aggregation_res_json,
&serde_json::json!({
"jsonagg1": {
"buckets": [
{"doc_count": 1, "key": "blue"},
{"doc_count": 1, "key": "red"}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
},
"jsonagg2": {
"buckets": [
{"doc_count": 1, "key": "blue"},
{"doc_count": 1, "key": "red"}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
})
);
}
#[test] #[test]
fn test_aggregation_on_json_object_empty_columns() { fn test_aggregation_on_json_object_empty_columns() {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Empty column when accessing color // => Empty column when accessing color
index_writer index_writer
.add_document(doc!(json => json!({"price": 10.0}))) .add_document(doc!(json => json!({"price": 10.0})))
@@ -748,7 +807,7 @@ fn test_aggregation_on_json_object_mixed_types() {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0}))) .add_document(doc!(json => json!({"mixed_type": 10.0})))

View File

@@ -132,6 +132,7 @@ impl DateHistogramAggregationReq {
hard_bounds: self.hard_bounds, hard_bounds: self.hard_bounds,
extended_bounds: self.extended_bounds, extended_bounds: self.extended_bounds,
keyed: self.keyed, keyed: self.keyed,
is_normalized_to_ns: false,
}) })
} }
@@ -243,15 +244,15 @@ fn parse_into_milliseconds(input: &str) -> Result<i64, AggregationError> {
} }
#[cfg(test)] #[cfg(test)]
mod tests { pub mod tests {
use pretty_assertions::assert_eq; use pretty_assertions::assert_eq;
use super::*; use super::*;
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::exec_request; use crate::aggregation::tests::exec_request;
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::schema::{Schema, FAST}; use crate::schema::{Schema, FAST, STRING};
use crate::Index; use crate::{Index, IndexWriter, TantivyDocument};
#[test] #[test]
fn test_parse_into_millisecs() { fn test_parse_into_millisecs() {
@@ -306,7 +307,8 @@ mod tests {
) -> crate::Result<Index> { ) -> crate::Result<Index> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
schema_builder.add_date_field("date", FAST); schema_builder.add_date_field("date", FAST);
schema_builder.add_text_field("text", FAST); schema_builder.add_text_field("text", FAST | STRING);
schema_builder.add_text_field("text2", FAST | STRING);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
{ {
@@ -314,7 +316,7 @@ mod tests {
index_writer.set_merge_policy(Box::new(NoMergePolicy)); index_writer.set_merge_policy(Box::new(NoMergePolicy));
for values in segment_and_docs { for values in segment_and_docs {
for doc_str in values { for doc_str in values {
let doc = schema.parse_document(doc_str)?; let doc = TantivyDocument::parse_json(&schema, doc_str)?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
} }
// writing the segment // writing the segment
@@ -326,7 +328,7 @@ mod tests {
.searchable_segment_ids() .searchable_segment_ids()
.expect("Searchable segments failed."); .expect("Searchable segments failed.");
if segment_ids.len() > 1 { if segment_ids.len() > 1 {
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?; index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?; index_writer.wait_merging_threads()?;
} }

View File

@@ -122,11 +122,14 @@ pub struct HistogramAggregation {
/// Whether to return the buckets as a hash map /// Whether to return the buckets as a hash map
#[serde(default)] #[serde(default)]
pub keyed: bool, pub keyed: bool,
/// Whether the values are normalized to ns for date time values. Defaults to false.
#[serde(default)]
pub is_normalized_to_ns: bool,
} }
impl HistogramAggregation { impl HistogramAggregation {
pub(crate) fn normalize(&mut self, column_type: ColumnType) { pub(crate) fn normalize_date_time(&mut self) {
if column_type.is_date_time() { if !self.is_normalized_to_ns {
// values are provided in ms, but the fastfield is in nano seconds // values are provided in ms, but the fastfield is in nano seconds
self.interval *= 1_000_000.0; self.interval *= 1_000_000.0;
self.offset = self.offset.map(|off| off * 1_000_000.0); self.offset = self.offset.map(|off| off * 1_000_000.0);
@@ -138,6 +141,7 @@ impl HistogramAggregation {
min: bounds.min * 1_000_000.0, min: bounds.min * 1_000_000.0,
max: bounds.max * 1_000_000.0, max: bounds.max * 1_000_000.0,
}); });
self.is_normalized_to_ns = true;
} }
} }
@@ -351,6 +355,7 @@ impl SegmentHistogramCollector {
let buckets_mem = self.buckets.memory_consumption(); let buckets_mem = self.buckets.memory_consumption();
self_mem + sub_aggs_mem + buckets_mem self_mem + sub_aggs_mem + buckets_mem
} }
/// Converts the collector result into a intermediate bucket result.
pub fn into_intermediate_bucket_result( pub fn into_intermediate_bucket_result(
self, self,
agg_with_accessor: &AggregationWithAccessor, agg_with_accessor: &AggregationWithAccessor,
@@ -369,7 +374,7 @@ impl SegmentHistogramCollector {
Ok(IntermediateBucketResult::Histogram { Ok(IntermediateBucketResult::Histogram {
buckets, buckets,
column_type: Some(self.column_type), is_date_agg: self.column_type == ColumnType::DateTime,
}) })
} }
@@ -380,7 +385,9 @@ impl SegmentHistogramCollector {
accessor_idx: usize, accessor_idx: usize,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
req.validate()?; req.validate()?;
req.normalize(field_type); if field_type == ColumnType::DateTime {
req.normalize_date_time();
}
let sub_aggregation_blueprint = if sub_aggregation.is_empty() { let sub_aggregation_blueprint = if sub_aggregation.is_empty() {
None None
@@ -438,6 +445,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
// memory check upfront // memory check upfront
let (_, first_bucket_num, last_bucket_num) = let (_, first_bucket_num, last_bucket_num) =
generate_bucket_pos_with_opt_minmax(histogram_req, min_max); generate_bucket_pos_with_opt_minmax(histogram_req, min_max);
// It's based on user input, so we need to account for overflows // It's based on user input, so we need to account for overflows
let added_buckets = ((last_bucket_num.saturating_sub(first_bucket_num)).max(0) as u64) let added_buckets = ((last_bucket_num.saturating_sub(first_bucket_num)).max(0) as u64)
.saturating_sub(buckets.len() as u64); .saturating_sub(buckets.len() as u64);
@@ -453,15 +461,12 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
let final_buckets: Vec<BucketEntry> = buckets let final_buckets: Vec<BucketEntry> = buckets
.into_iter() .into_iter()
.merge_join_by( .merge_join_by(fill_gaps_buckets, |existing_bucket, fill_gaps_bucket| {
fill_gaps_buckets.into_iter(), existing_bucket
|existing_bucket, fill_gaps_bucket| { .key
existing_bucket .partial_cmp(fill_gaps_bucket)
.key .unwrap_or(Ordering::Equal)
.partial_cmp(fill_gaps_bucket) })
.unwrap_or(Ordering::Equal)
},
)
.map(|either| match either { .map(|either| match either {
// Ignore the generated bucket // Ignore the generated bucket
itertools::EitherOrBoth::Both(existing, _) => existing, itertools::EitherOrBoth::Both(existing, _) => existing,
@@ -484,7 +489,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
// Convert to BucketEntry // Convert to BucketEntry
pub(crate) fn intermediate_histogram_buckets_to_final_buckets( pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
buckets: Vec<IntermediateHistogramBucketEntry>, buckets: Vec<IntermediateHistogramBucketEntry>,
column_type: Option<ColumnType>, is_date_agg: bool,
histogram_req: &HistogramAggregation, histogram_req: &HistogramAggregation,
sub_aggregation: &Aggregations, sub_aggregation: &Aggregations,
limits: &AggregationLimits, limits: &AggregationLimits,
@@ -493,8 +498,8 @@ pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
// The request used in the the call to final is not yet be normalized. // The request used in the the call to final is not yet be normalized.
// Normalization is changing the precision from milliseconds to nanoseconds. // Normalization is changing the precision from milliseconds to nanoseconds.
let mut histogram_req = histogram_req.clone(); let mut histogram_req = histogram_req.clone();
if let Some(column_type) = column_type { if is_date_agg {
histogram_req.normalize(column_type); histogram_req.normalize_date_time();
} }
let mut buckets = if histogram_req.min_doc_count() == 0 { let mut buckets = if histogram_req.min_doc_count() == 0 {
// With min_doc_count != 0, we may need to add buckets, so that there are no // With min_doc_count != 0, we may need to add buckets, so that there are no
@@ -518,7 +523,7 @@ pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
// If we have a date type on the histogram buckets, we add the `key_as_string` field as rfc339 // If we have a date type on the histogram buckets, we add the `key_as_string` field as rfc339
// and normalize from nanoseconds to milliseconds // and normalize from nanoseconds to milliseconds
if column_type == Some(ColumnType::DateTime) { if is_date_agg {
for bucket in buckets.iter_mut() { for bucket in buckets.iter_mut() {
if let crate::aggregation::Key::F64(ref mut val) = bucket.key { if let crate::aggregation::Key::F64(ref mut val) = bucket.key {
let key_as_string = format_date(*val as i64)?; let key_as_string = format_date(*val as i64)?;

View File

@@ -25,15 +25,15 @@
mod histogram; mod histogram;
mod range; mod range;
mod term_agg; mod term_agg;
mod term_missing_agg;
use std::collections::HashMap; use std::collections::HashMap;
pub(crate) use histogram::SegmentHistogramCollector;
pub use histogram::*; pub use histogram::*;
pub(crate) use range::SegmentRangeCollector;
pub use range::*; pub use range::*;
use serde::{de, Deserialize, Deserializer, Serialize, Serializer}; use serde::{de, Deserialize, Deserializer, Serialize, Serializer};
pub use term_agg::*; pub use term_agg::*;
pub use term_missing_agg::*;
/// Order for buckets in a bucket aggregation. /// Order for buckets in a bucket aggregation.
#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize, Default)] #[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize, Default)]

View File

@@ -262,7 +262,7 @@ impl SegmentRangeCollector {
pub(crate) fn from_req_and_validate( pub(crate) fn from_req_and_validate(
req: &RangeAggregation, req: &RangeAggregation,
sub_aggregation: &mut AggregationsWithAccessor, sub_aggregation: &mut AggregationsWithAccessor,
limits: &mut ResourceLimitGuard, limits: &ResourceLimitGuard,
field_type: ColumnType, field_type: ColumnType,
accessor_idx: usize, accessor_idx: usize,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
@@ -465,7 +465,7 @@ mod tests {
SegmentRangeCollector::from_req_and_validate( SegmentRangeCollector::from_req_and_validate(
&req, &req,
&mut Default::default(), &mut Default::default(),
&mut AggregationLimits::default().new_guard(), &AggregationLimits::default().new_guard(),
field_type, field_type,
0, 0,
) )

View File

@@ -1,6 +1,6 @@
use std::fmt::Debug; use std::fmt::Debug;
use columnar::ColumnType; use columnar::{BytesColumn, ColumnType, MonotonicallyMappableToU64, StrColumn};
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
@@ -9,7 +9,6 @@ use crate::aggregation::agg_limits::MemoryConsumption;
use crate::aggregation::agg_req_with_accessor::{ use crate::aggregation::agg_req_with_accessor::{
AggregationWithAccessor, AggregationsWithAccessor, AggregationWithAccessor, AggregationsWithAccessor,
}; };
use crate::aggregation::f64_from_fastfield_u64;
use crate::aggregation::intermediate_agg_result::{ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateBucketResult, IntermediateAggregationResult, IntermediateAggregationResults, IntermediateBucketResult,
IntermediateKey, IntermediateTermBucketEntry, IntermediateTermBucketResult, IntermediateKey, IntermediateTermBucketEntry, IntermediateTermBucketResult,
@@ -17,6 +16,7 @@ use crate::aggregation::intermediate_agg_result::{
use crate::aggregation::segment_agg_result::{ use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, SegmentAggregationCollector, build_segment_agg_collector, SegmentAggregationCollector,
}; };
use crate::aggregation::{f64_from_fastfield_u64, format_date, Key};
use crate::error::DataCorruption; use crate::error::DataCorruption;
use crate::TantivyError; use crate::TantivyError;
@@ -146,6 +146,28 @@ pub struct TermsAggregation {
/// { "average_price": "asc" } /// { "average_price": "asc" }
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(skip_serializing_if = "Option::is_none", default)]
pub order: Option<CustomOrder>, pub order: Option<CustomOrder>,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "missing": "NO_DATA" }
///
/// # Internal
///
/// Internally, `missing` requires some specialized handling in some scenarios.
///
/// Simple Case:
/// In the simplest case, we can just put the missing value in the termmap use that. In case of
/// text we put a special u64::MAX and replace it at the end with the actual missing value,
/// when loading the text.
/// Special Case 1:
/// If we have multiple columns on one field, we need to have a union on the indices on both
/// columns, to find docids without a value. That requires a special missing aggreggation.
/// Special Case 2: if the key is of type text and the column is numerical, we also need to use
/// the special missing aggregation, since there is no mechanism in the numerical column to
/// add text.
#[serde(skip_serializing_if = "Option::is_none", default)]
pub missing: Option<Key>,
} }
/// Same as TermsAggregation, but with populated defaults. /// Same as TermsAggregation, but with populated defaults.
@@ -176,6 +198,7 @@ pub(crate) struct TermsAggregationInternal {
pub min_doc_count: u64, pub min_doc_count: u64,
pub order: CustomOrder, pub order: CustomOrder,
pub missing: Option<Key>,
} }
impl TermsAggregationInternal { impl TermsAggregationInternal {
@@ -195,6 +218,7 @@ impl TermsAggregationInternal {
.unwrap_or_else(|| order == CustomOrder::default()), .unwrap_or_else(|| order == CustomOrder::default()),
min_doc_count: req.min_doc_count.unwrap_or(1), min_doc_count: req.min_doc_count.unwrap_or(1),
order, order,
missing: req.missing.clone(),
} }
} }
} }
@@ -224,110 +248,6 @@ impl TermBuckets {
} }
} }
/// The composite collector is used, when we have different types under one field, to support a term
/// aggregation on both.
#[derive(Clone, Debug)]
pub struct SegmentTermCollectorComposite {
term_agg1: SegmentTermCollector, // field type 1, e.g. strings
term_agg2: SegmentTermCollector, // field type 2, e.g. u64
accessor_idx: usize,
}
impl SegmentAggregationCollector for SegmentTermCollectorComposite {
fn add_intermediate_aggregation_result(
self: Box<Self>,
agg_with_accessor: &AggregationsWithAccessor,
results: &mut IntermediateAggregationResults,
) -> crate::Result<()> {
let name = agg_with_accessor.aggs.keys[self.accessor_idx].to_string();
let agg_with_accessor = &agg_with_accessor.aggs.values[self.accessor_idx];
let bucket = self
.term_agg1
.into_intermediate_bucket_result(agg_with_accessor)?;
results.push(
name.to_string(),
IntermediateAggregationResult::Bucket(bucket),
)?;
let bucket = self
.term_agg2
.into_intermediate_bucket_result(agg_with_accessor)?;
results.push(name, IntermediateAggregationResult::Bucket(bucket))?;
Ok(())
}
#[inline]
fn collect(
&mut self,
doc: crate::DocId,
agg_with_accessor: &mut AggregationsWithAccessor,
) -> crate::Result<()> {
self.term_agg1.collect_block(&[doc], agg_with_accessor)?;
self.swap_accessor(&mut agg_with_accessor.aggs.values[self.accessor_idx]);
self.term_agg2.collect_block(&[doc], agg_with_accessor)?;
self.swap_accessor(&mut agg_with_accessor.aggs.values[self.accessor_idx]);
Ok(())
}
#[inline]
fn collect_block(
&mut self,
docs: &[crate::DocId],
agg_with_accessor: &mut AggregationsWithAccessor,
) -> crate::Result<()> {
self.term_agg1.collect_block(docs, agg_with_accessor)?;
self.swap_accessor(&mut agg_with_accessor.aggs.values[self.accessor_idx]);
self.term_agg2.collect_block(docs, agg_with_accessor)?;
self.swap_accessor(&mut agg_with_accessor.aggs.values[self.accessor_idx]);
Ok(())
}
fn flush(&mut self, agg_with_accessor: &mut AggregationsWithAccessor) -> crate::Result<()> {
self.term_agg1.flush(agg_with_accessor)?;
self.swap_accessor(&mut agg_with_accessor.aggs.values[self.accessor_idx]);
self.term_agg2.flush(agg_with_accessor)?;
self.swap_accessor(&mut agg_with_accessor.aggs.values[self.accessor_idx]);
Ok(())
}
}
impl SegmentTermCollectorComposite {
/// Swaps the accessor and field type with the second accessor and field type.
/// This way we can use the same code for both aggregations.
fn swap_accessor(&self, aggregations: &mut AggregationWithAccessor) {
if let Some(accessor) = aggregations.accessor2.as_mut() {
std::mem::swap(&mut accessor.0, &mut aggregations.accessor);
std::mem::swap(&mut accessor.1, &mut aggregations.field_type);
}
}
pub(crate) fn from_req_and_validate(
req: &TermsAggregation,
sub_aggregations: &mut AggregationsWithAccessor,
field_type: ColumnType,
field_type2: ColumnType,
accessor_idx: usize,
) -> crate::Result<Self> {
Ok(Self {
term_agg1: SegmentTermCollector::from_req_and_validate(
req,
sub_aggregations,
field_type,
accessor_idx,
)?,
term_agg2: SegmentTermCollector::from_req_and_validate(
req,
sub_aggregations,
field_type2,
accessor_idx,
)?,
accessor_idx,
})
}
}
/// The collector puts values from the fast field into the correct buckets and does a conversion to /// The collector puts values from the fast field into the correct buckets and does a conversion to
/// the correct datatype. /// the correct datatype.
#[derive(Clone, Debug)] #[derive(Clone, Debug)]
@@ -379,9 +299,16 @@ impl SegmentAggregationCollector for SegmentTermCollector {
let mem_pre = self.get_memory_consumption(); let mem_pre = self.get_memory_consumption();
bucket_agg_accessor if let Some(missing) = bucket_agg_accessor.missing_value_for_accessor {
.column_block_accessor bucket_agg_accessor
.fetch_block(docs, &bucket_agg_accessor.accessor); .column_block_accessor
.fetch_block_with_missing(docs, &bucket_agg_accessor.accessor, missing);
} else {
bucket_agg_accessor
.column_block_accessor
.fetch_block(docs, &bucket_agg_accessor.accessor);
}
for term_id in bucket_agg_accessor.column_block_accessor.iter_vals() { for term_id in bucket_agg_accessor.column_block_accessor.iter_vals() {
let entry = self.term_buckets.entries.entry(term_id).or_default(); let entry = self.term_buckets.entries.entry(term_id).or_default();
*entry += 1; *entry += 1;
@@ -543,19 +470,42 @@ impl SegmentTermCollector {
let term_dict = agg_with_accessor let term_dict = agg_with_accessor
.str_dict_column .str_dict_column
.as_ref() .as_ref()
.expect("internal error: term dictionary not found for term aggregation"); .cloned()
.unwrap_or_else(|| {
StrColumn::wrap(BytesColumn::empty(agg_with_accessor.accessor.num_docs()))
});
let mut buffer = String::new(); let mut buffer = String::new();
for (term_id, doc_count) in entries { for (term_id, doc_count) in entries {
if !term_dict.ord_to_str(term_id, &mut buffer)? {
return Err(TantivyError::InternalError(format!(
"Couldn't find term_id {term_id} in dict"
)));
}
let intermediate_entry = into_intermediate_bucket_entry(term_id, doc_count)?; let intermediate_entry = into_intermediate_bucket_entry(term_id, doc_count)?;
// Special case for missing key
dict.insert(IntermediateKey::Str(buffer.to_string()), intermediate_entry); if term_id == u64::MAX {
let missing_key = self
.req
.missing
.as_ref()
.expect("Found placeholder term_id but `missing` is None");
match missing_key {
Key::Str(missing) => {
buffer.clear();
buffer.push_str(missing);
dict.insert(
IntermediateKey::Str(buffer.to_string()),
intermediate_entry,
);
}
Key::F64(val) => {
buffer.push_str(&val.to_string());
dict.insert(IntermediateKey::F64(*val), intermediate_entry);
}
}
} else {
if !term_dict.ord_to_str(term_id, &mut buffer)? {
return Err(TantivyError::InternalError(format!(
"Couldn't find term_id {term_id} in dict"
)));
}
dict.insert(IntermediateKey::Str(buffer.to_string()), intermediate_entry);
}
} }
if self.req.min_doc_count == 0 { if self.req.min_doc_count == 0 {
// TODO: Handle rev streaming for descending sorting by keys // TODO: Handle rev streaming for descending sorting by keys
@@ -581,6 +531,13 @@ impl SegmentTermCollector {
}); });
} }
} }
} else if self.field_type == ColumnType::DateTime {
for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val = i64::from_u64(val);
let date = format_date(val)?;
dict.insert(IntermediateKey::Str(date), intermediate_entry);
}
} else { } else {
for (val, doc_count) in entries { for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?; let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
@@ -633,6 +590,9 @@ pub(crate) fn cut_off_buckets<T: GetDocCount + Debug>(
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use common::DateTime;
use time::{Date, Month};
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::{ use crate::aggregation::tests::{
exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit, exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit,
@@ -641,7 +601,7 @@ mod tests {
use crate::aggregation::AggregationLimits; use crate::aggregation::AggregationLimits;
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::schema::{Schema, FAST, STRING}; use crate::schema::{Schema, FAST, STRING};
use crate::Index; use crate::{Index, IndexWriter};
#[test] #[test]
fn terms_aggregation_test_single_segment() -> crate::Result<()> { fn terms_aggregation_test_single_segment() -> crate::Result<()> {
@@ -1321,6 +1281,7 @@ mod tests {
]; ];
let index = get_test_index_from_terms(false, &terms_per_segment)?; let index = get_test_index_from_terms(false, &terms_per_segment)?;
assert_eq!(index.searchable_segments().unwrap().len(), 2);
let agg_req: Aggregations = serde_json::from_value(json!({ let agg_req: Aggregations = serde_json::from_value(json!({
"my_texts": { "my_texts": {
@@ -1506,6 +1467,47 @@ mod tests {
Ok(()) Ok(())
} }
#[test]
fn terms_empty_json() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json
index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap();
// => Segment with json, but no field partially_empty
index_writer
.add_document(doc!(json => json!({"different_field": "blue"})))
.unwrap();
index_writer.commit().unwrap();
//// => Segment with field partially_empty
index_writer
.add_document(doc!(json => json!({"partially_empty": "blue"})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"my_texts": {
"terms": {
"field": "json.partially_empty"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(res["my_texts"]["buckets"][0]["key"], "blue");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 1);
assert_eq!(res["my_texts"]["buckets"][1], serde_json::Value::Null);
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
Ok(())
}
#[test] #[test]
fn terms_aggregation_bytes() -> crate::Result<()> { fn terms_aggregation_bytes() -> crate::Result<()> {
@@ -1543,4 +1545,353 @@ mod tests {
Ok(()) Ok(())
} }
#[test]
fn terms_aggregation_missing_multi_value() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", FAST);
let id_field = schema_builder.add_u64_field("id", FAST);
let index = Index::create_in_ram(schema_builder.build());
{
let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(
text_field => "Hello Hello",
text_field => "Hello Hello",
id_field => 1u64,
id_field => 1u64,
))?;
// Missing
index_writer.add_document(doc!())?;
index_writer.add_document(doc!(
text_field => "Hello Hello",
))?;
index_writer.add_document(doc!(
text_field => "Hello Hello",
))?;
index_writer.commit()?;
// Empty segment special case
index_writer.add_document(doc!())?;
index_writer.commit()?;
// Full segment special case
index_writer.add_document(doc!(
text_field => "Hello Hello",
id_field => 1u64,
))?;
index_writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_texts": {
"terms": {
"field": "text",
"missing": "Empty"
},
},
"my_texts2": {
"terms": {
"field": "text",
"missing": 1337
},
},
"my_ids": {
"terms": {
"field": "id",
"missing": 1337
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hello Hello");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 5);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "Empty");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
assert_eq!(
res["my_texts"]["buckets"][2]["key"],
serde_json::Value::Null
);
// text field with numner as missing fallback
assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 5);
assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
assert_eq!(res["my_texts2"]["buckets"][1]["doc_count"], 2);
assert_eq!(
res["my_texts2"]["buckets"][2]["key"],
serde_json::Value::Null
);
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
// id field
assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 4);
assert_eq!(res["my_ids"]["buckets"][1]["key"], 1.0);
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 3);
assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_missing_simple_id() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let id_field = schema_builder.add_u64_field("id", FAST);
let index = Index::create_in_ram(schema_builder.build());
{
let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(
id_field => 1u64,
))?;
// Missing
index_writer.add_document(doc!())?;
index_writer.add_document(doc!())?;
index_writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_ids": {
"terms": {
"field": "id",
"missing": 1337
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// id field
assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_ids"]["buckets"][1]["key"], 1.0);
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_missing1() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("text", FAST);
let id_field = schema_builder.add_u64_field("id", FAST);
let index = Index::create_in_ram(schema_builder.build());
{
let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(
text_field => "Hello Hello",
id_field => 1u64,
))?;
// Missing
index_writer.add_document(doc!())?;
index_writer.add_document(doc!(
text_field => "Hello Hello",
))?;
index_writer.add_document(doc!(
text_field => "Hello Hello",
))?;
index_writer.commit()?;
// Empty segment special case
index_writer.add_document(doc!())?;
index_writer.commit()?;
// Full segment special case
index_writer.add_document(doc!(
text_field => "Hello Hello",
id_field => 1u64,
))?;
index_writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_texts": {
"terms": {
"field": "text",
"missing": "Empty"
},
},
"my_texts2": {
"terms": {
"field": "text",
"missing": 1337
},
},
"my_ids": {
"terms": {
"field": "id",
"missing": 1337
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hello Hello");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "Empty");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
assert_eq!(
res["my_texts"]["buckets"][2]["key"],
serde_json::Value::Null
);
// text field with numner as missing fallback
assert_eq!(res["my_texts2"]["buckets"][0]["key"], "Hello Hello");
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 4);
assert_eq!(res["my_texts2"]["buckets"][1]["key"], 1337.0);
assert_eq!(res["my_texts2"]["buckets"][1]["doc_count"], 2);
assert_eq!(
res["my_texts2"]["buckets"][2]["key"],
serde_json::Value::Null
);
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
// id field
assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 4);
assert_eq!(res["my_ids"]["buckets"][1]["key"], 1.0);
assert_eq!(res["my_ids"]["buckets"][1]["doc_count"], 2);
assert_eq!(res["my_ids"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_missing_empty() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("text", FAST);
schema_builder.add_u64_field("id", FAST);
let index = Index::create_in_ram(schema_builder.build());
{
let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
// Empty segment special case
index_writer.add_document(doc!())?;
index_writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_texts": {
"terms": {
"field": "text",
"missing": "Empty"
},
},
"my_texts2": {
"terms": {
"field": "text",
"missing": 1337
},
},
"my_ids": {
"terms": {
"field": "id",
"missing": 1337
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["my_texts"]["buckets"][0]["key"], "Empty");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 1);
assert_eq!(
res["my_texts"]["buckets"][1]["key"],
serde_json::Value::Null
);
// text field with number as missing fallback
assert_eq!(res["my_texts2"]["buckets"][0]["key"], 1337.0);
assert_eq!(res["my_texts2"]["buckets"][0]["doc_count"], 1);
assert_eq!(
res["my_texts2"]["buckets"][1]["key"],
serde_json::Value::Null
);
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
// id field
assert_eq!(res["my_ids"]["buckets"][0]["key"], 1337.0);
assert_eq!(res["my_ids"]["buckets"][0]["doc_count"], 1);
assert_eq!(res["my_ids"]["buckets"][1]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_date() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let date_field = schema_builder.add_date_field("date_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1983, Month::September, 27)?.with_hms(0, 0, 0)?)))?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_date": {
"terms": {
"field": "date_field"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// date_field field
assert_eq!(res["my_date"]["buckets"][0]["key"], "1982-09-17T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_date"]["buckets"][1]["key"], "1983-09-27T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_date"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_date_missing() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let date_field = schema_builder.add_date_field("date_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1983, Month::September, 27)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!())?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_date": {
"terms": {
"field": "date_field",
"missing": "1982-09-17T00:00:00Z"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// date_field field
assert_eq!(res["my_date"]["buckets"][0]["key"], "1982-09-17T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][0]["doc_count"], 3);
assert_eq!(res["my_date"]["buckets"][1]["key"], "1983-09-27T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_date"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
} }

View File

@@ -0,0 +1,476 @@
use rustc_hash::FxHashMap;
use crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor;
use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateBucketResult,
IntermediateKey, IntermediateTermBucketEntry, IntermediateTermBucketResult,
};
use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, SegmentAggregationCollector,
};
/// The specialized missing term aggregation.
#[derive(Default, Debug, Clone)]
pub struct TermMissingAgg {
missing_count: u32,
accessor_idx: usize,
sub_agg: Option<Box<dyn SegmentAggregationCollector>>,
}
impl TermMissingAgg {
pub(crate) fn new(
accessor_idx: usize,
sub_aggregations: &mut AggregationsWithAccessor,
) -> crate::Result<Self> {
let has_sub_aggregations = !sub_aggregations.is_empty();
let sub_agg = if has_sub_aggregations {
let sub_aggregation = build_segment_agg_collector(sub_aggregations)?;
Some(sub_aggregation)
} else {
None
};
Ok(Self {
accessor_idx,
sub_agg,
..Default::default()
})
}
}
impl SegmentAggregationCollector for TermMissingAgg {
fn add_intermediate_aggregation_result(
self: Box<Self>,
agg_with_accessor: &AggregationsWithAccessor,
results: &mut IntermediateAggregationResults,
) -> crate::Result<()> {
let name = agg_with_accessor.aggs.keys[self.accessor_idx].to_string();
let agg_with_accessor = &agg_with_accessor.aggs.values[self.accessor_idx];
let term_agg = agg_with_accessor
.agg
.agg
.as_term()
.expect("TermMissingAgg collector must be term agg req");
let missing = term_agg
.missing
.as_ref()
.expect("TermMissingAgg collector, but no missing found in agg req")
.clone();
let mut entries: FxHashMap<IntermediateKey, IntermediateTermBucketEntry> =
Default::default();
let mut missing_entry = IntermediateTermBucketEntry {
doc_count: self.missing_count,
sub_aggregation: Default::default(),
};
if let Some(sub_agg) = self.sub_agg {
let mut res = IntermediateAggregationResults::default();
sub_agg.add_intermediate_aggregation_result(
&agg_with_accessor.sub_aggregation,
&mut res,
)?;
missing_entry.sub_aggregation = res;
}
entries.insert(missing.into(), missing_entry);
let bucket = IntermediateBucketResult::Terms(IntermediateTermBucketResult {
entries,
sum_other_doc_count: 0,
doc_count_error_upper_bound: 0,
});
results.push(name, IntermediateAggregationResult::Bucket(bucket))?;
Ok(())
}
fn collect(
&mut self,
doc: crate::DocId,
agg_with_accessor: &mut AggregationsWithAccessor,
) -> crate::Result<()> {
let agg = &mut agg_with_accessor.aggs.values[self.accessor_idx];
let has_value = agg.accessors.iter().any(|acc| acc.index.has_value(doc));
if !has_value {
self.missing_count += 1;
if let Some(sub_agg) = self.sub_agg.as_mut() {
sub_agg.collect(doc, &mut agg.sub_aggregation)?;
}
}
Ok(())
}
fn collect_block(
&mut self,
docs: &[crate::DocId],
agg_with_accessor: &mut AggregationsWithAccessor,
) -> crate::Result<()> {
for doc in docs {
self.collect(*doc, agg_with_accessor)?;
}
Ok(())
}
}
#[cfg(test)]
mod tests {
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::exec_request_with_query;
use crate::schema::{Schema, FAST};
use crate::{Index, IndexWriter};
#[test]
fn terms_aggregation_missing_mixed_type_mult_seg_sub_agg() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric
index_writer
.add_document(doc!(score => 1.0, json => json!({"mixed_type": 10.0})))
.unwrap();
index_writer.add_document(doc!(score => 5.0))?;
// index_writer.commit().unwrap();
//// => Segment with all values text
index_writer
.add_document(doc!(score => 1.0, json => json!({"mixed_type": "blue"})))
.unwrap();
index_writer.add_document(doc!(score => 5.0))?;
// index_writer.commit().unwrap();
// => Segment with mixed values
index_writer.add_document(doc!(json => json!({"mixed_type": "red"})))?;
index_writer.add_document(doc!(json => json!({"mixed_type": -20.5})))?;
index_writer.add_document(doc!(json => json!({"mixed_type": true})))?;
index_writer.add_document(doc!(score => 5.0))?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"replace_null": {
"terms": {
"field": "json.mixed_type",
"missing": "NULL"
},
"aggs": {
"sum_score": {
"sum": {
"field": "score"
}
}
}
},
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["replace_null"]["buckets"][0]["key"], "NULL");
assert_eq!(res["replace_null"]["buckets"][0]["doc_count"], 3);
assert_eq!(
res["replace_null"]["buckets"][0]["sum_score"]["value"],
15.0
);
assert_eq!(res["replace_null"]["sum_other_doc_count"], 0);
assert_eq!(res["replace_null"]["doc_count_error_upper_bound"], 0);
Ok(())
}
#[test]
fn terms_aggregation_missing_mixed_type_sub_agg_reg1() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric
index_writer.add_document(doc!(score => 1.0, json => json!({"mixed_type": 10.0})))?;
index_writer.add_document(doc!(score => 5.0))?;
index_writer.add_document(doc!(score => 5.0))?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"replace_null": {
"terms": {
"field": "json.mixed_type",
"missing": "NULL"
},
"aggs": {
"sum_score": {
"sum": {
"field": "score"
}
}
}
},
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["replace_null"]["buckets"][0]["key"], "NULL");
assert_eq!(res["replace_null"]["buckets"][0]["doc_count"], 2);
assert_eq!(
res["replace_null"]["buckets"][0]["sum_score"]["value"],
10.0
);
assert_eq!(res["replace_null"]["sum_other_doc_count"], 0);
assert_eq!(res["replace_null"]["doc_count_error_upper_bound"], 0);
Ok(())
}
#[test]
fn terms_aggregation_missing_mult_seg_empty() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.add_document(doc!(score => 5.0))?;
index_writer.commit().unwrap();
index_writer.add_document(doc!(score => 5.0))?;
index_writer.commit().unwrap();
index_writer.add_document(doc!(score => 5.0))?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"replace_null": {
"terms": {
"field": "json.mixed_type",
"missing": "NULL"
},
"aggs": {
"sum_score": {
"sum": {
"field": "score"
}
}
}
},
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["replace_null"]["buckets"][0]["key"], "NULL");
assert_eq!(res["replace_null"]["buckets"][0]["doc_count"], 3);
assert_eq!(
res["replace_null"]["buckets"][0]["sum_score"]["value"],
15.0
);
assert_eq!(res["replace_null"]["sum_other_doc_count"], 0);
assert_eq!(res["replace_null"]["doc_count_error_upper_bound"], 0);
Ok(())
}
#[test]
fn terms_aggregation_missing_single_seg_empty() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.add_document(doc!(score => 5.0))?;
index_writer.add_document(doc!(score => 5.0))?;
index_writer.add_document(doc!(score => 5.0))?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"replace_null": {
"terms": {
"field": "json.mixed_type",
"missing": "NULL"
},
"aggs": {
"sum_score": {
"sum": {
"field": "score"
}
}
}
},
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["replace_null"]["buckets"][0]["key"], "NULL");
assert_eq!(res["replace_null"]["buckets"][0]["doc_count"], 3);
assert_eq!(
res["replace_null"]["buckets"][0]["sum_score"]["value"],
15.0
);
assert_eq!(res["replace_null"]["sum_other_doc_count"], 0);
assert_eq!(res["replace_null"]["doc_count_error_upper_bound"], 0);
Ok(())
}
#[test]
fn terms_aggregation_missing_mixed_type_mult_seg() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric
index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
//// => Segment with all values text
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
// => Segment with mixed values
index_writer
.add_document(doc!(json => json!({"mixed_type": "red"})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": -20.5})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": true})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"replace_null": {
"terms": {
"field": "json.mixed_type",
"missing": "NULL"
},
},
"replace_num": {
"terms": {
"field": "json.mixed_type",
"missing": 1337
},
},
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["replace_null"]["buckets"][0]["key"], "NULL");
assert_eq!(res["replace_null"]["buckets"][0]["doc_count"], 3);
assert_eq!(res["replace_num"]["buckets"][0]["key"], 1337.0);
assert_eq!(res["replace_num"]["buckets"][0]["doc_count"], 3);
assert_eq!(res["replace_null"]["sum_other_doc_count"], 0);
assert_eq!(res["replace_null"]["doc_count_error_upper_bound"], 0);
Ok(())
}
#[test]
fn terms_aggregation_missing_str_on_numeric_field() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric
index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.add_document(doc!())?;
index_writer
.add_document(doc!(json => json!({"mixed_type": -20.5})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"replace_null": {
"terms": {
"field": "json.mixed_type",
"missing": "NULL"
},
},
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["replace_null"]["buckets"][0]["key"], "NULL");
assert_eq!(res["replace_null"]["buckets"][0]["doc_count"], 3);
assert_eq!(res["replace_null"]["sum_other_doc_count"], 0);
assert_eq!(res["replace_null"]["doc_count_error_upper_bound"], 0);
Ok(())
}
#[test]
fn terms_aggregation_missing_mixed_type_one_seg() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric
index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0})))
.unwrap();
index_writer.add_document(doc!())?;
//// => Segment with all values text
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"})))
.unwrap();
index_writer.add_document(doc!())?;
// => Segment with mixed values
index_writer
.add_document(doc!(json => json!({"mixed_type": "red"})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": -20.5})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": true})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"replace_null": {
"terms": {
"field": "json.mixed_type",
"missing": "NULL"
},
},
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// text field
assert_eq!(res["replace_null"]["buckets"][0]["key"], "NULL");
assert_eq!(res["replace_null"]["buckets"][0]["doc_count"], 3);
assert_eq!(res["replace_null"]["sum_other_doc_count"], 0);
assert_eq!(res["replace_null"]["doc_count_error_upper_bound"], 0);
Ok(())
}
}

View File

@@ -111,9 +111,6 @@ impl IntermediateAggregationResults {
} }
/// Convert intermediate result and its aggregation request to the final result. /// Convert intermediate result and its aggregation request to the final result.
///
/// Internal function, AggregationsInternal is used instead Aggregations, which is optimized
/// for internal processing, by splitting metric and buckets into separate groups.
pub(crate) fn into_final_result_internal( pub(crate) fn into_final_result_internal(
self, self,
req: &Aggregations, req: &Aggregations,
@@ -121,7 +118,14 @@ impl IntermediateAggregationResults {
) -> crate::Result<AggregationResults> { ) -> crate::Result<AggregationResults> {
let mut results: FxHashMap<String, AggregationResult> = FxHashMap::default(); let mut results: FxHashMap<String, AggregationResult> = FxHashMap::default();
for (key, agg_res) in self.aggs_res.into_iter() { for (key, agg_res) in self.aggs_res.into_iter() {
let req = req.get(key.as_str()).unwrap(); let req = req.get(key.as_str()).unwrap_or_else(|| {
panic!(
"Could not find key {:?} in request keys {:?}. This probably means that \
add_intermediate_aggregation_result passed the wrong agg object.",
key,
req.keys().collect::<Vec<_>>()
)
});
results.insert(key, agg_res.into_final_result(req, limits)?); results.insert(key, agg_res.into_final_result(req, limits)?);
} }
// Handle empty results // Handle empty results
@@ -168,10 +172,16 @@ pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult
Range(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Range( Range(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Range(
Default::default(), Default::default(),
)), )),
Histogram(_) | DateHistogram(_) => { Histogram(_) => {
IntermediateAggregationResult::Bucket(IntermediateBucketResult::Histogram { IntermediateAggregationResult::Bucket(IntermediateBucketResult::Histogram {
buckets: Vec::new(), buckets: Vec::new(),
column_type: None, is_date_agg: false,
})
}
DateHistogram(_) => {
IntermediateAggregationResult::Bucket(IntermediateBucketResult::Histogram {
buckets: Vec::new(),
is_date_agg: true,
}) })
} }
Average(_) => IntermediateAggregationResult::Metric(IntermediateMetricResult::Average( Average(_) => IntermediateAggregationResult::Metric(IntermediateMetricResult::Average(
@@ -339,8 +349,8 @@ pub enum IntermediateBucketResult {
/// This is the histogram entry for a bucket, which contains a key, count, and optionally /// This is the histogram entry for a bucket, which contains a key, count, and optionally
/// sub_aggregations. /// sub_aggregations.
Histogram { Histogram {
/// The column_type of the underlying `Column` /// The column_type of the underlying `Column` is DateTime
column_type: Option<ColumnType>, is_date_agg: bool,
/// The buckets /// The buckets
buckets: Vec<IntermediateHistogramBucketEntry>, buckets: Vec<IntermediateHistogramBucketEntry>,
}, },
@@ -395,7 +405,7 @@ impl IntermediateBucketResult {
Ok(BucketResult::Range { buckets }) Ok(BucketResult::Range { buckets })
} }
IntermediateBucketResult::Histogram { IntermediateBucketResult::Histogram {
column_type, is_date_agg,
buckets, buckets,
} => { } => {
let histogram_req = &req let histogram_req = &req
@@ -404,7 +414,7 @@ impl IntermediateBucketResult {
.expect("unexpected aggregation, expected histogram aggregation"); .expect("unexpected aggregation, expected histogram aggregation");
let buckets = intermediate_histogram_buckets_to_final_buckets( let buckets = intermediate_histogram_buckets_to_final_buckets(
buckets, buckets,
column_type, is_date_agg,
histogram_req, histogram_req,
req.sub_aggregation(), req.sub_aggregation(),
limits, limits,
@@ -453,17 +463,17 @@ impl IntermediateBucketResult {
( (
IntermediateBucketResult::Histogram { IntermediateBucketResult::Histogram {
buckets: buckets_left, buckets: buckets_left,
.. is_date_agg: _,
}, },
IntermediateBucketResult::Histogram { IntermediateBucketResult::Histogram {
buckets: buckets_right, buckets: buckets_right,
.. is_date_agg: _,
}, },
) => { ) => {
let buckets: Result<Vec<IntermediateHistogramBucketEntry>, TantivyError> = let buckets: Result<Vec<IntermediateHistogramBucketEntry>, TantivyError> =
buckets_left buckets_left
.drain(..) .drain(..)
.merge_join_by(buckets_right.into_iter(), |left, right| { .merge_join_by(buckets_right, |left, right| {
left.key.partial_cmp(&right.key).unwrap_or(Ordering::Equal) left.key.partial_cmp(&right.key).unwrap_or(Ordering::Equal)
}) })
.map(|either| match either { .map(|either| match either {

View File

@@ -20,12 +20,21 @@ use super::{IntermediateStats, SegmentStatsCollector};
pub struct AverageAggregation { pub struct AverageAggregation {
/// The field name to compute the average on. /// The field name to compute the average on.
pub field: String, pub field: String,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)]
pub missing: Option<f64>,
} }
impl AverageAggregation { impl AverageAggregation {
/// Creates a new [`AverageAggregation`] instance from a field name. /// Creates a new [`AverageAggregation`] instance from a field name.
pub fn from_field_name(field_name: String) -> Self { pub fn from_field_name(field_name: String) -> Self {
Self { field: field_name } Self {
field: field_name,
missing: None,
}
} }
/// Returns the field name the aggregation is computed on. /// Returns the field name the aggregation is computed on.
pub fn field_name(&self) -> &str { pub fn field_name(&self) -> &str {

View File

@@ -18,14 +18,23 @@ use super::{IntermediateStats, SegmentStatsCollector};
/// ``` /// ```
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)] #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct CountAggregation { pub struct CountAggregation {
/// The field name to compute the minimum on. /// The field name to compute the count on.
pub field: String, pub field: String,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)]
pub missing: Option<f64>,
} }
impl CountAggregation { impl CountAggregation {
/// Creates a new [`CountAggregation`] instance from a field name. /// Creates a new [`CountAggregation`] instance from a field name.
pub fn from_field_name(field_name: String) -> Self { pub fn from_field_name(field_name: String) -> Self {
Self { field: field_name } Self {
field: field_name,
missing: None,
}
} }
/// Returns the field name the aggregation is computed on. /// Returns the field name the aggregation is computed on.
pub fn field_name(&self) -> &str { pub fn field_name(&self) -> &str {
@@ -51,7 +60,7 @@ impl IntermediateCount {
pub fn merge_fruits(&mut self, other: IntermediateCount) { pub fn merge_fruits(&mut self, other: IntermediateCount) {
self.stats.merge_fruits(other.stats); self.stats.merge_fruits(other.stats);
} }
/// Computes the final minimum value. /// Computes the final count value.
pub fn finalize(&self) -> Option<f64> { pub fn finalize(&self) -> Option<f64> {
Some(self.stats.finalize().count as f64) Some(self.stats.finalize().count as f64)
} }

View File

@@ -20,12 +20,21 @@ use super::{IntermediateStats, SegmentStatsCollector};
pub struct MaxAggregation { pub struct MaxAggregation {
/// The field name to compute the maximum on. /// The field name to compute the maximum on.
pub field: String, pub field: String,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)]
pub missing: Option<f64>,
} }
impl MaxAggregation { impl MaxAggregation {
/// Creates a new [`MaxAggregation`] instance from a field name. /// Creates a new [`MaxAggregation`] instance from a field name.
pub fn from_field_name(field_name: String) -> Self { pub fn from_field_name(field_name: String) -> Self {
Self { field: field_name } Self {
field: field_name,
missing: None,
}
} }
/// Returns the field name the aggregation is computed on. /// Returns the field name the aggregation is computed on.
pub fn field_name(&self) -> &str { pub fn field_name(&self) -> &str {
@@ -56,3 +65,55 @@ impl IntermediateMax {
self.stats.finalize().max self.stats.finalize().max
} }
} }
#[cfg(test)]
mod tests {
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::exec_request_with_query;
use crate::schema::{Schema, FAST};
use crate::{Index, IndexWriter};
#[test]
fn test_max_agg_with_missing() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json
index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap();
// => Segment with json, but no field partially_empty
index_writer
.add_document(doc!(json => json!({"different_field": "blue"})))
.unwrap();
index_writer.commit().unwrap();
//// => Segment with field partially_empty
index_writer
.add_document(doc!(json => json!({"partially_empty": 10.0})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"my_stats": {
"max": {
"field": "json.partially_empty",
"missing": 100.0,
}
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(
res["my_stats"],
json!({
"value": 100.0,
})
);
Ok(())
}
}

View File

@@ -20,12 +20,21 @@ use super::{IntermediateStats, SegmentStatsCollector};
pub struct MinAggregation { pub struct MinAggregation {
/// The field name to compute the minimum on. /// The field name to compute the minimum on.
pub field: String, pub field: String,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)]
pub missing: Option<f64>,
} }
impl MinAggregation { impl MinAggregation {
/// Creates a new [`MinAggregation`] instance from a field name. /// Creates a new [`MinAggregation`] instance from a field name.
pub fn from_field_name(field_name: String) -> Self { pub fn from_field_name(field_name: String) -> Self {
Self { field: field_name } Self {
field: field_name,
missing: None,
}
} }
/// Returns the field name the aggregation is computed on. /// Returns the field name the aggregation is computed on.
pub fn field_name(&self) -> &str { pub fn field_name(&self) -> &str {

View File

@@ -88,7 +88,7 @@ mod tests {
use crate::aggregation::AggregationCollector; use crate::aggregation::AggregationCollector;
use crate::query::AllQuery; use crate::query::AllQuery;
use crate::schema::{NumericOptions, Schema}; use crate::schema::{NumericOptions, Schema};
use crate::Index; use crate::{Index, IndexWriter};
#[test] #[test]
fn test_metric_aggregations() { fn test_metric_aggregations() {
@@ -96,7 +96,7 @@ mod tests {
let field_options = NumericOptions::default().set_fast(); let field_options = NumericOptions::default().set_fast();
let field = schema_builder.add_f64_field("price", field_options); let field = schema_builder.add_f64_field("price", field_options);
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
for i in 0..3 { for i in 0..3 {
index_writer index_writer

View File

@@ -11,7 +11,7 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult, IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult,
}; };
use crate::aggregation::segment_agg_result::SegmentAggregationCollector; use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::{f64_from_fastfield_u64, AggregationError}; use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64, AggregationError};
use crate::{DocId, TantivyError}; use crate::{DocId, TantivyError};
/// # Percentiles /// # Percentiles
@@ -80,6 +80,12 @@ pub struct PercentilesAggregationReq {
/// Whether to return the percentiles as a hash map /// Whether to return the percentiles as a hash map
#[serde(default = "default_as_true")] #[serde(default = "default_as_true")]
pub keyed: bool, pub keyed: bool,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" }
#[serde(skip_serializing_if = "Option::is_none", default)]
pub missing: Option<f64>,
} }
fn default_percentiles() -> &'static [f64] { fn default_percentiles() -> &'static [f64] {
&[1.0, 5.0, 25.0, 50.0, 75.0, 95.0, 99.0] &[1.0, 5.0, 25.0, 50.0, 75.0, 95.0, 99.0]
@@ -95,6 +101,7 @@ impl PercentilesAggregationReq {
field: field_name, field: field_name,
percents: None, percents: None,
keyed: default_as_true(), keyed: default_as_true(),
missing: None,
} }
} }
/// Returns the field name the aggregation is computed on. /// Returns the field name the aggregation is computed on.
@@ -127,6 +134,7 @@ pub(crate) struct SegmentPercentilesCollector {
pub(crate) percentiles: PercentilesCollector, pub(crate) percentiles: PercentilesCollector,
pub(crate) accessor_idx: usize, pub(crate) accessor_idx: usize,
val_cache: Vec<u64>, val_cache: Vec<u64>,
missing: Option<u64>,
} }
#[derive(Clone, Serialize, Deserialize)] #[derive(Clone, Serialize, Deserialize)]
@@ -227,11 +235,16 @@ impl SegmentPercentilesCollector {
accessor_idx: usize, accessor_idx: usize,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
req.validate()?; req.validate()?;
let missing = req
.missing
.and_then(|val| f64_to_fastfield_u64(val, &field_type));
Ok(Self { Ok(Self {
field_type, field_type,
percentiles: PercentilesCollector::new(), percentiles: PercentilesCollector::new(),
accessor_idx, accessor_idx,
val_cache: Default::default(), val_cache: Default::default(),
missing,
}) })
} }
#[inline] #[inline]
@@ -240,9 +253,17 @@ impl SegmentPercentilesCollector {
docs: &[DocId], docs: &[DocId],
agg_accessor: &mut AggregationWithAccessor, agg_accessor: &mut AggregationWithAccessor,
) { ) {
agg_accessor if let Some(missing) = self.missing.as_ref() {
.column_block_accessor agg_accessor.column_block_accessor.fetch_block_with_missing(
.fetch_block(docs, &agg_accessor.accessor); docs,
&agg_accessor.accessor,
*missing,
);
} else {
agg_accessor
.column_block_accessor
.fetch_block(docs, &agg_accessor.accessor);
}
for val in agg_accessor.column_block_accessor.iter_vals() { for val in agg_accessor.column_block_accessor.iter_vals() {
let val1 = f64_from_fastfield_u64(val, &self.field_type); let val1 = f64_from_fastfield_u64(val, &self.field_type);
@@ -277,9 +298,22 @@ impl SegmentAggregationCollector for SegmentPercentilesCollector {
) -> crate::Result<()> { ) -> crate::Result<()> {
let field = &agg_with_accessor.aggs.values[self.accessor_idx].accessor; let field = &agg_with_accessor.aggs.values[self.accessor_idx].accessor;
for val in field.values_for_doc(doc) { if let Some(missing) = self.missing {
let val1 = f64_from_fastfield_u64(val, &self.field_type); let mut has_val = false;
self.percentiles.collect(val1); for val in field.values_for_doc(doc) {
let val1 = f64_from_fastfield_u64(val, &self.field_type);
self.percentiles.collect(val1);
has_val = true;
}
if !has_val {
self.percentiles
.collect(f64_from_fastfield_u64(missing, &self.field_type));
}
} else {
for val in field.values_for_doc(doc) {
let val1 = f64_from_fastfield_u64(val, &self.field_type);
self.percentiles.collect(val1);
}
} }
Ok(()) Ok(())
@@ -309,10 +343,12 @@ mod tests {
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::agg_result::AggregationResults; use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::tests::{ use crate::aggregation::tests::{
get_test_index_from_values, get_test_index_from_values_and_terms, exec_request_with_query, get_test_index_from_values, get_test_index_from_values_and_terms,
}; };
use crate::aggregation::AggregationCollector; use crate::aggregation::AggregationCollector;
use crate::query::AllQuery; use crate::query::AllQuery;
use crate::schema::{Schema, FAST};
use crate::Index;
#[test] #[test]
fn test_aggregation_percentiles_empty_index() -> crate::Result<()> { fn test_aggregation_percentiles_empty_index() -> crate::Result<()> {
@@ -463,7 +499,7 @@ mod tests {
fn test_aggregation_percentiles(merge_segments: bool) -> crate::Result<()> { fn test_aggregation_percentiles(merge_segments: bool) -> crate::Result<()> {
use rand_distr::Distribution; use rand_distr::Distribution;
let num_values_in_segment = vec![100, 30_000, 8000]; let num_values_in_segment = [100, 30_000, 8000];
let lg_norm = rand_distr::LogNormal::new(2.996f64, 0.979f64).unwrap(); let lg_norm = rand_distr::LogNormal::new(2.996f64, 0.979f64).unwrap();
let mut rng = StdRng::from_seed([1u8; 32]); let mut rng = StdRng::from_seed([1u8; 32]);
@@ -545,4 +581,110 @@ mod tests {
Ok(()) Ok(())
} }
#[test]
fn test_percentiles_missing_sub_agg() -> crate::Result<()> {
// This test verifies the `collect` method (in contrast to `collect_block`), which is
// called when the sub-aggregations are flushed.
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("texts", FAST);
let score_field_f64 = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut index_writer = index.writer_for_tests()?;
// writing the segment
index_writer.add_document(doc!(
score_field_f64 => 10.0f64,
text_field => "a"
))?;
index_writer.add_document(doc!(
score_field_f64 => 10.0f64,
text_field => "a"
))?;
index_writer.add_document(doc!(text_field => "a"))?;
index_writer.commit()?;
}
let agg_req: Aggregations = {
serde_json::from_value(json!({
"range_with_stats": {
"terms": {
"field": "texts"
},
"aggs": {
"percentiles": {
"percentiles": {
"field": "score",
"missing": 5.0
}
}
}
}
}))
.unwrap()
};
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(res["range_with_stats"]["buckets"][0]["doc_count"], 3);
assert_eq!(
res["range_with_stats"]["buckets"][0]["percentiles"]["values"]["1.0"],
5.0028295751107414
);
assert_eq!(
res["range_with_stats"]["buckets"][0]["percentiles"]["values"]["99.0"],
10.07469668951144
);
Ok(())
}
#[test]
fn test_percentiles_missing() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("texts", FAST);
let score_field_f64 = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut index_writer = index.writer_for_tests()?;
// writing the segment
index_writer.add_document(doc!(
score_field_f64 => 10.0f64,
text_field => "a"
))?;
index_writer.add_document(doc!(
score_field_f64 => 10.0f64,
text_field => "a"
))?;
index_writer.add_document(doc!(text_field => "a"))?;
index_writer.commit()?;
}
let agg_req: Aggregations = {
serde_json::from_value(json!({
"percentiles": {
"percentiles": {
"field": "score",
"missing": 5.0
}
}
}))
.unwrap()
};
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(res["percentiles"]["values"]["1.0"], 5.0028295751107414);
assert_eq!(res["percentiles"]["values"]["99.0"], 10.07469668951144);
Ok(())
}
} }

View File

@@ -5,11 +5,11 @@ use super::*;
use crate::aggregation::agg_req_with_accessor::{ use crate::aggregation::agg_req_with_accessor::{
AggregationWithAccessor, AggregationsWithAccessor, AggregationWithAccessor, AggregationsWithAccessor,
}; };
use crate::aggregation::f64_from_fastfield_u64;
use crate::aggregation::intermediate_agg_result::{ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult, IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult,
}; };
use crate::aggregation::segment_agg_result::SegmentAggregationCollector; use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64};
use crate::{DocId, TantivyError}; use crate::{DocId, TantivyError};
/// A multi-value metric aggregation that computes a collection of statistics on numeric values that /// A multi-value metric aggregation that computes a collection of statistics on numeric values that
@@ -29,12 +29,21 @@ use crate::{DocId, TantivyError};
pub struct StatsAggregation { pub struct StatsAggregation {
/// The field name to compute the stats on. /// The field name to compute the stats on.
pub field: String, pub field: String,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)]
pub missing: Option<f64>,
} }
impl StatsAggregation { impl StatsAggregation {
/// Creates a new [`StatsAggregation`] instance from a field name. /// Creates a new [`StatsAggregation`] instance from a field name.
pub fn from_field_name(field_name: String) -> Self { pub fn from_field_name(field_name: String) -> Self {
StatsAggregation { field: field_name } StatsAggregation {
field: field_name,
missing: None,
}
} }
/// Returns the field name the aggregation is computed on. /// Returns the field name the aggregation is computed on.
pub fn field_name(&self) -> &str { pub fn field_name(&self) -> &str {
@@ -153,6 +162,7 @@ pub(crate) enum SegmentStatsType {
#[derive(Clone, Debug, PartialEq)] #[derive(Clone, Debug, PartialEq)]
pub(crate) struct SegmentStatsCollector { pub(crate) struct SegmentStatsCollector {
missing: Option<u64>,
field_type: ColumnType, field_type: ColumnType,
pub(crate) collecting_for: SegmentStatsType, pub(crate) collecting_for: SegmentStatsType,
pub(crate) stats: IntermediateStats, pub(crate) stats: IntermediateStats,
@@ -165,12 +175,15 @@ impl SegmentStatsCollector {
field_type: ColumnType, field_type: ColumnType,
collecting_for: SegmentStatsType, collecting_for: SegmentStatsType,
accessor_idx: usize, accessor_idx: usize,
missing: Option<f64>,
) -> Self { ) -> Self {
let missing = missing.and_then(|val| f64_to_fastfield_u64(val, &field_type));
Self { Self {
field_type, field_type,
collecting_for, collecting_for,
stats: IntermediateStats::default(), stats: IntermediateStats::default(),
accessor_idx, accessor_idx,
missing,
val_cache: Default::default(), val_cache: Default::default(),
} }
} }
@@ -180,10 +193,17 @@ impl SegmentStatsCollector {
docs: &[DocId], docs: &[DocId],
agg_accessor: &mut AggregationWithAccessor, agg_accessor: &mut AggregationWithAccessor,
) { ) {
agg_accessor if let Some(missing) = self.missing.as_ref() {
.column_block_accessor agg_accessor.column_block_accessor.fetch_block_with_missing(
.fetch_block(docs, &agg_accessor.accessor); docs,
&agg_accessor.accessor,
*missing,
);
} else {
agg_accessor
.column_block_accessor
.fetch_block(docs, &agg_accessor.accessor);
}
for val in agg_accessor.column_block_accessor.iter_vals() { for val in agg_accessor.column_block_accessor.iter_vals() {
let val1 = f64_from_fastfield_u64(val, &self.field_type); let val1 = f64_from_fastfield_u64(val, &self.field_type);
self.stats.collect(val1); self.stats.collect(val1);
@@ -234,10 +254,22 @@ impl SegmentAggregationCollector for SegmentStatsCollector {
agg_with_accessor: &mut AggregationsWithAccessor, agg_with_accessor: &mut AggregationsWithAccessor,
) -> crate::Result<()> { ) -> crate::Result<()> {
let field = &agg_with_accessor.aggs.values[self.accessor_idx].accessor; let field = &agg_with_accessor.aggs.values[self.accessor_idx].accessor;
if let Some(missing) = self.missing {
for val in field.values_for_doc(doc) { let mut has_val = false;
let val1 = f64_from_fastfield_u64(val, &self.field_type); for val in field.values_for_doc(doc) {
self.stats.collect(val1); let val1 = f64_from_fastfield_u64(val, &self.field_type);
self.stats.collect(val1);
has_val = true;
}
if !has_val {
self.stats
.collect(f64_from_fastfield_u64(missing, &self.field_type));
}
} else {
for val in field.values_for_doc(doc) {
let val1 = f64_from_fastfield_u64(val, &self.field_type);
self.stats.collect(val1);
}
} }
Ok(()) Ok(())
@@ -262,11 +294,13 @@ mod tests {
use crate::aggregation::agg_req::{Aggregation, Aggregations}; use crate::aggregation::agg_req::{Aggregation, Aggregations};
use crate::aggregation::agg_result::AggregationResults; use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::tests::{get_test_index_2_segments, get_test_index_from_values}; use crate::aggregation::tests::{
exec_request_with_query, get_test_index_2_segments, get_test_index_from_values,
};
use crate::aggregation::AggregationCollector; use crate::aggregation::AggregationCollector;
use crate::query::{AllQuery, TermQuery}; use crate::query::{AllQuery, TermQuery};
use crate::schema::IndexRecordOption; use crate::schema::{IndexRecordOption, Schema, FAST};
use crate::Term; use crate::{Index, IndexWriter, Term};
#[test] #[test]
fn test_aggregation_stats_empty_index() -> crate::Result<()> { fn test_aggregation_stats_empty_index() -> crate::Result<()> {
@@ -453,4 +487,159 @@ mod tests {
Ok(()) Ok(())
} }
#[test]
fn test_stats_json() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json
index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap();
// => Segment with json, but no field partially_empty
index_writer
.add_document(doc!(json => json!({"different_field": "blue"})))
.unwrap();
index_writer.commit().unwrap();
//// => Segment with field partially_empty
index_writer
.add_document(doc!(json => json!({"partially_empty": 10.0})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"my_stats": {
"stats": {
"field": "json.partially_empty"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(
res["my_stats"],
json!({
"avg": 10.0,
"count": 1,
"max": 10.0,
"min": 10.0,
"sum": 10.0
})
);
Ok(())
}
#[test]
fn test_stats_json_missing() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json
index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap();
// => Segment with json, but no field partially_empty
index_writer
.add_document(doc!(json => json!({"different_field": "blue"})))
.unwrap();
index_writer.commit().unwrap();
//// => Segment with field partially_empty
index_writer
.add_document(doc!(json => json!({"partially_empty": 10.0})))
.unwrap();
index_writer.add_document(doc!())?;
index_writer.commit().unwrap();
let agg_req: Aggregations = serde_json::from_value(json!({
"my_stats": {
"stats": {
"field": "json.partially_empty",
"missing": 0.0
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(
res["my_stats"],
json!({
"avg": 2.5,
"count": 4,
"max": 10.0,
"min": 0.0,
"sum": 10.0
})
);
Ok(())
}
#[test]
fn test_stats_json_missing_sub_agg() -> crate::Result<()> {
// This test verifies the `collect` method (in contrast to `collect_block`), which is
// called when the sub-aggregations are flushed.
let mut schema_builder = Schema::builder();
let text_field = schema_builder.add_text_field("texts", FAST);
let score_field_f64 = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut index_writer = index.writer_for_tests()?;
// writing the segment
index_writer.add_document(doc!(
score_field_f64 => 10.0f64,
text_field => "a"
))?;
index_writer.add_document(doc!(text_field => "a"))?;
index_writer.commit()?;
}
let agg_req: Aggregations = {
serde_json::from_value(json!({
"range_with_stats": {
"terms": {
"field": "texts"
},
"aggs": {
"my_stats": {
"stats": {
"field": "score",
"missing": 0.0
}
}
}
}
}))
.unwrap()
};
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(
res["range_with_stats"]["buckets"][0]["my_stats"]["count"],
2
);
assert_eq!(
res["range_with_stats"]["buckets"][0]["my_stats"]["min"],
0.0
);
assert_eq!(
res["range_with_stats"]["buckets"][0]["my_stats"]["avg"],
5.0
);
Ok(())
}
} }

View File

@@ -20,12 +20,21 @@ use super::{IntermediateStats, SegmentStatsCollector};
pub struct SumAggregation { pub struct SumAggregation {
/// The field name to compute the minimum on. /// The field name to compute the minimum on.
pub field: String, pub field: String,
/// The missing parameter defines how documents that are missing a value should be treated.
/// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)]
pub missing: Option<f64>,
} }
impl SumAggregation { impl SumAggregation {
/// Creates a new [`SumAggregation`] instance from a field name. /// Creates a new [`SumAggregation`] instance from a field name.
pub fn from_field_name(field_name: String) -> Self { pub fn from_field_name(field_name: String) -> Self {
Self { field: field_name } Self {
field: field_name,
missing: None,
}
} }
/// Returns the field name the aggregation is computed on. /// Returns the field name the aggregation is computed on.
pub fn field_name(&self) -> &str { pub fn field_name(&self) -> &str {

View File

@@ -319,7 +319,7 @@ mod tests {
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::query::{AllQuery, TermQuery}; use crate::query::{AllQuery, TermQuery};
use crate::schema::{IndexRecordOption, Schema, TextFieldIndexing, FAST, STRING}; use crate::schema::{IndexRecordOption, Schema, TextFieldIndexing, FAST, STRING};
use crate::{Index, Term}; use crate::{Index, IndexWriter, Term};
pub fn get_test_index_with_num_docs( pub fn get_test_index_with_num_docs(
merge_segments: bool, merge_segments: bool,
@@ -451,7 +451,7 @@ mod tests {
.searchable_segment_ids() .searchable_segment_ids()
.expect("Searchable segments failed."); .expect("Searchable segments failed.");
if segment_ids.len() > 1 { if segment_ids.len() > 1 {
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?; index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?; index_writer.wait_merging_threads()?;
} }
@@ -565,7 +565,7 @@ mod tests {
let segment_ids = index let segment_ids = index
.searchable_segment_ids() .searchable_segment_ids()
.expect("Searchable segments failed."); .expect("Searchable segments failed.");
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?; index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?; index_writer.wait_merging_threads()?;
} }

View File

@@ -15,7 +15,7 @@ use super::metric::{
SegmentPercentilesCollector, SegmentStatsCollector, SegmentStatsType, StatsAggregation, SegmentPercentilesCollector, SegmentStatsCollector, SegmentStatsType, StatsAggregation,
SumAggregation, SumAggregation,
}; };
use crate::aggregation::bucket::SegmentTermCollectorComposite; use crate::aggregation::bucket::TermMissingAgg;
pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug { pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug {
fn add_intermediate_aggregation_result( fn add_intermediate_aggregation_result(
@@ -82,29 +82,24 @@ pub(crate) fn build_single_agg_segment_collector(
use AggregationVariants::*; use AggregationVariants::*;
match &req.agg.agg { match &req.agg.agg {
Terms(terms_req) => { Terms(terms_req) => {
if let Some(acc2) = req.accessor2.as_ref() { if req.accessors.is_empty() {
Ok(Box::new(
SegmentTermCollectorComposite::from_req_and_validate(
terms_req,
&mut req.sub_aggregation,
req.field_type,
acc2.1,
accessor_idx,
)?,
))
} else {
Ok(Box::new(SegmentTermCollector::from_req_and_validate( Ok(Box::new(SegmentTermCollector::from_req_and_validate(
terms_req, terms_req,
&mut req.sub_aggregation, &mut req.sub_aggregation,
req.field_type, req.field_type,
accessor_idx, accessor_idx,
)?)) )?))
} else {
Ok(Box::new(TermMissingAgg::new(
accessor_idx,
&mut req.sub_aggregation,
)?))
} }
} }
Range(range_req) => Ok(Box::new(SegmentRangeCollector::from_req_and_validate( Range(range_req) => Ok(Box::new(SegmentRangeCollector::from_req_and_validate(
range_req, range_req,
&mut req.sub_aggregation, &mut req.sub_aggregation,
&mut req.limits, &req.limits,
req.field_type, req.field_type,
accessor_idx, accessor_idx,
)?)), )?)),
@@ -120,35 +115,43 @@ pub(crate) fn build_single_agg_segment_collector(
req.field_type, req.field_type,
accessor_idx, accessor_idx,
)?)), )?)),
Average(AverageAggregation { .. }) => Ok(Box::new(SegmentStatsCollector::from_req( Average(AverageAggregation { missing, .. }) => {
req.field_type, Ok(Box::new(SegmentStatsCollector::from_req(
SegmentStatsType::Average, req.field_type,
accessor_idx, SegmentStatsType::Average,
))), accessor_idx,
Count(CountAggregation { .. }) => Ok(Box::new(SegmentStatsCollector::from_req( *missing,
)))
}
Count(CountAggregation { missing, .. }) => Ok(Box::new(SegmentStatsCollector::from_req(
req.field_type, req.field_type,
SegmentStatsType::Count, SegmentStatsType::Count,
accessor_idx, accessor_idx,
*missing,
))), ))),
Max(MaxAggregation { .. }) => Ok(Box::new(SegmentStatsCollector::from_req( Max(MaxAggregation { missing, .. }) => Ok(Box::new(SegmentStatsCollector::from_req(
req.field_type, req.field_type,
SegmentStatsType::Max, SegmentStatsType::Max,
accessor_idx, accessor_idx,
*missing,
))), ))),
Min(MinAggregation { .. }) => Ok(Box::new(SegmentStatsCollector::from_req( Min(MinAggregation { missing, .. }) => Ok(Box::new(SegmentStatsCollector::from_req(
req.field_type, req.field_type,
SegmentStatsType::Min, SegmentStatsType::Min,
accessor_idx, accessor_idx,
*missing,
))), ))),
Stats(StatsAggregation { .. }) => Ok(Box::new(SegmentStatsCollector::from_req( Stats(StatsAggregation { missing, .. }) => Ok(Box::new(SegmentStatsCollector::from_req(
req.field_type, req.field_type,
SegmentStatsType::Stats, SegmentStatsType::Stats,
accessor_idx, accessor_idx,
*missing,
))), ))),
Sum(SumAggregation { .. }) => Ok(Box::new(SegmentStatsCollector::from_req( Sum(SumAggregation { missing, .. }) => Ok(Box::new(SegmentStatsCollector::from_req(
req.field_type, req.field_type,
SegmentStatsType::Sum, SegmentStatsType::Sum,
accessor_idx, accessor_idx,
*missing,
))), ))),
Percentiles(percentiles_req) => Ok(Box::new( Percentiles(percentiles_req) => Ok(Box::new(
SegmentPercentilesCollector::from_req_and_validate( SegmentPercentilesCollector::from_req_and_validate(

View File

@@ -16,7 +16,7 @@ use crate::{DocId, Score, SegmentOrdinal, SegmentReader};
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer(3_000_000).unwrap(); /// let mut index_writer = index.writer(15_000_000).unwrap();
/// index_writer.add_document(doc!(title => "The Name of the Wind")).unwrap(); /// index_writer.add_document(doc!(title => "The Name of the Wind")).unwrap();
/// index_writer.add_document(doc!(title => "The Diary of Muadib")).unwrap(); /// index_writer.add_document(doc!(title => "The Diary of Muadib")).unwrap();
/// index_writer.add_document(doc!(title => "A Dairy Cow")).unwrap(); /// index_writer.add_document(doc!(title => "A Dairy Cow")).unwrap();

View File

@@ -89,7 +89,7 @@ fn facet_depth(facet_bytes: &[u8]) -> usize {
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// { /// {
/// let mut index_writer = index.writer(3_000_000)?; /// let mut index_writer = index.writer(15_000_000)?;
/// // a document can be associated with any number of facets /// // a document can be associated with any number of facets
/// index_writer.add_document(doc!( /// index_writer.add_document(doc!(
/// title => "The Name of the Wind", /// title => "The Name of the Wind",
@@ -495,8 +495,8 @@ mod tests {
use crate::collector::Count; use crate::collector::Count;
use crate::core::Index; use crate::core::Index;
use crate::query::{AllQuery, QueryParser, TermQuery}; use crate::query::{AllQuery, QueryParser, TermQuery};
use crate::schema::{Document, Facet, FacetOptions, IndexRecordOption, Schema}; use crate::schema::{Facet, FacetOptions, IndexRecordOption, Schema, TantivyDocument};
use crate::Term; use crate::{IndexWriter, Term};
fn test_collapse_mapping_aux( fn test_collapse_mapping_aux(
facet_terms: &[&str], facet_terms: &[&str],
@@ -559,7 +559,7 @@ mod tests {
let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default()); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(facet_field=>Facet::from("/facet/a"))) .add_document(doc!(facet_field=>Facet::from("/facet/a")))
.unwrap(); .unwrap();
@@ -588,7 +588,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
let num_facets: usize = 3 * 4 * 5; let num_facets: usize = 3 * 4 * 5;
let facets: Vec<Facet> = (0..num_facets) let facets: Vec<Facet> = (0..num_facets)
.map(|mut n| { .map(|mut n| {
@@ -601,7 +601,7 @@ mod tests {
}) })
.collect(); .collect();
for i in 0..num_facets * 10 { for i in 0..num_facets * 10 {
let mut doc = Document::new(); let mut doc = TantivyDocument::new();
doc.add_facet(facet_field, facets[i % num_facets].clone()); doc.add_facet(facet_field, facets[i % num_facets].clone());
index_writer.add_document(doc).unwrap(); index_writer.add_document(doc).unwrap();
} }
@@ -732,24 +732,25 @@ mod tests {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let uniform = Uniform::new_inclusive(1, 100_000); let uniform = Uniform::new_inclusive(1, 100_000);
let mut docs: Vec<Document> = vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)] let mut docs: Vec<TantivyDocument> =
.into_iter() vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)]
.flat_map(|(c, count)| { .into_iter()
let facet = Facet::from(&format!("/facet/{}", c)); .flat_map(|(c, count)| {
let doc = doc!(facet_field => facet); let facet = Facet::from(&format!("/facet/{}", c));
iter::repeat(doc).take(count) let doc = doc!(facet_field => facet);
}) iter::repeat(doc).take(count)
.map(|mut doc| { })
doc.add_facet( .map(|mut doc| {
facet_field, doc.add_facet(
&format!("/facet/{}", thread_rng().sample(uniform)), facet_field,
); &format!("/facet/{}", thread_rng().sample(uniform)),
doc );
}) doc
.collect(); })
.collect();
docs[..].shuffle(&mut thread_rng()); docs[..].shuffle(&mut thread_rng());
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
for doc in docs { for doc in docs {
index_writer.add_document(doc).unwrap(); index_writer.add_document(doc).unwrap();
} }
@@ -780,7 +781,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let docs: Vec<Document> = vec![("b", 2), ("a", 2), ("c", 4)] let docs: Vec<TantivyDocument> = vec![("b", 2), ("a", 2), ("c", 4)]
.into_iter() .into_iter()
.flat_map(|(c, count)| { .flat_map(|(c, count)| {
let facet = Facet::from(&format!("/facet/{}", c)); let facet = Facet::from(&format!("/facet/{}", c));
@@ -828,7 +829,7 @@ mod bench {
use crate::collector::FacetCollector; use crate::collector::FacetCollector;
use crate::query::AllQuery; use crate::query::AllQuery;
use crate::schema::{Facet, Schema, INDEXED}; use crate::schema::{Facet, Schema, INDEXED};
use crate::Index; use crate::{Index, IndexWriter};
#[bench] #[bench]
fn bench_facet_collector(b: &mut Bencher) { fn bench_facet_collector(b: &mut Bencher) {
@@ -847,7 +848,7 @@ mod bench {
// 40425 docs // 40425 docs
docs[..].shuffle(&mut thread_rng()); docs[..].shuffle(&mut thread_rng());
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
for doc in docs { for doc in docs {
index_writer.add_document(doc).unwrap(); index_writer.add_document(doc).unwrap();
} }

View File

@@ -12,8 +12,7 @@ use std::marker::PhantomData;
use columnar::{BytesColumn, Column, DynamicColumn, HasAssociatedColumnType}; use columnar::{BytesColumn, Column, DynamicColumn, HasAssociatedColumnType};
use crate::collector::{Collector, SegmentCollector}; use crate::collector::{Collector, SegmentCollector};
use crate::schema::Field; use crate::{DocId, Score, SegmentReader};
use crate::{DocId, Score, SegmentReader, TantivyError};
/// The `FilterCollector` filters docs using a fast field value and a predicate. /// The `FilterCollector` filters docs using a fast field value and a predicate.
/// ///
@@ -38,7 +37,7 @@ use crate::{DocId, Score, SegmentReader, TantivyError};
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64))?; /// index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64))?; /// index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64))?; /// index_writer.add_document(doc!(title => "A Dairy Cow", price => 21_240u64))?;
@@ -50,13 +49,13 @@ use crate::{DocId, Score, SegmentReader, TantivyError};
/// ///
/// let query_parser = QueryParser::for_index(&index, vec![title]); /// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?; /// let query = query_parser.parse_query("diary")?;
/// let no_filter_collector = FilterCollector::new(price, |value: u64| value > 20_120u64, TopDocs::with_limit(2)); /// let no_filter_collector = FilterCollector::new("price".to_string(), |value: u64| value > 20_120u64, TopDocs::with_limit(2));
/// let top_docs = searcher.search(&query, &no_filter_collector)?; /// let top_docs = searcher.search(&query, &no_filter_collector)?;
/// ///
/// assert_eq!(top_docs.len(), 1); /// assert_eq!(top_docs.len(), 1);
/// assert_eq!(top_docs[0].1, DocAddress::new(0, 1)); /// assert_eq!(top_docs[0].1, DocAddress::new(0, 1));
/// ///
/// let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new(price, |value| value < 5u64, TopDocs::with_limit(2)); /// let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new("price".to_string(), |value| value < 5u64, TopDocs::with_limit(2));
/// let filtered_top_docs = searcher.search(&query, &filter_all_collector)?; /// let filtered_top_docs = searcher.search(&query, &filter_all_collector)?;
/// ///
/// assert_eq!(filtered_top_docs.len(), 0); /// assert_eq!(filtered_top_docs.len(), 0);
@@ -70,7 +69,7 @@ use crate::{DocId, Score, SegmentReader, TantivyError};
pub struct FilterCollector<TCollector, TPredicate, TPredicateValue> pub struct FilterCollector<TCollector, TPredicate, TPredicateValue>
where TPredicate: 'static + Clone where TPredicate: 'static + Clone
{ {
field: Field, field: String,
collector: TCollector, collector: TCollector,
predicate: TPredicate, predicate: TPredicate,
t_predicate_value: PhantomData<TPredicateValue>, t_predicate_value: PhantomData<TPredicateValue>,
@@ -83,7 +82,7 @@ where
TPredicate: Fn(TPredicateValue) -> bool + Send + Sync + Clone, TPredicate: Fn(TPredicateValue) -> bool + Send + Sync + Clone,
{ {
/// Create a new `FilterCollector`. /// Create a new `FilterCollector`.
pub fn new(field: Field, predicate: TPredicate, collector: TCollector) -> Self { pub fn new(field: String, predicate: TPredicate, collector: TCollector) -> Self {
Self { Self {
field, field,
predicate, predicate,
@@ -110,18 +109,7 @@ where
segment_local_id: u32, segment_local_id: u32,
segment_reader: &SegmentReader, segment_reader: &SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
let schema = segment_reader.schema(); let column_opt = segment_reader.fast_fields().column_opt(&self.field)?;
let field_entry = schema.get_field_entry(self.field);
if !field_entry.is_fast() {
return Err(TantivyError::SchemaError(format!(
"Field {:?} is not a fast field.",
field_entry.name()
)));
}
let column_opt = segment_reader
.fast_fields()
.column_opt(field_entry.name())?;
let segment_collector = self let segment_collector = self
.collector .collector
@@ -216,7 +204,7 @@ where
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind", barcode => &b"010101"[..]))?; /// index_writer.add_document(doc!(title => "The Name of the Wind", barcode => &b"010101"[..]))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib", barcode => &b"110011"[..]))?; /// index_writer.add_document(doc!(title => "The Diary of Muadib", barcode => &b"110011"[..]))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow", barcode => &b"110111"[..]))?; /// index_writer.add_document(doc!(title => "A Dairy Cow", barcode => &b"110111"[..]))?;
@@ -229,7 +217,7 @@ where
/// ///
/// let query_parser = QueryParser::for_index(&index, vec![title]); /// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?; /// let query = query_parser.parse_query("diary")?;
/// let filter_collector = BytesFilterCollector::new(barcode, |bytes: &[u8]| bytes.starts_with(b"01"), TopDocs::with_limit(2)); /// let filter_collector = BytesFilterCollector::new("barcode".to_string(), |bytes: &[u8]| bytes.starts_with(b"01"), TopDocs::with_limit(2));
/// let top_docs = searcher.search(&query, &filter_collector)?; /// let top_docs = searcher.search(&query, &filter_collector)?;
/// ///
/// assert_eq!(top_docs.len(), 1); /// assert_eq!(top_docs.len(), 1);
@@ -240,7 +228,7 @@ where
pub struct BytesFilterCollector<TCollector, TPredicate> pub struct BytesFilterCollector<TCollector, TPredicate>
where TPredicate: 'static + Clone where TPredicate: 'static + Clone
{ {
field: Field, field: String,
collector: TCollector, collector: TCollector,
predicate: TPredicate, predicate: TPredicate,
} }
@@ -251,7 +239,7 @@ where
TPredicate: Fn(&[u8]) -> bool + Send + Sync + Clone, TPredicate: Fn(&[u8]) -> bool + Send + Sync + Clone,
{ {
/// Create a new `BytesFilterCollector`. /// Create a new `BytesFilterCollector`.
pub fn new(field: Field, predicate: TPredicate, collector: TCollector) -> Self { pub fn new(field: String, predicate: TPredicate, collector: TCollector) -> Self {
Self { Self {
field, field,
predicate, predicate,
@@ -274,10 +262,7 @@ where
segment_local_id: u32, segment_local_id: u32,
segment_reader: &SegmentReader, segment_reader: &SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
let schema = segment_reader.schema(); let column_opt = segment_reader.fast_fields().bytes(&self.field)?;
let field_name = schema.get_field_name(self.field);
let column_opt = segment_reader.fast_fields().bytes(field_name)?;
let segment_collector = self let segment_collector = self
.collector .collector

View File

@@ -233,7 +233,7 @@ mod tests {
let val_field = schema_builder.add_i64_field("val_field", FAST); let val_field = schema_builder.add_i64_field("val_field", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(val_field=>12i64))?; writer.add_document(doc!(val_field=>12i64))?;
writer.add_document(doc!(val_field=>-30i64))?; writer.add_document(doc!(val_field=>-30i64))?;
writer.add_document(doc!(val_field=>-12i64))?; writer.add_document(doc!(val_field=>-12i64))?;
@@ -255,7 +255,7 @@ mod tests {
let val_field = schema_builder.add_i64_field("val_field", FAST); let val_field = schema_builder.add_i64_field("val_field", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(val_field=>12i64))?; writer.add_document(doc!(val_field=>12i64))?;
writer.commit()?; writer.commit()?;
writer.add_document(doc!(val_field=>-30i64))?; writer.add_document(doc!(val_field=>-30i64))?;
@@ -280,7 +280,7 @@ mod tests {
let date_field = schema_builder.add_date_field("date_field", FAST); let date_field = schema_builder.add_date_field("date_field", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?; writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document( writer.add_document(
doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1986, Month::March, 9)?.with_hms(0, 0, 0)?)), doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1986, Month::March, 9)?.with_hms(0, 0, 0)?)),

View File

@@ -44,7 +44,7 @@
//! # let title = schema_builder.add_text_field("title", TEXT); //! # let title = schema_builder.add_text_field("title", TEXT);
//! # let schema = schema_builder.build(); //! # let schema = schema_builder.build();
//! # let index = Index::create_in_ram(schema); //! # let index = Index::create_in_ram(schema);
//! # let mut index_writer = index.writer(3_000_000)?; //! # let mut index_writer = index.writer(15_000_000)?;
//! # index_writer.add_document(doc!( //! # index_writer.add_document(doc!(
//! # title => "The Name of the Wind", //! # title => "The Name of the Wind",
//! # ))?; //! # ))?;
@@ -97,7 +97,7 @@ pub use self::multi_collector::{FruitHandle, MultiCollector, MultiFruit};
mod top_collector; mod top_collector;
mod top_score_collector; mod top_score_collector;
pub use self::top_score_collector::TopDocs; pub use self::top_score_collector::{TopDocs, TopNComputer};
mod custom_score_top_collector; mod custom_score_top_collector;
pub use self::custom_score_top_collector::{CustomScorer, CustomSegmentScorer}; pub use self::custom_score_top_collector::{CustomScorer, CustomSegmentScorer};

View File

@@ -120,7 +120,7 @@ impl<TFruit: Fruit> FruitHandle<TFruit> {
/// let title = schema_builder.add_text_field("title", TEXT); /// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// let mut index_writer = index.writer(3_000_000)?; /// let mut index_writer = index.writer(15_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind"))?; /// index_writer.add_document(doc!(title => "The Name of the Wind"))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib"))?; /// index_writer.add_document(doc!(title => "The Diary of Muadib"))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow"))?; /// index_writer.add_document(doc!(title => "A Dairy Cow"))?;

View File

@@ -7,7 +7,9 @@ use crate::query::{AllQuery, QueryParser};
use crate::schema::{Schema, FAST, TEXT}; use crate::schema::{Schema, FAST, TEXT};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime; use crate::time::OffsetDateTime;
use crate::{doc, DateTime, DocAddress, DocId, Document, Index, Score, Searcher, SegmentOrdinal}; use crate::{
doc, DateTime, DocAddress, DocId, Index, Score, Searcher, SegmentOrdinal, TantivyDocument,
};
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector { pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
compute_score: true, compute_score: true,
@@ -26,7 +28,7 @@ pub fn test_filter_collector() -> crate::Result<()> {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::from_utc(OffsetDateTime::parse("1898-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?; index_writer.add_document(doc!(title => "The Name of the Wind", price => 30_200u64, date => DateTime::from_utc(OffsetDateTime::parse("1898-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::from_utc(OffsetDateTime::parse("2020-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?; index_writer.add_document(doc!(title => "The Diary of Muadib", price => 29_240u64, date => DateTime::from_utc(OffsetDateTime::parse("2020-04-09T00:00:00+00:00", &Rfc3339).unwrap())))?;
index_writer.add_document(doc!(title => "The Diary of Anne Frank", price => 18_240u64, date => DateTime::from_utc(OffsetDateTime::parse("2019-04-20T00:00:00+00:00", &Rfc3339).unwrap())))?; index_writer.add_document(doc!(title => "The Diary of Anne Frank", price => 18_240u64, date => DateTime::from_utc(OffsetDateTime::parse("2019-04-20T00:00:00+00:00", &Rfc3339).unwrap())))?;
@@ -40,7 +42,7 @@ pub fn test_filter_collector() -> crate::Result<()> {
let query_parser = QueryParser::for_index(&index, vec![title]); let query_parser = QueryParser::for_index(&index, vec![title]);
let query = query_parser.parse_query("diary")?; let query = query_parser.parse_query("diary")?;
let filter_some_collector = FilterCollector::new( let filter_some_collector = FilterCollector::new(
price, "price".to_string(),
&|value: u64| value > 20_120u64, &|value: u64| value > 20_120u64,
TopDocs::with_limit(2), TopDocs::with_limit(2),
); );
@@ -49,8 +51,11 @@ pub fn test_filter_collector() -> crate::Result<()> {
assert_eq!(top_docs.len(), 1); assert_eq!(top_docs.len(), 1);
assert_eq!(top_docs[0].1, DocAddress::new(0, 1)); assert_eq!(top_docs[0].1, DocAddress::new(0, 1));
let filter_all_collector: FilterCollector<_, _, u64> = let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new(
FilterCollector::new(price, &|value| value < 5u64, TopDocs::with_limit(2)); "price".to_string(),
&|value| value < 5u64,
TopDocs::with_limit(2),
);
let filtered_top_docs = searcher.search(&query, &filter_all_collector).unwrap(); let filtered_top_docs = searcher.search(&query, &filter_all_collector).unwrap();
assert_eq!(filtered_top_docs.len(), 0); assert_eq!(filtered_top_docs.len(), 0);
@@ -61,7 +66,8 @@ pub fn test_filter_collector() -> crate::Result<()> {
> 0 > 0
} }
let filter_dates_collector = FilterCollector::new(date, &date_filter, TopDocs::with_limit(5)); let filter_dates_collector =
FilterCollector::new("date".to_string(), &date_filter, TopDocs::with_limit(5));
let filtered_date_docs = searcher.search(&query, &filter_dates_collector)?; let filtered_date_docs = searcher.search(&query, &filter_dates_collector)?;
assert_eq!(filtered_date_docs.len(), 2); assert_eq!(filtered_date_docs.len(), 2);
@@ -280,8 +286,8 @@ fn make_test_searcher() -> crate::Result<Searcher> {
let schema = Schema::builder().build(); let schema = Schema::builder().build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(Document::default())?; index_writer.add_document(TantivyDocument::default())?;
index_writer.add_document(Document::default())?; index_writer.add_document(TantivyDocument::default())?;
index_writer.commit()?; index_writer.commit()?;
Ok(index.reader()?.searcher()) Ok(index.reader()?.searcher())
} }

View File

@@ -1,7 +1,7 @@
use std::cmp::Ordering; use std::cmp::Ordering;
use std::collections::BinaryHeap;
use std::marker::PhantomData; use std::marker::PhantomData;
use super::top_score_collector::TopNComputer;
use crate::{DocAddress, DocId, SegmentOrdinal, SegmentReader}; use crate::{DocAddress, DocId, SegmentOrdinal, SegmentReader};
/// Contains a feature (field, score, etc.) of a document along with the document address. /// Contains a feature (field, score, etc.) of a document along with the document address.
@@ -20,6 +20,14 @@ pub(crate) struct ComparableDoc<T, D> {
pub feature: T, pub feature: T,
pub doc: D, pub doc: D,
} }
impl<T: std::fmt::Debug, D: std::fmt::Debug> std::fmt::Debug for ComparableDoc<T, D> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("ComparableDoc")
.field("feature", &self.feature)
.field("doc", &self.doc)
.finish()
}
}
impl<T: PartialOrd, D: PartialOrd> PartialOrd for ComparableDoc<T, D> { impl<T: PartialOrd, D: PartialOrd> PartialOrd for ComparableDoc<T, D> {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> { fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
@@ -91,18 +99,13 @@ where T: PartialOrd + Clone
if self.limit == 0 { if self.limit == 0 {
return Ok(Vec::new()); return Ok(Vec::new());
} }
let mut top_collector = BinaryHeap::new(); let mut top_collector = TopNComputer::new(self.limit + self.offset);
for child_fruit in children { for child_fruit in children {
for (feature, doc) in child_fruit { for (feature, doc) in child_fruit {
if top_collector.len() < (self.limit + self.offset) { top_collector.push(ComparableDoc { feature, doc });
top_collector.push(ComparableDoc { feature, doc });
} else if let Some(mut head) = top_collector.peek_mut() {
if head.feature < feature {
*head = ComparableDoc { feature, doc };
}
}
} }
} }
Ok(top_collector Ok(top_collector
.into_sorted_vec() .into_sorted_vec()
.into_iter() .into_iter()
@@ -111,7 +114,7 @@ where T: PartialOrd + Clone
.collect()) .collect())
} }
pub(crate) fn for_segment<F: PartialOrd>( pub(crate) fn for_segment<F: PartialOrd + Clone>(
&self, &self,
segment_id: SegmentOrdinal, segment_id: SegmentOrdinal,
_: &SegmentReader, _: &SegmentReader,
@@ -136,20 +139,18 @@ where T: PartialOrd + Clone
/// The Top Collector keeps track of the K documents /// The Top Collector keeps track of the K documents
/// sorted by type `T`. /// sorted by type `T`.
/// ///
/// The implementation is based on a `BinaryHeap`. /// The implementation is based on a repeatedly truncating on the median after K * 2 documents
/// The theoretical complexity for collecting the top `K` out of `n` documents /// The theoretical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`. /// is `O(n + K)`.
pub(crate) struct TopSegmentCollector<T> { pub(crate) struct TopSegmentCollector<T> {
limit: usize, topn_computer: TopNComputer<T, DocId>,
heap: BinaryHeap<ComparableDoc<T, DocId>>,
segment_ord: u32, segment_ord: u32,
} }
impl<T: PartialOrd> TopSegmentCollector<T> { impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
fn new(segment_ord: SegmentOrdinal, limit: usize) -> TopSegmentCollector<T> { fn new(segment_ord: SegmentOrdinal, limit: usize) -> TopSegmentCollector<T> {
TopSegmentCollector { TopSegmentCollector {
limit, topn_computer: TopNComputer::new(limit),
heap: BinaryHeap::with_capacity(limit),
segment_ord, segment_ord,
} }
} }
@@ -158,7 +159,7 @@ impl<T: PartialOrd> TopSegmentCollector<T> {
impl<T: PartialOrd + Clone> TopSegmentCollector<T> { impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
pub fn harvest(self) -> Vec<(T, DocAddress)> { pub fn harvest(self) -> Vec<(T, DocAddress)> {
let segment_ord = self.segment_ord; let segment_ord = self.segment_ord;
self.heap self.topn_computer
.into_sorted_vec() .into_sorted_vec()
.into_iter() .into_iter()
.map(|comparable_doc| { .map(|comparable_doc| {
@@ -173,33 +174,13 @@ impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
.collect() .collect()
} }
/// Return true if more documents have been collected than the limit.
#[inline]
pub(crate) fn at_capacity(&self) -> bool {
self.heap.len() >= self.limit
}
/// Collects a document scored by the given feature /// Collects a document scored by the given feature
/// ///
/// It collects documents until it has reached the max capacity. Once it reaches capacity, it /// It collects documents until it has reached the max capacity. Once it reaches capacity, it
/// will compare the lowest scoring item with the given one and keep whichever is greater. /// will compare the lowest scoring item with the given one and keep whichever is greater.
#[inline] #[inline]
pub fn collect(&mut self, doc: DocId, feature: T) { pub fn collect(&mut self, doc: DocId, feature: T) {
if self.at_capacity() { self.topn_computer.push(ComparableDoc { feature, doc });
// It's ok to unwrap as long as a limit of 0 is forbidden.
if let Some(limit_feature) = self.heap.peek().map(|head| head.feature.clone()) {
if limit_feature < feature {
if let Some(mut head) = self.heap.peek_mut() {
head.feature = feature;
head.doc = doc;
}
}
}
} else {
// we have not reached capacity yet, so we can just push the
// element.
self.heap.push(ComparableDoc { feature, doc });
}
} }
} }

View File

@@ -1,4 +1,3 @@
use std::collections::BinaryHeap;
use std::fmt; use std::fmt;
use std::marker::PhantomData; use std::marker::PhantomData;
use std::sync::Arc; use std::sync::Arc;
@@ -86,12 +85,15 @@ where
/// The `TopDocs` collector keeps track of the top `K` documents /// The `TopDocs` collector keeps track of the top `K` documents
/// sorted by their score. /// sorted by their score.
/// ///
/// The implementation is based on a `BinaryHeap`. /// The implementation is based on a repeatedly truncating on the median after K * 2 documents
/// The theoretical complexity for collecting the top `K` out of `n` documents /// with pattern defeating QuickSort.
/// is `O(n log K)`. /// The theoretical complexity for collecting the top `K` out of `N` documents
/// is `O(N + K)`.
/// ///
/// This collector guarantees a stable sorting in case of a tie on the /// This collector does not guarantee a stable sorting in case of a tie on the
/// document score. As such, it is suitable to implement pagination. /// document score, for stable sorting `PartialOrd` needs to resolve on other fields
/// like docid in case of score equality.
/// Only then, it is suitable for pagination.
/// ///
/// ```rust /// ```rust
/// use tantivy::collector::TopDocs; /// use tantivy::collector::TopDocs;
@@ -105,7 +107,7 @@ where
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind"))?; /// index_writer.add_document(doc!(title => "The Name of the Wind"))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib"))?; /// index_writer.add_document(doc!(title => "The Diary of Muadib"))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow"))?; /// index_writer.add_document(doc!(title => "A Dairy Cow"))?;
@@ -210,7 +212,7 @@ impl TopDocs {
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind"))?; /// index_writer.add_document(doc!(title => "The Name of the Wind"))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib"))?; /// index_writer.add_document(doc!(title => "The Diary of Muadib"))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow"))?; /// index_writer.add_document(doc!(title => "A Dairy Cow"))?;
@@ -261,7 +263,7 @@ impl TopDocs {
/// # let schema = schema_builder.build(); /// # let schema = schema_builder.build();
/// # /// #
/// # let index = Index::create_in_ram(schema); /// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// # index_writer.add_document(doc!(title => "The Name of the Wind", rating => 92u64))?; /// # index_writer.add_document(doc!(title => "The Name of the Wind", rating => 92u64))?;
/// # index_writer.add_document(doc!(title => "The Diary of Muadib", rating => 97u64))?; /// # index_writer.add_document(doc!(title => "The Diary of Muadib", rating => 97u64))?;
/// # index_writer.add_document(doc!(title => "A Dairy Cow", rating => 63u64))?; /// # index_writer.add_document(doc!(title => "A Dairy Cow", rating => 63u64))?;
@@ -349,7 +351,7 @@ impl TopDocs {
/// # let schema = schema_builder.build(); /// # let schema = schema_builder.build();
/// # /// #
/// # let index = Index::create_in_ram(schema); /// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// # index_writer.add_document(doc!(title => "MadCow Inc.", revenue => 92_000_000i64))?; /// # index_writer.add_document(doc!(title => "MadCow Inc.", revenue => 92_000_000i64))?;
/// # index_writer.add_document(doc!(title => "Zozo Cow KKK", revenue => 119_000_000i64))?; /// # index_writer.add_document(doc!(title => "Zozo Cow KKK", revenue => 119_000_000i64))?;
/// # index_writer.add_document(doc!(title => "Declining Cow", revenue => -63_000_000i64))?; /// # index_writer.add_document(doc!(title => "Declining Cow", revenue => -63_000_000i64))?;
@@ -449,7 +451,7 @@ impl TopDocs {
/// fn create_index() -> tantivy::Result<Index> { /// fn create_index() -> tantivy::Result<Index> {
/// let schema = create_schema(); /// let schema = create_schema();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// let product_name = index.schema().get_field("product_name").unwrap(); /// let product_name = index.schema().get_field("product_name").unwrap();
/// let popularity: Field = index.schema().get_field("popularity").unwrap(); /// let popularity: Field = index.schema().get_field("popularity").unwrap();
/// index_writer.add_document(doc!(product_name => "The Diary of Muadib", popularity => 1u64))?; /// index_writer.add_document(doc!(product_name => "The Diary of Muadib", popularity => 1u64))?;
@@ -556,7 +558,7 @@ impl TopDocs {
/// # fn main() -> tantivy::Result<()> { /// # fn main() -> tantivy::Result<()> {
/// # let schema = create_schema(); /// # let schema = create_schema();
/// # let index = Index::create_in_ram(schema); /// # let index = Index::create_in_ram(schema);
/// # let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; /// # let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
/// # let product_name = index.schema().get_field("product_name").unwrap(); /// # let product_name = index.schema().get_field("product_name").unwrap();
/// # /// #
/// let popularity: Field = index.schema().get_field("popularity").unwrap(); /// let popularity: Field = index.schema().get_field("popularity").unwrap();
@@ -661,50 +663,35 @@ impl Collector for TopDocs {
reader: &SegmentReader, reader: &SegmentReader,
) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> { ) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> {
let heap_len = self.0.limit + self.0.offset; let heap_len = self.0.limit + self.0.offset;
let mut heap: BinaryHeap<ComparableDoc<Score, DocId>> = BinaryHeap::with_capacity(heap_len); let mut top_n = TopNComputer::new(heap_len);
if let Some(alive_bitset) = reader.alive_bitset() { if let Some(alive_bitset) = reader.alive_bitset() {
let mut threshold = Score::MIN; let mut threshold = Score::MIN;
weight.for_each_pruning(threshold, reader, &mut |doc, score| { top_n.threshold = Some(threshold);
weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| {
if alive_bitset.is_deleted(doc) { if alive_bitset.is_deleted(doc) {
return threshold; return threshold;
} }
let heap_item = ComparableDoc { let doc = ComparableDoc {
feature: score, feature: score,
doc, doc,
}; };
if heap.len() < heap_len { top_n.push(doc);
heap.push(heap_item); threshold = top_n.threshold.unwrap_or(Score::MIN);
if heap.len() == heap_len {
threshold = heap.peek().map(|el| el.feature).unwrap_or(Score::MIN);
}
return threshold;
}
*heap.peek_mut().unwrap() = heap_item;
threshold = heap.peek().map(|el| el.feature).unwrap_or(Score::MIN);
threshold threshold
})?; })?;
} else { } else {
weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| { weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| {
let heap_item = ComparableDoc { let doc = ComparableDoc {
feature: score, feature: score,
doc, doc,
}; };
if heap.len() < heap_len { top_n.push(doc);
heap.push(heap_item); top_n.threshold.unwrap_or(Score::MIN)
// TODO the threshold is suboptimal for heap.len == heap_len
if heap.len() == heap_len {
return heap.peek().map(|el| el.feature).unwrap_or(Score::MIN);
} else {
return Score::MIN;
}
}
*heap.peek_mut().unwrap() = heap_item;
heap.peek().map(|el| el.feature).unwrap_or(Score::MIN)
})?; })?;
} }
let fruit = heap let fruit = top_n
.into_sorted_vec() .into_sorted_vec()
.into_iter() .into_iter()
.map(|cid| { .map(|cid| {
@@ -736,9 +723,81 @@ impl SegmentCollector for TopScoreSegmentCollector {
} }
} }
/// Fast TopN Computation
///
/// For TopN == 0, it will be relative expensive.
pub struct TopNComputer<Score, DocId> {
buffer: Vec<ComparableDoc<Score, DocId>>,
top_n: usize,
pub(crate) threshold: Option<Score>,
}
impl<Score, DocId> TopNComputer<Score, DocId>
where
Score: PartialOrd + Clone,
DocId: Ord + Clone,
{
/// Create a new `TopNComputer`.
/// Internally it will allocate a buffer of size `2 * top_n`.
pub fn new(top_n: usize) -> Self {
let vec_cap = top_n.max(1) * 2;
TopNComputer {
buffer: Vec::with_capacity(vec_cap),
top_n,
threshold: None,
}
}
#[inline]
pub(crate) fn push(&mut self, doc: ComparableDoc<Score, DocId>) {
if let Some(last_median) = self.threshold.clone() {
if doc.feature < last_median {
return;
}
}
if self.buffer.len() == self.buffer.capacity() {
let median = self.truncate_top_n();
self.threshold = Some(median);
}
// This is faster since it avoids the buffer resizing to be inlined from vec.push()
// (this is in the hot path)
// TODO: Replace with `push_within_capacity` when it's stabilized
let uninit = self.buffer.spare_capacity_mut();
// This cannot panic, because we truncate_median will at least remove one element, since
// the min capacity is 2.
uninit[0].write(doc);
// This is safe because it would panic in the line above
unsafe {
self.buffer.set_len(self.buffer.len() + 1);
}
}
#[inline(never)]
fn truncate_top_n(&mut self) -> Score {
// Use select_nth_unstable to find the top nth score
let (_, median_el, _) = self.buffer.select_nth_unstable(self.top_n);
let median_score = median_el.feature.clone();
// Remove all elements below the top_n
self.buffer.truncate(self.top_n);
median_score
}
pub(crate) fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, DocId>> {
if self.buffer.len() > self.top_n {
self.truncate_top_n();
}
self.buffer.sort_unstable();
self.buffer
}
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::TopDocs; use super::{TopDocs, TopNComputer};
use crate::collector::top_collector::ComparableDoc;
use crate::collector::Collector; use crate::collector::Collector;
use crate::query::{AllQuery, Query, QueryParser}; use crate::query::{AllQuery, Query, QueryParser};
use crate::schema::{Field, Schema, FAST, STORED, TEXT}; use crate::schema::{Field, Schema, FAST, STORED, TEXT};
@@ -752,7 +811,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
// writing the segment // writing the segment
let mut index_writer = index.writer_with_num_threads(1, 10_000_000)?; let mut index_writer = index.writer_with_num_threads(1, 20_000_000)?;
index_writer.add_document(doc!(text_field=>"Hello happy tax payer."))?; index_writer.add_document(doc!(text_field=>"Hello happy tax payer."))?;
index_writer.add_document(doc!(text_field=>"Droopy says hello happy tax payer"))?; index_writer.add_document(doc!(text_field=>"Droopy says hello happy tax payer"))?;
index_writer.add_document(doc!(text_field=>"I like Droopy"))?; index_writer.add_document(doc!(text_field=>"I like Droopy"))?;
@@ -767,6 +826,78 @@ mod tests {
} }
} }
#[test]
fn test_empty_topn_computer() {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(0);
computer.push(ComparableDoc {
feature: 1u32,
doc: 1u32,
});
computer.push(ComparableDoc {
feature: 1u32,
doc: 2u32,
});
computer.push(ComparableDoc {
feature: 1u32,
doc: 3u32,
});
assert!(computer.into_sorted_vec().is_empty());
}
#[test]
fn test_topn_computer() {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(2);
computer.push(ComparableDoc {
feature: 1u32,
doc: 1u32,
});
computer.push(ComparableDoc {
feature: 2u32,
doc: 2u32,
});
computer.push(ComparableDoc {
feature: 3u32,
doc: 3u32,
});
computer.push(ComparableDoc {
feature: 2u32,
doc: 4u32,
});
computer.push(ComparableDoc {
feature: 1u32,
doc: 5u32,
});
assert_eq!(
computer.into_sorted_vec(),
&[
ComparableDoc {
feature: 3u32,
doc: 3u32,
},
ComparableDoc {
feature: 2u32,
doc: 2u32,
}
]
);
}
#[test]
fn test_topn_computer_no_panic() {
for top_n in 0..10 {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(top_n);
for _ in 0..1 + top_n * 2 {
computer.push(ComparableDoc {
feature: 1u32,
doc: 1u32,
});
}
let _vals = computer.into_sorted_vec();
}
}
#[test] #[test]
fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> { fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> {
let index = make_index()?; let index = make_index()?;
@@ -852,20 +983,25 @@ mod tests {
// using AllQuery to get a constant score // using AllQuery to get a constant score
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader().unwrap().searcher();
let page_0 = searcher.search(&AllQuery, &TopDocs::with_limit(1)).unwrap();
let page_1 = searcher.search(&AllQuery, &TopDocs::with_limit(2)).unwrap(); let page_1 = searcher.search(&AllQuery, &TopDocs::with_limit(2)).unwrap();
let page_2 = searcher.search(&AllQuery, &TopDocs::with_limit(3)).unwrap(); let page_2 = searcher.search(&AllQuery, &TopDocs::with_limit(3)).unwrap();
// precondition for the test to be meaningful: we did get documents // precondition for the test to be meaningful: we did get documents
// with the same score // with the same score
assert!(page_0.iter().all(|result| result.0 == page_1[0].0));
assert!(page_1.iter().all(|result| result.0 == page_1[0].0)); assert!(page_1.iter().all(|result| result.0 == page_1[0].0));
assert!(page_2.iter().all(|result| result.0 == page_2[0].0)); assert!(page_2.iter().all(|result| result.0 == page_2[0].0));
// sanity check since we're relying on make_index() // sanity check since we're relying on make_index()
assert_eq!(page_0.len(), 1);
assert_eq!(page_1.len(), 2); assert_eq!(page_1.len(), 2);
assert_eq!(page_2.len(), 3); assert_eq!(page_2.len(), 3);
assert_eq!(page_1, &page_2[..page_1.len()]); assert_eq!(page_1, &page_2[..page_1.len()]);
assert_eq!(page_0, &page_2[..page_0.len()]);
} }
#[test] #[test]
@@ -1122,7 +1258,7 @@ mod tests {
mut doc_adder: impl FnMut(&mut IndexWriter), mut doc_adder: impl FnMut(&mut IndexWriter),
) -> (Index, Box<dyn Query>) { ) -> (Index, Box<dyn Query>) {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 10_000_000).unwrap(); let mut index_writer = index.writer_with_num_threads(1, 15_000_000).unwrap();
doc_adder(&mut index_writer); doc_adder(&mut index_writer);
index_writer.commit().unwrap(); index_writer.commit().unwrap();
let query_parser = QueryParser::for_index(&index, vec![query_field]); let query_parser = QueryParser::for_index(&index, vec![query_field]);

View File

@@ -16,12 +16,14 @@ use crate::directory::error::OpenReadError;
use crate::directory::MmapDirectory; use crate::directory::MmapDirectory;
use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK}; use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK};
use crate::error::{DataCorruption, TantivyError}; use crate::error::{DataCorruption, TantivyError};
use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_ARENA_NUM_BYTES_MIN}; use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN};
use crate::indexer::segment_updater::save_metas; use crate::indexer::segment_updater::save_metas;
use crate::indexer::IndexWriter;
use crate::reader::{IndexReader, IndexReaderBuilder}; use crate::reader::{IndexReader, IndexReaderBuilder};
use crate::schema::document::Document;
use crate::schema::{Field, FieldType, Schema}; use crate::schema::{Field, FieldType, Schema};
use crate::tokenizer::{TextAnalyzer, TokenizerManager}; use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::IndexWriter; use crate::{merge_field_meta_data, FieldMetadata, SegmentReader};
fn load_metas( fn load_metas(
directory: &dyn Directory, directory: &dyn Directory,
@@ -184,11 +186,11 @@ impl IndexBuilder {
/// ///
/// It expects an originally empty directory, and will not run any GC operation. /// It expects an originally empty directory, and will not run any GC operation.
#[doc(hidden)] #[doc(hidden)]
pub fn single_segment_index_writer( pub fn single_segment_index_writer<D: Document>(
self, self,
dir: impl Into<Box<dyn Directory>>, dir: impl Into<Box<dyn Directory>>,
mem_budget: usize, mem_budget: usize,
) -> crate::Result<SingleSegmentIndexWriter> { ) -> crate::Result<SingleSegmentIndexWriter<D>> {
let index = self.create(dir)?; let index = self.create(dir)?;
let index_simple_writer = SingleSegmentIndexWriter::new(index, mem_budget)?; let index_simple_writer = SingleSegmentIndexWriter::new(index, mem_budget)?;
Ok(index_simple_writer) Ok(index_simple_writer)
@@ -488,6 +490,28 @@ impl Index {
self.inventory.all() self.inventory.all()
} }
/// Returns the list of fields that have been indexed in the Index.
/// The field list includes the field defined in the schema as well as the fields
/// that have been indexed as a part of a JSON field.
/// The returned field name is the full field name, including the name of the JSON field.
///
/// The returned field names can be used in queries.
///
/// Notice: If your data contains JSON fields this is **very expensive**, as it requires
/// browsing through the inverted index term dictionary and the columnar field dictionary.
///
/// Disclaimer: Some fields may not be listed here. For instance, if the schema contains a json
/// field that is not indexed nor a fast field but is stored, it is possible for the field
/// to not be listed.
pub fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
let segments = self.searchable_segments()?;
let fields_metadata: Vec<Vec<FieldMetadata>> = segments
.into_iter()
.map(|segment| SegmentReader::open(&segment)?.fields_metadata())
.collect::<Result<_, _>>()?;
Ok(merge_field_meta_data(fields_metadata, &self.schema()))
}
/// Creates a new segment_meta (Advanced user only). /// Creates a new segment_meta (Advanced user only).
/// ///
/// As long as the `SegmentMeta` lives, the files associated with the /// As long as the `SegmentMeta` lives, the files associated with the
@@ -523,19 +547,19 @@ impl Index {
/// - `num_threads` defines the number of indexing workers that /// - `num_threads` defines the number of indexing workers that
/// should work at the same time. /// should work at the same time.
/// ///
/// - `overall_memory_arena_in_bytes` sets the amount of memory /// - `overall_memory_budget_in_bytes` sets the amount of memory
/// allocated for all indexing thread. /// allocated for all indexing thread.
/// Each thread will receive a budget of `overall_memory_arena_in_bytes / num_threads`. /// Each thread will receive a budget of `overall_memory_budget_in_bytes / num_threads`.
/// ///
/// # Errors /// # Errors
/// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IoError`. /// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IoError`.
/// If the memory arena per thread is too small or too big, returns /// If the memory arena per thread is too small or too big, returns
/// `TantivyError::InvalidArgument` /// `TantivyError::InvalidArgument`
pub fn writer_with_num_threads( pub fn writer_with_num_threads<D: Document>(
&self, &self,
num_threads: usize, num_threads: usize,
overall_memory_arena_in_bytes: usize, overall_memory_budget_in_bytes: usize,
) -> crate::Result<IndexWriter> { ) -> crate::Result<IndexWriter<D>> {
let directory_lock = self let directory_lock = self
.directory .directory
.acquire_lock(&INDEX_WRITER_LOCK) .acquire_lock(&INDEX_WRITER_LOCK)
@@ -550,7 +574,7 @@ impl Index {
), ),
) )
})?; })?;
let memory_arena_in_bytes_per_thread = overall_memory_arena_in_bytes / num_threads; let memory_arena_in_bytes_per_thread = overall_memory_budget_in_bytes / num_threads;
IndexWriter::new( IndexWriter::new(
self, self,
num_threads, num_threads,
@@ -561,11 +585,11 @@ impl Index {
/// Helper to create an index writer for tests. /// Helper to create an index writer for tests.
/// ///
/// That index writer only simply has a single thread and a memory arena of 10 MB. /// That index writer only simply has a single thread and a memory budget of 15 MB.
/// Using a single thread gives us a deterministic allocation of DocId. /// Using a single thread gives us a deterministic allocation of DocId.
#[cfg(test)] #[cfg(test)]
pub fn writer_for_tests(&self) -> crate::Result<IndexWriter> { pub fn writer_for_tests<D: Document>(&self) -> crate::Result<IndexWriter<D>> {
self.writer_with_num_threads(1, 10_000_000) self.writer_with_num_threads(1, MEMORY_BUDGET_NUM_BYTES_MIN)
} }
/// Creates a multithreaded writer /// Creates a multithreaded writer
@@ -579,13 +603,16 @@ impl Index {
/// If the lockfile already exists, returns `Error::FileAlreadyExists`. /// If the lockfile already exists, returns `Error::FileAlreadyExists`.
/// If the memory arena per thread is too small or too big, returns /// If the memory arena per thread is too small or too big, returns
/// `TantivyError::InvalidArgument` /// `TantivyError::InvalidArgument`
pub fn writer(&self, memory_arena_num_bytes: usize) -> crate::Result<IndexWriter> { pub fn writer<D: Document>(
&self,
memory_budget_in_bytes: usize,
) -> crate::Result<IndexWriter<D>> {
let mut num_threads = std::cmp::min(num_cpus::get(), MAX_NUM_THREAD); let mut num_threads = std::cmp::min(num_cpus::get(), MAX_NUM_THREAD);
let memory_arena_num_bytes_per_thread = memory_arena_num_bytes / num_threads; let memory_budget_num_bytes_per_thread = memory_budget_in_bytes / num_threads;
if memory_arena_num_bytes_per_thread < MEMORY_ARENA_NUM_BYTES_MIN { if memory_budget_num_bytes_per_thread < MEMORY_BUDGET_NUM_BYTES_MIN {
num_threads = (memory_arena_num_bytes / MEMORY_ARENA_NUM_BYTES_MIN).max(1); num_threads = (memory_budget_in_bytes / MEMORY_BUDGET_NUM_BYTES_MIN).max(1);
} }
self.writer_with_num_threads(num_threads, memory_arena_num_bytes) self.writer_with_num_threads(num_threads, memory_budget_in_bytes)
} }
/// Accessor to the index settings /// Accessor to the index settings

View File

@@ -410,7 +410,9 @@ mod tests {
use super::IndexMeta; use super::IndexMeta;
use crate::core::index_meta::UntrackedIndexMeta; use crate::core::index_meta::UntrackedIndexMeta;
use crate::schema::{Schema, TEXT}; use crate::schema::{Schema, TEXT};
use crate::store::{Compressor, ZstdCompressor}; use crate::store::Compressor;
#[cfg(feature = "zstd-compression")]
use crate::store::ZstdCompressor;
use crate::{IndexSettings, IndexSortByField, Order}; use crate::{IndexSettings, IndexSortByField, Order};
#[test] #[test]
@@ -446,6 +448,7 @@ mod tests {
} }
#[test] #[test]
#[cfg(feature = "zstd-compression")]
fn test_serialize_metas_zstd_compressor() { fn test_serialize_metas_zstd_compressor() {
let schema = { let schema = {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
@@ -482,13 +485,14 @@ mod tests {
} }
#[test] #[test]
#[cfg(all(feature = "lz4-compression", feature = "zstd-compression"))]
fn test_serialize_metas_invalid_comp() { fn test_serialize_metas_invalid_comp() {
let json = r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"},"docstore_compression":"zsstd","docstore_blocksize":1000000},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0}"#; let json = r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"},"docstore_compression":"zsstd","docstore_blocksize":1000000},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0}"#;
let err = serde_json::from_str::<UntrackedIndexMeta>(json).unwrap_err(); let err = serde_json::from_str::<UntrackedIndexMeta>(json).unwrap_err();
assert_eq!( assert_eq!(
err.to_string(), err.to_string(),
"unknown variant `zsstd`, expected one of `none`, `lz4`, `brotli`, `snappy`, `zstd`, \ "unknown variant `zsstd`, expected one of `none`, `lz4`, `zstd`, \
`zstd(compression_level=5)` at line 1 column 96" `zstd(compression_level=5)` at line 1 column 96"
.to_string() .to_string()
); );
@@ -502,6 +506,20 @@ mod tests {
); );
} }
#[test]
#[cfg(not(feature = "zstd-compression"))]
fn test_serialize_metas_unsupported_comp() {
let json = r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"},"docstore_compression":"zstd","docstore_blocksize":1000000},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","fieldnorms":true,"tokenizer":"default"},"stored":false,"fast":false}}],"opstamp":0}"#;
let err = serde_json::from_str::<UntrackedIndexMeta>(json).unwrap_err();
assert_eq!(
err.to_string(),
"unsupported variant `zstd`, please enable Tantivy's `zstd-compression` feature at \
line 1 column 95"
.to_string()
);
}
#[test] #[test]
#[cfg(feature = "lz4-compression")] #[cfg(feature = "lz4-compression")]
fn test_index_settings_default() { fn test_index_settings_default() {

View File

@@ -1,11 +1,12 @@
use std::io; use std::io;
use common::BinarySerializable; use common::BinarySerializable;
use fnv::FnvHashSet;
use crate::directory::FileSlice; use crate::directory::FileSlice;
use crate::positions::PositionReader; use crate::positions::PositionReader;
use crate::postings::{BlockSegmentPostings, SegmentPostings, TermInfo}; use crate::postings::{BlockSegmentPostings, SegmentPostings, TermInfo};
use crate::schema::{IndexRecordOption, Term}; use crate::schema::{IndexRecordOption, Term, Type, JSON_END_OF_PATH};
use crate::termdict::TermDictionary; use crate::termdict::TermDictionary;
/// The inverted index reader is in charge of accessing /// The inverted index reader is in charge of accessing
@@ -69,6 +70,28 @@ impl InvertedIndexReader {
&self.termdict &self.termdict
} }
/// Return the fields and types encoded in the dictionary in lexicographic oder.
/// Only valid on JSON fields.
///
/// Notice: This requires a full scan and therefore **very expensive**.
/// TODO: Move to sstable to use the index.
pub fn list_encoded_fields(&self) -> io::Result<Vec<(String, Type)>> {
let mut stream = self.termdict.stream()?;
let mut fields = Vec::new();
let mut fields_set = FnvHashSet::default();
while let Some((term, _term_info)) = stream.next() {
if let Some(index) = term.iter().position(|&byte| byte == JSON_END_OF_PATH) {
if !fields_set.contains(&term[..index + 2]) {
fields_set.insert(term[..index + 2].to_vec());
let typ = Type::from_code(term[index + 1]).unwrap();
fields.push((String::from_utf8_lossy(&term[..index]).to_string(), typ));
}
}
}
Ok(fields)
}
/// Resets the block segment to another position of the postings /// Resets the block segment to another position of the postings
/// file. /// file.
/// ///

View File

@@ -1,11 +1,11 @@
use columnar::MonotonicallyMappableToU64; use columnar::MonotonicallyMappableToU64;
use common::replace_in_place; use common::{replace_in_place, JsonPathWriter};
use murmurhash32::murmurhash2;
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
use crate::fastfield::FastValue; use crate::fastfield::FastValue;
use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter}; use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter};
use crate::schema::term::{JSON_PATH_SEGMENT_SEP, JSON_PATH_SEGMENT_SEP_STR}; use crate::schema::document::{ReferenceValue, ReferenceValueLeaf, Value};
use crate::schema::term::JSON_PATH_SEGMENT_SEP;
use crate::schema::{Field, Type, DATE_TIME_PRECISION_INDEXED}; use crate::schema::{Field, Type, DATE_TIME_PRECISION_INDEXED};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
use crate::time::{OffsetDateTime, UtcOffset}; use crate::time::{OffsetDateTime, UtcOffset};
@@ -57,31 +57,41 @@ struct IndexingPositionsPerPath {
} }
impl IndexingPositionsPerPath { impl IndexingPositionsPerPath {
fn get_position(&mut self, term: &Term) -> &mut IndexingPosition { fn get_position_from_id(&mut self, id: u32) -> &mut IndexingPosition {
self.positions_per_path self.positions_per_path.entry(id).or_default()
.entry(murmurhash2(term.serialized_term()))
.or_insert_with(Default::default)
} }
} }
pub(crate) fn index_json_values<'a>( /// Convert JSON_PATH_SEGMENT_SEP to a dot.
pub fn json_path_sep_to_dot(path: &mut str) {
// This is safe since we are replacing a ASCII character by another ASCII character.
unsafe {
replace_in_place(JSON_PATH_SEGMENT_SEP, b'.', path.as_bytes_mut());
}
}
#[allow(clippy::too_many_arguments)]
pub(crate) fn index_json_values<'a, V: Value<'a>>(
doc: DocId, doc: DocId,
json_values: impl Iterator<Item = crate::Result<&'a serde_json::Map<String, serde_json::Value>>>, json_visitors: impl Iterator<Item = crate::Result<V::ObjectIter>>,
text_analyzer: &mut TextAnalyzer, text_analyzer: &mut TextAnalyzer,
expand_dots_enabled: bool, expand_dots_enabled: bool,
term_buffer: &mut Term, term_buffer: &mut Term,
postings_writer: &mut dyn PostingsWriter, postings_writer: &mut dyn PostingsWriter,
json_path_writer: &mut JsonPathWriter,
ctx: &mut IndexingContext, ctx: &mut IndexingContext,
) -> crate::Result<()> { ) -> crate::Result<()> {
let mut json_term_writer = JsonTermWriter::wrap(term_buffer, expand_dots_enabled); json_path_writer.clear();
json_path_writer.set_expand_dots(expand_dots_enabled);
let mut positions_per_path: IndexingPositionsPerPath = Default::default(); let mut positions_per_path: IndexingPositionsPerPath = Default::default();
for json_value_res in json_values { for json_visitor_res in json_visitors {
let json_value = json_value_res?; let json_visitor = json_visitor_res?;
index_json_object( index_json_object::<V>(
doc, doc,
json_value, json_visitor,
text_analyzer, text_analyzer,
&mut json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
&mut positions_per_path, &mut positions_per_path,
@@ -90,93 +100,154 @@ pub(crate) fn index_json_values<'a>(
Ok(()) Ok(())
} }
fn index_json_object( #[allow(clippy::too_many_arguments)]
fn index_json_object<'a, V: Value<'a>>(
doc: DocId, doc: DocId,
json_value: &serde_json::Map<String, serde_json::Value>, json_visitor: V::ObjectIter,
text_analyzer: &mut TextAnalyzer, text_analyzer: &mut TextAnalyzer,
json_term_writer: &mut JsonTermWriter, term_buffer: &mut Term,
json_path_writer: &mut JsonPathWriter,
postings_writer: &mut dyn PostingsWriter, postings_writer: &mut dyn PostingsWriter,
ctx: &mut IndexingContext, ctx: &mut IndexingContext,
positions_per_path: &mut IndexingPositionsPerPath, positions_per_path: &mut IndexingPositionsPerPath,
) { ) {
for (json_path_segment, json_value) in json_value { for (json_path_segment, json_value_visitor) in json_visitor {
json_term_writer.push_path_segment(json_path_segment); json_path_writer.push(json_path_segment);
index_json_value( index_json_value(
doc, doc,
json_value, json_value_visitor,
text_analyzer, text_analyzer,
json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
positions_per_path, positions_per_path,
); );
json_term_writer.pop_path_segment(); json_path_writer.pop();
} }
} }
fn index_json_value( #[allow(clippy::too_many_arguments)]
fn index_json_value<'a, V: Value<'a>>(
doc: DocId, doc: DocId,
json_value: &serde_json::Value, json_value: V,
text_analyzer: &mut TextAnalyzer, text_analyzer: &mut TextAnalyzer,
json_term_writer: &mut JsonTermWriter, term_buffer: &mut Term,
json_path_writer: &mut JsonPathWriter,
postings_writer: &mut dyn PostingsWriter, postings_writer: &mut dyn PostingsWriter,
ctx: &mut IndexingContext, ctx: &mut IndexingContext,
positions_per_path: &mut IndexingPositionsPerPath, positions_per_path: &mut IndexingPositionsPerPath,
) { ) {
match json_value { let set_path_id = |term_buffer: &mut Term, unordered_id: u32| {
serde_json::Value::Null => {} term_buffer.truncate_value_bytes(0);
serde_json::Value::Bool(val_bool) => { term_buffer.append_bytes(&unordered_id.to_be_bytes());
json_term_writer.set_fast_value(*val_bool); };
postings_writer.subscribe(doc, 0u32, json_term_writer.term(), ctx); let set_type = |term_buffer: &mut Term, typ: Type| {
} term_buffer.append_bytes(&[typ.to_code()]);
serde_json::Value::Number(number) => { };
if let Some(number_i64) = number.as_i64() {
json_term_writer.set_fast_value(number_i64); match json_value.as_value() {
} else if let Some(number_u64) = number.as_u64() { ReferenceValue::Leaf(leaf) => match leaf {
json_term_writer.set_fast_value(number_u64); ReferenceValueLeaf::Null => {}
} else if let Some(number_f64) = number.as_f64() { ReferenceValueLeaf::Str(val) => {
json_term_writer.set_fast_value(number_f64); let mut token_stream = text_analyzer.token_stream(val);
} let unordered_id = ctx
postings_writer.subscribe(doc, 0u32, json_term_writer.term(), ctx); .path_to_unordered_id
} .get_or_allocate_unordered_id(json_path_writer.as_str());
serde_json::Value::String(text) => match infer_type_from_str(text) {
TextOrDateTime::Text(text) => { // TODO: make sure the chain position works out.
let mut token_stream = text_analyzer.token_stream(text); set_path_id(term_buffer, unordered_id);
// TODO make sure the chain position works out. set_type(term_buffer, Type::Str);
json_term_writer.close_path_and_set_type(Type::Str); let indexing_position = positions_per_path.get_position_from_id(unordered_id);
let indexing_position = positions_per_path.get_position(json_term_writer.term());
postings_writer.index_text( postings_writer.index_text(
doc, doc,
&mut *token_stream, &mut *token_stream,
json_term_writer.term_buffer, term_buffer,
ctx, ctx,
indexing_position, indexing_position,
); );
} }
TextOrDateTime::DateTime(dt) => { ReferenceValueLeaf::U64(val) => {
json_term_writer.set_fast_value(DateTime::from_utc(dt)); set_path_id(
postings_writer.subscribe(doc, 0u32, json_term_writer.term(), ctx); term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::I64(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::F64(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::Bool(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::Date(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::PreTokStr(_) => {
unimplemented!(
"Pre-tokenized string support in dynamic fields is not yet implemented"
)
}
ReferenceValueLeaf::Bytes(_) => {
unimplemented!("Bytes support in dynamic fields is not yet implemented")
}
ReferenceValueLeaf::Facet(_) => {
unimplemented!("Facet support in dynamic fields is not yet implemented")
}
ReferenceValueLeaf::IpAddr(_) => {
unimplemented!("IP address support in dynamic fields is not yet implemented")
} }
}, },
serde_json::Value::Array(arr) => { ReferenceValue::Array(elements) => {
for val in arr { for val in elements {
index_json_value( index_json_value(
doc, doc,
val, val,
text_analyzer, text_analyzer,
json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
positions_per_path, positions_per_path,
); );
} }
} }
serde_json::Value::Object(map) => { ReferenceValue::Object(object) => {
index_json_object( index_json_object::<V>(
doc, doc,
map, object,
text_analyzer, text_analyzer,
json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
positions_per_path, positions_per_path,
@@ -185,21 +256,6 @@ fn index_json_value(
} }
} }
enum TextOrDateTime<'a> {
Text(&'a str),
DateTime(OffsetDateTime),
}
fn infer_type_from_str(text: &str) -> TextOrDateTime {
match OffsetDateTime::parse(text, &Rfc3339) {
Ok(dt) => {
let dt_utc = dt.to_offset(UtcOffset::UTC);
TextOrDateTime::DateTime(dt_utc)
}
Err(_) => TextOrDateTime::Text(text),
}
}
// Tries to infer a JSON type from a string. // Tries to infer a JSON type from a string.
pub fn convert_to_fast_value_and_get_term( pub fn convert_to_fast_value_and_get_term(
json_term_writer: &mut JsonTermWriter, json_term_writer: &mut JsonTermWriter,
@@ -272,7 +328,7 @@ pub struct JsonTermWriter<'a> {
/// In other words, /// In other words,
/// - `k8s.node` ends up as `["k8s", "node"]`. /// - `k8s.node` ends up as `["k8s", "node"]`.
/// - `k8s\.node` ends up as `["k8s.node"]`. /// - `k8s\.node` ends up as `["k8s.node"]`.
fn split_json_path(json_path: &str) -> Vec<String> { pub fn split_json_path(json_path: &str) -> Vec<String> {
let mut escaped_state: bool = false; let mut escaped_state: bool = false;
let mut json_path_segments = Vec::new(); let mut json_path_segments = Vec::new();
let mut buffer = String::new(); let mut buffer = String::new();
@@ -312,17 +368,13 @@ pub(crate) fn encode_column_name(
json_path: &str, json_path: &str,
expand_dots_enabled: bool, expand_dots_enabled: bool,
) -> String { ) -> String {
let mut column_key: String = String::with_capacity(field_name.len() + json_path.len() + 1); let mut path = JsonPathWriter::default();
column_key.push_str(field_name); path.push(field_name);
for mut segment in split_json_path(json_path) { path.set_expand_dots(expand_dots_enabled);
column_key.push_str(JSON_PATH_SEGMENT_SEP_STR); for segment in split_json_path(json_path) {
if expand_dots_enabled { path.push(&segment);
// We need to replace `.` by JSON_PATH_SEGMENT_SEP.
unsafe { replace_in_place(b'.', JSON_PATH_SEGMENT_SEP, segment.as_bytes_mut()) };
}
column_key.push_str(&segment);
} }
column_key path.into()
} }
impl<'a> JsonTermWriter<'a> { impl<'a> JsonTermWriter<'a> {
@@ -362,6 +414,7 @@ impl<'a> JsonTermWriter<'a> {
self.term_buffer.append_bytes(&[typ.to_code()]); self.term_buffer.append_bytes(&[typ.to_code()]);
} }
// TODO: Remove this function and use JsonPathWriter instead.
pub fn push_path_segment(&mut self, segment: &str) { pub fn push_path_segment(&mut self, segment: &str) {
// the path stack should never be empty. // the path stack should never be empty.
self.trim_to_end_of_path(); self.trim_to_end_of_path();
@@ -619,21 +672,21 @@ mod tests {
#[test] #[test]
fn test_split_json_path_escaped_dot() { fn test_split_json_path_escaped_dot() {
let json_path = split_json_path(r#"toto\.titi"#); let json_path = split_json_path(r"toto\.titi");
assert_eq!(&json_path, &["toto.titi"]); assert_eq!(&json_path, &["toto.titi"]);
let json_path_2 = split_json_path(r#"k8s\.container\.name"#); let json_path_2 = split_json_path(r"k8s\.container\.name");
assert_eq!(&json_path_2, &["k8s.container.name"]); assert_eq!(&json_path_2, &["k8s.container.name"]);
} }
#[test] #[test]
fn test_split_json_path_escaped_backslash() { fn test_split_json_path_escaped_backslash() {
let json_path = split_json_path(r#"toto\\titi"#); let json_path = split_json_path(r"toto\\titi");
assert_eq!(&json_path, &[r#"toto\titi"#]); assert_eq!(&json_path, &[r"toto\titi"]);
} }
#[test] #[test]
fn test_split_json_path_escaped_normal_letter() { fn test_split_json_path_escaped_normal_letter() {
let json_path = split_json_path(r#"toto\titi"#); let json_path = split_json_path(r"toto\titi");
assert_eq!(&json_path, &[r#"tototiti"#]); assert_eq!(&json_path, &[r#"tototiti"#]);
} }
} }

View File

@@ -25,7 +25,7 @@ pub use self::searcher::{Searcher, SearcherGeneration};
pub use self::segment::Segment; pub use self::segment::Segment;
pub use self::segment_component::SegmentComponent; pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId; pub use self::segment_id::SegmentId;
pub use self::segment_reader::SegmentReader; pub use self::segment_reader::{merge_field_meta_data, FieldMetadata, SegmentReader};
pub use self::single_segment_index_writer::SingleSegmentIndexWriter; pub use self::single_segment_index_writer::SingleSegmentIndexWriter;
/// The meta file contains all the information about the list of segments and the schema /// The meta file contains all the information about the list of segments and the schema

View File

@@ -5,7 +5,8 @@ use std::{fmt, io};
use crate::collector::Collector; use crate::collector::Collector;
use crate::core::{Executor, SegmentReader}; use crate::core::{Executor, SegmentReader};
use crate::query::{Bm25StatisticsProvider, EnableScoring, Query}; use crate::query::{Bm25StatisticsProvider, EnableScoring, Query};
use crate::schema::{Document, Schema, Term}; use crate::schema::document::DocumentDeserialize;
use crate::schema::{Schema, Term};
use crate::space_usage::SearcherSpaceUsage; use crate::space_usage::SearcherSpaceUsage;
use crate::store::{CacheStats, StoreReader}; use crate::store::{CacheStats, StoreReader};
use crate::{DocAddress, Index, Opstamp, SegmentId, TrackedObject}; use crate::{DocAddress, Index, Opstamp, SegmentId, TrackedObject};
@@ -83,7 +84,7 @@ impl Searcher {
/// ///
/// The searcher uses the segment ordinal to route the /// The searcher uses the segment ordinal to route the
/// request to the right `Segment`. /// request to the right `Segment`.
pub fn doc(&self, doc_address: DocAddress) -> crate::Result<Document> { pub fn doc<D: DocumentDeserialize>(&self, doc_address: DocAddress) -> crate::Result<D> {
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize]; let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
store_reader.get(doc_address.doc_id) store_reader.get(doc_address.doc_id)
} }
@@ -103,7 +104,10 @@ impl Searcher {
/// Fetches a document in an asynchronous manner. /// Fetches a document in an asynchronous manner.
#[cfg(feature = "quickwit")] #[cfg(feature = "quickwit")]
pub async fn doc_async(&self, doc_address: DocAddress) -> crate::Result<Document> { pub async fn doc_async<D: DocumentDeserialize>(
&self,
doc_address: DocAddress,
) -> crate::Result<D> {
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize]; let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
store_reader.get_async(doc_address.doc_id).await store_reader.get_async(doc_address.doc_id).await
} }

View File

@@ -1,12 +1,17 @@
use std::collections::HashMap; use std::collections::HashMap;
use std::ops::BitOrAssign;
use std::sync::{Arc, RwLock}; use std::sync::{Arc, RwLock};
use std::{fmt, io}; use std::{fmt, io};
use fnv::FnvHashMap;
use itertools::Itertools;
use crate::core::{InvertedIndexReader, Segment, SegmentComponent, SegmentId}; use crate::core::{InvertedIndexReader, Segment, SegmentComponent, SegmentId};
use crate::directory::{CompositeFile, FileSlice}; use crate::directory::{CompositeFile, FileSlice};
use crate::error::DataCorruption; use crate::error::DataCorruption;
use crate::fastfield::{intersect_alive_bitsets, AliveBitSet, FacetReader, FastFieldReaders}; use crate::fastfield::{intersect_alive_bitsets, AliveBitSet, FacetReader, FastFieldReaders};
use crate::fieldnorm::{FieldNormReader, FieldNormReaders}; use crate::fieldnorm::{FieldNormReader, FieldNormReaders};
use crate::json_utils::json_path_sep_to_dot;
use crate::schema::{Field, IndexRecordOption, Schema, Type}; use crate::schema::{Field, IndexRecordOption, Schema, Type};
use crate::space_usage::SegmentSpaceUsage; use crate::space_usage::SegmentSpaceUsage;
use crate::store::StoreReader; use crate::store::StoreReader;
@@ -280,6 +285,103 @@ impl SegmentReader {
Ok(inv_idx_reader) Ok(inv_idx_reader)
} }
/// Returns the list of fields that have been indexed in the segment.
/// The field list includes the field defined in the schema as well as the fields
/// that have been indexed as a part of a JSON field.
/// The returned field name is the full field name, including the name of the JSON field.
///
/// The returned field names can be used in queries.
///
/// Notice: If your data contains JSON fields this is **very expensive**, as it requires
/// browsing through the inverted index term dictionary and the columnar field dictionary.
///
/// Disclaimer: Some fields may not be listed here. For instance, if the schema contains a json
/// field that is not indexed nor a fast field but is stored, it is possible for the field
/// to not be listed.
pub fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
let mut indexed_fields: Vec<FieldMetadata> = Vec::new();
let mut map_to_canonical = FnvHashMap::default();
for (field, field_entry) in self.schema().fields() {
let field_name = field_entry.name().to_string();
let is_indexed = field_entry.is_indexed();
if is_indexed {
let is_json = field_entry.field_type().value_type() == Type::Json;
if is_json {
let inv_index = self.inverted_index(field)?;
let encoded_fields_in_index = inv_index.list_encoded_fields()?;
let mut build_path = |field_name: &str, mut json_path: String| {
// In this case we need to map the potential fast field to the field name
// accepted by the query parser.
let create_canonical =
!field_entry.is_expand_dots_enabled() && json_path.contains('.');
if create_canonical {
// Without expand dots enabled dots need to be escaped.
let escaped_json_path = json_path.replace('.', "\\.");
let full_path = format!("{}.{}", field_name, escaped_json_path);
let full_path_unescaped = format!("{}.{}", field_name, &json_path);
map_to_canonical.insert(full_path_unescaped, full_path.to_string());
full_path
} else {
// With expand dots enabled, we can use '.' instead of '\u{1}'.
json_path_sep_to_dot(&mut json_path);
format!("{}.{}", field_name, json_path)
}
};
indexed_fields.extend(
encoded_fields_in_index
.into_iter()
.map(|(name, typ)| (build_path(&field_name, name), typ))
.map(|(field_name, typ)| FieldMetadata {
indexed: true,
stored: false,
field_name,
fast: false,
typ,
}),
);
} else {
indexed_fields.push(FieldMetadata {
indexed: true,
stored: false,
field_name: field_name.to_string(),
fast: false,
typ: field_entry.field_type().value_type(),
});
}
}
}
let mut fast_fields: Vec<FieldMetadata> = self
.fast_fields()
.columnar()
.iter_columns()?
.map(|(mut field_name, handle)| {
json_path_sep_to_dot(&mut field_name);
// map to canonical path, to avoid similar but different entries.
// Eventually we should just accept '.' seperated for all cases.
let field_name = map_to_canonical
.get(&field_name)
.unwrap_or(&field_name)
.to_string();
FieldMetadata {
indexed: false,
stored: false,
field_name,
fast: true,
typ: Type::from(handle.column_type()),
}
})
.collect();
// Since the type is encoded differently in the fast field and in the inverted index,
// the order of the fields is not guaranteed to be the same. Therefore, we sort the fields.
// If we are sure that the order is the same, we can remove this sort.
indexed_fields.sort_unstable();
fast_fields.sort_unstable();
let merged = merge_field_meta_data(vec![indexed_fields, fast_fields], &self.schema);
Ok(merged)
}
/// Returns the segment id /// Returns the segment id
pub fn segment_id(&self) -> SegmentId { pub fn segment_id(&self) -> SegmentId {
self.segment_id self.segment_id
@@ -330,6 +432,65 @@ impl SegmentReader {
} }
} }
#[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord)]
/// FieldMetadata
pub struct FieldMetadata {
/// The field name
// Notice: Don't reorder the declaration of 1.field_name 2.typ, as it is used for ordering by
// field_name then typ.
pub field_name: String,
/// The field type
// Notice: Don't reorder the declaration of 1.field_name 2.typ, as it is used for ordering by
// field_name then typ.
pub typ: Type,
/// Is the field indexed for search
pub indexed: bool,
/// Is the field stored in the doc store
pub stored: bool,
/// Is the field stored in the columnar storage
pub fast: bool,
}
impl BitOrAssign for FieldMetadata {
fn bitor_assign(&mut self, rhs: Self) {
assert!(self.field_name == rhs.field_name);
assert!(self.typ == rhs.typ);
self.indexed |= rhs.indexed;
self.stored |= rhs.stored;
self.fast |= rhs.fast;
}
}
// Maybe too slow for the high cardinality case
fn is_field_stored(field_name: &str, schema: &Schema) -> bool {
schema
.find_field(field_name)
.map(|(field, _path)| schema.get_field_entry(field).is_stored())
.unwrap_or(false)
}
/// Helper to merge the field metadata from multiple segments.
pub fn merge_field_meta_data(
field_metadatas: Vec<Vec<FieldMetadata>>,
schema: &Schema,
) -> Vec<FieldMetadata> {
let mut merged_field_metadata = Vec::new();
for (_key, mut group) in &field_metadatas
.into_iter()
.kmerge_by(|left, right| left < right)
// TODO: Remove allocation
.group_by(|el| (el.field_name.to_string(), el.typ))
{
let mut merged: FieldMetadata = group.next().unwrap();
for el in group {
merged |= el;
}
// Currently is_field_stored is maybe too slow for the high cardinality case
merged.stored = is_field_stored(&merged.field_name, schema);
merged_field_metadata.push(merged);
}
merged_field_metadata
}
fn intersect_alive_bitset( fn intersect_alive_bitset(
left_opt: Option<AliveBitSet>, left_opt: Option<AliveBitSet>,
right_opt: Option<AliveBitSet>, right_opt: Option<AliveBitSet>,
@@ -353,9 +514,127 @@ impl fmt::Debug for SegmentReader {
#[cfg(test)] #[cfg(test)]
mod test { mod test {
use super::*;
use crate::core::Index; use crate::core::Index;
use crate::schema::{Schema, Term, STORED, TEXT}; use crate::schema::{Schema, SchemaBuilder, Term, STORED, TEXT};
use crate::DocId; use crate::{DocId, FieldMetadata, IndexWriter};
#[test]
fn test_merge_field_meta_data_same() {
let schema = SchemaBuilder::new().build();
let field_metadata1 = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: true,
stored: false,
fast: true,
};
let field_metadata2 = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: true,
stored: false,
fast: true,
};
let res = merge_field_meta_data(
vec![vec![field_metadata1.clone()], vec![field_metadata2]],
&schema,
);
assert_eq!(res, vec![field_metadata1]);
}
#[test]
fn test_merge_field_meta_data_different() {
let schema = SchemaBuilder::new().build();
let field_metadata1 = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: false,
stored: false,
fast: true,
};
let field_metadata2 = FieldMetadata {
field_name: "b".to_string(),
typ: crate::schema::Type::Str,
indexed: false,
stored: false,
fast: true,
};
let field_metadata3 = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: true,
stored: false,
fast: false,
};
let res = merge_field_meta_data(
vec![
vec![field_metadata1.clone(), field_metadata2.clone()],
vec![field_metadata3],
],
&schema,
);
let field_metadata_expected1 = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: true,
stored: false,
fast: true,
};
assert_eq!(res, vec![field_metadata_expected1, field_metadata2.clone()]);
}
#[test]
fn test_merge_field_meta_data_merge() {
use pretty_assertions::assert_eq;
let get_meta_data = |name: &str, typ: Type| FieldMetadata {
field_name: name.to_string(),
typ,
indexed: false,
stored: false,
fast: true,
};
let schema = SchemaBuilder::new().build();
let mut metas = vec![get_meta_data("d", Type::Str), get_meta_data("e", Type::U64)];
metas.sort();
let res = merge_field_meta_data(vec![vec![get_meta_data("e", Type::Str)], metas], &schema);
assert_eq!(
res,
vec![
get_meta_data("d", Type::Str),
get_meta_data("e", Type::Str),
get_meta_data("e", Type::U64),
]
);
}
#[test]
fn test_merge_field_meta_data_bitxor() {
let field_metadata1 = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: false,
stored: false,
fast: true,
};
let field_metadata2 = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: true,
stored: false,
fast: false,
};
let field_metadata_expected = FieldMetadata {
field_name: "a".to_string(),
typ: crate::schema::Type::Str,
indexed: true,
stored: false,
fast: true,
};
let mut res1 = field_metadata1.clone();
res1 |= field_metadata2.clone();
let mut res2 = field_metadata2.clone();
res2 |= field_metadata1;
assert_eq!(res1, field_metadata_expected);
assert_eq!(res2, field_metadata_expected);
}
#[test] #[test]
fn test_num_alive() -> crate::Result<()> { fn test_num_alive() -> crate::Result<()> {
@@ -366,7 +645,7 @@ mod test {
let name = schema.get_field("name").unwrap(); let name = schema.get_field("name").unwrap();
{ {
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.add_document(doc!(name => "tantivy"))?; index_writer.add_document(doc!(name => "tantivy"))?;
index_writer.add_document(doc!(name => "horse"))?; index_writer.add_document(doc!(name => "horse"))?;
index_writer.add_document(doc!(name => "jockey"))?; index_writer.add_document(doc!(name => "jockey"))?;
@@ -392,7 +671,7 @@ mod test {
let name = schema.get_field("name").unwrap(); let name = schema.get_field("name").unwrap();
{ {
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.add_document(doc!(name => "tantivy"))?; index_writer.add_document(doc!(name => "tantivy"))?;
index_writer.add_document(doc!(name => "horse"))?; index_writer.add_document(doc!(name => "horse"))?;
index_writer.add_document(doc!(name => "jockey"))?; index_writer.add_document(doc!(name => "jockey"))?;
@@ -402,7 +681,7 @@ mod test {
} }
{ {
let mut index_writer2 = index.writer(50_000_000)?; let mut index_writer2: IndexWriter = index.writer(50_000_000)?;
index_writer2.delete_term(Term::from_field_text(name, "horse")); index_writer2.delete_term(Term::from_field_text(name, "horse"));
index_writer2.delete_term(Term::from_field_text(name, "cap")); index_writer2.delete_term(Term::from_field_text(name, "cap"));

Some files were not shown because too many files have changed in this diff Show More