Compare commits

..

90 Commits

Author SHA1 Message Date
Raphaël Marinier
0890503fc2 Speed up searches by removing repeated memsets coming from vec.resize()
Also, reserve exactly the size needed, which is surprisingly needed to
get the full speedup of ~5% on a good fraction of the queries.
2024-03-12 17:50:23 +01:00
trinity-1686a
f6b0cc1aab allow some mixing of occur and bool in strict query parser (#2323)
* allow some mixing of occur and bool in strict query parser

* allow all mixing of binary and occur in strict parser
2024-03-07 15:17:48 +01:00
PSeitz
7e41d31c6e agg: support to deserialize f64 from string (#2311)
* agg: support to deserialize f64 from string

* remove visit_string

* disallow NaN
2024-03-05 05:49:41 +01:00
Adam Reichold
40aa4abfe5 Make FacetCounts defaultable and cloneable. (#2322) 2024-03-05 04:11:11 +01:00
dependabot[bot]
2650317622 Update fs4 requirement from 0.7.0 to 0.8.0 (#2321)
Updates the requirements on [fs4](https://github.com/al8n/fs4-rs) to permit the latest version.
- [Release notes](https://github.com/al8n/fs4-rs/releases)
- [Commits](https://github.com/al8n/fs4-rs/commits)

---
updated-dependencies:
- dependency-name: fs4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-27 03:38:04 +01:00
Paul Masurel
6739357314 Removing split_size and adding split_size and shard_size as segmnet_size (#2320)
aliases.
2024-02-26 11:35:22 +01:00
PSeitz
d57622d54b support bool type in term aggregation (#2318)
* support bool type in term aggregation

* add Bool to Intermediate Key
2024-02-20 03:22:22 +01:00
PSeitz
f745dbc054 fix Clone for TopNComputer, add top_hits bench (#2315)
* fix Clone for TopNComputer, add top_hits bench

add top_hits agg bench

test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg                                            ... bench: 123,475,175 ns/iter (+/- 30,608,889)
test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_multi                                      ... bench: 194,170,414 ns/iter (+/- 36,495,516)
test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_opt                                        ... bench: 179,742,809 ns/iter (+/- 29,976,507)
test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_sub_agg_sparse                                     ... bench:  27,592,534 ns/iter (+/- 2,672,370)
test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg                                       ... bench: 552,851,227 ns/iter (+/- 71,975,886)
test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_multi                                 ... bench: 558,616,384 ns/iter (+/- 100,890,124)
test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_opt                                   ... bench: 554,031,368 ns/iter (+/- 165,452,650)
test aggregation::agg_bench::bench::bench_aggregation_terms_many_with_top_hits_agg_sparse                                ... bench:  46,435,919 ns/iter (+/- 13,681,935)

* add comment
2024-02-20 03:22:00 +01:00
PSeitz
79b041f81f clippy (#2314) 2024-02-13 05:56:31 +01:00
PSeitz
0e16ed9ef7 Fix serde for TopNComputer (#2313)
* Fix serde for TopNComputer

The top hits aggregation changed the TopNComputer to be serializable,
but capacity needs to be carried over, as it contains logic which is
checked against when pushing elements (capacity == 0 is not allowed).

* use serde from deser

* remove pub, clippy
2024-02-07 12:52:06 +01:00
mochi
88a3275dbb add shared search executor (#2312) 2024-02-05 09:33:00 +01:00
PSeitz
1223a87eb2 add fuzz test for hashmap (#2310) 2024-01-31 10:30:21 +01:00
PSeitz
48630ceec9 move into new index module (#2259)
move core modules to index module
2024-01-31 10:30:04 +01:00
Adam Reichold
72002e8a89 Make test builds Clippy clean. (#2277) 2024-01-31 02:47:06 +01:00
trinity-1686a
3c9297dd64 report if posting list was actually loaded when warming it up (#2309) 2024-01-29 15:23:16 +01:00
Tushar
0e04ec3136 feat(aggregators/metric): Add a top_hits aggregator (#2198)
* feat(aggregators/metric): Implement a top_hits aggregator

* fix: Expose get_fields

* fix: Serializer for top_hits request

Also removes extraneous the extraneous third-party
serialization helper.

* chore: Avert panick on parsing invalid top_hits query

* refactor: Allow multiple field names from aggregations

* perf: Replace binary heap with TopNComputer

* fix: Avoid comparator inversion by ComparableDoc

* fix: Rank missing field values lower than present values

* refactor: Make KeyOrder a struct

* feat: Rough attempt at docvalue_fields

* feat: Complete stab at docvalue_fields

- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn

* test(unit): Add tests for top_hits aggregator

* fix: docfield_value field globbing

* test(unit): Include dynamic fields

* fix: Value -> OwnedValue

* fix: Use OwnedValue's native Null variant

* chore: Improve readability of test asserts

* chore: Remove DocAddress from top_hits result

* docs: Update aggregator doc

* revert: accidental doc test

* chore: enable time macros only for tests

* chore: Apply suggestions from review

* chore: Apply suggestions from review

* fix: Retrieve all values for fields

* test(unit): Update for multi-value retrieval

* chore: Assert term existence

* feat: Include all columns for a column name

Since a (name, type) constitutes a unique column.

* fix: Resolve json fields

Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.

* chore: Address review on mutability

* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal

* chore: Revert erroneous grammar change
2024-01-26 16:46:41 +01:00
Paul Masurel
9b7f3a55cf Bumped census version 2024-01-26 19:32:02 +09:00
PSeitz
1dacdb6c85 add histogram agg test on empty index (#2306) 2024-01-23 16:27:34 +01:00
François Massot
30483310ca Minor improvement of README.md (#2305)
* Update README.md

* Remove useless paragraph

* Wording.
2024-01-19 17:46:48 +09:00
Tushar
e1d18b5114 chore: Expose TopDocs::order_by_u64_field again (#2282) 2024-01-18 05:58:24 +01:00
trinity-1686a
108f30ba23 allow newline where we allow space in query parser (#2302)
fix regression from the new parser
2024-01-17 14:38:35 +01:00
PSeitz
5943ee46bd Truncate keys to u16::MAX in term hashmap (#2299)
Truncate keys to u16::MAX, instead e.g. storing 0 bytes for keys with length u16::MAX + 1

The term hashmap has a hidden API contract to only accept terms with lenght up u16::MAX.
2024-01-11 10:19:12 +01:00
PSeitz
f95a76293f add memory arena test (#2298)
* add memory arena test

* add assert

* Update stacker/src/memory_arena.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2024-01-11 07:18:48 +01:00
Paul Masurel
014328e378 Fix bug that can cause get_docids_for_value_range to panic. (#2295)
* Fix bug that can cause `get_docids_for_value_range` to panic.

When `selected_docid_range.end == num_rows`, we would get a panic
as we try to access a non-existing blockmeta.

This PR accepts calls to rank with any value.
For any value above num_rows we simply return non_null_rows.

Fixes #2293

* add tests, merge variables

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-01-09 14:52:20 +01:00
Adam Reichold
53f2fe1fbe Forward regex parser errors to enable understandin their reason. (#2288) 2023-12-22 11:01:10 +01:00
PSeitz
9c75942aaf fix merge panic for JSON fields (#2284)
Root cause was the positions buffer had residue positions from the
previous term, when the terms were alternating between having and not
having positions in JSON (terms have positions, but not numerics).

Fixes #2283
2023-12-21 11:05:34 +01:00
PSeitz
bff7c58497 improve indexing benchmark (#2275) 2023-12-11 09:04:42 +01:00
trinity-1686a
9ebc5ed053 use fst for sstable index (#2268)
* read path for new fst based index

* implement BlockAddrStoreWriter

* extract slop/derivation computation

* use better linear approximator and allow negative correction to approximator

* document format and reorder some fields

* optimize single block sstable size

* plug backward compat
2023-12-04 15:13:15 +01:00
PSeitz
0b56c88e69 Revert "Preparing for 0.21.2 release." (#2258)
* Revert "Preparing for 0.21.2 release. (#2256)"

This reverts commit 9caab45136.

* bump version to 0.21.1

* set version to 0.22.0-dev
2023-12-01 13:46:12 +01:00
PSeitz
24841f0b2a update bitpacker dep (#2269) 2023-12-01 13:45:52 +01:00
PSeitz
1a9fc10be9 add fields_metadata to SegmentReader, add columnar docs (#2222)
* add fields_metadata to SegmentReader, add columnar docs

* use schema to resolve field, add test

* normalize paths

* merge for FieldsMetadata, add fields_metadata on Index

* Update src/core/segment_reader.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* merge code paths

* add Hash

* move function oustide

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-11-22 12:29:53 +01:00
PSeitz
07573a7f19 update fst (#2267)
update fst to 0.5 (deduplicates regex-syntax in the dep tree)
deps cleanup
2023-11-21 16:06:57 +01:00
BlackHoleFox
daad2dc151 Take string references instead of owned values building Facet paths (#2265) 2023-11-20 09:40:44 +01:00
PSeitz
054f49dc31 support escaped dot, add agg test (#2250)
add agg test for nested JSON
allow escaping of dot
2023-11-20 03:00:57 +01:00
PSeitz
47009ed2d3 remove unused deps (#2264)
found with cargo machete
remove pprof (doesn't work)
2023-11-20 02:59:59 +01:00
PSeitz
0aae31d7d7 reduce number of allocations (#2257)
* reduce number of allocations

Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.

- Make Explanation optional in BM25
- Avoid allocations when using Explanation

* use Cow
2023-11-16 13:47:36 +01:00
Paul Masurel
9caab45136 Preparing for 0.21.2 release. (#2256) 2023-11-15 10:43:36 +09:00
Chris Tam
6d9a7b7eb0 Derive Debug for SchemaBuilder (#2254) 2023-11-15 01:03:44 +01:00
dependabot[bot]
7a2c5804b1 Update itertools requirement from 0.11.0 to 0.12.0 (#2255)
Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.11.0...v0.12.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-15 01:03:08 +01:00
François Massot
5319977171 Merge pull request #2253 from quickwit-oss/issue/2251-bug-merge-json-object-with-number
Fix bug occuring when merging JSON object indexed with positions.
2023-11-14 17:28:29 +01:00
trinity-1686a
828632e8c4 rustfmt 2023-11-14 15:05:16 +01:00
Paul Masurel
6b59ec6fd5 Fix bug occuring when merging JSON object indexed with positions.
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.

The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.

It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.

Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)

The merger would then proceed and attempt to decode 1 position when
there are in fact none.

This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.

Closes #2251
2023-11-14 22:41:48 +09:00
PSeitz
b60d862150 docid deltas while indexing (#2249)
* docid deltas while indexing

storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.

HDFS Indexing 1.1GB Total memory consumption:
Before:  760 MB
Now:     590 MB

* use scan for delta decoding
2023-11-13 05:14:27 +01:00
PSeitz
4837c7811a add missing inlines (#2245) 2023-11-10 08:00:42 +01:00
PSeitz
5a2397d57e add sstable ord_to_term benchmark (#2242) 2023-11-10 07:27:48 +01:00
PSeitz
927b4432c9 Perf: use term hashmap in fastfield (#2243)
* add shared arena hashmap

* bench fastfield indexing

* use shared arena hashmap in columnar

lower minimum resize in hashtable

* clippy

* add comments
2023-11-09 13:44:02 +01:00
trinity-1686a
7a0064db1f bump index version (#2237)
* bump index version

and add constant for lowest supported version

* use range instead of handcoded bounds
2023-11-06 19:02:37 +01:00
PSeitz
2e7327205d fix coverage run (#2232)
coverage run uses the compare_hash_only feature which is not compativle
with the test_hashmap_size test
2023-11-06 11:18:38 +00:00
Paul Masurel
7bc5bf78e2 Fixing functional tests. (#2239) 2023-11-05 18:18:39 +09:00
giovannicuccu
ef603c8c7e rename ReloadPolicy onCommit to onCommitWithDelay (#2235)
* rename ReloadPolicy onCommit to onCommitWithDelay

* fix format issues

---------

Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>
2023-11-03 12:22:10 +01:00
PSeitz
28dd6b6546 collect json paths in indexing (#2231)
* collect json paths in indexing

* remove unsafe iter_mut_keys
2023-11-01 11:25:17 +01:00
trinity-1686a
1dda2bb537 handle * inside term in query parser (#2228) 2023-10-27 08:57:02 +02:00
PSeitz
bf6544cf28 fix mmap::Advice reexport (#2230) 2023-10-27 14:09:25 +09:00
PSeitz
ccecf946f7 tantivy 0.21.1 (#2227) 2023-10-27 05:01:44 +02:00
PSeitz
19a859d6fd term hashmap remove copy in is_empty, unused unordered_id (#2229) 2023-10-27 05:01:32 +02:00
PSeitz
83af14caa4 Fix range query (#2226)
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
2023-10-25 09:17:31 +02:00
PSeitz
4feeb2323d fix clippy (#2223) 2023-10-24 10:05:22 +02:00
PSeitz
07bf66a197 json path writer (#2224)
* refactor logic to JsonPathWriter

* use in encode_column_name

* add inlines

* move unsafe block
2023-10-24 09:45:50 +02:00
trinity-1686a
0d4589219b encode some part of posting list as -1 instead of direct values (#2185)
* add support for delta-1 encoding posting list

* encode term frequency minus one

* don't emit tf for json integer terms

* make skipreader not pub(crate) mutable
2023-10-20 16:58:26 +02:00
PSeitz
c2b0469180 improve docs, rework exports (#2220)
* rework exports

move snippet and advice
make indexer pub, remove indexer reexports

* add deprecation warning

* add architecture overview
2023-10-18 09:22:24 +02:00
PSeitz
7e1980b218 run coverage only after merge (#2212)
* run coverage only after merge

coverage is a quite slow step in CI. It can be run only after merging

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-10-18 07:19:36 +02:00
PSeitz
ecb9a89a9f add compat mode for JSON (#2219) 2023-10-17 10:00:55 +02:00
PSeitz
5e06e504e6 split into ReferenceValueLeaf (#2217) 2023-10-16 16:31:30 +02:00
PSeitz
182f58cea6 remove Document: DocumentDeserialize dependency (#2211)
* remove Document: DocumentDeserialize dependency

The dependency requires users to implement an API they may not use.

* remove unnecessary Document bounds
2023-10-13 07:59:54 +02:00
dependabot[bot]
337ffadefd Update lru requirement from 0.11.0 to 0.12.0 (#2208)
Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs) to permit the latest version.
- [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jeromefroe/lru-rs/compare/0.11.0...0.12.0)

---
updated-dependencies:
- dependency-name: lru
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 12:09:56 +02:00
dependabot[bot]
22aa4daf19 Update zstd requirement from 0.12 to 0.13 (#2214)
Updates the requirements on [zstd](https://github.com/gyscos/zstd-rs) to permit the latest version.
- [Release notes](https://github.com/gyscos/zstd-rs/releases)
- [Commits](https://github.com/gyscos/zstd-rs/compare/v0.12.0...v0.13.0)

---
updated-dependencies:
- dependency-name: zstd
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 04:24:44 +02:00
PSeitz
493f9b2f2a Read list of JSON fields encoded in dictionary (#2184)
* Read list of JSON fields encoded in dictionary

add method to get list of fields on InvertedIndexReader

* add field type
2023-10-09 12:06:22 +02:00
PSeitz
e246e5765d replace ReferenceValue with Self in Value (#2210) 2023-10-06 08:22:15 +02:00
PSeitz
6097235eff fix numeric order, refactor Document (#2209)
fix numeric order to prefer i64
rename and move Document stuff
2023-10-05 16:39:56 +02:00
PSeitz
b700c42246 add AsRef, expose object and array iter on Value (#2207)
add AsRef
expose object and array iter
add to_json on Document
2023-10-05 03:55:35 +02:00
PSeitz
5b1bf1a993 replace Field with field name (#2196) 2023-10-04 06:21:40 +02:00
PSeitz
041d4fced7 move to_named_doc to Document trait (#2205) 2023-10-04 06:03:07 +02:00
dependabot[bot]
166fc15239 Update memmap2 requirement from 0.7.1 to 0.9.0 (#2204)
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs) to permit the latest version.
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v0.7.1...v0.9.0)

---
updated-dependencies:
- dependency-name: memmap2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-04 05:00:46 +02:00
PSeitz
514a6e7fef fix bench compile, fix Document reexport (#2203) 2023-10-03 17:28:36 +02:00
dependabot[bot]
82d9127191 Update fs4 requirement from 0.6.3 to 0.7.0 (#2199)
Updates the requirements on [fs4](https://github.com/al8n/fs4-rs) to permit the latest version.
- [Release notes](https://github.com/al8n/fs4-rs/releases)
- [Commits](https://github.com/al8n/fs4-rs/commits/0.7.0)

---
updated-dependencies:
- dependency-name: fs4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-03 04:43:09 +02:00
PSeitz
03a1f40767 rename DocValue to Value (#2197)
rename DocValue to Value to avoid confusion with lucene DocValues
rename Value to OwnedValue
2023-10-02 17:03:00 +02:00
Harrison Burt
1c7c6fd591 POC: Tantivy documents as a trait (#2071)
* fix windows build (#1)

* Fix windows build

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Fix generic bugs

* Reformat code

* Add generic to index writer which I forgot about

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add doc traits

* Add field value iter

* Add value and serialization

* Adjust order

* Fix bug

* Correct type

* Rebase main and fix conflicts

* Reformat code

* Merge upstream

* Fix missing generics on single segment writer

* Add missing type export

* Add default methods for convenience

* Cleanup

* Fix more-like-this query to use standard types

* Update API and fix tests

* Add tokenizer improvements from previous commits

* Add tokenizer improvements from previous commits

* Reformat

* Fix unit tests

* Fix unit tests

* Use enum in changes

* Stage changes

* Add new deserializer logic

* Add serializer integration

* Add document deserializer

* Implement new (de)serialization api for existing types

* Fix bugs and type errors

* Add helper implementations

* Fix errors

* Reformat code

* Add unit tests and some code organisation for serialization

* Add unit tests to deserializer

* Add some small docs

* Add support for deserializing serde values

* Reformat

* Fix typo

* Fix typo

* Change repr of facet

* Remove unused trait methods

* Add child value type

* Resolve comments

* Fix build

* Fix more build errors

* Fix more build errors

* Fix the tests I missed

* Fix examples

* fix numerical order, serialize PreTok Str

* fix coverage

* rename Document to TantivyDocument, rename DocumentAccess to Document

add Binary prefix to binary de/serialization

* fix coverage

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2023-10-02 10:01:16 +02:00
PSeitz
b525f653c0 replace BinaryHeap for TopN (#2186)
* replace BinaryHeap for TopN

replace BinaryHeap for TopN with variant that selects the median with QuickSort,
which runs in O(n) time.

add merge_fruits fast path

* call truncate unconditionally, extend test

* remove special early exit

* add TODO, fmt

* truncate top n instead median, return vec

* simplify code
2023-09-27 09:25:30 +02:00
ethever.eth
90586bc1e2 chore: remove unused Seek impl for Writers (#2187) (#2189)
Co-authored-by: famouscat <onismaa@gmail.com>
2023-09-26 17:03:28 +09:00
PSeitz
832f1633de handle exclusive out of bounds ranges on fastfield range queries (#2174)
closes https://github.com/quickwit-oss/quickwit/issues/3790
2023-09-26 08:00:40 +02:00
PSeitz
38db53c465 make column_index pub (#2181) 2023-09-22 08:06:45 +02:00
PSeitz
34920d31f5 Fix DateHistogram bucket gap (#2183)
* Fix DateHistogram bucket gap

Fixes a computation issue of the number of buckets needed in the
DateHistogram.

This is due to a missing normalization from request values (ms) to fast field
values (ns), when converting an intermediate result to the final result.
This results in a wrong computation by a factor 1_000_000.
The Histogram normalizes values to nanoseconds, to make the user input like
extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible.
This normalization happens only for date type fields, as other field types don't have precision settings.
The normalization does not happen due a missing `column_type`, which is not
correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation.

Another related issue is an empty aggregation, which will not have
`column_type` set, will not convert the result to human readable format.

This PR fixes the issue by:
- Limit the allowed field types of DateHistogram to DateType
- Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`.
- Fix the merge logic

Add a flag to to normalization only once. This is not an issue
currently, but it could become easily one.

closes https://github.com/quickwit-oss/quickwit/issues/3837

* use older nightly for time crate (breaks build)
2023-09-21 10:41:35 +02:00
trinity-1686a
0241a05b90 add support for exists query syntax in query parser (#2170)
* add support for exists query syntax in query parser

* rustfmt

* make Exists require a field
2023-09-19 11:10:39 +02:00
PSeitz
e125f3b041 fix test (#2178) 2023-09-19 08:21:50 +02:00
PSeitz
c520ac46fc add support for date in term agg (#2172)
support DateTime in TermsAggregation
Format dates with Rfc3339
2023-09-14 09:22:18 +02:00
PSeitz
2d7390341c increase min memory to 15MB for indexing (#2176)
With tantivy 0.20 the minimum memory consumption per SegmentWriter increased to
12MB. 7MB are for the different fast field collectors types (they could be
lazily created). Increase the minimum memory from 3MB to 15MB.

Change memory variable naming from arena to budget.

closes #2156
2023-09-13 07:38:34 +02:00
dependabot[bot]
03fcdce016 Bump actions/checkout from 3 to 4 (#2171)
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-11 10:47:33 +02:00
Ping Xia
e4e416ac42 extend FuzzyTermQuery to support json field (#2173)
* extend fuzzy search for json field

* comments

* comments

* fmt fix

* comments
2023-09-11 05:59:40 +02:00
Igor Motov
19325132b7 Fast-field based implementation of ExistsQuery (#2160)
Adds an implementation of ExistsQuery that takes advantage of fast fields.

Fixes #2159
2023-09-07 11:51:49 +09:00
Paul Masurel
389d36f760 Added comments 2023-09-04 11:06:56 +09:00
222 changed files with 11922 additions and 3659 deletions

View File

@@ -3,8 +3,6 @@ name: Coverage
on: on:
push: push:
branches: [main] branches: [main]
pull_request:
branches: [main]
# Ensures that we cancel running jobs for the same PR / same workflow. # Ensures that we cancel running jobs for the same PR / same workflow.
concurrency: concurrency:
@@ -15,13 +13,13 @@ jobs:
coverage: coverage:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install Rust - name: Install Rust
run: rustup toolchain install nightly --profile minimal --component llvm-tools-preview run: rustup toolchain install nightly-2023-09-10 --profile minimal --component llvm-tools-preview
- uses: Swatinem/rust-cache@v2 - uses: Swatinem/rust-cache@v2
- uses: taiki-e/install-action@cargo-llvm-cov - uses: taiki-e/install-action@cargo-llvm-cov
- name: Generate code coverage - name: Generate code coverage
run: cargo +nightly llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info run: cargo +nightly-2023-09-10 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
- name: Upload coverage to Codecov - name: Upload coverage to Codecov
uses: codecov/codecov-action@v3 uses: codecov/codecov-action@v3
continue-on-error: true continue-on-error: true

View File

@@ -19,7 +19,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install stable - name: Install stable
uses: actions-rs/toolchain@v1 uses: actions-rs/toolchain@v1
with: with:

View File

@@ -20,7 +20,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install nightly - name: Install nightly
uses: actions-rs/toolchain@v1 uses: actions-rs/toolchain@v1
@@ -39,6 +39,13 @@ jobs:
- name: Check Formatting - name: Check Formatting
run: cargo +nightly fmt --all -- --check run: cargo +nightly fmt --all -- --check
- name: Check Stable Compilation
run: cargo build --all-features
- name: Check Bench Compilation
run: cargo +nightly bench --no-run --profile=dev --all-features
- uses: actions-rs/clippy-check@v1 - uses: actions-rs/clippy-check@v1
with: with:
@@ -60,7 +67,7 @@ jobs:
name: test-${{ matrix.features.label}} name: test-${{ matrix.features.label}}
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install stable - name: Install stable
uses: actions-rs/toolchain@v1 uses: actions-rs/toolchain@v1

View File

@@ -1,3 +1,9 @@
Tantivy 0.21.1
================================
#### Bugfixes
- Range queries on fast fields with less values on that field than documents had an invalid end condition, leading to missing results. [#2226](https://github.com/quickwit-oss/tantivy/issues/2226)(@appaquet @PSeitz)
- Increase the minimum memory budget from 3MB to 15MB to avoid single doc segments (API fix). [#2176](https://github.com/quickwit-oss/tantivy/issues/2176)(@PSeitz)
Tantivy 0.21 Tantivy 0.21
================================ ================================
#### Bugfixes #### Bugfixes

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tantivy" name = "tantivy"
version = "0.21.0" version = "0.22.0-dev"
authors = ["Paul Masurel <paul.masurel@gmail.com>"] authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT" license = "MIT"
categories = ["database-implementations", "data-structures"] categories = ["database-implementations", "data-structures"]
@@ -22,36 +22,34 @@ crc32fast = "1.3.2"
once_cell = "1.10.0" once_cell = "1.10.0"
regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] } regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] }
aho-corasick = "1.0" aho-corasick = "1.0"
tantivy-fst = "0.4.0" tantivy-fst = "0.5"
memmap2 = { version = "0.7.1", optional = true } memmap2 = { version = "0.9.0", optional = true }
lz4_flex = { version = "0.11", default-features = false, optional = true } lz4_flex = { version = "0.11", default-features = false, optional = true }
zstd = { version = "0.12", optional = true, default-features = false } zstd = { version = "0.13", optional = true, default-features = false }
tempfile = { version = "3.3.0", optional = true } tempfile = { version = "3.3.0", optional = true }
log = "0.4.16" log = "0.4.16"
serde = { version = "1.0.136", features = ["derive"] } serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.79" serde_json = "1.0.79"
num_cpus = "1.13.1" num_cpus = "1.13.1"
fs4 = { version = "0.6.3", optional = true } fs4 = { version = "0.8.0", optional = true }
levenshtein_automata = "0.2.1" levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] } uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4" crossbeam-channel = "0.5.4"
rust-stemmers = "1.2.0" rust-stemmers = "1.2.0"
downcast-rs = "1.2.0" downcast-rs = "1.2.0"
bitpacking = { version = "0.8.4", default-features = false, features = ["bitpacker4x"] } bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker4x"] }
census = "0.4.0" census = "0.4.2"
rustc-hash = "1.1.0" rustc-hash = "1.1.0"
thiserror = "1.0.30" thiserror = "1.0.30"
htmlescape = "0.3.1" htmlescape = "0.3.1"
fail = { version = "0.5.0", optional = true } fail = { version = "0.5.0", optional = true }
murmurhash32 = "0.3.0"
time = { version = "0.3.10", features = ["serde-well-known"] } time = { version = "0.3.10", features = ["serde-well-known"] }
smallvec = "1.8.0" smallvec = "1.8.0"
rayon = "1.5.2" rayon = "1.5.2"
lru = "0.11.0" lru = "0.12.0"
fastdivide = "0.4.0" fastdivide = "0.4.0"
itertools = "0.11.0" itertools = "0.12.0"
measure_time = "0.8.2" measure_time = "0.8.2"
async-trait = "0.1.53"
arc-swap = "1.5.0" arc-swap = "1.5.0"
columnar = { version= "0.2", path="./columnar", package ="tantivy-columnar" } columnar = { version= "0.2", path="./columnar", package ="tantivy-columnar" }
@@ -63,6 +61,7 @@ common = { version= "0.6", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version= "0.2", path="./tokenizer-api", package="tantivy-tokenizer-api" } tokenizer-api = { version= "0.2", path="./tokenizer-api", package="tantivy-tokenizer-api" }
sketches-ddsketch = { version = "0.2.1", features = ["use_serde"] } sketches-ddsketch = { version = "0.2.1", features = ["use_serde"] }
futures-util = { version = "0.3.28", optional = true } futures-util = { version = "0.3.28", optional = true }
fnv = "1.0.7"
[target.'cfg(windows)'.dependencies] [target.'cfg(windows)'.dependencies]
winapi = "0.3.9" winapi = "0.3.9"
@@ -74,15 +73,14 @@ matches = "0.1.9"
pretty_assertions = "1.2.1" pretty_assertions = "1.2.1"
proptest = "1.0.0" proptest = "1.0.0"
test-log = "0.2.10" test-log = "0.2.10"
env_logger = "0.10.0"
futures = "0.3.21" futures = "0.3.21"
paste = "1.0.11" paste = "1.0.11"
more-asserts = "0.3.1" more-asserts = "0.3.1"
rand_distr = "0.4.3" rand_distr = "0.4.3"
time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
[target.'cfg(not(windows))'.dev-dependencies] [target.'cfg(not(windows))'.dev-dependencies]
criterion = "0.5" criterion = { version = "0.5", default-features = false }
pprof = { git = "https://github.com/PSeitz/pprof-rs/", rev = "53af24b", features = ["flamegraph", "criterion"] } # temp fork that works with criterion 0.5
[dev-dependencies.fail] [dev-dependencies.fail]
version = "0.5.0" version = "0.5.0"
@@ -115,6 +113,11 @@ unstable = [] # useful for benches.
quickwit = ["sstable", "futures-util"] quickwit = ["sstable", "futures-util"]
# Compares only the hash of a string when indexing data.
# Increases indexing speed, but may lead to extremely rare missing terms, when there's a hash collision.
# Uses 64bit ahash.
compare_hash_only = ["stacker/compare_hash_only"]
[workspace] [workspace]
members = ["query-grammar", "bitpacker", "common", "ownedbytes", "stacker", "sstable", "tokenizer-api", "columnar"] members = ["query-grammar", "bitpacker", "common", "ownedbytes", "stacker", "sstable", "tokenizer-api", "columnar"]
@@ -128,7 +131,7 @@ members = ["query-grammar", "bitpacker", "common", "ownedbytes", "stacker", "sst
[[test]] [[test]]
name = "failpoints" name = "failpoints"
path = "tests/failpoints/mod.rs" path = "tests/failpoints/mod.rs"
required-features = ["fail/failpoints"] required-features = ["failpoints"]
[[bench]] [[bench]]
name = "analyzer" name = "analyzer"

View File

@@ -5,19 +5,18 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy) [![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png) <img src="https://tantivy-search.github.io/logo/tantivy-logo.png" alt="Tantivy, the fastest full-text search engine library written in Rust" height="250">
**Tantivy** is a **full-text search engine library** written in Rust. ## Fast full-text search engine library written in Rust
It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not **If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our distributed search engine built on top of Tantivy.**
an off-the-shelf search engine server, but rather a crate that can be used
to build such a search engine. Tantivy is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.
Tantivy is, in fact, strongly inspired by Lucene's design. Tantivy is, in fact, strongly inspired by Lucene's design.
If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our search engine built on top of Tantivy. ## Benchmark
# Benchmark
The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns
performance for different types of queries/collections. performance for different types of queries/collections.
@@ -28,7 +27,7 @@ Your mileage WILL vary depending on the nature of queries and their load.
Details about the benchmark can be found at this [repository](https://github.com/quickwit-oss/search-benchmark-game). Details about the benchmark can be found at this [repository](https://github.com/quickwit-oss/search-benchmark-game).
# Features ## Features
- Full-text search - Full-text search
- Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy), [Vaporetto](https://crates.io/crates/vaporetto_tantivy), and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder)) - Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy), [Vaporetto](https://crates.io/crates/vaporetto_tantivy), and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
@@ -54,11 +53,11 @@ Details about the benchmark can be found at this [repository](https://github.com
- Searcher Warmer API - Searcher Warmer API
- Cheesy logo with a horse - Cheesy logo with a horse
## Non-features ### Non-features
Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out [Quickwit](https://github.com/quickwit-oss/quickwit/). Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out [Quickwit](https://github.com/quickwit-oss/quickwit/).
# Getting started ## Getting started
Tantivy works on stable Rust and supports Linux, macOS, and Windows. Tantivy works on stable Rust and supports Linux, macOS, and Windows.
@@ -68,7 +67,7 @@ index documents, and search via the CLI or a small server with a REST API.
It walks you through getting a Wikipedia search engine up and running in a few minutes. It walks you through getting a Wikipedia search engine up and running in a few minutes.
- [Reference doc for the last released version](https://docs.rs/tantivy/) - [Reference doc for the last released version](https://docs.rs/tantivy/)
# How can I support this project? ## How can I support this project?
There are many ways to support this project. There are many ways to support this project.
@@ -79,16 +78,16 @@ There are many ways to support this project.
- Contribute code (you can join [our Discord server](https://discord.gg/MT27AG5EVE)) - Contribute code (you can join [our Discord server](https://discord.gg/MT27AG5EVE))
- Talk about Tantivy around you - Talk about Tantivy around you
# Contributing code ## Contributing code
We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR. We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
Feel free to update CHANGELOG.md with your contribution. Feel free to update CHANGELOG.md with your contribution.
## Tokenizer ### Tokenizer
When implementing a tokenizer for tantivy depend on the `tantivy-tokenizer-api` crate. When implementing a tokenizer for tantivy depend on the `tantivy-tokenizer-api` crate.
## Clone and build locally ### Clone and build locally
Tantivy compiles on stable Rust. Tantivy compiles on stable Rust.
To check out and run tests, you can simply run: To check out and run tests, you can simply run:
@@ -99,7 +98,7 @@ cd tantivy
cargo test cargo test
``` ```
# Companies Using Tantivy ## Companies Using Tantivy
<p align="left"> <p align="left">
<img align="center" src="doc/assets/images/etsy.png" alt="Etsy" height="25" width="auto" />&nbsp; <img align="center" src="doc/assets/images/etsy.png" alt="Etsy" height="25" width="auto" />&nbsp;
@@ -111,7 +110,7 @@ cargo test
<img align="center" src="doc/assets/images/element-dark-theme.png#gh-dark-mode-only" alt="Element.io" height="25" width="auto" /> <img align="center" src="doc/assets/images/element-dark-theme.png#gh-dark-mode-only" alt="Element.io" height="25" width="auto" />
</p> </p>
# FAQ ## FAQ
### Can I use Tantivy in other languages? ### Can I use Tantivy in other languages?

View File

@@ -1,14 +1,99 @@
use criterion::{criterion_group, criterion_main, Criterion, Throughput}; use criterion::{criterion_group, criterion_main, BatchSize, Bencher, Criterion, Throughput};
use pprof::criterion::{Output, PProfProfiler}; use tantivy::schema::{TantivyDocument, FAST, INDEXED, STORED, STRING, TEXT};
use tantivy::schema::{FAST, INDEXED, STORED, STRING, TEXT}; use tantivy::{tokenizer, Index, IndexWriter};
use tantivy::Index;
const HDFS_LOGS: &str = include_str!("hdfs.json"); const HDFS_LOGS: &str = include_str!("hdfs.json");
const GH_LOGS: &str = include_str!("gh.json"); const GH_LOGS: &str = include_str!("gh.json");
const WIKI: &str = include_str!("wiki.json"); const WIKI: &str = include_str!("wiki.json");
fn get_lines(input: &str) -> Vec<&str> { fn benchmark(
input.trim().split('\n').collect() b: &mut Bencher,
input: &str,
schema: tantivy::schema::Schema,
commit: bool,
parse_json: bool,
is_dynamic: bool,
) {
if is_dynamic {
benchmark_dynamic_json(b, input, schema, commit, parse_json)
} else {
_benchmark(b, input, schema, commit, parse_json, |schema, doc_json| {
TantivyDocument::parse_json(&schema, doc_json).unwrap()
})
}
}
fn get_index(schema: tantivy::schema::Schema) -> Index {
let mut index = Index::create_in_ram(schema.clone());
let ff_tokenizer_manager = tokenizer::TokenizerManager::default();
ff_tokenizer_manager.register(
"raw",
tokenizer::TextAnalyzer::builder(tokenizer::RawTokenizer::default())
.filter(tokenizer::RemoveLongFilter::limit(255))
.build(),
);
index.set_fast_field_tokenizers(ff_tokenizer_manager.clone());
index
}
fn _benchmark(
b: &mut Bencher,
input: &str,
schema: tantivy::schema::Schema,
commit: bool,
include_json_parsing: bool,
create_doc: impl Fn(&tantivy::schema::Schema, &str) -> TantivyDocument,
) {
if include_json_parsing {
let lines: Vec<&str> = input.trim().split('\n').collect();
b.iter(|| {
let index = get_index(schema.clone());
let mut index_writer: IndexWriter =
index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = create_doc(&schema, doc_json);
index_writer.add_document(doc).unwrap();
}
if commit {
index_writer.commit().unwrap();
}
})
} else {
let docs: Vec<_> = input
.trim()
.split('\n')
.map(|doc_json| create_doc(&schema, doc_json))
.collect();
b.iter_batched(
|| docs.clone(),
|docs| {
let index = get_index(schema.clone());
let mut index_writer: IndexWriter =
index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc in docs {
index_writer.add_document(doc).unwrap();
}
if commit {
index_writer.commit().unwrap();
}
},
BatchSize::SmallInput,
)
}
}
fn benchmark_dynamic_json(
b: &mut Bencher,
input: &str,
schema: tantivy::schema::Schema,
commit: bool,
parse_json: bool,
) {
let json_field = schema.get_field("json").unwrap();
_benchmark(b, input, schema, commit, parse_json, |_schema, doc_json| {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
tantivy::doc!(json_field=>json_val)
})
} }
pub fn hdfs_index_benchmark(c: &mut Criterion) { pub fn hdfs_index_benchmark(c: &mut Criterion) {
@@ -19,7 +104,14 @@ pub fn hdfs_index_benchmark(c: &mut Criterion) {
schema_builder.add_text_field("severity", STRING); schema_builder.add_text_field("severity", STRING);
schema_builder.build() schema_builder.build()
}; };
let schema_with_store = { let schema_only_fast = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_u64_field("timestamp", FAST);
schema_builder.add_text_field("body", FAST);
schema_builder.add_text_field("severity", FAST);
schema_builder.build()
};
let _schema_with_store = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new(); let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_u64_field("timestamp", INDEXED | STORED); schema_builder.add_u64_field("timestamp", INDEXED | STORED);
schema_builder.add_text_field("body", TEXT | STORED); schema_builder.add_text_field("body", TEXT | STORED);
@@ -28,74 +120,39 @@ pub fn hdfs_index_benchmark(c: &mut Criterion) {
}; };
let dynamic_schema = { let dynamic_schema = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new(); let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_json_field("json", TEXT); schema_builder.add_json_field("json", TEXT | FAST);
schema_builder.build() schema_builder.build()
}; };
let mut group = c.benchmark_group("index-hdfs"); let mut group = c.benchmark_group("index-hdfs");
group.throughput(Throughput::Bytes(HDFS_LOGS.len() as u64)); group.throughput(Throughput::Bytes(HDFS_LOGS.len() as u64));
group.sample_size(20); group.sample_size(20);
group.bench_function("index-hdfs-no-commit", |b| {
let lines = get_lines(HDFS_LOGS); let benches = [
b.iter(|| { ("only-indexed-".to_string(), schema, false),
let index = Index::create_in_ram(schema.clone()); //("stored-".to_string(), _schema_with_store, false),
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap(); ("only-fast-".to_string(), schema_only_fast, false),
for doc_json in &lines { ("dynamic-".to_string(), dynamic_schema, true),
let doc = schema.parse_document(doc_json).unwrap(); ];
index_writer.add_document(doc).unwrap();
for (prefix, schema, is_dynamic) in benches {
for commit in [false, true] {
let suffix = if commit { "with-commit" } else { "no-commit" };
for parse_json in [false] {
// for parse_json in [false, true] {
let suffix = if parse_json {
format!("{}-with-json-parsing", suffix)
} else {
format!("{}", suffix)
};
let bench_name = format!("{}{}", prefix, suffix);
group.bench_function(bench_name, |b| {
benchmark(b, HDFS_LOGS, schema.clone(), commit, parse_json, is_dynamic)
});
} }
}) }
}); }
group.bench_function("index-hdfs-with-commit", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-no-commit-with-docstore", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(schema_with_store.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
})
});
group.bench_function("index-hdfs-with-commit-with-docstore", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(schema_with_store.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-no-commit-json-without-docstore", |b| {
let lines = get_lines(HDFS_LOGS);
b.iter(|| {
let index = Index::create_in_ram(dynamic_schema.clone());
let json_field = dynamic_schema.get_field("json").unwrap();
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
});
} }
pub fn gh_index_benchmark(c: &mut Criterion) { pub fn gh_index_benchmark(c: &mut Criterion) {
@@ -104,38 +161,24 @@ pub fn gh_index_benchmark(c: &mut Criterion) {
schema_builder.add_json_field("json", TEXT | FAST); schema_builder.add_json_field("json", TEXT | FAST);
schema_builder.build() schema_builder.build()
}; };
let dynamic_schema_fast = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_json_field("json", FAST);
schema_builder.build()
};
let mut group = c.benchmark_group("index-gh"); let mut group = c.benchmark_group("index-gh");
group.throughput(Throughput::Bytes(GH_LOGS.len() as u64)); group.throughput(Throughput::Bytes(GH_LOGS.len() as u64));
group.bench_function("index-gh-no-commit", |b| { group.bench_function("index-gh-no-commit", |b| {
let lines = get_lines(GH_LOGS); benchmark_dynamic_json(b, GH_LOGS, dynamic_schema.clone(), false, false)
b.iter(|| {
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
})
}); });
group.bench_function("index-gh-with-commit", |b| { group.bench_function("index-gh-fast", |b| {
let lines = get_lines(GH_LOGS); benchmark_dynamic_json(b, GH_LOGS, dynamic_schema_fast.clone(), false, false)
b.iter(|| { });
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone()); group.bench_function("index-gh-fast-with-commit", |b| {
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap(); benchmark_dynamic_json(b, GH_LOGS, dynamic_schema_fast.clone(), true, false)
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
}); });
} }
@@ -150,33 +193,10 @@ pub fn wiki_index_benchmark(c: &mut Criterion) {
group.throughput(Throughput::Bytes(WIKI.len() as u64)); group.throughput(Throughput::Bytes(WIKI.len() as u64));
group.bench_function("index-wiki-no-commit", |b| { group.bench_function("index-wiki-no-commit", |b| {
let lines = get_lines(WIKI); benchmark_dynamic_json(b, WIKI, dynamic_schema.clone(), false, false)
b.iter(|| {
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
})
}); });
group.bench_function("index-wiki-with-commit", |b| { group.bench_function("index-wiki-with-commit", |b| {
let lines = get_lines(WIKI); benchmark_dynamic_json(b, WIKI, dynamic_schema.clone(), true, false)
b.iter(|| {
let json_field = dynamic_schema.get_field("json").unwrap();
let index = Index::create_in_ram(dynamic_schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for doc_json in &lines {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
index_writer.commit().unwrap();
})
}); });
} }
@@ -187,12 +207,12 @@ criterion_group! {
} }
criterion_group! { criterion_group! {
name = gh_benches; name = gh_benches;
config = Criterion::default().with_profiler(PProfProfiler::new(100, Output::Flamegraph(None))); config = Criterion::default();
targets = gh_index_benchmark targets = gh_index_benchmark
} }
criterion_group! { criterion_group! {
name = wiki_benches; name = wiki_benches;
config = Criterion::default().with_profiler(PProfProfiler::new(100, Output::Flamegraph(None))); config = Criterion::default();
targets = wiki_index_benchmark targets = wiki_index_benchmark
} }
criterion_main!(benches, gh_benches, wiki_benches); criterion_main!(benches, gh_benches, wiki_benches);

View File

@@ -15,7 +15,7 @@ homepage = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies] [dependencies]
bitpacking = {version="0.8", default-features=false, features = ["bitpacker1x"]} bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker1x"] }
[dev-dependencies] [dev-dependencies]
rand = "0.8" rand = "0.8"

View File

@@ -125,6 +125,8 @@ impl BitUnpacker {
// Decodes the range of bitpacked `u32` values with idx // Decodes the range of bitpacked `u32` values with idx
// in [start_idx, start_idx + output.len()). // in [start_idx, start_idx + output.len()).
// It is guaranteed to completely fill `output` and not read from it, so passing a vector with
// un-initialized values is safe.
// //
// #Panics // #Panics
// //
@@ -237,7 +239,19 @@ impl BitUnpacker {
data: &[u8], data: &[u8],
positions: &mut Vec<u32>, positions: &mut Vec<u32>,
) { ) {
positions.resize(id_range.len(), 0u32); // We use the code below instead of positions.resize(id_range.len(), 0u32) for performance
// reasons: on some queries, the CPU cost of memsetting the array and of using a bigger
// vector than necessary is noticeable (~5%).
// In particular, searches are a few percent faster when using reserve_exact() as below
// instead of reserve().
// The un-initialized values are safe as get_batch_u32s() completely fills `positions`
// and does not read from it.
positions.clear();
positions.reserve_exact(id_range.len());
#[allow(clippy::uninit_vec)]
unsafe {
positions.set_len(id_range.len());
}
self.get_batch_u32s(id_range.start, data, positions); self.get_batch_u32s(id_range.start, data, positions);
crate::filter_vec::filter_vec_in_place(value_range, id_range.start, positions) crate::filter_vec::filter_vec_in_place(value_range, id_range.start, positions)
} }
@@ -367,7 +381,7 @@ mod test {
let mut output: Vec<u32> = Vec::new(); let mut output: Vec<u32> = Vec::new();
for len in [0, 1, 2, 32, 33, 34, 64] { for len in [0, 1, 2, 32, 33, 34, 64] {
for start_idx in 0u32..32u32 { for start_idx in 0u32..32u32 {
output.resize(len as usize, 0); output.resize(len, 0);
bitunpacker.get_batch_u32s(start_idx, &buffer, &mut output); bitunpacker.get_batch_u32s(start_idx, &buffer, &mut output);
for i in 0..len { for i in 0..len {
let expected = (start_idx + i as u32) & mask; let expected = (start_idx + i as u32) & mask;

View File

@@ -9,8 +9,7 @@ description = "column oriented storage for tantivy"
categories = ["database-implementations", "data-structures", "compression"] categories = ["database-implementations", "data-structures", "compression"]
[dependencies] [dependencies]
itertools = "0.11.0" itertools = "0.12.0"
fnv = "1.0.7"
fastdivide = "0.4.0" fastdivide = "0.4.0"
stacker = { version= "0.2", path = "../stacker", package="tantivy-stacker"} stacker = { version= "0.2", path = "../stacker", package="tantivy-stacker"}

View File

@@ -8,7 +8,6 @@ license = "MIT"
columnar = {path="../", package="tantivy-columnar"} columnar = {path="../", package="tantivy-columnar"}
serde_json = "1" serde_json = "1"
serde_json_borrow = {git="https://github.com/PSeitz/serde_json_borrow/"} serde_json_borrow = {git="https://github.com/PSeitz/serde_json_borrow/"}
serde = "1"
[workspace] [workspace]
members = [] members = []

View File

@@ -111,10 +111,7 @@ fn stack_multivalued_indexes<'a>(
let mut last_row_id = 0; let mut last_row_id = 0;
let mut current_it = multivalued_indexes.next(); let mut current_it = multivalued_indexes.next();
Box::new(std::iter::from_fn(move || loop { Box::new(std::iter::from_fn(move || loop {
let Some(multivalued_index) = current_it.as_mut() else { if let Some(row_id) = current_it.as_mut()?.next() {
return None;
};
if let Some(row_id) = multivalued_index.next() {
last_row_id = offset + row_id; last_row_id = offset + row_id;
return Some(last_row_id); return Some(last_row_id);
} }

View File

@@ -1,3 +1,8 @@
//! # `column_index`
//!
//! `column_index` provides rank and select operations to associate positions when not all
//! documents have exactly one element.
mod merge; mod merge;
mod multivalued_index; mod multivalued_index;
mod optional_index; mod optional_index;
@@ -41,10 +46,10 @@ impl ColumnIndex {
pub fn is_multivalue(&self) -> bool { pub fn is_multivalue(&self) -> bool {
matches!(self, ColumnIndex::Multivalued(_)) matches!(self, ColumnIndex::Multivalued(_))
} }
// Returns the cardinality of the column index. /// Returns the cardinality of the column index.
// ///
// By convention, if the column contains no docs, we consider that it is /// By convention, if the column contains no docs, we consider that it is
// full. /// full.
#[inline] #[inline]
pub fn get_cardinality(&self) -> Cardinality { pub fn get_cardinality(&self) -> Cardinality {
match self { match self {
@@ -121,18 +126,18 @@ impl ColumnIndex {
} }
} }
pub fn docid_range_to_rowids(&self, doc_id: Range<DocId>) -> Range<RowId> { pub fn docid_range_to_rowids(&self, doc_id_range: Range<DocId>) -> Range<RowId> {
match self { match self {
ColumnIndex::Empty { .. } => 0..0, ColumnIndex::Empty { .. } => 0..0,
ColumnIndex::Full => doc_id, ColumnIndex::Full => doc_id_range,
ColumnIndex::Optional(optional_index) => { ColumnIndex::Optional(optional_index) => {
let row_start = optional_index.rank(doc_id.start); let row_start = optional_index.rank(doc_id_range.start);
let row_end = optional_index.rank(doc_id.end); let row_end = optional_index.rank(doc_id_range.end);
row_start..row_end row_start..row_end
} }
ColumnIndex::Multivalued(multivalued_index) => { ColumnIndex::Multivalued(multivalued_index) => {
let end_docid = doc_id.end.min(multivalued_index.num_docs() - 1) + 1; let end_docid = doc_id_range.end.min(multivalued_index.num_docs() - 1) + 1;
let start_docid = doc_id.start.min(end_docid); let start_docid = doc_id_range.start.min(end_docid);
let row_start = multivalued_index.start_index_column.get_val(start_docid); let row_start = multivalued_index.start_index_column.get_val(start_docid);
let row_end = multivalued_index.start_index_column.get_val(end_docid); let row_end = multivalued_index.start_index_column.get_val(end_docid);

View File

@@ -21,8 +21,6 @@ const DENSE_BLOCK_THRESHOLD: u32 =
const ELEMENTS_PER_BLOCK: u32 = u16::MAX as u32 + 1; const ELEMENTS_PER_BLOCK: u32 = u16::MAX as u32 + 1;
const BLOCK_SIZE: RowId = 1 << 16;
#[derive(Copy, Clone, Debug)] #[derive(Copy, Clone, Debug)]
struct BlockMeta { struct BlockMeta {
non_null_rows_before_block: u32, non_null_rows_before_block: u32,
@@ -109,8 +107,8 @@ struct RowAddr {
#[inline(always)] #[inline(always)]
fn row_addr_from_row_id(row_id: RowId) -> RowAddr { fn row_addr_from_row_id(row_id: RowId) -> RowAddr {
RowAddr { RowAddr {
block_id: (row_id / BLOCK_SIZE) as u16, block_id: (row_id / ELEMENTS_PER_BLOCK) as u16,
in_block_row_id: (row_id % BLOCK_SIZE) as u16, in_block_row_id: (row_id % ELEMENTS_PER_BLOCK) as u16,
} }
} }
@@ -185,8 +183,13 @@ impl Set<RowId> for OptionalIndex {
} }
} }
/// Any value doc_id is allowed.
/// In particular, doc_id = num_rows.
#[inline] #[inline]
fn rank(&self, doc_id: DocId) -> RowId { fn rank(&self, doc_id: DocId) -> RowId {
if doc_id >= self.num_docs() {
return self.num_non_nulls();
}
let RowAddr { let RowAddr {
block_id, block_id,
in_block_row_id, in_block_row_id,
@@ -200,13 +203,15 @@ impl Set<RowId> for OptionalIndex {
block_meta.non_null_rows_before_block + block_offset_row_id block_meta.non_null_rows_before_block + block_offset_row_id
} }
/// Any value doc_id is allowed.
/// In particular, doc_id = num_rows.
#[inline] #[inline]
fn rank_if_exists(&self, doc_id: DocId) -> Option<RowId> { fn rank_if_exists(&self, doc_id: DocId) -> Option<RowId> {
let RowAddr { let RowAddr {
block_id, block_id,
in_block_row_id, in_block_row_id,
} = row_addr_from_row_id(doc_id); } = row_addr_from_row_id(doc_id);
let block_meta = self.block_metas[block_id as usize]; let block_meta = *self.block_metas.get(block_id as usize)?;
let block = self.block(block_meta); let block = self.block(block_meta);
let block_offset_row_id = match block { let block_offset_row_id = match block {
Block::Dense(dense_block) => dense_block.rank_if_exists(in_block_row_id), Block::Dense(dense_block) => dense_block.rank_if_exists(in_block_row_id),
@@ -491,7 +496,7 @@ fn deserialize_optional_index_block_metadatas(
non_null_rows_before_block += num_non_null_rows; non_null_rows_before_block += num_non_null_rows;
} }
block_metas.resize( block_metas.resize(
((num_rows + BLOCK_SIZE - 1) / BLOCK_SIZE) as usize, ((num_rows + ELEMENTS_PER_BLOCK - 1) / ELEMENTS_PER_BLOCK) as usize,
BlockMeta { BlockMeta {
non_null_rows_before_block, non_null_rows_before_block,
start_byte_offset, start_byte_offset,

View File

@@ -39,7 +39,8 @@ pub trait Set<T> {
/// ///
/// # Panics /// # Panics
/// ///
/// May panic if rank is greater than the number of elements in the Set. /// May panic if rank is greater or equal to the number of
/// elements in the Set.
fn select(&self, rank: T) -> T; fn select(&self, rank: T) -> T;
/// Creates a brand new select cursor. /// Creates a brand new select cursor.

View File

@@ -3,6 +3,30 @@ use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest}; use proptest::{prop_oneof, proptest};
use super::*; use super::*;
use crate::{ColumnarReader, ColumnarWriter, DynamicColumnHandle};
#[test]
fn test_optional_index_bug_2293() {
// tests for panic in docid_range_to_rowids for docid == num_docs
test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK - 1);
test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK);
test_optional_index_with_num_docs(ELEMENTS_PER_BLOCK + 1);
}
fn test_optional_index_with_num_docs(num_docs: u32) {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_numerical(100, "score", 80i64);
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(num_docs, None, &mut buffer)
.unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("score").unwrap();
assert_eq!(cols.len(), 1);
let col = cols[0].open().unwrap();
col.column_index().docid_range_to_rowids(0..num_docs);
}
#[test] #[test]
fn test_dense_block_threshold() { fn test_dense_block_threshold() {
@@ -35,7 +59,7 @@ proptest! {
#[test] #[test]
fn test_with_random_sets_simple() { fn test_with_random_sets_simple() {
let vals = 10..BLOCK_SIZE * 2; let vals = 10..ELEMENTS_PER_BLOCK * 2;
let mut out: Vec<u8> = Vec::new(); let mut out: Vec<u8> = Vec::new();
serialize_optional_index(&vals, 100, &mut out).unwrap(); serialize_optional_index(&vals, 100, &mut out).unwrap();
let null_index = open_optional_index(OwnedBytes::new(out)).unwrap(); let null_index = open_optional_index(OwnedBytes::new(out)).unwrap();
@@ -171,7 +195,7 @@ fn test_optional_index_rank() {
test_optional_index_rank_aux(&[0u32, 1u32]); test_optional_index_rank_aux(&[0u32, 1u32]);
let mut block = Vec::new(); let mut block = Vec::new();
block.push(3u32); block.push(3u32);
block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1)); block.extend((0..ELEMENTS_PER_BLOCK).map(|i| i + ELEMENTS_PER_BLOCK + 1));
test_optional_index_rank_aux(&block); test_optional_index_rank_aux(&block);
} }
@@ -185,8 +209,8 @@ fn test_optional_index_iter_empty_one() {
fn test_optional_index_iter_dense_block() { fn test_optional_index_iter_dense_block() {
let mut block = Vec::new(); let mut block = Vec::new();
block.push(3u32); block.push(3u32);
block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1)); block.extend((0..ELEMENTS_PER_BLOCK).map(|i| i + ELEMENTS_PER_BLOCK + 1));
test_optional_index_iter_aux(&block, 3 * BLOCK_SIZE); test_optional_index_iter_aux(&block, 3 * ELEMENTS_PER_BLOCK);
} }
#[test] #[test]
@@ -215,12 +239,12 @@ mod bench {
let vals: Vec<RowId> = (0..TOTAL_NUM_VALUES) let vals: Vec<RowId> = (0..TOTAL_NUM_VALUES)
.map(|_| rng.gen_bool(fill_ratio)) .map(|_| rng.gen_bool(fill_ratio))
.enumerate() .enumerate()
.filter(|(pos, val)| *val) .filter(|(_pos, val)| *val)
.map(|(pos, _)| pos as RowId) .map(|(pos, _)| pos as RowId)
.collect(); .collect();
serialize_optional_index(&&vals[..], TOTAL_NUM_VALUES, &mut out).unwrap(); serialize_optional_index(&&vals[..], TOTAL_NUM_VALUES, &mut out).unwrap();
let codec = open_optional_index(OwnedBytes::new(out)).unwrap();
codec open_optional_index(OwnedBytes::new(out)).unwrap()
} }
fn random_range_iterator( fn random_range_iterator(
@@ -242,7 +266,7 @@ mod bench {
} }
fn n_percent_step_iterator(percent: f32, num_values: u32) -> impl Iterator<Item = u32> { fn n_percent_step_iterator(percent: f32, num_values: u32) -> impl Iterator<Item = u32> {
let ratio = percent as f32 / 100.0; let ratio = percent / 100.0;
let step_size = (1f32 / ratio) as u32; let step_size = (1f32 / ratio) as u32;
let deviation = step_size - 1; let deviation = step_size - 1;
random_range_iterator(0, num_values, step_size, deviation) random_range_iterator(0, num_values, step_size, deviation)

View File

@@ -30,6 +30,7 @@ impl<'a> SerializableColumnIndex<'a> {
} }
} }
/// Serialize a column index.
pub fn serialize_column_index( pub fn serialize_column_index(
column_index: SerializableColumnIndex, column_index: SerializableColumnIndex,
output: &mut impl Write, output: &mut impl Write,
@@ -51,6 +52,7 @@ pub fn serialize_column_index(
Ok(column_index_num_bytes) Ok(column_index_num_bytes)
} }
/// Open a serialized column index.
pub fn open_column_index(mut bytes: OwnedBytes) -> io::Result<ColumnIndex> { pub fn open_column_index(mut bytes: OwnedBytes) -> io::Result<ColumnIndex> {
if bytes.is_empty() { if bytes.is_empty() {
return Err(io::Error::new( return Err(io::Error::new(

View File

@@ -101,7 +101,7 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync {
row_id_hits: &mut Vec<RowId>, row_id_hits: &mut Vec<RowId>,
) { ) {
let row_id_range = row_id_range.start..row_id_range.end.min(self.num_vals()); let row_id_range = row_id_range.start..row_id_range.end.min(self.num_vals());
for idx in row_id_range.start..row_id_range.end { for idx in row_id_range {
let val = self.get_val(idx); let val = self.get_val(idx);
if value_range.contains(&val) { if value_range.contains(&val) {
row_id_hits.push(idx); row_id_hits.push(idx);

View File

@@ -269,7 +269,8 @@ impl StrOrBytesColumnWriter {
dictionaries: &mut [DictionaryBuilder], dictionaries: &mut [DictionaryBuilder],
arena: &mut MemoryArena, arena: &mut MemoryArena,
) { ) {
let unordered_id = dictionaries[self.dictionary_id as usize].get_or_allocate_id(bytes); let unordered_id =
dictionaries[self.dictionary_id as usize].get_or_allocate_id(bytes, arena);
self.column_writer.record(doc, unordered_id, arena); self.column_writer.record(doc, unordered_id, arena);
} }

View File

@@ -338,7 +338,7 @@ impl ColumnarWriter {
let mut columns: Vec<(&[u8], ColumnType, Addr)> = self let mut columns: Vec<(&[u8], ColumnType, Addr)> = self
.numerical_field_hash_map .numerical_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| { .map(|(column_name, addr)| {
let numerical_column_writer: NumericalColumnWriter = let numerical_column_writer: NumericalColumnWriter =
self.numerical_field_hash_map.read(addr); self.numerical_field_hash_map.read(addr);
let column_type = numerical_column_writer.numerical_type().into(); let column_type = numerical_column_writer.numerical_type().into();
@@ -348,27 +348,27 @@ impl ColumnarWriter {
columns.extend( columns.extend(
self.bytes_field_hash_map self.bytes_field_hash_map
.iter() .iter()
.map(|(term, addr, _)| (term, ColumnType::Bytes, addr)), .map(|(term, addr)| (term, ColumnType::Bytes, addr)),
); );
columns.extend( columns.extend(
self.str_field_hash_map self.str_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::Str, addr)), .map(|(column_name, addr)| (column_name, ColumnType::Str, addr)),
); );
columns.extend( columns.extend(
self.bool_field_hash_map self.bool_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::Bool, addr)), .map(|(column_name, addr)| (column_name, ColumnType::Bool, addr)),
); );
columns.extend( columns.extend(
self.ip_addr_field_hash_map self.ip_addr_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::IpAddr, addr)), .map(|(column_name, addr)| (column_name, ColumnType::IpAddr, addr)),
); );
columns.extend( columns.extend(
self.datetime_field_hash_map self.datetime_field_hash_map
.iter() .iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::DateTime, addr)), .map(|(column_name, addr)| (column_name, ColumnType::DateTime, addr)),
); );
columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type)); columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type));
@@ -437,6 +437,7 @@ impl ColumnarWriter {
&mut symbol_byte_buffer, &mut symbol_byte_buffer,
), ),
buffers, buffers,
&self.arena,
&mut column_serializer, &mut column_serializer,
)?; )?;
column_serializer.finalize()?; column_serializer.finalize()?;
@@ -490,6 +491,7 @@ impl ColumnarWriter {
// Serialize [Dictionary, Column, dictionary num bytes U32::LE] // Serialize [Dictionary, Column, dictionary num bytes U32::LE]
// Column: [Column Index, Column Values, column index num bytes U32::LE] // Column: [Column Index, Column Values, column index num bytes U32::LE]
#[allow(clippy::too_many_arguments)]
fn serialize_bytes_or_str_column( fn serialize_bytes_or_str_column(
cardinality: Cardinality, cardinality: Cardinality,
num_docs: RowId, num_docs: RowId,
@@ -497,6 +499,7 @@ fn serialize_bytes_or_str_column(
dictionary_builder: &DictionaryBuilder, dictionary_builder: &DictionaryBuilder,
operation_it: impl Iterator<Item = ColumnOperation<UnorderedId>>, operation_it: impl Iterator<Item = ColumnOperation<UnorderedId>>,
buffers: &mut SpareBuffers, buffers: &mut SpareBuffers,
arena: &MemoryArena,
wrt: impl io::Write, wrt: impl io::Write,
) -> io::Result<()> { ) -> io::Result<()> {
let SpareBuffers { let SpareBuffers {
@@ -505,7 +508,8 @@ fn serialize_bytes_or_str_column(
.. ..
} = buffers; } = buffers;
let mut counting_writer = CountingWriter::wrap(wrt); let mut counting_writer = CountingWriter::wrap(wrt);
let term_id_mapping: TermIdMapping = dictionary_builder.serialize(&mut counting_writer)?; let term_id_mapping: TermIdMapping =
dictionary_builder.serialize(arena, &mut counting_writer)?;
let dictionary_num_bytes: u32 = counting_writer.written_bytes() as u32; let dictionary_num_bytes: u32 = counting_writer.written_bytes() as u32;
let mut wrt = counting_writer.finish(); let mut wrt = counting_writer.finish();
let operation_iterator = operation_it.map(|symbol: ColumnOperation<UnorderedId>| { let operation_iterator = operation_it.map(|symbol: ColumnOperation<UnorderedId>| {

View File

@@ -1,7 +1,7 @@
use std::io; use std::io;
use fnv::FnvHashMap;
use sstable::SSTable; use sstable::SSTable;
use stacker::{MemoryArena, SharedArenaHashMap};
pub(crate) struct TermIdMapping { pub(crate) struct TermIdMapping {
unordered_to_ord: Vec<OrderedId>, unordered_to_ord: Vec<OrderedId>,
@@ -31,29 +31,38 @@ pub struct OrderedId(pub u32);
/// mapping. /// mapping.
#[derive(Default)] #[derive(Default)]
pub(crate) struct DictionaryBuilder { pub(crate) struct DictionaryBuilder {
dict: FnvHashMap<Vec<u8>, UnorderedId>, dict: SharedArenaHashMap,
memory_consumption: usize,
} }
impl DictionaryBuilder { impl DictionaryBuilder {
/// Get or allocate an unordered id. /// Get or allocate an unordered id.
/// (This ID is simply an auto-incremented id.) /// (This ID is simply an auto-incremented id.)
pub fn get_or_allocate_id(&mut self, term: &[u8]) -> UnorderedId { pub fn get_or_allocate_id(&mut self, term: &[u8], arena: &mut MemoryArena) -> UnorderedId {
if let Some(term_id) = self.dict.get(term) { let next_id = self.dict.len() as u32;
return *term_id; let unordered_id = self
} .dict
let new_id = UnorderedId(self.dict.len() as u32); .mutate_or_create(term, arena, |unordered_id: Option<u32>| {
self.dict.insert(term.to_vec(), new_id); if let Some(unordered_id) = unordered_id {
self.memory_consumption += term.len(); unordered_id
self.memory_consumption += 40; // Term Metadata + HashMap overhead } else {
new_id next_id
}
});
UnorderedId(unordered_id)
} }
/// Serialize the dictionary into an fst, and returns the /// Serialize the dictionary into an fst, and returns the
/// `UnorderedId -> TermOrdinal` map. /// `UnorderedId -> TermOrdinal` map.
pub fn serialize<'a, W: io::Write + 'a>(&self, wrt: &mut W) -> io::Result<TermIdMapping> { pub fn serialize<'a, W: io::Write + 'a>(
let mut terms: Vec<(&[u8], UnorderedId)> = &self,
self.dict.iter().map(|(k, v)| (k.as_slice(), *v)).collect(); arena: &MemoryArena,
wrt: &mut W,
) -> io::Result<TermIdMapping> {
let mut terms: Vec<(&[u8], UnorderedId)> = self
.dict
.iter(arena)
.map(|(k, v)| (k, arena.read(v)))
.collect();
terms.sort_unstable_by_key(|(key, _)| *key); terms.sort_unstable_by_key(|(key, _)| *key);
// TODO Remove the allocation. // TODO Remove the allocation.
let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()]; let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()];
@@ -68,7 +77,7 @@ impl DictionaryBuilder {
} }
pub(crate) fn mem_usage(&self) -> usize { pub(crate) fn mem_usage(&self) -> usize {
self.memory_consumption self.dict.mem_usage()
} }
} }
@@ -78,12 +87,13 @@ mod tests {
#[test] #[test]
fn test_dictionary_builder() { fn test_dictionary_builder() {
let mut arena = MemoryArena::default();
let mut dictionary_builder = DictionaryBuilder::default(); let mut dictionary_builder = DictionaryBuilder::default();
let hello_uid = dictionary_builder.get_or_allocate_id(b"hello"); let hello_uid = dictionary_builder.get_or_allocate_id(b"hello", &mut arena);
let happy_uid = dictionary_builder.get_or_allocate_id(b"happy"); let happy_uid = dictionary_builder.get_or_allocate_id(b"happy", &mut arena);
let tax_uid = dictionary_builder.get_or_allocate_id(b"tax"); let tax_uid = dictionary_builder.get_or_allocate_id(b"tax", &mut arena);
let mut buffer = Vec::new(); let mut buffer = Vec::new();
let id_mapping = dictionary_builder.serialize(&mut buffer).unwrap(); let id_mapping = dictionary_builder.serialize(&arena, &mut buffer).unwrap();
assert_eq!(id_mapping.to_ord(hello_uid), OrderedId(1)); assert_eq!(id_mapping.to_ord(hello_uid), OrderedId(1));
assert_eq!(id_mapping.to_ord(happy_uid), OrderedId(0)); assert_eq!(id_mapping.to_ord(happy_uid), OrderedId(0));
assert_eq!(id_mapping.to_ord(tax_uid), OrderedId(2)); assert_eq!(id_mapping.to_ord(tax_uid), OrderedId(2));

View File

@@ -1,3 +1,22 @@
//! # Tantivy-Columnar
//!
//! `tantivy-columnar`provides a columnar storage for tantivy.
//! The crate allows for efficient read operations on specific columns rather than entire records.
//!
//! ## Overview
//!
//! - **columnar**: Reading, writing, and merging multiple columns:
//! - **[ColumnarWriter]**: Makes it possible to create a new columnar.
//! - **[ColumnarReader]**: The ColumnarReader makes it possible to access a set of columns
//! associated to field names.
//! - **[merge_columnar]**: Contains the functionalities to merge multiple ColumnarReader or
//! segments into a single one.
//!
//! - **column**: A single column, which contains
//! - [column_index]: Resolves the rows for a document id. Manages the cardinality of the
//! column.
//! - [column_values]: Stores the values of a column in a dense format.
#![cfg_attr(all(feature = "unstable", test), feature(test))] #![cfg_attr(all(feature = "unstable", test), feature(test))]
#[cfg(test)] #[cfg(test)]
@@ -12,7 +31,7 @@ use std::io;
mod block_accessor; mod block_accessor;
mod column; mod column;
mod column_index; pub mod column_index;
pub mod column_values; pub mod column_values;
mod columnar; mod columnar;
mod dictionary; mod dictionary;

View File

@@ -26,7 +26,7 @@ fn test_dataframe_writer_str() {
assert_eq!(columnar.num_columns(), 1); assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap(); let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
assert_eq!(cols.len(), 1); assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 87); assert_eq!(cols[0].num_bytes(), 73);
} }
#[test] #[test]
@@ -40,7 +40,7 @@ fn test_dataframe_writer_bytes() {
assert_eq!(columnar.num_columns(), 1); assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap(); let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
assert_eq!(cols.len(), 1); assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 87); assert_eq!(cols[0].num_bytes(), 73);
} }
#[test] #[test]
@@ -330,9 +330,9 @@ fn bytes_strategy() -> impl Strategy<Value = &'static [u8]> {
// A random column value // A random column value
fn column_value_strategy() -> impl Strategy<Value = ColumnValue> { fn column_value_strategy() -> impl Strategy<Value = ColumnValue> {
prop_oneof![ prop_oneof![
10 => string_strategy().prop_map(|s| ColumnValue::Str(s)), 10 => string_strategy().prop_map(ColumnValue::Str),
1 => bytes_strategy().prop_map(|b| ColumnValue::Bytes(b)), 1 => bytes_strategy().prop_map(ColumnValue::Bytes),
40 => num_strategy().prop_map(|n| ColumnValue::Numerical(n)), 40 => num_strategy().prop_map(ColumnValue::Numerical),
1 => (1u16..3u16).prop_map(|ip_addr_byte| ColumnValue::IpAddr(Ipv6Addr::new( 1 => (1u16..3u16).prop_map(|ip_addr_byte| ColumnValue::IpAddr(Ipv6Addr::new(
127, 127,
0, 0,
@@ -343,7 +343,7 @@ fn column_value_strategy() -> impl Strategy<Value = ColumnValue> {
0, 0,
ip_addr_byte ip_addr_byte
))), ))),
1 => any::<bool>().prop_map(|b| ColumnValue::Bool(b)), 1 => any::<bool>().prop_map(ColumnValue::Bool),
1 => (0_679_723_993i64..1_679_723_995i64) 1 => (0_679_723_993i64..1_679_723_995i64)
.prop_map(|val| { ColumnValue::DateTime(DateTime::from_timestamp_secs(val)) }) .prop_map(|val| { ColumnValue::DateTime(DateTime::from_timestamp_secs(val)) })
] ]
@@ -419,8 +419,8 @@ fn build_columnar_with_mapping(
columnar_writer columnar_writer
.serialize(num_docs, old_to_new_row_ids_opt, &mut buffer) .serialize(num_docs, old_to_new_row_ids_opt, &mut buffer)
.unwrap(); .unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
columnar_reader ColumnarReader::open(buffer).unwrap()
} }
fn build_columnar(docs: &[Vec<(&'static str, ColumnValue)>]) -> ColumnarReader { fn build_columnar(docs: &[Vec<(&'static str, ColumnValue)>]) -> ColumnarReader {
@@ -746,7 +746,7 @@ proptest! {
let stack_merge_order = StackMergeOrder::stack(&columnar_readers_arr[..]).into(); let stack_merge_order = StackMergeOrder::stack(&columnar_readers_arr[..]).into();
crate::merge_columnar(&columnar_readers_arr[..], &[], stack_merge_order, &mut output).unwrap(); crate::merge_columnar(&columnar_readers_arr[..], &[], stack_merge_order, &mut output).unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = columnar_docs.iter().cloned().flatten().collect(); let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = columnar_docs.iter().flatten().cloned().collect();
let expected_merged_columnar = build_columnar(&concat_rows[..]); let expected_merged_columnar = build_columnar(&concat_rows[..]);
assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar); assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar);
} }
@@ -772,7 +772,7 @@ fn test_columnar_merging_empty_columnar() {
.unwrap(); .unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> =
columnar_docs.iter().cloned().flatten().collect(); columnar_docs.iter().flatten().cloned().collect();
let expected_merged_columnar = build_columnar(&concat_rows[..]); let expected_merged_columnar = build_columnar(&concat_rows[..]);
assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar); assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar);
} }
@@ -809,7 +809,7 @@ fn test_columnar_merging_number_columns() {
.unwrap(); .unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap(); let merged_columnar = ColumnarReader::open(output).unwrap();
let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> = let concat_rows: Vec<Vec<(&'static str, ColumnValue)>> =
columnar_docs.iter().cloned().flatten().collect(); columnar_docs.iter().flatten().cloned().collect();
let expected_merged_columnar = build_columnar(&concat_rows[..]); let expected_merged_columnar = build_columnar(&concat_rows[..]);
assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar); assert_columnar_eq_strict(&merged_columnar, &expected_merged_columnar);
} }

View File

@@ -1,11 +1,14 @@
#![allow(deprecated)] #![allow(deprecated)]
use std::fmt; use std::fmt;
use std::io::{Read, Write};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use time::format_description::well_known::Rfc3339; use time::format_description::well_known::Rfc3339;
use time::{OffsetDateTime, PrimitiveDateTime, UtcOffset}; use time::{OffsetDateTime, PrimitiveDateTime, UtcOffset};
use crate::BinarySerializable;
/// Precision with which datetimes are truncated when stored in fast fields. This setting is only /// Precision with which datetimes are truncated when stored in fast fields. This setting is only
/// relevant for fast fields. In the docstore, datetimes are always saved with nanosecond precision. /// relevant for fast fields. In the docstore, datetimes are always saved with nanosecond precision.
#[derive( #[derive(
@@ -164,3 +167,15 @@ impl fmt::Debug for DateTime {
f.write_str(&utc_rfc3339) f.write_str(&utc_rfc3339)
} }
} }
impl BinarySerializable for DateTime {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> std::io::Result<()> {
let timestamp_micros = self.into_timestamp_micros();
<i64 as BinarySerializable>::serialize(&timestamp_micros, writer)
}
fn deserialize<R: Read>(reader: &mut R) -> std::io::Result<Self> {
let timestamp_micros = <i64 as BinarySerializable>::deserialize(reader)?;
Ok(Self::from_timestamp_micros(timestamp_micros))
}
}

View File

@@ -0,0 +1,112 @@
use crate::replace_in_place;
/// Separates the different segments of a json path.
pub const JSON_PATH_SEGMENT_SEP: u8 = 1u8;
pub const JSON_PATH_SEGMENT_SEP_STR: &str =
unsafe { std::str::from_utf8_unchecked(&[JSON_PATH_SEGMENT_SEP]) };
/// Create a new JsonPathWriter, that creates flattened json paths for tantivy.
#[derive(Clone, Debug, Default)]
pub struct JsonPathWriter {
path: String,
indices: Vec<usize>,
expand_dots: bool,
}
impl JsonPathWriter {
pub fn new() -> Self {
JsonPathWriter {
path: String::new(),
indices: Vec::new(),
expand_dots: false,
}
}
/// When expand_dots is enabled, json object like
/// `{"k8s.node.id": 5}` is processed as if it was
/// `{"k8s": {"node": {"id": 5}}}`.
/// This option has the merit of allowing users to
/// write queries like `k8s.node.id:5`.
/// On the other, enabling that feature can lead to
/// ambiguity.
#[inline]
pub fn set_expand_dots(&mut self, expand_dots: bool) {
self.expand_dots = expand_dots;
}
/// Push a new segment to the path.
#[inline]
pub fn push(&mut self, segment: &str) {
let len_path = self.path.len();
self.indices.push(len_path);
if !self.path.is_empty() {
self.path.push_str(JSON_PATH_SEGMENT_SEP_STR);
}
self.path.push_str(segment);
if self.expand_dots {
// This might include the separation byte, which is ok because it is not a dot.
let appended_segment = &mut self.path[len_path..];
// The unsafe below is safe as long as b'.' and JSON_PATH_SEGMENT_SEP are
// valid single byte ut8 strings.
// By utf-8 design, they cannot be part of another codepoint.
unsafe {
replace_in_place(b'.', JSON_PATH_SEGMENT_SEP, appended_segment.as_bytes_mut())
};
}
}
/// Remove the last segment. Does nothing if the path is empty.
#[inline]
pub fn pop(&mut self) {
if let Some(last_idx) = self.indices.pop() {
self.path.truncate(last_idx);
}
}
/// Clear the path.
#[inline]
pub fn clear(&mut self) {
self.path.clear();
self.indices.clear();
}
/// Get the current path.
#[inline]
pub fn as_str(&self) -> &str {
&self.path
}
}
impl From<JsonPathWriter> for String {
#[inline]
fn from(value: JsonPathWriter) -> Self {
value.path
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn json_path_writer_test() {
let mut writer = JsonPathWriter::new();
writer.push("root");
assert_eq!(writer.as_str(), "root");
writer.push("child");
assert_eq!(writer.as_str(), "root\u{1}child");
writer.pop();
assert_eq!(writer.as_str(), "root");
writer.push("k8s.node.id");
assert_eq!(writer.as_str(), "root\u{1}k8s.node.id");
writer.set_expand_dots(true);
writer.pop();
writer.push("k8s.node.id");
assert_eq!(writer.as_str(), "root\u{1}k8s\u{1}node\u{1}id");
}
}

View File

@@ -9,6 +9,7 @@ mod byte_count;
mod datetime; mod datetime;
pub mod file_slice; pub mod file_slice;
mod group_by; mod group_by;
mod json_path_writer;
mod serialize; mod serialize;
mod vint; mod vint;
mod writer; mod writer;
@@ -18,6 +19,7 @@ pub use byte_count::ByteCount;
pub use datetime::DatePrecision; pub use datetime::DatePrecision;
pub use datetime::{DateTime, DateTimePrecision}; pub use datetime::{DateTime, DateTimePrecision};
pub use group_by::GroupByIteratorExtended; pub use group_by::GroupByIteratorExtended;
pub use json_path_writer::JsonPathWriter;
pub use ownedbytes::{OwnedBytes, StableDeref}; pub use ownedbytes::{OwnedBytes, StableDeref};
pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize}; pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize};
pub use vint::{ pub use vint::{
@@ -116,6 +118,7 @@ pub fn u64_to_f64(val: u64) -> f64 {
/// ///
/// This function assumes that the needle is rarely contained in the bytes string /// This function assumes that the needle is rarely contained in the bytes string
/// and offers a fast path if the needle is not present. /// and offers a fast path if the needle is not present.
#[inline]
pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) { pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) {
if !bytes.contains(&needle) { if !bytes.contains(&needle) {
return; return;

View File

@@ -1,3 +1,4 @@
use std::borrow::Cow;
use std::io::{Read, Write}; use std::io::{Read, Write};
use std::{fmt, io}; use std::{fmt, io};
@@ -249,6 +250,43 @@ impl BinarySerializable for String {
} }
} }
impl<'a> BinarySerializable for Cow<'a, str> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
let data: &[u8] = self.as_bytes();
VInt(data.len() as u64).serialize(writer)?;
writer.write_all(data)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, str>> {
let string_length = VInt::deserialize(reader)?.val() as usize;
let mut result = String::with_capacity(string_length);
reader
.take(string_length as u64)
.read_to_string(&mut result)?;
Ok(Cow::Owned(result))
}
}
impl<'a> BinarySerializable for Cow<'a, [u8]> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.len() as u64).serialize(writer)?;
for it in self.iter() {
it.serialize(writer)?;
}
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, [u8]>> {
let num_items = VInt::deserialize(reader)?.val();
let mut items: Vec<u8> = Vec::with_capacity(num_items as usize);
for _ in 0..num_items {
let item = u8::deserialize(reader)?;
items.push(item);
}
Ok(Cow::Owned(items))
}
}
#[cfg(test)] #[cfg(test)]
pub mod test { pub mod test {

View File

@@ -12,7 +12,7 @@ use tantivy::aggregation::agg_result::AggregationResults;
use tantivy::aggregation::AggregationCollector; use tantivy::aggregation::AggregationCollector;
use tantivy::query::AllQuery; use tantivy::query::AllQuery;
use tantivy::schema::{self, IndexRecordOption, Schema, TextFieldIndexing, FAST}; use tantivy::schema::{self, IndexRecordOption, Schema, TextFieldIndexing, FAST};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Create Schema // # Create Schema
@@ -132,10 +132,10 @@ fn main() -> tantivy::Result<()> {
let stream = Deserializer::from_str(data).into_iter::<Value>(); let stream = Deserializer::from_str(data).into_iter::<Value>();
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
let mut num_indexed = 0; let mut num_indexed = 0;
for value in stream { for value in stream {
let doc = schema.parse_document(&serde_json::to_string(&value.unwrap())?)?; let doc = TantivyDocument::parse_json(&schema, &serde_json::to_string(&value.unwrap())?)?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
num_indexed += 1; num_indexed += 1;
if num_indexed > 4 { if num_indexed > 4 {

View File

@@ -15,7 +15,7 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
@@ -75,7 +75,7 @@ fn main() -> tantivy::Result<()> {
// Here we give tantivy a budget of `50MB`. // Here we give tantivy a budget of `50MB`.
// Using a bigger memory_arena for the indexer may increase // Using a bigger memory_arena for the indexer may increase
// throughput, but 50 MB is already plenty. // throughput, but 50 MB is already plenty.
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// Let's index our documents! // Let's index our documents!
// We first need a handle on the title and the body field. // We first need a handle on the title and the body field.
@@ -87,7 +87,7 @@ fn main() -> tantivy::Result<()> {
let title = schema.get_field("title").unwrap(); let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap(); let body = schema.get_field("body").unwrap();
let mut old_man_doc = Document::default(); let mut old_man_doc = TantivyDocument::default();
old_man_doc.add_text(title, "The Old Man and the Sea"); old_man_doc.add_text(title, "The Old Man and the Sea");
old_man_doc.add_text( old_man_doc.add_text(
body, body,
@@ -164,7 +164,7 @@ fn main() -> tantivy::Result<()> {
// will reload the index automatically after each commit. // will reload the index automatically after each commit.
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
// We now need to acquire a searcher. // We now need to acquire a searcher.
@@ -217,8 +217,8 @@ fn main() -> tantivy::Result<()> {
// the document returned will only contain // the document returned will only contain
// a title. // a title.
for (_score, doc_address) in top_docs { for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
// We can also get an explanation to understand // We can also get an explanation to understand

View File

@@ -13,7 +13,7 @@ use columnar::Column;
use tantivy::collector::{Collector, SegmentCollector}; use tantivy::collector::{Collector, SegmentCollector};
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, INDEXED, TEXT}; use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
use tantivy::{doc, Index, Score, SegmentReader}; use tantivy::{doc, Index, IndexWriter, Score, SegmentReader};
#[derive(Default)] #[derive(Default)]
struct Stats { struct Stats {
@@ -142,7 +142,7 @@ fn main() -> tantivy::Result<()> {
// this example. // this example.
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
product_name => "Super Broom 2000", product_name => "Super Broom 2000",
product_description => "While it is ok for short distance travel, this broom \ product_description => "While it is ok for short distance travel, this broom \

View File

@@ -6,7 +6,7 @@ use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer; use tantivy::tokenizer::NgramTokenizer;
use tantivy::{doc, Index}; use tantivy::{doc, Index, IndexWriter};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -62,7 +62,7 @@ fn main() -> tantivy::Result<()> {
// //
// Here we use a buffer of 50MB per thread. Using a bigger // Here we use a buffer of 50MB per thread. Using a bigger
// memory arena for the indexer can increase its throughput. // memory arena for the indexer can increase its throughput.
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "The Old Man and the Sea", title => "The Old Man and the Sea",
body => "He was an old man who fished alone in a skiff in the Gulf Stream and \ body => "He was an old man who fished alone in a skiff in the Gulf Stream and \
@@ -103,8 +103,8 @@ fn main() -> tantivy::Result<()> {
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?; let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (_, doc_address) in top_docs { for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
Ok(()) Ok(())

View File

@@ -4,8 +4,8 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{DateOptions, Schema, Value, INDEXED, STORED, STRING}; use tantivy::schema::{DateOptions, Document, OwnedValue, Schema, INDEXED, STORED, STRING};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -22,16 +22,18 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// The dates are passed as string in the RFC3339 format // The dates are passed as string in the RFC3339 format
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"occurred_at": "2022-06-22T12:53:50.53Z", "occurred_at": "2022-06-22T12:53:50.53Z",
"event": "pull-request" "event": "pull-request"
}"#, }"#,
)?; )?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"occurred_at": "2022-06-22T13:00:00.22Z", "occurred_at": "2022-06-22T13:00:00.22Z",
"event": "comment" "event": "comment"
@@ -58,13 +60,13 @@ fn main() -> tantivy::Result<()> {
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4))?; let count_docs = searcher.search(&*query, &TopDocs::with_limit(4))?;
assert_eq!(count_docs.len(), 1); assert_eq!(count_docs.len(), 1);
for (_score, doc_address) in count_docs { for (_score, doc_address) in count_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
assert!(matches!( assert!(matches!(
retrieved_doc.get_first(occurred_at), retrieved_doc.get_first(occurred_at),
Some(Value::Date(_)) Some(OwnedValue::Date(_))
)); ));
assert_eq!( assert_eq!(
schema.to_json(&retrieved_doc), retrieved_doc.to_json(&schema),
r#"{"event":["comment"],"occurred_at":["2022-06-22T13:00:00.22Z"]}"# r#"{"event":["comment"],"occurred_at":["2022-06-22T13:00:00.22Z"]}"#
); );
} }

View File

@@ -11,7 +11,7 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::TermQuery; use tantivy::query::TermQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, IndexReader}; use tantivy::{doc, Index, IndexReader, IndexWriter};
// A simple helper function to fetch a single document // A simple helper function to fetch a single document
// given its id from our index. // given its id from our index.
@@ -19,7 +19,7 @@ use tantivy::{doc, Index, IndexReader};
fn extract_doc_given_isbn( fn extract_doc_given_isbn(
reader: &IndexReader, reader: &IndexReader,
isbn_term: &Term, isbn_term: &Term,
) -> tantivy::Result<Option<Document>> { ) -> tantivy::Result<Option<TantivyDocument>> {
let searcher = reader.searcher(); let searcher = reader.searcher();
// This is the simplest query you can think of. // This is the simplest query you can think of.
@@ -69,10 +69,10 @@ fn main() -> tantivy::Result<()> {
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// Let's add a couple of documents, for the sake of the example. // Let's add a couple of documents, for the sake of the example.
let mut old_man_doc = Document::default(); let mut old_man_doc = TantivyDocument::default();
old_man_doc.add_text(title, "The Old Man and the Sea"); old_man_doc.add_text(title, "The Old Man and the Sea");
index_writer.add_document(doc!( index_writer.add_document(doc!(
isbn => "978-0099908401", isbn => "978-0099908401",
@@ -94,7 +94,7 @@ fn main() -> tantivy::Result<()> {
// Oops our frankenstein doc seems misspelled // Oops our frankenstein doc seems misspelled
let frankenstein_doc_misspelled = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap(); let frankenstein_doc_misspelled = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!( assert_eq!(
schema.to_json(&frankenstein_doc_misspelled), frankenstein_doc_misspelled.to_json(&schema),
r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#, r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#,
); );
@@ -136,7 +136,7 @@ fn main() -> tantivy::Result<()> {
// No more typo! // No more typo!
let frankenstein_new_doc = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap(); let frankenstein_new_doc = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
assert_eq!( assert_eq!(
schema.to_json(&frankenstein_new_doc), frankenstein_new_doc.to_json(&schema),
r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#, r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#,
); );

View File

@@ -17,7 +17,7 @@
use tantivy::collector::FacetCollector; use tantivy::collector::FacetCollector;
use tantivy::query::{AllQuery, TermQuery}; use tantivy::query::{AllQuery, TermQuery};
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index}; use tantivy::{doc, Index, IndexWriter};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// Let's create a temporary directory for the sake of this example // Let's create a temporary directory for the sake of this example
@@ -30,7 +30,7 @@ fn main() -> tantivy::Result<()> {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(30_000_000)?; let mut index_writer: IndexWriter = index.writer(30_000_000)?;
// For convenience, tantivy also comes with a macro to // For convenience, tantivy also comes with a macro to
// reduce the boilerplate above. // reduce the boilerplate above.

View File

@@ -12,7 +12,7 @@ use std::collections::HashSet;
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::BooleanQuery; use tantivy::query::BooleanQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, DocId, Index, Score, SegmentReader}; use tantivy::{doc, DocId, Index, IndexWriter, Score, SegmentReader};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
@@ -23,7 +23,7 @@ fn main() -> tantivy::Result<()> {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer(30_000_000)?; let mut index_writer: IndexWriter = index.writer(30_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "Fried egg", title => "Fried egg",
@@ -91,11 +91,10 @@ fn main() -> tantivy::Result<()> {
.iter() .iter()
.map(|(_, doc_id)| { .map(|(_, doc_id)| {
searcher searcher
.doc(*doc_id) .doc::<TantivyDocument>(*doc_id)
.unwrap() .unwrap()
.get_first(title) .get_first(title)
.unwrap() .and_then(|v| v.as_str())
.as_text()
.unwrap() .unwrap()
.to_owned() .to_owned()
}) })

View File

@@ -14,7 +14,7 @@
use tantivy::collector::{Count, TopDocs}; use tantivy::collector::{Count, TopDocs};
use tantivy::query::FuzzyTermQuery; use tantivy::query::FuzzyTermQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
@@ -66,7 +66,7 @@ fn main() -> tantivy::Result<()> {
// Here we give tantivy a budget of `50MB`. // Here we give tantivy a budget of `50MB`.
// Using a bigger memory_arena for the indexer may increase // Using a bigger memory_arena for the indexer may increase
// throughput, but 50 MB is already plenty. // throughput, but 50 MB is already plenty.
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// Let's index our documents! // Let's index our documents!
// We first need a handle on the title and the body field. // We first need a handle on the title and the body field.
@@ -123,7 +123,7 @@ fn main() -> tantivy::Result<()> {
// will reload the index automatically after each commit. // will reload the index automatically after each commit.
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
// We now need to acquire a searcher. // We now need to acquire a searcher.
@@ -151,10 +151,10 @@ fn main() -> tantivy::Result<()> {
assert_eq!(count, 3); assert_eq!(count, 3);
assert_eq!(top_docs.len(), 3); assert_eq!(top_docs.len(), 3);
for (score, doc_address) in top_docs { for (score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
// Note that the score is not lower for the fuzzy hit. // Note that the score is not lower for the fuzzy hit.
// There's an issue open for that: https://github.com/quickwit-oss/tantivy/issues/563 // There's an issue open for that: https://github.com/quickwit-oss/tantivy/issues/563
println!("score {score:?} doc {}", schema.to_json(&retrieved_doc)); let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("score {score:?} doc {}", retrieved_doc.to_json(&schema));
// score 1.0 doc {"title":["The Diary of Muadib"]} // score 1.0 doc {"title":["The Diary of Muadib"]}
// //
// score 1.0 doc {"title":["The Diary of a Young Girl"]} // score 1.0 doc {"title":["The Diary of a Young Girl"]}

View File

@@ -21,7 +21,7 @@ fn main() -> tantivy::Result<()> {
}"#; }"#;
// We can parse our document // We can parse our document
let _mice_and_men_doc = schema.parse_document(mice_and_men_doc_json)?; let _mice_and_men_doc = TantivyDocument::parse_json(&schema, mice_and_men_doc_json)?;
// Multi-valued field are allowed, they are // Multi-valued field are allowed, they are
// expressed in JSON by an array. // expressed in JSON by an array.
@@ -30,7 +30,7 @@ fn main() -> tantivy::Result<()> {
"title": ["Frankenstein", "The Modern Prometheus"], "title": ["Frankenstein", "The Modern Prometheus"],
"year": 1818 "year": 1818
}"#; }"#;
let _frankenstein_doc = schema.parse_document(frankenstein_json)?; let _frankenstein_doc = TantivyDocument::parse_json(&schema, frankenstein_json)?;
// Note that the schema is saved in your index directory. // Note that the schema is saved in your index directory.
// //

View File

@@ -5,7 +5,7 @@
use tantivy::collector::Count; use tantivy::collector::Count;
use tantivy::query::RangeQuery; use tantivy::query::RangeQuery;
use tantivy::schema::{Schema, INDEXED}; use tantivy::schema::{Schema, INDEXED};
use tantivy::{doc, Index, Result}; use tantivy::{doc, Index, IndexWriter, Result};
fn main() -> Result<()> { fn main() -> Result<()> {
// For the sake of simplicity, this schema will only have 1 field // For the sake of simplicity, this schema will only have 1 field
@@ -17,7 +17,7 @@ fn main() -> Result<()> {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let reader = index.reader()?; let reader = index.reader()?;
{ {
let mut index_writer = index.writer_with_num_threads(1, 6_000_000)?; let mut index_writer: IndexWriter = index.writer_with_num_threads(1, 6_000_000)?;
for year in 1950u64..2019u64 { for year in 1950u64..2019u64 {
index_writer.add_document(doc!(year_field => year))?; index_writer.add_document(doc!(year_field => year))?;
} }

View File

@@ -6,7 +6,7 @@
use tantivy::collector::{Count, TopDocs}; use tantivy::collector::{Count, TopDocs};
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, INDEXED, STORED, STRING}; use tantivy::schema::{Schema, FAST, INDEXED, STORED, STRING};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -22,20 +22,22 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// ### IPv4 // ### IPv4
// Adding documents that contain an IPv4 address. Notice that the IP addresses are passed as // Adding documents that contain an IPv4 address. Notice that the IP addresses are passed as
// `String`. Since the field is of type ip, we parse the IP address from the string and store it // `String`. Since the field is of type ip, we parse the IP address from the string and store it
// internally as IPv6. // internally as IPv6.
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"ip": "192.168.0.33", "ip": "192.168.0.33",
"event_type": "login" "event_type": "login"
}"#, }"#,
)?; )?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"ip": "192.168.0.80", "ip": "192.168.0.80",
"event_type": "checkout" "event_type": "checkout"
@@ -44,7 +46,8 @@ fn main() -> tantivy::Result<()> {
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
// ### IPv6 // ### IPv6
// Adding a document that contains an IPv6 address. // Adding a document that contains an IPv6 address.
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334", "ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334",
"event_type": "checkout" "event_type": "checkout"

View File

@@ -10,7 +10,7 @@
// --- // ---
// Importing tantivy... // Importing tantivy...
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, DocSet, Index, Postings, TERMINATED}; use tantivy::{doc, DocSet, Index, IndexWriter, Postings, TERMINATED};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// We first create a schema for the sake of the // We first create a schema for the sake of the
@@ -24,7 +24,7 @@ fn main() -> tantivy::Result<()> {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(1, 50_000_000)?; let mut index_writer: IndexWriter = index.writer_with_num_threads(1, 50_000_000)?;
index_writer.add_document(doc!(title => "The Old Man and the Sea"))?; index_writer.add_document(doc!(title => "The Old Man and the Sea"))?;
index_writer.add_document(doc!(title => "Of Mice and Men"))?; index_writer.add_document(doc!(title => "Of Mice and Men"))?;
index_writer.add_document(doc!(title => "The modern Promotheus"))?; index_writer.add_document(doc!(title => "The modern Promotheus"))?;

View File

@@ -7,7 +7,7 @@
use tantivy::collector::{Count, TopDocs}; use tantivy::collector::{Count, TopDocs};
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, STORED, STRING, TEXT}; use tantivy::schema::{Schema, FAST, STORED, STRING, TEXT};
use tantivy::Index; use tantivy::{Index, IndexWriter, TantivyDocument};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// # Defining the schema // # Defining the schema
@@ -20,8 +20,9 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"timestamp": "2022-02-22T23:20:50.53Z", "timestamp": "2022-02-22T23:20:50.53Z",
"event_type": "click", "event_type": "click",
@@ -33,7 +34,8 @@ fn main() -> tantivy::Result<()> {
}"#, }"#,
)?; )?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
let doc = schema.parse_document( let doc = TantivyDocument::parse_json(
&schema,
r#"{ r#"{
"timestamp": "2022-02-22T23:20:51.53Z", "timestamp": "2022-02-22T23:20:51.53Z",
"event_type": "click", "event_type": "click",

View File

@@ -1,7 +1,7 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy, Result}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy, Result};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> Result<()> { fn main() -> Result<()> {
@@ -17,7 +17,7 @@ fn main() -> Result<()> {
let index = Index::create_in_dir(&index_path, schema)?; let index = Index::create_in_dir(&index_path, schema)?;
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
index_writer.add_document(doc!( index_writer.add_document(doc!(
title => "The Old Man and the Sea", title => "The Old Man and the Sea",
@@ -51,7 +51,7 @@ fn main() -> Result<()> {
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -67,8 +67,12 @@ fn main() -> Result<()> {
let mut titles = top_docs let mut titles = top_docs
.into_iter() .into_iter()
.map(|(_score, doc_address)| { .map(|(_score, doc_address)| {
let doc = searcher.doc(doc_address)?; let doc = searcher.doc::<TantivyDocument>(doc_address)?;
let title = doc.get_first(title).unwrap().as_text().unwrap().to_owned(); let title = doc
.get_first(title)
.and_then(|v| v.as_str())
.unwrap()
.to_owned();
Ok(title) Ok(title)
}) })
.collect::<Result<Vec<_>>>()?; .collect::<Result<Vec<_>>>()?;

View File

@@ -13,7 +13,7 @@ use tantivy::collector::{Count, TopDocs};
use tantivy::query::TermQuery; use tantivy::query::TermQuery;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::tokenizer::{PreTokenizedString, SimpleTokenizer, Token, TokenStream, Tokenizer}; use tantivy::tokenizer::{PreTokenizedString, SimpleTokenizer, Token, TokenStream, Tokenizer};
use tantivy::{doc, Index, ReloadPolicy}; use tantivy::{doc, Index, IndexWriter, ReloadPolicy};
use tempfile::TempDir; use tempfile::TempDir;
fn pre_tokenize_text(text: &str) -> Vec<Token> { fn pre_tokenize_text(text: &str) -> Vec<Token> {
@@ -38,7 +38,7 @@ fn main() -> tantivy::Result<()> {
let index = Index::create_in_dir(&index_path, schema.clone())?; let index = Index::create_in_dir(&index_path, schema.clone())?;
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// We can create a document manually, by setting the fields // We can create a document manually, by setting the fields
// one by one in a Document object. // one by one in a Document object.
@@ -83,7 +83,7 @@ fn main() -> tantivy::Result<()> {
}] }]
}"#; }"#;
let short_man_doc = schema.parse_document(short_man_json)?; let short_man_doc = TantivyDocument::parse_json(&schema, short_man_json)?;
index_writer.add_document(short_man_doc)?; index_writer.add_document(short_man_doc)?;
@@ -94,7 +94,7 @@ fn main() -> tantivy::Result<()> {
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into()?; .try_into()?;
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -115,8 +115,8 @@ fn main() -> tantivy::Result<()> {
// Note that the tokens are not stored along with the original text // Note that the tokens are not stored along with the original text
// in the document store // in the document store
for (_score, doc_address) in top_docs { for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("Document: {}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
// In contrary to the previous query, when we search for the "man" term we // In contrary to the previous query, when we search for the "man" term we

View File

@@ -10,7 +10,8 @@
use tantivy::collector::TopDocs; use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::{doc, Index, Snippet, SnippetGenerator}; use tantivy::snippet::{Snippet, SnippetGenerator};
use tantivy::{doc, Index, IndexWriter};
use tempfile::TempDir; use tempfile::TempDir;
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
@@ -27,7 +28,7 @@ fn main() -> tantivy::Result<()> {
// # Indexing documents // # Indexing documents
let index = Index::create_in_dir(&index_path, schema)?; let index = Index::create_in_dir(&index_path, schema)?;
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
// we'll only need one doc for this example. // we'll only need one doc for this example.
index_writer.add_document(doc!( index_writer.add_document(doc!(
@@ -54,13 +55,10 @@ fn main() -> tantivy::Result<()> {
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?; let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
for (score, doc_address) in top_docs { for (score, doc_address) in top_docs {
let doc = searcher.doc(doc_address)?; let doc = searcher.doc::<TantivyDocument>(doc_address)?;
let snippet = snippet_generator.snippet_from_doc(&doc); let snippet = snippet_generator.snippet_from_doc(&doc);
println!("Document score {score}:"); println!("Document score {score}:");
println!( println!("title: {}", doc.get_first(title).unwrap().as_str().unwrap());
"title: {}",
doc.get_first(title).unwrap().as_text().unwrap()
);
println!("snippet: {}", snippet.to_html()); println!("snippet: {}", snippet.to_html());
println!("custom highlighting: {}", highlight(snippet)); println!("custom highlighting: {}", highlight(snippet));
} }

View File

@@ -15,7 +15,7 @@ use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::*; use tantivy::schema::*;
use tantivy::tokenizer::*; use tantivy::tokenizer::*;
use tantivy::{doc, Index}; use tantivy::{doc, Index, IndexWriter};
fn main() -> tantivy::Result<()> { fn main() -> tantivy::Result<()> {
// this example assumes you understand the content in `basic_search` // this example assumes you understand the content in `basic_search`
@@ -60,7 +60,7 @@ fn main() -> tantivy::Result<()> {
index.tokenizers().register("stoppy", tokenizer); index.tokenizers().register("stoppy", tokenizer);
let mut index_writer = index.writer(50_000_000)?; let mut index_writer: IndexWriter = index.writer(50_000_000)?;
let title = schema.get_field("title").unwrap(); let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap(); let body = schema.get_field("body").unwrap();
@@ -105,9 +105,9 @@ fn main() -> tantivy::Result<()> {
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?; let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (score, doc_address) in top_docs { for (score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?; let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
println!("\n==\nDocument score {score}:"); println!("\n==\nDocument score {score}:");
println!("{}", schema.to_json(&retrieved_doc)); println!("{}", retrieved_doc.to_json(&schema));
} }
Ok(()) Ok(())

View File

@@ -6,8 +6,8 @@ use tantivy::collector::TopDocs;
use tantivy::query::QueryParser; use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, TEXT}; use tantivy::schema::{Schema, FAST, TEXT};
use tantivy::{ use tantivy::{
doc, DocAddress, DocId, Index, Opstamp, Searcher, SearcherGeneration, SegmentId, SegmentReader, doc, DocAddress, DocId, Index, IndexWriter, Opstamp, Searcher, SearcherGeneration, SegmentId,
Warmer, SegmentReader, Warmer,
}; };
// This example shows how warmers can be used to // This example shows how warmers can be used to
@@ -143,7 +143,7 @@ fn main() -> tantivy::Result<()> {
const SNEAKERS: ProductId = 23222; const SNEAKERS: ProductId = 23222;
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 15_000_000)?; let mut writer: IndexWriter = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(product_id=>OLIVE_OIL, text=>"cooking olive oil from greece"))?; writer.add_document(doc!(product_id=>OLIVE_OIL, text=>"cooking olive oil from greece"))?;
writer.add_document(doc!(product_id=>GLOVES, text=>"kitchen gloves, perfect for cooking"))?; writer.add_document(doc!(product_id=>GLOVES, text=>"kitchen gloves, perfect for cooking"))?;
writer.add_document(doc!(product_id=>SNEAKERS, text=>"uber sweet sneakers"))?; writer.add_document(doc!(product_id=>SNEAKERS, text=>"uber sweet sneakers"))?;

View File

@@ -81,8 +81,8 @@ where
T: InputTakeAtPosition + Clone, T: InputTakeAtPosition + Clone,
<T as InputTakeAtPosition>::Item: AsChar + Clone, <T as InputTakeAtPosition>::Item: AsChar + Clone,
{ {
opt_i(nom::character::complete::space0)(input) opt_i(nom::character::complete::multispace0)(input)
.map(|(left, (spaces, errors))| (left, (spaces.expect("space0 can't fail"), errors))) .map(|(left, (spaces, errors))| (left, (spaces.expect("multispace0 can't fail"), errors)))
} }
pub(crate) fn space1_infallible<T>(input: T) -> JResult<T, Option<T>> pub(crate) fn space1_infallible<T>(input: T) -> JResult<T, Option<T>>
@@ -90,7 +90,7 @@ where
T: InputTakeAtPosition + Clone + InputLength, T: InputTakeAtPosition + Clone + InputLength,
<T as InputTakeAtPosition>::Item: AsChar + Clone, <T as InputTakeAtPosition>::Item: AsChar + Clone,
{ {
opt_i(nom::character::complete::space1)(input).map(|(left, (spaces, mut errors))| { opt_i(nom::character::complete::multispace1)(input).map(|(left, (spaces, mut errors))| {
if spaces.is_none() { if spaces.is_none() {
errors.push(LenientErrorInternal { errors.push(LenientErrorInternal {
pos: left.input_len(), pos: left.input_len(),

View File

@@ -3,11 +3,11 @@ use std::iter::once;
use nom::branch::alt; use nom::branch::alt;
use nom::bytes::complete::tag; use nom::bytes::complete::tag;
use nom::character::complete::{ use nom::character::complete::{
anychar, char, digit1, none_of, one_of, satisfy, space0, space1, u32, anychar, char, digit1, multispace0, multispace1, none_of, one_of, satisfy, u32,
}; };
use nom::combinator::{eof, map, map_res, opt, peek, recognize, value, verify}; use nom::combinator::{eof, map, map_res, opt, peek, recognize, value, verify};
use nom::error::{Error, ErrorKind}; use nom::error::{Error, ErrorKind};
use nom::multi::{many0, many1, separated_list0, separated_list1}; use nom::multi::{many0, many1, separated_list0};
use nom::sequence::{delimited, preceded, separated_pair, terminated, tuple}; use nom::sequence::{delimited, preceded, separated_pair, terminated, tuple};
use nom::IResult; use nom::IResult;
@@ -24,7 +24,7 @@ const SPECIAL_CHARS: &[char] = &[
/// consume a field name followed by colon. Return the field name with escape sequence /// consume a field name followed by colon. Return the field name with escape sequence
/// already interpreted /// already interpreted
fn field_name(i: &str) -> IResult<&str, String> { fn field_name(inp: &str) -> IResult<&str, String> {
let simple_char = none_of(SPECIAL_CHARS); let simple_char = none_of(SPECIAL_CHARS);
let first_char = verify(none_of(SPECIAL_CHARS), |c| *c != '-'); let first_char = verify(none_of(SPECIAL_CHARS), |c| *c != '-');
let escape_sequence = || preceded(char('\\'), one_of(SPECIAL_CHARS)); let escape_sequence = || preceded(char('\\'), one_of(SPECIAL_CHARS));
@@ -38,12 +38,12 @@ fn field_name(i: &str) -> IResult<&str, String> {
char(':'), char(':'),
), ),
|(first_char, next)| once(first_char).chain(next).collect(), |(first_char, next)| once(first_char).chain(next).collect(),
)(i) )(inp)
} }
/// Consume a word outside of any context. /// Consume a word outside of any context.
// TODO should support escape sequences // TODO should support escape sequences
fn word(i: &str) -> IResult<&str, &str> { fn word(inp: &str) -> IResult<&str, &str> {
map_res( map_res(
recognize(tuple(( recognize(tuple((
satisfy(|c| { satisfy(|c| {
@@ -55,45 +55,45 @@ fn word(i: &str) -> IResult<&str, &str> {
})), })),
))), ))),
|s| match s { |s| match s {
"OR" | "AND" | "NOT" | "IN" => Err(Error::new(i, ErrorKind::Tag)), "OR" | "AND" | "NOT" | "IN" => Err(Error::new(inp, ErrorKind::Tag)),
_ => Ok(s), _ => Ok(s),
}, },
)(i) )(inp)
} }
fn word_infallible(delimiter: &str) -> impl Fn(&str) -> JResult<&str, Option<&str>> + '_ { fn word_infallible(delimiter: &str) -> impl Fn(&str) -> JResult<&str, Option<&str>> + '_ {
|i| { |inp| {
opt_i_err( opt_i_err(
preceded( preceded(
space0, multispace0,
recognize(many1(satisfy(|c| { recognize(many1(satisfy(|c| {
!c.is_whitespace() && !delimiter.contains(c) !c.is_whitespace() && !delimiter.contains(c)
}))), }))),
), ),
"expected word", "expected word",
)(i) )(inp)
} }
} }
/// Consume a word inside a Range context. More values are allowed as they are /// Consume a word inside a Range context. More values are allowed as they are
/// not ambiguous in this context. /// not ambiguous in this context.
fn relaxed_word(i: &str) -> IResult<&str, &str> { fn relaxed_word(inp: &str) -> IResult<&str, &str> {
recognize(tuple(( recognize(tuple((
satisfy(|c| !c.is_whitespace() && !['`', '{', '}', '"', '[', ']', '(', ')'].contains(&c)), satisfy(|c| !c.is_whitespace() && !['`', '{', '}', '"', '[', ']', '(', ')'].contains(&c)),
many0(satisfy(|c: char| { many0(satisfy(|c: char| {
!c.is_whitespace() && !['{', '}', '"', '[', ']', '(', ')'].contains(&c) !c.is_whitespace() && !['{', '}', '"', '[', ']', '(', ')'].contains(&c)
})), })),
)))(i) )))(inp)
} }
fn negative_number(i: &str) -> IResult<&str, &str> { fn negative_number(inp: &str) -> IResult<&str, &str> {
recognize(preceded( recognize(preceded(
char('-'), char('-'),
tuple((digit1, opt(tuple((char('.'), digit1))))), tuple((digit1, opt(tuple((char('.'), digit1))))),
))(i) ))(inp)
} }
fn simple_term(i: &str) -> IResult<&str, (Delimiter, String)> { fn simple_term(inp: &str) -> IResult<&str, (Delimiter, String)> {
let escaped_string = |delimiter| { let escaped_string = |delimiter| {
// we need this because none_of can't accept an owned array of char. // we need this because none_of can't accept an owned array of char.
let not_delimiter = verify(anychar, move |parsed| *parsed != delimiter); let not_delimiter = verify(anychar, move |parsed| *parsed != delimiter);
@@ -123,13 +123,13 @@ fn simple_term(i: &str) -> IResult<&str, (Delimiter, String)> {
simple_quotes, simple_quotes,
double_quotes, double_quotes,
text_no_delimiter, text_no_delimiter,
))(i) ))(inp)
} }
fn simple_term_infallible( fn simple_term_infallible(
delimiter: &str, delimiter: &str,
) -> impl Fn(&str) -> JResult<&str, Option<(Delimiter, String)>> + '_ { ) -> impl Fn(&str) -> JResult<&str, Option<(Delimiter, String)>> + '_ {
|i| { |inp| {
let escaped_string = |delimiter| { let escaped_string = |delimiter| {
// we need this because none_of can't accept an owned array of char. // we need this because none_of can't accept an owned array of char.
let not_delimiter = verify(anychar, move |parsed| *parsed != delimiter); let not_delimiter = verify(anychar, move |parsed| *parsed != delimiter);
@@ -162,11 +162,11 @@ fn simple_term_infallible(
map(word_infallible(delimiter), |(text, errors)| { map(word_infallible(delimiter), |(text, errors)| {
(text.map(|text| (Delimiter::None, text.to_string())), errors) (text.map(|text| (Delimiter::None, text.to_string())), errors)
}), }),
)(i) )(inp)
} }
} }
fn term_or_phrase(i: &str) -> IResult<&str, UserInputLeaf> { fn term_or_phrase(inp: &str) -> IResult<&str, UserInputLeaf> {
map( map(
tuple((simple_term, fallible(slop_or_prefix_val))), tuple((simple_term, fallible(slop_or_prefix_val))),
|((delimiter, phrase), (slop, prefix))| { |((delimiter, phrase), (slop, prefix))| {
@@ -179,13 +179,13 @@ fn term_or_phrase(i: &str) -> IResult<&str, UserInputLeaf> {
} }
.into() .into()
}, },
)(i) )(inp)
} }
fn term_or_phrase_infallible(i: &str) -> JResult<&str, Option<UserInputLeaf>> { fn term_or_phrase_infallible(inp: &str) -> JResult<&str, Option<UserInputLeaf>> {
map( map(
// ~* for slop/prefix, ) inside group or ast tree, ^ if boost // ~* for slop/prefix, ) inside group or ast tree, ^ if boost
tuple_infallible((simple_term_infallible("*)^"), slop_or_prefix_val)), tuple_infallible((simple_term_infallible(")^"), slop_or_prefix_val)),
|((delimiter_phrase, (slop, prefix)), errors)| { |((delimiter_phrase, (slop, prefix)), errors)| {
let leaf = if let Some((delimiter, phrase)) = delimiter_phrase { let leaf = if let Some((delimiter, phrase)) = delimiter_phrase {
Some( Some(
@@ -214,10 +214,10 @@ fn term_or_phrase_infallible(i: &str) -> JResult<&str, Option<UserInputLeaf>> {
}; };
(leaf, errors) (leaf, errors)
}, },
)(i) )(inp)
} }
fn term_group(i: &str) -> IResult<&str, UserInputAst> { fn term_group(inp: &str) -> IResult<&str, UserInputAst> {
let occur_symbol = alt(( let occur_symbol = alt((
value(Occur::MustNot, char('-')), value(Occur::MustNot, char('-')),
value(Occur::Must, char('+')), value(Occur::Must, char('+')),
@@ -225,10 +225,10 @@ fn term_group(i: &str) -> IResult<&str, UserInputAst> {
map( map(
tuple(( tuple((
terminated(field_name, space0), terminated(field_name, multispace0),
delimited( delimited(
tuple((char('('), space0)), tuple((char('('), multispace0)),
separated_list0(space1, tuple((opt(occur_symbol), term_or_phrase))), separated_list0(multispace1, tuple((opt(occur_symbol), term_or_phrase))),
char(')'), char(')'),
), ),
)), )),
@@ -240,26 +240,26 @@ fn term_group(i: &str) -> IResult<&str, UserInputAst> {
.collect(), .collect(),
) )
}, },
)(i) )(inp)
} }
// this is a precondition for term_group_infallible. Without it, term_group_infallible can fail // this is a precondition for term_group_infallible. Without it, term_group_infallible can fail
// with a panic. It does not consume its input. // with a panic. It does not consume its input.
fn term_group_precond(i: &str) -> IResult<&str, (), ()> { fn term_group_precond(inp: &str) -> IResult<&str, (), ()> {
value( value(
(), (),
peek(tuple(( peek(tuple((
field_name, field_name,
space0, multispace0,
char('('), // when we are here, we know it can't be anything but a term group char('('), // when we are here, we know it can't be anything but a term group
))), ))),
)(i) )(inp)
.map_err(|e| e.map(|_| ())) .map_err(|e| e.map(|_| ()))
} }
fn term_group_infallible(i: &str) -> JResult<&str, UserInputAst> { fn term_group_infallible(inp: &str) -> JResult<&str, UserInputAst> {
let (mut i, (field_name, _, _, _)) = let (mut inp, (field_name, _, _, _)) =
tuple((field_name, space0, char('('), space0))(i).expect("precondition failed"); tuple((field_name, multispace0, char('('), multispace0))(inp).expect("precondition failed");
let mut terms = Vec::new(); let mut terms = Vec::new();
let mut errs = Vec::new(); let mut errs = Vec::new();
@@ -270,19 +270,19 @@ fn term_group_infallible(i: &str) -> JResult<&str, UserInputAst> {
first_round = false; first_round = false;
Vec::new() Vec::new()
} else { } else {
let (rest, (_, err)) = space1_infallible(i)?; let (rest, (_, err)) = space1_infallible(inp)?;
i = rest; inp = rest;
err err
}; };
if i.is_empty() { if inp.is_empty() {
errs.push(LenientErrorInternal { errs.push(LenientErrorInternal {
pos: i.len(), pos: inp.len(),
message: "missing )".to_string(), message: "missing )".to_string(),
}); });
break Ok((i, (UserInputAst::Clause(terms), errs))); break Ok((inp, (UserInputAst::Clause(terms), errs)));
} }
if let Some(i) = i.strip_prefix(')') { if let Some(inp) = inp.strip_prefix(')') {
break Ok((i, (UserInputAst::Clause(terms), errs))); break Ok((inp, (UserInputAst::Clause(terms), errs)));
} }
// only append missing space error if we did not reach the end of group // only append missing space error if we did not reach the end of group
errs.append(&mut space_error); errs.append(&mut space_error);
@@ -291,26 +291,57 @@ fn term_group_infallible(i: &str) -> JResult<&str, UserInputAst> {
// first byte is not `)` or ' '. If it did not, we would end up looping. // first byte is not `)` or ' '. If it did not, we would end up looping.
let (rest, ((occur, leaf), mut err)) = let (rest, ((occur, leaf), mut err)) =
tuple_infallible((occur_symbol, term_or_phrase_infallible))(i)?; tuple_infallible((occur_symbol, term_or_phrase_infallible))(inp)?;
errs.append(&mut err); errs.append(&mut err);
if let Some(leaf) = leaf { if let Some(leaf) = leaf {
terms.push((occur, leaf.set_field(Some(field_name.clone())).into())); terms.push((occur, leaf.set_field(Some(field_name.clone())).into()));
} }
i = rest; inp = rest;
} }
} }
fn literal(i: &str) -> IResult<&str, UserInputAst> { fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
value(
UserInputLeaf::Exists {
field: String::new(),
},
tuple((multispace0, char('*'))),
)(inp)
}
fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
value(
(),
peek(tuple((
field_name,
multispace0,
char('*'), // when we are here, we know it can't be anything but a exists
))),
)(inp)
.map_err(|e| e.map(|_| ()))
}
fn exists_infallible(inp: &str) -> JResult<&str, UserInputAst> {
let (inp, (field_name, _, _)) =
tuple((field_name, multispace0, char('*')))(inp).expect("precondition failed");
let exists = UserInputLeaf::Exists { field: field_name }.into();
Ok((inp, (exists, Vec::new())))
}
fn literal(inp: &str) -> IResult<&str, UserInputAst> {
// * alone is already parsed by our caller, so if `exists` succeed, we can be confident
// something (a field name) got parsed before
alt(( alt((
map( map(
tuple((opt(field_name), alt((range, set, term_or_phrase)))), tuple((opt(field_name), alt((range, set, exists, term_or_phrase)))),
|(field_name, leaf): (Option<String>, UserInputLeaf)| leaf.set_field(field_name).into(), |(field_name, leaf): (Option<String>, UserInputLeaf)| leaf.set_field(field_name).into(),
), ),
term_group, term_group,
))(i) ))(inp)
} }
fn literal_no_group_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> { fn literal_no_group_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>> {
map( map(
tuple_infallible(( tuple_infallible((
opt_i(field_name), opt_i(field_name),
@@ -318,7 +349,7 @@ fn literal_no_group_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> {
alt_infallible( alt_infallible(
( (
( (
value((), tuple((tag("IN"), space0, char('[')))), value((), tuple((tag("IN"), multispace0, char('[')))),
map(set_infallible, |(set, errs)| (Some(set), errs)), map(set_infallible, |(set, errs)| (Some(set), errs)),
), ),
( (
@@ -337,7 +368,7 @@ fn literal_no_group_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> {
&& field_name.is_none() && field_name.is_none()
{ {
errors.push(LenientErrorInternal { errors.push(LenientErrorInternal {
pos: i.len(), pos: inp.len(),
message: "parsed possible invalid field as term".to_string(), message: "parsed possible invalid field as term".to_string(),
}); });
} }
@@ -346,7 +377,7 @@ fn literal_no_group_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> {
&& field_name.is_none() && field_name.is_none()
{ {
errors.push(LenientErrorInternal { errors.push(LenientErrorInternal {
pos: i.len(), pos: inp.len(),
message: "parsed keyword NOT as term. It should be quoted".to_string(), message: "parsed keyword NOT as term. It should be quoted".to_string(),
}); });
} }
@@ -355,34 +386,40 @@ fn literal_no_group_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> {
errors, errors,
) )
}, },
)(i) )(inp)
} }
fn literal_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> { fn literal_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>> {
alt_infallible( alt_infallible(
(( (
term_group_precond, (
map(term_group_infallible, |(group, errs)| (Some(group), errs)), term_group_precond,
),), map(term_group_infallible, |(group, errs)| (Some(group), errs)),
),
(
exists_precond,
map(exists_infallible, |(exists, errs)| (Some(exists), errs)),
),
),
literal_no_group_infallible, literal_no_group_infallible,
)(i) )(inp)
} }
fn slop_or_prefix_val(i: &str) -> JResult<&str, (u32, bool)> { fn slop_or_prefix_val(inp: &str) -> JResult<&str, (u32, bool)> {
map( map(
opt_i(alt(( opt_i(alt((
value((0, true), char('*')), value((0, true), char('*')),
map(preceded(char('~'), u32), |slop| (slop, false)), map(preceded(char('~'), u32), |slop| (slop, false)),
))), ))),
|(slop_or_prefix_opt, err)| (slop_or_prefix_opt.unwrap_or_default(), err), |(slop_or_prefix_opt, err)| (slop_or_prefix_opt.unwrap_or_default(), err),
)(i) )(inp)
} }
/// Function that parses a range out of a Stream /// Function that parses a range out of a Stream
/// Supports ranges like: /// Supports ranges like:
/// [5 TO 10], {5 TO 10}, [* TO 10], [10 TO *], {10 TO *], >5, <=10 /// [5 TO 10], {5 TO 10}, [* TO 10], [10 TO *], {10 TO *], >5, <=10
/// [a TO *], [a TO c], [abc TO bcd} /// [a TO *], [a TO c], [abc TO bcd}
fn range(i: &str) -> IResult<&str, UserInputLeaf> { fn range(inp: &str) -> IResult<&str, UserInputLeaf> {
let range_term_val = || { let range_term_val = || {
map( map(
alt((negative_number, relaxed_word, tag("*"))), alt((negative_number, relaxed_word, tag("*"))),
@@ -393,8 +430,8 @@ fn range(i: &str) -> IResult<&str, UserInputLeaf> {
// check for unbounded range in the form of <5, <=10, >5, >=5 // check for unbounded range in the form of <5, <=10, >5, >=5
let elastic_unbounded_range = map( let elastic_unbounded_range = map(
tuple(( tuple((
preceded(space0, alt((tag(">="), tag("<="), tag("<"), tag(">")))), preceded(multispace0, alt((tag(">="), tag("<="), tag("<"), tag(">")))),
preceded(space0, range_term_val()), preceded(multispace0, range_term_val()),
)), )),
|(comparison_sign, bound)| match comparison_sign { |(comparison_sign, bound)| match comparison_sign {
">=" => (UserInputBound::Inclusive(bound), UserInputBound::Unbounded), ">=" => (UserInputBound::Inclusive(bound), UserInputBound::Unbounded),
@@ -407,7 +444,7 @@ fn range(i: &str) -> IResult<&str, UserInputLeaf> {
); );
let lower_bound = map( let lower_bound = map(
separated_pair(one_of("{["), space0, range_term_val()), separated_pair(one_of("{["), multispace0, range_term_val()),
|(boundary_char, lower_bound)| { |(boundary_char, lower_bound)| {
if lower_bound == "*" { if lower_bound == "*" {
UserInputBound::Unbounded UserInputBound::Unbounded
@@ -420,7 +457,7 @@ fn range(i: &str) -> IResult<&str, UserInputLeaf> {
); );
let upper_bound = map( let upper_bound = map(
separated_pair(range_term_val(), space0, one_of("}]")), separated_pair(range_term_val(), multispace0, one_of("}]")),
|(upper_bound, boundary_char)| { |(upper_bound, boundary_char)| {
if upper_bound == "*" { if upper_bound == "*" {
UserInputBound::Unbounded UserInputBound::Unbounded
@@ -432,8 +469,11 @@ fn range(i: &str) -> IResult<&str, UserInputLeaf> {
}, },
); );
let lower_to_upper = let lower_to_upper = separated_pair(
separated_pair(lower_bound, tuple((space1, tag("TO"), space1)), upper_bound); lower_bound,
tuple((multispace1, tag("TO"), multispace1)),
upper_bound,
);
map( map(
alt((elastic_unbounded_range, lower_to_upper)), alt((elastic_unbounded_range, lower_to_upper)),
@@ -442,10 +482,10 @@ fn range(i: &str) -> IResult<&str, UserInputLeaf> {
lower, lower,
upper, upper,
}, },
)(i) )(inp)
} }
fn range_infallible(i: &str) -> JResult<&str, UserInputLeaf> { fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
let lower_to_upper = map( let lower_to_upper = map(
tuple_infallible(( tuple_infallible((
opt_i(anychar), opt_i(anychar),
@@ -453,13 +493,16 @@ fn range_infallible(i: &str) -> JResult<&str, UserInputLeaf> {
word_infallible("]}"), word_infallible("]}"),
space1_infallible, space1_infallible,
opt_i_err( opt_i_err(
terminated(tag("TO"), alt((value((), space1), value((), eof)))), terminated(tag("TO"), alt((value((), multispace1), value((), eof)))),
"missing keyword TO", "missing keyword TO",
), ),
word_infallible("]}"), word_infallible("]}"),
opt_i_err(one_of("]}"), "missing range delimiter"), opt_i_err(one_of("]}"), "missing range delimiter"),
)), )),
|((lower_bound_kind, _space0, lower, _space1, to, upper, upper_bound_kind), errs)| { |(
(lower_bound_kind, _multispace0, lower, _multispace1, to, upper, upper_bound_kind),
errs,
)| {
let lower_bound = match (lower_bound_kind, lower) { let lower_bound = match (lower_bound_kind, lower) {
(_, Some("*")) => UserInputBound::Unbounded, (_, Some("*")) => UserInputBound::Unbounded,
(_, None) => UserInputBound::Unbounded, (_, None) => UserInputBound::Unbounded,
@@ -553,16 +596,16 @@ fn range_infallible(i: &str) -> JResult<&str, UserInputLeaf> {
errors, errors,
) )
}, },
)(i) )(inp)
} }
fn set(i: &str) -> IResult<&str, UserInputLeaf> { fn set(inp: &str) -> IResult<&str, UserInputLeaf> {
map( map(
preceded( preceded(
tuple((space0, tag("IN"), space1)), tuple((multispace0, tag("IN"), multispace1)),
delimited( delimited(
tuple((char('['), space0)), tuple((char('['), multispace0)),
separated_list0(space1, map(simple_term, |(_, term)| term)), separated_list0(multispace1, map(simple_term, |(_, term)| term)),
char(']'), char(']'),
), ),
), ),
@@ -570,10 +613,10 @@ fn set(i: &str) -> IResult<&str, UserInputLeaf> {
field: None, field: None,
elements, elements,
}, },
)(i) )(inp)
} }
fn set_infallible(mut i: &str) -> JResult<&str, UserInputLeaf> { fn set_infallible(mut inp: &str) -> JResult<&str, UserInputLeaf> {
// `IN [` has already been parsed when we enter, we only need to parse simple terms until we // `IN [` has already been parsed when we enter, we only need to parse simple terms until we
// find a `]` // find a `]`
let mut elements = Vec::new(); let mut elements = Vec::new();
@@ -584,41 +627,41 @@ fn set_infallible(mut i: &str) -> JResult<&str, UserInputLeaf> {
first_round = false; first_round = false;
Vec::new() Vec::new()
} else { } else {
let (rest, (_, err)) = space1_infallible(i)?; let (rest, (_, err)) = space1_infallible(inp)?;
i = rest; inp = rest;
err err
}; };
if i.is_empty() { if inp.is_empty() {
// TODO push error about missing ] // TODO push error about missing ]
// //
errs.push(LenientErrorInternal { errs.push(LenientErrorInternal {
pos: i.len(), pos: inp.len(),
message: "missing ]".to_string(), message: "missing ]".to_string(),
}); });
let res = UserInputLeaf::Set { let res = UserInputLeaf::Set {
field: None, field: None,
elements, elements,
}; };
return Ok((i, (res, errs))); return Ok((inp, (res, errs)));
} }
if let Some(i) = i.strip_prefix(']') { if let Some(inp) = inp.strip_prefix(']') {
let res = UserInputLeaf::Set { let res = UserInputLeaf::Set {
field: None, field: None,
elements, elements,
}; };
return Ok((i, (res, errs))); return Ok((inp, (res, errs)));
} }
errs.append(&mut space_error); errs.append(&mut space_error);
// TODO // TODO
// here we do the assumption term_or_phrase_infallible always consume something if the // here we do the assumption term_or_phrase_infallible always consume something if the
// first byte is not `)` or ' '. If it did not, we would end up looping. // first byte is not `)` or ' '. If it did not, we would end up looping.
let (rest, (delim_term, mut err)) = simple_term_infallible("]")(i)?; let (rest, (delim_term, mut err)) = simple_term_infallible("]")(inp)?;
errs.append(&mut err); errs.append(&mut err);
if let Some((_, term)) = delim_term { if let Some((_, term)) = delim_term {
elements.push(term); elements.push(term);
} }
i = rest; inp = rest;
} }
} }
@@ -626,16 +669,16 @@ fn negate(expr: UserInputAst) -> UserInputAst {
expr.unary(Occur::MustNot) expr.unary(Occur::MustNot)
} }
fn leaf(i: &str) -> IResult<&str, UserInputAst> { fn leaf(inp: &str) -> IResult<&str, UserInputAst> {
alt(( alt((
delimited(char('('), ast, char(')')), delimited(char('('), ast, char(')')),
map(char('*'), |_| UserInputAst::from(UserInputLeaf::All)), map(char('*'), |_| UserInputAst::from(UserInputLeaf::All)),
map(preceded(tuple((tag("NOT"), space1)), leaf), negate), map(preceded(tuple((tag("NOT"), multispace1)), leaf), negate),
literal, literal,
))(i) ))(inp)
} }
fn leaf_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> { fn leaf_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>> {
alt_infallible( alt_infallible(
( (
( (
@@ -665,23 +708,23 @@ fn leaf_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> {
), ),
), ),
literal_infallible, literal_infallible,
)(i) )(inp)
} }
fn positive_float_number(i: &str) -> IResult<&str, f64> { fn positive_float_number(inp: &str) -> IResult<&str, f64> {
map( map(
recognize(tuple((digit1, opt(tuple((char('.'), digit1)))))), recognize(tuple((digit1, opt(tuple((char('.'), digit1)))))),
// TODO this is actually dangerous if the number is actually not representable as a f64 // TODO this is actually dangerous if the number is actually not representable as a f64
// (too big for instance) // (too big for instance)
|float_str: &str| float_str.parse::<f64>().unwrap(), |float_str: &str| float_str.parse::<f64>().unwrap(),
)(i) )(inp)
} }
fn boost(i: &str) -> JResult<&str, Option<f64>> { fn boost(inp: &str) -> JResult<&str, Option<f64>> {
opt_i(preceded(char('^'), positive_float_number))(i) opt_i(preceded(char('^'), positive_float_number))(inp)
} }
fn boosted_leaf(i: &str) -> IResult<&str, UserInputAst> { fn boosted_leaf(inp: &str) -> IResult<&str, UserInputAst> {
map( map(
tuple((leaf, fallible(boost))), tuple((leaf, fallible(boost))),
|(leaf, boost_opt)| match boost_opt { |(leaf, boost_opt)| match boost_opt {
@@ -690,10 +733,10 @@ fn boosted_leaf(i: &str) -> IResult<&str, UserInputAst> {
} }
_ => leaf, _ => leaf,
}, },
)(i) )(inp)
} }
fn boosted_leaf_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> { fn boosted_leaf_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>> {
map( map(
tuple_infallible((leaf_infallible, boost)), tuple_infallible((leaf_infallible, boost)),
|((leaf, boost_opt), error)| match boost_opt { |((leaf, boost_opt), error)| match boost_opt {
@@ -703,30 +746,30 @@ fn boosted_leaf_infallible(i: &str) -> JResult<&str, Option<UserInputAst>> {
), ),
_ => (leaf, error), _ => (leaf, error),
}, },
)(i) )(inp)
} }
fn occur_symbol(i: &str) -> JResult<&str, Option<Occur>> { fn occur_symbol(inp: &str) -> JResult<&str, Option<Occur>> {
opt_i(alt(( opt_i(alt((
value(Occur::MustNot, char('-')), value(Occur::MustNot, char('-')),
value(Occur::Must, char('+')), value(Occur::Must, char('+')),
)))(i) )))(inp)
} }
fn occur_leaf(i: &str) -> IResult<&str, (Option<Occur>, UserInputAst)> { fn occur_leaf(inp: &str) -> IResult<&str, (Option<Occur>, UserInputAst)> {
tuple((fallible(occur_symbol), boosted_leaf))(i) tuple((fallible(occur_symbol), boosted_leaf))(inp)
} }
#[allow(clippy::type_complexity)] #[allow(clippy::type_complexity)]
fn operand_occur_leaf_infallible( fn operand_occur_leaf_infallible(
i: &str, inp: &str,
) -> JResult<&str, (Option<BinaryOperand>, Option<Occur>, Option<UserInputAst>)> { ) -> JResult<&str, (Option<BinaryOperand>, Option<Occur>, Option<UserInputAst>)> {
// TODO maybe this should support multiple chained AND/OR, and "fuse" them? // TODO maybe this should support multiple chained AND/OR, and "fuse" them?
tuple_infallible(( tuple_infallible((
delimited_infallible(nothing, opt_i(binary_operand), space0_infallible), delimited_infallible(nothing, opt_i(binary_operand), space0_infallible),
occur_symbol, occur_symbol,
boosted_leaf_infallible, boosted_leaf_infallible,
))(i) ))(inp)
} }
#[derive(Clone, Copy, Debug, PartialEq, Eq)] #[derive(Clone, Copy, Debug, PartialEq, Eq)]
@@ -735,35 +778,31 @@ enum BinaryOperand {
And, And,
} }
fn binary_operand(i: &str) -> IResult<&str, BinaryOperand> { fn binary_operand(inp: &str) -> IResult<&str, BinaryOperand> {
alt(( alt((
value(BinaryOperand::And, tag("AND ")), value(BinaryOperand::And, tag("AND ")),
value(BinaryOperand::Or, tag("OR ")), value(BinaryOperand::Or, tag("OR ")),
))(i) ))(inp)
} }
fn aggregate_binary_expressions( fn aggregate_binary_expressions(
left: UserInputAst, left: (Option<Occur>, UserInputAst),
others: Vec<(BinaryOperand, UserInputAst)>, others: Vec<(Option<BinaryOperand>, Option<Occur>, UserInputAst)>,
) -> UserInputAst { ) -> Result<UserInputAst, LenientErrorInternal> {
let mut dnf: Vec<Vec<UserInputAst>> = vec![vec![left]]; let mut leafs = Vec::with_capacity(others.len() + 1);
for (operator, operand_ast) in others { leafs.push((None, left.0, Some(left.1)));
match operator { leafs.extend(
BinaryOperand::And => { others
if let Some(last) = dnf.last_mut() { .into_iter()
last.push(operand_ast); .map(|(operand, occur, ast)| (operand, occur, Some(ast))),
} );
} // the parameters we pass should statically guarantee we can't get errors
BinaryOperand::Or => { // (no prefix BinaryOperand is provided)
dnf.push(vec![operand_ast]); let (res, mut errors) = aggregate_infallible_expressions(leafs);
} if errors.is_empty() {
} Ok(res)
}
if dnf.len() == 1 {
UserInputAst::and(dnf.into_iter().next().unwrap()) //< safe
} else { } else {
let conjunctions = dnf.into_iter().map(UserInputAst::and).collect(); Err(errors.swap_remove(0))
UserInputAst::or(conjunctions)
} }
} }
@@ -779,30 +818,10 @@ fn aggregate_infallible_expressions(
return (UserInputAst::empty_query(), err); return (UserInputAst::empty_query(), err);
} }
let use_operand = leafs.iter().any(|(operand, _, _)| operand.is_some());
let all_operand = leafs
.iter()
.skip(1)
.all(|(operand, _, _)| operand.is_some());
let early_operand = leafs let early_operand = leafs
.iter() .iter()
.take(1) .take(1)
.all(|(operand, _, _)| operand.is_some()); .all(|(operand, _, _)| operand.is_some());
let use_occur = leafs.iter().any(|(_, occur, _)| occur.is_some());
if use_operand && use_occur {
err.push(LenientErrorInternal {
pos: 0,
message: "Use of mixed occur and boolean operator".to_string(),
});
}
if use_operand && !all_operand {
err.push(LenientErrorInternal {
pos: 0,
message: "Missing boolean operator".to_string(),
});
}
if early_operand { if early_operand {
err.push(LenientErrorInternal { err.push(LenientErrorInternal {
@@ -829,7 +848,15 @@ fn aggregate_infallible_expressions(
Some(BinaryOperand::And) => Some(Occur::Must), Some(BinaryOperand::And) => Some(Occur::Must),
_ => Some(Occur::Should), _ => Some(Occur::Should),
}; };
clauses.push(vec![(occur.or(default_op), ast.clone())]); if occur == &Some(Occur::MustNot) && default_op == Some(Occur::Should) {
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
clauses.push(vec![(
Some(Occur::Should),
ast.clone().unary(Occur::MustNot),
)])
} else {
clauses.push(vec![(occur.or(default_op), ast.clone())]);
}
} }
None => { None => {
let default_op = match next_operator { let default_op = match next_operator {
@@ -837,7 +864,15 @@ fn aggregate_infallible_expressions(
Some(BinaryOperand::Or) => Some(Occur::Should), Some(BinaryOperand::Or) => Some(Occur::Should),
None => None, None => None,
}; };
clauses.push(vec![(occur.or(default_op), ast.clone())]) if occur == &Some(Occur::MustNot) && default_op == Some(Occur::Should) {
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
clauses.push(vec![(
Some(Occur::Should),
ast.clone().unary(Occur::MustNot),
)])
} else {
clauses.push(vec![(occur.or(default_op), ast.clone())])
}
} }
} }
} }
@@ -854,7 +889,12 @@ fn aggregate_infallible_expressions(
} }
} }
Some(BinaryOperand::Or) => { Some(BinaryOperand::Or) => {
clauses.push(vec![(last_occur.or(Some(Occur::Should)), last_ast)]); if last_occur == Some(Occur::MustNot) {
// if occur is MustNot *and* operation is OR, we synthetize a ShouldNot
clauses.push(vec![(Some(Occur::Should), last_ast.unary(Occur::MustNot))]);
} else {
clauses.push(vec![(last_occur.or(Some(Occur::Should)), last_ast)]);
}
} }
None => clauses.push(vec![(last_occur, last_ast)]), None => clauses.push(vec![(last_occur, last_ast)]),
} }
@@ -880,38 +920,32 @@ fn aggregate_infallible_expressions(
} }
} }
fn operand_leaf(i: &str) -> IResult<&str, (BinaryOperand, UserInputAst)> { fn operand_leaf(inp: &str) -> IResult<&str, (Option<BinaryOperand>, Option<Occur>, UserInputAst)> {
tuple(( map(
terminated(binary_operand, space0), tuple((
terminated(boosted_leaf, space0), terminated(opt(binary_operand), multispace0),
))(i) terminated(occur_leaf, multispace0),
)),
|(operand, (occur, ast))| (operand, occur, ast),
)(inp)
} }
fn ast(i: &str) -> IResult<&str, UserInputAst> { fn ast(inp: &str) -> IResult<&str, UserInputAst> {
let boolean_expr = map( let boolean_expr = map_res(
separated_pair(boosted_leaf, space1, many1(operand_leaf)), separated_pair(occur_leaf, multispace1, many1(operand_leaf)),
|(left, right)| aggregate_binary_expressions(left, right), |(left, right)| aggregate_binary_expressions(left, right),
); );
let whitespace_separated_leaves = map(separated_list1(space1, occur_leaf), |subqueries| { let single_leaf = map(occur_leaf, |(occur, ast)| {
if subqueries.len() == 1 { if occur == Some(Occur::MustNot) {
let (occur_opt, ast) = subqueries.into_iter().next().unwrap(); ast.unary(Occur::MustNot)
match occur_opt.unwrap_or(Occur::Should) {
Occur::Must | Occur::Should => ast,
Occur::MustNot => UserInputAst::Clause(vec![(Some(Occur::MustNot), ast)]),
}
} else { } else {
UserInputAst::Clause(subqueries.into_iter().collect()) ast
} }
}); });
delimited(multispace0, alt((boolean_expr, single_leaf)), multispace0)(inp)
delimited(
space0,
alt((boolean_expr, whitespace_separated_leaves)),
space0,
)(i)
} }
fn ast_infallible(i: &str) -> JResult<&str, UserInputAst> { fn ast_infallible(inp: &str) -> JResult<&str, UserInputAst> {
// ast() parse either `term AND term OR term` or `+term term -term` // ast() parse either `term AND term OR term` or `+term term -term`
// both are locally ambiguous, and as we allow error, it's hard to permit backtracking. // both are locally ambiguous, and as we allow error, it's hard to permit backtracking.
// Instead, we allow a mix of both syntaxes, trying to make sense of what a user meant. // Instead, we allow a mix of both syntaxes, trying to make sense of what a user meant.
@@ -928,13 +962,13 @@ fn ast_infallible(i: &str) -> JResult<&str, UserInputAst> {
}, },
); );
delimited_infallible(space0_infallible, expression, space0_infallible)(i) delimited_infallible(space0_infallible, expression, space0_infallible)(inp)
} }
pub fn parse_to_ast(i: &str) -> IResult<&str, UserInputAst> { pub fn parse_to_ast(inp: &str) -> IResult<&str, UserInputAst> {
map(delimited(space0, opt(ast), eof), |opt_ast| { map(delimited(multispace0, opt(ast), eof), |opt_ast| {
rewrite_ast(opt_ast.unwrap_or_else(UserInputAst::empty_query)) rewrite_ast(opt_ast.unwrap_or_else(UserInputAst::empty_query))
})(i) })(inp)
} }
pub fn parse_to_ast_lenient(query_str: &str) -> (UserInputAst, Vec<LenientError>) { pub fn parse_to_ast_lenient(query_str: &str) -> (UserInputAst, Vec<LenientError>) {
@@ -1076,6 +1110,9 @@ mod test {
test_parse_query_to_ast_helper("'www-form-encoded'", "'www-form-encoded'"); test_parse_query_to_ast_helper("'www-form-encoded'", "'www-form-encoded'");
test_parse_query_to_ast_helper("www-form-encoded", "www-form-encoded"); test_parse_query_to_ast_helper("www-form-encoded", "www-form-encoded");
test_parse_query_to_ast_helper("www-form-encoded", "www-form-encoded"); test_parse_query_to_ast_helper("www-form-encoded", "www-form-encoded");
test_parse_query_to_ast_helper("mr james bo?d", "(*mr *james *bo?d)");
test_parse_query_to_ast_helper("mr james bo*", "(*mr *james *bo*)");
test_parse_query_to_ast_helper("mr james b*d", "(*mr *james *b*d)");
} }
#[test] #[test]
@@ -1105,24 +1142,43 @@ mod test {
#[test] #[test]
fn test_parse_query_to_ast_binary_op() { fn test_parse_query_to_ast_binary_op() {
test_parse_query_to_ast_helper("a AND b", "(+a +b)"); test_parse_query_to_ast_helper("a AND b", "(+a +b)");
test_parse_query_to_ast_helper("a\nAND b", "(+a +b)");
test_parse_query_to_ast_helper("a OR b", "(?a ?b)"); test_parse_query_to_ast_helper("a OR b", "(?a ?b)");
test_parse_query_to_ast_helper("a OR b AND c", "(?a ?(+b +c))"); test_parse_query_to_ast_helper("a OR b AND c", "(?a ?(+b +c))");
test_parse_query_to_ast_helper("a AND b AND c", "(+a +b +c)"); test_parse_query_to_ast_helper("a AND b AND c", "(+a +b +c)");
test_is_parse_err("a OR b aaa", "(?a ?b *aaa)"); test_parse_query_to_ast_helper("a OR b aaa", "(?a ?b *aaa)");
test_is_parse_err("a AND b aaa", "(?(+a +b) *aaa)"); test_parse_query_to_ast_helper("a AND b aaa", "(?(+a +b) *aaa)");
test_is_parse_err("aaa a OR b ", "(*aaa ?a ?b)"); test_parse_query_to_ast_helper("aaa a OR b ", "(*aaa ?a ?b)");
test_is_parse_err("aaa ccc a OR b ", "(*aaa *ccc ?a ?b)"); test_parse_query_to_ast_helper("aaa ccc a OR b ", "(*aaa *ccc ?a ?b)");
test_is_parse_err("aaa a AND b ", "(*aaa ?(+a +b))"); test_parse_query_to_ast_helper("aaa a AND b ", "(*aaa ?(+a +b))");
test_is_parse_err("aaa ccc a AND b ", "(*aaa *ccc ?(+a +b))"); test_parse_query_to_ast_helper("aaa ccc a AND b ", "(*aaa *ccc ?(+a +b))");
} }
#[test] #[test]
fn test_parse_mixed_bool_occur() { fn test_parse_mixed_bool_occur() {
test_is_parse_err("a OR b +aaa", "(?a ?b +aaa)"); test_parse_query_to_ast_helper("+a OR +b", "(+a +b)");
test_is_parse_err("a AND b -aaa", "(?(+a +b) -aaa)");
test_is_parse_err("+a OR +b aaa", "(+a +b *aaa)"); test_parse_query_to_ast_helper("a AND -b", "(+a -b)");
test_is_parse_err("-a AND -b aaa", "(?(-a -b) *aaa)"); test_parse_query_to_ast_helper("-a AND b", "(-a +b)");
test_is_parse_err("-aaa +ccc -a OR b ", "(-aaa +ccc -a ?b)"); test_parse_query_to_ast_helper("a AND NOT b", "(+a +(-b))");
test_parse_query_to_ast_helper("NOT a AND b", "(+(-a) +b)");
test_parse_query_to_ast_helper("a AND NOT b AND c", "(+a +(-b) +c)");
test_parse_query_to_ast_helper("a AND -b AND c", "(+a -b +c)");
test_parse_query_to_ast_helper("a OR -b", "(?a ?(-b))");
test_parse_query_to_ast_helper("-a OR b", "(?(-a) ?b)");
test_parse_query_to_ast_helper("a OR NOT b", "(?a ?(-b))");
test_parse_query_to_ast_helper("NOT a OR b", "(?(-a) ?b)");
test_parse_query_to_ast_helper("a OR NOT b OR c", "(?a ?(-b) ?c)");
test_parse_query_to_ast_helper("a OR -b OR c", "(?a ?(-b) ?c)");
test_parse_query_to_ast_helper("a OR b +aaa", "(?a ?b +aaa)");
test_parse_query_to_ast_helper("a AND b -aaa", "(?(+a +b) -aaa)");
test_parse_query_to_ast_helper("+a OR +b aaa", "(+a +b *aaa)");
test_parse_query_to_ast_helper("-a AND -b aaa", "(?(-a -b) *aaa)");
test_parse_query_to_ast_helper("-aaa +ccc -a OR b ", "(-aaa +ccc ?(-a) ?b)");
} }
#[test] #[test]
@@ -1538,6 +1594,17 @@ mod test {
test_parse_query_to_ast_helper("foo:\"\"*", "\"foo\":\"\"*"); test_parse_query_to_ast_helper("foo:\"\"*", "\"foo\":\"\"*");
} }
#[test]
fn test_exist_query() {
test_parse_query_to_ast_helper("a:*", "\"a\":*");
test_parse_query_to_ast_helper("a: *", "\"a\":*");
// an exist followed by default term being b
test_is_parse_err("a:*b", "(*\"a\":* *b)");
// this is a term query (not a phrase prefix)
test_parse_query_to_ast_helper("a:b*", "\"a\":b*");
}
#[test] #[test]
fn test_not_queries_are_consistent() { fn test_not_queries_are_consistent() {
test_parse_query_to_ast_helper("tata -toto", "(*tata -toto)"); test_parse_query_to_ast_helper("tata -toto", "(*tata -toto)");

View File

@@ -16,6 +16,9 @@ pub enum UserInputLeaf {
field: Option<String>, field: Option<String>,
elements: Vec<String>, elements: Vec<String>,
}, },
Exists {
field: String,
},
} }
impl UserInputLeaf { impl UserInputLeaf {
@@ -36,6 +39,9 @@ impl UserInputLeaf {
upper, upper,
}, },
UserInputLeaf::Set { field: _, elements } => UserInputLeaf::Set { field, elements }, UserInputLeaf::Set { field: _, elements } => UserInputLeaf::Set { field, elements },
UserInputLeaf::Exists { field: _ } => UserInputLeaf::Exists {
field: field.expect("Exist query without a field isn't allowed"),
},
} }
} }
} }
@@ -74,6 +80,9 @@ impl Debug for UserInputLeaf {
write!(formatter, "]") write!(formatter, "]")
} }
UserInputLeaf::All => write!(formatter, "*"), UserInputLeaf::All => write!(formatter, "*"),
UserInputLeaf::Exists { field } => {
write!(formatter, "\"{field}\":*")
}
} }
} }
} }

View File

@@ -48,7 +48,7 @@ mod bench {
let score_field_f64 = schema_builder.add_f64_field("score_f64", score_fieldtype.clone()); let score_field_f64 = schema_builder.add_f64_field("score_f64", score_fieldtype.clone());
let score_field_i64 = schema_builder.add_i64_field("score_i64", score_fieldtype); let score_field_i64 = schema_builder.add_i64_field("score_i64", score_fieldtype);
let index = Index::create_from_tempdir(schema_builder.build())?; let index = Index::create_from_tempdir(schema_builder.build())?;
let few_terms_data = vec!["INFO", "ERROR", "WARN", "DEBUG"]; let few_terms_data = ["INFO", "ERROR", "WARN", "DEBUG"];
let lg_norm = rand_distr::LogNormal::new(2.996f64, 0.979f64).unwrap(); let lg_norm = rand_distr::LogNormal::new(2.996f64, 0.979f64).unwrap();
@@ -85,7 +85,7 @@ mod bench {
if cardinality == Cardinality::Sparse { if cardinality == Cardinality::Sparse {
doc_with_value /= 20; doc_with_value /= 20;
} }
let val_max = 1_000_000.0; let _val_max = 1_000_000.0;
for _ in 0..doc_with_value { for _ in 0..doc_with_value {
let val: f64 = rng.gen_range(0.0..1_000_000.0); let val: f64 = rng.gen_range(0.0..1_000_000.0);
let json = if rng.gen_bool(0.1) { let json = if rng.gen_bool(0.1) {
@@ -290,6 +290,41 @@ mod bench {
}); });
} }
bench_all_cardinalities!(bench_aggregation_terms_many_with_top_hits_agg);
fn bench_aggregation_terms_many_with_top_hits_agg_card(
b: &mut Bencher,
cardinality: Cardinality,
) {
let index = get_test_index_bench(cardinality).unwrap();
let reader = index.reader().unwrap();
b.iter(|| {
let agg_req: Aggregations = serde_json::from_value(json!({
"my_texts": {
"terms": { "field": "text_many_terms" },
"aggs": {
"top_hits": { "top_hits":
{
"sort": [
{ "score": "desc" }
],
"size": 2,
"doc_value_fields": ["score_f64"]
}
}
}
},
}))
.unwrap();
let collector = get_collector(agg_req);
let searcher = reader.searcher();
searcher.search(&AllQuery, &collector).unwrap()
});
}
bench_all_cardinalities!(bench_aggregation_terms_many_with_sub_agg); bench_all_cardinalities!(bench_aggregation_terms_many_with_sub_agg);
fn bench_aggregation_terms_many_with_sub_agg_card(b: &mut Bencher, cardinality: Cardinality) { fn bench_aggregation_terms_many_with_sub_agg_card(b: &mut Bencher, cardinality: Cardinality) {

View File

@@ -73,9 +73,9 @@ impl AggregationLimits {
/// Create a new ResourceLimitGuard, that will release the memory when dropped. /// Create a new ResourceLimitGuard, that will release the memory when dropped.
pub fn new_guard(&self) -> ResourceLimitGuard { pub fn new_guard(&self) -> ResourceLimitGuard {
ResourceLimitGuard { ResourceLimitGuard {
/// The counter which is shared between the aggregations for one request. // The counter which is shared between the aggregations for one request.
memory_consumption: Arc::clone(&self.memory_consumption), memory_consumption: Arc::clone(&self.memory_consumption),
/// The memory_limit in bytes // The memory_limit in bytes
memory_limit: self.memory_limit, memory_limit: self.memory_limit,
allocated_with_the_guard: 0, allocated_with_the_guard: 0,
} }
@@ -134,3 +134,142 @@ impl Drop for ResourceLimitGuard {
.fetch_sub(self.allocated_with_the_guard, Ordering::Relaxed); .fetch_sub(self.allocated_with_the_guard, Ordering::Relaxed);
} }
} }
#[cfg(test)]
mod tests {
use crate::aggregation::tests::exec_request_with_query;
// https://github.com/quickwit-oss/quickwit/issues/3837
#[test]
fn test_agg_limits_with_empty_merge() {
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::bucket::tests::get_test_index_from_docs;
let docs = vec![
vec![r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb", "text2": "bbb" }"#],
vec![r#"{ "text": "aaa", "text2": "bbb" }"#],
];
let index = get_test_index_from_docs(false, &docs).unwrap();
{
let elasticsearch_compatible_json = json!(
{
"1": {
"terms": {"field": "text2", "min_doc_count": 0},
"aggs": {
"2":{
"date_histogram": {
"field": "date",
"fixed_interval": "1d",
"extended_bounds": {
"min": "2015-01-01T00:00:00Z",
"max": "2015-01-10T00:00:00Z"
}
}
}
}
}
}
);
let agg_req: Aggregations = serde_json::from_str(
&serde_json::to_string(&elasticsearch_compatible_json).unwrap(),
)
.unwrap();
let res = exec_request_with_query(agg_req, &index, Some(("text", "bbb"))).unwrap();
let expected_res = json!({
"1": {
"buckets": [
{
"2": {
"buckets": [
{ "doc_count": 0, "key": 1420070400000.0, "key_as_string": "2015-01-01T00:00:00Z" },
{ "doc_count": 1, "key": 1420156800000.0, "key_as_string": "2015-01-02T00:00:00Z" },
{ "doc_count": 0, "key": 1420243200000.0, "key_as_string": "2015-01-03T00:00:00Z" },
{ "doc_count": 0, "key": 1420329600000.0, "key_as_string": "2015-01-04T00:00:00Z" },
{ "doc_count": 0, "key": 1420416000000.0, "key_as_string": "2015-01-05T00:00:00Z" },
{ "doc_count": 0, "key": 1420502400000.0, "key_as_string": "2015-01-06T00:00:00Z" },
{ "doc_count": 0, "key": 1420588800000.0, "key_as_string": "2015-01-07T00:00:00Z" },
{ "doc_count": 0, "key": 1420675200000.0, "key_as_string": "2015-01-08T00:00:00Z" },
{ "doc_count": 0, "key": 1420761600000.0, "key_as_string": "2015-01-09T00:00:00Z" },
{ "doc_count": 0, "key": 1420848000000.0, "key_as_string": "2015-01-10T00:00:00Z" }
]
},
"doc_count": 1,
"key": "bbb"
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
});
assert_eq!(res, expected_res);
}
}
// https://github.com/quickwit-oss/quickwit/issues/3837
#[test]
fn test_agg_limits_with_empty_data() {
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::bucket::tests::get_test_index_from_docs;
let docs = vec![vec![r#"{ "text": "aaa", "text2": "bbb" }"#]];
let index = get_test_index_from_docs(false, &docs).unwrap();
{
// Empty result since there is no doc with dates
let elasticsearch_compatible_json = json!(
{
"1": {
"terms": {"field": "text2", "min_doc_count": 0},
"aggs": {
"2":{
"date_histogram": {
"field": "date",
"fixed_interval": "1d",
"extended_bounds": {
"min": "2015-01-01T00:00:00Z",
"max": "2015-01-10T00:00:00Z"
}
}
}
}
}
}
);
let agg_req: Aggregations = serde_json::from_str(
&serde_json::to_string(&elasticsearch_compatible_json).unwrap(),
)
.unwrap();
let res = exec_request_with_query(agg_req, &index, Some(("text", "bbb"))).unwrap();
let expected_res = json!({
"1": {
"buckets": [
{
"2": {
"buckets": [
{ "doc_count": 0, "key": 1420070400000.0, "key_as_string": "2015-01-01T00:00:00Z" },
{ "doc_count": 0, "key": 1420156800000.0, "key_as_string": "2015-01-02T00:00:00Z" },
{ "doc_count": 0, "key": 1420243200000.0, "key_as_string": "2015-01-03T00:00:00Z" },
{ "doc_count": 0, "key": 1420329600000.0, "key_as_string": "2015-01-04T00:00:00Z" },
{ "doc_count": 0, "key": 1420416000000.0, "key_as_string": "2015-01-05T00:00:00Z" },
{ "doc_count": 0, "key": 1420502400000.0, "key_as_string": "2015-01-06T00:00:00Z" },
{ "doc_count": 0, "key": 1420588800000.0, "key_as_string": "2015-01-07T00:00:00Z" },
{ "doc_count": 0, "key": 1420675200000.0, "key_as_string": "2015-01-08T00:00:00Z" },
{ "doc_count": 0, "key": 1420761600000.0, "key_as_string": "2015-01-09T00:00:00Z" },
{ "doc_count": 0, "key": 1420848000000.0, "key_as_string": "2015-01-10T00:00:00Z" }
]
},
"doc_count": 0,
"key": "bbb"
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
});
assert_eq!(res, expected_res);
}
}
}

View File

@@ -35,7 +35,7 @@ use super::bucket::{
}; };
use super::metric::{ use super::metric::{
AverageAggregation, CountAggregation, MaxAggregation, MinAggregation, AverageAggregation, CountAggregation, MaxAggregation, MinAggregation,
PercentilesAggregationReq, StatsAggregation, SumAggregation, PercentilesAggregationReq, StatsAggregation, SumAggregation, TopHitsAggregation,
}; };
/// The top-level aggregation request structure, which contains [`Aggregation`] and their user /// The top-level aggregation request structure, which contains [`Aggregation`] and their user
@@ -93,7 +93,12 @@ impl Aggregation {
} }
fn get_fast_field_names(&self, fast_field_names: &mut HashSet<String>) { fn get_fast_field_names(&self, fast_field_names: &mut HashSet<String>) {
fast_field_names.insert(self.agg.get_fast_field_name().to_string()); fast_field_names.extend(
self.agg
.get_fast_field_names()
.iter()
.map(|s| s.to_string()),
);
fast_field_names.extend(get_fast_field_names(&self.sub_aggregation)); fast_field_names.extend(get_fast_field_names(&self.sub_aggregation));
} }
} }
@@ -147,23 +152,27 @@ pub enum AggregationVariants {
/// Computes the sum of the extracted values. /// Computes the sum of the extracted values.
#[serde(rename = "percentiles")] #[serde(rename = "percentiles")]
Percentiles(PercentilesAggregationReq), Percentiles(PercentilesAggregationReq),
/// Finds the top k values matching some order
#[serde(rename = "top_hits")]
TopHits(TopHitsAggregation),
} }
impl AggregationVariants { impl AggregationVariants {
/// Returns the name of the field used by the aggregation. /// Returns the name of the fields used by the aggregation.
pub fn get_fast_field_name(&self) -> &str { pub fn get_fast_field_names(&self) -> Vec<&str> {
match self { match self {
AggregationVariants::Terms(terms) => terms.field.as_str(), AggregationVariants::Terms(terms) => vec![terms.field.as_str()],
AggregationVariants::Range(range) => range.field.as_str(), AggregationVariants::Range(range) => vec![range.field.as_str()],
AggregationVariants::Histogram(histogram) => histogram.field.as_str(), AggregationVariants::Histogram(histogram) => vec![histogram.field.as_str()],
AggregationVariants::DateHistogram(histogram) => histogram.field.as_str(), AggregationVariants::DateHistogram(histogram) => vec![histogram.field.as_str()],
AggregationVariants::Average(avg) => avg.field_name(), AggregationVariants::Average(avg) => vec![avg.field_name()],
AggregationVariants::Count(count) => count.field_name(), AggregationVariants::Count(count) => vec![count.field_name()],
AggregationVariants::Max(max) => max.field_name(), AggregationVariants::Max(max) => vec![max.field_name()],
AggregationVariants::Min(min) => min.field_name(), AggregationVariants::Min(min) => vec![min.field_name()],
AggregationVariants::Stats(stats) => stats.field_name(), AggregationVariants::Stats(stats) => vec![stats.field_name()],
AggregationVariants::Sum(sum) => sum.field_name(), AggregationVariants::Sum(sum) => vec![sum.field_name()],
AggregationVariants::Percentiles(per) => per.field_name(), AggregationVariants::Percentiles(per) => vec![per.field_name()],
AggregationVariants::TopHits(top_hits) => top_hits.field_names(),
} }
} }

View File

@@ -1,6 +1,9 @@
//! This will enhance the request tree with access to the fastfield and metadata. //! This will enhance the request tree with access to the fastfield and metadata.
use columnar::{Column, ColumnBlockAccessor, ColumnType, StrColumn}; use std::collections::HashMap;
use std::io;
use columnar::{Column, ColumnBlockAccessor, ColumnType, DynamicColumn, StrColumn};
use super::agg_limits::ResourceLimitGuard; use super::agg_limits::ResourceLimitGuard;
use super::agg_req::{Aggregation, AggregationVariants, Aggregations}; use super::agg_req::{Aggregation, AggregationVariants, Aggregations};
@@ -14,7 +17,7 @@ use super::metric::{
use super::segment_agg_result::AggregationLimits; use super::segment_agg_result::AggregationLimits;
use super::VecWithNames; use super::VecWithNames;
use crate::aggregation::{f64_to_fastfield_u64, Key}; use crate::aggregation::{f64_to_fastfield_u64, Key};
use crate::SegmentReader; use crate::{SegmentOrdinal, SegmentReader};
#[derive(Default)] #[derive(Default)]
pub(crate) struct AggregationsWithAccessor { pub(crate) struct AggregationsWithAccessor {
@@ -32,6 +35,7 @@ impl AggregationsWithAccessor {
} }
pub struct AggregationWithAccessor { pub struct AggregationWithAccessor {
pub(crate) segment_ordinal: SegmentOrdinal,
/// In general there can be buckets without fast field access, e.g. buckets that are created /// In general there can be buckets without fast field access, e.g. buckets that are created
/// based on search terms. That is not that case currently, but eventually this needs to be /// based on search terms. That is not that case currently, but eventually this needs to be
/// Option or moved. /// Option or moved.
@@ -44,10 +48,16 @@ pub struct AggregationWithAccessor {
pub(crate) limits: ResourceLimitGuard, pub(crate) limits: ResourceLimitGuard,
pub(crate) column_block_accessor: ColumnBlockAccessor<u64>, pub(crate) column_block_accessor: ColumnBlockAccessor<u64>,
/// Used for missing term aggregation, which checks all columns for existence. /// Used for missing term aggregation, which checks all columns for existence.
/// And also for `top_hits` aggregation, which may sort on multiple fields.
/// By convention the missing aggregation is chosen, when this property is set /// By convention the missing aggregation is chosen, when this property is set
/// (instead bein set in `agg`). /// (instead bein set in `agg`).
/// If this needs to used by other aggregations, we need to refactor this. /// If this needs to used by other aggregations, we need to refactor this.
pub(crate) accessors: Vec<Column<u64>>, // NOTE: we can make all other aggregations use this instead of the `accessor` and `field_type`
// (making them obsolete) But will it have a performance impact?
pub(crate) accessors: Vec<(Column<u64>, ColumnType)>,
/// Map field names to all associated column accessors.
/// This field is used for `docvalue_fields`, which is currently only supported for `top_hits`.
pub(crate) value_accessors: HashMap<String, Vec<DynamicColumn>>,
pub(crate) agg: Aggregation, pub(crate) agg: Aggregation,
} }
@@ -57,19 +67,55 @@ impl AggregationWithAccessor {
agg: &Aggregation, agg: &Aggregation,
sub_aggregation: &Aggregations, sub_aggregation: &Aggregations,
reader: &SegmentReader, reader: &SegmentReader,
segment_ordinal: SegmentOrdinal,
limits: AggregationLimits, limits: AggregationLimits,
) -> crate::Result<Vec<AggregationWithAccessor>> { ) -> crate::Result<Vec<AggregationWithAccessor>> {
let add_agg_with_accessor = |accessor: Column<u64>, let mut agg = agg.clone();
let add_agg_with_accessor = |agg: &Aggregation,
accessor: Column<u64>,
column_type: ColumnType, column_type: ColumnType,
aggs: &mut Vec<AggregationWithAccessor>| aggs: &mut Vec<AggregationWithAccessor>|
-> crate::Result<()> { -> crate::Result<()> {
let res = AggregationWithAccessor { let res = AggregationWithAccessor {
segment_ordinal,
accessor, accessor,
accessors: Vec::new(), accessors: Default::default(),
value_accessors: Default::default(),
field_type: column_type, field_type: column_type,
sub_aggregation: get_aggs_with_segment_accessor_and_validate( sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation, sub_aggregation,
reader, reader,
segment_ordinal,
&limits,
)?,
agg: agg.clone(),
limits: limits.new_guard(),
missing_value_for_accessor: None,
str_dict_column: None,
column_block_accessor: Default::default(),
};
aggs.push(res);
Ok(())
};
let add_agg_with_accessors = |agg: &Aggregation,
accessors: Vec<(Column<u64>, ColumnType)>,
aggs: &mut Vec<AggregationWithAccessor>,
value_accessors: HashMap<String, Vec<DynamicColumn>>|
-> crate::Result<()> {
let (accessor, field_type) = accessors.first().expect("at least one accessor");
let res = AggregationWithAccessor {
segment_ordinal,
// TODO: We should do away with the `accessor` field altogether
accessor: accessor.clone(),
value_accessors,
field_type: *field_type,
accessors,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation,
reader,
segment_ordinal,
&limits, &limits,
)?, )?,
agg: agg.clone(), agg: agg.clone(),
@@ -84,31 +130,36 @@ impl AggregationWithAccessor {
let mut res: Vec<AggregationWithAccessor> = Vec::new(); let mut res: Vec<AggregationWithAccessor> = Vec::new();
use AggregationVariants::*; use AggregationVariants::*;
match &agg.agg {
match agg.agg {
Range(RangeAggregation { Range(RangeAggregation {
field: field_name, .. field: ref field_name,
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(accessor, column_type, &mut res)?; add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
} }
Histogram(HistogramAggregation { Histogram(HistogramAggregation {
field: field_name, .. field: ref field_name,
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(accessor, column_type, &mut res)?; add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
} }
DateHistogram(DateHistogramAggregationReq { DateHistogram(DateHistogramAggregationReq {
field: field_name, .. field: ref field_name,
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; // Only DateTime is supported for DateHistogram
add_agg_with_accessor(accessor, column_type, &mut res)?; get_ff_reader(reader, field_name, Some(&[ColumnType::DateTime]))?;
add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
} }
Terms(TermsAggregation { Terms(TermsAggregation {
field: field_name, field: ref field_name,
missing, ref missing,
.. ..
}) => { }) => {
let str_dict_column = reader.fast_fields().str(field_name)?; let str_dict_column = reader.fast_fields().str(field_name)?;
@@ -117,10 +168,10 @@ impl AggregationWithAccessor {
ColumnType::U64, ColumnType::U64,
ColumnType::F64, ColumnType::F64,
ColumnType::Str, ColumnType::Str,
ColumnType::DateTime,
ColumnType::Bool,
// ColumnType::Bytes Unsupported // ColumnType::Bytes Unsupported
// ColumnType::Bool Unsupported
// ColumnType::IpAddr Unsupported // ColumnType::IpAddr Unsupported
// ColumnType::DateTime Unsupported
]; ];
// In case the column is empty we want the shim column to match the missing type // In case the column is empty we want the shim column to match the missing type
@@ -145,29 +196,27 @@ impl AggregationWithAccessor {
.map(|m| matches!(m, Key::Str(_))) .map(|m| matches!(m, Key::Str(_)))
.unwrap_or(false); .unwrap_or(false);
let use_special_missing_agg = missing_and_more_than_one_col || text_on_non_text_col; // Actually we could convert the text to a number and have the fast path, if it is
// provided in Rfc3339 format. But this use case is probably common
// enough to justify the effort.
let text_on_date_col = column_and_types.len() == 1
&& column_and_types[0].1 == ColumnType::DateTime
&& missing
.as_ref()
.map(|m| matches!(m, Key::Str(_)))
.unwrap_or(false);
let use_special_missing_agg =
missing_and_more_than_one_col || text_on_non_text_col || text_on_date_col;
if use_special_missing_agg { if use_special_missing_agg {
let column_and_types = let column_and_types =
get_all_ff_reader_or_empty(reader, field_name, None, fallback_type)?; get_all_ff_reader_or_empty(reader, field_name, None, fallback_type)?;
let accessors: Vec<Column> = let accessors = column_and_types
column_and_types.iter().map(|(a, _)| a.clone()).collect(); .iter()
let agg_wit_acc = AggregationWithAccessor { .map(|c_t| (c_t.0.clone(), c_t.1))
missing_value_for_accessor: None, .collect();
accessor: accessors[0].clone(), add_agg_with_accessors(&agg, accessors, &mut res, Default::default())?;
accessors,
field_type: ColumnType::U64,
sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation,
reader,
&limits,
)?,
agg: agg.clone(),
str_dict_column: str_dict_column.clone(),
limits: limits.new_guard(),
column_block_accessor: Default::default(),
};
res.push(agg_wit_acc);
} }
for (accessor, column_type) in column_and_types { for (accessor, column_type) in column_and_types {
@@ -177,21 +226,25 @@ impl AggregationWithAccessor {
missing.clone() missing.clone()
}; };
let missing_value_for_accessor = let missing_value_for_accessor = if let Some(missing) =
if let Some(missing) = missing_value_term_agg.as_ref() { missing_value_term_agg.as_ref()
get_missing_val(column_type, missing, agg.agg.get_fast_field_name())? {
} else { get_missing_val(column_type, missing, agg.agg.get_fast_field_names()[0])?
None } else {
}; None
};
let agg = AggregationWithAccessor { let agg = AggregationWithAccessor {
segment_ordinal,
missing_value_for_accessor, missing_value_for_accessor,
accessor, accessor,
accessors: Vec::new(), accessors: Default::default(),
value_accessors: Default::default(),
field_type: column_type, field_type: column_type,
sub_aggregation: get_aggs_with_segment_accessor_and_validate( sub_aggregation: get_aggs_with_segment_accessor_and_validate(
sub_aggregation, sub_aggregation,
reader, reader,
segment_ordinal,
&limits, &limits,
)?, )?,
agg: agg.clone(), agg: agg.clone(),
@@ -203,34 +256,63 @@ impl AggregationWithAccessor {
} }
} }
Average(AverageAggregation { Average(AverageAggregation {
field: field_name, .. field: ref field_name,
..
}) })
| Count(CountAggregation { | Count(CountAggregation {
field: field_name, .. field: ref field_name,
..
}) })
| Max(MaxAggregation { | Max(MaxAggregation {
field: field_name, .. field: ref field_name,
..
}) })
| Min(MinAggregation { | Min(MinAggregation {
field: field_name, .. field: ref field_name,
..
}) })
| Stats(StatsAggregation { | Stats(StatsAggregation {
field: field_name, .. field: ref field_name,
..
}) })
| Sum(SumAggregation { | Sum(SumAggregation {
field: field_name, .. field: ref field_name,
..
}) => { }) => {
let (accessor, column_type) = let (accessor, column_type) =
get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?; get_ff_reader(reader, field_name, Some(get_numeric_or_date_column_types()))?;
add_agg_with_accessor(accessor, column_type, &mut res)?; add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
} }
Percentiles(percentiles) => { Percentiles(ref percentiles) => {
let (accessor, column_type) = get_ff_reader( let (accessor, column_type) = get_ff_reader(
reader, reader,
percentiles.field_name(), percentiles.field_name(),
Some(get_numeric_or_date_column_types()), Some(get_numeric_or_date_column_types()),
)?; )?;
add_agg_with_accessor(accessor, column_type, &mut res)?; add_agg_with_accessor(&agg, accessor, column_type, &mut res)?;
}
TopHits(ref mut top_hits) => {
top_hits.validate_and_resolve(reader.fast_fields().columnar())?;
let accessors: Vec<(Column<u64>, ColumnType)> = top_hits
.field_names()
.iter()
.map(|field| {
get_ff_reader(reader, field, Some(get_numeric_or_date_column_types()))
})
.collect::<crate::Result<_>>()?;
let value_accessors = top_hits
.value_field_names()
.iter()
.map(|field_name| {
Ok((
field_name.to_string(),
get_dynamic_columns(reader, field_name)?,
))
})
.collect::<crate::Result<_>>()?;
add_agg_with_accessors(&agg, accessors, &mut res, value_accessors)?;
} }
}; };
@@ -272,6 +354,7 @@ fn get_numeric_or_date_column_types() -> &'static [ColumnType] {
pub(crate) fn get_aggs_with_segment_accessor_and_validate( pub(crate) fn get_aggs_with_segment_accessor_and_validate(
aggs: &Aggregations, aggs: &Aggregations,
reader: &SegmentReader, reader: &SegmentReader,
segment_ordinal: SegmentOrdinal,
limits: &AggregationLimits, limits: &AggregationLimits,
) -> crate::Result<AggregationsWithAccessor> { ) -> crate::Result<AggregationsWithAccessor> {
let mut aggss = Vec::new(); let mut aggss = Vec::new();
@@ -280,6 +363,7 @@ pub(crate) fn get_aggs_with_segment_accessor_and_validate(
agg, agg,
agg.sub_aggregation(), agg.sub_aggregation(),
reader, reader,
segment_ordinal,
limits.clone(), limits.clone(),
)?; )?;
for agg in aggs { for agg in aggs {
@@ -309,6 +393,19 @@ fn get_ff_reader(
Ok(ff_field_with_type) Ok(ff_field_with_type)
} }
fn get_dynamic_columns(
reader: &SegmentReader,
field_name: &str,
) -> crate::Result<Vec<columnar::DynamicColumn>> {
let ff_fields = reader.fast_fields().dynamic_column_handles(field_name)?;
let cols = ff_fields
.iter()
.map(|h| h.open())
.collect::<io::Result<_>>()?;
assert!(!ff_fields.is_empty(), "field {} not found", field_name);
Ok(cols)
}
/// Get all fast field reader or empty as default. /// Get all fast field reader or empty as default.
/// ///
/// Is guaranteed to return at least one column. /// Is guaranteed to return at least one column.

View File

@@ -8,7 +8,7 @@ use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::bucket::GetDocCount; use super::bucket::GetDocCount;
use super::metric::{PercentilesMetricResult, SingleMetricResult, Stats}; use super::metric::{PercentilesMetricResult, SingleMetricResult, Stats, TopHitsMetricResult};
use super::{AggregationError, Key}; use super::{AggregationError, Key};
use crate::TantivyError; use crate::TantivyError;
@@ -90,8 +90,10 @@ pub enum MetricResult {
Stats(Stats), Stats(Stats),
/// Sum metric result. /// Sum metric result.
Sum(SingleMetricResult), Sum(SingleMetricResult),
/// Sum metric result. /// Percentiles metric result.
Percentiles(PercentilesMetricResult), Percentiles(PercentilesMetricResult),
/// Top hits metric result
TopHits(TopHitsMetricResult),
} }
impl MetricResult { impl MetricResult {
@@ -106,6 +108,9 @@ impl MetricResult {
MetricResult::Percentiles(_) => Err(TantivyError::AggregationError( MetricResult::Percentiles(_) => Err(TantivyError::AggregationError(
AggregationError::InvalidRequest("percentiles can't be used to order".to_string()), AggregationError::InvalidRequest("percentiles can't be used to order".to_string()),
)), )),
MetricResult::TopHits(_) => Err(TantivyError::AggregationError(
AggregationError::InvalidRequest("top_hits can't be used to order".to_string()),
)),
} }
} }
} }

View File

@@ -9,7 +9,7 @@ use crate::aggregation::tests::{get_test_index_2_segments, get_test_index_from_v
use crate::aggregation::DistributedAggregationCollector; use crate::aggregation::DistributedAggregationCollector;
use crate::query::{AllQuery, TermQuery}; use crate::query::{AllQuery, TermQuery};
use crate::schema::{IndexRecordOption, Schema, FAST}; use crate::schema::{IndexRecordOption, Schema, FAST};
use crate::{Index, Term}; use crate::{Index, IndexWriter, Term};
fn get_avg_req(field_name: &str) -> Aggregation { fn get_avg_req(field_name: &str) -> Aggregation {
serde_json::from_value(json!({ serde_json::from_value(json!({
@@ -586,7 +586,10 @@ fn test_aggregation_on_json_object() {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer
.add_document(doc!(json => json!({"color": "red"})))
.unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"color": "red"}))) .add_document(doc!(json => json!({"color": "red"})))
.unwrap(); .unwrap();
@@ -614,12 +617,74 @@ fn test_aggregation_on_json_object() {
&serde_json::json!({ &serde_json::json!({
"jsonagg": { "jsonagg": {
"buckets": [ "buckets": [
{"doc_count": 2, "key": "red"},
{"doc_count": 1, "key": "blue"}, {"doc_count": 1, "key": "blue"},
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
})
);
}
#[test]
fn test_aggregation_on_nested_json_object() {
let mut schema_builder = Schema::builder();
let json = schema_builder.add_json_field("json.blub", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer
.add_document(doc!(json => json!({"color.dot": "red", "color": {"nested":"red"} })))
.unwrap();
index_writer
.add_document(doc!(json => json!({"color.dot": "blue", "color": {"nested":"blue"} })))
.unwrap();
index_writer
.add_document(doc!(json => json!({"color.dot": "blue", "color": {"nested":"blue"} })))
.unwrap();
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let agg: Aggregations = serde_json::from_value(json!({
"jsonagg1": {
"terms": {
"field": "json\\.blub.color\\.dot",
}
},
"jsonagg2": {
"terms": {
"field": "json\\.blub.color.nested",
}
}
}))
.unwrap();
let aggregation_collector = get_collector(agg);
let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap();
let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap();
assert_eq!(
&aggregation_res_json,
&serde_json::json!({
"jsonagg1": {
"buckets": [
{"doc_count": 2, "key": "blue"},
{"doc_count": 1, "key": "red"}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
},
"jsonagg2": {
"buckets": [
{"doc_count": 2, "key": "blue"},
{"doc_count": 1, "key": "red"} {"doc_count": 1, "key": "red"}
], ],
"doc_count_error_upper_bound": 0, "doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0 "sum_other_doc_count": 0
} }
}) })
); );
} }
@@ -630,7 +695,7 @@ fn test_aggregation_on_json_object_empty_columns() {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Empty column when accessing color // => Empty column when accessing color
index_writer index_writer
.add_document(doc!(json => json!({"price": 10.0}))) .add_document(doc!(json => json!({"price": 10.0})))
@@ -748,13 +813,19 @@ fn test_aggregation_on_json_object_mixed_types() {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0}))) .add_document(doc!(json => json!({"mixed_type": 10.0})))
.unwrap(); .unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
// => Segment with all values text // => Segment with all values text
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"})))
.unwrap();
index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"})))
.unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": "blue"}))) .add_document(doc!(json => json!({"mixed_type": "blue"})))
.unwrap(); .unwrap();
@@ -766,6 +837,9 @@ fn test_aggregation_on_json_object_mixed_types() {
index_writer.commit().unwrap(); index_writer.commit().unwrap();
// => Segment with mixed values // => Segment with mixed values
index_writer
.add_document(doc!(json => json!({"mixed_type": "red"})))
.unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": "red"}))) .add_document(doc!(json => json!({"mixed_type": "red"})))
.unwrap(); .unwrap();
@@ -811,6 +885,8 @@ fn test_aggregation_on_json_object_mixed_types() {
let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap(); let aggregation_results = searcher.search(&AllQuery, &aggregation_collector).unwrap();
let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap(); let aggregation_res_json = serde_json::to_value(aggregation_results).unwrap();
// pretty print as json
use pretty_assertions::assert_eq;
assert_eq!( assert_eq!(
&aggregation_res_json, &aggregation_res_json,
&serde_json::json!({ &serde_json::json!({
@@ -826,9 +902,9 @@ fn test_aggregation_on_json_object_mixed_types() {
"buckets": [ "buckets": [
{ "doc_count": 1, "key": 10.0, "min_price": { "value": 10.0 } }, { "doc_count": 1, "key": 10.0, "min_price": { "value": 10.0 } },
{ "doc_count": 1, "key": -20.5, "min_price": { "value": -20.5 } }, { "doc_count": 1, "key": -20.5, "min_price": { "value": -20.5 } },
// TODO bool is also not yet handled in aggregation { "doc_count": 2, "key": "red", "min_price": { "value": null } },
{ "doc_count": 1, "key": "blue", "min_price": { "value": null } }, { "doc_count": 2, "key": 1.0, "key_as_string": "true", "min_price": { "value": null } },
{ "doc_count": 1, "key": "red", "min_price": { "value": null } }, { "doc_count": 3, "key": "blue", "min_price": { "value": null } },
], ],
"sum_other_doc_count": 0 "sum_other_doc_count": 0
} }

View File

@@ -1,7 +1,7 @@
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::{HistogramAggregation, HistogramBounds}; use super::{HistogramAggregation, HistogramBounds};
use crate::aggregation::AggregationError; use crate::aggregation::*;
/// DateHistogramAggregation is similar to `HistogramAggregation`, but it can only be used with date /// DateHistogramAggregation is similar to `HistogramAggregation`, but it can only be used with date
/// type. /// type.
@@ -132,6 +132,7 @@ impl DateHistogramAggregationReq {
hard_bounds: self.hard_bounds, hard_bounds: self.hard_bounds,
extended_bounds: self.extended_bounds, extended_bounds: self.extended_bounds,
keyed: self.keyed, keyed: self.keyed,
is_normalized_to_ns: false,
}) })
} }
@@ -243,15 +244,15 @@ fn parse_into_milliseconds(input: &str) -> Result<i64, AggregationError> {
} }
#[cfg(test)] #[cfg(test)]
mod tests { pub mod tests {
use pretty_assertions::assert_eq; use pretty_assertions::assert_eq;
use super::*; use super::*;
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::exec_request; use crate::aggregation::tests::exec_request;
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::schema::{Schema, FAST}; use crate::schema::{Schema, FAST, STRING};
use crate::Index; use crate::{Index, IndexWriter, TantivyDocument};
#[test] #[test]
fn test_parse_into_millisecs() { fn test_parse_into_millisecs() {
@@ -306,7 +307,9 @@ mod tests {
) -> crate::Result<Index> { ) -> crate::Result<Index> {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
schema_builder.add_date_field("date", FAST); schema_builder.add_date_field("date", FAST);
schema_builder.add_text_field("text", FAST); schema_builder.add_json_field("mixed", FAST);
schema_builder.add_text_field("text", FAST | STRING);
schema_builder.add_text_field("text2", FAST | STRING);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone()); let index = Index::create_in_ram(schema.clone());
{ {
@@ -314,7 +317,7 @@ mod tests {
index_writer.set_merge_policy(Box::new(NoMergePolicy)); index_writer.set_merge_policy(Box::new(NoMergePolicy));
for values in segment_and_docs { for values in segment_and_docs {
for doc_str in values { for doc_str in values {
let doc = schema.parse_document(doc_str)?; let doc = TantivyDocument::parse_json(&schema, doc_str)?;
index_writer.add_document(doc)?; index_writer.add_document(doc)?;
} }
// writing the segment // writing the segment
@@ -326,7 +329,7 @@ mod tests {
.searchable_segment_ids() .searchable_segment_ids()
.expect("Searchable segments failed."); .expect("Searchable segments failed.");
if segment_ids.len() > 1 { if segment_ids.len() > 1 {
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?; index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?; index_writer.wait_merging_threads()?;
} }
@@ -349,8 +352,10 @@ mod tests {
let docs = vec![ let docs = vec![
vec![r#"{ "date": "2015-01-01T12:10:30Z", "text": "aaa" }"#], vec![r#"{ "date": "2015-01-01T12:10:30Z", "text": "aaa" }"#],
vec![r#"{ "date": "2015-01-01T11:11:30Z", "text": "bbb" }"#], vec![r#"{ "date": "2015-01-01T11:11:30Z", "text": "bbb" }"#],
vec![r#"{ "date": "2015-01-01T11:11:30Z", "text": "bbb" }"#],
vec![r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb" }"#], vec![r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb" }"#],
vec![r#"{ "date": "2015-01-06T00:00:00Z", "text": "ccc" }"#], vec![r#"{ "date": "2015-01-06T00:00:00Z", "text": "ccc" }"#],
vec![r#"{ "date": "2015-01-06T00:00:00Z", "text": "ccc" }"#],
]; ];
let index = get_test_index_from_docs(merge_segments, &docs).unwrap(); let index = get_test_index_from_docs(merge_segments, &docs).unwrap();
@@ -379,7 +384,7 @@ mod tests {
{ {
"key_as_string" : "2015-01-01T00:00:00Z", "key_as_string" : "2015-01-01T00:00:00Z",
"key" : 1420070400000.0, "key" : 1420070400000.0,
"doc_count" : 4 "doc_count" : 6
} }
] ]
} }
@@ -417,15 +422,15 @@ mod tests {
{ {
"key_as_string" : "2015-01-01T00:00:00Z", "key_as_string" : "2015-01-01T00:00:00Z",
"key" : 1420070400000.0, "key" : 1420070400000.0,
"doc_count" : 4, "doc_count" : 6,
"texts": { "texts": {
"buckets": [ "buckets": [
{ {
"doc_count": 2, "doc_count": 3,
"key": "bbb" "key": "bbb"
}, },
{ {
"doc_count": 1, "doc_count": 2,
"key": "ccc" "key": "ccc"
}, },
{ {
@@ -464,7 +469,7 @@ mod tests {
"sales_over_time": { "sales_over_time": {
"buckets": [ "buckets": [
{ {
"doc_count": 2, "doc_count": 3,
"key": 1420070400000.0, "key": 1420070400000.0,
"key_as_string": "2015-01-01T00:00:00Z" "key_as_string": "2015-01-01T00:00:00Z"
}, },
@@ -489,7 +494,7 @@ mod tests {
"key_as_string": "2015-01-05T00:00:00Z" "key_as_string": "2015-01-05T00:00:00Z"
}, },
{ {
"doc_count": 1, "doc_count": 2,
"key": 1420502400000.0, "key": 1420502400000.0,
"key_as_string": "2015-01-06T00:00:00Z" "key_as_string": "2015-01-06T00:00:00Z"
} }
@@ -530,7 +535,7 @@ mod tests {
"key_as_string": "2014-12-31T00:00:00Z" "key_as_string": "2014-12-31T00:00:00Z"
}, },
{ {
"doc_count": 2, "doc_count": 3,
"key": 1420070400000.0, "key": 1420070400000.0,
"key_as_string": "2015-01-01T00:00:00Z" "key_as_string": "2015-01-01T00:00:00Z"
}, },
@@ -555,7 +560,7 @@ mod tests {
"key_as_string": "2015-01-05T00:00:00Z" "key_as_string": "2015-01-05T00:00:00Z"
}, },
{ {
"doc_count": 1, "doc_count": 2,
"key": 1420502400000.0, "key": 1420502400000.0,
"key_as_string": "2015-01-06T00:00:00Z" "key_as_string": "2015-01-06T00:00:00Z"
}, },

View File

@@ -20,7 +20,7 @@ use crate::aggregation::intermediate_agg_result::{
use crate::aggregation::segment_agg_result::{ use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, AggregationLimits, SegmentAggregationCollector, build_segment_agg_collector, AggregationLimits, SegmentAggregationCollector,
}; };
use crate::aggregation::{f64_from_fastfield_u64, format_date}; use crate::aggregation::*;
use crate::TantivyError; use crate::TantivyError;
/// Histogram is a bucket aggregation, where buckets are created dynamically for given `interval`. /// Histogram is a bucket aggregation, where buckets are created dynamically for given `interval`.
@@ -73,6 +73,7 @@ pub struct HistogramAggregation {
pub field: String, pub field: String,
/// The interval to chunk your data range. Each bucket spans a value range of [0..interval). /// The interval to chunk your data range. Each bucket spans a value range of [0..interval).
/// Must be a positive value. /// Must be a positive value.
#[serde(deserialize_with = "deserialize_f64")]
pub interval: f64, pub interval: f64,
/// Intervals implicitly defines an absolute grid of buckets `[interval * k, interval * (k + /// Intervals implicitly defines an absolute grid of buckets `[interval * k, interval * (k +
/// 1))`. /// 1))`.
@@ -85,6 +86,7 @@ pub struct HistogramAggregation {
/// fall into the buckets with the key 0 and 10. /// fall into the buckets with the key 0 and 10.
/// With offset 5 and interval 10, they would both fall into the bucket with they key 5 and the /// With offset 5 and interval 10, they would both fall into the bucket with they key 5 and the
/// range [5..15) /// range [5..15)
#[serde(default, deserialize_with = "deserialize_option_f64")]
pub offset: Option<f64>, pub offset: Option<f64>,
/// The minimum number of documents in a bucket to be returned. Defaults to 0. /// The minimum number of documents in a bucket to be returned. Defaults to 0.
pub min_doc_count: Option<u64>, pub min_doc_count: Option<u64>,
@@ -122,11 +124,14 @@ pub struct HistogramAggregation {
/// Whether to return the buckets as a hash map /// Whether to return the buckets as a hash map
#[serde(default)] #[serde(default)]
pub keyed: bool, pub keyed: bool,
/// Whether the values are normalized to ns for date time values. Defaults to false.
#[serde(default)]
pub is_normalized_to_ns: bool,
} }
impl HistogramAggregation { impl HistogramAggregation {
pub(crate) fn normalize(&mut self, column_type: ColumnType) { pub(crate) fn normalize_date_time(&mut self) {
if column_type.is_date_time() { if !self.is_normalized_to_ns {
// values are provided in ms, but the fastfield is in nano seconds // values are provided in ms, but the fastfield is in nano seconds
self.interval *= 1_000_000.0; self.interval *= 1_000_000.0;
self.offset = self.offset.map(|off| off * 1_000_000.0); self.offset = self.offset.map(|off| off * 1_000_000.0);
@@ -138,6 +143,7 @@ impl HistogramAggregation {
min: bounds.min * 1_000_000.0, min: bounds.min * 1_000_000.0,
max: bounds.max * 1_000_000.0, max: bounds.max * 1_000_000.0,
}); });
self.is_normalized_to_ns = true;
} }
} }
@@ -370,7 +376,7 @@ impl SegmentHistogramCollector {
Ok(IntermediateBucketResult::Histogram { Ok(IntermediateBucketResult::Histogram {
buckets, buckets,
column_type: Some(self.column_type), is_date_agg: self.column_type == ColumnType::DateTime,
}) })
} }
@@ -381,7 +387,9 @@ impl SegmentHistogramCollector {
accessor_idx: usize, accessor_idx: usize,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
req.validate()?; req.validate()?;
req.normalize(field_type); if field_type == ColumnType::DateTime {
req.normalize_date_time();
}
let sub_aggregation_blueprint = if sub_aggregation.is_empty() { let sub_aggregation_blueprint = if sub_aggregation.is_empty() {
None None
@@ -439,6 +447,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
// memory check upfront // memory check upfront
let (_, first_bucket_num, last_bucket_num) = let (_, first_bucket_num, last_bucket_num) =
generate_bucket_pos_with_opt_minmax(histogram_req, min_max); generate_bucket_pos_with_opt_minmax(histogram_req, min_max);
// It's based on user input, so we need to account for overflows // It's based on user input, so we need to account for overflows
let added_buckets = ((last_bucket_num.saturating_sub(first_bucket_num)).max(0) as u64) let added_buckets = ((last_bucket_num.saturating_sub(first_bucket_num)).max(0) as u64)
.saturating_sub(buckets.len() as u64); .saturating_sub(buckets.len() as u64);
@@ -482,7 +491,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
// Convert to BucketEntry // Convert to BucketEntry
pub(crate) fn intermediate_histogram_buckets_to_final_buckets( pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
buckets: Vec<IntermediateHistogramBucketEntry>, buckets: Vec<IntermediateHistogramBucketEntry>,
column_type: Option<ColumnType>, is_date_agg: bool,
histogram_req: &HistogramAggregation, histogram_req: &HistogramAggregation,
sub_aggregation: &Aggregations, sub_aggregation: &Aggregations,
limits: &AggregationLimits, limits: &AggregationLimits,
@@ -491,8 +500,8 @@ pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
// The request used in the the call to final is not yet be normalized. // The request used in the the call to final is not yet be normalized.
// Normalization is changing the precision from milliseconds to nanoseconds. // Normalization is changing the precision from milliseconds to nanoseconds.
let mut histogram_req = histogram_req.clone(); let mut histogram_req = histogram_req.clone();
if let Some(column_type) = column_type { if is_date_agg {
histogram_req.normalize(column_type); histogram_req.normalize_date_time();
} }
let mut buckets = if histogram_req.min_doc_count() == 0 { let mut buckets = if histogram_req.min_doc_count() == 0 {
// With min_doc_count != 0, we may need to add buckets, so that there are no // With min_doc_count != 0, we may need to add buckets, so that there are no
@@ -516,7 +525,7 @@ pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
// If we have a date type on the histogram buckets, we add the `key_as_string` field as rfc339 // If we have a date type on the histogram buckets, we add the `key_as_string` field as rfc339
// and normalize from nanoseconds to milliseconds // and normalize from nanoseconds to milliseconds
if column_type == Some(ColumnType::DateTime) { if is_date_agg {
for bucket in buckets.iter_mut() { for bucket in buckets.iter_mut() {
if let crate::aggregation::Key::F64(ref mut val) = bucket.key { if let crate::aggregation::Key::F64(ref mut val) = bucket.key {
let key_as_string = format_date(*val as i64)?; let key_as_string = format_date(*val as i64)?;
@@ -589,10 +598,13 @@ mod tests {
use super::*; use super::*;
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::tests::{ use crate::aggregation::tests::{
exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit, exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit,
get_test_index_2_segments, get_test_index_from_values, get_test_index_with_num_docs, get_test_index_2_segments, get_test_index_from_values, get_test_index_with_num_docs,
}; };
use crate::aggregation::AggregationCollector;
use crate::query::AllQuery;
#[test] #[test]
fn histogram_test_crooked_values() -> crate::Result<()> { fn histogram_test_crooked_values() -> crate::Result<()> {
@@ -1344,6 +1356,35 @@ mod tests {
}) })
); );
Ok(())
}
#[test]
fn test_aggregation_histogram_empty_index() -> crate::Result<()> {
// test index without segments
let values = vec![];
let index = get_test_index_from_values(false, &values)?;
let agg_req_1: Aggregations = serde_json::from_value(json!({
"myhisto": {
"histogram": {
"field": "score",
"interval": 10.0
},
}
}))
.unwrap();
let collector = AggregationCollector::from_aggs(agg_req_1, Default::default());
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&AllQuery, &collector).unwrap();
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
// Make sure the result structure is correct
assert_eq!(res["myhisto"]["buckets"].as_array().unwrap().len(), 0);
Ok(()) Ok(())
} }
} }

View File

@@ -14,9 +14,7 @@ use crate::aggregation::intermediate_agg_result::{
use crate::aggregation::segment_agg_result::{ use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, SegmentAggregationCollector, build_segment_agg_collector, SegmentAggregationCollector,
}; };
use crate::aggregation::{ use crate::aggregation::*;
f64_from_fastfield_u64, f64_to_fastfield_u64, format_date, Key, SerializedKey,
};
use crate::TantivyError; use crate::TantivyError;
/// Provide user-defined buckets to aggregate on. /// Provide user-defined buckets to aggregate on.
@@ -72,11 +70,19 @@ pub struct RangeAggregationRange {
pub key: Option<String>, pub key: Option<String>,
/// The from range value, which is inclusive in the range. /// The from range value, which is inclusive in the range.
/// `None` equals to an open ended interval. /// `None` equals to an open ended interval.
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(
skip_serializing_if = "Option::is_none",
default,
deserialize_with = "deserialize_option_f64"
)]
pub from: Option<f64>, pub from: Option<f64>,
/// The to range value, which is not inclusive in the range. /// The to range value, which is not inclusive in the range.
/// `None` equals to an open ended interval. /// `None` equals to an open ended interval.
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(
skip_serializing_if = "Option::is_none",
default,
deserialize_with = "deserialize_option_f64"
)]
pub to: Option<f64>, pub to: Option<f64>,
} }

View File

@@ -1,6 +1,6 @@
use std::fmt::Debug; use std::fmt::Debug;
use columnar::{BytesColumn, ColumnType, StrColumn}; use columnar::{BytesColumn, ColumnType, MonotonicallyMappableToU64, StrColumn};
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
@@ -16,7 +16,7 @@ use crate::aggregation::intermediate_agg_result::{
use crate::aggregation::segment_agg_result::{ use crate::aggregation::segment_agg_result::{
build_segment_agg_collector, SegmentAggregationCollector, build_segment_agg_collector, SegmentAggregationCollector,
}; };
use crate::aggregation::{f64_from_fastfield_u64, Key}; use crate::aggregation::{f64_from_fastfield_u64, format_date, Key};
use crate::error::DataCorruption; use crate::error::DataCorruption;
use crate::TantivyError; use crate::TantivyError;
@@ -99,24 +99,15 @@ pub struct TermsAggregation {
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(skip_serializing_if = "Option::is_none", default)]
pub size: Option<u32>, pub size: Option<u32>,
/// Unused by tantivy. /// To get more accurate results, we fetch more than `size` from each segment.
///
/// Since tantivy doesn't know shards, this parameter is merely there to be used by consumers
/// of tantivy. shard_size is the number of terms returned by each shard.
/// The default value in elasticsearch is size * 1.5 + 10.
///
/// Should never be smaller than size.
#[serde(skip_serializing_if = "Option::is_none", default)]
#[serde(alias = "shard_size")]
pub split_size: Option<u32>,
/// The get more accurate results, we fetch more than `size` from each segment.
/// ///
/// Increasing this value is will increase the cost for more accuracy. /// Increasing this value is will increase the cost for more accuracy.
/// ///
/// Defaults to 10 * size. /// Defaults to 10 * size.
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(skip_serializing_if = "Option::is_none", default)]
pub segment_size: Option<u32>, #[serde(alias = "segment_size")]
#[serde(alias = "split_size")]
pub shard_size: Option<u32>,
/// If you set the `show_term_doc_count_error` parameter to true, the terms aggregation will /// If you set the `show_term_doc_count_error` parameter to true, the terms aggregation will
/// include doc_count_error_upper_bound, which is an upper bound to the error on the /// include doc_count_error_upper_bound, which is an upper bound to the error on the
@@ -205,7 +196,7 @@ impl TermsAggregationInternal {
pub(crate) fn from_req(req: &TermsAggregation) -> Self { pub(crate) fn from_req(req: &TermsAggregation) -> Self {
let size = req.size.unwrap_or(10); let size = req.size.unwrap_or(10);
let mut segment_size = req.segment_size.unwrap_or(size * 10); let mut segment_size = req.shard_size.unwrap_or(size * 10);
let order = req.order.clone().unwrap_or_default(); let order = req.order.clone().unwrap_or_default();
segment_size = segment_size.max(size); segment_size = segment_size.max(size);
@@ -256,7 +247,7 @@ pub struct SegmentTermCollector {
term_buckets: TermBuckets, term_buckets: TermBuckets,
req: TermsAggregationInternal, req: TermsAggregationInternal,
blueprint: Option<Box<dyn SegmentAggregationCollector>>, blueprint: Option<Box<dyn SegmentAggregationCollector>>,
field_type: ColumnType, column_type: ColumnType,
accessor_idx: usize, accessor_idx: usize,
} }
@@ -355,7 +346,7 @@ impl SegmentTermCollector {
field_type: ColumnType, field_type: ColumnType,
accessor_idx: usize, accessor_idx: usize,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
if field_type == ColumnType::Bytes || field_type == ColumnType::Bool { if field_type == ColumnType::Bytes {
return Err(TantivyError::InvalidArgument(format!( return Err(TantivyError::InvalidArgument(format!(
"terms aggregation is not supported for column type {:?}", "terms aggregation is not supported for column type {:?}",
field_type field_type
@@ -389,7 +380,7 @@ impl SegmentTermCollector {
req: TermsAggregationInternal::from_req(req), req: TermsAggregationInternal::from_req(req),
term_buckets, term_buckets,
blueprint, blueprint,
field_type, column_type: field_type,
accessor_idx, accessor_idx,
}) })
} }
@@ -466,7 +457,7 @@ impl SegmentTermCollector {
Ok(intermediate_entry) Ok(intermediate_entry)
}; };
if self.field_type == ColumnType::Str { if self.column_type == ColumnType::Str {
let term_dict = agg_with_accessor let term_dict = agg_with_accessor
.str_dict_column .str_dict_column
.as_ref() .as_ref()
@@ -531,21 +522,34 @@ impl SegmentTermCollector {
}); });
} }
} }
} else if self.column_type == ColumnType::DateTime {
for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val = i64::from_u64(val);
let date = format_date(val)?;
dict.insert(IntermediateKey::Str(date), intermediate_entry);
}
} else if self.column_type == ColumnType::Bool {
for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val = bool::from_u64(val);
dict.insert(IntermediateKey::Bool(val), intermediate_entry);
}
} else { } else {
for (val, doc_count) in entries { for (val, doc_count) in entries {
let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?; let intermediate_entry = into_intermediate_bucket_entry(val, doc_count)?;
let val = f64_from_fastfield_u64(val, &self.field_type); let val = f64_from_fastfield_u64(val, &self.column_type);
dict.insert(IntermediateKey::F64(val), intermediate_entry); dict.insert(IntermediateKey::F64(val), intermediate_entry);
} }
}; };
Ok(IntermediateBucketResult::Terms( Ok(IntermediateBucketResult::Terms {
IntermediateTermBucketResult { buckets: IntermediateTermBucketResult {
entries: dict, entries: dict,
sum_other_doc_count, sum_other_doc_count,
doc_count_error_upper_bound: term_doc_count_before_cutoff, doc_count_error_upper_bound: term_doc_count_before_cutoff,
}, },
)) })
} }
} }
@@ -583,6 +587,9 @@ pub(crate) fn cut_off_buckets<T: GetDocCount + Debug>(
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use common::DateTime;
use time::{Date, Month};
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::{ use crate::aggregation::tests::{
exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit, exec_request, exec_request_with_query, exec_request_with_query_and_memory_limit,
@@ -591,7 +598,7 @@ mod tests {
use crate::aggregation::AggregationLimits; use crate::aggregation::AggregationLimits;
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::schema::{Schema, FAST, STRING}; use crate::schema::{Schema, FAST, STRING};
use crate::Index; use crate::{Index, IndexWriter};
#[test] #[test]
fn terms_aggregation_test_single_segment() -> crate::Result<()> { fn terms_aggregation_test_single_segment() -> crate::Result<()> {
@@ -1355,7 +1362,7 @@ mod tests {
#[test] #[test]
fn terms_aggregation_different_tokenizer_on_ff_test() -> crate::Result<()> { fn terms_aggregation_different_tokenizer_on_ff_test() -> crate::Result<()> {
let terms = vec!["Hello Hello", "Hallo Hallo"]; let terms = vec!["Hello Hello", "Hallo Hallo", "Hallo Hallo"];
let index = get_test_index_from_terms(true, &[terms])?; let index = get_test_index_from_terms(true, &[terms])?;
@@ -1373,7 +1380,7 @@ mod tests {
println!("{}", serde_json::to_string_pretty(&res).unwrap()); println!("{}", serde_json::to_string_pretty(&res).unwrap());
assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hallo Hallo"); assert_eq!(res["my_texts"]["buckets"][0]["key"], "Hallo Hallo");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 1); assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "Hello Hello"); assert_eq!(res["my_texts"]["buckets"][1]["key"], "Hello Hello");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 1); assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 1);
@@ -1463,7 +1470,7 @@ mod tests {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json // => Segment with empty json
index_writer.add_document(doc!()).unwrap(); index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
@@ -1813,4 +1820,111 @@ mod tests {
Ok(()) Ok(())
} }
#[test]
fn terms_aggregation_date() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let date_field = schema_builder.add_date_field("date_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1983, Month::September, 27)?.with_hms(0, 0, 0)?)))?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_date": {
"terms": {
"field": "date_field"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// date_field field
assert_eq!(res["my_date"]["buckets"][0]["key"], "1982-09-17T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_date"]["buckets"][1]["key"], "1983-09-27T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_date"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_date_missing() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let date_field = schema_builder.add_date_field("date_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1983, Month::September, 27)?.with_hms(0, 0, 0)?)))?;
writer.add_document(doc!())?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_date": {
"terms": {
"field": "date_field",
"missing": "1982-09-17T00:00:00Z"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
// date_field field
assert_eq!(res["my_date"]["buckets"][0]["key"], "1982-09-17T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][0]["doc_count"], 3);
assert_eq!(res["my_date"]["buckets"][1]["key"], "1983-09-27T00:00:00Z");
assert_eq!(res["my_date"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_date"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
#[test]
fn terms_aggregation_bool() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_bool_field("bool_field", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut writer = index.writer_with_num_threads(1, 15_000_000)?;
writer.add_document(doc!(field=>true))?;
writer.add_document(doc!(field=>false))?;
writer.add_document(doc!(field=>true))?;
writer.commit()?;
}
let agg_req: Aggregations = serde_json::from_value(json!({
"my_bool": {
"terms": {
"field": "bool_field"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(res["my_bool"]["buckets"][0]["key"], 1.0);
assert_eq!(res["my_bool"]["buckets"][0]["key_as_string"], "true");
assert_eq!(res["my_bool"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_bool"]["buckets"][1]["key"], 0.0);
assert_eq!(res["my_bool"]["buckets"][1]["key_as_string"], "false");
assert_eq!(res["my_bool"]["buckets"][1]["doc_count"], 1);
assert_eq!(res["my_bool"]["buckets"][2]["key"], serde_json::Value::Null);
Ok(())
}
} }

View File

@@ -73,11 +73,13 @@ impl SegmentAggregationCollector for TermMissingAgg {
entries.insert(missing.into(), missing_entry); entries.insert(missing.into(), missing_entry);
let bucket = IntermediateBucketResult::Terms(IntermediateTermBucketResult { let bucket = IntermediateBucketResult::Terms {
entries, buckets: IntermediateTermBucketResult {
sum_other_doc_count: 0, entries,
doc_count_error_upper_bound: 0, sum_other_doc_count: 0,
}); doc_count_error_upper_bound: 0,
},
};
results.push(name, IntermediateAggregationResult::Bucket(bucket))?; results.push(name, IntermediateAggregationResult::Bucket(bucket))?;
@@ -90,7 +92,10 @@ impl SegmentAggregationCollector for TermMissingAgg {
agg_with_accessor: &mut AggregationsWithAccessor, agg_with_accessor: &mut AggregationsWithAccessor,
) -> crate::Result<()> { ) -> crate::Result<()> {
let agg = &mut agg_with_accessor.aggs.values[self.accessor_idx]; let agg = &mut agg_with_accessor.aggs.values[self.accessor_idx];
let has_value = agg.accessors.iter().any(|acc| acc.index.has_value(doc)); let has_value = agg
.accessors
.iter()
.any(|(acc, _)| acc.index.has_value(doc));
if !has_value { if !has_value {
self.missing_count += 1; self.missing_count += 1;
if let Some(sub_agg) = self.sub_agg.as_mut() { if let Some(sub_agg) = self.sub_agg.as_mut() {
@@ -117,7 +122,7 @@ mod tests {
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::exec_request_with_query; use crate::aggregation::tests::exec_request_with_query;
use crate::schema::{Schema, FAST}; use crate::schema::{Schema, FAST};
use crate::Index; use crate::{Index, IndexWriter};
#[test] #[test]
fn terms_aggregation_missing_mixed_type_mult_seg_sub_agg() -> crate::Result<()> { fn terms_aggregation_missing_mixed_type_mult_seg_sub_agg() -> crate::Result<()> {
@@ -126,7 +131,7 @@ mod tests {
let score = schema_builder.add_f64_field("score", FAST); let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer index_writer
.add_document(doc!(score => 1.0, json => json!({"mixed_type": 10.0}))) .add_document(doc!(score => 1.0, json => json!({"mixed_type": 10.0})))
@@ -186,7 +191,7 @@ mod tests {
let score = schema_builder.add_f64_field("score", FAST); let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer.add_document(doc!(score => 1.0, json => json!({"mixed_type": 10.0})))?; index_writer.add_document(doc!(score => 1.0, json => json!({"mixed_type": 10.0})))?;
index_writer.add_document(doc!(score => 5.0))?; index_writer.add_document(doc!(score => 5.0))?;
@@ -231,7 +236,7 @@ mod tests {
let score = schema_builder.add_f64_field("score", FAST); let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.add_document(doc!(score => 5.0))?; index_writer.add_document(doc!(score => 5.0))?;
index_writer.commit().unwrap(); index_writer.commit().unwrap();
@@ -278,7 +283,7 @@ mod tests {
let score = schema_builder.add_f64_field("score", FAST); let score = schema_builder.add_f64_field("score", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.add_document(doc!(score => 5.0))?; index_writer.add_document(doc!(score => 5.0))?;
index_writer.add_document(doc!(score => 5.0))?; index_writer.add_document(doc!(score => 5.0))?;
@@ -323,7 +328,7 @@ mod tests {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0}))) .add_document(doc!(json => json!({"mixed_type": 10.0})))
@@ -385,7 +390,7 @@ mod tests {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0}))) .add_document(doc!(json => json!({"mixed_type": 10.0})))
@@ -427,7 +432,7 @@ mod tests {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with all values numeric // => Segment with all values numeric
index_writer index_writer
.add_document(doc!(json => json!({"mixed_type": 10.0}))) .add_document(doc!(json => json!({"mixed_type": 10.0})))

View File

@@ -8,7 +8,7 @@ use super::segment_agg_result::{
}; };
use crate::aggregation::agg_req_with_accessor::get_aggs_with_segment_accessor_and_validate; use crate::aggregation::agg_req_with_accessor::get_aggs_with_segment_accessor_and_validate;
use crate::collector::{Collector, SegmentCollector}; use crate::collector::{Collector, SegmentCollector};
use crate::{DocId, SegmentReader, TantivyError}; use crate::{DocId, SegmentOrdinal, SegmentReader, TantivyError};
/// The default max bucket count, before the aggregation fails. /// The default max bucket count, before the aggregation fails.
pub const DEFAULT_BUCKET_LIMIT: u32 = 65000; pub const DEFAULT_BUCKET_LIMIT: u32 = 65000;
@@ -64,10 +64,15 @@ impl Collector for DistributedAggregationCollector {
fn for_segment( fn for_segment(
&self, &self,
_segment_local_id: crate::SegmentOrdinal, segment_local_id: crate::SegmentOrdinal,
reader: &crate::SegmentReader, reader: &crate::SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
AggregationSegmentCollector::from_agg_req_and_reader(&self.agg, reader, &self.limits) AggregationSegmentCollector::from_agg_req_and_reader(
&self.agg,
reader,
segment_local_id,
&self.limits,
)
} }
fn requires_scoring(&self) -> bool { fn requires_scoring(&self) -> bool {
@@ -89,10 +94,15 @@ impl Collector for AggregationCollector {
fn for_segment( fn for_segment(
&self, &self,
_segment_local_id: crate::SegmentOrdinal, segment_local_id: crate::SegmentOrdinal,
reader: &crate::SegmentReader, reader: &crate::SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
AggregationSegmentCollector::from_agg_req_and_reader(&self.agg, reader, &self.limits) AggregationSegmentCollector::from_agg_req_and_reader(
&self.agg,
reader,
segment_local_id,
&self.limits,
)
} }
fn requires_scoring(&self) -> bool { fn requires_scoring(&self) -> bool {
@@ -135,10 +145,11 @@ impl AggregationSegmentCollector {
pub fn from_agg_req_and_reader( pub fn from_agg_req_and_reader(
agg: &Aggregations, agg: &Aggregations,
reader: &SegmentReader, reader: &SegmentReader,
segment_ordinal: SegmentOrdinal,
limits: &AggregationLimits, limits: &AggregationLimits,
) -> crate::Result<Self> { ) -> crate::Result<Self> {
let mut aggs_with_accessor = let mut aggs_with_accessor =
get_aggs_with_segment_accessor_and_validate(agg, reader, limits)?; get_aggs_with_segment_accessor_and_validate(agg, reader, segment_ordinal, limits)?;
let result = let result =
BufAggregationCollector::new(build_segment_agg_collector(&mut aggs_with_accessor)?); BufAggregationCollector::new(build_segment_agg_collector(&mut aggs_with_accessor)?);
Ok(AggregationSegmentCollector { Ok(AggregationSegmentCollector {

View File

@@ -19,7 +19,7 @@ use super::bucket::{
}; };
use super::metric::{ use super::metric::{
IntermediateAverage, IntermediateCount, IntermediateMax, IntermediateMin, IntermediateStats, IntermediateAverage, IntermediateCount, IntermediateMax, IntermediateMin, IntermediateStats,
IntermediateSum, PercentilesCollector, IntermediateSum, PercentilesCollector, TopHitsCollector,
}; };
use super::segment_agg_result::AggregationLimits; use super::segment_agg_result::AggregationLimits;
use super::{format_date, AggregationError, Key, SerializedKey}; use super::{format_date, AggregationError, Key, SerializedKey};
@@ -41,6 +41,8 @@ pub struct IntermediateAggregationResults {
/// This might seem redundant with `Key`, but the point is to have a different /// This might seem redundant with `Key`, but the point is to have a different
/// Serialize implementation. /// Serialize implementation.
pub enum IntermediateKey { pub enum IntermediateKey {
/// Bool key
Bool(bool),
/// String key /// String key
Str(String), Str(String),
/// `f64` key /// `f64` key
@@ -59,6 +61,7 @@ impl From<IntermediateKey> for Key {
match value { match value {
IntermediateKey::Str(s) => Self::Str(s), IntermediateKey::Str(s) => Self::Str(s),
IntermediateKey::F64(f) => Self::F64(f), IntermediateKey::F64(f) => Self::F64(f),
IntermediateKey::Bool(f) => Self::F64(f as u64 as f64),
} }
} }
} }
@@ -71,6 +74,7 @@ impl std::hash::Hash for IntermediateKey {
match self { match self {
IntermediateKey::Str(text) => text.hash(state), IntermediateKey::Str(text) => text.hash(state),
IntermediateKey::F64(val) => val.to_bits().hash(state), IntermediateKey::F64(val) => val.to_bits().hash(state),
IntermediateKey::Bool(val) => val.hash(state),
} }
} }
} }
@@ -166,16 +170,22 @@ impl IntermediateAggregationResults {
pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult { pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult {
use AggregationVariants::*; use AggregationVariants::*;
match req.agg { match req.agg {
Terms(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Terms( Terms(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Terms {
Default::default(), buckets: Default::default(),
)), }),
Range(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Range( Range(_) => IntermediateAggregationResult::Bucket(IntermediateBucketResult::Range(
Default::default(), Default::default(),
)), )),
Histogram(_) | DateHistogram(_) => { Histogram(_) => {
IntermediateAggregationResult::Bucket(IntermediateBucketResult::Histogram { IntermediateAggregationResult::Bucket(IntermediateBucketResult::Histogram {
buckets: Vec::new(), buckets: Vec::new(),
column_type: None, is_date_agg: false,
})
}
DateHistogram(_) => {
IntermediateAggregationResult::Bucket(IntermediateBucketResult::Histogram {
buckets: Vec::new(),
is_date_agg: true,
}) })
} }
Average(_) => IntermediateAggregationResult::Metric(IntermediateMetricResult::Average( Average(_) => IntermediateAggregationResult::Metric(IntermediateMetricResult::Average(
@@ -199,6 +209,9 @@ pub(crate) fn empty_from_req(req: &Aggregation) -> IntermediateAggregationResult
Percentiles(_) => IntermediateAggregationResult::Metric( Percentiles(_) => IntermediateAggregationResult::Metric(
IntermediateMetricResult::Percentiles(PercentilesCollector::default()), IntermediateMetricResult::Percentiles(PercentilesCollector::default()),
), ),
TopHits(_) => IntermediateAggregationResult::Metric(IntermediateMetricResult::TopHits(
TopHitsCollector::default(),
)),
} }
} }
@@ -259,6 +272,8 @@ pub enum IntermediateMetricResult {
Stats(IntermediateStats), Stats(IntermediateStats),
/// Intermediate sum result. /// Intermediate sum result.
Sum(IntermediateSum), Sum(IntermediateSum),
/// Intermediate top_hits result
TopHits(TopHitsCollector),
} }
impl IntermediateMetricResult { impl IntermediateMetricResult {
@@ -286,9 +301,13 @@ impl IntermediateMetricResult {
percentiles percentiles
.into_final_result(req.agg.as_percentile().expect("unexpected metric type")), .into_final_result(req.agg.as_percentile().expect("unexpected metric type")),
), ),
IntermediateMetricResult::TopHits(top_hits) => {
MetricResult::TopHits(top_hits.finalize())
}
} }
} }
// TODO: this is our top-of-the-chain fruit merge mech
fn merge_fruits(&mut self, other: IntermediateMetricResult) -> crate::Result<()> { fn merge_fruits(&mut self, other: IntermediateMetricResult) -> crate::Result<()> {
match (self, other) { match (self, other) {
( (
@@ -324,6 +343,9 @@ impl IntermediateMetricResult {
) => { ) => {
left.merge_fruits(right)?; left.merge_fruits(right)?;
} }
(IntermediateMetricResult::TopHits(left), IntermediateMetricResult::TopHits(right)) => {
left.merge_fruits(right)?;
}
_ => { _ => {
panic!("incompatible fruit types in tree or missing merge_fruits handler"); panic!("incompatible fruit types in tree or missing merge_fruits handler");
} }
@@ -343,13 +365,16 @@ pub enum IntermediateBucketResult {
/// This is the histogram entry for a bucket, which contains a key, count, and optionally /// This is the histogram entry for a bucket, which contains a key, count, and optionally
/// sub_aggregations. /// sub_aggregations.
Histogram { Histogram {
/// The column_type of the underlying `Column` /// The column_type of the underlying `Column` is DateTime
column_type: Option<ColumnType>, is_date_agg: bool,
/// The buckets /// The histogram buckets
buckets: Vec<IntermediateHistogramBucketEntry>, buckets: Vec<IntermediateHistogramBucketEntry>,
}, },
/// Term aggregation /// Term aggregation
Terms(IntermediateTermBucketResult), Terms {
/// The term buckets
buckets: IntermediateTermBucketResult,
},
} }
impl IntermediateBucketResult { impl IntermediateBucketResult {
@@ -399,7 +424,7 @@ impl IntermediateBucketResult {
Ok(BucketResult::Range { buckets }) Ok(BucketResult::Range { buckets })
} }
IntermediateBucketResult::Histogram { IntermediateBucketResult::Histogram {
column_type, is_date_agg,
buckets, buckets,
} => { } => {
let histogram_req = &req let histogram_req = &req
@@ -408,7 +433,7 @@ impl IntermediateBucketResult {
.expect("unexpected aggregation, expected histogram aggregation"); .expect("unexpected aggregation, expected histogram aggregation");
let buckets = intermediate_histogram_buckets_to_final_buckets( let buckets = intermediate_histogram_buckets_to_final_buckets(
buckets, buckets,
column_type, is_date_agg,
histogram_req, histogram_req,
req.sub_aggregation(), req.sub_aggregation(),
limits, limits,
@@ -426,7 +451,7 @@ impl IntermediateBucketResult {
}; };
Ok(BucketResult::Histogram { buckets }) Ok(BucketResult::Histogram { buckets })
} }
IntermediateBucketResult::Terms(terms) => terms.into_final_result( IntermediateBucketResult::Terms { buckets: terms } => terms.into_final_result(
req.agg req.agg
.as_term() .as_term()
.expect("unexpected aggregation, expected term aggregation"), .expect("unexpected aggregation, expected term aggregation"),
@@ -439,8 +464,12 @@ impl IntermediateBucketResult {
fn merge_fruits(&mut self, other: IntermediateBucketResult) -> crate::Result<()> { fn merge_fruits(&mut self, other: IntermediateBucketResult) -> crate::Result<()> {
match (self, other) { match (self, other) {
( (
IntermediateBucketResult::Terms(term_res_left), IntermediateBucketResult::Terms {
IntermediateBucketResult::Terms(term_res_right), buckets: term_res_left,
},
IntermediateBucketResult::Terms {
buckets: term_res_right,
},
) => { ) => {
merge_maps(&mut term_res_left.entries, term_res_right.entries)?; merge_maps(&mut term_res_left.entries, term_res_right.entries)?;
term_res_left.sum_other_doc_count += term_res_right.sum_other_doc_count; term_res_left.sum_other_doc_count += term_res_right.sum_other_doc_count;
@@ -457,11 +486,11 @@ impl IntermediateBucketResult {
( (
IntermediateBucketResult::Histogram { IntermediateBucketResult::Histogram {
buckets: buckets_left, buckets: buckets_left,
.. is_date_agg: _,
}, },
IntermediateBucketResult::Histogram { IntermediateBucketResult::Histogram {
buckets: buckets_right, buckets: buckets_right,
.. is_date_agg: _,
}, },
) => { ) => {
let buckets: Result<Vec<IntermediateHistogramBucketEntry>, TantivyError> = let buckets: Result<Vec<IntermediateHistogramBucketEntry>, TantivyError> =
@@ -524,8 +553,15 @@ impl IntermediateTermBucketResult {
.into_iter() .into_iter()
.filter(|bucket| bucket.1.doc_count as u64 >= req.min_doc_count) .filter(|bucket| bucket.1.doc_count as u64 >= req.min_doc_count)
.map(|(key, entry)| { .map(|(key, entry)| {
let key_as_string = match key {
IntermediateKey::Bool(key) => {
let val = if key { "true" } else { "false" };
Some(val.to_string())
}
_ => None,
};
Ok(BucketEntry { Ok(BucketEntry {
key_as_string: None, key_as_string,
key: key.into(), key: key.into(),
doc_count: entry.doc_count as u64, doc_count: entry.doc_count as u64,
sub_aggregation: entry sub_aggregation: entry

View File

@@ -2,7 +2,8 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::{IntermediateStats, SegmentStatsCollector}; use super::*;
use crate::aggregation::*;
/// A single-value metric aggregation that computes the average of numeric values that are /// A single-value metric aggregation that computes the average of numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -24,7 +25,7 @@ pub struct AverageAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)] #[serde(default, deserialize_with = "deserialize_option_f64")]
pub missing: Option<f64>, pub missing: Option<f64>,
} }
@@ -65,3 +66,71 @@ impl IntermediateAverage {
self.stats.finalize().avg self.stats.finalize().avg
} }
} }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn deserialization_with_missing_test1() {
let json = r#"{
"field": "score",
"missing": "10.0"
}"#;
let avg: AverageAggregation = serde_json::from_str(json).unwrap();
assert_eq!(avg.field, "score");
assert_eq!(avg.missing, Some(10.0));
// no dot
let json = r#"{
"field": "score",
"missing": "10"
}"#;
let avg: AverageAggregation = serde_json::from_str(json).unwrap();
assert_eq!(avg.field, "score");
assert_eq!(avg.missing, Some(10.0));
// from value
let avg: AverageAggregation = serde_json::from_value(json!({
"field": "score_f64",
"missing": 10u64,
}))
.unwrap();
assert_eq!(avg.missing, Some(10.0));
// from value
let avg: AverageAggregation = serde_json::from_value(json!({
"field": "score_f64",
"missing": 10u32,
}))
.unwrap();
assert_eq!(avg.missing, Some(10.0));
let avg: AverageAggregation = serde_json::from_value(json!({
"field": "score_f64",
"missing": 10i8,
}))
.unwrap();
assert_eq!(avg.missing, Some(10.0));
}
#[test]
fn deserialization_with_missing_test_fail() {
let json = r#"{
"field": "score",
"missing": "a"
}"#;
let avg: Result<AverageAggregation, _> = serde_json::from_str(json);
assert!(avg.is_err());
assert!(avg
.unwrap_err()
.to_string()
.contains("Failed to parse f64 from string: \"a\""));
// Disallow NaN
let json = r#"{
"field": "score",
"missing": "NaN"
}"#;
let avg: Result<AverageAggregation, _> = serde_json::from_str(json);
assert!(avg.is_err());
assert!(avg.unwrap_err().to_string().contains("NaN"));
}
}

View File

@@ -2,7 +2,8 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::{IntermediateStats, SegmentStatsCollector}; use super::*;
use crate::aggregation::*;
/// A single-value metric aggregation that counts the number of values that are /// A single-value metric aggregation that counts the number of values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -24,7 +25,7 @@ pub struct CountAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)] #[serde(default, deserialize_with = "deserialize_option_f64")]
pub missing: Option<f64>, pub missing: Option<f64>,
} }

View File

@@ -2,7 +2,8 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::{IntermediateStats, SegmentStatsCollector}; use super::*;
use crate::aggregation::*;
/// A single-value metric aggregation that computes the maximum of numeric values that are /// A single-value metric aggregation that computes the maximum of numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -24,7 +25,7 @@ pub struct MaxAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)] #[serde(default, deserialize_with = "deserialize_option_f64")]
pub missing: Option<f64>, pub missing: Option<f64>,
} }
@@ -71,7 +72,7 @@ mod tests {
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::tests::exec_request_with_query; use crate::aggregation::tests::exec_request_with_query;
use crate::schema::{Schema, FAST}; use crate::schema::{Schema, FAST};
use crate::Index; use crate::{Index, IndexWriter};
#[test] #[test]
fn test_max_agg_with_missing() -> crate::Result<()> { fn test_max_agg_with_missing() -> crate::Result<()> {
@@ -79,7 +80,7 @@ mod tests {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json // => Segment with empty json
index_writer.add_document(doc!()).unwrap(); index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();

View File

@@ -2,7 +2,8 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::{IntermediateStats, SegmentStatsCollector}; use super::*;
use crate::aggregation::*;
/// A single-value metric aggregation that computes the minimum of numeric values that are /// A single-value metric aggregation that computes the minimum of numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -24,7 +25,7 @@ pub struct MinAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)] #[serde(default, deserialize_with = "deserialize_option_f64")]
pub missing: Option<f64>, pub missing: Option<f64>,
} }

View File

@@ -23,6 +23,8 @@ mod min;
mod percentiles; mod percentiles;
mod stats; mod stats;
mod sum; mod sum;
mod top_hits;
pub use average::*; pub use average::*;
pub use count::*; pub use count::*;
pub use max::*; pub use max::*;
@@ -32,6 +34,7 @@ use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
pub use stats::*; pub use stats::*;
pub use sum::*; pub use sum::*;
pub use top_hits::*;
/// Single-metric aggregations use this common result structure. /// Single-metric aggregations use this common result structure.
/// ///
@@ -81,6 +84,27 @@ pub struct PercentilesMetricResult {
pub values: PercentileValues, pub values: PercentileValues,
} }
/// The top_hits metric results entry
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct TopHitsVecEntry {
/// The sort values of the document, depending on the sort criteria in the request.
pub sort: Vec<Option<u64>>,
/// Search results, for queries that include field retrieval requests
/// (`docvalue_fields`).
#[serde(flatten)]
pub search_results: FieldRetrivalResult,
}
/// The top_hits metric aggregation results a list of top hits by sort criteria.
///
/// The main reason for wrapping it in `hits` is to match elasticsearch output structure.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct TopHitsMetricResult {
/// The result of the top_hits metric.
pub hits: Vec<TopHitsVecEntry>,
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use crate::aggregation::agg_req::Aggregations; use crate::aggregation::agg_req::Aggregations;
@@ -88,7 +112,7 @@ mod tests {
use crate::aggregation::AggregationCollector; use crate::aggregation::AggregationCollector;
use crate::query::AllQuery; use crate::query::AllQuery;
use crate::schema::{NumericOptions, Schema}; use crate::schema::{NumericOptions, Schema};
use crate::Index; use crate::{Index, IndexWriter};
#[test] #[test]
fn test_metric_aggregations() { fn test_metric_aggregations() {
@@ -96,7 +120,7 @@ mod tests {
let field_options = NumericOptions::default().set_fast(); let field_options = NumericOptions::default().set_fast();
let field = schema_builder.add_f64_field("price", field_options); let field = schema_builder.add_f64_field("price", field_options);
let index = Index::create_in_ram(schema_builder.build()); let index = Index::create_in_ram(schema_builder.build());
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
for i in 0..3 { for i in 0..3 {
index_writer index_writer

View File

@@ -11,7 +11,7 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult, IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult,
}; };
use crate::aggregation::segment_agg_result::SegmentAggregationCollector; use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64, AggregationError}; use crate::aggregation::*;
use crate::{DocId, TantivyError}; use crate::{DocId, TantivyError};
/// # Percentiles /// # Percentiles
@@ -84,7 +84,11 @@ pub struct PercentilesAggregationReq {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(skip_serializing_if = "Option::is_none", default)] #[serde(
skip_serializing_if = "Option::is_none",
default,
deserialize_with = "deserialize_option_f64"
)]
pub missing: Option<f64>, pub missing: Option<f64>,
} }
fn default_percentiles() -> &'static [f64] { fn default_percentiles() -> &'static [f64] {
@@ -133,7 +137,6 @@ pub(crate) struct SegmentPercentilesCollector {
field_type: ColumnType, field_type: ColumnType,
pub(crate) percentiles: PercentilesCollector, pub(crate) percentiles: PercentilesCollector,
pub(crate) accessor_idx: usize, pub(crate) accessor_idx: usize,
val_cache: Vec<u64>,
missing: Option<u64>, missing: Option<u64>,
} }
@@ -243,7 +246,6 @@ impl SegmentPercentilesCollector {
field_type, field_type,
percentiles: PercentilesCollector::new(), percentiles: PercentilesCollector::new(),
accessor_idx, accessor_idx,
val_cache: Default::default(),
missing, missing,
}) })
} }

View File

@@ -9,7 +9,7 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult, IntermediateAggregationResult, IntermediateAggregationResults, IntermediateMetricResult,
}; };
use crate::aggregation::segment_agg_result::SegmentAggregationCollector; use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64}; use crate::aggregation::*;
use crate::{DocId, TantivyError}; use crate::{DocId, TantivyError};
/// A multi-value metric aggregation that computes a collection of statistics on numeric values that /// A multi-value metric aggregation that computes a collection of statistics on numeric values that
@@ -33,7 +33,7 @@ pub struct StatsAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)] #[serde(default, deserialize_with = "deserialize_option_f64")]
pub missing: Option<f64>, pub missing: Option<f64>,
} }
@@ -300,7 +300,7 @@ mod tests {
use crate::aggregation::AggregationCollector; use crate::aggregation::AggregationCollector;
use crate::query::{AllQuery, TermQuery}; use crate::query::{AllQuery, TermQuery};
use crate::schema::{IndexRecordOption, Schema, FAST}; use crate::schema::{IndexRecordOption, Schema, FAST};
use crate::{Index, Term}; use crate::{Index, IndexWriter, Term};
#[test] #[test]
fn test_aggregation_stats_empty_index() -> crate::Result<()> { fn test_aggregation_stats_empty_index() -> crate::Result<()> {
@@ -494,7 +494,7 @@ mod tests {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json // => Segment with empty json
index_writer.add_document(doc!()).unwrap(); index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
@@ -541,7 +541,7 @@ mod tests {
let json = schema_builder.add_json_field("json", FAST); let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
// => Segment with empty json // => Segment with empty json
index_writer.add_document(doc!()).unwrap(); index_writer.add_document(doc!()).unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
@@ -580,6 +580,30 @@ mod tests {
}) })
); );
// From string
let agg_req: Aggregations = serde_json::from_value(json!({
"my_stats": {
"stats": {
"field": "json.partially_empty",
"missing": "0.0"
},
}
}))
.unwrap();
let res = exec_request_with_query(agg_req, &index, None)?;
assert_eq!(
res["my_stats"],
json!({
"avg": 2.5,
"count": 4,
"max": 10.0,
"min": 0.0,
"sum": 10.0
})
);
Ok(()) Ok(())
} }

View File

@@ -2,7 +2,8 @@ use std::fmt::Debug;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::{IntermediateStats, SegmentStatsCollector}; use super::*;
use crate::aggregation::*;
/// A single-value metric aggregation that sums up numeric values that are /// A single-value metric aggregation that sums up numeric values that are
/// extracted from the aggregated documents. /// extracted from the aggregated documents.
@@ -24,7 +25,7 @@ pub struct SumAggregation {
/// By default they will be ignored but it is also possible to treat them as if they had a /// By default they will be ignored but it is also possible to treat them as if they had a
/// value. Examples in JSON format: /// value. Examples in JSON format:
/// { "field": "my_numbers", "missing": "10.0" } /// { "field": "my_numbers", "missing": "10.0" }
#[serde(default)] #[serde(default, deserialize_with = "deserialize_option_f64")]
pub missing: Option<f64>, pub missing: Option<f64>,
} }

View File

@@ -0,0 +1,837 @@
use std::collections::HashMap;
use std::fmt::Formatter;
use columnar::{ColumnarReader, DynamicColumn};
use regex::Regex;
use serde::ser::SerializeMap;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use super::{TopHitsMetricResult, TopHitsVecEntry};
use crate::aggregation::bucket::Order;
use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResult, IntermediateMetricResult,
};
use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
use crate::collector::TopNComputer;
use crate::schema::term::JSON_PATH_SEGMENT_SEP_STR;
use crate::schema::OwnedValue;
use crate::{DocAddress, DocId, SegmentOrdinal};
/// # Top Hits
///
/// The top hits aggregation is a useful tool to answer questions like:
/// - "What are the most recent posts by each author?"
/// - "What are the most popular items in each category?"
///
/// It does so by keeping track of the most relevant document being aggregated,
/// in terms of a sort criterion that can consist of multiple fields and their
/// sort-orders (ascending or descending).
///
/// `top_hits` should not be used as a top-level aggregation. It is intended to be
/// used as a sub-aggregation, inside a `terms` aggregation or a `filters` aggregation,
/// for example.
///
/// Note that this aggregator does not return the actual document addresses, but
/// rather a list of the values of the fields that were requested to be retrieved.
/// These values can be specified in the `docvalue_fields` parameter, which can include
/// a list of fast fields to be retrieved. At the moment, only fast fields are supported
/// but it is possible that we support the `fields` parameter to retrieve any stored
/// field in the future.
///
/// The following example demonstrates a request for the top_hits aggregation:
/// ```JSON
/// {
/// "aggs": {
/// "top_authors": {
/// "terms": {
/// "field": "author",
/// "size": 5
/// }
/// },
/// "aggs": {
/// "top_hits": {
/// "size": 2,
/// "from": 0
/// "sort": [
/// { "date": "desc" }
/// ]
/// "docvalue_fields": ["date", "title", "iden"]
/// }
/// }
/// }
/// ```
///
/// This request will return an object containing the top two documents, sorted
/// by the `date` field in descending order. You can also sort by multiple fields, which
/// helps to resolve ties. The aggregation object for each bucket will look like:
/// ```JSON
/// {
/// "hits": [
/// {
/// "score": [<time_u64>],
/// "docvalue_fields": {
/// "date": "<date_RFC3339>",
/// "title": "<title>",
/// "iden": "<iden>"
/// }
/// },
/// {
/// "score": [<time_u64>]
/// "docvalue_fields": {
/// "date": "<date_RFC3339>",
/// "title": "<title>",
/// "iden": "<iden>"
/// }
/// }
/// ]
/// }
/// ```
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct TopHitsAggregation {
sort: Vec<KeyOrder>,
size: usize,
from: Option<usize>,
#[serde(flatten)]
retrieval: RetrievalFields,
}
const fn default_doc_value_fields() -> Vec<String> {
Vec::new()
}
/// Search query spec for each matched document
/// TODO: move this to a common module
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct RetrievalFields {
/// The fast fields to return for each hit.
/// This is the only variant supported for now.
/// TODO: support the {field, format} variant for custom formatting.
#[serde(rename = "docvalue_fields")]
#[serde(default = "default_doc_value_fields")]
pub doc_value_fields: Vec<String>,
}
/// Search query result for each matched document
/// TODO: move this to a common module
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct FieldRetrivalResult {
/// The fast fields returned for each hit.
#[serde(rename = "docvalue_fields")]
#[serde(skip_serializing_if = "HashMap::is_empty")]
pub doc_value_fields: HashMap<String, OwnedValue>,
}
impl RetrievalFields {
fn get_field_names(&self) -> Vec<&str> {
self.doc_value_fields.iter().map(|s| s.as_str()).collect()
}
fn resolve_field_names(&mut self, reader: &ColumnarReader) -> crate::Result<()> {
// Tranform a glob (`pattern*`, for example) into a regex::Regex (`^pattern.*$`)
let globbed_string_to_regex = |glob: &str| {
// Replace `*` glob with `.*` regex
let sanitized = format!("^{}$", regex::escape(glob).replace(r"\*", ".*"));
Regex::new(&sanitized.replace('*', ".*")).map_err(|e| {
crate::TantivyError::SchemaError(format!(
"Invalid regex '{}' in docvalue_fields: {}",
glob, e
))
})
};
self.doc_value_fields = self
.doc_value_fields
.iter()
.map(|field| {
if !field.contains('*')
&& reader
.iter_columns()?
.any(|(name, _)| name.as_str() == field)
{
return Ok(vec![field.to_owned()]);
}
let pattern = globbed_string_to_regex(field)?;
let fields = reader
.iter_columns()?
.map(|(name, _)| {
// normalize path from internal fast field repr
name.replace(JSON_PATH_SEGMENT_SEP_STR, ".")
})
.filter(|name| pattern.is_match(name))
.collect::<Vec<_>>();
assert!(
!fields.is_empty(),
"No fields matched the glob '{}' in docvalue_fields",
field
);
Ok(fields)
})
.collect::<crate::Result<Vec<_>>>()?
.into_iter()
.flatten()
.collect();
Ok(())
}
fn get_document_field_data(
&self,
accessors: &HashMap<String, Vec<DynamicColumn>>,
doc_id: DocId,
) -> FieldRetrivalResult {
let dvf = self
.doc_value_fields
.iter()
.map(|field| {
let accessors = accessors
.get(field)
.unwrap_or_else(|| panic!("field '{}' not found in accessors", field));
let values: Vec<OwnedValue> = accessors
.iter()
.flat_map(|accessor| match accessor {
DynamicColumn::U64(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::U64)
.collect::<Vec<_>>(),
DynamicColumn::I64(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::I64)
.collect::<Vec<_>>(),
DynamicColumn::F64(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::F64)
.collect::<Vec<_>>(),
DynamicColumn::Bytes(accessor) => accessor
.term_ords(doc_id)
.map(|term_ord| {
let mut buffer = vec![];
assert!(
accessor
.ord_to_bytes(term_ord, &mut buffer)
.expect("could not read term dictionary"),
"term corresponding to term_ord does not exist"
);
OwnedValue::Bytes(buffer)
})
.collect::<Vec<_>>(),
DynamicColumn::Str(accessor) => accessor
.term_ords(doc_id)
.map(|term_ord| {
let mut buffer = vec![];
assert!(
accessor
.ord_to_bytes(term_ord, &mut buffer)
.expect("could not read term dictionary"),
"term corresponding to term_ord does not exist"
);
OwnedValue::Str(String::from_utf8(buffer).unwrap())
})
.collect::<Vec<_>>(),
DynamicColumn::Bool(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::Bool)
.collect::<Vec<_>>(),
DynamicColumn::IpAddr(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::IpAddr)
.collect::<Vec<_>>(),
DynamicColumn::DateTime(accessor) => accessor
.values_for_doc(doc_id)
.map(OwnedValue::Date)
.collect::<Vec<_>>(),
})
.collect();
(field.to_owned(), OwnedValue::Array(values))
})
.collect();
FieldRetrivalResult {
doc_value_fields: dvf,
}
}
}
#[derive(Debug, Clone, PartialEq, Default)]
struct KeyOrder {
field: String,
order: Order,
}
impl Serialize for KeyOrder {
fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
let KeyOrder { field, order } = self;
let mut map = serializer.serialize_map(Some(1))?;
map.serialize_entry(field, order)?;
map.end()
}
}
impl<'de> Deserialize<'de> for KeyOrder {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where D: Deserializer<'de> {
let mut k_o = <HashMap<String, Order>>::deserialize(deserializer)?.into_iter();
let (k, v) = k_o.next().ok_or(serde::de::Error::custom(
"Expected exactly one key-value pair in KeyOrder, found none",
))?;
if k_o.next().is_some() {
return Err(serde::de::Error::custom(
"Expected exactly one key-value pair in KeyOrder, found more",
));
}
Ok(Self { field: k, order: v })
}
}
impl TopHitsAggregation {
/// Validate and resolve field retrieval parameters
pub fn validate_and_resolve(&mut self, reader: &ColumnarReader) -> crate::Result<()> {
self.retrieval.resolve_field_names(reader)
}
/// Return fields accessed by the aggregator, in order.
pub fn field_names(&self) -> Vec<&str> {
self.sort
.iter()
.map(|KeyOrder { field, .. }| field.as_str())
.collect()
}
/// Return fields accessed by the aggregator's value retrieval.
pub fn value_field_names(&self) -> Vec<&str> {
self.retrieval.get_field_names()
}
}
/// Holds a single comparable doc feature, and the order in which it should be sorted.
#[derive(Clone, Serialize, Deserialize, Debug)]
struct ComparableDocFeature {
/// Stores any u64-mappable feature.
value: Option<u64>,
/// Sort order for the doc feature
order: Order,
}
impl Ord for ComparableDocFeature {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
let invert = |cmp: std::cmp::Ordering| match self.order {
Order::Asc => cmp,
Order::Desc => cmp.reverse(),
};
match (self.value, other.value) {
(Some(self_value), Some(other_value)) => invert(self_value.cmp(&other_value)),
(Some(_), None) => std::cmp::Ordering::Greater,
(None, Some(_)) => std::cmp::Ordering::Less,
(None, None) => std::cmp::Ordering::Equal,
}
}
}
impl PartialOrd for ComparableDocFeature {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for ComparableDocFeature {
fn eq(&self, other: &Self) -> bool {
self.value.cmp(&other.value) == std::cmp::Ordering::Equal
}
}
impl Eq for ComparableDocFeature {}
#[derive(Clone, Serialize, Deserialize, Debug)]
struct ComparableDocFeatures(Vec<ComparableDocFeature>, FieldRetrivalResult);
impl Ord for ComparableDocFeatures {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
for (self_feature, other_feature) in self.0.iter().zip(other.0.iter()) {
let cmp = self_feature.cmp(other_feature);
if cmp != std::cmp::Ordering::Equal {
return cmp;
}
}
std::cmp::Ordering::Equal
}
}
impl PartialOrd for ComparableDocFeatures {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for ComparableDocFeatures {
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == std::cmp::Ordering::Equal
}
}
impl Eq for ComparableDocFeatures {}
/// The TopHitsCollector used for collecting over segments and merging results.
#[derive(Clone, Serialize, Deserialize)]
pub struct TopHitsCollector {
req: TopHitsAggregation,
top_n: TopNComputer<ComparableDocFeatures, DocAddress, false>,
}
impl Default for TopHitsCollector {
fn default() -> Self {
Self {
req: TopHitsAggregation::default(),
top_n: TopNComputer::new(1),
}
}
}
impl std::fmt::Debug for TopHitsCollector {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
f.debug_struct("TopHitsCollector")
.field("req", &self.req)
.field("top_n_threshold", &self.top_n.threshold)
.finish()
}
}
impl std::cmp::PartialEq for TopHitsCollector {
fn eq(&self, _other: &Self) -> bool {
false
}
}
impl TopHitsCollector {
fn collect(&mut self, features: ComparableDocFeatures, doc: DocAddress) {
self.top_n.push(features, doc);
}
pub(crate) fn merge_fruits(&mut self, other_fruit: Self) -> crate::Result<()> {
for doc in other_fruit.top_n.into_vec() {
self.collect(doc.feature, doc.doc);
}
Ok(())
}
/// Finalize by converting self into the final result form
pub fn finalize(self) -> TopHitsMetricResult {
let mut hits: Vec<TopHitsVecEntry> = self
.top_n
.into_sorted_vec()
.into_iter()
.map(|doc| TopHitsVecEntry {
sort: doc.feature.0.iter().map(|f| f.value).collect(),
search_results: doc.feature.1,
})
.collect();
// Remove the first `from` elements
// Truncating from end would be more efficient, but we need to truncate from the front
// because `into_sorted_vec` gives us a descending order because of the inverted
// `Ord` semantics of the heap elements.
hits.drain(..self.req.from.unwrap_or(0));
TopHitsMetricResult { hits }
}
}
#[derive(Clone)]
pub(crate) struct SegmentTopHitsCollector {
segment_ordinal: SegmentOrdinal,
accessor_idx: usize,
inner_collector: TopHitsCollector,
}
impl SegmentTopHitsCollector {
pub fn from_req(
req: &TopHitsAggregation,
accessor_idx: usize,
segment_ordinal: SegmentOrdinal,
) -> Self {
Self {
inner_collector: TopHitsCollector {
req: req.clone(),
top_n: TopNComputer::new(req.size + req.from.unwrap_or(0)),
},
segment_ordinal,
accessor_idx,
}
}
}
impl std::fmt::Debug for SegmentTopHitsCollector {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
f.debug_struct("SegmentTopHitsCollector")
.field("segment_id", &self.segment_ordinal)
.field("accessor_idx", &self.accessor_idx)
.field("inner_collector", &self.inner_collector)
.finish()
}
}
impl SegmentAggregationCollector for SegmentTopHitsCollector {
fn add_intermediate_aggregation_result(
self: Box<Self>,
agg_with_accessor: &crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
results: &mut crate::aggregation::intermediate_agg_result::IntermediateAggregationResults,
) -> crate::Result<()> {
let name = agg_with_accessor.aggs.keys[self.accessor_idx].to_string();
let intermediate_result = IntermediateMetricResult::TopHits(self.inner_collector);
results.push(
name,
IntermediateAggregationResult::Metric(intermediate_result),
)
}
fn collect(
&mut self,
doc_id: crate::DocId,
agg_with_accessor: &mut crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
) -> crate::Result<()> {
let accessors = &agg_with_accessor.aggs.values[self.accessor_idx].accessors;
let value_accessors = &agg_with_accessor.aggs.values[self.accessor_idx].value_accessors;
let features: Vec<ComparableDocFeature> = self
.inner_collector
.req
.sort
.iter()
.enumerate()
.map(|(idx, KeyOrder { order, .. })| {
let order = *order;
let value = accessors
.get(idx)
.expect("could not find field in accessors")
.0
.values_for_doc(doc_id)
.next();
ComparableDocFeature { value, order }
})
.collect();
let retrieval_result = self
.inner_collector
.req
.retrieval
.get_document_field_data(value_accessors, doc_id);
self.inner_collector.collect(
ComparableDocFeatures(features, retrieval_result),
DocAddress {
segment_ord: self.segment_ordinal,
doc_id,
},
);
Ok(())
}
fn collect_block(
&mut self,
docs: &[crate::DocId],
agg_with_accessor: &mut crate::aggregation::agg_req_with_accessor::AggregationsWithAccessor,
) -> crate::Result<()> {
// TODO: Consider getting fields with the column block accessor and refactor this.
// ---
// Would the additional complexity of getting fields with the column_block_accessor
// make sense here? Probably yes, but I want to get a first-pass review first
// before proceeding.
for doc in docs {
self.collect(*doc, agg_with_accessor)?;
}
Ok(())
}
}
#[cfg(test)]
mod tests {
use common::DateTime;
use pretty_assertions::assert_eq;
use serde_json::Value;
use time::macros::datetime;
use super::{ComparableDocFeature, ComparableDocFeatures, Order};
use crate::aggregation::agg_req::Aggregations;
use crate::aggregation::agg_result::AggregationResults;
use crate::aggregation::bucket::tests::get_test_index_from_docs;
use crate::aggregation::tests::get_test_index_from_values;
use crate::aggregation::AggregationCollector;
use crate::collector::ComparableDoc;
use crate::query::AllQuery;
use crate::schema::OwnedValue as SchemaValue;
fn invert_order(cmp_feature: ComparableDocFeature) -> ComparableDocFeature {
let ComparableDocFeature { value, order } = cmp_feature;
let order = match order {
Order::Asc => Order::Desc,
Order::Desc => Order::Asc,
};
ComparableDocFeature { value, order }
}
fn collector_with_capacity(capacity: usize) -> super::TopHitsCollector {
super::TopHitsCollector {
top_n: super::TopNComputer::new(capacity),
..Default::default()
}
}
fn invert_order_features(cmp_features: ComparableDocFeatures) -> ComparableDocFeatures {
let ComparableDocFeatures(cmp_features, search_results) = cmp_features;
let cmp_features = cmp_features
.into_iter()
.map(invert_order)
.collect::<Vec<_>>();
ComparableDocFeatures(cmp_features, search_results)
}
#[test]
fn test_comparable_doc_feature() -> crate::Result<()> {
let small = ComparableDocFeature {
value: Some(1),
order: Order::Asc,
};
let big = ComparableDocFeature {
value: Some(2),
order: Order::Asc,
};
let none = ComparableDocFeature {
value: None,
order: Order::Asc,
};
assert!(small < big);
assert!(none < small);
assert!(none < big);
let small = invert_order(small);
let big = invert_order(big);
let none = invert_order(none);
assert!(small > big);
assert!(none < small);
assert!(none < big);
Ok(())
}
#[test]
fn test_comparable_doc_features() -> crate::Result<()> {
let features_1 = ComparableDocFeatures(
vec![ComparableDocFeature {
value: Some(1),
order: Order::Asc,
}],
Default::default(),
);
let features_2 = ComparableDocFeatures(
vec![ComparableDocFeature {
value: Some(2),
order: Order::Asc,
}],
Default::default(),
);
assert!(features_1 < features_2);
assert!(invert_order_features(features_1.clone()) > invert_order_features(features_2));
Ok(())
}
#[test]
fn test_aggregation_top_hits_empty_index() -> crate::Result<()> {
let values = vec![];
let index = get_test_index_from_values(false, &values)?;
let d: Aggregations = serde_json::from_value(json!({
"top_hits_req": {
"top_hits": {
"size": 2,
"sort": [
{ "date": "desc" }
],
"from": 0,
}
}
}))
.unwrap();
let collector = AggregationCollector::from_aggs(d, Default::default());
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&AllQuery, &collector).unwrap();
let res: Value = serde_json::from_str(
&serde_json::to_string(&agg_res).expect("JSON serialization failed"),
)
.expect("JSON parsing failed");
assert_eq!(
res,
json!({
"top_hits_req": {
"hits": []
}
})
);
Ok(())
}
#[test]
fn test_top_hits_collector_single_feature() -> crate::Result<()> {
let docs = vec![
ComparableDoc::<_, _, false> {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 0,
},
feature: ComparableDocFeatures(
vec![ComparableDocFeature {
value: Some(1),
order: Order::Asc,
}],
Default::default(),
),
},
ComparableDoc {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 2,
},
feature: ComparableDocFeatures(
vec![ComparableDocFeature {
value: Some(3),
order: Order::Asc,
}],
Default::default(),
),
},
ComparableDoc {
doc: crate::DocAddress {
segment_ord: 0,
doc_id: 1,
},
feature: ComparableDocFeatures(
vec![ComparableDocFeature {
value: Some(5),
order: Order::Asc,
}],
Default::default(),
),
},
];
let mut collector = collector_with_capacity(3);
for doc in docs.clone() {
collector.collect(doc.feature, doc.doc);
}
let res = collector.finalize();
assert_eq!(
res,
super::TopHitsMetricResult {
hits: vec![
super::TopHitsVecEntry {
sort: vec![docs[0].feature.0[0].value],
search_results: Default::default(),
},
super::TopHitsVecEntry {
sort: vec![docs[1].feature.0[0].value],
search_results: Default::default(),
},
super::TopHitsVecEntry {
sort: vec![docs[2].feature.0[0].value],
search_results: Default::default(),
},
]
}
);
Ok(())
}
fn test_aggregation_top_hits(merge_segments: bool) -> crate::Result<()> {
let docs = vec![
vec![
r#"{ "date": "2015-01-02T00:00:00Z", "text": "bbb", "text2": "bbb", "mixed": { "dyn_arr": [1, "2"] } }"#,
r#"{ "date": "2017-06-15T00:00:00Z", "text": "ccc", "text2": "ddd", "mixed": { "dyn_arr": [3, "4"] } }"#,
],
vec![
r#"{ "text": "aaa", "text2": "bbb", "date": "2018-01-02T00:00:00Z", "mixed": { "dyn_arr": ["9", 8] } }"#,
r#"{ "text": "aaa", "text2": "bbb", "date": "2016-01-02T00:00:00Z", "mixed": { "dyn_arr": ["7", 6] } }"#,
],
];
let index = get_test_index_from_docs(merge_segments, &docs)?;
let d: Aggregations = serde_json::from_value(json!({
"top_hits_req": {
"top_hits": {
"size": 2,
"sort": [
{ "date": "desc" }
],
"from": 1,
"docvalue_fields": [
"date",
"tex*",
"mixed.*",
],
}
}
}))?;
let collector = AggregationCollector::from_aggs(d, Default::default());
let reader = index.reader()?;
let searcher = reader.searcher();
let agg_res =
serde_json::to_value(searcher.search(&AllQuery, &collector).unwrap()).unwrap();
let date_2017 = datetime!(2017-06-15 00:00:00 UTC);
let date_2016 = datetime!(2016-01-02 00:00:00 UTC);
assert_eq!(
agg_res["top_hits_req"],
json!({
"hits": [
{
"sort": [common::i64_to_u64(date_2017.unix_timestamp_nanos() as i64)],
"docvalue_fields": {
"date": [ SchemaValue::Date(DateTime::from_utc(date_2017)) ],
"text": [ "ccc" ],
"text2": [ "ddd" ],
"mixed.dyn_arr": [ 3, "4" ],
}
},
{
"sort": [common::i64_to_u64(date_2016.unix_timestamp_nanos() as i64)],
"docvalue_fields": {
"date": [ SchemaValue::Date(DateTime::from_utc(date_2016)) ],
"text": [ "aaa" ],
"text2": [ "bbb" ],
"mixed.dyn_arr": [ 6, "7" ],
}
}
]
}),
);
Ok(())
}
#[test]
fn test_aggregation_top_hits_single_segment() -> crate::Result<()> {
test_aggregation_top_hits(true)
}
#[test]
fn test_aggregation_top_hits_multi_segment() -> crate::Result<()> {
test_aggregation_top_hits(false)
}
}

View File

@@ -145,6 +145,8 @@ mod agg_tests;
mod agg_bench; mod agg_bench;
use core::fmt;
pub use agg_limits::AggregationLimits; pub use agg_limits::AggregationLimits;
pub use collector::{ pub use collector::{
AggregationCollector, AggregationSegmentCollector, DistributedAggregationCollector, AggregationCollector, AggregationSegmentCollector, DistributedAggregationCollector,
@@ -154,7 +156,106 @@ use columnar::{ColumnType, MonotonicallyMappableToU64};
pub(crate) use date::format_date; pub(crate) use date::format_date;
pub use error::AggregationError; pub use error::AggregationError;
use itertools::Itertools; use itertools::Itertools;
use serde::{Deserialize, Serialize}; use serde::de::{self, Visitor};
use serde::{Deserialize, Deserializer, Serialize};
fn parse_str_into_f64<E: de::Error>(value: &str) -> Result<f64, E> {
let parsed = value.parse::<f64>().map_err(|_err| {
de::Error::custom(format!("Failed to parse f64 from string: {:?}", value))
})?;
// Check if the parsed value is NaN or infinity
if parsed.is_nan() || parsed.is_infinite() {
Err(de::Error::custom(format!(
"Value is not a valid f64 (NaN or Infinity): {:?}",
value
)))
} else {
Ok(parsed)
}
}
/// deserialize Option<f64> from string or float
pub(crate) fn deserialize_option_f64<'de, D>(deserializer: D) -> Result<Option<f64>, D::Error>
where D: Deserializer<'de> {
struct StringOrFloatVisitor;
impl<'de> Visitor<'de> for StringOrFloatVisitor {
type Value = Option<f64>;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
formatter.write_str("a string or a float")
}
fn visit_str<E>(self, value: &str) -> Result<Self::Value, E>
where E: de::Error {
parse_str_into_f64(value).map(Some)
}
fn visit_f64<E>(self, value: f64) -> Result<Self::Value, E>
where E: de::Error {
Ok(Some(value))
}
fn visit_i64<E>(self, value: i64) -> Result<Self::Value, E>
where E: de::Error {
Ok(Some(value as f64))
}
fn visit_u64<E>(self, value: u64) -> Result<Self::Value, E>
where E: de::Error {
Ok(Some(value as f64))
}
fn visit_none<E>(self) -> Result<Self::Value, E>
where E: de::Error {
Ok(None)
}
fn visit_unit<E>(self) -> Result<Self::Value, E>
where E: de::Error {
Ok(None)
}
}
deserializer.deserialize_any(StringOrFloatVisitor)
}
/// deserialize f64 from string or float
pub(crate) fn deserialize_f64<'de, D>(deserializer: D) -> Result<f64, D::Error>
where D: Deserializer<'de> {
struct StringOrFloatVisitor;
impl<'de> Visitor<'de> for StringOrFloatVisitor {
type Value = f64;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
formatter.write_str("a string or a float")
}
fn visit_str<E>(self, value: &str) -> Result<Self::Value, E>
where E: de::Error {
parse_str_into_f64(value)
}
fn visit_f64<E>(self, value: f64) -> Result<Self::Value, E>
where E: de::Error {
Ok(value)
}
fn visit_i64<E>(self, value: i64) -> Result<Self::Value, E>
where E: de::Error {
Ok(value as f64)
}
fn visit_u64<E>(self, value: u64) -> Result<Self::Value, E>
where E: de::Error {
Ok(value as f64)
}
}
deserializer.deserialize_any(StringOrFloatVisitor)
}
/// Represents an associative array `(key => values)` in a very efficient manner. /// Represents an associative array `(key => values)` in a very efficient manner.
#[derive(PartialEq, Serialize, Deserialize)] #[derive(PartialEq, Serialize, Deserialize)]
@@ -281,6 +382,7 @@ pub(crate) fn f64_from_fastfield_u64(val: u64, field_type: &ColumnType) -> f64 {
ColumnType::U64 => val as f64, ColumnType::U64 => val as f64,
ColumnType::I64 | ColumnType::DateTime => i64::from_u64(val) as f64, ColumnType::I64 | ColumnType::DateTime => i64::from_u64(val) as f64,
ColumnType::F64 => f64::from_u64(val), ColumnType::F64 => f64::from_u64(val),
ColumnType::Bool => val as f64,
_ => { _ => {
panic!("unexpected type {field_type:?}. This should not happen") panic!("unexpected type {field_type:?}. This should not happen")
} }
@@ -301,6 +403,7 @@ pub(crate) fn f64_to_fastfield_u64(val: f64, field_type: &ColumnType) -> Option<
ColumnType::U64 => Some(val as u64), ColumnType::U64 => Some(val as u64),
ColumnType::I64 | ColumnType::DateTime => Some((val as i64).to_u64()), ColumnType::I64 | ColumnType::DateTime => Some((val as i64).to_u64()),
ColumnType::F64 => Some(val.to_u64()), ColumnType::F64 => Some(val.to_u64()),
ColumnType::Bool => Some(val as u64),
_ => None, _ => None,
} }
} }
@@ -319,7 +422,7 @@ mod tests {
use crate::indexer::NoMergePolicy; use crate::indexer::NoMergePolicy;
use crate::query::{AllQuery, TermQuery}; use crate::query::{AllQuery, TermQuery};
use crate::schema::{IndexRecordOption, Schema, TextFieldIndexing, FAST, STRING}; use crate::schema::{IndexRecordOption, Schema, TextFieldIndexing, FAST, STRING};
use crate::{Index, Term}; use crate::{Index, IndexWriter, Term};
pub fn get_test_index_with_num_docs( pub fn get_test_index_with_num_docs(
merge_segments: bool, merge_segments: bool,
@@ -451,7 +554,7 @@ mod tests {
.searchable_segment_ids() .searchable_segment_ids()
.expect("Searchable segments failed."); .expect("Searchable segments failed.");
if segment_ids.len() > 1 { if segment_ids.len() > 1 {
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?; index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?; index_writer.wait_merging_threads()?;
} }
@@ -565,7 +668,7 @@ mod tests {
let segment_ids = index let segment_ids = index
.searchable_segment_ids() .searchable_segment_ids()
.expect("Searchable segments failed."); .expect("Searchable segments failed.");
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?; index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?; index_writer.wait_merging_threads()?;
} }

View File

@@ -16,6 +16,7 @@ use super::metric::{
SumAggregation, SumAggregation,
}; };
use crate::aggregation::bucket::TermMissingAgg; use crate::aggregation::bucket::TermMissingAgg;
use crate::aggregation::metric::SegmentTopHitsCollector;
pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug { pub(crate) trait SegmentAggregationCollector: CollectorClone + Debug {
fn add_intermediate_aggregation_result( fn add_intermediate_aggregation_result(
@@ -160,6 +161,11 @@ pub(crate) fn build_single_agg_segment_collector(
accessor_idx, accessor_idx,
)?, )?,
)), )),
TopHits(top_hits_req) => Ok(Box::new(SegmentTopHitsCollector::from_req(
top_hits_req,
accessor_idx,
req.segment_ordinal,
))),
} }
} }

View File

@@ -16,7 +16,7 @@ use crate::{DocId, Score, SegmentOrdinal, SegmentReader};
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// ///
/// let mut index_writer = index.writer(3_000_000).unwrap(); /// let mut index_writer = index.writer(15_000_000).unwrap();
/// index_writer.add_document(doc!(title => "The Name of the Wind")).unwrap(); /// index_writer.add_document(doc!(title => "The Name of the Wind")).unwrap();
/// index_writer.add_document(doc!(title => "The Diary of Muadib")).unwrap(); /// index_writer.add_document(doc!(title => "The Diary of Muadib")).unwrap();
/// index_writer.add_document(doc!(title => "A Dairy Cow")).unwrap(); /// index_writer.add_document(doc!(title => "A Dairy Cow")).unwrap();

View File

@@ -89,7 +89,7 @@ fn facet_depth(facet_bytes: &[u8]) -> usize {
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// { /// {
/// let mut index_writer = index.writer(3_000_000)?; /// let mut index_writer = index.writer(15_000_000)?;
/// // a document can be associated with any number of facets /// // a document can be associated with any number of facets
/// index_writer.add_document(doc!( /// index_writer.add_document(doc!(
/// title => "The Name of the Wind", /// title => "The Name of the Wind",
@@ -410,6 +410,7 @@ impl SegmentCollector for FacetSegmentCollector {
/// Intermediary result of the `FacetCollector` that stores /// Intermediary result of the `FacetCollector` that stores
/// the facet counts for all the segments. /// the facet counts for all the segments.
#[derive(Default, Clone)]
pub struct FacetCounts { pub struct FacetCounts {
facet_counts: BTreeMap<Facet, u64>, facet_counts: BTreeMap<Facet, u64>,
} }
@@ -493,10 +494,10 @@ mod tests {
use super::{FacetCollector, FacetCounts}; use super::{FacetCollector, FacetCounts};
use crate::collector::facet_collector::compress_mapping; use crate::collector::facet_collector::compress_mapping;
use crate::collector::Count; use crate::collector::Count;
use crate::core::Index; use crate::index::Index;
use crate::query::{AllQuery, QueryParser, TermQuery}; use crate::query::{AllQuery, QueryParser, TermQuery};
use crate::schema::{Document, Facet, FacetOptions, IndexRecordOption, Schema}; use crate::schema::{Facet, FacetOptions, IndexRecordOption, Schema, TantivyDocument};
use crate::Term; use crate::{IndexWriter, Term};
fn test_collapse_mapping_aux( fn test_collapse_mapping_aux(
facet_terms: &[&str], facet_terms: &[&str],
@@ -559,7 +560,7 @@ mod tests {
let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default()); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(facet_field=>Facet::from("/facet/a"))) .add_document(doc!(facet_field=>Facet::from("/facet/a")))
.unwrap(); .unwrap();
@@ -588,7 +589,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
let num_facets: usize = 3 * 4 * 5; let num_facets: usize = 3 * 4 * 5;
let facets: Vec<Facet> = (0..num_facets) let facets: Vec<Facet> = (0..num_facets)
.map(|mut n| { .map(|mut n| {
@@ -601,7 +602,7 @@ mod tests {
}) })
.collect(); .collect();
for i in 0..num_facets * 10 { for i in 0..num_facets * 10 {
let mut doc = Document::new(); let mut doc = TantivyDocument::new();
doc.add_facet(facet_field, facets[i % num_facets].clone()); doc.add_facet(facet_field, facets[i % num_facets].clone());
index_writer.add_document(doc).unwrap(); index_writer.add_document(doc).unwrap();
} }
@@ -732,24 +733,25 @@ mod tests {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let uniform = Uniform::new_inclusive(1, 100_000); let uniform = Uniform::new_inclusive(1, 100_000);
let mut docs: Vec<Document> = vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)] let mut docs: Vec<TantivyDocument> =
.into_iter() vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)]
.flat_map(|(c, count)| { .into_iter()
let facet = Facet::from(&format!("/facet/{}", c)); .flat_map(|(c, count)| {
let doc = doc!(facet_field => facet); let facet = Facet::from(&format!("/facet/{}", c));
iter::repeat(doc).take(count) let doc = doc!(facet_field => facet);
}) iter::repeat(doc).take(count)
.map(|mut doc| { })
doc.add_facet( .map(|mut doc| {
facet_field, doc.add_facet(
&format!("/facet/{}", thread_rng().sample(uniform)), facet_field,
); &format!("/facet/{}", thread_rng().sample(uniform)),
doc );
}) doc
.collect(); })
.collect();
docs[..].shuffle(&mut thread_rng()); docs[..].shuffle(&mut thread_rng());
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
for doc in docs { for doc in docs {
index_writer.add_document(doc).unwrap(); index_writer.add_document(doc).unwrap();
} }
@@ -780,7 +782,7 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let docs: Vec<Document> = vec![("b", 2), ("a", 2), ("c", 4)] let docs: Vec<TantivyDocument> = vec![("b", 2), ("a", 2), ("c", 4)]
.into_iter() .into_iter()
.flat_map(|(c, count)| { .flat_map(|(c, count)| {
let facet = Facet::from(&format!("/facet/{}", c)); let facet = Facet::from(&format!("/facet/{}", c));
@@ -828,7 +830,7 @@ mod bench {
use crate::collector::FacetCollector; use crate::collector::FacetCollector;
use crate::query::AllQuery; use crate::query::AllQuery;
use crate::schema::{Facet, Schema, INDEXED}; use crate::schema::{Facet, Schema, INDEXED};
use crate::Index; use crate::{Index, IndexWriter};
#[bench] #[bench]
fn bench_facet_collector(b: &mut Bencher) { fn bench_facet_collector(b: &mut Bencher) {
@@ -847,7 +849,7 @@ mod bench {
// 40425 docs // 40425 docs
docs[..].shuffle(&mut thread_rng()); docs[..].shuffle(&mut thread_rng());
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
for doc in docs { for doc in docs {
index_writer.add_document(doc).unwrap(); index_writer.add_document(doc).unwrap();
} }

View File

@@ -12,8 +12,7 @@ use std::marker::PhantomData;
use columnar::{BytesColumn, Column, DynamicColumn, HasAssociatedColumnType}; use columnar::{BytesColumn, Column, DynamicColumn, HasAssociatedColumnType};
use crate::collector::{Collector, SegmentCollector}; use crate::collector::{Collector, SegmentCollector};
use crate::schema::Field; use crate::{DocId, Score, SegmentReader};
use crate::{DocId, Score, SegmentReader, TantivyError};
/// The `FilterCollector` filters docs using a fast field value and a predicate. /// The `FilterCollector` filters docs using a fast field value and a predicate.
/// ///
@@ -50,13 +49,13 @@ use crate::{DocId, Score, SegmentReader, TantivyError};
/// ///
/// let query_parser = QueryParser::for_index(&index, vec![title]); /// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?; /// let query = query_parser.parse_query("diary")?;
/// let no_filter_collector = FilterCollector::new(price, |value: u64| value > 20_120u64, TopDocs::with_limit(2)); /// let no_filter_collector = FilterCollector::new("price".to_string(), |value: u64| value > 20_120u64, TopDocs::with_limit(2));
/// let top_docs = searcher.search(&query, &no_filter_collector)?; /// let top_docs = searcher.search(&query, &no_filter_collector)?;
/// ///
/// assert_eq!(top_docs.len(), 1); /// assert_eq!(top_docs.len(), 1);
/// assert_eq!(top_docs[0].1, DocAddress::new(0, 1)); /// assert_eq!(top_docs[0].1, DocAddress::new(0, 1));
/// ///
/// let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new(price, |value| value < 5u64, TopDocs::with_limit(2)); /// let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new("price".to_string(), |value| value < 5u64, TopDocs::with_limit(2));
/// let filtered_top_docs = searcher.search(&query, &filter_all_collector)?; /// let filtered_top_docs = searcher.search(&query, &filter_all_collector)?;
/// ///
/// assert_eq!(filtered_top_docs.len(), 0); /// assert_eq!(filtered_top_docs.len(), 0);
@@ -70,7 +69,7 @@ use crate::{DocId, Score, SegmentReader, TantivyError};
pub struct FilterCollector<TCollector, TPredicate, TPredicateValue> pub struct FilterCollector<TCollector, TPredicate, TPredicateValue>
where TPredicate: 'static + Clone where TPredicate: 'static + Clone
{ {
field: Field, field: String,
collector: TCollector, collector: TCollector,
predicate: TPredicate, predicate: TPredicate,
t_predicate_value: PhantomData<TPredicateValue>, t_predicate_value: PhantomData<TPredicateValue>,
@@ -83,7 +82,7 @@ where
TPredicate: Fn(TPredicateValue) -> bool + Send + Sync + Clone, TPredicate: Fn(TPredicateValue) -> bool + Send + Sync + Clone,
{ {
/// Create a new `FilterCollector`. /// Create a new `FilterCollector`.
pub fn new(field: Field, predicate: TPredicate, collector: TCollector) -> Self { pub fn new(field: String, predicate: TPredicate, collector: TCollector) -> Self {
Self { Self {
field, field,
predicate, predicate,
@@ -110,18 +109,7 @@ where
segment_local_id: u32, segment_local_id: u32,
segment_reader: &SegmentReader, segment_reader: &SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
let schema = segment_reader.schema(); let column_opt = segment_reader.fast_fields().column_opt(&self.field)?;
let field_entry = schema.get_field_entry(self.field);
if !field_entry.is_fast() {
return Err(TantivyError::SchemaError(format!(
"Field {:?} is not a fast field.",
field_entry.name()
)));
}
let column_opt = segment_reader
.fast_fields()
.column_opt(field_entry.name())?;
let segment_collector = self let segment_collector = self
.collector .collector
@@ -229,7 +217,7 @@ where
/// ///
/// let query_parser = QueryParser::for_index(&index, vec![title]); /// let query_parser = QueryParser::for_index(&index, vec![title]);
/// let query = query_parser.parse_query("diary")?; /// let query = query_parser.parse_query("diary")?;
/// let filter_collector = BytesFilterCollector::new(barcode, |bytes: &[u8]| bytes.starts_with(b"01"), TopDocs::with_limit(2)); /// let filter_collector = BytesFilterCollector::new("barcode".to_string(), |bytes: &[u8]| bytes.starts_with(b"01"), TopDocs::with_limit(2));
/// let top_docs = searcher.search(&query, &filter_collector)?; /// let top_docs = searcher.search(&query, &filter_collector)?;
/// ///
/// assert_eq!(top_docs.len(), 1); /// assert_eq!(top_docs.len(), 1);
@@ -240,7 +228,7 @@ where
pub struct BytesFilterCollector<TCollector, TPredicate> pub struct BytesFilterCollector<TCollector, TPredicate>
where TPredicate: 'static + Clone where TPredicate: 'static + Clone
{ {
field: Field, field: String,
collector: TCollector, collector: TCollector,
predicate: TPredicate, predicate: TPredicate,
} }
@@ -251,7 +239,7 @@ where
TPredicate: Fn(&[u8]) -> bool + Send + Sync + Clone, TPredicate: Fn(&[u8]) -> bool + Send + Sync + Clone,
{ {
/// Create a new `BytesFilterCollector`. /// Create a new `BytesFilterCollector`.
pub fn new(field: Field, predicate: TPredicate, collector: TCollector) -> Self { pub fn new(field: String, predicate: TPredicate, collector: TCollector) -> Self {
Self { Self {
field, field,
predicate, predicate,
@@ -274,10 +262,7 @@ where
segment_local_id: u32, segment_local_id: u32,
segment_reader: &SegmentReader, segment_reader: &SegmentReader,
) -> crate::Result<Self::Child> { ) -> crate::Result<Self::Child> {
let schema = segment_reader.schema(); let column_opt = segment_reader.fast_fields().bytes(&self.field)?;
let field_name = schema.get_field_name(self.field);
let column_opt = segment_reader.fast_fields().bytes(field_name)?;
let segment_collector = self let segment_collector = self
.collector .collector

View File

@@ -233,7 +233,7 @@ mod tests {
let val_field = schema_builder.add_i64_field("val_field", FAST); let val_field = schema_builder.add_i64_field("val_field", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(val_field=>12i64))?; writer.add_document(doc!(val_field=>12i64))?;
writer.add_document(doc!(val_field=>-30i64))?; writer.add_document(doc!(val_field=>-30i64))?;
writer.add_document(doc!(val_field=>-12i64))?; writer.add_document(doc!(val_field=>-12i64))?;
@@ -255,7 +255,7 @@ mod tests {
let val_field = schema_builder.add_i64_field("val_field", FAST); let val_field = schema_builder.add_i64_field("val_field", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(val_field=>12i64))?; writer.add_document(doc!(val_field=>12i64))?;
writer.commit()?; writer.commit()?;
writer.add_document(doc!(val_field=>-30i64))?; writer.add_document(doc!(val_field=>-30i64))?;
@@ -280,7 +280,7 @@ mod tests {
let date_field = schema_builder.add_date_field("date_field", FAST); let date_field = schema_builder.add_date_field("date_field", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_with_num_threads(1, 4_000_000)?; let mut writer = index.writer_for_tests()?;
writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?; writer.add_document(doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1982, Month::September, 17)?.with_hms(0, 0, 0)?)))?;
writer.add_document( writer.add_document(
doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1986, Month::March, 9)?.with_hms(0, 0, 0)?)), doc!(date_field=>DateTime::from_primitive(Date::from_calendar_date(1986, Month::March, 9)?.with_hms(0, 0, 0)?)),

View File

@@ -44,7 +44,7 @@
//! # let title = schema_builder.add_text_field("title", TEXT); //! # let title = schema_builder.add_text_field("title", TEXT);
//! # let schema = schema_builder.build(); //! # let schema = schema_builder.build();
//! # let index = Index::create_in_ram(schema); //! # let index = Index::create_in_ram(schema);
//! # let mut index_writer = index.writer(3_000_000)?; //! # let mut index_writer = index.writer(15_000_000)?;
//! # index_writer.add_document(doc!( //! # index_writer.add_document(doc!(
//! # title => "The Name of the Wind", //! # title => "The Name of the Wind",
//! # ))?; //! # ))?;
@@ -97,7 +97,8 @@ pub use self::multi_collector::{FruitHandle, MultiCollector, MultiFruit};
mod top_collector; mod top_collector;
mod top_score_collector; mod top_score_collector;
pub use self::top_score_collector::TopDocs; pub use self::top_collector::ComparableDoc;
pub use self::top_score_collector::{TopDocs, TopNComputer};
mod custom_score_top_collector; mod custom_score_top_collector;
pub use self::custom_score_top_collector::{CustomScorer, CustomSegmentScorer}; pub use self::custom_score_top_collector::{CustomScorer, CustomSegmentScorer};

View File

@@ -120,7 +120,7 @@ impl<TFruit: Fruit> FruitHandle<TFruit> {
/// let title = schema_builder.add_text_field("title", TEXT); /// let title = schema_builder.add_text_field("title", TEXT);
/// let schema = schema_builder.build(); /// let schema = schema_builder.build();
/// let index = Index::create_in_ram(schema); /// let index = Index::create_in_ram(schema);
/// let mut index_writer = index.writer(3_000_000)?; /// let mut index_writer = index.writer(15_000_000)?;
/// index_writer.add_document(doc!(title => "The Name of the Wind"))?; /// index_writer.add_document(doc!(title => "The Name of the Wind"))?;
/// index_writer.add_document(doc!(title => "The Diary of Muadib"))?; /// index_writer.add_document(doc!(title => "The Diary of Muadib"))?;
/// index_writer.add_document(doc!(title => "A Dairy Cow"))?; /// index_writer.add_document(doc!(title => "A Dairy Cow"))?;

View File

@@ -2,12 +2,14 @@ use columnar::{BytesColumn, Column};
use super::*; use super::*;
use crate::collector::{Count, FilterCollector, TopDocs}; use crate::collector::{Count, FilterCollector, TopDocs};
use crate::core::SegmentReader; use crate::index::SegmentReader;
use crate::query::{AllQuery, QueryParser}; use crate::query::{AllQuery, QueryParser};
use crate::schema::{Schema, FAST, TEXT}; use crate::schema::{Schema, FAST, TEXT};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime; use crate::time::OffsetDateTime;
use crate::{doc, DateTime, DocAddress, DocId, Document, Index, Score, Searcher, SegmentOrdinal}; use crate::{
doc, DateTime, DocAddress, DocId, Index, Score, Searcher, SegmentOrdinal, TantivyDocument,
};
pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector { pub const TEST_COLLECTOR_WITH_SCORE: TestCollector = TestCollector {
compute_score: true, compute_score: true,
@@ -40,7 +42,7 @@ pub fn test_filter_collector() -> crate::Result<()> {
let query_parser = QueryParser::for_index(&index, vec![title]); let query_parser = QueryParser::for_index(&index, vec![title]);
let query = query_parser.parse_query("diary")?; let query = query_parser.parse_query("diary")?;
let filter_some_collector = FilterCollector::new( let filter_some_collector = FilterCollector::new(
price, "price".to_string(),
&|value: u64| value > 20_120u64, &|value: u64| value > 20_120u64,
TopDocs::with_limit(2), TopDocs::with_limit(2),
); );
@@ -49,8 +51,11 @@ pub fn test_filter_collector() -> crate::Result<()> {
assert_eq!(top_docs.len(), 1); assert_eq!(top_docs.len(), 1);
assert_eq!(top_docs[0].1, DocAddress::new(0, 1)); assert_eq!(top_docs[0].1, DocAddress::new(0, 1));
let filter_all_collector: FilterCollector<_, _, u64> = let filter_all_collector: FilterCollector<_, _, u64> = FilterCollector::new(
FilterCollector::new(price, &|value| value < 5u64, TopDocs::with_limit(2)); "price".to_string(),
&|value| value < 5u64,
TopDocs::with_limit(2),
);
let filtered_top_docs = searcher.search(&query, &filter_all_collector).unwrap(); let filtered_top_docs = searcher.search(&query, &filter_all_collector).unwrap();
assert_eq!(filtered_top_docs.len(), 0); assert_eq!(filtered_top_docs.len(), 0);
@@ -61,7 +66,8 @@ pub fn test_filter_collector() -> crate::Result<()> {
> 0 > 0
} }
let filter_dates_collector = FilterCollector::new(date, &date_filter, TopDocs::with_limit(5)); let filter_dates_collector =
FilterCollector::new("date".to_string(), &date_filter, TopDocs::with_limit(5));
let filtered_date_docs = searcher.search(&query, &filter_dates_collector)?; let filtered_date_docs = searcher.search(&query, &filter_dates_collector)?;
assert_eq!(filtered_date_docs.len(), 2); assert_eq!(filtered_date_docs.len(), 2);
@@ -280,8 +286,8 @@ fn make_test_searcher() -> crate::Result<Searcher> {
let schema = Schema::builder().build(); let schema = Schema::builder().build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(Document::default())?; index_writer.add_document(TantivyDocument::default())?;
index_writer.add_document(Document::default())?; index_writer.add_document(TantivyDocument::default())?;
index_writer.commit()?; index_writer.commit()?;
Ok(index.reader()?.searcher()) Ok(index.reader()?.searcher())
} }

View File

@@ -1,39 +1,58 @@
use std::cmp::Ordering; use std::cmp::Ordering;
use std::collections::BinaryHeap;
use std::marker::PhantomData; use std::marker::PhantomData;
use serde::{Deserialize, Serialize};
use super::top_score_collector::TopNComputer;
use crate::{DocAddress, DocId, SegmentOrdinal, SegmentReader}; use crate::{DocAddress, DocId, SegmentOrdinal, SegmentReader};
/// Contains a feature (field, score, etc.) of a document along with the document address. /// Contains a feature (field, score, etc.) of a document along with the document address.
/// ///
/// It has a custom implementation of `PartialOrd` that reverses the order. This is because the /// It guarantees stable sorting: in case of a tie on the feature, the document
/// default Rust heap is a max heap, whereas a min heap is needed.
///
/// Additionally, it guarantees stable sorting: in case of a tie on the feature, the document
/// address is used. /// address is used.
/// ///
/// The REVERSE_ORDER generic parameter controls whether the by-feature order
/// should be reversed, which is useful for achieving for example largest-first
/// semantics without having to wrap the feature in a `Reverse`.
///
/// WARNING: equality is not what you would expect here. /// WARNING: equality is not what you would expect here.
/// Two elements are equal if their feature is equal, and regardless of whether `doc` /// Two elements are equal if their feature is equal, and regardless of whether `doc`
/// is equal. This should be perfectly fine for this usage, but let's make sure this /// is equal. This should be perfectly fine for this usage, but let's make sure this
/// struct is never public. /// struct is never public.
pub(crate) struct ComparableDoc<T, D> { #[derive(Clone, Default, Serialize, Deserialize)]
pub struct ComparableDoc<T, D, const REVERSE_ORDER: bool = false> {
/// The feature of the document. In practice, this is
/// is any type that implements `PartialOrd`.
pub feature: T, pub feature: T,
/// The document address. In practice, this is any
/// type that implements `PartialOrd`, and is guaranteed
/// to be unique for each document.
pub doc: D, pub doc: D,
} }
impl<T: std::fmt::Debug, D: std::fmt::Debug, const R: bool> std::fmt::Debug
for ComparableDoc<T, D, R>
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct(format!("ComparableDoc<_, _ {R}").as_str())
.field("feature", &self.feature)
.field("doc", &self.doc)
.finish()
}
}
impl<T: PartialOrd, D: PartialOrd> PartialOrd for ComparableDoc<T, D> { impl<T: PartialOrd, D: PartialOrd, const R: bool> PartialOrd for ComparableDoc<T, D, R> {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> { fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other)) Some(self.cmp(other))
} }
} }
impl<T: PartialOrd, D: PartialOrd> Ord for ComparableDoc<T, D> { impl<T: PartialOrd, D: PartialOrd, const R: bool> Ord for ComparableDoc<T, D, R> {
#[inline] #[inline]
fn cmp(&self, other: &Self) -> Ordering { fn cmp(&self, other: &Self) -> Ordering {
// Reversed to make BinaryHeap work as a min-heap let by_feature = self
let by_feature = other
.feature .feature
.partial_cmp(&self.feature) .partial_cmp(&other.feature)
.map(|ord| if R { ord.reverse() } else { ord })
.unwrap_or(Ordering::Equal); .unwrap_or(Ordering::Equal);
let lazy_by_doc_address = || self.doc.partial_cmp(&other.doc).unwrap_or(Ordering::Equal); let lazy_by_doc_address = || self.doc.partial_cmp(&other.doc).unwrap_or(Ordering::Equal);
@@ -45,13 +64,13 @@ impl<T: PartialOrd, D: PartialOrd> Ord for ComparableDoc<T, D> {
} }
} }
impl<T: PartialOrd, D: PartialOrd> PartialEq for ComparableDoc<T, D> { impl<T: PartialOrd, D: PartialOrd, const R: bool> PartialEq for ComparableDoc<T, D, R> {
fn eq(&self, other: &Self) -> bool { fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal self.cmp(other) == Ordering::Equal
} }
} }
impl<T: PartialOrd, D: PartialOrd> Eq for ComparableDoc<T, D> {} impl<T: PartialOrd, D: PartialOrd, const R: bool> Eq for ComparableDoc<T, D, R> {}
pub(crate) struct TopCollector<T> { pub(crate) struct TopCollector<T> {
pub limit: usize, pub limit: usize,
@@ -91,18 +110,13 @@ where T: PartialOrd + Clone
if self.limit == 0 { if self.limit == 0 {
return Ok(Vec::new()); return Ok(Vec::new());
} }
let mut top_collector = BinaryHeap::new(); let mut top_collector: TopNComputer<_, _> = TopNComputer::new(self.limit + self.offset);
for child_fruit in children { for child_fruit in children {
for (feature, doc) in child_fruit { for (feature, doc) in child_fruit {
if top_collector.len() < (self.limit + self.offset) { top_collector.push(feature, doc);
top_collector.push(ComparableDoc { feature, doc });
} else if let Some(mut head) = top_collector.peek_mut() {
if head.feature < feature {
*head = ComparableDoc { feature, doc };
}
}
} }
} }
Ok(top_collector Ok(top_collector
.into_sorted_vec() .into_sorted_vec()
.into_iter() .into_iter()
@@ -111,7 +125,7 @@ where T: PartialOrd + Clone
.collect()) .collect())
} }
pub(crate) fn for_segment<F: PartialOrd>( pub(crate) fn for_segment<F: PartialOrd + Clone>(
&self, &self,
segment_id: SegmentOrdinal, segment_id: SegmentOrdinal,
_: &SegmentReader, _: &SegmentReader,
@@ -136,20 +150,20 @@ where T: PartialOrd + Clone
/// The Top Collector keeps track of the K documents /// The Top Collector keeps track of the K documents
/// sorted by type `T`. /// sorted by type `T`.
/// ///
/// The implementation is based on a `BinaryHeap`. /// The implementation is based on a repeatedly truncating on the median after K * 2 documents
/// The theoretical complexity for collecting the top `K` out of `n` documents /// The theoretical complexity for collecting the top `K` out of `n` documents
/// is `O(n log K)`. /// is `O(n + K)`.
pub(crate) struct TopSegmentCollector<T> { pub(crate) struct TopSegmentCollector<T> {
limit: usize, /// We reverse the order of the feature in order to
heap: BinaryHeap<ComparableDoc<T, DocId>>, /// have top-semantics instead of bottom semantics.
topn_computer: TopNComputer<T, DocId>,
segment_ord: u32, segment_ord: u32,
} }
impl<T: PartialOrd> TopSegmentCollector<T> { impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
fn new(segment_ord: SegmentOrdinal, limit: usize) -> TopSegmentCollector<T> { fn new(segment_ord: SegmentOrdinal, limit: usize) -> TopSegmentCollector<T> {
TopSegmentCollector { TopSegmentCollector {
limit, topn_computer: TopNComputer::new(limit),
heap: BinaryHeap::with_capacity(limit),
segment_ord, segment_ord,
} }
} }
@@ -158,7 +172,7 @@ impl<T: PartialOrd> TopSegmentCollector<T> {
impl<T: PartialOrd + Clone> TopSegmentCollector<T> { impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
pub fn harvest(self) -> Vec<(T, DocAddress)> { pub fn harvest(self) -> Vec<(T, DocAddress)> {
let segment_ord = self.segment_ord; let segment_ord = self.segment_ord;
self.heap self.topn_computer
.into_sorted_vec() .into_sorted_vec()
.into_iter() .into_iter()
.map(|comparable_doc| { .map(|comparable_doc| {
@@ -173,33 +187,13 @@ impl<T: PartialOrd + Clone> TopSegmentCollector<T> {
.collect() .collect()
} }
/// Return true if more documents have been collected than the limit.
#[inline]
pub(crate) fn at_capacity(&self) -> bool {
self.heap.len() >= self.limit
}
/// Collects a document scored by the given feature /// Collects a document scored by the given feature
/// ///
/// It collects documents until it has reached the max capacity. Once it reaches capacity, it /// It collects documents until it has reached the max capacity. Once it reaches capacity, it
/// will compare the lowest scoring item with the given one and keep whichever is greater. /// will compare the lowest scoring item with the given one and keep whichever is greater.
#[inline] #[inline]
pub fn collect(&mut self, doc: DocId, feature: T) { pub fn collect(&mut self, doc: DocId, feature: T) {
if self.at_capacity() { self.topn_computer.push(feature, doc);
// It's ok to unwrap as long as a limit of 0 is forbidden.
if let Some(limit_feature) = self.heap.peek().map(|head| head.feature.clone()) {
if limit_feature < feature {
if let Some(mut head) = self.heap.peek_mut() {
head.feature = feature;
head.doc = doc;
}
}
}
} else {
// we have not reached capacity yet, so we can just push the
// element.
self.heap.push(ComparableDoc { feature, doc });
}
} }
} }

View File

@@ -1,9 +1,10 @@
use std::collections::BinaryHeap;
use std::fmt; use std::fmt;
use std::marker::PhantomData; use std::marker::PhantomData;
use std::sync::Arc; use std::sync::Arc;
use columnar::ColumnValues; use columnar::ColumnValues;
use serde::de::DeserializeOwned;
use serde::{Deserialize, Serialize};
use super::Collector; use super::Collector;
use crate::collector::custom_score_top_collector::CustomScoreTopCollector; use crate::collector::custom_score_top_collector::CustomScoreTopCollector;
@@ -86,12 +87,15 @@ where
/// The `TopDocs` collector keeps track of the top `K` documents /// The `TopDocs` collector keeps track of the top `K` documents
/// sorted by their score. /// sorted by their score.
/// ///
/// The implementation is based on a `BinaryHeap`. /// The implementation is based on a repeatedly truncating on the median after K * 2 documents
/// The theoretical complexity for collecting the top `K` out of `n` documents /// with pattern defeating QuickSort.
/// is `O(n log K)`. /// The theoretical complexity for collecting the top `K` out of `N` documents
/// is `O(N + K)`.
/// ///
/// This collector guarantees a stable sorting in case of a tie on the /// This collector does not guarantee a stable sorting in case of a tie on the
/// document score. As such, it is suitable to implement pagination. /// document score, for stable sorting `PartialOrd` needs to resolve on other fields
/// like docid in case of score equality.
/// Only then, it is suitable for pagination.
/// ///
/// ```rust /// ```rust
/// use tantivy::collector::TopDocs; /// use tantivy::collector::TopDocs;
@@ -307,7 +311,7 @@ impl TopDocs {
/// ///
/// To comfortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to /// To comfortably work with `u64`s, `i64`s, `f64`s, or `date`s, please refer to
/// the [.order_by_fast_field(...)](TopDocs::order_by_fast_field) method. /// the [.order_by_fast_field(...)](TopDocs::order_by_fast_field) method.
fn order_by_u64_field( pub fn order_by_u64_field(
self, self,
field: impl ToString, field: impl ToString,
order: Order, order: Order,
@@ -661,50 +665,27 @@ impl Collector for TopDocs {
reader: &SegmentReader, reader: &SegmentReader,
) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> { ) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> {
let heap_len = self.0.limit + self.0.offset; let heap_len = self.0.limit + self.0.offset;
let mut heap: BinaryHeap<ComparableDoc<Score, DocId>> = BinaryHeap::with_capacity(heap_len); let mut top_n: TopNComputer<_, _> = TopNComputer::new(heap_len);
if let Some(alive_bitset) = reader.alive_bitset() { if let Some(alive_bitset) = reader.alive_bitset() {
let mut threshold = Score::MIN; let mut threshold = Score::MIN;
weight.for_each_pruning(threshold, reader, &mut |doc, score| { top_n.threshold = Some(threshold);
weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| {
if alive_bitset.is_deleted(doc) { if alive_bitset.is_deleted(doc) {
return threshold; return threshold;
} }
let heap_item = ComparableDoc { top_n.push(score, doc);
feature: score, threshold = top_n.threshold.unwrap_or(Score::MIN);
doc,
};
if heap.len() < heap_len {
heap.push(heap_item);
if heap.len() == heap_len {
threshold = heap.peek().map(|el| el.feature).unwrap_or(Score::MIN);
}
return threshold;
}
*heap.peek_mut().unwrap() = heap_item;
threshold = heap.peek().map(|el| el.feature).unwrap_or(Score::MIN);
threshold threshold
})?; })?;
} else { } else {
weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| { weight.for_each_pruning(Score::MIN, reader, &mut |doc, score| {
let heap_item = ComparableDoc { top_n.push(score, doc);
feature: score, top_n.threshold.unwrap_or(Score::MIN)
doc,
};
if heap.len() < heap_len {
heap.push(heap_item);
// TODO the threshold is suboptimal for heap.len == heap_len
if heap.len() == heap_len {
return heap.peek().map(|el| el.feature).unwrap_or(Score::MIN);
} else {
return Score::MIN;
}
}
*heap.peek_mut().unwrap() = heap_item;
heap.peek().map(|el| el.feature).unwrap_or(Score::MIN)
})?; })?;
} }
let fruit = heap let fruit = top_n
.into_sorted_vec() .into_sorted_vec()
.into_iter() .into_iter()
.map(|cid| { .map(|cid| {
@@ -736,9 +717,142 @@ impl SegmentCollector for TopScoreSegmentCollector {
} }
} }
/// Fast TopN Computation
///
/// Capacity of the vec is 2 * top_n.
/// The buffer is truncated to the top_n elements when it reaches the capacity of the Vec.
/// That means capacity has special meaning and should be carried over when cloning or serializing.
///
/// For TopN == 0, it will be relative expensive.
#[derive(Serialize, Deserialize)]
#[serde(from = "TopNComputerDeser<Score, D, REVERSE_ORDER>")]
pub struct TopNComputer<Score, D, const REVERSE_ORDER: bool = true> {
/// The buffer reverses sort order to get top-semantics instead of bottom-semantics
buffer: Vec<ComparableDoc<Score, D, REVERSE_ORDER>>,
top_n: usize,
pub(crate) threshold: Option<Score>,
}
// Intermediate struct for TopNComputer for deserialization, to keep vec capacity
#[derive(Deserialize)]
struct TopNComputerDeser<Score, D, const REVERSE_ORDER: bool> {
buffer: Vec<ComparableDoc<Score, D, REVERSE_ORDER>>,
top_n: usize,
threshold: Option<Score>,
}
// Custom clone to keep capacity
impl<Score: Clone, D: Clone, const REVERSE_ORDER: bool> Clone
for TopNComputer<Score, D, REVERSE_ORDER>
{
fn clone(&self) -> Self {
let mut buffer_clone = Vec::with_capacity(self.buffer.capacity());
buffer_clone.extend(self.buffer.iter().cloned());
TopNComputer {
buffer: buffer_clone,
top_n: self.top_n,
threshold: self.threshold.clone(),
}
}
}
impl<Score, D, const R: bool> From<TopNComputerDeser<Score, D, R>> for TopNComputer<Score, D, R> {
fn from(mut value: TopNComputerDeser<Score, D, R>) -> Self {
let expected_cap = value.top_n.max(1) * 2;
let current_cap = value.buffer.capacity();
if current_cap < expected_cap {
value.buffer.reserve_exact(expected_cap - current_cap);
} else {
value.buffer.shrink_to(expected_cap);
}
TopNComputer {
buffer: value.buffer,
top_n: value.top_n,
threshold: value.threshold,
}
}
}
impl<Score, D, const R: bool> TopNComputer<Score, D, R>
where
Score: PartialOrd + Clone,
D: Serialize + DeserializeOwned + Ord + Clone,
{
/// Create a new `TopNComputer`.
/// Internally it will allocate a buffer of size `2 * top_n`.
pub fn new(top_n: usize) -> Self {
let vec_cap = top_n.max(1) * 2;
TopNComputer {
buffer: Vec::with_capacity(vec_cap),
top_n,
threshold: None,
}
}
/// Push a new document to the top n.
/// If the document is below the current threshold, it will be ignored.
#[inline]
pub fn push(&mut self, feature: Score, doc: D) {
if let Some(last_median) = self.threshold.clone() {
if feature < last_median {
return;
}
}
if self.buffer.len() == self.buffer.capacity() {
let median = self.truncate_top_n();
self.threshold = Some(median);
}
// This is faster since it avoids the buffer resizing to be inlined from vec.push()
// (this is in the hot path)
// TODO: Replace with `push_within_capacity` when it's stabilized
let uninit = self.buffer.spare_capacity_mut();
// This cannot panic, because we truncate_median will at least remove one element, since
// the min capacity is 2.
uninit[0].write(ComparableDoc { doc, feature });
// This is safe because it would panic in the line above
unsafe {
self.buffer.set_len(self.buffer.len() + 1);
}
}
#[inline(never)]
fn truncate_top_n(&mut self) -> Score {
// Use select_nth_unstable to find the top nth score
let (_, median_el, _) = self.buffer.select_nth_unstable(self.top_n);
let median_score = median_el.feature.clone();
// Remove all elements below the top_n
self.buffer.truncate(self.top_n);
median_score
}
/// Returns the top n elements in sorted order.
pub fn into_sorted_vec(mut self) -> Vec<ComparableDoc<Score, D, R>> {
if self.buffer.len() > self.top_n {
self.truncate_top_n();
}
self.buffer.sort_unstable();
self.buffer
}
/// Returns the top n elements in stored order.
/// Useful if you do not need the elements in sorted order,
/// for example when merging the results of multiple segments.
pub fn into_vec(mut self) -> Vec<ComparableDoc<Score, D, R>> {
if self.buffer.len() > self.top_n {
self.truncate_top_n();
}
self.buffer
}
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::TopDocs; use super::{TopDocs, TopNComputer};
use crate::collector::top_collector::ComparableDoc;
use crate::collector::Collector; use crate::collector::Collector;
use crate::query::{AllQuery, Query, QueryParser}; use crate::query::{AllQuery, Query, QueryParser};
use crate::schema::{Field, Schema, FAST, STORED, TEXT}; use crate::schema::{Field, Schema, FAST, STORED, TEXT};
@@ -766,6 +880,70 @@ mod tests {
crate::assert_nearly_equals!(result.0, expected.0); crate::assert_nearly_equals!(result.0, expected.0);
} }
} }
#[test]
fn test_topn_computer_serde() {
let computer: TopNComputer<u32, u32> = TopNComputer::new(1);
let computer_ser = serde_json::to_string(&computer).unwrap();
let mut computer: TopNComputer<u32, u32> = serde_json::from_str(&computer_ser).unwrap();
computer.push(1u32, 5u32);
computer.push(1u32, 0u32);
computer.push(1u32, 7u32);
assert_eq!(
computer.into_sorted_vec(),
&[ComparableDoc {
feature: 1u32,
doc: 0u32,
},]
);
}
#[test]
fn test_empty_topn_computer() {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(0);
computer.push(1u32, 1u32);
computer.push(1u32, 2u32);
computer.push(1u32, 3u32);
assert!(computer.into_sorted_vec().is_empty());
}
#[test]
fn test_topn_computer() {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(2);
computer.push(1u32, 1u32);
computer.push(2u32, 2u32);
computer.push(3u32, 3u32);
computer.push(2u32, 4u32);
computer.push(1u32, 5u32);
assert_eq!(
computer.into_sorted_vec(),
&[
ComparableDoc {
feature: 3u32,
doc: 3u32,
},
ComparableDoc {
feature: 2u32,
doc: 2u32,
}
]
);
}
#[test]
fn test_topn_computer_no_panic() {
for top_n in 0..10 {
let mut computer: TopNComputer<u32, u32> = TopNComputer::new(top_n);
for _ in 0..1 + top_n * 2 {
computer.push(1u32, 1u32);
}
let _vals = computer.into_sorted_vec();
}
}
#[test] #[test]
fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> { fn test_top_collector_not_at_capacity_without_offset() -> crate::Result<()> {
@@ -852,20 +1030,25 @@ mod tests {
// using AllQuery to get a constant score // using AllQuery to get a constant score
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader().unwrap().searcher();
let page_0 = searcher.search(&AllQuery, &TopDocs::with_limit(1)).unwrap();
let page_1 = searcher.search(&AllQuery, &TopDocs::with_limit(2)).unwrap(); let page_1 = searcher.search(&AllQuery, &TopDocs::with_limit(2)).unwrap();
let page_2 = searcher.search(&AllQuery, &TopDocs::with_limit(3)).unwrap(); let page_2 = searcher.search(&AllQuery, &TopDocs::with_limit(3)).unwrap();
// precondition for the test to be meaningful: we did get documents // precondition for the test to be meaningful: we did get documents
// with the same score // with the same score
assert!(page_0.iter().all(|result| result.0 == page_1[0].0));
assert!(page_1.iter().all(|result| result.0 == page_1[0].0)); assert!(page_1.iter().all(|result| result.0 == page_1[0].0));
assert!(page_2.iter().all(|result| result.0 == page_2[0].0)); assert!(page_2.iter().all(|result| result.0 == page_2[0].0));
// sanity check since we're relying on make_index() // sanity check since we're relying on make_index()
assert_eq!(page_0.len(), 1);
assert_eq!(page_1.len(), 2); assert_eq!(page_1.len(), 2);
assert_eq!(page_2.len(), 3); assert_eq!(page_2.len(), 3);
assert_eq!(page_1, &page_2[..page_1.len()]); assert_eq!(page_1, &page_2[..page_1.len()]);
assert_eq!(page_0, &page_2[..page_0.len()]);
} }
#[test] #[test]

View File

@@ -1,11 +1,11 @@
use columnar::MonotonicallyMappableToU64; use columnar::MonotonicallyMappableToU64;
use common::replace_in_place; use common::{replace_in_place, JsonPathWriter};
use murmurhash32::murmurhash2;
use rustc_hash::FxHashMap; use rustc_hash::FxHashMap;
use crate::fastfield::FastValue; use crate::fastfield::FastValue;
use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter}; use crate::postings::{IndexingContext, IndexingPosition, PostingsWriter};
use crate::schema::term::{JSON_PATH_SEGMENT_SEP, JSON_PATH_SEGMENT_SEP_STR}; use crate::schema::document::{ReferenceValue, ReferenceValueLeaf, Value};
use crate::schema::term::JSON_PATH_SEGMENT_SEP;
use crate::schema::{Field, Type, DATE_TIME_PRECISION_INDEXED}; use crate::schema::{Field, Type, DATE_TIME_PRECISION_INDEXED};
use crate::time::format_description::well_known::Rfc3339; use crate::time::format_description::well_known::Rfc3339;
use crate::time::{OffsetDateTime, UtcOffset}; use crate::time::{OffsetDateTime, UtcOffset};
@@ -57,31 +57,41 @@ struct IndexingPositionsPerPath {
} }
impl IndexingPositionsPerPath { impl IndexingPositionsPerPath {
fn get_position(&mut self, term: &Term) -> &mut IndexingPosition { fn get_position_from_id(&mut self, id: u32) -> &mut IndexingPosition {
self.positions_per_path self.positions_per_path.entry(id).or_default()
.entry(murmurhash2(term.serialized_term()))
.or_default()
} }
} }
pub(crate) fn index_json_values<'a>( /// Convert JSON_PATH_SEGMENT_SEP to a dot.
pub fn json_path_sep_to_dot(path: &mut str) {
// This is safe since we are replacing a ASCII character by another ASCII character.
unsafe {
replace_in_place(JSON_PATH_SEGMENT_SEP, b'.', path.as_bytes_mut());
}
}
#[allow(clippy::too_many_arguments)]
pub(crate) fn index_json_values<'a, V: Value<'a>>(
doc: DocId, doc: DocId,
json_values: impl Iterator<Item = crate::Result<&'a serde_json::Map<String, serde_json::Value>>>, json_visitors: impl Iterator<Item = crate::Result<V::ObjectIter>>,
text_analyzer: &mut TextAnalyzer, text_analyzer: &mut TextAnalyzer,
expand_dots_enabled: bool, expand_dots_enabled: bool,
term_buffer: &mut Term, term_buffer: &mut Term,
postings_writer: &mut dyn PostingsWriter, postings_writer: &mut dyn PostingsWriter,
json_path_writer: &mut JsonPathWriter,
ctx: &mut IndexingContext, ctx: &mut IndexingContext,
) -> crate::Result<()> { ) -> crate::Result<()> {
let mut json_term_writer = JsonTermWriter::wrap(term_buffer, expand_dots_enabled); json_path_writer.clear();
json_path_writer.set_expand_dots(expand_dots_enabled);
let mut positions_per_path: IndexingPositionsPerPath = Default::default(); let mut positions_per_path: IndexingPositionsPerPath = Default::default();
for json_value_res in json_values { for json_visitor_res in json_visitors {
let json_value = json_value_res?; let json_visitor = json_visitor_res?;
index_json_object( index_json_object::<V>(
doc, doc,
json_value, json_visitor,
text_analyzer, text_analyzer,
&mut json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
&mut positions_per_path, &mut positions_per_path,
@@ -90,93 +100,154 @@ pub(crate) fn index_json_values<'a>(
Ok(()) Ok(())
} }
fn index_json_object( #[allow(clippy::too_many_arguments)]
fn index_json_object<'a, V: Value<'a>>(
doc: DocId, doc: DocId,
json_value: &serde_json::Map<String, serde_json::Value>, json_visitor: V::ObjectIter,
text_analyzer: &mut TextAnalyzer, text_analyzer: &mut TextAnalyzer,
json_term_writer: &mut JsonTermWriter, term_buffer: &mut Term,
json_path_writer: &mut JsonPathWriter,
postings_writer: &mut dyn PostingsWriter, postings_writer: &mut dyn PostingsWriter,
ctx: &mut IndexingContext, ctx: &mut IndexingContext,
positions_per_path: &mut IndexingPositionsPerPath, positions_per_path: &mut IndexingPositionsPerPath,
) { ) {
for (json_path_segment, json_value) in json_value { for (json_path_segment, json_value_visitor) in json_visitor {
json_term_writer.push_path_segment(json_path_segment); json_path_writer.push(json_path_segment);
index_json_value( index_json_value(
doc, doc,
json_value, json_value_visitor,
text_analyzer, text_analyzer,
json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
positions_per_path, positions_per_path,
); );
json_term_writer.pop_path_segment(); json_path_writer.pop();
} }
} }
fn index_json_value( #[allow(clippy::too_many_arguments)]
fn index_json_value<'a, V: Value<'a>>(
doc: DocId, doc: DocId,
json_value: &serde_json::Value, json_value: V,
text_analyzer: &mut TextAnalyzer, text_analyzer: &mut TextAnalyzer,
json_term_writer: &mut JsonTermWriter, term_buffer: &mut Term,
json_path_writer: &mut JsonPathWriter,
postings_writer: &mut dyn PostingsWriter, postings_writer: &mut dyn PostingsWriter,
ctx: &mut IndexingContext, ctx: &mut IndexingContext,
positions_per_path: &mut IndexingPositionsPerPath, positions_per_path: &mut IndexingPositionsPerPath,
) { ) {
match json_value { let set_path_id = |term_buffer: &mut Term, unordered_id: u32| {
serde_json::Value::Null => {} term_buffer.truncate_value_bytes(0);
serde_json::Value::Bool(val_bool) => { term_buffer.append_bytes(&unordered_id.to_be_bytes());
json_term_writer.set_fast_value(*val_bool); };
postings_writer.subscribe(doc, 0u32, json_term_writer.term(), ctx); let set_type = |term_buffer: &mut Term, typ: Type| {
} term_buffer.append_bytes(&[typ.to_code()]);
serde_json::Value::Number(number) => { };
if let Some(number_i64) = number.as_i64() {
json_term_writer.set_fast_value(number_i64); match json_value.as_value() {
} else if let Some(number_u64) = number.as_u64() { ReferenceValue::Leaf(leaf) => match leaf {
json_term_writer.set_fast_value(number_u64); ReferenceValueLeaf::Null => {}
} else if let Some(number_f64) = number.as_f64() { ReferenceValueLeaf::Str(val) => {
json_term_writer.set_fast_value(number_f64); let mut token_stream = text_analyzer.token_stream(val);
} let unordered_id = ctx
postings_writer.subscribe(doc, 0u32, json_term_writer.term(), ctx); .path_to_unordered_id
} .get_or_allocate_unordered_id(json_path_writer.as_str());
serde_json::Value::String(text) => match infer_type_from_str(text) {
TextOrDateTime::Text(text) => { // TODO: make sure the chain position works out.
let mut token_stream = text_analyzer.token_stream(text); set_path_id(term_buffer, unordered_id);
// TODO make sure the chain position works out. set_type(term_buffer, Type::Str);
json_term_writer.close_path_and_set_type(Type::Str); let indexing_position = positions_per_path.get_position_from_id(unordered_id);
let indexing_position = positions_per_path.get_position(json_term_writer.term());
postings_writer.index_text( postings_writer.index_text(
doc, doc,
&mut *token_stream, &mut *token_stream,
json_term_writer.term_buffer, term_buffer,
ctx, ctx,
indexing_position, indexing_position,
); );
} }
TextOrDateTime::DateTime(dt) => { ReferenceValueLeaf::U64(val) => {
json_term_writer.set_fast_value(DateTime::from_utc(dt)); set_path_id(
postings_writer.subscribe(doc, 0u32, json_term_writer.term(), ctx); term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::I64(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::F64(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::Bool(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::Date(val) => {
set_path_id(
term_buffer,
ctx.path_to_unordered_id
.get_or_allocate_unordered_id(json_path_writer.as_str()),
);
term_buffer.append_type_and_fast_value(val);
postings_writer.subscribe(doc, 0u32, term_buffer, ctx);
}
ReferenceValueLeaf::PreTokStr(_) => {
unimplemented!(
"Pre-tokenized string support in dynamic fields is not yet implemented"
)
}
ReferenceValueLeaf::Bytes(_) => {
unimplemented!("Bytes support in dynamic fields is not yet implemented")
}
ReferenceValueLeaf::Facet(_) => {
unimplemented!("Facet support in dynamic fields is not yet implemented")
}
ReferenceValueLeaf::IpAddr(_) => {
unimplemented!("IP address support in dynamic fields is not yet implemented")
} }
}, },
serde_json::Value::Array(arr) => { ReferenceValue::Array(elements) => {
for val in arr { for val in elements {
index_json_value( index_json_value(
doc, doc,
val, val,
text_analyzer, text_analyzer,
json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
positions_per_path, positions_per_path,
); );
} }
} }
serde_json::Value::Object(map) => { ReferenceValue::Object(object) => {
index_json_object( index_json_object::<V>(
doc, doc,
map, object,
text_analyzer, text_analyzer,
json_term_writer, term_buffer,
json_path_writer,
postings_writer, postings_writer,
ctx, ctx,
positions_per_path, positions_per_path,
@@ -185,21 +256,6 @@ fn index_json_value(
} }
} }
enum TextOrDateTime<'a> {
Text(&'a str),
DateTime(OffsetDateTime),
}
fn infer_type_from_str(text: &str) -> TextOrDateTime {
match OffsetDateTime::parse(text, &Rfc3339) {
Ok(dt) => {
let dt_utc = dt.to_offset(UtcOffset::UTC);
TextOrDateTime::DateTime(dt_utc)
}
Err(_) => TextOrDateTime::Text(text),
}
}
// Tries to infer a JSON type from a string. // Tries to infer a JSON type from a string.
pub fn convert_to_fast_value_and_get_term( pub fn convert_to_fast_value_and_get_term(
json_term_writer: &mut JsonTermWriter, json_term_writer: &mut JsonTermWriter,
@@ -272,7 +328,7 @@ pub struct JsonTermWriter<'a> {
/// In other words, /// In other words,
/// - `k8s.node` ends up as `["k8s", "node"]`. /// - `k8s.node` ends up as `["k8s", "node"]`.
/// - `k8s\.node` ends up as `["k8s.node"]`. /// - `k8s\.node` ends up as `["k8s.node"]`.
fn split_json_path(json_path: &str) -> Vec<String> { pub fn split_json_path(json_path: &str) -> Vec<String> {
let mut escaped_state: bool = false; let mut escaped_state: bool = false;
let mut json_path_segments = Vec::new(); let mut json_path_segments = Vec::new();
let mut buffer = String::new(); let mut buffer = String::new();
@@ -312,17 +368,13 @@ pub(crate) fn encode_column_name(
json_path: &str, json_path: &str,
expand_dots_enabled: bool, expand_dots_enabled: bool,
) -> String { ) -> String {
let mut column_key: String = String::with_capacity(field_name.len() + json_path.len() + 1); let mut path = JsonPathWriter::default();
column_key.push_str(field_name); path.push(field_name);
for mut segment in split_json_path(json_path) { path.set_expand_dots(expand_dots_enabled);
column_key.push_str(JSON_PATH_SEGMENT_SEP_STR); for segment in split_json_path(json_path) {
if expand_dots_enabled { path.push(&segment);
// We need to replace `.` by JSON_PATH_SEGMENT_SEP.
unsafe { replace_in_place(b'.', JSON_PATH_SEGMENT_SEP, segment.as_bytes_mut()) };
}
column_key.push_str(&segment);
} }
column_key path.into()
} }
impl<'a> JsonTermWriter<'a> { impl<'a> JsonTermWriter<'a> {
@@ -362,6 +414,7 @@ impl<'a> JsonTermWriter<'a> {
self.term_buffer.append_bytes(&[typ.to_code()]); self.term_buffer.append_bytes(&[typ.to_code()]);
} }
// TODO: Remove this function and use JsonPathWriter instead.
pub fn push_path_segment(&mut self, segment: &str) { pub fn push_path_segment(&mut self, segment: &str) {
// the path stack should never be empty. // the path stack should never be empty.
self.trim_to_end_of_path(); self.trim_to_end_of_path();

View File

@@ -1,32 +1,14 @@
mod executor; mod executor;
pub mod index;
mod index_meta;
mod inverted_index_reader;
#[doc(hidden)] #[doc(hidden)]
pub mod json_utils; pub mod json_utils;
pub mod searcher; pub mod searcher;
mod segment;
mod segment_component;
mod segment_id;
mod segment_reader;
mod single_segment_index_writer;
use std::path::Path; use std::path::Path;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
pub use self::executor::Executor; pub use self::executor::Executor;
pub use self::index::{Index, IndexBuilder};
pub use self::index_meta::{
IndexMeta, IndexSettings, IndexSortByField, Order, SegmentMeta, SegmentMetaInventory,
};
pub use self::inverted_index_reader::InvertedIndexReader;
pub use self::searcher::{Searcher, SearcherGeneration}; pub use self::searcher::{Searcher, SearcherGeneration};
pub use self::segment::Segment;
pub use self::segment_component::SegmentComponent;
pub use self::segment_id::SegmentId;
pub use self::segment_reader::SegmentReader;
pub use self::single_segment_index_writer::SingleSegmentIndexWriter;
/// The meta file contains all the information about the list of segments and the schema /// The meta file contains all the information about the list of segments and the schema
/// of the index. /// of the index.

View File

@@ -3,9 +3,11 @@ use std::sync::Arc;
use std::{fmt, io}; use std::{fmt, io};
use crate::collector::Collector; use crate::collector::Collector;
use crate::core::{Executor, SegmentReader}; use crate::core::Executor;
use crate::index::SegmentReader;
use crate::query::{Bm25StatisticsProvider, EnableScoring, Query}; use crate::query::{Bm25StatisticsProvider, EnableScoring, Query};
use crate::schema::{Document, Schema, Term}; use crate::schema::document::DocumentDeserialize;
use crate::schema::{Schema, Term};
use crate::space_usage::SearcherSpaceUsage; use crate::space_usage::SearcherSpaceUsage;
use crate::store::{CacheStats, StoreReader}; use crate::store::{CacheStats, StoreReader};
use crate::{DocAddress, Index, Opstamp, SegmentId, TrackedObject}; use crate::{DocAddress, Index, Opstamp, SegmentId, TrackedObject};
@@ -83,7 +85,7 @@ impl Searcher {
/// ///
/// The searcher uses the segment ordinal to route the /// The searcher uses the segment ordinal to route the
/// request to the right `Segment`. /// request to the right `Segment`.
pub fn doc(&self, doc_address: DocAddress) -> crate::Result<Document> { pub fn doc<D: DocumentDeserialize>(&self, doc_address: DocAddress) -> crate::Result<D> {
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize]; let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
store_reader.get(doc_address.doc_id) store_reader.get(doc_address.doc_id)
} }
@@ -103,7 +105,10 @@ impl Searcher {
/// Fetches a document in an asynchronous manner. /// Fetches a document in an asynchronous manner.
#[cfg(feature = "quickwit")] #[cfg(feature = "quickwit")]
pub async fn doc_async(&self, doc_address: DocAddress) -> crate::Result<Document> { pub async fn doc_async<D: DocumentDeserialize>(
&self,
doc_address: DocAddress,
) -> crate::Result<D> {
let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize]; let store_reader = &self.inner.store_readers[doc_address.segment_ord as usize];
store_reader.get_async(doc_address.doc_id).await store_reader.get_async(doc_address.doc_id).await
} }

View File

@@ -1,12 +1,13 @@
use crate::collector::Count; use crate::collector::Count;
use crate::directory::{RamDirectory, WatchCallback}; use crate::directory::{RamDirectory, WatchCallback};
use crate::indexer::NoMergePolicy; use crate::indexer::{LogMergePolicy, NoMergePolicy};
use crate::json_utils::JsonTermWriter;
use crate::query::TermQuery; use crate::query::TermQuery;
use crate::schema::{Field, IndexRecordOption, Schema, INDEXED, STRING, TEXT}; use crate::schema::{Field, IndexRecordOption, Schema, Type, INDEXED, STRING, TEXT};
use crate::tokenizer::TokenizerManager; use crate::tokenizer::TokenizerManager;
use crate::{ use crate::{
Directory, Document, Index, IndexBuilder, IndexReader, IndexSettings, ReloadPolicy, SegmentId, Directory, DocSet, Index, IndexBuilder, IndexReader, IndexSettings, IndexWriter, Postings,
Term, ReloadPolicy, SegmentId, TantivyDocument, Term,
}; };
#[test] #[test]
@@ -121,7 +122,7 @@ fn test_index_on_commit_reload_policy() -> crate::Result<()> {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into() .try_into()
.unwrap(); .unwrap();
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
@@ -147,7 +148,7 @@ mod mmap_specific {
let index = Index::create_in_dir(tempdir_path, schema).unwrap(); let index = Index::create_in_dir(tempdir_path, schema).unwrap();
let reader = index let reader = index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into() .try_into()
.unwrap(); .unwrap();
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
@@ -159,7 +160,7 @@ mod mmap_specific {
let schema = throw_away_schema(); let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap(); let field = schema.get_field("num_likes").unwrap();
let mut index = Index::create_from_tempdir(schema)?; let mut index = Index::create_from_tempdir(schema)?;
let mut writer = index.writer_for_tests()?; let mut writer: IndexWriter = index.writer_for_tests()?;
writer.commit()?; writer.commit()?;
let reader = index let reader = index
.reader_builder() .reader_builder()
@@ -189,7 +190,7 @@ mod mmap_specific {
let read_index = Index::open_in_dir(&tempdir_path).unwrap(); let read_index = Index::open_in_dir(&tempdir_path).unwrap();
let reader = read_index let reader = read_index
.reader_builder() .reader_builder()
.reload_policy(ReloadPolicy::OnCommit) .reload_policy(ReloadPolicy::OnCommitWithDelay)
.try_into() .try_into()
.unwrap(); .unwrap();
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
@@ -208,7 +209,7 @@ fn test_index_on_commit_reload_policy_aux(
.watch(WatchCallback::new(move || { .watch(WatchCallback::new(move || {
let _ = sender.send(()); let _ = sender.send(());
})); }));
let mut writer = index.writer_for_tests()?; let mut writer: IndexWriter = index.writer_for_tests()?;
assert_eq!(reader.searcher().num_docs(), 0); assert_eq!(reader.searcher().num_docs(), 0);
writer.add_document(doc!(field=>1u64))?; writer.add_document(doc!(field=>1u64))?;
writer.commit().unwrap(); writer.commit().unwrap();
@@ -242,7 +243,7 @@ fn garbage_collect_works_as_intended() -> crate::Result<()> {
let field = schema.get_field("num_likes").unwrap(); let field = schema.get_field("num_likes").unwrap();
let index = Index::create(directory.clone(), schema, IndexSettings::default())?; let index = Index::create(directory.clone(), schema, IndexSettings::default())?;
let mut writer = index.writer_with_num_threads(1, 32_000_000).unwrap(); let mut writer: IndexWriter = index.writer_with_num_threads(1, 32_000_000).unwrap();
for _seg in 0..8 { for _seg in 0..8 {
for i in 0u64..1_000u64 { for i in 0u64..1_000u64 {
writer.add_document(doc!(field => i))?; writer.add_document(doc!(field => i))?;
@@ -306,7 +307,7 @@ fn test_merging_segment_update_docfreq() {
let id_field = schema_builder.add_text_field("id", STRING); let id_field = schema_builder.add_text_field("id", STRING);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut writer = index.writer_for_tests().unwrap(); let mut writer: IndexWriter = index.writer_for_tests().unwrap();
writer.set_merge_policy(Box::new(NoMergePolicy)); writer.set_merge_policy(Box::new(NoMergePolicy));
for _ in 0..5 { for _ in 0..5 {
writer.add_document(doc!(text_field=>"hello")).unwrap(); writer.add_document(doc!(text_field=>"hello")).unwrap();
@@ -317,13 +318,13 @@ fn test_merging_segment_update_docfreq() {
writer writer
.add_document(doc!(text_field=>"hello", id_field=>"TO_BE_DELETED")) .add_document(doc!(text_field=>"hello", id_field=>"TO_BE_DELETED"))
.unwrap(); .unwrap();
writer.add_document(Document::default()).unwrap(); writer.add_document(TantivyDocument::default()).unwrap();
writer.commit().unwrap(); writer.commit().unwrap();
for _ in 0..7 { for _ in 0..7 {
writer.add_document(doc!(text_field=>"hello")).unwrap(); writer.add_document(doc!(text_field=>"hello")).unwrap();
} }
writer.add_document(Document::default()).unwrap(); writer.add_document(TantivyDocument::default()).unwrap();
writer.add_document(Document::default()).unwrap(); writer.add_document(TantivyDocument::default()).unwrap();
writer.delete_term(Term::from_field_text(id_field, "TO_BE_DELETED")); writer.delete_term(Term::from_field_text(id_field, "TO_BE_DELETED"));
writer.commit().unwrap(); writer.commit().unwrap();
@@ -344,3 +345,132 @@ fn test_merging_segment_update_docfreq() {
let term_info = inv_index.get_term_info(&term).unwrap().unwrap(); let term_info = inv_index.get_term_info(&term).unwrap().unwrap();
assert_eq!(term_info.doc_freq, 12); assert_eq!(term_info.doc_freq, 12);
} }
// motivated by https://github.com/quickwit-oss/quickwit/issues/4130
#[test]
fn test_positions_merge_bug_non_text_json_vint() {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_json_field("dynamic", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut writer: IndexWriter = index.writer_for_tests().unwrap();
let mut merge_policy = LogMergePolicy::default();
merge_policy.set_min_num_segments(2);
writer.set_merge_policy(Box::new(merge_policy));
// Here a string would work.
let doc_json = r#"{"tenant_id":75}"#;
let vals = serde_json::from_str(doc_json).unwrap();
let mut doc = TantivyDocument::default();
doc.add_object(field, vals);
writer.add_document(doc.clone()).unwrap();
writer.commit().unwrap();
writer.add_document(doc.clone()).unwrap();
writer.commit().unwrap();
writer.wait_merging_threads().unwrap();
let reader = index.reader().unwrap();
assert_eq!(reader.searcher().segment_readers().len(), 1);
}
// Same as above but with bitpacked blocks
#[test]
fn test_positions_merge_bug_non_text_json_bitpacked_block() {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_json_field("dynamic", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut writer: IndexWriter = index.writer_for_tests().unwrap();
let mut merge_policy = LogMergePolicy::default();
merge_policy.set_min_num_segments(2);
writer.set_merge_policy(Box::new(merge_policy));
// Here a string would work.
let doc_json = r#"{"tenant_id":75}"#;
let vals = serde_json::from_str(doc_json).unwrap();
let mut doc = TantivyDocument::default();
doc.add_object(field, vals);
for _ in 0..128 {
writer.add_document(doc.clone()).unwrap();
}
writer.commit().unwrap();
writer.add_document(doc.clone()).unwrap();
writer.commit().unwrap();
writer.wait_merging_threads().unwrap();
let reader = index.reader().unwrap();
assert_eq!(reader.searcher().segment_readers().len(), 1);
}
#[test]
fn test_non_text_json_term_freq() {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_json_field("dynamic", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut writer: IndexWriter = index.writer_for_tests().unwrap();
// Here a string would work.
let doc_json = r#"{"tenant_id":75}"#;
let vals = serde_json::from_str(doc_json).unwrap();
let mut doc = TantivyDocument::default();
doc.add_object(field, vals);
writer.add_document(doc.clone()).unwrap();
writer.commit().unwrap();
let reader = index.reader().unwrap();
assert_eq!(reader.searcher().segment_readers().len(), 1);
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_idx = segment_reader.inverted_index(field).unwrap();
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
json_term_writer.push_path_segment("tenant_id");
json_term_writer.close_path_and_set_type(Type::U64);
json_term_writer.set_fast_value(75u64);
let postings = inv_idx
.read_postings(
json_term_writer.term(),
IndexRecordOption::WithFreqsAndPositions,
)
.unwrap()
.unwrap();
assert_eq!(postings.doc(), 0);
assert_eq!(postings.term_freq(), 1u32);
}
#[test]
fn test_non_text_json_term_freq_bitpacked() {
let mut schema_builder = Schema::builder();
let field = schema_builder.add_json_field("dynamic", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
let mut writer: IndexWriter = index.writer_for_tests().unwrap();
// Here a string would work.
let doc_json = r#"{"tenant_id":75}"#;
let vals = serde_json::from_str(doc_json).unwrap();
let mut doc = TantivyDocument::default();
doc.add_object(field, vals);
let num_docs = 132;
for _ in 0..num_docs {
writer.add_document(doc.clone()).unwrap();
}
writer.commit().unwrap();
let reader = index.reader().unwrap();
assert_eq!(reader.searcher().segment_readers().len(), 1);
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_idx = segment_reader.inverted_index(field).unwrap();
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term, false);
json_term_writer.push_path_segment("tenant_id");
json_term_writer.close_path_and_set_type(Type::U64);
json_term_writer.set_fast_value(75u64);
let mut postings = inv_idx
.read_postings(
json_term_writer.term(),
IndexRecordOption::WithFreqsAndPositions,
)
.unwrap()
.unwrap();
assert_eq!(postings.doc(), 0);
assert_eq!(postings.term_freq(), 1u32);
for i in 1..num_docs {
assert_eq!(postings.advance(), i);
assert_eq!(postings.term_freq(), 1u32);
}
}

View File

@@ -222,8 +222,8 @@ pub trait Directory: DirectoryClone + fmt::Debug + Send + Sync + 'static {
/// registered (and whose [`WatchHandle`] is still alive) are triggered. /// registered (and whose [`WatchHandle`] is still alive) are triggered.
/// ///
/// Internally, tantivy only uses this API to detect new commits to implement the /// Internally, tantivy only uses this API to detect new commits to implement the
/// `OnCommit` `ReloadPolicy`. Not implementing watch in a `Directory` only prevents the /// `OnCommitWithDelay` `ReloadPolicy`. Not implementing watch in a `Directory` only prevents
/// `OnCommit` `ReloadPolicy` to work properly. /// the `OnCommitWithDelay` `ReloadPolicy` to work properly.
fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle>; fn watch(&self, watch_callback: WatchCallback) -> crate::Result<WatchHandle>;
} }

View File

@@ -7,7 +7,7 @@ use serde::{Deserialize, Serialize};
use crate::directory::error::Incompatibility; use crate::directory::error::Incompatibility;
use crate::directory::{AntiCallToken, FileSlice, TerminatingWrite}; use crate::directory::{AntiCallToken, FileSlice, TerminatingWrite};
use crate::{Version, INDEX_FORMAT_VERSION}; use crate::{Version, INDEX_FORMAT_OLDEST_SUPPORTED_VERSION, INDEX_FORMAT_VERSION};
const FOOTER_MAX_LEN: u32 = 50_000; const FOOTER_MAX_LEN: u32 = 50_000;
@@ -102,10 +102,11 @@ impl Footer {
/// Confirms that the index will be read correctly by this version of tantivy /// Confirms that the index will be read correctly by this version of tantivy
/// Has to be called after `extract_footer` to make sure it's not accessing uninitialised memory /// Has to be called after `extract_footer` to make sure it's not accessing uninitialised memory
pub fn is_compatible(&self) -> Result<(), Incompatibility> { pub fn is_compatible(&self) -> Result<(), Incompatibility> {
const SUPPORTED_INDEX_FORMAT_VERSION_RANGE: std::ops::RangeInclusive<u32> =
INDEX_FORMAT_OLDEST_SUPPORTED_VERSION..=INDEX_FORMAT_VERSION;
let library_version = crate::version(); let library_version = crate::version();
if self.version.index_format_version < 4 if !SUPPORTED_INDEX_FORMAT_VERSION_RANGE.contains(&self.version.index_format_version) {
|| self.version.index_format_version > INDEX_FORMAT_VERSION
{
return Err(Incompatibility::IndexMismatch { return Err(Incompatibility::IndexMismatch {
library_version: library_version.clone(), library_version: library_version.clone(),
index_version: self.version.clone(), index_version: self.version.clone(),

View File

@@ -1,13 +1,15 @@
use std::collections::HashMap; use std::collections::HashMap;
use std::fmt; use std::fmt;
use std::fs::{self, File, OpenOptions}; use std::fs::{self, File, OpenOptions};
use std::io::{self, BufWriter, Read, Seek, Write}; use std::io::{self, BufWriter, Read, Write};
use std::ops::Deref; use std::ops::Deref;
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use std::sync::{Arc, RwLock, Weak}; use std::sync::{Arc, RwLock, Weak};
use common::StableDeref; use common::StableDeref;
use fs4::FileExt; use fs4::FileExt;
#[cfg(all(feature = "mmap", unix))]
pub use memmap2::Advice;
use memmap2::Mmap; use memmap2::Mmap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use tempfile::TempDir; use tempfile::TempDir;
@@ -21,8 +23,6 @@ use crate::directory::{
AntiCallToken, Directory, DirectoryLock, FileHandle, Lock, OwnedBytes, TerminatingWrite, AntiCallToken, Directory, DirectoryLock, FileHandle, Lock, OwnedBytes, TerminatingWrite,
WatchCallback, WatchHandle, WritePtr, WatchCallback, WatchHandle, WritePtr,
}; };
#[cfg(unix)]
use crate::Advice;
pub type ArcBytes = Arc<dyn Deref<Target = [u8]> + Send + Sync + 'static>; pub type ArcBytes = Arc<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
pub type WeakArcBytes = Weak<dyn Deref<Target = [u8]> + Send + Sync + 'static>; pub type WeakArcBytes = Weak<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
@@ -328,12 +328,6 @@ impl Write for SafeFileWriter {
} }
} }
impl Seek for SafeFileWriter {
fn seek(&mut self, pos: io::SeekFrom) -> io::Result<u64> {
self.0.seek(pos)
}
}
impl TerminatingWrite for SafeFileWriter { impl TerminatingWrite for SafeFileWriter {
fn terminate_ref(&mut self, _: AntiCallToken) -> io::Result<()> { fn terminate_ref(&mut self, _: AntiCallToken) -> io::Result<()> {
self.0.flush()?; self.0.flush()?;
@@ -485,6 +479,7 @@ impl Directory for MmapDirectory {
let file: File = OpenOptions::new() let file: File = OpenOptions::new()
.write(true) .write(true)
.create(true) //< if the file does not exist yet, create it. .create(true) //< if the file does not exist yet, create it.
.truncate(false)
.open(full_path) .open(full_path)
.map_err(LockError::wrap_io_error)?; .map_err(LockError::wrap_io_error)?;
if lock.is_blocking { if lock.is_blocking {
@@ -539,7 +534,7 @@ mod tests {
use super::*; use super::*;
use crate::indexer::LogMergePolicy; use crate::indexer::LogMergePolicy;
use crate::schema::{Schema, SchemaBuilder, TEXT}; use crate::schema::{Schema, SchemaBuilder, TEXT};
use crate::{Index, IndexSettings, ReloadPolicy}; use crate::{Index, IndexSettings, IndexWriter, ReloadPolicy};
#[test] #[test]
fn test_open_non_existent_path() { fn test_open_non_existent_path() {
@@ -651,7 +646,7 @@ mod tests {
let index = let index =
Index::create(mmap_directory.clone(), schema, IndexSettings::default()).unwrap(); Index::create(mmap_directory.clone(), schema, IndexSettings::default()).unwrap();
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
let mut log_merge_policy = LogMergePolicy::default(); let mut log_merge_policy = LogMergePolicy::default();
log_merge_policy.set_min_num_segments(3); log_merge_policy.set_min_num_segments(3);
index_writer.set_merge_policy(Box::new(log_merge_policy)); index_writer.set_merge_policy(Box::new(log_merge_policy));
@@ -679,7 +674,7 @@ mod tests {
let num_segments = reader.searcher().segment_readers().len(); let num_segments = reader.searcher().segment_readers().len();
assert!(num_segments <= 4); assert!(num_segments <= 4);
let num_components_except_deletes_and_tempstore = let num_components_except_deletes_and_tempstore =
crate::core::SegmentComponent::iterator().len() - 2; crate::index::SegmentComponent::iterator().len() - 2;
let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments; let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
assert_eventually(|| { assert_eventually(|| {
let num_mmapped = mmap_directory.get_cache_info().mmapped.len(); let num_mmapped = mmap_directory.get_cache_info().mmapped.len();

View File

@@ -42,6 +42,9 @@ pub struct GarbageCollectionResult {
pub failed_to_delete_files: Vec<PathBuf>, pub failed_to_delete_files: Vec<PathBuf>,
} }
#[cfg(all(feature = "mmap", unix))]
pub use memmap2::Advice;
pub use self::managed_directory::ManagedDirectory; pub use self::managed_directory::ManagedDirectory;
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
pub use self::mmap_directory::MmapDirectory; pub use self::mmap_directory::MmapDirectory;

View File

@@ -1,5 +1,5 @@
use std::collections::HashMap; use std::collections::HashMap;
use std::io::{self, BufWriter, Cursor, Seek, SeekFrom, Write}; use std::io::{self, BufWriter, Cursor, Write};
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use std::sync::{Arc, RwLock}; use std::sync::{Arc, RwLock};
use std::{fmt, result}; use std::{fmt, result};
@@ -48,12 +48,6 @@ impl Drop for VecWriter {
} }
} }
impl Seek for VecWriter {
fn seek(&mut self, pos: SeekFrom) -> io::Result<u64> {
self.data.seek(pos)
}
}
impl Write for VecWriter { impl Write for VecWriter {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> { fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
self.is_flushed = false; self.is_flushed = false;
@@ -91,7 +85,7 @@ impl InnerDirectory {
self.fs self.fs
.get(path) .get(path)
.ok_or_else(|| OpenReadError::FileDoesNotExist(PathBuf::from(path))) .ok_or_else(|| OpenReadError::FileDoesNotExist(PathBuf::from(path)))
.map(Clone::clone) .cloned()
} }
fn delete(&mut self, path: &Path) -> result::Result<(), DeleteError> { fn delete(&mut self, path: &Path) -> result::Result<(), DeleteError> {

View File

@@ -17,7 +17,7 @@ pub trait DocSet: Send {
/// ///
/// The DocId of the next element is returned. /// The DocId of the next element is returned.
/// In other words we should always have : /// In other words we should always have :
/// ```ignore /// ```compile_fail
/// let doc = docset.advance(); /// let doc = docset.advance();
/// assert_eq!(doc, docset.doc()); /// assert_eq!(doc, docset.doc());
/// ``` /// ```

View File

@@ -11,6 +11,7 @@ use crate::directory::error::{
Incompatibility, LockError, OpenDirectoryError, OpenReadError, OpenWriteError, Incompatibility, LockError, OpenDirectoryError, OpenReadError, OpenWriteError,
}; };
use crate::fastfield::FastFieldNotAvailableError; use crate::fastfield::FastFieldNotAvailableError;
use crate::schema::document::DeserializeError;
use crate::{query, schema}; use crate::{query, schema};
/// Represents a `DataCorruption` error. /// Represents a `DataCorruption` error.
@@ -106,6 +107,9 @@ pub enum TantivyError {
/// e.g. a datastructure is incorrectly inititalized. /// e.g. a datastructure is incorrectly inititalized.
#[error("Internal error: '{0}'")] #[error("Internal error: '{0}'")]
InternalError(String), InternalError(String),
#[error("Deserialize error: {0}")]
/// An error occurred while attempting to deserialize a document.
DeserializeError(DeserializeError),
} }
impl From<io::Error> for TantivyError { impl From<io::Error> for TantivyError {
@@ -176,3 +180,9 @@ impl From<rayon::ThreadPoolBuildError> for TantivyError {
TantivyError::SystemError(error.to_string()) TantivyError::SystemError(error.to_string())
} }
} }
impl From<DeserializeError> for TantivyError {
fn from(error: DeserializeError) -> TantivyError {
TantivyError::DeserializeError(error)
}
}

View File

@@ -62,8 +62,9 @@ impl FacetReader {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use crate::schema::{Facet, FacetOptions, SchemaBuilder, Value, STORED}; use crate::schema::document::Value;
use crate::{DocAddress, Document, Index}; use crate::schema::{Facet, FacetOptions, SchemaBuilder, STORED};
use crate::{DocAddress, Index, IndexWriter, TantivyDocument};
#[test] #[test]
fn test_facet_only_indexed() { fn test_facet_only_indexed() {
@@ -71,7 +72,7 @@ mod tests {
let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default()); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap())) .add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()))
.unwrap(); .unwrap();
@@ -85,8 +86,10 @@ mod tests {
let mut facet = Facet::default(); let mut facet = Facet::default();
facet_reader.facet_from_ord(0, &mut facet).unwrap(); facet_reader.facet_from_ord(0, &mut facet).unwrap();
assert_eq!(facet.to_path_string(), "/a/b"); assert_eq!(facet.to_path_string(), "/a/b");
let doc = searcher.doc(DocAddress::new(0u32, 0u32)).unwrap(); let doc = searcher
let value = doc.get_first(facet_field).and_then(Value::as_facet); .doc::<TantivyDocument>(DocAddress::new(0u32, 0u32))
.unwrap();
let value = doc.get_first(facet_field).and_then(|v| v.as_facet());
assert_eq!(value, None); assert_eq!(value, None);
} }
@@ -96,7 +99,7 @@ mod tests {
let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default()); let facet_field = schema_builder.add_facet_field("facet", FacetOptions::default());
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(facet_field=>Facet::from_text("/parent/child1").unwrap())) .add_document(doc!(facet_field=>Facet::from_text("/parent/child1").unwrap()))
.unwrap(); .unwrap();
@@ -142,8 +145,8 @@ mod tests {
let mut facet_ords = Vec::new(); let mut facet_ords = Vec::new();
facet_ords.extend(facet_reader.facet_ords(0u32)); facet_ords.extend(facet_reader.facet_ords(0u32));
assert_eq!(&facet_ords, &[0u64]); assert_eq!(&facet_ords, &[0u64]);
let doc = searcher.doc(DocAddress::new(0u32, 0u32))?; let doc = searcher.doc::<TantivyDocument>(DocAddress::new(0u32, 0u32))?;
let value: Option<&Facet> = doc.get_first(facet_field).and_then(Value::as_facet); let value: Option<&Facet> = doc.get_first(facet_field).and_then(|v| v.as_facet());
assert_eq!(value, Facet::from_text("/a/b").ok().as_ref()); assert_eq!(value, Facet::from_text("/a/b").ok().as_ref());
Ok(()) Ok(())
} }
@@ -156,7 +159,7 @@ mod tests {
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()))?; index_writer.add_document(doc!(facet_field=>Facet::from_text("/a/b").unwrap()))?;
index_writer.add_document(Document::default())?; index_writer.add_document(TantivyDocument::default())?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let facet_reader = searcher.segment_reader(0u32).facet_reader("facet").unwrap(); let facet_reader = searcher.segment_reader(0u32).facet_reader("facet").unwrap();
@@ -176,8 +179,8 @@ mod tests {
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests()?; let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(Document::default())?; index_writer.add_document(TantivyDocument::default())?;
index_writer.add_document(Document::default())?; index_writer.add_document(TantivyDocument::default())?;
index_writer.commit()?; index_writer.commit()?;
let searcher = index.reader()?.searcher(); let searcher = index.reader()?.searcher();
let facet_reader = searcher.segment_reader(0u32).facet_reader("facet").unwrap(); let facet_reader = searcher.segment_reader(0u32).facet_reader("facet").unwrap();

View File

@@ -90,12 +90,12 @@ mod tests {
use crate::directory::{Directory, RamDirectory, WritePtr}; use crate::directory::{Directory, RamDirectory, WritePtr};
use crate::merge_policy::NoMergePolicy; use crate::merge_policy::NoMergePolicy;
use crate::schema::{ use crate::schema::{
Document, Facet, FacetOptions, Field, JsonObjectOptions, Schema, SchemaBuilder, Facet, FacetOptions, Field, JsonObjectOptions, Schema, SchemaBuilder, TantivyDocument,
TextOptions, FAST, INDEXED, STORED, STRING, TEXT, TextOptions, FAST, INDEXED, STORED, STRING, TEXT,
}; };
use crate::time::OffsetDateTime; use crate::time::OffsetDateTime;
use crate::tokenizer::{LowerCaser, RawTokenizer, TextAnalyzer, TokenizerManager}; use crate::tokenizer::{LowerCaser, RawTokenizer, TextAnalyzer, TokenizerManager};
use crate::{DateOptions, DateTimePrecision, Index, SegmentId, SegmentReader}; use crate::{DateOptions, DateTimePrecision, Index, IndexWriter, SegmentId, SegmentReader};
pub static SCHEMA: Lazy<Schema> = Lazy::new(|| { pub static SCHEMA: Lazy<Schema> = Lazy::new(|| {
let mut schema_builder = Schema::builder(); let mut schema_builder = Schema::builder();
@@ -131,7 +131,7 @@ mod tests {
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 93); assert_eq!(file.len(), 80);
let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap(); let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap();
let column = fast_field_readers let column = fast_field_readers
.u64("field") .u64("field")
@@ -181,7 +181,7 @@ mod tests {
write.terminate().unwrap(); write.terminate().unwrap();
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 121); assert_eq!(file.len(), 108);
let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap(); let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap();
let col = fast_field_readers let col = fast_field_readers
.u64("field") .u64("field")
@@ -214,7 +214,7 @@ mod tests {
write.terminate().unwrap(); write.terminate().unwrap();
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 94); assert_eq!(file.len(), 81);
let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap(); let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap();
let fast_field_reader = fast_field_readers let fast_field_reader = fast_field_readers
.u64("field") .u64("field")
@@ -246,7 +246,7 @@ mod tests {
write.terminate().unwrap(); write.terminate().unwrap();
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 4489); assert_eq!(file.len(), 4476);
{ {
let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap(); let fast_field_readers = FastFieldReaders::open(file, SCHEMA.clone()).unwrap();
let col = fast_field_readers let col = fast_field_readers
@@ -271,7 +271,7 @@ mod tests {
let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap(); let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap(); let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap();
for i in -100i64..10_000i64 { for i in -100i64..10_000i64 {
let mut doc = Document::default(); let mut doc = TantivyDocument::default();
doc.add_i64(i64_field, i); doc.add_i64(i64_field, i);
fast_field_writers.add_document(&doc).unwrap(); fast_field_writers.add_document(&doc).unwrap();
} }
@@ -279,7 +279,7 @@ mod tests {
write.terminate().unwrap(); write.terminate().unwrap();
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 265); assert_eq!(file.len(), 252);
{ {
let fast_field_readers = FastFieldReaders::open(file, schema).unwrap(); let fast_field_readers = FastFieldReaders::open(file, schema).unwrap();
@@ -312,7 +312,7 @@ mod tests {
{ {
let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap(); let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap(); let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap();
let doc = Document::default(); let doc = TantivyDocument::default();
fast_field_writers.add_document(&doc).unwrap(); fast_field_writers.add_document(&doc).unwrap();
fast_field_writers.serialize(&mut write, None).unwrap(); fast_field_writers.serialize(&mut write, None).unwrap();
write.terminate().unwrap(); write.terminate().unwrap();
@@ -345,7 +345,7 @@ mod tests {
{ {
let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap(); let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap(); let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap();
let doc = Document::default(); let doc = TantivyDocument::default();
fast_field_writers.add_document(&doc).unwrap(); fast_field_writers.add_document(&doc).unwrap();
fast_field_writers.serialize(&mut write, None).unwrap(); fast_field_writers.serialize(&mut write, None).unwrap();
write.terminate().unwrap(); write.terminate().unwrap();
@@ -416,7 +416,7 @@ mod tests {
let date_field = schema_builder.add_date_field("date", FAST); let date_field = schema_builder.add_date_field("date", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.set_merge_policy(Box::new(NoMergePolicy)); index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer index_writer
.add_document(doc!(date_field => DateTime::from_utc(OffsetDateTime::now_utc()))) .add_document(doc!(date_field => DateTime::from_utc(OffsetDateTime::now_utc())))
@@ -452,7 +452,7 @@ mod tests {
{ {
// first segment // first segment
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.set_merge_policy(Box::new(NoMergePolicy)); index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer index_writer
.add_document(doc!( .add_document(doc!(
@@ -506,7 +506,7 @@ mod tests {
{ {
// second segment // second segment
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!( .add_document(doc!(
@@ -537,7 +537,7 @@ mod tests {
// Merging the segments // Merging the segments
{ {
let segment_ids = index.searchable_segment_ids().unwrap(); let segment_ids = index.searchable_segment_ids().unwrap();
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.merge(&segment_ids).wait().unwrap(); index_writer.merge(&segment_ids).wait().unwrap();
index_writer.wait_merging_threads().unwrap(); index_writer.wait_merging_threads().unwrap();
} }
@@ -662,7 +662,7 @@ mod tests {
// Merging the segments // Merging the segments
{ {
let segment_ids = index.searchable_segment_ids()?; let segment_ids = index.searchable_segment_ids()?;
let mut index_writer = index.writer_for_tests()?; let mut index_writer: IndexWriter = index.writer_for_tests()?;
index_writer.merge(&segment_ids).wait()?; index_writer.merge(&segment_ids).wait()?;
index_writer.wait_merging_threads()?; index_writer.wait_merging_threads()?;
} }
@@ -773,7 +773,7 @@ mod tests {
write.terminate().unwrap(); write.terminate().unwrap();
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 102); assert_eq!(file.len(), 84);
let fast_field_readers = FastFieldReaders::open(file, schema).unwrap(); let fast_field_readers = FastFieldReaders::open(file, schema).unwrap();
let bool_col = fast_field_readers.bool("field_bool").unwrap(); let bool_col = fast_field_readers.bool("field_bool").unwrap();
assert_eq!(bool_col.first(0), Some(true)); assert_eq!(bool_col.first(0), Some(true));
@@ -805,7 +805,7 @@ mod tests {
write.terminate().unwrap(); write.terminate().unwrap();
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 114); assert_eq!(file.len(), 96);
let readers = FastFieldReaders::open(file, schema).unwrap(); let readers = FastFieldReaders::open(file, schema).unwrap();
let bool_col = readers.bool("field_bool").unwrap(); let bool_col = readers.bool("field_bool").unwrap();
for i in 0..25 { for i in 0..25 {
@@ -824,13 +824,13 @@ mod tests {
{ {
let mut write: WritePtr = directory.open_write(path).unwrap(); let mut write: WritePtr = directory.open_write(path).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap(); let mut fast_field_writers = FastFieldsWriter::from_schema(&schema).unwrap();
let doc = Document::default(); let doc = TantivyDocument::default();
fast_field_writers.add_document(&doc).unwrap(); fast_field_writers.add_document(&doc).unwrap();
fast_field_writers.serialize(&mut write, None).unwrap(); fast_field_writers.serialize(&mut write, None).unwrap();
write.terminate().unwrap(); write.terminate().unwrap();
} }
let file = directory.open_read(path).unwrap(); let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 104); assert_eq!(file.len(), 86);
let fastfield_readers = FastFieldReaders::open(file, schema).unwrap(); let fastfield_readers = FastFieldReaders::open(file, schema).unwrap();
let col = fastfield_readers.bool("field_bool").unwrap(); let col = fastfield_readers.bool("field_bool").unwrap();
assert_eq!(col.first(0), None); assert_eq!(col.first(0), None);
@@ -846,7 +846,7 @@ mod tests {
assert_eq!(col.get_val(0), true); assert_eq!(col.get_val(0), true);
} }
fn get_index(docs: &[crate::Document], schema: &Schema) -> crate::Result<RamDirectory> { fn get_index(docs: &[crate::TantivyDocument], schema: &Schema) -> crate::Result<RamDirectory> {
let directory: RamDirectory = RamDirectory::create(); let directory: RamDirectory = RamDirectory::create();
{ {
let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap(); let mut write: WritePtr = directory.open_write(Path::new("test")).unwrap();
@@ -888,7 +888,7 @@ mod tests {
let field = schema_builder.add_date_field("field", date_options); let field = schema_builder.add_date_field("field", date_options);
let schema = schema_builder.build(); let schema = schema_builder.build();
let docs: Vec<Document> = times.iter().map(|time| doc!(field=>*time)).collect(); let docs: Vec<TantivyDocument> = times.iter().map(|time| doc!(field=>*time)).collect();
let directory = get_index(&docs[..], &schema).unwrap(); let directory = get_index(&docs[..], &schema).unwrap();
let path = Path::new("test"); let path = Path::new("test");
@@ -962,11 +962,15 @@ mod tests {
let ip_field = schema_builder.add_u64_field("ip", FAST); let ip_field = schema_builder.add_u64_field("ip", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
let ip_addr = Ipv6Addr::new(1, 2, 3, 4, 5, 1, 2, 3); let ip_addr = Ipv6Addr::new(1, 2, 3, 4, 5, 1, 2, 3);
index_writer.add_document(Document::default()).unwrap(); index_writer
.add_document(TantivyDocument::default())
.unwrap();
index_writer.add_document(doc!(ip_field=>ip_addr)).unwrap(); index_writer.add_document(doc!(ip_field=>ip_addr)).unwrap();
index_writer.add_document(Document::default()).unwrap(); index_writer
.add_document(TantivyDocument::default())
.unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader().unwrap().searcher();
let fastfields = searcher.segment_reader(0u32).fast_fields(); let fastfields = searcher.segment_reader(0u32).fast_fields();
@@ -1086,7 +1090,7 @@ mod tests {
let json = schema_builder.add_json_field("json", json_option); let json = schema_builder.add_json_field("json", json_option);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"attr.age": 32}))) .add_document(doc!(json => json!({"attr.age": 32})))
.unwrap(); .unwrap();
@@ -1112,7 +1116,7 @@ mod tests {
let json = schema_builder.add_json_field("json", json_option); let json = schema_builder.add_json_field("json", json_option);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"age": 32}))) .add_document(doc!(json => json!({"age": 32})))
.unwrap(); .unwrap();
@@ -1139,7 +1143,7 @@ mod tests {
let json = schema_builder.add_json_field("json", json_option); let json = schema_builder.add_json_field("json", json_option);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(json => json!({"attr.age": 32}))) .add_document(doc!(json => json!({"attr.age": 32})))
.unwrap(); .unwrap();
@@ -1162,7 +1166,7 @@ mod tests {
let field_with_dot = schema_builder.add_i64_field("field.with.dot", FAST); let field_with_dot = schema_builder.add_i64_field("field.with.dot", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(field_with_dot => 32i64)) .add_document(doc!(field_with_dot => 32i64))
.unwrap(); .unwrap();
@@ -1184,7 +1188,7 @@ mod tests {
let shadowing_json_field = schema_builder.add_json_field("jsonfield.attr", FAST); let shadowing_json_field = schema_builder.add_json_field("jsonfield.attr", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(json_field=> json!({"attr": {"age": 32}}), shadowing_json_field=>json!({"age": 33}))) .add_document(doc!(json_field=> json!({"attr": {"age": 32}}), shadowing_json_field=>json!({"age": 33})))
.unwrap(); .unwrap();
@@ -1215,7 +1219,7 @@ mod tests {
let mut index = Index::create_in_ram(schema); let mut index = Index::create_in_ram(schema);
index.set_fast_field_tokenizers(ff_tokenizer_manager); index.set_fast_field_tokenizers(ff_tokenizer_manager);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(text_field => "Test1 test2")) .add_document(doc!(text_field => "Test1 test2"))
.unwrap(); .unwrap();
@@ -1244,7 +1248,7 @@ mod tests {
let log_field = schema_builder.add_text_field("log_level", text_fieldtype); let log_field = schema_builder.add_text_field("log_level", text_fieldtype);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(log_field => "info")) .add_document(doc!(log_field => "info"))
.unwrap(); .unwrap();
@@ -1277,18 +1281,25 @@ mod tests {
let shadowing_json_field = schema_builder.add_json_field("jsonfield.attr", json_option); let shadowing_json_field = schema_builder.add_json_field("jsonfield.attr", json_option);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer index_writer
.add_document(doc!(json_field=> json!({"attr.age": 32}), shadowing_json_field=>json!({"age": 33}))) .add_document(doc!(json_field=> json!({"attr.age": 32}), shadowing_json_field=>json!({"age": 33})))
.unwrap(); .unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
let searcher = index.reader().unwrap().searcher(); let searcher = index.reader().unwrap().searcher();
let fast_field_reader = searcher.segment_reader(0u32).fast_fields(); let fast_field_reader = searcher.segment_reader(0u32).fast_fields();
// Supported for now, maybe dropped in the future.
let column = fast_field_reader let column = fast_field_reader
.column_opt::<i64>("jsonfield.attr.age") .column_opt::<i64>("jsonfield.attr.age")
.unwrap() .unwrap()
.unwrap(); .unwrap();
let vals: Vec<i64> = column.values_for_doc(0u32).collect(); let vals: Vec<i64> = column.values_for_doc(0u32).collect();
assert_eq!(&vals, &[33]); assert_eq!(&vals, &[33]);
let column = fast_field_reader
.column_opt::<i64>("jsonfield\\.attr.age")
.unwrap()
.unwrap();
let vals: Vec<i64> = column.values_for_doc(0u32).collect();
assert_eq!(&vals, &[33]);
} }
} }

View File

@@ -234,6 +234,22 @@ impl FastFieldReaders {
Ok(dynamic_column_handle_opt) Ok(dynamic_column_handle_opt)
} }
/// Returning all `dynamic_column_handle`.
pub fn dynamic_column_handles(
&self,
field_name: &str,
) -> crate::Result<Vec<DynamicColumnHandle>> {
let Some(resolved_field_name) = self.resolve_field(field_name)? else {
return Ok(Vec::new());
};
let dynamic_column_handles = self
.columnar
.read_columns(&resolved_field_name)?
.into_iter()
.collect();
Ok(dynamic_column_handles)
}
#[doc(hidden)] #[doc(hidden)]
pub async fn list_dynamic_column_handles( pub async fn list_dynamic_column_handles(
&self, &self,
@@ -338,8 +354,10 @@ impl FastFieldReaders {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use columnar::ColumnType;
use crate::schema::{JsonObjectOptions, Schema, FAST}; use crate::schema::{JsonObjectOptions, Schema, FAST};
use crate::{Document, Index}; use crate::{Index, IndexWriter, TantivyDocument};
#[test] #[test]
fn test_fast_field_reader_resolve_with_dynamic_internal() { fn test_fast_field_reader_resolve_with_dynamic_internal() {
@@ -355,8 +373,10 @@ mod tests {
let dynamic_field = schema_builder.add_json_field("_dyna", FAST); let dynamic_field = schema_builder.add_json_field("_dyna", FAST);
let schema = schema_builder.build(); let schema = schema_builder.build();
let index = Index::create_in_ram(schema); let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap(); let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer.add_document(Document::default()).unwrap(); index_writer
.add_document(TantivyDocument::default())
.unwrap();
index_writer.commit().unwrap(); index_writer.commit().unwrap();
let reader = index.reader().unwrap(); let reader = index.reader().unwrap();
let searcher = reader.searcher(); let searcher = reader.searcher();
@@ -417,4 +437,45 @@ mod tests {
Some("_dyna\u{1}notinschema\u{1}attr\u{1}color".to_string()) Some("_dyna\u{1}notinschema\u{1}attr\u{1}color".to_string())
); );
} }
#[test]
fn test_fast_field_reader_dynamic_column_handles() {
let mut schema_builder = Schema::builder();
let id = schema_builder.add_u64_field("id", FAST);
let json = schema_builder.add_json_field("json", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
index_writer
.add_document(doc!(id=> 1u64, json => json!({"foo": 42})))
.unwrap();
index_writer
.add_document(doc!(id=> 2u64, json => json!({"foo": true})))
.unwrap();
index_writer
.add_document(doc!(id=> 3u64, json => json!({"foo": "bar"})))
.unwrap();
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let reader = searcher.segment_reader(0u32);
let fast_fields = reader.fast_fields();
let id_columns = fast_fields.dynamic_column_handles("id").unwrap();
assert_eq!(id_columns.len(), 1);
assert_eq!(id_columns.first().unwrap().column_type(), ColumnType::U64);
let foo_columns = fast_fields.dynamic_column_handles("json.foo").unwrap();
assert_eq!(foo_columns.len(), 3);
assert!(foo_columns
.iter()
.any(|column| column.column_type() == ColumnType::I64));
assert!(foo_columns
.iter()
.any(|column| column.column_type() == ColumnType::Bool));
assert!(foo_columns
.iter()
.any(|column| column.column_type() == ColumnType::Str));
println!("*** {:?}", fast_fields.columnar().list_columns());
}
} }

View File

@@ -1,12 +1,12 @@
use std::io; use std::io;
use columnar::{ColumnarWriter, NumericalValue}; use columnar::{ColumnarWriter, NumericalValue};
use common::replace_in_place; use common::JsonPathWriter;
use tokenizer_api::Token; use tokenizer_api::Token;
use crate::indexer::doc_id_mapping::DocIdMapping; use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::schema::term::{JSON_PATH_SEGMENT_SEP, JSON_PATH_SEGMENT_SEP_STR}; use crate::schema::document::{Document, ReferenceValue, ReferenceValueLeaf, Value};
use crate::schema::{value_type_to_column_type, Document, FieldType, Schema, Type, Value}; use crate::schema::{value_type_to_column_type, Field, FieldType, Schema, Type};
use crate::tokenizer::{TextAnalyzer, TokenizerManager}; use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::{DateTimePrecision, DocId, TantivyError}; use crate::{DateTimePrecision, DocId, TantivyError};
@@ -23,7 +23,7 @@ pub struct FastFieldsWriter {
expand_dots: Vec<bool>, expand_dots: Vec<bool>,
num_docs: DocId, num_docs: DocId,
// Buffer that we recycle to avoid allocation. // Buffer that we recycle to avoid allocation.
json_path_buffer: String, json_path_buffer: JsonPathWriter,
} }
impl FastFieldsWriter { impl FastFieldsWriter {
@@ -97,7 +97,7 @@ impl FastFieldsWriter {
num_docs: 0u32, num_docs: 0u32,
date_precisions, date_precisions,
expand_dots, expand_dots,
json_path_buffer: String::new(), json_path_buffer: JsonPathWriter::default(),
}) })
} }
@@ -117,114 +117,121 @@ impl FastFieldsWriter {
} }
/// Indexes all of the fastfields of a new document. /// Indexes all of the fastfields of a new document.
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> { pub fn add_document<D: Document>(&mut self, doc: &D) -> crate::Result<()> {
let doc_id = self.num_docs; let doc_id = self.num_docs;
for field_value in doc.field_values() { for (field, value) in doc.iter_fields_and_values() {
if let Some(field_name) = let value_access = value as D::Value<'_>;
&self.fast_field_names[field_value.field().field_id() as usize]
{
match &field_value.value {
Value::U64(u64_val) => {
self.columnar_writer.record_numerical(
doc_id,
field_name.as_str(),
NumericalValue::from(*u64_val),
);
}
Value::I64(i64_val) => {
self.columnar_writer.record_numerical(
doc_id,
field_name.as_str(),
NumericalValue::from(*i64_val),
);
}
Value::F64(f64_val) => {
self.columnar_writer.record_numerical(
doc_id,
field_name.as_str(),
NumericalValue::from(*f64_val),
);
}
Value::Str(text_val) => {
if let Some(tokenizer) =
&mut self.per_field_tokenizer[field_value.field().field_id() as usize]
{
let mut token_stream = tokenizer.token_stream(text_val);
token_stream.process(&mut |token: &Token| {
self.columnar_writer.record_str(
doc_id,
field_name.as_str(),
&token.text,
);
})
} else {
self.columnar_writer
.record_str(doc_id, field_name.as_str(), text_val);
}
}
Value::Bytes(bytes_val) => {
self.columnar_writer
.record_bytes(doc_id, field_name.as_str(), bytes_val);
}
Value::PreTokStr(pre_tok) => {
for token in &pre_tok.tokens {
self.columnar_writer.record_str(
doc_id,
field_name.as_str(),
&token.text,
);
}
}
Value::Bool(bool_val) => {
self.columnar_writer
.record_bool(doc_id, field_name.as_str(), *bool_val);
}
Value::Date(datetime) => {
let date_precision =
self.date_precisions[field_value.field().field_id() as usize];
let truncated_datetime = datetime.truncate(date_precision);
self.columnar_writer.record_datetime(
doc_id,
field_name.as_str(),
truncated_datetime,
);
}
Value::Facet(facet) => {
self.columnar_writer.record_str(
doc_id,
field_name.as_str(),
facet.encoded_str(),
);
}
Value::JsonObject(json_obj) => {
let expand_dots = self.expand_dots[field_value.field().field_id() as usize];
self.json_path_buffer.clear();
self.json_path_buffer.push_str(field_name);
let text_analyzer = self.add_doc_value(doc_id, field, value_access)?;
&mut self.per_field_tokenizer[field_value.field().field_id() as usize];
record_json_obj_to_columnar_writer(
doc_id,
json_obj,
expand_dots,
JSON_DEPTH_LIMIT,
&mut self.json_path_buffer,
&mut self.columnar_writer,
text_analyzer,
);
}
Value::IpAddr(ip_addr) => {
self.columnar_writer
.record_ip_addr(doc_id, field_name.as_str(), *ip_addr);
}
}
}
} }
self.num_docs += 1; self.num_docs += 1;
Ok(()) Ok(())
} }
fn add_doc_value<'a, V: Value<'a>>(
&mut self,
doc_id: DocId,
field: Field,
value: V,
) -> crate::Result<()> {
let field_name = match &self.fast_field_names[field.field_id() as usize] {
None => return Ok(()),
Some(name) => name,
};
match value.as_value() {
ReferenceValue::Leaf(leaf) => match leaf {
ReferenceValueLeaf::Null => {}
ReferenceValueLeaf::Str(val) => {
if let Some(tokenizer) =
&mut self.per_field_tokenizer[field.field_id() as usize]
{
let mut token_stream = tokenizer.token_stream(val);
token_stream.process(&mut |token: &Token| {
self.columnar_writer
.record_str(doc_id, field_name, &token.text);
})
} else {
self.columnar_writer.record_str(doc_id, field_name, val);
}
}
ReferenceValueLeaf::U64(val) => {
self.columnar_writer.record_numerical(
doc_id,
field_name,
NumericalValue::from(val),
);
}
ReferenceValueLeaf::I64(val) => {
self.columnar_writer.record_numerical(
doc_id,
field_name,
NumericalValue::from(val),
);
}
ReferenceValueLeaf::F64(val) => {
self.columnar_writer.record_numerical(
doc_id,
field_name,
NumericalValue::from(val),
);
}
ReferenceValueLeaf::Date(val) => {
let date_precision = self.date_precisions[field.field_id() as usize];
let truncated_datetime = val.truncate(date_precision);
self.columnar_writer
.record_datetime(doc_id, field_name, truncated_datetime);
}
ReferenceValueLeaf::Facet(val) => {
self.columnar_writer
.record_str(doc_id, field_name, val.encoded_str());
}
ReferenceValueLeaf::Bytes(val) => {
self.columnar_writer.record_bytes(doc_id, field_name, val);
}
ReferenceValueLeaf::IpAddr(val) => {
self.columnar_writer.record_ip_addr(doc_id, field_name, val);
}
ReferenceValueLeaf::Bool(val) => {
self.columnar_writer.record_bool(doc_id, field_name, val);
}
ReferenceValueLeaf::PreTokStr(val) => {
for token in &val.tokens {
self.columnar_writer
.record_str(doc_id, field_name, &token.text);
}
}
},
ReferenceValue::Array(val) => {
// TODO: Check this is the correct behaviour we want.
for value in val {
self.add_doc_value(doc_id, field, value)?;
}
}
ReferenceValue::Object(val) => {
let expand_dots = self.expand_dots[field.field_id() as usize];
self.json_path_buffer.clear();
// First field should not be expanded.
self.json_path_buffer.set_expand_dots(false);
self.json_path_buffer.push(field_name);
self.json_path_buffer.set_expand_dots(expand_dots);
let text_analyzer = &mut self.per_field_tokenizer[field.field_id() as usize];
record_json_obj_to_columnar_writer::<V>(
doc_id,
val,
JSON_DEPTH_LIMIT,
&mut self.json_path_buffer,
&mut self.columnar_writer,
text_analyzer,
);
}
}
Ok(())
}
/// Serializes all of the `FastFieldWriter`s by pushing them in /// Serializes all of the `FastFieldWriter`s by pushing them in
/// order to the fast field serializer. /// order to the fast field serializer.
pub fn serialize( pub fn serialize(
@@ -241,66 +248,33 @@ impl FastFieldsWriter {
} }
} }
#[inline] fn record_json_obj_to_columnar_writer<'a, V: Value<'a>>(
fn columnar_numerical_value(json_number: &serde_json::Number) -> Option<NumericalValue> {
if let Some(num_i64) = json_number.as_i64() {
return Some(num_i64.into());
}
if let Some(num_u64) = json_number.as_u64() {
return Some(num_u64.into());
}
if let Some(num_f64) = json_number.as_f64() {
return Some(num_f64.into());
}
// This can happen with arbitrary precision.... but we do not handle it.
None
}
fn record_json_obj_to_columnar_writer(
doc: DocId, doc: DocId,
json_obj: &serde_json::Map<String, serde_json::Value>, json_visitor: V::ObjectIter,
expand_dots: bool,
remaining_depth_limit: usize, remaining_depth_limit: usize,
json_path_buffer: &mut String, json_path_buffer: &mut JsonPathWriter,
columnar_writer: &mut columnar::ColumnarWriter, columnar_writer: &mut columnar::ColumnarWriter,
tokenizer: &mut Option<TextAnalyzer>, tokenizer: &mut Option<TextAnalyzer>,
) { ) {
for (key, child) in json_obj { for (key, child) in json_visitor {
let len_path = json_path_buffer.len(); json_path_buffer.push(key);
if !json_path_buffer.is_empty() {
json_path_buffer.push_str(JSON_PATH_SEGMENT_SEP_STR);
}
json_path_buffer.push_str(key);
if expand_dots {
// This might include the separation byte, which is ok because it is not a dot.
let appended_segment = &mut json_path_buffer[len_path..];
// The unsafe below is safe as long as b'.' and JSON_PATH_SEGMENT_SEP are
// valid single byte ut8 strings.
// By utf-8 design, they cannot be part of another codepoint.
replace_in_place(b'.', JSON_PATH_SEGMENT_SEP, unsafe {
appended_segment.as_bytes_mut()
});
}
record_json_value_to_columnar_writer( record_json_value_to_columnar_writer(
doc, doc,
child, child,
expand_dots,
remaining_depth_limit, remaining_depth_limit,
json_path_buffer, json_path_buffer,
columnar_writer, columnar_writer,
tokenizer, tokenizer,
); );
// popping our sub path. json_path_buffer.pop();
json_path_buffer.truncate(len_path);
} }
} }
fn record_json_value_to_columnar_writer( fn record_json_value_to_columnar_writer<'a, V: Value<'a>>(
doc: DocId, doc: DocId,
json_val: &serde_json::Value, json_val: V,
expand_dots: bool,
mut remaining_depth_limit: usize, mut remaining_depth_limit: usize,
json_path_writer: &mut String, json_path_writer: &mut JsonPathWriter,
columnar_writer: &mut columnar::ColumnarWriter, columnar_writer: &mut columnar::ColumnarWriter,
tokenizer: &mut Option<TextAnalyzer>, tokenizer: &mut Option<TextAnalyzer>,
) { ) {
@@ -308,34 +282,69 @@ fn record_json_value_to_columnar_writer(
return; return;
} }
remaining_depth_limit -= 1; remaining_depth_limit -= 1;
match json_val {
serde_json::Value::Null => { match json_val.as_value() {
// TODO handle null ReferenceValue::Leaf(leaf) => match leaf {
} ReferenceValueLeaf::Null => {} // TODO: Handle null
serde_json::Value::Bool(bool_val) => { ReferenceValueLeaf::Str(val) => {
columnar_writer.record_bool(doc, json_path_writer, *bool_val); if let Some(text_analyzer) = tokenizer.as_mut() {
} let mut token_stream = text_analyzer.token_stream(val);
serde_json::Value::Number(json_number) => { token_stream.process(&mut |token| {
if let Some(numerical_value) = columnar_numerical_value(json_number) { columnar_writer.record_str(doc, json_path_writer.as_str(), &token.text);
columnar_writer.record_numerical(doc, json_path_writer.as_str(), numerical_value); })
} else {
columnar_writer.record_str(doc, json_path_writer.as_str(), val);
}
} }
} ReferenceValueLeaf::U64(val) => {
serde_json::Value::String(text) => { columnar_writer.record_numerical(
if let Some(text_analyzer) = tokenizer.as_mut() { doc,
let mut token_stream = text_analyzer.token_stream(text); json_path_writer.as_str(),
token_stream.process(&mut |token| { NumericalValue::from(val),
columnar_writer.record_str(doc, json_path_writer.as_str(), &token.text); );
})
} else {
columnar_writer.record_str(doc, json_path_writer.as_str(), text);
} }
} ReferenceValueLeaf::I64(val) => {
serde_json::Value::Array(arr) => { columnar_writer.record_numerical(
for el in arr { doc,
json_path_writer.as_str(),
NumericalValue::from(val),
);
}
ReferenceValueLeaf::F64(val) => {
columnar_writer.record_numerical(
doc,
json_path_writer.as_str(),
NumericalValue::from(val),
);
}
ReferenceValueLeaf::Bool(val) => {
columnar_writer.record_bool(doc, json_path_writer.as_str(), val);
}
ReferenceValueLeaf::Date(val) => {
columnar_writer.record_datetime(doc, json_path_writer.as_str(), val);
}
ReferenceValueLeaf::Facet(_) => {
unimplemented!("Facet support in dynamic fields is not yet implemented")
}
ReferenceValueLeaf::Bytes(_) => {
// TODO: This can be re added once it is added to the JSON Utils section as well.
// columnar_writer.record_bytes(doc, json_path_writer.as_str(), val);
unimplemented!("Bytes support in dynamic fields is not yet implemented")
}
ReferenceValueLeaf::IpAddr(_) => {
unimplemented!("IP address support in dynamic fields is not yet implemented")
}
ReferenceValueLeaf::PreTokStr(_) => {
unimplemented!(
"Pre-tokenized string support in dynamic fields is not yet implemented"
)
}
},
ReferenceValue::Array(elements) => {
for el in elements {
record_json_value_to_columnar_writer( record_json_value_to_columnar_writer(
doc, doc,
el, el,
expand_dots,
remaining_depth_limit, remaining_depth_limit,
json_path_writer, json_path_writer,
columnar_writer, columnar_writer,
@@ -343,11 +352,10 @@ fn record_json_value_to_columnar_writer(
); );
} }
} }
serde_json::Value::Object(json_obj) => { ReferenceValue::Object(object) => {
record_json_obj_to_columnar_writer( record_json_obj_to_columnar_writer::<V>(
doc, doc,
json_obj, object,
expand_dots,
remaining_depth_limit, remaining_depth_limit,
json_path_writer, json_path_writer,
columnar_writer, columnar_writer,
@@ -360,6 +368,7 @@ fn record_json_value_to_columnar_writer(
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use columnar::{Column, ColumnarReader, ColumnarWriter, StrColumn}; use columnar::{Column, ColumnarReader, ColumnarWriter, StrColumn};
use common::JsonPathWriter;
use super::record_json_value_to_columnar_writer; use super::record_json_value_to_columnar_writer;
use crate::fastfield::writer::JSON_DEPTH_LIMIT; use crate::fastfield::writer::JSON_DEPTH_LIMIT;
@@ -370,12 +379,12 @@ mod tests {
expand_dots: bool, expand_dots: bool,
) -> ColumnarReader { ) -> ColumnarReader {
let mut columnar_writer = ColumnarWriter::default(); let mut columnar_writer = ColumnarWriter::default();
let mut json_path = String::new(); let mut json_path = JsonPathWriter::default();
json_path.set_expand_dots(expand_dots);
for (doc, json_doc) in json_docs.iter().enumerate() { for (doc, json_doc) in json_docs.iter().enumerate() {
record_json_value_to_columnar_writer( record_json_value_to_columnar_writer(
doc as u32, doc as u32,
json_doc, json_doc,
expand_dots,
JSON_DEPTH_LIMIT, JSON_DEPTH_LIMIT,
&mut json_path, &mut json_path,
&mut columnar_writer, &mut columnar_writer,

View File

@@ -2,8 +2,9 @@ use std::collections::HashSet;
use rand::{thread_rng, Rng}; use rand::{thread_rng, Rng};
use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN;
use crate::schema::*; use crate::schema::*;
use crate::{doc, schema, Index, IndexSettings, IndexSortByField, Order, Searcher}; use crate::{doc, schema, Index, IndexSettings, IndexSortByField, IndexWriter, Order, Searcher};
fn check_index_content(searcher: &Searcher, vals: &[u64]) -> crate::Result<()> { fn check_index_content(searcher: &Searcher, vals: &[u64]) -> crate::Result<()> {
assert!(searcher.segment_readers().len() < 20); assert!(searcher.segment_readers().len() < 20);
@@ -11,7 +12,7 @@ fn check_index_content(searcher: &Searcher, vals: &[u64]) -> crate::Result<()> {
for segment_reader in searcher.segment_readers() { for segment_reader in searcher.segment_readers() {
let store_reader = segment_reader.get_store_reader(1)?; let store_reader = segment_reader.get_store_reader(1)?;
for doc_id in 0..segment_reader.max_doc() { for doc_id in 0..segment_reader.max_doc() {
let _doc = store_reader.get(doc_id)?; let _doc: TantivyDocument = store_reader.get(doc_id)?;
} }
} }
Ok(()) Ok(())
@@ -30,7 +31,8 @@ fn test_functional_store() -> crate::Result<()> {
let mut rng = thread_rng(); let mut rng = thread_rng();
let mut index_writer = index.writer_with_num_threads(3, 12_000_000)?; let mut index_writer: IndexWriter =
index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
let mut doc_set: Vec<u64> = Vec::new(); let mut doc_set: Vec<u64> = Vec::new();
@@ -90,7 +92,8 @@ fn test_functional_indexing_sorted() -> crate::Result<()> {
let mut rng = thread_rng(); let mut rng = thread_rng();
let mut index_writer = index.writer_with_num_threads(3, 120_000_000)?; let mut index_writer: IndexWriter =
index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
let mut committed_docs: HashSet<u64> = HashSet::new(); let mut committed_docs: HashSet<u64> = HashSet::new();
let mut uncommitted_docs: HashSet<u64> = HashSet::new(); let mut uncommitted_docs: HashSet<u64> = HashSet::new();
@@ -113,7 +116,7 @@ fn test_functional_indexing_sorted() -> crate::Result<()> {
index_writer.delete_term(doc_id_term); index_writer.delete_term(doc_id_term);
} else { } else {
uncommitted_docs.insert(random_val); uncommitted_docs.insert(random_val);
let mut doc = Document::new(); let mut doc = TantivyDocument::new();
doc.add_u64(id_field, random_val); doc.add_u64(id_field, random_val);
for i in 1u64..10u64 { for i in 1u64..10u64 {
doc.add_u64(multiples_field, random_val * i); doc.add_u64(multiples_field, random_val * i);
@@ -165,7 +168,8 @@ fn test_functional_indexing_unsorted() -> crate::Result<()> {
let mut rng = thread_rng(); let mut rng = thread_rng();
let mut index_writer = index.writer_with_num_threads(3, 120_000_000)?; let mut index_writer: IndexWriter =
index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
let mut committed_docs: HashSet<u64> = HashSet::new(); let mut committed_docs: HashSet<u64> = HashSet::new();
let mut uncommitted_docs: HashSet<u64> = HashSet::new(); let mut uncommitted_docs: HashSet<u64> = HashSet::new();
@@ -188,7 +192,7 @@ fn test_functional_indexing_unsorted() -> crate::Result<()> {
index_writer.delete_term(doc_id_term); index_writer.delete_term(doc_id_term);
} else { } else {
uncommitted_docs.insert(random_val); uncommitted_docs.insert(random_val);
let mut doc = Document::new(); let mut doc = TantivyDocument::new();
doc.add_u64(id_field, random_val); doc.add_u64(id_field, random_val);
for i in 1u64..10u64 { for i in 1u64..10u64 {
doc.add_u64(multiples_field, random_val * i); doc.add_u64(multiples_field, random_val * i);

View File

@@ -6,22 +6,23 @@ use std::path::PathBuf;
use std::sync::Arc; use std::sync::Arc;
use super::segment::Segment; use super::segment::Segment;
use super::IndexSettings; use super::segment_reader::merge_field_meta_data;
use crate::core::single_segment_index_writer::SingleSegmentIndexWriter; use super::{FieldMetadata, IndexSettings};
use crate::core::{ use crate::core::{Executor, META_FILEPATH};
Executor, IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory, META_FILEPATH,
};
use crate::directory::error::OpenReadError; use crate::directory::error::OpenReadError;
#[cfg(feature = "mmap")] #[cfg(feature = "mmap")]
use crate::directory::MmapDirectory; use crate::directory::MmapDirectory;
use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK}; use crate::directory::{Directory, ManagedDirectory, RamDirectory, INDEX_WRITER_LOCK};
use crate::error::{DataCorruption, TantivyError}; use crate::error::{DataCorruption, TantivyError};
use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_ARENA_NUM_BYTES_MIN}; use crate::index::{IndexMeta, SegmentId, SegmentMeta, SegmentMetaInventory};
use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_BUDGET_NUM_BYTES_MIN};
use crate::indexer::segment_updater::save_metas; use crate::indexer::segment_updater::save_metas;
use crate::indexer::{IndexWriter, SingleSegmentIndexWriter};
use crate::reader::{IndexReader, IndexReaderBuilder}; use crate::reader::{IndexReader, IndexReaderBuilder};
use crate::schema::document::Document;
use crate::schema::{Field, FieldType, Schema}; use crate::schema::{Field, FieldType, Schema};
use crate::tokenizer::{TextAnalyzer, TokenizerManager}; use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::IndexWriter; use crate::SegmentReader;
fn load_metas( fn load_metas(
directory: &dyn Directory, directory: &dyn Directory,
@@ -184,11 +185,11 @@ impl IndexBuilder {
/// ///
/// It expects an originally empty directory, and will not run any GC operation. /// It expects an originally empty directory, and will not run any GC operation.
#[doc(hidden)] #[doc(hidden)]
pub fn single_segment_index_writer( pub fn single_segment_index_writer<D: Document>(
self, self,
dir: impl Into<Box<dyn Directory>>, dir: impl Into<Box<dyn Directory>>,
mem_budget: usize, mem_budget: usize,
) -> crate::Result<SingleSegmentIndexWriter> { ) -> crate::Result<SingleSegmentIndexWriter<D>> {
let index = self.create(dir)?; let index = self.create(dir)?;
let index_simple_writer = SingleSegmentIndexWriter::new(index, mem_budget)?; let index_simple_writer = SingleSegmentIndexWriter::new(index, mem_budget)?;
Ok(index_simple_writer) Ok(index_simple_writer)
@@ -321,6 +322,15 @@ impl Index {
Ok(()) Ok(())
} }
/// Custom thread pool by a outer thread pool.
pub fn set_shared_multithread_executor(
&mut self,
shared_thread_pool: Arc<Executor>,
) -> crate::Result<()> {
self.executor = shared_thread_pool.clone();
Ok(())
}
/// Replace the default single thread search executor pool /// Replace the default single thread search executor pool
/// by a thread pool with as many threads as there are CPUs on the system. /// by a thread pool with as many threads as there are CPUs on the system.
pub fn set_default_multithread_executor(&mut self) -> crate::Result<()> { pub fn set_default_multithread_executor(&mut self) -> crate::Result<()> {
@@ -488,6 +498,28 @@ impl Index {
self.inventory.all() self.inventory.all()
} }
/// Returns the list of fields that have been indexed in the Index.
/// The field list includes the field defined in the schema as well as the fields
/// that have been indexed as a part of a JSON field.
/// The returned field name is the full field name, including the name of the JSON field.
///
/// The returned field names can be used in queries.
///
/// Notice: If your data contains JSON fields this is **very expensive**, as it requires
/// browsing through the inverted index term dictionary and the columnar field dictionary.
///
/// Disclaimer: Some fields may not be listed here. For instance, if the schema contains a json
/// field that is not indexed nor a fast field but is stored, it is possible for the field
/// to not be listed.
pub fn fields_metadata(&self) -> crate::Result<Vec<FieldMetadata>> {
let segments = self.searchable_segments()?;
let fields_metadata: Vec<Vec<FieldMetadata>> = segments
.into_iter()
.map(|segment| SegmentReader::open(&segment)?.fields_metadata())
.collect::<Result<_, _>>()?;
Ok(merge_field_meta_data(fields_metadata, &self.schema()))
}
/// Creates a new segment_meta (Advanced user only). /// Creates a new segment_meta (Advanced user only).
/// ///
/// As long as the `SegmentMeta` lives, the files associated with the /// As long as the `SegmentMeta` lives, the files associated with the
@@ -523,19 +555,19 @@ impl Index {
/// - `num_threads` defines the number of indexing workers that /// - `num_threads` defines the number of indexing workers that
/// should work at the same time. /// should work at the same time.
/// ///
/// - `overall_memory_arena_in_bytes` sets the amount of memory /// - `overall_memory_budget_in_bytes` sets the amount of memory
/// allocated for all indexing thread. /// allocated for all indexing thread.
/// Each thread will receive a budget of `overall_memory_arena_in_bytes / num_threads`. /// Each thread will receive a budget of `overall_memory_budget_in_bytes / num_threads`.
/// ///
/// # Errors /// # Errors
/// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IoError`. /// If the lockfile already exists, returns `Error::DirectoryLockBusy` or an `Error::IoError`.
/// If the memory arena per thread is too small or too big, returns /// If the memory arena per thread is too small or too big, returns
/// `TantivyError::InvalidArgument` /// `TantivyError::InvalidArgument`
pub fn writer_with_num_threads( pub fn writer_with_num_threads<D: Document>(
&self, &self,
num_threads: usize, num_threads: usize,
overall_memory_arena_in_bytes: usize, overall_memory_budget_in_bytes: usize,
) -> crate::Result<IndexWriter> { ) -> crate::Result<IndexWriter<D>> {
let directory_lock = self let directory_lock = self
.directory .directory
.acquire_lock(&INDEX_WRITER_LOCK) .acquire_lock(&INDEX_WRITER_LOCK)
@@ -550,7 +582,7 @@ impl Index {
), ),
) )
})?; })?;
let memory_arena_in_bytes_per_thread = overall_memory_arena_in_bytes / num_threads; let memory_arena_in_bytes_per_thread = overall_memory_budget_in_bytes / num_threads;
IndexWriter::new( IndexWriter::new(
self, self,
num_threads, num_threads,
@@ -561,11 +593,11 @@ impl Index {
/// Helper to create an index writer for tests. /// Helper to create an index writer for tests.
/// ///
/// That index writer only simply has a single thread and a memory arena of 10 MB. /// That index writer only simply has a single thread and a memory budget of 15 MB.
/// Using a single thread gives us a deterministic allocation of DocId. /// Using a single thread gives us a deterministic allocation of DocId.
#[cfg(test)] #[cfg(test)]
pub fn writer_for_tests(&self) -> crate::Result<IndexWriter> { pub fn writer_for_tests<D: Document>(&self) -> crate::Result<IndexWriter<D>> {
self.writer_with_num_threads(1, 15_000_000) self.writer_with_num_threads(1, MEMORY_BUDGET_NUM_BYTES_MIN)
} }
/// Creates a multithreaded writer /// Creates a multithreaded writer
@@ -579,13 +611,16 @@ impl Index {
/// If the lockfile already exists, returns `Error::FileAlreadyExists`. /// If the lockfile already exists, returns `Error::FileAlreadyExists`.
/// If the memory arena per thread is too small or too big, returns /// If the memory arena per thread is too small or too big, returns
/// `TantivyError::InvalidArgument` /// `TantivyError::InvalidArgument`
pub fn writer(&self, memory_arena_num_bytes: usize) -> crate::Result<IndexWriter> { pub fn writer<D: Document>(
&self,
memory_budget_in_bytes: usize,
) -> crate::Result<IndexWriter<D>> {
let mut num_threads = std::cmp::min(num_cpus::get(), MAX_NUM_THREAD); let mut num_threads = std::cmp::min(num_cpus::get(), MAX_NUM_THREAD);
let memory_arena_num_bytes_per_thread = memory_arena_num_bytes / num_threads; let memory_budget_num_bytes_per_thread = memory_budget_in_bytes / num_threads;
if memory_arena_num_bytes_per_thread < MEMORY_ARENA_NUM_BYTES_MIN { if memory_budget_num_bytes_per_thread < MEMORY_BUDGET_NUM_BYTES_MIN {
num_threads = (memory_arena_num_bytes / MEMORY_ARENA_NUM_BYTES_MIN).max(1); num_threads = (memory_budget_in_bytes / MEMORY_BUDGET_NUM_BYTES_MIN).max(1);
} }
self.writer_with_num_threads(num_threads, memory_arena_num_bytes) self.writer_with_num_threads(num_threads, memory_budget_in_bytes)
} }
/// Accessor to the index settings /// Accessor to the index settings

View File

@@ -7,7 +7,7 @@ use std::sync::Arc;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use super::SegmentComponent; use super::SegmentComponent;
use crate::core::SegmentId; use crate::index::SegmentId;
use crate::schema::Schema; use crate::schema::Schema;
use crate::store::Compressor; use crate::store::Compressor;
use crate::{Inventory, Opstamp, TrackedObject}; use crate::{Inventory, Opstamp, TrackedObject};
@@ -19,7 +19,7 @@ struct DeleteMeta {
} }
#[derive(Clone, Default)] #[derive(Clone, Default)]
pub struct SegmentMetaInventory { pub(crate) struct SegmentMetaInventory {
inventory: Inventory<InnerSegmentMeta>, inventory: Inventory<InnerSegmentMeta>,
} }
@@ -408,7 +408,7 @@ impl fmt::Debug for IndexMeta {
mod tests { mod tests {
use super::IndexMeta; use super::IndexMeta;
use crate::core::index_meta::UntrackedIndexMeta; use crate::index::index_meta::UntrackedIndexMeta;
use crate::schema::{Schema, TEXT}; use crate::schema::{Schema, TEXT};
use crate::store::Compressor; use crate::store::Compressor;
#[cfg(feature = "zstd-compression")] #[cfg(feature = "zstd-compression")]

Some files were not shown because too many files have changed in this diff Show More