Compare commits

...

212 Commits

Author SHA1 Message Date
Paul Masurel
8828b6d310 Support for columnar 2022-12-21 12:21:30 +09:00
Paul Masurel
2b89bf9050 Support for NotNaN in fast fields 2022-12-21 12:20:48 +09:00
Paul Masurel
3580198447 Minor refactoring 2022-12-21 12:18:33 +09:00
Paul Masurel
d96a716d20 Refactoring to prepare for the addition of dynamic fast field 2022-12-21 12:16:00 +09:00
PSeitz
2ac1cc2fc0 add sparse codec (#1723)
* add sparse codec

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add the -1 u16 fix for metadata num_vals

* add dense block encoding to sparse codec

* add comment, refactor u16 reading

Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-12-20 15:30:33 +01:00
PSeitz
f9171a3981 fix clippy (#1725)
* fix clippy

* fix clippy fastfield codecs

* fix clippy bitpacker

* fix clippy common

* fix clippy stacker

* fix clippy sstable

* fmt
2022-12-20 07:30:06 +01:00
PSeitz
a2cf6a79b4 Sparse dense index (#1716)
* add dense codec

* benchmark fix and important optimisation

* move code to DenseIndexBlock

improve benchmark

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* extend benchmarks

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-12-13 07:50:09 +01:00
Paul Masurel
f6e87a5319 Cargo fmt 2022-12-13 12:30:40 +09:00
Paul Masurel
f9971e15fe Fixing unit test with sstable test. 2022-12-13 12:22:44 +09:00
PSeitz
3cdc8e7472 pass index info to serialize (#1719) 2022-12-13 04:20:31 +01:00
dependabot[bot]
fbb0f8b55d Update base64 requirement from 0.13.0 to 0.20.0 (#1720)
Updates the requirements on [base64](https://github.com/marshallpierce/rust-base64) to permit the latest version.
- [Release notes](https://github.com/marshallpierce/rust-base64/releases)
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.13.0...v0.20.0)

---
updated-dependencies:
- dependency-name: base64
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-13 11:46:23 +09:00
Paul Masurel
136a8f4124 Isolating sstable and stacker in independant crates. (#1718)
Both crate will be used in the new (optional + dynamic) fastfield work.
2022-12-13 11:44:17 +09:00
PSeitz
5d4535de83 Changelog fix (#1717) 2022-12-12 14:28:42 +09:00
PSeitz
2c50b02eb3 Fix max bucket limit in histogram (#1703)
* Fix max bucket limit in histogram

The max bucket limit in histogram was broken, since some code introduced temporary filtering of buckets, which then resulted into an incorrect increment on the bucket count.
The provided solution covers more scenarios, but there are still some scenarios unhandled (See #1702).

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-12-12 04:40:15 +01:00
PSeitz
509adab79d Bump version (#1715)
* group workspace deps

* update cargo.toml

* revert tant version

* chore: Release
2022-12-12 04:39:43 +01:00
PSeitz
96c93a6ba3 Merge pull request #1700 from quickwit-oss/PSeitz-patch-1
Update CHANGELOG.md
2022-12-02 16:31:11 +01:00
boraarslan
495824361a Move split_full_path to Schema (#1692) 2022-11-29 20:56:13 +09:00
PSeitz
485a8f507e Update CHANGELOG.md 2022-11-28 15:41:31 +01:00
PSeitz
1119e59eae prepare fastfield format for null index (#1691)
* prepare fastfield format for null index
* add format version for fastfield
* Update fastfield_codecs/src/compact_space/mod.rs
* switch to variable size footer
* serialize delta of end
2022-11-28 17:15:24 +09:00
PSeitz
ee1f2c1f28 add aggregation support for date type (#1693)
* add aggregation support for date type
fixes #1332

* serialize key_as_string as rfc3339 in date histogram
* update docs
* enable date for range aggregation
2022-11-28 09:12:08 +09:00
PSeitz
600548fd26 Merge pull request #1694 from quickwit-oss/dependabot/cargo/zstd-0.12
Update zstd requirement from 0.11 to 0.12
2022-11-25 05:48:59 +01:00
PSeitz
9929c0c221 Merge pull request #1696 from quickwit-oss/dependabot/cargo/env_logger-0.10.0
Update env_logger requirement from 0.9.0 to 0.10.0
2022-11-25 03:28:10 +01:00
dependabot[bot]
f53e65648b Update env_logger requirement from 0.9.0 to 0.10.0
Updates the requirements on [env_logger](https://github.com/rust-cli/env_logger) to permit the latest version.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](https://github.com/rust-cli/env_logger/compare/v0.9.0...v0.10.0)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-11-24 20:07:52 +00:00
PSeitz
0281b22b77 update create_in_ram docs (#1695) 2022-11-24 17:30:09 +01:00
dependabot[bot]
a05c184830 Update zstd requirement from 0.11 to 0.12
Updates the requirements on [zstd](https://github.com/gyscos/zstd-rs) to permit the latest version.
- [Release notes](https://github.com/gyscos/zstd-rs/releases)
- [Commits](https://github.com/gyscos/zstd-rs/commits)

---
updated-dependencies:
- dependency-name: zstd
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-11-23 20:15:32 +00:00
Paul Masurel
0b40a7fe43 Added a expand_dots JsonObjectOptions. (#1687)
Related with quickwit#2345.
2022-11-21 23:03:00 +09:00
trinity-1686a
e758080465 add support for TermSetQuery in query parser (#1683) 2022-11-17 16:49:49 +01:00
Paul Masurel
2a39289a1b Handle escaped dot in json path in the QueryParser. (#1682) 2022-11-16 07:18:34 +09:00
Adam Reichold
ca6231170e Make the built-in stop word lists selectable via the Language enum already used by the Stemmer filter. (#1671) 2022-11-15 17:40:25 +09:00
PSeitz
eda6e5a10a Merge pull request #1681 from quickwit-oss/ip_range_query_multi
remove Column from MultiValuedU128FastFieldReader
2022-11-15 09:27:46 +08:00
Pascal Seitz
8641155cbb remove column from MultiValuedU128FastFieldReader 2022-11-14 18:49:15 +08:00
PSeitz
9a090ed994 Merge pull request #1659 from quickwit-oss/ip_range_query_multi
add support for ip range query on multivalue fastfields
2022-11-14 15:17:41 +08:00
Pascal Seitz
b7d0dd154a fmt 2022-11-14 14:49:15 +08:00
PSeitz
ce10fab20f Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-11-14 14:21:53 +08:00
Pascal Seitz
e034328a8b Improve position_to_docid, refactor, add tests 2022-11-14 14:21:53 +08:00
Pascal Seitz
f811d1616b add support for ip range query on multivalue fastfields 2022-11-14 14:21:52 +08:00
PSeitz
c665b16ff0 Merge pull request #1672 from quickwit-oss/allow_range_without_indexed
Allow range query on fastfield without INDEXED
2022-11-14 12:45:12 +08:00
PSeitz
3b5f810051 Merge pull request #1677 from quickwit-oss/switch_to_u32
switch total_num_val to u32
2022-11-14 12:01:40 +08:00
trinity-1686a
5765c261aa allow warming up of the full posting list (#1673)
* allow warming up of the full posting list

* cargo fmt
2022-11-14 10:27:56 +09:00
Pascal Seitz
fb9f03118d switch total_num_val to u32 2022-11-11 17:35:52 +08:00
PSeitz
55a9d808d4 Merge pull request #1674 from quickwit-oss/u128_codec_header
add header with codec type for u128
2022-11-11 13:47:51 +08:00
Pascal Seitz
32166682b3 add header deser test 2022-11-11 13:28:12 +08:00
Pascal Seitz
e6acf8f76d add header with codec type for u128 2022-11-11 11:52:17 +08:00
Pascal Seitz
9e8a0c2cca Allow range query on fastfield without INDEXED 2022-11-10 15:56:08 +08:00
Paul Masurel
3edf0a2724 Using the manual reload policy in IndexWriter. (#1667) 2022-11-09 11:20:41 +01:00
Paul Masurel
8ca12a5683 Added stop word filter to CHANGELOG.md 2022-11-09 17:00:45 +09:00
Adam Reichold
a4b759d2fe Include stop word lists from Lucene and the Snowball project (#1666) 2022-11-09 16:57:35 +09:00
PSeitz
3e9c806890 Merge pull request #1665 from quickwit-oss/fix_num_vals
fix num_vals on u128 value index after merge
2022-11-07 21:46:02 +08:00
Pascal Seitz
c69a873dd3 fix num_vals on value index after merge 2022-11-07 21:05:21 +08:00
PSeitz
666afcf641 Merge pull request #1663 from PSeitz/fix_clippy
fix clippy
2022-11-07 18:11:20 +08:00
Pascal Seitz
38ad46e580 fix clippy 2022-11-07 16:09:55 +08:00
PSeitz
e948889f4c Merge pull request #1662 from quickwit-oss/fix_num_vals
fix num_vals in multivalue index after merge
2022-11-07 15:57:32 +08:00
Pascal Seitz
6e636c9cea fix num_vals in multivalue index after merge 2022-11-07 15:00:52 +08:00
PSeitz
5a610efbc1 Merge pull request #1661 from quickwit-oss/upgrade_criterion
update criterion to 0.4
2022-11-04 14:45:34 +08:00
Pascal Seitz
500a0d5e48 update criterion to 0.4 2022-11-04 13:26:29 +08:00
PSeitz
509a265659 add docstore version (#1652)
* add docstore version

closes #1589

* assert for docstore version
2022-11-04 10:19:16 +09:00
PSeitz
5b2cea1b97 Merge pull request #1656 from quickwit-oss/multival_offset_index
move multivalue index to own file
2022-11-02 14:03:06 +08:00
PSeitz
a5a80ffaea Update fastfield_codecs/src/column.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-11-02 06:37:27 +01:00
PSeitz
0f98d91a39 Merge pull request #1646 from quickwit-oss/no_score_calls
No score calls if score is not requested
2022-11-01 20:09:32 +08:00
PSeitz
2af6b01c17 Update src/query/boolean_query/boolean_weight.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-11-01 16:13:00 +08:00
Adam Reichold
c32ab66bbd Small improvements to StopWorldFilter (#1657)
* Do not copy the whole set of stop words for each stream

* Make construction of StopWordFilter more flexible.
2022-11-01 16:47:34 +09:00
PSeitz
3f3a6f9990 Merge pull request #1653 from quickwit-oss/faster_hash
switch to fx hashmap
2022-11-01 14:53:18 +08:00
Pascal Seitz
83325d8f3f move multivalue index to own file
start_doc parameter in positions to docids
2022-11-01 10:36:13 +08:00
PSeitz
4e46f4f8c4 Merge pull request #1649 from adamreichold/split-compound-words
RFC: Add dictionary-based SplitCompoundWords token filter.
2022-10-27 17:12:48 +08:00
Pascal Seitz
43df356010 rename to docset 2022-10-27 16:53:38 +08:00
PSeitz
6647362464 Merge pull request #1648 from adamreichold/stemmer-todo-alloc
Avoid unconditional allocation in StemmerTokenStream.
2022-10-27 16:50:41 +08:00
Pascal Seitz
279b1b28d3 switch to fx hashmap 2022-10-27 16:19:59 +08:00
PSeitz
7a80851e36 Merge pull request #1645 from quickwit-oss/ip_field_range_query
add ip range query benchmark, add seek behaviour
2022-10-27 16:13:52 +08:00
Adam Reichold
cd952429d2 Add dictionary-based SplitCompoundWords token filter. 2022-10-27 08:30:33 +02:00
PSeitz
d777c964da Merge pull request #1650 from adamreichold/fnv-rustc-hash
Replace FNV by rustc-hash
2022-10-27 12:11:26 +08:00
Adam Reichold
bbb058d976 Replace FNV by rustc-hash
Both construction have similar goals but rustc-hash ist better suited for
contemporary CPU as it works one word at a time instead of byte per byte.
2022-10-27 00:35:09 +02:00
Adam Reichold
5f7d027a52 Avoid unconditional allocation in StemmerTokenStream.
This fixes the TODO in two ways: If the stemmer already yields an owned string,
it is used directly as the new text of the token. Otherwise, a temporary buffer
is used to copy the stemmed text (just as before) and then swapping it into the
token to reuse its existing buffer.
2022-10-26 18:11:15 +02:00
Pascal Seitz
dfab201191 for_each_docset to iterate without score 2022-10-26 17:25:05 +08:00
PSeitz
0c2bd36fe3 Panic on duplicate field names (#1647)
fixes #1601
2022-10-26 16:17:33 +09:00
Pascal Seitz
af839753e0 No score calls if score is not requested 2022-10-26 12:18:35 +08:00
Pascal Seitz
fec2b63571 improve bench by adding more blanks in compact space 2022-10-25 22:09:01 +08:00
Pascal Seitz
6213ea476a pass positions parameter 2022-10-25 17:44:51 +08:00
Pascal Seitz
5e159c26bf add ip range query benchmark, add seek behaviour 2022-10-25 15:57:19 +08:00
PSeitz
a5e59ab598 Merge pull request #1644 from quickwit-oss/get_val_u32
switch get_val() to u32
2022-10-24 19:30:03 +08:00
Pascal Seitz
e772d3170d switch get_val() to u32
Fixes #1638
2022-10-24 19:05:57 +08:00
PSeitz
8c2ba7bd55 Merge pull request #1637 from quickwit-oss/ip_field_range_query
add range query via ip fast field
2022-10-24 18:10:47 +08:00
Pascal Seitz
02328b0151 fix proptest 2022-10-24 17:46:06 +08:00
Pascal Seitz
7cc775256c add comments, rename 2022-10-24 17:08:37 +08:00
Pascal Seitz
07b40f8b8b add proptest 2022-10-24 16:52:55 +08:00
PSeitz
9b6b6be5b9 Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-24 16:00:38 +08:00
Pascal Seitz
6bb73a527f add range query via ip fast field 2022-10-24 16:00:38 +08:00
PSeitz
03885d0f3c Merge pull request #1643 from quickwit-oss/range_query_parser
allow more characters in range query
2022-10-24 15:09:47 +08:00
Pascal Seitz
f2e5135870 allow more characters in range query
closes #1642
2022-10-21 18:05:15 +08:00
Paul Masurel
c24157f28b Bumping version format. (#1640)
The docstore format has changed in a non-compatible manner.
2022-10-21 15:35:35 +09:00
PSeitz
873382cdcb Merge pull request #1639 from quickwit-oss/num_vals_u32
switch num_vals() to u32
2022-10-21 12:36:50 +08:00
Pascal Seitz
791350091c switch num_vals() to u32
fixes #1630
2022-10-20 19:44:28 +08:00
Paul Masurel
483b1d13d4 Added unit test for long tokens (#1635)
* Bugfix on long tokens and multivalue text fields.

Fixes a minor bug for the strong edge case
in which a tokenizer would emit tokens where
the last token does not cover the last position.

More importantly, this adds unit tests.

Closes #1634

* Update src/indexer/segment_writer.rs

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2022-10-20 15:05:37 +09:00
PSeitz
8de7fa9d95 Merge pull request #1631 from quickwit-oss/high_positions
add test for phrase search on multi text field
2022-10-20 10:26:00 +08:00
Paul Masurel
94313b62f8 Hotfix issue/1629 - position broken (#1633)
* Bugfix position broken.

For Field with several FieldValues, with a
value that contained no token at all, the token position
was reinitialized to 0.

As a result, PhraseQueries can show some false positives.
In addition, after the computation of the position delta, we can
underflow u32, and end up with gigantic delta.

We haven't been able to actually explain the bug in 1629, but it
is assumed that in some corner case these delta can cause a panic.

Closes #1629
2022-10-20 11:03:55 +09:00
Pascal Seitz
f2b2628feb add test for phrase search on multi text field 2022-10-19 16:29:56 +08:00
PSeitz
449f595832 Merge pull request #1628 from quickwit-oss/skip_index_deser
faster skipindex deserialization, larger blocksize on sort
2022-10-19 11:05:20 +08:00
PSeitz
c9235df059 Merge pull request #1627 from quickwit-oss/ip_field_range_query
add range query handling for ip via term dictionary
2022-10-19 10:53:00 +08:00
Pascal Seitz
a4485f7611 faster skipindex deserialization, larger blocksize on sort 2022-10-18 19:32:23 +08:00
Pascal Seitz
1082ff60f9 add range query handling for ip via term dictionary
since IPs are mapped monotonically we can use the term dictionary for range queries
2022-10-18 13:08:27 +08:00
PSeitz
491854155c Merge pull request #1625 from quickwit-oss/index_ip_field
index ip field
2022-10-18 11:18:17 +08:00
Christoph Herzog
96c3d54ac7 fix: Fix power of two computation on 32bit architectures (#1624)
The current `compute_previous_power_of_two()` implementation used for
TermHashmap takes and returns `usize` , but actually only works
correclty on 64 bit architectures (aka usize == u64)

On other architectures the leading_zeros computation is run on the wrong
type (must be u64), and leads to overflows.

Fixed simply computing the leading_zeros based on a u64 value.
2022-10-18 11:55:02 +09:00
Pascal Seitz
6800fdec9d add indexing for ip field
Closes #1595
2022-10-18 10:07:48 +08:00
PSeitz
c9cf9c952a Merge pull request #1614 from quickwit-oss/remove_superfluous_steps
refactor Term
2022-10-17 18:25:31 +08:00
Pascal Seitz
024e53a99c remove truncate 2022-10-17 12:14:35 +08:00
Pascal Seitz
8d75e451bd fix truncate, remove mutable access from term 2022-10-17 12:14:35 +08:00
Pascal Seitz
fcfd76ec55 refactor Term
fixes some issues with Term
Remove duplicate calls to truncate or resize
Replace Magic Number 5 with constant
Enforce minimum size of 5 for metadata
Fix broken truncate docs
use constructor instead new + set calls
normalize constructor stack
replace assert on internal behavior fixes #1585
2022-10-17 12:14:34 +08:00
PSeitz
6b7b1cc4fa Merge pull request #1623 from quickwit-oss/remove_unused_buffer
remove unused buffer
2022-10-14 20:36:00 +08:00
Pascal Seitz
129f7422f5 remove unused buffer 2022-10-14 20:01:10 +08:00
PSeitz
f39cce2c8b Merge pull request #1622 from quickwit-oss/term_aggregation
add term aggregation clarification
2022-10-14 18:09:18 +08:00
PSeitz
d2478fac8a Merge pull request #1621 from quickwit-oss/changelog
update CHANGELOG
2022-10-14 18:08:57 +08:00
Pascal Seitz
952b048341 add term aggregation clarification 2022-10-14 16:12:19 +08:00
PSeitz
80f9596ec8 Merge pull request #1611 from quickwit-oss/remove_token_stream_alloc
remove tokenstream vec alloc
2022-10-14 15:12:30 +08:00
Pascal Seitz
84f9e77e1d update CHANGELOG 2022-10-14 15:10:33 +08:00
PSeitz
a602c248fb Merge pull request #1590 from waywardmonkeys/fix-doc-warnings-quickwit
Fix missing doc warnings when enabling feature "quickwit".
2022-10-14 14:09:25 +08:00
PSeitz
4b9d1fe828 Merge pull request #1620 from quickwit-oss/fix_fieldnorms_indexing
Fix missing fieldnorm indexing
2022-10-14 13:41:38 +08:00
Pascal Seitz
63bc390b02 Fix missing fieldnorm indexing
Fixes broken search (no results) with BM25 for u64, i64, f64, bool, bytes and date after deletion and merge.
There were no fieldnorms recorded for those field. After merge InvertedIndexReader::total_num_tokens returns 0 (Sum over the fieldnorms is 0). BM25 does not work when total_num_tokens is 0.
Fixes #1617
2022-10-14 12:44:40 +08:00
Paul Masurel
07393c2fa0 Attempt to fix race condition in test. (#1619)
Close #1550
2022-10-14 10:56:37 +09:00
PSeitz
77a415cbe4 rename NothingRecorder to DocIdRecorder (#1615) 2022-10-13 15:43:40 +09:00
PSeitz
4b4c231bba Merge pull request #1612 from quickwit-oss/no_panic_please
return Error instead panic in fastfields
2022-10-11 18:33:00 +08:00
PSeitz
11d3409286 add missing docs for fastfield_codecs crate (#1613)
closes #1603
2022-10-11 18:54:24 +09:00
Pascal Seitz
9cb8cfbea8 return Error instead panic in fastfields
fixes #1572
2022-10-11 14:15:22 +08:00
PSeitz
8b69aab0fc avoid prepare_doc allocation (#1610)
avoid prepare_doc allocation, ~10% more thoughput best case
2022-10-11 14:15:55 +09:00
PSeitz
3650d1f36a Merge pull request #1553 from quickwit-oss/ip_field
ip field
2022-10-11 13:09:47 +08:00
Pascal Seitz
2efebdb1bb remove tokenstream vec alloc 2022-10-11 10:30:56 +08:00
François Massot
e443ca63aa Merge pull request #1608 from quickwit-oss/nigel/serialise-bytes-as-b64-#2042
Serialise bytes as base64 strings instead of arrays.
2022-10-10 11:51:23 +02:00
Pascal Seitz
5c9cbee29d handle IpV4 serialization case 2022-10-07 19:52:00 +08:00
Pascal Seitz
b2ca83a93c switch to ipv6, add monotonic_mapping tests 2022-10-07 18:47:55 +08:00
Nigel Andrews
3b189080d4 Use raw string literals in tests 2022-10-07 12:28:25 +02:00
Nigel Andrews
00a6586efe Replaced String::serialize for serializer.serialize_str 2022-10-07 11:55:05 +02:00
Pascal Seitz
b9b913510e fmt 2022-10-07 16:56:19 +08:00
PSeitz
534b1d33c3 use ipv6
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-07 16:56:00 +08:00
PSeitz
f465173872 Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-07 16:55:53 +08:00
Pascal Seitz
96315df20d use idx part only for positions_to_docid 2022-10-07 16:54:04 +08:00
Pascal Seitz
9a1609d364 add test 2022-10-07 16:25:01 +08:00
Pascal Seitz
39f4e58450 improve comment 2022-10-07 16:25:01 +08:00
Pascal Seitz
a8a36b62cd enable test 2022-10-07 16:25:01 +08:00
Pascal Seitz
226a49338f add StrictlyMonotonicFn 2022-10-07 16:25:01 +08:00
Pascal Seitz
2864bf7123 use serializer for u128 2022-10-07 16:25:01 +08:00
Pascal Seitz
5171ff611b serialize ip as u128, add test for positions_to_docid 2022-10-07 16:25:01 +08:00
Pascal Seitz
e50e74acf8 remove u128 type 2022-10-07 16:25:01 +08:00
Pascal Seitz
0b86658389 rename ip addr, use buffer 2022-10-07 16:25:01 +08:00
Pascal Seitz
5d6602a8d9 mark null handling TODO 2022-10-07 16:25:01 +08:00
Pascal Seitz
4d29ff4d01 finalize ip addr rename 2022-10-07 16:25:01 +08:00
Pascal Seitz
cdc8e3a8be group montonic mapping and inverse
fix mapping inverse
remove ip indexing
add get_between_vals test
2022-10-07 16:25:01 +08:00
Pascal Seitz
67f453b534 rename to iter_gen 2022-10-07 16:25:01 +08:00
Pascal Seitz
787a37bacf expect instead of unwrap 2022-10-07 16:25:01 +08:00
Pascal Seitz
f5039f1846 remove roaring 2022-10-07 16:25:01 +08:00
Pascal Seitz
eeb1f19093 rename to iter_gen 2022-10-07 16:25:01 +08:00
Pascal Seitz
087beaf328 remove null handling 2022-10-07 16:25:01 +08:00
Pascal Seitz
309449dba3 rename to IpAddr 2022-10-07 16:25:01 +08:00
Pascal Seitz
5a76e6c5d3 fix get_between_vals forwarding
fix get_between_vals forwarding in monotonicmapping column by adding an additional conversion function Output->Input
2022-10-07 16:25:01 +08:00
Pascal Seitz
c8713a01ed use iter api 2022-10-07 16:25:01 +08:00
Pascal Seitz
6113e0408c remove comment 2022-10-07 16:25:01 +08:00
Pascal Seitz
400a20b7af add ip field
add u128 multivalue reader and writer
add ip to schema
add ip writers, handle merge
2022-10-07 16:25:01 +08:00
PSeitz
5f565e77de Merge pull request #1604 from quickwit-oss/replace_cbor
replace cbor with cborium
2022-10-07 14:42:55 +08:00
Pascal Seitz
516e60900d remove unwrap 2022-10-07 14:22:37 +08:00
Pascal Seitz
36e1c79f37 replace cbor with cborium
closes #1526
2022-10-07 13:23:39 +08:00
Bruce Mitchener
c2f1c250f9 doc: Remove reference to Searcher pool. (#1598)
The pool of searchers was removed in 23fe73a6 as part of #1411.
2022-10-06 00:04:11 +09:00
Bruce Mitchener
c694bc039a Fix missing doc warnings when enabling feature "quickwit". 2022-10-05 20:17:10 +07:00
PSeitz
2063f1717f Merge pull request #1591 from quickwit-oss/ff_refact
disable linear codec for multivalue values
2022-10-05 19:39:36 +08:00
Pascal Seitz
d742275048 renames 2022-10-05 19:16:49 +08:00
PSeitz
b9f06bc287 Update src/fastfield/multivalued/mod.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-05 19:09:19 +08:00
Pascal Seitz
8b42c4c126 disable linear codec for multivalue value index
don't materialize index column on merge
use simpler chain() variant
2022-10-05 19:09:17 +08:00
PSeitz
7905965800 Merge pull request #1594 from quickwit-oss/flat_map_with_buffer
Removing alloc on all .next() in MultiValueColumn
2022-10-05 18:34:15 +08:00
Pascal Seitz
f60a551890 add flat_map_with_buffer to Iterator trait 2022-10-05 17:44:26 +08:00
Paul Masurel
7baa6e3ec5 Removing alloc on all .next() in MultiValueColumn 2022-10-05 17:12:06 +09:00
PSeitz
2100ec5d26 Merge pull request #1593 from waywardmonkeys/doc-improvements
Documentation improvements.
2022-10-05 15:50:08 +08:00
Bruce Mitchener
b3bf9a5716 Documentation improvements. 2022-10-05 14:18:10 +07:00
Paul Masurel
0dc8c458e0 Flaky unit test. (#1592) 2022-10-05 16:15:48 +09:00
Nigel Andrews
e5043d78d2 added a couple of tests + make fmt 2022-10-04 12:52:44 +02:00
Nigel Andrews
6d0bb82bd2 Fix issue 1576: serialize bytes as base64 strings 2022-10-04 12:18:13 +02:00
trinity-1686a
5945dbf0bd change format for store to make it faster with small documents (#1569)
* use new format for docstore blocks

* move index to end of block

it makes writing the block faster due to one less memcopy
2022-10-04 09:58:55 +02:00
PSeitz
4cf911d56a Merge pull request #1587 from quickwit-oss/no_get_val_in_serialize
remove get_val in serialization
2022-10-04 12:56:48 +08:00
Pascal Seitz
0f5cff762f move enumerate and remove computation 2022-10-04 12:30:19 +08:00
Pascal Seitz
6d9a123cf2 remove get_val in serialization
remove get_val in serialization and mark as unimplemented!()
replace get_val with iter in linear codec
remove MultivalueStartIndexRandomSeeker
replace MultivalueStartIndexIter with closure
Sample 100 values in linear codec
2022-10-04 12:01:25 +08:00
PSeitz
0f4a47816a Merge pull request #1582 from quickwit-oss/faster_sorted_field_values
use groupby instead of vec allocation
2022-10-04 09:36:24 +08:00
Pascal Seitz
b062ab2196 use groupby instead of vec allocation 2022-10-04 09:26:26 +08:00
Bruce Mitchener
a9d2f3db23 Tantivy requires Rust 1.62 or later. (#1583)
Tantivy needs the `total_cmp` feature to compile, which was stabilized
in Rust 1.62.
2022-10-03 18:31:07 +09:00
Bruce Mitchener
44e03791f9 Fix warnings when doc'ing private items. (#1579)
This also fixes a couple of typos, but plenty remain!
2022-10-03 14:24:00 +09:00
Bruce Mitchener
2d23763e9f Use u64::from boolean more. (#1580)
This case is inverted from the previous cases fixed.

This is from nightly clippy.
2022-10-03 14:17:50 +09:00
Bruce Mitchener
a24ae8d924 clippy: Fix needless-borrow warnings. (#1581)
These show on nightly clippy.
2022-10-03 14:15:09 +09:00
PSeitz
927dff5262 Merge pull request #1578 from quickwit-oss/dead_code
remove dead indexing code
2022-10-03 11:25:10 +08:00
Pascal Seitz
a695edcc95 remove dead indexing code 2022-10-03 09:44:02 +08:00
Paul Masurel
b4b4f3fa73 Removing default features for zstd (#1574) 2022-09-30 13:02:46 +09:00
PSeitz
b50e4b7c20 Merge pull request #1566 from quickwit-oss/fix_docstore_sorting
fix docstore settings for temp docstore
2022-09-30 10:10:36 +08:00
PSeitz
f8686ab1ec improve comments
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-09-30 10:06:34 +08:00
PSeitz
2fe42719d8 Merge pull request #1570 from quickwit-oss/no_sort_on_multi
validate index settings on create
2022-09-30 09:17:03 +08:00
PSeitz
fadd784a25 log improvements (#1564) 2022-09-30 09:39:26 +09:00
Pascal Seitz
0e94213af0 validate index settings on create 2022-09-29 18:58:09 +08:00
PSeitz
0da2a2e70d Merge pull request #1567 from quickwit-oss/dependabot/cargo/tantivy-fst-0.4.0
Update tantivy-fst requirement from 0.3.0 to 0.4.0
2022-09-29 10:00:16 +08:00
dependabot[bot]
0bcdf3cbbf Update tantivy-fst requirement from 0.3.0 to 0.4.0
Updates the requirements on [tantivy-fst](https://github.com/tantivy-search/fst) to permit the latest version.
- [Release notes](https://github.com/tantivy-search/fst/releases)
- [Commits](https://github.com/tantivy-search/fst/commits)

---
updated-dependencies:
- dependency-name: tantivy-fst
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-09-28 20:50:43 +00:00
Pascal Seitz
8f647b817f fix docstore settings for temp docstore
fixes #1565
2022-09-28 17:53:59 +08:00
trinity-1686a
a86b0df6f4 Add query matching terms in a set (#1539) 2022-09-28 09:43:18 +02:00
Bruce Mitchener
f842da758c Move ArcBytes,WeakArcBytes to mmap_directory. (#1555)
When building without default features (so without mmap, etc),
there are some warnings about unused things. This fixes the
ones related to `ArcBytes` and `WeakArcBytes`, which are only
used with the `mmap_directory` code.
2022-09-27 09:57:28 +09:00
Bruce Mitchener
97ccd6d712 Avoid slicing a string in DocParsingError. (#1559)
Fixes #1339.
2022-09-26 20:27:15 +09:00
Bruce Mitchener
cb252a42af docs: "associated to" -> "associated with" (#1557)
This reads better this way.
2022-09-26 20:23:37 +09:00
Bruce Mitchener
d9609dd6b6 POLLING_INTERVAL needn't be pub. (#1556)
This is only used within the file watcher and is const, so it
can't be configured.
2022-09-26 20:22:55 +09:00
Bruce Mitchener
f03667d967 Remove references to /cpp directory. (#1560)
This was removed in 2018, so these should be fine to remove now.
2022-09-26 20:22:28 +09:00
PSeitz
10f10a322f Merge pull request #1554 from quickwit-oss/prepare_ip_field
prepare for ip field
2022-09-26 16:34:24 +08:00
Pascal Seitz
f757471077 prepare for ip field 2022-09-26 16:27:35 +08:00
PSeitz
21e0adefda use binary search instead of linear for get_val in merge (#1548)
* use binary search instead of linear for get_val in merge

* use partition_point
2022-09-26 09:42:33 +09:00
Bruce Mitchener
ea8e6d7b1d Tidy up clippy config. (#1547)
* Checking cfg_attr is no longer necessary.
* Don't need multiple `clippy::` prefixes on a name.
2022-09-26 09:37:55 +09:00
PSeitz
dac7da780e Merge pull request #1545 from waywardmonkeys/remove-some-refs
clippy: Remove borrows that the compiler will do.
2022-09-23 15:33:23 +08:00
PSeitz
20c87903b2 fix multivalue ff index creation regression (#1543)
fixes multivalue ff regression by avoiding using `get_val`. Line::train calls repeatedly get_val, but get_val implementation on Column for multivalues is very slow. The fix is to use the iterator instead. Longterm fix should be to remove get_val access in serialization.

Old Code

test fastfield::bench::bench_multi_value_ff_merge_few_segments                                                           ... bench:  46,103,960 ns/iter (+/- 2,066,083)
test fastfield::bench::bench_multi_value_ff_merge_many_segments                                                          ... bench:  83,073,036 ns/iter (+/- 4,373,615)
est fastfield::bench::bench_multi_value_ff_merge_many_segments_log_merge                                                ... bench:  64,178,576 ns/iter (+/- 1,466,700)

Current

running 3 tests
test fastfield::multivalued::bench::bench_multi_value_ff_merge_few_segments                                              ... bench:  57,379,523 ns/iter (+/- 3,220,787)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments                                             ... bench:  90,831,688 ns/iter (+/- 1,445,486)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments_log_merge                                   ... bench: 158,313,264 ns/iter (+/- 28,823,250)

With Fix

running 3 tests
test fastfield::multivalued::bench::bench_multi_value_ff_merge_few_segments                                              ... bench:  57,635,671 ns/iter (+/- 2,707,361)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments                                             ... bench:  91,468,712 ns/iter (+/- 11,393,581)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments_log_merge                                   ... bench:  73,909,138 ns/iter (+/- 15,846,097)
2022-09-23 15:36:29 +09:00
PSeitz
f9c3947803 Merge pull request #1546 from waywardmonkeys/use-ux-from-bool
Use u8::from(bool), u64::from(bool).
2022-09-23 09:06:24 +08:00
Bruce Mitchener
e9a384bb15 Use u8::from(bool), u64::from(bool). 2022-09-22 22:44:53 +07:00
Bruce Mitchener
d231671fe2 clippy: Remove borrows that the compiler will do.
This started showing up with clippy in rust 1.64.
2022-09-22 22:38:23 +07:00
trinity-1686a
fa3d786a2f Add support for deleting all documents matching query (#1535)
* add support for deleting all documents matching query

#1494
2022-09-22 21:26:09 +09:00
Paul Masurel
75aafeeb9b Added a function to deep clone RamDirectory. (#1544) 2022-09-22 12:04:02 +02:00
PSeitz
6f066c7f65 Merge pull request #1541 from quickwit-oss/add_bench
add benchmarks for multivalued fastfield merge
2022-09-22 15:28:00 +08:00
Pascal Seitz
22e56aaee3 add benchmarks for multivalued fastfield merge 2022-09-22 11:25:41 +08:00
Paul Masurel
d641979127 Minor refactor of fast fields (#1538) 2022-09-21 12:55:03 +09:00
231 changed files with 13545 additions and 2501 deletions

1
.gitattributes vendored
View File

@@ -1 +0,0 @@
cpp/* linguist-vendored

View File

@@ -48,7 +48,7 @@ jobs:
strategy:
matrix:
features: [
{ label: "all", flags: "mmap,brotli-compression,lz4-compression,snappy-compression,zstd-compression,failpoints" },
{ label: "all", flags: "mmap,stopwords,brotli-compression,lz4-compression,snappy-compression,zstd-compression,failpoints" },
{ label: "quickwit", flags: "mmap,quickwit,failpoints" }
]

1
.gitignore vendored
View File

@@ -9,7 +9,6 @@ target/release
Cargo.lock
benchmark
.DS_Store
cpp/simdcomp/bitpackingbenchmark
*.bk
.idea
trace.dat

View File

@@ -1,10 +1,37 @@
Tantivy 0.19
================================
#### Bugfixes
- Fix missing fieldnorms for u64, i64, f64, bool, bytes and date [#1620](https://github.com/quickwit-oss/tantivy/pull/1620) (@PSeitz)
- Fix interpolation overflow in linear interpolation fastfield codec [#1480](https://github.com/quickwit-oss/tantivy/pull/1480) (@PSeitz @fulmicoton)
- Updated [Date Field Type](https://github.com/quickwit-oss/tantivy/pull/1396)
The `DateTime` type has been updated to hold timestamps with microseconds precision.
`DateOptions` and `DatePrecision` have been added to configure Date fields. The precision is used to hint on fast values compression. Otherwise, seconds precision is used everywhere else (i.e terms, indexing).
- Remove Searcher pool and make `Searcher` cloneable.
#### Features/Improvements
- Add support for `IN` in queryparser , e.g. `field: IN [val1 val2 val3]` [#1683](https://github.com/quickwit-oss/tantivy/pull/1683) (@trinity-1686a)
- Skip score calculation, when no scoring is required [#1646](https://github.com/quickwit-oss/tantivy/pull/1646) (@PSeitz)
- Limit fast fields to u32 (`get_val(u32)`) [#1644](https://github.com/quickwit-oss/tantivy/pull/1644) (@PSeitz)
- The `DateTime` type has been updated to hold timestamps with microseconds precision.
`DateOptions` and `DatePrecision` have been added to configure Date fields. The precision is used to hint on fast values compression. Otherwise, seconds precision is used everywhere else (i.e terms, indexing) [#1396](https://github.com/quickwit-oss/tantivy/pull/1396) (@evanxg852000)
- Add IP address field type [#1553](https://github.com/quickwit-oss/tantivy/pull/1553) (@PSeitz)
- Add boolean field type [#1382](https://github.com/quickwit-oss/tantivy/pull/1382) (@boraarslan)
- Remove Searcher pool and make `Searcher` cloneable. (@PSeitz)
- Validate settings on create [#1570](https://github.com/quickwit-oss/tantivy/pull/1570) (@PSeitz)
- Detect and apply gcd on fastfield codecs [#1418](https://github.com/quickwit-oss/tantivy/pull/1418) (@PSeitz)
- Doc store
- use separate thread to compress block store [#1389](https://github.com/quickwit-oss/tantivy/pull/1389) [#1510](https://github.com/quickwit-oss/tantivy/pull/1510) (@PSeitz @fulmicoton)
- Expose doc store cache size [#1403](https://github.com/quickwit-oss/tantivy/pull/1403) (@PSeitz)
- Enable compression levels for doc store [#1378](https://github.com/quickwit-oss/tantivy/pull/1378) (@PSeitz)
- Make block size configurable [#1374](https://github.com/quickwit-oss/tantivy/pull/1374) (@kryesh)
- Make `tantivy::TantivyError` cloneable [#1402](https://github.com/quickwit-oss/tantivy/pull/1402) (@PSeitz)
- Add support for phrase slop in query language [#1393](https://github.com/quickwit-oss/tantivy/pull/1393) (@saroh)
- Aggregation
- Add aggregation support for date type [#1693](https://github.com/quickwit-oss/tantivy/pull/1693)(@PSeitz)
- Add support for keyed parameter in range and histgram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
- Add aggregation bucket limit [#1363](https://github.com/quickwit-oss/tantivy/pull/1363) (@PSeitz)
- Faster indexing
- [#1610](https://github.com/quickwit-oss/tantivy/pull/1610) (@PSeitz)
- [#1594](https://github.com/quickwit-oss/tantivy/pull/1594) (@PSeitz)
- [#1582](https://github.com/quickwit-oss/tantivy/pull/1582) (@PSeitz)
- [#1611](https://github.com/quickwit-oss/tantivy/pull/1611) (@PSeitz)
- Added a pre-configured stop word filter for various language [#1666](https://github.com/quickwit-oss/tantivy/pull/1666) (@adamreichold)
Tantivy 0.18
================================
@@ -22,6 +49,10 @@ Tantivy 0.18
- Add terms aggregation (@PSeitz)
- Add support for zstd compression (@kryesh)
Tantivy 0.18.1
================================
- Hotfix: positions computation. #1629 (@fmassot, @fulmicoton, @PSeitz)
Tantivy 0.17
================================

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy"
version = "0.18.0"
version = "0.19.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
@@ -11,19 +11,21 @@ repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2021"
rust-version = "1.62"
[dependencies]
oneshot = "0.1.3"
base64 = "0.13.0"
oneshot = "0.1.5"
base64 = "0.20.0"
byteorder = "1.4.3"
crc32fast = "1.3.2"
once_cell = "1.10.0"
regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] }
tantivy-fst = "0.3.0"
aho-corasick = "0.7"
tantivy-fst = "0.4.0"
memmap2 = { version = "0.5.3", optional = true }
lz4_flex = { version = "0.9.2", default-features = false, features = ["checked-decode"], optional = true }
brotli = { version = "3.3.4", optional = true }
zstd = { version = "0.11", optional = true }
zstd = { version = "0.12", optional = true, default-features = false }
snap = { version = "1.0.5", optional = true }
tempfile = { version = "3.3.0", optional = true }
log = "0.4.16"
@@ -34,17 +36,12 @@ fs2 = { version = "0.4.3", optional = true }
levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4"
tantivy-query-grammar = { version="0.18.0", path="./query-grammar" }
tantivy-bitpacker = { version="0.2", path="./bitpacker" }
common = { version = "0.3", path = "./common/", package = "tantivy-common" }
fastfield_codecs = { version="0.2", path="./fastfield_codecs", default-features = false }
ownedbytes = { version="0.3", path="./ownedbytes" }
stable_deref_trait = "1.2.0"
rust-stemmers = "1.2.0"
downcast-rs = "1.2.0"
bitpacking = { version = "0.8.4", default-features = false, features = ["bitpacker4x"] }
census = "0.4.0"
fnv = "1.0.7"
rustc-hash = "1.1.0"
thiserror = "1.0.30"
htmlescape = "0.3.1"
fail = "0.5.0"
@@ -56,10 +53,17 @@ lru = "0.7.5"
fastdivide = "0.4.0"
itertools = "0.10.3"
measure_time = "0.8.2"
serde_cbor = { version = "0.11.2", optional = true }
async-trait = "0.1.53"
arc-swap = "1.5.0"
sstable = { version="0.1", path="./sstable", package ="tantivy-sstable", optional = true }
stacker = { version="0.1", path="./stacker", package ="tantivy-stacker" }
tantivy-query-grammar = { version= "0.19.0", path="./query-grammar" }
tantivy-bitpacker = { version= "0.3", path="./bitpacker" }
common = { version= "0.5", path = "./common/", package = "tantivy-common" }
fastfield_codecs = { version= "0.3", path="./fastfield_codecs", default-features = false }
ownedbytes = { version= "0.5", path="./ownedbytes" }
[target.'cfg(windows)'.dependencies]
winapi = "0.3.9"
@@ -69,10 +73,10 @@ maplit = "1.0.2"
matches = "0.1.9"
pretty_assertions = "1.2.1"
proptest = "1.0.0"
criterion = "0.3.5"
criterion = "0.4"
test-log = "0.2.10"
env_logger = "0.9.0"
pprof = { version = "0.10.0", features = ["flamegraph", "criterion"] }
env_logger = "0.10.0"
pprof = { version = "0.11.0", features = ["flamegraph", "criterion"] }
futures = "0.3.21"
[dev-dependencies.fail]
@@ -89,8 +93,9 @@ debug-assertions = true
overflow-checks = true
[features]
default = ["mmap", "lz4-compression" ]
default = ["mmap", "stopwords", "lz4-compression"]
mmap = ["fs2", "tempfile", "memmap2"]
stopwords = []
brotli-compression = ["brotli"]
lz4-compression = ["lz4_flex"]
@@ -100,10 +105,10 @@ zstd-compression = ["zstd"]
failpoints = ["fail/failpoints"]
unstable = [] # useful for benches.
quickwit = ["serde_cbor"]
quickwit = ["sstable"]
[workspace]
members = ["query-grammar", "bitpacker", "common", "fastfield_codecs", "ownedbytes"]
members = ["query-grammar", "bitpacker", "common", "fastfield_codecs", "ownedbytes", "stacker", "sstable", "columnar"]
# Following the "fail" crate best practises, we isolate
# tests that define specific behavior in fail check points

View File

@@ -58,7 +58,7 @@ Distributed search is out of the scope of Tantivy, but if you are looking for th
# Getting started
Tantivy works on stable Rust (>= 1.27) and supports Linux, macOS, and Windows.
Tantivy works on stable Rust and supports Linux, macOS, and Windows.
- [Tantivy's simple search example](https://tantivy-search.github.io/examples/basic_search.html)
- [tantivy-cli and its tutorial](https://github.com/quickwit-oss/tantivy-cli) - `tantivy-cli` is an actual command-line interface that makes it easy for you to create a search engine,
@@ -81,9 +81,13 @@ There are many ways to support this project.
We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
## Minimum supported Rust version
Tantivy currently requires at least Rust 1.62 or later to compile.
## Clone and build locally
Tantivy compiles on stable Rust but requires `Rust >= 1.27`.
Tantivy compiles on stable Rust.
To check out and run tests, you can simply run:
```bash

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-bitpacker"
version = "0.2.0"
version = "0.3.0"
edition = "2021"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
@@ -8,6 +8,8 @@ categories = []
description = """Tantivy-sub crate: bitpacking"""
repository = "https://github.com/quickwit-oss/tantivy"
keywords = []
documentation = "https://docs.rs/tantivy-bitpacker/latest/tantivy_bitpacker"
homepage = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

View File

@@ -25,15 +25,14 @@ impl BitPacker {
num_bits: u8,
output: &mut TWrite,
) -> io::Result<()> {
let val_u64 = val as u64;
let num_bits = num_bits as usize;
if self.mini_buffer_written + num_bits > 64 {
self.mini_buffer |= val_u64.wrapping_shl(self.mini_buffer_written as u32);
self.mini_buffer |= val.wrapping_shl(self.mini_buffer_written as u32);
output.write_all(self.mini_buffer.to_le_bytes().as_ref())?;
self.mini_buffer = val_u64.wrapping_shr((64 - self.mini_buffer_written) as u32);
self.mini_buffer = val.wrapping_shr((64 - self.mini_buffer_written) as u32);
self.mini_buffer_written = self.mini_buffer_written + num_bits - 64;
} else {
self.mini_buffer |= val_u64 << self.mini_buffer_written;
self.mini_buffer |= val << self.mini_buffer_written;
self.mini_buffer_written += num_bits;
if self.mini_buffer_written == 64 {
output.write_all(self.mini_buffer.to_le_bytes().as_ref())?;
@@ -87,22 +86,22 @@ impl BitUnpacker {
}
#[inline]
pub fn get(&self, idx: u64, data: &[u8]) -> u64 {
pub fn get(&self, idx: u32, data: &[u8]) -> u64 {
if self.num_bits == 0 {
return 0u64;
}
let addr_in_bits = idx * self.num_bits;
let addr_in_bits = idx * self.num_bits as u32;
let addr = addr_in_bits >> 3;
let bit_shift = addr_in_bits & 7;
debug_assert!(
addr + 8 <= data.len() as u64,
addr + 8 <= data.len() as u32,
"The fast field field should have been padded with 7 bytes."
);
let bytes: [u8; 8] = (&data[(addr as usize)..(addr as usize) + 8])
.try_into()
.unwrap();
let val_unshifted_unmasked: u64 = u64::from_le_bytes(bytes);
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
let val_shifted: u64 = val_unshifted_unmasked >> bit_shift;
val_shifted & self.mask
}
}
@@ -130,7 +129,7 @@ mod test {
fn test_bitpacker_util(len: usize, num_bits: u8) {
let (bitunpacker, vals, data) = create_fastfield_bitpacker(len, num_bits);
for (i, val) in vals.iter().enumerate() {
assert_eq!(bitunpacker.get(i as u64, &data), *val);
assert_eq!(bitunpacker.get(i as u32, &data), *val);
}
}

View File

@@ -84,7 +84,7 @@ impl BlockedBitpacker {
#[inline]
pub fn add(&mut self, val: u64) {
self.buffer.push(val);
if self.buffer.len() == BLOCK_SIZE as usize {
if self.buffer.len() == BLOCK_SIZE {
self.flush();
}
}
@@ -126,11 +126,11 @@ impl BlockedBitpacker {
}
#[inline]
pub fn get(&self, idx: usize) -> u64 {
let metadata_pos = idx / BLOCK_SIZE as usize;
let pos_in_block = idx % BLOCK_SIZE as usize;
let metadata_pos = idx / BLOCK_SIZE;
let pos_in_block = idx % BLOCK_SIZE;
if let Some(metadata) = self.offset_and_bits.get(metadata_pos) {
let unpacked = BitUnpacker::new(metadata.num_bits()).get(
pos_in_block as u64,
pos_in_block as u32,
&self.compressed_blocks[metadata.offset() as usize..],
);
unpacked + metadata.base_value()

26
columnar/Cargo.toml Normal file
View File

@@ -0,0 +1,26 @@
[package]
name = "tantivy-columnar"
version = "0.1.0"
edition = "2021"
[dependencies]
stacker = { path = "../stacker", package="tantivy-stacker"}
serde_json = "1"
thiserror = "1"
fnv = "1"
tantivy-fst = "0.4.0"
sstable = { path = "../sstable", package = "tantivy-sstable" }
common = { path = "../common", package = "tantivy-common" }
fastfield_codecs = { path = "../fastfield_codecs"}
ordered-float = "3.4"
itertools = "0.10"
[features]
# default = ["quickwit"]
# quickwit = ["common/quickwit"]
[dev-dependencies]
proptest = "1"

33
columnar/README.md Normal file
View File

@@ -0,0 +1,33 @@
# Columnar format
This crate describes columnar format used in tantivy.
## Goals
This format is special in the following way.
- it needs to be compact
- it does not required to be loaded in memory.
- it is designed to fit well with quickwit's strange constraint:
we need to be able to load columns rapidly.
- columns of several types can be associated with the same column name.
- it needs to support columns with different types `(str, u64, i64, f64)`
and different cardinality `(required, optional, multivalued)`.
- columns, once loaded, offer cheap random access.
# Format
A quickwit/tantivy style sstable associated
`(column names, column_cardinality, column_type) to range of bytes.
The format of the key is:
`[column_name][ZERO_BYTE][column_type_header: u8]`
Column name may not contain the zero byte.
Listing all columns associated to `column_name` can therefore
be done by listing all keys prefixed by
`[column_name][ZERO_BYTE]`
The associated range of bytes refer to a range of bytes

View File

@@ -0,0 +1,154 @@
use crate::value::NumericalType;
#[derive(Clone, Copy, Hash, Default, Debug, PartialEq, Eq, PartialOrd, Ord)]
#[repr(u8)]
pub enum Cardinality {
#[default]
Required = 0,
Optional = 1,
Multivalued = 2,
}
impl Cardinality {
pub fn to_code(self) -> u8 {
self as u8
}
pub fn try_from_code(code: u8) -> Option<Cardinality> {
match code {
0 => Some(Cardinality::Required),
1 => Some(Cardinality::Optional),
2 => Some(Cardinality::Multivalued),
_ => None,
}
}
}
#[derive(Hash, Eq, PartialEq, Debug, Clone, Copy)]
pub enum ColumnType {
Bytes,
Numerical(NumericalType),
}
impl ColumnType {
pub fn to_code(self) -> u8 {
match self {
ColumnType::Bytes => 0u8,
ColumnType::Numerical(numerical_type) => 1u8 | (numerical_type.to_code() << 1),
}
}
pub fn try_from_code(code: u8) -> Option<ColumnType> {
if code == 0u8 {
return Some(ColumnType::Bytes);
}
if code & 1u8 == 0u8 {
return None;
}
let numerical_type = NumericalType::try_from_code(code >> 1)?;
Some(ColumnType::Numerical(numerical_type))
}
}
/// Represents the type and cardinality of a column.
/// This is encoded over one-byte and added to a column key in the
/// columnar sstable.
///
/// Cardinality is encoded as the first two highest two bits.
/// The low 6 bits encode the column type.
#[derive(Eq, Hash, PartialEq, Debug, Copy, Clone)]
pub struct ColumnTypeAndCardinality {
pub cardinality: Cardinality,
pub typ: ColumnType,
}
#[inline]
const fn compute_mask(num_bits: u8) -> u8 {
if num_bits == 8 {
u8::MAX
} else {
(1u8 << num_bits) - 1
}
}
#[inline]
fn select_bits<const START: u8, const END: u8>(code: u8) -> u8 {
assert!(START <= END);
assert!(END <= 8);
let num_bits: u8 = END - START;
let mask: u8 = compute_mask(num_bits);
(code >> START) & mask
}
#[inline]
fn place_bits<const START: u8, const END: u8>(code: u8) -> u8 {
assert!(START <= END);
assert!(END <= 8);
let num_bits: u8 = END - START;
let mask: u8 = compute_mask(num_bits);
assert!(code <= mask);
code << START
}
impl ColumnTypeAndCardinality {
pub fn to_code(self) -> u8 {
place_bits::<6, 8>(self.cardinality.to_code()) | place_bits::<0, 6>(self.typ.to_code())
}
pub fn try_from_code(code: u8) -> Option<ColumnTypeAndCardinality> {
let typ_code = select_bits::<0, 6>(code);
let cardinality_code = select_bits::<6, 8>(code);
let cardinality = Cardinality::try_from_code(cardinality_code)?;
let typ = ColumnType::try_from_code(typ_code)?;
assert_eq!(typ.to_code(), typ_code);
Some(ColumnTypeAndCardinality { cardinality, typ })
}
}
#[cfg(test)]
mod tests {
use std::collections::HashSet;
use super::ColumnTypeAndCardinality;
use crate::column_type_header::{Cardinality, ColumnType};
#[test]
fn test_column_type_header_to_code() {
let mut column_type_header_set: HashSet<ColumnTypeAndCardinality> = HashSet::new();
for code in u8::MIN..=u8::MAX {
if let Some(column_type_header) = ColumnTypeAndCardinality::try_from_code(code) {
assert_eq!(column_type_header.to_code(), code);
assert!(column_type_header_set.insert(column_type_header));
}
}
assert_eq!(
column_type_header_set.len(),
3 /* cardinality */ * (1 + 3) // column_types
);
}
#[test]
fn test_column_type_to_code() {
let mut column_type_set: HashSet<ColumnType> = HashSet::new();
for code in u8::MIN..=u8::MAX {
if let Some(column_type) = ColumnType::try_from_code(code) {
assert_eq!(column_type.to_code(), code);
assert!(column_type_set.insert(column_type));
}
}
assert_eq!(column_type_set.len(), 1 + 3);
}
#[test]
fn test_cardinality_to_code() {
let mut num_cardinality = 0;
for code in u8::MIN..=u8::MAX {
let cardinality_opt = Cardinality::try_from_code(code);
if let Some(cardinality) = cardinality_opt {
assert_eq!(cardinality.to_code(), code);
num_cardinality += 1;
}
}
assert_eq!(num_cardinality, 3);
}
}

View File

@@ -0,0 +1,78 @@
use std::io;
use fnv::FnvHashMap;
fn fst_err_into_io_err(fst_err: tantivy_fst::Error) -> io::Error {
match fst_err {
tantivy_fst::Error::Fst(fst_err) => {
io::Error::new(io::ErrorKind::Other, format!("FST Error: {:?}", fst_err))
}
tantivy_fst::Error::Io(io_err) => io_err,
}
}
/// `DictionaryBuilder` for dictionary encoding.
///
/// It stores the different terms encounterred and assigns them a temporary value
/// we call unordered id.
///
/// Upon serialization, we will sort the ids and hence build a `UnorderedId -> Term ordinal`
/// mapping.
#[derive(Default)]
pub struct DictionaryBuilder {
dict: FnvHashMap<Vec<u8>, UnorderedId>,
}
pub struct IdMapping {
unordered_to_ord: Vec<OrderedId>,
}
impl IdMapping {
pub fn to_ord(&self, unordered: UnorderedId) -> OrderedId {
self.unordered_to_ord[unordered.0 as usize]
}
}
impl DictionaryBuilder {
/// Get or allocate an unordered id.
/// (This ID is simply an auto-incremented id.)
pub fn get_or_allocate_id(&mut self, term: &[u8]) -> UnorderedId {
if let Some(term_id) = self.dict.get(term) {
return *term_id;
}
let new_id = UnorderedId(self.dict.len() as u32);
self.dict.insert(term.to_vec(), new_id);
new_id
}
/// Serialize the dictionary into an fst, and returns the
/// `UnorderedId -> TermOrdinal` map.
pub fn serialize<'a, W: io::Write + 'a>(&self, wrt: &mut W) -> io::Result<IdMapping> {
serialize_inner(&self.dict, wrt).map_err(fst_err_into_io_err)
}
}
/// Helper function just there for error conversion.
fn serialize_inner<'a, W: io::Write + 'a>(
dict: &FnvHashMap<Vec<u8>, UnorderedId>,
wrt: &mut W,
) -> tantivy_fst::Result<IdMapping> {
let mut terms: Vec<(&[u8], UnorderedId)> =
dict.iter().map(|(k, v)| (k.as_slice(), *v)).collect();
terms.sort_unstable_by_key(|(key, _)| *key);
let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()];
let mut fst_builder = tantivy_fst::MapBuilder::new(wrt)?;
for (ord, (key, unordered_id)) in terms.into_iter().enumerate() {
let ordered_id = OrderedId(ord as u32);
fst_builder.insert(key, ord as u64)?;
unordered_to_ord[unordered_id.0 as usize] = ordered_id;
}
fst_builder.finish()?;
Ok(IdMapping { unordered_to_ord })
}
#[derive(Clone, Copy, Debug)]
pub struct UnorderedId(pub u32);
#[derive(Clone, Copy)]
pub struct OrderedId(pub u32);

69
columnar/src/lib.rs Normal file
View File

@@ -0,0 +1,69 @@
// Copyright (C) 2022 Quickwit, Inc.
//
// Quickwit is offered under the AGPL v3.0 and as commercial software.
// For commercial licensing, contact us at hello@quickwit.io.
//
// AGPL:
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
mod column_type_header;
mod dictionary;
mod reader;
mod serializer;
mod value;
mod writer;
pub use column_type_header::Cardinality;
pub use reader::ColumnarReader;
pub use serializer::ColumnarSerializer;
pub use writer::ColumnarWriter;
pub type DocId = u32;
#[cfg(test)]
mod tests {
use std::ops::Range;
use common::file_slice::FileSlice;
use crate::column_type_header::ColumnTypeAndCardinality;
use crate::reader::ColumnarReader;
use crate::serializer::ColumnarSerializer;
use crate::value::NumericalValue;
use crate::ColumnarWriter;
#[test]
fn test_dataframe_writer() {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_numerical(1u32, b"srical.value", NumericalValue::U64(1u64));
dataframe_writer.record_numerical(2u32, b"srical.value", NumericalValue::U64(2u64));
dataframe_writer.record_numerical(4u32, b"srical.value", NumericalValue::I64(2i64));
let mut buffer: Vec<u8> = Vec::new();
let serializer = ColumnarSerializer::new(&mut buffer);
dataframe_writer.serialize(5, serializer).unwrap();
let columnar_fileslice = FileSlice::from(buffer);
let columnar = ColumnarReader::open(columnar_fileslice).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<(ColumnTypeAndCardinality, Range<u64>)> =
columnar.read_columns("srical.value").unwrap();
assert_eq!(cols.len(), 1);
// Right now this 31 bytes are spent as follows
//
// - header 14 bytes
// - vals 8 //< due to padding? could have been 1byte?.
// - null footer 6 bytes
// - version footer 3 bytes // Should be file-wide
assert_eq!(cols[0].1, 0..31);
}
}

View File

@@ -0,0 +1,66 @@
use std::ops::Range;
use std::{io, mem};
use common::file_slice::FileSlice;
use common::BinarySerializable;
use sstable::{Dictionary, SSTableRange};
use crate::column_type_header::ColumnTypeAndCardinality;
fn io_invalid_data(msg: String) -> io::Error {
io::Error::new(io::ErrorKind::InvalidData, msg) // format!("Invalid key found.
// {key_bytes:?}")));
}
pub struct ColumnarReader {
column_dictionary: Dictionary<SSTableRange>,
column_data: FileSlice,
}
impl ColumnarReader {
pub fn num_columns(&self) -> usize {
self.column_dictionary.num_terms()
}
pub fn open(file_slice: FileSlice) -> io::Result<ColumnarReader> {
let (file_slice_without_sstable_len, sstable_len_bytes) =
file_slice.split_from_end(mem::size_of::<u64>());
let mut sstable_len_bytes = sstable_len_bytes.read_bytes()?;
let sstable_len = u64::deserialize(&mut sstable_len_bytes)?;
let (column_data, sstable) =
file_slice_without_sstable_len.split_from_end(sstable_len as usize);
let column_dictionary = Dictionary::open(sstable)?;
Ok(ColumnarReader {
column_dictionary,
column_data,
})
}
pub fn read_columns(
&self,
field_name: &str,
) -> io::Result<Vec<(ColumnTypeAndCardinality, Range<u64>)>> {
let mut start_key = field_name.to_string();
start_key.push('\0');
let mut end_key = field_name.to_string();
end_key.push(1u8 as char);
let mut stream = self
.column_dictionary
.range()
.ge(start_key.as_bytes())
.lt(end_key.as_bytes())
.into_stream()?;
let mut results = Vec::new();
while stream.advance() {
let key_bytes: &[u8] = stream.key();
if !key_bytes.starts_with(start_key.as_bytes()) {
return Err(io_invalid_data(format!("Invalid key found. {key_bytes:?}")));
}
let column_code: u8 = key_bytes.last().cloned().unwrap();
let column_type_and_cardinality = ColumnTypeAndCardinality::try_from_code(column_code)
.ok_or_else(|| io_invalid_data(format!("Unknown column code `{column_code}`")))?;
let range = stream.value().clone();
results.push((column_type_and_cardinality, range));
}
Ok(results)
}
}

View File

@@ -0,0 +1,39 @@
use std::io;
use std::io::Write;
use std::ops::Range;
use common::CountingWriter;
use sstable::value::RangeWriter;
use sstable::SSTableRange;
pub struct ColumnarSerializer<W: io::Write> {
wrt: CountingWriter<W>,
sstable_range: sstable::Writer<Vec<u8>, RangeWriter>,
}
impl<W: io::Write> ColumnarSerializer<W> {
pub fn new(wrt: W) -> ColumnarSerializer<W> {
let sstable_range: sstable::Writer<Vec<u8>, RangeWriter> =
sstable::Dictionary::<SSTableRange>::builder(Vec::with_capacity(100_000)).unwrap();
ColumnarSerializer {
wrt: CountingWriter::wrap(wrt),
sstable_range,
}
}
pub fn record_column_offsets(&mut self, key: &[u8], byte_range: Range<u64>) -> io::Result<()> {
self.sstable_range.insert(key, &byte_range)
}
pub fn wrt(&mut self) -> &mut CountingWriter<W> {
&mut self.wrt
}
pub fn finalize(mut self) -> io::Result<()> {
let sstable_bytes: Vec<u8> = self.sstable_range.finish()?;
let sstable_num_bytes: u64 = sstable_bytes.len() as u64;
self.wrt.write_all(&sstable_bytes)?;
self.wrt.write_all(&sstable_num_bytes.to_le_bytes()[..])?;
Ok(())
}
}

123
columnar/src/value.rs Normal file
View File

@@ -0,0 +1,123 @@
use ordered_float::NotNan;
#[derive(Copy, Clone, Debug, PartialEq)]
pub enum NumericalValue {
I64(i64),
U64(u64),
F64(NotNan<f64>),
}
impl From<u64> for NumericalValue {
fn from(val: u64) -> NumericalValue {
NumericalValue::U64(val)
}
}
impl From<i64> for NumericalValue {
fn from(val: i64) -> Self {
NumericalValue::I64(val)
}
}
impl From<NotNan<f64>> for NumericalValue {
fn from(val: NotNan<f64>) -> Self {
NumericalValue::F64(val)
}
}
impl NumericalValue {
pub fn numerical_type(&self) -> NumericalType {
match self {
NumericalValue::F64(_) => NumericalType::F64,
NumericalValue::I64(_) => NumericalType::I64,
NumericalValue::U64(_) => NumericalType::U64,
}
}
}
impl Eq for NumericalValue {}
#[derive(Clone, Copy, Debug, Default, Hash, Eq, PartialEq)]
#[repr(u8)]
pub enum NumericalType {
#[default]
I64 = 0,
U64 = 1,
F64 = 2,
}
impl NumericalType {
pub fn to_code(self) -> u8 {
self as u8
}
pub fn try_from_code(code: u8) -> Option<NumericalType> {
match code {
0 => Some(NumericalType::I64),
1 => Some(NumericalType::U64),
2 => Some(NumericalType::F64),
_ => None,
}
}
}
/// We voluntarily avoid using `Into` here to keep this
/// implementation quirk as private as possible.
///
/// This coercion trait actually panics if it is used
/// to convert a loose types to a stricter type.
///
/// The level is strictness is somewhat arbitrary.
/// - i64
/// - u64
/// - f64.
pub(crate) trait Coerce {
fn coerce(numerical_value: NumericalValue) -> Self;
}
impl Coerce for i64 {
fn coerce(value: NumericalValue) -> Self {
match value {
NumericalValue::I64(val) => val,
NumericalValue::U64(val) => val as i64,
NumericalValue::F64(_) => unreachable!(),
}
}
}
impl Coerce for u64 {
fn coerce(value: NumericalValue) -> Self {
match value {
NumericalValue::I64(val) => val as u64,
NumericalValue::U64(val) => val,
NumericalValue::F64(_) => unreachable!(),
}
}
}
impl Coerce for NotNan<f64> {
fn coerce(value: NumericalValue) -> Self {
match value {
NumericalValue::I64(val) => unsafe { NotNan::new_unchecked(val as f64) },
NumericalValue::U64(val) => unsafe { NotNan::new_unchecked(val as f64) },
NumericalValue::F64(val) => val,
}
}
}
#[cfg(test)]
mod tests {
use super::NumericalType;
#[test]
fn test_numerical_type_code() {
let mut num_numerical_type = 0;
for code in u8::MIN..=u8::MAX {
if let Some(numerical_type) = NumericalType::try_from_code(code) {
assert_eq!(numerical_type.to_code(), code);
num_numerical_type += 1;
}
}
assert_eq!(num_numerical_type, 3);
}
}

View File

@@ -0,0 +1,321 @@
use std::fmt;
use std::num::NonZeroU8;
use ordered_float::NotNan;
use thiserror::Error;
use crate::dictionary::UnorderedId;
use crate::value::NumericalValue;
use crate::DocId;
/// When we build a columnar dataframe, we first just group
/// all mutations per column, and append them in append-only object.
///
/// We represents all of these operations as `ColumnOperation`.
#[derive(Eq, PartialEq, Debug, Clone, Copy)]
pub(crate) enum ColumnOperation<T> {
NewDoc(DocId),
Value(T),
}
impl<T> From<T> for ColumnOperation<T> {
fn from(value: T) -> Self {
ColumnOperation::Value(value)
}
}
#[allow(clippy::from_over_into)]
pub(crate) trait SymbolValue: Into<MiniBuffer> + Clone + Copy + fmt::Debug {
fn deserialize(header: NonZeroU8, bytes: &mut &[u8]) -> Result<Self, ParseError>;
}
pub(crate) struct MiniBuffer {
pub bytes: [u8; 9],
pub len: usize,
}
impl MiniBuffer {
pub fn as_slice(&self) -> &[u8] {
&self.bytes[..self.len]
}
}
fn compute_header_byte(typ: SymbolType, len: usize) -> u8 {
assert!(len <= 9);
(len << 4) as u8 | typ as u8
}
impl SymbolValue for NumericalValue {
fn deserialize(header_byte: NonZeroU8, bytes: &mut &[u8]) -> Result<Self, ParseError> {
let (typ, len) = parse_header_byte(header_byte)?;
let value_bytes: &[u8];
(value_bytes, *bytes) = bytes.split_at(len);
let symbol: NumericalValue = match typ {
SymbolType::U64 => {
let mut octet: [u8; 8] = [0u8; 8];
octet[..value_bytes.len()].copy_from_slice(value_bytes);
let val: u64 = u64::from_le_bytes(octet);
NumericalValue::U64(val)
}
SymbolType::I64 => {
let mut octet: [u8; 8] = [0u8; 8];
octet[..value_bytes.len()].copy_from_slice(value_bytes);
let encoded: u64 = u64::from_le_bytes(octet);
let val: i64 = decode_zig_zag(encoded);
NumericalValue::I64(val)
}
SymbolType::Float => {
let octet: [u8; 8] =
value_bytes.try_into().map_err(|_| ParseError::InvalidLen {
typ: SymbolType::Float,
len,
})?;
let val_possibly_nan = f64::from_le_bytes(octet);
let val_not_nan = NotNan::new(val_possibly_nan)
.map_err(|_| ParseError::NaN)?;
NumericalValue::F64(val_not_nan)
}
};
Ok(symbol)
}
}
#[allow(clippy::from_over_into)]
impl Into<MiniBuffer> for NumericalValue {
fn into(self) -> MiniBuffer {
let mut bytes = [0u8; 9];
match self {
NumericalValue::F64(val) => {
let len = 8;
let header_byte = compute_header_byte(SymbolType::Float, len);
bytes[0] = header_byte;
bytes[1..].copy_from_slice(&val.to_le_bytes());
MiniBuffer {
bytes,
len: len + 1,
}
}
NumericalValue::U64(val) => {
let len = compute_num_bytes_for_u64(val);
let header_byte = compute_header_byte(SymbolType::U64, len);
bytes[0] = header_byte;
bytes[1..].copy_from_slice(&val.to_le_bytes());
MiniBuffer {
bytes,
len: len + 1,
}
}
NumericalValue::I64(val) => {
let encoded = encode_zig_zag(val);
let len = compute_num_bytes_for_u64(encoded);
let header_byte = compute_header_byte(SymbolType::I64, len);
bytes[0] = header_byte;
bytes[1..].copy_from_slice(&encoded.to_le_bytes());
MiniBuffer {
bytes,
len: len + 1,
}
}
}
}
}
#[allow(clippy::from_over_into)]
impl Into<MiniBuffer> for UnorderedId {
fn into(self) -> MiniBuffer {
let mut bytes = [0u8; 9];
let val = self.0 as u64;
let len = compute_num_bytes_for_u64(val) + 1;
bytes[0] = len as u8;
bytes[1..].copy_from_slice(&val.to_le_bytes());
MiniBuffer { bytes, len }
}
}
impl SymbolValue for UnorderedId {
fn deserialize(header: NonZeroU8, bytes: &mut &[u8]) -> Result<UnorderedId, ParseError> {
let len = header.get() as usize;
let symbol_bytes: &[u8];
(symbol_bytes, *bytes) = bytes.split_at(len);
let mut value_bytes = [0u8; 4];
value_bytes[..len - 1].copy_from_slice(&symbol_bytes[1..]);
let value = u32::from_le_bytes(value_bytes);
Ok(UnorderedId(value))
}
}
const HEADER_MASK: u8 = (1u8 << 4) - 1u8;
fn compute_num_bytes_for_u64(val: u64) -> usize {
let msb = (64u32 - val.leading_zeros()) as usize;
(msb + 7) / 8
}
fn parse_header_byte(byte: NonZeroU8) -> Result<(SymbolType, usize), ParseError> {
let len = (byte.get() as usize) >> 4;
let typ_code = byte.get() & HEADER_MASK;
let typ = SymbolType::try_from(typ_code)?;
Ok((typ, len))
}
#[derive(Error, Debug)]
pub enum ParseError {
#[error("Type byte unknown `{0}`")]
UnknownType(u8),
#[error("Invalid len for type `{len}` for type `{typ:?}`.")]
InvalidLen { typ: SymbolType, len: usize },
#[error("Missing bytes.")]
MissingBytes,
#[error("Not a number value.")]
NaN,
}
impl<V: SymbolValue> ColumnOperation<V> {
pub fn serialize(self) -> MiniBuffer {
match self {
ColumnOperation::NewDoc(doc) => {
let mut minibuf: [u8; 9] = [0u8; 9];
minibuf[0] = 0u8;
minibuf[1..5].copy_from_slice(&doc.to_le_bytes());
MiniBuffer {
bytes: minibuf,
len: 5,
}
}
ColumnOperation::Value(val) => val.into(),
}
}
pub fn deserialize(bytes: &mut &[u8]) -> Result<Self, ParseError> {
if bytes.is_empty() {
return Err(ParseError::MissingBytes);
}
let header_byte = bytes[0];
*bytes = &bytes[1..];
if let Some(header_byte) = NonZeroU8::new(header_byte) {
let value = V::deserialize(header_byte, bytes)?;
Ok(ColumnOperation::Value(value))
} else {
let doc_bytes: &[u8];
(doc_bytes, *bytes) = bytes.split_at(4);
let doc: u32 =
u32::from_le_bytes(doc_bytes.try_into().map_err(|_| ParseError::MissingBytes)?);
Ok(ColumnOperation::NewDoc(doc))
}
}
}
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
#[repr(u8)]
pub enum SymbolType {
U64 = 1u8,
I64 = 2u8,
Float = 3u8,
}
impl TryFrom<u8> for SymbolType {
type Error = ParseError;
fn try_from(byte: u8) -> Result<Self, ParseError> {
match byte {
1u8 => Ok(SymbolType::U64),
2u8 => Ok(SymbolType::I64),
3u8 => Ok(SymbolType::Float),
_ => Err(ParseError::UnknownType(byte)),
}
}
}
fn encode_zig_zag(n: i64) -> u64 {
((n << 1) ^ (n >> 63)) as u64
}
fn decode_zig_zag(n: u64) -> i64 {
((n >> 1) as i64) ^ (-((n & 1) as i64))
}
#[cfg(test)]
mod tests {
use super::{SymbolType, *};
#[track_caller]
fn test_zig_zag_aux(val: i64) {
let encoded = super::encode_zig_zag(val);
assert_eq!(decode_zig_zag(encoded), val);
if let Some(abs_val) = val.checked_abs() {
let abs_val = abs_val as u64;
assert!(encoded <= abs_val * 2);
}
}
#[test]
fn test_zig_zag() {
assert_eq!(encode_zig_zag(0i64), 0u64);
assert_eq!(encode_zig_zag(-1i64), 1u64);
assert_eq!(encode_zig_zag(1i64), 2u64);
test_zig_zag_aux(0i64);
test_zig_zag_aux(i64::MIN);
test_zig_zag_aux(i64::MAX);
}
use proptest::prelude::any;
use proptest::proptest;
proptest! {
#[test]
fn test_proptest_zig_zag(val in any::<i64>()) {
test_zig_zag_aux(val);
}
}
#[track_caller]
fn ser_deser_header_byte_aux(symbol_type: SymbolType, len: usize) {
let header_byte = compute_header_byte(symbol_type, len);
let (serdeser_numerical_type, serdeser_len) =
parse_header_byte(NonZeroU8::new(header_byte).unwrap()).unwrap();
assert_eq!(symbol_type, serdeser_numerical_type);
assert_eq!(len, serdeser_len);
}
#[test]
fn test_header_byte_serialization() {
for len in 1..9 {
ser_deser_header_byte_aux(SymbolType::Float, len);
ser_deser_header_byte_aux(SymbolType::I64, len);
ser_deser_header_byte_aux(SymbolType::U64, len);
}
}
#[track_caller]
fn ser_deser_symbol(symbol: ColumnOperation<NumericalValue>) {
let buf = symbol.serialize();
let mut bytes = &buf.bytes[..];
let serdeser_symbol = ColumnOperation::deserialize(&mut bytes).unwrap();
assert_eq!(bytes.len() + buf.len, buf.bytes.len());
assert_eq!(symbol, serdeser_symbol);
}
#[test]
fn test_compute_num_bytes_for_u64() {
assert_eq!(compute_num_bytes_for_u64(0), 0);
assert_eq!(compute_num_bytes_for_u64(1), 1);
assert_eq!(compute_num_bytes_for_u64(255), 1);
assert_eq!(compute_num_bytes_for_u64(256), 2);
assert_eq!(compute_num_bytes_for_u64((1 << 16) - 1), 2);
assert_eq!(compute_num_bytes_for_u64(1 << 16), 3);
}
#[test]
fn test_symbol_serialization() {
ser_deser_symbol(ColumnOperation::NewDoc(0));
ser_deser_symbol(ColumnOperation::NewDoc(3));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(0i64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(1i64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(257u64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(-257i64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(i64::MIN)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(0u64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(u64::MIN)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(u64::MAX)));
}
}

675
columnar/src/writer/mod.rs Normal file
View File

@@ -0,0 +1,675 @@
mod column_operation;
mod value_index;
use std::io::{self, Write};
use column_operation::ColumnOperation;
use common::CountingWriter;
use fastfield_codecs::serialize::ValueIndexInfo;
use fastfield_codecs::{Column, MonotonicallyMappableToU64, VecColumn};
use ordered_float::NotNan;
use stacker::{Addr, ArenaHashMap, ExpUnrolledLinkedList, MemoryArena};
use crate::column_type_header::{ColumnType, ColumnTypeAndCardinality};
use crate::dictionary::{DictionaryBuilder, IdMapping, UnorderedId};
use crate::value::{Coerce, NumericalType, NumericalValue};
use crate::writer::column_operation::SymbolValue;
use crate::writer::value_index::{IndexBuilder, SpareIndexBuilders};
use crate::{Cardinality, ColumnarSerializer, DocId};
#[derive(Copy, Clone, Default)]
struct ColumnWriter {
// Detected cardinality of the column so far.
cardinality: Cardinality,
// Last document inserted.
// None if no doc has been added yet.
last_doc_opt: Option<u32>,
// Buffer containing the serialized values.
values: ExpUnrolledLinkedList,
}
#[derive(Clone, Copy, Default)]
pub struct NumericalColumnWriter {
compatible_numerical_types: CompatibleNumericalTypes,
column_writer: ColumnWriter,
}
#[derive(Clone, Copy)]
struct CompatibleNumericalTypes {
all_values_within_i64_range: bool,
all_values_within_u64_range: bool,
}
impl Default for CompatibleNumericalTypes {
fn default() -> CompatibleNumericalTypes {
CompatibleNumericalTypes {
all_values_within_i64_range: true,
all_values_within_u64_range: true,
}
}
}
impl CompatibleNumericalTypes {
pub fn accept_value(&mut self, numerical_value: NumericalValue) {
match numerical_value {
NumericalValue::I64(val_i64) => {
let value_within_u64_range = val_i64 >= 0i64;
self.all_values_within_u64_range &= value_within_u64_range;
}
NumericalValue::U64(val_u64) => {
let value_within_i64_range = val_u64 < i64::MAX as u64;
self.all_values_within_i64_range &= value_within_i64_range;
}
NumericalValue::F64(_) => {
self.all_values_within_i64_range = false;
self.all_values_within_u64_range = false;
}
}
}
pub fn to_numerical_type(self) -> NumericalType {
if self.all_values_within_i64_range {
NumericalType::I64
} else if self.all_values_within_u64_range {
NumericalType::U64
} else {
NumericalType::F64
}
}
}
impl NumericalColumnWriter {
pub fn record_numerical_value(
&mut self,
doc: DocId,
value: NumericalValue,
arena: &mut MemoryArena,
) {
self.compatible_numerical_types.accept_value(value);
self.column_writer.record(doc, value, arena);
}
}
impl ColumnWriter {
fn symbol_iterator<'a, V: SymbolValue>(
&self,
arena: &MemoryArena,
buffer: &'a mut Vec<u8>,
) -> impl Iterator<Item = ColumnOperation<V>> + 'a {
buffer.clear();
self.values.read_to_end(arena, buffer);
let mut cursor: &[u8] = &buffer[..];
std::iter::from_fn(move || {
if cursor.is_empty() {
return None;
}
let symbol = ColumnOperation::deserialize(&mut cursor)
.expect("Failed to deserialize symbol from in-memory. This should never happen.");
Some(symbol)
})
}
fn delta_with_last_doc(&self, doc: DocId) -> u32 {
self.last_doc_opt
.map(|last_doc| doc - last_doc)
.unwrap_or(doc + 1u32)
}
/// Records a change of the document being recorded.
///
/// This function will also update the cardinality of the column
/// if necessary.
fn record(&mut self, doc: DocId, value: NumericalValue, arena: &mut MemoryArena) {
// Difference between `doc` and the last doc.
match self.delta_with_last_doc(doc) {
0 => {
// This is the last encounterred document.
self.cardinality = Cardinality::Multivalued;
}
1 => {
self.last_doc_opt = Some(doc);
self.write_symbol::<NumericalValue>(ColumnOperation::NewDoc(doc), arena);
}
_ => {
self.cardinality = self.cardinality.max(Cardinality::Optional);
self.last_doc_opt = Some(doc);
self.write_symbol::<NumericalValue>(ColumnOperation::NewDoc(doc), arena);
}
}
self.write_symbol(ColumnOperation::Value(value), arena);
}
// Get the cardinality.
// The overall number of docs in the column is necessary to
// deal with the case where the all docs contain 1 value, except some documents
// at the end of the column.
fn get_cardinality(&self, num_docs: DocId) -> Cardinality {
if self.delta_with_last_doc(num_docs) > 1 {
self.cardinality.max(Cardinality::Optional)
} else {
self.cardinality
}
}
fn write_symbol<V: SymbolValue>(
&mut self,
symbol: ColumnOperation<V>,
arena: &mut MemoryArena,
) {
self.values
.writer(arena)
.extend_from_slice(symbol.serialize().as_slice());
}
}
#[derive(Copy, Clone, Default)]
pub struct BytesColumnWriter {
dictionary_id: u32,
column_writer: ColumnWriter,
}
impl BytesColumnWriter {
pub fn with_dictionary_id(dictionary_id: u32) -> BytesColumnWriter {
BytesColumnWriter {
dictionary_id,
column_writer: Default::default(),
}
}
pub fn record_bytes(
&mut self,
doc: DocId,
bytes: &[u8],
dictionaries: &mut [DictionaryBuilder],
arena: &mut MemoryArena,
) {
let unordered_id = dictionaries[self.dictionary_id as usize].get_or_allocate_id(bytes);
let numerical_value = NumericalValue::U64(unordered_id.0 as u64);
self.column_writer.record(doc, numerical_value, arena);
}
}
pub struct ColumnarWriter {
numerical_field_hash_map: ArenaHashMap,
bytes_field_hash_map: ArenaHashMap,
arena: MemoryArena,
// Dictionaries used to store dictionary-encoded values.
dictionaries: Vec<DictionaryBuilder>,
buffers: SpareBuffers,
}
#[derive(Default)]
struct SpareBuffers {
byte_buffer: Vec<u8>,
value_index_builders: SpareIndexBuilders,
i64_values: Vec<i64>,
u64_values: Vec<u64>,
f64_values: Vec<ordered_float::NotNan<f64>>,
}
impl Default for ColumnarWriter {
fn default() -> Self {
ColumnarWriter {
numerical_field_hash_map: ArenaHashMap::new(10_000),
bytes_field_hash_map: ArenaHashMap::new(10_000),
dictionaries: Vec::new(),
arena: MemoryArena::default(),
buffers: SpareBuffers::default(),
}
}
}
#[derive(Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Debug)]
enum BytesOrNumerical {
Bytes,
Numerical,
}
impl ColumnarWriter {
pub fn record_numerical(&mut self, doc: DocId, key: &[u8], numerical_value: NumericalValue) {
let (hash_map, arena) = (&mut self.numerical_field_hash_map, &mut self.arena);
hash_map.mutate_or_create(key, |column_opt: Option<NumericalColumnWriter>| {
let mut column: NumericalColumnWriter = column_opt.unwrap_or_default();
column.record_numerical_value(doc, numerical_value, arena);
column
});
}
pub fn record_bytes(&mut self, doc: DocId, key: &[u8], value: &[u8]) {
let (hash_map, arena, dictionaries) = (
&mut self.bytes_field_hash_map,
&mut self.arena,
&mut self.dictionaries,
);
hash_map.mutate_or_create(key, |column_opt: Option<BytesColumnWriter>| {
let mut column: BytesColumnWriter = column_opt.unwrap_or_else(|| {
let dictionary_id = dictionaries.len() as u32;
dictionaries.push(DictionaryBuilder::default());
BytesColumnWriter::with_dictionary_id(dictionary_id)
});
column.record_bytes(doc, value, dictionaries, arena);
column
});
}
pub fn serialize<W: io::Write>(
&mut self,
num_docs: DocId,
mut serializer: ColumnarSerializer<W>,
) -> io::Result<()> {
let mut field_columns: Vec<(&[u8], BytesOrNumerical, Addr)> = self
.numerical_field_hash_map
.iter()
.map(|(term, addr, _)| (term, BytesOrNumerical::Numerical, addr))
.collect();
field_columns.extend(
self.bytes_field_hash_map
.iter()
.map(|(term, addr, _)| (term, BytesOrNumerical::Bytes, addr)),
);
let mut key_buffer = Vec::new();
field_columns.sort_unstable_by_key(|(key, col_type, _)| (*key, *col_type));
let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries);
for (key, bytes_or_numerical, addr) in field_columns {
let wrt = serializer.wrt();
let start_offset = wrt.written_bytes();
let column_type_and_cardinality: ColumnTypeAndCardinality =
match bytes_or_numerical {
BytesOrNumerical::Bytes => {
let BytesColumnWriter { dictionary_id, column_writer } =
self.bytes_field_hash_map.read(addr);
let dictionary_builder =
&dictionaries[dictionary_id as usize];
serialize_bytes_column(
&column_writer,
num_docs,
dictionary_builder,
arena,
buffers,
wrt,
)?;
ColumnTypeAndCardinality {
cardinality: column_writer.get_cardinality(num_docs),
typ: ColumnType::Bytes,
}
}
BytesOrNumerical::Numerical => {
let NumericalColumnWriter { compatible_numerical_types, column_writer } =
self.numerical_field_hash_map.read(addr);
let cardinality = column_writer.get_cardinality(num_docs);
let numerical_type = compatible_numerical_types.to_numerical_type();
serialize_numerical_column(
cardinality,
numerical_type,
&column_writer,
num_docs,
arena,
buffers,
wrt,
)?;
ColumnTypeAndCardinality {
cardinality,
typ: ColumnType::Numerical(numerical_type),
}
}
};
let end_offset = wrt.written_bytes();
let key_with_type = prepare_key(key, column_type_and_cardinality, &mut key_buffer);
serializer.record_column_offsets(key_with_type, start_offset..end_offset)?;
}
serializer.finalize()?;
Ok(())
}
}
/// Returns a key consisting of the concatenation of the key and the column_type_and_cardinality
/// code.
fn prepare_key<'a>(
key: &[u8],
column_type_cardinality: ColumnTypeAndCardinality,
buffer: &'a mut Vec<u8>,
) -> &'a [u8] {
buffer.clear();
buffer.extend_from_slice(key);
buffer.push(0u8);
buffer.push(column_type_cardinality.to_code());
&buffer[..]
}
fn serialize_bytes_column<W: io::Write>(
column_writer: &ColumnWriter,
num_docs: DocId,
dictionary_builder: &DictionaryBuilder,
arena: &MemoryArena,
buffers: &mut SpareBuffers,
wrt: &mut CountingWriter<W>,
) -> io::Result<()> {
let start_offset = wrt.written_bytes();
let id_mapping: IdMapping = dictionary_builder.serialize(wrt)?;
let dictionary_num_bytes: u32 = (wrt.written_bytes() - start_offset) as u32;
let cardinality = column_writer.get_cardinality(num_docs);
let SpareBuffers {
byte_buffer,
value_index_builders,
u64_values,
..
} = buffers;
let symbol_iterator = column_writer
.symbol_iterator(arena, byte_buffer)
.map(|symbol: ColumnOperation<UnorderedId>| {
// We map unordered ids to ordered ids.
match symbol {
ColumnOperation::Value(unordered_id) => {
let ordered_id = id_mapping.to_ord(unordered_id);
ColumnOperation::Value(ordered_id.0 as u64)
}
ColumnOperation::NewDoc(doc) => ColumnOperation::NewDoc(doc),
}
});
serialize_column(
symbol_iterator,
cardinality,
num_docs,
value_index_builders,
u64_values,
wrt,
)?;
wrt.write_all(&dictionary_num_bytes.to_le_bytes()[..])?;
Ok(())
}
fn serialize_numerical_column<W: io::Write>(
cardinality: Cardinality,
numerical_type: NumericalType,
column_writer: &ColumnWriter,
num_docs: DocId,
arena: &MemoryArena,
buffers: &mut SpareBuffers,
wrt: &mut W,
) -> io::Result<()> {
let SpareBuffers {
byte_buffer,
value_index_builders,
u64_values,
i64_values,
f64_values,
} = buffers;
let symbol_iterator = column_writer.symbol_iterator(arena, byte_buffer);
match numerical_type {
NumericalType::I64 => {
serialize_column(
coerce_numerical_symbol::<i64>(symbol_iterator),
cardinality,
num_docs,
value_index_builders,
i64_values,
wrt,
)?;
}
NumericalType::U64 => {
serialize_column(
coerce_numerical_symbol::<u64>(symbol_iterator),
cardinality,
num_docs,
value_index_builders,
u64_values,
wrt,
)?;
}
NumericalType::F64 => {
serialize_column(
coerce_numerical_symbol::<NotNan<f64>>(symbol_iterator),
cardinality,
num_docs,
value_index_builders,
f64_values,
wrt,
)?;
}
};
Ok(())
}
fn serialize_column<
T: Copy + Ord + Default + Send + Sync + MonotonicallyMappableToU64,
W: io::Write,
>(
symbol_iterator: impl Iterator<Item = ColumnOperation<T>>,
cardinality: Cardinality,
num_docs: DocId,
value_index_builders: &mut SpareIndexBuilders,
values: &mut Vec<T>,
wrt: &mut W,
) -> io::Result<()>
where
for<'a> VecColumn<'a, T>: Column<T>,
{
match cardinality {
Cardinality::Required => {
consume_symbol_iterator(
symbol_iterator,
value_index_builders.borrow_required_index_builder(),
values,
);
fastfield_codecs::serialize(
VecColumn::from(&values[..]),
wrt,
&fastfield_codecs::ALL_CODEC_TYPES[..],
)?;
}
Cardinality::Optional => {
let optional_index_builder = value_index_builders.borrow_optional_index_builder();
consume_symbol_iterator(symbol_iterator, optional_index_builder, values);
let optional_index = optional_index_builder.finish(num_docs);
fastfield_codecs::serialize::serialize_new(
ValueIndexInfo::SingleValue(Box::new(optional_index)),
VecColumn::from(&values[..]),
wrt,
&fastfield_codecs::ALL_CODEC_TYPES[..],
)?;
}
Cardinality::Multivalued => {
let multivalued_index_builder = value_index_builders.borrow_multivalued_index_builder();
consume_symbol_iterator(symbol_iterator, multivalued_index_builder, values);
let multivalued_index = multivalued_index_builder.finish(num_docs);
fastfield_codecs::serialize::serialize_new(
ValueIndexInfo::MultiValue(Box::new(multivalued_index)),
VecColumn::from(&values[..]),
wrt,
&fastfield_codecs::ALL_CODEC_TYPES[..],
)?;
}
}
Ok(())
}
fn coerce_numerical_symbol<T>(
symbol_iterator: impl Iterator<Item = ColumnOperation<NumericalValue>>,
) -> impl Iterator<Item = ColumnOperation<T>>
where T: Coerce {
symbol_iterator.map(|symbol| match symbol {
ColumnOperation::NewDoc(doc) => ColumnOperation::NewDoc(doc),
ColumnOperation::Value(numerical_value) => {
ColumnOperation::Value(Coerce::coerce(numerical_value))
}
})
}
fn consume_symbol_iterator<T, TIndexBuilder: IndexBuilder>(
symbol_iterator: impl Iterator<Item = ColumnOperation<T>>,
index_builder: &mut TIndexBuilder,
values: &mut Vec<T>,
) {
for symbol in symbol_iterator {
match symbol {
ColumnOperation::NewDoc(doc) => {
index_builder.record_doc(doc);
}
ColumnOperation::Value(value) => {
index_builder.record_value();
values.push(value);
}
}
}
}
#[cfg(test)]
mod tests {
use ordered_float::NotNan;
use stacker::MemoryArena;
use super::prepare_key;
use crate::column_type_header::{ColumnType, ColumnTypeAndCardinality};
use crate::value::{NumericalType, NumericalValue};
use crate::writer::column_operation::ColumnOperation;
use crate::writer::CompatibleNumericalTypes;
use crate::Cardinality;
#[test]
fn test_prepare_key_bytes() {
let mut buffer: Vec<u8> = b"somegarbage".to_vec();
let column_type_and_cardinality = ColumnTypeAndCardinality {
typ: ColumnType::Bytes,
cardinality: Cardinality::Optional,
};
let prepared_key = prepare_key(b"root\0child", column_type_and_cardinality, &mut buffer);
assert_eq!(prepared_key.len(), 12);
assert_eq!(&prepared_key[..10], b"root\0child");
assert_eq!(prepared_key[10], 0u8);
assert_eq!(prepared_key[11], column_type_and_cardinality.to_code());
}
#[test]
fn test_column_writer_required_simple() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(0u32, 14i64.into(), &mut arena);
column_writer.record(1u32, 15i64.into(), &mut arena);
column_writer.record(2u32, (-16i64).into(), &mut arena);
assert_eq!(column_writer.get_cardinality(3), Cardinality::Required);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.symbol_iterator(&mut arena, &mut buffer)
.collect();
assert_eq!(symbols.len(), 6);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(14i64))
));
assert!(matches!(symbols[2], ColumnOperation::NewDoc(1u32)));
assert!(matches!(
symbols[3],
ColumnOperation::Value(NumericalValue::I64(15i64))
));
assert!(matches!(symbols[4], ColumnOperation::NewDoc(2u32)));
assert!(matches!(
symbols[5],
ColumnOperation::Value(NumericalValue::I64(-16i64))
));
}
#[test]
fn test_column_writer_optional_cardinality_missing_first() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(1u32, 15i64.into(), &mut arena);
column_writer.record(2u32, (-16i64).into(), &mut arena);
assert_eq!(column_writer.get_cardinality(3), Cardinality::Optional);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.symbol_iterator(&mut arena, &mut buffer)
.collect();
assert_eq!(symbols.len(), 4);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(1u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(15i64))
));
assert!(matches!(symbols[2], ColumnOperation::NewDoc(2u32)));
assert!(matches!(
symbols[3],
ColumnOperation::Value(NumericalValue::I64(-16i64))
));
}
#[test]
fn test_column_writer_optional_cardinality_missing_last() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(0u32, 15i64.into(), &mut arena);
assert_eq!(column_writer.get_cardinality(2), Cardinality::Optional);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.symbol_iterator(&mut arena, &mut buffer)
.collect();
assert_eq!(symbols.len(), 2);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(15i64))
));
}
#[test]
fn test_column_writer_multivalued() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(0u32, 16i64.into(), &mut arena);
column_writer.record(0u32, 17i64.into(), &mut arena);
assert_eq!(column_writer.get_cardinality(1), Cardinality::Multivalued);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.symbol_iterator(&mut arena, &mut buffer)
.collect();
assert_eq!(symbols.len(), 3);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(16i64))
));
assert!(matches!(
symbols[2],
ColumnOperation::Value(NumericalValue::I64(17i64))
));
}
#[track_caller]
fn test_column_writer_coercion_iter_aux(
values: impl Iterator<Item = NumericalValue>,
expected_numerical_type: NumericalType,
) {
let mut compatible_numerical_types = CompatibleNumericalTypes::default();
for value in values {
compatible_numerical_types.accept_value(value);
}
assert_eq!(
compatible_numerical_types.to_numerical_type(),
expected_numerical_type
);
}
#[track_caller]
fn test_column_writer_coercion_aux(
values: &[NumericalValue],
expected_numerical_type: NumericalType,
) {
test_column_writer_coercion_iter_aux(values.iter().copied(), expected_numerical_type);
test_column_writer_coercion_iter_aux(values.iter().rev().copied(), expected_numerical_type);
}
#[test]
fn test_column_writer_coercion() {
test_column_writer_coercion_aux(&[], NumericalType::I64);
test_column_writer_coercion_aux(&[1i64.into()], NumericalType::I64);
test_column_writer_coercion_aux(&[1u64.into()], NumericalType::I64);
// We don't detect exact integer at the moment. We could!
test_column_writer_coercion_aux(&[NotNan::new(1f64).unwrap().into()], NumericalType::F64);
test_column_writer_coercion_aux(&[u64::MAX.into()], NumericalType::U64);
test_column_writer_coercion_aux(&[(i64::MAX as u64).into()], NumericalType::U64);
test_column_writer_coercion_aux(&[(1u64 << 63).into()], NumericalType::U64);
test_column_writer_coercion_aux(&[1i64.into(), 1u64.into()], NumericalType::I64);
test_column_writer_coercion_aux(&[u64::MAX.into(), (-1i64).into()], NumericalType::F64);
}
}

View File

@@ -0,0 +1,218 @@
use fastfield_codecs::serialize::{MultiValueIndexInfo, SingleValueIndexInfo};
use crate::DocId;
/// The `IndexBuilder` interprets a sequence of
/// calls of the form:
/// (record_doc,record_value+)*
/// and can then serialize the results into an index.
///
/// It has different implementation depending on whether the
/// cardinality is required, optional, or multivalued.
pub(crate) trait IndexBuilder {
fn record_doc(&mut self, doc: DocId);
#[inline]
fn record_value(&mut self) {}
}
/// The RequiredIndexBuilder does nothing.
#[derive(Default)]
pub struct RequiredIndexBuilder;
impl IndexBuilder for RequiredIndexBuilder {
#[inline(always)]
fn record_doc(&mut self, _doc: DocId) {}
}
#[derive(Default)]
pub struct OptionalIndexBuilder {
docs: Vec<DocId>,
}
struct SingleValueArrayIndex<'a> {
docs: &'a [DocId],
num_docs: DocId,
}
impl<'a> SingleValueIndexInfo for SingleValueArrayIndex<'a> {
fn num_vals(&self) -> u32 {
self.num_docs as u32
}
fn num_non_nulls(&self) -> u32 {
self.docs.len() as u32
}
fn iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
Box::new(self.docs.iter().copied())
}
}
impl OptionalIndexBuilder {
pub fn finish(&mut self, num_docs: DocId) -> impl SingleValueIndexInfo + '_ {
debug_assert!(self
.docs
.last()
.copied()
.map(|last_doc| last_doc < num_docs)
.unwrap_or(true));
SingleValueArrayIndex {
docs: &self.docs[..],
num_docs,
}
}
fn reset(&mut self) {
self.docs.clear();
}
}
impl IndexBuilder for OptionalIndexBuilder {
#[inline(always)]
fn record_doc(&mut self, doc: DocId) {
debug_assert!(self
.docs
.last()
.copied()
.map(|prev_doc| doc > prev_doc)
.unwrap_or(true));
self.docs.push(doc);
}
}
#[derive(Default)]
pub struct MultivaluedIndexBuilder {
// TODO should we switch to `start_offset`?
end_values: Vec<DocId>,
total_num_vals_seen: u32,
}
pub struct MultivaluedValueArrayIndex<'a> {
end_offsets: &'a [DocId],
}
impl<'a> MultiValueIndexInfo for MultivaluedValueArrayIndex<'a> {
fn num_docs(&self) -> u32 {
self.end_offsets.len() as u32
}
fn num_vals(&self) -> u32 {
self.end_offsets.last().copied().unwrap_or(0u32)
}
fn iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
if self.end_offsets.is_empty() {
return Box::new(std::iter::empty());
}
let n = self.end_offsets.len();
Box::new(std::iter::once(0u32).chain(self.end_offsets[..n - 1].iter().copied()))
}
}
impl MultivaluedIndexBuilder {
pub fn finish(&mut self, num_docs: DocId) -> impl MultiValueIndexInfo + '_ {
self.end_values
.resize(num_docs as usize, self.total_num_vals_seen);
MultivaluedValueArrayIndex {
end_offsets: &self.end_values[..],
}
}
fn reset(&mut self) {
self.end_values.clear();
self.total_num_vals_seen = 0;
}
}
impl IndexBuilder for MultivaluedIndexBuilder {
fn record_doc(&mut self, doc: DocId) {
self.end_values
.resize(doc as usize, self.total_num_vals_seen);
}
fn record_value(&mut self) {
self.total_num_vals_seen += 1;
}
}
/// The `SpareIndexBuilders` is there to avoid allocating a
/// new index builder for every single column.
#[derive(Default)]
pub struct SpareIndexBuilders {
required_index_builder: RequiredIndexBuilder,
optional_index_builder: OptionalIndexBuilder,
multivalued_index_builder: MultivaluedIndexBuilder,
}
impl SpareIndexBuilders {
pub fn borrow_required_index_builder(&mut self) -> &mut RequiredIndexBuilder {
&mut self.required_index_builder
}
pub fn borrow_optional_index_builder(&mut self) -> &mut OptionalIndexBuilder {
self.optional_index_builder.reset();
&mut self.optional_index_builder
}
pub fn borrow_multivalued_index_builder(&mut self) -> &mut MultivaluedIndexBuilder {
self.multivalued_index_builder.reset();
&mut self.multivalued_index_builder
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_optional_value_index_builder() {
let mut opt_value_index_builder = OptionalIndexBuilder::default();
opt_value_index_builder.record_doc(0u32);
opt_value_index_builder.record_value();
assert_eq!(
&opt_value_index_builder
.finish(1u32)
.iter()
.collect::<Vec<u32>>(),
&[0]
);
opt_value_index_builder.reset();
opt_value_index_builder.record_doc(1u32);
opt_value_index_builder.record_value();
assert_eq!(
&opt_value_index_builder
.finish(2u32)
.iter()
.collect::<Vec<u32>>(),
&[1]
);
}
#[test]
fn test_multivalued_value_index_builder() {
let mut multivalued_value_index_builder = MultivaluedIndexBuilder::default();
multivalued_value_index_builder.record_doc(1u32);
multivalued_value_index_builder.record_value();
multivalued_value_index_builder.record_value();
multivalued_value_index_builder.record_doc(2u32);
multivalued_value_index_builder.record_value();
assert_eq!(
multivalued_value_index_builder
.finish(4u32)
.iter()
.collect::<Vec<u32>>(),
vec![0, 0, 2, 3]
);
multivalued_value_index_builder.reset();
multivalued_value_index_builder.record_doc(2u32);
multivalued_value_index_builder.record_value();
multivalued_value_index_builder.record_value();
assert_eq!(
multivalued_value_index_builder
.finish(4u32)
.iter()
.collect::<Vec<u32>>(),
vec![0, 0, 0, 2]
);
}
}

View File

@@ -1,16 +1,21 @@
[package]
name = "tantivy-common"
version = "0.3.0"
version = "0.5.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT"
edition = "2021"
description = "common traits and utility functions used by multiple tantivy subcrates"
documentation = "https://docs.rs/tantivy_common/"
homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
byteorder = "1.4.3"
ownedbytes = { version="0.3", path="../ownedbytes" }
ownedbytes = { version= "0.5", path="../ownedbytes" }
async-trait = "0.1"
[dev-dependencies]
proptest = "1.0.0"

View File

@@ -151,7 +151,7 @@ impl TinySet {
if self.is_empty() {
None
} else {
let lowest = self.0.trailing_zeros() as u32;
let lowest = self.0.trailing_zeros();
self.0 ^= TinySet::singleton(lowest).0;
Some(lowest)
}
@@ -259,11 +259,7 @@ impl BitSet {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len += if self.tinysets[higher as usize].insert_mut(lower) {
1
} else {
0
};
self.len += u64::from(self.tinysets[higher as usize].insert_mut(lower));
}
/// Inserts an element in the `BitSet`
@@ -272,11 +268,7 @@ impl BitSet {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len -= if self.tinysets[higher as usize].remove_mut(lower) {
1
} else {
0
};
self.len -= u64::from(self.tinysets[higher as usize].remove_mut(lower));
}
/// Returns true iff the elements is in the `BitSet`.
@@ -285,7 +277,7 @@ impl BitSet {
self.tinyset(el / 64u32).contains(el % 64)
}
/// Returns the first non-empty `TinySet` associated to a bucket lower
/// Returns the first non-empty `TinySet` associated with a bucket lower
/// or greater than bucket.
///
/// Reminder: the tiny set with the bucket `bucket`, represents the
@@ -429,7 +421,7 @@ mod tests {
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len() as usize, i as usize);
assert_eq!(bitset.len(), i as usize);
}
}
@@ -440,7 +432,7 @@ mod tests {
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len() as usize, 64);
assert_eq!(bitset.len(), 64);
}
#[test]

View File

@@ -1,23 +1,19 @@
use std::ops::{Deref, Range};
use std::sync::{Arc, Weak};
use std::ops::{Deref, Range, RangeBounds};
use std::sync::Arc;
use std::{fmt, io};
use async_trait::async_trait;
use common::HasLen;
use stable_deref_trait::StableDeref;
use ownedbytes::{OwnedBytes, StableDeref};
use crate::directory::OwnedBytes;
pub type ArcBytes = Arc<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
pub type WeakArcBytes = Weak<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
use crate::HasLen;
/// Objects that represents files sections in tantivy.
///
/// By contract, whatever happens to the directory file, as long as a FileHandle
/// is alive, the data associated with it cannot be altered or destroyed.
///
/// The underlying behavior is therefore specific to the `Directory` that created it.
/// Despite its name, a `FileSlice` may or may not directly map to an actual file
/// The underlying behavior is therefore specific to the `Directory` that
/// created it. Despite its name, a [`FileSlice`] may or may not directly map to an actual file
/// on the filesystem.
#[async_trait]
@@ -27,13 +23,9 @@ pub trait FileHandle: 'static + Send + Sync + HasLen + fmt::Debug {
/// This method may panic if the range requested is invalid.
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes>;
#[cfg(feature = "quickwit")]
#[doc(hidden)]
async fn read_bytes_async(
&self,
_byte_range: Range<usize>,
) -> crate::AsyncIoResult<OwnedBytes> {
Err(crate::error::AsyncIoError::AsyncUnsupported)
async fn read_bytes_async(&self, byte_range: Range<usize>) -> io::Result<OwnedBytes> {
self.read_bytes(byte_range)
}
}
@@ -45,7 +37,7 @@ impl FileHandle for &'static [u8] {
}
#[cfg(feature = "quickwit")]
async fn read_bytes_async(&self, byte_range: Range<usize>) -> crate::AsyncIoResult<OwnedBytes> {
async fn read_bytes_async(&self, byte_range: Range<usize>) -> io::Result<OwnedBytes> {
Ok(self.read_bytes(byte_range)?)
}
}
@@ -73,6 +65,25 @@ impl fmt::Debug for FileSlice {
}
}
#[inline]
fn combine_ranges<R: RangeBounds<usize>>(orig_range: Range<usize>, rel_range: R) -> Range<usize> {
let start: usize = orig_range.start
+ match rel_range.start_bound().cloned() {
std::ops::Bound::Included(rel_start) => rel_start,
std::ops::Bound::Excluded(rel_start) => rel_start + 1,
std::ops::Bound::Unbounded => 0,
};
assert!(start <= orig_range.end);
let end: usize = match rel_range.end_bound().cloned() {
std::ops::Bound::Included(rel_end) => orig_range.start + rel_end + 1,
std::ops::Bound::Excluded(rel_end) => orig_range.start + rel_end,
std::ops::Bound::Unbounded => orig_range.end,
};
assert!(end >= start);
assert!(end <= orig_range.end);
start..end
}
impl FileSlice {
/// Wraps a FileHandle.
pub fn new(file_handle: Arc<dyn FileHandle>) -> Self {
@@ -96,11 +107,11 @@ impl FileSlice {
///
/// Panics if `byte_range.end` exceeds the filesize.
#[must_use]
pub fn slice(&self, byte_range: Range<usize>) -> FileSlice {
assert!(byte_range.end <= self.len());
#[inline]
pub fn slice<R: RangeBounds<usize>>(&self, byte_range: R) -> FileSlice {
FileSlice {
data: self.data.clone(),
range: self.range.start + byte_range.start..self.range.start + byte_range.end,
range: combine_ranges(self.range.clone(), byte_range),
}
}
@@ -120,9 +131,8 @@ impl FileSlice {
self.data.read_bytes(self.range.clone())
}
#[cfg(feature = "quickwit")]
#[doc(hidden)]
pub async fn read_bytes_async(&self) -> crate::AsyncIoResult<OwnedBytes> {
pub async fn read_bytes_async(&self) -> io::Result<OwnedBytes> {
self.data.read_bytes_async(self.range.clone()).await
}
@@ -140,12 +150,8 @@ impl FileSlice {
.read_bytes(self.range.start + range.start..self.range.start + range.end)
}
#[cfg(feature = "quickwit")]
#[doc(hidden)]
pub async fn read_bytes_slice_async(
&self,
byte_range: Range<usize>,
) -> crate::AsyncIoResult<OwnedBytes> {
pub async fn read_bytes_slice_async(&self, byte_range: Range<usize>) -> io::Result<OwnedBytes> {
assert!(
self.range.start + byte_range.end <= self.range.end,
"`to` exceeds the fileslice length"
@@ -208,7 +214,7 @@ impl FileHandle for FileSlice {
}
#[cfg(feature = "quickwit")]
async fn read_bytes_async(&self, byte_range: Range<usize>) -> crate::AsyncIoResult<OwnedBytes> {
async fn read_bytes_async(&self, byte_range: Range<usize>) -> io::Result<OwnedBytes> {
self.read_bytes_slice_async(byte_range).await
}
}
@@ -226,7 +232,7 @@ impl FileHandle for OwnedBytes {
}
#[cfg(feature = "quickwit")]
async fn read_bytes_async(&self, range: Range<usize>) -> crate::AsyncIoResult<OwnedBytes> {
async fn read_bytes_async(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
let bytes = self.read_bytes(range)?;
Ok(bytes)
}
@@ -237,9 +243,9 @@ mod tests {
use std::io;
use std::sync::Arc;
use common::HasLen;
use super::{FileHandle, FileSlice};
use crate::file_slice::combine_ranges;
use crate::HasLen;
#[test]
fn test_file_slice() -> io::Result<()> {
@@ -310,4 +316,18 @@ mod tests {
b"bcd"
);
}
#[test]
fn test_combine_range() {
assert_eq!(combine_ranges(1..3, 0..1), 1..2);
assert_eq!(combine_ranges(1..3, 1..), 2..3);
assert_eq!(combine_ranges(1..4, ..2), 1..3);
assert_eq!(combine_ranges(3..10, 2..5), 5..8);
}
#[test]
#[should_panic]
fn test_combine_range_panics() {
let _ = combine_ranges(3..5, 1..4);
}
}

View File

@@ -5,11 +5,12 @@ use std::ops::Deref;
pub use byteorder::LittleEndian as Endianness;
mod bitset;
pub mod file_slice;
mod serialize;
mod vint;
mod writer;
pub use bitset::*;
pub use ownedbytes::OwnedBytes;
pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize};
pub use vint::{
deserialize_vint_u128, read_u32_vint, read_u32_vint_no_advance, serialize_vint_u128,

View File

@@ -94,6 +94,20 @@ impl FixedSize for u32 {
const SIZE_IN_BYTES: usize = 4;
}
impl BinarySerializable for u16 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u16::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<u16> {
reader.read_u16::<Endianness>()
}
}
impl FixedSize for u16 {
const SIZE_IN_BYTES: usize = 2;
}
impl BinarySerializable for u64 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u64::<Endianness>(*self)
@@ -107,6 +121,19 @@ impl FixedSize for u64 {
const SIZE_IN_BYTES: usize = 8;
}
impl BinarySerializable for u128 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u128::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
reader.read_u128::<Endianness>()
}
}
impl FixedSize for u128 {
const SIZE_IN_BYTES: usize = 16;
}
impl BinarySerializable for f32 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_f32::<Endianness>(*self)
@@ -161,8 +188,7 @@ impl FixedSize for u8 {
impl BinarySerializable for bool {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
let val = if *self { 1 } else { 0 };
writer.write_u8(val)
writer.write_u8(u8::from(*self))
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<bool> {
let val = reader.read_u8()?;

View File

@@ -157,7 +157,7 @@ fn vint_len(data: &[u8]) -> usize {
/// If the buffer does not start by a valid
/// vint payload
pub fn read_u32_vint(data: &mut &[u8]) -> u32 {
let (result, vlen) = read_u32_vint_no_advance(*data);
let (result, vlen) = read_u32_vint_no_advance(data);
*data = &data[vlen..];
result
}

View File

@@ -50,7 +50,7 @@ to get tantivy to fit your use case:
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated to segment `D-7`.
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated with segment `D-7`.
## Merging

View File

@@ -118,7 +118,7 @@ fn main() -> tantivy::Result<()> {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&term_query, &collector).unwrap();

View File

@@ -105,7 +105,7 @@ impl SegmentCollector for StatsSegmentCollector {
type Fruit = Option<Stats>;
fn collect(&mut self, doc: u32, _score: Score) {
let value = self.fast_field_reader.get_val(doc as u64) as f64;
let value = self.fast_field_reader.get_val(doc) as f64;
self.stats.count += 1;
self.stats.sum += value;
self.stats.squared_sum += value * value;

View File

@@ -113,7 +113,7 @@ fn main() -> tantivy::Result<()> {
// on its id.
//
// Note that `tantivy` does nothing to enforce the idea that
// there is only one document associated to this id.
// there is only one document associated with this id.
//
// Also you might have noticed that we apply the delete before
// having committed. This does not matter really...

View File

@@ -44,7 +44,7 @@ fn main() -> tantivy::Result<()> {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
// - the inverted lists associated with each terms and their positions
let inverted_index = segment_reader.inverted_index(title)?;
// A `Term` is a text token associated with a field.
@@ -105,7 +105,7 @@ fn main() -> tantivy::Result<()> {
// A segment contains different data structure.
// Inverted index stands for the combination of
// - the term dictionary
// - the inverted lists associated to each terms and their positions
// - the inverted lists associated with each terms and their positions
let inverted_index = segment_reader.inverted_index(title)?;
// This segment posting object is like a cursor over the documents matching the term.

View File

@@ -51,7 +51,7 @@ impl Warmer for DynamicPriceColumn {
let product_id_reader = segment.fast_fields().u64(self.field)?;
let product_ids: Vec<ProductId> = segment
.doc_ids_alive()
.map(|doc| product_id_reader.get_val(doc as u64))
.map(|doc| product_id_reader.get_val(doc))
.collect();
let mut prices_it = self.price_fetcher.fetch_prices(&product_ids).into_iter();
let mut price_vals: Vec<Price> = Vec::new();

View File

@@ -1,23 +1,27 @@
[package]
name = "fastfield_codecs"
version = "0.2.0"
version = "0.3.0"
authors = ["Pascal Seitz <pascal@quickwit.io>"]
license = "MIT"
edition = "2021"
description = "Fast field codecs used by tantivy"
documentation = "https://docs.rs/fastfield_codecs/"
homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
common = { version = "0.3", path = "../common/", package = "tantivy-common" }
tantivy-bitpacker = { version="0.2", path = "../bitpacker/" }
ownedbytes = { version = "0.3.0", path = "../ownedbytes" }
common = { version = "0.5", path = "../common/", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.3", path = "../bitpacker/" }
ownedbytes = { version = "0.5", path = "../ownedbytes" }
prettytable-rs = {version="0.9.0", optional= true}
rand = {version="0.8.3", optional= true}
fastdivide = "0.4"
log = "0.4"
itertools = { version = "0.10.3" }
measure_time = { version="0.8.2", optional=true}
ordered-float = "3.4"
[dev-dependencies]
more-asserts = "0.3.0"

View File

@@ -65,7 +65,7 @@ mod tests {
b.iter(|| {
let mut a = 0u64;
for _ in 0..n {
a = column.get_val(a as u64);
a = column.get_val(a as u32);
}
a
});
@@ -100,9 +100,10 @@ mod tests {
fn get_u128_column_from_data(data: &[u128]) -> Arc<dyn Column<u128>> {
let mut out = vec![];
serialize_u128(VecColumn::from(&data), &mut out).unwrap();
let iter_gen = || data.iter().cloned();
serialize_u128(iter_gen, data.len() as u32, &mut out).unwrap();
let out = OwnedBytes::new(out);
open_u128(out).unwrap()
open_u128::<u128>(out).unwrap()
}
#[bench]
@@ -110,7 +111,15 @@ mod tests {
let (major_item, _minor_item, data) = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| column.get_between_vals(major_item..=major_item));
b.iter(|| {
let mut positions = Vec::new();
column.get_docids_for_value_range(
major_item..=major_item,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
@@ -118,7 +127,15 @@ mod tests {
let (_major_item, minor_item, data) = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| column.get_between_vals(minor_item..=minor_item));
b.iter(|| {
let mut positions = Vec::new();
column.get_docids_for_value_range(
minor_item..=minor_item,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
@@ -126,7 +143,11 @@ mod tests {
let (_major_item, _minor_item, data) = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| column.get_between_vals(0..=u128::MAX));
b.iter(|| {
let mut positions = Vec::new();
column.get_docids_for_value_range(0..=u128::MAX, 0..data.len() as u32, &mut positions);
positions
});
}
#[bench]
@@ -136,7 +157,7 @@ mod tests {
b.iter(|| {
let mut a = 0u128;
for i in 0u64..column.num_vals() as u64 {
a += column.get_val(i);
a += column.get_val(i as u32);
}
a
});
@@ -150,7 +171,7 @@ mod tests {
let n = column.num_vals();
let mut a = 0u128;
for i in (0..n / 5).map(|val| val * 5) {
a += column.get_val(i as u64);
a += column.get_val(i);
}
a
});
@@ -175,9 +196,9 @@ mod tests {
let n = permutation.len();
let column: Arc<dyn Column<u64>> = serialize_and_load(&permutation);
b.iter(|| {
let mut a = 0u64;
let mut a = 0;
for i in (0..n / 7).map(|val| val * 7) {
a += column.get_val(i as u64);
a += column.get_val(i as u32);
}
a
});
@@ -190,7 +211,7 @@ mod tests {
let column: Arc<dyn Column<u64>> = serialize_and_load(&permutation);
b.iter(|| {
let mut a = 0u64;
for i in 0u64..n as u64 {
for i in 0u32..n as u32 {
a += column.get_val(i);
}
a
@@ -204,8 +225,8 @@ mod tests {
let column: Arc<dyn Column<u64>> = serialize_and_load(&permutation);
b.iter(|| {
let mut a = 0u64;
for i in 0..n as u64 {
a += column.get_val(i);
for i in 0..n {
a += column.get_val(i as u32);
}
a
});

View File

@@ -17,7 +17,7 @@ pub struct BitpackedReader {
impl Column for BitpackedReader {
#[inline]
fn get_val(&self, doc: u64) -> u64 {
fn get_val(&self, doc: u32) -> u64 {
self.bit_unpacker.get(doc, &self.data)
}
#[inline]
@@ -30,7 +30,7 @@ impl Column for BitpackedReader {
self.normalized_header.max_value
}
#[inline]
fn num_vals(&self) -> u64 {
fn num_vals(&self) -> u32 {
self.normalized_header.num_vals
}
}
@@ -75,7 +75,7 @@ impl FastFieldCodec for BitpackedCodec {
Ok(())
}
fn estimate(column: &impl Column) -> Option<f32> {
fn estimate(column: &dyn Column) -> Option<f32> {
let num_bits = compute_num_bits(column.max_value());
let num_bits_uncompressed = 64;
Some(num_bits as f32 / num_bits_uncompressed as f32)

View File

@@ -36,7 +36,7 @@ impl BinarySerializable for Block {
}
}
fn compute_num_blocks(num_vals: u64) -> usize {
fn compute_num_blocks(num_vals: u32) -> usize {
(num_vals as usize + CHUNK_SIZE - 1) / CHUNK_SIZE
}
@@ -71,14 +71,14 @@ impl FastFieldCodec for BlockwiseLinearCodec {
}
// Estimate first_chunk and extrapolate
fn estimate(column: &impl crate::Column) -> Option<f32> {
if column.num_vals() < 10 * CHUNK_SIZE as u64 {
fn estimate(column: &dyn crate::Column) -> Option<f32> {
if column.num_vals() < 10 * CHUNK_SIZE as u32 {
return None;
}
let mut first_chunk: Vec<u64> = column.iter().take(CHUNK_SIZE as usize).collect();
let mut first_chunk: Vec<u64> = column.iter().take(CHUNK_SIZE).collect();
let line = Line::train(&VecColumn::from(&first_chunk));
for (i, buffer_val) in first_chunk.iter_mut().enumerate() {
let interpolated_val = line.eval(i as u64);
let interpolated_val = line.eval(i as u32);
*buffer_val = buffer_val.wrapping_sub(interpolated_val);
}
let estimated_bit_width = first_chunk
@@ -95,12 +95,12 @@ impl FastFieldCodec for BlockwiseLinearCodec {
};
let num_bits = estimated_bit_width as u64 * column.num_vals() as u64
// function metadata per block
+ metadata_per_block as u64 * (column.num_vals() / CHUNK_SIZE as u64);
+ metadata_per_block as u64 * (column.num_vals() as u64 / CHUNK_SIZE as u64);
let num_bits_uncompressed = 64 * column.num_vals();
Some(num_bits as f32 / num_bits_uncompressed as f32)
}
fn serialize(column: &dyn crate::Column, wrt: &mut impl io::Write) -> io::Result<()> {
fn serialize(column: &dyn Column, wrt: &mut impl io::Write) -> io::Result<()> {
// The BitpackedReader assumes a normalized vector.
assert_eq!(column.min_value(), 0);
let mut buffer = Vec::with_capacity(CHUNK_SIZE);
@@ -121,7 +121,7 @@ impl FastFieldCodec for BlockwiseLinearCodec {
assert!(!buffer.is_empty());
for (i, buffer_val) in buffer.iter_mut().enumerate() {
let interpolated_val = line.eval(i as u64);
let interpolated_val = line.eval(i as u32);
*buffer_val = buffer_val.wrapping_sub(interpolated_val);
}
let bit_width = buffer.iter().copied().map(compute_num_bits).max().unwrap();
@@ -161,9 +161,9 @@ pub struct BlockwiseLinearReader {
impl Column for BlockwiseLinearReader {
#[inline(always)]
fn get_val(&self, idx: u64) -> u64 {
let block_id = (idx / CHUNK_SIZE as u64) as usize;
let idx_within_block = idx % (CHUNK_SIZE as u64);
fn get_val(&self, idx: u32) -> u64 {
let block_id = (idx / CHUNK_SIZE as u32) as usize;
let idx_within_block = idx % (CHUNK_SIZE as u32);
let block = &self.blocks[block_id];
let interpoled_val: u64 = block.line.eval(idx_within_block);
let block_bytes = &self.data[block.data_start_offset..];
@@ -180,7 +180,7 @@ impl Column for BlockwiseLinearReader {
self.normalized_header.max_value
}
fn num_vals(&self) -> u64 {
fn num_vals(&self) -> u32 {
self.normalized_header.num_vals
}
}

View File

@@ -1,17 +1,20 @@
use std::marker::PhantomData;
use std::ops::RangeInclusive;
use std::ops::{Range, RangeInclusive};
use tantivy_bitpacker::minmax;
use crate::monotonic_mapping::StrictlyMonotonicFn;
/// `Column` provides columnar access on a field.
pub trait Column<T: PartialOrd = u64>: Send + Sync {
/// Return the value associated to the given idx.
/// Return the value associated with the given idx.
///
/// This accessor should return as fast as possible.
///
/// # Panics
///
/// May panic if `idx` is greater than the column length.
fn get_val(&self, idx: u64) -> T;
fn get_val(&self, idx: u32) -> T;
/// Fills an output buffer with the fast field values
/// associated with the `DocId` going from
@@ -24,21 +27,28 @@ pub trait Column<T: PartialOrd = u64>: Send + Sync {
#[inline]
fn get_range(&self, start: u64, output: &mut [T]) {
for (out, idx) in output.iter_mut().zip(start..) {
*out = self.get_val(idx);
*out = self.get_val(idx as u32);
}
}
/// Return the positions of values which are in the provided range.
/// Get the positions of values which are in the provided value range.
///
/// Note that position == docid for single value fast fields
#[inline]
fn get_between_vals(&self, range: RangeInclusive<T>) -> Vec<u64> {
let mut vals = Vec::new();
for idx in 0..self.num_vals() {
fn get_docids_for_value_range(
&self,
value_range: RangeInclusive<T>,
doc_id_range: Range<u32>,
positions: &mut Vec<u32>,
) {
let doc_id_range = doc_id_range.start..doc_id_range.end.min(self.num_vals());
for idx in doc_id_range.start..doc_id_range.end {
let val = self.get_val(idx);
if range.contains(&val) {
vals.push(idx);
if value_range.contains(&val) {
positions.push(idx);
}
}
vals
}
/// Returns the minimum value for this fast field.
@@ -57,7 +67,8 @@ pub trait Column<T: PartialOrd = u64>: Send + Sync {
/// `.max_value()`.
fn max_value(&self) -> T;
fn num_vals(&self) -> u64;
/// The number of values in the column.
fn num_vals(&self) -> u32;
/// Returns a iterator over the data
fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = T> + 'a> {
@@ -65,6 +76,7 @@ pub trait Column<T: PartialOrd = u64>: Send + Sync {
}
}
/// VecColumn provides `Column` over a slice.
pub struct VecColumn<'a, T = u64> {
values: &'a [T],
min_value: T,
@@ -72,7 +84,7 @@ pub struct VecColumn<'a, T = u64> {
}
impl<'a, C: Column<T>, T: Copy + PartialOrd> Column<T> for &'a C {
fn get_val(&self, idx: u64) -> T {
fn get_val(&self, idx: u32) -> T {
(*self).get_val(idx)
}
@@ -84,7 +96,7 @@ impl<'a, C: Column<T>, T: Copy + PartialOrd> Column<T> for &'a C {
(*self).max_value()
}
fn num_vals(&self) -> u64 {
fn num_vals(&self) -> u32 {
(*self).num_vals()
}
@@ -98,7 +110,7 @@ impl<'a, C: Column<T>, T: Copy + PartialOrd> Column<T> for &'a C {
}
impl<'a, T: Copy + PartialOrd + Send + Sync> Column<T> for VecColumn<'a, T> {
fn get_val(&self, position: u64) -> T {
fn get_val(&self, position: u32) -> T {
self.values[position as usize]
}
@@ -114,8 +126,8 @@ impl<'a, T: Copy + PartialOrd + Send + Sync> Column<T> for VecColumn<'a, T> {
self.max_value
}
fn num_vals(&self) -> u64 {
self.values.len() as u64
fn num_vals(&self) -> u32 {
self.values.len() as u32
}
fn get_range(&self, start: u64, output: &mut [T]) {
@@ -143,16 +155,30 @@ struct MonotonicMappingColumn<C, T, Input> {
_phantom: PhantomData<Input>,
}
/// Creates a view of a column transformed by a monotonic mapping.
pub fn monotonic_map_column<C, T, Input: PartialOrd, Output: PartialOrd>(
/// Creates a view of a column transformed by a strictly monotonic mapping. See
/// [`StrictlyMonotonicFn`].
///
/// E.g. apply a gcd monotonic_mapping([100, 200, 300]) == [1, 2, 3]
/// monotonic_mapping.mapping() is expected to be injective, and we should always have
/// monotonic_mapping.inverse(monotonic_mapping.mapping(el)) == el
///
/// The inverse of the mapping is required for:
/// `fn get_positions_for_value_range(&self, range: RangeInclusive<T>) -> Vec<u64> `
/// The user provides the original value range and we need to monotonic map them in the same way the
/// serialization does before calling the underlying column.
///
/// Note that when opening a codec, the monotonic_mapping should be the inverse of the mapping
/// during serialization. And therefore the monotonic_mapping_inv when opening is the same as
/// monotonic_mapping during serialization.
pub fn monotonic_map_column<C, T, Input, Output>(
from_column: C,
monotonic_mapping: T,
) -> impl Column<Output>
where
C: Column<Input>,
T: Fn(Input) -> Output + Send + Sync,
Input: Send + Sync,
Output: Send + Sync,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Send + Sync + Clone,
Output: PartialOrd + Send + Sync + Clone,
{
MonotonicMappingColumn {
from_column,
@@ -161,42 +187,60 @@ where
}
}
impl<C, T, Input: PartialOrd, Output: PartialOrd> Column<Output>
for MonotonicMappingColumn<C, T, Input>
impl<C, T, Input, Output> Column<Output> for MonotonicMappingColumn<C, T, Input>
where
C: Column<Input>,
T: Fn(Input) -> Output + Send + Sync,
Input: Send + Sync,
Output: Send + Sync,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Send + Sync + Clone,
Output: PartialOrd + Send + Sync + Clone,
{
#[inline]
fn get_val(&self, idx: u64) -> Output {
fn get_val(&self, idx: u32) -> Output {
let from_val = self.from_column.get_val(idx);
(self.monotonic_mapping)(from_val)
self.monotonic_mapping.mapping(from_val)
}
fn min_value(&self) -> Output {
let from_min_value = self.from_column.min_value();
(self.monotonic_mapping)(from_min_value)
self.monotonic_mapping.mapping(from_min_value)
}
fn max_value(&self) -> Output {
let from_max_value = self.from_column.max_value();
(self.monotonic_mapping)(from_max_value)
self.monotonic_mapping.mapping(from_max_value)
}
fn num_vals(&self) -> u64 {
fn num_vals(&self) -> u32 {
self.from_column.num_vals()
}
fn iter(&self) -> Box<dyn Iterator<Item = Output> + '_> {
Box::new(self.from_column.iter().map(&self.monotonic_mapping))
Box::new(
self.from_column
.iter()
.map(|el| self.monotonic_mapping.mapping(el)),
)
}
fn get_docids_for_value_range(
&self,
range: RangeInclusive<Output>,
doc_id_range: Range<u32>,
positions: &mut Vec<u32>,
) {
self.from_column.get_docids_for_value_range(
self.monotonic_mapping.inverse(range.start().clone())
..=self.monotonic_mapping.inverse(range.end().clone()),
doc_id_range,
positions,
)
}
// We voluntarily do not implement get_range as it yields a regression,
// and we do not have any specialized implementation anyway.
}
/// Wraps an iterator into a `Column`.
pub struct IterColumn<T>(T);
impl<T> From<T> for IterColumn<T>
@@ -212,7 +256,7 @@ where
T: Iterator + Clone + ExactSizeIterator + Send + Sync,
T::Item: PartialOrd,
{
fn get_val(&self, idx: u64) -> T::Item {
fn get_val(&self, idx: u32) -> T::Item {
self.0.clone().nth(idx as usize).unwrap()
}
@@ -224,8 +268,8 @@ where
self.0.clone().last().unwrap()
}
fn num_vals(&self) -> u64 {
self.0.len() as u64
fn num_vals(&self) -> u32 {
self.0.len() as u32
}
fn iter(&self) -> Box<dyn Iterator<Item = T::Item> + '_> {
@@ -236,19 +280,22 @@ where
#[cfg(test)]
mod tests {
use super::*;
use crate::MonotonicallyMappableToU64;
use crate::monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternalBaseval,
StrictlyMonotonicMappingToInternalGCDBaseval,
};
#[test]
fn test_monotonic_mapping() {
let vals = &[1u64, 3u64][..];
let vals = &[3u64, 5u64][..];
let col = VecColumn::from(vals);
let mapped = monotonic_map_column(col, |el| el + 4);
assert_eq!(mapped.min_value(), 5u64);
assert_eq!(mapped.max_value(), 7u64);
let mapped = monotonic_map_column(col, StrictlyMonotonicMappingToInternalBaseval::new(2));
assert_eq!(mapped.min_value(), 1u64);
assert_eq!(mapped.max_value(), 3u64);
assert_eq!(mapped.num_vals(), 2);
assert_eq!(mapped.num_vals(), 2);
assert_eq!(mapped.get_val(0), 5);
assert_eq!(mapped.get_val(1), 7);
assert_eq!(mapped.get_val(0), 1);
assert_eq!(mapped.get_val(1), 3);
}
#[test]
@@ -260,10 +307,15 @@ mod tests {
#[test]
fn test_monotonic_mapping_iter() {
let vals: Vec<u64> = (-1..99).map(i64::to_u64).collect();
let vals: Vec<u64> = (10..110u64).map(|el| el * 10).collect();
let col = VecColumn::from(&vals);
let mapped = monotonic_map_column(col, |el| i64::from_u64(el) * 10i64);
let val_i64s: Vec<i64> = mapped.iter().collect();
let mapped = monotonic_map_column(
col,
StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalGCDBaseval::new(10, 100),
),
);
let val_i64s: Vec<u64> = mapped.iter().collect();
for i in 0..100 {
assert_eq!(val_i64s[i as usize], mapped.get_val(i));
}
@@ -271,20 +323,26 @@ mod tests {
#[test]
fn test_monotonic_mapping_get_range() {
let vals: Vec<u64> = (-1..99).map(i64::to_u64).collect();
let vals: Vec<u64> = (0..100u64).map(|el| el * 10).collect();
let col = VecColumn::from(&vals);
let mapped = monotonic_map_column(col, |el| i64::from_u64(el) * 10i64);
assert_eq!(mapped.min_value(), -10i64);
assert_eq!(mapped.max_value(), 980i64);
let mapped = monotonic_map_column(
col,
StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalGCDBaseval::new(10, 0),
),
);
assert_eq!(mapped.min_value(), 0u64);
assert_eq!(mapped.max_value(), 9900u64);
assert_eq!(mapped.num_vals(), 100);
let val_i64s: Vec<i64> = mapped.iter().collect();
assert_eq!(val_i64s.len(), 100);
let val_u64s: Vec<u64> = mapped.iter().collect();
assert_eq!(val_u64s.len(), 100);
for i in 0..100 {
assert_eq!(val_i64s[i as usize], mapped.get_val(i));
assert_eq!(val_i64s[i as usize], i64::from_u64(vals[i as usize]) * 10);
assert_eq!(val_u64s[i as usize], mapped.get_val(i));
assert_eq!(val_u64s[i as usize], vals[i as usize] * 10);
}
let mut buf = [0i64; 20];
let mut buf = [0u64; 20];
mapped.get_range(7, &mut buf[..]);
assert_eq!(&val_i64s[7..][..20], &buf);
assert_eq!(&val_u64s[7..][..20], &buf);
}
}

View File

@@ -57,7 +57,7 @@ fn num_bits(val: u128) -> u8 {
/// metadata.
pub fn get_compact_space(
values_deduped_sorted: &BTreeSet<u128>,
total_num_values: u64,
total_num_values: u32,
cost_per_blank: usize,
) -> CompactSpace {
let mut compact_space_builder = CompactSpaceBuilder::new();
@@ -208,7 +208,7 @@ impl CompactSpaceBuilder {
};
let covered_range_len = range_mapping.range_length();
ranges_mapping.push(range_mapping);
compact_start += covered_range_len as u64;
compact_start += covered_range_len;
}
// println!("num ranges {}", ranges_mapping.len());
CompactSpace { ranges_mapping }

View File

@@ -14,7 +14,7 @@ use std::{
cmp::Ordering,
collections::BTreeSet,
io::{self, Write},
ops::RangeInclusive,
ops::{Range, RangeInclusive},
};
use common::{BinarySerializable, CountingWriter, VInt, VIntU128};
@@ -97,7 +97,7 @@ impl BinarySerializable for CompactSpace {
};
let range_length = range_mapping.range_length();
ranges_mapping.push(range_mapping);
compact_start += range_length as u64;
compact_start += range_length;
}
Ok(Self { ranges_mapping })
@@ -165,16 +165,16 @@ pub struct IPCodecParams {
bit_unpacker: BitUnpacker,
min_value: u128,
max_value: u128,
num_vals: u64,
num_vals: u32,
num_bits: u8,
}
impl CompactSpaceCompressor {
/// Taking the vals as Vec may cost a lot of memory. It is used to sort the vals.
pub fn train_from(column: &impl Column<u128>) -> Self {
pub fn train_from(iter: impl Iterator<Item = u128>, num_vals: u32) -> Self {
let mut values_sorted = BTreeSet::new();
values_sorted.extend(column.iter());
let total_num_values = column.num_vals();
values_sorted.extend(iter);
let total_num_values = num_vals;
let compact_space =
get_compact_space(&values_sorted, total_num_values, COST_PER_BLANK_IN_BITS);
@@ -200,7 +200,7 @@ impl CompactSpaceCompressor {
bit_unpacker: BitUnpacker::new(num_bits),
min_value,
max_value,
num_vals: total_num_values as u64,
num_vals: total_num_values,
num_bits,
},
}
@@ -267,7 +267,7 @@ impl BinarySerializable for IPCodecParams {
let _header_flags = u64::deserialize(reader)?;
let min_value = VIntU128::deserialize(reader)?.0;
let max_value = VIntU128::deserialize(reader)?.0;
let num_vals = VIntU128::deserialize(reader)?.0 as u64;
let num_vals = VIntU128::deserialize(reader)?.0 as u32;
let num_bits = u8::deserialize(reader)?;
let compact_space = CompactSpace::deserialize(reader)?;
@@ -284,7 +284,7 @@ impl BinarySerializable for IPCodecParams {
impl Column<u128> for CompactSpaceDecompressor {
#[inline]
fn get_val(&self, doc: u64) -> u128 {
fn get_val(&self, doc: u32) -> u128 {
self.get(doc)
}
@@ -296,7 +296,7 @@ impl Column<u128> for CompactSpaceDecompressor {
self.max_value()
}
fn num_vals(&self) -> u64 {
fn num_vals(&self) -> u32 {
self.params.num_vals
}
@@ -304,8 +304,15 @@ impl Column<u128> for CompactSpaceDecompressor {
fn iter(&self) -> Box<dyn Iterator<Item = u128> + '_> {
Box::new(self.iter())
}
fn get_between_vals(&self, range: RangeInclusive<u128>) -> Vec<u64> {
self.get_between_vals(range)
#[inline]
fn get_docids_for_value_range(
&self,
value_range: RangeInclusive<u128>,
positions_range: Range<u32>,
positions: &mut Vec<u32>,
) {
self.get_positions_for_value_range(value_range, positions_range, positions)
}
}
@@ -340,12 +347,19 @@ impl CompactSpaceDecompressor {
/// Comparing on compact space: Real dataset 1.08 GElements/s
///
/// Comparing on original space: Real dataset .06 GElements/s (not completely optimized)
pub fn get_between_vals(&self, range: RangeInclusive<u128>) -> Vec<u64> {
if range.start() > range.end() {
return Vec::new();
#[inline]
pub fn get_positions_for_value_range(
&self,
value_range: RangeInclusive<u128>,
position_range: Range<u32>,
positions: &mut Vec<u32>,
) {
if value_range.start() > value_range.end() {
return;
}
let from_value = *range.start();
let to_value = *range.end();
let position_range = position_range.start..position_range.end.min(self.num_vals());
let from_value = *value_range.start();
let to_value = *value_range.end();
assert!(to_value >= from_value);
let compact_from = self.u128_to_compact(from_value);
let compact_to = self.u128_to_compact(to_value);
@@ -353,7 +367,7 @@ impl CompactSpaceDecompressor {
// Quick return, if both ranges fall into the same non-mapped space, the range can't cover
// any values, so we can early exit
match (compact_to, compact_from) {
(Err(pos1), Err(pos2)) if pos1 == pos2 => return Vec::new(),
(Err(pos1), Err(pos2)) if pos1 == pos2 => return,
_ => {}
}
@@ -375,19 +389,20 @@ impl CompactSpaceDecompressor {
});
let range = compact_from..=compact_to;
let mut positions = Vec::new();
let scan_num_docs = position_range.end - position_range.start;
let step_size = 4;
let cutoff = self.params.num_vals - self.params.num_vals % step_size;
let cutoff = position_range.start + scan_num_docs - scan_num_docs % step_size;
let mut push_if_in_range = |idx, val| {
if range.contains(&val) {
positions.push(idx);
}
};
let get_val = |idx| self.params.bit_unpacker.get(idx as u64, &self.data);
let get_val = |idx| self.params.bit_unpacker.get(idx, &self.data);
// unrolled loop
for idx in (0..cutoff).step_by(step_size as usize) {
for idx in (position_range.start..cutoff).step_by(step_size as usize) {
let idx1 = idx;
let idx2 = idx + 1;
let idx3 = idx + 2;
@@ -403,17 +418,14 @@ impl CompactSpaceDecompressor {
}
// handle rest
for idx in cutoff..self.params.num_vals {
for idx in cutoff..position_range.end {
push_if_in_range(idx, get_val(idx));
}
positions
}
#[inline]
fn iter_compact(&self) -> impl Iterator<Item = u64> + '_ {
(0..self.params.num_vals)
.map(move |idx| self.params.bit_unpacker.get(idx as u64, &self.data) as u64)
(0..self.params.num_vals).map(move |idx| self.params.bit_unpacker.get(idx, &self.data))
}
#[inline]
@@ -425,7 +437,7 @@ impl CompactSpaceDecompressor {
}
#[inline]
pub fn get(&self, idx: u64) -> u128 {
pub fn get(&self, idx: u32) -> u128 {
let compact = self.params.bit_unpacker.get(idx, &self.data);
self.compact_to_u128(compact)
}
@@ -443,7 +455,10 @@ impl CompactSpaceDecompressor {
mod tests {
use super::*;
use crate::{open_u128, serialize_u128, VecColumn};
use crate::format_version::read_format_version;
use crate::null_index_footer::read_null_index_footer;
use crate::serialize::U128Header;
use crate::{open_u128, serialize_u128};
#[test]
fn compact_space_test() {
@@ -452,7 +467,7 @@ mod tests {
]
.into_iter()
.collect();
let compact_space = get_compact_space(ips, ips.len() as u64, 11);
let compact_space = get_compact_space(ips, ips.len() as u32, 11);
let amplitude = compact_space.amplitude_compact_space();
assert_eq!(amplitude, 17);
assert_eq!(1, compact_space.u128_to_compact(2).unwrap());
@@ -483,24 +498,30 @@ mod tests {
#[test]
fn compact_space_amplitude_test() {
let ips = &[100000u128, 1000000].into_iter().collect();
let compact_space = get_compact_space(ips, ips.len() as u64, 1);
let compact_space = get_compact_space(ips, ips.len() as u32, 1);
let amplitude = compact_space.amplitude_compact_space();
assert_eq!(amplitude, 2);
}
fn test_all(data: OwnedBytes, expected: &[u128]) {
fn test_all(mut data: OwnedBytes, expected: &[u128]) {
let _header = U128Header::deserialize(&mut data);
let decompressor = CompactSpaceDecompressor::open(data).unwrap();
for (idx, expected_val) in expected.iter().cloned().enumerate() {
let val = decompressor.get(idx as u64);
let val = decompressor.get(idx as u32);
assert_eq!(val, expected_val);
let test_range = |range: RangeInclusive<u128>| {
let expected_positions = expected
.iter()
.positions(|val| range.contains(val))
.map(|pos| pos as u64)
.map(|pos| pos as u32)
.collect::<Vec<_>>();
let positions = decompressor.get_between_vals(range);
let mut positions = Vec::new();
decompressor.get_positions_for_value_range(
range,
0..decompressor.num_vals(),
&mut positions,
);
assert_eq!(positions, expected_positions);
};
@@ -513,10 +534,18 @@ mod tests {
fn test_aux_vals(u128_vals: &[u128]) -> OwnedBytes {
let mut out = Vec::new();
serialize_u128(VecColumn::from(u128_vals), &mut out).unwrap();
serialize_u128(
|| u128_vals.iter().cloned(),
u128_vals.len() as u32,
&mut out,
)
.unwrap();
let data = OwnedBytes::new(out);
let (data, _format_version) = read_format_version(data).unwrap();
let (data, _null_index_footer) = read_null_index_footer(data).unwrap();
test_all(data.clone(), u128_vals);
data
}
@@ -533,26 +562,111 @@ mod tests {
4_000_211_222u128,
333u128,
];
let data = test_aux_vals(vals);
let mut data = test_aux_vals(vals);
let _header = U128Header::deserialize(&mut data);
let decomp = CompactSpaceDecompressor::open(data).unwrap();
let positions = decomp.get_between_vals(0..=1);
let complete_range = 0..vals.len() as u32;
for (pos, val) in vals.iter().enumerate() {
let val = *val;
let pos = pos as u32;
let mut positions = Vec::new();
decomp.get_positions_for_value_range(val..=val, pos..pos + 1, &mut positions);
assert_eq!(positions, vec![pos]);
}
// handle docid range out of bounds
let positions = get_positions_for_value_range_helper(&decomp, 0..=1, 1..u32::MAX);
assert_eq!(positions, vec![]);
let positions =
get_positions_for_value_range_helper(&decomp, 0..=1, complete_range.clone());
assert_eq!(positions, vec![0]);
let positions = decomp.get_between_vals(0..=2);
let positions =
get_positions_for_value_range_helper(&decomp, 0..=2, complete_range.clone());
assert_eq!(positions, vec![0]);
let positions = decomp.get_between_vals(0..=3);
let positions =
get_positions_for_value_range_helper(&decomp, 0..=3, complete_range.clone());
assert_eq!(positions, vec![0, 2]);
assert_eq!(decomp.get_between_vals(99999u128..=99999u128), vec![3]);
assert_eq!(decomp.get_between_vals(99999u128..=100000u128), vec![3, 4]);
assert_eq!(decomp.get_between_vals(99998u128..=100000u128), vec![3, 4]);
assert_eq!(decomp.get_between_vals(99998u128..=99999u128), vec![3]);
assert_eq!(decomp.get_between_vals(99998u128..=99998u128), vec![]);
assert_eq!(decomp.get_between_vals(333u128..=333u128), vec![8]);
assert_eq!(decomp.get_between_vals(332u128..=333u128), vec![8]);
assert_eq!(decomp.get_between_vals(332u128..=334u128), vec![8]);
assert_eq!(decomp.get_between_vals(333u128..=334u128), vec![8]);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99999u128..=99999u128,
complete_range.clone()
),
vec![3]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99999u128..=100000u128,
complete_range.clone()
),
vec![3, 4]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99998u128..=100000u128,
complete_range.clone()
),
vec![3, 4]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99998u128..=99999u128,
complete_range.clone()
),
vec![3]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99998u128..=99998u128,
complete_range.clone()
),
vec![]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
333u128..=333u128,
complete_range.clone()
),
vec![8]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
332u128..=333u128,
complete_range.clone()
),
vec![8]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
332u128..=334u128,
complete_range.clone()
),
vec![8]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
333u128..=334u128,
complete_range.clone()
),
vec![8]
);
assert_eq!(
decomp.get_between_vals(4_000_211_221u128..=5_000_000_000u128),
get_positions_for_value_range_helper(
&decomp,
4_000_211_221u128..=5_000_000_000u128,
complete_range
),
vec![6, 7]
);
}
@@ -575,14 +689,32 @@ mod tests {
4_000_211_222u128,
333u128,
];
let data = test_aux_vals(vals);
let mut data = test_aux_vals(vals);
let _header = U128Header::deserialize(&mut data);
let decomp = CompactSpaceDecompressor::open(data).unwrap();
let positions = decomp.get_between_vals(0..=5);
assert_eq!(positions, vec![]);
let positions = decomp.get_between_vals(0..=100);
assert_eq!(positions, vec![0]);
let positions = decomp.get_between_vals(0..=105);
assert_eq!(positions, vec![0]);
let complete_range = 0..vals.len() as u32;
assert_eq!(
get_positions_for_value_range_helper(&decomp, 0..=5, complete_range.clone()),
vec![]
);
assert_eq!(
get_positions_for_value_range_helper(&decomp, 0..=100, complete_range.clone()),
vec![0]
);
assert_eq!(
get_positions_for_value_range_helper(&decomp, 0..=105, complete_range),
vec![0]
);
}
fn get_positions_for_value_range_helper<C: Column<T> + ?Sized, T: PartialOrd>(
column: &C,
value_range: RangeInclusive<T>,
doc_id_range: Range<u32>,
) -> Vec<u32> {
let mut positions = Vec::new();
column.get_docids_for_value_range(value_range, doc_id_range, &mut positions);
positions
}
#[test]
@@ -603,13 +735,29 @@ mod tests {
5_000_000_000,
];
let mut out = Vec::new();
serialize_u128(VecColumn::from(vals), &mut out).unwrap();
let decomp = open_u128(OwnedBytes::new(out)).unwrap();
serialize_u128(|| vals.iter().cloned(), vals.len() as u32, &mut out).unwrap();
let decomp = open_u128::<u128>(OwnedBytes::new(out)).unwrap();
let complete_range = 0..vals.len() as u32;
assert_eq!(decomp.get_between_vals(199..=200), vec![0]);
assert_eq!(decomp.get_between_vals(199..=201), vec![0, 1]);
assert_eq!(decomp.get_between_vals(200..=200), vec![0]);
assert_eq!(decomp.get_between_vals(1_000_000..=1_000_000), vec![11]);
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 199..=200, complete_range.clone()),
vec![0]
);
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 199..=201, complete_range.clone()),
vec![0, 1]
);
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 200..=200, complete_range.clone()),
vec![0]
);
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 1_000_000..=1_000_000, complete_range),
vec![11]
);
}
#[test]

View File

@@ -0,0 +1,39 @@
use std::io;
use common::BinarySerializable;
use ownedbytes::OwnedBytes;
const MAGIC_NUMBER: u16 = 4335u16;
const FASTFIELD_FORMAT_VERSION: u8 = 1;
pub(crate) fn append_format_version(output: &mut impl io::Write) -> io::Result<()> {
FASTFIELD_FORMAT_VERSION.serialize(output)?;
MAGIC_NUMBER.serialize(output)?;
Ok(())
}
pub(crate) fn read_format_version(data: OwnedBytes) -> io::Result<(OwnedBytes, u8)> {
let (data, magic_number_bytes) = data.rsplit(2);
let magic_number = u16::deserialize(&mut magic_number_bytes.as_slice())?;
if magic_number != MAGIC_NUMBER {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!("magic number mismatch {} != {}", magic_number, MAGIC_NUMBER),
));
}
let (data, format_version_bytes) = data.rsplit(1);
let format_version = u8::deserialize(&mut format_version_bytes.as_slice())?;
if format_version > FASTFIELD_FORMAT_VERSION {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!(
"Unsupported fastfield format version: {}. Max supported version: {}",
format_version, FASTFIELD_FORMAT_VERSION
),
));
}
Ok((data, format_version))
}

View File

@@ -1,5 +1,12 @@
#![warn(missing_docs)]
#![cfg_attr(all(feature = "unstable", test), feature(test))]
//! # `fastfield_codecs`
//!
//! - Columnar storage of data for tantivy [`Column`].
//! - Encode data in different codecs.
//! - Monotonically map values to u64/u128
#[cfg(test)]
#[macro_use]
extern crate more_asserts;
@@ -13,34 +20,55 @@ use std::sync::Arc;
use common::BinarySerializable;
use compact_space::CompactSpaceDecompressor;
use format_version::read_format_version;
use monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternal,
StrictlyMonotonicMappingToInternalBaseval, StrictlyMonotonicMappingToInternalGCDBaseval,
};
use null_index_footer::read_null_index_footer;
use ownedbytes::OwnedBytes;
use serialize::Header;
use serialize::{Header, U128Header};
mod bitpacked;
mod blockwise_linear;
mod compact_space;
mod format_version;
mod line;
mod linear;
mod monotonic_mapping;
mod monotonic_mapping_u128;
#[allow(dead_code)]
mod null_index;
mod null_index_footer;
mod column;
mod gcd;
mod serialize;
pub mod serialize;
pub use ordered_float;
use self::bitpacked::BitpackedCodec;
use self::blockwise_linear::BlockwiseLinearCodec;
pub use self::column::{monotonic_map_column, Column, VecColumn};
pub use self::column::{monotonic_map_column, Column, IterColumn, VecColumn};
use self::linear::LinearCodec;
pub use self::monotonic_mapping::MonotonicallyMappableToU64;
pub use self::monotonic_mapping::{MonotonicallyMappableToU64, StrictlyMonotonicFn};
pub use self::monotonic_mapping_u128::MonotonicallyMappableToU128;
pub use self::serialize::{
estimate, serialize, serialize_and_load, serialize_u128, NormalizedHeader,
};
#[derive(PartialEq, Eq, PartialOrd, Ord, Debug, Clone, Copy)]
#[repr(u8)]
/// Available codecs to use to encode the u64 (via [`MonotonicallyMappableToU64`]) converted data.
pub enum FastFieldCodecType {
/// Bitpack all values in the value range. The number of bits is defined by the amplitude
/// `column.max_value() - column.min_value()`
Bitpacked = 1,
/// Linear interpolation puts a line between the first and last value and then bitpacks the
/// values by the offset from the line. The number of bits is defined by the max deviation from
/// the line.
Linear = 2,
/// Same as [`FastFieldCodecType::Linear`], but encodes in blocks of 512 elements.
BlockwiseLinear = 3,
}
@@ -58,11 +86,11 @@ impl BinarySerializable for FastFieldCodecType {
}
impl FastFieldCodecType {
pub fn to_code(self) -> u8 {
pub(crate) fn to_code(self) -> u8 {
self as u8
}
pub fn from_code(code: u8) -> Option<Self> {
pub(crate) fn from_code(code: u8) -> Option<Self> {
match code {
1 => Some(Self::Bitpacked),
2 => Some(Self::Linear),
@@ -72,15 +100,59 @@ impl FastFieldCodecType {
}
}
/// Returns the correct codec reader wrapped in the `Arc` for the data.
pub fn open_u128(bytes: OwnedBytes) -> io::Result<Arc<dyn Column<u128>>> {
Ok(Arc::new(CompactSpaceDecompressor::open(bytes)?))
#[derive(PartialEq, Eq, PartialOrd, Ord, Debug, Clone, Copy)]
#[repr(u8)]
/// Available codecs to use to encode the u128 (via [`MonotonicallyMappableToU128`]) converted data.
pub enum U128FastFieldCodecType {
/// This codec takes a large number space (u128) and reduces it to a compact number space, by
/// removing the holes.
CompactSpace = 1,
}
impl BinarySerializable for U128FastFieldCodecType {
fn serialize<W: Write>(&self, wrt: &mut W) -> io::Result<()> {
self.to_code().serialize(wrt)
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let code = u8::deserialize(reader)?;
let codec_type: Self = Self::from_code(code)
.ok_or_else(|| io::Error::new(io::ErrorKind::InvalidData, "Unknown code `{code}.`"))?;
Ok(codec_type)
}
}
impl U128FastFieldCodecType {
pub(crate) fn to_code(self) -> u8 {
self as u8
}
pub(crate) fn from_code(code: u8) -> Option<Self> {
match code {
1 => Some(Self::CompactSpace),
_ => None,
}
}
}
/// Returns the correct codec reader wrapped in the `Arc` for the data.
pub fn open<T: MonotonicallyMappableToU64>(
mut bytes: OwnedBytes,
) -> io::Result<Arc<dyn Column<T>>> {
pub fn open_u128<Item: MonotonicallyMappableToU128>(
bytes: OwnedBytes,
) -> io::Result<Arc<dyn Column<Item>>> {
let (bytes, _format_version) = read_format_version(bytes)?;
let (mut bytes, _null_index_footer) = read_null_index_footer(bytes)?;
let header = U128Header::deserialize(&mut bytes)?;
assert_eq!(header.codec_type, U128FastFieldCodecType::CompactSpace);
let reader = CompactSpaceDecompressor::open(bytes)?;
let inverted: StrictlyMonotonicMappingInverter<StrictlyMonotonicMappingToInternal<Item>> =
StrictlyMonotonicMappingToInternal::<Item>::new().into();
Ok(Arc::new(monotonic_map_column(reader, inverted)))
}
/// Returns the correct codec reader wrapped in the `Arc` for the data.
pub fn open<T: MonotonicallyMappableToU64>(bytes: OwnedBytes) -> io::Result<Arc<dyn Column<T>>> {
let (bytes, _format_version) = read_format_version(bytes)?;
let (mut bytes, _null_index_footer) = read_null_index_footer(bytes)?;
let header = Header::deserialize(&mut bytes)?;
match header.codec_type {
FastFieldCodecType::Bitpacked => open_specific_codec::<BitpackedCodec, _>(bytes, &header),
@@ -99,11 +171,15 @@ fn open_specific_codec<C: FastFieldCodec, Item: MonotonicallyMappableToU64>(
let reader = C::open_from_bytes(bytes, normalized_header)?;
let min_value = header.min_value;
if let Some(gcd) = header.gcd {
let monotonic_mapping = move |val: u64| Item::from_u64(min_value + val * gcd.get());
Ok(Arc::new(monotonic_map_column(reader, monotonic_mapping)))
let mapping = StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalGCDBaseval::new(gcd.get(), min_value),
);
Ok(Arc::new(monotonic_map_column(reader, mapping)))
} else {
let monotonic_mapping = move |val: u64| Item::from_u64(min_value + val);
Ok(Arc::new(monotonic_map_column(reader, monotonic_mapping)))
let mapping = StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalBaseval::new(min_value),
);
Ok(Arc::new(monotonic_map_column(reader, mapping)))
}
}
@@ -123,7 +199,7 @@ trait FastFieldCodec: 'static {
///
/// The column iterator should be preferred over using column `get_val` method for
/// performance reasons.
fn serialize(column: &dyn Column<u64>, write: &mut impl Write) -> io::Result<()>;
fn serialize(column: &dyn Column, write: &mut impl Write) -> io::Result<()>;
/// Returns an estimate of the compression ratio.
/// If the codec is not applicable, returns `None`.
@@ -132,9 +208,10 @@ trait FastFieldCodec: 'static {
///
/// It could make sense to also return a value representing
/// computational complexity.
fn estimate(column: &impl Column) -> Option<f32>;
fn estimate(column: &dyn Column) -> Option<f32>;
}
/// The list of all available codecs for u64 convertible data.
pub const ALL_CODEC_TYPES: [FastFieldCodecType; 3] = [
FastFieldCodecType::Bitpacked,
FastFieldCodecType::BlockwiseLinear,
@@ -143,6 +220,7 @@ pub const ALL_CODEC_TYPES: [FastFieldCodecType; 3] = [
#[cfg(test)]
mod tests {
use proptest::prelude::*;
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest};
@@ -168,15 +246,32 @@ mod tests {
let actual_compression = out.len() as f32 / (data.len() as f32 * 8.0);
let reader = crate::open::<u64>(OwnedBytes::new(out)).unwrap();
assert_eq!(reader.num_vals(), data.len() as u64);
assert_eq!(reader.num_vals(), data.len() as u32);
for (doc, orig_val) in data.iter().copied().enumerate() {
let val = reader.get_val(doc as u64);
let val = reader.get_val(doc as u32);
assert_eq!(
val, orig_val,
"val `{val}` does not match orig_val {orig_val:?}, in data set {name}, data \
`{data:?}`",
);
}
if !data.is_empty() {
let test_rand_idx = rand::thread_rng().gen_range(0..=data.len() - 1);
let expected_positions: Vec<u32> = data
.iter()
.enumerate()
.filter(|(_, el)| **el == data[test_rand_idx])
.map(|(pos, _)| pos as u32)
.collect();
let mut positions = Vec::new();
reader.get_docids_for_value_range(
data[test_rand_idx]..=data[test_rand_idx],
0..data.len() as u32,
&mut positions,
);
assert_eq!(expected_positions, positions);
}
Some((estimation, actual_compression))
}
@@ -312,7 +407,7 @@ mod tests {
#[test]
fn estimation_test_bad_interpolation_case_monotonically_increasing() {
let mut data: Vec<u64> = (200..=20000_u64).collect();
let mut data: Vec<u64> = (201..=20000_u64).collect();
data.push(1_000_000);
let data: VecColumn = data.as_slice().into();
@@ -386,7 +481,7 @@ mod bench {
b.iter(|| {
let mut sum = 0u64;
for pos in value_iter() {
let val = col.get_val(pos as u64);
let val = col.get_val(pos as u32);
sum = sum.wrapping_add(val);
}
sum
@@ -398,7 +493,7 @@ mod bench {
b.iter(|| {
let mut sum = 0u64;
for pos in value_iter() {
let val = col.get_val(pos as u64);
let val = col.get_val(pos as u32);
sum = sum.wrapping_add(val);
}
sum

View File

@@ -1,5 +1,5 @@
use std::io;
use std::num::NonZeroU64;
use std::num::NonZeroU32;
use common::{BinarySerializable, VInt};
@@ -29,7 +29,7 @@ pub struct Line {
/// compute_slope(y0, y1)
/// = compute_slope(y0 + X % 2^64, y1 + X % 2^64)
/// `
fn compute_slope(y0: u64, y1: u64, num_vals: NonZeroU64) -> u64 {
fn compute_slope(y0: u64, y1: u64, num_vals: NonZeroU32) -> u64 {
let dy = y1.wrapping_sub(y0);
let sign = dy <= (1 << 63);
let abs_dy = if sign {
@@ -43,7 +43,7 @@ fn compute_slope(y0: u64, y1: u64, num_vals: NonZeroU64) -> u64 {
return 0u64;
}
let abs_slope = (abs_dy << 32) / num_vals.get();
let abs_slope = (abs_dy << 32) / num_vals.get() as u64;
if sign {
abs_slope
} else {
@@ -62,29 +62,43 @@ fn compute_slope(y0: u64, y1: u64, num_vals: NonZeroU64) -> u64 {
impl Line {
#[inline(always)]
pub fn eval(&self, x: u64) -> u64 {
let linear_part = (x.wrapping_mul(self.slope) >> 32) as i32 as u64;
pub fn eval(&self, x: u32) -> u64 {
let linear_part = ((x as u64).wrapping_mul(self.slope) >> 32) as i32 as u64;
self.intercept.wrapping_add(linear_part)
}
// Same as train, but the intercept is only estimated from provided sample positions
pub fn estimate(ys: &dyn Column, sample_positions: &[u64]) -> Self {
Self::train_from(ys, sample_positions.iter().cloned())
pub fn estimate(sample_positions_and_values: &[(u64, u64)]) -> Self {
let first_val = sample_positions_and_values[0].1;
let last_val = sample_positions_and_values[sample_positions_and_values.len() - 1].1;
let num_vals = sample_positions_and_values[sample_positions_and_values.len() - 1].0 + 1;
Self::train_from(
first_val,
last_val,
num_vals as u32,
sample_positions_and_values.iter().cloned(),
)
}
// Intercept is only computed from provided positions
fn train_from(ys: &dyn Column, positions: impl Iterator<Item = u64>) -> Self {
let num_vals = if let Some(num_vals) = NonZeroU64::new(ys.num_vals() - 1) {
num_vals
fn train_from(
first_val: u64,
last_val: u64,
num_vals: u32,
positions_and_values: impl Iterator<Item = (u64, u64)>,
) -> Self {
// TODO replace with let else
let idx_last_val = if let Some(idx_last_val) = NonZeroU32::new(num_vals - 1) {
idx_last_val
} else {
return Line::default();
};
let y0 = ys.get_val(0);
let y1 = ys.get_val(num_vals.get());
let y0 = first_val;
let y1 = last_val;
// We first independently pick our slope.
let slope = compute_slope(y0, y1, num_vals);
let slope = compute_slope(y0, y1, idx_last_val);
// We picked our slope. Note that it does not have to be perfect.
// Now we need to compute the best intercept.
@@ -114,11 +128,8 @@ impl Line {
intercept: 0,
};
let heuristic_shift = y0.wrapping_sub(MID_POINT);
line.intercept = positions
.map(|pos| {
let y = ys.get_val(pos);
y.wrapping_sub(line.eval(pos))
})
line.intercept = positions_and_values
.map(|(pos, y)| y.wrapping_sub(line.eval(pos as u32)))
.min_by_key(|&val| val.wrapping_sub(heuristic_shift))
.unwrap_or(0u64); //< Never happens.
line
@@ -135,7 +146,14 @@ impl Line {
/// This function is only invariable by translation if all of the
/// `ys` are packaged into half of the space. (See heuristic below)
pub fn train(ys: &dyn Column) -> Self {
Self::train_from(ys, 0..ys.num_vals())
let first_val = ys.iter().next().unwrap();
let last_val = ys.iter().nth(ys.num_vals() as usize - 1).unwrap();
Self::train_from(
first_val,
last_val,
ys.num_vals(),
ys.iter().enumerate().map(|(pos, val)| (pos as u64, val)),
)
}
}
@@ -181,7 +199,7 @@ mod tests {
let line = Line::train(&VecColumn::from(&ys));
ys.iter()
.enumerate()
.map(|(x, y)| y.wrapping_sub(line.eval(x as u64)))
.map(|(x, y)| y.wrapping_sub(line.eval(x as u32)))
.max()
}

View File

@@ -19,7 +19,7 @@ pub struct LinearReader {
impl Column for LinearReader {
#[inline]
fn get_val(&self, doc: u64) -> u64 {
fn get_val(&self, doc: u32) -> u64 {
let interpoled_val: u64 = self.linear_params.line.eval(doc);
let bitpacked_diff = self.linear_params.bit_unpacker.get(doc, &self.data);
interpoled_val.wrapping_add(bitpacked_diff)
@@ -37,7 +37,7 @@ impl Column for LinearReader {
}
#[inline]
fn num_vals(&self) -> u64 {
fn num_vals(&self) -> u32 {
self.header.num_vals
}
}
@@ -93,7 +93,7 @@ impl FastFieldCodec for LinearCodec {
.iter()
.enumerate()
.map(|(pos, actual_value)| {
let calculated_value = line.eval(pos as u64);
let calculated_value = line.eval(pos as u32);
actual_value.wrapping_sub(calculated_value)
})
.max()
@@ -108,7 +108,7 @@ impl FastFieldCodec for LinearCodec {
let mut bit_packer = BitPacker::new();
for (pos, actual_value) in column.iter().enumerate() {
let calculated_value = line.eval(pos as u64);
let calculated_value = line.eval(pos as u32);
let offset = actual_value.wrapping_sub(calculated_value);
bit_packer.write(offset, num_bits, write)?;
}
@@ -121,24 +121,26 @@ impl FastFieldCodec for LinearCodec {
/// where the local maxima for the deviation of the calculated value are and
/// the offset to shift all values to >=0 is also unknown.
#[allow(clippy::question_mark)]
fn estimate(column: &impl Column) -> Option<f32> {
fn estimate(column: &dyn Column) -> Option<f32> {
if column.num_vals() < 3 {
return None; // disable compressor for this case
}
// let's sample at 0%, 5%, 10% .. 95%, 100%
let num_vals = column.num_vals() as f32 / 100.0;
let sample_positions = (0..20)
.map(|pos| (num_vals * pos as f32 * 5.0) as u64)
.collect::<Vec<_>>();
let limit_num_vals = column.num_vals().min(100_000);
let line = Line::estimate(column, &sample_positions);
let num_samples = 100;
let step_size = (limit_num_vals / num_samples).max(1); // 20 samples
let mut sample_positions_and_values: Vec<_> = Vec::new();
for (pos, val) in column.iter().enumerate().step_by(step_size as usize) {
sample_positions_and_values.push((pos as u64, val));
}
let estimated_bit_width = sample_positions
let line = Line::estimate(&sample_positions_and_values);
let estimated_bit_width = sample_positions_and_values
.into_iter()
.map(|pos| {
let actual_value = column.get_val(pos);
let interpolated_val = line.eval(pos as u64);
.map(|(pos, actual_value)| {
let interpolated_val = line.eval(pos as u32);
actual_value.wrapping_sub(interpolated_val)
})
.map(|diff| ((diff as f32 * 1.5) * 2.0) as u64)
@@ -146,6 +148,7 @@ impl FastFieldCodec for LinearCodec {
.max()
.unwrap_or(0);
// Extrapolate to whole column
let num_bits = (estimated_bit_width as u64 * column.num_vals() as u64) + 64;
let num_bits_uncompressed = 64 * column.num_vals();
Some(num_bits as f32 / num_bits_uncompressed as f32)

View File

@@ -90,7 +90,7 @@ fn bench_ip() {
{
let mut data = vec![];
for dataset in dataset.chunks(500_000) {
serialize_u128(VecColumn::from(dataset), &mut data).unwrap();
serialize_u128(|| dataset.iter().cloned(), dataset.len() as u32, &mut data).unwrap();
}
let compression = data.len() as f64 / (dataset.len() * 16) as f64;
println!("Compression 50_000 chunks {:.4}", compression);
@@ -101,7 +101,10 @@ fn bench_ip() {
}
let mut data = vec![];
serialize_u128(VecColumn::from(&dataset), &mut data).unwrap();
{
print_time!("creation");
serialize_u128(|| dataset.iter().cloned(), dataset.len() as u32, &mut data).unwrap();
}
let compression = data.len() as f64 / (dataset.len() * 16) as f64;
println!("Compression {:.2}", compression);
@@ -110,11 +113,17 @@ fn bench_ip() {
(data.len() * 8) as f32 / dataset.len() as f32
);
let decompressor = open_u128(OwnedBytes::new(data)).unwrap();
let decompressor = open_u128::<u128>(OwnedBytes::new(data)).unwrap();
// Sample some ranges
let mut doc_values = Vec::new();
for value in dataset.iter().take(1110).skip(1100).cloned() {
doc_values.clear();
print_time!("get range");
let doc_values = decompressor.get_between_vals(value..=value);
decompressor.get_docids_for_value_range(
value..=value,
0..decompressor.num_vals(),
&mut doc_values,
);
println!("{:?}", doc_values.len());
}
}

View File

@@ -1,3 +1,12 @@
use std::marker::PhantomData;
use fastdivide::DividerU64;
use ordered_float::NotNan;
use crate::MonotonicallyMappableToU128;
/// Monotonic maps a value to u64 value space.
/// Monotonic mapping enables `PartialOrd` on u64 space without conversion to original space.
pub trait MonotonicallyMappableToU64: 'static + PartialOrd + Copy + Send + Sync {
/// Converts a value to u64.
///
@@ -11,6 +20,145 @@ pub trait MonotonicallyMappableToU64: 'static + PartialOrd + Copy + Send + Sync
fn from_u64(val: u64) -> Self;
}
/// Values need to be strictly monotonic mapped to a `Internal` value (u64 or u128) that can be
/// used in fast field codecs.
///
/// The monotonic mapping is required so that `PartialOrd` can be used on `Internal` without
/// converting to `External`.
///
/// All strictly monotonic functions are invertible because they are guaranteed to have a one-to-one
/// mapping from their range to their domain. The `inverse` method is required when opening a codec,
/// so a value can be converted back to its original domain (e.g. ip address or f64) from its
/// internal representation.
pub trait StrictlyMonotonicFn<External, Internal> {
/// Strictly monotonically maps the value from External to Internal.
fn mapping(&self, inp: External) -> Internal;
/// Inverse of `mapping`. Maps the value from Internal to External.
fn inverse(&self, out: Internal) -> External;
}
/// Inverts a strictly monotonic mapping from `StrictlyMonotonicFn<A, B>` to
/// `StrictlyMonotonicFn<B, A>`.
///
/// # Warning
///
/// This type comes with a footgun. A type being strictly monotonic does not impose that the inverse
/// mapping is strictly monotonic over the entire space External. e.g. a -> a * 2. Use at your own
/// risks.
pub(crate) struct StrictlyMonotonicMappingInverter<T> {
orig_mapping: T,
}
impl<T> From<T> for StrictlyMonotonicMappingInverter<T> {
fn from(orig_mapping: T) -> Self {
Self { orig_mapping }
}
}
impl<From, To, T> StrictlyMonotonicFn<To, From> for StrictlyMonotonicMappingInverter<T>
where T: StrictlyMonotonicFn<From, To>
{
fn mapping(&self, val: To) -> From {
self.orig_mapping.inverse(val)
}
fn inverse(&self, val: From) -> To {
self.orig_mapping.mapping(val)
}
}
/// Applies the strictly monotonic mapping from `T` without any additional changes.
pub(crate) struct StrictlyMonotonicMappingToInternal<T> {
_phantom: PhantomData<T>,
}
impl<T> StrictlyMonotonicMappingToInternal<T> {
pub(crate) fn new() -> StrictlyMonotonicMappingToInternal<T> {
Self {
_phantom: PhantomData,
}
}
}
impl<External: MonotonicallyMappableToU128, T: MonotonicallyMappableToU128>
StrictlyMonotonicFn<External, u128> for StrictlyMonotonicMappingToInternal<T>
where T: MonotonicallyMappableToU128
{
fn mapping(&self, inp: External) -> u128 {
External::to_u128(inp)
}
fn inverse(&self, out: u128) -> External {
External::from_u128(out)
}
}
impl<External: MonotonicallyMappableToU64, T: MonotonicallyMappableToU64>
StrictlyMonotonicFn<External, u64> for StrictlyMonotonicMappingToInternal<T>
where T: MonotonicallyMappableToU64
{
fn mapping(&self, inp: External) -> u64 {
External::to_u64(inp)
}
fn inverse(&self, out: u64) -> External {
External::from_u64(out)
}
}
/// Mapping dividing by gcd and a base value.
///
/// The function is assumed to be only called on values divided by passed
/// gcd value. (It is necessary for the function to be monotonic.)
pub(crate) struct StrictlyMonotonicMappingToInternalGCDBaseval {
gcd_divider: DividerU64,
gcd: u64,
min_value: u64,
}
impl StrictlyMonotonicMappingToInternalGCDBaseval {
pub(crate) fn new(gcd: u64, min_value: u64) -> Self {
let gcd_divider = DividerU64::divide_by(gcd);
Self {
gcd_divider,
gcd,
min_value,
}
}
}
impl<External: MonotonicallyMappableToU64> StrictlyMonotonicFn<External, u64>
for StrictlyMonotonicMappingToInternalGCDBaseval
{
fn mapping(&self, inp: External) -> u64 {
self.gcd_divider
.divide(External::to_u64(inp) - self.min_value)
}
fn inverse(&self, out: u64) -> External {
External::from_u64(self.min_value + out * self.gcd)
}
}
/// Strictly monotonic mapping with a base value.
pub(crate) struct StrictlyMonotonicMappingToInternalBaseval {
min_value: u64,
}
impl StrictlyMonotonicMappingToInternalBaseval {
pub(crate) fn new(min_value: u64) -> Self {
Self { min_value }
}
}
impl<External: MonotonicallyMappableToU64> StrictlyMonotonicFn<External, u64>
for StrictlyMonotonicMappingToInternalBaseval
{
fn mapping(&self, val: External) -> u64 {
External::to_u64(val) - self.min_value
}
fn inverse(&self, val: u64) -> External {
External::from_u64(self.min_value + val)
}
}
impl MonotonicallyMappableToU64 for u64 {
fn to_u64(self) -> u64 {
self
@@ -36,11 +184,7 @@ impl MonotonicallyMappableToU64 for i64 {
impl MonotonicallyMappableToU64 for bool {
#[inline(always)]
fn to_u64(self) -> u64 {
if self {
1
} else {
0
}
u64::from(self)
}
#[inline(always)]
@@ -49,6 +193,8 @@ impl MonotonicallyMappableToU64 for bool {
}
}
// TODO remove me.
// Tantivy should refuse NaN values and work with NotNaN internally.
impl MonotonicallyMappableToU64 for f64 {
fn to_u64(self) -> u64 {
common::f64_to_u64(self)
@@ -58,3 +204,64 @@ impl MonotonicallyMappableToU64 for f64 {
common::u64_to_f64(val)
}
}
impl MonotonicallyMappableToU64 for ordered_float::NotNan<f64> {
fn to_u64(self) -> u64 {
common::f64_to_u64(self.into_inner())
}
fn from_u64(val: u64) -> Self {
NotNan::new(common::u64_to_f64(val)).expect("Invalid NotNaN f64 value.")
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_from_u64_pos_inf() {
let inf_as_u64 = common::f64_to_u64(f64::INFINITY);
let inf_back_to_f64 = NotNan::from_u64(inf_as_u64);
assert_eq!(inf_back_to_f64, NotNan::new(f64::INFINITY).unwrap());
}
#[test]
fn test_from_u64_neg_inf() {
let inf_as_u64 = common::f64_to_u64(-f64::INFINITY);
let inf_back_to_f64 = NotNan::from_u64(inf_as_u64);
assert_eq!(inf_back_to_f64, NotNan::new(-f64::INFINITY).unwrap());
}
#[test]
#[should_panic(expected = "Invalid NotNaN")]
fn test_from_u64_nan_panics() {
let nan_as_u64 = common::f64_to_u64(f64::NAN);
NotNan::from_u64(nan_as_u64);
}
#[test]
fn strictly_monotonic_test() {
// identity mapping
test_round_trip(&StrictlyMonotonicMappingToInternal::<u64>::new(), 100u64);
// round trip to i64
test_round_trip(&StrictlyMonotonicMappingToInternal::<i64>::new(), 100u64);
// identity mapping
test_round_trip(&StrictlyMonotonicMappingToInternal::<u128>::new(), 100u128);
// base value to i64 round trip
let mapping = StrictlyMonotonicMappingToInternalBaseval::new(100);
test_round_trip::<_, _, u64>(&mapping, 100i64);
// base value and gcd to u64 round trip
let mapping = StrictlyMonotonicMappingToInternalGCDBaseval::new(10, 100);
test_round_trip::<_, _, u64>(&mapping, 100u64);
}
fn test_round_trip<T: StrictlyMonotonicFn<K, L>, K: std::fmt::Debug + Eq + Copy, L>(
mapping: &T,
test_val: K,
) {
assert_eq!(mapping.inverse(mapping.mapping(test_val)), test_val);
}
}

View File

@@ -0,0 +1,40 @@
use std::net::Ipv6Addr;
/// Montonic maps a value to u128 value space
/// Monotonic mapping enables `PartialOrd` on u128 space without conversion to original space.
pub trait MonotonicallyMappableToU128: 'static + PartialOrd + Copy + Send + Sync {
/// Converts a value to u128.
///
/// Internally all fast field values are encoded as u64.
fn to_u128(self) -> u128;
/// Converts a value from u128
///
/// Internally all fast field values are encoded as u64.
/// **Note: To be used for converting encoded Term, Posting values.**
fn from_u128(val: u128) -> Self;
}
impl MonotonicallyMappableToU128 for u128 {
fn to_u128(self) -> u128 {
self
}
fn from_u128(val: u128) -> Self {
val
}
}
impl MonotonicallyMappableToU128 for Ipv6Addr {
fn to_u128(self) -> u128 {
ip_to_u128(self)
}
fn from_u128(val: u128) -> Self {
Ipv6Addr::from(val.to_be_bytes())
}
}
fn ip_to_u128(ip_addr: Ipv6Addr) -> u128 {
u128::from_be_bytes(ip_addr.octets())
}

View File

@@ -0,0 +1,454 @@
use std::convert::TryInto;
use std::io::{self, Write};
use common::BinarySerializable;
use itertools::Itertools;
use ownedbytes::OwnedBytes;
use super::{get_bit_at, set_bit_at};
/// For the `DenseCodec`, `data` which contains the encoded blocks.
/// Each block consists of [u8; 12]. The first 8 bytes is a bitvec for 64 elements.
/// The last 4 bytes are the offset, the number of set bits so far.
///
/// When translating the original index to a dense index, the correct block can be computed
/// directly `orig_idx/64`. Inside the block the position is `orig_idx%64`.
///
/// When translating a dense index to the original index, we can use the offset to find the correct
/// block. Direct computation is not possible, but we can employ a linear or binary search.
#[derive(Clone)]
pub struct DenseCodec {
// data consists of blocks of 64 bits.
//
// The format is &[(u64, u32)]
// u64 is the bitvec
// u32 is the offset of the block, the number of set bits so far.
//
// At the end one block is appended, to store the number of values in the index in offset.
data: OwnedBytes,
}
const ELEMENTS_PER_BLOCK: u32 = 64;
const BLOCK_BITVEC_SIZE: usize = 8;
const BLOCK_OFFSET_SIZE: usize = 4;
const SERIALIZED_BLOCK_SIZE: usize = BLOCK_BITVEC_SIZE + BLOCK_OFFSET_SIZE;
#[inline]
fn count_ones(bitvec: u64, pos_in_bitvec: u32) -> u32 {
if pos_in_bitvec == 63 {
bitvec.count_ones()
} else {
let mask = (1u64 << (pos_in_bitvec + 1)) - 1;
let masked_bitvec = bitvec & mask;
masked_bitvec.count_ones()
}
}
#[derive(Clone, Copy)]
struct DenseIndexBlock {
bitvec: u64,
offset: u32,
}
impl From<[u8; SERIALIZED_BLOCK_SIZE]> for DenseIndexBlock {
fn from(data: [u8; SERIALIZED_BLOCK_SIZE]) -> Self {
let bitvec = u64::from_le_bytes(data[..BLOCK_BITVEC_SIZE].try_into().unwrap());
let offset = u32::from_le_bytes(data[BLOCK_BITVEC_SIZE..].try_into().unwrap());
Self { bitvec, offset }
}
}
impl DenseCodec {
/// Open the DenseCodec from OwnedBytes
pub fn open(data: OwnedBytes) -> Self {
Self { data }
}
#[inline]
/// Check if value at position is not null.
pub fn exists(&self, idx: u32) -> bool {
let block_pos = idx / ELEMENTS_PER_BLOCK;
let bitvec = self.dense_index_block(block_pos).bitvec;
let pos_in_bitvec = idx % ELEMENTS_PER_BLOCK;
get_bit_at(bitvec, pos_in_bitvec)
}
#[inline]
fn dense_index_block(&self, block_pos: u32) -> DenseIndexBlock {
dense_index_block(&self.data, block_pos)
}
/// Return the number of non-null values in an index
pub fn num_non_nulls(&self) -> u32 {
let last_block = (self.data.len() / SERIALIZED_BLOCK_SIZE) - 1;
self.dense_index_block(last_block as u32).offset
}
#[inline]
/// Translate from the original index to the codec index.
pub fn translate_to_codec_idx(&self, idx: u32) -> Option<u32> {
let block_pos = idx / ELEMENTS_PER_BLOCK;
let index_block = self.dense_index_block(block_pos);
let pos_in_block_bit_vec = idx % ELEMENTS_PER_BLOCK;
let ones_in_block = count_ones(index_block.bitvec, pos_in_block_bit_vec);
if get_bit_at(index_block.bitvec, pos_in_block_bit_vec) {
// -1 is ok, since idx does exist, so there's at least one
Some(index_block.offset + ones_in_block - 1)
} else {
None
}
}
/// Translate positions from the codec index to the original index.
///
/// # Panics
///
/// May panic if any `idx` is greater than the max codec index.
pub fn translate_codec_idx_to_original_idx<'a>(
&'a self,
iter: impl Iterator<Item = u32> + 'a,
) -> impl Iterator<Item = u32> + 'a {
let mut block_pos = 0u32;
iter.map(move |dense_idx| {
// update block_pos to limit search scope
block_pos = find_block(dense_idx, block_pos, &self.data);
let index_block = self.dense_index_block(block_pos);
// The next offset is higher than dense_idx and therefore:
// dense_idx <= offset + num_set_bits in block
let mut num_set_bits = 0;
for idx_in_bitvec in 0..ELEMENTS_PER_BLOCK {
if get_bit_at(index_block.bitvec, idx_in_bitvec) {
num_set_bits += 1;
}
if num_set_bits == (dense_idx - index_block.offset + 1) {
let orig_idx = block_pos * ELEMENTS_PER_BLOCK + idx_in_bitvec;
return orig_idx;
}
}
panic!("Internal Error: Offset calculation in dense idx seems to be wrong.");
})
}
}
#[inline]
fn dense_index_block(data: &[u8], block_pos: u32) -> DenseIndexBlock {
let data_start_pos = block_pos as usize * SERIALIZED_BLOCK_SIZE;
let block_data: [u8; SERIALIZED_BLOCK_SIZE] = data[data_start_pos..][..SERIALIZED_BLOCK_SIZE]
.try_into()
.unwrap();
block_data.into()
}
#[inline]
/// Finds the block position containing the dense_idx.
///
/// # Correctness
/// dense_idx needs to be smaller than the number of values in the index
///
/// The last offset number is equal to the number of values in the index.
fn find_block(dense_idx: u32, mut block_pos: u32, data: &[u8]) -> u32 {
loop {
let offset = dense_index_block(data, block_pos).offset;
if offset > dense_idx {
return block_pos - 1;
}
block_pos += 1;
}
}
/// Iterator over all values, true if set, otherwise false
pub fn serialize_dense_codec(
iter: impl Iterator<Item = bool>,
mut out: impl Write,
) -> io::Result<()> {
let mut offset: u32 = 0;
for chunk in &iter.chunks(ELEMENTS_PER_BLOCK as usize) {
let mut block: u64 = 0;
for (pos, is_bit_set) in chunk.enumerate() {
if is_bit_set {
set_bit_at(&mut block, pos as u64);
}
}
block.serialize(&mut out)?;
offset.serialize(&mut out)?;
offset += block.count_ones();
}
// Add sentinal block for the offset
let block: u64 = 0;
block.serialize(&mut out)?;
offset.serialize(&mut out)?;
Ok(())
}
#[cfg(test)]
mod tests {
use proptest::prelude::{any, prop, *};
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest};
use super::*;
fn random_bitvec() -> BoxedStrategy<Vec<bool>> {
prop_oneof![
1 => prop::collection::vec(proptest::bool::weighted(1.0), 0..100),
1 => prop::collection::vec(proptest::bool::weighted(1.0), 0..64),
1 => prop::collection::vec(proptest::bool::weighted(0.0), 0..100),
1 => prop::collection::vec(proptest::bool::weighted(0.0), 0..64),
8 => vec![any::<bool>()],
2 => prop::collection::vec(any::<bool>(), 0..50),
]
.boxed()
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(500))]
#[test]
fn test_with_random_bitvecs(bitvec1 in random_bitvec(), bitvec2 in random_bitvec(), bitvec3 in random_bitvec()) {
let mut bitvec = Vec::new();
bitvec.extend_from_slice(&bitvec1);
bitvec.extend_from_slice(&bitvec2);
bitvec.extend_from_slice(&bitvec3);
test_null_index(bitvec);
}
}
#[test]
fn dense_codec_test_one_block_false() {
let mut iter = vec![false; 64];
iter.push(true);
test_null_index(iter);
}
fn test_null_index(data: Vec<bool>) {
let mut out = vec![];
serialize_dense_codec(data.iter().cloned(), &mut out).unwrap();
let null_index = DenseCodec::open(OwnedBytes::new(out));
let orig_idx_with_value: Vec<u32> = data
.iter()
.enumerate()
.filter(|(_pos, val)| **val)
.map(|(pos, _val)| pos as u32)
.collect();
assert_eq!(
null_index
.translate_codec_idx_to_original_idx(0..orig_idx_with_value.len() as u32)
.collect_vec(),
orig_idx_with_value
);
for (dense_idx, orig_idx) in orig_idx_with_value.iter().enumerate() {
assert_eq!(
null_index.translate_to_codec_idx(*orig_idx),
Some(dense_idx as u32)
);
}
for (pos, value) in data.iter().enumerate() {
assert_eq!(null_index.exists(pos as u32), *value);
}
}
#[test]
fn dense_codec_test_translation() {
let mut out = vec![];
let iter = ([true, false, true, false]).iter().cloned();
serialize_dense_codec(iter, &mut out).unwrap();
let null_index = DenseCodec::open(OwnedBytes::new(out));
assert_eq!(
null_index
.translate_codec_idx_to_original_idx(0..2)
.collect_vec(),
vec![0, 2]
);
}
#[test]
fn dense_codec_translate() {
let mut out = vec![];
let iter = ([true, false, true, false]).iter().cloned();
serialize_dense_codec(iter, &mut out).unwrap();
let null_index = DenseCodec::open(OwnedBytes::new(out));
assert_eq!(null_index.translate_to_codec_idx(0), Some(0));
assert_eq!(null_index.translate_to_codec_idx(2), Some(1));
}
#[test]
fn dense_codec_test_small() {
let mut out = vec![];
let iter = ([true, false, true, false]).iter().cloned();
serialize_dense_codec(iter, &mut out).unwrap();
let null_index = DenseCodec::open(OwnedBytes::new(out));
assert!(null_index.exists(0));
assert!(!null_index.exists(1));
assert!(null_index.exists(2));
assert!(!null_index.exists(3));
}
#[test]
fn dense_codec_test_large() {
let mut docs = vec![];
docs.extend((0..1000).map(|_idx| false));
docs.extend((0..=1000).map(|_idx| true));
let iter = docs.iter().cloned();
let mut out = vec![];
serialize_dense_codec(iter, &mut out).unwrap();
let null_index = DenseCodec::open(OwnedBytes::new(out));
assert!(!null_index.exists(0));
assert!(!null_index.exists(100));
assert!(!null_index.exists(999));
assert!(null_index.exists(1000));
assert!(null_index.exists(1999));
assert!(null_index.exists(2000));
assert!(!null_index.exists(2001));
}
#[test]
fn test_count_ones() {
let mut block = 0;
set_bit_at(&mut block, 0);
set_bit_at(&mut block, 2);
assert_eq!(count_ones(block, 0), 1);
assert_eq!(count_ones(block, 1), 1);
assert_eq!(count_ones(block, 2), 2);
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use test::Bencher;
use super::*;
const TOTAL_NUM_VALUES: u32 = 1_000_000;
fn gen_bools(fill_ratio: f64) -> DenseCodec {
let mut out = Vec::new();
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let bools: Vec<_> = (0..TOTAL_NUM_VALUES)
.map(|_| rng.gen_bool(fill_ratio))
.collect();
serialize_dense_codec(bools.into_iter(), &mut out).unwrap();
let codec = DenseCodec::open(OwnedBytes::new(out));
codec
}
fn random_range_iterator(start: u32, end: u32, step_size: u32) -> impl Iterator<Item = u32> {
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let mut current = start;
std::iter::from_fn(move || {
current += rng.gen_range(1..step_size + 1);
if current >= end {
None
} else {
Some(current)
}
})
}
fn walk_over_data(codec: &DenseCodec, max_step_size: u32) -> Option<u32> {
walk_over_data_from_positions(
codec,
random_range_iterator(0, TOTAL_NUM_VALUES, max_step_size),
)
}
fn walk_over_data_from_positions(
codec: &DenseCodec,
positions: impl Iterator<Item = u32>,
) -> Option<u32> {
let mut dense_idx: Option<u32> = None;
for idx in positions {
dense_idx = dense_idx.or(codec.translate_to_codec_idx(idx));
}
dense_idx
}
#[bench]
fn bench_dense_codec_translate_orig_to_codec_90percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_dense_codec_translate_orig_to_codec_50percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.5f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_dense_codec_translate_orig_to_codec_full_scan_10percent(bench: &mut Bencher) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_dense_codec_translate_orig_to_codec_full_scan_90percent(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_dense_codec_translate_orig_to_codec_10percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_dense_codec_translate_codec_to_orig_90percent_filled_random_stride_big_step(
bench: &mut Bencher,
) {
let codec = gen_bools(0.9f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(random_range_iterator(0, num_vals, 50_000))
.last()
});
}
#[bench]
fn bench_dense_codec_translate_codec_to_orig_90percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.9f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(random_range_iterator(0, num_vals, 100))
.last()
});
}
#[bench]
fn bench_dense_codec_translate_codec_to_orig_90percent_filled_full_scan(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(0..num_vals)
.last()
});
}
}

View File

@@ -0,0 +1,14 @@
pub use dense::{serialize_dense_codec, DenseCodec};
mod dense;
mod sparse;
#[inline]
fn get_bit_at(input: u64, n: u32) -> bool {
input & (1 << n) != 0
}
#[inline]
fn set_bit_at(input: &mut u64, n: u64) {
*input |= 1 << n;
}

View File

@@ -0,0 +1,752 @@
use std::io::{self, Write};
use common::BitSet;
use ownedbytes::OwnedBytes;
use super::{serialize_dense_codec, DenseCodec};
/// `SparseCodec` is the codec for data, when only few documents have values.
/// In contrast to `DenseCodec` opening a `SparseCodec` causes runtime data to be produced, for
/// faster access.
///
/// The lower 16 bits of doc ids are stored as u16 while the upper 16 bits are given by the block
/// id. Each block contains 1<<16 docids.
///
/// # Serialized Data Layout
/// The data starts with the block data. Each block is either dense or sparse encoded, depending on
/// the number of values in the block. A block is sparse when it contains less than
/// DENSE_BLOCK_THRESHOLD (6144) values.
/// [Sparse data block | dense data block, .. #repeat*; Desc: Either a sparse or dense encoded
/// block]
/// ### Sparse block data
/// [u16 LE, .. #repeat*; Desc: Positions with values in a block]
/// ### Dense block data
/// [Dense codec for the whole block; Desc: Similar to a bitvec(0..ELEMENTS_PER_BLOCK) + Metadata
/// for faster lookups. See dense.rs]
///
/// The data is followed by block metadata, to know which area of the raw block data belongs to
/// which block. Only metadata for blocks with elements is recorded to
/// keep the overhead low for scenarios with many very sparse columns. The block metadata consists
/// of the block index and the number of values in the block. Since we don't store empty blocks
/// num_vals is incremented by 1, e.g. 0 means 1 value.
///
/// The last u16 is storing the number of metadata blocks.
/// [u16 LE, .. #repeat*; Desc: Positions with values in a block][(u16 LE, u16 LE), .. #repeat*;
/// Desc: (Block Id u16, Num Elements u16)][u16 LE; Desc: num blocks with values u16]
///
/// # Opening
/// When opening the data layout, the data is expanded to `Vec<SparseCodecBlockVariant>`, where the
/// index is the block index. For each block `byte_start` and `offset` is computed.
pub struct SparseCodec {
data: OwnedBytes,
blocks: Vec<SparseCodecBlockVariant>,
}
/// The threshold for for number of elements after which we switch to dense block encoding
const DENSE_BLOCK_THRESHOLD: u32 = 6144;
const ELEMENTS_PER_BLOCK: u32 = u16::MAX as u32 + 1;
/// 1.5 bit per Element + 12 bytes for the sentinal block
const NUM_BYTES_DENSE_BLOCK: u32 = (ELEMENTS_PER_BLOCK + ELEMENTS_PER_BLOCK / 2 + 64 + 32) / 8;
#[derive(Clone)]
enum SparseCodecBlockVariant {
Empty { offset: u32 },
Dense(DenseBlock),
Sparse(SparseBlock),
}
impl SparseCodecBlockVariant {
/// The number of non-null values that preceeded that block.
fn offset(&self) -> u32 {
match self {
SparseCodecBlockVariant::Empty { offset } => *offset,
SparseCodecBlockVariant::Dense(dense) => dense.offset,
SparseCodecBlockVariant::Sparse(sparse) => sparse.offset,
}
}
}
/// A block consists of max u16 values
#[derive(Clone)]
struct DenseBlock {
/// The number of values set before the block
offset: u32,
/// The data for the dense encoding
codec: DenseCodec,
}
impl DenseBlock {
pub fn exists(&self, idx: u32) -> bool {
self.codec.exists(idx)
}
pub fn translate_to_codec_idx(&self, idx: u32) -> Option<u32> {
self.codec.translate_to_codec_idx(idx)
}
pub fn translate_codec_idx_to_original_idx(&self, idx: u32) -> u32 {
self.codec
.translate_codec_idx_to_original_idx(idx..=idx)
.next()
.unwrap()
}
}
/// A block consists of max u16 values
#[derive(Debug, Copy, Clone)]
struct SparseBlock {
/// The number of values in the block
num_vals: u32,
/// The number of values set before the block
offset: u32,
/// The start position of the data for the block
byte_start: u32,
}
impl SparseBlock {
fn empty_block(offset: u32) -> Self {
Self {
num_vals: 0,
byte_start: 0,
offset,
}
}
#[inline]
fn value_at_idx(&self, data: &[u8], idx: u16) -> u16 {
let start_offset: usize = self.byte_start as usize + (idx as u32 as usize * 2);
get_u16(data, start_offset)
}
#[inline]
#[allow(clippy::comparison_chain)]
// Looks for the element in the block. Returns the positions if found.
fn binary_search(&self, data: &[u8], target: u16) -> Option<u16> {
let mut size = self.num_vals as u16;
let mut left = 0;
let mut right = size;
// TODO try different implem.
// e.g. exponential search into binary search
while left < right {
let mid = left + size / 2;
// TODO do boundary check only once, and then use an
// unsafe `value_at_idx`
let mid_val = self.value_at_idx(data, mid);
if target > mid_val {
left = mid + 1;
} else if target < mid_val {
right = mid;
} else {
return Some(mid);
}
size = right - left;
}
None
}
}
#[inline]
fn get_u16(data: &[u8], byte_position: usize) -> u16 {
let bytes: [u8; 2] = data[byte_position..byte_position + 2].try_into().unwrap();
u16::from_le_bytes(bytes)
}
const SERIALIZED_BLOCK_METADATA_SIZE: usize = 4;
fn deserialize_sparse_codec_block(data: &OwnedBytes) -> Vec<SparseCodecBlockVariant> {
// The number of vals so far
let mut offset = 0;
let mut sparse_codec_blocks = Vec::new();
let num_blocks = get_u16(data, data.len() - 2);
let block_data_index_start =
data.len() - 2 - num_blocks as usize * SERIALIZED_BLOCK_METADATA_SIZE;
let mut byte_start = 0;
for block_num in 0..num_blocks as usize {
let block_data_index = block_data_index_start + SERIALIZED_BLOCK_METADATA_SIZE * block_num;
let block_idx = get_u16(data, block_data_index);
let num_vals = get_u16(data, block_data_index + 2) as u32 + 1;
sparse_codec_blocks.resize(
block_idx as usize,
SparseCodecBlockVariant::Empty { offset },
);
if is_sparse(num_vals) {
let block = SparseBlock {
num_vals,
offset,
byte_start,
};
sparse_codec_blocks.push(SparseCodecBlockVariant::Sparse(block));
byte_start += 2 * num_vals;
} else {
let block = DenseBlock {
offset,
codec: DenseCodec::open(data.slice(byte_start as usize..data.len()).clone()),
};
sparse_codec_blocks.push(SparseCodecBlockVariant::Dense(block));
// Dense blocks have a fixed size spanning ELEMENTS_PER_BLOCK.
byte_start += NUM_BYTES_DENSE_BLOCK;
}
offset += num_vals;
}
sparse_codec_blocks.push(SparseCodecBlockVariant::Empty { offset });
sparse_codec_blocks
}
/// Splits a value address into lower and upper 16bits.
/// The lower 16 bits are the value in the block
/// The upper 16 bits are the block index
#[derive(Debug, Clone, Copy)]
struct ValueAddr {
block_idx: u16,
value_in_block: u16,
}
/// Splits a idx into block index and value in the block
fn value_addr(idx: u32) -> ValueAddr {
/// Static assert number elements per block this method expects
#[allow(clippy::assertions_on_constants)]
const _: () = assert!(ELEMENTS_PER_BLOCK == (1 << 16));
let value_in_block = idx as u16;
let block_idx = (idx >> 16) as u16;
ValueAddr {
block_idx,
value_in_block,
}
}
impl SparseCodec {
/// Open the SparseCodec from OwnedBytes
pub fn open(data: OwnedBytes) -> Self {
let blocks = deserialize_sparse_codec_block(&data);
Self { data, blocks }
}
#[inline]
/// Check if value at position is not null.
pub fn exists(&self, idx: u32) -> bool {
let value_addr = value_addr(idx);
// There may be trailing nulls without data, those are not stored as blocks. It would be
// possible to create empty blocks, but for that we would need to serialize the number of
// values or pass them when opening
if let Some(block) = self.blocks.get(value_addr.block_idx as usize) {
match block {
SparseCodecBlockVariant::Empty { offset: _ } => false,
SparseCodecBlockVariant::Dense(block) => {
block.exists(value_addr.value_in_block as u32)
}
SparseCodecBlockVariant::Sparse(block) => block
.binary_search(&self.data, value_addr.value_in_block)
.is_some(),
}
} else {
false
}
}
/// Return the number of non-null values in an index
pub fn num_non_nulls(&self) -> u32 {
self.blocks.last().map(|block| block.offset()).unwrap_or(0)
}
#[inline]
/// Translate from the original index to the codec index.
pub fn translate_to_codec_idx(&self, idx: u32) -> Option<u32> {
let value_addr = value_addr(idx);
let block = self.blocks.get(value_addr.block_idx as usize)?;
match block {
SparseCodecBlockVariant::Empty { offset: _ } => None,
SparseCodecBlockVariant::Dense(block) => block
.translate_to_codec_idx(value_addr.value_in_block as u32)
.map(|pos_in_block| pos_in_block + block.offset),
SparseCodecBlockVariant::Sparse(block) => {
let pos_in_block = block.binary_search(&self.data, value_addr.value_in_block);
pos_in_block.map(|pos_in_block: u16| block.offset + pos_in_block as u32)
}
}
}
fn find_block(&self, dense_idx: u32, mut block_pos: u32) -> u32 {
loop {
let offset = self.blocks[block_pos as usize].offset();
if offset > dense_idx {
return block_pos - 1;
}
block_pos += 1;
}
}
/// Translate positions from the codec index to the original index.
///
/// # Panics
///
/// May panic if any `idx` is greater than the max codec index.
pub fn translate_codec_idx_to_original_idx<'a>(
&'a self,
iter: impl Iterator<Item = u32> + 'a,
) -> impl Iterator<Item = u32> + 'a {
// TODO: There's a big potential performance gain, by using iterators per block instead of
// random access for each element in a block
// group_by itertools won't help though, since it requires a temporary local variable
let mut block_pos = 0u32;
iter.map(move |codec_idx| {
// update block_pos to limit search scope
block_pos = self.find_block(codec_idx, block_pos);
let block_doc_idx_start = block_pos * ELEMENTS_PER_BLOCK;
let block = &self.blocks[block_pos as usize];
let idx_in_block = codec_idx - block.offset();
match block {
SparseCodecBlockVariant::Empty { offset: _ } => {
panic!(
"invalid input, cannot translate to original index. associated empty \
block with dense idx. block_pos {}, idx_in_block {}",
block_pos, idx_in_block
)
}
SparseCodecBlockVariant::Dense(dense) => {
dense.translate_codec_idx_to_original_idx(idx_in_block) + block_doc_idx_start
}
SparseCodecBlockVariant::Sparse(block) => {
block.value_at_idx(&self.data, idx_in_block as u16) as u32 + block_doc_idx_start
}
}
})
}
}
fn is_sparse(num_elem_in_block: u32) -> bool {
num_elem_in_block < DENSE_BLOCK_THRESHOLD
}
#[derive(Default)]
struct BlockDataSerialized {
block_idx: u16,
num_vals: u32,
}
/// Iterator over positions of set values.
pub fn serialize_sparse_codec<W: Write>(
mut iter: impl Iterator<Item = u32>,
mut out: W,
) -> io::Result<()> {
let mut block_metadata: Vec<BlockDataSerialized> = Vec::new();
let mut current_block = Vec::new();
// This if-statement for the first element ensures that
// `block_metadata` is not empty in the loop below.
if let Some(idx) = iter.next() {
let value_addr = value_addr(idx);
block_metadata.push(BlockDataSerialized {
block_idx: value_addr.block_idx,
num_vals: 1,
});
current_block.push(value_addr.value_in_block);
}
let flush_block = |current_block: &mut Vec<u16>, out: &mut W| -> io::Result<()> {
let is_sparse = is_sparse(current_block.len() as u32);
if is_sparse {
for val_in_block in current_block.iter() {
out.write_all(val_in_block.to_le_bytes().as_ref())?;
}
} else {
let mut bitset = BitSet::with_max_value(ELEMENTS_PER_BLOCK + 1);
for val_in_block in current_block.iter() {
bitset.insert(*val_in_block as u32);
}
let iter = (0..ELEMENTS_PER_BLOCK).map(|idx| bitset.contains(idx));
serialize_dense_codec(iter, out)?;
}
current_block.clear();
Ok(())
};
for idx in iter {
let value_addr = value_addr(idx);
if block_metadata[block_metadata.len() - 1].block_idx == value_addr.block_idx {
let last_idx_metadata = block_metadata.len() - 1;
block_metadata[last_idx_metadata].num_vals += 1;
} else {
// flush prev block
flush_block(&mut current_block, &mut out)?;
block_metadata.push(BlockDataSerialized {
block_idx: value_addr.block_idx,
num_vals: 1,
});
}
current_block.push(value_addr.value_in_block);
}
// handle last block
flush_block(&mut current_block, &mut out)?;
for block in &block_metadata {
out.write_all(block.block_idx.to_le_bytes().as_ref())?;
// We don't store empty blocks, therefore we can subtract 1.
// This way we will be able to use u16 when the number of elements is 1 << 16 or u16::MAX+1
out.write_all(((block.num_vals - 1) as u16).to_le_bytes().as_ref())?;
}
out.write_all((block_metadata.len() as u16).to_le_bytes().as_ref())?;
Ok(())
}
#[cfg(test)]
mod tests {
use itertools::Itertools;
use proptest::prelude::{any, prop, *};
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest};
use super::*;
fn random_bitvec() -> BoxedStrategy<Vec<bool>> {
prop_oneof![
1 => prop::collection::vec(proptest::bool::weighted(1.0), 0..100),
1 => prop::collection::vec(proptest::bool::weighted(0.00), 0..(ELEMENTS_PER_BLOCK as usize * 3)), // empty blocks
1 => prop::collection::vec(proptest::bool::weighted(1.00), 0..(ELEMENTS_PER_BLOCK as usize + 10)), // full block
1 => prop::collection::vec(proptest::bool::weighted(0.01), 0..100),
1 => prop::collection::vec(proptest::bool::weighted(0.01), 0..u16::MAX as usize),
8 => vec![any::<bool>()],
]
.boxed()
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(50))]
#[test]
fn test_with_random_bitvecs(bitvec1 in random_bitvec(), bitvec2 in random_bitvec(), bitvec3 in random_bitvec()) {
let mut bitvec = Vec::new();
bitvec.extend_from_slice(&bitvec1);
bitvec.extend_from_slice(&bitvec2);
bitvec.extend_from_slice(&bitvec3);
test_null_index(bitvec);
}
}
#[test]
fn sparse_codec_test_one_block_false() {
let mut iter = vec![false; ELEMENTS_PER_BLOCK as usize];
iter.push(true);
test_null_index(iter);
}
#[test]
fn sparse_codec_test_one_block_true() {
let mut iter = vec![true; ELEMENTS_PER_BLOCK as usize];
iter.push(true);
test_null_index(iter);
}
fn test_null_index(data: Vec<bool>) {
let mut out = vec![];
serialize_sparse_codec(
data.iter()
.cloned()
.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _val)| pos as u32),
&mut out,
)
.unwrap();
let null_index = SparseCodec::open(OwnedBytes::new(out));
let orig_idx_with_value: Vec<u32> = data
.iter()
.enumerate()
.filter(|(_pos, val)| **val)
.map(|(pos, _val)| pos as u32)
.collect();
assert_eq!(
null_index
.translate_codec_idx_to_original_idx(0..orig_idx_with_value.len() as u32)
.collect_vec(),
orig_idx_with_value
);
let step_size = (orig_idx_with_value.len() / 100).max(1);
for (dense_idx, orig_idx) in orig_idx_with_value.iter().enumerate().step_by(step_size) {
assert_eq!(
null_index.translate_to_codec_idx(*orig_idx),
Some(dense_idx as u32)
);
}
// 100 samples
let step_size = (data.len() / 100).max(1);
for (pos, value) in data.iter().enumerate().step_by(step_size) {
assert_eq!(null_index.exists(pos as u32), *value);
}
}
#[test]
fn sparse_codec_test_translation() {
let mut out = vec![];
let iter = ([true, false, true, false]).iter().cloned();
serialize_sparse_codec(
iter.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _val)| pos as u32),
&mut out,
)
.unwrap();
let null_index = SparseCodec::open(OwnedBytes::new(out));
assert_eq!(
null_index
.translate_codec_idx_to_original_idx(0..2)
.collect_vec(),
vec![0, 2]
);
}
#[test]
fn sparse_codec_translate() {
let mut out = vec![];
let iter = ([true, false, true, false]).iter().cloned();
serialize_sparse_codec(
iter.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _val)| pos as u32),
&mut out,
)
.unwrap();
let null_index = SparseCodec::open(OwnedBytes::new(out));
assert_eq!(null_index.translate_to_codec_idx(0), Some(0));
assert_eq!(null_index.translate_to_codec_idx(2), Some(1));
}
#[test]
fn sparse_codec_test_small() {
let mut out = vec![];
let iter = ([true, false, true, false]).iter().cloned();
serialize_sparse_codec(
iter.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _val)| pos as u32),
&mut out,
)
.unwrap();
let null_index = SparseCodec::open(OwnedBytes::new(out));
assert!(null_index.exists(0));
assert!(!null_index.exists(1));
assert!(null_index.exists(2));
assert!(!null_index.exists(3));
}
#[test]
fn sparse_codec_test_large() {
let mut docs = vec![];
docs.extend((0..ELEMENTS_PER_BLOCK).map(|_idx| false));
docs.extend((0..=1).map(|_idx| true));
let iter = docs.iter().cloned();
let mut out = vec![];
serialize_sparse_codec(
iter.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _val)| pos as u32),
&mut out,
)
.unwrap();
let null_index = SparseCodec::open(OwnedBytes::new(out));
assert!(!null_index.exists(0));
assert!(!null_index.exists(100));
assert!(!null_index.exists(ELEMENTS_PER_BLOCK - 1));
assert!(null_index.exists(ELEMENTS_PER_BLOCK));
assert!(null_index.exists(ELEMENTS_PER_BLOCK + 1));
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use test::Bencher;
use super::*;
const TOTAL_NUM_VALUES: u32 = 1_000_000;
fn gen_bools(fill_ratio: f64) -> SparseCodec {
let mut out = Vec::new();
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
serialize_sparse_codec(
(0..TOTAL_NUM_VALUES)
.map(|_| rng.gen_bool(fill_ratio))
.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _val)| pos as u32),
&mut out,
)
.unwrap();
let codec = SparseCodec::open(OwnedBytes::new(out));
codec
}
fn random_range_iterator(start: u32, end: u32, step_size: u32) -> impl Iterator<Item = u32> {
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let mut current = start;
std::iter::from_fn(move || {
current += rng.gen_range(1..step_size + 1);
if current >= end {
None
} else {
Some(current)
}
})
}
fn walk_over_data(codec: &SparseCodec, max_step_size: u32) -> Option<u32> {
walk_over_data_from_positions(
codec,
random_range_iterator(0, TOTAL_NUM_VALUES, max_step_size),
)
}
fn walk_over_data_from_positions(
codec: &SparseCodec,
positions: impl Iterator<Item = u32>,
) -> Option<u32> {
let mut dense_idx: Option<u32> = None;
for idx in positions {
dense_idx = dense_idx.or(codec.translate_to_codec_idx(idx));
}
dense_idx
}
#[bench]
fn bench_sparse_codec_translate_orig_to_codec_1percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.01f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_sparse_codec_translate_orig_to_codec_5percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.05f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_sparse_codec_translate_orig_to_codec_full_scan_10percent(bench: &mut Bencher) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_sparse_codec_translate_orig_to_codec_full_scan_90percent(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_sparse_codec_translate_orig_to_codec_full_scan_1percent(bench: &mut Bencher) {
let codec = gen_bools(0.01f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_sparse_codec_translate_orig_to_codec_10percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_sparse_codec_translate_orig_to_codec_90percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_sparse_codec_translate_codec_to_orig_1percent_filled_random_stride_big_step(
bench: &mut Bencher,
) {
let codec = gen_bools(0.01f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(random_range_iterator(0, num_vals, 50_000))
.last()
});
}
#[bench]
fn bench_sparse_codec_translate_codec_to_orig_1percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.01f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(random_range_iterator(0, num_vals, 100))
.last()
});
}
#[bench]
fn bench_sparse_codec_translate_codec_to_orig_1percent_filled_full_scan(bench: &mut Bencher) {
let codec = gen_bools(0.01f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(0..num_vals)
.last()
});
}
#[bench]
fn bench_sparse_codec_translate_codec_to_orig_90percent_filled_random_stride_big_step(
bench: &mut Bencher,
) {
let codec = gen_bools(0.90f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(random_range_iterator(0, num_vals, 50_000))
.last()
});
}
#[bench]
fn bench_sparse_codec_translate_codec_to_orig_90percent_filled_random_stride(
bench: &mut Bencher,
) {
let codec = gen_bools(0.9f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(random_range_iterator(0, num_vals, 100))
.last()
});
}
#[bench]
fn bench_sparse_codec_translate_codec_to_orig_90percent_filled_full_scan(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
let num_vals = codec.num_non_nulls();
bench.iter(|| {
codec
.translate_codec_idx_to_original_idx(0..num_vals)
.last()
});
}
}

View File

@@ -0,0 +1,146 @@
use std::io::{self, Write};
use std::ops::Range;
use common::{BinarySerializable, CountingWriter, VInt};
use ownedbytes::OwnedBytes;
#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub(crate) enum FastFieldCardinality {
Single = 1,
Multi = 2,
}
impl BinarySerializable for FastFieldCardinality {
fn serialize<W: Write>(&self, wrt: &mut W) -> io::Result<()> {
self.to_code().serialize(wrt)
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let code = u8::deserialize(reader)?;
let codec_type: Self = Self::from_code(code)
.ok_or_else(|| io::Error::new(io::ErrorKind::InvalidData, "Unknown code `{code}.`"))?;
Ok(codec_type)
}
}
impl FastFieldCardinality {
pub(crate) fn to_code(self) -> u8 {
self as u8
}
pub(crate) fn from_code(code: u8) -> Option<Self> {
match code {
1 => Some(Self::Single),
2 => Some(Self::Multi),
_ => None,
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub(crate) enum NullIndexCodec {
Full = 1,
}
impl BinarySerializable for NullIndexCodec {
fn serialize<W: Write>(&self, wrt: &mut W) -> io::Result<()> {
self.to_code().serialize(wrt)
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let code = u8::deserialize(reader)?;
let codec_type: Self = Self::from_code(code)
.ok_or_else(|| io::Error::new(io::ErrorKind::InvalidData, "Unknown code `{code}.`"))?;
Ok(codec_type)
}
}
impl NullIndexCodec {
pub(crate) fn to_code(self) -> u8 {
self as u8
}
pub(crate) fn from_code(code: u8) -> Option<Self> {
match code {
1 => Some(Self::Full),
_ => None,
}
}
}
#[derive(Debug, Clone, Eq, PartialEq)]
pub(crate) struct NullIndexFooter {
pub(crate) cardinality: FastFieldCardinality,
pub(crate) null_index_codec: NullIndexCodec,
// Unused for NullIndexCodec::Full
pub(crate) null_index_byte_range: Range<u64>,
}
impl BinarySerializable for NullIndexFooter {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
self.cardinality.serialize(writer)?;
self.null_index_codec.serialize(writer)?;
VInt(self.null_index_byte_range.start).serialize(writer)?;
VInt(self.null_index_byte_range.end - self.null_index_byte_range.start)
.serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let cardinality = FastFieldCardinality::deserialize(reader)?;
let null_index_codec = NullIndexCodec::deserialize(reader)?;
let null_index_byte_range_start = VInt::deserialize(reader)?.0;
let null_index_byte_range_end = VInt::deserialize(reader)?.0 + null_index_byte_range_start;
Ok(Self {
cardinality,
null_index_codec,
null_index_byte_range: null_index_byte_range_start..null_index_byte_range_end,
})
}
}
pub(crate) fn append_null_index_footer(
output: &mut impl io::Write,
null_index_footer: NullIndexFooter,
) -> io::Result<()> {
let mut counting_write = CountingWriter::wrap(output);
null_index_footer.serialize(&mut counting_write)?;
let footer_payload_len = counting_write.written_bytes();
BinarySerializable::serialize(&(footer_payload_len as u16), &mut counting_write)?;
Ok(())
}
pub(crate) fn read_null_index_footer(
data: OwnedBytes,
) -> io::Result<(OwnedBytes, NullIndexFooter)> {
let (data, null_footer_length_bytes) = data.rsplit(2);
let footer_length = u16::deserialize(&mut null_footer_length_bytes.as_slice())?;
let (data, null_index_footer_bytes) = data.rsplit(footer_length as usize);
let null_index_footer = NullIndexFooter::deserialize(&mut null_index_footer_bytes.as_ref())?;
Ok((data, null_index_footer))
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn null_index_footer_deser_test() {
let null_index_footer = NullIndexFooter {
cardinality: FastFieldCardinality::Single,
null_index_codec: NullIndexCodec::Full,
null_index_byte_range: 100..120,
};
let mut out = vec![];
null_index_footer.serialize(&mut out).unwrap();
assert_eq!(
null_index_footer,
NullIndexFooter::deserialize(&mut &out[..]).unwrap()
);
}
}

View File

@@ -1,54 +1,44 @@
// Copyright (C) 2022 Quickwit, Inc.
//
// Quickwit is offered under the AGPL v3.0 and as commercial software.
// For commercial licensing, contact us at hello@quickwit.io.
//
// AGPL:
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
use std::io;
use std::num::NonZeroU64;
use std::sync::Arc;
use common::{BinarySerializable, VInt};
use fastdivide::DividerU64;
use log::warn;
use ownedbytes::OwnedBytes;
use crate::bitpacked::BitpackedCodec;
use crate::blockwise_linear::BlockwiseLinearCodec;
use crate::compact_space::CompactSpaceCompressor;
use crate::format_version::append_format_version;
use crate::linear::LinearCodec;
use crate::monotonic_mapping::{
StrictlyMonotonicFn, StrictlyMonotonicMappingToInternal,
StrictlyMonotonicMappingToInternalGCDBaseval,
};
use crate::null_index_footer::{
append_null_index_footer, FastFieldCardinality, NullIndexCodec, NullIndexFooter,
};
use crate::{
monotonic_map_column, Column, FastFieldCodec, FastFieldCodecType, MonotonicallyMappableToU64,
VecColumn, ALL_CODEC_TYPES,
U128FastFieldCodecType, VecColumn, ALL_CODEC_TYPES,
};
/// The normalized header gives some parameters after applying the following
/// normalization of the vector:
/// val -> (val - min_value) / gcd
/// `val -> (val - min_value) / gcd`
///
/// By design, after normalization, `min_value = 0` and `gcd = 1`.
#[derive(Debug, Copy, Clone)]
pub struct NormalizedHeader {
pub num_vals: u64,
/// The number of values in the underlying column.
pub num_vals: u32,
/// The max value of the underlying column.
pub max_value: u64,
}
#[derive(Debug, Copy, Clone)]
pub(crate) struct Header {
pub num_vals: u64,
pub num_vals: u32,
pub min_value: u64,
pub max_value: u64,
pub gcd: Option<NonZeroU64>,
@@ -57,8 +47,11 @@ pub(crate) struct Header {
impl Header {
pub fn normalized(self) -> NormalizedHeader {
let max_value =
(self.max_value - self.min_value) / self.gcd.map(|gcd| gcd.get()).unwrap_or(1);
let gcd = self.gcd.map(|gcd| gcd.get()).unwrap_or(1);
let gcd_min_val_mapping =
StrictlyMonotonicMappingToInternalGCDBaseval::new(gcd, self.min_value);
let max_value = gcd_min_val_mapping.mapping(self.max_value);
NormalizedHeader {
num_vals: self.num_vals,
max_value,
@@ -66,10 +59,7 @@ impl Header {
}
pub fn normalize_column<C: Column>(&self, from_column: C) -> impl Column {
let min_value = self.min_value;
let gcd = self.gcd.map(|gcd| gcd.get()).unwrap_or(1);
let divider = DividerU64::divide_by(gcd);
monotonic_map_column(from_column, move |val| divider.divide(val - min_value))
normalize_column(from_column, self.min_value, self.gcd)
}
pub fn compute_header(
@@ -81,9 +71,8 @@ impl Header {
let max_value = column.max_value();
let gcd = crate::gcd::find_gcd(column.iter().map(|val| val - min_value))
.filter(|gcd| gcd.get() > 1u64);
let divider = DividerU64::divide_by(gcd.map(|gcd| gcd.get()).unwrap_or(1u64));
let shifted_column = monotonic_map_column(&column, |val| divider.divide(val - min_value));
let codec_type = detect_codec(shifted_column, codecs)?;
let normalized_column = normalize_column(column, min_value, gcd);
let codec_type = detect_codec(normalized_column, codecs)?;
Some(Header {
num_vals,
min_value,
@@ -94,9 +83,42 @@ impl Header {
}
}
#[derive(Debug, Copy, Clone, PartialEq, Eq)]
pub(crate) struct U128Header {
pub num_vals: u32,
pub codec_type: U128FastFieldCodecType,
}
impl BinarySerializable for U128Header {
fn serialize<W: io::Write>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.num_vals as u64).serialize(writer)?;
self.codec_type.serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let num_vals = VInt::deserialize(reader)?.0 as u32;
let codec_type = U128FastFieldCodecType::deserialize(reader)?;
Ok(U128Header {
num_vals,
codec_type,
})
}
}
pub fn normalize_column<C: Column>(
from_column: C,
min_value: u64,
gcd: Option<NonZeroU64>,
) -> impl Column {
let gcd = gcd.map(|gcd| gcd.get()).unwrap_or(1);
let mapping = StrictlyMonotonicMappingToInternalGCDBaseval::new(gcd, min_value);
monotonic_map_column(from_column, mapping)
}
impl BinarySerializable for Header {
fn serialize<W: io::Write>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.num_vals).serialize(writer)?;
VInt(self.num_vals as u64).serialize(writer)?;
VInt(self.min_value).serialize(writer)?;
VInt(self.max_value - self.min_value).serialize(writer)?;
if let Some(gcd) = self.gcd {
@@ -109,7 +131,7 @@ impl BinarySerializable for Header {
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let num_vals = VInt::deserialize(reader)?.0;
let num_vals = VInt::deserialize(reader)?.0 as u32;
let min_value = VInt::deserialize(reader)?.0;
let amplitude = VInt::deserialize(reader)?.0;
let max_value = min_value + amplitude;
@@ -125,16 +147,21 @@ impl BinarySerializable for Header {
}
}
/// Return estimated compression for given codec in the value range [0.0..1.0], where 1.0 means no
/// compression.
pub fn estimate<T: MonotonicallyMappableToU64>(
typed_column: impl Column<T>,
codec_type: FastFieldCodecType,
) -> Option<f32> {
let column = monotonic_map_column(typed_column, T::to_u64);
let column = monotonic_map_column(typed_column, StrictlyMonotonicMappingToInternal::<T>::new());
let min_value = column.min_value();
let gcd = crate::gcd::find_gcd(column.iter().map(|val| val - min_value))
.filter(|gcd| gcd.get() > 1u64);
let divider = DividerU64::divide_by(gcd.map(|gcd| gcd.get()).unwrap_or(1u64));
let normalized_column = monotonic_map_column(&column, |val| divider.divide(val - min_value));
let mapping = StrictlyMonotonicMappingToInternalGCDBaseval::new(
gcd.map(|gcd| gcd.get()).unwrap_or(1u64),
min_value,
);
let normalized_column = monotonic_map_column(&column, mapping);
match codec_type {
FastFieldCodecType::Bitpacked => BitpackedCodec::estimate(&normalized_column),
FastFieldCodecType::Linear => LinearCodec::estimate(&normalized_column),
@@ -142,25 +169,110 @@ pub fn estimate<T: MonotonicallyMappableToU64>(
}
}
pub fn serialize_u128(
typed_column: impl Column<u128>,
/// Serializes u128 values with the compact space codec.
pub fn serialize_u128<F: Fn() -> I, I: Iterator<Item = u128>>(
iter_gen: F,
num_vals: u32,
output: &mut impl io::Write,
) -> io::Result<()> {
// TODO write header, to later support more codecs
let compressor = CompactSpaceCompressor::train_from(&typed_column);
compressor
.compress_into(typed_column.iter(), output)
.unwrap();
serialize_u128_new(ValueIndexInfo::default(), iter_gen, num_vals, output)
}
#[allow(dead_code)]
pub enum ValueIndexInfo<'a> {
MultiValue(Box<dyn MultiValueIndexInfo + 'a>),
SingleValue(Box<dyn SingleValueIndexInfo + 'a>),
}
impl Default for ValueIndexInfo<'static> {
fn default() -> Self {
struct Dummy {}
impl SingleValueIndexInfo for Dummy {
fn num_vals(&self) -> u32 {
todo!()
}
fn num_non_nulls(&self) -> u32 {
todo!()
}
fn iter(&self) -> Box<dyn Iterator<Item = u32>> {
todo!()
}
}
Self::SingleValue(Box::new(Dummy {}))
}
}
impl<'a> ValueIndexInfo<'a> {
fn get_cardinality(&self) -> FastFieldCardinality {
match self {
ValueIndexInfo::MultiValue(_) => FastFieldCardinality::Multi,
ValueIndexInfo::SingleValue(_) => FastFieldCardinality::Single,
}
}
}
pub trait MultiValueIndexInfo {
/// The number of docs in the column.
fn num_docs(&self) -> u32;
/// The number of values in the column.
fn num_vals(&self) -> u32;
/// Return the start index of the values for each doc
fn iter(&self) -> Box<dyn Iterator<Item = u32> + '_>;
}
pub trait SingleValueIndexInfo {
/// The number of values including nulls in the column.
fn num_vals(&self) -> u32;
/// The number of non-null values in the column.
fn num_non_nulls(&self) -> u32;
/// Return a iterator of the positions of docs with a value
fn iter(&self) -> Box<dyn Iterator<Item = u32> + '_>;
}
/// Serializes u128 values with the compact space codec.
pub fn serialize_u128_new<F: Fn() -> I, I: Iterator<Item = u128>>(
value_index: ValueIndexInfo,
iter_gen: F,
num_vals: u32,
output: &mut impl io::Write,
) -> io::Result<()> {
let header = U128Header {
num_vals,
codec_type: U128FastFieldCodecType::CompactSpace,
};
header.serialize(output)?;
let compressor = CompactSpaceCompressor::train_from(iter_gen(), num_vals);
compressor.compress_into(iter_gen(), output).unwrap();
let null_index_footer = NullIndexFooter {
cardinality: value_index.get_cardinality(),
null_index_codec: NullIndexCodec::Full,
null_index_byte_range: 0..0,
};
append_null_index_footer(output, null_index_footer)?;
append_format_version(output)?;
Ok(())
}
/// Serializes the column with the codec with the best estimate on the data.
pub fn serialize<T: MonotonicallyMappableToU64>(
typed_column: impl Column<T>,
output: &mut impl io::Write,
codecs: &[FastFieldCodecType],
) -> io::Result<()> {
let column = monotonic_map_column(typed_column, T::to_u64);
serialize_new(ValueIndexInfo::default(), typed_column, output, codecs)
}
/// Serializes the column with the codec with the best estimate on the data.
pub fn serialize_new<T: MonotonicallyMappableToU64>(
value_index: ValueIndexInfo,
typed_column: impl Column<T>,
output: &mut impl io::Write,
codecs: &[FastFieldCodecType],
) -> io::Result<()> {
let column = monotonic_map_column(typed_column, StrictlyMonotonicMappingToInternal::<T>::new());
let header = Header::compute_header(&column, codecs).ok_or_else(|| {
io::Error::new(
io::ErrorKind::InvalidInput,
@@ -174,6 +286,15 @@ pub fn serialize<T: MonotonicallyMappableToU64>(
let normalized_column = header.normalize_column(column);
assert_eq!(normalized_column.min_value(), 0u64);
serialize_given_codec(normalized_column, header.codec_type, output)?;
let null_index_footer = NullIndexFooter {
cardinality: value_index.get_cardinality(),
null_index_codec: NullIndexCodec::Full,
null_index_byte_range: 0..0,
};
append_null_index_footer(output, null_index_footer)?;
append_format_version(output)?;
Ok(())
}
@@ -225,6 +346,7 @@ fn serialize_given_codec(
Ok(())
}
/// Helper function to serialize a column (autodetect from all codecs) and then open it
pub fn serialize_and_load<T: MonotonicallyMappableToU64 + Ord + Default>(
column: &[T],
) -> Arc<dyn Column<T>> {
@@ -237,6 +359,18 @@ pub fn serialize_and_load<T: MonotonicallyMappableToU64 + Ord + Default>(
mod tests {
use super::*;
#[test]
fn test_serialize_deserialize_u128_header() {
let original = U128Header {
num_vals: 11,
codec_type: U128FastFieldCodecType::CompactSpace,
};
let mut out = Vec::new();
original.serialize(&mut out).unwrap();
let restored = U128Header::deserialize(&mut &out[..]).unwrap();
assert_eq!(restored, original);
}
#[test]
fn test_serialize_deserialize() {
let original = [1u64, 5u64, 10u64];
@@ -250,7 +384,7 @@ mod tests {
let col = VecColumn::from(&[false, true][..]);
serialize(col, &mut buffer, &ALL_CODEC_TYPES).unwrap();
// 5 bytes of header, 1 byte of value, 7 bytes of padding.
assert_eq!(buffer.len(), 5 + 8);
assert_eq!(buffer.len(), 3 + 5 + 8 + 4 + 2);
}
#[test]
@@ -259,7 +393,7 @@ mod tests {
let col = VecColumn::from(&[true][..]);
serialize(col, &mut buffer, &ALL_CODEC_TYPES).unwrap();
// 5 bytes of header, 0 bytes of value, 7 bytes of padding.
assert_eq!(buffer.len(), 5 + 7);
assert_eq!(buffer.len(), 3 + 5 + 7 + 4 + 2);
}
#[test]
@@ -269,6 +403,6 @@ mod tests {
let col = VecColumn::from(&vals[..]);
serialize(col, &mut buffer, &[FastFieldCodecType::Bitpacked]).unwrap();
// Values are stored over 3 bits.
assert_eq!(buffer.len(), 7 + (3 * 80 / 8) + 7);
assert_eq!(buffer.len(), 3 + 7 + (3 * 80 / 8) + 7 + 4 + 2);
}
}

View File

@@ -1,10 +1,14 @@
[package]
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
name = "ownedbytes"
version = "0.3.0"
version = "0.5.0"
edition = "2021"
description = "Expose data as static slice"
license = "MIT"
documentation = "https://docs.rs/ownedbytes/"
homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]

View File

@@ -3,7 +3,7 @@ use std::ops::{Deref, Range};
use std::sync::Arc;
use std::{fmt, io, mem};
use stable_deref_trait::StableDeref;
pub use stable_deref_trait::StableDeref;
/// An OwnedBytes simply wraps an object that owns a slice of data and exposes
/// this data as a slice.
@@ -80,6 +80,21 @@ impl OwnedBytes {
(left, right)
}
/// Splits the OwnedBytes into two OwnedBytes `(left, right)`.
///
/// Right will hold `split_len` bytes.
///
/// This operation is cheap and does not require to copy any memory.
/// On the other hand, both `left` and `right` retain a handle over
/// the entire slice of memory. In other words, the memory will only
/// be released when both left and right are dropped.
#[inline]
#[must_use]
pub fn rsplit(self, split_len: usize) -> (OwnedBytes, OwnedBytes) {
let data_len = self.data.len();
self.split(data_len - split_len)
}
/// Splits the right part of the `OwnedBytes` at the given offset.
///
/// `self` is truncated to `split_len`, left with the remaining bytes.

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy-query-grammar"
version = "0.18.0"
version = "0.19.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]

View File

@@ -5,7 +5,8 @@ use combine::parser::range::{take_while, take_while1};
use combine::parser::repeat::escaped;
use combine::parser::Parser;
use combine::{
attempt, choice, eof, many, many1, one_of, optional, parser, satisfy, skip_many1, value,
attempt, between, choice, eof, many, many1, one_of, optional, parser, satisfy, sep_by,
skip_many1, value,
};
use once_cell::sync::Lazy;
use regex::Regex;
@@ -62,6 +63,20 @@ fn word<'a>() -> impl Parser<&'a str, Output = String> {
})
}
// word variant that allows more characters, e.g. for range queries that don't allow field
// specifier
fn relaxed_word<'a>() -> impl Parser<&'a str, Output = String> {
(
satisfy(|c: char| {
!c.is_whitespace() && !['`', '{', '}', '"', '[', ']', '(', ')'].contains(&c)
}),
many(satisfy(|c: char| {
!c.is_whitespace() && !['{', '}', '"', '[', ']', '(', ')'].contains(&c)
})),
)
.map(|(s1, s2): (char, String)| format!("{}{}", s1, s2))
}
/// Parses a date time according to rfc3339
/// 2015-08-02T18:54:42+02
/// 2021-04-13T19:46:26.266051969+00:00
@@ -181,8 +196,8 @@ fn spaces1<'a>() -> impl Parser<&'a str, Output = ()> {
fn range<'a>() -> impl Parser<&'a str, Output = UserInputLeaf> {
let range_term_val = || {
attempt(date_time())
.or(word())
.or(negative_number())
.or(relaxed_word())
.or(char('*').with(value("*".to_string())))
};
@@ -250,6 +265,17 @@ fn range<'a>() -> impl Parser<&'a str, Output = UserInputLeaf> {
})
}
/// Function that parses a set out of a Stream
/// Supports ranges like: `IN [val1 val2 val3]`
fn set<'a>() -> impl Parser<&'a str, Output = UserInputLeaf> {
let term_list = between(char('['), char(']'), sep_by(term_val(), spaces()));
let set_content = ((string("IN"), spaces()), term_list).map(|(_, elements)| elements);
(optional(attempt(field_name().skip(spaces()))), set_content)
.map(|(field, elements)| UserInputLeaf::Set { field, elements })
}
fn negate(expr: UserInputAst) -> UserInputAst {
expr.unary(Occur::MustNot)
}
@@ -264,6 +290,7 @@ fn leaf<'a>() -> impl Parser<&'a str, Output = UserInputAst> {
string("NOT").skip(spaces1()).with(leaf()).map(negate),
))
.or(attempt(range().map(UserInputAst::from)))
.or(attempt(set().map(UserInputAst::from)))
.or(literal().map(UserInputAst::from))
.parse_stream(input)
.into_result()
@@ -649,6 +676,34 @@ mod test {
.expect("Cannot parse date range")
.0;
assert_eq!(res6, expected_flexible_dates);
// IP Range Unbounded
let expected_weight = UserInputLeaf::Range {
field: Some("ip".to_string()),
lower: UserInputBound::Inclusive("::1".to_string()),
upper: UserInputBound::Unbounded,
};
let res1 = range()
.parse("ip: >=::1")
.expect("Cannot parse ip v6 format")
.0;
let res2 = range()
.parse("ip:[::1 TO *}")
.expect("Cannot parse ip v6 format")
.0;
assert_eq!(res1, expected_weight);
assert_eq!(res2, expected_weight);
// IP Range Bounded
let expected_weight = UserInputLeaf::Range {
field: Some("ip".to_string()),
lower: UserInputBound::Inclusive("::0.0.0.50".to_string()),
upper: UserInputBound::Exclusive("::0.0.0.52".to_string()),
};
let res1 = range()
.parse("ip:[::0.0.0.50 TO ::0.0.0.52}")
.expect("Cannot parse ip v6 format")
.0;
assert_eq!(res1, expected_weight);
}
#[test]
@@ -705,6 +760,14 @@ mod test {
test_parse_query_to_ast_helper("+(a b) +d", "(+(*\"a\" *\"b\") +\"d\")");
}
#[test]
fn test_parse_test_query_set() {
test_parse_query_to_ast_helper("abc: IN [a b c]", r#""abc": IN ["a" "b" "c"]"#);
test_parse_query_to_ast_helper("abc: IN [1]", r#""abc": IN ["1"]"#);
test_parse_query_to_ast_helper("abc: IN []", r#""abc": IN []"#);
test_parse_query_to_ast_helper("IN [1 2]", r#"IN ["1" "2"]"#);
}
#[test]
fn test_parse_test_query_other() {
test_parse_query_to_ast_helper("(+a +b) d", "(*(+\"a\" +\"b\") *\"d\")");

View File

@@ -12,6 +12,10 @@ pub enum UserInputLeaf {
lower: UserInputBound,
upper: UserInputBound,
},
Set {
field: Option<String>,
elements: Vec<String>,
},
}
impl Debug for UserInputLeaf {
@@ -31,6 +35,19 @@ impl Debug for UserInputLeaf {
upper.display_upper(formatter)?;
Ok(())
}
UserInputLeaf::Set { field, elements } => {
if let Some(ref field) = field {
write!(formatter, "\"{}\": ", field)?;
}
write!(formatter, "IN [")?;
for (i, element) in elements.iter().enumerate() {
if i != 0 {
write!(formatter, " ")?;
}
write!(formatter, "\"{}\"", element)?;
}
write!(formatter, "]")
}
UserInputLeaf::All => write!(formatter, "*"),
}
}

View File

@@ -11,7 +11,7 @@ use super::bucket::{HistogramAggregation, RangeAggregation, TermsAggregation};
use super::metric::{AverageAggregation, StatsAggregation};
use super::segment_agg_result::BucketCount;
use super::VecWithNames;
use crate::fastfield::{type_and_cardinality, FastType, MultiValuedFastFieldReader};
use crate::fastfield::{type_and_cardinality, MultiValuedFastFieldReader};
use crate::schema::{Cardinality, Type};
use crate::{InvertedIndexReader, SegmentReader, TantivyError};
@@ -194,13 +194,7 @@ fn get_ff_reader_and_validate(
.ok_or_else(|| TantivyError::FieldNotFound(field_name.to_string()))?;
let field_type = reader.schema().get_field_entry(field).field_type();
if let Some((ff_type, field_cardinality)) = type_and_cardinality(field_type) {
if ff_type == FastType::Date {
return Err(TantivyError::InvalidArgument(
"Unsupported field type date in aggregation".to_string(),
));
}
if let Some((_ff_type, field_cardinality)) = type_and_cardinality(field_type) {
if cardinality != field_cardinality {
return Err(TantivyError::InvalidArgument(format!(
"Invalid field cardinality on field {} expected {:?}, but got {:?}",

View File

@@ -4,9 +4,7 @@
//! intermediate average results, which is the sum and the number of values. The actual average is
//! calculated on the step from intermediate to final aggregation result tree.
use std::collections::HashMap;
use fnv::FnvHashMap;
use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize};
use super::agg_req::BucketAggregationInternal;
@@ -14,11 +12,12 @@ use super::bucket::GetDocCount;
use super::intermediate_agg_result::{IntermediateBucketResult, IntermediateMetricResult};
use super::metric::{SingleMetricResult, Stats};
use super::Key;
use crate::schema::Schema;
use crate::TantivyError;
#[derive(Clone, Default, Debug, PartialEq, Serialize, Deserialize)]
/// The final aggegation result.
pub struct AggregationResults(pub HashMap<String, AggregationResult>);
pub struct AggregationResults(pub FxHashMap<String, AggregationResult>);
impl AggregationResults {
pub(crate) fn get_value_from_aggregation(
@@ -131,9 +130,12 @@ pub enum BucketResult {
}
impl BucketResult {
pub(crate) fn empty_from_req(req: &BucketAggregationInternal) -> crate::Result<Self> {
pub(crate) fn empty_from_req(
req: &BucketAggregationInternal,
schema: &Schema,
) -> crate::Result<Self> {
let empty_bucket = IntermediateBucketResult::empty_from_req(&req.bucket_agg);
empty_bucket.into_final_bucket_result(req)
empty_bucket.into_final_bucket_result(req, schema)
}
}
@@ -145,7 +147,7 @@ pub enum BucketEntries<T> {
/// Vector format bucket entries
Vec(Vec<T>),
/// HashMap format bucket entries
HashMap(FnvHashMap<String, T>),
HashMap(FxHashMap<String, T>),
}
/// This is the default entry for a bucket, which contains a key, count, and optionally
@@ -176,6 +178,9 @@ pub enum BucketEntries<T> {
/// ```
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct BucketEntry {
#[serde(skip_serializing_if = "Option::is_none")]
/// The string representation of the bucket.
pub key_as_string: Option<String>,
/// The identifier of the bucket.
pub key: Key,
/// Number of documents in the bucket.
@@ -240,4 +245,10 @@ pub struct RangeBucketEntry {
/// The to range of the bucket. Equals `f64::MAX` when `None`.
#[serde(skip_serializing_if = "Option::is_none")]
pub to: Option<f64>,
/// The optional string representation for the `from` range.
#[serde(skip_serializing_if = "Option::is_none")]
pub from_as_string: Option<String>,
/// The optional string representation for the `to` range.
#[serde(skip_serializing_if = "Option::is_none")]
pub to_as_string: Option<String>,
}

View File

@@ -10,12 +10,12 @@ use crate::aggregation::agg_req_with_accessor::{
AggregationsWithAccessor, BucketAggregationWithAccessor,
};
use crate::aggregation::agg_result::BucketEntry;
use crate::aggregation::f64_from_fastfield_u64;
use crate::aggregation::intermediate_agg_result::{
IntermediateAggregationResults, IntermediateBucketResult, IntermediateHistogramBucketEntry,
};
use crate::aggregation::segment_agg_result::SegmentAggregationResultsCollector;
use crate::schema::Type;
use crate::aggregation::{f64_from_fastfield_u64, format_date};
use crate::schema::{Schema, Type};
use crate::{DocId, TantivyError};
/// Histogram is a bucket aggregation, where buckets are created dynamically for given `interval`.
@@ -206,6 +206,7 @@ pub struct SegmentHistogramCollector {
field_type: Type,
interval: f64,
offset: f64,
min_doc_count: u64,
first_bucket_num: i64,
bounds: HistogramBounds,
}
@@ -215,6 +216,30 @@ impl SegmentHistogramCollector {
self,
agg_with_accessor: &BucketAggregationWithAccessor,
) -> crate::Result<IntermediateBucketResult> {
// Compute the number of buckets to validate against max num buckets
// Note: We use min_doc_count here, but it's only an lowerbound here, since were are on the
// intermediate level and after merging the number of documents of a bucket could exceed
// `min_doc_count`.
{
let cut_off_buckets_front = self
.buckets
.iter()
.take_while(|bucket| bucket.doc_count <= self.min_doc_count)
.count();
let cut_off_buckets_back = self.buckets[cut_off_buckets_front..]
.iter()
.rev()
.take_while(|bucket| bucket.doc_count <= self.min_doc_count)
.count();
let estimate_num_buckets =
self.buckets.len() - cut_off_buckets_front - cut_off_buckets_back;
agg_with_accessor
.bucket_count
.add_count(estimate_num_buckets as u32);
agg_with_accessor.bucket_count.validate_bucket_count()?;
}
let mut buckets = Vec::with_capacity(
self.buckets
.iter()
@@ -251,11 +276,6 @@ impl SegmentHistogramCollector {
);
};
agg_with_accessor
.bucket_count
.add_count(buckets.len() as u32);
agg_with_accessor.bucket_count.validate_bucket_count()?;
Ok(IntermediateBucketResult::Histogram { buckets })
}
@@ -308,6 +328,7 @@ impl SegmentHistogramCollector {
first_bucket_num,
bounds,
sub_aggregations,
min_doc_count: req.min_doc_count(),
})
}
@@ -331,10 +352,10 @@ impl SegmentHistogramCollector {
.expect("unexpected fast field cardinatility");
let mut iter = doc.chunks_exact(4);
for docs in iter.by_ref() {
let val0 = self.f64_from_fastfield_u64(accessor.get_val(docs[0] as u64));
let val1 = self.f64_from_fastfield_u64(accessor.get_val(docs[1] as u64));
let val2 = self.f64_from_fastfield_u64(accessor.get_val(docs[2] as u64));
let val3 = self.f64_from_fastfield_u64(accessor.get_val(docs[3] as u64));
let val0 = self.f64_from_fastfield_u64(accessor.get_val(docs[0]));
let val1 = self.f64_from_fastfield_u64(accessor.get_val(docs[1]));
let val2 = self.f64_from_fastfield_u64(accessor.get_val(docs[2]));
let val3 = self.f64_from_fastfield_u64(accessor.get_val(docs[3]));
let bucket_pos0 = get_bucket_num(val0);
let bucket_pos1 = get_bucket_num(val1);
@@ -371,7 +392,7 @@ impl SegmentHistogramCollector {
)?;
}
for &doc in iter.remainder() {
let val = f64_from_fastfield_u64(accessor.get_val(doc as u64), &self.field_type);
let val = f64_from_fastfield_u64(accessor.get_val(doc), &self.field_type);
if !bounds.contains(val) {
continue;
}
@@ -380,7 +401,7 @@ impl SegmentHistogramCollector {
debug_assert_eq!(
self.buckets[bucket_pos].key,
get_bucket_val(val, self.interval, self.offset) as f64
get_bucket_val(val, self.interval, self.offset)
);
self.increment_bucket(bucket_pos, doc, &bucket_with_accessor.sub_aggregation)?;
}
@@ -407,7 +428,7 @@ impl SegmentHistogramCollector {
if bounds.contains(val) {
debug_assert_eq!(
self.buckets[bucket_pos].key,
get_bucket_val(val, self.interval, self.offset) as f64
get_bucket_val(val, self.interval, self.offset)
);
self.increment_bucket(bucket_pos, doc, bucket_with_accessor)?;
@@ -425,7 +446,7 @@ impl SegmentHistogramCollector {
let bucket = &mut self.buckets[bucket_pos];
bucket.doc_count += 1;
if let Some(sub_aggregation) = self.sub_aggregations.as_mut() {
(&mut sub_aggregation[bucket_pos]).collect(doc, bucket_with_accessor)?;
sub_aggregation[bucket_pos].collect(doc, bucket_with_accessor)?;
}
Ok(())
}
@@ -451,8 +472,9 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
buckets: Vec<IntermediateHistogramBucketEntry>,
histogram_req: &HistogramAggregation,
sub_aggregation: &AggregationsInternal,
schema: &Schema,
) -> crate::Result<Vec<BucketEntry>> {
// Generate the the full list of buckets without gaps.
// Generate the full list of buckets without gaps.
//
// The bounds are the min max from the current buckets, optionally extended by
// extended_bounds from the request
@@ -491,7 +513,9 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
sub_aggregation: empty_sub_aggregation.clone(),
},
})
.map(|intermediate_bucket| intermediate_bucket.into_final_bucket_entry(sub_aggregation))
.map(|intermediate_bucket| {
intermediate_bucket.into_final_bucket_entry(sub_aggregation, schema)
})
.collect::<crate::Result<Vec<_>>>()
}
@@ -500,20 +524,43 @@ pub(crate) fn intermediate_histogram_buckets_to_final_buckets(
buckets: Vec<IntermediateHistogramBucketEntry>,
histogram_req: &HistogramAggregation,
sub_aggregation: &AggregationsInternal,
schema: &Schema,
) -> crate::Result<Vec<BucketEntry>> {
if histogram_req.min_doc_count() == 0 {
let mut buckets = if histogram_req.min_doc_count() == 0 {
// With min_doc_count != 0, we may need to add buckets, so that there are no
// gaps, since intermediate result does not contain empty buckets (filtered to
// reduce serialization size).
intermediate_buckets_to_final_buckets_fill_gaps(buckets, histogram_req, sub_aggregation)
intermediate_buckets_to_final_buckets_fill_gaps(
buckets,
histogram_req,
sub_aggregation,
schema,
)?
} else {
buckets
.into_iter()
.filter(|histogram_bucket| histogram_bucket.doc_count >= histogram_req.min_doc_count())
.map(|histogram_bucket| histogram_bucket.into_final_bucket_entry(sub_aggregation))
.collect::<crate::Result<Vec<_>>>()
.map(|histogram_bucket| {
histogram_bucket.into_final_bucket_entry(sub_aggregation, schema)
})
.collect::<crate::Result<Vec<_>>>()?
};
// If we have a date type on the histogram buckets, we add the `key_as_string` field as rfc339
let field = schema
.get_field(&histogram_req.field)
.ok_or_else(|| TantivyError::FieldNotFound(histogram_req.field.to_string()))?;
if schema.get_field_entry(field).field_type().is_date() {
for bucket in buckets.iter_mut() {
if let crate::aggregation::Key::F64(val) = bucket.key {
let key_as_string = format_date(val as i64)?;
bucket.key_as_string = Some(key_as_string);
}
}
}
Ok(buckets)
}
/// Applies req extended_bounds/hard_bounds on the min_max value
@@ -1372,6 +1419,63 @@ mod tests {
Ok(())
}
#[test]
fn histogram_date_test_single_segment() -> crate::Result<()> {
histogram_date_test_with_opt(true)
}
#[test]
fn histogram_date_test_multi_segment() -> crate::Result<()> {
histogram_date_test_with_opt(false)
}
fn histogram_date_test_with_opt(merge_segments: bool) -> crate::Result<()> {
let index = get_test_index_2_segments(merge_segments)?;
let agg_req: Aggregations = vec![(
"histogram".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Histogram(HistogramAggregation {
field: "date".to_string(),
interval: 86400000000.0, // one day in microseconds
..Default::default()
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let agg_res = exec_request(agg_req, &index)?;
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
assert_eq!(res["histogram"]["buckets"][0]["key"], 1546300800000000.0);
assert_eq!(
res["histogram"]["buckets"][0]["key_as_string"],
"2019-01-01T00:00:00Z"
);
assert_eq!(res["histogram"]["buckets"][0]["doc_count"], 1);
assert_eq!(res["histogram"]["buckets"][1]["key"], 1546387200000000.0);
assert_eq!(
res["histogram"]["buckets"][1]["key_as_string"],
"2019-01-02T00:00:00Z"
);
assert_eq!(res["histogram"]["buckets"][1]["doc_count"], 5);
assert_eq!(res["histogram"]["buckets"][2]["key"], 1546473600000000.0);
assert_eq!(
res["histogram"]["buckets"][2]["key_as_string"],
"2019-01-03T00:00:00Z"
);
assert_eq!(res["histogram"]["buckets"][3], Value::Null);
Ok(())
}
#[test]
fn histogram_invalid_request() -> crate::Result<()> {
let index = get_test_index_2_segments(true)?;
@@ -1438,4 +1542,36 @@ mod tests {
Ok(())
}
#[test]
fn histogram_test_max_buckets_segments() -> crate::Result<()> {
let values = vec![0.0, 70000.0];
let index = get_test_index_from_values(true, &values)?;
let agg_req: Aggregations = vec![(
"my_interval".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Histogram(HistogramAggregation {
field: "score_f64".to_string(),
interval: 1.0,
..Default::default()
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let res = exec_request(agg_req, &index);
assert_eq!(
res.unwrap_err().to_string(),
"An invalid argument was passed: 'Aborting aggregation because too many buckets were \
created'"
.to_string()
);
Ok(())
}
}

View File

@@ -1,7 +1,8 @@
use std::fmt::Debug;
use std::ops::Range;
use fnv::FnvHashMap;
use fastfield_codecs::MonotonicallyMappableToU64;
use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize};
use crate::aggregation::agg_req_with_accessor::{
@@ -11,7 +12,9 @@ use crate::aggregation::intermediate_agg_result::{
IntermediateBucketResult, IntermediateRangeBucketEntry, IntermediateRangeBucketResult,
};
use crate::aggregation::segment_agg_result::{BucketCount, SegmentAggregationResultsCollector};
use crate::aggregation::{f64_from_fastfield_u64, f64_to_fastfield_u64, Key, SerializedKey};
use crate::aggregation::{
f64_from_fastfield_u64, f64_to_fastfield_u64, format_date, Key, SerializedKey,
};
use crate::schema::Type;
use crate::{DocId, TantivyError};
@@ -176,12 +179,12 @@ impl SegmentRangeCollector {
) -> crate::Result<IntermediateBucketResult> {
let field_type = self.field_type;
let buckets: FnvHashMap<SerializedKey, IntermediateRangeBucketEntry> = self
let buckets: FxHashMap<SerializedKey, IntermediateRangeBucketEntry> = self
.buckets
.into_iter()
.map(move |range_bucket| {
Ok((
range_to_string(&range_bucket.range, &field_type),
range_to_string(&range_bucket.range, &field_type)?,
range_bucket
.bucket
.into_intermediate_bucket_entry(&agg_with_accessor.sub_aggregation)?,
@@ -209,8 +212,8 @@ impl SegmentRangeCollector {
let key = range
.key
.clone()
.map(Key::Str)
.unwrap_or_else(|| range_to_key(&range.range, &field_type));
.map(|key| Ok(Key::Str(key)))
.unwrap_or_else(|| range_to_key(&range.range, &field_type))?;
let to = if range.range.end == u64::MAX {
None
} else {
@@ -228,6 +231,7 @@ impl SegmentRangeCollector {
sub_aggregation,
)?)
};
Ok(SegmentRangeAndBucketEntry {
range: range.range.clone(),
bucket: SegmentRangeBucketEntry {
@@ -263,10 +267,10 @@ impl SegmentRangeCollector {
.as_single()
.expect("unexpected fast field cardinality");
for docs in iter.by_ref() {
let val1 = accessor.get_val(docs[0] as u64);
let val2 = accessor.get_val(docs[1] as u64);
let val3 = accessor.get_val(docs[2] as u64);
let val4 = accessor.get_val(docs[3] as u64);
let val1 = accessor.get_val(docs[0]);
let val2 = accessor.get_val(docs[1]);
let val3 = accessor.get_val(docs[2]);
let val4 = accessor.get_val(docs[3]);
let bucket_pos1 = self.get_bucket_pos(val1);
let bucket_pos2 = self.get_bucket_pos(val2);
let bucket_pos3 = self.get_bucket_pos(val3);
@@ -278,7 +282,7 @@ impl SegmentRangeCollector {
self.increment_bucket(bucket_pos4, docs[3], &bucket_with_accessor.sub_aggregation)?;
}
for &doc in iter.remainder() {
let val = accessor.get_val(doc as u64);
let val = accessor.get_val(doc);
let bucket_pos = self.get_bucket_pos(val);
self.increment_bucket(bucket_pos, doc, &bucket_with_accessor.sub_aggregation)?;
}
@@ -323,8 +327,8 @@ impl SegmentRangeCollector {
/// Converts the user provided f64 range value to fast field value space.
///
/// Internally fast field values are always stored as u64.
/// If the fast field has u64 [1,2,5], these values are stored as is in the fast field.
/// A fast field with f64 [1.0, 2.0, 5.0] is converted to u64 space, using a
/// If the fast field has u64 `[1, 2, 5]`, these values are stored as is in the fast field.
/// A fast field with f64 `[1.0, 2.0, 5.0]` is converted to u64 space, using a
/// monotonic mapping function, so the order is preserved.
///
/// Consequently, a f64 user range 1.0..3.0 needs to be converted to fast field value space using
@@ -402,34 +406,45 @@ fn extend_validate_ranges(
Ok(converted_buckets)
}
pub(crate) fn range_to_string(range: &Range<u64>, field_type: &Type) -> String {
pub(crate) fn range_to_string(range: &Range<u64>, field_type: &Type) -> crate::Result<String> {
// is_start is there for malformed requests, e.g. ig the user passes the range u64::MIN..0.0,
// it should be rendered as "*-0" and not "*-*"
let to_str = |val: u64, is_start: bool| {
if (is_start && val == u64::MIN) || (!is_start && val == u64::MAX) {
"*".to_string()
Ok("*".to_string())
} else if *field_type == Type::Date {
let val = i64::from_u64(val);
format_date(val)
} else {
f64_from_fastfield_u64(val, field_type).to_string()
Ok(f64_from_fastfield_u64(val, field_type).to_string())
}
};
format!("{}-{}", to_str(range.start, true), to_str(range.end, false))
Ok(format!(
"{}-{}",
to_str(range.start, true)?,
to_str(range.end, false)?
))
}
pub(crate) fn range_to_key(range: &Range<u64>, field_type: &Type) -> Key {
Key::Str(range_to_string(range, field_type))
pub(crate) fn range_to_key(range: &Range<u64>, field_type: &Type) -> crate::Result<Key> {
Ok(Key::Str(range_to_string(range, field_type)?))
}
#[cfg(test)]
mod tests {
use fastfield_codecs::MonotonicallyMappableToU64;
use serde_json::Value;
use super::*;
use crate::aggregation::agg_req::{
Aggregation, Aggregations, BucketAggregation, BucketAggregationType,
};
use crate::aggregation::tests::{exec_request_with_query, get_test_index_with_num_docs};
use crate::aggregation::tests::{
exec_request, exec_request_with_query, get_test_index_2_segments,
get_test_index_with_num_docs,
};
pub fn get_collector_from_ranges(
ranges: Vec<RangeAggregationRange>,
@@ -567,6 +582,77 @@ mod tests {
Ok(())
}
#[test]
fn range_date_test_single_segment() -> crate::Result<()> {
range_date_test_with_opt(true)
}
#[test]
fn range_date_test_multi_segment() -> crate::Result<()> {
range_date_test_with_opt(false)
}
fn range_date_test_with_opt(merge_segments: bool) -> crate::Result<()> {
let index = get_test_index_2_segments(merge_segments)?;
let agg_req: Aggregations = vec![(
"date_ranges".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Range(RangeAggregation {
field: "date".to_string(),
ranges: vec![
RangeAggregationRange {
key: None,
from: None,
to: Some(1546300800000000.0f64),
},
RangeAggregationRange {
key: None,
from: Some(1546300800000000.0f64),
to: Some(1546387200000000.0f64),
},
],
keyed: false,
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let agg_res = exec_request(agg_req, &index)?;
let res: Value = serde_json::from_str(&serde_json::to_string(&agg_res)?)?;
assert_eq!(
res["date_ranges"]["buckets"][0]["from_as_string"],
Value::Null
);
assert_eq!(
res["date_ranges"]["buckets"][0]["key"],
"*-2019-01-01T00:00:00Z"
);
assert_eq!(
res["date_ranges"]["buckets"][1]["from_as_string"],
"2019-01-01T00:00:00Z"
);
assert_eq!(
res["date_ranges"]["buckets"][1]["to_as_string"],
"2019-01-02T00:00:00Z"
);
assert_eq!(
res["date_ranges"]["buckets"][2]["from_as_string"],
"2019-01-02T00:00:00Z"
);
assert_eq!(
res["date_ranges"]["buckets"][2]["to_as_string"],
Value::Null
);
Ok(())
}
#[test]
fn range_custom_key_keyed_buckets_test() -> crate::Result<()> {
let index = get_test_index_with_num_docs(false, 100)?;

View File

@@ -1,7 +1,7 @@
use std::fmt::Debug;
use fnv::FnvHashMap;
use itertools::Itertools;
use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize};
use super::{CustomOrder, Order, OrderTarget};
@@ -17,7 +17,11 @@ use crate::fastfield::MultiValuedFastFieldReader;
use crate::schema::Type;
use crate::{DocId, TantivyError};
/// Creates a bucket for every unique term
/// Creates a bucket for every unique term and counts the number of occurences.
/// Note that doc_count in the response buckets equals term count here.
///
/// If the text is untokenized and single value, that means one term per document and therefore it
/// is in fact doc count.
///
/// ### Terminology
/// Shard parameters are supposed to be equivalent to elasticsearch shard parameter.
@@ -64,6 +68,25 @@ use crate::{DocId, TantivyError};
/// }
/// }
/// ```
///
/// /// # Response JSON Format
/// ```json
/// {
/// ...
/// "aggregations": {
/// "genres": {
/// "doc_count_error_upper_bound": 0,
/// "sum_other_doc_count": 0,
/// "buckets": [
/// { "key": "drumnbass", "doc_count": 6 },
/// { "key": "raggae", "doc_count": 4 },
/// { "key": "jazz", "doc_count": 2 }
/// ]
/// }
/// }
/// }
/// ```
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
pub struct TermsAggregation {
/// The field to aggregate on.
@@ -176,7 +199,7 @@ impl TermsAggregationInternal {
#[derive(Clone, Debug, PartialEq)]
/// Container to store term_ids and their buckets.
struct TermBuckets {
pub(crate) entries: FnvHashMap<u32, TermBucketEntry>,
pub(crate) entries: FxHashMap<u32, TermBucketEntry>,
blueprint: Option<SegmentAggregationResultsCollector>,
}
@@ -374,7 +397,7 @@ impl SegmentTermCollector {
.expect("internal error: inverted index not loaded for term aggregation");
let term_dict = inverted_index.terms();
let mut dict: FnvHashMap<String, IntermediateTermBucketEntry> = Default::default();
let mut dict: FxHashMap<String, IntermediateTermBucketEntry> = Default::default();
let mut buffer = vec![];
for (term_id, entry) in entries {
term_dict
@@ -1106,9 +1129,9 @@ mod tests {
assert_eq!(res["my_texts"]["buckets"][0]["key"], "terma");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 4);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "termb");
assert_eq!(res["my_texts"]["buckets"][1]["key"], "termc");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 0);
assert_eq!(res["my_texts"]["buckets"][2]["key"], "termc");
assert_eq!(res["my_texts"]["buckets"][2]["key"], "termb");
assert_eq!(res["my_texts"]["buckets"][2]["doc_count"], 0);
assert_eq!(res["my_texts"]["sum_other_doc_count"], 0);
assert_eq!(res["my_texts"]["doc_count_error_upper_bound"], 0);
@@ -1206,11 +1229,43 @@ mod tests {
.collect();
let res = exec_request_with_query(agg_req, &index, None);
assert!(res.is_err());
Ok(())
}
#[test]
fn terms_aggregation_multi_token_per_doc() -> crate::Result<()> {
let terms = vec!["Hello Hello", "Hallo Hallo"];
let index = get_test_index_from_terms(true, &[terms])?;
let agg_req: Aggregations = vec![(
"my_texts".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Terms(TermsAggregation {
field: "text_id".to_string(),
min_doc_count: Some(0),
..Default::default()
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let res = exec_request_with_query(agg_req, &index, None).unwrap();
assert_eq!(res["my_texts"]["buckets"][0]["key"], "hello");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "hallo");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
Ok(())
}
#[test]
fn test_json_format() -> crate::Result<()> {
let agg_req: Aggregations = vec![(

View File

@@ -7,6 +7,7 @@ use super::intermediate_agg_result::IntermediateAggregationResults;
use super::segment_agg_result::SegmentAggregationResultsCollector;
use crate::aggregation::agg_req_with_accessor::get_aggs_with_accessor_and_validate;
use crate::collector::{Collector, SegmentCollector};
use crate::schema::Schema;
use crate::{SegmentReader, TantivyError};
/// The default max bucket count, before the aggregation fails.
@@ -16,6 +17,7 @@ pub const MAX_BUCKET_COUNT: u32 = 65000;
///
/// The collector collects all aggregations by the underlying aggregation request.
pub struct AggregationCollector {
schema: Schema,
agg: Aggregations,
max_bucket_count: u32,
}
@@ -25,8 +27,9 @@ impl AggregationCollector {
///
/// Aggregation fails when the total bucket count is higher than max_bucket_count.
/// max_bucket_count will default to `MAX_BUCKET_COUNT` (65000) when unset
pub fn from_aggs(agg: Aggregations, max_bucket_count: Option<u32>) -> Self {
pub fn from_aggs(agg: Aggregations, max_bucket_count: Option<u32>, schema: Schema) -> Self {
Self {
schema,
agg,
max_bucket_count: max_bucket_count.unwrap_or(MAX_BUCKET_COUNT),
}
@@ -113,7 +116,7 @@ impl Collector for AggregationCollector {
segment_fruits: Vec<<Self::Child as SegmentCollector>::Fruit>,
) -> crate::Result<Self::Fruit> {
let res = merge_fruits(segment_fruits)?;
res.into_final_bucket_result(self.agg.clone())
res.into_final_bucket_result(self.agg.clone(), &self.schema)
}
}

18
src/aggregation/date.rs Normal file
View File

@@ -0,0 +1,18 @@
use time::format_description::well_known::Rfc3339;
use time::OffsetDateTime;
use crate::TantivyError;
pub(crate) fn format_date(val: i64) -> crate::Result<String> {
let datetime =
OffsetDateTime::from_unix_timestamp_nanos(1_000 * (val as i128)).map_err(|err| {
TantivyError::InvalidArgument(format!(
"Could not convert {:?} to OffsetDateTime, err {:?}",
val, err
))
})?;
let key_as_string = datetime
.format(&Rfc3339)
.map_err(|_err| TantivyError::InvalidArgument("Could not serialize date".to_string()))?;
Ok(key_as_string)
}

View File

@@ -3,15 +3,14 @@
//! indices.
use std::cmp::Ordering;
use std::collections::HashMap;
use fnv::FnvHashMap;
use itertools::Itertools;
use rustc_hash::FxHashMap;
use serde::{Deserialize, Serialize};
use super::agg_req::{
Aggregations, AggregationsInternal, BucketAggregationInternal, BucketAggregationType,
MetricAggregation,
MetricAggregation, RangeAggregation,
};
use super::agg_result::{AggregationResult, BucketResult, RangeBucketEntry};
use super::bucket::{
@@ -20,9 +19,11 @@ use super::bucket::{
};
use super::metric::{IntermediateAverage, IntermediateStats};
use super::segment_agg_result::SegmentMetricResultCollector;
use super::{Key, SerializedKey, VecWithNames};
use super::{format_date, Key, SerializedKey, VecWithNames};
use crate::aggregation::agg_result::{AggregationResults, BucketEntries, BucketEntry};
use crate::aggregation::bucket::TermsAggregationInternal;
use crate::schema::Schema;
use crate::TantivyError;
/// Contains the intermediate aggregation result, which is optimized to be merged with other
/// intermediate results.
@@ -36,8 +37,12 @@ pub struct IntermediateAggregationResults {
impl IntermediateAggregationResults {
/// Convert intermediate result and its aggregation request to the final result.
pub fn into_final_bucket_result(self, req: Aggregations) -> crate::Result<AggregationResults> {
self.into_final_bucket_result_internal(&(req.into()))
pub fn into_final_bucket_result(
self,
req: Aggregations,
schema: &Schema,
) -> crate::Result<AggregationResults> {
self.into_final_bucket_result_internal(&(req.into()), schema)
}
/// Convert intermediate result and its aggregation request to the final result.
@@ -47,18 +52,19 @@ impl IntermediateAggregationResults {
pub(crate) fn into_final_bucket_result_internal(
self,
req: &AggregationsInternal,
schema: &Schema,
) -> crate::Result<AggregationResults> {
// Important assumption:
// When the tree contains buckets/metric, we expect it to have all buckets/metrics from the
// request
let mut results: HashMap<String, AggregationResult> = HashMap::new();
let mut results: FxHashMap<String, AggregationResult> = FxHashMap::default();
if let Some(buckets) = self.buckets {
convert_and_add_final_buckets_to_result(&mut results, buckets, &req.buckets)?
convert_and_add_final_buckets_to_result(&mut results, buckets, &req.buckets, schema)?
} else {
// When there are no buckets, we create empty buckets, so that the serialized json
// format is constant
add_empty_final_buckets_to_result(&mut results, &req.buckets)?
add_empty_final_buckets_to_result(&mut results, &req.buckets, schema)?
};
if let Some(metrics) = self.metrics {
@@ -132,7 +138,7 @@ impl IntermediateAggregationResults {
}
fn convert_and_add_final_metrics_to_result(
results: &mut HashMap<String, AggregationResult>,
results: &mut FxHashMap<String, AggregationResult>,
metrics: VecWithNames<IntermediateMetricResult>,
) {
results.extend(
@@ -143,7 +149,7 @@ fn convert_and_add_final_metrics_to_result(
}
fn add_empty_final_metrics_to_result(
results: &mut HashMap<String, AggregationResult>,
results: &mut FxHashMap<String, AggregationResult>,
req_metrics: &VecWithNames<MetricAggregation>,
) -> crate::Result<()> {
results.extend(req_metrics.iter().map(|(key, req)| {
@@ -157,27 +163,30 @@ fn add_empty_final_metrics_to_result(
}
fn add_empty_final_buckets_to_result(
results: &mut HashMap<String, AggregationResult>,
results: &mut FxHashMap<String, AggregationResult>,
req_buckets: &VecWithNames<BucketAggregationInternal>,
schema: &Schema,
) -> crate::Result<()> {
let requested_buckets = req_buckets.iter();
for (key, req) in requested_buckets {
let empty_bucket = AggregationResult::BucketResult(BucketResult::empty_from_req(req)?);
let empty_bucket =
AggregationResult::BucketResult(BucketResult::empty_from_req(req, schema)?);
results.insert(key.to_string(), empty_bucket);
}
Ok(())
}
fn convert_and_add_final_buckets_to_result(
results: &mut HashMap<String, AggregationResult>,
results: &mut FxHashMap<String, AggregationResult>,
buckets: VecWithNames<IntermediateBucketResult>,
req_buckets: &VecWithNames<BucketAggregationInternal>,
schema: &Schema,
) -> crate::Result<()> {
assert_eq!(buckets.len(), req_buckets.len());
let buckets_with_request = buckets.into_iter().zip(req_buckets.values());
for ((key, bucket), req) in buckets_with_request {
let result = AggregationResult::BucketResult(bucket.into_final_bucket_result(req)?);
let result = AggregationResult::BucketResult(bucket.into_final_bucket_result(req, schema)?);
results.insert(key, result);
}
Ok(())
@@ -267,13 +276,21 @@ impl IntermediateBucketResult {
pub(crate) fn into_final_bucket_result(
self,
req: &BucketAggregationInternal,
schema: &Schema,
) -> crate::Result<BucketResult> {
match self {
IntermediateBucketResult::Range(range_res) => {
let mut buckets: Vec<RangeBucketEntry> = range_res
.buckets
.into_iter()
.map(|(_, bucket)| bucket.into_final_bucket_entry(&req.sub_aggregation))
.into_values()
.map(|bucket| {
bucket.into_final_bucket_entry(
&req.sub_aggregation,
schema,
req.as_range()
.expect("unexpected aggregation, expected histogram aggregation"),
)
})
.collect::<crate::Result<Vec<_>>>()?;
buckets.sort_by(|left, right| {
@@ -288,7 +305,7 @@ impl IntermediateBucketResult {
.keyed;
let buckets = if is_keyed {
let mut bucket_map =
FnvHashMap::with_capacity_and_hasher(buckets.len(), Default::default());
FxHashMap::with_capacity_and_hasher(buckets.len(), Default::default());
for bucket in buckets {
bucket_map.insert(bucket.key.to_string(), bucket);
}
@@ -304,11 +321,12 @@ impl IntermediateBucketResult {
req.as_histogram()
.expect("unexpected aggregation, expected histogram aggregation"),
&req.sub_aggregation,
schema,
)?;
let buckets = if req.as_histogram().unwrap().keyed {
let mut bucket_map =
FnvHashMap::with_capacity_and_hasher(buckets.len(), Default::default());
FxHashMap::with_capacity_and_hasher(buckets.len(), Default::default());
for bucket in buckets {
bucket_map.insert(bucket.key.to_string(), bucket);
}
@@ -322,6 +340,7 @@ impl IntermediateBucketResult {
req.as_term()
.expect("unexpected aggregation, expected term aggregation"),
&req.sub_aggregation,
schema,
),
}
}
@@ -396,13 +415,13 @@ impl IntermediateBucketResult {
#[derive(Default, Clone, Debug, PartialEq, Serialize, Deserialize)]
/// Range aggregation including error counts
pub struct IntermediateRangeBucketResult {
pub(crate) buckets: FnvHashMap<SerializedKey, IntermediateRangeBucketEntry>,
pub(crate) buckets: FxHashMap<SerializedKey, IntermediateRangeBucketEntry>,
}
#[derive(Default, Clone, Debug, PartialEq, Serialize, Deserialize)]
/// Term aggregation including error counts
pub struct IntermediateTermBucketResult {
pub(crate) entries: FnvHashMap<String, IntermediateTermBucketEntry>,
pub(crate) entries: FxHashMap<String, IntermediateTermBucketEntry>,
pub(crate) sum_other_doc_count: u64,
pub(crate) doc_count_error_upper_bound: u64,
}
@@ -412,6 +431,7 @@ impl IntermediateTermBucketResult {
self,
req: &TermsAggregation,
sub_aggregation_req: &AggregationsInternal,
schema: &Schema,
) -> crate::Result<BucketResult> {
let req = TermsAggregationInternal::from_req(req);
let mut buckets: Vec<BucketEntry> = self
@@ -420,11 +440,12 @@ impl IntermediateTermBucketResult {
.filter(|bucket| bucket.1.doc_count >= req.min_doc_count)
.map(|(key, entry)| {
Ok(BucketEntry {
key_as_string: None,
key: Key::Str(key),
doc_count: entry.doc_count,
sub_aggregation: entry
.sub_aggregation
.into_final_bucket_result_internal(sub_aggregation_req)?,
.into_final_bucket_result_internal(sub_aggregation_req, schema)?,
})
})
.collect::<crate::Result<_>>()?;
@@ -499,8 +520,8 @@ trait MergeFruits {
}
fn merge_maps<V: MergeFruits + Clone>(
entries_left: &mut FnvHashMap<SerializedKey, V>,
mut entries_right: FnvHashMap<SerializedKey, V>,
entries_left: &mut FxHashMap<SerializedKey, V>,
mut entries_right: FxHashMap<SerializedKey, V>,
) {
for (name, entry_left) in entries_left.iter_mut() {
if let Some(entry_right) = entries_right.remove(name) {
@@ -529,13 +550,15 @@ impl IntermediateHistogramBucketEntry {
pub(crate) fn into_final_bucket_entry(
self,
req: &AggregationsInternal,
schema: &Schema,
) -> crate::Result<BucketEntry> {
Ok(BucketEntry {
key_as_string: None,
key: Key::F64(self.key),
doc_count: self.doc_count,
sub_aggregation: self
.sub_aggregation
.into_final_bucket_result_internal(req)?,
.into_final_bucket_result_internal(req, schema)?,
})
}
}
@@ -572,16 +595,38 @@ impl IntermediateRangeBucketEntry {
pub(crate) fn into_final_bucket_entry(
self,
req: &AggregationsInternal,
schema: &Schema,
range_req: &RangeAggregation,
) -> crate::Result<RangeBucketEntry> {
Ok(RangeBucketEntry {
let mut range_bucket_entry = RangeBucketEntry {
key: self.key,
doc_count: self.doc_count,
sub_aggregation: self
.sub_aggregation
.into_final_bucket_result_internal(req)?,
.into_final_bucket_result_internal(req, schema)?,
to: self.to,
from: self.from,
})
to_as_string: None,
from_as_string: None,
};
// If we have a date type on the histogram buckets, we add the `key_as_string` field as
// rfc339
let field = schema
.get_field(&range_req.field)
.ok_or_else(|| TantivyError::FieldNotFound(range_req.field.to_string()))?;
if schema.get_field_entry(field).field_type().is_date() {
if let Some(val) = range_bucket_entry.to {
let key_as_string = format_date(val as i64)?;
range_bucket_entry.to_as_string = Some(key_as_string);
}
if let Some(val) = range_bucket_entry.from {
let key_as_string = format_date(val as i64)?;
range_bucket_entry.from_as_string = Some(key_as_string);
}
}
Ok(range_bucket_entry)
}
}
@@ -626,7 +671,7 @@ mod tests {
fn get_sub_test_tree(data: &[(String, u64)]) -> IntermediateAggregationResults {
let mut map = HashMap::new();
let mut buckets = FnvHashMap::default();
let mut buckets = FxHashMap::default();
for (key, doc_count) in data {
buckets.insert(
key.to_string(),
@@ -653,7 +698,7 @@ mod tests {
data: &[(String, u64, String, u64)],
) -> IntermediateAggregationResults {
let mut map = HashMap::new();
let mut buckets: FnvHashMap<_, _> = Default::default();
let mut buckets: FxHashMap<_, _> = Default::default();
for (key, doc_count, sub_aggregation_key, sub_aggregation_count) in data {
buckets.insert(
key.to_string(),

View File

@@ -60,10 +60,10 @@ impl SegmentAverageCollector {
pub(crate) fn collect_block(&mut self, doc: &[DocId], field: &dyn Column<u64>) {
let mut iter = doc.chunks_exact(4);
for docs in iter.by_ref() {
let val1 = field.get_val(docs[0] as u64);
let val2 = field.get_val(docs[1] as u64);
let val3 = field.get_val(docs[2] as u64);
let val4 = field.get_val(docs[3] as u64);
let val1 = field.get_val(docs[0]);
let val2 = field.get_val(docs[1]);
let val3 = field.get_val(docs[2]);
let val4 = field.get_val(docs[3]);
let val1 = f64_from_fastfield_u64(val1, &self.field_type);
let val2 = f64_from_fastfield_u64(val2, &self.field_type);
let val3 = f64_from_fastfield_u64(val3, &self.field_type);
@@ -74,7 +74,7 @@ impl SegmentAverageCollector {
self.data.collect(val4);
}
for &doc in iter.remainder() {
let val = field.get_val(doc as u64);
let val = field.get_val(doc);
let val = f64_from_fastfield_u64(val, &self.field_type);
self.data.collect(val);
}

View File

@@ -166,10 +166,10 @@ impl SegmentStatsCollector {
pub(crate) fn collect_block(&mut self, doc: &[DocId], field: &dyn Column<u64>) {
let mut iter = doc.chunks_exact(4);
for docs in iter.by_ref() {
let val1 = field.get_val(docs[0] as u64);
let val2 = field.get_val(docs[1] as u64);
let val3 = field.get_val(docs[2] as u64);
let val4 = field.get_val(docs[3] as u64);
let val1 = field.get_val(docs[0]);
let val2 = field.get_val(docs[1]);
let val3 = field.get_val(docs[2]);
let val4 = field.get_val(docs[3]);
let val1 = f64_from_fastfield_u64(val1, &self.field_type);
let val2 = f64_from_fastfield_u64(val2, &self.field_type);
let val3 = f64_from_fastfield_u64(val3, &self.field_type);
@@ -180,7 +180,7 @@ impl SegmentStatsCollector {
self.stats.collect(val4);
}
for &doc in iter.remainder() {
let val = field.get_val(doc as u64);
let val = field.get_val(doc);
let val = f64_from_fastfield_u64(val, &self.field_type);
self.stats.collect(val);
}
@@ -222,7 +222,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let reader = index.reader()?;
let searcher = reader.searcher();
@@ -300,7 +300,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&term_query, &collector).unwrap();

View File

@@ -10,21 +10,19 @@
//!
//! There are two categories: [Metrics](metric) and [Buckets](bucket).
//!
//! # Usage
//!
//! ## Prerequisite
//! Currently aggregations work only on [fast fields](`crate::fastfield`). Single value fast fields
//! of type `u64`, `f64`, `i64`, `date` and fast fields on text fields.
//!
//! ## Usage
//! To use aggregations, build an aggregation request by constructing
//! [`Aggregations`](agg_req::Aggregations).
//! Create an [`AggregationCollector`] from this request. `AggregationCollector` implements the
//! [`Collector`](crate::collector::Collector) trait and can be passed as collector into
//! [`Searcher::search()`](crate::Searcher::search).
//!
//! #### Limitations
//!
//! Currently aggregations work only on single value fast fields of type `u64`, `f64`, `i64` and
//! fast fields on text fields.
//!
//! # JSON Format
//! ## JSON Format
//! Aggregations request and result structures de/serialize into elasticsearch compatible JSON.
//!
//! ```verbatim
@@ -35,7 +33,7 @@
//! let json_response_string: String = &serde_json::to_string(&agg_res)?;
//! ```
//!
//! # Supported Aggregations
//! ## Supported Aggregations
//! - [Bucket](bucket)
//! - [Histogram](bucket::HistogramAggregation)
//! - [Range](bucket::RangeAggregation)
@@ -55,9 +53,10 @@
//! use tantivy::query::AllQuery;
//! use tantivy::aggregation::agg_result::AggregationResults;
//! use tantivy::IndexReader;
//! use tantivy::schema::Schema;
//!
//! # #[allow(dead_code)]
//! fn aggregate_on_index(reader: &IndexReader) {
//! fn aggregate_on_index(reader: &IndexReader, schema: Schema) {
//! let agg_req: Aggregations = vec![
//! (
//! "average".to_string(),
@@ -69,7 +68,7 @@
//! .into_iter()
//! .collect();
//!
//! let collector = AggregationCollector::from_aggs(agg_req, None);
//! let collector = AggregationCollector::from_aggs(agg_req, None, schema);
//!
//! let searcher = reader.searcher();
//! let agg_res: AggregationResults = searcher.search(&AllQuery, &collector).unwrap();
@@ -159,6 +158,7 @@ mod agg_req_with_accessor;
pub mod agg_result;
pub mod bucket;
mod collector;
mod date;
pub mod intermediate_agg_result;
pub mod metric;
mod segment_agg_result;
@@ -169,6 +169,7 @@ pub use collector::{
AggregationCollector, AggregationSegmentCollector, DistributedAggregationCollector,
MAX_BUCKET_COUNT,
};
pub(crate) use date::format_date;
use fastfield_codecs::MonotonicallyMappableToU64;
use itertools::Itertools;
use serde::{Deserialize, Serialize};
@@ -285,11 +286,11 @@ impl Display for Key {
/// Inverse of `to_fastfield_u64`. Used to convert to `f64` for metrics.
///
/// # Panics
/// Only `u64`, `f64`, and `i64` are supported.
/// Only `u64`, `f64`, `date`, and `i64` are supported.
pub(crate) fn f64_from_fastfield_u64(val: u64, field_type: &Type) -> f64 {
match field_type {
Type::U64 => val as f64,
Type::I64 => i64::from_u64(val) as f64,
Type::I64 | Type::Date => i64::from_u64(val) as f64,
Type::F64 => f64::from_u64(val),
_ => {
panic!("unexpected type {:?}. This should not happen", field_type)
@@ -297,10 +298,9 @@ pub(crate) fn f64_from_fastfield_u64(val: u64, field_type: &Type) -> f64 {
}
}
/// Converts the `f64` value to fast field value space.
/// Converts the `f64` value to fast field value space, which is always u64.
///
/// If the fast field has `u64`, values are stored as `u64` in the fast field.
/// A `f64` value of e.g. `2.0` therefore needs to be converted to `1u64`.
/// If the fast field has `u64`, values are stored unchanged as `u64` in the fast field.
///
/// If the fast field has `f64` values are converted and stored to `u64` using a
/// monotonic mapping.
@@ -310,7 +310,7 @@ pub(crate) fn f64_from_fastfield_u64(val: u64, field_type: &Type) -> f64 {
pub(crate) fn f64_to_fastfield_u64(val: f64, field_type: &Type) -> Option<u64> {
match field_type {
Type::U64 => Some(val as u64),
Type::I64 => Some((val as i64).to_u64()),
Type::I64 | Type::Date => Some((val as i64).to_u64()),
Type::F64 => Some(val.to_u64()),
_ => None,
}
@@ -319,6 +319,7 @@ pub(crate) fn f64_to_fastfield_u64(val: f64, field_type: &Type) -> Option<u64> {
#[cfg(test)]
mod tests {
use serde_json::Value;
use time::OffsetDateTime;
use super::agg_req::{Aggregation, Aggregations, BucketAggregation};
use super::bucket::RangeAggregation;
@@ -334,7 +335,7 @@ mod tests {
use crate::aggregation::DistributedAggregationCollector;
use crate::query::{AllQuery, TermQuery};
use crate::schema::{Cardinality, IndexRecordOption, Schema, TextFieldIndexing, FAST, STRING};
use crate::{Index, Term};
use crate::{DateTime, Index, Term};
fn get_avg_req(field_name: &str) -> Aggregation {
Aggregation::Metric(MetricAggregation::Average(
@@ -360,7 +361,7 @@ mod tests {
index: &Index,
query: Option<(&str, &str)>,
) -> crate::Result<Value> {
let collector = AggregationCollector::from_aggs(agg_req, None);
let collector = AggregationCollector::from_aggs(agg_req, None, index.schema());
let reader = index.reader()?;
let searcher = reader.searcher();
@@ -450,9 +451,9 @@ mod tests {
text_field_id => term.to_string(),
string_field_id => term.to_string(),
score_field => i as u64,
score_field_f64 => i as f64,
score_field_f64 => i,
score_field_i64 => i as i64,
fraction_field => i as f64/100.0,
fraction_field => i/100.0,
))?;
}
index_writer.commit()?;
@@ -554,10 +555,10 @@ mod tests {
let searcher = reader.searcher();
let intermediate_agg_result = searcher.search(&AllQuery, &collector).unwrap();
intermediate_agg_result
.into_final_bucket_result(agg_req)
.into_final_bucket_result(agg_req, &index.schema())
.unwrap()
} else {
let collector = AggregationCollector::from_aggs(agg_req, None);
let collector = AggregationCollector::from_aggs(agg_req, None, index.schema());
let searcher = reader.searcher();
searcher.search(&AllQuery, &collector).unwrap()
@@ -650,6 +651,7 @@ mod tests {
.set_fast()
.set_stored();
let text_field = schema_builder.add_text_field("text", text_fieldtype);
let date_field = schema_builder.add_date_field("date", FAST);
schema_builder.add_text_field("dummy_text", STRING);
let score_fieldtype =
crate::schema::NumericOptions::default().set_fast(Cardinality::SingleValue);
@@ -667,6 +669,7 @@ mod tests {
// writing the segment
index_writer.add_document(doc!(
text_field => "cool",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800).unwrap()),
score_field => 1u64,
score_field_f64 => 1f64,
score_field_i64 => 1i64,
@@ -675,6 +678,7 @@ mod tests {
))?;
index_writer.add_document(doc!(
text_field => "cool",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400).unwrap()),
score_field => 3u64,
score_field_f64 => 3f64,
score_field_i64 => 3i64,
@@ -683,18 +687,21 @@ mod tests {
))?;
index_writer.add_document(doc!(
text_field => "cool",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400).unwrap()),
score_field => 5u64,
score_field_f64 => 5f64,
score_field_i64 => 5i64,
))?;
index_writer.add_document(doc!(
text_field => "nohit",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400).unwrap()),
score_field => 6u64,
score_field_f64 => 6f64,
score_field_i64 => 6i64,
))?;
index_writer.add_document(doc!(
text_field => "cool",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400).unwrap()),
score_field => 7u64,
score_field_f64 => 7f64,
score_field_i64 => 7i64,
@@ -702,12 +709,14 @@ mod tests {
index_writer.commit()?;
index_writer.add_document(doc!(
text_field => "cool",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400).unwrap()),
score_field => 11u64,
score_field_f64 => 11f64,
score_field_i64 => 11i64,
))?;
index_writer.add_document(doc!(
text_field => "cool",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400 + 86400).unwrap()),
score_field => 14u64,
score_field_f64 => 14f64,
score_field_i64 => 14i64,
@@ -715,6 +724,7 @@ mod tests {
index_writer.add_document(doc!(
text_field => "cool",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400 + 86400).unwrap()),
score_field => 44u64,
score_field_f64 => 44.5f64,
score_field_i64 => 44i64,
@@ -725,6 +735,7 @@ mod tests {
// no hits segment
index_writer.add_document(doc!(
text_field => "nohit",
date_field => DateTime::from_utc(OffsetDateTime::from_unix_timestamp(1_546_300_800 + 86400 + 86400).unwrap()),
score_field => 44u64,
score_field_f64 => 44.5f64,
score_field_i64 => 44i64,
@@ -797,7 +808,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults = searcher.search(&term_query, &collector).unwrap();
@@ -997,9 +1008,10 @@ mod tests {
// Test de/serialization roundtrip on intermediate_agg_result
let res: IntermediateAggregationResults =
serde_json::from_str(&serde_json::to_string(&res).unwrap()).unwrap();
res.into_final_bucket_result(agg_req.clone()).unwrap()
res.into_final_bucket_result(agg_req.clone(), &index.schema())
.unwrap()
} else {
let collector = AggregationCollector::from_aggs(agg_req.clone(), None);
let collector = AggregationCollector::from_aggs(agg_req.clone(), None, index.schema());
let searcher = reader.searcher();
searcher.search(&term_query, &collector).unwrap()
@@ -1057,7 +1069,7 @@ mod tests {
);
// Test empty result set
let collector = AggregationCollector::from_aggs(agg_req, None);
let collector = AggregationCollector::from_aggs(agg_req, None, index.schema());
let searcher = reader.searcher();
searcher.search(&query_with_no_hits, &collector).unwrap();
@@ -1122,7 +1134,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
@@ -1235,7 +1247,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1266,7 +1278,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1297,7 +1309,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1336,7 +1348,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1365,7 +1377,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req, None);
let collector = AggregationCollector::from_aggs(agg_req, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1394,7 +1406,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req, None);
let collector = AggregationCollector::from_aggs(agg_req, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1431,7 +1443,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1466,7 +1478,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1505,7 +1517,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1535,7 +1547,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =
@@ -1592,7 +1604,7 @@ mod tests {
.into_iter()
.collect();
let collector = AggregationCollector::from_aggs(agg_req_1, None);
let collector = AggregationCollector::from_aggs(agg_req_1, None, index.schema());
let searcher = reader.searcher();
let agg_res: AggregationResults =

View File

@@ -305,7 +305,7 @@ impl BucketCount {
}
pub(crate) fn add_count(&self, count: u32) {
self.bucket_count
.fetch_add(count as u32, std::sync::atomic::Ordering::Relaxed);
.fetch_add(count, std::sync::atomic::Ordering::Relaxed);
}
pub(crate) fn get_count(&self) -> u32 {
self.bucket_count.load(std::sync::atomic::Ordering::Relaxed)

View File

@@ -38,7 +38,7 @@ pub trait CustomSegmentScorer<TScore>: 'static {
pub trait CustomScorer<TScore>: Sync {
/// Type of the associated [`CustomSegmentScorer`].
type Child: CustomSegmentScorer<TScore>;
/// Builds a child scorer for a specific segment. The child scorer is associated to
/// Builds a child scorer for a specific segment. The child scorer is associated with
/// a specific segment.
fn segment_scorer(&self, segment_reader: &SegmentReader) -> crate::Result<Self::Child>;
}

View File

@@ -91,7 +91,7 @@ fn facet_depth(facet_bytes: &[u8]) -> usize {
/// let index = Index::create_in_ram(schema);
/// {
/// let mut index_writer = index.writer(3_000_000)?;
/// // a document can be associated to any number of facets
/// // a document can be associated with any number of facets
/// index_writer.add_document(doc!(
/// title => "The Name of the Wind",
/// facet => Facet::from("/lang/en"),
@@ -338,11 +338,7 @@ impl SegmentCollector for FacetSegmentCollector {
let mut previous_collapsed_ord: usize = usize::MAX;
for &facet_ord in &self.facet_ords_buf {
let collapsed_ord = self.collapse_mapping[facet_ord as usize];
self.counts[collapsed_ord] += if collapsed_ord == previous_collapsed_ord {
0
} else {
1
};
self.counts[collapsed_ord] += u64::from(collapsed_ord != previous_collapsed_ord);
previous_collapsed_ord = collapsed_ord;
}
}
@@ -361,7 +357,7 @@ impl SegmentCollector for FacetSegmentCollector {
let mut facet = vec![];
let facet_ord = self.collapse_facet_ords[collapsed_facet_ord];
// TODO handle errors.
if facet_dict.ord_to_term(facet_ord as u64, &mut facet).is_ok() {
if facet_dict.ord_to_term(facet_ord, &mut facet).is_ok() {
if let Ok(facet) = Facet::from_encoded(facet) {
facet_counts.insert(facet, count);
}
@@ -620,7 +616,7 @@ mod tests {
.map(|mut doc| {
doc.add_facet(
facet_field,
&format!("/facet/{}", thread_rng().sample(&uniform)),
&format!("/facet/{}", thread_rng().sample(uniform)),
);
doc
})

View File

@@ -177,7 +177,7 @@ where
type Fruit = TSegmentCollector::Fruit;
fn collect(&mut self, doc: u32, score: Score) {
let value = self.fast_field_reader.get_val(doc as u64);
let value = self.fast_field_reader.get_val(doc);
if (self.predicate)(value) {
self.segment_collector.collect(doc, score)
}

View File

@@ -37,7 +37,7 @@ impl HistogramCollector {
/// The scale/range of the histogram is not dynamic. It is required to
/// define it by supplying following parameter:
/// - `min_value`: the minimum value that can be recorded in the histogram.
/// - `bucket_width`: the length of the interval that is associated to each buckets.
/// - `bucket_width`: the length of the interval that is associated with each buckets.
/// - `num_buckets`: The overall number of buckets.
///
/// Together, this parameters define a partition of `[min_value, min_value + num_buckets *
@@ -94,7 +94,7 @@ impl SegmentCollector for SegmentHistogramCollector {
type Fruit = Vec<u64>;
fn collect(&mut self, doc: DocId, _score: Score) {
let value = self.ff_reader.get_val(doc as u64);
let value = self.ff_reader.get_val(doc);
self.histogram_computer.add_value(value);
}

View File

@@ -142,7 +142,7 @@ pub trait Collector: Sync + Send {
/// e.g. `usize` for the `Count` collector.
type Fruit: Fruit;
/// Type of the `SegmentCollector` associated to this collector.
/// Type of the `SegmentCollector` associated with this collector.
type Child: SegmentCollector;
/// `set_segment` is called before beginning to enumerate
@@ -156,7 +156,7 @@ pub trait Collector: Sync + Send {
/// Returns true iff the collector requires to compute scores for documents.
fn requires_scoring(&self) -> bool;
/// Combines the fruit associated to the collection of each segments
/// Combines the fruit associated with the collection of each segments
/// into one fruit.
fn merge_fruits(
&self,
@@ -170,19 +170,35 @@ pub trait Collector: Sync + Send {
segment_ord: u32,
reader: &SegmentReader,
) -> crate::Result<<Self::Child as SegmentCollector>::Fruit> {
let mut segment_collector = self.for_segment(segment_ord as u32, reader)?;
let mut segment_collector = self.for_segment(segment_ord, reader)?;
if let Some(alive_bitset) = reader.alive_bitset() {
weight.for_each(reader, &mut |doc, score| {
if alive_bitset.is_alive(doc) {
match (reader.alive_bitset(), self.requires_scoring()) {
(Some(alive_bitset), true) => {
weight.for_each(reader, &mut |doc, score| {
if alive_bitset.is_alive(doc) {
segment_collector.collect(doc, score);
}
})?;
}
(Some(alive_bitset), false) => {
weight.for_each_no_score(reader, &mut |doc| {
if alive_bitset.is_alive(doc) {
segment_collector.collect(doc, 0.0);
}
})?;
}
(None, true) => {
weight.for_each(reader, &mut |doc, score| {
segment_collector.collect(doc, score);
}
})?;
} else {
weight.for_each(reader, &mut |doc, score| {
segment_collector.collect(doc, score);
})?;
})?;
}
(None, false) => {
weight.for_each_no_score(reader, &mut |doc| {
segment_collector.collect(doc, 0.0);
})?;
}
}
Ok(segment_collector.harvest())
}
}

View File

@@ -201,7 +201,7 @@ impl SegmentCollector for FastFieldSegmentCollector {
type Fruit = Vec<u64>;
fn collect(&mut self, doc: DocId, _score: Score) {
let val = self.reader.get_val(doc as u64);
let val = self.reader.get_val(doc);
self.vals.push(val);
}

View File

@@ -137,7 +137,7 @@ struct ScorerByFastFieldReader {
impl CustomSegmentScorer<u64> for ScorerByFastFieldReader {
fn score(&mut self, doc: DocId) -> u64 {
self.ff_reader.get_val(doc as u64)
self.ff_reader.get_val(doc)
}
}
@@ -458,7 +458,7 @@ impl TopDocs {
///
/// // We can now define our actual scoring function
/// move |doc: DocId, original_score: Score| {
/// let popularity: u64 = popularity_reader.get_val(doc as u64);
/// let popularity: u64 = popularity_reader.get_val(doc);
/// // Well.. For the sake of the example we use a simple logarithm
/// // function.
/// let popularity_boost_score = ((2u64 + popularity) as Score).log2();
@@ -567,8 +567,8 @@ impl TopDocs {
///
/// // We can now define our actual scoring function
/// move |doc: DocId| {
/// let popularity: u64 = popularity_reader.get_val(doc as u64);
/// let boosted: u64 = boosted_reader.get_val(doc as u64);
/// let popularity: u64 = popularity_reader.get_val(doc);
/// let boosted: u64 = boosted_reader.get_val(doc);
/// // Score do not have to be `f64` in tantivy.
/// // Here we return a couple to get lexicographical order
/// // for free.
@@ -693,7 +693,7 @@ impl Collector for TopDocs {
}
}
/// Segment Collector associated to `TopDocs`.
/// Segment Collector associated with `TopDocs`.
pub struct TopScoreSegmentCollector(TopSegmentCollector<Score>);
impl SegmentCollector for TopScoreSegmentCollector {

View File

@@ -40,7 +40,7 @@ pub trait ScoreTweaker<TScore>: Sync {
/// Type of the associated [`ScoreSegmentTweaker`].
type Child: ScoreSegmentTweaker<TScore>;
/// Builds a child tweaker for a specific segment. The child scorer is associated to
/// Builds a child tweaker for a specific segment. The child scorer is associated with
/// a specific segment.
fn segment_tweaker(&self, segment_reader: &SegmentReader) -> Result<Self::Child>;
}

View File

@@ -19,7 +19,7 @@ use crate::error::{DataCorruption, TantivyError};
use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_ARENA_NUM_BYTES_MIN};
use crate::indexer::segment_updater::save_metas;
use crate::reader::{IndexReader, IndexReaderBuilder};
use crate::schema::{Field, FieldType, Schema};
use crate::schema::{Cardinality, Field, FieldType, Schema};
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::IndexWriter;
@@ -149,12 +149,11 @@ impl IndexBuilder {
/// Creates a new index using the [`RamDirectory`].
///
/// The index will be allocated in anonymous memory.
/// This should only be used for unit tests.
/// This is useful for indexing small set of documents
/// for instances like unit test or temporary in memory index.
pub fn create_in_ram(self) -> Result<Index, TantivyError> {
let ram_directory = RamDirectory::create();
Ok(self
.create(ram_directory)
.expect("Creating a RamDirectory should never fail"))
self.create(ram_directory)
}
/// Creates a new index in a given filepath.
@@ -228,10 +227,44 @@ impl IndexBuilder {
))
}
}
fn validate(&self) -> crate::Result<()> {
if let Some(schema) = self.schema.as_ref() {
if let Some(sort_by_field) = self.index_settings.sort_by_field.as_ref() {
let schema_field = schema.get_field(&sort_by_field.field).ok_or_else(|| {
TantivyError::InvalidArgument(format!(
"Field to sort index {} not found in schema",
sort_by_field.field
))
})?;
let entry = schema.get_field_entry(schema_field);
if !entry.is_fast() {
return Err(TantivyError::InvalidArgument(format!(
"Field {} is no fast field. Field needs to be a single value fast field \
to be used to sort an index",
sort_by_field.field
)));
}
if entry.field_type().fastfield_cardinality() != Some(Cardinality::SingleValue) {
return Err(TantivyError::InvalidArgument(format!(
"Only single value fast field Cardinality supported for sorting index {}",
sort_by_field.field
)));
}
}
Ok(())
} else {
Err(TantivyError::InvalidArgument(
"no schema passed".to_string(),
))
}
}
/// Creates a new index given an implementation of the trait `Directory`.
///
/// If a directory previously existed, it will be erased.
fn create<T: Into<Box<dyn Directory>>>(self, dir: T) -> crate::Result<Index> {
self.validate()?;
let dir = dir.into();
let directory = ManagedDirectory::wrap(dir)?;
save_new_metas(
@@ -780,7 +813,7 @@ mod tests {
let field = schema.get_field("num_likes").unwrap();
let tempdir = TempDir::new().unwrap();
let tempdir_path = PathBuf::from(tempdir.path());
let index = Index::create_in_dir(&tempdir_path, schema).unwrap();
let index = Index::create_in_dir(tempdir_path, schema).unwrap();
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)

View File

@@ -130,10 +130,10 @@ impl SegmentMeta {
/// Returns the relative path of a component of our segment.
///
/// It just joins the segment id with the extension
/// associated to a segment component.
/// associated with a segment component.
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
let mut path = self.id().uuid_string();
path.push_str(&*match component {
path.push_str(&match component {
SegmentComponent::Postings => ".idx".to_string(),
SegmentComponent::Positions => ".pos".to_string(),
SegmentComponent::Terms => ".term".to_string(),
@@ -326,13 +326,13 @@ pub struct IndexMeta {
/// `IndexSettings` to configure index options.
#[serde(default)]
pub index_settings: IndexSettings,
/// List of `SegmentMeta` information associated to each finalized segment of the index.
/// List of `SegmentMeta` information associated with each finalized segment of the index.
pub segments: Vec<SegmentMeta>,
/// Index `Schema`
pub schema: Schema,
/// Opstamp associated to the last `commit` operation.
/// Opstamp associated with the last `commit` operation.
pub opstamp: Opstamp,
/// Payload associated to the last commit.
/// Payload associated with the last commit.
///
/// Upon commit, clients can optionally add a small `String` payload to their commit
/// to help identify this commit.

View File

@@ -9,18 +9,17 @@ use crate::schema::{IndexRecordOption, Term};
use crate::termdict::TermDictionary;
/// The inverted index reader is in charge of accessing
/// the inverted index associated to a specific field.
/// the inverted index associated with a specific field.
///
/// # Note
///
/// It is safe to delete the segment associated to
/// It is safe to delete the segment associated with
/// an `InvertedIndexReader`. As long as it is open,
/// the `FileSlice` it is relying on should
/// the [`FileSlice`] it is relying on should
/// stay available.
///
///
/// `InvertedIndexReader` are created by calling
/// the `SegmentReader`'s [`.inverted_index(...)`] method
/// [`SegmentReader::inverted_index()`](crate::SegmentReader::inverted_index).
pub struct InvertedIndexReader {
termdict: TermDictionary,
postings_file_slice: FileSlice,
@@ -30,7 +29,7 @@ pub struct InvertedIndexReader {
}
impl InvertedIndexReader {
#[cfg_attr(feature = "cargo-clippy", allow(clippy::needless_pass_by_value))] // for symmetry
#[allow(clippy::needless_pass_by_value)] // for symmetry
pub(crate) fn new(
termdict: TermDictionary,
postings_file_slice: FileSlice,
@@ -75,7 +74,7 @@ impl InvertedIndexReader {
///
/// This is useful for enumerating through a list of terms,
/// and consuming the associated posting lists while avoiding
/// reallocating a `BlockSegmentPostings`.
/// reallocating a [`BlockSegmentPostings`].
///
/// # Warning
///
@@ -96,7 +95,7 @@ impl InvertedIndexReader {
/// Returns a block postings given a `Term`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
/// Most users should prefer using [`Self::read_postings()`] instead.
pub fn read_block_postings(
&self,
term: &Term,
@@ -110,7 +109,7 @@ impl InvertedIndexReader {
/// Returns a block postings given a `term_info`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
/// Most users should prefer using [`Self::read_postings()`] instead.
pub fn read_block_postings_from_terminfo(
&self,
term_info: &TermInfo,
@@ -130,7 +129,7 @@ impl InvertedIndexReader {
/// Returns a posting object given a `term_info`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
/// Most users should prefer using [`Self::read_postings()`] instead.
pub fn read_postings_from_terminfo(
&self,
term_info: &TermInfo,
@@ -164,12 +163,12 @@ impl InvertedIndexReader {
/// or `None` if the term has never been encountered and indexed.
///
/// If the field was not indexed with the indexing options that cover
/// the requested options, the returned `SegmentPostings` the method does not fail
/// the requested options, the returned [`SegmentPostings`] the method does not fail
/// and returns a `SegmentPostings` with as much information as possible.
///
/// For instance, requesting `IndexRecordOption::Freq` for a
/// `TextIndexingOptions` that does not index position will return a `SegmentPostings`
/// with `DocId`s and frequencies.
/// For instance, requesting [`IndexRecordOption::WithFreqs`] for a
/// [`TextOptions`](crate::schema::TextOptions) that does not index position
/// will return a [`SegmentPostings`] with `DocId`s and frequencies.
pub fn read_postings(
&self,
term: &Term,
@@ -201,23 +200,16 @@ impl InvertedIndexReader {
#[cfg(feature = "quickwit")]
impl InvertedIndexReader {
pub(crate) async fn get_term_info_async(
&self,
term: &Term,
) -> crate::AsyncIoResult<Option<TermInfo>> {
pub(crate) async fn get_term_info_async(&self, term: &Term) -> io::Result<Option<TermInfo>> {
self.termdict.get_async(term.value_bytes()).await
}
/// Returns a block postings given a `Term`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
pub async fn warm_postings(
&self,
term: &Term,
with_positions: bool,
) -> crate::AsyncIoResult<()> {
let term_info_opt = self.get_term_info_async(term).await?;
/// Most users should prefer using [`Self::read_postings()`] instead.
pub async fn warm_postings(&self, term: &Term, with_positions: bool) -> io::Result<()> {
let term_info_opt: Option<TermInfo> = self.get_term_info_async(term).await?;
if let Some(term_info) = term_info_opt {
self.postings_file_slice
.read_bytes_slice_async(term_info.postings_range.clone())
@@ -231,8 +223,20 @@ impl InvertedIndexReader {
Ok(())
}
/// Read the block postings for all terms.
/// This method is for an advanced usage only.
///
/// If you know which terms to pre-load, prefer using [`Self::warm_postings`] instead.
pub async fn warm_postings_full(&self, with_positions: bool) -> io::Result<()> {
self.postings_file_slice.read_bytes_async().await?;
if with_positions {
self.positions_file_slice.read_bytes_async().await?;
}
Ok(())
}
/// Returns the number of documents containing the term asynchronously.
pub async fn doc_freq_async(&self, term: &Term) -> crate::AsyncIoResult<u32> {
pub async fn doc_freq_async(&self, term: &Term) -> io::Result<u32> {
Ok(self
.get_term_info_async(term)
.await?

View File

@@ -4,7 +4,7 @@ use std::{fmt, io};
use crate::collector::Collector;
use crate::core::{Executor, SegmentReader};
use crate::query::Query;
use crate::query::{EnableScoring, Query};
use crate::schema::{Document, Schema, Term};
use crate::space_usage::SearcherSpaceUsage;
use crate::store::{CacheStats, StoreReader};
@@ -69,7 +69,7 @@ pub struct Searcher {
}
impl Searcher {
/// Returns the `Index` associated to the `Searcher`
/// Returns the `Index` associated with the `Searcher`
pub fn index(&self) -> &Index {
&self.inner.index
}
@@ -108,7 +108,7 @@ impl Searcher {
store_reader.get_async(doc_address.doc_id).await
}
/// Access the schema associated to the index of this searcher.
/// Access the schema associated with the index of this searcher.
pub fn schema(&self) -> &Schema {
&self.inner.schema
}
@@ -161,11 +161,11 @@ impl Searcher {
///
/// Search works as follows :
///
/// First the weight object associated to the query is created.
/// First the weight object associated with the query is created.
///
/// Then, the query loops over the segments and for each segment :
/// - setup the collector and informs it that the segment being processed has changed.
/// - creates a SegmentCollector for collecting documents associated to the segment
/// - creates a SegmentCollector for collecting documents associated with the segment
/// - creates a `Scorer` object associated for this segment
/// - iterate through the matched documents and push them to the segment collector.
///
@@ -199,7 +199,12 @@ impl Searcher {
executor: &Executor,
) -> crate::Result<C::Fruit> {
let scoring_enabled = collector.requires_scoring();
let weight = query.weight(self, scoring_enabled)?;
let enabled_scoring = if scoring_enabled {
EnableScoring::Enabled(self)
} else {
EnableScoring::Disabled(self.schema())
};
let weight = query.weight(enabled_scoring)?;
let segment_readers = self.segment_readers();
let fruits = executor.map(
|(segment_ord, segment_reader)| {

View File

@@ -70,7 +70,7 @@ impl Segment {
/// Returns the relative path of a component of our segment.
///
/// It just joins the segment id with the extension
/// associated to a segment component.
/// associated with a segment component.
pub fn relative_path(&self, component: SegmentComponent) -> PathBuf {
self.meta.relative_path(component)
}

View File

@@ -6,7 +6,7 @@ use std::slice;
/// except the delete component that takes an `segment_uuid`.`delete_opstamp`.`component_extension`
#[derive(Copy, Clone, Eq, PartialEq)]
pub enum SegmentComponent {
/// Postings (or inverted list). Sorted lists of document ids, associated to terms
/// Postings (or inverted list). Sorted lists of document ids, associated with terms
Postings,
/// Positions of terms in each document.
Positions,

View File

@@ -57,7 +57,7 @@ impl SegmentId {
/// Picking the first 8 chars is ok to identify
/// segments in a display message (e.g. a5c4dfcb).
pub fn short_uuid_string(&self) -> String {
(&self.0.as_simple().to_string()[..8]).to_string()
self.0.as_simple().to_string()[..8].to_string()
}
/// Returns a segment uuid string.

View File

@@ -89,7 +89,7 @@ impl SegmentReader {
&self.fast_fields_readers
}
/// Accessor to the `FacetReader` associated to a given `Field`.
/// Accessor to the `FacetReader` associated with a given `Field`.
pub fn facet_reader(&self, field: Field) -> crate::Result<FacetReader> {
let field_entry = self.schema.get_field_entry(field);
@@ -208,18 +208,18 @@ impl SegmentReader {
})
}
/// Returns a field reader associated to the field given in argument.
/// Returns a field reader associated with the field given in argument.
/// If the field was not present in the index during indexing time,
/// the InvertedIndexReader is empty.
///
/// The field reader is in charge of iterating through the
/// term dictionary associated to a specific field,
/// and opening the posting list associated to any term.
/// term dictionary associated with a specific field,
/// and opening the posting list associated with any term.
///
/// If the field is not marked as index, a warn is logged and an empty `InvertedIndexReader`
/// If the field is not marked as index, a warning is logged and an empty `InvertedIndexReader`
/// is returned.
/// Similarly if the field is marked as indexed but no term has been indexed for the given
/// index. an empty `InvertedIndexReader` is returned (but no warning is logged).
/// Similarly, if the field is marked as indexed but no term has been indexed for the given
/// index, an empty `InvertedIndexReader` is returned (but no warning is logged).
pub fn inverted_index(&self, field: Field) -> crate::Result<Arc<InvertedIndexReader>> {
if let Some(inv_idx_reader) = self
.inv_idx_reader_cache
@@ -241,7 +241,7 @@ impl SegmentReader {
if postings_file_opt.is_none() || record_option_opt.is_none() {
// no documents in the segment contained this field.
// As a result, no data is associated to the inverted index.
// As a result, no data is associated with the inverted index.
//
// Returns an empty inverted index.
let record_option = record_option_opt.unwrap_or(IndexRecordOption::Basic);

View File

@@ -75,7 +75,7 @@ impl<W: TerminatingWrite + Write> CompositeWrite<W> {
let mut prev_offset = 0;
for (file_addr, offset) in self.offsets {
VInt((offset - prev_offset) as u64).serialize(&mut self.write)?;
VInt(offset - prev_offset).serialize(&mut self.write)?;
file_addr.serialize(&mut self.write)?;
prev_offset = offset;
}
@@ -154,14 +154,14 @@ impl CompositeFile {
}
}
/// Returns the `FileSlice` associated
/// to a given `Field` and stored in a `CompositeFile`.
/// Returns the `FileSlice` associated with
/// a given `Field` and stored in a `CompositeFile`.
pub fn open_read(&self, field: Field) -> Option<FileSlice> {
self.open_read_with_idx(field, 0)
}
/// Returns the `FileSlice` associated
/// to a given `Field` and stored in a `CompositeFile`.
/// Returns the `FileSlice` associated with
/// a given `Field` and stored in a `CompositeFile`.
pub fn open_read_with_idx(&self, field: Field, idx: usize) -> Option<FileSlice> {
self.offsets_index
.get(&FileAddr { field, idx })

View File

@@ -39,7 +39,7 @@ impl RetryPolicy {
/// The `DirectoryLock` is an object that represents a file lock.
///
/// It is associated to a lock file, that gets deleted on `Drop.`
/// It is associated with a lock file, that gets deleted on `Drop.`
pub struct DirectoryLock(Box<dyn Send + Sync + 'static>);
struct DirectoryLockGuard {
@@ -55,7 +55,7 @@ impl<T: Send + Sync + 'static> From<Box<T>> for DirectoryLock {
impl Drop for DirectoryLockGuard {
fn drop(&mut self) {
if let Err(e) = self.directory.delete(&*self.path) {
if let Err(e) = self.directory.delete(&self.path) {
error!("Failed to remove the lock file. {:?}", e);
}
}

View File

@@ -9,7 +9,7 @@ use crc32fast::Hasher;
use crate::directory::{WatchCallback, WatchCallbackList, WatchHandle};
pub const POLLING_INTERVAL: Duration = Duration::from_millis(if cfg!(test) { 1 } else { 500 });
const POLLING_INTERVAL: Duration = Duration::from_millis(if cfg!(test) { 1 } else { 500 });
// Watches a file and executes registered callbacks when the file is modified.
pub struct FileWatcher {

View File

@@ -38,7 +38,7 @@ impl Footer {
counting_write.write_all(serde_json::to_string(&self)?.as_ref())?;
let footer_payload_len = counting_write.written_bytes();
BinarySerializable::serialize(&(footer_payload_len as u32), write)?;
BinarySerializable::serialize(&(FOOTER_MAGIC_NUMBER as u32), write)?;
BinarySerializable::serialize(&FOOTER_MAGIC_NUMBER, write)?;
Ok(())
}
@@ -90,9 +90,10 @@ impl Footer {
));
}
let footer: Footer = serde_json::from_slice(&file.read_bytes_slice(
file.len() - total_footer_size..file.len() - footer_metadata_len as usize,
)?)?;
let footer: Footer =
serde_json::from_slice(&file.read_bytes_slice(
file.len() - total_footer_size..file.len() - footer_metadata_len,
)?)?;
let body = file.slice_to(file.len() - total_footer_size);
Ok((footer, body))

View File

@@ -388,7 +388,7 @@ mod tests_mmap_specific {
let tempdir_path = PathBuf::from(tempdir.path());
let living_files = HashSet::new();
let mmap_directory = MmapDirectory::open(&tempdir_path).unwrap();
let mmap_directory = MmapDirectory::open(tempdir_path).unwrap();
let mut managed_directory = ManagedDirectory::wrap(Box::new(mmap_directory)).unwrap();
let mut write = managed_directory.open_write(test_path1).unwrap();
write.write_all(&[0u8, 1u8]).unwrap();

View File

@@ -3,7 +3,7 @@ use std::fs::{self, File, OpenOptions};
use std::io::{self, BufWriter, Read, Seek, Write};
use std::ops::Deref;
use std::path::{Path, PathBuf};
use std::sync::{Arc, RwLock};
use std::sync::{Arc, RwLock, Weak};
use std::{fmt, result};
use fs2::FileExt;
@@ -18,10 +18,13 @@ use crate::directory::error::{
};
use crate::directory::file_watcher::FileWatcher;
use crate::directory::{
AntiCallToken, ArcBytes, Directory, DirectoryLock, FileHandle, Lock, OwnedBytes,
TerminatingWrite, WatchCallback, WatchHandle, WeakArcBytes, WritePtr,
AntiCallToken, Directory, DirectoryLock, FileHandle, Lock, OwnedBytes, TerminatingWrite,
WatchCallback, WatchHandle, WritePtr,
};
pub type ArcBytes = Arc<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
pub type WeakArcBytes = Weak<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
/// Create a default io error given a string.
pub(crate) fn make_io_err(msg: String) -> io::Error {
io::Error::new(io::ErrorKind::Other, msg)
@@ -301,7 +304,7 @@ pub(crate) fn atomic_write(path: &Path, content: &[u8]) -> io::Result<()> {
"Path {:?} does not have parent directory.",
)
})?;
let mut tempfile = tempfile::Builder::new().tempfile_in(&parent_path)?;
let mut tempfile = tempfile::Builder::new().tempfile_in(parent_path)?;
tempfile.write_all(content)?;
tempfile.flush()?;
tempfile.as_file_mut().sync_data()?;
@@ -334,11 +337,11 @@ impl Directory for MmapDirectory {
Ok(Arc::new(owned_bytes))
}
/// Any entry associated to the path in the mmap will be
/// Any entry associated with the path in the mmap will be
/// removed before the file is deleted.
fn delete(&self, path: &Path) -> result::Result<(), DeleteError> {
let full_path = self.resolve_path(path);
fs::remove_file(&full_path).map_err(|e| {
fs::remove_file(full_path).map_err(|e| {
if e.kind() == io::ErrorKind::NotFound {
DeleteError::FileDoesNotExist(path.to_owned())
} else {
@@ -392,7 +395,7 @@ impl Directory for MmapDirectory {
fn atomic_read(&self, path: &Path) -> Result<Vec<u8>, OpenReadError> {
let full_path = self.resolve_path(path);
let mut buffer = Vec::new();
match File::open(&full_path) {
match File::open(full_path) {
Ok(mut file) => {
file.read_to_end(&mut buffer).map_err(|io_error| {
OpenReadError::wrap_io_error(io_error, path.to_path_buf())
@@ -422,7 +425,7 @@ impl Directory for MmapDirectory {
let file: File = OpenOptions::new()
.write(true)
.create(true) //< if the file does not exist yet, create it.
.open(&full_path)
.open(full_path)
.map_err(LockError::wrap_io_error)?;
if lock.is_blocking {
file.lock_exclusive().map_err(LockError::wrap_io_error)?;
@@ -472,6 +475,8 @@ mod tests {
// There are more tests in directory/mod.rs
// The following tests are specific to the MmapDirectory
use std::time::Duration;
use common::HasLen;
use super::*;
@@ -566,9 +571,21 @@ mod tests {
assert_eq!(mmap_directory.get_cache_info().mmapped.len(), 0);
}
fn assert_eventually<P: Fn() -> Option<String>>(predicate: P) {
for _ in 0..30 {
if predicate().is_none() {
break;
}
std::thread::sleep(Duration::from_millis(200));
}
if let Some(error_msg) = predicate() {
panic!("{}", error_msg);
}
}
#[test]
fn test_mmap_released() -> crate::Result<()> {
let mmap_directory = MmapDirectory::create_from_tempdir()?;
fn test_mmap_released() {
let mmap_directory = MmapDirectory::create_from_tempdir().unwrap();
let mut schema_builder: SchemaBuilder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
@@ -577,40 +594,56 @@ mod tests {
let index =
Index::create(mmap_directory.clone(), schema, IndexSettings::default()).unwrap();
let mut index_writer = index.writer_for_tests()?;
let mut index_writer = index.writer_for_tests().unwrap();
let mut log_merge_policy = LogMergePolicy::default();
log_merge_policy.set_min_num_segments(3);
index_writer.set_merge_policy(Box::new(log_merge_policy));
for _num_commits in 0..10 {
for _ in 0..10 {
index_writer.add_document(doc!(text_field=>"abc"))?;
index_writer.add_document(doc!(text_field=>"abc")).unwrap();
}
index_writer.commit()?;
index_writer.commit().unwrap();
}
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.try_into()?;
.try_into()
.unwrap();
for _ in 0..4 {
index_writer.add_document(doc!(text_field=>"abc"))?;
index_writer.commit()?;
reader.reload()?;
index_writer.add_document(doc!(text_field=>"abc")).unwrap();
index_writer.commit().unwrap();
reader.reload().unwrap();
}
index_writer.wait_merging_threads()?;
index_writer.wait_merging_threads().unwrap();
reader.reload()?;
reader.reload().unwrap();
let num_segments = reader.searcher().segment_readers().len();
assert!(num_segments <= 4);
let num_components_except_deletes_and_tempstore =
crate::core::SegmentComponent::iterator().len() - 2;
assert_eq!(
num_segments * num_components_except_deletes_and_tempstore,
mmap_directory.get_cache_info().mmapped.len()
);
let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
assert_eventually(|| {
let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
if num_mmapped > max_num_mmapped {
Some(format!(
"Expected at most {max_num_mmapped} mmapped files, got {num_mmapped}"
))
} else {
None
}
});
}
assert!(mmap_directory.get_cache_info().mmapped.is_empty());
Ok(())
// This test failed on CI. The last Mmap is dropped from the merging thread so there might
// be a race condition indeed.
assert_eventually(|| {
let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
if num_mmapped > 0 {
Some(format!("Expected no mmapped files, got {num_mmapped}"))
} else {
None
}
});
}
}

View File

@@ -5,7 +5,6 @@ mod mmap_directory;
mod directory;
mod directory_lock;
mod file_slice;
mod file_watcher;
mod footer;
mod managed_directory;
@@ -20,14 +19,13 @@ mod composite_file;
use std::io::BufWriter;
use std::path::PathBuf;
pub use common::file_slice::{FileHandle, FileSlice};
pub use common::{AntiCallToken, TerminatingWrite};
pub use ownedbytes::OwnedBytes;
pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
pub use self::directory::{Directory, DirectoryClone, DirectoryLock};
pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK};
pub(crate) use self::file_slice::{ArcBytes, WeakArcBytes};
pub use self::file_slice::{FileHandle, FileSlice};
pub use self::ram_directory::RamDirectory;
pub use self::watch_event_router::{WatchCallback, WatchCallbackList, WatchHandle};

View File

@@ -136,6 +136,20 @@ impl RamDirectory {
Self::default()
}
/// Deep clones the directory.
///
/// Ulterior writes on one of the copy
/// will not affect the other copy.
pub fn deep_clone(&self) -> RamDirectory {
let inner_clone = InnerDirectory {
fs: self.fs.read().unwrap().fs.clone(),
watch_router: Default::default(),
};
RamDirectory {
fs: Arc::new(RwLock::new(inner_clone)),
}
}
/// Returns the sum of the size of the different files
/// in the [`RamDirectory`].
pub fn total_mem_usage(&self) -> usize {
@@ -256,4 +270,23 @@ mod tests {
assert_eq!(directory_copy.atomic_read(path_atomic).unwrap(), msg_atomic);
assert_eq!(directory_copy.atomic_read(path_seq).unwrap(), msg_seq);
}
#[test]
fn test_ram_directory_deep_clone() {
let dir = RamDirectory::default();
let test = Path::new("test");
let test2 = Path::new("test2");
dir.atomic_write(test, b"firstwrite").unwrap();
let dir_clone = dir.deep_clone();
assert_eq!(
dir_clone.atomic_read(test).unwrap(),
dir.atomic_read(test).unwrap()
);
dir.atomic_write(test, b"original").unwrap();
dir_clone.atomic_write(test, b"clone").unwrap();
dir_clone.atomic_write(test2, b"clone2").unwrap();
assert_eq!(dir.atomic_read(test).unwrap(), b"original");
assert_eq!(&dir_clone.atomic_read(test).unwrap(), b"clone");
assert_eq!(&dir_clone.atomic_read(test2).unwrap(), b"clone2");
}
}

View File

@@ -104,28 +104,6 @@ pub enum TantivyError {
InternalError(String),
}
#[cfg(feature = "quickwit")]
#[derive(Error, Debug)]
#[doc(hidden)]
pub enum AsyncIoError {
#[error("io::Error `{0}`")]
Io(#[from] io::Error),
#[error("Asynchronous API is unsupported by this directory")]
AsyncUnsupported,
}
#[cfg(feature = "quickwit")]
impl From<AsyncIoError> for TantivyError {
fn from(async_io_err: AsyncIoError) -> Self {
match async_io_err {
AsyncIoError::Io(io_err) => TantivyError::from(io_err),
AsyncIoError::AsyncUnsupported => {
TantivyError::SystemError(format!("{:?}", async_io_err))
}
}
}
}
impl From<io::Error> for TantivyError {
fn from(io_err: io::Error) -> TantivyError {
TantivyError::IoError(Arc::new(io_err))

View File

@@ -6,7 +6,7 @@ pub use self::writer::BytesFastFieldWriter;
#[cfg(test)]
mod tests {
use crate::query::TermQuery;
use crate::query::{EnableScoring, TermQuery};
use crate::schema::{BytesOptions, IndexRecordOption, Schema, Value, FAST, INDEXED, STORED};
use crate::{DocAddress, DocSet, Index, Searcher, Term};
@@ -82,7 +82,7 @@ mod tests {
let field = searcher.schema().get_field("string_bytes").unwrap();
let term = Term::from_field_bytes(field, b"lucene".as_ref());
let term_query = TermQuery::new(term, IndexRecordOption::Basic);
let term_weight = term_query.specialized_weight(&searcher, true)?;
let term_weight = term_query.specialized_weight(EnableScoring::Enabled(&searcher))?;
let term_scorer = term_weight.specialized_scorer(searcher.segment_reader(0), 1.0)?;
assert_eq!(term_scorer.doc(), 0u32);
Ok(())
@@ -95,7 +95,8 @@ mod tests {
let field = searcher.schema().get_field("string_bytes").unwrap();
let term = Term::from_field_bytes(field, b"lucene".as_ref());
let term_query = TermQuery::new(term, IndexRecordOption::Basic);
let term_weight_err = term_query.specialized_weight(&searcher, false);
let term_weight_err =
term_query.specialized_weight(EnableScoring::Disabled(searcher.schema()));
assert!(matches!(
term_weight_err,
Err(crate::TantivyError::SchemaError(_))

View File

@@ -3,7 +3,7 @@ use std::sync::Arc;
use fastfield_codecs::Column;
use crate::directory::{FileSlice, OwnedBytes};
use crate::fastfield::MultiValueLength;
use crate::fastfield::MultiValueIndex;
use crate::DocId;
/// Reader for byte array fast fields
@@ -18,7 +18,7 @@ use crate::DocId;
/// and the start index for the next document, and keeping the bytes in between.
#[derive(Clone)]
pub struct BytesFastFieldReader {
idx_reader: Arc<dyn Column<u64>>,
idx_reader: MultiValueIndex,
values: OwnedBytes,
}
@@ -28,39 +28,31 @@ impl BytesFastFieldReader {
values_file: FileSlice,
) -> crate::Result<BytesFastFieldReader> {
let values = values_file.read_bytes()?;
Ok(BytesFastFieldReader { idx_reader, values })
Ok(BytesFastFieldReader {
idx_reader: MultiValueIndex::new(idx_reader),
values,
})
}
fn range(&self, doc: DocId) -> (usize, usize) {
let idx = doc as u64;
let start = self.idx_reader.get_val(idx) as usize;
let stop = self.idx_reader.get_val(idx + 1) as usize;
(start, stop)
/// returns the multivalue index
pub fn get_index_reader(&self) -> &MultiValueIndex {
&self.idx_reader
}
/// Returns the bytes associated to the given `doc`
/// Returns the bytes associated with the given `doc`
pub fn get_bytes(&self, doc: DocId) -> &[u8] {
let (start, stop) = self.range(doc);
&self.values.as_slice()[start..stop]
let range = self.idx_reader.range(doc);
&self.values.as_slice()[range.start as usize..range.end as usize]
}
/// Returns the length of the bytes associated to the given `doc`
pub fn num_bytes(&self, doc: DocId) -> usize {
let (start, stop) = self.range(doc);
stop - start
/// Returns the length of the bytes associated with the given `doc`
pub fn num_bytes(&self, doc: DocId) -> u64 {
let range = self.idx_reader.range(doc);
(range.end - range.start) as u64
}
/// Returns the overall number of bytes in this bytes fast field.
pub fn total_num_bytes(&self) -> usize {
self.values.len()
}
}
impl MultiValueLength for BytesFastFieldReader {
fn get_len(&self, doc_id: DocId) -> u64 {
self.num_bytes(doc_id) as u64
}
fn get_total_len(&self) -> u64 {
self.total_num_bytes() as u64
pub fn total_num_bytes(&self) -> u32 {
self.values.len() as u32
}
}

View File

@@ -24,7 +24,7 @@ use crate::DocId;
///
/// Once acquired, writing is done by calling
/// [`.add_document_val(&[u8])`](BytesFastFieldWriter::add_document_val)
/// once per document, even if there are no bytes associated to it.
/// once per document, even if there are no bytes associated with it.
pub struct BytesFastFieldWriter {
field: Field,
vals: Vec<u8>,
@@ -45,7 +45,7 @@ impl BytesFastFieldWriter {
pub fn mem_usage(&self) -> usize {
self.vals.capacity() + self.doc_index.capacity() * std::mem::size_of::<u64>()
}
/// Access the field associated to the `BytesFastFieldWriter`
/// Access the field associated with the `BytesFastFieldWriter`
pub fn field(&self) -> Field {
self.field
}
@@ -57,17 +57,18 @@ impl BytesFastFieldWriter {
/// Shift to the next document and add all of the
/// matching field values present in the document.
pub fn add_document(&mut self, doc: &Document) {
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> {
self.next_doc();
for field_value in doc.get_all(self.field) {
if let Value::Bytes(ref bytes) = field_value {
self.vals.extend_from_slice(bytes);
return;
return Ok(());
}
}
Ok(())
}
/// Register the bytes associated to a document.
/// Register the bytes associated with a document.
///
/// The method returns the `DocId` of the document that was
/// just written.

View File

@@ -7,7 +7,7 @@ use crate::termdict::{TermDictionary, TermOrdinal};
use crate::DocId;
/// The facet reader makes it possible to access the list of
/// facets associated to a given document in a specific
/// facets associated with a given document in a specific
/// segment.
///
/// Rather than manipulating `Facet` object directly, the API
@@ -58,15 +58,13 @@ impl FacetReader {
&self.term_dict
}
/// Given a term ordinal returns the term associated to it.
/// Given a term ordinal returns the term associated with it.
pub fn facet_from_ord(
&mut self,
facet_ord: TermOrdinal,
output: &mut Facet,
) -> crate::Result<()> {
let found_term = self
.term_dict
.ord_to_term(facet_ord as u64, &mut self.buffer)?;
let found_term = self.term_dict.ord_to_term(facet_ord, &mut self.buffer)?;
assert!(found_term, "Term ordinal {} no found.", facet_ord);
let facet_str = str::from_utf8(&self.buffer[..])
.map_err(|utf8_err| DataCorruption::comment_only(utf8_err.to_string()))?;
@@ -74,7 +72,7 @@ impl FacetReader {
Ok(())
}
/// Return the list of facet ordinals associated to a document.
/// Return the list of facet ordinals associated with a document.
pub fn facet_ords(&self, doc: DocId, output: &mut Vec<u64>) {
self.term_ords.get_vals(doc, output);
}

View File

@@ -7,16 +7,15 @@
//! It is designed for the fast random access of some document
//! fields given a document id.
//!
//! `FastField` are useful when a field is required for all or most of
//! the `DocSet` : for instance for scoring, grouping, filtering, or faceting.
//! Fast fields are useful when a field is required for all or most of
//! the `DocSet`: for instance for scoring, grouping, aggregation, filtering, or faceting.
//!
//!
//! Fields have to be declared as `FAST` in the schema.
//! Currently supported fields are: u64, i64, f64 and bytes.
//! Fields have to be declared as `FAST` in the schema.
//! Currently supported fields are: u64, i64, f64, bytes and text.
//!
//! u64, i64 and f64 fields are stored in a bit-packed fashion so that
//! their memory usage is directly linear with the amplitude of the
//! values stored.
//! Fast fields are stored in with [different codecs](fastfield_codecs). The best codec is detected
//! automatically, when serializing.
//!
//! Read access performance is comparable to that of an array lookup.
@@ -26,14 +25,18 @@ pub use self::alive_bitset::{intersect_alive_bitsets, write_alive_bitset, AliveB
pub use self::bytes::{BytesFastFieldReader, BytesFastFieldWriter};
pub use self::error::{FastFieldNotAvailableError, Result};
pub use self::facet_reader::FacetReader;
pub(crate) use self::multivalued::MultivalueStartIndex;
pub use self::multivalued::{MultiValuedFastFieldReader, MultiValuedFastFieldWriter};
pub(crate) use self::multivalued::{get_fastfield_codecs_for_multivalue, MultivalueStartIndex};
pub use self::multivalued::{
MultiValueIndex, MultiValueU128FastFieldWriter, MultiValuedFastFieldReader,
MultiValuedFastFieldWriter, MultiValuedU128FastFieldReader,
};
pub(crate) use self::readers::type_and_cardinality;
pub use self::readers::FastFieldReaders;
pub(crate) use self::readers::{type_and_cardinality, FastType};
pub use self::serializer::{Column, CompositeFastFieldSerializer};
use self::writer::unexpected_value;
pub use self::writer::{FastFieldsWriter, IntFastFieldWriter};
use crate::schema::{Type, Value};
use crate::{DateTime, DocId};
use crate::DateTime;
mod alive_bitset;
mod bytes;
@@ -44,15 +47,6 @@ mod readers;
mod serializer;
mod writer;
/// Trait for `BytesFastFieldReader` and `MultiValuedFastFieldReader` to return the length of data
/// for a doc_id
pub trait MultiValueLength {
/// returns the num of values associated to a doc_id
fn get_len(&self, doc_id: DocId) -> u64;
/// returns the sum of num values for all doc_ids
fn get_total_len(&self) -> u64;
}
/// Trait for types that are allowed for fast fields:
/// (u64, i64 and f64, bool, DateTime).
pub trait FastValue:
@@ -115,15 +109,16 @@ impl FastValue for DateTime {
}
}
fn value_to_u64(value: &Value) -> u64 {
match value {
fn value_to_u64(value: &Value) -> crate::Result<u64> {
let value = match value {
Value::U64(val) => val.to_u64(),
Value::I64(val) => val.to_u64(),
Value::F64(val) => val.to_u64(),
Value::Bool(val) => val.to_u64(),
Value::Date(val) => val.to_u64(),
_ => panic!("Expected a u64/i64/f64/bool/date field, got {:?} ", value),
}
_ => return Err(unexpected_value("u64/i64/f64/bool/date", value)),
};
Ok(value)
}
/// The fast field type
@@ -178,9 +173,9 @@ mod tests {
#[test]
pub fn test_fastfield() {
let test_fastfield = fastfield_codecs::serialize_and_load(&[100u64, 200u64, 300u64][..]);
assert_eq!(test_fastfield.get_val(0u64), 100);
assert_eq!(test_fastfield.get_val(1u64), 200);
assert_eq!(test_fastfield.get_val(2u64), 300);
assert_eq!(test_fastfield.get_val(0), 100);
assert_eq!(test_fastfield.get_val(1), 200);
assert_eq!(test_fastfield.get_val(2), 300);
}
#[test]
@@ -197,16 +192,22 @@ mod tests {
let write: WritePtr = directory.open_write(Path::new("test")).unwrap();
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
fast_field_writers.add_document(&doc!(*FIELD=>13u64));
fast_field_writers.add_document(&doc!(*FIELD=>14u64));
fast_field_writers.add_document(&doc!(*FIELD=>2u64));
fast_field_writers
.add_document(&doc!(*FIELD=>13u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>14u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>2u64))
.unwrap();
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
.unwrap();
serializer.close().unwrap();
}
let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 25);
assert_eq!(file.len(), 34);
let composite_file = CompositeFile::open(&file)?;
let fast_field_bytes = composite_file.open_read(*FIELD).unwrap().read_bytes()?;
let fast_field_reader = open::<u64>(fast_field_bytes)?;
@@ -224,20 +225,38 @@ mod tests {
let write: WritePtr = directory.open_write(Path::new("test"))?;
let mut serializer = CompositeFastFieldSerializer::from_write(write)?;
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
fast_field_writers.add_document(&doc!(*FIELD=>4u64));
fast_field_writers.add_document(&doc!(*FIELD=>14_082_001u64));
fast_field_writers.add_document(&doc!(*FIELD=>3_052u64));
fast_field_writers.add_document(&doc!(*FIELD=>9_002u64));
fast_field_writers.add_document(&doc!(*FIELD=>15_001u64));
fast_field_writers.add_document(&doc!(*FIELD=>777u64));
fast_field_writers.add_document(&doc!(*FIELD=>1_002u64));
fast_field_writers.add_document(&doc!(*FIELD=>1_501u64));
fast_field_writers.add_document(&doc!(*FIELD=>215u64));
fast_field_writers
.add_document(&doc!(*FIELD=>4u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>14_082_001u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>3_052u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>9_002u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>15_001u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>777u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>1_002u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>1_501u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>215u64))
.unwrap();
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
serializer.close()?;
}
let file = directory.open_read(path)?;
assert_eq!(file.len(), 53);
assert_eq!(file.len(), 62);
{
let fast_fields_composite = CompositeFile::open(&file)?;
let data = fast_fields_composite
@@ -268,7 +287,9 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
for _ in 0..10_000 {
fast_field_writers.add_document(&doc!(*FIELD=>100_000u64));
fast_field_writers
.add_document(&doc!(*FIELD=>100_000u64))
.unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -276,7 +297,7 @@ mod tests {
serializer.close().unwrap();
}
let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 26);
assert_eq!(file.len(), 35);
{
let fast_fields_composite = CompositeFile::open(&file).unwrap();
let data = fast_fields_composite
@@ -301,9 +322,13 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
// forcing the amplitude to be high
fast_field_writers.add_document(&doc!(*FIELD=>0u64));
fast_field_writers
.add_document(&doc!(*FIELD=>0u64))
.unwrap();
for i in 0u64..10_000u64 {
fast_field_writers.add_document(&doc!(*FIELD=>5_000_000_000_000_000_000u64 + i));
fast_field_writers
.add_document(&doc!(*FIELD=>5_000_000_000_000_000_000u64 + i))
.unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -311,7 +336,7 @@ mod tests {
serializer.close().unwrap();
}
let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 80040);
assert_eq!(file.len(), 80049);
{
let fast_fields_composite = CompositeFile::open(&file)?;
let data = fast_fields_composite
@@ -345,7 +370,7 @@ mod tests {
for i in -100i64..10_000i64 {
let mut doc = Document::default();
doc.add_i64(i64_field, i);
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -353,7 +378,7 @@ mod tests {
serializer.close().unwrap();
}
let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 40_usize);
assert_eq!(file.len(), 49_usize);
{
let fast_fields_composite = CompositeFile::open(&file)?;
@@ -366,7 +391,7 @@ mod tests {
assert_eq!(fast_field_reader.min_value(), -100i64);
assert_eq!(fast_field_reader.max_value(), 9_999i64);
for (doc, i) in (-100i64..10_000i64).enumerate() {
assert_eq!(fast_field_reader.get_val(doc as u64), i);
assert_eq!(fast_field_reader.get_val(doc as u32), i);
}
let mut buffer = vec![0i64; 100];
fast_field_reader.get_range(53, &mut buffer[..]);
@@ -390,7 +415,7 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
let doc = Document::default();
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
.unwrap();
@@ -433,7 +458,7 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write)?;
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
for &x in &permutation {
fast_field_writers.add_document(&doc!(*FIELD=>x));
fast_field_writers.add_document(&doc!(*FIELD=>x)).unwrap();
}
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
serializer.close()?;
@@ -448,7 +473,7 @@ mod tests {
let fast_field_reader = open::<u64>(data)?;
for a in 0..n {
assert_eq!(fast_field_reader.get_val(a as u64), permutation[a as usize]);
assert_eq!(fast_field_reader.get_val(a as u32), permutation[a]);
}
}
Ok(())
@@ -783,17 +808,21 @@ mod tests {
let write: WritePtr = directory.open_write(path).unwrap();
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
fast_field_writers.add_document(&doc!(field=>true));
fast_field_writers.add_document(&doc!(field=>false));
fast_field_writers.add_document(&doc!(field=>true));
fast_field_writers.add_document(&doc!(field=>false));
fast_field_writers.add_document(&doc!(field=>true)).unwrap();
fast_field_writers
.add_document(&doc!(field=>false))
.unwrap();
fast_field_writers.add_document(&doc!(field=>true)).unwrap();
fast_field_writers
.add_document(&doc!(field=>false))
.unwrap();
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
.unwrap();
serializer.close().unwrap();
}
let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 24);
assert_eq!(file.len(), 33);
let composite_file = CompositeFile::open(&file)?;
let data = composite_file.open_read(field).unwrap().read_bytes()?;
let fast_field_reader = open::<bool>(data)?;
@@ -820,8 +849,10 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
for _ in 0..50 {
fast_field_writers.add_document(&doc!(field=>true));
fast_field_writers.add_document(&doc!(field=>false));
fast_field_writers.add_document(&doc!(field=>true)).unwrap();
fast_field_writers
.add_document(&doc!(field=>false))
.unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -829,7 +860,7 @@ mod tests {
serializer.close().unwrap();
}
let file = directory.open_read(path).unwrap();
assert_eq!(file.len(), 36);
assert_eq!(file.len(), 45);
let composite_file = CompositeFile::open(&file)?;
let data = composite_file.open_read(field).unwrap().read_bytes()?;
let fast_field_reader = open::<bool>(data)?;
@@ -855,13 +886,13 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write)?;
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
let doc = Document::default();
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
serializer.close()?;
}
let file = directory.open_read(path).unwrap();
let composite_file = CompositeFile::open(&file)?;
assert_eq!(file.len(), 23);
assert_eq!(file.len(), 32);
let data = composite_file.open_read(field).unwrap().read_bytes()?;
let fast_field_reader = open::<bool>(data)?;
assert_eq!(fast_field_reader.get_val(0), false);
@@ -881,7 +912,7 @@ mod tests {
CompositeFastFieldSerializer::from_write_with_codec(write, codec_types).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(schema);
for doc in docs {
fast_field_writers.add_document(doc);
fast_field_writers.add_document(doc).unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -895,10 +926,10 @@ mod tests {
pub fn test_gcd_date() -> crate::Result<()> {
let size_prec_sec =
test_gcd_date_with_codec(FastFieldCodecType::Bitpacked, DatePrecision::Seconds)?;
assert_eq!(size_prec_sec, 28 + (1_000 * 13) / 8); // 13 bits per val = ceil(log_2(number of seconds in 2hours);
assert_eq!(size_prec_sec, 5 + 4 + 28 + (1_000 * 13) / 8); // 13 bits per val = ceil(log_2(number of seconds in 2hours);
let size_prec_micro =
test_gcd_date_with_codec(FastFieldCodecType::Bitpacked, DatePrecision::Microseconds)?;
assert_eq!(size_prec_micro, 26 + (1_000 * 33) / 8); // 33 bits per val = ceil(log_2(number of microsecsseconds in 2hours);
assert_eq!(size_prec_micro, 5 + 4 + 26 + (1_000 * 33) / 8); // 33 bits per val = ceil(log_2(number of microsecsseconds in 2hours);
Ok(())
}
@@ -934,7 +965,7 @@ mod tests {
let test_fastfield = open::<DateTime>(file.read_bytes()?)?;
for (i, time) in times.iter().enumerate() {
assert_eq!(test_fastfield.get_val(i as u64), time.truncate(precision));
assert_eq!(test_fastfield.get_val(i as u32), time.truncate(precision));
}
Ok(len)
}

View File

@@ -0,0 +1,148 @@
use std::ops::Range;
use std::sync::Arc;
use fastfield_codecs::Column;
use crate::DocId;
#[derive(Clone)]
/// Index to resolve value range for given doc_id.
/// Starts at 0.
pub struct MultiValueIndex {
idx: Arc<dyn Column<u64>>,
}
impl MultiValueIndex {
pub(crate) fn new(idx: Arc<dyn Column<u64>>) -> Self {
Self { idx }
}
/// Returns `[start, end)`, such that the values associated with
/// the given document are `start..end`.
#[inline]
pub(crate) fn range(&self, doc: DocId) -> Range<u32> {
let start = self.idx.get_val(doc) as u32;
let end = self.idx.get_val(doc + 1) as u32;
start..end
}
/// Given a range of documents, returns the Range of value offsets fo
/// these documents.
///
/// For instance, `given start_doc..end_doc`,
/// if we assume Document #start_doc end #end_doc both
/// have values, this function returns `start..end`
/// such that `value_column.get(start_doc)` is the first value of
/// `start_doc` (well, if there is one), and `value_column.get(end_doc - 1)`
/// is the last value of `end_doc`.
///
/// The passed end range is allowed to be out of bounds, in which case
/// it will be clipped to make it valid.
#[inline]
pub(crate) fn docid_range_to_position_range(&self, range: Range<DocId>) -> Range<u32> {
let end_docid = range.end.min(self.num_docs() - 1) + 1;
let start_docid = range.start.min(end_docid);
let start = self.idx.get_val(start_docid) as u32;
let end = self.idx.get_val(end_docid) as u32;
assert!(start <= end);
start..end
}
/// returns the num of values associated with a doc_id
pub(crate) fn num_vals_for_doc(&self, doc: DocId) -> u32 {
let range = self.range(doc);
range.end - range.start
}
/// Returns the overall number of values in this field.
#[inline]
pub fn total_num_vals(&self) -> u32 {
self.idx.max_value() as u32
}
/// Returns the number of documents in the index.
#[inline]
pub fn num_docs(&self) -> u32 {
self.idx.num_vals() - 1
}
/// Converts a list of positions of values in a 1:n index to the corresponding list of DocIds.
/// Positions are converted inplace to docids.
///
/// Since there is no index for value pos -> docid, but docid -> value pos range, we scan the
/// index.
///
/// Correctness: positions needs to be sorted. idx_reader needs to contain monotonically
/// increasing positions.
///
///
/// TODO: Instead of a linear scan we can employ a exponential search into binary search to
/// match a docid to its value position.
pub(crate) fn positions_to_docids(&self, doc_id_range: Range<u32>, positions: &mut Vec<u32>) {
if positions.is_empty() {
return;
}
let mut cur_doc = doc_id_range.start;
let mut last_doc = None;
assert!(self.idx.get_val(doc_id_range.start) as u32 <= positions[0]);
let mut write_doc_pos = 0;
for i in 0..positions.len() {
let pos = positions[i];
loop {
let end = self.idx.get_val(cur_doc + 1) as u32;
if end > pos {
positions[write_doc_pos] = cur_doc;
write_doc_pos += if last_doc == Some(cur_doc) { 0 } else { 1 };
last_doc = Some(cur_doc);
break;
}
cur_doc += 1;
}
}
positions.truncate(write_doc_pos);
}
}
#[cfg(test)]
mod tests {
use std::ops::Range;
use std::sync::Arc;
use fastfield_codecs::IterColumn;
use crate::fastfield::MultiValueIndex;
fn index_to_pos_helper(
index: &MultiValueIndex,
doc_id_range: Range<u32>,
positions: &[u32],
) -> Vec<u32> {
let mut positions = positions.to_vec();
index.positions_to_docids(doc_id_range, &mut positions);
positions
}
#[test]
fn test_positions_to_docid() {
let offsets = vec![0, 10, 12, 15, 22, 23]; // docid values are [0..10, 10..12, 12..15, etc.]
let column = IterColumn::from(offsets.into_iter());
let index = MultiValueIndex::new(Arc::new(column));
assert_eq!(index.num_docs(), 5);
{
let positions = vec![10u32, 11, 15, 20, 21, 22];
assert_eq!(index_to_pos_helper(&index, 0..5, &positions), vec![1, 3, 4]);
assert_eq!(index_to_pos_helper(&index, 1..5, &positions), vec![1, 3, 4]);
assert_eq!(index_to_pos_helper(&index, 0..5, &[9]), vec![0]);
assert_eq!(index_to_pos_helper(&index, 1..5, &[10]), vec![1]);
assert_eq!(index_to_pos_helper(&index, 1..5, &[11]), vec![1]);
assert_eq!(index_to_pos_helper(&index, 2..5, &[12]), vec![2]);
assert_eq!(index_to_pos_helper(&index, 2..5, &[12, 14]), vec![2]);
assert_eq!(index_to_pos_helper(&index, 2..5, &[12, 14, 15]), vec![2, 3]);
}
}
}

View File

@@ -1,9 +1,23 @@
mod index;
mod reader;
mod writer;
pub use self::reader::MultiValuedFastFieldReader;
pub use self::writer::MultiValuedFastFieldWriter;
use fastfield_codecs::FastFieldCodecType;
pub use index::MultiValueIndex;
pub use self::reader::{MultiValuedFastFieldReader, MultiValuedU128FastFieldReader};
pub(crate) use self::writer::MultivalueStartIndex;
pub use self::writer::{MultiValueU128FastFieldWriter, MultiValuedFastFieldWriter};
/// The valid codecs for multivalue values excludes the linear interpolation codec.
///
/// This limitation is only valid for the values, not the offset index of the multivalue index.
pub(crate) fn get_fastfield_codecs_for_multivalue() -> [FastFieldCodecType; 2] {
[
FastFieldCodecType::Bitpacked,
FastFieldCodecType::BlockwiseLinear,
]
}
#[cfg(test)]
mod tests {
@@ -402,6 +416,74 @@ mod bench {
use crate::schema::{Cardinality, NumericOptions, Schema};
use crate::Document;
fn bench_multi_value_ff_merge_opt(
num_docs: usize,
segments_every_n_docs: usize,
merge_policy: impl crate::indexer::MergePolicy + 'static,
) {
let mut builder = crate::schema::SchemaBuilder::new();
let fast_multi =
crate::schema::NumericOptions::default().set_fast(Cardinality::MultiValues);
let multi_field = builder.add_f64_field("f64s", fast_multi);
let index = crate::Index::create_in_ram(builder.build());
let mut writer = index.writer_for_tests().unwrap();
writer.set_merge_policy(Box::new(merge_policy));
for i in 0..num_docs {
let mut doc = crate::Document::new();
doc.add_f64(multi_field, 0.24);
doc.add_f64(multi_field, 0.27);
doc.add_f64(multi_field, 0.37);
if i % 3 == 0 {
doc.add_f64(multi_field, 0.44);
}
writer.add_document(doc).unwrap();
if i % segments_every_n_docs == 0 {
writer.commit().unwrap();
}
}
{
writer.wait_merging_threads().unwrap();
let mut writer = index.writer_for_tests().unwrap();
let segment_ids = index.searchable_segment_ids().unwrap();
writer.merge(&segment_ids).wait().unwrap();
}
// If a merging thread fails, we should end up with more
// than one segment here
assert_eq!(1, index.searchable_segments().unwrap().len());
}
#[bench]
fn bench_multi_value_ff_merge_many_segments(b: &mut Bencher) {
let num_docs = 100_000;
b.iter(|| {
bench_multi_value_ff_merge_opt(num_docs, 1_000, crate::indexer::NoMergePolicy);
});
}
#[bench]
fn bench_multi_value_ff_merge_many_segments_log_merge(b: &mut Bencher) {
let num_docs = 100_000;
b.iter(|| {
let merge_policy = crate::indexer::LogMergePolicy::default();
bench_multi_value_ff_merge_opt(num_docs, 1_000, merge_policy);
});
}
#[bench]
fn bench_multi_value_ff_merge_few_segments(b: &mut Bencher) {
let num_docs = 100_000;
b.iter(|| {
bench_multi_value_ff_merge_opt(num_docs, 33_000, crate::indexer::NoMergePolicy);
});
}
fn multi_values(num_docs: usize, vals_per_doc: usize) -> Vec<Vec<u64>> {
let mut vals = vec![];
for _i in 0..num_docs {
@@ -435,7 +517,7 @@ mod bench {
for val in block {
doc.add_u64(field, *val);
}
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -493,7 +575,7 @@ mod bench {
for val in block {
doc.add_u64(field, *val);
}
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -526,7 +608,7 @@ mod bench {
for val in block {
doc.add_u64(field, *val);
}
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), Some(&doc_id_mapping))

Some files were not shown because too many files have changed in this diff Show More