Compare commits

...

755 Commits

Author SHA1 Message Date
Pascal Seitz
6037cdfe7e remove dynamic dispatch in collect_segment 2023-03-02 20:16:45 +08:00
PSeitz
ca20bfa776 add date_histogram (#1900)
* add date_histogram

* add return result
2023-03-02 05:17:35 +01:00
PSeitz
faa706d804 add coerce option for text and numbers types (#1904)
* add coerce option for text and numbers types

allow to coerce the field type when indexing if the type does not match

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add tests,add COERCE flag, include bool in coercion

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-03-01 11:36:59 +01:00
PSeitz
850a0d7ae2 add agg benchmark for optional and multi value (#1916)
closes #1870
2023-03-01 17:01:52 +09:00
Paul Masurel
7fae4d98d7 Adapting for quickwit2 (#1912)
* Adapting tantivy to make it possible to be plugged to quickwit.

* Apply suggestions from code review

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

* Added unit test

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2023-03-01 16:27:46 +09:00
PSeitz
bc36458334 move buffer in front of dynamic dispatch (#1915)
dynamic dispatch seems to be really expensive, move the buffer in front of the dynamic dispatch, to reduce the number of calls into the dynamic dispatched collector.
2023-02-28 13:07:50 +08:00
trinity-1686a
8a71e00da3 allow limiting the number of matched term in range query (#1899) 2023-02-27 10:44:08 +01:00
PSeitz
e510f699c8 feat: add support for u64,i64,f64 fields in term aggregation (#1883)
* feat: add support for u64,i64,f64 fields in term aggregation

* hash enum values

* fix build

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-02-27 15:04:41 +08:00
Paul Masurel
d25fc155b2 Making some of the column/termdict operations async-friendly (#1902) 2023-02-27 15:34:47 +09:00
Paul Masurel
8ea97e7d6b Minor refactoring preparing for getting columnar integrated in quickwit. (#1911) 2023-02-27 14:23:30 +09:00
Paul Masurel
0a726a0897 Added Empty ColumnIndex (#1910) 2023-02-27 13:59:22 +09:00
Paul Masurel
66ff53b0f4 Various minor code cleanup (#1909) 2023-02-27 13:48:34 +09:00
Paul Masurel
d002698008 Re-export of query grammar. (#1908) 2023-02-27 12:26:34 +09:00
Paul Masurel
c838aa808b Removedc the extra nesting in unit test file (#1907) 2023-02-27 12:17:52 +09:00
Paul Masurel
06850719dc Renaming .values(DocId) to .values_for_doc(DocId) (#1906) 2023-02-27 12:15:13 +09:00
PSeitz
5f23bb7e65 switch to sparse collection for histogram (#1898)
* switch to sparse collection for histogram

Replaces histogram vec collection with a hashmap. This approach works much better for sparse data and enables use cases like drill downs (filter + small interval).
It is slower for dense cases (1.3x-2x slower). This can be alleviated with a specialized hashmap in the future.
closes #1704
closes #1370

* refactor, clippy

* fix bucket_pos overflow issue
2023-02-23 07:02:58 +01:00
trinity-1686a
533ad99cd5 add PhrasePrefixQuery (#1842)
* add PhrasePrefixQuery
2023-02-22 11:18:33 +01:00
PSeitz
c7278b3258 remove schema in aggs (#1888)
* switch to ColumnType, move tests

* remove Schema dependency in agg
2023-02-22 04:50:28 +01:00
Paul Masurel
6b403e3281 Re-export of columnar 2023-02-22 11:23:54 +09:00
Paul Masurel
789cc8703e Adding unit test testing docfreq after merge (#1895) 2023-02-22 11:05:34 +09:00
Paul Masurel
e5098d9fe8 Moving test around reenabling tests that were disabled. (#1894) 2023-02-22 10:31:52 +09:00
Paul Masurel
f537334e4f Adding a write schema to columnar's merge operations. (#1884)
* Adding a write schema to columnar's merge operations.

* Added unit test checking min/max when columns are empty.

* CR comment

* Rename to value_type_to_column_type
2023-02-21 18:25:16 +09:00
Paul Masurel
e2aa5af075 Clippy warnings fixes (#1885) 2023-02-20 19:04:13 +09:00
Paul Masurel
02bebf4ff5 Cargo fmt 2023-02-20 09:40:04 +09:00
Paul Masurel
0274c982d5 Refactoring. (#1881)
`ColumnValues` wrongly located in column_values/column.rs due to
historical reason moves to column_values/mod.rs

u128 stuff gets its own directory like u64 stuff.
2023-02-17 21:57:14 +09:00
PSeitz
74bf60b4f7 implement SegmentAggregationCollector on bucket aggs (#1878) 2023-02-17 12:53:29 +01:00
PSeitz
bf1449b22d update examples for literate docs (#1880) 2023-02-17 11:48:22 +01:00
PSeitz
111f25a8f7 clippy (#1879)
* fix clippy

* fix clippy

* fmt
2023-02-17 11:34:21 +01:00
PSeitz
019db10e8e refactor aggregations (#1875)
* add specialized version for full cardinality

Pre Columnar
test aggregation::tests::bench::bench_aggregation_average_u64                                                            ... bench:   6,681,850 ns/iter (+/- 1,217,385)
test aggregation::tests::bench::bench_aggregation_average_u64_and_f64                                                    ... bench:  10,576,327 ns/iter (+/- 494,380)

Current
test aggregation::tests::bench::bench_aggregation_average_u64                                                            ... bench:  11,562,084 ns/iter (+/- 3,678,682)
test aggregation::tests::bench::bench_aggregation_average_u64_and_f64                                                    ... bench:  18,925,790 ns/iter (+/- 17,616,771)

Post Change
test aggregation::tests::bench::bench_aggregation_average_u64                                                            ... bench:   9,123,811 ns/iter (+/- 399,720)
test aggregation::tests::bench::bench_aggregation_average_u64_and_f64                                                    ... bench:  13,111,825 ns/iter (+/- 273,547)

* refactor aggregation collection

* add buffering collector
2023-02-16 13:15:16 +01:00
Paul Masurel
7423f99719 Issue/columnar for json (#1876)
Adding support for JSON fast field.
2023-02-16 20:38:32 +09:00
Alex Cole
f2f38c43ce Make BM25 scoring more flexible (#1855)
* Introduce Bm25StatisticsProvider to inject statistics

* fix formatting I accidentally changed
2023-02-16 19:14:12 +09:00
PSeitz
71f43ace1d fix dynamic dispatch regression for range queries (#1871) 2023-02-14 16:56:40 +01:00
PSeitz
347614c841 test error for avg agg on ip field (#1873)
closes #1835
2023-02-14 23:22:56 +08:00
Paul Masurel
097fd6138d Fix clippy comments (#1872) 2023-02-14 23:12:45 +09:00
PSeitz
01e5a22759 switch to new ff api (#1868) 2023-02-14 15:57:32 +08:00
Antoine Gauthier
b60b7d2afe fix(CI) enable coverage on doctest (#1839)
* fix(CI) enable coverage on doctest
⚠️ Marked as [unstable](https://github.com/taiki-e/cargo-llvm-cov/issues/2)
refs #1761

* remove obsolete CI directory
2023-02-14 16:42:44 +09:00
Yukun Guo
dfe4e95fde Make index compatible with virtual drives on Windows (#1843)
* Make index compatible with virtual drives on Windows

* Get rid of normpath
2023-02-14 16:41:48 +09:00
Paul Masurel
60cc2644d6 Fixing test_fail_on_flush_segment_but_one_worker_remains (#1869)
The new fast field code, based on columnar, had a larger minimum memory
footprint, causing the first docuemnt to trigger a flush of the asegment
in this unit test.

This PR prevents the allocation of a large capacity for the different hashmap tables
using in the columnar writer.

Closes #1859
2023-02-14 16:09:42 +09:00
Paul Masurel
10bccac61b Bugfix in parse_into_milliseconds (#1867) 2023-02-14 15:06:40 +09:00
PSeitz
1cfb9ce59a improve range query performance (#1864)
fix RowId vs DocId naming
fixes #1863
2023-02-14 13:25:39 +09:00
trinity-1686a
539ff08a79 move DateTime to tantivy_common (#1861)
* move DateTime to tantivy_common

* resolve imports of columnar::DateTime as import of common::DateTime
2023-02-11 17:03:06 +01:00
PSeitz
dab93df94e fix benchmarks (#1862) 2023-02-11 15:44:47 +09:00
trinity-1686a
3120147a76 re-enable examples (#1860) 2023-02-10 14:51:37 +01:00
PSeitz
cbcafae04c fix: doc store for files larger 4GB (#1856)
Fixes an issue in the skip list deserialization, which deserialized the byte start offset incorrectly as u32.
`get_doc` will fail for any docs that live in a block with start offset larger than u32::MAX (~4GB).
Causes index corruption, if a segment with a doc store larger 4GB is merged.

tantivy version 0.19 is affected
2023-02-10 14:29:43 +01:00
PSeitz
36c6138e7f fix: auto downgrade index record option, instead of vint error (#1857)
Prev: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })', src/main.rs:46:14
Now: Automatic downgrade to next available level
2023-02-10 13:45:23 +01:00
PSeitz
7a9befd18d fix sort order test for term aggregation (#1858)
fix sort order test for term aggregation
fix invalid request test
2023-02-10 10:26:58 +01:00
Paul Masurel
62c811df2b Added a columnar cli 2023-02-09 19:02:16 +01:00
PSeitz
03345f0aa2 fmt code, update lz4_flex (#1838)
formatting on nightly changed
2023-02-10 01:42:32 +09:00
Paul Masurel
b7bfa20e38 Fixed test performance. 2023-02-09 17:39:55 +01:00
Paul Masurel
db8583db75 Fixing unit test 2023-02-09 16:53:05 +01:00
trinity-1686a
1390834ae8 make Term::as_slice public (#1846) 2023-02-09 15:37:07 +01:00
trinity-1686a
3ac973bea4 fix invalid endianness in documentation (#1845)
* fix doc about term endianness

* rustfmt
2023-02-09 15:36:38 +01:00
Paul Masurel
405e2cf4d9 Merge with main 2023-02-09 14:28:57 +01:00
Paul Masurel
b63c6c27bc adding change from main 2023-02-09 14:18:46 +01:00
Paul Masurel
bd5eea9852 Integrated columnar work. 2023-02-09 13:14:31 +01:00
PSeitz
0f20787917 fix doc store cache docs (#1821)
* fix doc store cache docs

addresses an issue reported in #1820

* rename doc_store_cache_size
2023-01-23 07:06:49 +01:00
Paul Masurel
2874554ee4 Removed the sorting logic that forced column type to be sorted like (#1816)
* Removed the sorting logic that forced column type to be sorted like
ColumnTypes.

* add comments

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2023-01-20 12:43:28 +01:00
PSeitz
cbc70a9eae Cargo.toml cleanup (#1817) 2023-01-20 12:30:35 +01:00
PSeitz
226d0f88bc add columnar to workspace (#1808) 2023-01-20 11:47:10 +01:00
Paul Masurel
9548570e88 Fixing broken test build 2023-01-20 18:18:32 +09:00
Paul Masurel
9a296b29b7 Renamed dense file to dense.rs 2023-01-20 17:22:25 +09:00
PSeitz
b31fd389d8 collect columns for merge (#1812)
* collect columns for merge

* return column_type from, fix visibility

* fix

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-01-20 07:58:29 +01:00
Paul Masurel
89cec79813 Make it possible to force a column type and intricate bugfix. (#1815) 2023-01-20 14:30:56 +09:00
PSeitz
d09d91a856 fix tests (#1813) 2023-01-19 23:41:21 +09:00
PSeitz
50d8a8bc32 Update README (#1804)
Some parts are outdated

For the debugging tutorial, debugging is really easy now with VSCode, and there are plenty of other sources for debugging rust
2023-01-19 18:09:45 +09:00
Paul Masurel
08919a2900 Improvement on the scalar / random bitpacker code. (#1781)
* Improvement on the scalar / random bitpacker code.

Added proptesting
Added simple benchmark
Added assert and comments on the very non trivial hidden contract
Remove the need for an extra padding.

The last point introduces a small performance regression (~10%).

* Fixing unit tests
2023-01-19 18:09:13 +09:00
Lonre Wang
8ba333f1b4 Typo fix (#1803)
* Update text_options.rs

* Update src/schema/text_options.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-01-19 17:56:05 +09:00
PSeitz
a2ca12995e update aggregation docs (#1807) 2023-01-19 09:52:47 +01:00
Paul Masurel
e3d504d833 Minor code cleanup (#1810) 2023-01-19 17:47:26 +09:00
Paul Masurel
5a42c5aae9 Add support for multivalues (#1809) 2023-01-19 16:55:01 +09:00
Paul Masurel
a86b104a40 Differentiating between str and bytes, + unit test 2023-01-19 14:38:12 +09:00
PSeitz
f9abd256b7 add ip addr to columnar (#1805) 2023-01-19 05:36:06 +01:00
Paul Masurel
9f42b6440a Completed unit test for dictionary encoded column 2023-01-19 12:15:27 +09:00
Paul Masurel
c723ed3f0b Columnar merge (#1806) 2023-01-19 11:52:27 +09:00
trinity-1686a
d72ea7d353 modify getters for sstable metadata (#1793)
* add way to get up to `limit` terms from sstable

* make some function of sstable load less data

* add some tests to sstable

* add tests on sstable dictionary

* fix some bugs with sstable
2023-01-18 14:42:55 +01:00
Paul Masurel
5180b612ef Removing the demuxer code (#1799) 2023-01-18 16:12:35 +09:00
PSeitz
f687b3a5aa start migrate Field to &str (#1772)
start migrate Field to &str in preparation of columnar
return Result for get_field
2023-01-18 16:12:07 +09:00
PSeitz
c4af63e588 add rename (#1797) 2023-01-18 13:28:37 +09:00
Adrien Guillo
4b343b3189 Merge pull request #1802 from quickwit-oss/guilload/clippy-fixes
Fix some Clippy warnings
2023-01-17 10:39:55 -05:00
Adrien Guillo
c51d9f9f83 Fix some Clippy warnings 2023-01-17 10:17:51 -05:00
Adrien Guillo
c9cb3d04bf Merge pull request #1788 from quickwit-oss/guilload/remove-std-dev-from-stats-agg
Remove standard deviation from stats aggregation
2023-01-16 23:16:36 -05:00
Adrien Guillo
0caaf13a90 Remove standard deviation from stats aggregation 2023-01-16 22:58:23 -05:00
Adrien Guillo
a59bd965cc Merge pull request #1794 from quickwit-oss/guilload/count-min-max-sum-aggs
Add count, min, max, and sum aggregations
2023-01-16 22:45:01 -05:00
Adrien Guillo
f2dad194ea Add count, min, max, and sum aggregations 2023-01-16 12:22:20 -05:00
Paul Masurel
25bad784ad Integrated fastfield codecs into columnar. (#1782)
Introduced asymetric OptionalCodec / SerializableOptionalCodec
Removed cardinality from the columnar sstable.
Added DynamicColumn
Reorganized all files
Change DenseCodec serialization logic.
Renamed methods to rank/select
Moved versioning footer to the columnar level
2023-01-16 17:24:49 +09:00
PSeitz
4bac945709 add ip field example (#1775) 2023-01-16 06:06:11 +01:00
trinity-1686a
16b704e190 make file_slice_for_range on sstable public (#1784) 2023-01-16 13:59:57 +09:00
PSeitz
6ca9a477f3 reuse stats for average (#1785)
* reuse stats for average

* fix count type
2023-01-13 23:32:27 +08:00
Shikhar Bhushan
2650111b76 EnableScoring::Disabled - optional Searcher (#1780) 2023-01-12 09:26:50 -05:00
PSeitz
1176555eff handle user input on get_docid_for_value_range (#1760)
* handle user input on get_docid_for_value_range

fixes #1757

* pass range as parameter
2023-01-12 14:20:16 +01:00
Adrien Guillo
f8d111a75e Merge pull request #1777 from quickwit-oss/guilload/ff-range-query-on-not-indexed-fields
Allow range queries via fast fields on non-indexed fields
2023-01-11 10:14:32 -05:00
Adrien Guillo
e17996f2fd Allow range queries via fast fields on non-indexed fields 2023-01-11 09:56:13 -05:00
Adrien Guillo
f3621c0487 Add license to tokenizer-api crate (#1778) 2023-01-11 05:26:41 +01:00
Adrien Guillo
14222a47a3 Fix typo (#1776) 2023-01-11 00:49:13 +09:00
Adam Reichold
8312c882a5 More cosmetic fixes for upcoming Clippy lints. (#1771) 2023-01-10 10:32:45 +01:00
Paul Masurel
7a8fce0ae7 Minor mini fixes 2023-01-10 14:15:30 +09:00
Michael Kleen
196e42f33e Add regex tokenizer (#1759)
This adds a regex tokenizer which tokenizes the text by using a
regex pattern to split.

Co-authored-by: Michael Kleen <mkleen@gmailw.com>
2023-01-10 13:38:37 +09:00
Adam Reichold
82a183bc2d Bump dependency on lru to from version 0.7.5 to version 0.9.0. (#1755) 2023-01-10 13:35:37 +09:00
dependabot[bot]
3090d49615 Update base64 requirement from 0.20.0 to 0.21.0 (#1769)
Updates the requirements on [base64](https://github.com/marshallpierce/rust-base64) to permit the latest version.
- [Release notes](https://github.com/marshallpierce/rust-base64/releases)
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.20.0...v0.21.0)

---
updated-dependencies:
- dependency-name: base64
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-10 13:35:05 +09:00
PSeitz
7c6cc818ae enable range query on fast field for u64 compatible types (#1762)
* enable range query on fast field for u64 compatible types

* rename, update benches
2023-01-10 04:08:26 +01:00
PSeitz
514d23a20c move tokenizer API to seperate crate (#1767)
closes #1766

Finding tantivy tokenizers is a frustrating experience currently, since
they need be updated for each tantivy version. That's unnecessary since
the API is rather stable anyway.
2023-01-09 06:37:38 +01:00
Paul Masurel
4f9efe654c Support for columnar (#1734)
* Added support for dynamic fast field.

See README for more information.

* Apply suggestions from code review

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2023-01-07 17:37:00 +09:00
Adam Reichold
1afa5bf3db Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances lazy. (#1756) 2023-01-06 12:44:49 +09:00
PSeitz
07a51eb7c8 refactor multivalue fastfield, refactor range query (#1749)
Introduce MakeZero trait, remove make_zero from FastValue
Merge two multivalue fastfield implementations into one
prepare range query on fastfield for different types
2023-01-05 12:09:50 +01:00
Adam Reichold
2080c370c2 Enable usage of FuzzyTermQuery for specific fields via QueryParser (#1750)
* Make nightly Clippy mostly happy.

* Document how to produce TermSetQuery queries using QueryParser.

* Enable construction of queries using FuzzyTermQuery via the QueryParser

* Use FxHashMap instead of HashMap in the QueryParser as these hash tables are not exposed to DoS attacks.

* Use a struct instead of a tuple to improve readability.
2023-01-04 18:11:27 +09:00
Daw-Chih Liou
b22f96624e doc: update comments in the faceted search example (#1737)
* doc: update comments in the faceted search example

* chore: format
2023-01-02 11:07:30 +01:00
pinkforest(she/her)
b78dc5e313 Bump prettytables (#1746) 2022-12-31 15:01:39 +01:00
Paul Masurel
3f915925af Fixing unit tests 2022-12-27 12:02:16 +09:00
Paul Masurel
9c5fef5af7 Fixing sstable proptest (#1743) 2022-12-26 16:29:33 +09:00
Paul Masurel
9948a84ebe Simplifies the count_ones definition. (#1742) 2022-12-26 16:08:01 +09:00
PSeitz
45156fd869 use group_by in translate_codec_idx_to_original_id (#1736) 2022-12-26 06:13:29 +01:00
Paul Masurel
bc959006fa Ooops. Removing ordered_floats. 2022-12-22 19:50:34 +09:00
Paul Masurel
7385a8f80c Supporting PartialCmp in VectorColumn. (#1735)
* Supporting PartialCmp in VectorColumn.
* Apply suggestions from code review

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2022-12-22 17:47:25 +09:00
Paul Masurel
13b89cba17 Adding inlines. 2022-12-22 14:29:41 +09:00
Hasnain Lakhani
f4804ce2f5 Adjust spelling of "returns" in docs for DisjunctionMaxQuery (#1733) 2022-12-22 14:04:07 +09:00
Paul Masurel
2a6d1eaf78 Added missing license. 2022-12-22 12:47:43 +09:00
Paul Masurel
540a9972bd Support for NotNaN in fast fields 2022-12-22 12:28:25 +09:00
Paul Masurel
bb48c3e488 Refactoring to prepare for the addition of dynamic fast field (#1730)
* Refactoring to prepare for the addition of dynamic fast field

- Exposing insert_key / insert_value
- Renamed SSTable::{Reader/Writer}-> SSTable::{ValueReader/ValueWriter}
- Added a generic Dictionary object in the sstable crate
- Removing the TermDictionary wrapper from tantivy, relying directly on
  an alias of the generic Dictionary object.
- dropped the use of byteorder in sstable.
- Stopped scanning / reading the entire dictionary when streaming a range.

* Added a benchmark for streaming sstable ranges.

* CR comments.

Rename deserialize_u64 -> deserialize_vint_u64

* Removed needless allocation, split serialize into serialize and clear.
2022-12-22 12:25:46 +09:00
Paul Masurel
3339a3ec05 Removed feature(quickwit) in tantivy-common. 2022-12-22 10:19:57 +09:00
Paul Masurel
f39165e1e7 Moving FileSlice to tantivy-common (#1729) 2022-12-21 16:35:11 +09:00
Paul Masurel
32cb1d22da Removed AsyncIoResult. (#1728) 2022-12-21 16:01:17 +09:00
Paul Masurel
4a6bf50e78 Clippy 2022-12-21 15:43:34 +09:00
PSeitz
2ac1cc2fc0 add sparse codec (#1723)
* add sparse codec

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add the -1 u16 fix for metadata num_vals

* add dense block encoding to sparse codec

* add comment, refactor u16 reading

Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-12-20 15:30:33 +01:00
PSeitz
f9171a3981 fix clippy (#1725)
* fix clippy

* fix clippy fastfield codecs

* fix clippy bitpacker

* fix clippy common

* fix clippy stacker

* fix clippy sstable

* fmt
2022-12-20 07:30:06 +01:00
PSeitz
a2cf6a79b4 Sparse dense index (#1716)
* add dense codec

* benchmark fix and important optimisation

* move code to DenseIndexBlock

improve benchmark

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* extend benchmarks

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-12-13 07:50:09 +01:00
Paul Masurel
f6e87a5319 Cargo fmt 2022-12-13 12:30:40 +09:00
Paul Masurel
f9971e15fe Fixing unit test with sstable test. 2022-12-13 12:22:44 +09:00
PSeitz
3cdc8e7472 pass index info to serialize (#1719) 2022-12-13 04:20:31 +01:00
dependabot[bot]
fbb0f8b55d Update base64 requirement from 0.13.0 to 0.20.0 (#1720)
Updates the requirements on [base64](https://github.com/marshallpierce/rust-base64) to permit the latest version.
- [Release notes](https://github.com/marshallpierce/rust-base64/releases)
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.13.0...v0.20.0)

---
updated-dependencies:
- dependency-name: base64
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-13 11:46:23 +09:00
Paul Masurel
136a8f4124 Isolating sstable and stacker in independant crates. (#1718)
Both crate will be used in the new (optional + dynamic) fastfield work.
2022-12-13 11:44:17 +09:00
PSeitz
5d4535de83 Changelog fix (#1717) 2022-12-12 14:28:42 +09:00
PSeitz
2c50b02eb3 Fix max bucket limit in histogram (#1703)
* Fix max bucket limit in histogram

The max bucket limit in histogram was broken, since some code introduced temporary filtering of buckets, which then resulted into an incorrect increment on the bucket count.
The provided solution covers more scenarios, but there are still some scenarios unhandled (See #1702).

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-12-12 04:40:15 +01:00
PSeitz
509adab79d Bump version (#1715)
* group workspace deps

* update cargo.toml

* revert tant version

* chore: Release
2022-12-12 04:39:43 +01:00
PSeitz
96c93a6ba3 Merge pull request #1700 from quickwit-oss/PSeitz-patch-1
Update CHANGELOG.md
2022-12-02 16:31:11 +01:00
boraarslan
495824361a Move split_full_path to Schema (#1692) 2022-11-29 20:56:13 +09:00
PSeitz
485a8f507e Update CHANGELOG.md 2022-11-28 15:41:31 +01:00
PSeitz
1119e59eae prepare fastfield format for null index (#1691)
* prepare fastfield format for null index
* add format version for fastfield
* Update fastfield_codecs/src/compact_space/mod.rs
* switch to variable size footer
* serialize delta of end
2022-11-28 17:15:24 +09:00
PSeitz
ee1f2c1f28 add aggregation support for date type (#1693)
* add aggregation support for date type
fixes #1332

* serialize key_as_string as rfc3339 in date histogram
* update docs
* enable date for range aggregation
2022-11-28 09:12:08 +09:00
PSeitz
600548fd26 Merge pull request #1694 from quickwit-oss/dependabot/cargo/zstd-0.12
Update zstd requirement from 0.11 to 0.12
2022-11-25 05:48:59 +01:00
PSeitz
9929c0c221 Merge pull request #1696 from quickwit-oss/dependabot/cargo/env_logger-0.10.0
Update env_logger requirement from 0.9.0 to 0.10.0
2022-11-25 03:28:10 +01:00
dependabot[bot]
f53e65648b Update env_logger requirement from 0.9.0 to 0.10.0
Updates the requirements on [env_logger](https://github.com/rust-cli/env_logger) to permit the latest version.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](https://github.com/rust-cli/env_logger/compare/v0.9.0...v0.10.0)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-11-24 20:07:52 +00:00
PSeitz
0281b22b77 update create_in_ram docs (#1695) 2022-11-24 17:30:09 +01:00
dependabot[bot]
a05c184830 Update zstd requirement from 0.11 to 0.12
Updates the requirements on [zstd](https://github.com/gyscos/zstd-rs) to permit the latest version.
- [Release notes](https://github.com/gyscos/zstd-rs/releases)
- [Commits](https://github.com/gyscos/zstd-rs/commits)

---
updated-dependencies:
- dependency-name: zstd
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-11-23 20:15:32 +00:00
Paul Masurel
0b40a7fe43 Added a expand_dots JsonObjectOptions. (#1687)
Related with quickwit#2345.
2022-11-21 23:03:00 +09:00
trinity-1686a
e758080465 add support for TermSetQuery in query parser (#1683) 2022-11-17 16:49:49 +01:00
Paul Masurel
2a39289a1b Handle escaped dot in json path in the QueryParser. (#1682) 2022-11-16 07:18:34 +09:00
Adam Reichold
ca6231170e Make the built-in stop word lists selectable via the Language enum already used by the Stemmer filter. (#1671) 2022-11-15 17:40:25 +09:00
PSeitz
eda6e5a10a Merge pull request #1681 from quickwit-oss/ip_range_query_multi
remove Column from MultiValuedU128FastFieldReader
2022-11-15 09:27:46 +08:00
Pascal Seitz
8641155cbb remove column from MultiValuedU128FastFieldReader 2022-11-14 18:49:15 +08:00
PSeitz
9a090ed994 Merge pull request #1659 from quickwit-oss/ip_range_query_multi
add support for ip range query on multivalue fastfields
2022-11-14 15:17:41 +08:00
Pascal Seitz
b7d0dd154a fmt 2022-11-14 14:49:15 +08:00
PSeitz
ce10fab20f Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-11-14 14:21:53 +08:00
Pascal Seitz
e034328a8b Improve position_to_docid, refactor, add tests 2022-11-14 14:21:53 +08:00
Pascal Seitz
f811d1616b add support for ip range query on multivalue fastfields 2022-11-14 14:21:52 +08:00
PSeitz
c665b16ff0 Merge pull request #1672 from quickwit-oss/allow_range_without_indexed
Allow range query on fastfield without INDEXED
2022-11-14 12:45:12 +08:00
PSeitz
3b5f810051 Merge pull request #1677 from quickwit-oss/switch_to_u32
switch total_num_val to u32
2022-11-14 12:01:40 +08:00
trinity-1686a
5765c261aa allow warming up of the full posting list (#1673)
* allow warming up of the full posting list

* cargo fmt
2022-11-14 10:27:56 +09:00
Pascal Seitz
fb9f03118d switch total_num_val to u32 2022-11-11 17:35:52 +08:00
PSeitz
55a9d808d4 Merge pull request #1674 from quickwit-oss/u128_codec_header
add header with codec type for u128
2022-11-11 13:47:51 +08:00
Pascal Seitz
32166682b3 add header deser test 2022-11-11 13:28:12 +08:00
Pascal Seitz
e6acf8f76d add header with codec type for u128 2022-11-11 11:52:17 +08:00
Pascal Seitz
9e8a0c2cca Allow range query on fastfield without INDEXED 2022-11-10 15:56:08 +08:00
Paul Masurel
3edf0a2724 Using the manual reload policy in IndexWriter. (#1667) 2022-11-09 11:20:41 +01:00
Paul Masurel
8ca12a5683 Added stop word filter to CHANGELOG.md 2022-11-09 17:00:45 +09:00
Adam Reichold
a4b759d2fe Include stop word lists from Lucene and the Snowball project (#1666) 2022-11-09 16:57:35 +09:00
PSeitz
3e9c806890 Merge pull request #1665 from quickwit-oss/fix_num_vals
fix num_vals on u128 value index after merge
2022-11-07 21:46:02 +08:00
Pascal Seitz
c69a873dd3 fix num_vals on value index after merge 2022-11-07 21:05:21 +08:00
PSeitz
666afcf641 Merge pull request #1663 from PSeitz/fix_clippy
fix clippy
2022-11-07 18:11:20 +08:00
Pascal Seitz
38ad46e580 fix clippy 2022-11-07 16:09:55 +08:00
PSeitz
e948889f4c Merge pull request #1662 from quickwit-oss/fix_num_vals
fix num_vals in multivalue index after merge
2022-11-07 15:57:32 +08:00
Pascal Seitz
6e636c9cea fix num_vals in multivalue index after merge 2022-11-07 15:00:52 +08:00
PSeitz
5a610efbc1 Merge pull request #1661 from quickwit-oss/upgrade_criterion
update criterion to 0.4
2022-11-04 14:45:34 +08:00
Pascal Seitz
500a0d5e48 update criterion to 0.4 2022-11-04 13:26:29 +08:00
PSeitz
509a265659 add docstore version (#1652)
* add docstore version

closes #1589

* assert for docstore version
2022-11-04 10:19:16 +09:00
PSeitz
5b2cea1b97 Merge pull request #1656 from quickwit-oss/multival_offset_index
move multivalue index to own file
2022-11-02 14:03:06 +08:00
PSeitz
a5a80ffaea Update fastfield_codecs/src/column.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-11-02 06:37:27 +01:00
PSeitz
0f98d91a39 Merge pull request #1646 from quickwit-oss/no_score_calls
No score calls if score is not requested
2022-11-01 20:09:32 +08:00
PSeitz
2af6b01c17 Update src/query/boolean_query/boolean_weight.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-11-01 16:13:00 +08:00
Adam Reichold
c32ab66bbd Small improvements to StopWorldFilter (#1657)
* Do not copy the whole set of stop words for each stream

* Make construction of StopWordFilter more flexible.
2022-11-01 16:47:34 +09:00
PSeitz
3f3a6f9990 Merge pull request #1653 from quickwit-oss/faster_hash
switch to fx hashmap
2022-11-01 14:53:18 +08:00
Pascal Seitz
83325d8f3f move multivalue index to own file
start_doc parameter in positions to docids
2022-11-01 10:36:13 +08:00
PSeitz
4e46f4f8c4 Merge pull request #1649 from adamreichold/split-compound-words
RFC: Add dictionary-based SplitCompoundWords token filter.
2022-10-27 17:12:48 +08:00
Pascal Seitz
43df356010 rename to docset 2022-10-27 16:53:38 +08:00
PSeitz
6647362464 Merge pull request #1648 from adamreichold/stemmer-todo-alloc
Avoid unconditional allocation in StemmerTokenStream.
2022-10-27 16:50:41 +08:00
Pascal Seitz
279b1b28d3 switch to fx hashmap 2022-10-27 16:19:59 +08:00
PSeitz
7a80851e36 Merge pull request #1645 from quickwit-oss/ip_field_range_query
add ip range query benchmark, add seek behaviour
2022-10-27 16:13:52 +08:00
Adam Reichold
cd952429d2 Add dictionary-based SplitCompoundWords token filter. 2022-10-27 08:30:33 +02:00
PSeitz
d777c964da Merge pull request #1650 from adamreichold/fnv-rustc-hash
Replace FNV by rustc-hash
2022-10-27 12:11:26 +08:00
Adam Reichold
bbb058d976 Replace FNV by rustc-hash
Both construction have similar goals but rustc-hash ist better suited for
contemporary CPU as it works one word at a time instead of byte per byte.
2022-10-27 00:35:09 +02:00
Adam Reichold
5f7d027a52 Avoid unconditional allocation in StemmerTokenStream.
This fixes the TODO in two ways: If the stemmer already yields an owned string,
it is used directly as the new text of the token. Otherwise, a temporary buffer
is used to copy the stemmed text (just as before) and then swapping it into the
token to reuse its existing buffer.
2022-10-26 18:11:15 +02:00
Pascal Seitz
dfab201191 for_each_docset to iterate without score 2022-10-26 17:25:05 +08:00
PSeitz
0c2bd36fe3 Panic on duplicate field names (#1647)
fixes #1601
2022-10-26 16:17:33 +09:00
Pascal Seitz
af839753e0 No score calls if score is not requested 2022-10-26 12:18:35 +08:00
Pascal Seitz
fec2b63571 improve bench by adding more blanks in compact space 2022-10-25 22:09:01 +08:00
Pascal Seitz
6213ea476a pass positions parameter 2022-10-25 17:44:51 +08:00
Pascal Seitz
5e159c26bf add ip range query benchmark, add seek behaviour 2022-10-25 15:57:19 +08:00
PSeitz
a5e59ab598 Merge pull request #1644 from quickwit-oss/get_val_u32
switch get_val() to u32
2022-10-24 19:30:03 +08:00
Pascal Seitz
e772d3170d switch get_val() to u32
Fixes #1638
2022-10-24 19:05:57 +08:00
PSeitz
8c2ba7bd55 Merge pull request #1637 from quickwit-oss/ip_field_range_query
add range query via ip fast field
2022-10-24 18:10:47 +08:00
Pascal Seitz
02328b0151 fix proptest 2022-10-24 17:46:06 +08:00
Pascal Seitz
7cc775256c add comments, rename 2022-10-24 17:08:37 +08:00
Pascal Seitz
07b40f8b8b add proptest 2022-10-24 16:52:55 +08:00
PSeitz
9b6b6be5b9 Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-24 16:00:38 +08:00
Pascal Seitz
6bb73a527f add range query via ip fast field 2022-10-24 16:00:38 +08:00
PSeitz
03885d0f3c Merge pull request #1643 from quickwit-oss/range_query_parser
allow more characters in range query
2022-10-24 15:09:47 +08:00
Pascal Seitz
f2e5135870 allow more characters in range query
closes #1642
2022-10-21 18:05:15 +08:00
Paul Masurel
c24157f28b Bumping version format. (#1640)
The docstore format has changed in a non-compatible manner.
2022-10-21 15:35:35 +09:00
PSeitz
873382cdcb Merge pull request #1639 from quickwit-oss/num_vals_u32
switch num_vals() to u32
2022-10-21 12:36:50 +08:00
Pascal Seitz
791350091c switch num_vals() to u32
fixes #1630
2022-10-20 19:44:28 +08:00
Paul Masurel
483b1d13d4 Added unit test for long tokens (#1635)
* Bugfix on long tokens and multivalue text fields.

Fixes a minor bug for the strong edge case
in which a tokenizer would emit tokens where
the last token does not cover the last position.

More importantly, this adds unit tests.

Closes #1634

* Update src/indexer/segment_writer.rs

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2022-10-20 15:05:37 +09:00
PSeitz
8de7fa9d95 Merge pull request #1631 from quickwit-oss/high_positions
add test for phrase search on multi text field
2022-10-20 10:26:00 +08:00
Paul Masurel
94313b62f8 Hotfix issue/1629 - position broken (#1633)
* Bugfix position broken.

For Field with several FieldValues, with a
value that contained no token at all, the token position
was reinitialized to 0.

As a result, PhraseQueries can show some false positives.
In addition, after the computation of the position delta, we can
underflow u32, and end up with gigantic delta.

We haven't been able to actually explain the bug in 1629, but it
is assumed that in some corner case these delta can cause a panic.

Closes #1629
2022-10-20 11:03:55 +09:00
Pascal Seitz
f2b2628feb add test for phrase search on multi text field 2022-10-19 16:29:56 +08:00
PSeitz
449f595832 Merge pull request #1628 from quickwit-oss/skip_index_deser
faster skipindex deserialization, larger blocksize on sort
2022-10-19 11:05:20 +08:00
PSeitz
c9235df059 Merge pull request #1627 from quickwit-oss/ip_field_range_query
add range query handling for ip via term dictionary
2022-10-19 10:53:00 +08:00
Pascal Seitz
a4485f7611 faster skipindex deserialization, larger blocksize on sort 2022-10-18 19:32:23 +08:00
Pascal Seitz
1082ff60f9 add range query handling for ip via term dictionary
since IPs are mapped monotonically we can use the term dictionary for range queries
2022-10-18 13:08:27 +08:00
PSeitz
491854155c Merge pull request #1625 from quickwit-oss/index_ip_field
index ip field
2022-10-18 11:18:17 +08:00
Christoph Herzog
96c3d54ac7 fix: Fix power of two computation on 32bit architectures (#1624)
The current `compute_previous_power_of_two()` implementation used for
TermHashmap takes and returns `usize` , but actually only works
correclty on 64 bit architectures (aka usize == u64)

On other architectures the leading_zeros computation is run on the wrong
type (must be u64), and leads to overflows.

Fixed simply computing the leading_zeros based on a u64 value.
2022-10-18 11:55:02 +09:00
Pascal Seitz
6800fdec9d add indexing for ip field
Closes #1595
2022-10-18 10:07:48 +08:00
PSeitz
c9cf9c952a Merge pull request #1614 from quickwit-oss/remove_superfluous_steps
refactor Term
2022-10-17 18:25:31 +08:00
Pascal Seitz
024e53a99c remove truncate 2022-10-17 12:14:35 +08:00
Pascal Seitz
8d75e451bd fix truncate, remove mutable access from term 2022-10-17 12:14:35 +08:00
Pascal Seitz
fcfd76ec55 refactor Term
fixes some issues with Term
Remove duplicate calls to truncate or resize
Replace Magic Number 5 with constant
Enforce minimum size of 5 for metadata
Fix broken truncate docs
use constructor instead new + set calls
normalize constructor stack
replace assert on internal behavior fixes #1585
2022-10-17 12:14:34 +08:00
PSeitz
6b7b1cc4fa Merge pull request #1623 from quickwit-oss/remove_unused_buffer
remove unused buffer
2022-10-14 20:36:00 +08:00
Pascal Seitz
129f7422f5 remove unused buffer 2022-10-14 20:01:10 +08:00
PSeitz
f39cce2c8b Merge pull request #1622 from quickwit-oss/term_aggregation
add term aggregation clarification
2022-10-14 18:09:18 +08:00
PSeitz
d2478fac8a Merge pull request #1621 from quickwit-oss/changelog
update CHANGELOG
2022-10-14 18:08:57 +08:00
Pascal Seitz
952b048341 add term aggregation clarification 2022-10-14 16:12:19 +08:00
PSeitz
80f9596ec8 Merge pull request #1611 from quickwit-oss/remove_token_stream_alloc
remove tokenstream vec alloc
2022-10-14 15:12:30 +08:00
Pascal Seitz
84f9e77e1d update CHANGELOG 2022-10-14 15:10:33 +08:00
PSeitz
a602c248fb Merge pull request #1590 from waywardmonkeys/fix-doc-warnings-quickwit
Fix missing doc warnings when enabling feature "quickwit".
2022-10-14 14:09:25 +08:00
PSeitz
4b9d1fe828 Merge pull request #1620 from quickwit-oss/fix_fieldnorms_indexing
Fix missing fieldnorm indexing
2022-10-14 13:41:38 +08:00
Pascal Seitz
63bc390b02 Fix missing fieldnorm indexing
Fixes broken search (no results) with BM25 for u64, i64, f64, bool, bytes and date after deletion and merge.
There were no fieldnorms recorded for those field. After merge InvertedIndexReader::total_num_tokens returns 0 (Sum over the fieldnorms is 0). BM25 does not work when total_num_tokens is 0.
Fixes #1617
2022-10-14 12:44:40 +08:00
Paul Masurel
07393c2fa0 Attempt to fix race condition in test. (#1619)
Close #1550
2022-10-14 10:56:37 +09:00
PSeitz
77a415cbe4 rename NothingRecorder to DocIdRecorder (#1615) 2022-10-13 15:43:40 +09:00
PSeitz
4b4c231bba Merge pull request #1612 from quickwit-oss/no_panic_please
return Error instead panic in fastfields
2022-10-11 18:33:00 +08:00
PSeitz
11d3409286 add missing docs for fastfield_codecs crate (#1613)
closes #1603
2022-10-11 18:54:24 +09:00
Pascal Seitz
9cb8cfbea8 return Error instead panic in fastfields
fixes #1572
2022-10-11 14:15:22 +08:00
PSeitz
8b69aab0fc avoid prepare_doc allocation (#1610)
avoid prepare_doc allocation, ~10% more thoughput best case
2022-10-11 14:15:55 +09:00
PSeitz
3650d1f36a Merge pull request #1553 from quickwit-oss/ip_field
ip field
2022-10-11 13:09:47 +08:00
Pascal Seitz
2efebdb1bb remove tokenstream vec alloc 2022-10-11 10:30:56 +08:00
François Massot
e443ca63aa Merge pull request #1608 from quickwit-oss/nigel/serialise-bytes-as-b64-#2042
Serialise bytes as base64 strings instead of arrays.
2022-10-10 11:51:23 +02:00
Pascal Seitz
5c9cbee29d handle IpV4 serialization case 2022-10-07 19:52:00 +08:00
Pascal Seitz
b2ca83a93c switch to ipv6, add monotonic_mapping tests 2022-10-07 18:47:55 +08:00
Nigel Andrews
3b189080d4 Use raw string literals in tests 2022-10-07 12:28:25 +02:00
Nigel Andrews
00a6586efe Replaced String::serialize for serializer.serialize_str 2022-10-07 11:55:05 +02:00
Pascal Seitz
b9b913510e fmt 2022-10-07 16:56:19 +08:00
PSeitz
534b1d33c3 use ipv6
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-07 16:56:00 +08:00
PSeitz
f465173872 Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-07 16:55:53 +08:00
Pascal Seitz
96315df20d use idx part only for positions_to_docid 2022-10-07 16:54:04 +08:00
Pascal Seitz
9a1609d364 add test 2022-10-07 16:25:01 +08:00
Pascal Seitz
39f4e58450 improve comment 2022-10-07 16:25:01 +08:00
Pascal Seitz
a8a36b62cd enable test 2022-10-07 16:25:01 +08:00
Pascal Seitz
226a49338f add StrictlyMonotonicFn 2022-10-07 16:25:01 +08:00
Pascal Seitz
2864bf7123 use serializer for u128 2022-10-07 16:25:01 +08:00
Pascal Seitz
5171ff611b serialize ip as u128, add test for positions_to_docid 2022-10-07 16:25:01 +08:00
Pascal Seitz
e50e74acf8 remove u128 type 2022-10-07 16:25:01 +08:00
Pascal Seitz
0b86658389 rename ip addr, use buffer 2022-10-07 16:25:01 +08:00
Pascal Seitz
5d6602a8d9 mark null handling TODO 2022-10-07 16:25:01 +08:00
Pascal Seitz
4d29ff4d01 finalize ip addr rename 2022-10-07 16:25:01 +08:00
Pascal Seitz
cdc8e3a8be group montonic mapping and inverse
fix mapping inverse
remove ip indexing
add get_between_vals test
2022-10-07 16:25:01 +08:00
Pascal Seitz
67f453b534 rename to iter_gen 2022-10-07 16:25:01 +08:00
Pascal Seitz
787a37bacf expect instead of unwrap 2022-10-07 16:25:01 +08:00
Pascal Seitz
f5039f1846 remove roaring 2022-10-07 16:25:01 +08:00
Pascal Seitz
eeb1f19093 rename to iter_gen 2022-10-07 16:25:01 +08:00
Pascal Seitz
087beaf328 remove null handling 2022-10-07 16:25:01 +08:00
Pascal Seitz
309449dba3 rename to IpAddr 2022-10-07 16:25:01 +08:00
Pascal Seitz
5a76e6c5d3 fix get_between_vals forwarding
fix get_between_vals forwarding in monotonicmapping column by adding an additional conversion function Output->Input
2022-10-07 16:25:01 +08:00
Pascal Seitz
c8713a01ed use iter api 2022-10-07 16:25:01 +08:00
Pascal Seitz
6113e0408c remove comment 2022-10-07 16:25:01 +08:00
Pascal Seitz
400a20b7af add ip field
add u128 multivalue reader and writer
add ip to schema
add ip writers, handle merge
2022-10-07 16:25:01 +08:00
PSeitz
5f565e77de Merge pull request #1604 from quickwit-oss/replace_cbor
replace cbor with cborium
2022-10-07 14:42:55 +08:00
Pascal Seitz
516e60900d remove unwrap 2022-10-07 14:22:37 +08:00
Pascal Seitz
36e1c79f37 replace cbor with cborium
closes #1526
2022-10-07 13:23:39 +08:00
Bruce Mitchener
c2f1c250f9 doc: Remove reference to Searcher pool. (#1598)
The pool of searchers was removed in 23fe73a6 as part of #1411.
2022-10-06 00:04:11 +09:00
Bruce Mitchener
c694bc039a Fix missing doc warnings when enabling feature "quickwit". 2022-10-05 20:17:10 +07:00
PSeitz
2063f1717f Merge pull request #1591 from quickwit-oss/ff_refact
disable linear codec for multivalue values
2022-10-05 19:39:36 +08:00
Pascal Seitz
d742275048 renames 2022-10-05 19:16:49 +08:00
PSeitz
b9f06bc287 Update src/fastfield/multivalued/mod.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-05 19:09:19 +08:00
Pascal Seitz
8b42c4c126 disable linear codec for multivalue value index
don't materialize index column on merge
use simpler chain() variant
2022-10-05 19:09:17 +08:00
PSeitz
7905965800 Merge pull request #1594 from quickwit-oss/flat_map_with_buffer
Removing alloc on all .next() in MultiValueColumn
2022-10-05 18:34:15 +08:00
Pascal Seitz
f60a551890 add flat_map_with_buffer to Iterator trait 2022-10-05 17:44:26 +08:00
Paul Masurel
7baa6e3ec5 Removing alloc on all .next() in MultiValueColumn 2022-10-05 17:12:06 +09:00
PSeitz
2100ec5d26 Merge pull request #1593 from waywardmonkeys/doc-improvements
Documentation improvements.
2022-10-05 15:50:08 +08:00
Bruce Mitchener
b3bf9a5716 Documentation improvements. 2022-10-05 14:18:10 +07:00
Paul Masurel
0dc8c458e0 Flaky unit test. (#1592) 2022-10-05 16:15:48 +09:00
Nigel Andrews
e5043d78d2 added a couple of tests + make fmt 2022-10-04 12:52:44 +02:00
Nigel Andrews
6d0bb82bd2 Fix issue 1576: serialize bytes as base64 strings 2022-10-04 12:18:13 +02:00
trinity-1686a
5945dbf0bd change format for store to make it faster with small documents (#1569)
* use new format for docstore blocks

* move index to end of block

it makes writing the block faster due to one less memcopy
2022-10-04 09:58:55 +02:00
PSeitz
4cf911d56a Merge pull request #1587 from quickwit-oss/no_get_val_in_serialize
remove get_val in serialization
2022-10-04 12:56:48 +08:00
Pascal Seitz
0f5cff762f move enumerate and remove computation 2022-10-04 12:30:19 +08:00
Pascal Seitz
6d9a123cf2 remove get_val in serialization
remove get_val in serialization and mark as unimplemented!()
replace get_val with iter in linear codec
remove MultivalueStartIndexRandomSeeker
replace MultivalueStartIndexIter with closure
Sample 100 values in linear codec
2022-10-04 12:01:25 +08:00
PSeitz
0f4a47816a Merge pull request #1582 from quickwit-oss/faster_sorted_field_values
use groupby instead of vec allocation
2022-10-04 09:36:24 +08:00
Pascal Seitz
b062ab2196 use groupby instead of vec allocation 2022-10-04 09:26:26 +08:00
Bruce Mitchener
a9d2f3db23 Tantivy requires Rust 1.62 or later. (#1583)
Tantivy needs the `total_cmp` feature to compile, which was stabilized
in Rust 1.62.
2022-10-03 18:31:07 +09:00
Bruce Mitchener
44e03791f9 Fix warnings when doc'ing private items. (#1579)
This also fixes a couple of typos, but plenty remain!
2022-10-03 14:24:00 +09:00
Bruce Mitchener
2d23763e9f Use u64::from boolean more. (#1580)
This case is inverted from the previous cases fixed.

This is from nightly clippy.
2022-10-03 14:17:50 +09:00
Bruce Mitchener
a24ae8d924 clippy: Fix needless-borrow warnings. (#1581)
These show on nightly clippy.
2022-10-03 14:15:09 +09:00
PSeitz
927dff5262 Merge pull request #1578 from quickwit-oss/dead_code
remove dead indexing code
2022-10-03 11:25:10 +08:00
Pascal Seitz
a695edcc95 remove dead indexing code 2022-10-03 09:44:02 +08:00
Paul Masurel
b4b4f3fa73 Removing default features for zstd (#1574) 2022-09-30 13:02:46 +09:00
PSeitz
b50e4b7c20 Merge pull request #1566 from quickwit-oss/fix_docstore_sorting
fix docstore settings for temp docstore
2022-09-30 10:10:36 +08:00
PSeitz
f8686ab1ec improve comments
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-09-30 10:06:34 +08:00
PSeitz
2fe42719d8 Merge pull request #1570 from quickwit-oss/no_sort_on_multi
validate index settings on create
2022-09-30 09:17:03 +08:00
PSeitz
fadd784a25 log improvements (#1564) 2022-09-30 09:39:26 +09:00
Pascal Seitz
0e94213af0 validate index settings on create 2022-09-29 18:58:09 +08:00
PSeitz
0da2a2e70d Merge pull request #1567 from quickwit-oss/dependabot/cargo/tantivy-fst-0.4.0
Update tantivy-fst requirement from 0.3.0 to 0.4.0
2022-09-29 10:00:16 +08:00
dependabot[bot]
0bcdf3cbbf Update tantivy-fst requirement from 0.3.0 to 0.4.0
Updates the requirements on [tantivy-fst](https://github.com/tantivy-search/fst) to permit the latest version.
- [Release notes](https://github.com/tantivy-search/fst/releases)
- [Commits](https://github.com/tantivy-search/fst/commits)

---
updated-dependencies:
- dependency-name: tantivy-fst
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-09-28 20:50:43 +00:00
Pascal Seitz
8f647b817f fix docstore settings for temp docstore
fixes #1565
2022-09-28 17:53:59 +08:00
trinity-1686a
a86b0df6f4 Add query matching terms in a set (#1539) 2022-09-28 09:43:18 +02:00
Bruce Mitchener
f842da758c Move ArcBytes,WeakArcBytes to mmap_directory. (#1555)
When building without default features (so without mmap, etc),
there are some warnings about unused things. This fixes the
ones related to `ArcBytes` and `WeakArcBytes`, which are only
used with the `mmap_directory` code.
2022-09-27 09:57:28 +09:00
Bruce Mitchener
97ccd6d712 Avoid slicing a string in DocParsingError. (#1559)
Fixes #1339.
2022-09-26 20:27:15 +09:00
Bruce Mitchener
cb252a42af docs: "associated to" -> "associated with" (#1557)
This reads better this way.
2022-09-26 20:23:37 +09:00
Bruce Mitchener
d9609dd6b6 POLLING_INTERVAL needn't be pub. (#1556)
This is only used within the file watcher and is const, so it
can't be configured.
2022-09-26 20:22:55 +09:00
Bruce Mitchener
f03667d967 Remove references to /cpp directory. (#1560)
This was removed in 2018, so these should be fine to remove now.
2022-09-26 20:22:28 +09:00
PSeitz
10f10a322f Merge pull request #1554 from quickwit-oss/prepare_ip_field
prepare for ip field
2022-09-26 16:34:24 +08:00
Pascal Seitz
f757471077 prepare for ip field 2022-09-26 16:27:35 +08:00
PSeitz
21e0adefda use binary search instead of linear for get_val in merge (#1548)
* use binary search instead of linear for get_val in merge

* use partition_point
2022-09-26 09:42:33 +09:00
Bruce Mitchener
ea8e6d7b1d Tidy up clippy config. (#1547)
* Checking cfg_attr is no longer necessary.
* Don't need multiple `clippy::` prefixes on a name.
2022-09-26 09:37:55 +09:00
PSeitz
dac7da780e Merge pull request #1545 from waywardmonkeys/remove-some-refs
clippy: Remove borrows that the compiler will do.
2022-09-23 15:33:23 +08:00
PSeitz
20c87903b2 fix multivalue ff index creation regression (#1543)
fixes multivalue ff regression by avoiding using `get_val`. Line::train calls repeatedly get_val, but get_val implementation on Column for multivalues is very slow. The fix is to use the iterator instead. Longterm fix should be to remove get_val access in serialization.

Old Code

test fastfield::bench::bench_multi_value_ff_merge_few_segments                                                           ... bench:  46,103,960 ns/iter (+/- 2,066,083)
test fastfield::bench::bench_multi_value_ff_merge_many_segments                                                          ... bench:  83,073,036 ns/iter (+/- 4,373,615)
est fastfield::bench::bench_multi_value_ff_merge_many_segments_log_merge                                                ... bench:  64,178,576 ns/iter (+/- 1,466,700)

Current

running 3 tests
test fastfield::multivalued::bench::bench_multi_value_ff_merge_few_segments                                              ... bench:  57,379,523 ns/iter (+/- 3,220,787)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments                                             ... bench:  90,831,688 ns/iter (+/- 1,445,486)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments_log_merge                                   ... bench: 158,313,264 ns/iter (+/- 28,823,250)

With Fix

running 3 tests
test fastfield::multivalued::bench::bench_multi_value_ff_merge_few_segments                                              ... bench:  57,635,671 ns/iter (+/- 2,707,361)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments                                             ... bench:  91,468,712 ns/iter (+/- 11,393,581)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments_log_merge                                   ... bench:  73,909,138 ns/iter (+/- 15,846,097)
2022-09-23 15:36:29 +09:00
PSeitz
f9c3947803 Merge pull request #1546 from waywardmonkeys/use-ux-from-bool
Use u8::from(bool), u64::from(bool).
2022-09-23 09:06:24 +08:00
Bruce Mitchener
e9a384bb15 Use u8::from(bool), u64::from(bool). 2022-09-22 22:44:53 +07:00
Bruce Mitchener
d231671fe2 clippy: Remove borrows that the compiler will do.
This started showing up with clippy in rust 1.64.
2022-09-22 22:38:23 +07:00
trinity-1686a
fa3d786a2f Add support for deleting all documents matching query (#1535)
* add support for deleting all documents matching query

#1494
2022-09-22 21:26:09 +09:00
Paul Masurel
75aafeeb9b Added a function to deep clone RamDirectory. (#1544) 2022-09-22 12:04:02 +02:00
PSeitz
6f066c7f65 Merge pull request #1541 from quickwit-oss/add_bench
add benchmarks for multivalued fastfield merge
2022-09-22 15:28:00 +08:00
Pascal Seitz
22e56aaee3 add benchmarks for multivalued fastfield merge 2022-09-22 11:25:41 +08:00
Paul Masurel
d641979127 Minor refactor of fast fields (#1538) 2022-09-21 12:55:03 +09:00
Paul Masurel
1998111521 Minor refactoring fast fields (#1537) 2022-09-21 12:46:11 +09:00
PSeitz
acb2e2e282 Merge pull request #1532 from quickwit-oss/refactor_ff
remove fast_field_cardinality from FastValue
2022-09-21 04:00:35 +02:00
Pascal Seitz
1ff5da5eb4 remove fast_field_cardinality from FastValue
unused and at the wrong placed
2022-09-21 09:38:46 +08:00
Bruce Mitchener
c3b25710ad doc: Improve directory::Lock docs. (#1534)
Update the docs to reflect the lack of LockParams, correct an error,
and improve cross-linking.
2022-09-20 18:03:35 +09:00
PSeitz
8492010d43 Merge pull request #1531 from waywardmonkeys/improve-docs-more
Improvements to doc linking, grammar, etc.
2022-09-20 15:37:07 +08:00
Bruce Mitchener
cf02e32578 Improvements to doc linking, grammar, etc. 2022-09-19 18:10:22 +07:00
PSeitz
8cca1014c9 Merge pull request #1527 from waywardmonkeys/remove-stream_field-reference
docs: Remove mentions of stream_field method.
2022-09-19 17:16:46 +08:00
PSeitz
938f884e32 Merge pull request #1525 from waywardmonkeys/fix-etsy-logo-alt-text-readme
README: Fix Etsy logo and alt text.
2022-09-19 16:55:08 +08:00
PSeitz
ed68afb698 Merge pull request #1528 from quickwit-oss/ff_refact
fix benches
2022-09-19 11:37:08 +08:00
PSeitz
8a7962dc22 Merge pull request #1524 from waywardmonkeys/improve-docs-1
Documentation improvements.
2022-09-19 11:15:42 +08:00
Pascal Seitz
a06039dea8 fix benches
move some benches to lib.rs to test unexported items
2022-09-19 11:07:20 +08:00
Bruce Mitchener
68b6254b09 docs: Remove mentions of stream_field method.
This method doesn't exist, so no need to mention it.
2022-09-18 23:13:41 +07:00
Bruce Mitchener
6a88ac3fe3 Documentation improvements.
Fix some linking, some grammar, some typos, etc.
2022-09-18 18:05:37 +07:00
Bruce Mitchener
191b934650 README: Fix Etsy logo and alt text. 2022-09-18 15:02:35 +07:00
PSeitz
1a2ba7025a Merge pull request #1513 from quickwit-oss/ip_codec
add ip codec
2022-09-16 18:53:08 +08:00
Pascal Seitz
02599ebeb7 remove ip_to_u128 2022-09-16 18:16:16 +08:00
Pascal Seitz
a16b466460 merge ColumnExt with Column trait 2022-09-16 18:15:18 +08:00
Pascal Seitz
b8d8fdeb6e move benches, improve bench data 2022-09-16 16:42:23 +08:00
Pascal Seitz
12856d80fa change bench, update numbers 2022-09-16 16:41:01 +08:00
Pascal Seitz
e75472ec9a add serialize_u128, open_u128, refactor 2022-09-16 16:40:59 +08:00
Pascal Seitz
e2e6c94ba8 remove ColumnV2 2022-09-16 16:40:06 +08:00
Pascal Seitz
9f610b25af fix benches, add benches 2022-09-16 16:38:48 +08:00
Pascal Seitz
237b64025e take ColumnV2 as parameter
improve algorithm
stricter assertions
improve names
2022-09-16 16:38:48 +08:00
Pascal Seitz
592caeefa0 renames 2022-09-16 16:38:48 +08:00
Pascal Seitz
570009b5b1 move to mod.rs 2022-09-16 16:38:48 +08:00
Pascal Seitz
61b5110db7 use 0 as null in compact space 2022-09-16 16:38:48 +08:00
PSeitz
58af1235e4 Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-09-16 16:38:48 +08:00
Pascal Seitz
d3e7c41a1f refactor to range_mapping 2022-09-16 16:38:48 +08:00
Pascal Seitz
11275854ca unroll get range iteration 2022-09-16 16:38:48 +08:00
Pascal Seitz
3ca48cd826 fix test 2022-09-16 16:38:48 +08:00
Pascal Seitz
47dc511733 add inline 2022-09-16 16:38:48 +08:00
Pascal Seitz
cae6b28a8f remove num_vals param 2022-09-16 16:38:48 +08:00
Pascal Seitz
9aa9efe2a4 fix bench 2022-09-16 16:38:48 +08:00
Pascal Seitz
57570b38a2 use vint, forward errors, removed unused var 2022-09-16 16:38:48 +08:00
Pascal Seitz
584394db1e fix Cargo.toml 2022-09-16 16:38:48 +08:00
Pascal Seitz
3aeb026970 fix blank_size, add comments 2022-09-16 16:38:48 +08:00
Pascal Seitz
df32ee2df2 refactor, use BTreeSet for sorted deduped values 2022-09-16 16:38:48 +08:00
Pascal Seitz
762e662bfd extend proptest for get_range 2022-09-16 16:38:48 +08:00
Pascal Seitz
63b2420058 fix get_range
change blank handling
optimize blank collection
fix off by one errors
extend tests
fix get_range
dedupe values to save space
add bench
2022-09-16 16:38:47 +08:00
Pascal Seitz
ced21b8791 move tests 2022-09-16 16:38:02 +08:00
Pascal Seitz
bc85947105 add ip codec 2022-09-16 16:38:01 +08:00
Paul Masurel
64f08a1a5c Hiding useless symbols and removing code. (#1522) 2022-09-16 14:42:27 +09:00
Paul Masurel
e029fdfca7 Perf fix on the MonotonicMapping column (#1519)
The Monotonic mapping was using the default implementation
for `get_range` and `.iter`.

As a result, some of the column used in merge (e.g. multivalued
fast fields) were exhibiting a very strong performance regression.
2022-09-15 14:20:43 +09:00
Paul Masurel
817225edfb Allow for a same-thread doc compressor. (#1510)
In addition, it isolates the doc compressor logic,
better reports io::Result.

In the case of the same-thread doc compressor,
the blocks are also not copied.
2022-09-13 15:32:48 +09:00
Shikhar Bhushan
1eab12396d Make Column: Send + Sync (#1518) 2022-09-13 13:31:28 +09:00
dependabot[bot]
8006f63426 Update criterion requirement from 0.3.5 to 0.4.0 (#1517)
Updates the requirements on [criterion](https://github.com/bheisler/criterion.rs) to permit the latest version.
- [Release notes](https://github.com/bheisler/criterion.rs/releases)
- [Changelog](https://github.com/bheisler/criterion.rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/bheisler/criterion.rs/compare/0.3.5...0.4.0)

---
updated-dependencies:
- dependency-name: criterion
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-13 10:02:12 +09:00
Adam Reichold
0a907d0319 Move pretty_assertions from dependencies to dev-dependencies to reduce dependency closure of downstream projects. (#1515) 2022-09-10 18:01:26 +09:00
PSeitz
45924711fd improve docs (#1514)
fix link alias after https://github.com/rust-lang/rustfmt/pull/5262 has been merged and released.
fix dead links
2022-09-08 22:33:59 +09:00
PSeitz
14cb817a52 Merge pull request #1509 from quickwit-oss/refact-fast-field
refactor, fix api
2022-09-07 22:04:32 -07:00
Pascal Seitz
edd9155b88 return Write, add documentation 2022-09-08 12:41:55 +08:00
PSeitz
9497794d40 fix positions docs (#1511) 2022-09-08 10:24:00 +09:00
Pascal Seitz
29d56111de refactor, fix api
refactor
fix clippy
fix docs
remove unused code
fix bytesfield index api flaw
2022-09-07 18:43:04 +08:00
Paul Masurel
4d634d61ff Expose memory usage in SingleSegmentIndexWriter (#1508) 2022-09-07 18:33:52 +09:00
PSeitz
1f3d8ca7e2 Merge pull request #1507 from quickwit-oss/improve_test
add check to proptest
2022-09-07 02:30:29 -07:00
PSeitz
54696da771 Merge pull request #1505 from quickwit-oss/refact-fast-field
Refact fast field
2022-09-07 02:07:42 -07:00
Pascal Seitz
21c2205de9 add check to proptest 2022-09-07 16:58:07 +08:00
PSeitz
9436049d85 Merge pull request #1506 from quickwit-oss/multifastfieldbench
add benchmark for multivalue fast field
2022-09-07 01:36:16 -07:00
Pascal Seitz
21c9a26182 add ff creation benchmark 2022-09-07 15:43:50 +08:00
Pascal Seitz
56c68f5869 add ff creation benchmark 2022-09-07 14:03:24 +08:00
Pascal Seitz
f5e66042d8 no alloc in loop 2022-09-07 12:42:16 +08:00
Pascal Seitz
bf3327acd3 add benchmark for multivalue fast field 2022-09-06 16:55:30 +08:00
PSeitz
2a6479b66d Merge pull request #1427 from quickwit-oss/empty_segments_crash
handle empty segments for merge
2022-09-05 22:59:06 -07:00
Pascal Seitz
9c2ef81198 fix clippy 2022-09-06 13:34:36 +08:00
Paul Masurel
c5d30a54bc CR 2022-09-06 00:16:41 +09:00
Paul Masurel
c632fc014e Refactoring fast fields codecs.
This removes the GCD part as a codec, and
makes it so that fastfield codecs all share
the same normalization part (shift + gcd).
2022-09-05 23:07:12 +09:00
Pascal Seitz
085e63ae43 return new segment meta 2022-09-05 15:19:01 +08:00
Pascal Seitz
f6f23ba684 optionally create segment on merge
create a new segment only if it contains data

fixes #1189
2022-09-05 15:07:03 +08:00
Paul Masurel
ea72cf34d6 Int based linear interpol (#1482)
* Rename BlockwiseLinear to BlockwiseLinearLegacy

Reimplements the blockwise multilinear codec using integer arithmetics.
Added comments

* add estimate for blockwise

* Added one unit test

* use int based for linear interpol

* fix merge conflicts

* reuse code

* cargo fmt

* fix clippy

* fix test

* fix off by one

fix off by one to accurately interpolate autoincrement fields

* extend test, fix estimate

* remove legacy codec

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2022-09-05 15:53:00 +09:00
PSeitz
00657d9e99 Merge pull request #1504 from quickwit-oss/move-to-fastfield-codec
Move to fastfield codec
2022-09-03 05:18:35 -07:00
Paul Masurel
26876d41d7 Moving the serialization logic to the fastfield_codecs crate. 2022-09-03 00:29:52 +09:00
Paul Masurel
8e775b6c3d Refactoring dyn Column (#1502) 2022-09-02 17:26:30 +09:00
Maxim Kraynyuchenko
e1f9af4384 Added Etsy logo to readme (#1503) 2022-09-02 15:27:59 +09:00
Paul Masurel
4e350c5f1b Clippy 2022-09-02 13:05:00 +09:00
Paul Masurel
84e0c75598 Bench fixing 2022-09-02 11:15:44 +09:00
Paul Masurel
08c4412d73 Adding dragon API to build index without any thread. (#1496)
Closes #1487
2022-09-01 10:32:36 +09:00
Shikhar Bhushan
70e58adff9 OwnedBytes doc clarification (#1498)
It only exposes it with the same lifetime as `&self`, which is what keeps things safe
2022-09-01 10:32:17 +09:00
PSeitz
0d1cd119e9 Merge pull request #1497 from quickwit-oss/improve_proptest
custom num strategy, faster test
2022-08-31 06:25:25 -07:00
Pascal Seitz
d3dd620048 fix clippy 2022-08-31 13:13:56 +02:00
Pascal Seitz
e89c220b56 custom num strategy, faster test
closes #1486
faster test with rand values
2022-08-31 12:08:44 +02:00
Paul Masurel
a451f6d60d Minor refactoring. (#1495) 2022-08-31 12:00:58 +09:00
PSeitz
f740ddeee3 Merge pull request #1493 from quickwit-oss/remove_vec_impl
remove Column impl on Vec
2022-08-29 07:54:33 -07:00
Pascal Seitz
7a26cc9022 add VecColumn 2022-08-29 15:49:43 +02:00
Pascal Seitz
54972caa7c remove Column impl on Vec
remove Column impl on Vec to avoid function shadowing
2022-08-29 11:57:41 +02:00
PSeitz
5d436759b0 Merge pull request #1480 from quickwit-oss/overflow_issue
fix overflow issue in interpolation
2022-08-28 16:44:00 -07:00
PSeitz
6f563b1606 Merge pull request #1491 from quickwit-oss/col-trait-refact
Introducing a column trait
2022-08-28 10:05:25 -07:00
Pascal Seitz
095fb68fda fix doc test 2022-08-28 18:30:39 +02:00
Pascal Seitz
6316eaefc6 fix benches 2022-08-28 14:38:30 +02:00
Paul Masurel
5331be800b Introducing a column trait 2022-08-28 14:14:27 +02:00
Paul Masurel
c73b425bc1 Fixing unit tests 2022-08-27 23:20:57 +02:00
Paul Masurel
54cfd0d154 Removing Deserializer trait (#1489)
Removing Deserializer trait and renaming the `Serializer` trait `FastFieldCodec`.
Small refactoring estimate.
2022-08-28 04:54:55 +09:00
PSeitz
0dd62169c8 merge FastFieldCodecReader wit FastFieldDataAccess (#1485)
* num_vals to FastFieldCodecReader

* split open_from_bytes to own trait

* rename get_u64 to ge_val

* merge traits
2022-08-28 03:58:28 +09:00
Paul Masurel
3a9727aa91 Pleasing Clippy 2022-08-27 11:33:03 +02:00
UEDA Akira
17093e8ffe Collapse overlapped highlighted ranges (#1473) 2022-08-26 14:37:08 +09:00
Paul Masurel
03e4630cd8 Mark the CI as successful regardless of whether uploading to Coverall fails. 2022-08-26 07:35:29 +02:00
Paul Masurel
4ae0317d68 Cargo fmt 2022-08-26 00:50:07 +02:00
Paul Masurel
107b19855f Fixing the fastfield codec benchmark (#1484) 2022-08-26 05:54:14 +09:00
Paul Masurel
d8f66ba07e Rename fastfield codecs (#1483) 2022-08-26 01:19:30 +09:00
Paul Masurel
f908549245 Argument missing in bench 2022-08-25 15:42:59 +02:00
Paul Masurel
3673a5df9b Homogeneous codec names. (#1481) 2022-08-25 05:51:37 +09:00
Pascal Seitz
3984cafccc fix overflow issue in interpolation
use saturating_sub and saturating_add to cover edge cases with values close to u64::MAX or 0 in combination with imprecise computation
2022-08-24 20:08:13 +02:00
Paul Masurel
298b5dd726 GCD wrapper uses DividerU64 (#1478) 2022-08-25 02:29:13 +09:00
Paul Masurel
8bbb22e9bf Minor refactoring. Introducing a codec type enum. (#1477) 2022-08-25 02:21:41 +09:00
PSeitz
513f68209d Merge pull request #1476 from quickwit-oss/fix_interpol
add proptest to ff codecs
2022-08-24 08:01:36 -07:00
Pascal Seitz
91f2f7e722 add proptest to ff codecs 2022-08-24 16:42:40 +02:00
PSeitz
c476b530cf Merge pull request #1432 from quickwit-oss/gcd_encoding
add gcd test for DateTime
2022-08-24 06:50:34 -07:00
PSeitz
77dd202e19 Merge pull request #1475 from quickwit-oss/extend_ff_access
move fastfield stats to trait
2022-08-24 06:44:57 -07:00
Pascal Seitz
00ebff3c16 move fastfield stats to trait 2022-08-24 15:29:55 +02:00
Paul Masurel
9a6d37c42c Apply suggestions from code review 2022-08-24 21:20:17 +09:00
PSeitz
bb01e99e05 Fixes race condition in Searcher (#1464)
Fixes a race condition in Searcher, by avoiding repeated calls to open_segment_readers and passing them instead as argument

Closes #1461
2022-08-24 21:17:37 +09:00
PSeitz
535f1a5d83 Merge pull request #1471 from adamreichold/ci-no-nightly-no-cry
Split test into check and test CI jobs
2022-08-24 04:41:42 -07:00
Pascal Seitz
625f9174a7 check for size 2022-08-24 10:32:45 +02:00
Adam Reichold
11a4d97cf5 Use a job matrix to further split and deduplicate the test CI job. 2022-08-24 10:27:57 +02:00
Adam Reichold
1c3d39677a Split checking and testing to a bit more parallelism in the CI. 2022-08-24 10:27:57 +02:00
Pascal Seitz
6f65995cfd remove gcd from api 2022-08-24 10:24:09 +02:00
Pascal Seitz
e2e4190571 add gcd test for DateTime 2022-08-24 10:24:09 +02:00
PSeitz
82209c58aa reuse get_calculated_value (#1472) 2022-08-24 17:16:25 +09:00
Paul Masurel
21519788ea Build fix (#1470) 2022-08-24 07:16:38 +09:00
Shikhar Bhushan
4c6c6e4a9c ConstScoreQuery (#1463) 2022-08-24 06:37:34 +09:00
Adam Reichold
df0ac9e901 Extend facet deserialization to handle owned in addition to borrowed strings. (#1466) 2022-08-24 06:37:13 +09:00
Adam Reichold
71ab482720 RFC: Use a more general but still object-safe signature for Query::query_terms. (#1468)
* Use a more general but still object-safe signature for Query::query_terms.

* Further constraint the generalized Query::query_terms signature to allow extracting references to terms.
2022-08-24 06:34:07 +09:00
Adam Reichold
2ae383e452 Cache dependencies in CI to speed up build times. (#1469)
* Cache dependencies in CI to speed up build times.

* Give cargo-nextest a try.
2022-08-24 06:27:29 +09:00
PSeitz
8b3a6f6231 Merge pull request #1439 from quickwit-oss/fix_value_range
fix get calculated value
2022-08-23 10:15:13 -07:00
PSeitz
11edd6bd59 fix for api change (#1467) 2022-08-24 01:10:12 +09:00
Pascal Seitz
193a3c21f4 fix neg slope calculated value 2022-08-23 13:42:09 +02:00
PSeitz
998b1263f6 Merge pull request #1460 from quickwit-oss/merge_ff_access_iterator
move iter to FastFieldDataAccess
2022-08-23 02:58:10 -07:00
Pascal Seitz
72272bdf81 fix variable name 2022-08-23 11:38:27 +02:00
Pascal Seitz
c39c2d79da move iter to FastFieldDataAccess 2022-08-23 11:26:47 +02:00
Paul Masurel
67d94f5bd2 Getting rid of the gcd dependency and using NonZeroU64 in gcd. (#1459) 2022-08-23 07:25:26 +09:00
Paul Masurel
abbd934ac9 Embeds OwnedBytes into the FastFieldCodecReader. (#1458) 2022-08-23 00:02:31 +09:00
Paul Masurel
7f9ba0ee50 Minor readability refactoring in the SegmentDocIdMapping (#1451) 2022-08-22 22:44:36 +09:00
PSeitz
8edcd6f958 Merge pull request #1428 from izihawa/feature/dismax
[feat] Implement `DisjunctionMaxQuery` and refactor `ScoreCombiner`
2022-08-22 06:15:30 -07:00
Pasha Podolsky
f50700835d [fix] Fn -> FnOnce 2022-08-22 15:57:30 +03:00
PSeitz
494e92ca59 fix issue in composite (#1456)
The file offsets were recorded incorrectly in some cases, e.g. when the recording looked like this [(Field 1, Index 0, Offset 0), (Field 1, Index 1, Offset 14), (Field 0, Index 0, Offset 14)]. The last file is offset 14 to end of file for field 0. But the data was converted to a vec and sorted, which changes the last file to Field 1.
2022-08-22 17:52:12 +09:00
Paul Masurel
4a3169011d clippy (#1452) 2022-08-20 20:01:33 +09:00
Pascal Seitz
050fc5dde9 add comment for diff dance 2022-08-20 08:56:03 +02:00
Paul Masurel
ce45889add Minor codestyle change is prefix of (#1450)
* Minor code stlye change in the Facet::is_prefix_of.

* bugfix
2022-08-19 21:20:33 +09:00
dependabot[bot]
4875174d16 Update prettytable-rs requirement from 0.8.0 to 0.9.0 (#1446)
Updates the requirements on [prettytable-rs](https://github.com/phsym/prettytable-rs) to permit the latest version.
- [Release notes](https://github.com/phsym/prettytable-rs/releases)
- [Commits](https://github.com/phsym/prettytable-rs/compare/v0.8.0...v0.9.0)

---
updated-dependencies:
- dependency-name: prettytable-rs
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-08-19 18:09:59 +09:00
Kanji Yomoda
0c634c5bc6 Add missing seek to RequiredOptionalScorer (#1442) 2022-08-19 18:08:52 +09:00
Paul Masurel
e25ab5d537 Minor code stlye change in the Facet::is_prefix_of. (#1449) 2022-08-19 18:05:11 +09:00
Adam Reichold
27400c9ad3 Check for the special case of the root facet as prefix of other facets. (#1448) 2022-08-19 17:45:14 +09:00
PSeitz
19074e1d5e Merge pull request #1445 from kianmeng/fix-typos-and-markdowns
Fix typos and markdowns
2022-08-18 00:03:37 -07:00
Kian-Meng Ang
014b1adc3e cargo +nightly fmt 2022-08-17 22:33:44 +08:00
Kian-Meng Ang
84295d5b35 cargo fmt 2022-08-15 21:07:01 +08:00
Kian-Meng Ang
625bcb4877 Fix typos and markdowns
Found via these commands:

    codespell -L crate,ser,panting,beauti,hart,ue,atleast,childs,ond,pris,hel,mot
    markdownlint *.md doc/src/*.md --disable MD013 MD025 MD033 MD001 MD024 MD036 MD041 MD003
2022-08-13 18:25:47 +08:00
Pascal Seitz
f01cb7d3aa remove cast 2022-08-12 19:50:06 +02:00
PSeitz
8e773ade77 Merge pull request #1444 from quickwit-oss/add-async-doc-freq
Support for SnippetGenerator in async context
2022-08-12 05:46:13 -07:00
Evance Soumaoro
fad3faefe2 added InvertedIndexReader::doc_freq_async and SnippetGenerator::new methods 2022-08-12 06:39:10 +00:00
Pascal Seitz
9811d15657 improve slope calculation by delaying f64 cast 2022-08-11 13:32:10 +02:00
Pascal Seitz
31ba5a3c16 fix get calculated value
fix get calculated value by delaying cast
2022-08-11 09:44:20 +02:00
PSeitz
f4d7621370 Merge pull request #1436 from boraarslan/bora--warmup-fieldnorms
Expose inner file slice for fieldnorms
2022-08-09 02:45:45 -07:00
boraarslan
d4b2b7de8b Expose inner file slice 2022-08-04 18:13:17 +03:00
PSeitz
d5ee4edf25 Merge pull request #1426 from k-yomo/support-custom-key-in-range-aggregation
Add support for custom key param for range aggregation
2022-08-03 04:31:02 -07:00
PSeitz
fcc7bd7024 Merge pull request #1418 from quickwit-oss/gcd_encoding
apply gcd on fastfield as preprocessing
2022-07-29 02:00:14 -07:00
Pascal Seitz
ce8d6b259a early return 2022-07-29 10:05:30 +02:00
k-yomo
099e626156 Refactor InternalRangeAggregationRange initialization with From trait 2022-07-29 05:41:29 +09:00
Pasha Podolsky
71041b2314 [fix] Fix bench 2022-07-28 21:36:28 +03:00
Pasha Podolsky
09aae134e6 [feat] Implement DisjunctionMaxQuery and refactor ScoreCombiner 2022-07-28 20:47:20 +03:00
Pascal Seitz
6a9d09cf7a handle gcd like a composable codec 2022-07-28 09:54:35 +02:00
k-yomo
704d0a8d8b Refactor range aggregation tests 2022-07-28 06:31:25 +09:00
k-yomo
195309a557 Add support for custom key param for range aggregation 2022-07-28 06:21:39 +09:00
PSeitz
da0f78e06c Merge pull request #1424 from k-yomo/support-keyed-parameter-in-aggregation
Add support for keyed parameter in range and histgram aggregations
2022-07-27 06:22:29 -07:00
k-yomo
9b6b60cc2b Remove unnecessary keyed parameter setting 2022-07-27 18:43:52 +09:00
k-yomo
6444516a82 User serde default for the keyed params 2022-07-27 01:12:56 +09:00
k-yomo
a9b0d1a0ab Fix aggreagtion examples 2022-07-26 18:54:27 +09:00
k-yomo
2b333ca635 Fix keyed param type in the comment 2022-07-26 18:35:01 +09:00
k-yomo
80a1418284 Use FnvHashMap for keyed bucket entries 2022-07-26 18:24:54 +09:00
k-yomo
5ab5f070ed Fix to use bool directory for the keyed parameter 2022-07-26 18:18:38 +09:00
k-yomo
d122f2c74e Add tests for keyed buckets 2022-07-26 04:28:21 +09:00
k-yomo
5b564916f0 Add support for keyed parameter in range and histgram aggregations 2022-07-26 04:28:21 +09:00
Pascal Seitz
06fd8684b7 use filter to filter zero 2022-07-25 10:26:35 +02:00
Kanji Yomoda
931bab8010 Fix failing nanosec truncation check on mac OS (#1423) 2022-07-25 09:32:15 +09:00
Pascal Seitz
8dac30e6d1 fix benchmark 2022-07-22 17:44:06 +02:00
Pascal Seitz
2e0a7d072f use single pass for gcd 2022-07-22 16:04:32 +02:00
Kanji Yomoda
af84e74284 Replace deprecated std package's constants on floats and integers (#1420) 2022-07-22 08:05:08 +09:00
Pascal Seitz
fff1a03842 replace generic with impl T 2022-07-21 14:26:45 +02:00
Pascal Seitz
90e296f2d0 fix var name 2022-07-21 14:26:45 +02:00
PSeitz
5f966d747b Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-07-21 14:25:35 +02:00
PSeitz
d24f31f965 Merge pull request #1419 from quickwit-oss/expose-final-bucket-result
Re(Expose) IntermediateAggregationResults method
2022-07-21 04:40:23 -07:00
Evance Soumaoro
f26b686a1c expose IntermediateAggregationResults->into_final_bucket_result 2022-07-21 11:19:23 +00:00
Pier-Olivier Thibault
775e936f7d FileHandle: Change from boxed to Arc. (#1415)
* FileHandle: Change from boxed to Arc.

Changing from a Box<dyn FileHandle> to an Arc<dyn FileHandle> would
allow for a user of tantivy to manage file handles outside of tantivy
and be able to manage their life cycle.

* Fix: Rust linter
2022-07-21 16:19:18 +09:00
Pascal Seitz
7e032a9efd apply gcd on fastfield as preprocessing 2022-07-20 16:19:47 +02:00
PSeitz
23fe73a6c0 remove searcher pool and make Searcher cloneable (#1411)
* remove searcher pool and make Searcher cloneable

closes #1410

* use SearcherInner in InnerIndexReader
2022-07-12 18:07:48 +09:00
Evance Soumaoro
a4be239d38 Updated DateTime to hold timestamp in microseconds, while making date field precision configurable (#1396) 2022-07-12 10:04:28 +09:00
PSeitz
2406d9278b allow set doc store cache size on IndexReaderBuilder (#1407) 2022-07-06 14:40:35 +09:00
PSeitz
6c2d9737f1 Merge pull request #1405 from quickwit-oss/fix_action
fix workflow action
2022-07-04 23:05:28 -07:00
PSeitz
a5688572a5 Merge pull request #1406 from quickwit-oss/edition_2021
edition 2021 for subcrates
2022-07-04 19:42:24 -07:00
Pascal Seitz
431b5a091e remove test trigger 2022-07-05 10:32:33 +08:00
PSeitz
2c17271cd9 Merge pull request #1403 from quickwit-oss/docstore_cache_size
expose doc store cache size
2022-07-04 19:28:51 -07:00
Pascal Seitz
5750224d4c set docstore cache size at construction 2022-07-04 14:27:55 +08:00
Pascal Seitz
02691f2445 edition 2021 for subcrates 2022-07-04 14:19:32 +08:00
Pascal Seitz
e31e78f39f fix workflow action 2022-07-04 14:04:49 +08:00
Pascal Seitz
9db2f0e82b expose doc store cache size
expose lru doc store cache size
optimize doc store cache size
2022-07-04 13:54:41 +08:00
PSeitz
2ed5cc873d Merge pull request #1404 from quickwit-oss/total_cmp
use total_cmp
2022-07-03 22:51:00 -07:00
Pascal Seitz
d278417300 move build step down 2022-07-04 13:22:04 +08:00
Pascal Seitz
d89a8dd118 set rust version 2022-07-04 13:15:32 +08:00
Pascal Seitz
1bd44a5f61 use total_cmp 2022-07-04 12:48:23 +08:00
Ryan Russell
d750ced813 chore(collector): src/collector readability (#1399)
* chore(collector): `src/collector` readability

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* Update src/collector/tests.rs
2022-07-04 12:12:53 +09:00
dependabot[bot]
fbc469e5df Update pprof requirement from 0.9.0 to 0.10.0 (#1400)
Updates the requirements on [pprof](https://github.com/tikv/pprof-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/pprof-rs/releases)
- [Changelog](https://github.com/tikv/pprof-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/pprof-rs/compare/v0.9.1...v0.10.0)

---
updated-dependencies:
- dependency-name: pprof
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-07-04 11:29:33 +09:00
PSeitz
c1273670e4 Merge pull request #1402 from PSeitz/cloneable_error
make errors cloneable
2022-06-30 20:09:37 +08:00
Pascal Seitz
7eb267341e make errors cloneable 2022-06-30 19:42:23 +08:00
PSeitz
db1836691e fix visibility (#1398) 2022-06-28 16:21:39 +09:00
Antoine G
437cd350a2 Add support for phrase slop in query language (#1393)
Closes #1390
2022-06-28 13:55:47 +09:00
PSeitz
8024ecf013 Merge pull request #1389 from quickwit-oss/doc_writer_thread
use separate thread to compress block store
2022-06-23 16:17:41 +08:00
PSeitz
9baefbe2ab Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
PSeitz
ad76d11008 Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
PSeitz
c3220bece0 Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
PSeitz
2b713f0977 Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
Pascal Seitz
0bc6b4a117 renames and refactoring 2022-06-23 15:34:21 +08:00
PSeitz
79e42d4a6d Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
PSeitz
0135fbc4c8 Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
PSeitz
449594f67a Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
Pascal Seitz
8b6647e908 move writer to compressor thread 2022-06-23 15:34:21 +08:00
PSeitz
efabcbcdf5 Update src/store/writer.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-06-23 15:34:21 +08:00
Pascal Seitz
7bf5962554 merge match, explicit type 2022-06-23 15:34:21 +08:00
Pascal Seitz
4c7dedef29 use seperate thread to compress block store
Use seperate thread to compress block store for increased indexing performance. This allows to use slower compressors with higher compression ratio, with less or no perfomance impact (with enough cores).

A seperate thread is spawned to compress the docstore, which handles single blocks and stacking from other docstores.
The spawned compressor thread does not write, instead it sends back the compressed data. This is done in order to avoid writing multithreaded on the same file.
2022-06-23 15:34:21 +08:00
PSeitz
93f356a7a7 Extend FAQ (#1388)
* Extend FAQ

Co-authored-by: Maxim Kraynyuchenko <100854040+maximkpa@users.noreply.github.com>
2022-06-23 11:53:20 +09:00
PSeitz
6ca5f77466 Merge pull request #1363 from quickwit-oss/refactor_aggregation
Add aggregation bucket limit
2022-06-23 10:27:57 +08:00
Paul Masurel
2e2822f89d Apply suggestions from code review 2022-06-23 09:48:28 +09:00
PSeitz
de178a1901 Merge pull request #1395 from PSeitz/fix_clippy
fix clippy
2022-06-21 16:30:59 +08:00
Antoine G
11e4225f23 doc fix (#1391)
Documentation fix.
2022-06-21 15:53:33 +09:00
Paul Masurel
f21b73d1f6 Apply suggestions from code review 2022-06-21 15:52:43 +09:00
Pascal Seitz
1440f3243b fix clippy 2022-06-21 14:47:01 +08:00
Kanji Yomoda
83d0c13fb0 Fix outdated variable naming and comments to alive bitset (#1387)
* Fix outdated variables and comments for alive bitset

* Fix expired link to delete bitset
2022-06-14 15:59:15 +09:00
PSeitz
88054aa333 Merge pull request #1382 from boraarslan/bool-fields
Add boolean fields
2022-06-13 13:20:05 +08:00
boraarslan
635c39ba48 cargo fmt 2022-06-10 19:54:44 +03:00
boraarslan
eab2257637 Change var name 2022-06-10 19:36:25 +03:00
PSeitz
328bd96c24 Merge pull request #1378 from quickwit-oss/test_compression
enable setting compression level
2022-06-10 11:10:07 +08:00
dependabot[bot]
fc24842a43 Update more-asserts requirement from 0.2.1 to 0.3.0 (#1384)
Updates the requirements on [more-asserts](https://github.com/thomcc/rust-more-asserts) to permit the latest version.
- [Release notes](https://github.com/thomcc/rust-more-asserts/releases)
- [Commits](https://github.com/thomcc/rust-more-asserts/compare/v0.2.2...v0.3.0)

---
updated-dependencies:
- dependency-name: more-asserts
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-10 10:38:28 +09:00
boraarslan
2d6f1d43ff Add bool and explicit types for merger 2022-06-07 10:10:33 +03:00
boraarslan
ca0973ec78 Fix tests 2022-06-07 10:10:33 +03:00
boraarslan
38ee60d792 Edit Test 2022-06-07 10:10:33 +03:00
boraarslan
f68be28284 Add bool 2022-06-07 10:09:37 +03:00
boraarslan
fc43ab9280 Add tests 2022-06-07 10:09:37 +03:00
boraarslan
38c2ea6a5d Remove unnecessary line 2022-06-07 10:09:37 +03:00
boraarslan
26a0fd1fbe cargo fmt 2022-06-07 10:09:37 +03:00
boraarslan
811b91ecb3 Edit and add tests 2022-06-07 10:09:37 +03:00
boraarslan
25c00ce856 Fix indexing for bool 2022-06-07 10:09:37 +03:00
boraarslan
e5debb97a7 Edit test 2022-06-07 10:09:37 +03:00
boraarslan
bc4cd9ffaa typo fix 2022-06-07 10:09:37 +03:00
boraarslan
9a13d8709b Explicitly write types 2022-06-07 10:09:37 +03:00
boraarslan
e6eadf1a2f Add tests 2022-06-07 10:09:37 +03:00
boraarslan
7cca7e6a47 Fix of last commit 2022-06-07 10:09:37 +03:00
boraarslan
ef2492dba6 Broken commit 2022-06-07 10:09:37 +03:00
boraarslan
2981e6c1df First commit 2022-06-07 10:09:37 +03:00
Ryan Russell
b33b4c0092 Fix various occurrence var names and references (#1385)
Thank you Ryan!

Signed-off-by: Ryan Russell <git@ryanrussell.org>
2022-06-07 11:08:19 +09:00
Pascal Seitz
4d9d2b6db0 split into compressor/decompressor
use custom de/serializer for compressor
accept parameters like zstd(compression_level=5) as compressor
2022-06-02 23:29:24 +08:00
Pascal Seitz
ed868f93a3 enable setting compression level 2022-06-02 16:47:29 +08:00
PSeitz
5e599d96d7 Merge pull request #1372 from quickwit-oss/doc_store_api
refactor doc store
2022-06-02 15:19:57 +08:00
Pascal Seitz
314ae43a45 fix fmt 2022-06-02 14:54:23 +08:00
Pascal Seitz
fce91b2f3a vec without capacity 2022-06-02 13:50:18 +08:00
Pascal Seitz
9bcd2b8104 fix read_block_async 2022-06-02 13:37:52 +08:00
Pascal Seitz
0c9c257150 move cache handling into single function 2022-06-02 13:25:29 +08:00
Pascal Seitz
1af85a2956 accept usize instead &usize 2022-06-02 11:23:36 +08:00
Pascal Seitz
bc4c3d0c6b add peek_lru test 2022-06-02 11:13:17 +08:00
Pascal Seitz
6937c75f05 hide advanced doc store api 2022-06-02 11:13:17 +08:00
Pascal Seitz
e54429e827 expose doc store functions
expose doc store functions for advanced usage
refactor cache
expose cache statistics
remove unnecessary arc
unduplicate code
2022-06-02 11:13:17 +08:00
Ryan Russell
ca836b6414 Improve Docs Readability (#1380)
Signed-off-by: Ryan Russell <git@ryanrussell.org>
2022-06-02 09:32:57 +09:00
Paul Masurel
f0a2b1cc44 Bumped tantivy and subcrate versions. 2022-05-25 22:50:33 +09:00
Paul Masurel
fcfdc44c61 Bumped tantivy-grammar version 2022-05-25 21:52:46 +09:00
Paul Masurel
3171f0b9ba Added ZSTD support in CHANGELOG 2022-05-25 21:51:46 +09:00
PSeitz
89e19f14b5 Merge pull request #1374 from kryesh/main
Add Zstd compression support, Make block size configurable via IndexSettings
2022-05-25 07:39:46 +02:00
PSeitz
1a6a1396cd Merge pull request #1376 from saroh/json-example
Add examples to explain default field handling in the json example
2022-05-24 07:09:37 +02:00
saroh
e766375700 remove useless example 2022-05-23 19:49:31 +02:00
PSeitz
496b4a4fdb Update examples/json_field.rs 2022-05-23 12:24:36 +02:00
PSeitz
93cc8498b3 Update examples/json_field.rs 2022-05-23 11:59:42 +02:00
PSeitz
0aa3d63a9f Update examples/json_field.rs 2022-05-23 11:39:45 +02:00
PSeitz
4e2a053b69 Update examples/json_field.rs 2022-05-23 11:27:05 +02:00
Paul Masurel
71c4393ec4 Clippy 2022-05-23 10:20:37 +09:00
saroh
b2e97e266a more examples to explain default field handling 2022-05-21 17:36:39 +02:00
Antoine G
9ee4772140 Fix deps for unicode regex compiling (#1373)
* lint doc warning

* fix regex build
2022-05-20 10:18:44 +09:00
Kryesh
c95013b11e Add zstd-compression feature to github workflow tests 2022-05-19 22:15:18 +10:00
Pascal Seitz
71f75071d2 cache and return error in aggregations 2022-05-19 16:58:56 +08:00
Pascal Seitz
b114e553cd Revert "return result from segment collector"
This reverts commit a99e5459e3.
2022-05-19 16:57:55 +08:00
Pascal Seitz
17dcc99e43 Revert "introduce optional collect_block in segmentcollector"
This reverts commit c5c2e59b2b.
2022-05-19 16:25:21 +08:00
Pascal Seitz
c5c2e59b2b introduce optional collect_block in segmentcollector
add collect_block in segment_collector to handle groups of documents as performance optimization
add collect_block for MultiCollector
2022-05-19 16:23:25 +08:00
Kryesh
fc045e6bf9 Cleanup imports, remove unneeded error mapping 2022-05-19 10:34:02 +10:00
Kryesh
6837a4d468 Fix bench 2022-05-18 20:35:29 +10:00
Kryesh
0759bf9448 Cleanup zstd structure and serialise to u32 in line with lz4 2022-05-18 20:31:22 +10:00
Kryesh
152e8238d7 Fix silly errors from running tests without feature flag 2022-05-18 19:49:10 +10:00
Kryesh
d4e5b48437 Apply feedback - standardise on u64 and fix correct compression bounds 2022-05-18 19:37:28 +10:00
Kryesh
03040ed81d Add Zstd compression support 2022-05-18 14:04:43 +10:00
Kryesh
aaa22ad225 Make block size configurable to allow for better compression ratios on large documents 2022-05-18 11:13:15 +10:00
Pascal Seitz
44ea7313ca set max bucket size as parameter 2022-05-13 13:21:52 +08:00
Antoine G
3223bdf254 Refactorize PhraseScorer::compute_phrase_match (#1364)
* Refactorize PhraseScorer::compute_phrase_match
* implem optim for slop
2022-05-13 09:57:21 +09:00
Pascal Seitz
11ac451250 abort aggregation when too many buckets are created
Validation happens on different phases depending on the aggregation
Term: During segment collection
Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370
Range: When validating the request

update CHANGELOG
2022-05-12 12:26:43 +08:00
Pascal Seitz
6a4632211a forward error in aggregation collect 2022-05-12 12:26:43 +08:00
Pascal Seitz
a99e5459e3 return result from segment collector 2022-05-12 12:26:43 +08:00
Pascal Seitz
3f88718f38 refactor aggregations 2022-05-12 12:26:43 +08:00
dependabot[bot]
cbd06ab189 Update pprof requirement from 0.8.0 to 0.9.0 (#1365)
Updates the requirements on [pprof](https://github.com/tikv/pprof-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/pprof-rs/releases)
- [Changelog](https://github.com/tikv/pprof-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/pprof-rs/commits)

---
updated-dependencies:
- dependency-name: pprof
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-05-11 11:42:04 +09:00
Paul Masurel
749395bbb8 Added rustdoc for MultiFruit extract function (#1369) 2022-05-11 11:41:39 +09:00
Paul Masurel
617ba1f0c0 Bugfix in the document deserialization. (#1368)
Deserializing a json field does not expect the
end of the document anymore.

This behavior is well documented in serde_json.
https://docs.serde.rs/serde_json/fn.from_reader.html

Closes #1366
2022-05-11 11:38:10 +09:00
Paul Masurel
2f1cd7e7f0 Bugfix in the document deserialization. (#1367)
Deserializing a json field does not expect the
end of the document anymore.

This behavior is well documented in serde_json.
https://docs.serde.rs/serde_json/fn.from_reader.html

Closes #1366
2022-05-11 11:27:04 +09:00
PSeitz
58c0cb5fc4 Merge pull request #1357 from saroh/1302-json-term-writer-API
Expose helpers to generate json field writer terms
2022-05-10 11:02:05 +08:00
PSeitz
7f45a6ac96 allow setting tokenizer manager on index (#1362)
handle json in tokenizer_for_field
2022-05-09 18:15:45 +09:00
saroh
0ade871126 rename constructor to be more explicit 2022-05-06 13:29:07 +02:00
PSeitz
aab65490c9 Merge pull request #1358 from quickwit-oss/fix_docs
add alias shard_size to split_size for quickwit
2022-05-06 18:41:34 +08:00
Pascal Seitz
d77e8de36a flip alias variable name 2022-05-06 17:52:36 +08:00
Pascal Seitz
d11a8cce26 minor docs fix 2022-05-06 17:52:36 +08:00
Pascal Seitz
bc607a921b add alias shard_size split_size for quickwit
improve some docs
2022-05-06 17:52:36 +08:00
Paul Masurel
1273f33338 Fixed comment. 2022-05-06 18:35:25 +09:00
Paul Masurel
e30449743c Shortens blocks' last_key in the SSTable block index. (#1361)
Right now we store last key in the blocks of the SSTable index.
This PR replaces the last key by a shorter string that is greater or
equal and still lesser than the next key.
This property is sufficiently to ensure the block index
works properly.

Related to quickwit#1366
2022-05-06 16:29:06 +08:00
Paul Masurel
ed26552296 Minor changes in query parsing for quickwit#1334. (#1356)
Quickwit's still heavily relies on generating field names
containing a '.' for nested object, yet allows for
user defined field names to contain a dot.

In order to reuse tantivy query parser, we will end up
using quickwit field names directly into tantivy.
Only '.' will be escaped.

This PR makes minor changes in how tantivy query parser parses
a field name and resolves it to a field.
Some of the new edge case behavior is hacky.

Closes #1355
2022-05-06 13:20:10 +09:00
Saroh
65d129afbd better function names 2022-05-05 10:12:28 +02:00
Antoine G
386ffab76c Fix documentation regression (#1359)
This breaks the doc on doc.rs as the type seems to shadow the struct https://docs.rs/tantivy/latest/tantivy/termdict/type.TermDictionary.html
introduced by #1293 which may not have been up to date with what was done in #1242
2022-05-05 14:59:25 +09:00
Pasha Podolsky
57a8d0359c Make FruitHandle and MultiFruit public (#1360)
* Make `FruitHandle` and `MultiFruit` public

* Add docs for `MultiFruit` and `FruitHandle`
2022-05-05 14:58:33 +09:00
Saroh
14cb66ee00 move helper to indexer module 2022-05-04 18:01:57 +02:00
Saroh
9e38343352 expose helpers for json field writer manipulation
closes #1302
2022-05-04 18:01:45 +02:00
PSeitz
944302ae2f Merge pull request #1350 from quickwit-oss/update_edition
update edition
2022-05-04 11:02:52 +02:00
Paul Masurel
be70804d17 Removed AtomicUsize. 2022-05-04 16:45:24 +09:00
PSeitz
a1afc80600 Update src/core/executor.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-05-04 08:39:44 +02:00
Paul Masurel
02e24fda52 Clippy fix 2022-05-04 12:24:07 +09:00
PSeitz
7e3c0c5392 Merge pull request #1353 from quickwit-oss/fix_docs
minor docs fixes
2022-05-02 07:48:25 +02:00
Pascal Seitz
fdb2524f9e minor docs fixes 2022-05-02 12:26:12 +08:00
Pascal Seitz
4db655ae82 update dependencies, update edition 2022-04-28 22:50:55 +08:00
Pascal Seitz
bb44cc84c4 update dependencies 2022-04-28 20:55:36 +08:00
PSeitz
8c1e1cf1ad Merge pull request #1349 from quickwit-oss/fix_error_message
print whole query on syntax error
2022-04-28 09:31:45 +02:00
Pascal Seitz
b5b16948b0 print whole query on syntax error 2022-04-27 12:48:30 +08:00
PSeitz
c305d3a2a2 Merge pull request #1346 from quickwit-oss/term_agg
term agg
2022-04-26 07:08:07 +02:00
PSeitz
038d234ff1 Merge pull request #1347 from quickwit-oss/query_parser_error
fix query parser error field not found
2022-04-26 07:01:48 +02:00
Pascal Seitz
c45eb9a9fa improve readability, add json test 2022-04-26 11:22:34 +08:00
Pascal Seitz
824d6f96fe return query on parse error 2022-04-22 16:11:36 +08:00
Pascal Seitz
7cf821bac0 fix query parser error field not found 2022-04-22 12:40:00 +08:00
PSeitz
ae83fc8298 bump uuid to 1.0 (#1345) 2022-04-22 10:02:24 +09:00
dependabot[bot]
a7bc361145 Update pprof requirement from 0.7 to 0.8 (#1343)
Updates the requirements on [pprof](https://github.com/tikv/pprof-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/pprof-rs/releases)
- [Changelog](https://github.com/tikv/pprof-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/pprof-rs/commits)

---
updated-dependencies:
- dependency-name: pprof
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-04-21 09:35:13 +09:00
Pascal Seitz
2805291400 minor fixes 2022-04-20 14:22:44 +08:00
Pascal Seitz
6614a2cba0 fix is_fast for bytes field 2022-04-20 12:02:38 +08:00
Pascal Seitz
6f4d203d1b return error on missing sub aggregation 2022-04-20 11:19:36 +08:00
Pascal Seitz
1be6c6111c support order property on term aggregations
support order property on term aggregations
order can be by doc_count, key, or a metric sub_aggregation
2022-04-20 00:34:38 +08:00
PSeitz
c7c3eab256 Merge pull request #1340 from PSeitz/term_agg
fix collecting term_dict field names
2022-04-18 08:21:27 +02:00
Pascal Seitz
ec69875d15 fix collecting term_dict field names
fix collecting term_dict field names for sub_aggregations, minor refactoring
2022-04-15 17:49:20 +08:00
PSeitz
d832cfcfd8 Merge pull request #1329 from quickwit-oss/term_agg
add term aggregation
2022-04-14 14:45:21 +08:00
Pascal Seitz
ab6b532cc4 add comments 2022-04-14 12:06:36 +08:00
Pascal Seitz
4b6047f7d7 return Option from as_ methods 2022-04-14 10:48:36 +08:00
Pascal Seitz
5ca04beb94 add min_doc_count test 2022-04-13 19:51:18 +08:00
Pascal Seitz
902d05ebec refactor getffreader function 2022-04-13 19:51:18 +08:00
Pascal Seitz
f1b298642a remove unnecessary benchmarks 2022-04-13 19:51:18 +08:00
Pascal Seitz
dd13dedaeb forward errors, remove unwrap 2022-04-13 19:51:18 +08:00
Pascal Seitz
46724b4a05 add segment_size, add get term dict fields, add tests 2022-04-13 19:51:18 +08:00
Pascal Seitz
24432bf523 add term aggregation 2022-04-13 19:51:18 +08:00
PSeitz
31d3bcfff2 Merge pull request #1334 from PSeitz/minor_fixes
fix DateTime naming, fix docs, cleanup
2022-04-13 13:13:57 +08:00
Pascal Seitz
706fbd6886 fix DateTime naming, fix docs, cleanup 2022-04-13 13:01:00 +08:00
PSeitz
8a8a048015 fix coverage (#1335) 2022-04-13 13:47:47 +09:00
dependabot[bot]
c72549cb9a Bump codecov/codecov-action from 2 to 3 (#1328)
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 2 to 3.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/master/CHANGELOG.md)
- [Commits](https://github.com/codecov/codecov-action/compare/v2...v3)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-04-11 21:26:52 +09:00
PSeitz
d6f803212c Merge pull request #1325 from quickwit-oss/term_agg
fast field on string
2022-04-04 15:34:31 +08:00
Pascal Seitz
dac73537d2 update changelog 2022-04-04 14:15:40 +08:00
Pascal Seitz
bb5254de12 always serialize, use enum as param 2022-04-04 13:50:23 +08:00
Maxim Kraynyuchenko
be5218c2f6 Company Logos were not visible in Dark Theme. (#1326) 2022-04-04 11:53:31 +09:00
Pascal Seitz
ec9478830a add text test
move get multiple values to test code
remove sorting term ids per docidi for non facets
2022-03-30 11:31:33 +08:00
Pascal Seitz
8807bfd13d fast field on string
enables FAST on string fields, which creates a fastfield containing the term ordinals
2022-03-29 12:40:10 +08:00
Maxim Kraynyuchenko
447811c111 Update README following sections: features, benchmark illustration & FAQ. (#1318)
* Updated features, benchmark illustration & FAQ.
* Updated README: Feat,Graph,Non-Feat,Companies,FAQ
2022-03-23 10:02:09 +09:00
PSeitz
f29acf5d8c fix clippy (#1321) 2022-03-22 12:48:23 +09:00
Uwe Klotz
125707dbe0 Replace chrono with time (#1307)
For date values `chrono` has been replaced with `time` 
- The `time` crate is re-exported as `tantivy::time` instead of `tantivy::chrono`.
- The type alias `tantivy::DateTime` has been removed.
- `Value::Date` wraps `time::PrimitiveDateTime` without time zone information.
- Internally date/time values are stored as seconds since UNIX epoch in UTC.
- Converting a `time::OffsetDateTime` to `Value::Date` implicitly converts the value into UTC.
If this is not desired do the time zone conversion yourself and use `time::PrimitiveDateTime`
directly instead.

Closes #1304
2022-03-21 10:50:19 +09:00
Paul Masurel
46d5de920d Removes all usage of block_on, and use a oneshot channel instead. (#1315)
* Removes all usage of block_on, and use a oneshot channel instead.

Calling `block_on` panics in certain context.
For instance, it panics when it is called in a the context of another
call to block.

Using it in tantivy is unnecessary. We replace it by a thin wrapper
around a oneshot channel that supports both async/sync.

* Removing needless uses of async in the API.

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2022-03-18 16:54:58 +09:00
PSeitz
d2a7bcf217 fix fmt (#1317) 2022-03-18 15:53:27 +09:00
PSeitz
141b9aa245 Merge pull request #1306 from PSeitz/histogram
add Histogram aggregation
2022-03-18 05:03:46 +01:00
PSeitz
c5a6282fa8 Update src/aggregation/bucket/histogram/histogram.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-03-18 04:55:31 +01:00
PSeitz
c0f524e1a3 Update src/aggregation/bucket/histogram/histogram.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-03-18 04:55:25 +01:00
Paul Masurel
958b2bee08 Clippy comments (#1316) 2022-03-17 18:57:55 +09:00
Pascal Seitz
f619658e2c rename 2022-03-17 16:37:57 +08:00
Pascal Seitz
aa391bf843 refactor parameters 2022-03-17 16:28:37 +08:00
Pascal Seitz
47dcbdbeae handle empty results, empty indices, add tests 2022-03-17 10:24:34 +08:00
Pascal Seitz
691245bf20 make code more concise 2022-03-16 14:21:58 +08:00
Pascal Seitz
90798d4b39 address comments, add single bucket test 2022-03-16 13:58:13 +08:00
Pascal Seitz
0b6d9f90cf improve docs 2022-03-16 12:39:26 +08:00
PSeitz
8a5a12d961 add setter to json object options (#1311) 2022-03-16 10:36:30 +09:00
Pascal Seitz
e73542e2e8 Elasticsearch behaviour on hard/extended_bounds 2022-03-15 16:46:45 +08:00
Pascal Seitz
0262e44bbd merge_fruits pass by value 2022-03-15 12:59:22 +08:00
Pascal Seitz
613aad7a8a vec optional, improve performance 2022-03-14 21:29:07 +08:00
Pascal Seitz
1aa88b0c51 improve performance 2022-03-14 20:28:08 +08:00
Pascal Seitz
564fa38085 move sub_aggregations to own vec, use itertools minmax 2022-03-14 16:20:26 +08:00
dependabot[bot]
59ec21479f Update pprof requirement from 0.6 to 0.7 (#1305)
Updates the requirements on [pprof](https://github.com/tikv/pprof-rs) to permit the latest version.
- [Release notes](https://github.com/tikv/pprof-rs/releases)
- [Changelog](https://github.com/tikv/pprof-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tikv/pprof-rs/commits)

---
updated-dependencies:
- dependency-name: pprof
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-14 13:57:22 +09:00
PSeitz
42283f9e91 fix error message UnknownTokenizer (#1308)
closes #1303
2022-03-14 13:54:47 +09:00
PSeitz
b105bf72e1 use defaults in meta.json (#1310)
This change allows to have unset fields in meta.json and fall back to their defaults
Currently it is required to explicitly put e.g. fieldnorms: false
2022-03-14 13:54:06 +09:00
Pascal Seitz
226f577803 Add Histogram aggregation 2022-03-11 21:52:07 +08:00
Paul Masurel
2e255c4bef Preparing for release 2022-03-09 09:59:08 +09:00
Paul Masurel
387592809f Updated CHANGELOG 2022-03-07 15:31:35 +09:00
Halvor Fladsrud Bø
cedced5bb0 Slop support for phrase queries (#1241)
Closes #1068
2022-03-07 15:29:18 +09:00
dependabot[bot]
d31f045872 Bump actions/checkout from 2 to 3 (#1300)
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v3)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-07 11:54:26 +09:00
PSeitz
6656a70d1b Merge pull request #1301 from saroh/1232-doc-fastfield
update fastfield doc
2022-03-04 08:18:21 +01:00
saroh
d36e0a9549 fix fastfield doc 2022-03-03 17:43:18 +01:00
Antoine G
8771b2673f Update src/fastfield/writer.rs
Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2022-03-03 11:25:24 +01:00
Antoine G
a41d3d51a4 Update fastfield_codecs/src/lib.rs 2022-03-03 11:25:06 +01:00
Saroh
cae34ffe47 update fastfield doc 2022-03-02 16:04:15 +01:00
PSeitz
4b62f7907d Merge pull request #1297 from PSeitz/fix_clippy
fix clippy issues
2022-03-02 10:11:56 +01:00
Pascal Seitz
7fa6a0b665 cargo fmt 2022-03-02 09:24:14 +01:00
PSeitz
458ed29a31 Merge pull request #1299 from saroh/1232-doc-lint
doc lint for errors and aggregations
2022-03-02 09:22:07 +01:00
Antoine G
e37775fe21 iff->if or if and only if (#1298)
* has_xxx is_xxx -> if, these function usualy define equivalence
xxx returns bool -> specify equivalence when appropriate

* fix doc
2022-03-02 11:00:00 +09:00
Saroh
1cd2434a32 fix(aggregations) Readme 2022-03-01 20:37:48 +01:00
Saroh
de2cba6d1e error definitions 2022-03-01 20:13:59 +01:00
Paul Masurel
c0b1a58d27 Apply suggestions from code review 2022-03-01 18:41:58 +09:00
Paul Masurel
848b795b9f Apply suggestions from code review 2022-03-01 18:37:51 +09:00
Pascal Seitz
091b668624 fix clippy issues 2022-03-01 08:58:51 +01:00
Paul Masurel
5004290daa Return an error on certain type of corruption. (#1296) 2022-03-01 11:35:56 +09:00
StyMaar
5d2c2b804c Fix link to RamDirectory and MMapDirectory in Directory's documentation (#1295) 2022-03-01 09:46:53 +09:00
PSeitz
1a92b588e0 Merge pull request #1294 from PSeitz/aggregation
fix intermediate result de/serialization
2022-02-28 08:39:23 +01:00
Pascal Seitz
010e92c118 fix intermediate result de/serialization
return None for empty average/stats metric
add test for de/serialization of intermediate result
add test for metric on empty result
2022-02-25 16:39:57 +01:00
Paul Masurel
2ead010c83 Tantivy quickwit (#1293)
* Added sstable and enabling it by default, and parallel boolean query.
* Added async API for FileSlice.
* Added async get_doc
* Reduce blocksize to 32_000
* Added debug logs

Quickwit specific feature a hidden behind the quickwit feature flag.
2022-02-25 17:32:49 +09:00
PSeitz
c4f66eb185 improve validation in aggregation, extend invalid field test (#1292)
* improve validation in aggregation, extend invalid field test

improve validation in aggregation
extend invalid field test
Fixes #1291

* collect fast field names on request structure

* fix visibility of AggregationSegmentCollector
2022-02-25 15:21:19 +09:00
Paul Masurel
d7b46d2137 Added JSON Type (#1270)
- Removed useless copy when ingesting JSON.
- Bugfix in phrase query with a missing field norms.
- Disabled range query on default fields

Closes #1251
2022-02-24 16:25:22 +09:00
PSeitz
d042ce74c7 Merge pull request #1289 from PSeitz/numeric_options
rename IntOptions to NumericOptions
2022-02-23 14:04:40 +01:00
PSeitz
7ba9e662b8 Merge pull request #1290 from PSeitz/improve_docs
improve aggregation docs
2022-02-23 14:04:20 +01:00
Pascal Seitz
fdd5ef85e5 improve aggregation docs 2022-02-22 10:37:54 +01:00
Pascal Seitz
704498a1ac rename IntOptions to NumericOptions
keep IntOptions with deprecation warning
Fixes #1286
2022-02-21 22:20:07 +01:00
PSeitz
1232af7928 fix docs (#1288) 2022-02-21 23:15:58 +09:00
Paul Masurel
d37633e034 Minor changes in indexing. (#1285) 2022-02-21 17:16:52 +09:00
Paul Masurel
9815067171 Minor changes 2022-02-21 13:55:01 +09:00
PSeitz
972cb6c26d Aggregation (#1276)
Added support for aggregation compatible with Elasticsearch's API.
2022-02-21 09:59:11 +09:00
Paul Masurel
4dc80cfa25 Removes TokenStream chain. (#1283)
This change is mostly motivated by the introduction of json object.

We need to be able to inject a position object to make the position
shift.
2022-02-21 09:51:27 +09:00
PSeitz
cef145790c Fix opening bytes index with dynamic codec (#1279)
* Fix opening bytes index with dynamic codec

Fix #1278

* extend proptest to cover bytes field codec bug
2022-02-18 20:44:21 +09:00
Paul Masurel
e05e2a0c51 Added profiling to indexing bench (#1282) 2022-02-18 20:43:28 +09:00
Paul Masurel
e028515caf Simplified expull code. (#1281) 2022-02-18 18:57:10 +09:00
Paul Masurel
850b9eaea4 added a bench to measure the perf of indexing logs (#1275) 2022-02-18 16:48:29 +09:00
Shikhar Bhushan
505e6a440c Remove test assertion sensitive to background segment merging (#1274) 2022-02-17 10:59:46 +09:00
Koichi Akabe
fcd651f6a9 Add Vaporetto tokenizer to README (#1271)
* Add Vaporetto tokenizer to README

* Update README.md
2022-02-14 18:19:57 +09:00
Paul Masurel
e6653228a9 Renamed github workflows (#1269) 2022-02-04 15:10:24 +09:00
Paul Masurel
bdedefe07d Adding an IndexingContext object (#1268) 2022-02-04 15:08:01 +09:00
Paul Masurel
13a4473faa Removing obsolete clippy allow thingy. 2022-02-01 11:54:01 +09:00
Paul Masurel
2069e3e52b Fixing clippy comments 2022-02-01 10:24:05 +09:00
Paul Masurel
0d8263cba1 Using nightly to format 2022-01-31 16:10:11 +09:00
Paul Masurel
65b365b81c Fixing all-features build. 2022-01-31 14:41:14 +09:00
dependabot[bot]
4c1366da87 Update fastdivide requirement from 0.3 to 0.4 (#1265)
Updates the requirements on fastdivide to permit the latest version.

---
updated-dependencies:
- dependency-name: fastdivide
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-31 11:26:50 +09:00
Paul Masurel
eca6628b3c Minor refactoring (#1266) 2022-01-28 15:55:55 +09:00
Paul Masurel
9679c5f306 Rename quickwit-inc -> quickwit-oss 2022-01-27 15:37:09 +09:00
Shikhar Bhushan
5a2497b6fd Avoid exposing TrackedObject from Warmer API (#1264) 2022-01-25 10:04:08 +09:00
Shikhar Bhushan
99d4b1a177 Searcher Warming API (#1261)
Adds an API to register Warmers in the IndexReader.

Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-01-20 23:40:25 +09:00
412 changed files with 148707 additions and 13617 deletions

1
.gitattributes vendored
View File

@@ -1 +0,0 @@
cpp/* linguist-vendored

View File

@@ -2,23 +2,24 @@ name: Coverage
on:
push:
branches: [ main ]
branches: [main]
pull_request:
branches: [ main ]
branches: [main]
jobs:
coverage:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Install Rust
run: rustup toolchain install nightly --component llvm-tools-preview
- name: Install cargo-llvm-cov
run: curl -LsSf https://github.com/taiki-e/cargo-llvm-cov/releases/latest/download/cargo-llvm-cov-x86_64-unknown-linux-gnu.tar.gz | tar xzf - -C ~/.cargo/bin
run: rustup toolchain install nightly --profile minimal --component llvm-tools-preview
- uses: Swatinem/rust-cache@v2
- uses: taiki-e/install-action@cargo-llvm-cov
- name: Generate code coverage
run: cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info
run: cargo +nightly llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v2
uses: codecov/codecov-action@v3
continue-on-error: true
with:
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
files: lcov.info

View File

@@ -1,4 +1,4 @@
name: Rust
name: Long running tests
on:
push:
@@ -9,16 +9,20 @@ env:
NUM_FUNCTIONAL_TEST_ITERATIONS: 20000
jobs:
functional_test_unsorted:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Install stable
uses: actions-rs/toolchain@v1
with:
toolchain: stable
profile: minimal
override: true
- name: Run indexing_unsorted
run: cargo test indexing_unsorted -- --ignored
functional_test_sorted:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run indexing_sorted
run: cargo test indexing_sorted -- --ignored

View File

@@ -1,4 +1,4 @@
name: Rust
name: Unit tests
on:
push:
@@ -10,21 +10,65 @@ env:
CARGO_TERM_COLOR: always
jobs:
test:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build
run: cargo build --verbose --workspace
- name: Install latest nightly to test also against unstable feature flag
- uses: actions/checkout@v3
- name: Install nightly
uses: actions-rs/toolchain@v1
with:
toolchain: nightly
profile: minimal
components: rustfmt
- name: Install stable
uses: actions-rs/toolchain@v1
with:
toolchain: stable
override: true
components: rustfmt
- name: Run tests
run: cargo test --features mmap,brotli-compression,lz4-compression,snappy-compression,failpoints --verbose --workspace
profile: minimal
components: clippy
- uses: Swatinem/rust-cache@v2
- name: Check Formatting
run: cargo fmt --all -- --check
run: cargo +nightly fmt --all -- --check
- uses: actions-rs/clippy-check@v1
with:
toolchain: stable
token: ${{ secrets.GITHUB_TOKEN }}
args: --tests
test:
runs-on: ubuntu-latest
strategy:
matrix:
features: [
{ label: "all", flags: "mmap,stopwords,brotli-compression,lz4-compression,snappy-compression,zstd-compression,failpoints" },
{ label: "quickwit", flags: "mmap,quickwit,failpoints" }
]
name: test-${{ matrix.features.label}}
steps:
- uses: actions/checkout@v3
- name: Install stable
uses: actions-rs/toolchain@v1
with:
toolchain: stable
profile: minimal
override: true
- uses: taiki-e/install-action@nextest
- uses: Swatinem/rust-cache@v2
- name: Run tests
run: cargo +stable nextest run --features ${{ matrix.features.flags }} --verbose --workspace
- name: Run doctests
run: cargo +stable test --doc --features ${{ matrix.features.flags }} --verbose --workspace

1
.gitignore vendored
View File

@@ -9,7 +9,6 @@ target/release
Cargo.lock
benchmark
.DS_Store
cpp/simdcomp/bitpackingbenchmark
*.bk
.idea
trace.dat

View File

@@ -10,6 +10,7 @@ Tantivy's bread and butter is to address the problem of full-text search :
Given a large set of textual documents, and a text query, return the K-most relevant documents in a very efficient way. To execute these queries rapidly, the tantivy needs to build an index beforehand. The relevance score implemented in the tantivy is not configurable. Tantivy uses the same score as the default similarity used in Lucene / Elasticsearch, called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
But tantivy's scope does not stop there. Numerous features are required to power rich-search applications. For instance, one may want to:
- compute the count of documents matching a query in the different section of an e-commerce website,
- display an average price per meter square for a real estate search engine,
- take into account historical user data to rank documents in a specific way,
@@ -22,27 +23,28 @@ rapidly select all documents matching a given predicate (also known as a query)
collect some information about them ([See collector](#collector-define-what-to-do-with-matched-documents)).
Roughly speaking the design is following these guiding principles:
- Search should be O(1) in memory.
- Indexing should be O(1) in memory. (In practice it is just sublinear)
- Search should be as fast as possible
This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
## [core/](src/core): Index, segments, searchers.
## [core/](src/core): Index, segments, searchers
Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.
This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.
### Index and Segments...
### Index and Segments
A tantivy index is a collection of smaller independent immutable segments.
A tantivy index is a collection of smaller independent immutable segments.
Each segment contains its own independent set of data structures.
A segment is identified by a segment id that is in fact a UUID.
The file of a segment has the format
```segment-id . ext ```
```segment-id . ext```
The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
@@ -52,17 +54,15 @@ On commit, one segment per indexing thread is written to disk, and the `meta.jso
For a better idea of how indexing works, you may read the [following blog post](https://fulmicoton.com/posts/behold-tantivy-part2/).
### Deletes
Deletes happen by deleting a "term". Tantivy does not offer any notion of primary id, so it is up to the user to use a field in their schema as if it was a primary id, and delete the associated term if they want to delete only one specific document.
On commit, tantivy will find all of the segments with documents matching this existing term and create a [tombstone file](src/fastfield/delete.rs) that represents the bitset of the document that are deleted.
Like all segment files, this file is immutable. Because it is possible to have more than one tombstone file at a given instant, the tombstone filename has the format ``` segment_id . commit_opstamp . del```.
On commit, tantivy will find all of the segments with documents matching this existing term and remove from [alive bitset file](src/fastfield/alive_bitset.rs) that represents the bitset of the alive document ids.
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.
An opstamp is simply an incremental id that identifies any operation applied to the index. For instance, performing a commit or adding a document.
### DocId
Within a segment, all documents are identified by a DocId that ranges within `[0, max_doc)`.
@@ -74,6 +74,7 @@ The DocIds are simply allocated in the order documents are added to the index.
In separate threads, tantivy's index writer search for opportunities to merge segments.
The point of segment merge is to:
- eventually get rid of tombstoned documents
- reduce the otherwise ever-growing number of segments.
@@ -94,7 +95,7 @@ called [`Directory`](src/directory/directory.rs).
Contrary to Lucene however, "files" are quite different from some kind of `io::Read` object.
Check out [`src/directory/directory.rs`](src/directory/directory.rs) trait for more details.
Tantivy ships two main directory implementation: the `MMapDirectory` and the `RAMDirectory`,
Tantivy ships two main directory implementation: the `MmapDirectory` and the `RamDirectory`,
but users can extend tantivy with their own implementation.
## [schema/](src/schema): What are documents?
@@ -104,6 +105,7 @@ Tantivy's document follows a very strict schema, decided before building any ind
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
Depending on the type of the field, you can decide to
- put it in the docstore
- store it as a fast field
- index it
@@ -117,9 +119,10 @@ As of today, tantivy's schema imposes a 1:1 relationship between a field that is
This is not something tantivy supports, and it is up to the user to duplicate field / concatenate fields before feeding them to tantivy.
## General information about these data structures.
## General information about these data structures
All data structures in tantivy, have:
- a writer
- a serializer
- a reader
@@ -132,7 +135,7 @@ This conversion is done by the serializer.
Finally, the reader is in charge of offering an API to read on this on-disk read-only representation.
In tantivy, readers are designed to require very little anonymous memory. The data is read straight from an mmapped file, and loading an index is as fast as mmapping its files.
## [store/](src/store): Here is my DocId, Gimme my document!
## [store/](src/store): Here is my DocId, Gimme my document
The docstore is a row-oriented storage that, for each document, stores a subset of the fields
that are marked as stored in the schema. The docstore is compressed using a general-purpose algorithm
@@ -146,6 +149,7 @@ Once the top 10 documents have been identified, we fetch them from the store, an
**Not useful for**
Fetching a document from the store is typically a "slow" operation. It usually consists in
- searching into a compact tree-like data structure to find the position of the right block.
- decompressing a small block
- returning the document from this block.
@@ -154,8 +158,7 @@ It is NOT meant to be called for every document matching a query.
As a rule of thumb, if you hit the docstore more than 100 times per search query, you are probably misusing tantivy.
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value!
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value
Fast fields are stored in a column-oriented storage that allows for random access.
The only compression applied is bitpacking. The column comes with two meta data.
@@ -163,7 +166,7 @@ The minimum value in the column and the number of bits per doc.
Fetching a value for a `DocId` is then as simple as computing
```
```rust
min_value + fetch_bits(num_bits * doc_id..num_bits * (doc_id+1))
```
@@ -190,7 +193,7 @@ For advanced search engine, it is possible to store all of the features required
Finally facets are a specific kind of fast field, and the associated source code is in [`fastfield/facet_reader.rs`](src/fastfield/facet_reader.rs).
# The inverted search index.
# The inverted search index
The inverted index is the core part of full-text search.
When presented a new document with the text field "Hello, happy tax payer!", tantivy breaks it into a list of so-called tokens. In addition to just splitting these strings into tokens, it might also do different kinds of operations like dropping the punctuation, converting the character to lowercase, apply stemming, etc. Tantivy makes it possible to configure the operations to be applied in the schema (tokenizer/ is the place where these operations are implemented).
@@ -215,19 +218,18 @@ The inverted index actually consists of two data structures chained together.
Where [TermInfo](src/postings/term_info.rs) is an object containing some meta data about a term.
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)!
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)
Tantivy's term dictionary is mainly in charge of supplying the function
[Term](src/schema/term.rs) ⟶ [TermInfo](src/postings/term_info.rs)
It is itself broken into two parts.
- [Term](src/schema/term.rs) ⟶ [TermOrdinal](src/termdict/mod.rs) is addressed by a finite state transducer, implemented by the fst crate.
- [TermOrdinal](src/termdict/mod.rs) ⟶ [TermInfo](src/postings/term_info.rs) is addressed by the term info store.
## [postings/](src/postings): Iterate over documents... very fast!
## [postings/](src/postings): Iterate over documents... very fast
A posting list makes it possible to store a sorted list of doc ids and for each doc store
a term frequency as well.
@@ -249,7 +251,7 @@ For instance, when the phrase query "the art of war" does not match "the war of
To make it possible, it is possible to specify in the schema that a field should store positions in addition to being indexed.
The token positions of all of the terms are then stored in a separate file with the extension `.pos`.
The [TermInfo](src/postings/term_info.rs) gives an offset (expressed in position this time) in this file. As we iterate throught the docset,
The [TermInfo](src/postings/term_info.rs) gives an offset (expressed in position this time) in this file. As we iterate through the docset,
we advance the position reader by the number of term frequencies of the current document.
## [fieldnorms/](src/fieldnorms): Here is my doc, how many tokens in this field?
@@ -257,7 +259,6 @@ we advance the position reader by the number of term frequencies of the current
The [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) formula also requires to know the number of tokens stored in a specific field for a given document. We store this information on one byte per document in the fieldnorm.
The fieldnorm is therefore compressed. Values up to 40 are encoded unchanged.
## [tokenizer/](src/tokenizer): How should we process text?
Text processing is key to a good search experience.
@@ -268,7 +269,6 @@ Text processing can be configured by selecting an off-the-shelf [`Tokenizer`](./
Tantivy's comes with few tokenizers, but external crates are offering advanced tokenizers, such as [Lindera](https://crates.io/crates/lindera) for Japanese.
## [query/](src/query): Define and compose queries
The [Query](src/query/query.rs) trait defines what a query is.

View File

@@ -1,41 +1,108 @@
Tantivy 0.19
================================
#### Bugfixes
- Fix missing fieldnorms for u64, i64, f64, bool, bytes and date [#1620](https://github.com/quickwit-oss/tantivy/pull/1620) (@PSeitz)
- Fix interpolation overflow in linear interpolation fastfield codec [#1480](https://github.com/quickwit-oss/tantivy/pull/1480) (@PSeitz @fulmicoton)
#### Features/Improvements
- Add support for `IN` in queryparser , e.g. `field: IN [val1 val2 val3]` [#1683](https://github.com/quickwit-oss/tantivy/pull/1683) (@trinity-1686a)
- Skip score calculation, when no scoring is required [#1646](https://github.com/quickwit-oss/tantivy/pull/1646) (@PSeitz)
- Limit fast fields to u32 (`get_val(u32)`) [#1644](https://github.com/quickwit-oss/tantivy/pull/1644) (@PSeitz)
- The `DateTime` type has been updated to hold timestamps with microseconds precision.
`DateOptions` and `DatePrecision` have been added to configure Date fields. The precision is used to hint on fast values compression. Otherwise, seconds precision is used everywhere else (i.e terms, indexing) [#1396](https://github.com/quickwit-oss/tantivy/pull/1396) (@evanxg852000)
- Add IP address field type [#1553](https://github.com/quickwit-oss/tantivy/pull/1553) (@PSeitz)
- Add boolean field type [#1382](https://github.com/quickwit-oss/tantivy/pull/1382) (@boraarslan)
- Remove Searcher pool and make `Searcher` cloneable. (@PSeitz)
- Validate settings on create [#1570](https://github.com/quickwit-oss/tantivy/pull/1570) (@PSeitz)
- Detect and apply gcd on fastfield codecs [#1418](https://github.com/quickwit-oss/tantivy/pull/1418) (@PSeitz)
- Doc store
- use separate thread to compress block store [#1389](https://github.com/quickwit-oss/tantivy/pull/1389) [#1510](https://github.com/quickwit-oss/tantivy/pull/1510) (@PSeitz @fulmicoton)
- Expose doc store cache size [#1403](https://github.com/quickwit-oss/tantivy/pull/1403) (@PSeitz)
- Enable compression levels for doc store [#1378](https://github.com/quickwit-oss/tantivy/pull/1378) (@PSeitz)
- Make block size configurable [#1374](https://github.com/quickwit-oss/tantivy/pull/1374) (@kryesh)
- Make `tantivy::TantivyError` cloneable [#1402](https://github.com/quickwit-oss/tantivy/pull/1402) (@PSeitz)
- Add support for phrase slop in query language [#1393](https://github.com/quickwit-oss/tantivy/pull/1393) (@saroh)
- Aggregation
- Add aggregation support for date type [#1693](https://github.com/quickwit-oss/tantivy/pull/1693)(@PSeitz)
- Add support for keyed parameter in range and histgram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
- Add aggregation bucket limit [#1363](https://github.com/quickwit-oss/tantivy/pull/1363) (@PSeitz)
- Faster indexing
- [#1610](https://github.com/quickwit-oss/tantivy/pull/1610) (@PSeitz)
- [#1594](https://github.com/quickwit-oss/tantivy/pull/1594) (@PSeitz)
- [#1582](https://github.com/quickwit-oss/tantivy/pull/1582) (@PSeitz)
- [#1611](https://github.com/quickwit-oss/tantivy/pull/1611) (@PSeitz)
- Added a pre-configured stop word filter for various language [#1666](https://github.com/quickwit-oss/tantivy/pull/1666) (@adamreichold)
Tantivy 0.18
================================
- For date values `chrono` has been replaced with `time` (@uklotzde) #1304 :
- The `time` crate is re-exported as `tantivy::time` instead of `tantivy::chrono`.
- The type alias `tantivy::DateTime` has been removed.
- `Value::Date` wraps `time::PrimitiveDateTime` without time zone information.
- Internally date/time values are stored as seconds since UNIX epoch in UTC.
- Converting a `time::OffsetDateTime` to `Value::Date` implicitly converts the value into UTC.
If this is not desired do the time zone conversion yourself and use `time::PrimitiveDateTime`
directly instead.
- Add [histogram](https://github.com/quickwit-oss/tantivy/pull/1306) aggregation (@PSeitz)
- Add support for fastfield on text fields (@PSeitz)
- Add terms aggregation (@PSeitz)
- Add support for zstd compression (@kryesh)
Tantivy 0.18.1
================================
- Hotfix: positions computation. #1629 (@fmassot, @fulmicoton, @PSeitz)
Tantivy 0.17
================================
- LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar @fulmicoton) [#115](https://github.com/quickwit-oss/tantivy/issues/115)
- Adds a searcher Warmer API (@shikhar @fulmicoton)
- Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211
- Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .
- Bugfix that could in theory impact durability in theory on some filesystems [#1224](https://github.com/quickwit-inc/tantivy/issues/1224)
- Reduce the number of fsync calls [#1225](https://github.com/quickwit-inc/tantivy/issues/1225)
- Schema now offers not indexing fieldnorms (@lpouget) [#922](https://github.com/quickwit-inc/tantivy/issues/922)
- LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar) [#115](https://github.com/quickwit-inc/tantivy/issues/115)
- Bugfix that could in theory impact durability in theory on some filesystems [#1224](https://github.com/quickwit-oss/tantivy/issues/1224)
- Schema now offers not indexing fieldnorms (@lpouget) [#922](https://github.com/quickwit-oss/tantivy/issues/922)
- Reduce the number of fsync calls [#1225](https://github.com/quickwit-oss/tantivy/issues/1225)
- Fix opening bytes index with dynamic codec (@PSeitz) [#1278](https://github.com/quickwit-oss/tantivy/issues/1278)
- Added an aggregation collector for range, average and stats compatible with Elasticsearch. (@PSeitz)
- Added a JSON schema type @fulmicoton [#1251](https://github.com/quickwit-oss/tantivy/issues/1251)
- Added support for slop in phrase queries @halvorboe [#1068](https://github.com/quickwit-oss/tantivy/issues/1068)
Tantivy 0.16.2
================================
- Bugfix in FuzzyTermQuery. (tranposition_cost_one was not doing anything)
- Bugfix in FuzzyTermQuery. (transposition_cost_one was not doing anything)
Tantivy 0.16.1
========================
- Major Bugfix on multivalued fastfield. #1151
- Demux operation (@PSeitz)
Tantivy 0.16.0
=========================
- Bugfix in the filesum check. (@evanxg852000) #1127
- Bugfix in positions when the index is sorted by a field. (@appaquet) #1125
Tantivy 0.15.3
=========================
- Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101
- Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101
Tantivy 0.15.2
========================
- Major bugfix. DocStore still panics when a deleted doc is at the beginning of a block. (@appaquet) #1088
Tantivy 0.15.1
=========================
- Major bugfix. DocStore panics when first block is deleted. (@appaquet) #1077
Tantivy 0.15.0
=========================
- API Changes. Using Range instead of (start, end) in the API and internals (`FileSlice`, `OwnedBytes`, `Snippets`, ...)
This change is breaking but migration is trivial.
- Added an Histogram collector. (@fulmicoton) #994
@@ -57,9 +124,9 @@ Tantivy 0.15.0
- Updated TermMerger implementation to rely on the union feature of the FST (@scampi) #469
- Add boolean marking whether position is required in the query_terms API call (@fulmicoton). #1070
Tantivy 0.14.0
=========================
- Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).
- Migrated tantivy error from the now deprecated `failure` crate to `thiserror` #760. (@hirevo)
- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match.
@@ -78,16 +145,19 @@ This version breaks compatibility and requires users to reindex everything.
Tantivy 0.13.2
===================
Bugfix. Acquiring a facet reader on a segment that does not contain any
doc with this facet returns `None`. (#896)
Tantivy 0.13.1
===================
Made `Query` and `Collector` `Send + Sync`.
Updated misc dependency versions.
Tantivy 0.13.0
======================
Tantivy 0.13 introduce a change in the index format that will require
you to reindex your index (BlockWAND information are added in the skiplist).
The index size increase is minor as this information is only added for
@@ -102,6 +172,7 @@ so that we can discuss possible solutions.
A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet.
`.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId.
As a result, iterating through DocSet now looks as follows
```rust
let mut doc = docset.doc();
while doc != TERMINATED {
@@ -109,7 +180,9 @@ while doc != TERMINATED {
doc = docset.advance();
}
```
The change made it possible to greatly simplify a lot of the docset's code.
- Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)
- Added an offset option to the Top(.*)Collectors. (@robyoung)
- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks
@@ -117,6 +190,7 @@ to the PISA team for answering all my questions!)
Tantivy 0.12.0
======================
- Removing static dispatch in tokenizers for simplicity. (#762)
- Added backward iteration for `TermDictionary` stream. (@halvorboe)
- Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)
@@ -127,30 +201,32 @@ Tantivy 0.12.0
## How to update?
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
minor changes. Check https://github.com/quickwit-inc/tantivy/blob/main/examples/custom_tokenizer.rs
minor changes. Check <https://github.com/quickwit-oss/tantivy/blob/main/examples/custom_tokenizer.rs>
to check for some code sample.
Tantivy 0.11.3
=======================
- Fixed DateTime as a fast field (#735)
Tantivy 0.11.2
=======================
- The future returned by `IndexWriter::merge` does not borrow `self` mutably anymore (#732)
- Exposing a constructor for `WatchHandle` (#731)
Tantivy 0.11.1
=====================
- Bug fix #729
- Bug fix #729
Tantivy 0.11.0
=====================
- Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)
- Various bugfixes in the query parser.
- Better handling of hyphens in query parser. (#609)
- Better handling of whitespaces.
- Better handling of hyphens in query parser. (#609)
- Better handling of whitespaces.
- Closes #498 - add support for Elastic-style unbounded range queries for alphanumeric types eg. "title:>hello", "weight:>=70.5", "height:<200" (@petr-tik)
- API change around `Box<BoxableTokenizer>`. See detail in #629
- Avoid rebuilding Regex automaton whenever a regex query is reused. #639 (@brainlock)
@@ -181,7 +257,6 @@ Tantivy 0.10.1
Avoid watching the mmap directory until someone effectively creates a reader that uses
this functionality.
Tantivy 0.10.0
=====================
@@ -197,6 +272,7 @@ Tantivy 0.10.0
Minor
---------
- Switched to Rust 2018 (@uvd)
- Small simplification of the code.
Calling .freq() or .doc() when .advance() has never been called
@@ -204,8 +280,7 @@ on segment postings should panic from now on.
- Tokens exceeding `u16::max_value() - 4` chars are discarded silently instead of panicking.
- Fast fields are now preloaded when the `SegmentReader` is created.
- `IndexMeta` is now public. (@hntd187)
- `IndexWriter` `add_document`, `delete_term`. `IndexWriter` is `Sync`, making it possible to use it with a `
Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
- `IndexWriter` `add_document`, `delete_term`. `IndexWriter` is `Sync`, making it possible to use it with a `Arc<RwLock<IndexWriter>>`. `add_document` and `delete_term` can
only require a read lock. (@fulmicoton)
- Introducing `Opstamp` as an expressive type alias for `u64`. (@petr-tik)
- Stamper now relies on `AtomicU64` on all platforms (@petr-tik)
@@ -221,16 +296,17 @@ Your program should be usable as is.
Fast fields used to be accessed directly from the `SegmentReader`.
The API changed, you are now required to acquire your fast field reader via the
`segment_reader.fast_fields()`, and use one of the typed method:
- `.u64()`, `.i64()` if your field is single-valued ;
- `.u64s()`, `.i64s()` if your field is multi-valued ;
- `.bytes()` if your field is bytes fast field.
Tantivy 0.9.0
=====================
*0.9.0 index format is not compatible with the
previous index format.*
- MAJOR BUGFIX :
Some `Mmap` objects were being leaked, and would never get released. (@fulmicoton)
- Removed most unsafe (@fulmicoton)
@@ -274,37 +350,40 @@ To update from tantivy 0.8, you will need to go through the following steps.
```
Tantivy 0.8.2
=====================
Fixing build for x86_64 platforms. (#496)
No need to update from 0.8.1 if tantivy
is building on your platform.
Tantivy 0.8.1
=====================
Hotfix of #476.
Merge was reflecting deletes before commit was passed.
Thanks @barrotsteindev for reporting the bug.
Tantivy 0.8.0
=====================
*No change in the index format*
- API Breaking change in the collector API. (@jwolfe, @fulmicoton)
- Multithreaded search (@jwolfe, @fulmicoton)
Tantivy 0.7.1
=====================
*No change in the index format*
- Bugfix: NGramTokenizer panics on non ascii chars
- Added a space usage API
Tantivy 0.7
=====================
- Skip data for doc ids and positions (@fulmicoton),
greatly improving performance
- Tantivy error now rely on the failure crate (@drusellers)
@@ -314,15 +393,15 @@ Tantivy 0.7
Tantivy 0.6.1
=========================
- Bugfix #324. GC removing was removing file that were still in useful
- Added support for parsing AllQuery and RangeQuery via QueryParser
- AllQuery: `*`
- RangeQuery:
- Inclusive `field:[startIncl to endIncl]`
- Exclusive `field:{startExcl to endExcl}`
- Mixed `field:[startIncl to endExcl}` and vice versa
- Unbounded `field:[start to *]`, `field:[* to end]`
- AllQuery: `*`
- RangeQuery:
- Inclusive `field:[startIncl to endIncl]`
- Exclusive `field:{startExcl to endExcl}`
- Mixed `field:[startIncl to endExcl}` and vice versa
- Unbounded `field:[start to *]`, `field:[* to end]`
Tantivy 0.6
==========================
@@ -335,58 +414,53 @@ to this release!
- Approximate field norms encoded over 1 byte. (@fulmicoton)
- Compiles on stable rust (@fulmicoton)
- Add &[u8] fastfield for associating arbitrary bytes to each document (@jason-wolfe) (#270)
- Completely uncompressed
- Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
- Completely uncompressed
- Internally: One u64 fast field for indexes, one fast field for the bytes themselves.
- Add NGram token support (@drusellers)
- Add Stopword Filter support (@drusellers)
- Add a FuzzyTermQuery (@drusellers)
- Add a RegexQuery (@drusellers)
- Various performance improvements (@fulmicoton)_
Tantivy 0.5.2
===========================
- bugfix #274
- bugfix #280
- bugfix #289
Tantivy 0.5.1
==========================
- bugfix #254 : tantivy failed if no documents in a segment contained a specific field.
- bugfix #254 : tantivy failed if no documents in a segment contained a specific field.
Tantivy 0.5
==========================
- Faceting
- RangeQuery
- Configurable tokenization pipeline
- Bugfix in PhraseQuery
- Various query optimisation
- Allowing very large indexes
- 64 bits file address
- Smarter encoding of the `TermInfo` objects
- 64 bits file address
- Smarter encoding of the `TermInfo` objects
Tantivy 0.4.3
==========================
- Bugfix race condition when deleting files. (#198)
Tantivy 0.4.2
==========================
- Prevent usage of AVX2 instructions (#201)
Tantivy 0.4.1
==========================
- Bugfix for non-indexed fields. (#199)
Tantivy 0.4.0
==========================
@@ -401,37 +475,31 @@ Tantivy 0.4.0
- Searching for a non-indexed field returns an explicit Error
- Phrase query for non-tokenized field are not tokenized by the query parser.
- Faster/Better indexing (@fulmicoton)
- using murmurhash2
- faster merging
- more memory efficient fast field writer (@lnicola )
- better handling of collisions
- lesser memory usage
- using murmurhash2
- faster merging
- more memory efficient fast field writer (@lnicola )
- better handling of collisions
- lesser memory usage
- Added API, most notably to iterate over ranges of terms (@fulmicoton)
- Bugfix that was preventing to unmap segment files, on index drop (@fulmicoton)
- Made the doc! macro public (@fulmicoton)
- Added an alternative implementation of the streaming dictionary (@fulmicoton)
Tantivy 0.3.1
==========================
- Expose a method to trigger files garbage collection
Tantivy 0.3
==========================
Special thanks to @Kodraus @lnicola @Ameobea @manuel-woelker @celaus
for their contribution to this release.
Thanks also to everyone in tantivy gitter chat
for their advise and company :)
https://gitter.im/tantivy-search/tantivy
<https://gitter.im/tantivy-search/tantivy>
Warning:
@@ -440,19 +508,16 @@ code and index format.
You should not expect backward compatibility before
tantivy 1.0.
New Features
------------
- Delete. You can now delete documents from an index.
- Support for windows (Thanks to @lnicola)
Various Bugfixes & small improvements
----------------------------------------
- Added CI for Windows (https://ci.appveyor.com/project/fulmicoton/tantivy)
- Added CI for Windows (<https://ci.appveyor.com/project/fulmicoton/tantivy>)
Thanks to @KodrAus ! (#108)
- Various dependy version update (Thanks to @Ameobea) #76
- Fixed several race conditions in `Index.wait_merge_threads`
@@ -464,7 +529,3 @@ Thanks to @KodrAus ! (#108)
- Building binary targets for tantivy-cli (Thanks to @KodrAus)
- Misc invisible bug fixes, and code cleanup.
- Use

View File

@@ -1,75 +1,86 @@
[package]
name = "tantivy"
version = "0.17.0-dev"
version = "0.19.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
description = """Search engine library"""
documentation = "https://docs.rs/tantivy/"
homepage = "https://github.com/quickwit-inc/tantivy"
repository = "https://github.com/quickwit-inc/tantivy"
homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2018"
edition = "2021"
rust-version = "1.62"
[dependencies]
base64 = "0.13"
oneshot = "0.1.5"
base64 = "0.21.0"
byteorder = "1.4.3"
crc32fast = "1.2.1"
once_cell = "1.7.2"
regex ={ version = "1.5.4", default-features = false, features = ["std"] }
tantivy-fst = "0.3"
memmap2 = {version = "0.5", optional=true}
lz4_flex = { version = "0.9", default-features = false, features = ["checked-decode"], optional = true }
brotli = { version = "3.3", optional = true }
crc32fast = "1.3.2"
once_cell = "1.10.0"
regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] }
aho-corasick = "0.7"
tantivy-fst = "0.4.0"
memmap2 = { version = "0.5.3", optional = true }
lz4_flex = { version = "0.10", default-features = false, features = ["checked-decode"], optional = true }
brotli = { version = "3.3.4", optional = true }
zstd = { version = "0.12", optional = true, default-features = false }
snap = { version = "1.0.5", optional = true }
tempfile = { version = "3.2", optional = true }
log = "0.4.14"
serde = { version = "1.0.126", features = ["derive"] }
serde_json = "1.0.64"
num_cpus = "1.13"
fs2={ version = "0.4.3", optional = true }
levenshtein_automata = "0.2"
uuid = { version = "0.8.2", features = ["v4", "serde"] }
crossbeam = "0.8.1"
futures = { version = "0.3.15", features = ["thread-pool"] }
tantivy-query-grammar = { version="0.15.0", path="./query-grammar" }
tantivy-bitpacker = { version="0.1", path="./bitpacker" }
common = { version = "0.1", path = "./common/", package = "tantivy-common" }
fastfield_codecs = { version="0.1", path="./fastfield_codecs", default-features = false }
ownedbytes = { version="0.2", path="./ownedbytes" }
stable_deref_trait = "1.2"
rust-stemmers = "1.2"
downcast-rs = "1.2"
tempfile = { version = "3.3.0", optional = true }
log = "0.4.16"
serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.79"
num_cpus = "1.13.1"
fs2 = { version = "0.4.3", optional = true }
levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4"
rust-stemmers = "1.2.0"
downcast-rs = "1.2.0"
bitpacking = { version = "0.8.4", default-features = false, features = ["bitpacker4x"] }
census = "0.4"
fnv = "1.0.7"
thiserror = "1.0.24"
census = "0.4.0"
rustc-hash = "1.1.0"
thiserror = "1.0.30"
htmlescape = "0.3.1"
fail = "0.5"
murmurhash32 = "0.2"
chrono = "0.4.19"
smallvec = "1.6.1"
rayon = "1.5"
lru = "0.7.0"
fastdivide = "0.3"
itertools = "0.10.0"
measure_time = "0.8.0"
fail = "0.5.0"
murmurhash32 = "0.2.0"
time = { version = "0.3.10", features = ["serde-well-known"] }
smallvec = "1.8.0"
rayon = "1.5.2"
lru = "0.9.0"
fastdivide = "0.4.0"
itertools = "0.10.3"
measure_time = "0.8.2"
async-trait = "0.1.53"
arc-swap = "1.5.0"
columnar = { version="0.1", path="./columnar", package ="tantivy-columnar" }
sstable = { version="0.1", path="./sstable", package ="tantivy-sstable", optional = true }
stacker = { version="0.1", path="./stacker", package ="tantivy-stacker" }
query-grammar = { version= "0.19.0", path="./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version= "0.3", path="./bitpacker" }
common = { version= "0.5", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version="0.1", path="./tokenizer-api", package="tantivy-tokenizer-api" }
[target.'cfg(windows)'.dependencies]
winapi = "0.3.9"
[dev-dependencies]
rand = "0.8.3"
rand = "0.8.5"
maplit = "1.0.2"
matches = "0.1.8"
proptest = "1.0"
criterion = "0.3.5"
test-log = "0.2.8"
env_logger = "0.9.0"
matches = "0.1.9"
pretty_assertions = "1.2.1"
proptest = "1.0.0"
criterion = "0.4"
test-log = "0.2.10"
env_logger = "0.10.0"
pprof = { version = "0.11.0", features = ["flamegraph", "criterion"] }
futures = "0.3.21"
paste = "1.0.11"
[dev-dependencies.fail]
version = "0.5"
version = "0.5.0"
features = ["failpoints"]
[profile.release]
@@ -82,21 +93,22 @@ debug-assertions = true
overflow-checks = true
[features]
default = ["mmap", "lz4-compression" ]
default = ["mmap", "stopwords", "lz4-compression"]
mmap = ["fs2", "tempfile", "memmap2"]
stopwords = []
brotli-compression = ["brotli"]
lz4-compression = ["lz4_flex"]
snappy-compression = ["snap"]
zstd-compression = ["zstd"]
failpoints = ["fail/failpoints"]
unstable = [] # useful for benches.
[workspace]
members = ["query-grammar", "bitpacker", "common", "fastfield_codecs", "ownedbytes"]
quickwit = ["sstable"]
[badges]
travis-ci = { repository = "tantivy-search/tantivy" }
[workspace]
members = ["query-grammar", "bitpacker", "common", "ownedbytes", "stacker", "sstable", "tokenizer-api", "columnar"]
# Following the "fail" crate best practises, we isolate
# tests that define specific behavior in fail check points
@@ -113,3 +125,8 @@ required-features = ["fail/failpoints"]
[[bench]]
name = "analyzer"
harness = false
[[bench]]
name = "index-bench"
harness = false

View File

@@ -1,3 +1,6 @@
test:
echo "Run test only... No examples."
cargo test --tests --lib
fmt:
cargo +nightly fmt --all

114
README.md
View File

@@ -1,23 +1,13 @@
[![Docs](https://docs.rs/tantivy/badge.svg)](https://docs.rs/crate/tantivy/)
[![Build Status](https://github.com/quickwit-inc/tantivy/actions/workflows/test.yml/badge.svg)](https://github.com/quickwit-inc/tantivy/actions/workflows/test.yml)
[![codecov](https://codecov.io/gh/quickwit-inc/tantivy/branch/main/graph/badge.svg)](https://codecov.io/gh/quickwit-inc/tantivy)
[![Build Status](https://github.com/quickwit-oss/tantivy/actions/workflows/test.yml/badge.svg)](https://github.com/quickwit-oss/tantivy/actions/workflows/test.yml)
[![codecov](https://codecov.io/gh/quickwit-oss/tantivy/branch/main/graph/badge.svg)](https://codecov.io/gh/quickwit-oss/tantivy)
[![Join the chat at https://discord.gg/MT27AG5EVE](https://shields.io/discord/908281611840282624?label=chat%20on%20discord)](https://discord.gg/MT27AG5EVE)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Crates.io](https://img.shields.io/crates/v/tantivy.svg)](https://crates.io/crates/tantivy)
![Tantivy](https://tantivy-search.github.io/logo/tantivy-logo.png)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/0)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/0)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/1)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/1)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/2)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/2)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/3)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/3)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/4)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/4)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/5)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/5)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/6)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/6)
[![](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/images/7)](https://sourcerer.io/fame/fulmicoton/tantivy-search/tantivy/links/7)
**Tantivy** is a **full text search engine library** written in Rust.
**Tantivy** is a **full-text search engine library** written in Rust.
It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
an off-the-shelf search engine server, but rather a crate that can be used
@@ -25,19 +15,23 @@ to build such a search engine.
Tantivy is, in fact, strongly inspired by Lucene's design.
If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our search engine built on top of Tantivy.
# Benchmark
The following [benchmark](https://tantivy-search.github.io/bench/) break downs
performance for different type of queries / collection.
The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns
performance for different types of queries/collections.
Your mileage WILL vary depending on the nature of queries and their load.
<img src="doc/assets/images/searchbenchmark.png">
# Features
- Full-text search
- Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy) and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
- Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy), [Vaporetto](https://crates.io/crates/vaporetto_tantivy), and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
- Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
- Tiny startup time (<10ms), perfect for command line tools
- Tiny startup time (<10ms), perfect for command-line tools
- BM25 scoring (the same as Lucene)
- Natural query language (e.g. `(michael AND jackson) OR "king of pop"`)
- Phrase queries search (e.g. `"michael jackson"`)
@@ -47,33 +41,34 @@ Your mileage WILL vary depending on the nature of queries and their load.
- SIMD integer compression when the platform/CPU includes the SSE2 instruction set
- Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
- `&[u8]` fast fields
- Text, i64, u64, f64, dates, and hierarchical facet fields
- LZ4 compressed document store
- Text, i64, u64, f64, dates, ip, bool, and hierarchical facet fields
- Compressed document store (LZ4, Zstd, None, Brotli, Snap)
- Range queries
- Faceted search
- Configurable indexing (optional term frequency and position indexing)
- JSON Field
- Aggregation Collector: histogram, range buckets, average, and stats metrics
- LogMergePolicy with deletes
- Searcher Warmer API
- Cheesy logo with a horse
## Non-features
- Distributed search is out of the scope of Tantivy. That being said, Tantivy is a
library upon which one could build a distributed search. Serializable/mergeable collector state for instance,
are within the scope of Tantivy.
Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out [Quickwit](https://github.com/quickwit-oss/quickwit/).
# Getting started
Tantivy works on stable Rust (>= 1.27) and supports Linux, MacOS, and Windows.
Tantivy works on stable Rust and supports Linux, macOS, and Windows.
- [Tantivy's simple search example](https://tantivy-search.github.io/examples/basic_search.html)
- [tantivy-cli and its tutorial](https://github.com/tantivy-search/tantivy-cli) - `tantivy-cli` is an actual command line interface that makes it easy for you to create a search engine,
- [tantivy-cli and its tutorial](https://github.com/quickwit-oss/tantivy-cli) - `tantivy-cli` is an actual command-line interface that makes it easy for you to create a search engine,
index documents, and search via the CLI or a small server with a REST API.
It walks you through getting a wikipedia search engine up and running in a few minutes.
It walks you through getting a Wikipedia search engine up and running in a few minutes.
- [Reference doc for the last released version](https://docs.rs/tantivy/)
# How can I support this project?
There are many ways to support this project.
There are many ways to support this project.
- Use Tantivy and tell us about your experience on [Discord](https://discord.gg/MT27AG5EVE) or by email (paul.masurel@gmail.com)
- Report bugs
@@ -85,46 +80,63 @@ There are many ways to support this project.
# Contributing code
We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
Feel free to update CHANGELOG.md with your contribution.
## Tokenizer
When implementing a tokenizer for tantivy depend on the `tantivy-tokenizer-api` crate.
## Clone and build locally
Tantivy compiles on stable Rust but requires `Rust >= 1.27`.
Tantivy compiles on stable Rust.
To check out and run tests, you can simply run:
```bash
git clone https://github.com/quickwit-inc/tantivy.git
cd tantivy
cargo build
git clone https://github.com/quickwit-oss/tantivy.git
cd tantivy
cargo test
```
## Run tests
# Companies Using Tantivy
Some tests will not run with just `cargo test` because of `fail-rs`.
To run the tests exhaustively, run `./run-tests.sh`.
<p align="left">
<img align="center" src="doc/assets/images/etsy.png" alt="Etsy" height="25" width="auto" />&nbsp;
<img align="center" src="doc/assets/images/Nuclia.png#gh-light-mode-only" alt="Nuclia" height="25" width="auto" /> &nbsp;
<img align="center" src="doc/assets/images/humanfirst.png#gh-light-mode-only" alt="Humanfirst.ai" height="30" width="auto" />
<img align="center" src="doc/assets/images/element.io.svg#gh-light-mode-only" alt="Element.io" height="25" width="auto" />
<img align="center" src="doc/assets/images/nuclia-dark-theme.png#gh-dark-mode-only" alt="Nuclia" height="35" width="auto" /> &nbsp;
<img align="center" src="doc/assets/images/humanfirst.ai-dark-theme.png#gh-dark-mode-only" alt="Humanfirst.ai" height="25" width="auto" />&nbsp; &nbsp;
<img align="center" src="doc/assets/images/element-dark-theme.png#gh-dark-mode-only" alt="Element.io" height="25" width="auto" />
</p>
## Debug
# FAQ
You might find it useful to step through the programme with a debugger.
### Can I use Tantivy in other languages?
### A failing test
- Python → [tantivy-py](https://github.com/quickwit-oss/tantivy-py)
- Ruby → [tantiny](https://github.com/baygeldin/tantiny)
Make sure you haven't run `cargo clean` after the most recent `cargo test` or `cargo build` to guarantee that the `target/` directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under `rust-gdb`:
You can also find other bindings on [GitHub](https://github.com/search?q=tantivy) but they may be less maintained.
```bash
find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY
```
### What are some examples of Tantivy use?
Now that you are in `rust-gdb`, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to `cargo test` like this:
- [seshat](https://github.com/matrix-org/seshat/): A matrix message database/indexer
- [tantiny](https://github.com/baygeldin/tantiny): Tiny full-text search for Ruby
- [lnx](https://github.com/lnx-search/lnx): adaptable, typo tolerant search engine with a REST API
- and [more](https://github.com/search?q=tantivy)!
```bash
$gdb run --test-threads 1 --test $NAME_OF_TEST
```
### On average, how much faster is Tantivy compared to Lucene?
### An example
- According to our [search latency benchmark](https://tantivy-search.github.io/bench/), Tantivy is approximately 2x faster than Lucene.
By default, `rustc` compiles everything in the `examples/` directory in debug mode. This makes it easy for you to make examples to reproduce bugs:
### Does tantivy support incremental indexing?
```bash
rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run
```
- Yes.
### How can I edit documents?
- Data in tantivy is immutable. To edit a document, the document needs to be deleted and reindexed.
### When will my documents be searchable during indexing?
- Documents will be searchable after a `commit` is called on an `IndexWriter`. Existing `IndexReader`s will also need to be reloaded in order to reflect the changes. Finally, changes are only visible to newly acquired `Searcher`.

18
TODO.txt Normal file
View File

@@ -0,0 +1,18 @@
Make schema_builder API fluent.
fix doc serialization and prevent compression problems
u64 , etc. shoudl return Resutl<Option> now that we support optional missing a column is really not an error
remove fastfield codecs
ditch the first_or_default trick. if it is still useful, improve its implementation.
rename FastFieldReaders::open to load
remove fast field reader
find a way to unify the two DateTime.
readd type check in the filter wrapper
add unit test on columnar list columns.
make sure sort works

100000
benches/hdfs.json Normal file

File diff suppressed because it is too large Load Diff

121
benches/index-bench.rs Normal file
View File

@@ -0,0 +1,121 @@
use criterion::{criterion_group, criterion_main, Criterion};
use pprof::criterion::{Output, PProfProfiler};
use tantivy::schema::{INDEXED, STORED, STRING, TEXT};
use tantivy::Index;
const HDFS_LOGS: &str = include_str!("hdfs.json");
const NUM_REPEATS: usize = 2;
pub fn hdfs_index_benchmark(c: &mut Criterion) {
let schema = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_u64_field("timestamp", INDEXED);
schema_builder.add_text_field("body", TEXT);
schema_builder.add_text_field("severity", STRING);
schema_builder.build()
};
let schema_with_store = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_u64_field("timestamp", INDEXED | STORED);
schema_builder.add_text_field("body", TEXT | STORED);
schema_builder.add_text_field("severity", STRING | STORED);
schema_builder.build()
};
let dynamic_schema = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_json_field("json", TEXT);
schema_builder.build()
};
let mut group = c.benchmark_group("index-hdfs");
group.sample_size(20);
group.bench_function("index-hdfs-no-commit", |b| {
b.iter(|| {
let index = Index::create_in_ram(schema.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split('\n') {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
}
})
});
group.bench_function("index-hdfs-with-commit", |b| {
b.iter(|| {
let index = Index::create_in_ram(schema.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split('\n') {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-no-commit-with-docstore", |b| {
b.iter(|| {
let index = Index::create_in_ram(schema_with_store.clone());
let index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split('\n') {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
}
})
});
group.bench_function("index-hdfs-with-commit-with-docstore", |b| {
b.iter(|| {
let index = Index::create_in_ram(schema_with_store.clone());
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split('\n') {
let doc = schema.parse_document(doc_json).unwrap();
index_writer.add_document(doc).unwrap();
}
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-no-commit-json-without-docstore", |b| {
b.iter(|| {
let index = Index::create_in_ram(dynamic_schema.clone());
let json_field = dynamic_schema.get_field("json").unwrap();
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split('\n') {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-with-commit-json-without-docstore", |b| {
b.iter(|| {
let index = Index::create_in_ram(dynamic_schema.clone());
let json_field = dynamic_schema.get_field("json").unwrap();
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split('\n') {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
}
index_writer.commit().unwrap();
})
});
}
criterion_group! {
name = benches;
config = Criterion::default().with_profiler(PProfProfiler::new(100, Output::Flamegraph(None)));
targets = hdfs_index_benchmark
}
criterion_main!(benches);

View File

@@ -1,15 +1,21 @@
[package]
name = "tantivy-bitpacker"
version = "0.1.1"
edition = "2018"
version = "0.3.0"
edition = "2021"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = []
description = """Tantivy-sub crate: bitpacking"""
repository = "https://github.com/quickwit-inc/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
keywords = []
documentation = "https://docs.rs/tantivy-bitpacker/latest/tantivy_bitpacker"
homepage = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
[dev-dependencies]
rand = "0.8"
proptest = "1"

View File

@@ -4,8 +4,39 @@ extern crate test;
#[cfg(test)]
mod tests {
use tantivy_bitpacker::BlockedBitpacker;
use rand::seq::IteratorRandom;
use rand::thread_rng;
use tantivy_bitpacker::{BitPacker, BitUnpacker, BlockedBitpacker};
use test::Bencher;
#[inline(never)]
fn create_bitpacked_data(bit_width: u8, num_els: u32) -> Vec<u8> {
let mut bitpacker = BitPacker::new();
let mut buffer = Vec::new();
for _ in 0..num_els {
// the values do not matter.
bitpacker.write(0u64, bit_width, &mut buffer).unwrap();
bitpacker.flush(&mut buffer).unwrap();
}
buffer
}
#[bench]
fn bench_bitpacking_read(b: &mut Bencher) {
let bit_width = 3;
let num_els = 1_000_000u32;
let bit_unpacker = BitUnpacker::new(bit_width);
let data = create_bitpacked_data(bit_width, num_els);
let idxs: Vec<u32> = (0..num_els).choose_multiple(&mut thread_rng(), 100_000);
b.iter(|| {
let mut out = 0u64;
for &idx in &idxs {
out = out.wrapping_add(bit_unpacker.get(idx, &data[..]));
}
out
});
}
#[bench]
fn bench_blockedbitp_read(b: &mut Bencher) {
let mut blocked_bitpacker = BlockedBitpacker::new();
@@ -13,13 +44,14 @@ mod tests {
blocked_bitpacker.add(val * val);
}
b.iter(|| {
let mut out = 0;
let mut out = 0u64;
for val in 0..=21500 {
out = blocked_bitpacker.get(val);
out = out.wrapping_add(blocked_bitpacker.get(val));
}
out
});
}
#[bench]
fn bench_blockedbitp_create(b: &mut Bencher) {
b.iter(|| {

View File

@@ -1,4 +1,5 @@
use std::{convert::TryInto, io};
use std::convert::TryInto;
use std::io;
pub struct BitPacker {
mini_buffer: u64,
@@ -18,21 +19,20 @@ impl BitPacker {
}
#[inline]
pub fn write<TWrite: io::Write>(
pub fn write<TWrite: io::Write + ?Sized>(
&mut self,
val: u64,
num_bits: u8,
output: &mut TWrite,
) -> io::Result<()> {
let val_u64 = val as u64;
let num_bits = num_bits as usize;
if self.mini_buffer_written + num_bits > 64 {
self.mini_buffer |= val_u64.wrapping_shl(self.mini_buffer_written as u32);
self.mini_buffer |= val.wrapping_shl(self.mini_buffer_written as u32);
output.write_all(self.mini_buffer.to_le_bytes().as_ref())?;
self.mini_buffer = val_u64.wrapping_shr((64 - self.mini_buffer_written) as u32);
self.mini_buffer = val.wrapping_shr((64 - self.mini_buffer_written) as u32);
self.mini_buffer_written = self.mini_buffer_written + num_bits - 64;
} else {
self.mini_buffer |= val_u64 << self.mini_buffer_written;
self.mini_buffer |= val << self.mini_buffer_written;
self.mini_buffer_written += num_bits;
if self.mini_buffer_written == 64 {
output.write_all(self.mini_buffer.to_le_bytes().as_ref())?;
@@ -43,7 +43,7 @@ impl BitPacker {
Ok(())
}
pub fn flush<TWrite: io::Write>(&mut self, output: &mut TWrite) -> io::Result<()> {
pub fn flush<TWrite: io::Write + ?Sized>(&mut self, output: &mut TWrite) -> io::Result<()> {
if self.mini_buffer_written > 0 {
let num_bytes = (self.mini_buffer_written + 7) / 8;
let bytes = self.mini_buffer.to_le_bytes();
@@ -54,53 +54,69 @@ impl BitPacker {
Ok(())
}
pub fn close<TWrite: io::Write>(&mut self, output: &mut TWrite) -> io::Result<()> {
pub fn close<TWrite: io::Write + ?Sized>(&mut self, output: &mut TWrite) -> io::Result<()> {
self.flush(output)?;
// Padding the write file to simplify reads.
output.write_all(&[0u8; 7])?;
Ok(())
}
}
#[derive(Clone, Debug, Default)]
#[derive(Clone, Debug, Default, Copy)]
pub struct BitUnpacker {
num_bits: u64,
num_bits: u32,
mask: u64,
}
impl BitUnpacker {
/// Creates a bit unpacker, that assumes the same bitwidth for all values.
///
/// The bitunpacker works by doing an unaligned read of 8 bytes.
/// For this reason, values of `num_bits` between
/// [57..63] are forbidden.
pub fn new(num_bits: u8) -> BitUnpacker {
assert!(num_bits <= 7 * 8 || num_bits == 64);
let mask: u64 = if num_bits == 64 {
!0u64
} else {
(1u64 << num_bits) - 1u64
};
BitUnpacker {
num_bits: u64::from(num_bits),
num_bits: u32::from(num_bits),
mask,
}
}
pub fn bit_width(&self) -> u8 {
self.num_bits as u8
}
#[inline]
pub fn get(&self, idx: u64, data: &[u8]) -> u64 {
if self.num_bits == 0 {
return 0u64;
pub fn get(&self, idx: u32, data: &[u8]) -> u64 {
let addr_in_bits = idx * self.num_bits;
let addr = (addr_in_bits >> 3) as usize;
if addr + 8 > data.len() {
if self.num_bits == 0 {
return 0;
}
let bit_shift = addr_in_bits & 7;
return self.get_slow_path(addr, bit_shift, data);
}
let num_bits = self.num_bits;
let mask = self.mask;
let addr_in_bits = idx * num_bits;
let addr = addr_in_bits >> 3;
let bit_shift = addr_in_bits & 7;
debug_assert!(
addr + 8 <= data.len() as u64,
"The fast field field should have been padded with 7 bytes."
);
let bytes: [u8; 8] = (&data[(addr as usize)..(addr as usize) + 8])
.try_into()
.unwrap();
let bytes: [u8; 8] = (&data[addr..addr + 8]).try_into().unwrap();
let val_unshifted_unmasked: u64 = u64::from_le_bytes(bytes);
let val_shifted = (val_unshifted_unmasked >> bit_shift) as u64;
val_shifted & mask
let val_shifted = val_unshifted_unmasked >> bit_shift;
val_shifted & self.mask
}
#[inline(never)]
fn get_slow_path(&self, addr: usize, bit_shift: u32, data: &[u8]) -> u64 {
let mut bytes: [u8; 8] = [0u8; 8];
let available_bytes = data.len() - addr;
// This function is meant to only be called if we did not have 8 bytes to load.
debug_assert!(available_bytes < 8);
bytes[..available_bytes].copy_from_slice(&data[addr..]);
let val_unshifted_unmasked: u64 = u64::from_le_bytes(bytes);
let val_shifted = val_unshifted_unmasked >> bit_shift;
val_shifted & self.mask
}
}
@@ -108,7 +124,7 @@ impl BitUnpacker {
mod test {
use super::{BitPacker, BitUnpacker};
fn create_fastfield_bitpacker(len: usize, num_bits: u8) -> (BitUnpacker, Vec<u64>, Vec<u8>) {
fn create_bitpacker(len: usize, num_bits: u8) -> (BitUnpacker, Vec<u64>, Vec<u8>) {
let mut data = Vec::new();
let mut bitpacker = BitPacker::new();
let max_val: u64 = (1u64 << num_bits as u64) - 1u64;
@@ -119,15 +135,15 @@ mod test {
bitpacker.write(val, num_bits, &mut data).unwrap();
}
bitpacker.close(&mut data).unwrap();
assert_eq!(data.len(), ((num_bits as usize) * len + 7) / 8 + 7);
assert_eq!(data.len(), ((num_bits as usize) * len + 7) / 8);
let bitunpacker = BitUnpacker::new(num_bits);
(bitunpacker, vals, data)
}
fn test_bitpacker_util(len: usize, num_bits: u8) {
let (bitunpacker, vals, data) = create_fastfield_bitpacker(len, num_bits);
let (bitunpacker, vals, data) = create_bitpacker(len, num_bits);
for (i, val) in vals.iter().enumerate() {
assert_eq!(bitunpacker.get(i as u64, &data), *val);
assert_eq!(bitunpacker.get(i as u32, &data), *val);
}
}
@@ -139,4 +155,49 @@ mod test {
test_bitpacker_util(6, 14);
test_bitpacker_util(1000, 14);
}
use proptest::prelude::*;
fn num_bits_strategy() -> impl Strategy<Value = u8> {
prop_oneof!(Just(0), Just(1), 2u8..56u8, Just(56), Just(64),)
}
fn vals_strategy() -> impl Strategy<Value = (u8, Vec<u64>)> {
(num_bits_strategy(), 0usize..100usize).prop_flat_map(|(num_bits, len)| {
let max_val = if num_bits == 64 {
u64::MAX
} else {
(1u64 << num_bits as u32) - 1
};
let vals = proptest::collection::vec(0..=max_val, len);
vals.prop_map(move |vals| (num_bits, vals))
})
}
fn test_bitpacker_aux(num_bits: u8, vals: &[u64]) {
let mut buffer: Vec<u8> = Vec::new();
let mut bitpacker = BitPacker::new();
for &val in vals {
bitpacker.write(val, num_bits, &mut buffer).unwrap();
}
bitpacker.flush(&mut buffer).unwrap();
assert_eq!(buffer.len(), (vals.len() * num_bits as usize + 7) / 8);
let bitunpacker = BitUnpacker::new(num_bits);
let max_val = if num_bits == 64 {
u64::MAX
} else {
(1u64 << num_bits) - 1
};
for (i, val) in vals.iter().copied().enumerate() {
assert!(val <= max_val);
assert_eq!(bitunpacker.get(i as u32, &buffer), val);
}
}
proptest::proptest! {
#[test]
fn test_bitpacker_proptest((num_bits, vals) in vals_strategy()) {
test_bitpacker_aux(num_bits, &vals);
}
}
}

View File

@@ -1,12 +1,11 @@
use super::bitpacker::BitPacker;
use super::compute_num_bits;
use crate::{minmax, BitUnpacker};
use super::{bitpacker::BitPacker, compute_num_bits};
const BLOCK_SIZE: usize = 128;
/// `BlockedBitpacker` compresses data in blocks of
/// 128 elements, while keeping an index on it
///
#[derive(Debug, Clone)]
pub struct BlockedBitpacker {
// bitpacked blocks
@@ -59,6 +58,10 @@ fn metadata_test() {
assert_eq!(meta.num_bits(), 6);
}
fn mem_usage<T>(items: &Vec<T>) -> usize {
items.capacity() * std::mem::size_of::<T>()
}
impl BlockedBitpacker {
pub fn new() -> Self {
let mut compressed_blocks = vec![];
@@ -74,16 +77,14 @@ impl BlockedBitpacker {
pub fn mem_usage(&self) -> usize {
std::mem::size_of::<BlockedBitpacker>()
+ self.compressed_blocks.capacity()
+ self.offset_and_bits.capacity()
* std::mem::size_of_val(&self.offset_and_bits.get(0).cloned().unwrap_or_default())
+ self.buffer.capacity()
* std::mem::size_of_val(&self.buffer.get(0).cloned().unwrap_or_default())
+ mem_usage(&self.offset_and_bits)
+ mem_usage(&self.buffer)
}
#[inline]
pub fn add(&mut self, val: u64) {
self.buffer.push(val);
if self.buffer.len() == BLOCK_SIZE as usize {
if self.buffer.len() == BLOCK_SIZE {
self.flush();
}
}
@@ -125,11 +126,11 @@ impl BlockedBitpacker {
}
#[inline]
pub fn get(&self, idx: usize) -> u64 {
let metadata_pos = idx / BLOCK_SIZE as usize;
let pos_in_block = idx % BLOCK_SIZE as usize;
let metadata_pos = idx / BLOCK_SIZE;
let pos_in_block = idx % BLOCK_SIZE;
if let Some(metadata) = self.offset_and_bits.get(metadata_pos) {
let unpacked = BitUnpacker::new(metadata.num_bits()).get(
pos_in_block as u64,
pos_in_block as u32,
&self.compressed_blocks[metadata.offset() as usize..],
);
unpacked + metadata.base_value()

View File

@@ -1,8 +1,9 @@
mod bitpacker;
mod blocked_bitpacker;
pub use crate::bitpacker::BitPacker;
pub use crate::bitpacker::BitUnpacker;
use std::cmp::Ordering;
pub use crate::bitpacker::{BitPacker, BitUnpacker};
pub use crate::blocked_bitpacker::BlockedBitpacker;
/// Computes the number of bits that will be used for bitpacking.
@@ -38,44 +39,104 @@ pub fn compute_num_bits(n: u64) -> u8 {
}
}
/// Computes the (min, max) of an iterator of `PartialOrd` values.
///
/// For values implementing `Ord` (in a way consistent to their `PartialOrd` impl),
/// this function behaves as expected.
///
/// For values with partial ordering, the behavior is non-trivial and may
/// depends on the order of the values.
/// For floats however, it simply returns the same results as if NaN were
/// skipped.
pub fn minmax<I, T>(mut vals: I) -> Option<(T, T)>
where
I: Iterator<Item = T>,
T: Copy + Ord,
T: Copy + PartialOrd,
{
if let Some(first_el) = vals.next() {
return Some(vals.fold((first_el, first_el), |(min_val, max_val), el| {
(min_val.min(el), max_val.max(el))
}));
let first_el = vals.find(|val| {
// We use this to make sure we skip all NaN values when
// working with a float type.
val.partial_cmp(val) == Some(Ordering::Equal)
})?;
let mut min_so_far: T = first_el;
let mut max_so_far: T = first_el;
for val in vals {
if val.partial_cmp(&min_so_far) == Some(Ordering::Less) {
min_so_far = val;
}
if val.partial_cmp(&max_so_far) == Some(Ordering::Greater) {
max_so_far = val;
}
}
None
Some((min_so_far, max_so_far))
}
#[test]
fn test_compute_num_bits() {
assert_eq!(compute_num_bits(1), 1u8);
assert_eq!(compute_num_bits(0), 0u8);
assert_eq!(compute_num_bits(2), 2u8);
assert_eq!(compute_num_bits(3), 2u8);
assert_eq!(compute_num_bits(4), 3u8);
assert_eq!(compute_num_bits(255), 8u8);
assert_eq!(compute_num_bits(256), 9u8);
assert_eq!(compute_num_bits(5_000_000_000), 33u8);
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_minmax_empty() {
let vals: Vec<u32> = vec![];
assert_eq!(minmax(vals.into_iter()), None);
}
#[test]
fn test_compute_num_bits() {
assert_eq!(compute_num_bits(1), 1u8);
assert_eq!(compute_num_bits(0), 0u8);
assert_eq!(compute_num_bits(2), 2u8);
assert_eq!(compute_num_bits(3), 2u8);
assert_eq!(compute_num_bits(4), 3u8);
assert_eq!(compute_num_bits(255), 8u8);
assert_eq!(compute_num_bits(256), 9u8);
assert_eq!(compute_num_bits(5_000_000_000), 33u8);
}
#[test]
fn test_minmax_one() {
assert_eq!(minmax(vec![1].into_iter()), Some((1, 1)));
}
#[test]
fn test_minmax_empty() {
let vals: Vec<u32> = vec![];
assert_eq!(minmax(vals.into_iter()), None);
}
#[test]
fn test_minmax_two() {
assert_eq!(minmax(vec![1, 2].into_iter()), Some((1, 2)));
assert_eq!(minmax(vec![2, 1].into_iter()), Some((1, 2)));
#[test]
fn test_minmax_one() {
assert_eq!(minmax(vec![1].into_iter()), Some((1, 1)));
}
#[test]
fn test_minmax_two() {
assert_eq!(minmax(vec![1, 2].into_iter()), Some((1, 2)));
assert_eq!(minmax(vec![2, 1].into_iter()), Some((1, 2)));
}
#[test]
fn test_minmax_nan() {
assert_eq!(
minmax(vec![f64::NAN, 1f64, 2f64].into_iter()),
Some((1f64, 2f64))
);
assert_eq!(
minmax(vec![2f64, f64::NAN, 1f64].into_iter()),
Some((1f64, 2f64))
);
assert_eq!(
minmax(vec![2f64, 1f64, f64::NAN].into_iter()),
Some((1f64, 2f64))
);
}
#[test]
fn test_minmax_inf() {
assert_eq!(
minmax(vec![f64::INFINITY, 1f64, 2f64].into_iter()),
Some((1f64, f64::INFINITY))
);
assert_eq!(
minmax(vec![-f64::INFINITY, 1f64, 2f64].into_iter()),
Some((-f64::INFINITY, 2f64))
);
assert_eq!(
minmax(vec![2f64, f64::INFINITY, 1f64].into_iter()),
Some((1f64, f64::INFINITY))
);
assert_eq!(
minmax(vec![2f64, 1f64, -f64::INFINITY].into_iter()),
Some((-f64::INFINITY, 2f64))
);
}
}

View File

@@ -1,23 +0,0 @@
# This script takes care of packaging the build artifacts that will go in the
# release zipfile
$SRC_DIR = $PWD.Path
$STAGE = [System.Guid]::NewGuid().ToString()
Set-Location $ENV:Temp
New-Item -Type Directory -Name $STAGE
Set-Location $STAGE
$ZIP = "$SRC_DIR\$($Env:CRATE_NAME)-$($Env:APPVEYOR_REPO_TAG_NAME)-$($Env:TARGET).zip"
# TODO Update this to package the right artifacts
Copy-Item "$SRC_DIR\target\$($Env:TARGET)\release\hello.exe" '.\'
7z a "$ZIP" *
Push-AppveyorArtifact "$ZIP"
Remove-Item *.* -Force
Set-Location ..
Remove-Item $STAGE
Set-Location $SRC_DIR

View File

@@ -1,33 +0,0 @@
# This script takes care of building your crate and packaging it for release
set -ex
main() {
local src=$(pwd) \
stage=
case $TRAVIS_OS_NAME in
linux)
stage=$(mktemp -d)
;;
osx)
stage=$(mktemp -d -t tmp)
;;
esac
test -f Cargo.lock || cargo generate-lockfile
# TODO Update this to build the artifacts that matter to you
cross rustc --bin hello --target $TARGET --release -- -C lto
# TODO Update this to package the right artifacts
cp target/$TARGET/release/hello $stage/
cd $stage
tar czf $src/$CRATE_NAME-$TRAVIS_TAG-$TARGET.tar.gz *
cd $src
rm -rf $stage
}
main

View File

@@ -1,47 +0,0 @@
set -ex
main() {
local target=
if [ $TRAVIS_OS_NAME = linux ]; then
target=x86_64-unknown-linux-musl
sort=sort
else
target=x86_64-apple-darwin
sort=gsort # for `sort --sort-version`, from brew's coreutils.
fi
# Builds for iOS are done on OSX, but require the specific target to be
# installed.
case $TARGET in
aarch64-apple-ios)
rustup target install aarch64-apple-ios
;;
armv7-apple-ios)
rustup target install armv7-apple-ios
;;
armv7s-apple-ios)
rustup target install armv7s-apple-ios
;;
i386-apple-ios)
rustup target install i386-apple-ios
;;
x86_64-apple-ios)
rustup target install x86_64-apple-ios
;;
esac
# This fetches latest stable release
local tag=$(git ls-remote --tags --refs --exit-code https://github.com/japaric/cross \
| cut -d/ -f3 \
| grep -E '^v[0.1.0-9.]+$' \
| $sort --version-sort \
| tail -n1)
curl -LSfs https://japaric.github.io/trust/install.sh | \
sh -s -- \
--force \
--git japaric/cross \
--tag $tag \
--target $target
}
main

View File

@@ -1,30 +0,0 @@
#!/usr/bin/env bash
# This script takes care of testing your crate
set -ex
main() {
if [ ! -z $CODECOV ]; then
echo "Codecov"
cargo build --verbose && cargo coverage --verbose --all && bash <(curl -s https://codecov.io/bash) -s target/kcov
else
echo "Build"
cross build --target $TARGET
if [ ! -z $DISABLE_TESTS ]; then
return
fi
echo "Test"
cross test --target $TARGET --no-default-features --features mmap
cross test --target $TARGET --no-default-features --features mmap query-grammar
fi
for example in $(ls examples/*.rs)
do
cargo run --example $(basename $example .rs)
done
}
# we don't run the "test phase" when doing deploys
if [ -z $TRAVIS_TAG ]; then
main
fi

28
columnar/Cargo.toml Normal file
View File

@@ -0,0 +1,28 @@
[package]
name = "tantivy-columnar"
version = "0.1.0"
edition = "2021"
license = "MIT"
[dependencies]
itertools = "0.10.5"
log = "0.4.17"
fnv = "1.0.7"
fastdivide = "0.4.0"
rand = { version = "0.8.5", optional = true }
measure_time = { version = "0.8.2", optional = true }
prettytable-rs = { version = "0.10.0", optional = true }
stacker = { path = "../stacker", package="tantivy-stacker"}
sstable = { path = "../sstable", package = "tantivy-sstable" }
common = { path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.3", path = "../bitpacker/" }
serde = "1.0.152"
[dev-dependencies]
proptest = "1"
more-asserts = "0.3.1"
rand = "0.8.5"
[features]
unstable = []

109
columnar/README.md Normal file
View File

@@ -0,0 +1,109 @@
# Columnar format
This crate describes columnar format used in tantivy.
## Goals
This format is special in the following way.
- it needs to be compact
- accessing a specific column does not require to load the entire columnar. It can be done in 2 to 3 random access.
- columns of several types can be associated with the same column name.
- it needs to support columns with different types `(str, u64, i64, f64)`
and different cardinality `(required, optional, multivalued)`.
- columns, once loaded, offer cheap random access.
- it is designed to allow range queries.
# Coercion rules
Users can create a columnar by inserting rows to a `ColumnarWriter`,
and serializing it into a `Write` object.
Nothing prevents a user from recording values with different type to the same `column_name`.
In that case, `tantivy-columnar`'s behavior is as follows:
- JsonValues are grouped into 3 types (String, Number, bool).
Values that corresponds to different groups are mapped to different columns. For instance, String values are treated independently
from Number or boolean values. `tantivy-columnar` will simply emit several columns associated to a given column_name.
- Only one column for a given json value type is emitted. If number values with different number types are recorded (e.g. u64, i64, f64),
`tantivy-columnar` will pick the first type that can represents the set of appended value, with the following prioriy order (`i64`, `u64`, `f64`).
`i64` is picked over `u64` as it is likely to yield less change of types. Most use cases strictly requiring `u64` show the
restriction on 50% of the values (e.g. a 64-bit hash). On the other hand, a lot of use cases can show rare negative value.
# Columnar format
This columnar format may have more than one column (with different types) associated to the same `column_name` (see [Coercion rules](#coercion-rules) above).
The `(column_name, columne_type)` couple however uniquely identifies a column.
That couple is serialized as a column `column_key`. The format of that key is:
`[column_name][ZERO_BYTE][column_type_header: u8]`
```
COLUMNAR:=
[COLUMNAR_DATA]
[COLUMNAR_KEY_TO_DATA_INDEX]
[COLUMNAR_FOOTER];
# Columns are sorted by their column key.
COLUMNAR_DATA:=
[COLUMN_DATA]+;
COLUMNAR_FOOTER := [RANGE_SSTABLE_BYTES_LEN: 8 bytes little endian]
```
The columnar file starts by the actual column data, concatenated one after the other,
sorted by column key.
A sstable associates
`(column name, column_cardinality, column_type) to range of bytes.
Column name may not contain the zero byte `\0`.
Listing all columns associated to `column_name` can therefore
be done by listing all keys prefixed by
`[column_name][ZERO_BYTE]`
The associated range of bytes refer to a range of bytes
This crate exposes a columnar format for tantivy.
This format is described in README.md
The crate introduces the following concepts.
`Columnar` is an equivalent of a dataframe.
It maps `column_key` to `Column`.
A `Column<T>` asssociates a `RowId` (u32) to any
number of values.
This is made possible by wrapping a `ColumnIndex` and a `ColumnValue` object.
The `ColumnValue<T>` represents a mapping that associates each `RowId` to
exactly one single value.
The `ColumnIndex` then maps each RowId to a set of `RowId` in the
`ColumnValue`.
For optimization, and compression purposes, the `ColumnIndex` has three
possible representation, each for different cardinalities.
- Full
All RowId have exactly one value. The ColumnIndex is the trivial mapping.
- Optional
All RowIds can have at most one value. The ColumnIndex is the trivial mapping `ColumnRowId -> Option<ColumnValueRowId>`.
- Multivalued
All RowIds can have any number of values.
The column index is mapping values to a range.
All these objects are implemented an unit tested independently
in their own module:
- columnar
- column_index
- column_values
- column

View File

@@ -0,0 +1,124 @@
#![feature(test)]
use std::ops::RangeInclusive;
use std::sync::Arc;
use common::OwnedBytes;
use rand::rngs::StdRng;
use rand::seq::SliceRandom;
use rand::{random, Rng, SeedableRng};
use tantivy_columnar::ColumnValues;
use test::Bencher;
extern crate test;
// TODO does this make sense for IPv6 ?
fn generate_random() -> Vec<u64> {
let mut permutation: Vec<u64> = (0u64..100_000u64)
.map(|el| el + random::<u16>() as u64)
.collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
fn get_u128_column_random() -> Arc<dyn ColumnValues<u128>> {
let permutation = generate_random();
let permutation = permutation.iter().map(|el| *el as u128).collect::<Vec<_>>();
get_u128_column_from_data(&permutation)
}
fn get_u128_column_from_data(data: &[u128]) -> Arc<dyn ColumnValues<u128>> {
let mut out = vec![];
tantivy_columnar::column_values::serialize_column_values_u128(&data, &mut out).unwrap();
let out = OwnedBytes::new(out);
tantivy_columnar::column_values::open_u128_mapped::<u128>(out).unwrap()
}
const FIFTY_PERCENT_RANGE: RangeInclusive<u64> = 1..=50;
const SINGLE_ITEM: u64 = 90;
const SINGLE_ITEM_RANGE: RangeInclusive<u64> = 90..=90;
fn get_data_50percent_item() -> Vec<u128> {
let mut rng = StdRng::from_seed([1u8; 32]);
let mut data = vec![];
for _ in 0..300_000 {
let val = rng.gen_range(1..=100);
data.push(val);
}
data.push(SINGLE_ITEM);
data.shuffle(&mut rng);
let data = data.iter().map(|el| *el as u128).collect::<Vec<_>>();
data
}
#[bench]
fn bench_intfastfield_getrange_u128_50percent_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
*FIFTY_PERCENT_RANGE.start() as u128..=*FIFTY_PERCENT_RANGE.end() as u128,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u128_single_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
*SINGLE_ITEM_RANGE.start() as u128..=*SINGLE_ITEM_RANGE.end() as u128,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u128_hit_all(b: &mut Bencher) {
let data = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(0..=u128::MAX, 0..data.len() as u32, &mut positions);
positions
});
}
// U128 RANGE END
#[bench]
fn bench_intfastfield_scan_all_fflookup_u128(b: &mut Bencher) {
let column = get_u128_column_random();
b.iter(|| {
let mut a = 0u128;
for i in 0u64..column.num_vals() as u64 {
a += column.get_val(i as u32);
}
a
});
}
#[bench]
fn bench_intfastfield_jumpy_stride5_u128(b: &mut Bencher) {
let column = get_u128_column_random();
b.iter(|| {
let n = column.num_vals();
let mut a = 0u128;
for i in (0..n / 5).map(|val| val * 5) {
a += column.get_val(i);
}
a
});
}

View File

@@ -0,0 +1,211 @@
#![feature(test)]
extern crate test;
use std::ops::RangeInclusive;
use std::sync::Arc;
use rand::prelude::*;
use tantivy_columnar::column_values::{serialize_and_load_u64_based_column_values, CodecType};
use tantivy_columnar::*;
use test::Bencher;
// Warning: this generates the same permutation at each call
fn generate_permutation() -> Vec<u64> {
let mut permutation: Vec<u64> = (0u64..100_000u64).collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
fn generate_random() -> Vec<u64> {
let mut permutation: Vec<u64> = (0u64..100_000u64)
.map(|el| el + random::<u16>() as u64)
.collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
// Warning: this generates the same permutation at each call
fn generate_permutation_gcd() -> Vec<u64> {
let mut permutation: Vec<u64> = (1u64..100_000u64).map(|el| el * 1000).collect();
permutation.shuffle(&mut StdRng::from_seed([1u8; 32]));
permutation
}
pub fn serialize_and_load(column: &[u64], codec_type: CodecType) -> Arc<dyn ColumnValues<u64>> {
serialize_and_load_u64_based_column_values(&column, &[codec_type])
}
#[bench]
fn bench_intfastfield_jumpy_veclookup(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
b.iter(|| {
let mut a = 0u64;
for _ in 0..n {
a = permutation[a as usize];
}
a
});
}
#[bench]
fn bench_intfastfield_jumpy_fflookup_bitpacked(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
b.iter(|| {
let mut a = 0u64;
for _ in 0..n {
a = column.get_val(a as u32);
}
a
});
}
const FIFTY_PERCENT_RANGE: RangeInclusive<u64> = 1..=50;
const SINGLE_ITEM: u64 = 90;
const SINGLE_ITEM_RANGE: RangeInclusive<u64> = 90..=90;
const ONE_PERCENT_ITEM_RANGE: RangeInclusive<u64> = 49..=49;
fn get_data_50percent_item() -> Vec<u128> {
let mut rng = StdRng::from_seed([1u8; 32]);
let mut data = vec![];
for _ in 0..300_000 {
let val = rng.gen_range(1..=100);
data.push(val);
}
data.push(SINGLE_ITEM);
data.shuffle(&mut rng);
let data = data.iter().map(|el| *el as u128).collect::<Vec<_>>();
data
}
// U64 RANGE START
#[bench]
fn bench_intfastfield_getrange_u64_50percent_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
FIFTY_PERCENT_RANGE,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u64_1percent_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
ONE_PERCENT_ITEM_RANGE,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u64_single_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(SINGLE_ITEM_RANGE, 0..data.len() as u32, &mut positions);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u64_hit_all(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(0..=u64::MAX, 0..data.len() as u32, &mut positions);
positions
});
}
// U64 RANGE END
#[bench]
fn bench_intfastfield_stride7_vec(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
b.iter(|| {
let mut a = 0u64;
for i in (0..n / 7).map(|val| val * 7) {
a += permutation[i as usize];
}
a
});
}
#[bench]
fn bench_intfastfield_stride7_fflookup(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
b.iter(|| {
let mut a = 0;
for i in (0..n / 7).map(|val| val * 7) {
a += column.get_val(i as u32);
}
a
});
}
#[bench]
fn bench_intfastfield_scan_all_fflookup(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
let column_ref = column.as_ref();
b.iter(|| {
let mut a = 0u64;
for i in 0u32..n as u32 {
a += column_ref.get_val(i);
}
a
});
}
#[bench]
fn bench_intfastfield_scan_all_fflookup_gcd(b: &mut Bencher) {
let permutation = generate_permutation_gcd();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
b.iter(|| {
let mut a = 0u64;
for i in 0..n {
a += column.get_val(i as u32);
}
a
});
}
#[bench]
fn bench_intfastfield_scan_all_vec(b: &mut Bencher) {
let permutation = generate_permutation();
b.iter(|| {
let mut a = 0u64;
for i in 0..permutation.len() {
a += permutation[i as usize] as u64;
}
a
});
}

View File

@@ -0,0 +1,17 @@
[package]
name = "tantivy-columnar-cli"
version = "0.1.0"
edition = "2021"
license = "MIT"
[dependencies]
columnar = {path="../", package="tantivy-columnar"}
serde_json = "1"
serde_json_borrow = {git="https://github.com/PSeitz/serde_json_borrow/"}
serde = "1"
[workspace]
members = []
[profile.release]
debug = true

View File

@@ -0,0 +1,134 @@
use columnar::ColumnarWriter;
use columnar::NumericalValue;
use serde_json_borrow;
use std::fs::File;
use std::io;
use std::io::BufRead;
use std::io::BufReader;
use std::time::Instant;
#[derive(Default)]
struct JsonStack {
path: String,
stack: Vec<usize>,
}
impl JsonStack {
fn push(&mut self, seg: &str) {
let len = self.path.len();
self.stack.push(len);
self.path.push('.');
self.path.push_str(seg);
}
fn pop(&mut self) {
if let Some(len) = self.stack.pop() {
self.path.truncate(len);
}
}
fn path(&self) -> &str {
&self.path[1..]
}
}
fn append_json_to_columnar(
doc: u32,
json_value: &serde_json_borrow::Value,
columnar: &mut ColumnarWriter,
stack: &mut JsonStack,
) -> usize {
let mut count = 0;
match json_value {
serde_json_borrow::Value::Null => {}
serde_json_borrow::Value::Bool(val) => {
columnar.record_numerical(
doc,
stack.path(),
NumericalValue::from(if *val { 1u64 } else { 0u64 }),
);
count += 1;
}
serde_json_borrow::Value::Number(num) => {
let numerical_value: NumericalValue = if let Some(num_i64) = num.as_i64() {
num_i64.into()
} else if let Some(num_u64) = num.as_u64() {
num_u64.into()
} else if let Some(num_f64) = num.as_f64() {
num_f64.into()
} else {
panic!();
};
count += 1;
columnar.record_numerical(
doc,
stack.path(),
numerical_value,
);
}
serde_json_borrow::Value::Str(msg) => {
columnar.record_str(
doc,
stack.path(),
msg,
);
count += 1;
},
serde_json_borrow::Value::Array(vals) => {
for val in vals {
count += append_json_to_columnar(doc, val, columnar, stack);
}
},
serde_json_borrow::Value::Object(json_map) => {
for (child_key, child_val) in json_map {
stack.push(child_key);
count += append_json_to_columnar(doc, child_val, columnar, stack);
stack.pop();
}
},
}
count
}
fn main() -> io::Result<()> {
let file = File::open("gh_small.json")?;
let mut reader = BufReader::new(file);
let mut line = String::with_capacity(100);
let mut columnar = columnar::ColumnarWriter::default();
let mut doc = 0;
let start = Instant::now();
let mut stack = JsonStack::default();
let mut total_count = 0;
let start_build = Instant::now();
loop {
line.clear();
let len = reader.read_line(&mut line)?;
if len == 0 {
break;
}
let Ok(json_value) = serde_json::from_str::<serde_json_borrow::Value>(&line) else { continue; };
total_count += append_json_to_columnar(doc, &json_value, &mut columnar, &mut stack);
doc += 1;
}
println!("Build in {:?}", start_build.elapsed());
println!("value count {total_count}");
let mut buffer = Vec::new();
let start_serialize = Instant::now();
columnar.serialize(doc, None, &mut buffer)?;
println!("Serialized in {:?}", start_serialize.elapsed());
println!("num docs: {doc}, {:?}", start.elapsed());
println!("buffer len {} MB", buffer.len() / 1_000_000);
let columnar = columnar::ColumnarReader::open(buffer)?;
for (column_name, dynamic_column) in columnar.list_columns()? {
let num_bytes = dynamic_column.num_bytes();
let typ = dynamic_column.column_type();
if num_bytes > 1_000_000 {
println!("{column_name} {typ:?} {} KB", num_bytes / 1_000);
}
}
println!("{} columns", columnar.num_columns());
Ok(())
}

47
columnar/src/TODO.md Normal file
View File

@@ -0,0 +1,47 @@
# zero to one
* revisit line codec
* add columns from schema on merge
* Plugging JSON
* replug examples
* move datetime to quickwit common
* switch to nanos
* reintroduce the gcd map.
# Perf and Size
* remove alloc in `ord_to_term`
+ multivaued range queries restrat frm the beginning all of the time.
* re-add ZSTD compression for dictionaries
no systematic monotonic mapping
consider removing multilinear
f32?
adhoc solution for bool?
add metrics helper for aggregate. sum(row_id)
review inline absence/presence
improv perf of select using PDEP
compare with roaring bitmap/elias fano etc etc.
SIMD range? (see blog post)
Add alignment?
Consider another codec to bridge the gap between few and 5k elements
# Cleanup and rationalization
in benchmark, unify percent vs ratio, f32 vs f64.
investigate if should have better errors? io::Error is overused at the moment.
rename rank/select in unit tests
Review the public API via cargo doc
go through TODOs
remove all doc_id occurences -> row_id
use the rank & select naming in unit tests branch.
multi-linear -> blockwise
linear codec -> simply a multiplication for the index column
rename columnar to something more explicit, like column_dictionary or columnar_table
rename fastfield -> column
document changes
rationalization FastFieldValue, HasColumnType
isolate u128_based and uniform naming
# Other
fix enhance column-cli
# Santa claus
autodetect datetime ipaddr, plug customizable tokenizer.

View File

@@ -0,0 +1,100 @@
use std::io;
use std::ops::Deref;
use std::sync::Arc;
use sstable::{Dictionary, VoidSSTable};
use crate::column::Column;
use crate::RowId;
/// Dictionary encoded column.
///
/// The column simply gives access to a regular u64-column that, in
/// which the values are term-ordinals.
///
/// These ordinals are ids uniquely identify the bytes that are stored in
/// the column. These ordinals are small, and sorted in the same order
/// as the term_ord_column.
#[derive(Clone)]
pub struct BytesColumn {
pub(crate) dictionary: Arc<Dictionary<VoidSSTable>>,
pub(crate) term_ord_column: Column<u64>,
}
impl BytesColumn {
/// Fills the given `output` buffer with the term associated to the ordinal `ord`.
///
/// Returns `false` if the term does not exist (e.g. `term_ord` is greater or equal to the
/// overll number of terms).
pub fn ord_to_bytes(&self, ord: u64, output: &mut Vec<u8>) -> io::Result<bool> {
self.dictionary.ord_to_term(ord, output)
}
/// Returns the number of rows in the column.
pub fn num_rows(&self) -> RowId {
self.term_ord_column.num_docs()
}
pub fn term_ords(&self, row_id: RowId) -> impl Iterator<Item = u64> + '_ {
self.term_ord_column.values_for_doc(row_id)
}
/// Returns the column of ordinals
pub fn ords(&self) -> &Column<u64> {
&self.term_ord_column
}
pub fn num_terms(&self) -> usize {
self.dictionary.num_terms()
}
pub fn dictionary(&self) -> &Dictionary<VoidSSTable> {
self.dictionary.as_ref()
}
}
#[derive(Clone)]
pub struct StrColumn(BytesColumn);
impl From<StrColumn> for BytesColumn {
fn from(str_column: StrColumn) -> BytesColumn {
str_column.0
}
}
impl StrColumn {
pub(crate) fn wrap(bytes_column: BytesColumn) -> StrColumn {
StrColumn(bytes_column)
}
pub fn dictionary(&self) -> &Dictionary<VoidSSTable> {
self.0.dictionary.as_ref()
}
/// Fills the buffer
pub fn ord_to_str(&self, term_ord: u64, output: &mut String) -> io::Result<bool> {
unsafe {
let buf = output.as_mut_vec();
if !self.0.dictionary.ord_to_term(term_ord, buf)? {
return Ok(false);
}
// TODO consider remove checks if it hurts performance.
if std::str::from_utf8(buf.as_slice()).is_err() {
buf.clear();
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Not valid utf-8",
));
}
}
Ok(true)
}
}
impl Deref for StrColumn {
type Target = BytesColumn;
fn deref(&self) -> &Self::Target {
&self.0
}
}

161
columnar/src/column/mod.rs Normal file
View File

@@ -0,0 +1,161 @@
mod dictionary_encoded;
mod serialize;
use std::fmt::Debug;
use std::io::Write;
use std::ops::{Deref, Range, RangeInclusive};
use std::sync::Arc;
use common::BinarySerializable;
pub use dictionary_encoded::{BytesColumn, StrColumn};
pub use serialize::{
open_column_bytes, open_column_str, open_column_u128, open_column_u64,
serialize_column_mappable_to_u128, serialize_column_mappable_to_u64,
};
use crate::column_index::ColumnIndex;
use crate::column_values::monotonic_mapping::StrictlyMonotonicMappingToInternal;
use crate::column_values::{monotonic_map_column, ColumnValues};
use crate::{Cardinality, MonotonicallyMappableToU64, RowId};
#[derive(Clone)]
pub struct Column<T = u64> {
pub idx: ColumnIndex,
pub values: Arc<dyn ColumnValues<T>>,
}
impl<T: MonotonicallyMappableToU64> Column<T> {
pub fn to_u64_monotonic(self) -> Column<u64> {
let values = Arc::new(monotonic_map_column(
self.values,
StrictlyMonotonicMappingToInternal::<T>::new(),
));
Column {
idx: self.idx,
values,
}
}
}
impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
#[inline]
pub fn get_cardinality(&self) -> Cardinality {
self.idx.get_cardinality()
}
pub fn num_docs(&self) -> RowId {
match &self.idx {
ColumnIndex::Empty { num_docs } => *num_docs,
ColumnIndex::Full => self.values.num_vals(),
ColumnIndex::Optional(optional_index) => optional_index.num_docs(),
ColumnIndex::Multivalued(col_index) => {
// The multivalued index contains all value start row_id,
// and one extra value at the end with the overall number of rows.
col_index.num_docs()
}
}
}
pub fn min_value(&self) -> T {
self.values.min_value()
}
pub fn max_value(&self) -> T {
self.values.max_value()
}
pub fn first(&self, row_id: RowId) -> Option<T> {
self.values_for_doc(row_id).next()
}
pub fn values_for_doc(&self, row_id: RowId) -> impl Iterator<Item = T> + '_ {
self.value_row_ids(row_id)
.map(|value_row_id: RowId| self.values.get_val(value_row_id))
}
/// Get the docids of values which are in the provided value range.
#[inline]
pub fn get_docids_for_value_range(
&self,
value_range: RangeInclusive<T>,
selected_docid_range: Range<u32>,
doc_ids: &mut Vec<u32>,
) {
// convert passed docid range to row id range
let rowid_range = self.idx.docid_range_to_rowids(selected_docid_range.clone());
// Load rows
self.values
.get_row_ids_for_value_range(value_range, rowid_range, doc_ids);
// Convert rows to docids
self.idx
.select_batch_in_place(selected_docid_range.start, doc_ids);
}
/// Fils the output vector with the (possibly multiple values that are associated_with
/// `row_id`.
///
/// This method clears the `output` vector.
pub fn fill_vals(&self, row_id: RowId, output: &mut Vec<T>) {
output.clear();
output.extend(self.values_for_doc(row_id));
}
pub fn first_or_default_col(self, default_value: T) -> Arc<dyn ColumnValues<T>> {
Arc::new(FirstValueWithDefault {
column: self,
default_value,
})
}
}
impl<T> Deref for Column<T> {
type Target = ColumnIndex;
fn deref(&self) -> &Self::Target {
&self.idx
}
}
impl BinarySerializable for Cardinality {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> std::io::Result<()> {
self.to_code().serialize(writer)
}
fn deserialize<R: std::io::Read>(reader: &mut R) -> std::io::Result<Self> {
let cardinality_code = u8::deserialize(reader)?;
let cardinality = Cardinality::try_from_code(cardinality_code)?;
Ok(cardinality)
}
}
// TODO simplify or optimize
struct FirstValueWithDefault<T: Copy> {
column: Column<T>,
default_value: T,
}
impl<T: PartialOrd + Debug + Send + Sync + Copy + 'static> ColumnValues<T>
for FirstValueWithDefault<T>
{
fn get_val(&self, idx: u32) -> T {
self.column.first(idx).unwrap_or(self.default_value)
}
fn min_value(&self) -> T {
self.column.values.min_value()
}
fn max_value(&self) -> T {
self.column.values.max_value()
}
fn num_vals(&self) -> u32 {
match &self.column.idx {
ColumnIndex::Empty { .. } => 0u32,
ColumnIndex::Full => self.column.values.num_vals(),
ColumnIndex::Optional(optional_idx) => optional_idx.num_docs(),
ColumnIndex::Multivalued(multivalue_idx) => multivalue_idx.num_docs(),
}
}
}

View File

@@ -0,0 +1,94 @@
use std::io;
use std::io::Write;
use std::sync::Arc;
use common::OwnedBytes;
use sstable::Dictionary;
use crate::column::{BytesColumn, Column};
use crate::column_index::{serialize_column_index, SerializableColumnIndex};
use crate::column_values::{
load_u64_based_column_values, serialize_column_values_u128, serialize_u64_based_column_values,
CodecType, MonotonicallyMappableToU128, MonotonicallyMappableToU64,
};
use crate::iterable::Iterable;
use crate::StrColumn;
pub fn serialize_column_mappable_to_u128<T: MonotonicallyMappableToU128>(
column_index: SerializableColumnIndex<'_>,
iterable: &dyn Iterable<T>,
output: &mut impl Write,
) -> io::Result<()> {
let column_index_num_bytes = serialize_column_index(column_index, output)?;
serialize_column_values_u128(iterable, output)?;
output.write_all(&column_index_num_bytes.to_le_bytes())?;
Ok(())
}
pub fn serialize_column_mappable_to_u64<T: MonotonicallyMappableToU64>(
column_index: SerializableColumnIndex<'_>,
column_values: &impl Iterable<T>,
output: &mut impl Write,
) -> io::Result<()> {
let column_index_num_bytes = serialize_column_index(column_index, output)?;
serialize_u64_based_column_values(
column_values,
&[CodecType::Bitpacked, CodecType::BlockwiseLinear],
output,
)?;
output.write_all(&column_index_num_bytes.to_le_bytes())?;
Ok(())
}
pub fn open_column_u64<T: MonotonicallyMappableToU64>(bytes: OwnedBytes) -> io::Result<Column<T>> {
let (body, column_index_num_bytes_payload) = bytes.rsplit(4);
let column_index_num_bytes = u32::from_le_bytes(
column_index_num_bytes_payload
.as_slice()
.try_into()
.unwrap(),
);
let (column_index_data, column_values_data) = body.split(column_index_num_bytes as usize);
let column_index = crate::column_index::open_column_index(column_index_data)?;
let column_values = load_u64_based_column_values(column_values_data)?;
Ok(Column {
idx: column_index,
values: column_values,
})
}
pub fn open_column_u128<T: MonotonicallyMappableToU128>(
bytes: OwnedBytes,
) -> io::Result<Column<T>> {
let (body, column_index_num_bytes_payload) = bytes.rsplit(4);
let column_index_num_bytes = u32::from_le_bytes(
column_index_num_bytes_payload
.as_slice()
.try_into()
.unwrap(),
);
let (column_index_data, column_values_data) = body.split(column_index_num_bytes as usize);
let column_index = crate::column_index::open_column_index(column_index_data)?;
let column_values = crate::column_values::open_u128_mapped(column_values_data)?;
Ok(Column {
idx: column_index,
values: column_values,
})
}
pub fn open_column_bytes(data: OwnedBytes) -> io::Result<BytesColumn> {
let (body, dictionary_len_bytes) = data.rsplit(4);
let dictionary_len = u32::from_le_bytes(dictionary_len_bytes.as_slice().try_into().unwrap());
let (dictionary_bytes, column_bytes) = body.split(dictionary_len as usize);
let dictionary = Arc::new(Dictionary::from_bytes(dictionary_bytes)?);
let term_ord_column = crate::column::open_column_u64::<u64>(column_bytes)?;
Ok(BytesColumn {
dictionary,
term_ord_column,
})
}
pub fn open_column_str(data: OwnedBytes) -> io::Result<StrColumn> {
let bytes_column = open_column_bytes(data)?;
Ok(StrColumn::wrap(bytes_column))
}

View File

@@ -0,0 +1,136 @@
mod shuffled;
mod stacked;
use shuffled::merge_column_index_shuffled;
use stacked::merge_column_index_stacked;
use crate::column_index::SerializableColumnIndex;
use crate::{Cardinality, ColumnIndex, MergeRowOrder};
// For simplification, we never have cardinality go down due to deletes.
fn detect_cardinality(columns: &[Option<ColumnIndex>]) -> Cardinality {
columns
.iter()
.flatten()
.map(ColumnIndex::get_cardinality)
.max()
.unwrap_or(Cardinality::Full)
}
pub fn merge_column_index<'a>(
columns: &'a [Option<ColumnIndex>],
merge_row_order: &'a MergeRowOrder,
) -> SerializableColumnIndex<'a> {
// For simplification, we do not try to detect whether the cardinality could be
// downgraded thanks to deletes.
let cardinality_after_merge = detect_cardinality(columns);
match merge_row_order {
MergeRowOrder::Stack(stack_merge_order) => {
merge_column_index_stacked(columns, cardinality_after_merge, stack_merge_order)
}
MergeRowOrder::Shuffled(complex_merge_order) => {
merge_column_index_shuffled(columns, cardinality_after_merge, complex_merge_order)
}
}
}
// TODO actually, the shuffled code path is a bit too general.
// In practise, we do not really shuffle everything.
// The merge order restricted to a specific column keeps the original row order.
//
// This may offer some optimization that we have not explored yet.
#[cfg(test)]
mod tests {
use crate::column_index::merge::detect_cardinality;
use crate::column_index::multivalued_index::MultiValueIndex;
use crate::column_index::{merge_column_index, OptionalIndex, SerializableColumnIndex};
use crate::{Cardinality, ColumnIndex, MergeRowOrder, RowAddr, RowId, ShuffleMergeOrder};
#[test]
fn test_detect_cardinality() {
assert_eq!(detect_cardinality(&[]), Cardinality::Full);
let optional_index: ColumnIndex = OptionalIndex::for_test(1, &[]).into();
let multivalued_index: ColumnIndex = MultiValueIndex::for_test(&[0, 1]).into();
assert_eq!(
detect_cardinality(&[Some(optional_index.clone()), None]),
Cardinality::Optional
);
assert_eq!(
detect_cardinality(&[Some(optional_index.clone()), Some(ColumnIndex::Full)]),
Cardinality::Optional
);
assert_eq!(
detect_cardinality(&[Some(multivalued_index.clone()), None]),
Cardinality::Multivalued
);
assert_eq!(
detect_cardinality(&[
Some(multivalued_index.clone()),
Some(optional_index.clone())
]),
Cardinality::Multivalued
);
assert_eq!(
detect_cardinality(&[Some(optional_index), Some(multivalued_index)]),
Cardinality::Multivalued
);
}
#[test]
fn test_merge_index_multivalued_sorted() {
let column_indexes: Vec<Option<ColumnIndex>> =
vec![Some(MultiValueIndex::for_test(&[0, 2, 5]).into())];
let merge_row_order: MergeRowOrder = ShuffleMergeOrder::for_test(
&[2],
vec![
RowAddr {
segment_ord: 0u32,
row_id: 1u32,
},
RowAddr {
segment_ord: 0u32,
row_id: 0u32,
},
],
)
.into();
let merged_column_index = merge_column_index(&column_indexes[..], &merge_row_order);
let SerializableColumnIndex::Multivalued(start_index_iterable) = merged_column_index
else { panic!("Excpected a multivalued index") };
let start_indexes: Vec<RowId> = start_index_iterable.boxed_iter().collect();
assert_eq!(&start_indexes, &[0, 3, 5]);
}
#[test]
fn test_merge_index_multivalued_sorted_several_segment() {
let column_indexes: Vec<Option<ColumnIndex>> = vec![
Some(MultiValueIndex::for_test(&[0, 2, 5]).into()),
None,
Some(MultiValueIndex::for_test(&[0, 1, 4]).into()),
];
let merge_row_order: MergeRowOrder = ShuffleMergeOrder::for_test(
&[2, 0, 2],
vec![
RowAddr {
segment_ord: 2u32,
row_id: 1u32,
},
RowAddr {
segment_ord: 0u32,
row_id: 0u32,
},
RowAddr {
segment_ord: 2u32,
row_id: 0u32,
},
],
)
.into();
let merged_column_index = merge_column_index(&column_indexes[..], &merge_row_order);
let SerializableColumnIndex::Multivalued(start_index_iterable) = merged_column_index
else { panic!("Excpected a multivalued index") };
let start_indexes: Vec<RowId> = start_index_iterable.boxed_iter().collect();
assert_eq!(&start_indexes, &[0, 3, 5, 6]);
}
}

View File

@@ -0,0 +1,168 @@
use std::iter;
use crate::column_index::{SerializableColumnIndex, Set};
use crate::iterable::Iterable;
use crate::{Cardinality, ColumnIndex, RowId, ShuffleMergeOrder};
pub fn merge_column_index_shuffled<'a>(
column_indexes: &'a [Option<ColumnIndex>],
cardinality_after_merge: Cardinality,
shuffle_merge_order: &'a ShuffleMergeOrder,
) -> SerializableColumnIndex<'a> {
match cardinality_after_merge {
Cardinality::Full => SerializableColumnIndex::Full,
Cardinality::Optional => {
let non_null_row_ids =
merge_column_index_shuffled_optional(column_indexes, shuffle_merge_order);
SerializableColumnIndex::Optional {
non_null_row_ids,
num_rows: shuffle_merge_order.num_rows(),
}
}
Cardinality::Multivalued => {
let multivalue_start_index =
merge_column_index_shuffled_multivalued(column_indexes, shuffle_merge_order);
SerializableColumnIndex::Multivalued(multivalue_start_index)
}
}
}
/// Merge several column indexes into one, ordering rows according to the merge_order passed as
/// argument. While it is true that the `merge_order` may imply deletes and hence could in theory a
/// multivalued index into an optional one, this is not supported today for simplification.
///
/// In other words the column_indexes passed as argument may NOT be multivalued.
fn merge_column_index_shuffled_optional<'a>(
column_indexes: &'a [Option<ColumnIndex>],
merge_order: &'a ShuffleMergeOrder,
) -> Box<dyn Iterable<RowId> + 'a> {
Box::new(ShuffledOptionalIndex {
column_indexes,
merge_order,
})
}
struct ShuffledOptionalIndex<'a> {
column_indexes: &'a [Option<ColumnIndex>],
merge_order: &'a ShuffleMergeOrder,
}
impl<'a> Iterable<u32> for ShuffledOptionalIndex<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
Box::new(self.merge_order
.iter_new_to_old_row_addrs()
.enumerate()
.filter_map(|(new_row_id, old_row_addr)| {
let Some(column_index) = &self.column_indexes[old_row_addr.segment_ord as usize] else {
return None;
};
let row_id = new_row_id as u32;
if column_index.has_value(old_row_addr.row_id) {
Some(row_id)
} else {
None
}
}))
}
}
fn merge_column_index_shuffled_multivalued<'a>(
column_indexes: &'a [Option<ColumnIndex>],
merge_order: &'a ShuffleMergeOrder,
) -> Box<dyn Iterable<RowId> + 'a> {
Box::new(ShuffledMultivaluedIndex {
column_indexes,
merge_order,
})
}
struct ShuffledMultivaluedIndex<'a> {
column_indexes: &'a [Option<ColumnIndex>],
merge_order: &'a ShuffleMergeOrder,
}
fn iter_num_values<'a>(
column_indexes: &'a [Option<ColumnIndex>],
merge_order: &'a ShuffleMergeOrder,
) -> impl Iterator<Item = u32> + 'a {
merge_order.iter_new_to_old_row_addrs().map(|row_addr| {
let Some(column_index) = &column_indexes[row_addr.segment_ord as usize] else {
// No values in the entire column. It surely means there are 0 values associated to this row.
return 0u32;
};
match column_index {
ColumnIndex::Empty { .. } => 0u32,
ColumnIndex::Full => 1,
ColumnIndex::Optional(optional_index) => {
u32::from(optional_index.contains(row_addr.row_id))
}
ColumnIndex::Multivalued(multivalued_index) => {
multivalued_index.range(row_addr.row_id).len() as u32
}
}
})
}
/// Transforms an iterator containing the number of vals per row (with `num_rows` elements)
/// into a `start_offset` iterator starting at 0 and (with `num_rows + 1` element)
fn integrate_num_vals(num_vals: impl Iterator<Item = u32>) -> impl Iterator<Item = RowId> {
iter::once(0u32).chain(num_vals.scan(0, |state, num_vals| {
*state += num_vals;
Some(*state)
}))
}
impl<'a> Iterable<u32> for ShuffledMultivaluedIndex<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
let num_vals_per_row = iter_num_values(self.column_indexes, self.merge_order);
Box::new(integrate_num_vals(num_vals_per_row))
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::column_index::OptionalIndex;
use crate::RowAddr;
#[test]
fn test_integrate_num_vals_empty() {
assert!(integrate_num_vals(iter::empty()).eq(iter::once(0)));
}
#[test]
fn test_integrate_num_vals_one_el() {
assert!(integrate_num_vals(iter::once(10)).eq([0, 10].into_iter()));
}
#[test]
fn test_integrate_num_vals_several() {
assert!(integrate_num_vals([3, 0, 10, 20].into_iter()).eq([0, 3, 3, 13, 33].into_iter()));
}
#[test]
fn test_merge_column_index_optional_shuffle() {
let optional_index: ColumnIndex = OptionalIndex::for_test(2, &[0]).into();
let column_indexes = vec![Some(optional_index), Some(ColumnIndex::Full)];
let row_addrs = vec![
RowAddr {
segment_ord: 0u32,
row_id: 1u32,
},
RowAddr {
segment_ord: 1u32,
row_id: 0u32,
},
];
let shuffle_merge_order = ShuffleMergeOrder::for_test(&[2, 1], row_addrs);
let serializable_index = merge_column_index_shuffled(
&column_indexes[..],
Cardinality::Optional,
&shuffle_merge_order,
);
let SerializableColumnIndex::Optional { non_null_row_ids, num_rows } = serializable_index else { panic!() };
assert_eq!(num_rows, 2);
let non_null_rows: Vec<RowId> = non_null_row_ids.boxed_iter().collect();
assert_eq!(&non_null_rows, &[1]);
}
}

View File

@@ -0,0 +1,156 @@
use std::iter;
use crate::column_index::{SerializableColumnIndex, Set};
use crate::iterable::Iterable;
use crate::{Cardinality, ColumnIndex, RowId, StackMergeOrder};
/// Simple case:
/// The new mapping just consists in stacking the different column indexes.
///
/// There are no sort nor deletes involved.
pub fn merge_column_index_stacked<'a>(
columns: &'a [Option<ColumnIndex>],
cardinality_after_merge: Cardinality,
stack_merge_order: &'a StackMergeOrder,
) -> SerializableColumnIndex<'a> {
match cardinality_after_merge {
Cardinality::Full => SerializableColumnIndex::Full,
Cardinality::Optional => SerializableColumnIndex::Optional {
non_null_row_ids: Box::new(StackedOptionalIndex {
columns,
stack_merge_order,
}),
num_rows: stack_merge_order.num_rows(),
},
Cardinality::Multivalued => {
let stacked_multivalued_index = StackedMultivaluedIndex {
columns,
stack_merge_order,
};
SerializableColumnIndex::Multivalued(Box::new(stacked_multivalued_index))
}
}
}
struct StackedOptionalIndex<'a> {
columns: &'a [Option<ColumnIndex>],
stack_merge_order: &'a StackMergeOrder,
}
impl<'a> Iterable<RowId> for StackedOptionalIndex<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = RowId> + 'a> {
Box::new(
self.columns
.iter()
.enumerate()
.flat_map(|(columnar_id, column_index_opt)| {
let columnar_row_range = self.stack_merge_order.columnar_range(columnar_id);
let rows_it: Box<dyn Iterator<Item = RowId>> = match column_index_opt {
Some(ColumnIndex::Full) => Box::new(columnar_row_range),
Some(ColumnIndex::Optional(optional_index)) => Box::new(
optional_index
.iter_rows()
.map(move |row_id: RowId| columnar_row_range.start + row_id),
),
Some(ColumnIndex::Multivalued(_)) => {
panic!("No multivalued index is allowed when stacking column index");
}
None | Some(ColumnIndex::Empty { .. }) => Box::new(std::iter::empty()),
};
rows_it
}),
)
}
}
#[derive(Clone, Copy)]
struct StackedMultivaluedIndex<'a> {
columns: &'a [Option<ColumnIndex>],
stack_merge_order: &'a StackMergeOrder,
}
fn convert_column_opt_to_multivalued_index<'a>(
column_index_opt: Option<&'a ColumnIndex>,
num_rows: RowId,
) -> Box<dyn Iterator<Item = RowId> + 'a> {
match column_index_opt {
None | Some(ColumnIndex::Empty { .. }) => {
Box::new(iter::repeat(0u32).take(num_rows as usize + 1))
}
Some(ColumnIndex::Full) => Box::new(0..num_rows + 1),
Some(ColumnIndex::Optional(optional_index)) => {
Box::new(
(0..num_rows)
// TODO optimize
.map(|row_id| optional_index.rank(row_id))
.chain(std::iter::once(optional_index.num_non_nulls())),
)
}
Some(ColumnIndex::Multivalued(multivalued_index)) => {
multivalued_index.start_index_column.iter()
}
}
}
impl<'a> Iterable<RowId> for StackedMultivaluedIndex<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = RowId> + '_> {
let multivalued_indexes =
self.columns
.iter()
.map(Option::as_ref)
.enumerate()
.map(|(columnar_id, column_opt)| {
let num_rows =
self.stack_merge_order.columnar_range(columnar_id).len() as RowId;
convert_column_opt_to_multivalued_index(column_opt, num_rows)
});
stack_multivalued_indexes(multivalued_indexes)
}
}
// Refactor me
fn stack_multivalued_indexes<'a>(
mut multivalued_indexes: impl Iterator<Item = Box<dyn Iterator<Item = RowId> + 'a>> + 'a,
) -> Box<dyn Iterator<Item = RowId> + 'a> {
let mut offset = 0;
let mut last_row_id = 0;
let mut current_it = multivalued_indexes.next();
Box::new(std::iter::from_fn(move || loop {
let Some(multivalued_index) = current_it.as_mut() else {
return None;
};
if let Some(row_id) = multivalued_index.next() {
last_row_id = offset + row_id;
return Some(last_row_id);
}
offset = last_row_id;
loop {
current_it = multivalued_indexes.next();
if current_it.as_mut()?.next().is_some() {
break;
}
}
}))
}
#[cfg(test)]
mod tests {
use crate::RowId;
fn it<'a>(row_ids: &'a [RowId]) -> Box<dyn Iterator<Item = RowId> + 'a> {
Box::new(row_ids.iter().copied())
}
#[test]
fn test_stack() {
let columns = [
it(&[0u32, 0u32]),
it(&[0u32, 1u32, 1u32, 4u32]),
it(&[0u32, 3u32, 5u32]),
it(&[0u32, 4u32]),
]
.into_iter();
let start_offsets: Vec<RowId> = super::stack_multivalued_indexes(columns).collect();
assert_eq!(start_offsets, &[0, 0, 1, 1, 4, 7, 9, 13]);
}
}

View File

@@ -0,0 +1,115 @@
mod merge;
mod multivalued_index;
mod optional_index;
mod serialize;
use std::ops::Range;
pub use merge::merge_column_index;
pub use optional_index::{OptionalIndex, Set};
pub use serialize::{open_column_index, serialize_column_index, SerializableColumnIndex};
use crate::column_index::multivalued_index::MultiValueIndex;
use crate::{Cardinality, DocId, RowId};
#[derive(Clone)]
pub enum ColumnIndex {
Empty {
num_docs: u32,
},
Full,
Optional(OptionalIndex),
/// In addition, at index num_rows, an extra value is added
/// containing the overal number of values.
Multivalued(MultiValueIndex),
}
impl From<OptionalIndex> for ColumnIndex {
fn from(optional_index: OptionalIndex) -> ColumnIndex {
ColumnIndex::Optional(optional_index)
}
}
impl From<MultiValueIndex> for ColumnIndex {
fn from(multi_value_index: MultiValueIndex) -> ColumnIndex {
ColumnIndex::Multivalued(multi_value_index)
}
}
impl ColumnIndex {
#[inline]
pub fn get_cardinality(&self) -> Cardinality {
match self {
ColumnIndex::Empty { .. } => Cardinality::Optional,
ColumnIndex::Full => Cardinality::Full,
ColumnIndex::Optional(_) => Cardinality::Optional,
ColumnIndex::Multivalued(_) => Cardinality::Multivalued,
}
}
/// Returns true if and only if there are at least one value associated to the row.
pub fn has_value(&self, doc_id: DocId) -> bool {
match self {
ColumnIndex::Empty { .. } => false,
ColumnIndex::Full => true,
ColumnIndex::Optional(optional_index) => optional_index.contains(doc_id),
ColumnIndex::Multivalued(multivalued_index) => {
!multivalued_index.range(doc_id).is_empty()
}
}
}
pub fn value_row_ids(&self, doc_id: DocId) -> Range<RowId> {
match self {
ColumnIndex::Empty { .. } => 0..0,
ColumnIndex::Full => doc_id..doc_id + 1,
ColumnIndex::Optional(optional_index) => {
if let Some(val) = optional_index.rank_if_exists(doc_id) {
val..val + 1
} else {
0..0
}
}
ColumnIndex::Multivalued(multivalued_index) => multivalued_index.range(doc_id),
}
}
pub fn docid_range_to_rowids(&self, doc_id: Range<DocId>) -> Range<RowId> {
match self {
ColumnIndex::Empty { .. } => 0..0,
ColumnIndex::Full => doc_id,
ColumnIndex::Optional(optional_index) => {
let row_start = optional_index.rank(doc_id.start);
let row_end = optional_index.rank(doc_id.end);
row_start..row_end
}
ColumnIndex::Multivalued(multivalued_index) => {
let end_docid = doc_id.end.min(multivalued_index.num_docs() - 1) + 1;
let start_docid = doc_id.start.min(end_docid);
let row_start = multivalued_index.start_index_column.get_val(start_docid);
let row_end = multivalued_index.start_index_column.get_val(end_docid);
row_start..row_end
}
}
}
pub fn select_batch_in_place(&self, doc_id_start: DocId, rank_ids: &mut Vec<RowId>) {
match self {
ColumnIndex::Empty { .. } => {
rank_ids.clear();
}
ColumnIndex::Full => {
// No need to do anything:
// value_idx and row_idx are the same.
}
ColumnIndex::Optional(optional_index) => {
optional_index.select_batch(&mut rank_ids[..]);
}
ColumnIndex::Multivalued(multivalued_index) => {
multivalued_index.select_batch_in_place(doc_id_start, rank_ids)
}
}
}
}

View File

@@ -0,0 +1,141 @@
use std::io;
use std::io::Write;
use std::ops::Range;
use std::sync::Arc;
use common::OwnedBytes;
use crate::column_values::{
load_u64_based_column_values, serialize_u64_based_column_values, CodecType, ColumnValues,
};
use crate::iterable::Iterable;
use crate::{DocId, RowId};
pub fn serialize_multivalued_index(
multivalued_index: &dyn Iterable<RowId>,
output: &mut impl Write,
) -> io::Result<()> {
serialize_u64_based_column_values(
multivalued_index,
&[CodecType::Bitpacked, CodecType::Linear],
output,
)?;
Ok(())
}
pub fn open_multivalued_index(bytes: OwnedBytes) -> io::Result<MultiValueIndex> {
let start_index_column: Arc<dyn ColumnValues<RowId>> = load_u64_based_column_values(bytes)?;
Ok(MultiValueIndex { start_index_column })
}
#[derive(Clone)]
/// Index to resolve value range for given doc_id.
/// Starts at 0.
pub struct MultiValueIndex {
pub start_index_column: Arc<dyn crate::ColumnValues<RowId>>,
}
impl From<Arc<dyn ColumnValues<RowId>>> for MultiValueIndex {
fn from(start_index_column: Arc<dyn ColumnValues<RowId>>) -> Self {
MultiValueIndex { start_index_column }
}
}
impl MultiValueIndex {
pub fn for_test(start_offsets: &[RowId]) -> MultiValueIndex {
let mut buffer = Vec::new();
serialize_multivalued_index(&start_offsets, &mut buffer).unwrap();
let bytes = OwnedBytes::new(buffer);
open_multivalued_index(bytes).unwrap()
}
/// Returns `[start, end)`, such that the values associated with
/// the given document are `start..end`.
#[inline]
pub(crate) fn range(&self, doc_id: DocId) -> Range<RowId> {
let start = self.start_index_column.get_val(doc_id);
let end = self.start_index_column.get_val(doc_id + 1);
start..end
}
/// Returns the number of documents in the index.
#[inline]
pub fn num_docs(&self) -> u32 {
self.start_index_column.num_vals() - 1
}
/// Converts a list of ranks (row ids of values) in a 1:n index to the corresponding list of
/// docids. Positions are converted inplace to docids.
///
/// Since there is no index for value pos -> docid, but docid -> value pos range, we scan the
/// index.
///
/// Correctness: positions needs to be sorted. idx_reader needs to contain monotonically
/// increasing positions.
///
/// TODO: Instead of a linear scan we can employ a exponential search into binary search to
/// match a docid to its value position.
#[allow(clippy::bool_to_int_with_if)]
pub(crate) fn select_batch_in_place(&self, docid_start: DocId, ranks: &mut Vec<u32>) {
if ranks.is_empty() {
return;
}
let mut cur_doc = docid_start;
let mut last_doc = None;
assert!(self.start_index_column.get_val(docid_start) <= ranks[0]);
let mut write_doc_pos = 0;
for i in 0..ranks.len() {
let pos = ranks[i];
loop {
let end = self.start_index_column.get_val(cur_doc + 1);
if end > pos {
ranks[write_doc_pos] = cur_doc;
write_doc_pos += if last_doc == Some(cur_doc) { 0 } else { 1 };
last_doc = Some(cur_doc);
break;
}
cur_doc += 1;
}
}
ranks.truncate(write_doc_pos);
}
}
#[cfg(test)]
mod tests {
use std::ops::Range;
use std::sync::Arc;
use super::MultiValueIndex;
use crate::column_values::IterColumn;
use crate::{ColumnValues, RowId};
fn index_to_pos_helper(
index: &MultiValueIndex,
doc_id_range: Range<u32>,
positions: &[u32],
) -> Vec<u32> {
let mut positions = positions.to_vec();
index.select_batch_in_place(doc_id_range.start, &mut positions);
positions
}
#[test]
fn test_positions_to_docid() {
let offsets: Vec<RowId> = vec![0, 10, 12, 15, 22, 23]; // docid values are [0..10, 10..12, 12..15, etc.]
let column: Arc<dyn ColumnValues<RowId>> = Arc::new(IterColumn::from(offsets.into_iter()));
let index = MultiValueIndex::from(column);
assert_eq!(index.num_docs(), 5);
let positions = &[10u32, 11, 15, 20, 21, 22];
assert_eq!(index_to_pos_helper(&index, 0..5, positions), vec![1, 3, 4]);
assert_eq!(index_to_pos_helper(&index, 1..5, positions), vec![1, 3, 4]);
assert_eq!(index_to_pos_helper(&index, 0..5, &[9]), vec![0]);
assert_eq!(index_to_pos_helper(&index, 1..5, &[10]), vec![1]);
assert_eq!(index_to_pos_helper(&index, 1..5, &[11]), vec![1]);
assert_eq!(index_to_pos_helper(&index, 2..5, &[12]), vec![2]);
assert_eq!(index_to_pos_helper(&index, 2..5, &[12, 14]), vec![2]);
assert_eq!(index_to_pos_helper(&index, 2..5, &[12, 14, 15]), vec![2, 3]);
}
}

View File

@@ -0,0 +1,515 @@
use std::io::{self, Write};
use std::sync::Arc;
mod set;
mod set_block;
use common::{BinarySerializable, OwnedBytes, VInt};
pub use set::{SelectCursor, Set, SetCodec};
use set_block::{
DenseBlock, DenseBlockCodec, SparseBlock, SparseBlockCodec, DENSE_BLOCK_NUM_BYTES,
};
use crate::iterable::Iterable;
use crate::{DocId, InvalidData, RowId};
/// The threshold for for number of elements after which we switch to dense block encoding.
///
/// We simply pick the value that minimize the size of the blocks.
const DENSE_BLOCK_THRESHOLD: u32 =
set_block::DENSE_BLOCK_NUM_BYTES / std::mem::size_of::<u16>() as u32; //< 5_120
const ELEMENTS_PER_BLOCK: u32 = u16::MAX as u32 + 1;
const BLOCK_SIZE: RowId = 1 << 16;
#[derive(Copy, Clone, Debug)]
struct BlockMeta {
non_null_rows_before_block: u32,
start_byte_offset: u32,
block_variant: BlockVariant,
}
#[derive(Clone, Copy, Debug)]
enum BlockVariant {
Dense,
Sparse { num_vals: u16 },
}
impl BlockVariant {
pub fn empty() -> Self {
Self::Sparse { num_vals: 0 }
}
pub fn num_bytes_in_block(&self) -> u32 {
match *self {
BlockVariant::Dense => set_block::DENSE_BLOCK_NUM_BYTES,
BlockVariant::Sparse { num_vals } => num_vals as u32 * 2,
}
}
}
/// This codec is inspired by roaring bitmaps.
/// In the dense blocks, however, in order to accelerate `select`
/// we interleave an offset over two bytes. (more on this lower)
///
/// The lower 16 bits of doc ids are stored as u16 while the upper 16 bits are given by the block
/// id. Each block contains 1<<16 docids.
///
/// # Serialized Data Layout
/// The data starts with the block data. Each block is either dense or sparse encoded, depending on
/// the number of values in the block. A block is sparse when it contains less than
/// DENSE_BLOCK_THRESHOLD (6144) values.
/// [Sparse data block | dense data block, .. #repeat*; Desc: Either a sparse or dense encoded
/// block]
/// ### Sparse block data
/// [u16 LE, .. #repeat*; Desc: Positions with values in a block]
/// ### Dense block data
/// [Dense codec for the whole block; Desc: Similar to a bitvec(0..ELEMENTS_PER_BLOCK) + Metadata
/// for faster lookups. See dense.rs]
///
/// The data is followed by block metadata, to know which area of the raw block data belongs to
/// which block. Only metadata for blocks with elements is recorded to
/// keep the overhead low for scenarios with many very sparse columns. The block metadata consists
/// of the block index and the number of values in the block. Since we don't store empty blocks
/// num_vals is incremented by 1, e.g. 0 means 1 value.
///
/// The last u16 is storing the number of metadata blocks.
/// [u16 LE, .. #repeat*; Desc: Positions with values in a block][(u16 LE, u16 LE), .. #repeat*;
/// Desc: (Block Id u16, Num Elements u16)][u16 LE; Desc: num blocks with values u16]
///
/// # Opening
/// When opening the data layout, the data is expanded to `Vec<SparseCodecBlockVariant>`, where the
/// index is the block index. For each block `byte_start` and `offset` is computed.
#[derive(Clone)]
pub struct OptionalIndex {
num_rows: RowId,
num_non_null_rows: RowId,
block_data: OwnedBytes,
block_metas: Arc<[BlockMeta]>,
}
/// Splits a value address into lower and upper 16bits.
/// The lower 16 bits are the value in the block
/// The upper 16 bits are the block index
#[derive(Copy, Debug, Clone)]
struct RowAddr {
block_id: u16,
in_block_row_id: u16,
}
#[inline(always)]
fn row_addr_from_row_id(row_id: RowId) -> RowAddr {
RowAddr {
block_id: (row_id / BLOCK_SIZE) as u16,
in_block_row_id: (row_id % BLOCK_SIZE) as u16,
}
}
enum BlockSelectCursor<'a> {
Dense(<DenseBlock<'a> as Set<u16>>::SelectCursor<'a>),
Sparse(<SparseBlock<'a> as Set<u16>>::SelectCursor<'a>),
}
impl<'a> BlockSelectCursor<'a> {
fn select(&mut self, rank: u16) -> u16 {
match self {
BlockSelectCursor::Dense(dense_select_cursor) => dense_select_cursor.select(rank),
BlockSelectCursor::Sparse(sparse_select_cursor) => sparse_select_cursor.select(rank),
}
}
}
pub struct OptionalIndexSelectCursor<'a> {
current_block_cursor: BlockSelectCursor<'a>,
current_block_id: u16,
// The current block is guaranteed to contain ranks < end_rank.
current_block_end_rank: RowId,
optional_index: &'a OptionalIndex,
block_doc_idx_start: RowId,
num_null_rows_before_block: RowId,
}
impl<'a> OptionalIndexSelectCursor<'a> {
fn search_and_load_block(&mut self, rank: RowId) {
if rank < self.current_block_end_rank {
// we are already in the right block
return;
}
self.current_block_id = self.optional_index.find_block(rank, self.current_block_id);
self.current_block_end_rank = self
.optional_index
.block_metas
.get(self.current_block_id as usize + 1)
.map(|block_meta| block_meta.non_null_rows_before_block)
.unwrap_or(u32::MAX);
self.block_doc_idx_start = (self.current_block_id as u32) * ELEMENTS_PER_BLOCK;
let block_meta = self.optional_index.block_metas[self.current_block_id as usize];
self.num_null_rows_before_block = block_meta.non_null_rows_before_block;
let block: Block<'_> = self.optional_index.block(block_meta);
self.current_block_cursor = match block {
Block::Dense(dense_block) => BlockSelectCursor::Dense(dense_block.select_cursor()),
Block::Sparse(sparse_block) => BlockSelectCursor::Sparse(sparse_block.select_cursor()),
};
}
}
impl<'a> SelectCursor<RowId> for OptionalIndexSelectCursor<'a> {
fn select(&mut self, rank: RowId) -> RowId {
self.search_and_load_block(rank);
let index_in_block = (rank - self.num_null_rows_before_block) as u16;
self.current_block_cursor.select(index_in_block) as RowId + self.block_doc_idx_start
}
}
impl Set<RowId> for OptionalIndex {
type SelectCursor<'b> = OptionalIndexSelectCursor<'b> where Self: 'b;
// Check if value at position is not null.
#[inline]
fn contains(&self, row_id: RowId) -> bool {
let RowAddr {
block_id,
in_block_row_id,
} = row_addr_from_row_id(row_id);
let block_meta = self.block_metas[block_id as usize];
match self.block(block_meta) {
Block::Dense(dense_block) => dense_block.contains(in_block_row_id),
Block::Sparse(sparse_block) => sparse_block.contains(in_block_row_id),
}
}
#[inline]
fn rank(&self, doc_id: DocId) -> RowId {
let RowAddr {
block_id,
in_block_row_id,
} = row_addr_from_row_id(doc_id);
let block_meta = self.block_metas[block_id as usize];
let block = self.block(block_meta);
let block_offset_row_id = match block {
Block::Dense(dense_block) => dense_block.rank(in_block_row_id),
Block::Sparse(sparse_block) => sparse_block.rank(in_block_row_id),
} as u32;
block_meta.non_null_rows_before_block + block_offset_row_id
}
#[inline]
fn rank_if_exists(&self, doc_id: DocId) -> Option<RowId> {
let RowAddr {
block_id,
in_block_row_id,
} = row_addr_from_row_id(doc_id);
let block_meta = self.block_metas[block_id as usize];
let block = self.block(block_meta);
let block_offset_row_id = match block {
Block::Dense(dense_block) => dense_block.rank_if_exists(in_block_row_id),
Block::Sparse(sparse_block) => sparse_block.rank_if_exists(in_block_row_id),
}? as u32;
Some(block_meta.non_null_rows_before_block + block_offset_row_id)
}
#[inline]
fn select(&self, rank: RowId) -> RowId {
let block_pos = self.find_block(rank, 0);
let block_doc_idx_start = (block_pos as u32) * ELEMENTS_PER_BLOCK;
let block_meta = self.block_metas[block_pos as usize];
let block: Block<'_> = self.block(block_meta);
let index_in_block = (rank - block_meta.non_null_rows_before_block) as u16;
let in_block_rank = match block {
Block::Dense(dense_block) => dense_block.select(index_in_block),
Block::Sparse(sparse_block) => sparse_block.select(index_in_block),
};
block_doc_idx_start + in_block_rank as u32
}
fn select_cursor(&self) -> OptionalIndexSelectCursor<'_> {
OptionalIndexSelectCursor {
current_block_cursor: BlockSelectCursor::Sparse(
SparseBlockCodec::open(b"").select_cursor(),
),
current_block_id: 0u16,
current_block_end_rank: 0u32, //< this is sufficient to force the first load
optional_index: self,
block_doc_idx_start: 0u32,
num_null_rows_before_block: 0u32,
}
}
}
impl OptionalIndex {
pub fn for_test(num_rows: RowId, row_ids: &[RowId]) -> OptionalIndex {
assert!(row_ids
.last()
.copied()
.map(|last_row_id| last_row_id < num_rows)
.unwrap_or(true));
let mut buffer = Vec::new();
serialize_optional_index(&row_ids, num_rows, &mut buffer).unwrap();
let bytes = OwnedBytes::new(buffer);
open_optional_index(bytes).unwrap()
}
pub fn num_docs(&self) -> RowId {
self.num_rows
}
pub fn num_non_nulls(&self) -> RowId {
self.num_non_null_rows
}
pub fn iter_rows(&self) -> impl Iterator<Item = RowId> + '_ {
// TODO optimize
let mut select_batch = self.select_cursor();
(0..self.num_non_null_rows).map(move |rank| select_batch.select(rank))
}
pub fn select_batch(&self, ranks: &mut [RowId]) {
let mut select_cursor = self.select_cursor();
for rank in ranks.iter_mut() {
*rank = select_cursor.select(*rank);
}
}
#[inline]
fn block(&self, block_meta: BlockMeta) -> Block<'_> {
let BlockMeta {
start_byte_offset,
block_variant,
..
} = block_meta;
let start_byte_offset = start_byte_offset as usize;
let bytes = self.block_data.as_slice();
match block_variant {
BlockVariant::Dense => Block::Dense(DenseBlockCodec::open(
&bytes[start_byte_offset..start_byte_offset + DENSE_BLOCK_NUM_BYTES as usize],
)),
BlockVariant::Sparse { num_vals } => {
let end_byte_offset = start_byte_offset + num_vals as usize * 2;
let sparse_bytes = &bytes[start_byte_offset..end_byte_offset];
Block::Sparse(SparseBlockCodec::open(sparse_bytes))
}
}
}
#[inline]
fn find_block(&self, dense_idx: u32, start_block_pos: u16) -> u16 {
for block_pos in start_block_pos..self.block_metas.len() as u16 {
let offset = self.block_metas[block_pos as usize].non_null_rows_before_block;
if offset > dense_idx {
return block_pos - 1u16;
}
}
self.block_metas.len() as u16 - 1u16
}
// TODO Add a good API for the codec_idx to original_idx translation.
// The Iterator API is a probably a bad idea
}
#[derive(Copy, Clone)]
enum Block<'a> {
Dense(DenseBlock<'a>),
Sparse(SparseBlock<'a>),
}
#[derive(Debug, Copy, Clone)]
enum OptionalIndexCodec {
Dense = 0,
Sparse = 1,
}
impl OptionalIndexCodec {
fn to_code(self) -> u8 {
self as u8
}
fn try_from_code(code: u8) -> Result<Self, InvalidData> {
match code {
0 => Ok(Self::Dense),
1 => Ok(Self::Sparse),
_ => Err(InvalidData),
}
}
}
impl BinarySerializable for OptionalIndexCodec {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_all(&[self.to_code()])
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let optional_codec_code = u8::deserialize(reader)?;
let optional_codec = Self::try_from_code(optional_codec_code)?;
Ok(optional_codec)
}
}
fn serialize_optional_index_block(block_els: &[u16], out: &mut impl io::Write) -> io::Result<()> {
let is_sparse = is_sparse(block_els.len() as u32);
if is_sparse {
SparseBlockCodec::serialize(block_els.iter().copied(), out)?;
} else {
DenseBlockCodec::serialize(block_els.iter().copied(), out)?;
}
Ok(())
}
pub fn serialize_optional_index<W: io::Write>(
non_null_rows: &dyn Iterable<RowId>,
num_rows: RowId,
output: &mut W,
) -> io::Result<()> {
VInt(num_rows as u64).serialize(output)?;
let mut rows_it = non_null_rows.boxed_iter();
let mut block_metadata: Vec<SerializedBlockMeta> = Vec::new();
let mut current_block = Vec::new();
// This if-statement for the first element ensures that
// `block_metadata` is not empty in the loop below.
let Some(idx) = rows_it.next() else {
output.write_all(&0u16.to_le_bytes())?;
return Ok(());
};
let row_addr = row_addr_from_row_id(idx);
let mut current_block_id = row_addr.block_id;
current_block.push(row_addr.in_block_row_id);
for idx in rows_it {
let value_addr = row_addr_from_row_id(idx);
if current_block_id != value_addr.block_id {
serialize_optional_index_block(&current_block[..], output)?;
block_metadata.push(SerializedBlockMeta {
block_id: current_block_id,
num_non_null_rows: current_block.len() as u32,
});
current_block.clear();
current_block_id = value_addr.block_id;
}
current_block.push(value_addr.in_block_row_id);
}
// handle last block
serialize_optional_index_block(&current_block[..], output)?;
block_metadata.push(SerializedBlockMeta {
block_id: current_block_id,
num_non_null_rows: current_block.len() as u32,
});
for block in &block_metadata {
output.write_all(&block.to_bytes())?;
}
output.write_all((block_metadata.len() as u16).to_le_bytes().as_ref())?;
Ok(())
}
const SERIALIZED_BLOCK_META_NUM_BYTES: usize = 4;
#[derive(Clone, Copy, Debug)]
struct SerializedBlockMeta {
block_id: u16,
num_non_null_rows: u32, //< takes values in 1..=u16::MAX
}
// TODO unit tests
impl SerializedBlockMeta {
#[inline]
fn from_bytes(bytes: [u8; SERIALIZED_BLOCK_META_NUM_BYTES]) -> SerializedBlockMeta {
let block_id = u16::from_le_bytes(bytes[0..2].try_into().unwrap());
let num_non_null_rows: u32 =
u16::from_le_bytes(bytes[2..4].try_into().unwrap()) as u32 + 1u32;
SerializedBlockMeta {
block_id,
num_non_null_rows,
}
}
#[inline]
fn to_bytes(self) -> [u8; SERIALIZED_BLOCK_META_NUM_BYTES] {
assert!(self.num_non_null_rows > 0);
let mut bytes = [0u8; SERIALIZED_BLOCK_META_NUM_BYTES];
bytes[0..2].copy_from_slice(&self.block_id.to_le_bytes());
// We don't store empty blocks, therefore we can subtract 1.
// This way we will be able to use u16 when the number of elements is 1 << 16 or u16::MAX+1
bytes[2..4].copy_from_slice(&((self.num_non_null_rows - 1u32) as u16).to_le_bytes());
bytes
}
}
#[inline]
fn is_sparse(num_rows_in_block: u32) -> bool {
num_rows_in_block < DENSE_BLOCK_THRESHOLD
}
fn deserialize_optional_index_block_metadatas(
data: &[u8],
num_rows: u32,
) -> (Box<[BlockMeta]>, u32) {
let num_blocks = data.len() / SERIALIZED_BLOCK_META_NUM_BYTES;
let mut block_metas = Vec::with_capacity(num_blocks + 1);
let mut start_byte_offset = 0;
let mut non_null_rows_before_block = 0;
for block_meta_bytes in data.chunks_exact(SERIALIZED_BLOCK_META_NUM_BYTES) {
let block_meta_bytes: [u8; SERIALIZED_BLOCK_META_NUM_BYTES] =
block_meta_bytes.try_into().unwrap();
let SerializedBlockMeta {
block_id,
num_non_null_rows,
} = SerializedBlockMeta::from_bytes(block_meta_bytes);
block_metas.resize(
block_id as usize,
BlockMeta {
non_null_rows_before_block,
start_byte_offset,
block_variant: BlockVariant::empty(),
},
);
let block_variant = if is_sparse(num_non_null_rows) {
BlockVariant::Sparse {
num_vals: num_non_null_rows as u16,
}
} else {
BlockVariant::Dense
};
block_metas.push(BlockMeta {
non_null_rows_before_block,
start_byte_offset,
block_variant,
});
start_byte_offset += block_variant.num_bytes_in_block();
non_null_rows_before_block += num_non_null_rows;
}
block_metas.resize(
((num_rows + BLOCK_SIZE - 1) / BLOCK_SIZE) as usize,
BlockMeta {
non_null_rows_before_block,
start_byte_offset,
block_variant: BlockVariant::empty(),
},
);
(block_metas.into_boxed_slice(), non_null_rows_before_block)
}
pub fn open_optional_index(bytes: OwnedBytes) -> io::Result<OptionalIndex> {
let (mut bytes, num_non_empty_blocks_bytes) = bytes.rsplit(2);
let num_non_empty_block_bytes =
u16::from_le_bytes(num_non_empty_blocks_bytes.as_slice().try_into().unwrap());
let num_rows = VInt::deserialize_u64(&mut bytes)? as u32;
let block_metas_num_bytes =
num_non_empty_block_bytes as usize * SERIALIZED_BLOCK_META_NUM_BYTES;
let (block_data, block_metas) = bytes.rsplit(block_metas_num_bytes);
let (block_metas, num_non_null_rows) =
deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_rows);
let optional_index = OptionalIndex {
num_rows,
num_non_null_rows,
block_data,
block_metas: block_metas.into(),
};
Ok(optional_index)
}
#[cfg(test)]
mod tests;

View File

@@ -0,0 +1,47 @@
use std::io;
/// A codec makes it possible to serialize a set of
/// elements, and open the resulting Set representation.
pub trait SetCodec {
type Item: Copy + TryFrom<usize> + Eq + std::hash::Hash + std::fmt::Debug;
type Reader<'a>: Set<Self::Item>;
/// Serializes a set of unique sorted u16 elements.
///
/// May panic if the elements are not sorted.
fn serialize(els: impl Iterator<Item = Self::Item>, wrt: impl io::Write) -> io::Result<()>;
fn open(data: &[u8]) -> Self::Reader<'_>;
}
/// Stateful object that makes it possible to compute several select in a row,
/// provided the rank passed as argument are increasing.
pub trait SelectCursor<T> {
// May panic if rank is greater than the number of elements in the Set,
// or if rank is < than value provided in the previous call.
fn select(&mut self, rank: T) -> T;
}
pub trait Set<T> {
type SelectCursor<'b>: SelectCursor<T>
where Self: 'b;
/// Returns true if the elements is contained in the Set
fn contains(&self, el: T) -> bool;
/// Returns the number of rows in the set that are < `el`
fn rank(&self, el: T) -> T;
/// If the set contains `el` returns the element rank.
/// If the set does not contain the element, it returns `None`.
fn rank_if_exists(&self, el: T) -> Option<T>;
/// Return the rank-th value stored in this bitmap.
///
/// # Panics
///
/// May panic if rank is greater than the number of elements in the Set.
fn select(&self, rank: T) -> T;
/// Creates a brand new select cursor.
fn select_cursor(&self) -> Self::SelectCursor<'_>;
}

View File

@@ -0,0 +1,278 @@
use std::convert::TryInto;
use std::io::{self, Write};
use common::BinarySerializable;
use crate::column_index::optional_index::{SelectCursor, Set, SetCodec, ELEMENTS_PER_BLOCK};
#[inline(always)]
fn get_bit_at(input: u64, n: u16) -> bool {
input & (1 << n) != 0
}
#[inline]
fn set_bit_at(input: &mut u64, n: u16) {
*input |= 1 << n;
}
/// For the `DenseCodec`, `data` which contains the encoded blocks.
/// Each block consists of [u8; 12]. The first 8 bytes is a bitvec for 64 elements.
/// The last 4 bytes are the offset, the number of set bits so far.
///
/// When translating the original index to a dense index, the correct block can be computed
/// directly `orig_idx/64`. Inside the block the position is `orig_idx%64`.
///
/// When translating a dense index to the original index, we can use the offset to find the correct
/// block. Direct computation is not possible, but we can employ a linear or binary search.
const ELEMENTS_PER_MINI_BLOCK: u16 = 64;
const MINI_BLOCK_BITVEC_NUM_BYTES: usize = 8;
const MINI_BLOCK_OFFSET_NUM_BYTES: usize = 2;
pub const MINI_BLOCK_NUM_BYTES: usize = MINI_BLOCK_BITVEC_NUM_BYTES + MINI_BLOCK_OFFSET_NUM_BYTES;
/// Number of bytes in a dense block.
pub const DENSE_BLOCK_NUM_BYTES: u32 =
(ELEMENTS_PER_BLOCK / ELEMENTS_PER_MINI_BLOCK as u32) * MINI_BLOCK_NUM_BYTES as u32;
pub struct DenseBlockCodec;
impl SetCodec for DenseBlockCodec {
type Item = u16;
type Reader<'a> = DenseBlock<'a>;
fn serialize(els: impl Iterator<Item = u16>, wrt: impl io::Write) -> io::Result<()> {
serialize_dense_codec(els, wrt)
}
#[inline]
fn open(data: &[u8]) -> Self::Reader<'_> {
assert_eq!(data.len(), DENSE_BLOCK_NUM_BYTES as usize);
DenseBlock(data)
}
}
/// Interpreting the bitvec as a set of integer within 0..=63
/// and given an element, returns the number of elements in the
/// set lesser than the element.
///
/// # Panics
///
/// May panic or return a wrong result if el <= 64.
#[inline(always)]
fn rank_u64(bitvec: u64, el: u16) -> u16 {
debug_assert!(el < 64);
let mask = (1u64 << el) - 1;
let masked_bitvec = bitvec & mask;
masked_bitvec.count_ones() as u16
}
#[inline(always)]
fn select_u64(mut bitvec: u64, rank: u16) -> u16 {
for _ in 0..rank {
bitvec &= bitvec - 1;
}
bitvec.trailing_zeros() as u16
}
// TODO test the following solution on Intel... on Ryzen Zen <3 it is a catastrophy.
// #[target_feature(enable = "bmi2")]
// unsafe fn select_bitvec_unsafe(bitvec: u64, rank: u16) -> u16 {
// let pdep = _pdep_u64(1u64 << rank, bitvec);
// pdep.trailing_zeros() as u16
// }
#[derive(Clone, Copy, Debug)]
struct DenseMiniBlock {
bitvec: u64,
rank: u16,
}
impl DenseMiniBlock {
fn from_bytes(data: [u8; MINI_BLOCK_NUM_BYTES]) -> Self {
let bitvec = u64::from_le_bytes(data[..MINI_BLOCK_BITVEC_NUM_BYTES].try_into().unwrap());
let rank = u16::from_le_bytes(data[MINI_BLOCK_BITVEC_NUM_BYTES..].try_into().unwrap());
Self { bitvec, rank }
}
fn to_bytes(self) -> [u8; MINI_BLOCK_NUM_BYTES] {
let mut bytes = [0u8; MINI_BLOCK_NUM_BYTES];
bytes[..MINI_BLOCK_BITVEC_NUM_BYTES].copy_from_slice(&self.bitvec.to_le_bytes());
bytes[MINI_BLOCK_BITVEC_NUM_BYTES..].copy_from_slice(&self.rank.to_le_bytes());
bytes
}
}
#[derive(Copy, Clone)]
pub struct DenseBlock<'a>(&'a [u8]);
pub struct DenseBlockSelectCursor<'a> {
block_id: u16,
dense_block: DenseBlock<'a>,
}
impl<'a> SelectCursor<u16> for DenseBlockSelectCursor<'a> {
#[inline]
fn select(&mut self, rank: u16) -> u16 {
self.block_id = self
.dense_block
.find_miniblock_containing_rank(rank, self.block_id)
.unwrap();
let index_block = self.dense_block.mini_block(self.block_id);
let in_block_rank = rank - index_block.rank;
self.block_id * ELEMENTS_PER_MINI_BLOCK + select_u64(index_block.bitvec, in_block_rank)
}
}
impl<'a> Set<u16> for DenseBlock<'a> {
type SelectCursor<'b> = DenseBlockSelectCursor<'a> where Self: 'b;
#[inline(always)]
fn contains(&self, el: u16) -> bool {
let mini_block_id = el / ELEMENTS_PER_MINI_BLOCK;
let bitvec = self.mini_block(mini_block_id).bitvec;
let pos_in_bitvec = el % ELEMENTS_PER_MINI_BLOCK;
get_bit_at(bitvec, pos_in_bitvec)
}
#[inline(always)]
fn rank_if_exists(&self, el: u16) -> Option<u16> {
let block_pos = el / ELEMENTS_PER_MINI_BLOCK;
let index_block = self.mini_block(block_pos);
let pos_in_block_bit_vec = el % ELEMENTS_PER_MINI_BLOCK;
let ones_in_block = rank_u64(index_block.bitvec, pos_in_block_bit_vec);
let rank = index_block.rank + ones_in_block;
if get_bit_at(index_block.bitvec, pos_in_block_bit_vec) {
Some(rank)
} else {
None
}
}
#[inline(always)]
fn rank(&self, el: u16) -> u16 {
let block_pos = el / ELEMENTS_PER_MINI_BLOCK;
let index_block = self.mini_block(block_pos);
let pos_in_block_bit_vec = el % ELEMENTS_PER_MINI_BLOCK;
let ones_in_block = rank_u64(index_block.bitvec, pos_in_block_bit_vec);
index_block.rank + ones_in_block
}
#[inline(always)]
fn select(&self, rank: u16) -> u16 {
let block_id = self.find_miniblock_containing_rank(rank, 0).unwrap();
let index_block = self.mini_block(block_id);
let in_block_rank = rank - index_block.rank;
block_id * ELEMENTS_PER_MINI_BLOCK + select_u64(index_block.bitvec, in_block_rank)
}
#[inline(always)]
fn select_cursor(&self) -> Self::SelectCursor<'_> {
DenseBlockSelectCursor {
block_id: 0,
dense_block: *self,
}
}
}
impl<'a> DenseBlock<'a> {
#[inline]
fn mini_block(&self, mini_block_id: u16) -> DenseMiniBlock {
let data_start_pos = mini_block_id as usize * MINI_BLOCK_NUM_BYTES;
DenseMiniBlock::from_bytes(
self.0[data_start_pos..data_start_pos + MINI_BLOCK_NUM_BYTES]
.try_into()
.unwrap(),
)
}
#[inline]
fn iter_miniblocks(
&self,
from_block_id: u16,
) -> impl Iterator<Item = (u16, DenseMiniBlock)> + '_ {
self.0
.chunks_exact(MINI_BLOCK_NUM_BYTES)
.enumerate()
.skip(from_block_id as usize)
.map(|(block_id, bytes)| {
let mini_block = DenseMiniBlock::from_bytes(bytes.try_into().unwrap());
(block_id as u16, mini_block)
})
}
/// Finds the block position containing the dense_idx.
///
/// # Correctness
/// dense_idx needs to be smaller than the number of values in the index
///
/// The last offset number is equal to the number of values in the index.
#[inline]
fn find_miniblock_containing_rank(&self, rank: u16, from_block_id: u16) -> Option<u16> {
self.iter_miniblocks(from_block_id)
.take_while(|(_, block)| block.rank <= rank)
.map(|(block_id, _)| block_id)
.last()
}
}
/// Iterator over all values, true if set, otherwise false
pub fn serialize_dense_codec(
els: impl Iterator<Item = u16>,
mut output: impl Write,
) -> io::Result<()> {
let mut non_null_rows_before: u16 = 0u16;
let mut block = 0u64;
let mut current_block_id = 0u16;
for el in els {
let block_id = el / ELEMENTS_PER_MINI_BLOCK;
let in_offset = el % ELEMENTS_PER_MINI_BLOCK;
while block_id > current_block_id {
let dense_mini_block = DenseMiniBlock {
bitvec: block,
rank: non_null_rows_before,
};
output.write_all(&dense_mini_block.to_bytes())?;
non_null_rows_before += block.count_ones() as u16;
block = 0u64;
current_block_id += 1u16;
}
set_bit_at(&mut block, in_offset);
}
while current_block_id <= u16::MAX / ELEMENTS_PER_MINI_BLOCK {
block.serialize(&mut output)?;
non_null_rows_before.serialize(&mut output)?;
// This will overflow to 0 exactly if all bits are set.
// This is however not problem as we won't use this last value.
non_null_rows_before = non_null_rows_before.wrapping_add(block.count_ones() as u16);
block = 0u64;
current_block_id += 1u16;
}
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_select_bitvec() {
assert_eq!(select_u64(1u64, 0), 0);
assert_eq!(select_u64(2u64, 0), 1);
assert_eq!(select_u64(4u64, 0), 2);
assert_eq!(select_u64(8u64, 0), 3);
assert_eq!(select_u64(1 | 8u64, 0), 0);
assert_eq!(select_u64(1 | 8u64, 1), 3);
}
#[test]
fn test_count_ones() {
for i in 0..=63 {
assert_eq!(rank_u64(u64::MAX, i), i);
}
}
#[test]
fn test_dense() {
assert_eq!(DENSE_BLOCK_NUM_BYTES, 10_240);
}
}

View File

@@ -0,0 +1,8 @@
mod dense;
mod sparse;
pub use dense::{DenseBlock, DenseBlockCodec, DENSE_BLOCK_NUM_BYTES};
pub use sparse::{SparseBlock, SparseBlockCodec};
#[cfg(test)]
mod tests;

View File

@@ -0,0 +1,111 @@
use crate::column_index::optional_index::{SelectCursor, Set, SetCodec};
pub struct SparseBlockCodec;
impl SetCodec for SparseBlockCodec {
type Item = u16;
type Reader<'a> = SparseBlock<'a>;
fn serialize(
els: impl Iterator<Item = u16>,
mut wrt: impl std::io::Write,
) -> std::io::Result<()> {
for el in els {
wrt.write_all(&el.to_le_bytes())?;
}
Ok(())
}
fn open(data: &[u8]) -> Self::Reader<'_> {
SparseBlock(data)
}
}
#[derive(Copy, Clone)]
pub struct SparseBlock<'a>(&'a [u8]);
impl<'a> SelectCursor<u16> for SparseBlock<'a> {
#[inline]
fn select(&mut self, rank: u16) -> u16 {
<SparseBlock<'a> as Set<u16>>::select(self, rank)
}
}
impl<'a> Set<u16> for SparseBlock<'a> {
type SelectCursor<'b> = Self where Self: 'b;
#[inline(always)]
fn contains(&self, el: u16) -> bool {
self.binary_search(el).is_ok()
}
#[inline(always)]
fn rank_if_exists(&self, el: u16) -> Option<u16> {
self.binary_search(el).ok()
}
#[inline(always)]
fn rank(&self, el: u16) -> u16 {
self.binary_search(el).unwrap_or_else(|el| el)
}
#[inline(always)]
fn select(&self, rank: u16) -> u16 {
let offset = rank as usize * 2;
u16::from_le_bytes(self.0[offset..offset + 2].try_into().unwrap())
}
#[inline(always)]
fn select_cursor(&self) -> Self::SelectCursor<'_> {
*self
}
}
#[inline(always)]
fn get_u16(data: &[u8], byte_position: usize) -> u16 {
let bytes: [u8; 2] = data[byte_position..byte_position + 2].try_into().unwrap();
u16::from_le_bytes(bytes)
}
impl<'a> SparseBlock<'a> {
#[inline(always)]
fn value_at_idx(&self, data: &[u8], idx: u16) -> u16 {
let start_offset: usize = idx as usize * 2;
get_u16(data, start_offset)
}
#[inline]
fn num_vals(&self) -> u16 {
(self.0.len() / 2) as u16
}
#[inline]
#[allow(clippy::comparison_chain)]
// Looks for the element in the block. Returns the positions if found.
fn binary_search(&self, target: u16) -> Result<u16, u16> {
let data = &self.0;
let mut size = self.num_vals();
let mut left = 0;
let mut right = size;
// TODO try different implem.
// e.g. exponential search into binary search
while left < right {
let mid = left + size / 2;
// TODO do boundary check only once, and then use an
// unsafe `value_at_idx`
let mid_val = self.value_at_idx(data, mid);
if target > mid_val {
left = mid + 1;
} else if target < mid_val {
right = mid;
} else {
return Ok(mid);
}
size = right - left;
}
Err(left)
}
}

View File

@@ -0,0 +1,109 @@
use std::collections::HashMap;
use crate::column_index::optional_index::set_block::dense::DENSE_BLOCK_NUM_BYTES;
use crate::column_index::optional_index::set_block::{DenseBlockCodec, SparseBlockCodec};
use crate::column_index::optional_index::{SelectCursor, Set, SetCodec};
fn test_set_helper<C: SetCodec<Item = u16>>(vals: &[u16]) -> usize {
let mut buffer = Vec::new();
C::serialize(vals.iter().copied(), &mut buffer).unwrap();
let tested_set = C::open(buffer.as_slice());
let hash_set: HashMap<C::Item, C::Item> = vals
.iter()
.copied()
.enumerate()
.map(|(ord, val)| (val, C::Item::try_from(ord).ok().unwrap()))
.collect();
for val in 0u16..=u16::MAX {
assert_eq!(tested_set.contains(val), hash_set.contains_key(&val));
assert_eq!(tested_set.rank_if_exists(val), hash_set.get(&val).copied());
assert_eq!(
tested_set.rank(val),
vals.iter().cloned().take_while(|v| *v < val).count() as u16
);
}
for rank in 0..vals.len() {
assert_eq!(tested_set.select(rank as u16), vals[rank]);
}
buffer.len()
}
#[test]
fn test_dense_block_set_u16_empty() {
let buffer_len = test_set_helper::<DenseBlockCodec>(&[]);
assert_eq!(buffer_len, DENSE_BLOCK_NUM_BYTES as usize);
}
#[test]
fn test_dense_block_set_u16_max() {
let buffer_len = test_set_helper::<DenseBlockCodec>(&[u16::MAX]);
assert_eq!(buffer_len, DENSE_BLOCK_NUM_BYTES as usize);
}
#[test]
fn test_sparse_block_set_u16_empty() {
let buffer_len = test_set_helper::<SparseBlockCodec>(&[]);
assert_eq!(buffer_len, 0);
}
#[test]
fn test_sparse_block_set_u16_max() {
let buffer_len = test_set_helper::<SparseBlockCodec>(&[u16::MAX]);
assert_eq!(buffer_len, 2);
}
use proptest::prelude::*;
proptest! {
#![proptest_config(ProptestConfig::with_cases(1))]
#[test]
fn test_prop_test_dense(els in proptest::collection::btree_set(0..=u16::MAX, 0..=u16::MAX as usize)) {
let vals: Vec<u16> = els.into_iter().collect();
let buffer_len = test_set_helper::<DenseBlockCodec>(&vals);
assert_eq!(buffer_len, DENSE_BLOCK_NUM_BYTES as usize);
}
#[test]
fn test_prop_test_sparse(els in proptest::collection::btree_set(0..=u16::MAX, 0..=u16::MAX as usize)) {
let vals: Vec<u16> = els.into_iter().collect();
let buffer_len = test_set_helper::<SparseBlockCodec>(&vals);
assert_eq!(buffer_len, vals.len() * 2);
}
}
#[test]
fn test_simple_translate_codec_codec_idx_to_original_idx_dense() {
let mut buffer = Vec::new();
DenseBlockCodec::serialize([1, 3, 17, 32, 30_000, 30_001].iter().copied(), &mut buffer)
.unwrap();
let tested_set = DenseBlockCodec::open(buffer.as_slice());
assert!(tested_set.contains(1));
let mut select_cursor = tested_set.select_cursor();
assert_eq!(select_cursor.select(0), 1);
assert_eq!(select_cursor.select(1), 3);
assert_eq!(select_cursor.select(2), 17);
}
#[test]
fn test_simple_translate_codec_idx_to_original_idx_sparse() {
let mut buffer = Vec::new();
SparseBlockCodec::serialize([1, 3, 17].iter().copied(), &mut buffer).unwrap();
let tested_set = SparseBlockCodec::open(buffer.as_slice());
assert!(tested_set.contains(1));
let mut select_cursor = tested_set.select_cursor();
assert_eq!(SelectCursor::select(&mut select_cursor, 0), 1);
assert_eq!(SelectCursor::select(&mut select_cursor, 1), 3);
assert_eq!(SelectCursor::select(&mut select_cursor, 2), 17);
}
#[test]
fn test_simple_translate_codec_idx_to_original_idx_dense() {
let mut buffer = Vec::new();
DenseBlockCodec::serialize(0u16..150u16, &mut buffer).unwrap();
let tested_set = DenseBlockCodec::open(buffer.as_slice());
assert!(tested_set.contains(1));
let mut select_cursor = tested_set.select_cursor();
for i in 0..150 {
assert_eq!(i, select_cursor.select(i));
}
}

View File

@@ -0,0 +1,371 @@
use proptest::prelude::{any, prop, *};
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest};
use super::*;
#[test]
fn test_dense_block_threshold() {
assert_eq!(super::DENSE_BLOCK_THRESHOLD, 5_120);
}
fn random_bitvec() -> BoxedStrategy<Vec<bool>> {
prop_oneof![
1 => prop::collection::vec(proptest::bool::weighted(1.0), 0..100),
1 => prop::collection::vec(proptest::bool::weighted(0.00), 0..(ELEMENTS_PER_BLOCK as usize * 3)), // empty blocks
1 => prop::collection::vec(proptest::bool::weighted(1.00), 0..(ELEMENTS_PER_BLOCK as usize + 10)), // full block
1 => prop::collection::vec(proptest::bool::weighted(0.01), 0..100),
1 => prop::collection::vec(proptest::bool::weighted(0.01), 0..u16::MAX as usize),
8 => vec![any::<bool>()],
]
.boxed()
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(50))]
#[test]
fn test_with_random_bitvecs(bitvec1 in random_bitvec(), bitvec2 in random_bitvec(), bitvec3 in random_bitvec()) {
let mut bitvec = Vec::new();
bitvec.extend_from_slice(&bitvec1);
bitvec.extend_from_slice(&bitvec2);
bitvec.extend_from_slice(&bitvec3);
test_null_index(&bitvec[..]);
}
}
#[test]
fn test_with_random_sets_simple() {
let vals = 10..BLOCK_SIZE * 2;
let mut out: Vec<u8> = Vec::new();
serialize_optional_index(&vals, 100, &mut out).unwrap();
let null_index = open_optional_index(OwnedBytes::new(out)).unwrap();
let ranks: Vec<u32> = (65_472u32..65_473u32).collect();
let els: Vec<u32> = ranks.iter().copied().map(|rank| rank + 10).collect();
let mut select_cursor = null_index.select_cursor();
for (rank, el) in ranks.iter().copied().zip(els.iter().copied()) {
assert_eq!(select_cursor.select(rank), el);
}
}
#[test]
fn test_optional_index_trailing_empty_blocks() {
test_null_index(&[false]);
}
#[test]
fn test_optional_index_one_block_false() {
let mut iter = vec![false; ELEMENTS_PER_BLOCK as usize];
iter.push(true);
test_null_index(&iter[..]);
}
#[test]
fn test_optional_index_one_block_true() {
let mut iter = vec![true; ELEMENTS_PER_BLOCK as usize];
iter.push(true);
test_null_index(&iter[..]);
}
impl<'a> Iterable<RowId> for &'a [bool] {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = RowId> + 'a> {
Box::new(
self.iter()
.cloned()
.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _val)| pos as u32),
)
}
}
fn test_null_index(data: &[bool]) {
let mut out: Vec<u8> = Vec::new();
serialize_optional_index(&data, data.len() as RowId, &mut out).unwrap();
let null_index = open_optional_index(OwnedBytes::new(out)).unwrap();
let orig_idx_with_value: Vec<u32> = data
.iter()
.enumerate()
.filter(|(_pos, val)| **val)
.map(|(pos, _val)| pos as u32)
.collect();
let mut select_iter = null_index.select_cursor();
for i in 0..orig_idx_with_value.len() {
assert_eq!(select_iter.select(i as u32), orig_idx_with_value[i]);
}
let step_size = (orig_idx_with_value.len() / 100).max(1);
for (dense_idx, orig_idx) in orig_idx_with_value.iter().enumerate().step_by(step_size) {
assert_eq!(null_index.rank_if_exists(*orig_idx), Some(dense_idx as u32));
}
// 100 samples
let step_size = (data.len() / 100).max(1);
for (pos, value) in data.iter().enumerate().step_by(step_size) {
assert_eq!(null_index.contains(pos as u32), *value);
}
}
#[test]
fn test_optional_index_test_translation() {
let optional_index = OptionalIndex::for_test(4, &[0, 2]);
let mut select_cursor = optional_index.select_cursor();
assert_eq!(select_cursor.select(0), 0);
assert_eq!(select_cursor.select(1), 2);
}
#[test]
fn test_optional_index_translate() {
let optional_index = OptionalIndex::for_test(4, &[0, 2]);
assert_eq!(optional_index.rank_if_exists(0), Some(0));
assert_eq!(optional_index.rank_if_exists(2), Some(1));
}
#[test]
fn test_optional_index_small() {
let optional_index = OptionalIndex::for_test(4, &[0, 2]);
assert!(optional_index.contains(0));
assert!(!optional_index.contains(1));
assert!(optional_index.contains(2));
assert!(!optional_index.contains(3));
}
#[test]
fn test_optional_index_large() {
let row_ids = &[ELEMENTS_PER_BLOCK, ELEMENTS_PER_BLOCK + 1];
let optional_index = OptionalIndex::for_test(ELEMENTS_PER_BLOCK + 2, row_ids);
assert!(!optional_index.contains(0));
assert!(!optional_index.contains(100));
assert!(!optional_index.contains(ELEMENTS_PER_BLOCK - 1));
assert!(optional_index.contains(ELEMENTS_PER_BLOCK));
assert!(optional_index.contains(ELEMENTS_PER_BLOCK + 1));
}
fn test_optional_index_iter_aux(row_ids: &[RowId], num_rows: RowId) {
let optional_index = OptionalIndex::for_test(num_rows, row_ids);
assert_eq!(optional_index.num_docs(), num_rows);
assert!(optional_index.iter_rows().eq(row_ids.iter().copied()));
}
#[test]
fn test_optional_index_iter_empty() {
test_optional_index_iter_aux(&[], 0u32);
}
fn test_optional_index_rank_aux(row_ids: &[RowId]) {
let num_rows = row_ids.last().copied().unwrap_or(0u32) + 1;
let null_index = OptionalIndex::for_test(num_rows, row_ids);
assert_eq!(null_index.num_docs(), num_rows);
for (row_id, row_val) in row_ids.iter().copied().enumerate() {
assert_eq!(null_index.rank(row_val), row_id as u32);
assert_eq!(null_index.rank_if_exists(row_val), Some(row_id as u32));
if row_val > 0 && !null_index.contains(&row_val - 1) {
assert_eq!(null_index.rank(row_val - 1), row_id as u32);
}
assert_eq!(null_index.rank(row_val + 1), row_id as u32 + 1);
}
}
#[test]
fn test_optional_index_rank() {
test_optional_index_rank_aux(&[1u32]);
test_optional_index_rank_aux(&[0u32, 1u32]);
let mut block = Vec::new();
block.push(3u32);
block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1));
test_optional_index_rank_aux(&block);
}
#[test]
fn test_optional_index_iter_empty_one() {
test_optional_index_iter_aux(&[1], 2u32);
test_optional_index_iter_aux(&[100_000], 200_000u32);
}
#[test]
fn test_optional_index_iter_dense_block() {
let mut block = Vec::new();
block.push(3u32);
block.extend((0..BLOCK_SIZE).map(|i| i + BLOCK_SIZE + 1));
test_optional_index_iter_aux(&block, 3 * BLOCK_SIZE);
}
#[test]
fn test_optional_index_for_tests() {
let optional_index = OptionalIndex::for_test(4, &[1, 2]);
assert!(!optional_index.contains(0));
assert!(optional_index.contains(1));
assert!(optional_index.contains(2));
assert!(!optional_index.contains(3));
assert_eq!(optional_index.num_docs(), 4);
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use test::Bencher;
use super::*;
const TOTAL_NUM_VALUES: u32 = 1_000_000;
fn gen_bools(fill_ratio: f64) -> OptionalIndex {
let mut out = Vec::new();
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let vals: Vec<RowId> = (0..TOTAL_NUM_VALUES)
.map(|_| rng.gen_bool(fill_ratio))
.enumerate()
.filter(|(pos, val)| *val)
.map(|(pos, _)| pos as RowId)
.collect();
serialize_optional_index(&&vals[..], TOTAL_NUM_VALUES, &mut out).unwrap();
let codec = open_optional_index(OwnedBytes::new(out)).unwrap();
codec
}
fn random_range_iterator(
start: u32,
end: u32,
avg_step_size: u32,
avg_deviation: u32,
) -> impl Iterator<Item = u32> {
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let mut current = start;
std::iter::from_fn(move || {
current += rng.gen_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
if current >= end {
None
} else {
Some(current)
}
})
}
fn n_percent_step_iterator(percent: f32, num_values: u32) -> impl Iterator<Item = u32> {
let ratio = percent as f32 / 100.0;
let step_size = (1f32 / ratio) as u32;
let deviation = step_size - 1;
random_range_iterator(0, num_values, step_size, deviation)
}
fn walk_over_data(codec: &OptionalIndex, avg_step_size: u32) -> Option<u32> {
walk_over_data_from_positions(
codec,
random_range_iterator(0, TOTAL_NUM_VALUES, avg_step_size, 0),
)
}
fn walk_over_data_from_positions(
codec: &OptionalIndex,
positions: impl Iterator<Item = u32>,
) -> Option<u32> {
let mut dense_idx: Option<u32> = None;
for idx in positions {
dense_idx = dense_idx.or(codec.rank_if_exists(idx));
}
dense_idx
}
#[bench]
fn bench_translate_orig_to_codec_1percent_filled_10percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.01f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_5percent_filled_10percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.05f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_5percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.05f64);
bench.iter(|| walk_over_data(&codec, 1000));
}
#[bench]
fn bench_translate_orig_to_codec_full_scan_1percent_filled(bench: &mut Bencher) {
let codec = gen_bools(0.01f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_translate_orig_to_codec_full_scan_10percent_filled(bench: &mut Bencher) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_translate_orig_to_codec_full_scan_90percent_filled(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_translate_orig_to_codec_10percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_50percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.5f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_90percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_codec_to_orig_1percent_filled_0comma005percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.01f64, 0.005f32, bench);
}
#[bench]
fn bench_translate_codec_to_orig_10percent_filled_0comma005percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.1f64, 0.005f32, bench);
}
#[bench]
fn bench_translate_codec_to_orig_1percent_filled_10percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.01f64, 10f32, bench);
}
#[bench]
fn bench_translate_codec_to_orig_1percent_filled_full_scan(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.01f64, 100f32, bench);
}
fn bench_translate_codec_to_orig_util(
percent_filled: f64,
percent_hit: f32,
bench: &mut Bencher,
) {
let codec = gen_bools(percent_filled);
let num_non_nulls = codec.num_non_nulls();
let idxs: Vec<u32> = if percent_hit == 100.0f32 {
(0..num_non_nulls).collect()
} else {
n_percent_step_iterator(percent_hit, num_non_nulls).collect()
};
let mut output = vec![0u32; idxs.len()];
bench.iter(|| {
output.copy_from_slice(&idxs[..]);
codec.select_batch(&mut output);
});
}
#[bench]
fn bench_translate_codec_to_orig_90percent_filled_0comma005percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.9f64, 0.005, bench);
}
#[bench]
fn bench_translate_codec_to_orig_90percent_filled_full_scan(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.9f64, 100.0f32, bench);
}
}

View File

@@ -0,0 +1,77 @@
use std::io;
use std::io::Write;
use common::{CountingWriter, OwnedBytes};
use crate::column_index::multivalued_index::serialize_multivalued_index;
use crate::column_index::optional_index::serialize_optional_index;
use crate::column_index::ColumnIndex;
use crate::iterable::Iterable;
use crate::{Cardinality, RowId};
pub enum SerializableColumnIndex<'a> {
Full,
Optional {
non_null_row_ids: Box<dyn Iterable<RowId> + 'a>,
num_rows: RowId,
},
// TODO remove the Arc<dyn> apart from serialization this is not
// dynamic at all.
Multivalued(Box<dyn Iterable<RowId> + 'a>),
}
impl<'a> SerializableColumnIndex<'a> {
pub fn get_cardinality(&self) -> Cardinality {
match self {
SerializableColumnIndex::Full => Cardinality::Full,
SerializableColumnIndex::Optional { .. } => Cardinality::Optional,
SerializableColumnIndex::Multivalued(_) => Cardinality::Multivalued,
}
}
}
pub fn serialize_column_index(
column_index: SerializableColumnIndex,
output: &mut impl Write,
) -> io::Result<u32> {
let mut output = CountingWriter::wrap(output);
let cardinality = column_index.get_cardinality().to_code();
output.write_all(&[cardinality])?;
match column_index {
SerializableColumnIndex::Full => {}
SerializableColumnIndex::Optional {
non_null_row_ids,
num_rows,
} => serialize_optional_index(non_null_row_ids.as_ref(), num_rows, &mut output)?,
SerializableColumnIndex::Multivalued(multivalued_index) => {
serialize_multivalued_index(&*multivalued_index, &mut output)?
}
}
let column_index_num_bytes = output.written_bytes() as u32;
Ok(column_index_num_bytes)
}
pub fn open_column_index(mut bytes: OwnedBytes) -> io::Result<ColumnIndex> {
if bytes.is_empty() {
return Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"Failed to deserialize column index. Empty buffer.",
));
}
let cardinality_code = bytes[0];
let cardinality = Cardinality::try_from_code(cardinality_code)?;
bytes.advance(1);
match cardinality {
Cardinality::Full => Ok(ColumnIndex::Full),
Cardinality::Optional => {
let optional_index = super::optional_index::open_optional_index(bytes)?;
Ok(ColumnIndex::Optional(optional_index))
}
Cardinality::Multivalued => {
let multivalue_index = super::multivalued_index::open_multivalued_index(bytes)?;
Ok(ColumnIndex::Multivalued(multivalue_index))
}
}
}
// TODO unit tests

View File

@@ -0,0 +1,135 @@
use std::sync::Arc;
use common::OwnedBytes;
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use test::{self, Bencher};
use super::*;
use crate::column_values::u64_based::*;
fn get_data() -> Vec<u64> {
let mut rng = StdRng::seed_from_u64(2u64);
let mut data: Vec<_> = (100..55000_u64)
.map(|num| num + rng.gen::<u8>() as u64)
.collect();
data.push(99_000);
data.insert(1000, 2000);
data.insert(2000, 100);
data.insert(3000, 4100);
data.insert(4000, 100);
data.insert(5000, 800);
data
}
fn compute_stats(vals: impl Iterator<Item = u64>) -> ColumnStats {
let mut stats_collector = StatsCollector::default();
for val in vals {
stats_collector.collect(val);
}
stats_collector.stats()
}
#[inline(never)]
fn value_iter() -> impl Iterator<Item = u64> {
0..20_000
}
fn get_reader_for_bench<Codec: ColumnCodec>(data: &[u64]) -> Codec::ColumnValues {
let mut bytes = Vec::new();
let stats = compute_stats(data.iter().cloned());
let mut codec_serializer = Codec::estimator();
for val in data {
codec_serializer.collect(*val);
}
codec_serializer.serialize(&stats, Box::new(data.iter().copied()).as_mut(), &mut bytes);
Codec::load(OwnedBytes::new(bytes)).unwrap()
}
fn bench_get<Codec: ColumnCodec>(b: &mut Bencher, data: &[u64]) {
let col = get_reader_for_bench::<Codec>(data);
b.iter(|| {
let mut sum = 0u64;
for pos in value_iter() {
let val = col.get_val(pos as u32);
sum = sum.wrapping_add(val);
}
sum
});
}
#[inline(never)]
fn bench_get_dynamic_helper(b: &mut Bencher, col: Arc<dyn ColumnValues>) {
b.iter(|| {
let mut sum = 0u64;
for pos in value_iter() {
let val = col.get_val(pos as u32);
sum = sum.wrapping_add(val);
}
sum
});
}
fn bench_get_dynamic<Codec: ColumnCodec>(b: &mut Bencher, data: &[u64]) {
let col = Arc::new(get_reader_for_bench::<Codec>(data));
bench_get_dynamic_helper(b, col);
}
fn bench_create<Codec: ColumnCodec>(b: &mut Bencher, data: &[u64]) {
let stats = compute_stats(data.iter().cloned());
let mut bytes = Vec::new();
b.iter(|| {
bytes.clear();
let mut codec_serializer = Codec::estimator();
for val in data.iter().take(1024) {
codec_serializer.collect(*val);
}
codec_serializer.serialize(&stats, Box::new(data.iter().copied()).as_mut(), &mut bytes)
});
}
#[bench]
fn bench_fastfield_bitpack_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<BitpackedCodec>(b, &data);
}
#[bench]
fn bench_fastfield_linearinterpol_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<LinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<BlockwiseLinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_bitpack_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<BitpackedCodec>(b, &data);
}
#[bench]
fn bench_fastfield_bitpack_get_dynamic(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get_dynamic::<BitpackedCodec>(b, &data);
}
#[bench]
fn bench_fastfield_linearinterpol_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<LinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_linearinterpol_get_dynamic(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get_dynamic::<LinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<BlockwiseLinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_get_dynamic(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get_dynamic::<BlockwiseLinearCodec>(b, &data);
}

View File

@@ -0,0 +1,41 @@
use std::fmt::Debug;
use std::sync::Arc;
use crate::iterable::Iterable;
use crate::{ColumnIndex, ColumnValues, MergeRowOrder};
pub(crate) struct MergedColumnValues<'a, T> {
pub(crate) column_indexes: &'a [Option<ColumnIndex>],
pub(crate) column_values: &'a [Option<Arc<dyn ColumnValues<T>>>],
pub(crate) merge_row_order: &'a MergeRowOrder,
}
impl<'a, T: Copy + PartialOrd + Debug> Iterable<T> for MergedColumnValues<'a, T> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
match self.merge_row_order {
MergeRowOrder::Stack(_) => Box::new(
self.column_values
.iter()
.flatten()
.flat_map(|column_value| column_value.iter()),
),
MergeRowOrder::Shuffled(shuffle_merge_order) => Box::new(
shuffle_merge_order
.iter_new_to_old_row_addrs()
.flat_map(|row_addr| {
let column_index =
self.column_indexes[row_addr.segment_ord as usize].as_ref()?;
let column_values =
self.column_values[row_addr.segment_ord as usize].as_ref()?;
let value_range = column_index.value_row_ids(row_addr.row_id);
Some((value_range, column_values))
})
.flat_map(|(value_range, column_values)| {
value_range
.into_iter()
.map(|val| column_values.get_val(val))
}),
),
}
}
}

View File

@@ -0,0 +1,220 @@
#![warn(missing_docs)]
//! # `fastfield_codecs`
//!
//! - Columnar storage of data for tantivy [`Column`].
//! - Encode data in different codecs.
//! - Monotonically map values to u64/u128
use std::fmt::Debug;
use std::ops::{Range, RangeInclusive};
use std::sync::Arc;
pub use monotonic_mapping::{MonotonicallyMappableToU64, StrictlyMonotonicFn};
pub use monotonic_mapping_u128::MonotonicallyMappableToU128;
mod merge;
pub(crate) mod monotonic_mapping;
pub(crate) mod monotonic_mapping_u128;
mod stats;
mod u128_based;
mod u64_based;
mod vec_column;
mod monotonic_column;
pub(crate) use merge::MergedColumnValues;
pub use stats::ColumnStats;
pub use u128_based::{open_u128_mapped, serialize_column_values_u128};
pub use u64_based::{
load_u64_based_column_values, serialize_and_load_u64_based_column_values,
serialize_u64_based_column_values, CodecType, ALL_U64_CODEC_TYPES,
};
pub use vec_column::VecColumn;
pub use self::monotonic_column::monotonic_map_column;
use crate::RowId;
/// `ColumnValues` provides access to a dense field column.
///
/// `Column` are just a wrapper over `ColumnValues` and a `ColumnIndex`.
///
/// Any methods with a default and specialized implementation need to be called in the
/// wrappers that implement the trait: Arc and MonotonicMappingColumn
pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync {
/// Return the value associated with the given idx.
///
/// This accessor should return as fast as possible.
///
/// # Panics
///
/// May panic if `idx` is greater than the column length.
fn get_val(&self, idx: u32) -> T;
/// Allows to push down multiple fetch calls, to avoid dynamic dispatch overhead.
///
/// idx and output should have the same length
///
/// # Panics
///
/// May panic if `idx` is greater than the column length.
fn get_vals(&self, idx: &[u32], output: &mut [T]) {
assert!(idx.len() == output.len());
for (out, idx) in output.iter_mut().zip(idx.iter()) {
*out = self.get_val(*idx as u32);
}
}
/// Fills an output buffer with the fast field values
/// associated with the `DocId` going from
/// `start` to `start + output.len()`.
///
/// # Panics
///
/// Must panic if `start + output.len()` is greater than
/// the segment's `maxdoc`.
#[inline(always)]
fn get_range(&self, start: u64, output: &mut [T]) {
for (out, idx) in output.iter_mut().zip(start..) {
*out = self.get_val(idx as u32);
}
}
/// Get the row ids of values which are in the provided value range.
///
/// Note that position == docid for single value fast fields
#[inline(always)]
fn get_row_ids_for_value_range(
&self,
value_range: RangeInclusive<T>,
row_id_range: Range<RowId>,
row_id_hits: &mut Vec<RowId>,
) {
let row_id_range = row_id_range.start..row_id_range.end.min(self.num_vals());
for idx in row_id_range.start..row_id_range.end {
let val = self.get_val(idx);
if value_range.contains(&val) {
row_id_hits.push(idx);
}
}
}
/// Returns the minimum value for this fast field.
///
/// This min_value may not be exact.
/// For instance, the min value does not take in account of possible
/// deleted document. All values are however guaranteed to be higher than
/// `.min_value()`.
fn min_value(&self) -> T;
/// Returns the maximum value for this fast field.
///
/// This max_value may not be exact.
/// For instance, the max value does not take in account of possible
/// deleted document. All values are however guaranteed to be higher than
/// `.max_value()`.
fn max_value(&self) -> T;
/// The number of values in the column.
fn num_vals(&self) -> u32;
/// Returns a iterator over the data
fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = T> + 'a> {
Box::new((0..self.num_vals()).map(|idx| self.get_val(idx)))
}
}
impl<T: Copy + PartialOrd + Debug> ColumnValues<T> for Arc<dyn ColumnValues<T>> {
#[inline(always)]
fn get_val(&self, idx: u32) -> T {
self.as_ref().get_val(idx)
}
#[inline(always)]
fn min_value(&self) -> T {
self.as_ref().min_value()
}
#[inline(always)]
fn max_value(&self) -> T {
self.as_ref().max_value()
}
#[inline(always)]
fn num_vals(&self) -> u32 {
self.as_ref().num_vals()
}
#[inline(always)]
fn iter<'b>(&'b self) -> Box<dyn Iterator<Item = T> + 'b> {
self.as_ref().iter()
}
#[inline(always)]
fn get_range(&self, start: u64, output: &mut [T]) {
self.as_ref().get_range(start, output)
}
#[inline(always)]
fn get_row_ids_for_value_range(
&self,
range: RangeInclusive<T>,
doc_id_range: Range<u32>,
positions: &mut Vec<u32>,
) {
self.as_ref()
.get_row_ids_for_value_range(range, doc_id_range, positions)
}
}
/// Wraps an cloneable iterator into a `Column`.
pub struct IterColumn<T>(T);
impl<T> From<T> for IterColumn<T>
where T: Iterator + Clone + ExactSizeIterator
{
fn from(iter: T) -> Self {
IterColumn(iter)
}
}
impl<T> ColumnValues<T::Item> for IterColumn<T>
where
T: Iterator + Clone + ExactSizeIterator + Send + Sync,
T::Item: PartialOrd + Debug,
{
fn get_val(&self, idx: u32) -> T::Item {
self.0.clone().nth(idx as usize).unwrap()
}
fn min_value(&self) -> T::Item {
self.0.clone().next().unwrap()
}
fn max_value(&self) -> T::Item {
self.0.clone().last().unwrap()
}
fn num_vals(&self) -> u32 {
self.0.len() as u32
}
fn iter(&self) -> Box<dyn Iterator<Item = T::Item> + '_> {
Box::new(self.0.clone())
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench;
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_range_as_col() {
let col = IterColumn::from(10..100);
assert_eq!(col.num_vals(), 90);
assert_eq!(col.max_value(), 99);
}
}

View File

@@ -0,0 +1,120 @@
use std::fmt::Debug;
use std::marker::PhantomData;
use std::ops::{Range, RangeInclusive};
use crate::column_values::monotonic_mapping::StrictlyMonotonicFn;
use crate::ColumnValues;
struct MonotonicMappingColumn<C, T, Input> {
from_column: C,
monotonic_mapping: T,
_phantom: PhantomData<Input>,
}
/// Creates a view of a column transformed by a strictly monotonic mapping. See
/// [`StrictlyMonotonicFn`].
///
/// E.g. apply a gcd monotonic_mapping([100, 200, 300]) == [1, 2, 3]
/// monotonic_mapping.mapping() is expected to be injective, and we should always have
/// monotonic_mapping.inverse(monotonic_mapping.mapping(el)) == el
///
/// The inverse of the mapping is required for:
/// `fn get_positions_for_value_range(&self, range: RangeInclusive<T>) -> Vec<u64> `
/// The user provides the original value range and we need to monotonic map them in the same way the
/// serialization does before calling the underlying column.
///
/// Note that when opening a codec, the monotonic_mapping should be the inverse of the mapping
/// during serialization. And therefore the monotonic_mapping_inv when opening is the same as
/// monotonic_mapping during serialization.
pub fn monotonic_map_column<C, T, Input, Output>(
from_column: C,
monotonic_mapping: T,
) -> impl ColumnValues<Output>
where
C: ColumnValues<Input>,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Debug + Send + Sync + Clone,
Output: PartialOrd + Debug + Send + Sync + Clone,
{
MonotonicMappingColumn {
from_column,
monotonic_mapping,
_phantom: PhantomData,
}
}
impl<C, T, Input, Output> ColumnValues<Output> for MonotonicMappingColumn<C, T, Input>
where
C: ColumnValues<Input>,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Send + Debug + Sync + Clone,
Output: PartialOrd + Send + Debug + Sync + Clone,
{
#[inline]
fn get_val(&self, idx: u32) -> Output {
let from_val = self.from_column.get_val(idx);
self.monotonic_mapping.mapping(from_val)
}
fn min_value(&self) -> Output {
let from_min_value = self.from_column.min_value();
self.monotonic_mapping.mapping(from_min_value)
}
fn max_value(&self) -> Output {
let from_max_value = self.from_column.max_value();
self.monotonic_mapping.mapping(from_max_value)
}
fn num_vals(&self) -> u32 {
self.from_column.num_vals()
}
fn iter(&self) -> Box<dyn Iterator<Item = Output> + '_> {
Box::new(
self.from_column
.iter()
.map(|el| self.monotonic_mapping.mapping(el)),
)
}
fn get_row_ids_for_value_range(
&self,
range: RangeInclusive<Output>,
doc_id_range: Range<u32>,
positions: &mut Vec<u32>,
) {
self.from_column.get_row_ids_for_value_range(
self.monotonic_mapping.inverse(range.start().clone())
..=self.monotonic_mapping.inverse(range.end().clone()),
doc_id_range,
positions,
)
}
// We voluntarily do not implement get_range as it yields a regression,
// and we do not have any specialized implementation anyway.
}
#[cfg(test)]
mod tests {
use super::*;
use crate::column_values::monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternal,
};
use crate::column_values::VecColumn;
#[test]
fn test_monotonic_mapping_iter() {
let vals: Vec<u64> = (0..100u64).map(|el| el * 10).collect();
let col = VecColumn::from(&vals);
let mapped = monotonic_map_column(
col,
StrictlyMonotonicMappingInverter::from(StrictlyMonotonicMappingToInternal::<i64>::new()),
);
let val_i64s: Vec<u64> = mapped.iter().collect();
for i in 0..100 {
assert_eq!(val_i64s[i as usize], mapped.get_val(i));
}
}
}

View File

@@ -0,0 +1,211 @@
use std::fmt::Debug;
use std::marker::PhantomData;
use common::DateTime;
use super::MonotonicallyMappableToU128;
use crate::RowId;
/// Monotonic maps a value to u64 value space.
/// Monotonic mapping enables `PartialOrd` on u64 space without conversion to original space.
pub trait MonotonicallyMappableToU64: 'static + PartialOrd + Debug + Copy + Send + Sync {
/// Converts a value to u64.
///
/// Internally all fast field values are encoded as u64.
fn to_u64(self) -> u64;
/// Converts a value from u64
///
/// Internally all fast field values are encoded as u64.
/// **Note: To be used for converting encoded Term, Posting values.**
fn from_u64(val: u64) -> Self;
}
/// Values need to be strictly monotonic mapped to a `Internal` value (u64 or u128) that can be
/// used in fast field codecs.
///
/// The monotonic mapping is required so that `PartialOrd` can be used on `Internal` without
/// converting to `External`.
///
/// All strictly monotonic functions are invertible because they are guaranteed to have a one-to-one
/// mapping from their range to their domain. The `inverse` method is required when opening a codec,
/// so a value can be converted back to its original domain (e.g. ip address or f64) from its
/// internal representation.
pub trait StrictlyMonotonicFn<External, Internal> {
/// Strictly monotonically maps the value from External to Internal.
fn mapping(&self, inp: External) -> Internal;
/// Inverse of `mapping`. Maps the value from Internal to External.
fn inverse(&self, out: Internal) -> External;
}
/// Inverts a strictly monotonic mapping from `StrictlyMonotonicFn<A, B>` to
/// `StrictlyMonotonicFn<B, A>`.
///
/// # Warning
///
/// This type comes with a footgun. A type being strictly monotonic does not impose that the inverse
/// mapping is strictly monotonic over the entire space External. e.g. a -> a * 2. Use at your own
/// risks.
pub(crate) struct StrictlyMonotonicMappingInverter<T> {
orig_mapping: T,
}
impl<T> From<T> for StrictlyMonotonicMappingInverter<T> {
fn from(orig_mapping: T) -> Self {
Self { orig_mapping }
}
}
impl<From, To, T> StrictlyMonotonicFn<To, From> for StrictlyMonotonicMappingInverter<T>
where T: StrictlyMonotonicFn<From, To>
{
#[inline(always)]
fn mapping(&self, val: To) -> From {
self.orig_mapping.inverse(val)
}
#[inline(always)]
fn inverse(&self, val: From) -> To {
self.orig_mapping.mapping(val)
}
}
/// Applies the strictly monotonic mapping from `T` without any additional changes.
pub(crate) struct StrictlyMonotonicMappingToInternal<T> {
_phantom: PhantomData<T>,
}
impl<T> StrictlyMonotonicMappingToInternal<T> {
pub(crate) fn new() -> StrictlyMonotonicMappingToInternal<T> {
Self {
_phantom: PhantomData,
}
}
}
impl<External: MonotonicallyMappableToU128, T: MonotonicallyMappableToU128>
StrictlyMonotonicFn<External, u128> for StrictlyMonotonicMappingToInternal<T>
where T: MonotonicallyMappableToU128
{
#[inline(always)]
fn mapping(&self, inp: External) -> u128 {
External::to_u128(inp)
}
#[inline(always)]
fn inverse(&self, out: u128) -> External {
External::from_u128(out)
}
}
impl<External: MonotonicallyMappableToU64, T: MonotonicallyMappableToU64>
StrictlyMonotonicFn<External, u64> for StrictlyMonotonicMappingToInternal<T>
where T: MonotonicallyMappableToU64
{
#[inline(always)]
fn mapping(&self, inp: External) -> u64 {
External::to_u64(inp)
}
#[inline(always)]
fn inverse(&self, out: u64) -> External {
External::from_u64(out)
}
}
impl MonotonicallyMappableToU64 for u64 {
#[inline(always)]
fn to_u64(self) -> u64 {
self
}
#[inline(always)]
fn from_u64(val: u64) -> Self {
val
}
}
impl MonotonicallyMappableToU64 for i64 {
#[inline(always)]
fn to_u64(self) -> u64 {
common::i64_to_u64(self)
}
#[inline(always)]
fn from_u64(val: u64) -> Self {
common::u64_to_i64(val)
}
}
impl MonotonicallyMappableToU64 for DateTime {
#[inline(always)]
fn to_u64(self) -> u64 {
common::i64_to_u64(self.into_timestamp_micros())
}
#[inline(always)]
fn from_u64(val: u64) -> Self {
DateTime::from_timestamp_micros(common::u64_to_i64(val))
}
}
impl MonotonicallyMappableToU64 for bool {
#[inline(always)]
fn to_u64(self) -> u64 {
u64::from(self)
}
#[inline(always)]
fn from_u64(val: u64) -> Self {
val > 0
}
}
impl MonotonicallyMappableToU64 for RowId {
#[inline(always)]
fn to_u64(self) -> u64 {
u64::from(self)
}
#[inline(always)]
fn from_u64(val: u64) -> RowId {
val as RowId
}
}
// TODO remove me.
// Tantivy should refuse NaN values and work with NotNaN internally.
impl MonotonicallyMappableToU64 for f64 {
#[inline(always)]
fn to_u64(self) -> u64 {
common::f64_to_u64(self)
}
#[inline(always)]
fn from_u64(val: u64) -> Self {
common::u64_to_f64(val)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn strictly_monotonic_test() {
// identity mapping
test_round_trip(&StrictlyMonotonicMappingToInternal::<u64>::new(), 100u64);
// round trip to i64
test_round_trip(&StrictlyMonotonicMappingToInternal::<i64>::new(), 100u64);
// TODO
// identity mapping
// test_round_trip(&StrictlyMonotonicMappingToInternal::<u128>::new(), 100u128);
}
fn test_round_trip<T: StrictlyMonotonicFn<K, L>, K: std::fmt::Debug + Eq + Copy, L>(
mapping: &T,
test_val: K,
) {
assert_eq!(mapping.inverse(mapping.mapping(test_val)), test_val);
}
}

View File

@@ -0,0 +1,41 @@
use std::fmt::Debug;
use std::net::Ipv6Addr;
/// Montonic maps a value to u128 value space
/// Monotonic mapping enables `PartialOrd` on u128 space without conversion to original space.
pub trait MonotonicallyMappableToU128: 'static + PartialOrd + Copy + Debug + Send + Sync {
/// Converts a value to u128.
///
/// Internally all fast field values are encoded as u64.
fn to_u128(self) -> u128;
/// Converts a value from u128
///
/// Internally all fast field values are encoded as u64.
/// **Note: To be used for converting encoded Term, Posting values.**
fn from_u128(val: u128) -> Self;
}
impl MonotonicallyMappableToU128 for u128 {
fn to_u128(self) -> u128 {
self
}
fn from_u128(val: u128) -> Self {
val
}
}
impl MonotonicallyMappableToU128 for Ipv6Addr {
fn to_u128(self) -> u128 {
ip_to_u128(self)
}
fn from_u128(val: u128) -> Self {
Ipv6Addr::from(val.to_be_bytes())
}
}
fn ip_to_u128(ip_addr: Ipv6Addr) -> u128 {
u128::from_be_bytes(ip_addr.octets())
}

View File

@@ -0,0 +1,103 @@
use std::io;
use std::io::Write;
use std::num::NonZeroU64;
use common::{BinarySerializable, VInt};
use crate::RowId;
/// Column statistics.
#[derive(Debug, Clone, Eq, PartialEq)]
pub struct ColumnStats {
/// GCD of the elements `el - min(column)`.
pub gcd: NonZeroU64,
/// Minimum value of the column.
pub min_value: u64,
/// Maximum value of the column.
pub max_value: u64,
/// Number of rows in the column.
pub num_rows: RowId,
}
impl ColumnStats {
/// Amplitude of value.
/// Difference between the maximum and the minimum value.
pub fn amplitude(&self) -> u64 {
self.max_value - self.min_value
}
}
impl BinarySerializable for ColumnStats {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.min_value).serialize(writer)?;
VInt(self.gcd.get()).serialize(writer)?;
VInt(self.amplitude() / self.gcd).serialize(writer)?;
VInt(self.num_rows as u64).serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let min_value = VInt::deserialize(reader)?.0;
let gcd = VInt::deserialize(reader)?.0;
let gcd = NonZeroU64::new(gcd)
.ok_or_else(|| io::Error::new(io::ErrorKind::InvalidData, "GCD of 0 is forbidden"))?;
let amplitude = VInt::deserialize(reader)?.0 * gcd.get();
let max_value = min_value + amplitude;
let num_rows = VInt::deserialize(reader)?.0 as RowId;
Ok(ColumnStats {
min_value,
max_value,
num_rows,
gcd,
})
}
}
#[cfg(test)]
mod tests {
use std::num::NonZeroU64;
use common::BinarySerializable;
use crate::column_values::ColumnStats;
#[track_caller]
fn test_stats_ser_deser_aux(stats: &ColumnStats, num_bytes: usize) {
let mut buffer: Vec<u8> = Vec::new();
stats.serialize(&mut buffer).unwrap();
assert_eq!(buffer.len(), num_bytes);
let deser_stats = ColumnStats::deserialize(&mut &buffer[..]).unwrap();
assert_eq!(stats, &deser_stats);
}
#[test]
fn test_stats_serialization() {
test_stats_ser_deser_aux(
&(ColumnStats {
gcd: NonZeroU64::new(3).unwrap(),
min_value: 1,
max_value: 3001,
num_rows: 10,
}),
5,
);
test_stats_ser_deser_aux(
&(ColumnStats {
gcd: NonZeroU64::new(1_000).unwrap(),
min_value: 1,
max_value: 3001,
num_rows: 10,
}),
5,
);
test_stats_ser_deser_aux(
&(ColumnStats {
gcd: NonZeroU64::new(1).unwrap(),
min_value: 0,
max_value: 0,
num_rows: 0,
}),
4,
);
}
}

View File

@@ -0,0 +1,43 @@
use std::ops::RangeInclusive;
/// The range of a blank in value space.
///
/// A blank is an unoccupied space in the data.
/// Use try_into() to construct.
/// A range has to have at least length of 3. Invalid ranges will be rejected.
///
/// Ordered by range length.
#[derive(Debug, Eq, PartialEq, Clone)]
pub(crate) struct BlankRange {
blank_range: RangeInclusive<u128>,
}
impl TryFrom<RangeInclusive<u128>> for BlankRange {
type Error = &'static str;
fn try_from(range: RangeInclusive<u128>) -> Result<Self, Self::Error> {
let blank_size = range.end().saturating_sub(*range.start());
if blank_size < 2 {
Err("invalid range")
} else {
Ok(BlankRange { blank_range: range })
}
}
}
impl BlankRange {
pub(crate) fn blank_size(&self) -> u128 {
self.blank_range.end() - self.blank_range.start() + 1
}
pub(crate) fn blank_range(&self) -> RangeInclusive<u128> {
self.blank_range.clone()
}
}
impl Ord for BlankRange {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
self.blank_size().cmp(&other.blank_size())
}
}
impl PartialOrd for BlankRange {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.blank_size().cmp(&other.blank_size()))
}
}

View File

@@ -0,0 +1,231 @@
use std::collections::{BTreeSet, BinaryHeap};
use std::iter;
use std::ops::RangeInclusive;
use itertools::Itertools;
use super::blank_range::BlankRange;
use super::{CompactSpace, RangeMapping};
/// Put the blanks for the sorted values into a binary heap
fn get_blanks(values_sorted: &BTreeSet<u128>) -> BinaryHeap<BlankRange> {
let mut blanks: BinaryHeap<BlankRange> = BinaryHeap::new();
for (first, second) in values_sorted.iter().tuple_windows() {
// Correctness Overflow: the values are deduped and sorted (BTreeSet property), that means
// there's always space between two values.
let blank_range = first + 1..=second - 1;
let blank_range: Result<BlankRange, _> = blank_range.try_into();
if let Ok(blank_range) = blank_range {
blanks.push(blank_range);
}
}
blanks
}
struct BlankCollector {
blanks: Vec<BlankRange>,
staged_blanks_sum: u128,
}
impl BlankCollector {
fn new() -> Self {
Self {
blanks: vec![],
staged_blanks_sum: 0,
}
}
fn stage_blank(&mut self, blank: BlankRange) {
self.staged_blanks_sum += blank.blank_size();
self.blanks.push(blank);
}
fn drain(&mut self) -> impl Iterator<Item = BlankRange> + '_ {
self.staged_blanks_sum = 0;
self.blanks.drain(..)
}
fn staged_blanks_sum(&self) -> u128 {
self.staged_blanks_sum
}
fn num_staged_blanks(&self) -> usize {
self.blanks.len()
}
}
fn num_bits(val: u128) -> u8 {
(128u32 - val.leading_zeros()) as u8
}
/// Will collect blanks and add them to compact space if more bits are saved than cost from
/// metadata.
pub fn get_compact_space(
values_deduped_sorted: &BTreeSet<u128>,
total_num_values: u32,
cost_per_blank: usize,
) -> CompactSpace {
let mut compact_space_builder = CompactSpaceBuilder::new();
if values_deduped_sorted.is_empty() {
return compact_space_builder.finish();
}
let mut blanks: BinaryHeap<BlankRange> = get_blanks(values_deduped_sorted);
// Replace after stabilization of https://github.com/rust-lang/rust/issues/62924
// We start by space that's limited to min_value..=max_value
let min_value = *values_deduped_sorted.iter().next().unwrap_or(&0);
let max_value = *values_deduped_sorted.iter().last().unwrap_or(&0);
// +1 for null, in case min and max covers the whole space, we are off by one.
let mut amplitude_compact_space = (max_value - min_value).saturating_add(1);
if min_value != 0 {
compact_space_builder.add_blanks(iter::once(0..=min_value - 1));
}
if max_value != u128::MAX {
compact_space_builder.add_blanks(iter::once(max_value + 1..=u128::MAX));
}
let mut amplitude_bits: u8 = num_bits(amplitude_compact_space);
let mut blank_collector = BlankCollector::new();
// We will stage blanks until they reduce the compact space by at least 1 bit and then flush
// them if the metadata cost is lower than the total number of saved bits.
// Binary heap to process the gaps by their size
while let Some(blank_range) = blanks.pop() {
blank_collector.stage_blank(blank_range);
let staged_spaces_sum: u128 = blank_collector.staged_blanks_sum();
let amplitude_new_compact_space = amplitude_compact_space - staged_spaces_sum;
let amplitude_new_bits = num_bits(amplitude_new_compact_space);
if amplitude_bits == amplitude_new_bits {
continue;
}
let saved_bits = (amplitude_bits - amplitude_new_bits) as usize * total_num_values as usize;
// TODO: Maybe calculate exact cost of blanks and run this more expensive computation only,
// when amplitude_new_bits changes
let cost = blank_collector.num_staged_blanks() * cost_per_blank;
if cost >= saved_bits {
// Continue here, since although we walk over the blanks by size,
// we can potentially save a lot at the last bits, which are smaller blanks
//
// E.g. if the first range reduces the compact space by 1000 from 2000 to 1000, which
// saves 11-10=1 bit and the next range reduces the compact space by 950 to
// 50, which saves 10-6=4 bit
continue;
}
amplitude_compact_space = amplitude_new_compact_space;
amplitude_bits = amplitude_new_bits;
compact_space_builder.add_blanks(blank_collector.drain().map(|blank| blank.blank_range()));
}
// special case, when we don't collected any blanks because:
// * the data is empty (early exit)
// * the algorithm did decide it's not worth the cost, which can be the case for single values
//
// We drain one collected blank unconditionally, so the empty case is reserved for empty
// data, and therefore empty compact_space means the data is empty and no data is covered
// (conversely to all data) and we can assign null to it.
if compact_space_builder.is_empty() {
compact_space_builder.add_blanks(
blank_collector
.drain()
.map(|blank| blank.blank_range())
.take(1),
);
}
let compact_space = compact_space_builder.finish();
if max_value - min_value != u128::MAX {
debug_assert_eq!(
compact_space.amplitude_compact_space(),
amplitude_compact_space
);
}
compact_space
}
#[derive(Debug, Clone, Eq, PartialEq)]
struct CompactSpaceBuilder {
blanks: Vec<RangeInclusive<u128>>,
}
impl CompactSpaceBuilder {
/// Creates a new compact space builder which will initially cover the whole space.
fn new() -> Self {
Self { blanks: Vec::new() }
}
/// Assumes that repeated add_blank calls don't overlap and are not adjacent,
/// e.g. [3..=5, 5..=10] is not allowed
///
/// Both of those assumptions are true when blanks are produced from sorted values.
fn add_blanks(&mut self, blank: impl Iterator<Item = RangeInclusive<u128>>) {
self.blanks.extend(blank);
}
fn is_empty(&self) -> bool {
self.blanks.is_empty()
}
/// Convert blanks to covered space and assign null value
fn finish(mut self) -> CompactSpace {
// sort by start. ranges are not allowed to overlap
self.blanks.sort_unstable_by_key(|blank| *blank.start());
let mut covered_space = Vec::with_capacity(self.blanks.len());
// begining of the blanks
if let Some(first_blank_start) = self.blanks.first().map(RangeInclusive::start) {
if *first_blank_start != 0 {
covered_space.push(0..=first_blank_start - 1);
}
}
// Between the blanks
let between_blanks = self.blanks.iter().tuple_windows().map(|(left, right)| {
assert!(
left.end() < right.start(),
"overlapping or adjacent ranges detected"
);
*left.end() + 1..=*right.start() - 1
});
covered_space.extend(between_blanks);
// end of the blanks
if let Some(last_blank_end) = self.blanks.last().map(RangeInclusive::end) {
if *last_blank_end != u128::MAX {
covered_space.push(last_blank_end + 1..=u128::MAX);
}
}
if covered_space.is_empty() {
covered_space.push(0..=0); // empty data case
};
let mut compact_start: u64 = 1; // 0 is reserved for `null`
let mut ranges_mapping: Vec<RangeMapping> = Vec::with_capacity(covered_space.len());
for cov in covered_space {
let range_mapping = super::RangeMapping {
value_range: cov,
compact_start,
};
let covered_range_len = range_mapping.range_length();
ranges_mapping.push(range_mapping);
compact_start += covered_range_len;
}
// println!("num ranges {}", ranges_mapping.len());
CompactSpace { ranges_mapping }
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_binary_heap_pop_order() {
let mut blanks: BinaryHeap<BlankRange> = BinaryHeap::new();
blanks.push((0..=10).try_into().unwrap());
blanks.push((100..=200).try_into().unwrap());
blanks.push((100..=110).try_into().unwrap());
assert_eq!(blanks.pop().unwrap().blank_size(), 101);
assert_eq!(blanks.pop().unwrap().blank_size(), 11);
}
}

View File

@@ -0,0 +1,809 @@
/// This codec takes a large number space (u128) and reduces it to a compact number space.
///
/// It will find spaces in the number range. For example:
///
/// 100, 101, 102, 103, 104, 50000, 50001
/// could be mapped to
/// 100..104 -> 0..4
/// 50000..50001 -> 5..6
///
/// Compact space 0..=6 requires much less bits than 100..=50001
///
/// The codec is created to compress ip addresses, but may be employed in other use cases.
use std::{
cmp::Ordering,
collections::BTreeSet,
io::{self, Write},
ops::{Range, RangeInclusive},
};
mod blank_range;
mod build_compact_space;
use build_compact_space::get_compact_space;
use common::{BinarySerializable, CountingWriter, OwnedBytes, VInt, VIntU128};
use tantivy_bitpacker::{self, BitPacker, BitUnpacker};
use crate::column_values::ColumnValues;
use crate::RowId;
/// The cost per blank is quite hard actually, since blanks are delta encoded, the actual cost of
/// blanks depends on the number of blanks.
///
/// The number is taken by looking at a real dataset. It is optimized for larger datasets.
const COST_PER_BLANK_IN_BITS: usize = 36;
#[derive(Debug, Clone, Eq, PartialEq)]
pub struct CompactSpace {
ranges_mapping: Vec<RangeMapping>,
}
/// Maps the range from the original space to compact_start + range.len()
#[derive(Debug, Clone, Eq, PartialEq)]
struct RangeMapping {
value_range: RangeInclusive<u128>,
compact_start: u64,
}
impl RangeMapping {
fn range_length(&self) -> u64 {
(self.value_range.end() - self.value_range.start()) as u64 + 1
}
// The last value of the compact space in this range
fn compact_end(&self) -> u64 {
self.compact_start + self.range_length() - 1
}
}
impl BinarySerializable for CompactSpace {
fn serialize<W: io::Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.ranges_mapping.len() as u64).serialize(writer)?;
let mut prev_value = 0;
for value_range in self
.ranges_mapping
.iter()
.map(|range_mapping| &range_mapping.value_range)
{
let blank_delta_start = value_range.start() - prev_value;
VIntU128(blank_delta_start).serialize(writer)?;
prev_value = *value_range.start();
let blank_delta_end = value_range.end() - prev_value;
VIntU128(blank_delta_end).serialize(writer)?;
prev_value = *value_range.end();
}
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let num_ranges = VInt::deserialize(reader)?.0;
let mut ranges_mapping: Vec<RangeMapping> = vec![];
let mut value = 0u128;
let mut compact_start = 1u64; // 0 is reserved for `null`
for _ in 0..num_ranges {
let blank_delta_start = VIntU128::deserialize(reader)?.0;
value += blank_delta_start;
let blank_start = value;
let blank_delta_end = VIntU128::deserialize(reader)?.0;
value += blank_delta_end;
let blank_end = value;
let range_mapping = RangeMapping {
value_range: blank_start..=blank_end,
compact_start,
};
let range_length = range_mapping.range_length();
ranges_mapping.push(range_mapping);
compact_start += range_length;
}
Ok(Self { ranges_mapping })
}
}
impl CompactSpace {
/// Amplitude is the value range of the compact space including the sentinel value used to
/// identify null values. The compact space is 0..=amplitude .
///
/// It's only used to verify we don't exceed u64 number space, which would indicate a bug.
fn amplitude_compact_space(&self) -> u128 {
self.ranges_mapping
.last()
.map(|last_range| last_range.compact_end() as u128)
.unwrap_or(1) // compact space starts at 1, 0 == null
}
fn get_range_mapping(&self, pos: usize) -> &RangeMapping {
&self.ranges_mapping[pos]
}
/// Returns either Ok(the value in the compact space) or if it is outside the compact space the
/// Err(position where it would be inserted)
fn u128_to_compact(&self, value: u128) -> Result<u64, usize> {
self.ranges_mapping
.binary_search_by(|probe| {
let value_range = &probe.value_range;
if value < *value_range.start() {
Ordering::Greater
} else if value > *value_range.end() {
Ordering::Less
} else {
Ordering::Equal
}
})
.map(|pos| {
let range_mapping = &self.ranges_mapping[pos];
let pos_in_range = (value - range_mapping.value_range.start()) as u64;
range_mapping.compact_start + pos_in_range
})
}
/// Unpacks a value from compact space u64 to u128 space
fn compact_to_u128(&self, compact: u64) -> u128 {
let pos = self
.ranges_mapping
.binary_search_by_key(&compact, |range_mapping| range_mapping.compact_start)
// Correctness: Overflow. The first range starts at compact space 0, the error from
// binary search can never be 0
.map_or_else(|e| e - 1, |v| v);
let range_mapping = &self.ranges_mapping[pos];
let diff = compact - range_mapping.compact_start;
range_mapping.value_range.start() + diff as u128
}
}
pub struct CompactSpaceCompressor {
params: IPCodecParams,
}
#[derive(Debug, Clone)]
pub struct IPCodecParams {
compact_space: CompactSpace,
bit_unpacker: BitUnpacker,
min_value: u128,
max_value: u128,
num_vals: RowId,
num_bits: u8,
}
impl CompactSpaceCompressor {
pub fn num_vals(&self) -> RowId {
self.params.num_vals
}
/// Taking the vals as Vec may cost a lot of memory. It is used to sort the vals.
pub fn train_from(iter: impl Iterator<Item = u128>) -> Self {
let mut values_sorted = BTreeSet::new();
let mut total_num_values = 0u32;
for val in iter {
total_num_values += 1u32;
values_sorted.insert(val);
}
let compact_space =
get_compact_space(&values_sorted, total_num_values, COST_PER_BLANK_IN_BITS);
let amplitude_compact_space = compact_space.amplitude_compact_space();
assert!(
amplitude_compact_space <= u64::MAX as u128,
"case unsupported."
);
let num_bits = tantivy_bitpacker::compute_num_bits(amplitude_compact_space as u64);
let min_value = *values_sorted.iter().next().unwrap_or(&0);
let max_value = *values_sorted.iter().last().unwrap_or(&0);
assert_eq!(
compact_space
.u128_to_compact(max_value)
.expect("could not convert max value to compact space"),
amplitude_compact_space as u64
);
CompactSpaceCompressor {
params: IPCodecParams {
compact_space,
bit_unpacker: BitUnpacker::new(num_bits),
min_value,
max_value,
num_vals: total_num_values,
num_bits,
},
}
}
fn write_footer(self, writer: &mut impl Write) -> io::Result<()> {
let writer = &mut CountingWriter::wrap(writer);
self.params.serialize(writer)?;
let footer_len = writer.written_bytes() as u32;
footer_len.serialize(writer)?;
Ok(())
}
pub fn compress_into(
self,
vals: impl Iterator<Item = u128>,
write: &mut impl Write,
) -> io::Result<()> {
let mut bitpacker = BitPacker::default();
for val in vals {
let compact = self
.params
.compact_space
.u128_to_compact(val)
.map_err(|_| {
io::Error::new(
io::ErrorKind::InvalidData,
"Could not convert value to compact_space. This is a bug.",
)
})?;
bitpacker.write(compact, self.params.num_bits, write)?;
}
bitpacker.close(write)?;
self.write_footer(write)?;
Ok(())
}
}
#[derive(Debug, Clone)]
pub struct CompactSpaceDecompressor {
data: OwnedBytes,
params: IPCodecParams,
}
impl BinarySerializable for IPCodecParams {
fn serialize<W: io::Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
// header flags for future optional dictionary encoding
let footer_flags = 0u64;
footer_flags.serialize(writer)?;
VIntU128(self.min_value).serialize(writer)?;
VIntU128(self.max_value).serialize(writer)?;
VIntU128(self.num_vals as u128).serialize(writer)?;
self.num_bits.serialize(writer)?;
self.compact_space.serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let _header_flags = u64::deserialize(reader)?;
let min_value = VIntU128::deserialize(reader)?.0;
let max_value = VIntU128::deserialize(reader)?.0;
let num_vals = VIntU128::deserialize(reader)?.0 as u32;
let num_bits = u8::deserialize(reader)?;
let compact_space = CompactSpace::deserialize(reader)?;
Ok(Self {
compact_space,
bit_unpacker: BitUnpacker::new(num_bits),
min_value,
max_value,
num_vals,
num_bits,
})
}
}
impl ColumnValues<u128> for CompactSpaceDecompressor {
#[inline]
fn get_val(&self, doc: u32) -> u128 {
self.get(doc)
}
fn min_value(&self) -> u128 {
self.min_value()
}
fn max_value(&self) -> u128 {
self.max_value()
}
fn num_vals(&self) -> u32 {
self.params.num_vals
}
#[inline]
fn iter(&self) -> Box<dyn Iterator<Item = u128> + '_> {
Box::new(self.iter())
}
#[inline]
fn get_row_ids_for_value_range(
&self,
value_range: RangeInclusive<u128>,
positions_range: Range<u32>,
positions: &mut Vec<u32>,
) {
self.get_positions_for_value_range(value_range, positions_range, positions)
}
}
impl CompactSpaceDecompressor {
pub fn open(data: OwnedBytes) -> io::Result<CompactSpaceDecompressor> {
let (data_slice, footer_len_bytes) = data.split_at(data.len() - 4);
let footer_len = u32::deserialize(&mut &footer_len_bytes[..])?;
let data_footer = &data_slice[data_slice.len() - footer_len as usize..];
let params = IPCodecParams::deserialize(&mut &data_footer[..])?;
let decompressor = CompactSpaceDecompressor { data, params };
Ok(decompressor)
}
/// Converting to compact space for the decompressor is more complex, since we may get values
/// which are outside the compact space. e.g. if we map
/// 1000 => 5
/// 2000 => 6
///
/// and we want a mapping for 1005, there is no equivalent compact space. We instead return an
/// error with the index of the next range.
fn u128_to_compact(&self, value: u128) -> Result<u64, usize> {
self.params.compact_space.u128_to_compact(value)
}
fn compact_to_u128(&self, compact: u64) -> u128 {
self.params.compact_space.compact_to_u128(compact)
}
/// Comparing on compact space: Random dataset 0,24 (50% random hit) - 1.05 GElements/s
/// Comparing on compact space: Real dataset 1.08 GElements/s
///
/// Comparing on original space: Real dataset .06 GElements/s (not completely optimized)
#[inline]
pub fn get_positions_for_value_range(
&self,
value_range: RangeInclusive<u128>,
position_range: Range<u32>,
positions: &mut Vec<u32>,
) {
if value_range.start() > value_range.end() {
return;
}
let position_range = position_range.start..position_range.end.min(self.num_vals());
let from_value = *value_range.start();
let to_value = *value_range.end();
assert!(to_value >= from_value);
let compact_from = self.u128_to_compact(from_value);
let compact_to = self.u128_to_compact(to_value);
// Quick return, if both ranges fall into the same non-mapped space, the range can't cover
// any values, so we can early exit
match (compact_to, compact_from) {
(Err(pos1), Err(pos2)) if pos1 == pos2 => return,
_ => {}
}
let compact_from = compact_from.unwrap_or_else(|pos| {
// Correctness: Out of bounds, if this value is Err(last_index + 1), we early exit,
// since the to_value also mapps into the same non-mapped space
let range_mapping = self.params.compact_space.get_range_mapping(pos);
range_mapping.compact_start
});
// If there is no compact space, we go to the closest upperbound compact space
let compact_to = compact_to.unwrap_or_else(|pos| {
// Correctness: Overflow, if this value is Err(0), we early exit,
// since the from_value also mapps into the same non-mapped space
// Get end of previous range
let pos = pos - 1;
let range_mapping = self.params.compact_space.get_range_mapping(pos);
range_mapping.compact_end()
});
let range = compact_from..=compact_to;
let scan_num_docs = position_range.end - position_range.start;
let step_size = 4;
let cutoff = position_range.start + scan_num_docs - scan_num_docs % step_size;
let mut push_if_in_range = |idx, val| {
if range.contains(&val) {
positions.push(idx);
}
};
let get_val = |idx| self.params.bit_unpacker.get(idx, &self.data);
// unrolled loop
for idx in (position_range.start..cutoff).step_by(step_size as usize) {
let idx1 = idx;
let idx2 = idx + 1;
let idx3 = idx + 2;
let idx4 = idx + 3;
let val1 = get_val(idx1);
let val2 = get_val(idx2);
let val3 = get_val(idx3);
let val4 = get_val(idx4);
push_if_in_range(idx1, val1);
push_if_in_range(idx2, val2);
push_if_in_range(idx3, val3);
push_if_in_range(idx4, val4);
}
// handle rest
for idx in cutoff..position_range.end {
push_if_in_range(idx, get_val(idx));
}
}
#[inline]
fn iter_compact(&self) -> impl Iterator<Item = u64> + '_ {
(0..self.params.num_vals).map(move |idx| self.params.bit_unpacker.get(idx, &self.data))
}
#[inline]
fn iter(&self) -> impl Iterator<Item = u128> + '_ {
// TODO: Performance. It would be better to iterate on the ranges and check existence via
// the bit_unpacker.
self.iter_compact()
.map(|compact| self.compact_to_u128(compact))
}
#[inline]
pub fn get(&self, idx: u32) -> u128 {
let compact = self.params.bit_unpacker.get(idx, &self.data);
self.compact_to_u128(compact)
}
pub fn min_value(&self) -> u128 {
self.params.min_value
}
pub fn max_value(&self) -> u128 {
self.params.max_value
}
}
#[cfg(test)]
mod tests {
use itertools::Itertools;
use super::*;
use crate::column_values::u128_based::U128Header;
use crate::column_values::{open_u128_mapped, serialize_column_values_u128};
#[test]
fn compact_space_test() {
let ips = &[
2u128, 4u128, 1000, 1001, 1002, 1003, 1004, 1005, 1008, 1010, 1012, 1260,
]
.into_iter()
.collect();
let compact_space = get_compact_space(ips, ips.len() as u32, 11);
let amplitude = compact_space.amplitude_compact_space();
assert_eq!(amplitude, 17);
assert_eq!(1, compact_space.u128_to_compact(2).unwrap());
assert_eq!(2, compact_space.u128_to_compact(3).unwrap());
assert_eq!(compact_space.u128_to_compact(100).unwrap_err(), 1);
for (num1, num2) in (0..3).tuple_windows() {
assert_eq!(
compact_space.get_range_mapping(num1).compact_end() + 1,
compact_space.get_range_mapping(num2).compact_start
);
}
let mut output: Vec<u8> = Vec::new();
compact_space.serialize(&mut output).unwrap();
assert_eq!(
compact_space,
CompactSpace::deserialize(&mut &output[..]).unwrap()
);
for ip in ips {
let compact = compact_space.u128_to_compact(*ip).unwrap();
assert_eq!(compact_space.compact_to_u128(compact), *ip);
}
}
#[test]
fn compact_space_amplitude_test() {
let ips = &[100000u128, 1000000].into_iter().collect();
let compact_space = get_compact_space(ips, ips.len() as u32, 1);
let amplitude = compact_space.amplitude_compact_space();
assert_eq!(amplitude, 2);
}
fn test_all(mut data: OwnedBytes, expected: &[u128]) {
let _header = U128Header::deserialize(&mut data);
let decompressor = CompactSpaceDecompressor::open(data).unwrap();
for (idx, expected_val) in expected.iter().cloned().enumerate() {
let val = decompressor.get(idx as u32);
assert_eq!(val, expected_val);
let test_range = |range: RangeInclusive<u128>| {
let expected_positions = expected
.iter()
.positions(|val| range.contains(val))
.map(|pos| pos as u32)
.collect::<Vec<_>>();
let mut positions = Vec::new();
decompressor.get_positions_for_value_range(
range,
0..decompressor.num_vals(),
&mut positions,
);
assert_eq!(positions, expected_positions);
};
test_range(expected_val.saturating_sub(1)..=expected_val);
test_range(expected_val..=expected_val);
test_range(expected_val..=expected_val.saturating_add(1));
test_range(expected_val.saturating_sub(1)..=expected_val.saturating_add(1));
}
}
fn test_aux_vals(u128_vals: &[u128]) -> OwnedBytes {
let mut out = Vec::new();
serialize_column_values_u128(&u128_vals, &mut out).unwrap();
let data = OwnedBytes::new(out);
test_all(data.clone(), u128_vals);
data
}
#[test]
fn test_range_1() {
let vals = &[
1u128,
100u128,
3u128,
99999u128,
100000u128,
100001u128,
4_000_211_221u128,
4_000_211_222u128,
333u128,
];
let mut data = test_aux_vals(vals);
let _header = U128Header::deserialize(&mut data);
let decomp = CompactSpaceDecompressor::open(data).unwrap();
let complete_range = 0..vals.len() as u32;
for (pos, val) in vals.iter().enumerate() {
let val = *val;
let pos = pos as u32;
let mut positions = Vec::new();
decomp.get_positions_for_value_range(val..=val, pos..pos + 1, &mut positions);
assert_eq!(positions, vec![pos]);
}
// handle docid range out of bounds
let positions: Vec<u32> = get_positions_for_value_range_helper(&decomp, 0..=1, 1..u32::MAX);
assert!(positions.is_empty());
let positions =
get_positions_for_value_range_helper(&decomp, 0..=1, complete_range.clone());
assert_eq!(positions, vec![0]);
let positions =
get_positions_for_value_range_helper(&decomp, 0..=2, complete_range.clone());
assert_eq!(positions, vec![0]);
let positions =
get_positions_for_value_range_helper(&decomp, 0..=3, complete_range.clone());
assert_eq!(positions, vec![0, 2]);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99999u128..=99999u128,
complete_range.clone()
),
vec![3]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99999u128..=100000u128,
complete_range.clone()
),
vec![3, 4]
);
assert_eq!(
get_positions_for_value_range_helper(
&decomp,
99998u128..=100000u128,
complete_range.clone()
),
vec![3, 4]
);
assert_eq!(
&get_positions_for_value_range_helper(
&decomp,
99998u128..=99999u128,
complete_range.clone()
),
&[3]
);
assert!(get_positions_for_value_range_helper(
&decomp,
99998u128..=99998u128,
complete_range.clone()
)
.is_empty());
assert_eq!(
&get_positions_for_value_range_helper(
&decomp,
333u128..=333u128,
complete_range.clone()
),
&[8]
);
assert_eq!(
&get_positions_for_value_range_helper(
&decomp,
332u128..=333u128,
complete_range.clone()
),
&[8]
);
assert_eq!(
&get_positions_for_value_range_helper(
&decomp,
332u128..=334u128,
complete_range.clone()
),
&[8]
);
assert_eq!(
&get_positions_for_value_range_helper(
&decomp,
333u128..=334u128,
complete_range.clone()
),
&[8]
);
assert_eq!(
&get_positions_for_value_range_helper(
&decomp,
4_000_211_221u128..=5_000_000_000u128,
complete_range
),
&[6, 7]
);
}
#[test]
fn test_empty() {
let vals = &[];
let data = test_aux_vals(vals);
let _decomp = CompactSpaceDecompressor::open(data).unwrap();
}
#[test]
fn test_range_2() {
let vals = &[
100u128,
99999u128,
100000u128,
100001u128,
4_000_211_221u128,
4_000_211_222u128,
333u128,
];
let mut data = test_aux_vals(vals);
let _header = U128Header::deserialize(&mut data);
let decomp = CompactSpaceDecompressor::open(data).unwrap();
let complete_range = 0..vals.len() as u32;
assert!(
&get_positions_for_value_range_helper(&decomp, 0..=5, complete_range.clone())
.is_empty(),
);
assert_eq!(
&get_positions_for_value_range_helper(&decomp, 0..=100, complete_range.clone()),
&[0]
);
assert_eq!(
&get_positions_for_value_range_helper(&decomp, 0..=105, complete_range),
&[0]
);
}
fn get_positions_for_value_range_helper<C: ColumnValues<T> + ?Sized, T: PartialOrd>(
column: &C,
value_range: RangeInclusive<T>,
doc_id_range: Range<u32>,
) -> Vec<u32> {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(value_range, doc_id_range, &mut positions);
positions
}
#[test]
fn test_range_3() {
let vals = &[
200u128,
201,
202,
203,
204,
204,
206,
207,
208,
209,
210,
1_000_000,
5_000_000_000,
];
let mut out = Vec::new();
serialize_column_values_u128(&&vals[..], &mut out).unwrap();
let decomp = open_u128_mapped(OwnedBytes::new(out)).unwrap();
let complete_range = 0..vals.len() as u32;
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 199..=200, complete_range.clone()),
vec![0]
);
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 199..=201, complete_range.clone()),
vec![0, 1]
);
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 200..=200, complete_range.clone()),
vec![0]
);
assert_eq!(
get_positions_for_value_range_helper(&*decomp, 1_000_000..=1_000_000, complete_range),
vec![11]
);
}
#[test]
fn test_bug1() {
let vals = &[9223372036854775806];
let _data = test_aux_vals(vals);
}
#[test]
fn test_bug2() {
let vals = &[340282366920938463463374607431768211455u128];
let _data = test_aux_vals(vals);
}
#[test]
fn test_bug3() {
let vals = &[340282366920938463463374607431768211454];
let _data = test_aux_vals(vals);
}
#[test]
fn test_bug4() {
let vals = &[340282366920938463463374607431768211455, 0];
let _data = test_aux_vals(vals);
}
#[test]
fn test_first_large_gaps() {
let vals = &[1_000_000_000u128; 100];
let _data = test_aux_vals(vals);
}
use proptest::prelude::*;
fn num_strategy() -> impl Strategy<Value = u128> {
prop_oneof![
1 => prop::num::u128::ANY.prop_map(|num| u128::MAX - (num % 10) ),
1 => prop::num::u128::ANY.prop_map(|num| i64::MAX as u128 + 5 - (num % 10) ),
1 => prop::num::u128::ANY.prop_map(|num| i128::MAX as u128 + 5 - (num % 10) ),
1 => prop::num::u128::ANY.prop_map(|num| num % 10 ),
20 => prop::num::u128::ANY,
]
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(10))]
#[test]
fn compress_decompress_random(vals in proptest::collection::vec(num_strategy() , 1..1000)) {
let _data = test_aux_vals(&vals);
}
}
}

View File

@@ -0,0 +1,178 @@
use std::fmt::Debug;
use std::io;
use std::io::Write;
use std::sync::Arc;
mod compact_space;
use common::{BinarySerializable, OwnedBytes, VInt};
use compact_space::{CompactSpaceCompressor, CompactSpaceDecompressor};
use crate::column_values::monotonic_map_column;
use crate::column_values::monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternal,
};
use crate::iterable::Iterable;
use crate::{ColumnValues, MonotonicallyMappableToU128};
#[derive(Debug, Copy, Clone, PartialEq, Eq)]
pub(crate) struct U128Header {
pub num_vals: u32,
pub codec_type: U128FastFieldCodecType,
}
impl BinarySerializable for U128Header {
fn serialize<W: io::Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.num_vals as u64).serialize(writer)?;
self.codec_type.serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let num_vals = VInt::deserialize(reader)?.0 as u32;
let codec_type = U128FastFieldCodecType::deserialize(reader)?;
Ok(U128Header {
num_vals,
codec_type,
})
}
}
/// Serializes u128 values with the compact space codec.
pub fn serialize_column_values_u128<T: MonotonicallyMappableToU128>(
iterable: &dyn Iterable<T>,
output: &mut impl io::Write,
) -> io::Result<()> {
let compressor = CompactSpaceCompressor::train_from(
iterable
.boxed_iter()
.map(MonotonicallyMappableToU128::to_u128),
);
let header = U128Header {
num_vals: compressor.num_vals(),
codec_type: U128FastFieldCodecType::CompactSpace,
};
header.serialize(output)?;
compressor.compress_into(
iterable
.boxed_iter()
.map(MonotonicallyMappableToU128::to_u128),
output,
)?;
Ok(())
}
#[derive(PartialEq, Eq, PartialOrd, Ord, Debug, Clone, Copy)]
#[repr(u8)]
/// Available codecs to use to encode the u128 (via [`MonotonicallyMappableToU128`]) converted data.
pub(crate) enum U128FastFieldCodecType {
/// This codec takes a large number space (u128) and reduces it to a compact number space, by
/// removing the holes.
CompactSpace = 1,
}
impl BinarySerializable for U128FastFieldCodecType {
fn serialize<W: Write + ?Sized>(&self, wrt: &mut W) -> io::Result<()> {
self.to_code().serialize(wrt)
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let code = u8::deserialize(reader)?;
let codec_type: Self = Self::from_code(code)
.ok_or_else(|| io::Error::new(io::ErrorKind::InvalidData, "Unknown code `{code}.`"))?;
Ok(codec_type)
}
}
impl U128FastFieldCodecType {
pub(crate) fn to_code(self) -> u8 {
self as u8
}
pub(crate) fn from_code(code: u8) -> Option<Self> {
match code {
1 => Some(Self::CompactSpace),
_ => None,
}
}
}
/// Returns the correct codec reader wrapped in the `Arc` for the data.
pub fn open_u128_mapped<T: MonotonicallyMappableToU128 + Debug>(
mut bytes: OwnedBytes,
) -> io::Result<Arc<dyn ColumnValues<T>>> {
let header = U128Header::deserialize(&mut bytes)?;
assert_eq!(header.codec_type, U128FastFieldCodecType::CompactSpace);
let reader = CompactSpaceDecompressor::open(bytes)?;
let inverted: StrictlyMonotonicMappingInverter<StrictlyMonotonicMappingToInternal<T>> =
StrictlyMonotonicMappingToInternal::<T>::new().into();
Ok(Arc::new(monotonic_map_column(reader, inverted)))
}
#[cfg(test)]
pub mod tests {
use super::*;
use crate::column_values::u64_based::{
serialize_and_load_u64_based_column_values, serialize_u64_based_column_values,
ALL_U64_CODEC_TYPES,
};
use crate::column_values::CodecType;
#[test]
fn test_serialize_deserialize_u128_header() {
let original = U128Header {
num_vals: 11,
codec_type: U128FastFieldCodecType::CompactSpace,
};
let mut out = Vec::new();
original.serialize(&mut out).unwrap();
let restored = U128Header::deserialize(&mut &out[..]).unwrap();
assert_eq!(restored, original);
}
#[test]
fn test_serialize_deserialize() {
let original = [1u64, 5u64, 10u64];
let restored: Vec<u64> =
serialize_and_load_u64_based_column_values(&&original[..], &ALL_U64_CODEC_TYPES)
.iter()
.collect();
assert_eq!(&restored, &original[..]);
}
#[test]
fn test_fastfield_bool_size_bitwidth_1() {
let mut buffer = Vec::new();
serialize_u64_based_column_values::<bool>(
&&[false, true][..],
&ALL_U64_CODEC_TYPES,
&mut buffer,
)
.unwrap();
// TODO put the header as a footer so that it serves as a padding.
// 5 bytes of header, 1 byte of value, 7 bytes of padding.
assert_eq!(buffer.len(), 5 + 1);
}
#[test]
fn test_fastfield_bool_bit_size_bitwidth_0() {
let mut buffer = Vec::new();
serialize_u64_based_column_values::<bool>(
&&[false, true][..],
&ALL_U64_CODEC_TYPES,
&mut buffer,
)
.unwrap();
// 6 bytes of header, 0 bytes of value, 7 bytes of padding.
assert_eq!(buffer.len(), 6);
}
#[test]
fn test_fastfield_gcd() {
let mut buffer = Vec::new();
let vals: Vec<u64> = (0..80).map(|val| (val % 7) * 1_000u64).collect();
serialize_u64_based_column_values(&&vals[..], &[CodecType::Bitpacked], &mut buffer)
.unwrap();
// Values are stored over 3 bits.
assert_eq!(buffer.len(), 6 + (3 * 80 / 8));
}
}

View File

@@ -0,0 +1,127 @@
use std::io::{self, Write};
use common::{BinarySerializable, OwnedBytes};
use fastdivide::DividerU64;
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use crate::column_values::u64_based::{ColumnCodec, ColumnCodecEstimator, ColumnStats};
use crate::{ColumnValues, RowId};
/// Depending on the field type, a different
/// fast field is required.
#[derive(Clone)]
pub struct BitpackedReader {
data: OwnedBytes,
bit_unpacker: BitUnpacker,
stats: ColumnStats,
}
impl ColumnValues for BitpackedReader {
#[inline(always)]
fn get_val(&self, doc: u32) -> u64 {
self.stats.min_value + self.stats.gcd.get() * self.bit_unpacker.get(doc, &self.data)
}
#[inline]
fn min_value(&self) -> u64 {
self.stats.min_value
}
#[inline]
fn max_value(&self) -> u64 {
self.stats.max_value
}
#[inline]
fn num_vals(&self) -> RowId {
self.stats.num_rows
}
}
fn num_bits(stats: &ColumnStats) -> u8 {
compute_num_bits(stats.amplitude() / stats.gcd)
}
#[derive(Default)]
pub struct BitpackedCodecEstimator;
impl ColumnCodecEstimator for BitpackedCodecEstimator {
fn collect(&mut self, _value: u64) {}
fn estimate(&self, stats: &ColumnStats) -> Option<u64> {
let num_bits_per_value = num_bits(stats);
Some(stats.num_bytes() + (stats.num_rows as u64 * (num_bits_per_value as u64) + 7) / 8)
}
fn serialize(
&self,
stats: &ColumnStats,
vals: &mut dyn Iterator<Item = u64>,
wrt: &mut dyn Write,
) -> io::Result<()> {
stats.serialize(wrt)?;
let num_bits = num_bits(stats);
let mut bit_packer = BitPacker::new();
let divider = DividerU64::divide_by(stats.gcd.get());
for val in vals {
bit_packer.write(divider.divide(val - stats.min_value), num_bits, wrt)?;
}
bit_packer.close(wrt)?;
Ok(())
}
}
pub struct BitpackedCodec;
impl ColumnCodec for BitpackedCodec {
type ColumnValues = BitpackedReader;
type Estimator = BitpackedCodecEstimator;
/// Opens a fast field given a file.
fn load(mut data: OwnedBytes) -> io::Result<Self::ColumnValues> {
let stats = ColumnStats::deserialize(&mut data)?;
let num_bits = num_bits(&stats);
let bit_unpacker = BitUnpacker::new(num_bits);
Ok(BitpackedReader {
data,
bit_unpacker,
stats,
})
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::column_values::u64_based::tests::create_and_validate;
#[test]
fn test_with_codec_data_sets_simple() {
create_and_validate::<BitpackedCodec>(&[4, 3, 12], "name");
}
#[test]
fn test_with_codec_data_sets_simple_gcd() {
create_and_validate::<BitpackedCodec>(&[1000, 2000, 3000], "name");
}
#[test]
fn test_with_codec_data_sets() {
let data_sets = crate::column_values::u64_based::tests::get_codec_test_datasets();
for (mut data, name) in data_sets {
create_and_validate::<BitpackedCodec>(&data, name);
data.reverse();
create_and_validate::<BitpackedCodec>(&data, name);
}
}
#[test]
fn bitpacked_fast_field_rand() {
for _ in 0..500 {
let mut data = (0..1 + rand::random::<u8>() as usize)
.map(|_| rand::random::<i64>() as u64 / 2)
.collect::<Vec<_>>();
create_and_validate::<BitpackedCodec>(&data, "rand");
data.reverse();
create_and_validate::<BitpackedCodec>(&data, "rand");
}
}
}

View File

@@ -0,0 +1,281 @@
use std::io::Write;
use std::sync::Arc;
use std::{io, iter};
use common::{BinarySerializable, CountingWriter, DeserializeFrom, OwnedBytes};
use fastdivide::DividerU64;
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use crate::column_values::u64_based::line::Line;
use crate::column_values::u64_based::{ColumnCodec, ColumnCodecEstimator, ColumnStats};
use crate::column_values::{ColumnValues, VecColumn};
use crate::MonotonicallyMappableToU64;
const BLOCK_SIZE: u32 = 512u32;
#[derive(Debug, Default)]
struct Block {
line: Line,
bit_unpacker: BitUnpacker,
data_start_offset: usize,
}
impl BinarySerializable for Block {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
self.line.serialize(writer)?;
self.bit_unpacker.bit_width().serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let line = Line::deserialize(reader)?;
let bit_width = u8::deserialize(reader)?;
Ok(Block {
line,
bit_unpacker: BitUnpacker::new(bit_width),
data_start_offset: 0,
})
}
}
fn compute_num_blocks(num_vals: u32) -> u32 {
(num_vals + BLOCK_SIZE - 1) / BLOCK_SIZE
}
pub struct BlockwiseLinearEstimator {
block: Vec<u64>,
values_num_bytes: u64,
meta_num_bytes: u64,
}
impl Default for BlockwiseLinearEstimator {
fn default() -> Self {
Self {
block: Vec::with_capacity(BLOCK_SIZE as usize),
values_num_bytes: 0u64,
meta_num_bytes: 0u64,
}
}
}
impl BlockwiseLinearEstimator {
fn flush_block_estimate(&mut self) {
if self.block.is_empty() {
return;
}
let line = Line::train(&VecColumn::from(&self.block));
let mut max_value = 0u64;
for (i, buffer_val) in self.block.iter().enumerate() {
let interpolated_val = line.eval(i as u32);
let val = buffer_val.wrapping_sub(interpolated_val);
max_value = val.max(max_value);
}
let bit_width = compute_num_bits(max_value) as usize;
self.values_num_bytes += (bit_width * self.block.len() + 7) as u64 / 8;
self.meta_num_bytes += 1 + line.num_bytes();
}
}
impl ColumnCodecEstimator for BlockwiseLinearEstimator {
fn collect(&mut self, value: u64) {
self.block.push(value);
if self.block.len() == BLOCK_SIZE as usize {
self.flush_block_estimate();
self.block.clear();
}
}
fn estimate(&self, stats: &ColumnStats) -> Option<u64> {
let mut estimate = 4 + stats.num_bytes() + self.meta_num_bytes + self.values_num_bytes;
if stats.gcd.get() > 1 {
let estimate_gain_from_gcd =
(stats.gcd.get() as f32).log2().floor() * stats.num_rows as f32 / 8.0f32;
estimate = estimate.saturating_sub(estimate_gain_from_gcd as u64);
}
Some(estimate)
}
fn finalize(&mut self) {
self.flush_block_estimate();
}
fn serialize(
&self,
stats: &ColumnStats,
mut vals: &mut dyn Iterator<Item = u64>,
wrt: &mut dyn Write,
) -> io::Result<()> {
stats.serialize(wrt)?;
let mut buffer = Vec::with_capacity(BLOCK_SIZE as usize);
let num_blocks = compute_num_blocks(stats.num_rows) as usize;
let mut blocks = Vec::with_capacity(num_blocks);
let mut bit_packer = BitPacker::new();
let gcd_divider = DividerU64::divide_by(stats.gcd.get());
for _ in 0..num_blocks {
buffer.clear();
buffer.extend(
(&mut vals)
.map(MonotonicallyMappableToU64::to_u64)
.take(BLOCK_SIZE as usize),
);
for buffer_val in buffer.iter_mut() {
*buffer_val = gcd_divider.divide(*buffer_val - stats.min_value);
}
let line = Line::train(&VecColumn::from(&buffer));
assert!(!buffer.is_empty());
for (i, buffer_val) in buffer.iter_mut().enumerate() {
let interpolated_val = line.eval(i as u32);
*buffer_val = buffer_val.wrapping_sub(interpolated_val);
}
let bit_width = buffer.iter().copied().map(compute_num_bits).max().unwrap();
for &buffer_val in &buffer {
bit_packer.write(buffer_val, bit_width, wrt)?;
}
blocks.push(Block {
line,
bit_unpacker: BitUnpacker::new(bit_width),
data_start_offset: 0,
});
}
bit_packer.close(wrt)?;
assert_eq!(blocks.len(), num_blocks);
let mut counting_wrt = CountingWriter::wrap(wrt);
for block in &blocks {
block.serialize(&mut counting_wrt)?;
}
let footer_len = counting_wrt.written_bytes();
(footer_len as u32).serialize(&mut counting_wrt)?;
Ok(())
}
}
pub struct BlockwiseLinearCodec;
impl ColumnCodec<u64> for BlockwiseLinearCodec {
type ColumnValues = BlockwiseLinearReader;
type Estimator = BlockwiseLinearEstimator;
fn load(mut bytes: OwnedBytes) -> io::Result<Self::ColumnValues> {
let stats = ColumnStats::deserialize(&mut bytes)?;
let footer_len: u32 = (&bytes[bytes.len() - 4..]).deserialize()?;
let footer_offset = bytes.len() - 4 - footer_len as usize;
let (data, mut footer) = bytes.split(footer_offset);
let num_blocks = compute_num_blocks(stats.num_rows);
let mut blocks: Vec<Block> = iter::repeat_with(|| Block::deserialize(&mut footer))
.take(num_blocks as usize)
.collect::<io::Result<_>>()?;
let mut start_offset = 0;
for block in &mut blocks {
block.data_start_offset = start_offset;
start_offset += (block.bit_unpacker.bit_width() as usize) * BLOCK_SIZE as usize / 8;
}
Ok(BlockwiseLinearReader {
blocks: blocks.into_boxed_slice().into(),
data,
stats,
})
}
}
#[derive(Clone)]
pub struct BlockwiseLinearReader {
blocks: Arc<[Block]>,
data: OwnedBytes,
stats: ColumnStats,
}
impl ColumnValues for BlockwiseLinearReader {
#[inline(always)]
fn get_val(&self, idx: u32) -> u64 {
let block_id = (idx / BLOCK_SIZE) as usize;
let idx_within_block = idx % BLOCK_SIZE;
let block = &self.blocks[block_id];
let interpoled_val: u64 = block.line.eval(idx_within_block);
let block_bytes = &self.data[block.data_start_offset..];
let bitpacked_diff = block.bit_unpacker.get(idx_within_block, block_bytes);
// TODO optimize me! the line parameters could be tweaked to include the multiplication and
// remove the dependency.
self.stats.min_value
+ self
.stats
.gcd
.get()
.wrapping_mul(interpoled_val.wrapping_add(bitpacked_diff))
}
#[inline(always)]
fn min_value(&self) -> u64 {
self.stats.min_value
}
#[inline(always)]
fn max_value(&self) -> u64 {
self.stats.max_value
}
#[inline(always)]
fn num_vals(&self) -> u32 {
self.stats.num_rows
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::column_values::u64_based::tests::create_and_validate;
#[test]
fn test_with_codec_data_sets_simple() {
create_and_validate::<BlockwiseLinearCodec>(
&[11, 20, 40, 20, 10, 10, 10, 10, 10, 10],
"simple test",
)
.unwrap();
}
#[test]
fn test_with_codec_data_sets_simple_gcd() {
let (_, actual_compression_rate) = create_and_validate::<BlockwiseLinearCodec>(
&[10, 20, 40, 20, 10, 10, 10, 10, 10, 10],
"name",
)
.unwrap();
assert_eq!(actual_compression_rate, 0.175);
}
#[test]
fn test_with_codec_data_sets() {
let data_sets = crate::column_values::u64_based::tests::get_codec_test_datasets();
for (mut data, name) in data_sets {
create_and_validate::<BlockwiseLinearCodec>(&data, name);
data.reverse();
create_and_validate::<BlockwiseLinearCodec>(&data, name);
}
}
#[test]
fn test_blockwise_linear_fast_field_rand() {
for _ in 0..500 {
let mut data = (0..1 + rand::random::<u8>() as usize)
.map(|_| rand::random::<i64>() as u64 / 2)
.collect::<Vec<_>>();
create_and_validate::<BlockwiseLinearCodec>(&data, "rand");
data.reverse();
create_and_validate::<BlockwiseLinearCodec>(&data, "rand");
}
}
}

View File

@@ -0,0 +1,210 @@
use std::io;
use std::num::NonZeroU32;
use common::{BinarySerializable, VInt};
use crate::column_values::ColumnValues;
const MID_POINT: u64 = (1u64 << 32) - 1u64;
/// `Line` describes a line function `y: ax + b` using integer
/// arithmetics.
///
/// The slope is in fact a decimal split into a 32 bit integer value,
/// and a 32-bit decimal value.
///
/// The multiplication then becomes.
/// `y = m * x >> 32 + b`
#[derive(Debug, Clone, Copy, Default)]
pub struct Line {
pub(crate) slope: u64,
pub(crate) intercept: u64,
}
/// Compute the line slope.
///
/// This function has the nice property of being
/// invariant by translation.
/// `
/// compute_slope(y0, y1)
/// = compute_slope(y0 + X % 2^64, y1 + X % 2^64)
/// `
fn compute_slope(y0: u64, y1: u64, num_vals: NonZeroU32) -> u64 {
let dy = y1.wrapping_sub(y0);
let sign = dy <= (1 << 63);
let abs_dy = if sign {
y1.wrapping_sub(y0)
} else {
y0.wrapping_sub(y1)
};
if abs_dy >= 1 << 32 {
// This is outside of realm we handle.
// Let's just bail.
return 0u64;
}
let abs_slope = (abs_dy << 32) / num_vals.get() as u64;
if sign {
abs_slope
} else {
// The complement does indeed create the
// opposite decreasing slope...
//
// Intuitively (without the bitshifts and % u64::MAX)
// ```
// (x + shift)*(u64::MAX - abs_slope)
// - (x * (u64::MAX - abs_slope))
// = - shift * abs_slope
// ```
u64::MAX - abs_slope
}
}
impl Line {
#[inline(always)]
pub fn eval(&self, x: u32) -> u64 {
let linear_part = ((x as u64).wrapping_mul(self.slope) >> 32) as i32 as u64;
self.intercept.wrapping_add(linear_part)
}
// Intercept is only computed from provided positions
pub fn train_from(
first_val: u64,
last_val: u64,
num_vals: u32,
positions_and_values: impl Iterator<Item = (u64, u64)>,
) -> Self {
// TODO replace with let else
let idx_last_val = if let Some(idx_last_val) = NonZeroU32::new(num_vals - 1) {
idx_last_val
} else {
return Line::default();
};
let y0 = first_val;
let y1 = last_val;
// We first independently pick our slope.
let slope = compute_slope(y0, y1, idx_last_val);
// We picked our slope. Note that it does not have to be perfect.
// Now we need to compute the best intercept.
//
// Intuitively, the best intercept is such that line passes through one of the
// `(i, ys[])`.
//
// The best intercept therefore has the form
// `y[i] - line.eval(i)` (using wrapping arithmetics).
// In other words, the best intercept is one of the `y - Line::eval(ys[i])`
// and our task is just to pick the one that minimizes our error.
//
// Without sorting our values, this is a difficult problem.
// We however rely on the following trick...
//
// We only focus on the case where the interpolation is half decent.
// If the line interpolation is doing its job on a dataset suited for it,
// we can hope that the maximum error won't be larger than `u64::MAX / 2`.
//
// In other words, even without the intercept the values `y - Line::eval(ys[i])` will all be
// within an interval that takes less than half of the modulo space of `u64`.
//
// Our task is therefore to identify this interval.
// Here we simply translate all of our values by `y0 - 2^63` and pick the min.
let mut line = Line {
slope,
intercept: 0,
};
let heuristic_shift = y0.wrapping_sub(MID_POINT);
line.intercept = positions_and_values
.map(|(pos, y)| y.wrapping_sub(line.eval(pos as u32)))
.min_by_key(|&val| val.wrapping_sub(heuristic_shift))
.unwrap_or(0u64); //< Never happens.
line
}
/// Returns a line that attemps to approximate a function
/// f: i in 0..[ys.num_vals()) -> ys[i].
///
/// - The approximation is always lower than the actual value.
/// Or more rigorously, formally `f(i).wrapping_sub(ys[i])` is small
/// for any i in [0..ys.len()).
/// - It computes without panicking for any value of it.
///
/// This function is only invariable by translation if all of the
/// `ys` are packaged into half of the space. (See heuristic below)
/// TODO USE array
pub fn train(ys: &dyn ColumnValues) -> Self {
let first_val = ys.iter().next().unwrap();
let last_val = ys.iter().nth(ys.num_vals() as usize - 1).unwrap();
Self::train_from(
first_val,
last_val,
ys.num_vals(),
ys.iter().enumerate().map(|(pos, val)| (pos as u64, val)),
)
}
}
impl BinarySerializable for Line {
fn serialize<W: io::Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.slope).serialize(writer)?;
VInt(self.intercept).serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let slope = VInt::deserialize(reader)?.0;
let intercept = VInt::deserialize(reader)?.0;
Ok(Line { slope, intercept })
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::column_values::VecColumn;
/// Test training a line and ensuring that the maximum difference between
/// the data points and the line is `expected`.
///
/// This function operates translation over the data for better coverage.
#[track_caller]
fn test_line_interpol_with_translation(ys: &[u64], expected: Option<u64>) {
let mut translations = vec![0, 100, u64::MAX / 2, u64::MAX, u64::MAX - 1];
translations.extend_from_slice(ys);
for translation in translations {
let translated_ys: Vec<u64> = ys
.iter()
.copied()
.map(|y| y.wrapping_add(translation))
.collect();
let largest_err = test_eval_max_err(&translated_ys);
assert_eq!(largest_err, expected);
}
}
fn test_eval_max_err(ys: &[u64]) -> Option<u64> {
let line = Line::train(&VecColumn::from(&ys));
ys.iter()
.enumerate()
.map(|(x, y)| y.wrapping_sub(line.eval(x as u32)))
.max()
}
#[test]
fn test_train() {
test_line_interpol_with_translation(&[11, 11, 11, 12, 12, 13], Some(1));
test_line_interpol_with_translation(&[13, 12, 12, 11, 11, 11], Some(1));
test_line_interpol_with_translation(&[13, 13, 12, 11, 11, 11], Some(1));
test_line_interpol_with_translation(&[13, 13, 12, 11, 11, 11], Some(1));
test_line_interpol_with_translation(&[u64::MAX - 1, 0, 0, 1], Some(1));
test_line_interpol_with_translation(&[u64::MAX - 1, u64::MAX, 0, 1], Some(0));
test_line_interpol_with_translation(&[0, 1, 2, 3, 5], Some(0));
test_line_interpol_with_translation(&[1, 2, 3, 4], Some(0));
let data: Vec<u64> = (0..255).collect();
test_line_interpol_with_translation(&data, Some(0));
let data: Vec<u64> = (0..255).map(|el| el * 2).collect();
test_line_interpol_with_translation(&data, Some(0));
}
}

View File

@@ -0,0 +1,277 @@
use std::io;
use common::{BinarySerializable, OwnedBytes};
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use super::line::Line;
use super::ColumnValues;
use crate::column_values::u64_based::{ColumnCodec, ColumnCodecEstimator, ColumnStats};
use crate::column_values::VecColumn;
use crate::RowId;
const HALF_SPACE: u64 = u64::MAX / 2;
const LINE_ESTIMATION_BLOCK_LEN: usize = 512;
/// Depending on the field type, a different
/// fast field is required.
#[derive(Clone)]
pub struct LinearReader {
data: OwnedBytes,
linear_params: LinearParams,
stats: ColumnStats,
}
impl ColumnValues for LinearReader {
#[inline]
fn get_val(&self, doc: u32) -> u64 {
let interpoled_val: u64 = self.linear_params.line.eval(doc);
let bitpacked_diff = self.linear_params.bit_unpacker.get(doc, &self.data);
interpoled_val.wrapping_add(bitpacked_diff)
}
#[inline(always)]
fn min_value(&self) -> u64 {
self.stats.min_value
}
#[inline(always)]
fn max_value(&self) -> u64 {
self.stats.max_value
}
#[inline]
fn num_vals(&self) -> u32 {
self.stats.num_rows
}
}
/// Fastfield serializer, which tries to guess values by linear interpolation
/// and stores the difference bitpacked.
pub struct LinearCodec;
#[derive(Debug, Clone)]
struct LinearParams {
line: Line,
bit_unpacker: BitUnpacker,
}
impl BinarySerializable for LinearParams {
fn serialize<W: io::Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
self.line.serialize(writer)?;
self.bit_unpacker.bit_width().serialize(writer)?;
Ok(())
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let line = Line::deserialize(reader)?;
let bit_width = u8::deserialize(reader)?;
Ok(Self {
line,
bit_unpacker: BitUnpacker::new(bit_width),
})
}
}
pub struct LinearCodecEstimator {
block: Vec<u64>,
line: Option<Line>,
row_id: RowId,
min_deviation: u64,
max_deviation: u64,
first_val: u64,
last_val: u64,
}
impl Default for LinearCodecEstimator {
fn default() -> LinearCodecEstimator {
LinearCodecEstimator {
block: Vec::with_capacity(LINE_ESTIMATION_BLOCK_LEN),
line: None,
row_id: 0,
min_deviation: u64::MAX,
max_deviation: u64::MIN,
first_val: 0u64,
last_val: 0u64,
}
}
}
impl ColumnCodecEstimator for LinearCodecEstimator {
fn finalize(&mut self) {
if let Some(line) = self.line.as_mut() {
line.intercept = line
.intercept
.wrapping_add(self.min_deviation)
.wrapping_sub(HALF_SPACE);
}
}
fn estimate(&self, stats: &ColumnStats) -> Option<u64> {
let line = self.line?;
let amplitude = self.max_deviation - self.min_deviation;
let num_bits = compute_num_bits(amplitude);
let linear_params = LinearParams {
line,
bit_unpacker: BitUnpacker::new(num_bits),
};
Some(
stats.num_bytes()
+ linear_params.num_bytes()
+ (num_bits as u64 * stats.num_rows as u64 + 7) / 8,
)
}
fn serialize(
&self,
stats: &ColumnStats,
vals: &mut dyn Iterator<Item = u64>,
wrt: &mut dyn io::Write,
) -> io::Result<()> {
stats.serialize(wrt)?;
let line = self.line.unwrap();
let amplitude = self.max_deviation - self.min_deviation;
let num_bits = compute_num_bits(amplitude);
let linear_params = LinearParams {
line,
bit_unpacker: BitUnpacker::new(num_bits),
};
linear_params.serialize(wrt)?;
let mut bit_packer = BitPacker::new();
for (pos, value) in vals.enumerate() {
let calculated_value = line.eval(pos as u32);
let offset = value.wrapping_sub(calculated_value);
bit_packer.write(offset, num_bits, wrt)?;
}
bit_packer.close(wrt)?;
Ok(())
}
fn collect(&mut self, value: u64) {
if let Some(line) = self.line {
self.collect_after_line_estimation(&line, value);
} else {
self.collect_before_line_estimation(value);
}
}
}
impl LinearCodecEstimator {
#[inline]
fn collect_after_line_estimation(&mut self, line: &Line, value: u64) {
let interpoled_val: u64 = line.eval(self.row_id);
let deviation = value.wrapping_add(HALF_SPACE).wrapping_sub(interpoled_val);
self.min_deviation = self.min_deviation.min(deviation);
self.max_deviation = self.max_deviation.max(deviation);
if self.row_id == 0 {
self.first_val = value;
}
self.last_val = value;
self.row_id += 1u32;
}
#[inline]
fn collect_before_line_estimation(&mut self, value: u64) {
self.block.push(value);
if self.block.len() == LINE_ESTIMATION_BLOCK_LEN {
let line = Line::train(&VecColumn::from(&self.block));
let block = std::mem::take(&mut self.block);
for val in block {
self.collect_after_line_estimation(&line, val);
}
self.line = Some(line);
}
}
}
impl ColumnCodec for LinearCodec {
type ColumnValues = LinearReader;
type Estimator = LinearCodecEstimator;
fn load(mut data: OwnedBytes) -> io::Result<Self::ColumnValues> {
let stats = ColumnStats::deserialize(&mut data)?;
let linear_params = LinearParams::deserialize(&mut data)?;
Ok(LinearReader {
stats,
linear_params,
data,
})
}
}
#[cfg(test)]
mod tests {
use rand::RngCore;
use super::*;
use crate::column_values::u64_based::tests::{create_and_validate, get_codec_test_datasets};
#[test]
fn test_compression_simple() {
let vals = (100u64..)
.take(super::LINE_ESTIMATION_BLOCK_LEN)
.collect::<Vec<_>>();
create_and_validate::<LinearCodec>(&vals, "simple monotonically large").unwrap();
}
#[test]
fn test_compression() {
let data = (10..=6_000_u64).collect::<Vec<_>>();
let (estimate, actual_compression) =
create_and_validate::<LinearCodec>(&data, "simple monotonically large").unwrap();
assert_le!(actual_compression, 0.001);
assert_le!(estimate, 0.02);
}
#[test]
fn test_with_codec_datasets() {
let data_sets = get_codec_test_datasets();
for (mut data, name) in data_sets {
create_and_validate::<LinearCodec>(&data, name);
data.reverse();
create_and_validate::<LinearCodec>(&data, name);
}
}
#[test]
fn linear_interpol_fast_field_test_large_amplitude() {
let data = vec![
i64::MAX as u64 / 2,
i64::MAX as u64 / 3,
i64::MAX as u64 / 2,
];
create_and_validate::<LinearCodec>(&data, "large amplitude");
}
#[test]
fn overflow_error_test() {
let data = vec![1572656989877777, 1170935903116329, 720575940379279, 0];
create_and_validate::<LinearCodec>(&data, "overflow test");
}
#[test]
fn linear_interpol_fast_concave_data() {
let data = vec![0, 1, 2, 5, 8, 10, 20, 50];
create_and_validate::<LinearCodec>(&data, "concave data");
}
#[test]
fn linear_interpol_fast_convex_data() {
let data = vec![0, 40, 60, 70, 75, 77];
create_and_validate::<LinearCodec>(&data, "convex data");
}
#[test]
fn linear_interpol_fast_field_test_simple() {
let data = (10..=20_u64).collect::<Vec<_>>();
create_and_validate::<LinearCodec>(&data, "simple monotonically");
}
#[test]
fn linear_interpol_fast_field_rand() {
let mut rng = rand::thread_rng();
for _ in 0..50 {
let mut data = (0..10_000).map(|_| rng.next_u64()).collect::<Vec<_>>();
create_and_validate::<LinearCodec>(&data, "random");
data.reverse();
create_and_validate::<LinearCodec>(&data, "random");
}
}
}

View File

@@ -0,0 +1,214 @@
mod bitpacked;
mod blockwise_linear;
mod line;
mod linear;
mod stats_collector;
use std::io;
use std::io::Write;
use std::sync::Arc;
use common::{BinarySerializable, OwnedBytes};
use crate::column_values::monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternal,
};
pub use crate::column_values::u64_based::bitpacked::BitpackedCodec;
pub use crate::column_values::u64_based::blockwise_linear::BlockwiseLinearCodec;
pub use crate::column_values::u64_based::linear::LinearCodec;
pub use crate::column_values::u64_based::stats_collector::StatsCollector;
use crate::column_values::{monotonic_map_column, ColumnStats};
use crate::iterable::Iterable;
use crate::{ColumnValues, MonotonicallyMappableToU64};
/// A `ColumnCodecEstimator` is in charge of gathering all
/// data required to serialize a column.
///
/// This happens during a first pass on data of the column elements.
/// During that pass, all column estimators receive a call to their
/// `.collect(el)`.
///
/// After this first pass, finalize is called.
/// `.estimate(..)` then should return an accurate estimation of the
/// size of the serialized column (were we to pick this codec.).
/// `.serialize(..)` then serializes the column using this codec.
pub trait ColumnCodecEstimator<T = u64>: 'static {
/// Records a new value for estimation.
/// This method will be called for each element of the column during
/// `estimation`.
fn collect(&mut self, value: u64);
/// Finalizes the first pass phase.
fn finalize(&mut self) {}
/// Returns an accurate estimation of the number of bytes that will
/// be used to represent this column.
fn estimate(&self, stats: &ColumnStats) -> Option<u64>;
/// Serializes the column using the given codec.
/// This constitutes a second pass over the columns values.
fn serialize(
&self,
stats: &ColumnStats,
vals: &mut dyn Iterator<Item = T>,
wrt: &mut dyn io::Write,
) -> io::Result<()>;
}
/// A column codec describes a colunm serialization format.
pub trait ColumnCodec<T: PartialOrd = u64> {
/// Specialized `ColumnValues` type.
type ColumnValues: ColumnValues<T> + 'static;
/// `Estimator` for the given codec.
type Estimator: ColumnCodecEstimator + Default;
/// Loads a column that has been serialized using this codec.
fn load(bytes: OwnedBytes) -> io::Result<Self::ColumnValues>;
/// Returns an estimator.
fn estimator() -> Self::Estimator {
Self::Estimator::default()
}
/// Returns a boxed estimator.
fn boxed_estimator() -> Box<dyn ColumnCodecEstimator> {
Box::new(Self::estimator())
}
}
/// Available codecs to use to encode the u64 (via [`MonotonicallyMappableToU64`]) converted data.
#[derive(PartialEq, Eq, PartialOrd, Ord, Debug, Clone, Copy)]
#[repr(u8)]
pub enum CodecType {
/// Bitpack all values in the value range. The number of bits is defined by the amplitude
/// `column.max_value() - column.min_value()`
Bitpacked = 0u8,
/// Linear interpolation puts a line between the first and last value and then bitpacks the
/// values by the offset from the line. The number of bits is defined by the max deviation from
/// the line.
Linear = 1u8,
/// Same as [`CodecType::Linear`], but encodes in blocks of 512 elements.
BlockwiseLinear = 2u8,
}
/// List of all available u64-base codecs.
pub const ALL_U64_CODEC_TYPES: [CodecType; 3] = [
CodecType::Bitpacked,
CodecType::Linear,
CodecType::BlockwiseLinear,
];
impl CodecType {
fn to_code(self) -> u8 {
self as u8
}
fn try_from_code(code: u8) -> Option<CodecType> {
match code {
0u8 => Some(CodecType::Bitpacked),
1u8 => Some(CodecType::Linear),
2u8 => Some(CodecType::BlockwiseLinear),
_ => None,
}
}
fn load<T: MonotonicallyMappableToU64>(
&self,
bytes: OwnedBytes,
) -> io::Result<Arc<dyn ColumnValues<T>>> {
match self {
CodecType::Bitpacked => load_specific_codec::<BitpackedCodec, T>(bytes),
CodecType::Linear => load_specific_codec::<LinearCodec, T>(bytes),
CodecType::BlockwiseLinear => load_specific_codec::<BlockwiseLinearCodec, T>(bytes),
}
}
}
fn load_specific_codec<C: ColumnCodec, T: MonotonicallyMappableToU64>(
bytes: OwnedBytes,
) -> io::Result<Arc<dyn ColumnValues<T>>> {
let reader = C::load(bytes)?;
let reader_typed = monotonic_map_column(
reader,
StrictlyMonotonicMappingInverter::from(StrictlyMonotonicMappingToInternal::<T>::new()),
);
Ok(Arc::new(reader_typed))
}
impl CodecType {
/// Returns a boxed codec estimator associated to a given `CodecType`.
pub fn estimator(&self) -> Box<dyn ColumnCodecEstimator> {
match self {
CodecType::Bitpacked => BitpackedCodec::boxed_estimator(),
CodecType::Linear => LinearCodec::boxed_estimator(),
CodecType::BlockwiseLinear => BlockwiseLinearCodec::boxed_estimator(),
}
}
}
/// Serializes a given column of u64-mapped values.
pub fn serialize_u64_based_column_values<T: MonotonicallyMappableToU64>(
vals: &dyn Iterable<T>,
codec_types: &[CodecType],
wrt: &mut dyn Write,
) -> io::Result<()> {
let mut stats_collector = StatsCollector::default();
let mut estimators: Vec<(CodecType, Box<dyn ColumnCodecEstimator>)> =
Vec::with_capacity(codec_types.len());
for &codec_type in codec_types {
estimators.push((codec_type, codec_type.estimator()));
}
for val in vals.boxed_iter() {
let val_u64 = val.to_u64();
stats_collector.collect(val_u64);
for (_, estimator) in &mut estimators {
estimator.collect(val_u64);
}
}
for (_, estimator) in &mut estimators {
estimator.finalize();
}
let stats = stats_collector.stats();
let (_, best_codec, best_codec_estimator) = estimators
.into_iter()
.flat_map(|(codec_type, estimator)| {
let num_bytes = estimator.estimate(&stats)?;
Some((num_bytes, codec_type, estimator))
})
.min_by_key(|(num_bytes, _, _)| *num_bytes)
.ok_or_else(|| {
io::Error::new(io::ErrorKind::InvalidData, "No available applicable codec.")
})?;
best_codec.to_code().serialize(wrt)?;
best_codec_estimator.serialize(
&stats,
&mut vals.boxed_iter().map(MonotonicallyMappableToU64::to_u64),
wrt,
)?;
Ok(())
}
/// Load u64-based column values.
///
/// This method first identifies the codec off the first byte.
pub fn load_u64_based_column_values<T: MonotonicallyMappableToU64>(
mut bytes: OwnedBytes,
) -> io::Result<Arc<dyn ColumnValues<T>>> {
let codec_type: CodecType = bytes
.first()
.copied()
.and_then(CodecType::try_from_code)
.ok_or_else(|| io::Error::new(io::ErrorKind::InvalidData, "Failed to read codec type"))?;
bytes.advance(1);
codec_type.load(bytes)
}
/// Helper function to serialize a column (autodetect from all codecs) and then open it
pub fn serialize_and_load_u64_based_column_values<T: MonotonicallyMappableToU64>(
vals: &dyn Iterable,
codec_types: &[CodecType],
) -> Arc<dyn ColumnValues<T>> {
let mut buffer = Vec::new();
serialize_u64_based_column_values(vals, codec_types, &mut buffer).unwrap();
load_u64_based_column_values::<T>(OwnedBytes::new(buffer)).unwrap()
}
#[cfg(test)]
mod tests;

View File

@@ -0,0 +1,200 @@
use std::num::NonZeroU64;
use fastdivide::DividerU64;
use crate::column_values::ColumnStats;
use crate::RowId;
/// Compute the gcd of two non null numbers.
///
/// It is recommended, but not required, to feed values such that `large >= small`.
fn compute_gcd(mut large: NonZeroU64, mut small: NonZeroU64) -> NonZeroU64 {
loop {
let rem: u64 = large.get() % small;
if let Some(new_small) = NonZeroU64::new(rem) {
(large, small) = (small, new_small);
} else {
return small;
}
}
}
#[derive(Default)]
pub struct StatsCollector {
min_max_opt: Option<(u64, u64)>,
num_rows: RowId,
// We measure the GCD of the difference between the values and the minimal value.
// This is the same as computing the difference between the values and the first value.
//
// This way, we can compress i64-converted-to-u64 (e.g. timestamp that were supplied in
// seconds, only to be converted in microseconds).
increment_gcd_opt: Option<(NonZeroU64, DividerU64)>,
first_value_opt: Option<u64>,
}
impl StatsCollector {
pub fn stats(&self) -> ColumnStats {
let (min_value, max_value) = self.min_max_opt.unwrap_or((0u64, 0u64));
let increment_gcd = if let Some((increment_gcd, _)) = self.increment_gcd_opt {
increment_gcd
} else {
NonZeroU64::new(1u64).unwrap()
};
ColumnStats {
min_value,
max_value,
num_rows: self.num_rows,
gcd: increment_gcd,
}
}
#[inline]
fn update_increment_gcd(&mut self, value: u64) {
let Some(first_value) = self.first_value_opt else {
// We set the first value and just quit.
self.first_value_opt = Some(value);
return;
};
let Some(non_zero_value) = NonZeroU64::new(value.abs_diff(first_value)) else {
// We can simply skip 0 values.
return;
};
let Some((gcd, gcd_divider)) = self.increment_gcd_opt else {
self.set_increment_gcd(non_zero_value);
return;
};
if gcd.get() == 1 {
// It won't see any update now.
return;
}
let remainder =
non_zero_value.get() - (gcd_divider.divide(non_zero_value.get())) * gcd.get();
if remainder == 0 {
return;
}
let new_gcd = compute_gcd(non_zero_value, gcd);
self.set_increment_gcd(new_gcd);
}
fn set_increment_gcd(&mut self, gcd: NonZeroU64) {
let new_divider = DividerU64::divide_by(gcd.get());
self.increment_gcd_opt = Some((gcd, new_divider));
}
pub fn collect(&mut self, value: u64) {
self.min_max_opt = Some(if let Some((min, max)) = self.min_max_opt {
(min.min(value), max.max(value))
} else {
(value, value)
});
self.num_rows += 1;
self.update_increment_gcd(value);
}
}
#[cfg(test)]
mod tests {
use std::num::NonZeroU64;
use crate::column_values::u64_based::stats_collector::{compute_gcd, StatsCollector};
use crate::column_values::u64_based::ColumnStats;
fn compute_stats(vals: impl Iterator<Item = u64>) -> ColumnStats {
let mut stats_collector = StatsCollector::default();
for val in vals {
stats_collector.collect(val);
}
stats_collector.stats()
}
fn find_gcd(vals: impl Iterator<Item = u64>) -> u64 {
compute_stats(vals).gcd.get()
}
#[test]
fn test_compute_gcd() {
let test_compute_gcd_aux = |large, small, expected| {
let large = NonZeroU64::new(large).unwrap();
let small = NonZeroU64::new(small).unwrap();
let expected = NonZeroU64::new(expected).unwrap();
assert_eq!(compute_gcd(small, large), expected);
assert_eq!(compute_gcd(large, small), expected);
};
test_compute_gcd_aux(1, 4, 1);
test_compute_gcd_aux(2, 4, 2);
test_compute_gcd_aux(10, 25, 5);
test_compute_gcd_aux(25, 25, 25);
}
#[test]
fn test_gcd() {
assert_eq!(find_gcd([0].into_iter()), 1);
assert_eq!(find_gcd([0, 10].into_iter()), 10);
assert_eq!(find_gcd([10, 0].into_iter()), 10);
assert_eq!(find_gcd([].into_iter()), 1);
assert_eq!(find_gcd([15, 30, 5, 10].into_iter()), 5);
assert_eq!(find_gcd([15, 16, 10].into_iter()), 1);
assert_eq!(find_gcd([0, 5, 5, 5].into_iter()), 5);
assert_eq!(find_gcd([0, 0].into_iter()), 1);
assert_eq!(find_gcd([1, 10, 4, 1, 7, 10].into_iter()), 3);
assert_eq!(find_gcd([1, 10, 0, 4, 1, 7, 10].into_iter()), 1);
}
#[test]
fn test_stats() {
assert_eq!(
compute_stats([].into_iter()),
ColumnStats {
gcd: NonZeroU64::new(1).unwrap(),
min_value: 0,
max_value: 0,
num_rows: 0
}
);
assert_eq!(
compute_stats([0, 1].into_iter()),
ColumnStats {
gcd: NonZeroU64::new(1).unwrap(),
min_value: 0,
max_value: 1,
num_rows: 2
}
);
assert_eq!(
compute_stats([0, 1].into_iter()),
ColumnStats {
gcd: NonZeroU64::new(1).unwrap(),
min_value: 0,
max_value: 1,
num_rows: 2
}
);
assert_eq!(
compute_stats([10, 20, 30].into_iter()),
ColumnStats {
gcd: NonZeroU64::new(10).unwrap(),
min_value: 10,
max_value: 30,
num_rows: 3
}
);
assert_eq!(
compute_stats([10, 50, 10, 30].into_iter()),
ColumnStats {
gcd: NonZeroU64::new(20).unwrap(),
min_value: 10,
max_value: 50,
num_rows: 4
}
);
assert_eq!(
compute_stats([10, 0, 30].into_iter()),
ColumnStats {
gcd: NonZeroU64::new(10).unwrap(),
min_value: 0,
max_value: 30,
num_rows: 3
}
);
}
}

View File

@@ -0,0 +1,401 @@
use proptest::prelude::*;
use proptest::strategy::Strategy;
use proptest::{num, prop_oneof, proptest};
#[test]
fn test_serialize_and_load_simple() {
let mut buffer = Vec::new();
let vals = &[1u64, 2u64, 5u64];
serialize_u64_based_column_values(
&&vals[..],
&[CodecType::Bitpacked, CodecType::BlockwiseLinear],
&mut buffer,
)
.unwrap();
assert_eq!(buffer.len(), 7);
let col = load_u64_based_column_values::<u64>(OwnedBytes::new(buffer)).unwrap();
assert_eq!(col.num_vals(), 3);
assert_eq!(col.get_val(0), 1);
assert_eq!(col.get_val(1), 2);
assert_eq!(col.get_val(2), 5);
}
#[test]
fn test_empty_column_i64() {
let vals: [i64; 0] = [];
let mut num_acceptable_codecs = 0;
for codec in ALL_U64_CODEC_TYPES {
let mut buffer = Vec::new();
if serialize_u64_based_column_values(&&vals[..], &[codec], &mut buffer).is_err() {
continue;
}
num_acceptable_codecs += 1;
let col = load_u64_based_column_values::<i64>(OwnedBytes::new(buffer)).unwrap();
assert_eq!(col.num_vals(), 0);
assert_eq!(col.min_value(), i64::MIN);
assert_eq!(col.max_value(), i64::MIN);
}
assert!(num_acceptable_codecs > 0);
}
#[test]
fn test_empty_column_u64() {
let vals: [u64; 0] = [];
let mut num_acceptable_codecs = 0;
for codec in ALL_U64_CODEC_TYPES {
let mut buffer = Vec::new();
if serialize_u64_based_column_values(&&vals[..], &[codec], &mut buffer).is_err() {
continue;
}
num_acceptable_codecs += 1;
let col = load_u64_based_column_values::<u64>(OwnedBytes::new(buffer)).unwrap();
assert_eq!(col.num_vals(), 0);
assert_eq!(col.min_value(), u64::MIN);
assert_eq!(col.max_value(), u64::MIN);
}
assert!(num_acceptable_codecs > 0);
}
#[test]
fn test_empty_column_f64() {
let vals: [f64; 0] = [];
let mut num_acceptable_codecs = 0;
for codec in ALL_U64_CODEC_TYPES {
let mut buffer = Vec::new();
if serialize_u64_based_column_values(&&vals[..], &[codec], &mut buffer).is_err() {
continue;
}
num_acceptable_codecs += 1;
let col = load_u64_based_column_values::<f64>(OwnedBytes::new(buffer)).unwrap();
assert_eq!(col.num_vals(), 0);
// FIXME. f64::MIN would be better!
assert!(col.min_value().is_nan());
assert!(col.max_value().is_nan());
}
assert!(num_acceptable_codecs > 0);
}
pub(crate) fn create_and_validate<TColumnCodec: ColumnCodec>(
vals: &[u64],
name: &str,
) -> Option<(f32, f32)> {
let mut stats_collector = StatsCollector::default();
let mut codec_estimator: TColumnCodec::Estimator = Default::default();
for val in vals.boxed_iter() {
stats_collector.collect(val);
codec_estimator.collect(val);
}
codec_estimator.finalize();
let stats = stats_collector.stats();
let estimation = codec_estimator.estimate(&stats)?;
let mut buffer = Vec::new();
codec_estimator
.serialize(&stats, vals.boxed_iter().as_mut(), &mut buffer)
.unwrap();
let actual_compression = buffer.len() as u64;
let reader = TColumnCodec::load(OwnedBytes::new(buffer)).unwrap();
assert_eq!(reader.num_vals(), vals.len() as u32);
for (doc, orig_val) in vals.iter().copied().enumerate() {
let val = reader.get_val(doc as u32);
assert_eq!(
val, orig_val,
"val `{val}` does not match orig_val {orig_val:?}, in data set {name}, data `{vals:?}`",
);
}
if !vals.is_empty() {
let test_rand_idx = rand::thread_rng().gen_range(0..=vals.len() - 1);
let expected_positions: Vec<u32> = vals
.iter()
.enumerate()
.filter(|(_, el)| **el == vals[test_rand_idx])
.map(|(pos, _)| pos as u32)
.collect();
let mut positions = Vec::new();
reader.get_row_ids_for_value_range(
vals[test_rand_idx]..=vals[test_rand_idx],
0..vals.len() as u32,
&mut positions,
);
assert_eq!(expected_positions, positions);
}
if actual_compression > 1000 {
assert!(relative_difference(estimation, actual_compression) < 0.10f32);
}
Some((
compression_rate(estimation, stats.num_rows),
compression_rate(actual_compression, stats.num_rows),
))
}
fn compression_rate(num_bytes: u64, num_values: u32) -> f32 {
num_bytes as f32 / (num_values as f32 * 8.0)
}
fn relative_difference(left: u64, right: u64) -> f32 {
let left = left as f32;
let right = right as f32;
2.0f32 * (left - right).abs() / (left + right)
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(100))]
#[test]
fn test_proptest_small_bitpacked(data in proptest::collection::vec(num_strategy(), 1..10)) {
create_and_validate::<BitpackedCodec>(&data, "proptest bitpacked");
}
#[test]
fn test_proptest_small_linear(data in proptest::collection::vec(num_strategy(), 1..10)) {
create_and_validate::<LinearCodec>(&data, "proptest linearinterpol");
}
#[test]
fn test_proptest_small_blockwise_linear(data in proptest::collection::vec(num_strategy(), 1..10)) {
create_and_validate::<BlockwiseLinearCodec>(&data, "proptest multilinearinterpol");
}
}
#[test]
fn test_small_blockwise_linear_example() {
create_and_validate::<BlockwiseLinearCodec>(
&[9223372036854775808, 9223370937344622593],
"proptest multilinearinterpol",
);
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(10))]
#[test]
fn test_proptest_large_bitpacked(data in proptest::collection::vec(num_strategy(), 1..6000)) {
create_and_validate::<BitpackedCodec>(&data, "proptest bitpacked");
}
#[test]
fn test_proptest_large_linear(data in proptest::collection::vec(num_strategy(), 1..6000)) {
create_and_validate::<LinearCodec>(&data, "proptest linearinterpol");
}
#[test]
fn test_proptest_large_blockwise_linear(data in proptest::collection::vec(num_strategy(), 1..6000)) {
create_and_validate::<BlockwiseLinearCodec>(&data, "proptest multilinearinterpol");
}
}
fn num_strategy() -> impl Strategy<Value = u64> {
prop_oneof![
1 => prop::num::u64::ANY.prop_map(|num| u64::MAX - (num % 10) ),
1 => prop::num::u64::ANY.prop_map(|num| num % 10 ),
20 => prop::num::u64::ANY,
]
}
pub fn get_codec_test_datasets() -> Vec<(Vec<u64>, &'static str)> {
let mut data_and_names = vec![];
let data = (10..=10_000_u64).collect::<Vec<_>>();
data_and_names.push((data, "simple monotonically increasing"));
data_and_names.push((
vec![5, 6, 7, 8, 9, 10, 99, 100],
"offset in linear interpol",
));
data_and_names.push((vec![5, 50, 3, 13, 1, 1000, 35], "rand small"));
data_and_names.push((vec![10], "single value"));
data_and_names.push((
vec![1572656989877777, 1170935903116329, 720575940379279, 0],
"overflow error",
));
data_and_names
}
fn test_codec<C: ColumnCodec>() {
let codec_name = std::any::type_name::<C>();
for (data, dataset_name) in get_codec_test_datasets() {
let estimate_actual_opt: Option<(f32, f32)> =
tests::create_and_validate::<C>(&data, dataset_name);
let result = if let Some((estimate, actual)) = estimate_actual_opt {
format!("Estimate `{estimate}` Actual `{actual}`")
} else {
"Disabled".to_string()
};
println!("Codec {codec_name}, DataSet {dataset_name}, {result}");
}
}
#[test]
fn test_codec_bitpacking() {
test_codec::<BitpackedCodec>();
}
#[test]
fn test_codec_interpolation() {
test_codec::<LinearCodec>();
}
#[test]
fn test_codec_multi_interpolation() {
test_codec::<BlockwiseLinearCodec>();
}
use super::*;
fn estimate<C: ColumnCodec>(vals: &[u64]) -> Option<f32> {
let mut stats_collector = StatsCollector::default();
let mut estimator = C::Estimator::default();
for &val in vals {
stats_collector.collect(val);
estimator.collect(val);
}
estimator.finalize();
let stats = stats_collector.stats();
let num_bytes = estimator.estimate(&stats)?;
if stats.num_rows == 0 {
return None;
}
Some(num_bytes as f32 / (8.0 * stats.num_rows as f32))
}
#[test]
fn estimation_good_interpolation_case() {
let data = (10..=20000_u64).collect::<Vec<_>>();
let linear_interpol_estimation = estimate::<LinearCodec>(&data).unwrap();
assert_le!(linear_interpol_estimation, 0.01);
let multi_linear_interpol_estimation = estimate::<BlockwiseLinearCodec>(&data).unwrap();
assert_le!(multi_linear_interpol_estimation, 0.2);
assert_lt!(linear_interpol_estimation, multi_linear_interpol_estimation);
let bitpacked_estimation = estimate::<BitpackedCodec>(&data).unwrap();
assert_lt!(linear_interpol_estimation, bitpacked_estimation);
}
#[test]
fn estimation_test_bad_interpolation_case_monotonically_increasing() {
let mut data: Vec<u64> = (201..=20000_u64).collect();
data.push(1_000_000);
// in this case the linear interpolation can't in fact not be worse than bitpacking,
// but the estimator adds some threshold, which leads to estimated worse behavior
let linear_interpol_estimation = estimate::<LinearCodec>(&data[..]).unwrap();
assert_le!(linear_interpol_estimation, 0.35);
let bitpacked_estimation = estimate::<BitpackedCodec>(&data).unwrap();
assert_le!(bitpacked_estimation, 0.32);
assert_le!(bitpacked_estimation, linear_interpol_estimation);
}
#[test]
fn test_fast_field_codec_type_to_code() {
let mut count_codec = 0;
for code in 0..=255 {
if let Some(codec_type) = CodecType::try_from_code(code) {
assert_eq!(codec_type.to_code(), code);
count_codec += 1;
}
}
assert_eq!(count_codec, 3);
}
fn test_fastfield_gcd_i64_with_codec(codec_type: CodecType, num_vals: usize) -> io::Result<()> {
let mut vals: Vec<i64> = (-4..=(num_vals as i64) - 5).map(|val| val * 1000).collect();
let mut buffer: Vec<u8> = Vec::new();
crate::column_values::serialize_u64_based_column_values(
&&vals[..],
&[codec_type],
&mut buffer,
)?;
let buffer = OwnedBytes::new(buffer);
let column = crate::column_values::load_u64_based_column_values::<i64>(buffer.clone())?;
assert_eq!(column.get_val(0), -4000i64);
assert_eq!(column.get_val(1), -3000i64);
assert_eq!(column.get_val(2), -2000i64);
assert_eq!(column.max_value(), (num_vals as i64 - 5) * 1000);
assert_eq!(column.min_value(), -4000i64);
// Can't apply gcd
let mut buffer_without_gcd = Vec::new();
vals.pop();
vals.push(1001i64);
crate::column_values::serialize_u64_based_column_values(
&&vals[..],
&[codec_type],
&mut buffer_without_gcd,
)?;
let buffer_without_gcd = OwnedBytes::new(buffer_without_gcd);
assert!(buffer_without_gcd.len() > buffer.len());
Ok(())
}
#[test]
fn test_fastfield_gcd_i64() -> io::Result<()> {
for &codec_type in &[
CodecType::Bitpacked,
CodecType::BlockwiseLinear,
CodecType::Linear,
] {
test_fastfield_gcd_i64_with_codec(codec_type, 5500)?;
}
Ok(())
}
fn test_fastfield_gcd_u64_with_codec(codec_type: CodecType, num_vals: usize) -> io::Result<()> {
let mut vals: Vec<u64> = (1..=num_vals).map(|i| i as u64 * 1000u64).collect();
let mut buffer: Vec<u8> = Vec::new();
crate::column_values::serialize_u64_based_column_values(
&&vals[..],
&[codec_type],
&mut buffer,
)?;
let buffer = OwnedBytes::new(buffer);
let column = crate::column_values::load_u64_based_column_values::<u64>(buffer.clone())?;
assert_eq!(column.get_val(0), 1000u64);
assert_eq!(column.get_val(1), 2000u64);
assert_eq!(column.get_val(2), 3000u64);
assert_eq!(column.max_value(), num_vals as u64 * 1000);
assert_eq!(column.min_value(), 1000u64);
// Can't apply gcd
let mut buffer_without_gcd = Vec::new();
vals.pop();
vals.push(1001u64);
crate::column_values::serialize_u64_based_column_values(
&&vals[..],
&[codec_type],
&mut buffer_without_gcd,
)?;
let buffer_without_gcd = OwnedBytes::new(buffer_without_gcd);
assert!(buffer_without_gcd.len() > buffer.len());
Ok(())
}
#[test]
fn test_fastfield_gcd_u64() -> io::Result<()> {
for &codec_type in &[
CodecType::Bitpacked,
CodecType::BlockwiseLinear,
CodecType::Linear,
] {
test_fastfield_gcd_u64_with_codec(codec_type, 5500)?;
}
Ok(())
}
#[test]
pub fn test_fastfield2() {
let test_fastfield = crate::column_values::serialize_and_load_u64_based_column_values::<u64>(
&&[100u64, 200u64, 300u64][..],
&ALL_U64_CODEC_TYPES,
);
assert_eq!(test_fastfield.get_val(0), 100);
assert_eq!(test_fastfield.get_val(1), 200);
assert_eq!(test_fastfield.get_val(2), 300);
}

View File

@@ -0,0 +1,52 @@
use std::fmt::Debug;
use tantivy_bitpacker::minmax;
use crate::ColumnValues;
/// VecColumn provides `Column` over a slice.
pub struct VecColumn<'a, T = u64> {
pub(crate) values: &'a [T],
pub(crate) min_value: T,
pub(crate) max_value: T,
}
impl<'a, T: Copy + PartialOrd + Send + Sync + Debug> ColumnValues<T> for VecColumn<'a, T> {
fn get_val(&self, position: u32) -> T {
self.values[position as usize]
}
fn iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
Box::new(self.values.iter().copied())
}
fn min_value(&self) -> T {
self.min_value
}
fn max_value(&self) -> T {
self.max_value
}
fn num_vals(&self) -> u32 {
self.values.len() as u32
}
fn get_range(&self, start: u64, output: &mut [T]) {
output.copy_from_slice(&self.values[start as usize..][..output.len()])
}
}
impl<'a, T: Copy + PartialOrd + Default, V> From<&'a V> for VecColumn<'a, T>
where V: AsRef<[T]> + ?Sized
{
fn from(values: &'a V) -> Self {
let values = values.as_ref();
let (min_value, max_value) = minmax(values.iter().copied()).unwrap_or_default();
Self {
values,
min_value,
max_value,
}
}
}

View File

@@ -0,0 +1,163 @@
use std::fmt::Debug;
use std::net::Ipv6Addr;
use serde::{Deserialize, Serialize};
use crate::value::NumericalType;
use crate::InvalidData;
/// The column type represents the column type.
/// Any changes need to be propagated to `COLUMN_TYPES`.
#[derive(Hash, Eq, PartialEq, Debug, Clone, Copy, Ord, PartialOrd, Serialize, Deserialize)]
#[repr(u8)]
pub enum ColumnType {
I64 = 0u8,
U64 = 1u8,
F64 = 2u8,
Bytes = 3u8,
Str = 4u8,
Bool = 5u8,
IpAddr = 6u8,
DateTime = 7u8,
}
// The order needs to match _exactly_ the order in the enum
const COLUMN_TYPES: [ColumnType; 8] = [
ColumnType::I64,
ColumnType::U64,
ColumnType::F64,
ColumnType::Bytes,
ColumnType::Str,
ColumnType::Bool,
ColumnType::IpAddr,
ColumnType::DateTime,
];
impl ColumnType {
pub fn to_code(self) -> u8 {
self as u8
}
pub(crate) fn try_from_code(code: u8) -> Result<ColumnType, InvalidData> {
COLUMN_TYPES.get(code as usize).copied().ok_or(InvalidData)
}
}
impl From<NumericalType> for ColumnType {
fn from(numerical_type: NumericalType) -> Self {
match numerical_type {
NumericalType::I64 => ColumnType::I64,
NumericalType::U64 => ColumnType::U64,
NumericalType::F64 => ColumnType::F64,
}
}
}
impl ColumnType {
pub fn numerical_type(&self) -> Option<NumericalType> {
match self {
ColumnType::I64 => Some(NumericalType::I64),
ColumnType::U64 => Some(NumericalType::U64),
ColumnType::F64 => Some(NumericalType::F64),
ColumnType::Bytes
| ColumnType::Str
| ColumnType::Bool
| ColumnType::IpAddr
| ColumnType::DateTime => None,
}
}
}
// TODO remove if possible
pub trait HasAssociatedColumnType: 'static + Debug + Send + Sync + Copy + PartialOrd {
fn column_type() -> ColumnType;
fn default_value() -> Self;
}
impl HasAssociatedColumnType for u64 {
fn column_type() -> ColumnType {
ColumnType::U64
}
fn default_value() -> Self {
0u64
}
}
impl HasAssociatedColumnType for i64 {
fn column_type() -> ColumnType {
ColumnType::I64
}
fn default_value() -> Self {
0i64
}
}
impl HasAssociatedColumnType for f64 {
fn column_type() -> ColumnType {
ColumnType::F64
}
fn default_value() -> Self {
Default::default()
}
}
impl HasAssociatedColumnType for bool {
fn column_type() -> ColumnType {
ColumnType::Bool
}
fn default_value() -> Self {
Default::default()
}
}
impl HasAssociatedColumnType for common::DateTime {
fn column_type() -> ColumnType {
ColumnType::DateTime
}
fn default_value() -> Self {
Default::default()
}
}
impl HasAssociatedColumnType for Ipv6Addr {
fn column_type() -> ColumnType {
ColumnType::IpAddr
}
fn default_value() -> Self {
Ipv6Addr::from([0u8; 16])
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::Cardinality;
#[test]
fn test_column_type_to_code() {
for (code, expected_column_type) in super::COLUMN_TYPES.iter().copied().enumerate() {
if let Ok(column_type) = ColumnType::try_from_code(code as u8) {
assert_eq!(column_type, expected_column_type);
}
}
for code in COLUMN_TYPES.len() as u8..=u8::MAX {
assert!(ColumnType::try_from_code(code).is_err());
}
}
#[test]
fn test_cardinality_to_code() {
let mut num_cardinality = 0;
for code in u8::MIN..=u8::MAX {
if let Ok(cardinality) = Cardinality::try_from_code(code) {
assert_eq!(cardinality.to_code(), code);
num_cardinality += 1;
}
}
assert_eq!(num_cardinality, 3);
}
}

View File

@@ -0,0 +1,73 @@
use crate::InvalidData;
pub const VERSION_FOOTER_NUM_BYTES: usize = MAGIC_BYTES.len() + std::mem::size_of::<u32>();
/// We end the file by these 4 bytes just to somewhat identify that
/// this is indeed a columnar file.
const MAGIC_BYTES: [u8; 4] = [2, 113, 119, 66];
pub fn footer() -> [u8; VERSION_FOOTER_NUM_BYTES] {
let mut footer_bytes = [0u8; VERSION_FOOTER_NUM_BYTES];
footer_bytes[0..4].copy_from_slice(&Version::V1.to_bytes());
footer_bytes[4..8].copy_from_slice(&MAGIC_BYTES[..]);
footer_bytes
}
pub fn parse_footer(footer_bytes: [u8; VERSION_FOOTER_NUM_BYTES]) -> Result<Version, InvalidData> {
if footer_bytes[4..8] != MAGIC_BYTES {
return Err(InvalidData);
}
Version::try_from_bytes(footer_bytes[0..4].try_into().unwrap())
}
#[derive(Debug, Copy, Clone, Eq, PartialEq)]
#[repr(u32)]
pub enum Version {
V1 = 1u32,
}
impl Version {
fn to_bytes(self) -> [u8; 4] {
(self as u32).to_le_bytes()
}
fn try_from_bytes(bytes: [u8; 4]) -> Result<Version, InvalidData> {
let code = u32::from_le_bytes(bytes);
match code {
1u32 => Ok(Version::V1),
_ => Err(InvalidData),
}
}
}
#[cfg(test)]
mod tests {
use std::collections::HashSet;
use super::*;
#[test]
fn test_footer_dserialization() {
let parsed_version: Version = parse_footer(footer()).unwrap();
assert_eq!(Version::V1, parsed_version);
}
#[test]
fn test_version_serialization() {
let version_to_tests: Vec<u32> = [0, 1 << 8, 1 << 16, 1 << 24]
.iter()
.copied()
.flat_map(|offset| (0..255).map(move |el| el + offset))
.collect();
let mut valid_versions: HashSet<u32> = HashSet::default();
for &i in &version_to_tests {
let version_res = Version::try_from_bytes(i.to_le_bytes());
if let Ok(version) = version_res {
assert_eq!(version, Version::V1);
assert_eq!(version.to_bytes(), i.to_le_bytes());
valid_versions.insert(i);
}
}
assert_eq!(valid_versions.len(), 1);
}
}

View File

@@ -0,0 +1,204 @@
use std::io::{self, Write};
use common::{BitSet, CountingWriter, ReadOnlyBitSet};
use sstable::{SSTable, TermOrdinal};
use super::term_merger::TermMerger;
use crate::column::serialize_column_mappable_to_u64;
use crate::column_index::SerializableColumnIndex;
use crate::iterable::Iterable;
use crate::{BytesColumn, MergeRowOrder, ShuffleMergeOrder};
// Serialize [Dictionary, Column, dictionary num bytes U32::LE]
// Column: [Column Index, Column Values, column index num bytes U32::LE]
pub fn merge_bytes_or_str_column(
column_index: SerializableColumnIndex<'_>,
bytes_columns: &[Option<BytesColumn>],
merge_row_order: &MergeRowOrder,
output: &mut impl Write,
) -> io::Result<()> {
// Serialize dict and generate mapping for values
let mut output = CountingWriter::wrap(output);
// TODO !!! Remove useless terms.
let term_ord_mapping = serialize_merged_dict(bytes_columns, merge_row_order, &mut output)?;
let dictionary_num_bytes: u32 = output.written_bytes() as u32;
let output = output.finish();
let remapped_term_ordinals_values = RemappedTermOrdinalsValues {
bytes_columns,
term_ord_mapping: &term_ord_mapping,
merge_row_order,
};
serialize_column_mappable_to_u64(column_index, &remapped_term_ordinals_values, output)?;
output.write_all(&dictionary_num_bytes.to_le_bytes())?;
Ok(())
}
struct RemappedTermOrdinalsValues<'a> {
bytes_columns: &'a [Option<BytesColumn>],
term_ord_mapping: &'a TermOrdinalMapping,
merge_row_order: &'a MergeRowOrder,
}
impl<'a> Iterable for RemappedTermOrdinalsValues<'a> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
match self.merge_row_order {
MergeRowOrder::Stack(_) => self.boxed_iter_stacked(),
MergeRowOrder::Shuffled(shuffle_merge_order) => {
self.boxed_iter_shuffled(shuffle_merge_order)
}
}
}
}
impl<'a> RemappedTermOrdinalsValues<'a> {
fn boxed_iter_stacked(&self) -> Box<dyn Iterator<Item = u64> + '_> {
let iter = self
.bytes_columns
.iter()
.enumerate()
.flat_map(|(segment_ord, byte_column)| {
let segment_ord = self.term_ord_mapping.get_segment(segment_ord as u32);
byte_column.iter().flat_map(move |bytes_column| {
bytes_column
.ords()
.values
.iter()
.map(move |term_ord| segment_ord[term_ord as usize])
})
});
// TODO see if we can better decompose the mapping / and the stacking
Box::new(iter)
}
fn boxed_iter_shuffled<'b>(
&'b self,
shuffle_merge_order: &'b ShuffleMergeOrder,
) -> Box<dyn Iterator<Item = u64> + 'b> {
Box::new(
shuffle_merge_order
.iter_new_to_old_row_addrs()
.flat_map(move |old_addr| {
let segment_ord = self.term_ord_mapping.get_segment(old_addr.segment_ord);
self.bytes_columns[old_addr.segment_ord as usize]
.as_ref()
.into_iter()
.flat_map(move |bytes_column| {
bytes_column
.term_ords(old_addr.row_id)
.map(|old_term_ord: u64| segment_ord[old_term_ord as usize])
})
}),
)
}
}
fn compute_term_bitset(column: &BytesColumn, row_bitset: &ReadOnlyBitSet) -> BitSet {
let num_terms = column.dictionary().num_terms();
let mut term_bitset = BitSet::with_max_value(num_terms as u32);
for row_id in row_bitset.iter() {
for term_ord in column.term_ord_column.values_for_doc(row_id) {
term_bitset.insert(term_ord as u32);
}
}
term_bitset
}
fn is_term_present(bitsets: &[Option<BitSet>], term_merger: &TermMerger) -> bool {
for (segment_ord, from_term_ord) in term_merger.matching_segments() {
if let Some(bitset) = bitsets[segment_ord].as_ref() {
if bitset.contains(from_term_ord as u32) {
return true;
}
} else {
return true;
}
}
false
}
fn serialize_merged_dict(
bytes_columns: &[Option<BytesColumn>],
merge_row_order: &MergeRowOrder,
output: &mut impl Write,
) -> io::Result<TermOrdinalMapping> {
let mut term_ord_mapping = TermOrdinalMapping::default();
let mut field_term_streams = Vec::new();
for column in bytes_columns.iter().flatten() {
term_ord_mapping.add_segment(column.dictionary.num_terms());
let terms = column.dictionary.stream()?;
field_term_streams.push(terms);
}
let mut merged_terms = TermMerger::new(field_term_streams);
let mut sstable_builder = sstable::VoidSSTable::writer(output);
// TODO support complex `merge_row_order`.
match merge_row_order {
MergeRowOrder::Stack(_) => {
let mut current_term_ord = 0;
while merged_terms.advance() {
let term_bytes: &[u8] = merged_terms.key();
sstable_builder.insert(term_bytes, &())?;
for (segment_ord, from_term_ord) in merged_terms.matching_segments() {
term_ord_mapping.register_from_to(segment_ord, from_term_ord, current_term_ord);
}
current_term_ord += 1;
}
sstable_builder.finish()?;
}
MergeRowOrder::Shuffled(shuffle_merge_order) => {
assert_eq!(shuffle_merge_order.alive_bitsets.len(), bytes_columns.len());
let mut term_bitsets: Vec<Option<BitSet>> = Vec::with_capacity(bytes_columns.len());
for (alive_bitset_opt, bytes_column_opt) in shuffle_merge_order
.alive_bitsets
.iter()
.zip(bytes_columns.iter())
{
match (alive_bitset_opt, bytes_column_opt) {
(Some(alive_bitset), Some(bytes_column)) => {
let term_bitset = compute_term_bitset(bytes_column, alive_bitset);
term_bitsets.push(Some(term_bitset));
}
_ => {
term_bitsets.push(None);
}
}
}
let mut current_term_ord = 0;
while merged_terms.advance() {
let term_bytes: &[u8] = merged_terms.key();
if !is_term_present(&term_bitsets[..], &merged_terms) {
continue;
}
sstable_builder.insert(term_bytes, &())?;
for (segment_ord, from_term_ord) in merged_terms.matching_segments() {
term_ord_mapping.register_from_to(segment_ord, from_term_ord, current_term_ord);
}
current_term_ord += 1;
}
sstable_builder.finish()?;
}
}
Ok(term_ord_mapping)
}
#[derive(Default, Debug)]
struct TermOrdinalMapping {
per_segment_new_term_ordinals: Vec<Vec<TermOrdinal>>,
}
impl TermOrdinalMapping {
fn add_segment(&mut self, max_term_ord: usize) {
self.per_segment_new_term_ordinals
.push(vec![TermOrdinal::default(); max_term_ord]);
}
fn register_from_to(&mut self, segment_ord: usize, from_ord: TermOrdinal, to_ord: TermOrdinal) {
self.per_segment_new_term_ordinals[segment_ord][from_ord as usize] = to_ord;
}
fn get_segment(&self, segment_ord: u32) -> &[TermOrdinal] {
&(self.per_segment_new_term_ordinals[segment_ord as usize])[..]
}
}

View File

@@ -0,0 +1,118 @@
use std::ops::Range;
use common::{BitSet, OwnedBytes, ReadOnlyBitSet};
use crate::{ColumnarReader, RowAddr, RowId};
pub struct StackMergeOrder {
// This does not start at 0. The first row is the number of
// rows in the first columnar.
cumulated_row_ids: Vec<RowId>,
}
impl StackMergeOrder {
pub fn stack(columnars: &[&ColumnarReader]) -> StackMergeOrder {
let mut cumulated_row_ids: Vec<RowId> = Vec::with_capacity(columnars.len());
let mut cumulated_row_id = 0;
for columnar in columnars {
cumulated_row_id += columnar.num_rows();
cumulated_row_ids.push(cumulated_row_id);
}
StackMergeOrder { cumulated_row_ids }
}
pub fn num_rows(&self) -> RowId {
self.cumulated_row_ids.last().copied().unwrap_or(0)
}
pub fn offset(&self, columnar_id: usize) -> RowId {
if columnar_id == 0 {
return 0;
}
self.cumulated_row_ids[columnar_id - 1]
}
pub fn columnar_range(&self, columnar_id: usize) -> Range<RowId> {
self.offset(columnar_id)..self.offset(columnar_id + 1)
}
}
pub enum MergeRowOrder {
/// Columnar tables are simply stacked one above the other.
/// If the i-th columnar_readers has n_rows_i rows, then
/// in the resulting columnar,
/// rows [r0..n_row_0) contains the row of columnar_readers[0], in ordder
/// rows [n_row_0..n_row_0 + n_row_1 contains the row of columnar_readers[1], in order.
/// ..
/// No documents is deleted.
Stack(StackMergeOrder),
/// Some more complex mapping, that may interleaves rows from the different readers and
/// drop rows, or do both.
Shuffled(ShuffleMergeOrder),
}
impl From<StackMergeOrder> for MergeRowOrder {
fn from(stack_merge_order: StackMergeOrder) -> MergeRowOrder {
MergeRowOrder::Stack(stack_merge_order)
}
}
impl From<ShuffleMergeOrder> for MergeRowOrder {
fn from(shuffle_merge_order: ShuffleMergeOrder) -> MergeRowOrder {
MergeRowOrder::Shuffled(shuffle_merge_order)
}
}
impl MergeRowOrder {
pub fn num_rows(&self) -> RowId {
match self {
MergeRowOrder::Stack(stack_row_order) => stack_row_order.num_rows(),
MergeRowOrder::Shuffled(complex_mapping) => complex_mapping.num_rows(),
}
}
}
pub struct ShuffleMergeOrder {
pub new_row_id_to_old_row_id: Vec<RowAddr>,
pub alive_bitsets: Vec<Option<ReadOnlyBitSet>>,
}
impl ShuffleMergeOrder {
pub fn for_test(
segment_num_rows: &[RowId],
new_row_id_to_old_row_id: Vec<RowAddr>,
) -> ShuffleMergeOrder {
let mut alive_bitsets: Vec<BitSet> = segment_num_rows
.iter()
.map(|&num_rows| BitSet::with_max_value(num_rows))
.collect();
for &RowAddr {
segment_ord,
row_id,
} in &new_row_id_to_old_row_id
{
alive_bitsets[segment_ord as usize].insert(row_id);
}
let alive_bitsets: Vec<Option<ReadOnlyBitSet>> = alive_bitsets
.into_iter()
.map(|alive_bitset| {
let mut buffer = Vec::new();
alive_bitset.serialize(&mut buffer).unwrap();
let data = OwnedBytes::new(buffer);
Some(ReadOnlyBitSet::open(data))
})
.collect();
ShuffleMergeOrder {
new_row_id_to_old_row_id,
alive_bitsets,
}
}
pub fn num_rows(&self) -> RowId {
self.new_row_id_to_old_row_id.len() as RowId
}
pub fn iter_new_to_old_row_addrs(&self) -> impl Iterator<Item = RowAddr> + '_ {
self.new_row_id_to_old_row_id.iter().copied()
}
}

View File

@@ -0,0 +1,375 @@
mod merge_dict_column;
mod merge_mapping;
mod term_merger;
use std::collections::{BTreeMap, HashMap, HashSet};
use std::io;
use std::net::Ipv6Addr;
use std::sync::Arc;
pub use merge_mapping::{MergeRowOrder, ShuffleMergeOrder, StackMergeOrder};
use super::writer::ColumnarSerializer;
use crate::column::{serialize_column_mappable_to_u128, serialize_column_mappable_to_u64};
use crate::column_values::MergedColumnValues;
use crate::columnar::merge::merge_dict_column::merge_bytes_or_str_column;
use crate::columnar::writer::CompatibleNumericalTypes;
use crate::columnar::ColumnarReader;
use crate::dynamic_column::DynamicColumn;
use crate::{
BytesColumn, Column, ColumnIndex, ColumnType, ColumnValues, NumericalType, NumericalValue,
};
/// Column types are grouped into different categories.
/// After merge, all columns belonging to the same category are coerced to
/// the same column type.
///
/// In practise, today, only Numerical colummns are coerced into one type today.
///
/// See also [README.md].
#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
enum ColumnTypeCategory {
Bool,
Str,
Numerical,
DateTime,
Bytes,
IpAddr,
}
impl From<ColumnType> for ColumnTypeCategory {
fn from(column_type: ColumnType) -> Self {
match column_type {
ColumnType::I64 => ColumnTypeCategory::Numerical,
ColumnType::U64 => ColumnTypeCategory::Numerical,
ColumnType::F64 => ColumnTypeCategory::Numerical,
ColumnType::Bytes => ColumnTypeCategory::Bytes,
ColumnType::Str => ColumnTypeCategory::Str,
ColumnType::Bool => ColumnTypeCategory::Bool,
ColumnType::IpAddr => ColumnTypeCategory::IpAddr,
ColumnType::DateTime => ColumnTypeCategory::DateTime,
}
}
}
/// Merge several columnar table together.
///
/// If several columns with the same name are conflicting with the numerical types in the
/// input columnars, the first type compatible out of i64, u64, f64 in that order will be used.
///
/// `require_columns` makes it possible to ensure that some columns will be present in the
/// resulting columnar. When a required column is a numerical column type, one of two things can
/// happen:
/// - If the required column type is compatible with all of the input columnar, the resulsting
/// merged
/// columnar will simply coerce the input column and use the required column type.
/// - If the required column type is incompatible with one of the input columnar, the merged
/// will fail with an InvalidData error.
///
/// `merge_row_order` makes it possible to remove or reorder row in the resulting
/// `Columnar` table.
///
/// Reminder: a string and a numerical column may bare the same column name. This is not
/// considered a conflict.
pub fn merge_columnar(
columnar_readers: &[&ColumnarReader],
required_columns: &[(String, ColumnType)],
merge_row_order: MergeRowOrder,
output: &mut impl io::Write,
) -> io::Result<()> {
let mut serializer = ColumnarSerializer::new(output);
let columns_to_merge = group_columns_for_merge(columnar_readers, required_columns)?;
for ((column_name, column_type), columns) in columns_to_merge {
let mut column_serializer =
serializer.serialize_column(column_name.as_bytes(), column_type);
merge_column(
column_type,
columns,
&merge_row_order,
&mut column_serializer,
)?;
}
serializer.finalize(merge_row_order.num_rows())?;
Ok(())
}
fn dynamic_column_to_u64_monotonic(dynamic_column: DynamicColumn) -> Option<Column<u64>> {
match dynamic_column {
DynamicColumn::Bool(column) => Some(column.to_u64_monotonic()),
DynamicColumn::I64(column) => Some(column.to_u64_monotonic()),
DynamicColumn::U64(column) => Some(column.to_u64_monotonic()),
DynamicColumn::F64(column) => Some(column.to_u64_monotonic()),
DynamicColumn::DateTime(column) => Some(column.to_u64_monotonic()),
DynamicColumn::IpAddr(_) | DynamicColumn::Bytes(_) | DynamicColumn::Str(_) => None,
}
}
fn merge_column(
column_type: ColumnType,
columns: Vec<Option<DynamicColumn>>,
merge_row_order: &MergeRowOrder,
wrt: &mut impl io::Write,
) -> io::Result<()> {
match column_type {
ColumnType::I64
| ColumnType::U64
| ColumnType::F64
| ColumnType::DateTime
| ColumnType::Bool => {
let mut column_indexes: Vec<Option<ColumnIndex>> = Vec::with_capacity(columns.len());
let mut column_values: Vec<Option<Arc<dyn ColumnValues>>> =
Vec::with_capacity(columns.len());
for dynamic_column_opt in columns {
if let Some(Column { idx, values }) =
dynamic_column_opt.and_then(dynamic_column_to_u64_monotonic)
{
column_indexes.push(Some(idx));
column_values.push(Some(values));
} else {
column_indexes.push(None);
column_values.push(None);
}
}
let merged_column_index =
crate::column_index::merge_column_index(&column_indexes[..], merge_row_order);
let merge_column_values = MergedColumnValues {
column_indexes: &column_indexes[..],
column_values: &column_values[..],
merge_row_order,
};
serialize_column_mappable_to_u64(merged_column_index, &merge_column_values, wrt)?;
}
ColumnType::IpAddr => {
let mut column_indexes: Vec<Option<ColumnIndex>> = Vec::with_capacity(columns.len());
let mut column_values: Vec<Option<Arc<dyn ColumnValues<Ipv6Addr>>>> =
Vec::with_capacity(columns.len());
for dynamic_column_opt in columns {
if let Some(DynamicColumn::IpAddr(Column { idx, values })) = dynamic_column_opt {
column_indexes.push(Some(idx));
column_values.push(Some(values));
} else {
column_indexes.push(None);
column_values.push(None);
}
}
let merged_column_index =
crate::column_index::merge_column_index(&column_indexes[..], merge_row_order);
let merge_column_values = MergedColumnValues {
column_indexes: &column_indexes[..],
column_values: &column_values,
merge_row_order,
};
serialize_column_mappable_to_u128(merged_column_index, &merge_column_values, wrt)?;
}
ColumnType::Bytes | ColumnType::Str => {
let mut column_indexes: Vec<Option<ColumnIndex>> = Vec::with_capacity(columns.len());
let mut bytes_columns: Vec<Option<BytesColumn>> = Vec::with_capacity(columns.len());
for dynamic_column_opt in columns {
match dynamic_column_opt {
Some(DynamicColumn::Str(str_column)) => {
column_indexes.push(Some(str_column.term_ord_column.idx.clone()));
bytes_columns.push(Some(str_column.into()));
}
Some(DynamicColumn::Bytes(bytes_column)) => {
column_indexes.push(Some(bytes_column.term_ord_column.idx.clone()));
bytes_columns.push(Some(bytes_column));
}
_ => {
column_indexes.push(None);
bytes_columns.push(None);
}
}
}
let merged_column_index =
crate::column_index::merge_column_index(&column_indexes[..], merge_row_order);
merge_bytes_or_str_column(merged_column_index, &bytes_columns, merge_row_order, wrt)?;
}
}
Ok(())
}
struct GroupedColumns {
required_column_type: Option<ColumnType>,
columns: Vec<Option<DynamicColumn>>,
column_category: ColumnTypeCategory,
}
impl GroupedColumns {
fn for_category(column_category: ColumnTypeCategory, num_columnars: usize) -> Self {
GroupedColumns {
required_column_type: None,
columns: vec![None; num_columnars],
column_category,
}
}
/// Set the dynamic column for a given columnar.
fn set_column(&mut self, columnar_id: usize, column: DynamicColumn) {
self.columns[columnar_id] = Some(column);
}
/// Force the existence of a column, as well as its type.
fn require_type(&mut self, required_type: ColumnType) -> io::Result<()> {
if let Some(existing_required_type) = self.required_column_type {
if existing_required_type == required_type {
// This was just a duplicate in the `required_columns`.
// Nothing to do.
return Ok(());
} else {
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
"Required column conflicts with another required column of the same type \
category.",
));
}
}
self.required_column_type = Some(required_type);
Ok(())
}
/// Returns the column type after merge.
///
/// This method does not check if the column types can actually be coerced to
/// this type.
fn column_type_after_merge(&self) -> ColumnType {
if let Some(required_type) = self.required_column_type {
return required_type;
}
let column_type: HashSet<ColumnType> = self
.columns
.iter()
.flatten()
.map(|column| column.column_type())
.collect();
if column_type.len() == 1 {
return column_type.into_iter().next().unwrap();
}
// At the moment, only the numerical categorical column type has more than one possible
// column type.
assert_eq!(self.column_category, ColumnTypeCategory::Numerical);
merged_numerical_columns_type(self.columns.iter().flatten()).into()
}
}
/// Returns the type of the merged numerical column.
///
/// This function picks the first numerical type out of i64, u64, f64 (order matters
/// here), that is compatible with all the `columns`.
///
/// # Panics
/// Panics if one of the column is not numerical.
fn merged_numerical_columns_type<'a>(
columns: impl Iterator<Item = &'a DynamicColumn>,
) -> NumericalType {
let mut compatible_numerical_types = CompatibleNumericalTypes::default();
for column in columns {
let (min_value, max_value) =
min_max_if_numerical(column).expect("All columns re required to be numerical");
compatible_numerical_types.accept_value(min_value);
compatible_numerical_types.accept_value(max_value);
}
compatible_numerical_types.to_numerical_type()
}
#[allow(clippy::type_complexity)]
fn group_columns_for_merge(
columnar_readers: &[&ColumnarReader],
required_columns: &[(String, ColumnType)],
) -> io::Result<BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>>> {
// Each column name may have multiple types of column associated.
// For merging we are interested in the same column type category since they can be merged.
let mut columns_grouped: HashMap<(String, ColumnTypeCategory), GroupedColumns> = HashMap::new();
for &(ref column_name, column_type) in required_columns {
columns_grouped
.entry((column_name.clone(), column_type.into()))
.or_insert_with(|| {
GroupedColumns::for_category(column_type.into(), columnar_readers.len())
})
.require_type(column_type)?;
}
for (columnar_id, columnar_reader) in columnar_readers.iter().enumerate() {
let column_name_and_handle = columnar_reader.list_columns()?;
for (column_name, handle) in column_name_and_handle {
let column_category: ColumnTypeCategory = handle.column_type().into();
let column = handle.open()?;
columns_grouped
.entry((column_name, column_category))
.or_insert_with(|| {
GroupedColumns::for_category(column_category, columnar_readers.len())
})
.set_column(columnar_id, column);
}
}
let mut merge_columns: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
Default::default();
for ((column_name, _), mut grouped_columns) in columns_grouped {
let column_type = grouped_columns.column_type_after_merge();
coerce_columns(column_type, &mut grouped_columns.columns)?;
merge_columns.insert((column_name, column_type), grouped_columns.columns);
}
Ok(merge_columns)
}
fn coerce_columns(
column_type: ColumnType,
columns: &mut [Option<DynamicColumn>],
) -> io::Result<()> {
for column_opt in columns.iter_mut() {
if let Some(column) = column_opt.take() {
*column_opt = Some(coerce_column(column_type, column)?);
}
}
Ok(())
}
fn coerce_column(column_type: ColumnType, column: DynamicColumn) -> io::Result<DynamicColumn> {
if let Some(numerical_type) = column_type.numerical_type() {
column
.coerce_numerical(numerical_type)
.ok_or_else(|| io::Error::new(io::ErrorKind::InvalidInput, ""))
} else {
if column.column_type() != column_type {
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
format!(
"Cannot coerce column of type `{:?}` to `{column_type:?}`",
column.column_type()
),
));
}
Ok(column)
}
}
/// Returns the (min, max) of a column provided it is numerical (i64, u64. f64).
///
/// The min and the max are simply the numerical value as defined by `ColumnValue::min_value()`,
/// and `ColumnValue::max_value()`.
///
/// It is important to note that these values are only guaranteed to be lower/upper bound
/// (as opposed to min/max value).
/// If a column is empty, the min and max values are currently set to 0.
fn min_max_if_numerical(column: &DynamicColumn) -> Option<(NumericalValue, NumericalValue)> {
match column {
DynamicColumn::I64(column) => Some((column.min_value().into(), column.max_value().into())),
DynamicColumn::U64(column) => Some((column.min_value().into(), column.min_value().into())),
DynamicColumn::F64(column) => Some((column.min_value().into(), column.min_value().into())),
DynamicColumn::Bool(_)
| DynamicColumn::IpAddr(_)
| DynamicColumn::DateTime(_)
| DynamicColumn::Bytes(_)
| DynamicColumn::Str(_) => None,
}
}
#[cfg(test)]
mod tests;

View File

@@ -0,0 +1,107 @@
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use sstable::TermOrdinal;
use crate::Streamer;
pub struct HeapItem<'a> {
pub streamer: Streamer<'a>,
pub segment_ord: usize,
}
impl<'a> PartialEq for HeapItem<'a> {
fn eq(&self, other: &Self) -> bool {
self.segment_ord == other.segment_ord
}
}
impl<'a> Eq for HeapItem<'a> {}
impl<'a> PartialOrd for HeapItem<'a> {
fn partial_cmp(&self, other: &HeapItem<'a>) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl<'a> Ord for HeapItem<'a> {
fn cmp(&self, other: &HeapItem<'a>) -> Ordering {
(&other.streamer.key(), &other.segment_ord).cmp(&(&self.streamer.key(), &self.segment_ord))
}
}
/// Given a list of sorted term streams,
/// returns an iterator over sorted unique terms.
///
/// The item yield is actually a pair with
/// - the term
/// - a slice with the ordinal of the segments containing
/// the terms.
pub struct TermMerger<'a> {
heap: BinaryHeap<HeapItem<'a>>,
current_streamers: Vec<HeapItem<'a>>,
}
impl<'a> TermMerger<'a> {
/// Stream of merged term dictionary
pub fn new(streams: Vec<Streamer<'a>>) -> TermMerger<'a> {
TermMerger {
heap: BinaryHeap::new(),
current_streamers: streams
.into_iter()
.enumerate()
.map(|(ord, streamer)| HeapItem {
streamer,
segment_ord: ord,
})
.collect(),
}
}
pub(crate) fn matching_segments<'b: 'a>(
&'b self,
) -> impl 'b + Iterator<Item = (usize, TermOrdinal)> {
self.current_streamers
.iter()
.map(|heap_item| (heap_item.segment_ord, heap_item.streamer.term_ord()))
}
fn advance_segments(&mut self) {
let streamers = &mut self.current_streamers;
let heap = &mut self.heap;
for mut heap_item in streamers.drain(..) {
if heap_item.streamer.advance() {
heap.push(heap_item);
}
}
}
/// Advance the term iterator to the next term.
/// Returns true if there is indeed another term
/// False if there is none.
pub fn advance(&mut self) -> bool {
self.advance_segments();
if let Some(head) = self.heap.pop() {
self.current_streamers.push(head);
while let Some(next_streamer) = self.heap.peek() {
if self.current_streamers[0].streamer.key() != next_streamer.streamer.key() {
break;
}
let next_heap_it = self.heap.pop().unwrap(); // safe : we peeked beforehand
self.current_streamers.push(next_heap_it);
}
true
} else {
false
}
}
/// Returns the current term.
///
/// This method may be called
/// if and only if advance() has been called before
/// and "true" was returned.
pub fn key(&self) -> &[u8] {
self.current_streamers[0].streamer.key()
}
}

View File

@@ -0,0 +1,318 @@
use super::*;
use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId};
fn make_columnar<T: Into<NumericalValue> + HasAssociatedColumnType + Copy>(
column_name: &str,
vals: &[T],
) -> ColumnarReader {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_column_type(column_name, T::column_type(), false);
for (row_id, val) in vals.iter().copied().enumerate() {
dataframe_writer.record_numerical(row_id as RowId, column_name, val.into());
}
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(vals.len() as RowId, None, &mut buffer)
.unwrap();
ColumnarReader::open(buffer).unwrap()
}
#[test]
fn test_column_coercion_to_u64() {
// i64 type
let columnar1 = make_columnar("numbers", &[1i64]);
// u64 type
let columnar2 = make_columnar("numbers", &[u64::MAX]);
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(&[&columnar1, &columnar2], &[]).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64)));
}
#[test]
fn test_column_no_coercion_if_all_the_same() {
let columnar1 = make_columnar("numbers", &[1u64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(&[&columnar1, &columnar2], &[]).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64)));
}
#[test]
fn test_column_coercion_to_i64() {
let columnar1 = make_columnar("numbers", &[-1i64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(&[&columnar1, &columnar2], &[]).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::I64)));
}
#[test]
fn test_impossible_coercion_returns_an_error() {
let columnar1 = make_columnar("numbers", &[u64::MAX]);
let group_error =
group_columns_for_merge(&[&columnar1], &[("numbers".to_string(), ColumnType::I64)])
.map(|_| ())
.unwrap_err();
assert_eq!(group_error.kind(), io::ErrorKind::InvalidInput);
}
#[test]
fn test_group_columns_with_required_column() {
let columnar1 = make_columnar("numbers", &[1i64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(
&[&columnar1, &columnar2],
&[("numbers".to_string(), ColumnType::U64)],
)
.unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64)));
}
#[test]
fn test_group_columns_required_column_with_no_existing_columns() {
let columnar1 = make_columnar("numbers", &[2u64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(
&[&columnar1, &columnar2],
&[("required_col".to_string(), ColumnType::Str)],
)
.unwrap();
assert_eq!(column_map.len(), 2);
let columns = column_map
.get(&("required_col".to_string(), ColumnType::Str))
.unwrap();
assert_eq!(columns.len(), 2);
assert!(columns[0].is_none());
assert!(columns[1].is_none());
}
#[test]
fn test_group_columns_required_column_is_above_all_columns_have_the_same_type_rule() {
let columnar1 = make_columnar("numbers", &[2i64]);
let columnar2 = make_columnar("numbers", &[2i64]);
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(
&[&columnar1, &columnar2],
&[("numbers".to_string(), ColumnType::U64)],
)
.unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::U64)));
}
#[test]
fn test_missing_column() {
let columnar1 = make_columnar("numbers", &[-1i64]);
let columnar2 = make_columnar("numbers2", &[2u64]);
let column_map: BTreeMap<(String, ColumnType), Vec<Option<DynamicColumn>>> =
group_columns_for_merge(&[&columnar1, &columnar2], &[]).unwrap();
assert_eq!(column_map.len(), 2);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnType::I64)));
{
let columns = column_map
.get(&("numbers".to_string(), ColumnType::I64))
.unwrap();
assert!(columns[0].is_some());
assert!(columns[1].is_none());
}
{
let columns = column_map
.get(&("numbers2".to_string(), ColumnType::U64))
.unwrap();
assert!(columns[0].is_none());
assert!(columns[1].is_some());
}
}
fn make_numerical_columnar_multiple_columns(
columns: &[(&str, &[&[NumericalValue]])],
) -> ColumnarReader {
let mut dataframe_writer = ColumnarWriter::default();
for (column_name, column_values) in columns {
for (row_id, vals) in column_values.iter().enumerate() {
for val in vals.iter() {
dataframe_writer.record_numerical(row_id as u32, column_name, *val);
}
}
}
let num_rows = columns
.iter()
.map(|(_, val_rows)| val_rows.len() as RowId)
.max()
.unwrap_or(0u32);
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(num_rows, None, &mut buffer)
.unwrap();
ColumnarReader::open(buffer).unwrap()
}
fn make_byte_columnar_multiple_columns(columns: &[(&str, &[&[&[u8]]])]) -> ColumnarReader {
let mut dataframe_writer = ColumnarWriter::default();
for (column_name, column_values) in columns {
for (row_id, vals) in column_values.iter().enumerate() {
for val in vals.iter() {
dataframe_writer.record_bytes(row_id as u32, column_name, val);
}
}
}
let num_rows = columns
.iter()
.map(|(_, val_rows)| val_rows.len() as RowId)
.max()
.unwrap_or(0u32);
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(num_rows, None, &mut buffer)
.unwrap();
ColumnarReader::open(buffer).unwrap()
}
fn make_text_columnar_multiple_columns(columns: &[(&str, &[&[&str]])]) -> ColumnarReader {
let mut dataframe_writer = ColumnarWriter::default();
for (column_name, column_values) in columns {
for (row_id, vals) in column_values.iter().enumerate() {
for val in vals.iter() {
dataframe_writer.record_str(row_id as u32, column_name, val);
}
}
}
let num_rows = columns
.iter()
.map(|(_, val_rows)| val_rows.len() as RowId)
.max()
.unwrap_or(0u32);
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer
.serialize(num_rows, None, &mut buffer)
.unwrap();
ColumnarReader::open(buffer).unwrap()
}
#[test]
fn test_merge_columnar_numbers() {
let columnar1 =
make_numerical_columnar_multiple_columns(&[("numbers", &[&[NumericalValue::from(-1f64)]])]);
let columnar2 = make_numerical_columnar_multiple_columns(&[(
"numbers",
&[&[], &[NumericalValue::from(-3f64)]],
)]);
let mut buffer = Vec::new();
let columnars = &[&columnar1, &columnar2];
let stack_merge_order = StackMergeOrder::stack(columnars);
crate::columnar::merge_columnar(
columnars,
&[],
MergeRowOrder::Stack(stack_merge_order),
&mut buffer,
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 3);
assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("numbers").unwrap();
let dynamic_column = cols[0].open().unwrap();
let DynamicColumn::F64(vals) = dynamic_column else { panic!() };
assert_eq!(vals.get_cardinality(), Cardinality::Optional);
assert_eq!(vals.first(0u32), Some(-1f64));
assert_eq!(vals.first(1u32), None);
assert_eq!(vals.first(2u32), Some(-3f64));
}
#[test]
fn test_merge_columnar_texts() {
let columnar1 = make_text_columnar_multiple_columns(&[("texts", &[&["a"]])]);
let columnar2 = make_text_columnar_multiple_columns(&[("texts", &[&[], &["b"]])]);
let mut buffer = Vec::new();
let columnars = &[&columnar1, &columnar2];
let stack_merge_order = StackMergeOrder::stack(columnars);
crate::columnar::merge_columnar(
columnars,
&[],
MergeRowOrder::Stack(stack_merge_order),
&mut buffer,
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 3);
assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("texts").unwrap();
let dynamic_column = cols[0].open().unwrap();
let DynamicColumn::Str(vals) = dynamic_column else { panic!() };
let get_str_for_ord = |ord| {
let mut out = String::new();
vals.ord_to_str(ord, &mut out).unwrap();
out
};
assert_eq!(vals.dictionary.num_terms(), 2);
assert_eq!(get_str_for_ord(0), "a");
assert_eq!(get_str_for_ord(1), "b");
let get_str_for_row = |row_id| {
let term_ords: Vec<u64> = vals.term_ords(row_id).collect();
assert!(term_ords.len() <= 1);
let mut out = String::new();
if term_ords.len() == 1 {
vals.ord_to_str(term_ords[0], &mut out).unwrap();
}
out
};
assert_eq!(get_str_for_row(0), "a");
assert_eq!(get_str_for_row(1), "");
assert_eq!(get_str_for_row(2), "b");
}
#[test]
fn test_merge_columnar_byte() {
let columnar1 = make_byte_columnar_multiple_columns(&[("bytes", &[&[b"bbbb"], &[b"baaa"]])]);
let columnar2 = make_byte_columnar_multiple_columns(&[("bytes", &[&[], &[b"a"]])]);
let mut buffer = Vec::new();
let columnars = &[&columnar1, &columnar2];
let stack_merge_order = StackMergeOrder::stack(columnars);
crate::columnar::merge_columnar(
columnars,
&[],
MergeRowOrder::Stack(stack_merge_order),
&mut buffer,
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 4);
assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("bytes").unwrap();
let dynamic_column = cols[0].open().unwrap();
let DynamicColumn::Bytes(vals) = dynamic_column else { panic!() };
let get_bytes_for_ord = |ord| {
let mut out = Vec::new();
vals.ord_to_bytes(ord, &mut out).unwrap();
out
};
assert_eq!(vals.dictionary.num_terms(), 3);
assert_eq!(get_bytes_for_ord(0), b"a");
assert_eq!(get_bytes_for_ord(1), b"baaa");
assert_eq!(get_bytes_for_ord(2), b"bbbb");
let get_bytes_for_row = |row_id| {
let term_ords: Vec<u64> = vals.term_ords(row_id).collect();
assert!(term_ords.len() <= 1);
let mut out = Vec::new();
if term_ords.len() == 1 {
vals.ord_to_bytes(term_ords[0], &mut out).unwrap();
}
out
};
assert_eq!(get_bytes_for_row(0), b"bbbb");
assert_eq!(get_bytes_for_row(1), b"baaa");
assert_eq!(get_bytes_for_row(2), b"");
assert_eq!(get_bytes_for_row(3), b"a");
}

View File

@@ -0,0 +1,10 @@
mod column_type;
mod format_version;
mod merge;
mod reader;
mod writer;
pub use column_type::{ColumnType, HasAssociatedColumnType};
pub use merge::{merge_columnar, MergeRowOrder, ShuffleMergeOrder, StackMergeOrder};
pub use reader::ColumnarReader;
pub use writer::ColumnarWriter;

View File

@@ -0,0 +1,192 @@
use std::{io, mem};
use common::file_slice::FileSlice;
use common::BinarySerializable;
use sstable::{Dictionary, RangeSSTable};
use crate::columnar::{format_version, ColumnType};
use crate::dynamic_column::DynamicColumnHandle;
use crate::RowId;
fn io_invalid_data(msg: String) -> io::Error {
io::Error::new(io::ErrorKind::InvalidData, msg)
}
/// The ColumnarReader makes it possible to access a set of columns
/// associated to field names.
#[derive(Clone)]
pub struct ColumnarReader {
column_dictionary: Dictionary<RangeSSTable>,
column_data: FileSlice,
num_rows: RowId,
}
/// Functions by both the async/sync code listing columns.
/// It takes a stream from the column sstable and return the list of
/// `DynamicColumn` available in it.
fn read_all_columns_in_stream(
mut stream: sstable::Streamer<'_, RangeSSTable>,
column_data: &FileSlice,
) -> io::Result<Vec<DynamicColumnHandle>> {
let mut results = Vec::new();
while stream.advance() {
let key_bytes: &[u8] = stream.key();
let Some(column_code) = key_bytes.last().copied() else {
return Err(io_invalid_data("Empty column name.".to_string()));
};
let column_type = ColumnType::try_from_code(column_code)
.map_err(|_| io_invalid_data(format!("Unknown column code `{column_code}`")))?;
let range = stream.value();
let file_slice = column_data.slice(range.start as usize..range.end as usize);
let dynamic_column_handle = DynamicColumnHandle {
file_slice,
column_type,
};
results.push(dynamic_column_handle);
}
Ok(results)
}
impl ColumnarReader {
/// Opens a new Columnar file.
pub fn open<F>(file_slice: F) -> io::Result<ColumnarReader>
where FileSlice: From<F> {
Self::open_inner(file_slice.into())
}
fn open_inner(file_slice: FileSlice) -> io::Result<ColumnarReader> {
let (file_slice_without_sstable_len, footer_slice) = file_slice
.split_from_end(mem::size_of::<u64>() + 4 + format_version::VERSION_FOOTER_NUM_BYTES);
let footer_bytes = footer_slice.read_bytes()?;
let sstable_len = u64::deserialize(&mut &footer_bytes[0..8])?;
let num_rows = u32::deserialize(&mut &footer_bytes[8..12])?;
let version_footer_bytes: [u8; format_version::VERSION_FOOTER_NUM_BYTES] =
footer_bytes[12..].try_into().unwrap();
let _version = format_version::parse_footer(version_footer_bytes)?;
let (column_data, sstable) =
file_slice_without_sstable_len.split_from_end(sstable_len as usize);
let column_dictionary = Dictionary::open(sstable)?;
Ok(ColumnarReader {
column_dictionary,
column_data,
num_rows,
})
}
pub fn num_rows(&self) -> RowId {
self.num_rows
}
// TODO Add unit tests
pub fn list_columns(&self) -> io::Result<Vec<(String, DynamicColumnHandle)>> {
let mut stream = self.column_dictionary.stream()?;
let mut results = Vec::new();
while stream.advance() {
let key_bytes: &[u8] = stream.key();
let column_code: u8 = key_bytes.last().cloned().unwrap();
let column_type: ColumnType = ColumnType::try_from_code(column_code)
.map_err(|_| io_invalid_data(format!("Unknown column code `{column_code}`")))?;
let range = stream.value().clone();
let column_name =
// The last two bytes are respectively the 0u8 separator and the column_type.
String::from_utf8_lossy(&key_bytes[..key_bytes.len() - 2]).to_string();
let file_slice = self
.column_data
.slice(range.start as usize..range.end as usize);
let column_handle = DynamicColumnHandle {
file_slice,
column_type,
};
results.push((column_name, column_handle));
}
Ok(results)
}
fn stream_for_column_range(&self, column_name: &str) -> sstable::StreamerBuilder<RangeSSTable> {
// Each column is a associated to a given `column_key`,
// that starts by `column_name\0column_header`.
//
// Listing the columns associated to the given column name is therefore equivalent to
// listing `column_key` with the prefix `column_name\0`.
//
// This is in turn equivalent to searching for the range
// `[column_name,\0`..column_name\1)`.
// TODO can we get some more generic `prefix(..)` logic in the dictionary.
let mut start_key = column_name.to_string();
start_key.push('\0');
let mut end_key = column_name.to_string();
end_key.push(1u8 as char);
self.column_dictionary
.range()
.ge(start_key.as_bytes())
.lt(end_key.as_bytes())
}
pub async fn read_columns_async(
&self,
column_name: &str,
) -> io::Result<Vec<DynamicColumnHandle>> {
let stream = self
.stream_for_column_range(column_name)
.into_stream_async()
.await?;
read_all_columns_in_stream(stream, &self.column_data)
}
/// Get all columns for the given column name.
///
/// There can be more than one column associated to a given column name, provided they have
/// different types.
pub fn read_columns(&self, column_name: &str) -> io::Result<Vec<DynamicColumnHandle>> {
let stream = self.stream_for_column_range(column_name).into_stream()?;
read_all_columns_in_stream(stream, &self.column_data)
}
/// Return the number of columns in the columnar.
pub fn num_columns(&self) -> usize {
self.column_dictionary.num_terms()
}
}
#[cfg(test)]
mod tests {
use crate::{ColumnType, ColumnarReader, ColumnarWriter};
#[test]
fn test_list_columns() {
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_column_type("col1", ColumnType::Str, false);
columnar_writer.record_column_type("col2", ColumnType::U64, false);
let mut buffer = Vec::new();
columnar_writer.serialize(1, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
let columns = columnar.list_columns().unwrap();
assert_eq!(columns.len(), 2);
assert_eq!(&columns[0].0, "col1");
assert_eq!(columns[0].1.column_type(), ColumnType::Str);
assert_eq!(&columns[1].0, "col2");
assert_eq!(columns[1].1.column_type(), ColumnType::U64);
}
#[test]
fn test_list_columns_strict_typing_prevents_coercion() {
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_column_type("count", ColumnType::U64, false);
columnar_writer.record_numerical(1, "count", 1u64);
let mut buffer = Vec::new();
columnar_writer.serialize(2, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
let columns = columnar.list_columns().unwrap();
assert_eq!(columns.len(), 1);
assert_eq!(&columns[0].0, "count");
assert_eq!(columns[0].1.column_type(), ColumnType::U64);
}
#[test]
#[should_panic(expected = "Input type forbidden")]
fn test_list_columns_strict_typing_panics_on_wrong_types() {
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_column_type("count", ColumnType::U64, false);
columnar_writer.record_numerical(1, "count", 1i64);
}
}

View File

@@ -0,0 +1,360 @@
use std::net::Ipv6Addr;
use crate::dictionary::UnorderedId;
use crate::utils::{place_bits, pop_first_byte, select_bits};
use crate::value::NumericalValue;
use crate::{InvalidData, NumericalType, RowId};
/// When we build a columnar dataframe, we first just group
/// all mutations per column, and appends them in append-only buffer
/// in the stacker.
///
/// These ColumnOperation<T> are therefore serialize/deserialized
/// in memory.
///
/// We represents all of these operations as `ColumnOperation`.
#[derive(Eq, PartialEq, Debug, Clone, Copy)]
pub(super) enum ColumnOperation<T> {
NewDoc(RowId),
Value(T),
}
#[derive(Copy, Clone, Eq, PartialEq, Debug)]
struct ColumnOperationMetadata {
op_type: ColumnOperationType,
len: u8,
}
impl ColumnOperationMetadata {
fn to_code(self) -> u8 {
place_bits::<0, 6>(self.len) | place_bits::<6, 8>(self.op_type.to_code())
}
fn try_from_code(code: u8) -> Result<Self, InvalidData> {
let len = select_bits::<0, 6>(code);
let typ_code = select_bits::<6, 8>(code);
let column_type = ColumnOperationType::try_from_code(typ_code)?;
Ok(ColumnOperationMetadata {
op_type: column_type,
len,
})
}
}
#[derive(Copy, Clone, Eq, PartialEq, Debug)]
#[repr(u8)]
enum ColumnOperationType {
NewDoc = 0u8,
AddValue = 1u8,
}
impl ColumnOperationType {
pub fn to_code(self) -> u8 {
self as u8
}
pub fn try_from_code(code: u8) -> Result<Self, InvalidData> {
match code {
0 => Ok(Self::NewDoc),
1 => Ok(Self::AddValue),
_ => Err(InvalidData),
}
}
}
impl<V: SymbolValue> ColumnOperation<V> {
pub(super) fn serialize(self) -> impl AsRef<[u8]> {
let mut minibuf = MiniBuffer::default();
let column_op_metadata = match self {
ColumnOperation::NewDoc(new_doc) => {
let symbol_len = new_doc.serialize(&mut minibuf.bytes[1..]);
ColumnOperationMetadata {
op_type: ColumnOperationType::NewDoc,
len: symbol_len,
}
}
ColumnOperation::Value(val) => {
let symbol_len = val.serialize(&mut minibuf.bytes[1..]);
ColumnOperationMetadata {
op_type: ColumnOperationType::AddValue,
len: symbol_len,
}
}
};
minibuf.bytes[0] = column_op_metadata.to_code();
// +1 for the metadata
minibuf.len = 1 + column_op_metadata.len;
minibuf
}
/// Deserialize a colummn operation.
/// Returns None if the buffer is empty.
///
/// Panics if the payload is invalid:
/// this deserialize method is meant to target in memory.
pub(super) fn deserialize(bytes: &mut &[u8]) -> Option<Self> {
let column_op_metadata_byte = pop_first_byte(bytes)?;
let column_op_metadata = ColumnOperationMetadata::try_from_code(column_op_metadata_byte)
.expect("Invalid op metadata byte");
let symbol_bytes: &[u8];
(symbol_bytes, *bytes) = bytes.split_at(column_op_metadata.len as usize);
match column_op_metadata.op_type {
ColumnOperationType::NewDoc => {
let new_doc = u32::deserialize(symbol_bytes);
Some(ColumnOperation::NewDoc(new_doc))
}
ColumnOperationType::AddValue => {
let value = V::deserialize(symbol_bytes);
Some(ColumnOperation::Value(value))
}
}
}
}
impl<T> From<T> for ColumnOperation<T> {
fn from(value: T) -> Self {
ColumnOperation::Value(value)
}
}
// Serialization trait very local to the writer.
// As we write fast fields, we accumulate them in "in memory".
// In order to limit memory usage, and in order
// to benefit from the stacker, we do this by serialization our data
// as "Symbols".
#[allow(clippy::from_over_into)]
pub(super) trait SymbolValue: Clone + Copy {
// Serializes the symbol into the given buffer.
// Returns the number of bytes written into the buffer.
/// # Panics
/// May not exceed 9bytes
fn serialize(self, buffer: &mut [u8]) -> u8;
// Panics if invalid
fn deserialize(bytes: &[u8]) -> Self;
}
impl SymbolValue for bool {
fn serialize(self, buffer: &mut [u8]) -> u8 {
buffer[0] = u8::from(self);
1u8
}
fn deserialize(bytes: &[u8]) -> Self {
bytes[0] == 1u8
}
}
impl SymbolValue for Ipv6Addr {
fn serialize(self, buffer: &mut [u8]) -> u8 {
buffer[0..16].copy_from_slice(&self.octets());
16
}
fn deserialize(bytes: &[u8]) -> Self {
let octets: [u8; 16] = bytes[0..16].try_into().unwrap();
Ipv6Addr::from(octets)
}
}
#[derive(Default)]
struct MiniBuffer {
pub bytes: [u8; 17],
pub len: u8,
}
impl AsRef<[u8]> for MiniBuffer {
fn as_ref(&self) -> &[u8] {
&self.bytes[..self.len as usize]
}
}
impl SymbolValue for NumericalValue {
fn deserialize(mut bytes: &[u8]) -> Self {
let type_code = pop_first_byte(&mut bytes).unwrap();
let symbol_type = NumericalType::try_from_code(type_code).unwrap();
let mut octet: [u8; 8] = [0u8; 8];
octet[..bytes.len()].copy_from_slice(bytes);
match symbol_type {
NumericalType::U64 => {
let val: u64 = u64::from_le_bytes(octet);
NumericalValue::U64(val)
}
NumericalType::I64 => {
let encoded: u64 = u64::from_le_bytes(octet);
let val: i64 = decode_zig_zag(encoded);
NumericalValue::I64(val)
}
NumericalType::F64 => {
debug_assert_eq!(bytes.len(), 8);
let val: f64 = f64::from_le_bytes(octet);
NumericalValue::F64(val)
}
}
}
/// F64: Serialize with a fixed size of 9 bytes
/// U64: Serialize without leading zeroes
/// I64: ZigZag encoded and serialize without leading zeroes
fn serialize(self, output: &mut [u8]) -> u8 {
match self {
NumericalValue::F64(val) => {
output[0] = NumericalType::F64 as u8;
output[1..9].copy_from_slice(&val.to_le_bytes());
9u8
}
NumericalValue::U64(val) => {
let len = compute_num_bytes_for_u64(val) as u8;
output[0] = NumericalType::U64 as u8;
output[1..9].copy_from_slice(&val.to_le_bytes());
len + 1u8
}
NumericalValue::I64(val) => {
let zig_zag_encoded = encode_zig_zag(val);
let len = compute_num_bytes_for_u64(zig_zag_encoded) as u8;
output[0] = NumericalType::I64 as u8;
output[1..9].copy_from_slice(&zig_zag_encoded.to_le_bytes());
len + 1u8
}
}
}
}
impl SymbolValue for u32 {
fn serialize(self, output: &mut [u8]) -> u8 {
let len = compute_num_bytes_for_u64(self as u64);
output[0..4].copy_from_slice(&self.to_le_bytes());
len as u8
}
fn deserialize(bytes: &[u8]) -> Self {
let mut quartet: [u8; 4] = [0u8; 4];
quartet[..bytes.len()].copy_from_slice(bytes);
u32::from_le_bytes(quartet)
}
}
impl SymbolValue for UnorderedId {
fn serialize(self, output: &mut [u8]) -> u8 {
self.0.serialize(output)
}
fn deserialize(bytes: &[u8]) -> Self {
UnorderedId(u32::deserialize(bytes))
}
}
fn compute_num_bytes_for_u64(val: u64) -> usize {
let msb = (64u32 - val.leading_zeros()) as usize;
(msb + 7) / 8
}
fn encode_zig_zag(n: i64) -> u64 {
((n << 1) ^ (n >> 63)) as u64
}
fn decode_zig_zag(n: u64) -> i64 {
((n >> 1) as i64) ^ (-((n & 1) as i64))
}
#[cfg(test)]
mod tests {
use super::*;
#[track_caller]
fn test_zig_zag_aux(val: i64) {
let encoded = super::encode_zig_zag(val);
assert_eq!(decode_zig_zag(encoded), val);
if let Some(abs_val) = val.checked_abs() {
let abs_val = abs_val as u64;
assert!(encoded <= abs_val * 2);
}
}
#[test]
fn test_zig_zag() {
assert_eq!(encode_zig_zag(0i64), 0u64);
assert_eq!(encode_zig_zag(-1i64), 1u64);
assert_eq!(encode_zig_zag(1i64), 2u64);
test_zig_zag_aux(0i64);
test_zig_zag_aux(i64::MIN);
test_zig_zag_aux(i64::MAX);
}
use proptest::prelude::any;
use proptest::proptest;
proptest! {
#[test]
fn test_proptest_zig_zag(val in any::<i64>()) {
test_zig_zag_aux(val);
}
}
#[test]
fn test_column_op_metadata_byte_serialization() {
for len in 0..=15 {
for op_type in [ColumnOperationType::AddValue, ColumnOperationType::NewDoc] {
let column_op_metadata = ColumnOperationMetadata { op_type, len };
let column_op_metadata_code = column_op_metadata.to_code();
let serdeser_metadata =
ColumnOperationMetadata::try_from_code(column_op_metadata_code).unwrap();
assert_eq!(column_op_metadata, serdeser_metadata);
}
}
}
#[track_caller]
fn ser_deser_symbol(column_op: ColumnOperation<NumericalValue>) {
let buf = column_op.serialize();
let mut buffer = buf.as_ref().to_vec();
buffer.extend_from_slice(b"234234");
let mut bytes = &buffer[..];
let serdeser_symbol = ColumnOperation::deserialize(&mut bytes).unwrap();
assert_eq!(bytes.len() + buf.as_ref().len(), buffer.len());
assert_eq!(column_op, serdeser_symbol);
}
#[test]
fn test_compute_num_bytes_for_u64() {
assert_eq!(compute_num_bytes_for_u64(0), 0);
assert_eq!(compute_num_bytes_for_u64(1), 1);
assert_eq!(compute_num_bytes_for_u64(255), 1);
assert_eq!(compute_num_bytes_for_u64(256), 2);
assert_eq!(compute_num_bytes_for_u64((1 << 16) - 1), 2);
assert_eq!(compute_num_bytes_for_u64(1 << 16), 3);
}
#[test]
fn test_symbol_serialization() {
ser_deser_symbol(ColumnOperation::NewDoc(0));
ser_deser_symbol(ColumnOperation::NewDoc(3));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(0i64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(1i64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(257u64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(-257i64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::I64(i64::MIN)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(0u64)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(u64::MIN)));
ser_deser_symbol(ColumnOperation::Value(NumericalValue::U64(u64::MAX)));
}
fn test_column_operation_unordered_aux(val: u32, expected_len: usize) {
let column_op = ColumnOperation::Value(UnorderedId(val));
let minibuf = column_op.serialize();
assert_eq!({ minibuf.as_ref().len() }, expected_len);
let mut buf = minibuf.as_ref().to_vec();
buf.extend_from_slice(&[2, 2, 2, 2, 2, 2]);
let mut cursor = &buf[..];
let column_op_serdeser: ColumnOperation<UnorderedId> =
ColumnOperation::deserialize(&mut cursor).unwrap();
assert_eq!(column_op_serdeser, ColumnOperation::Value(UnorderedId(val)));
assert_eq!(cursor.len() + expected_len, buf.len());
}
#[test]
fn test_column_operation_unordered() {
test_column_operation_unordered_aux(300u32, 3);
test_column_operation_unordered_aux(1u32, 2);
test_column_operation_unordered_aux(0u32, 1);
}
}

View File

@@ -0,0 +1,363 @@
use std::cmp::Ordering;
use stacker::{ExpUnrolledLinkedList, MemoryArena};
use crate::columnar::writer::column_operation::{ColumnOperation, SymbolValue};
use crate::dictionary::{DictionaryBuilder, UnorderedId};
use crate::{Cardinality, NumericalType, NumericalValue, RowId};
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
#[repr(u8)]
enum DocumentStep {
Same = 0,
Next = 1,
Skipped = 2,
}
#[inline(always)]
fn delta_with_last_doc(last_doc_opt: Option<u32>, doc: u32) -> DocumentStep {
let expected_next_doc = last_doc_opt.map(|last_doc| last_doc + 1).unwrap_or(0u32);
match doc.cmp(&expected_next_doc) {
Ordering::Less => DocumentStep::Same,
Ordering::Equal => DocumentStep::Next,
Ordering::Greater => DocumentStep::Skipped,
}
}
#[derive(Copy, Clone, Default)]
pub struct ColumnWriter {
// Detected cardinality of the column so far.
cardinality: Cardinality,
// Last document inserted.
// None if no doc has been added yet.
last_doc_opt: Option<u32>,
// Buffer containing the serialized values.
values: ExpUnrolledLinkedList,
}
impl ColumnWriter {
/// Returns an iterator over the Symbol that have been recorded
/// for the given column.
pub(super) fn operation_iterator<'a, V: SymbolValue>(
&self,
arena: &MemoryArena,
old_to_new_ids_opt: Option<&[RowId]>,
buffer: &'a mut Vec<u8>,
) -> impl Iterator<Item = ColumnOperation<V>> + 'a {
buffer.clear();
self.values.read_to_end(arena, buffer);
if let Some(old_to_new_ids) = old_to_new_ids_opt {
// TODO avoid the extra deserialization / serialization.
let mut sorted_ops: Vec<(RowId, ColumnOperation<V>)> = Vec::new();
let mut new_doc = 0u32;
let mut cursor = &buffer[..];
for op in std::iter::from_fn(|| ColumnOperation::<V>::deserialize(&mut cursor)) {
if let ColumnOperation::NewDoc(doc) = &op {
new_doc = old_to_new_ids[*doc as usize];
sorted_ops.push((new_doc, ColumnOperation::NewDoc(new_doc)));
} else {
sorted_ops.push((new_doc, op));
}
}
// stable sort is crucial here.
sorted_ops.sort_by_key(|(new_doc_id, _)| *new_doc_id);
buffer.clear();
for (_, op) in sorted_ops {
buffer.extend_from_slice(op.serialize().as_ref());
}
}
let mut cursor: &[u8] = &buffer[..];
std::iter::from_fn(move || ColumnOperation::deserialize(&mut cursor))
}
/// Records a change of the document being recorded.
///
/// This function will also update the cardinality of the column
/// if necessary.
pub(super) fn record<S: SymbolValue>(&mut self, doc: RowId, value: S, arena: &mut MemoryArena) {
// Difference between `doc` and the last doc.
match delta_with_last_doc(self.last_doc_opt, doc) {
DocumentStep::Same => {
// This is the last encounterred document.
self.cardinality = Cardinality::Multivalued;
}
DocumentStep::Next => {
self.last_doc_opt = Some(doc);
self.write_symbol::<S>(ColumnOperation::NewDoc(doc), arena);
}
DocumentStep::Skipped => {
self.cardinality = self.cardinality.max(Cardinality::Optional);
self.last_doc_opt = Some(doc);
self.write_symbol::<S>(ColumnOperation::NewDoc(doc), arena);
}
}
self.write_symbol(ColumnOperation::Value(value), arena);
}
// Get the cardinality.
// The overall number of docs in the column is necessary to
// deal with the case where the all docs contain 1 value, except some documents
// at the end of the column.
pub(crate) fn get_cardinality(&self, num_docs: RowId) -> Cardinality {
match delta_with_last_doc(self.last_doc_opt, num_docs) {
DocumentStep::Same | DocumentStep::Next => self.cardinality,
DocumentStep::Skipped => self.cardinality.max(Cardinality::Optional),
}
}
/// Appends a new symbol to the `ColumnWriter`.
fn write_symbol<V: SymbolValue>(
&mut self,
column_operation: ColumnOperation<V>,
arena: &mut MemoryArena,
) {
self.values
.writer(arena)
.extend_from_slice(column_operation.serialize().as_ref());
}
}
#[derive(Clone, Copy, Default)]
pub(crate) struct NumericalColumnWriter {
compatible_numerical_types: CompatibleNumericalTypes,
column_writer: ColumnWriter,
}
impl NumericalColumnWriter {
pub fn force_numerical_type(&mut self, numerical_type: NumericalType) {
assert!(self
.compatible_numerical_types
.is_type_accepted(numerical_type));
self.compatible_numerical_types = CompatibleNumericalTypes::StaticType(numerical_type);
}
}
/// State used to store what types are still acceptable
/// after having seen a set of numerical values.
#[derive(Clone, Copy)]
pub(crate) enum CompatibleNumericalTypes {
Dynamic {
all_values_within_i64_range: bool,
all_values_within_u64_range: bool,
},
StaticType(NumericalType),
}
impl Default for CompatibleNumericalTypes {
fn default() -> CompatibleNumericalTypes {
CompatibleNumericalTypes::Dynamic {
all_values_within_i64_range: true,
all_values_within_u64_range: true,
}
}
}
impl CompatibleNumericalTypes {
pub fn is_type_accepted(&self, numerical_type: NumericalType) -> bool {
match self {
CompatibleNumericalTypes::Dynamic {
all_values_within_i64_range,
all_values_within_u64_range,
} => match numerical_type {
NumericalType::I64 => *all_values_within_i64_range,
NumericalType::U64 => *all_values_within_u64_range,
NumericalType::F64 => true,
},
CompatibleNumericalTypes::StaticType(static_numerical_type) => {
*static_numerical_type == numerical_type
}
}
}
pub fn accept_value(&mut self, numerical_value: NumericalValue) {
match self {
CompatibleNumericalTypes::Dynamic {
all_values_within_i64_range,
all_values_within_u64_range,
} => match numerical_value {
NumericalValue::I64(val_i64) => {
let value_within_u64_range = val_i64 >= 0i64;
*all_values_within_u64_range &= value_within_u64_range;
}
NumericalValue::U64(val_u64) => {
let value_within_i64_range = val_u64 < i64::MAX as u64;
*all_values_within_i64_range &= value_within_i64_range;
}
NumericalValue::F64(_) => {
*all_values_within_i64_range = false;
*all_values_within_u64_range = false;
}
},
CompatibleNumericalTypes::StaticType(typ) => {
assert_eq!(
numerical_value.numerical_type(),
*typ,
"Input type forbidden. This column has been forced to type {typ:?}, received \
{numerical_value:?}"
);
}
}
}
pub fn to_numerical_type(self) -> NumericalType {
for numerical_type in [NumericalType::I64, NumericalType::U64] {
if self.is_type_accepted(numerical_type) {
return numerical_type;
}
}
NumericalType::F64
}
}
impl NumericalColumnWriter {
pub fn numerical_type(&self) -> NumericalType {
self.compatible_numerical_types.to_numerical_type()
}
pub fn cardinality(&self, num_docs: RowId) -> Cardinality {
self.column_writer.get_cardinality(num_docs)
}
pub fn record_numerical_value(
&mut self,
doc: RowId,
value: NumericalValue,
arena: &mut MemoryArena,
) {
self.compatible_numerical_types.accept_value(value);
self.column_writer.record(doc, value, arena);
}
pub(super) fn operation_iterator<'a>(
self,
arena: &MemoryArena,
old_to_new_ids: Option<&[RowId]>,
buffer: &'a mut Vec<u8>,
) -> impl Iterator<Item = ColumnOperation<NumericalValue>> + 'a {
self.column_writer
.operation_iterator(arena, old_to_new_ids, buffer)
}
}
#[derive(Copy, Clone)]
pub(crate) struct StrOrBytesColumnWriter {
pub(crate) dictionary_id: u32,
pub(crate) column_writer: ColumnWriter,
// If true, when facing a multivalued cardinality,
// values associated to a given document will be sorted.
//
// This is useful for facets.
//
// If false, the order of appearance in the document will be
// observed.
pub(crate) sort_values_within_row: bool,
}
impl StrOrBytesColumnWriter {
pub(crate) fn with_dictionary_id(dictionary_id: u32) -> StrOrBytesColumnWriter {
StrOrBytesColumnWriter {
dictionary_id,
column_writer: Default::default(),
sort_values_within_row: false,
}
}
pub(crate) fn record_bytes(
&mut self,
doc: RowId,
bytes: &[u8],
dictionaries: &mut [DictionaryBuilder],
arena: &mut MemoryArena,
) {
let unordered_id = dictionaries[self.dictionary_id as usize].get_or_allocate_id(bytes);
self.column_writer.record(doc, unordered_id, arena);
}
pub(super) fn operation_iterator<'a>(
&self,
arena: &MemoryArena,
old_to_new_ids: Option<&[RowId]>,
byte_buffer: &'a mut Vec<u8>,
) -> impl Iterator<Item = ColumnOperation<UnorderedId>> + 'a {
self.column_writer
.operation_iterator(arena, old_to_new_ids, byte_buffer)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_delta_with_last_doc() {
assert_eq!(delta_with_last_doc(None, 0u32), DocumentStep::Next);
assert_eq!(delta_with_last_doc(None, 1u32), DocumentStep::Skipped);
assert_eq!(delta_with_last_doc(None, 2u32), DocumentStep::Skipped);
assert_eq!(delta_with_last_doc(Some(0u32), 0u32), DocumentStep::Same);
assert_eq!(delta_with_last_doc(Some(1u32), 1u32), DocumentStep::Same);
assert_eq!(delta_with_last_doc(Some(1u32), 2u32), DocumentStep::Next);
assert_eq!(delta_with_last_doc(Some(1u32), 3u32), DocumentStep::Skipped);
assert_eq!(delta_with_last_doc(Some(1u32), 4u32), DocumentStep::Skipped);
}
#[track_caller]
fn test_column_writer_coercion_iter_aux(
values: impl Iterator<Item = NumericalValue>,
expected_numerical_type: NumericalType,
) {
let mut compatible_numerical_types = CompatibleNumericalTypes::default();
for value in values {
compatible_numerical_types.accept_value(value);
}
assert_eq!(
compatible_numerical_types.to_numerical_type(),
expected_numerical_type
);
}
#[track_caller]
fn test_column_writer_coercion_aux(
values: &[NumericalValue],
expected_numerical_type: NumericalType,
) {
test_column_writer_coercion_iter_aux(values.iter().copied(), expected_numerical_type);
test_column_writer_coercion_iter_aux(values.iter().rev().copied(), expected_numerical_type);
}
#[test]
fn test_column_writer_coercion() {
test_column_writer_coercion_aux(&[], NumericalType::I64);
test_column_writer_coercion_aux(&[1i64.into()], NumericalType::I64);
test_column_writer_coercion_aux(&[1u64.into()], NumericalType::I64);
// We don't detect exact integer at the moment. We could!
test_column_writer_coercion_aux(&[1f64.into()], NumericalType::F64);
test_column_writer_coercion_aux(&[u64::MAX.into()], NumericalType::U64);
test_column_writer_coercion_aux(&[(i64::MAX as u64).into()], NumericalType::U64);
test_column_writer_coercion_aux(&[(1u64 << 63).into()], NumericalType::U64);
test_column_writer_coercion_aux(&[1i64.into(), 1u64.into()], NumericalType::I64);
test_column_writer_coercion_aux(&[u64::MAX.into(), (-1i64).into()], NumericalType::F64);
}
#[test]
#[should_panic]
fn test_compatible_numerical_types_static_incompatible_type() {
let mut compatible_numerical_types =
CompatibleNumericalTypes::StaticType(NumericalType::U64);
compatible_numerical_types.accept_value(NumericalValue::I64(1i64));
}
#[test]
fn test_compatible_numerical_types_static_different_type_forbidden() {
let mut compatible_numerical_types =
CompatibleNumericalTypes::StaticType(NumericalType::U64);
compatible_numerical_types.accept_value(NumericalValue::U64(u64::MAX));
}
#[test]
fn test_compatible_numerical_types_static() {
for typ in [NumericalType::I64, NumericalType::I64, NumericalType::F64] {
let compatible_numerical_types = CompatibleNumericalTypes::StaticType(typ);
assert_eq!(compatible_numerical_types.to_numerical_type(), typ);
}
}
}

View File

@@ -0,0 +1,848 @@
mod column_operation;
mod column_writers;
mod serializer;
mod value_index;
use std::io;
use std::net::Ipv6Addr;
use column_operation::ColumnOperation;
pub(crate) use column_writers::CompatibleNumericalTypes;
use common::CountingWriter;
pub(crate) use serializer::ColumnarSerializer;
use stacker::{Addr, ArenaHashMap, MemoryArena};
use crate::column_index::SerializableColumnIndex;
use crate::column_values::{
ColumnValues, MonotonicallyMappableToU128, MonotonicallyMappableToU64, VecColumn,
};
use crate::columnar::column_type::ColumnType;
use crate::columnar::writer::column_writers::{
ColumnWriter, NumericalColumnWriter, StrOrBytesColumnWriter,
};
use crate::columnar::writer::value_index::{IndexBuilder, PreallocatedIndexBuilders};
use crate::dictionary::{DictionaryBuilder, TermIdMapping, UnorderedId};
use crate::value::{Coerce, NumericalType, NumericalValue};
use crate::{Cardinality, RowId};
/// This is a set of buffers that are used to temporarily write the values into before passing them
/// to the fast field codecs.
#[derive(Default)]
struct SpareBuffers {
value_index_builders: PreallocatedIndexBuilders,
u64_values: Vec<u64>,
ip_addr_values: Vec<Ipv6Addr>,
}
/// Makes it possible to create a new columnar.
///
/// ```rust
/// use tantivy_columnar::ColumnarWriter;
///
/// let mut columnar_writer = ColumnarWriter::default();
/// columnar_writer.record_str(0u32 /* doc id */, "product_name", "Red backpack");
/// columnar_writer.record_numerical(0u32 /* doc id */, "price", 10u64);
/// columnar_writer.record_str(1u32 /* doc id */, "product_name", "Apple");
/// columnar_writer.record_numerical(0u32 /* doc id */, "price", 10.5f64); //< uh oh we ended up mixing integer and floats.
/// let mut wrt: Vec<u8> = Vec::new();
/// columnar_writer.serialize(2u32, None, &mut wrt).unwrap();
/// ```
#[derive(Default)]
pub struct ColumnarWriter {
numerical_field_hash_map: ArenaHashMap,
datetime_field_hash_map: ArenaHashMap,
bool_field_hash_map: ArenaHashMap,
ip_addr_field_hash_map: ArenaHashMap,
bytes_field_hash_map: ArenaHashMap,
str_field_hash_map: ArenaHashMap,
arena: MemoryArena,
// Dictionaries used to store dictionary-encoded values.
dictionaries: Vec<DictionaryBuilder>,
buffers: SpareBuffers,
}
#[inline]
fn mutate_or_create_column<V, TMutator>(
arena_hash_map: &mut ArenaHashMap,
column_name: &str,
updater: TMutator,
) where
V: Copy + 'static,
TMutator: FnMut(Option<V>) -> V,
{
assert!(
!column_name.as_bytes().contains(&0u8),
"key may not contain the 0 byte"
);
arena_hash_map.mutate_or_create(column_name.as_bytes(), updater);
}
impl ColumnarWriter {
pub fn mem_usage(&self) -> usize {
// TODO add dictionary builders.
self.arena.mem_usage()
+ self.numerical_field_hash_map.mem_usage()
+ self.bool_field_hash_map.mem_usage()
+ self.bytes_field_hash_map.mem_usage()
+ self.str_field_hash_map.mem_usage()
+ self.ip_addr_field_hash_map.mem_usage()
+ self.datetime_field_hash_map.mem_usage()
}
/// Returns the list of doc ids from 0..num_docs sorted by the `sort_field`
/// column.
///
/// If the column is multivalued, use the first value for scoring.
/// If no value is associated to a specific row, the document is assigned
/// the lowest possible score.
///
/// The sort applied is stable.
pub fn sort_order(&self, sort_field: &str, num_docs: RowId, reversed: bool) -> Vec<u32> {
let Some(numerical_col_writer) =
self.numerical_field_hash_map.get::<NumericalColumnWriter>(sort_field.as_bytes()) else {
return Vec::new();
};
let mut symbols_buffer = Vec::new();
let mut values = Vec::new();
let mut last_doc_opt: Option<RowId> = None;
for op in numerical_col_writer.operation_iterator(&self.arena, None, &mut symbols_buffer) {
match op {
ColumnOperation::NewDoc(doc) => {
last_doc_opt = Some(doc);
}
ColumnOperation::Value(numerical_value) => {
if let Some(last_doc) = last_doc_opt {
let score: f32 = f64::coerce(numerical_value) as f32;
values.push((score, last_doc));
}
}
}
}
for doc in values.len() as u32..num_docs {
values.push((0.0f32, doc));
}
values.sort_by(|(left_score, _), (right_score, _)| {
if reversed {
right_score.partial_cmp(left_score).unwrap()
} else {
left_score.partial_cmp(right_score).unwrap()
}
});
values.into_iter().map(|(_score, doc)| doc).collect()
}
/// Records a column type. This is useful to bypass the coercion process,
/// makes sure the empty is present in the resulting columnar, or set
/// the `sort_values_within_row`.
///
/// `sort_values_within_row` is only allowed for `Bytes` or `Str` columns.
pub fn record_column_type(
&mut self,
column_name: &str,
column_type: ColumnType,
sort_values_within_row: bool,
) {
if sort_values_within_row {
assert!(
column_type == ColumnType::Bytes || column_type == ColumnType::Str,
"sort_values_within_row is only allowed for Bytes and Str columns",
);
}
match column_type {
ColumnType::Str | ColumnType::Bytes => {
let (hash_map, dictionaries) = (
if column_type == ColumnType::Str {
&mut self.str_field_hash_map
} else {
&mut self.bytes_field_hash_map
},
&mut self.dictionaries,
);
mutate_or_create_column(
hash_map,
column_name,
|column_opt: Option<StrOrBytesColumnWriter>| {
let mut column_writer = if let Some(column_writer) = column_opt {
column_writer
} else {
let dictionary_id = dictionaries.len() as u32;
dictionaries.push(DictionaryBuilder::default());
StrOrBytesColumnWriter::with_dictionary_id(dictionary_id)
};
column_writer.sort_values_within_row = sort_values_within_row;
column_writer
},
);
}
ColumnType::Bool => {
mutate_or_create_column(
&mut self.bool_field_hash_map,
column_name,
|column_opt: Option<ColumnWriter>| column_opt.unwrap_or_default(),
);
}
ColumnType::DateTime => {
mutate_or_create_column(
&mut self.datetime_field_hash_map,
column_name,
|column_opt: Option<ColumnWriter>| column_opt.unwrap_or_default(),
);
}
ColumnType::I64 | ColumnType::F64 | ColumnType::U64 => {
let numerical_type = column_type.numerical_type().unwrap();
mutate_or_create_column(
&mut self.numerical_field_hash_map,
column_name,
|column_opt: Option<NumericalColumnWriter>| {
let mut column: NumericalColumnWriter = column_opt.unwrap_or_default();
column.force_numerical_type(numerical_type);
column
},
);
}
ColumnType::IpAddr => mutate_or_create_column(
&mut self.ip_addr_field_hash_map,
column_name,
|column_opt: Option<ColumnWriter>| column_opt.unwrap_or_default(),
),
}
}
pub fn record_numerical<T: Into<NumericalValue> + Copy>(
&mut self,
doc: RowId,
column_name: &str,
numerical_value: T,
) {
let (hash_map, arena) = (&mut self.numerical_field_hash_map, &mut self.arena);
mutate_or_create_column(
hash_map,
column_name,
|column_opt: Option<NumericalColumnWriter>| {
let mut column: NumericalColumnWriter = column_opt.unwrap_or_default();
column.record_numerical_value(doc, numerical_value.into(), arena);
column
},
);
}
pub fn record_ip_addr(&mut self, doc: RowId, column_name: &str, ip_addr: Ipv6Addr) {
assert!(
!column_name.as_bytes().contains(&0u8),
"key may not contain the 0 byte"
);
let (hash_map, arena) = (&mut self.ip_addr_field_hash_map, &mut self.arena);
hash_map.mutate_or_create(
column_name.as_bytes(),
|column_opt: Option<ColumnWriter>| {
let mut column: ColumnWriter = column_opt.unwrap_or_default();
column.record(doc, ip_addr, arena);
column
},
);
}
pub fn record_bool(&mut self, doc: RowId, column_name: &str, val: bool) {
let (hash_map, arena) = (&mut self.bool_field_hash_map, &mut self.arena);
mutate_or_create_column(hash_map, column_name, |column_opt: Option<ColumnWriter>| {
let mut column: ColumnWriter = column_opt.unwrap_or_default();
column.record(doc, val, arena);
column
});
}
pub fn record_datetime(&mut self, doc: RowId, column_name: &str, datetime: common::DateTime) {
let (hash_map, arena) = (&mut self.datetime_field_hash_map, &mut self.arena);
mutate_or_create_column(hash_map, column_name, |column_opt: Option<ColumnWriter>| {
let mut column: ColumnWriter = column_opt.unwrap_or_default();
column.record(
doc,
NumericalValue::I64(datetime.into_timestamp_micros()),
arena,
);
column
});
}
pub fn record_str(&mut self, doc: RowId, column_name: &str, value: &str) {
let (hash_map, arena, dictionaries) = (
&mut self.str_field_hash_map,
&mut self.arena,
&mut self.dictionaries,
);
hash_map.mutate_or_create(
column_name.as_bytes(),
|column_opt: Option<StrOrBytesColumnWriter>| {
let mut column: StrOrBytesColumnWriter = column_opt.unwrap_or_else(|| {
// Each column has its own dictionary
let dictionary_id = dictionaries.len() as u32;
dictionaries.push(DictionaryBuilder::default());
StrOrBytesColumnWriter::with_dictionary_id(dictionary_id)
});
column.record_bytes(doc, value.as_bytes(), dictionaries, arena);
column
},
);
}
pub fn record_bytes(&mut self, doc: RowId, column_name: &str, value: &[u8]) {
assert!(
!column_name.as_bytes().contains(&0u8),
"key may not contain the 0 byte"
);
let (hash_map, arena, dictionaries) = (
&mut self.bytes_field_hash_map,
&mut self.arena,
&mut self.dictionaries,
);
hash_map.mutate_or_create(
column_name.as_bytes(),
|column_opt: Option<StrOrBytesColumnWriter>| {
let mut column: StrOrBytesColumnWriter = column_opt.unwrap_or_else(|| {
// Each column has its own dictionary
let dictionary_id = dictionaries.len() as u32;
dictionaries.push(DictionaryBuilder::default());
StrOrBytesColumnWriter::with_dictionary_id(dictionary_id)
});
column.record_bytes(doc, value, dictionaries, arena);
column
},
);
}
pub fn serialize(
&mut self,
num_docs: RowId,
old_to_new_row_ids: Option<&[RowId]>,
wrt: &mut dyn io::Write,
) -> io::Result<()> {
let mut serializer = ColumnarSerializer::new(wrt);
let mut columns: Vec<(&[u8], ColumnType, Addr)> = self
.numerical_field_hash_map
.iter()
.map(|(column_name, addr, _)| {
let numerical_column_writer: NumericalColumnWriter =
self.numerical_field_hash_map.read(addr);
let column_type = numerical_column_writer.numerical_type().into();
(column_name, column_type, addr)
})
.collect();
columns.extend(
self.bytes_field_hash_map
.iter()
.map(|(term, addr, _)| (term, ColumnType::Bytes, addr)),
);
columns.extend(
self.str_field_hash_map
.iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::Str, addr)),
);
columns.extend(
self.bool_field_hash_map
.iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::Bool, addr)),
);
columns.extend(
self.ip_addr_field_hash_map
.iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::IpAddr, addr)),
);
columns.extend(
self.datetime_field_hash_map
.iter()
.map(|(column_name, addr, _)| (column_name, ColumnType::DateTime, addr)),
);
columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type));
let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries);
let mut symbol_byte_buffer: Vec<u8> = Vec::new();
for (column_name, column_type, addr) in columns {
match column_type {
ColumnType::Bool => {
let column_writer: ColumnWriter = self.bool_field_hash_map.read(addr);
let cardinality = column_writer.get_cardinality(num_docs);
let mut column_serializer =
serializer.serialize_column(column_name, column_type);
serialize_bool_column(
cardinality,
num_docs,
column_writer.operation_iterator(
arena,
old_to_new_row_ids,
&mut symbol_byte_buffer,
),
buffers,
&mut column_serializer,
)?;
}
ColumnType::IpAddr => {
let column_writer: ColumnWriter = self.ip_addr_field_hash_map.read(addr);
let cardinality = column_writer.get_cardinality(num_docs);
let mut column_serializer =
serializer.serialize_column(column_name, ColumnType::IpAddr);
serialize_ip_addr_column(
cardinality,
num_docs,
column_writer.operation_iterator(
arena,
old_to_new_row_ids,
&mut symbol_byte_buffer,
),
buffers,
&mut column_serializer,
)?;
}
ColumnType::Bytes | ColumnType::Str => {
let str_or_bytes_column_writer: StrOrBytesColumnWriter =
if column_type == ColumnType::Bytes {
self.bytes_field_hash_map.read(addr)
} else {
self.str_field_hash_map.read(addr)
};
let dictionary_builder =
&dictionaries[str_or_bytes_column_writer.dictionary_id as usize];
let cardinality = str_or_bytes_column_writer
.column_writer
.get_cardinality(num_docs);
let mut column_serializer =
serializer.serialize_column(column_name, column_type);
serialize_bytes_or_str_column(
cardinality,
num_docs,
str_or_bytes_column_writer.sort_values_within_row,
dictionary_builder,
str_or_bytes_column_writer.operation_iterator(
arena,
old_to_new_row_ids,
&mut symbol_byte_buffer,
),
buffers,
&mut column_serializer,
)?;
}
ColumnType::F64 | ColumnType::I64 | ColumnType::U64 => {
let numerical_column_writer: NumericalColumnWriter =
self.numerical_field_hash_map.read(addr);
let cardinality = numerical_column_writer.cardinality(num_docs);
let mut column_serializer =
serializer.serialize_column(column_name, column_type);
let numerical_type = column_type.numerical_type().unwrap();
serialize_numerical_column(
cardinality,
num_docs,
numerical_type,
numerical_column_writer.operation_iterator(
arena,
old_to_new_row_ids,
&mut symbol_byte_buffer,
),
buffers,
&mut column_serializer,
)?;
}
ColumnType::DateTime => {
let column_writer: ColumnWriter = self.datetime_field_hash_map.read(addr);
let cardinality = column_writer.get_cardinality(num_docs);
let mut column_serializer =
serializer.serialize_column(column_name, ColumnType::DateTime);
serialize_numerical_column(
cardinality,
num_docs,
NumericalType::I64,
column_writer.operation_iterator(
arena,
old_to_new_row_ids,
&mut symbol_byte_buffer,
),
buffers,
&mut column_serializer,
)?;
}
};
}
serializer.finalize(num_docs)?;
Ok(())
}
}
// Serialize [Dictionary, Column, dictionary num bytes U32::LE]
// Column: [Column Index, Column Values, column index num bytes U32::LE]
fn serialize_bytes_or_str_column(
cardinality: Cardinality,
num_docs: RowId,
sort_values_within_row: bool,
dictionary_builder: &DictionaryBuilder,
operation_it: impl Iterator<Item = ColumnOperation<UnorderedId>>,
buffers: &mut SpareBuffers,
wrt: impl io::Write,
) -> io::Result<()> {
let SpareBuffers {
value_index_builders,
u64_values,
..
} = buffers;
let mut counting_writer = CountingWriter::wrap(wrt);
let term_id_mapping: TermIdMapping = dictionary_builder.serialize(&mut counting_writer)?;
let dictionary_num_bytes: u32 = counting_writer.written_bytes() as u32;
let mut wrt = counting_writer.finish();
let operation_iterator = operation_it.map(|symbol: ColumnOperation<UnorderedId>| {
// We map unordered ids to ordered ids.
match symbol {
ColumnOperation::Value(unordered_id) => {
let ordered_id = term_id_mapping.to_ord(unordered_id);
ColumnOperation::Value(ordered_id.0 as u64)
}
ColumnOperation::NewDoc(doc) => ColumnOperation::NewDoc(doc),
}
});
send_to_serialize_column_mappable_to_u64(
operation_iterator,
cardinality,
num_docs,
sort_values_within_row,
value_index_builders,
u64_values,
&mut wrt,
)?;
wrt.write_all(&dictionary_num_bytes.to_le_bytes()[..])?;
Ok(())
}
fn serialize_numerical_column(
cardinality: Cardinality,
num_docs: RowId,
numerical_type: NumericalType,
op_iterator: impl Iterator<Item = ColumnOperation<NumericalValue>>,
buffers: &mut SpareBuffers,
wrt: &mut impl io::Write,
) -> io::Result<()> {
let SpareBuffers {
value_index_builders,
u64_values,
..
} = buffers;
match numerical_type {
NumericalType::I64 => {
send_to_serialize_column_mappable_to_u64(
coerce_numerical_symbol::<i64>(op_iterator),
cardinality,
num_docs,
false,
value_index_builders,
u64_values,
wrt,
)?;
}
NumericalType::U64 => {
send_to_serialize_column_mappable_to_u64(
coerce_numerical_symbol::<u64>(op_iterator),
cardinality,
num_docs,
false,
value_index_builders,
u64_values,
wrt,
)?;
}
NumericalType::F64 => {
send_to_serialize_column_mappable_to_u64(
coerce_numerical_symbol::<f64>(op_iterator),
cardinality,
num_docs,
false,
value_index_builders,
u64_values,
wrt,
)?;
}
};
Ok(())
}
fn serialize_bool_column(
cardinality: Cardinality,
num_docs: RowId,
column_operations_it: impl Iterator<Item = ColumnOperation<bool>>,
buffers: &mut SpareBuffers,
wrt: &mut impl io::Write,
) -> io::Result<()> {
let SpareBuffers {
value_index_builders,
u64_values,
..
} = buffers;
send_to_serialize_column_mappable_to_u64(
column_operations_it.map(|bool_column_operation| match bool_column_operation {
ColumnOperation::NewDoc(doc) => ColumnOperation::NewDoc(doc),
ColumnOperation::Value(bool_val) => ColumnOperation::Value(bool_val.to_u64()),
}),
cardinality,
num_docs,
false,
value_index_builders,
u64_values,
wrt,
)?;
Ok(())
}
fn serialize_ip_addr_column(
cardinality: Cardinality,
num_docs: RowId,
column_operations_it: impl Iterator<Item = ColumnOperation<Ipv6Addr>>,
buffers: &mut SpareBuffers,
wrt: &mut impl io::Write,
) -> io::Result<()> {
let SpareBuffers {
value_index_builders,
ip_addr_values,
..
} = buffers;
send_to_serialize_column_mappable_to_u128(
column_operations_it,
cardinality,
num_docs,
value_index_builders,
ip_addr_values,
wrt,
)?;
Ok(())
}
fn send_to_serialize_column_mappable_to_u128<
T: Copy + Ord + std::fmt::Debug + Send + Sync + MonotonicallyMappableToU128 + PartialOrd,
>(
op_iterator: impl Iterator<Item = ColumnOperation<T>>,
cardinality: Cardinality,
num_rows: RowId,
value_index_builders: &mut PreallocatedIndexBuilders,
values: &mut Vec<T>,
mut wrt: impl io::Write,
) -> io::Result<()>
where
for<'a> VecColumn<'a, T>: ColumnValues<T>,
{
values.clear();
// TODO: split index and values
let serializable_column_index = match cardinality {
Cardinality::Full => {
consume_operation_iterator(
op_iterator,
value_index_builders.borrow_required_index_builder(),
values,
);
SerializableColumnIndex::Full
}
Cardinality::Optional => {
let optional_index_builder = value_index_builders.borrow_optional_index_builder();
consume_operation_iterator(op_iterator, optional_index_builder, values);
let optional_index = optional_index_builder.finish(num_rows);
SerializableColumnIndex::Optional {
num_rows,
non_null_row_ids: Box::new(optional_index),
}
}
Cardinality::Multivalued => {
let multivalued_index_builder = value_index_builders.borrow_multivalued_index_builder();
consume_operation_iterator(op_iterator, multivalued_index_builder, values);
let multivalued_index = multivalued_index_builder.finish(num_rows);
SerializableColumnIndex::Multivalued(Box::new(multivalued_index))
}
};
crate::column::serialize_column_mappable_to_u128(
serializable_column_index,
&&values[..],
&mut wrt,
)?;
Ok(())
}
fn sort_values_within_row_in_place(multivalued_index: &[RowId], values: &mut [u64]) {
let mut start_index: usize = 0;
for end_index in multivalued_index.iter().copied() {
let end_index = end_index as usize;
values[start_index..end_index].sort_unstable();
start_index = end_index;
}
}
fn send_to_serialize_column_mappable_to_u64(
op_iterator: impl Iterator<Item = ColumnOperation<u64>>,
cardinality: Cardinality,
num_rows: RowId,
sort_values_within_row: bool,
value_index_builders: &mut PreallocatedIndexBuilders,
values: &mut Vec<u64>,
mut wrt: impl io::Write,
) -> io::Result<()>
where
for<'a> VecColumn<'a, u64>: ColumnValues<u64>,
{
values.clear();
let serializable_column_index = match cardinality {
Cardinality::Full => {
consume_operation_iterator(
op_iterator,
value_index_builders.borrow_required_index_builder(),
values,
);
SerializableColumnIndex::Full
}
Cardinality::Optional => {
let optional_index_builder = value_index_builders.borrow_optional_index_builder();
consume_operation_iterator(op_iterator, optional_index_builder, values);
let optional_index = optional_index_builder.finish(num_rows);
SerializableColumnIndex::Optional {
non_null_row_ids: Box::new(optional_index),
num_rows,
}
}
Cardinality::Multivalued => {
let multivalued_index_builder = value_index_builders.borrow_multivalued_index_builder();
consume_operation_iterator(op_iterator, multivalued_index_builder, values);
let multivalued_index = multivalued_index_builder.finish(num_rows);
if sort_values_within_row {
sort_values_within_row_in_place(multivalued_index, values);
}
SerializableColumnIndex::Multivalued(Box::new(multivalued_index))
}
};
crate::column::serialize_column_mappable_to_u64(
serializable_column_index,
&&values[..],
&mut wrt,
)?;
Ok(())
}
fn coerce_numerical_symbol<T>(
operation_iterator: impl Iterator<Item = ColumnOperation<NumericalValue>>,
) -> impl Iterator<Item = ColumnOperation<u64>>
where T: Coerce + MonotonicallyMappableToU64 {
operation_iterator.map(|symbol| match symbol {
ColumnOperation::NewDoc(doc) => ColumnOperation::NewDoc(doc),
ColumnOperation::Value(numerical_value) => {
ColumnOperation::Value(T::coerce(numerical_value).to_u64())
}
})
}
fn consume_operation_iterator<T: Ord, TIndexBuilder: IndexBuilder>(
operation_iterator: impl Iterator<Item = ColumnOperation<T>>,
index_builder: &mut TIndexBuilder,
values: &mut Vec<T>,
) {
for symbol in operation_iterator {
match symbol {
ColumnOperation::NewDoc(doc) => {
index_builder.record_row(doc);
}
ColumnOperation::Value(value) => {
index_builder.record_value();
values.push(value);
}
}
}
}
#[cfg(test)]
mod tests {
use stacker::MemoryArena;
use crate::columnar::writer::column_operation::ColumnOperation;
use crate::{Cardinality, NumericalValue};
#[test]
fn test_column_writer_required_simple() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(0u32, NumericalValue::from(14i64), &mut arena);
column_writer.record(1u32, NumericalValue::from(15i64), &mut arena);
column_writer.record(2u32, NumericalValue::from(-16i64), &mut arena);
assert_eq!(column_writer.get_cardinality(3), Cardinality::Full);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.operation_iterator(&arena, None, &mut buffer)
.collect();
assert_eq!(symbols.len(), 6);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(14i64))
));
assert!(matches!(symbols[2], ColumnOperation::NewDoc(1u32)));
assert!(matches!(
symbols[3],
ColumnOperation::Value(NumericalValue::I64(15i64))
));
assert!(matches!(symbols[4], ColumnOperation::NewDoc(2u32)));
assert!(matches!(
symbols[5],
ColumnOperation::Value(NumericalValue::I64(-16i64))
));
}
#[test]
fn test_column_writer_optional_cardinality_missing_first() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(1u32, NumericalValue::from(15i64), &mut arena);
column_writer.record(2u32, NumericalValue::from(-16i64), &mut arena);
assert_eq!(column_writer.get_cardinality(3), Cardinality::Optional);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.operation_iterator(&arena, None, &mut buffer)
.collect();
assert_eq!(symbols.len(), 4);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(1u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(15i64))
));
assert!(matches!(symbols[2], ColumnOperation::NewDoc(2u32)));
assert!(matches!(
symbols[3],
ColumnOperation::Value(NumericalValue::I64(-16i64))
));
}
#[test]
fn test_column_writer_optional_cardinality_missing_last() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(0u32, NumericalValue::from(15i64), &mut arena);
assert_eq!(column_writer.get_cardinality(2), Cardinality::Optional);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.operation_iterator(&arena, None, &mut buffer)
.collect();
assert_eq!(symbols.len(), 2);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(15i64))
));
}
#[test]
fn test_column_writer_multivalued() {
let mut arena = MemoryArena::default();
let mut column_writer = super::ColumnWriter::default();
column_writer.record(0u32, NumericalValue::from(16i64), &mut arena);
column_writer.record(0u32, NumericalValue::from(17i64), &mut arena);
assert_eq!(column_writer.get_cardinality(1), Cardinality::Multivalued);
let mut buffer = Vec::new();
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
.operation_iterator(&arena, None, &mut buffer)
.collect();
assert_eq!(symbols.len(), 3);
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
assert!(matches!(
symbols[1],
ColumnOperation::Value(NumericalValue::I64(16i64))
));
assert!(matches!(
symbols[2],
ColumnOperation::Value(NumericalValue::I64(17i64))
));
}
}

View File

@@ -0,0 +1,108 @@
use std::io;
use std::io::Write;
use common::{BinarySerializable, CountingWriter};
use sstable::value::RangeValueWriter;
use sstable::RangeSSTable;
use crate::columnar::ColumnType;
use crate::RowId;
pub struct ColumnarSerializer<W: io::Write> {
wrt: CountingWriter<W>,
sstable_range: sstable::Writer<Vec<u8>, RangeValueWriter>,
prepare_key_buffer: Vec<u8>,
}
/// Returns a key consisting of the concatenation of the key and the column_type_and_cardinality
/// code.
fn prepare_key(key: &[u8], column_type: ColumnType, buffer: &mut Vec<u8>) {
buffer.clear();
buffer.extend_from_slice(key);
buffer.push(0u8);
buffer.push(column_type.to_code());
}
impl<W: io::Write> ColumnarSerializer<W> {
pub(crate) fn new(wrt: W) -> ColumnarSerializer<W> {
let sstable_range: sstable::Writer<Vec<u8>, RangeValueWriter> =
sstable::Dictionary::<RangeSSTable>::builder(Vec::with_capacity(100_000)).unwrap();
ColumnarSerializer {
wrt: CountingWriter::wrap(wrt),
sstable_range,
prepare_key_buffer: Vec::new(),
}
}
pub fn serialize_column<'a>(
&'a mut self,
column_name: &[u8],
column_type: ColumnType,
) -> impl io::Write + 'a {
let start_offset = self.wrt.written_bytes();
prepare_key(column_name, column_type, &mut self.prepare_key_buffer);
ColumnSerializer {
columnar_serializer: self,
start_offset,
}
}
pub(crate) fn finalize(mut self, num_rows: RowId) -> io::Result<()> {
let sstable_bytes: Vec<u8> = self.sstable_range.finish()?;
let sstable_num_bytes: u64 = sstable_bytes.len() as u64;
self.wrt.write_all(&sstable_bytes)?;
self.wrt.write_all(&sstable_num_bytes.to_le_bytes()[..])?;
num_rows.serialize(&mut self.wrt)?;
self.wrt
.write_all(&super::super::format_version::footer())?;
self.wrt.flush()?;
Ok(())
}
}
struct ColumnSerializer<'a, W: io::Write> {
columnar_serializer: &'a mut ColumnarSerializer<W>,
start_offset: u64,
}
impl<'a, W: io::Write> Drop for ColumnSerializer<'a, W> {
fn drop(&mut self) {
let end_offset: u64 = self.columnar_serializer.wrt.written_bytes();
let byte_range = self.start_offset..end_offset;
self.columnar_serializer.sstable_range.insert_cannot_fail(
&self.columnar_serializer.prepare_key_buffer[..],
&byte_range,
);
self.columnar_serializer.prepare_key_buffer.clear();
}
}
impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
self.columnar_serializer.wrt.write(buf)
}
fn flush(&mut self) -> io::Result<()> {
self.columnar_serializer.wrt.flush()
}
fn write_all(&mut self, buf: &[u8]) -> io::Result<()> {
self.columnar_serializer.wrt.write_all(buf)
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::columnar::column_type::ColumnType;
#[test]
fn test_prepare_key_bytes() {
let mut buffer: Vec<u8> = b"somegarbage".to_vec();
prepare_key(b"root\0child", ColumnType::Str, &mut buffer);
assert_eq!(buffer.len(), 12);
assert_eq!(&buffer[..10], b"root\0child");
assert_eq!(buffer[10], 0u8);
assert_eq!(buffer[11], ColumnType::Str.to_code());
}
}

View File

@@ -0,0 +1,165 @@
use crate::iterable::Iterable;
use crate::RowId;
/// The `IndexBuilder` interprets a sequence of
/// calls of the form:
/// (record_doc,record_value+)*
/// and can then serialize the results into an index to associate docids with their value[s].
///
/// It has different implementation depending on whether the
/// cardinality is required, optional, or multivalued.
pub(crate) trait IndexBuilder {
fn record_row(&mut self, doc: RowId);
#[inline]
fn record_value(&mut self) {}
}
/// The FullIndexBuilder does nothing.
#[derive(Default)]
pub struct FullIndexBuilder;
impl IndexBuilder for FullIndexBuilder {
#[inline(always)]
fn record_row(&mut self, _doc: RowId) {}
}
#[derive(Default)]
pub struct OptionalIndexBuilder {
docs: Vec<RowId>,
}
impl OptionalIndexBuilder {
pub fn finish(&mut self, num_rows: RowId) -> impl Iterable<RowId> + '_ {
debug_assert!(self
.docs
.last()
.copied()
.map(|last_doc| last_doc < num_rows)
.unwrap_or(true));
&self.docs[..]
}
fn reset(&mut self) {
self.docs.clear();
}
}
impl IndexBuilder for OptionalIndexBuilder {
#[inline(always)]
fn record_row(&mut self, doc: RowId) {
debug_assert!(self
.docs
.last()
.copied()
.map(|prev_doc| doc > prev_doc)
.unwrap_or(true));
self.docs.push(doc);
}
}
#[derive(Default)]
pub struct MultivaluedIndexBuilder {
start_offsets: Vec<RowId>,
total_num_vals_seen: u32,
}
impl MultivaluedIndexBuilder {
pub fn finish(&mut self, num_docs: RowId) -> &[u32] {
self.start_offsets
.resize(num_docs as usize + 1, self.total_num_vals_seen);
&self.start_offsets[..]
}
fn reset(&mut self) {
self.start_offsets.clear();
self.start_offsets.push(0u32);
self.total_num_vals_seen = 0;
}
}
impl IndexBuilder for MultivaluedIndexBuilder {
fn record_row(&mut self, row_id: RowId) {
self.start_offsets
.resize(row_id as usize + 1, self.total_num_vals_seen);
}
fn record_value(&mut self) {
self.total_num_vals_seen += 1;
}
}
/// The `SpareIndexBuilders` is there to avoid allocating a
/// new index builder for every single column.
#[derive(Default)]
pub struct PreallocatedIndexBuilders {
required_index_builder: FullIndexBuilder,
optional_index_builder: OptionalIndexBuilder,
multivalued_index_builder: MultivaluedIndexBuilder,
}
impl PreallocatedIndexBuilders {
pub fn borrow_required_index_builder(&mut self) -> &mut FullIndexBuilder {
&mut self.required_index_builder
}
pub fn borrow_optional_index_builder(&mut self) -> &mut OptionalIndexBuilder {
self.optional_index_builder.reset();
&mut self.optional_index_builder
}
pub fn borrow_multivalued_index_builder(&mut self) -> &mut MultivaluedIndexBuilder {
self.multivalued_index_builder.reset();
&mut self.multivalued_index_builder
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_optional_value_index_builder() {
let mut opt_value_index_builder = OptionalIndexBuilder::default();
opt_value_index_builder.record_row(0u32);
opt_value_index_builder.record_value();
assert_eq!(
&opt_value_index_builder
.finish(1u32)
.boxed_iter()
.collect::<Vec<u32>>(),
&[0]
);
opt_value_index_builder.reset();
opt_value_index_builder.record_row(1u32);
opt_value_index_builder.record_value();
assert_eq!(
&opt_value_index_builder
.finish(2u32)
.boxed_iter()
.collect::<Vec<u32>>(),
&[1]
);
}
#[test]
fn test_multivalued_value_index_builder() {
let mut multivalued_value_index_builder = MultivaluedIndexBuilder::default();
multivalued_value_index_builder.record_row(1u32);
multivalued_value_index_builder.record_value();
multivalued_value_index_builder.record_value();
multivalued_value_index_builder.record_row(2u32);
multivalued_value_index_builder.record_value();
assert_eq!(
multivalued_value_index_builder.finish(4u32).to_vec(),
vec![0, 0, 2, 3, 3]
);
multivalued_value_index_builder.reset();
multivalued_value_index_builder.record_row(2u32);
multivalued_value_index_builder.record_value();
multivalued_value_index_builder.record_value();
assert_eq!(
multivalued_value_index_builder.finish(4u32).to_vec(),
vec![0, 0, 0, 2, 2]
);
}
}

View File

@@ -0,0 +1,84 @@
use std::io;
use fnv::FnvHashMap;
use sstable::SSTable;
pub(crate) struct TermIdMapping {
unordered_to_ord: Vec<OrderedId>,
}
impl TermIdMapping {
pub fn to_ord(&self, unordered: UnorderedId) -> OrderedId {
self.unordered_to_ord[unordered.0 as usize]
}
}
/// When we add values, we cannot know their ordered id yet.
/// For this reason, we temporarily assign them a `UnorderedId`
/// that will be mapped to an `OrderedId` upon serialization.
#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq)]
pub struct UnorderedId(pub u32);
#[derive(Clone, Copy, Hash, PartialEq, Eq, Debug)]
pub struct OrderedId(pub u32);
/// `DictionaryBuilder` for dictionary encoding.
///
/// It stores the different terms encounterred and assigns them a temporary value
/// we call unordered id.
///
/// Upon serialization, we will sort the ids and hence build a `UnorderedId -> Term ordinal`
/// mapping.
#[derive(Default)]
pub(crate) struct DictionaryBuilder {
dict: FnvHashMap<Vec<u8>, UnorderedId>,
}
impl DictionaryBuilder {
/// Get or allocate an unordered id.
/// (This ID is simply an auto-incremented id.)
pub fn get_or_allocate_id(&mut self, term: &[u8]) -> UnorderedId {
if let Some(term_id) = self.dict.get(term) {
return *term_id;
}
let new_id = UnorderedId(self.dict.len() as u32);
self.dict.insert(term.to_vec(), new_id);
new_id
}
/// Serialize the dictionary into an fst, and returns the
/// `UnorderedId -> TermOrdinal` map.
pub fn serialize<'a, W: io::Write + 'a>(&self, wrt: &mut W) -> io::Result<TermIdMapping> {
let mut terms: Vec<(&[u8], UnorderedId)> =
self.dict.iter().map(|(k, v)| (k.as_slice(), *v)).collect();
terms.sort_unstable_by_key(|(key, _)| *key);
// TODO Remove the allocation.
let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()];
let mut sstable_builder = sstable::VoidSSTable::writer(wrt);
for (ord, (key, unordered_id)) in terms.into_iter().enumerate() {
let ordered_id = OrderedId(ord as u32);
sstable_builder.insert(key, &())?;
unordered_to_ord[unordered_id.0 as usize] = ordered_id;
}
sstable_builder.finish()?;
Ok(TermIdMapping { unordered_to_ord })
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_dictionary_builder() {
let mut dictionary_builder = DictionaryBuilder::default();
let hello_uid = dictionary_builder.get_or_allocate_id(b"hello");
let happy_uid = dictionary_builder.get_or_allocate_id(b"happy");
let tax_uid = dictionary_builder.get_or_allocate_id(b"tax");
let mut buffer = Vec::new();
let id_mapping = dictionary_builder.serialize(&mut buffer).unwrap();
assert_eq!(id_mapping.to_ord(hello_uid), OrderedId(1));
assert_eq!(id_mapping.to_ord(happy_uid), OrderedId(0));
assert_eq!(id_mapping.to_ord(tax_uid), OrderedId(2));
}
}

View File

@@ -0,0 +1,258 @@
use std::io;
use std::net::Ipv6Addr;
use std::sync::Arc;
use common::file_slice::FileSlice;
use common::{DateTime, HasLen, OwnedBytes};
use crate::column::{BytesColumn, Column, StrColumn};
use crate::column_values::{monotonic_map_column, StrictlyMonotonicFn};
use crate::columnar::ColumnType;
use crate::{Cardinality, NumericalType};
#[derive(Clone)]
pub enum DynamicColumn {
Bool(Column<bool>),
I64(Column<i64>),
U64(Column<u64>),
F64(Column<f64>),
IpAddr(Column<Ipv6Addr>),
DateTime(Column<DateTime>),
Bytes(BytesColumn),
Str(StrColumn),
}
impl DynamicColumn {
pub fn get_cardinality(&self) -> Cardinality {
match self {
DynamicColumn::Bool(c) => c.get_cardinality(),
DynamicColumn::I64(c) => c.get_cardinality(),
DynamicColumn::U64(c) => c.get_cardinality(),
DynamicColumn::F64(c) => c.get_cardinality(),
DynamicColumn::IpAddr(c) => c.get_cardinality(),
DynamicColumn::DateTime(c) => c.get_cardinality(),
DynamicColumn::Bytes(c) => c.ords().get_cardinality(),
DynamicColumn::Str(c) => c.ords().get_cardinality(),
}
}
pub fn column_type(&self) -> ColumnType {
match self {
DynamicColumn::Bool(_) => ColumnType::Bool,
DynamicColumn::I64(_) => ColumnType::I64,
DynamicColumn::U64(_) => ColumnType::U64,
DynamicColumn::F64(_) => ColumnType::F64,
DynamicColumn::IpAddr(_) => ColumnType::IpAddr,
DynamicColumn::DateTime(_) => ColumnType::DateTime,
DynamicColumn::Bytes(_) => ColumnType::Bytes,
DynamicColumn::Str(_) => ColumnType::Str,
}
}
pub fn coerce_numerical(self, target_numerical_type: NumericalType) -> Option<Self> {
match target_numerical_type {
NumericalType::I64 => self.coerce_to_i64(),
NumericalType::U64 => self.coerce_to_u64(),
NumericalType::F64 => self.coerce_to_f64(),
}
}
pub fn is_numerical(&self) -> bool {
self.column_type().numerical_type().is_some()
}
pub fn is_f64(&self) -> bool {
self.column_type().numerical_type() == Some(NumericalType::F64)
}
pub fn is_i64(&self) -> bool {
self.column_type().numerical_type() == Some(NumericalType::I64)
}
pub fn is_u64(&self) -> bool {
self.column_type().numerical_type() == Some(NumericalType::U64)
}
fn coerce_to_f64(self) -> Option<DynamicColumn> {
match self {
DynamicColumn::I64(column) => Some(DynamicColumn::F64(Column {
idx: column.idx,
values: Arc::new(monotonic_map_column(column.values, MapI64ToF64)),
})),
DynamicColumn::U64(column) => Some(DynamicColumn::F64(Column {
idx: column.idx,
values: Arc::new(monotonic_map_column(column.values, MapU64ToF64)),
})),
DynamicColumn::F64(_) => Some(self),
_ => None,
}
}
fn coerce_to_i64(self) -> Option<DynamicColumn> {
match self {
DynamicColumn::U64(column) => {
if column.max_value() > i64::MAX as u64 {
return None;
}
Some(DynamicColumn::I64(Column {
idx: column.idx,
values: Arc::new(monotonic_map_column(column.values, MapU64ToI64)),
}))
}
DynamicColumn::I64(_) => Some(self),
_ => None,
}
}
fn coerce_to_u64(self) -> Option<DynamicColumn> {
match self {
DynamicColumn::I64(column) => {
if column.min_value() < 0 {
return None;
}
Some(DynamicColumn::U64(Column {
idx: column.idx,
values: Arc::new(monotonic_map_column(column.values, MapI64ToU64)),
}))
}
DynamicColumn::U64(_) => Some(self),
_ => None,
}
}
}
struct MapI64ToF64;
impl StrictlyMonotonicFn<i64, f64> for MapI64ToF64 {
#[inline(always)]
fn mapping(&self, inp: i64) -> f64 {
inp as f64
}
#[inline(always)]
fn inverse(&self, out: f64) -> i64 {
out as i64
}
}
struct MapU64ToF64;
impl StrictlyMonotonicFn<u64, f64> for MapU64ToF64 {
#[inline(always)]
fn mapping(&self, inp: u64) -> f64 {
inp as f64
}
#[inline(always)]
fn inverse(&self, out: f64) -> u64 {
out as u64
}
}
struct MapU64ToI64;
impl StrictlyMonotonicFn<u64, i64> for MapU64ToI64 {
#[inline(always)]
fn mapping(&self, inp: u64) -> i64 {
inp as i64
}
#[inline(always)]
fn inverse(&self, out: i64) -> u64 {
out as u64
}
}
struct MapI64ToU64;
impl StrictlyMonotonicFn<i64, u64> for MapI64ToU64 {
#[inline(always)]
fn mapping(&self, inp: i64) -> u64 {
inp as u64
}
#[inline(always)]
fn inverse(&self, out: u64) -> i64 {
out as i64
}
}
macro_rules! static_dynamic_conversions {
($typ:ty, $enum_name:ident) => {
impl From<DynamicColumn> for Option<$typ> {
fn from(dynamic_column: DynamicColumn) -> Option<$typ> {
if let DynamicColumn::$enum_name(col) = dynamic_column {
Some(col)
} else {
None
}
}
}
impl From<$typ> for DynamicColumn {
fn from(typed_column: $typ) -> Self {
DynamicColumn::$enum_name(typed_column)
}
}
};
}
static_dynamic_conversions!(Column<bool>, Bool);
static_dynamic_conversions!(Column<u64>, U64);
static_dynamic_conversions!(Column<i64>, I64);
static_dynamic_conversions!(Column<f64>, F64);
static_dynamic_conversions!(Column<DateTime>, DateTime);
static_dynamic_conversions!(StrColumn, Str);
static_dynamic_conversions!(BytesColumn, Bytes);
static_dynamic_conversions!(Column<Ipv6Addr>, IpAddr);
#[derive(Clone)]
pub struct DynamicColumnHandle {
pub(crate) file_slice: FileSlice,
pub(crate) column_type: ColumnType,
}
impl DynamicColumnHandle {
// TODO rename load
pub fn open(&self) -> io::Result<DynamicColumn> {
let column_bytes: OwnedBytes = self.file_slice.read_bytes()?;
self.open_internal(column_bytes)
}
#[doc(hidden)]
pub fn file_slice(&self) -> &FileSlice {
&self.file_slice
}
/// Returns the `u64` fast field reader reader associated with `fields` of types
/// Str, u64, i64, f64, or datetime.
///
/// If not, the fastfield reader will returns the u64-value associated with the original
/// FastValue.
pub fn open_u64_lenient(&self) -> io::Result<Option<Column<u64>>> {
let column_bytes = self.file_slice.read_bytes()?;
match self.column_type {
ColumnType::Str | ColumnType::Bytes => {
let column: BytesColumn = crate::column::open_column_bytes(column_bytes)?;
Ok(Some(column.term_ord_column))
}
ColumnType::Bool => Ok(None),
ColumnType::IpAddr => Ok(None),
ColumnType::I64 | ColumnType::U64 | ColumnType::F64 | ColumnType::DateTime => {
let column = crate::column::open_column_u64::<u64>(column_bytes)?;
Ok(Some(column))
}
}
}
fn open_internal(&self, column_bytes: OwnedBytes) -> io::Result<DynamicColumn> {
let dynamic_column: DynamicColumn = match self.column_type {
ColumnType::Bytes => crate::column::open_column_bytes(column_bytes)?.into(),
ColumnType::Str => crate::column::open_column_str(column_bytes)?.into(),
ColumnType::I64 => crate::column::open_column_u64::<i64>(column_bytes)?.into(),
ColumnType::U64 => crate::column::open_column_u64::<u64>(column_bytes)?.into(),
ColumnType::F64 => crate::column::open_column_u64::<f64>(column_bytes)?.into(),
ColumnType::Bool => crate::column::open_column_u64::<bool>(column_bytes)?.into(),
ColumnType::IpAddr => crate::column::open_column_u128::<Ipv6Addr>(column_bytes)?.into(),
ColumnType::DateTime => {
crate::column::open_column_u64::<DateTime>(column_bytes)?.into()
}
};
Ok(dynamic_column)
}
pub fn num_bytes(&self) -> usize {
self.file_slice.len()
}
pub fn column_type(&self) -> ColumnType {
self.column_type
}
}

19
columnar/src/iterable.rs Normal file
View File

@@ -0,0 +1,19 @@
use std::ops::Range;
pub trait Iterable<T = u64> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_>;
}
impl<'a, T: Copy> Iterable<T> for &'a [T] {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
Box::new(self.iter().copied())
}
}
impl<T: Copy> Iterable<T> for Range<T>
where Range<T>: Iterator<Item = T>
{
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
Box::new(self.clone())
}
}

96
columnar/src/lib.rs Normal file
View File

@@ -0,0 +1,96 @@
#![cfg_attr(all(feature = "unstable", test), feature(test))]
#[cfg(test)]
#[macro_use]
extern crate more_asserts;
#[cfg(all(test, feature = "unstable"))]
extern crate test;
use std::io;
mod column;
mod column_index;
pub mod column_values;
mod columnar;
mod dictionary;
mod dynamic_column;
mod iterable;
pub(crate) mod utils;
mod value;
pub use column::{BytesColumn, Column, StrColumn};
pub use column_index::ColumnIndex;
pub use column_values::{ColumnValues, MonotonicallyMappableToU128, MonotonicallyMappableToU64};
pub use columnar::{
merge_columnar, ColumnType, ColumnarReader, ColumnarWriter, HasAssociatedColumnType,
MergeRowOrder, ShuffleMergeOrder, StackMergeOrder,
};
use sstable::VoidSSTable;
pub use value::{NumericalType, NumericalValue};
pub use self::dynamic_column::{DynamicColumn, DynamicColumnHandle};
pub type RowId = u32;
pub type DocId = u32;
#[derive(Clone, Copy)]
pub struct RowAddr {
pub segment_ord: u32,
pub row_id: RowId,
}
pub use sstable::Dictionary;
pub type Streamer<'a> = sstable::Streamer<'a, VoidSSTable>;
pub use common::DateTime;
#[derive(Copy, Clone, Debug)]
pub struct InvalidData;
impl From<InvalidData> for io::Error {
fn from(_: InvalidData) -> Self {
io::Error::new(io::ErrorKind::InvalidData, "Invalid data")
}
}
/// Enum describing the number of values that can exist per document
/// (or per row if you will).
///
/// The cardinality must fit on 2 bits.
#[derive(Clone, Copy, Hash, Default, Debug, PartialEq, Eq, PartialOrd, Ord)]
#[repr(u8)]
pub enum Cardinality {
/// All documents contain exactly one value.
/// `Full` is the default for auto-detecting the Cardinality, since it is the most strict.
#[default]
Full = 0,
/// All documents contain at most one value.
Optional = 1,
/// All documents may contain any number of values.
Multivalued = 2,
}
impl Cardinality {
pub fn is_optional(&self) -> bool {
matches!(self, Cardinality::Optional)
}
pub fn is_multivalue(&self) -> bool {
matches!(self, Cardinality::Multivalued)
}
pub(crate) fn to_code(self) -> u8 {
self as u8
}
pub(crate) fn try_from_code(code: u8) -> Result<Cardinality, InvalidData> {
match code {
0 => Ok(Cardinality::Full),
1 => Ok(Cardinality::Optional),
2 => Ok(Cardinality::Multivalued),
_ => Err(InvalidData),
}
}
}
#[cfg(test)]
mod tests;

212
columnar/src/tests.rs Normal file
View File

@@ -0,0 +1,212 @@
use std::net::Ipv6Addr;
use crate::column_values::MonotonicallyMappableToU128;
use crate::columnar::ColumnType;
use crate::dynamic_column::{DynamicColumn, DynamicColumnHandle};
use crate::value::NumericalValue;
use crate::{Cardinality, ColumnarReader, ColumnarWriter};
#[test]
fn test_dataframe_writer_str() {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_str(1u32, "my_string", "hello");
dataframe_writer.record_str(3u32, "my_string", "helloeee");
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 158);
}
#[test]
fn test_dataframe_writer_bytes() {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_bytes(1u32, "my_string", b"hello");
dataframe_writer.record_bytes(3u32, "my_string", b"helloeee");
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 158);
}
#[test]
fn test_dataframe_writer_bool() {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_bool(1u32, "bool.value", false);
dataframe_writer.record_bool(3u32, "bool.value", true);
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("bool.value").unwrap();
assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 22);
assert_eq!(cols[0].column_type(), ColumnType::Bool);
let dyn_bool_col = cols[0].open().unwrap();
let DynamicColumn::Bool(bool_col) = dyn_bool_col else { panic!(); };
let vals: Vec<Option<bool>> = (0..5).map(|row_id| bool_col.first(row_id)).collect();
assert_eq!(&vals, &[None, Some(false), None, Some(true), None,]);
}
#[test]
fn test_dataframe_writer_u64_multivalued() {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_numerical(2u32, "divisor", 2u64);
dataframe_writer.record_numerical(3u32, "divisor", 3u64);
dataframe_writer.record_numerical(4u32, "divisor", 2u64);
dataframe_writer.record_numerical(5u32, "divisor", 5u64);
dataframe_writer.record_numerical(6u32, "divisor", 2u64);
dataframe_writer.record_numerical(6u32, "divisor", 3u64);
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer.serialize(7, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("divisor").unwrap();
assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 29);
let dyn_i64_col = cols[0].open().unwrap();
let DynamicColumn::I64(divisor_col) = dyn_i64_col else { panic!(); };
assert_eq!(
divisor_col.get_cardinality(),
crate::Cardinality::Multivalued
);
assert_eq!(divisor_col.num_docs(), 7);
}
#[test]
fn test_dataframe_writer_ip_addr() {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_ip_addr(1, "ip_addr", Ipv6Addr::from_u128(1001));
dataframe_writer.record_ip_addr(3, "ip_addr", Ipv6Addr::from_u128(1050));
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("ip_addr").unwrap();
assert_eq!(cols.len(), 1);
assert_eq!(cols[0].num_bytes(), 42);
assert_eq!(cols[0].column_type(), ColumnType::IpAddr);
let dyn_bool_col = cols[0].open().unwrap();
let DynamicColumn::IpAddr(ip_col) = dyn_bool_col else { panic!(); };
let vals: Vec<Option<Ipv6Addr>> = (0..5).map(|row_id| ip_col.first(row_id)).collect();
assert_eq!(
&vals,
&[
None,
Some(Ipv6Addr::from_u128(1001)),
None,
Some(Ipv6Addr::from_u128(1050)),
None,
]
);
}
#[test]
fn test_dataframe_writer_numerical() {
let mut dataframe_writer = ColumnarWriter::default();
dataframe_writer.record_numerical(1u32, "srical.value", NumericalValue::U64(12u64));
dataframe_writer.record_numerical(2u32, "srical.value", NumericalValue::U64(13u64));
dataframe_writer.record_numerical(4u32, "srical.value", NumericalValue::U64(15u64));
let mut buffer: Vec<u8> = Vec::new();
dataframe_writer.serialize(6, None, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar.num_columns(), 1);
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("srical.value").unwrap();
assert_eq!(cols.len(), 1);
// Right now this 31 bytes are spent as follows
//
// - header 14 bytes
// - vals 8 //< due to padding? could have been 1byte?.
// - null footer 6 bytes
assert_eq!(cols[0].num_bytes(), 33);
let column = cols[0].open().unwrap();
let DynamicColumn::I64(column_i64) = column else { panic!(); };
assert_eq!(column_i64.idx.get_cardinality(), Cardinality::Optional);
assert_eq!(column_i64.first(0), None);
assert_eq!(column_i64.first(1), Some(12i64));
assert_eq!(column_i64.first(2), Some(13i64));
assert_eq!(column_i64.first(3), None);
assert_eq!(column_i64.first(4), Some(15i64));
assert_eq!(column_i64.first(5), None);
assert_eq!(column_i64.first(6), None); //< we can change the spec for that one.
}
#[test]
fn test_dictionary_encoded_str() {
let mut buffer = Vec::new();
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_str(1, "my.column", "a");
columnar_writer.record_str(3, "my.column", "c");
columnar_writer.record_str(3, "my.column2", "different_column!");
columnar_writer.record_str(4, "my.column", "b");
columnar_writer.serialize(5, None, &mut buffer).unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_columns(), 2);
let col_handles = columnar_reader.read_columns("my.column").unwrap();
assert_eq!(col_handles.len(), 1);
let DynamicColumn::Str(str_col) = col_handles[0].open().unwrap() else { panic!(); };
let index: Vec<Option<u64>> = (0..5).map(|row_id| str_col.ords().first(row_id)).collect();
assert_eq!(index, &[None, Some(0), None, Some(2), Some(1)]);
assert_eq!(str_col.num_rows(), 5);
let mut term_buffer = String::new();
let term_ords = str_col.ords();
assert_eq!(term_ords.first(0), None);
assert_eq!(term_ords.first(1), Some(0));
str_col.ord_to_str(0u64, &mut term_buffer).unwrap();
assert_eq!(term_buffer, "a");
assert_eq!(term_ords.first(2), None);
assert_eq!(term_ords.first(3), Some(2));
str_col.ord_to_str(2u64, &mut term_buffer).unwrap();
assert_eq!(term_buffer, "c");
assert_eq!(term_ords.first(4), Some(1));
str_col.ord_to_str(1u64, &mut term_buffer).unwrap();
assert_eq!(term_buffer, "b");
}
#[test]
fn test_dictionary_encoded_bytes() {
let mut buffer = Vec::new();
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_bytes(1, "my.column", b"a");
columnar_writer.record_bytes(3, "my.column", b"c");
columnar_writer.record_bytes(3, "my.column2", b"different_column!");
columnar_writer.record_bytes(4, "my.column", b"b");
columnar_writer.serialize(5, None, &mut buffer).unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_columns(), 2);
let col_handles = columnar_reader.read_columns("my.column").unwrap();
assert_eq!(col_handles.len(), 1);
let DynamicColumn::Bytes(bytes_col) = col_handles[0].open().unwrap() else { panic!(); };
let index: Vec<Option<u64>> = (0..5)
.map(|row_id| bytes_col.ords().first(row_id))
.collect();
assert_eq!(index, &[None, Some(0), None, Some(2), Some(1)]);
assert_eq!(bytes_col.num_rows(), 5);
let mut term_buffer = Vec::new();
let term_ords = bytes_col.ords();
assert_eq!(term_ords.first(0), None);
assert_eq!(term_ords.first(1), Some(0));
bytes_col
.dictionary
.ord_to_term(0u64, &mut term_buffer)
.unwrap();
assert_eq!(term_buffer, b"a");
assert_eq!(term_ords.first(2), None);
assert_eq!(term_ords.first(3), Some(2));
bytes_col
.dictionary
.ord_to_term(2u64, &mut term_buffer)
.unwrap();
assert_eq!(term_buffer, b"c");
assert_eq!(term_ords.first(4), Some(1));
bytes_col
.dictionary
.ord_to_term(1u64, &mut term_buffer)
.unwrap();
assert_eq!(term_buffer, b"b");
}

76
columnar/src/utils.rs Normal file
View File

@@ -0,0 +1,76 @@
const fn compute_mask(num_bits: u8) -> u8 {
if num_bits == 8 {
u8::MAX
} else {
(1u8 << num_bits) - 1
}
}
#[inline(always)]
#[must_use]
pub(crate) fn select_bits<const START: u8, const END: u8>(code: u8) -> u8 {
assert!(START <= END);
assert!(END <= 8);
let num_bits: u8 = END - START;
let mask: u8 = compute_mask(num_bits);
(code >> START) & mask
}
#[inline(always)]
#[must_use]
pub(crate) fn place_bits<const START: u8, const END: u8>(code: u8) -> u8 {
assert!(START <= END);
assert!(END <= 8);
let num_bits: u8 = END - START;
let mask: u8 = compute_mask(num_bits);
assert!(code <= mask);
code << START
}
/// Pop-front one bytes from a slice of bytes.
#[inline(always)]
pub fn pop_first_byte(bytes: &mut &[u8]) -> Option<u8> {
if bytes.is_empty() {
return None;
}
let first_byte = bytes[0];
*bytes = &bytes[1..];
Some(first_byte)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_select_bits() {
assert_eq!(255u8, select_bits::<0, 8>(255u8));
assert_eq!(0u8, select_bits::<0, 0>(255u8));
assert_eq!(8u8, select_bits::<0, 4>(8u8));
assert_eq!(4u8, select_bits::<1, 4>(8u8));
assert_eq!(0u8, select_bits::<1, 3>(8u8));
}
#[test]
fn test_place_bits() {
assert_eq!(255u8, place_bits::<0, 8>(255u8));
assert_eq!(4u8, place_bits::<2, 3>(1u8));
assert_eq!(0u8, place_bits::<2, 2>(0u8));
}
#[test]
#[should_panic]
fn test_place_bits_overflows() {
let _ = place_bits::<1, 4>(8u8);
}
#[test]
fn test_pop_first_byte() {
let mut cursor: &[u8] = &b"abcd"[..];
assert_eq!(pop_first_byte(&mut cursor), Some(b'a'));
assert_eq!(pop_first_byte(&mut cursor), Some(b'b'));
assert_eq!(pop_first_byte(&mut cursor), Some(b'c'));
assert_eq!(pop_first_byte(&mut cursor), Some(b'd'));
assert_eq!(pop_first_byte(&mut cursor), None);
}
}

131
columnar/src/value.rs Normal file
View File

@@ -0,0 +1,131 @@
use common::DateTime;
use crate::InvalidData;
#[derive(Copy, Clone, PartialEq, Debug)]
pub enum NumericalValue {
I64(i64),
U64(u64),
F64(f64),
}
impl NumericalValue {
pub fn numerical_type(&self) -> NumericalType {
match self {
NumericalValue::I64(_) => NumericalType::I64,
NumericalValue::U64(_) => NumericalType::U64,
NumericalValue::F64(_) => NumericalType::F64,
}
}
}
impl From<u64> for NumericalValue {
fn from(val: u64) -> NumericalValue {
NumericalValue::U64(val)
}
}
impl From<i64> for NumericalValue {
fn from(val: i64) -> Self {
NumericalValue::I64(val)
}
}
impl From<f64> for NumericalValue {
fn from(val: f64) -> Self {
NumericalValue::F64(val)
}
}
#[derive(Clone, Copy, Debug, Default, Hash, Eq, PartialEq)]
#[repr(u8)]
pub enum NumericalType {
#[default]
I64 = 0,
U64 = 1,
F64 = 2,
}
impl NumericalType {
pub fn to_code(self) -> u8 {
self as u8
}
pub fn try_from_code(code: u8) -> Result<NumericalType, InvalidData> {
match code {
0 => Ok(NumericalType::I64),
1 => Ok(NumericalType::U64),
2 => Ok(NumericalType::F64),
_ => Err(InvalidData),
}
}
}
/// We voluntarily avoid using `Into` here to keep this
/// implementation quirk as private as possible.
///
/// # Panics
/// This coercion trait actually panics if it is used
/// to convert a loose types to a stricter type.
///
/// The level is strictness is somewhat arbitrary.
/// - i64
/// - u64
/// - f64.
pub(crate) trait Coerce {
fn coerce(numerical_value: NumericalValue) -> Self;
}
impl Coerce for i64 {
fn coerce(value: NumericalValue) -> Self {
match value {
NumericalValue::I64(val) => val,
NumericalValue::U64(val) => val as i64,
NumericalValue::F64(_) => unreachable!(),
}
}
}
impl Coerce for u64 {
fn coerce(value: NumericalValue) -> Self {
match value {
NumericalValue::I64(val) => val as u64,
NumericalValue::U64(val) => val,
NumericalValue::F64(_) => unreachable!(),
}
}
}
impl Coerce for f64 {
fn coerce(value: NumericalValue) -> Self {
match value {
NumericalValue::I64(val) => val as f64,
NumericalValue::U64(val) => val as f64,
NumericalValue::F64(val) => val,
}
}
}
impl Coerce for DateTime {
fn coerce(value: NumericalValue) -> Self {
let timestamp_micros = i64::coerce(value);
DateTime::from_timestamp_micros(timestamp_micros)
}
}
#[cfg(test)]
mod tests {
use super::NumericalType;
#[test]
fn test_numerical_type_code() {
let mut num_numerical_type = 0;
for code in u8::MIN..=u8::MAX {
if let Ok(numerical_type) = NumericalType::try_from_code(code) {
assert_eq!(numerical_type.to_code(), code);
num_numerical_type += 1;
}
}
assert_eq!(num_numerical_type, 3);
}
}

View File

@@ -1,16 +1,23 @@
[package]
name = "tantivy-common"
version = "0.1.0"
version = "0.5.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT"
edition = "2018"
edition = "2021"
description = "common traits and utility functions used by multiple tantivy subcrates"
documentation = "https://docs.rs/tantivy_common/"
homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
byteorder = "1.4.3"
ownedbytes = { version="0.2", path="../ownedbytes" }
ownedbytes = { version= "0.5", path="../ownedbytes" }
async-trait = "0.1"
time = { version = "0.3.10", features = ["serde-well-known"] }
serde = { version = "1.0.136", features = ["derive"] }
[dev-dependencies]
proptest = "1.0.0"

View File

@@ -1,8 +1,8 @@
use ownedbytes::OwnedBytes;
use std::convert::TryInto;
use std::io::Write;
use std::u64;
use std::{fmt, io};
use std::{fmt, io, u64};
use ownedbytes::OwnedBytes;
#[derive(Clone, Copy, Eq, PartialEq)]
pub struct TinySet(u64);
@@ -151,7 +151,7 @@ impl TinySet {
if self.is_empty() {
None
} else {
let lowest = self.0.trailing_zeros() as u32;
let lowest = self.0.trailing_zeros();
self.0 ^= TinySet::singleton(lowest).0;
Some(lowest)
}
@@ -187,7 +187,6 @@ fn num_buckets(max_val: u32) -> u32 {
impl BitSet {
/// serialize a `BitSet`.
///
pub fn serialize<T: Write>(&self, writer: &mut T) -> io::Result<()> {
writer.write_all(self.max_value.to_le_bytes().as_ref())?;
for tinyset in self.tinysets.iter().cloned() {
@@ -260,11 +259,7 @@ impl BitSet {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len += if self.tinysets[higher as usize].insert_mut(lower) {
1
} else {
0
};
self.len += u64::from(self.tinysets[higher as usize].insert_mut(lower));
}
/// Inserts an element in the `BitSet`
@@ -273,11 +268,7 @@ impl BitSet {
// we do not check saturated els.
let higher = el / 64u32;
let lower = el % 64u32;
self.len -= if self.tinysets[higher as usize].remove_mut(lower) {
1
} else {
0
};
self.len -= u64::from(self.tinysets[higher as usize].remove_mut(lower));
}
/// Returns true iff the elements is in the `BitSet`.
@@ -286,7 +277,7 @@ impl BitSet {
self.tinyset(el / 64u32).contains(el % 64)
}
/// Returns the first non-empty `TinySet` associated to a bucket lower
/// Returns the first non-empty `TinySet` associated with a bucket lower
/// or greater than bucket.
///
/// Reminder: the tiny set with the bucket `bucket`, represents the
@@ -353,7 +344,6 @@ impl ReadOnlyBitSet {
}
/// Iterate the tinyset on the fly from serialized data.
///
#[inline]
fn iter_tinysets(&self) -> impl Iterator<Item = TinySet> + '_ {
self.data.chunks_exact(8).map(move |chunk| {
@@ -363,7 +353,6 @@ impl ReadOnlyBitSet {
}
/// Iterate over the positions of the elements.
///
#[inline]
pub fn iter(&self) -> impl Iterator<Item = u32> + '_ {
self.iter_tinysets()
@@ -415,14 +404,14 @@ impl<'a> From<&'a BitSet> for ReadOnlyBitSet {
#[cfg(test)]
mod tests {
use super::BitSet;
use super::ReadOnlyBitSet;
use super::TinySet;
use std::collections::HashSet;
use ownedbytes::OwnedBytes;
use rand::distributions::Bernoulli;
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use std::collections::HashSet;
use super::{BitSet, ReadOnlyBitSet, TinySet};
#[test]
fn test_read_serialized_bitset_full_multi() {
@@ -432,7 +421,7 @@ mod tests {
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len() as usize, i as usize);
assert_eq!(bitset.len(), i as usize);
}
}
@@ -443,7 +432,7 @@ mod tests {
bitset.serialize(&mut out).unwrap();
let bitset = ReadOnlyBitSet::open(OwnedBytes::new(out));
assert_eq!(bitset.len() as usize, 64 as usize);
assert_eq!(bitset.len(), 64);
}
#[test]
@@ -710,10 +699,10 @@ mod tests {
#[cfg(all(test, feature = "unstable"))]
mod bench {
use super::BitSet;
use super::TinySet;
use test;
use super::{BitSet, TinySet};
#[bench]
fn bench_tinyset_pop(b: &mut test::Bencher) {
b.iter(|| {

136
common/src/datetime.rs Normal file
View File

@@ -0,0 +1,136 @@
use std::fmt;
use serde::{Deserialize, Serialize};
use time::format_description::well_known::Rfc3339;
use time::{OffsetDateTime, PrimitiveDateTime, UtcOffset};
/// DateTime Precision
#[derive(
Clone, Copy, Debug, Hash, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize, Default,
)]
#[serde(rename_all = "lowercase")]
pub enum DatePrecision {
/// Seconds precision
#[default]
Seconds,
/// Milli-seconds precision.
Milliseconds,
/// Micro-seconds precision.
Microseconds,
}
/// A date/time value with microsecond precision.
///
/// This timestamp does not carry any explicit time zone information.
/// Users are responsible for applying the provided conversion
/// functions consistently. Internally the time zone is assumed
/// to be UTC, which is also used implicitly for JSON serialization.
///
/// All constructors and conversions are provided as explicit
/// functions and not by implementing any `From`/`Into` traits
/// to prevent unintended usage.
#[derive(Clone, Default, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct DateTime {
// Timestamp in microseconds.
pub(crate) timestamp_micros: i64,
}
impl DateTime {
/// Create new from UNIX timestamp in seconds
pub const fn from_timestamp_secs(seconds: i64) -> Self {
Self {
timestamp_micros: seconds * 1_000_000,
}
}
/// Create new from UNIX timestamp in milliseconds
pub const fn from_timestamp_millis(milliseconds: i64) -> Self {
Self {
timestamp_micros: milliseconds * 1_000,
}
}
/// Create new from UNIX timestamp in microseconds.
pub const fn from_timestamp_micros(microseconds: i64) -> Self {
Self {
timestamp_micros: microseconds,
}
}
/// Create new from `OffsetDateTime`
///
/// The given date/time is converted to UTC and the actual
/// time zone is discarded.
pub const fn from_utc(dt: OffsetDateTime) -> Self {
let timestamp_micros = dt.unix_timestamp() * 1_000_000 + dt.microsecond() as i64;
Self { timestamp_micros }
}
/// Create new from `PrimitiveDateTime`
///
/// Implicitly assumes that the given date/time is in UTC!
/// Otherwise the original value must only be reobtained with
/// [`Self::into_primitive()`].
pub fn from_primitive(dt: PrimitiveDateTime) -> Self {
Self::from_utc(dt.assume_utc())
}
/// Convert to UNIX timestamp in seconds.
pub const fn into_timestamp_secs(self) -> i64 {
self.timestamp_micros / 1_000_000
}
/// Convert to UNIX timestamp in milliseconds.
pub const fn into_timestamp_millis(self) -> i64 {
self.timestamp_micros / 1_000
}
/// Convert to UNIX timestamp in microseconds.
pub const fn into_timestamp_micros(self) -> i64 {
self.timestamp_micros
}
/// Convert to UTC `OffsetDateTime`
pub fn into_utc(self) -> OffsetDateTime {
let timestamp_nanos = self.timestamp_micros as i128 * 1000;
let utc_datetime = OffsetDateTime::from_unix_timestamp_nanos(timestamp_nanos)
.expect("valid UNIX timestamp");
debug_assert_eq!(UtcOffset::UTC, utc_datetime.offset());
utc_datetime
}
/// Convert to `OffsetDateTime` with the given time zone
pub fn into_offset(self, offset: UtcOffset) -> OffsetDateTime {
self.into_utc().to_offset(offset)
}
/// Convert to `PrimitiveDateTime` without any time zone
///
/// The value should have been constructed with [`Self::from_primitive()`].
/// Otherwise the time zone is implicitly assumed to be UTC.
pub fn into_primitive(self) -> PrimitiveDateTime {
let utc_datetime = self.into_utc();
// Discard the UTC time zone offset
debug_assert_eq!(UtcOffset::UTC, utc_datetime.offset());
PrimitiveDateTime::new(utc_datetime.date(), utc_datetime.time())
}
/// Truncates the microseconds value to the corresponding precision.
pub fn truncate(self, precision: DatePrecision) -> Self {
let truncated_timestamp_micros = match precision {
DatePrecision::Seconds => (self.timestamp_micros / 1_000_000) * 1_000_000,
DatePrecision::Milliseconds => (self.timestamp_micros / 1_000) * 1_000,
DatePrecision::Microseconds => self.timestamp_micros,
};
Self {
timestamp_micros: truncated_timestamp_micros,
}
}
}
impl fmt::Debug for DateTime {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let utc_rfc3339 = self.into_utc().format(&Rfc3339).map_err(|_| fmt::Error)?;
f.write_str(&utc_rfc3339)
}
}

View File

@@ -1,50 +1,60 @@
use stable_deref_trait::StableDeref;
use std::ops::{Deref, Range, RangeBounds};
use std::sync::Arc;
use std::{fmt, io};
use crate::directory::OwnedBytes;
use common::HasLen;
use std::fmt;
use std::ops::Range;
use std::sync::{Arc, Weak};
use std::{io, ops::Deref};
use async_trait::async_trait;
use ownedbytes::{OwnedBytes, StableDeref};
pub type ArcBytes = Arc<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
pub type WeakArcBytes = Weak<dyn Deref<Target = [u8]> + Send + Sync + 'static>;
use crate::HasLen;
/// Objects that represents files sections in tantivy.
///
/// By contract, whatever happens to the directory file, as long as a FileHandle
/// is alive, the data associated with it cannot be altered or destroyed.
///
/// The underlying behavior is therefore specific to the `Directory` that created it.
/// Despite its name, a `FileSlice` may or may not directly map to an actual file
/// The underlying behavior is therefore specific to the `Directory` that
/// created it. Despite its name, a [`FileSlice`] may or may not directly map to an actual file
/// on the filesystem.
#[async_trait]
pub trait FileHandle: 'static + Send + Sync + HasLen + fmt::Debug {
/// Reads a slice of bytes.
///
/// This method may panic if the range requested is invalid.
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes>;
#[doc(hidden)]
async fn read_bytes_async(&self, _byte_range: Range<usize>) -> io::Result<OwnedBytes> {
Err(io::Error::new(
io::ErrorKind::Unsupported,
"Async read is not supported.",
))
}
}
#[async_trait]
impl FileHandle for &'static [u8] {
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
let bytes = &self[range];
Ok(OwnedBytes::new(bytes))
}
async fn read_bytes_async(&self, byte_range: Range<usize>) -> io::Result<OwnedBytes> {
Ok(self.read_bytes(byte_range)?)
}
}
impl<B> From<B> for FileSlice
where
B: StableDeref + Deref<Target = [u8]> + 'static + Send + Sync,
where B: StableDeref + Deref<Target = [u8]> + 'static + Send + Sync
{
fn from(bytes: B) -> FileSlice {
FileSlice::new(Box::new(OwnedBytes::new(bytes)))
FileSlice::new(Arc::new(OwnedBytes::new(bytes)))
}
}
/// Logical slice of read only file in tantivy.
///
/// It can be cloned and sliced cheaply.
///
#[derive(Clone)]
pub struct FileSlice {
data: Arc<dyn FileHandle>,
@@ -57,9 +67,37 @@ impl fmt::Debug for FileSlice {
}
}
/// Takes a range, a `RangeBounds` object, and returns
/// a `Range` that corresponds to the relative application of the
/// `RangeBounds` object to the original `Range`.
///
/// For instance, combine_ranges(`[2..11)`, `[5..7]`) returns `[7..10]`
/// as it reads, what is the sub-range that starts at the 5 element of
/// `[2..11)` and ends at the 9th element included.
///
/// This function panics, if the result would suggest something outside
/// of the bounds of the original range.
fn combine_ranges<R: RangeBounds<usize>>(orig_range: Range<usize>, rel_range: R) -> Range<usize> {
let start: usize = orig_range.start
+ match rel_range.start_bound().cloned() {
std::ops::Bound::Included(rel_start) => rel_start,
std::ops::Bound::Excluded(rel_start) => rel_start + 1,
std::ops::Bound::Unbounded => 0,
};
assert!(start <= orig_range.end);
let end: usize = match rel_range.end_bound().cloned() {
std::ops::Bound::Included(rel_end) => orig_range.start + rel_end + 1,
std::ops::Bound::Excluded(rel_end) => orig_range.start + rel_end,
std::ops::Bound::Unbounded => orig_range.end,
};
assert!(end >= start);
assert!(end <= orig_range.end);
start..end
}
impl FileSlice {
/// Wraps a FileHandle.
pub fn new(file_handle: Box<dyn FileHandle>) -> Self {
pub fn new(file_handle: Arc<dyn FileHandle>) -> Self {
let num_bytes = file_handle.len();
FileSlice::new_with_num_bytes(file_handle, num_bytes)
}
@@ -67,9 +105,9 @@ impl FileSlice {
/// Wraps a FileHandle.
#[doc(hidden)]
#[must_use]
pub fn new_with_num_bytes(file_handle: Box<dyn FileHandle>, num_bytes: usize) -> Self {
pub fn new_with_num_bytes(file_handle: Arc<dyn FileHandle>, num_bytes: usize) -> Self {
FileSlice {
data: Arc::from(file_handle),
data: file_handle,
range: 0..num_bytes,
}
}
@@ -79,11 +117,12 @@ impl FileSlice {
/// # Panics
///
/// Panics if `byte_range.end` exceeds the filesize.
pub fn slice(&self, byte_range: Range<usize>) -> FileSlice {
assert!(byte_range.end <= self.len());
#[must_use]
#[inline]
pub fn slice<R: RangeBounds<usize>>(&self, byte_range: R) -> FileSlice {
FileSlice {
data: self.data.clone(),
range: self.range.start + byte_range.start..self.range.start + byte_range.end,
range: combine_ranges(self.range.clone(), byte_range),
}
}
@@ -95,7 +134,7 @@ impl FileSlice {
/// Returns a `OwnedBytes` with all of the data in the `FileSlice`.
///
/// The behavior is strongly dependant on the implementation of the underlying
/// The behavior is strongly dependent on the implementation of the underlying
/// `Directory` and the `FileSliceTrait` it creates.
/// In particular, it is up to the `Directory` implementation
/// to handle caching if needed.
@@ -103,6 +142,11 @@ impl FileSlice {
self.data.read_bytes(self.range.clone())
}
#[doc(hidden)]
pub async fn read_bytes_async(&self) -> io::Result<OwnedBytes> {
self.data.read_bytes_async(self.range.clone()).await
}
/// Reads a specific slice of data.
///
/// This is equivalent to running `file_slice.slice(from, to).read_bytes()`.
@@ -117,6 +161,19 @@ impl FileSlice {
.read_bytes(self.range.start + range.start..self.range.start + range.end)
}
#[doc(hidden)]
pub async fn read_bytes_slice_async(&self, byte_range: Range<usize>) -> io::Result<OwnedBytes> {
assert!(
self.range.start + byte_range.end <= self.range.end,
"`to` exceeds the fileslice length"
);
self.data
.read_bytes_async(
self.range.start + byte_range.start..self.range.start + byte_range.end,
)
.await
}
/// Splits the FileSlice at the given offset and return two file slices.
/// `file_slice[..split_offset]` and `file_slice[split_offset..]`.
///
@@ -138,6 +195,7 @@ impl FileSlice {
/// boundary.
///
/// Equivalent to `.slice(from_offset, self.len())`
#[must_use]
pub fn slice_from(&self, from_offset: usize) -> FileSlice {
self.slice(from_offset..self.len())
}
@@ -145,6 +203,7 @@ impl FileSlice {
/// Returns a slice from the end.
///
/// Equivalent to `.slice(self.len() - from_offset, self.len())`
#[must_use]
pub fn slice_from_end(&self, from_offset: usize) -> FileSlice {
self.slice(self.len() - from_offset..self.len())
}
@@ -153,15 +212,21 @@ impl FileSlice {
/// boundary.
///
/// Equivalent to `.slice(0, to_offset)`
#[must_use]
pub fn slice_to(&self, to_offset: usize) -> FileSlice {
self.slice(0..to_offset)
}
}
#[async_trait]
impl FileHandle for FileSlice {
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
self.read_bytes_slice(range)
}
async fn read_bytes_async(&self, byte_range: Range<usize>) -> io::Result<OwnedBytes> {
self.read_bytes_slice_async(byte_range).await
}
}
impl HasLen for FileSlice {
@@ -170,15 +235,30 @@ impl HasLen for FileSlice {
}
}
#[async_trait]
impl FileHandle for OwnedBytes {
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
Ok(self.slice(range))
}
async fn read_bytes_async(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
self.read_bytes(range)
}
}
#[cfg(test)]
mod tests {
use super::{FileHandle, FileSlice};
use common::HasLen;
use std::io;
use std::ops::Bound;
use std::sync::Arc;
use super::{FileHandle, FileSlice};
use crate::file_slice::combine_ranges;
use crate::HasLen;
#[test]
fn test_file_slice() -> io::Result<()> {
let file_slice = FileSlice::new(Box::new(b"abcdef".as_ref()));
let file_slice = FileSlice::new(Arc::new(b"abcdef".as_ref()));
assert_eq!(file_slice.len(), 6);
assert_eq!(file_slice.slice_from(2).read_bytes()?.as_slice(), b"cdef");
assert_eq!(file_slice.slice_to(2).read_bytes()?.as_slice(), b"ab");
@@ -222,7 +302,7 @@ mod tests {
#[test]
fn test_slice_simple_read() -> io::Result<()> {
let slice = FileSlice::new(Box::new(&b"abcdef"[..]));
let slice = FileSlice::new(Arc::new(&b"abcdef"[..]));
assert_eq!(slice.len(), 6);
assert_eq!(slice.read_bytes()?.as_ref(), b"abcdef");
assert_eq!(slice.slice(1..4).read_bytes()?.as_ref(), b"bcd");
@@ -231,7 +311,7 @@ mod tests {
#[test]
fn test_slice_read_slice() -> io::Result<()> {
let slice_deref = FileSlice::new(Box::new(&b"abcdef"[..]));
let slice_deref = FileSlice::new(Arc::new(&b"abcdef"[..]));
assert_eq!(slice_deref.read_bytes_slice(1..4)?.as_ref(), b"bcd");
Ok(())
}
@@ -239,10 +319,29 @@ mod tests {
#[test]
#[should_panic(expected = "end of requested range exceeds the fileslice length (10 > 6)")]
fn test_slice_read_slice_invalid_range_exceeds() {
let slice_deref = FileSlice::new(Box::new(&b"abcdef"[..]));
let slice_deref = FileSlice::new(Arc::new(&b"abcdef"[..]));
assert_eq!(
slice_deref.read_bytes_slice(0..10).unwrap().as_ref(),
b"bcd"
);
}
#[test]
fn test_combine_range() {
assert_eq!(combine_ranges(1..3, 0..1), 1..2);
assert_eq!(combine_ranges(1..3, 1..), 2..3);
assert_eq!(combine_ranges(1..4, ..2), 1..3);
assert_eq!(combine_ranges(3..10, 2..5), 5..8);
assert_eq!(combine_ranges(2..11, 5..=7), 7..10);
assert_eq!(
combine_ranges(2..11, (Bound::Excluded(5), Bound::Unbounded)),
8..11
);
}
#[test]
#[should_panic]
fn test_combine_range_panics() {
let _ = combine_ranges(3..5, 1..4);
}
}

166
common/src/group_by.rs Normal file
View File

@@ -0,0 +1,166 @@
use std::cell::RefCell;
use std::iter::Peekable;
use std::rc::Rc;
pub trait GroupByIteratorExtended: Iterator {
/// Return an `Iterator` that groups iterator elements. Consecutive elements that map to the
/// same key are assigned to the same group.
///
/// The returned Iterator item is `(K, impl Iterator)`, where Iterator are the items of the
/// group.
///
/// ```
/// use tantivy_common::GroupByIteratorExtended;
///
/// // group data into blocks of larger than zero or not.
/// let data: Vec<i32> = vec![1, 3, -2, -2, 1, 0, 1, 2];
/// // groups: |---->|------>|--------->|
///
/// let mut data_grouped = Vec::new();
/// // Note: group is an iterator
/// for (key, group) in data.into_iter().group_by(|val| *val >= 0) {
/// data_grouped.push((key, group.collect()));
/// }
/// assert_eq!(data_grouped, vec![(true, vec![1, 3]), (false, vec![-2, -2]), (true, vec![1, 0, 1, 2])]);
/// ```
fn group_by<K, F>(self, key: F) -> GroupByIterator<Self, F, K>
where
Self: Sized,
F: FnMut(&Self::Item) -> K,
K: PartialEq + Copy,
Self::Item: Copy,
{
GroupByIterator::new(self, key)
}
}
impl<I: Iterator> GroupByIteratorExtended for I {}
pub struct GroupByIterator<I, F, K: Copy>
where
I: Iterator,
F: FnMut(&I::Item) -> K,
{
// I really would like to avoid the Rc<RefCell>, but the Iterator is shared between
// `GroupByIterator` and `GroupIter`. In practice they are used consecutive and
// `GroupByIter` is finished before calling next on `GroupByIterator`. I'm not sure there
// is a solution with lifetimes for that, because we would need to enforce it in the usage
// somehow.
//
// One potential solution would be to replace the iterator approach with something similar.
inner: Rc<RefCell<GroupByShared<I, F, K>>>,
}
struct GroupByShared<I, F, K: Copy>
where
I: Iterator,
F: FnMut(&I::Item) -> K,
{
iter: Peekable<I>,
group_by_fn: F,
}
impl<I, F, K> GroupByIterator<I, F, K>
where
I: Iterator,
F: FnMut(&I::Item) -> K,
K: Copy,
{
fn new(inner: I, group_by_fn: F) -> Self {
let inner = GroupByShared {
iter: inner.peekable(),
group_by_fn,
};
Self {
inner: Rc::new(RefCell::new(inner)),
}
}
}
impl<I, F, K> Iterator for GroupByIterator<I, F, K>
where
I: Iterator,
I::Item: Copy,
F: FnMut(&I::Item) -> K,
K: Copy,
{
type Item = (K, GroupIterator<I, F, K>);
fn next(&mut self) -> Option<Self::Item> {
let mut inner = self.inner.borrow_mut();
let value = *inner.iter.peek()?;
let key = (inner.group_by_fn)(&value);
let inner = self.inner.clone();
let group_iter = GroupIterator {
inner,
group_key: key,
};
Some((key, group_iter))
}
}
pub struct GroupIterator<I, F, K: Copy>
where
I: Iterator,
F: FnMut(&I::Item) -> K,
{
inner: Rc<RefCell<GroupByShared<I, F, K>>>,
group_key: K,
}
impl<I, F, K: PartialEq + Copy> Iterator for GroupIterator<I, F, K>
where
I: Iterator,
I::Item: Copy,
F: FnMut(&I::Item) -> K,
{
type Item = I::Item;
fn next(&mut self) -> Option<Self::Item> {
let mut inner = self.inner.borrow_mut();
// peek if next value is in group
let peek_val = *inner.iter.peek()?;
if (inner.group_by_fn)(&peek_val) == self.group_key {
inner.iter.next()
} else {
None
}
}
}
#[cfg(test)]
mod tests {
use super::*;
fn group_by_collect<I: Iterator<Item = u32>>(iter: I) -> Vec<(I::Item, Vec<I::Item>)> {
iter.group_by(|val| val / 10)
.map(|(el, iter)| (el, iter.collect::<Vec<_>>()))
.collect::<Vec<_>>()
}
#[test]
fn group_by_two_groups() {
let vals = vec![1u32, 4, 15];
let grouped_vals = group_by_collect(vals.into_iter());
assert_eq!(grouped_vals, vec![(0, vec![1, 4]), (1, vec![15])]);
}
#[test]
fn group_by_test_empty() {
let vals = vec![];
let grouped_vals = group_by_collect(vals.into_iter());
assert_eq!(grouped_vals, vec![]);
}
#[test]
fn group_by_three_groups() {
let vals = vec![1u32, 4, 15, 1];
let grouped_vals = group_by_collect(vals.into_iter());
assert_eq!(
grouped_vals,
vec![(0, vec![1, 4]), (1, vec![15]), (0, vec![1])]
);
}
}

View File

@@ -5,13 +5,21 @@ use std::ops::Deref;
pub use byteorder::LittleEndian as Endianness;
mod bitset;
mod datetime;
pub mod file_slice;
mod group_by;
mod serialize;
mod vint;
mod writer;
pub use bitset::*;
pub use datetime::{DatePrecision, DateTime};
pub use group_by::GroupByIteratorExtended;
pub use ownedbytes::{OwnedBytes, StableDeref};
pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize};
pub use vint::{read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint, VInt};
pub use vint::{
deserialize_vint_u128, read_u32_vint, read_u32_vint_no_advance, serialize_vint_u128,
serialize_vint_u32, write_u32_vint, VInt, VIntU128,
};
pub use writer::{AntiCallToken, CountingWriter, TerminatingWrite};
/// Has length trait
@@ -52,13 +60,13 @@ const HIGHEST_BIT: u64 = 1 << 63;
/// to values over 2^63, and all values end up requiring 64 bits.
///
/// # See also
/// The [reverse mapping is `u64_to_i64`](./fn.u64_to_i64.html).
/// The reverse mapping is [`u64_to_i64()`].
#[inline]
pub fn i64_to_u64(val: i64) -> u64 {
(val as u64) ^ HIGHEST_BIT
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
/// Reverse the mapping given by [`i64_to_u64()`].
#[inline]
pub fn u64_to_i64(val: u64) -> i64 {
(val ^ HIGHEST_BIT) as i64
@@ -80,7 +88,7 @@ pub fn u64_to_i64(val: u64) -> i64 {
/// explains the mapping in a clear manner.
///
/// # See also
/// The [reverse mapping is `u64_to_f64`](./fn.u64_to_f64.html).
/// The reverse mapping is [`u64_to_f64()`].
#[inline]
pub fn f64_to_u64(val: f64) -> u64 {
let bits = val.to_bits();
@@ -91,7 +99,7 @@ pub fn f64_to_u64(val: f64) -> u64 {
}
}
/// Reverse the mapping given by [`i64_to_u64`](./fn.i64_to_u64.html).
/// Reverse the mapping given by [`f64_to_u64()`].
#[inline]
pub fn u64_to_f64(val: u64) -> f64 {
f64::from_bits(if val & HIGHEST_BIT != 0 {
@@ -101,13 +109,27 @@ pub fn u64_to_f64(val: u64) -> f64 {
})
}
/// Replaces a given byte in the `bytes` slice of bytes.
///
/// This function assumes that the needle is rarely contained in the bytes string
/// and offers a fast path if the needle is not present.
pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) {
if !bytes.contains(&needle) {
return;
}
for b in bytes {
if *b == needle {
*b = replacement;
}
}
}
#[cfg(test)]
pub mod test {
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
use super::{BinarySerializable, FixedSize};
use proptest::prelude::*;
use std::f64;
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64, BinarySerializable, FixedSize};
fn test_i64_converter_helper(val: i64) {
assert_eq!(u64_to_i64(i64_to_u64(val)), val);
@@ -134,11 +156,11 @@ pub mod test {
#[test]
fn test_i64_converter() {
assert_eq!(i64_to_u64(i64::min_value()), u64::min_value());
assert_eq!(i64_to_u64(i64::max_value()), u64::max_value());
assert_eq!(i64_to_u64(i64::MIN), u64::MIN);
assert_eq!(i64_to_u64(i64::MAX), u64::MAX);
test_i64_converter_helper(0i64);
test_i64_converter_helper(i64::min_value());
test_i64_converter_helper(i64::max_value());
test_i64_converter_helper(i64::MIN);
test_i64_converter_helper(i64::MAX);
for i in -1000i64..1000i64 {
test_i64_converter_helper(i);
}
@@ -157,13 +179,29 @@ pub mod test {
#[test]
fn test_f64_order() {
assert!(!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY))
.contains(&f64_to_u64(f64::NAN))); //nan is not a number
assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); //same exponent, different mantissa
assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); //same mantissa, different exponent
assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); //different exponent and mantissa
.contains(&f64_to_u64(f64::NAN))); // nan is not a number
assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); // same exponent, different mantissa
assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); // same mantissa, different exponent
assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); // different exponent and mantissa
assert!(f64_to_u64(1.0) > f64_to_u64(-1.0)); // pos > neg
assert!(f64_to_u64(-1.5) < f64_to_u64(-1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(1.0));
assert!(f64_to_u64(-2.0) < f64_to_u64(-1.5));
}
#[test]
fn test_replace_in_place() {
let test_aux = |before_replacement: &[u8], expected: &[u8]| {
let mut bytes: Vec<u8> = before_replacement.to_vec();
super::replace_in_place(b'b', b'c', &mut bytes);
assert_eq!(&bytes[..], expected);
};
test_aux(b"", b"");
test_aux(b"b", b"c");
test_aux(b"baaa", b"caaa");
test_aux(b"aaab", b"aaac");
test_aux(b"aaabaa", b"aaacaa");
test_aux(b"aaaaaa", b"aaaaaa");
test_aux(b"bbbb", b"cccc");
}
}

View File

@@ -1,17 +1,41 @@
use crate::Endianness;
use crate::VInt;
use std::io::{Read, Write};
use std::{fmt, io};
use byteorder::{ReadBytesExt, WriteBytesExt};
use std::fmt;
use std::io;
use std::io::Read;
use std::io::Write;
use crate::{Endianness, VInt};
#[derive(Default)]
struct Counter(u64);
impl io::Write for Counter {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
self.0 += buf.len() as u64;
Ok(buf.len())
}
fn write_all(&mut self, buf: &[u8]) -> io::Result<()> {
self.0 += buf.len() as u64;
Ok(())
}
fn flush(&mut self) -> io::Result<()> {
Ok(())
}
}
/// Trait for a simple binary serialization.
pub trait BinarySerializable: fmt::Debug + Sized {
/// Serialize
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()>;
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()>;
/// Deserialize
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self>;
fn num_bytes(&self) -> u64 {
let mut counter = Counter::default();
self.serialize(&mut counter).unwrap();
counter.0
}
}
pub trait DeserializeFrom<T: BinarySerializable> {
@@ -20,7 +44,7 @@ pub trait DeserializeFrom<T: BinarySerializable> {
/// Implement deserialize from &[u8] for all types which implement BinarySerializable.
///
/// TryFrom would actually be preferrable, but not possible because of the orphan
/// TryFrom would actually be preferable, but not possible because of the orphan
/// rules (not completely sure if this could be resolved)
impl<T: BinarySerializable> DeserializeFrom<T> for &[u8] {
fn deserialize(&mut self) -> io::Result<T> {
@@ -35,7 +59,7 @@ pub trait FixedSize: BinarySerializable {
}
impl BinarySerializable for () {
fn serialize<W: Write>(&self, _: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, _: &mut W) -> io::Result<()> {
Ok(())
}
fn deserialize<R: Read>(_: &mut R) -> io::Result<Self> {
@@ -48,7 +72,7 @@ impl FixedSize for () {
}
impl<T: BinarySerializable> BinarySerializable for Vec<T> {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.len() as u64).serialize(writer)?;
for it in self {
it.serialize(writer)?;
@@ -67,7 +91,7 @@ impl<T: BinarySerializable> BinarySerializable for Vec<T> {
}
impl<Left: BinarySerializable, Right: BinarySerializable> BinarySerializable for (Left, Right) {
fn serialize<W: Write>(&self, write: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, write: &mut W) -> io::Result<()> {
self.0.serialize(write)?;
self.1.serialize(write)
}
@@ -82,7 +106,7 @@ impl<Left: BinarySerializable + FixedSize, Right: BinarySerializable + FixedSize
}
impl BinarySerializable for u32 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u32::<Endianness>(*self)
}
@@ -95,8 +119,22 @@ impl FixedSize for u32 {
const SIZE_IN_BYTES: usize = 4;
}
impl BinarySerializable for u16 {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u16::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<u16> {
reader.read_u16::<Endianness>()
}
}
impl FixedSize for u16 {
const SIZE_IN_BYTES: usize = 2;
}
impl BinarySerializable for u64 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u64::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
@@ -108,8 +146,21 @@ impl FixedSize for u64 {
const SIZE_IN_BYTES: usize = 8;
}
impl BinarySerializable for u128 {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u128::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
reader.read_u128::<Endianness>()
}
}
impl FixedSize for u128 {
const SIZE_IN_BYTES: usize = 16;
}
impl BinarySerializable for f32 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_f32::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
@@ -122,7 +173,7 @@ impl FixedSize for f32 {
}
impl BinarySerializable for i64 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_i64::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
@@ -135,7 +186,7 @@ impl FixedSize for i64 {
}
impl BinarySerializable for f64 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_f64::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
@@ -148,7 +199,7 @@ impl FixedSize for f64 {
}
impl BinarySerializable for u8 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u8(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<u8> {
@@ -161,9 +212,8 @@ impl FixedSize for u8 {
}
impl BinarySerializable for bool {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
let val = if *self { 1 } else { 0 };
writer.write_u8(val)
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u8(u8::from(*self))
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<bool> {
let val = reader.read_u8()?;
@@ -183,7 +233,7 @@ impl FixedSize for bool {
}
impl BinarySerializable for String {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
let data: &[u8] = self.as_bytes();
VInt(data.len() as u64).serialize(writer)?;
writer.write_all(data)
@@ -202,8 +252,7 @@ impl BinarySerializable for String {
#[cfg(test)]
pub mod test {
use super::VInt;
use super::*;
use super::{VInt, *};
use crate::serialize::BinarySerializable;
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();
@@ -231,7 +280,7 @@ pub mod test {
fixed_size_test::<u32>();
assert_eq!(4, serialize_test(3u32));
assert_eq!(4, serialize_test(5u32));
assert_eq!(4, serialize_test(u32::max_value()));
assert_eq!(4, serialize_test(u32::MAX));
}
#[test]
@@ -249,6 +298,11 @@ pub mod test {
fixed_size_test::<u64>();
}
#[test]
fn test_serialize_bool() {
fixed_size_test::<bool>();
}
#[test]
fn test_serialize_string() {
assert_eq!(serialize_test(String::from("")), 1);
@@ -274,6 +328,6 @@ pub mod test {
assert_eq!(serialize_test(VInt(1234u64)), 2);
assert_eq!(serialize_test(VInt(16_383u64)), 2);
assert_eq!(serialize_test(VInt(16_384u64)), 3);
assert_eq!(serialize_test(VInt(u64::max_value())), 10);
assert_eq!(serialize_test(VInt(u64::MAX)), 10);
}
}

View File

@@ -1,8 +1,78 @@
use super::BinarySerializable;
use byteorder::{ByteOrder, LittleEndian};
use std::io;
use std::io::Read;
use std::io::Write;
use std::io::{Read, Write};
use byteorder::{ByteOrder, LittleEndian};
use super::BinarySerializable;
/// Variable int serializes a u128 number
pub fn serialize_vint_u128(mut val: u128, output: &mut Vec<u8>) {
loop {
let next_byte: u8 = (val % 128u128) as u8;
val /= 128u128;
if val == 0 {
output.push(next_byte | STOP_BIT);
return;
} else {
output.push(next_byte);
}
}
}
/// Deserializes a u128 number
///
/// Returns the number and the slice after the vint
pub fn deserialize_vint_u128(data: &[u8]) -> io::Result<(u128, &[u8])> {
let mut result = 0u128;
let mut shift = 0u64;
for i in 0..19 {
let b = data[i];
result |= u128::from(b % 128u8) << shift;
if b >= STOP_BIT {
return Ok((result, &data[i + 1..]));
}
shift += 7;
}
Err(io::Error::new(
io::ErrorKind::InvalidData,
"Failed to deserialize u128 vint",
))
}
/// Wrapper over a `u128` that serializes as a variable int.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub struct VIntU128(pub u128);
impl BinarySerializable for VIntU128 {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
let mut buffer = vec![];
serialize_vint_u128(self.0, &mut buffer);
writer.write_all(&buffer)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
let mut bytes = reader.bytes();
let mut result = 0u128;
let mut shift = 0u64;
loop {
match bytes.next() {
Some(Ok(b)) => {
result |= u128::from(b % 128u8) << shift;
if b >= STOP_BIT {
return Ok(VIntU128(result));
}
shift += 7;
}
_ => {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Reach end of buffer while reading VInt",
));
}
}
}
}
}
/// Wrapper over a `u64` that serializes as a variable int.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
@@ -87,7 +157,7 @@ fn vint_len(data: &[u8]) -> usize {
/// If the buffer does not start by a valid
/// vint payload
pub fn read_u32_vint(data: &mut &[u8]) -> u32 {
let (result, vlen) = read_u32_vint_no_advance(*data);
let (result, vlen) = read_u32_vint_no_advance(data);
*data = &data[vlen..];
result
}
@@ -141,7 +211,7 @@ impl VInt {
}
impl BinarySerializable for VInt {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
let mut buffer = [0u8; 10];
let num_bytes = self.serialize_into(&mut buffer);
writer.write_all(&buffer[0..num_bytes])
@@ -174,9 +244,8 @@ impl BinarySerializable for VInt {
#[cfg(test)]
mod tests {
use super::serialize_vint_u32;
use super::BinarySerializable;
use super::VInt;
use super::{serialize_vint_u32, BinarySerializable, VInt};
use crate::vint::{deserialize_vint_u128, serialize_vint_u128, VIntU128};
fn aux_test_vint(val: u64) {
let mut v = [14u8; 10];
@@ -200,7 +269,7 @@ mod tests {
aux_test_vint(0);
aux_test_vint(1);
aux_test_vint(5);
aux_test_vint(u64::max_value());
aux_test_vint(u64::MAX);
for i in 1..9 {
let power_of_128 = 1u64 << (7 * i);
aux_test_vint(power_of_128 - 1u64);
@@ -218,6 +287,26 @@ mod tests {
assert_eq!(&buffer[..len_vint], res2, "array wrong for {}", val);
}
fn aux_test_vint_u128(val: u128) {
let mut data = vec![];
serialize_vint_u128(val, &mut data);
let (deser_val, _data) = deserialize_vint_u128(&data).unwrap();
assert_eq!(val, deser_val);
let mut out = vec![];
VIntU128(val).serialize(&mut out).unwrap();
let deser_val = VIntU128::deserialize(&mut &out[..]).unwrap();
assert_eq!(val, deser_val.0);
}
#[test]
fn test_vint_u128() {
aux_test_vint_u128(0);
aux_test_vint_u128(1);
aux_test_vint_u128(u128::MAX / 3);
aux_test_vint_u128(u128::MAX);
}
#[test]
fn test_vint_u32() {
aux_test_serialize_vint_u32(0);
@@ -229,6 +318,6 @@ mod tests {
aux_test_serialize_vint_u32(power_of_128);
aux_test_serialize_vint_u32(power_of_128 + 1u32);
}
aux_test_serialize_vint_u32(u32::max_value());
aux_test_serialize_vint_u32(u32::MAX);
}
}

View File

@@ -54,19 +54,18 @@ impl<W: TerminatingWrite> TerminatingWrite for CountingWriter<W> {
}
}
/// Struct used to prevent from calling [`terminate_ref`](trait.TerminatingWrite.html#tymethod.terminate_ref) directly
/// Struct used to prevent from calling
/// [`terminate_ref`](TerminatingWrite::terminate_ref) directly
///
/// The point is that while the type is public, it cannot be built by anyone
/// outside of this module.
pub struct AntiCallToken(());
/// Trait used to indicate when no more write need to be done on a writer
pub trait TerminatingWrite: Write {
pub trait TerminatingWrite: Write + Send + Sync {
/// Indicate that the writer will no longer be used. Internally call terminate_ref.
fn terminate(mut self) -> io::Result<()>
where
Self: Sized,
{
where Self: Sized {
self.terminate_ref(AntiCallToken(()))
}
@@ -97,9 +96,10 @@ impl<'a> TerminatingWrite for &'a mut Vec<u8> {
#[cfg(test)]
mod test {
use super::CountingWriter;
use std::io::Write;
use super::CountingWriter;
#[test]
fn test_counting_writer() {
let buffer: Vec<u8> = vec![];

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

View File

@@ -0,0 +1,8 @@
<svg width="518" height="112" viewBox="0 0 518 112" fill="none" xmlns="http://www.w3.org/2000/svg">
<path fill-rule="evenodd" clip-rule="evenodd" d="M56 112C86.9279 112 112 86.9279 112 56C112 25.0721 86.9279 0 56 0C25.0721 0 0 25.0721 0 56C0 86.9279 25.0721 112 56 112Z" fill="#0DBD8B"/>
<path fill-rule="evenodd" clip-rule="evenodd" d="M45.7615 26.093C45.7615 23.8325 47.5977 22.0001 49.8629 22.0001C65.2154 22.0001 77.6611 34.4199 77.6611 49.7406C77.6611 52.001 75.8248 53.8335 73.5597 53.8335C71.2945 53.8335 69.4583 52.001 69.4583 49.7406C69.4583 38.9408 60.6851 30.1859 49.8629 30.1859C47.5977 30.1859 45.7615 28.3534 45.7615 26.093Z" fill="white"/>
<path fill-rule="evenodd" clip-rule="evenodd" d="M85.8986 45.6477C88.1637 45.6477 89.9999 47.4801 89.9999 49.7406C89.9999 65.0612 77.5543 77.4811 62.2017 77.4811C59.9366 77.4811 58.1003 75.6486 58.1003 73.3882C58.1003 71.1277 59.9366 69.2953 62.2017 69.2953C73.024 69.2953 81.7972 60.5403 81.7972 49.7406C81.7972 47.4801 83.6334 45.6477 85.8986 45.6477Z" fill="white"/>
<path fill-rule="evenodd" clip-rule="evenodd" d="M66.3031 85.907C66.3031 88.1675 64.4668 89.9999 62.2017 89.9999C46.8492 89.9999 34.4035 77.58 34.4035 62.2594C34.4035 59.9989 36.2398 58.1665 38.5049 58.1665C40.77 58.1665 42.6063 59.9989 42.6063 62.2594C42.6063 73.0592 51.3795 81.8141 62.2017 81.8141C64.4668 81.8141 66.3031 83.6466 66.3031 85.907Z" fill="white"/>
<path fill-rule="evenodd" clip-rule="evenodd" d="M26.1014 66.3523C23.8363 66.3523 22.0001 64.5199 22.0001 62.2594C22 46.9388 34.4457 34.5189 49.7983 34.5189C52.0634 34.5189 53.8997 36.3514 53.8997 38.6118C53.8997 40.8723 52.0634 42.7047 49.7983 42.7047C38.976 42.7047 30.2028 51.4597 30.2028 62.2594C30.2028 64.5199 28.3666 66.3523 26.1014 66.3523Z" fill="white"/>
<path d="M197 63.5H157.5C157.967 67.6333 159.467 70.9333 162 73.4C164.533 75.8 167.867 77 172 77C174.733 77 177.2 76.3333 179.4 75C181.6 73.6667 183.167 71.8667 184.1 69.6H196.1C194.5 74.8667 191.5 79.1333 187.1 82.4C182.767 85.6 177.633 87.2 171.7 87.2C163.967 87.2 157.7 84.6333 152.9 79.5C148.167 74.3667 145.8 67.8667 145.8 60C145.8 52.3333 148.2 45.9 153 40.7C157.8 35.5 164 32.9 171.6 32.9C179.2 32.9 185.333 35.4667 190 40.6C194.733 45.6667 197.1 52.0667 197.1 59.8L197 63.5ZM171.6 42.6C167.867 42.6 164.767 43.7 162.3 45.9C159.833 48.1 158.3 51.0333 157.7 54.7H185.3C184.767 51.0333 183.3 48.1 180.9 45.9C178.5 43.7 175.4 42.6 171.6 42.6ZM205.289 70.5V11H217.189V70.7C217.189 73.3667 218.656 74.7 221.589 74.7L223.689 74.6V85.9C222.556 86.1 221.356 86.2 220.089 86.2C214.956 86.2 211.189 84.9 208.789 82.3C206.456 79.7 205.289 75.7667 205.289 70.5ZM279.109 63.5H239.609C240.076 67.6333 241.576 70.9333 244.109 73.4C246.643 75.8 249.976 77 254.109 77C256.843 77 259.309 76.3333 261.509 75C263.709 73.6667 265.276 71.8667 266.209 69.6H278.209C276.609 74.8667 273.609 79.1333 269.209 82.4C264.876 85.6 259.743 87.2 253.809 87.2C246.076 87.2 239.809 84.6333 235.009 79.5C230.276 74.3667 227.909 67.8667 227.909 60C227.909 52.3333 230.309 45.9 235.109 40.7C239.909 35.5 246.109 32.9 253.709 32.9C261.309 32.9 267.443 35.4667 272.109 40.6C276.843 45.6667 279.209 52.0667 279.209 59.8L279.109 63.5ZM253.709 42.6C249.976 42.6 246.876 43.7 244.409 45.9C241.943 48.1 240.409 51.0333 239.809 54.7H267.409C266.876 51.0333 265.409 48.1 263.009 45.9C260.609 43.7 257.509 42.6 253.709 42.6ZM332.798 56.2V86H320.898V54.9C320.898 47.0333 317.632 43.1 311.098 43.1C307.565 43.1 304.732 44.2333 302.598 46.5C300.532 48.7667 299.498 51.8667 299.498 55.8V86H287.598V34.1H298.598V41C299.865 38.6667 301.798 36.7333 304.398 35.2C306.998 33.6667 310.232 32.9 314.098 32.9C321.298 32.9 326.498 35.6333 329.698 41.1C334.098 35.6333 339.965 32.9 347.298 32.9C353.365 32.9 358.032 34.8 361.298 38.6C364.565 42.3333 366.198 47.2667 366.198 53.4V86H354.298V54.9C354.298 47.0333 351.032 43.1 344.498 43.1C340.898 43.1 338.032 44.2667 335.898 46.6C333.832 48.8667 332.798 52.0667 332.798 56.2ZM425.379 63.5H385.879C386.346 67.6333 387.846 70.9333 390.379 73.4C392.912 75.8 396.246 77 400.379 77C403.112 77 405.579 76.3333 407.779 75C409.979 73.6667 411.546 71.8667 412.479 69.6H424.479C422.879 74.8667 419.879 79.1333 415.479 82.4C411.146 85.6 406.012 87.2 400.079 87.2C392.346 87.2 386.079 84.6333 381.279 79.5C376.546 74.3667 374.179 67.8667 374.179 60C374.179 52.3333 376.579 45.9 381.379 40.7C386.179 35.5 392.379 32.9 399.979 32.9C407.579 32.9 413.712 35.4667 418.379 40.6C423.112 45.6667 425.479 52.0667 425.479 59.8L425.379 63.5ZM399.979 42.6C396.246 42.6 393.146 43.7 390.679 45.9C388.212 48.1 386.679 51.0333 386.079 54.7H413.679C413.146 51.0333 411.679 48.1 409.279 45.9C406.879 43.7 403.779 42.6 399.979 42.6ZM444.868 34.1V41C446.068 38.7333 448.035 36.8333 450.768 35.3C453.568 33.7 456.935 32.9 460.868 32.9C467.001 32.9 471.735 34.7667 475.068 38.5C478.468 42.2333 480.168 47.2 480.168 53.4V86H468.268V54.9C468.268 51.2333 467.401 48.3667 465.668 46.3C464.001 44.1667 461.435 43.1 457.968 43.1C454.168 43.1 451.168 44.2333 448.968 46.5C446.835 48.7667 445.768 51.9 445.768 55.9V86H433.868V34.1H444.868ZM514.922 75.4V85.7C513.455 86.1 511.389 86.3 508.722 86.3C498.589 86.3 493.522 81.2 493.522 71V43.6H485.622V34.1H493.522V20.6H505.422V34.1H515.122V43.6H505.422V69.8C505.422 73.8667 507.355 75.9 511.222 75.9L514.922 75.4Z" fill="black"/>
</svg>

After

Width:  |  Height:  |  Size: 5.2 KiB

BIN
doc/assets/images/etsy.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 102 KiB

Some files were not shown because too many files have changed in this diff Show More