Compare commits

...

99 Commits

Author SHA1 Message Date
Paul Masurel
727d024a23 Bugfix position broken.
For Field with several FieldValues, with a
value that contained no token at all, the token position
was reinitialized to 0.

As a result, PhraseQueries can show some false positives.
In addition, after the computation of the position delta, we can
underflow u32, and end up with gigantic delta.

We haven't been able to actually explain the bug in 1629, but it
is assumed that in some corner case these delta can cause a panic.

Closes #1629
2022-10-20 10:19:41 +09:00
PSeitz
449f595832 Merge pull request #1628 from quickwit-oss/skip_index_deser
faster skipindex deserialization, larger blocksize on sort
2022-10-19 11:05:20 +08:00
PSeitz
c9235df059 Merge pull request #1627 from quickwit-oss/ip_field_range_query
add range query handling for ip via term dictionary
2022-10-19 10:53:00 +08:00
Pascal Seitz
a4485f7611 faster skipindex deserialization, larger blocksize on sort 2022-10-18 19:32:23 +08:00
Pascal Seitz
1082ff60f9 add range query handling for ip via term dictionary
since IPs are mapped monotonically we can use the term dictionary for range queries
2022-10-18 13:08:27 +08:00
PSeitz
491854155c Merge pull request #1625 from quickwit-oss/index_ip_field
index ip field
2022-10-18 11:18:17 +08:00
Christoph Herzog
96c3d54ac7 fix: Fix power of two computation on 32bit architectures (#1624)
The current `compute_previous_power_of_two()` implementation used for
TermHashmap takes and returns `usize` , but actually only works
correclty on 64 bit architectures (aka usize == u64)

On other architectures the leading_zeros computation is run on the wrong
type (must be u64), and leads to overflows.

Fixed simply computing the leading_zeros based on a u64 value.
2022-10-18 11:55:02 +09:00
Pascal Seitz
6800fdec9d add indexing for ip field
Closes #1595
2022-10-18 10:07:48 +08:00
PSeitz
c9cf9c952a Merge pull request #1614 from quickwit-oss/remove_superfluous_steps
refactor Term
2022-10-17 18:25:31 +08:00
Pascal Seitz
024e53a99c remove truncate 2022-10-17 12:14:35 +08:00
Pascal Seitz
8d75e451bd fix truncate, remove mutable access from term 2022-10-17 12:14:35 +08:00
Pascal Seitz
fcfd76ec55 refactor Term
fixes some issues with Term
Remove duplicate calls to truncate or resize
Replace Magic Number 5 with constant
Enforce minimum size of 5 for metadata
Fix broken truncate docs
use constructor instead new + set calls
normalize constructor stack
replace assert on internal behavior fixes #1585
2022-10-17 12:14:34 +08:00
PSeitz
6b7b1cc4fa Merge pull request #1623 from quickwit-oss/remove_unused_buffer
remove unused buffer
2022-10-14 20:36:00 +08:00
Pascal Seitz
129f7422f5 remove unused buffer 2022-10-14 20:01:10 +08:00
PSeitz
f39cce2c8b Merge pull request #1622 from quickwit-oss/term_aggregation
add term aggregation clarification
2022-10-14 18:09:18 +08:00
PSeitz
d2478fac8a Merge pull request #1621 from quickwit-oss/changelog
update CHANGELOG
2022-10-14 18:08:57 +08:00
Pascal Seitz
952b048341 add term aggregation clarification 2022-10-14 16:12:19 +08:00
PSeitz
80f9596ec8 Merge pull request #1611 from quickwit-oss/remove_token_stream_alloc
remove tokenstream vec alloc
2022-10-14 15:12:30 +08:00
Pascal Seitz
84f9e77e1d update CHANGELOG 2022-10-14 15:10:33 +08:00
PSeitz
a602c248fb Merge pull request #1590 from waywardmonkeys/fix-doc-warnings-quickwit
Fix missing doc warnings when enabling feature "quickwit".
2022-10-14 14:09:25 +08:00
PSeitz
4b9d1fe828 Merge pull request #1620 from quickwit-oss/fix_fieldnorms_indexing
Fix missing fieldnorm indexing
2022-10-14 13:41:38 +08:00
Pascal Seitz
63bc390b02 Fix missing fieldnorm indexing
Fixes broken search (no results) with BM25 for u64, i64, f64, bool, bytes and date after deletion and merge.
There were no fieldnorms recorded for those field. After merge InvertedIndexReader::total_num_tokens returns 0 (Sum over the fieldnorms is 0). BM25 does not work when total_num_tokens is 0.
Fixes #1617
2022-10-14 12:44:40 +08:00
Paul Masurel
07393c2fa0 Attempt to fix race condition in test. (#1619)
Close #1550
2022-10-14 10:56:37 +09:00
PSeitz
77a415cbe4 rename NothingRecorder to DocIdRecorder (#1615) 2022-10-13 15:43:40 +09:00
PSeitz
4b4c231bba Merge pull request #1612 from quickwit-oss/no_panic_please
return Error instead panic in fastfields
2022-10-11 18:33:00 +08:00
PSeitz
11d3409286 add missing docs for fastfield_codecs crate (#1613)
closes #1603
2022-10-11 18:54:24 +09:00
Pascal Seitz
9cb8cfbea8 return Error instead panic in fastfields
fixes #1572
2022-10-11 14:15:22 +08:00
PSeitz
8b69aab0fc avoid prepare_doc allocation (#1610)
avoid prepare_doc allocation, ~10% more thoughput best case
2022-10-11 14:15:55 +09:00
PSeitz
3650d1f36a Merge pull request #1553 from quickwit-oss/ip_field
ip field
2022-10-11 13:09:47 +08:00
Pascal Seitz
2efebdb1bb remove tokenstream vec alloc 2022-10-11 10:30:56 +08:00
François Massot
e443ca63aa Merge pull request #1608 from quickwit-oss/nigel/serialise-bytes-as-b64-#2042
Serialise bytes as base64 strings instead of arrays.
2022-10-10 11:51:23 +02:00
Pascal Seitz
5c9cbee29d handle IpV4 serialization case 2022-10-07 19:52:00 +08:00
Pascal Seitz
b2ca83a93c switch to ipv6, add monotonic_mapping tests 2022-10-07 18:47:55 +08:00
Nigel Andrews
3b189080d4 Use raw string literals in tests 2022-10-07 12:28:25 +02:00
Nigel Andrews
00a6586efe Replaced String::serialize for serializer.serialize_str 2022-10-07 11:55:05 +02:00
Pascal Seitz
b9b913510e fmt 2022-10-07 16:56:19 +08:00
PSeitz
534b1d33c3 use ipv6
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-07 16:56:00 +08:00
PSeitz
f465173872 Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-07 16:55:53 +08:00
Pascal Seitz
96315df20d use idx part only for positions_to_docid 2022-10-07 16:54:04 +08:00
Pascal Seitz
9a1609d364 add test 2022-10-07 16:25:01 +08:00
Pascal Seitz
39f4e58450 improve comment 2022-10-07 16:25:01 +08:00
Pascal Seitz
a8a36b62cd enable test 2022-10-07 16:25:01 +08:00
Pascal Seitz
226a49338f add StrictlyMonotonicFn 2022-10-07 16:25:01 +08:00
Pascal Seitz
2864bf7123 use serializer for u128 2022-10-07 16:25:01 +08:00
Pascal Seitz
5171ff611b serialize ip as u128, add test for positions_to_docid 2022-10-07 16:25:01 +08:00
Pascal Seitz
e50e74acf8 remove u128 type 2022-10-07 16:25:01 +08:00
Pascal Seitz
0b86658389 rename ip addr, use buffer 2022-10-07 16:25:01 +08:00
Pascal Seitz
5d6602a8d9 mark null handling TODO 2022-10-07 16:25:01 +08:00
Pascal Seitz
4d29ff4d01 finalize ip addr rename 2022-10-07 16:25:01 +08:00
Pascal Seitz
cdc8e3a8be group montonic mapping and inverse
fix mapping inverse
remove ip indexing
add get_between_vals test
2022-10-07 16:25:01 +08:00
Pascal Seitz
67f453b534 rename to iter_gen 2022-10-07 16:25:01 +08:00
Pascal Seitz
787a37bacf expect instead of unwrap 2022-10-07 16:25:01 +08:00
Pascal Seitz
f5039f1846 remove roaring 2022-10-07 16:25:01 +08:00
Pascal Seitz
eeb1f19093 rename to iter_gen 2022-10-07 16:25:01 +08:00
Pascal Seitz
087beaf328 remove null handling 2022-10-07 16:25:01 +08:00
Pascal Seitz
309449dba3 rename to IpAddr 2022-10-07 16:25:01 +08:00
Pascal Seitz
5a76e6c5d3 fix get_between_vals forwarding
fix get_between_vals forwarding in monotonicmapping column by adding an additional conversion function Output->Input
2022-10-07 16:25:01 +08:00
Pascal Seitz
c8713a01ed use iter api 2022-10-07 16:25:01 +08:00
Pascal Seitz
6113e0408c remove comment 2022-10-07 16:25:01 +08:00
Pascal Seitz
400a20b7af add ip field
add u128 multivalue reader and writer
add ip to schema
add ip writers, handle merge
2022-10-07 16:25:01 +08:00
PSeitz
5f565e77de Merge pull request #1604 from quickwit-oss/replace_cbor
replace cbor with cborium
2022-10-07 14:42:55 +08:00
Pascal Seitz
516e60900d remove unwrap 2022-10-07 14:22:37 +08:00
Pascal Seitz
36e1c79f37 replace cbor with cborium
closes #1526
2022-10-07 13:23:39 +08:00
Bruce Mitchener
c2f1c250f9 doc: Remove reference to Searcher pool. (#1598)
The pool of searchers was removed in 23fe73a6 as part of #1411.
2022-10-06 00:04:11 +09:00
Bruce Mitchener
c694bc039a Fix missing doc warnings when enabling feature "quickwit". 2022-10-05 20:17:10 +07:00
PSeitz
2063f1717f Merge pull request #1591 from quickwit-oss/ff_refact
disable linear codec for multivalue values
2022-10-05 19:39:36 +08:00
Pascal Seitz
d742275048 renames 2022-10-05 19:16:49 +08:00
PSeitz
b9f06bc287 Update src/fastfield/multivalued/mod.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-10-05 19:09:19 +08:00
Pascal Seitz
8b42c4c126 disable linear codec for multivalue value index
don't materialize index column on merge
use simpler chain() variant
2022-10-05 19:09:17 +08:00
PSeitz
7905965800 Merge pull request #1594 from quickwit-oss/flat_map_with_buffer
Removing alloc on all .next() in MultiValueColumn
2022-10-05 18:34:15 +08:00
Pascal Seitz
f60a551890 add flat_map_with_buffer to Iterator trait 2022-10-05 17:44:26 +08:00
Paul Masurel
7baa6e3ec5 Removing alloc on all .next() in MultiValueColumn 2022-10-05 17:12:06 +09:00
PSeitz
2100ec5d26 Merge pull request #1593 from waywardmonkeys/doc-improvements
Documentation improvements.
2022-10-05 15:50:08 +08:00
Bruce Mitchener
b3bf9a5716 Documentation improvements. 2022-10-05 14:18:10 +07:00
Paul Masurel
0dc8c458e0 Flaky unit test. (#1592) 2022-10-05 16:15:48 +09:00
Nigel Andrews
e5043d78d2 added a couple of tests + make fmt 2022-10-04 12:52:44 +02:00
Nigel Andrews
6d0bb82bd2 Fix issue 1576: serialize bytes as base64 strings 2022-10-04 12:18:13 +02:00
trinity-1686a
5945dbf0bd change format for store to make it faster with small documents (#1569)
* use new format for docstore blocks

* move index to end of block

it makes writing the block faster due to one less memcopy
2022-10-04 09:58:55 +02:00
PSeitz
4cf911d56a Merge pull request #1587 from quickwit-oss/no_get_val_in_serialize
remove get_val in serialization
2022-10-04 12:56:48 +08:00
Pascal Seitz
0f5cff762f move enumerate and remove computation 2022-10-04 12:30:19 +08:00
Pascal Seitz
6d9a123cf2 remove get_val in serialization
remove get_val in serialization and mark as unimplemented!()
replace get_val with iter in linear codec
remove MultivalueStartIndexRandomSeeker
replace MultivalueStartIndexIter with closure
Sample 100 values in linear codec
2022-10-04 12:01:25 +08:00
PSeitz
0f4a47816a Merge pull request #1582 from quickwit-oss/faster_sorted_field_values
use groupby instead of vec allocation
2022-10-04 09:36:24 +08:00
Pascal Seitz
b062ab2196 use groupby instead of vec allocation 2022-10-04 09:26:26 +08:00
Bruce Mitchener
a9d2f3db23 Tantivy requires Rust 1.62 or later. (#1583)
Tantivy needs the `total_cmp` feature to compile, which was stabilized
in Rust 1.62.
2022-10-03 18:31:07 +09:00
Bruce Mitchener
44e03791f9 Fix warnings when doc'ing private items. (#1579)
This also fixes a couple of typos, but plenty remain!
2022-10-03 14:24:00 +09:00
Bruce Mitchener
2d23763e9f Use u64::from boolean more. (#1580)
This case is inverted from the previous cases fixed.

This is from nightly clippy.
2022-10-03 14:17:50 +09:00
Bruce Mitchener
a24ae8d924 clippy: Fix needless-borrow warnings. (#1581)
These show on nightly clippy.
2022-10-03 14:15:09 +09:00
PSeitz
927dff5262 Merge pull request #1578 from quickwit-oss/dead_code
remove dead indexing code
2022-10-03 11:25:10 +08:00
Pascal Seitz
a695edcc95 remove dead indexing code 2022-10-03 09:44:02 +08:00
Paul Masurel
b4b4f3fa73 Removing default features for zstd (#1574) 2022-09-30 13:02:46 +09:00
PSeitz
b50e4b7c20 Merge pull request #1566 from quickwit-oss/fix_docstore_sorting
fix docstore settings for temp docstore
2022-09-30 10:10:36 +08:00
PSeitz
f8686ab1ec improve comments
Co-authored-by: Paul Masurel <paul@quickwit.io>
2022-09-30 10:06:34 +08:00
PSeitz
2fe42719d8 Merge pull request #1570 from quickwit-oss/no_sort_on_multi
validate index settings on create
2022-09-30 09:17:03 +08:00
PSeitz
fadd784a25 log improvements (#1564) 2022-09-30 09:39:26 +09:00
Pascal Seitz
0e94213af0 validate index settings on create 2022-09-29 18:58:09 +08:00
PSeitz
0da2a2e70d Merge pull request #1567 from quickwit-oss/dependabot/cargo/tantivy-fst-0.4.0
Update tantivy-fst requirement from 0.3.0 to 0.4.0
2022-09-29 10:00:16 +08:00
dependabot[bot]
0bcdf3cbbf Update tantivy-fst requirement from 0.3.0 to 0.4.0
Updates the requirements on [tantivy-fst](https://github.com/tantivy-search/fst) to permit the latest version.
- [Release notes](https://github.com/tantivy-search/fst/releases)
- [Commits](https://github.com/tantivy-search/fst/commits)

---
updated-dependencies:
- dependency-name: tantivy-fst
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-09-28 20:50:43 +00:00
Pascal Seitz
8f647b817f fix docstore settings for temp docstore
fixes #1565
2022-09-28 17:53:59 +08:00
trinity-1686a
a86b0df6f4 Add query matching terms in a set (#1539) 2022-09-28 09:43:18 +02:00
78 changed files with 3091 additions and 799 deletions

View File

@@ -1,10 +1,32 @@
Tantivy 0.19
================================
- Major bugfix: Fix missing fieldnorms for u64, i64, f64, bool, bytes and date [#1620](https://github.com/quickwit-oss/tantivy/pull/1620) (@PSeitz)
- Updated [Date Field Type](https://github.com/quickwit-oss/tantivy/pull/1396)
The `DateTime` type has been updated to hold timestamps with microseconds precision.
`DateOptions` and `DatePrecision` have been added to configure Date fields. The precision is used to hint on fast values compression. Otherwise, seconds precision is used everywhere else (i.e terms, indexing).
- Remove Searcher pool and make `Searcher` cloneable.
`DateOptions` and `DatePrecision` have been added to configure Date fields. The precision is used to hint on fast values compression. Otherwise, seconds precision is used everywhere else (i.e terms, indexing). (@evanxg852000)
- Add IP address field type [#1553](https://github.com/quickwit-oss/tantivy/pull/1553) (@PSeitz)
- Add boolean field type [#1382](https://github.com/quickwit-oss/tantivy/pull/1382) (@boraarslan)
- Remove Searcher pool and make `Searcher` cloneable. (@PSeitz)
- Validate settings on create [#1570](https://github.com/quickwit-oss/tantivy/pull/1570 (@PSeitz)
- Fix interpolation overflow in linear interpolation fastfield codec [#1480](https://github.com/quickwit-oss/tantivy/pull/1480 (@PSeitz @fulmicoton)
- Detect and apply gcd on fastfield codecs [#1418](https://github.com/quickwit-oss/tantivy/pull/1418) (@PSeitz)
- Doc store
- use separate thread to compress block store [#1389](https://github.com/quickwit-oss/tantivy/pull/1389) [#1510](https://github.com/quickwit-oss/tantivy/pull/1510 (@PSeitz @fulmicoton)
- Expose doc store cache size [#1403](https://github.com/quickwit-oss/tantivy/pull/1403) (@PSeitz)
- Enable compression levels for doc store [#1378](https://github.com/quickwit-oss/tantivy/pull/1378) (@PSeitz)
- Make block size configurable [#1374](https://github.com/quickwit-oss/tantivy/pull/1374) (@kryesh)
- Make `tantivy::TantivyError` cloneable [#1402](https://github.com/quickwit-oss/tantivy/pull/1402) (@PSeitz)
- Add support for phrase slop in query language [#1393](https://github.com/quickwit-oss/tantivy/pull/1393) (@saroh)
- Aggregation
- Add support for keyed parameter in range and histgram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
- Add aggregation bucket limit [#1363](https://github.com/quickwit-oss/tantivy/pull/1363) (@PSeitz)
- Faster indexing
- [#1610](https://github.com/quickwit-oss/tantivy/pull/1610 (@PSeitz)
- [#1594](https://github.com/quickwit-oss/tantivy/pull/1594 (@PSeitz)
- [#1582](https://github.com/quickwit-oss/tantivy/pull/1582 (@PSeitz)
- [#1611](https://github.com/quickwit-oss/tantivy/pull/1611 (@PSeitz)
Tantivy 0.18
================================

View File

@@ -11,6 +11,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2021"
rust-version = "1.62"
[dependencies]
oneshot = "0.1.3"
@@ -19,11 +20,11 @@ byteorder = "1.4.3"
crc32fast = "1.3.2"
once_cell = "1.10.0"
regex = { version = "1.5.5", default-features = false, features = ["std", "unicode"] }
tantivy-fst = "0.3.0"
tantivy-fst = "0.4.0"
memmap2 = { version = "0.5.3", optional = true }
lz4_flex = { version = "0.9.2", default-features = false, features = ["checked-decode"], optional = true }
brotli = { version = "3.3.4", optional = true }
zstd = { version = "0.11", optional = true }
zstd = { version = "0.11", optional = true, default-features = false }
snap = { version = "1.0.5", optional = true }
tempfile = { version = "3.3.0", optional = true }
log = "0.4.16"
@@ -56,7 +57,7 @@ lru = "0.7.5"
fastdivide = "0.4.0"
itertools = "0.10.3"
measure_time = "0.8.2"
serde_cbor = { version = "0.11.2", optional = true }
ciborium = { version = "0.2", optional = true}
async-trait = "0.1.53"
arc-swap = "1.5.0"
@@ -100,7 +101,7 @@ zstd-compression = ["zstd"]
failpoints = ["fail/failpoints"]
unstable = [] # useful for benches.
quickwit = ["serde_cbor"]
quickwit = ["ciborium"]
[workspace]
members = ["query-grammar", "bitpacker", "common", "fastfield_codecs", "ownedbytes"]

View File

@@ -58,7 +58,7 @@ Distributed search is out of the scope of Tantivy, but if you are looking for th
# Getting started
Tantivy works on stable Rust (>= 1.27) and supports Linux, macOS, and Windows.
Tantivy works on stable Rust and supports Linux, macOS, and Windows.
- [Tantivy's simple search example](https://tantivy-search.github.io/examples/basic_search.html)
- [tantivy-cli and its tutorial](https://github.com/quickwit-oss/tantivy-cli) - `tantivy-cli` is an actual command-line interface that makes it easy for you to create a search engine,
@@ -81,9 +81,13 @@ There are many ways to support this project.
We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.
## Minimum supported Rust version
Tantivy currently requires at least Rust 1.62 or later to compile.
## Clone and build locally
Tantivy compiles on stable Rust but requires `Rust >= 1.27`.
Tantivy compiles on stable Rust.
To check out and run tests, you can simply run:
```bash

View File

@@ -107,6 +107,19 @@ impl FixedSize for u64 {
const SIZE_IN_BYTES: usize = 8;
}
impl BinarySerializable for u128 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_u128::<Endianness>(*self)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
reader.read_u128::<Endianness>()
}
}
impl FixedSize for u128 {
const SIZE_IN_BYTES: usize = 16;
}
impl BinarySerializable for f32 {
fn serialize<W: Write>(&self, writer: &mut W) -> io::Result<()> {
writer.write_f32::<Endianness>(*self)

View File

@@ -100,9 +100,10 @@ mod tests {
fn get_u128_column_from_data(data: &[u128]) -> Arc<dyn Column<u128>> {
let mut out = vec![];
serialize_u128(VecColumn::from(&data), &mut out).unwrap();
let iter_gen = || data.iter().cloned();
serialize_u128(iter_gen, data.len() as u64, &mut out).unwrap();
let out = OwnedBytes::new(out);
open_u128(out).unwrap()
open_u128::<u128>(out).unwrap()
}
#[bench]

View File

@@ -3,6 +3,9 @@ use std::ops::RangeInclusive;
use tantivy_bitpacker::minmax;
use crate::monotonic_mapping::StrictlyMonotonicFn;
/// `Column` provides columnar access on a field.
pub trait Column<T: PartialOrd = u64>: Send + Sync {
/// Return the value associated with the given idx.
///
@@ -57,6 +60,7 @@ pub trait Column<T: PartialOrd = u64>: Send + Sync {
/// `.max_value()`.
fn max_value(&self) -> T;
/// The number of values in the column.
fn num_vals(&self) -> u64;
/// Returns a iterator over the data
@@ -65,6 +69,7 @@ pub trait Column<T: PartialOrd = u64>: Send + Sync {
}
}
/// VecColumn provides `Column` over a slice.
pub struct VecColumn<'a, T = u64> {
values: &'a [T],
min_value: T,
@@ -143,16 +148,30 @@ struct MonotonicMappingColumn<C, T, Input> {
_phantom: PhantomData<Input>,
}
/// Creates a view of a column transformed by a monotonic mapping.
pub fn monotonic_map_column<C, T, Input: PartialOrd, Output: PartialOrd>(
/// Creates a view of a column transformed by a strictly monotonic mapping. See
/// [`StrictlyMonotonicFn`].
///
/// E.g. apply a gcd monotonic_mapping([100, 200, 300]) == [1, 2, 3]
/// monotonic_mapping.mapping() is expected to be injective, and we should always have
/// monotonic_mapping.inverse(monotonic_mapping.mapping(el)) == el
///
/// The inverse of the mapping is required for:
/// `fn get_between_vals(&self, range: RangeInclusive<T>) -> Vec<u64> `
/// The user provides the original value range and we need to monotonic map them in the same way the
/// serialization does before calling the underlying column.
///
/// Note that when opening a codec, the monotonic_mapping should be the inverse of the mapping
/// during serialization. And therefore the monotonic_mapping_inv when opening is the same as
/// monotonic_mapping during serialization.
pub fn monotonic_map_column<C, T, Input, Output>(
from_column: C,
monotonic_mapping: T,
) -> impl Column<Output>
where
C: Column<Input>,
T: Fn(Input) -> Output + Send + Sync,
Input: Send + Sync,
Output: Send + Sync,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Send + Sync + Clone,
Output: PartialOrd + Send + Sync + Clone,
{
MonotonicMappingColumn {
from_column,
@@ -161,28 +180,27 @@ where
}
}
impl<C, T, Input: PartialOrd, Output: PartialOrd> Column<Output>
for MonotonicMappingColumn<C, T, Input>
impl<C, T, Input, Output> Column<Output> for MonotonicMappingColumn<C, T, Input>
where
C: Column<Input>,
T: Fn(Input) -> Output + Send + Sync,
Input: Send + Sync,
Output: Send + Sync,
T: StrictlyMonotonicFn<Input, Output> + Send + Sync,
Input: PartialOrd + Send + Sync + Clone,
Output: PartialOrd + Send + Sync + Clone,
{
#[inline]
fn get_val(&self, idx: u64) -> Output {
let from_val = self.from_column.get_val(idx);
(self.monotonic_mapping)(from_val)
self.monotonic_mapping.mapping(from_val)
}
fn min_value(&self) -> Output {
let from_min_value = self.from_column.min_value();
(self.monotonic_mapping)(from_min_value)
self.monotonic_mapping.mapping(from_min_value)
}
fn max_value(&self) -> Output {
let from_max_value = self.from_column.max_value();
(self.monotonic_mapping)(from_max_value)
self.monotonic_mapping.mapping(from_max_value)
}
fn num_vals(&self) -> u64 {
@@ -190,7 +208,18 @@ where
}
fn iter(&self) -> Box<dyn Iterator<Item = Output> + '_> {
Box::new(self.from_column.iter().map(&self.monotonic_mapping))
Box::new(
self.from_column
.iter()
.map(|el| self.monotonic_mapping.mapping(el)),
)
}
fn get_between_vals(&self, range: RangeInclusive<Output>) -> Vec<u64> {
self.from_column.get_between_vals(
self.monotonic_mapping.inverse(range.start().clone())
..=self.monotonic_mapping.inverse(range.end().clone()),
)
}
// We voluntarily do not implement get_range as it yields a regression,
@@ -236,19 +265,22 @@ where
#[cfg(test)]
mod tests {
use super::*;
use crate::MonotonicallyMappableToU64;
use crate::monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternalBaseval,
StrictlyMonotonicMappingToInternalGCDBaseval,
};
#[test]
fn test_monotonic_mapping() {
let vals = &[1u64, 3u64][..];
let vals = &[3u64, 5u64][..];
let col = VecColumn::from(vals);
let mapped = monotonic_map_column(col, |el| el + 4);
assert_eq!(mapped.min_value(), 5u64);
assert_eq!(mapped.max_value(), 7u64);
let mapped = monotonic_map_column(col, StrictlyMonotonicMappingToInternalBaseval::new(2));
assert_eq!(mapped.min_value(), 1u64);
assert_eq!(mapped.max_value(), 3u64);
assert_eq!(mapped.num_vals(), 2);
assert_eq!(mapped.num_vals(), 2);
assert_eq!(mapped.get_val(0), 5);
assert_eq!(mapped.get_val(1), 7);
assert_eq!(mapped.get_val(0), 1);
assert_eq!(mapped.get_val(1), 3);
}
#[test]
@@ -260,10 +292,15 @@ mod tests {
#[test]
fn test_monotonic_mapping_iter() {
let vals: Vec<u64> = (-1..99).map(i64::to_u64).collect();
let vals: Vec<u64> = (10..110u64).map(|el| el * 10).collect();
let col = VecColumn::from(&vals);
let mapped = monotonic_map_column(col, |el| i64::from_u64(el) * 10i64);
let val_i64s: Vec<i64> = mapped.iter().collect();
let mapped = monotonic_map_column(
col,
StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalGCDBaseval::new(10, 100),
),
);
let val_i64s: Vec<u64> = mapped.iter().collect();
for i in 0..100 {
assert_eq!(val_i64s[i as usize], mapped.get_val(i));
}
@@ -271,20 +308,26 @@ mod tests {
#[test]
fn test_monotonic_mapping_get_range() {
let vals: Vec<u64> = (-1..99).map(i64::to_u64).collect();
let vals: Vec<u64> = (0..100u64).map(|el| el * 10).collect();
let col = VecColumn::from(&vals);
let mapped = monotonic_map_column(col, |el| i64::from_u64(el) * 10i64);
assert_eq!(mapped.min_value(), -10i64);
assert_eq!(mapped.max_value(), 980i64);
let mapped = monotonic_map_column(
col,
StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalGCDBaseval::new(10, 0),
),
);
assert_eq!(mapped.min_value(), 0u64);
assert_eq!(mapped.max_value(), 9900u64);
assert_eq!(mapped.num_vals(), 100);
let val_i64s: Vec<i64> = mapped.iter().collect();
assert_eq!(val_i64s.len(), 100);
let val_u64s: Vec<u64> = mapped.iter().collect();
assert_eq!(val_u64s.len(), 100);
for i in 0..100 {
assert_eq!(val_i64s[i as usize], mapped.get_val(i));
assert_eq!(val_i64s[i as usize], i64::from_u64(vals[i as usize]) * 10);
assert_eq!(val_u64s[i as usize], mapped.get_val(i));
assert_eq!(val_u64s[i as usize], vals[i as usize] * 10);
}
let mut buf = [0i64; 20];
let mut buf = [0u64; 20];
mapped.get_range(7, &mut buf[..]);
assert_eq!(&val_i64s[7..][..20], &buf);
assert_eq!(&val_u64s[7..][..20], &buf);
}
}

View File

@@ -171,10 +171,10 @@ pub struct IPCodecParams {
impl CompactSpaceCompressor {
/// Taking the vals as Vec may cost a lot of memory. It is used to sort the vals.
pub fn train_from(column: &impl Column<u128>) -> Self {
pub fn train_from(iter: impl Iterator<Item = u128>, num_vals: u64) -> Self {
let mut values_sorted = BTreeSet::new();
values_sorted.extend(column.iter());
let total_num_values = column.num_vals();
values_sorted.extend(iter);
let total_num_values = num_vals;
let compact_space =
get_compact_space(&values_sorted, total_num_values, COST_PER_BLANK_IN_BITS);
@@ -443,7 +443,7 @@ impl CompactSpaceDecompressor {
mod tests {
use super::*;
use crate::{open_u128, serialize_u128, VecColumn};
use crate::{open_u128, serialize_u128};
#[test]
fn compact_space_test() {
@@ -513,7 +513,12 @@ mod tests {
fn test_aux_vals(u128_vals: &[u128]) -> OwnedBytes {
let mut out = Vec::new();
serialize_u128(VecColumn::from(u128_vals), &mut out).unwrap();
serialize_u128(
|| u128_vals.iter().cloned(),
u128_vals.len() as u64,
&mut out,
)
.unwrap();
let data = OwnedBytes::new(out);
test_all(data.clone(), u128_vals);
@@ -603,8 +608,8 @@ mod tests {
5_000_000_000,
];
let mut out = Vec::new();
serialize_u128(VecColumn::from(vals), &mut out).unwrap();
let decomp = open_u128(OwnedBytes::new(out)).unwrap();
serialize_u128(|| vals.iter().cloned(), vals.len() as u64, &mut out).unwrap();
let decomp = open_u128::<u128>(OwnedBytes::new(out)).unwrap();
assert_eq!(decomp.get_between_vals(199..=200), vec![0]);
assert_eq!(decomp.get_between_vals(199..=201), vec![0, 1]);

View File

@@ -1,5 +1,12 @@
#![warn(missing_docs)]
#![cfg_attr(all(feature = "unstable", test), feature(test))]
//! # `fastfield_codecs`
//!
//! - Columnar storage of data for tantivy [`Column`].
//! - Encode data in different codecs.
//! - Monotonically map values to u64/u128
#[cfg(test)]
#[macro_use]
extern crate more_asserts;
@@ -13,6 +20,10 @@ use std::sync::Arc;
use common::BinarySerializable;
use compact_space::CompactSpaceDecompressor;
use monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternal,
StrictlyMonotonicMappingToInternalBaseval, StrictlyMonotonicMappingToInternalGCDBaseval,
};
use ownedbytes::OwnedBytes;
use serialize::Header;
@@ -22,6 +33,7 @@ mod compact_space;
mod line;
mod linear;
mod monotonic_mapping;
mod monotonic_mapping_u128;
mod column;
mod gcd;
@@ -31,16 +43,24 @@ use self::bitpacked::BitpackedCodec;
use self::blockwise_linear::BlockwiseLinearCodec;
pub use self::column::{monotonic_map_column, Column, VecColumn};
use self::linear::LinearCodec;
pub use self::monotonic_mapping::MonotonicallyMappableToU64;
pub use self::monotonic_mapping::{MonotonicallyMappableToU64, StrictlyMonotonicFn};
pub use self::monotonic_mapping_u128::MonotonicallyMappableToU128;
pub use self::serialize::{
estimate, serialize, serialize_and_load, serialize_u128, NormalizedHeader,
};
#[derive(PartialEq, Eq, PartialOrd, Ord, Debug, Clone, Copy)]
#[repr(u8)]
/// Available codecs to use to encode the u64 (via [`MonotonicallyMappableToU64`]) converted data.
pub enum FastFieldCodecType {
/// Bitpack all values in the value range. The number of bits is defined by the amplitude
/// `column.max_value() - column.min_value()`
Bitpacked = 1,
/// Linear interpolation puts a line between the first and last value and then bitpacks the
/// values by the offset from the line. The number of bits is defined by the max deviation from
/// the line.
Linear = 2,
/// Same as [`FastFieldCodecType::Linear`], but encodes in blocks of 512 elements.
BlockwiseLinear = 3,
}
@@ -58,11 +78,11 @@ impl BinarySerializable for FastFieldCodecType {
}
impl FastFieldCodecType {
pub fn to_code(self) -> u8 {
pub(crate) fn to_code(self) -> u8 {
self as u8
}
pub fn from_code(code: u8) -> Option<Self> {
pub(crate) fn from_code(code: u8) -> Option<Self> {
match code {
1 => Some(Self::Bitpacked),
2 => Some(Self::Linear),
@@ -73,8 +93,13 @@ impl FastFieldCodecType {
}
/// Returns the correct codec reader wrapped in the `Arc` for the data.
pub fn open_u128(bytes: OwnedBytes) -> io::Result<Arc<dyn Column<u128>>> {
Ok(Arc::new(CompactSpaceDecompressor::open(bytes)?))
pub fn open_u128<Item: MonotonicallyMappableToU128>(
bytes: OwnedBytes,
) -> io::Result<Arc<dyn Column<Item>>> {
let reader = CompactSpaceDecompressor::open(bytes)?;
let inverted: StrictlyMonotonicMappingInverter<StrictlyMonotonicMappingToInternal<Item>> =
StrictlyMonotonicMappingToInternal::<Item>::new().into();
Ok(Arc::new(monotonic_map_column(reader, inverted)))
}
/// Returns the correct codec reader wrapped in the `Arc` for the data.
@@ -99,11 +124,15 @@ fn open_specific_codec<C: FastFieldCodec, Item: MonotonicallyMappableToU64>(
let reader = C::open_from_bytes(bytes, normalized_header)?;
let min_value = header.min_value;
if let Some(gcd) = header.gcd {
let monotonic_mapping = move |val: u64| Item::from_u64(min_value + val * gcd.get());
Ok(Arc::new(monotonic_map_column(reader, monotonic_mapping)))
let mapping = StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalGCDBaseval::new(gcd.get(), min_value),
);
Ok(Arc::new(monotonic_map_column(reader, mapping)))
} else {
let monotonic_mapping = move |val: u64| Item::from_u64(min_value + val);
Ok(Arc::new(monotonic_map_column(reader, monotonic_mapping)))
let mapping = StrictlyMonotonicMappingInverter::from(
StrictlyMonotonicMappingToInternalBaseval::new(min_value),
);
Ok(Arc::new(monotonic_map_column(reader, mapping)))
}
}
@@ -135,6 +164,7 @@ trait FastFieldCodec: 'static {
fn estimate(column: &dyn Column) -> Option<f32>;
}
/// The list of all available codecs for u64 convertible data.
pub const ALL_CODEC_TYPES: [FastFieldCodecType; 3] = [
FastFieldCodecType::Bitpacked,
FastFieldCodecType::BlockwiseLinear,
@@ -143,6 +173,7 @@ pub const ALL_CODEC_TYPES: [FastFieldCodecType; 3] = [
#[cfg(test)]
mod tests {
use proptest::prelude::*;
use proptest::strategy::Strategy;
use proptest::{prop_oneof, proptest};
@@ -177,6 +208,18 @@ mod tests {
`{data:?}`",
);
}
if !data.is_empty() {
let test_rand_idx = rand::thread_rng().gen_range(0..=data.len() - 1);
let expected_positions: Vec<u64> = data
.iter()
.enumerate()
.filter(|(_, el)| **el == data[test_rand_idx])
.map(|(pos, _)| pos as u64)
.collect();
let positions = reader.get_between_vals(data[test_rand_idx]..=data[test_rand_idx]);
assert_eq!(expected_positions, positions);
}
Some((estimation, actual_compression))
}
@@ -312,7 +355,7 @@ mod tests {
#[test]
fn estimation_test_bad_interpolation_case_monotonically_increasing() {
let mut data: Vec<u64> = (200..=20000_u64).collect();
let mut data: Vec<u64> = (201..=20000_u64).collect();
data.push(1_000_000);
let data: VecColumn = data.as_slice().into();

View File

@@ -68,29 +68,37 @@ impl Line {
}
// Same as train, but the intercept is only estimated from provided sample positions
pub fn estimate(ys: &dyn Column, sample_positions: &[u64]) -> Self {
pub fn estimate(sample_positions_and_values: &[(u64, u64)]) -> Self {
let first_val = sample_positions_and_values[0].1;
let last_val = sample_positions_and_values[sample_positions_and_values.len() - 1].1;
let num_vals = sample_positions_and_values[sample_positions_and_values.len() - 1].0 + 1;
Self::train_from(
ys,
sample_positions
.iter()
.cloned()
.map(|pos| (pos, ys.get_val(pos))),
first_val,
last_val,
num_vals,
sample_positions_and_values.iter().cloned(),
)
}
// Intercept is only computed from provided positions
fn train_from(ys: &dyn Column, positions_and_values: impl Iterator<Item = (u64, u64)>) -> Self {
let num_vals = if let Some(num_vals) = NonZeroU64::new(ys.num_vals() - 1) {
num_vals
fn train_from(
first_val: u64,
last_val: u64,
num_vals: u64,
positions_and_values: impl Iterator<Item = (u64, u64)>,
) -> Self {
// TODO replace with let else
let idx_last_val = if let Some(idx_last_val) = NonZeroU64::new(num_vals - 1) {
idx_last_val
} else {
return Line::default();
};
let y0 = ys.get_val(0);
let y1 = ys.get_val(num_vals.get());
let y0 = first_val;
let y1 = last_val;
// We first independently pick our slope.
let slope = compute_slope(y0, y1, num_vals);
let slope = compute_slope(y0, y1, idx_last_val);
// We picked our slope. Note that it does not have to be perfect.
// Now we need to compute the best intercept.
@@ -138,8 +146,12 @@ impl Line {
/// This function is only invariable by translation if all of the
/// `ys` are packaged into half of the space. (See heuristic below)
pub fn train(ys: &dyn Column) -> Self {
let first_val = ys.iter().next().unwrap();
let last_val = ys.iter().nth(ys.num_vals() as usize - 1).unwrap();
Self::train_from(
ys,
first_val,
last_val,
ys.num_vals(),
ys.iter().enumerate().map(|(pos, val)| (pos as u64, val)),
)
}

View File

@@ -126,18 +126,20 @@ impl FastFieldCodec for LinearCodec {
return None; // disable compressor for this case
}
// let's sample at 0%, 5%, 10% .. 95%, 100%
let num_vals = column.num_vals() as f32 / 100.0;
let sample_positions = (0..20)
.map(|pos| (num_vals * pos as f32 * 5.0) as u64)
.collect::<Vec<_>>();
let limit_num_vals = column.num_vals().min(100_000);
let line = Line::estimate(column, &sample_positions);
let num_samples = 100;
let step_size = (limit_num_vals / num_samples).max(1); // 20 samples
let mut sample_positions_and_values: Vec<_> = Vec::new();
for (pos, val) in column.iter().enumerate().step_by(step_size as usize) {
sample_positions_and_values.push((pos as u64, val));
}
let estimated_bit_width = sample_positions
let line = Line::estimate(&sample_positions_and_values);
let estimated_bit_width = sample_positions_and_values
.into_iter()
.map(|pos| {
let actual_value = column.get_val(pos);
.map(|(pos, actual_value)| {
let interpolated_val = line.eval(pos as u64);
actual_value.wrapping_sub(interpolated_val)
})
@@ -146,6 +148,7 @@ impl FastFieldCodec for LinearCodec {
.max()
.unwrap_or(0);
// Extrapolate to whole column
let num_bits = (estimated_bit_width as u64 * column.num_vals() as u64) + 64;
let num_bits_uncompressed = 64 * column.num_vals();
Some(num_bits as f32 / num_bits_uncompressed as f32)

View File

@@ -90,7 +90,7 @@ fn bench_ip() {
{
let mut data = vec![];
for dataset in dataset.chunks(500_000) {
serialize_u128(VecColumn::from(dataset), &mut data).unwrap();
serialize_u128(|| dataset.iter().cloned(), dataset.len() as u64, &mut data).unwrap();
}
let compression = data.len() as f64 / (dataset.len() * 16) as f64;
println!("Compression 50_000 chunks {:.4}", compression);
@@ -101,7 +101,10 @@ fn bench_ip() {
}
let mut data = vec![];
serialize_u128(VecColumn::from(&dataset), &mut data).unwrap();
{
print_time!("creation");
serialize_u128(|| dataset.iter().cloned(), dataset.len() as u64, &mut data).unwrap();
}
let compression = data.len() as f64 / (dataset.len() * 16) as f64;
println!("Compression {:.2}", compression);
@@ -110,7 +113,7 @@ fn bench_ip() {
(data.len() * 8) as f32 / dataset.len() as f32
);
let decompressor = open_u128(OwnedBytes::new(data)).unwrap();
let decompressor = open_u128::<u128>(OwnedBytes::new(data)).unwrap();
// Sample some ranges
for value in dataset.iter().take(1110).skip(1100).cloned() {
print_time!("get range");

View File

@@ -1,3 +1,11 @@
use std::marker::PhantomData;
use fastdivide::DividerU64;
use crate::MonotonicallyMappableToU128;
/// Monotonic maps a value to u64 value space.
/// Monotonic mapping enables `PartialOrd` on u64 space without conversion to original space.
pub trait MonotonicallyMappableToU64: 'static + PartialOrd + Copy + Send + Sync {
/// Converts a value to u64.
///
@@ -11,6 +19,145 @@ pub trait MonotonicallyMappableToU64: 'static + PartialOrd + Copy + Send + Sync
fn from_u64(val: u64) -> Self;
}
/// Values need to be strictly monotonic mapped to a `Internal` value (u64 or u128) that can be
/// used in fast field codecs.
///
/// The monotonic mapping is required so that `PartialOrd` can be used on `Internal` without
/// converting to `External`.
///
/// All strictly monotonic functions are invertible because they are guaranteed to have a one-to-one
/// mapping from their range to their domain. The `inverse` method is required when opening a codec,
/// so a value can be converted back to its original domain (e.g. ip address or f64) from its
/// internal representation.
pub trait StrictlyMonotonicFn<External, Internal> {
/// Strictly monotonically maps the value from External to Internal.
fn mapping(&self, inp: External) -> Internal;
/// Inverse of `mapping`. Maps the value from Internal to External.
fn inverse(&self, out: Internal) -> External;
}
/// Inverts a strictly monotonic mapping from `StrictlyMonotonicFn<A, B>` to
/// `StrictlyMonotonicFn<B, A>`.
///
/// # Warning
///
/// This type comes with a footgun. A type being strictly monotonic does not impose that the inverse
/// mapping is strictly monotonic over the entire space External. e.g. a -> a * 2. Use at your own
/// risks.
pub(crate) struct StrictlyMonotonicMappingInverter<T> {
orig_mapping: T,
}
impl<T> From<T> for StrictlyMonotonicMappingInverter<T> {
fn from(orig_mapping: T) -> Self {
Self { orig_mapping }
}
}
impl<From, To, T> StrictlyMonotonicFn<To, From> for StrictlyMonotonicMappingInverter<T>
where T: StrictlyMonotonicFn<From, To>
{
fn mapping(&self, val: To) -> From {
self.orig_mapping.inverse(val)
}
fn inverse(&self, val: From) -> To {
self.orig_mapping.mapping(val)
}
}
/// Applies the strictly monotonic mapping from `T` without any additional changes.
pub(crate) struct StrictlyMonotonicMappingToInternal<T> {
_phantom: PhantomData<T>,
}
impl<T> StrictlyMonotonicMappingToInternal<T> {
pub(crate) fn new() -> StrictlyMonotonicMappingToInternal<T> {
Self {
_phantom: PhantomData,
}
}
}
impl<External: MonotonicallyMappableToU128, T: MonotonicallyMappableToU128>
StrictlyMonotonicFn<External, u128> for StrictlyMonotonicMappingToInternal<T>
where T: MonotonicallyMappableToU128
{
fn mapping(&self, inp: External) -> u128 {
External::to_u128(inp)
}
fn inverse(&self, out: u128) -> External {
External::from_u128(out)
}
}
impl<External: MonotonicallyMappableToU64, T: MonotonicallyMappableToU64>
StrictlyMonotonicFn<External, u64> for StrictlyMonotonicMappingToInternal<T>
where T: MonotonicallyMappableToU64
{
fn mapping(&self, inp: External) -> u64 {
External::to_u64(inp)
}
fn inverse(&self, out: u64) -> External {
External::from_u64(out)
}
}
/// Mapping dividing by gcd and a base value.
///
/// The function is assumed to be only called on values divided by passed
/// gcd value. (It is necessary for the function to be monotonic.)
pub(crate) struct StrictlyMonotonicMappingToInternalGCDBaseval {
gcd_divider: DividerU64,
gcd: u64,
min_value: u64,
}
impl StrictlyMonotonicMappingToInternalGCDBaseval {
pub(crate) fn new(gcd: u64, min_value: u64) -> Self {
let gcd_divider = DividerU64::divide_by(gcd);
Self {
gcd_divider,
gcd,
min_value,
}
}
}
impl<External: MonotonicallyMappableToU64> StrictlyMonotonicFn<External, u64>
for StrictlyMonotonicMappingToInternalGCDBaseval
{
fn mapping(&self, inp: External) -> u64 {
self.gcd_divider
.divide(External::to_u64(inp) - self.min_value)
}
fn inverse(&self, out: u64) -> External {
External::from_u64(self.min_value + out * self.gcd)
}
}
/// Strictly monotonic mapping with a base value.
pub(crate) struct StrictlyMonotonicMappingToInternalBaseval {
min_value: u64,
}
impl StrictlyMonotonicMappingToInternalBaseval {
pub(crate) fn new(min_value: u64) -> Self {
Self { min_value }
}
}
impl<External: MonotonicallyMappableToU64> StrictlyMonotonicFn<External, u64>
for StrictlyMonotonicMappingToInternalBaseval
{
fn mapping(&self, val: External) -> u64 {
External::to_u64(val) - self.min_value
}
fn inverse(&self, val: u64) -> External {
External::from_u64(self.min_value + val)
}
}
impl MonotonicallyMappableToU64 for u64 {
fn to_u64(self) -> u64 {
self
@@ -54,3 +201,33 @@ impl MonotonicallyMappableToU64 for f64 {
common::u64_to_f64(val)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn strictly_monotonic_test() {
// identity mapping
test_round_trip(&StrictlyMonotonicMappingToInternal::<u64>::new(), 100u64);
// round trip to i64
test_round_trip(&StrictlyMonotonicMappingToInternal::<i64>::new(), 100u64);
// identity mapping
test_round_trip(&StrictlyMonotonicMappingToInternal::<u128>::new(), 100u128);
// base value to i64 round trip
let mapping = StrictlyMonotonicMappingToInternalBaseval::new(100);
test_round_trip::<_, _, u64>(&mapping, 100i64);
// base value and gcd to u64 round trip
let mapping = StrictlyMonotonicMappingToInternalGCDBaseval::new(10, 100);
test_round_trip::<_, _, u64>(&mapping, 100u64);
}
fn test_round_trip<T: StrictlyMonotonicFn<K, L>, K: std::fmt::Debug + Eq + Copy, L>(
mapping: &T,
test_val: K,
) {
assert_eq!(mapping.inverse(mapping.mapping(test_val)), test_val);
}
}

View File

@@ -1,5 +1,7 @@
use std::net::{IpAddr, Ipv6Addr};
use std::net::Ipv6Addr;
/// Montonic maps a value to u128 value space
/// Monotonic mapping enables `PartialOrd` on u128 space without conversion to original space.
pub trait MonotonicallyMappableToU128: 'static + PartialOrd + Copy + Send + Sync {
/// Converts a value to u128.
///
@@ -23,20 +25,16 @@ impl MonotonicallyMappableToU128 for u128 {
}
}
impl MonotonicallyMappableToU128 for IpAddr {
impl MonotonicallyMappableToU128 for Ipv6Addr {
fn to_u128(self) -> u128 {
ip_to_u128(self)
}
fn from_u128(val: u128) -> Self {
IpAddr::from(val.to_be_bytes())
Ipv6Addr::from(val.to_be_bytes())
}
}
fn ip_to_u128(ip_addr: IpAddr) -> u128 {
let ip_addr_v6: Ipv6Addr = match ip_addr {
IpAddr::V4(v4) => v4.to_ipv6_mapped(),
IpAddr::V6(v6) => v6,
};
u128::from_be_bytes(ip_addr_v6.octets())
fn ip_to_u128(ip_addr: Ipv6Addr) -> u128 {
u128::from_be_bytes(ip_addr.octets())
}

View File

@@ -22,7 +22,6 @@ use std::num::NonZeroU64;
use std::sync::Arc;
use common::{BinarySerializable, VInt};
use fastdivide::DividerU64;
use log::warn;
use ownedbytes::OwnedBytes;
@@ -30,6 +29,10 @@ use crate::bitpacked::BitpackedCodec;
use crate::blockwise_linear::BlockwiseLinearCodec;
use crate::compact_space::CompactSpaceCompressor;
use crate::linear::LinearCodec;
use crate::monotonic_mapping::{
StrictlyMonotonicFn, StrictlyMonotonicMappingToInternal,
StrictlyMonotonicMappingToInternalGCDBaseval,
};
use crate::{
monotonic_map_column, Column, FastFieldCodec, FastFieldCodecType, MonotonicallyMappableToU64,
VecColumn, ALL_CODEC_TYPES,
@@ -37,12 +40,14 @@ use crate::{
/// The normalized header gives some parameters after applying the following
/// normalization of the vector:
/// val -> (val - min_value) / gcd
/// `val -> (val - min_value) / gcd`
///
/// By design, after normalization, `min_value = 0` and `gcd = 1`.
#[derive(Debug, Copy, Clone)]
pub struct NormalizedHeader {
/// The number of values in the underlying column.
pub num_vals: u64,
/// The max value of the underlying column.
pub max_value: u64,
}
@@ -57,8 +62,11 @@ pub(crate) struct Header {
impl Header {
pub fn normalized(self) -> NormalizedHeader {
let max_value =
(self.max_value - self.min_value) / self.gcd.map(|gcd| gcd.get()).unwrap_or(1);
let gcd = self.gcd.map(|gcd| gcd.get()).unwrap_or(1);
let gcd_min_val_mapping =
StrictlyMonotonicMappingToInternalGCDBaseval::new(gcd, self.min_value);
let max_value = gcd_min_val_mapping.mapping(self.max_value);
NormalizedHeader {
num_vals: self.num_vals,
max_value,
@@ -66,10 +74,7 @@ impl Header {
}
pub fn normalize_column<C: Column>(&self, from_column: C) -> impl Column {
let min_value = self.min_value;
let gcd = self.gcd.map(|gcd| gcd.get()).unwrap_or(1);
let divider = DividerU64::divide_by(gcd);
monotonic_map_column(from_column, move |val| divider.divide(val - min_value))
normalize_column(from_column, self.min_value, self.gcd)
}
pub fn compute_header(
@@ -81,9 +86,8 @@ impl Header {
let max_value = column.max_value();
let gcd = crate::gcd::find_gcd(column.iter().map(|val| val - min_value))
.filter(|gcd| gcd.get() > 1u64);
let divider = DividerU64::divide_by(gcd.map(|gcd| gcd.get()).unwrap_or(1u64));
let shifted_column = monotonic_map_column(&column, |val| divider.divide(val - min_value));
let codec_type = detect_codec(shifted_column, codecs)?;
let normalized_column = normalize_column(column, min_value, gcd);
let codec_type = detect_codec(normalized_column, codecs)?;
Some(Header {
num_vals,
min_value,
@@ -94,6 +98,16 @@ impl Header {
}
}
pub fn normalize_column<C: Column>(
from_column: C,
min_value: u64,
gcd: Option<NonZeroU64>,
) -> impl Column {
let gcd = gcd.map(|gcd| gcd.get()).unwrap_or(1);
let mapping = StrictlyMonotonicMappingToInternalGCDBaseval::new(gcd, min_value);
monotonic_map_column(from_column, mapping)
}
impl BinarySerializable for Header {
fn serialize<W: io::Write>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.num_vals).serialize(writer)?;
@@ -125,16 +139,21 @@ impl BinarySerializable for Header {
}
}
/// Return estimated compression for given codec in the value range [0.0..1.0], where 1.0 means no
/// compression.
pub fn estimate<T: MonotonicallyMappableToU64>(
typed_column: impl Column<T>,
codec_type: FastFieldCodecType,
) -> Option<f32> {
let column = monotonic_map_column(typed_column, T::to_u64);
let column = monotonic_map_column(typed_column, StrictlyMonotonicMappingToInternal::<T>::new());
let min_value = column.min_value();
let gcd = crate::gcd::find_gcd(column.iter().map(|val| val - min_value))
.filter(|gcd| gcd.get() > 1u64);
let divider = DividerU64::divide_by(gcd.map(|gcd| gcd.get()).unwrap_or(1u64));
let normalized_column = monotonic_map_column(&column, |val| divider.divide(val - min_value));
let mapping = StrictlyMonotonicMappingToInternalGCDBaseval::new(
gcd.map(|gcd| gcd.get()).unwrap_or(1u64),
min_value,
);
let normalized_column = monotonic_map_column(&column, mapping);
match codec_type {
FastFieldCodecType::Bitpacked => BitpackedCodec::estimate(&normalized_column),
FastFieldCodecType::Linear => LinearCodec::estimate(&normalized_column),
@@ -142,25 +161,26 @@ pub fn estimate<T: MonotonicallyMappableToU64>(
}
}
pub fn serialize_u128(
typed_column: impl Column<u128>,
/// Serializes u128 values with the compact space codec.
pub fn serialize_u128<F: Fn() -> I, I: Iterator<Item = u128>>(
iter_gen: F,
num_vals: u64,
output: &mut impl io::Write,
) -> io::Result<()> {
// TODO write header, to later support more codecs
let compressor = CompactSpaceCompressor::train_from(&typed_column);
compressor
.compress_into(typed_column.iter(), output)
.unwrap();
let compressor = CompactSpaceCompressor::train_from(iter_gen(), num_vals);
compressor.compress_into(iter_gen(), output).unwrap();
Ok(())
}
/// Serializes the column with the codec with the best estimate on the data.
pub fn serialize<T: MonotonicallyMappableToU64>(
typed_column: impl Column<T>,
output: &mut impl io::Write,
codecs: &[FastFieldCodecType],
) -> io::Result<()> {
let column = monotonic_map_column(typed_column, T::to_u64);
let column = monotonic_map_column(typed_column, StrictlyMonotonicMappingToInternal::<T>::new());
let header = Header::compute_header(&column, codecs).ok_or_else(|| {
io::Error::new(
io::ErrorKind::InvalidInput,
@@ -225,6 +245,7 @@ fn serialize_given_codec(
Ok(())
}
/// Helper function to serialize a column (autodetect from all codecs) and then open it
pub fn serialize_and_load<T: MonotonicallyMappableToU64 + Ord + Default>(
column: &[T],
) -> Arc<dyn Column<T>> {

View File

@@ -452,7 +452,7 @@ fn intermediate_buckets_to_final_buckets_fill_gaps(
histogram_req: &HistogramAggregation,
sub_aggregation: &AggregationsInternal,
) -> crate::Result<Vec<BucketEntry>> {
// Generate the the full list of buckets without gaps.
// Generate the full list of buckets without gaps.
//
// The bounds are the min max from the current buckets, optionally extended by
// extended_bounds from the request

View File

@@ -323,8 +323,8 @@ impl SegmentRangeCollector {
/// Converts the user provided f64 range value to fast field value space.
///
/// Internally fast field values are always stored as u64.
/// If the fast field has u64 [1,2,5], these values are stored as is in the fast field.
/// A fast field with f64 [1.0, 2.0, 5.0] is converted to u64 space, using a
/// If the fast field has u64 `[1, 2, 5]`, these values are stored as is in the fast field.
/// A fast field with f64 `[1.0, 2.0, 5.0]` is converted to u64 space, using a
/// monotonic mapping function, so the order is preserved.
///
/// Consequently, a f64 user range 1.0..3.0 needs to be converted to fast field value space using

View File

@@ -17,7 +17,11 @@ use crate::fastfield::MultiValuedFastFieldReader;
use crate::schema::Type;
use crate::{DocId, TantivyError};
/// Creates a bucket for every unique term
/// Creates a bucket for every unique term and counts the number of occurences.
/// Note that doc_count in the response buckets equals term count here.
///
/// If the text is untokenized and single value, that means one term per document and therefore it
/// is in fact doc count.
///
/// ### Terminology
/// Shard parameters are supposed to be equivalent to elasticsearch shard parameter.
@@ -64,6 +68,25 @@ use crate::{DocId, TantivyError};
/// }
/// }
/// ```
///
/// /// # Response JSON Format
/// ```json
/// {
/// ...
/// "aggregations": {
/// "genres": {
/// "doc_count_error_upper_bound": 0,
/// "sum_other_doc_count": 0,
/// "buckets": [
/// { "key": "drumnbass", "doc_count": 6 },
/// { "key": "raggae", "doc_count": 4 },
/// { "key": "jazz", "doc_count": 2 }
/// ]
/// }
/// }
/// }
/// ```
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
pub struct TermsAggregation {
/// The field to aggregate on.
@@ -1206,11 +1229,43 @@ mod tests {
.collect();
let res = exec_request_with_query(agg_req, &index, None);
assert!(res.is_err());
Ok(())
}
#[test]
fn terms_aggregation_multi_token_per_doc() -> crate::Result<()> {
let terms = vec!["Hello Hello", "Hallo Hallo"];
let index = get_test_index_from_terms(true, &[terms])?;
let agg_req: Aggregations = vec![(
"my_texts".to_string(),
Aggregation::Bucket(BucketAggregation {
bucket_agg: BucketAggregationType::Terms(TermsAggregation {
field: "text_id".to_string(),
min_doc_count: Some(0),
..Default::default()
}),
sub_aggregation: Default::default(),
}),
)]
.into_iter()
.collect();
let res = exec_request_with_query(agg_req, &index, None).unwrap();
assert_eq!(res["my_texts"]["buckets"][0]["key"], "hello");
assert_eq!(res["my_texts"]["buckets"][0]["doc_count"], 2);
assert_eq!(res["my_texts"]["buckets"][1]["key"], "hallo");
assert_eq!(res["my_texts"]["buckets"][1]["doc_count"], 2);
Ok(())
}
#[test]
fn test_json_format() -> crate::Result<()> {
let agg_req: Aggregations = vec![(

View File

@@ -10,21 +10,19 @@
//!
//! There are two categories: [Metrics](metric) and [Buckets](bucket).
//!
//! # Usage
//!
//! ## Prerequisite
//! Currently aggregations work only on [fast fields](`crate::fastfield`). Single value fast fields
//! of type `u64`, `f64`, `i64` and fast fields on text fields.
//!
//! ## Usage
//! To use aggregations, build an aggregation request by constructing
//! [`Aggregations`](agg_req::Aggregations).
//! Create an [`AggregationCollector`] from this request. `AggregationCollector` implements the
//! [`Collector`](crate::collector::Collector) trait and can be passed as collector into
//! [`Searcher::search()`](crate::Searcher::search).
//!
//! #### Limitations
//!
//! Currently aggregations work only on single value fast fields of type `u64`, `f64`, `i64` and
//! fast fields on text fields.
//!
//! # JSON Format
//! ## JSON Format
//! Aggregations request and result structures de/serialize into elasticsearch compatible JSON.
//!
//! ```verbatim
@@ -35,7 +33,7 @@
//! let json_response_string: String = &serde_json::to_string(&agg_res)?;
//! ```
//!
//! # Supported Aggregations
//! ## Supported Aggregations
//! - [Bucket](bucket)
//! - [Histogram](bucket::HistogramAggregation)
//! - [Range](bucket::RangeAggregation)

View File

@@ -338,11 +338,7 @@ impl SegmentCollector for FacetSegmentCollector {
let mut previous_collapsed_ord: usize = usize::MAX;
for &facet_ord in &self.facet_ords_buf {
let collapsed_ord = self.collapse_mapping[facet_ord as usize];
self.counts[collapsed_ord] += if collapsed_ord == previous_collapsed_ord {
0
} else {
1
};
self.counts[collapsed_ord] += u64::from(collapsed_ord != previous_collapsed_ord);
previous_collapsed_ord = collapsed_ord;
}
}

View File

@@ -19,7 +19,7 @@ use crate::error::{DataCorruption, TantivyError};
use crate::indexer::index_writer::{MAX_NUM_THREAD, MEMORY_ARENA_NUM_BYTES_MIN};
use crate::indexer::segment_updater::save_metas;
use crate::reader::{IndexReader, IndexReaderBuilder};
use crate::schema::{Field, FieldType, Schema};
use crate::schema::{Cardinality, Field, FieldType, Schema};
use crate::tokenizer::{TextAnalyzer, TokenizerManager};
use crate::IndexWriter;
@@ -152,9 +152,7 @@ impl IndexBuilder {
/// This should only be used for unit tests.
pub fn create_in_ram(self) -> Result<Index, TantivyError> {
let ram_directory = RamDirectory::create();
Ok(self
.create(ram_directory)
.expect("Creating a RamDirectory should never fail"))
self.create(ram_directory)
}
/// Creates a new index in a given filepath.
@@ -228,10 +226,44 @@ impl IndexBuilder {
))
}
}
fn validate(&self) -> crate::Result<()> {
if let Some(schema) = self.schema.as_ref() {
if let Some(sort_by_field) = self.index_settings.sort_by_field.as_ref() {
let schema_field = schema.get_field(&sort_by_field.field).ok_or_else(|| {
TantivyError::InvalidArgument(format!(
"Field to sort index {} not found in schema",
sort_by_field.field
))
})?;
let entry = schema.get_field_entry(schema_field);
if !entry.is_fast() {
return Err(TantivyError::InvalidArgument(format!(
"Field {} is no fast field. Field needs to be a single value fast field \
to be used to sort an index",
sort_by_field.field
)));
}
if entry.field_type().fastfield_cardinality() != Some(Cardinality::SingleValue) {
return Err(TantivyError::InvalidArgument(format!(
"Only single value fast field Cardinality supported for sorting index {}",
sort_by_field.field
)));
}
}
Ok(())
} else {
Err(TantivyError::InvalidArgument(
"no schema passed".to_string(),
))
}
}
/// Creates a new index given an implementation of the trait `Directory`.
///
/// If a directory previously existed, it will be erased.
fn create<T: Into<Box<dyn Directory>>>(self, dir: T) -> crate::Result<Index> {
self.validate()?;
let dir = dir.into();
let directory = ManagedDirectory::wrap(dir)?;
save_new_metas(

View File

@@ -15,12 +15,11 @@ use crate::termdict::TermDictionary;
///
/// It is safe to delete the segment associated with
/// an `InvertedIndexReader`. As long as it is open,
/// the `FileSlice` it is relying on should
/// the [`FileSlice`] it is relying on should
/// stay available.
///
///
/// `InvertedIndexReader` are created by calling
/// the `SegmentReader`'s [`.inverted_index(...)`] method
/// [`SegmentReader::inverted_index()`](crate::SegmentReader::inverted_index).
pub struct InvertedIndexReader {
termdict: TermDictionary,
postings_file_slice: FileSlice,
@@ -75,7 +74,7 @@ impl InvertedIndexReader {
///
/// This is useful for enumerating through a list of terms,
/// and consuming the associated posting lists while avoiding
/// reallocating a `BlockSegmentPostings`.
/// reallocating a [`BlockSegmentPostings`].
///
/// # Warning
///
@@ -96,7 +95,7 @@ impl InvertedIndexReader {
/// Returns a block postings given a `Term`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
/// Most users should prefer using [`Self::read_postings()`] instead.
pub fn read_block_postings(
&self,
term: &Term,
@@ -110,7 +109,7 @@ impl InvertedIndexReader {
/// Returns a block postings given a `term_info`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
/// Most users should prefer using [`Self::read_postings()`] instead.
pub fn read_block_postings_from_terminfo(
&self,
term_info: &TermInfo,
@@ -130,7 +129,7 @@ impl InvertedIndexReader {
/// Returns a posting object given a `term_info`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
/// Most users should prefer using [`Self::read_postings()`] instead.
pub fn read_postings_from_terminfo(
&self,
term_info: &TermInfo,
@@ -164,12 +163,12 @@ impl InvertedIndexReader {
/// or `None` if the term has never been encountered and indexed.
///
/// If the field was not indexed with the indexing options that cover
/// the requested options, the returned `SegmentPostings` the method does not fail
/// the requested options, the returned [`SegmentPostings`] the method does not fail
/// and returns a `SegmentPostings` with as much information as possible.
///
/// For instance, requesting `IndexRecordOption::Freq` for a
/// `TextIndexingOptions` that does not index position will return a `SegmentPostings`
/// with `DocId`s and frequencies.
/// For instance, requesting [`IndexRecordOption::WithFreqs`] for a
/// [`TextOptions`](crate::schema::TextOptions) that does not index position
/// will return a [`SegmentPostings`] with `DocId`s and frequencies.
pub fn read_postings(
&self,
term: &Term,
@@ -211,7 +210,7 @@ impl InvertedIndexReader {
/// Returns a block postings given a `Term`.
/// This method is for an advanced usage only.
///
/// Most user should prefer using `read_postings` instead.
/// Most users should prefer using [`Self::read_postings()`] instead.
pub async fn warm_postings(
&self,
term: &Term,

View File

@@ -216,10 +216,10 @@ impl SegmentReader {
/// term dictionary associated with a specific field,
/// and opening the posting list associated with any term.
///
/// If the field is not marked as index, a warn is logged and an empty `InvertedIndexReader`
/// If the field is not marked as index, a warning is logged and an empty `InvertedIndexReader`
/// is returned.
/// Similarly if the field is marked as indexed but no term has been indexed for the given
/// index. an empty `InvertedIndexReader` is returned (but no warning is logged).
/// Similarly, if the field is marked as indexed but no term has been indexed for the given
/// index, an empty `InvertedIndexReader` is returned (but no warning is logged).
pub fn inverted_index(&self, field: Field) -> crate::Result<Arc<InvertedIndexReader>> {
if let Some(inv_idx_reader) = self
.inv_idx_reader_cache

View File

@@ -13,8 +13,8 @@ use crate::directory::OwnedBytes;
/// By contract, whatever happens to the directory file, as long as a FileHandle
/// is alive, the data associated with it cannot be altered or destroyed.
///
/// The underlying behavior is therefore specific to the `Directory` that created it.
/// Despite its name, a `FileSlice` may or may not directly map to an actual file
/// The underlying behavior is therefore specific to the [`Directory`](crate::Directory) that
/// created it. Despite its name, a [`FileSlice`] may or may not directly map to an actual file
/// on the filesystem.
#[async_trait]

View File

@@ -304,7 +304,7 @@ pub(crate) fn atomic_write(path: &Path, content: &[u8]) -> io::Result<()> {
"Path {:?} does not have parent directory.",
)
})?;
let mut tempfile = tempfile::Builder::new().tempfile_in(&parent_path)?;
let mut tempfile = tempfile::Builder::new().tempfile_in(parent_path)?;
tempfile.write_all(content)?;
tempfile.flush()?;
tempfile.as_file_mut().sync_data()?;
@@ -571,9 +571,21 @@ mod tests {
assert_eq!(mmap_directory.get_cache_info().mmapped.len(), 0);
}
fn assert_eventually<P: Fn() -> Option<String>>(predicate: P) {
for _ in 0..30 {
if predicate().is_none() {
break;
}
std::thread::sleep(Duration::from_millis(200));
}
if let Some(error_msg) = predicate() {
panic!("{}", error_msg);
}
}
#[test]
fn test_mmap_released() -> crate::Result<()> {
let mmap_directory = MmapDirectory::create_from_tempdir()?;
fn test_mmap_released() {
let mmap_directory = MmapDirectory::create_from_tempdir().unwrap();
let mut schema_builder: SchemaBuilder = Schema::builder();
let text_field = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
@@ -582,47 +594,56 @@ mod tests {
let index =
Index::create(mmap_directory.clone(), schema, IndexSettings::default()).unwrap();
let mut index_writer = index.writer_for_tests()?;
let mut index_writer = index.writer_for_tests().unwrap();
let mut log_merge_policy = LogMergePolicy::default();
log_merge_policy.set_min_num_segments(3);
index_writer.set_merge_policy(Box::new(log_merge_policy));
for _num_commits in 0..10 {
for _ in 0..10 {
index_writer.add_document(doc!(text_field=>"abc"))?;
index_writer.add_document(doc!(text_field=>"abc")).unwrap();
}
index_writer.commit()?;
index_writer.commit().unwrap();
}
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.try_into()?;
.try_into()
.unwrap();
for _ in 0..4 {
index_writer.add_document(doc!(text_field=>"abc"))?;
index_writer.commit()?;
reader.reload()?;
index_writer.add_document(doc!(text_field=>"abc")).unwrap();
index_writer.commit().unwrap();
reader.reload().unwrap();
}
index_writer.wait_merging_threads()?;
index_writer.wait_merging_threads().unwrap();
reader.reload()?;
reader.reload().unwrap();
let num_segments = reader.searcher().segment_readers().len();
assert!(num_segments <= 4);
let num_components_except_deletes_and_tempstore =
crate::core::SegmentComponent::iterator().len() - 2;
assert_eq!(
num_segments * num_components_except_deletes_and_tempstore,
mmap_directory.get_cache_info().mmapped.len()
);
let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
assert_eventually(|| {
let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
if num_mmapped > max_num_mmapped {
Some(format!(
"Expected at most {max_num_mmapped} mmapped files, got {num_mmapped}"
))
} else {
None
}
});
}
// This test failed on CI. The last Mmap is dropped from the merging thread so there might
// be a race condition indeed.
for _ in 0..10 {
if mmap_directory.get_cache_info().mmapped.is_empty() {
return Ok(());
assert_eventually(|| {
let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
if num_mmapped > 0 {
Some(format!("Expected no mmapped files, got {num_mmapped}"))
} else {
None
}
std::thread::sleep(Duration::from_millis(200));
}
panic!("The cache still contains information. One of the Mmap has not been dropped.");
});
}
}

View File

@@ -57,14 +57,15 @@ impl BytesFastFieldWriter {
/// Shift to the next document and add all of the
/// matching field values present in the document.
pub fn add_document(&mut self, doc: &Document) {
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> {
self.next_doc();
for field_value in doc.get_all(self.field) {
if let Value::Bytes(ref bytes) = field_value {
self.vals.extend_from_slice(bytes);
return;
return Ok(());
}
}
Ok(())
}
/// Register the bytes associated with a document.

View File

@@ -7,16 +7,15 @@
//! It is designed for the fast random access of some document
//! fields given a document id.
//!
//! `FastField` are useful when a field is required for all or most of
//! the `DocSet` : for instance for scoring, grouping, filtering, or faceting.
//! Fast fields are useful when a field is required for all or most of
//! the `DocSet`: for instance for scoring, grouping, aggregation, filtering, or faceting.
//!
//!
//! Fields have to be declared as `FAST` in the schema.
//! Currently supported fields are: u64, i64, f64 and bytes.
//! Fields have to be declared as `FAST` in the schema.
//! Currently supported fields are: u64, i64, f64, bytes and text.
//!
//! u64, i64 and f64 fields are stored in a bit-packed fashion so that
//! their memory usage is directly linear with the amplitude of the
//! values stored.
//! Fast fields are stored in with [different codecs](fastfield_codecs). The best codec is detected
//! automatically, when serializing.
//!
//! Read access performance is comparable to that of an array lookup.
@@ -26,11 +25,15 @@ pub use self::alive_bitset::{intersect_alive_bitsets, write_alive_bitset, AliveB
pub use self::bytes::{BytesFastFieldReader, BytesFastFieldWriter};
pub use self::error::{FastFieldNotAvailableError, Result};
pub use self::facet_reader::FacetReader;
pub(crate) use self::multivalued::MultivalueStartIndex;
pub use self::multivalued::{MultiValuedFastFieldReader, MultiValuedFastFieldWriter};
pub(crate) use self::multivalued::{get_fastfield_codecs_for_multivalue, MultivalueStartIndex};
pub use self::multivalued::{
MultiValueU128FastFieldWriter, MultiValuedFastFieldReader, MultiValuedFastFieldWriter,
MultiValuedU128FastFieldReader,
};
pub use self::readers::FastFieldReaders;
pub(crate) use self::readers::{type_and_cardinality, FastType};
pub use self::serializer::{Column, CompositeFastFieldSerializer};
use self::writer::unexpected_value;
pub use self::writer::{FastFieldsWriter, IntFastFieldWriter};
use crate::schema::{Type, Value};
use crate::{DateTime, DocId};
@@ -117,15 +120,16 @@ impl FastValue for DateTime {
}
}
fn value_to_u64(value: &Value) -> u64 {
match value {
fn value_to_u64(value: &Value) -> crate::Result<u64> {
let value = match value {
Value::U64(val) => val.to_u64(),
Value::I64(val) => val.to_u64(),
Value::F64(val) => val.to_u64(),
Value::Bool(val) => val.to_u64(),
Value::Date(val) => val.to_u64(),
_ => panic!("Expected a u64/i64/f64/bool/date field, got {:?} ", value),
}
_ => return Err(unexpected_value("u64/i64/f64/bool/date", value)),
};
Ok(value)
}
/// The fast field type
@@ -199,9 +203,15 @@ mod tests {
let write: WritePtr = directory.open_write(Path::new("test")).unwrap();
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
fast_field_writers.add_document(&doc!(*FIELD=>13u64));
fast_field_writers.add_document(&doc!(*FIELD=>14u64));
fast_field_writers.add_document(&doc!(*FIELD=>2u64));
fast_field_writers
.add_document(&doc!(*FIELD=>13u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>14u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>2u64))
.unwrap();
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
.unwrap();
@@ -226,15 +236,33 @@ mod tests {
let write: WritePtr = directory.open_write(Path::new("test"))?;
let mut serializer = CompositeFastFieldSerializer::from_write(write)?;
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
fast_field_writers.add_document(&doc!(*FIELD=>4u64));
fast_field_writers.add_document(&doc!(*FIELD=>14_082_001u64));
fast_field_writers.add_document(&doc!(*FIELD=>3_052u64));
fast_field_writers.add_document(&doc!(*FIELD=>9_002u64));
fast_field_writers.add_document(&doc!(*FIELD=>15_001u64));
fast_field_writers.add_document(&doc!(*FIELD=>777u64));
fast_field_writers.add_document(&doc!(*FIELD=>1_002u64));
fast_field_writers.add_document(&doc!(*FIELD=>1_501u64));
fast_field_writers.add_document(&doc!(*FIELD=>215u64));
fast_field_writers
.add_document(&doc!(*FIELD=>4u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>14_082_001u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>3_052u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>9_002u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>15_001u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>777u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>1_002u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>1_501u64))
.unwrap();
fast_field_writers
.add_document(&doc!(*FIELD=>215u64))
.unwrap();
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
serializer.close()?;
}
@@ -270,7 +298,9 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
for _ in 0..10_000 {
fast_field_writers.add_document(&doc!(*FIELD=>100_000u64));
fast_field_writers
.add_document(&doc!(*FIELD=>100_000u64))
.unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -303,9 +333,13 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
// forcing the amplitude to be high
fast_field_writers.add_document(&doc!(*FIELD=>0u64));
fast_field_writers
.add_document(&doc!(*FIELD=>0u64))
.unwrap();
for i in 0u64..10_000u64 {
fast_field_writers.add_document(&doc!(*FIELD=>5_000_000_000_000_000_000u64 + i));
fast_field_writers
.add_document(&doc!(*FIELD=>5_000_000_000_000_000_000u64 + i))
.unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -347,7 +381,7 @@ mod tests {
for i in -100i64..10_000i64 {
let mut doc = Document::default();
doc.add_i64(i64_field, i);
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -392,7 +426,7 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
let doc = Document::default();
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
.unwrap();
@@ -435,7 +469,7 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write)?;
let mut fast_field_writers = FastFieldsWriter::from_schema(&SCHEMA);
for &x in &permutation {
fast_field_writers.add_document(&doc!(*FIELD=>x));
fast_field_writers.add_document(&doc!(*FIELD=>x)).unwrap();
}
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
serializer.close()?;
@@ -785,10 +819,14 @@ mod tests {
let write: WritePtr = directory.open_write(path).unwrap();
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
fast_field_writers.add_document(&doc!(field=>true));
fast_field_writers.add_document(&doc!(field=>false));
fast_field_writers.add_document(&doc!(field=>true));
fast_field_writers.add_document(&doc!(field=>false));
fast_field_writers.add_document(&doc!(field=>true)).unwrap();
fast_field_writers
.add_document(&doc!(field=>false))
.unwrap();
fast_field_writers.add_document(&doc!(field=>true)).unwrap();
fast_field_writers
.add_document(&doc!(field=>false))
.unwrap();
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
.unwrap();
@@ -822,8 +860,10 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
for _ in 0..50 {
fast_field_writers.add_document(&doc!(field=>true));
fast_field_writers.add_document(&doc!(field=>false));
fast_field_writers.add_document(&doc!(field=>true)).unwrap();
fast_field_writers
.add_document(&doc!(field=>false))
.unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)
@@ -857,7 +897,7 @@ mod tests {
let mut serializer = CompositeFastFieldSerializer::from_write(write)?;
let mut fast_field_writers = FastFieldsWriter::from_schema(&schema);
let doc = Document::default();
fast_field_writers.add_document(&doc);
fast_field_writers.add_document(&doc).unwrap();
fast_field_writers.serialize(&mut serializer, &HashMap::new(), None)?;
serializer.close()?;
}
@@ -883,7 +923,7 @@ mod tests {
CompositeFastFieldSerializer::from_write_with_codec(write, codec_types).unwrap();
let mut fast_field_writers = FastFieldsWriter::from_schema(schema);
for doc in docs {
fast_field_writers.add_document(doc);
fast_field_writers.add_document(doc).unwrap();
}
fast_field_writers
.serialize(&mut serializer, &HashMap::new(), None)

View File

@@ -1,9 +1,21 @@
mod reader;
mod writer;
pub use self::reader::MultiValuedFastFieldReader;
pub use self::writer::MultiValuedFastFieldWriter;
use fastfield_codecs::FastFieldCodecType;
pub use self::reader::{MultiValuedFastFieldReader, MultiValuedU128FastFieldReader};
pub(crate) use self::writer::MultivalueStartIndex;
pub use self::writer::{MultiValueU128FastFieldWriter, MultiValuedFastFieldWriter};
/// The valid codecs for multivalue values excludes the linear interpolation codec.
///
/// This limitation is only valid for the values, not the offset index of the multivalue index.
pub(crate) fn get_fastfield_codecs_for_multivalue() -> [FastFieldCodecType; 2] {
[
FastFieldCodecType::Bitpacked,
FastFieldCodecType::BlockwiseLinear,
]
}
#[cfg(test)]
mod tests {

View File

@@ -1,7 +1,7 @@
use std::ops::Range;
use std::ops::{Range, RangeInclusive};
use std::sync::Arc;
use fastfield_codecs::Column;
use fastfield_codecs::{Column, MonotonicallyMappableToU128};
use crate::fastfield::{FastValue, MultiValueLength};
use crate::DocId;
@@ -99,12 +99,176 @@ impl<Item: FastValue> MultiValueLength for MultiValuedFastFieldReader<Item> {
self.total_num_vals() as u64
}
}
/// Reader for a multivalued `u128` fast field.
///
/// The reader is implemented as a `u64` fast field for the index and a `u128` fast field.
///
/// The `vals_reader` will access the concatenated list of all
/// values for all reader.
/// The `idx_reader` associated, for each document, the index of its first value.
#[derive(Clone)]
pub struct MultiValuedU128FastFieldReader<T: MonotonicallyMappableToU128> {
idx_reader: Arc<dyn Column<u64>>,
vals_reader: Arc<dyn Column<T>>,
}
impl<T: MonotonicallyMappableToU128> MultiValuedU128FastFieldReader<T> {
pub(crate) fn open(
idx_reader: Arc<dyn Column<u64>>,
vals_reader: Arc<dyn Column<T>>,
) -> MultiValuedU128FastFieldReader<T> {
Self {
idx_reader,
vals_reader,
}
}
/// Returns `[start, end)`, such that the values associated
/// to the given document are `start..end`.
#[inline]
fn range(&self, doc: DocId) -> Range<u64> {
let start = self.idx_reader.get_val(doc as u64);
let end = self.idx_reader.get_val(doc as u64 + 1);
start..end
}
/// Returns the array of values associated to the given `doc`.
#[inline]
pub fn get_first_val(&self, doc: DocId) -> Option<T> {
let range = self.range(doc);
if range.is_empty() {
return None;
}
Some(self.vals_reader.get_val(range.start))
}
/// Returns the array of values associated to the given `doc`.
#[inline]
fn get_vals_for_range(&self, range: Range<u64>, vals: &mut Vec<T>) {
let len = (range.end - range.start) as usize;
vals.resize(len, T::from_u128(0));
self.vals_reader.get_range(range.start, &mut vals[..]);
}
/// Returns the array of values associated to the given `doc`.
#[inline]
pub fn get_vals(&self, doc: DocId, vals: &mut Vec<T>) {
let range = self.range(doc);
self.get_vals_for_range(range, vals);
}
/// Returns all docids which are in the provided value range
pub fn get_between_vals(&self, range: RangeInclusive<T>) -> Vec<DocId> {
let positions = self.vals_reader.get_between_vals(range);
positions_to_docids(&positions, self.idx_reader.as_ref())
}
/// Iterates over all elements in the fast field
pub fn iter(&self) -> impl Iterator<Item = T> + '_ {
self.vals_reader.iter()
}
/// Returns the minimum value for this fast field.
///
/// The min value does not take in account of possible
/// deleted document, and should be considered as a lower bound
/// of the actual mimimum value.
pub fn min_value(&self) -> T {
self.vals_reader.min_value()
}
/// Returns the maximum value for this fast field.
///
/// The max value does not take in account of possible
/// deleted document, and should be considered as an upper bound
/// of the actual maximum value.
pub fn max_value(&self) -> T {
self.vals_reader.max_value()
}
/// Returns the number of values associated with the document `DocId`.
#[inline]
pub fn num_vals(&self, doc: DocId) -> usize {
let range = self.range(doc);
(range.end - range.start) as usize
}
/// Returns the overall number of values in this field.
#[inline]
pub fn total_num_vals(&self) -> u64 {
self.idx_reader.max_value()
}
}
impl<T: MonotonicallyMappableToU128> MultiValueLength for MultiValuedU128FastFieldReader<T> {
fn get_range(&self, doc_id: DocId) -> std::ops::Range<u64> {
self.range(doc_id)
}
fn get_len(&self, doc_id: DocId) -> u64 {
self.num_vals(doc_id) as u64
}
fn get_total_len(&self) -> u64 {
self.total_num_vals() as u64
}
}
/// Converts a list of positions of values in a 1:n index to the corresponding list of DocIds.
///
/// Since there is no index for value pos -> docid, but docid -> value pos range, we scan the index.
///
/// Correctness: positions needs to be sorted. idx_reader needs to contain monotonically increasing
/// positions.
///
/// TODO: Instead of a linear scan we can employ a expotential search into binary search to match a
/// docid to its value position.
fn positions_to_docids<C: Column + ?Sized>(positions: &[u64], idx_reader: &C) -> Vec<DocId> {
let mut docs = vec![];
let mut cur_doc = 0u32;
let mut last_doc = None;
for pos in positions {
loop {
let end = idx_reader.get_val(cur_doc as u64 + 1);
if end > *pos {
// avoid duplicates
if Some(cur_doc) == last_doc {
break;
}
docs.push(cur_doc);
last_doc = Some(cur_doc);
break;
}
cur_doc += 1;
}
}
docs
}
#[cfg(test)]
mod tests {
use fastfield_codecs::VecColumn;
use crate::core::Index;
use crate::fastfield::multivalued::reader::positions_to_docids;
use crate::schema::{Cardinality, Facet, FacetOptions, NumericOptions, Schema};
#[test]
fn test_positions_to_docid() {
let positions = vec![10u64, 11, 15, 20, 21, 22];
let offsets = vec![0, 10, 12, 15, 22, 23];
{
let column = VecColumn::from(&offsets);
let docids = positions_to_docids(&positions, &column);
assert_eq!(docids, vec![1, 3, 4]);
}
}
#[test]
fn test_multifastfield_reader() -> crate::Result<()> {
let mut schema_builder = Schema::builder();

View File

@@ -1,9 +1,12 @@
use std::io;
use std::sync::Mutex;
use fastfield_codecs::{Column, MonotonicallyMappableToU64, VecColumn};
use fastfield_codecs::{
Column, MonotonicallyMappableToU128, MonotonicallyMappableToU64, VecColumn,
};
use fnv::FnvHashMap;
use super::get_fastfield_codecs_for_multivalue;
use crate::fastfield::writer::unexpected_value;
use crate::fastfield::{value_to_u64, CompositeFastFieldSerializer, FastFieldType};
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::postings::UnorderedTermId;
@@ -79,11 +82,11 @@ impl MultiValuedFastFieldWriter {
/// Shift to the next document and adds
/// all of the matching field values present in the document.
pub fn add_document(&mut self, doc: &Document) {
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> {
self.next_doc();
// facets/texts are indexed in the `SegmentWriter` as we encode their unordered id.
if self.fast_field_type.is_storing_term_ids() {
return;
return Ok(());
}
for field_value in doc.field_values() {
if field_value.field == self.field {
@@ -92,11 +95,12 @@ impl MultiValuedFastFieldWriter {
(Some(precision), Value::Date(date_val)) => {
date_val.truncate(precision).to_u64()
}
_ => value_to_u64(value),
_ => value_to_u64(value)?,
};
self.add_val(value_u64);
}
}
Ok(())
}
/// Returns an iterator over values per doc_id in ascending doc_id order.
@@ -195,7 +199,12 @@ impl MultiValuedFastFieldWriter {
}
}
let col = VecColumn::from(&values[..]);
serializer.create_auto_detect_u64_fast_field_with_idx(self.field, col, 1)?;
serializer.create_auto_detect_u64_fast_field_with_idx_and_codecs(
self.field,
col,
1,
&get_fastfield_codecs_for_multivalue(),
)?;
}
Ok(())
}
@@ -204,112 +213,197 @@ impl MultiValuedFastFieldWriter {
pub(crate) struct MultivalueStartIndex<'a, C: Column> {
column: &'a C,
doc_id_map: &'a DocIdMapping,
min_max_opt: Mutex<Option<(u64, u64)>>,
random_seeker: Mutex<MultivalueStartIndexRandomSeeker<'a, C>>,
}
struct MultivalueStartIndexRandomSeeker<'a, C: Column> {
seek_head: MultivalueStartIndexIter<'a, C>,
seek_next_id: u64,
}
impl<'a, C: Column> MultivalueStartIndexRandomSeeker<'a, C> {
fn new(column: &'a C, doc_id_map: &'a DocIdMapping) -> Self {
Self {
seek_head: MultivalueStartIndexIter {
column,
doc_id_map,
new_doc_id: 0,
offset: 0u64,
},
seek_next_id: 0u64,
}
}
min: u64,
max: u64,
}
impl<'a, C: Column> MultivalueStartIndex<'a, C> {
pub fn new(column: &'a C, doc_id_map: &'a DocIdMapping) -> Self {
assert_eq!(column.num_vals(), doc_id_map.num_old_doc_ids() as u64 + 1);
let (min, max) =
tantivy_bitpacker::minmax(iter_remapped_multivalue_index(doc_id_map, column))
.unwrap_or((0u64, 0u64));
MultivalueStartIndex {
column,
doc_id_map,
min_max_opt: Mutex::default(),
random_seeker: Mutex::new(MultivalueStartIndexRandomSeeker::new(column, doc_id_map)),
min,
max,
}
}
fn minmax(&self) -> (u64, u64) {
if let Some((min, max)) = *self.min_max_opt.lock().unwrap() {
return (min, max);
}
let (min, max) = tantivy_bitpacker::minmax(self.iter()).unwrap_or((0u64, 0u64));
*self.min_max_opt.lock().unwrap() = Some((min, max));
(min, max)
}
}
impl<'a, C: Column> Column for MultivalueStartIndex<'a, C> {
fn get_val(&self, idx: u64) -> u64 {
let mut random_seeker_lock = self.random_seeker.lock().unwrap();
if random_seeker_lock.seek_next_id > idx {
*random_seeker_lock =
MultivalueStartIndexRandomSeeker::new(self.column, self.doc_id_map);
}
let to_skip = idx - random_seeker_lock.seek_next_id;
random_seeker_lock.seek_next_id = idx + 1;
random_seeker_lock.seek_head.nth(to_skip as usize).unwrap()
fn get_val(&self, _idx: u64) -> u64 {
unimplemented!()
}
fn min_value(&self) -> u64 {
self.minmax().0
self.min
}
fn max_value(&self) -> u64 {
self.minmax().1
self.max
}
fn num_vals(&self) -> u64 {
(self.doc_id_map.num_new_doc_ids() + 1) as u64
}
fn iter<'b>(&'b self) -> Box<dyn Iterator<Item = u64> + 'b> {
Box::new(MultivalueStartIndexIter::new(self.column, self.doc_id_map))
fn iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
Box::new(iter_remapped_multivalue_index(
self.doc_id_map,
&self.column,
))
}
}
struct MultivalueStartIndexIter<'a, C: Column> {
pub column: &'a C,
pub doc_id_map: &'a DocIdMapping,
pub new_doc_id: usize,
pub offset: u64,
fn iter_remapped_multivalue_index<'a, C: Column>(
doc_id_map: &'a DocIdMapping,
column: &'a C,
) -> impl Iterator<Item = u64> + 'a {
let mut offset = 0;
std::iter::once(0).chain(doc_id_map.iter_old_doc_ids().map(move |old_doc| {
let num_vals_for_doc = column.get_val(old_doc as u64 + 1) - column.get_val(old_doc as u64);
offset += num_vals_for_doc;
offset as u64
}))
}
impl<'a, C: Column> MultivalueStartIndexIter<'a, C> {
fn new(column: &'a C, doc_id_map: &'a DocIdMapping) -> Self {
Self {
column,
doc_id_map,
new_doc_id: 0,
offset: 0,
/// Writer for multi-valued (as in, more than one value per document)
/// int fast field.
///
/// This `Writer` is only useful for advanced users.
/// The normal way to get your multivalued int in your index
/// is to
/// - declare your field with fast set to `Cardinality::MultiValues`
/// in your schema
/// - add your document simply by calling `.add_document(...)`.
///
/// The `MultiValuedFastFieldWriter` can be acquired from the
pub struct MultiValueU128FastFieldWriter {
field: Field,
vals: Vec<u128>,
doc_index: Vec<u64>,
}
impl MultiValueU128FastFieldWriter {
/// Creates a new `U128MultiValueFastFieldWriter`
pub(crate) fn new(field: Field) -> Self {
MultiValueU128FastFieldWriter {
field,
vals: Vec::new(),
doc_index: Vec::new(),
}
}
/// The memory used (inclusive childs)
pub fn mem_usage(&self) -> usize {
self.vals.capacity() * std::mem::size_of::<UnorderedTermId>()
+ self.doc_index.capacity() * std::mem::size_of::<u64>()
}
/// Finalize the current document.
pub(crate) fn next_doc(&mut self) {
self.doc_index.push(self.vals.len() as u64);
}
/// Pushes a new value to the current document.
pub(crate) fn add_val(&mut self, val: u128) {
self.vals.push(val);
}
/// Shift to the next document and adds
/// all of the matching field values present in the document.
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> {
self.next_doc();
for field_value in doc.field_values() {
if field_value.field == self.field {
let value = field_value.value();
let ip_addr = value
.as_ip_addr()
.ok_or_else(|| unexpected_value("ip", value))?;
let ip_addr_u128 = ip_addr.to_u128();
self.add_val(ip_addr_u128);
}
}
Ok(())
}
/// Returns an iterator over values per doc_id in ascending doc_id order.
///
/// Normally the order is simply iterating self.doc_id_index.
/// With doc_id_map it accounts for the new mapping, returning values in the order of the
/// new doc_ids.
fn get_ordered_values<'a: 'b, 'b>(
&'a self,
doc_id_map: Option<&'b DocIdMapping>,
) -> impl Iterator<Item = &'b [u128]> {
get_ordered_values(&self.vals, &self.doc_index, doc_id_map)
}
/// Serializes fast field values.
pub fn serialize(
mut self,
serializer: &mut CompositeFastFieldSerializer,
doc_id_map: Option<&DocIdMapping>,
) -> io::Result<()> {
{
// writing the offset index
//
self.doc_index.push(self.vals.len() as u64);
let col = VecColumn::from(&self.doc_index[..]);
if let Some(doc_id_map) = doc_id_map {
let multi_value_start_index = MultivalueStartIndex::new(&col, doc_id_map);
serializer.create_auto_detect_u64_fast_field_with_idx(
self.field,
multi_value_start_index,
0,
)?;
} else {
serializer.create_auto_detect_u64_fast_field_with_idx(self.field, col, 0)?;
}
}
{
let iter_gen = || self.get_ordered_values(doc_id_map).flatten().cloned();
serializer.create_u128_fast_field_with_idx(
self.field,
iter_gen,
self.vals.len() as u64,
1,
)?;
}
Ok(())
}
}
impl<'a, C: Column> Iterator for MultivalueStartIndexIter<'a, C> {
type Item = u64;
/// Returns an iterator over values per doc_id in ascending doc_id order.
///
/// Normally the order is simply iterating self.doc_id_index.
/// With doc_id_map it accounts for the new mapping, returning values in the order of the
/// new doc_ids.
fn get_ordered_values<'a: 'b, 'b, T>(
vals: &'a [T],
doc_index: &'a [u64],
doc_id_map: Option<&'b DocIdMapping>,
) -> impl Iterator<Item = &'b [T]> {
let doc_id_iter: Box<dyn Iterator<Item = u32>> = if let Some(doc_id_map) = doc_id_map {
Box::new(doc_id_map.iter_old_doc_ids())
} else {
let max_doc = doc_index.len() as DocId;
Box::new(0..max_doc)
};
doc_id_iter.map(move |doc_id| get_values_for_doc_id(doc_id, vals, doc_index))
}
fn next(&mut self) -> Option<Self::Item> {
if self.new_doc_id > self.doc_id_map.num_new_doc_ids() {
return None;
}
let new_doc_id = self.new_doc_id;
self.new_doc_id += 1;
let start_offset = self.offset;
if new_doc_id < self.doc_id_map.num_new_doc_ids() {
let old_doc = self.doc_id_map.get_old_doc_id(new_doc_id as u32) as u64;
let num_vals_for_doc = self.column.get_val(old_doc + 1) - self.column.get_val(old_doc);
self.offset += num_vals_for_doc;
}
Some(start_offset)
}
/// returns all values for a doc_id
fn get_values_for_doc_id<'a, T>(doc_id: u32, vals: &'a [T], doc_index: &'a [u64]) -> &'a [T] {
let start_pos = doc_index[doc_id as usize] as usize;
let end_pos = doc_index
.get(doc_id as usize + 1)
.cloned()
.unwrap_or(vals.len() as u64) as usize; // special case, last doc_id has no offset information
&vals[start_pos..end_pos]
}
#[cfg(test)]
@@ -344,11 +438,5 @@ mod tests {
vec![0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
);
assert_eq!(multivalue_start_index.num_vals(), 11);
assert_eq!(multivalue_start_index.get_val(3), 2);
assert_eq!(multivalue_start_index.get_val(5), 5);
assert_eq!(multivalue_start_index.get_val(8), 21);
assert_eq!(multivalue_start_index.get_val(4), 3);
assert_eq!(multivalue_start_index.get_val(0), 0);
assert_eq!(multivalue_start_index.get_val(10), 55);
}
}

View File

@@ -1,7 +1,9 @@
use std::net::Ipv6Addr;
use std::sync::Arc;
use fastfield_codecs::{open, Column};
use fastfield_codecs::{open, open_u128, Column};
use super::multivalued::MultiValuedU128FastFieldReader;
use crate::directory::{CompositeFile, FileSlice};
use crate::fastfield::{
BytesFastFieldReader, FastFieldNotAvailableError, FastValue, MultiValuedFastFieldReader,
@@ -23,6 +25,7 @@ pub struct FastFieldReaders {
pub(crate) enum FastType {
I64,
U64,
U128,
F64,
Bool,
Date,
@@ -49,6 +52,9 @@ pub(crate) fn type_and_cardinality(field_type: &FieldType) -> Option<(FastType,
FieldType::Str(options) if options.is_fast() => {
Some((FastType::U64, Cardinality::MultiValues))
}
FieldType::IpAddr(options) => options
.get_fastfield_cardinality()
.map(|cardinality| (FastType::U128, cardinality)),
_ => None,
}
}
@@ -143,6 +149,59 @@ impl FastFieldReaders {
self.typed_fast_field_reader(field)
}
/// Returns the `ip` fast field reader reader associated to `field`.
///
/// If `field` is not a u128 fast field, this method returns an Error.
pub fn ip_addr(&self, field: Field) -> crate::Result<Arc<dyn Column<Ipv6Addr>>> {
self.check_type(field, FastType::U128, Cardinality::SingleValue)?;
let bytes = self.fast_field_data(field, 0)?.read_bytes()?;
Ok(open_u128::<Ipv6Addr>(bytes)?)
}
/// Returns the `ip` fast field reader reader associated to `field`.
///
/// If `field` is not a u128 fast field, this method returns an Error.
pub fn ip_addrs(
&self,
field: Field,
) -> crate::Result<MultiValuedU128FastFieldReader<Ipv6Addr>> {
self.check_type(field, FastType::U128, Cardinality::MultiValues)?;
let idx_reader: Arc<dyn Column<u64>> = self.typed_fast_field_reader(field)?;
let bytes = self.fast_field_data(field, 1)?.read_bytes()?;
let vals_reader = open_u128::<Ipv6Addr>(bytes)?;
Ok(MultiValuedU128FastFieldReader::open(
idx_reader,
vals_reader,
))
}
/// Returns the `u128` fast field reader reader associated to `field`.
///
/// If `field` is not a u128 fast field, this method returns an Error.
pub(crate) fn u128(&self, field: Field) -> crate::Result<Arc<dyn Column<u128>>> {
self.check_type(field, FastType::U128, Cardinality::SingleValue)?;
let bytes = self.fast_field_data(field, 0)?.read_bytes()?;
Ok(open_u128::<u128>(bytes)?)
}
/// Returns the `u128` multi-valued fast field reader reader associated to `field`.
///
/// If `field` is not a u128 multi-valued fast field, this method returns an Error.
pub fn u128s(&self, field: Field) -> crate::Result<MultiValuedU128FastFieldReader<u128>> {
self.check_type(field, FastType::U128, Cardinality::MultiValues)?;
let idx_reader: Arc<dyn Column<u64>> = self.typed_fast_field_reader(field)?;
let bytes = self.fast_field_data(field, 1)?.read_bytes()?;
let vals_reader = open_u128::<u128>(bytes)?;
Ok(MultiValuedU128FastFieldReader::open(
idx_reader,
vals_reader,
))
}
/// Returns the `u64` fast field reader reader associated with `field`, regardless of whether
/// the given field is effectively of type `u64` or not.
///

View File

@@ -70,6 +70,35 @@ impl CompositeFastFieldSerializer {
Ok(())
}
/// Serialize data into a new u64 fast field. The best compression codec of the the provided
/// will be chosen.
pub fn create_auto_detect_u64_fast_field_with_idx_and_codecs<T: MonotonicallyMappableToU64>(
&mut self,
field: Field,
fastfield_accessor: impl Column<T>,
idx: usize,
codec_types: &[FastFieldCodecType],
) -> io::Result<()> {
let field_write = self.composite_write.for_field_with_idx(field, idx);
fastfield_codecs::serialize(fastfield_accessor, field_write, codec_types)?;
Ok(())
}
/// Serialize data into a new u128 fast field. The codec will be compact space compressor,
/// which is optimized for scanning the fast field for a given range.
pub fn create_u128_fast_field_with_idx<F: Fn() -> I, I: Iterator<Item = u128>>(
&mut self,
field: Field,
iter_gen: F,
num_vals: u64,
idx: usize,
) -> io::Result<()> {
let field_write = self.composite_write.for_field_with_idx(field, idx);
fastfield_codecs::serialize_u128(iter_gen, num_vals, field_write)?;
Ok(())
}
/// Start serializing a new [u8] fast field. Use the returned writer to write data into the
/// bytes field. To associate the bytes with documents a seperate index must be created on
/// index 0. See bytes/writer.rs::serialize for an example.

View File

@@ -2,11 +2,11 @@ use std::collections::HashMap;
use std::io;
use common;
use fastfield_codecs::{Column, MonotonicallyMappableToU64};
use fastfield_codecs::{Column, MonotonicallyMappableToU128, MonotonicallyMappableToU64};
use fnv::FnvHashMap;
use tantivy_bitpacker::BlockedBitpacker;
use super::multivalued::MultiValuedFastFieldWriter;
use super::multivalued::{MultiValueU128FastFieldWriter, MultiValuedFastFieldWriter};
use super::FastFieldType;
use crate::fastfield::{BytesFastFieldWriter, CompositeFastFieldSerializer};
use crate::indexer::doc_id_mapping::DocIdMapping;
@@ -19,10 +19,19 @@ use crate::DatePrecision;
pub struct FastFieldsWriter {
term_id_writers: Vec<MultiValuedFastFieldWriter>,
single_value_writers: Vec<IntFastFieldWriter>,
u128_value_writers: Vec<U128FastFieldWriter>,
u128_multi_value_writers: Vec<MultiValueU128FastFieldWriter>,
multi_values_writers: Vec<MultiValuedFastFieldWriter>,
bytes_value_writers: Vec<BytesFastFieldWriter>,
}
pub(crate) fn unexpected_value(expected: &str, actual: &Value) -> crate::TantivyError {
crate::TantivyError::SchemaError(format!(
"Expected a {:?} in fast field, but got {:?}",
expected, actual
))
}
fn fast_field_default_value(field_entry: &FieldEntry) -> u64 {
match *field_entry.field_type() {
FieldType::I64(_) | FieldType::Date(_) => common::i64_to_u64(0i64),
@@ -34,6 +43,8 @@ fn fast_field_default_value(field_entry: &FieldEntry) -> u64 {
impl FastFieldsWriter {
/// Create all `FastFieldWriter` required by the schema.
pub fn from_schema(schema: &Schema) -> FastFieldsWriter {
let mut u128_value_writers = Vec::new();
let mut u128_multi_value_writers = Vec::new();
let mut single_value_writers = Vec::new();
let mut term_id_writers = Vec::new();
let mut multi_values_writers = Vec::new();
@@ -97,10 +108,27 @@ impl FastFieldsWriter {
bytes_value_writers.push(fast_field_writer);
}
}
FieldType::IpAddr(opt) => {
if opt.is_fast() {
match opt.get_fastfield_cardinality() {
Some(Cardinality::SingleValue) => {
let fast_field_writer = U128FastFieldWriter::new(field);
u128_value_writers.push(fast_field_writer);
}
Some(Cardinality::MultiValues) => {
let fast_field_writer = MultiValueU128FastFieldWriter::new(field);
u128_multi_value_writers.push(fast_field_writer);
}
None => {}
}
}
}
FieldType::Str(_) | FieldType::JsonObject(_) => {}
}
}
FastFieldsWriter {
u128_value_writers,
u128_multi_value_writers,
term_id_writers,
single_value_writers,
multi_values_writers,
@@ -129,6 +157,16 @@ impl FastFieldsWriter {
.iter()
.map(|w| w.mem_usage())
.sum::<usize>()
+ self
.u128_value_writers
.iter()
.map(|w| w.mem_usage())
.sum::<usize>()
+ self
.u128_multi_value_writers
.iter()
.map(|w| w.mem_usage())
.sum::<usize>()
}
/// Get the `FastFieldWriter` associated with a field.
@@ -190,21 +228,27 @@ impl FastFieldsWriter {
.iter_mut()
.find(|field_writer| field_writer.field() == field)
}
/// Indexes all of the fastfields of a new document.
pub fn add_document(&mut self, doc: &Document) {
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> {
for field_writer in &mut self.term_id_writers {
field_writer.add_document(doc);
field_writer.add_document(doc)?;
}
for field_writer in &mut self.single_value_writers {
field_writer.add_document(doc);
field_writer.add_document(doc)?;
}
for field_writer in &mut self.multi_values_writers {
field_writer.add_document(doc);
field_writer.add_document(doc)?;
}
for field_writer in &mut self.bytes_value_writers {
field_writer.add_document(doc);
field_writer.add_document(doc)?;
}
for field_writer in &mut self.u128_value_writers {
field_writer.add_document(doc)?;
}
for field_writer in &mut self.u128_multi_value_writers {
field_writer.add_document(doc)?;
}
Ok(())
}
/// Serializes all of the `FastFieldWriter`s by pushing them in
@@ -230,6 +274,108 @@ impl FastFieldsWriter {
for field_writer in self.bytes_value_writers {
field_writer.serialize(serializer, doc_id_map)?;
}
for field_writer in self.u128_value_writers {
field_writer.serialize(serializer, doc_id_map)?;
}
for field_writer in self.u128_multi_value_writers {
field_writer.serialize(serializer, doc_id_map)?;
}
Ok(())
}
}
/// Fast field writer for u128 values.
/// The fast field writer just keeps the values in memory.
///
/// Only when the segment writer can be closed and
/// persisted on disk, the fast field writer is
/// sent to a `FastFieldSerializer` via the `.serialize(...)`
/// method.
///
/// We cannot serialize earlier as the values are
/// compressed to a compact number space and the number of
/// bits required for bitpacking can only been known once
/// we have seen all of the values.
pub struct U128FastFieldWriter {
field: Field,
vals: Vec<u128>,
val_count: u32,
}
impl U128FastFieldWriter {
/// Creates a new `IntFastFieldWriter`
pub fn new(field: Field) -> Self {
Self {
field,
vals: vec![],
val_count: 0,
}
}
/// The memory used (inclusive childs)
pub fn mem_usage(&self) -> usize {
self.vals.len() * 16
}
/// Records a new value.
///
/// The n-th value being recorded is implicitely
/// associated to the document with the `DocId` n.
/// (Well, `n-1` actually because of 0-indexing)
pub fn add_val(&mut self, val: u128) {
self.vals.push(val);
}
/// Extract the fast field value from the document
/// (or use the default value) and records it.
///
/// Extract the value associated to the fast field for
/// this document.
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> {
match doc.get_first(self.field) {
Some(v) => {
let ip_addr = v.as_ip_addr().ok_or_else(|| unexpected_value("ip", v))?;
let value = ip_addr.to_u128();
self.add_val(value);
}
None => {
self.add_val(0); // TODO fix null handling
}
};
self.val_count += 1;
Ok(())
}
/// Push the fast fields value to the `FastFieldWriter`.
pub fn serialize(
&self,
serializer: &mut CompositeFastFieldSerializer,
doc_id_map: Option<&DocIdMapping>,
) -> io::Result<()> {
if let Some(doc_id_map) = doc_id_map {
let iter_gen = || {
doc_id_map
.iter_old_doc_ids()
.map(|idx| self.vals[idx as usize])
};
serializer.create_u128_fast_field_with_idx(
self.field,
iter_gen,
self.val_count as u64,
0,
)?;
} else {
let iter_gen = || self.vals.iter().cloned();
serializer.create_u128_fast_field_with_idx(
self.field,
iter_gen,
self.val_count as u64,
0,
)?;
}
Ok(())
}
}
@@ -238,7 +384,7 @@ impl FastFieldsWriter {
/// The fast field writer just keeps the values in memory.
///
/// Only when the segment writer can be closed and
/// persisted on disc, the fast field writer is
/// persisted on disk, the fast field writer is
/// sent to a `FastFieldSerializer` via the `.serialize(...)`
/// method.
///
@@ -325,14 +471,14 @@ impl IntFastFieldWriter {
/// only the first one is taken in account.
///
/// Values on text fast fields are skipped.
pub fn add_document(&mut self, doc: &Document) {
pub fn add_document(&mut self, doc: &Document) -> crate::Result<()> {
match doc.get_first(self.field) {
Some(v) => {
let value = match (self.precision_opt, v) {
(Some(precision), Value::Date(date_val)) => {
date_val.truncate(precision).to_u64()
}
_ => super::value_to_u64(v),
_ => super::value_to_u64(v)?,
};
self.add_val(value);
}
@@ -340,6 +486,7 @@ impl IntFastFieldWriter {
self.add_val(self.val_if_missing);
}
};
Ok(())
}
/// get iterator over the data
@@ -391,15 +538,8 @@ impl<'map, 'bitp> Column for WriterFastFieldAccessProvider<'map, 'bitp> {
/// # Panics
///
/// May panic if `doc` is greater than the index.
fn get_val(&self, doc: u64) -> u64 {
if let Some(doc_id_map) = self.doc_id_map {
self.vals
.get(doc_id_map.get_old_doc_id(doc as u32) as usize) // consider extra
// FastFieldReader wrapper for
// non doc_id_map
} else {
self.vals.get(doc as usize)
}
fn get_val(&self, _doc: u64) -> u64 {
unimplemented!()
}
fn iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {

View File

@@ -34,10 +34,6 @@ impl SegmentDocIdMapping {
self.new_doc_id_to_old_doc_addr.len()
}
pub(crate) fn get_old_doc_addr(&self, new_doc_id: DocId) -> DocAddress {
self.new_doc_id_to_old_doc_addr[new_doc_id as usize]
}
/// This flags means the segments are simply stacked in the order of their ordinal.
/// e.g. [(0, 1), .. (n, 1), (0, 2)..., (m, 2)]
///

View File

@@ -0,0 +1,69 @@
pub struct FlatMapWithBuffer<T, F, Iter> {
buffer: Vec<T>,
fill_buffer: F,
underlying_it: Iter,
}
impl<T, F, Iter, I> Iterator for FlatMapWithBuffer<T, F, Iter>
where
Iter: Iterator<Item = I>,
F: Fn(I, &mut Vec<T>),
{
type Item = T;
fn next(&mut self) -> Option<Self::Item> {
while self.buffer.is_empty() {
let next_el = self.underlying_it.next()?;
(self.fill_buffer)(next_el, &mut self.buffer);
// We will pop elements, so we reverse the buffer first.
self.buffer.reverse();
}
self.buffer.pop()
}
}
pub trait FlatMapWithBufferIter: Iterator {
/// Function similar to `flat_map`, but allows reusing a shared `Vec`.
fn flat_map_with_buffer<F, T>(self, fill_buffer: F) -> FlatMapWithBuffer<T, F, Self>
where
F: Fn(Self::Item, &mut Vec<T>),
Self: Sized,
{
FlatMapWithBuffer {
buffer: Vec::with_capacity(10),
fill_buffer,
underlying_it: self,
}
}
}
impl<T: ?Sized> FlatMapWithBufferIter for T where T: Iterator {}
#[cfg(test)]
mod tests {
use crate::indexer::flat_map_with_buffer::FlatMapWithBufferIter;
#[test]
fn test_flat_map_with_buffer_empty() {
let mut empty_iter = std::iter::empty::<usize>()
.flat_map_with_buffer(|_val: usize, _buffer: &mut Vec<usize>| {});
assert!(empty_iter.next().is_none());
}
#[test]
fn test_flat_map_with_buffer_simple() {
let vals: Vec<usize> = (1..5)
.flat_map_with_buffer(|val: usize, buffer: &mut Vec<usize>| buffer.extend(0..val))
.collect();
assert_eq!(&[0, 0, 1, 0, 1, 2, 0, 1, 2, 3], &vals[..]);
}
#[test]
fn test_flat_map_filling_no_elements_does_not_stop_iterator() {
let vals: Vec<usize> = [2, 0, 0, 3]
.into_iter()
.flat_map_with_buffer(|val: usize, buffer: &mut Vec<usize>| buffer.extend(0..val))
.collect();
assert_eq!(&[0, 1, 0, 1, 2], &vals[..]);
}
}

View File

@@ -803,7 +803,9 @@ impl Drop for IndexWriter {
#[cfg(test)]
mod tests {
use std::collections::{HashMap, HashSet};
use std::net::Ipv6Addr;
use fastfield_codecs::MonotonicallyMappableToU128;
use proptest::prelude::*;
use proptest::prop_oneof;
use proptest::strategy::Strategy;
@@ -815,11 +817,13 @@ mod tests {
use crate::indexer::NoMergePolicy;
use crate::query::{BooleanQuery, Occur, Query, QueryParser, TermQuery};
use crate::schema::{
self, Cardinality, Facet, FacetOptions, IndexRecordOption, NumericOptions,
self, Cardinality, Facet, FacetOptions, IndexRecordOption, IpAddrOptions, NumericOptions,
TextFieldIndexing, TextOptions, FAST, INDEXED, STORED, STRING, TEXT,
};
use crate::store::DOCSTORE_CACHE_CAPACITY;
use crate::{DocAddress, Index, IndexSettings, IndexSortByField, Order, ReloadPolicy, Term};
use crate::{
DateTime, DocAddress, Index, IndexSettings, IndexSortByField, Order, ReloadPolicy, Term,
};
const LOREM: &str = "Doc Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do \
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad \
@@ -1395,6 +1399,35 @@ mod tests {
assert!(commit_again.is_ok());
}
#[test]
fn test_sort_by_multivalue_field_error() -> crate::Result<()> {
let mut schema_builder = schema::Schema::builder();
let options = NumericOptions::default().set_fast(Cardinality::MultiValues);
schema_builder.add_u64_field("id", options);
let schema = schema_builder.build();
let settings = IndexSettings {
sort_by_field: Some(IndexSortByField {
field: "id".to_string(),
order: Order::Desc,
}),
..Default::default()
};
let err = Index::builder()
.schema(schema)
.settings(settings)
.create_in_ram()
.unwrap_err();
assert_eq!(
err.to_string(),
"An invalid argument was passed: 'Only single value fast field Cardinality supported \
for sorting index id'"
);
Ok(())
}
#[test]
fn test_delete_with_sort_by_field() -> crate::Result<()> {
let mut schema_builder = schema::Schema::builder();
@@ -1564,7 +1597,15 @@ mod tests {
force_end_merge: bool,
) -> crate::Result<()> {
let mut schema_builder = schema::Schema::builder();
let ip_field = schema_builder.add_ip_addr_field("ip", FAST | INDEXED | STORED);
let ips_field = schema_builder.add_ip_addr_field(
"ips",
IpAddrOptions::default().set_fast(Cardinality::MultiValues),
);
let id_field = schema_builder.add_u64_field("id", FAST | INDEXED | STORED);
let i64_field = schema_builder.add_i64_field("i64", INDEXED);
let f64_field = schema_builder.add_f64_field("f64", INDEXED);
let date_field = schema_builder.add_date_field("date", INDEXED);
let bytes_field = schema_builder.add_bytes_field("bytes", FAST | INDEXED | STORED);
let bool_field = schema_builder.add_bool_field("bool", FAST | INDEXED | STORED);
let text_field = schema_builder.add_text_field(
@@ -1615,21 +1656,49 @@ mod tests {
let old_reader = index.reader()?;
let ip_exists = |id| id % 3 != 0; // 0 does not exist
for &op in ops {
match op {
IndexingOp::AddDoc { id } => {
let facet = Facet::from(&("/cola/".to_string() + &id.to_string()));
index_writer.add_document(doc!(id_field=>id,
bytes_field => id.to_le_bytes().as_slice(),
multi_numbers=> id,
multi_numbers => id,
bool_field => (id % 2u64) != 0,
multi_bools => (id % 2u64) != 0,
multi_bools => (id % 2u64) == 0,
text_field => id.to_string(),
facet_field => facet,
large_text_field=> LOREM
))?;
let ip_from_id = Ipv6Addr::from_u128(id as u128);
if !ip_exists(id) {
// every 3rd doc has no ip field
index_writer.add_document(doc!(id_field=>id,
bytes_field => id.to_le_bytes().as_slice(),
multi_numbers=> id,
multi_numbers => id,
bool_field => (id % 2u64) != 0,
i64_field => id as i64,
f64_field => id as f64,
date_field => DateTime::from_timestamp_secs(id as i64),
multi_bools => (id % 2u64) != 0,
multi_bools => (id % 2u64) == 0,
text_field => id.to_string(),
facet_field => facet,
large_text_field=> LOREM
))?;
} else {
index_writer.add_document(doc!(id_field=>id,
bytes_field => id.to_le_bytes().as_slice(),
ip_field => ip_from_id,
ips_field => ip_from_id,
ips_field => ip_from_id,
multi_numbers=> id,
multi_numbers => id,
bool_field => (id % 2u64) != 0,
i64_field => id as i64,
f64_field => id as f64,
date_field => DateTime::from_timestamp_secs(id as i64),
multi_bools => (id % 2u64) != 0,
multi_bools => (id % 2u64) == 0,
text_field => id.to_string(),
facet_field => facet,
large_text_field=> LOREM
))?;
}
}
IndexingOp::DeleteDoc { id } => {
index_writer.delete_term(Term::from_field_u64(id_field, id));
@@ -1715,6 +1784,60 @@ mod tests {
.collect::<HashSet<_>>()
);
// Load all ips addr
let ips: HashSet<Ipv6Addr> = searcher
.segment_readers()
.iter()
.flat_map(|segment_reader| {
let ff_reader = segment_reader.fast_fields().ip_addr(ip_field).unwrap();
segment_reader.doc_ids_alive().flat_map(move |doc| {
let val = ff_reader.get_val(doc as u64);
if val == Ipv6Addr::from_u128(0) {
// TODO Fix null handling
None
} else {
Some(val)
}
})
})
.collect();
let expected_ips = expected_ids_and_num_occurrences
.keys()
.flat_map(|id| {
if !ip_exists(*id) {
None
} else {
Some(Ipv6Addr::from_u128(*id as u128))
}
})
.collect::<HashSet<_>>();
assert_eq!(ips, expected_ips);
let expected_ips = expected_ids_and_num_occurrences
.keys()
.filter_map(|id| {
if !ip_exists(*id) {
None
} else {
Some(Ipv6Addr::from_u128(*id as u128))
}
})
.collect::<HashSet<_>>();
let ips: HashSet<Ipv6Addr> = searcher
.segment_readers()
.iter()
.flat_map(|segment_reader| {
let ff_reader = segment_reader.fast_fields().ip_addrs(ips_field).unwrap();
segment_reader.doc_ids_alive().flat_map(move |doc| {
let mut vals = vec![];
ff_reader.get_vals(doc, &mut vals);
vals.into_iter().filter(|val| val.to_u128() != 0) // TODO Fix null handling
})
})
.collect();
assert_eq!(ips, expected_ips);
// multivalue fast field tests
for segment_reader in searcher.segment_readers().iter() {
let id_reader = segment_reader.fast_fields().u64(id_field).unwrap();
@@ -1779,10 +1902,8 @@ mod tests {
}
}
// test search
let my_text_field = index.schema().get_field("text_field").unwrap();
let do_search = |term: &str| {
let query = QueryParser::for_index(&index, vec![my_text_field])
let do_search = |term: &str, field| {
let query = QueryParser::for_index(&index, vec![field])
.parse_query(term)
.unwrap();
let top_docs: Vec<(f32, DocAddress)> =
@@ -1791,11 +1912,70 @@ mod tests {
top_docs.iter().map(|el| el.1).collect::<Vec<_>>()
};
for (existing_id, count) in expected_ids_and_num_occurrences {
assert_eq!(do_search(&existing_id.to_string()).len() as u64, count);
let do_search2 = |term: Term| {
let query = TermQuery::new(term, IndexRecordOption::Basic);
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(1000)).unwrap();
top_docs.iter().map(|el| el.1).collect::<Vec<_>>()
};
for (existing_id, count) in &expected_ids_and_num_occurrences {
let (existing_id, count) = (*existing_id, *count);
let assert_field = |field| do_search(&existing_id.to_string(), field).len() as u64;
assert_eq!(assert_field(text_field), count);
assert_eq!(assert_field(i64_field), count);
assert_eq!(assert_field(f64_field), count);
assert_eq!(assert_field(id_field), count);
// Test bytes
let term = Term::from_field_bytes(bytes_field, existing_id.to_le_bytes().as_slice());
assert_eq!(do_search2(term).len() as u64, count);
// Test date
let term = Term::from_field_date(
date_field,
DateTime::from_timestamp_secs(existing_id as i64),
);
assert_eq!(do_search2(term).len() as u64, count);
}
for existing_id in deleted_ids {
assert_eq!(do_search(&existing_id.to_string()).len(), 0);
for deleted_id in deleted_ids {
let assert_field = |field| {
assert_eq!(do_search(&deleted_id.to_string(), field).len() as u64, 0);
};
assert_field(text_field);
assert_field(f64_field);
assert_field(i64_field);
assert_field(id_field);
// Test bytes
let term = Term::from_field_bytes(bytes_field, deleted_id.to_le_bytes().as_slice());
assert_eq!(do_search2(term).len() as u64, 0);
// Test date
let term =
Term::from_field_date(date_field, DateTime::from_timestamp_secs(deleted_id as i64));
assert_eq!(do_search2(term).len() as u64, 0);
}
// search ip address
//
for (existing_id, count) in &expected_ids_and_num_occurrences {
let (existing_id, count) = (*existing_id, *count);
if !ip_exists(existing_id) {
continue;
}
let do_search_ip_field = |term: &str| do_search(term, ip_field).len() as u64;
let ip_addr = Ipv6Addr::from_u128(existing_id as u128);
// Test incoming ip as ipv6
assert_eq!(do_search_ip_field(&format!("\"{}\"", ip_addr)), count);
let term = Term::from_field_ip_addr(ip_field, ip_addr);
assert_eq!(do_search2(term).len() as u64, count);
// Test incoming ip as ipv4
if let Some(ip_addr) = ip_addr.to_ipv4_mapped() {
assert_eq!(do_search_ip_field(&format!("\"{}\"", ip_addr)), count);
}
}
// test facets
for segment_reader in searcher.segment_readers().iter() {
@@ -1818,6 +1998,36 @@ mod tests {
Ok(())
}
#[test]
fn test_minimal() {
assert!(test_operation_strategy(
&[
IndexingOp::AddDoc { id: 23 },
IndexingOp::AddDoc { id: 13 },
IndexingOp::DeleteDoc { id: 13 }
],
true,
false
)
.is_ok());
assert!(test_operation_strategy(
&[
IndexingOp::AddDoc { id: 23 },
IndexingOp::AddDoc { id: 13 },
IndexingOp::DeleteDoc { id: 13 }
],
false,
false
)
.is_ok());
}
#[test]
fn test_minimal_sort_merge() {
assert!(test_operation_strategy(&[IndexingOp::AddDoc { id: 3 },], true, true).is_ok());
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(20))]
#[test]
@@ -1910,4 +2120,135 @@ mod tests {
index_writer.commit()?;
Ok(())
}
#[test]
fn test_bug_1617_3() {
assert!(test_operation_strategy(
&[
IndexingOp::DeleteDoc { id: 0 },
IndexingOp::AddDoc { id: 6 },
IndexingOp::DeleteDocQuery { id: 11 },
IndexingOp::Commit,
IndexingOp::Merge,
IndexingOp::Commit,
IndexingOp::Commit
],
false,
false
)
.is_ok());
}
#[test]
fn test_bug_1617_2() {
assert!(test_operation_strategy(
&[
IndexingOp::AddDoc { id: 13 },
IndexingOp::DeleteDoc { id: 13 },
IndexingOp::Commit,
IndexingOp::AddDoc { id: 30 },
IndexingOp::Commit,
IndexingOp::Merge,
],
false,
true
)
.is_ok());
}
#[test]
fn test_bug_1617() -> crate::Result<()> {
let mut schema_builder = schema::Schema::builder();
let id_field = schema_builder.add_u64_field("id", INDEXED);
let schema = schema_builder.build();
let index = Index::builder().schema(schema).create_in_ram()?;
let mut index_writer = index.writer_for_tests()?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
let existing_id = 16u64;
let deleted_id = 13u64;
index_writer.add_document(doc!(
id_field=>existing_id,
))?;
index_writer.add_document(doc!(
id_field=>deleted_id,
))?;
index_writer.delete_term(Term::from_field_u64(id_field, deleted_id));
index_writer.commit()?;
// Merge
{
assert!(index_writer.wait_merging_threads().is_ok());
let mut index_writer = index.writer_for_tests()?;
let segment_ids = index
.searchable_segment_ids()
.expect("Searchable segments failed.");
index_writer.merge(&segment_ids).wait().unwrap();
assert!(index_writer.wait_merging_threads().is_ok());
}
let searcher = index.reader()?.searcher();
let query = TermQuery::new(
Term::from_field_u64(id_field, existing_id),
IndexRecordOption::Basic,
);
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(10)).unwrap();
assert_eq!(top_docs.len(), 1); // Fails
Ok(())
}
#[test]
fn test_bug_1618() -> crate::Result<()> {
let mut schema_builder = schema::Schema::builder();
let id_field = schema_builder.add_i64_field("id", INDEXED);
let schema = schema_builder.build();
let index = Index::builder().schema(schema).create_in_ram()?;
let mut index_writer = index.writer_for_tests()?;
index_writer.set_merge_policy(Box::new(NoMergePolicy));
index_writer.add_document(doc!(
id_field=>10i64,
))?;
index_writer.add_document(doc!(
id_field=>30i64,
))?;
index_writer.commit()?;
// Merge
{
assert!(index_writer.wait_merging_threads().is_ok());
let mut index_writer = index.writer_for_tests()?;
let segment_ids = index
.searchable_segment_ids()
.expect("Searchable segments failed.");
index_writer.merge(&segment_ids).wait().unwrap();
assert!(index_writer.wait_merging_threads().is_ok());
}
let searcher = index.reader()?.searcher();
let query = TermQuery::new(
Term::from_field_i64(id_field, 10i64),
IndexRecordOption::Basic,
);
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(10)).unwrap();
assert_eq!(top_docs.len(), 1); // Fails
let query = TermQuery::new(
Term::from_field_i64(id_field, 30i64),
IndexRecordOption::Basic,
);
let top_docs: Vec<(f32, DocAddress)> =
searcher.search(&query, &TopDocs::with_limit(10)).unwrap();
assert_eq!(top_docs.len(), 1); // Fails
Ok(())
}
}

View File

@@ -242,10 +242,12 @@ pub(crate) fn set_string_and_get_terms(
) -> Vec<(usize, Term)> {
let mut positions_and_terms = Vec::<(usize, Term)>::new();
json_term_writer.close_path_and_set_type(Type::Str);
let term_num_bytes = json_term_writer.term_buffer.as_slice().len();
let term_num_bytes = json_term_writer.term_buffer.len_bytes();
let mut token_stream = text_analyzer.token_stream(value);
token_stream.process(&mut |token| {
json_term_writer.term_buffer.truncate(term_num_bytes);
json_term_writer
.term_buffer
.truncate_value_bytes(term_num_bytes);
json_term_writer
.term_buffer
.append_bytes(token.text.as_bytes());
@@ -265,7 +267,7 @@ impl<'a> JsonTermWriter<'a> {
json_path: &str,
term_buffer: &'a mut Term,
) -> Self {
term_buffer.set_field(Type::Json, field);
term_buffer.set_field_and_type(field, Type::Json);
let mut json_term_writer = Self::wrap(term_buffer);
for segment in json_path.split('.') {
json_term_writer.push_path_segment(segment);
@@ -276,7 +278,7 @@ impl<'a> JsonTermWriter<'a> {
pub fn wrap(term_buffer: &'a mut Term) -> Self {
term_buffer.clear_with_type(Type::Json);
let mut path_stack = Vec::with_capacity(10);
path_stack.push(5);
path_stack.push(0);
Self {
term_buffer,
path_stack,
@@ -285,28 +287,28 @@ impl<'a> JsonTermWriter<'a> {
fn trim_to_end_of_path(&mut self) {
let end_of_path = *self.path_stack.last().unwrap();
self.term_buffer.truncate(end_of_path);
self.term_buffer.truncate_value_bytes(end_of_path);
}
pub fn close_path_and_set_type(&mut self, typ: Type) {
self.trim_to_end_of_path();
let buffer = self.term_buffer.as_mut();
let buffer = self.term_buffer.value_bytes_mut();
let buffer_len = buffer.len();
buffer[buffer_len - 1] = JSON_END_OF_PATH;
buffer.push(typ.to_code());
self.term_buffer.append_bytes(&[typ.to_code()]);
}
pub fn push_path_segment(&mut self, segment: &str) {
// the path stack should never be empty.
self.trim_to_end_of_path();
let buffer = self.term_buffer.as_mut();
let buffer = self.term_buffer.value_bytes_mut();
let buffer_len = buffer.len();
if self.path_stack.len() > 1 {
buffer[buffer_len - 1] = JSON_PATH_SEGMENT_SEP;
}
buffer.extend(segment.as_bytes());
buffer.push(JSON_PATH_SEGMENT_SEP);
self.path_stack.push(buffer.len());
self.term_buffer.append_bytes(segment.as_bytes());
self.term_buffer.append_bytes(&[JSON_PATH_SEGMENT_SEP]);
self.path_stack.push(self.term_buffer.len_bytes());
}
pub fn pop_path_segment(&mut self) {
@@ -318,8 +320,8 @@ impl<'a> JsonTermWriter<'a> {
/// Returns the json path of the term being currently built.
#[cfg(test)]
pub(crate) fn path(&self) -> &[u8] {
let end_of_path = self.path_stack.last().cloned().unwrap_or(6);
&self.term().as_slice()[5..end_of_path - 1]
let end_of_path = self.path_stack.last().cloned().unwrap_or(1);
&self.term().value_bytes()[..end_of_path - 1]
}
pub fn set_fast_value<T: FastValue>(&mut self, val: T) {
@@ -332,14 +334,13 @@ impl<'a> JsonTermWriter<'a> {
val.to_u64()
};
self.term_buffer
.as_mut()
.extend_from_slice(value.to_be_bytes().as_slice());
.append_bytes(value.to_be_bytes().as_slice());
}
#[cfg(test)]
pub(crate) fn set_str(&mut self, text: &str) {
self.close_path_and_set_type(Type::Str);
self.term_buffer.as_mut().extend_from_slice(text.as_bytes());
self.term_buffer.append_bytes(text.as_bytes());
}
pub fn term(&self) -> &Term {
@@ -356,8 +357,7 @@ mod tests {
#[test]
fn test_json_writer() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("attributes");
json_writer.push_path_segment("color");
@@ -391,8 +391,7 @@ mod tests {
#[test]
fn test_string_term() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("color");
json_writer.set_str("red");
@@ -405,8 +404,7 @@ mod tests {
#[test]
fn test_i64_term() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("color");
json_writer.set_fast_value(-4i64);
@@ -419,8 +417,7 @@ mod tests {
#[test]
fn test_u64_term() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("color");
json_writer.set_fast_value(4u64);
@@ -433,8 +430,7 @@ mod tests {
#[test]
fn test_f64_term() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("color");
json_writer.set_fast_value(4.0f64);
@@ -447,8 +443,7 @@ mod tests {
#[test]
fn test_bool_term() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("color");
json_writer.set_fast_value(true);
@@ -461,8 +456,7 @@ mod tests {
#[test]
fn test_push_after_set_path_segment() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("attribute");
json_writer.set_str("something");
@@ -477,8 +471,7 @@ mod tests {
#[test]
fn test_pop_segment() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("color");
json_writer.push_path_segment("hue");
@@ -493,8 +486,7 @@ mod tests {
#[test]
fn test_json_writer_path() {
let field = Field::from_field_id(1);
let mut term = Term::new();
term.set_field(Type::Json, field);
let mut term = Term::with_type_and_field(Type::Json, field);
let mut json_writer = JsonTermWriter::wrap(&mut term);
json_writer.push_path_segment("color");
assert_eq!(json_writer.path(), b"color");

View File

@@ -6,16 +6,19 @@ use fastfield_codecs::VecColumn;
use itertools::Itertools;
use measure_time::debug_time;
use super::flat_map_with_buffer::FlatMapWithBufferIter;
use super::sorted_doc_id_multivalue_column::RemappedDocIdMultiValueIndexColumn;
use crate::core::{Segment, SegmentReader};
use crate::docset::{DocSet, TERMINATED};
use crate::error::DataCorruption;
use crate::fastfield::{
AliveBitSet, Column, CompositeFastFieldSerializer, MultiValueLength, MultiValuedFastFieldReader,
get_fastfield_codecs_for_multivalue, AliveBitSet, Column, CompositeFastFieldSerializer,
MultiValueLength, MultiValuedFastFieldReader, MultiValuedU128FastFieldReader,
};
use crate::fieldnorm::{FieldNormReader, FieldNormReaders, FieldNormsSerializer, FieldNormsWriter};
use crate::indexer::doc_id_mapping::{expect_field_id_for_sort_field, SegmentDocIdMapping};
use crate::indexer::sorted_doc_id_column::SortedDocIdColumn;
use crate::indexer::sorted_doc_id_multivalue_column::SortedDocIdMultiValueColumn;
use crate::indexer::sorted_doc_id_column::RemappedDocIdColumn;
use crate::indexer::sorted_doc_id_multivalue_column::RemappedDocIdMultiValueColumn;
use crate::indexer::SegmentSerializer;
use crate::postings::{InvertedIndexSerializer, Postings, SegmentPostings};
use crate::schema::{Cardinality, Field, FieldType, Schema};
@@ -293,6 +296,24 @@ impl IndexMerger {
self.write_bytes_fast_field(field, fast_field_serializer, doc_id_mapping)?;
}
}
FieldType::IpAddr(options) => match options.get_fastfield_cardinality() {
Some(Cardinality::SingleValue) => {
self.write_u128_single_fast_field(
field,
fast_field_serializer,
doc_id_mapping,
)?;
}
Some(Cardinality::MultiValues) => {
self.write_u128_multi_fast_field(
field,
fast_field_serializer,
doc_id_mapping,
)?;
}
None => {}
},
FieldType::JsonObject(_) | FieldType::Facet(_) | FieldType::Str(_) => {
// We don't handle json fast field for the moment
// They can be implemented using what is done
@@ -303,6 +324,91 @@ impl IndexMerger {
Ok(())
}
// used to merge `u128` single fast fields.
fn write_u128_multi_fast_field(
&self,
field: Field,
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
) -> crate::Result<()> {
let segment_and_ff_readers: Vec<(&SegmentReader, MultiValuedU128FastFieldReader<u128>)> =
self.readers
.iter()
.map(|segment_reader| {
let ff_reader: MultiValuedU128FastFieldReader<u128> =
segment_reader.fast_fields().u128s(field).expect(
"Failed to find index for multivalued field. This is a bug in \
tantivy, please report.",
);
(segment_reader, ff_reader)
})
.collect::<Vec<_>>();
Self::write_1_n_fast_field_idx_generic(
field,
fast_field_serializer,
doc_id_mapping,
&segment_and_ff_readers,
)?;
let fast_field_readers = segment_and_ff_readers
.into_iter()
.map(|(_, ff_reader)| ff_reader)
.collect::<Vec<_>>();
let iter_gen = || {
doc_id_mapping
.iter_old_doc_addrs()
.flat_map_with_buffer(|doc_addr, buffer| {
let fast_field_reader = &fast_field_readers[doc_addr.segment_ord as usize];
fast_field_reader.get_vals(doc_addr.doc_id, buffer);
})
};
fast_field_serializer.create_u128_fast_field_with_idx(
field,
iter_gen,
doc_id_mapping.len() as u64,
1,
)?;
Ok(())
}
// used to merge `u128` single fast fields.
fn write_u128_single_fast_field(
&self,
field: Field,
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
) -> crate::Result<()> {
let fast_field_readers = self
.readers
.iter()
.map(|reader| {
let u128_reader: Arc<dyn Column<u128>> = reader.fast_fields().u128(field).expect(
"Failed to find a reader for single fast field. This is a tantivy bug and it \
should never happen.",
);
u128_reader
})
.collect::<Vec<_>>();
let iter_gen = || {
doc_id_mapping.iter_old_doc_addrs().map(|doc_addr| {
let fast_field_reader = &fast_field_readers[doc_addr.segment_ord as usize];
fast_field_reader.get_val(doc_addr.doc_id as u64)
})
};
fast_field_serializer.create_u128_fast_field_with_idx(
field,
iter_gen,
doc_id_mapping.len() as u64,
0,
)?;
Ok(())
}
// used both to merge field norms, `u64/i64` single fast fields.
fn write_single_fast_field(
&self,
@@ -310,7 +416,7 @@ impl IndexMerger {
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
) -> crate::Result<()> {
let fast_field_accessor = SortedDocIdColumn::new(&self.readers, doc_id_mapping, field);
let fast_field_accessor = RemappedDocIdColumn::new(&self.readers, doc_id_mapping, field);
fast_field_serializer.create_auto_detect_u64_fast_field(field, fast_field_accessor)?;
Ok(())
@@ -423,33 +529,17 @@ impl IndexMerger {
// Creating the index file to point into the data, generic over `BytesFastFieldReader` and
// `MultiValuedFastFieldReader`
//
fn write_1_n_fast_field_idx_generic<T: MultiValueLength>(
fn write_1_n_fast_field_idx_generic<T: MultiValueLength + Send + Sync>(
field: Field,
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
reader_and_field_accessors: &[(&SegmentReader, T)],
) -> crate::Result<Vec<u64>> {
// We can now create our `idx` serializer, and in a second pass,
// can effectively push the different indexes.
segment_and_ff_readers: &[(&SegmentReader, T)],
) -> crate::Result<()> {
let column =
RemappedDocIdMultiValueIndexColumn::new(segment_and_ff_readers, doc_id_mapping);
// copying into a temp vec is not ideal, but the fast field codec api requires random
// access, which is used in the estimation. It's possible to 1. calculate random
// access on the fly or 2. change the codec api to make random access optional, but
// they both have also major drawbacks.
let mut offsets = Vec::with_capacity(doc_id_mapping.len());
let mut offset = 0;
for old_doc_addr in doc_id_mapping.iter_old_doc_addrs() {
let reader = &reader_and_field_accessors[old_doc_addr.segment_ord as usize].1;
offsets.push(offset);
offset += reader.get_len(old_doc_addr.doc_id) as u64;
}
offsets.push(offset);
let fastfield_accessor = VecColumn::from(&offsets[..]);
fast_field_serializer.create_auto_detect_u64_fast_field(field, fastfield_accessor)?;
Ok(offsets)
fast_field_serializer.create_auto_detect_u64_fast_field(field, column)?;
Ok(())
}
/// Returns the fastfield index (index for the data, not the data).
fn write_multi_value_fast_field_idx(
@@ -457,8 +547,8 @@ impl IndexMerger {
field: Field,
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
) -> crate::Result<Vec<u64>> {
let reader_ordinal_and_field_accessors = self
) -> crate::Result<()> {
let segment_and_ff_readers = self
.readers
.iter()
.map(|reader| {
@@ -477,7 +567,7 @@ impl IndexMerger {
field,
fast_field_serializer,
doc_id_mapping,
&reader_ordinal_and_field_accessors,
&segment_and_ff_readers,
)
}
@@ -526,7 +616,12 @@ impl IndexMerger {
}
let col = VecColumn::from(&vals[..]);
fast_field_serializer.create_auto_detect_u64_fast_field_with_idx(field, col, 1)?;
fast_field_serializer.create_auto_detect_u64_fast_field_with_idx_and_codecs(
field,
col,
1,
&get_fastfield_codecs_for_multivalue(),
)?;
}
Ok(())
}
@@ -561,20 +656,21 @@ impl IndexMerger {
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
) -> crate::Result<()> {
// Multifastfield consists in 2 fastfields.
// Multifastfield consists of 2 fastfields.
// The first serves as an index into the second one and is strictly increasing.
// The second contains the actual values.
// First we merge the idx fast field.
let offsets =
self.write_multi_value_fast_field_idx(field, fast_field_serializer, doc_id_mapping)?;
self.write_multi_value_fast_field_idx(field, fast_field_serializer, doc_id_mapping)?;
let fastfield_accessor =
SortedDocIdMultiValueColumn::new(&self.readers, doc_id_mapping, &offsets, field);
fast_field_serializer.create_auto_detect_u64_fast_field_with_idx(
RemappedDocIdMultiValueColumn::new(&self.readers, doc_id_mapping, field);
fast_field_serializer.create_auto_detect_u64_fast_field_with_idx_and_codecs(
field,
fastfield_accessor,
1,
&get_fastfield_codecs_for_multivalue(),
)?;
Ok(())
@@ -586,7 +682,7 @@ impl IndexMerger {
fast_field_serializer: &mut CompositeFastFieldSerializer,
doc_id_mapping: &SegmentDocIdMapping,
) -> crate::Result<()> {
let reader_and_field_accessors = self
let segment_and_ff_readers = self
.readers
.iter()
.map(|reader| {
@@ -597,17 +693,17 @@ impl IndexMerger {
(reader, bytes_reader)
})
.collect::<Vec<_>>();
Self::write_1_n_fast_field_idx_generic(
field,
fast_field_serializer,
doc_id_mapping,
&reader_and_field_accessors,
&segment_and_ff_readers,
)?;
let mut serialize_vals = fast_field_serializer.new_bytes_fast_field(field);
for old_doc_addr in doc_id_mapping.iter_old_doc_addrs() {
let bytes_reader = &reader_and_field_accessors[old_doc_addr.segment_ord as usize].1;
let bytes_reader = &segment_and_ff_readers[old_doc_addr.segment_ord as usize].1;
let val = bytes_reader.get_bytes(old_doc_addr.doc_id);
serialize_vals.write_all(val)?;
}

View File

@@ -3,6 +3,7 @@ pub mod delete_queue;
pub mod demuxer;
pub mod doc_id_mapping;
mod doc_opstamp_mapping;
mod flat_map_with_buffer;
pub mod index_writer;
mod index_writer_status;
mod json_term_writer;

View File

@@ -24,12 +24,27 @@ impl SegmentSerializer {
// In the merge case this is not necessary because we can kmerge the already sorted
// segments
let remapping_required = segment.index().settings().sort_by_field.is_some() && !is_in_merge;
let store_component = if remapping_required {
SegmentComponent::TempStore
let settings = segment.index().settings().clone();
let store_writer = if remapping_required {
let store_write = segment.open_write(SegmentComponent::TempStore)?;
StoreWriter::new(
store_write,
crate::store::Compressor::None,
// We want fast random access on the docs, so we choose a small block size.
// If this is zero, the skip index will contain too many checkpoints and
// therefore will be relatively slow.
16000,
settings.docstore_compress_dedicated_thread,
)?
} else {
SegmentComponent::Store
let store_write = segment.open_write(SegmentComponent::Store)?;
StoreWriter::new(
store_write,
settings.docstore_compression,
settings.docstore_blocksize,
settings.docstore_compress_dedicated_thread,
)?
};
let store_write = segment.open_write(store_component)?;
let fast_field_write = segment.open_write(SegmentComponent::FastFields)?;
let fast_field_serializer = CompositeFastFieldSerializer::from_write(fast_field_write)?;
@@ -38,13 +53,6 @@ impl SegmentSerializer {
let fieldnorms_serializer = FieldNormsSerializer::from_write(fieldnorms_write)?;
let postings_serializer = InvertedIndexSerializer::open(&mut segment)?;
let settings = segment.index().settings();
let store_writer = StoreWriter::new(
store_write,
settings.docstore_compression,
settings.docstore_blocksize,
settings.docstore_compress_dedicated_thread,
)?;
Ok(SegmentSerializer {
segment,
store_writer,

View File

@@ -133,15 +133,15 @@ fn merge(
/// Advanced: Merges a list of segments from different indices in a new index.
///
/// Returns `TantivyError` if the the indices list is empty or their
/// Returns `TantivyError` if the indices list is empty or their
/// schemas don't match.
///
/// `output_directory`: is assumed to be empty.
///
/// # Warning
/// This function does NOT check or take the `IndexWriter` is running. It is not
/// meant to work if you have an IndexWriter running for the origin indices, or
/// the destination Index.
/// meant to work if you have an `IndexWriter` running for the origin indices, or
/// the destination `Index`.
#[doc(hidden)]
pub fn merge_indices<T: Into<Box<dyn Directory>>>(
indices: &[Index],
@@ -179,15 +179,15 @@ pub fn merge_indices<T: Into<Box<dyn Directory>>>(
/// Advanced: Merges a list of segments from different indices in a new index.
/// Additional you can provide a delete bitset for each segment to ignore doc_ids.
///
/// Returns `TantivyError` if the the indices list is empty or their
/// Returns `TantivyError` if the indices list is empty or their
/// schemas don't match.
///
/// `output_directory`: is assumed to be empty.
///
/// # Warning
/// This function does NOT check or take the `IndexWriter` is running. It is not
/// meant to work if you have an IndexWriter running for the origin indices, or
/// the destination Index.
/// meant to work if you have an `IndexWriter` running for the origin indices, or
/// the destination `Index`.
#[doc(hidden)]
pub fn merge_filtered_segments<T: Into<Box<dyn Directory>>>(
segments: &[Segment],

View File

@@ -1,4 +1,5 @@
use fastfield_codecs::MonotonicallyMappableToU64;
use itertools::Itertools;
use super::doc_id_mapping::{get_doc_id_mapping_from_field, DocIdMapping};
use super::operation::AddOperation;
@@ -11,11 +12,9 @@ use crate::postings::{
compute_table_size, serialize_postings, IndexingContext, IndexingPosition,
PerFieldPostingsWriter, PostingsWriter,
};
use crate::schema::{FieldEntry, FieldType, FieldValue, Schema, Term, Value};
use crate::schema::{FieldEntry, FieldType, Schema, Term, Value};
use crate::store::{StoreReader, StoreWriter};
use crate::tokenizer::{
BoxTokenStream, FacetTokenizer, PreTokenizedStream, TextAnalyzer, Tokenizer,
};
use crate::tokenizer::{FacetTokenizer, PreTokenizedStream, TextAnalyzer, Tokenizer};
use crate::{DatePrecision, DocId, Document, Opstamp, SegmentComponent};
/// Computes the initial size of the hash table.
@@ -115,7 +114,7 @@ impl SegmentWriter {
fast_field_writers: FastFieldsWriter::from_schema(&schema),
doc_opstamps: Vec::with_capacity(1_000),
per_field_text_analyzers,
term_buffer: Term::new(),
term_buffer: Term::with_capacity(16),
schema,
})
}
@@ -157,7 +156,13 @@ impl SegmentWriter {
fn index_document(&mut self, doc: &Document) -> crate::Result<()> {
let doc_id = self.max_doc;
for (field, values) in doc.get_sorted_field_values() {
let vals_grouped_by_field = doc
.field_values()
.iter()
.sorted_by_key(|el| el.field())
.group_by(|el| el.field());
for (field, field_values) in &vals_grouped_by_field {
let values = field_values.map(|field_value| field_value.value());
let field_entry = self.schema.get_field_entry(field);
let make_schema_error = || {
crate::TantivyError::SchemaError(format!(
@@ -169,10 +174,12 @@ impl SegmentWriter {
if !field_entry.is_indexed() {
continue;
}
let (term_buffer, ctx) = (&mut self.term_buffer, &mut self.ctx);
let postings_writer: &mut dyn PostingsWriter =
self.per_field_postings_writers.get_for_field_mut(field);
term_buffer.set_field(field_entry.field_type().value_type(), field);
term_buffer.clear_with_field_and_type(field_entry.field_type().value_type(), field);
match *field_entry.field_type() {
FieldType::Facet(_) => {
for value in values {
@@ -197,35 +204,23 @@ impl SegmentWriter {
}
}
FieldType::Str(_) => {
let mut token_streams: Vec<BoxTokenStream> = vec![];
let mut offsets = vec![];
let mut total_offset = 0;
let mut indexing_position = IndexingPosition::default();
for value in values {
match value {
let mut token_stream = match value {
Value::PreTokStr(tok_str) => {
offsets.push(total_offset);
if let Some(last_token) = tok_str.tokens.last() {
total_offset += last_token.offset_to;
}
token_streams
.push(PreTokenizedStream::from(tok_str.clone()).into());
PreTokenizedStream::from(tok_str.clone()).into()
}
Value::Str(ref text) => {
let text_analyzer =
&self.per_field_text_analyzers[field.field_id() as usize];
offsets.push(total_offset);
total_offset += text.len();
token_streams.push(text_analyzer.token_stream(text));
text_analyzer.token_stream(text)
}
_ => (),
}
}
_ => {
continue;
}
};
let mut indexing_position = IndexingPosition::default();
for mut token_stream in token_streams {
assert_eq!(term_buffer.as_slice().len(), 5);
assert!(term_buffer.is_empty());
postings_writer.index_text(
doc_id,
&mut *token_stream,
@@ -241,52 +236,81 @@ impl SegmentWriter {
}
}
FieldType::U64(_) => {
let mut num_vals = 0;
for value in values {
num_vals += 1;
let u64_val = value.as_u64().ok_or_else(make_schema_error)?;
term_buffer.set_u64(u64_val);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
self.fieldnorms_writer.record(doc_id, field, num_vals);
}
}
FieldType::Date(_) => {
let mut num_vals = 0;
for value in values {
num_vals += 1;
let date_val = value.as_date().ok_or_else(make_schema_error)?;
term_buffer.set_u64(date_val.truncate(DatePrecision::Seconds).to_u64());
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
self.fieldnorms_writer.record(doc_id, field, num_vals);
}
}
FieldType::I64(_) => {
let mut num_vals = 0;
for value in values {
num_vals += 1;
let i64_val = value.as_i64().ok_or_else(make_schema_error)?;
term_buffer.set_i64(i64_val);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
self.fieldnorms_writer.record(doc_id, field, num_vals);
}
}
FieldType::F64(_) => {
let mut num_vals = 0;
for value in values {
num_vals += 1;
let f64_val = value.as_f64().ok_or_else(make_schema_error)?;
term_buffer.set_f64(f64_val);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
self.fieldnorms_writer.record(doc_id, field, num_vals);
}
}
FieldType::Bool(_) => {
let mut num_vals = 0;
for value in values {
num_vals += 1;
let bool_val = value.as_bool().ok_or_else(make_schema_error)?;
term_buffer.set_bool(bool_val);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
self.fieldnorms_writer.record(doc_id, field, num_vals);
}
}
FieldType::Bytes(_) => {
let mut num_vals = 0;
for value in values {
num_vals += 1;
let bytes = value.as_bytes().ok_or_else(make_schema_error)?;
term_buffer.set_bytes(bytes);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
self.fieldnorms_writer.record(doc_id, field, num_vals);
}
}
FieldType::JsonObject(_) => {
let text_analyzer = &self.per_field_text_analyzers[field.field_id() as usize];
let json_values_it = values
.iter()
.map(|value| value.as_json().ok_or_else(make_schema_error));
let json_values_it =
values.map(|value| value.as_json().ok_or_else(make_schema_error));
index_json_values(
doc_id,
json_values_it,
@@ -296,6 +320,18 @@ impl SegmentWriter {
ctx,
)?;
}
FieldType::IpAddr(_) => {
let mut num_vals = 0;
for value in values {
num_vals += 1;
let ip_addr = value.as_ip_addr().ok_or_else(make_schema_error)?;
term_buffer.set_ip_addr(ip_addr);
postings_writer.subscribe(doc_id, 0u32, term_buffer, ctx);
}
if field_entry.has_fieldnorms() {
self.fieldnorms_writer.record(doc_id, field, num_vals);
}
}
}
}
Ok(())
@@ -307,11 +343,10 @@ impl SegmentWriter {
pub fn add_document(&mut self, add_operation: AddOperation) -> crate::Result<()> {
let doc = add_operation.document;
self.doc_opstamps.push(add_operation.opstamp);
self.fast_field_writers.add_document(&doc);
self.fast_field_writers.add_document(&doc)?;
self.index_document(&doc)?;
let prepared_doc = prepare_doc_for_store(doc, &self.schema);
let doc_writer = self.segment_serializer.get_store_writer();
doc_writer.store(&prepared_doc)?;
doc_writer.store(&doc, &self.schema)?;
self.max_doc += 1;
Ok(())
}
@@ -374,9 +409,9 @@ fn remap_and_write(
doc_id_map,
)?;
debug!("resort-docstore");
// finalize temp docstore and create version, which reflects the doc_id_map
if let Some(doc_id_map) = doc_id_map {
debug!("resort-docstore");
let store_write = serializer
.segment_mut()
.open_write(SegmentComponent::Store)?;
@@ -393,7 +428,8 @@ fn remap_and_write(
serializer
.segment()
.open_read(SegmentComponent::TempStore)?,
50,
1, /* The docstore is configured to have one doc per block, and each doc is accessed
* only once: we don't need caching. */
)?;
for old_doc_id in doc_id_map.iter_old_doc_ids() {
let doc_bytes = store_read.get_document_bytes(old_doc_id)?;
@@ -407,40 +443,24 @@ fn remap_and_write(
Ok(())
}
/// Prepares Document for being stored in the document store
///
/// Method transforms PreTokenizedString values into String
/// values.
pub fn prepare_doc_for_store(doc: Document, schema: &Schema) -> Document {
Document::from(
doc.into_iter()
.filter(|field_value| schema.get_field_entry(field_value.field()).is_stored())
.map(|field_value| match field_value {
FieldValue {
field,
value: Value::PreTokStr(pre_tokenized_text),
} => FieldValue {
field,
value: Value::Str(pre_tokenized_text.text),
},
field_value => field_value,
})
.collect::<Vec<_>>(),
)
}
#[cfg(test)]
mod tests {
use std::path::Path;
use super::compute_initial_table_size;
use crate::collector::Count;
use crate::directory::RamDirectory;
use crate::indexer::json_term_writer::JsonTermWriter;
use crate::postings::TermInfo;
use crate::query::PhraseQuery;
use crate::schema::{IndexRecordOption, Schema, Type, STORED, STRING, TEXT};
use crate::store::{Compressor, StoreReader, StoreWriter};
use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime;
use crate::tokenizer::{PreTokenizedString, Token};
use crate::{DateTime, DocAddress, DocSet, Document, Index, Postings, Term, TERMINATED};
use crate::{
DateTime, Directory, DocAddress, DocSet, Document, Index, Postings, Term, TERMINATED,
};
#[test]
fn test_hashmap_size() {
@@ -470,14 +490,21 @@ mod tests {
doc.add_pre_tokenized_text(text_field, pre_tokenized_text);
doc.add_text(text_field, "title");
let prepared_doc = super::prepare_doc_for_store(doc, &schema);
assert_eq!(prepared_doc.field_values().len(), 2);
assert_eq!(prepared_doc.field_values()[0].value().as_text(), Some("A"));
assert_eq!(
prepared_doc.field_values()[1].value().as_text(),
Some("title")
);
let path = Path::new("store");
let directory = RamDirectory::create();
let store_wrt = directory.open_write(path).unwrap();
let mut store_writer = StoreWriter::new(store_wrt, Compressor::None, 0, false).unwrap();
store_writer.store(&doc, &schema).unwrap();
store_writer.close().unwrap();
let reader = StoreReader::open(directory.open_read(path).unwrap(), 0).unwrap();
let doc = reader.get(0).unwrap();
assert_eq!(doc.field_values().len(), 2);
assert_eq!(doc.field_values()[0].value().as_text(), Some("A"));
assert_eq!(doc.field_values()[1].value().as_text(), Some("title"));
}
#[test]
@@ -527,8 +554,7 @@ mod tests {
let inv_idx = segment_reader.inverted_index(json_field).unwrap();
let term_dict = inv_idx.terms();
let mut term = Term::new();
term.set_field(Type::Json, json_field);
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut term_stream = term_dict.stream().unwrap();
let mut json_term_writer = JsonTermWriter::wrap(&mut term);
@@ -621,8 +647,7 @@ mod tests {
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_index = segment_reader.inverted_index(json_field).unwrap();
let mut term = Term::new();
term.set_field(Type::Json, json_field);
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term);
json_term_writer.push_path_segment("mykey");
json_term_writer.set_str("token");
@@ -666,8 +691,7 @@ mod tests {
let searcher = reader.searcher();
let segment_reader = searcher.segment_reader(0u32);
let inv_index = segment_reader.inverted_index(json_field).unwrap();
let mut term = Term::new();
term.set_field(Type::Json, json_field);
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term);
json_term_writer.push_path_segment("mykey");
json_term_writer.set_str("two tokens");
@@ -712,8 +736,7 @@ mod tests {
writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let mut term = Term::new();
term.set_field(Type::Json, json_field);
let mut term = Term::with_type_and_field(Type::Json, json_field);
let mut json_term_writer = JsonTermWriter::wrap(&mut term);
json_term_writer.push_path_segment("mykey");
json_term_writer.push_path_segment("field");
@@ -728,4 +751,38 @@ mod tests {
let phrase_query = PhraseQuery::new(vec![nothello_term, happy_term]);
assert_eq!(searcher.search(&phrase_query, &Count).unwrap(), 0);
}
#[test]
fn test_bug_regression_1629_position_when_array_with_a_field_value_that_does_not_contain_any_token(
) {
// We experienced a bug where we would have a position underflow when computing position
// delta in an horrible corner case.
//
// See the commit with this unit test if you want the details.
let mut schema_builder = Schema::builder();
let text = schema_builder.add_text_field("text", TEXT);
let schema = schema_builder.build();
let doc = schema
.parse_document(r#"{"text": [ "bbb", "aaa", "", "aaa"]}"#)
.unwrap();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_for_tests().unwrap();
index_writer.add_document(doc).unwrap();
// On debug this did panic on the underflow
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let seg_reader = searcher.segment_reader(0);
let inv_index = seg_reader.inverted_index(text).unwrap();
let term = Term::from_field_text(text, "aaa");
let mut postings = inv_index
.read_postings(&term, IndexRecordOption::WithFreqsAndPositions)
.unwrap()
.unwrap();
assert_eq!(postings.doc(), 0u32);
let mut positions = Vec::new();
postings.positions(&mut positions);
// On release this was [2, 1]. (< note the decreasing values)
assert_eq!(positions, &[2, 5]);
}
}

View File

@@ -5,9 +5,9 @@ use itertools::Itertools;
use crate::indexer::doc_id_mapping::SegmentDocIdMapping;
use crate::schema::Field;
use crate::{DocAddress, SegmentReader};
use crate::SegmentReader;
pub(crate) struct SortedDocIdColumn<'a> {
pub(crate) struct RemappedDocIdColumn<'a> {
doc_id_mapping: &'a SegmentDocIdMapping,
fast_field_readers: Vec<Arc<dyn Column<u64>>>,
min_value: u64,
@@ -37,7 +37,7 @@ fn compute_min_max_val(
.into_option()
}
impl<'a> SortedDocIdColumn<'a> {
impl<'a> RemappedDocIdColumn<'a> {
pub(crate) fn new(
readers: &'a [SegmentReader],
doc_id_mapping: &'a SegmentDocIdMapping,
@@ -68,7 +68,7 @@ impl<'a> SortedDocIdColumn<'a> {
})
.collect::<Vec<_>>();
SortedDocIdColumn {
RemappedDocIdColumn {
doc_id_mapping,
fast_field_readers,
min_value,
@@ -78,13 +78,9 @@ impl<'a> SortedDocIdColumn<'a> {
}
}
impl<'a> Column for SortedDocIdColumn<'a> {
fn get_val(&self, doc: u64) -> u64 {
let DocAddress {
doc_id,
segment_ord,
} = self.doc_id_mapping.get_old_doc_addr(doc as u32);
self.fast_field_readers[segment_ord as usize].get_val(doc_id as u64)
impl<'a> Column for RemappedDocIdColumn<'a> {
fn get_val(&self, _doc: u64) -> u64 {
unimplemented!()
}
fn iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {

View File

@@ -2,26 +2,24 @@ use std::cmp;
use fastfield_codecs::Column;
use super::flat_map_with_buffer::FlatMapWithBufferIter;
use crate::fastfield::{MultiValueLength, MultiValuedFastFieldReader};
use crate::indexer::doc_id_mapping::SegmentDocIdMapping;
use crate::schema::Field;
use crate::{DocId, SegmentReader};
use crate::{DocAddress, SegmentReader};
// We can now initialize our serializer, and push it the different values
pub(crate) struct SortedDocIdMultiValueColumn<'a> {
pub(crate) struct RemappedDocIdMultiValueColumn<'a> {
doc_id_mapping: &'a SegmentDocIdMapping,
fast_field_readers: Vec<MultiValuedFastFieldReader<u64>>,
offsets: &'a [u64],
min_value: u64,
max_value: u64,
num_vals: u64,
}
impl<'a> SortedDocIdMultiValueColumn<'a> {
impl<'a> RemappedDocIdMultiValueColumn<'a> {
pub(crate) fn new(
readers: &'a [SegmentReader],
doc_id_mapping: &'a SegmentDocIdMapping,
offsets: &'a [u64],
field: Field,
) -> Self {
// Our values are bitpacked and we need to know what should be
@@ -58,10 +56,9 @@ impl<'a> SortedDocIdMultiValueColumn<'a> {
min_value = 0;
max_value = 0;
}
SortedDocIdMultiValueColumn {
RemappedDocIdMultiValueColumn {
doc_id_mapping,
fast_field_readers,
offsets,
min_value,
max_value,
num_vals: num_vals as u64,
@@ -69,37 +66,18 @@ impl<'a> SortedDocIdMultiValueColumn<'a> {
}
}
impl<'a> Column for SortedDocIdMultiValueColumn<'a> {
fn get_val(&self, pos: u64) -> u64 {
// use the offsets index to find the doc_id which will contain the position.
// the offsets are strictly increasing so we can do a binary search on it.
let new_doc_id: DocId = self.offsets.partition_point(|&offset| offset <= pos) as DocId - 1; // Offsets start at 0, so -1 is safe
// now we need to find the position of `pos` in the multivalued bucket
let num_pos_covered_until_now = self.offsets[new_doc_id as usize];
let pos_in_values = pos - num_pos_covered_until_now;
let old_doc_addr = self.doc_id_mapping.get_old_doc_addr(new_doc_id);
let num_vals =
self.fast_field_readers[old_doc_addr.segment_ord as usize].get_len(old_doc_addr.doc_id);
assert!(num_vals >= pos_in_values);
let mut vals = Vec::new();
self.fast_field_readers[old_doc_addr.segment_ord as usize]
.get_vals(old_doc_addr.doc_id, &mut vals);
vals[pos_in_values as usize]
impl<'a> Column for RemappedDocIdMultiValueColumn<'a> {
fn get_val(&self, _pos: u64) -> u64 {
unimplemented!()
}
fn iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
Box::new(
self.doc_id_mapping
.iter_old_doc_addrs()
.flat_map(|old_doc_addr| {
.flat_map_with_buffer(|old_doc_addr: DocAddress, buffer| {
let ff_reader = &self.fast_field_readers[old_doc_addr.segment_ord as usize];
let mut vals = Vec::new();
ff_reader.get_vals(old_doc_addr.doc_id, &mut vals);
vals.into_iter()
ff_reader.get_vals(old_doc_addr.doc_id, buffer);
}),
)
}
@@ -115,3 +93,76 @@ impl<'a> Column for SortedDocIdMultiValueColumn<'a> {
self.num_vals
}
}
pub(crate) struct RemappedDocIdMultiValueIndexColumn<'a, T: MultiValueLength> {
doc_id_mapping: &'a SegmentDocIdMapping,
multi_value_length_readers: Vec<&'a T>,
min_value: u64,
max_value: u64,
num_vals: u64,
}
impl<'a, T: MultiValueLength> RemappedDocIdMultiValueIndexColumn<'a, T> {
pub(crate) fn new(
segment_and_ff_readers: &'a [(&'a SegmentReader, T)],
doc_id_mapping: &'a SegmentDocIdMapping,
) -> Self {
// We go through a complete first pass to compute the minimum and the
// maximum value and initialize our Column.
let mut num_vals = 0;
let min_value = 0;
let mut max_value = 0;
let mut multi_value_length_readers = Vec::with_capacity(segment_and_ff_readers.len());
for segment_and_ff_reader in segment_and_ff_readers {
let segment_reader = segment_and_ff_reader.0;
let multi_value_length_reader = &segment_and_ff_reader.1;
if !segment_reader.has_deletes() {
max_value += multi_value_length_reader.get_total_len();
} else {
for doc in segment_reader.doc_ids_alive() {
max_value += multi_value_length_reader.get_len(doc);
}
}
num_vals += segment_reader.num_docs() as u64;
multi_value_length_readers.push(multi_value_length_reader);
}
Self {
doc_id_mapping,
multi_value_length_readers,
min_value,
max_value,
num_vals,
}
}
}
impl<'a, T: MultiValueLength + Send + Sync> Column for RemappedDocIdMultiValueIndexColumn<'a, T> {
fn get_val(&self, _pos: u64) -> u64 {
unimplemented!()
}
fn iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
let mut offset = 0;
Box::new(
std::iter::once(0).chain(self.doc_id_mapping.iter_old_doc_addrs().map(
move |old_doc_addr| {
let ff_reader =
&self.multi_value_length_readers[old_doc_addr.segment_ord as usize];
offset += ff_reader.get_len(old_doc_addr.doc_id);
offset
},
)),
)
}
fn min_value(&self) -> u64 {
self.min_value
}
fn max_value(&self) -> u64 {
self.max_value
}
fn num_vals(&self) -> u64 {
self.num_vals
}
}

View File

@@ -106,7 +106,7 @@ impl BlockDecoder {
pub trait VIntEncoder {
/// Compresses an array of `u32` integers,
/// using [delta-encoding](https://en.wikipedia.org/wiki/Delta_ encoding)
/// using [delta-encoding](https://en.wikipedia.org/wiki/Delta_encoding)
/// and variable bytes encoding.
///
/// The method takes an array of ints to compress, and returns

View File

@@ -3,7 +3,7 @@ use std::io;
use crate::fastfield::MultiValuedFastFieldWriter;
use crate::indexer::doc_id_mapping::DocIdMapping;
use crate::postings::postings_writer::SpecializedPostingsWriter;
use crate::postings::recorder::{BufferLender, NothingRecorder, Recorder};
use crate::postings::recorder::{BufferLender, DocIdRecorder, Recorder};
use crate::postings::stacker::Addr;
use crate::postings::{
FieldSerializer, IndexingContext, IndexingPosition, PostingsWriter, UnorderedTermId,
@@ -16,7 +16,7 @@ use crate::{DocId, Term};
#[derive(Default)]
pub(crate) struct JsonPostingsWriter<Rec: Recorder> {
str_posting_writer: SpecializedPostingsWriter<Rec>,
non_str_posting_writer: SpecializedPostingsWriter<NothingRecorder>,
non_str_posting_writer: SpecializedPostingsWriter<DocIdRecorder>,
}
impl<Rec: Recorder> From<JsonPostingsWriter<Rec>> for Box<dyn PostingsWriter> {
@@ -77,7 +77,7 @@ impl<Rec: Recorder> PostingsWriter for JsonPostingsWriter<Rec> {
serializer,
)?;
} else {
SpecializedPostingsWriter::<NothingRecorder>::serialize_one_term(
SpecializedPostingsWriter::<DocIdRecorder>::serialize_one_term(
term,
*addr,
doc_id_map,

View File

@@ -1,6 +1,6 @@
use crate::postings::json_postings_writer::JsonPostingsWriter;
use crate::postings::postings_writer::SpecializedPostingsWriter;
use crate::postings::recorder::{NothingRecorder, TermFrequencyRecorder, TfAndPositionRecorder};
use crate::postings::recorder::{DocIdRecorder, TermFrequencyRecorder, TfAndPositionRecorder};
use crate::postings::PostingsWriter;
use crate::schema::{Field, FieldEntry, FieldType, IndexRecordOption, Schema};
@@ -34,7 +34,7 @@ fn posting_writer_from_field_entry(field_entry: &FieldEntry) -> Box<dyn Postings
.get_indexing_options()
.map(|indexing_options| match indexing_options.index_option() {
IndexRecordOption::Basic => {
SpecializedPostingsWriter::<NothingRecorder>::default().into()
SpecializedPostingsWriter::<DocIdRecorder>::default().into()
}
IndexRecordOption::WithFreqs => {
SpecializedPostingsWriter::<TermFrequencyRecorder>::default().into()
@@ -43,19 +43,20 @@ fn posting_writer_from_field_entry(field_entry: &FieldEntry) -> Box<dyn Postings
SpecializedPostingsWriter::<TfAndPositionRecorder>::default().into()
}
})
.unwrap_or_else(|| SpecializedPostingsWriter::<NothingRecorder>::default().into()),
.unwrap_or_else(|| SpecializedPostingsWriter::<DocIdRecorder>::default().into()),
FieldType::U64(_)
| FieldType::I64(_)
| FieldType::F64(_)
| FieldType::Bool(_)
| FieldType::Date(_)
| FieldType::Bytes(_)
| FieldType::Facet(_) => Box::new(SpecializedPostingsWriter::<NothingRecorder>::default()),
| FieldType::IpAddr(_)
| FieldType::Facet(_) => Box::new(SpecializedPostingsWriter::<DocIdRecorder>::default()),
FieldType::JsonObject(ref json_object_options) => {
if let Some(text_indexing_option) = json_object_options.get_text_indexing_options() {
match text_indexing_option.index_option() {
IndexRecordOption::Basic => {
JsonPostingsWriter::<NothingRecorder>::default().into()
JsonPostingsWriter::<DocIdRecorder>::default().into()
}
IndexRecordOption::WithFreqs => {
JsonPostingsWriter::<TermFrequencyRecorder>::default().into()
@@ -65,7 +66,7 @@ fn posting_writer_from_field_entry(field_entry: &FieldEntry) -> Box<dyn Postings
}
}
} else {
JsonPostingsWriter::<NothingRecorder>::default().into()
JsonPostingsWriter::<DocIdRecorder>::default().into()
}
}
}

View File

@@ -89,6 +89,7 @@ pub(crate) fn serialize_postings(
| FieldType::Bool(_) => {}
FieldType::Bytes(_) => {}
FieldType::JsonObject(_) => {}
FieldType::IpAddr(_) => {}
}
let postings_writer = per_field_postings_writers.get_for_field(field);
@@ -152,9 +153,9 @@ pub(crate) trait PostingsWriter: Send + Sync {
indexing_position: &mut IndexingPosition,
mut term_id_fast_field_writer_opt: Option<&mut MultiValuedFastFieldWriter>,
) {
let end_of_path_idx = term_buffer.as_slice().len();
let end_of_path_idx = term_buffer.len_bytes();
let mut num_tokens = 0;
let mut end_position = 0;
let mut end_position = indexing_position.end_position;
token_stream.process(&mut |token: &Token| {
// We skip all tokens with a len greater than u16.
if token.text.len() > MAX_TOKEN_LEN {
@@ -166,7 +167,7 @@ pub(crate) trait PostingsWriter: Send + Sync {
);
return;
}
term_buffer.truncate(end_of_path_idx);
term_buffer.truncate_value_bytes(end_of_path_idx);
term_buffer.append_bytes(token.text.as_bytes());
let start_position = indexing_position.end_position + token.position as u32;
end_position = start_position + token.position_length as u32;
@@ -180,7 +181,7 @@ pub(crate) trait PostingsWriter: Send + Sync {
indexing_position.end_position = end_position + POSITION_GAP;
indexing_position.num_tokens += num_tokens;
term_buffer.truncate(end_of_path_idx);
term_buffer.truncate_value_bytes(end_of_path_idx);
}
fn total_num_tokens(&self) -> u64;

View File

@@ -47,11 +47,11 @@ impl<'a> Iterator for VInt32Reader<'a> {
}
}
/// Recorder is in charge of recording relevant information about
/// `Recorder` is in charge of recording relevant information about
/// the presence of a term in a document.
///
/// Depending on the `TextIndexingOptions` associated with the
/// field, the recorder may records
/// Depending on the [`TextOptions`](crate::schema::TextOptions) associated
/// with the field, the recorder may record:
/// * the document frequency
/// * the document id
/// * the term frequency
@@ -83,21 +83,21 @@ pub(crate) trait Recorder: Copy + Default + Send + Sync + 'static {
/// Only records the doc ids
#[derive(Clone, Copy)]
pub struct NothingRecorder {
pub struct DocIdRecorder {
stack: ExpUnrolledLinkedList,
current_doc: DocId,
}
impl Default for NothingRecorder {
impl Default for DocIdRecorder {
fn default() -> Self {
NothingRecorder {
DocIdRecorder {
stack: ExpUnrolledLinkedList::new(),
current_doc: u32::MAX,
}
}
}
impl Recorder for NothingRecorder {
impl Recorder for DocIdRecorder {
fn current_doc(&self) -> DocId {
self.current_doc
}

View File

@@ -98,7 +98,7 @@ impl<'a> Iterator for Iter<'a> {
/// # Panics if n == 0
fn compute_previous_power_of_two(n: usize) -> usize {
assert!(n > 0);
let msb = (63u32 - n.leading_zeros()) as u8;
let msb = (63u32 - (n as u64).leading_zeros()) as u8;
1 << msb
}
@@ -199,7 +199,7 @@ impl TermHashMap {
/// `update` create a new entry for a given key if it does not exists
/// or updates the existing entry.
///
/// The actual logic for this update is define in the the `updater`
/// The actual logic for this update is define in the `updater`
/// argument.
///
/// If the key is not present, `updater` will receive `None` and

View File

@@ -144,7 +144,7 @@ fn advance_all_scorers_on_pivot(term_scorers: &mut Vec<TermScorerWithMaxScore>,
/// Implements the WAND (Weak AND) algorithm for dynamic pruning
/// described in the paper "Faster Top-k Document Retrieval Using Block-Max Indexes".
/// Link: http://engineering.nyu.edu/~suel/papers/bmw.pdf
/// Link: <http://engineering.nyu.edu/~suel/papers/bmw.pdf>
pub fn block_wand(
mut scorers: Vec<TermScorer>,
mut threshold: Score,
@@ -212,12 +212,12 @@ pub fn block_wand(
}
/// Specialized version of [`block_wand`] for a single scorer.
/// In this case, the algorithm is simple and readable and faster (~ x3)
/// In this case, the algorithm is simple, readable and faster (~ x3)
/// than the generic algorithm.
/// The algorithm behaves as follows:
/// - While we don't hit the end of the docset:
/// - While the block max score is under the `threshold`, go to the next block.
/// - On a block, advance until the end and execute `callback`` when the doc score is greater or
/// - On a block, advance until the end and execute `callback` when the doc score is greater or
/// equal to the `threshold`.
pub fn block_wand_single_scorer(
mut scorer: TermScorer,

View File

@@ -174,9 +174,9 @@ impl<TScoreCombiner: ScoreCombiner + Sync> Weight for BooleanWeight<TScoreCombin
into_box_scorer(specialized_scorer, &self.score_combiner_fn)
})
} else {
self.complex_scorer(reader, boost, &DoNothingCombiner::default)
self.complex_scorer(reader, boost, DoNothingCombiner::default)
.map(|specialized_scorer| {
into_box_scorer(specialized_scorer, &DoNothingCombiner::default)
into_box_scorer(specialized_scorer, DoNothingCombiner::default)
})
}
}

View File

@@ -21,6 +21,7 @@ mod range_query;
mod regex_query;
mod reqopt_scorer;
mod scorer;
mod set_query;
mod term_query;
mod union;
mod weight;
@@ -58,6 +59,7 @@ pub use self::score_combiner::{
DisjunctionMaxCombiner, ScoreCombiner, SumCombiner, SumWithCoordsCombiner,
};
pub use self::scorer::Scorer;
pub use self::set_query::TermSetQuery;
pub use self::term_query::TermQuery;
pub use self::union::Union;
#[cfg(test)]

View File

@@ -33,9 +33,9 @@ impl Ord for ScoreTerm {
}
}
/// A struct used as helper to build [`MoreLikeThisQuery`]
/// This more-like-this implementation is inspired by the Appache Lucene
/// amd closely follows the same implementation with adaptabtion to Tantivy vocabulary and API.
/// A struct used as helper to build [`MoreLikeThisQuery`](crate::query::MoreLikeThisQuery)
/// This more-like-this implementation is inspired by the Apache Lucene
/// and closely follows the same implementation with adaptation to Tantivy vocabulary and API.
///
/// [MoreLikeThis](https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L147)
/// [MoreLikeThisQuery](https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThisQuery.java#L36)

View File

@@ -1,4 +1,5 @@
use std::collections::HashMap;
use std::net::{AddrParseError, IpAddr};
use std::num::{ParseFloatError, ParseIntError};
use std::ops::Bound;
use std::str::{FromStr, ParseBoolError};
@@ -15,7 +16,7 @@ use crate::query::{
TermQuery,
};
use crate::schema::{
Facet, FacetParseError, Field, FieldType, IndexRecordOption, Schema, Term, Type,
Facet, FacetParseError, Field, FieldType, IndexRecordOption, IntoIpv6Addr, Schema, Term, Type,
};
use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime;
@@ -84,6 +85,9 @@ pub enum QueryParserError {
/// The format for the facet field is invalid.
#[error("The facet field is malformed: {0}")]
FacetFormatError(#[from] FacetParseError),
/// The format for the ip field is invalid.
#[error("The ip field is malformed: {0}")]
IpFormatError(#[from] AddrParseError),
}
/// Recursively remove empty clause from the AST
@@ -400,6 +404,10 @@ impl QueryParser {
let bytes = base64::decode(phrase).map_err(QueryParserError::ExpectedBase64)?;
Ok(Term::from_field_bytes(field, &bytes))
}
FieldType::IpAddr(_) => {
let ip_v6 = IpAddr::from_str(phrase)?.into_ipv6_addr();
Ok(Term::from_field_ip_addr(field, ip_v6))
}
}
}
@@ -506,6 +514,11 @@ impl QueryParser {
let bytes_term = Term::from_field_bytes(field, &bytes);
Ok(vec![LogicalLiteral::Term(bytes_term)])
}
FieldType::IpAddr(_) => {
let ip_v6 = IpAddr::from_str(phrase)?.into_ipv6_addr();
let term = Term::from_field_ip_addr(field, ip_v6);
Ok(vec![LogicalLiteral::Term(term)])
}
}
}
@@ -730,7 +743,7 @@ fn generate_literals_for_json_object(
index_record_option: IndexRecordOption,
) -> Result<Vec<LogicalLiteral>, QueryParserError> {
let mut logical_literals = Vec::new();
let mut term = Term::new();
let mut term = Term::with_capacity(100);
let mut json_term_writer =
JsonTermWriter::from_field_and_json_path(field, json_path, &mut term);
if let Some(term) = convert_to_fast_value_and_get_term(&mut json_term_writer, phrase) {

View File

@@ -328,13 +328,15 @@ impl Weight for RangeWeight {
#[cfg(test)]
mod tests {
use std::net::IpAddr;
use std::ops::Bound;
use std::str::FromStr;
use super::RangeQuery;
use crate::collector::{Count, TopDocs};
use crate::query::QueryParser;
use crate::schema::{Document, Field, Schema, INDEXED, TEXT};
use crate::Index;
use crate::schema::{Document, Field, IntoIpv6Addr, Schema, INDEXED, STORED, TEXT};
use crate::{doc, Index};
#[test]
fn test_range_query_simple() -> crate::Result<()> {
@@ -506,4 +508,69 @@ mod tests {
assert_eq!(top_docs.len(), 1);
Ok(())
}
#[test]
fn search_ip_range_test() {
let mut schema_builder = Schema::builder();
let ip_field = schema_builder.add_ip_addr_field("ip", INDEXED | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let ip_addr_1 = IpAddr::from_str("127.0.0.10").unwrap().into_ipv6_addr();
let ip_addr_2 = IpAddr::from_str("127.0.0.20").unwrap().into_ipv6_addr();
{
let mut index_writer = index.writer(3_000_000).unwrap();
index_writer
.add_document(doc!(
ip_field => ip_addr_1
))
.unwrap();
index_writer
.add_document(doc!(
ip_field => ip_addr_2
))
.unwrap();
index_writer.commit().unwrap();
}
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let get_num_hits = |query| {
let (_top_docs, count) = searcher
.search(&query, &(TopDocs::with_limit(10), Count))
.unwrap();
count
};
let query_from_text = |text: &str| {
QueryParser::for_index(&index, vec![ip_field])
.parse_query(text)
.unwrap()
};
assert_eq!(
get_num_hits(query_from_text("ip:[127.0.0.1 TO 127.0.0.20]")),
2
);
assert_eq!(
get_num_hits(query_from_text("ip:[127.0.0.10 TO 127.0.0.20]")),
2
);
assert_eq!(
get_num_hits(query_from_text("ip:[127.0.0.11 TO 127.0.0.20]")),
1
);
assert_eq!(
get_num_hits(query_from_text("ip:[127.0.0.11 TO 127.0.0.19]")),
0
);
assert_eq!(get_num_hits(query_from_text("ip:[127.0.0.11 TO *]")), 1);
assert_eq!(get_num_hits(query_from_text("ip:[127.0.0.21 TO *]")), 0);
assert_eq!(get_num_hits(query_from_text("ip:[* TO 127.0.0.9]")), 0);
assert_eq!(get_num_hits(query_from_text("ip:[* TO 127.0.0.10]")), 1);
}
}

222
src/query/set_query.rs Normal file
View File

@@ -0,0 +1,222 @@
use std::collections::HashMap;
use tantivy_fst::raw::CompiledAddr;
use tantivy_fst::{Automaton, Map};
use crate::query::score_combiner::DoNothingCombiner;
use crate::query::{AutomatonWeight, BooleanWeight, Occur, Query, Weight};
use crate::schema::Field;
use crate::{Searcher, Term};
/// A Term Set Query matches all of the documents containing any of the Term provided
#[derive(Debug, Clone)]
pub struct TermSetQuery {
terms_map: HashMap<Field, Vec<Term>>,
}
impl TermSetQuery {
/// Create a Term Set Query
pub fn new<T: IntoIterator<Item = Term>>(terms: T) -> Self {
let mut terms_map: HashMap<_, Vec<_>> = HashMap::new();
for term in terms {
terms_map.entry(term.field()).or_default().push(term);
}
for terms in terms_map.values_mut() {
terms.sort_unstable();
terms.dedup();
}
TermSetQuery { terms_map }
}
fn specialized_weight(
&self,
searcher: &Searcher,
) -> crate::Result<BooleanWeight<DoNothingCombiner>> {
let mut sub_queries: Vec<(_, Box<dyn Weight>)> = Vec::with_capacity(self.terms_map.len());
for (&field, sorted_terms) in self.terms_map.iter() {
let field_entry = searcher.schema().get_field_entry(field);
let field_type = field_entry.field_type();
if !field_type.is_indexed() {
let error_msg = format!("Field {:?} is not indexed.", field_entry.name());
return Err(crate::TantivyError::SchemaError(error_msg));
}
// In practice this won't fail because:
// - we are writing to memory, so no IoError
// - Terms are ordered
let map = Map::from_iter(sorted_terms.iter().map(|key| (key.value_bytes(), 0)))
.map_err(|e| std::io::Error::new(std::io::ErrorKind::Other, e))?;
sub_queries.push((
Occur::Should,
Box::new(AutomatonWeight::new(field, SetDfaWrapper(map))),
));
}
Ok(BooleanWeight::new(
sub_queries,
false,
Box::new(|| DoNothingCombiner),
))
}
}
impl Query for TermSetQuery {
fn weight(
&self,
searcher: &Searcher,
_scoring_enabled: bool,
) -> crate::Result<Box<dyn Weight>> {
Ok(Box::new(self.specialized_weight(searcher)?))
}
}
struct SetDfaWrapper(Map<Vec<u8>>);
impl Automaton for SetDfaWrapper {
type State = Option<CompiledAddr>;
fn start(&self) -> Option<CompiledAddr> {
Some(self.0.as_ref().root().addr())
}
fn is_match(&self, state_opt: &Option<CompiledAddr>) -> bool {
if let Some(state) = state_opt {
self.0.as_ref().node(*state).is_final()
} else {
false
}
}
fn accept(&self, state_opt: &Option<CompiledAddr>, byte: u8) -> Option<CompiledAddr> {
let state = state_opt.as_ref()?;
let node = self.0.as_ref().node(*state);
let transition = node.find_input(byte)?;
Some(node.transition_addr(transition))
}
fn can_match(&self, state: &Self::State) -> bool {
state.is_some()
}
}
#[cfg(test)]
mod tests {
use crate::collector::TopDocs;
use crate::query::TermSetQuery;
use crate::schema::{Schema, TEXT};
use crate::{assert_nearly_equals, Index, Term};
#[test]
pub fn test_term_set_query() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let field1 = schema_builder.add_text_field("field1", TEXT);
let field2 = schema_builder.add_text_field("field1", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(
field1 => "doc1",
field2 => "val1",
))?;
index_writer.add_document(doc!(
field1 => "doc2",
field2 => "val2",
))?;
index_writer.add_document(doc!(
field1 => "doc3",
field2 => "val3",
))?;
index_writer.add_document(doc!(
field1 => "val3",
field2 => "doc3",
))?;
index_writer.commit()?;
}
let reader = index.reader()?;
let searcher = reader.searcher();
{
// single element
let terms = vec![Term::from_field_text(field1, "doc1")];
let term_set_query = TermSetQuery::new(terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(2))?;
assert_eq!(top_docs.len(), 1, "Expected 1 document");
let (score, _) = top_docs[0];
assert_nearly_equals!(1.0, score);
}
{
// single element, absent
let terms = vec![Term::from_field_text(field1, "doc4")];
let term_set_query = TermSetQuery::new(terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(1))?;
assert!(top_docs.is_empty(), "Expected 0 document");
}
{
// multiple elements
let terms = vec![
Term::from_field_text(field1, "doc1"),
Term::from_field_text(field1, "doc2"),
];
let term_set_query = TermSetQuery::new(terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(2))?;
assert_eq!(top_docs.len(), 2, "Expected 2 documents");
for (score, _) in top_docs {
assert_nearly_equals!(1.0, score);
}
}
{
// multiple elements, mixed fields
let terms = vec![
Term::from_field_text(field1, "doc1"),
Term::from_field_text(field1, "doc1"),
Term::from_field_text(field2, "val2"),
];
let term_set_query = TermSetQuery::new(terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(3))?;
assert_eq!(top_docs.len(), 2, "Expected 2 document");
for (score, _) in top_docs {
assert_nearly_equals!(1.0, score);
}
}
{
// no field crosstalk
let terms = vec![Term::from_field_text(field1, "doc3")];
let term_set_query = TermSetQuery::new(terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(3))?;
assert_eq!(top_docs.len(), 1, "Expected 1 document");
let terms = vec![Term::from_field_text(field2, "doc3")];
let term_set_query = TermSetQuery::new(terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(3))?;
assert_eq!(top_docs.len(), 1, "Expected 1 document");
let terms = vec![
Term::from_field_text(field1, "doc3"),
Term::from_field_text(field2, "doc3"),
];
let term_set_query = TermSetQuery::new(terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(3))?;
assert_eq!(top_docs.len(), 2, "Expected 2 document");
}
Ok(())
}
}

View File

@@ -124,3 +124,70 @@ impl Query for TermQuery {
visitor(&self.term, false);
}
}
#[cfg(test)]
mod tests {
use std::net::{IpAddr, Ipv6Addr};
use std::str::FromStr;
use fastfield_codecs::MonotonicallyMappableToU128;
use crate::collector::{Count, TopDocs};
use crate::query::{Query, QueryParser, TermQuery};
use crate::schema::{IndexRecordOption, IntoIpv6Addr, Schema, INDEXED, STORED};
use crate::{doc, Index, Term};
#[test]
fn search_ip_test() {
let mut schema_builder = Schema::builder();
let ip_field = schema_builder.add_ip_addr_field("ip", INDEXED | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let ip_addr_1 = IpAddr::from_str("127.0.0.1").unwrap().into_ipv6_addr();
let ip_addr_2 = Ipv6Addr::from_u128(10);
{
let mut index_writer = index.writer(3_000_000).unwrap();
index_writer
.add_document(doc!(
ip_field => ip_addr_1
))
.unwrap();
index_writer
.add_document(doc!(
ip_field => ip_addr_2
))
.unwrap();
index_writer.commit().unwrap();
}
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let assert_single_hit = |query| {
let (_top_docs, count) = searcher
.search(&query, &(TopDocs::with_limit(2), Count))
.unwrap();
assert_eq!(count, 1);
};
let query_from_text = |text: String| {
QueryParser::for_index(&index, vec![ip_field])
.parse_query(&text)
.unwrap()
};
let query_from_ip = |ip_addr| -> Box<dyn Query> {
Box::new(TermQuery::new(
Term::from_field_ip_addr(ip_field, ip_addr),
IndexRecordOption::Basic,
))
};
assert_single_hit(query_from_ip(ip_addr_1));
assert_single_hit(query_from_ip(ip_addr_2));
assert_single_hit(query_from_text("127.0.0.1".to_string()));
assert_single_hit(query_from_text("\"127.0.0.1\"".to_string()));
assert_single_hit(query_from_text(format!("\"{}\"", ip_addr_1)));
assert_single_hit(query_from_text(format!("\"{}\"", ip_addr_2)));
}
}

View File

@@ -263,8 +263,7 @@ impl InnerIndexReader {
/// It controls when a new version of the index should be loaded and lends
/// you instances of `Searcher` for the last loaded version.
///
/// `Clone` does not clone the different pool of searcher. `IndexReader`
/// just wraps an `Arc`.
/// `IndexReader` just wraps an `Arc`.
#[derive(Clone)]
pub struct IndexReader {
inner: Arc<InnerIndexReader>,
@@ -294,9 +293,6 @@ impl IndexReader {
///
/// This method should be called every single time a search
/// query is performed.
/// The searchers are taken from a pool of `num_searchers` searchers.
/// If no searcher is available
/// this may block.
///
/// The same searcher must be used for a given query, as it ensures
/// the use of a consistent segment set.

View File

@@ -3,7 +3,7 @@ use std::ops::BitOr;
use serde::{Deserialize, Serialize};
use super::flags::{FastFlag, IndexedFlag, SchemaFlagList, StoredFlag};
/// Define how an a bytes field should be handled by tantivy.
/// Define how a bytes field should be handled by tantivy.
#[derive(Clone, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
#[serde(from = "BytesOptionsDeser")]
pub struct BytesOptions {

View File

@@ -1,6 +1,7 @@
use std::collections::{HashMap, HashSet};
use std::io::{self, Read, Write};
use std::mem;
use std::net::Ipv6Addr;
use common::{BinarySerializable, VInt};
@@ -97,6 +98,11 @@ impl Document {
self.add_field_value(field, value);
}
/// Add a IP address field. Internally only Ipv6Addr is used.
pub fn add_ip_addr(&mut self, field: Field, value: Ipv6Addr) {
self.add_field_value(field, value);
}
/// Add a i64 field
pub fn add_i64(&mut self, field: Field, value: i64) {
self.add_field_value(field, value);
@@ -191,6 +197,34 @@ impl Document {
pub fn get_first(&self, field: Field) -> Option<&Value> {
self.get_all(field).next()
}
/// Serializes stored field values.
pub fn serialize_stored<W: Write>(&self, schema: &Schema, writer: &mut W) -> io::Result<()> {
let stored_field_values = || {
self.field_values()
.iter()
.filter(|field_value| schema.get_field_entry(field_value.field()).is_stored())
};
let num_field_values = stored_field_values().count();
VInt(num_field_values as u64).serialize(writer)?;
for field_value in stored_field_values() {
match field_value {
FieldValue {
field,
value: Value::PreTokStr(pre_tokenized_text),
} => {
let field_value = FieldValue {
field: *field,
value: Value::Str(pre_tokenized_text.text.to_string()),
};
field_value.serialize(writer)?;
}
field_value => field_value.serialize(writer)?,
};
}
Ok(())
}
}
impl BinarySerializable for Document {

View File

@@ -1,5 +1,6 @@
use serde::{Deserialize, Serialize};
use super::ip_options::IpAddrOptions;
use crate::schema::bytes_options::BytesOptions;
use crate::schema::{
is_valid_field_name, DateOptions, FacetOptions, FieldType, JsonObjectOptions, NumericOptions,
@@ -60,6 +61,11 @@ impl FieldEntry {
Self::new(field_name, FieldType::Date(date_options))
}
/// Creates a new ip address field entry.
pub fn new_ip_addr(field_name: String, ip_options: IpAddrOptions) -> FieldEntry {
Self::new(field_name, FieldType::IpAddr(ip_options))
}
/// Creates a field entry for a facet.
pub fn new_facet(field_name: String, facet_options: FacetOptions) -> FieldEntry {
Self::new(field_name, FieldType::Facet(facet_options))
@@ -114,6 +120,7 @@ impl FieldEntry {
FieldType::Facet(ref options) => options.is_stored(),
FieldType::Bytes(ref options) => options.is_stored(),
FieldType::JsonObject(ref options) => options.is_stored(),
FieldType::IpAddr(ref options) => options.is_stored(),
}
}
}

View File

@@ -1,7 +1,12 @@
use std::net::IpAddr;
use std::str::FromStr;
use serde::{Deserialize, Serialize};
use serde_json::Value as JsonValue;
use thiserror::Error;
use super::ip_options::IpAddrOptions;
use super::{Cardinality, IntoIpv6Addr};
use crate::schema::bytes_options::BytesOptions;
use crate::schema::facet_options::FacetOptions;
use crate::schema::{
@@ -61,9 +66,11 @@ pub enum Type {
Bytes = b'b',
/// Leaf in a Json object.
Json = b'j',
/// IpAddr
IpAddr = b'p',
}
const ALL_TYPES: [Type; 9] = [
const ALL_TYPES: [Type; 10] = [
Type::Str,
Type::U64,
Type::I64,
@@ -73,6 +80,7 @@ const ALL_TYPES: [Type; 9] = [
Type::Facet,
Type::Bytes,
Type::Json,
Type::IpAddr,
];
impl Type {
@@ -99,6 +107,7 @@ impl Type {
Type::Facet => "Facet",
Type::Bytes => "Bytes",
Type::Json => "Json",
Type::IpAddr => "IpAddr",
}
}
@@ -115,6 +124,7 @@ impl Type {
b'h' => Some(Type::Facet),
b'b' => Some(Type::Bytes),
b'j' => Some(Type::Json),
b'p' => Some(Type::IpAddr),
_ => None,
}
}
@@ -145,6 +155,8 @@ pub enum FieldType {
Bytes(BytesOptions),
/// Json object
JsonObject(JsonObjectOptions),
/// IpAddr field
IpAddr(IpAddrOptions),
}
impl FieldType {
@@ -160,6 +172,7 @@ impl FieldType {
FieldType::Facet(_) => Type::Facet,
FieldType::Bytes(_) => Type::Bytes,
FieldType::JsonObject(_) => Type::Json,
FieldType::IpAddr(_) => Type::IpAddr,
}
}
@@ -175,6 +188,7 @@ impl FieldType {
FieldType::Facet(ref _facet_options) => true,
FieldType::Bytes(ref bytes_options) => bytes_options.is_indexed(),
FieldType::JsonObject(ref json_object_options) => json_object_options.is_indexed(),
FieldType::IpAddr(ref ip_addr_options) => ip_addr_options.is_indexed(),
}
}
@@ -209,11 +223,32 @@ impl FieldType {
| FieldType::F64(ref int_options)
| FieldType::Bool(ref int_options) => int_options.is_fast(),
FieldType::Date(ref date_options) => date_options.is_fast(),
FieldType::IpAddr(ref ip_addr_options) => ip_addr_options.is_fast(),
FieldType::Facet(_) => true,
FieldType::JsonObject(_) => false,
}
}
/// returns true if the field is fast.
pub fn fastfield_cardinality(&self) -> Option<Cardinality> {
match *self {
FieldType::Bytes(ref bytes_options) if bytes_options.is_fast() => {
Some(Cardinality::SingleValue)
}
FieldType::Str(ref text_options) if text_options.is_fast() => {
Some(Cardinality::MultiValues)
}
FieldType::U64(ref int_options)
| FieldType::I64(ref int_options)
| FieldType::F64(ref int_options)
| FieldType::Bool(ref int_options) => int_options.get_fastfield_cardinality(),
FieldType::Date(ref date_options) => date_options.get_fastfield_cardinality(),
FieldType::Facet(_) => Some(Cardinality::MultiValues),
FieldType::JsonObject(_) => None,
_ => None,
}
}
/// returns true if the field is normed (see [fieldnorms](crate::fieldnorm)).
pub fn has_fieldnorms(&self) -> bool {
match *self {
@@ -229,6 +264,7 @@ impl FieldType {
FieldType::Facet(_) => false,
FieldType::Bytes(ref bytes_options) => bytes_options.fieldnorms(),
FieldType::JsonObject(ref _json_object_options) => false,
FieldType::IpAddr(ref ip_addr_options) => ip_addr_options.fieldnorms(),
}
}
@@ -273,6 +309,13 @@ impl FieldType {
FieldType::JsonObject(ref json_obj_options) => json_obj_options
.get_text_indexing_options()
.map(TextFieldIndexing::index_option),
FieldType::IpAddr(ref ip_addr_options) => {
if ip_addr_options.is_indexed() {
Some(IndexRecordOption::Basic)
} else {
None
}
}
}
}
@@ -312,6 +355,16 @@ impl FieldType {
expected: "a json object",
json: JsonValue::String(field_text),
}),
FieldType::IpAddr(_) => {
let ip_addr: IpAddr = IpAddr::from_str(&field_text).map_err(|err| {
ValueParsingError::ParseError {
error: err.to_string(),
json: JsonValue::String(field_text),
}
})?;
Ok(Value::IpAddr(ip_addr.into_ipv6_addr()))
}
}
}
JsonValue::Number(field_val_num) => match self {
@@ -359,6 +412,10 @@ impl FieldType {
expected: "a json object",
json: JsonValue::Number(field_val_num),
}),
FieldType::IpAddr(_) => Err(ValueParsingError::TypeError {
expected: "a string with an ip addr",
json: JsonValue::Number(field_val_num),
}),
},
JsonValue::Object(json_map) => match self {
FieldType::Str(_) => {

View File

@@ -37,6 +37,8 @@ pub struct FastFlag;
///
/// Fast fields can be random-accessed rapidly. Fields useful for scoring, filtering
/// or collection should be mark as fast fields.
///
/// See [fast fields](`crate::fastfield`).
pub const FAST: SchemaFlagList<FastFlag, ()> = SchemaFlagList {
head: FastFlag,
tail: (),

View File

@@ -1,3 +1,4 @@
use std::net::{IpAddr, Ipv6Addr};
use std::ops::BitOr;
use serde::{Deserialize, Serialize};
@@ -5,25 +6,52 @@ use serde::{Deserialize, Serialize};
use super::flags::{FastFlag, IndexedFlag, SchemaFlagList, StoredFlag};
use super::Cardinality;
/// Trait to convert into an Ipv6Addr.
pub trait IntoIpv6Addr {
/// Consumes the object and returns an Ipv6Addr.
fn into_ipv6_addr(self) -> Ipv6Addr;
}
impl IntoIpv6Addr for IpAddr {
fn into_ipv6_addr(self) -> Ipv6Addr {
match self {
IpAddr::V4(addr) => addr.to_ipv6_mapped(),
IpAddr::V6(addr) => addr,
}
}
}
/// Define how an ip field should be handled by tantivy.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, Default)]
pub struct IpOptions {
pub struct IpAddrOptions {
#[serde(skip_serializing_if = "Option::is_none")]
fast: Option<Cardinality>,
stored: bool,
indexed: bool,
fieldnorms: bool,
}
impl IpOptions {
impl IpAddrOptions {
/// Returns true iff the value is a fast field.
pub fn is_fast(&self) -> bool {
self.fast.is_some()
}
/// Returns `true` if the json object should be stored.
/// Returns `true` if the ip address should be stored in the doc store.
pub fn is_stored(&self) -> bool {
self.stored
}
/// Returns true iff the value is indexed and therefore searchable.
pub fn is_indexed(&self) -> bool {
self.indexed
}
/// Returns true if and only if the value is normed.
pub fn fieldnorms(&self) -> bool {
self.fieldnorms
}
/// Returns the cardinality of the fastfield.
///
/// If the field has not been declared as a fastfield, then
@@ -32,6 +60,16 @@ impl IpOptions {
self.fast
}
/// Set the field as normed.
///
/// Setting an integer as normed will generate
/// the fieldnorm data for it.
#[must_use]
pub fn set_fieldnorms(mut self) -> Self {
self.fieldnorms = true;
self
}
/// Sets the field as stored
#[must_use]
pub fn set_stored(mut self) -> Self {
@@ -39,6 +77,19 @@ impl IpOptions {
self
}
/// Set the field as indexed.
///
/// Setting an ip address as indexed will generate
/// a posting list for each value taken by the ip address.
/// Ips are normalized to IpV6.
///
/// This is required for the field to be searchable.
#[must_use]
pub fn set_indexed(mut self) -> Self {
self.indexed = true;
self
}
/// Set the field as a fast field.
///
/// Fast fields are designed for random access.
@@ -52,52 +103,60 @@ impl IpOptions {
}
}
impl From<()> for IpOptions {
fn from(_: ()) -> IpOptions {
IpOptions::default()
impl From<()> for IpAddrOptions {
fn from(_: ()) -> IpAddrOptions {
IpAddrOptions::default()
}
}
impl From<FastFlag> for IpOptions {
impl From<FastFlag> for IpAddrOptions {
fn from(_: FastFlag) -> Self {
IpOptions {
IpAddrOptions {
fieldnorms: false,
indexed: false,
stored: false,
fast: Some(Cardinality::SingleValue),
}
}
}
impl From<StoredFlag> for IpOptions {
impl From<StoredFlag> for IpAddrOptions {
fn from(_: StoredFlag) -> Self {
IpOptions {
IpAddrOptions {
fieldnorms: false,
indexed: false,
stored: true,
fast: None,
}
}
}
impl From<IndexedFlag> for IpOptions {
impl From<IndexedFlag> for IpAddrOptions {
fn from(_: IndexedFlag) -> Self {
IpOptions {
IpAddrOptions {
fieldnorms: true,
indexed: true,
stored: false,
fast: None,
}
}
}
impl<T: Into<IpOptions>> BitOr<T> for IpOptions {
type Output = IpOptions;
impl<T: Into<IpAddrOptions>> BitOr<T> for IpAddrOptions {
type Output = IpAddrOptions;
fn bitor(self, other: T) -> IpOptions {
fn bitor(self, other: T) -> IpAddrOptions {
let other = other.into();
IpOptions {
IpAddrOptions {
fieldnorms: self.fieldnorms | other.fieldnorms,
indexed: self.indexed | other.indexed,
stored: self.stored | other.stored,
fast: self.fast.or(other.fast),
}
}
}
impl<Head, Tail> From<SchemaFlagList<Head, Tail>> for IpOptions
impl<Head, Tail> From<SchemaFlagList<Head, Tail>> for IpAddrOptions
where
Head: Clone,
Tail: Clone,

View File

@@ -28,10 +28,10 @@
//! use tantivy::schema::*;
//! let mut schema_builder = Schema::builder();
//! let title_options = TextOptions::default()
//! .set_stored()
//! .set_indexing_options(TextFieldIndexing::default()
//! .set_tokenizer("default")
//! .set_index_option(IndexRecordOption::WithFreqsAndPositions));
//! .set_stored()
//! .set_indexing_options(TextFieldIndexing::default()
//! .set_tokenizer("default")
//! .set_index_option(IndexRecordOption::WithFreqsAndPositions));
//! schema_builder.add_text_field("title", title_options);
//! let schema = schema_builder.build();
//! ```
@@ -45,8 +45,7 @@
//! In the first phase, the ability to search for documents by the given field is determined by the
//! [`IndexRecordOption`] of our [`TextOptions`].
//!
//! The effect of each possible setting is described more in detail
//! [`TextIndexingOptions`](enum.TextIndexingOptions.html).
//! The effect of each possible setting is described more in detail in [`TextOptions`].
//!
//! On the other hand setting the field as stored or not determines whether the field should be
//! returned when [`Searcher::doc()`](crate::Searcher::doc) is called.
@@ -60,8 +59,8 @@
//! use tantivy::schema::*;
//! let mut schema_builder = Schema::builder();
//! let num_stars_options = NumericOptions::default()
//! .set_stored()
//! .set_indexed();
//! .set_stored()
//! .set_indexed();
//! schema_builder.add_u64_field("num_stars", num_stars_options);
//! let schema = schema_builder.build();
//! ```
@@ -79,8 +78,8 @@
//! For convenience, it is possible to define your field indexing options by combining different
//! flags using the `|` operator.
//!
//! For instance, a schema containing the two fields defined in the example above could be rewritten
//! :
//! For instance, a schema containing the two fields defined in the example above could be
//! rewritten:
//!
//! ```
//! use tantivy::schema::*;
@@ -139,7 +138,7 @@ pub use self::field_type::{FieldType, Type};
pub use self::field_value::FieldValue;
pub use self::flags::{FAST, INDEXED, STORED};
pub use self::index_record_option::IndexRecordOption;
pub use self::ip_options::IpOptions;
pub use self::ip_options::{IntoIpv6Addr, IpAddrOptions};
pub use self::json_object_options::JsonObjectOptions;
pub use self::named_field_document::NamedFieldDocument;
pub use self::numeric_options::NumericOptions;

View File

@@ -59,7 +59,7 @@ impl From<NumericOptionsDeser> for NumericOptions {
}
impl NumericOptions {
/// Returns true iff the value is stored.
/// Returns true iff the value is stored in the doc store.
pub fn is_stored(&self) -> bool {
self.stored
}

View File

@@ -7,6 +7,7 @@ use serde::ser::SerializeSeq;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use serde_json::{self, Value as JsonValue};
use super::ip_options::IpAddrOptions;
use super::*;
use crate::schema::bytes_options::BytesOptions;
use crate::schema::field_type::ValueParsingError;
@@ -144,6 +145,26 @@ impl SchemaBuilder {
self.add_field(field_entry)
}
/// Adds a ip field.
/// Returns the associated field handle.
///
/// # Caution
///
/// Appending two fields with the same name
/// will result in the shadowing of the first
/// by the second one.
/// The first field will get a field id
/// but only the second one will be indexed
pub fn add_ip_addr_field<T: Into<IpAddrOptions>>(
&mut self,
field_name_str: &str,
field_options: T,
) -> Field {
let field_name = String::from(field_name_str);
let field_entry = FieldEntry::new_ip_addr(field_name, field_options.into());
self.add_field(field_entry)
}
/// Adds a new text field.
/// Returns the associated field handle
///
@@ -598,12 +619,14 @@ mod tests {
schema_builder.add_text_field("title", TEXT);
schema_builder.add_text_field("author", STRING);
schema_builder.add_u64_field("count", count_options);
schema_builder.add_ip_addr_field("ip", FAST | STORED);
schema_builder.add_bool_field("is_read", is_read_options);
let schema = schema_builder.build();
let doc_json = r#"{
"title": "my title",
"author": "fulmicoton",
"count": 4,
"ip": "127.0.0.1",
"is_read": true
}"#;
let doc = schema.parse_document(doc_json).unwrap();
@@ -612,6 +635,39 @@ mod tests {
assert_eq!(doc, doc_serdeser);
}
#[test]
pub fn test_document_to_ipv4_json() {
let mut schema_builder = Schema::builder();
schema_builder.add_ip_addr_field("ip", FAST | STORED);
let schema = schema_builder.build();
// IpV4 loopback
let doc_json = r#"{
"ip": "127.0.0.1"
}"#;
let doc = schema.parse_document(doc_json).unwrap();
let value: serde_json::Value = serde_json::from_str(&schema.to_json(&doc)).unwrap();
assert_eq!(value["ip"][0], "127.0.0.1");
// Special case IpV6 loopback. We don't want to map that to IPv4
let doc_json = r#"{
"ip": "::1"
}"#;
let doc = schema.parse_document(doc_json).unwrap();
let value: serde_json::Value = serde_json::from_str(&schema.to_json(&doc)).unwrap();
assert_eq!(value["ip"][0], "::1");
// testing ip address of every router in the world
let doc_json = r#"{
"ip": "192.168.0.1"
}"#;
let doc = schema.parse_document(doc_json).unwrap();
let value: serde_json::Value = serde_json::from_str(&schema.to_json(&doc)).unwrap();
assert_eq!(value["ip"][0], "192.168.0.1");
}
#[test]
pub fn test_document_from_nameddoc() {
let mut schema_builder = Schema::builder();

View File

@@ -1,24 +1,15 @@
use std::convert::TryInto;
use std::hash::{Hash, Hasher};
use std::net::Ipv6Addr;
use std::{fmt, str};
use fastfield_codecs::MonotonicallyMappableToU128;
use super::Field;
use crate::fastfield::FastValue;
use crate::schema::{Facet, Type};
use crate::{DatePrecision, DateTime};
/// Size (in bytes) of the buffer of a fast value (u64, i64, f64, or date) term.
/// <field> + <type byte> + <value len>
///
/// - <field> is a big endian encoded u32 field id
/// - <type_byte>'s most significant bit expresses whether the term is a json term or not
/// The remaining 7 bits are used to encode the type of the value.
/// If this is a JSON term, the type is the type of the leaf of the json.
///
/// - <value> is, if this is not the json term, a binary representation specific to the type.
/// If it is a JSON Term, then it is prepended with the path that leads to this leaf value.
const FAST_VALUE_TERM_LEN: usize = 4 + 1 + 8;
/// Separates the different segments of
/// the json path.
pub const JSON_PATH_SEGMENT_SEP: u8 = 1u8;
@@ -36,45 +27,78 @@ pub const JSON_END_OF_PATH: u8 = 0u8;
pub struct Term<B = Vec<u8>>(B)
where B: AsRef<[u8]>;
impl AsMut<Vec<u8>> for Term {
fn as_mut(&mut self) -> &mut Vec<u8> {
&mut self.0
}
}
/// The number of bytes used as metadata by `Term`.
const TERM_METADATA_LENGTH: usize = 5;
impl Term {
pub(crate) fn new() -> Term {
Term(Vec::with_capacity(100))
pub(crate) fn with_capacity(capacity: usize) -> Term {
let mut data = Vec::with_capacity(TERM_METADATA_LENGTH + capacity);
data.resize(TERM_METADATA_LENGTH, 0u8);
Term(data)
}
pub(crate) fn with_type_and_field(typ: Type, field: Field) -> Term {
let mut term = Self::with_capacity(8);
term.set_field_and_type(field, typ);
term
}
fn with_bytes_and_field_and_payload(typ: Type, field: Field, bytes: &[u8]) -> Term {
let mut term = Self::with_capacity(bytes.len());
term.set_field_and_type(field, typ);
term.0.extend_from_slice(bytes);
term
}
fn from_fast_value<T: FastValue>(field: Field, val: &T) -> Term {
let mut term = Term(vec![0u8; FAST_VALUE_TERM_LEN]);
term.set_field(T::to_type(), field);
let mut term = Self::with_type_and_field(T::to_type(), field);
term.set_u64(val.to_u64());
term
}
/// Builds a term given a field, and a u64-value
/// Panics when the term is not empty... ie: some value is set.
/// Use `clear_with_field_and_type` in that case.
///
/// Sets field and the type.
pub(crate) fn set_field_and_type(&mut self, field: Field, typ: Type) {
assert!(self.is_empty());
self.0[0..4].clone_from_slice(field.field_id().to_be_bytes().as_ref());
self.0[4] = typ.to_code();
}
/// Is empty if there are no value bytes.
pub fn is_empty(&self) -> bool {
self.0.len() == TERM_METADATA_LENGTH
}
/// Builds a term given a field, and a `Ipv6Addr`-value
pub fn from_field_ip_addr(field: Field, ip_addr: Ipv6Addr) -> Term {
let mut term = Self::with_type_and_field(Type::IpAddr, field);
term.set_ip_addr(ip_addr);
term
}
/// Builds a term given a field, and a `u64`-value
pub fn from_field_u64(field: Field, val: u64) -> Term {
Term::from_fast_value(field, &val)
}
/// Builds a term given a field, and a i64-value
/// Builds a term given a field, and a `i64`-value
pub fn from_field_i64(field: Field, val: i64) -> Term {
Term::from_fast_value(field, &val)
}
/// Builds a term given a field, and a f64-value
/// Builds a term given a field, and a `f64`-value
pub fn from_field_f64(field: Field, val: f64) -> Term {
Term::from_fast_value(field, &val)
}
/// Builds a term given a field, and a f64-value
/// Builds a term given a field, and a `bool`-value
pub fn from_field_bool(field: Field, val: bool) -> Term {
Term::from_fast_value(field, &val)
}
/// Builds a term given a field, and a DateTime value
/// Builds a term given a field, and a `DateTime` value
pub fn from_field_date(field: Field, val: DateTime) -> Term {
Term::from_fast_value(field, &val.truncate(DatePrecision::Seconds))
}
@@ -82,31 +106,29 @@ impl Term {
/// Creates a `Term` given a facet.
pub fn from_facet(field: Field, facet: &Facet) -> Term {
let facet_encoded_str = facet.encoded_str();
Term::create_bytes_term(Type::Facet, field, facet_encoded_str.as_bytes())
Term::with_bytes_and_field_and_payload(Type::Facet, field, facet_encoded_str.as_bytes())
}
/// Builds a term given a field, and a string value
pub fn from_field_text(field: Field, text: &str) -> Term {
Term::create_bytes_term(Type::Str, field, text.as_bytes())
}
fn create_bytes_term(typ: Type, field: Field, bytes: &[u8]) -> Term {
let mut term = Term(vec![0u8; 5 + bytes.len()]);
term.set_field(typ, field);
term.0.extend_from_slice(bytes);
term
Term::with_bytes_and_field_and_payload(Type::Str, field, text.as_bytes())
}
/// Builds a term bytes.
pub fn from_field_bytes(field: Field, bytes: &[u8]) -> Term {
Term::create_bytes_term(Type::Bytes, field, bytes)
Term::with_bytes_and_field_and_payload(Type::Bytes, field, bytes)
}
pub(crate) fn set_field(&mut self, typ: Type, field: Field) {
self.0.clear();
self.0
.extend_from_slice(field.field_id().to_be_bytes().as_ref());
self.0.push(typ.to_code());
/// Removes the value_bytes and set the field and type code.
pub(crate) fn clear_with_field_and_type(&mut self, typ: Type, field: Field) {
self.truncate_value_bytes(0);
self.set_field_and_type(field, typ);
}
/// Removes the value_bytes and set the type code.
pub fn clear_with_type(&mut self, typ: Type) {
self.truncate_value_bytes(0);
self.0[4] = typ.to_code();
}
/// Sets a u64 value in the term.
@@ -117,12 +139,6 @@ impl Term {
/// the natural order of the values.
pub fn set_u64(&mut self, val: u64) {
self.set_fast_value(val);
self.set_bytes(val.to_be_bytes().as_ref());
}
fn set_fast_value<T: FastValue>(&mut self, val: T) {
self.0.resize(FAST_VALUE_TERM_LEN, 0u8);
self.set_bytes(val.to_u64().to_be_bytes().as_ref());
}
/// Sets a `i64` value in the term.
@@ -130,7 +146,7 @@ impl Term {
self.set_fast_value(val);
}
/// Sets a `i64` value in the term.
/// Sets a `DateTime` value in the term.
pub fn set_date(&mut self, date: DateTime) {
self.set_fast_value(date);
}
@@ -145,9 +161,18 @@ impl Term {
self.set_fast_value(val);
}
fn set_fast_value<T: FastValue>(&mut self, val: T) {
self.set_bytes(val.to_u64().to_be_bytes().as_ref());
}
/// Sets a `Ipv6Addr` value in the term.
pub fn set_ip_addr(&mut self, val: Ipv6Addr) {
self.set_bytes(val.to_u128().to_be_bytes().as_ref());
}
/// Sets the value of a `Bytes` field.
pub fn set_bytes(&mut self, bytes: &[u8]) {
self.0.resize(5, 0u8);
self.truncate_value_bytes(0);
self.0.extend(bytes);
}
@@ -156,18 +181,22 @@ impl Term {
self.set_bytes(text.as_bytes());
}
/// Removes the value_bytes and set the type code.
pub fn clear_with_type(&mut self, typ: Type) {
self.truncate(5);
self.0[4] = typ.to_code();
/// Truncates the value bytes of the term. Value and field type stays the same.
pub fn truncate_value_bytes(&mut self, len: usize) {
self.0.truncate(len + TERM_METADATA_LENGTH);
}
/// Truncate the term right after the field and the type code.
pub fn truncate(&mut self, len: usize) {
self.0.truncate(len);
/// Returns the value bytes as mutable slice
pub fn value_bytes_mut(&mut self) -> &mut [u8] {
&mut self.0[TERM_METADATA_LENGTH..]
}
/// Truncate the term right after the field and the type code.
/// The length of the bytes.
pub fn len_bytes(&self) -> usize {
self.0.len() - TERM_METADATA_LENGTH
}
/// Appends value bytes to the Term.
pub fn append_bytes(&mut self, bytes: &[u8]) {
self.0.extend_from_slice(bytes);
}
@@ -293,9 +322,6 @@ where B: AsRef<[u8]>
/// Returns `None` if the field is not of string type
/// or if the bytes are not valid utf-8.
pub fn as_str(&self) -> Option<&str> {
if self.as_slice().len() < 5 {
return None;
}
if self.typ() != Type::Str {
return None;
}
@@ -307,9 +333,6 @@ where B: AsRef<[u8]>
/// Returns `None` if the field is not of facet type
/// or if the bytes are not valid utf-8.
pub fn as_facet(&self) -> Option<Facet> {
if self.as_slice().len() < 5 {
return None;
}
if self.typ() != Type::Facet {
return None;
}
@@ -321,9 +344,6 @@ where B: AsRef<[u8]>
///
/// Returns `None` if the field is not of bytes type.
pub fn as_bytes(&self) -> Option<&[u8]> {
if self.as_slice().len() < 5 {
return None;
}
if self.typ() != Type::Bytes {
return None;
}
@@ -337,7 +357,7 @@ where B: AsRef<[u8]>
/// If the term is a u64, its value is encoded according
/// to `byteorder::LittleEndian`.
pub fn value_bytes(&self) -> &[u8] {
&self.0.as_ref()[5..]
&self.0.as_ref()[TERM_METADATA_LENGTH..]
}
/// Returns the underlying `&[u8]`.
@@ -415,6 +435,9 @@ fn debug_value_bytes(typ: Type, bytes: &[u8], f: &mut fmt::Formatter) -> fmt::Re
debug_value_bytes(typ, bytes, f)?;
}
}
Type::IpAddr => {
write!(f, "")?; // TODO change once we actually have IP address terms.
}
}
Ok(())
}
@@ -448,6 +471,18 @@ mod tests {
assert_eq!(term.as_str(), Some("test"))
}
/// Size (in bytes) of the buffer of a fast value (u64, i64, f64, or date) term.
/// <field> + <type byte> + <value len>
///
/// - <field> is a big endian encoded u32 field id
/// - <type_byte>'s most significant bit expresses whether the term is a json term or not
/// The remaining 7 bits are used to encode the type of the value.
/// If this is a JSON term, the type is the type of the leaf of the json.
///
/// - <value> is, if this is not the json term, a binary representation specific to the type.
/// If it is a JSON Term, then it is prepended with the path that leads to this leaf value.
const FAST_VALUE_TERM_LEN: usize = 4 + 1 + 8;
#[test]
pub fn test_term_u64() {
let mut schema_builder = Schema::builder();
@@ -455,7 +490,7 @@ mod tests {
let term = Term::from_field_u64(count_field, 983u64);
assert_eq!(term.field(), count_field);
assert_eq!(term.typ(), Type::U64);
assert_eq!(term.as_slice().len(), super::FAST_VALUE_TERM_LEN);
assert_eq!(term.as_slice().len(), FAST_VALUE_TERM_LEN);
assert_eq!(term.as_u64(), Some(983u64))
}
@@ -466,7 +501,7 @@ mod tests {
let term = Term::from_field_bool(bool_field, true);
assert_eq!(term.field(), bool_field);
assert_eq!(term.typ(), Type::Bool);
assert_eq!(term.as_slice().len(), super::FAST_VALUE_TERM_LEN);
assert_eq!(term.as_slice().len(), FAST_VALUE_TERM_LEN);
assert_eq!(term.as_bool(), Some(true))
}
}

View File

@@ -47,7 +47,9 @@ impl TextOptions {
/// unchanged. The "default" tokenizer will store the terms as lower case and this will be
/// reflected in the dictionary.
///
/// The original text can be retrieved via `ord_to_term` from the dictionary.
/// The original text can be retrieved via
/// [`TermDictionary::ord_to_term()`](crate::termdict::TermDictionary::ord_to_term)
/// from the dictionary.
#[must_use]
pub fn set_fast(mut self) -> TextOptions {
self.fast = true;

View File

@@ -1,4 +1,5 @@
use std::fmt;
use std::net::Ipv6Addr;
use serde::de::Visitor;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
@@ -32,6 +33,8 @@ pub enum Value {
Bytes(Vec<u8>),
/// Json object value.
JsonObject(serde_json::Map<String, serde_json::Value>),
/// IpV6 Address. Internally there is no IpV4, it needs to be converted to `Ipv6Addr`.
IpAddr(Ipv6Addr),
}
impl Eq for Value {}
@@ -48,8 +51,16 @@ impl Serialize for Value {
Value::Bool(b) => serializer.serialize_bool(b),
Value::Date(ref date) => time::serde::rfc3339::serialize(&date.into_utc(), serializer),
Value::Facet(ref facet) => facet.serialize(serializer),
Value::Bytes(ref bytes) => serializer.serialize_bytes(bytes),
Value::Bytes(ref bytes) => serializer.serialize_str(&base64::encode(bytes)),
Value::JsonObject(ref obj) => obj.serialize(serializer),
Value::IpAddr(ref obj) => {
// Ensure IpV4 addresses get serialized as IpV4, but excluding IpV6 loopback.
if let Some(ip_v4) = obj.to_ipv4_mapped() {
ip_v4.serialize(serializer)
} else {
obj.serialize(serializer)
}
}
}
}
}
@@ -201,6 +212,16 @@ impl Value {
None
}
}
/// Returns the ip addr, provided the value is of the `Ip` type.
/// (Returns None if the value is not of the `Ip` type)
pub fn as_ip_addr(&self) -> Option<Ipv6Addr> {
if let Value::IpAddr(val) = self {
Some(*val)
} else {
None
}
}
}
impl From<String> for Value {
@@ -209,6 +230,12 @@ impl From<String> for Value {
}
}
impl From<Ipv6Addr> for Value {
fn from(v: Ipv6Addr) -> Value {
Value::IpAddr(v)
}
}
impl From<u64> for Value {
fn from(v: u64) -> Value {
Value::U64(v)
@@ -288,8 +315,10 @@ impl From<serde_json::Value> for Value {
mod binary_serialize {
use std::io::{self, Read, Write};
use std::net::Ipv6Addr;
use common::{f64_to_u64, u64_to_f64, BinarySerializable};
use fastfield_codecs::MonotonicallyMappableToU128;
use super::Value;
use crate::schema::Facet;
@@ -306,6 +335,7 @@ mod binary_serialize {
const EXT_CODE: u8 = 7;
const JSON_OBJ_CODE: u8 = 8;
const BOOL_CODE: u8 = 9;
const IP_CODE: u8 = 10;
// extended types
@@ -366,6 +396,10 @@ mod binary_serialize {
serde_json::to_writer(writer, &map)?;
Ok(())
}
Value::IpAddr(ref ip) => {
IP_CODE.serialize(writer)?;
ip.to_u128().serialize(writer)
}
}
}
@@ -436,6 +470,11 @@ mod binary_serialize {
let json_map = <serde_json::Map::<String, serde_json::Value> as serde::Deserialize>::deserialize(&mut de)?;
Ok(Value::JsonObject(json_map))
}
IP_CODE => {
let value = u128::deserialize(reader)?;
Ok(Value::IpAddr(Ipv6Addr::from_u128(value)))
}
_ => Err(io::Error::new(
io::ErrorKind::InvalidData,
format!("No field type is associated with code {:?}", type_code),
@@ -448,9 +487,52 @@ mod binary_serialize {
#[cfg(test)]
mod tests {
use super::Value;
use crate::schema::{BytesOptions, Schema};
use crate::time::format_description::well_known::Rfc3339;
use crate::time::OffsetDateTime;
use crate::DateTime;
use crate::{DateTime, Document};
#[test]
fn test_parse_bytes_doc() {
let mut schema_builder = Schema::builder();
let bytes_options = BytesOptions::default();
let bytes_field = schema_builder.add_bytes_field("my_bytes", bytes_options);
let schema = schema_builder.build();
let mut doc = Document::default();
doc.add_bytes(bytes_field, "this is a test".as_bytes());
let json_string = schema.to_json(&doc);
assert_eq!(json_string, r#"{"my_bytes":["dGhpcyBpcyBhIHRlc3Q="]}"#);
}
#[test]
fn test_parse_empty_bytes_doc() {
let mut schema_builder = Schema::builder();
let bytes_options = BytesOptions::default();
let bytes_field = schema_builder.add_bytes_field("my_bytes", bytes_options);
let schema = schema_builder.build();
let mut doc = Document::default();
doc.add_bytes(bytes_field, "".as_bytes());
let json_string = schema.to_json(&doc);
assert_eq!(json_string, r#"{"my_bytes":[""]}"#);
}
#[test]
fn test_parse_many_bytes_doc() {
let mut schema_builder = Schema::builder();
let bytes_options = BytesOptions::default();
let bytes_field = schema_builder.add_bytes_field("my_bytes", bytes_options);
let schema = schema_builder.build();
let mut doc = Document::default();
doc.add_bytes(
bytes_field,
"A bigger test I guess\nspanning on multiple lines\nhoping this will work".as_bytes(),
);
let json_string = schema.to_json(&doc);
assert_eq!(
json_string,
r#"{"my_bytes":["QSBiaWdnZXIgdGVzdCBJIGd1ZXNzCnNwYW5uaW5nIG9uIG11bHRpcGxlIGxpbmVzCmhvcGluZyB0aGlzIHdpbGwgd29yaw=="]}"#
);
}
#[test]
fn test_serialize_date() {

View File

@@ -104,6 +104,13 @@ impl ZstdCompressor {
value, opt_name, err
)
})?;
if value >= 15 {
warn!(
"High zstd compression level detected: {:?}. High compression levels \
(>=15) are slow and will limit indexing speed.",
value
)
}
compressor.compression_level = Some(value);
}
_ => {

View File

@@ -1,7 +1,7 @@
use std::io;
use std::ops::Range;
use common::VInt;
use common::{read_u32_vint, VInt};
use crate::store::index::{Checkpoint, CHECKPOINT_PERIOD};
use crate::DocId;
@@ -85,15 +85,15 @@ impl CheckpointBlock {
return Err(io::Error::new(io::ErrorKind::UnexpectedEof, ""));
}
self.checkpoints.clear();
let len = VInt::deserialize_u64(data)? as usize;
let len = read_u32_vint(data);
if len == 0 {
return Ok(());
}
let mut doc = VInt::deserialize_u64(data)? as DocId;
let mut start_offset = VInt::deserialize_u64(data)? as usize;
let mut doc = read_u32_vint(data);
let mut start_offset = read_u32_vint(data) as usize;
for _ in 0..len {
let num_docs = VInt::deserialize_u64(data)? as DocId;
let block_num_bytes = VInt::deserialize_u64(data)? as usize;
let num_docs = read_u32_vint(data);
let block_num_bytes = read_u32_vint(data) as usize;
self.checkpoints.push(Checkpoint {
doc_range: doc..doc + num_docs,
byte_range: start_offset..start_offset + block_num_bytes,

View File

@@ -96,7 +96,7 @@ pub mod tests {
let mut doc = Document::default();
doc.add_field_value(field_body, LOREM.to_string());
doc.add_field_value(field_title, format!("Doc {i}"));
store_writer.store(&doc).unwrap();
store_writer.store(&doc, &schema).unwrap();
}
store_writer.close().unwrap();
}

View File

@@ -1,10 +1,10 @@
use std::io;
use std::iter::Sum;
use std::ops::AddAssign;
use std::ops::{AddAssign, Range};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::{Arc, Mutex};
use common::{BinarySerializable, HasLen, VInt};
use common::{BinarySerializable, HasLen};
use lru::LruCache;
use ownedbytes::OwnedBytes;
@@ -140,10 +140,10 @@ impl StoreReader {
self.cache.stats()
}
/// Get checkpoint for DocId. The checkpoint can be used to load a block containing the
/// Get checkpoint for `DocId`. The checkpoint can be used to load a block containing the
/// document.
///
/// Advanced API. In most cases use [get](Self::get).
/// Advanced API. In most cases use [`get`](Self::get).
fn block_checkpoint(&self, doc_id: DocId) -> crate::Result<Checkpoint> {
self.skip_index.seek(doc_id).ok_or_else(|| {
crate::TantivyError::InvalidArgument(format!("Failed to lookup Doc #{}.", doc_id))
@@ -160,7 +160,7 @@ impl StoreReader {
/// Loads and decompresses a block.
///
/// Advanced API. In most cases use [get](Self::get).
/// Advanced API. In most cases use [`get`](Self::get).
fn read_block(&self, checkpoint: &Checkpoint) -> io::Result<Block> {
let cache_key = checkpoint.byte_range.start;
if let Some(block) = self.cache.get_from_cache(cache_key) {
@@ -205,28 +205,21 @@ impl StoreReader {
/// Advanced API.
///
/// In most cases use [get_document_bytes](Self::get_document_bytes).
/// In most cases use [`get_document_bytes`](Self::get_document_bytes).
fn get_document_bytes_from_block(
block: OwnedBytes,
doc_id: DocId,
checkpoint: &Checkpoint,
) -> crate::Result<OwnedBytes> {
let mut cursor = &block[..];
let cursor_len_before = cursor.len();
for _ in checkpoint.doc_range.start..doc_id {
let doc_length = VInt::deserialize(&mut cursor)?.val() as usize;
cursor = &cursor[doc_length..];
}
let doc_pos = doc_id - checkpoint.doc_range.start;
let doc_length = VInt::deserialize(&mut cursor)?.val() as usize;
let start_pos = cursor_len_before - cursor.len();
let end_pos = cursor_len_before - cursor.len() + doc_length;
Ok(block.slice(start_pos..end_pos))
let range = block_read_index(&block, doc_pos)?;
Ok(block.slice(range))
}
/// Iterator over all Documents in their order as they are stored in the doc store.
/// Use this, if you want to extract all Documents from the doc store.
/// The alive_bitset has to be forwarded from the `SegmentReader` or the results maybe wrong.
/// The `alive_bitset` has to be forwarded from the `SegmentReader` or the results may be wrong.
pub fn iter<'a: 'b, 'b>(
&'b self,
alive_bitset: Option<&'a AliveBitSet>,
@@ -237,9 +230,9 @@ impl StoreReader {
})
}
/// Iterator over all RawDocuments in their order as they are stored in the doc store.
/// Iterator over all raw Documents in their order as they are stored in the doc store.
/// Use this, if you want to extract all Documents from the doc store.
/// The alive_bitset has to be forwarded from the `SegmentReader` or the results maybe wrong.
/// The `alive_bitset` has to be forwarded from the `SegmentReader` or the results may be wrong.
pub(crate) fn iter_raw<'a: 'b, 'b>(
&'b self,
alive_bitset: Option<&'a AliveBitSet>,
@@ -254,9 +247,7 @@ impl StoreReader {
let mut curr_block = curr_checkpoint
.as_ref()
.map(|checkpoint| self.read_block(checkpoint).map_err(|e| e.kind())); // map error in order to enable cloning
let mut block_start_pos = 0;
let mut num_skipped = 0;
let mut reset_block_pos = false;
let mut doc_pos = 0;
(0..last_doc_id)
.filter_map(move |doc_id| {
// filter_map is only used to resolve lifetime issues between the two closures on
@@ -268,24 +259,19 @@ impl StoreReader {
curr_block = curr_checkpoint
.as_ref()
.map(|checkpoint| self.read_block(checkpoint).map_err(|e| e.kind()));
reset_block_pos = true;
num_skipped = 0;
doc_pos = 0;
}
let alive = alive_bitset.map_or(true, |bitset| bitset.is_alive(doc_id));
if alive {
let ret = Some((curr_block.clone(), num_skipped, reset_block_pos));
// the map block will move over the num_skipped, so we reset to 0
num_skipped = 0;
reset_block_pos = false;
ret
let res = if alive {
Some((curr_block.clone(), doc_pos))
} else {
// we keep the number of skipped documents to move forward in the map block
num_skipped += 1;
None
}
};
doc_pos += 1;
res
})
.map(move |(block, num_skipped, reset_block_pos)| {
.map(move |(block, doc_pos)| {
let block = block
.ok_or_else(|| {
DataCorruption::comment_only(
@@ -296,30 +282,9 @@ impl StoreReader {
.map_err(|error_kind| {
std::io::Error::new(error_kind, "error when reading block in doc store")
})?;
// this flag is set, when filter_map moved to the next block
if reset_block_pos {
block_start_pos = 0;
}
let mut cursor = &block[block_start_pos..];
let mut pos = 0;
// move forward 1 doc + num_skipped in block and return length of current doc
let doc_length = loop {
let doc_length = VInt::deserialize(&mut cursor)?.val() as usize;
let num_bytes_read = block[block_start_pos..].len() - cursor.len();
block_start_pos += num_bytes_read;
pos += 1;
if pos == num_skipped + 1 {
break doc_length;
} else {
block_start_pos += doc_length;
cursor = &block[block_start_pos..];
}
};
let end_pos = block_start_pos + doc_length;
let doc_bytes = block.slice(block_start_pos..end_pos);
block_start_pos = end_pos;
Ok(doc_bytes)
let range = block_read_index(&block, doc_pos)?;
Ok(block.slice(range))
})
}
@@ -329,11 +294,33 @@ impl StoreReader {
}
}
fn block_read_index(block: &[u8], doc_pos: u32) -> crate::Result<Range<usize>> {
let doc_pos = doc_pos as usize;
let size_of_u32 = std::mem::size_of::<u32>();
let index_len_pos = block.len() - size_of_u32;
let index_len = u32::deserialize(&mut &block[index_len_pos..])? as usize;
if doc_pos > index_len {
return Err(crate::TantivyError::InternalError(
"Attempted to read doc from wrong block".to_owned(),
));
}
let index_start = block.len() - (index_len + 1) * size_of_u32;
let index = &block[index_start..index_start + index_len * size_of_u32];
let start_offset = u32::deserialize(&mut &index[doc_pos * size_of_u32..])? as usize;
let end_offset = u32::deserialize(&mut &index[(doc_pos + 1) * size_of_u32..])
.unwrap_or(index_start as u32) as usize;
Ok(start_offset..end_offset)
}
#[cfg(feature = "quickwit")]
impl StoreReader {
/// Advanced API.
///
/// In most cases use [get_async](Self::get_async)
/// In most cases use [`get_async`](Self::get_async)
///
/// Loads and decompresses a block asynchronously.
async fn read_block_async(&self, checkpoint: &Checkpoint) -> crate::AsyncIoResult<Block> {
@@ -357,14 +344,14 @@ impl StoreReader {
Ok(decompressed_block)
}
/// Fetches a document asynchronously.
/// Reads raw bytes of a given document asynchronously.
pub async fn get_document_bytes_async(&self, doc_id: DocId) -> crate::Result<OwnedBytes> {
let checkpoint = self.block_checkpoint(doc_id)?;
let block = self.read_block_async(&checkpoint).await?;
Self::get_document_bytes_from_block(block, doc_id, &checkpoint)
}
/// Reads raw bytes of a given document. Async version of [get](Self::get).
/// Fetches a document asynchronously. Async version of [`get`](Self::get).
pub async fn get_async(&self, doc_id: DocId) -> crate::Result<Document> {
let mut doc_bytes = self.get_document_bytes_async(doc_id).await?;
Ok(Document::deserialize(&mut doc_bytes)?)
@@ -427,7 +414,7 @@ mod tests {
assert_eq!(store.cache_stats().cache_hits, 1);
assert_eq!(store.cache_stats().cache_misses, 2);
assert_eq!(store.cache.peek_lru(), Some(9210));
assert_eq!(store.cache.peek_lru(), Some(11163));
Ok(())
}

View File

@@ -1,11 +1,11 @@
use std::io::{self, Write};
use std::io;
use common::{BinarySerializable, VInt};
use common::BinarySerializable;
use super::compressors::Compressor;
use super::StoreReader;
use crate::directory::WritePtr;
use crate::schema::Document;
use crate::schema::{Document, Schema};
use crate::store::store_compressor::BlockCompressor;
use crate::DocId;
@@ -20,8 +20,8 @@ pub struct StoreWriter {
compressor: Compressor,
block_size: usize,
num_docs_in_current_block: DocId,
intermediary_buffer: Vec<u8>,
current_block: Vec<u8>,
doc_pos: Vec<u32>,
block_compressor: BlockCompressor,
}
@@ -41,7 +41,7 @@ impl StoreWriter {
compressor,
block_size,
num_docs_in_current_block: 0,
intermediary_buffer: Vec::new(),
doc_pos: Vec::new(),
current_block: Vec::new(),
block_compressor,
})
@@ -53,12 +53,15 @@ impl StoreWriter {
/// The memory used (inclusive childs)
pub fn mem_usage(&self) -> usize {
self.intermediary_buffer.capacity() + self.current_block.capacity()
self.current_block.capacity() + self.doc_pos.capacity() * std::mem::size_of::<u32>()
}
/// Checks if the current block is full, and if so, compresses and flushes it.
fn check_flush_block(&mut self) -> io::Result<()> {
if self.current_block.len() > self.block_size {
// this does not count the VInt storing the index lenght itself, but it is negligible in
// front of everything else.
let index_len = self.doc_pos.len() * std::mem::size_of::<usize>();
if self.current_block.len() + index_len > self.block_size {
self.send_current_block_to_compressor()?;
}
Ok(())
@@ -70,8 +73,19 @@ impl StoreWriter {
if self.current_block.is_empty() {
return Ok(());
}
let size_of_u32 = std::mem::size_of::<u32>();
self.current_block
.reserve((self.doc_pos.len() + 1) * size_of_u32);
for pos in self.doc_pos.iter() {
pos.serialize(&mut self.current_block)?;
}
(self.doc_pos.len() as u32).serialize(&mut self.current_block)?;
self.block_compressor
.compress_block_and_write(&self.current_block, self.num_docs_in_current_block)?;
self.doc_pos.clear();
self.current_block.clear();
self.num_docs_in_current_block = 0;
Ok(())
@@ -81,16 +95,9 @@ impl StoreWriter {
///
/// The document id is implicitly the current number
/// of documents.
pub fn store(&mut self, stored_document: &Document) -> io::Result<()> {
self.intermediary_buffer.clear();
stored_document.serialize(&mut self.intermediary_buffer)?;
// calling store bytes would be preferable for code reuse, but then we can't use
// intermediary_buffer due to the borrow checker
// a new buffer costs ~1% indexing performance
let doc_num_bytes = self.intermediary_buffer.len();
VInt(doc_num_bytes as u64).serialize_into_vec(&mut self.current_block);
self.current_block
.write_all(&self.intermediary_buffer[..])?;
pub fn store(&mut self, document: &Document, schema: &Schema) -> io::Result<()> {
self.doc_pos.push(self.current_block.len() as u32);
document.serialize_stored(schema, &mut self.current_block)?;
self.num_docs_in_current_block += 1;
self.check_flush_block()?;
Ok(())
@@ -101,8 +108,7 @@ impl StoreWriter {
/// The document id is implicitly the current number
/// of documents.
pub fn store_bytes(&mut self, serialized_document: &[u8]) -> io::Result<()> {
let doc_num_bytes = serialized_document.len();
VInt(doc_num_bytes as u64).serialize_into_vec(&mut self.current_block);
self.doc_pos.push(self.current_block.len() as u32);
self.current_block.extend_from_slice(serialized_document);
self.num_docs_in_current_block += 1;
self.check_flush_block()?;

View File

@@ -13,7 +13,7 @@ pub struct SSTableIndex {
impl SSTableIndex {
pub(crate) fn load(data: &[u8]) -> Result<SSTableIndex, DataCorruption> {
serde_cbor::de::from_slice(data)
ciborium::de::from_reader(data)
.map_err(|_| DataCorruption::comment_only("SSTable index is corrupted"))
}
@@ -85,9 +85,9 @@ impl SSTableIndexBuilder {
})
}
pub fn serialize(&self, wrt: &mut dyn io::Write) -> io::Result<()> {
serde_cbor::ser::to_writer(wrt, &self.index).unwrap();
Ok(())
pub fn serialize<W: std::io::Write>(&self, wrt: W) -> io::Result<()> {
ciborium::ser::into_writer(&self.index, wrt)
.map_err(|err| io::Error::new(io::ErrorKind::Other, err))
}
}

View File

@@ -24,6 +24,8 @@ impl SSTable for TermInfoSSTable {
type Reader = TermInfoReader;
type Writer = TermInfoWriter;
}
/// Builder for the new term dictionary.
pub struct TermDictionaryBuilder<W: io::Write> {
sstable_writer: Writer<W, TermInfoWriter>,
}
@@ -138,6 +140,7 @@ impl TermDictionary {
})
}
/// Creates a term dictionary from the supplied bytes.
pub fn from_bytes(owned_bytes: OwnedBytes) -> crate::Result<TermDictionary> {
TermDictionary::open(FileSlice::new(Arc::new(owned_bytes)))
}
@@ -229,19 +232,19 @@ impl TermDictionary {
Ok(None)
}
// Returns a range builder, to stream all of the terms
// within an interval.
/// Returns a range builder, to stream all of the terms
/// within an interval.
pub fn range(&self) -> TermStreamerBuilder<'_> {
TermStreamerBuilder::new(self, AlwaysMatch)
}
// A stream of all the sorted terms.
/// A stream of all the sorted terms.
pub fn stream(&self) -> io::Result<TermStreamer<'_>> {
self.range().into_stream()
}
// Returns a search builder, to stream all of the terms
// within the Automaton
/// Returns a search builder, to stream all of the terms
/// within the Automaton
pub fn search<'a, A: Automaton + 'a>(&'a self, automaton: A) -> TermStreamerBuilder<'a, A>
where A::State: Clone {
TermStreamerBuilder::<A>::new(self, automaton)

View File

@@ -255,7 +255,7 @@ where T: Iterator<Item = usize>
/// Emits all of the offsets where a codepoint starts
/// or a codepoint ends.
///
/// By convention, we emit [0] for the empty string.
/// By convention, we emit `[0]` for the empty string.
struct CodepointFrontiers<'a> {
s: &'a str,
next_el: Option<usize>,