For Field with several FieldValues, with a
value that contained no token at all, the token position
was reinitialized to 0.
As a result, PhraseQueries can show some false positives.
In addition, after the computation of the position delta, we can
underflow u32, and end up with gigantic delta.
We haven't been able to actually explain the bug in 1629, but it
is assumed that in some corner case these delta can cause a panic.
Closes#1629
The current `compute_previous_power_of_two()` implementation used for
TermHashmap takes and returns `usize` , but actually only works
correclty on 64 bit architectures (aka usize == u64)
On other architectures the leading_zeros computation is run on the wrong
type (must be u64), and leads to overflows.
Fixed simply computing the leading_zeros based on a u64 value.
fixes some issues with Term
Remove duplicate calls to truncate or resize
Replace Magic Number 5 with constant
Enforce minimum size of 5 for metadata
Fix broken truncate docs
use constructor instead new + set calls
normalize constructor stack
replace assert on internal behavior fixes#1585
Fixes broken search (no results) with BM25 for u64, i64, f64, bool, bytes and date after deletion and merge.
There were no fieldnorms recorded for those field. After merge InvertedIndexReader::total_num_tokens returns 0 (Sum over the fieldnorms is 0). BM25 does not work when total_num_tokens is 0.
Fixes#1617
remove get_val in serialization and mark as unimplemented!()
replace get_val with iter in linear codec
remove MultivalueStartIndexRandomSeeker
replace MultivalueStartIndexIter with closure
Sample 100 values in linear codec
When building without default features (so without mmap, etc),
there are some warnings about unused things. This fixes the
ones related to `ArcBytes` and `WeakArcBytes`, which are only
used with the `mmap_directory` code.
fixes multivalue ff regression by avoiding using `get_val`. Line::train calls repeatedly get_val, but get_val implementation on Column for multivalues is very slow. The fix is to use the iterator instead. Longterm fix should be to remove get_val access in serialization.
Old Code
test fastfield::bench::bench_multi_value_ff_merge_few_segments ... bench: 46,103,960 ns/iter (+/- 2,066,083)
test fastfield::bench::bench_multi_value_ff_merge_many_segments ... bench: 83,073,036 ns/iter (+/- 4,373,615)
est fastfield::bench::bench_multi_value_ff_merge_many_segments_log_merge ... bench: 64,178,576 ns/iter (+/- 1,466,700)
Current
running 3 tests
test fastfield::multivalued::bench::bench_multi_value_ff_merge_few_segments ... bench: 57,379,523 ns/iter (+/- 3,220,787)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments ... bench: 90,831,688 ns/iter (+/- 1,445,486)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments_log_merge ... bench: 158,313,264 ns/iter (+/- 28,823,250)
With Fix
running 3 tests
test fastfield::multivalued::bench::bench_multi_value_ff_merge_few_segments ... bench: 57,635,671 ns/iter (+/- 2,707,361)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments ... bench: 91,468,712 ns/iter (+/- 11,393,581)
test fastfield::multivalued::bench::bench_multi_value_ff_merge_many_segments_log_merge ... bench: 73,909,138 ns/iter (+/- 15,846,097)
The Monotonic mapping was using the default implementation
for `get_range` and `.iter`.
As a result, some of the column used in merge (e.g. multivalued
fast fields) were exhibiting a very strong performance regression.
In addition, it isolates the doc compressor logic,
better reports io::Result.
In the case of the same-thread doc compressor,
the blocks are also not copied.
* Rename BlockwiseLinear to BlockwiseLinearLegacy
Reimplements the blockwise multilinear codec using integer arithmetics.
Added comments
* add estimate for blockwise
* Added one unit test
* use int based for linear interpol
* fix merge conflicts
* reuse code
* cargo fmt
* fix clippy
* fix test
* fix off by one
fix off by one to accurately interpolate autoincrement fields
* extend test, fix estimate
* remove legacy codec
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* Use a more general but still object-safe signature for Query::query_terms.
* Further constraint the generalized Query::query_terms signature to allow extracting references to terms.
The file offsets were recorded incorrectly in some cases, e.g. when the recording looked like this [(Field 1, Index 0, Offset 0), (Field 1, Index 1, Offset 14), (Field 0, Index 0, Offset 14)]. The last file is offset 14 to end of file for field 0. But the data was converted to a vec and sorted, which changes the last file to Field 1.
* FileHandle: Change from boxed to Arc.
Changing from a Box<dyn FileHandle> to an Arc<dyn FileHandle> would
allow for a user of tantivy to manage file handles outside of tantivy
and be able to manage their life cycle.
* Fix: Rust linter
Use seperate thread to compress block store for increased indexing performance. This allows to use slower compressors with higher compression ratio, with less or no perfomance impact (with enough cores).
A seperate thread is spawned to compress the docstore, which handles single blocks and stacking from other docstores.
The spawned compressor thread does not write, instead it sends back the compressed data. This is done in order to avoid writing multithreaded on the same file.
Validation happens on different phases depending on the aggregation
Term: During segment collection
Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370
Range: When validating the request
update CHANGELOG
Right now we store last key in the blocks of the SSTable index.
This PR replaces the last key by a shorter string that is greater or
equal and still lesser than the next key.
This property is sufficiently to ensure the block index
works properly.
Related to quickwit#1366
Quickwit's still heavily relies on generating field names
containing a '.' for nested object, yet allows for
user defined field names to contain a dot.
In order to reuse tantivy query parser, we will end up
using quickwit field names directly into tantivy.
Only '.' will be escaped.
This PR makes minor changes in how tantivy query parser parses
a field name and resolves it to a field.
Some of the new edge case behavior is hacky.
Closes#1355
For date values `chrono` has been replaced with `time`
- The `time` crate is re-exported as `tantivy::time` instead of `tantivy::chrono`.
- The type alias `tantivy::DateTime` has been removed.
- `Value::Date` wraps `time::PrimitiveDateTime` without time zone information.
- Internally date/time values are stored as seconds since UNIX epoch in UTC.
- Converting a `time::OffsetDateTime` to `Value::Date` implicitly converts the value into UTC.
If this is not desired do the time zone conversion yourself and use `time::PrimitiveDateTime`
directly instead.
Closes#1304
* Removes all usage of block_on, and use a oneshot channel instead.
Calling `block_on` panics in certain context.
For instance, it panics when it is called in a the context of another
call to block.
Using it in tantivy is unnecessary. We replace it by a thin wrapper
around a oneshot channel that supports both async/sync.
* Removing needless uses of async in the API.
Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
* Added sstable and enabling it by default, and parallel boolean query.
* Added async API for FileSlice.
* Added async get_doc
* Reduce blocksize to 32_000
* Added debug logs
Quickwit specific feature a hidden behind the quickwit feature flag.
* improve validation in aggregation, extend invalid field test
improve validation in aggregation
extend invalid field test
Fixes#1291
* collect fast field names on request structure
* fix visibility of AggregationSegmentCollector
* Term are now typed.
This change is backward compatible:
While the Term has a byte representation that is modified, a Term itself
is a transient object that is not serialized as is in the index.
Its .field() and .value_bytes() on the other hand are unchanged.
This change offers better Debug information for terms.
While not necessary it also will help in the support for JSON types.
* Renamed Hierarchical Facet -> Facet
The test-env-log crate has been renamed to test-log to better reflect
its intent of not only catering to env_logger specific initialization
but also tracing (and potentially others in the future).
This change updates the crate to use test-log instead of the now
deprecated test-env-log.
* Add a NORMED options on field
Make fieldnorm indexation optional:
* for all types except text => added a NORMED options
* for text field
** if STRING, field has not fieldnorm retained
** if TEXT, field has fieldnorm computed
* Finalize making fieldnorm optional for all field types.
- Using Option for fieldnorm readers.
This work by introducing a new API method in the Directory
trait. The user needs to explicitely call this method.
(In particular, once before a commmit)
Closes#1225
In addition this PR:
- removes unnecessary flushes and fsyncs on files.
- replace all fsync by fdatasync. The latter triggers
a meta sync if a metadata required to read the file
has changed. It is therefore sufficient for us.
Closes#1224
* Added optimisation using block wand for single TermScorer.
A proptest was also added.
* Fix block wand algorithm by taking the last doc id of scores until the pivot scorer (included).
* In block wand, when block max score is lower than the threshold, advance the scorer with best score.
* Fix wrong condition in block_wand_single_scorer and add debug_assert to have an equality check on doc to break the loop.
* fix#1151
Fixes a off by one error in the stats for the index fast field in the multi value fast field.
When retrieving the data range for a docid, `get(doc)..get(docid+1)` is requested. On creation
the num_vals statistic was set to doc instead of docid + 1. In the multivaluelinearinterpol fast
field the last value was therefore not serialized (and would return 0 instead in most cases).
So the last document get(lastdoc)..get(lastdoc + 1) would return the invalid range `value..0`.
This PR adds a proptest to cover this scenario. A combination of a large number values, since multilinear
interpolation is only active for more than 5_000 values, and a merge is required.
* Handle field names with any characters with a known set of special characters and an escape one
* Update field name validation rule to check only if it has at least one character and does not start with `-`
Closes#1087.
Use dynamic fastfield codes for multivalues fixes (only sorting case covered)
Rename get to get_val due to conflict with Vec
use u64 precision instead of u32 for get_range, to allow use of existing fast field interface interface (actually not sure if it would be better have a different interface)
add CodecReader as common interface in fastfield codec crate
add LinearInterpolation to DynamicFastFieldReader
calc estimation and choose best codec
cleanup
support multiple codes
prepend codec id to all fast fields
add new api to create fastfields with access to all data
use new fastfield creation api in initial creation and merge
remove unused collect of data in doc_id_mapping
* prepare for multiple fastfield codecs
prepare for multiple fastfield codecs by wrapping the codecs in an enum #1042
* add FastFieldSerializer trait, add DynamicFastFieldSerializer
add FastFieldSerializer trait
add DynamicFastFieldSerializer enum to wrap all implementors of the FastFieldSerializer trait
* add estimation for fastfield bitpacker
Change Footer version handling, Make compression dynamic
Change Footer version handling
Simplify version handling by switching to JSON instead of binary serialization.
fixes#1058
Make compression dynamic
Instead of choosing the compression during compile time via a feature flag, you can now have multiple compression algorithms enabled and decide during runtime which one to choose via IndexSettings. Changing the compression algorithm on an index is also supported. The information which algorithm was used in the doc store is stored in the DocStoreFooter. The default is the lz4 block format.
fixes#904
Handle merging of different compressors
Fix feature flag names
Add doc store test for all compressors
* Detect if segments are stackackable with sorting, fixes#1038
Detect if segments are stackable when their data ranges on the sort property are disjunct.
Presort segments by thei min value on merge, to enable easier stacking.
* move code to function
* fix and refactor log merge policy, fixes#1035
fixes a bug in log merge policy where an index was wrongly referenced by its index
* cleanup
* fix sort order, improve method names
* use itertools groupby, fix serialization test
* minor improvments
* update names
* add iterator over documents in docstore
When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.
Merge Time on Sorted Index Before/After:
24s / 19s
Merge Time on Unsorted Index Before/After:
15s / 13,5s
So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.
* Update reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* sort index by field
add sort info to IndexSettings
generate docid mapping for sorted field (only fastfield)
remap singlevalue fastfield
* support docid mapping in multivalue fastfield
move docid mapping to serialization step (less intermediate data for mapping)
add support for docid mapping in multivalue fastfield
* handle docid map in bytes fastfield
* forward docid mapping, remap postings
* fix merge conflicts
* move test to index_sorter
* add docid index mapping old->new
add docid mapping for both directions old->new (used in postings) and new->old (used in fast field)
handle mapping in postings recorder
warn instead of info for MAX_TOKEN_LEN
* remap docid in fielnorm
* resort docids in recorder, more extensive tests
* handle index sorting in docstore
handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore.
add docstore sort tests
refactor tests
* refactor
rename docid doc_id
rename docid_map doc_id_map
rename DocidMapping DocIdMapping
fix typo
* u32 to DocId
* better doc_id_map creation
remove unstable sort
* add non mut method to FastFieldWriters
add _mut prefix to &mut methods
* remove sort_index
* fix clippy issues
* fix SegmentComponent iterator
use std::mem::replace
* fix test
* fmt
* handle indexsettings deserialize
* add reading, writing bytes to doc store
get bytes of document in doc store
add store_bytes method doc writer to accept serialized document
add serialization index settings test
* rename index_sorter to doc_id_mapping
use bufferlender in recorder
* fix compile issue, make sort_by_field optional
* fix test compile
* validate index settings on merge
validate index settings on merge
forward merge info to SegmentSerializer (for TempStore)
* fix doctest
* add itertools, use kmerge
add itertools, use kmerge
push because rustfmt fails
* implement/test merge for fastfield
implement/test merge for fastfield
rename len to num_deleted in DeleteBitSet
* Use precalculated docid mapping in merger
Use precalculated docid mapping in merger for sorted indices instead of on the fly calculation
Add index creation macro benchmark, but commented out for now, since it is not really usable due to long runtimes, and extreme fluctuations. May be better suited in criterion or an external bench bin
* fix fast field reader docs
fix fast field reader docs, Error instead of None returned
add u64s_lenient to fastreader
add create docid mapping benchmark
* add test for multifast field merge
refactor test
add test for multifast field merge
* add num_bytes to BytesFastFieldReader
equivalent to num_vals in MultiValuedFastFieldReader
* add MultiValueLength trait
add MultiValueLength trait in order to unify index creation for BytesFastFieldReader and MultiValuedFastFieldReader in merger
* Add ReaderWithOrdinal, fix
Add ReaderWithOrdinal to associate data to a reader in merger
Fix bytes offset index creation in merger
* add test for merging bytes with sorted docids
* Merge fieldnorm for sorted index
* handle posting list in merge in sorted index
handle posting list in merge in sorted index by using doc id mapping for sorting
reuse SegmentOrdinal type
* handle doc store order in merge in sorted index
* fix typo, cleanup
* make IndexSetting non-optional
* fix type, rename test file
fix type
rename test file
add type
* remove SegmentReaderWithOrdinal accessors
* cargo fmt
* add index sort & merge test to include deletes
* Fix posting list merge issue
Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids
handle sorting and merging for facets field
* performance: cache field readers, use bytes for doc store merge
* change facet merge test to cover index sorting
* add RawDocument abstraction to access bytes in doc store
* fix deserialization, update changelog
fix deserialization
update changelog
forward error on merge failed
* cache store readers to utilize lru cache (4x performance)
cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)
* add include_temp_doc_store flag in InnerSegmentMeta
unset flag on deserialization and after finalize of a segment
set flag when creating new instances
The value that was previously there was 3 and it made the test fail when i
enabled it. Verified that it, indeed, should have been 2 instead (the testing
code previously contained an error).
add lz4 block compressor using lz4_flex, add lz4-block-compression feature flag
add snappy-compression feature flag for snap compressor, make snap crate optional
set lz4-block-compression as default feature flag
Added a FacetOptions for HierarchicalFacet which add indexed and stored flags to it.
Propagate change and update tests accordingly
Added a test to ensure that a not indexed flag was taken care of.
Added on Value implem the `path()` function to return the stored facet.
For some reason under a docker build I get a build error under docker only saying that `serde::export` is private. This fixes it for me.
```
error[E0603]: module `export` is private
--> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.13.2/src/collector/top_collector.rs:5:12
|
5 | use serde::export::PhantomData;
| ^^^^^^ private module
|
note: the module `export` is defined here
--> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/serde-1.0.119/src/lib.rs:275:5
|
275 | use self::__private as export;
| ^^^^^^^^^^^^^^^^^^^^^^^^^
```
In order to reduce IO, we introduced a way to instanciate a dummy
constant FieldnormReader which worked by allocating a buffer with
as many bytes as there are docs in the segments.
This allocation is not a negligible by any mean.
This PR works by offering two implementation for the
FieldnormReader.
The const field norm reader simply returns the same value all of the
time, while the array based one does the same as the current one.
Recently, `test_index_manual_policy_mmap` has been failing on Windows.
The idea addressed by this patch is that we forget to sync the parent
directory with the current implementation of atomic writes.
This was done correctly when we were relying the atomicwrites crate.
*crossing fingers*
Merge pull request #927 from tantivy-search/compact-store-index
The skip index now identifies both the start and the end offset of blocks. Checkpoints are compressed in blocks, reaching better compression.
Dearest Maintainer,
The say thanks project moved to email https://github.com/BlitzKraft/saythanks.io/issues/60. I removed the links. You might want to use your email but at that point people could just email you thanks?
Anyway, Thanks for the hard work on the project. I am enjoying it.
Dictated but not reviewed,
Becker
Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a `FileSlice` that can be reduced and eventually read into an `OwnedBytes` object. Long and blocking io operation are still required by they do not span over the entire file.
* add support for indexed bytes fast field
* remove backup code file
* refine test cases
* Simplified unit test. Renamed it as it is testing the storable part. Not the indexed part.
* Small refactoring and added unit test. If multivalued we only retain the first FAST value.
Co-authored-by: Raul <raul.tang.lc@gmail.com>
* Unsatisfactory implementation.
The fastfield are hit. But for performance, we want the comparison to happen on u64,
and the conversion to the FastType to be done only on the selected TopK
elements.
For i64, the current approach might be ok.
For DateTime, it is most likely catastrophic.
Closes#822
* Decoupled SegmentCollector Fruit from Collector Fruit.
Deferred conversion from u64 to the proper FastField type to after the overall collection.
(tantivy guarantees that u64 encoding is consistent with the original
ordering of the fastfield)
Closes#882
go_to_first_doc was typically calling seek with a target smaller than
doc.
Since SegmentPostings typically do a linear search on the full block,
regardless of the current position, it could have our segment postings
go backward.
* Do nothing when combining score values of excluded scores.
* Add test case for two excluded.
* Test score for two excluded terms.
* Use TopDocs in test_boolean_query_two_excluded
There was a bug in the LogMergePolicy that was surfacing when there were
segments, but all of the segments were larger than the max limit.
After filtering, the list of segments candidate for merge was 0, and
the code was indexing the first element of an empty Vec.
* Add offset to TopDocsCollector
Add an offset to TopDocsCollector and TopDocs to make it clearer how to
handle pagination.
Closes#822
* Address review comments
- Make Debug formatting of TopDocs clearer.
- Add unit tests for limit and offset on TopCollector.
- Change API for using offset to a fluent interface.
- Add some context to the docstring to clarify what limit and offset are
equivalent to in other projects.
* Changes required by rebase on e25284
- Pass Collector into TweakedScoreTopCollector and
CustomScoreTopCollector.
- Add std:: qualifier to f32, i32 etc. Not sure why this was not failing
already.
- Add unit tests for TopDocs with offset including for tweaked and
custom score collectors.
In order to convert a TopCollector<Score> to a TopCollector<TScore> I
had to add a `into_tscore` method to `TopCollector`. This is a hack but
I don't know how to avoid it.
- Change in the DocSet and Scorer API. (@fulmicoton).
A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet.
`.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId.
As a result, iterating through DocSet now looks as follows
```rust
let mut doc = docset.doc();
while doc != TERMINATED {
// ...
doc = docset.advance();
}
```
The change made it possible to greatly simplify a lot of the docset's code.
- Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)
* Make TweakScore and CustomScore mutable
Make TweakScore and CustomScore mutable at the segment level.
Addresses issue #806
* Add example to show tweak_score working for facets
When applying the delete operations in the delete queue, it is possible
that there was no new deleted document.
In this case, avoid creating a new delete file, and updating the delete
opstamp.
* Alternative take on boosted queries
* Fixing unit test
* Added boosting to the query grammar.
* Made BoostQuery public.
* Added support for boosting field in QueryParser
Closes#547
* Added backwards iteration to termdict
* Ran formatter
* Updated fst dependency
* Updated dependency
* Changelog and version
* Fixed version
* Made it part of 12.0
In order to make `IndexWriter::run` callable from outside of the create,
the `UserOperation` type needs to be publicly available.
Since the `indexer` module is private, we just export the `UserOperation`
type directly.
* WIP implemented is_compatible
hide Footer::from_bytes from public consumption - only found Footer::extract
used outside the module
Add a new error type for IncompatibleIndex
add a prototypical call to footer.is_compatible() in ManagedDirectory::open_read
to make sure we error before reading it further
* Make error handling more ergonomic
Add an error subtype for OpenReadError and converters to TantivyError
* Remove an unnecessary assert
it's follower by the same check that Errors instead of panicking
* Correct the compatibility check logic
Leave a defensive versioned footer check to make sure we add new logic handling
when we add possible footer versions
Restricted VersionedFooter::from_bytes to be used inside the crate only
remove a half-baked test
* WIP.
* Return an error if index incompatible - closes#662
Enrich the error type with incompatibility
Change return type to Result<bool, TantivyError>, instead of bool
Add an Incompatibility enum that enriches the IncompatibleIndex error variant
with information, which then allows us to generate a developer-friendly hint how
to upgrade library version or switch feature flags for a different compression
algorithm
Updated changelog
Change the signature of is_compatible
Added documentation to the Incompatibility
Added a conditional test on a Footer with lz4 erroring
* Make u64_lenient() handle f64 fast fields too
Without this, we get a panic during merge since the merger will
get a `None` where it expects something.
Prior to this patch, you can reproduce the panic with:
use tantivy::{
self,
schema::{SchemaBuilder, FAST},
Document, Index, Result,
};
#[test]
fn pass() -> Result<()> {
let mut builder = SchemaBuilder::new();
let field = builder.add_f64_field("f64", FAST);
let index = Index::create_in_ram(builder.build());
let mut writer = index.writer_with_num_threads(1, 50_000_000)?;
for i in 0..1000 {
let mut doc = Document::new();
doc.add_f64(field, 0.42);
writer.add_document(doc);
if i % 5 == 0 {
writer.commit()?;
}
}
writer.commit()?;
Ok(())
}
* Add test to verify that f64 fields are merged
* Ensure multi-valued fast fields can be merged too
* Prevent tokens from being stored in the document store.
Commit adds prepare_for_store method to Document, which changes all
PreTokenizedString values into String values. The method is called
before adding document to the document store to prevent tokens from
being saved there. Commit also adds small changes to comments in
pre_tokenized_text example.
* Avoid storing the pretokenized text.
* code tidy-up
Replace `20` magic constant with COMMON_FOOTER_SIZE
Add a docstring showing how footer is serialised
Add a test for footer length checking
* Add more tests for VersionedFooter
successful and panicking .to_bytes() calls
* Minor changes in footer.rs
* Added handling of pre-tokenized text fields (#642).
* * Updated changelog and examples concerning #642.
* Added tokenized_text method to Value implementation.
* Implemented From<TokenizedString> for TokenizedStream.
* * Removed tokenized flag from TextOptions and code reliance on the flag.
* Changed naming to use word "pre-tokenized" instead of "tokenized".
* Updated example code.
* Fixed comments.
* Minor code refactoring. Test improvements.
* Use `slice::iter` instead of `into_iter` to avoid future breakage
`an_array.into_iter()` currently just works because of the autoref
feature, which then calls `<[T] as IntoIterator>::into_iter`. But
in the future, arrays will implement `IntoIterator`, too. In order
to avoid problems in the future, the call is replaced by `iter()`
which is shorter and more explicit.
* cargo fmt
* TopDocs: ensure stable sorting on equal score
When selecting the top K documents by score, we need to ensure stable
sorting. Until now, for documents with the same score, we were relying
on the (arbitrary) order returned by the BinaryHeap used to implement
the collectors.
This patch fixes the problem by explicitly using the doc address when
harvesting the `TopSegmentCollector` and when merging the results in
`TopCollector::merge_fruits()`.
This is important (for example) to implement pagination correctly using
the TopDocs collector. If sorting isn't stable, documents that have the
same score might be ranked in different positions depending on the
specific K that was used, thus appearing in two different pages, or in
none at all.
Fixes gh-671
* TMP: alternative solution (see previous commit)
If we add the constrait that D is also PartialOrd in ComparableDoc<T,
D>, then we can move the comparison by doc address directly in the cmp
implementation of ComparableDoc.
* TMP rebase as first commit: add benchmarks for TopSegmentCollector
* fixup! TMP: alternative solution (see previous commit)
* TMP add changelog entry
* TMP run cargo fmt
* Add a doctest to BooleanQuery
Closes#446
Mark a function that is only used in tests to be compiled for tests only
Fix doc-comments in a couple of related files
* Minor corrections
remove whitespace, fix typos, add explicit dyn marker
* WIP: BooleanQuery doc test
Trying to nest several BooleanQueries together
* Addressed old review
rust 2018 edition + make function available to everyone
* Box the previous query to resolve the type error
* Rework wording in DocAdress document strings
* Reworded and restructured the docstring
* add checksum check in ManagedDirectory
fix#400
* flush after writing checksum
* don't checksum atomic file access and clone managed_paths
* implement a footer storing metadata about a file
this is more of a poc, it require some refactoring into multiple files
`terminate(self)` is implemented, but not used anywhere yet
* address comments and simplify things with new contract
use BitOrder for integer to raw byte conversion
consider atomic write imply atomic read, which might not actually be true
use some indirection to have a boxable terminating writer
* implement TerminatingWrite and make terminate() be called where it should
add dependancy to drop_bomb to help find where terminate() should be called
implement TerminatingWrite for wrapper writers
make tests pass
/!\ some tests seems to pass where they shouldn't
* remove usage of drop_bomb
* fmt
* add test for checksum
* address some review comments
* update changelog
* fmt
* small docs cleanup
* only compile a regex once per RegexQuery
Building a `Regex` is an expensive operation. Users of `RegexQuery`
need to cache and reuse regexes when searching across multiple fields.
This is the first step towards allowing that: we can store the `Regex`
directly in the `RegexQuery`, instead of the string pattern.
* RegexQuery: account for possible failure in the constructor
When building a regex from a str pattern, we have to account for the
possibility that the pattern is invalid. Before the previous commit, the
failure would happen in the `specialized_weight` method. Now that we
store a compiled `Regex` in `RegexQuery`, `specialized_weight` doesn't
fail anymore, and we can fail early while constructing `RegexQuery` if
the pattern is invalid.
This is a breaking change for users of `RegexQuery::new`.
* add RegexQuery::from_regex method
This builds a `RegexQuery` from an already compiled `Regex`. The use of
`Into<Arc<Regex>>` is to allow the caller to either simply pass a
`Regex`, or an `Arc<Regex>`, in case it needs to be cached and shared on
the caller's side.
* Using an Arc in AutomatonWeight
Closes#639
* Added cargo-fmt to CI runs
Closes#625
* Remove fmt from appveyor builds
Windows seems to have issues with install components through rustup.
Formatting should be equally informative regardless of the OS,
so best to keep it in Linux on Travis
* Tidy up
fmt
remove unneccessary -> Result<()> followed by run.unwrap() in a test
* Adding support for elasticsearch-style unbounded queries
Extend the UserInputBound to include Unbounded, so we can reuse formatting and
internal query format
* Still working on elastic-style range queries
Fixes#498
Merge the elastic_range into range
Reformat to make code easier to follow, use optional() macro to return Some
* Fixed bugs
Made the range parser insensitive to whitespace between the ":" and the range.
Removed optional parsing of field.
Added a unit test for the range parser.
Derived PartialEq to compare the results of parsing as structs, instead of
strings. Found a bug with that unit test - "*}" was parsed as an
UserInputBound::Exclusive, instead of UserInputBound::Unbounded. Added an early
detection-and-return for * in the original range parser
* Correct failing test
Assume that we will use "{*" for Unbounded ranges
* Add a note in the changelog
cargo-fmt
* Moved parenthesis to a newline to make nested if-else more visible
* Replace unwrap with match and proper Error handling
* Replaced 'magic' values with a documented variable
Didn't like the unexplained 0..3 range, thought it was best as a variable
Calculating Levenshtein distance is expensive, so best explain why we should
keep it low
* add basic support for float
as for i64, they are mapped to u64 for indexing
query parser don't work yet
* Update value.rs
* implement support for float in query parser
* Update README.md
* cargo: update to fail 0.3
* tantivy: align failpoints feature naming
This aligns feature naming to use `failpoints` everywhere, like the
underlying library.
* Refactor deletes
* Removing generation from SegmentUpdater. These have been obsolete for a long time
* Number literal clippy
* Removed clippy useless allow statement
* Enables clearing the index
Closes#510
* Adds an examples to clear and rebuild index
* Addressing code review
Moved the example from examples/ to docstring above `clear`
* Corrected minor typos and missed/duplicate words
* Added stamper.revert method to be used for rollback
Added type alias for Opstamp
Moved to AtomicU64 on stable rust (since 1.34)
* Change the method name and doc-string
* Remove rollback from delete_all_documents
test_add_then_delete_all_documents fails with --test-threads 2
* Passes all the tests with any number of test-threads
(ran locally 5 times)
* Addressed code review
Deleted comments with debug info
changed ReloadPolicy to Manual
* Removing useless garbage_collect call and updated CHANGELOG
* add ascii folding support
* Minor change and added Changelog.
* add additional tests
* Add tests for ascii folding (#533)
* first tests for ascii folding
* use a `RawTokenizer` for tokens using punctuation
* add test for all (?) folding, inspired by Lucene
* Simplification of the unit test code
Added stamper.revert method to be used for rollback - rolling back to a previous
commit in case of deleting all documents or rolling operations back should reset
the stamper as well
Added type alias for Opstamp - helps code readibility instead of seeing u64
returned by functions.
Moved to AtomicU64 on stable rust (since 1.34) - where possible use standard
library interfaces.
* Clippy comments
Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.
This commit passes the OutputBuffer object (which enforces proper
alignment) instead of `&[u32]`.
* Clippy. Block alignment
* Code simplification
* Added comment. Code simplification
* Removed the extraneous freq block len hack.
Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.
This commit passes the OutputBuffer object (which enforces proper
alignment) instead of `&[u32]`.
Clippy complaints that about the cast of &[u32] to a *const __m128i,
because of the lack of alignment constraints.
This commit passes the OutputBuffer object (which enforces proper
alignment) instead of `&[u32]`.
* initial version, still a work in progress
* remove redudant or
* add chrono::DateTime and index i64
* add more tests
* fix tests
* pass DateTime by ptr
* remove println!
* document query_parser rfc 3339 date support
* added some more docs about implementation to schema.rs
* enforce DateTime is UTC, and re-export chrono
* added DateField to changelog
* fixed conflict
* use INDEXED instead of INT_INDEXED for date fields
* Format
Made the docstring consistent
remove empty line
* Move matches to dev deps
* Replace MsQueue with an unbounded crossbeam-channel
Questions:
queue.push ignores Result return
How to test pop() calls, if they block
* Format
Made the docstring consistent
remove empty line
* Unwrap the Result of queue.pop
* Addressed Paul's review
wrap the Result-returning send call with expect()
implemented the test not to fail after popping from empty queue
removed references to the Michael-Scott Queue
formatted
* [WIP] added UserOperation enum, added IndexWriter.run, and added MultiStamp
* removed MultiStamp in favor of std::ops::Range
* changed IndexWriter::run to return u64, Stamper::stamps to return a Range, added tests, and added docs
* changed delete_cursor skipping to use first operation's opstamp vice last. change index_writer test to use 1 thread
* added test for order batch of operations
* added a test comment
* Moving lock to directory/
* added fs2
* doc
* Using fs2 for locking
* Added unit test
* Fixed error message related unit test
* Fixing location of import
* Closes 471
Removing writing_segments in the segment manager as it is now useless.
Removing the target merged segment id as it is useless as well.
* RAII for tracking which segment is in merge.
Closes#471
* fmt
* Using Inventory::default().
* add unit test
* improved test
* added SegmentManager#remove_empty_segments
* update old tests for new behaviour
* cleaner filter for empty segments
* PR adjustments
* rename x in closures
* simplify assert_eq!(vec.len(), 0)
* wait_merging_threads
* acquire searchers
* add comments to test
* rebased on latest master
* harden test
* fix merger#test_merge_multivalued_int_fields_all_deleted test
* Using unrolled u32 VInt and caching Vec s
* cargo fmt
* Exposing a io::Write in the Expull thing
* expull as a writer. clippy + format
* inline the first block
* simplified -if let Some-
* vint reader iterator
* blop
* Using unrolled u32 VInt and caching Vec s
* cargo fmt
* Exposing a io::Write in the Expull thing
* expull as a writer. clippy + format
* inline the first block
* simplified -if let Some-
* vint reader iterator
* Split Collector into an overall Collector and a per-segment SegmentCollector. Precursor to cross-segment parallelism, and as a side benefit cleans up any per-segment fields from being Option<T> to just T.
* Attempt to add MultiCollector back
* working. Chained collector is broken though
* Fix chained collector
* Fix test
* Make Weight Send+Sync for parallelization purposes
* Expose parameters of RangeQuery for external usage
* Removed &mut self
* fixing tests
* Restored TestCollectors
* blop
* multicollector working
* chained collector working
* test broken
* fixing unit test
* blop
* blop
* Blop
* simplifying APi
* blop
* better syntax
* Simplifying top_collector
* refactoring
* blop
* Sync with master
* Added multithread search
* Collector refactoring
* Schema::builder
* CR and rustdoc
* CR comments
* blop
* Added an executor
* Sorted the segment readers in the searcher
* Update searcher.rs
* Fixed unit testst
* changed the place where we have the sort-segment-by-count heuristic
* using crossbeam::channel
* inlining
* Comments about panics propagating
* Added unit test for executor panicking
* Readded default
* Removed Default impl
* Added unit test for executor
* Change the semantic of Index::create_in_dir.
It should return an error if the directory already contains an Index.
* Index::open_or_create is working
* additional test
* Checking that schema matches on open_or_create.
Simplifying unit tests.
* simplifying Eq
* A working version
* optimize the ngram parsing
* Decoding codepoint only once.
* Closes#429
* using leading_zeros to make code less cryptic
* lookup in a table
* Compute space usage of a Searcher / SegmentReader / CompositeFile
* Fix typo
* Add serde Serialize/Deserialize for all the SpaceUsage structs
* Fix indexing
* Public methods for consuming space usage information
* #281: Add a space usage method that takes a SegmentComponent to support code that is unaware of particular segment components, and to make it more likely to update methods when a new component type is added.
* Add support for space usage computation of positions skip index file (#281)
* Add some tests for space usage computation (#281)
* Make TopCollector generic
Make TopCollector take a generic type instead of only being tied to
score. This will allow for sharing code between a TopCollector that
sorts results by Score and a TopCollector that sorts documents by a fast
field. This commit makes no functional changes to TopCollector.
* Add TopFieldCollector and TopScoreCollector
Create two new collectors that use the refactored TopCollector.
TopFieldCollector has the same functionality that TopCollector
originally had. TopFieldCollector allows for sorting results by a given
fast field. Closestantivy-search/tantivy#388
* Make TopCollector private
Make TopCollector package private and export TopFieldCollector as
TopCollector to maintain backwards compatibility. Mark TopCollector
as deprecated to encourage use of the non-aliased TopFieldCollector.
Remove Collector implementation for TopCollector since it is not longer
used.
`try_add_token` will now update the stop_offset as well.
`FragmentCandidate::new` now just takes `start_offset`,
it expects `try_add_token` to be called to add a token.
* Moving Range and All to Leaves
* Parsing OR/AND
* Simplify user input ast
* AND and OR supported. Returning an error when mixing syntax
Closes#246
* Added support for NOT
* Updated changelog
* add position_length to Token
refer #291
* Add term offset to `PhraseQuery`
ref #291
* Add new constructor for `PhraseQuery` that allows custom offset
* fix the method name as per pr comment
* Closes#291
Added unit test.
Using offsets from the analyzer in QueryParser.
* Moved NUM_SEARCHERS into a local variable
dynamically determined as the number of available cpus.
var name in lowercase (not a constant anymore).
updated it in docstring
* lowercased the varnames
* User can set number of logical cores in create_from_metas
* cargo fmt
* Num_searchers as Arc<AtomicUsize>
Retrieving the value with Relaxed ordering
Reverted create_from_metas signature. However, it calls num_cpus and
sets the Arc val
* Add skip information for posting list (skip to doc ids)
* Separate num bits from data for positions (skip n positions)
* Address in the position using a n-position offset
* Added a long skip structure to allow efficient opening of the position for a given term.
* (#321) Add support for range query parsing to grammar / parser. Still needs to be wired through the rest of the way.
* (321) Finish wiring RangeQuery parsing through
* (#321) Add logical AST query parser tests for RangeQuery
* (#321) Support parsing AllQuery
* (#321) Update documentation of QueryParser
* (#321) Support negative numbers in range query parsing
* Add TopCollector doc test
* Add CountCollector Doc Test
* Add Doc Test for MultiCollector
* Add ChainedCollector Doc Test
* Expose Fuzzy Query where it should be
* Add FuzzyTermQuery Doc Test
* Expose RegexQuery
* Regex Query Doc Test
* Add TermQuery Doc Test
* Add doc comments
* fix test 🤦
* Added explanation about the complexity variables
* Fixing unit tests
* Single threads if you check docids
* Add fst_regex crate in
* Reduce API surface area
This doesn't need to be public
* better test name
* Pull Automaton weight out so it can be shared
* Implement Regex Query
* Relocate `from_directory` closer to its usage
* Specific methods come before the generic method
* Rename open methods to follow the lead of the create methods
* Pull all creation methods next to each other
The goal here is to make it clear which methods are performing the
same function, and to assist with standardizing the API calls.
* Make `from_directory` private
This seems to be an internal function, so lets make it internal.
* Rename `create` to `create_in_dir`
This lets the name match the `create_in_ram` pattern and opens up
`create` for the generic implementation.
* Implement the generic create function
All of the create methods now delegate to the common create function
and future `create_in_*` functions now have a clear pattern
to follow as well
* Replaced lz4 by a pure rust implementation of snappy.
Closes#257
* snappy is the default compression. One can use lz4 by enabling the lz4 feature flag.
* Removed Compression trait
* Add AutomatonWeight to a fuzzy_search module
* Hacking around ownership issues
* Working through lifetime issues
* Working through tests
* fix test by lower casing the words (reducing distance)
* code review changes
* Suggestion on how to solve the borrow problem
* clean up
* Changed the heap to a paged memory arena.
* Trying to simplify the indexing term hashmap
* Exploding datastruct
* Removed some complexity in bitpacker
* rustfmt and some English grammar
* sort cargo.toml crates
* WIP: something to show
* Remove example for now
* Implement desired method
* Resolving Generic Type Arguments
* Resolve Generic Types
* Banging around on the tests
* DANGER! Change unsafe usage based on compiler warnings
* Unscrew up my rebase
* Clean Up Type Spam
Default Types FTW
* typo
* better variable names
* Remove Duplicate Levenshtein crate
* Implement StopWords Filter
- added example doctest for alphanum_only.rs so that I could
drive my own test of the stopword filter
* Style Cop
* Switch HashSet Hasher to FNV for speed
* Update Change Log
* fix missed location renaming
* Add non-deleted DocId iterator to SegmentReader
Closes#287
* Add Todo
* Add Unit Test
* Improving test based on feedback
- found bug and fixed it. :)
* Reestablish changes post rebase for clean merge
* Add fast field for associating arbitrary bytes to a document
* Fix unused macro_use warning
* Improvements from code review
* Make BytesFastFieldWriter public
* Fix json parsing validation failure
* Add bytes fast field to CHANGELOG.md
* Fix compile errors from merge
* Support merging
* Address misc code review comments
* Fix comments from CR
* Simple Implementation of NGram Tokenizer
It does not yet support edges
It could probably be better in many "rusty" ways
But the test is passing, so I'll call this a good stopping point for
the day.
* Remove Ngram from manager. Too many variations
* Basic configuration model
Should the extensive tests exist here?
* Add Sample to provide an End to End testing
* Basic Edgegram support
* cleanup
* code feedback
* More code review feedback processed
PhraseQuery panics with a nice error message when the underlying field does not have any positions.
The `QueryParser` fails as well with a dedicated error.
Tantivy is a library that is meant to build search engines. Although it is by no means a port of Lucene, its architecture is strongly inspired by it. If you are familiar with Lucene, you may be struck by the overlapping vocabulary.
This is not fortuitous.
Tantivy's bread and butter is to address the problem of full-text search :
Given a large set of textual documents, and a text query, return the K-most relevant documents in a very efficient way. To execute these queries rapidly, the tantivy needs to build an index beforehand. The relevance score implemented in the tantivy is not configurable. Tantivy uses the same score as the default similarity used in Lucene / Elasticsearch, called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
But tantivy's scope does not stop there. Numerous features are required to power rich-search applications. For instance, one may want to:
- compute the count of documents matching a query in the different section of an e-commerce website,
- display an average price per meter square for a real estate search engine,
- take into account historical user data to rank documents in a specific way,
- or even use tantivy to power an OLAP database.
A more abstract description of the problem space tantivy is trying to address is the following.
Ingest a large set of documents, create an index that makes it possible to
rapidly select all documents matching a given predicate (also known as a query) and
collect some information about them ([See collector](#collector-define-what-to-do-with-matched-documents)).
Roughly speaking the design is following these guiding principles:
- Search should be O(1) in memory.
- Indexing should be O(1) in memory. (In practice it is just sublinear)
- Search should be as fast as possible
This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
## [core/](src/core): Index, segments, searchers
Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.
This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.
### Index and Segments
A tantivy index is a collection of smaller independent immutable segments.
Each segment contains its own independent set of data structures.
A segment is identified by a segment id that is in fact a UUID.
The file of a segment has the format
```segment-id . ext```
The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
A small `meta.json` file is in charge of keeping track of the list of segments, as well as the schema.
On commit, one segment per indexing thread is written to disk, and the `meta.json` is then updated atomically.
For a better idea of how indexing works, you may read the [following blog post](https://fulmicoton.com/posts/behold-tantivy-part2/).
### Deletes
Deletes happen by deleting a "term". Tantivy does not offer any notion of primary id, so it is up to the user to use a field in their schema as if it was a primary id, and delete the associated term if they want to delete only one specific document.
On commit, tantivy will find all of the segments with documents matching this existing term and remove from [alive bitset file](src/fastfield/alive_bitset.rs) that represents the bitset of the alive document ids.
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.
An opstamp is simply an incremental id that identifies any operation applied to the index. For instance, performing a commit or adding a document.
### DocId
Within a segment, all documents are identified by a DocId that ranges within `[0, max_doc)`.
where `max_doc` is the number of documents in the segment, (deleted or not). Having such a compact `DocId` space is key to the compression of our data structures.
The DocIds are simply allocated in the order documents are added to the index.
### Merges
In separate threads, tantivy's index writer search for opportunities to merge segments.
The point of segment merge is to:
- eventually get rid of tombstoned documents
- reduce the otherwise ever-growing number of segments.
Indeed, while having several segments instead of one does not hurt search too much, having hundreds can have a measurable impact on the search performance.
### Searcher
The user of the library usually does not need to know about the existence of Segments.
Searching is done through an object called a [`Searcher`](src/core/searcher.rs), that captures a
snapshot of the index at one point of time, by holding a list of [SegmentReader](src/core/segment_reader.rs).
In other words, regardless of commits, file garbage collection, or segment merge that might happen, as long as the user holds and reuse the same [Searcher](src/core/searcher.rs), search will happen on an immutable snapshot of the index.
## [directory/](src/directory): Where should the data be stored?
Tantivy, like Lucene, abstracts the place where the data should be stored in a key-trait
called [`Directory`](src/directory/directory.rs).
Contrary to Lucene however, "files" are quite different from some kind of `io::Read` object.
Check out [`src/directory/directory.rs`](src/directory/directory.rs) trait for more details.
Tantivy ships two main directory implementation: the `MmapDirectory` and the `RamDirectory`,
but users can extend tantivy with their own implementation.
## [schema/](src/schema): What are documents?
Tantivy's document follows a very strict schema, decided before building any index.
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
Depending on the type of the field, you can decide to
- put it in the docstore
- store it as a fast field
- index it
Practically, tantivy will push values associated with this type to up to 3 respective
data structures.
*Limitations*
As of today, tantivy's schema imposes a 1:1 relationship between a field that is being ingested and a field represented in the search index. In sophisticated search application, it is fairly common to want to index a field twice using different tokenizers, or to index the concatenation of several fields together into one field.
This is not something tantivy supports, and it is up to the user to duplicate field / concatenate fields before feeding them to tantivy.
## General information about these data structures
All data structures in tantivy, have:
- a writer
- a serializer
- a reader
The writer builds an in-memory representation of a batch of documents. This representation is not searchable. It is just meant as an intermediary mutable representation, to which we can sequentially add
the document of a batch. At the end of the batch (or if a memory limit is reached), this representation
is then converted into an on-disk immutable representation, that is extremely compact.
This conversion is done by the serializer.
Finally, the reader is in charge of offering an API to read on this on-disk read-only representation.
In tantivy, readers are designed to require very little anonymous memory. The data is read straight from an mmapped file, and loading an index is as fast as mmapping its files.
## [store/](src/store): Here is my DocId, Gimme my document
The docstore is a row-oriented storage that, for each document, stores a subset of the fields
that are marked as stored in the schema. The docstore is compressed using a general-purpose algorithm
like LZ4.
**Useful for**
In search engines, it is often used to display search results.
Once the top 10 documents have been identified, we fetch them from the store, and display them or their snippet on the search result page (aka SERP).
**Not useful for**
Fetching a document from the store is typically a "slow" operation. It usually consists in
- searching into a compact tree-like data structure to find the position of the right block.
- decompressing a small block
- returning the document from this block.
It is NOT meant to be called for every document matching a query.
As a rule of thumb, if you hit the docstore more than 100 times per search query, you are probably misusing tantivy.
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value
Fast fields are stored in a column-oriented storage that allows for random access.
The only compression applied is bitpacking. The column comes with two meta data.
The minimum value in the column and the number of bits per doc.
Fetching a value for a `DocId` is then as simple as computing
Because, DocSets are scanned through in order (DocId are iterated in a sorted manner) which
also help locality.
In Lucene's jargon, fast fields are called DocValues.
**Useful for**
They are typically integer values that are useful to either rank or compute aggregate over
all of the documents matching a query (aka [DocSet](src/docset.rs)).
For instance, one could define a function to combine upvotes with tantivy's internal relevancy score.
This can be done by fetching a fast field during scoring.
One could also compute the mean price of the items matching a query in an e-commerce website.
This can be done by fetching a fast field in a collector.
Finally one could decide to post-filter a docset to remove docset with a price within a specific range.
If the ratio of filtered out documents is not too low, an efficient way to do this is to fetch the price and apply the filter on the collector side.
Aside from integer values, it is also possible to store an actual byte payload.
For advanced search engine, it is possible to store all of the features required for learning-to-rank in a byte payload, access it during search, and apply the learning-to-rank model.
Finally facets are a specific kind of fast field, and the associated source code is in [`fastfield/facet_reader.rs`](src/fastfield/facet_reader.rs).
# The inverted search index
The inverted index is the core part of full-text search.
When presented a new document with the text field "Hello, happy tax payer!", tantivy breaks it into a list of so-called tokens. In addition to just splitting these strings into tokens, it might also do different kinds of operations like dropping the punctuation, converting the character to lowercase, apply stemming, etc. Tantivy makes it possible to configure the operations to be applied in the schema (tokenizer/ is the place where these operations are implemented).
For instance, the default tokenizer of tantivy would break our text into: `[hello, happy, tax, payer]`.
The document will therefore be registered in the inverted index as containing the terms
`[text:hello, text:happy, text:tax, text:payer]`.
The role of the inverted index is, when given a term, gives us in return a very fast iterator over the sorted doc ids that match the term.
Such an iterator is called a posting list. In addition to giving us `DocId`, they can also give us optionally the number of occurrence of the term for each document, also called term frequency or TF.
These iterators being sorted by DocId, one can create an iterator over the document containing `text:tax AND text:payer`, `(text:tax AND text:payer) OR (text:contribuable)` or any boolean expression.
In order to represent the function
```Term ⟶ Posting```
The inverted index actually consists of two data structures chained together.
- [Term](src/schema/term.rs) ⟶ [TermInfo](src/postings/term_info.rs) is addressed by the term dictionary.
- [TermInfo](src/postings/term_info.rs) ⟶ [Posting](src/postings/postings.rs) is addressed by the posting lists.
Where [TermInfo](src/postings/term_info.rs) is an object containing some meta data about a term.
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)
Tantivy's term dictionary is mainly in charge of supplying the function
- [Term](src/schema/term.rs) ⟶ [TermOrdinal](src/termdict/mod.rs) is addressed by a finite state transducer, implemented by the fst crate.
- [TermOrdinal](src/termdict/mod.rs) ⟶ [TermInfo](src/postings/term_info.rs) is addressed by the term info store.
## [postings/](src/postings): Iterate over documents... very fast
A posting list makes it possible to store a sorted list of doc ids and for each doc store
a term frequency as well.
The posting lists are stored in a separate file. The [TermInfo](src/postings/term_info.rs) contains an offset into that file and a number of documents for the given posting list. Both are required and sufficient to read the posting list.
The posting list is organized in block of 128 documents.
One block of doc ids is followed by one block of term frequencies.
The doc ids are delta encoded and bitpacked.
The term frequencies are bitpacked.
Because the number of docs is rarely a multiple of 128, the last block may contain an arbitrary number of docs between 1 and 127 documents. We then use variable int encoding instead of bitpacking.
## [positions/](src/positions): Where are my terms within the documents?
Phrase queries make it possible to search for documents containing a specific sequence of terms.
For instance, when the phrase query "the art of war" does not match "the war of art".
To make it possible, it is possible to specify in the schema that a field should store positions in addition to being indexed.
The token positions of all of the terms are then stored in a separate file with the extension `.pos`.
The [TermInfo](src/postings/term_info.rs) gives an offset (expressed in position this time) in this file. As we iterate through the docset,
we advance the position reader by the number of term frequencies of the current document.
## [fieldnorms/](src/fieldnorms): Here is my doc, how many tokens in this field?
The [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) formula also requires to know the number of tokens stored in a specific field for a given document. We store this information on one byte per document in the fieldnorm.
The fieldnorm is therefore compressed. Values up to 40 are encoded unchanged.
## [tokenizer/](src/tokenizer): How should we process text?
Text processing is key to a good search experience.
Splits or normalize your text too much, and the search results will have a less precision and a higher recall.
Do not normalize, or under split your text, you will end up with a higher precision and a lesser recall.
Text processing can be configured by selecting an off-the-shelf [`Tokenizer`](./src/tokenizer/tokenizer.rs) or implementing your own to first split the text into tokens, and then chain different [`TokenFilter`](src/tokenizer/tokenizer.rs)'s to it.
Tantivy's comes with few tokenizers, but external crates are offering advanced tokenizers, such as [Lindera](https://crates.io/crates/lindera) for Japanese.
## [query/](src/query): Define and compose queries
The [Query](src/query/query.rs) trait defines what a query is.
Due to the necessity for some queries to compute some statistics over the entire index, and because the
index is composed of several `SegmentReader`, the path from transforming a `Query` to an iterator over documents is slightly convoluted, but fundamentally, this is what a Query is.
The iterator over a document comes with some scoring function. The resulting trait is called a
[Scorer](src/query/scorer.rs) and is specific to a segment.
Different queries can be combined using the [BooleanQuery](src/query/boolean_query/).
Tantivy comes with different types of queries and can be extended by implementing
the `Query`, `Weight`, and `Scorer` traits.
## [collector](src/collector): Define what to do with matched documents
Collectors define how to aggregate the documents matching a query, in the broadest sense possible.
The search will push matched documents one by one, calling their
`fn collect(doc: DocId, score: Score);` method.
Users may implement their own collectors by implementing the [Collector](src/collector/mod.rs) trait.
## [query-grammar](query-grammar): Defines the grammar of the query parser
While the [QueryParser](src/query/query_parser/query_parser.rs) struct is located in the `query/` directory, the actual parser combinator used to convert user queries into an AST is in an external crate called `query-grammar`. This part was externalized to lighten the work of the compiler.
- Major bugfix: Fix missing fieldnorms for u64, i64, f64, bool, bytes and date [#1620](https://github.com/quickwit-oss/tantivy/pull/1620) (@PSeitz)
- Updated [Date Field Type](https://github.com/quickwit-oss/tantivy/pull/1396)
The `DateTime` type has been updated to hold timestamps with microseconds precision.
`DateOptions` and `DatePrecision` have been added to configure Date fields. The precision is used to hint on fast values compression. Otherwise, seconds precision is used everywhere else (i.e terms, indexing). (@evanxg852000)
- Add IP address field type [#1553](https://github.com/quickwit-oss/tantivy/pull/1553) (@PSeitz)
- Add boolean field type [#1382](https://github.com/quickwit-oss/tantivy/pull/1382) (@boraarslan)
- Remove Searcher pool and make `Searcher` cloneable. (@PSeitz)
- Validate settings on create [#1570](https://github.com/quickwit-oss/tantivy/pull/1570 (@PSeitz)
- Fix interpolation overflow in linear interpolation fastfield codec [#1480](https://github.com/quickwit-oss/tantivy/pull/1480 (@PSeitz @fulmicoton)
- Detect and apply gcd on fastfield codecs [#1418](https://github.com/quickwit-oss/tantivy/pull/1418) (@PSeitz)
- Doc store
- use separate thread to compress block store [#1389](https://github.com/quickwit-oss/tantivy/pull/1389) [#1510](https://github.com/quickwit-oss/tantivy/pull/1510 (@PSeitz @fulmicoton)
- Expose doc store cache size [#1403](https://github.com/quickwit-oss/tantivy/pull/1403) (@PSeitz)
- Enable compression levels for doc store [#1378](https://github.com/quickwit-oss/tantivy/pull/1378) (@PSeitz)
- Make block size configurable [#1374](https://github.com/quickwit-oss/tantivy/pull/1374) (@kryesh)
- Make `tantivy::TantivyError` cloneable [#1402](https://github.com/quickwit-oss/tantivy/pull/1402) (@PSeitz)
- Add support for phrase slop in query language [#1393](https://github.com/quickwit-oss/tantivy/pull/1393) (@saroh)
- Aggregation
- Add support for keyed parameter in range and histgram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
- Add support for fastfield on text fields (@PSeitz)
- Add terms aggregation (@PSeitz)
- Add support for zstd compression (@kryesh)
Tantivy 0.17
================================
- LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar@fulmicoton) [#115](https://github.com/quickwit-oss/tantivy/issues/115)
- Adds a searcher Warmer API (@shikhar@fulmicoton)
- Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211
- Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .
- Bugfix that could in theory impact durability in theory on some filesystems [#1224](https://github.com/quickwit-oss/tantivy/issues/1224)
- Schema now offers not indexing fieldnorms (@lpouget) [#922](https://github.com/quickwit-oss/tantivy/issues/922)
- Reduce the number of fsync calls [#1225](https://github.com/quickwit-oss/tantivy/issues/1225)
- Fix opening bytes index with dynamic codec (@PSeitz) [#1278](https://github.com/quickwit-oss/tantivy/issues/1278)
- Added an aggregation collector for range, average and stats compatible with Elasticsearch. (@PSeitz)
- Added a JSON schema type @fulmicoton [#1251](https://github.com/quickwit-oss/tantivy/issues/1251)
- Added support for slop in phrase queries @halvorboe [#1068](https://github.com/quickwit-oss/tantivy/issues/1068)
Tantivy 0.16.2
================================
- Bugfix in FuzzyTermQuery. (transposition_cost_one was not doing anything)
Tantivy 0.16.1
========================
- Major Bugfix on multivalued fastfield. #1151
- Demux operation (@PSeitz)
Tantivy 0.16.0
=========================
- Bugfix in the filesum check. (@evanxg852000) #1127
- Bugfix in positions when the index is sorted by a field. (@appaquet) #1125
Tantivy 0.15.3
=========================
- Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101
Tantivy 0.15.2
========================
- Major bugfix. DocStore still panics when a deleted doc is at the beginning of a block. (@appaquet) #1088
Tantivy 0.15.1
=========================
- Major bugfix. DocStore panics when first block is deleted. (@appaquet) #1077
Tantivy 0.15.0
=========================
- API Changes. Using Range instead of (start, end) in the API and internals (`FileSlice`, `OwnedBytes`, `Snippets`, ...)
This change is breaking but migration is trivial.
- Added an Histogram collector. (@fulmicoton) #994
- Added support for Option<TCollector>. (@fulmicoton)
- DocAddress is now a struct (@scampi) #987
- Bugfix consistent tie break handling in facet's topk (@hardikpnsp) #357
- Date field support for range queries (@rihardsk) #516
- Added lz4-flex as the default compression scheme in tantivy (@PSeitz) #1009
- Renamed a lot of symbols to avoid all uppercasing on acronyms, as per new clippy recommendation. For instance, RAMDirectory -> RamDirectory. (@fulmicoton)
- Simplified positions index format (@fulmicoton) #1022
- Moved bitpacking to bitpacker subcrate and add BlockedBitpacker, which bitpacks blocks of 128 elements (@PSeitz) #1030
- Added support for more-like-this query in tantivy (@evanxg852000) #1011
- Added support for sorting an index, e.g presorting documents in an index by a timestamp field. This can heavily improve performance for certain scenarios, by utilizing the sorted data (Top-n optimizations)(@PSeitz). #1026
- Add iterator over documents in doc store (@PSeitz). #1044
- Fix log merge policy (@PSeitz). #1043
- Add detection to avoid small doc store blocks on merge (@PSeitz). #1054
- Make doc store compression dynamic (@PSeitz). #1060
- Switch to json for footer version handling (@PSeitz). #1060
- Updated TermMerger implementation to rely on the union feature of the FST (@scampi) #469
- Add boolean marking whether position is required in the query_terms API call (@fulmicoton). #1070
Tantivy 0.14.0
=========================
- Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).
- Migrated tantivy error from the now deprecated `failure` crate to `thiserror`#760. (@hirevo)
- API Change. Accessing the typed value off a `Schema::Value` now returns an Option instead of panicking if the type does not match.
- Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a `FileSlice` that can be reduced and eventually read into an `OwnedBytes` object. Long and blocking io operation are still required by they do not span over the entire file.
- Added support for Brotli compression in the DocStore. (@ppodolsky)
- Added helper for building intersections and unions in BooleanQuery (@guilload)
- Bugfix in `Query::explain`
- Removed dependency on `notify`#924. Replaced with `FileWatcher` struct that polls meta file every 500ms in background thread. (@halvorboe@guilload)
- Added `FilterCollector`, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev)
- Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)
-`FilterCollector` now supports all Fast Field value types (@barrotsteindev)
- FastField are not all loaded when opening the segment reader. (@fulmicoton)
- Added an API to merge segments, see `tantivy::merge_segments`#1005. (@evanxg852000)
This version breaks compatibility and requires users to reindex everything.
Tantivy 0.13.2
===================
Bugfix. Acquiring a facet reader on a segment that does not contain any
doc with this facet returns `None`. (#896)
Tantivy 0.13.1
===================
Made `Query` and `Collector``Send + Sync`.
Updated misc dependency versions.
Tantivy 0.13.0
======================
Tantivy 0.13 introduce a change in the index format that will require
you to reindex your index (BlockWAND information are added in the skiplist).
The index size increase is minor as this information is only added for
full blocks.
If you have a massive index for which reindexing is not an option, please contact me
so that we can discuss possible solutions.
- Bugfix in `FuzzyTermQuery` not matching terms by prefix when it should (@Peachball)
- Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.
-`MMapDirectory::open` does not return a `Result` anymore.
- Change in the DocSet and Scorer API. (@fulmicoton).
A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet.
`.advance()` returns the new DocId. `Scorer::skip(target)` has been replaced by `Scorer::seek(target)` and returns the resulting DocId.
As a result, iterating through DocSet now looks as follows
```rust
letmutdoc=docset.doc();
whiledoc!=TERMINATED{
// ...
doc=docset.advance();
}
```
The change made it possible to greatly simplify a lot of the docset's code.
- Misc internal optimization and introduction of the `Scorer::for_each_pruning` function. (@fulmicoton)
- Added an offset option to the Top(.*)Collectors. (@robyoung)
- Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks
to the PISA team for answering all my questions!)
Tantivy 0.12.0
======================
- Removing static dispatch in tokenizers for simplicity. (#762)
- Added backward iteration for `TermDictionary` stream. (@halvorboe)
- Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)
- Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe#713)
- Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)
- Added support for field boosting. (#547, @fulmicoton)
## How to update?
Crates relying on custom tokenizer, or registering tokenizer in the manager will require some
minor changes. Check <https://github.com/quickwit-oss/tantivy/blob/main/examples/custom_tokenizer.rs>
to check for some code sample.
Tantivy 0.11.3
=======================
- Fixed DateTime as a fast field (#735)
Tantivy 0.11.2
=======================
- The future returned by `IndexWriter::merge` does not borrow `self` mutably anymore (#732)
- Exposing a constructor for `WatchHandle` (#731)
Tantivy 0.11.1
=====================
- Bug fix #729
Tantivy 0.11.0
=====================
- Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)
- Various bugfixes in the query parser.
- Better handling of hyphens in query parser. (#609)
- Better handling of whitespaces.
- Closes #498 - add support for Elastic-style unbounded range queries for alphanumeric types eg. "title:>hello", "weight:>=70.5", "height:<200"(@petr-tik)
Copyright (c) 2018 by the project authors, as listed in the AUTHORS file.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
[](https://discord.gg/MT27AG5EVE)
[](https://gitter.im/tantivy-search/tantivy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
**Tantivy** is a **full-text search engine library** written in Rust.
**Tantivy** is a **full text search engine library** written in rust.
It is closer to [Apache Lucene](https://lucene.apache.org/) than to [Elasticsearch](https://www.elastic.co/products/elasticsearch) or [Apache Solr](https://lucene.apache.org/solr/) in the sense it is not
an off-the-shelf search engine server, but rather a crate that can be used
to build such a search engine.
It is strongly inspired by Lucene's design.
Tantivy is, in fact, strongly inspired by Lucene's design.
If you are looking for an alternative to Elasticsearch or Apache Solr, check out [Quickwit](https://github.com/quickwit-oss/quickwit), our search engine built on top of Tantivy.
# Benchmark
The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns
performance for different types of queries/collections.
Your mileage WILL vary depending on the nature of queries and their load.
<imgsrc="doc/assets/images/searchbenchmark.png">
# Features
-Tiny startup time (<10ms),perfectforcommandlinetools
-tf-idfscoring
-Basicquerylanguage
-Phrasequeries
-Full-text search
-Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/cang-jie)), Japanese ([lindera](https://github.com/lindera-morphology/lindera-tantivy), [Vaporetto](https://crates.io/crates/vaporetto_tantivy), and [tantivy-tokenizer-tiny-segmenter](https://crates.io/crates/tantivy-tokenizer-tiny-segmenter)) and Korean ([lindera](https://github.com/lindera-morphology/lindera-tantivy) + [lindera-ko-dic-builder](https://github.com/lindera-morphology/lindera-ko-dic-builder))
-Fast (check out the :racehorse: :sparkles: [benchmark](https://tantivy-search.github.io/bench/) :sparkles: :racehorse:)
-Tiny startup time (<10ms),perfectforcommand-linetools
-BM25scoring(thesameasLucene)
-Naturalquerylanguage(e.g.`(michael AND jackson) OR "king of pop"`)
- [tantivy-cli and its tutorial](https://github.com/quickwit-oss/tantivy-cli) -`tantivy-cli`isanactualcommand-lineinterfacethatmakesiteasyforyoutocreateasearchengine,
You can also find other bindings on [GitHub](https://github.com/search?q=tantivy) but they may be less maintained.
### What are some examples of Tantivy use?
- [seshat](https://github.com/matrix-org/seshat/): A matrix message database/indexer
- [tantiny](https://github.com/baygeldin/tantiny): Tiny full-text search for Ruby
- [lnx](https://github.com/lnx-search/lnx): adaptable, typo tolerant search engine with a REST API
- and [more](https://github.com/search?q=tantivy)!
### On average, how much faster is Tantivy compared to Lucene?
- According to our [search latency benchmark](https://tantivy-search.github.io/bench/), Tantivy is approximately 2x faster than Lucene.
### Does tantivy support incremental indexing?
- Yes.
### How can I edit documents?
- Data in tantivy is immutable. To edit a document, the document needs to be deleted and reindexed.
### When will my documents be searchable during indexing?
- Documents will be searchable after a `commit` is called on an `IndexWriter`. Existing `IndexReader`s will also need to be reloaded in order to reflect the changes. Finally, changes are only visible to newly acquired `Searcher`.
> Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
they both have the same scope and targeted use cases.
If you are not familiar with Lucene, let's break down our little tagline.
- **Search** here means full-text search : fundamentally, tantivy is here to help you
identify efficiently what are the documents matching a given query in your corpus.
But modern search UI are so much more : text processing, facetting, autocomplete, fuzzy search, good
Tantivy accesses its data using an abstracting trait called `Directory`.
In theory, one can come and override the data access logic. In practise, the
trait somewhat assumes that your data can be mapped to memory, and tantivy
seems deeply married to using `mmap` for its io [^1], and the only persisting
directory shipped with tantivy is the `MmapDirectory`.
While this design has some downsides, this greatly simplifies the source code of
tantivy. Caching is also entirely delegated to the OS.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.
In later chapters, we will discuss tantivy's inverted index data layout.
One key take away is that to achieve great performance, search indexes are extremely compact.
Of course this is crucial to reduce IO, and ensure that as much of our index can sit in RAM.
Also, whenever possible its data is accessed sequentially. Of course, this is an amazing property when tantivy needs to access the data from your spinning hard disk, but this is also
critical for performance, if your data is read from and an `SSD` or even already in your pagecache.
## Segments, and the log method
That kind of compact layout comes at one cost: it prevents our datastructures from being dynamic.
In fact, the `Directory` trait does not even allow you to modify part of a file.
To allow the addition / deletion of documents, and create the illusion that
your index is dynamic (i.e.: adding and deleting documents), tantivy uses a common database trick sometimes referred to as the *log method*.
Let's forget about deletes for a moment.
As you add documents, these documents are processed and stored in a dedicated datastructure, in a `RAM` buffer. This datastructure is not ready for search, but it is useful to receive your data and rearrange it very rapidly.
As you add documents, this buffer will reach its capacity and tantivy will transparently stop adding document to it and start converting this datastructure to its final read-only format on disk. Once written, an brand empty buffer is available to resume adding documents.
The resulting chunk of index obtained after this serialization is called a `Segment`.
> A segment is a self-contained atomic piece of index. It is identified with a UUID, and all of its files are identified using the naming scheme : `<UUID>.*`.
Which brings us to the nature of a tantivy `Index`.
> A tantivy `Index` is a collection of `Segments`.
Physically, this really just means and index is a bunch of segment files in a given `Directory`,
linked together by a `meta.json` file. This transparency can become extremely handy
to get tantivy to fit your use case:
*Example 1* You could for instance use hadoop to build a very large search index in a timely manner, copy all of the resulting segment files in the same directory and edit the `meta.json` to get a functional index.[^2]
*Example 2* You could also disable your merge policy and enforce daily segments. Removing data after one week can then be done very efficiently by just editing the `meta.json` and deleting the files associated with segment `D-7`.
## Merging
As you index more and more data, your index will accumulate more and more segments.
Having a lot of small segments is not really optimal. There is a bit of redundancy in having
all these term dictionary. Also when searching, we will need to do term lookups as many times as we have segments. It can hurt search performance a bit.
That's where merging or compacting comes into place. Tantivy will continuously consider merge
opportunities and start merging segments in the background.
## Indexing throughput, number of indexing threads
[^1]: This may eventually change.
[^2]: Be careful however. By default these files will not be considered as *managed* by tantivy. This means they will never be garbage collected by tantivy, regardless of whether they become obsolete or not.
Tantivy allows you to sort the index according to a property.
## Why Sorting
Presorting an index has several advantages:
### Compression
When data is sorted it is easier to compress the data. E.g. the numbers sequence [5, 2, 3, 1, 4] would be sorted to [1, 2, 3, 4, 5].
If we apply delta encoding this list would be unsorted [5, -3, 1, -2, 3] vs. [1, 1, 1, 1, 1].
Compression ratio is mainly affected on the fast field of the sorted property, every thing else is likely unaffected.
### Top-N Optimization
When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents.
E.g. if the data is sorted by timestamp and want the top n newest docs containing a term, we can simply leveraging the order of the docids.
Note: Tantivy 0.16 does not do this optimization yet.
### Pruning
Let's say we want all documents and want to apply the filter `>= 2010-08-11`. When the data is sorted, we could make a lookup in the fast field to find the docid range and use this as the filter.
Note: Tantivy 0.16 does not do this optimization yet.
### Other?
In principle there are many algorithms possible that exploit the monotonically increasing nature. (aggregations maybe?)
## Usage
The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of Tantivy 0.16 only fast fields are allowed to be used.
Sorting an index is applied in the serialization step. In general there are two serialization steps: [Finishing a single segment](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/segment_writer.rs#L338) and [merging multiple segments](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/indexer/merger.rs#L1073).
In both cases we generate a docid mapping reflecting the sort. This mapping is used when serializing the different components (doc store, fastfields, posting list, normfield, facets).
As of tantivy 0.17, tantivy supports a json object type.
This type can be used to allow for a schema-less search index.
When indexing a json object, we "flatten" the JSON. This operation emits terms that represent a triplet `(json_path, value_type, value)`
For instance, if user is a json field, the following document:
```json
{
"user":{
"name":"Paul Masurel",
"address":{
"city":"Tokyo",
"country":"Japan"
},
"created_at":"2018-11-12T23:20:50.52Z"
}
}
```
emits the following tokens:
- ("name", Text, "Paul")
- ("name", Text, "Masurel")
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("created_at", Date, 15420648505)
## Bytes-encoding and lexicographical sort
Like any other terms, these triplets are encoded into a binary format as follows.
-`json_path`: the json path is a sequence of "segments". In the example above, `address.city`
is just a debug representation of the json path `["address", "city"]`.
Its representation is done by separating segments by a unicode char `\x01`, and ending the path by `\x00`.
-`value type`: One byte represents the `Value` type.
-`value`: The value representation is just the regular Value representation.
This representation is designed to align the natural sort of Terms with the lexicographical sort
of their binary representation (Tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).
In the example above, the terms will be sorted as
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("name", Text, "Masurel")
- ("name", Text, "Paul")
- ("created_at", Date, 15420648505)
As seen in "pitfalls", we may end up having to search for a value for a same path in several different fields. Putting the field code after the path makes it maximizes compression opportunities but also increases the chances for the two terms to end up in the actual same term dictionary block.
## Pitfalls, limitation and corner cases
Json gives very little information about the type of the literals it stores.
All numeric types end up mapped as a "Number" and there are no types for dates.
At indexing, tantivy will try to interpret number and strings as different type with a
priority order.
Numbers will be interpreted as u64, i64 and f64 in that order.
Strings will be interpreted as rfc3999 dates or simple strings.
The first working type is picked and is the only term that is emitted for indexing.
Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
a consistent field type at the scale of a segment.
On the query parser side on the other hand, we may end up emitting more than one type.
For instance, we do not even know if the type is a number or string based.
Likewise, we need to emit two tokens if the query contains an rfc3999 date.
Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.
If one more json field is defined, things get even more complicated.
## Default json field
If the schema contains a text field called "text" and a json field that is set as a default field:
`text:hello` could be reasonably interpreted as targeting the text field or as targeting the json field called `json_dynamic` with the json_path "text".
If there is such an ambiguity, we decide to only search in the "text" field: `text:hello`.
In other words, the parser will not search in default json fields if there is a schema hit.
This is a product decision.
The user can still target the JSON field by specifying its name explicitly:
`json_dynamic.text:hello`.
## Range queries are not supported
Json field do not support range queries.
## Arrays do not work like nested object
If json object contains an array, a search query might return more documents
to be able to retrieve the document after the search.</p>
<p>TEXT | STORED is some syntactic sugar to describe
that.</p>
<p><code>TEXT</code> means the field should be tokenized and indexed,
along with its term frequency and term positions.</p>
<p><code>STORED</code> means that the field will also be saved
in a compressed, row-oriented key-value store.
This store is useful to reconstruct the
documents that were selected during the search phase.</p>
</div>
<divclass="content"><divclass='highlight'><pre> schema_builder.add_text_field(<spanclass="hljs-string">"title"</span>, TEXT | STORED);</pre></div></div>
<divclass="content"><divclass='highlight'><pre><spanclass="hljs-keyword">let</span> index = Index::create(index_path, schema.clone())?;</pre></div></div>
</li>
<liid="section-8">
<divclass="annotation">
<divclass="pilwrap ">
<aclass="pilcrow"href="#section-8">¶</a>
</div>
<p>To insert document we need an index writer.
There must be only one writer at a time.
This single <code>IndexWriter</code> is already
multithreaded.</p>
<p>Here we use a buffer of 50MB per thread. Using a bigger
heap for the indexer can increase its throughput.</p>
We first need a handle on the title and the body field.</p>
</div>
</li>
<liid="section-10">
<divclass="annotation">
<divclass="pilwrap ">
<aclass="pilcrow"href="#section-10">¶</a>
</div>
<h3id="create-a-document-manually-">Create a document “manually”.</h3>
<p>We can create a document manually, by setting the fields
one by one in a Document object.</p>
</div>
<divclass="content"><divclass='highlight'><pre><spanclass="hljs-keyword">let</span> title = schema.get_field(<spanclass="hljs-string">"title"</span>).unwrap();
<spanclass="hljs-keyword">let</span> body = schema.get_field(<spanclass="hljs-string">"body"</span>).unwrap();
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.