* Fix serde for TopNComputer
The top hits aggregation changed the TopNComputer to be serializable,
but capacity needs to be carried over, as it contains logic which is
checked against when pushing elements (capacity == 0 is not allowed).
* use serde from deser
* remove pub, clippy
* feat(aggregators/metric): Implement a top_hits aggregator
* fix: Expose get_fields
* fix: Serializer for top_hits request
Also removes extraneous the extraneous third-party
serialization helper.
* chore: Avert panick on parsing invalid top_hits query
* refactor: Allow multiple field names from aggregations
* perf: Replace binary heap with TopNComputer
* fix: Avoid comparator inversion by ComparableDoc
* fix: Rank missing field values lower than present values
* refactor: Make KeyOrder a struct
* feat: Rough attempt at docvalue_fields
* feat: Complete stab at docvalue_fields
- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn
* test(unit): Add tests for top_hits aggregator
* fix: docfield_value field globbing
* test(unit): Include dynamic fields
* fix: Value -> OwnedValue
* fix: Use OwnedValue's native Null variant
* chore: Improve readability of test asserts
* chore: Remove DocAddress from top_hits result
* docs: Update aggregator doc
* revert: accidental doc test
* chore: enable time macros only for tests
* chore: Apply suggestions from review
* chore: Apply suggestions from review
* fix: Retrieve all values for fields
* test(unit): Update for multi-value retrieval
* chore: Assert term existence
* feat: Include all columns for a column name
Since a (name, type) constitutes a unique column.
* fix: Resolve json fields
Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.
* chore: Address review on mutability
* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal
* chore: Revert erroneous grammar change
Root cause was the positions buffer had residue positions from the
previous term, when the terms were alternating between having and not
having positions in JSON (terms have positions, but not numerics).
Fixes#2283
* read path for new fst based index
* implement BlockAddrStoreWriter
* extract slop/derivation computation
* use better linear approximator and allow negative correction to approximator
* document format and reorder some fields
* optimize single block sstable size
* plug backward compat
* add fields_metadata to SegmentReader, add columnar docs
* use schema to resolve field, add test
* normalize paths
* merge for FieldsMetadata, add fields_metadata on Index
* Update src/core/segment_reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io>
* merge code paths
* add Hash
* move function oustide
---------
Co-authored-by: Paul Masurel <paul@quickwit.io>
* reduce number of allocations
Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.
- Make Explanation optional in BM25
- Avoid allocations when using Explanation
* use Cow
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.
The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.
It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.
Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)
The merger would then proceed and attempt to decode 1 position when
there are in fact none.
This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.
Closes#2251
* docid deltas while indexing
storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.
HDFS Indexing 1.1GB Total memory consumption:
Before: 760 MB
Now: 590 MB
* use scan for delta decoding
* add support for delta-1 encoding posting list
* encode term frequency minus one
* don't emit tf for json integer terms
* make skipreader not pub(crate) mutable
* remove Document: DocumentDeserialize dependency
The dependency requires users to implement an API they may not use.
* remove unnecessary Document bounds
* fix windows build (#1)
* Fix windows build
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Fix generic bugs
* Reformat code
* Add generic to index writer which I forgot about
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add doc traits
* Add field value iter
* Add value and serialization
* Adjust order
* Fix bug
* Correct type
* Rebase main and fix conflicts
* Reformat code
* Merge upstream
* Fix missing generics on single segment writer
* Add missing type export
* Add default methods for convenience
* Cleanup
* Fix more-like-this query to use standard types
* Update API and fix tests
* Add tokenizer improvements from previous commits
* Add tokenizer improvements from previous commits
* Reformat
* Fix unit tests
* Fix unit tests
* Use enum in changes
* Stage changes
* Add new deserializer logic
* Add serializer integration
* Add document deserializer
* Implement new (de)serialization api for existing types
* Fix bugs and type errors
* Add helper implementations
* Fix errors
* Reformat code
* Add unit tests and some code organisation for serialization
* Add unit tests to deserializer
* Add some small docs
* Add support for deserializing serde values
* Reformat
* Fix typo
* Fix typo
* Change repr of facet
* Remove unused trait methods
* Add child value type
* Resolve comments
* Fix build
* Fix more build errors
* Fix more build errors
* Fix the tests I missed
* Fix examples
* fix numerical order, serialize PreTok Str
* fix coverage
* rename Document to TantivyDocument, rename DocumentAccess to Document
add Binary prefix to binary de/serialization
* fix coverage
---------
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
* replace BinaryHeap for TopN
replace BinaryHeap for TopN with variant that selects the median with QuickSort,
which runs in O(n) time.
add merge_fruits fast path
* call truncate unconditionally, extend test
* remove special early exit
* add TODO, fmt
* truncate top n instead median, return vec
* simplify code