Compare commits

...

216 Commits

Author SHA1 Message Date
Paul Masurel
643639f14b Introduced geopoint. 2025-12-03 17:05:27 +01:00
Paul Masurel
f85a27068d Introduced geopoint. 2025-12-03 17:05:16 +01:00
Paul Masurel
1619e05bc5 plastic surgery 2025-12-03 16:20:18 +01:00
Paul Masurel
5d03c600ba Added bugfix and unit tests
Removed use of robust.
2025-12-03 15:21:37 +01:00
Paul Masurel
32beb06382 plastic surgery 2025-12-03 13:02:10 +01:00
Paul Masurel
d8bc0e7c99 added doc 2025-12-03 12:41:17 +01:00
Paul Masurel
79622f1f0b bugfix 2025-12-01 17:13:57 +01:00
Alan Gutierrez
d26d6c34fc Fix select_nth_unstable_by_key midpoint duplicates.
Existing code behaved as if the result of `select_nth_unstable_by_key`
was either a sorted array or the product of an algorithm that gathered
partition values as in the Dutch national flag problem. The existing
code was written knowing that the former isn't true and the latter isn't
advertised. Knowing, but not remembering. Quite the oversight.
2025-12-01 16:49:22 +01:00
Alan Gutierrez
6da54fa5da Revert "Remove radix_select.rs."
This reverts commit 19eab167b6.

Restore radix select in order to implement a merge solution that will
not require a temporary file.
2025-12-01 16:49:21 +01:00
Alan Gutierrez
9f10279681 Complete Spatial/Geometry type integration.
Addressed all `todo!()` markers created when adding `Spatial` field type
and `Geometry` value type to existing code paths:

- Dynamic field handling: `Geometry` not supported in dynamic JSON
  fields, return `unimplemented!()` consistently with other complex
  types.
- Fast field writer: Panic if geometry routed incorrectly (internal
  error.)
- `OwnedValue` serialization: Implement `Geometry` to GeoJSON
  serialization and reference-to-owned conversion.
- Field type: Return `None` for `get_index_record_option()` since
  spatial fields use BKD trees, not inverted index.
- Space usage tracking: Add spatial field to `SegmentSpaceUsage` with
  proper integration through `SegmentReader`.
- Spatial query explain: Implement `explain()` method following pattern of
  other binary/constant-score queries.

Fixed `MultiPolygon` deserialization bug: count total points across all
rings, not number of rings.

Added clippy expects for legitimate too_many_arguments cases in geometric
predicates.
2025-12-01 16:49:21 +01:00
Alan Gutierrez
68009bb25b Read block kd-tree nodes using from_le_bytes.
Read node structures using `from_le_bytes` instead of casting memory.
After an inspection of columnar storage, it appears that this is the
standard practice in Rust and in the Tantivy code base. Left the
structure alignment for now in case it tends to align with cache
boundaries.
2025-12-01 16:49:20 +01:00
Alan Gutierrez
459456ca28 Remove radix_select.rs.
Ended up using `select_nth_unstable_by_key` from the Rust standard
library instead.
2025-12-01 16:49:20 +01:00
Alan Gutierrez
dbbc8c3f65 Slot block kd-tree into Tantivy.
Implemented a geometry document field with a minimal `Geometry` enum.
Now able to add that Geometry from GeoJSON parsed from a JSON document.
Geometry is triangulated if it is a polygon, otherwise it is correctly
encoded as a degenerate triangle if it is a point or a line string.
Write accumulated triangles to a block kd-tree on commit.

Serialize the original `f64` polygon for retrieval from search.

Created a query method for intersection. Query against the memory mapped
block kd-tree. Return hits and original `f64` polygon.

Implemented a merge of one or more block kd-trees from one or more
segments during merge.

Updated the block kd-tree to write to a Tantivy `WritePtr` instead of
more generic Rust I/O.
2025-12-01 16:49:16 +01:00
Alan Gutierrez
d3049cb323 Triangulation is not just a conversion.
The triangulation function in `triangle.rs` is now called
`delaunay_to_triangles` and it accepts the output of a Delaunay
triangulation from `i_triangle` and not a GeoRust multi-polygon. The
translation of user polygons to `i_triangle` polygons and subsequent
triangulation will take place outside of `triangle.rs`.
2025-12-01 16:48:34 +01:00
Alan Gutierrez
ccdf399cd7 XOR delta compression for f64 polygon coordinates.
Lossless compression for floating-point lat/lon coordinates using XOR
delta encoding on IEEE 754 bit patterns with variable-length integer
encoding. Designed for per-polygon random access in the document store,
where each polygon compresses independently without requiring sequential
decompression.
2025-12-01 16:48:33 +01:00
Alan Gutierrez
2dc46b235e Implement block kd-tree.
Implement an immutable bulk-loaded spatial index using recursive median
partitioning on bounding box dimensions. Each leaf stores up to 512
triangles with delta-compressed coordinates and doc IDs. The tree
provides three query types (intersects, within, contains) that use exact
integer arithmetic for geometric predicates and accumulate results in
bit sets for efficient deduplication across leaves.

The serialized format stores compressed leaf pages followed by the tree
structure (leaf and branch nodes), enabling zero-copy access through
memory-mapped segments without upfront decompression.
2025-12-01 16:48:32 +01:00
Alan Gutierrez
f38140f72f Add delta compression for block kd-tree leaf nodes.
Implements dimension-major bit-packing with zigzag encoding for signed i32
deltas, enabling compression of spatially-clustered triangles from 32-bit
coordinates down to 4-19 bits per delta depending on spatial extent.
2025-12-01 16:48:32 +01:00
Alan Gutierrez
0996bea7ac Add a surveyor to determine spread and prefix.
Implemented a `Surveyor` that will evaluate the bounding boxes of a set
of triangles and determine the dimension with the maximum spread and the
shared prefix for the values of dimension with the maximum spread.
2025-12-01 16:48:31 +01:00
Alan Gutierrez
1c66567efc Radix selection for block kd-tree partitioning.
Implemented byte-wise histogram selection to find median values without
comparisons, enabling efficient partitioning of spatial data during
block kd-tree construction. Processes values through multiple passes,
building histograms for each byte position after a common prefix,
avoiding the need to sort or compare elements directly.
2025-12-01 16:48:31 +01:00
Alan Gutierrez
b2a9bb279d Implement polygon tessellation.
The `triangulate` function takes a polygon with floating-point lat/lon
coordinates, converts to integer coordinates with millimeter precision
(using 2^32 scaling), performs constrained Delaunay triangulation, and
encodes the resulting triangles with boundary edge information for block
kd-tree spatial indexing.

It handles polygons with holes correctly, preserving which triangle
edges lie on the original polygon boundaries versus internal
tessellation edges.
2025-12-01 16:48:26 +01:00
Alan Gutierrez
558c99fa2d Triangle encoding for spatial indexing.
Encodes triangles with the bounding box in the first four words,
enabling efficient spatial pruning during tree traversal without
reconstructing the full triangle. The remaining words contain an
additional vertex and packed reconstruction metadata, allowing exact
triangle recovery when needed.
2025-12-01 16:47:56 +01:00
Alan Gutierrez
43b5f34721 Implement SPATIAL flag.
Implement a SPATIAL flag for use in creating a spatial field.
2025-12-01 16:47:55 +01:00
Paul Masurel
63c66005db Lazy scorers (#2726)
* Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features.

- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.

This PR breaks public API, but fixing code is straightforward.

* Bumping tantivy version

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-01 15:38:57 +01:00
Paul Masurel
7d513a44c5 Added some benchmark for top K by a fast field (#2754)
Also removed query parsing from the bench code.

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-01 14:58:29 +01:00
Stu Hood
ca87fcd454 Implement collect_block for Collectors which wrap other Collectors (#2727)
* Implement `collect_block` for tuple Collectors, and for MultiCollector.

* Two more.
2025-12-01 12:26:29 +01:00
Ang
08a92675dc Fix typos again (#2753)
Found via `codespell -S benches,stopwords.rs -L
womens,parth,abd,childs,ond,ser,ue,mot,hel,atleast,pris,claus,allo`
2025-12-01 12:15:41 +01:00
Raphaël Cohen
f7f4b354d6 fix: Handle phrase prefixed with star (#2751)
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
2025-12-01 11:43:25 +01:00
Paul Masurel
25d44fcec8 Revert "remove unused columnar api (#2742)" (#2748)
* Revert "remove unused columnar api (#2742)"

This reverts commit 8725594d47.

* Clippy comment + removing fill_vals

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-11-26 17:44:02 +01:00
PSeitz-dd
842fe9295f split Term in Term and IndexingTerm (#2744)
* split Term in Term and IndexingTerm

* add append_json_path to JsonTermSerializer
2025-11-26 16:48:59 +01:00
Paul Masurel
f88b7200b2 Optimization when posting list are saturated. (#2745)
* Optimization when posting list are saturated.

If a posting list doc freq is the segment reader's
max_doc, and if scoring does not matter, we can replace it
by a AllScorer.

In turn, in a boolean query, we can dismiss  all scorers and
empty scorers, to accelerate the request.

* Added range query optimization

* CR comment

* CR comments

* CR comment

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-11-26 15:50:57 +01:00
PSeitz-dd
8725594d47 remove unused columnar api (#2742) 2025-11-21 18:07:25 +01:00
PSeitz
43a784671a clippy (#2741)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-11-21 18:07:03 +01:00
Paul Masurel
c363bbd23d Optimize term aggregation with low cardinality + some refactoring (#2740)
This introduce an optimization of top level term aggregation on field with a low cardinality.

We then use a Vec as the underlying map.
In addition, we buffer subaggregations.

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Paul Masurel <paul@quickwit.io>
2025-11-21 14:46:29 +01:00
Moe
70e591e230 feat: added filter aggregation (#2711)
* Initial impl

* Added `Filter` impl in `build_single_agg_segment_collector_with_reader` + Added tests

* Added `Filter(FilterBucketResult)` + Made tests work.

* Fixed type issues.

* Fixed a test.

* 8a7a73a: Pass `segment_reader`

* Added more tests.

* Improved parsing + tests

* refactoring

* Added more tests.

* refactoring: moved parsing code under QueryParser

* Use Tantivy syntax instead of ES

* Added a sanity check test.

* Simplified impl + tests

* Added back tests in a more maintable way

* nitz.

* nitz

* implemented very simple fast-path

* improved a comment

* implemented fast field support

* Used `BoundsRange`

* Improved fast field impl + tests

* Simplified execution.

* Fixed exports + nitz

* Improved the tests to check to the expected result.

* Improved test by checking the whole result JSON

* Removed brittle perf checks.

* Added efficiency verification tests.

* Added one more efficiency check test.

* Improved the efficiency tests.

* Removed unnecessary parsing code + added direct Query obj

* Fixed tests.

* Improved tests

* Fixed code structure

* Fixed lint issues

* nitz.

* nitz

* nitz.

* nitz.

* nitz.

* Added an example

* Fixed PR comments.

* Applied PR comments + nitz

* nitz.

* Improved the code.

* Fixed a perf issue.

* Added batch processing.

* Made the example more interesting

* Fixed bucket count

* Renamed Direct to CustomQuery

* Fixed lint issues.

* No need for scorer to be an `Option`

* nitz

* Used BitSet

* Added an optimization for AllQuery

* Fixed merge issues.

* Fixed lint issues.

* Added benchmark for FILTER

* Removed the Option wrapper.

* nitz.

* Applied PR comments.

* Fixed the AllQuery optimization

* Applied PR comments.

* feat: used `erased_serde` to allow filter query to be serialized

* further improved a comment

* Added back tests.

* removed an unused method

* removed an unused method

* Added documentation

* nitz.

* Added query builder.

* Fixed a comment.

* Applied PR comments.

* Fixed doctest issues.

* Added ser/de

* Removed bench in test

* Fixed a lint issue.
2025-11-18 20:54:31 +01:00
Arthur
5277367cb0 remove duplicated call to index_writer.commit() in example (#2732) 2025-11-12 14:52:44 +01:00
Paul Masurel
8b02bff9b8 Removing obsolete benchmark screenshot (#2730)
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-11-05 09:55:13 +01:00
PSeitz
60225bdd45 cleanup (#2724)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-23 10:23:34 +02:00
PSeitz
938bfec8b7 use FxHashMap for Aggregations Request (#2722)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-21 15:59:18 +02:00
PSeitz
dabcaa5809 fix merge intermediate aggregation results (#2719)
Previously the merging relied on the order of the results, which is invalid since https://github.com/quickwit-oss/tantivy/pull/2035.
This bug is only hit in specific scenarios, when the aggregation collectors are built in a different order on different segments.

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-17 12:41:31 +02:00
PSeitz
d410a3b0c0 Add Filtering for Term Aggregations (#2717)
* Add Filtering for Term Aggregations

Closes #2702

* add AggregationsSegmentCtx memory consumption

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-15 17:39:53 +02:00
Remi
fc93391d0e Minor clarifications on the AggregationsWithAccessor refacto (#2716) 2025-10-14 19:59:33 +02:00
PSeitz
f8e79271ab Replace AggregationsWithAccessor (#2715)
* add nested histogram-termagg benchmark

* Replace AggregationsWithAccessor with AggData

With AggregationsWithAccessor pre-computation and caching was done on the collector level.
If you have 10000 sub collectors (e.g. a term aggregation with sub aggregations) this is very inefficient.
`AggData` instead moves the data from the collector to a node which reflects the cardinality of the request tree instead of the cardinality of the segment collector.
It also moves the global struct shared with all aggregations in to aggregation specific structs. So each aggregation has its own space to store cached data and aggregation specific information.

This also breaks up the dependency to the elastic search aggregation structure somewhat.

Due to lifetime issues, we move the agg request specific object out of `AggData` during the collection and move it back at the end (for now). That's some unnecessary work, which costs CPU.

This allows better caching and will also pave the way for another potential optimization, by separating the collector and its storage. Currently we allocate a new collector for each sub aggregation bucket (for nested aggregations), but ideally we would have just one collector instance.

* renames

* move request data to agg request files

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-14 09:22:11 +02:00
PSeitz
33835b6a01 Add DocSet::cost() (#2707)
* query: add DocSet cost hint and use it for intersection ordering

- Add DocSet::cost()
- Use cost() instead of size_hint() to order scorers in intersect_scorers

This isolates cost-related changes without the new seek APIs from
PR #2538

* add comments

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-10-13 16:25:49 +02:00
PSeitz
270ca5123c refactor postings (#2709)
rename shallow_seek to seek_block
remove full_block from public postings API

This is as preparation to optionally handle Bitsets in the postings
2025-10-08 16:55:25 +02:00
Mustafa S. Moiz
714366d3b9 docs: correct grammar (#2704)
Correct phrasing for a single line in the docs (`one documents` -> `a document`).
2025-10-08 16:47:09 +02:00
PSeitz-dd
40659d4d07 improve naming in buffered_union (#2705) 2025-09-24 10:58:46 +02:00
PSeitz
e1e131a804 add and/or queries benchmark (#2701) 2025-09-22 16:32:49 +02:00
PSeitz-dd
70da310b2d perf: deduplicate queries (#2698)
* deduplicate queries

Deduplicate queries in the UserInputAst after parsing queries

* add return type
2025-09-22 12:16:58 +02:00
PSeitz
85010b589a clippy (#2700)
* clippy

* clippy

* clippy

* clippy + fmt

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-09-19 18:04:25 +02:00
PSeitz-dd
2340dca628 fix compiler warnings (#2699)
* fix compiler warnings

* fix import
2025-09-19 15:55:04 +02:00
Remi
71a26d5b24 Fix CI with rust 1.90 (#2696)
* Empty commit

* Fix dead code lint error
2025-09-18 23:06:33 +02:00
PSeitz-dd
203751f2fe Optimize ExistsQuery for a high number of dynamic columns (#2694)
* Optimize ExistsQuery for a high number of dynamic columns

The previous algorithm checked _each_ doc in _each_ column for
existence. This causes huge cost on JSON fields with e.g. 100k columns.
Compute a bitset instead if we have more than one column.

add `iter_docs` to the multivalued_index

* add benchmark

subfields=1
exists_json_union    Memory: 89.3 KB (+2.01%)    Avg: 0.4865ms (-26.03%)    Median: 0.4865ms (-26.03%)    [0.4865ms .. 0.4865ms]
subfields=2
exists_json_union    Memory: 68.1 KB     Avg: 1.7048ms (-0.46%)    Median: 1.7048ms (-0.46%)    [1.7048ms .. 1.7048ms]
subfields=3
exists_json_union    Memory: 61.8 KB     Avg: 2.0742ms (-2.22%)    Median: 2.0742ms (-2.22%)    [2.0742ms .. 2.0742ms]
subfields=4
exists_json_union    Memory: 119.8 KB (+103.44%)    Avg: 3.9500ms (+42.62%)    Median: 3.9500ms (+42.62%)    [3.9500ms .. 3.9500ms]
subfields=5
exists_json_union    Memory: 120.4 KB (+107.65%)    Avg: 3.9610ms (+20.65%)    Median: 3.9610ms (+20.65%)    [3.9610ms .. 3.9610ms]
subfields=6
exists_json_union    Memory: 120.6 KB (+107.49%)    Avg: 3.8903ms (+3.11%)    Median: 3.8903ms (+3.11%)    [3.8903ms .. 3.8903ms]
subfields=7
exists_json_union    Memory: 120.9 KB (+106.93%)    Avg: 3.6220ms (-16.22%)    Median: 3.6220ms (-16.22%)    [3.6220ms .. 3.6220ms]
subfields=8
exists_json_union    Memory: 121.3 KB (+106.23%)    Avg: 4.0981ms (-15.97%)    Median: 4.0981ms (-15.97%)    [4.0981ms .. 4.0981ms]
subfields=16
exists_json_union    Memory: 123.1 KB (+103.09%)    Avg: 4.3483ms (-92.26%)    Median: 4.3483ms (-92.26%)    [4.3483ms .. 4.3483ms]
subfields=256
exists_json_union    Memory: 204.6 KB (+19.85%)    Avg: 3.8874ms (-99.01%)    Median: 3.8874ms (-99.01%)    [3.8874ms .. 3.8874ms]
subfields=4096
exists_json_union    Memory: 2.0 MB     Avg: 3.5571ms (-99.90%)    Median: 3.5571ms (-99.90%)    [3.5571ms .. 3.5571ms]
subfields=65536
exists_json_union    Memory: 28.3 MB     Avg: 14.4417ms (-99.97%)    Median: 14.4417ms (-99.97%)    [14.4417ms .. 14.4417ms]
subfields=262144
exists_json_union    Memory: 113.3 MB     Avg: 66.2860ms (-99.95%)    Median: 66.2860ms (-99.95%)    [66.2860ms .. 66.2860ms]

* rename methods
2025-09-16 18:21:03 +02:00
PSeitz-dd
7963b0b4aa Add fast field fallback for term query if not indexed (#2693)
* Add fast field fallback for term query if not indexed

* only fallback without scores
2025-09-12 14:58:21 +02:00
Paul Masurel
d5eefca11d Merge pull request #2692 from quickwit-oss/paul.masurel/coerce-floats-too-in-search-too
This PR changes the logic used on the ingestion of floats.
2025-09-10 09:46:54 +02:00
Paul Masurel
5d6c8de23e Align search float search logic to the columnar coercion rules
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.

That way a document with the float 1.0 will be searchable when the user
searches for 1.

Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
2025-09-09 19:28:17 +02:00
PSeitz
a06365f39f Update CHANGELOG.md for bugfixes (#2674)
* Update CHANGELOG.md

* Update CHANGELOG.md
2025-09-04 11:51:00 +02:00
Raphaël Cohen
f4b374110f feat: Regex query grammar (#2677)
* feat: Regex query grammar

* feat: Disable regexes by default

* chore: Apply formatting
2025-09-03 10:07:04 +02:00
PSeitz-dd
c37af9c1ff update release instructions (#2687) 2025-08-22 07:57:48 +08:00
PSeitz
33794a114c chore: Release (#2686)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-08-20 18:29:37 +08:00
PSeitz-dd
8676a1f57b prepare release: update Changelog (#2685) 2025-08-20 16:07:53 +08:00
PSeitz-dd
021ff2ad63 move bench to binggan (#2684) 2025-08-14 17:02:44 +08:00
Paul Masurel
39e027667b per field size details (#2679)
* Added per-field size details.

This also does a bunch of refactoring.

merging field metadata does not silently asserts that arguments should be sorted.
merging does not set `stored`.

We do not rely on a hashmap to group fields, but instead rely on the fact that
the term dictionary is sorted.

The inverted level method that exposes field metadata is not exposed
as public anymore.

* CR comment

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-08-13 13:12:22 +02:00
PSeitz-dd
a1d65c3df3 test stable ordering with pagination (#2683) 2025-08-13 15:36:28 +08:00
trinity-1686a
2e4615c2d3 Merge pull request #2678 from Darkheir/feat/query_grammar_space_between_field_and_value
feat: Support spaces between field name and value
2025-08-11 09:57:23 +02:00
Darkheir
610091e2c4 feat: Applies PR review suggestion 2025-08-04 10:12:51 +02:00
trinity-1686a
c301e7b1c4 Merge pull request #2673 from paradedb/stuhood.fix-order-by-dup-string
Fix `TopDocs::order_by_string_fast_field` for duplicates
2025-07-30 18:25:03 +02:00
Stu Hood
d9eb093368 Attempt to clarify sorted_ords_to_term_cb. 2025-07-29 21:56:31 -07:00
Darkheir
d4b090124c feat: Support spaces between field name and value 2025-07-23 11:12:13 +02:00
PSeitz-dd
811c68cdb2 fix field_names in top_hits aggregation (#2675) 2025-07-21 12:19:30 +08:00
trinity-1686a
bc1c789897 Merge pull request #2676 from quickwit-oss/trinity.pointard/allow-partial-default-field-success
ignore failure to parse query when other default field suceeded
2025-07-18 14:20:41 +02:00
trinity Pointard
e7c8c331bd ignore failure to parse query when other default field suceeded 2025-07-17 14:47:28 +02:00
Eric Ridge
2f01152a3c adjust Dictionary::sorted_ords_to_term_cb() to allow duplicates 2025-07-16 13:38:43 -07:00
PSeitz
4e84c70387 Fix TopNComputer for reverse order (#2672)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-16 21:44:04 +08:00
Paul M.
f2c77f06c5 Update fs4 to latest (0.13.1) (#2654)
- One change was needed to handle the `Result<bool>` that now returns from `try_lock_exclusive`

Co-authored-by: Paul M. <prov223@tutanota.com>
2025-07-14 11:26:19 +08:00
MassimilianoBaglioni
74334f9c9a Fixed typo in documentation (#2629)
Co-authored-by: Massimiliano Baglioni <massimilianobaglioni@MacBook-Air-di-Massimiliano.local>
2025-07-11 14:45:59 +08:00
Parth
cc4beb61ba update CHANGELOG (#2670)
* update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

* Update CHANGELOG.md

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2025-07-11 11:33:11 +08:00
Dale Seo
6742e5981b fix a typo in the comment (#2668) 2025-07-10 07:14:57 +02:00
Philippe Noël
b128299976 Update ParadeDB logo (#2669) 2025-07-10 07:14:35 +02:00
PSeitz
945af922d1 clippy (#2661)
* clippy

* use readable version

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-02 11:25:03 +02:00
PSeitz-dd
295d07e55c fix union performance regression (#2663)
closes https://github.com/quickwit-oss/tantivy/issues/2656
2025-07-01 20:32:25 +02:00
PSeitz
080fa4d1f4 add docs/example and Vec<u32> values to sstable (#2660) 2025-07-01 15:40:02 +02:00
PSeitz-dd
988c2b35e7 fix import in test (#2657) 2025-06-24 12:55:34 +02:00
PSeitz
bf3cc12610 update CHANGELOG (#2621)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-06-24 11:58:44 +02:00
Stu Hood
a2400f4e73 Add string fast field support to TopDocs. (#2642)
* Add string fast field support to `TopDocs`.

* Remove unnecessary generics, and review feedback.

* Use actual/less-ambiguous cities.

* Review feedback
2025-06-20 10:27:14 +02:00
Zhang.Jinrui
436ec6caea fix typo for the comments of search_with_executor() (#2653)
Co-authored-by: Zhang Jinrui <zhangjinrui@microsoft.com>
2025-06-19 09:53:21 +02:00
PSeitz
4a6123d3ff release tantivy: bump versions (#2625)
* chore: Release

* chore: Release

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-06-10 15:34:39 +02:00
Parth
5a2fe42c24 make zstd optional in sstable (#2633)
* make zstd truly optional

* changelog notes

* make sure we write

* resolve comments

* make this a default feature

* remove changelog notes
2025-05-14 17:16:41 +02:00
PSeitz
5379c99ea2 update edition to 2024 (#2620)
* update common to edition 2024

* update bitpacker to edition 2024

* update stacker to edition 2024

* update query-grammar to edition 2024

* update sstable to edition 2024 + fmt

* fmt

* update columnar to edition 2024

* cargo fmt

* use None instead of _
2025-04-18 04:56:31 +02:00
Paul Masurel
3fa90e70e2 Merge pull request #2618 from quickwit-oss/release_tantivy
fix tantivy-query-grammar version
2025-04-09 09:54:09 +02:00
Pascal Seitz
6ab4102253 fix tantivy-query-grammar version 2025-04-09 14:35:23 +08:00
PSeitz
11c6329ca5 temp unbump version (#2501)
temp unbump to 0.22 for easier release with `cargo release`
2025-04-09 08:09:41 +02:00
PSeitz
ab8bb93928 update changelog (#2617) 2025-04-09 03:31:30 +02:00
PSeitz
2b668bd2bf readability improvement on executor (#2615) 2025-04-08 18:28:49 +02:00
Paul Masurel
97a7137ef8 Merge pull request #2606 from katlim-br/add_serde_serialize
Add serde json serialize to UserInputAst
2025-04-03 15:57:03 +02:00
Kat Lim Ruiz
ffa7cdf397 agreed with Remi, about the final json structure, having "type" tag and using "clauses" is more accurate 2025-04-03 08:35:16 -05:00
Kat Lim Ruiz
caf1275e60 Merge pull request #1 from quickwit-oss/tagged-user-input-ast
Tag UserInputAst
2025-04-03 08:30:07 -05:00
Remi Dettai
fb12b7be28 Tag UserInputAst 2025-04-03 10:07:34 +02:00
Kat Lim Ruiz
6f77083493 create more complex unit test 2025-04-02 18:06:20 -05:00
Kat Lim Ruiz
cd7745da7a set Leaf untagged, leave clause and boost the same (with own property) 2025-04-02 17:52:18 -05:00
Kat Lim Ruiz
eb8304dee9 remove untitled file 2025-04-02 08:47:58 -05:00
Kat Lim Ruiz
e5638112a9 all json should be snake_case 2025-04-02 08:45:33 -05:00
Kat Lim Ruiz
81110152fb add unit test for unbounded 2025-04-01 18:08:04 -05:00
Kat Lim Ruiz
ae88a7ece5 add tag type and content value to UserInputBound 2025-04-01 18:06:40 -05:00
Kat Lim Ruiz
bdd5f80fd9 add clause unit test 2025-04-01 18:04:19 -05:00
Kat Lim Ruiz
3f62ef22e5 set tag=type only for Leaf 2025-04-01 17:52:36 -05:00
Kat Lim Ruiz
8102e19e48 set Error as serializable because is part of the possible outcomes (however, I think using this empty Error struct is not a good pattern) 2025-04-01 17:43:24 -05:00
Kat Lim Ruiz
175c853ea7 add serialization test for LenientError 2025-04-01 17:38:23 -05:00
Kat Lim Ruiz
c992cf3f37 Revert "set all enum to be snake_case when serializing"
This reverts commit 83f6c2f265.
2025-04-01 17:27:28 -05:00
Kat Lim Ruiz
83f6c2f265 set all enum to be snake_case when serializing 2025-04-01 17:13:04 -05:00
Kat Lim Ruiz
17bf8aa092 Merge branch 'quickwit-oss:main' into add_serde_serialize 2025-04-01 08:32:08 -05:00
trinity-1686a
6fc0e96ff8 Merge pull request #2610 from quickwit-oss/fix-compilation-stability
Fix compilation stability
2025-04-01 10:45:58 +02:00
Remi Dettai
06d2dcf469 Further fix type inference tests 2025-04-01 09:52:22 +02:00
Remi Dettai
b681ec9335 Fix compilation stability 2025-04-01 09:33:33 +02:00
Kat Lim Ruiz
da2ff5712a fix fmt nightly 2025-03-31 08:21:54 -05:00
Kat Lim Ruiz
18da402e27 cargo fmt 2025-03-30 22:10:38 -05:00
Kat Lim Ruiz
18ae3ffe94 uniformize root cargo.toml 2025-03-30 21:55:51 -05:00
Kat Lim Ruiz
0a37b7acaa update to latest serde and serde_json (and follow the pattern to use patch versions) 2025-03-30 11:35:58 -05:00
Kat Lim Ruiz
1a9fd885dd allow LenientError to be serializable too 2025-03-30 11:26:20 -05:00
Kat Lim Ruiz
3e660905a7 unit test parse_query_lenient 2025-03-30 11:22:22 -05:00
Kat Lim Ruiz
0c2b984cb4 add tests 2025-03-30 11:12:15 -05:00
Kat Lim Ruiz
a69b1c609c add error to be debuggable 2025-03-30 11:12:12 -05:00
Kat Lim Ruiz
8d4a6fcaba deserialize is not needed 2025-03-30 11:11:55 -05:00
Kat Lim Ruiz
feced4762f update root cargo.toml 2025-03-30 11:01:22 -05:00
Kat Lim Ruiz
0149317c5a set 0.23 2025-03-30 10:55:48 -05:00
Kat Lim Ruiz
3fcb6f9597 add unit tests 2025-03-30 10:41:43 -05:00
Kat Lim Ruiz
388fcd763b add serde, and allow UserInputAst to be json serialized/deserialized 2025-03-30 10:36:43 -05:00
trinity-1686a
e488f9e6a2 Merge pull request #2598 from quickwit-oss/1686a/agg-key-eq
fix invalid impl of Eq on Key
2025-03-14 15:24:31 +01:00
trinity Pointard
9426d5be7b fix agg Key PartialEq impl 2025-03-14 14:57:45 +01:00
PSeitz
d5d2d41264 merge column: small refactors (#2579)
* merge column: small refactors

* make ord dependency more explicit

* add columnar merge crashtest proptest

* fix naming
2025-03-07 18:52:34 +08:00
Paul Masurel
80f5f1ecd4 Merge pull request #2586 from quickwit-oss/issue/2577-get_batch_multiply_overflow
follow up on the fix of multiply with overflow
2025-03-05 11:17:12 +01:00
Paul Masurel
519e5d2ed1 clippy warnings 2025-03-05 11:15:06 +01:00
Paul Masurel
df2d52a84e follow up on the fix of multiply with overflow 2025-03-05 11:15:05 +01:00
Paul Masurel
371dba9414 Merge pull request #2591 from quickwit-oss/cargo-fmt
Cargo fmt
2025-03-05 11:08:06 +01:00
Paul Masurel
0afabad494 Cargo fmt 2025-03-05 11:07:46 +01:00
Remi Dettai
89b052cd42 Catch panics during merges (#2582)
* Adding panic handler for the rayon merge thread pool

* Return panic message in error

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-03-05 10:36:48 +01:00
SteveLauC
c48c649436 refactor: use std AtomicU64 and remove wrapper (#2585) 2025-02-24 03:56:15 +01:00
Paul Masurel
58c0739953 Merge pull request #2581 from quickwit-oss/merge_dict_column_repro
use usize in bitpacker
2025-02-21 10:53:07 +09:00
Pascal Seitz
e7daf69de9 use usize in bitpacker
use usize in bitpacker to enable larger columns in the columnar store

Godbolt comparison with u32 vs u64 for get access: https://godbolt.org/z/cjf7nenYP

Add a mini-tool to inspect columnar files created by tantivy. (very basic functionality which can be extended later)
2025-02-20 15:39:10 +01:00
trinity-1686a
f060e86bc6 Merge pull request #2578 from quickwit-oss/1686a/buildable-histo-agg
make DateHistogramAggregationReq buildable
2025-02-18 15:30:54 +01:00
trinity Pointard
0368162ef0 make DateHistogramAggregationReq buildable 2025-02-18 11:45:24 +01:00
trinity-1686a
e843c71015 Merge pull request #2568 from quickwit-oss/trinity/wildcard-query-parser
allow term starting with wildcard in query parser
2025-02-12 16:47:25 +01:00
trinity Pointard
5cea16ef9f improve handling of spcial char after exist query 2025-01-22 16:04:31 +01:00
dependabot[bot]
4aa8cd2470 Update downcast-rs requirement from 1.2.1 to 2.0.1 (#2566)
Updates the requirements on [downcast-rs](https://github.com/marcianx/downcast-rs) to permit the latest version.
- [Changelog](https://github.com/marcianx/downcast-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/marcianx/downcast-rs/compare/v1.2.1...v2.0.1)

---
updated-dependencies:
- dependency-name: downcast-rs
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-22 10:32:24 +01:00
trinity Pointard
4d4ee1b0ac allow term starting with wildcard in query parser 2025-01-15 10:27:48 +01:00
dependabot[bot]
43c89b4360 Update itertools requirement from 0.13.0 to 0.14.0 (#2563)
Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.14.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-08 17:11:46 +01:00
trinity-1686a
d281ca3e65 Merge pull request #2559 from quickwit-oss/trinity/sstable-partial-automaton
allow warming partially an sstable for an automaton
2025-01-08 16:35:35 +01:00
trinity Pointard
be17daf658 split iterator 2025-01-08 16:24:34 +01:00
trinity Pointard
6ca84a61fa make termdict always clone 2025-01-08 16:19:54 +01:00
trinity Pointard
037d12c9c9 fix deadlocking on automaton warmup 2025-01-06 11:58:58 +01:00
Remi Dettai
71cf19870b Exist queries match subpath fields (#2558)
* Exist queries match subpath fields

* Make subpath check optional

* Add async subpath listing
2025-01-06 10:17:39 +01:00
trinity Pointard
175a529c41 use executor for cpu-heavy sstable decompression for automaton 2025-01-03 19:14:07 +01:00
trinity Pointard
fe0c7c5408 change rangebound style 2025-01-02 11:56:05 +01:00
Harrison Burt
148594f0f9 Improve IndexWriter customisation via builder (#2562)
* Improve `IndexWriter` customisation via builder

* Remove change noise from PR

* Correct documentation

* Resolve comments and add test
2025-01-02 09:43:22 +01:00
dependabot[bot]
8edb439440 Update rustc-hash requirement from 1.1.0 to 2.1.0 (#2551)
---
updated-dependencies:
- dependency-name: rustc-hash
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-26 10:25:05 +01:00
trinity Pointard
dfff5f3bcb rename merge_holes_under => merge_holes_under_bytes 2024-12-23 16:17:44 +01:00
trinity-1686a
ebf4d84553 add comment about cpu-intensive operation in async context 2024-12-20 12:23:49 +01:00
trinity-1686a
42efc7f7c8 clippy 2024-12-20 11:00:11 +01:00
trinity-1686a
192395c311 attempt at simplifying can_block_match_automaton 2024-12-20 10:25:38 +01:00
trinity-1686a
a1447cc9c2 remove breaking change in sstable public api 2024-12-19 17:30:05 +01:00
trinity-1686a
c39d91f827 Merge pull request #2547 from quickwit-oss/trinity/count-str
add support for counting non integer in aggregation
2024-12-17 15:27:30 +01:00
trinity Pointard
32b6e9711b add tests 2024-12-13 16:06:24 +01:00
trinity-1686a
24c5dc2398 allow warming up automaton 2024-12-10 13:32:12 +01:00
trinity-1686a
9e2ddec4b3 merge adjacent block when building delta for automaton 2024-12-10 13:32:12 +01:00
trinity-1686a
1f6a8e74bb support iterating over partially loaded sstable 2024-12-10 13:32:12 +01:00
trinity-1686a
7e901f523b get iter for blocks of sstable matching automaton 2024-12-10 13:32:12 +01:00
trinity-1686a
3c30a41c14 add helper to figure if block can match automaton 2024-12-10 13:32:12 +01:00
dependabot[bot]
0f99d4f420 Update measure_time requirement from 0.8.2 to 0.9.0 (#2557)
---
updated-dependencies:
- dependency-name: measure_time
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-09 21:39:01 +01:00
Pierre Barre
6e02c5cb25 Make NUM_MERGE_THREADS configurable (#2535)
* Make `NUM_MERGE_THREADS` configurable

* Remove unused import

* Reword comment src/index/index.rs

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2024-12-09 16:53:11 +08:00
PSeitz
876a579e5d queryparser: add field respecification test (#2550) 2024-12-02 14:17:12 +01:00
PSeitz
4c52499622 clippy (#2549) 2024-11-29 16:08:21 +08:00
trinity-1686a
0bac391291 add support for counting non integer in aggregation 2024-11-28 19:52:47 +01:00
PSeitz
52d4e81e70 update CHANGELOG (#2546) 2024-11-27 20:49:35 +08:00
dependabot[bot]
c71ea7b2ef Update thiserror requirement from 1.0.30 to 2.0.1 (#2542)
Updates the requirements on [thiserror](https://github.com/dtolnay/thiserror) to permit the latest version.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.30...2.0.1)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-09 08:08:34 +08:00
Paul Masurel
c35a782747 Updating rustc-hash and clippy fixes (#2532)
* Updating rustc-hash and clippy fixes

* fix terms_aggregation_min_doc_count_special_case

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-11-01 13:46:26 +08:00
dependabot[bot]
c66af2c0a9 Update binggan requirement from 0.12.0 to 0.14.0 (#2530)
* Update binggan requirement from 0.12.0 to 0.14.0

---
updated-dependencies:
- dependency-name: binggan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix build

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-10-24 09:41:35 +08:00
Joan Antoni RE
f9ac055847 Fix some links in architecture docs (#2528) 2024-10-23 21:06:54 +09:00
PSeitz
21d057059e clippy (#2527)
* clippy

* clippy

* clippy

* clippy

* convert allow to expect and remove unused

* cargo fmt

* cleanup

* export sample

* clippy
2024-10-22 09:26:54 +08:00
PSeitz
dca508b4ca remove read_postings_no_deletes (#2526)
closes #2525
2024-10-22 09:52:43 +09:00
PSeitz
aebae9965d add RegexPhraseQuery (#2516)
* add RegexPhraseQuery

RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"

Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.

This PR comes with some new datastructures:

SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)

LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.

BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.

Renamed Union to BufferedUnionScorer
Added proptests to test different union types.

* cleanup

* use Box instead of Vec

* use RefCell instead of term_freq(&mut)

* remove wildcard mode

* move RefCell to outer

* clippy
2024-10-21 18:29:17 +08:00
Marvin
e7e3e3f44c make casing in docs more consistent (#2524)
* make casing in docs more consistent

* more

* lowercase tantivy
2024-10-21 17:59:41 +09:00
PSeitz
2f2db16ec1 store DateTime as nanoseconds in doc store (#2486)
* store DateTime as nanoseconds in doc store

The doc store DateTime was truncated to microseconds previously. This
removes this truncation, while still keeping backwards compatibility.

This is done by adding the trait `ConfigurableBinarySerializable`, which
works like `BinarySerializable`, but with a config that allows de/serialize
as different date time precision currently.

bump version format to 7.
add compat test to check the date time truncation.

* remove configurable binary serialize, add enum for doc store version

* test doc store version ord
2024-10-18 10:50:20 +08:00
Paul Masurel
d152e29687 Fixed citation (#2523) 2024-10-17 10:19:50 +09:00
Paul Masurel
285bcc25c9 Added citation.cff (#2522) 2024-10-17 09:43:35 +09:00
PSeitz
7b65ad922d use binggan for stacker bench (#2492)
* use binggan for stacker bench

```
alice (num terms: 174693)
hashmap                    Memory: 1.3 MB     Avg: 367.19 MiB/s (-1.34%)    Median: 368.10 MiB/s (-1.34%)    [378.75 MiB/s .. 352.81 MiB/s]
hasmap with postings       Memory: 2.4 MB     Avg: 237.29 MiB/s (-2.19%)    Median: 240.22 MiB/s (-1.61%)    [248.26 MiB/s .. 210.66 MiB/s]
fxhashmap ref postings     Memory: 2.9 MB     Avg: 171.94 MiB/s (-3.22%)    Median: 174.13 MiB/s (-2.69%)    [185.94 MiB/s .. 152.43 MiB/s]
fxhasmap owned postings    Memory: 3.5 MB     Avg: 96.993 MiB/s (-4.20%)    Median: 97.410 MiB/s (-4.48%)    [102.78 MiB/s .. 82.745 MiB/s]
numbers unique 100k
hashmap                 Memory: 5.2 MB     Avg: 334.17 MiB/s (-3.06%)    Median: 352.61 MiB/s (+0.77%)    [362.60 MiB/s .. 213.03 MiB/s]
hasmap with postings    Memory: 6.3 MB     Avg: 316.96 MiB/s (-0.02%)    Median: 325.16 MiB/s (-0.04%)    [338.36 MiB/s .. 218.60 MiB/s]
zipfs numbers 100k
hashmap                 Memory: 1.3 MB     Avg: 1.2342 GiB/s (+2.87%)    Median: 1.2677 GiB/s (+4.66%)    [1.3130 GiB/s .. 915.93 MiB/s]
hasmap with postings    Memory: 2.4 MB     Avg: 485.16 MiB/s (+2.68%)    Median: 494.70 MiB/s (+4.42%)    [505.31 MiB/s .. 413.14 MiB/s]
numbers unique 1mio
hashmap                 Memory: 35.7 MB     Avg: 169.68 MiB/s (-1.08%)    Median: 166.80 MiB/s (-3.87%)    [201.33 MiB/s .. 154.26 MiB/s]
hasmap with postings    Memory: 39.8 MB     Avg: 149.49 MiB/s (-3.07%)    Median: 150.85 MiB/s (-1.45%)    [160.76 MiB/s .. 130.94 MiB/s]
zipfs numbers 1mio
hashmap                 Memory: 1.3 MB     Avg: 1.2185 GiB/s (-2.33%)     Median: 1.2291 GiB/s (-2.33%)     [1.2905 GiB/s .. 1.0742 GiB/s]
hasmap with postings    Memory: 5.5 MB     Avg: 358.43 MiB/s (-11.63%)    Median: 356.95 MiB/s (-12.85%)    [444.94 MiB/s .. 302.46 MiB/s]
numbers unique 2mio
hashmap                 Memory: 70.3 MB     Avg: 163.65 MiB/s (+8.37%)    Median: 162.83 MiB/s (+8.80%)    [190.20 MiB/s .. 144.70 MiB/s]
hasmap with postings    Memory: 78.6 MB     Avg: 148.00 MiB/s (+7.75%)    Median: 151.53 MiB/s (+9.11%)    [166.92 MiB/s .. 120.09 MiB/s]
zipfs numbers 2mio
hashmap                 Memory: 1.3 MB     Avg: 1.2535 GiB/s (+2.59%)    Median: 1.2654 GiB/s (+0.36%)    [1.2938 GiB/s .. 1.0592 GiB/s]
hasmap with postings    Memory: 9.7 MB     Avg: 377.96 MiB/s (-4.94%)    Median: 381.82 MiB/s (-3.67%)    [426.14 MiB/s .. 335.66 MiB/s]
numbers unique 5mio
hashmap                 Memory: 277.9 MB     Avg: 121.30 MiB/s (+2.00%)    Median: 121.99 MiB/s (+2.99%)    [132.51 MiB/s .. 110.32 MiB/s]
hasmap with postings    Memory: 295.7 MB     Avg: 114.23 MiB/s (+2.13%)    Median: 115.26 MiB/s (+2.94%)    [124.08 MiB/s .. 103.38 MiB/s]
zipfs numbers 5mio
hashmap                 Memory: 1.3 MB      Avg: 1.2326 GiB/s (+0.63%)    Median: 1.2400 GiB/s (+0.71%)    [1.2755 GiB/s .. 1.0923 GiB/s]
hasmap with postings    Memory: 25.4 MB     Avg: 360.49 MiB/s (+1.07%)    Median: 363.44 MiB/s (+1.27%)    [404.88 MiB/s .. 300.38 MiB/s]
```

* rename bench

* update binggan

* rename to HASHMAP_CAPACITY
2024-10-16 11:41:33 +08:00
dependabot[bot]
99be20cedd Update binggan requirement from 0.10.0 to 0.12.0 (#2519)
* Update binggan requirement from 0.10.0 to 0.12.0

---
updated-dependencies:
- dependency-name: binggan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix build

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-10-16 11:36:04 +08:00
Bruce Mitchener
5f026901b8 Update MSRV to 1.75 (#2515)
This is required by the `fs4` dependency. There are other
things that need something later than 1.66.

Both quickwit and the Python binding already require something
newer.
2024-10-16 10:32:16 +08:00
baishen
6dfa2df06f fix OwnedBytes debug panic (#2512) 2024-10-16 10:31:40 +08:00
Bruce Mitchener
c17e513377 Reduce typo count. (#2510) 2024-10-10 09:55:37 +08:00
PSeitz
2f5a269c70 update packages (#2500)
fixes some warnings
2024-09-25 17:46:18 +08:00
PSeitz
50532260e3 update changelog (#2496) 2024-09-25 10:28:53 +08:00
Tri
8bd6eb06e6 feat: make SegmentMeta.with_max_doc public (#2499)
* chore: add container

* feat: make max doc editable externally

* chore: expose another method

* chore: remove comments

* remove unused devcontainer

* chore: manually match nightly format

* chore: change weird formating

* revert format change

* fix: format with nightly
2024-09-23 12:39:36 +08:00
PSeitz
55b0b52457 Fix AggregationLimits (#2495)
* change AggregationLimits behavior

This fixes an issue encountered with the current behaviour of
AggregationLimits.
Previously we had AggregationLimits and RessourceLimitGuard, which both
track the memory, but only RessourceLimitGuard released memory when
dropped, while AggregationLimits did not.

This PR changes AggregationLimits to be a guard itself and removes the
RessourceLimitGuard.

* rename AggregationLimits to AggregationLimitsGuard
2024-09-17 14:25:47 +08:00
dependabot[bot]
56fc56c5b9 Update binggan requirement from 0.8.0 to 0.10.0 (#2493)
* Update binggan requirement from 0.8.0 to 0.10.0

---
updated-dependencies:
- dependency-name: binggan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* update PR

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-09-10 14:26:06 +08:00
trinity-1686a
85395d942a fix clippy lints from 1.80-1.81 (#2488)
* fix some clippy lints

* fix clippy::doc_lazy_continuation

* fix some lints for 1.82
2024-09-05 14:33:05 +02:00
PSeitz
a206c3ccd3 add compat tests (#2485) 2024-09-04 18:26:57 +08:00
Chaya
dc5d31c116 grammar and misspellings (#2483)
* grammar

* grammar

* misspelling
2024-09-04 12:45:31 +08:00
gezihuzi
95a4ddea3e Fix: Improve collapse_overlapped_ranges function (#2474)
* Fix: Improve collapse_overlapped_ranges function

- Refactor into separate sort_and_deduplicate_ranges and merge_overlapping_ranges functions
- Enhance sorting to consider both start and end of ranges
- Optimize merging logic to handle adjacent ranges
- Add comprehensive examples in function documentation
- Ensure proper handling of duplicate and unsorted input ranges
- Improve overall efficiency and readability of range collapsing algorithm

* move debug_assert

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2024-09-04 12:39:13 +08:00
trinity-1686a
ab5125d3dc remove unused trait bounds and outdated doc comment (#2478) 2024-09-03 16:31:51 +02:00
trinity-1686a
9f81d59ecd make find_field_with_default return json fields without path (#2476)
* make find_field_with_default return json fields without path

* add tests for find_field_with_default
2024-08-19 15:25:29 +02:00
PSeitz
c71ec8086d add FastFieldRangeQuery, rename (#2477)
* add FastFieldRangeQuery, rename

* remove Query impl
2024-08-19 09:02:00 +02:00
PSeitz
27be6aed91 lift clauses in LogicalAst (#2449)
(a OR b) OR (c OR d) can be simplified to (a OR b OR c OR d)
(a AND b) AND (c AND d) can be simplified to (a AND b AND c AND d)

This directly affects how queries are executed

remove unused SumWithCoordsCombiner
the number of fields is unused and private
2024-08-14 19:21:26 +02:00
PSeitz
3d1c4b313a support ff range queries on json fields (#2456)
* support ff range queries on json fields

* fix term date truncation

* use inverted index range query for phrase prefix queries

* rename to InvertedIndexRangeQuery

* fix column filter, add mixed column test
2024-08-02 00:06:50 +08:00
PSeitz
0d4e319965 add Key::I64 and Key::U64 variants in aggregation (#2468)
* add Key::I64 and Key::U64 variants in aggregation

Currently all `Key` numerical values are returned as f64. This causes problems in some
cases with the precision and the way f64 is serialized.

This PR adds `Key::I64` and `Key::U64` variants and uses them in the term
aggregation.

* add clarification comment
2024-07-31 20:29:32 +08:00
PSeitz
75dc3eb298 extend custom order deserialization (#2451)
allow arrays
improve validation
closes https://github.com/quickwit-oss/tantivy/issues/2435
2024-07-30 18:36:08 +08:00
PSeitz
3f6d225086 fix potential endless loop in merge (#2457)
avoid single segments lists without deletes as merge candidates, as they will be moved
to a merge operation and filtered for merging in the next
consider_merge_options call. In rare cases this may end up in a endless
merge loop where only single segments where nothing is to be done are
merged.
2024-07-30 16:37:20 +08:00
PSeitz
d8843c608c make FastFieldRangeWeight::new pub (#2460) 2024-07-29 10:39:27 +08:00
PSeitz
7ebcc15b17 add support for str fast field range query (#2453)
* add support for str fast field range query

Add support for range queries on fast fields, by converting term bounds to
term ordinals bounds.

closes https://github.com/quickwit-oss/tantivy/issues/2023

* extend tests, rename

* update comment

* update comment
2024-07-17 09:31:42 +08:00
PSeitz
1b4076691f refactor fast field query (#2452)
As preparation of #2023 and #1709

* Use Term to pass parameters
* merge u64 and ip fast field range query

Side note: I did not rename range_query_u64_fastfield, because then git can't track the changes.
2024-07-15 18:08:05 +08:00
Robert Caulk
eab660873a doc: fix typo in readme (#2450) 2024-07-09 15:12:22 +08:00
PSeitz
232f37126e fix coverage (#2448) 2024-07-05 12:04:18 +08:00
PSeitz
13e9885dfd faster term aggregation fetch terms (#2447)
big impact for term aggregations with large `size` parameter (e.g. 1000)
add top 1000 term agg bench

full
terms_few                                      Memory: 27.3 KB (+79.09%)    Avg: 3.8058ms (+2.40%)      Median: 3.7192ms (+3.47%)       [3.6224ms .. 4.3721ms]
terms_many                                     Memory: 6.9 MB               Avg: 12.6102ms (-4.70%)     Median: 12.1389ms (-6.58%)      [10.2847ms .. 15.4857ms]
terms_many_top_1000                            Memory: 6.9 MB               Avg: 15.8216ms (-83.19%)    Median: 15.4899ms (-83.46%)     [13.4250ms .. 20.6897ms]
terms_many_order_by_term                       Memory: 6.9 MB               Avg: 14.7820ms (-3.95%)     Median: 14.2236ms (-4.28%)      [12.6669ms .. 21.0968ms]
terms_many_with_top_hits                       Memory: 58.2 MB              Avg: 551.6218ms (+7.18%)    Median: 549.8826ms (+11.01%)    [496.7371ms .. 592.1299ms]
terms_many_with_avg_sub_agg                    Memory: 27.8 MB              Avg: 197.7029ms (+2.66%)    Median: 190.1564ms (+0.64%)     [167.9226ms .. 245.6651ms]
terms_many_json_mixed_type_with_avg_sub_agg    Memory: 42.0 MB (+0.00%)     Avg: 242.0121ms (+0.92%)    Median: 237.7084ms (-2.85%)     [201.9959ms .. 302.2136ms]
terms_few_with_cardinality_agg                 Memory: 10.6 MB              Avg: 122.6036ms (+1.21%)    Median: 119.0033ms (+2.60%)     [109.2859ms .. 161.5858ms]
range_agg_with_term_agg_few                    Memory: 45.4 KB (+39.75%)    Avg: 24.5454ms (+2.14%)     Median: 24.2861ms (+2.44%)      [23.5109ms .. 27.8406ms]
range_agg_with_term_agg_many                   Memory: 6.9 MB               Avg: 56.8049ms (+3.01%)     Median: 50.9706ms (+1.52%)      [41.4517ms .. 90.3934ms]
dense
terms_few                                      Memory: 28.8 KB (+81.74%)    Avg: 8.9092ms (-2.24%)      Median: 8.7143ms (-1.31%)      [8.6148ms .. 10.3868ms]
terms_many                                     Memory: 6.9 MB (-0.00%)      Avg: 17.9604ms (-10.18%)    Median: 17.1552ms (-11.93%)    [14.8979ms .. 26.2779ms]
terms_many_top_1000                            Memory: 6.9 MB               Avg: 21.4963ms (-78.90%)    Median: 21.2924ms (-78.98%)    [18.2033ms .. 28.0087ms]
terms_many_order_by_term                       Memory: 6.9 MB               Avg: 20.4167ms (-9.13%)     Median: 19.5596ms (-11.37%)    [17.5153ms .. 29.5987ms]
terms_many_with_top_hits                       Memory: 58.2 MB              Avg: 518.4474ms (-6.41%)    Median: 514.9180ms (-9.44%)    [471.5550ms .. 579.0220ms]
terms_many_with_avg_sub_agg                    Memory: 27.8 MB              Avg: 263.6702ms (-2.78%)    Median: 260.8775ms (-2.55%)    [239.5754ms .. 304.6669ms]
terms_many_json_mixed_type_with_avg_sub_agg    Memory: 42.0 MB              Avg: 299.9791ms (-2.01%)    Median: 302.2180ms (-3.08%)    [239.2080ms .. 346.3649ms]
terms_few_with_cardinality_agg                 Memory: 10.6 MB              Avg: 136.3303ms (-3.12%)    Median: 132.3831ms (-2.88%)    [123.7564ms .. 164.7914ms]
range_agg_with_term_agg_few                    Memory: 47.1 KB (+37.81%)    Avg: 35.4538ms (+0.66%)     Median: 34.8754ms (-0.56%)     [34.2287ms .. 40.0884ms]
range_agg_with_term_agg_many                   Memory: 6.9 MB               Avg: 72.2269ms (-4.38%)     Median: 66.1174ms (-4.98%)     [55.5125ms .. 124.1622ms]
sparse
terms_few                                      Memory: 27.3 KB (+69.68%)    Avg: 19.6053ms (-1.15%)     Median: 19.4543ms (-0.38%)     [19.3056ms .. 24.0547ms]
terms_many                                     Memory: 1.8 MB               Avg: 21.2886ms (-6.28%)     Median: 21.1287ms (-6.65%)     [20.6640ms .. 24.6144ms]
terms_many_top_1000                            Memory: 2.6 MB               Avg: 23.4869ms (-85.53%)    Median: 23.3393ms (-85.61%)    [22.7789ms .. 25.0896ms]
terms_many_order_by_term                       Memory: 1.8 MB               Avg: 21.7437ms (-7.78%)     Median: 21.6272ms (-7.66%)     [21.0409ms .. 23.6517ms]
terms_many_with_top_hits                       Memory: 13.1 MB              Avg: 43.7926ms (-2.76%)     Median: 44.3602ms (+0.01%)     [37.8039ms .. 51.0451ms]
terms_many_with_avg_sub_agg                    Memory: 7.5 MB               Avg: 34.6307ms (+3.72%)     Median: 33.4522ms (+1.16%)     [32.4418ms .. 41.4196ms]
terms_many_json_mixed_type_with_avg_sub_agg    Memory: 7.4 MB               Avg: 46.4318ms (+1.16%)     Median: 46.4050ms (+2.03%)     [44.5986ms .. 48.5142ms]
terms_few_with_cardinality_agg                 Memory: 680.0 KB (-0.04%)    Avg: 35.4410ms (+2.05%)     Median: 35.1384ms (+1.19%)     [34.4402ms .. 39.1082ms]
range_agg_with_term_agg_few                    Memory: 45.7 KB (+39.44%)    Avg: 22.7760ms (+0.44%)     Median: 22.5152ms (-0.35%)     [22.3078ms .. 26.1567ms]
range_agg_with_term_agg_many                   Memory: 1.8 MB               Avg: 25.7696ms (-4.45%)     Median: 25.4009ms (-5.61%)     [24.7874ms .. 29.6434ms]
multivalue
terms_few                                      Memory: 244.4 KB            Avg: 15.1253ms (-2.85%)     Median: 15.0988ms (-0.54%)     [14.8790ms .. 15.8193ms]
terms_many                                     Memory: 6.9 MB (-0.00%)     Avg: 26.3019ms (-6.24%)     Median: 26.3662ms (-4.94%)     [21.3553ms .. 31.0564ms]
terms_many_top_1000                            Memory: 6.9 MB              Avg: 29.5212ms (-72.90%)    Median: 29.4257ms (-72.84%)    [24.2645ms .. 35.1607ms]
terms_many_order_by_term                       Memory: 6.9 MB              Avg: 28.6076ms (-4.93%)     Median: 28.1059ms (-6.64%)     [24.0845ms .. 34.1493ms]
terms_many_with_top_hits                       Memory: 58.3 MB             Avg: 570.1548ms (+1.52%)    Median: 572.7759ms (+0.53%)    [525.9567ms .. 617.0862ms]
terms_many_with_avg_sub_agg                    Memory: 27.8 MB             Avg: 305.5207ms (+0.24%)    Median: 296.0101ms (-0.22%)    [277.8579ms .. 373.5914ms]
terms_many_json_mixed_type_with_avg_sub_agg    Memory: 42.0 MB (-0.00%)    Avg: 324.7342ms (-2.51%)    Median: 319.0025ms (-2.58%)    [298.7122ms .. 368.6144ms]
terms_few_with_cardinality_agg                 Memory: 10.8 MB             Avg: 151.6126ms (-2.54%)    Median: 149.0616ms (-0.32%)    [136.5592ms .. 181.8942ms]
range_agg_with_term_agg_few                    Memory: 248.2 KB            Avg: 49.5225ms (+3.11%)     Median: 48.3994ms (+3.18%)     [46.4134ms .. 60.5989ms]
range_agg_with_term_agg_many                   Memory: 6.9 MB              Avg: 85.9824ms (-3.66%)     Median: 78.4266ms (-3.85%)     [64.1231ms .. 128.5279ms]
2024-07-03 12:42:59 +08:00
PSeitz
56d79cb203 fix cardinality aggregation performance (#2446)
* fix cardinality aggregation performance

fix cardinality performance by fetching multiple terms at once. This
avoids decompressing the same block and keeps the buffer state between
terms.

add cardinality aggregation benchmark

bump rust version to 1.66

Performance comparison to before (AllQuery)
```
full
cardinality_agg                   Memory: 3.5 MB (-0.00%)    Avg: 21.2256ms (-97.78%)    Median: 21.0042ms (-97.82%)    [20.4717ms .. 23.6206ms]
terms_few_with_cardinality_agg    Memory: 10.6 MB            Avg: 81.9293ms (-97.37%)    Median: 81.5526ms (-97.38%)    [79.7564ms .. 88.0374ms]
dense
cardinality_agg                   Memory: 3.6 MB (-0.00%)    Avg: 25.9372ms (-97.24%)    Median: 25.7744ms (-97.25%)    [24.7241ms .. 27.8793ms]
terms_few_with_cardinality_agg    Memory: 10.6 MB            Avg: 93.9897ms (-96.91%)    Median: 92.7821ms (-96.94%)    [90.3312ms .. 117.4076ms]
sparse
cardinality_agg                   Memory: 895.4 KB (-0.00%)    Avg: 22.5113ms (-95.01%)    Median: 22.5629ms (-94.99%)    [22.1628ms .. 22.9436ms]
terms_few_with_cardinality_agg    Memory: 680.2 KB             Avg: 26.4250ms (-94.85%)    Median: 26.4135ms (-94.86%)    [26.3210ms .. 26.6774ms]
```

* clippy

* assert for sorted ordinals
2024-07-02 15:29:00 +08:00
Paul Masurel
0f4c2e27cf Fixes bug that causes out-of-order sstable key. (#2445)
The previous way to address the problem was to replace \u{0000}
with 0 in different places.

This logic had several flaws:
Done on the serializer side (like it was for the columnar), there was
a collision problem.

If a document in the segment contained a json field with a \0 and
antoher doc contained the same json field but `0` then we were sending
the same field path twice to the serializer.

Another option would have been to normalizes all values on the writer
side.

This PR simplifies the logic and simply ignore json path containing a
\0, both in the columnar and the inverted index.

Closes #2442
2024-07-01 15:40:07 +08:00
落叶乌龟
f9ae295507 feat(query): Make BooleanQuery supports minimum_number_should_match (#2405)
* feat(query): Make `BooleanQuery` supports `minimum_number_should_match`. see issue #2398

In this commit, a novel scorer named DisjunctionScorer is introduced, which performs the union of inverted chains with the minimal required elements. BTW, it's implemented via a min-heap. Necessary modifications on `BooleanQuery` and `BooleanWeight` are performed as well.

* fixup! fix test

* fixup!: refactor code.

1. More meaningful names.
2. Add Cache for `Disjunction`'s scorers, and fix bug.
3. Optimize `BooleanWeight::complex_scorer`

Thanks
 Paul Masurel <paul@quickwit.io>

* squash!: come up with better variable naming.

* squash!: fix naming issues.

* squash!: fix typo.

* squash!: Remove CombinationMethod::FullIntersection
2024-07-01 15:39:41 +08:00
Raphael Coeffic
d9db5302d9 feat: cardinality aggregation (#2337)
* WiP: cardinality aggregation

* Collect unique entries first, then insert into HyperLogLog

* Handle `missing`

* Hybrid approach

* Review changes

- insert `missing` value at most once
- `term_id` -> `term_ord`
- iterate directly over entries without collecting first

* Use salted hasher to include column type

* fix: formatting

* More review fixes

* Add cardinality to test_aggregation_flushing

* Formatting
2024-07-01 07:49:42 +08:00
Paul Masurel
e453848134 Recycling buffer in PrefixPhraseScorer (#2443) 2024-06-24 17:11:53 +09:00
378 changed files with 24369 additions and 7454 deletions

View File

@@ -15,11 +15,11 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Install Rust
run: rustup toolchain install nightly-2024-04-10 --profile minimal --component llvm-tools-preview
run: rustup toolchain install nightly-2024-07-01 --profile minimal --component llvm-tools-preview
- uses: Swatinem/rust-cache@v2
- uses: taiki-e/install-action@cargo-llvm-cov
- name: Generate code coverage
run: cargo +nightly-2024-04-10 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
run: cargo +nightly-2024-07-01 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
continue-on-error: true

View File

@@ -46,7 +46,7 @@ The file of a segment has the format
```segment-id . ext```
The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
The extension signals which data structure (or [`SegmentComponent`](src/index/segment_component.rs)) is stored in the file.
A small `meta.json` file is in charge of keeping track of the list of segments, as well as the schema.
@@ -102,7 +102,7 @@ but users can extend tantivy with their own implementation.
Tantivy's document follows a very strict schema, decided before building any index.
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
The schema defines all of the fields that the indexes [`Document`](src/schema/document/mod.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
Depending on the type of the field, you can decide to

View File

@@ -1,3 +1,121 @@
Tantivy 0.25
================================
## Bugfixes
- fix union performance regression in tantivy 0.24 [#2663](https://github.com/quickwit-oss/tantivy/pull/2663)(@PSeitz)
- make zstd optional in sstable [#2633](https://github.com/quickwit-oss/tantivy/pull/2633)(@Parth)
- Fix TopDocs::order_by_string_fast_field for asc order [#2672](https://github.com/quickwit-oss/tantivy/pull/2672)(@stuhood @PSeitz)
## Features/Improvements
- add docs/example and Vec<u32> values to sstable [#2660](https://github.com/quickwit-oss/tantivy/pull/2660)(@PSeitz)
- Add string fast field support to `TopDocs`. [#2642](https://github.com/quickwit-oss/tantivy/pull/2642)(@stuhood)
- update edition to 2024 [#2620](https://github.com/quickwit-oss/tantivy/pull/2620)(@PSeitz)
- Allow optional spaces between the field name and the value in the query parser [#2678](https://github.com/quickwit-oss/tantivy/pull/2678)(@Darkheir)
- Support mixed field types in query parser [#2676](https://github.com/quickwit-oss/tantivy/pull/2676)(@trinity-1686a)
- Add per-field size details [#2679](https://github.com/quickwit-oss/tantivy/pull/2679)(@fulmicoton)
Tantivy 0.24.2
================================
- Fix TopNComputer for reverse order. [#2672](https://github.com/quickwit-oss/tantivy/pull/2672)(@stuhood @PSeitz)
Affected queries are [order_by_fast_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_fast_field) and
[order_by_u64_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_u64_field)
for `Order::Asc`
Tantivy 0.24.1
================================
- Fix: bump required rust version to 1.81
Tantivy 0.24
================================
Tantivy 0.24 will be backwards compatible with indices created with v0.22 and v0.21. The new minimum rust version will be 1.75. Tantivy 0.23 will be skipped.
#### Bugfixes
- fix potential endless loop in merge [#2457](https://github.com/quickwit-oss/tantivy/pull/2457)(@PSeitz)
- fix bug that causes out-of-order sstable key. [#2445](https://github.com/quickwit-oss/tantivy/pull/2445)(@fulmicoton)
- fix ReferenceValue API flaw [#2372](https://github.com/quickwit-oss/tantivy/pull/2372)(@PSeitz)
- fix `OwnedBytes` debug panic [#2512](https://github.com/quickwit-oss/tantivy/pull/2512)(@b41sh)
- catch panics during merges [#2582](https://github.com/quickwit-oss/tantivy/pull/2582)(@rdettai)
- switch from u32 to usize in bitpacker. This enables multivalued columns larger than 4GB, which crashed during merge before. [#2581](https://github.com/quickwit-oss/tantivy/pull/2581) [#2586](https://github.com/quickwit-oss/tantivy/pull/2586)(@fulmicoton-dd @PSeitz)
#### Breaking API Changes
- remove index sorting [#2434](https://github.com/quickwit-oss/tantivy/pull/2434)(@PSeitz)
#### Features/Improvements
- **Aggregation**
- Support for cardinality aggregation [#2337](https://github.com/quickwit-oss/tantivy/pull/2337) [#2446](https://github.com/quickwit-oss/tantivy/pull/2446) (@raphaelcoeffic @PSeitz)
- Support for extended stats aggregation [#2247](https://github.com/quickwit-oss/tantivy/pull/2247)(@giovannicuccu)
- Add Key::I64 and Key::U64 variants in aggregation to avoid f64 precision issues [#2468](https://github.com/quickwit-oss/tantivy/pull/2468)(@PSeitz)
- Faster term aggregation fetch terms [#2447](https://github.com/quickwit-oss/tantivy/pull/2447)(@PSeitz)
- Improve custom order deserialization [#2451](https://github.com/quickwit-oss/tantivy/pull/2451)(@PSeitz)
- Change AggregationLimits behavior [#2495](https://github.com/quickwit-oss/tantivy/pull/2495)(@PSeitz)
- lower contention on AggregationLimits [#2394](https://github.com/quickwit-oss/tantivy/pull/2394)(@PSeitz)
- fix postcard compatibility for top_hits, add postcard test [#2346](https://github.com/quickwit-oss/tantivy/pull/2346)(@PSeitz)
- reduce top hits memory consumption [#2426](https://github.com/quickwit-oss/tantivy/pull/2426)(@PSeitz)
- check unsupported parameters top_hits [#2351](https://github.com/quickwit-oss/tantivy/pull/2351)(@PSeitz)
- Change AggregationLimits to AggregationLimitsGuard [#2495](https://github.com/quickwit-oss/tantivy/pull/2495)(@PSeitz)
- add support for counting non integer in aggregation [#2547](https://github.com/quickwit-oss/tantivy/pull/2547)(@trinity-1686a)
- **Range Queries**
- Support fast field range queries on json fields [#2456](https://github.com/quickwit-oss/tantivy/pull/2456)(@PSeitz)
- Add support for str fast field range query [#2460](https://github.com/quickwit-oss/tantivy/pull/2460) [#2452](https://github.com/quickwit-oss/tantivy/pull/2452) [#2453](https://github.com/quickwit-oss/tantivy/pull/2453)(@PSeitz)
- modify fastfield range query heuristic [#2375](https://github.com/quickwit-oss/tantivy/pull/2375)(@trinity-1686a)
- add FastFieldRangeQuery for explicit range queries on fast field (for `RangeQuery` it is autodetected) [#2477](https://github.com/quickwit-oss/tantivy/pull/2477)(@PSeitz)
- add format backwards-compatibility tests [#2485](https://github.com/quickwit-oss/tantivy/pull/2485)(@PSeitz)
- add columnar format compatibility tests [#2433](https://github.com/quickwit-oss/tantivy/pull/2433)(@PSeitz)
- Improved snippet ranges algorithm [#2474](https://github.com/quickwit-oss/tantivy/pull/2474)(@gezihuzi)
- make find_field_with_default return json fields without path [#2476](https://github.com/quickwit-oss/tantivy/pull/2476)(@trinity-1686a)
- Make `BooleanQuery` support `minimum_number_should_match` [#2405](https://github.com/quickwit-oss/tantivy/pull/2405)(@LebranceBW)
- Make `NUM_MERGE_THREADS` configurable [#2535](https://github.com/quickwit-oss/tantivy/pull/2535)(@Barre)
- **RegexPhraseQuery**
`RegexPhraseQuery` supports phrase queries with regex. E.g. query "b.* b.* wolf" matches "big bad wolf". Slop is supported as well: "b.* wolf"~2 matches "big bad wolf" [#2516](https://github.com/quickwit-oss/tantivy/pull/2516)(@PSeitz)
- **Optional Index in Multivalue Columnar Index**
For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case).
This is alleviated by placing an optional index in the multivalued index to mark documents that have values.
This will slightly increase space and access time. [#2439](https://github.com/quickwit-oss/tantivy/pull/2439)(@PSeitz)
- **Store DateTime as nanoseconds in doc store** DateTime in the doc store was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. [#2486](https://github.com/quickwit-oss/tantivy/pull/2486)(@PSeitz)
- **Performance/Memory**
- lift clauses in LogicalAst for optimized ast during execution [#2449](https://github.com/quickwit-oss/tantivy/pull/2449)(@PSeitz)
- Use Vec instead of BTreeMap to back OwnedValue object [#2364](https://github.com/quickwit-oss/tantivy/pull/2364)(@fulmicoton)
- Replace TantivyDocument with CompactDoc. CompactDoc is much smaller and provides similar performance. [#2402](https://github.com/quickwit-oss/tantivy/pull/2402)(@PSeitz)
- Recycling buffer in PrefixPhraseScorer [#2443](https://github.com/quickwit-oss/tantivy/pull/2443)(@fulmicoton)
- **Json Type**
- JSON supports now all values on the root level. Previously an object was required. This enables support for flat mixed types. allow more JSON values, fix i64 special case [#2383](https://github.com/quickwit-oss/tantivy/pull/2383)(@PSeitz)
- add json path constructor to term [#2367](https://github.com/quickwit-oss/tantivy/pull/2367)(@PSeitz)
- **QueryParser**
- fix de-escaping too much in query parser [#2427](https://github.com/quickwit-oss/tantivy/pull/2427)(@trinity-1686a)
- improve query parser [#2416](https://github.com/quickwit-oss/tantivy/pull/2416)(@trinity-1686a)
- Support field grouping `title:(return AND "pink panther")` [#2333](https://github.com/quickwit-oss/tantivy/pull/2333)(@trinity-1686a)
- allow term starting with wildcard [#2568](https://github.com/quickwit-oss/tantivy/pull/2568)(@trinity-1686a)
- Exist queries match subpath fields [#2558](https://github.com/quickwit-oss/tantivy/pull/2558)(@rdettai)
- add access benchmark for columnar [#2432](https://github.com/quickwit-oss/tantivy/pull/2432)(@PSeitz)
- extend indexwriter proptests [#2342](https://github.com/quickwit-oss/tantivy/pull/2342)(@PSeitz)
- add bench & test for columnar merging [#2428](https://github.com/quickwit-oss/tantivy/pull/2428)(@PSeitz)
- Change in Executor API [#2391](https://github.com/quickwit-oss/tantivy/pull/2391)(@fulmicoton)
- Removed usage of num_cpus [#2387](https://github.com/quickwit-oss/tantivy/pull/2387)(@fulmicoton)
- use bingang for agg and stacker benchmark [#2378](https://github.com/quickwit-oss/tantivy/pull/2378)[#2492](https://github.com/quickwit-oss/tantivy/pull/2492)(@PSeitz)
- cleanup top level exports [#2382](https://github.com/quickwit-oss/tantivy/pull/2382)(@PSeitz)
- make convert_to_fast_value_and_append_to_json_term pub [#2370](https://github.com/quickwit-oss/tantivy/pull/2370)(@PSeitz)
- remove JsonTermWriter [#2238](https://github.com/quickwit-oss/tantivy/pull/2238)(@PSeitz)
- validate sort by field type [#2336](https://github.com/quickwit-oss/tantivy/pull/2336)(@PSeitz)
- Fix trait bound of StoreReader::iter [#2360](https://github.com/quickwit-oss/tantivy/pull/2360)(@adamreichold)
- remove read_postings_no_deletes [#2526](https://github.com/quickwit-oss/tantivy/pull/2526)(@PSeitz)
Tantivy 0.22.1
================================
- Fix TopNComputer for reverse order. [#2672](https://github.com/quickwit-oss/tantivy/pull/2672)(@stuhood @PSeitz)
Affected queries are [order_by_fast_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_fast_field) and
[order_by_u64_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_u64_field)
for `Order::Asc`
Tantivy 0.22
================================
@@ -8,7 +126,7 @@ Tantivy 0.22 will be able to read indices created with Tantivy 0.21.
- Fix bug that can cause `get_docids_for_value_range` to panic. [#2295](https://github.com/quickwit-oss/tantivy/pull/2295)(@fulmicoton)
- Avoid 1 document indices by increase min memory to 15MB for indexing [#2176](https://github.com/quickwit-oss/tantivy/pull/2176)(@PSeitz)
- Fix merge panic for JSON fields [#2284](https://github.com/quickwit-oss/tantivy/pull/2284)(@PSeitz)
- Fix bug occuring when merging JSON object indexed with positions. [#2253](https://github.com/quickwit-oss/tantivy/pull/2253)(@fulmicoton)
- Fix bug occurring when merging JSON object indexed with positions. [#2253](https://github.com/quickwit-oss/tantivy/pull/2253)(@fulmicoton)
- Fix empty DateHistogram gap bug [#2183](https://github.com/quickwit-oss/tantivy/pull/2183)(@PSeitz)
- Fix range query end check (fields with less than 1 value per doc are affected) [#2226](https://github.com/quickwit-oss/tantivy/pull/2226)(@PSeitz)
- Handle exclusive out of bounds ranges on fastfield range queries [#2174](https://github.com/quickwit-oss/tantivy/pull/2174)(@PSeitz)
@@ -26,7 +144,7 @@ Tantivy 0.22 will be able to read indices created with Tantivy 0.21.
- Support to deserialize f64 from string [#2311](https://github.com/quickwit-oss/tantivy/pull/2311)(@PSeitz)
- Add a top_hits aggregator [#2198](https://github.com/quickwit-oss/tantivy/pull/2198)(@ditsuke)
- Support bool type in term aggregation [#2318](https://github.com/quickwit-oss/tantivy/pull/2318)(@PSeitz)
- Support ip adresses in term aggregation [#2319](https://github.com/quickwit-oss/tantivy/pull/2319)(@PSeitz)
- Support ip addresses in term aggregation [#2319](https://github.com/quickwit-oss/tantivy/pull/2319)(@PSeitz)
- Support date type in term aggregation [#2172](https://github.com/quickwit-oss/tantivy/pull/2172)(@PSeitz)
- Support escaped dot when addressing field [#2250](https://github.com/quickwit-oss/tantivy/pull/2250)(@PSeitz)
@@ -116,7 +234,7 @@ Tantivy 0.20
- Add PhrasePrefixQuery [#1842](https://github.com/quickwit-oss/tantivy/issues/1842) (@trinity-1686a)
- Add `coerce` option for text and numbers types (convert the value instead of returning an error during indexing) [#1904](https://github.com/quickwit-oss/tantivy/issues/1904) (@PSeitz)
- Add regex tokenizer [#1759](https://github.com/quickwit-oss/tantivy/issues/1759)(@mkleen)
- Move tokenizer API to seperate crate. Having a seperate crate with a stable API will allow us to use tokenizers with different tantivy versions. [#1767](https://github.com/quickwit-oss/tantivy/issues/1767) (@PSeitz)
- Move tokenizer API to separate crate. Having a separate crate with a stable API will allow us to use tokenizers with different tantivy versions. [#1767](https://github.com/quickwit-oss/tantivy/issues/1767) (@PSeitz)
- **Columnar crate**: New fast field handling (@fulmicoton @PSeitz) [#1806](https://github.com/quickwit-oss/tantivy/issues/1806)[#1809](https://github.com/quickwit-oss/tantivy/issues/1809)
- Support for fast fields with optional values. Previously tantivy supported only single-valued and multi-value fast fields. The encoding of optional fast fields is now very compact.
- Fast field Support for JSON (schemaless fast fields). Support multiple types on the same column. [#1876](https://github.com/quickwit-oss/tantivy/issues/1876) (@fulmicoton)
@@ -163,13 +281,13 @@ Tantivy 0.20
- Auto downgrade index record option, instead of vint error [#1857](https://github.com/quickwit-oss/tantivy/issues/1857) (@PSeitz)
- Enable range query on fast field for u64 compatible types [#1762](https://github.com/quickwit-oss/tantivy/issues/1762) (@PSeitz) [#1876]
- sstable
- Isolating sstable and stacker in independant crates. [#1718](https://github.com/quickwit-oss/tantivy/issues/1718) (@fulmicoton)
- Isolating sstable and stacker in independent crates. [#1718](https://github.com/quickwit-oss/tantivy/issues/1718) (@fulmicoton)
- New sstable format [#1943](https://github.com/quickwit-oss/tantivy/issues/1943)[#1953](https://github.com/quickwit-oss/tantivy/issues/1953) (@trinity-1686a)
- Use DeltaReader directly to implement Dictionnary::ord_to_term [#1928](https://github.com/quickwit-oss/tantivy/issues/1928) (@trinity-1686a)
- Use DeltaReader directly to implement Dictionnary::term_ord [#1925](https://github.com/quickwit-oss/tantivy/issues/1925) (@trinity-1686a)
- Add seperate tokenizer manager for fast fields [#2019](https://github.com/quickwit-oss/tantivy/issues/2019) (@PSeitz)
- Use DeltaReader directly to implement Dictionary::ord_to_term [#1928](https://github.com/quickwit-oss/tantivy/issues/1928) (@trinity-1686a)
- Use DeltaReader directly to implement Dictionary::term_ord [#1925](https://github.com/quickwit-oss/tantivy/issues/1925) (@trinity-1686a)
- Add separate tokenizer manager for fast fields [#2019](https://github.com/quickwit-oss/tantivy/issues/2019) (@PSeitz)
- Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances lazy. [#1756](https://github.com/quickwit-oss/tantivy/issues/1756) (@adamreichold)
- Added support for madvise when opening an mmaped Index [#2036](https://github.com/quickwit-oss/tantivy/issues/2036) (@fulmicoton)
- Added support for madvise when opening an mmapped Index [#2036](https://github.com/quickwit-oss/tantivy/issues/2036) (@fulmicoton)
- Rename `DatePrecision` to `DateTimePrecision` [#2051](https://github.com/quickwit-oss/tantivy/issues/2051) (@guilload)
- Query Parser
- Quotation mark can now be used for phrase queries. [#2050](https://github.com/quickwit-oss/tantivy/issues/2050) (@fulmicoton)
@@ -208,7 +326,7 @@ Tantivy 0.19
- Add support for phrase slop in query language [#1393](https://github.com/quickwit-oss/tantivy/pull/1393) (@saroh)
- Aggregation
- Add aggregation support for date type [#1693](https://github.com/quickwit-oss/tantivy/pull/1693)(@PSeitz)
- Add support for keyed parameter in range and histgram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
- Add support for keyed parameter in range and histogram aggregations [#1424](https://github.com/quickwit-oss/tantivy/pull/1424) (@k-yomo)
- Add aggregation bucket limit [#1363](https://github.com/quickwit-oss/tantivy/pull/1363) (@PSeitz)
- Faster indexing
- [#1610](https://github.com/quickwit-oss/tantivy/pull/1610) (@PSeitz)
@@ -651,7 +769,7 @@ Tantivy 0.4.0
- Raise the limit of number of fields (previously 256 fields) (@fulmicoton)
- Removed u32 fields. They are replaced by u64 and i64 fields (#65) (@fulmicoton)
- Optimized skip in SegmentPostings (#130) (@lnicola)
- Replacing rustc_serialize by serde. Kudos to @KodrAus and @lnicola
- Replacing rustc_serialize by serde. Kudos to benchmark@KodrAus and @lnicola
- Using error-chain (@KodrAus)
- QueryParser: (@fulmicoton)
- Explicit error returned when searched for a term that is not indexed

10
CITATION.cff Normal file
View File

@@ -0,0 +1,10 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- alias: Quickwit Inc.
website: "https://quickwit.io"
title: "tantivy"
version: 0.22.0
doi: 10.5281/zenodo.13942948
date-released: 2024-10-17
url: "https://github.com/quickwit-oss/tantivy"

View File

@@ -1,6 +1,6 @@
[package]
name = "tantivy"
version = "0.23.0"
version = "0.26.0"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = ["database-implementations", "data-structures"]
@@ -11,7 +11,7 @@ repository = "https://github.com/quickwit-oss/tantivy"
readme = "README.md"
keywords = ["search", "information", "retrieval"]
edition = "2021"
rust-version = "1.63"
rust-version = "1.85"
exclude = ["benches/*.json", "benches/*.txt"]
[dependencies]
@@ -29,49 +29,55 @@ tantivy-fst = "0.5"
memmap2 = { version = "0.9.0", optional = true }
lz4_flex = { version = "0.11", default-features = false, optional = true }
zstd = { version = "0.13", optional = true, default-features = false }
tempfile = { version = "3.3.0", optional = true }
tempfile = { version = "3.12.0", optional = true }
log = "0.4.16"
serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.79"
fs4 = { version = "0.8.0", optional = true }
serde = { version = "1.0.219", features = ["derive"] }
serde_json = "1.0.140"
fs4 = { version = "0.13.1", optional = true }
levenshtein_automata = "0.2.1"
uuid = { version = "1.0.0", features = ["v4", "serde"] }
crossbeam-channel = "0.5.4"
rust-stemmers = "1.2.0"
downcast-rs = "1.2.0"
downcast-rs = "2.0.1"
bitpacking = { version = "0.9.2", default-features = false, features = [
"bitpacker4x",
] }
census = "0.4.2"
rustc-hash = "1.1.0"
thiserror = "1.0.30"
rustc-hash = "2.0.0"
thiserror = "2.0.1"
htmlescape = "0.3.1"
fail = { version = "0.5.0", optional = true }
time = { version = "0.3.10", features = ["serde-well-known"] }
time = { version = "0.3.35", features = ["serde-well-known"] }
smallvec = "1.8.0"
rayon = "1.5.2"
lru = "0.12.0"
fastdivide = "0.4.0"
itertools = "0.13.0"
measure_time = "0.8.2"
itertools = "0.14.0"
measure_time = "0.9.0"
arc-swap = "1.5.0"
bon = "3.3.1"
i_triangle = "0.38.0"
columnar = { version = "0.3", path = "./columnar", package = "tantivy-columnar" }
sstable = { version = "0.3", path = "./sstable", package = "tantivy-sstable", optional = true }
stacker = { version = "0.3", path = "./stacker", package = "tantivy-stacker" }
query-grammar = { version = "0.22.0", path = "./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version = "0.6", path = "./bitpacker" }
common = { version = "0.7", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version = "0.3", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
columnar = { version = "0.6", path = "./columnar", package = "tantivy-columnar" }
sstable = { version = "0.6", path = "./sstable", package = "tantivy-sstable", optional = true }
stacker = { version = "0.6", path = "./stacker", package = "tantivy-stacker" }
query-grammar = { version = "0.25.0", path = "./query-grammar", package = "tantivy-query-grammar" }
tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
common = { version = "0.10", path = "./common/", package = "tantivy-common" }
tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] }
hyperloglogplus = { version = "0.4.1", features = ["const-loop"] }
futures-util = { version = "0.3.28", optional = true }
futures-channel = { version = "0.3.28", optional = true }
fnv = "1.0.7"
typetag = "0.2.21"
geo-types = "0.7.17"
[target.'cfg(windows)'.dependencies]
winapi = "0.3.9"
[dev-dependencies]
binggan = "0.8.0"
binggan = "0.14.0"
rand = "0.8.5"
maplit = "1.0.2"
matches = "0.1.9"
@@ -84,7 +90,7 @@ more-asserts = "0.3.1"
rand_distr = "0.4.3"
time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
postcard = { version = "1.0.4", features = [
"use-std",
"use-std",
], default-features = false }
[target.'cfg(not(windows))'.dev-dependencies]
@@ -109,17 +115,20 @@ debug-assertions = true
overflow-checks = true
[features]
default = ["mmap", "stopwords", "lz4-compression"]
default = ["mmap", "stopwords", "lz4-compression", "columnar-zstd-compression"]
mmap = ["fs4", "tempfile", "memmap2"]
stopwords = []
lz4-compression = ["lz4_flex"]
zstd-compression = ["zstd"]
# enable zstd-compression in columnar (and sstable)
columnar-zstd-compression = ["columnar/zstd-compression"]
failpoints = ["fail", "fail/failpoints"]
unstable = [] # useful for benches.
quickwit = ["sstable", "futures-util"]
quickwit = ["sstable", "futures-util", "futures-channel"]
# Compares only the hash of a string when indexing data.
# Increases indexing speed, but may lead to extremely rare missing terms, when there's a hash collision.
@@ -161,3 +170,11 @@ harness = false
[[bench]]
name = "agg_bench"
harness = false
[[bench]]
name = "exists_json"
harness = false
[[bench]]
name = "and_or_queries"
harness = false

View File

@@ -18,13 +18,11 @@ Tantivy is, in fact, strongly inspired by Lucene's design.
## Benchmark
The following [benchmark](https://tantivy-search.github.io/bench/) breakdowns
The following [benchmark](https://tantivy-search.github.io/bench/) breaks down the
performance for different types of queries/collections.
Your mileage WILL vary depending on the nature of queries and their load.
<img src="doc/assets/images/searchbenchmark.png">
Details about the benchmark can be found at this [repository](https://github.com/quickwit-oss/search-benchmark-game).
## Features

View File

@@ -1,4 +1,4 @@
# Release a new Tantivy Version
# Releasing a new Tantivy Version
## Steps
@@ -10,12 +10,29 @@
6. Set git tag with new version
In conjucation with `cargo-release` Steps 1-4 (I'm not sure if the change detection works):
Set new packages to version 0.0.0
[`cargo-release`](https://github.com/crate-ci/cargo-release) will help us with steps 1-5:
Replace prev-tag-name
```bash
cargo release --workspace --no-publish -v --prev-tag-name 0.19 --push-remote origin minor --no-tag --execute
cargo release --workspace --no-publish -v --prev-tag-name 0.24 --push-remote origin minor --no-tag
```
no-tag or it will create tags for all the subpackages
`no-tag` or it will create tags for all the subpackages
cargo release will _not_ ignore unchanged packages, but it will print warnings for them.
e.g. "warning: updating ownedbytes to 0.10.0 despite no changes made since tag 0.24"
We need to manually ignore these unchanged packages
```bash
cargo release --workspace --no-publish -v --prev-tag-name 0.24 --push-remote origin minor --no-tag --exclude tokenizer-api
```
Add `--execute` to actually publish the packages, otherwise it will only print the commands that would be run.
### Tag Version
```bash
git tag 0.25.0
git push upstream tag 0.25.0
```

View File

@@ -1,7 +1,7 @@
Make schema_builder API fluent.
fix doc serialization and prevent compression problems
u64 , etc. shoudl return Resutl<Option> now that we support optional missing a column is really not an error
u64 , etc. should return Result<Option> now that we support optional missing a column is really not an error
remove fastfield codecs
ditch the first_or_default trick. if it is still useful, improve its implementation.
rename FastFieldReaders::open to load
@@ -10,7 +10,7 @@ rename FastFieldReaders::open to load
remove fast field reader
find a way to unify the two DateTime.
readd type check in the filter wrapper
re-add type check in the filter wrapper
add unit test on columnar list columns.

View File

@@ -1,3 +1,4 @@
use binggan::plugins::PeakMemAllocPlugin;
use binggan::{black_box, InputGroup, PeakMemAlloc, INSTRUMENTED_SYSTEM};
use rand::prelude::SliceRandom;
use rand::rngs::StdRng;
@@ -17,7 +18,9 @@ pub static GLOBAL: &PeakMemAlloc<std::alloc::System> = &INSTRUMENTED_SYSTEM;
/// runner.register("average_u64", move |index| average_u64(index));
macro_rules! register {
($runner:expr, $func:ident) => {
$runner.register(stringify!($func), move |index| $func(index))
$runner.register(stringify!($func), move |index| {
$func(index);
})
};
}
@@ -42,7 +45,8 @@ fn main() {
}
fn bench_agg(mut group: InputGroup<Index>) {
group.set_alloc(GLOBAL); // Set the peak mem allocator. This will enable peak memory reporting.
group.add_plugin(PeakMemAllocPlugin::new(GLOBAL));
register!(group, average_u64);
register!(group, average_f64);
register!(group, average_f64_u64);
@@ -51,10 +55,17 @@ fn bench_agg(mut group: InputGroup<Index>) {
register!(group, percentiles_f64);
register!(group, terms_few);
register!(group, terms_many);
register!(group, terms_many_top_1000);
register!(group, terms_many_order_by_term);
register!(group, terms_many_with_top_hits);
register!(group, terms_many_with_avg_sub_agg);
register!(group, terms_many_json_mixed_type_with_sub_agg_card);
register!(group, terms_few_with_avg_sub_agg);
register!(group, terms_many_json_mixed_type_with_avg_sub_agg);
register!(group, cardinality_agg);
register!(group, terms_few_with_cardinality_agg);
register!(group, range_agg);
register!(group, range_agg_with_avg_sub_agg);
register!(group, range_agg_with_term_agg_few);
@@ -62,8 +73,15 @@ fn bench_agg(mut group: InputGroup<Index>) {
register!(group, histogram);
register!(group, histogram_hard_bounds);
register!(group, histogram_with_avg_sub_agg);
register!(group, histogram_with_term_agg_few);
register!(group, avg_and_range_with_avg_sub_agg);
// Filter aggregation benchmarks
register!(group, filter_agg_all_query_count_agg);
register!(group, filter_agg_term_query_count_agg);
register!(group, filter_agg_all_query_with_sub_aggs);
register!(group, filter_agg_term_query_with_sub_aggs);
group.run();
}
@@ -123,6 +141,33 @@ fn percentiles_f64(index: &Index) {
});
execute_agg(index, agg_req);
}
fn cardinality_agg(index: &Index) {
let agg_req = json!({
"cardinality": {
"cardinality": {
"field": "text_many_terms"
},
}
});
execute_agg(index, agg_req);
}
fn terms_few_with_cardinality_agg(index: &Index) {
let agg_req = json!({
"my_texts": {
"terms": { "field": "text_few_terms" },
"aggs": {
"cardinality": {
"cardinality": {
"field": "text_many_terms"
},
}
}
},
});
execute_agg(index, agg_req);
}
fn terms_few(index: &Index) {
let agg_req = json!({
"my_texts": { "terms": { "field": "text_few_terms" } },
@@ -135,6 +180,12 @@ fn terms_many(index: &Index) {
});
execute_agg(index, agg_req);
}
fn terms_many_top_1000(index: &Index) {
let agg_req = json!({
"my_texts": { "terms": { "field": "text_many_terms", "size": 1000 } },
});
execute_agg(index, agg_req);
}
fn terms_many_order_by_term(index: &Index) {
let agg_req = json!({
"my_texts": { "terms": { "field": "text_many_terms", "order": { "_key": "desc" } } },
@@ -171,7 +222,20 @@ fn terms_many_with_avg_sub_agg(index: &Index) {
});
execute_agg(index, agg_req);
}
fn terms_many_json_mixed_type_with_sub_agg_card(index: &Index) {
fn terms_few_with_avg_sub_agg(index: &Index) {
let agg_req = json!({
"my_texts": {
"terms": { "field": "text_few_terms" },
"aggs": {
"average_f64": { "avg": { "field": "score_f64" } }
}
},
});
execute_agg(index, agg_req);
}
fn terms_many_json_mixed_type_with_avg_sub_agg(index: &Index) {
let agg_req = json!({
"my_texts": {
"terms": { "field": "json.mixed_type" },
@@ -268,6 +332,7 @@ fn range_agg_with_term_agg_many(index: &Index) {
});
execute_agg(index, agg_req);
}
fn histogram(index: &Index) {
let agg_req = json!({
"rangef64": {
@@ -296,6 +361,17 @@ fn histogram_with_avg_sub_agg(index: &Index) {
});
execute_agg(index, agg_req);
}
fn histogram_with_term_agg_few(index: &Index) {
let agg_req = json!({
"rangef64": {
"histogram": { "field": "score_f64", "interval": 10 },
"aggs": {
"my_texts": { "terms": { "field": "text_few_terms" } }
}
}
});
execute_agg(index, agg_req);
}
fn avg_and_range_with_avg_sub_agg(index: &Index) {
let agg_req = json!({
"rangef64": {
@@ -417,3 +493,61 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
Ok(index)
}
// Filter aggregation benchmarks
fn filter_agg_all_query_count_agg(index: &Index) {
let agg_req = json!({
"filtered": {
"filter": "*",
"aggs": {
"count": { "value_count": { "field": "score" } }
}
}
});
execute_agg(index, agg_req);
}
fn filter_agg_term_query_count_agg(index: &Index) {
let agg_req = json!({
"filtered": {
"filter": "text:cool",
"aggs": {
"count": { "value_count": { "field": "score" } }
}
}
});
execute_agg(index, agg_req);
}
fn filter_agg_all_query_with_sub_aggs(index: &Index) {
let agg_req = json!({
"filtered": {
"filter": "*",
"aggs": {
"avg_score": { "avg": { "field": "score" } },
"stats_score": { "stats": { "field": "score_f64" } },
"terms_text": {
"terms": { "field": "text_few_terms" }
}
}
}
});
execute_agg(index, agg_req);
}
fn filter_agg_term_query_with_sub_aggs(index: &Index) {
let agg_req = json!({
"filtered": {
"filter": "text:cool",
"aggs": {
"avg_score": { "avg": { "field": "score" } },
"stats_score": { "stats": { "field": "score_f64" } },
"terms_text": {
"terms": { "field": "text_few_terms" }
}
}
}
});
execute_agg(index, agg_req);
}

218
benches/and_or_queries.rs Normal file
View File

@@ -0,0 +1,218 @@
// Benchmarks boolean conjunction queries using binggan.
//
// Whats measured:
// - Or and And queries with varying selectivity (only `Term` queries for now on leafs)
// - Nested AND/OR combinations (on multiple fields)
// - No-scoring path using the Count collector (focus on iterator/skip performance)
// - Top-K retrieval (k=10) using the TopDocs collector
//
// Corpus model:
// - Synthetic docs; each token a/b/c is independently included per doc
// - If none of a/b/c are included, emit a neutral filler token to keep doc length similar
//
// Notes:
// - After optimization, when scoring is disabled Tantivy reads doc-only postings
// (IndexRecordOption::Basic), avoiding frequency decoding overhead.
// - This bench isolates boolean iteration speed and intersection/union cost.
// - Use `cargo bench --bench boolean_conjunction` to run.
use binggan::{black_box, BenchGroup, BenchRunner};
use rand::prelude::*;
use rand::rngs::StdRng;
use rand::SeedableRng;
use tantivy::collector::sort_key::SortByStaticFastValue;
use tantivy::collector::{Collector, Count, TopDocs};
use tantivy::query::{Query, QueryParser};
use tantivy::schema::{Schema, FAST, TEXT};
use tantivy::{doc, Index, Order, ReloadPolicy, Searcher};
#[derive(Clone)]
struct BenchIndex {
#[allow(dead_code)]
index: Index,
searcher: Searcher,
query_parser: QueryParser,
}
/// Build a single index containing both fields (title, body) and
/// return two BenchIndex views:
/// - single_field: QueryParser defaults to only "body"
/// - multi_field: QueryParser defaults to ["title", "body"]
fn build_shared_indices(num_docs: usize, p_a: f32, p_b: f32, p_c: f32) -> (BenchIndex, BenchIndex) {
// Unified schema (two text fields)
let mut schema_builder = Schema::builder();
let f_title = schema_builder.add_text_field("title", TEXT);
let f_body = schema_builder.add_text_field("body", TEXT);
let f_score = schema_builder.add_u64_field("score", FAST);
let f_score2 = schema_builder.add_u64_field("score2", FAST);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema.clone());
// Populate index with stable RNG for reproducibility.
let mut rng = StdRng::from_seed([7u8; 32]);
// Populate: spread each present token 90/10 to body/title
{
let mut writer = index.writer_with_num_threads(1, 500_000_000).unwrap();
for _ in 0..num_docs {
let has_a = rng.gen_bool(p_a as f64);
let has_b = rng.gen_bool(p_b as f64);
let has_c = rng.gen_bool(p_c as f64);
let score = rng.gen_range(0u64..100u64);
let score2 = rng.gen_range(0u64..100_000u64);
let mut title_tokens: Vec<&str> = Vec::new();
let mut body_tokens: Vec<&str> = Vec::new();
if has_a {
if rng.gen_bool(0.1) {
title_tokens.push("a");
} else {
body_tokens.push("a");
}
}
if has_b {
if rng.gen_bool(0.1) {
title_tokens.push("b");
} else {
body_tokens.push("b");
}
}
if has_c {
if rng.gen_bool(0.1) {
title_tokens.push("c");
} else {
body_tokens.push("c");
}
}
if title_tokens.is_empty() && body_tokens.is_empty() {
body_tokens.push("z");
}
writer
.add_document(doc!(
f_title=>title_tokens.join(" "),
f_body=>body_tokens.join(" "),
f_score=>score,
f_score2=>score2,
))
.unwrap();
}
writer.commit().unwrap();
}
// Prepare reader/searcher once.
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.try_into()
.unwrap();
let searcher = reader.searcher();
// Build two query parsers with different default fields.
let qp_single = QueryParser::for_index(&index, vec![f_body]);
let qp_multi = QueryParser::for_index(&index, vec![f_title, f_body]);
let single_view = BenchIndex {
index: index.clone(),
searcher: searcher.clone(),
query_parser: qp_single,
};
let multi_view = BenchIndex {
index,
searcher,
query_parser: qp_multi,
};
(single_view, multi_view)
}
fn main() {
// Prepare corpora with varying selectivity. Build one index per corpus
// and derive two views (single-field vs multi-field) from it.
let scenarios = vec![
(
"N=1M, p(a)=5%, p(b)=1%, p(c)=15%".to_string(),
1_000_000,
0.05,
0.01,
0.15,
),
(
"N=1M, p(a)=1%, p(b)=1%, p(c)=15%".to_string(),
1_000_000,
0.01,
0.01,
0.15,
),
];
let queries = &["a", "+a +b", "+a +b +c", "a OR b", "a OR b OR c"];
let mut runner = BenchRunner::new();
for (label, n, pa, pb, pc) in scenarios {
let (single_view, multi_view) = build_shared_indices(n, pa, pb, pc);
for (view_name, bench_index) in [("single_field", single_view), ("multi_field", multi_view)]
{
// Single-field group: default field is body only
let mut group = runner.new_group();
group.set_name(format!("{}{}", view_name, label));
for query_str in queries {
add_bench_task(&mut group, &bench_index, query_str, Count, "count");
add_bench_task(
&mut group,
&bench_index,
query_str,
TopDocs::with_limit(10).order_by_score(),
"top10",
);
add_bench_task(
&mut group,
&bench_index,
query_str,
TopDocs::with_limit(10).order_by_fast_field::<u64>("score", Order::Asc),
"top10_by_ff",
);
add_bench_task(
&mut group,
&bench_index,
query_str,
TopDocs::with_limit(10).order_by((
SortByStaticFastValue::<u64>::for_field("score"),
SortByStaticFastValue::<u64>::for_field("score2"),
)),
"top10_by_2ff",
);
}
group.run();
}
}
}
fn add_bench_task<C: Collector + 'static>(
bench_group: &mut BenchGroup,
bench_index: &BenchIndex,
query_str: &str,
collector: C,
collector_name: &str,
) {
let task_name = format!("{}_{}", query_str.replace(" ", "_"), collector_name);
let query = bench_index.query_parser.parse_query(query_str).unwrap();
let search_task = SearchTask {
searcher: bench_index.searcher.clone(),
collector,
query,
};
bench_group.register(task_name, move |_| black_box(search_task.run()));
}
struct SearchTask<C: Collector> {
searcher: Searcher,
collector: C,
query: Box<dyn Query>,
}
impl<C: Collector> SearchTask<C> {
#[inline(never)]
pub fn run(&self) -> usize {
self.searcher.search(&self.query, &self.collector).unwrap();
1
}
}

69
benches/exists_json.rs Normal file
View File

@@ -0,0 +1,69 @@
use binggan::plugins::PeakMemAllocPlugin;
use binggan::{black_box, InputGroup, PeakMemAlloc, INSTRUMENTED_SYSTEM};
use serde_json::json;
use tantivy::collector::Count;
use tantivy::query::ExistsQuery;
use tantivy::schema::{Schema, FAST, TEXT};
use tantivy::{doc, Index};
#[global_allocator]
pub static GLOBAL: &PeakMemAlloc<std::alloc::System> = &INSTRUMENTED_SYSTEM;
fn main() {
let doc_count: usize = 500_000;
let subfield_counts: &[usize] = &[1, 2, 3, 4, 5, 6, 7, 8, 16, 256, 4096, 65536, 262144];
let indices: Vec<(String, Index)> = subfield_counts
.iter()
.map(|&sub_fields| {
(
format!("subfields={sub_fields}"),
build_index_with_json_subfields(doc_count, sub_fields),
)
})
.collect();
let mut group = InputGroup::new_with_inputs(indices);
group.add_plugin(PeakMemAllocPlugin::new(GLOBAL));
group.config().num_iter_group = Some(1);
group.config().num_iter_bench = Some(1);
group.register("exists_json", exists_json_union);
group.run();
}
fn exists_json_union(index: &Index) {
let reader = index.reader().expect("reader");
let searcher = reader.searcher();
let query = ExistsQuery::new("json".to_string(), true);
let count = searcher.search(&query, &Count).expect("exists search");
// Prevents optimizer from eliding the search
black_box(count);
}
fn build_index_with_json_subfields(num_docs: usize, num_subfields: usize) -> Index {
// Schema: single JSON field stored as FAST to support ExistsQuery.
let mut schema_builder = Schema::builder();
let json_field = schema_builder.add_json_field("json", TEXT | FAST);
let schema = schema_builder.build();
let index = Index::create_from_tempdir(schema).expect("create index");
{
let mut index_writer = index
.writer_with_num_threads(1, 200_000_000)
.expect("writer");
for i in 0..num_docs {
let sub = i % num_subfields;
// Only one subpath set per document; rotate subpaths so that
// no single subpath is full, but the union covers all docs.
let v = json!({ format!("field_{sub}"): i as u64 });
index_writer
.add_document(doc!(json_field => v))
.expect("add_document");
}
index_writer.commit().expect("commit");
}
index
}

View File

@@ -1,7 +1,7 @@
[package]
name = "tantivy-bitpacker"
version = "0.6.0"
edition = "2021"
version = "0.9.0"
edition = "2024"
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
license = "MIT"
categories = []

View File

@@ -48,7 +48,7 @@ impl BitPacker {
pub fn flush<TWrite: io::Write + ?Sized>(&mut self, output: &mut TWrite) -> io::Result<()> {
if self.mini_buffer_written > 0 {
let num_bytes = (self.mini_buffer_written + 7) / 8;
let num_bytes = self.mini_buffer_written.div_ceil(8);
let bytes = self.mini_buffer.to_le_bytes();
output.write_all(&bytes[..num_bytes])?;
self.mini_buffer_written = 0;
@@ -65,7 +65,7 @@ impl BitPacker {
#[derive(Clone, Debug, Default, Copy)]
pub struct BitUnpacker {
num_bits: u32,
num_bits: usize,
mask: u64,
}
@@ -83,7 +83,7 @@ impl BitUnpacker {
(1u64 << num_bits) - 1u64
};
BitUnpacker {
num_bits: u32::from(num_bits),
num_bits: usize::from(num_bits),
mask,
}
}
@@ -94,14 +94,14 @@ impl BitUnpacker {
#[inline]
pub fn get(&self, idx: u32, data: &[u8]) -> u64 {
let addr_in_bits = idx * self.num_bits;
let addr = (addr_in_bits >> 3) as usize;
let addr_in_bits = idx as usize * self.num_bits;
let addr = addr_in_bits >> 3;
if addr + 8 > data.len() {
if self.num_bits == 0 {
return 0;
}
let bit_shift = addr_in_bits & 7;
return self.get_slow_path(addr, bit_shift, data);
return self.get_slow_path(addr, bit_shift as u32, data);
}
let bit_shift = addr_in_bits & 7;
let bytes: [u8; 8] = (&data[addr..addr + 8]).try_into().unwrap();
@@ -134,12 +134,13 @@ impl BitUnpacker {
"Bitwidth must be <= 32 to use this method."
);
let end_idx = start_idx + output.len() as u32;
let end_idx: u32 = start_idx + output.len() as u32;
let end_bit_read = end_idx * self.num_bits;
let end_byte_read = (end_bit_read + 7) / 8;
// We use `usize` here to avoid overflow issues.
let end_bit_read = (end_idx as usize) * self.num_bits;
let end_byte_read = end_bit_read.div_ceil(8);
assert!(
end_byte_read as usize <= data.len(),
end_byte_read <= data.len(),
"Requested index is out of bounds."
);
@@ -159,24 +160,24 @@ impl BitUnpacker {
// We want the start of the fast track to start align with bytes.
// A sufficient condition is to start with an idx that is a multiple of 8,
// so highway start is the closest multiple of 8 that is >= start_idx.
let entrance_ramp_len = 8 - (start_idx % 8) % 8;
let entrance_ramp_len: u32 = 8 - (start_idx % 8) % 8;
let highway_start: u32 = start_idx + entrance_ramp_len;
if highway_start + BitPacker1x::BLOCK_LEN as u32 > end_idx {
if highway_start + (BitPacker1x::BLOCK_LEN as u32) > end_idx {
// We don't have enough values to have even a single block of highway.
// Let's just supply the values the simple way.
get_batch_ramp(start_idx, output);
return;
}
let num_blocks: u32 = (end_idx - highway_start) / BitPacker1x::BLOCK_LEN as u32;
let num_blocks: usize = (end_idx - highway_start) as usize / BitPacker1x::BLOCK_LEN;
// Entrance ramp
get_batch_ramp(start_idx, &mut output[..entrance_ramp_len as usize]);
// Highway
let mut offset = (highway_start * self.num_bits) as usize / 8;
let mut offset = (highway_start as usize * self.num_bits) / 8;
let mut output_cursor = (highway_start - start_idx) as usize;
for _ in 0..num_blocks {
offset += BitPacker1x.decompress(
@@ -188,7 +189,7 @@ impl BitUnpacker {
}
// Exit ramp
let highway_end = highway_start + num_blocks * BitPacker1x::BLOCK_LEN as u32;
let highway_end: u32 = highway_start + (num_blocks * BitPacker1x::BLOCK_LEN) as u32;
get_batch_ramp(highway_end, &mut output[output_cursor..]);
}
@@ -257,7 +258,7 @@ mod test {
bitpacker.write(val, num_bits, &mut data).unwrap();
}
bitpacker.close(&mut data).unwrap();
assert_eq!(data.len(), ((num_bits as usize) * len + 7) / 8);
assert_eq!(data.len(), ((num_bits as usize) * len).div_ceil(8));
let bitunpacker = BitUnpacker::new(num_bits);
(bitunpacker, vals, data)
}
@@ -303,7 +304,7 @@ mod test {
bitpacker.write(val, num_bits, &mut buffer).unwrap();
}
bitpacker.flush(&mut buffer).unwrap();
assert_eq!(buffer.len(), (vals.len() * num_bits as usize + 7) / 8);
assert_eq!(buffer.len(), (vals.len() * num_bits as usize).div_ceil(8));
let bitunpacker = BitUnpacker::new(num_bits);
let max_val = if num_bits == 64 {
u64::MAX
@@ -368,9 +369,9 @@ mod test {
for start_idx in 0u32..32u32 {
output.resize(len, 0);
bitunpacker.get_batch_u32s(start_idx, &buffer, &mut output);
for i in 0..len {
for (i, output_byte) in output.iter().enumerate() {
let expected = (start_idx + i as u32) & mask;
assert_eq!(output[i], expected);
assert_eq!(*output_byte, expected);
}
}
}

View File

@@ -1,6 +1,6 @@
use super::bitpacker::BitPacker;
use super::compute_num_bits;
use crate::{minmax, BitUnpacker};
use crate::{BitUnpacker, minmax};
const BLOCK_SIZE: usize = 128;
@@ -34,7 +34,7 @@ struct BlockedBitpackerEntryMetaData {
impl BlockedBitpackerEntryMetaData {
fn new(offset: u64, num_bits: u8, base_value: u64) -> Self {
let encoded = offset | (num_bits as u64) << (64 - 8);
let encoded = offset | (u64::from(num_bits) << (64 - 8));
Self {
encoded,
base_value,
@@ -140,10 +140,10 @@ impl BlockedBitpacker {
pub fn iter(&self) -> impl Iterator<Item = u64> + '_ {
// todo performance: we could decompress a whole block and cache it instead
let bitpacked_elems = self.offset_and_bits.len() * BLOCK_SIZE;
let iter = (0..bitpacked_elems)
(0..bitpacked_elems)
.map(move |idx| self.get(idx))
.chain(self.buffer.iter().cloned());
iter
.chain(self.buffer.iter().cloned())
}
}

View File

@@ -35,8 +35,8 @@ const IMPLS: [FilterImplPerInstructionSet; 2] = [
const IMPLS: [FilterImplPerInstructionSet; 1] = [FilterImplPerInstructionSet::Scalar];
impl FilterImplPerInstructionSet {
#[allow(unused_variables)]
#[inline]
#[allow(unused_variables)] // on non-x86_64, code is unused.
fn from(code: u8) -> FilterImplPerInstructionSet {
#[cfg(target_arch = "x86_64")]
if code == FilterImplPerInstructionSet::AVX2 as u8 {

View File

@@ -33,11 +33,7 @@ pub use crate::blocked_bitpacker::BlockedBitpacker;
/// number of bits.
pub fn compute_num_bits(n: u64) -> u8 {
let amplitude = (64u32 - n.leading_zeros()) as u8;
if amplitude <= 64 - 8 {
amplitude
} else {
64
}
if amplitude <= 64 - 8 { amplitude } else { 64 }
}
/// Computes the (min, max) of an iterator of `PartialOrd` values.

View File

@@ -16,14 +16,14 @@ body = """
{%- if version %} in {{ version }}{%- endif -%}
{% for commit in commits %}
{% if commit.github.pr_title -%}
{%- set commit_message = commit.github.pr_title -%}
{% if commit.remote.pr_title -%}
{%- set commit_message = commit.remote.pr_title -%}
{%- else -%}
{%- set commit_message = commit.message -%}
{%- endif -%}
- {{ commit_message | split(pat="\n") | first | trim }}\
{% if commit.github.pr_number %} \
[#{{ commit.github.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.github.pr_number }}){% if commit.github.username %}(@{{ commit.github.username }}){%- endif -%} \
{% if commit.remote.pr_number %} \
[#{{ commit.remote.pr_number }}]({{ self::remote_url() }}/pull/{{ commit.remote.pr_number }}){% if commit.remote.username %}(@{{ commit.remote.username }}){%- endif -%} \
{%- endif %}
{%- endfor -%}

View File

@@ -1,7 +1,7 @@
[package]
name = "tantivy-columnar"
version = "0.3.0"
edition = "2021"
version = "0.6.0"
edition = "2024"
license = "MIT"
homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
@@ -9,21 +9,21 @@ description = "column oriented storage for tantivy"
categories = ["database-implementations", "data-structures", "compression"]
[dependencies]
itertools = "0.13.0"
itertools = "0.14.0"
fastdivide = "0.4.0"
stacker = { version= "0.3", path = "../stacker", package="tantivy-stacker"}
sstable = { version= "0.3", path = "../sstable", package = "tantivy-sstable" }
common = { version= "0.7", path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.6", path = "../bitpacker/" }
stacker = { version= "0.6", path = "../stacker", package="tantivy-stacker"}
sstable = { version= "0.6", path = "../sstable", package = "tantivy-sstable" }
common = { version= "0.10", path = "../common", package = "tantivy-common" }
tantivy-bitpacker = { version= "0.9", path = "../bitpacker/" }
serde = "1.0.152"
downcast-rs = "1.2.0"
downcast-rs = "2.0.1"
[dev-dependencies]
proptest = "1"
more-asserts = "0.3.1"
rand = "0.8"
binggan = "0.8.1"
binggan = "0.14.0"
[[bench]]
name = "bench_merge"
@@ -33,6 +33,29 @@ harness = false
name = "bench_access"
harness = false
[[bench]]
name = "bench_first_vals"
harness = false
[[bench]]
name = "bench_values_u64"
harness = false
[[bench]]
name = "bench_values_u128"
harness = false
[[bench]]
name = "bench_create_column_values"
harness = false
[[bench]]
name = "bench_column_values_get"
harness = false
[[bench]]
name = "bench_optional_index"
harness = false
[features]
unstable = []
zstd-compression = ["sstable/zstd-compression"]

View File

@@ -31,7 +31,7 @@ restriction on 50% of the values (e.g. a 64-bit hash). On the other hand, a lot
# Columnar format
This columnar format may have more than one column (with different types) associated to the same `column_name` (see [Coercion rules](#coercion-rules) above).
The `(column_name, columne_type)` couple however uniquely identifies a column.
The `(column_name, column_type)` couple however uniquely identifies a column.
That couple is serialized as a column `column_key`. The format of that key is:
`[column_name][ZERO_BYTE][column_type_header: u8]`
@@ -73,7 +73,7 @@ The crate introduces the following concepts.
`Columnar` is an equivalent of a dataframe.
It maps `column_key` to `Column`.
A `Column<T>` asssociates a `RowId` (u32) to any
A `Column<T>` associates a `RowId` (u32) to any
number of values.
This is made possible by wrapping a `ColumnIndex` and a `ColumnValue` object.

View File

@@ -1,4 +1,4 @@
use binggan::{black_box, InputGroup};
use binggan::{InputGroup, black_box};
use common::*;
use tantivy_columnar::Column;
@@ -19,7 +19,7 @@ fn main() {
let mut add_card = |card1: Card| {
inputs.push((
format!("{card1}"),
card1.to_string(),
generate_columnar_and_open(card1, NUM_DOCS),
));
};
@@ -50,6 +50,7 @@ fn bench_group(mut runner: InputGroup<Column>) {
let mut buffer = vec![None; BLOCK_SIZE];
for i in (0..NUM_DOCS).step_by(BLOCK_SIZE) {
// fill docs
#[allow(clippy::needless_range_loop)]
for idx in 0..BLOCK_SIZE {
docs[idx] = idx as u32 + i;
}

View File

@@ -0,0 +1,61 @@
use std::sync::Arc;
use binggan::{InputGroup, black_box};
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use tantivy_columnar::ColumnValues;
use tantivy_columnar::column_values::{CodecType, serialize_and_load_u64_based_column_values};
fn get_data() -> Vec<u64> {
let mut rng = StdRng::seed_from_u64(2u64);
let mut data: Vec<_> = (100..55_000_u64)
.map(|num| num + rng.r#gen::<u8>() as u64)
.collect();
data.push(99_000);
data.insert(1000, 2000);
data.insert(2000, 100);
data.insert(3000, 4100);
data.insert(4000, 100);
data.insert(5000, 800);
data
}
#[inline(never)]
fn value_iter() -> impl Iterator<Item = u64> {
0..20_000
}
type Col = Arc<dyn ColumnValues<u64>>;
fn main() {
let data = get_data();
let inputs: Vec<(String, Col)> = vec![
(
"bitpacked".to_string(),
serialize_and_load_u64_based_column_values(&data.as_slice(), &[CodecType::Bitpacked]),
),
(
"linear".to_string(),
serialize_and_load_u64_based_column_values(&data.as_slice(), &[CodecType::Linear]),
),
(
"blockwise_linear".to_string(),
serialize_and_load_u64_based_column_values(
&data.as_slice(),
&[CodecType::BlockwiseLinear],
),
),
];
let mut group: InputGroup<Col> = InputGroup::new_with_inputs(inputs);
group.register("fastfield_get", |col: &Col| {
let mut sum = 0u64;
for pos in value_iter() {
sum = sum.wrapping_add(col.get_val(pos as u32));
}
black_box(sum);
});
group.run();
}

View File

@@ -0,0 +1,44 @@
use binggan::{InputGroup, black_box};
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use tantivy_columnar::column_values::{CodecType, serialize_u64_based_column_values};
fn get_data() -> Vec<u64> {
let mut rng = StdRng::seed_from_u64(2u64);
let mut data: Vec<_> = (100..55_000_u64)
.map(|num| num + rng.r#gen::<u8>() as u64)
.collect();
data.push(99_000);
data.insert(1000, 2000);
data.insert(2000, 100);
data.insert(3000, 4100);
data.insert(4000, 100);
data.insert(5000, 800);
data
}
fn main() {
let data = get_data();
let mut group: InputGroup<(CodecType, Vec<u64>)> = InputGroup::new_with_inputs(vec![
(
"bitpacked codec".to_string(),
(CodecType::Bitpacked, data.clone()),
),
(
"linear codec".to_string(),
(CodecType::Linear, data.clone()),
),
(
"blockwise linear codec".to_string(),
(CodecType::BlockwiseLinear, data.clone()),
),
]);
group.register("serialize column_values", |data| {
let mut buffer = Vec::new();
serialize_u64_based_column_values(&data.1.as_slice(), &[data.0], &mut buffer).unwrap();
black_box(buffer.len());
});
group.run();
}

View File

@@ -1,12 +1,9 @@
#![feature(test)]
extern crate test;
use std::sync::Arc;
use binggan::{InputGroup, black_box};
use rand::prelude::*;
use tantivy_columnar::column_values::{serialize_and_load_u64_based_column_values, CodecType};
use tantivy_columnar::column_values::{CodecType, serialize_and_load_u64_based_column_values};
use tantivy_columnar::*;
use test::{black_box, Bencher};
struct Columns {
pub optional: Column,
@@ -68,88 +65,38 @@ pub fn serialize_and_load(column: &[u64], codec_type: CodecType) -> Arc<dyn Colu
serialize_and_load_u64_based_column_values(&column, &[codec_type])
}
fn run_bench_on_column_full_scan(b: &mut Bencher, column: Column) {
let num_iter = black_box(NUM_VALUES);
b.iter(|| {
fn main() {
let Columns {
optional,
full,
multi,
} = get_test_columns();
let inputs = vec![
("full".to_string(), full),
("optional".to_string(), optional),
("multi".to_string(), multi),
];
let mut group = InputGroup::new_with_inputs(inputs);
group.register("first_full_scan", |column| {
let mut sum = 0u64;
for i in 0..num_iter as u32 {
for i in 0..NUM_VALUES as u32 {
let val = column.first(i);
sum += val.unwrap_or(0);
}
sum
black_box(sum);
});
}
fn run_bench_on_column_block_fetch(b: &mut Bencher, column: Column) {
let mut block: Vec<Option<u64>> = vec![None; 64];
let fetch_docids = (0..64).collect::<Vec<_>>();
b.iter(move || {
column.first_vals(&fetch_docids, &mut block);
block[0]
});
}
fn run_bench_on_column_block_single_calls(b: &mut Bencher, column: Column) {
let mut block: Vec<Option<u64>> = vec![None; 64];
let fetch_docids = (0..64).collect::<Vec<_>>();
b.iter(move || {
group.register("first_block_single_calls", |column| {
let mut block: Vec<Option<u64>> = vec![None; 64];
let fetch_docids = (0..64).collect::<Vec<_>>();
for i in 0..fetch_docids.len() {
block[i] = column.first(fetch_docids[i]);
}
block[0]
black_box(block[0]);
});
}
/// Column first method
#[bench]
fn bench_get_first_on_full_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_full_scan(b, column);
}
#[bench]
fn bench_get_first_on_optional_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_full_scan(b, column);
}
#[bench]
fn bench_get_first_on_multi_column_full_scan(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_full_scan(b, column);
}
/// Block fetch column accessor
#[bench]
fn bench_get_block_first_on_optional_column(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_multi_column(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_full_column(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_block_fetch(b, column);
}
#[bench]
fn bench_get_block_first_on_optional_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().optional;
run_bench_on_column_block_single_calls(b, column);
}
#[bench]
fn bench_get_block_first_on_multi_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().multi;
run_bench_on_column_block_single_calls(b, column);
}
#[bench]
fn bench_get_block_first_on_full_column_single_calls(b: &mut Bencher) {
let column = get_test_columns().full;
run_bench_on_column_block_single_calls(b, column);
group.run();
}

View File

@@ -1,7 +1,7 @@
pub mod common;
use binggan::{black_box, BenchRunner};
use common::{generate_columnar_with_name, Card};
use binggan::BenchRunner;
use common::{Card, generate_columnar_with_name};
use tantivy_columnar::*;
const NUM_DOCS: u32 = 100_000;
@@ -29,7 +29,7 @@ fn main() {
add_combo(Card::Multi, Card::Dense);
add_combo(Card::Multi, Card::Sparse);
let runner: BenchRunner = BenchRunner::new();
let mut runner: BenchRunner = BenchRunner::new();
let mut group = runner.new_group();
for (input_name, columnar_readers) in inputs.iter() {
group.register_with_input(
@@ -41,7 +41,7 @@ fn main() {
let merge_row_order = StackMergeOrder::stack(&columnar_readers[..]);
merge_columnar(&columnar_readers, &[], merge_row_order.into(), &mut out).unwrap();
black_box(out);
Some(out.len() as u64)
},
);
}

View File

@@ -0,0 +1,106 @@
use binggan::{InputGroup, black_box};
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use tantivy_columnar::column_index::{OptionalIndex, Set};
const TOTAL_NUM_VALUES: u32 = 1_000_000;
fn gen_optional_index(fill_ratio: f64) -> OptionalIndex {
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let vals: Vec<u32> = (0..TOTAL_NUM_VALUES)
.map(|_| rng.gen_bool(fill_ratio))
.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _)| pos as u32)
.collect();
OptionalIndex::for_test(TOTAL_NUM_VALUES, &vals)
}
fn random_range_iterator(
start: u32,
end: u32,
avg_step_size: u32,
avg_deviation: u32,
) -> impl Iterator<Item = u32> {
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let mut current = start;
std::iter::from_fn(move || {
current += rng.gen_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
if current >= end { None } else { Some(current) }
})
}
fn n_percent_step_iterator(percent: f32, num_values: u32) -> impl Iterator<Item = u32> {
let ratio = percent / 100.0;
let step_size = (1f32 / ratio) as u32;
let deviation = step_size - 1;
random_range_iterator(0, num_values, step_size, deviation)
}
fn walk_over_data(codec: &OptionalIndex, avg_step_size: u32) -> Option<u32> {
walk_over_data_from_positions(
codec,
random_range_iterator(0, TOTAL_NUM_VALUES, avg_step_size, 0),
)
}
fn walk_over_data_from_positions(
codec: &OptionalIndex,
positions: impl Iterator<Item = u32>,
) -> Option<u32> {
let mut dense_idx: Option<u32> = None;
for idx in positions {
dense_idx = dense_idx.or(codec.rank_if_exists(idx));
}
dense_idx
}
fn main() {
// Build separate inputs for each fill ratio.
let inputs: Vec<(String, OptionalIndex)> = vec![
("fill=1%".to_string(), gen_optional_index(0.01)),
("fill=5%".to_string(), gen_optional_index(0.05)),
("fill=10%".to_string(), gen_optional_index(0.10)),
("fill=50%".to_string(), gen_optional_index(0.50)),
("fill=90%".to_string(), gen_optional_index(0.90)),
];
let mut group: InputGroup<OptionalIndex> = InputGroup::new_with_inputs(inputs);
// Translate orig->codec (rank_if_exists) with sampling
group.register("orig_to_codec_10pct_hit", |codec: &OptionalIndex| {
black_box(walk_over_data(codec, 100));
});
group.register("orig_to_codec_1pct_hit", |codec: &OptionalIndex| {
black_box(walk_over_data(codec, 1000));
});
group.register("orig_to_codec_full_scan", |codec: &OptionalIndex| {
black_box(walk_over_data_from_positions(codec, 0..TOTAL_NUM_VALUES));
});
// Translate codec->orig (select/select_batch) on sampled ranks
fn bench_translate_codec_to_orig_util(codec: &OptionalIndex, percent_hit: f32) {
let num_non_nulls = codec.num_non_nulls();
let idxs: Vec<u32> = if percent_hit == 100.0f32 {
(0..num_non_nulls).collect()
} else {
n_percent_step_iterator(percent_hit, num_non_nulls).collect()
};
let mut output = vec![0u32; idxs.len()];
output.copy_from_slice(&idxs[..]);
codec.select_batch(&mut output);
black_box(output);
}
group.register("codec_to_orig_0.005pct_hit", |codec: &OptionalIndex| {
bench_translate_codec_to_orig_util(codec, 0.005);
});
group.register("codec_to_orig_10pct_hit", |codec: &OptionalIndex| {
bench_translate_codec_to_orig_util(codec, 10.0);
});
group.register("codec_to_orig_full_scan", |codec: &OptionalIndex| {
bench_translate_codec_to_orig_util(codec, 100.0);
});
group.run();
}

View File

@@ -1,15 +1,12 @@
#![feature(test)]
use std::ops::RangeInclusive;
use std::sync::Arc;
use binggan::{InputGroup, black_box};
use common::OwnedBytes;
use rand::rngs::StdRng;
use rand::seq::SliceRandom;
use rand::{random, Rng, SeedableRng};
use rand::{Rng, SeedableRng, random};
use tantivy_columnar::ColumnValues;
use test::Bencher;
extern crate test;
// TODO does this make sense for IPv6 ?
fn generate_random() -> Vec<u64> {
@@ -47,78 +44,77 @@ fn get_data_50percent_item() -> Vec<u128> {
}
data.push(SINGLE_ITEM);
data.shuffle(&mut rng);
let data = data.iter().map(|el| *el as u128).collect::<Vec<_>>();
data
data.iter().map(|el| *el as u128).collect::<Vec<_>>()
}
#[bench]
fn bench_intfastfield_getrange_u128_50percent_hit(b: &mut Bencher) {
fn main() {
let data = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
let column_range = get_u128_column_from_data(&data);
let column_random = get_u128_column_random();
b.iter(|| {
struct Inputs {
data: Vec<u128>,
column_range: Arc<dyn ColumnValues<u128>>,
column_random: Arc<dyn ColumnValues<u128>>,
}
let inputs = Inputs {
data,
column_range,
column_random,
};
let mut group: InputGroup<Inputs> =
InputGroup::new_with_inputs(vec![("u128 benches".to_string(), inputs)]);
group.register(
"intfastfield_getrange_u128_50percent_hit",
|inp: &Inputs| {
let mut positions = Vec::new();
inp.column_range.get_row_ids_for_value_range(
*FIFTY_PERCENT_RANGE.start() as u128..=*FIFTY_PERCENT_RANGE.end() as u128,
0..inp.data.len() as u32,
&mut positions,
);
black_box(positions.len());
},
);
group.register("intfastfield_getrange_u128_single_hit", |inp: &Inputs| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
*FIFTY_PERCENT_RANGE.start() as u128..=*FIFTY_PERCENT_RANGE.end() as u128,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u128_single_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
inp.column_range.get_row_ids_for_value_range(
*SINGLE_ITEM_RANGE.start() as u128..=*SINGLE_ITEM_RANGE.end() as u128,
0..data.len() as u32,
0..inp.data.len() as u32,
&mut positions,
);
positions
black_box(positions.len());
});
}
#[bench]
fn bench_intfastfield_getrange_u128_hit_all(b: &mut Bencher) {
let data = get_data_50percent_item();
let column = get_u128_column_from_data(&data);
b.iter(|| {
group.register("intfastfield_getrange_u128_hit_all", |inp: &Inputs| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(0..=u128::MAX, 0..data.len() as u32, &mut positions);
positions
inp.column_range.get_row_ids_for_value_range(
0..=u128::MAX,
0..inp.data.len() as u32,
&mut positions,
);
black_box(positions.len());
});
}
// U128 RANGE END
#[bench]
fn bench_intfastfield_scan_all_fflookup_u128(b: &mut Bencher) {
let column = get_u128_column_random();
b.iter(|| {
group.register("intfastfield_scan_all_fflookup_u128", |inp: &Inputs| {
let mut a = 0u128;
for i in 0u64..column.num_vals() as u64 {
a += column.get_val(i as u32);
for i in 0u64..inp.column_random.num_vals() as u64 {
a += inp.column_random.get_val(i as u32);
}
a
black_box(a);
});
}
#[bench]
fn bench_intfastfield_jumpy_stride5_u128(b: &mut Bencher) {
let column = get_u128_column_random();
b.iter(|| {
let n = column.num_vals();
group.register("intfastfield_jumpy_stride5_u128", |inp: &Inputs| {
let n = inp.column_random.num_vals();
let mut a = 0u128;
for i in (0..n / 5).map(|val| val * 5) {
a += column.get_val(i);
a += inp.column_random.get_val(i);
}
a
black_box(a);
});
group.run();
}

View File

@@ -1,13 +1,10 @@
#![feature(test)]
extern crate test;
use std::ops::RangeInclusive;
use std::sync::Arc;
use binggan::{InputGroup, black_box};
use rand::prelude::*;
use tantivy_columnar::column_values::{serialize_and_load_u64_based_column_values, CodecType};
use tantivy_columnar::column_values::{CodecType, serialize_and_load_u64_based_column_values};
use tantivy_columnar::*;
use test::Bencher;
// Warning: this generates the same permutation at each call
fn generate_permutation() -> Vec<u64> {
@@ -27,37 +24,11 @@ pub fn serialize_and_load(column: &[u64], codec_type: CodecType) -> Arc<dyn Colu
serialize_and_load_u64_based_column_values(&column, &[codec_type])
}
#[bench]
fn bench_intfastfield_jumpy_veclookup(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
b.iter(|| {
let mut a = 0u64;
for _ in 0..n {
a = permutation[a as usize];
}
a
});
}
#[bench]
fn bench_intfastfield_jumpy_fflookup_bitpacked(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
b.iter(|| {
let mut a = 0u64;
for _ in 0..n {
a = column.get_val(a as u32);
}
a
});
}
const FIFTY_PERCENT_RANGE: RangeInclusive<u64> = 1..=50;
const SINGLE_ITEM: u64 = 90;
const SINGLE_ITEM_RANGE: RangeInclusive<u64> = 90..=90;
const ONE_PERCENT_ITEM_RANGE: RangeInclusive<u64> = 49..=49;
fn get_data_50percent_item() -> Vec<u128> {
let mut rng = StdRng::from_seed([1u8; 32]);
@@ -69,135 +40,122 @@ fn get_data_50percent_item() -> Vec<u128> {
data.push(SINGLE_ITEM);
data.shuffle(&mut rng);
let data = data.iter().map(|el| *el as u128).collect::<Vec<_>>();
data
data.iter().map(|el| *el as u128).collect::<Vec<_>>()
}
// U64 RANGE START
#[bench]
fn bench_intfastfield_getrange_u64_50percent_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
FIFTY_PERCENT_RANGE,
0..data.len() as u32,
&mut positions,
);
positions
});
}
type VecCol = (Vec<u64>, Arc<dyn ColumnValues<u64>>);
#[bench]
fn bench_intfastfield_getrange_u64_1percent_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(
ONE_PERCENT_ITEM_RANGE,
0..data.len() as u32,
&mut positions,
);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u64_single_hit(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(SINGLE_ITEM_RANGE, 0..data.len() as u32, &mut positions);
positions
});
}
#[bench]
fn bench_intfastfield_getrange_u64_hit_all(b: &mut Bencher) {
let data = get_data_50percent_item();
let data = data.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&data, CodecType::Bitpacked);
b.iter(|| {
let mut positions = Vec::new();
column.get_row_ids_for_value_range(0..=u64::MAX, 0..data.len() as u32, &mut positions);
positions
});
}
// U64 RANGE END
#[bench]
fn bench_intfastfield_stride7_vec(b: &mut Bencher) {
fn bench_access() {
let permutation = generate_permutation();
let n = permutation.len();
b.iter(|| {
let column_perm: Arc<dyn ColumnValues<u64>> =
serialize_and_load(&permutation, CodecType::Bitpacked);
let permutation_gcd = generate_permutation_gcd();
let column_perm_gcd: Arc<dyn ColumnValues<u64>> =
serialize_and_load(&permutation_gcd, CodecType::Bitpacked);
let mut group: InputGroup<VecCol> = InputGroup::new_with_inputs(vec![
(
"access".to_string(),
(permutation.clone(), column_perm.clone()),
),
(
"access_gcd".to_string(),
(permutation_gcd.clone(), column_perm_gcd.clone()),
),
]);
group.register("stride7_vec", |inp: &VecCol| {
let n = inp.0.len();
let mut a = 0u64;
for i in (0..n / 7).map(|val| val * 7) {
a += permutation[i as usize];
a += inp.0[i];
}
a
black_box(a);
});
}
#[bench]
fn bench_intfastfield_stride7_fflookup(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
b.iter(|| {
let mut a = 0;
group.register("fullscan_vec", |inp: &VecCol| {
let mut a = 0u64;
for i in 0..inp.0.len() {
a += inp.0[i];
}
black_box(a);
});
group.register("stride7_column_values", |inp: &VecCol| {
let n = inp.1.num_vals() as usize;
let mut a = 0u64;
for i in (0..n / 7).map(|val| val * 7) {
a += column.get_val(i as u32);
a += inp.1.get_val(i as u32);
}
a
black_box(a);
});
}
#[bench]
fn bench_intfastfield_scan_all_fflookup(b: &mut Bencher) {
let permutation = generate_permutation();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
let column_ref = column.as_ref();
b.iter(|| {
let mut a = 0u64;
for i in 0u32..n as u32 {
a += column_ref.get_val(i);
}
a
});
}
#[bench]
fn bench_intfastfield_scan_all_fflookup_gcd(b: &mut Bencher) {
let permutation = generate_permutation_gcd();
let n = permutation.len();
let column: Arc<dyn ColumnValues<u64>> = serialize_and_load(&permutation, CodecType::Bitpacked);
b.iter(|| {
group.register("fullscan_column_values", |inp: &VecCol| {
let mut a = 0u64;
let n = inp.1.num_vals() as usize;
for i in 0..n {
a += column.get_val(i as u32);
a += inp.1.get_val(i as u32);
}
a
black_box(a);
});
group.run();
}
#[bench]
fn bench_intfastfield_scan_all_vec(b: &mut Bencher) {
let permutation = generate_permutation();
b.iter(|| {
let mut a = 0u64;
for i in 0..permutation.len() {
a += permutation[i as usize] as u64;
}
a
});
fn bench_range() {
let data_50 = get_data_50percent_item();
let data_u64 = data_50.iter().map(|el| *el as u64).collect::<Vec<_>>();
let column_data: Arc<dyn ColumnValues<u64>> =
serialize_and_load(&data_u64, CodecType::Bitpacked);
let mut group: InputGroup<Arc<dyn ColumnValues<u64>>> =
InputGroup::new_with_inputs(vec![("dist_50pct_item".to_string(), column_data.clone())]);
group.register(
"fastfield_getrange_u64_50percent_hit",
|col: &Arc<dyn ColumnValues<u64>>| {
let mut positions = Vec::new();
col.get_row_ids_for_value_range(FIFTY_PERCENT_RANGE, 0..col.num_vals(), &mut positions);
black_box(positions.len());
},
);
group.register(
"fastfield_getrange_u64_1percent_hit",
|col: &Arc<dyn ColumnValues<u64>>| {
let mut positions = Vec::new();
col.get_row_ids_for_value_range(
ONE_PERCENT_ITEM_RANGE,
0..col.num_vals(),
&mut positions,
);
black_box(positions.len());
},
);
group.register(
"fastfield_getrange_u64_single_hit",
|col: &Arc<dyn ColumnValues<u64>>| {
let mut positions = Vec::new();
col.get_row_ids_for_value_range(SINGLE_ITEM_RANGE, 0..col.num_vals(), &mut positions);
black_box(positions.len());
},
);
group.register(
"fastfield_getrange_u64_hit_all",
|col: &Arc<dyn ColumnValues<u64>>| {
let mut positions = Vec::new();
col.get_row_ids_for_value_range(0..=u64::MAX, 0..col.num_vals(), &mut positions);
black_box(positions.len());
},
);
group.run();
}
fn main() {
bench_access();
bench_range();
}

View File

@@ -0,0 +1,18 @@
[package]
name = "tantivy-columnar-inspect"
version = "0.1.0"
edition = "2021"
license = "MIT"
[dependencies]
tantivy = {path="../..", package="tantivy"}
columnar = {path="../", package="tantivy-columnar"}
common = {path="../../common", package="tantivy-common"}
[workspace]
members = []
[profile.release]
debug = true
#debug-assertions = true
#overflow-checks = true

View File

@@ -0,0 +1,54 @@
use columnar::ColumnarReader;
use common::file_slice::{FileSlice, WrapFile};
use std::io;
use std::path::Path;
use tantivy::directory::footer::Footer;
fn main() -> io::Result<()> {
println!("Opens a columnar file written by tantivy and validates it.");
let path = std::env::args().nth(1).unwrap();
let path = Path::new(&path);
println!("Reading {:?}", path);
let _reader = open_and_validate_columnar(path.to_str().unwrap())?;
Ok(())
}
pub fn validate_columnar_reader(reader: &ColumnarReader) {
let num_rows = reader.num_rows();
println!("num_rows: {}", num_rows);
let columns = reader.list_columns().unwrap();
println!("num columns: {:?}", columns.len());
for (col_name, dynamic_column_handle) in columns {
let col = dynamic_column_handle.open().unwrap();
match col {
columnar::DynamicColumn::Bool(_)
| columnar::DynamicColumn::I64(_)
| columnar::DynamicColumn::U64(_)
| columnar::DynamicColumn::F64(_)
| columnar::DynamicColumn::IpAddr(_)
| columnar::DynamicColumn::DateTime(_)
| columnar::DynamicColumn::Bytes(_) => {}
columnar::DynamicColumn::Str(str_column) => {
let num_vals = str_column.ords().values.num_vals();
let num_terms_dict = str_column.num_terms() as u64;
let max_ord = str_column.ords().values.iter().max().unwrap_or_default();
println!("{col_name:35} num_vals {num_vals:10} \t num_terms_dict {num_terms_dict:8} max_ord: {max_ord:8}",);
for ord in str_column.ords().values.iter() {
assert!(ord < num_terms_dict);
}
}
}
}
}
/// Opens a columnar file that was written by tantivy and validates it.
pub fn open_and_validate_columnar(path: &str) -> io::Result<ColumnarReader> {
let wrap_file = WrapFile::new(std::fs::File::open(path)?)?;
let slice = FileSlice::new(std::sync::Arc::new(wrap_file));
let (_footer, slice) = Footer::extract_footer(slice.clone()).unwrap();
let reader = ColumnarReader::open(slice).unwrap();
validate_columnar_reader(&reader);
Ok(reader)
}

View File

@@ -10,7 +10,7 @@
# Perf and Size
* remove alloc in `ord_to_term`
+ multivaued range queries restrat frm the beginning all of the time.
+ multivaued range queries restart from the beginning all of the time.
* re-add ZSTD compression for dictionaries
no systematic monotonic mapping
consider removing multilinear
@@ -30,7 +30,7 @@ investigate if should have better errors? io::Error is overused at the moment.
rename rank/select in unit tests
Review the public API via cargo doc
go through TODOs
remove all doc_id occurences -> row_id
remove all doc_id occurrences -> row_id
use the rank & select naming in unit tests branch.
multi-linear -> blockwise
linear codec -> simply a multiplication for the index column
@@ -43,5 +43,5 @@ isolate u128_based and uniform naming
# Other
fix enhance column-cli
# Santa claus
# Santa Claus
autodetect datetime ipaddr, plug customizable tokenizer.

View File

@@ -66,7 +66,7 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
&'a self,
docs: &'a [u32],
accessor: &Column<T>,
) -> impl Iterator<Item = (DocId, T)> + '_ {
) -> impl Iterator<Item = (DocId, T)> + 'a + use<'a, T> {
if accessor.index.get_cardinality().is_full() {
docs.iter().cloned().zip(self.val_cache.iter().cloned())
} else {
@@ -139,7 +139,7 @@ mod tests {
missing_docs.push(missing_doc);
});
assert_eq!(missing_docs, vec![]);
assert_eq!(missing_docs, Vec::<u32>::new());
}
#[test]

View File

@@ -4,8 +4,8 @@ use std::{fmt, io};
use sstable::{Dictionary, VoidSSTable};
use crate::column::Column;
use crate::RowId;
use crate::column::Column;
/// Dictionary encoded column.
///

View File

@@ -9,13 +9,14 @@ use std::sync::Arc;
use common::BinarySerializable;
pub use dictionary_encoded::{BytesColumn, StrColumn};
pub use serialize::{
open_column_bytes, open_column_str, open_column_u128, open_column_u128_as_compact_u64,
open_column_u64, serialize_column_mappable_to_u128, serialize_column_mappable_to_u64,
open_column_bytes, open_column_str, open_column_u64, open_column_u128,
open_column_u128_as_compact_u64, serialize_column_mappable_to_u64,
serialize_column_mappable_to_u128,
};
use crate::column_index::{ColumnIndex, Set};
use crate::column_values::monotonic_mapping::StrictlyMonotonicMappingToInternal;
use crate::column_values::{monotonic_map_column, ColumnValues};
use crate::column_values::{ColumnValues, monotonic_map_column};
use crate::{Cardinality, DocId, EmptyColumnValues, MonotonicallyMappableToU64, RowId};
#[derive(Clone)]
@@ -113,7 +114,7 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
}
}
/// Translates a block of docis to row_ids.
/// Translates a block of docids to row_ids.
///
/// returns the row_ids and the matching docids on the same index
/// e.g.
@@ -130,6 +131,8 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
self.index.docids_to_rowids(doc_ids, doc_ids_out, row_ids)
}
/// Get an iterator over the values for the provided docid.
#[inline]
pub fn values_for_doc(&self, doc_id: DocId) -> impl Iterator<Item = T> + '_ {
self.index
.value_row_ids(doc_id)
@@ -157,15 +160,6 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
.select_batch_in_place(selected_docid_range.start, doc_ids);
}
/// Fills the output vector with the (possibly multiple values that are associated_with
/// `row_id`.
///
/// This method clears the `output` vector.
pub fn fill_vals(&self, row_id: RowId, output: &mut Vec<T>) {
output.clear();
output.extend(self.values_for_doc(row_id));
}
pub fn first_or_default_col(self, default_value: T) -> Arc<dyn ColumnValues<T>> {
Arc::new(FirstValueWithDefault {
column: self,

View File

@@ -6,10 +6,10 @@ use common::OwnedBytes;
use sstable::Dictionary;
use crate::column::{BytesColumn, Column};
use crate::column_index::{serialize_column_index, SerializableColumnIndex};
use crate::column_index::{SerializableColumnIndex, serialize_column_index};
use crate::column_values::{
CodecType, MonotonicallyMappableToU64, MonotonicallyMappableToU128,
load_u64_based_column_values, serialize_column_values_u128, serialize_u64_based_column_values,
CodecType, MonotonicallyMappableToU128, MonotonicallyMappableToU64,
};
use crate::iterable::Iterable;
use crate::{StrColumn, Version};

View File

@@ -99,9 +99,9 @@ mod tests {
use crate::column_index::merge::detect_cardinality;
use crate::column_index::multivalued_index::{
open_multivalued_index, serialize_multivalued_index, MultiValueIndex,
MultiValueIndex, open_multivalued_index, serialize_multivalued_index,
};
use crate::column_index::{merge_column_index, OptionalIndex, SerializableColumnIndex};
use crate::column_index::{OptionalIndex, SerializableColumnIndex, merge_column_index};
use crate::{
Cardinality, ColumnIndex, MergeRowOrder, RowAddr, RowId, ShuffleMergeOrder, StackMergeOrder,
};
@@ -173,7 +173,7 @@ mod tests {
.into();
let merged_column_index = merge_column_index(&column_indexes[..], &merge_row_order);
let SerializableColumnIndex::Multivalued(start_index_iterable) = merged_column_index else {
panic!("Excpected a multivalued index")
panic!("Expected a multivalued index")
};
let mut output = Vec::new();
serialize_multivalued_index(&start_index_iterable, &mut output).unwrap();
@@ -211,7 +211,7 @@ mod tests {
let merged_column_index = merge_column_index(&column_indexes[..], &merge_row_order);
let SerializableColumnIndex::Multivalued(start_index_iterable) = merged_column_index else {
panic!("Excpected a multivalued index")
panic!("Expected a multivalued index")
};
let mut output = Vec::new();
serialize_multivalued_index(&start_index_iterable, &mut output).unwrap();

View File

@@ -58,7 +58,7 @@ struct ShuffledIndex<'a> {
merge_order: &'a ShuffleMergeOrder,
}
impl<'a> Iterable<u32> for ShuffledIndex<'a> {
impl Iterable<u32> for ShuffledIndex<'_> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
Box::new(
self.merge_order
@@ -127,7 +127,7 @@ fn integrate_num_vals(num_vals: impl Iterator<Item = u32>) -> impl Iterator<Item
)
}
impl<'a> Iterable<u32> for ShuffledMultivaluedIndex<'a> {
impl Iterable<u32> for ShuffledMultivaluedIndex<'_> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
let num_vals_per_row = iter_num_values(self.column_indexes, self.merge_order);
Box::new(integrate_num_vals(num_vals_per_row))
@@ -137,8 +137,8 @@ impl<'a> Iterable<u32> for ShuffledMultivaluedIndex<'a> {
#[cfg(test)]
mod tests {
use super::*;
use crate::column_index::OptionalIndex;
use crate::RowAddr;
use crate::column_index::OptionalIndex;
#[test]
fn test_integrate_num_vals_empty() {

View File

@@ -1,8 +1,8 @@
use std::ops::Range;
use crate::column_index::SerializableColumnIndex;
use crate::column_index::multivalued_index::{MultiValueIndex, SerializableMultivalueIndex};
use crate::column_index::serialize::SerializableOptionalIndex;
use crate::column_index::SerializableColumnIndex;
use crate::iterable::Iterable;
use crate::{Cardinality, ColumnIndex, RowId, StackMergeOrder};
@@ -56,7 +56,7 @@ fn get_doc_ids_with_values<'a>(
ColumnIndex::Full => Box::new(doc_range),
ColumnIndex::Optional(optional_index) => Box::new(
optional_index
.iter_rows()
.iter_non_null_docs()
.map(move |row| row + doc_range.start),
),
ColumnIndex::Multivalued(multivalued_index) => match multivalued_index {
@@ -73,7 +73,7 @@ fn get_doc_ids_with_values<'a>(
MultiValueIndex::MultiValueIndexV2(multivalued_index) => Box::new(
multivalued_index
.optional_index
.iter_rows()
.iter_non_null_docs()
.map(move |row| row + doc_range.start),
),
},
@@ -105,10 +105,11 @@ fn get_num_values_iterator<'a>(
) -> Box<dyn Iterator<Item = u32> + 'a> {
match column_index {
ColumnIndex::Empty { .. } => Box::new(std::iter::empty()),
ColumnIndex::Full => Box::new(std::iter::repeat(1u32).take(num_docs as usize)),
ColumnIndex::Optional(optional_index) => {
Box::new(std::iter::repeat(1u32).take(optional_index.num_non_nulls() as usize))
}
ColumnIndex::Full => Box::new(std::iter::repeat_n(1u32, num_docs as usize)),
ColumnIndex::Optional(optional_index) => Box::new(std::iter::repeat_n(
1u32,
optional_index.num_non_nulls() as usize,
)),
ColumnIndex::Multivalued(multivalued_index) => Box::new(
multivalued_index
.get_start_index_column()
@@ -123,7 +124,7 @@ fn get_num_values_iterator<'a>(
}
}
impl<'a> Iterable<u32> for StackedStartOffsets<'a> {
impl Iterable<u32> for StackedStartOffsets<'_> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
let num_values_it = (0..self.column_indexes.len()).flat_map(|columnar_id| {
let num_docs = self.stack_merge_order.columnar_range(columnar_id).len() as u32;
@@ -177,7 +178,7 @@ impl<'a> Iterable<RowId> for StackedOptionalIndex<'a> {
ColumnIndex::Full => Box::new(columnar_row_range),
ColumnIndex::Optional(optional_index) => Box::new(
optional_index
.iter_rows()
.iter_non_null_docs()
.map(move |row_id: RowId| columnar_row_range.start + row_id),
),
ColumnIndex::Multivalued(_) => {

View File

@@ -14,7 +14,7 @@ pub use merge::merge_column_index;
pub(crate) use multivalued_index::SerializableMultivalueIndex;
pub use optional_index::{OptionalIndex, Set};
pub use serialize::{
open_column_index, serialize_column_index, SerializableColumnIndex, SerializableOptionalIndex,
SerializableColumnIndex, SerializableOptionalIndex, open_column_index, serialize_column_index,
};
use crate::column_index::multivalued_index::MultiValueIndex;
@@ -28,7 +28,7 @@ pub enum ColumnIndex {
Full,
Optional(OptionalIndex),
/// In addition, at index num_rows, an extra value is added
/// containing the overal number of values.
/// containing the overall number of values.
Multivalued(MultiValueIndex),
}

View File

@@ -8,7 +8,7 @@ use common::{CountingWriter, OwnedBytes};
use super::optional_index::{open_optional_index, serialize_optional_index};
use super::{OptionalIndex, SerializableOptionalIndex, Set};
use crate::column_values::{
load_u64_based_column_values, serialize_u64_based_column_values, CodecType, ColumnValues,
CodecType, ColumnValues, load_u64_based_column_values, serialize_u64_based_column_values,
};
use crate::iterable::Iterable;
use crate::{DocId, RowId, Version};
@@ -215,6 +215,32 @@ impl MultiValueIndex {
}
}
/// Returns an iterator over document ids that have at least one value.
pub fn iter_non_null_docs(&self) -> Box<dyn Iterator<Item = DocId> + '_> {
match self {
MultiValueIndex::MultiValueIndexV1(idx) => {
let mut doc: DocId = 0u32;
let num_docs = idx.num_docs();
Box::new(std::iter::from_fn(move || {
// This is not the most efficient way to do this, but it's legacy code.
while doc < num_docs {
let cur = doc;
doc += 1;
let start = idx.start_index_column.get_val(cur);
let end = idx.start_index_column.get_val(cur + 1);
if end > start {
return Some(cur);
}
}
None
}))
}
MultiValueIndex::MultiValueIndexV2(idx) => {
Box::new(idx.optional_index.iter_non_null_docs())
}
}
}
/// Converts a list of ranks (row ids of values) in a 1:n index to the corresponding list of
/// docids. Positions are converted inplace to docids.
///

View File

@@ -1,4 +1,4 @@
use std::io::{self, Write};
use std::io;
use std::sync::Arc;
mod set;
@@ -7,11 +7,11 @@ mod set_block;
use common::{BinarySerializable, OwnedBytes, VInt};
pub use set::{SelectCursor, Set, SetCodec};
use set_block::{
DenseBlock, DenseBlockCodec, SparseBlock, SparseBlockCodec, DENSE_BLOCK_NUM_BYTES,
DENSE_BLOCK_NUM_BYTES, DenseBlock, DenseBlockCodec, SparseBlock, SparseBlockCodec,
};
use crate::iterable::Iterable;
use crate::{DocId, InvalidData, RowId};
use crate::{DocId, RowId};
/// The threshold for for number of elements after which we switch to dense block encoding.
///
@@ -80,23 +80,23 @@ impl BlockVariant {
/// index is the block index. For each block `byte_start` and `offset` is computed.
#[derive(Clone)]
pub struct OptionalIndex {
num_rows: RowId,
num_non_null_rows: RowId,
num_docs: RowId,
num_non_null_docs: RowId,
block_data: OwnedBytes,
block_metas: Arc<[BlockMeta]>,
}
impl<'a> Iterable<u32> for &'a OptionalIndex {
impl Iterable<u32> for &OptionalIndex {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
Box::new(self.iter_rows())
Box::new(self.iter_non_null_docs())
}
}
impl std::fmt::Debug for OptionalIndex {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
f.debug_struct("OptionalIndex")
.field("num_rows", &self.num_rows)
.field("num_non_null_rows", &self.num_non_null_rows)
.field("num_docs", &self.num_docs)
.field("num_non_null_docs", &self.num_non_null_docs)
.finish_non_exhaustive()
}
}
@@ -123,7 +123,7 @@ enum BlockSelectCursor<'a> {
Sparse(<SparseBlock<'a> as Set<u16>>::SelectCursor<'a>),
}
impl<'a> BlockSelectCursor<'a> {
impl BlockSelectCursor<'_> {
fn select(&mut self, rank: u16) -> u16 {
match self {
BlockSelectCursor::Dense(dense_select_cursor) => dense_select_cursor.select(rank),
@@ -141,7 +141,7 @@ pub struct OptionalIndexSelectCursor<'a> {
num_null_rows_before_block: RowId,
}
impl<'a> OptionalIndexSelectCursor<'a> {
impl OptionalIndexSelectCursor<'_> {
fn search_and_load_block(&mut self, rank: RowId) {
if rank < self.current_block_end_rank {
// we are already in the right block
@@ -165,7 +165,7 @@ impl<'a> OptionalIndexSelectCursor<'a> {
}
}
impl<'a> SelectCursor<RowId> for OptionalIndexSelectCursor<'a> {
impl SelectCursor<RowId> for OptionalIndexSelectCursor<'_> {
fn select(&mut self, rank: RowId) -> RowId {
self.search_and_load_block(rank);
let index_in_block = (rank - self.num_null_rows_before_block) as u16;
@@ -174,7 +174,9 @@ impl<'a> SelectCursor<RowId> for OptionalIndexSelectCursor<'a> {
}
impl Set<RowId> for OptionalIndex {
type SelectCursor<'b> = OptionalIndexSelectCursor<'b> where Self: 'b;
type SelectCursor<'b>
= OptionalIndexSelectCursor<'b>
where Self: 'b;
// Check if value at position is not null.
#[inline]
fn contains(&self, row_id: RowId) -> bool {
@@ -257,11 +259,13 @@ impl Set<RowId> for OptionalIndex {
impl OptionalIndex {
pub fn for_test(num_rows: RowId, row_ids: &[RowId]) -> OptionalIndex {
assert!(row_ids
.last()
.copied()
.map(|last_row_id| last_row_id < num_rows)
.unwrap_or(true));
assert!(
row_ids
.last()
.copied()
.map(|last_row_id| last_row_id < num_rows)
.unwrap_or(true)
);
let mut buffer = Vec::new();
serialize_optional_index(&row_ids, num_rows, &mut buffer).unwrap();
let bytes = OwnedBytes::new(buffer);
@@ -269,17 +273,18 @@ impl OptionalIndex {
}
pub fn num_docs(&self) -> RowId {
self.num_rows
self.num_docs
}
pub fn num_non_nulls(&self) -> RowId {
self.num_non_null_rows
self.num_non_null_docs
}
pub fn iter_rows(&self) -> impl Iterator<Item = RowId> + '_ {
// TODO optimize
pub fn iter_non_null_docs(&self) -> impl Iterator<Item = RowId> + '_ {
// TODO optimize. We could iterate over the blocks directly.
// We use the dense value ids and retrieve the doc ids via select.
let mut select_batch = self.select_cursor();
(0..self.num_non_null_rows).map(move |rank| select_batch.select(rank))
(0..self.num_non_null_docs).map(move |rank| select_batch.select(rank))
}
pub fn select_batch(&self, ranks: &mut [RowId]) {
let mut select_cursor = self.select_cursor();
@@ -330,38 +335,6 @@ enum Block<'a> {
Sparse(SparseBlock<'a>),
}
#[derive(Debug, Copy, Clone)]
enum OptionalIndexCodec {
Dense = 0,
Sparse = 1,
}
impl OptionalIndexCodec {
fn to_code(self) -> u8 {
self as u8
}
fn try_from_code(code: u8) -> Result<Self, InvalidData> {
match code {
0 => Ok(Self::Dense),
1 => Ok(Self::Sparse),
_ => Err(InvalidData),
}
}
}
impl BinarySerializable for OptionalIndexCodec {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
writer.write_all(&[self.to_code()])
}
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
let optional_codec_code = u8::deserialize(reader)?;
let optional_codec = Self::try_from_code(optional_codec_code)?;
Ok(optional_codec)
}
}
fn serialize_optional_index_block(block_els: &[u16], out: &mut impl io::Write) -> io::Result<()> {
let is_sparse = is_sparse(block_els.len() as u32);
if is_sparse {
@@ -503,7 +476,7 @@ fn deserialize_optional_index_block_metadatas(
non_null_rows_before_block += num_non_null_rows;
}
block_metas.resize(
((num_rows + ELEMENTS_PER_BLOCK - 1) / ELEMENTS_PER_BLOCK) as usize,
num_rows.div_ceil(ELEMENTS_PER_BLOCK) as usize,
BlockMeta {
non_null_rows_before_block,
start_byte_offset,
@@ -517,15 +490,15 @@ pub fn open_optional_index(bytes: OwnedBytes) -> io::Result<OptionalIndex> {
let (mut bytes, num_non_empty_blocks_bytes) = bytes.rsplit(2);
let num_non_empty_block_bytes =
u16::from_le_bytes(num_non_empty_blocks_bytes.as_slice().try_into().unwrap());
let num_rows = VInt::deserialize_u64(&mut bytes)? as u32;
let num_docs = VInt::deserialize_u64(&mut bytes)? as u32;
let block_metas_num_bytes =
num_non_empty_block_bytes as usize * SERIALIZED_BLOCK_META_NUM_BYTES;
let (block_data, block_metas) = bytes.rsplit(block_metas_num_bytes);
let (block_metas, num_non_null_rows) =
deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_rows);
let (block_metas, num_non_null_docs) =
deserialize_optional_index_block_metadatas(block_metas.as_slice(), num_docs);
let optional_index = OptionalIndex {
num_rows,
num_non_null_rows,
num_docs,
num_non_null_docs,
block_data,
block_metas: block_metas.into(),
};

View File

@@ -2,7 +2,7 @@ use std::io::{self, Write};
use common::BinarySerializable;
use crate::column_index::optional_index::{SelectCursor, Set, SetCodec, ELEMENTS_PER_BLOCK};
use crate::column_index::optional_index::{ELEMENTS_PER_BLOCK, SelectCursor, Set, SetCodec};
#[inline(always)]
fn get_bit_at(input: u64, n: u16) -> bool {
@@ -23,7 +23,6 @@ fn set_bit_at(input: &mut u64, n: u16) {
///
/// When translating a dense index to the original index, we can use the offset to find the correct
/// block. Direct computation is not possible, but we can employ a linear or binary search.
const ELEMENTS_PER_MINI_BLOCK: u16 = 64;
const MINI_BLOCK_BITVEC_NUM_BYTES: usize = 8;
const MINI_BLOCK_OFFSET_NUM_BYTES: usize = 2;
@@ -109,7 +108,7 @@ pub struct DenseBlockSelectCursor<'a> {
dense_block: DenseBlock<'a>,
}
impl<'a> SelectCursor<u16> for DenseBlockSelectCursor<'a> {
impl SelectCursor<u16> for DenseBlockSelectCursor<'_> {
#[inline]
fn select(&mut self, rank: u16) -> u16 {
self.block_id = self
@@ -123,7 +122,9 @@ impl<'a> SelectCursor<u16> for DenseBlockSelectCursor<'a> {
}
impl<'a> Set<u16> for DenseBlock<'a> {
type SelectCursor<'b> = DenseBlockSelectCursor<'a> where Self: 'b;
type SelectCursor<'b>
= DenseBlockSelectCursor<'a>
where Self: 'b;
#[inline(always)]
fn contains(&self, el: u16) -> bool {
@@ -173,7 +174,7 @@ impl<'a> Set<u16> for DenseBlock<'a> {
}
}
impl<'a> DenseBlock<'a> {
impl DenseBlock<'_> {
#[inline]
fn mini_block(&self, mini_block_id: u16) -> DenseMiniBlock {
let data_start_pos = mini_block_id as usize * MINI_BLOCK_NUM_BYTES;

View File

@@ -1,7 +1,7 @@
mod dense;
mod sparse;
pub use dense::{DenseBlock, DenseBlockCodec, DENSE_BLOCK_NUM_BYTES};
pub use dense::{DENSE_BLOCK_NUM_BYTES, DenseBlock, DenseBlockCodec};
pub use sparse::{SparseBlock, SparseBlockCodec};
#[cfg(test)]

View File

@@ -31,8 +31,10 @@ impl<'a> SelectCursor<u16> for SparseBlock<'a> {
}
}
impl<'a> Set<u16> for SparseBlock<'a> {
type SelectCursor<'b> = Self where Self: 'b;
impl Set<u16> for SparseBlock<'_> {
type SelectCursor<'b>
= Self
where Self: 'b;
#[inline(always)]
fn contains(&self, el: u16) -> bool {
@@ -67,7 +69,7 @@ fn get_u16(data: &[u8], byte_position: usize) -> u16 {
u16::from_le_bytes(bytes)
}
impl<'a> SparseBlock<'a> {
impl SparseBlock<'_> {
#[inline(always)]
fn value_at_idx(&self, data: &[u8], idx: u16) -> u16 {
let start_offset: usize = idx as usize * 2;
@@ -80,7 +82,7 @@ impl<'a> SparseBlock<'a> {
}
#[inline]
#[allow(clippy::comparison_chain)]
#[expect(clippy::comparison_chain)]
// Looks for the element in the block. Returns the positions if found.
fn binary_search(&self, target: u16) -> Result<u16, u16> {
let data = &self.0;

View File

@@ -110,8 +110,8 @@ fn test_null_index(data: &[bool]) {
.map(|(pos, _val)| pos as u32)
.collect();
let mut select_iter = null_index.select_cursor();
for i in 0..orig_idx_with_value.len() {
assert_eq!(select_iter.select(i as u32), orig_idx_with_value[i]);
for (i, expected) in orig_idx_with_value.iter().enumerate() {
assert_eq!(select_iter.select(i as u32), *expected);
}
let step_size = (orig_idx_with_value.len() / 100).max(1);
@@ -164,7 +164,11 @@ fn test_optional_index_large() {
fn test_optional_index_iter_aux(row_ids: &[RowId], num_rows: RowId) {
let optional_index = OptionalIndex::for_test(num_rows, row_ids);
assert_eq!(optional_index.num_docs(), num_rows);
assert!(optional_index.iter_rows().eq(row_ids.iter().copied()));
assert!(
optional_index
.iter_non_null_docs()
.eq(row_ids.iter().copied())
);
}
#[test]
@@ -219,174 +223,3 @@ fn test_optional_index_for_tests() {
assert!(!optional_index.contains(3));
assert_eq!(optional_index.num_docs(), 4);
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use test::Bencher;
use super::*;
const TOTAL_NUM_VALUES: u32 = 1_000_000;
fn gen_bools(fill_ratio: f64) -> OptionalIndex {
let mut out = Vec::new();
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let vals: Vec<RowId> = (0..TOTAL_NUM_VALUES)
.map(|_| rng.gen_bool(fill_ratio))
.enumerate()
.filter(|(_pos, val)| *val)
.map(|(pos, _)| pos as RowId)
.collect();
serialize_optional_index(&&vals[..], TOTAL_NUM_VALUES, &mut out).unwrap();
open_optional_index(OwnedBytes::new(out)).unwrap()
}
fn random_range_iterator(
start: u32,
end: u32,
avg_step_size: u32,
avg_deviation: u32,
) -> impl Iterator<Item = u32> {
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
let mut current = start;
std::iter::from_fn(move || {
current += rng.gen_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
if current >= end {
None
} else {
Some(current)
}
})
}
fn n_percent_step_iterator(percent: f32, num_values: u32) -> impl Iterator<Item = u32> {
let ratio = percent / 100.0;
let step_size = (1f32 / ratio) as u32;
let deviation = step_size - 1;
random_range_iterator(0, num_values, step_size, deviation)
}
fn walk_over_data(codec: &OptionalIndex, avg_step_size: u32) -> Option<u32> {
walk_over_data_from_positions(
codec,
random_range_iterator(0, TOTAL_NUM_VALUES, avg_step_size, 0),
)
}
fn walk_over_data_from_positions(
codec: &OptionalIndex,
positions: impl Iterator<Item = u32>,
) -> Option<u32> {
let mut dense_idx: Option<u32> = None;
for idx in positions {
dense_idx = dense_idx.or(codec.rank_if_exists(idx));
}
dense_idx
}
#[bench]
fn bench_translate_orig_to_codec_1percent_filled_10percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.01f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_5percent_filled_10percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.05f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_5percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.05f64);
bench.iter(|| walk_over_data(&codec, 1000));
}
#[bench]
fn bench_translate_orig_to_codec_full_scan_1percent_filled(bench: &mut Bencher) {
let codec = gen_bools(0.01f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_translate_orig_to_codec_full_scan_10percent_filled(bench: &mut Bencher) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_translate_orig_to_codec_full_scan_90percent_filled(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data_from_positions(&codec, 0..TOTAL_NUM_VALUES));
}
#[bench]
fn bench_translate_orig_to_codec_10percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.1f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_50percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.5f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_orig_to_codec_90percent_filled_1percent_hit(bench: &mut Bencher) {
let codec = gen_bools(0.9f64);
bench.iter(|| walk_over_data(&codec, 100));
}
#[bench]
fn bench_translate_codec_to_orig_1percent_filled_0comma005percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.01f64, 0.005f32, bench);
}
#[bench]
fn bench_translate_codec_to_orig_10percent_filled_0comma005percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.1f64, 0.005f32, bench);
}
#[bench]
fn bench_translate_codec_to_orig_1percent_filled_10percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.01f64, 10f32, bench);
}
#[bench]
fn bench_translate_codec_to_orig_1percent_filled_full_scan(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.01f64, 100f32, bench);
}
fn bench_translate_codec_to_orig_util(
percent_filled: f64,
percent_hit: f32,
bench: &mut Bencher,
) {
let codec = gen_bools(percent_filled);
let num_non_nulls = codec.num_non_nulls();
let idxs: Vec<u32> = if percent_hit == 100.0f32 {
(0..num_non_nulls).collect()
} else {
n_percent_step_iterator(percent_hit, num_non_nulls).collect()
};
let mut output = vec![0u32; idxs.len()];
bench.iter(|| {
output.copy_from_slice(&idxs[..]);
codec.select_batch(&mut output);
});
}
#[bench]
fn bench_translate_codec_to_orig_90percent_filled_0comma005percent_hit(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.9f64, 0.005, bench);
}
#[bench]
fn bench_translate_codec_to_orig_90percent_filled_full_scan(bench: &mut Bencher) {
bench_translate_codec_to_orig_util(0.9f64, 100.0f32, bench);
}
}

View File

@@ -3,11 +3,11 @@ use std::io::Write;
use common::{CountingWriter, OwnedBytes};
use super::multivalued_index::SerializableMultivalueIndex;
use super::OptionalIndex;
use super::multivalued_index::SerializableMultivalueIndex;
use crate::column_index::ColumnIndex;
use crate::column_index::multivalued_index::serialize_multivalued_index;
use crate::column_index::optional_index::serialize_optional_index;
use crate::column_index::ColumnIndex;
use crate::iterable::Iterable;
use crate::{Cardinality, RowId, Version};
@@ -31,7 +31,7 @@ pub enum SerializableColumnIndex<'a> {
Multivalued(SerializableMultivalueIndex<'a>),
}
impl<'a> SerializableColumnIndex<'a> {
impl SerializableColumnIndex<'_> {
pub fn get_cardinality(&self) -> Cardinality {
match self {
SerializableColumnIndex::Full => Cardinality::Full,

View File

@@ -1,135 +0,0 @@
use std::sync::Arc;
use common::OwnedBytes;
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use test::{self, Bencher};
use super::*;
use crate::column_values::u64_based::*;
fn get_data() -> Vec<u64> {
let mut rng = StdRng::seed_from_u64(2u64);
let mut data: Vec<_> = (100..55000_u64)
.map(|num| num + rng.gen::<u8>() as u64)
.collect();
data.push(99_000);
data.insert(1000, 2000);
data.insert(2000, 100);
data.insert(3000, 4100);
data.insert(4000, 100);
data.insert(5000, 800);
data
}
fn compute_stats(vals: impl Iterator<Item = u64>) -> ColumnStats {
let mut stats_collector = StatsCollector::default();
for val in vals {
stats_collector.collect(val);
}
stats_collector.stats()
}
#[inline(never)]
fn value_iter() -> impl Iterator<Item = u64> {
0..20_000
}
fn get_reader_for_bench<Codec: ColumnCodec>(data: &[u64]) -> Codec::ColumnValues {
let mut bytes = Vec::new();
let stats = compute_stats(data.iter().cloned());
let mut codec_serializer = Codec::estimator();
for val in data {
codec_serializer.collect(*val);
}
codec_serializer.serialize(&stats, Box::new(data.iter().copied()).as_mut(), &mut bytes);
Codec::load(OwnedBytes::new(bytes)).unwrap()
}
fn bench_get<Codec: ColumnCodec>(b: &mut Bencher, data: &[u64]) {
let col = get_reader_for_bench::<Codec>(data);
b.iter(|| {
let mut sum = 0u64;
for pos in value_iter() {
let val = col.get_val(pos as u32);
sum = sum.wrapping_add(val);
}
sum
});
}
#[inline(never)]
fn bench_get_dynamic_helper(b: &mut Bencher, col: Arc<dyn ColumnValues>) {
b.iter(|| {
let mut sum = 0u64;
for pos in value_iter() {
let val = col.get_val(pos as u32);
sum = sum.wrapping_add(val);
}
sum
});
}
fn bench_get_dynamic<Codec: ColumnCodec>(b: &mut Bencher, data: &[u64]) {
let col = Arc::new(get_reader_for_bench::<Codec>(data));
bench_get_dynamic_helper(b, col);
}
fn bench_create<Codec: ColumnCodec>(b: &mut Bencher, data: &[u64]) {
let stats = compute_stats(data.iter().cloned());
let mut bytes = Vec::new();
b.iter(|| {
bytes.clear();
let mut codec_serializer = Codec::estimator();
for val in data.iter().take(1024) {
codec_serializer.collect(*val);
}
codec_serializer.serialize(&stats, Box::new(data.iter().copied()).as_mut(), &mut bytes)
});
}
#[bench]
fn bench_fastfield_bitpack_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<BitpackedCodec>(b, &data);
}
#[bench]
fn bench_fastfield_linearinterpol_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<LinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_create(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_create::<BlockwiseLinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_bitpack_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<BitpackedCodec>(b, &data);
}
#[bench]
fn bench_fastfield_bitpack_get_dynamic(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get_dynamic::<BitpackedCodec>(b, &data);
}
#[bench]
fn bench_fastfield_linearinterpol_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<LinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_linearinterpol_get_dynamic(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get_dynamic::<LinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_get(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get::<BlockwiseLinearCodec>(b, &data);
}
#[bench]
fn bench_fastfield_multilinearinterpol_get_dynamic(b: &mut Bencher) {
let data: Vec<_> = get_data();
bench_get_dynamic::<BlockwiseLinearCodec>(b, &data);
}

View File

@@ -10,7 +10,7 @@ pub(crate) struct MergedColumnValues<'a, T> {
pub(crate) merge_row_order: &'a MergeRowOrder,
}
impl<'a, T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'a, T> {
impl<T: Copy + PartialOrd + Debug + 'static> Iterable<T> for MergedColumnValues<'_, T> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
match self.merge_row_order {
MergeRowOrder::Stack(_) => Box::new(

View File

@@ -26,13 +26,13 @@ mod monotonic_column;
pub(crate) use merge::MergedColumnValues;
pub use stats::ColumnStats;
pub use u128_based::{
open_u128_as_compact_u64, open_u128_mapped, serialize_column_values_u128,
CompactSpaceU64Accessor,
};
pub use u64_based::{
load_u64_based_column_values, serialize_and_load_u64_based_column_values,
serialize_u64_based_column_values, CodecType, ALL_U64_CODEC_TYPES,
ALL_U64_CODEC_TYPES, CodecType, load_u64_based_column_values,
serialize_and_load_u64_based_column_values, serialize_u64_based_column_values,
};
pub use u128_based::{
CompactSpaceU64Accessor, open_u128_as_compact_u64, open_u128_mapped,
serialize_column_values_u128,
};
pub use vec_column::VecColumn;
@@ -242,6 +242,3 @@ impl<T: Copy + PartialOrd + Debug + 'static> ColumnValues<T> for Arc<dyn ColumnV
.get_row_ids_for_value_range(range, doc_id_range, positions)
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench;

View File

@@ -2,8 +2,8 @@ use std::fmt::Debug;
use std::marker::PhantomData;
use std::ops::{Range, RangeInclusive};
use crate::column_values::monotonic_mapping::StrictlyMonotonicFn;
use crate::ColumnValues;
use crate::column_values::monotonic_mapping::StrictlyMonotonicFn;
struct MonotonicMappingColumn<C, T, Input> {
from_column: C,
@@ -99,10 +99,10 @@ where
#[cfg(test)]
mod tests {
use super::*;
use crate::column_values::VecColumn;
use crate::column_values::monotonic_mapping::{
StrictlyMonotonicMappingInverter, StrictlyMonotonicMappingToInternal,
};
use crate::column_values::VecColumn;
#[test]
fn test_monotonic_mapping_iter() {

View File

@@ -1,7 +1,7 @@
use std::fmt::Debug;
use std::net::Ipv6Addr;
/// Montonic maps a value to u128 value space
/// Monotonic maps a value to u128 value space
/// Monotonic mapping enables `PartialOrd` on u128 space without conversion to original space.
pub trait MonotonicallyMappableToU128: 'static + PartialOrd + Copy + Debug + Send + Sync {
/// Converts a value to u128.

View File

@@ -184,11 +184,11 @@ impl CompactSpaceBuilder {
let mut covered_space = Vec::with_capacity(self.blanks.len());
// begining of the blanks
if let Some(first_blank_start) = self.blanks.first().map(RangeInclusive::start) {
if *first_blank_start != 0 {
covered_space.push(0..=first_blank_start - 1);
}
// beginning of the blanks
if let Some(first_blank_start) = self.blanks.first().map(RangeInclusive::start)
&& *first_blank_start != 0
{
covered_space.push(0..=first_blank_start - 1);
}
// Between the blanks
@@ -202,10 +202,10 @@ impl CompactSpaceBuilder {
covered_space.extend(between_blanks);
// end of the blanks
if let Some(last_blank_end) = self.blanks.last().map(RangeInclusive::end) {
if *last_blank_end != u128::MAX {
covered_space.push(last_blank_end + 1..=u128::MAX);
}
if let Some(last_blank_end) = self.blanks.last().map(RangeInclusive::end)
&& *last_blank_end != u128::MAX
{
covered_space.push(last_blank_end + 1..=u128::MAX);
}
if covered_space.is_empty() {

View File

@@ -24,8 +24,8 @@ use build_compact_space::get_compact_space;
use common::{BinarySerializable, CountingWriter, OwnedBytes, VInt, VIntU128};
use tantivy_bitpacker::{BitPacker, BitUnpacker};
use crate::column_values::ColumnValues;
use crate::RowId;
use crate::column_values::ColumnValues;
/// The cost per blank is quite hard actually, since blanks are delta encoded, the actual cost of
/// blanks depends on the number of blanks.
@@ -653,12 +653,14 @@ mod tests {
),
&[3]
);
assert!(get_positions_for_value_range_helper(
&decomp,
99998u128..=99998u128,
complete_range.clone()
)
.is_empty());
assert!(
get_positions_for_value_range_helper(
&decomp,
99998u128..=99998u128,
complete_range.clone()
)
.is_empty()
);
assert_eq!(
&get_positions_for_value_range_helper(
&decomp,

View File

@@ -128,13 +128,13 @@ pub fn open_u128_as_compact_u64(mut bytes: OwnedBytes) -> io::Result<Arc<dyn Col
}
#[cfg(test)]
pub mod tests {
pub(crate) mod tests {
use super::*;
use crate::column_values::u64_based::{
serialize_and_load_u64_based_column_values, serialize_u64_based_column_values,
ALL_U64_CODEC_TYPES,
};
use crate::column_values::CodecType;
use crate::column_values::u64_based::{
ALL_U64_CODEC_TYPES, serialize_and_load_u64_based_column_values,
serialize_u64_based_column_values,
};
#[test]
fn test_serialize_deserialize_u128_header() {

View File

@@ -4,7 +4,7 @@ use std::ops::{Range, RangeInclusive};
use common::{BinarySerializable, OwnedBytes};
use fastdivide::DividerU64;
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use tantivy_bitpacker::{BitPacker, BitUnpacker, compute_num_bits};
use crate::column_values::u64_based::{ColumnCodec, ColumnCodecEstimator, ColumnStats};
use crate::{ColumnValues, RowId};
@@ -23,11 +23,7 @@ const fn div_ceil(n: u64, q: NonZeroU64) -> u64 {
// copied from unstable rust standard library.
let d = n / q.get();
let r = n % q.get();
if r > 0 {
d + 1
} else {
d
}
if r > 0 { d + 1 } else { d }
}
// The bitpacked codec applies a linear transformation `f` over data that are bitpacked.
@@ -109,7 +105,7 @@ impl ColumnCodecEstimator for BitpackedCodecEstimator {
fn estimate(&self, stats: &ColumnStats) -> Option<u64> {
let num_bits_per_value = num_bits(stats);
Some(stats.num_bytes() + (stats.num_rows as u64 * (num_bits_per_value as u64) + 7) / 8)
Some(stats.num_bytes() + (stats.num_rows as u64 * (num_bits_per_value as u64)).div_ceil(8))
}
fn serialize(

View File

@@ -4,12 +4,12 @@ use std::{io, iter};
use common::{BinarySerializable, CountingWriter, DeserializeFrom, OwnedBytes};
use fastdivide::DividerU64;
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use tantivy_bitpacker::{BitPacker, BitUnpacker, compute_num_bits};
use crate::MonotonicallyMappableToU64;
use crate::column_values::u64_based::line::Line;
use crate::column_values::u64_based::{ColumnCodec, ColumnCodecEstimator, ColumnStats};
use crate::column_values::{ColumnValues, VecColumn};
use crate::MonotonicallyMappableToU64;
const BLOCK_SIZE: u32 = 512u32;
@@ -39,7 +39,7 @@ impl BinarySerializable for Block {
}
fn compute_num_blocks(num_vals: u32) -> u32 {
(num_vals + BLOCK_SIZE - 1) / BLOCK_SIZE
num_vals.div_ceil(BLOCK_SIZE)
}
pub struct BlockwiseLinearEstimator {

View File

@@ -8,7 +8,7 @@ use crate::column_values::ColumnValues;
const MID_POINT: u64 = (1u64 << 32) - 1u64;
/// `Line` describes a line function `y: ax + b` using integer
/// arithmetics.
/// arithmetic.
///
/// The slope is in fact a decimal split into a 32 bit integer value,
/// and a 32-bit decimal value.
@@ -94,7 +94,7 @@ impl Line {
// `(i, ys[])`.
//
// The best intercept therefore has the form
// `y[i] - line.eval(i)` (using wrapping arithmetics).
// `y[i] - line.eval(i)` (using wrapping arithmetic).
// In other words, the best intercept is one of the `y - Line::eval(ys[i])`
// and our task is just to pick the one that minimizes our error.
//
@@ -122,12 +122,11 @@ impl Line {
line
}
/// Returns a line that attemps to approximate a function
/// Returns a line that attempts to approximate a function
/// f: i in 0..[ys.num_vals()) -> ys[i].
///
/// - The approximation is always lower than the actual value.
/// Or more rigorously, formally `f(i).wrapping_sub(ys[i])` is small
/// for any i in [0..ys.len()).
/// - The approximation is always lower than the actual value. Or more rigorously, formally
/// `f(i).wrapping_sub(ys[i])` is small for any i in [0..ys.len()).
/// - It computes without panicking for any value of it.
///
/// This function is only invariable by translation if all of the

View File

@@ -1,13 +1,13 @@
use std::io;
use common::{BinarySerializable, OwnedBytes};
use tantivy_bitpacker::{compute_num_bits, BitPacker, BitUnpacker};
use tantivy_bitpacker::{BitPacker, BitUnpacker, compute_num_bits};
use super::line::Line;
use super::ColumnValues;
use crate::column_values::u64_based::{ColumnCodec, ColumnCodecEstimator, ColumnStats};
use crate::column_values::VecColumn;
use super::line::Line;
use crate::RowId;
use crate::column_values::VecColumn;
use crate::column_values::u64_based::{ColumnCodec, ColumnCodecEstimator, ColumnStats};
const HALF_SPACE: u64 = u64::MAX / 2;
const LINE_ESTIMATION_BLOCK_LEN: usize = 512;
@@ -117,7 +117,7 @@ impl ColumnCodecEstimator for LinearCodecEstimator {
Some(
stats.num_bytes()
+ linear_params.num_bytes()
+ (num_bits as u64 * stats.num_rows as u64 + 7) / 8,
+ (num_bits as u64 * stats.num_rows as u64).div_ceil(8),
)
}

View File

@@ -17,7 +17,7 @@ pub use crate::column_values::u64_based::bitpacked::BitpackedCodec;
pub use crate::column_values::u64_based::blockwise_linear::BlockwiseLinearCodec;
pub use crate::column_values::u64_based::linear::LinearCodec;
pub use crate::column_values::u64_based::stats_collector::StatsCollector;
use crate::column_values::{monotonic_map_column, ColumnStats};
use crate::column_values::{ColumnStats, monotonic_map_column};
use crate::iterable::Iterable;
use crate::{ColumnValues, MonotonicallyMappableToU64};
@@ -52,7 +52,7 @@ pub trait ColumnCodecEstimator<T = u64>: 'static {
) -> io::Result<()>;
}
/// A column codec describes a colunm serialization format.
/// A column codec describes a column serialization format.
pub trait ColumnCodec<T: PartialOrd = u64> {
/// Specialized `ColumnValues` type.
type ColumnValues: ColumnValues<T> + 'static;

View File

@@ -2,8 +2,8 @@ use std::num::NonZeroU64;
use fastdivide::DividerU64;
use crate::column_values::ColumnStats;
use crate::RowId;
use crate::column_values::ColumnStats;
/// Compute the gcd of two non null numbers.
///
@@ -96,8 +96,8 @@ impl StatsCollector {
mod tests {
use std::num::NonZeroU64;
use crate::column_values::u64_based::stats_collector::{compute_gcd, StatsCollector};
use crate::column_values::u64_based::ColumnStats;
use crate::column_values::u64_based::stats_collector::{StatsCollector, compute_gcd};
fn compute_stats(vals: impl Iterator<Item = u64>) -> ColumnStats {
let mut stats_collector = StatsCollector::default();

View File

@@ -1,5 +1,6 @@
use proptest::prelude::*;
use proptest::{prop_oneof, proptest};
use rand::Rng;
#[test]
fn test_serialize_and_load_simple() {

View File

@@ -4,8 +4,8 @@ use std::net::Ipv6Addr;
use serde::{Deserialize, Serialize};
use crate::value::NumericalType;
use crate::InvalidData;
use crate::value::NumericalType;
/// The column type represents the column type.
/// Any changes need to be propagated to `COLUMN_TYPES`.

View File

@@ -3,7 +3,7 @@ use std::io::{self, Write};
use common::{BitSet, CountingWriter, ReadOnlyBitSet};
use sstable::{SSTable, Streamer, TermOrdinal, VoidSSTable};
use super::term_merger::TermMerger;
use super::term_merger::{TermMerger, TermsWithSegmentOrd};
use crate::column::serialize_column_mappable_to_u64;
use crate::column_index::SerializableColumnIndex;
use crate::iterable::Iterable;
@@ -39,7 +39,7 @@ struct RemappedTermOrdinalsValues<'a> {
merge_row_order: &'a MergeRowOrder,
}
impl<'a> Iterable for RemappedTermOrdinalsValues<'a> {
impl Iterable for RemappedTermOrdinalsValues<'_> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u64> + '_> {
match self.merge_row_order {
MergeRowOrder::Stack(_) => self.boxed_iter_stacked(),
@@ -50,7 +50,7 @@ impl<'a> Iterable for RemappedTermOrdinalsValues<'a> {
}
}
impl<'a> RemappedTermOrdinalsValues<'a> {
impl RemappedTermOrdinalsValues<'_> {
fn boxed_iter_stacked(&self) -> Box<dyn Iterator<Item = u64> + '_> {
let iter = self
.bytes_columns
@@ -126,14 +126,17 @@ fn serialize_merged_dict(
let mut term_ord_mapping = TermOrdinalMapping::default();
let mut field_term_streams = Vec::new();
for column_opt in bytes_columns.iter() {
for (segment_ord, column_opt) in bytes_columns.iter().enumerate() {
if let Some(column) = column_opt {
term_ord_mapping.add_segment(column.dictionary.num_terms());
let terms: Streamer<VoidSSTable> = column.dictionary.stream()?;
field_term_streams.push(terms);
field_term_streams.push(TermsWithSegmentOrd { terms, segment_ord });
} else {
term_ord_mapping.add_segment(0);
field_term_streams.push(Streamer::empty());
field_term_streams.push(TermsWithSegmentOrd {
terms: Streamer::empty(),
segment_ord,
});
}
}
@@ -191,6 +194,7 @@ fn serialize_merged_dict(
#[derive(Default, Debug)]
struct TermOrdinalMapping {
/// Contains the new term ordinals for each segment.
per_segment_new_term_ordinals: Vec<Vec<TermOrdinal>>,
}
@@ -205,6 +209,6 @@ impl TermOrdinalMapping {
}
fn get_segment(&self, segment_ord: u32) -> &[TermOrdinal] {
&(self.per_segment_new_term_ordinals[segment_ord as usize])[..]
&self.per_segment_new_term_ordinals[segment_ord as usize]
}
}

View File

@@ -26,7 +26,7 @@ impl StackMergeOrder {
let mut cumulated_row_ids: Vec<RowId> = Vec::with_capacity(columnars.len());
let mut cumulated_row_id = 0;
for columnar in columnars {
cumulated_row_id += columnar.num_rows();
cumulated_row_id += columnar.num_docs();
cumulated_row_ids.push(cumulated_row_id);
}
StackMergeOrder { cumulated_row_ids }

View File

@@ -10,11 +10,11 @@ use std::sync::Arc;
pub use merge_mapping::{MergeRowOrder, ShuffleMergeOrder, StackMergeOrder};
use super::writer::ColumnarSerializer;
use crate::column::{serialize_column_mappable_to_u128, serialize_column_mappable_to_u64};
use crate::column::{serialize_column_mappable_to_u64, serialize_column_mappable_to_u128};
use crate::column_values::MergedColumnValues;
use crate::columnar::ColumnarReader;
use crate::columnar::merge::merge_dict_column::merge_bytes_or_str_column;
use crate::columnar::writer::CompatibleNumericalTypes;
use crate::columnar::ColumnarReader;
use crate::dynamic_column::DynamicColumn;
use crate::{
BytesColumn, Column, ColumnIndex, ColumnType, ColumnValues, DynamicColumnHandle, NumericalType,
@@ -25,7 +25,7 @@ use crate::{
/// After merge, all columns belonging to the same category are coerced to
/// the same column type.
///
/// In practise, today, only Numerical colummns are coerced into one type today.
/// In practise, today, only Numerical columns are coerced into one type today.
///
/// See also [README.md].
///
@@ -63,11 +63,10 @@ impl From<ColumnType> for ColumnTypeCategory {
/// `require_columns` makes it possible to ensure that some columns will be present in the
/// resulting columnar. When a required column is a numerical column type, one of two things can
/// happen:
/// - If the required column type is compatible with all of the input columnar, the resulsting
/// merged
/// columnar will simply coerce the input column and use the required column type.
/// - If the required column type is incompatible with one of the input columnar, the merged
/// will fail with an InvalidData error.
/// - If the required column type is compatible with all of the input columnar, the resulting merged
/// columnar will simply coerce the input column and use the required column type.
/// - If the required column type is incompatible with one of the input columnar, the merged will
/// fail with an InvalidData error.
///
/// `merge_row_order` makes it possible to remove or reorder row in the resulting
/// `Columnar` table.
@@ -81,13 +80,12 @@ pub fn merge_columnar(
output: &mut impl io::Write,
) -> io::Result<()> {
let mut serializer = ColumnarSerializer::new(output);
let num_rows_per_columnar = columnar_readers
let num_docs_per_columnar = columnar_readers
.iter()
.map(|reader| reader.num_rows())
.map(|reader| reader.num_docs())
.collect::<Vec<u32>>();
let columns_to_merge =
group_columns_for_merge(columnar_readers, required_columns, &merge_row_order)?;
let columns_to_merge = group_columns_for_merge(columnar_readers, required_columns)?;
for res in columns_to_merge {
let ((column_name, _column_type_category), grouped_columns) = res;
let grouped_columns = grouped_columns.open(&merge_row_order)?;
@@ -95,15 +93,18 @@ pub fn merge_columnar(
continue;
}
let column_type = grouped_columns.column_type_after_merge();
let column_type_after_merge = grouped_columns.column_type_after_merge();
let mut columns = grouped_columns.columns;
coerce_columns(column_type, &mut columns)?;
// Make sure the number of columns is the same as the number of columnar readers.
// Or num_docs_per_columnar would be incorrect.
assert_eq!(columns.len(), columnar_readers.len());
coerce_columns(column_type_after_merge, &mut columns)?;
let mut column_serializer =
serializer.start_serialize_column(column_name.as_bytes(), column_type);
serializer.start_serialize_column(column_name.as_bytes(), column_type_after_merge);
merge_column(
column_type,
&num_rows_per_columnar,
column_type_after_merge,
&num_docs_per_columnar,
columns,
&merge_row_order,
&mut column_serializer,
@@ -129,7 +130,7 @@ fn dynamic_column_to_u64_monotonic(dynamic_column: DynamicColumn) -> Option<Colu
fn merge_column(
column_type: ColumnType,
num_docs_per_column: &[u32],
columns: Vec<Option<DynamicColumn>>,
columns_to_merge: Vec<Option<DynamicColumn>>,
merge_row_order: &MergeRowOrder,
wrt: &mut impl io::Write,
) -> io::Result<()> {
@@ -139,20 +140,21 @@ fn merge_column(
| ColumnType::F64
| ColumnType::DateTime
| ColumnType::Bool => {
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
let mut column_values: Vec<Option<Arc<dyn ColumnValues>>> =
Vec::with_capacity(columns.len());
for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
if let Some(Column { index: idx, values }) =
dynamic_column_opt.and_then(dynamic_column_to_u64_monotonic)
{
column_indexes.push(idx);
column_values.push(Some(values));
} else {
column_indexes.push(ColumnIndex::Empty {
num_docs: num_docs_per_column[i],
});
column_values.push(None);
Vec::with_capacity(columns_to_merge.len());
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
match dynamic_column_opt.and_then(dynamic_column_to_u64_monotonic) {
Some(Column { index: idx, values }) => {
column_indexes.push(idx);
column_values.push(Some(values));
}
None => {
column_indexes.push(ColumnIndex::Empty {
num_docs: num_docs_per_column[i],
});
column_values.push(None);
}
}
}
let merged_column_index =
@@ -165,10 +167,10 @@ fn merge_column(
serialize_column_mappable_to_u64(merged_column_index, &merge_column_values, wrt)?;
}
ColumnType::IpAddr => {
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
let mut column_values: Vec<Option<Arc<dyn ColumnValues<Ipv6Addr>>>> =
Vec::with_capacity(columns.len());
for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
Vec::with_capacity(columns_to_merge.len());
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
if let Some(DynamicColumn::IpAddr(Column { index: idx, values })) =
dynamic_column_opt
{
@@ -193,9 +195,10 @@ fn merge_column(
serialize_column_mappable_to_u128(merged_column_index, &merge_column_values, wrt)?;
}
ColumnType::Bytes | ColumnType::Str => {
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns.len());
let mut bytes_columns: Vec<Option<BytesColumn>> = Vec::with_capacity(columns.len());
for (i, dynamic_column_opt) in columns.into_iter().enumerate() {
let mut column_indexes: Vec<ColumnIndex> = Vec::with_capacity(columns_to_merge.len());
let mut bytes_columns: Vec<Option<BytesColumn>> =
Vec::with_capacity(columns_to_merge.len());
for (i, dynamic_column_opt) in columns_to_merge.into_iter().enumerate() {
match dynamic_column_opt {
Some(DynamicColumn::Str(str_column)) => {
column_indexes.push(str_column.term_ord_column.index.clone());
@@ -249,13 +252,15 @@ impl GroupedColumns {
if column_type.len() == 1 {
return column_type.into_iter().next().unwrap();
}
// At the moment, only the numerical categorical column type has more than one possible
// At the moment, only the numerical column type category has more than one possible
// column type.
assert!(self
.columns
.iter()
.flatten()
.all(|el| ColumnTypeCategory::from(el.column_type()) == ColumnTypeCategory::Numerical));
assert!(
self.columns
.iter()
.flatten()
.all(|el| ColumnTypeCategory::from(el.column_type())
== ColumnTypeCategory::Numerical)
);
merged_numerical_columns_type(self.columns.iter().flatten()).into()
}
}
@@ -362,7 +367,7 @@ fn is_empty_after_merge(
ColumnIndex::Empty { .. } => true,
ColumnIndex::Full => alive_bitset.len() == 0,
ColumnIndex::Optional(optional_index) => {
for doc in optional_index.iter_rows() {
for doc in optional_index.iter_non_null_docs() {
if alive_bitset.contains(doc) {
return false;
}
@@ -392,7 +397,6 @@ fn is_empty_after_merge(
fn group_columns_for_merge<'a>(
columnar_readers: &'a [&'a ColumnarReader],
required_columns: &'a [(String, ColumnType)],
_merge_row_order: &'a MergeRowOrder,
) -> io::Result<BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle>> {
let mut columns: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> = BTreeMap::new();

View File

@@ -5,28 +5,29 @@ use sstable::TermOrdinal;
use crate::Streamer;
pub struct HeapItem<'a> {
pub streamer: Streamer<'a>,
/// The terms of a column with the ordinal of the segment.
pub struct TermsWithSegmentOrd<'a> {
pub terms: Streamer<'a>,
pub segment_ord: usize,
}
impl<'a> PartialEq for HeapItem<'a> {
impl PartialEq for TermsWithSegmentOrd<'_> {
fn eq(&self, other: &Self) -> bool {
self.segment_ord == other.segment_ord
}
}
impl<'a> Eq for HeapItem<'a> {}
impl Eq for TermsWithSegmentOrd<'_> {}
impl<'a> PartialOrd for HeapItem<'a> {
fn partial_cmp(&self, other: &HeapItem<'a>) -> Option<Ordering> {
impl<'a> PartialOrd for TermsWithSegmentOrd<'a> {
fn partial_cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl<'a> Ord for HeapItem<'a> {
fn cmp(&self, other: &HeapItem<'a>) -> Ordering {
(&other.streamer.key(), &other.segment_ord).cmp(&(&self.streamer.key(), &self.segment_ord))
impl<'a> Ord for TermsWithSegmentOrd<'a> {
fn cmp(&self, other: &TermsWithSegmentOrd<'a>) -> Ordering {
(&other.terms.key(), &other.segment_ord).cmp(&(&self.terms.key(), &self.segment_ord))
}
}
@@ -35,42 +36,34 @@ impl<'a> Ord for HeapItem<'a> {
///
/// The item yield is actually a pair with
/// - the term
/// - a slice with the ordinal of the segments containing
/// the terms.
/// - a slice with the ordinal of the segments containing the terms.
pub struct TermMerger<'a> {
heap: BinaryHeap<HeapItem<'a>>,
current_streamers: Vec<HeapItem<'a>>,
heap: BinaryHeap<TermsWithSegmentOrd<'a>>,
term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>,
}
impl<'a> TermMerger<'a> {
/// Stream of merged term dictionary
pub fn new(streams: Vec<Streamer<'a>>) -> TermMerger<'a> {
pub fn new(term_streams_with_segment: Vec<TermsWithSegmentOrd<'a>>) -> TermMerger<'a> {
TermMerger {
heap: BinaryHeap::new(),
current_streamers: streams
.into_iter()
.enumerate()
.map(|(ord, streamer)| HeapItem {
streamer,
segment_ord: ord,
})
.collect(),
term_streams_with_segment,
}
}
pub(crate) fn matching_segments<'b: 'a>(
&'b self,
) -> impl 'b + Iterator<Item = (usize, TermOrdinal)> {
self.current_streamers
self.term_streams_with_segment
.iter()
.map(|heap_item| (heap_item.segment_ord, heap_item.streamer.term_ord()))
.map(|heap_item| (heap_item.segment_ord, heap_item.terms.term_ord()))
}
fn advance_segments(&mut self) {
let streamers = &mut self.current_streamers;
let streamers = &mut self.term_streams_with_segment;
let heap = &mut self.heap;
for mut heap_item in streamers.drain(..) {
if heap_item.streamer.advance() {
if heap_item.terms.advance() {
heap.push(heap_item);
}
}
@@ -81,18 +74,19 @@ impl<'a> TermMerger<'a> {
/// False if there is none.
pub fn advance(&mut self) -> bool {
self.advance_segments();
if let Some(head) = self.heap.pop() {
self.current_streamers.push(head);
while let Some(next_streamer) = self.heap.peek() {
if self.current_streamers[0].streamer.key() != next_streamer.streamer.key() {
break;
match self.heap.pop() {
Some(head) => {
self.term_streams_with_segment.push(head);
while let Some(next_streamer) = self.heap.peek() {
if self.term_streams_with_segment[0].terms.key() != next_streamer.terms.key() {
break;
}
let next_heap_it = self.heap.pop().unwrap(); // safe : we peeked beforehand
self.term_streams_with_segment.push(next_heap_it);
}
let next_heap_it = self.heap.pop().unwrap(); // safe : we peeked beforehand
self.current_streamers.push(next_heap_it);
true
}
true
} else {
false
_ => false,
}
}
@@ -102,6 +96,6 @@ impl<'a> TermMerger<'a> {
/// if and only if advance() has been called before
/// and "true" was returned.
pub fn key(&self) -> &[u8] {
self.current_streamers[0].streamer.key()
self.term_streams_with_segment[0].terms.key()
}
}

View File

@@ -1,7 +1,10 @@
use itertools::Itertools;
use proptest::collection::vec;
use proptest::prelude::*;
use super::*;
use crate::{Cardinality, ColumnarWriter, HasAssociatedColumnType, RowId};
use crate::columnar::{ColumnarReader, MergeRowOrder, StackMergeOrder, merge_columnar};
use crate::{Cardinality, ColumnarWriter, DynamicColumn, HasAssociatedColumnType, RowId};
fn make_columnar<T: Into<NumericalValue> + HasAssociatedColumnType + Copy>(
column_name: &str,
@@ -26,9 +29,8 @@ fn test_column_coercion_to_u64() {
// u64 type
let columnar2 = make_columnar("numbers", &[u64::MAX]);
let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[], &merge_order).unwrap();
group_columns_for_merge(columnars, &[]).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
}
@@ -38,9 +40,8 @@ fn test_column_coercion_to_i64() {
let columnar1 = make_columnar("numbers", &[-1i64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[], &merge_order).unwrap();
group_columns_for_merge(columnars, &[]).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
}
@@ -63,14 +64,8 @@ fn test_group_columns_with_required_column() {
let columnar1 = make_columnar("numbers", &[1i64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(
&[&columnar1, &columnar2],
&[("numbers".to_string(), ColumnType::U64)],
&merge_order,
)
.unwrap();
group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
}
@@ -80,13 +75,9 @@ fn test_group_columns_required_column_with_no_existing_columns() {
let columnar1 = make_columnar("numbers", &[2u64]);
let columnar2 = make_columnar("numbers", &[2u64]);
let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<_, _> = group_columns_for_merge(
columnars,
&[("required_col".to_string(), ColumnType::Str)],
&merge_order,
)
.unwrap();
let column_map: BTreeMap<_, _> =
group_columns_for_merge(columnars, &[("required_col".to_string(), ColumnType::Str)])
.unwrap();
assert_eq!(column_map.len(), 2);
let columns = &column_map
.get(&("required_col".to_string(), ColumnTypeCategory::Str))
@@ -102,14 +93,8 @@ fn test_group_columns_required_column_is_above_all_columns_have_the_same_type_ru
let columnar1 = make_columnar("numbers", &[2i64]);
let columnar2 = make_columnar("numbers", &[2i64]);
let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(
columnars,
&[("numbers".to_string(), ColumnType::U64)],
&merge_order,
)
.unwrap();
group_columns_for_merge(columnars, &[("numbers".to_string(), ColumnType::U64)]).unwrap();
assert_eq!(column_map.len(), 1);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
}
@@ -119,9 +104,8 @@ fn test_missing_column() {
let columnar1 = make_columnar("numbers", &[-1i64]);
let columnar2 = make_columnar("numbers2", &[2u64]);
let columnars = &[&columnar1, &columnar2];
let merge_order = StackMergeOrder::stack(columnars).into();
let column_map: BTreeMap<(String, ColumnTypeCategory), GroupedColumnsHandle> =
group_columns_for_merge(columnars, &[], &merge_order).unwrap();
group_columns_for_merge(columnars, &[]).unwrap();
assert_eq!(column_map.len(), 2);
assert!(column_map.contains_key(&("numbers".to_string(), ColumnTypeCategory::Numerical)));
{
@@ -224,7 +208,7 @@ fn test_merge_columnar_numbers() {
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 3);
assert_eq!(columnar_reader.num_docs(), 3);
assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("numbers").unwrap();
let dynamic_column = cols[0].open().unwrap();
@@ -252,7 +236,7 @@ fn test_merge_columnar_texts() {
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 3);
assert_eq!(columnar_reader.num_docs(), 3);
assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("texts").unwrap();
let dynamic_column = cols[0].open().unwrap();
@@ -301,7 +285,7 @@ fn test_merge_columnar_byte() {
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 4);
assert_eq!(columnar_reader.num_docs(), 4);
assert_eq!(columnar_reader.num_columns(), 1);
let cols = columnar_reader.read_columns("bytes").unwrap();
let dynamic_column = cols[0].open().unwrap();
@@ -357,7 +341,7 @@ fn test_merge_columnar_byte_with_missing() {
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 3 + 2 + 3);
assert_eq!(columnar_reader.num_docs(), 3 + 2 + 3);
assert_eq!(columnar_reader.num_columns(), 2);
let cols = columnar_reader.read_columns("col").unwrap();
let dynamic_column = cols[0].open().unwrap();
@@ -409,7 +393,7 @@ fn test_merge_columnar_different_types() {
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 4);
assert_eq!(columnar_reader.num_docs(), 4);
assert_eq!(columnar_reader.num_columns(), 2);
let cols = columnar_reader.read_columns("mixed").unwrap();
@@ -419,11 +403,11 @@ fn test_merge_columnar_different_types() {
panic!()
};
assert_eq!(vals.get_cardinality(), Cardinality::Optional);
assert_eq!(vals.values_for_doc(0).collect_vec(), vec![]);
assert_eq!(vals.values_for_doc(1).collect_vec(), vec![]);
assert_eq!(vals.values_for_doc(2).collect_vec(), vec![]);
assert_eq!(vals.values_for_doc(0).collect_vec(), Vec::<i64>::new());
assert_eq!(vals.values_for_doc(1).collect_vec(), Vec::<i64>::new());
assert_eq!(vals.values_for_doc(2).collect_vec(), Vec::<i64>::new());
assert_eq!(vals.values_for_doc(3).collect_vec(), vec![1]);
assert_eq!(vals.values_for_doc(4).collect_vec(), vec![]);
assert_eq!(vals.values_for_doc(4).collect_vec(), Vec::<i64>::new());
// text column
let dynamic_column = cols[1].open().unwrap();
@@ -474,7 +458,7 @@ fn test_merge_columnar_different_empty_cardinality() {
)
.unwrap();
let columnar_reader = ColumnarReader::open(buffer).unwrap();
assert_eq!(columnar_reader.num_rows(), 2);
assert_eq!(columnar_reader.num_docs(), 2);
assert_eq!(columnar_reader.num_columns(), 2);
let cols = columnar_reader.read_columns("mixed").unwrap();
@@ -486,3 +470,119 @@ fn test_merge_columnar_different_empty_cardinality() {
let dynamic_column = cols[1].open().unwrap();
assert_eq!(dynamic_column.get_cardinality(), Cardinality::Optional);
}
#[derive(Debug, Clone)]
struct ColumnSpec {
column_name: String,
/// (row_id, term)
terms: Vec<(RowId, Vec<u8>)>,
}
#[derive(Clone, Debug)]
struct ColumnarSpec {
columns: Vec<ColumnSpec>,
}
/// Generate a random (row_id, term) pair:
/// - row_id in [0..10]
/// - term is either from POSSIBLE_TERMS or random bytes
fn rowid_and_term_strategy() -> impl Strategy<Value = (RowId, Vec<u8>)> {
const POSSIBLE_TERMS: &[&[u8]] = &[b"a", b"b", b"allo"];
let term_strat = prop_oneof![
// pick from the fixed list
(0..POSSIBLE_TERMS.len()).prop_map(|i| POSSIBLE_TERMS[i].to_vec()),
// or random bytes (length 0..10)
prop::collection::vec(any::<u8>(), 0..10),
];
(0u32..11, term_strat)
}
/// Generate one ColumnSpec, with a random name and a random list of (row_id, term).
/// We sort it by row_id so that data is in ascending order.
fn column_spec_strategy() -> impl Strategy<Value = ColumnSpec> {
let column_name = prop_oneof![
Just("col".to_string()),
Just("col2".to_string()),
"col.*".prop_map(|s| s),
];
// We'll produce 0..8 (rowid,term) entries for this column
let data_strat = vec(rowid_and_term_strategy(), 0..8).prop_map(|mut pairs| {
// Sort by row_id
pairs.sort_by_key(|(row_id, _)| *row_id);
pairs
});
(column_name, data_strat).prop_map(|(name, data)| ColumnSpec {
column_name: name,
terms: data,
})
}
/// Strategy to generate an ColumnarSpec
fn columnar_strategy() -> impl Strategy<Value = ColumnarSpec> {
vec(column_spec_strategy(), 0..3).prop_map(|columns| ColumnarSpec { columns })
}
/// Strategy to generate multiple ColumnarSpecs, each of which we will treat
/// as one "columnar" to be merged together.
fn columnars_strategy() -> impl Strategy<Value = Vec<ColumnarSpec>> {
vec(columnar_strategy(), 1..4)
}
/// Build a `ColumnarReader` from a `ColumnarSpec`
fn build_columnar(spec: &ColumnarSpec) -> ColumnarReader {
let mut writer = ColumnarWriter::default();
let mut max_row_id = 0;
for col in &spec.columns {
for &(row_id, ref term) in &col.terms {
writer.record_bytes(row_id, &col.column_name, term);
max_row_id = max_row_id.max(row_id);
}
}
let mut buffer = Vec::new();
writer.serialize(max_row_id + 1, &mut buffer).unwrap();
ColumnarReader::open(buffer).unwrap()
}
proptest! {
// We just test that the merge_columnar function doesn't crash.
#![proptest_config(ProptestConfig::with_cases(256))]
#[test]
fn test_merge_columnar_bytes_no_crash(columnars in columnars_strategy(), second_merge_columnars in columnars_strategy()) {
let columnars: Vec<ColumnarReader> = columnars.iter()
.map(build_columnar)
.collect();
let mut out = Vec::new();
let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
merge_columnar(
&columnar_refs,
&[],
MergeRowOrder::Stack(stack_merge_order),
&mut out,
).unwrap();
let merged_reader = ColumnarReader::open(out).unwrap();
// Merge the second set of columnars with the result of the first merge
let mut columnars: Vec<ColumnarReader> = second_merge_columnars.iter()
.map(build_columnar)
.collect();
columnars.push(merged_reader);
let mut out = Vec::new();
let columnar_refs: Vec<&ColumnarReader> = columnars.iter().collect();
let stack_merge_order = StackMergeOrder::stack(&columnar_refs);
merge_columnar(
&columnar_refs,
&[],
MergeRowOrder::Stack(stack_merge_order),
&mut out,
).unwrap();
}
}

View File

@@ -5,9 +5,9 @@ mod reader;
mod writer;
pub use column_type::{ColumnType, HasAssociatedColumnType};
pub use format_version::{Version, CURRENT_VERSION};
pub use format_version::{CURRENT_VERSION, Version};
#[cfg(test)]
pub(crate) use merge::ColumnTypeCategory;
pub use merge::{merge_columnar, MergeRowOrder, ShuffleMergeOrder, StackMergeOrder};
pub use merge::{MergeRowOrder, ShuffleMergeOrder, StackMergeOrder, merge_columnar};
pub use reader::ColumnarReader;
pub use writer::ColumnarWriter;

View File

@@ -1,10 +1,11 @@
use std::{fmt, io, mem};
use common::file_slice::FileSlice;
use common::BinarySerializable;
use common::file_slice::FileSlice;
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
use sstable::{Dictionary, RangeSSTable};
use crate::columnar::{format_version, ColumnType};
use crate::columnar::{ColumnType, format_version};
use crate::dynamic_column::DynamicColumnHandle;
use crate::{RowId, Version};
@@ -18,13 +19,13 @@ fn io_invalid_data(msg: String) -> io::Error {
pub struct ColumnarReader {
column_dictionary: Dictionary<RangeSSTable>,
column_data: FileSlice,
num_rows: RowId,
num_docs: RowId,
format_version: Version,
}
impl fmt::Debug for ColumnarReader {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let num_rows = self.num_rows();
let num_rows = self.num_docs();
let columns = self.list_columns().unwrap();
let num_cols = columns.len();
let mut debug_struct = f.debug_struct("Columnar");
@@ -76,6 +77,19 @@ fn read_all_columns_in_stream(
Ok(results)
}
fn column_dictionary_prefix_for_column_name(column_name: &str) -> String {
// Each column is a associated to a given `column_key`,
// that starts by `column_name\0column_header`.
//
// Listing the columns associated to the given column name is therefore equivalent to
// listing `column_key` with the prefix `column_name\0`.
format!("{}{}", column_name, '\0')
}
fn column_dictionary_prefix_for_subpath(root_path: &str) -> String {
format!("{}{}", root_path, JSON_PATH_SEGMENT_SEP as char)
}
impl ColumnarReader {
/// Opens a new Columnar file.
pub fn open<F>(file_slice: F) -> io::Result<ColumnarReader>
@@ -98,13 +112,13 @@ impl ColumnarReader {
Ok(ColumnarReader {
column_dictionary,
column_data,
num_rows,
num_docs: num_rows,
format_version,
})
}
pub fn num_rows(&self) -> RowId {
self.num_rows
pub fn num_docs(&self) -> RowId {
self.num_docs
}
// Iterate over the columns in a sorted way
pub fn iter_columns(
@@ -144,32 +158,14 @@ impl ColumnarReader {
Ok(self.iter_columns()?.collect())
}
fn stream_for_column_range(&self, column_name: &str) -> sstable::StreamerBuilder<RangeSSTable> {
// Each column is a associated to a given `column_key`,
// that starts by `column_name\0column_header`.
//
// Listing the columns associated to the given column name is therefore equivalent to
// listing `column_key` with the prefix `column_name\0`.
//
// This is in turn equivalent to searching for the range
// `[column_name,\0`..column_name\1)`.
// TODO can we get some more generic `prefix(..)` logic in the dictionary.
let mut start_key = column_name.to_string();
start_key.push('\0');
let mut end_key = column_name.to_string();
end_key.push(1u8 as char);
self.column_dictionary
.range()
.ge(start_key.as_bytes())
.lt(end_key.as_bytes())
}
pub async fn read_columns_async(
&self,
column_name: &str,
) -> io::Result<Vec<DynamicColumnHandle>> {
let prefix = column_dictionary_prefix_for_column_name(column_name);
let stream = self
.stream_for_column_range(column_name)
.column_dictionary
.prefix_range(prefix)
.into_stream_async()
.await?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
@@ -180,7 +176,35 @@ impl ColumnarReader {
/// There can be more than one column associated to a given column name, provided they have
/// different types.
pub fn read_columns(&self, column_name: &str) -> io::Result<Vec<DynamicColumnHandle>> {
let stream = self.stream_for_column_range(column_name).into_stream()?;
let prefix = column_dictionary_prefix_for_column_name(column_name);
let stream = self.column_dictionary.prefix_range(prefix).into_stream()?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
}
pub async fn read_subpath_columns_async(
&self,
root_path: &str,
) -> io::Result<Vec<DynamicColumnHandle>> {
let prefix = column_dictionary_prefix_for_subpath(root_path);
let stream = self
.column_dictionary
.prefix_range(prefix)
.into_stream_async()
.await?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
}
/// Get all inner columns for a given JSON prefix, i.e columns for which the name starts
/// with the prefix then contain the [`JSON_PATH_SEGMENT_SEP`].
///
/// There can be more than one column associated to each path within the JSON structure,
/// provided they have different types.
pub fn read_subpath_columns(&self, root_path: &str) -> io::Result<Vec<DynamicColumnHandle>> {
let prefix = column_dictionary_prefix_for_subpath(root_path);
let stream = self
.column_dictionary
.prefix_range(prefix.as_bytes())
.into_stream()?;
read_all_columns_in_stream(stream, &self.column_data, self.format_version)
}
@@ -192,6 +216,8 @@ impl ColumnarReader {
#[cfg(test)]
mod tests {
use common::json_path_writer::JSON_PATH_SEGMENT_SEP;
use crate::{ColumnType, ColumnarReader, ColumnarWriter};
#[test]
@@ -224,6 +250,64 @@ mod tests {
assert_eq!(columns[0].1.column_type(), ColumnType::U64);
}
#[test]
fn test_read_columns() {
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_column_type("col", ColumnType::U64, false);
columnar_writer.record_numerical(1, "col", 1u64);
let mut buffer = Vec::new();
columnar_writer.serialize(2, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
{
let columns = columnar.read_columns("col").unwrap();
assert_eq!(columns.len(), 1);
assert_eq!(columns[0].column_type(), ColumnType::U64);
}
{
let columns = columnar.read_columns("other").unwrap();
assert_eq!(columns.len(), 0);
}
}
#[test]
fn test_read_subpath_columns() {
let mut columnar_writer = ColumnarWriter::default();
columnar_writer.record_str(
0,
&format!("col1{}subcol1", JSON_PATH_SEGMENT_SEP as char),
"hello",
);
columnar_writer.record_numerical(
0,
&format!("col1{}subcol2", JSON_PATH_SEGMENT_SEP as char),
1i64,
);
columnar_writer.record_str(1, "col1", "hello");
columnar_writer.record_str(0, "col2", "hello");
let mut buffer = Vec::new();
columnar_writer.serialize(2, &mut buffer).unwrap();
let columnar = ColumnarReader::open(buffer).unwrap();
{
let columns = columnar.read_subpath_columns("col1").unwrap();
assert_eq!(columns.len(), 2);
assert_eq!(columns[0].column_type(), ColumnType::Str);
assert_eq!(columns[1].column_type(), ColumnType::I64);
}
{
let columns = columnar.read_subpath_columns("col1.subcol1").unwrap();
assert_eq!(columns.len(), 0);
}
{
let columns = columnar.read_subpath_columns("col2").unwrap();
assert_eq!(columns.len(), 0);
}
{
let columns = columnar.read_subpath_columns("other").unwrap();
assert_eq!(columns.len(), 0);
}
}
#[test]
#[should_panic(expected = "Input type forbidden")]
fn test_list_columns_strict_typing_panics_on_wrong_types() {

View File

@@ -87,7 +87,7 @@ impl<V: SymbolValue> ColumnOperation<V> {
minibuf
}
/// Deserialize a colummn operation.
/// Deserialize a column operation.
/// Returns None if the buffer is empty.
///
/// Panics if the payload is invalid:
@@ -122,7 +122,6 @@ impl<T> From<T> for ColumnOperation<T> {
// In order to limit memory usage, and in order
// to benefit from the stacker, we do this by serialization our data
// as "Symbols".
#[allow(clippy::from_over_into)]
pub(super) trait SymbolValue: Clone + Copy {
// Serializes the symbol into the given buffer.
// Returns the number of bytes written into the buffer.
@@ -245,7 +244,7 @@ impl SymbolValue for UnorderedId {
fn compute_num_bytes_for_u64(val: u64) -> usize {
let msb = (64u32 - val.leading_zeros()) as usize;
(msb + 7) / 8
msb.div_ceil(8)
}
fn encode_zig_zag(n: i64) -> u64 {

View File

@@ -42,7 +42,7 @@ impl ColumnWriter {
&self,
arena: &MemoryArena,
buffer: &'a mut Vec<u8>,
) -> impl Iterator<Item = ColumnOperation<V>> + 'a {
) -> impl Iterator<Item = ColumnOperation<V>> + 'a + use<'a, V> {
buffer.clear();
self.values.read_to_end(arena, buffer);
let mut cursor: &[u8] = &buffer[..];
@@ -104,9 +104,10 @@ pub(crate) struct NumericalColumnWriter {
impl NumericalColumnWriter {
pub fn force_numerical_type(&mut self, numerical_type: NumericalType) {
assert!(self
.compatible_numerical_types
.is_type_accepted(numerical_type));
assert!(
self.compatible_numerical_types
.is_type_accepted(numerical_type)
);
self.compatible_numerical_types = CompatibleNumericalTypes::StaticType(numerical_type);
}
}
@@ -211,7 +212,7 @@ impl NumericalColumnWriter {
self,
arena: &MemoryArena,
buffer: &'a mut Vec<u8>,
) -> impl Iterator<Item = ColumnOperation<NumericalValue>> + 'a {
) -> impl Iterator<Item = ColumnOperation<NumericalValue>> + 'a + use<'a> {
self.column_writer.operation_iterator(arena, buffer)
}
}
@@ -255,7 +256,7 @@ impl StrOrBytesColumnWriter {
&self,
arena: &MemoryArena,
byte_buffer: &'a mut Vec<u8>,
) -> impl Iterator<Item = ColumnOperation<UnorderedId>> + 'a {
) -> impl Iterator<Item = ColumnOperation<UnorderedId>> + 'a + use<'a> {
self.column_writer.operation_iterator(arena, byte_buffer)
}
}

View File

@@ -9,11 +9,12 @@ use std::net::Ipv6Addr;
use column_operation::ColumnOperation;
pub(crate) use column_writers::CompatibleNumericalTypes;
use common::CountingWriter;
use common::json_path_writer::JSON_END_OF_PATH;
pub(crate) use serializer::ColumnarSerializer;
use stacker::{Addr, ArenaHashMap, MemoryArena};
use crate::column_index::{SerializableColumnIndex, SerializableOptionalIndex};
use crate::column_values::{MonotonicallyMappableToU128, MonotonicallyMappableToU64};
use crate::column_values::{MonotonicallyMappableToU64, MonotonicallyMappableToU128};
use crate::columnar::column_type::ColumnType;
use crate::columnar::writer::column_writers::{
ColumnWriter, NumericalColumnWriter, StrOrBytesColumnWriter,
@@ -247,6 +248,7 @@ impl ColumnarWriter {
}
pub fn serialize(&mut self, num_docs: RowId, wrt: &mut dyn io::Write) -> io::Result<()> {
let mut serializer = ColumnarSerializer::new(wrt);
let mut columns: Vec<(&[u8], ColumnType, Addr)> = self
.numerical_field_hash_map
.iter()
@@ -260,7 +262,7 @@ impl ColumnarWriter {
columns.extend(
self.bytes_field_hash_map
.iter()
.map(|(term, addr)| (term, ColumnType::Bytes, addr)),
.map(|(column_name, addr)| (column_name, ColumnType::Bytes, addr)),
);
columns.extend(
self.str_field_hash_map
@@ -283,10 +285,15 @@ impl ColumnarWriter {
.map(|(column_name, addr)| (column_name, ColumnType::DateTime, addr)),
);
columns.sort_unstable_by_key(|(column_name, col_type, _)| (*column_name, *col_type));
let (arena, buffers, dictionaries) = (&self.arena, &mut self.buffers, &self.dictionaries);
let mut symbol_byte_buffer: Vec<u8> = Vec::new();
for (column_name, column_type, addr) in columns {
if column_name.contains(&JSON_END_OF_PATH) {
// Tantivy uses b'0' as a separator for nested fields in JSON.
// Column names with a b'0' are not simply ignored by the columnar (and the inverted
// index).
continue;
}
match column_type {
ColumnType::Bool => {
let column_writer: ColumnWriter = self.bool_field_hash_map.read(addr);
@@ -384,7 +391,7 @@ impl ColumnarWriter {
// Serialize [Dictionary, Column, dictionary num bytes U32::LE]
// Column: [Column Index, Column Values, column index num bytes U32::LE]
#[allow(clippy::too_many_arguments)]
#[expect(clippy::too_many_arguments)]
fn serialize_bytes_or_str_column(
cardinality: Cardinality,
num_docs: RowId,

View File

@@ -1,12 +1,13 @@
use std::io;
use std::io::Write;
use common::json_path_writer::JSON_END_OF_PATH;
use common::{BinarySerializable, CountingWriter};
use sstable::value::RangeValueWriter;
use sstable::RangeSSTable;
use sstable::value::RangeValueWriter;
use crate::columnar::ColumnType;
use crate::RowId;
use crate::columnar::ColumnType;
pub struct ColumnarSerializer<W: io::Write> {
wrt: CountingWriter<W>,
@@ -18,13 +19,8 @@ pub struct ColumnarSerializer<W: io::Write> {
/// code.
fn prepare_key(key: &[u8], column_type: ColumnType, buffer: &mut Vec<u8>) {
buffer.clear();
// Convert 0 bytes to '0' string, as 0 bytes are reserved for the end of the path.
if key.contains(&0u8) {
buffer.extend(key.iter().map(|&b| if b == 0 { b'0' } else { b }));
} else {
buffer.extend_from_slice(key);
}
buffer.push(0u8);
buffer.extend_from_slice(key);
buffer.push(JSON_END_OF_PATH);
buffer.push(column_type.to_code());
}
@@ -71,7 +67,7 @@ pub struct ColumnSerializer<'a, W: io::Write> {
start_offset: u64,
}
impl<'a, W: io::Write> ColumnSerializer<'a, W> {
impl<W: io::Write> ColumnSerializer<'_, W> {
pub fn finalize(self) -> io::Result<()> {
let end_offset: u64 = self.columnar_serializer.wrt.written_bytes();
let byte_range = self.start_offset..end_offset;
@@ -84,7 +80,7 @@ impl<'a, W: io::Write> ColumnSerializer<'a, W> {
}
}
impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
impl<W: io::Write> io::Write for ColumnSerializer<'_, W> {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
self.columnar_serializer.wrt.write(buf)
}
@@ -97,18 +93,3 @@ impl<'a, W: io::Write> io::Write for ColumnSerializer<'a, W> {
self.columnar_serializer.wrt.write_all(buf)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_prepare_key_bytes() {
let mut buffer: Vec<u8> = b"somegarbage".to_vec();
prepare_key(b"root\0child", ColumnType::Str, &mut buffer);
assert_eq!(buffer.len(), 12);
assert_eq!(&buffer[..10], b"root0child");
assert_eq!(buffer[10], 0u8);
assert_eq!(buffer[11], ColumnType::Str.to_code());
}
}

View File

@@ -1,6 +1,6 @@
use crate::RowId;
use crate::column_index::{SerializableMultivalueIndex, SerializableOptionalIndex};
use crate::iterable::Iterable;
use crate::RowId;
/// The `IndexBuilder` interprets a sequence of
/// calls of the form:
@@ -31,12 +31,13 @@ pub struct OptionalIndexBuilder {
impl OptionalIndexBuilder {
pub fn finish(&mut self, num_rows: RowId) -> impl Iterable<RowId> + '_ {
debug_assert!(self
.docs
.last()
.copied()
.map(|last_doc| last_doc < num_rows)
.unwrap_or(true));
debug_assert!(
self.docs
.last()
.copied()
.map(|last_doc| last_doc < num_rows)
.unwrap_or(true)
);
&self.docs[..]
}
@@ -48,12 +49,13 @@ impl OptionalIndexBuilder {
impl IndexBuilder for OptionalIndexBuilder {
#[inline(always)]
fn record_row(&mut self, doc: RowId) {
debug_assert!(self
.docs
.last()
.copied()
.map(|prev_doc| doc > prev_doc)
.unwrap_or(true));
debug_assert!(
self.docs
.last()
.copied()
.map(|prev_doc| doc > prev_doc)
.unwrap_or(true)
);
self.docs.push(doc);
}
}

View File

@@ -3,8 +3,8 @@ use std::path::PathBuf;
use itertools::Itertools;
use crate::{
merge_columnar, Cardinality, Column, ColumnarReader, DynamicColumn, StackMergeOrder,
CURRENT_VERSION,
CURRENT_VERSION, Cardinality, Column, ColumnarReader, DynamicColumn, StackMergeOrder,
merge_columnar,
};
const NUM_DOCS: u32 = u16::MAX as u32;

View File

@@ -6,7 +6,7 @@ use common::file_slice::FileSlice;
use common::{ByteCount, DateTime, HasLen, OwnedBytes};
use crate::column::{BytesColumn, Column, StrColumn};
use crate::column_values::{monotonic_map_column, StrictlyMonotonicFn};
use crate::column_values::{StrictlyMonotonicFn, monotonic_map_column};
use crate::columnar::ColumnType;
use crate::{Cardinality, ColumnIndex, ColumnValues, NumericalType, Version};

View File

@@ -7,7 +7,7 @@ pub trait Iterable<T = u64> {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_>;
}
impl<'a, T: Copy> Iterable<T> for &'a [T] {
impl<T: Copy> Iterable<T> for &[T] {
fn boxed_iter(&self) -> Box<dyn Iterator<Item = T> + '_> {
Box::new(self.iter().copied())
}

View File

@@ -17,15 +17,10 @@
//! column.
//! - [column_values]: Stores the values of a column in a dense format.
#![cfg_attr(all(feature = "unstable", test), feature(test))]
#[cfg(test)]
#[macro_use]
extern crate more_asserts;
#[cfg(all(test, feature = "unstable"))]
extern crate test;
use std::fmt::Display;
use std::io;
@@ -44,11 +39,11 @@ pub use block_accessor::ColumnBlockAccessor;
pub use column::{BytesColumn, Column, StrColumn};
pub use column_index::ColumnIndex;
pub use column_values::{
ColumnValues, EmptyColumnValues, MonotonicallyMappableToU128, MonotonicallyMappableToU64,
ColumnValues, EmptyColumnValues, MonotonicallyMappableToU64, MonotonicallyMappableToU128,
};
pub use columnar::{
merge_columnar, ColumnType, ColumnarReader, ColumnarWriter, HasAssociatedColumnType,
MergeRowOrder, ShuffleMergeOrder, StackMergeOrder, Version, CURRENT_VERSION,
CURRENT_VERSION, ColumnType, ColumnarReader, ColumnarWriter, HasAssociatedColumnType,
MergeRowOrder, ShuffleMergeOrder, StackMergeOrder, Version, merge_columnar,
};
use sstable::VoidSSTable;
pub use value::{NumericalType, NumericalValue};

View File

@@ -380,7 +380,7 @@ fn assert_columnar_eq(
right: &ColumnarReader,
lenient_on_numerical_value: bool,
) {
assert_eq!(left.num_rows(), right.num_rows());
assert_eq!(left.num_docs(), right.num_docs());
let left_columns = left.list_columns().unwrap();
let right_columns = right.list_columns().unwrap();
assert_eq!(left_columns.len(), right_columns.len());
@@ -588,7 +588,7 @@ proptest! {
#[test]
fn test_single_columnar_builder_proptest(docs in columnar_docs_strategy()) {
let columnar = build_columnar(&docs[..]);
assert_eq!(columnar.num_rows() as usize, docs.len());
assert_eq!(columnar.num_docs() as usize, docs.len());
let mut expected_columns: HashMap<(&str, ColumnTypeCategory), HashMap<u32, Vec<&ColumnValue>> > = Default::default();
for (doc_id, doc_vals) in docs.iter().enumerate() {
for (col_name, col_val) in doc_vals {
@@ -715,8 +715,9 @@ fn test_columnar_merging_number_columns() {
// TODO test required_columns
// TODO document edge case: required_columns incompatible with values.
fn columnar_docs_and_remap(
) -> impl Strategy<Value = (Vec<Vec<Vec<(&'static str, ColumnValue)>>>, Vec<RowAddr>)> {
#[allow(clippy::type_complexity)]
fn columnar_docs_and_remap()
-> impl Strategy<Value = (Vec<Vec<Vec<(&'static str, ColumnValue)>>>, Vec<RowAddr>)> {
proptest::collection::vec(columnar_docs_strategy(), 2..=3).prop_flat_map(
|columnars_docs: Vec<Vec<Vec<(&str, ColumnValue)>>>| {
let row_addrs: Vec<RowAddr> = columnars_docs
@@ -819,7 +820,7 @@ fn test_columnar_merge_empty() {
)
.unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap();
assert_eq!(merged_columnar.num_rows(), 0);
assert_eq!(merged_columnar.num_docs(), 0);
assert_eq!(merged_columnar.num_columns(), 0);
}
@@ -845,7 +846,7 @@ fn test_columnar_merge_single_str_column() {
)
.unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap();
assert_eq!(merged_columnar.num_rows(), 1);
assert_eq!(merged_columnar.num_docs(), 1);
assert_eq!(merged_columnar.num_columns(), 1);
}
@@ -877,7 +878,7 @@ fn test_delete_decrease_cardinality() {
)
.unwrap();
let merged_columnar = ColumnarReader::open(output).unwrap();
assert_eq!(merged_columnar.num_rows(), 1);
assert_eq!(merged_columnar.num_docs(), 1);
assert_eq!(merged_columnar.num_columns(), 1);
let cols = merged_columnar.read_columns("c").unwrap();
assert_eq!(cols.len(), 1);

View File

@@ -1,3 +1,5 @@
use std::str::FromStr;
use common::DateTime;
use crate::InvalidData;
@@ -9,6 +11,23 @@ pub enum NumericalValue {
F64(f64),
}
impl FromStr for NumericalValue {
type Err = ();
fn from_str(s: &str) -> Result<Self, ()> {
if let Ok(val_i64) = s.parse::<i64>() {
return Ok(val_i64.into());
}
if let Ok(val_u64) = s.parse::<u64>() {
return Ok(val_u64.into());
}
if let Ok(val_f64) = s.parse::<f64>() {
return Ok(NumericalValue::from(val_f64).normalize());
}
Err(())
}
}
impl NumericalValue {
pub fn numerical_type(&self) -> NumericalType {
match self {
@@ -17,6 +36,31 @@ impl NumericalValue {
NumericalValue::F64(_) => NumericalType::F64,
}
}
/// Tries to normalize the numerical value in the following priorities:
/// i64, i64, f64
pub fn normalize(self) -> Self {
match self {
NumericalValue::U64(val) => {
if val <= i64::MAX as u64 {
NumericalValue::I64(val as i64)
} else {
NumericalValue::U64(val)
}
}
NumericalValue::I64(val) => NumericalValue::I64(val),
NumericalValue::F64(val) => {
let fract = val.fract();
if fract == 0.0 && val >= i64::MIN as f64 && val <= i64::MAX as f64 {
NumericalValue::I64(val as i64)
} else if fract == 0.0 && val >= u64::MIN as f64 && val <= u64::MAX as f64 {
NumericalValue::U64(val as u64)
} else {
NumericalValue::F64(val)
}
}
}
}
}
impl From<u64> for NumericalValue {
@@ -116,6 +160,7 @@ impl Coerce for DateTime {
#[cfg(test)]
mod tests {
use super::NumericalType;
use crate::NumericalValue;
#[test]
fn test_numerical_type_code() {
@@ -128,4 +173,58 @@ mod tests {
}
assert_eq!(num_numerical_type, 3);
}
#[test]
fn test_parse_numerical() {
assert_eq!(
"123".parse::<NumericalValue>().unwrap(),
NumericalValue::I64(123)
);
assert_eq!(
"18446744073709551615".parse::<NumericalValue>().unwrap(),
NumericalValue::U64(18446744073709551615u64)
);
assert_eq!(
"1.0".parse::<NumericalValue>().unwrap(),
NumericalValue::I64(1i64)
);
assert_eq!(
"1.1".parse::<NumericalValue>().unwrap(),
NumericalValue::F64(1.1f64)
);
assert_eq!(
"-1.0".parse::<NumericalValue>().unwrap(),
NumericalValue::I64(-1i64)
);
}
#[test]
fn test_normalize_numerical() {
assert_eq!(
NumericalValue::from(1u64).normalize(),
NumericalValue::I64(1i64),
);
let limit_val = i64::MAX as u64 + 1u64;
assert_eq!(
NumericalValue::from(limit_val).normalize(),
NumericalValue::U64(limit_val),
);
assert_eq!(
NumericalValue::from(-1i64).normalize(),
NumericalValue::I64(-1i64),
);
assert_eq!(
NumericalValue::from(-2.0f64).normalize(),
NumericalValue::I64(-2i64),
);
assert_eq!(
NumericalValue::from(-2.1f64).normalize(),
NumericalValue::F64(-2.1f64),
);
let large_float = 2.0f64.powf(70.0f64);
assert_eq!(
NumericalValue::from(large_float).normalize(),
NumericalValue::F64(large_float),
);
}
}

View File

@@ -1,27 +1,25 @@
[package]
name = "tantivy-common"
version = "0.7.0"
version = "0.10.0"
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
license = "MIT"
edition = "2021"
edition = "2024"
description = "common traits and utility functions used by multiple tantivy subcrates"
documentation = "https://docs.rs/tantivy_common/"
homepage = "https://github.com/quickwit-oss/tantivy"
repository = "https://github.com/quickwit-oss/tantivy"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
byteorder = "1.4.3"
ownedbytes = { version= "0.7", path="../ownedbytes" }
ownedbytes = { version= "0.9", path="../ownedbytes" }
async-trait = "0.1"
time = { version = "0.3.10", features = ["serde-well-known"] }
serde = { version = "1.0.136", features = ["derive"] }
[dev-dependencies]
binggan = "0.14.0"
proptest = "1.0.0"
rand = "0.8.4"
[features]
unstable = [] # useful for benches.

View File

@@ -1,39 +1,64 @@
#![feature(test)]
use binggan::{BenchRunner, black_box};
use rand::seq::IteratorRandom;
use rand::thread_rng;
use tantivy_common::{BitSet, TinySet, serialize_vint_u32};
extern crate test;
fn bench_vint() {
let mut runner = BenchRunner::new();
#[cfg(test)]
mod tests {
use rand::seq::IteratorRandom;
use rand::thread_rng;
use tantivy_common::serialize_vint_u32;
use test::Bencher;
let vals: Vec<u32> = (0..20_000).collect();
runner.bench_function("bench_vint", move |_| {
let mut out = 0u64;
for val in vals.iter().cloned() {
let mut buf = [0u8; 8];
serialize_vint_u32(val, &mut buf);
out += u64::from(buf[0]);
}
black_box(out);
});
#[bench]
fn bench_vint(b: &mut Bencher) {
let vals: Vec<u32> = (0..20_000).collect();
b.iter(|| {
let mut out = 0u64;
for val in vals.iter().cloned() {
let mut buf = [0u8; 8];
serialize_vint_u32(val, &mut buf);
out += u64::from(buf[0]);
}
out
});
}
#[bench]
fn bench_vint_rand(b: &mut Bencher) {
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
b.iter(|| {
let mut out = 0u64;
for val in vals.iter().cloned() {
let mut buf = [0u8; 8];
serialize_vint_u32(val, &mut buf);
out += u64::from(buf[0]);
}
out
});
}
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
runner.bench_function("bench_vint_rand", move |_| {
let mut out = 0u64;
for val in vals.iter().cloned() {
let mut buf = [0u8; 8];
serialize_vint_u32(val, &mut buf);
out += u64::from(buf[0]);
}
black_box(out);
});
}
fn bench_bitset() {
let mut runner = BenchRunner::new();
runner.bench_function("bench_tinyset_pop", move |_| {
let mut tinyset = TinySet::singleton(black_box(31u32));
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
black_box(tinyset);
});
let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
runner.bench_function("bench_tinyset_sum", move |_| {
assert_eq!(black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
});
let v = [10u32, 14u32, 21u32];
runner.bench_function("bench_tinyarr_sum", move |_| {
black_box(v.iter().cloned().sum::<u32>());
});
runner.bench_function("bench_bitset_initialize", move |_| {
black_box(BitSet::with_max_value(1_000_000));
});
}
fn main() {
bench_vint();
bench_bitset();
}

View File

@@ -183,7 +183,7 @@ pub struct BitSet {
}
fn num_buckets(max_val: u32) -> u32 {
(max_val + 63u32) / 64u32
max_val.div_ceil(64u32)
}
impl BitSet {
@@ -696,43 +696,3 @@ mod tests {
}
}
}
#[cfg(all(test, feature = "unstable"))]
mod bench {
use test;
use super::{BitSet, TinySet};
#[bench]
fn bench_tinyset_pop(b: &mut test::Bencher) {
b.iter(|| {
let mut tinyset = TinySet::singleton(test::black_box(31u32));
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
tinyset.pop_lowest();
});
}
#[bench]
fn bench_tinyset_sum(b: &mut test::Bencher) {
let tiny_set = TinySet::empty().insert(10u32).insert(14u32).insert(21u32);
b.iter(|| {
assert_eq!(test::black_box(tiny_set).into_iter().sum::<u32>(), 45u32);
});
}
#[bench]
fn bench_tinyarr_sum(b: &mut test::Bencher) {
let v = [10u32, 14u32, 21u32];
b.iter(|| test::black_box(v).iter().cloned().sum::<u32>());
}
#[bench]
fn bench_bitset_initialize(b: &mut test::Bencher) {
b.iter(|| BitSet::with_max_value(1_000_000));
}
}

130
common/src/bounds.rs Normal file
View File

@@ -0,0 +1,130 @@
use std::io;
use std::ops::Bound;
#[derive(Clone, Debug)]
pub struct BoundsRange<T> {
pub lower_bound: Bound<T>,
pub upper_bound: Bound<T>,
}
impl<T> BoundsRange<T> {
pub fn new(lower_bound: Bound<T>, upper_bound: Bound<T>) -> Self {
BoundsRange {
lower_bound,
upper_bound,
}
}
pub fn is_unbounded(&self) -> bool {
matches!(self.lower_bound, Bound::Unbounded) && matches!(self.upper_bound, Bound::Unbounded)
}
pub fn map_bound<TTo>(&self, transform: impl Fn(&T) -> TTo) -> BoundsRange<TTo> {
BoundsRange {
lower_bound: map_bound(&self.lower_bound, &transform),
upper_bound: map_bound(&self.upper_bound, &transform),
}
}
pub fn map_bound_res<TTo, Err>(
&self,
transform: impl Fn(&T) -> Result<TTo, Err>,
) -> Result<BoundsRange<TTo>, Err> {
Ok(BoundsRange {
lower_bound: map_bound_res(&self.lower_bound, &transform)?,
upper_bound: map_bound_res(&self.upper_bound, &transform)?,
})
}
pub fn transform_inner<TTo>(
&self,
transform_lower: impl Fn(&T) -> TransformBound<TTo>,
transform_upper: impl Fn(&T) -> TransformBound<TTo>,
) -> BoundsRange<TTo> {
BoundsRange {
lower_bound: transform_bound_inner(&self.lower_bound, &transform_lower),
upper_bound: transform_bound_inner(&self.upper_bound, &transform_upper),
}
}
/// Returns the first set inner value
pub fn get_inner(&self) -> Option<&T> {
inner_bound(&self.lower_bound).or(inner_bound(&self.upper_bound))
}
}
pub enum TransformBound<T> {
/// Overwrite the bounds
NewBound(Bound<T>),
/// Use Existing bounds with new value
Existing(T),
}
/// Takes a bound and transforms the inner value into a new bound via a closure.
/// The bound variant may change by the value returned value from the closure.
pub fn transform_bound_inner_res<TFrom, TTo>(
bound: &Bound<TFrom>,
transform: impl Fn(&TFrom) -> io::Result<TransformBound<TTo>>,
) -> io::Result<Bound<TTo>> {
use self::Bound::*;
Ok(match bound {
Excluded(from_val) => match transform(from_val)? {
TransformBound::NewBound(new_val) => new_val,
TransformBound::Existing(new_val) => Excluded(new_val),
},
Included(from_val) => match transform(from_val)? {
TransformBound::NewBound(new_val) => new_val,
TransformBound::Existing(new_val) => Included(new_val),
},
Unbounded => Unbounded,
})
}
/// Takes a bound and transforms the inner value into a new bound via a closure.
/// The bound variant may change by the value returned value from the closure.
pub fn transform_bound_inner<TFrom, TTo>(
bound: &Bound<TFrom>,
transform: impl Fn(&TFrom) -> TransformBound<TTo>,
) -> Bound<TTo> {
use self::Bound::*;
match bound {
Excluded(from_val) => match transform(from_val) {
TransformBound::NewBound(new_val) => new_val,
TransformBound::Existing(new_val) => Excluded(new_val),
},
Included(from_val) => match transform(from_val) {
TransformBound::NewBound(new_val) => new_val,
TransformBound::Existing(new_val) => Included(new_val),
},
Unbounded => Unbounded,
}
}
/// Returns the inner value of a `Bound`
pub fn inner_bound<T>(val: &Bound<T>) -> Option<&T> {
match val {
Bound::Included(term) | Bound::Excluded(term) => Some(term),
Bound::Unbounded => None,
}
}
pub fn map_bound<TFrom, TTo>(
bound: &Bound<TFrom>,
transform: impl Fn(&TFrom) -> TTo,
) -> Bound<TTo> {
use self::Bound::*;
match bound {
Excluded(from_val) => Bound::Excluded(transform(from_val)),
Included(from_val) => Bound::Included(transform(from_val)),
Unbounded => Unbounded,
}
}
pub fn map_bound_res<TFrom, TTo, Err>(
bound: &Bound<TFrom>,
transform: impl Fn(&TFrom) -> Result<TTo, Err>,
) -> Result<Bound<TTo>, Err> {
use self::Bound::*;
Ok(match bound {
Excluded(from_val) => Excluded(transform(from_val)?),
Included(from_val) => Included(transform(from_val)?),
Unbounded => Unbounded,
})
}

View File

@@ -1,5 +1,6 @@
use std::fs::File;
use std::ops::{Deref, Range, RangeBounds};
use std::path::Path;
use std::sync::Arc;
use std::{fmt, io};
@@ -73,7 +74,7 @@ impl FileHandle for WrapFile {
{
use std::io::{Read, Seek};
let mut file = self.file.try_clone()?; // Clone the file to read from it separately
// Seek to the start position in the file
// Seek to the start position in the file
file.seek(io::SeekFrom::Start(start as u64))?;
// Read the data into the buffer
file.read_exact(&mut buffer)?;
@@ -177,6 +178,12 @@ fn combine_ranges<R: RangeBounds<usize>>(orig_range: Range<usize>, rel_range: R)
}
impl FileSlice {
/// Creates a FileSlice from a path.
pub fn open(path: &Path) -> io::Result<FileSlice> {
let wrap_file = WrapFile::new(File::open(path)?)?;
Ok(FileSlice::new(Arc::new(wrap_file)))
}
/// Wraps a FileHandle.
pub fn new(file_handle: Arc<dyn FileHandle>) -> Self {
let num_bytes = file_handle.len();
@@ -339,8 +346,8 @@ mod tests {
use std::sync::Arc;
use super::{FileHandle, FileSlice};
use crate::file_slice::combine_ranges;
use crate::HasLen;
use crate::file_slice::combine_ranges;
#[test]
fn test_file_slice() -> io::Result<()> {

View File

@@ -5,6 +5,7 @@ use std::ops::Deref;
pub use byteorder::LittleEndian as Endianness;
mod bitset;
pub mod bounds;
mod byte_count;
mod datetime;
pub mod file_slice;
@@ -21,7 +22,7 @@ pub use json_path_writer::JsonPathWriter;
pub use ownedbytes::{OwnedBytes, StableDeref};
pub use serialize::{BinarySerializable, DeserializeFrom, FixedSize};
pub use vint::{
read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint, VInt, VIntU128,
VInt, VIntU128, read_u32_vint, read_u32_vint_no_advance, serialize_vint_u32, write_u32_vint,
};
pub use writer::{AntiCallToken, CountingWriter, TerminatingWrite};
@@ -129,11 +130,11 @@ pub fn replace_in_place(needle: u8, replacement: u8, bytes: &mut [u8]) {
}
#[cfg(test)]
pub mod test {
pub(crate) mod test {
use proptest::prelude::*;
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64, BinarySerializable, FixedSize};
use super::{f64_to_u64, i64_to_u64, u64_to_f64, u64_to_i64};
fn test_i64_converter_helper(val: i64) {
assert_eq!(u64_to_i64(i64_to_u64(val)), val);
@@ -143,12 +144,6 @@ pub mod test {
assert_eq!(u64_to_f64(f64_to_u64(val)), val);
}
pub fn fixed_size_test<O: BinarySerializable + FixedSize + Default>() {
let mut buffer = Vec::new();
O::default().serialize(&mut buffer).unwrap();
assert_eq!(buffer.len(), O::SIZE_IN_BYTES);
}
proptest! {
#[test]
fn test_f64_converter_monotonicity_proptest((left, right) in (proptest::num::f64::NORMAL, proptest::num::f64::NORMAL)) {
@@ -182,8 +177,10 @@ pub mod test {
#[test]
fn test_f64_order() {
assert!(!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY))
.contains(&f64_to_u64(f64::NAN))); // nan is not a number
assert!(
!(f64_to_u64(f64::NEG_INFINITY)..f64_to_u64(f64::INFINITY))
.contains(&f64_to_u64(f64::NAN))
); // nan is not a number
assert!(f64_to_u64(1.5) > f64_to_u64(1.0)); // same exponent, different mantissa
assert!(f64_to_u64(2.0) > f64_to_u64(1.0)); // same mantissa, different exponent
assert!(f64_to_u64(2.0) > f64_to_u64(1.5)); // different exponent and mantissa

View File

@@ -74,14 +74,14 @@ impl FixedSize for () {
impl<T: BinarySerializable> BinarySerializable for Vec<T> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.len() as u64).serialize(writer)?;
BinarySerializable::serialize(&VInt(self.len() as u64), writer)?;
for it in self {
it.serialize(writer)?;
}
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Vec<T>> {
let num_items = VInt::deserialize(reader)?.val();
let num_items = <VInt as BinarySerializable>::deserialize(reader)?.val();
let mut items: Vec<T> = Vec::with_capacity(num_items as usize);
for _ in 0..num_items {
let item = T::deserialize(reader)?;
@@ -236,12 +236,12 @@ impl FixedSize for bool {
impl BinarySerializable for String {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
let data: &[u8] = self.as_bytes();
VInt(data.len() as u64).serialize(writer)?;
BinarySerializable::serialize(&VInt(data.len() as u64), writer)?;
writer.write_all(data)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<String> {
let string_length = VInt::deserialize(reader)?.val() as usize;
let string_length = <VInt as BinarySerializable>::deserialize(reader)?.val() as usize;
let mut result = String::with_capacity(string_length);
reader
.take(string_length as u64)
@@ -253,12 +253,12 @@ impl BinarySerializable for String {
impl<'a> BinarySerializable for Cow<'a, str> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
let data: &[u8] = self.as_bytes();
VInt(data.len() as u64).serialize(writer)?;
BinarySerializable::serialize(&VInt(data.len() as u64), writer)?;
writer.write_all(data)
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, str>> {
let string_length = VInt::deserialize(reader)?.val() as usize;
let string_length = <VInt as BinarySerializable>::deserialize(reader)?.val() as usize;
let mut result = String::with_capacity(string_length);
reader
.take(string_length as u64)
@@ -269,18 +269,18 @@ impl<'a> BinarySerializable for Cow<'a, str> {
impl<'a> BinarySerializable for Cow<'a, [u8]> {
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
VInt(self.len() as u64).serialize(writer)?;
BinarySerializable::serialize(&VInt(self.len() as u64), writer)?;
for it in self.iter() {
it.serialize(writer)?;
BinarySerializable::serialize(it, writer)?;
}
Ok(())
}
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Cow<'a, [u8]>> {
let num_items = VInt::deserialize(reader)?.val();
let num_items = <VInt as BinarySerializable>::deserialize(reader)?.val();
let mut items: Vec<u8> = Vec::with_capacity(num_items as usize);
for _ in 0..num_items {
let item = u8::deserialize(reader)?;
let item = <u8 as BinarySerializable>::deserialize(reader)?;
items.push(item);
}
Ok(Cow::Owned(items))

View File

@@ -28,7 +28,9 @@ impl BinarySerializable for VIntU128 {
writer.write_all(&buffer)
}
#[allow(clippy::unbuffered_bytes)]
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
#[allow(clippy::unbuffered_bytes)]
let mut bytes = reader.bytes();
let mut result = 0u128;
let mut shift = 0u64;
@@ -195,7 +197,9 @@ impl BinarySerializable for VInt {
writer.write_all(&buffer[0..num_bytes])
}
#[allow(clippy::unbuffered_bytes)]
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
#[allow(clippy::unbuffered_bytes)]
let mut bytes = reader.bytes();
let mut result = 0u64;
let mut shift = 0u64;
@@ -222,7 +226,7 @@ impl BinarySerializable for VInt {
#[cfg(test)]
mod tests {
use super::{serialize_vint_u32, BinarySerializable, VInt};
use super::{BinarySerializable, VInt, serialize_vint_u32};
fn aux_test_vint(val: u64) {
let mut v = [14u8; 10];

View File

@@ -87,7 +87,7 @@ impl<W: TerminatingWrite> TerminatingWrite for BufWriter<W> {
}
}
impl<'a> TerminatingWrite for &'a mut Vec<u8> {
impl TerminatingWrite for &mut Vec<u8> {
fn terminate_ref(&mut self, _a: AntiCallToken) -> io::Result<()> {
self.flush()
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

After

Width:  |  Height:  |  Size: 7.4 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 653 KiB

View File

@@ -2,7 +2,7 @@
> Tantivy is a **search** engine **library** for Rust.
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for rust. tantivy is heavily inspired by Lucene's design and
If you are familiar with Lucene, it's an excellent approximation to consider tantivy as Lucene for Rust. Tantivy is heavily inspired by Lucene's design and
they both have the same scope and targeted use cases.
If you are not familiar with Lucene, let's break down our little tagline.
@@ -17,7 +17,7 @@ relevancy, collapsing, highlighting, spatial search.
experience. But keep in mind this is just a toolbox.
Which bring us to the second keyword...
- **Library** means that you will have to write code. tantivy is not an *all-in-one* server solution like elastic search for instance.
- **Library** means that you will have to write code. Tantivy is not an *all-in-one* server solution like Elasticsearch for instance.
Sometimes a functionality will not be available in tantivy because it is too
specific to your use case. By design, tantivy should make it possible to extend
@@ -31,4 +31,4 @@ relevancy, collapsing, highlighting, spatial search.
index from a different format.
Tantivy exposes a lot of low level API to do all of these things.

View File

@@ -11,7 +11,7 @@ directory shipped with tantivy is the `MmapDirectory`.
While this design has some downsides, this greatly simplifies the source code of
tantivy. Caching is also entirely delegated to the OS.
`tantivy` works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
Tantivy works entirely (or almost) by directly reading the datastructures as they are laid on disk. As a result, the act of opening an indexing does not involve loading different datastructures from the disk into random access memory : starting a process, opening an index, and performing your first query can typically be done in a matter of milliseconds.
This is an interesting property for a command line search engine, or for some multi-tenant log search engine : spawning a new process for each new query can be a perfectly sensible solution in some use case.

View File

@@ -31,13 +31,13 @@ Compression ratio is mainly affected on the fast field of the sorted property, e
When data is presorted by a field and search queries request sorting by the same field, we can leverage the natural order of the documents.
E.g. if the data is sorted by timestamp and want the top n newest docs containing a term, we can simply leveraging the order of the docids.
Note: Tantivy 0.16 does not do this optimization yet.
Note: tantivy 0.16 does not do this optimization yet.
### Pruning
Let's say we want all documents and want to apply the filter `>= 2010-08-11`. When the data is sorted, we could make a lookup in the fast field to find the docid range and use this as the filter.
Note: Tantivy 0.16 does not do this optimization yet.
Note: tantivy 0.16 does not do this optimization yet.
### Other?
@@ -45,7 +45,7 @@ In principle there are many algorithms possible that exploit the monotonically i
## Usage
The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of Tantivy 0.16 only fast fields are allowed to be used.
The index sorting can be configured setting [`sort_by_field`](https://github.com/quickwit-oss/tantivy/blob/000d76b11a139a84b16b9b95060a1c93e8b9851c/src/core/index_meta.rs#L238) on `IndexSettings` and passing it to a `IndexBuilder`. As of tantivy 0.16 only fast fields are allowed to be used.
```rust
let settings = IndexSettings {

View File

@@ -39,7 +39,7 @@ Its representation is done by separating segments by a unicode char `\x01`, and
- `value`: The value representation is just the regular Value representation.
This representation is designed to align the natural sort of Terms with the lexicographical sort
of their binary representation (Tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).
of their binary representation (tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).
In the example above, the terms will be sorted as

View File

@@ -51,7 +51,7 @@ fn main() -> tantivy::Result<()> {
// Our second field is body.
// We want full-text search for it, but we do not
// need to be able to be able to retrieve it
// need to be able to retrieve it
// for our application.
//
// We can make our index lighter by omitting the `STORED` flag.
@@ -208,7 +208,7 @@ fn main() -> tantivy::Result<()> {
// is the role of the `TopDocs` collector.
// We can now perform our query.
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
// The actual documents still need to be
// retrieved from Tantivy's store.
@@ -226,7 +226,7 @@ fn main() -> tantivy::Result<()> {
let query = query_parser.parse_query("title:sea^20 body:whale^70")?;
let (_score, doc_address) = searcher
.search(&query, &TopDocs::with_limit(1))?
.search(&query, &TopDocs::with_limit(1).order_by_score())?
.into_iter()
.next()
.unwrap();

View File

@@ -100,7 +100,7 @@ fn main() -> tantivy::Result<()> {
// here we want to get a hit on the 'ken' in Frankenstein
let query = query_parser.parse_query("ken")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
for (_, doc_address) in top_docs {
let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;

View File

@@ -50,14 +50,14 @@ fn main() -> tantivy::Result<()> {
{
// Simple exact search on the date
let query = query_parser.parse_query("occurred_at:\"2022-06-22T12:53:50.53Z\"")?;
let count_docs = searcher.search(&*query, &TopDocs::with_limit(5))?;
let count_docs = searcher.search(&*query, &TopDocs::with_limit(5).order_by_score())?;
assert_eq!(count_docs.len(), 1);
}
{
// Range query on the date field
let query = query_parser
.parse_query(r#"occurred_at:[2022-06-22T12:58:00Z TO 2022-06-23T00:00:00Z}"#)?;
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4))?;
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4).order_by_score())?;
assert_eq!(count_docs.len(), 1);
for (_score, doc_address) in count_docs {
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;

View File

@@ -28,7 +28,7 @@ fn extract_doc_given_isbn(
// The second argument is here to tell we don't care about decoding positions,
// or term frequencies.
let term_query = TermQuery::new(isbn_term.clone(), IndexRecordOption::Basic);
let top_docs = searcher.search(&term_query, &TopDocs::with_limit(1))?;
let top_docs = searcher.search(&term_query, &TopDocs::with_limit(1).order_by_score())?;
if let Some((_score, doc_address)) = top_docs.first() {
let doc = searcher.doc(*doc_address)?;

View File

@@ -0,0 +1,212 @@
// # Filter Aggregation Example
//
// This example demonstrates filter aggregations - creating buckets of documents
// matching specific queries, with nested aggregations computed on each bucket.
//
// Filter aggregations are useful for computing metrics on different subsets of
// your data in a single query, like "average price overall + average price for
// electronics + count of in-stock items".
use serde_json::json;
use tantivy::aggregation::agg_req::Aggregations;
use tantivy::aggregation::AggregationCollector;
use tantivy::query::AllQuery;
use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
use tantivy::{doc, Index};
fn main() -> tantivy::Result<()> {
// Create a simple product schema
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("category", TEXT | FAST);
schema_builder.add_text_field("brand", TEXT | FAST);
schema_builder.add_u64_field("price", FAST);
schema_builder.add_f64_field("rating", FAST);
schema_builder.add_bool_field("in_stock", FAST | INDEXED);
let schema = schema_builder.build();
// Create index and add sample products
let index = Index::create_in_ram(schema.clone());
let mut writer = index.writer(50_000_000)?;
writer.add_document(doc!(
schema.get_field("category")? => "electronics",
schema.get_field("brand")? => "apple",
schema.get_field("price")? => 999u64,
schema.get_field("rating")? => 4.5f64,
schema.get_field("in_stock")? => true
))?;
writer.add_document(doc!(
schema.get_field("category")? => "electronics",
schema.get_field("brand")? => "samsung",
schema.get_field("price")? => 799u64,
schema.get_field("rating")? => 4.2f64,
schema.get_field("in_stock")? => true
))?;
writer.add_document(doc!(
schema.get_field("category")? => "clothing",
schema.get_field("brand")? => "nike",
schema.get_field("price")? => 120u64,
schema.get_field("rating")? => 4.1f64,
schema.get_field("in_stock")? => false
))?;
writer.add_document(doc!(
schema.get_field("category")? => "books",
schema.get_field("brand")? => "penguin",
schema.get_field("price")? => 25u64,
schema.get_field("rating")? => 4.8f64,
schema.get_field("in_stock")? => true
))?;
writer.commit()?;
let reader = index.reader()?;
let searcher = reader.searcher();
// Example 1: Basic filter with metric aggregation
println!("=== Example 1: Electronics average price ===");
let agg_req = json!({
"electronics": {
"filter": "category:electronics",
"aggs": {
"avg_price": { "avg": { "field": "price" } }
}
}
});
let agg: Aggregations = serde_json::from_value(agg_req)?;
let collector = AggregationCollector::from_aggs(agg, Default::default());
let result = searcher.search(&AllQuery, &collector)?;
let expected = json!({
"electronics": {
"doc_count": 2,
"avg_price": { "value": 899.0 }
}
});
assert_eq!(serde_json::to_value(&result)?, expected);
println!("{}\n", serde_json::to_string_pretty(&result)?);
// Example 2: Multiple independent filters
println!("=== Example 2: Multiple filters in one query ===");
let agg_req = json!({
"electronics": {
"filter": "category:electronics",
"aggs": { "avg_price": { "avg": { "field": "price" } } }
},
"in_stock": {
"filter": "in_stock:true",
"aggs": { "count": { "value_count": { "field": "brand" } } }
},
"high_rated": {
"filter": "rating:[4.5 TO *]",
"aggs": { "count": { "value_count": { "field": "brand" } } }
}
});
let agg: Aggregations = serde_json::from_value(agg_req)?;
let collector = AggregationCollector::from_aggs(agg, Default::default());
let result = searcher.search(&AllQuery, &collector)?;
let expected = json!({
"electronics": {
"doc_count": 2,
"avg_price": { "value": 899.0 }
},
"in_stock": {
"doc_count": 3,
"count": { "value": 3.0 }
},
"high_rated": {
"doc_count": 2,
"count": { "value": 2.0 }
}
});
assert_eq!(serde_json::to_value(&result)?, expected);
println!("{}\n", serde_json::to_string_pretty(&result)?);
// Example 3: Nested filters - progressive refinement
println!("=== Example 3: Nested filters ===");
let agg_req = json!({
"in_stock": {
"filter": "in_stock:true",
"aggs": {
"electronics": {
"filter": "category:electronics",
"aggs": {
"expensive": {
"filter": "price:[800 TO *]",
"aggs": {
"avg_rating": { "avg": { "field": "rating" } }
}
}
}
}
}
}
});
let agg: Aggregations = serde_json::from_value(agg_req)?;
let collector = AggregationCollector::from_aggs(agg, Default::default());
let result = searcher.search(&AllQuery, &collector)?;
let expected = json!({
"in_stock": {
"doc_count": 3, // apple, samsung, penguin
"electronics": {
"doc_count": 2, // apple, samsung
"expensive": {
"doc_count": 1, // only apple (999)
"avg_rating": { "value": 4.5 }
}
}
}
});
assert_eq!(serde_json::to_value(&result)?, expected);
println!("{}\n", serde_json::to_string_pretty(&result)?);
// Example 4: Filter with sub-aggregation (terms)
println!("=== Example 4: Filter with terms sub-aggregation ===");
let agg_req = json!({
"electronics": {
"filter": "category:electronics",
"aggs": {
"by_brand": {
"terms": { "field": "brand" },
"aggs": {
"avg_price": { "avg": { "field": "price" } }
}
}
}
}
});
let agg: Aggregations = serde_json::from_value(agg_req)?;
let collector = AggregationCollector::from_aggs(agg, Default::default());
let result = searcher.search(&AllQuery, &collector)?;
let expected = json!({
"electronics": {
"doc_count": 2,
"by_brand": {
"buckets": [
{
"key": "samsung",
"doc_count": 1,
"avg_price": { "value": 799.0 }
},
{
"key": "apple",
"doc_count": 1,
"avg_price": { "value": 999.0 }
}
],
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0
}
}
});
assert_eq!(serde_json::to_value(&result)?, expected);
println!("{}", serde_json::to_string_pretty(&result)?);
Ok(())
}

View File

@@ -85,7 +85,6 @@ fn main() -> tantivy::Result<()> {
index_writer.add_document(doc!(
title => "The Diary of a Young Girl",
))?;
index_writer.commit()?;
// ### Committing
//
@@ -146,7 +145,7 @@ fn main() -> tantivy::Result<()> {
let query = FuzzyTermQuery::new(term, 2, true);
let (top_docs, count) = searcher
.search(&query, &(TopDocs::with_limit(5), Count))
.search(&query, &(TopDocs::with_limit(5).order_by_score(), Count))
.unwrap();
assert_eq!(count, 3);
assert_eq!(top_docs.len(), 3);

Some files were not shown because too many files have changed in this diff Show More