tantivy

mirror of https://github.com/quickwit-oss/tantivy.git synced 2026-01-04 08:12:54 +00:00

Author	SHA1	Message	Date
Paul Masurel	643639f14b	Introduced geopoint.	2025-12-03 17:05:27 +01:00
Paul Masurel	f85a27068d	Introduced geopoint.	2025-12-03 17:05:16 +01:00
Paul Masurel	1619e05bc5	plastic surgery	2025-12-03 16:20:18 +01:00
Paul Masurel	5d03c600ba	Added bugfix and unit tests Removed use of robust.	2025-12-03 15:21:37 +01:00
Paul Masurel	32beb06382	plastic surgery	2025-12-03 13:02:10 +01:00
Paul Masurel	d8bc0e7c99	added doc	2025-12-03 12:41:17 +01:00
Paul Masurel	79622f1f0b	bugfix	2025-12-01 17:13:57 +01:00
Alan Gutierrez	d26d6c34fc	Fix `select_nth_unstable_by_key` midpoint duplicates. Existing code behaved as if the result of `select_nth_unstable_by_key` was either a sorted array or the product of an algorithm that gathered partition values as in the Dutch national flag problem. The existing code was written knowing that the former isn't true and the latter isn't advertised. Knowing, but not remembering. Quite the oversight.	2025-12-01 16:49:22 +01:00
Alan Gutierrez	6da54fa5da	Revert "Remove `radix_select.rs`." This reverts commit `19eab167b6`. Restore radix select in order to implement a merge solution that will not require a temporary file.	2025-12-01 16:49:21 +01:00
Alan Gutierrez	9f10279681	Complete Spatial/Geometry type integration. Addressed all `todo!()` markers created when adding `Spatial` field type and `Geometry` value type to existing code paths: - Dynamic field handling: `Geometry` not supported in dynamic JSON fields, return `unimplemented!()` consistently with other complex types. - Fast field writer: Panic if geometry routed incorrectly (internal error.) - `OwnedValue` serialization: Implement `Geometry` to GeoJSON serialization and reference-to-owned conversion. - Field type: Return `None` for `get_index_record_option()` since spatial fields use BKD trees, not inverted index. - Space usage tracking: Add spatial field to `SegmentSpaceUsage` with proper integration through `SegmentReader`. - Spatial query explain: Implement `explain()` method following pattern of other binary/constant-score queries. Fixed `MultiPolygon` deserialization bug: count total points across all rings, not number of rings. Added clippy expects for legitimate too_many_arguments cases in geometric predicates.	2025-12-01 16:49:21 +01:00
Alan Gutierrez	68009bb25b	Read block kd-tree nodes using `from_le_bytes`. Read node structures using `from_le_bytes` instead of casting memory. After an inspection of columnar storage, it appears that this is the standard practice in Rust and in the Tantivy code base. Left the structure alignment for now in case it tends to align with cache boundaries.	2025-12-01 16:49:20 +01:00
Alan Gutierrez	459456ca28	Remove `radix_select.rs`. Ended up using `select_nth_unstable_by_key` from the Rust standard library instead.	2025-12-01 16:49:20 +01:00
Alan Gutierrez	dbbc8c3f65	Slot block kd-tree into Tantivy. Implemented a geometry document field with a minimal `Geometry` enum. Now able to add that Geometry from GeoJSON parsed from a JSON document. Geometry is triangulated if it is a polygon, otherwise it is correctly encoded as a degenerate triangle if it is a point or a line string. Write accumulated triangles to a block kd-tree on commit. Serialize the original `f64` polygon for retrieval from search. Created a query method for intersection. Query against the memory mapped block kd-tree. Return hits and original `f64` polygon. Implemented a merge of one or more block kd-trees from one or more segments during merge. Updated the block kd-tree to write to a Tantivy `WritePtr` instead of more generic Rust I/O.	2025-12-01 16:49:16 +01:00
Alan Gutierrez	d3049cb323	Triangulation is not just a conversion. The triangulation function in `triangle.rs` is now called `delaunay_to_triangles` and it accepts the output of a Delaunay triangulation from `i_triangle` and not a GeoRust multi-polygon. The translation of user polygons to `i_triangle` polygons and subsequent triangulation will take place outside of `triangle.rs`.	2025-12-01 16:48:34 +01:00
Alan Gutierrez	ccdf399cd7	XOR delta compression for f64 polygon coordinates. Lossless compression for floating-point lat/lon coordinates using XOR delta encoding on IEEE 754 bit patterns with variable-length integer encoding. Designed for per-polygon random access in the document store, where each polygon compresses independently without requiring sequential decompression.	2025-12-01 16:48:33 +01:00
Alan Gutierrez	2dc46b235e	Implement block kd-tree. Implement an immutable bulk-loaded spatial index using recursive median partitioning on bounding box dimensions. Each leaf stores up to 512 triangles with delta-compressed coordinates and doc IDs. The tree provides three query types (intersects, within, contains) that use exact integer arithmetic for geometric predicates and accumulate results in bit sets for efficient deduplication across leaves. The serialized format stores compressed leaf pages followed by the tree structure (leaf and branch nodes), enabling zero-copy access through memory-mapped segments without upfront decompression.	2025-12-01 16:48:32 +01:00
Alan Gutierrez	f38140f72f	Add delta compression for block kd-tree leaf nodes. Implements dimension-major bit-packing with zigzag encoding for signed i32 deltas, enabling compression of spatially-clustered triangles from 32-bit coordinates down to 4-19 bits per delta depending on spatial extent.	2025-12-01 16:48:32 +01:00
Alan Gutierrez	0996bea7ac	Add a surveyor to determine spread and prefix. Implemented a `Surveyor` that will evaluate the bounding boxes of a set of triangles and determine the dimension with the maximum spread and the shared prefix for the values of dimension with the maximum spread.	2025-12-01 16:48:31 +01:00
Alan Gutierrez	1c66567efc	Radix selection for block kd-tree partitioning. Implemented byte-wise histogram selection to find median values without comparisons, enabling efficient partitioning of spatial data during block kd-tree construction. Processes values through multiple passes, building histograms for each byte position after a common prefix, avoiding the need to sort or compare elements directly.	2025-12-01 16:48:31 +01:00
Alan Gutierrez	b2a9bb279d	Implement polygon tessellation. The `triangulate` function takes a polygon with floating-point lat/lon coordinates, converts to integer coordinates with millimeter precision (using 2^32 scaling), performs constrained Delaunay triangulation, and encodes the resulting triangles with boundary edge information for block kd-tree spatial indexing. It handles polygons with holes correctly, preserving which triangle edges lie on the original polygon boundaries versus internal tessellation edges.	2025-12-01 16:48:26 +01:00
Alan Gutierrez	558c99fa2d	Triangle encoding for spatial indexing. Encodes triangles with the bounding box in the first four words, enabling efficient spatial pruning during tree traversal without reconstructing the full triangle. The remaining words contain an additional vertex and packed reconstruction metadata, allowing exact triangle recovery when needed.	2025-12-01 16:47:56 +01:00
Alan Gutierrez	43b5f34721	Implement SPATIAL flag. Implement a SPATIAL flag for use in creating a spatial field.	2025-12-01 16:47:55 +01:00
Paul Masurel	63c66005db	Lazy scorers (#2726 ) * Refactoring of the score tweaker into `SortKeyComputer`s to unlock two features. - Allow lazy evaluation of score. As soon as we identified that a doc won't reach the topK threshold, we can stop the evaluation. - Allow for a different segment level score, segment level score and their conversion. This PR breaks public API, but fixing code is straightforward. * Bumping tantivy version --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-12-01 15:38:57 +01:00
Paul Masurel	7d513a44c5	Added some benchmark for top K by a fast field (#2754 ) Also removed query parsing from the bench code. Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-12-01 14:58:29 +01:00
Stu Hood	ca87fcd454	Implement `collect_block` for `Collector`s which wrap other `Collector`s (#2727 ) * Implement `collect_block` for tuple Collectors, and for MultiCollector. * Two more.	2025-12-01 12:26:29 +01:00
Ang	08a92675dc	Fix typos again (#2753 ) Found via `codespell -S benches,stopwords.rs -L womens,parth,abd,childs,ond,ser,ue,mot,hel,atleast,pris,claus,allo`	2025-12-01 12:15:41 +01:00
Raphaël Cohen	f7f4b354d6	fix: Handle phrase prefixed with star (#2751 ) Signed-off-by: Darkheir <raphael.cohen@sekoia.io>	2025-12-01 11:43:25 +01:00
Paul Masurel	25d44fcec8	Revert "remove unused columnar api (#2742 )" (#2748 ) * Revert "remove unused columnar api (#2742)" This reverts commit `8725594d47`. * Clippy comment + removing fill_vals --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-11-26 17:44:02 +01:00
PSeitz-dd	842fe9295f	split Term in Term and IndexingTerm (#2744 ) * split Term in Term and IndexingTerm * add append_json_path to JsonTermSerializer	2025-11-26 16:48:59 +01:00
Paul Masurel	f88b7200b2	Optimization when posting list are saturated. (#2745 ) * Optimization when posting list are saturated. If a posting list doc freq is the segment reader's max_doc, and if scoring does not matter, we can replace it by a AllScorer. In turn, in a boolean query, we can dismiss all scorers and empty scorers, to accelerate the request. * Added range query optimization * CR comment * CR comments * CR comment --------- Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-11-26 15:50:57 +01:00
PSeitz-dd	8725594d47	remove unused columnar api (#2742 )	2025-11-21 18:07:25 +01:00
PSeitz	43a784671a	clippy (#2741 ) Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-11-21 18:07:03 +01:00
Paul Masurel	c363bbd23d	Optimize term aggregation with low cardinality + some refactoring (#2740 ) This introduce an optimization of top level term aggregation on field with a low cardinality. We then use a Vec as the underlying map. In addition, we buffer subaggregations. --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com> Co-authored-by: Paul Masurel <paul@quickwit.io>	2025-11-21 14:46:29 +01:00
Moe	70e591e230	feat: added filter aggregation (#2711 ) * Initial impl * Added `Filter` impl in `build_single_agg_segment_collector_with_reader` + Added tests * Added `Filter(FilterBucketResult)` + Made tests work. * Fixed type issues. * Fixed a test. * 8a7a73a: Pass `segment_reader` * Added more tests. * Improved parsing + tests * refactoring * Added more tests. * refactoring: moved parsing code under QueryParser * Use Tantivy syntax instead of ES * Added a sanity check test. * Simplified impl + tests * Added back tests in a more maintable way * nitz. * nitz * implemented very simple fast-path * improved a comment * implemented fast field support * Used `BoundsRange` * Improved fast field impl + tests * Simplified execution. * Fixed exports + nitz * Improved the tests to check to the expected result. * Improved test by checking the whole result JSON * Removed brittle perf checks. * Added efficiency verification tests. * Added one more efficiency check test. * Improved the efficiency tests. * Removed unnecessary parsing code + added direct Query obj * Fixed tests. * Improved tests * Fixed code structure * Fixed lint issues * nitz. * nitz * nitz. * nitz. * nitz. * Added an example * Fixed PR comments. * Applied PR comments + nitz * nitz. * Improved the code. * Fixed a perf issue. * Added batch processing. * Made the example more interesting * Fixed bucket count * Renamed Direct to CustomQuery * Fixed lint issues. * No need for scorer to be an `Option` * nitz * Used BitSet * Added an optimization for AllQuery * Fixed merge issues. * Fixed lint issues. * Added benchmark for FILTER * Removed the Option wrapper. * nitz. * Applied PR comments. * Fixed the AllQuery optimization * Applied PR comments. * feat: used `erased_serde` to allow filter query to be serialized * further improved a comment * Added back tests. * removed an unused method * removed an unused method * Added documentation * nitz. * Added query builder. * Fixed a comment. * Applied PR comments. * Fixed doctest issues. * Added ser/de * Removed bench in test * Fixed a lint issue.	2025-11-18 20:54:31 +01:00
Arthur	5277367cb0	remove duplicated call to `index_writer.commit()` in example (#2732 )	2025-11-12 14:52:44 +01:00
Paul Masurel	8b02bff9b8	Removing obsolete benchmark screenshot (#2730 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2025-11-05 09:55:13 +01:00
PSeitz	60225bdd45	cleanup (#2724 ) Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-10-23 10:23:34 +02:00
PSeitz	938bfec8b7	use FxHashMap for Aggregations Request (#2722 ) Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-10-21 15:59:18 +02:00
PSeitz	dabcaa5809	fix merge intermediate aggregation results (#2719 ) Previously the merging relied on the order of the results, which is invalid since https://github.com/quickwit-oss/tantivy/pull/2035. This bug is only hit in specific scenarios, when the aggregation collectors are built in a different order on different segments. Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-10-17 12:41:31 +02:00
PSeitz	d410a3b0c0	Add Filtering for Term Aggregations (#2717 ) * Add Filtering for Term Aggregations Closes #2702 * add AggregationsSegmentCtx memory consumption --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-10-15 17:39:53 +02:00
Remi	fc93391d0e	Minor clarifications on the AggregationsWithAccessor refacto (#2716 )	2025-10-14 19:59:33 +02:00
PSeitz	f8e79271ab	Replace AggregationsWithAccessor (#2715 ) * add nested histogram-termagg benchmark * Replace AggregationsWithAccessor with AggData With AggregationsWithAccessor pre-computation and caching was done on the collector level. If you have 10000 sub collectors (e.g. a term aggregation with sub aggregations) this is very inefficient. `AggData` instead moves the data from the collector to a node which reflects the cardinality of the request tree instead of the cardinality of the segment collector. It also moves the global struct shared with all aggregations in to aggregation specific structs. So each aggregation has its own space to store cached data and aggregation specific information. This also breaks up the dependency to the elastic search aggregation structure somewhat. Due to lifetime issues, we move the agg request specific object out of `AggData` during the collection and move it back at the end (for now). That's some unnecessary work, which costs CPU. This allows better caching and will also pave the way for another potential optimization, by separating the collector and its storage. Currently we allocate a new collector for each sub aggregation bucket (for nested aggregations), but ideally we would have just one collector instance. * renames * move request data to agg request files --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-10-14 09:22:11 +02:00
PSeitz	33835b6a01	Add DocSet::cost() (#2707 ) * query: add DocSet cost hint and use it for intersection ordering - Add DocSet::cost() - Use cost() instead of size_hint() to order scorers in intersect_scorers This isolates cost-related changes without the new seek APIs from PR #2538 * add comments --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-10-13 16:25:49 +02:00
PSeitz	270ca5123c	refactor postings (#2709 ) rename shallow_seek to seek_block remove full_block from public postings API This is as preparation to optionally handle Bitsets in the postings	2025-10-08 16:55:25 +02:00
Mustafa S. Moiz	714366d3b9	docs: correct grammar (#2704 ) Correct phrasing for a single line in the docs (`one documents` -> `a document`).	2025-10-08 16:47:09 +02:00
PSeitz-dd	40659d4d07	improve naming in buffered_union (#2705 )	2025-09-24 10:58:46 +02:00
PSeitz	e1e131a804	add and/or queries benchmark (#2701 )	2025-09-22 16:32:49 +02:00
PSeitz-dd	70da310b2d	perf: deduplicate queries (#2698 ) * deduplicate queries Deduplicate queries in the UserInputAst after parsing queries * add return type	2025-09-22 12:16:58 +02:00
PSeitz	85010b589a	clippy (#2700 ) * clippy * clippy * clippy * clippy + fmt --------- Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>	2025-09-19 18:04:25 +02:00
PSeitz-dd	2340dca628	fix compiler warnings (#2699 ) * fix compiler warnings * fix import	2025-09-19 15:55:04 +02:00

1 2 3 4 5 ...

3423 Commits