Lossless compression for floating-point lat/lon coordinates using XOR
delta encoding on IEEE 754 bit patterns with variable-length integer
encoding. Designed for per-polygon random access in the document store,
where each polygon compresses independently without requiring sequential
decompression.
Implement an immutable bulk-loaded spatial index using recursive median
partitioning on bounding box dimensions. Each leaf stores up to 512
triangles with delta-compressed coordinates and doc IDs. The tree
provides three query types (intersects, within, contains) that use exact
integer arithmetic for geometric predicates and accumulate results in
bit sets for efficient deduplication across leaves.
The serialized format stores compressed leaf pages followed by the tree
structure (leaf and branch nodes), enabling zero-copy access through
memory-mapped segments without upfront decompression.
Implements dimension-major bit-packing with zigzag encoding for signed i32
deltas, enabling compression of spatially-clustered triangles from 32-bit
coordinates down to 4-19 bits per delta depending on spatial extent.
Implemented a `Surveyor` that will evaluate the bounding boxes of a set
of triangles and determine the dimension with the maximum spread and the
shared prefix for the values of dimension with the maximum spread.
Implemented byte-wise histogram selection to find median values without
comparisons, enabling efficient partitioning of spatial data during
block kd-tree construction. Processes values through multiple passes,
building histograms for each byte position after a common prefix,
avoiding the need to sort or compare elements directly.
The `triangulate` function takes a polygon with floating-point lat/lon
coordinates, converts to integer coordinates with millimeter precision
(using 2^32 scaling), performs constrained Delaunay triangulation, and
encodes the resulting triangles with boundary edge information for block
kd-tree spatial indexing.
It handles polygons with holes correctly, preserving which triangle
edges lie on the original polygon boundaries versus internal
tessellation edges.
Encodes triangles with the bounding box in the first four words,
enabling efficient spatial pruning during tree traversal without
reconstructing the full triangle. The remaining words contain an
additional vertex and packed reconstruction metadata, allowing exact
triangle recovery when needed.
Previously the merging relied on the order of the results, which is invalid since https://github.com/quickwit-oss/tantivy/pull/2035.
This bug is only hit in specific scenarios, when the aggregation collectors are built in a different order on different segments.
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
* add nested histogram-termagg benchmark
* Replace AggregationsWithAccessor with AggData
With AggregationsWithAccessor pre-computation and caching was done on the collector level.
If you have 10000 sub collectors (e.g. a term aggregation with sub aggregations) this is very inefficient.
`AggData` instead moves the data from the collector to a node which reflects the cardinality of the request tree instead of the cardinality of the segment collector.
It also moves the global struct shared with all aggregations in to aggregation specific structs. So each aggregation has its own space to store cached data and aggregation specific information.
This also breaks up the dependency to the elastic search aggregation structure somewhat.
Due to lifetime issues, we move the agg request specific object out of `AggData` during the collection and move it back at the end (for now). That's some unnecessary work, which costs CPU.
This allows better caching and will also pave the way for another potential optimization, by separating the collector and its storage. Currently we allocate a new collector for each sub aggregation bucket (for nested aggregations), but ideally we would have just one collector instance.
* renames
* move request data to agg request files
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
* query: add DocSet cost hint and use it for intersection ordering
- Add DocSet::cost()
- Use cost() instead of size_hint() to order scorers in intersect_scorers
This isolates cost-related changes without the new seek APIs from
PR #2538
* add comments
---------
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
It applies the same logic on floats as for u64 or i64.
In all case, the idea is (for the inverted index) to coerce number
to their canonical representation, before indexing and before searching.
That way a document with the float 1.0 will be searchable when the user
searches for 1.
Note that contrary to the columnar, we do not attempt to coerce all of the
terms associated to a given json path to a single numerical type.
We simply rely on this "point-wise" canonicalization.
* Added per-field size details.
This also does a bunch of refactoring.
merging field metadata does not silently asserts that arguments should be sorted.
merging does not set `stored`.
We do not rely on a hashmap to group fields, but instead rely on the fact that
the term dictionary is sorted.
The inverted level method that exposes field metadata is not exposed
as public anymore.
* CR comment
---------
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>