9.7 KiB
Feature Name, Tracking Issue, Date, Author
| Feature Name | Tracking Issue | Date | Author |
|---|---|---|---|
| Vector Index | TBD | 2025-12-04 | TBD |
Summary
Introduce a per-SST approximate nearest neighbor (ANN) index for VECTOR(dim) columns with a pluggable engine. USearch HNSW is the initial engine, while the design keeps VSAG (default when linked) and future engines selectable at DDL or alter time and encoded in the index metadata. The index is built alongside SST creation and accelerates ORDER BY vec_*_distance(column, <literal vector>) LIMIT k queries, falling back to the existing brute-force path when an index is unavailable or ineligible.
Motivation
Vector distances are currently computed with nalgebra across all rows (O(N)) before sorting, which does not scale to millions of vectors. An on-disk ANN index with sub-linear search reduces latency and compute cost for common RAG, semantic search, and recommendation workloads without changing SQL.
Details
Current Behavior
VECTOR(dim) values are stored as binary blobs. Queries call vec_cos_distance/vec_l2sq_distance/vec_dot_product via nalgebra for every row and then sort; there is no indexing or caching.
Index Eligibility and Configuration
Only VECTOR(dim) columns can be indexed. A column metadata flag follows the existing column-option pattern with an intentionally small surface area:
engine:vsag(default when the binding is built) orusearch. If a configured engine is unavailable at runtime, the builder logs and falls back tousearchwhile leaving the option intact for future rebuilds.metric:cosine(default),l2sq, ordot; mismatches with query functions force brute-force execution.m: HNSW graph connectivity (higher = denser graph, more memory, better recall), default16.ef_construct: build-time expansion, default128.ef_search: query-time expansion, default64; engines may clamp values.
Option semantics mirror HNSW defaults so both USearch and VSAG can honor them; engine-specific tunables stay in reserved key-value pairs inside the blob header for forward compatibility.
DDL reuses column extensions similar to inverted/fulltext indexes:
CREATE TABLE embeddings (
ts TIMESTAMP TIME INDEX,
id STRING PRIMARY KEY,
vec VECTOR(384) VECTOR INDEX WITH (engine = 'vsag', metric = 'cosine', ef_search = 64)
);
Altering column options toggles the flag, can switch engines (for example usearch -> vsag), and triggers rebuilds through the existing alter/compaction flow. Engine choice stays in table metadata and each blob header; new SSTs use the configured engine while older SSTs remain readable under their recorded engine until compaction or a manual rebuild rewrites them.
Storage and Format
- One vector index per indexed column per SST, stored as a Puffin blob with type
greptime-vector-index-v1. - Each blob records the engine (
usearch,vsag, future values) and engine parameters in the header so readers can select the matching decoder. Mixed-engine SSTs remain readable because the engine id travels with the blob. - USearch uses
f32vectors and SST row offsets (u64) as keys; nulls andOpType::Deleterows are skipped. Row ids are the absolute SST ordinal so readers can deriveRowSelectiondirectly from parquet row group lengths without extra side tables. - Blob layout:
- Header: version, column id, dimension, engine id, metric,
m,ef_construct,ef_search, and reserved engine-specific key-value pairs. - Counts: total rows written and indexed rows.
- Payload: USearch binary produced by
save_to_buffer.
- Header: version, column id, dimension, engine id, metric,
- An empty index (no eligible vectors) results in no available index entry for that column.
puffin_managerregisters the blob type so caches and readers discover it alongside inverted/fulltext/bloom blobs in the same index file.
Row Visibility and Duplicates
- The indexer increments
row_offsetfor every incoming row (including skipped/null/delete rows) so offsets stay aligned with parquet ordering across row groups. - Only
OpType::Putrows with the expected dimension are inserted;OpType::Deleteand malformed rows are skipped but still advancerow_offset, matching the data plane’s visibility rules. - Multiple versions of the same primary key remain in the graph; the read path intersects search hits with the standard mito2 deduplication/visibility pipeline (sequence-aware dedup, delete filtering, projection) before returning results.
- Searches overfetch beyond
kto compensate for rows discarded by visibility checks and to avoid reissuing index reads.
Build Path (mito2 write)
Extend sst::index::Indexer to optionally create a VectorIndexer when region metadata marks a column as vector-indexed, mirroring how inverted/fulltext/bloom filters attach to IndexerBuilderImpl in mito2.
The indexer consumes Batch/RecordBatch data and shares memory tracking and abort semantics with existing indexers:
- Maintain a running
row_offsetthat follows SST write order and spans row groups so the search result can be turned intoRowSelection. - For each
OpType::Put, if the vector is non-null and matches the declared dimension, insert into USearch withrow_offsetas the key; otherwise skip. - Track memory with existing index build metrics; on failure, abort only the vector index while keeping SST writing unaffected.
Engine selection is table-driven: the builder picks the configured engine (default vsag, fallback usearch if vsag is not compiled in) and dispatches to the matching implementation. Unknown engines skip index build with a warning.
On finish, serialize the engine-tagged index into the Puffin writer and record IndexType::Vector metadata for the column. IndexOutput and FileMeta::indexes/available_indexes gain a vector entry so manifest updates and RegionVersion surface per-column presence, following patterns used by inverted/fulltext/bloom indexes. Planner/metadata validation ensures that mismatched dimensions only reduce the indexed-row count and do not break reads.
Read Path (mito2 query)
A planner rule in query identifies eligible plans on mito2 tables: a single ORDER BY vec_cos_distance|vec_l2sq_distance|vec_dot_product(<vector column>, <literal vector>) in ascending order plus a LIMIT/TopK. The rule rejects plans with multiple sort keys, non-literal query vectors, or additional projections that would change the distance expression and falls back to brute-force in those cases.
For eligible scans, build a VectorIndexScan execution node that:
- Consults SST metadata for
IndexType::Vector, loads the index via Puffin using the existingmito2::cache::indexinfrastructure, and dispatches to the engine declared in the blob header (USearch/VSAG/etc.). - Runs the engine’s
searchwith an overfetch (for example 2×k) to tolerate rows filtered by deletes, dimension mismatches, or late-stage dedup; keys already match SST row offsets produced by the writer. - Converts hits to
RowSelectionusing parquet row group lengths and reuses the parquet reader so visibility, projection, and deduplication logic stay unchanged; distances are recomputed withvec_*_distancebefore the final trim to k to guarantee ordering and to merge distributed partial results deterministically.
Any unsupported shape, load error, or cache miss falls back to the current brute-force execution path.
Lifecycle and Maintenance
Lifecycle piggybacks on the existing SST/index flow: rebuilds run where other secondary indexes do, graphs are always rebuilt from source rows (no HNSW merge), and cleanup/versioning/caching reuse the existing Puffin and index cache paths.
Implementation Plan
- Add the
usearchdependency (wrapper module inindexormito2) and map minimal HNSW options; keep an engine trait that allows plugging VSAG without changing the rest of the pipeline. - Introduce
IndexType::Vectorand a column metadata key for vector index options (includingengine); add SQL parser andSHOW CREATE TABLEsupport forVECTOR INDEX WITH (...). - Implement
vector_indexbuild/read modules undermito2(andindexif shared), including Puffin serialization that records engine id, blob-type registration withpuffin_manager, and integration with theIndexerbuilder,IndexOutput, manifest updates, and compaction rebuild. - Extend the query planner/execution to detect eligible plans and drive a
RowSelection-based ANN scan with a fallback path, dispatching by engine at read time and using existing Puffin and index caches. - Add unit tests for serialization/search correctness and an end-to-end test covering plan rewrite, cache usage, engine selection, and fallback; add a mixed-engine test to confirm old USearch blobs still serve after a VSAG switch.
- Follow up with an optional VSAG engine binding (feature flag), validate parity with USearch on dense vectors, exercise alternative algorithms (for example PQ), and flip the default
enginetovsagwhen the binding is present.
Alternatives
- VSAG (follow-up engine): C++ library with HNSW and additional algorithms (for example SINDI for sparse vectors and PQ) targeting in-memory and disk-friendly search. Provides parameter generators and a roadmap for GPU-assisted build and graph compression. Compared to FAISS it is newer with fewer integrations but bundles sparse/dense coverage and out-of-core focus in one engine. Fits the pluggable-engine design and would become the default
engine = 'vsag'when linked; USearch remains available for lighter dependencies. - FAISS: Broad feature set (IVF/IVFPQ/PQ/HNSW, GPU acceleration, scalar filtering, pre/post filters) and battle-tested performance across datasets, but it requires a heavier C++/GPU toolchain, has no official Rust binding, and is less disk-centric than VSAG; integrating it would add more build/distribution burden than USearch/VSAG.
- Do nothing: Keep brute-force evaluation, which remains O(N) and unacceptable at scale.