feat: support binary vector and IVF_FLAT in TypeScript (#2221)

resolve #2218 --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-12-23 05:19:58 +00:00 · 2025-03-22 01:57:08 +08:00
parent 2bfdef2624
commit bdb6c09c3b
11 changed files with 406 additions and 5 deletions
--- a/docs/src/js/classes/Index.md
+++ b/docs/src/js/classes/Index.md
@@ -126,6 +126,37 @@ the vectors.
 ***
 ### ivfFlat()
 ```ts
 static ivfFlat(options?): Index
 ```
 Create an IvfFlat index
 This index groups vectors into partitions of similar vectors.  Each partition keeps track of
 a centroid which is the average value of all vectors in the group.
 During a query the centroids are compared with the query vector to find the closest
 partitions.  The vectors in these partitions are then searched to find
 the closest vectors.
 The partitioning process is called IVF and the `num_partitions` parameter controls how
 many groups to create.
 Note that training an IVF FLAT index on a large dataset is a slow operation and
 currently is also a memory intensive operation.
 #### Parameters
 * **options?**: `Partial`&lt;[`IvfFlatOptions`](../interfaces/IvfFlatOptions.md)&gt;
 #### Returns
 [`Index`](Index.md)
 ***
 ### ivfPq()
 ```ts
--- a/docs/src/js/functions/packBits.md
+++ b/docs/src/js/functions/packBits.md
@@ -0,0 +1,19 @@
 [**@lancedb/lancedb**](../README.md) • **Docs**
 ***
 [@lancedb/lancedb](../globals.md) / packBits
 # Function: packBits()
 ```ts
 function packBits(data): number[]
 ```
 ## Parameters
 * **data**: `number`[]
 ## Returns
 `number`[]
--- a/docs/src/js/globals.md
+++ b/docs/src/js/globals.md
@@ -39,6 +39,7 @@
 - [IndexConfig](interfaces/IndexConfig.md)
 - [IndexOptions](interfaces/IndexOptions.md)
 - [IndexStatistics](interfaces/IndexStatistics.md)
 - [IvfFlatOptions](interfaces/IvfFlatOptions.md)
 - [IvfPqOptions](interfaces/IvfPqOptions.md)
 - [OpenTableOptions](interfaces/OpenTableOptions.md)
 - [OptimizeOptions](interfaces/OptimizeOptions.md)
@@ -66,3 +67,4 @@
 - [connect](functions/connect.md)
 - [makeArrowTable](functions/makeArrowTable.md)
 - [packBits](functions/packBits.md)
--- a/docs/src/js/interfaces/IvfFlatOptions.md
+++ b/docs/src/js/interfaces/IvfFlatOptions.md
@@ -0,0 +1,112 @@
 [**@lancedb/lancedb**](../README.md) • **Docs**
 ***
 [@lancedb/lancedb](../globals.md) / IvfFlatOptions
 # Interface: IvfFlatOptions
 Options to create an `IVF_FLAT` index
 ## Properties
 ### distanceType?
 ```ts
 optional distanceType: "l2" | "cosine" | "dot" | "hamming";
 ```
 Distance type to use to build the index.
 Default value is "l2".
 This is used when training the index to calculate the IVF partitions
 (vectors are grouped in partitions with similar vectors according to this
 distance type).
 The distance type used to train an index MUST match the distance type used
 to search the index.  Failure to do so will yield inaccurate results.
 The following distance types are available:
 "l2" - Euclidean distance. This is a very common distance metric that
 accounts for both magnitude and direction when determining the distance
 between vectors. l2 distance has a range of [0, ∞).
 "cosine" - Cosine distance.  Cosine distance is a distance metric
 calculated from the cosine similarity between two vectors. Cosine
 similarity is a measure of similarity between two non-zero vectors of an
 inner product space. It is defined to equal the cosine of the angle
 between them.  Unlike l2, the cosine distance is not affected by the
 magnitude of the vectors.  Cosine distance has a range of [0, 2].
 Note: the cosine distance is undefined when one (or both) of the vectors
 are all zeros (there is no direction).  These vectors are invalid and may
 never be returned from a vector search.
 "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
 distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
 l2 norm is 1), then dot distance is equivalent to the cosine distance.
 "hamming" - Hamming distance. Hamming distance is a distance metric
 calculated from the number of bits that are different between two vectors.
 Hamming distance has a range of [0, dimension]. Note that the hamming distance
 is only valid for binary vectors.
 ***
 ### maxIterations?
 ```ts
 optional maxIterations: number;
 ```
 Max iteration to train IVF kmeans.
 When training an IVF FLAT index we use kmeans to calculate the partitions.  This parameter
 controls how many iterations of kmeans to run.
 Increasing this might improve the quality of the index but in most cases these extra
 iterations have diminishing returns.
 The default value is 50.
 ***
 ### numPartitions?
 ```ts
 optional numPartitions: number;
 ```
 The number of IVF partitions to create.
 This value should generally scale with the number of rows in the dataset.
 By default the number of partitions is the square root of the number of
 rows.
 If this value is too large then the first part of the search (picking the
 right partition) will be slow.  If this value is too small then the second
 part of the search (searching within a partition) will be slow.
 ***
 ### sampleRate?
 ```ts
 optional sampleRate: number;
 ```
 The number of vectors, per partition, to sample when training IVF kmeans.
 When an IVF FLAT index is trained, we need to calculate partitions.  These are groups
 of vectors that are similar to each other.  To do this we use an algorithm called kmeans.
 Running kmeans on a large dataset can be slow.  To speed this up we run kmeans on a
 random sample of the data.  This parameter controls the size of the sample.  The total
 number of vectors used to train the index is `sample_rate * num_partitions`.
 Increasing this value might improve the quality of the index but in most cases the
 default should be sufficient.
 The default value is 256.
--- a/docs/src/search.md
+++ b/docs/src/search.md
@@ -138,6 +138,19 @@ LanceDB supports binary vectors as a data type, and has the ability to search bi
        --8<-- "python/python/tests/docs/test_binary_vector.py:async_binary_vector"
        ```
    === "TypeScript"
        ```ts
        --8<-- "nodejs/examples/search.test.ts:import"
        --8<-- "nodejs/examples/search.test.ts:import_bin_util"
        --8<-- "nodejs/examples/search.test.ts:ingest_binary_data"
        --8<-- "nodejs/examples/search.test.ts:search_binary_data"
        ```
 ## Multivector type
 LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali).
--- a/nodejs/test/table.test.ts
+++ b/nodejs/test/table.test.ts
@@ -21,6 +21,7 @@ import {
  Int64,
  List,
  Schema,
  Uint8,
  Utf8,
  makeArrowTable,
 } from "../lancedb/arrow";
@@ -740,6 +741,38 @@ describe("When creating an index", () => {
    expect(stats).toBeUndefined();
  });
  test("create ivf_flat with binary vectors", async () => {
    const db = await connect(tmpDir.name);
    const binarySchema = new Schema([
      new Field("id", new Int32(), true),
      new Field("vec", new FixedSizeList(32, new Field("item", new Uint8()))),
    ]);
    const tbl = await db.createTable(
      "binary",
      makeArrowTable(
        Array(300)
          .fill(1)
          .map((_, i) => ({
            id: i,
            vec: Array(32)
              .fill(1)
              .map(() => Math.floor(Math.random() * 255)),
          })),
        { schema: binarySchema },
      ),
    );
    await tbl.createIndex("vec", {
      config: Index.ivfFlat({ numPartitions: 10, distanceType: "hamming" }),
    });
    // query with binary vectors
    const queryVec = Array(32)
      .fill(1)
      .map(() => Math.floor(Math.random() * 255));
    const rst = await tbl.query().limit(5).nearestTo(queryVec).toArrow();
    expect(rst.numRows).toBe(5);
  });
  // TODO: Move this test to the query API test (making sure we can reject queries
  // when the dimension is incorrect)
  test("two columns with different dimensions", async () => {
--- a/nodejs/examples/search.test.ts
+++ b/nodejs/examples/search.test.ts
@@ -4,9 +4,12 @@ import { expect, test } from "@jest/globals";
 // --8<-- [start:import]
 import * as lancedb from "@lancedb/lancedb";
 // --8<-- [end:import]
 // --8<-- [start:import_bin_util]
 import { Field, FixedSizeList, Int32, Schema, Uint8 } from "apache-arrow";
 // --8<-- [end:import_bin_util]
 import { withTempDirectory } from "./util.ts";
-test("full text search", async () => {
+test("vector search", async () => {
  await withTempDirectory(async (databaseDir) => {
    {
      const db = await lancedb.connect(databaseDir);
@@ -14,8 +17,6 @@ test("full text search", async () => {
      const data = Array.from({ length: 10_000 }, (_, i) => ({
        vector: Array(128).fill(i),
        id: `${i}`,
        content: "",
        longId: `${i}`,
      }));
      await db.createTable("my_vectors", data);
@@ -52,5 +53,41 @@ test("full text search", async () => {
      expect(r.distance).toBeGreaterThanOrEqual(0.1);
      expect(r.distance).toBeLessThan(0.2);
    }
    {
      // --8<-- [start:ingest_binary_data]
      const schema = new Schema([
        new Field("id", new Int32(), true),
        new Field("vec", new FixedSizeList(32, new Field("item", new Uint8()))),
      ]);
      const data = lancedb.makeArrowTable(
        Array(1_000)
          .fill(0)
          .map((_, i) => ({
            // the 256 bits would be store in 32 bytes,
            // if your data is already in this format, you can skip the packBits step
            id: i,
            vec: lancedb.packBits(Array(256).fill(i % 2)),
          })),
        { schema: schema },
      );
      const tbl = await db.createTable("binary_table", data);
      await tbl.createIndex("vec", {
        config: lancedb.Index.ivfFlat({
          numPartitions: 10,
          distanceType: "hamming",
        }),
      });
      // --8<-- [end:ingest_binary_data]
      // --8<-- [start:search_binary_data]
      const query = Array(32)
        .fill(1)
        .map(() => Math.floor(Math.random() * 255));
      const results = await tbl.query().nearestTo(query).limit(10).toArrow();
      // --8<-- [end:search_binary_data
      expect(results.numRows).toBe(10);
    }
  });
 });
--- a/nodejs/lancedb/index.ts
+++ b/nodejs/lancedb/index.ts
@@ -53,6 +53,7 @@ export {
  Index,
  IndexOptions,
  IvfPqOptions,
  IvfFlatOptions,
  HnswPqOptions,
  HnswSqOptions,
  FtsOptions,
@@ -79,7 +80,7 @@ export {
  DataLike,
  IntoVector,
 } from "./arrow";
-export { IntoSql } from "./util";
+export { IntoSql, packBits } from "./util";
 /**
 * Connect to a LanceDB instance at the given URI.
--- a/nodejs/lancedb/indices.ts
+++ b/nodejs/lancedb/indices.ts
@@ -327,6 +327,94 @@ export interface HnswSqOptions {
  efConstruction?: number;
 }
 /**
 * Options to create an `IVF_FLAT` index
 */
 export interface IvfFlatOptions {
  /**
   * The number of IVF partitions to create.
   *
   * This value should generally scale with the number of rows in the dataset.
   * By default the number of partitions is the square root of the number of
   * rows.
   *
   * If this value is too large then the first part of the search (picking the
   * right partition) will be slow.  If this value is too small then the second
   * part of the search (searching within a partition) will be slow.
   */
  numPartitions?: number;
  /**
   * Distance type to use to build the index.
   *
   * Default value is "l2".
   *
   * This is used when training the index to calculate the IVF partitions
   * (vectors are grouped in partitions with similar vectors according to this
   * distance type).
   *
   * The distance type used to train an index MUST match the distance type used
   * to search the index.  Failure to do so will yield inaccurate results.
   *
   * The following distance types are available:
   *
   * "l2" - Euclidean distance. This is a very common distance metric that
   * accounts for both magnitude and direction when determining the distance
   * between vectors. l2 distance has a range of [0, ∞).
   *
   * "cosine" - Cosine distance.  Cosine distance is a distance metric
   * calculated from the cosine similarity between two vectors. Cosine
   * similarity is a measure of similarity between two non-zero vectors of an
   * inner product space. It is defined to equal the cosine of the angle
   * between them.  Unlike l2, the cosine distance is not affected by the
   * magnitude of the vectors.  Cosine distance has a range of [0, 2].
   *
   * Note: the cosine distance is undefined when one (or both) of the vectors
   * are all zeros (there is no direction).  These vectors are invalid and may
   * never be returned from a vector search.
   *
   * "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
   * distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
   * l2 norm is 1), then dot distance is equivalent to the cosine distance.
   *
   * "hamming" - Hamming distance. Hamming distance is a distance metric
   * calculated from the number of bits that are different between two vectors.
   * Hamming distance has a range of [0, dimension]. Note that the hamming distance
   * is only valid for binary vectors.
   */
  distanceType?: "l2" | "cosine" | "dot" | "hamming";
  /**
   * Max iteration to train IVF kmeans.
   *
   * When training an IVF FLAT index we use kmeans to calculate the partitions.  This parameter
   * controls how many iterations of kmeans to run.
   *
   * Increasing this might improve the quality of the index but in most cases these extra
   * iterations have diminishing returns.
   *
   * The default value is 50.
   */
  maxIterations?: number;
  /**
   * The number of vectors, per partition, to sample when training IVF kmeans.
   *
   * When an IVF FLAT index is trained, we need to calculate partitions.  These are groups
   * of vectors that are similar to each other.  To do this we use an algorithm called kmeans.
   *
   * Running kmeans on a large dataset can be slow.  To speed this up we run kmeans on a
   * random sample of the data.  This parameter controls the size of the sample.  The total
   * number of vectors used to train the index is `sample_rate * num_partitions`.
   *
   * Increasing this value might improve the quality of the index but in most cases the
   * default should be sufficient.
   *
   * The default value is 256.
   */
  sampleRate?: number;
 }
 /**
 * Options to create a full text search index
 */
@@ -426,6 +514,33 @@ export class Index {
    );
  }
  /**
   * Create an IvfFlat index
   *
   * This index groups vectors into partitions of similar vectors.  Each partition keeps track of
   * a centroid which is the average value of all vectors in the group.
   *
   * During a query the centroids are compared with the query vector to find the closest
   * partitions.  The vectors in these partitions are then searched to find
   * the closest vectors.
   *
   * The partitioning process is called IVF and the `num_partitions` parameter controls how
   * many groups to create.
   *
   * Note that training an IVF FLAT index on a large dataset is a slow operation and
   * currently is also a memory intensive operation.
   */
  static ivfFlat(options?: Partial<IvfFlatOptions>) {
    return new Index(
      LanceDbIndex.ivfFlat(
        options?.distanceType,
        options?.numPartitions,
        options?.maxIterations,
        options?.sampleRate,
      ),
    );
  }
  /**
   * Create a btree index
   *
--- a/nodejs/lancedb/util.ts
+++ b/nodejs/lancedb/util.ts
@@ -35,6 +35,16 @@ export function toSQL(value: IntoSql): string {
  }
 }
 export function packBits(data: Array<number>): Array<number> {
  const packed = Array(data.length >> 3).fill(0);
  for (let i = 0; i < data.length; i++) {
    const byte = i >> 3;
    const bit = i & 7;
    packed[byte] |= data[i] << bit;
  }
  return packed;
 }
 export class TTLCache {
  // biome-ignore lint/suspicious/noExplicitAny: <explanation>
  private readonly cache: Map<string, { value: any; expires: number }>;
--- a/nodejs/src/index.rs
+++ b/nodejs/src/index.rs
@@ -4,7 +4,9 @@
 use std::sync::Mutex;
 use lancedb::index::scalar::{BTreeIndexBuilder, FtsIndexBuilder};
-use lancedb::index::vector::{IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder};
+use lancedb::index::vector::{
    IvfFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder,
 };
 use lancedb::index::Index as LanceDbIndex;
 use napi_derive::napi;
@@ -63,6 +65,32 @@ impl Index {
        })
    }
    #[napi(factory)]
    pub fn ivf_flat(
        distance_type: Option<String>,
        num_partitions: Option<u32>,
        max_iterations: Option<u32>,
        sample_rate: Option<u32>,
    ) -> napi::Result<Self> {
        let mut ivf_flat_builder = IvfFlatIndexBuilder::default();
        if let Some(distance_type) = distance_type {
            let distance_type = parse_distance_type(distance_type)?;
            ivf_flat_builder = ivf_flat_builder.distance_type(distance_type);
        }
        if let Some(num_partitions) = num_partitions {
            ivf_flat_builder = ivf_flat_builder.num_partitions(num_partitions);
        }
        if let Some(max_iterations) = max_iterations {
            ivf_flat_builder = ivf_flat_builder.max_iterations(max_iterations);
        }
        if let Some(sample_rate) = sample_rate {
            ivf_flat_builder = ivf_flat_builder.sample_rate(sample_rate);
        }
        Ok(Self {
            inner: Mutex::new(Some(LanceDbIndex::IvfFlat(ivf_flat_builder))),
        })
    }
    #[napi(factory)]
    pub fn btree() -> Self {
        Self {