mirror of
https://github.com/lancedb/lancedb.git
synced 2026-01-04 19:02:58 +00:00
feat: support binary vector and IVF_FLAT in TypeScript (#2221)
resolve #2218 --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This commit is contained in:
@@ -126,6 +126,37 @@ the vectors.
|
||||
|
||||
***
|
||||
|
||||
### ivfFlat()
|
||||
|
||||
```ts
|
||||
static ivfFlat(options?): Index
|
||||
```
|
||||
|
||||
Create an IvfFlat index
|
||||
|
||||
This index groups vectors into partitions of similar vectors. Each partition keeps track of
|
||||
a centroid which is the average value of all vectors in the group.
|
||||
|
||||
During a query the centroids are compared with the query vector to find the closest
|
||||
partitions. The vectors in these partitions are then searched to find
|
||||
the closest vectors.
|
||||
|
||||
The partitioning process is called IVF and the `num_partitions` parameter controls how
|
||||
many groups to create.
|
||||
|
||||
Note that training an IVF FLAT index on a large dataset is a slow operation and
|
||||
currently is also a memory intensive operation.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **options?**: `Partial`<[`IvfFlatOptions`](../interfaces/IvfFlatOptions.md)>
|
||||
|
||||
#### Returns
|
||||
|
||||
[`Index`](Index.md)
|
||||
|
||||
***
|
||||
|
||||
### ivfPq()
|
||||
|
||||
```ts
|
||||
|
||||
19
docs/src/js/functions/packBits.md
Normal file
19
docs/src/js/functions/packBits.md
Normal file
@@ -0,0 +1,19 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / packBits
|
||||
|
||||
# Function: packBits()
|
||||
|
||||
```ts
|
||||
function packBits(data): number[]
|
||||
```
|
||||
|
||||
## Parameters
|
||||
|
||||
* **data**: `number`[]
|
||||
|
||||
## Returns
|
||||
|
||||
`number`[]
|
||||
@@ -39,6 +39,7 @@
|
||||
- [IndexConfig](interfaces/IndexConfig.md)
|
||||
- [IndexOptions](interfaces/IndexOptions.md)
|
||||
- [IndexStatistics](interfaces/IndexStatistics.md)
|
||||
- [IvfFlatOptions](interfaces/IvfFlatOptions.md)
|
||||
- [IvfPqOptions](interfaces/IvfPqOptions.md)
|
||||
- [OpenTableOptions](interfaces/OpenTableOptions.md)
|
||||
- [OptimizeOptions](interfaces/OptimizeOptions.md)
|
||||
@@ -66,3 +67,4 @@
|
||||
|
||||
- [connect](functions/connect.md)
|
||||
- [makeArrowTable](functions/makeArrowTable.md)
|
||||
- [packBits](functions/packBits.md)
|
||||
|
||||
112
docs/src/js/interfaces/IvfFlatOptions.md
Normal file
112
docs/src/js/interfaces/IvfFlatOptions.md
Normal file
@@ -0,0 +1,112 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / IvfFlatOptions
|
||||
|
||||
# Interface: IvfFlatOptions
|
||||
|
||||
Options to create an `IVF_FLAT` index
|
||||
|
||||
## Properties
|
||||
|
||||
### distanceType?
|
||||
|
||||
```ts
|
||||
optional distanceType: "l2" | "cosine" | "dot" | "hamming";
|
||||
```
|
||||
|
||||
Distance type to use to build the index.
|
||||
|
||||
Default value is "l2".
|
||||
|
||||
This is used when training the index to calculate the IVF partitions
|
||||
(vectors are grouped in partitions with similar vectors according to this
|
||||
distance type).
|
||||
|
||||
The distance type used to train an index MUST match the distance type used
|
||||
to search the index. Failure to do so will yield inaccurate results.
|
||||
|
||||
The following distance types are available:
|
||||
|
||||
"l2" - Euclidean distance. This is a very common distance metric that
|
||||
accounts for both magnitude and direction when determining the distance
|
||||
between vectors. l2 distance has a range of [0, ∞).
|
||||
|
||||
"cosine" - Cosine distance. Cosine distance is a distance metric
|
||||
calculated from the cosine similarity between two vectors. Cosine
|
||||
similarity is a measure of similarity between two non-zero vectors of an
|
||||
inner product space. It is defined to equal the cosine of the angle
|
||||
between them. Unlike l2, the cosine distance is not affected by the
|
||||
magnitude of the vectors. Cosine distance has a range of [0, 2].
|
||||
|
||||
Note: the cosine distance is undefined when one (or both) of the vectors
|
||||
are all zeros (there is no direction). These vectors are invalid and may
|
||||
never be returned from a vector search.
|
||||
|
||||
"dot" - Dot product. Dot distance is the dot product of two vectors. Dot
|
||||
distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
|
||||
l2 norm is 1), then dot distance is equivalent to the cosine distance.
|
||||
|
||||
"hamming" - Hamming distance. Hamming distance is a distance metric
|
||||
calculated from the number of bits that are different between two vectors.
|
||||
Hamming distance has a range of [0, dimension]. Note that the hamming distance
|
||||
is only valid for binary vectors.
|
||||
|
||||
***
|
||||
|
||||
### maxIterations?
|
||||
|
||||
```ts
|
||||
optional maxIterations: number;
|
||||
```
|
||||
|
||||
Max iteration to train IVF kmeans.
|
||||
|
||||
When training an IVF FLAT index we use kmeans to calculate the partitions. This parameter
|
||||
controls how many iterations of kmeans to run.
|
||||
|
||||
Increasing this might improve the quality of the index but in most cases these extra
|
||||
iterations have diminishing returns.
|
||||
|
||||
The default value is 50.
|
||||
|
||||
***
|
||||
|
||||
### numPartitions?
|
||||
|
||||
```ts
|
||||
optional numPartitions: number;
|
||||
```
|
||||
|
||||
The number of IVF partitions to create.
|
||||
|
||||
This value should generally scale with the number of rows in the dataset.
|
||||
By default the number of partitions is the square root of the number of
|
||||
rows.
|
||||
|
||||
If this value is too large then the first part of the search (picking the
|
||||
right partition) will be slow. If this value is too small then the second
|
||||
part of the search (searching within a partition) will be slow.
|
||||
|
||||
***
|
||||
|
||||
### sampleRate?
|
||||
|
||||
```ts
|
||||
optional sampleRate: number;
|
||||
```
|
||||
|
||||
The number of vectors, per partition, to sample when training IVF kmeans.
|
||||
|
||||
When an IVF FLAT index is trained, we need to calculate partitions. These are groups
|
||||
of vectors that are similar to each other. To do this we use an algorithm called kmeans.
|
||||
|
||||
Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a
|
||||
random sample of the data. This parameter controls the size of the sample. The total
|
||||
number of vectors used to train the index is `sample_rate * num_partitions`.
|
||||
|
||||
Increasing this value might improve the quality of the index but in most cases the
|
||||
default should be sufficient.
|
||||
|
||||
The default value is 256.
|
||||
@@ -138,6 +138,19 @@ LanceDB supports binary vectors as a data type, and has the ability to search bi
|
||||
--8<-- "python/python/tests/docs/test_binary_vector.py:async_binary_vector"
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```ts
|
||||
--8<-- "nodejs/examples/search.test.ts:import"
|
||||
|
||||
--8<-- "nodejs/examples/search.test.ts:import_bin_util"
|
||||
|
||||
--8<-- "nodejs/examples/search.test.ts:ingest_binary_data"
|
||||
|
||||
--8<-- "nodejs/examples/search.test.ts:search_binary_data"
|
||||
```
|
||||
|
||||
|
||||
## Multivector type
|
||||
|
||||
LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali).
|
||||
|
||||
Reference in New Issue
Block a user