lancedb/index.md at 22bd8329f3e3591f08afb45f09e23f2e998e8271

mirror of https://github.com/lancedb/lancedb.git synced 2026-01-03 18:32:55 +00:00

Files

Will Jones 0fd8a50bd7 ci(node): run examples in CI (#1796 )

This is done as setup for a PR that will fix the OpenAI dependency
issue.

 * [x] FTS examples
 * [x] Setup mock openai
 * [x] Ran `npm audit fix`
 * [x] sentences embeddings test
 * [x] Double check formatting of docs examples

2024-11-13 11:10:56 -08:00

4.2 KiB

Raw Blame History

Due to the nature of vector embeddings, they can be used to represent any kind of data, from text to images to audio. This makes them a very powerful tool for machine learning practitioners. However, there's no one-size-fits-all solution for generating embeddings - there are many different libraries and APIs (both commercial and open source) that can be used to generate embeddings from structured/unstructured data.

LanceDB supports 3 methods of working with embeddings.

You can manually generate embeddings for the data and queries. This is done outside of LanceDB.
You can use the built-in embedding functions to embed the data and queries in the background.
You can define your own custom embedding function that extends the default embedding functions.

For python users, there is also a legacy with_embeddings API. It is retained for compatibility and will be removed in a future version.

Quickstart

To get started with embeddings, you can use the built-in embedding functions.

OpenAI Embedding function

LanceDB registers the OpenAI embeddings function in the registry as openai. You can pass any supported model name to the create. By default it uses "text-embedding-ada-002".

=== "Python"

```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

db = lancedb.connect("/tmp/db")
func = get_registry().get("openai").create(name="text-embedding-ada-002")

class Words(LanceModel):
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()

table = db.create_table("words", schema=Words, mode="overwrite")
table.add(
    [
        {"text": "hello world"},
        {"text": "goodbye world"}
    ]
    )

query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```

=== "TypeScript"

```typescript
--8<--- "nodejs/examples/embedding.test.ts:imports"
--8<--- "nodejs/examples/embedding.test.ts:openai_embeddings"
```

=== "Rust"

```rust
--8<--- "rust/lancedb/examples/openai.rs:imports"
--8<--- "rust/lancedb/examples/openai.rs:openai_embeddings"
```

Sentence Transformers Embedding function

LanceDB registers the Sentence Transformers embeddings function in the registry as sentence-transformers. You can pass any supported model name to the create. By default it uses "sentence-transformers/paraphrase-MiniLM-L6-v2".

=== "Python" ```python import lancedb from lancedb.pydantic import LanceModel, Vector from lancedb.embeddings import get_registry

db = lancedb.connect("/tmp/db")
model = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")

class Words(LanceModel):
    text: str = model.SourceField()
    vector: Vector(model.ndims()) = model.VectorField()

table = db.create_table("words", schema=Words)
table.add(
    [
        {"text": "hello world"},
        {"text": "goodbye world"}
    ]
)

query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```