mirror of
https://github.com/lancedb/lancedb.git
synced 2026-01-14 15:52:57 +00:00
143 lines
4.8 KiB
Markdown
143 lines
4.8 KiB
Markdown
# Embedding Functions
|
|
|
|
Embeddings are high dimensional floating-point vector representations of your data or query.
|
|
Anything can be embedded using some embedding model or function.
|
|
For a given embedding function, the output will always have the same number of dimensions.
|
|
|
|
## Creating an embedding function
|
|
|
|
Any function that takes as input a batch (list) of data and outputs a batch (list) of embeddings
|
|
can be used by LanceDB as an embedding function. The input and output batch sizes should be the same.
|
|
|
|
### HuggingFace example
|
|
|
|
One popular free option would be to use the [sentence-transformers](https://www.sbert.net/) library from HuggingFace.
|
|
You can install this using pip: `pip install sentence-transformers`.
|
|
|
|
```python
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
name="paraphrase-albert-small-v2"
|
|
model = SentenceTransformer(name)
|
|
|
|
# used for both training and querying
|
|
def embed_func(batch):
|
|
return [model.encode(sentence) for sentence in batch]
|
|
```
|
|
|
|
Please note that currently HuggingFace is only supported in the Python SDK.
|
|
|
|
### OpenAI example
|
|
|
|
You can also use an external API like OpenAI to generate embeddings
|
|
|
|
=== "Python"
|
|
```python
|
|
import openai
|
|
import os
|
|
|
|
# Configuring the environment variable OPENAI_API_KEY
|
|
if "OPENAI_API_KEY" not in os.environ:
|
|
# OR set the key here as a variable
|
|
openai.api_key = "sk-..."
|
|
|
|
# verify that the API key is working
|
|
assert len(openai.Model.list()["data"]) > 0
|
|
|
|
def embed_func(c):
|
|
rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
|
|
return [record["embedding"] for record in rs["data"]]
|
|
```
|
|
|
|
=== "Javascript"
|
|
```javascript
|
|
const lancedb = require("vectordb");
|
|
|
|
// You need to provide an OpenAI API key
|
|
const apiKey = "sk-..."
|
|
// The embedding function will create embeddings for the 'text' column
|
|
const embedding = new lancedb.OpenAIEmbeddingFunction('text', apiKey)
|
|
```
|
|
|
|
## Applying an embedding function
|
|
|
|
=== "Python"
|
|
Using an embedding function, you can apply it to raw data
|
|
to generate embeddings for each row.
|
|
|
|
Say if you have a pandas DataFrame with a `text` column that you want to be embedded,
|
|
you can use the [with_embeddings](https://lancedb.github.io/lancedb/python/python/#lancedb.embeddings.with_embeddings)
|
|
function to generate embeddings and add create a combined pyarrow table:
|
|
|
|
|
|
```python
|
|
import pandas as pd
|
|
from lancedb.embeddings import with_embeddings
|
|
|
|
df = pd.DataFrame([{"text": "pepperoni"},
|
|
{"text": "pineapple"}])
|
|
data = with_embeddings(embed_func, df)
|
|
|
|
# The output is used to create / append to a table
|
|
# db.create_table("my_table", data=data)
|
|
```
|
|
|
|
If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`.
|
|
|
|
By default, LanceDB calls the function with batches of 1000 rows. This can be configured
|
|
using the `batch_size` parameter to `with_embeddings`.
|
|
|
|
LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI
|
|
API call is reliable.
|
|
|
|
=== "Javascript"
|
|
Using an embedding function, you can apply it to raw data
|
|
to generate embeddings for each row.
|
|
|
|
You can just pass the embedding function created previously and LanceDB will automatically generate
|
|
embededings for your data.
|
|
|
|
```javascript
|
|
const db = await lancedb.connect("data/sample-lancedb");
|
|
const data = [
|
|
{ text: 'pepperoni' },
|
|
{ text: 'pineapple' }
|
|
]
|
|
|
|
const table = await db.createTable('vectors', data, embedding)
|
|
```
|
|
|
|
|
|
## Searching with an embedding function
|
|
|
|
At inference time, you also need the same embedding function to embed your query text.
|
|
It's important that you use the same model / function otherwise the embedding vectors don't
|
|
belong in the same latent space and your results will be nonsensical.
|
|
|
|
=== "Python"
|
|
```python
|
|
query = "What's the best pizza topping?"
|
|
query_vector = embed_func([query])[0]
|
|
tbl.search(query_vector).limit(10).to_df()
|
|
```
|
|
|
|
The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.
|
|
|
|
=== "Javascript"
|
|
```javascript
|
|
const results = await table
|
|
.search("What's the best pizza topping?")
|
|
.limit(10)
|
|
.execute()
|
|
```
|
|
|
|
The above snippet returns an array of records with the 10 closest vectors to the query.
|
|
|
|
|
|
## Roadmap
|
|
|
|
In the near future, we'll be integrating the embedding functions deeper into LanceDB<br/>.
|
|
The goal is that you just have to configure the function once when you create the table,
|
|
and then you'll never have to deal with embeddings / vectors after that unless you want to.
|
|
We'll also integrate more popular models and APIs.
|