# Embedding Functions Embeddings are high dimensional floating-point vector representations of your data or query. Anything can be embedded using some embedding model or function. For a given embedding function, the output will always have the same number of dimensions. ## Creating an embedding function Any function that takes as input a batch (list) of data and outputs a batch (list) of embeddings can be used by LanceDB as an embedding function. The input and output batch sizes should be the same. ### HuggingFace example One popular free option would be to use the [sentence-transformers](https://www.sbert.net/) library from HuggingFace. You can install this using pip: `pip install sentence-transformers`. ```python from sentence_transformers import SentenceTransformer name="paraphrase-albert-small-v2" model = SentenceTransformer(name) # used for both training and querying def embed_func(batch): return [model.encode(sentence) for sentence in batch] ``` ### OpenAI example You can also use an external API like OpenAI to generate embeddings ```python import openai import os # Configuring the environment variable OPENAI_API_KEY if "OPENAI_API_KEY" not in os.environ: # OR set the key here as a variable openai.api_key = "sk-..." # verify that the API key is working assert len(openai.Model.list()["data"]) > 0 def embed_func(c): rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002") return [record["embedding"] for record in rs["data"]] ``` ## Applying an embedding function Using an embedding function, you can apply it to raw data to generate embeddings for each row. Say if you have a pandas DataFrame with a `text` column that you want to be embedded, you can use the [with_embeddings](https://lancedb.github.io/lancedb/python/#lancedb.embeddings.with_embeddings) function to generate embeddings and add create a combined pyarrow table: ```python import pandas as pd from lancedb.embeddings import with_embeddings df = pd.DataFrame([{"text": "pepperoni"}, {"text": "pineapple"}]) data = with_embeddings(embed_func, df) # The output is used to create / append to a table # db.create_table("my_table", data=data) ``` If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`. By default, LanceDB calls the function with batches of 1000 rows. This can be configured using the `batch_size` parameter to `with_embeddings`. LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI API call is reliable. ## Searching with an embedding function At inference time, you also need the same embedding function to embed your query text. It's important that you use the same model / function otherwise the embedding vectors don't belong in the same latent space and your results will be nonsensical. ```python query = "What's the best pizza topping?" query_vector = embed_func([query])[0] tbl.search(query_vector).limit(10).to_df() ``` The above snippet returns a pandas DataFrame with the 10 closest vectors to the query. ## Roadmap In the near future, we'll be integrating the embedding functions deeper into LanceDB
. The goal is that you just have to configure the function once when you create the table, and then you'll never have to deal with embeddings / vectors after that unless you want to. We'll also integrate more popular models and APIs.