mirror of
https://github.com/lancedb/lancedb.git
synced 2026-01-06 03:42:57 +00:00
docs: Updates and refactor (#683)
This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>
This commit is contained in:
committed by
Weston Pace
parent
33ab68c790
commit
4d5d748acd
@@ -1,4 +1,5 @@
|
||||
To use your own custom embedding function, you need to follow these 2 simple steps.
|
||||
To use your own custom embedding function, you can follow these 2 simple steps:
|
||||
|
||||
1. Create your embedding function by implementing the `EmbeddingFunction` interface
|
||||
2. Register your embedding function in the global `EmbeddingFunctionRegistry`.
|
||||
|
||||
@@ -6,13 +7,11 @@ Let us see how this looks like in action.
|
||||
|
||||

|
||||
|
||||
`EmbeddingFunction` and `EmbeddingFunctionRegistry` handle low-level details for serializing schema and model information as metadata. To build a custom embedding function, you don't have to worry about the finer details - simply focus on setting up the model and leave the rest to LanceDB.
|
||||
|
||||
`EmbeddingFunction` & `EmbeddingFunctionRegistry` handle low-level details for serializing schema and model information as metadata. To build a custom embdding function, you don't need to worry about those details and simply focus on setting up the model.
|
||||
|
||||
## `TextEmbeddingFunction` Interface
|
||||
|
||||
There is another optional layer of abstraction provided in form of `TextEmbeddingFunction`. You can use this if your model isn't multi-modal in nature and only operates on text. In such case both source and vector fields will have the same pathway for vectorization, so you simply just need to setup the model and rest is handled by `TextEmbeddingFunction`. You can read more about the class and its attributes in the class reference.
|
||||
## `TextEmbeddingFunction` interface
|
||||
|
||||
There is another optional layer of abstraction available: `TextEmbeddingFunction`. You can use this abstraction if your model isn't multi-modal in nature and only needs to operate on text. In such cases, both the source and vector fields will have the same work for vectorization, so you simply just need to setup the model and rest is handled by `TextEmbeddingFunction`. You can read more about the class and its attributes in the class reference.
|
||||
|
||||
Let's implement `SentenceTransformerEmbeddings` class. All you need to do is implement the `generate_embeddings()` and `ndims` function to handle the input types you expect and register the class in the global `EmbeddingFunctionRegistry`
|
||||
|
||||
@@ -39,7 +38,6 @@ class SentenceTransformerEmbeddings(TextEmbeddingFunction):
|
||||
@cached(cache={})
|
||||
def _embedding_model(self):
|
||||
return sentence_transformers.SentenceTransformer(name)
|
||||
|
||||
```
|
||||
|
||||
This is a stripped down version of our implementation of `SentenceTransformerEmbeddings` that removes certain optimizations and defaul settings.
|
||||
|
||||
@@ -1,18 +1,19 @@
|
||||
There are various Embedding functions available out of the box with LanceDB. We're working on supporting other popular embedding APIs.
|
||||
There are various embedding functions available out of the box with LanceDB to manage your embeddings implicitly. We're actively working on adding other popular embedding APIs and models.
|
||||
|
||||
## Text Embedding Functions
|
||||
Here are the text embedding functions registered by default.
|
||||
Embedding functions have an inbuilt rate limit handler wrapper for source and query embedding function calls that retry with exponential standoff.
|
||||
Each `EmbeddingFunction` implementation automatically takes `max_retries` as an argument which has the default value of 7.
|
||||
## Text embedding functions
|
||||
Contains the text embedding functions registered by default.
|
||||
|
||||
### Sentence Transformers
|
||||
Here are the parameters that you can set when registering a `sentence-transformers` object, and their default values:
|
||||
* Embedding functions have an inbuilt rate limit handler wrapper for source and query embedding function calls that retry with exponential backoff.
|
||||
* Each `EmbeddingFunction` implementation automatically takes `max_retries` as an argument which has the default value of 7.
|
||||
|
||||
### Sentence transformers
|
||||
Allows you to set parameters when registering a `sentence-transformers` object.
|
||||
|
||||
| Parameter | Type | Default Value | Description |
|
||||
|---|---|---|---|
|
||||
| `name` | `str` | `"all-MiniLM-L6-v2"` | The name of the model. |
|
||||
| `device` | `str` | `"cpu"` | The device to run the model on. Can be `"cpu"` or `"gpu"`. |
|
||||
| `normalize` | `bool` | `True` | Whether to normalize the input text before feeding it to the model. |
|
||||
| `name` | `str` | `all-MiniLM-L6-v2` | The name of the model |
|
||||
| `device` | `str` | `cpu` | The device to run the model on (can be `cpu` or `gpu`) |
|
||||
| `normalize` | `bool` | `True` | Whether to normalize the input text before feeding it to the model |
|
||||
|
||||
|
||||
```python
|
||||
@@ -37,15 +38,14 @@ actual = table.search(query).limit(1).to_pydantic(Words)[0]
|
||||
print(actual.text)
|
||||
```
|
||||
|
||||
### OpenAIEmbeddings
|
||||
LanceDB has OpenAI embeddings function in the registry by default. It is registered as `openai` and here are the parameters that you can customize when creating the instances
|
||||
### OpenAI embeddings
|
||||
LanceDB registers the OpenAI embeddings function in the registry by default, as `openai`. Below are the parameters that you can customize when creating the instances:
|
||||
|
||||
| Parameter | Type | Default Value | Description |
|
||||
|---|---|---|---|
|
||||
| `name` | `str` | `"text-embedding-ada-002"` | The name of the model. |
|
||||
|
||||
|
||||
|
||||
```python
|
||||
db = lancedb.connect("/tmp/db")
|
||||
registry = EmbeddingFunctionRegistry.get_instance()
|
||||
@@ -69,17 +69,18 @@ print(actual.text)
|
||||
```
|
||||
|
||||
### Instructor Embeddings
|
||||
Instructor is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g. classification, retrieval, clustering, text evaluation, etc.) and domains (e.g. science, finance, etc.) by simply providing the task instruction, without any finetuning.
|
||||
[Instructor](https://instructor-embedding.github.io/) is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g. classification, retrieval, clustering, text evaluation, etc.) and domains (e.g. science, finance, etc.) by simply providing the task instruction, without any finetuning.
|
||||
|
||||
If you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions:
|
||||
If you want to calculate customized embeddings for specific sentences, you can follow the unified template to write instructions.
|
||||
|
||||
Represent the `domain` `text_type` for `task_objective`:
|
||||
!!! info
|
||||
Represent the `domain` `text_type` for `task_objective`:
|
||||
|
||||
* `domain` is optional, and it specifies the domain of the text, e.g. science, finance, medicine, etc.
|
||||
* `text_type` is required, and it specifies the encoding unit, e.g. sentence, document, paragraph, etc.
|
||||
* `task_objective` is optional, and it specifies the objective of embedding, e.g. retrieve a document, classify the sentence, etc.
|
||||
* `domain` is optional, and it specifies the domain of the text, e.g. science, finance, medicine, etc.
|
||||
* `text_type` is required, and it specifies the encoding unit, e.g. sentence, document, paragraph, etc.
|
||||
* `task_objective` is optional, and it specifies the objective of embedding, e.g. retrieve a document, classify the sentence, etc.
|
||||
|
||||
More information about the model can be found here - https://github.com/xlang-ai/instructor-embedding
|
||||
More information about the model can be found at the [source URL](https://github.com/xlang-ai/instructor-embedding).
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
@@ -157,9 +158,8 @@ rs = tbl.search("hello").limit(1).to_pandas()
|
||||
## Multi-modal embedding functions
|
||||
Multi-modal embedding functions allow you to query your table using both images and text.
|
||||
|
||||
### OpenClipEmbeddings
|
||||
We support CLIP model embeddings using the open source alternative, open-clip which supports various customizations. It is registered as `open-clip` and supports the following customizations:
|
||||
|
||||
### OpenClip embeddings
|
||||
We support CLIP model embeddings using the open source alternative, [open-clip](https://github.com/mlfoundations/open_clip) which supports various customizations. It is registered as `open-clip` and supports the following customizations:
|
||||
|
||||
| Parameter | Type | Default Value | Description |
|
||||
|---|---|---|---|
|
||||
@@ -169,11 +169,10 @@ We support CLIP model embeddings using the open source alternative, open-clip wh
|
||||
| `batch_size` | `int` | `64` | The number of images to process in a batch. |
|
||||
| `normalize` | `bool` | `True` | Whether to normalize the input images before feeding them to the model. |
|
||||
|
||||
|
||||
This embedding function supports ingesting images as both bytes and urls. You can query them using both test and other images.
|
||||
|
||||
NOTE:
|
||||
LanceDB supports ingesting images directly from accessible links.
|
||||
!!! info
|
||||
LanceDB supports ingesting images directly from accessible links.
|
||||
|
||||
|
||||
```python
|
||||
@@ -241,4 +240,4 @@ print(actual.label)
|
||||
|
||||
```
|
||||
|
||||
If you have any questions about the embeddings API, supported models, or see a relevant model missing, please raise an issue.
|
||||
If you have any questions about the embeddings API, supported models, or see a relevant model missing, please raise an issue [on GitHub](https://github.com/lancedb/lancedb/issues).
|
||||
|
||||
141
docs/src/embeddings/embedding_explicit.md
Normal file
141
docs/src/embeddings/embedding_explicit.md
Normal file
@@ -0,0 +1,141 @@
|
||||
In this workflow, you define your own embedding function and pass it as a callable to LanceDB, invoking it in your code to generate the embeddings. Let's look at some examples.
|
||||
|
||||
### Hugging Face
|
||||
|
||||
!!! note
|
||||
Currently, the Hugging Face method is only supported in the Python SDK.
|
||||
|
||||
=== "Python"
|
||||
The most popular open source option is to use the [sentence-transformers](https://www.sbert.net/)
|
||||
library, which can be installed via pip.
|
||||
|
||||
```bash
|
||||
pip install sentence-transformers
|
||||
```
|
||||
|
||||
The example below shows how to use the `paraphrase-albert-small-v2` model to generate embeddings
|
||||
for a given document.
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
name="paraphrase-albert-small-v2"
|
||||
model = SentenceTransformer(name)
|
||||
|
||||
# used for both training and querying
|
||||
def embed_func(batch):
|
||||
return [model.encode(sentence) for sentence in batch]
|
||||
```
|
||||
|
||||
### OpenAI
|
||||
|
||||
Another popular alternative is to use an external API like OpenAI's [embeddings API](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings).
|
||||
|
||||
=== "Python"
|
||||
```python
|
||||
import openai
|
||||
import os
|
||||
|
||||
# Configuring the environment variable OPENAI_API_KEY
|
||||
if "OPENAI_API_KEY" not in os.environ:
|
||||
# OR set the key here as a variable
|
||||
openai.api_key = "sk-..."
|
||||
|
||||
# verify that the API key is working
|
||||
assert len(openai.Model.list()["data"]) > 0
|
||||
|
||||
def embed_func(c):
|
||||
rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
|
||||
return [record["embedding"] for record in rs["data"]]
|
||||
```
|
||||
|
||||
=== "JavaScript"
|
||||
```javascript
|
||||
const lancedb = require("vectordb");
|
||||
|
||||
// You need to provide an OpenAI API key
|
||||
const apiKey = "sk-..."
|
||||
// The embedding function will create embeddings for the 'text' column
|
||||
const embedding = new lancedb.OpenAIEmbeddingFunction('text', apiKey)
|
||||
```
|
||||
|
||||
## Applying an embedding function to data
|
||||
|
||||
=== "Python"
|
||||
Using an embedding function, you can apply it to raw data
|
||||
to generate embeddings for each record.
|
||||
|
||||
Say you have a pandas DataFrame with a `text` column that you want embedded,
|
||||
you can use the `with_embeddings` function to generate embeddings and add them to
|
||||
an existing table.
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from lancedb.embeddings import with_embeddings
|
||||
|
||||
df = pd.DataFrame(
|
||||
[
|
||||
{"text": "pepperoni"},
|
||||
{"text": "pineapple"}
|
||||
]
|
||||
)
|
||||
data = with_embeddings(embed_func, df)
|
||||
|
||||
# The output is used to create / append to a table
|
||||
# db.create_table("my_table", data=data)
|
||||
```
|
||||
|
||||
If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`.
|
||||
|
||||
By default, LanceDB calls the function with batches of 1000 rows. This can be configured
|
||||
using the `batch_size` parameter to `with_embeddings`.
|
||||
|
||||
LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI
|
||||
API call is reliable.
|
||||
|
||||
=== "JavaScript"
|
||||
Using an embedding function, you can apply it to raw data
|
||||
to generate embeddings for each record.
|
||||
|
||||
Simply pass the embedding function created above and LanceDB will use it to generate
|
||||
embeddings for your data.
|
||||
|
||||
```javascript
|
||||
const db = await lancedb.connect("data/sample-lancedb");
|
||||
const data = [
|
||||
{ text: "pepperoni"},
|
||||
{ text: "pineapple"}
|
||||
]
|
||||
|
||||
const table = await db.createTable("vectors", data, embedding)
|
||||
```
|
||||
|
||||
## Querying using an embedding function
|
||||
|
||||
!!! warning
|
||||
At query time, you **must** use the same embedding function you used to vectorize your data.
|
||||
If you use a different embedding function, the embeddings will not reside in the same vector
|
||||
space and the results will be nonsensical.
|
||||
|
||||
=== "Python"
|
||||
```python
|
||||
query = "What's the best pizza topping?"
|
||||
query_vector = embed_func([query])[0]
|
||||
results = (
|
||||
tbl.search(query_vector)
|
||||
.limit(10)
|
||||
.to_pandas()
|
||||
)
|
||||
```
|
||||
|
||||
The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.
|
||||
|
||||
=== "JavaScript"
|
||||
```javascript
|
||||
const results = await table
|
||||
.search("What's the best pizza topping?")
|
||||
.limit(10)
|
||||
.execute()
|
||||
```
|
||||
|
||||
The above snippet returns an array of records with the top 10 nearest neighbors to the query.
|
||||
@@ -1,71 +1,70 @@
|
||||
Representing multi-modal data as vector embeddings is becoming a standard practice. Embedding functions themselves can be thought of as a part of the processing pipeline that each request(input) has to be passed through. After initial setup these components are not expected to change for a particular project.
|
||||
Representing multi-modal data as vector embeddings is becoming a standard practice. Embedding functions can themselves be thought of as key part of the data processing pipeline that each request has to be passed through. The assumption here is: after initial setup, these components and the underlying methodology are not expected to change for a particular project.
|
||||
|
||||
Our new embedding functions API allow you simply set it up once and the table remembers it, effectively making the **embedding functions disappear in the background** so you don't have to worry about modelling and can simply focus on the DB aspects of VectorDB.
|
||||
For this purpose, LanceDB introduces an **embedding functions API**, that allow you simply set up once, during the configuration stage of your project. After this, the table remembers it, effectively making the embedding functions *disappear in the background* so you don't have to worry about manually passing callables, and instead, simply focus on the rest of your data engineering pipeline.
|
||||
|
||||
You can simply follow these steps and forget about the details of your embedding functions as long as you don't intend to change it.
|
||||
!!! warning
|
||||
Using the implicit embeddings management approach means that you can forget about the manually passing around embedding
|
||||
functions in your code, as long as you don't intend to change it at a later time. If your embedding function changes,
|
||||
you'll have to re-configure your table with the new embedding function and regenerate the embeddings.
|
||||
|
||||
### Step 1 - Define the embedding function
|
||||
We have some pre-defined embedding functions in the global registry with more coming soon. Here's let's an implementation of CLIP as example.
|
||||
## 1. Define the embedding function
|
||||
We have some pre-defined embedding functions in the global registry, with more coming soon. Here's let's an implementation of CLIP as example.
|
||||
```
|
||||
from lancedb.embeddings import EmbeddingFunctionRegistry
|
||||
|
||||
registry = EmbeddingFunctionRegistry.get_instance()
|
||||
clip = registry.get("open-clip").create()
|
||||
|
||||
```
|
||||
You can also define your own embedding function by implementing the `EmbeddingFunction` abstract base interface. It subclasses PyDantic Model which can be utilized to write complex schemas simply as we'll see next!
|
||||
You can also define your own embedding function by implementing the `EmbeddingFunction` abstract base interface. It subclasses Pydantic Model which can be utilized to write complex schemas simply as we'll see next!
|
||||
|
||||
### Step 2 - Define the Data Model or Schema
|
||||
Our embedding function from the previous section abstracts away all the details about the models and dimensions required to define the schema. You can simply set a field as **source** or **vector** column. Here's how
|
||||
## 2. Define the data model or schema
|
||||
The embedding function defined above abstracts away all the details about the models and dimensions required to define the schema. You can simply set a field as **source** or **vector** column. Here's how:
|
||||
|
||||
```python
|
||||
from lancedb.pydantic import LanceModel, Vector
|
||||
|
||||
class Pets(LanceModel):
|
||||
vector: Vector(clip.ndims) = clip.VectorField()
|
||||
image_uri: str = clip.SourceField()
|
||||
|
||||
```
|
||||
`VectorField` tells LanceDB to use the clip embedding function to generate query embeddings for `vector` column & `SourceField` tells that when adding data, automatically use the embedding function to encode `image_uri`.
|
||||
|
||||
`VectorField` tells LanceDB to use the clip embedding function to generate query embeddings for the `vector` column and `SourceField` ensures that when adding data, we automatically use the specified embedding function to encode `image_uri`.
|
||||
|
||||
### Step 3 - Create LanceDB Table
|
||||
Now that we have chosen/defined our embedding function and the schema, we can create the table
|
||||
## 3. Create LanceDB table
|
||||
Now that we have chosen/defined our embedding function and the schema, we can create the table:
|
||||
|
||||
```python
|
||||
import lancedb
|
||||
|
||||
db = lancedb.connect("~/lancedb")
|
||||
table = db.create_table("pets", schema=Pets)
|
||||
|
||||
```
|
||||
|
||||
That's it! We have ingested all the information needed to embed source and query inputs. We can now forget about the model and dimension details and start to build our VectorDB.
|
||||
That's it! We've provided all the information needed to embed the source and query inputs. We can now forget about the model and dimension details and start to build our VectorDB pipeline.
|
||||
|
||||
### Step 4 - Ingest lots of data and run vector search!
|
||||
Now you can just add the data and it'll be vectorized automatically
|
||||
## 4. Ingest lots of data and query your table
|
||||
Any new or incoming data can just be added and it'll be vectorized automatically.
|
||||
|
||||
```python
|
||||
table.add([{"image_uri": u} for u in uris])
|
||||
```
|
||||
|
||||
Our OpenCLIP query embedding function support querying via both text and images.
|
||||
Our OpenCLIP query embedding function supports querying via both text and images:
|
||||
|
||||
```python
|
||||
result = table.search("dog")
|
||||
```
|
||||
|
||||
Let's query an image
|
||||
Let's query an image:
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
|
||||
p = Path("path/to/images/samoyed_100.jpg")
|
||||
query_image = Image.open(p)
|
||||
table.search(query_image)
|
||||
```
|
||||
|
||||
### Rate limit Handling
|
||||
`EmbeddingFunction` class wraps the calls for source and query embedding generation inside a rate limit handler that retries the requests with exponential backoff after successive failures. By default the maximum retires is set to 7. You can tune it by setting it to a different number or disable it by setting it to 0. Example:
|
||||
---
|
||||
|
||||
## Rate limit Handling
|
||||
`EmbeddingFunction` class wraps the calls for source and query embedding generation inside a rate limit handler that retries the requests with exponential backoff after successive failures. By default, the maximum retires is set to 7. You can tune it by setting it to a different number, or disable it by setting it to 0.
|
||||
|
||||
An example of how to do this is shown below:
|
||||
|
||||
```python
|
||||
clip = registry.get("open-clip").create() # Defaults to 7 max retries
|
||||
@@ -73,16 +72,17 @@ clip = registry.get("open-clip").create(max_retries=10) # Increase max retries t
|
||||
clip = registry.get("open-clip").create(max_retries=0) # Retries disabled
|
||||
```
|
||||
|
||||
NOTE:
|
||||
Embedding functions can also fail due to other errors that have nothing to do with rate limits. This is why the errors are also logged.
|
||||
!!! note
|
||||
Embedding functions can also fail due to other errors that have nothing to do with rate limits.
|
||||
This is why the error is also logged.
|
||||
|
||||
### A little fun with PyDantic
|
||||
LanceDB is integrated with PyDantic. In fact, we've used the integration in the above example to define the schema. It is also being used behind the scene by the embedding function API to ingest useful information as table metadata.
|
||||
You can also use it for adding utility operations in the schema. For example, in our multi-modal example, you can search images using text or another image. Let's define a utility function to plot the image.
|
||||
## Some fun with Pydantic
|
||||
|
||||
LanceDB is integrated with Pydantic, which was used in the example above to define the schema in Python. It's also used behind the scenes by the embedding function API to ingest useful information as table metadata.
|
||||
|
||||
You can also use the integration for adding utility operations in the schema. For example, in our multi-modal example, you can search images using text or another image. Let's define a utility function to plot the image.
|
||||
|
||||
```python
|
||||
from lancedb.pydantic import LanceModel, Vector
|
||||
|
||||
class Pets(LanceModel):
|
||||
vector: Vector(clip.ndims) = clip.VectorField()
|
||||
image_uri: str = clip.SourceField()
|
||||
@@ -91,8 +91,7 @@ class Pets(LanceModel):
|
||||
def image(self):
|
||||
return Image.open(self.image_uri)
|
||||
```
|
||||
|
||||
Now, you can covert your search results to PyDantic model and use its property.
|
||||
Now, you can covert your search results to a Pydantic model and use this property.
|
||||
|
||||
```python
|
||||
rs = table.search(query_image).limit(3).to_pydantic(Pets)
|
||||
@@ -101,4 +100,4 @@ rs[2].image
|
||||
|
||||

|
||||
|
||||
Now that you have the basic idea about LanceDB embedding function, let us dive deeper into the API that you can use to implement your own embedding functions!
|
||||
Now that you have the basic idea about implicit management via embedding functions, let's dive deeper into a [custom API](./api.md) that you can use to implement your own embedding functions.
|
||||
@@ -1,149 +1,8 @@
|
||||
# Embedding
|
||||
Due to the nature of vector embeddings, they can be used to represent any kind of data, from text to images to audio. This makes them a very powerful tool for machine learning practitioners. However, there's no one-size-fits-all solution for generating embeddings - there are many different libraries and APIs (both commercial and open source) that can be used to generate embeddings from structured/unstructured data.
|
||||
|
||||
Embeddings are high dimensional floating-point vector representations of your data or query. Anything can be embedded using some embedding model or function. Position of embedding in a high dimensional vector space has semantic significance to a degree that depends on the type of modal and training. These embeddings when projected in a 2-D space generally group similar entities close-by forming groups.
|
||||
LanceDB supports 2 methods of vectorizing your raw data into embeddings.
|
||||
|
||||

|
||||
1. **Explicit**: By manually calling LanceDB's `with_embedding` function to vectorize your data via an `embed_func` of your choice
|
||||
2. **Implicit**: Allow LanceDB to embed the data and queries in the background as they come in, by using the table's `EmbeddingRegistry` information
|
||||
|
||||
# Creating an embedding function
|
||||
|
||||
LanceDB supports 2 major ways of vectorizing your data, explicit and implicit.
|
||||
|
||||
1. By manually embedding the data before ingesting in the table
|
||||
2. By automatically embedding the data and query as they come, by ingesting embedding function information in the table itself! Covered in [Next Section](embedding_functions.md)
|
||||
|
||||
Whatever workflow you prefer, we have the tools to support you.
|
||||
## Explicit Vectorization
|
||||
|
||||
In this workflow, you can create your embedding function and vectorize your data using lancedb's `with_embedding` function. Let's look at some examples.
|
||||
|
||||
### HuggingFace example
|
||||
|
||||
One popular free option would be to use the [sentence-transformers](https://www.sbert.net/) library from HuggingFace.
|
||||
You can install this using pip: `pip install sentence-transformers`.
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
name="paraphrase-albert-small-v2"
|
||||
model = SentenceTransformer(name)
|
||||
|
||||
# used for both training and querying
|
||||
def embed_func(batch):
|
||||
return [model.encode(sentence) for sentence in batch]
|
||||
```
|
||||
|
||||
Please note that currently HuggingFace is only supported in the Python SDK.
|
||||
|
||||
### OpenAI example
|
||||
|
||||
You can also use an external API like OpenAI to generate embeddings
|
||||
|
||||
=== "Python"
|
||||
```python
|
||||
import openai
|
||||
import os
|
||||
|
||||
# Configuring the environment variable OPENAI_API_KEY
|
||||
if "OPENAI_API_KEY" not in os.environ:
|
||||
# OR set the key here as a variable
|
||||
openai.api_key = "sk-..."
|
||||
|
||||
# verify that the API key is working
|
||||
assert len(openai.Model.list()["data"]) > 0
|
||||
|
||||
def embed_func(c):
|
||||
rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
|
||||
return [record["embedding"] for record in rs["data"]]
|
||||
```
|
||||
|
||||
=== "Javascript"
|
||||
```javascript
|
||||
const lancedb = require("vectordb");
|
||||
|
||||
// You need to provide an OpenAI API key
|
||||
const apiKey = "sk-..."
|
||||
// The embedding function will create embeddings for the 'text' column
|
||||
const embedding = new lancedb.OpenAIEmbeddingFunction('text', apiKey)
|
||||
```
|
||||
|
||||
## Applying an embedding function
|
||||
|
||||
=== "Python"
|
||||
Using an embedding function, you can apply it to raw data
|
||||
to generate embeddings for each row.
|
||||
|
||||
Say if you have a pandas DataFrame with a `text` column that you want to be embedded,
|
||||
you can use the [with_embeddings](https://lancedb.github.io/lancedb/python/python/#lancedb.embeddings.with_embeddings)
|
||||
function to generate embeddings and add create a combined pyarrow table:
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from lancedb.embeddings import with_embeddings
|
||||
|
||||
df = pd.DataFrame([{"text": "pepperoni"},
|
||||
{"text": "pineapple"}])
|
||||
data = with_embeddings(embed_func, df)
|
||||
|
||||
# The output is used to create / append to a table
|
||||
# db.create_table("my_table", data=data)
|
||||
```
|
||||
|
||||
If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`.
|
||||
|
||||
By default, LanceDB calls the function with batches of 1000 rows. This can be configured
|
||||
using the `batch_size` parameter to `with_embeddings`.
|
||||
|
||||
LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI
|
||||
API call is reliable.
|
||||
|
||||
=== "Javascript"
|
||||
Using an embedding function, you can apply it to raw data
|
||||
to generate embeddings for each row.
|
||||
|
||||
You can just pass the embedding function created previously and LanceDB will automatically generate
|
||||
embededings for your data.
|
||||
|
||||
```javascript
|
||||
const db = await lancedb.connect("data/sample-lancedb");
|
||||
const data = [
|
||||
{ text: 'pepperoni' },
|
||||
{ text: 'pineapple' }
|
||||
]
|
||||
|
||||
const table = await db.createTable('vectors', data, embedding)
|
||||
```
|
||||
|
||||
|
||||
## Searching with an embedding function
|
||||
|
||||
At inference time, you also need the same embedding function to embed your query text.
|
||||
It's important that you use the same model / function otherwise the embedding vectors don't
|
||||
belong in the same latent space and your results will be nonsensical.
|
||||
|
||||
=== "Python"
|
||||
```python
|
||||
query = "What's the best pizza topping?"
|
||||
query_vector = embed_func([query])[0]
|
||||
tbl.search(query_vector).limit(10).to_pandas()
|
||||
```
|
||||
|
||||
The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.
|
||||
|
||||
=== "Javascript"
|
||||
```javascript
|
||||
const results = await table
|
||||
.search("What's the best pizza topping?")
|
||||
.limit(10)
|
||||
.execute()
|
||||
```
|
||||
|
||||
The above snippet returns an array of records with the 10 closest vectors to the query.
|
||||
|
||||
|
||||
## Implicit vectorization / Ingesting embedding functions
|
||||
Representing multi-modal data as vector embeddings is becoming a standard practice. Embedding functions themselves be thought of as a part of the processing pipeline that each request(input) has to be passed through. After initial setup these components are not expected to change for a particular project.
|
||||
|
||||
This is main motivation behind our new embedding functions API, that allow you simply set it up once and the table remembers it, effectively making the **embedding functions disappear in the background** so you don't have to worry about modelling and simply focus on the DB aspects of VectorDB.
|
||||
|
||||
Learn more in the Next Section
|
||||
See the [explicit](embedding_explicit.md) and [implicit](embedding_functions.md) embedding sections for more details.
|
||||
Reference in New Issue
Block a user