lancedb/embedding_explicit.md at 5228ca4b6be5a3247ec36848e70db67913716150

mirror of https://github.com/lancedb/lancedb.git synced 2026-01-07 04:12:59 +00:00

Files

Prashanth Rao 119b928a52 docs: Updates and refactor (#683 )

This PR makes incremental changes to the documentation.

* Closes #697 
* Closes #698

## Chores
- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

## Restructure/new content 
- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>

2024-01-19 00:18:37 +05:30

4.3 KiB

Raw Blame History

In this workflow, you define your own embedding function and pass it as a callable to LanceDB, invoking it in your code to generate the embeddings. Let's look at some examples.

Hugging Face

!!! note Currently, the Hugging Face method is only supported in the Python SDK.

=== "Python" The most popular open source option is to use the sentence-transformers library, which can be installed via pip.

```bash
pip install sentence-transformers
```

The example below shows how to use the `paraphrase-albert-small-v2` model to generate embeddings 
for a given document.

```python
from sentence_transformers import SentenceTransformer

name="paraphrase-albert-small-v2"
model = SentenceTransformer(name)

# used for both training and querying
def embed_func(batch):
    return [model.encode(sentence) for sentence in batch]
```

OpenAI

Another popular alternative is to use an external API like OpenAI's embeddings API.

=== "Python" ```python import openai import os

    # Configuring the environment variable OPENAI_API_KEY
    if "OPENAI_API_KEY" not in os.environ:
    # OR set the key here as a variable
    openai.api_key = "sk-..."

    # verify that the API key is working
    assert len(openai.Model.list()["data"]) > 0

    def embed_func(c):
        rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
        return [record["embedding"] for record in rs["data"]]
  ```

=== "JavaScript" ```javascript const lancedb = require("vectordb");

    // You need to provide an OpenAI API key
    const apiKey = "sk-..."
    // The embedding function will create embeddings for the 'text' column
    const embedding = new lancedb.OpenAIEmbeddingFunction('text', apiKey)
  ```

Applying an embedding function to data

=== "Python" Using an embedding function, you can apply it to raw data to generate embeddings for each record.

Say you have a pandas DataFrame with a `text` column that you want embedded,
you can use the `with_embeddings` function to generate embeddings and add them to 
an existing table.

```python
 import pandas as pd
 from lancedb.embeddings import with_embeddings

 df = pd.DataFrame(
    [
        {"text": "pepperoni"},
        {"text": "pineapple"}
    ]
)
 data = with_embeddings(embed_func, df)

 # The output is used to create / append to a table
 # db.create_table("my_table", data=data)
```

If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`.

By default, LanceDB calls the function with batches of 1000 rows. This can be configured
using the `batch_size` parameter to `with_embeddings`.

LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI
API call is reliable.

=== "JavaScript" Using an embedding function, you can apply it to raw data to generate embeddings for each record.

Simply pass the embedding function created above and LanceDB will use it to generate
embeddings for your data.

```javascript
const db = await lancedb.connect("data/sample-lancedb");
const data = [
{ text: "pepperoni"},
{ text: "pineapple"}
]

const table = await db.createTable("vectors", data, embedding)
```

Querying using an embedding function

!!! warning At query time, you must use the same embedding function you used to vectorize your data. If you use a different embedding function, the embeddings will not reside in the same vector space and the results will be nonsensical.

=== "Python" python query = "What's the best pizza topping?" query_vector = embed_func([query])[0] results = ( tbl.search(query_vector) .limit(10) .to_pandas() )

 The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.

=== "JavaScript" javascript const results = await table .search("What's the best pizza topping?") .limit(10) .execute()

 The above snippet returns an array of records with the top 10 nearest neighbors to the query.

4.3 KiB Raw Blame History

Hugging Face

OpenAI

Applying an embedding function to data

Querying using an embedding function

4.3 KiB

Raw Blame History