From 08e67d04bb99c54a95e5c17a48d8088410034180 Mon Sep 17 00:00:00 2001 From: Chang She <759245+changhiskhan@users.noreply.github.com> Date: Wed, 19 Apr 2023 15:33:19 -0700 Subject: [PATCH 1/2] [DOC] embedding function documentation --- docs/mkdocs.yml | 4 +- docs/src/embedding.md | 90 +++++++++++++++++++++++++++++++++++++++++++ docs/src/index.md | 7 ++++ 3 files changed, 100 insertions(+), 1 deletion(-) create mode 100644 docs/src/embedding.md diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index fa855f487..d7e968394 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -13,6 +13,8 @@ plugins: nav: - Home: index.md +- Basics: basics.md +- Embeddings: embeddings.md - Integrations: integrations.md - Python API: python.md @@ -23,4 +25,4 @@ markdown_extensions: pygments_lang_class: true - pymdownx.inlinehilite - pymdownx.snippets -- pymdownx.superfences \ No newline at end of file +- pymdownx.superfences diff --git a/docs/src/embedding.md b/docs/src/embedding.md new file mode 100644 index 000000000..b90a8787b --- /dev/null +++ b/docs/src/embedding.md @@ -0,0 +1,90 @@ +# Embedding Functions + +Embeddings are high dimensional floating-point vector representations of your data or query. +Anything can be embedded using some embedding model or function. +For a given embedding function, the output will always have the same number of dimensions. + +## Creating an embedding function + +Any function that takes as input a batch (list) of data and outputs a batch (list) of embeddings +can be used by LanceDB as an embedding function. + +### HuggingFace example + +One popular free option would be to use the sentence-transformers library from HuggingFace. + +```python +from sentence_transformers import SentenceTransformer + +name="paraphrase-albert-small-v2" +model = SentenceTransformer(name) + +# used for both training and querying +def embed_func(batch): + return [model.encode(sentence) for sentence in batch] +``` + +### OpenAI example + +You can also use an external API like OpenAI to generate embeddings + +```python +import openai +import os + +# Configuring the environment variable OPENAI_API_KEY +if "OPENAI_API_KEY" not in os.environ: + # OR set the key here as a variable + openai.api_key = "sk-..." + +# verify that the API key is working +assert len(openai.Model.list()["data"]) > 0 + +def embed_func(c): + rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002") + return [record["embedding"] for record in rs["data"]] +``` + +## Applying an embedding function + +Using an embedding function, you can apply it to raw data +to generate embeddings for each row. + +Say if you have a pandas DataFrame with a `text` column that you want to be embedded, +you can use the following code to generate embeddings and add create a combined +pyarrow table: + +```python +from lancedb.embeddings import with_embeddings + +data = with_embeddings(embed_func, df) + +# The output is used to create / append to a table +# db.create_table("my_table", data=data) +``` + +By default, LanceDB calls the function with batches of 1000 rows. This can be configured +using the `batch_size` parameter to `with_embeddings`. + +LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI +API call is reliable. + +## Searching with an embedding function + +At inference time, you also need the same embedding function to embed your query text. +It's important that you use the same model / function otherwise the embedding vectors don't +belong in the same latent space and your results will be nonsensical. + +```python +query_vector = embed_func([query])[0] +tbl.search(query_vector).limit(10).to_df() +``` + +The above snippet returns a pandas DataFrame with the 10 closest vectors to the query. + +## Roadmap + +In the near future, we'll be integrating the embedding functions deeper into LanceDB
. +The goal is that you just have to configure the function once when you create the table, +and then you'll never have to deal with embeddings / vectors after that unless you want to. +We'll also integrate more popular models and APIs. diff --git a/docs/src/index.md b/docs/src/index.md index 02fce3682..5a6ea068f 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -33,7 +33,14 @@ table = db.create_table("my_table", result = table.search([100, 100]).limit(2).to_df() ``` +## Complete Demos + +We will be adding completed demo apps built using LanceDB. +- [YouTube Transcript Search](../notebooks/youtube_transcript_search.ipynb) + ## Documentation Quick Links * [`Basic Operations`](basic.md) - basic functionality of LanceDB. +* [`Embedding Functions`](embedding.md) - functions for working with embeddings. +* [`Ecosystem Integrations`](integrations.md) - integrating LanceDB with python data tooling ecosystem. * [`API Reference`](python.md) - detailed documentation for the LanceDB Python SDK. From 85dda5377919b1a418d69454425d7eab7b219c50 Mon Sep 17 00:00:00 2001 From: Chang She <759245+changhiskhan@users.noreply.github.com> Date: Wed, 19 Apr 2023 16:35:48 -0700 Subject: [PATCH 2/2] review comments --- docs/src/embedding.md | 15 +++++++++++---- docs/src/python.md | 2 ++ 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/docs/src/embedding.md b/docs/src/embedding.md index b90a8787b..410965148 100644 --- a/docs/src/embedding.md +++ b/docs/src/embedding.md @@ -7,11 +7,12 @@ For a given embedding function, the output will always have the same number of d ## Creating an embedding function Any function that takes as input a batch (list) of data and outputs a batch (list) of embeddings -can be used by LanceDB as an embedding function. +can be used by LanceDB as an embedding function. The input and output batch sizes should be the same. ### HuggingFace example -One popular free option would be to use the sentence-transformers library from HuggingFace. +One popular free option would be to use the [sentence-transformers](https://www.sbert.net/) library from HuggingFace. +You can install this using pip: `pip install sentence-transformers`. ```python from sentence_transformers import SentenceTransformer @@ -51,18 +52,23 @@ Using an embedding function, you can apply it to raw data to generate embeddings for each row. Say if you have a pandas DataFrame with a `text` column that you want to be embedded, -you can use the following code to generate embeddings and add create a combined -pyarrow table: +you can use the [with_embeddings](https://lancedb.github.io/lancedb/python/#lancedb.embeddings.with_embeddings) +function to generate embeddings and add create a combined pyarrow table: ```python +import pandas as pd from lancedb.embeddings import with_embeddings +df = pd.DataFrame([{"text": "pepperoni"}, + {"text": "pineapple"}]) data = with_embeddings(embed_func, df) # The output is used to create / append to a table # db.create_table("my_table", data=data) ``` +If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`. + By default, LanceDB calls the function with batches of 1000 rows. This can be configured using the `batch_size` parameter to `with_embeddings`. @@ -76,6 +82,7 @@ It's important that you use the same model / function otherwise the embedding ve belong in the same latent space and your results will be nonsensical. ```python +query = "What's the best pizza topping?" query_vector = embed_func([query])[0] tbl.search(query_vector).limit(10).to_df() ``` diff --git a/docs/src/python.md b/docs/src/python.md index c04c45a46..e4dbf2217 100644 --- a/docs/src/python.md +++ b/docs/src/python.md @@ -10,3 +10,5 @@ pip install lancedb ::: lancedb.db ::: lancedb.table ::: lancedb.query +::: lancedb.embeddings +::: lancedb.context