From 08e67d04bb99c54a95e5c17a48d8088410034180 Mon Sep 17 00:00:00 2001
From: Chang She <759245+changhiskhan@users.noreply.github.com>
Date: Wed, 19 Apr 2023 15:33:19 -0700
Subject: [PATCH 1/2] [DOC] embedding function documentation

---
 docs/mkdocs.yml       |  4 +-
 docs/src/embedding.md | 90 +++++++++++++++++++++++++++++++++++++++++++
 docs/src/index.md     |  7 ++++
 3 files changed, 100 insertions(+), 1 deletion(-)
 create mode 100644 docs/src/embedding.md
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
index fa855f487..d7e968394 100644
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -13,6 +13,8 @@ plugins:
 
 nav:
 - Home: index.md
+- Basics: basics.md
+- Embeddings: embeddings.md
 - Integrations: integrations.md
 - Python API: python.md
 
@@ -23,4 +25,4 @@ markdown_extensions:
     pygments_lang_class: true
 - pymdownx.inlinehilite
 - pymdownx.snippets
-- pymdownx.superfences
\ No newline at end of file
+- pymdownx.superfences
diff --git a/docs/src/embedding.md b/docs/src/embedding.md
new file mode 100644
index 000000000..b90a8787b
--- /dev/null
+++ b/docs/src/embedding.md
@@ -0,0 +1,90 @@
+# Embedding Functions
+
+Embeddings are high dimensional floating-point vector representations of your data or query.
+Anything can be embedded using some embedding model or function.
+For a given embedding function, the output will always have the same number of dimensions.
+
+## Creating an embedding function
+
+Any function that takes as input a batch (list) of data and outputs a batch (list) of embeddings
+can be used by LanceDB as an embedding function.
+
+### HuggingFace example
+
+One popular free option would be to use the sentence-transformers library from HuggingFace.
+
+```python
+from sentence_transformers import SentenceTransformer
+
+name="paraphrase-albert-small-v2"
+model = SentenceTransformer(name)
+
+# used for both training and querying
+def embed_func(batch):
+    return [model.encode(sentence) for sentence in batch]
+```
+
+### OpenAI example
+
+You can also use an external API like OpenAI to generate embeddings
+
+```python
+import openai
+import os
+
+# Configuring the environment variable OPENAI_API_KEY
+if "OPENAI_API_KEY" not in os.environ:
+    # OR set the key here as a variable
+    openai.api_key = "sk-..."
+
+# verify that the API key is working
+assert len(openai.Model.list()["data"]) > 0
+
+def embed_func(c):
+    rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
+    return [record["embedding"] for record in rs["data"]]
+```
+
+## Applying an embedding function
+
+Using an embedding function, you can apply it to raw data
+to generate embeddings for each row.
+
+Say if you have a pandas DataFrame with a `text` column that you want to be embedded,
+you can use the following code to generate embeddings and add create a combined
+pyarrow table:
+
+```python
+from lancedb.embeddings import with_embeddings
+
+data = with_embeddings(embed_func, df)
+
+# The output is used to create / append to a table
+# db.create_table("my_table", data=data)
+```
+
+By default, LanceDB calls the function with batches of 1000 rows. This can be configured
+using the `batch_size` parameter to `with_embeddings`.
+
+LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI
+API call is reliable.
+
+## Searching with an embedding function
+
+At inference time, you also need the same embedding function to embed your query text.
+It's important that you use the same model / function otherwise the embedding vectors don't
+belong in the same latent space and your results will be nonsensical.
+
+```python
+query_vector = embed_func([query])[0]
+tbl.search(query_vector).limit(10).to_df()
+```
+
+The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.
+
+## Roadmap
+
+In the near future, we'll be integrating the embedding functions deeper into LanceDB<br/>.
+The goal is that you just have to configure the function once when you create the table,
+and then you'll never have to deal with embeddings / vectors after that unless you want to.
+We'll also integrate more popular models and APIs.
diff --git a/docs/src/index.md b/docs/src/index.md
index 02fce3682..5a6ea068f 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -33,7 +33,14 @@ table = db.create_table("my_table",
 result = table.search([100, 100]).limit(2).to_df()
 ```
 
+## Complete Demos
+
+We will be adding completed demo apps built using LanceDB.
+- [YouTube Transcript Search](../notebooks/youtube_transcript_search.ipynb)
+
 
 ## Documentation Quick Links
 * [`Basic Operations`](basic.md) - basic functionality of LanceDB.
+* [`Embedding Functions`](embedding.md) - functions for working with embeddings.
+* [`Ecosystem Integrations`](integrations.md) - integrating LanceDB with python data tooling ecosystem.
 * [`API Reference`](python.md) - detailed documentation for the LanceDB Python SDK.

From 85dda5377919b1a418d69454425d7eab7b219c50 Mon Sep 17 00:00:00 2001
From: Chang She <759245+changhiskhan@users.noreply.github.com>
Date: Wed, 19 Apr 2023 16:35:48 -0700
Subject: [PATCH 2/2] review comments

---
 docs/src/embedding.md | 15 +++++++++++----
 docs/src/python.md    |  2 ++
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/docs/src/embedding.md b/docs/src/embedding.md
index b90a8787b..410965148 100644
--- a/docs/src/embedding.md
+++ b/docs/src/embedding.md
@@ -7,11 +7,12 @@ For a given embedding function, the output will always have the same number of d
 ## Creating an embedding function
 
 Any function that takes as input a batch (list) of data and outputs a batch (list) of embeddings
-can be used by LanceDB as an embedding function.
+can be used by LanceDB as an embedding function. The input and output batch sizes should be the same.
 
 ### HuggingFace example
 
-One popular free option would be to use the sentence-transformers library from HuggingFace.
+One popular free option would be to use the [sentence-transformers](https://www.sbert.net/) library from HuggingFace.
+You can install this using pip: `pip install sentence-transformers`.
 
 ```python
 from sentence_transformers import SentenceTransformer
@@ -51,18 +52,23 @@ Using an embedding function, you can apply it to raw data
 to generate embeddings for each row.
 
 Say if you have a pandas DataFrame with a `text` column that you want to be embedded,
-you can use the following code to generate embeddings and add create a combined
-pyarrow table:
+you can use the [with_embeddings](https://lancedb.github.io/lancedb/python/#lancedb.embeddings.with_embeddings)
+function to generate embeddings and add create a combined pyarrow table:
 
 ```python
+import pandas as pd
 from lancedb.embeddings import with_embeddings
 
+df = pd.DataFrame([{"text": "pepperoni"},
+                   {"text": "pineapple"}])
 data = with_embeddings(embed_func, df)
 
 # The output is used to create / append to a table
 # db.create_table("my_table", data=data)
 ```
 
+If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`.
+
 By default, LanceDB calls the function with batches of 1000 rows. This can be configured
 using the `batch_size` parameter to `with_embeddings`.
 
@@ -76,6 +82,7 @@ It's important that you use the same model / function otherwise the embedding ve
 belong in the same latent space and your results will be nonsensical.
 
 ```python
+query = "What's the best pizza topping?"
 query_vector = embed_func([query])[0]
 tbl.search(query_vector).limit(10).to_df()
 ```
diff --git a/docs/src/python.md b/docs/src/python.md
index c04c45a46..e4dbf2217 100644
--- a/docs/src/python.md
+++ b/docs/src/python.md
@@ -10,3 +10,5 @@ pip install lancedb
 ::: lancedb.db
 ::: lancedb.table
 ::: lancedb.query
+::: lancedb.embeddings
+::: lancedb.context