docs: revamp Python example: Overview page and remove redundant examples and notebooks (#1574)

before:
![Screenshot 2024-08-29
131656](https://github.com/user-attachments/assets/81cb5d70-5dff-4e57-8bbe-3461327aed7d)

After:
![Screenshot 2024-08-29
131715](https://github.com/user-attachments/assets/62109a37-7f66-4fd4-90ed-906a85472117)

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
This commit is contained in:
Rithik Kumar
2024-08-29 13:48:10 +05:30
committed by GitHub
parent dd1c16bbaf
commit 6f6eb170a9
5 changed files with 19 additions and 1397 deletions

View File

@@ -1,378 +0,0 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "13cb272e",
"metadata": {},
"source": [
"# Code documentation Q&A bot example with LangChain\n",
"\n",
"This Q&A bot will allow you to query your own documentation easily using questions. We'll also demonstrate the use of LangChain and LanceDB using the OpenAI API. \n",
"\n",
"In this example we'll use Pandas 2.0 documentation, but, this could be replaced for your own docs as well\n",
"\n",
"<a href=\"https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/Code-Documentation-QA-Bot/main.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>\n",
"\n",
"Scripts - [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](./examples/Code-Documentation-QA-Bot/main.py) [![JavaScript](https://img.shields.io/badge/javascript-%23323330.svg?style=for-the-badge&logo=javascript&logoColor=%23F7DF1E)](./examples/Code-Documentation-QA-Bot/index.js)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "66638d6c",
"metadata": {},
"outputs": [],
"source": [
"!pip install --quiet openai langchain\n",
"!pip install --quiet -U lancedb"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d1cdcac3",
"metadata": {},
"source": [
"First, let's get some setup out of the way. As we're using the OpenAI API, ensure that you've set your key (and organization if needed):"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "58ee1868",
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"import os\n",
"\n",
"# Configuring the environment variable OPENAI_API_KEY\n",
"if \"OPENAI_API_KEY\" not in os.environ:\n",
" os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
"client = OpenAI()\n",
"assert len(client.models.list().data) > 0"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "34f524d3",
"metadata": {},
"source": [
"# Loading in our code documentation, generating embeddings and storing our documents in LanceDB\n",
"\n",
"We're going to use the power of LangChain to help us create our Q&A bot. It comes with several APIs that can make our development much easier as well as a LanceDB integration for vectorstore."
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "b55d22f1",
"metadata": {},
"outputs": [],
"source": [
"import lancedb\n",
"import re\n",
"import pickle\n",
"import requests\n",
"import zipfile\n",
"from pathlib import Path\n",
"\n",
"from langchain.document_loaders import UnstructuredHTMLLoader\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain.vectorstores import LanceDB\n",
"from langchain.llms import OpenAI\n",
"from langchain.chains import RetrievalQA"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "56cc6d50",
"metadata": {},
"source": [
"To make this easier, we've downloaded Pandas documentation and stored the raw HTML files for you to download. We'll download them and then use LangChain's HTML document readers to parse them and store them in LanceDB as a vector store, along with relevant metadata."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7da77e75",
"metadata": {},
"outputs": [],
"source": [
"pandas_docs = requests.get(\"https://eto-public.s3.us-west-2.amazonaws.com/datasets/pandas_docs/pandas.documentation.zip\")\n",
"with open('/tmp/pandas.documentation.zip', 'wb') as f:\n",
" f.write(pandas_docs.content)\n",
"\n",
"file = zipfile.ZipFile(\"/tmp/pandas.documentation.zip\")\n",
"file.extractall(path=\"/tmp/pandas_docs\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ae42496c",
"metadata": {},
"source": [
"We'll create a simple helper function that can help to extract metadata, so we can use this downstream when we're wanting to query with filters. In this case, we want to keep the lineage of the uri or path for each document that we process:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "d171d062",
"metadata": {},
"outputs": [],
"source": [
"def get_document_title(document):\n",
" m = str(document.metadata[\"source\"])\n",
" title = re.findall(\"pandas.documentation(.*).html\", m)\n",
" if title[0] is not None:\n",
" return(title[0])\n",
" return ''"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "130162ad",
"metadata": {},
"source": [
"# Pre-processing and loading the documentation\n",
"\n",
"Next, let's pre-process and load the documentation. To make sure we don't need to do this repeatedly if we were updating code, we're caching it using pickle so we can retrieve it again (this could take a few minutes to run the first time you do it). We'll also add some more metadata to the docs here such as the title and version of the code:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "33bfe7d8",
"metadata": {},
"outputs": [],
"source": [
"docs_path = Path(\"docs.pkl\")\n",
"docs = []\n",
"\n",
"if not docs_path.exists():\n",
" for p in Path(\"/tmp/pandas_docs/pandas.documentation\").rglob(\"*.html\"):\n",
" print(p)\n",
" if p.is_dir():\n",
" continue\n",
" loader = UnstructuredHTMLLoader(p)\n",
" raw_document = loader.load()\n",
" \n",
" m = {}\n",
" m[\"title\"] = get_document_title(raw_document[0])\n",
" m[\"version\"] = \"2.0rc0\"\n",
" raw_document[0].metadata = raw_document[0].metadata | m\n",
" raw_document[0].metadata[\"source\"] = str(raw_document[0].metadata[\"source\"])\n",
" docs = docs + raw_document\n",
"\n",
" with docs_path.open(\"wb\") as fh:\n",
" pickle.dump(docs, fh)\n",
"else:\n",
" with docs_path.open(\"rb\") as fh:\n",
" docs = pickle.load(fh)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c3852dd3",
"metadata": {},
"source": [
"# Generating embeddings from our docs\n",
"\n",
"Now that we have our raw documents loaded, we need to pre-process them to generate embeddings:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "82230563",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=1000,\n",
" chunk_overlap=200,\n",
")\n",
"documents = text_splitter.split_documents(docs)\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "43e68215",
"metadata": {},
"source": [
"# Storing and querying with LanceDB\n",
"\n",
"Let's connect to LanceDB so we can store our documents. We'll create a Table to store them in:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "74780a58",
"metadata": {},
"outputs": [],
"source": [
"db = lancedb.connect('/tmp/lancedb')\n",
"table = db.create_table(\"pandas_docs\", data=[\n",
" {\"vector\": embeddings.embed_query(\"Hello World\"), \"text\": \"Hello World\", \"id\": \"1\"}\n",
"], mode=\"overwrite\")\n",
"docsearch = LanceDB.from_documents(documents, embeddings, connection=table)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3cb1dc5d",
"metadata": {},
"source": [
"Now let's create our RetrievalQA chain using the LanceDB vector store:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "6a5891ad",
"metadata": {},
"outputs": [],
"source": [
"qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type=\"stuff\", retriever=docsearch.as_retriever())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "28d93b85",
"metadata": {},
"source": [
"And that's it! We're all set up. The next step is to run some queries, let's try a few:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "70d88316",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' The major differences in pandas 2.0 include installing optional dependencies with pip extras, the ability to use any numpy numeric dtype in an Index, and enhancements, notable bug fixes, backwards incompatible API changes, deprecations, and performance improvements.'"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What are the major differences in pandas 2.0?\"\n",
"qa.run(query)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "85a0397c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' 2.0.0rc0'"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What's the current version of pandas?\"\n",
"qa.run(query)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "923f86c6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Optional dependencies can be installed with pip install \"pandas[all]\" or \"pandas[performance]\". This will install all recommended performance dependencies such as numexpr, bottleneck and numba.'"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"How do I make use of installing optional dependencies?\"\n",
"qa.run(query)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "02082f83",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" \\n\\nPandas 2.0 includes a number of API breaking changes, such as increased minimum versions for dependencies, the use of os.linesep for DataFrame.to_csv's line_terminator, and reorganization of the library. See the release notes for a full list of changes.\""
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"What are the backwards incompatible API changes in Pandas 2.0?\"\n",
"qa.run(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75cea547",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,297 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![example](https://github.com/lancedb/vectordb-recipes/assets/15766192/799f94a1-a01d-4a5b-a627-2a733bbb4227)\n",
"\n",
" <a href=\"https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/multimodal_clip/main.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>| [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](./examples/multimodal_clip/main.py) |"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m A new release of pip available: \u001B[0m\u001B[31;49m22.3.1\u001B[0m\u001B[39;49m -> \u001B[0m\u001B[32;49m23.1.2\u001B[0m\n",
"\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m To update, run: \u001B[0m\u001B[32;49mpip install --upgrade pip\u001B[0m\n",
"\n",
"\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m A new release of pip available: \u001B[0m\u001B[31;49m22.3.1\u001B[0m\u001B[39;49m -> \u001B[0m\u001B[32;49m23.1.2\u001B[0m\n",
"\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m To update, run: \u001B[0m\u001B[32;49mpip install --upgrade pip\u001B[0m\n"
]
}
],
"source": [
"!pip install --quiet -U lancedb\n",
"!pip install --quiet gradio transformers torch torchvision"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import io\n",
"\n",
"import PIL\n",
"import duckdb\n",
"import lancedb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## First run setup: Download data and pre-process"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### Get dataset\n",
"\n",
"!wget https://eto-public.s3.us-west-2.amazonaws.com/datasets/diffusiondb_lance.tar.gz\n",
"!tar -xvf diffusiondb_lance.tar.gz\n",
"!mv diffusiondb_test rawdata.lance\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<lance.dataset.LanceDataset at 0x3045db590>"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# remove null prompts\n",
"import lance\n",
"import pyarrow.compute as pc\n",
"\n",
"# download s3://eto-public/datasets/diffusiondb/small_10k.lance to this uri\n",
"data = lance.dataset(\"~/datasets/rawdata.lance\").to_table()\n",
"\n",
"# First data processing and full-text-search index\n",
"db = lancedb.connect(\"~/datasets/demo\")\n",
"tbl = db.create_table(\"diffusiondb\", data.filter(~pc.field(\"prompt\").is_null()))\n",
"tbl = tbl.create_fts_index([\"prompt\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create / Open LanceDB Table"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"db = lancedb.connect(\"~/datasets/demo\")\n",
"tbl = db.open_table(\"diffusiondb\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create CLIP embedding function for the text"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from transformers import CLIPModel, CLIPProcessor, CLIPTokenizerFast\n",
"\n",
"MODEL_ID = \"openai/clip-vit-base-patch32\"\n",
"\n",
"tokenizer = CLIPTokenizerFast.from_pretrained(MODEL_ID)\n",
"model = CLIPModel.from_pretrained(MODEL_ID)\n",
"processor = CLIPProcessor.from_pretrained(MODEL_ID)\n",
"\n",
"def embed_func(query):\n",
" inputs = tokenizer([query], padding=True, return_tensors=\"pt\")\n",
" text_features = model.get_text_features(**inputs)\n",
" return text_features.detach().numpy()[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Search functions for Gradio"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def find_image_vectors(query):\n",
" emb = embed_func(query)\n",
" code = (\n",
" \"import lancedb\\n\"\n",
" \"db = lancedb.connect('~/datasets/demo')\\n\"\n",
" \"tbl = db.open_table('diffusiondb')\\n\\n\"\n",
" f\"embedding = embed_func('{query}')\\n\"\n",
" \"tbl.search(embedding).limit(9).to_pandas()\"\n",
" )\n",
" return (_extract(tbl.search(emb).limit(9).to_pandas()), code)\n",
"\n",
"def find_image_keywords(query):\n",
" code = (\n",
" \"import lancedb\\n\"\n",
" \"db = lancedb.connect('~/datasets/demo')\\n\"\n",
" \"tbl = db.open_table('diffusiondb')\\n\\n\"\n",
" f\"tbl.search('{query}').limit(9).to_pandas()\"\n",
" )\n",
" return (_extract(tbl.search(query).limit(9).to_pandas()), code)\n",
"\n",
"def find_image_sql(query):\n",
" code = (\n",
" \"import lancedb\\n\"\n",
" \"import duckdb\\n\"\n",
" \"db = lancedb.connect('~/datasets/demo')\\n\"\n",
" \"tbl = db.open_table('diffusiondb')\\n\\n\"\n",
" \"diffusiondb = tbl.to_lance()\\n\"\n",
" f\"duckdb.sql('{query}').to_df()\"\n",
" ) \n",
" diffusiondb = tbl.to_lance()\n",
" return (_extract(duckdb.sql(query).to_df()), code)\n",
"\n",
"def _extract(df):\n",
" image_col = \"image\"\n",
" return [(PIL.Image.open(io.BytesIO(row[image_col])), row[\"prompt\"]) for _, row in df.iterrows()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup Gradio interface"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running on local URL: http://127.0.0.1:7881\n",
"\n",
"To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7881/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import gradio as gr\n",
"\n",
"\n",
"with gr.Blocks() as demo:\n",
" with gr.Row():\n",
" with gr.Tab(\"Embeddings\"):\n",
" vector_query = gr.Textbox(value=\"portraits of a person\", show_label=False)\n",
" b1 = gr.Button(\"Submit\")\n",
" with gr.Tab(\"Keywords\"):\n",
" keyword_query = gr.Textbox(value=\"ninja turtle\", show_label=False)\n",
" b2 = gr.Button(\"Submit\")\n",
" with gr.Tab(\"SQL\"):\n",
" sql_query = gr.Textbox(value=\"SELECT * from diffusiondb WHERE image_nsfw >= 2 LIMIT 9\", show_label=False)\n",
" b3 = gr.Button(\"Submit\")\n",
" with gr.Row():\n",
" code = gr.Code(label=\"Code\", language=\"python\")\n",
" with gr.Row():\n",
" gallery = gr.Gallery(\n",
" label=\"Found images\", show_label=False, elem_id=\"gallery\"\n",
" ).style(columns=[3], rows=[3], object_fit=\"contain\", height=\"auto\") \n",
" \n",
" b1.click(find_image_vectors, inputs=vector_query, outputs=[gallery, code])\n",
" b2.click(find_image_keywords, inputs=keyword_query, outputs=[gallery, code])\n",
" b3.click(find_image_sql, inputs=sql_query, outputs=[gallery, code])\n",
" \n",
"demo.launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.11.4 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"vscode": {
"interpreter": {
"hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}

File diff suppressed because one or more lines are too long