update lambda example to lancedb

This commit is contained in:
Jai Chopra
2023-05-04 08:16:33 -07:00
parent c3d90b2c78
commit 6556e42e6d
2 changed files with 171 additions and 15 deletions

View File

@@ -15,3 +15,124 @@ for p in Path("./pandas.documentation").rglob("*.html"):
loader = UnstructuredHTMLLoader(p)
raw_document = loader.load()
docs = docs + raw_document
```
Once we have pre-processed our input documents, the next step is to generate the embeddings. For this, well use LangChains OpenAI API wrapper. Note that youll need to sign up for the OpenAI API (this is a paid service). Here well tokenize the documents and store them in a Lance dataset. Using Lance will persist your embeddings locally, you can read more about Lances data management features in its [documentation](https://eto-ai.github.io/lance/).
```python
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
documents = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
LanceDataset.from_documents(documents, embeddings, uri="pandas.lance")
```
Now weve got our vector store setup, we can boot up the chat app, which uses LangChains chain API to submit an input query to the vector store. Under the hood, this will generate embeddings for your query, perform similarity search using the vector store and generate the resulting text.
And presto! Your very own Pandas API helper bot, a handy little assistant to help you get up to speed with Pandas — here are some examples:
First lets make sure were on the right document version for pandas:
Great, now we can ask some more specific questions:
So far so good!
# Integrating Lance into LangChain
LangChain has a vectorstore abstraction with multiple implementations. This is where we put Lance. In our own langchain fork, we added a lance_dataset.py as a new kind of vectorstore that is just a LanceDataset (pip install pylance). Once you get the embeddings, you can call lances vec_to_table() method to create a pyarrow Table from it:
```python
import lance
from lance.vector import vec_table
embeddings = embedding.embed_documents(texts)
tbl = vec_to_to_table(embeddings)
```
Writing the data is just:
```python
uri = "pandas_documentation.lance"
dataset = lance.write_dataset(tbl, uri)
```
If the dataset is small, Lances SIMD code for vector distances makes brute forcing faster than numpy. And if the dataset is large, you can create an ANN index in another 1 line of python code.
```python
dataset.create_index("vector", index_type="IVF_PQ",
num_partitions=256, # ivf partitions
num_sub_vectors=num_sub_vectors) # PQ subvectors
```
To make an ANN query to find 10 closest neighbors to the query_vector and also fetch the document and metadata, use the to_table function like this:
```python
tbl = self.dataset.to_table(columns=["document", "metadata"],
nearest={"column": "vector",
"q": query_vector,
"k": 10})
```
You now have a pyarrow Table that you can use for downstream re-ranking and filtering.
# Lance datasets are queryable
Lance datasets are Arrow compatible so you can directly query your Lance datasets using Pandas, DuckDB, Polars to make it super easy to do additional filtering, reranking, enrichment, and or debugging.
We store metadata in JSON alongside the vectors, but, if you wish, you could define the data model yourself.
For example if we want to check the number of vectors were storing, against what version of documentation in DuckDB, you can load the lance dataset in DuckDB and run SQL against it:
```sql
SELECT
count(vector),
json_extract(metadata, '$.version') as VERSION
FROM
pandas_docs
GROUP BY
VERSION
```
# Lance versions your data automatically
Oftentimes we have to regenerate our embeddings on a regular basis. This makes debugging a pain to have to track down the right version of your data and the right version of your index to diagnose any issues. With Lance you can create a new version of your dataset by specifying mode=”overwrite” when writing.
Lets say we start with some toy data:
```python
>>> import pandas as pd
>>> import lance
>>> import numpy as np
>>> df = pd.DataFrame({"a": [5]})
>>> dataset = lance.write_dataset(df, "data.lance")
>>> dataset.to_table().to_pandas()
a
0 5
```
We can create and persist a new version:
```python
>>> df = pd.DataFrame({"a": [50, 100]})
>>> dataset = lance.write_dataset(df, "data.lance", mode="overwrite")
>>> dataset.to_table().to_pandas()
a
0 50
1 100
```
And you can time travel to a previous version by specifying the version number or timestamp when you create the dataset instance.
```python
>>> dataset.versions()
[{'version': 2, 'timestamp': datetime.datetime(2023, 2, 24, 11, 58, 20, 739008), 'metadata': {}},
{'version': 1, 'timestamp': datetime.datetime(2023, 2, 24, 11, 56, 59, 690977), 'metadata': {}}]
>>> lance.dataset("data.lance", version=1).to_table().to_pandas()
a
0 5
```
# Where Lance is headed
Lances random access performance makes it ideal to build search engines and high-performance data stores for deep learning. Were actively working to make Lance support 1B+ scale vector datasets, partitioning and reindexing, and new index types. Lance is written in Rust and comes with a wrapper for python and an extension for duckdb.

View File

@@ -12,31 +12,57 @@ Before we start, you'll need to ensure you create a secure account access to AWS
We'll also use a container to ship our Lambda code. This is a good option for Lambda as you don't have the space limits that you would otherwise by building a package yourself.
First, let's create a new `Dockerfile` using the AWS python container base:
# Initial setup: creating a LanceDB Table and storing it remotely on S3
We'll use the SIFT vector dataset as an example. To make it easier, we've already made a Lance-format SIFT dataset publically available, which we can access and use to populate our LanceDB Table.
To do this, download the Lance files locally first from:
```
s3://eto-public/datasets/sift/vec_data.lance
```
Then, we can write a quick Python script to populate our LanceDB Table:
```python
import pylance
sift_dataset = pylance.dataset("/path/to/local/vec_data.lance")
df = sift_dataset.to_table().to_pandas()
import lancedb
db = lancedb.connect(".")
table = db.create_table("vector_example", df)
```
Once we've created our Table, we are free to move this data over to S3 so we can remotely host it.
# Building our Lambda app: a simple event handler for vector search
Now that we've got a remotely hosted LanceDB Table, we'll want to be able to query it from Lambda. To do so, let's create a new `Dockerfile` using the AWS python container base:
```docker
FROM public.ecr.aws/lambda/python:3.10
RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir -U numpy --target "${LAMBDA_TASK_ROOT}"
RUN pip3 install --no-cache-dir -U pylance --target "${LAMBDA_TASK_ROOT}"
RUN pip3 install --no-cache-dir -U lancedb --target "${LAMBDA_TASK_ROOT}"
COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]
```
Now let's make a simple Lambda function that queries the SIFT dataset, and allows the user to enter a vector and change the nearest neighbour parameter in `app.py`.
Now let's make a simple Lambda function that queries the SIFT dataset in `app.py`.
```python
import time
import json
import numpy as np
import lance
from lance.vector import vec_to_table
import lancedb
s3_dataset = lance.dataset("s3://eto-public/datasets/sift/vec_data.lance")
db = lancedb.connect("s3://eto-public/tables")
table = db.open_table("vector_example")
def handler(event, context):
status_code = 200
@@ -56,19 +82,28 @@ def handler(event, context):
# Shape of SIFT is (128,1M), d=float32
query_vector = np.array(event['query_vector'], dtype=np.float32)
if event['num_k'] is not None:
num_k = event['num_k']
if event['debug'] is not None:
rs = s3_dataset.to_table(nearest={"column": "vector", "k": num_k, "q": query_vector})
else:
rs = s3_dataset.to_table(nearest={"column": "vector", "k": num_k, "q": query_vector})
rs = table.search(query_vector).limit(2).to_df()
return {
"statusCode": status_code,
"headers": {
"Content-Type": "application/json"
},
"body": rs.to_pandas().to_json()
"body": rs.to_json()
}
```
# Deploying the container to EKS
The next step is to build and push the container to EKS, where it can then be used to create a new Lambda function.
It's best to follow the official AWS documentation for how to do this, which you can view here:
```
https://docs.aws.amazon.com/lambda/latest/dg/images-create.html#images-upload
```
# Final step: setting up your Lambda function
Once the container is pushed, you can create a Lambda function by selecting the container.