mirror of
https://github.com/lancedb/lancedb.git
synced 2026-01-09 21:32:58 +00:00
update lambda example to lancedb
This commit is contained in:
@@ -15,3 +15,124 @@ for p in Path("./pandas.documentation").rglob("*.html"):
|
||||
loader = UnstructuredHTMLLoader(p)
|
||||
raw_document = loader.load()
|
||||
docs = docs + raw_document
|
||||
```
|
||||
|
||||
Once we have pre-processed our input documents, the next step is to generate the embeddings. For this, we’ll use LangChain’s OpenAI API wrapper. Note that you’ll need to sign up for the OpenAI API (this is a paid service). Here we’ll tokenize the documents and store them in a Lance dataset. Using Lance will persist your embeddings locally, you can read more about Lance’s data management features in its [documentation](https://eto-ai.github.io/lance/).
|
||||
|
||||
```python
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200
|
||||
)
|
||||
documents = text_splitter.split_documents(docs)
|
||||
embeddings = OpenAIEmbeddings()
|
||||
LanceDataset.from_documents(documents, embeddings, uri="pandas.lance")
|
||||
```
|
||||
|
||||
Now we’ve got our vector store setup, we can boot up the chat app, which uses LangChain’s chain API to submit an input query to the vector store. Under the hood, this will generate embeddings for your query, perform similarity search using the vector store and generate the resulting text.
|
||||
|
||||
And presto! Your very own Pandas API helper bot, a handy little assistant to help you get up to speed with Pandas — here are some examples:
|
||||
|
||||
First let’s make sure we’re on the right document version for pandas:
|
||||
|
||||
Great, now we can ask some more specific questions:
|
||||
|
||||
So far so good!
|
||||
|
||||
# Integrating Lance into LangChain
|
||||
|
||||
LangChain has a vectorstore abstraction with multiple implementations. This is where we put Lance. In our own langchain fork, we added a lance_dataset.py as a new kind of vectorstore that is just a LanceDataset (pip install pylance). Once you get the embeddings, you can call lance’s vec_to_table() method to create a pyarrow Table from it:
|
||||
|
||||
```python
|
||||
import lance
|
||||
from lance.vector import vec_table
|
||||
embeddings = embedding.embed_documents(texts)
|
||||
tbl = vec_to_to_table(embeddings)
|
||||
```
|
||||
|
||||
Writing the data is just:
|
||||
|
||||
```python
|
||||
uri = "pandas_documentation.lance"
|
||||
dataset = lance.write_dataset(tbl, uri)
|
||||
```
|
||||
|
||||
If the dataset is small, Lance’s SIMD code for vector distances makes brute forcing faster than numpy. And if the dataset is large, you can create an ANN index in another 1 line of python code.
|
||||
|
||||
```python
|
||||
dataset.create_index("vector", index_type="IVF_PQ",
|
||||
num_partitions=256, # ivf partitions
|
||||
num_sub_vectors=num_sub_vectors) # PQ subvectors
|
||||
```
|
||||
|
||||
To make an ANN query to find 10 closest neighbors to the query_vector and also fetch the document and metadata, use the to_table function like this:
|
||||
|
||||
```python
|
||||
tbl = self.dataset.to_table(columns=["document", "metadata"],
|
||||
nearest={"column": "vector",
|
||||
"q": query_vector,
|
||||
"k": 10})
|
||||
```
|
||||
|
||||
You now have a pyarrow Table that you can use for downstream re-ranking and filtering.
|
||||
|
||||
# Lance datasets are queryable
|
||||
|
||||
Lance datasets are Arrow compatible so you can directly query your Lance datasets using Pandas, DuckDB, Polars to make it super easy to do additional filtering, reranking, enrichment, and or debugging.
|
||||
|
||||
We store metadata in JSON alongside the vectors, but, if you wish, you could define the data model yourself.
|
||||
|
||||
For example if we want to check the number of vectors we’re storing, against what version of documentation in DuckDB, you can load the lance dataset in DuckDB and run SQL against it:
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
count(vector),
|
||||
json_extract(metadata, '$.version') as VERSION
|
||||
FROM
|
||||
pandas_docs
|
||||
GROUP BY
|
||||
VERSION
|
||||
```
|
||||
|
||||
# Lance versions your data automatically
|
||||
|
||||
Oftentimes we have to regenerate our embeddings on a regular basis. This makes debugging a pain to have to track down the right version of your data and the right version of your index to diagnose any issues. With Lance you can create a new version of your dataset by specifying mode=”overwrite” when writing.
|
||||
|
||||
Let’s say we start with some toy data:
|
||||
|
||||
```python
|
||||
>>> import pandas as pd
|
||||
>>> import lance
|
||||
>>> import numpy as np
|
||||
>>> df = pd.DataFrame({"a": [5]})
|
||||
>>> dataset = lance.write_dataset(df, "data.lance")
|
||||
>>> dataset.to_table().to_pandas()
|
||||
a
|
||||
0 5
|
||||
```
|
||||
|
||||
We can create and persist a new version:
|
||||
|
||||
```python
|
||||
>>> df = pd.DataFrame({"a": [50, 100]})
|
||||
>>> dataset = lance.write_dataset(df, "data.lance", mode="overwrite")
|
||||
>>> dataset.to_table().to_pandas()
|
||||
a
|
||||
0 50
|
||||
1 100
|
||||
```
|
||||
|
||||
And you can time travel to a previous version by specifying the version number or timestamp when you create the dataset instance.
|
||||
|
||||
```python
|
||||
>>> dataset.versions()
|
||||
[{'version': 2, 'timestamp': datetime.datetime(2023, 2, 24, 11, 58, 20, 739008), 'metadata': {}},
|
||||
{'version': 1, 'timestamp': datetime.datetime(2023, 2, 24, 11, 56, 59, 690977), 'metadata': {}}]
|
||||
>>> lance.dataset("data.lance", version=1).to_table().to_pandas()
|
||||
a
|
||||
0 5
|
||||
```
|
||||
|
||||
# Where Lance is headed
|
||||
|
||||
Lance’s random access performance makes it ideal to build search engines and high-performance data stores for deep learning. We’re actively working to make Lance support 1B+ scale vector datasets, partitioning and reindexing, and new index types. Lance is written in Rust and comes with a wrapper for python and an extension for duckdb.
|
||||
@@ -12,31 +12,57 @@ Before we start, you'll need to ensure you create a secure account access to AWS
|
||||
|
||||
We'll also use a container to ship our Lambda code. This is a good option for Lambda as you don't have the space limits that you would otherwise by building a package yourself.
|
||||
|
||||
First, let's create a new `Dockerfile` using the AWS python container base:
|
||||
# Initial setup: creating a LanceDB Table and storing it remotely on S3
|
||||
|
||||
We'll use the SIFT vector dataset as an example. To make it easier, we've already made a Lance-format SIFT dataset publically available, which we can access and use to populate our LanceDB Table.
|
||||
|
||||
To do this, download the Lance files locally first from:
|
||||
|
||||
```
|
||||
s3://eto-public/datasets/sift/vec_data.lance
|
||||
```
|
||||
|
||||
Then, we can write a quick Python script to populate our LanceDB Table:
|
||||
|
||||
```python
|
||||
import pylance
|
||||
sift_dataset = pylance.dataset("/path/to/local/vec_data.lance")
|
||||
df = sift_dataset.to_table().to_pandas()
|
||||
|
||||
import lancedb
|
||||
db = lancedb.connect(".")
|
||||
table = db.create_table("vector_example", df)
|
||||
```
|
||||
|
||||
Once we've created our Table, we are free to move this data over to S3 so we can remotely host it.
|
||||
|
||||
# Building our Lambda app: a simple event handler for vector search
|
||||
|
||||
Now that we've got a remotely hosted LanceDB Table, we'll want to be able to query it from Lambda. To do so, let's create a new `Dockerfile` using the AWS python container base:
|
||||
|
||||
```docker
|
||||
FROM public.ecr.aws/lambda/python:3.10
|
||||
|
||||
RUN pip3 install --upgrade pip
|
||||
RUN pip3 install --no-cache-dir -U numpy --target "${LAMBDA_TASK_ROOT}"
|
||||
RUN pip3 install --no-cache-dir -U pylance --target "${LAMBDA_TASK_ROOT}"
|
||||
RUN pip3 install --no-cache-dir -U lancedb --target "${LAMBDA_TASK_ROOT}"
|
||||
|
||||
COPY app.py ${LAMBDA_TASK_ROOT}
|
||||
|
||||
CMD [ "app.handler" ]
|
||||
```
|
||||
|
||||
Now let's make a simple Lambda function that queries the SIFT dataset, and allows the user to enter a vector and change the nearest neighbour parameter in `app.py`.
|
||||
Now let's make a simple Lambda function that queries the SIFT dataset in `app.py`.
|
||||
|
||||
```python
|
||||
import time
|
||||
import json
|
||||
|
||||
import numpy as np
|
||||
import lance
|
||||
from lance.vector import vec_to_table
|
||||
import lancedb
|
||||
|
||||
s3_dataset = lance.dataset("s3://eto-public/datasets/sift/vec_data.lance")
|
||||
db = lancedb.connect("s3://eto-public/tables")
|
||||
table = db.open_table("vector_example")
|
||||
|
||||
def handler(event, context):
|
||||
status_code = 200
|
||||
@@ -56,19 +82,28 @@ def handler(event, context):
|
||||
|
||||
# Shape of SIFT is (128,1M), d=float32
|
||||
query_vector = np.array(event['query_vector'], dtype=np.float32)
|
||||
|
||||
if event['num_k'] is not None:
|
||||
num_k = event['num_k']
|
||||
|
||||
if event['debug'] is not None:
|
||||
rs = s3_dataset.to_table(nearest={"column": "vector", "k": num_k, "q": query_vector})
|
||||
else:
|
||||
rs = s3_dataset.to_table(nearest={"column": "vector", "k": num_k, "q": query_vector})
|
||||
|
||||
rs = table.search(query_vector).limit(2).to_df()
|
||||
|
||||
return {
|
||||
"statusCode": status_code,
|
||||
"headers": {
|
||||
"Content-Type": "application/json"
|
||||
},
|
||||
"body": rs.to_pandas().to_json()
|
||||
"body": rs.to_json()
|
||||
}
|
||||
```
|
||||
|
||||
# Deploying the container to EKS
|
||||
|
||||
The next step is to build and push the container to EKS, where it can then be used to create a new Lambda function.
|
||||
|
||||
It's best to follow the official AWS documentation for how to do this, which you can view here:
|
||||
|
||||
```
|
||||
https://docs.aws.amazon.com/lambda/latest/dg/images-create.html#images-upload
|
||||
```
|
||||
|
||||
# Final step: setting up your Lambda function
|
||||
|
||||
Once the container is pushed, you can create a Lambda function by selecting the container.
|
||||
|
||||
Reference in New Issue
Block a user