- Creates testing files `md_testing.py` and `md_testing.js` for testing python and nodejs code in markdown files in the documentation This listens for HTML tags as well: `<!--[language] code code code...-->` will create a set-up file to create some mock tables or to fulfill some assumptions in the documentation. - Creates a github action workflow that triggers every push/pr to `docs/**` - Modifies documentation so tests run (mostly indentation, some small syntax errors and some missing imports) A list of excluded files that we need to take a closer look at later on: ```javascript const excludedFiles = [ "../src/fts.md", "../src/embedding.md", "../src/examples/serverless_lancedb_with_s3_and_lambda.md", "../src/examples/serverless_qa_bot_with_modal_and_langchain.md", "../src/examples/youtube_transcript_bot_with_nodejs.md", ]; ``` Many of them can't be done because we need the OpenAI API key :(. `fts.md` has some issues with the library, I believe this is still experimental? Closes #170 --------- Co-authored-by: Will Jones <willjones127@gmail.com>
2.8 KiB
Integrations
Built on top of Apache Arrow, LanceDB is easy to integrate with the Python ecosystem, including Pandas, PyArrow and DuckDB.
Pandas and PyArrow
First, we need to connect to a LanceDB database.
import lancedb
db = lancedb.connect("data/sample-lancedb")
And write a Pandas DataFrame to LanceDB directly.
import pandas as pd
data = pd.DataFrame({
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0]
})
table = db.create_table("pd_table", data=data)
You will find detailed instructions of creating dataset and index in Basic Operations and Indexing sections.
We can now perform similarity searches via LanceDB.
# Open the table previously created.
table = db.open_table("pd_table")
query_vector = [100, 100]
# Pandas DataFrame
df = table.search(query_vector).limit(1).to_df()
print(df)
vector item price score
0 [5.9, 26.5] bar 20.0 14257.05957
If you have a simple filter, it's faster to provide a where clause to LanceDB's search query.
If you have more complex criteria, you can always apply the filter to the resulting pandas DataFrame from the search query.
# Apply the filter via LanceDB
results = table.search([100, 100]).where("price < 15").to_df()
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# Apply the filter via Pandas
df = results = table.search([100, 100]).to_df()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
DuckDB
LanceDB works with DuckDB via PyArrow integration.
Let us start with installing duckdb and lancedb.
pip install duckdb lancedb
We will re-use the dataset created previously
import lancedb
db = lancedb.connect("data/sample-lancedb")
table = db.open_table("pd_table")
arrow_table = table.to_arrow()
DuckDB can directly query the arrow_table:
import duckdb
duckdb.query("SELECT * FROM arrow_table")
┌─────────────┬─────────┬────────┐
│ vector │ item │ price │
│ float[] │ varchar │ double │
├─────────────┼─────────┼────────┤
│ [3.1, 4.1] │ foo │ 10.0 │
│ [5.9, 26.5] │ bar │ 20.0 │
└─────────────┴─────────┴────────┘
duckdb.query("SELECT mean(price) FROM arrow_table")
Out[16]:
┌─────────────┐
│ mean(price) │
│ double │
├─────────────┤
│ 15.0 │
└─────────────┘