mirror of https://github.com/lancedb/lancedb.git synced 2026-01-05 11:22:58 +00:00

Files

Tevin Wang b731a6aed9 Add docs code testing & documentation syntax changes (#196 )

- Creates testing files `md_testing.py` and `md_testing.js` for testing
python and nodejs code in markdown files in the documentation
This listens for HTML tags as well: `<!--[language] code code
code...-->` will create a set-up file to create some mock tables or to
fulfill some assumptions in the documentation.
- Creates a github action workflow that triggers every push/pr to
`docs/**`
- Modifies documentation so tests run (mostly indentation, some small
syntax errors and some missing imports)

A list of excluded files that we need to take a closer look at later on:
```javascript
const excludedFiles = [
  "../src/fts.md",
  "../src/embedding.md",
  "../src/examples/serverless_lancedb_with_s3_and_lambda.md",
  "../src/examples/serverless_qa_bot_with_modal_and_langchain.md",
  "../src/examples/youtube_transcript_bot_with_nodejs.md",
];
```
Many of them can't be done because we need the OpenAI API key :(.
`fts.md` has some issues with the library, I believe this is still
experimental?

Closes #170

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

2023-06-28 11:07:26 -07:00

2.8 KiB

Raw Blame History

Integrations

Built on top of Apache Arrow, LanceDB is easy to integrate with the Python ecosystem, including Pandas, PyArrow and DuckDB.

Pandas and PyArrow

First, we need to connect to a LanceDB database.


import lancedb

db = lancedb.connect("data/sample-lancedb")

And write a Pandas DataFrame to LanceDB directly.

import pandas as pd

data = pd.DataFrame({
    "vector": [[3.1, 4.1], [5.9, 26.5]],
    "item": ["foo", "bar"],
    "price": [10.0, 20.0]
})
table = db.create_table("pd_table", data=data)

You will find detailed instructions of creating dataset and index in Basic Operations and Indexing sections.

We can now perform similarity searches via LanceDB.

# Open the table previously created.
table = db.open_table("pd_table")

query_vector = [100, 100]
# Pandas DataFrame
df = table.search(query_vector).limit(1).to_df()
print(df)

    vector     item  price        score
0  [5.9, 26.5]  bar   20.0  14257.05957

If you have a simple filter, it's faster to provide a where clause to LanceDB's search query. If you have more complex criteria, you can always apply the filter to the resulting pandas DataFrame from the search query.


# Apply the filter via LanceDB
results = table.search([100, 100]).where("price < 15").to_df()
assert len(results) == 1
assert results["item"].iloc[0] == "foo"

# Apply the filter via Pandas
df = results = table.search([100, 100]).to_df()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"

DuckDB

LanceDB works with DuckDB via PyArrow integration.

Let us start with installing duckdb and lancedb.

pip install duckdb lancedb

We will re-use the dataset created previously

import lancedb

db = lancedb.connect("data/sample-lancedb")
table = db.open_table("pd_table")
arrow_table = table.to_arrow()

DuckDB can directly query the arrow_table:

import duckdb 

duckdb.query("SELECT * FROM arrow_table")

┌─────────────┬─────────┬────────┐
│   vector    │  item   │ price  │
│   float[]   │ varchar │ double │
├─────────────┼─────────┼────────┤
│ [3.1, 4.1]  │ foo     │   10.0 │
│ [5.9, 26.5] │ bar     │   20.0 │
└─────────────┴─────────┴────────┘

duckdb.query("SELECT mean(price) FROM arrow_table")

Out[16]:
┌─────────────┐
│ mean(price) │
│   double    │
├─────────────┤
│        15.0 │
└─────────────┘

2.8 KiB Raw Blame History

Integrations

Pandas and PyArrow

DuckDB

2.8 KiB

Raw Blame History