docs: add async examples to doc (#1941)

- added sync and async tabs for python examples
- moved python code to tests/docs

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
This commit is contained in:
QianZhu
2025-01-07 15:10:25 -08:00
committed by GitHub
parent 0b45ef93c0
commit 17c9e9afea
21 changed files with 3639 additions and 987 deletions

View File

@@ -146,7 +146,9 @@ nav:
- Building Custom Rerankers: reranking/custom_reranker.md - Building Custom Rerankers: reranking/custom_reranker.md
- Example: notebooks/lancedb_reranking.ipynb - Example: notebooks/lancedb_reranking.ipynb
- Filtering: sql.md - Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb - Versioning & Reproducibility:
- sync API: notebooks/reproducibility.ipynb
- async API: notebooks/reproducibility_async.ipynb
- Configuring Storage: guides/storage.md - Configuring Storage: guides/storage.md
- Migration Guide: migration.md - Migration Guide: migration.md
- Tuning retrieval performance: - Tuning retrieval performance:
@@ -278,7 +280,9 @@ nav:
- Building Custom Rerankers: reranking/custom_reranker.md - Building Custom Rerankers: reranking/custom_reranker.md
- Example: notebooks/lancedb_reranking.ipynb - Example: notebooks/lancedb_reranking.ipynb
- Filtering: sql.md - Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb - Versioning & Reproducibility:
- sync API: notebooks/reproducibility.ipynb
- async API: notebooks/reproducibility_async.ipynb
- Configuring Storage: guides/storage.md - Configuring Storage: guides/storage.md
- Migration Guide: migration.md - Migration Guide: migration.md
- Tuning retrieval performance: - Tuning retrieval performance:

View File

@@ -18,25 +18,24 @@ See the [indexing](concepts/index_ivfpq.md) concepts guide for more information
Lance supports `IVF_PQ` index type by default. Lance supports `IVF_PQ` index type by default.
=== "Python" === "Python"
=== "Sync API"
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method. Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
import numpy as np --8<-- "python/python/tests/docs/test_guide_index.py:import-numpy"
uri = "data/sample-lancedb" --8<-- "python/python/tests/docs/test_guide_index.py:create_ann_index"
db = lancedb.connect(uri) ```
=== "Async API"
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
# Create 10,000 sample vectors ```python
data = [{"vector": row, "item": f"item {i}"} --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
for i, row in enumerate(np.random.random((10_000, 1536)).astype('float32'))] --8<-- "python/python/tests/docs/test_guide_index.py:import-numpy"
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb-ivfpq"
# Add the vectors to a table --8<-- "python/python/tests/docs/test_guide_index.py:create_ann_index_async"
tbl = db.create_table("my_vectors", data=data) ```
# Create and train the index - you need to have enough data in the table for an effective training step
tbl.create_index(num_partitions=256, num_sub_vectors=96)
```
=== "TypeScript" === "TypeScript"
@@ -127,6 +126,8 @@ You can specify the GPU device to train IVF partitions via
accelerator="mps" accelerator="mps"
) )
``` ```
!!! note
GPU based indexing is not yet supported with our asynchronous client.
Troubleshooting: Troubleshooting:
@@ -152,14 +153,16 @@ There are a couple of parameters that can be used to fine-tune the search:
=== "Python" === "Python"
=== "Sync API"
```python ```python
tbl.search(np.random.random((1536))) \ --8<-- "python/python/tests/docs/test_guide_index.py:vector_search"
.limit(2) \ ```
.nprobes(20) \ === "Async API"
.refine_factor(10) \
.to_pandas() ```python
``` --8<-- "python/python/tests/docs/test_guide_index.py:vector_search_async"
```
```text ```text
vector item _distance vector item _distance
@@ -196,10 +199,16 @@ The search will return the data requested in addition to the distance of each it
You can further filter the elements returned by a search using a where clause. You can further filter the elements returned by a search using a where clause.
=== "Python" === "Python"
=== "Sync API"
```python ```python
tbl.search(np.random.random((1536))).where("item != 'item 1141'").to_pandas() --8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_filter"
``` ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_async_with_filter"
```
=== "TypeScript" === "TypeScript"
@@ -221,10 +230,16 @@ You can select the columns returned by the query using a select clause.
=== "Python" === "Python"
```python === "Sync API"
tbl.search(np.random.random((1536))).select(["vector"]).to_pandas()
```
```python
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_select"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_async_with_select"
```
```text ```text
vector _distance vector _distance

View File

@@ -10,28 +10,20 @@ LanceDB provides support for full-text search via Lance, allowing you to incorpo
Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search, the FTS index must be created before you can search via keywords. Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search, the FTS index must be created before you can search via keywords.
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
--8<-- "python/python/tests/docs/test_search.py:basic_fts"
```
=== "Async API"
uri = "data/sample-lancedb" ```python
db = lancedb.connect(uri) --8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
table = db.create_table( --8<-- "python/python/tests/docs/test_search.py:basic_fts_async"
"my_table", ```
data=[
{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
{"vector": [5.9, 26.5], "text": "There are several kittens playing"},
],
)
# passing `use_tantivy=False` to use lance FTS index
# `use_tantivy=True` by default
table.create_fts_index("text", use_tantivy=False)
table.search("puppy").limit(10).select(["text"]).to_list()
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
```
=== "TypeScript" === "TypeScript"
@@ -93,22 +85,32 @@ By default the text is tokenized by splitting on punctuation and whitespaces, an
Stemming is useful for improving search results by reducing words to their root form, e.g. "running" to "run". LanceDB supports stemming for multiple languages, you can specify the tokenizer name to enable stemming by the pattern `tokenizer_name="{language_code}_stem"`, e.g. `en_stem` for English. Stemming is useful for improving search results by reducing words to their root form, e.g. "running" to "run". LanceDB supports stemming for multiple languages, you can specify the tokenizer name to enable stemming by the pattern `tokenizer_name="{language_code}_stem"`, e.g. `en_stem` for English.
For example, to enable stemming for English: For example, to enable stemming for English:
```python === "Sync API"
table.create_fts_index("text", use_tantivy=True, tokenizer_name="en_stem")
``` ```python
--8<-- "python/python/tests/docs/test_search.py:fts_config_stem"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_config_stem_async"
```
the following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported. the following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported.
The tokenizer is customizable, you can specify how the tokenizer splits the text, and how it filters out words, etc. The tokenizer is customizable, you can specify how the tokenizer splits the text, and how it filters out words, etc.
For example, for language with accents, you can specify the tokenizer to use `ascii_folding` to remove accents, e.g. 'é' to 'e': For example, for language with accents, you can specify the tokenizer to use `ascii_folding` to remove accents, e.g. 'é' to 'e':
```python === "Sync API"
table.create_fts_index("text",
use_tantivy=False, ```python
language="French", --8<-- "python/python/tests/docs/test_search.py:fts_config_folding"
stem=True, ```
ascii_folding=True) === "Async API"
```
```python
--8<-- "python/python/tests/docs/test_search.py:fts_config_folding_async"
```
## Filtering ## Filtering
@@ -119,9 +121,16 @@ This can be invoked via the familiar `where` syntax.
With pre-filtering: With pre-filtering:
=== "Python" === "Python"
```python === "Sync API"
table.search("puppy").limit(10).where("meta='foo'", prefilte=True).to_list()
``` ```python
--8<-- "python/python/tests/docs/test_search.py:fts_prefiltering"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_prefiltering_async"
```
=== "TypeScript" === "TypeScript"
@@ -151,9 +160,16 @@ With pre-filtering:
With post-filtering: With post-filtering:
=== "Python" === "Python"
```python === "Sync API"
table.search("puppy").limit(10).where("meta='foo'", prefilte=False).to_list()
``` ```python
--8<-- "python/python/tests/docs/test_search.py:fts_postfiltering"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_postfiltering_async"
```
=== "TypeScript" === "TypeScript"
@@ -191,9 +207,16 @@ or a **terms** search query like `old man sea`. For more details on the terms
query syntax, see Tantivy's [query parser rules](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html). query syntax, see Tantivy's [query parser rules](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html).
To search for a phrase, the index must be created with `with_position=True`: To search for a phrase, the index must be created with `with_position=True`:
```python === "Sync API"
table.create_fts_index("text", use_tantivy=False, with_position=True)
``` ```python
--8<-- "python/python/tests/docs/test_search.py:fts_with_position"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_with_position_async"
```
This will allow you to search for phrases, but it will also significantly increase the index size and indexing time. This will allow you to search for phrases, but it will also significantly increase the index size and indexing time.
@@ -205,10 +228,16 @@ This can make the query more efficient, especially when the table is large and t
=== "Python" === "Python"
```python === "Sync API"
table.add([{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"}])
table.optimize() ```python
``` --8<-- "python/python/tests/docs/test_search.py:fts_incremental_index"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_incremental_index_async"
```
=== "TypeScript" === "TypeScript"

View File

@@ -2,7 +2,7 @@
LanceDB also provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions. LanceDB also provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.
The tantivy-based FTS is only available in Python and does not support building indexes on object storage or incremental indexing. If you need these features, try native FTS [native FTS](fts.md). The tantivy-based FTS is only available in Python synchronous APIs and does not support building indexes on object storage or incremental indexing. If you need these features, try native FTS [native FTS](fts.md).
## Installation ## Installation

View File

@@ -32,19 +32,20 @@ over scalar columns.
### Create a scalar index ### Create a scalar index
=== "Python" === "Python"
```python === "Sync API"
import lancedb
books = [
{"book_id": 1, "publisher": "plenty of books", "tags": ["fantasy", "adventure"]},
{"book_id": 2, "publisher": "book town", "tags": ["non-fiction"]},
{"book_id": 3, "publisher": "oreilly", "tags": ["textbook"]}
]
db = lancedb.connect("./db") ```python
table = db.create_table("books", books) --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
table.create_scalar_index("book_id") # BTree by default --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb-btree-bitmap"
table.create_scalar_index("publisher", index_type="BITMAP") --8<-- "python/python/tests/docs/test_guide_index.py:basic_scalar_index"
``` ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb-btree-bitmap"
--8<-- "python/python/tests/docs/test_guide_index.py:basic_scalar_index_async"
```
=== "Typescript" === "Typescript"
@@ -62,12 +63,18 @@ The following scan will be faster if the column `book_id` has a scalar index:
=== "Python" === "Python"
```python === "Sync API"
import lancedb
table = db.open_table("books") ```python
my_df = table.search().where("book_id = 2").to_pandas() --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
``` --8<-- "python/python/tests/docs/test_guide_index.py:search_with_scalar_index"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_index.py:search_with_scalar_index_async"
```
=== "Typescript" === "Typescript"
@@ -88,22 +95,18 @@ Scalar indices can also speed up scans containing a vector search or full text s
=== "Python" === "Python"
```python === "Sync API"
import lancedb
data = [ ```python
{"book_id": 1, "vector": [1, 2]}, --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
{"book_id": 2, "vector": [3, 4]}, --8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_scalar_index"
{"book_id": 3, "vector": [5, 6]} ```
] === "Async API"
table = db.create_table("book_with_embeddings", data)
( ```python
table.search([1, 2]) --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
.where("book_id != 3", prefilter=True) --8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_scalar_index_async"
.to_pandas() ```
)
```
=== "Typescript" === "Typescript"
@@ -122,10 +125,16 @@ Scalar indices can also speed up scans containing a vector search or full text s
Updating the table data (adding, deleting, or modifying records) requires that you also update the scalar index. This can be done by calling `optimize`, which will trigger an update to the existing scalar index. Updating the table data (adding, deleting, or modifying records) requires that you also update the scalar index. This can be done by calling `optimize`, which will trigger an update to the existing scalar index.
=== "Python" === "Python"
```python === "Sync API"
table.add([{"vector": [7, 8], "book_id": 4}])
table.optimize() ```python
``` --8<-- "python/python/tests/docs/test_guide_index.py:update_scalar_index"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:update_scalar_index_async"
```
=== "TypeScript" === "TypeScript"

View File

@@ -12,26 +12,50 @@ LanceDB OSS supports object stores such as AWS S3 (and compatible stores), Azure
=== "Python" === "Python"
AWS S3: AWS S3:
=== "Sync API"
```python ```python
import lancedb import lancedb
db = lancedb.connect("s3://bucket/path") db = lancedb.connect("s3://bucket/path")
``` ```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("s3://bucket/path")
```
Google Cloud Storage: Google Cloud Storage:
```python === "Sync API"
import lancedb
db = lancedb.connect("gs://bucket/path") ```python
``` import lancedb
db = lancedb.connect("gs://bucket/path")
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("gs://bucket/path")
```
Azure Blob Storage: Azure Blob Storage:
<!-- skip-test --> <!-- skip-test -->
```python === "Sync API"
import lancedb
db = lancedb.connect("az://bucket/path") ```python
``` import lancedb
db = lancedb.connect("az://bucket/path")
```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("az://bucket/path")
```
Note that for Azure, storage credentials must be configured. See [below](#azure-blob-storage) for more details. Note that for Azure, storage credentials must be configured. See [below](#azure-blob-storage) for more details.
@@ -94,13 +118,24 @@ If you only want this to apply to one particular connection, you can pass the `s
=== "Python" === "Python"
```python === "Sync API"
import lancedb
db = await lancedb.connect_async( ```python
"s3://bucket/path", import lancedb
storage_options={"timeout": "60s"} db = lancedb.connect(
) "s3://bucket/path",
``` storage_options={"timeout": "60s"}
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://bucket/path",
storage_options={"timeout": "60s"}
)
```
=== "TypeScript" === "TypeScript"
@@ -128,15 +163,29 @@ Getting even more specific, you can set the `timeout` for only a particular tabl
=== "Python" === "Python"
<!-- skip-test --> <!-- skip-test -->
```python === "Sync API"
import lancedb
db = await lancedb.connect_async("s3://bucket/path") ```python
table = await db.create_table( import lancedb
"table", db = lancedb.connect("s3://bucket/path")
[{"a": 1, "b": 2}], table = db.create_table(
storage_options={"timeout": "60s"} "table",
) [{"a": 1, "b": 2}],
``` storage_options={"timeout": "60s"}
)
```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("s3://bucket/path")
async_table = await async_db.create_table(
"table",
[{"a": 1, "b": 2}],
storage_options={"timeout": "60s"}
)
```
=== "TypeScript" === "TypeScript"
@@ -194,17 +243,32 @@ These can be set as environment variables or passed in the `storage_options` par
=== "Python" === "Python"
```python === "Sync API"
import lancedb
db = await lancedb.connect_async( ```python
"s3://bucket/path", import lancedb
storage_options={ db = lancedb.connect(
"aws_access_key_id": "my-access-key", "s3://bucket/path",
"aws_secret_access_key": "my-secret-key", storage_options={
"aws_session_token": "my-session-token", "aws_access_key_id": "my-access-key",
} "aws_secret_access_key": "my-secret-key",
) "aws_session_token": "my-session-token",
``` }
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://bucket/path",
storage_options={
"aws_access_key_id": "my-access-key",
"aws_secret_access_key": "my-secret-key",
"aws_session_token": "my-session-token",
}
)
```
=== "TypeScript" === "TypeScript"
@@ -348,12 +412,22 @@ name of the table to use.
=== "Python" === "Python"
```python === "Sync API"
import lancedb
db = await lancedb.connect_async( ```python
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table", import lancedb
) db = lancedb.connect(
``` "s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
)
```
=== "JavaScript" === "JavaScript"
@@ -441,16 +515,30 @@ LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you m
=== "Python" === "Python"
```python === "Sync API"
import lancedb
db = await lancedb.connect_async( ```python
"s3://bucket/path", import lancedb
storage_options={ db = lancedb.connect(
"region": "us-east-1", "s3://bucket/path",
"endpoint": "http://minio:9000", storage_options={
} "region": "us-east-1",
) "endpoint": "http://minio:9000",
``` }
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://bucket/path",
storage_options={
"region": "us-east-1",
"endpoint": "http://minio:9000",
}
)
```
=== "TypeScript" === "TypeScript"
@@ -502,16 +590,30 @@ To configure LanceDB to use an S3 Express endpoint, you must set the storage opt
=== "Python" === "Python"
```python === "Sync API"
import lancedb
db = await lancedb.connect_async( ```python
"s3://my-bucket--use1-az4--x-s3/path", import lancedb
storage_options={ db = lancedb.connect(
"region": "us-east-1", "s3://my-bucket--use1-az4--x-s3/path",
"s3_express": "true", storage_options={
} "region": "us-east-1",
) "s3_express": "true",
``` }
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://my-bucket--use1-az4--x-s3/path",
storage_options={
"region": "us-east-1",
"s3_express": "true",
}
)
```
=== "TypeScript" === "TypeScript"
@@ -552,15 +654,29 @@ GCS credentials are configured by setting the `GOOGLE_SERVICE_ACCOUNT` environme
=== "Python" === "Python"
<!-- skip-test --> <!-- skip-test -->
```python === "Sync API"
import lancedb
db = await lancedb.connect_async( ```python
"gs://my-bucket/my-database", import lancedb
storage_options={ db = lancedb.connect(
"service_account": "path/to/service-account.json", "gs://my-bucket/my-database",
} storage_options={
) "service_account": "path/to/service-account.json",
``` }
)
```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"gs://my-bucket/my-database",
storage_options={
"service_account": "path/to/service-account.json",
}
)
```
=== "TypeScript" === "TypeScript"
@@ -612,16 +728,31 @@ Azure Blob Storage credentials can be configured by setting the `AZURE_STORAGE_A
=== "Python" === "Python"
<!-- skip-test --> <!-- skip-test -->
```python === "Sync API"
import lancedb
db = await lancedb.connect_async( ```python
"az://my-container/my-database", import lancedb
storage_options={ db = await lancedb.connect(
account_name: "some-account", "az://my-container/my-database",
account_key: "some-key", storage_options={
} account_name: "some-account",
) account_key: "some-key",
``` }
)
```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"az://my-container/my-database",
storage_options={
account_name: "some-account",
account_key: "some-key",
}
)
```
=== "TypeScript" === "TypeScript"

View File

@@ -12,10 +12,18 @@ Initialize a LanceDB connection and create a table
=== "Python" === "Python"
```python === "Sync API"
import lancedb
db = lancedb.connect("./.lancedb") ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:connect"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:connect_async"
```
LanceDB allows ingesting data from various sources - `dict`, `list[dict]`, `pd.DataFrame`, `pa.Table` or a `Iterator[pa.RecordBatch]`. Let's take a look at some of the these. LanceDB allows ingesting data from various sources - `dict`, `list[dict]`, `pd.DataFrame`, `pa.Table` or a `Iterator[pa.RecordBatch]`. Let's take a look at some of the these.
@@ -47,18 +55,16 @@ Initialize a LanceDB connection and create a table
=== "Python" === "Python"
```python === "Sync API"
import lancedb
db = lancedb.connect("./.lancedb") ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table"
```
=== "Async API"
data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7}, ```python
{"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}] --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async"
```
db.create_table("my_table", data)
db["my_table"].head()
```
!!! info "Note" !!! info "Note"
If the table already exists, LanceDB will raise an error by default. If the table already exists, LanceDB will raise an error by default.
@@ -67,16 +73,30 @@ Initialize a LanceDB connection and create a table
and the table exists, then it simply opens the existing table. The data you and the table exists, then it simply opens the existing table. The data you
passed in will NOT be appended to the table in that case. passed in will NOT be appended to the table in that case.
```python === "Sync API"
db.create_table("name", data, exist_ok=True)
``` ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_exist_ok"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_exist_ok"
```
Sometimes you want to make sure that you start fresh. If you want to Sometimes you want to make sure that you start fresh. If you want to
overwrite the table, you can pass in mode="overwrite" to the createTable function. overwrite the table, you can pass in mode="overwrite" to the createTable function.
```python === "Sync API"
db.create_table("name", data, mode="overwrite")
``` ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_overwrite"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_overwrite"
```
=== "Typescript[^1]" === "Typescript[^1]"
You can create a LanceDB table in JavaScript using an array of records as follows. You can create a LanceDB table in JavaScript using an array of records as follows.
@@ -146,34 +166,37 @@ Initialize a LanceDB connection and create a table
### From a Pandas DataFrame ### From a Pandas DataFrame
```python
import pandas as pd
data = pd.DataFrame({ === "Sync API"
"vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]],
"lat": [45.5, 40.1],
"long": [-122.7, -74.1]
})
db.create_table("my_table", data) ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_pandas"
```
=== "Async API"
db["my_table"].head() ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_pandas"
```
!!! info "Note" !!! info "Note"
Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly. Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.
The **`vector`** column needs to be a [Vector](../python/pydantic.md#vector-field) (defined as [pyarrow.FixedSizeList](https://arrow.apache.org/docs/python/generated/pyarrow.list_.html)) type. The **`vector`** column needs to be a [Vector](../python/pydantic.md#vector-field) (defined as [pyarrow.FixedSizeList](https://arrow.apache.org/docs/python/generated/pyarrow.list_.html)) type.
```python === "Sync API"
custom_schema = pa.schema([
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("lat", pa.float32()),
pa.field("long", pa.float32())
])
table = db.create_table("my_table", data, schema=custom_schema) ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_custom_schema"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_custom_schema"
```
### From a Polars DataFrame ### From a Polars DataFrame
@@ -182,45 +205,38 @@ written in Rust. Just like in Pandas, the Polars integration is enabled by PyArr
under the hood. A deeper integration between LanceDB Tables and Polars DataFrames under the hood. A deeper integration between LanceDB Tables and Polars DataFrames
is on the way. is on the way.
```python === "Sync API"
import polars as pl
data = pl.DataFrame({ ```python
"vector": [[3.1, 4.1], [5.9, 26.5]], --8<-- "python/python/tests/docs/test_guide_tables.py:import-polars"
"item": ["foo", "bar"], --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_polars"
"price": [10.0, 20.0] ```
}) === "Async API"
table = db.create_table("pl_table", data=data)
``` ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-polars"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_polars"
```
### From an Arrow Table ### From an Arrow Table
You can also create LanceDB tables directly from Arrow tables. You can also create LanceDB tables directly from Arrow tables.
LanceDB supports float16 data type! LanceDB supports float16 data type!
=== "Python" === "Python"
=== "Sync API"
```python ```python
import pyarrows as pa --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
import numpy as np --8<-- "python/python/tests/docs/test_guide_tables.py:import-numpy"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_arrow_table"
```
=== "Async API"
dim = 16 ```python
total = 2 --8<-- "python/python/tests/docs/test_guide_tables.py:import-polars"
schema = pa.schema( --8<-- "python/python/tests/docs/test_guide_tables.py:import-numpy"
[ --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_arrow_table"
pa.field("vector", pa.list_(pa.float16(), dim)), ```
pa.field("text", pa.string())
]
)
data = pa.Table.from_arrays(
[
pa.array([np.random.randn(dim).astype(np.float16) for _ in range(total)],
pa.list_(pa.float16(), dim)),
pa.array(["foo", "bar"])
],
["vector", "text"],
)
tbl = db.create_table("f16_tbl", data, schema=schema)
```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -250,25 +266,22 @@ can be configured with the vector dimensions. It is also important to note that
LanceDB only understands subclasses of `lancedb.pydantic.LanceModel` LanceDB only understands subclasses of `lancedb.pydantic.LanceModel`
(which itself derives from `pydantic.BaseModel`). (which itself derives from `pydantic.BaseModel`).
```python === "Sync API"
from lancedb.pydantic import Vector, LanceModel
class Content(LanceModel): ```python
movie_id: int --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
vector: Vector(128) --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
genres: str --8<-- "python/python/tests/docs/test_guide_tables.py:class-Content"
title: str --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_pydantic"
imdb_id: int ```
=== "Async API"
@property ```python
def imdb_url(self) -> str: --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
return f"https://www.imdb.com/title/tt{self.imdb_id}" --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:class-Content"
import pyarrow as pa --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_pydantic"
db = lancedb.connect("~/.lancedb") ```
table_name = "movielens_small"
table = db.create_table(table_name, schema=Content)
```
#### Nested schemas #### Nested schemas
@@ -277,22 +290,24 @@ For example, you may want to store the document string
and the document source name as a nested Document object: and the document source name as a nested Document object:
```python ```python
class Document(BaseModel): --8<-- "python/python/tests/docs/test_guide_tables.py:import-pydantic-basemodel"
content: str --8<-- "python/python/tests/docs/test_guide_tables.py:class-Document"
source: str
``` ```
This can be used as the type of a LanceDB table column: This can be used as the type of a LanceDB table column:
```python === "Sync API"
class NestedSchema(LanceModel):
id: str
vector: Vector(1536)
document: Document
tbl = db.create_table("nested_table", schema=NestedSchema, mode="overwrite") ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:class-NestedSchema"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_nested_schema"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:class-NestedSchema"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_nested_schema"
```
This creates a struct column called "document" that has two subfields This creates a struct column called "document" that has two subfields
called "content" and "source": called "content" and "source":
@@ -356,29 +371,20 @@ LanceDB additionally supports PyArrow's `RecordBatch` Iterators or other generat
Here's an example using using `RecordBatch` iterator for creating tables. Here's an example using using `RecordBatch` iterator for creating tables.
```python === "Sync API"
import pyarrow as pa
def make_batches(): ```python
for i in range(5): --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
yield pa.RecordBatch.from_arrays( --8<-- "python/python/tests/docs/test_guide_tables.py:make_batches"
[ --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_batch"
pa.array([[3.1, 4.1, 5.1, 6.1], [5.9, 26.5, 4.7, 32.8]], ```
pa.list_(pa.float32(), 4)), === "Async API"
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"],
)
schema = pa.schema([ ```python
pa.field("vector", pa.list_(pa.float32(), 4)), --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
pa.field("item", pa.utf8()), --8<-- "python/python/tests/docs/test_guide_tables.py:make_batches"
pa.field("price", pa.float32()), --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_batch"
]) ```
db.create_table("batched_tale", make_batches(), schema=schema)
```
You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example. You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example.
@@ -387,15 +393,29 @@ You can also use iterators of other types like Pandas DataFrame or Pylists direc
=== "Python" === "Python"
If you forget the name of your table, you can always get a listing of all table names. If you forget the name of your table, you can always get a listing of all table names.
```python === "Sync API"
print(db.table_names())
``` ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:list_tables"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:list_tables_async"
```
Then, you can open any existing tables. Then, you can open any existing tables.
```python === "Sync API"
tbl = db.open_table("my_table")
``` ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:open_table"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:open_table_async"
```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -418,35 +438,41 @@ You can create an empty table for scenarios where you want to add data to the ta
An empty table can be initialized via a PyArrow schema. An empty table can be initialized via a PyArrow schema.
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
import pyarrow as pa --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table"
```
=== "Async API"
schema = pa.schema( ```python
[ --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
pa.field("vector", pa.list_(pa.float32(), 2)), --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
pa.field("item", pa.string()), --8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table_async"
pa.field("price", pa.float32()), ```
])
tbl = db.create_table("empty_table_add", schema=schema)
```
Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not
directly import `pydantic` but instead use `lancedb.pydantic` which is a subclass of `pydantic.BaseModel` directly import `pydantic` but instead use `lancedb.pydantic` which is a subclass of `pydantic.BaseModel`
that has been extended to support LanceDB specific types like `Vector`. that has been extended to support LanceDB specific types like `Vector`.
```python === "Sync API"
import lancedb
from lancedb.pydantic import LanceModel, vector
class Item(LanceModel): ```python
vector: Vector(2) --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
item: str --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
price: float --8<-- "python/python/tests/docs/test_guide_tables.py:class-Item"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table_pydantic"
```
=== "Async API"
tbl = db.create_table("empty_table_add", schema=Item.to_arrow_schema()) ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_guide_tables.py:class-Item"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table_async_pydantic"
```
Once the empty table has been created, you can add data to it via the various methods listed in the [Adding to a table](#adding-to-a-table) section. Once the empty table has been created, you can add data to it via the various methods listed in the [Adding to a table](#adding-to-a-table) section.
@@ -473,86 +499,96 @@ After a table has been created, you can always add more data to it using the `ad
### Add a Pandas DataFrame ### Add a Pandas DataFrame
```python === "Sync API"
df = pd.DataFrame({
"vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0] ```python
}) --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_pandas"
tbl.add(df) ```
``` === "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_pandas"
```
### Add a Polars DataFrame ### Add a Polars DataFrame
```python === "Sync API"
df = pl.DataFrame({
"vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0] ```python
}) --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_polars"
tbl.add(df) ```
``` === "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_polars"
```
### Add an Iterator ### Add an Iterator
You can also add a large dataset batch in one go using Iterator of any supported data types. You can also add a large dataset batch in one go using Iterator of any supported data types.
```python === "Sync API"
def make_batches():
for i in range(5): ```python
yield [ --8<-- "python/python/tests/docs/test_guide_tables.py:make_batches_for_add"
{"vector": [3.1, 4.1], "item": "peach", "price": 6.0}, --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_batch"
{"vector": [5.9, 26.5], "item": "pear", "price": 5.0} ```
] === "Async API"
tbl.add(make_batches())
``` ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:make_batches_for_add"
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_batch"
```
### Add a PyArrow table ### Add a PyArrow table
If you have data coming in as a PyArrow table, you can add it directly to the LanceDB table. If you have data coming in as a PyArrow table, you can add it directly to the LanceDB table.
```python === "Sync API"
pa_table = pa.Table.from_arrays(
[
pa.array([[9.1, 6.7], [9.9, 31.2]],
pa.list_(pa.float32(), 2)),
pa.array(["mango", "orange"]),
pa.array([7.0, 4.0]),
],
["vector", "item", "price"],
)
tbl.add(pa_table) ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_pyarrow"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_pyarrow"
```
### Add a Pydantic Model ### Add a Pydantic Model
Assuming that a table has been created with the correct schema as shown [above](#creating-empty-table), you can add data items that are valid Pydantic models to the table. Assuming that a table has been created with the correct schema as shown [above](#creating-empty-table), you can add data items that are valid Pydantic models to the table.
```python === "Sync API"
pydantic_model_items = [
Item(vector=[8.1, 4.7], item="pineapple", price=10.0),
Item(vector=[6.9, 9.3], item="avocado", price=9.0)
]
tbl.add(pydantic_model_items) ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_pydantic"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_pydantic"
```
??? "Ingesting Pydantic models with LanceDB embedding API" ??? "Ingesting Pydantic models with LanceDB embedding API"
When using LanceDB's embedding API, you can add Pydantic models directly to the table. LanceDB will automatically convert the `vector` field to a vector before adding it to the table. You need to specify the default value of `vector` field as None to allow LanceDB to automatically vectorize the data. When using LanceDB's embedding API, you can add Pydantic models directly to the table. LanceDB will automatically convert the `vector` field to a vector before adding it to the table. You need to specify the default value of `vector` field as None to allow LanceDB to automatically vectorize the data.
```python === "Sync API"
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("~/tmp") ```python
embed_fcn = get_registry().get("huggingface").create(name="BAAI/bge-small-en-v1.5") --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-embeddings"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_with_embedding"
```
=== "Async API"
class Schema(LanceModel): ```python
text: str = embed_fcn.SourceField() --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
vector: Vector(embed_fcn.ndims()) = embed_fcn.VectorField(default=None) --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-embeddings"
tbl = db.create_table("my_table", schema=Schema, mode="overwrite") --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_with_embedding"
models = [Schema(text="hello"), Schema(text="world")] ```
tbl.add(models)
```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -571,44 +607,41 @@ Use the `delete()` method on tables to delete rows from a table. To choose which
=== "Python" === "Python"
```python === "Sync API"
tbl.delete('item = "fizz"')
``` ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:delete_row"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:delete_row_async"
```
### Deleting row with specific column value ### Deleting row with specific column value
```python === "Sync API"
import lancedb
data = [{"x": 1, "vector": [1, 2]}, ```python
{"x": 2, "vector": [3, 4]}, --8<-- "python/python/tests/docs/test_guide_tables.py:delete_specific_row"
{"x": 3, "vector": [5, 6]}] ```
db = lancedb.connect("./.lancedb") === "Async API"
table = db.create_table("my_table", data)
table.to_pandas()
# x vector
# 0 1 [1.0, 2.0]
# 1 2 [3.0, 4.0]
# 2 3 [5.0, 6.0]
table.delete("x = 2") ```python
table.to_pandas() --8<-- "python/python/tests/docs/test_guide_tables.py:delete_specific_row_async"
# x vector ```
# 0 1 [1.0, 2.0]
# 1 3 [5.0, 6.0]
```
### Delete from a list of values ### Delete from a list of values
=== "Sync API"
```python ```python
to_remove = [1, 5] --8<-- "python/python/tests/docs/test_guide_tables.py:delete_list_values"
to_remove = ", ".join(str(v) for v in to_remove) ```
=== "Async API"
table.delete(f"x IN ({to_remove})") ```python
table.to_pandas() --8<-- "python/python/tests/docs/test_guide_tables.py:delete_list_values_async"
# x vector ```
# 0 3 [5.0, 6.0]
```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -659,27 +692,20 @@ This can be used to update zero to all rows depending on how many rows match the
=== "Python" === "Python"
API Reference: [lancedb.table.Table.update][] API Reference: [lancedb.table.Table.update][]
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
import pandas as pd --8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
--8<-- "python/python/tests/docs/test_guide_tables.py:update_table"
```
=== "Async API"
# Create a lancedb connection ```python
db = lancedb.connect("./.lancedb") --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
# Create a table from a pandas DataFrame --8<-- "python/python/tests/docs/test_guide_tables.py:update_table_async"
data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]}) ```
table = db.create_table("my_table", data)
# Update the table where x = 2
table.update(where="x = 2", values={"vector": [10, 10]})
# Get the updated table as a pandas DataFrame
df = table.to_pandas()
# Print the DataFrame
print(df)
```
Output Output
```shell ```shell
@@ -734,13 +760,16 @@ This can be used to update zero to all rows depending on how many rows match the
The `values` parameter is used to provide the new values for the columns as literal values. You can also use the `values_sql` / `valuesSql` parameter to provide SQL expressions for the new values. For example, you can use `values_sql="x + 1"` to increment the value of the `x` column by 1. The `values` parameter is used to provide the new values for the columns as literal values. You can also use the `values_sql` / `valuesSql` parameter to provide SQL expressions for the new values. For example, you can use `values_sql="x + 1"` to increment the value of the `x` column by 1.
=== "Python" === "Python"
=== "Sync API"
```python ```python
# Update the table where x = 2 --8<-- "python/python/tests/docs/test_guide_tables.py:update_table_sql"
table.update(valuesSql={"x": "x + 1"}) ```
=== "Async API"
print(table.to_pandas()) ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:update_table_sql_async"
```
Output Output
```shell ```shell
@@ -771,11 +800,16 @@ This can be used to update zero to all rows depending on how many rows match the
Use the `drop_table()` method on the database to remove a table. Use the `drop_table()` method on the database to remove a table.
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:drop_table" --8<-- "python/python/tests/docs/test_basic.py:drop_table"
--8<-- "python/python/tests/docs/test_basic.py:drop_table_async" ```
``` === "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:drop_table_async"
```
This permanently removes the table and is not recoverable, unlike deleting rows. This permanently removes the table and is not recoverable, unlike deleting rows.
By default, if the table does not exist an exception is raised. To suppress this, By default, if the table does not exist an exception is raised. To suppress this,
@@ -809,9 +843,16 @@ data type for it.
=== "Python" === "Python"
```python === "Sync API"
--8<-- "python/python/tests/docs/test_basic.py:add_columns"
``` ```python
--8<-- "python/python/tests/docs/test_basic.py:add_columns"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:add_columns_async"
```
**API Reference:** [lancedb.table.Table.add_columns][] **API Reference:** [lancedb.table.Table.add_columns][]
=== "Typescript" === "Typescript"
@@ -848,10 +889,18 @@ rewriting the column, which can be a heavy operation.
=== "Python" === "Python"
```python === "Sync API"
import pyarrow as pa
--8<-- "python/python/tests/docs/test_basic.py:alter_columns" ```python
``` --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_basic.py:alter_columns"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_basic.py:alter_columns_async"
```
**API Reference:** [lancedb.table.Table.alter_columns][] **API Reference:** [lancedb.table.Table.alter_columns][]
=== "Typescript" === "Typescript"
@@ -872,9 +921,16 @@ will remove the column from the schema.
=== "Python" === "Python"
```python === "Sync API"
--8<-- "python/python/tests/docs/test_basic.py:drop_columns"
``` ```python
--8<-- "python/python/tests/docs/test_basic.py:drop_columns"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:drop_columns_async"
```
**API Reference:** [lancedb.table.Table.drop_columns][] **API Reference:** [lancedb.table.Table.drop_columns][]
=== "Typescript" === "Typescript"
@@ -925,31 +981,46 @@ There are three possible settings for `read_consistency_interval`:
To set strong consistency, use `timedelta(0)`: To set strong consistency, use `timedelta(0)`:
```python === "Sync API"
from datetime import timedelta
db = lancedb.connect("./.lancedb",. read_consistency_interval=timedelta(0)) ```python
table = db.open_table("my_table") --8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
``` --8<-- "python/python/tests/docs/test_guide_tables.py:table_strong_consistency"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
--8<-- "python/python/tests/docs/test_guide_tables.py:table_async_strong_consistency"
```
For eventual consistency, use a custom `timedelta`: For eventual consistency, use a custom `timedelta`:
```python === "Sync API"
from datetime import timedelta
db = lancedb.connect("./.lancedb", read_consistency_interval=timedelta(seconds=5)) ```python
table = db.open_table("my_table") --8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
``` --8<-- "python/python/tests/docs/test_guide_tables.py:table_eventual_consistency"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
--8<-- "python/python/tests/docs/test_guide_tables.py:table_async_eventual_consistency"
```
By default, a `Table` will never check for updates from other writers. To manually check for updates you can use `checkout_latest`: By default, a `Table` will never check for updates from other writers. To manually check for updates you can use `checkout_latest`:
```python === "Sync API"
db = lancedb.connect("./.lancedb")
table = db.open_table("my_table")
# (Other writes happen to my_table from another process) ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:table_checkout_latest"
```
=== "Async API"
# Check for updates ```python
table.checkout_latest() --8<-- "python/python/tests/docs/test_guide_tables.py:table_async_checkout_latest"
``` ```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -957,14 +1028,14 @@ There are three possible settings for `read_consistency_interval`:
```ts ```ts
const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 0 }); const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 0 });
const table = await db.openTable("my_table"); const tbl = await db.openTable("my_table");
``` ```
For eventual consistency, specify the update interval as seconds: For eventual consistency, specify the update interval as seconds:
```ts ```ts
const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 5 }); const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 5 });
const table = await db.openTable("my_table"); const tbl = await db.openTable("my_table");
``` ```
<!-- Node doesn't yet support the version time travel: https://github.com/lancedb/lancedb/issues/1007 <!-- Node doesn't yet support the version time travel: https://github.com/lancedb/lancedb/issues/1007

View File

@@ -5,57 +5,46 @@ LanceDB supports both semantic and keyword-based search (also termed full-text s
## Hybrid search in LanceDB ## Hybrid search in LanceDB
You can perform hybrid search in LanceDB by combining the results of semantic and full-text search via a reranking algorithm of your choice. LanceDB provides multiple rerankers out of the box. However, you can always write a custom reranker if your use case need more sophisticated logic . You can perform hybrid search in LanceDB by combining the results of semantic and full-text search via a reranking algorithm of your choice. LanceDB provides multiple rerankers out of the box. However, you can always write a custom reranker if your use case need more sophisticated logic .
```python === "Sync API"
import os
import lancedb ```python
import openai --8<-- "python/python/tests/docs/test_search.py:import-os"
from lancedb.embeddings import get_registry --8<-- "python/python/tests/docs/test_search.py:import-openai"
from lancedb.pydantic import LanceModel, Vector --8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-embeddings"
--8<-- "python/python/tests/docs/test_search.py:import-pydantic"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
--8<-- "python/python/tests/docs/test_search.py:import-openai-embeddings"
--8<-- "python/python/tests/docs/test_search.py:class-Documents"
--8<-- "python/python/tests/docs/test_search.py:basic_hybrid_search"
```
=== "Async API"
db = lancedb.connect("~/.lancedb") ```python
--8<-- "python/python/tests/docs/test_search.py:import-os"
--8<-- "python/python/tests/docs/test_search.py:import-openai"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-embeddings"
--8<-- "python/python/tests/docs/test_search.py:import-pydantic"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
--8<-- "python/python/tests/docs/test_search.py:import-openai-embeddings"
--8<-- "python/python/tests/docs/test_search.py:class-Documents"
--8<-- "python/python/tests/docs/test_search.py:basic_hybrid_search_async"
```
# Ingest embedding function in LanceDB table
# Configuring the environment variable OPENAI_API_KEY
if "OPENAI_API_KEY" not in os.environ:
# OR set the key here as a variable
openai.api_key = "sk-..."
embeddings = get_registry().get("openai").create()
class Documents(LanceModel):
vector: Vector(embeddings.ndims()) = embeddings.VectorField()
text: str = embeddings.SourceField()
table = db.create_table("documents", schema=Documents)
data = [
{ "text": "rebel spaceships striking from a hidden base"},
{ "text": "have won their first victory against the evil Galactic Empire"},
{ "text": "during the battle rebel spies managed to steal secret plans"},
{ "text": "to the Empire's ultimate weapon the Death Star"}
]
# ingest docs with auto-vectorization
table.add(data)
# Create a fts index before the hybrid search
table.create_fts_index("text")
# hybrid search with default reranker
results = table.search("flower moon", query_type="hybrid").to_pandas()
```
!!! Note !!! Note
You can also pass the vector and text query manually. This is useful if you're not using the embedding API or if you're using a separate embedder service. You can also pass the vector and text query manually. This is useful if you're not using the embedding API or if you're using a separate embedder service.
### Explicitly passing the vector and text query ### Explicitly passing the vector and text query
```python === "Sync API"
vector_query = [0.1, 0.2, 0.3, 0.4, 0.5]
text_query = "flower moon"
results = table.search(query_type="hybrid")
.vector(vector_query)
.text(text_query)
.limit(5)
.to_pandas()
``` ```python
--8<-- "python/python/tests/docs/test_search.py:hybrid_search_pass_vector_text"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:hybrid_search_pass_vector_text_async"
```
By default, LanceDB uses `RRFReranker()`, which uses reciprocal rank fusion score, to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers: By default, LanceDB uses `RRFReranker()`, which uses reciprocal rank fusion score, to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers:

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -8,54 +8,55 @@ and PyArrow. The sequence of steps in a typical workflow is shown below.
First, we need to connect to a LanceDB database. First, we need to connect to a LanceDB database.
```py === "Sync API"
import lancedb ```python
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb"
```
=== "Async API"
db = lancedb.connect("data/sample-lancedb") ```python
``` --8<-- "python/python/tests/docs/test_python.py:import-lancedb"
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb_async"
```
We can load a Pandas `DataFrame` to LanceDB directly. We can load a Pandas `DataFrame` to LanceDB directly.
```py === "Sync API"
import pandas as pd
data = pd.DataFrame({ ```python
"vector": [[3.1, 4.1], [5.9, 26.5]], --8<-- "python/python/tests/docs/test_python.py:import-pandas"
"item": ["foo", "bar"], --8<-- "python/python/tests/docs/test_python.py:create_table_pandas"
"price": [10.0, 20.0] ```
}) === "Async API"
table = db.create_table("pd_table", data=data)
``` ```python
--8<-- "python/python/tests/docs/test_python.py:import-pandas"
--8<-- "python/python/tests/docs/test_python.py:create_table_pandas_async"
```
Similar to the [`pyarrow.write_dataset()`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html) method, LanceDB's Similar to the [`pyarrow.write_dataset()`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html) method, LanceDB's
[`db.create_table()`](python.md/#lancedb.db.DBConnection.create_table) accepts data in a variety of forms. [`db.create_table()`](python.md/#lancedb.db.DBConnection.create_table) accepts data in a variety of forms.
If you have a dataset that is larger than memory, you can create a table with `Iterator[pyarrow.RecordBatch]` to lazily load the data: If you have a dataset that is larger than memory, you can create a table with `Iterator[pyarrow.RecordBatch]` to lazily load the data:
```py === "Sync API"
from typing import Iterable ```python
import pyarrow as pa --8<-- "python/python/tests/docs/test_python.py:import-iterable"
--8<-- "python/python/tests/docs/test_python.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_python.py:make_batches"
--8<-- "python/python/tests/docs/test_python.py:create_table_iterable"
```
=== "Async API"
def make_batches() -> Iterable[pa.RecordBatch]: ```python
for i in range(5): --8<-- "python/python/tests/docs/test_python.py:import-iterable"
yield pa.RecordBatch.from_arrays( --8<-- "python/python/tests/docs/test_python.py:import-pyarrow"
[ --8<-- "python/python/tests/docs/test_python.py:make_batches"
pa.array([[3.1, 4.1], [5.9, 26.5]]), --8<-- "python/python/tests/docs/test_python.py:create_table_iterable_async"
pa.array(["foo", "bar"]), ```
pa.array([10.0, 20.0]),
],
["vector", "item", "price"])
schema=pa.schema([
pa.field("vector", pa.list_(pa.float32())),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
])
table = db.create_table("iterable_table", data=make_batches(), schema=schema)
```
You will find detailed instructions of creating a LanceDB dataset in You will find detailed instructions of creating a LanceDB dataset in
[Getting Started](../basic.md#quick-start) and [API](python.md/#lancedb.db.DBConnection.create_table) [Getting Started](../basic.md#quick-start) and [API](python.md/#lancedb.db.DBConnection.create_table)
@@ -65,15 +66,16 @@ sections.
We can now perform similarity search via the LanceDB Python API. We can now perform similarity search via the LanceDB Python API.
```py === "Sync API"
# Open the table previously created.
table = db.open_table("pd_table")
query_vector = [100, 100] ```python
# Pandas DataFrame --8<-- "python/python/tests/docs/test_python.py:vector_search"
df = table.search(query_vector).limit(1).to_pandas() ```
print(df) === "Async API"
```
```python
--8<-- "python/python/tests/docs/test_python.py:vector_search_async"
```
``` ```
vector item price _distance vector item price _distance
@@ -83,16 +85,13 @@ print(df)
If you have a simple filter, it's faster to provide a `where` clause to LanceDB's `search` method. If you have a simple filter, it's faster to provide a `where` clause to LanceDB's `search` method.
For more complex filters or aggregations, you can always resort to using the underlying `DataFrame` methods after performing a search. For more complex filters or aggregations, you can always resort to using the underlying `DataFrame` methods after performing a search.
```python === "Sync API"
# Apply the filter via LanceDB ```python
results = table.search([100, 100]).where("price < 15").to_pandas() --8<-- "python/python/tests/docs/test_python.py:vector_search_with_filter"
assert len(results) == 1 ```
assert results["item"].iloc[0] == "foo" === "Async API"
# Apply the filter via Pandas ```python
df = results = table.search([100, 100]).to_pandas() --8<-- "python/python/tests/docs/test_python.py:vector_search_with_filter_async"
results = df[df.price < 15] ```
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
```

View File

@@ -2,38 +2,29 @@
LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame. LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame.
## Create & Query LanceDB Table ## Create & Query LanceDB Table
### From Polars DataFrame ### From Polars DataFrame
First, we connect to a LanceDB database. First, we connect to a LanceDB database.
```py
import lancedb
db = lancedb.connect("data/polars-lancedb") ```py
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb"
``` ```
We can load a Polars `DataFrame` to LanceDB directly. We can load a Polars `DataFrame` to LanceDB directly.
```py ```py
import polars as pl --8<-- "python/python/tests/docs/test_python.py:import-polars"
--8<-- "python/python/tests/docs/test_python.py:create_table_polars"
data = pl.DataFrame({
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0]
})
table = db.create_table("pl_table", data=data)
``` ```
We can now perform similarity search via the LanceDB Python API. We can now perform similarity search via the LanceDB Python API.
```py ```py
query = [3.0, 4.0] --8<-- "python/python/tests/docs/test_python.py:vector_search_polars"
result = table.search(query).limit(1).to_polars()
print(result)
print(type(result))
``` ```
In addition to the selected columns, LanceDB also returns a vector In addition to the selected columns, LanceDB also returns a vector
@@ -59,33 +50,16 @@ Note that the type of the result from a table search is a Polars DataFrame.
Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame. Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame.
```py ```py
import polars as pl --8<-- "python/python/tests/docs/test_python.py:import-polars"
from lancedb.pydantic import Vector, LanceModel --8<-- "python/python/tests/docs/test_python.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_python.py:class_Item"
--8<-- "python/python/tests/docs/test_python.py:create_table_pydantic"
class Item(LanceModel):
vector: Vector(2)
item: str
price: float
data = {
"vector": [[3.1, 4.1]],
"item": "foo",
"price": 10.0,
}
table = db.create_table("test_table", schema=Item)
df = pl.DataFrame(data)
# Add Polars DataFrame to table
table.add(df)
``` ```
The table can now be queried as usual. The table can now be queried as usual.
```py ```py
result = table.search([3.0, 4.0]).limit(1).to_polars() --8<-- "python/python/tests/docs/test_python.py:vector_search_polars"
print(result)
print(type(result))
``` ```
``` ```
@@ -108,8 +82,7 @@ As you iterate on your application, you'll likely need to work with the whole ta
LanceDB tables can also be converted directly into a polars LazyFrame for further processing. LanceDB tables can also be converted directly into a polars LazyFrame for further processing.
```python ```python
ldf = table.to_polars() --8<-- "python/python/tests/docs/test_python.py:dump_table_lazyform"
print(type(ldf))
``` ```
Unlike the search result from a query, we can see that the type of the result is a LazyFrame. Unlike the search result from a query, we can see that the type of the result is a LazyFrame.
@@ -121,7 +94,7 @@ Unlike the search result from a query, we can see that the type of the result is
We can now work with the LazyFrame as we would in Polars, and collect the first result. We can now work with the LazyFrame as we would in Polars, and collect the first result.
```python ```python
print(ldf.first().collect()) --8<-- "python/python/tests/docs/test_python.py:print_table_lazyform"
``` ```
``` ```

View File

@@ -36,14 +36,14 @@ tbl = db.create_table("test", data)
reranker = CohereReranker(api_key="your_api_key") reranker = CohereReranker(api_key="your_api_key")
# Run vector search with a reranker # Run vector search with a reranker
result = tbl.query("hello").rerank(reranker).to_list() result = tbl.search("hello").rerank(reranker).to_list()
# Run FTS search with a reranker # Run FTS search with a reranker
result = tbl.query("hello", query_type="fts").rerank(reranker).to_list() result = tbl.search("hello", query_type="fts").rerank(reranker).to_list()
# Run hybrid search with a reranker # Run hybrid search with a reranker
tbl.create_fts_index("text") tbl.create_fts_index("text")
result = tbl.query("hello", query_type="hybrid").rerank(reranker).to_list() result = tbl.search("hello", query_type="hybrid").rerank(reranker).to_list()
``` ```
### Multi-vector reranking ### Multi-vector reranking

View File

@@ -44,18 +44,16 @@ db.create_table("my_vectors", data=data)
=== "Python" === "Python"
```python === "Sync API"
import lancedb
import numpy as np
db = lancedb.connect("data/sample-lancedb") ```python
--8<-- "python/python/tests/docs/test_search.py:exhaustive_search"
```
=== "Async API"
tbl = db.open_table("my_vectors") ```python
--8<-- "python/python/tests/docs/test_search.py:exhaustive_search_async"
df = tbl.search(np.random.random((1536))) \ ```
.limit(10) \
.to_list()
```
=== "TypeScript" === "TypeScript"
@@ -81,12 +79,16 @@ By default, `l2` will be used as metric type. You can specify the metric type as
=== "Python" === "Python"
```python === "Sync API"
df = tbl.search(np.random.random((1536))) \
.metric("cosine") \ ```python
.limit(10) \ --8<-- "python/python/tests/docs/test_search.py:exhaustive_search_cosine"
.to_list() ```
``` === "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:exhaustive_search_async_cosine"
```
=== "TypeScript" === "TypeScript"
@@ -142,40 +144,28 @@ LanceDB returns vector search results via different formats commonly used in pyt
Let's create a LanceDB table with a nested schema: Let's create a LanceDB table with a nested schema:
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_search.py:import-datetime"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_search.py:import-numpy"
--8<-- "python/python/tests/docs/test_search.py:import-pydantic-base-model"
--8<-- "python/python/tests/docs/test_search.py:class-definition"
--8<-- "python/python/tests/docs/test_search.py:create_table_with_nested_schema"
```
=== "Async API"
from datetime import datetime ```python
import lancedb --8<-- "python/python/tests/docs/test_search.py:import-datetime"
from lancedb.pydantic import LanceModel, Vector --8<-- "python/python/tests/docs/test_search.py:import-lancedb"
import numpy as np --8<-- "python/python/tests/docs/test_search.py:import-lancedb-pydantic"
from pydantic import BaseModel --8<-- "python/python/tests/docs/test_search.py:import-numpy"
uri = "data/sample-lancedb-nested" --8<-- "python/python/tests/docs/test_search.py:import-pydantic-base-model"
--8<-- "python/python/tests/docs/test_search.py:class-definition"
class Metadata(BaseModel): --8<-- "python/python/tests/docs/test_search.py:create_table_async_with_nested_schema"
source: str ```
timestamp: datetime
class Document(BaseModel):
content: str
meta: Metadata
class LanceSchema(LanceModel):
id: str
vector: Vector(1536)
payload: Document
# Let's add 100 sample rows to our dataset
data = [LanceSchema(
id=f"id{i}",
vector=np.random.randn(1536),
payload=Document(
content=f"document{i}", meta=Metadata(source=f"source{i % 10}", timestamp=datetime.now())
),
) for i in range(100)]
tbl = db.create_table("documents", data=data)
```
### As a PyArrow table ### As a PyArrow table
@@ -184,17 +174,31 @@ Let's create a LanceDB table with a nested schema:
the addition of an `_distance` column for vector search or a `score` the addition of an `_distance` column for vector search or a `score`
column for full text search. column for full text search.
```python === "Sync API"
tbl.search(np.random.randn(1536)).to_arrow()
``` ```python
--8<-- "python/python/tests/docs/test_search.py:search_result_as_pyarrow"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:search_result_async_as_pyarrow"
```
### As a Pandas DataFrame ### As a Pandas DataFrame
You can also get the results as a pandas dataframe. You can also get the results as a pandas dataframe.
```python === "Sync API"
tbl.search(np.random.randn(1536)).to_pandas()
``` ```python
--8<-- "python/python/tests/docs/test_search.py:search_result_as_pandas"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:search_result_async_as_pandas"
```
While other formats like Arrow/Pydantic/Python dicts have a natural While other formats like Arrow/Pydantic/Python dicts have a natural
way to handle nested schemas, pandas can only store nested data as a way to handle nested schemas, pandas can only store nested data as a
@@ -202,33 +206,50 @@ Let's create a LanceDB table with a nested schema:
So for convenience, you can also tell LanceDB to flatten a nested schema So for convenience, you can also tell LanceDB to flatten a nested schema
when creating the pandas dataframe. when creating the pandas dataframe.
```python === "Sync API"
tbl.search(np.random.randn(1536)).to_pandas(flatten=True)
``` ```python
--8<-- "python/python/tests/docs/test_search.py:search_result_as_pandas_flatten_true"
```
If your table has a deeply nested struct, you can control how many levels If your table has a deeply nested struct, you can control how many levels
of nesting to flatten by passing in a positive integer. of nesting to flatten by passing in a positive integer.
```python === "Sync API"
tbl.search(np.random.randn(1536)).to_pandas(flatten=1)
``` ```python
--8<-- "python/python/tests/docs/test_search.py:search_result_as_pandas_flatten_1"
```
!!! note
`flatten` is not yet supported with our asynchronous client.
### As a list of Python dicts ### As a list of Python dicts
You can of course return results as a list of python dicts. You can of course return results as a list of python dicts.
```python === "Sync API"
tbl.search(np.random.randn(1536)).to_list()
``` ```python
--8<-- "python/python/tests/docs/test_search.py:search_result_as_list"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:search_result_async_as_list"
```
### As a list of Pydantic models ### As a list of Pydantic models
We can add data using Pydantic models, and we can certainly We can add data using Pydantic models, and we can certainly
retrieve results as Pydantic models retrieve results as Pydantic models
```python === "Sync API"
tbl.search(np.random.randn(1536)).to_pydantic(LanceSchema)
``` ```python
--8<-- "python/python/tests/docs/test_search.py:search_result_as_pydantic"
```
!!! note
`to_pydantic()` is not yet supported with our asynchronous client.
Note that in this case the extra `_distance` field is discarded since Note that in this case the extra `_distance` field is discarded since
it's not part of the LanceSchema. it's not part of the LanceSchema.

View File

@@ -15,13 +15,18 @@ Similarly, a highly selective post-filter can lead to false positives. Increasin
```python ```python
import lancedb import lancedb
import numpy as np import numpy as np
uri = "data/sample-lancedb" uri = "data/sample-lancedb"
db = lancedb.connect(uri)
data = [{"vector": row, "item": f"item {i}", "id": i} data = [{"vector": row, "item": f"item {i}", "id": i}
for i, row in enumerate(np.random.random((10_000, 2)).astype('int'))] for i, row in enumerate(np.random.random((10_000, 2)).astype('int'))]
# Synchronous client
db = lancedb.connect(uri)
tbl = db.create_table("my_vectors", data=data) tbl = db.create_table("my_vectors", data=data)
# Asynchronous client
async_db = await lancedb.connect_async(uri)
async_tbl = await async_db.create_table("my_vectors_async", data=data)
``` ```
--> -->
<!-- Setup Code <!-- Setup Code
@@ -39,13 +44,11 @@ const tbl = await db.createTable('myVectors', data)
=== "Python" === "Python"
```py ```python
result = ( # Synchronous client
tbl.search([0.5, 0.2]) result = tbl.search([0.5, 0.2]).where("id = 10", prefilter=True).limit(1).to_arrow()
.where("id = 10", prefilter=True) # Asynchronous client
.limit(1) result = await async_tbl.query().where("id = 10").nearest_to([0.5, 0.2]).limit(1).to_arrow()
.to_arrow()
)
``` ```
=== "TypeScript" === "TypeScript"
@@ -88,9 +91,17 @@ For example, the following filter string is acceptable:
=== "Python" === "Python"
```python ```python
tbl.search([100, 102]) \ # Synchronous client
.where("(item IN ('item 0', 'item 2')) AND (id > 10)") \ tbl.search([100, 102]).where(
.to_arrow() "(item IN ('item 0', 'item 2')) AND (id > 10)"
).to_arrow()
# Asynchronous client
await (
async_tbl.query()
.where("(item IN ('item 0', 'item 2')) AND (id > 10)")
.nearest_to([100, 102])
.to_arrow()
)
``` ```
=== "TypeScript" === "TypeScript"
@@ -168,7 +179,10 @@ You can also filter your data without search:
=== "Python" === "Python"
```python ```python
# Synchronous client
tbl.search().where("id = 10").limit(10).to_arrow() tbl.search().where("id = 10").limit(10).to_arrow()
# Asynchronous client
await async_tbl.query().where("id = 10").limit(10).to_arrow()
``` ```
=== "TypeScript" === "TypeScript"

View File

@@ -12,6 +12,8 @@ excluded_globs = [
"../src/integrations/*.md", "../src/integrations/*.md",
"../src/guides/tables.md", "../src/guides/tables.md",
"../src/python/duckdb.md", "../src/python/duckdb.md",
"../src/python/pandas_and_pyarrow.md",
"../src/python/polars_arrow.md",
"../src/embeddings/*.md", "../src/embeddings/*.md",
"../src/concepts/*.md", "../src/concepts/*.md",
"../src/ann_indexes.md", "../src/ann_indexes.md",
@@ -23,9 +25,10 @@ excluded_globs = [
"../src/embeddings/available_embedding_models/text_embedding_functions/*.md", "../src/embeddings/available_embedding_models/text_embedding_functions/*.md",
"../src/embeddings/available_embedding_models/multimodal_embedding_functions/*.md", "../src/embeddings/available_embedding_models/multimodal_embedding_functions/*.md",
"../src/rag/*.md", "../src/rag/*.md",
"../src/rag/advanced_techniques/*.md" "../src/rag/advanced_techniques/*.md",
"../src/guides/scalar_index.md",
"../src/guides/storage.md",
"../src/search.md"
] ]
python_prefix = "py" python_prefix = "py"

View File

@@ -125,7 +125,7 @@ async def test_quickstart_async():
# --8<-- [start:create_table_async] # --8<-- [start:create_table_async]
# Asynchronous client # Asynchronous client
async_tbl = await async_db.create_table("my_table2", data=data) async_tbl = await async_db.create_table("my_table_async", data=data)
# --8<-- [end:create_table_async] # --8<-- [end:create_table_async]
df = pd.DataFrame( df = pd.DataFrame(
@@ -137,17 +137,17 @@ async def test_quickstart_async():
# --8<-- [start:create_table_async_pandas] # --8<-- [start:create_table_async_pandas]
# Asynchronous client # Asynchronous client
async_tbl = await async_db.create_table("table_from_df2", df) async_tbl = await async_db.create_table("table_from_df_async", df)
# --8<-- [end:create_table_async_pandas] # --8<-- [end:create_table_async_pandas]
schema = pa.schema([pa.field("vector", pa.list_(pa.float32(), list_size=2))]) schema = pa.schema([pa.field("vector", pa.list_(pa.float32(), list_size=2))])
# --8<-- [start:create_empty_table_async] # --8<-- [start:create_empty_table_async]
# Asynchronous client # Asynchronous client
async_tbl = await async_db.create_table("empty_table2", schema=schema) async_tbl = await async_db.create_table("empty_table_async", schema=schema)
# --8<-- [end:create_empty_table_async] # --8<-- [end:create_empty_table_async]
# --8<-- [start:open_table_async] # --8<-- [start:open_table_async]
# Asynchronous client # Asynchronous client
async_tbl = await async_db.open_table("my_table2") async_tbl = await async_db.open_table("my_table_async")
# --8<-- [end:open_table_async] # --8<-- [end:open_table_async]
# --8<-- [start:table_names_async] # --8<-- [start:table_names_async]
# Asynchronous client # Asynchronous client
@@ -161,6 +161,22 @@ async def test_quickstart_async():
data = [{"vector": [x, x], "item": "filler", "price": x * x} for x in range(1000)] data = [{"vector": [x, x], "item": "filler", "price": x * x} for x in range(1000)]
await async_tbl.add(data) await async_tbl.add(data)
# --8<-- [start:vector_search_async] # --8<-- [start:vector_search_async]
# --8<-- [start:add_columns_async]
await async_tbl.add_columns({"double_price": "cast((price * 2) as float)"})
# --8<-- [end:add_columns_async]
# --8<-- [start:alter_columns_async]
await async_tbl.alter_columns(
{
"path": "double_price",
"rename": "dbl_price",
"data_type": pa.float64(),
"nullable": True,
}
)
# --8<-- [end:alter_columns_async]
# --8<-- [start:drop_columns_async]
await async_tbl.drop_columns(["dbl_price"])
# --8<-- [end:drop_columns_async]
# Asynchronous client # Asynchronous client
await async_tbl.vector_search([100, 100]).limit(2).to_pandas() await async_tbl.vector_search([100, 100]).limit(2).to_pandas()
# --8<-- [end:vector_search_async] # --8<-- [end:vector_search_async]
@@ -174,5 +190,5 @@ async def test_quickstart_async():
# --8<-- [end:delete_rows_async] # --8<-- [end:delete_rows_async]
# --8<-- [start:drop_table_async] # --8<-- [start:drop_table_async]
# Asynchronous client # Asynchronous client
await async_db.drop_table("my_table2") await async_db.drop_table("my_table_async")
# --8<-- [end:drop_table_async] # --8<-- [end:drop_table_async]

View File

@@ -0,0 +1,169 @@
# --8<-- [start:import-lancedb]
import lancedb
# --8<-- [end:import-lancedb]
# --8<-- [start:import-lancedb-ivfpq]
from lancedb.index import IvfPq
# --8<-- [end:import-lancedb-ivfpq]
# --8<-- [start:import-lancedb-btree-bitmap]
from lancedb.index import BTree, Bitmap
# --8<-- [end:import-lancedb-btree-bitmap]
# --8<-- [start:import-numpy]
import numpy as np
# --8<-- [end:import-numpy]
import pytest
def test_ann_index():
# --8<-- [start:create_ann_index]
uri = "data/sample-lancedb"
# Create 5,000 sample vectors
data = [
{"vector": row, "item": f"item {i}"}
for i, row in enumerate(np.random.random((5_000, 32)).astype("float32"))
]
db = lancedb.connect(uri)
# Add the vectors to a table
tbl = db.create_table("my_vectors", data=data)
# Create and train the index - you need to have enough data in the table
# for an effective training step
tbl.create_index(num_partitions=2, num_sub_vectors=4)
# --8<-- [end:create_ann_index]
# --8<-- [start:vector_search]
tbl.search(np.random.random((32))).limit(2).nprobes(20).refine_factor(
10
).to_pandas()
# --8<-- [end:vector_search]
# --8<-- [start:vector_search_with_filter]
tbl.search(np.random.random((32))).where("item != 'item 1141'").to_pandas()
# --8<-- [end:vector_search_with_filter]
# --8<-- [start:vector_search_with_select]
tbl.search(np.random.random((32))).select(["vector"]).to_pandas()
# --8<-- [end:vector_search_with_select]
@pytest.mark.asyncio
async def test_ann_index_async():
# --8<-- [start:create_ann_index_async]
uri = "data/sample-lancedb"
# Create 5,000 sample vectors
data = [
{"vector": row, "item": f"item {i}"}
for i, row in enumerate(np.random.random((5_000, 32)).astype("float32"))
]
async_db = await lancedb.connect_async(uri)
# Add the vectors to a table
async_tbl = await async_db.create_table("my_vectors_async", data=data)
# Create and train the index - you need to have enough data in the table
# for an effective training step
await async_tbl.create_index(
"vector", config=IvfPq(num_partitions=2, num_sub_vectors=4)
)
# --8<-- [end:create_ann_index_async]
# --8<-- [start:vector_search_async]
await (
async_tbl.query()
.nearest_to(np.random.random((32)))
.limit(2)
.nprobes(20)
.refine_factor(10)
.to_pandas()
)
# --8<-- [end:vector_search_async]
# --8<-- [start:vector_search_async_with_filter]
await (
async_tbl.query()
.nearest_to(np.random.random((32)))
.where("item != 'item 1141'")
.to_pandas()
)
# --8<-- [end:vector_search_async_with_filter]
# --8<-- [start:vector_search_async_with_select]
await (
async_tbl.query()
.nearest_to(np.random.random((32)))
.select(["vector"])
.to_pandas()
)
# --8<-- [end:vector_search_async_with_select]
def test_scalar_index():
# --8<-- [start:basic_scalar_index]
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
books = [
{
"book_id": 1,
"publisher": "plenty of books",
"tags": ["fantasy", "adventure"],
},
{"book_id": 2, "publisher": "book town", "tags": ["non-fiction"]},
{"book_id": 3, "publisher": "oreilly", "tags": ["textbook"]},
]
table = db.create_table("books", books)
table.create_scalar_index("book_id") # BTree by default
table.create_scalar_index("publisher", index_type="BITMAP")
# --8<-- [end:basic_scalar_index]
# --8<-- [start:search_with_scalar_index]
table = db.open_table("books")
table.search().where("book_id = 2").to_pandas()
# --8<-- [end:search_with_scalar_index]
# --8<-- [start:vector_search_with_scalar_index]
data = [
{"book_id": 1, "vector": [1, 2]},
{"book_id": 2, "vector": [3, 4]},
{"book_id": 3, "vector": [5, 6]},
]
table = db.create_table("book_with_embeddings", data)
(table.search([1, 2]).where("book_id != 3", prefilter=True).to_pandas())
# --8<-- [end:vector_search_with_scalar_index]
# --8<-- [start:update_scalar_index]
table.add([{"vector": [7, 8], "book_id": 4}])
table.optimize()
# --8<-- [end:update_scalar_index]
@pytest.mark.asyncio
async def test_scalar_index_async():
# --8<-- [start:basic_scalar_index_async]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)
books = [
{
"book_id": 1,
"publisher": "plenty of books",
"tags": ["fantasy", "adventure"],
},
{"book_id": 2, "publisher": "book town", "tags": ["non-fiction"]},
{"book_id": 3, "publisher": "oreilly", "tags": ["textbook"]},
]
async_tbl = await async_db.create_table("books_async", books)
await async_tbl.create_index("book_id", config=BTree()) # BTree by default
await async_tbl.create_index("publisher", config=Bitmap())
# --8<-- [end:basic_scalar_index_async]
# --8<-- [start:search_with_scalar_index_async]
async_tbl = await async_db.open_table("books_async")
await async_tbl.query().where("book_id = 2").to_pandas()
# --8<-- [end:search_with_scalar_index_async]
# --8<-- [start:vector_search_with_scalar_index_async]
data = [
{"book_id": 1, "vector": [1, 2]},
{"book_id": 2, "vector": [3, 4]},
{"book_id": 3, "vector": [5, 6]},
]
async_tbl = await async_db.create_table("book_with_embeddings_async", data)
(await async_tbl.query().where("book_id != 3").nearest_to([1, 2]).to_pandas())
# --8<-- [end:vector_search_with_scalar_index_async]
# --8<-- [start:update_scalar_index_async]
await async_tbl.add([{"vector": [7, 8], "book_id": 4}])
await async_tbl.optimize()
# --8<-- [end:update_scalar_index_async]

View File

@@ -0,0 +1,576 @@
# --8<-- [start:import-lancedb]
import lancedb
# --8<-- [end:import-lancedb]
# --8<-- [start:import-pandas]
import pandas as pd
# --8<-- [end:import-pandas]
# --8<-- [start:import-pyarrow]
import pyarrow as pa
# --8<-- [end:import-pyarrow]
# --8<-- [start:import-polars]
import polars as pl
# --8<-- [end:import-polars]
# --8<-- [start:import-numpy]
import numpy as np
# --8<-- [end:import-numpy]
# --8<-- [start:import-lancedb-pydantic]
from lancedb.pydantic import Vector, LanceModel
# --8<-- [end:import-lancedb-pydantic]
# --8<-- [start:import-datetime]
from datetime import timedelta
# --8<-- [end:import-datetime]
# --8<-- [start:import-embeddings]
from lancedb.embeddings import get_registry
# --8<-- [end:import-embeddings]
# --8<-- [start:import-pydantic-basemodel]
from pydantic import BaseModel
# --8<-- [end:import-pydantic-basemodel]
import pytest
# --8<-- [start:class-Content]
class Content(LanceModel):
movie_id: int
vector: Vector(128)
genres: str
title: str
imdb_id: int
@property
def imdb_url(self) -> str:
return f"https://www.imdb.com/title/tt{self.imdb_id}"
# --8<-- [end:class-Content]
# --8<-- [start:class-Document]
class Document(BaseModel):
content: str
source: str
# --8<-- [end:class-Document]
# --8<-- [start:class-NestedSchema]
class NestedSchema(LanceModel):
id: str
vector: Vector(1536)
document: Document
# --8<-- [end:class-NestedSchema]
# --8<-- [start:class-Item]
class Item(LanceModel):
vector: Vector(2)
item: str
price: float
# --8<-- [end:class-Item]
# --8<-- [start:make_batches]
def make_batches():
for i in range(5):
yield pa.RecordBatch.from_arrays(
[
pa.array(
[[3.1, 4.1, 5.1, 6.1], [5.9, 26.5, 4.7, 32.8]],
pa.list_(pa.float32(), 4),
),
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"],
)
# --8<-- [end:make_batches]
# --8<-- [start:make_batches_for_add]
def make_batches_for_add():
for i in range(5):
yield [
{"vector": [3.1, 4.1], "item": "peach", "price": 6.0},
{"vector": [5.9, 26.5], "item": "pear", "price": 5.0},
]
# --8<-- [end:make_batches_for_add]
def test_table():
# --8<-- [start:connect]
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
# --8<-- [end:connect]
# --8<-- [start:create_table]
data = [
{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
{"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1},
]
db.create_table("test_table", data)
db["test_table"].head()
# --8<-- [end:create_table]
# --8<-- [start:create_table_exist_ok]
db.create_table("test_table", data, exist_ok=True)
# --8<-- [end:create_table_exist_ok]
# --8<-- [start:create_table_overwrite]
db.create_table("test_table", data, mode="overwrite")
# --8<-- [end:create_table_overwrite]
# --8<-- [start:create_table_from_pandas]
data = pd.DataFrame(
{
"vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]],
"lat": [45.5, 40.1],
"long": [-122.7, -74.1],
}
)
db.create_table("my_table_pandas", data)
db["my_table_pandas"].head()
# --8<-- [end:create_table_from_pandas]
# --8<-- [start:create_table_custom_schema]
custom_schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("lat", pa.float32()),
pa.field("long", pa.float32()),
]
)
tbl = db.create_table("my_table_custom_schema", data, schema=custom_schema)
# --8<-- [end:create_table_custom_schema]
# --8<-- [start:create_table_from_polars]
data = pl.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
tbl = db.create_table("my_table_pl", data)
# --8<-- [end:create_table_from_polars]
# --8<-- [start:create_table_from_arrow_table]
dim = 16
total = 2
schema = pa.schema(
[pa.field("vector", pa.list_(pa.float16(), dim)), pa.field("text", pa.string())]
)
data = pa.Table.from_arrays(
[
pa.array(
[np.random.randn(dim).astype(np.float16) for _ in range(total)],
pa.list_(pa.float16(), dim),
),
pa.array(["foo", "bar"]),
],
["vector", "text"],
)
tbl = db.create_table("f16_tbl", data, schema=schema)
# --8<-- [end:create_table_from_arrow_table]
# --8<-- [start:create_table_from_pydantic]
tbl = db.create_table("movielens_small", schema=Content)
# --8<-- [end:create_table_from_pydantic]
# --8<-- [start:create_table_nested_schema]
tbl = db.create_table("nested_table", schema=NestedSchema)
# --8<-- [end:create_table_nested_schema]
# --8<-- [start:create_table_from_batch]
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
]
)
db.create_table("batched_tale", make_batches(), schema=schema)
# --8<-- [end:create_table_from_batch]
# --8<-- [start:list_tables]
print(db.table_names())
# --8<-- [end:list_tables]
# --8<-- [start:open_table]
tbl = db.open_table("test_table")
# --8<-- [end:open_table]
# --8<-- [start:create_empty_table]
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32(), 2)),
pa.field("item", pa.string()),
pa.field("price", pa.float32()),
]
)
tbl = db.create_table("test_empty_table", schema=schema)
# --8<-- [end:create_empty_table]
# --8<-- [start:create_empty_table_pydantic]
tbl = db.create_table("test_empty_table_new", schema=Item.to_arrow_schema())
# --8<-- [end:create_empty_table_pydantic]
# --8<-- [start:add_table_from_pandas]
df = pd.DataFrame(
{
"vector": [[1.3, 1.4], [9.5, 56.2]],
"item": ["banana", "apple"],
"price": [5.0, 7.0],
}
)
tbl.add(df)
# --8<-- [end:add_table_from_pandas]
# --8<-- [start:add_table_from_polars]
df = pl.DataFrame(
{
"vector": [[1.3, 1.4], [9.5, 56.2]],
"item": ["banana", "apple"],
"price": [5.0, 7.0],
}
)
tbl.add(df)
# --8<-- [end:add_table_from_polars]
# --8<-- [start:add_table_from_batch]
tbl.add(make_batches_for_add())
# --8<-- [end:add_table_from_batch]
# --8<-- [start:add_table_from_pyarrow]
pa_table = pa.Table.from_arrays(
[
pa.array([[9.1, 6.7], [9.9, 31.2]], pa.list_(pa.float32(), 2)),
pa.array(["mango", "orange"]),
pa.array([7.0, 4.0]),
],
["vector", "item", "price"],
)
tbl.add(pa_table)
# --8<-- [end:add_table_from_pyarrow]
# --8<-- [start:add_table_from_pydantic]
pydantic_model_items = [
Item(vector=[8.1, 4.7], item="pineapple", price=10.0),
Item(vector=[6.9, 9.3], item="avocado", price=9.0),
]
tbl.add(pydantic_model_items)
# --8<-- [end:add_table_from_pydantic]
# --8<-- [start:delete_row]
tbl.delete('item = "fizz"')
# --8<-- [end:delete_row]
# --8<-- [start:delete_specific_row]
data = [
{"x": 1, "vector": [1, 2]},
{"x": 2, "vector": [3, 4]},
{"x": 3, "vector": [5, 6]},
]
# Synchronous client
tbl = db.create_table("delete_row", data)
tbl.to_pandas()
# x vector
# 0 1 [1.0, 2.0]
# 1 2 [3.0, 4.0]
# 2 3 [5.0, 6.0]
tbl.delete("x = 2")
tbl.to_pandas()
# x vector
# 0 1 [1.0, 2.0]
# 1 3 [5.0, 6.0]
# --8<-- [end:delete_specific_row]
# --8<-- [start:delete_list_values]
to_remove = [1, 5]
to_remove = ", ".join(str(v) for v in to_remove)
tbl.delete(f"x IN ({to_remove})")
tbl.to_pandas()
# x vector
# 0 3 [5.0, 6.0]
# --8<-- [end:delete_list_values]
# --8<-- [start:update_table]
# Create a table from a pandas DataFrame
data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
tbl = db.create_table("test_table", data, mode="overwrite")
# Update the table where x = 2
tbl.update(where="x = 2", values={"vector": [10, 10]})
# Get the updated table as a pandas DataFrame
df = tbl.to_pandas()
print(df)
# --8<-- [end:update_table]
# --8<-- [start:update_table_sql]
# Update the table where x = 2
tbl.update(values_sql={"x": "x + 1"})
print(tbl.to_pandas())
# --8<-- [end:update_table_sql]
# --8<-- [start:table_strong_consistency]
uri = "data/sample-lancedb"
db = lancedb.connect(uri, read_consistency_interval=timedelta(0))
tbl = db.open_table("test_table")
# --8<-- [end:table_strong_consistency]
# --8<-- [start:table_eventual_consistency]
uri = "data/sample-lancedb"
db = lancedb.connect(uri, read_consistency_interval=timedelta(seconds=5))
tbl = db.open_table("test_table")
# --8<-- [end:table_eventual_consistency]
# --8<-- [start:table_checkout_latest]
tbl = db.open_table("test_table")
# (Other writes happen to my_table from another process)
# Check for updates
tbl.checkout_latest()
# --8<-- [end:table_checkout_latest]
@pytest.mark.skip
def test_table_with_embedding():
db = lancedb.connect("data/sample-lancedb")
# --8<-- [start:create_table_with_embedding]
embed_fcn = get_registry().get("huggingface").create(name="BAAI/bge-small-en-v1.5")
class Schema(LanceModel):
text: str = embed_fcn.SourceField()
vector: Vector(embed_fcn.ndims()) = embed_fcn.VectorField(default=None)
tbl = db.create_table("my_table_with_embedding", schema=Schema, mode="overwrite")
models = [Schema(text="hello"), Schema(text="world")]
tbl.add(models)
# --8<-- [end:create_table_with_embedding]
@pytest.mark.skip
async def test_table_with_embedding_async():
async_db = await lancedb.connect_async("data/sample-lancedb")
# --8<-- [start:create_table_async_with_embedding]
embed_fcn = get_registry().get("huggingface").create(name="BAAI/bge-small-en-v1.5")
class Schema(LanceModel):
text: str = embed_fcn.SourceField()
vector: Vector(embed_fcn.ndims()) = embed_fcn.VectorField(default=None)
async_tbl = await async_db.create_table(
"my_table_async_with_embedding", schema=Schema, mode="overwrite"
)
models = [Schema(text="hello"), Schema(text="world")]
await async_tbl.add(models)
# --8<-- [end:create_table_async_with_embedding]
@pytest.mark.asyncio
async def test_table_async():
# --8<-- [start:connect_async]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)
# --8<-- [end:connect_async]
# --8<-- [start:create_table_async]
data = [
{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
{"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1},
]
async_tbl = await async_db.create_table("test_table_async", data)
await async_tbl.head()
# --8<-- [end:create_table_async]
# --8<-- [start:create_table_async_exist_ok]
await async_db.create_table("test_table_async", data, exist_ok=True)
# --8<-- [end:create_table_async_exist_ok]
# --8<-- [start:create_table_async_overwrite]
await async_db.create_table("test_table_async", data, mode="overwrite")
# --8<-- [end:create_table_async_overwrite]
# --8<-- [start:create_table_async_from_pandas]
data = pd.DataFrame(
{
"vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]],
"lat": [45.5, 40.1],
"long": [-122.7, -74.1],
}
)
async_tbl = await async_db.create_table("my_table_async_pd", data)
await async_tbl.head()
# --8<-- [end:create_table_async_from_pandas]
# --8<-- [start:create_table_async_custom_schema]
custom_schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("lat", pa.float32()),
pa.field("long", pa.float32()),
]
)
async_tbl = await async_db.create_table(
"my_table_async_custom_schema", data, schema=custom_schema
)
# --8<-- [end:create_table_async_custom_schema]
# --8<-- [start:create_table_async_from_polars]
data = pl.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
async_tbl = await async_db.create_table("my_table_async_pl", data)
# --8<-- [end:create_table_async_from_polars]
# --8<-- [start:create_table_async_from_arrow_table]
dim = 16
total = 2
schema = pa.schema(
[pa.field("vector", pa.list_(pa.float16(), dim)), pa.field("text", pa.string())]
)
data = pa.Table.from_arrays(
[
pa.array(
[np.random.randn(dim).astype(np.float16) for _ in range(total)],
pa.list_(pa.float16(), dim),
),
pa.array(["foo", "bar"]),
],
["vector", "text"],
)
async_tbl = await async_db.create_table("f16_tbl_async", data, schema=schema)
# --8<-- [end:create_table_async_from_arrow_table]
# --8<-- [start:create_table_async_from_pydantic]
async_tbl = await async_db.create_table("movielens_small_async", schema=Content)
# --8<-- [end:create_table_async_from_pydantic]
# --8<-- [start:create_table_async_nested_schema]
async_tbl = await async_db.create_table("nested_table_async", schema=NestedSchema)
# --8<-- [end:create_table_async_nested_schema]
# --8<-- [start:create_table_async_from_batch]
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
]
)
await async_db.create_table("batched_table", make_batches(), schema=schema)
# --8<-- [end:create_table_async_from_batch]
# --8<-- [start:list_tables_async]
print(await async_db.table_names())
# --8<-- [end:list_tables_async]
# --8<-- [start:open_table_async]
async_tbl = await async_db.open_table("test_table_async")
# --8<-- [end:open_table_async]
# --8<-- [start:create_empty_table_async]
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32(), 2)),
pa.field("item", pa.string()),
pa.field("price", pa.float32()),
]
)
async_tbl = await async_db.create_table("test_empty_table_async", schema=schema)
# --8<-- [end:create_empty_table_async]
# --8<-- [start:create_empty_table_async_pydantic]
async_tbl = await async_db.create_table(
"test_empty_table_async_new", schema=Item.to_arrow_schema()
)
# --8<-- [end:create_empty_table_async_pydantic]
# --8<-- [start:add_table_async_from_pandas]
df = pd.DataFrame(
{
"vector": [[1.3, 1.4], [9.5, 56.2]],
"item": ["banana", "apple"],
"price": [5.0, 7.0],
}
)
await async_tbl.add(df)
# --8<-- [end:add_table_async_from_pandas]
# --8<-- [start:add_table_async_from_polars]
df = pl.DataFrame(
{
"vector": [[1.3, 1.4], [9.5, 56.2]],
"item": ["banana", "apple"],
"price": [5.0, 7.0],
}
)
await async_tbl.add(df)
# --8<-- [end:add_table_async_from_polars]
# --8<-- [start:add_table_async_from_batch]
await async_tbl.add(make_batches_for_add())
# --8<-- [end:add_table_async_from_batch]
# --8<-- [start:add_table_async_from_pyarrow]
pa_table = pa.Table.from_arrays(
[
pa.array([[9.1, 6.7], [9.9, 31.2]], pa.list_(pa.float32(), 2)),
pa.array(["mango", "orange"]),
pa.array([7.0, 4.0]),
],
["vector", "item", "price"],
)
await async_tbl.add(pa_table)
# --8<-- [end:add_table_async_from_pyarrow]
# --8<-- [start:add_table_async_from_pydantic]
pydantic_model_items = [
Item(vector=[8.1, 4.7], item="pineapple", price=10.0),
Item(vector=[6.9, 9.3], item="avocado", price=9.0),
]
await async_tbl.add(pydantic_model_items)
# --8<-- [end:add_table_async_from_pydantic]
# --8<-- [start:delete_row_async]
await async_tbl.delete('item = "fizz"')
# --8<-- [end:delete_row_async]
# --8<-- [start:delete_specific_row_async]
data = [
{"x": 1, "vector": [1, 2]},
{"x": 2, "vector": [3, 4]},
{"x": 3, "vector": [5, 6]},
]
async_db = await lancedb.connect_async(uri)
async_tbl = await async_db.create_table("delete_row_async", data)
await async_tbl.to_pandas()
# x vector
# 0 1 [1.0, 2.0]
# 1 2 [3.0, 4.0]
# 2 3 [5.0, 6.0]
await async_tbl.delete("x = 2")
await async_tbl.to_pandas()
# x vector
# 0 1 [1.0, 2.0]
# 1 3 [5.0, 6.0]
# --8<-- [end:delete_specific_row_async]
# --8<-- [start:delete_list_values_async]
to_remove = [1, 5]
to_remove = ", ".join(str(v) for v in to_remove)
await async_tbl.delete(f"x IN ({to_remove})")
await async_tbl.to_pandas()
# x vector
# 0 3 [5.0, 6.0]
# --8<-- [end:delete_list_values_async]
# --8<-- [start:update_table_async]
# Create a table from a pandas DataFrame
data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
async_tbl = await async_db.create_table("update_table_async", data)
# Update the table where x = 2
await async_tbl.update({"vector": [10, 10]}, where="x = 2")
# Get the updated table as a pandas DataFrame
df = await async_tbl.to_pandas()
# Print the DataFrame
print(df)
# --8<-- [end:update_table_async]
# --8<-- [start:update_table_sql_async]
# Update the table where x = 2
await async_tbl.update(updates_sql={"x": "x + 1"})
print(await async_tbl.to_pandas())
# --8<-- [end:update_table_sql_async]
# --8<-- [start:table_async_strong_consistency]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri, read_consistency_interval=timedelta(0))
async_tbl = await async_db.open_table("test_table_async")
# --8<-- [end:table_async_strong_consistency]
# --8<-- [start:table_async_ventual_consistency]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(
uri, read_consistency_interval=timedelta(seconds=5)
)
async_tbl = await async_db.open_table("test_table_async")
# --8<-- [end:table_async_eventual_consistency]
# --8<-- [start:table_async_checkout_latest]
async_tbl = await async_db.open_table("test_table_async")
# (Other writes happen to test_table_async from another process)
# Check for updates
await async_tbl.checkout_latest()
# --8<-- [end:table_async_checkout_latest]

View File

@@ -0,0 +1,187 @@
# --8<-- [start:import-lancedb]
import lancedb
# --8<-- [end:import-lancedb]
# --8<-- [start:import-pandas]
import pandas as pd
# --8<-- [end:import-pandas]
# --8<-- [start:import-iterable]
from typing import Iterable
# --8<-- [end:import-iterable]
# --8<-- [start:import-pyarrow]
import pyarrow as pa
# --8<-- [end:import-pyarrow]
# --8<-- [start:import-polars]
import polars as pl
# --8<-- [end:import-polars]
# --8<-- [start:import-lancedb-pydantic]
from lancedb.pydantic import Vector, LanceModel
# --8<-- [end:import-lancedb-pydantic]
import pytest
# --8<-- [start:make_batches]
def make_batches() -> Iterable[pa.RecordBatch]:
for i in range(5):
yield pa.RecordBatch.from_arrays(
[
pa.array([[3.1, 4.1], [5.9, 26.5]]),
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"],
)
# --8<-- [end:make_batches]
def test_pandas_and_pyarrow():
# --8<-- [start:connect_to_lancedb]
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
# --8<-- [end:connect_to_lancedb]
# --8<-- [start:create_table_pandas]
data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
table = db.create_table("pd_table", data=data)
# --8<-- [end:create_table_pandas]
# --8<-- [start:create_table_iterable]
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32())),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
]
)
table = db.create_table("iterable_table", data=make_batches(), schema=schema)
# --8<-- [end:create_table_iterable]
# --8<-- [start:vector_search]
# Open the table previously created.
table = db.open_table("pd_table")
query_vector = [100, 100]
# Pandas DataFrame
df = table.search(query_vector).limit(1).to_pandas()
print(df)
# --8<-- [end:vector_search]
# --8<-- [start:vector_search_with_filter]
# Apply the filter via LanceDB
results = table.search([100, 100]).where("price < 15").to_pandas()
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# Apply the filter via Pandas
df = results = table.search([100, 100]).to_pandas()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# --8<-- [end:vector_search_with_filter]
@pytest.mark.asyncio
async def test_pandas_and_pyarrow_async():
# --8<-- [start:connect_to_lancedb_async]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)
# --8<-- [end:connect_to_lancedb_async]
# --8<-- [start:create_table_pandas_async]
data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
await async_db.create_table("pd_table_async", data=data)
# --8<-- [end:create_table_pandas_async]
# --8<-- [start:create_table_iterable_async]
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32())),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
]
)
await async_db.create_table(
"iterable_table_async", data=make_batches(), schema=schema
)
# --8<-- [end:create_table_iterable_async]
# --8<-- [start:vector_search_async]
# Open the table previously created.
async_tbl = await async_db.open_table("pd_table_async")
query_vector = [100, 100]
# Pandas DataFrame
df = await async_tbl.query().nearest_to(query_vector).limit(1).to_pandas()
print(df)
# --8<-- [end:vector_search_async]
# --8<-- [start:vector_search_with_filter_async]
# Apply the filter via LanceDB
results = (
await async_tbl.query().nearest_to([100, 100]).where("price < 15").to_pandas()
)
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# Apply the filter via Pandas
df = results = await async_tbl.query().nearest_to([100, 100]).to_pandas()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# --8<-- [end:vector_search_with_filter_async]
# --8<-- [start:class_Item]
class Item(LanceModel):
vector: Vector(2)
item: str
price: float
# --8<-- [end:class_Item]
def test_polars():
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
# --8<-- [start:create_table_polars]
data = pl.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
table = db.create_table("pl_table", data=data)
# --8<-- [end:create_table_polars]
# --8<-- [start:vector_search_polars]
query = [3.0, 4.0]
result = table.search(query).limit(1).to_polars()
print(result)
print(type(result))
# --8<-- [end:vector_search_polars]
# --8<-- [start:create_table_pydantic]
table = db.create_table("pydantic_table", schema=Item)
df = pl.DataFrame(data)
# Add Polars DataFrame to table
table.add(df)
# --8<-- [end:create_table_pydantic]
# --8<-- [start:dump_table_lazyform]
ldf = table.to_polars()
print(type(ldf))
# --8<-- [end:dump_table_lazyform]
# --8<-- [start:print_table_lazyform]
print(ldf.first().collect())
# --8<-- [end:print_table_lazyform]

View File

@@ -0,0 +1,366 @@
# --8<-- [start:import-lancedb]
import lancedb
# --8<-- [end:import-lancedb]
# --8<-- [start:import-numpy]
import numpy as np
# --8<-- [end:import-numpy]
# --8<-- [start:import-datetime]
from datetime import datetime
# --8<-- [end:import-datetime]
# --8<-- [start:import-lancedb-pydantic]
from lancedb.pydantic import Vector, LanceModel
# --8<-- [end:import-lancedb-pydantic]
# --8<-- [start:import-pydantic-base-model]
from pydantic import BaseModel
# --8<-- [end:import-pydantic-base-model]
# --8<-- [start:import-lancedb-fts]
from lancedb.index import FTS
# --8<-- [end:import-lancedb-fts]
# --8<-- [start:import-os]
import os
# --8<-- [end:import-os]
# --8<-- [start:import-embeddings]
from lancedb.embeddings import get_registry
# --8<-- [end:import-embeddings]
import pytest
# --8<-- [start:class-definition]
class Metadata(BaseModel):
source: str
timestamp: datetime
class Document(BaseModel):
content: str
meta: Metadata
class LanceSchema(LanceModel):
id: str
vector: Vector(1536)
payload: Document
# --8<-- [end:class-definition]
def test_vector_search():
# --8<-- [start:exhaustive_search]
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
data = [
{"vector": row, "item": f"item {i}"}
for i, row in enumerate(np.random.random((10_000, 1536)).astype("float32"))
]
tbl = db.create_table("vector_search", data=data)
tbl.search(np.random.random((1536))).limit(10).to_list()
# --8<-- [end:exhaustive_search]
# --8<-- [start:exhaustive_search_cosine]
tbl.search(np.random.random((1536))).metric("cosine").limit(10).to_list()
# --8<-- [end:exhaustive_search_cosine]
# --8<-- [start:create_table_with_nested_schema]
# Let's add 100 sample rows to our dataset
data = [
LanceSchema(
id=f"id{i}",
vector=np.random.randn(1536),
payload=Document(
content=f"document{i}",
meta=Metadata(source=f"source{i % 10}", timestamp=datetime.now()),
),
)
for i in range(100)
]
# Synchronous client
tbl = db.create_table("documents", data=data)
# --8<-- [end:create_table_with_nested_schema]
# --8<-- [start:search_result_as_pyarrow]
tbl.search(np.random.randn(1536)).to_arrow()
# --8<-- [end:search_result_as_pyarrow]
# --8<-- [start:search_result_as_pandas]
tbl.search(np.random.randn(1536)).to_pandas()
# --8<-- [end:search_result_as_pandas]
# --8<-- [start:search_result_as_pandas_flatten_true]
tbl.search(np.random.randn(1536)).to_pandas(flatten=True)
# --8<-- [end:search_result_as_pandas_flatten_true]
# --8<-- [start:search_result_as_pandas_flatten_1]
tbl.search(np.random.randn(1536)).to_pandas(flatten=1)
# --8<-- [end:search_result_as_pandas_flatten_1]
# --8<-- [start:search_result_as_list]
tbl.search(np.random.randn(1536)).to_list()
# --8<-- [end:search_result_as_list]
# --8<-- [start:search_result_as_pydantic]
tbl.search(np.random.randn(1536)).to_pydantic(LanceSchema)
# --8<-- [end:search_result_as_pydantic]
@pytest.mark.asyncio
async def test_vector_search_async():
# --8<-- [start:exhaustive_search_async]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)
data = [
{"vector": row, "item": f"item {i}"}
for i, row in enumerate(np.random.random((10_000, 1536)).astype("float32"))
]
async_tbl = await async_db.create_table("vector_search_async", data=data)
(await async_tbl.query().nearest_to(np.random.random((1536))).limit(10).to_list())
# --8<-- [end:exhaustive_search_async]
# --8<-- [start:exhaustive_search_async_cosine]
(
await async_tbl.query()
.nearest_to(np.random.random((1536)))
.distance_type("cosine")
.limit(10)
.to_list()
)
# --8<-- [end:exhaustive_search_async_cosine]
# --8<-- [start:create_table_async_with_nested_schema]
# Let's add 100 sample rows to our dataset
data = [
LanceSchema(
id=f"id{i}",
vector=np.random.randn(1536),
payload=Document(
content=f"document{i}",
meta=Metadata(source=f"source{i % 10}", timestamp=datetime.now()),
),
)
for i in range(100)
]
async_tbl = await async_db.create_table("documents_async", data=data)
# --8<-- [end:create_table_async_with_nested_schema]
# --8<-- [start:search_result_async_as_pyarrow]
await async_tbl.query().nearest_to(np.random.randn(1536)).to_arrow()
# --8<-- [end:search_result_async_as_pyarrow]
# --8<-- [start:search_result_async_as_pandas]
await async_tbl.query().nearest_to(np.random.randn(1536)).to_pandas()
# --8<-- [end:search_result_async_as_pandas]
# --8<-- [start:search_result_async_as_list]
await async_tbl.query().nearest_to(np.random.randn(1536)).to_list()
# --8<-- [end:search_result_async_as_list]
def test_fts_native():
# --8<-- [start:basic_fts]
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table(
"my_table_fts",
data=[
{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
{"vector": [5.9, 26.5], "text": "There are several kittens playing"},
],
)
# passing `use_tantivy=False` to use lance FTS index
# `use_tantivy=True` by default
table.create_fts_index("text", use_tantivy=False)
table.search("puppy").limit(10).select(["text"]).to_list()
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
# --8<-- [end:basic_fts]
# --8<-- [start:fts_config_stem]
table.create_fts_index("text", tokenizer_name="en_stem", replace=True)
# --8<-- [end:fts_config_stem]
# --8<-- [start:fts_config_folding]
table.create_fts_index(
"text",
use_tantivy=False,
language="French",
stem=True,
ascii_folding=True,
replace=True,
)
# --8<-- [end:fts_config_folding]
# --8<-- [start:fts_prefiltering]
table.search("puppy").limit(10).where("text='foo'", prefilter=True).to_list()
# --8<-- [end:fts_prefiltering]
# --8<-- [start:fts_postfiltering]
table.search("puppy").limit(10).where("text='foo'", prefilter=False).to_list()
# --8<-- [end:fts_postfiltering]
# --8<-- [start:fts_with_position]
table.create_fts_index("text", use_tantivy=False, with_position=True, replace=True)
# --8<-- [end:fts_with_position]
# --8<-- [start:fts_incremental_index]
table.add([{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"}])
table.optimize()
# --8<-- [end:fts_incremental_index]
@pytest.mark.asyncio
async def test_fts_native_async():
# --8<-- [start:basic_fts_async]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)
async_tbl = await async_db.create_table(
"my_table_fts_async",
data=[
{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
{"vector": [5.9, 26.5], "text": "There are several kittens playing"},
],
)
# async API uses our native FTS algorithm
await async_tbl.create_index("text", config=FTS())
await (
async_tbl.query().nearest_to_text("puppy").select(["text"]).limit(10).to_list()
)
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
# --8<-- [end:basic_fts_async]
# --8<-- [start:fts_config_stem_async]
await async_tbl.create_index(
"text", config=FTS(language="English", stem=True, remove_stop_words=True)
) # --8<-- [end:fts_config_stem_async]
# --8<-- [start:fts_config_folding_async]
await async_tbl.create_index(
"text", config=FTS(language="French", stem=True, ascii_folding=True)
)
# --8<-- [end:fts_config_folding_async]
# --8<-- [start:fts_prefiltering_async]
await (
async_tbl.query()
.nearest_to_text("puppy")
.limit(10)
.where("text='foo'")
.to_list()
)
# --8<-- [end:fts_prefiltering_async]
# --8<-- [start:fts_postfiltering_async]
await (
async_tbl.query()
.nearest_to_text("puppy")
.limit(10)
.where("text='foo'")
.postfilter()
.to_list()
)
# --8<-- [end:fts_postfiltering_async]
# --8<-- [start:fts_with_position_async]
await async_tbl.create_index("text", config=FTS(with_position=True))
# --8<-- [end:fts_with_position_async]
# --8<-- [start:fts_incremental_index_async]
await async_tbl.add([{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"}])
await async_tbl.optimize()
# --8<-- [end:fts_incremental_index_async]
@pytest.mark.skip()
def test_hybrid_search():
# --8<-- [start:import-openai]
import openai
# --8<-- [end:import-openai]
# --8<-- [start:openai-embeddings]
# Ingest embedding function in LanceDB table
# Configuring the environment variable OPENAI_API_KEY
if "OPENAI_API_KEY" not in os.environ:
# OR set the key here as a variable
openai.api_key = "sk-..."
embeddings = get_registry().get("openai").create()
# --8<-- [end:openai-embeddings]
# --8<-- [start:class-Documents]
class Documents(LanceModel):
vector: Vector(embeddings.ndims()) = embeddings.VectorField()
text: str = embeddings.SourceField()
# --8<-- [end:class-Documents]
# --8<-- [start:basic_hybrid_search]
data = [
{"text": "rebel spaceships striking from a hidden base"},
{"text": "have won their first victory against the evil Galactic Empire"},
{"text": "during the battle rebel spies managed to steal secret plans"},
{"text": "to the Empire's ultimate weapon the Death Star"},
]
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("documents", schema=Documents)
# ingest docs with auto-vectorization
table.add(data)
# Create a fts index before the hybrid search
table.create_fts_index("text")
# hybrid search with default re-ranker
table.search("flower moon", query_type="hybrid").to_pandas()
# --8<-- [end:basic_hybrid_search]
# --8<-- [start:hybrid_search_pass_vector_text]
vector_query = [0.1, 0.2, 0.3, 0.4, 0.5]
text_query = "flower moon"
(
table.search(query_type="hybrid")
.vector(vector_query)
.text(text_query)
.limit(5)
.to_pandas()
)
# --8<-- [end:hybrid_search_pass_vector_text]
@pytest.mark.skip
async def test_hybrid_search_async():
import openai
# --8<-- [start:openai-embeddings]
# Ingest embedding function in LanceDB table
# Configuring the environment variable OPENAI_API_KEY
if "OPENAI_API_KEY" not in os.environ:
# OR set the key here as a variable
openai.api_key = "sk-..."
embeddings = get_registry().get("openai").create()
# --8<-- [end:openai-embeddings]
# --8<-- [start:class-Documents]
class Documents(LanceModel):
vector: Vector(embeddings.ndims()) = embeddings.VectorField()
text: str = embeddings.SourceField()
# --8<-- [end:class-Documents]
# --8<-- [start:basic_hybrid_search_async]
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)
data = [
{"text": "rebel spaceships striking from a hidden base"},
{"text": "have won their first victory against the evil Galactic Empire"},
{"text": "during the battle rebel spies managed to steal secret plans"},
{"text": "to the Empire's ultimate weapon the Death Star"},
]
async_tbl = await async_db.create_table("documents_async", schema=Documents)
# ingest docs with auto-vectorization
await async_tbl.add(data)
# Create a fts index before the hybrid search
await async_tbl.create_index("text", config=FTS())
text_query = "flower moon"
vector_query = embeddings.compute_query_embeddings(text_query)[0]
# hybrid search with default re-ranker
await (
async_tbl.query()
.nearest_to(vector_query)
.nearest_to_text(text_query)
.to_pandas()
)
# --8<-- [end:basic_hybrid_search_async]
# --8<-- [start:hybrid_search_pass_vector_text_async]
vector_query = [0.1, 0.2, 0.3, 0.4, 0.5]
text_query = "flower moon"
await (
async_tbl.query()
.nearest_to(vector_query)
.nearest_to_text(text_query)
.limit(5)
.to_pandas()
)
# --8<-- [end:hybrid_search_pass_vector_text_async]