Compare commits

...

805 Commits

Author SHA1 Message Date
Lance Release
431f94e564 [python] Bump version: 0.6.9 → 0.6.10 2024-04-22 17:42:24 +00:00
Alex Kohler
c1a7d65473 chore: fix get_registry call in baai embeddings example (#1230) 2024-04-20 07:25:16 +05:30
Rob Meng
1e5ccb1614 chore: upgrade lance to 0.10.15 (#1229) 2024-04-19 10:31:39 -04:00
Bert
2e7ab373dc fix: update lance to 0.10.13 (#1226) 2024-04-17 09:29:10 -04:00
Weston Pace
c7fbc4aaee docs: fix minor typo (#1220) 2024-04-14 03:32:57 +05:30
Lance Release
7e023c1ef2 [python] Bump version: 0.6.8 → 0.6.9 2024-04-12 22:09:12 +00:00
Weston Pace
1d0dd9a8b8 feat: bump lance version from 0.10.10 to 0.10.12 (#1219) 2024-04-12 15:08:39 -07:00
Weston Pace
deb947ddbd doc: fix typo, broken links (#1218) 2024-04-11 14:58:51 -07:00
Ayush Chaurasia
b039765d50 docs : Embedding functions quickstart and minor fixes (#1217) 2024-04-11 17:30:45 +05:30
Prashanth Rao
d155e82723 [docs] Fix broken links and clarify language in integrations docs (#1209)
This PR does the following:

- Fixes broken/outdated URLs
- Adds clarity to the way DuckDB/LanceDB integration works via Arrow
2024-04-11 15:32:08 +05:30
Ayush Chaurasia
5d8c91256c fix(python): Update to latest cohere reranking api (#1212)
Fixes https://github.com/lancedb/lancedb/issues/1196
Cohere introduced a breaking change in their reranker API starting
version 5.0.0. More context in discussion here
https://github.com/cohere-ai/cohere-python/issues/446
2024-04-11 15:20:29 +05:30
Ayush Chaurasia
44c03ebef3 docs : Update Reranking docs (#1213) 2024-04-11 15:20:00 +05:30
Will Jones
8ea06fe7f3 ci: fix failures in release scripts (#1215)
* Python release has been running when we create a Node release.
https://github.com/lancedb/lancedb/actions/runs/8635662585
* Rust is missing new enough compilers to check the kernels feature
https://github.com/lancedb/lancedb/actions/runs/8635662578
2024-04-10 13:09:39 -07:00
Lance Release
cf06b653d4 [python] Bump version: 0.6.7 → 0.6.8 2024-04-10 17:51:45 +00:00
Lance Release
09cfab6d00 Updating package-lock.json 2024-04-10 17:40:03 +00:00
Lance Release
e4945abb1a Bump version: 0.4.16 → 0.4.17 2024-04-10 17:39:52 +00:00
Raghav Dixit
a6aa67baed python: Bug fixes / tests (#1210)
closes #1194 #1172 #1124 #1208 


@wjones127 : `if query_type != "fts":` is needed because both fts and
vector search create `LanceQueryBuilder` which has `vector_column_name`
as a required attribute.
2024-04-10 10:17:14 -07:00
Will Jones
1d23af213b feat: expose storage options in LanceDB (#1204)
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.

1. Fixes #1168
2. Closes #1165
3. Closes #1082
4. Closes #439
5. Closes #897
6. Closes #642
7. Closes #281
8. Closes #114
9. Closes #990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
2024-04-10 10:12:04 -07:00
Bert
25dea4e859 BREAKING CHANGE: Check if remote table exists when opening (with caching) (#1214)
- make open table behaviour consistent:
- remote tables will check if the table exists by calling /describe and
throwing an error if the call doesn't succeed
- this is similar to the behaviour for local tables where we will raise
an exception when opening the table if the local dataset doesn't exist
- The table names are cached in the client with a TTL
- Also fixes a small bug where if the remote error response was
deserialized from JSON as an object, we'd print it resulting in the
unhelpful error message: `Error: Server Error, status: 404, message: Not
Found: [object Object]`
2024-04-10 11:54:47 -04:00
Weston Pace
8a1227030a chore: restore requests which was lost during rebase (#1205) 2024-04-08 11:56:43 +05:30
Weston Pace
9fee384d2c chore(node): restore package-lock.json lost during rebase 2024-04-05 16:36:29 -07:00
Ayush Chaurasia
b2952acca7 chore(python): remove redundant files (#1203) 2024-04-05 16:35:10 -07:00
Pranav Maddi
2b132a0bef Fix markdown formatting (#1188) 2024-04-05 16:35:10 -07:00
Will Jones
ba56208a34 ci: fix job (#1193) 2024-04-05 16:35:10 -07:00
Ayush Chaurasia
2d2042d59e chore(python): Remove settings manager and telemetry. (#1198)
This PR is intended to remove settings manager. But because telemetry
and CLI depends on settings manager those need to go too.
2024-04-05 16:35:09 -07:00
Raghav Dixit
1c41a00d87 Embeddings: HF model hub support added via transformers (#1154) 2024-04-05 16:34:56 -07:00
Lance Release
ac63d4066b Updating package-lock.json 2024-04-05 16:34:53 -07:00
Lance Release
be2074b90d [python] Bump version: 0.6.6 → 0.6.7 2024-04-05 16:34:53 -07:00
Lance Release
6c452f29e9 Bump version: 0.4.15 → 0.4.16 2024-04-05 16:34:50 -07:00
Will Jones
8a7ded23b2 chore: upgrade to lance-0.10.9 (#1192) 2024-04-05 16:34:50 -07:00
QianZhu
871500db70 add a default value for search.limit to be consistent with python sdk (#1191)
Changed the default value for search.limit to be 10
2024-04-05 16:34:50 -07:00
Bert
a900bc0827 ensure table names are uri encoded for tables (#1189)
This prevents an issue where users can do something like:
```js
db.createTable('my-table#123123')
```
The server has logic to determine that '#' character is not allowed in
the table name, but currently this is being returned as 404 error
because it routes to `/v1/my-table#123123/create` and `#123123/create`
will not be parsed as part of path
2024-04-05 16:34:50 -07:00
Will Jones
47cff963c5 feat: ship fp16kernels in Python wheels (#1148)
Same deal as https://github.com/lancedb/lance/pull/2098
2024-04-05 16:34:50 -07:00
Lei Xu
e6ff3d848b chore: bump to 0.10.8 (#1187) 2024-04-05 16:34:50 -07:00
QianZhu
44d799ebb8 bug: fix the return value of countRows (#1186) 2024-04-05 16:34:50 -07:00
Lei Xu
1d3325dcc5 chore: bump lance version (#1185)
Bump lance version to `0.10.7`
2024-04-05 16:34:50 -07:00
Bert
ff45f25cf2 fix error decoding in nodejs client (#1184)
fixes: #1183
2024-04-05 16:34:50 -07:00
QianZhu
a34cc770c5 remote count_rows need to return the number (#1181) 2024-04-05 16:34:50 -07:00
eduardjbotha
f749b8808f SQL Documentation includes DataFusion functions (#1179)
Show that it is possible to use the DataFusion functions in the `WHERE`
clause.

Co-authored-by: Eduard Botha <eduard.botha@inovex.de>
2024-04-05 16:34:50 -07:00
Lei Xu
7e5a54b76a chore: add social link footer (#1177) 2024-04-05 16:34:50 -07:00
Lei Xu
3f14938392 chore: pass str instead of String to build table names (#1178) 2024-04-05 16:34:50 -07:00
Lance Release
3bd16e1b14 Updating package-lock.json 2024-04-05 16:34:46 -07:00
QianZhu
2f89fc26f1 feat: add filterable countRows to remote API (#1169) 2024-04-05 16:34:46 -07:00
Lance Release
e5bfec4318 [python] Bump version: 0.6.5 → 0.6.6 2024-04-05 16:34:46 -07:00
Lance Release
e0f50013ea Bump version: 0.4.14 → 0.4.15 2024-04-05 16:34:39 -07:00
Weston Pace
e4e64f9d6b chore: bump lance version to 0.10.6 (#1175) 2024-04-05 16:34:39 -07:00
Bert
6c9f4c4304 Update LanceDB Logo in README (#1167)
<img width="1034" alt="image"
src="https://github.com/lancedb/lancedb/assets/5846846/5b8aa53c-4d93-4c0e-bed4-80c238b319ba">
2024-04-05 16:34:39 -07:00
Weston Pace
e21b56293c docs: add a reference to @lancedb/lance in the docs (#1166)
We aren't yet ready to switch over the examples since almost all JS
examples rely on embeddings and we haven't yet ported those over.
However, this makes it possible for those that are interested to start
using `@lancedb/lancedb`
2024-04-05 16:34:39 -07:00
Will Jones
1b0aaf9ec3 ci: fix name collision in npm artifacts for vectordb (#1164)
Fixes #1163
2024-04-05 16:34:39 -07:00
Weston Pace
01239da082 chore: add nodejs to bumpversion (#1161)
The previous release failed to release nodejs because the nodejs version
wasn't bumped. This should fix that.
2024-04-05 16:34:39 -07:00
Weston Pace
6060c0cd36 chore: fix clippy (#1162) 2024-04-05 16:34:38 -07:00
Bert
bb179981dd added new logo to vercel example gif (#1158) 2024-04-05 16:34:38 -07:00
Bert
2e1f1c6d5d New logo on docs site (#1157) 2024-04-05 16:34:38 -07:00
Ayush Chaurasia
b916f5f132 docs: Add all available HF/sentence transformers embedding models list (#1134)
Solves -  https://github.com/lancedb/lancedb/issues/968
2024-04-05 16:34:38 -07:00
Weston Pace
f97c7dad8c docs: add the async python API to the docs (#1156) 2024-04-05 16:34:37 -07:00
Lance Release
ccf13f15d4 Bump version: 0.4.13 → 0.4.14 2024-04-05 16:33:37 -07:00
Weston Pace
287c5ca2f9 feat: add publish step for nodejs (#1155)
This will start publishing `@lancedb/lancedb` with the new nodejs
package on our releases.
2024-04-05 16:33:37 -07:00
Pranav Maddi
479289dd38 Adds a Ask LanceDB button to docs. (#1150)
This links out to the new [asklancedb.com](https://asklancedb.com) page.

Screenshots of the change:

![Quick start - LanceDB · 10 20am ·
03-22](https://github.com/lancedb/lancedb/assets/2371511/c45ba893-fc74-4957-bdd3-3712b351aff3)
![Quick start -
LanceDB](https://github.com/lancedb/lancedb/assets/2371511/d4762eb6-52af-4fd5-857e-3ed280716999)
2024-04-05 16:33:37 -07:00
Bert
1e41232f28 Node SDK Client middleware for HTTP Requests (#1130)
Adds client-side middleware to LanceDB Node SDK to instrument HTTP
Requests

Example - adding `x-request-id` request header:
```js
class HttpMiddleware {
    constructor({ requestId }) {
        this.requestId = requestId
    }

    onRemoteRequest(req, next) {
        req.headers['x-request-id'] = this.requestId
        return next(req)
    }
}

const db = await lancedb.connect({
  uri: 'db://remote-123',
  apiKey: 'sk_...',
})

let tables = await db.withMiddleware(new HttpMiddleware({ requestId: '123' })).tableNames();

```

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:33:37 -07:00
QianZhu
db2631c2ad remove warnings (#1147) 2024-04-05 16:33:37 -07:00
Lei Xu
473ef7e426 chore: validate table name (#1146)
Closes #1129
2024-04-05 16:33:37 -07:00
Lance Release
d32dc84653 [python] Bump version: 0.6.4 → 0.6.5 2024-04-05 16:33:37 -07:00
Lei Xu
1aaaeff511 chore: bump lance to 0.10.5 (#1145) 2024-04-05 16:33:37 -07:00
QianZhu
bdd07a5dfa fix nodejs test (#1141)
changed the error msg for query with wrong vector dim thus need this
change to pass the nodejs tests.
2024-04-05 16:33:37 -07:00
QianZhu
63db51c90d better error msg for query vector with wrong dim (#1140) 2024-04-05 16:33:37 -07:00
Ishani Ghose
0838e12b30 feat: add to_batches API #805 (#1048)
SDK
Python

Description
Exposes pyarrow batch api during query execution - relevant when there
is no vector search query, dataset is large and the filtered result is
larger than memory.

---------

Co-authored-by: Ishani Ghose <isghose@amazon.com>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:33:37 -07:00
Weston Pace
968c62cb8f feat: introduce ArrowNative wrapper struct for adding data that is already a RecordBatchReader (#1139)
In
2de226220b
I added a new `IntoArrow` trait for adding data into a table.
Unfortunately, it seems my approach for implementing the trait for
"things that are already record batch readers" was flawed. This PR
corrects that flaw and, conveniently, removes the need to box readers at
all (though it is ok if you do).
2024-04-05 16:33:37 -07:00
natcharacter
f6e9f8e3f4 Order by field support FTS (#1132)
This PR adds support for passing through a set of ordering fields at
index time (unsigned ints that tantivity can use as fast_fields) that at
query time you can sort your results on. This is useful for cases where
you want to get related hits, i.e by keyword, but order those hits by
some other score, such as popularity.

I.e search for songs descriptions that match on "sad AND jazz AND 1920"
and then order those by number of times played. Example usage can be
seen in the fts tests.

---------

Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:33:36 -07:00
Chang She
4466cfa958 feat(python): support writing huggingface dataset and dataset dict (#1110)
HuggingFace Dataset is written as arrow batches.
For DatasetDict, all splits are written with a "split" column appended.

- [x] what if the dataset schema already has a `split` column
- [x] add unit tests
2024-04-05 16:33:06 -07:00
Ayush Chaurasia
42fad84ec8 feat(python): Support reranking for vector and fts (#1103)
solves https://github.com/lancedb/lancedb/issues/1086

Usage Reranking with FTS:
```
retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite")
pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274."},
          {"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."},
        {"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."},
        {"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "},
        {"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."},
        {"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."},
        ]
retriever.add(pylist)
retriever.create_fts_index("text", replace=True)

query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query, query_type="fts").limit(10).to_pandas())
print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas())
```
Result
```
                                                text                                             vector     score
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602
1  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046
2  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898
4  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346
                                                text                                             vector     score  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046          0.041462
```

## Vector Search usage:
```
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query).limit(10).to_pandas())
print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here
```

Results
```
                                                text                                             vector  _distance
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973
1  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884
2  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200
3  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654
4  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544
                                                text                                             vector  _distance  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654          0.041462
```
2024-04-05 16:33:06 -07:00
Weston Pace
b36c750cc7 fix: fix compile error in example caused by merge conflict (#1135) 2024-04-05 16:33:06 -07:00
Weston Pace
a23b856410 feat: change DistanceType to be independent thing instead of resuing lance_linalg (#1133)
This PR originated from a request to add `Serialize` / `Deserialize` to
`lance_linalg::distance::DistanceType`. However, that is a strange
request for `lance_linalg` which shouldn't really have to worry about
`Serialize` / `Deserialize`. The problem is that `lancedb` is re-using
`DistanceType` and things in `lancedb` do need to worry about
`Serialize`/`Deserialize` (because `lancedb` needs to support remote
client).

On the bright side, separating the two types allows us to independently
document distance type and allows `lance_linalg` to make changes to
`DistanceType` in the future without having to worry about backwards
compatibility concerns.
2024-04-05 16:33:06 -07:00
Weston Pace
0fe0976a0e docs: add links to rust SDK docs, remove references to rust SDK being unstable / experimental (#1131) 2024-04-05 16:33:05 -07:00
Weston Pace
abde77eafb feat(rust): add trait for incoming data (#1128)
This will make it easier for 3rd party integrations. They simply need to
implement `IntoArrow` for their types in order for those types to be
used in ingestion.
2024-04-05 16:32:47 -07:00
vincent d warmerdam
85a9ef472f Unhide Pydantic guides in Docs (#1122)
@wjones127 after fixing https://github.com/lancedb/lancedb/issues/1112 I
noticed something else on the docs. There's an odd chunk of the docs
missing
[here](https://lancedb.github.io/lancedb/guides/tables/#from-a-polars-dataframe).
I can see the heading, but after clicking it the contents don't show.

![CleanShot 2024-03-15 at 23 40
17@2x](https://github.com/lancedb/lancedb/assets/1019791/04784b19-0200-4c3f-ae17-7a8f871ef9bd)

Apon inspection it was a markdown issue, one tab too many on a whole
segment.

This PR fixes it. It looks like this now and the sections appear again:

![CleanShot 2024-03-15 at 23 42
32@2x](https://github.com/lancedb/lancedb/assets/1019791/c5aaec4c-1c37-474d-9fb0-641f4cf52626)
2024-04-05 16:32:47 -07:00
Weston Pace
4180b44472 feat: refactor the query API and add query support to the python async API (#1113)
In addition, there are also a number of changes in nodejs to the
docstrings of existing methods because this PR adds a jsdoc linter.
2024-04-05 16:32:47 -07:00
Lance Release
2db257ca29 [python] Bump version: 0.6.3 → 0.6.4 2024-04-05 16:32:41 -07:00
Lance Release
1f816d597a Bump version: 0.4.12 → 0.4.13 2024-04-05 16:32:31 -07:00
Weston Pace
c1e3dc48af feat: bump lance to 0.10.4 (#1123) 2024-04-05 16:32:31 -07:00
vincent d warmerdam
b9afc01cfd Explain vonoroi seed initalisation (#1114)
This PR fixes https://github.com/lancedb/lancedb/issues/1112. It turned
out that K-means is currently used internally, so I figured adding that
context to the docs would be nice.
2024-04-05 16:32:31 -07:00
Christian Di Lorenzo
8bb983bc3d fix(python): Add python azure blob read support (#1102)
I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.

It should be now as simple as:

```python
import lancedb

db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```

Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.

Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
2024-04-05 16:32:31 -07:00
Weston Pace
1ea0c33545 feat: update lance to v0.10.3 (#1094) 2024-04-05 16:32:31 -07:00
Raghav Dixit
765569425c doc updates (#1085)
closes #1084
2024-04-05 16:32:15 -07:00
Chang She
377832e532 feat(python): support optional vector field in pydantic model (#1097)
The LanceDB embeddings registry allows users to annotate the pydantic
model used as table schema with the desired embedding function, e.g.:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField()
    text: str = openai.SourceField()
```

Tables created like this does not require embeddings to be calculated by
the user explicitly, e.g. this works:

```python
table.add([{"id": "foo", "text": "rust all the things"}])
```

However, trying to construct pydantic model instances without vector
doesn't because it's a required field.

Instead, you need add a default value:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField(default=None)
    text: str = openai.SourceField()
```

then this completes without errors:
```python
table.add([Schema(id="foo", text="rust all the things")])
```

However, all of the vectors are filled with zeros. Instead in
add_vector_col we have to add an additional check so that the embedding
generation is called.
2024-04-05 16:32:15 -07:00
QianZhu
723defbe7e add index_stats python api (#1096)
the integration test will be covered in another PR:
https://github.com/lancedb/sophon/pull/1876
2024-04-05 16:32:15 -07:00
Chang She
c33110397e fix(python): fix typo in passing in the api_key explicitly (#1098)
fix silly typo
2024-04-05 16:32:15 -07:00
Weston Pace
b6a522d483 feat: add list_indices to the async api (#1074) 2024-04-05 16:32:15 -07:00
Weston Pace
9031ec6878 feat: add update to the async API (#1093) 2024-04-05 16:32:15 -07:00
Will Jones
f0c5f5ba62 fix: handle uri in object (#1091)
Fixes #1078
2024-04-05 16:32:15 -07:00
Weston Pace
47daf9b7b0 feat: add time travel operations to the async API (#1070) 2024-04-05 16:32:15 -07:00
Weston Pace
f822255683 feat: add create_index to the async python API (#1052)
This also refactors the rust lancedb index builder API (and,
correspondingly, the nodejs API)
2024-04-05 16:32:14 -07:00
Will Jones
90af5cf028 fix: propagate filter validation errors (#1092)
In Rust and Node, we have been swallowing filter validation errors. If
there was an error in parsing the filter, then the filter was silently
ignored, returning unfiltered results.

Fixes #1081
2024-04-05 16:31:53 -07:00
Lance Release
fec6f92184 [python] Bump version: 0.6.2 → 0.6.3 2024-04-05 16:31:53 -07:00
Rob Meng
35bc4f3078 feat: configurable timeout for LanceDB Cloud queries (#1090) 2024-04-05 16:31:53 -07:00
Ivan Leo
89ce417452 Update default_embedding_functions.md (#1073)
Added a small bit of documentation for the `dim` feature which is
provided by the new `text-embedding-3` model series that allows users to
shorten an embedding.

Happy to discuss a bit on the phrasing but I struggled quite a bit with
getting it to work so wanted to help others who might want to use the
newer model too
2024-04-05 16:31:53 -07:00
Weston Pace
d4502add44 Remove remote integration workflow (#1076) 2024-04-05 16:31:53 -07:00
Will Jones
334857a8cb fix: Allow converting from NativeTable to Table (#1069) 2024-04-05 16:31:53 -07:00
Lance Release
386d5da22f Bump version: 0.4.11 → 0.4.12 2024-04-05 16:31:45 -07:00
Lance Release
77ba97416d [python] Bump version: 0.6.1 → 0.6.2 2024-04-05 16:31:45 -07:00
Will Jones
5120bf262b fix: make checkout_latest force a reload (#1064)
#1002 accidentally changed `checkout_latest` to do nothing if the table
was already in latest mode. This PR makes sure it forces a reload of the
table (if there is a newer version).
2024-04-05 16:31:45 -07:00
Lei Xu
f27167017b chore: bump lance to 0.10.2 (#1061) 2024-04-05 16:31:45 -07:00
Weston Pace
73c69a6b9a feat: page_token / limit to native table_names function. Use async table_names function from sync table_names function (#1059)
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.

However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
2024-04-05 16:31:45 -07:00
Will Jones
05f9a77baf feat: more accessible errors (#1025)
The fact that we convert errors to strings makes them really hard to
work with. For example, in SaaS we want to know whether the underlying
`lance::Error` was the `InvalidInput` variant, so we can return a 400
instead of a 500.
2024-04-05 16:31:45 -07:00
Chang She
10089481c0 doc(python): document the method in fts (#982)
Co-authored-by: prrao87 <prrao87@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Ayush Chaurasia
b5326d31e9 fix(python): Few fts patches (#1039)
1. filtering with fts mutated the schema, which caused schema mistmatch
problems with hybrid search as it combines fts and vector search tables.
2. fts with filter failed with `with_row_id`. This was because row_id
was calculated before filtering which caused size mismatch on attaching
it after.
3. The fix for 1 meant that now row_id is attached before filtering but
passing a filter to `to_lance` on a dataset that already contains
`_rowid` raises a panic from lance. So temporarily, in case where fts is
used with a filter AND `with_row_id`, we just force user to using the
duckdb pathway.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Weston Pace
c60a193767 fix: sanitize foreign schemas (#1058)
Arrow-js uses brittle `instanceof` checks throughout the code base.
These fail unless the library instance that produced the object matches
exactly the same instance the vectordb is using. At a minimum, this
means that a user using arrow version 15 (or any version that doesn't
match exactly the version that vectordb is using) will get strange
errors when they try and use vectordb.

However, there are even cases where the versions can be perfectly
identical, and the instanceof check still fails. One such example is
when using `vite` (e.g. https://github.com/vitejs/vite/issues/3910)

This PR solves the problem in a rather brute force, but workable,
fashion. If we encounter a schema that does not pass the `instanceof`
check then we will attempt to sanitize that schema by traversing the
object and, if it has all the correct properties, constructing an
appropriate `Schema` instance via deep cloning.
2024-04-05 16:31:42 -07:00
Weston Pace
785ecfa037 feat: reconfigure typescript linter / formatter for nodejs (#1042)
The eslint rules specify some formatting requirements that are rather
strict and conflict with vscode's default formatter. I was unable to get
auto-formatting to setup correctly. Also, eslint has quite recently
[given up on
formatting](https://eslint.org/blog/2023/10/deprecating-formatting-rules/)
and recommends using a 3rd party formatter.

This PR adds prettier as the formatter. It restores the eslint rules to
their defaults. This does mean we now have the "no explicit any" check
back on. I know that rule is pedantic but it did help me catch a few
corner cases in type testing that weren't covered in the current code.
Leaving in draft as this is dependent on other PRs.
2024-04-05 16:31:36 -07:00
Weston Pace
8033a44d68 feat: add support for add to async python API (#1037)
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:31:36 -07:00
Chang She
3bbcaba65b chore(rust): update rust version (#810) 2024-04-05 16:31:36 -07:00
Chang She
e60fde73ba feat(python): allow user to override api url (#1054) 2024-04-05 16:31:36 -07:00
Chang She
a7dbe933dc chore(python): use pypi tantivy to speed up CI (#987) 2024-04-05 16:31:36 -07:00
Chang She
4f34a01020 doc: fix docs deployment GHA (#1055) 2024-04-05 16:31:36 -07:00
Prashanth Rao
f9c244e608 [docs]: Fix issues with Rust code snippets in "quick start" (#1047)
The renaming of `vectordb` to `lancedb` broke the [quick start
docs](https://lancedb.github.io/lancedb/basic/#__tabbed_5_3) (it's
pointing to a non-existent directory). This PR fixes the code snippets
and the paths in the docs page.

Additionally, more fixes related to indexing docs below 👇🏽.
2024-04-05 16:31:36 -07:00
Louis Guitton
7f9ef0d329 Fix default_embedding_functions.md (#1043)
typo and broken table
2024-04-05 16:31:36 -07:00
Chang She
a3761f4209 doc: fix langchain link (#1053) 2024-04-05 16:31:36 -07:00
Chang She
4b40dad963 feat(python): add model_names() method to openai embedding function (#1049)
small QoL improvement
2024-04-05 16:31:36 -07:00
QianZhu
b32b69c993 Add create scalar index to sdk (#1033) 2024-04-05 16:31:36 -07:00
Weston Pace
4299f719ec feat: port create_table to the async python API and the remote rust API (#1031)
I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
2024-04-05 16:31:36 -07:00
Lance Release
accf31fa92 [python] Bump version: 0.6.0 → 0.6.1 2024-04-05 16:31:36 -07:00
Rob Meng
b8eb5d4bfe fix: fix columns type for pydantic 2.x (#1045) 2024-04-05 16:31:36 -07:00
Weston Pace
629c622d15 feat: Initial remote table implementation for rust (#1024)
This will eventually replace the remote table implementations in python
and node.
2024-04-05 16:31:36 -07:00
Lance Release
45b5b66c82 [python] Bump version: 0.5.7 → 0.6.0 2024-04-05 16:31:36 -07:00
BubbleCal
5896541bb8 chore: enable test for dropping table (#1038)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-04-05 16:31:36 -07:00
natcharacter
e29e4cc36d A simple base usage that install the dependencies necessary to use FT… (#1036)
A simple base usage that install the dependencies necessary to use FTS
and Hybrid search

---------

Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:31:36 -07:00
Rob Meng
f3de3d990d chore: upgrade to lance 0.10.1 (#1034)
upgrade to lance 0.10.1 and update doc string to reflect dynamic
projection options
2024-04-05 16:31:36 -07:00
BubbleCal
0a8e258247 chore(rust): report the TableNotFound error while dropping non-exist table (#1022)
this will work after upgrading lance with
https://github.com/lancedb/lance/pull/1995 merged
see #884 for details

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-04-05 16:31:36 -07:00
Weston Pace
2cec2a8937 feat: add a basic async python client starting point (#1014)
This changes `lancedb` from a "pure python" setuptools project to a
maturin project and adds a rust lancedb dependency.

The async python client is extremely minimal (only `connect` and
`Connection.table_names` are supported). The purpose of this PR is to
get the infrastructure in place for building out the rest of the async
client.

Although this is not technically a breaking change (no APIs are
changing) it is still a considerable change in the way the wheels are
built because they now include the native shared library.
2024-04-05 16:31:34 -07:00
Will Jones
464a36ad38 feat: {add|alter|drop}_columns APIs (#1015)
Initial work for #959. This exposes the basic functionality for each in
all of the APIs. Will add user guide documentation in a later PR.
2024-04-05 16:30:47 -07:00
Weston Pace
ad1e81a1d1 refactor: change arrow from a direct dependency to a peer dependency (#984)
BREAKING CHANGE: users will now need to npm install `apache-arrow` and
`@apache-arrow/ts` themselves.
2024-04-05 16:30:47 -07:00
Lance Release
562d1af1ed Bump version: 0.4.10 → 0.4.11 2024-04-05 16:30:40 -07:00
Weston Pace
2163502b31 refactor: rename the rust crate from vectordb to lancedb (#1012)
This also renames the new experimental node package to lancedb. The
classic node package remains named vectordb.

The goal here is to avoid introducing piecemeal breaking changes to the
vectordb crate. Instead, once the new API is stabilized, we will
officially release the lancedb crate and deprecate the vectordb crate.
The same pattern will eventually happen with the npm package vectordb.
2024-04-05 16:30:40 -07:00
Will Jones
c5b0934bfb feat(node): add read_consistency_interval to Node and Rust (#1002)
This PR adds the same consistency semantics as was added in #828. It
*does not* add the same lazy-loading of tables, since that breaks some
existing tests.

This closes #998.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:30:40 -07:00
Lance Release
ef54bd5ba2 [python] Bump version: 0.5.6 → 0.5.7 2024-04-05 16:30:40 -07:00
Lei Xu
80e4d14c02 chore: bump pylance to 0.9.18 (#1011) 2024-04-05 16:30:40 -07:00
Raghav Dixit
fdabf31984 python(feat): Imagebind embedding fn support (#1003)
Added imagebind fn support , steps to install mentioned in docstring. 
pytest slow checks done locally

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-04-05 16:30:40 -07:00
Ayush Chaurasia
538d0320f7 Docs: add meta tags (#1006) 2024-04-05 16:30:40 -07:00
Weston Pace
cbc0c439ef refactor: rust vectordb API stabilization of the Connection trait (#993)
This is the start of a more comprehensive refactor and stabilization of
the Rust API. The `Connection` trait is cleaned up to not require
`lance` and to match the `Connection` trait in other APIs. In addition,
the concrete implementation `Database` is hidden.

BREAKING CHANGE: The struct `crate::connection::Database` is now gone.
Several examples opened a connection using `Database::connect` or
`Database::connect_with_params`. Users should now use
`vectordb::connect`.

BREAKING CHANGE: The `connect`, `create_table`, and `open_table` methods
now all return a builder object. This means that a call like
`conn.open_table(..., opt1, opt2)` will now become
`conn.open_table(...).opt1(opt1).opt2(opt2).execute()` In addition, the
structure of options has changed slightly. However, no options
capability has been removed.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:30:40 -07:00
Lance Release
69492586f0 [python] Bump version: 0.5.5 → 0.5.6 2024-04-05 16:30:40 -07:00
Bert
f5627dac14 lance 0.9.18 (#1000) 2024-04-05 16:30:40 -07:00
Johannes Kolbe
32bfb68ac3 apply fixes for notebook (#989) 2024-04-05 16:30:40 -07:00
Ayush Chaurasia
bc871169f0 docs: Add meta tag for image preview (#988)
I think this should work. Need to deploy it to be sure as it can be
tested locally. Can be tested here.

2 things about this solution:
* All pages have a same meta tag, i.e, lancedb banner
* If needed, we can automatically use the first image of each page and
generate meta tags using the ultralytics mkdocs plugin that we did for
this purpose - https://github.com/ultralytics/mkdocs
2024-04-05 16:30:40 -07:00
Chang She
3fc835e124 doc: update navigation links for embedding functions (#986) 2024-04-05 16:30:40 -07:00
Chang She
484a121866 doc: improve embedding functions documentation (#983)
Got some user feedback that the `implicit` / `explicit` distinction is
confusing.
Instead I was thinking we would just deprecate the `with_embeddings` API
and then organize working with embeddings into 3 buckets:

1. manually generate embeddings
2. use a provided embedding function
3. define your own custom embedding function
2024-04-05 16:30:40 -07:00
Chang She
bc850e6add feat(python): add optional threadpool for batch requests (#981)
Currently if a batch request is given to the remote API, each query is
sent sequentially. We should allow the user to specify a threadpool.
2024-04-05 16:30:40 -07:00
Will Jones
26eec4bef4 fix: use static C runtime on Windows (#979)
We depend on C static runtime, but not all Windows machines have that.
So might be worth statically linking it.

https://github.com/reorproject/reor/issues/36#issuecomment-1948876463
2024-04-05 16:30:40 -07:00
Will Jones
f84a4855ca docs: show DuckDB with dataset, not table (#974)
Using datasets is preferred way to allow filter and projection pushdown,
as well as aggregated larger-than-memory tables.
2024-04-05 16:30:40 -07:00
Ayush Chaurasia
aecafa6479 docs: Minimal reranking evaluation benchmarks (#977) 2024-04-05 16:30:40 -07:00
Will Jones
efa846b6e5 chore: upgrade lance to 0.9.16 (#975) 2024-04-05 16:30:36 -07:00
Will Jones
cf3dbcf684 ci: fix Node ARM release build (#971)
When we turned on fat LTO builds, we made the release build job **much**
more compute and memory intensive. The ARM runners have particularly low
memory per core, which makes them susceptible to OOM errors. To avoid
issues, I have enabled memory swap on ARM and bumped the side of the
runner.
2024-04-05 16:30:36 -07:00
Will Jones
c425d3759d ci: reduce number of build jobs on aarch64 to avoid OOM (#970) 2024-04-05 16:30:36 -07:00
Lance Release
fded15c9fe [python] Bump version: 0.5.4 → 0.5.5 2024-04-05 16:30:36 -07:00
Lance Release
e888cb5b48 Bump version: 0.4.9 → 0.4.10 2024-04-05 16:30:30 -07:00
Weston Pace
9241f47f0e feat: make it easier to create empty tables (#942)
This PR also reworks the table creation utilities significantly so that
they are more consistent, built on top of each other, and thoroughly
documented.
2024-04-05 16:30:30 -07:00
Prashanth Rao
b014c24e66 [docs]: Fix typos and clarity in hybrid search docs (#966)
- Fixed typos and added some clarity to the hybrid search docs
- Changed "Airbnb" case to be as per the [official company
name](https://en.wikipedia.org/wiki/Airbnb) (the "bnb" shouldn't be
capitalized", and the text in the document aligns with this
- Fixed headers in nav bar
2024-04-05 16:30:30 -07:00
Will Jones
68115f1369 fix: wrap in BigInt to avoid upstream bug (#962)
Closes #960
2024-04-05 16:30:30 -07:00
Ayush Chaurasia
f78fe721db docs: Add setup cell for colab example (#965) 2024-04-05 16:30:30 -07:00
Ayush Chaurasia
510e8378bc feat(python): hybrid search updates, examples, & latency benchmarks (#964)
- Rename safe_import -> attempt_import_or_raise (closes
https://github.com/lancedb/lancedb/pull/923)
- Update docs
- Add Notebook example (@changhiskhan you can use it for the talk. Comes
with "open in colab" button)
- Latency benchmark & results comparison, sanity check on real-world
data
- Updates the default openai model to gpt-4
2024-04-05 16:30:30 -07:00
Will Jones
1045af6c09 chore: fix clippy lints (#963) 2024-04-05 16:30:30 -07:00
QianZhu
7afcfca10d Qian/make vector col optional (#950)
remote SDK tests were completed through lancedb_integtest
2024-04-05 16:30:29 -07:00
Will Jones
88205aba64 fix(node): statically link lzma (#961)
Fixes #956

Same changes as https://github.com/lancedb/lance/pull/1934
2024-04-05 16:30:10 -07:00
Weston Pace
da47938a43 chore: use a bigger runner for NPM publish jobs on aarch64 to avoid OOM (#955) 2024-04-05 16:30:06 -07:00
Lance Release
03e705c14c Bump version: 0.4.8 → 0.4.9 2024-04-05 16:29:58 -07:00
Lance Release
a7e60a4c3f [python] Bump version: 0.5.3 → 0.5.4 2024-04-05 16:29:58 -07:00
Weston Pace
e12bdc78bb chore: bump lance version to 0.9.15 (#949) 2024-04-05 16:29:58 -07:00
Weston Pace
41ccb48160 feat: add support for filter during merge insert when matched (#948)
Closes #940
2024-04-05 16:29:58 -07:00
QianZhu
069ad267bd added error msg to SaaS APIs (#852)
1. improved error msg for SaaS create_table and create_index

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:29:58 -07:00
Weston Pace
138fc3f66b feat: add a filterable count_rows to all the lancedb APIs (#913)
A `count_rows` method that takes a filter was recently added to
`LanceTable`. This PR adds it everywhere else except `RemoteTable` (that
will come soon).
2024-04-05 16:29:58 -07:00
Nitish Sharma
2c3f982f4f Minor updates to FAQ (#935)
Based on discussion over discord, adding minor updates to the FAQ
section about benchmarks, practical data size and concurrency in LanceDB
2024-04-05 16:29:58 -07:00
Ayush Chaurasia
d07817a562 feat(python): Reranker DX improvements (#904)
- Most users might not know how to use `QueryBuilder` object. Instead we
should just pass the string query.
- Add new rerankers: Colbert, openai
2024-04-05 16:29:58 -07:00
Will Jones
39cc2fd62b feat(python): add read_consistency_interval argument (#828)
This PR refactors how we handle read consistency: does the `LanceTable`
class always pick up modifications to the table made by other instance
or processes. Users have three options they can set at the connection
level:

1. (Default) `read_consistency_interval=None` means it will not check at
all. Users can call `table.checkout_latest()` to manually check for
updates.
2. `read_consistency_interval=timedelta(0)` means **always** check for
updates, giving strong read consistency.
3. `read_consistency_interval=timedelta(seconds=20)` means check for
updates every 20 seconds. This is eventual consistency, a compromise
between the two options above.

There is now an explicit difference between a `LanceTable` that tracks
the current version and one that is fixed at a historical version. We
now enforce that users cannot write if they have checked out an old
version. They are instructed to call `checkout_latest()` before calling
the write methods.

Since `conn.open_table()` doesn't have a parameter for version, users
will only get fixed references if they call `table.checkout()`.

The difference between these two can be seen in the repr: Table that are
fixed at a particular version will have a `version` displayed in the
repr. Otherwise, the version will not be shown.

```python
>>> table
LanceTable(connection=..., name="my_table")
>>> table.checkout(1)
>>> table
LanceTable(connection=..., name="my_table", version=1)
```

I decided to not create different classes for these states, because I
think we already have enough complexity with the Cloud vs OSS table
references.

Based on #812
2024-04-05 16:29:57 -07:00
Ayush Chaurasia
0f00cd0097 feat(python): add support new openai embedding functions (#912)
@PrashantDixit0

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:29:13 -07:00
Lei Xu
84edf56995 chore: add global cargo config to enable minimal cpu target (#925)
* Closes #895 
* Fix cargo clippy
2024-04-05 16:29:13 -07:00
QianZhu
b2efd0da53 fix hybrid search example (#922) 2024-04-05 16:29:13 -07:00
Lance Release
c101e9deed [python] Bump version: 0.5.2 → 0.5.3 2024-04-05 16:29:13 -07:00
Ayush Chaurasia
a24e16f753 fix: revert safe_import_pandas usage (#921) 2024-04-05 16:29:13 -07:00
Lance Release
eb1f02919a Bump version: 0.4.7 → 0.4.8 2024-04-05 16:29:05 -07:00
Lance Release
c8f92c2987 [python] Bump version: 0.5.1 → 0.5.2 2024-04-05 16:29:05 -07:00
Weston Pace
9d115bd507 chore: bump pylance version to latest in pyproject.toml (#918) 2024-04-05 16:29:05 -07:00
Weston Pace
18f7bad3dd feat: add merge_insert to the node and rust APIs (#915) 2024-04-05 16:29:05 -07:00
QianZhu
2e75b16403 make it explicit about the vector column data type (#916)
<img width="837" alt="Screenshot 2024-02-01 at 4 23 34 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/4f0f5c5a-2a24-4b00-aad1-ef80a593d964">
[
<img width="838" alt="Screenshot 2024-02-01 at 4 26 03 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/ca073bc8-b518-4be3-811d-8a7184416f07">
](url)

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:29:05 -07:00
Bert
3c544582f6 fix: add request retry to python client (#917)
Adds capability to the remote python SDK to retry requests (fixes #911)

This can be configured through environment:
- `LANCE_CLIENT_MAX_RETRIES`= total number of retries. Set to 0 to
disable retries. default = 3
- `LANCE_CLIENT_CONNECT_RETRIES` = number of times to retry request in
case of TCP connect failure. default = 3
- `LANCE_CLIENT_READ_RETRIES` = number of times to retry request in case
of HTTP request failure. default = 3
- `LANCE_CLIENT_RETRY_STATUSES` = http statuses for which the request
will be retried. passed as comma separated list of ints. default `500,
502, 503`
- `LANCE_CLIENT_RETRY_BACKOFF_FACTOR` = controls time between retry
requests. see
[here](23f2287eb5/src/urllib3/util/retry.py (L141-L146)).
default = 0.25

Only read requests will be retried:
- list table names
- query
- describe table
- list table indices

This does not add retry capabilities for writes as it could possibly
cause issues in the case where the retried write isn't idempotent. For
example, in the case where the LB times-out the request but the server
completes the request anyway, we might not want to blindly retry an
insert request.
2024-04-05 16:29:05 -07:00
Weston Pace
f602e07f99 docs: add cleanup_old_versions and compact_files to Table for documentation purposes (#900)
Closes #819
2024-04-05 16:29:05 -07:00
Weston Pace
4eb819072a feat: upgrade to lance 0.9.11 and expose merge_insert (#906)
This adds the python bindings requested in #870 The javascript/rust
bindings will be added in a future PR.
2024-04-05 16:29:05 -07:00
Lei Xu
bd2d187538 ci: bump to new version of python action to use node 20 gIthub action runtime (#909)
Github action is deprecating old node-16 runtime.
2024-04-05 16:29:05 -07:00
JacobLinCool
f308a0ffdb fix the repo link on npm, add links for homepage and bug report (#910)
- fix the repo link on npm
- add links for homepage and bug report
2024-04-05 16:29:05 -07:00
QianZhu
1f2eafca75 arrow table/f16 example (#907) 2024-04-05 16:29:05 -07:00
Lance Release
567c5f6d01 Bump version: 0.4.6 → 0.4.7 2024-04-05 16:28:56 -07:00
Lei Xu
8e139012e2 fix(node): pass AWS credentials to db level operations (#908)
Passed the following tests

```ts
const keyId = process.env.AWS_ACCESS_KEY_ID;
const secretKey = process.env.AWS_SECRET_ACCESS_KEY;
const sessionToken = process.env.AWS_SESSION_TOKEN;
const region = process.env.AWS_REGION;

const db = await lancedb.connect({
  uri: "s3://bucket/path",
  awsCredentials: {
    accessKeyId: keyId,
    secretKey: secretKey,
    sessionToken: sessionToken,
  },
  awsRegion: region,
} as lancedb.ConnectionOptions);

  console.log(await db.createTable("test", [{ vector: [1, 2, 3] }]));
  console.log(await db.tableNames());
  console.log(await db.dropTable("test"))
```
2024-04-05 16:28:56 -07:00
Will Jones
d5be6c7a05 docs: provide AWS S3 cleanup and permissions advice (#903)
Adding some more quick advice for how to setup AWS S3 with LanceDB.

---------

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Abraham Lopez
5a12224a02 chore: update JS/TS example in README (#898)
- The JS/TS library actually expects named parameters via an object in
`.createTable()` rather than individual arguments
- Added example on how to search rows by criteria without a vector
search. TS type of `.search()` currently has the `query` parameter as
non-optional so we have to pass undefined for now.
2024-04-05 16:28:56 -07:00
Lei Xu
a617ad35ff ci: change apple silicon runner to free OSS macos-14 target (#901) 2024-04-05 16:28:56 -07:00
Raghav Dixit
61bf688e5b chore(python): GTE embedding function model name update (#902)
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
a41f7be88d feat(python): Hybrid search & Reranker API (#824)
based on https://github.com/lancedb/lancedb/pull/713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)


### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`

```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()
```

Cohere Reranker
```
from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

```

CrossEncoderReranker

```
from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

```

## Using custom Reranker
```
from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res
```

- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Prashanth Rao
ecbbe185c7 Fix image bgcolor (#891)
Minor fix to change the background color for an image in the docs. It's
now readable in both light and dark modes (earlier version made it
impossible to read in dark mode).
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
b326bf2ef6 doc: Add documentation chatbot for LanceDB (#890)
<img width="1258" alt="Screenshot 2024-01-29 at 10 05 52 PM"
src="https://github.com/lancedb/lancedb/assets/15766192/7c108fde-e993-415c-ad01-72010fd5fe31">
2024-04-05 16:28:56 -07:00
Raghav Dixit
472344fcb3 feat(python): Embedding fn support for gte-mlx/gte-large (#873)
have added testing and an example in the docstring, will be pushing a
separate PR in recipe repo for rag example

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
bca80939c2 chore(python): Temporarily extend remote connection timeout (#888)
Context - https://etoai.slack.com/archives/C05NC5YSW5V/p1706371205883149
2024-04-05 16:28:56 -07:00
Lei Xu
911d063237 doc: fix js example of create index (#886) 2024-04-05 16:28:56 -07:00
Lei Xu
12e776821a doc: use snippet for rust code example and make sure rust examples run through CI (#885) 2024-04-05 16:28:56 -07:00
Lei Xu
c6e5eb0398 fix: fix doc build to include the source snippet correctly (#883) 2024-04-05 16:28:56 -07:00
Chang She
1d0578ce25 doc(rust): minor fixes for Rust quick start. (#878) 2024-04-05 16:28:56 -07:00
Lei Xu
e7fdb931de chore: convert all js doc test to use snippet. (#881) 2024-04-05 16:28:56 -07:00
Lei Xu
d811b89de2 doc: use code snippet for typescript examples (#880)
The typescript code is in a fully function file, that will be run via the CI.
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
545a03d7f9 feat(python): Aws Bedrock embeddings integration (#822)
Supports amazon titan, cohere english & cohere multi-lingual base
models.
2024-04-05 16:28:56 -07:00
Lei Xu
f2e29eb004 chore: upgrade lance, pylance and datafusion (#879) 2024-04-05 16:28:56 -07:00
Lei Xu
36dbf47d60 chore: add one rust SDK e2e example (#876)
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Lei Xu
fd2fd94862 doc: update quick start for full rust example (#872) 2024-04-05 16:28:56 -07:00
Lei Xu
faa5912c3f chore: bump github actions to v4 due to GHA warnings of node version deprecation (#874) 2024-04-05 16:28:56 -07:00
Lance Release
334e423464 Bump version: 0.4.5 → 0.4.6 2024-04-05 16:28:18 -07:00
Lei Xu
7274c913a8 feat(rust): provide connect and connect_with_options in Rust SDK (#871)
* Bring the feature parity of Rust connect methods.
* A global connect method that can connect to local and remote / cloud
table, as the same as in js/python today.
2024-04-05 16:28:18 -07:00
Lei Xu
a192c1a9b1 chore(rust): simplified version of optimize (#869)
Consolidate various optimize() into one method, similar to postgres
VACCUM in the process of preparing Rust API for public use
2024-04-05 16:28:18 -07:00
Lei Xu
cef0293985 feat(napi): Issue queries as node SDK (#868)
* Query as a fluent API and `AsyncIterator<RecordBatch>`
* Much more docs
* Add tests for auto infer vector search columns with different
dimensions.
2024-04-05 16:28:18 -07:00
Lance Release
0be4fd2aa6 Bump version: 0.4.4 → 0.4.5 2024-04-05 16:27:59 -07:00
Lei Xu
0664eee38d fix: release build for node sdk (#861) 2024-04-05 16:27:59 -07:00
Lance Release
f3dd5c89dc Bump version: 0.4.3 → 0.4.4 2024-04-05 16:27:51 -07:00
Lei Xu
8b04d8fef6 feat: improve the rust table query API and documents (#860)
* Easy to type
* Handle `String, &str, [String] and [&str]` well without manual
conversion
* Fix function name to be verb
* Improve docstring of Rust.
* Promote `query` and `search()` to public `Table` trait
2024-04-05 16:27:51 -07:00
Lei Xu
68e2bb0b2d doc: update rust readme to include crate and docs.rs links (#859) 2024-04-05 16:27:51 -07:00
Lei Xu
db4a979278 feat(napi): Provide a new createIndex API in the napi SDK. (#857) 2024-04-05 16:27:51 -07:00
Will Jones
7d82e56f76 docs: document basics of configuring object storage (#832)
Created based on upstream PR https://github.com/lancedb/lance/pull/1849

Closes #681

---------

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:27:51 -07:00
Lei Xu
dfabbe9081 feat(rust): create index API improvement (#853)
* Extract a minimal Table interface in Rust SDK
* Make create_index composable in Rust.
* Fix compiling issues from ffi
2024-04-05 16:27:51 -07:00
Bert
d1f9722bfb Bump lance 0.9.9 (#851) 2024-04-05 16:27:51 -07:00
Lei Xu
efcaa433fe feat: rework NodeJS SDK using napi (#847)
Use Napi to write a Node.js SDK that follows Polars for better
maintainability, while keeping most of the logic in Rust.
2024-04-05 16:27:51 -07:00
Lance Release
7b8188bcd5 [python] Bump version: 0.5.0 → 0.5.1 2024-04-05 16:27:51 -07:00
Lei Xu
65c1d8bc4c feat: change create table to accept Arrow table (#845) 2024-04-05 16:27:50 -07:00
QianZhu
5ecbf971e2 extend timeout for requests.get and requests.post (#848) 2024-04-05 16:27:42 -07:00
Lei Xu
a78e07907c chore(rust): provide a Connection trait to match python and nodejs SDK (#846)
In NodeJS and Python, LanceDB establishes a connection to a db. In Rust
core, it is called Database.
We should be consistent with the naming.
2024-04-05 16:27:42 -07:00
Bert
a409000c6f allow passing api key as env var (#841)
Allow passing API key as env var:
```shell
export LANCEDB_API_KEY=sh_123...
```

with this set, apiKey argument can omitted from `connect`
```js
    const db = await vectordb.connect({
        uri: "db://test-proj-01-ae8343",
        region: "us-east-1",
  })
```
```py
    db = lancedb.connect(
        uri="db://test-proj-01-ae8343",
        region="us-east-1",
    )
```
2024-04-05 16:27:42 -07:00
Lei Xu
d8befeeea2 feat(js): add helper function to create Arrow Table with schema (#838)
Support to make Apache Arrow Table from an array of javascript Records,
with optionally provided Schema.
2024-04-05 16:27:42 -07:00
Chang She
b699b5c42b chore(js): remove errant console.log (#834) 2024-04-05 16:27:42 -07:00
Lei Xu
49de13c65a doc: add index page for rust crate (#839)
Rust API doc for the braves
2024-04-05 16:27:42 -07:00
Lei Xu
97d033dfd6 bug: add a test for fp16 (#837)
Add test to ingest fp16 to a database
2024-04-05 16:27:42 -07:00
Chang She
0c580abd70 Merge branch 'tecmie-tecmie/embeddings-openai' 2024-04-05 16:27:42 -07:00
Chang She
d19bf80375 Merge branch 'tecmie/embeddings-openai' of github.com:tecmie/lancedb into tecmie-tecmie/embeddings-openai 2024-04-05 16:27:41 -07:00
Lei Xu
5b2c602fb3 doc: improve docs for nodejs connect functions (#833)
* improve the docstring for NodeJS connect functions and
`ConnectOptions` parameters.
* Simplify `npm run build` steps.
2024-04-05 16:27:32 -07:00
Bert
7bdca7a092 fix: remote python client closes idle connections (#831) 2024-04-05 16:27:32 -07:00
Will Jones
5f6d13e958 ci: lint and enforce linting (#829)
@eddyxu added instructions for linting here:

7af213801a/python/README.md (L45-L50)

However, we had a lot of failures and weren't checking this in CI. This
PR fixes all lints and adds a check to CI to keep us in compliance with
the lints.
2024-04-05 16:27:31 -07:00
Bert
4243eaee93 bump lance to 0.9.7 (#826) 2024-04-05 16:27:14 -07:00
Prashanth Rao
e6bb907d81 Docs updates incl. Polars (#827)
This PR makes the following aesthetic and content updates to the docs.

- [x] Fix max width issue on mobile: Content should now render more
cleanly and be more readable on smaller devices
- [x] Improve image quality of flowchart in data management page
- [x] Fix syntax highlighting in text at the bottom of the IVF-PQ
concepts page
- [x] Add example of Polars LazyFrames to docs (Integrations)
- [x] Add example of adding data to tables using Polars (guides)
2024-04-05 16:27:14 -07:00
Prashanth Rao
4d5d748acd docs: Updates and refactor (#683)
This PR makes incremental changes to the documentation.

* Closes #697
* Closes #698

- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:27:12 -07:00
Lance Release
33ab68c790 [python] Bump version: 0.4.4 → 0.5.0 2024-04-05 16:26:36 -07:00
Chang She
dbc3515d96 chore(python): turn off lazy frame ingestion (#821) 2024-04-05 16:26:36 -07:00
Chang She
ac3d95ec34 feat(python): allow the entire table to be converted a polars dataframe (#814) 2024-04-05 16:26:36 -07:00
Chang She
72b39432e8 feat(python): add exist_ok option to create table (#813)
This mimics CREATE TABLE IF NOT EXISTS behavior.
We add `db.create_table(..., exist_ok=True)` parameter.
By default it is set to False, so trying to create
a table with the same name will raise an exception.
If set to True, then it only opens the table if it
already exists. If you pass in a schema, it will
be checked against the existing table to make sure
you get what you want. If you pass in data, it will
NOT be added to the existing table.
2024-04-05 16:26:35 -07:00
Ayush Chaurasia
340fd98b42 chore(python): get rid of Pydantic deprication warning in embedding fcn (#816)
```
UserWarning: Valid config keys have changed in V2:
* 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning)
```
2024-04-05 16:26:20 -07:00
Anton Shevtsov
dc0b11a86a Add openai api key not found help (#815)
This pull request adds check for the presence of an environment variable
`OPENAI_API_KEY` and removes an unused parameter in
`retry_with_exponential_backoff` function.
2024-04-05 16:26:20 -07:00
Chang She
17dcb70076 feat(python): basic polars integration (#811)
We should now be able to directly ingest polars dataframes and return
results as polars dataframes

![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)
2024-04-05 16:26:19 -07:00
Andrew Miracle
8daed93a91 eslint fix 2024-04-05 16:25:52 -07:00
Ayush Chaurasia
2f72d5138e feat(python): Add gemini text embedding function (#806)
Named it Gemini-text for now. Not sure how complicated it will be to
support both text and multimodal embeddings under the same class
"gemini"..But its not something to worry about for now I guess.
2024-04-05 16:25:52 -07:00
Andrew Miracle
f1aad1afc7 Merge branch 'main' into tecmie/embeddings-openai 2024-04-05 16:25:51 -07:00
Andrew Miracle
fa13fb9392 rebase from lancedb/main 2024-04-05 16:25:14 -07:00
Lance Release
d39145c7e4 Updating package-lock.json 2024-04-05 16:25:14 -07:00
Lance Release
3463248eba Bump version: 0.4.2 → 0.4.3 2024-04-05 16:25:14 -07:00
Lance Release
3191966ffb [python] Bump version: 0.4.3 → 0.4.4 2024-04-05 16:25:14 -07:00
Will Jones
3b119420b2 upgrade lance (#809) 2024-04-05 16:25:14 -07:00
Lei Xu
6f7cb75b07 chore: remove black as dependency (#808)
We use `ruff` in CI and dev workflow now.
2024-04-05 16:25:14 -07:00
Chang She
118a11c9b3 feat(node): align incoming data to table schema (#802) 2024-04-05 16:25:14 -07:00
Sebastian Law
70ca6d8ea5 use requests instead of aiohttp for underlying http client (#803)
instead of starting and stopping the current thread's event loop on
every http call, just make an http call.
2024-04-05 16:25:14 -07:00
Chang She
556e01d9d9 chore(python): add docstring for limit behavior (#800)
Closes #796
2024-04-05 16:25:14 -07:00
Chang She
1060dde858 feat(python): add phrase query option for fts (#798)
addresses #797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
2024-04-05 16:25:14 -07:00
Chang She
950e05da81 feat(python): add count_rows with filter option (#801)
Closes #795
2024-04-05 16:25:14 -07:00
Chang She
2b7754f929 fix(rust): not sure why clippy is suddenly unhappy (#794)
should fix the error on top of main


https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725
2024-04-05 16:25:14 -07:00
Chang She
d0bff7b78e feat(python): support new style optional syntax (#793) 2024-04-05 16:25:14 -07:00
Chang She
85f3f8793c chore(python): document phrase queries in fts (#788)
closes #769 

Add unit test and documentation on using quotes to perform a phrase
query
2024-04-05 16:25:14 -07:00
Chang She
a758876a65 feat(node): support table.schema for LocalTable (#789)
Close #773 

we pass an empty table over IPC so we don't need to manually deal with
serde. Then we just return the schema attribute from the empty table.

---------

Co-authored-by: albertlockett <albert.lockett@gmail.com>
2024-04-05 16:25:14 -07:00
Lei Xu
073a2a1b28 chore: bump lance to 0.9.5 (#790) 2024-04-05 16:25:14 -07:00
Chang She
195c106242 feat(python): Set heap size to get faster fts indexing performance (#762)
By default tantivy-py uses 128MB heapsize. We change the default to 1GB
and we allow the user to customize this

locally this makes `test_fts.py` run 10x faster
2024-04-05 16:25:13 -07:00
Lance Release
f0a654036e Updating package-lock.json 2024-04-05 16:25:02 -07:00
lucasiscovici
792830ccb5 raise exception if fts index does not exist (#776)
raise exception if fts index does not exist

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:25:02 -07:00
Lance Release
162f8536d1 Updating package-lock.json 2024-04-05 16:25:02 -07:00
sudhir
5d198327bb Make examples work with current version of Openai api's (#779)
These examples don't work because of changes in openai api from version
1+
2024-04-05 16:25:02 -07:00
Lance Release
55cc3ed5a2 Bump version: 0.4.2 → 0.4.3 2024-04-05 16:25:02 -07:00
Chris
b11428dddb Minor Fixes to Ingest Embedding Functions Docs (#777)
Addressed minor typos and grammatical issues to improve readability

---------

Co-authored-by: Christopher Correa <chris.correa@gmail.com>
2024-04-05 16:25:02 -07:00
Lance Release
1387dc6e48 [python] Bump version: 0.4.3 → 0.4.4 2024-04-05 16:25:02 -07:00
Vladimir Varankin
84c6c8f08c Minor corrections for docs of embedding_functions (#780)
In addition to #777, this pull request fixes more typos in the
documentation for "Ingest Embedding Functions".
2024-04-05 16:25:02 -07:00
Will Jones
63e273606e upgrade lance (#809) 2024-04-05 16:25:02 -07:00
QianZhu
35f83694be small bug fix for example code in SaaS JS doc (#770) 2024-04-05 16:25:02 -07:00
Lei Xu
45b006d68c chore: remove black as dependency (#808)
We use `ruff` in CI and dev workflow now.
2024-04-05 16:25:02 -07:00
Chang She
20208b9efb chore(python): handle NaN input in fts ingestion (#763)
If the input text is None, Tantivy raises an error
complaining it cannot add a NoneType. We handle this
upstream so None's are not added to the document.
If all of the indexed fields are None then we skip
this document.
2024-04-05 16:25:02 -07:00
Bengsoon Chuah
c00af75d63 Add relevant imports for each step (#764)
I found that it was quite incoherent to have to read through the
documentation and having to search which submodule that each class
should be imported from.

For example, it is cumbersome to have to navigate to another
documentation page to find out that `EmbeddingFunctionRegistry` is from
`lancedb.embeddings`
2024-04-05 16:25:02 -07:00
QianZhu
21245dfb9d SaaS JS API sdk doc (#740)
Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>
2024-04-05 16:25:02 -07:00
Chang She
81487f10fe feat(js): support list of string input (#755)
Add support for adding lists of string input (e.g., list of categorical
labels)

Follow-up items: #757 #758
2024-04-05 16:25:02 -07:00
Lance Release
3aa233f38a Updating package-lock.json 2024-04-05 16:25:02 -07:00
Lance Release
3278fa75d1 Bump version: 0.4.1 → 0.4.2 2024-04-05 16:25:02 -07:00
Lance Release
549f2bf396 [python] Bump version: 0.4.2 → 0.4.3 2024-04-05 16:25:02 -07:00
Lei Xu
138760bc6e chore: bump pylance to 0.9.2 (#754) 2024-04-05 16:25:02 -07:00
Xin Hao
0bddf77a73 docs: fix link (#752) 2024-04-05 16:25:02 -07:00
Chang She
154dc508ba feat(python): first cut batch queries for remote api (#753)
issue separate requests under the hood and concatenate results
2024-04-05 16:25:02 -07:00
Lance Release
0b8fe76590 [python] Bump version: 0.4.1 → 0.4.2 2024-04-05 16:25:02 -07:00
Chang She
c22eacb8b6 chore(python): update embedding API to use openai 1.6.1 (#751)
API has changed significantly, namely `openai.Embedding.create` no
longer exists.
https://github.com/openai/openai-python/discussions/742

Update the OpenAI embedding function and put a minimum on the openai sdk
version.
2024-04-05 16:25:02 -07:00
Chang She
75d575ef4e feat: add timezone handling for datetime in pydantic (#578)
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
2024-04-05 16:25:02 -07:00
Chang She
bc83bc9838 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:25:02 -07:00
Aidan
a76b5755ff fix: createIndex index cache size (#741) 2024-04-05 16:25:02 -07:00
Chang She
9a192426d3 feat(python): support list of list fields from pydantic schema (#747)
For object detection, each row may correspond to an image and each image
can have multiple bounding boxes of x-y coordinates. This means that a
`bbox` field is potentially "list of list of float". This adds support
in our pydantic-pyarrow conversion for nested lists.
2024-04-05 16:25:02 -07:00
Lance Release
ab794ba237 Updating package-lock.json 2024-04-05 16:25:02 -07:00
Lance Release
81e9df57c0 [python] Bump version: 0.4.0 → 0.4.1 2024-04-05 16:25:02 -07:00
Lance Release
8705784cea Bump version: 0.4.0 → 0.4.1 2024-04-05 16:25:02 -07:00
elliottRobinson
b3fbca4aee Update default_embedding_functions.md (#744)
Modify some grammar, punctuation, and spelling errors.
2024-04-05 16:25:02 -07:00
Andrew Miracle
5948f11641 eslint fix 2024-04-05 16:25:02 -07:00
Andrew Miracle
9efc3fa6d8 remove console logs 2024-04-05 16:25:02 -07:00
Andrew Miracle
453bf113ae add support for openai SDK version ^4.24.1 2024-04-05 16:25:02 -07:00
Chang She
4b243c5ff8 feat(node): align incoming data to table schema (#802) 2024-04-05 16:25:01 -07:00
Sebastian Law
4aa7f58a07 use requests instead of aiohttp for underlying http client (#803)
instead of starting and stopping the current thread's event loop on
every http call, just make an http call.
2024-04-05 16:25:01 -07:00
Chang She
7581cbb38f chore(python): add docstring for limit behavior (#800)
Closes #796
2024-04-05 16:25:01 -07:00
Chang She
881dfa022b feat(python): add phrase query option for fts (#798)
addresses #797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
2024-04-05 16:25:01 -07:00
Chang She
f17d16f935 feat(python): add count_rows with filter option (#801)
Closes #795
2024-04-05 16:25:01 -07:00
Chang She
f3a905af63 fix(rust): not sure why clippy is suddenly unhappy (#794)
should fix the error on top of main


https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725
2024-04-05 16:25:01 -07:00
Chang She
a07c6c465a feat(python): support new style optional syntax (#793) 2024-04-05 16:25:01 -07:00
Chang She
1dd663fc8a chore(python): document phrase queries in fts (#788)
closes #769 

Add unit test and documentation on using quotes to perform a phrase
query
2024-04-05 16:25:01 -07:00
Chang She
175ad9223b feat(node): support table.schema for LocalTable (#789)
Close #773 

we pass an empty table over IPC so we don't need to manually deal with
serde. Then we just return the schema attribute from the empty table.

---------

Co-authored-by: albertlockett <albert.lockett@gmail.com>
2024-04-05 16:25:01 -07:00
Lei Xu
4c8690549a chore: bump lance to 0.9.5 (#790) 2024-04-05 16:25:01 -07:00
Chang She
3100f0d861 feat(python): Set heap size to get faster fts indexing performance (#762)
By default tantivy-py uses 128MB heapsize. We change the default to 1GB
and we allow the user to customize this

locally this makes `test_fts.py` run 10x faster
2024-04-05 16:25:00 -07:00
Will Jones
c34aa09166 docs: update node API reference (#734)
This command hasn't been run for a while...
2024-04-05 16:24:47 -07:00
lucasiscovici
328aa2247b raise exception if fts index does not exist (#776)
raise exception if fts index does not exist

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:24:47 -07:00
Will Jones
43662705ad docs: enhance Update user guide (#735)
Closes #705
2024-04-05 16:24:47 -07:00
sudhir
8a48b32689 Make examples work with current version of Openai api's (#779)
These examples don't work because of changes in openai api from version
1+
2024-04-05 16:24:47 -07:00
Bert
5bb128a24d docs: fix JS api docs for update method (#738) 2024-04-05 16:24:47 -07:00
Chris
6698376f02 Minor Fixes to Ingest Embedding Functions Docs (#777)
Addressed minor typos and grammatical issues to improve readability

---------

Co-authored-by: Christopher Correa <chris.correa@gmail.com>
2024-04-05 16:24:47 -07:00
Weston Pace
94e81ff84b feat: add the ability to create scalar indices (#679)
This is a pretty direct binding to the underlying lance capability
2024-04-05 16:24:47 -07:00
Vladimir Varankin
2fd829296e Minor corrections for docs of embedding_functions (#780)
In addition to #777, this pull request fixes more typos in the
documentation for "Ingest Embedding Functions".
2024-04-05 16:24:47 -07:00
Aidan
b4ae3f3097 feat: node list tables pagination (#733) 2024-04-05 16:24:47 -07:00
QianZhu
a25d10279c small bug fix for example code in SaaS JS doc (#770) 2024-04-05 16:24:47 -07:00
Chang She
5376970e87 doc(javascript): minor improvement on docs for working with tables (#736)
Closes #639 
Closes #638
2024-04-05 16:24:47 -07:00
Chang She
e929491187 chore(python): handle NaN input in fts ingestion (#763)
If the input text is None, Tantivy raises an error
complaining it cannot add a NoneType. We handle this
upstream so None's are not added to the document.
If all of the indexed fields are None then we skip
this document.
2024-04-05 16:24:47 -07:00
Bengsoon Chuah
e3ba5b2402 Add relevant imports for each step (#764)
I found that it was quite incoherent to have to read through the
documentation and having to search which submodule that each class
should be imported from.

For example, it is cumbersome to have to navigate to another
documentation page to find out that `EmbeddingFunctionRegistry` is from
`lancedb.embeddings`
2024-04-05 16:24:47 -07:00
QianZhu
25d1c62c3f SaaS JS API sdk doc (#740)
Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>
2024-04-05 16:24:47 -07:00
Chang She
cd791a366b feat(js): support list of string input (#755)
Add support for adding lists of string input (e.g., list of categorical
labels)

Follow-up items: #757 #758
2024-04-05 16:24:47 -07:00
Lance Release
24afea8c56 Updating package-lock.json 2024-04-05 16:24:47 -07:00
Lance Release
0d2dbf7d09 Updating package-lock.json 2024-04-05 16:24:47 -07:00
Lance Release
c629080d60 Bump version: 0.4.1 → 0.4.2 2024-04-05 16:24:47 -07:00
Lance Release
918a2a4405 [python] Bump version: 0.4.2 → 0.4.3 2024-04-05 16:24:47 -07:00
Lei Xu
56db257ea9 chore: bump pylance to 0.9.2 (#754) 2024-04-05 16:24:47 -07:00
Xin Hao
a63262cfda docs: fix link (#752) 2024-04-05 16:24:47 -07:00
Chang She
98af0ceec6 feat(python): first cut batch queries for remote api (#753)
issue separate requests under the hood and concatenate results
2024-04-05 16:24:47 -07:00
Lance Release
7778031b26 [python] Bump version: 0.4.1 → 0.4.2 2024-04-05 16:24:47 -07:00
Chang She
c97ae6b787 chore(python): update embedding API to use openai 1.6.1 (#751)
API has changed significantly, namely `openai.Embedding.create` no
longer exists.
https://github.com/openai/openai-python/discussions/742

Update the OpenAI embedding function and put a minimum on the openai sdk
version.
2024-04-05 16:24:47 -07:00
Chang She
7bac1131fb feat: add timezone handling for datetime in pydantic (#578)
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
2024-04-05 16:24:47 -07:00
Chang She
a0afa84786 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:47 -07:00
Aidan
e74c203e6f fix: createIndex index cache size (#741) 2024-04-05 16:24:47 -07:00
Chang She
46bf5a1ed1 feat(python): support list of list fields from pydantic schema (#747)
For object detection, each row may correspond to an image and each image
can have multiple bounding boxes of x-y coordinates. This means that a
`bbox` field is potentially "list of list of float". This adds support
in our pydantic-pyarrow conversion for nested lists.
2024-04-05 16:24:47 -07:00
Lance Release
4891a7ae14 Updating package-lock.json 2024-04-05 16:24:47 -07:00
Lance Release
d1f24ba1dd [python] Bump version: 0.4.0 → 0.4.1 2024-04-05 16:24:47 -07:00
Lance Release
b56c54c990 Bump version: 0.4.0 → 0.4.1 2024-04-05 16:24:47 -07:00
elliottRobinson
3ab4b335c3 Update default_embedding_functions.md (#744)
Modify some grammar, punctuation, and spelling errors.
2024-04-05 16:24:47 -07:00
Chang She
009297e900 bug(python): fix path handling in windows (#724)
Use pathlib for local paths so that pathlib
can handle the correct separator on windows.

Closes #703

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:45 -07:00
Will Jones
3f3acb48c6 chore: add issue templates (#732)
This PR adds issue templates, which help two recurring issues:

* Users forget to tell us whether they are using the Node or Python SDK
* Issues don't get appropriate tags

This doesn't force the use of the templates. Because we set
`blank_issues_enabled: true`, users can still create a custom issue.
2024-04-05 16:24:30 -07:00
Will Jones
c3cda2c5d0 ci: check formatting and clippy (#730) 2024-04-05 16:24:30 -07:00
Will Jones
a975cc0a94 fix: prevent duplicate data in FTS index (#728)
This forces the user to replace the whole FTS directory when re-creating
the index, prevent duplicate data from being created. Previously, the
whole dataset was re-added to the existing index, duplicating existing
rows in the index.

This (in combination with lancedb/lance#1707) caused #726, since the
duplicate data emitted duplicate indices for `take()` and an upstream
issue caused those queries to fail.

This solution isn't ideal, since it makes the FTS index temporarily
unavailable while the index is built. In the future, we should have
multiple FTS index directories, which would allow atomic commits of new
indexes (as well as multiple indexes for different columns).

Fixes #498.
Fixes #726.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:24:30 -07:00
Will Jones
48a12e780c upgrade lance to v0.9.1 (#727)
This brings in some important bugfixes related to take and aarch64
Linux. See changes at:
https://github.com/lancedb/lance/releases/tag/v0.9.1
2024-04-05 16:24:30 -07:00
Chang She
b60a2177ae feat(python): support nested reference for fts (#723)
https://github.com/lancedb/lance/issues/1739

Support nested field reference in full text search

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:30 -07:00
Chang She
cc9d74e7a7 feat(python): add option to flatten output in to_pandas (#722)
Closes https://github.com/lancedb/lance/issues/1738

We add a `flatten` parameter to the signature of `to_pandas`. By default
this is None and does nothing.
If set to True or -1, then LanceDB will flatten structs before
converting to a pandas dataframe. All nested structs are also flattened.
If set to any positive integer, then LanceDB will flatten structs up to
the specified level of nesting.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:24:30 -07:00
Aidan
3232b55218 feat: Node create index API (#720) 2024-04-05 16:24:30 -07:00
Aidan
ee2034db23 feat: Node Schema API (#717) 2024-04-05 16:24:30 -07:00
Lance Release
1dac34d2fa Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
78b457f230 Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
884ce655fe Bump version: 0.3.11 → 0.4.0 2024-04-05 16:24:30 -07:00
Lance Release
acbcbe6496 [python] Bump version: 0.3.6 → 0.4.0 2024-04-05 16:24:30 -07:00
Lei Xu
1d79e9168e chore: bump lance version to 0.9 (#715) 2024-04-05 16:24:30 -07:00
Lance Release
f46931228b Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
811e604077 [python] Bump version: 0.3.5 → 0.3.6 2024-04-05 16:24:30 -07:00
Lance Release
072be50cb3 Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
aca1b43d5e Bump version: 0.3.10 → 0.3.11 2024-04-05 16:24:30 -07:00
Bert
0b9c8ef88a chore: fix package lock (#711) 2024-04-05 16:24:30 -07:00
Bert
eb62ddfb0c implement update for remote clients (#706) 2024-04-05 16:24:30 -07:00
Rob Meng
32515ace74 feat: pass vector column name to remote backend (#710)
pass vector column name to remote as well.

`vector_column` is already part of `Query` just declearing it as part to
`remote.VectorQuery` as well
2024-04-05 16:24:30 -07:00
Rob Meng
82946f3623 feat: allow custom column name in query (#709) 2024-04-05 16:24:30 -07:00
Chang She
374a6f7e78 feat: support nested pydantic schema (#707) 2024-04-05 16:24:30 -07:00
Will Jones
e52f691420 ci: fix broken npm publication (#704)
Most recent release failed because `release` depends on `node-macos`,
but we renamed `node-macos` to `node-macos-{x86,arm64}`. This fixes that
by consolidating them back to a single `node-macos` job, which also has
the side effect of making the file shorter.
2024-04-05 16:24:30 -07:00
Lance Release
79aeb6bea6 Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
7d70c9940c Bump version: 0.3.9 → 0.3.10 2024-04-05 16:24:30 -07:00
Lance Release
fc32f98c34 [python] Bump version: 0.3.4 → 0.3.5 2024-04-05 16:24:30 -07:00
Will Jones
9356c3b86a feat(python): add update query support for Python (#654)
Closes #69

Will not pass until https://github.com/lancedb/lance/pull/1585 is
released
2024-04-05 16:24:29 -07:00
Chang She
b02370cacd feat: LocalTable for vectordb now supports filters without vector search (#693)
Note this currently the filter/where is only implemented for LocalTable
so that it requires an explicit cast to "enable" (see new unit test).
The alternative is to add it to the Table interface, but since it's not
available on RemoteTable this may cause some user experience issues.
2024-04-05 16:24:15 -07:00
Bert
e479acc1bd Update in Node & Rust (#696)
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:15 -07:00
Ayush Chaurasia
3413e79b0f chore(python): Reduce posthog event count (#661)
- Register open_table as event
- Because we're dropping 'seach' event currently, changed the name to
'search_table' and introduced throttling
- Throttled events will be counted once per time batch so that the user
is registered but event count doesn't go up by a lot
2024-04-05 16:24:14 -07:00
Ayush Chaurasia
91ff324c70 docs: Update roboflow tutorial position (#666) 2024-04-05 16:23:49 -07:00
QianZhu
480a630e19 Qian/minor fix doc (#695) 2024-04-05 16:23:49 -07:00
Kaushal Kumar Choudhary
07e33c2b2d docs: Add badges (#694)
adding some badges
added a gif to readme for the vectordb repo

---------

Co-authored-by: kaushal07wick <kaushalc6@gmail.com>
2024-04-05 16:23:49 -07:00
Chang She
fb1de97e83 chore: Use m1 runner for npm publish (#687)
We had some build issues with npm publish for cross-compiling arm64
macos on an x86 macos runner. Switching to m1 runner for now until
someone has time to deal with the feature flags.

follow-up tracked here: #688
2024-04-05 16:23:49 -07:00
QianZhu
bda0135cfc saas python sdk doc (#692)
<img width="256" alt="Screenshot 2023-12-07 at 11 55 41 AM"
src="https://github.com/lancedb/lancedb/assets/1305083/259bf234-9b3b-4c5d-af45-c7f3fada2cc7">
2024-04-05 16:23:49 -07:00
Chang She
287d85a3aa chore: update package lock (#689) 2024-04-05 16:23:49 -07:00
Chang She
7b92e796bb chore: set error handling to immediate (#686)
there's build failure for the rust artifact but the macos arm64 build
for npm publish still passed. So we had a silent failure for 2 releases.
By setting error to immediate this should cause fail immediately.
2024-04-05 16:23:49 -07:00
Lance Release
608e502de6 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
328880f057 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
93ade53515 Bump version: 0.3.8 → 0.3.9 2024-04-05 16:23:49 -07:00
Rob Meng
d74e188f80 fix: fix passing prefilter flag to remote client (#677)
was passing this at the wrong position
2024-04-05 16:23:49 -07:00
Rob Meng
59c25574f0 feat: enable prefilter in node js (#675)
enable prefiltering in node js, both native and remote
2024-04-05 16:23:49 -07:00
Rob Meng
c1c3083b74 chore: expose prefilter in lancedb rust (#674)
expose prefilter flag in vectordb rust code.
2024-04-05 16:23:49 -07:00
James
a94a033553 (docs):Add CLIP image embedding example (#660)
In this PR, I add a guide that lets you use Roboflow Inference to
calculate CLIP embeddings for use in LanceDB. This post was reviewed by
@AyushExel.
2024-04-05 16:23:49 -07:00
Bert
bbf34ae7f4 fix: python remote correct open_table error message (#659) 2024-04-05 16:23:49 -07:00
Lance Release
57dda15f49 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
8f82e4897c [python] Bump version: 0.3.3 → 0.3.4 2024-04-05 16:23:49 -07:00
Lance Release
8bd77d3c72 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
0273df4e04 Bump version: 0.3.7 → 0.3.8 2024-04-05 16:23:49 -07:00
Will Jones
6d76fe80b8 chore: upgrade lance to v0.8.17 (#656)
Readying for the next Lance release.
2024-04-05 16:23:49 -07:00
Rok Mihevc
78ab9068a8 feat(python): expose index cache size (#655)
This is to enable https://github.com/lancedb/lancedb/issues/641.
Should be merged after https://github.com/lancedb/lance/pull/1587 is
released.
2024-04-05 16:23:49 -07:00
Ayush Chaurasia
088792c821 [Docs]: Add Instructor embeddings and rate limit handler docs (#651) 2024-04-05 16:23:49 -07:00
Ayush Chaurasia
955c2a751a [Docs][SEO] Add sitemap and robots.txt (#645)
Sitemap improves SEO by ranking pages and tracking updates.
2024-04-05 16:23:49 -07:00
Aidan
775bee576c SaaS create_index API (#649) 2024-04-05 16:23:49 -07:00
Lance Release
f59af4df76 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
15cc5227c4 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
c008faddfd Bump version: 0.3.6 → 0.3.7 2024-04-05 16:23:49 -07:00
Bert
22fc0eaaf6 fix: node remote implement table.countRows (#648) 2024-04-05 16:23:49 -07:00
Rok Mihevc
32cb1b9ea4 feat: add RemoteTable.version in Python (#644)
Please note: this is not tested as we don't have a server here and
testing against a mock object wouldn't be that interesting.
2024-04-05 16:23:49 -07:00
Bert
49a366bc74 fix: node send db header for GET requests (#646) 2024-04-05 16:23:49 -07:00
Ayush Chaurasia
d59dbf8230 fix: Pydantic 1.x compat for weak_lru caching in embeddings API (#643)
Colab has pydantic 1.x by default and pydantic 1.x BaseModel objects
don't support weakref creation by default that we use to cache embedding
models
https://github.com/lancedb/lancedb/blob/main/python/lancedb/embeddings/utils.py#L206
. It needs to be added to slot.
2024-04-05 16:23:49 -07:00
Ayush Chaurasia
c0a49a9a5b Multi-task instructor model with quantization support & weak_lru cache for embedding function models (#612)
resolves #608
2024-04-05 16:23:49 -07:00
QianZhu
2f2964a645 fix saas open_table and table_names issues (#640)
- added check whether a table exists in SaaS open_table
- remove prefilter not supported warning in SaaS search
- fixed issues for SaaS table_names
2024-04-05 16:23:49 -07:00
Rob Meng
3d50c9cdfe upgrade lance to 0.8.14 (#636)
upgrade lance
2024-04-05 16:23:49 -07:00
Rob Meng
bdb3b46f7e skip missing file on mirrored dir when deleting (#635)
mirrored store is not garueeteed to have all the files. Ignore the ones
that doesn't exist.
2024-04-05 16:23:49 -07:00
Lei Xu
49306a99ba chore: apple silicon runner (#633)
Close #632
2024-04-05 16:23:49 -07:00
Lei Xu
86efd36689 chore: improve create_table API consistency between local and remote SDK (#627) 2024-04-05 16:23:47 -07:00
Bert
20ab85171b fix: node remote connection handles non http errors (#624)
https://github.com/lancedb/lancedb/issues/623

Fixes issue trying to print response status when using remote client. If
the error is not an HTTP error (e.g. dns/network failure), there won't
be a response.
2024-04-05 16:23:14 -07:00
Ayush Chaurasia
159ecbac5a Exponential standoff retry support for handling rate limited embedding functions (#614)
Users ingesting data using rate limited apis don't need to manually make
the process sleep for counter rate limits
resolves #579
2024-04-05 16:23:14 -07:00
Lance Release
148f6d7283 Updating package-lock.json 2024-04-05 16:23:14 -07:00
Lance Release
c604912139 Updating package-lock.json 2024-04-05 16:23:14 -07:00
Lance Release
178af0c2b8 Bump version: 0.3.5 → 0.3.6 2024-04-05 16:23:14 -07:00
Lance Release
c1b037f0a5 [python] Bump version: 0.3.2 → 0.3.3 2024-04-05 16:23:14 -07:00
Lei Xu
3855bdf986 chore: bump lance to 8.10 (#622) 2024-04-05 16:23:14 -07:00
Ayush Chaurasia
07ab4cd14c Disable posthog on docs & reduce sentry trace factor (#607)
- posthog charges per event and docs events are registered very
frequently. We can keep tracking them on GA
- Reduced sentry trace factor
2024-04-05 16:23:13 -07:00
Chang She
531c947fc1 doc: node sdk now supports windows (#616) 2024-04-05 16:22:59 -07:00
Bert
4e9aab9e8b ci: cancel in progress runs on new push (#620) 2024-04-05 16:22:59 -07:00
Bert
cd7a4dd251 fix!: sort table names (#619)
https://github.com/lancedb/lance/issues/1385
2024-04-05 16:22:59 -07:00
QianZhu
3c139c2ee5 Qian/query option doc (#615)
- API documentation improvement for queries (table.search)
- a small bug fix for the remote API on create_table

![image](https://github.com/lancedb/lancedb/assets/1305083/712e9bd3-deb8-4d81-8cd0-d8e98ef68f4e)

![image](https://github.com/lancedb/lancedb/assets/1305083/ba22125a-8c36-4e34-a07f-e39f0136e62c)
2024-04-05 16:22:59 -07:00
Will Jones
166b281d66 increment pylance (#618) 2024-04-05 16:22:59 -07:00
Bert
c9fee0faed added api docs for prefilter flag (#617)
Added the prefilter flag argument to the `LanceQueryBuilder.where`.

This should make it display here:

https://lancedb.github.io/lancedb/python/python/#lancedb.query.LanceQueryBuilder.select

And also in intellisense like this:
<img width="848" alt="image"
src="https://github.com/lancedb/lancedb/assets/5846846/e0c53f4f-96bc-411b-9159-680a6c4d0070">

Also adds some improved documentation about the `where` argument to this
method.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:22:59 -07:00
Weston Pace
301e08f30e feat: allow prefiltering with index (#610)
Support for prefiltering with an index was added in lance version 0.8.7.
We can remove the lancedb check that prevents this. Closes #261
2024-04-05 16:22:59 -07:00
Lei Xu
b5e57ebce3 doc: add doc to use GPU for indexing (#611) 2024-04-05 16:22:59 -07:00
Lance Release
87364532bf Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
c275ec006f Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
53b0375e6d Bump version: 0.3.4 → 0.3.5 2024-04-05 16:22:59 -07:00
Bert
6881c50866 fix conv version (#605) 2024-04-05 16:22:59 -07:00
Lance Release
a174832d61 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
722cede32b Bump version: 0.3.3 → 0.3.4 2024-04-05 16:22:59 -07:00
Bert
4d086d63eb feat: added dataset stats api to node (#604) 2024-04-05 16:22:59 -07:00
Bert
f5e9c073f0 feat: added data stats apis (#596) 2024-04-05 16:22:59 -07:00
Rob Meng
178e016ff2 expose remap index api (#603)
expose index remap options in `compact_files`
2024-04-05 16:22:59 -07:00
Rob Meng
3c998b020f feat: expose optimize index api (#602)
expose `optimize_index` api.
2024-04-05 16:22:59 -07:00
Lance Release
a3c955070e [python] Bump version: 0.3.1 → 0.3.2 2024-04-05 16:22:59 -07:00
Bert
edeecd3d9f update lance to 0.8.7 (#598) 2024-04-05 16:22:59 -07:00
Chang She
2861f33982 fix(python): fix multiple embedding functions bug (#597)
Closes #594

The embedding functions are pydantic models so multiple instances with
the same parameters are considered ==, which means that if you have
multiple embedding columns it's possible for the embeddings to get
overwritten. Instead we use `is` instead of == to avoid this problem.

testing: modified unit test to include this case
2024-04-05 16:22:59 -07:00
Rob Meng
0036ca9de7 feat: add checkout method to table to reuse existing store and connections (#593)
Prior to this PR, to get a new version of a table, we need to re-open
the table. This has a few downsides w.r.t. performance:
* Object store is recreated, which takes time and throws away existing
warm connections
* Commit handler is thrown aways as well, which also may contain warm
connections
2024-04-05 16:22:59 -07:00
Rob Meng
2826bc7f1a feat: include manifest files in mirrow store (#589) 2024-04-05 16:22:59 -07:00
Will Jones
e37a0566e0 Revert "[python] Bump version: 0.3.2 → 0.3.3"
This reverts commit c30faf6083.
2024-04-05 16:22:59 -07:00
Will Jones
48999ffc27 [python] Bump version: 0.3.2 → 0.3.3 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
0dc893993f [Docs]: Minor Fixes (#587)
* Filename typo
* Remove rick_morty csv as users won't really be able to use it.. We can
create a an executable colab and download it from a bucket or smth.
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
12de39612e [Docs] Embeddings API: Add multi-lingual semantic search example (#582) 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
05509bfb03 [Docs]Versioning docs (#586)
closes #564

---------

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Lance Release
fa702f992e Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
7f707205de Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
2394ff14d0 Bump version: 0.3.2 → 0.3.3 2024-04-05 16:22:59 -07:00
Chang She
31334b05df chore: bump lance version in python/rust lancedb (#584)
To include latest v0.8.6

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
942976f49f [Docs] Update embedding function docs (#581) 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
507f6087c2 [Python]Embeddings API refactor (#580)
Sets things up for this -> https://github.com/lancedb/lancedb/issues/579
- Just separates out the registry/ingestion code from the function
implementation code
- adds a `get_registry` util
- package name "open-clip" -> "open-clip-torch"
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
39c1cb87ad [Docs] Add posthog telemetry to docs (#577)
Allows creation of funnels and user journeys
2024-04-05 16:22:59 -07:00
QianZhu
6b0d1d6ec1 list table pagination draft (#574) 2024-04-05 16:22:59 -07:00
Prashanth Rao
d38e3d496f Add pyarrow date and timestamp type conversion from pydantic (#576) 2024-04-05 16:22:59 -07:00
Chang She
f4ac47e1b5 doc: fix broken link and add README (#573)
Fix broken link to embedding functions

testing: broken link was verified after local docs build to have been
repaired

---------

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Lance Release
c94e428252 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
a09389459c Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
4f62fb5ae8 Bump version: 0.3.1 → 0.3.2 2024-04-05 16:22:59 -07:00
Rob Meng
c14ccbd334 implement remote api calls for table mutation (#567)
Add more APIs to remote table for Node SDK
* `add` rows
* `overwrite` table with rows
* `create` table

This has been tested against dev stack
2024-04-05 16:22:59 -07:00
Rok Mihevc
b10afbeedc docs: show source of documented functions (#569) 2024-04-05 16:22:59 -07:00
Lei Xu
8dc10180b0 feat(python,js): deletion operation on remote tables (#568) 2024-04-05 16:22:59 -07:00
Rok Mihevc
377a564904 docs: switch python examples to be row based (#554) 2024-04-05 16:22:59 -07:00
Lei Xu
7b5bfadab2 chore: bump lance to 0.8.5 (#561)
Bump lance to 0.5.8
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
1c42894918 [DOCS][PYTHON] Update embeddings API docs & Example (#516)
This PR adds an overview of embeddings docs:
- 2 ways to vectorize your data using lancedb - explicit & implicit
- explicit - manually vectorize your data using `wit_embedding` function
- Implicit - automatically vectorize your data as it comes by ingesting
your embedding function details as table metadata
- Multi-modal example w/ disappearing embedding function
2024-04-05 16:22:59 -07:00
Lance Release
2b341f3482 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
5027529663 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
3ed509f20c Bump version: 0.3.0 → 0.3.1 2024-04-05 16:22:59 -07:00
Lance Release
87c69e74fc [python] Bump version: 0.3.0 → 0.3.1 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
0e9a7f0dc7 Add cohere embedding function (#550) 2024-04-05 16:22:59 -07:00
Will Jones
c07207c661 feat: cleanup and compaction (#518)
#488
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
541b06664f [Docs] Improve visibility of table ops (#553)
A little verbose, but better than being non-discoverable 
![Screenshot from 2023-10-11
16-26-02](https://github.com/lancedb/lancedb/assets/15766192/9ba539a7-0cf8-4d9e-94e7-ce5d37c35df0)
2024-04-05 16:22:59 -07:00
Chang She
8469d010f8 feat: add to_list and to_pandas api's (#556)
Add `to_list` to return query results as list of python dict (so we're
not too pandas-centric). Closes #555

Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Ankur Goyal
a737bbff19 Use query.limit(..) in README (#543)
If you run the README javascript example in typescript, it complains
that the type of limit is a function and cannot be set to a number.
2024-04-05 16:22:59 -07:00
Lei Xu
a26c8f3316 feat: use GPU for index creation. (#540)
Bump lance to 0.8.3 to include GPU training

---------

Co-authored-by: Rob Meng <rob.xu.meng@gmail.com>
2023-10-05 20:49:00 -07:00
Josh Wein
88d8d7249e Typo cleanup (#539) 2023-10-05 23:07:28 -04:00
Rob Meng
0eb7c9ea0c fix stackoverflow (#542)
closes #541 

two functions was calling itself instead of routing to primary
2023-10-05 20:05:04 -04:00
Rob Meng
1db66c6980 implement mirroring object store (#537)
This PR implements a mirroring object store and allows and table to be
mirrored to a local path when param `mirroredStore` is set in the url
2023-10-04 21:23:34 -04:00
Lance Release
c58da8fc8a Updating package-lock.json 2023-10-03 22:59:02 +00:00
Lance Release
448c4a835d Updating package-lock.json 2023-10-03 22:09:00 +00:00
Lance Release
850f80de99 Bump version: 0.2.6 → 0.3.0 2023-10-03 22:08:44 +00:00
Lance Release
a022368426 [python] Bump version: 0.2.6 → 0.3.0 2023-10-03 21:48:22 +00:00
Lei Xu
8b815ef5a8 chore: upgrade lance to 0.8.1 (#536)
Bump to lance 0.8.1 for both javascript and python sdk
2023-10-03 14:29:18 -07:00
Tan Li
e4c3a9346c [doc] make the tensor width differnt from height (#533) 2023-10-03 00:55:16 -07:00
Prashanth Rao
1d1f8964d2 [DOCS][PYTHON] Update docs for clarity (#531)
I only modified those docs pages that are untouched in existing unmerged
PRs, so hopefully there are no merge conflicts!

1. The `tantivy-py` version specified in the docs doesn't work (pip
install fails), but with the latest version of pip and wheel installed
on my Mac, I was able to just `pip install tantivy` and FTS works great
for me. I updated the docs page to include this in
7ca4b757ce but can always modify to
another specific version in case this breaks any tests.
2. The `.add()` method for Python should take in a list of dicts as the
first option (to also align with the JS API), and additionally, users
can pass an existing pandas DataFrame to add to a table. Hope this makes
sense.
3. I've had multiple conversations with users who are unclear that the
terms "exhaustive", "flat" and "KNN" are all the same kind of search, so
I've updated the verbiage of this section to clarify this.
4. Fixed typos and improved clarity in the ANN indexes page.
2023-10-03 09:46:53 +05:30
Lance Release
d326146a40 [python] Bump version: 0.2.5 → 0.2.6 2023-10-01 17:48:59 +00:00
Chang She
693bca1eba feat(python): expose prefilter to lancedb (#522)
We have experimental support for prefiltering (without ANN) in pylance.
This means that we can now apply a filter BEFORE vector search is
performed. This can be done via the `.where(filter_string,
prefilter=True)` kwargs of the query.

Limitations:
- When connecting to LanceDB cloud, `prefilter=True` will raise
NotImplemented
- When an ANN index is present, `prefilter=True` will raise
NotImplemented
- This option is not available for full text search query
- This option is not available for empty search query (just
filter/project)

Additional changes in this PR:
- Bump pylance version to v0.8.0 which supports the experimental
prefiltering.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-10-01 10:34:12 -07:00
Will Jones
343e274ea5 fix: define minimum dependency versions (#515)
Closes #513

For each of these, I found the minimum version that would allow the unit
tests to pass.
2023-09-29 09:04:49 -07:00
Rob Meng
a695fb8030 fix import attr to use import attrs (#510)
Thanks to #508, I used `attr` instead of the correct package `attrs`

s/attr/attrs
2023-09-23 00:30:56 -04:00
Hynek Schlawack
bc8670d7af [Python] Fix attrs dependency (#508)
The `attr` project is unrelated to `attrs` that also provides the `attr`
namespace (see also <https://hynek.me/articles/import-attrs/>).

It used to _usually_ work, because attrs is a dependency of aiohttp and
somehow took precedence over `attr`'s `attr`.

Yes, sorry, it's a mess.
2023-09-21 12:35:34 -04:00
Lance Release
74004161ff [python] Bump version: 0.2.4 → 0.2.5 2023-09-19 16:43:06 +00:00
Lance Release
34ddb1de6d Updating package-lock.json 2023-09-19 13:48:20 +00:00
Lance Release
1029fc9cb0 Updating package-lock.json 2023-09-19 12:19:23 +00:00
Lance Release
31c5df6d99 Bump version: 0.2.5 → 0.2.6 2023-09-19 12:19:05 +00:00
Rob Meng
dbf37a0434 fix: upgrade lance to 0.7.5 and add tests for searching empty dataset (#505)
This PR upgrade lance to `0.7.5`, which include fixes for searching an
empty dataset.

This PR also adds two tests in node SDK to make sure searching empty
dataset do no throw

Co-authored-by: rmeng <rob@lancedb.com>
2023-09-18 22:12:11 -07:00
Chang She
f20f19b804 feat: improve pydantic 1.x compat (#503) 2023-09-18 19:01:30 -07:00
Chang She
55207ce844 feat: add lancedb.__version__ (#504) 2023-09-18 18:51:51 -07:00
Chang She
c21f9cdda0 ci: fix docs build (#496)
python/python.md contains typos in the class references

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-09-18 13:07:21 -07:00
Rob Meng
bc38abb781 refactor connection string processing (#500)
in #486 `connect` started converting path into uri. However, the PR
didn't handle relative path and appended `file://` to relative path.

This PR changes the parsing strat to be more rational. If a path is
provided instead of url, we do not try anythinng special.

engine and engine params may only be specified when a url with schema is
provided

Co-authored-by: rmeng <rob@lancedb.com>
2023-09-18 12:38:00 -07:00
Rob Meng
731f86e44c add health check to wait for all service ready before next step (#501)
aws integration tests are flaky because we didn't wait for the services
to become healthy. (we only waited for the localstack service, this PR
adds wait for sub services)
2023-09-18 15:17:45 -04:00
Chang She
31dad71c94 multi-modal embedding-function (#484) 2023-09-16 21:23:51 -04:00
Will Jones
9585f550b3 fix: increase S3 timeouts (#494)
Closes #493
2023-09-15 20:21:34 -07:00
Lance Release
8dc2315479 [python] Bump version: 0.2.3 → 0.2.4 2023-09-15 14:23:26 +00:00
Rob Meng
f6bfb5da11 chore: upgrade lance to 0.7.4 (#491) 2023-09-14 16:02:23 -04:00
Lance Release
661fcecf38 [python] Bump version: 0.2.2 → 0.2.3 2023-09-14 17:48:32 +00:00
Lance Release
07fe284810 Updating package-lock.json 2023-09-10 23:58:06 +00:00
Lance Release
800bb691c3 Updating package-lock.json 2023-09-09 19:45:58 +00:00
Lance Release
ec24e09add Bump version: 0.2.4 → 0.2.5 2023-09-09 19:45:43 +00:00
Rob Meng
0554db03b3 progagate uri query string to lance; add aws integration tests (#486)
# WARNING: specifying engine is NOT a publicly supported feature in
lancedb yet. THE API WILL CHANGE.

This PR exposes dynamodb based commit to `vectordb` and JS SDK (will do
python in another PR since it's on a different release track)

This PR also added aws integration test using `localstack`

## What?
This PR adds uri parameters to DB connection string. User may specify
`engine` in the connection string to let LanceDB know that the user
wants to use an external store when reading and writing a table. User
may also pass any parameters required by the commitStore in the
connection string, these parameters will be propagated to lance.

e.g.
```
vectordb.connect("s3://my-db-bucket?engine=ddb&ddbTableName=my-commit-table")
```
will automatically convert table path to
```
s3+ddb://my-db-bucket/my_table.lance?&ddbTableName=my-commit-table
```
2023-09-09 13:33:16 -04:00
Lei Xu
b315ea3978 [Python] Pydantic vector field with default value (#474)
Rename `lance.pydantic.vector` to `Vector` and deprecate `vector(dim)`
2023-09-08 22:35:31 -07:00
Ayush Chaurasia
aa7806cf0d [Python]Fix record_batch_generator (#483)
Should fix - https://github.com/lancedb/lancedb/issues/482
2023-09-08 21:18:50 +05:30
Lei Xu
6799613109 feat: upgrade lance to 0.7.3 (#481) 2023-09-07 17:01:45 -07:00
Lei Xu
0f26915d22 [Rust] schema coerce and vector column inference (#476)
Split the rust core from #466 for easy review and less merge conflicts.
2023-09-06 10:00:46 -07:00
Chang She
32163063dc Fix up docs (#477) 2023-09-05 22:29:50 -07:00
Chang She
9a9a73a65d [python] Use pydantic for embedding function persistence (#467)
1. Support persistent embedding function so users can just search using
query string
2. Add fixed size list conversion for multiple vector columns
3. Add support for empty query (just apply select/where/limit).
4. Refactor and simplify some of the data prep code

---------

Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-09-05 21:30:45 -07:00
Ayush Chaurasia
52fa7f5577 [Docs] Small typo fixes (#460) 2023-09-02 22:17:19 +05:30
Chang She
0cba0f4f92 [python] Temporary update feature (#457)
Combine delete and append to make a temporary update feature that is
only enabled for the local python lancedb.

The reason why this is temporary is because it first has to load the
data that matches the where clause into memory, which is technical
unbounded.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-30 00:25:26 -07:00
Will Jones
8391ffee84 chore: make crate more discoverable (#443)
A few small changes to make the Rust crate more discoverable.
2023-08-25 08:59:14 -07:00
Lance Release
fe8848efb9 [python] Bump version: 0.2.1 → 0.2.2 2023-08-24 23:18:10 +00:00
Chang She
213c313b99 Revert "Updating package-lock.json" (#455)
This reverts commit ab97e5d632.

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-24 15:54:57 -07:00
Chang She
157e995a43 Revert "Bump version: 0.2.4 → 0.2.5" (#454)
This reverts commit 87e9a0250f.

I triggered the nodejs release commit GHA by mistake. Reverting it.
The tag will be removed manually.

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-24 15:44:37 -07:00
Lance Release
ab97e5d632 Updating package-lock.json 2023-08-24 21:54:35 +00:00
Lance Release
87e9a0250f Bump version: 0.2.4 → 0.2.5 2023-08-24 21:54:18 +00:00
Chang She
e587a17a64 [python] Support schema evolution in local LanceDB (#452)
Previously if you needed to add a column to a table you'd have to
rewrite the whole table. Instead,
we use the merge functionality from Lance format
to incrementally add columns from another table
or dataframe.

---------

Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-08-24 14:40:49 -07:00
Chang She
2f1f9f6338 [python] improve restore functionality (#451)
Previously the temporary restore feature required copying data. The new
feature in pylance does not.

---------

Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-08-24 11:00:34 -07:00
Lance Release
a34fa4df26 Updating package-lock.json 2023-08-24 05:23:19 +00:00
Lance Release
e20979b335 Updating package-lock.json 2023-08-24 04:48:11 +00:00
Lance Release
08689c345d Bump version: 0.2.3 → 0.2.4 2023-08-24 04:47:57 +00:00
Lance Release
909b7e90cd [python] Bump version: 0.2.0 → 0.2.1 2023-08-24 04:00:11 +00:00
QianZhu
ae8486cc8f bump lance version to 0.6.5 for lancedb release (#453) 2023-08-23 20:59:03 -07:00
Tevin Wang
b8f32d082f Clean up docs testing - exclude by glob instead of by file (#450) 2023-08-24 07:30:37 +05:30
Jai
ea7522baa5 fix url to image in docs (#444) 2023-08-22 16:21:02 -07:00
Lance Release
8764741116 Updating package-lock.json 2023-08-22 21:11:28 +00:00
Ayush Chaurasia
cc916389a6 [DOCS] Major Docs Revamp (#435) 2023-08-22 14:06:26 -07:00
Lance Release
3d7d903d88 Updating package-lock.json 2023-08-22 20:15:13 +00:00
Lance Release
cc5e2d3e10 Bump version: 0.2.2 → 0.2.3 2023-08-22 20:14:58 +00:00
Rob Meng
30f5bc5865 expose awsRegion to be configurable (#441) 2023-08-22 16:00:14 -04:00
gsilvestrin
2737315cb2 feat(node): Create empty tables / Arrow Tables (#399)
- Supports creating an empty table as long as an Arrow Schema is provided
- Supports creating a table from an Arrow Table (can be passed as data)
- Simplified some Arrow code in the TS/FFI side
- removed createTableArrow method, it was never documented / tested.
2023-08-22 10:57:45 -07:00
Rob Meng
d52422603c use a lambda function to hide the value of credentials when printing a connection/table (#438)
Previously when logging the `LocalConnection` and `LocalTable` classes,
we would expose the aws creds inside them. This PR changes the stored
creds to a anonymous function to hide the creds
2023-08-21 23:06:44 -04:00
Ayush Chaurasia
f35f8e451f [DOCS] Update integrations + small typos (#432)
Depends on - https://github.com/lancedb/lancedb/pull/430

---------

Co-authored-by: Kevin Tse <NivekT@users.noreply.github.com>
2023-08-18 09:59:22 +05:30
Ayush Chaurasia
0b9924b432 Make creating (and adding to) tables via Iterators more flexible & intuitive (#430)
It improves the UX as iterators can be of any type supported by the
table (plus recordbatch) & there is no separate requirement.
Also expands the test cases for pydantic & arrow schema.
If this is looks good I'll update the docs.

Example usage:
```
class Content(LanceModel):
    vector: vector(2)
    item: str
    price: float

def make_batches():
    for _ in range(5):
        yield from [ 
        # pandas
        pd.DataFrame({
            "vector": [[3.1, 4.1], [1, 1]],
            "item": ["foo", "bar"],
            "price": [10.0, 20.0],
        }),
        
        # pylist
        [
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],

        # recordbatch
        pa.RecordBatch.from_arrays(
            [
                pa.array([[3.1, 4.1], [5.9, 26.5]], pa.list_(pa.float32(), 2)),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ], 
            ["vector", "item", "price"],
        ),

        # pydantic list
        [
            Content(vector=[3.1, 4.1], item="foo", price=10.0),
            Content(vector=[5.9, 26.5], item="bar", price=20.0),
        ]]

db = lancedb.connect("db")
tbl = db.create_table("tabley", make_batches(), schema=Content, mode="overwrite")

tbl.add(make_batches())
```
Same should with arrow schema.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-08-18 09:56:30 +05:30
Lance Release
ba416a571d Updating package-lock.json 2023-08-17 23:48:01 +00:00
Lance Release
13317ffb46 Updating package-lock.json 2023-08-17 23:07:51 +00:00
Lance Release
ca961567fe Bump version: 0.2.1 → 0.2.2 2023-08-17 23:07:36 +00:00
gsilvestrin
31a12a141d fix(node) Electron crashes when creating external buffer (#424) 2023-08-17 14:47:54 -07:00
Chang She
e3061d4cb4 [python] Temporary restore feature (#428)
This adds LanceTable.restore as a temporary feature. It reads data from
a previous version and creates
a new snapshot version using that data. This makes the version writeable
unlike checkout. This should be replaced once the feature is implemented
in pylance.

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-14 20:10:29 -07:00
Lance Release
1fcc67fd2c Updating package-lock.json 2023-08-14 23:02:39 +00:00
Rob Meng
ac18812af0 fix moka version (#427) 2023-08-14 18:28:55 -04:00
Lance Release
8324e0f171 Bump version: 0.2.0 → 0.2.1 2023-08-14 22:22:24 +00:00
Rob Meng
f0bcb26f32 Upgrade lance and pass AWS creds when opening a table (#426) 2023-08-14 18:22:02 -04:00
Lance Release
b281c5255c Updating package-lock.json 2023-08-14 17:03:51 +00:00
Lance Release
d349d2a44a Updating package-lock.json 2023-08-14 16:06:52 +00:00
Lance Release
0699a6fa7b Bump version: 0.1.19 → 0.2.0 2023-08-14 16:06:36 +00:00
Lance Release
b1a5c251ba [python] Bump version: 0.1.16 → 0.2.0 2023-08-12 04:43:16 +00:00
Will Jones
722462c38b chore: upgrade Lance and rename score to _distance (#398)
BREAKING CHANGE: The `score` column has been renamed to `_distance` to
more accurately describe the semantics (smaller means closer / better).

---------

Co-authored-by: Lei Xu <lei@lancedb.com>
2023-08-11 21:42:33 -07:00
Ashis Kumar Naik
902a402951 implementation of drop_database (#418)
#416 Fixed.

added drop_database() method . This deletes all the tables from the
database with a single command.

---------

Signed-off-by: Ashis Kumar Naik <ashishami2002@gmail.com>
2023-08-11 20:59:56 -07:00
Rob Meng
2f2cb984d4 [breaking change] make schema a property (#414) 2023-08-11 18:58:41 -04:00
Lei Xu
9921b2a4e5 [Node] Use index by default (#422) 2023-08-11 15:26:44 -07:00
gsilvestrin
03b8f99dca feat(node) Remote drop table (#412) 2023-08-10 09:21:36 -07:00
Lei Xu
aa91f35a28 [Python][Remote] Raise meaningful exception for to_arrow() / to_pandas() (#413) 2023-08-08 14:40:09 -07:00
gsilvestrin
f227658e08 fix(node) Remove mpsc from JS SDK (#407)
- Callers / SDKs are responsible for keeping track of the last version of the Table
-  Remove the mpsc from Table and make all Table operations non-blocking
2023-08-08 10:35:43 -07:00
Rob Meng
fd65887d87 implement remote drop table call (#411)
Also moves `request_id` to header instead of request param
2023-08-08 13:24:16 -04:00
Weston Pace
4673958543 fix(docs) fix minor typo (#408) 2023-08-08 08:37:32 -07:00
Chang She
a54d1e5618 Automatically convert pydantic model (#400)
Saves users from having to explicitly call
`LanceModel.to_arrow_schema()` when creating an empty table.
See new docs for full details.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-06 14:50:03 -07:00
Tevin Wang
8f7264f81d [Documentation Code Testing] temp fix for nodejs docs test hang (#404) 2023-08-06 13:13:35 -07:00
Ayush Chaurasia
44b8271fde [Docs] Allow edit suggestions and analytics (#394) 2023-08-06 22:53:35 +05:30
Ayush Chaurasia
74ef141b9c [Docs] add Tables guide (#381)
* Rename "Reference" -> "Guides" to create distinction b/w api reference
and user facing docs
* Add all the various ways to create, add and delete from table

Related - https://github.com/lancedb/lancedb/pull/391
2023-08-06 12:34:08 +05:30
gsilvestrin
b69b1e3ec8 fix(node) Unit tests hangs and don't exit (#396) 2023-08-04 20:18:23 -07:00
Ayush Chaurasia
bbfadfe58d [python] Allow adding via iterators (#391)
Makes the following work so all the formats accepted by `create_table()`
are also accepted by `add()`
```
import lancedb
import pyarrow as pa

db = lancedb.connect("/tmp")

def make_batches():
    for i in range(5):
        yield pa.RecordBatch.from_arrays(
            [
                pa.array([[3.1, 4.1], [5.9, 26.5]]),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ],
            ["vector", "item", "price"],
        )

schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32())),
    pa.field("item", pa.utf8()),
    pa.field("price", pa.float32()),
])

tbl = db.create_table("table4", make_batches(), schema=schema)
tbl.add(make_batches())
```
2023-08-04 12:49:44 -07:00
Leon Yee
cf977866d8 [WIP] Workflow to trigger vectordb-recipes workflow (#371) 2023-08-02 11:27:08 -07:00
gsilvestrin
3ff3068a1e fix(node) Give preference to local index.node lib (#393) 2023-08-01 15:29:15 -07:00
gsilvestrin
593b5939be feat(node): Improve concurrency (#376)
- Moved computation out of JS main thread by using a mpsc
- Removes the Arc/Mutex since Table is owned by JsTable now
- Moved table / query methods to their own files 
- Fixed js-transformers example
2023-08-01 14:22:04 -07:00
Lei Xu
f0e1290ae6 Restrict semver version to 3.0 (#389) 2023-07-31 22:26:24 -07:00
Chang She
4b45128bd6 add LanceModel to docs (#386)
Co-authored-by: Chang She <chang@lancedb.com>
2023-07-31 15:12:02 -04:00
Lance Release
b06e214d29 [python] Bump version: 0.1.15 → 0.1.16 2023-07-31 18:32:40 +00:00
Chang She
c1f8feb6ed make pandas an optional dependency in lancedb as well (#385) 2023-07-31 14:08:58 -04:00
Chang She
cada35d5b7 Improve pydantic integration (#384) 2023-07-31 12:16:44 -04:00
Chang She
2d25c263e9 Implement drop table if exists (#383) 2023-07-31 10:25:09 +02:00
gsilvestrin
bcd7f66dc7 fix(node): Handle overflows in the node bridge (#372)
- Fixes many numeric conversions that results in hard to reproduce issues
- JsObjectExt extends JsObject with safe methods to extract numericvalues
2023-07-28 13:15:21 -07:00
gsilvestrin
1daecac648 fix(python): Pin pylance and add pandas as test dependency (#373) 2023-07-27 15:21:45 -07:00
Lance Release
b8e656b2a7 Updating package-lock.json 2023-07-27 21:53:30 +00:00
Lance Release
ff7c1193a7 Updating package-lock.json 2023-07-27 21:06:32 +00:00
Lance Release
6d70e7c29b Bump version: 0.1.18 → 0.1.19 2023-07-27 21:06:17 +00:00
gsilvestrin
73cc12ecc5 fix(node): Relax EmbeddingFunction type guard (#370) 2023-07-27 12:51:59 -07:00
gsilvestrin
6036cf48a7 fix(node) Replace panic errors with friendlier ones (#366)
- Implement Result/Error in the node FFI
- Implement a trait (ResultExt) to make error handling less verbose
- Refactor some parts of the code that touch arrow into arrow.rs
2023-07-26 13:44:58 -07:00
Ayush Chaurasia
15f4787cc8 [Docs]: Add badges, CTA and updates examples (#358)
<img width="1054" alt="Screenshot 2023-07-24 at 6 13 00 PM"
src="https://github.com/lancedb/lancedb/assets/15766192/a263a17e-66d0-4591-adc7-b520aa5b23f6">
Is this a problem? Are we using metadata to track usage or something?
2023-07-26 16:35:46 +05:30
Lance Release
0e4050e706 [python] Bump version: 0.1.14 → 0.1.15 2023-07-25 18:58:44 +00:00
Rob Meng
147796ffcd bump lance version for vectordb, fix minor bugs in lancedb remote client (#365) 2023-07-24 21:30:57 -04:00
Lance Release
6fd465ceef Updating package-lock.json 2023-07-24 20:02:35 +00:00
Lance Release
e2e5a0fb83 Updating package-lock.json 2023-07-24 19:27:32 +00:00
Lance Release
ff8d5a6d51 Bump version: 0.1.17 → 0.1.18 2023-07-24 19:27:17 +00:00
Will Jones
8829988ada ci: build node in manylinux docker container (#350)
Closes #359

TODO:
 * [x] test in a sample of Linux distro docker containers
2023-07-24 11:31:47 -07:00
gsilvestrin
80a32be121 bugfix(node): make WriteMode optional when specifying embeddings (#336) 2023-07-24 11:26:43 -07:00
Rob Meng
8325979bb8 dont print apikey in remote client toString, add hostoverride to python client (#353) 2023-07-23 18:44:00 -04:00
lindt
ed5ff5a482 [docs] typo fix (#352)
Co-authored-by: Stefan Rohe <think@eduroam152-169.nbk.vse.cz>
2023-07-22 11:18:58 +02:00
Lance Release
2c9371dcc4 Updating package-lock.json 2023-07-21 23:18:22 +00:00
Lance Release
6d5621da4a Updating package-lock.json 2023-07-21 22:39:21 +00:00
Lance Release
380c1572f3 Bump version: 0.1.16 → 0.1.17 2023-07-21 22:39:06 +00:00
gsilvestrin
4383848d53 feat(node): Add Linux ARM build (#348) 2023-07-21 15:33:02 -07:00
gsilvestrin
473c43860c bugfix: Set Github token when pushing changes (#351) 2023-07-21 15:31:44 -07:00
gsilvestrin
17cf244e53 Updating package-lock.json (#347) 2023-07-20 14:44:10 -07:00
Leon Yee
0b60694df4 [docs] typo fix (#346) 2023-07-20 14:33:56 -07:00
Lance Release
600da476e8 Updating package-lock.json 2023-07-20 20:24:54 +00:00
Lance Release
458217783c Bump version: 0.1.15 → 0.1.16 2023-07-20 20:24:37 +00:00
gsilvestrin
21b1a71a6b bugfix(node): Don't persist credentials on make-release-commit.yml (#345) 2023-07-20 13:24:06 -07:00
gsilvestrin
2d899675e8 bugfix(node): Make release task can't push to repo (#344) 2023-07-20 13:15:29 -07:00
Lance Release
1cbfc1bbf4 [python] Bump version: 0.1.13 → 0.1.14 2023-07-20 20:06:15 +00:00
gsilvestrin
a2bb497135 feat(node) Move native packages to @lancedb NPM org (#341)
- Move native packages to @lancedb org
- Move package-lock.json update to a reusable action and created a target to run it manually.
2023-07-20 12:54:39 -07:00
Will Jones
0cf40c8da3 fix: only use util function to build filesystem (#339) 2023-07-20 10:41:50 -07:00
Rob Meng
8233c689c3 fix remote SDK (#342) 2023-07-20 02:01:13 -04:00
gsilvestrin
6e24e731b8 Updating package-lock.json (#338) 2023-07-18 21:10:18 -07:00
Lance Release
f4ce86e12c [python] Bump version: 0.1.12 → 0.1.13 2023-07-19 03:09:50 +00:00
Lance Release
0664eaec82 Bump version: 0.1.14 → 0.1.15 2023-07-19 02:54:10 +00:00
Lei Xu
63acdc2069 [Python] Support pydantic v1 as well (#337)
Support both Pydantic v1 and v2 (breaking changes)
2023-07-18 19:53:09 -07:00
Rob Meng
a636bb1075 add support for host override (#335) 2023-07-18 21:21:39 -04:00
Lance Release
5e3167da83 [python] Bump version: 0.1.11 → 0.1.12 2023-07-19 01:18:28 +00:00
Lei Xu
f09db4a6d6 [Python] Do not return Table count for every add operation (#328)
`Table::count()` will be linearly slower with more fragments ingested.
2023-07-18 17:11:17 -07:00
Lei Xu
1d343edbd4 [Node] implement remote db.TableNames (#334) 2023-07-18 16:56:47 -07:00
Lei Xu
980f910f50 [Node] initial support of nodejs remote sdk (#333) 2023-07-18 16:15:27 -07:00
Will Jones
fb97b03a51 feat: pass AWS_ENDPOINT environment variable down (#330)
Tested locally against minio.
2023-07-18 15:07:26 -07:00
Lei Xu
141b6647a8 [Python] Fix bumpversion.cfg (#327) 2023-07-18 09:18:14 -07:00
gsilvestrin
b45ac4608f feat(node): Explicitly set registry url when publishing package (#324) 2023-07-18 08:55:56 -07:00
Lei Xu
a86bc05131 [Bug] Fix dataset path check in Table::open (#326)
Fixed a bug that prevents to open remote tables.
2023-07-18 08:45:10 -07:00
Will Jones
3537afb2c3 docs: show how to delete rows in user guide (#309)
Closes #265
2023-07-18 08:19:48 -07:00
Lei Xu
23f5dddc7c [Rust] Checkout a version of dataset. (#321)
* `Table::open()` from absolute path, and gives the responsibility of
organizing metadata out of Table object
* Fix Clippy warnings
* Add `Table::checkout(version)` API
2023-07-17 17:29:58 -07:00
gsilvestrin
9748406cba Updating package-lock.json (#322) 2023-07-17 16:48:22 -07:00
gsilvestrin
6271949d38 feat(node): Update package-lock.json on each release (#302) 2023-07-17 16:33:43 -07:00
Lance Release
131ad09ab3 Bump version: 0.1.13 → 0.1.14 2023-07-17 20:06:58 +00:00
Lei Xu
030f07e7f0 Bump minimal lance version to 0.5.8 (#318) 2023-07-17 12:41:29 -07:00
gsilvestrin
72afa06b7a feat(node): Add Windows support (#294) 2023-07-17 08:48:24 -07:00
Lei Xu
088e745e1d [Python] Create table with Iterator[RecordBatch] and add docs (#316) 2023-07-16 21:45:55 -07:00
Lei Xu
7a57cddb2c [Python] Add records to remote (#315) 2023-07-16 13:24:38 -07:00
Lei Xu
8ff5f88916 [Python] Bug fixes in remote API (#314) 2023-07-16 11:09:19 -07:00
Lei Xu
028a6e433d [Python] Get table schema (#313) 2023-07-15 17:39:37 -07:00
Lei Xu
04c6814fb1 [Rust] Expose Table schema and version in Rust (#312) 2023-07-14 22:01:23 -07:00
Lei Xu
c62e4ca1eb Bump lance version to 0.5.7 (#311) 2023-07-14 17:17:31 -07:00
gsilvestrin
aecc5fc42b feat(node): Fix npm publish task (#298) 2023-07-14 13:39:15 -07:00
Chang She
2fdcb307eb [python] Fix a few minor bugs (#304) 2023-07-15 03:47:42 +08:00
Tevin Wang
ad18826579 [Documentation Code Testing] build node sdk in release (#307) 2023-07-14 12:46:48 -07:00
Leon Yee
a8a50591d7 [docs] small fixes (#308)
Closes #288 and #287
2023-07-14 12:46:31 -07:00
gsilvestrin
6dfe7fabc2 pin half (#310) 2023-07-14 12:45:05 -07:00
gsilvestrin
2b108e1c80 Updating package-lock.json file (#301) 2023-07-13 17:50:01 -07:00
Lei Xu
8c9edafccc [Doc] Add more Python integrations documents (#299) 2023-07-13 17:09:39 -07:00
Leon Yee
0590413b96 Added transformersJS example to docs and node/examples (#297) 2023-07-13 17:01:36 -07:00
Lance Release
bd2d40a927 Bump version: 0.1.12 → 0.1.13 2023-07-13 21:17:35 +00:00
Lei Xu
08944bf4fd [Python] Convert Pydantic Model to Arrow Schema (#291)
Provide utility to automatically convert Pydantic model to Arrow Schema

Closes #256
2023-07-13 11:16:37 -07:00
gsilvestrin
826dc90151 feat(node): add option object to connect method (#286) 2023-07-13 11:03:48 -07:00
Lei Xu
08cc483ec9 [Doc] Describe the difference between ANN and KNN, and how to create indices. (#293) 2023-07-13 08:52:58 -07:00
Lei Xu
ff1d206182 [Doc] Split the python integration into different topics (#292) 2023-07-12 21:26:59 -07:00
gsilvestrin
c385c55629 feat(node): pull node binaries into separate packages (3) (#285) 2023-07-12 16:52:04 -07:00
Lance Release
0a03f7ca5a Bump version: 0.1.11 → 0.1.12 2023-07-12 04:20:34 +00:00
Rob Meng
88be978e87 allow logging in JS (#283)
tested with `RUST_LOG=info npm test`
2023-07-11 22:50:36 -04:00
Rob Meng
98b12caa06 export create table with aws credentials (#282) 2023-07-11 17:21:10 -04:00
Lance Release
091dffb171 Bump version: 0.1.10 → 0.1.11 2023-07-11 20:42:15 +00:00
Rob Meng
ace6aa883a Upgrade lance to 0.5.5, and plumb thru new features from the upgrade (#279)
* upgrade
* fixes for the upgrade
* allow JS users to pass custom AWS credentials
2023-07-11 16:33:39 -04:00
Tevin Wang
80c25f9896 [Docs] uncomment cosine metric (#271)
- Change k value to `10` for js search to keep it consistent with python
docs
- Uncomment now that cosine metrix is fixed in lance:
https://github.com/lancedb/lance/pull/1035
2023-07-11 12:30:11 -07:00
gsilvestrin
caf22fdb71 Run rust tests when Cargo.toml changes (#276) 2023-07-11 11:19:06 -07:00
Lei Xu
0e7ae5dfbf [Python] Fix list type conversion to JSON and temporal types (#274) 2023-07-11 11:05:51 -07:00
gsilvestrin
b261e27222 Pin lance version (#275)
we shouldn't auto-upgrade lance
2023-07-11 10:58:15 -07:00
Lei Xu
9f603f73a9 [Python] Schema to JSON (#272) 2023-07-10 18:11:24 -07:00
Lei Xu
9ef846929b [Python] List tables from remote service (#262) 2023-07-09 23:58:03 -07:00
Lei Xu
97364a2514 Bump to v0.1.10-python 2023-07-09 21:52:11 -07:00
Lei Xu
e6c6da6104 [Python] Initial support of cloud API (#260)
Support connect with remote database, and implement Search API
2023-07-07 15:41:15 -07:00
Leon Yee
a5eb665b7d [docs] dynamic docs generation and deployment (#253)
Solves #245 , edited docs.yml to run the generation of docs before
deployment. Tested on a test repository
2023-07-06 21:10:36 -07:00
Chang She
e2325c634b Allow creation of an empty table (#254)
It's inconvenient to always require data at table creation time.
Here we enable you to create an empty table and add data and set schema
later.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-06 20:44:58 -07:00
Chang She
507eeae9c8 Set default to error instead of drop (#259)
when encountering bad input data, we can default to principle of least
surprise and raise an exception.

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-05 22:44:18 -07:00
Lance Release
bb3df62dce Bump version: 0.1.9 → 0.1.10 2023-07-06 03:05:32 +00:00
Lei Xu
dc7146b2cb [Node] Expose IVF PQ config (#258) 2023-07-05 19:54:21 -07:00
Lei Xu
d701947f0b [Rust] Re-export WriteMode from lancedb instead of lance (#257)
`Table::add(.., mode: WriteMode)`, which is a public API, currently uses
the WriteMode exported from `lance`. Re-export it to lancedb so that the
pub API looks better.
2023-07-05 18:20:31 -07:00
Chang She
3c46d7f268 Handle NaN input data (#241)
Sometimes LangChain would insert a single `[np.nan]` as a placeholder if
the embedding function failed. This causes a problem for Lance format
because then the array can't be stored as a FixedSizedListArray.

Instead:
1. By default we remove rows with embedding lengths less than the
maximum length in the batch
2. If `strict=True` kwargs is set to True, then a `ValueError` is raised
if the embeddings aren't all the same length

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-04 20:00:46 -07:00
Leon Yee
9600a38ff0 [docs] fixed javascript docs for overloaded functions (#247)
Solves #244 :


![image](https://github.com/lancedb/lancedb/assets/43097991/d1fd9d2a-0d6a-4c16-a0ab-f460cc709349)

Problem was function overloading in the interface caused some weird
`typedoc` formatting, so breaking it apart into methods fixed the issue.

Also regenerated and updated javascript docs

---------

Co-authored-by: Tevin Wang <tevin@cmu.edu>
2023-07-04 13:07:34 -07:00
Lei Xu
148ed82607 Bump Lance version to 0.5.3 (#250) 2023-07-04 08:34:41 -07:00
Lei Xu
fc725c99f0 [Node] Create Table with WriteMode (#246)
Support `createTable(name, data, mode?)`  to be consistent with Python.

Closes #242
2023-07-03 17:04:21 -07:00
Rob Meng
a6bdffd75b bump lance to 0.5.2, make object store construction hook public (#237)
* bump to 0.5.2 to pick up S3 auth fixes
* make `open_table_params` a public attribute
* add `open_table_with_params` on `Database`
2023-06-29 18:50:02 -04:00
Lei Xu
051c03c3c9 Add dot product support (#239)
Closes #207
2023-06-29 10:32:01 -07:00
Tevin Wang
39479dcf8e fix sha error in npm (#236)
Currently getting a `npm ERR! code EINTEGRITY` on merge, need to fix
asap.


https://stackoverflow.com/questions/75905223/github-action-npm-install-give-code-eintegrity-integrity-checksum-failed
2023-06-29 09:31:23 -07:00
Tevin Wang
b731a6aed9 Add docs code testing & documentation syntax changes (#196)
- Creates testing files `md_testing.py` and `md_testing.js` for testing
python and nodejs code in markdown files in the documentation
This listens for HTML tags as well: `<!--[language] code code
code...-->` will create a set-up file to create some mock tables or to
fulfill some assumptions in the documentation.
- Creates a github action workflow that triggers every push/pr to
`docs/**`
- Modifies documentation so tests run (mostly indentation, some small
syntax errors and some missing imports)

A list of excluded files that we need to take a closer look at later on:
```javascript
const excludedFiles = [
  "../src/fts.md",
  "../src/embedding.md",
  "../src/examples/serverless_lancedb_with_s3_and_lambda.md",
  "../src/examples/serverless_qa_bot_with_modal_and_langchain.md",
  "../src/examples/youtube_transcript_bot_with_nodejs.md",
];
```
Many of them can't be done because we need the OpenAI API key :(.
`fts.md` has some issues with the library, I believe this is still
experimental?

Closes #170

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2023-06-28 11:07:26 -07:00
Rob Meng
0f58bd7af2 allow passing ReadParams to dataset when opening a table (#234)
Plumb thru object store construction hook from
[lance/pull/1014](https://github.com/lancedb/lance/pull/1014)
2023-06-28 11:20:09 -04:00
Rob Meng
01abf82808 Refactor TS client to use interface + implementation pattern (#226)
## What?
* Changed `Connection` and `Table` to interfaces
* Renamed original `Connection` and `Table` to `LocalConnection` and
`LocalTable`
2023-06-27 21:45:01 -04:00
Leon Yee
eb5bcda337 Error implementations (#232)
Solves #216 by adding a check on table open for existence of the
`.lance` file. Does not check for it for remote connections.
2023-06-27 16:48:31 -07:00
Lei Xu
4bc676e26a [Python] Support replace during create_index (#233)
Closes #214
2023-06-27 16:02:07 -07:00
Lei Xu
c68c236f17 [Js] Create index with replace flag (#229) 2023-06-26 18:38:20 -07:00
Philip Kung
313e66c4c5 Specify and Index Column for Vector Search (#217) 2023-06-26 16:11:08 -07:00
Lei Xu
e850df56f1 fix requirements 2023-06-26 12:25:29 -07:00
Lei Xu
8c5507075c Sql filter document (#228) 2023-06-26 12:22:22 -07:00
Will Jones
0e4c52b8a6 bump python module version 2023-06-26 11:25:39 -07:00
Lance Release
c8bebf4776 Bump version: 0.1.8 → 0.1.9 2023-06-26 18:12:38 +00:00
Lei Xu
c14ad91df0 [Node] drop table api (#221)
Provide `drop_table` in rust and node. Closes #86
2023-06-23 19:58:37 -07:00
Will Jones
ad48242ffb feat: support for deletion (#219)
Also upgrades Arrow and Lance.
2023-06-23 18:09:07 -07:00
Leon Yee
1a9a392e20 [docs] CTA for discord + twitter (#218)
![image](https://github.com/lancedb/lancedb/assets/43097991/33eb893c-3baf-4166-8291-47d2f4bde23a)

Includes discord and twitter links in documentation

[#1001](https://github.com/lancedb/sophon/issues/1001)
2023-06-22 16:52:34 -07:00
Ayush Chaurasia
b489edc576 Add favicon in docs (#209) 2023-06-19 20:30:46 -07:00
gsilvestrin
8708fde3ef Revert "feat(node): pull node binaries into separate packages (2) (#1… (#206)
…97)"

This reverts commit 0724d41c4b.
2023-06-16 18:15:49 -07:00
Lance Release
cc7e54298b Bump version: 0.1.7 → 0.1.8 2023-06-17 00:33:53 +00:00
Rob Meng
d1e8a97a2a isort entire repo (#200) 2023-06-15 20:12:10 -04:00
Lance Release
01dadb0862 Bump version: 0.1.6 → 0.1.7 2023-06-15 23:30:01 +00:00
gsilvestrin
0724d41c4b feat(node): pull node binaries into separate packages (2) (#197)
* Refactors the Node module to load the shared library from a separate
package. When a user does `npm install vectordb`, the correct optional
dependency is automatically downloaded by npm.
* Add scripts and instructions to build Linux and MacOS node artifacts
locally.
* Add instructions for publishing the npm module and crates.

Co-authored-by: Will Jones <willjones127@gmail.com>
2023-06-15 16:15:42 -07:00
Rob Meng
cbb56e25ab port remote connection client into lancedb (#194)
* to_df() is now async, added `to_df_blocking` to convenience
* add remote lancedb client to public lancedb
* make lancedb connection class understand url scheme
`lancedb+<connection_type>://<host>:<port>`.
2023-06-15 18:57:52 -04:00
gsilvestrin
78de8f5782 feat(node): add Table.countRows() (#185) 2023-06-15 14:35:54 -07:00
Lance Release
a6544c2a31 Bump version: 0.1.5 → 0.1.6 2023-06-15 16:16:03 +00:00
Leon Yee
39ed70896a [rust] added rust.yml for /rust directory (#193) 2023-06-14 11:46:08 -07:00
gsilvestrin
ae672df1b7 feat(rust): add action to publish release to crates.io (#192) 2023-06-14 11:01:22 -07:00
gsilvestrin
15c3f42387 feat(node): add action to tag node / rust releases (#186) 2023-06-14 11:01:02 -07:00
gsilvestrin
f65d85efcc feat(node): add where method to query builder (#183)
Closes #181
2023-06-14 10:54:43 -07:00
Utkarsh Gautam
6b5c046c3b [Python] Updated to_df implementation in Contextualizer class (#174)
Changes include:
- Contexts of sizes less than window param to be included as well
- Added optional threshold parameter to to_df in Contextualizer 
This should close #165 
- If maintainers are satisfied with the implementation will add more
examples and test cases and update the documentations as well.

---------

Co-authored-by: Nithin PS <47279496+Nithinps021@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2023-06-14 09:22:32 -07:00
Lei Xu
d00f4e51d0 Fix node ffi build (#191) 2023-06-13 19:31:29 -07:00
Benjamin Manns
fbc44d4243 Fix small typo in ann_indexes.md (#190) 2023-06-13 17:43:18 -07:00
Lei Xu
b53eee42ce Upgrade to lance 0.4.21 (#187) 2023-06-13 15:39:44 -07:00
Utkarsh Gautam
7e0d6088ca [docs] Fixed langchain example broken link in index.md (#184) 2023-06-13 12:40:39 -07:00
Lance Release
5210f40a33 [python] Bump version: 0.1.7 → 0.1.8 2023-06-12 22:06:59 +00:00
gsilvestrin
5ec4a5d730 feat(python): add action to build and publish wheel (#179) 2023-06-12 14:54:54 -07:00
gsilvestrin
e4f64fca7b Bump pylance 0.4.17 -> 0.4.20 (#173) 2023-06-12 14:54:20 -07:00
Lance Release
4744640bd2 [python] Bump version: 0.1.6 → 0.1.7 2023-06-12 21:39:16 +00:00
gsilvestrin
094b5e643c bugfix(python) Make release action has invalid name (#180) 2023-06-12 14:24:15 -07:00
gsilvestrin
a318778d2a feat(python): add action to tag python releases (#172) 2023-06-12 13:59:08 -07:00
Tevin Wang
9b83ce3d2a add black to python CI (#178)
Closes #48
2023-06-12 11:22:34 -07:00
Nithin PS
7bad676f30 [Python] FIx Contextualizer validation to arguments (#168)
Closes #164

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2023-06-12 09:20:09 -07:00
gsilvestrin
0e981e782b [nodejs] bumping version to 0.1.5 (#171) 2023-06-09 12:33:17 -07:00
Utkarsh Gautam
e18cdfc7cf [docs] Fixed Minor typo in embedding.md (#167)
Added missing tab to python snippet
2023-06-08 22:01:51 -07:00
Will Jones
fed33a51d5 wip: make the python API reference a bit nicer (#162)
Adds:

* Make `mkdocstrings` aware we are using numpy-style docstrings
* Fixes broken link on `index.md` to Python API docs (and added link to
node ones)
* Added examples to various classes.
* Added doctest to verify examples work.
2023-06-08 16:07:06 -07:00
Jai
a56b65db84 rename examples for slugs (#159) 2023-06-07 16:44:54 -07:00
gsilvestrin
f21caebeda Update links in README.md (#161)
Current one 404s
2023-06-07 13:16:00 -07:00
gsilvestrin
12da77a9f7 [doc] removed index creation from quickstart (#160) 2023-06-07 09:29:38 -07:00
gsilvestrin
131b2dc57b [nodejs] Added completed youtube transcript example / docs (#156) 2023-06-06 16:26:21 -07:00
Chang She
3798f56a9b bump version for v0.1.6-python 2023-06-05 18:20:15 -07:00
Chang She
50cdb16b45 Better handle empty results from tantivy (#155)
Closes #154

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-06-05 18:18:14 -07:00
gsilvestrin
d803482588 [nodejs] bumping version to 0.1.4 (#147) 2023-06-03 13:59:58 -07:00
gsilvestrin
f37994b72a [nodejs] deprecated created_index in favor of createIndex. (#145) 2023-06-03 11:05:35 -07:00
gsilvestrin
2418de0a3c [nodejs] add npm clean task (#146) 2023-06-03 11:05:02 -07:00
gsilvestrin
d0c47e3838 added projection api for nodejs (#140) 2023-06-03 10:34:08 -07:00
Jai
41cca31f48 Modal example using LangChain (#143) 2023-06-03 06:08:31 -07:00
Jai
b621009d39 add multimodal gif, add copy about fts, sql (#144) 2023-06-02 22:25:33 -07:00
Jai
6a9cde22de Update broken doc links to refer to new directory and include gallery app for multimodal search (#142)
closes #121 
adds new multimodal example to gallery app
2023-06-02 21:27:26 -07:00
Chang She
bfa90b35ee add code snippet for each example (#141)
<img width="1937" alt="image"
src="https://github.com/lancedb/lancedb/assets/759245/4ee52e4a-5955-47c2-9ffe-84d1bc0062ff">

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-06-02 21:27:02 -07:00
gsilvestrin
12ec29f55b Adding nodejs CHANGELOG.md (#132) 2023-06-02 18:27:53 -07:00
Lei Xu
cdd08ef35c [Doc] Metrics types. (#135)
Closes #129
2023-06-02 17:18:01 -07:00
Jai
adcb2a1387 Update mkdocs.yml (#138) 2023-06-02 17:13:32 -07:00
Jai
9d52a32668 Minor patch to docs (#136) 2023-06-02 16:26:03 -07:00
Jai
11b2e63eea fix index docs (#134) 2023-06-02 16:16:34 -07:00
Jai
daedf1396b update references to end to end examples, use s3 for langchain exampl… (#133) 2023-06-02 16:08:56 -07:00
Jai
8af5f19cc1 js docs, modal example, doc notebook integration, update doc styles (#131) 2023-06-02 15:24:16 -07:00
Chang She
fbd0bc7740 bump version for v0.1.5-python 2023-06-02 09:18:26 -07:00
gsilvestrin
f765a453cf Use fsspec to implement table_names with cloud storage support (#117)
Co-authored-by: Will Jones <willjones127@gmail.com>
2023-06-01 16:56:26 -07:00
gsilvestrin
45b3a14f26 Bumping vectordb to v0.1.3 (#124) 2023-06-01 16:36:11 -07:00
Lei Xu
9965b4564d [Python] Support drop table (#123)
Closes #86
2023-06-01 15:58:45 -07:00
gsilvestrin
0719e4b3fb Revert "refactor: pull node binaries into separate packages (#88)" (#122)
This reverts commit e50b642d80.
2023-06-01 13:53:07 -07:00
Jai
091fb9b665 add existence check (#112) 2023-06-01 11:45:26 -07:00
Chang She
03013a4434 Multimodal search demo (#118)
Slow roasted over 12 hours, Pairs well with #111

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-06-01 10:34:08 -07:00
gsilvestrin
3e14b357e7 add openai embedding function to nodejs client (#107)
- openai is an optional dependency for lancedb
- added an example to show how to use it
2023-06-01 10:25:00 -07:00
Lei Xu
99cbda8b07 Generate diffusiondb embeddings (#111) 2023-06-01 10:23:29 -07:00
Will Jones
e50b642d80 refactor: pull node binaries into separate packages (#88)
Changes:

* Refactors the Node module to load the shared library from a separate
package. When a user does `npm install vectordb`, the correct optional
dependency is automatically downloaded by npm.
* Brings Rust and Node versions in alignment at 0.1.2.
* Add scripts and instructions to build Linux and MacOS node artifacts
locally.
* Add instructions for publishing the npm module and crates.
2023-06-01 09:17:19 -07:00
gsilvestrin
6d8cf52e01 Better error granularity for table operations (#113) 2023-06-01 09:04:42 -07:00
Akash
53f3882d6e Fixed documentation link for the Youtube Transcripts Jupyter Notebook (#105)
Changed the link to the Youtube Transcripts jupyter notebook path on the
documentation.

Previously it went inside docs/notebooks (which does not exist). I've
modified it to go inside the notebooks folder instead.
2023-06-01 09:00:40 -07:00
Chang She
2b26775ed1 python v0.1.4 2023-05-31 20:11:25 -07:00
Lei Xu
306ada5cb8 Support S3 and GCS from typescript SDK (#106) 2023-05-30 21:32:17 -07:00
gsilvestrin
d3aa8bfbc5 add embedding functions to the nodejs client (#95) 2023-05-26 18:09:20 -07:00
Chang She
04d97347d7 move tantivy-py installation to be separate from wheel (#97)
pypi does not allow packages to be uploaded that has a direct reference

for now we'll just ask the user to install tantivy separately

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-05-25 17:57:26 -06:00
Chang She
22aa8a93c2 bump version for v0.1.3 2023-05-25 17:01:52 -06:00
Chang She
f485378ea4 Basic full text search capabilities (#62)
This is v1 of integrating full text search index into LanceDB.

# API
The query API is roughly the same as before, except if the input is text
instead of a vector we assume that its fts search.

## Example
If `table` is a LanceDB LanceTable, then:

Build index: `table.create_fts_index("text")`

Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()`

# Implementation
Here we use the tantivy-py package to build the index. We then use the
row id's as the full-text-search index's doc id then we just do a Take
operation to fetch the rows.

# Limitations

1. don't support incremental row appends yet. New data won't show up in
search
2. local filesystem only 
3. requires building tantivy explicitly

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-05-24 22:25:31 -06:00
gsilvestrin
f923cfe47f add create index to nodejs client (#89) 2023-05-24 16:45:58 -06:00
gsilvestrin
06cb7b6458 add query params to to nodejs client (#87) 2023-05-24 15:48:31 -06:00
gsilvestrin
bdef634954 bugfix: string columns should be converted to Utf8Array (#94) 2023-05-23 14:58:49 -07:00
Will Jones
aac2ffa4b3 Lint and test vectordb node in CI (#92)
Closes #90.
2023-05-22 14:26:06 -07:00
gsilvestrin
e28fe7b468 nodejs append records api (#85) 2023-05-18 15:13:57 -07:00
gsilvestrin
61b9479bd9 JavaScript client initial linux support (#84) 2023-05-16 17:04:06 -07:00
gsilvestrin
961d892c89 Added TypeScript example (#82) 2023-05-16 13:40:52 -07:00
Jai
0b35e6dfa9 node quickstart (#83) 2023-05-16 09:53:04 -07:00
Jai
ca96fc55f6 add link to node quickstart to readme (#81) 2023-05-16 09:24:12 -07:00
gsilvestrin
395c7460d5 nodejs create_table (#75) 2023-05-15 19:00:17 -07:00
Jai
92d810eac4 docs build (#78) 2023-05-14 10:18:28 -07:00
Jai
a55a579b7f nodejs read only example (#77) 2023-05-12 15:50:59 -07:00
gsilvestrin
202924f832 updated node example (#74) 2023-05-11 12:55:02 -07:00
gsilvestrin
648f8123ca Exposing limit parameter (#73) 2023-05-11 09:12:06 -07:00
gsilvestrin
5bb5b0a685 javascript example improvements (#72) 2023-05-10 22:06:44 -07:00
gsilvestrin
c2e73262ef bump version and skipping building the native lib during install (#71) 2023-05-10 15:10:46 -07:00
gsilvestrin
f5bf6181e3 Merge pull request #70 from lancedb/gsilvestrin/nodejs_client-merge
JavaScript / Node.js library for LanceDB
2023-05-10 13:44:52 -07:00
gsilvestrin
c2dc1da509 Removing sample db 2023-05-10 13:40:17 -07:00
gsilvestrin
38e6efc185 JavaScript / Node.js library for LanceDB
- Core rust library
- ffi bridge that exposes rust functionality to javascript
- npm package that provides a TypeScript / JavaScript library
- limitations: it only supports reading for now
2023-05-10 12:51:49 -07:00
Chang She
636a6d3761 Merge pull request #65 from lancedb/jaichopra/add-youtube-transcript-example 2023-05-08 17:45:35 -07:00
Jai Chopra
2a855c9f6a update image url 2023-05-08 17:39:52 -07:00
Jai Chopra
5c47b0c6a2 add youtube transcript example 2023-05-08 17:38:08 -07:00
Jai
d12bc24091 Merge pull request #63 from lancedb/jaichopra/update-readme-ecosystem
update ecosystem in readme
2023-05-07 09:12:25 -07:00
Jai Chopra
c4261b23e6 update blog url 2023-05-07 08:18:24 -07:00
Jai Chopra
ab0abbbfab update ecosystem in readme 2023-05-07 08:17:02 -07:00
Chang She
13c9a2e1c9 Merge pull request #61 from lancedb/jaichopra/langchain-example-doc
add langchain example
2023-05-05 16:06:40 -07:00
Jai Chopra
7e3db16225 add langchain example 2023-05-05 16:00:14 -07:00
Jai
62abe2d96f Merge pull request #57 from lancedb/jaichopra/s3-lambda-docs
S3 Lambda example
2023-05-05 14:08:24 -07:00
Jai Chopra
11f423ccf5 clean up 2023-05-04 17:21:53 -07:00
Jai Chopra
6ff3c60cd1 clean up example 2023-05-04 10:14:31 -07:00
Jai Chopra
6556e42e6d update lambda example to lancedb 2023-05-04 08:17:13 -07:00
Jai Chopra
c3d90b2c78 update tagline 2023-05-04 08:17:13 -07:00
Jai Chopra
66f7d5cec9 also update docs index 2023-05-04 08:17:13 -07:00
Jai Chopra
4336ed050d add new feature to readme.md 2023-05-04 08:17:13 -07:00
Lei Xu
976344257c add cargo metadata 2023-05-04 08:17:13 -07:00
Lei Xu
906551b001 initialize the rust core 2023-05-04 08:17:13 -07:00
Chang She
33ac42a51c bump version for v0.1.1 2023-05-04 08:17:13 -07:00
Jai
7cd36196b4 Update langchain.md 2023-04-27 11:08:29 -07:00
Jai
87fb4d0645 Update langchain.md 2023-04-27 07:13:18 -07:00
Jai
c930b94917 Update s3_lambda.md 2023-04-27 07:12:52 -07:00
Jai
aa23d911f5 Update langchain.md 2023-04-26 14:50:09 -07:00
Jai Chopra
ca8d8e82b7 add simple langchain example 2023-04-26 14:44:20 -07:00
Jai
3d3ba913ed Update s3_lambda.md 2023-04-26 10:19:27 -07:00
Jai
0346d5319e Update s3_lambda.md 2023-04-26 10:18:47 -07:00
Jai
41eadf6fd9 Update s3_lambda.md 2023-04-26 10:18:31 -07:00
Jai Chopra
e784c6311d tree github build script from remote 2023-04-25 21:40:28 -07:00
428 changed files with 71699 additions and 1715 deletions

22
.bumpversion.cfg Normal file
View File

@@ -0,0 +1,22 @@
[bumpversion]
current_version = 0.4.17
commit = True
message = Bump version: {current_version} → {new_version}
tag = True
tag_name = v{new_version}
[bumpversion:file:node/package.json]
[bumpversion:file:nodejs/package.json]
[bumpversion:file:nodejs/npm/darwin-x64/package.json]
[bumpversion:file:nodejs/npm/darwin-arm64/package.json]
[bumpversion:file:nodejs/npm/linux-x64-gnu/package.json]
[bumpversion:file:nodejs/npm/linux-arm64-gnu/package.json]
[bumpversion:file:rust/ffi/node/Cargo.toml]
[bumpversion:file:rust/lancedb/Cargo.toml]

40
.cargo/config.toml Normal file
View File

@@ -0,0 +1,40 @@
[profile.release]
lto = "fat"
codegen-units = 1
[profile.release-with-debug]
inherits = "release"
debug = true
# Prioritize compile time over runtime performance
codegen-units = 16
lto = "thin"
[target.'cfg(all())']
rustflags = [
"-Wclippy::all",
"-Wclippy::style",
"-Wclippy::fallible_impl_from",
"-Wclippy::manual_let_else",
"-Wclippy::redundant_pub_crate",
"-Wclippy::string_add_assign",
"-Wclippy::string_add",
"-Wclippy::string_lit_as_bytes",
"-Wclippy::string_to_string",
"-Wclippy::use_self",
"-Dclippy::cargo",
"-Dclippy::dbg_macro",
# not too much we can do to avoid multiple crate versions
"-Aclippy::multiple-crate-versions",
"-Aclippy::wildcard_dependencies",
]
[target.x86_64-unknown-linux-gnu]
rustflags = ["-C", "target-cpu=haswell", "-C", "target-feature=+avx2,+fma,+f16c"]
[target.aarch64-apple-darwin]
rustflags = ["-C", "target-cpu=apple-m1", "-C", "target-feature=+neon,+fp16,+fhm,+dotprod"]
# Not all Windows systems have the C runtime installed, so this avoids library
# not found errors on systems that are missing it.
[target.x86_64-pc-windows-msvc]
rustflags = ["-Ctarget-feature=+crt-static"]

33
.github/ISSUE_TEMPLATE/bug-node.yml vendored Normal file
View File

@@ -0,0 +1,33 @@
name: Bug Report - Node / Typescript
description: File a bug report
title: "bug(node): "
labels: [bug, typescript]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report!
- type: input
id: version
attributes:
label: LanceDB version
description: What version of LanceDB are you using? `npm list | grep vectordb`.
placeholder: v0.3.2
validations:
required: false
- type: textarea
id: what-happened
attributes:
label: What happened?
description: Also tell us, what did you expect to happen?
validations:
required: true
- type: textarea
id: reproduction
attributes:
label: Are there known steps to reproduce?
description: |
Let us know how to reproduce the bug and we may be able to fix it more
quickly. This is not required, but it is helpful.
validations:
required: false

33
.github/ISSUE_TEMPLATE/bug-python.yml vendored Normal file
View File

@@ -0,0 +1,33 @@
name: Bug Report - Python
description: File a bug report
title: "bug(python): "
labels: [bug, python]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report!
- type: input
id: version
attributes:
label: LanceDB version
description: What version of LanceDB are you using? `python -c "import lancedb; print(lancedb.__version__)"`.
placeholder: v0.3.2
validations:
required: false
- type: textarea
id: what-happened
attributes:
label: What happened?
description: Also tell us, what did you expect to happen?
validations:
required: true
- type: textarea
id: reproduction
attributes:
label: Are there known steps to reproduce?
description: |
Let us know how to reproduce the bug and we may be able to fix it more
quickly. This is not required, but it is helpful.
validations:
required: false

5
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1,5 @@
blank_issues_enabled: true
contact_links:
- name: Discord Community Support
url: https://discord.com/invite/zMM32dvNtd
about: Please ask and answer questions here.

View File

@@ -0,0 +1,23 @@
name: 'Documentation improvement'
description: Report an issue with the documentation.
labels: [documentation]
body:
- type: textarea
id: description
attributes:
label: Description
description: >
Describe the issue with the documentation and how it can be fixed or improved.
validations:
required: true
- type: input
id: link
attributes:
label: Link
description: >
Provide a link to the existing documentation, if applicable.
placeholder: ex. https://lancedb.github.io/lancedb/guides/tables/...
validations:
required: false

31
.github/ISSUE_TEMPLATE/feature.yml vendored Normal file
View File

@@ -0,0 +1,31 @@
name: Feature suggestion
description: Suggestion a new feature for LanceDB
title: "Feature: "
labels: [enhancement]
body:
- type: markdown
attributes:
value: |
Share a new idea for a feature or improvement. Be sure to search existing
issues first to avoid duplicates.
- type: dropdown
id: sdk
attributes:
label: SDK
description: Which SDK are you using? This helps us prioritize.
options:
- Python
- Node
- Rust
default: 0
validations:
required: false
- type: textarea
id: description
attributes:
label: Description
description: |
Describe the feature and why it would be useful. If applicable, consider
providing a code example of what it might be like to use the feature.
validations:
required: true

View File

@@ -0,0 +1,62 @@
# We create a composite action to be re-used both for testing and for releasing
name: build-linux-wheel
description: "Build a manylinux wheel for lance"
inputs:
python-minor-version:
description: "8, 9, 10, 11, 12"
required: true
args:
description: "--release"
required: false
default: ""
arm-build:
description: "Build for arm64 instead of x86_64"
# Note: this does *not* mean the host is arm64, since we might be cross-compiling.
required: false
default: "false"
manylinux:
description: "The manylinux version to build for"
required: false
default: "2_17"
runs:
using: "composite"
steps:
- name: CONFIRM ARM BUILD
shell: bash
run: |
echo "ARM BUILD: ${{ inputs.arm-build }}"
- name: Build x86_64 Manylinux wheel
if: ${{ inputs.arm-build == 'false' }}
uses: PyO3/maturin-action@v1
with:
command: build
working-directory: python
target: x86_64-unknown-linux-gnu
manylinux: ${{ inputs.manylinux }}
args: ${{ inputs.args }}
before-script-linux: |
set -e
yum install -y openssl-devel \
&& curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-$(uname -m).zip > /tmp/protoc.zip \
&& unzip /tmp/protoc.zip -d /usr/local \
&& rm /tmp/protoc.zip
- name: Build Arm Manylinux Wheel
if: ${{ inputs.arm-build == 'true' }}
uses: PyO3/maturin-action@v1
with:
command: build
working-directory: python
target: aarch64-unknown-linux-gnu
manylinux: ${{ inputs.manylinux }}
args: ${{ inputs.args }}
before-script-linux: |
set -e
apt install -y unzip
if [ $(uname -m) = "x86_64" ]; then
PROTOC_ARCH="x86_64"
else
PROTOC_ARCH="aarch_64"
fi
curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-$PROTOC_ARCH.zip > /tmp/protoc.zip \
&& unzip /tmp/protoc.zip -d /usr/local \
&& rm /tmp/protoc.zip

View File

@@ -0,0 +1,25 @@
# We create a composite action to be re-used both for testing and for releasing
name: build_wheel
description: "Build a lance wheel"
inputs:
python-minor-version:
description: "8, 9, 10, 11"
required: true
args:
description: "--release"
required: false
default: ""
runs:
using: "composite"
steps:
- name: Install macos dependency
shell: bash
run: |
brew install protobuf
- name: Build wheel
uses: PyO3/maturin-action@v1
with:
command: build
args: ${{ inputs.args }}
working-directory: python
interpreter: 3.${{ inputs.python-minor-version }}

View File

@@ -0,0 +1,33 @@
# We create a composite action to be re-used both for testing and for releasing
name: build_wheel
description: "Build a lance wheel"
inputs:
python-minor-version:
description: "8, 9, 10, 11"
required: true
args:
description: "--release"
required: false
default: ""
runs:
using: "composite"
steps:
- name: Install Protoc v21.12
working-directory: C:\
run: |
New-Item -Path 'C:\protoc' -ItemType Directory
Set-Location C:\protoc
Invoke-WebRequest https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip -OutFile C:\protoc\protoc.zip
7z x protoc.zip
Add-Content $env:GITHUB_PATH "C:\protoc\bin"
shell: powershell
- name: Build wheel
uses: PyO3/maturin-action@v1
with:
command: build
args: ${{ inputs.args }}
working-directory: python
- uses: actions/upload-artifact@v3
with:
name: windows-wheels
path: python\target\wheels

32
.github/workflows/cargo-publish.yml vendored Normal file
View File

@@ -0,0 +1,32 @@
name: Cargo Publish
on:
release:
types: [ published ]
env:
# This env var is used by Swatinem/rust-cache@v2 for the cache
# key, so we set it to make sure it is always consistent.
CARGO_TERM_COLOR: always
# Up-to-date compilers needed for fp16kernels.
CC: gcc-12
CXX: g++-12
jobs:
build:
runs-on: ubuntu-22.04
timeout-minutes: 30
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v4
- uses: Swatinem/rust-cache@v2
with:
workspaces: rust
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Publish the package
run: |
cargo publish -p lancedb --all-features --token ${{ secrets.CARGO_REGISTRY_TOKEN }}

View File

@@ -24,12 +24,16 @@ jobs:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-22.04
runs-on: buildjet-8vcpu-ubuntu-2204
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install dependecies needed for ubuntu
run: |
sudo apt install -y protobuf-compiler libssl-dev
rustup update && rustup default
- name: Set up Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: "3.10"
cache: "pip"
@@ -39,10 +43,32 @@ jobs:
run: |
python -m pip install -e .
python -m pip install -r ../docs/requirements.txt
- name: Set up node
uses: actions/setup-node@v3
with:
node-version: 20
cache: 'npm'
cache-dependency-path: node/package-lock.json
- uses: Swatinem/rust-cache@v2
- name: Install node dependencies
working-directory: node
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Build node
working-directory: node
run: |
npm ci
npm run build
npm run tsc
- name: Create markdown files
working-directory: node
run: |
npx typedoc --plugin typedoc-plugin-markdown --out ../docs/src/javascript src/index.ts
- name: Build docs
working-directory: docs
run: |
mkdocs build
PYTHONPATH=. mkdocs build
- name: Setup Pages
uses: actions/configure-pages@v2
- name: Upload artifact

100
.github/workflows/docs_test.yml vendored Normal file
View File

@@ -0,0 +1,100 @@
name: Documentation Code Testing
on:
push:
branches:
- main
paths:
- docs/**
- .github/workflows/docs_test.yml
pull_request:
paths:
- docs/**
- .github/workflows/docs_test.yml
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
env:
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=haswell -C target-feature=+f16c,+avx2,+fma"
RUST_BACKTRACE: "1"
jobs:
test-python:
name: Test doc python code
runs-on: "buildjet-8vcpu-ubuntu-2204"
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Print CPU capabilities
run: cat /proc/cpuinfo
- name: Install dependecies needed for ubuntu
run: |
sudo apt install -y protobuf-compiler libssl-dev
rustup update && rustup default
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.11
cache: "pip"
cache-dependency-path: "docs/test/requirements.txt"
- name: Rust cache
uses: swatinem/rust-cache@v2
- name: Build Python
working-directory: docs/test
run:
python -m pip install -r requirements.txt
- name: Create test files
run: |
cd docs/test
python md_testing.py
- name: Test
run: |
cd docs/test/python
for d in *; do cd "$d"; echo "$d".py; python "$d".py; cd ..; done
test-node:
name: Test doc nodejs code
runs-on: "buildjet-8vcpu-ubuntu-2204"
timeout-minutes: 60
strategy:
fail-fast: false
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Print CPU capabilities
run: cat /proc/cpuinfo
- name: Set up Node
uses: actions/setup-node@v4
with:
node-version: 20
- name: Install dependecies needed for ubuntu
run: |
sudo apt install -y protobuf-compiler libssl-dev
rustup update && rustup default
- name: Rust cache
uses: swatinem/rust-cache@v2
- name: Install node dependencies
run: |
sudo swapoff -a
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo swapon --show
cd node
npm ci
npm run build-release
cd ../docs
npm install
- name: Test
env:
LANCEDB_URI: ${{ secrets.LANCEDB_URI }}
LANCEDB_DEV_API_KEY: ${{ secrets.LANCEDB_DEV_API_KEY }}
run: |
cd docs
npm t

View File

@@ -0,0 +1,59 @@
name: Create release commit
on:
workflow_dispatch:
inputs:
dry_run:
description: 'Dry run (create the local commit/tags but do not push it)'
required: true
default: "false"
type: choice
options:
- "true"
- "false"
part:
description: 'What kind of release is this?'
required: true
default: 'patch'
type: choice
options:
- patch
- minor
- major
jobs:
bump-version:
runs-on: ubuntu-latest
steps:
- name: Check out main
uses: actions/checkout@v4
with:
ref: main
persist-credentials: false
fetch-depth: 0
lfs: true
- name: Set git configs for bumpversion
shell: bash
run: |
git config user.name 'Lance Release'
git config user.email 'lance-dev@lancedb.com'
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Bump version, create tag and commit
run: |
pip install bump2version
bumpversion --verbose ${{ inputs.part }}
- name: Push new version and tag
if: ${{ inputs.dry_run }} == "false"
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}
branch: main
tags: true
- uses: ./.github/workflows/update_package_lock
if: ${{ inputs.dry_run }} == "false"
with:
github_token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}

147
.github/workflows/node.yml vendored Normal file
View File

@@ -0,0 +1,147 @@
name: Node
on:
push:
branches:
- main
pull_request:
paths:
- node/**
- rust/ffi/node/**
- .github/workflows/node.yml
- docker-compose.yml
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
env:
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
#
# Use native CPU to accelerate tests if possible, especially for f16
# target-cpu=haswell fixes failing ci build
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=haswell -C target-feature=+f16c,+avx2,+fma"
RUST_BACKTRACE: "1"
jobs:
linux:
name: Linux (Node ${{ matrix.node-version }})
timeout-minutes: 30
strategy:
matrix:
node-version: [ "18", "20" ]
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: node
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
cache-dependency-path: node/package-lock.json
- uses: Swatinem/rust-cache@v2
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Build
run: |
npm ci
npm run build
npm run pack-build
npm install --no-save ./dist/lancedb-vectordb-*.tgz
# Remove index.node to test with dependency installed
rm index.node
- name: Test
run: npm run test
macos:
timeout-minutes: 30
runs-on: "macos-13"
defaults:
run:
shell: bash
working-directory: node
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: actions/setup-node@v3
with:
node-version: 20
cache: 'npm'
cache-dependency-path: node/package-lock.json
- uses: Swatinem/rust-cache@v2
- name: Install dependencies
run: brew install protobuf
- name: Build
run: |
npm ci
npm run build
npm run pack-build
npm install --no-save ./dist/lancedb-vectordb-*.tgz
# Remove index.node to test with dependency installed
rm index.node
- name: Test
run: |
npm run test
aws-integtest:
timeout-minutes: 45
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: node
env:
AWS_ACCESS_KEY_ID: ACCESSKEY
AWS_SECRET_ACCESS_KEY: SECRETKEY
AWS_DEFAULT_REGION: us-west-2
# this one is for s3
AWS_ENDPOINT: http://localhost:4566
# this one is for dynamodb
DYNAMODB_ENDPOINT: http://localhost:4566
ALLOW_HTTP: true
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: actions/setup-node@v3
with:
node-version: 20
cache: 'npm'
cache-dependency-path: node/package-lock.json
- name: start local stack
run: docker compose -f ../docker-compose.yml up -d --wait
- name: create s3
run: aws s3 mb s3://lancedb-integtest --endpoint $AWS_ENDPOINT
- name: create ddb
run: |
aws dynamodb create-table \
--table-name lancedb-integtest \
--attribute-definitions '[{"AttributeName": "base_uri", "AttributeType": "S"}, {"AttributeName": "version", "AttributeType": "N"}]' \
--key-schema '[{"AttributeName": "base_uri", "KeyType": "HASH"}, {"AttributeName": "version", "KeyType": "RANGE"}]' \
--provisioned-throughput '{"ReadCapacityUnits": 10, "WriteCapacityUnits": 10}' \
--endpoint-url $DYNAMODB_ENDPOINT
- uses: Swatinem/rust-cache@v2
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Build
run: |
npm ci
npm run build
npm run pack-build
npm install --no-save ./dist/lancedb-vectordb-*.tgz
# Remove index.node to test with dependency installed
rm index.node
- name: Test
run: npm run integration-test

123
.github/workflows/nodejs.yml vendored Normal file
View File

@@ -0,0 +1,123 @@
name: NodeJS (NAPI)
on:
push:
branches:
- main
pull_request:
paths:
- nodejs/**
- .github/workflows/nodejs.yml
- docker-compose.yml
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
env:
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
RUSTFLAGS: "-C debuginfo=1"
RUST_BACKTRACE: "1"
jobs:
lint:
name: Lint
runs-on: ubuntu-22.04
defaults:
run:
shell: bash
working-directory: nodejs
env:
# Need up-to-date compilers for kernels
CC: gcc-12
CXX: g++-12
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: actions/setup-node@v3
with:
node-version: 20
cache: 'npm'
cache-dependency-path: nodejs/package-lock.json
- uses: Swatinem/rust-cache@v2
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Lint
run: |
cargo fmt --all -- --check
cargo clippy --all --all-features -- -D warnings
npm ci
npm run lint
npm run chkformat
linux:
name: Linux (NodeJS ${{ matrix.node-version }})
timeout-minutes: 30
strategy:
matrix:
node-version: [ "18", "20" ]
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: nodejs
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
cache-dependency-path: node/package-lock.json
- uses: Swatinem/rust-cache@v2
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
npm install -g @napi-rs/cli
- name: Build
run: |
npm ci
npm run build
- name: Setup localstack
working-directory: .
run: docker compose up --detach --wait
- name: Test
env:
S3_TEST: "1"
run: npm run test
macos:
timeout-minutes: 30
runs-on: "macos-14"
defaults:
run:
shell: bash
working-directory: nodejs
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: actions/setup-node@v3
with:
node-version: 20
cache: 'npm'
cache-dependency-path: node/package-lock.json
- uses: Swatinem/rust-cache@v2
- name: Install dependencies
run: |
brew install protobuf
npm install -g @napi-rs/cli
- name: Build
run: |
npm ci
npm run build
- name: Test
run: |
npm run test

349
.github/workflows/npm-publish.yml vendored Normal file
View File

@@ -0,0 +1,349 @@
name: NPM Publish
on:
release:
types: [published]
jobs:
node:
runs-on: ubuntu-latest
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
defaults:
run:
shell: bash
working-directory: node
steps:
- name: Checkout
uses: actions/checkout@v4
- uses: actions/setup-node@v3
with:
node-version: 20
cache: "npm"
cache-dependency-path: node/package-lock.json
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Build
run: |
npm ci
npm run tsc
npm pack
- name: Upload Linux Artifacts
uses: actions/upload-artifact@v4
with:
name: node-package
path: |
node/vectordb-*.tgz
node-macos:
strategy:
matrix:
config:
- arch: x86_64-apple-darwin
runner: macos-13
- arch: aarch64-apple-darwin
# xlarge is implicitly arm64.
runner: macos-14
runs-on: ${{ matrix.config.runner }}
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install system dependencies
run: brew install protobuf
- name: Install npm dependencies
run: |
cd node
npm ci
- name: Build MacOS native node modules
run: bash ci/build_macos_artifacts.sh ${{ matrix.config.arch }}
- name: Upload Darwin Artifacts
uses: actions/upload-artifact@v4
with:
name: node-native-darwin-${{ matrix.config.arch }}
path: |
node/dist/lancedb-vectordb-darwin*.tgz
nodejs-macos:
strategy:
matrix:
config:
- arch: x86_64-apple-darwin
runner: macos-13
- arch: aarch64-apple-darwin
# xlarge is implicitly arm64.
runner: macos-14
runs-on: ${{ matrix.config.runner }}
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install system dependencies
run: brew install protobuf
- name: Install npm dependencies
run: |
cd nodejs
npm ci
- name: Build MacOS native nodejs modules
run: bash ci/build_macos_artifacts_nodejs.sh ${{ matrix.config.arch }}
- name: Upload Darwin Artifacts
uses: actions/upload-artifact@v4
with:
name: nodejs-native-darwin-${{ matrix.config.arch }}
path: |
nodejs/dist/*.node
node-linux:
name: node-linux (${{ matrix.config.arch}}-unknown-linux-gnu
runs-on: ${{ matrix.config.runner }}
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
strategy:
fail-fast: false
matrix:
config:
- arch: x86_64
runner: ubuntu-latest
- arch: aarch64
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
runner: buildjet-16vcpu-ubuntu-2204-arm
steps:
- name: Checkout
uses: actions/checkout@v4
# Buildjet aarch64 runners have only 1.5 GB RAM per core, vs 3.5 GB per core for
# x86_64 runners. To avoid OOM errors on ARM, we create a swap file.
- name: Configure aarch64 build
if: ${{ matrix.config.arch == 'aarch64' }}
run: |
free -h
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo "/swapfile swap swap defaults 0 0" >> sudo /etc/fstab
# print info
swapon --show
free -h
- name: Build Linux Artifacts
run: |
bash ci/build_linux_artifacts.sh ${{ matrix.config.arch }}
- name: Upload Linux Artifacts
uses: actions/upload-artifact@v4
with:
name: node-native-linux-${{ matrix.config.arch }}
path: |
node/dist/lancedb-vectordb-linux*.tgz
nodejs-linux:
name: nodejs-linux (${{ matrix.config.arch}}-unknown-linux-gnu
runs-on: ${{ matrix.config.runner }}
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
strategy:
fail-fast: false
matrix:
config:
- arch: x86_64
runner: ubuntu-latest
- arch: aarch64
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
runner: buildjet-16vcpu-ubuntu-2204-arm
steps:
- name: Checkout
uses: actions/checkout@v4
# Buildjet aarch64 runners have only 1.5 GB RAM per core, vs 3.5 GB per core for
# x86_64 runners. To avoid OOM errors on ARM, we create a swap file.
- name: Configure aarch64 build
if: ${{ matrix.config.arch == 'aarch64' }}
run: |
free -h
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo "/swapfile swap swap defaults 0 0" >> sudo /etc/fstab
# print info
swapon --show
free -h
- name: Build Linux Artifacts
run: |
bash ci/build_linux_artifacts_nodejs.sh ${{ matrix.config.arch }}
- name: Upload Linux Artifacts
uses: actions/upload-artifact@v4
with:
name: nodejs-native-linux-${{ matrix.config.arch }}
path: |
nodejs/dist/*.node
# The generic files are the same in all distros so we just pick
# one to do the upload.
- name: Upload Generic Artifacts
if: ${{ matrix.config.arch == 'x86_64' }}
uses: actions/upload-artifact@v4
with:
name: nodejs-dist
path: |
nodejs/dist/*
!nodejs/dist/*.node
node-windows:
runs-on: windows-2022
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
strategy:
fail-fast: false
matrix:
target: [x86_64-pc-windows-msvc]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install Protoc v21.12
working-directory: C:\
run: |
New-Item -Path 'C:\protoc' -ItemType Directory
Set-Location C:\protoc
Invoke-WebRequest https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip -OutFile C:\protoc\protoc.zip
7z x protoc.zip
Add-Content $env:GITHUB_PATH "C:\protoc\bin"
shell: powershell
- name: Install npm dependencies
run: |
cd node
npm ci
- name: Build Windows native node modules
run: .\ci\build_windows_artifacts.ps1 ${{ matrix.target }}
- name: Upload Windows Artifacts
uses: actions/upload-artifact@v4
with:
name: node-native-windows
path: |
node/dist/lancedb-vectordb-win32*.tgz
nodejs-windows:
runs-on: windows-2022
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
strategy:
fail-fast: false
matrix:
target: [x86_64-pc-windows-msvc]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install Protoc v21.12
working-directory: C:\
run: |
New-Item -Path 'C:\protoc' -ItemType Directory
Set-Location C:\protoc
Invoke-WebRequest https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip -OutFile C:\protoc\protoc.zip
7z x protoc.zip
Add-Content $env:GITHUB_PATH "C:\protoc\bin"
shell: powershell
- name: Install npm dependencies
run: |
cd nodejs
npm ci
- name: Build Windows native node modules
run: .\ci\build_windows_artifacts_nodejs.ps1 ${{ matrix.target }}
- name: Upload Windows Artifacts
uses: actions/upload-artifact@v4
with:
name: nodejs-native-windows
path: |
nodejs/dist/*.node
release:
needs: [node, node-macos, node-linux, node-windows]
runs-on: ubuntu-latest
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/download-artifact@v4
with:
pattern: node-*
- name: Display structure of downloaded files
run: ls -R
- uses: actions/setup-node@v3
with:
node-version: 20
registry-url: "https://registry.npmjs.org"
- name: Publish to NPM
env:
NODE_AUTH_TOKEN: ${{ secrets.LANCEDB_NPM_REGISTRY_TOKEN }}
run: |
mv */*.tgz .
for filename in *.tgz; do
npm publish $filename
done
release-nodejs:
needs: [nodejs-macos, nodejs-linux, nodejs-windows]
runs-on: ubuntu-latest
# Only runs on tags that matches the make-release action
if: startsWith(github.ref, 'refs/tags/v')
defaults:
run:
shell: bash
working-directory: nodejs
steps:
- name: Checkout
uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with:
name: nodejs-dist
path: nodejs/dist
- uses: actions/download-artifact@v4
name: Download arch-specific binaries
with:
pattern: nodejs-*
path: nodejs/nodejs-artifacts
merge-multiple: true
- name: Display structure of downloaded files
run: find .
- uses: actions/setup-node@v3
with:
node-version: 20
registry-url: "https://registry.npmjs.org"
- name: Install napi-rs
run: npm install -g @napi-rs/cli
- name: Prepare artifacts
run: npx napi artifacts -d nodejs-artifacts
- name: Display structure of staged files
run: find npm
- name: Publish to NPM
env:
NODE_AUTH_TOKEN: ${{ secrets.LANCEDB_NPM_REGISTRY_TOKEN }}
run: npm publish --access public
update-package-lock:
needs: [release]
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
ref: main
persist-credentials: false
fetch-depth: 0
lfs: true
- uses: ./.github/workflows/update_package_lock
with:
github_token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}
update-package-lock-nodejs:
needs: [release-nodejs]
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
ref: main
persist-credentials: false
fetch-depth: 0
lfs: true
- uses: ./.github/workflows/update_package_lock_nodejs
with:
github_token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}

109
.github/workflows/pypi-publish.yml vendored Normal file
View File

@@ -0,0 +1,109 @@
name: PyPI Publish
on:
release:
types: [published]
jobs:
linux:
# Only runs on tags that matches the python-make-release action
if: startsWith(github.ref, 'refs/tags/python-v')
name: Python ${{ matrix.config.platform }} manylinux${{ matrix.config.manylinux }}
timeout-minutes: 60
strategy:
matrix:
python-minor-version: ["8"]
config:
- platform: x86_64
manylinux: "2_17"
extra_args: ""
- platform: x86_64
manylinux: "2_28"
extra_args: "--features fp16kernels"
- platform: aarch64
manylinux: "2_24"
extra_args: ""
# We don't build fp16 kernels for aarch64, because it uses
# cross compilation image, which doesn't have a new enough compiler.
runs-on: "ubuntu-22.04"
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.${{ matrix.python-minor-version }}
- uses: ./.github/workflows/build_linux_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
args: "--release --strip ${{ matrix.config.extra_args }}"
arm-build: ${{ matrix.config.platform == 'aarch64' }}
manylinux: ${{ matrix.config.manylinux }}
- uses: ./.github/workflows/upload_wheel
with:
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"
mac:
# Only runs on tags that matches the python-make-release action
if: startsWith(github.ref, 'refs/tags/python-v')
timeout-minutes: 60
runs-on: ${{ matrix.config.runner }}
strategy:
matrix:
python-minor-version: ["8"]
config:
- target: x86_64-apple-darwin
runner: macos-13
- target: aarch64-apple-darwin
runner: macos-14
env:
MACOSX_DEPLOYMENT_TARGET: 10.15
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.12
- uses: ./.github/workflows/build_mac_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
args: "--release --strip --target ${{ matrix.config.target }} --features fp16kernels"
- uses: ./.github/workflows/upload_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"
windows:
# Only runs on tags that matches the python-make-release action
if: startsWith(github.ref, 'refs/tags/python-v')
timeout-minutes: 60
runs-on: windows-latest
strategy:
matrix:
python-minor-version: ["8"]
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.${{ matrix.python-minor-version }}
- uses: ./.github/workflows/build_windows_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
args: "--release --strip"
vcpkg_token: ${{ secrets.VCPKG_GITHUB_PACKAGES }}
- uses: ./.github/workflows/upload_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"

View File

@@ -0,0 +1,56 @@
name: Python - Create release commit
on:
workflow_dispatch:
inputs:
dry_run:
description: 'Dry run (create the local commit/tags but do not push it)'
required: true
default: "false"
type: choice
options:
- "true"
- "false"
part:
description: 'What kind of release is this?'
required: true
default: 'patch'
type: choice
options:
- patch
- minor
- major
jobs:
bump-version:
runs-on: ubuntu-latest
steps:
- name: Check out main
uses: actions/checkout@v4
with:
ref: main
persist-credentials: false
fetch-depth: 0
lfs: true
- name: Set git configs for bumpversion
shell: bash
run: |
git config user.name 'Lance Release'
git config user.email 'lance-dev@lancedb.com'
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Bump version, create tag and commit
working-directory: python
run: |
pip install bump2version
bumpversion --verbose ${{ inputs.part }}
- name: Push new version and tag
if: ${{ inputs.dry_run }} == "false"
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}
branch: main
tags: true

View File

@@ -8,51 +8,188 @@ on:
paths:
- python/**
- .github/workflows/python.yml
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
linux:
lint:
name: "Lint"
timeout-minutes: 30
strategy:
matrix:
python-minor-version: [ "8", "9", "10", "11" ]
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.${{ matrix.python-minor-version }}
- name: Install lancedb
run: |
pip install -e .
pip install pytest
- name: Run tests
run: pytest -x -v --durations=30 tests
mac:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install ruff
run: |
pip install ruff==0.2.2
- name: Format check
run: ruff format --check .
- name: Lint
run: ruff .
doctest:
name: "Doctest"
timeout-minutes: 30
runs-on: "macos-12"
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install lancedb
run: |
pip install -e .
pip install pytest
- name: Run tests
run: pytest -x -v --durations=30 tests
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install protobuf
run: |
sudo apt update
sudo apt install -y protobuf-compiler
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- name: Install
run: |
pip install -e .[tests,dev,embeddings]
pip install tantivy
pip install mlx
- name: Doctest
run: pytest --doctest-modules python/lancedb
linux:
name: "Linux: python-3.${{ matrix.python-minor-version }}"
timeout-minutes: 30
strategy:
matrix:
python-minor-version: ["8", "11"]
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Install protobuf
run: |
sudo apt update
sudo apt install -y protobuf-compiler
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.${{ matrix.python-minor-version }}
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- uses: ./.github/workflows/build_linux_wheel
- uses: ./.github/workflows/run_tests
with:
integration: true
# Make sure wheels are not included in the Rust cache
- name: Delete wheels
run: rm -rf target/wheels
platform:
name: "Mac: ${{ matrix.config.name }}"
timeout-minutes: 30
strategy:
matrix:
config:
- name: x86
runner: macos-13
- name: Arm
runner: macos-14
runs-on: "${{ matrix.config.runner }}"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- uses: ./.github/workflows/build_mac_wheel
- uses: ./.github/workflows/run_tests
# Make sure wheels are not included in the Rust cache
- name: Delete wheels
run: rm -rf target/wheels
windows:
name: "Windows: ${{ matrix.config.name }}"
timeout-minutes: 30
strategy:
matrix:
config:
- name: x86
runner: windows-latest
runs-on: "${{ matrix.config.runner }}"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- uses: ./.github/workflows/build_windows_wheel
- uses: ./.github/workflows/run_tests
# Make sure wheels are not included in the Rust cache
- name: Delete wheels
run: rm -rf target/wheels
pydantic1x:
timeout-minutes: 30
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.9
- name: Install lancedb
run: |
pip install "pydantic<2"
pip install -e .[tests]
pip install tantivy
- name: Run tests
run: pytest -m "not slow and not s3_test" -x -v --durations=30 python/tests

31
.github/workflows/run_tests/action.yml vendored Normal file
View File

@@ -0,0 +1,31 @@
name: run-tests
description: "Install lance wheel and run unit tests"
inputs:
python-minor-version:
required: true
description: "8 9 10 11 12"
integration:
required: false
description: "Run integration tests"
default: "false"
runs:
using: "composite"
steps:
- name: Install lancedb
shell: bash
run: |
pip3 install $(ls target/wheels/lancedb-*.whl)[tests,dev]
- name: Setup localstack for integration tests
if: ${{ inputs.integration == 'true' }}
shell: bash
working-directory: .
run: docker compose up --detach --wait
- name: pytest (with integration)
shell: bash
if: ${{ inputs.integration == 'true' }}
run: pytest -m "not slow" -x -v --durations=30 python/python/tests
- name: pytest (no integration tests)
shell: bash
if: ${{ inputs.integration != 'true' }}
run: pytest -m "not slow and not s3_test" -x -v --durations=30 python/python/tests

134
.github/workflows/rust.yml vendored Normal file
View File

@@ -0,0 +1,134 @@
name: Rust
on:
push:
branches:
- main
pull_request:
paths:
- Cargo.toml
- rust/**
- .github/workflows/rust.yml
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
env:
# This env var is used by Swatinem/rust-cache@v2 for the cache
# key, so we set it to make sure it is always consistent.
CARGO_TERM_COLOR: always
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
RUSTFLAGS: "-C debuginfo=1"
RUST_BACKTRACE: "1"
jobs:
lint:
timeout-minutes: 30
runs-on: ubuntu-22.04
defaults:
run:
shell: bash
working-directory: rust
env:
# Need up-to-date compilers for kernels
CC: gcc-12
CXX: g++-12
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: Swatinem/rust-cache@v2
with:
workspaces: rust
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Run format
run: cargo fmt --all -- --check
- name: Run clippy
run: cargo clippy --all --all-features -- -D warnings
linux:
timeout-minutes: 30
runs-on: ubuntu-22.04
defaults:
run:
shell: bash
working-directory: rust
env:
# Need up-to-date compilers for kernels
CC: gcc-12
CXX: g++-12
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: Swatinem/rust-cache@v2
with:
workspaces: rust
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Build
run: cargo build --all-features
- name: Start S3 integration test environment
working-directory: .
run: docker compose up --detach --wait
- name: Run tests
run: cargo test --all-features
- name: Run examples
run: cargo run --example simple
macos:
timeout-minutes: 30
strategy:
matrix:
mac-runner: [ "macos-13", "macos-14" ]
runs-on: "${{ matrix.mac-runner }}"
defaults:
run:
shell: bash
working-directory: rust
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: CPU features
run: sysctl -a | grep cpu
- uses: Swatinem/rust-cache@v2
with:
workspaces: rust
- name: Install dependencies
run: brew install protobuf
- name: Build
run: cargo build --all-features
- name: Run tests
# Run with everything except the integration tests.
run: cargo test --features remote,fp16kernels
windows:
runs-on: windows-2022
steps:
- uses: actions/checkout@v4
- uses: Swatinem/rust-cache@v2
with:
workspaces: rust
- name: Install Protoc v21.12
working-directory: C:\
run: |
New-Item -Path 'C:\protoc' -ItemType Directory
Set-Location C:\protoc
Invoke-WebRequest https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip -OutFile C:\protoc\protoc.zip
7z x protoc.zip
Add-Content $env:GITHUB_PATH "C:\protoc\bin"
shell: powershell
- name: Run tests
run: |
$env:VCPKG_ROOT = $env:VCPKG_INSTALLATION_ROOT
cargo build
cargo test

View File

@@ -0,0 +1,26 @@
name: Trigger vectordb-recipers workflow
on:
push:
branches: [ main ]
pull_request:
paths:
- .github/workflows/trigger-vectordb-recipes.yml
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Trigger vectordb-recipes workflow
uses: actions/github-script@v6
with:
github-token: ${{ secrets.VECTORDB_RECIPES_ACTION_TOKEN }}
script: |
const result = await github.rest.actions.createWorkflowDispatch({
owner: 'lancedb',
repo: 'vectordb-recipes',
workflow_id: 'examples-test.yml',
ref: 'main'
});
console.log(result);

View File

@@ -0,0 +1,33 @@
name: update_package_lock
description: "Update node's package.lock"
inputs:
github_token:
required: true
description: "github token for the repo"
runs:
using: "composite"
steps:
- uses: actions/setup-node@v3
with:
node-version: 20
- name: Set git configs
shell: bash
run: |
git config user.name 'Lance Release'
git config user.email 'lance-dev@lancedb.com'
- name: Update package-lock.json file
working-directory: ./node
run: |
npm install
git add package-lock.json
git commit -m "Updating package-lock.json"
shell: bash
- name: Push changes
if: ${{ inputs.dry_run }} == "false"
uses: ad-m/github-push-action@master
with:
github_token: ${{ inputs.github_token }}
branch: main
tags: true

View File

@@ -0,0 +1,33 @@
name: update_package_lock_nodejs
description: "Update nodejs's package.lock"
inputs:
github_token:
required: true
description: "github token for the repo"
runs:
using: "composite"
steps:
- uses: actions/setup-node@v3
with:
node-version: 20
- name: Set git configs
shell: bash
run: |
git config user.name 'Lance Release'
git config user.email 'lance-dev@lancedb.com'
- name: Update package-lock.json file
working-directory: ./nodejs
run: |
npm install
git add package-lock.json
git commit -m "Updating package-lock.json"
shell: bash
- name: Push changes
if: ${{ inputs.dry_run }} == "false"
uses: ad-m/github-push-action@master
with:
github_token: ${{ inputs.github_token }}
branch: main
tags: true

View File

@@ -0,0 +1,19 @@
name: Update package-lock.json
on:
workflow_dispatch:
jobs:
publish:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
ref: main
persist-credentials: false
fetch-depth: 0
lfs: true
- uses: ./.github/workflows/update_package_lock
with:
github_token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}

View File

@@ -0,0 +1,19 @@
name: Update NodeJs package-lock.json
on:
workflow_dispatch:
jobs:
publish:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
ref: main
persist-credentials: false
fetch-depth: 0
lfs: true
- uses: ./.github/workflows/update_package_lock_nodejs
with:
github_token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}

View File

@@ -0,0 +1,29 @@
name: upload-wheel
description: "Upload wheels to Pypi"
inputs:
os:
required: true
description: "ubuntu-22.04 or macos-13"
repo:
required: false
description: "pypi or testpypi"
default: "pypi"
token:
required: true
description: "release token for the repo"
runs:
using: "composite"
steps:
- name: Install dependencies
shell: bash
run: |
python -m pip install --upgrade pip
pip install twine
- name: Publish wheel
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ inputs.token }}
shell: bash
run: twine upload --repository ${{ inputs.repo }} target/wheels/lancedb-*.whl

28
.gitignore vendored
View File

@@ -2,6 +2,10 @@
**/*.whl
*.egg-info
**/__pycache__
.DS_Store
venv
.vscode
rust/target
rust/Cargo.lock
@@ -14,6 +18,28 @@ site
python/build
python/dist
notebooks/.ipynb_checkpoints
**/.ipynb_checkpoints
**/.hypothesis
# Compiled Dynamic libraries
*.so
*.dylib
*.dll
## Javascript
*.node
**/node_modules
**/.DS_Store
node/dist
node/examples/**/package-lock.json
node/examples/**/dist
nodejs/lancedb/native*
dist
## Rust
target
**/sccache.log
Cargo.lock

View File

@@ -5,7 +5,14 @@ repos:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.12.0
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.2.2
hooks:
- id: black
- id: ruff
- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.1.0
hooks:
- id: prettier
files: "nodejs/.*"
exclude: nodejs/lancedb/native.d.ts|nodejs/dist/.*

43
Cargo.toml Normal file
View File

@@ -0,0 +1,43 @@
[workspace]
members = ["rust/ffi/node", "rust/lancedb", "nodejs", "python"]
# Python package needs to be built by maturin.
exclude = ["python"]
resolver = "2"
[workspace.package]
edition = "2021"
authors = ["LanceDB Devs <dev@lancedb.com>"]
license = "Apache-2.0"
repository = "https://github.com/lancedb/lancedb"
description = "Serverless, low-latency vector database for AI applications"
keywords = ["lancedb", "lance", "database", "vector", "search"]
categories = ["database-implementations"]
[workspace.dependencies]
lance = { "version" = "=0.10.15", "features" = ["dynamodb"] }
lance-index = { "version" = "=0.10.15" }
lance-linalg = { "version" = "=0.10.15" }
lance-testing = { "version" = "=0.10.15" }
# Note that this one does not include pyarrow
arrow = { version = "50.0", optional = false }
arrow-array = "50.0"
arrow-data = "50.0"
arrow-ipc = "50.0"
arrow-ord = "50.0"
arrow-schema = "50.0"
arrow-arith = "50.0"
arrow-cast = "50.0"
async-trait = "0"
chrono = "0.4.35"
half = { "version" = "=2.3.1", default-features = false, features = [
"num-traits",
] }
futures = "0"
log = "0.4"
object_store = "0.9.0"
pin-project = "1.0.7"
snafu = "0.7.4"
url = "2"
num-traits = "0.2"
regex = "1.10"
lazy_static = "1"

145
README.md
View File

@@ -1,58 +1,87 @@
<div align="center">
<p align="center">
<img width="275" alt="LanceDB Logo" src="https://user-images.githubusercontent.com/917119/226205734-6063d87a-1ecc-45fe-85be-1dea6383a3d8.png">
**Developer-friendly, serverless vector database for AI applications**
<a href="https://lancedb.github.io/lancedb/">Documentation</a>
<a href="https://blog.eto.ai/">Blog</a>
<a href="https://discord.gg/zMM32dvNtd">Discord</a>
<a href="https://twitter.com/lancedb">Twitter</a>
</p>
</div>
<hr />
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.
The key features of LanceDB include:
* Production-scale vector search with no servers to manage.
* Optimized for multi-modal data (text, images, videos, point clouds and more).
* Native Python and Javascript/Typescript support (coming soon).
* Combine attribute-based information with vectors and store them as a single source-of-truth.
* Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.
* Ecosystem integrations: Apache-Arrow, Pandas, Polars, DuckDB and more on the way.
LanceDB's core is written in Rust 🦀 and is built using <a href="https://github.com/eto-ai/lance">Lance</a>, an open-source columnar format designed for performant ML workloads.
## Quick Start
**Installation**
```shell
pip install lancedb
```
**Quickstart**
```python
import lancedb
uri = "/tmp/lancedb"
db = lancedb.connect(uri)
table = db.create_table("my_table",
data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
result = table.search([100, 100]).limit(2).to_df()
```
## Blogs, Tutorials & Videos
* 📈 <a href="https://blog.eto.ai/benchmarking-random-access-in-lance-ed690757a826">2000x better performance with Lance over Parquet</a>
* 🤖 <a href="https://github.com/lancedb/lancedb/blob/main/notebooks/youtube_transcript_search.ipynb">Build a question and answer bot with LanceDB</a>
<div align="center">
<p align="center">
<img width="275" alt="LanceDB Logo" src="https://github.com/lancedb/lancedb/assets/5846846/37d7c7ad-c2fd-4f56-9f16-fffb0d17c73a">
**Developer-friendly, database for multimodal AI**
<a href='https://github.com/lancedb/vectordb-recipes/tree/main' target="_blank"><img alt='LanceDB' src='https://img.shields.io/badge/VectorDB_Recipes-100000?style=for-the-badge&logo=LanceDB&logoColor=white&labelColor=645cfb&color=645cfb'/></a>
<a href='https://lancedb.github.io/lancedb/' target="_blank"><img alt='lancdb' src='https://img.shields.io/badge/DOCS-100000?style=for-the-badge&logo=lancdb&logoColor=white&labelColor=645cfb&color=645cfb'/></a>
[![Blog](https://img.shields.io/badge/Blog-12100E?style=for-the-badge&logoColor=white)](https://blog.lancedb.com/)
[![Discord](https://img.shields.io/badge/Discord-%235865F2.svg?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/zMM32dvNtd)
[![Twitter](https://img.shields.io/badge/Twitter-%231DA1F2.svg?style=for-the-badge&logo=Twitter&logoColor=white)](https://twitter.com/lancedb)
</p>
<img max-width="750px" alt="LanceDB Multimodal Search" src="https://github.com/lancedb/lancedb/assets/917119/09c5afc5-7816-4687-bae4-f2ca194426ec">
</p>
</div>
<hr />
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.
The key features of LanceDB include:
* Production-scale vector search with no servers to manage.
* Store, query and filter vectors, metadata and multi-modal data (text, images, videos, point clouds, and more).
* Support for vector similarity search, full-text search and SQL.
* Native Python and Javascript/Typescript support.
* Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.
* GPU support in building vector index(*).
* Ecosystem integrations with [LangChain 🦜️🔗](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/lanecdb.html), [LlamaIndex 🦙](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/LanceDBIndexDemo.html), Apache-Arrow, Pandas, Polars, DuckDB and more on the way.
LanceDB's core is written in Rust 🦀 and is built using <a href="https://github.com/lancedb/lance">Lance</a>, an open-source columnar format designed for performant ML workloads.
## Quick Start
**Javascript**
```shell
npm install vectordb
```
```javascript
const lancedb = require('vectordb');
const db = await lancedb.connect('data/sample-lancedb');
const table = await db.createTable({
name: 'vectors',
data: [
{ id: 1, vector: [0.1, 0.2], item: "foo", price: 10 },
{ id: 2, vector: [1.1, 1.2], item: "bar", price: 50 }
]
})
const query = table.search([0.1, 0.3]).limit(2);
const results = await query.execute();
// You can also search for rows by specific criteria without involving a vector search.
const rowsByCriteria = await table.search(undefined).where("price >= 10").execute();
```
**Python**
```shell
pip install lancedb
```
```python
import lancedb
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("my_table",
data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
result = table.search([100, 100]).limit(2).to_pandas()
```
## Blogs, Tutorials & Videos
* 📈 <a href="https://blog.eto.ai/benchmarking-random-access-in-lance-ed690757a826">2000x better performance with Lance over Parquet</a>
* 🤖 <a href="https://github.com/lancedb/lancedb/blob/main/docs/src/notebooks/youtube_transcript_search.ipynb">Build a question and answer bot with LanceDB</a>

21
ci/build_linux_artifacts.sh Executable file
View File

@@ -0,0 +1,21 @@
#!/bin/bash
set -e
ARCH=${1:-x86_64}
# We pass down the current user so that when we later mount the local files
# into the container, the files are accessible by the current user.
pushd ci/manylinux_node
docker build \
-t lancedb-node-manylinux \
--build-arg="ARCH=$ARCH" \
--build-arg="DOCKER_USER=$(id -u)" \
--progress=plain \
.
popd
# We turn on memory swap to avoid OOM killer
docker run \
-v $(pwd):/io -w /io \
--memory-swap=-1 \
lancedb-node-manylinux \
bash ci/manylinux_node/build.sh $ARCH

View File

@@ -0,0 +1,21 @@
#!/bin/bash
set -e
ARCH=${1:-x86_64}
# We pass down the current user so that when we later mount the local files
# into the container, the files are accessible by the current user.
pushd ci/manylinux_nodejs
docker build \
-t lancedb-nodejs-manylinux \
--build-arg="ARCH=$ARCH" \
--build-arg="DOCKER_USER=$(id -u)" \
--progress=plain \
.
popd
# We turn on memory swap to avoid OOM killer
docker run \
-v $(pwd):/io -w /io \
--memory-swap=-1 \
lancedb-nodejs-manylinux \
bash ci/manylinux_nodejs/build.sh $ARCH

View File

@@ -0,0 +1,34 @@
# Builds the macOS artifacts (node binaries).
# Usage: ./ci/build_macos_artifacts.sh [target]
# Targets supported: x86_64-apple-darwin aarch64-apple-darwin
set -e
prebuild_rust() {
# Building here for the sake of easier debugging.
pushd rust/ffi/node
echo "Building rust library for $1"
export RUST_BACKTRACE=1
cargo build --release --target $1
popd
}
build_node_binaries() {
pushd node
echo "Building node library for $1"
npm run build-release -- --target $1
npm run pack-build -- --target $1
popd
}
if [ -n "$1" ]; then
targets=$1
else
targets="x86_64-apple-darwin aarch64-apple-darwin"
fi
echo "Building artifacts for targets: $targets"
for target in $targets
do
prebuild_rust $target
build_node_binaries $target
done

View File

@@ -0,0 +1,34 @@
# Builds the macOS artifacts (nodejs binaries).
# Usage: ./ci/build_macos_artifacts_nodejs.sh [target]
# Targets supported: x86_64-apple-darwin aarch64-apple-darwin
set -e
prebuild_rust() {
# Building here for the sake of easier debugging.
pushd rust/lancedb
echo "Building rust library for $1"
export RUST_BACKTRACE=1
cargo build --release --target $1
popd
}
build_node_binaries() {
pushd nodejs
echo "Building nodejs library for $1"
export RUST_TARGET=$1
npm run build-release
popd
}
if [ -n "$1" ]; then
targets=$1
else
targets="x86_64-apple-darwin aarch64-apple-darwin"
fi
echo "Building artifacts for targets: $targets"
for target in $targets
do
prebuild_rust $target
build_node_binaries $target
done

View File

@@ -0,0 +1,41 @@
# Builds the Windows artifacts (node binaries).
# Usage: .\ci\build_windows_artifacts.ps1 [target]
# Targets supported:
# - x86_64-pc-windows-msvc
# - i686-pc-windows-msvc
function Prebuild-Rust {
param (
[string]$target
)
# Building here for the sake of easier debugging.
Push-Location -Path "rust/ffi/node"
Write-Host "Building rust library for $target"
$env:RUST_BACKTRACE=1
cargo build --release --target $target
Pop-Location
}
function Build-NodeBinaries {
param (
[string]$target
)
Push-Location -Path "node"
Write-Host "Building node library for $target"
npm run build-release -- --target $target
npm run pack-build -- --target $target
Pop-Location
}
$targets = $args[0]
if (-not $targets) {
$targets = "x86_64-pc-windows-msvc"
}
Write-Host "Building artifacts for targets: $targets"
foreach ($target in $targets) {
Prebuild-Rust $target
Build-NodeBinaries $target
}

View File

@@ -0,0 +1,41 @@
# Builds the Windows artifacts (nodejs binaries).
# Usage: .\ci\build_windows_artifacts_nodejs.ps1 [target]
# Targets supported:
# - x86_64-pc-windows-msvc
# - i686-pc-windows-msvc
function Prebuild-Rust {
param (
[string]$target
)
# Building here for the sake of easier debugging.
Push-Location -Path "rust/lancedb"
Write-Host "Building rust library for $target"
$env:RUST_BACKTRACE=1
cargo build --release --target $target
Pop-Location
}
function Build-NodeBinaries {
param (
[string]$target
)
Push-Location -Path "nodejs"
Write-Host "Building nodejs library for $target"
$env:RUST_TARGET=$target
npm run build-release
Pop-Location
}
$targets = $args[0]
if (-not $targets) {
$targets = "x86_64-pc-windows-msvc"
}
Write-Host "Building artifacts for targets: $targets"
foreach ($target in $targets) {
Prebuild-Rust $target
Build-NodeBinaries $target
}

View File

@@ -0,0 +1,31 @@
# Many linux dockerfile with Rust, Node, and Lance dependencies installed.
# This container allows building the node modules native libraries in an
# environment with a very old glibc, so that we are compatible with a wide
# range of linux distributions.
ARG ARCH=x86_64
FROM quay.io/pypa/manylinux2014_${ARCH}
ARG ARCH=x86_64
ARG DOCKER_USER=default_user
# Install static openssl
COPY install_openssl.sh install_openssl.sh
RUN ./install_openssl.sh ${ARCH} > /dev/null
# Protobuf is also installed as root.
COPY install_protobuf.sh install_protobuf.sh
RUN ./install_protobuf.sh ${ARCH}
ENV DOCKER_USER=${DOCKER_USER}
# Create a group and user
RUN echo ${ARCH} && adduser --user-group --create-home --uid ${DOCKER_USER} build_user
# We switch to the user to install Rust and Node, since those like to be
# installed at the user level.
USER ${DOCKER_USER}
COPY prepare_manylinux_node.sh prepare_manylinux_node.sh
RUN cp /prepare_manylinux_node.sh $HOME/ && \
cd $HOME && \
./prepare_manylinux_node.sh ${ARCH}

19
ci/manylinux_node/build.sh Executable file
View File

@@ -0,0 +1,19 @@
#!/bin/bash
# Builds the node module for manylinux. Invoked by ci/build_linux_artifacts.sh.
set -e
ARCH=${1:-x86_64}
if [ "$ARCH" = "x86_64" ]; then
export OPENSSL_LIB_DIR=/usr/local/lib64/
else
export OPENSSL_LIB_DIR=/usr/local/lib/
fi
export OPENSSL_STATIC=1
export OPENSSL_INCLUDE_DIR=/usr/local/include/openssl
source $HOME/.bashrc
cd node
npm ci
npm run build-release
npm run pack-build

View File

@@ -0,0 +1,26 @@
#!/bin/bash
# Builds openssl from source so we can statically link to it
# this is to avoid the error we get with the system installation:
# /usr/bin/ld: <library>: version node not found for symbol SSLeay@@OPENSSL_1.0.1
# /usr/bin/ld: failed to set dynamic section sizes: Bad value
set -e
git clone -b OpenSSL_1_1_1u \
--single-branch \
https://github.com/openssl/openssl.git
pushd openssl
if [[ $1 == x86_64* ]]; then
ARCH=linux-x86_64
else
# gnu target
ARCH=linux-aarch64
fi
./Configure no-shared $ARCH
make
make install

View File

@@ -0,0 +1,15 @@
#!/bin/bash
# Installs protobuf compiler. Should be run as root.
set -e
if [[ $1 == x86_64* ]]; then
ARCH=x86_64
else
# gnu target
ARCH=aarch_64
fi
PB_REL=https://github.com/protocolbuffers/protobuf/releases
PB_VERSION=23.1
curl -LO $PB_REL/download/v$PB_VERSION/protoc-$PB_VERSION-linux-$ARCH.zip
unzip protoc-$PB_VERSION-linux-$ARCH.zip -d /usr/local

View File

@@ -0,0 +1,21 @@
#!/bin/bash
set -e
install_node() {
echo "Installing node..."
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash
source "$HOME"/.bashrc
nvm install --no-progress 16
}
install_rust() {
echo "Installing rust..."
curl https://sh.rustup.rs -sSf | bash -s -- -y
export PATH="$PATH:/root/.cargo/bin"
}
install_node
install_rust

View File

@@ -0,0 +1,31 @@
# Many linux dockerfile with Rust, Node, and Lance dependencies installed.
# This container allows building the node modules native libraries in an
# environment with a very old glibc, so that we are compatible with a wide
# range of linux distributions.
ARG ARCH=x86_64
FROM quay.io/pypa/manylinux2014_${ARCH}
ARG ARCH=x86_64
ARG DOCKER_USER=default_user
# Install static openssl
COPY install_openssl.sh install_openssl.sh
RUN ./install_openssl.sh ${ARCH} > /dev/null
# Protobuf is also installed as root.
COPY install_protobuf.sh install_protobuf.sh
RUN ./install_protobuf.sh ${ARCH}
ENV DOCKER_USER=${DOCKER_USER}
# Create a group and user
RUN echo ${ARCH} && adduser --user-group --create-home --uid ${DOCKER_USER} build_user
# We switch to the user to install Rust and Node, since those like to be
# installed at the user level.
USER ${DOCKER_USER}
COPY prepare_manylinux_node.sh prepare_manylinux_node.sh
RUN cp /prepare_manylinux_node.sh $HOME/ && \
cd $HOME && \
./prepare_manylinux_node.sh ${ARCH}

18
ci/manylinux_nodejs/build.sh Executable file
View File

@@ -0,0 +1,18 @@
#!/bin/bash
# Builds the nodejs module for manylinux. Invoked by ci/build_linux_artifacts_nodejs.sh.
set -e
ARCH=${1:-x86_64}
if [ "$ARCH" = "x86_64" ]; then
export OPENSSL_LIB_DIR=/usr/local/lib64/
else
export OPENSSL_LIB_DIR=/usr/local/lib/
fi
export OPENSSL_STATIC=1
export OPENSSL_INCLUDE_DIR=/usr/local/include/openssl
source $HOME/.bashrc
cd nodejs
npm ci
npm run build-release

View File

@@ -0,0 +1,26 @@
#!/bin/bash
# Builds openssl from source so we can statically link to it
# this is to avoid the error we get with the system installation:
# /usr/bin/ld: <library>: version node not found for symbol SSLeay@@OPENSSL_1.0.1
# /usr/bin/ld: failed to set dynamic section sizes: Bad value
set -e
git clone -b OpenSSL_1_1_1u \
--single-branch \
https://github.com/openssl/openssl.git
pushd openssl
if [[ $1 == x86_64* ]]; then
ARCH=linux-x86_64
else
# gnu target
ARCH=linux-aarch64
fi
./Configure no-shared $ARCH
make
make install

View File

@@ -0,0 +1,15 @@
#!/bin/bash
# Installs protobuf compiler. Should be run as root.
set -e
if [[ $1 == x86_64* ]]; then
ARCH=x86_64
else
# gnu target
ARCH=aarch_64
fi
PB_REL=https://github.com/protocolbuffers/protobuf/releases
PB_VERSION=23.1
curl -LO $PB_REL/download/v$PB_VERSION/protoc-$PB_VERSION-linux-$ARCH.zip
unzip protoc-$PB_VERSION-linux-$ARCH.zip -d /usr/local

View File

@@ -0,0 +1,21 @@
#!/bin/bash
set -e
install_node() {
echo "Installing node..."
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash
source "$HOME"/.bashrc
nvm install --no-progress 16
}
install_rust() {
echo "Installing rust..."
curl https://sh.rustup.rs -sSf | bash -s -- -y
export PATH="$PATH:/root/.cargo/bin"
}
install_node
install_rust

18
docker-compose.yml Normal file
View File

@@ -0,0 +1,18 @@
version: "3.9"
services:
localstack:
image: localstack/localstack:3.3
ports:
- 4566:4566
environment:
- SERVICES=s3,dynamodb,kms
- DEBUG=1
- LS_LOG=trace
- DOCKER_HOST=unix:///var/run/docker.sock
- AWS_ACCESS_KEY_ID=ACCESSKEY
- AWS_SECRET_ACCESS_KEY=SECRETKEY
healthcheck:
test: [ "CMD", "curl", "-s", "http://localhost:4566/_localstack/health" ]
interval: 5s
retries: 3
start_period: 10s

27
dockerfiles/Dockerfile Normal file
View File

@@ -0,0 +1,27 @@
#Simple base dockerfile that supports basic dependencies required to run lance with FTS and Hybrid Search
#Usage docker build -t lancedb:latest -f Dockerfile .
FROM python:3.10-slim-buster
# Install Rust
RUN apt-get update && apt-get install -y curl build-essential && \
curl https://sh.rustup.rs -sSf | sh -s -- -y
# Set the environment variable for Rust
ENV PATH="/root/.cargo/bin:${PATH}"
# Install protobuf compiler
RUN apt-get install -y protobuf-compiler && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN apt-get -y update &&\
apt-get -y upgrade && \
apt-get -y install git
# Verify installations
RUN python --version && \
rustc --version && \
protoc --version
RUN pip install tantivy lancedb

44
docs/README.md Normal file
View File

@@ -0,0 +1,44 @@
# LanceDB Documentation
LanceDB docs are deployed to https://lancedb.github.io/lancedb/.
Docs is built and deployed automatically by [Github Actions](.github/workflows/docs.yml)
whenever a commit is pushed to the `main` branch. So it is possible for the docs to show
unreleased features.
## Building the docs
### Setup
1. Install LanceDB. From LanceDB repo root: `pip install -e python`
2. Install dependencies. From LanceDB repo root: `pip install -r docs/requirements.txt`
3. Make sure you have node and npm setup
4. Make sure protobuf and libssl are installed
### Building node module and create markdown files
See [Javascript docs README](./src/javascript/README.md)
### Build docs
From LanceDB repo root:
Run: `PYTHONPATH=. mkdocs build -f docs/mkdocs.yml`
If successful, you should see a `docs/site` directory that you can verify locally.
### Run local server
You can run a local server to test the docs prior to deployment by navigating to the `docs` directory and running the following command:
```bash
cd docs
mkdocs serve
```
### Run doctest for typescript example
```bash
cd lancedb/docs
npm i
npm run build
npm run all
```

View File

@@ -1,29 +1,241 @@
site_name: LanceDB Documentation
site_name: LanceDB
site_url: https://lancedb.github.io/lancedb/
repo_url: https://github.com/lancedb/lancedb
edit_uri: https://github.com/lancedb/lancedb/tree/main/docs/src
repo_name: lancedb/lancedb
docs_dir: src
theme:
name: "material"
logo: assets/logo.png
favicon: assets/logo.png
palette:
# Palette toggle for light mode
- scheme: lancedb
primary: custom
toggle:
icon: material/weather-night
name: Switch to dark mode
# Palette toggle for dark mode
- scheme: slate
primary: custom
toggle:
icon: material/weather-sunny
name: Switch to light mode
features:
- content.code.copy
- content.tabs.link
- content.action.edit
- toc.follow
- navigation.top
- navigation.tabs
- navigation.tabs.sticky
- navigation.footer
- navigation.tracking
- navigation.instant
icon:
repo: fontawesome/brands/github
custom_dir: overrides
plugins:
- search
- mkdocstrings
- mkdocs-jupyter
nav:
- Home: index.md
- Basics: basic.md
- Embeddings: embedding.md
- Indexing: ann_indexes.md
- Integrations: integrations.md
- Python API: python.md
- search
- autorefs
- mkdocstrings:
handlers:
python:
paths: [../python]
options:
docstring_style: numpy
heading_level: 3
show_source: true
show_symbol_type_in_heading: true
show_signature_annotations: true
show_root_heading: true
members_order: source
import:
# for cross references
- https://arrow.apache.org/docs/objects.inv
- https://pandas.pydata.org/docs/objects.inv
- mkdocs-jupyter
markdown_extensions:
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences
- admonition
- footnotes
- pymdownx.details
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets:
base_path: ..
dedent_subsections: true
- pymdownx.superfences
- pymdownx.tabbed:
alternate_style: true
- md_in_html
- attr_list
nav:
- Home:
- LanceDB: index.md
- 🏃🏼‍♂️ Quick start: basic.md
- 📚 Concepts:
- Vector search: concepts/vector_search.md
- Indexing: concepts/index_ivfpq.md
- Storage: concepts/storage.md
- Data management: concepts/data_management.md
- 🔨 Guides:
- Working with tables: guides/tables.md
- Building an ANN index: ann_indexes.md
- Vector Search: search.md
- Full-text search: fts.md
- Hybrid search:
- Overview: hybrid_search/hybrid_search.md
- Comparing Rerankers: hybrid_search/eval.md
- Airbnb financial data example: notebooks/hybrid_search.ipynb
- Reranking:
- Quickstart: reranking/index.md
- Cohere Reranker: reranking/cohere.md
- Linear Combination Reranker: reranking/linear_combination.md
- Cross Encoder Reranker: reranking/cross_encoder.md
- ColBERT Reranker: reranking/colbert.md
- OpenAI Reranker: reranking/openai.md
- Building Custom Rerankers: reranking/custom_reranker.md
- Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb
- Configuring Storage: guides/storage.md
- Sync -> Async Migration Guide: migration.md
- 🧬 Managing embeddings:
- Overview: embeddings/index.md
- Embedding functions: embeddings/embedding_functions.md
- Available models: embeddings/default_embedding_functions.md
- User-defined embedding functions: embeddings/custom_embedding_function.md
- "Example: Multi-lingual semantic search": notebooks/multi_lingual_example.ipynb
- "Example: MultiModal CLIP Embeddings": notebooks/DisappearingEmbeddingFunction.ipynb
- 🔌 Integrations:
- Tools and data formats: integrations/index.md
- Pandas and PyArrow: python/pandas_and_pyarrow.md
- Polars: python/polars_arrow.md
- DuckDB: python/duckdb.md
- LangChain:
- LangChain 🔗: https://python.langchain.com/docs/integrations/vectorstores/lancedb/
- LangChain JS/TS 🔗: https://js.langchain.com/docs/integrations/vectorstores/lancedb
- LlamaIndex 🦙: https://docs.llamaindex.ai/en/stable/examples/vector_stores/LanceDBIndexDemo/
- Pydantic: python/pydantic.md
- Voxel51: integrations/voxel51.md
- PromptTools: integrations/prompttools.md
- 🎯 Examples:
- Overview: examples/index.md
- 🐍 Python:
- Overview: examples/examples_python.md
- YouTube Transcript Search: notebooks/youtube_transcript_search.ipynb
- Documentation QA Bot using LangChain: notebooks/code_qa_bot.ipynb
- Multimodal search using CLIP: notebooks/multimodal_search.ipynb
- Example - Calculate CLIP Embeddings with Roboflow Inference: examples/image_embeddings_roboflow.md
- Serverless QA Bot with S3 and Lambda: examples/serverless_lancedb_with_s3_and_lambda.md
- Serverless QA Bot with Modal: examples/serverless_qa_bot_with_modal_and_langchain.md
- 👾 JavaScript:
- Overview: examples/examples_js.md
- Serverless Website Chatbot: examples/serverless_website_chatbot.md
- YouTube Transcript Search: examples/youtube_transcript_bot_with_nodejs.md
- TransformersJS Embedding Search: examples/transformerjs_embedding_search_nodejs.md
- 🦀 Rust:
- Overview: examples/examples_rust.md
- 💭 FAQs: faq.md
- ⚙️ API reference:
- 🐍 Python: python/python.md
- 👾 JavaScript (vectordb): javascript/modules.md
- 👾 JavaScript (lancedb): javascript/modules.md
- 🦀 Rust: https://docs.rs/lancedb/latest/lancedb/
- ☁️ LanceDB Cloud:
- Overview: cloud/index.md
- API reference:
- 🐍 Python: python/saas-python.md
- 👾 JavaScript: javascript/saas-modules.md
- Quick start: basic.md
- Concepts:
- Vector search: concepts/vector_search.md
- Indexing: concepts/index_ivfpq.md
- Storage: concepts/storage.md
- Data management: concepts/data_management.md
- Guides:
- Working with tables: guides/tables.md
- Building an ANN index: ann_indexes.md
- Vector Search: search.md
- Full-text search: fts.md
- Hybrid search:
- Overview: hybrid_search/hybrid_search.md
- Comparing Rerankers: hybrid_search/eval.md
- Airbnb financial data example: notebooks/hybrid_search.ipynb
- Reranking:
- Quickstart: reranking/index.md
- Cohere Reranker: reranking/cohere.md
- Linear Combination Reranker: reranking/linear_combination.md
- Cross Encoder Reranker: reranking/cross_encoder.md
- ColBERT Reranker: reranking/colbert.md
- OpenAI Reranker: reranking/openai.md
- Building Custom Rerankers: reranking/custom_reranker.md
- Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb
- Configuring Storage: guides/storage.md
- Sync -> Async Migration Guide: migration.md
- Managing Embeddings:
- Overview: embeddings/index.md
- Embedding functions: embeddings/embedding_functions.md
- Available models: embeddings/default_embedding_functions.md
- User-defined embedding functions: embeddings/custom_embedding_function.md
- "Example: Multi-lingual semantic search": notebooks/multi_lingual_example.ipynb
- "Example: MultiModal CLIP Embeddings": notebooks/DisappearingEmbeddingFunction.ipynb
- Integrations:
- Overview: integrations/index.md
- Pandas and PyArrow: python/pandas_and_pyarrow.md
- Polars: python/polars_arrow.md
- DuckDB: python/duckdb.md
- LangChain 🦜️🔗↗: https://python.langchain.com/docs/integrations/vectorstores/lancedb
- LangChain.js 🦜️🔗↗: https://js.langchain.com/docs/integrations/vectorstores/lancedb
- LlamaIndex 🦙↗: https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/LanceDBIndexDemo.html
- Pydantic: python/pydantic.md
- Voxel51: integrations/voxel51.md
- PromptTools: integrations/prompttools.md
- Examples:
- examples/index.md
- YouTube Transcript Search: notebooks/youtube_transcript_search.ipynb
- Documentation QA Bot using LangChain: notebooks/code_qa_bot.ipynb
- Multimodal search using CLIP: notebooks/multimodal_search.ipynb
- Serverless QA Bot with S3 and Lambda: examples/serverless_lancedb_with_s3_and_lambda.md
- Serverless QA Bot with Modal: examples/serverless_qa_bot_with_modal_and_langchain.md
- YouTube Transcript Search (JS): examples/youtube_transcript_bot_with_nodejs.md
- Serverless Chatbot from any website: examples/serverless_website_chatbot.md
- TransformersJS Embedding Search: examples/transformerjs_embedding_search_nodejs.md
- API reference:
- Overview: api_reference.md
- Python: python/python.md
- Javascript (vectordb): javascript/modules.md
- Javascript (lancedb): js/modules.md
- Rust: https://docs.rs/lancedb/latest/lancedb/index.html
- LanceDB Cloud:
- Overview: cloud/index.md
- API reference:
- 🐍 Python: python/saas-python.md
- 👾 JavaScript: javascript/saas-modules.md
extra_css:
- styles/global.css
- styles/extra.css
extra_javascript:
- "extra_js/init_ask_ai_widget.js"
extra:
analytics:
provider: google
property: G-B7NFM40W74
social:
- icon: fontawesome/brands/github
link: https://github.com/lancedb/lancedb
- icon: fontawesome/brands/x-twitter
link: https://twitter.com/lancedb
- icon: fontawesome/brands/linkedin
link: https://www.linkedin.com/company/lancedb

View File

@@ -0,0 +1,176 @@
<!--
Copyright (c) 2016-2023 Martin Donath <martin.donath@squidfunk.com>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to
deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.
-->
{% set class = "md-header" %}
{% if "navigation.tabs.sticky" in features %}
{% set class = class ~ " md-header--shadow md-header--lifted" %}
{% elif "navigation.tabs" not in features %}
{% set class = class ~ " md-header--shadow" %}
{% endif %}
<!-- Header -->
<header class="{{ class }}" data-md-component="header">
<nav
class="md-header__inner md-grid"
aria-label="{{ lang.t('header') }}"
>
<!-- Link to home -->
<a
href="{{ config.extra.homepage | d(nav.homepage.url, true) | url }}"
title="{{ config.site_name | e }}"
class="md-header__button md-logo"
aria-label="{{ config.site_name }}"
data-md-component="logo"
>
{% include "partials/logo.html" %}
</a>
<!-- Button to open drawer -->
<label class="md-header__button md-icon" for="__drawer">
{% include ".icons/material/menu" ~ ".svg" %}
</label>
<!-- Header title -->
<div class="md-header__title" style="width: auto !important;" data-md-component="header-title">
<div class="md-header__ellipsis">
<div class="md-header__topic">
<span class="md-ellipsis">
{{ config.site_name }}
</span>
</div>
<div class="md-header__topic" data-md-component="header-topic">
<span class="md-ellipsis">
{% if page.meta and page.meta.title %}
{{ page.meta.title }}
{% else %}
{{ page.title }}
{% endif %}
</span>
</div>
</div>
</div>
<!-- Color palette -->
{% if config.theme.palette %}
{% if not config.theme.palette is mapping %}
<form class="md-header__option" data-md-component="palette">
{% for option in config.theme.palette %}
{% set scheme = option.scheme | d("default", true) %}
{% set primary = option.primary | d("indigo", true) %}
{% set accent = option.accent | d("indigo", true) %}
<input
class="md-option"
data-md-color-media="{{ option.media }}"
data-md-color-scheme="{{ scheme | replace(' ', '-') }}"
data-md-color-primary="{{ primary | replace(' ', '-') }}"
data-md-color-accent="{{ accent | replace(' ', '-') }}"
{% if option.toggle %}
aria-label="{{ option.toggle.name }}"
{% else %}
aria-hidden="true"
{% endif %}
type="radio"
name="__palette"
id="__palette_{{ loop.index }}"
/>
{% if option.toggle %}
<label
class="md-header__button md-icon"
title="{{ option.toggle.name }}"
for="__palette_{{ loop.index0 or loop.length }}"
hidden
>
{% include ".icons/" ~ option.toggle.icon ~ ".svg" %}
</label>
{% endif %}
{% endfor %}
</form>
{% endif %}
{% endif %}
<!-- Site language selector -->
{% if config.extra.alternate %}
<div class="md-header__option">
<div class="md-select">
{% set icon = config.theme.icon.alternate or "material/translate" %}
<button
class="md-header__button md-icon"
aria-label="{{ lang.t('select.language') }}"
>
{% include ".icons/" ~ icon ~ ".svg" %}
</button>
<div class="md-select__inner">
<ul class="md-select__list">
{% for alt in config.extra.alternate %}
<li class="md-select__item">
<a
href="{{ alt.link | url }}"
hreflang="{{ alt.lang }}"
class="md-select__link"
>
{{ alt.name }}
</a>
</li>
{% endfor %}
</ul>
</div>
</div>
</div>
{% endif %}
<!-- Button to open search modal -->
{% if "material/search" in config.plugins %}
<label class="md-header__button md-icon" for="__search">
{% include ".icons/material/magnify.svg" %}
</label>
<!-- Search interface -->
{% include "partials/search.html" %}
{% endif %}
<div style="margin-left: 10px; margin-right: 5px;">
<a href="https://discord.com/invite/zMM32dvNtd" target="_blank" rel="noopener noreferrer">
<svg fill="#FFFFFF" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 50 50" width="25px" height="25px"><path d="M 41.625 10.769531 C 37.644531 7.566406 31.347656 7.023438 31.078125 7.003906 C 30.660156 6.96875 30.261719 7.203125 30.089844 7.589844 C 30.074219 7.613281 29.9375 7.929688 29.785156 8.421875 C 32.417969 8.867188 35.652344 9.761719 38.578125 11.578125 C 39.046875 11.867188 39.191406 12.484375 38.902344 12.953125 C 38.710938 13.261719 38.386719 13.429688 38.050781 13.429688 C 37.871094 13.429688 37.6875 13.378906 37.523438 13.277344 C 32.492188 10.15625 26.210938 10 25 10 C 23.789063 10 17.503906 10.15625 12.476563 13.277344 C 12.007813 13.570313 11.390625 13.425781 11.101563 12.957031 C 10.808594 12.484375 10.953125 11.871094 11.421875 11.578125 C 14.347656 9.765625 17.582031 8.867188 20.214844 8.425781 C 20.0625 7.929688 19.925781 7.617188 19.914063 7.589844 C 19.738281 7.203125 19.34375 6.960938 18.921875 7.003906 C 18.652344 7.023438 12.355469 7.566406 8.320313 10.8125 C 6.214844 12.761719 2 24.152344 2 34 C 2 34.175781 2.046875 34.34375 2.132813 34.496094 C 5.039063 39.605469 12.972656 40.941406 14.78125 41 C 14.789063 41 14.800781 41 14.8125 41 C 15.132813 41 15.433594 40.847656 15.621094 40.589844 L 17.449219 38.074219 C 12.515625 36.800781 9.996094 34.636719 9.851563 34.507813 C 9.4375 34.144531 9.398438 33.511719 9.765625 33.097656 C 10.128906 32.683594 10.761719 32.644531 11.175781 33.007813 C 11.234375 33.0625 15.875 37 25 37 C 34.140625 37 38.78125 33.046875 38.828125 33.007813 C 39.242188 32.648438 39.871094 32.683594 40.238281 33.101563 C 40.601563 33.515625 40.5625 34.144531 40.148438 34.507813 C 40.003906 34.636719 37.484375 36.800781 32.550781 38.074219 L 34.378906 40.589844 C 34.566406 40.847656 34.867188 41 35.1875 41 C 35.199219 41 35.210938 41 35.21875 41 C 37.027344 40.941406 44.960938 39.605469 47.867188 34.496094 C 47.953125 34.34375 48 34.175781 48 34 C 48 24.152344 43.785156 12.761719 41.625 10.769531 Z M 18.5 30 C 16.566406 30 15 28.210938 15 26 C 15 23.789063 16.566406 22 18.5 22 C 20.433594 22 22 23.789063 22 26 C 22 28.210938 20.433594 30 18.5 30 Z M 31.5 30 C 29.566406 30 28 28.210938 28 26 C 28 23.789063 29.566406 22 31.5 22 C 33.433594 22 35 23.789063 35 26 C 35 28.210938 33.433594 30 31.5 30 Z"/></svg>
</a>
</div>
<div style="margin-left: 5px; margin-right: 5px;">
<a href="https://twitter.com/lancedb" target="_blank" rel="noopener noreferrer">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0,0,256,256" width="25px" height="25px" fill-rule="nonzero"><g fill-opacity="0" fill="#ffffff" fill-rule="nonzero" stroke="none" stroke-width="1" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="10" stroke-dasharray="" stroke-dashoffset="0" font-family="none" font-weight="none" font-size="none" text-anchor="none" style="mix-blend-mode: normal"><path d="M0,256v-256h256v256z" id="bgRectangle"></path></g><g fill="#ffffff" fill-rule="nonzero" stroke="none" stroke-width="1" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="10" stroke-dasharray="" stroke-dashoffset="0" font-family="none" font-weight="none" font-size="none" text-anchor="none" style="mix-blend-mode: normal"><g transform="scale(4,4)"><path d="M57,17.114c-1.32,1.973 -2.991,3.707 -4.916,5.097c0.018,0.423 0.028,0.847 0.028,1.274c0,13.013 -9.902,28.018 -28.016,28.018c-5.562,0 -12.81,-1.948 -15.095,-4.423c0.772,0.092 1.556,0.138 2.35,0.138c4.615,0 8.861,-1.575 12.23,-4.216c-4.309,-0.079 -7.946,-2.928 -9.199,-6.84c1.96,0.308 4.447,-0.17 4.447,-0.17c0,0 -7.7,-1.322 -7.899,-9.779c2.226,1.291 4.46,1.231 4.46,1.231c0,0 -4.441,-2.734 -4.379,-8.195c0.037,-3.221 1.331,-4.953 1.331,-4.953c8.414,10.361 20.298,10.29 20.298,10.29c0,0 -0.255,-1.471 -0.255,-2.243c0,-5.437 4.408,-9.847 9.847,-9.847c2.832,0 5.391,1.196 7.187,3.111c2.245,-0.443 4.353,-1.263 6.255,-2.391c-0.859,3.44 -4.329,5.448 -4.329,5.448c0,0 2.969,-0.329 5.655,-1.55z"></path></g></g></svg>
</a>
</div>
<!-- Repository information -->
{% if config.repo_url %}
<div class="md-header__source" style="margin-left: -5px !important;">
{% include "partials/source.html" %}
</div>
{% endif %}
</nav>
<!-- Navigation tabs (sticky) -->
{% if "navigation.tabs.sticky" in features %}
{% if "navigation.tabs" in features %}
{% include "partials/tabs.html" %}
{% endif %}
{% endif %}
</header>

132
docs/package-lock.json generated Normal file
View File

@@ -0,0 +1,132 @@
{
"name": "lancedb-docs-test",
"version": "1.0.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "lancedb-docs-test",
"version": "1.0.0",
"license": "Apache 2",
"dependencies": {
"apache-arrow": "file:../node/node_modules/apache-arrow",
"vectordb": "file:../node"
},
"devDependencies": {
"@types/node": "^20.11.8",
"typescript": "^5.3.3"
}
},
"../node": {
"name": "vectordb",
"version": "0.4.6",
"cpu": [
"x64",
"arm64"
],
"license": "Apache-2.0",
"os": [
"darwin",
"linux",
"win32"
],
"dependencies": {
"@apache-arrow/ts": "^14.0.2",
"@neon-rs/load": "^0.0.74",
"apache-arrow": "^14.0.2",
"axios": "^1.4.0"
},
"devDependencies": {
"@neon-rs/cli": "^0.0.160",
"@types/chai": "^4.3.4",
"@types/chai-as-promised": "^7.1.5",
"@types/mocha": "^10.0.1",
"@types/node": "^18.16.2",
"@types/sinon": "^10.0.15",
"@types/temp": "^0.9.1",
"@types/uuid": "^9.0.3",
"@typescript-eslint/eslint-plugin": "^5.59.1",
"cargo-cp-artifact": "^0.1",
"chai": "^4.3.7",
"chai-as-promised": "^7.1.1",
"eslint": "^8.39.0",
"eslint-config-standard-with-typescript": "^34.0.1",
"eslint-plugin-import": "^2.26.0",
"eslint-plugin-n": "^15.7.0",
"eslint-plugin-promise": "^6.1.1",
"mocha": "^10.2.0",
"openai": "^4.24.1",
"sinon": "^15.1.0",
"temp": "^0.9.4",
"ts-node": "^10.9.1",
"ts-node-dev": "^2.0.0",
"typedoc": "^0.24.7",
"typedoc-plugin-markdown": "^3.15.3",
"typescript": "*",
"uuid": "^9.0.0"
},
"optionalDependencies": {
"@lancedb/vectordb-darwin-arm64": "0.4.6",
"@lancedb/vectordb-darwin-x64": "0.4.6",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.6",
"@lancedb/vectordb-linux-x64-gnu": "0.4.6",
"@lancedb/vectordb-win32-x64-msvc": "0.4.6"
}
},
"../node/node_modules/apache-arrow": {
"version": "14.0.2",
"license": "Apache-2.0",
"dependencies": {
"@types/command-line-args": "5.2.0",
"@types/command-line-usage": "5.0.2",
"@types/node": "20.3.0",
"@types/pad-left": "2.1.1",
"command-line-args": "5.2.1",
"command-line-usage": "7.0.1",
"flatbuffers": "23.5.26",
"json-bignum": "^0.0.3",
"pad-left": "^2.1.0",
"tslib": "^2.5.3"
},
"bin": {
"arrow2csv": "bin/arrow2csv.js"
}
},
"node_modules/@types/node": {
"version": "20.11.8",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.11.8.tgz",
"integrity": "sha512-i7omyekpPTNdv4Jb/Rgqg0RU8YqLcNsI12quKSDkRXNfx7Wxdm6HhK1awT3xTgEkgxPn3bvnSpiEAc7a7Lpyow==",
"dev": true,
"dependencies": {
"undici-types": "~5.26.4"
}
},
"node_modules/apache-arrow": {
"resolved": "../node/node_modules/apache-arrow",
"link": true
},
"node_modules/typescript": {
"version": "5.3.3",
"resolved": "https://registry.npmjs.org/typescript/-/typescript-5.3.3.tgz",
"integrity": "sha512-pXWcraxM0uxAS+tN0AG/BF2TyqmHO014Z070UsJ+pFvYuRSq8KH8DmWpnbXe0pEPDHXZV3FcAbJkijJ5oNEnWw==",
"dev": true,
"bin": {
"tsc": "bin/tsc",
"tsserver": "bin/tsserver"
},
"engines": {
"node": ">=14.17"
}
},
"node_modules/undici-types": {
"version": "5.26.5",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-5.26.5.tgz",
"integrity": "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA==",
"dev": true
},
"node_modules/vectordb": {
"resolved": "../node",
"link": true
}
}
}

20
docs/package.json Normal file
View File

@@ -0,0 +1,20 @@
{
"name": "lancedb-docs-test",
"version": "1.0.0",
"description": "auto-generated tests from doc",
"author": "dev@lancedb.com",
"license": "Apache 2",
"dependencies": {
"apache-arrow": "file:../node/node_modules/apache-arrow",
"vectordb": "file:../node"
},
"scripts": {
"build": "tsc -b && cd ../node && npm run build-release",
"example": "npm run build && node",
"test": "npm run build && ls dist/*.js | xargs -n 1 node"
},
"devDependencies": {
"@types/node": "^20.11.8",
"typescript": "^5.3.3"
}
}

View File

@@ -1,4 +1,5 @@
mkdocs==1.4.2
mkdocs==1.5.3
mkdocs-jupyter==0.24.1
mkdocs-material==9.1.3
mkdocs-material==9.5.3
mkdocstrings[python]==0.20.0
pydantic

View File

@@ -1,45 +1,122 @@
# ANN (Approximate Nearest Neighbor) Indexes
# Approximate Nearest Neighbor (ANN) Indexes
You can create an index over your vector data to make search faster.
Vector indexes are faster but less accurate than exhaustive search.
An ANN or a vector index is a data structure specifically designed to efficiently organize and
search vector data based on their similarity via the chosen distance metric.
By constructing a vector index, the search space is effectively narrowed down, avoiding the need
for brute-force scanning of the entire vector space.
A vector index is faster but less accurate than exhaustive search (kNN or flat search).
LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
Currently, LanceDB does *not* automatically create the ANN index.
LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
## Disk-based Index
In the future we will look to automatically create and configure the ANN index.
Lance provides an `IVF_PQ` disk-based index. It uses **Inverted File Index (IVF)** to first divide
the dataset into `N` partitions, and then applies **Product Quantization** to compress vectors in each partition.
See the [indexing](concepts/index_ivfpq.md) concepts guide for more information on how this works.
## Creating an ANN Index
## Creating an IVF_PQ Index
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
Lance supports `IVF_PQ` index type by default.
```python
import lancedb
import numpy as np
uri = "~/.lancedb"
db = lancedb.connect(uri)
=== "Python"
# Create 10,000 sample vectors
data = [{"vector": row, "item": f"item {i}"}
for i, row in enumerate(np.random.random((10_000, 768)).astype('float32'))]
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
# Add the vectors to a table
tbl = db.create_table("my_vectors", data=data)
```python
import lancedb
import numpy as np
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
# Create and train the index - you need to have enough data in the table for an effective training step
tbl.create_index(num_partitions=256, num_sub_vectors=96)
```
# Create 10,000 sample vectors
data = [{"vector": row, "item": f"item {i}"}
for i, row in enumerate(np.random.random((10_000, 1536)).astype('float32'))]
Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
creation by providing the following parameters:
# Add the vectors to a table
tbl = db.create_table("my_vectors", data=data)
- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support cosine distance.
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
A higher number leads to faster queries, but it makes index generation slower.
- **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
search more accurate, but also makes the index larger and slower to build.
# Create and train the index - you need to have enough data in the table for an effective training step
tbl.create_index(num_partitions=256, num_sub_vectors=96)
```
=== "Typescript"
```typescript
--8<--- "docs/src/ann_indexes.ts:import"
--8<-- "docs/src/ann_indexes.ts:ingest"
```
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/ivf_pq.rs:create_index"
```
IVF_PQ index parameters are more fully defined in the [crate docs](https://docs.rs/lancedb/latest/lancedb/index/vector/struct.IvfPqIndexBuilder.html).
The following IVF_PQ paramters can be specified:
- **distance_type**: The distance metric to use. By default it uses euclidean distance "`L2`".
We also support "cosine" and "dot" distance as well.
- **num_partitions**: The number of partitions in the index. The default is the square root
of the number of rows.
!!! note
In the synchronous python SDK and node's `vectordb` the default is 256. This default has
changed in the asynchronous python SDK and node's `lancedb`.
- **num_sub_vectors**: The number of sub-vectors (M) that will be created during Product Quantization (PQ).
For D dimensional vector, it will be divided into `M` subvectors with dimension `D/M`, each of which is replaced by
a single PQ code. The default is the dimension of the vector divided by 16.
!!! note
In the synchronous python SDK and node's `vectordb` the default is currently 96. This default has
changed in the asynchronous python SDK and node's `lancedb`.
<figure markdown>
![IVF PQ](./assets/ivf_pq.png)
<figcaption>IVF_PQ index with <code>num_partitions=2, num_sub_vectors=4</code></figcaption>
</figure>
### Use GPU to build vector index
Lance Python SDK has experimental GPU support for creating IVF index.
Using GPU for index creation requires [PyTorch>2.0](https://pytorch.org/) being installed.
You can specify the GPU device to train IVF partitions via
- **accelerator**: Specify to `cuda` or `mps` (on Apple Silicon) to enable GPU training.
=== "Linux"
<!-- skip-test -->
``` { .python .copy }
# Create index using CUDA on Nvidia GPUs.
tbl.create_index(
num_partitions=256,
num_sub_vectors=96,
accelerator="cuda"
)
```
=== "MacOS"
<!-- skip-test -->
```python
# Create index using MPS on Apple Silicon.
tbl.create_index(
num_partitions=256,
num_sub_vectors=96,
accelerator="mps"
)
```
Troubleshooting:
If you see `AssertionError: Torch not compiled with CUDA enabled`, you need to [install
PyTorch with CUDA support](https://pytorch.org/get-started/locally/).
## Querying an ANN Index
@@ -53,43 +130,114 @@ There are a couple of parameters that can be used to fine-tune the search:
e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.<br/>
Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.<br/>
A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.<br/>
A higher number makes search more accurate but also slower. If you find the recall is less than ideal, try refine_factor=10 to start.<br/>
e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.<br/>
Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
=== "Python"
```python
tbl.search(np.random.random((768))) \
.limit(2) \
.nprobes(20) \
.refine_factor(10) \
.to_df()
```python
tbl.search(np.random.random((1536))) \
.limit(2) \
.nprobes(20) \
.refine_factor(10) \
.to_pandas()
```
vector item score
0 [0.44949695, 0.8444449, 0.06281311, 0.23338133... item 1141 103.575333
1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867
```
```text
vector item _distance
0 [0.44949695, 0.8444449, 0.06281311, 0.23338133... item 1141 103.575333
1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867
```
The search will return the data requested in addition to the score of each item.
=== "Typescript"
**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
```typescript
--8<-- "docs/src/ann_indexes.ts:search1"
```
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/ivf_pq.rs:search1"
```
Vector search options are more fully defined in the [crate docs](https://docs.rs/lancedb/latest/lancedb/query/struct.Query.html#method.nearest_to).
The search will return the data requested in addition to the distance of each item.
### Filtering (where clause)
You can further filter the elements returned by a search using a where clause.
```python
tbl.search(np.random.random((768))).where("item != 'item 1141'").to_df()
```
=== "Python"
```python
tbl.search(np.random.random((1536))).where("item != 'item 1141'").to_pandas()
```
=== "Typescript"
```javascript
--8<-- "docs/src/ann_indexes.ts:search2"
```
### Projections (select clause)
You can select the columns returned by the query using a select clause.
```python
tbl.search(np.random.random((768))).select(["vector"]).to_df()
vector score
0 [0.30928212, 0.022668175, 0.1756372, 0.4911822... 93.971092
1 [0.2525465, 0.01723831, 0.261568, 0.002007689,... 95.173485
...
```
=== "Python"
```python
tbl.search(np.random.random((1536))).select(["vector"]).to_pandas()
```
```text
vector _distance
0 [0.30928212, 0.022668175, 0.1756372, 0.4911822... 93.971092
1 [0.2525465, 0.01723831, 0.261568, 0.002007689,... 95.173485
...
```
=== "Typescript"
```typescript
--8<-- "docs/src/ann_indexes.ts:search3"
```
## FAQ
### Why do I need to manually create an index?
Currently, LanceDB does _not_ automatically create the ANN index.
LanceDB is well-optimized for kNN (exhaustive search) via a disk-based index. For many use-cases,
datasets of the order of ~100K vectors don't require index creation. If you can live with up to
100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
### When is it necessary to create an ANN vector index?
`LanceDB` comes out-of-the-box with highly optimized SIMD code for computing vector similarity.
In our benchmarks, computing distances for 100K pairs of 1K dimension vectors takes **less than 20ms**.
We observe that for small datasets (~100K rows) or for applications that can accept 100ms latency,
vector indices are usually not necessary.
For large-scale or higher dimension vectors, it can beneficial to create vector index for performance.
### How big is my index, and how many memory will it take?
In LanceDB, all vector indices are **disk-based**, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code.
For example, with a 1024-dimension dataset, if we choose `num_sub_vectors=64`, each sub-vector has `1024 / 64 = 16` float32 numbers.
Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` times of space reduction.
### How to choose `num_partitions` and `num_sub_vectors` for `IVF_PQ` index?
`num_partitions` is used to decide how many partitions the first level `IVF` index uses.
Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train.
On `SIFT-1M` dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall.
`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. Because
PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in
less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and
more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency.

53
docs/src/ann_indexes.ts Normal file
View File

@@ -0,0 +1,53 @@
// --8<-- [start:import]
import * as vectordb from "vectordb";
// --8<-- [end:import]
(async () => {
// --8<-- [start:ingest]
const db = await vectordb.connect("data/sample-lancedb");
let data = [];
for (let i = 0; i < 10_000; i++) {
data.push({
vector: Array(1536).fill(i),
id: `${i}`,
content: "",
longId: `${i}`,
});
}
const table = await db.createTable("my_vectors", data);
await table.createIndex({
type: "ivf_pq",
column: "vector",
num_partitions: 16,
num_sub_vectors: 48,
});
// --8<-- [end:ingest]
// --8<-- [start:search1]
const results_1 = await table
.search(Array(1536).fill(1.2))
.limit(2)
.nprobes(20)
.refineFactor(10)
.execute();
// --8<-- [end:search1]
// --8<-- [start:search2]
const results_2 = await table
.search(Array(1536).fill(1.2))
.where("id != '1141'")
.limit(2)
.execute();
// --8<-- [end:search2]
// --8<-- [start:search3]
const results_3 = await table
.search(Array(1536).fill(1.2))
.select(["id"])
.limit(2)
.execute();
// --8<-- [end:search3]
console.log("Ann indexes: done");
})();

View File

@@ -0,0 +1,8 @@
# API Reference
The API reference for the LanceDB client SDKs are available at the following locations:
- [Python](python/python.md)
- [JavaScript (legacy vectordb package)](javascript/modules.md)
- [JavaScript (newer @lancedb/lancedb package)](js/modules.md)
- [Rust](https://docs.rs/lancedb/latest/lancedb/index.html)

Binary file not shown.

After

Width:  |  Height:  |  Size: 342 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 245 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

BIN
docs/src/assets/ivf_pq.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 204 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 217 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 224 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 170 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.9 KiB

BIN
docs/src/assets/logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

BIN
docs/src/assets/voxel.gif Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 953 KiB

View File

@@ -1,77 +1,401 @@
# Basic LanceDB Functionality
# Quick start
## How to connect to a database
!!! info "LanceDB can be run in a number of ways:"
In local mode, LanceDB stores data in a directory on your local machine. To connect to a local database, you can use the following code:
```python
import lancedb
uri = "~/.lancedb"
db = lancedb.connect(uri)
```
* Embedded within an existing backend (like your Django, Flask, Node.js or FastAPI application)
* Directly from a client application like a Jupyter notebook for analytical workloads
* Deployed as a remote serverless database
![](assets/lancedb_embedded_explanation.png)
## Installation
=== "Python"
```shell
pip install lancedb
```
=== "Typescript"
```shell
npm install vectordb
```
=== "Rust"
```shell
cargo add lancedb
```
!!! info "To use the lancedb create, you first need to install protobuf."
=== "macOS"
```shell
brew install protobuf
```
=== "Ubuntu/Debian"
```shell
sudo apt install -y protobuf-compiler libssl-dev
```
!!! info "Please also make sure you're using the same version of Arrow as in the [lancedb crate](https://github.com/lancedb/lancedb/blob/main/Cargo.toml)"
## Connect to a database
=== "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:imports"
--8<-- "python/python/tests/docs/test_basic.py:connect"
--8<-- "python/python/tests/docs/test_basic.py:connect_async"
```
!!! note "Asynchronous Python API"
The asynchronous Python API is new and has some slight differences compared
to the synchronous API. Feel free to start using the asynchronous version.
Once all features have migrated we will start to move the synchronous API to
use the same syntax as the asynchronous API. To help with this migration we
have created a [migration guide](migration.md) detailing the differences.
=== "Typescript"
```typescript
--8<-- "docs/src/basic_legacy.ts:import"
--8<-- "docs/src/basic_legacy.ts:open_db"
```
!!! note "`@lancedb/lancedb` vs. `vectordb`"
The Javascript SDK was originally released as `vectordb`. In an effort to
reduce maintenance we are aligning our SDKs. The new, aligned, Javascript
API is being released as `lancedb`. If you are starting new work we encourage
you to try out `lancedb`. Once the new API is feature complete we will begin
slowly deprecating `vectordb` in favor of `lancedb`. There is a
[migration guide](migration.md) detailing the differences which will assist
you in this process.
=== "Rust"
```rust
#[tokio::main]
async fn main() -> Result<()> {
--8<-- "rust/lancedb/examples/simple.rs:connect"
}
```
!!! info "See [examples/simple.rs](https://github.com/lancedb/lancedb/tree/main/rust/lancedb/examples/simple.rs) for a full working example."
LanceDB will create the directory if it doesn't exist (including parent directories).
If you need a reminder of the uri, use the `db.uri` property.
If you need a reminder of the uri, you can call `db.uri()`.
## How to create a table
## Create a table
To create a table, you can use the following code:
```python
tbl = db.create_table("my_table",
data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
```
### Create a table from initial data
Under the hood, LanceDB is converting the input data into an Apache Arrow table
and persisting it to disk in [Lance format](github.com/eto-ai/lance).
If you have data to insert into the table at creation time, you can simultaneously create a
table and insert the data into it. The schema of the data will be used as the schema of the
table.
If the table already exists, LanceDB will raise an error by default.
If you want to overwrite the table, you can pass in `mode="overwrite"`
to the `create_table` method.
=== "Python"
You can also pass in a pandas DataFrame directly:
```python
import pandas as pd
df = pd.DataFrame([{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
tbl = db.create_table("table_from_df", data=df)
```
```python
--8<-- "python/python/tests/docs/test_basic.py:create_table"
--8<-- "python/python/tests/docs/test_basic.py:create_table_async"
```
## How to open an existing table
If the table already exists, LanceDB will raise an error by default.
If you want to overwrite the table, you can pass in `mode="overwrite"`
to the `create_table` method.
Once created, you can open a table using the following code:
```python
tbl = db.open_table("my_table")
```
You can also pass in a pandas DataFrame directly:
```python
--8<-- "python/python/tests/docs/test_basic.py:create_table_pandas"
--8<-- "python/python/tests/docs/test_basic.py:create_table_async_pandas"
```
=== "Typescript"
```typescript
--8<-- "docs/src/basic_legacy.ts:create_table"
```
If the table already exists, LanceDB will raise an error by default.
If you want to overwrite the table, you can pass in `mode="overwrite"`
to the `createTable` function.
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/simple.rs:create_table"
```
If the table already exists, LanceDB will raise an error by default. See
[the mode option](https://docs.rs/lancedb/latest/lancedb/connection/struct.CreateTableBuilder.html#method.mode)
for details on how to overwrite (or open) existing tables instead.
!!! Providing table records in Rust
The Rust SDK currently expects data to be provided as an Arrow
[RecordBatchReader](https://docs.rs/arrow-array/latest/arrow_array/trait.RecordBatchReader.html)
Support for additional formats (such as serde or polars) is on the roadmap.
!!! info "Under the hood, LanceDB reads in the Apache Arrow data and persists it to disk using the [Lance format](https://www.github.com/lancedb/lance)."
### Create an empty table
Sometimes you may not have the data to insert into the table at creation time.
In this case, you can create an empty table and specify the schema, so that you can add
data to the table at a later time (as long as it conforms to the schema). This is
similar to a `CREATE TABLE` statement in SQL.
=== "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:create_empty_table"
--8<-- "python/python/tests/docs/test_basic.py:create_empty_table_async"
```
=== "Typescript"
```typescript
--8<-- "docs/src/basic_legacy.ts:create_empty_table"
```
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/simple.rs:create_empty_table"
```
## Open an existing table
Once created, you can open a table as follows:
=== "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:open_table"
--8<-- "python/python/tests/docs/test_basic.py:open_table_async"
```
=== "Typescript"
```typescript
const tbl = await db.openTable("myTable");
```
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/simple.rs:open_existing_tbl"
```
If you forget the name of your table, you can always get a listing of all table names:
```python
db.table_names()
```
=== "Python"
## How to add data to a table
```python
--8<-- "python/python/tests/docs/test_basic.py:table_names"
--8<-- "python/python/tests/docs/test_basic.py:table_names_async"
```
After a table has been created, you can always add more data to it using
=== "Javascript"
```python
df = pd.DataFrame([{"vector": [1.3, 1.4], "item": "fizz", "price": 100.0},
{"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}])
tbl.add(df)
```
```javascript
console.log(await db.tableNames());
```
## How to search for (approximate) nearest neighbors
=== "Rust"
Once you've embedded the query, you can find its nearest neighbors using the following code:
```rust
--8<-- "rust/lancedb/examples/simple.rs:list_names"
```
```python
tbl.search([100, 100]).limit(2).to_df()
```
## Add data to a table
This returns a pandas DataFrame with the results.
After a table has been created, you can always add more data to it as follows:
=== "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:add_data"
--8<-- "python/python/tests/docs/test_basic.py:add_data_async"
```
=== "Typescript"
```typescript
--8<-- "docs/src/basic_legacy.ts:add"
```
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/simple.rs:add"
```
## Search for nearest neighbors
Once you've embedded the query, you can find its nearest neighbors as follows:
=== "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:vector_search"
--8<-- "python/python/tests/docs/test_basic.py:vector_search_async"
```
This returns a pandas DataFrame with the results.
=== "Typescript"
```typescript
--8<-- "docs/src/basic_legacy.ts:search"
```
=== "Rust"
```rust
use futures::TryStreamExt;
--8<-- "rust/lancedb/examples/simple.rs:search"
```
!!! Query vectors in Rust
Rust does not yet support automatic execution of embedding functions. You will need to
calculate embeddings yourself. Support for this is on the roadmap and can be tracked at
https://github.com/lancedb/lancedb/issues/994
Query vectors can be provided as Arrow arrays or a Vec/slice of Rust floats.
Support for additional formats (e.g. `polars::series::Series`) is on the roadmap.
By default, LanceDB runs a brute-force scan over dataset to find the K nearest neighbours (KNN).
For tables with more than 50K vectors, creating an ANN index is recommended to speed up search performance.
LanceDB allows you to create an ANN index on a table as follows:
=== "Python"
```py
--8<-- "python/python/tests/docs/test_basic.py:create_index"
--8<-- "python/python/tests/docs/test_basic.py:create_index_async"
```
=== "Typescript"
```{.typescript .ignore}
--8<-- "docs/src/basic_legacy.ts:create_index"
```
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/simple.rs:create_index"
```
!!! note "Why do I need to create an index manually?"
LanceDB does not automatically create the ANN index for two reasons. The first is that it's optimized
for really fast retrievals via a disk-based index, and the second is that data and query workloads can
be very diverse, so there's no one-size-fits-all index configuration. LanceDB provides many parameters
to fine-tune index size, query latency and accuracy. See the section on
[ANN indexes](ann_indexes.md) for more details.
## Delete rows from a table
Use the `delete()` method on tables to delete rows from a table. To choose
which rows to delete, provide a filter that matches on the metadata columns.
This can delete any number of rows that match the filter.
=== "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:delete_rows"
--8<-- "python/python/tests/docs/test_basic.py:delete_rows_async"
```
=== "Typescript"
```typescript
--8<-- "docs/src/basic_legacy.ts:delete"
```
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/simple.rs:delete"
```
The deletion predicate is a SQL expression that supports the same expressions
as the `where()` clause (`only_if()` in Rust) on a search. They can be as
simple or complex as needed. To see what expressions are supported, see the
[SQL filters](sql.md) section.
=== "Python"
Read more: [lancedb.table.Table.delete][]
=== "Javascript"
Read more: [vectordb.Table.delete](javascript/interfaces/Table.md#delete)
=== "Rust"
Read more: [lancedb::Table::delete](https://docs.rs/lancedb/latest/lancedb/table/struct.Table.html#method.delete)
## Drop a table
Use the `drop_table()` method on the database to remove a table.
=== "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:drop_table"
--8<-- "python/python/tests/docs/test_basic.py:drop_table_async"
```
This permanently removes the table and is not recoverable, unlike deleting rows.
By default, if the table does not exist an exception is raised. To suppress this,
you can pass in `ignore_missing=True`.
=== "Typescript"
```typescript
--8<-- "docs/src/basic_legacy.ts:drop_table"
```
This permanently removes the table and is not recoverable, unlike deleting rows.
If the table does not exist an exception is raised.
=== "Rust"
```rust
--8<-- "rust/lancedb/examples/simple.rs:drop_table"
```
!!! note "Bundling `vectordb` apps with Webpack"
If you're using the `vectordb` module in JavaScript, since LanceDB contains a prebuilt Node binary, you must configure `next.config.js` to exclude it from webpack. This is required for both using Next.js and deploying a LanceDB app on Vercel.
```javascript
/** @type {import('next').NextConfig} */
module.exports = ({
webpack(config) {
config.externals.push({ vectordb: 'vectordb' })
return config;
}
})
```
## What's next
This section covered the very basics of the LanceDB API.
LanceDB supports many additional features when creating indices to speed up search and options for search.
These are contained in the next section of the documentation.
This section covered the very basics of using LanceDB. If you're learning about vector databases for the first time, you may want to read the page on [indexing](concepts/index_ivfpq.md) to get familiar with the concepts.
If you've already worked with other vector databases, you may want to read the [guides](guides/tables.md) to learn how to work with LanceDB in more detail.

92
docs/src/basic_legacy.ts Normal file
View File

@@ -0,0 +1,92 @@
// --8<-- [start:import]
import * as lancedb from "vectordb";
import { Schema, Field, Float32, FixedSizeList, Int32, Float16 } from "apache-arrow";
// --8<-- [end:import]
import * as fs from "fs";
import { Table as ArrowTable, Utf8 } from "apache-arrow";
const example = async () => {
fs.rmSync("data/sample-lancedb", { recursive: true, force: true });
// --8<-- [start:open_db]
const lancedb = require("vectordb");
const uri = "data/sample-lancedb";
const db = await lancedb.connect(uri);
// --8<-- [end:open_db]
// --8<-- [start:create_table]
const tbl = await db.createTable(
"myTable",
[
{ vector: [3.1, 4.1], item: "foo", price: 10.0 },
{ vector: [5.9, 26.5], item: "bar", price: 20.0 },
],
{ writeMode: lancedb.WriteMode.Overwrite }
);
// --8<-- [end:create_table]
// --8<-- [start:add]
const newData = Array.from({ length: 500 }, (_, i) => ({
vector: [i, i + 1],
item: "fizz",
price: i * 0.1,
}));
await tbl.add(newData);
// --8<-- [end:add]
// --8<-- [start:create_index]
await tbl.createIndex({
type: "ivf_pq",
num_partitions: 2,
num_sub_vectors: 2,
});
// --8<-- [end:create_index]
// --8<-- [start:create_empty_table]
const schema = new Schema([
new Field("id", new Int32()),
new Field("name", new Utf8()),
]);
const empty_tbl = await db.createTable({ name: "empty_table", schema });
// --8<-- [end:create_empty_table]
// --8<-- [start:create_f16_table]
const dim = 16
const total = 10
const f16_schema = new Schema([
new Field('id', new Int32()),
new Field(
'vector',
new FixedSizeList(dim, new Field('item', new Float16(), true)),
false
)
])
const data = lancedb.makeArrowTable(
Array.from(Array(total), (_, i) => ({
id: i,
vector: Array.from(Array(dim), Math.random)
})),
{ f16_schema }
)
const table = await db.createTable('f16_tbl', data)
// --8<-- [end:create_f16_table]
// --8<-- [start:search]
const query = await tbl.search([100, 100]).limit(2).execute();
// --8<-- [end:search]
console.log(query);
// --8<-- [start:delete]
await tbl.delete('item = "fizz"');
// --8<-- [end:delete]
// --8<-- [start:drop_table]
await db.dropTable("myTable");
// --8<-- [end:drop_table]
};
async function main() {
await example();
console.log("Basic example: done");
}
main();

17
docs/src/cloud/index.md Normal file
View File

@@ -0,0 +1,17 @@
# About LanceDB Cloud
LanceDB Cloud is a SaaS (software-as-a-service) solution that runs serverless in the cloud, clearly separating storage from compute. It's designed to be highly scalable without breaking the bank. LanceDB Cloud is currently in private beta with general availability coming soon, but you can apply for early access with the private beta release by signing up below.
[Try out LanceDB Cloud](https://noteforms.com/forms/lancedb-mailing-list-cloud-kty1o5?notionforms=1&utm_source=notionforms){ .md-button .md-button--primary }
## Architecture
LanceDB Cloud provides the same underlying fast vector store that powers the OSS version, but without the need to maintain your own infrastructure. Because it's serverless, you only pay for the storage you use, and you can scale compute up and down as needed depending on the size of your data and its associated index.
![](../assets/lancedb_cloud.png)
## Transitioning from the OSS to the Cloud version
The OSS version of LanceDB is designed to be embedded in your application, and it runs in-process. This makes it incredibly simple to self-host your own AI retrieval workflows for RAG and more and build and test out your concepts on your own infrastructure. The OSS version is forever free, and you can continue to build and integrate LanceDB into your existing backend applications without any added costs.
Should you decide that you need a managed deployment in production, it's possible to seamlessly transition from the OSS to the cloud version by changing the connection string to point to a remote database instead of a local one. With LanceDB Cloud, you can take your AI application from development to production without major code changes or infrastructure burden.

View File

@@ -0,0 +1,62 @@
# Data management
This section covers concepts related to managing your data over time in LanceDB.
## A primer on Lance
Because LanceDB is built on top of the [Lance](https://lancedb.github.io/lance/) data format, it helps to understand some of its core ideas. Just like Apache Arrow, Lance is a fast columnar data format, but it has the added benefit of being versionable, query and train ML models on. Lance is designed to be used with simple and complex data types, like tabular data, images, videos audio, 3D point clouds (which are deeply nested) and more.
The following concepts are important to keep in mind:
- Data storage is columnar and is interoperable with other columnar formats (such as Parquet) via Arrow
- Data is divided into fragments that represent a subset of the data
- Data is versioned, with each insert operation creating a new version of the dataset and an update to the manifest that tracks versions via metadata
!!! note
1. First, each version contains metadata and just the new/updated data in your transaction. So if you have 100 versions, they aren't 100 duplicates of the same data. However, they do have 100x the metadata overhead of a single version, which can result in slower queries.
2. Second, these versions exist to keep LanceDB scalable and consistent. We do not immediately blow away old versions when creating new ones because other clients might be in the middle of querying the old version. It's important to retain older versions for as long as they might be queried.
## What are fragments?
Fragments are chunks of data in a Lance dataset. Each fragment includes multiple files that contain several columns in the chunk of data that it represents.
## Compaction
As you insert more data, your dataset will grow and you'll need to perform *compaction* to maintain query throughput (i.e., keep latencies down to a minimum). Compaction is the process of merging fragments together to reduce the amount of metadata that needs to be managed, and to reduce the number of files that need to be opened while scanning the dataset.
### How does compaction improve performance?
Compaction performs the following tasks in the background:
- Removes deleted rows from fragments
- Removes dropped columns from fragments
- Merges small fragments into larger ones
Depending on the use case and dataset, optimal compaction will have different requirements. As a rule of thumb:
- Its always better to use *batch* inserts rather than adding 1 row at a time (to avoid too small fragments). If single-row inserts are unavoidable, run compaction on a regular basis to merge them into larger fragments.
- Keep the number of fragments under 100, which is suitable for most use cases (for *really* large datasets of >500M rows, more fragments might be needed)
## Deletion
Although Lance allows you to delete rows from a dataset, it does not actually delete the data immediately. It simply marks the row as deleted in the `DataFile` that represents a fragment. For a given version of the dataset, each fragment can have up to one deletion file (if no rows were ever deleted from that fragment, it will not have a deletion file). This is important to keep in mind because it means that the data is still there, and can be recovered if needed, as long as that version still exists based on your backup policy.
## Reindexing
Reindexing is the process of updating the index to account for new data, keeping good performance for queries. This applies to either a full-text search (FTS) index or a vector index. For ANN search, new data will always be included in query results, but queries on tables with unindexed data will fallback to slower search methods for the new parts of the table. This is another important operation to run periodically as your data grows, as it also improves performance. This is especially important if you're appending large amounts of data to an existing dataset.
!!! tip
When adding new data to a dataset that has an existing index (either FTS or vector), LanceDB doesn't immediately update the index until a reindex operation is complete.
Both LanceDB OSS and Cloud support reindexing, but the process (at least for now) is different for each, depending on the type of index.
When a reindex job is triggered in the background, the entire data is reindexed, but in the interim as new queries come in, LanceDB will combine results from the existing index with exhaustive kNN search on the new data. This is done to ensure that you're still searching on all your data, but it does come at a performance cost. The more data that you add without reindexing, the impact on latency (due to exhaustive search) can be noticeable.
### Vector reindex
* LanceDB Cloud supports incremental reindexing, where a background process will trigger a new index build for you automatically when new data is added to a dataset
* LanceDB OSS requires you to manually trigger a reindex operation -- we are working on adding incremental reindexing to LanceDB OSS as well
### FTS reindex
FTS reindexing is supported in both LanceDB OSS and Cloud, but requires that it's manually rebuilt once you have a significant enough amount of new data added that needs to be reindexed. We [updated](https://github.com/lancedb/lancedb/pull/762) Tantivy's default heap size from 128MB to 1GB in LanceDB to make it much faster to reindex, by up to 10x from the default settings.

View File

@@ -0,0 +1,84 @@
# Understanding LanceDB's IVF-PQ index
An ANN (Approximate Nearest Neighbors) index is a data structure that represents data in a way that makes it more efficient to search and retrieve. Using an ANN index is faster, but less accurate than kNN or brute force search because, in essence, the index is a lossy representation of the data.
LanceDB is fundamentally different from other vector databases in that it is built on top of [Lance](https://github.com/lancedb/lance), an open-source columnar data format designed for performant ML workloads and fast random access. Due to the design of Lance, LanceDB's indexing philosophy adopts a primarily *disk-based* indexing philosophy.
## IVF-PQ
IVF-PQ is a composite index that combines inverted file index (IVF) and product quantization (PQ). The implementation in LanceDB provides several parameters to fine-tune the index's size, query throughput, latency and recall, which are described later in this section.
### Product quantization
Quantization is a compression technique used to reduce the dimensionality of an embedding to speed up search.
Product quantization (PQ) works by dividing a large, high-dimensional vector of size into equally sized subvectors. Each subvector is assigned a "reproduction value" that maps to the nearest centroid of points for that subvector. The reproduction values are then assigned to a codebook using unique IDs, which can be used to reconstruct the original vector.
![](../assets/ivfpq_pq_desc.png)
It's important to remember that quantization is a *lossy process*, i.e., the reconstructed vector is not identical to the original vector. This results in a trade-off between the size of the index and the accuracy of the search results.
As an example, consider starting with 128-dimensional vector consisting of 32-bit floats. Quantizing it to an 8-bit integer vector with 4 dimensions as in the image above, we can significantly reduce memory requirements.
!!! example "Effect of quantization"
Original: `128 × 32 = 4096` bits
Quantized: `4 × 8 = 32` bits
Quantization results in a **128x** reduction in memory requirements for each vector in the index, which is substantial.
### Inverted file index
While PQ helps with reducing the size of the index, IVF primarily addresses search performance. The primary purpose of an inverted file index is to facilitate rapid and effective nearest neighbor search by narrowing down the search space.
In IVF, the PQ vector space is divided into *Voronoi cells*, which are essentially partitions that consist of all the points in the space that are within a threshold distance of the given region's seed point. These seed points are initialized by running K-means over the stored vectors. The centroids of K-means turn into the seed points which then each define a region. These regions are then are used to create an inverted index that correlates each centroid with a list of vectors in the space, allowing a search to be restricted to just a subset of vectors in the index.
![](../assets/ivfpq_ivf_desc.webp)
During query time, depending on where the query lands in vector space, it may be close to the border of multiple Voronoi cells, which could make the top-k results ambiguous and span across multiple cells. To address this, the IVF-PQ introduces the `nprobe` parameter, which controls the number of Voronoi cells to search during a query. The higher the `nprobe`, the more accurate the results, but the slower the query.
![](../assets/ivfpq_query_vector.webp)
## Putting it all together
We can combine the above concepts to understand how to build and query an IVF-PQ index in LanceDB.
### Construct index
There are three key parameters to set when constructing an IVF-PQ index:
* `metric`: Use an `L2` euclidean distance metric. We also support `dot` and `cosine` distance.
* `num_partitions`: The number of partitions in the IVF portion of the index.
* `num_sub_vectors`: The number of sub-vectors that will be created during Product Quantization (PQ).
In Python, the index can be created as follows:
```python
# Create and train the index for a 1536-dimensional vector
# Make sure you have enough data in the table for an effective training step
tbl.create_index(metric="L2", num_partitions=256, num_sub_vectors=96)
```
The `num_partitions` is usually chosen to target a particular number of vectors per partition. `num_sub_vectors` is typically chosen based on the desired recall and the dimensionality of the vector. See the [FAQs](#faq) below for best practices on choosing these parameters.
### Query the index
```python
# Search using a random 1536-dimensional embedding
tbl.search(np.random.random((1536))) \
.limit(2) \
.nprobes(20) \
.refine_factor(10) \
.to_pandas()
```
The above query will perform a search on the table `tbl` using the given query vector, with the following parameters:
* `limit`: The number of results to return
* `nprobes`: The number of probes determines the distribution of vector space. While a higher number enhances search accuracy, it also results in slower performance. Typically, setting `nprobes` to cover 510% of the dataset proves effective in achieving high recall with minimal latency.
* `refine_factor`: Refine the results by reading extra elements and re-ranking them in memory. A higher number makes the search more accurate but also slower (see the [FAQ](../faq.md#do-i-need-to-set-a-refine-factor-when-using-an-index) page for more details on this).
* `to_pandas()`: Convert the results to a pandas DataFrame
And there you have it! You now understand what an IVF-PQ index is, and how to create and query it in LanceDB.
To see how to create an IVF-PQ index in LanceDB, take a look at the [ANN indexes](../ann_indexes.md) section.

View File

@@ -0,0 +1,80 @@
# Storage
LanceDB is among the only vector databases built on top of multiple modular components designed from the ground-up to be efficient on disk. This gives it the unique benefit of being flexible enough to support multiple storage backends, including local NVMe, EBS, EFS and many other third-party APIs that connect to the cloud.
It is important to understand the tradeoffs between cost and latency for your specific application and use case. This section will help you understand the tradeoffs between the different storage backends.
## Storage options
We've prepared a simple diagram to showcase the thought process that goes into choosing a storage backend when using LanceDB OSS, Cloud or Enterprise.
![](../assets/lancedb_storage_tradeoffs.png)
When architecting your system, you'd typically ask yourself the following questions to decide on a storage option:
1. **Latency**: How fast do I need results? What do the p50 and also p95 look like?
2. **Scalability**: Can I scale up the amount of data and QPS easily?
3. **Cost**: To serve my application, whats the all-in cost of *both* storage and serving infra?
4. **Reliability/Availability**: How does replication work? Is disaster recovery addressed?
## Tradeoffs
This section reviews the characteristics of each storage option in four dimensions: latency, scalability, cost and reliability.
**We begin with the lowest cost option, and end with the lowest latency option.**
### 1. S3 / GCS / Azure Blob Storage
!!! tip "Lowest cost, highest latency"
- **Latency** ⇒ Has the highest latency. p95 latency is also substantially worse than p50. In general you get results in the order of several hundred milliseconds
- **Scalability** ⇒ Infinite on storage, however, QPS will be limited by S3 concurrency limits
- **Cost** ⇒ Lowest (order of magnitude cheaper than other options)
- **Reliability/Availability** ⇒ Highly available, as blob storage like S3 are critical infrastructure that form the backbone of the internet.
Another important point to note is that LanceDB is designed to separate storage from compute, and the underlying Lance format stores the data in numerous immutable fragments. Due to these factors, LanceDB is a great storage option that addresses the _N + 1_ query problem. i.e., when a high query throughput is required, query processes can run in a stateless manner and be scaled up and down as needed.
### 2. EFS / GCS Filestore / Azure File Storage
!!! info "Moderately low cost, moderately low latency (<100ms)"
- **Latency** Much better than object/blob storage but not as good as EBS/Local disk; < 100ms p95 achievable
- **Scalability** High, but the bottleneck will be the IOPs limit, but when scaling you can provision multiple EFS volumes
- **Cost** Significantly more expensive than S3 but still very cost effective compared to in-memory dbs. Inactive data in EFS is also automatically tiered to S3-level costs.
- **Reliability/Availability** Highly available, as query nodes can go down without affecting EFS. However, EFS does not provide replication / backup - this must be managed manually.
A recommended best practice is to keep a copy of the data on S3 for disaster recovery scenarios. If any downtime is unacceptable, then you would need another EFS with a copy of the data. This is still much cheaper than EC2 instances holding multiple copies of the data.
### 3. Third-party storage solutions
Solutions like [MinIO](https://blog.min.io/lancedb-trusted-steed-against-data-complexity/), WekaFS, etc. that deliver S3 compatible API with much better performance than S3.
!!! info "Moderately low cost, moderately low latency (<100ms)"
- **Latency** Should be similar latency to EFS, better than S3 (<100ms)
- **Scalability** Up to the solutions architect, who can add as many nodes to their MinIO or other third-party provider's cluster as needed
- **Cost** Definitely higher than S3. The cost can be marginally higher than EFS until you get to maybe >10TB scale with high utilization
- **Reliability/Availability** ⇒ These are all shareable by lots of nodes, quality/cost of replication/backup depends on the vendor
### 4. EBS / GCP Persistent Disk / Azure Managed Disk
!!! info "Very low latency (<30ms), higher cost"
- **Latency** Very good, pretty close to local disk. Youre looking at <30ms latency in most cases
- **Scalability** EBS is not shareable between instances. If deployed via k8s, it can be shared between pods that live on the same instance, but beyond that you would need to shard data or make an additional copy
- **Cost** Higher than EFS. There are some hidden costs to EBS as well if youre paying for IO.
- **Reliability/Availability** Not shareable between instances but can be shared between pods on the same instance. Survives instance termination. No automatic backups.
Just like EFS, an EBS or persistent disk setup requires more manual work to manage data sharding, backups and capacity.
### 5. Local disk (SSD/NVMe)
!!! danger "Lowest latency (<10ms), highest cost"
- **Latency** Lowest latency with modern NVMe drives, <10ms p95
- **Scalability** Difficult to scale on cloud. Also need additional copies / sharding if QPS needs to be higher
- **Cost** Highest cost; the main issue with keeping your application and storage tightly integrated is that its just not really possible to scale this up in cloud environments
- **Reliability/Availability** If the instance goes down, so does your data. You have to be _very_ diligent about backing up your data
As a rule of thumb, local disk should be your storage option if you require absolutely *crazy low* latency and youre willing to do a bunch of data management work to make it happen.

View File

@@ -0,0 +1,36 @@
# Vector search
Vector search is a technique used to search for similar items based on their vector representations, called embeddings. It is also known as similarity search, nearest neighbor search, or approximate nearest neighbor search.
Raw data (e.g. text, images, audio, etc.) is converted into embeddings via an embedding model, which are then stored in a vector database like LanceDB. To perform similarity search at scale, an index is created on the stored embeddings, which can then used to perform fast lookups.
![](../assets/vector-db-basics.png)
## Embeddings
Modern machine learning models can be trained to convert raw data into embeddings, represented as arrays (or vectors) of floating point numbers of fixed dimensionality. What makes embeddings useful in practice is that the position of an embedding in vector space captures some of the semantics of the data, depending on the type of model and how it was trained. Points that are close to each other in vector space are considered similar (or appear in similar contexts), and points that are far away are considered dissimilar.
Large datasets of multi-modal data (text, audio, images, etc.) can be converted into embeddings with the appropriate model. Projecting the vectors' principal components in 2D space results in groups of vectors that represent similar concepts clustering together, as shown below.
![](../assets/embedding_intro.png)
## Indexes
Embeddings for a given dataset are made searchable via an **index**. The index is constructed by using data structures that store the embeddings such that it's very efficient to perform scans and lookups on them. A key distinguishing feature of LanceDB is it uses a disk-based index: IVF-PQ, which is a variant of the Inverted File Index (IVF) that uses Product Quantization (PQ) to compress the embeddings.
See the [IVF-PQ](./index_ivfpq.md) page for more details on how it works.
## Brute force search
The simplest way to perform vector search is to perform a brute force search, without an index, where the distance between the query vector and all the vectors in the database are computed, with the top-k closest vectors returned. This is equivalent to a k-nearest neighbours (kNN) search in vector space.
![](../assets/knn_search.png)
As you can imagine, the brute force approach is not scalable for datasets larger than a few hundred thousand vectors, as the latency of the search grows linearly with the size of the dataset. This is where approximate nearest neighbour (ANN) algorithms come in.
## Approximate nearest neighbour (ANN) search
Instead of performing an exhaustive search on the entire database for each and every query, approximate nearest neighbour (ANN) algorithms use an index to narrow down the search space, which significantly reduces query latency. The trade-off is that the results are not guaranteed to be the true nearest neighbors of the query, but are usually "good enough" for most use cases.

View File

@@ -1,97 +0,0 @@
# Embedding Functions
Embeddings are high dimensional floating-point vector representations of your data or query.
Anything can be embedded using some embedding model or function.
For a given embedding function, the output will always have the same number of dimensions.
## Creating an embedding function
Any function that takes as input a batch (list) of data and outputs a batch (list) of embeddings
can be used by LanceDB as an embedding function. The input and output batch sizes should be the same.
### HuggingFace example
One popular free option would be to use the [sentence-transformers](https://www.sbert.net/) library from HuggingFace.
You can install this using pip: `pip install sentence-transformers`.
```python
from sentence_transformers import SentenceTransformer
name="paraphrase-albert-small-v2"
model = SentenceTransformer(name)
# used for both training and querying
def embed_func(batch):
return [model.encode(sentence) for sentence in batch]
```
### OpenAI example
You can also use an external API like OpenAI to generate embeddings
```python
import openai
import os
# Configuring the environment variable OPENAI_API_KEY
if "OPENAI_API_KEY" not in os.environ:
# OR set the key here as a variable
openai.api_key = "sk-..."
# verify that the API key is working
assert len(openai.Model.list()["data"]) > 0
def embed_func(c):
rs = openai.Embedding.create(input=c, engine="text-embedding-ada-002")
return [record["embedding"] for record in rs["data"]]
```
## Applying an embedding function
Using an embedding function, you can apply it to raw data
to generate embeddings for each row.
Say if you have a pandas DataFrame with a `text` column that you want to be embedded,
you can use the [with_embeddings](https://lancedb.github.io/lancedb/python/#lancedb.embeddings.with_embeddings)
function to generate embeddings and add create a combined pyarrow table:
```python
import pandas as pd
from lancedb.embeddings import with_embeddings
df = pd.DataFrame([{"text": "pepperoni"},
{"text": "pineapple"}])
data = with_embeddings(embed_func, df)
# The output is used to create / append to a table
# db.create_table("my_table", data=data)
```
If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`.
By default, LanceDB calls the function with batches of 1000 rows. This can be configured
using the `batch_size` parameter to `with_embeddings`.
LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI
API call is reliable.
## Searching with an embedding function
At inference time, you also need the same embedding function to embed your query text.
It's important that you use the same model / function otherwise the embedding vectors don't
belong in the same latent space and your results will be nonsensical.
```python
query = "What's the best pizza topping?"
query_vector = embed_func([query])[0]
tbl.search(query_vector).limit(10).to_df()
```
The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.
## Roadmap
In the near future, we'll be integrating the embedding functions deeper into LanceDB<br/>.
The goal is that you just have to configure the function once when you create the table,
and then you'll never have to deal with embeddings / vectors after that unless you want to.
We'll also integrate more popular models and APIs.

View File

@@ -0,0 +1,212 @@
To use your own custom embedding function, you can follow these 2 simple steps:
1. Create your embedding function by implementing the `EmbeddingFunction` interface
2. Register your embedding function in the global `EmbeddingFunctionRegistry`.
Let us see how this looks like in action.
![](../assets/embeddings_api.png)
`EmbeddingFunction` and `EmbeddingFunctionRegistry` handle low-level details for serializing schema and model information as metadata. To build a custom embedding function, you don't have to worry about the finer details - simply focus on setting up the model and leave the rest to LanceDB.
## `TextEmbeddingFunction` interface
There is another optional layer of abstraction available: `TextEmbeddingFunction`. You can use this abstraction if your model isn't multi-modal in nature and only needs to operate on text. In such cases, both the source and vector fields will have the same work for vectorization, so you simply just need to setup the model and rest is handled by `TextEmbeddingFunction`. You can read more about the class and its attributes in the class reference.
Let's implement `SentenceTransformerEmbeddings` class. All you need to do is implement the `generate_embeddings()` and `ndims` function to handle the input types you expect and register the class in the global `EmbeddingFunctionRegistry`
```python
from lancedb.embeddings import register
from lancedb.util import attempt_import_or_raise
@register("sentence-transformers")
class SentenceTransformerEmbeddings(TextEmbeddingFunction):
name: str = "all-MiniLM-L6-v2"
# set more default instance vars like device, etc.
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._ndims = None
def generate_embeddings(self, texts):
return self._embedding_model().encode(list(texts), ...).tolist()
def ndims(self):
if self._ndims is None:
self._ndims = len(self.generate_embeddings("foo")[0])
return self._ndims
@cached(cache={})
def _embedding_model(self):
return sentence_transformers.SentenceTransformer(name)
```
This is a stripped down version of our implementation of `SentenceTransformerEmbeddings` that removes certain optimizations and defaul settings.
Now you can use this embedding function to create your table schema and that's it! you can then ingest data and run queries without manually vectorizing the inputs.
```python
from lancedb.pydantic import LanceModel, Vector
registry = EmbeddingFunctionRegistry.get_instance()
stransformer = registry.get("sentence-transformers").create()
class TextModelSchema(LanceModel):
vector: Vector(stransformer.ndims) = stransformer.VectorField()
text: str = stransformer.SourceField()
tbl = db.create_table("table", schema=TextModelSchema)
tbl.add(pd.DataFrame({"text": ["halo", "world"]}))
result = tbl.search("world").limit(5)
```
NOTE:
You can always implement the `EmbeddingFunction` interface directly if you want or need to, `TextEmbeddingFunction` just makes it much simpler and faster for you to do so, by setting up the boiler plat for text-specific use case
## Multi-modal embedding function example
You can also use the `EmbeddingFunction` interface to implement more complex workflows such as multi-modal embedding function support. LanceDB implements `OpenClipEmeddingFunction` class that suppports multi-modal seach. Here's the implementation that you can use as a reference to build your own multi-modal embedding functions.
```python
@register("open-clip")
class OpenClipEmbeddings(EmbeddingFunction):
name: str = "ViT-B-32"
pretrained: str = "laion2b_s34b_b79k"
device: str = "cpu"
batch_size: int = 64
normalize: bool = True
_model = PrivateAttr()
_preprocess = PrivateAttr()
_tokenizer = PrivateAttr()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
open_clip = attempt_import_or_raise("open_clip", "open-clip") # EmbeddingFunction util to import external libs and raise if not found
model, _, preprocess = open_clip.create_model_and_transforms(
self.name, pretrained=self.pretrained
)
model.to(self.device)
self._model, self._preprocess = model, preprocess
self._tokenizer = open_clip.get_tokenizer(self.name)
self._ndims = None
def ndims(self):
if self._ndims is None:
self._ndims = self.generate_text_embeddings("foo").shape[0]
return self._ndims
def compute_query_embeddings(
self, query: Union[str, "PIL.Image.Image"], *args, **kwargs
) -> List[np.ndarray]:
"""
Compute the embeddings for a given user query
Parameters
----------
query : Union[str, PIL.Image.Image]
The query to embed. A query can be either text or an image.
"""
if isinstance(query, str):
return [self.generate_text_embeddings(query)]
else:
PIL = attempt_import_or_raise("PIL", "pillow")
if isinstance(query, PIL.Image.Image):
return [self.generate_image_embedding(query)]
else:
raise TypeError("OpenClip supports str or PIL Image as query")
def generate_text_embeddings(self, text: str) -> np.ndarray:
torch = attempt_import_or_raise("torch")
text = self.sanitize_input(text)
text = self._tokenizer(text)
text.to(self.device)
with torch.no_grad():
text_features = self._model.encode_text(text.to(self.device))
if self.normalize:
text_features /= text_features.norm(dim=-1, keepdim=True)
return text_features.cpu().numpy().squeeze()
def sanitize_input(self, images: IMAGES) -> Union[List[bytes], np.ndarray]:
"""
Sanitize the input to the embedding function.
"""
if isinstance(images, (str, bytes)):
images = [images]
elif isinstance(images, pa.Array):
images = images.to_pylist()
elif isinstance(images, pa.ChunkedArray):
images = images.combine_chunks().to_pylist()
return images
def compute_source_embeddings(
self, images: IMAGES, *args, **kwargs
) -> List[np.array]:
"""
Get the embeddings for the given images
"""
images = self.sanitize_input(images)
embeddings = []
for i in range(0, len(images), self.batch_size):
j = min(i + self.batch_size, len(images))
batch = images[i:j]
embeddings.extend(self._parallel_get(batch))
return embeddings
def _parallel_get(self, images: Union[List[str], List[bytes]]) -> List[np.ndarray]:
"""
Issue concurrent requests to retrieve the image data
"""
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(self.generate_image_embedding, image)
for image in images
]
return [future.result() for future in futures]
def generate_image_embedding(
self, image: Union[str, bytes, "PIL.Image.Image"]
) -> np.ndarray:
"""
Generate the embedding for a single image
Parameters
----------
image : Union[str, bytes, PIL.Image.Image]
The image to embed. If the image is a str, it is treated as a uri.
If the image is bytes, it is treated as the raw image bytes.
"""
torch = attempt_import_or_raise("torch")
# TODO handle retry and errors for https
image = self._to_pil(image)
image = self._preprocess(image).unsqueeze(0)
with torch.no_grad():
return self._encode_and_normalize_image(image)
def _to_pil(self, image: Union[str, bytes]):
PIL = attempt_import_or_raise("PIL", "pillow")
if isinstance(image, bytes):
return PIL.Image.open(io.BytesIO(image))
if isinstance(image, PIL.Image.Image):
return image
elif isinstance(image, str):
parsed = urlparse.urlparse(image)
# TODO handle drive letter on windows.
if parsed.scheme == "file":
return PIL.Image.open(parsed.path)
elif parsed.scheme == "":
return PIL.Image.open(image if os.name == "nt" else parsed.path)
elif parsed.scheme.startswith("http"):
return PIL.Image.open(io.BytesIO(url_retrieve(image)))
else:
raise NotImplementedError("Only local and http(s) urls are supported")
def _encode_and_normalize_image(self, image_tensor: "torch.Tensor"):
"""
encode a single image tensor and optionally normalize the output
"""
image_features = self._model.encode_image(image_tensor)
if self.normalize:
image_features /= image_features.norm(dim=-1, keepdim=True)
return image_features.cpu().numpy().squeeze()
```

View File

@@ -0,0 +1,532 @@
There are various embedding functions available out of the box with LanceDB to manage your embeddings implicitly. We're actively working on adding other popular embedding APIs and models.
## Text embedding functions
Contains the text embedding functions registered by default.
* Embedding functions have an inbuilt rate limit handler wrapper for source and query embedding function calls that retry with exponential backoff.
* Each `EmbeddingFunction` implementation automatically takes `max_retries` as an argument which has the default value of 7.
### Sentence transformers
Allows you to set parameters when registering a `sentence-transformers` object.
!!! info
Sentence transformer embeddings are normalized by default. It is recommended to use normalized embeddings for similarity search.
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| `name` | `str` | `all-MiniLM-L6-v2` | The name of the model |
| `device` | `str` | `cpu` | The device to run the model on (can be `cpu` or `gpu`) |
| `normalize` | `bool` | `True` | Whether to normalize the input text before feeding it to the model |
??? "Check out available sentence-transformer models here!"
```markdown
- sentence-transformers/all-MiniLM-L12-v2
- sentence-transformers/paraphrase-mpnet-base-v2
- sentence-transformers/gtr-t5-base
- sentence-transformers/LaBSE
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/bert-base-nli-max-tokens
- sentence-transformers/bert-base-nli-mean-tokens
- sentence-transformers/bert-base-nli-stsb-mean-tokens
- sentence-transformers/bert-base-wikipedia-sections-mean-tokens
- sentence-transformers/bert-large-nli-cls-token
- sentence-transformers/bert-large-nli-max-tokens
- sentence-transformers/bert-large-nli-mean-tokens
- sentence-transformers/bert-large-nli-stsb-mean-tokens
- sentence-transformers/distilbert-base-nli-max-tokens
- sentence-transformers/distilbert-base-nli-mean-tokens
- sentence-transformers/distilbert-base-nli-stsb-mean-tokens
- sentence-transformers/distilroberta-base-msmarco-v1
- sentence-transformers/distilroberta-base-msmarco-v2
- sentence-transformers/nli-bert-base-cls-pooling
- sentence-transformers/nli-bert-base-max-pooling
- sentence-transformers/nli-bert-base
- sentence-transformers/nli-bert-large-cls-pooling
- sentence-transformers/nli-bert-large-max-pooling
- sentence-transformers/nli-bert-large
- sentence-transformers/nli-distilbert-base-max-pooling
- sentence-transformers/nli-distilbert-base
- sentence-transformers/nli-roberta-base
- sentence-transformers/nli-roberta-large
- sentence-transformers/roberta-base-nli-mean-tokens
- sentence-transformers/roberta-base-nli-stsb-mean-tokens
- sentence-transformers/roberta-large-nli-mean-tokens
- sentence-transformers/roberta-large-nli-stsb-mean-tokens
- sentence-transformers/stsb-bert-base
- sentence-transformers/stsb-bert-large
- sentence-transformers/stsb-distilbert-base
- sentence-transformers/stsb-roberta-base
- sentence-transformers/stsb-roberta-large
- sentence-transformers/xlm-r-100langs-bert-base-nli-mean-tokens
- sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens
- sentence-transformers/xlm-r-base-en-ko-nli-ststb
- sentence-transformers/xlm-r-bert-base-nli-mean-tokens
- sentence-transformers/xlm-r-bert-base-nli-stsb-mean-tokens
- sentence-transformers/xlm-r-large-en-ko-nli-ststb
- sentence-transformers/bert-base-nli-cls-token
- sentence-transformers/all-distilroberta-v1
- sentence-transformers/multi-qa-MiniLM-L6-dot-v1
- sentence-transformers/multi-qa-distilbert-cos-v1
- sentence-transformers/multi-qa-distilbert-dot-v1
- sentence-transformers/multi-qa-mpnet-base-cos-v1
- sentence-transformers/multi-qa-mpnet-base-dot-v1
- sentence-transformers/nli-distilroberta-base-v2
- sentence-transformers/all-MiniLM-L6-v1
- sentence-transformers/all-mpnet-base-v1
- sentence-transformers/all-mpnet-base-v2
- sentence-transformers/all-roberta-large-v1
- sentence-transformers/allenai-specter
- sentence-transformers/average_word_embeddings_glove.6B.300d
- sentence-transformers/average_word_embeddings_glove.840B.300d
- sentence-transformers/average_word_embeddings_komninos
- sentence-transformers/average_word_embeddings_levy_dependency
- sentence-transformers/clip-ViT-B-32-multilingual-v1
- sentence-transformers/clip-ViT-B-32
- sentence-transformers/distilbert-base-nli-stsb-quora-ranking
- sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking
- sentence-transformers/distilroberta-base-paraphrase-v1
- sentence-transformers/distiluse-base-multilingual-cased-v1
- sentence-transformers/distiluse-base-multilingual-cased-v2
- sentence-transformers/distiluse-base-multilingual-cased
- sentence-transformers/facebook-dpr-ctx_encoder-multiset-base
- sentence-transformers/facebook-dpr-ctx_encoder-single-nq-base
- sentence-transformers/facebook-dpr-question_encoder-multiset-base
- sentence-transformers/facebook-dpr-question_encoder-single-nq-base
- sentence-transformers/gtr-t5-large
- sentence-transformers/gtr-t5-xl
- sentence-transformers/gtr-t5-xxl
- sentence-transformers/msmarco-MiniLM-L-12-v3
- sentence-transformers/msmarco-MiniLM-L-6-v3
- sentence-transformers/msmarco-MiniLM-L12-cos-v5
- sentence-transformers/msmarco-MiniLM-L6-cos-v5
- sentence-transformers/msmarco-bert-base-dot-v5
- sentence-transformers/msmarco-bert-co-condensor
- sentence-transformers/msmarco-distilbert-base-dot-prod-v3
- sentence-transformers/msmarco-distilbert-base-tas-b
- sentence-transformers/msmarco-distilbert-base-v2
- sentence-transformers/msmarco-distilbert-base-v3
- sentence-transformers/msmarco-distilbert-base-v4
- sentence-transformers/msmarco-distilbert-cos-v5
- sentence-transformers/msmarco-distilbert-dot-v5
- sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned
- sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch
- sentence-transformers/msmarco-distilroberta-base-v2
- sentence-transformers/msmarco-roberta-base-ance-firstp
- sentence-transformers/msmarco-roberta-base-v2
- sentence-transformers/msmarco-roberta-base-v3
- sentence-transformers/multi-qa-MiniLM-L6-cos-v1
- sentence-transformers/nli-mpnet-base-v2
- sentence-transformers/nli-roberta-base-v2
- sentence-transformers/nq-distilbert-base-v1
- sentence-transformers/paraphrase-MiniLM-L12-v2
- sentence-transformers/paraphrase-MiniLM-L3-v2
- sentence-transformers/paraphrase-MiniLM-L6-v2
- sentence-transformers/paraphrase-TinyBERT-L6-v2
- sentence-transformers/paraphrase-albert-base-v2
- sentence-transformers/paraphrase-albert-small-v2
- sentence-transformers/paraphrase-distilroberta-base-v1
- sentence-transformers/paraphrase-distilroberta-base-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- sentence-transformers/paraphrase-xlm-r-multilingual-v1
- sentence-transformers/quora-distilbert-base
- sentence-transformers/quora-distilbert-multilingual
- sentence-transformers/sentence-t5-base
- sentence-transformers/sentence-t5-large
- sentence-transformers/sentence-t5-xxl
- sentence-transformers/sentence-t5-xl
- sentence-transformers/stsb-distilroberta-base-v2
- sentence-transformers/stsb-mpnet-base-v2
- sentence-transformers/stsb-roberta-base-v2
- sentence-transformers/stsb-xlm-r-multilingual
- sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1
- sentence-transformers/clip-ViT-L-14
- sentence-transformers/clip-ViT-B-16
- sentence-transformers/use-cmlm-multilingual
- sentence-transformers/all-MiniLM-L12-v1
```
!!! info
You can also load many other model architectures from the library. For example models from sources such as BAAI, nomic, salesforce research, etc.
See this HF hub page for all [supported models](https://huggingface.co/models?library=sentence-transformers).
!!! note "BAAI Embeddings example"
Here is an example that uses BAAI embedding model from the HuggingFace Hub [supported models](https://huggingface.co/models?library=sentence-transformers)
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
model = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")
class Words(LanceModel):
text: str = model.SourceField()
vector: Vector(model.ndims()) = model.VectorField()
table = db.create_table("words", schema=Words)
table.add(
[
{"text": "hello world"},
{"text": "goodbye world"}
]
)
query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```
Visit sentence-transformers [HuggingFace HUB](https://huggingface.co/sentence-transformers) page for more information on the available models.
### Huggingface embedding models
We offer support for all huggingface models (which can be loaded via [transformers](https://huggingface.co/docs/transformers/en/index) library). The default model is `colbert-ir/colbertv2.0` which also has its own special callout - `registry.get("colbert")`
Example usage -
```python
import lancedb
import pandas as pd
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
model = get_registry().get("huggingface").create(name='facebook/bart-base')
class TextModel(LanceModel):
text: str = model.SourceField()
vector: Vector(model.ndims()) = model.VectorField()
df = pd.DataFrame({"text": ["hi hello sayonara", "goodbye world"]})
table = db.create_table("greets", schema=Words)
table.add()
query = "old greeting"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```
### OpenAI embeddings
LanceDB registers the OpenAI embeddings function in the registry by default, as `openai`. Below are the parameters that you can customize when creating the instances:
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| `name` | `str` | `"text-embedding-ada-002"` | The name of the model. |
| `dim` | `int` | Model default | For OpenAI's newer text-embedding-3 model, we can specify a dimensionality that is smaller than the 1536 size. This feature supports it |
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
func = get_registry().get("openai").create(name="text-embedding-ada-002")
class Words(LanceModel):
text: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField()
table = db.create_table("words", schema=Words, mode="overwrite")
table.add(
[
{"text": "hello world"},
{"text": "goodbye world"}
]
)
query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```
### Instructor Embeddings
[Instructor](https://instructor-embedding.github.io/) is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g. classification, retrieval, clustering, text evaluation, etc.) and domains (e.g. science, finance, etc.) by simply providing the task instruction, without any finetuning.
If you want to calculate customized embeddings for specific sentences, you can follow the unified template to write instructions.
!!! info
Represent the `domain` `text_type` for `task_objective`:
* `domain` is optional, and it specifies the domain of the text, e.g. science, finance, medicine, etc.
* `text_type` is required, and it specifies the encoding unit, e.g. sentence, document, paragraph, etc.
* `task_objective` is optional, and it specifies the objective of embedding, e.g. retrieve a document, classify the sentence, etc.
More information about the model can be found at the [source URL](https://github.com/xlang-ai/instructor-embedding).
| Argument | Type | Default | Description |
|---|---|---|---|
| `name` | `str` | "hkunlp/instructor-base" | The name of the model to use |
| `batch_size` | `int` | `32` | The batch size to use when generating embeddings |
| `device` | `str` | `"cpu"` | The device to use when generating embeddings |
| `show_progress_bar` | `bool` | `True` | Whether to show a progress bar when generating embeddings |
| `normalize_embeddings` | `bool` | `True` | Whether to normalize the embeddings |
| `quantize` | `bool` | `False` | Whether to quantize the model |
| `source_instruction` | `str` | `"represent the docuement for retreival"` | The instruction for the source column |
| `query_instruction` | `str` | `"represent the document for retreiving the most similar documents"` | The instruction for the query |
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry, InstuctorEmbeddingFunction
instructor = get_registry().get("instructor").create(
source_instruction="represent the docuement for retreival",
query_instruction="represent the document for retreiving the most similar documents"
)
class Schema(LanceModel):
vector: Vector(instructor.ndims()) = instructor.VectorField()
text: str = instructor.SourceField()
db = lancedb.connect("~/.lancedb")
tbl = db.create_table("test", schema=Schema, mode="overwrite")
texts = [{"text": "Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that..."},
{"text": "The disparate impact theory is especially controversial under the Fair Housing Act because the Act..."},
{"text": "Disparate impact in United States labor law refers to practices in employment, housing, and other areas that.."}]
tbl.add(texts)
```
### Gemini Embeddings
With Google's Gemini, you can represent text (words, sentences, and blocks of text) in a vectorized form, making it easier to compare and contrast embeddings. For example, two texts that share a similar subject matter or sentiment should have similar embeddings, which can be identified through mathematical comparison techniques such as cosine similarity. For more on how and why you should use embeddings, refer to the Embeddings guide.
The Gemini Embedding Model API supports various task types:
| Task Type | Description |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| "`retrieval_query`" | Specifies the given text is a query in a search/retrieval setting. |
| "`retrieval_document`" | Specifies the given text is a document in a search/retrieval setting. Using this task type requires a title but is automatically proided by Embeddings API |
| "`semantic_similarity`" | Specifies the given text will be used for Semantic Textual Similarity (STS). |
| "`classification`" | Specifies that the embeddings will be used for classification. |
| "`clusering`" | Specifies that the embeddings will be used for clustering. |
Usage Example:
```python
import lancedb
import pandas as pd
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
model = get_registry().get("gemini-text").create()
class TextModel(LanceModel):
text: str = model.SourceField()
vector: Vector(model.ndims()) = model.VectorField()
df = pd.DataFrame({"text": ["hello world", "goodbye world"]})
db = lancedb.connect("~/.lancedb")
tbl = db.create_table("test", schema=TextModel, mode="overwrite")
tbl.add(df)
rs = tbl.search("hello").limit(1).to_pandas()
```
### AWS Bedrock Text Embedding Functions
AWS Bedrock supports multiple base models for generating text embeddings. You need to setup the AWS credentials to use this embedding function.
You can do so by using `awscli` and also add your session_token:
```shell
aws configure
aws configure set aws_session_token "<your_session_token>"
```
to ensure that the credentials are set up correctly, you can run the following command:
```shell
aws sts get-caller-identity
```
Supported Embedding modelIDs are:
* `amazon.titan-embed-text-v1`
* `cohere.embed-english-v3`
* `cohere.embed-multilingual-v3`
Supported parameters (to be passed in `create` method) are:
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| **name** | str | "amazon.titan-embed-text-v1" | The model ID of the bedrock model to use. Supported base models for Text Embeddings: amazon.titan-embed-text-v1, cohere.embed-english-v3, cohere.embed-multilingual-v3 |
| **region** | str | "us-east-1" | Optional name of the AWS Region in which the service should be called (e.g., "us-east-1"). |
| **profile_name** | str | None | Optional name of the AWS profile to use for calling the Bedrock service. If not specified, the default profile will be used. |
| **assumed_role** | str | None | Optional ARN of an AWS IAM role to assume for calling the Bedrock service. If not specified, the current active credentials will be used. |
| **role_session_name** | str | "lancedb-embeddings" | Optional name of the AWS IAM role session to use for calling the Bedrock service. If not specified, a "lancedb-embeddings" name will be used. |
| **runtime** | bool | True | Optional choice of getting different client to perform operations with the Amazon Bedrock service. |
| **max_retries** | int | 7 | Optional number of retries to perform when a request fails. |
Usage Example:
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
model = get_registry().get("bedrock-text").create()
class TextModel(LanceModel):
text: str = model.SourceField()
vector: Vector(model.ndims()) = model.VectorField()
df = pd.DataFrame({"text": ["hello world", "goodbye world"]})
db = lancedb.connect("tmp_path")
tbl = db.create_table("test", schema=TextModel, mode="overwrite")
tbl.add(df)
rs = tbl.search("hello").limit(1).to_pandas()
```
## Multi-modal embedding functions
Multi-modal embedding functions allow you to query your table using both images and text.
### OpenClip embeddings
We support CLIP model embeddings using the open source alternative, [open-clip](https://github.com/mlfoundations/open_clip) which supports various customizations. It is registered as `open-clip` and supports the following customizations:
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| `name` | `str` | `"ViT-B-32"` | The name of the model. |
| `pretrained` | `str` | `"laion2b_s34b_b79k"` | The name of the pretrained model to load. |
| `device` | `str` | `"cpu"` | The device to run the model on. Can be `"cpu"` or `"gpu"`. |
| `batch_size` | `int` | `64` | The number of images to process in a batch. |
| `normalize` | `bool` | `True` | Whether to normalize the input images before feeding them to the model. |
This embedding function supports ingesting images as both bytes and urls. You can query them using both test and other images.
!!! info
LanceDB supports ingesting images directly from accessible links.
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect(tmp_path)
func = get_registry.get("open-clip").create()
class Images(LanceModel):
label: str
image_uri: str = func.SourceField() # image uri as the source
image_bytes: bytes = func.SourceField() # image bytes as the source
vector: Vector(func.ndims()) = func.VectorField() # vector column
vec_from_bytes: Vector(func.ndims()) = func.VectorField() # Another vector column
table = db.create_table("images", schema=Images)
labels = ["cat", "cat", "dog", "dog", "horse", "horse"]
uris = [
"http://farm1.staticflickr.com/53/167798175_7c7845bbbd_z.jpg",
"http://farm1.staticflickr.com/134/332220238_da527d8140_z.jpg",
"http://farm9.staticflickr.com/8387/8602747737_2e5c2a45d4_z.jpg",
"http://farm5.staticflickr.com/4092/5017326486_1f46057f5f_z.jpg",
"http://farm9.staticflickr.com/8216/8434969557_d37882c42d_z.jpg",
"http://farm6.staticflickr.com/5142/5835678453_4f3a4edb45_z.jpg",
]
# get each uri as bytes
image_bytes = [requests.get(uri).content for uri in uris]
table.add(
[{"label": labels, "image_uri": uris, "image_bytes": image_bytes}]
)
```
Now we can search using text from both the default vector column and the custom vector column
```python
# text search
actual = table.search("man's best friend").limit(1).to_pydantic(Images)[0]
print(actual.label) # prints "dog"
frombytes = (
table.search("man's best friend", vector_column_name="vec_from_bytes")
.limit(1)
.to_pydantic(Images)[0]
)
print(frombytes.label)
```
Because we're using a multi-modal embedding function, we can also search using images
```python
# image search
query_image_uri = "http://farm1.staticflickr.com/200/467715466_ed4a31801f_z.jpg"
image_bytes = requests.get(query_image_uri).content
query_image = Image.open(io.BytesIO(image_bytes))
actual = table.search(query_image).limit(1).to_pydantic(Images)[0]
print(actual.label == "dog")
# image search using a custom vector column
other = (
table.search(query_image, vector_column_name="vec_from_bytes")
.limit(1)
.to_pydantic(Images)[0]
)
print(actual.label)
```
### Imagebind embeddings
We have support for [imagebind](https://github.com/facebookresearch/ImageBind) model embeddings. You can download our version of the packaged model via - `pip install imagebind-packaged==0.1.2`.
This function is registered as `imagebind` and supports Audio, Video and Text modalities(extending to Thermal,Depth,IMU data):
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| `name` | `str` | `"imagebind_huge"` | Name of the model. |
| `device` | `str` | `"cpu"` | The device to run the model on. Can be `"cpu"` or `"gpu"`. |
| `normalize` | `bool` | `False` | set to `True` to normalize your inputs before model ingestion. |
Below is an example demonstrating how the API works:
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect(tmp_path)
func = get_registry.get("imagebind").create()
class ImageBindModel(LanceModel):
text: str
image_uri: str = func.SourceField()
audio_path: str
vector: Vector(func.ndims()) = func.VectorField()
# add locally accessible image paths
text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
# Load data
inputs = [
{"text": a, "audio_path": b, "image_uri": c}
for a, b, c in zip(text_list, audio_paths, image_paths)
]
#create table and add data
table = db.create_table("img_bind", schema=ImageBindModel)
table.add(inputs)
```
Now, we can search using any modality:
#### image search
```python
query_image = "./assets/dog_image2.jpg" #download an image and enter that path here
actual = table.search(query_image).limit(1).to_pydantic(ImageBindModel)[0]
print(actual.text == "dog")
```
#### audio search
```python
query_audio = "./assets/car_audio2.wav" #download an audio clip and enter path here
actual = table.search(query_audio).limit(1).to_pydantic(ImageBindModel)[0]
print(actual.text == "car")
```
#### Text search
You can add any input query and fetch the result as follows:
```python
query = "an animal which flies and tweets"
actual = table.search(query).limit(1).to_pydantic(ImageBindModel)[0]
print(actual.text == "bird")
```
If you have any questions about the embeddings API, supported models, or see a relevant model missing, please raise an issue [on GitHub](https://github.com/lancedb/lancedb/issues).

View File

@@ -0,0 +1,169 @@
Representing multi-modal data as vector embeddings is becoming a standard practice. Embedding functions can themselves be thought of as key part of the data processing pipeline that each request has to be passed through. The assumption here is: after initial setup, these components and the underlying methodology are not expected to change for a particular project.
For this purpose, LanceDB introduces an **embedding functions API**, that allow you simply set up once, during the configuration stage of your project. After this, the table remembers it, effectively making the embedding functions *disappear in the background* so you don't have to worry about manually passing callables, and instead, simply focus on the rest of your data engineering pipeline.
!!! warning
Using the embedding function registry means that you don't have to explicitly generate the embeddings yourself.
However, if your embedding function changes, you'll have to re-configure your table with the new embedding function
and regenerate the embeddings. In the future, we plan to support the ability to change the embedding function via
table metadata and have LanceDB automatically take care of regenerating the embeddings.
## 1. Define the embedding function
=== "Python"
In the LanceDB python SDK, we define a global embedding function registry with
many different embedding models and even more coming soon.
Here's let's an implementation of CLIP as example.
```python
from lancedb.embeddings import get_registry
registry = get_registry()
clip = registry.get("open-clip").create()
```
You can also define your own embedding function by implementing the `EmbeddingFunction`
abstract base interface. It subclasses Pydantic Model which can be utilized to write complex schemas simply as we'll see next!
=== "JavaScript""
In the TypeScript SDK, the choices are more limited. For now, only the OpenAI
embedding function is available.
```javascript
const lancedb = require("vectordb");
// You need to provide an OpenAI API key
const apiKey = "sk-..."
// The embedding function will create embeddings for the 'text' column
const embedding = new lancedb.OpenAIEmbeddingFunction('text', apiKey)
```
## 2. Define the data model or schema
=== "Python"
The embedding function defined above abstracts away all the details about the models and dimensions required to define the schema. You can simply set a field as **source** or **vector** column. Here's how:
```python
class Pets(LanceModel):
vector: Vector(clip.ndims) = clip.VectorField()
image_uri: str = clip.SourceField()
```
`VectorField` tells LanceDB to use the clip embedding function to generate query embeddings for the `vector` column and `SourceField` ensures that when adding data, we automatically use the specified embedding function to encode `image_uri`.
=== "JavaScript"
For the TypeScript SDK, a schema can be inferred from input data, or an explicit
Arrow schema can be provided.
## 3. Create table and add data
Now that we have chosen/defined our embedding function and the schema,
we can create the table and ingest data without needing to explicitly generate
the embeddings at all:
=== "Python"
```python
db = lancedb.connect("~/lancedb")
table = db.create_table("pets", schema=Pets)
table.add([{"image_uri": u} for u in uris])
```
=== "JavaScript"
```javascript
const db = await lancedb.connect("data/sample-lancedb");
const data = [
{ text: "pepperoni"},
{ text: "pineapple"}
]
const table = await db.createTable("vectors", data, embedding)
```
## 4. Querying your table
Not only can you forget about the embeddings during ingestion, you also don't
need to worry about it when you query the table:
=== "Python"
Our OpenCLIP query embedding function supports querying via both text and images:
```python
results = (
table.search("dog")
.limit(10)
.to_pandas()
)
```
Or we can search using an image:
```python
p = Path("path/to/images/samoyed_100.jpg")
query_image = Image.open(p)
results = (
table.search(query_image)
.limit(10)
.to_pandas()
)
```
Both of the above snippet returns a pandas DataFrame with the 10 closest vectors to the query.
=== "JavaScript"
```javascript
const results = await table
.search("What's the best pizza topping?")
.limit(10)
.execute()
```
The above snippet returns an array of records with the top 10 nearest neighbors to the query.
---
## Rate limit Handling
`EmbeddingFunction` class wraps the calls for source and query embedding generation inside a rate limit handler that retries the requests with exponential backoff after successive failures. By default, the maximum retires is set to 7. You can tune it by setting it to a different number, or disable it by setting it to 0.
An example of how to do this is shown below:
```python
clip = registry.get("open-clip").create() # Defaults to 7 max retries
clip = registry.get("open-clip").create(max_retries=10) # Increase max retries to 10
clip = registry.get("open-clip").create(max_retries=0) # Retries disabled
```
!!! note
Embedding functions can also fail due to other errors that have nothing to do with rate limits.
This is why the error is also logged.
## Some fun with Pydantic
LanceDB is integrated with Pydantic, which was used in the example above to define the schema in Python. It's also used behind the scenes by the embedding function API to ingest useful information as table metadata.
You can also use the integration for adding utility operations in the schema. For example, in our multi-modal example, you can search images using text or another image. Let's define a utility function to plot the image.
```python
class Pets(LanceModel):
vector: Vector(clip.ndims) = clip.VectorField()
image_uri: str = clip.SourceField()
@property
def image(self):
return Image.open(self.image_uri)
```
Now, you can covert your search results to a Pydantic model and use this property.
```python
rs = table.search(query_image).limit(3).to_pydantic(Pets)
rs[2].image
```
![](../assets/dog_clip_output.png)
Now that you have the basic idea about LanceDB embedding functions and the embedding function registry,
let's dive deeper into defining your own [custom functions](./custom_embedding_function.md).

View File

@@ -0,0 +1,74 @@
Due to the nature of vector embeddings, they can be used to represent any kind of data, from text to images to audio.
This makes them a very powerful tool for machine learning practitioners.
However, there's no one-size-fits-all solution for generating embeddings - there are many different libraries and APIs
(both commercial and open source) that can be used to generate embeddings from structured/unstructured data.
LanceDB supports 3 methods of working with embeddings.
1. You can manually generate embeddings for the data and queries. This is done outside of LanceDB.
2. You can use the built-in [embedding functions](./embedding_functions.md) to embed the data and queries in the background.
3. For python users, you can define your own [custom embedding function](./custom_embedding_function.md)
that extends the default embedding functions.
For python users, there is also a legacy [with_embeddings API](./legacy.md).
It is retained for compatibility and will be removed in a future version.
## Quickstart
To get started with embeddings, you can use the built-in embedding functions.
### OpenAI Embedding function
LanceDB registers the OpenAI embeddings function in the registry as `openai`. You can pass any supported model name to the `create`. By default it uses `"text-embedding-ada-002"`.
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
func = get_registry().get("openai").create(name="text-embedding-ada-002")
class Words(LanceModel):
text: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField()
table = db.create_table("words", schema=Words, mode="overwrite")
table.add(
[
{"text": "hello world"},
{"text": "goodbye world"}
]
)
query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```
### Sentence Transformers Embedding function
LanceDB registers the Sentence Transformers embeddings function in the registry as `sentence-transformers`. You can pass any supported model name to the `create`. By default it uses `"sentence-transformers/paraphrase-MiniLM-L6-v2"`.
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
model = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")
class Words(LanceModel):
text: str = model.SourceField()
vector: Vector(model.ndims()) = model.VectorField()
table = db.create_table("words", schema=Words)
table.add(
[
{"text": "hello world"},
{"text": "goodbye world"}
]
)
query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```

View File

@@ -0,0 +1,99 @@
The legacy `with_embeddings` API is for Python only and is deprecated.
### Hugging Face
The most popular open source option is to use the [sentence-transformers](https://www.sbert.net/)
library, which can be installed via pip.
```bash
pip install sentence-transformers
```
The example below shows how to use the `paraphrase-albert-small-v2` model to generate embeddings
for a given document.
```python
from sentence_transformers import SentenceTransformer
name="paraphrase-albert-small-v2"
model = SentenceTransformer(name)
# used for both training and querying
def embed_func(batch):
return [model.encode(sentence) for sentence in batch]
```
### OpenAI
Another popular alternative is to use an external API like OpenAI's [embeddings API](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings).
```python
import openai
import os
# Configuring the environment variable OPENAI_API_KEY
if "OPENAI_API_KEY" not in os.environ:
# OR set the key here as a variable
openai.api_key = "sk-..."
client = openai.OpenAI()
def embed_func(c):
rs = client.embeddings.create(input=c, model="text-embedding-ada-002")
return [record.embedding for record in rs["data"]]
```
## Applying an embedding function to data
Using an embedding function, you can apply it to raw data
to generate embeddings for each record.
Say you have a pandas DataFrame with a `text` column that you want embedded,
you can use the `with_embeddings` function to generate embeddings and add them to
an existing table.
```python
import pandas as pd
from lancedb.embeddings import with_embeddings
df = pd.DataFrame(
[
{"text": "pepperoni"},
{"text": "pineapple"}
]
)
data = with_embeddings(embed_func, df)
# The output is used to create / append to a table
tbl = db.create_table("my_table", data=data)
```
If your data is in a different column, you can specify the `column` kwarg to `with_embeddings`.
By default, LanceDB calls the function with batches of 1000 rows. This can be configured
using the `batch_size` parameter to `with_embeddings`.
LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI
API call is reliable.
## Querying using an embedding function
!!! warning
At query time, you **must** use the same embedding function you used to vectorize your data.
If you use a different embedding function, the embeddings will not reside in the same vector
space and the results will be nonsensical.
=== "Python"
```python
query = "What's the best pizza topping?"
query_vector = embed_func([query])[0]
results = (
tbl.search(query_vector)
.limit(10)
.to_pandas()
)
```
The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.

View File

@@ -0,0 +1,7 @@
# Code documentation Q&A bot with LangChain
## use LanceDB's LangChain integration to build a Q&A bot for your documentation
<img id="splash" width="400" alt="langchain" src="https://user-images.githubusercontent.com/917119/236580868-61a246a9-e587-4c2b-8ae5-6fe5f7b7e81e.png">
This example is in a [notebook](https://github.com/lancedb/lancedb/blob/main/docs/src/notebooks/code_qa_bot.ipynb)

View File

@@ -0,0 +1,11 @@
# Examples: JavaScript
To help you get started, we provide some examples, projects and applications that use the LanceDB JavaScript API. You can always find the latest examples in our [VectorDB Recipes](https://github.com/lancedb/vectordb-recipes) repository.
| Example | Scripts |
|-------- | ------ |
| | |
| [Youtube transcript search bot](https://github.com/lancedb/vectordb-recipes/tree/main/examples/youtube_bot/) | [![JavaScript](https://img.shields.io/badge/javascript-%23323330.svg?style=for-the-badge&logo=javascript&logoColor=%23F7DF1E)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/youtube_bot/index.js)|
| [Langchain: Code Docs QA bot](https://github.com/lancedb/vectordb-recipes/tree/main/examples/Code-Documentation-QA-Bot/) | [![JavaScript](https://img.shields.io/badge/javascript-%23323330.svg?style=for-the-badge&logo=javascript&logoColor=%23F7DF1E)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/Code-Documentation-QA-Bot/index.js)|
| [AI Agents: Reducing Hallucination](https://github.com/lancedb/vectordb-recipes/tree/main/examples/reducing_hallucinations_ai_agents/) | [![JavaScript](https://img.shields.io/badge/javascript-%23323330.svg?style=for-the-badge&logo=javascript&logoColor=%23F7DF1E)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/reducing_hallucinations_ai_agents/index.js)|
| [TransformersJS Embedding example](https://github.com/lancedb/vectordb-recipes/tree/main/examples/js-transformers/) | [![JavaScript](https://img.shields.io/badge/javascript-%23323330.svg?style=for-the-badge&logo=javascript&logoColor=%23F7DF1E)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/js-transformers/index.js) |

View File

@@ -0,0 +1,17 @@
# Examples: Python
To help you get started, we provide some examples, projects and applications that use the LanceDB Python API. You can always find the latest examples in our [VectorDB Recipes](https://github.com/lancedb/vectordb-recipes) repository.
| Example | Interactive Envs | Scripts |
|-------- | ---------------- | ------ |
| | | |
| [Youtube transcript search bot](https://github.com/lancedb/vectordb-recipes/tree/main/examples/youtube_bot/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/youtube_bot/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>| [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/youtube_bot/main.py)|
| [Langchain: Code Docs QA bot](https://github.com/lancedb/vectordb-recipes/tree/main/examples/Code-Documentation-QA-Bot/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/Code-Documentation-QA-Bot/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>| [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/Code-Documentation-QA-Bot/main.py) |
| [AI Agents: Reducing Hallucination](https://github.com/lancedb/vectordb-recipes/tree/main/examples/reducing_hallucinations_ai_agents/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/reducing_hallucinations_ai_agents/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>| [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/reducing_hallucinations_ai_agents/main.py)|
| [Multimodal CLIP: DiffusionDB](https://github.com/lancedb/vectordb-recipes/tree/main/examples/multimodal_clip/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/multimodal_clip/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>| [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/multimodal_clip/main.py) |
| [Multimodal CLIP: Youtube videos](https://github.com/lancedb/vectordb-recipes/tree/main/examples/multimodal_video_search/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/multimodal_video_search/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>| [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/multimodal_video_search/main.py) |
| [Movie Recommender](https://github.com/lancedb/vectordb-recipes/tree/main/examples/movie-recommender/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/movie-recommender/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a> | [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/movie-recommender/main.py) |
| [Audio Search](https://github.com/lancedb/vectordb-recipes/tree/main/examples/audio_search/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/audio_search/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a> | [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/audio_search/main.py) |
| [Multimodal Image + Text Search](https://github.com/lancedb/vectordb-recipes/tree/main/examples/multimodal_search/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/multimodal_search/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a> | [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://github.com/lancedb/vectordb-recipes/tree/main/examples/multimodal_search/main.py) |
| [Evaluating Prompts with Prompttools](https://github.com/lancedb/vectordb-recipes/tree/main/examples/prompttools-eval-prompts/) | <a href="https://colab.research.google.com/github/lancedb/vectordb-recipes/blob/main/examples/prompttools-eval-prompts/main.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a> | |

View File

@@ -0,0 +1,3 @@
# Examples: Rust
Our Rust SDK is now stable. Examples are coming soon.

View File

@@ -0,0 +1,165 @@
# How to Load Image Embeddings into LanceDB
With the rise of Large Multimodal Models (LMMs) such as [GPT-4 Vision](https://blog.roboflow.com/gpt-4-vision/), the need for storing image embeddings is growing. The most effective way to store text and image embeddings is in a vector database such as LanceDB. Vector databases are a special kind of data store that enables efficient search over stored embeddings.
[CLIP](https://blog.roboflow.com/openai-clip/), a multimodal model developed by OpenAI, is commonly used to calculate image embeddings. These embeddings can then be used with a vector database to build a semantic search engine that you can query using images or text. For example, you could use LanceDB and CLIP embeddings to build a search engine for a database of folders.
In this guide, we are going to show you how to use Roboflow Inference to load image embeddings into LanceDB. Without further ado, lets get started!
## Step #1: Install Roboflow Inference
[Roboflow Inference](https://inference.roboflow.com) enables you to run state-of-the-art computer vision models with minimal configuration. Inference supports a range of models, from fine-tuned object detection, classification, and segmentation models to foundation models like CLIP. We will use Inference to calculate CLIP image embeddings.
Inference provides a HTTP API through which you can run vision models.
Inference powers the Roboflow hosted API, and is available as an open source utility. In this guide, we are going to run Inference locally, which enables you to calculate CLIP embeddings on your own hardware. We will also show you how to use the hosted Roboflow CLIP API, which is ideal if you need to scale and do not want to manage a system for calculating embeddings.
To get started, first install the Inference CLI:
```
pip install inference-cli
```
Next, install Docker. Refer to the official Docker installation instructions for your operating system to get Docker set up. Once Docker is ready, you can start Inference using the following command:
```
inference server start
```
An Inference server will start running at http://localhost:9001.
## Step #2: Set Up a LanceDB Vector Database
Now that we have Inference running, we can set up a LanceDB vector database. You can run LanceDB in JavaScript and Python. For this guide, we will use the Python API. But, you can take the HTTP requests we make below and change them to JavaScript if required.
For this guide, we are going to search the [COCO 128 dataset](https://universe.roboflow.com/team-roboflow/coco-128), which contains a wide range of objects. The variability in objects present in this dataset makes it a good dataset to demonstrate the capabilities of vector search. If you want to use this dataset, you can download [COCO 128 from Roboflow Universe](https://universe.roboflow.com/team-roboflow/coco-128). With that said, you can search whatever folder of images you want.
Once you have a dataset ready, install LanceDB with the following command:
```
pip install lancedb
```
We also need to install a specific commit of `tantivy`, a dependency of the LanceDB full text search engine we will use later in this guide:
```
pip install tantivy
```
Create a new Python file and add the following code:
```python
import cv2
import supervision as sv
import requests
import lancedb
db = lancedb.connect("./embeddings")
IMAGE_DIR = "images/"
API_KEY = os.environ.get("ROBOFLOW_API_KEY")
SERVER_URL = "http://localhost:9001"
results = []
for i, image in enumerate(os.listdir(IMAGE_DIR)):
infer_clip_payload = {
#Images can be provided as urls or as base64 encoded strings
"image": {
"type": "base64",
"value": base64.b64encode(open(IMAGE_DIR + image, "rb").read()).decode("utf-8"),
},
}
res = requests.post(
f"{SERVER_URL}/clip/embed_image?api_key={API_KEY}",
json=infer_clip_payload,
)
embeddings = res.json()['embeddings']
print("Calculated embedding for image: ", image)
image = {"vector": embeddings[0], "name": os.path.join(IMAGE_DIR, image)}
results.append(image)
tbl = db.create_table("images", data=results)
tbl.create_fts_index("name")
```
To use the code above, you will need a Roboflow API key. [Learn how to retrieve a Roboflow API key](https://docs.roboflow.com/api-reference/authentication#retrieve-an-api-key). Run the following command to set up your API key in your environment:
```
export ROBOFLOW_API_KEY=""
```
Replace the `IMAGE_DIR` value with the folder in which you are storing the images for which you want to calculate embeddings. If you want to use the Roboflow CLIP API to calculate embeddings, replace the `SERVER_URL` value with `https://infer.roboflow.com`.
Run the script above to create a new LanceDB database. This database will be stored on your local machine. The database will be called `embeddings` and the table will be called `images`.
The script above calculates all embeddings for a folder then creates a new table. To add additional images, use the following code:
```python
def make_batches():
for i in range(5):
yield [
{"vector": [3.1, 4.1], "name": "image1.png"},
{"vector": [5.9, 26.5], "name": "image2.png"}
]
tbl = db.open_table("images")
tbl.add(make_batches())
```
Replacing the `make_batches()` function with code to load embeddings for images.
## Step #3: Run a Search Query
We are now ready to run a search query. To run a search query, we need a text embedding that represents a text query. We can use this embedding to search our LanceDB database for an entry.
Lets calculate a text embedding for the query “cat”, then run a search query:
```python
infer_clip_payload = {
"text": "cat",
}
res = requests.post(
f"{SERVER_URL}/clip/embed_text?api_key={API_KEY}",
json=infer_clip_payload,
)
embeddings = res.json()['embeddings']
df = tbl.search(embeddings[0]).limit(3).to_list()
print("Results:")
for i in df:
print(i["name"])
```
This code will search for the three images most closely related to the prompt “cat”. The names of the most similar three images will be printed to the console. Here are the three top results:
```
dataset/images/train/000000000650_jpg.rf.1b74ba165c5a3513a3211d4a80b69e1c.jpg
dataset/images/train/000000000138_jpg.rf.af439ef1c55dd8a4e4b142d186b9c957.jpg
dataset/images/train/000000000165_jpg.rf.eae14d5509bf0c9ceccddbb53a5f0c66.jpg
```
Lets open the top image:
![Cat](https://media.roboflow.com/cat_lancedb.jpg)
The top image was a cat. Our search was successful.
## Conclusion
LanceDB is a vector database that you can use to store and efficiently search your image embeddings. You can use Roboflow Inference, a scalable computer vision inference server, to calculate CLIP embeddings that you can store in LanceDB.
You can use Inference and LanceDB together to build a range of applications with image embeddings, from a media search engine to a retrieval-augmented generation pipeline for use with LMMs.
To learn more about Inference and its capabilities, refer to the Inference documentation.

Some files were not shown because too many files have changed in this diff Show More