Compare commits

..

470 Commits

Author SHA1 Message Date
Lance Release
7e023c1ef2 [python] Bump version: 0.6.8 → 0.6.9 2024-04-12 22:09:12 +00:00
Weston Pace
1d0dd9a8b8 feat: bump lance version from 0.10.10 to 0.10.12 (#1219) 2024-04-12 15:08:39 -07:00
Weston Pace
deb947ddbd doc: fix typo, broken links (#1218) 2024-04-11 14:58:51 -07:00
Ayush Chaurasia
b039765d50 docs : Embedding functions quickstart and minor fixes (#1217) 2024-04-11 17:30:45 +05:30
Prashanth Rao
d155e82723 [docs] Fix broken links and clarify language in integrations docs (#1209)
This PR does the following:

- Fixes broken/outdated URLs
- Adds clarity to the way DuckDB/LanceDB integration works via Arrow
2024-04-11 15:32:08 +05:30
Ayush Chaurasia
5d8c91256c fix(python): Update to latest cohere reranking api (#1212)
Fixes https://github.com/lancedb/lancedb/issues/1196
Cohere introduced a breaking change in their reranker API starting
version 5.0.0. More context in discussion here
https://github.com/cohere-ai/cohere-python/issues/446
2024-04-11 15:20:29 +05:30
Ayush Chaurasia
44c03ebef3 docs : Update Reranking docs (#1213) 2024-04-11 15:20:00 +05:30
Will Jones
8ea06fe7f3 ci: fix failures in release scripts (#1215)
* Python release has been running when we create a Node release.
https://github.com/lancedb/lancedb/actions/runs/8635662585
* Rust is missing new enough compilers to check the kernels feature
https://github.com/lancedb/lancedb/actions/runs/8635662578
2024-04-10 13:09:39 -07:00
Lance Release
cf06b653d4 [python] Bump version: 0.6.7 → 0.6.8 2024-04-10 17:51:45 +00:00
Lance Release
09cfab6d00 Updating package-lock.json 2024-04-10 17:40:03 +00:00
Lance Release
e4945abb1a Bump version: 0.4.16 → 0.4.17 2024-04-10 17:39:52 +00:00
Raghav Dixit
a6aa67baed python: Bug fixes / tests (#1210)
closes #1194 #1172 #1124 #1208 


@wjones127 : `if query_type != "fts":` is needed because both fts and
vector search create `LanceQueryBuilder` which has `vector_column_name`
as a required attribute.
2024-04-10 10:17:14 -07:00
Will Jones
1d23af213b feat: expose storage options in LanceDB (#1204)
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.

1. Fixes #1168
2. Closes #1165
3. Closes #1082
4. Closes #439
5. Closes #897
6. Closes #642
7. Closes #281
8. Closes #114
9. Closes #990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
2024-04-10 10:12:04 -07:00
Bert
25dea4e859 BREAKING CHANGE: Check if remote table exists when opening (with caching) (#1214)
- make open table behaviour consistent:
- remote tables will check if the table exists by calling /describe and
throwing an error if the call doesn't succeed
- this is similar to the behaviour for local tables where we will raise
an exception when opening the table if the local dataset doesn't exist
- The table names are cached in the client with a TTL
- Also fixes a small bug where if the remote error response was
deserialized from JSON as an object, we'd print it resulting in the
unhelpful error message: `Error: Server Error, status: 404, message: Not
Found: [object Object]`
2024-04-10 11:54:47 -04:00
Weston Pace
8a1227030a chore: restore requests which was lost during rebase (#1205) 2024-04-08 11:56:43 +05:30
Weston Pace
9fee384d2c chore(node): restore package-lock.json lost during rebase 2024-04-05 16:36:29 -07:00
Ayush Chaurasia
b2952acca7 chore(python): remove redundant files (#1203) 2024-04-05 16:35:10 -07:00
Pranav Maddi
2b132a0bef Fix markdown formatting (#1188) 2024-04-05 16:35:10 -07:00
Will Jones
ba56208a34 ci: fix job (#1193) 2024-04-05 16:35:10 -07:00
Ayush Chaurasia
2d2042d59e chore(python): Remove settings manager and telemetry. (#1198)
This PR is intended to remove settings manager. But because telemetry
and CLI depends on settings manager those need to go too.
2024-04-05 16:35:09 -07:00
Raghav Dixit
1c41a00d87 Embeddings: HF model hub support added via transformers (#1154) 2024-04-05 16:34:56 -07:00
Lance Release
ac63d4066b Updating package-lock.json 2024-04-05 16:34:53 -07:00
Lance Release
be2074b90d [python] Bump version: 0.6.6 → 0.6.7 2024-04-05 16:34:53 -07:00
Lance Release
6c452f29e9 Bump version: 0.4.15 → 0.4.16 2024-04-05 16:34:50 -07:00
Will Jones
8a7ded23b2 chore: upgrade to lance-0.10.9 (#1192) 2024-04-05 16:34:50 -07:00
QianZhu
871500db70 add a default value for search.limit to be consistent with python sdk (#1191)
Changed the default value for search.limit to be 10
2024-04-05 16:34:50 -07:00
Bert
a900bc0827 ensure table names are uri encoded for tables (#1189)
This prevents an issue where users can do something like:
```js
db.createTable('my-table#123123')
```
The server has logic to determine that '#' character is not allowed in
the table name, but currently this is being returned as 404 error
because it routes to `/v1/my-table#123123/create` and `#123123/create`
will not be parsed as part of path
2024-04-05 16:34:50 -07:00
Will Jones
47cff963c5 feat: ship fp16kernels in Python wheels (#1148)
Same deal as https://github.com/lancedb/lance/pull/2098
2024-04-05 16:34:50 -07:00
Lei Xu
e6ff3d848b chore: bump to 0.10.8 (#1187) 2024-04-05 16:34:50 -07:00
QianZhu
44d799ebb8 bug: fix the return value of countRows (#1186) 2024-04-05 16:34:50 -07:00
Lei Xu
1d3325dcc5 chore: bump lance version (#1185)
Bump lance version to `0.10.7`
2024-04-05 16:34:50 -07:00
Bert
ff45f25cf2 fix error decoding in nodejs client (#1184)
fixes: #1183
2024-04-05 16:34:50 -07:00
QianZhu
a34cc770c5 remote count_rows need to return the number (#1181) 2024-04-05 16:34:50 -07:00
eduardjbotha
f749b8808f SQL Documentation includes DataFusion functions (#1179)
Show that it is possible to use the DataFusion functions in the `WHERE`
clause.

Co-authored-by: Eduard Botha <eduard.botha@inovex.de>
2024-04-05 16:34:50 -07:00
Lei Xu
7e5a54b76a chore: add social link footer (#1177) 2024-04-05 16:34:50 -07:00
Lei Xu
3f14938392 chore: pass str instead of String to build table names (#1178) 2024-04-05 16:34:50 -07:00
Lance Release
3bd16e1b14 Updating package-lock.json 2024-04-05 16:34:46 -07:00
QianZhu
2f89fc26f1 feat: add filterable countRows to remote API (#1169) 2024-04-05 16:34:46 -07:00
Lance Release
e5bfec4318 [python] Bump version: 0.6.5 → 0.6.6 2024-04-05 16:34:46 -07:00
Lance Release
e0f50013ea Bump version: 0.4.14 → 0.4.15 2024-04-05 16:34:39 -07:00
Weston Pace
e4e64f9d6b chore: bump lance version to 0.10.6 (#1175) 2024-04-05 16:34:39 -07:00
Bert
6c9f4c4304 Update LanceDB Logo in README (#1167)
<img width="1034" alt="image"
src="https://github.com/lancedb/lancedb/assets/5846846/5b8aa53c-4d93-4c0e-bed4-80c238b319ba">
2024-04-05 16:34:39 -07:00
Weston Pace
e21b56293c docs: add a reference to @lancedb/lance in the docs (#1166)
We aren't yet ready to switch over the examples since almost all JS
examples rely on embeddings and we haven't yet ported those over.
However, this makes it possible for those that are interested to start
using `@lancedb/lancedb`
2024-04-05 16:34:39 -07:00
Will Jones
1b0aaf9ec3 ci: fix name collision in npm artifacts for vectordb (#1164)
Fixes #1163
2024-04-05 16:34:39 -07:00
Weston Pace
01239da082 chore: add nodejs to bumpversion (#1161)
The previous release failed to release nodejs because the nodejs version
wasn't bumped. This should fix that.
2024-04-05 16:34:39 -07:00
Weston Pace
6060c0cd36 chore: fix clippy (#1162) 2024-04-05 16:34:38 -07:00
Bert
bb179981dd added new logo to vercel example gif (#1158) 2024-04-05 16:34:38 -07:00
Bert
2e1f1c6d5d New logo on docs site (#1157) 2024-04-05 16:34:38 -07:00
Ayush Chaurasia
b916f5f132 docs: Add all available HF/sentence transformers embedding models list (#1134)
Solves -  https://github.com/lancedb/lancedb/issues/968
2024-04-05 16:34:38 -07:00
Weston Pace
f97c7dad8c docs: add the async python API to the docs (#1156) 2024-04-05 16:34:37 -07:00
Lance Release
ccf13f15d4 Bump version: 0.4.13 → 0.4.14 2024-04-05 16:33:37 -07:00
Weston Pace
287c5ca2f9 feat: add publish step for nodejs (#1155)
This will start publishing `@lancedb/lancedb` with the new nodejs
package on our releases.
2024-04-05 16:33:37 -07:00
Pranav Maddi
479289dd38 Adds a Ask LanceDB button to docs. (#1150)
This links out to the new [asklancedb.com](https://asklancedb.com) page.

Screenshots of the change:

![Quick start - LanceDB · 10 20am ·
03-22](https://github.com/lancedb/lancedb/assets/2371511/c45ba893-fc74-4957-bdd3-3712b351aff3)
![Quick start -
LanceDB](https://github.com/lancedb/lancedb/assets/2371511/d4762eb6-52af-4fd5-857e-3ed280716999)
2024-04-05 16:33:37 -07:00
Bert
1e41232f28 Node SDK Client middleware for HTTP Requests (#1130)
Adds client-side middleware to LanceDB Node SDK to instrument HTTP
Requests

Example - adding `x-request-id` request header:
```js
class HttpMiddleware {
    constructor({ requestId }) {
        this.requestId = requestId
    }

    onRemoteRequest(req, next) {
        req.headers['x-request-id'] = this.requestId
        return next(req)
    }
}

const db = await lancedb.connect({
  uri: 'db://remote-123',
  apiKey: 'sk_...',
})

let tables = await db.withMiddleware(new HttpMiddleware({ requestId: '123' })).tableNames();

```

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:33:37 -07:00
QianZhu
db2631c2ad remove warnings (#1147) 2024-04-05 16:33:37 -07:00
Lei Xu
473ef7e426 chore: validate table name (#1146)
Closes #1129
2024-04-05 16:33:37 -07:00
Lance Release
d32dc84653 [python] Bump version: 0.6.4 → 0.6.5 2024-04-05 16:33:37 -07:00
Lei Xu
1aaaeff511 chore: bump lance to 0.10.5 (#1145) 2024-04-05 16:33:37 -07:00
QianZhu
bdd07a5dfa fix nodejs test (#1141)
changed the error msg for query with wrong vector dim thus need this
change to pass the nodejs tests.
2024-04-05 16:33:37 -07:00
QianZhu
63db51c90d better error msg for query vector with wrong dim (#1140) 2024-04-05 16:33:37 -07:00
Ishani Ghose
0838e12b30 feat: add to_batches API #805 (#1048)
SDK
Python

Description
Exposes pyarrow batch api during query execution - relevant when there
is no vector search query, dataset is large and the filtered result is
larger than memory.

---------

Co-authored-by: Ishani Ghose <isghose@amazon.com>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:33:37 -07:00
Weston Pace
968c62cb8f feat: introduce ArrowNative wrapper struct for adding data that is already a RecordBatchReader (#1139)
In
2de226220b
I added a new `IntoArrow` trait for adding data into a table.
Unfortunately, it seems my approach for implementing the trait for
"things that are already record batch readers" was flawed. This PR
corrects that flaw and, conveniently, removes the need to box readers at
all (though it is ok if you do).
2024-04-05 16:33:37 -07:00
natcharacter
f6e9f8e3f4 Order by field support FTS (#1132)
This PR adds support for passing through a set of ordering fields at
index time (unsigned ints that tantivity can use as fast_fields) that at
query time you can sort your results on. This is useful for cases where
you want to get related hits, i.e by keyword, but order those hits by
some other score, such as popularity.

I.e search for songs descriptions that match on "sad AND jazz AND 1920"
and then order those by number of times played. Example usage can be
seen in the fts tests.

---------

Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:33:36 -07:00
Chang She
4466cfa958 feat(python): support writing huggingface dataset and dataset dict (#1110)
HuggingFace Dataset is written as arrow batches.
For DatasetDict, all splits are written with a "split" column appended.

- [x] what if the dataset schema already has a `split` column
- [x] add unit tests
2024-04-05 16:33:06 -07:00
Ayush Chaurasia
42fad84ec8 feat(python): Support reranking for vector and fts (#1103)
solves https://github.com/lancedb/lancedb/issues/1086

Usage Reranking with FTS:
```
retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite")
pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274."},
          {"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."},
        {"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."},
        {"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "},
        {"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."},
        {"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."},
        ]
retriever.add(pylist)
retriever.create_fts_index("text", replace=True)

query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query, query_type="fts").limit(10).to_pandas())
print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas())
```
Result
```
                                                text                                             vector     score
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602
1  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046
2  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898
4  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346
                                                text                                             vector     score  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046          0.041462
```

## Vector Search usage:
```
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query).limit(10).to_pandas())
print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here
```

Results
```
                                                text                                             vector  _distance
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973
1  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884
2  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200
3  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654
4  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544
                                                text                                             vector  _distance  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654          0.041462
```
2024-04-05 16:33:06 -07:00
Weston Pace
b36c750cc7 fix: fix compile error in example caused by merge conflict (#1135) 2024-04-05 16:33:06 -07:00
Weston Pace
a23b856410 feat: change DistanceType to be independent thing instead of resuing lance_linalg (#1133)
This PR originated from a request to add `Serialize` / `Deserialize` to
`lance_linalg::distance::DistanceType`. However, that is a strange
request for `lance_linalg` which shouldn't really have to worry about
`Serialize` / `Deserialize`. The problem is that `lancedb` is re-using
`DistanceType` and things in `lancedb` do need to worry about
`Serialize`/`Deserialize` (because `lancedb` needs to support remote
client).

On the bright side, separating the two types allows us to independently
document distance type and allows `lance_linalg` to make changes to
`DistanceType` in the future without having to worry about backwards
compatibility concerns.
2024-04-05 16:33:06 -07:00
Weston Pace
0fe0976a0e docs: add links to rust SDK docs, remove references to rust SDK being unstable / experimental (#1131) 2024-04-05 16:33:05 -07:00
Weston Pace
abde77eafb feat(rust): add trait for incoming data (#1128)
This will make it easier for 3rd party integrations. They simply need to
implement `IntoArrow` for their types in order for those types to be
used in ingestion.
2024-04-05 16:32:47 -07:00
vincent d warmerdam
85a9ef472f Unhide Pydantic guides in Docs (#1122)
@wjones127 after fixing https://github.com/lancedb/lancedb/issues/1112 I
noticed something else on the docs. There's an odd chunk of the docs
missing
[here](https://lancedb.github.io/lancedb/guides/tables/#from-a-polars-dataframe).
I can see the heading, but after clicking it the contents don't show.

![CleanShot 2024-03-15 at 23 40
17@2x](https://github.com/lancedb/lancedb/assets/1019791/04784b19-0200-4c3f-ae17-7a8f871ef9bd)

Apon inspection it was a markdown issue, one tab too many on a whole
segment.

This PR fixes it. It looks like this now and the sections appear again:

![CleanShot 2024-03-15 at 23 42
32@2x](https://github.com/lancedb/lancedb/assets/1019791/c5aaec4c-1c37-474d-9fb0-641f4cf52626)
2024-04-05 16:32:47 -07:00
Weston Pace
4180b44472 feat: refactor the query API and add query support to the python async API (#1113)
In addition, there are also a number of changes in nodejs to the
docstrings of existing methods because this PR adds a jsdoc linter.
2024-04-05 16:32:47 -07:00
Lance Release
2db257ca29 [python] Bump version: 0.6.3 → 0.6.4 2024-04-05 16:32:41 -07:00
Lance Release
1f816d597a Bump version: 0.4.12 → 0.4.13 2024-04-05 16:32:31 -07:00
Weston Pace
c1e3dc48af feat: bump lance to 0.10.4 (#1123) 2024-04-05 16:32:31 -07:00
vincent d warmerdam
b9afc01cfd Explain vonoroi seed initalisation (#1114)
This PR fixes https://github.com/lancedb/lancedb/issues/1112. It turned
out that K-means is currently used internally, so I figured adding that
context to the docs would be nice.
2024-04-05 16:32:31 -07:00
Christian Di Lorenzo
8bb983bc3d fix(python): Add python azure blob read support (#1102)
I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.

It should be now as simple as:

```python
import lancedb

db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```

Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.

Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
2024-04-05 16:32:31 -07:00
Weston Pace
1ea0c33545 feat: update lance to v0.10.3 (#1094) 2024-04-05 16:32:31 -07:00
Raghav Dixit
765569425c doc updates (#1085)
closes #1084
2024-04-05 16:32:15 -07:00
Chang She
377832e532 feat(python): support optional vector field in pydantic model (#1097)
The LanceDB embeddings registry allows users to annotate the pydantic
model used as table schema with the desired embedding function, e.g.:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField()
    text: str = openai.SourceField()
```

Tables created like this does not require embeddings to be calculated by
the user explicitly, e.g. this works:

```python
table.add([{"id": "foo", "text": "rust all the things"}])
```

However, trying to construct pydantic model instances without vector
doesn't because it's a required field.

Instead, you need add a default value:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField(default=None)
    text: str = openai.SourceField()
```

then this completes without errors:
```python
table.add([Schema(id="foo", text="rust all the things")])
```

However, all of the vectors are filled with zeros. Instead in
add_vector_col we have to add an additional check so that the embedding
generation is called.
2024-04-05 16:32:15 -07:00
QianZhu
723defbe7e add index_stats python api (#1096)
the integration test will be covered in another PR:
https://github.com/lancedb/sophon/pull/1876
2024-04-05 16:32:15 -07:00
Chang She
c33110397e fix(python): fix typo in passing in the api_key explicitly (#1098)
fix silly typo
2024-04-05 16:32:15 -07:00
Weston Pace
b6a522d483 feat: add list_indices to the async api (#1074) 2024-04-05 16:32:15 -07:00
Weston Pace
9031ec6878 feat: add update to the async API (#1093) 2024-04-05 16:32:15 -07:00
Will Jones
f0c5f5ba62 fix: handle uri in object (#1091)
Fixes #1078
2024-04-05 16:32:15 -07:00
Weston Pace
47daf9b7b0 feat: add time travel operations to the async API (#1070) 2024-04-05 16:32:15 -07:00
Weston Pace
f822255683 feat: add create_index to the async python API (#1052)
This also refactors the rust lancedb index builder API (and,
correspondingly, the nodejs API)
2024-04-05 16:32:14 -07:00
Will Jones
90af5cf028 fix: propagate filter validation errors (#1092)
In Rust and Node, we have been swallowing filter validation errors. If
there was an error in parsing the filter, then the filter was silently
ignored, returning unfiltered results.

Fixes #1081
2024-04-05 16:31:53 -07:00
Lance Release
fec6f92184 [python] Bump version: 0.6.2 → 0.6.3 2024-04-05 16:31:53 -07:00
Rob Meng
35bc4f3078 feat: configurable timeout for LanceDB Cloud queries (#1090) 2024-04-05 16:31:53 -07:00
Ivan Leo
89ce417452 Update default_embedding_functions.md (#1073)
Added a small bit of documentation for the `dim` feature which is
provided by the new `text-embedding-3` model series that allows users to
shorten an embedding.

Happy to discuss a bit on the phrasing but I struggled quite a bit with
getting it to work so wanted to help others who might want to use the
newer model too
2024-04-05 16:31:53 -07:00
Weston Pace
d4502add44 Remove remote integration workflow (#1076) 2024-04-05 16:31:53 -07:00
Will Jones
334857a8cb fix: Allow converting from NativeTable to Table (#1069) 2024-04-05 16:31:53 -07:00
Lance Release
386d5da22f Bump version: 0.4.11 → 0.4.12 2024-04-05 16:31:45 -07:00
Lance Release
77ba97416d [python] Bump version: 0.6.1 → 0.6.2 2024-04-05 16:31:45 -07:00
Will Jones
5120bf262b fix: make checkout_latest force a reload (#1064)
#1002 accidentally changed `checkout_latest` to do nothing if the table
was already in latest mode. This PR makes sure it forces a reload of the
table (if there is a newer version).
2024-04-05 16:31:45 -07:00
Lei Xu
f27167017b chore: bump lance to 0.10.2 (#1061) 2024-04-05 16:31:45 -07:00
Weston Pace
73c69a6b9a feat: page_token / limit to native table_names function. Use async table_names function from sync table_names function (#1059)
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.

However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
2024-04-05 16:31:45 -07:00
Will Jones
05f9a77baf feat: more accessible errors (#1025)
The fact that we convert errors to strings makes them really hard to
work with. For example, in SaaS we want to know whether the underlying
`lance::Error` was the `InvalidInput` variant, so we can return a 400
instead of a 500.
2024-04-05 16:31:45 -07:00
Chang She
10089481c0 doc(python): document the method in fts (#982)
Co-authored-by: prrao87 <prrao87@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Ayush Chaurasia
b5326d31e9 fix(python): Few fts patches (#1039)
1. filtering with fts mutated the schema, which caused schema mistmatch
problems with hybrid search as it combines fts and vector search tables.
2. fts with filter failed with `with_row_id`. This was because row_id
was calculated before filtering which caused size mismatch on attaching
it after.
3. The fix for 1 meant that now row_id is attached before filtering but
passing a filter to `to_lance` on a dataset that already contains
`_rowid` raises a panic from lance. So temporarily, in case where fts is
used with a filter AND `with_row_id`, we just force user to using the
duckdb pathway.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Weston Pace
c60a193767 fix: sanitize foreign schemas (#1058)
Arrow-js uses brittle `instanceof` checks throughout the code base.
These fail unless the library instance that produced the object matches
exactly the same instance the vectordb is using. At a minimum, this
means that a user using arrow version 15 (or any version that doesn't
match exactly the version that vectordb is using) will get strange
errors when they try and use vectordb.

However, there are even cases where the versions can be perfectly
identical, and the instanceof check still fails. One such example is
when using `vite` (e.g. https://github.com/vitejs/vite/issues/3910)

This PR solves the problem in a rather brute force, but workable,
fashion. If we encounter a schema that does not pass the `instanceof`
check then we will attempt to sanitize that schema by traversing the
object and, if it has all the correct properties, constructing an
appropriate `Schema` instance via deep cloning.
2024-04-05 16:31:42 -07:00
Weston Pace
785ecfa037 feat: reconfigure typescript linter / formatter for nodejs (#1042)
The eslint rules specify some formatting requirements that are rather
strict and conflict with vscode's default formatter. I was unable to get
auto-formatting to setup correctly. Also, eslint has quite recently
[given up on
formatting](https://eslint.org/blog/2023/10/deprecating-formatting-rules/)
and recommends using a 3rd party formatter.

This PR adds prettier as the formatter. It restores the eslint rules to
their defaults. This does mean we now have the "no explicit any" check
back on. I know that rule is pedantic but it did help me catch a few
corner cases in type testing that weren't covered in the current code.
Leaving in draft as this is dependent on other PRs.
2024-04-05 16:31:36 -07:00
Weston Pace
8033a44d68 feat: add support for add to async python API (#1037)
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:31:36 -07:00
Chang She
3bbcaba65b chore(rust): update rust version (#810) 2024-04-05 16:31:36 -07:00
Chang She
e60fde73ba feat(python): allow user to override api url (#1054) 2024-04-05 16:31:36 -07:00
Chang She
a7dbe933dc chore(python): use pypi tantivy to speed up CI (#987) 2024-04-05 16:31:36 -07:00
Chang She
4f34a01020 doc: fix docs deployment GHA (#1055) 2024-04-05 16:31:36 -07:00
Prashanth Rao
f9c244e608 [docs]: Fix issues with Rust code snippets in "quick start" (#1047)
The renaming of `vectordb` to `lancedb` broke the [quick start
docs](https://lancedb.github.io/lancedb/basic/#__tabbed_5_3) (it's
pointing to a non-existent directory). This PR fixes the code snippets
and the paths in the docs page.

Additionally, more fixes related to indexing docs below 👇🏽.
2024-04-05 16:31:36 -07:00
Louis Guitton
7f9ef0d329 Fix default_embedding_functions.md (#1043)
typo and broken table
2024-04-05 16:31:36 -07:00
Chang She
a3761f4209 doc: fix langchain link (#1053) 2024-04-05 16:31:36 -07:00
Chang She
4b40dad963 feat(python): add model_names() method to openai embedding function (#1049)
small QoL improvement
2024-04-05 16:31:36 -07:00
QianZhu
b32b69c993 Add create scalar index to sdk (#1033) 2024-04-05 16:31:36 -07:00
Weston Pace
4299f719ec feat: port create_table to the async python API and the remote rust API (#1031)
I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
2024-04-05 16:31:36 -07:00
Lance Release
accf31fa92 [python] Bump version: 0.6.0 → 0.6.1 2024-04-05 16:31:36 -07:00
Rob Meng
b8eb5d4bfe fix: fix columns type for pydantic 2.x (#1045) 2024-04-05 16:31:36 -07:00
Weston Pace
629c622d15 feat: Initial remote table implementation for rust (#1024)
This will eventually replace the remote table implementations in python
and node.
2024-04-05 16:31:36 -07:00
Lance Release
45b5b66c82 [python] Bump version: 0.5.7 → 0.6.0 2024-04-05 16:31:36 -07:00
BubbleCal
5896541bb8 chore: enable test for dropping table (#1038)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-04-05 16:31:36 -07:00
natcharacter
e29e4cc36d A simple base usage that install the dependencies necessary to use FT… (#1036)
A simple base usage that install the dependencies necessary to use FTS
and Hybrid search

---------

Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:31:36 -07:00
Rob Meng
f3de3d990d chore: upgrade to lance 0.10.1 (#1034)
upgrade to lance 0.10.1 and update doc string to reflect dynamic
projection options
2024-04-05 16:31:36 -07:00
BubbleCal
0a8e258247 chore(rust): report the TableNotFound error while dropping non-exist table (#1022)
this will work after upgrading lance with
https://github.com/lancedb/lance/pull/1995 merged
see #884 for details

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-04-05 16:31:36 -07:00
Weston Pace
2cec2a8937 feat: add a basic async python client starting point (#1014)
This changes `lancedb` from a "pure python" setuptools project to a
maturin project and adds a rust lancedb dependency.

The async python client is extremely minimal (only `connect` and
`Connection.table_names` are supported). The purpose of this PR is to
get the infrastructure in place for building out the rest of the async
client.

Although this is not technically a breaking change (no APIs are
changing) it is still a considerable change in the way the wheels are
built because they now include the native shared library.
2024-04-05 16:31:34 -07:00
Will Jones
464a36ad38 feat: {add|alter|drop}_columns APIs (#1015)
Initial work for #959. This exposes the basic functionality for each in
all of the APIs. Will add user guide documentation in a later PR.
2024-04-05 16:30:47 -07:00
Weston Pace
ad1e81a1d1 refactor: change arrow from a direct dependency to a peer dependency (#984)
BREAKING CHANGE: users will now need to npm install `apache-arrow` and
`@apache-arrow/ts` themselves.
2024-04-05 16:30:47 -07:00
Lance Release
562d1af1ed Bump version: 0.4.10 → 0.4.11 2024-04-05 16:30:40 -07:00
Weston Pace
2163502b31 refactor: rename the rust crate from vectordb to lancedb (#1012)
This also renames the new experimental node package to lancedb. The
classic node package remains named vectordb.

The goal here is to avoid introducing piecemeal breaking changes to the
vectordb crate. Instead, once the new API is stabilized, we will
officially release the lancedb crate and deprecate the vectordb crate.
The same pattern will eventually happen with the npm package vectordb.
2024-04-05 16:30:40 -07:00
Will Jones
c5b0934bfb feat(node): add read_consistency_interval to Node and Rust (#1002)
This PR adds the same consistency semantics as was added in #828. It
*does not* add the same lazy-loading of tables, since that breaks some
existing tests.

This closes #998.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:30:40 -07:00
Lance Release
ef54bd5ba2 [python] Bump version: 0.5.6 → 0.5.7 2024-04-05 16:30:40 -07:00
Lei Xu
80e4d14c02 chore: bump pylance to 0.9.18 (#1011) 2024-04-05 16:30:40 -07:00
Raghav Dixit
fdabf31984 python(feat): Imagebind embedding fn support (#1003)
Added imagebind fn support , steps to install mentioned in docstring. 
pytest slow checks done locally

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-04-05 16:30:40 -07:00
Ayush Chaurasia
538d0320f7 Docs: add meta tags (#1006) 2024-04-05 16:30:40 -07:00
Weston Pace
cbc0c439ef refactor: rust vectordb API stabilization of the Connection trait (#993)
This is the start of a more comprehensive refactor and stabilization of
the Rust API. The `Connection` trait is cleaned up to not require
`lance` and to match the `Connection` trait in other APIs. In addition,
the concrete implementation `Database` is hidden.

BREAKING CHANGE: The struct `crate::connection::Database` is now gone.
Several examples opened a connection using `Database::connect` or
`Database::connect_with_params`. Users should now use
`vectordb::connect`.

BREAKING CHANGE: The `connect`, `create_table`, and `open_table` methods
now all return a builder object. This means that a call like
`conn.open_table(..., opt1, opt2)` will now become
`conn.open_table(...).opt1(opt1).opt2(opt2).execute()` In addition, the
structure of options has changed slightly. However, no options
capability has been removed.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:30:40 -07:00
Lance Release
69492586f0 [python] Bump version: 0.5.5 → 0.5.6 2024-04-05 16:30:40 -07:00
Bert
f5627dac14 lance 0.9.18 (#1000) 2024-04-05 16:30:40 -07:00
Johannes Kolbe
32bfb68ac3 apply fixes for notebook (#989) 2024-04-05 16:30:40 -07:00
Ayush Chaurasia
bc871169f0 docs: Add meta tag for image preview (#988)
I think this should work. Need to deploy it to be sure as it can be
tested locally. Can be tested here.

2 things about this solution:
* All pages have a same meta tag, i.e, lancedb banner
* If needed, we can automatically use the first image of each page and
generate meta tags using the ultralytics mkdocs plugin that we did for
this purpose - https://github.com/ultralytics/mkdocs
2024-04-05 16:30:40 -07:00
Chang She
3fc835e124 doc: update navigation links for embedding functions (#986) 2024-04-05 16:30:40 -07:00
Chang She
484a121866 doc: improve embedding functions documentation (#983)
Got some user feedback that the `implicit` / `explicit` distinction is
confusing.
Instead I was thinking we would just deprecate the `with_embeddings` API
and then organize working with embeddings into 3 buckets:

1. manually generate embeddings
2. use a provided embedding function
3. define your own custom embedding function
2024-04-05 16:30:40 -07:00
Chang She
bc850e6add feat(python): add optional threadpool for batch requests (#981)
Currently if a batch request is given to the remote API, each query is
sent sequentially. We should allow the user to specify a threadpool.
2024-04-05 16:30:40 -07:00
Will Jones
26eec4bef4 fix: use static C runtime on Windows (#979)
We depend on C static runtime, but not all Windows machines have that.
So might be worth statically linking it.

https://github.com/reorproject/reor/issues/36#issuecomment-1948876463
2024-04-05 16:30:40 -07:00
Will Jones
f84a4855ca docs: show DuckDB with dataset, not table (#974)
Using datasets is preferred way to allow filter and projection pushdown,
as well as aggregated larger-than-memory tables.
2024-04-05 16:30:40 -07:00
Ayush Chaurasia
aecafa6479 docs: Minimal reranking evaluation benchmarks (#977) 2024-04-05 16:30:40 -07:00
Will Jones
efa846b6e5 chore: upgrade lance to 0.9.16 (#975) 2024-04-05 16:30:36 -07:00
Will Jones
cf3dbcf684 ci: fix Node ARM release build (#971)
When we turned on fat LTO builds, we made the release build job **much**
more compute and memory intensive. The ARM runners have particularly low
memory per core, which makes them susceptible to OOM errors. To avoid
issues, I have enabled memory swap on ARM and bumped the side of the
runner.
2024-04-05 16:30:36 -07:00
Will Jones
c425d3759d ci: reduce number of build jobs on aarch64 to avoid OOM (#970) 2024-04-05 16:30:36 -07:00
Lance Release
fded15c9fe [python] Bump version: 0.5.4 → 0.5.5 2024-04-05 16:30:36 -07:00
Lance Release
e888cb5b48 Bump version: 0.4.9 → 0.4.10 2024-04-05 16:30:30 -07:00
Weston Pace
9241f47f0e feat: make it easier to create empty tables (#942)
This PR also reworks the table creation utilities significantly so that
they are more consistent, built on top of each other, and thoroughly
documented.
2024-04-05 16:30:30 -07:00
Prashanth Rao
b014c24e66 [docs]: Fix typos and clarity in hybrid search docs (#966)
- Fixed typos and added some clarity to the hybrid search docs
- Changed "Airbnb" case to be as per the [official company
name](https://en.wikipedia.org/wiki/Airbnb) (the "bnb" shouldn't be
capitalized", and the text in the document aligns with this
- Fixed headers in nav bar
2024-04-05 16:30:30 -07:00
Will Jones
68115f1369 fix: wrap in BigInt to avoid upstream bug (#962)
Closes #960
2024-04-05 16:30:30 -07:00
Ayush Chaurasia
f78fe721db docs: Add setup cell for colab example (#965) 2024-04-05 16:30:30 -07:00
Ayush Chaurasia
510e8378bc feat(python): hybrid search updates, examples, & latency benchmarks (#964)
- Rename safe_import -> attempt_import_or_raise (closes
https://github.com/lancedb/lancedb/pull/923)
- Update docs
- Add Notebook example (@changhiskhan you can use it for the talk. Comes
with "open in colab" button)
- Latency benchmark & results comparison, sanity check on real-world
data
- Updates the default openai model to gpt-4
2024-04-05 16:30:30 -07:00
Will Jones
1045af6c09 chore: fix clippy lints (#963) 2024-04-05 16:30:30 -07:00
QianZhu
7afcfca10d Qian/make vector col optional (#950)
remote SDK tests were completed through lancedb_integtest
2024-04-05 16:30:29 -07:00
Will Jones
88205aba64 fix(node): statically link lzma (#961)
Fixes #956

Same changes as https://github.com/lancedb/lance/pull/1934
2024-04-05 16:30:10 -07:00
Weston Pace
da47938a43 chore: use a bigger runner for NPM publish jobs on aarch64 to avoid OOM (#955) 2024-04-05 16:30:06 -07:00
Lance Release
03e705c14c Bump version: 0.4.8 → 0.4.9 2024-04-05 16:29:58 -07:00
Lance Release
a7e60a4c3f [python] Bump version: 0.5.3 → 0.5.4 2024-04-05 16:29:58 -07:00
Weston Pace
e12bdc78bb chore: bump lance version to 0.9.15 (#949) 2024-04-05 16:29:58 -07:00
Weston Pace
41ccb48160 feat: add support for filter during merge insert when matched (#948)
Closes #940
2024-04-05 16:29:58 -07:00
QianZhu
069ad267bd added error msg to SaaS APIs (#852)
1. improved error msg for SaaS create_table and create_index

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:29:58 -07:00
Weston Pace
138fc3f66b feat: add a filterable count_rows to all the lancedb APIs (#913)
A `count_rows` method that takes a filter was recently added to
`LanceTable`. This PR adds it everywhere else except `RemoteTable` (that
will come soon).
2024-04-05 16:29:58 -07:00
Nitish Sharma
2c3f982f4f Minor updates to FAQ (#935)
Based on discussion over discord, adding minor updates to the FAQ
section about benchmarks, practical data size and concurrency in LanceDB
2024-04-05 16:29:58 -07:00
Ayush Chaurasia
d07817a562 feat(python): Reranker DX improvements (#904)
- Most users might not know how to use `QueryBuilder` object. Instead we
should just pass the string query.
- Add new rerankers: Colbert, openai
2024-04-05 16:29:58 -07:00
Will Jones
39cc2fd62b feat(python): add read_consistency_interval argument (#828)
This PR refactors how we handle read consistency: does the `LanceTable`
class always pick up modifications to the table made by other instance
or processes. Users have three options they can set at the connection
level:

1. (Default) `read_consistency_interval=None` means it will not check at
all. Users can call `table.checkout_latest()` to manually check for
updates.
2. `read_consistency_interval=timedelta(0)` means **always** check for
updates, giving strong read consistency.
3. `read_consistency_interval=timedelta(seconds=20)` means check for
updates every 20 seconds. This is eventual consistency, a compromise
between the two options above.

There is now an explicit difference between a `LanceTable` that tracks
the current version and one that is fixed at a historical version. We
now enforce that users cannot write if they have checked out an old
version. They are instructed to call `checkout_latest()` before calling
the write methods.

Since `conn.open_table()` doesn't have a parameter for version, users
will only get fixed references if they call `table.checkout()`.

The difference between these two can be seen in the repr: Table that are
fixed at a particular version will have a `version` displayed in the
repr. Otherwise, the version will not be shown.

```python
>>> table
LanceTable(connection=..., name="my_table")
>>> table.checkout(1)
>>> table
LanceTable(connection=..., name="my_table", version=1)
```

I decided to not create different classes for these states, because I
think we already have enough complexity with the Cloud vs OSS table
references.

Based on #812
2024-04-05 16:29:57 -07:00
Ayush Chaurasia
0f00cd0097 feat(python): add support new openai embedding functions (#912)
@PrashantDixit0

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:29:13 -07:00
Lei Xu
84edf56995 chore: add global cargo config to enable minimal cpu target (#925)
* Closes #895 
* Fix cargo clippy
2024-04-05 16:29:13 -07:00
QianZhu
b2efd0da53 fix hybrid search example (#922) 2024-04-05 16:29:13 -07:00
Lance Release
c101e9deed [python] Bump version: 0.5.2 → 0.5.3 2024-04-05 16:29:13 -07:00
Ayush Chaurasia
a24e16f753 fix: revert safe_import_pandas usage (#921) 2024-04-05 16:29:13 -07:00
Lance Release
eb1f02919a Bump version: 0.4.7 → 0.4.8 2024-04-05 16:29:05 -07:00
Lance Release
c8f92c2987 [python] Bump version: 0.5.1 → 0.5.2 2024-04-05 16:29:05 -07:00
Weston Pace
9d115bd507 chore: bump pylance version to latest in pyproject.toml (#918) 2024-04-05 16:29:05 -07:00
Weston Pace
18f7bad3dd feat: add merge_insert to the node and rust APIs (#915) 2024-04-05 16:29:05 -07:00
QianZhu
2e75b16403 make it explicit about the vector column data type (#916)
<img width="837" alt="Screenshot 2024-02-01 at 4 23 34 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/4f0f5c5a-2a24-4b00-aad1-ef80a593d964">
[
<img width="838" alt="Screenshot 2024-02-01 at 4 26 03 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/ca073bc8-b518-4be3-811d-8a7184416f07">
](url)

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:29:05 -07:00
Bert
3c544582f6 fix: add request retry to python client (#917)
Adds capability to the remote python SDK to retry requests (fixes #911)

This can be configured through environment:
- `LANCE_CLIENT_MAX_RETRIES`= total number of retries. Set to 0 to
disable retries. default = 3
- `LANCE_CLIENT_CONNECT_RETRIES` = number of times to retry request in
case of TCP connect failure. default = 3
- `LANCE_CLIENT_READ_RETRIES` = number of times to retry request in case
of HTTP request failure. default = 3
- `LANCE_CLIENT_RETRY_STATUSES` = http statuses for which the request
will be retried. passed as comma separated list of ints. default `500,
502, 503`
- `LANCE_CLIENT_RETRY_BACKOFF_FACTOR` = controls time between retry
requests. see
[here](23f2287eb5/src/urllib3/util/retry.py (L141-L146)).
default = 0.25

Only read requests will be retried:
- list table names
- query
- describe table
- list table indices

This does not add retry capabilities for writes as it could possibly
cause issues in the case where the retried write isn't idempotent. For
example, in the case where the LB times-out the request but the server
completes the request anyway, we might not want to blindly retry an
insert request.
2024-04-05 16:29:05 -07:00
Weston Pace
f602e07f99 docs: add cleanup_old_versions and compact_files to Table for documentation purposes (#900)
Closes #819
2024-04-05 16:29:05 -07:00
Weston Pace
4eb819072a feat: upgrade to lance 0.9.11 and expose merge_insert (#906)
This adds the python bindings requested in #870 The javascript/rust
bindings will be added in a future PR.
2024-04-05 16:29:05 -07:00
Lei Xu
bd2d187538 ci: bump to new version of python action to use node 20 gIthub action runtime (#909)
Github action is deprecating old node-16 runtime.
2024-04-05 16:29:05 -07:00
JacobLinCool
f308a0ffdb fix the repo link on npm, add links for homepage and bug report (#910)
- fix the repo link on npm
- add links for homepage and bug report
2024-04-05 16:29:05 -07:00
QianZhu
1f2eafca75 arrow table/f16 example (#907) 2024-04-05 16:29:05 -07:00
Lance Release
567c5f6d01 Bump version: 0.4.6 → 0.4.7 2024-04-05 16:28:56 -07:00
Lei Xu
8e139012e2 fix(node): pass AWS credentials to db level operations (#908)
Passed the following tests

```ts
const keyId = process.env.AWS_ACCESS_KEY_ID;
const secretKey = process.env.AWS_SECRET_ACCESS_KEY;
const sessionToken = process.env.AWS_SESSION_TOKEN;
const region = process.env.AWS_REGION;

const db = await lancedb.connect({
  uri: "s3://bucket/path",
  awsCredentials: {
    accessKeyId: keyId,
    secretKey: secretKey,
    sessionToken: sessionToken,
  },
  awsRegion: region,
} as lancedb.ConnectionOptions);

  console.log(await db.createTable("test", [{ vector: [1, 2, 3] }]));
  console.log(await db.tableNames());
  console.log(await db.dropTable("test"))
```
2024-04-05 16:28:56 -07:00
Will Jones
d5be6c7a05 docs: provide AWS S3 cleanup and permissions advice (#903)
Adding some more quick advice for how to setup AWS S3 with LanceDB.

---------

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Abraham Lopez
5a12224a02 chore: update JS/TS example in README (#898)
- The JS/TS library actually expects named parameters via an object in
`.createTable()` rather than individual arguments
- Added example on how to search rows by criteria without a vector
search. TS type of `.search()` currently has the `query` parameter as
non-optional so we have to pass undefined for now.
2024-04-05 16:28:56 -07:00
Lei Xu
a617ad35ff ci: change apple silicon runner to free OSS macos-14 target (#901) 2024-04-05 16:28:56 -07:00
Raghav Dixit
61bf688e5b chore(python): GTE embedding function model name update (#902)
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
a41f7be88d feat(python): Hybrid search & Reranker API (#824)
based on https://github.com/lancedb/lancedb/pull/713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)


### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`

```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()
```

Cohere Reranker
```
from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

```

CrossEncoderReranker

```
from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

```

## Using custom Reranker
```
from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res
```

- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Prashanth Rao
ecbbe185c7 Fix image bgcolor (#891)
Minor fix to change the background color for an image in the docs. It's
now readable in both light and dark modes (earlier version made it
impossible to read in dark mode).
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
b326bf2ef6 doc: Add documentation chatbot for LanceDB (#890)
<img width="1258" alt="Screenshot 2024-01-29 at 10 05 52 PM"
src="https://github.com/lancedb/lancedb/assets/15766192/7c108fde-e993-415c-ad01-72010fd5fe31">
2024-04-05 16:28:56 -07:00
Raghav Dixit
472344fcb3 feat(python): Embedding fn support for gte-mlx/gte-large (#873)
have added testing and an example in the docstring, will be pushing a
separate PR in recipe repo for rag example

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
bca80939c2 chore(python): Temporarily extend remote connection timeout (#888)
Context - https://etoai.slack.com/archives/C05NC5YSW5V/p1706371205883149
2024-04-05 16:28:56 -07:00
Lei Xu
911d063237 doc: fix js example of create index (#886) 2024-04-05 16:28:56 -07:00
Lei Xu
12e776821a doc: use snippet for rust code example and make sure rust examples run through CI (#885) 2024-04-05 16:28:56 -07:00
Lei Xu
c6e5eb0398 fix: fix doc build to include the source snippet correctly (#883) 2024-04-05 16:28:56 -07:00
Chang She
1d0578ce25 doc(rust): minor fixes for Rust quick start. (#878) 2024-04-05 16:28:56 -07:00
Lei Xu
e7fdb931de chore: convert all js doc test to use snippet. (#881) 2024-04-05 16:28:56 -07:00
Lei Xu
d811b89de2 doc: use code snippet for typescript examples (#880)
The typescript code is in a fully function file, that will be run via the CI.
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
545a03d7f9 feat(python): Aws Bedrock embeddings integration (#822)
Supports amazon titan, cohere english & cohere multi-lingual base
models.
2024-04-05 16:28:56 -07:00
Lei Xu
f2e29eb004 chore: upgrade lance, pylance and datafusion (#879) 2024-04-05 16:28:56 -07:00
Lei Xu
36dbf47d60 chore: add one rust SDK e2e example (#876)
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Lei Xu
fd2fd94862 doc: update quick start for full rust example (#872) 2024-04-05 16:28:56 -07:00
Lei Xu
faa5912c3f chore: bump github actions to v4 due to GHA warnings of node version deprecation (#874) 2024-04-05 16:28:56 -07:00
Lance Release
334e423464 Bump version: 0.4.5 → 0.4.6 2024-04-05 16:28:18 -07:00
Lei Xu
7274c913a8 feat(rust): provide connect and connect_with_options in Rust SDK (#871)
* Bring the feature parity of Rust connect methods.
* A global connect method that can connect to local and remote / cloud
table, as the same as in js/python today.
2024-04-05 16:28:18 -07:00
Lei Xu
a192c1a9b1 chore(rust): simplified version of optimize (#869)
Consolidate various optimize() into one method, similar to postgres
VACCUM in the process of preparing Rust API for public use
2024-04-05 16:28:18 -07:00
Lei Xu
cef0293985 feat(napi): Issue queries as node SDK (#868)
* Query as a fluent API and `AsyncIterator<RecordBatch>`
* Much more docs
* Add tests for auto infer vector search columns with different
dimensions.
2024-04-05 16:28:18 -07:00
Lance Release
0be4fd2aa6 Bump version: 0.4.4 → 0.4.5 2024-04-05 16:27:59 -07:00
Lei Xu
0664eee38d fix: release build for node sdk (#861) 2024-04-05 16:27:59 -07:00
Lance Release
f3dd5c89dc Bump version: 0.4.3 → 0.4.4 2024-04-05 16:27:51 -07:00
Lei Xu
8b04d8fef6 feat: improve the rust table query API and documents (#860)
* Easy to type
* Handle `String, &str, [String] and [&str]` well without manual
conversion
* Fix function name to be verb
* Improve docstring of Rust.
* Promote `query` and `search()` to public `Table` trait
2024-04-05 16:27:51 -07:00
Lei Xu
68e2bb0b2d doc: update rust readme to include crate and docs.rs links (#859) 2024-04-05 16:27:51 -07:00
Lei Xu
db4a979278 feat(napi): Provide a new createIndex API in the napi SDK. (#857) 2024-04-05 16:27:51 -07:00
Will Jones
7d82e56f76 docs: document basics of configuring object storage (#832)
Created based on upstream PR https://github.com/lancedb/lance/pull/1849

Closes #681

---------

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:27:51 -07:00
Lei Xu
dfabbe9081 feat(rust): create index API improvement (#853)
* Extract a minimal Table interface in Rust SDK
* Make create_index composable in Rust.
* Fix compiling issues from ffi
2024-04-05 16:27:51 -07:00
Bert
d1f9722bfb Bump lance 0.9.9 (#851) 2024-04-05 16:27:51 -07:00
Lei Xu
efcaa433fe feat: rework NodeJS SDK using napi (#847)
Use Napi to write a Node.js SDK that follows Polars for better
maintainability, while keeping most of the logic in Rust.
2024-04-05 16:27:51 -07:00
Lance Release
7b8188bcd5 [python] Bump version: 0.5.0 → 0.5.1 2024-04-05 16:27:51 -07:00
Lei Xu
65c1d8bc4c feat: change create table to accept Arrow table (#845) 2024-04-05 16:27:50 -07:00
QianZhu
5ecbf971e2 extend timeout for requests.get and requests.post (#848) 2024-04-05 16:27:42 -07:00
Lei Xu
a78e07907c chore(rust): provide a Connection trait to match python and nodejs SDK (#846)
In NodeJS and Python, LanceDB establishes a connection to a db. In Rust
core, it is called Database.
We should be consistent with the naming.
2024-04-05 16:27:42 -07:00
Bert
a409000c6f allow passing api key as env var (#841)
Allow passing API key as env var:
```shell
export LANCEDB_API_KEY=sh_123...
```

with this set, apiKey argument can omitted from `connect`
```js
    const db = await vectordb.connect({
        uri: "db://test-proj-01-ae8343",
        region: "us-east-1",
  })
```
```py
    db = lancedb.connect(
        uri="db://test-proj-01-ae8343",
        region="us-east-1",
    )
```
2024-04-05 16:27:42 -07:00
Lei Xu
d8befeeea2 feat(js): add helper function to create Arrow Table with schema (#838)
Support to make Apache Arrow Table from an array of javascript Records,
with optionally provided Schema.
2024-04-05 16:27:42 -07:00
Chang She
b699b5c42b chore(js): remove errant console.log (#834) 2024-04-05 16:27:42 -07:00
Lei Xu
49de13c65a doc: add index page for rust crate (#839)
Rust API doc for the braves
2024-04-05 16:27:42 -07:00
Lei Xu
97d033dfd6 bug: add a test for fp16 (#837)
Add test to ingest fp16 to a database
2024-04-05 16:27:42 -07:00
Chang She
0c580abd70 Merge branch 'tecmie-tecmie/embeddings-openai' 2024-04-05 16:27:42 -07:00
Chang She
d19bf80375 Merge branch 'tecmie/embeddings-openai' of github.com:tecmie/lancedb into tecmie-tecmie/embeddings-openai 2024-04-05 16:27:41 -07:00
Lei Xu
5b2c602fb3 doc: improve docs for nodejs connect functions (#833)
* improve the docstring for NodeJS connect functions and
`ConnectOptions` parameters.
* Simplify `npm run build` steps.
2024-04-05 16:27:32 -07:00
Bert
7bdca7a092 fix: remote python client closes idle connections (#831) 2024-04-05 16:27:32 -07:00
Will Jones
5f6d13e958 ci: lint and enforce linting (#829)
@eddyxu added instructions for linting here:

7af213801a/python/README.md (L45-L50)

However, we had a lot of failures and weren't checking this in CI. This
PR fixes all lints and adds a check to CI to keep us in compliance with
the lints.
2024-04-05 16:27:31 -07:00
Bert
4243eaee93 bump lance to 0.9.7 (#826) 2024-04-05 16:27:14 -07:00
Prashanth Rao
e6bb907d81 Docs updates incl. Polars (#827)
This PR makes the following aesthetic and content updates to the docs.

- [x] Fix max width issue on mobile: Content should now render more
cleanly and be more readable on smaller devices
- [x] Improve image quality of flowchart in data management page
- [x] Fix syntax highlighting in text at the bottom of the IVF-PQ
concepts page
- [x] Add example of Polars LazyFrames to docs (Integrations)
- [x] Add example of adding data to tables using Polars (guides)
2024-04-05 16:27:14 -07:00
Prashanth Rao
4d5d748acd docs: Updates and refactor (#683)
This PR makes incremental changes to the documentation.

* Closes #697
* Closes #698

- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:27:12 -07:00
Lance Release
33ab68c790 [python] Bump version: 0.4.4 → 0.5.0 2024-04-05 16:26:36 -07:00
Chang She
dbc3515d96 chore(python): turn off lazy frame ingestion (#821) 2024-04-05 16:26:36 -07:00
Chang She
ac3d95ec34 feat(python): allow the entire table to be converted a polars dataframe (#814) 2024-04-05 16:26:36 -07:00
Chang She
72b39432e8 feat(python): add exist_ok option to create table (#813)
This mimics CREATE TABLE IF NOT EXISTS behavior.
We add `db.create_table(..., exist_ok=True)` parameter.
By default it is set to False, so trying to create
a table with the same name will raise an exception.
If set to True, then it only opens the table if it
already exists. If you pass in a schema, it will
be checked against the existing table to make sure
you get what you want. If you pass in data, it will
NOT be added to the existing table.
2024-04-05 16:26:35 -07:00
Ayush Chaurasia
340fd98b42 chore(python): get rid of Pydantic deprication warning in embedding fcn (#816)
```
UserWarning: Valid config keys have changed in V2:
* 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning)
```
2024-04-05 16:26:20 -07:00
Anton Shevtsov
dc0b11a86a Add openai api key not found help (#815)
This pull request adds check for the presence of an environment variable
`OPENAI_API_KEY` and removes an unused parameter in
`retry_with_exponential_backoff` function.
2024-04-05 16:26:20 -07:00
Chang She
17dcb70076 feat(python): basic polars integration (#811)
We should now be able to directly ingest polars dataframes and return
results as polars dataframes

![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)
2024-04-05 16:26:19 -07:00
Andrew Miracle
8daed93a91 eslint fix 2024-04-05 16:25:52 -07:00
Ayush Chaurasia
2f72d5138e feat(python): Add gemini text embedding function (#806)
Named it Gemini-text for now. Not sure how complicated it will be to
support both text and multimodal embeddings under the same class
"gemini"..But its not something to worry about for now I guess.
2024-04-05 16:25:52 -07:00
Andrew Miracle
f1aad1afc7 Merge branch 'main' into tecmie/embeddings-openai 2024-04-05 16:25:51 -07:00
Andrew Miracle
fa13fb9392 rebase from lancedb/main 2024-04-05 16:25:14 -07:00
Lance Release
d39145c7e4 Updating package-lock.json 2024-04-05 16:25:14 -07:00
Lance Release
3463248eba Bump version: 0.4.2 → 0.4.3 2024-04-05 16:25:14 -07:00
Lance Release
3191966ffb [python] Bump version: 0.4.3 → 0.4.4 2024-04-05 16:25:14 -07:00
Will Jones
3b119420b2 upgrade lance (#809) 2024-04-05 16:25:14 -07:00
Lei Xu
6f7cb75b07 chore: remove black as dependency (#808)
We use `ruff` in CI and dev workflow now.
2024-04-05 16:25:14 -07:00
Chang She
118a11c9b3 feat(node): align incoming data to table schema (#802) 2024-04-05 16:25:14 -07:00
Sebastian Law
70ca6d8ea5 use requests instead of aiohttp for underlying http client (#803)
instead of starting and stopping the current thread's event loop on
every http call, just make an http call.
2024-04-05 16:25:14 -07:00
Chang She
556e01d9d9 chore(python): add docstring for limit behavior (#800)
Closes #796
2024-04-05 16:25:14 -07:00
Chang She
1060dde858 feat(python): add phrase query option for fts (#798)
addresses #797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
2024-04-05 16:25:14 -07:00
Chang She
950e05da81 feat(python): add count_rows with filter option (#801)
Closes #795
2024-04-05 16:25:14 -07:00
Chang She
2b7754f929 fix(rust): not sure why clippy is suddenly unhappy (#794)
should fix the error on top of main


https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725
2024-04-05 16:25:14 -07:00
Chang She
d0bff7b78e feat(python): support new style optional syntax (#793) 2024-04-05 16:25:14 -07:00
Chang She
85f3f8793c chore(python): document phrase queries in fts (#788)
closes #769 

Add unit test and documentation on using quotes to perform a phrase
query
2024-04-05 16:25:14 -07:00
Chang She
a758876a65 feat(node): support table.schema for LocalTable (#789)
Close #773 

we pass an empty table over IPC so we don't need to manually deal with
serde. Then we just return the schema attribute from the empty table.

---------

Co-authored-by: albertlockett <albert.lockett@gmail.com>
2024-04-05 16:25:14 -07:00
Lei Xu
073a2a1b28 chore: bump lance to 0.9.5 (#790) 2024-04-05 16:25:14 -07:00
Chang She
195c106242 feat(python): Set heap size to get faster fts indexing performance (#762)
By default tantivy-py uses 128MB heapsize. We change the default to 1GB
and we allow the user to customize this

locally this makes `test_fts.py` run 10x faster
2024-04-05 16:25:13 -07:00
Lance Release
f0a654036e Updating package-lock.json 2024-04-05 16:25:02 -07:00
lucasiscovici
792830ccb5 raise exception if fts index does not exist (#776)
raise exception if fts index does not exist

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:25:02 -07:00
Lance Release
162f8536d1 Updating package-lock.json 2024-04-05 16:25:02 -07:00
sudhir
5d198327bb Make examples work with current version of Openai api's (#779)
These examples don't work because of changes in openai api from version
1+
2024-04-05 16:25:02 -07:00
Lance Release
55cc3ed5a2 Bump version: 0.4.2 → 0.4.3 2024-04-05 16:25:02 -07:00
Chris
b11428dddb Minor Fixes to Ingest Embedding Functions Docs (#777)
Addressed minor typos and grammatical issues to improve readability

---------

Co-authored-by: Christopher Correa <chris.correa@gmail.com>
2024-04-05 16:25:02 -07:00
Lance Release
1387dc6e48 [python] Bump version: 0.4.3 → 0.4.4 2024-04-05 16:25:02 -07:00
Vladimir Varankin
84c6c8f08c Minor corrections for docs of embedding_functions (#780)
In addition to #777, this pull request fixes more typos in the
documentation for "Ingest Embedding Functions".
2024-04-05 16:25:02 -07:00
Will Jones
63e273606e upgrade lance (#809) 2024-04-05 16:25:02 -07:00
QianZhu
35f83694be small bug fix for example code in SaaS JS doc (#770) 2024-04-05 16:25:02 -07:00
Lei Xu
45b006d68c chore: remove black as dependency (#808)
We use `ruff` in CI and dev workflow now.
2024-04-05 16:25:02 -07:00
Chang She
20208b9efb chore(python): handle NaN input in fts ingestion (#763)
If the input text is None, Tantivy raises an error
complaining it cannot add a NoneType. We handle this
upstream so None's are not added to the document.
If all of the indexed fields are None then we skip
this document.
2024-04-05 16:25:02 -07:00
Bengsoon Chuah
c00af75d63 Add relevant imports for each step (#764)
I found that it was quite incoherent to have to read through the
documentation and having to search which submodule that each class
should be imported from.

For example, it is cumbersome to have to navigate to another
documentation page to find out that `EmbeddingFunctionRegistry` is from
`lancedb.embeddings`
2024-04-05 16:25:02 -07:00
QianZhu
21245dfb9d SaaS JS API sdk doc (#740)
Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>
2024-04-05 16:25:02 -07:00
Chang She
81487f10fe feat(js): support list of string input (#755)
Add support for adding lists of string input (e.g., list of categorical
labels)

Follow-up items: #757 #758
2024-04-05 16:25:02 -07:00
Lance Release
3aa233f38a Updating package-lock.json 2024-04-05 16:25:02 -07:00
Lance Release
3278fa75d1 Bump version: 0.4.1 → 0.4.2 2024-04-05 16:25:02 -07:00
Lance Release
549f2bf396 [python] Bump version: 0.4.2 → 0.4.3 2024-04-05 16:25:02 -07:00
Lei Xu
138760bc6e chore: bump pylance to 0.9.2 (#754) 2024-04-05 16:25:02 -07:00
Xin Hao
0bddf77a73 docs: fix link (#752) 2024-04-05 16:25:02 -07:00
Chang She
154dc508ba feat(python): first cut batch queries for remote api (#753)
issue separate requests under the hood and concatenate results
2024-04-05 16:25:02 -07:00
Lance Release
0b8fe76590 [python] Bump version: 0.4.1 → 0.4.2 2024-04-05 16:25:02 -07:00
Chang She
c22eacb8b6 chore(python): update embedding API to use openai 1.6.1 (#751)
API has changed significantly, namely `openai.Embedding.create` no
longer exists.
https://github.com/openai/openai-python/discussions/742

Update the OpenAI embedding function and put a minimum on the openai sdk
version.
2024-04-05 16:25:02 -07:00
Chang She
75d575ef4e feat: add timezone handling for datetime in pydantic (#578)
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
2024-04-05 16:25:02 -07:00
Chang She
bc83bc9838 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:25:02 -07:00
Aidan
a76b5755ff fix: createIndex index cache size (#741) 2024-04-05 16:25:02 -07:00
Chang She
9a192426d3 feat(python): support list of list fields from pydantic schema (#747)
For object detection, each row may correspond to an image and each image
can have multiple bounding boxes of x-y coordinates. This means that a
`bbox` field is potentially "list of list of float". This adds support
in our pydantic-pyarrow conversion for nested lists.
2024-04-05 16:25:02 -07:00
Lance Release
ab794ba237 Updating package-lock.json 2024-04-05 16:25:02 -07:00
Lance Release
81e9df57c0 [python] Bump version: 0.4.0 → 0.4.1 2024-04-05 16:25:02 -07:00
Lance Release
8705784cea Bump version: 0.4.0 → 0.4.1 2024-04-05 16:25:02 -07:00
elliottRobinson
b3fbca4aee Update default_embedding_functions.md (#744)
Modify some grammar, punctuation, and spelling errors.
2024-04-05 16:25:02 -07:00
Andrew Miracle
5948f11641 eslint fix 2024-04-05 16:25:02 -07:00
Andrew Miracle
9efc3fa6d8 remove console logs 2024-04-05 16:25:02 -07:00
Andrew Miracle
453bf113ae add support for openai SDK version ^4.24.1 2024-04-05 16:25:02 -07:00
Chang She
4b243c5ff8 feat(node): align incoming data to table schema (#802) 2024-04-05 16:25:01 -07:00
Sebastian Law
4aa7f58a07 use requests instead of aiohttp for underlying http client (#803)
instead of starting and stopping the current thread's event loop on
every http call, just make an http call.
2024-04-05 16:25:01 -07:00
Chang She
7581cbb38f chore(python): add docstring for limit behavior (#800)
Closes #796
2024-04-05 16:25:01 -07:00
Chang She
881dfa022b feat(python): add phrase query option for fts (#798)
addresses #797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
2024-04-05 16:25:01 -07:00
Chang She
f17d16f935 feat(python): add count_rows with filter option (#801)
Closes #795
2024-04-05 16:25:01 -07:00
Chang She
f3a905af63 fix(rust): not sure why clippy is suddenly unhappy (#794)
should fix the error on top of main


https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725
2024-04-05 16:25:01 -07:00
Chang She
a07c6c465a feat(python): support new style optional syntax (#793) 2024-04-05 16:25:01 -07:00
Chang She
1dd663fc8a chore(python): document phrase queries in fts (#788)
closes #769 

Add unit test and documentation on using quotes to perform a phrase
query
2024-04-05 16:25:01 -07:00
Chang She
175ad9223b feat(node): support table.schema for LocalTable (#789)
Close #773 

we pass an empty table over IPC so we don't need to manually deal with
serde. Then we just return the schema attribute from the empty table.

---------

Co-authored-by: albertlockett <albert.lockett@gmail.com>
2024-04-05 16:25:01 -07:00
Lei Xu
4c8690549a chore: bump lance to 0.9.5 (#790) 2024-04-05 16:25:01 -07:00
Chang She
3100f0d861 feat(python): Set heap size to get faster fts indexing performance (#762)
By default tantivy-py uses 128MB heapsize. We change the default to 1GB
and we allow the user to customize this

locally this makes `test_fts.py` run 10x faster
2024-04-05 16:25:00 -07:00
Will Jones
c34aa09166 docs: update node API reference (#734)
This command hasn't been run for a while...
2024-04-05 16:24:47 -07:00
lucasiscovici
328aa2247b raise exception if fts index does not exist (#776)
raise exception if fts index does not exist

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:24:47 -07:00
Will Jones
43662705ad docs: enhance Update user guide (#735)
Closes #705
2024-04-05 16:24:47 -07:00
sudhir
8a48b32689 Make examples work with current version of Openai api's (#779)
These examples don't work because of changes in openai api from version
1+
2024-04-05 16:24:47 -07:00
Bert
5bb128a24d docs: fix JS api docs for update method (#738) 2024-04-05 16:24:47 -07:00
Chris
6698376f02 Minor Fixes to Ingest Embedding Functions Docs (#777)
Addressed minor typos and grammatical issues to improve readability

---------

Co-authored-by: Christopher Correa <chris.correa@gmail.com>
2024-04-05 16:24:47 -07:00
Weston Pace
94e81ff84b feat: add the ability to create scalar indices (#679)
This is a pretty direct binding to the underlying lance capability
2024-04-05 16:24:47 -07:00
Vladimir Varankin
2fd829296e Minor corrections for docs of embedding_functions (#780)
In addition to #777, this pull request fixes more typos in the
documentation for "Ingest Embedding Functions".
2024-04-05 16:24:47 -07:00
Aidan
b4ae3f3097 feat: node list tables pagination (#733) 2024-04-05 16:24:47 -07:00
QianZhu
a25d10279c small bug fix for example code in SaaS JS doc (#770) 2024-04-05 16:24:47 -07:00
Chang She
5376970e87 doc(javascript): minor improvement on docs for working with tables (#736)
Closes #639 
Closes #638
2024-04-05 16:24:47 -07:00
Chang She
e929491187 chore(python): handle NaN input in fts ingestion (#763)
If the input text is None, Tantivy raises an error
complaining it cannot add a NoneType. We handle this
upstream so None's are not added to the document.
If all of the indexed fields are None then we skip
this document.
2024-04-05 16:24:47 -07:00
Bengsoon Chuah
e3ba5b2402 Add relevant imports for each step (#764)
I found that it was quite incoherent to have to read through the
documentation and having to search which submodule that each class
should be imported from.

For example, it is cumbersome to have to navigate to another
documentation page to find out that `EmbeddingFunctionRegistry` is from
`lancedb.embeddings`
2024-04-05 16:24:47 -07:00
QianZhu
25d1c62c3f SaaS JS API sdk doc (#740)
Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>
2024-04-05 16:24:47 -07:00
Chang She
cd791a366b feat(js): support list of string input (#755)
Add support for adding lists of string input (e.g., list of categorical
labels)

Follow-up items: #757 #758
2024-04-05 16:24:47 -07:00
Lance Release
24afea8c56 Updating package-lock.json 2024-04-05 16:24:47 -07:00
Lance Release
0d2dbf7d09 Updating package-lock.json 2024-04-05 16:24:47 -07:00
Lance Release
c629080d60 Bump version: 0.4.1 → 0.4.2 2024-04-05 16:24:47 -07:00
Lance Release
918a2a4405 [python] Bump version: 0.4.2 → 0.4.3 2024-04-05 16:24:47 -07:00
Lei Xu
56db257ea9 chore: bump pylance to 0.9.2 (#754) 2024-04-05 16:24:47 -07:00
Xin Hao
a63262cfda docs: fix link (#752) 2024-04-05 16:24:47 -07:00
Chang She
98af0ceec6 feat(python): first cut batch queries for remote api (#753)
issue separate requests under the hood and concatenate results
2024-04-05 16:24:47 -07:00
Lance Release
7778031b26 [python] Bump version: 0.4.1 → 0.4.2 2024-04-05 16:24:47 -07:00
Chang She
c97ae6b787 chore(python): update embedding API to use openai 1.6.1 (#751)
API has changed significantly, namely `openai.Embedding.create` no
longer exists.
https://github.com/openai/openai-python/discussions/742

Update the OpenAI embedding function and put a minimum on the openai sdk
version.
2024-04-05 16:24:47 -07:00
Chang She
7bac1131fb feat: add timezone handling for datetime in pydantic (#578)
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
2024-04-05 16:24:47 -07:00
Chang She
a0afa84786 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:47 -07:00
Aidan
e74c203e6f fix: createIndex index cache size (#741) 2024-04-05 16:24:47 -07:00
Chang She
46bf5a1ed1 feat(python): support list of list fields from pydantic schema (#747)
For object detection, each row may correspond to an image and each image
can have multiple bounding boxes of x-y coordinates. This means that a
`bbox` field is potentially "list of list of float". This adds support
in our pydantic-pyarrow conversion for nested lists.
2024-04-05 16:24:47 -07:00
Lance Release
4891a7ae14 Updating package-lock.json 2024-04-05 16:24:47 -07:00
Lance Release
d1f24ba1dd [python] Bump version: 0.4.0 → 0.4.1 2024-04-05 16:24:47 -07:00
Lance Release
b56c54c990 Bump version: 0.4.0 → 0.4.1 2024-04-05 16:24:47 -07:00
elliottRobinson
3ab4b335c3 Update default_embedding_functions.md (#744)
Modify some grammar, punctuation, and spelling errors.
2024-04-05 16:24:47 -07:00
Chang She
009297e900 bug(python): fix path handling in windows (#724)
Use pathlib for local paths so that pathlib
can handle the correct separator on windows.

Closes #703

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:45 -07:00
Will Jones
3f3acb48c6 chore: add issue templates (#732)
This PR adds issue templates, which help two recurring issues:

* Users forget to tell us whether they are using the Node or Python SDK
* Issues don't get appropriate tags

This doesn't force the use of the templates. Because we set
`blank_issues_enabled: true`, users can still create a custom issue.
2024-04-05 16:24:30 -07:00
Will Jones
c3cda2c5d0 ci: check formatting and clippy (#730) 2024-04-05 16:24:30 -07:00
Will Jones
a975cc0a94 fix: prevent duplicate data in FTS index (#728)
This forces the user to replace the whole FTS directory when re-creating
the index, prevent duplicate data from being created. Previously, the
whole dataset was re-added to the existing index, duplicating existing
rows in the index.

This (in combination with lancedb/lance#1707) caused #726, since the
duplicate data emitted duplicate indices for `take()` and an upstream
issue caused those queries to fail.

This solution isn't ideal, since it makes the FTS index temporarily
unavailable while the index is built. In the future, we should have
multiple FTS index directories, which would allow atomic commits of new
indexes (as well as multiple indexes for different columns).

Fixes #498.
Fixes #726.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:24:30 -07:00
Will Jones
48a12e780c upgrade lance to v0.9.1 (#727)
This brings in some important bugfixes related to take and aarch64
Linux. See changes at:
https://github.com/lancedb/lance/releases/tag/v0.9.1
2024-04-05 16:24:30 -07:00
Chang She
b60a2177ae feat(python): support nested reference for fts (#723)
https://github.com/lancedb/lance/issues/1739

Support nested field reference in full text search

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:30 -07:00
Chang She
cc9d74e7a7 feat(python): add option to flatten output in to_pandas (#722)
Closes https://github.com/lancedb/lance/issues/1738

We add a `flatten` parameter to the signature of `to_pandas`. By default
this is None and does nothing.
If set to True or -1, then LanceDB will flatten structs before
converting to a pandas dataframe. All nested structs are also flattened.
If set to any positive integer, then LanceDB will flatten structs up to
the specified level of nesting.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:24:30 -07:00
Aidan
3232b55218 feat: Node create index API (#720) 2024-04-05 16:24:30 -07:00
Aidan
ee2034db23 feat: Node Schema API (#717) 2024-04-05 16:24:30 -07:00
Lance Release
1dac34d2fa Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
78b457f230 Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
884ce655fe Bump version: 0.3.11 → 0.4.0 2024-04-05 16:24:30 -07:00
Lance Release
acbcbe6496 [python] Bump version: 0.3.6 → 0.4.0 2024-04-05 16:24:30 -07:00
Lei Xu
1d79e9168e chore: bump lance version to 0.9 (#715) 2024-04-05 16:24:30 -07:00
Lance Release
f46931228b Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
811e604077 [python] Bump version: 0.3.5 → 0.3.6 2024-04-05 16:24:30 -07:00
Lance Release
072be50cb3 Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
aca1b43d5e Bump version: 0.3.10 → 0.3.11 2024-04-05 16:24:30 -07:00
Bert
0b9c8ef88a chore: fix package lock (#711) 2024-04-05 16:24:30 -07:00
Bert
eb62ddfb0c implement update for remote clients (#706) 2024-04-05 16:24:30 -07:00
Rob Meng
32515ace74 feat: pass vector column name to remote backend (#710)
pass vector column name to remote as well.

`vector_column` is already part of `Query` just declearing it as part to
`remote.VectorQuery` as well
2024-04-05 16:24:30 -07:00
Rob Meng
82946f3623 feat: allow custom column name in query (#709) 2024-04-05 16:24:30 -07:00
Chang She
374a6f7e78 feat: support nested pydantic schema (#707) 2024-04-05 16:24:30 -07:00
Will Jones
e52f691420 ci: fix broken npm publication (#704)
Most recent release failed because `release` depends on `node-macos`,
but we renamed `node-macos` to `node-macos-{x86,arm64}`. This fixes that
by consolidating them back to a single `node-macos` job, which also has
the side effect of making the file shorter.
2024-04-05 16:24:30 -07:00
Lance Release
79aeb6bea6 Updating package-lock.json 2024-04-05 16:24:30 -07:00
Lance Release
7d70c9940c Bump version: 0.3.9 → 0.3.10 2024-04-05 16:24:30 -07:00
Lance Release
fc32f98c34 [python] Bump version: 0.3.4 → 0.3.5 2024-04-05 16:24:30 -07:00
Will Jones
9356c3b86a feat(python): add update query support for Python (#654)
Closes #69

Will not pass until https://github.com/lancedb/lance/pull/1585 is
released
2024-04-05 16:24:29 -07:00
Chang She
b02370cacd feat: LocalTable for vectordb now supports filters without vector search (#693)
Note this currently the filter/where is only implemented for LocalTable
so that it requires an explicit cast to "enable" (see new unit test).
The alternative is to add it to the Table interface, but since it's not
available on RemoteTable this may cause some user experience issues.
2024-04-05 16:24:15 -07:00
Bert
e479acc1bd Update in Node & Rust (#696)
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:15 -07:00
Ayush Chaurasia
3413e79b0f chore(python): Reduce posthog event count (#661)
- Register open_table as event
- Because we're dropping 'seach' event currently, changed the name to
'search_table' and introduced throttling
- Throttled events will be counted once per time batch so that the user
is registered but event count doesn't go up by a lot
2024-04-05 16:24:14 -07:00
Ayush Chaurasia
91ff324c70 docs: Update roboflow tutorial position (#666) 2024-04-05 16:23:49 -07:00
QianZhu
480a630e19 Qian/minor fix doc (#695) 2024-04-05 16:23:49 -07:00
Kaushal Kumar Choudhary
07e33c2b2d docs: Add badges (#694)
adding some badges
added a gif to readme for the vectordb repo

---------

Co-authored-by: kaushal07wick <kaushalc6@gmail.com>
2024-04-05 16:23:49 -07:00
Chang She
fb1de97e83 chore: Use m1 runner for npm publish (#687)
We had some build issues with npm publish for cross-compiling arm64
macos on an x86 macos runner. Switching to m1 runner for now until
someone has time to deal with the feature flags.

follow-up tracked here: #688
2024-04-05 16:23:49 -07:00
QianZhu
bda0135cfc saas python sdk doc (#692)
<img width="256" alt="Screenshot 2023-12-07 at 11 55 41 AM"
src="https://github.com/lancedb/lancedb/assets/1305083/259bf234-9b3b-4c5d-af45-c7f3fada2cc7">
2024-04-05 16:23:49 -07:00
Chang She
287d85a3aa chore: update package lock (#689) 2024-04-05 16:23:49 -07:00
Chang She
7b92e796bb chore: set error handling to immediate (#686)
there's build failure for the rust artifact but the macos arm64 build
for npm publish still passed. So we had a silent failure for 2 releases.
By setting error to immediate this should cause fail immediately.
2024-04-05 16:23:49 -07:00
Lance Release
608e502de6 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
328880f057 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
93ade53515 Bump version: 0.3.8 → 0.3.9 2024-04-05 16:23:49 -07:00
Rob Meng
d74e188f80 fix: fix passing prefilter flag to remote client (#677)
was passing this at the wrong position
2024-04-05 16:23:49 -07:00
Rob Meng
59c25574f0 feat: enable prefilter in node js (#675)
enable prefiltering in node js, both native and remote
2024-04-05 16:23:49 -07:00
Rob Meng
c1c3083b74 chore: expose prefilter in lancedb rust (#674)
expose prefilter flag in vectordb rust code.
2024-04-05 16:23:49 -07:00
James
a94a033553 (docs):Add CLIP image embedding example (#660)
In this PR, I add a guide that lets you use Roboflow Inference to
calculate CLIP embeddings for use in LanceDB. This post was reviewed by
@AyushExel.
2024-04-05 16:23:49 -07:00
Bert
bbf34ae7f4 fix: python remote correct open_table error message (#659) 2024-04-05 16:23:49 -07:00
Lance Release
57dda15f49 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
8f82e4897c [python] Bump version: 0.3.3 → 0.3.4 2024-04-05 16:23:49 -07:00
Lance Release
8bd77d3c72 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
0273df4e04 Bump version: 0.3.7 → 0.3.8 2024-04-05 16:23:49 -07:00
Will Jones
6d76fe80b8 chore: upgrade lance to v0.8.17 (#656)
Readying for the next Lance release.
2024-04-05 16:23:49 -07:00
Rok Mihevc
78ab9068a8 feat(python): expose index cache size (#655)
This is to enable https://github.com/lancedb/lancedb/issues/641.
Should be merged after https://github.com/lancedb/lance/pull/1587 is
released.
2024-04-05 16:23:49 -07:00
Ayush Chaurasia
088792c821 [Docs]: Add Instructor embeddings and rate limit handler docs (#651) 2024-04-05 16:23:49 -07:00
Ayush Chaurasia
955c2a751a [Docs][SEO] Add sitemap and robots.txt (#645)
Sitemap improves SEO by ranking pages and tracking updates.
2024-04-05 16:23:49 -07:00
Aidan
775bee576c SaaS create_index API (#649) 2024-04-05 16:23:49 -07:00
Lance Release
f59af4df76 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
15cc5227c4 Updating package-lock.json 2024-04-05 16:23:49 -07:00
Lance Release
c008faddfd Bump version: 0.3.6 → 0.3.7 2024-04-05 16:23:49 -07:00
Bert
22fc0eaaf6 fix: node remote implement table.countRows (#648) 2024-04-05 16:23:49 -07:00
Rok Mihevc
32cb1b9ea4 feat: add RemoteTable.version in Python (#644)
Please note: this is not tested as we don't have a server here and
testing against a mock object wouldn't be that interesting.
2024-04-05 16:23:49 -07:00
Bert
49a366bc74 fix: node send db header for GET requests (#646) 2024-04-05 16:23:49 -07:00
Ayush Chaurasia
d59dbf8230 fix: Pydantic 1.x compat for weak_lru caching in embeddings API (#643)
Colab has pydantic 1.x by default and pydantic 1.x BaseModel objects
don't support weakref creation by default that we use to cache embedding
models
https://github.com/lancedb/lancedb/blob/main/python/lancedb/embeddings/utils.py#L206
. It needs to be added to slot.
2024-04-05 16:23:49 -07:00
Ayush Chaurasia
c0a49a9a5b Multi-task instructor model with quantization support & weak_lru cache for embedding function models (#612)
resolves #608
2024-04-05 16:23:49 -07:00
QianZhu
2f2964a645 fix saas open_table and table_names issues (#640)
- added check whether a table exists in SaaS open_table
- remove prefilter not supported warning in SaaS search
- fixed issues for SaaS table_names
2024-04-05 16:23:49 -07:00
Rob Meng
3d50c9cdfe upgrade lance to 0.8.14 (#636)
upgrade lance
2024-04-05 16:23:49 -07:00
Rob Meng
bdb3b46f7e skip missing file on mirrored dir when deleting (#635)
mirrored store is not garueeteed to have all the files. Ignore the ones
that doesn't exist.
2024-04-05 16:23:49 -07:00
Lei Xu
49306a99ba chore: apple silicon runner (#633)
Close #632
2024-04-05 16:23:49 -07:00
Lei Xu
86efd36689 chore: improve create_table API consistency between local and remote SDK (#627) 2024-04-05 16:23:47 -07:00
Bert
20ab85171b fix: node remote connection handles non http errors (#624)
https://github.com/lancedb/lancedb/issues/623

Fixes issue trying to print response status when using remote client. If
the error is not an HTTP error (e.g. dns/network failure), there won't
be a response.
2024-04-05 16:23:14 -07:00
Ayush Chaurasia
159ecbac5a Exponential standoff retry support for handling rate limited embedding functions (#614)
Users ingesting data using rate limited apis don't need to manually make
the process sleep for counter rate limits
resolves #579
2024-04-05 16:23:14 -07:00
Lance Release
148f6d7283 Updating package-lock.json 2024-04-05 16:23:14 -07:00
Lance Release
c604912139 Updating package-lock.json 2024-04-05 16:23:14 -07:00
Lance Release
178af0c2b8 Bump version: 0.3.5 → 0.3.6 2024-04-05 16:23:14 -07:00
Lance Release
c1b037f0a5 [python] Bump version: 0.3.2 → 0.3.3 2024-04-05 16:23:14 -07:00
Lei Xu
3855bdf986 chore: bump lance to 8.10 (#622) 2024-04-05 16:23:14 -07:00
Ayush Chaurasia
07ab4cd14c Disable posthog on docs & reduce sentry trace factor (#607)
- posthog charges per event and docs events are registered very
frequently. We can keep tracking them on GA
- Reduced sentry trace factor
2024-04-05 16:23:13 -07:00
Chang She
531c947fc1 doc: node sdk now supports windows (#616) 2024-04-05 16:22:59 -07:00
Bert
4e9aab9e8b ci: cancel in progress runs on new push (#620) 2024-04-05 16:22:59 -07:00
Bert
cd7a4dd251 fix!: sort table names (#619)
https://github.com/lancedb/lance/issues/1385
2024-04-05 16:22:59 -07:00
QianZhu
3c139c2ee5 Qian/query option doc (#615)
- API documentation improvement for queries (table.search)
- a small bug fix for the remote API on create_table

![image](https://github.com/lancedb/lancedb/assets/1305083/712e9bd3-deb8-4d81-8cd0-d8e98ef68f4e)

![image](https://github.com/lancedb/lancedb/assets/1305083/ba22125a-8c36-4e34-a07f-e39f0136e62c)
2024-04-05 16:22:59 -07:00
Will Jones
166b281d66 increment pylance (#618) 2024-04-05 16:22:59 -07:00
Bert
c9fee0faed added api docs for prefilter flag (#617)
Added the prefilter flag argument to the `LanceQueryBuilder.where`.

This should make it display here:

https://lancedb.github.io/lancedb/python/python/#lancedb.query.LanceQueryBuilder.select

And also in intellisense like this:
<img width="848" alt="image"
src="https://github.com/lancedb/lancedb/assets/5846846/e0c53f4f-96bc-411b-9159-680a6c4d0070">

Also adds some improved documentation about the `where` argument to this
method.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:22:59 -07:00
Weston Pace
301e08f30e feat: allow prefiltering with index (#610)
Support for prefiltering with an index was added in lance version 0.8.7.
We can remove the lancedb check that prevents this. Closes #261
2024-04-05 16:22:59 -07:00
Lei Xu
b5e57ebce3 doc: add doc to use GPU for indexing (#611) 2024-04-05 16:22:59 -07:00
Lance Release
87364532bf Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
c275ec006f Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
53b0375e6d Bump version: 0.3.4 → 0.3.5 2024-04-05 16:22:59 -07:00
Bert
6881c50866 fix conv version (#605) 2024-04-05 16:22:59 -07:00
Lance Release
a174832d61 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
722cede32b Bump version: 0.3.3 → 0.3.4 2024-04-05 16:22:59 -07:00
Bert
4d086d63eb feat: added dataset stats api to node (#604) 2024-04-05 16:22:59 -07:00
Bert
f5e9c073f0 feat: added data stats apis (#596) 2024-04-05 16:22:59 -07:00
Rob Meng
178e016ff2 expose remap index api (#603)
expose index remap options in `compact_files`
2024-04-05 16:22:59 -07:00
Rob Meng
3c998b020f feat: expose optimize index api (#602)
expose `optimize_index` api.
2024-04-05 16:22:59 -07:00
Lance Release
a3c955070e [python] Bump version: 0.3.1 → 0.3.2 2024-04-05 16:22:59 -07:00
Bert
edeecd3d9f update lance to 0.8.7 (#598) 2024-04-05 16:22:59 -07:00
Chang She
2861f33982 fix(python): fix multiple embedding functions bug (#597)
Closes #594

The embedding functions are pydantic models so multiple instances with
the same parameters are considered ==, which means that if you have
multiple embedding columns it's possible for the embeddings to get
overwritten. Instead we use `is` instead of == to avoid this problem.

testing: modified unit test to include this case
2024-04-05 16:22:59 -07:00
Rob Meng
0036ca9de7 feat: add checkout method to table to reuse existing store and connections (#593)
Prior to this PR, to get a new version of a table, we need to re-open
the table. This has a few downsides w.r.t. performance:
* Object store is recreated, which takes time and throws away existing
warm connections
* Commit handler is thrown aways as well, which also may contain warm
connections
2024-04-05 16:22:59 -07:00
Rob Meng
2826bc7f1a feat: include manifest files in mirrow store (#589) 2024-04-05 16:22:59 -07:00
Will Jones
e37a0566e0 Revert "[python] Bump version: 0.3.2 → 0.3.3"
This reverts commit c30faf6083.
2024-04-05 16:22:59 -07:00
Will Jones
48999ffc27 [python] Bump version: 0.3.2 → 0.3.3 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
0dc893993f [Docs]: Minor Fixes (#587)
* Filename typo
* Remove rick_morty csv as users won't really be able to use it.. We can
create a an executable colab and download it from a bucket or smth.
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
12de39612e [Docs] Embeddings API: Add multi-lingual semantic search example (#582) 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
05509bfb03 [Docs]Versioning docs (#586)
closes #564

---------

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Lance Release
fa702f992e Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
7f707205de Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
2394ff14d0 Bump version: 0.3.2 → 0.3.3 2024-04-05 16:22:59 -07:00
Chang She
31334b05df chore: bump lance version in python/rust lancedb (#584)
To include latest v0.8.6

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
942976f49f [Docs] Update embedding function docs (#581) 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
507f6087c2 [Python]Embeddings API refactor (#580)
Sets things up for this -> https://github.com/lancedb/lancedb/issues/579
- Just separates out the registry/ingestion code from the function
implementation code
- adds a `get_registry` util
- package name "open-clip" -> "open-clip-torch"
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
39c1cb87ad [Docs] Add posthog telemetry to docs (#577)
Allows creation of funnels and user journeys
2024-04-05 16:22:59 -07:00
QianZhu
6b0d1d6ec1 list table pagination draft (#574) 2024-04-05 16:22:59 -07:00
Prashanth Rao
d38e3d496f Add pyarrow date and timestamp type conversion from pydantic (#576) 2024-04-05 16:22:59 -07:00
Chang She
f4ac47e1b5 doc: fix broken link and add README (#573)
Fix broken link to embedding functions

testing: broken link was verified after local docs build to have been
repaired

---------

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Lance Release
c94e428252 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
a09389459c Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
4f62fb5ae8 Bump version: 0.3.1 → 0.3.2 2024-04-05 16:22:59 -07:00
Rob Meng
c14ccbd334 implement remote api calls for table mutation (#567)
Add more APIs to remote table for Node SDK
* `add` rows
* `overwrite` table with rows
* `create` table

This has been tested against dev stack
2024-04-05 16:22:59 -07:00
Rok Mihevc
b10afbeedc docs: show source of documented functions (#569) 2024-04-05 16:22:59 -07:00
Lei Xu
8dc10180b0 feat(python,js): deletion operation on remote tables (#568) 2024-04-05 16:22:59 -07:00
Rok Mihevc
377a564904 docs: switch python examples to be row based (#554) 2024-04-05 16:22:59 -07:00
Lei Xu
7b5bfadab2 chore: bump lance to 0.8.5 (#561)
Bump lance to 0.5.8
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
1c42894918 [DOCS][PYTHON] Update embeddings API docs & Example (#516)
This PR adds an overview of embeddings docs:
- 2 ways to vectorize your data using lancedb - explicit & implicit
- explicit - manually vectorize your data using `wit_embedding` function
- Implicit - automatically vectorize your data as it comes by ingesting
your embedding function details as table metadata
- Multi-modal example w/ disappearing embedding function
2024-04-05 16:22:59 -07:00
Lance Release
2b341f3482 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
5027529663 Updating package-lock.json 2024-04-05 16:22:59 -07:00
Lance Release
3ed509f20c Bump version: 0.3.0 → 0.3.1 2024-04-05 16:22:59 -07:00
Lance Release
87c69e74fc [python] Bump version: 0.3.0 → 0.3.1 2024-04-05 16:22:59 -07:00
Ayush Chaurasia
0e9a7f0dc7 Add cohere embedding function (#550) 2024-04-05 16:22:59 -07:00
Will Jones
c07207c661 feat: cleanup and compaction (#518)
#488
2024-04-05 16:22:59 -07:00
Ayush Chaurasia
541b06664f [Docs] Improve visibility of table ops (#553)
A little verbose, but better than being non-discoverable 
![Screenshot from 2023-10-11
16-26-02](https://github.com/lancedb/lancedb/assets/15766192/9ba539a7-0cf8-4d9e-94e7-ce5d37c35df0)
2024-04-05 16:22:59 -07:00
Chang She
8469d010f8 feat: add to_list and to_pandas api's (#556)
Add `to_list` to return query results as list of python dict (so we're
not too pandas-centric). Closes #555

Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Ankur Goyal
a737bbff19 Use query.limit(..) in README (#543)
If you run the README javascript example in typescript, it complains
that the type of limit is a function and cannot be set to a number.
2024-04-05 16:22:59 -07:00
79 changed files with 4072 additions and 1470 deletions

View File

@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.4.16
current_version = 0.4.17
commit = True
message = Bump version: {current_version} → {new_version}
tag = True

View File

@@ -8,6 +8,9 @@ env:
# This env var is used by Swatinem/rust-cache@v2 for the cache
# key, so we set it to make sure it is always consistent.
CARGO_TERM_COLOR: always
# Up-to-date compilers needed for fp16kernels.
CC: gcc-12
CXX: g++-12
jobs:
build:

View File

@@ -107,6 +107,7 @@ jobs:
AWS_ENDPOINT: http://localhost:4566
# this one is for dynamodb
DYNAMODB_ENDPOINT: http://localhost:4566
ALLOW_HTTP: true
steps:
- uses: actions/checkout@v4
with:

View File

@@ -28,6 +28,10 @@ jobs:
run:
shell: bash
working-directory: nodejs
env:
# Need up-to-date compilers for kernels
CC: gcc-12
CXX: g++-12
steps:
- uses: actions/checkout@v4
with:
@@ -81,7 +85,12 @@ jobs:
run: |
npm ci
npm run build
- name: Setup localstack
working-directory: .
run: docker compose up --detach --wait
- name: Test
env:
S3_TEST: "1"
run: npm run test
macos:
timeout-minutes: 30

View File

@@ -6,6 +6,8 @@ on:
jobs:
linux:
# Only runs on tags that matches the python-make-release action
if: startsWith(github.ref, 'refs/tags/python-v')
name: Python ${{ matrix.config.platform }} manylinux${{ matrix.config.manylinux }}
timeout-minutes: 60
strategy:
@@ -44,6 +46,8 @@ jobs:
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"
mac:
# Only runs on tags that matches the python-make-release action
if: startsWith(github.ref, 'refs/tags/python-v')
timeout-minutes: 60
runs-on: ${{ matrix.config.runner }}
strategy:
@@ -76,6 +80,8 @@ jobs:
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"
windows:
# Only runs on tags that matches the python-make-release action
if: startsWith(github.ref, 'refs/tags/python-v')
timeout-minutes: 60
runs-on: windows-latest
strategy:

View File

@@ -99,6 +99,8 @@ jobs:
workspaces: python
- uses: ./.github/workflows/build_linux_wheel
- uses: ./.github/workflows/run_tests
with:
integration: true
# Make sure wheels are not included in the Rust cache
- name: Delete wheels
run: rm -rf target/wheels
@@ -190,4 +192,4 @@ jobs:
pip install -e .[tests]
pip install tantivy
- name: Run tests
run: pytest -m "not slow" -x -v --durations=30 python/tests
run: pytest -m "not slow and not s3_test" -x -v --durations=30 python/tests

View File

@@ -5,6 +5,10 @@ inputs:
python-minor-version:
required: true
description: "8 9 10 11 12"
integration:
required: false
description: "Run integration tests"
default: "false"
runs:
using: "composite"
steps:
@@ -12,6 +16,16 @@ runs:
shell: bash
run: |
pip3 install $(ls target/wheels/lancedb-*.whl)[tests,dev]
- name: pytest
- name: Setup localstack for integration tests
if: ${{ inputs.integration == 'true' }}
shell: bash
working-directory: .
run: docker compose up --detach --wait
- name: pytest (with integration)
shell: bash
if: ${{ inputs.integration == 'true' }}
run: pytest -m "not slow" -x -v --durations=30 python/python/tests
- name: pytest (no integration tests)
shell: bash
if: ${{ inputs.integration != 'true' }}
run: pytest -m "not slow and not s3_test" -x -v --durations=30 python/python/tests

View File

@@ -76,6 +76,9 @@ jobs:
sudo apt install -y protobuf-compiler libssl-dev
- name: Build
run: cargo build --all-features
- name: Start S3 integration test environment
working-directory: .
run: docker compose up --detach --wait
- name: Run tests
run: cargo test --all-features
- name: Run examples
@@ -105,7 +108,8 @@ jobs:
- name: Build
run: cargo build --all-features
- name: Run tests
run: cargo test --all-features
# Run with everything except the integration tests.
run: cargo test --features remote,fp16kernels
windows:
runs-on: windows-2022
steps:

View File

@@ -14,10 +14,10 @@ keywords = ["lancedb", "lance", "database", "vector", "search"]
categories = ["database-implementations"]
[workspace.dependencies]
lance = { "version" = "=0.10.9", "features" = ["dynamodb"] }
lance-index = { "version" = "=0.10.9" }
lance-linalg = { "version" = "=0.10.9" }
lance-testing = { "version" = "=0.10.9" }
lance = { "version" = "=0.10.12", "features" = ["dynamodb"] }
lance-index = { "version" = "=0.10.12" }
lance-linalg = { "version" = "=0.10.12" }
lance-testing = { "version" = "=0.10.12" }
# Note that this one does not include pyarrow
arrow = { version = "50.0", optional = false }
arrow-array = "50.0"

View File

@@ -1,18 +1,18 @@
version: "3.9"
services:
localstack:
image: localstack/localstack:0.14
image: localstack/localstack:3.3
ports:
- 4566:4566
environment:
- SERVICES=s3,dynamodb
- SERVICES=s3,dynamodb,kms
- DEBUG=1
- LS_LOG=trace
- DOCKER_HOST=unix:///var/run/docker.sock
- AWS_ACCESS_KEY_ID=ACCESSKEY
- AWS_SECRET_ACCESS_KEY=SECRETKEY
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:4566/health" ]
test: [ "CMD", "curl", "-s", "http://localhost:4566/_localstack/health" ]
interval: 5s
retries: 3
start_period: 10s

View File

@@ -57,16 +57,6 @@ plugins:
- https://arrow.apache.org/docs/objects.inv
- https://pandas.pydata.org/docs/objects.inv
- mkdocs-jupyter
- ultralytics:
verbose: True
enabled: True
default_image: "assets/lancedb_and_lance.png" # Default image for all pages
add_image: True # Automatically add meta image
add_keywords: True # Add page keywords in the header tag
add_share_buttons: True # Add social share buttons
add_authors: False # Display page authors
add_desc: False
add_dates: False
markdown_extensions:
- admonition
@@ -104,6 +94,14 @@ nav:
- Overview: hybrid_search/hybrid_search.md
- Comparing Rerankers: hybrid_search/eval.md
- Airbnb financial data example: notebooks/hybrid_search.ipynb
- Reranking:
- Quickstart: reranking/index.md
- Cohere Reranker: reranking/cohere.md
- Linear Combination Reranker: reranking/linear_combination.md
- Cross Encoder Reranker: reranking/cross_encoder.md
- ColBERT Reranker: reranking/colbert.md
- OpenAI Reranker: reranking/openai.md
- Building Custom Rerankers: reranking/custom_reranker.md
- Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb
- Configuring Storage: guides/storage.md
@@ -120,9 +118,10 @@ nav:
- Pandas and PyArrow: python/pandas_and_pyarrow.md
- Polars: python/polars_arrow.md
- DuckDB: python/duckdb.md
- LangChain 🔗: https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/lancedb.html
- LangChain JS/TS 🔗: https://js.langchain.com/docs/modules/data_connection/vectorstores/integrations/lancedb
- LlamaIndex 🦙: https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/LanceDBIndexDemo.html
- LangChain:
- LangChain 🔗: https://python.langchain.com/docs/integrations/vectorstores/lancedb/
- LangChain JS/TS 🔗: https://js.langchain.com/docs/integrations/vectorstores/lancedb
- LlamaIndex 🦙: https://docs.llamaindex.ai/en/stable/examples/vector_stores/LanceDBIndexDemo/
- Pydantic: python/pydantic.md
- Voxel51: integrations/voxel51.md
- PromptTools: integrations/prompttools.md
@@ -143,7 +142,6 @@ nav:
- TransformersJS Embedding Search: examples/transformerjs_embedding_search_nodejs.md
- 🦀 Rust:
- Overview: examples/examples_rust.md
- 🔧 CLI & Config: cli_config.md
- 💭 FAQs: faq.md
- ⚙️ API reference:
- 🐍 Python: python/python.md
@@ -171,6 +169,14 @@ nav:
- Overview: hybrid_search/hybrid_search.md
- Comparing Rerankers: hybrid_search/eval.md
- Airbnb financial data example: notebooks/hybrid_search.ipynb
- Reranking:
- Quickstart: reranking/index.md
- Cohere Reranker: reranking/cohere.md
- Linear Combination Reranker: reranking/linear_combination.md
- Cross Encoder Reranker: reranking/cross_encoder.md
- ColBERT Reranker: reranking/colbert.md
- OpenAI Reranker: reranking/openai.md
- Building Custom Rerankers: reranking/custom_reranker.md
- Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb
- Configuring Storage: guides/storage.md
@@ -187,8 +193,8 @@ nav:
- Pandas and PyArrow: python/pandas_and_pyarrow.md
- Polars: python/polars_arrow.md
- DuckDB: python/duckdb.md
- LangChain 🦜️🔗↗: https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/lancedb.html
- LangChain.js 🦜️🔗↗: https://js.langchain.com/docs/modules/data_connection/vectorstores/integrations/lancedb
- LangChain 🦜️🔗↗: https://python.langchain.com/docs/integrations/vectorstores/lancedb
- LangChain.js 🦜️🔗↗: https://js.langchain.com/docs/integrations/vectorstores/lancedb
- LlamaIndex 🦙↗: https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/LanceDBIndexDemo.html
- Pydantic: python/pydantic.md
- Voxel51: integrations/voxel51.md

View File

@@ -3,4 +3,3 @@ mkdocs-jupyter==0.24.1
mkdocs-material==9.5.3
mkdocstrings[python]==0.20.0
pydantic
mkdocs-ultralytics-plugin==0.0.44

View File

@@ -1,51 +0,0 @@
# CLI & Config
## LanceDB CLI
Once lanceDB is installed, you can access the CLI using `lancedb` command on the console.
```
lancedb
```
This lists out all the various command-line options available. You can get the usage or help for a particular command.
```
lancedb {command} --help
```
## LanceDB config
LanceDB uses a global config file to store certain settings. These settings are configurable using the lanceDB cli.
To view your config settings, you can use:
```
lancedb config
```
These config parameters can be tuned using the cli.
```
lancedb {config_name} --{argument}
```
## LanceDB Opt-in Diagnostics
When enabled, LanceDB will send anonymous events to help us improve LanceDB. These diagnostics are used only for error reporting and no data is collected. Error & stats allow us to automate certain aspects of bug reporting, prioritization of fixes and feature requests.
These diagnostics are opt-in and can be enabled or disabled using the `lancedb diagnostics` command. These are enabled by default.
### Get usage help
```
lancedb diagnostics --help
```
### Disable diagnostics
```
lancedb diagnostics --disabled
```
### Enable diagnostics
```
lancedb diagnostics --enabled
```

View File

@@ -154,9 +154,12 @@ Allows you to set parameters when registering a `sentence-transformers` object.
!!! note "BAAI Embeddings example"
Here is an example that uses BAAI embedding model from the HuggingFace Hub [supported models](https://huggingface.co/models?library=sentence-transformers)
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
registry = EmbeddingFunctionRegistry.get_instance()
model = registry.get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")
model = get_registry.get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")
class Words(LanceModel):
text: str = model.SourceField()
@@ -165,7 +168,7 @@ Allows you to set parameters when registering a `sentence-transformers` object.
table = db.create_table("words", schema=Words)
table.add(
[
{"text": "hello world"}
{"text": "hello world"},
{"text": "goodbye world"}
]
)
@@ -177,6 +180,32 @@ Allows you to set parameters when registering a `sentence-transformers` object.
Visit sentence-transformers [HuggingFace HUB](https://huggingface.co/sentence-transformers) page for more information on the available models.
### Huggingface embedding models
We offer support for all huggingface models (which can be loaded via [transformers](https://huggingface.co/docs/transformers/en/index) library). The default model is `colbert-ir/colbertv2.0` which also has its own special callout - `registry.get("colbert")`
Example usage -
```python
import lancedb
import pandas as pd
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
model = get_registry().get("huggingface").create(name='facebook/bart-base')
class TextModel(LanceModel):
text: str = model.SourceField()
vector: Vector(model.ndims()) = model.VectorField()
df = pd.DataFrame({"text": ["hi hello sayonara", "goodbye world"]})
table = db.create_table("greets", schema=Words)
table.add()
query = "old greeting"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```
### OpenAI embeddings
LanceDB registers the OpenAI embeddings function in the registry by default, as `openai`. Below are the parameters that you can customize when creating the instances:
@@ -187,18 +216,21 @@ LanceDB registers the OpenAI embeddings function in the registry by default, as
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("openai").create()
func = get_registry().get("openai").create(name="text-embedding-ada-002")
class Words(LanceModel):
text: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField()
table = db.create_table("words", schema=Words)
table = db.create_table("words", schema=Words, mode="overwrite")
table.add(
[
{"text": "hello world"}
{"text": "hello world"},
{"text": "goodbye world"}
]
)
@@ -327,6 +359,10 @@ Supported parameters (to be passed in `create` method) are:
Usage Example:
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
model = get_registry().get("bedrock-text").create()
class TextModel(LanceModel):
@@ -361,10 +397,12 @@ This embedding function supports ingesting images as both bytes and urls. You ca
LanceDB supports ingesting images directly from accessible links.
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect(tmp_path)
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("open-clip").create()
func = get_registry.get("open-clip").create()
class Images(LanceModel):
label: str
@@ -439,9 +477,12 @@ This function is registered as `imagebind` and supports Audio, Video and Text mo
Below is an example demonstrating how the API works:
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect(tmp_path)
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("imagebind").create()
func = get_registry.get("imagebind").create()
class ImageBindModel(LanceModel):
text: str

View File

@@ -12,3 +12,63 @@ LanceDB supports 3 methods of working with embeddings.
For python users, there is also a legacy [with_embeddings API](./legacy.md).
It is retained for compatibility and will be removed in a future version.
## Quickstart
To get started with embeddings, you can use the built-in embedding functions.
### OpenAI Embedding function
LanceDB registers the OpenAI embeddings function in the registry as `openai`. You can pass any supported model name to the `create`. By default it uses `"text-embedding-ada-002"`.
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
func = get_registry().get("openai").create(name="text-embedding-ada-002")
class Words(LanceModel):
text: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField()
table = db.create_table("words", schema=Words, mode="overwrite")
table.add(
[
{"text": "hello world"},
{"text": "goodbye world"}
]
)
query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```
### Sentence Transformers Embedding function
LanceDB registers the Sentence Transformers embeddings function in the registry as `sentence-transformers`. You can pass any supported model name to the `create`. By default it uses `"sentence-transformers/paraphrase-MiniLM-L6-v2"`.
```python
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
db = lancedb.connect("/tmp/db")
model = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")
class Words(LanceModel):
text: str = model.SourceField()
vector: Vector(model.ndims()) = model.VectorField()
table = db.create_table("words", schema=Words)
table.add(
[
{"text": "hello world"},
{"text": "goodbye world"}
]
)
query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)
```

View File

@@ -55,18 +55,139 @@ LanceDB OSS supports object stores such as AWS S3 (and compatible stores), Azure
const db = await lancedb.connect("az://bucket/path");
```
In most cases, when running in the respective cloud and permissions are set up correctly, no additional configuration is required. When running outside of the respective cloud, authentication credentials must be provided using environment variables. In general, these environment variables are the same as those used by the respective cloud SDKs. The sections below describe the environment variables that can be used to configure each object store.
In most cases, when running in the respective cloud and permissions are set up correctly, no additional configuration is required. When running outside of the respective cloud, authentication credentials must be provided. Credentials and other configuration options can be set in two ways: first, by setting environment variables. And second, by passing a `storage_options` object to the `connect` function. For example, to increase the request timeout to 60 seconds, you can set the `TIMEOUT` environment variable to `60s`:
LanceDB OSS uses the [object-store](https://docs.rs/object_store/latest/object_store/) Rust crate for object store access. There are general environment variables that can be used to configure the object store, such as the request timeout and proxy configuration. See the [object_store ClientConfigKey](https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html) doc for available configuration options. The environment variables that can be set are the snake-cased versions of these variable names. For example, to set `ProxyUrl` use the environment variable `PROXY_URL`. (Don't let the Rust docs intimidate you! We link to them so you can see an up-to-date list of the available options.)
```bash
export TIMEOUT=60s
```
!!! note "`storage_options` availability"
The `storage_options` parameter is only available in Python *async* API and JavaScript API.
It is not yet supported in the Python synchronous API.
If you only want this to apply to one particular connection, you can pass the `storage_options` argument when opening the connection:
=== "Python"
```python
import lancedb
db = await lancedb.connect_async(
"s3://bucket/path",
storage_options={"timeout": "60s"}
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect("s3://bucket/path",
{storageOptions: {timeout: "60s"}});
```
Getting even more specific, you can set the `timeout` for only a particular table:
=== "Python"
<!-- skip-test -->
```python
import lancedb
db = await lancedb.connect_async("s3://bucket/path")
table = await db.create_table(
"table",
[{"a": 1, "b": 2}],
storage_options={"timeout": "60s"}
)
```
=== "JavaScript"
<!-- skip-test -->
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect("s3://bucket/path");
const table = db.createTable(
"table",
[{ a: 1, b: 2}],
{storageOptions: {timeout: "60s"}}
);
```
!!! info "Storage option casing"
The storage option keys are case-insensitive. So `connect_timeout` and `CONNECT_TIMEOUT` are the same setting. Usually lowercase is used in the `storage_options` argument and uppercase is used for environment variables. In the `lancedb` Node package, the keys can also be provided in `camelCase` capitalization. For example, `connectTimeout` is equivalent to `connect_timeout`.
### General configuration
There are several options that can be set for all object stores, mostly related to network client configuration.
<!-- from here: https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html -->
| Key | Description |
|----------------------------|--------------------------------------------------------------------------------------------------|
| `allow_http` | Allow non-TLS, i.e. non-HTTPS connections. Default: `False`. |
| `allow_invalid_certificates`| Skip certificate validation on HTTPS connections. Default: `False`. |
| `connect_timeout` | Timeout for only the connect phase of a Client. Default: `5s`. |
| `timeout` | Timeout for the entire request, from connection until the response body has finished. Default: `30s`. |
| `user_agent` | User agent string to use in requests. |
| `proxy_url` | URL of a proxy server to use for requests. Default: `None`. |
| `proxy_ca_certificate` | PEM-formatted CA certificate for proxy connections. |
| `proxy_excludes` | List of hosts that bypass the proxy. This is a comma-separated list of domains and IP masks. Any subdomain of the provided domain will be bypassed. For example, `example.com, 192.168.1.0/24` would bypass `https://api.example.com`, `https://www.example.com`, and any IP in the range `192.168.1.0/24`. |
### AWS S3
To configure credentials for AWS S3, you can use the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables.
To configure credentials for AWS S3, you can use the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` keys. Region can also be set, but it is not mandatory when using AWS.
These can be set as environment variables or passed in the `storage_options` parameter:
=== "Python"
```python
import lancedb
db = await lancedb.connect_async(
"s3://bucket/path",
storage_options={
"aws_access_key_id": "my-access-key",
"aws_secret_access_key": "my-secret-key",
"aws_session_token": "my-session-token",
}
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect(
"s3://bucket/path",
{
storageOptions: {
awsAccessKeyId: "my-access-key",
awsSecretAccessKey: "my-secret-key",
awsSessionToken: "my-session-token",
}
}
);
```
Alternatively, if you are using AWS SSO, you can use the `AWS_PROFILE` and `AWS_DEFAULT_REGION` environment variables.
You can see a full list of environment variables [here](https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.from_env).
The following keys can be used as both environment variables or keys in the `storage_options` parameter:
| Key | Description |
|------------------------------------|------------------------------------------------------------------------------------------------------|
| `aws_region` / `region` | The AWS region the bucket is in. This can be automatically detected when using AWS S3, but must be specified for S3-compatible stores. |
| `aws_access_key_id` / `access_key_id` | The AWS access key ID to use. |
| `aws_secret_access_key` / `secret_access_key` | The AWS secret access key to use. |
| `aws_session_token` / `session_token` | The AWS session token to use. |
| `aws_endpoint` / `endpoint` | The endpoint to use for S3-compatible stores. |
| `aws_virtual_hosted_style_request` / `virtual_hosted_style_request` | Whether to use virtual hosted-style requests, where the bucket name is part of the endpoint. Meant to be used with `aws_endpoint`. Default: `False`. |
| `aws_s3_express` / `s3_express` | Whether to use S3 Express One Zone endpoints. Default: `False`. See more details below. |
| `aws_server_side_encryption` | The server-side encryption algorithm to use. Must be one of `"AES256"`, `"aws:kms"`, or `"aws:kms:dsse"`. Default: `None`. |
| `aws_sse_kms_key_id` | The KMS key ID to use for server-side encryption. If set, `aws_server_side_encryption` must be `"aws:kms"` or `"aws:kms:dsse"`. |
| `aws_sse_bucket_key_enabled` | Whether to use bucket keys for server-side encryption. |
!!! tip "Automatic cleanup for failed writes"
@@ -146,22 +267,174 @@ For **read-only access**, LanceDB will need a policy such as:
#### S3-compatible stores
LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you must specify two environment variables: `AWS_ENDPOINT` and `AWS_DEFAULT_REGION`. `AWS_ENDPOINT` should be the URL of the S3-compatible store, and `AWS_DEFAULT_REGION` should be the region to use.
LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you must specify both region and endpoint:
=== "Python"
```python
import lancedb
db = await lancedb.connect_async(
"s3://bucket/path",
storage_options={
"region": "us-east-1",
"endpoint": "http://minio:9000",
}
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect(
"s3://bucket/path",
{
storageOptions: {
region: "us-east-1",
endpoint: "http://minio:9000",
}
}
);
```
This can also be done with the ``AWS_ENDPOINT`` and ``AWS_DEFAULT_REGION`` environment variables.
#### S3 Express
LanceDB supports [S3 Express One Zone](https://aws.amazon.com/s3/storage-classes/express-one-zone/) endpoints, but requires additional configuration. Also, S3 Express endpoints only support connecting from an EC2 instance within the same region.
To configure LanceDB to use an S3 Express endpoint, you must set the storage option `s3_express`. The bucket name in your table URI should **include the suffix**.
=== "Python"
```python
import lancedb
db = await lancedb.connect_async(
"s3://my-bucket--use1-az4--x-s3/path",
storage_options={
"region": "us-east-1",
"s3_express": "true",
}
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect(
"s3://my-bucket--use1-az4--x-s3/path",
{
storageOptions: {
region: "us-east-1",
s3Express: "true",
}
}
);
```
<!-- TODO: we should also document the use of S3 Express once we fully support it -->
### Google Cloud Storage
GCS credentials are configured by setting the `GOOGLE_SERVICE_ACCOUNT` environment variable to the path of a JSON file containing the service account credentials. There are several aliases for this environment variable, documented [here](https://docs.rs/object_store/latest/object_store/gcp/struct.GoogleCloudStorageBuilder.html#method.from_env).
GCS credentials are configured by setting the `GOOGLE_SERVICE_ACCOUNT` environment variable to the path of a JSON file containing the service account credentials. Alternatively, you can pass the path to the JSON file in the `storage_options`:
=== "Python"
<!-- skip-test -->
```python
import lancedb
db = await lancedb.connect_async(
"gs://my-bucket/my-database",
storage_options={
"service_account": "path/to/service-account.json",
}
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect(
"gs://my-bucket/my-database",
{
storageOptions: {
serviceAccount: "path/to/service-account.json",
}
}
);
```
!!! info "HTTP/2 support"
By default, GCS uses HTTP/1 for communication, as opposed to HTTP/2. This improves maximum throughput significantly. However, if you wish to use HTTP/2 for some reason, you can set the environment variable `HTTP1_ONLY` to `false`.
The following keys can be used as both environment variables or keys in the `storage_options` parameter:
<!-- source: https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html -->
| Key | Description |
|---------------------------------------|----------------------------------------------|
| ``google_service_account`` / `service_account` | Path to the service account JSON file. |
| ``google_service_account_key`` | The serialized service account key. |
| ``google_application_credentials`` | Path to the application credentials. |
### Azure Blob Storage
Azure Blob Storage credentials can be configured by setting the `AZURE_STORAGE_ACCOUNT_NAME` and ``AZURE_STORAGE_ACCOUNT_KEY`` environment variables. The full list of environment variables that can be set are documented [here](https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.from_env).
Azure Blob Storage credentials can be configured by setting the `AZURE_STORAGE_ACCOUNT_NAME`and `AZURE_STORAGE_ACCOUNT_KEY` environment variables. Alternatively, you can pass the account name and key in the `storage_options` parameter:
=== "Python"
<!-- skip-test -->
```python
import lancedb
db = await lancedb.connect_async(
"az://my-container/my-database",
storage_options={
account_name: "some-account",
account_key: "some-key",
}
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect(
"az://my-container/my-database",
{
storageOptions: {
accountName: "some-account",
accountKey: "some-key",
}
}
);
```
These keys can be used as both environment variables or keys in the `storage_options` parameter:
<!-- source: https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html -->
| Key | Description |
|---------------------------------------|--------------------------------------------------------------------------------------------------|
| ``azure_storage_account_name`` | The name of the azure storage account. |
| ``azure_storage_account_key`` | The serialized service account key. |
| ``azure_client_id`` | Service principal client id for authorizing requests. |
| ``azure_client_secret`` | Service principal client secret for authorizing requests. |
| ``azure_tenant_id`` | Tenant id used in oauth flows. |
| ``azure_storage_sas_key`` | Shared access signature. The signature is expected to be percent-encoded, much like they are provided in the azure storage explorer or azure portal. |
| ``azure_storage_token`` | Bearer token. |
| ``azure_storage_use_emulator`` | Use object store with azurite storage emulator. |
| ``azure_endpoint`` | Override the endpoint used to communicate with blob storage. |
| ``azure_use_fabric_endpoint`` | Use object store with url scheme account.dfs.fabric.microsoft.com. |
| ``azure_msi_endpoint`` | Endpoint to request a imds managed identity token. |
| ``azure_object_id`` | Object id for use with managed identity authentication. |
| ``azure_msi_resource_id`` | Msi resource id for use with managed identity authentication. |
| ``azure_federated_token_file`` | File containing token for Azure AD workload identity federation. |
| ``azure_use_azure_cli`` | Use azure cli for acquiring access token. |
| ``azure_disable_tagging`` | Disables tagging objects. This can be desirable if not supported by the backing store. |
<!-- TODO: demonstrate how to configure networked file systems for optimal performance -->

View File

@@ -142,6 +142,7 @@ rules are as follows:
**`Example`**
```ts
import { fromTableToBuffer, makeArrowTable } from "../arrow";
import { Field, FixedSizeList, Float16, Float32, Int32, Schema } from "apache-arrow";

View File

@@ -24,7 +24,8 @@ data = [
table = db.create_table("pd_table", data=data)
```
To query the table, first call `to_lance` to convert the table to a "dataset", which is an object that can be queried by DuckDB. Then all you need to do is reference that dataset by the same name in your SQL query.
The `to_lance` method converts the LanceDB table to a `LanceDataset`, which is accessible to DuckDB through the Arrow compatibility layer.
To query the resulting Lance dataset in DuckDB, all you need to do is reference the dataset by the same name in your SQL query.
```python
import duckdb

View File

@@ -0,0 +1,75 @@
# Cohere Reranker
This re-ranker uses the [Cohere](https://cohere.ai/) API to rerank the search results. You can use this re-ranker by passing `CohereReranker()` to the `rerank()` method. Note that you'll either need to set the `COHERE_API_KEY` environment variable or pass the `api_key` argument to use this re-ranker.
!!! note
Supported Query Types: Hybrid, Vector, FTS
```python
import numpy
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from lancedb.rerankers import CohereReranker
embedder = get_registry().get("sentence-transformers").create()
db = lancedb.connect("~/.lancedb")
class Schema(LanceModel):
text: str = embedder.SourceField()
vector: Vector(embedder.ndims()) = embedder.VectorField()
data = [
{"text": "hello world"},
{"text": "goodbye world"}
]
tbl = db.create_table("test", schema=Schema, mode="overwrite")
tbl.add(data)
reranker = CohereReranker(api_key="key")
# Run vector search with a reranker
result = tbl.search("hello").rerank(reranker=reranker).to_list()
# Run FTS search with a reranker
result = tbl.search("hello", query_type="fts").rerank(reranker=reranker).to_list()
# Run hybrid search with a reranker
tbl.create_fts_index("text", replace=True)
result = tbl.search("hello", query_type="hybrid").rerank(reranker=reranker).to_list()
```
Accepted Arguments
----------------
| Argument | Type | Default | Description |
| --- | --- | --- | --- |
| `model_name` | `str` | `"rerank-english-v2.0"` | The name of the reranker model to use. Available cohere models are: rerank-english-v2.0, rerank-multilingual-v2.0 |
| `column` | `str` | `"text"` | The name of the column to use as input to the cross encoder model. |
| `top_n` | `str` | `None` | The number of results to return. If None, will return all results. |
| `api_key` | `str` | `None` | The API key for the Cohere API. If not provided, the `COHERE_API_KEY` environment variable is used. |
| `return_score` | str | `"relevance"` | Options are "relevance" or "all". The type of score to return. If "relevance", will return only the `_relevance_score. If "all" is supported, will return relevance score along with the vector and/or fts scores depending on query type |
## Supported Scores for each query type
You can specify the type of scores you want the reranker to return. The following are the supported scores for each query type:
### Hybrid Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ❌ Not Supported | Returns have vector(`_distance`) and FTS(`score`) along with Hybrid Search score(`_relevance_score`) |
### Vector Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have vector(`_distance`) along with Hybrid Search score(`_relevance_score`) |
### FTS Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have FTS(`score`) along with Hybrid Search score(`_relevance_score`) |

View File

@@ -0,0 +1,71 @@
# ColBERT Reranker
This re-ranker uses ColBERT model to rerank the search results. You can use this re-ranker by passing `ColbertReranker()` to the `rerank()` method.
!!! note
Supported Query Types: Hybrid, Vector, FTS
```python
import numpy
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from lancedb.rerankers import ColbertReranker
embedder = get_registry().get("sentence-transformers").create()
db = lancedb.connect("~/.lancedb")
class Schema(LanceModel):
text: str = embedder.SourceField()
vector: Vector(embedder.ndims()) = embedder.VectorField()
data = [
{"text": "hello world"},
{"text": "goodbye world"}
]
tbl = db.create_table("test", schema=Schema, mode="overwrite")
tbl.add(data)
reranker = ColbertReranker()
# Run vector search with a reranker
result = tbl.search("hello").rerank(reranker=reranker).to_list()
# Run FTS search with a reranker
result = tbl.search("hello", query_type="fts").rerank(reranker=reranker).to_list()
# Run hybrid search with a reranker
tbl.create_fts_index("text", replace=True)
result = tbl.search("hello", query_type="hybrid").rerank(reranker=reranker).to_list()
```
Accepted Arguments
----------------
| Argument | Type | Default | Description |
| --- | --- | --- | --- |
| `model_name` | `str` | `"colbert-ir/colbertv2.0"` | The name of the reranker model to use.|
| `column` | `str` | `"text"` | The name of the column to use as input to the cross encoder model. |
| `device` | `str` | `None` | The device to use for the cross encoder model. If None, will use "cuda" if available, otherwise "cpu". |
| `return_score` | str | `"relevance"` | Options are "relevance" or "all". The type of score to return. If "relevance", will return only the `_relevance_score. If "all" is supported, will return relevance score along with the vector and/or fts scores depending on query type |
## Supported Scores for each query type
You can specify the type of scores you want the reranker to return. The following are the supported scores for each query type:
### Hybrid Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ❌ Not Supported | Returns have vector(`_distance`) and FTS(`score`) along with Hybrid Search score(`_relevance_score`) |
### Vector Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have vector(`_distance`) along with Hybrid Search score(`_relevance_score`) |
### FTS Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have FTS(`score`) along with Hybrid Search score(`_relevance_score`) |

View File

@@ -0,0 +1,70 @@
# Cross Encoder Reranker
This re-ranker uses Cross Encoder models from sentence-transformers to rerank the search results. You can use this re-ranker by passing `CrossEncoderReranker()` to the `rerank()` method.
!!! note
Supported Query Types: Hybrid, Vector, FTS
```python
import numpy
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from lancedb.rerankers import CrossEncoderReranker
embedder = get_registry().get("sentence-transformers").create()
db = lancedb.connect("~/.lancedb")
class Schema(LanceModel):
text: str = embedder.SourceField()
vector: Vector(embedder.ndims()) = embedder.VectorField()
data = [
{"text": "hello world"},
{"text": "goodbye world"}
]
tbl = db.create_table("test", schema=Schema, mode="overwrite")
tbl.add(data)
reranker = CrossEncoderReranker()
# Run vector search with a reranker
result = tbl.search("hello").rerank(reranker=reranker).to_list()
# Run FTS search with a reranker
result = tbl.search("hello", query_type="fts").rerank(reranker=reranker).to_list()
# Run hybrid search with a reranker
tbl.create_fts_index("text", replace=True)
result = tbl.search("hello", query_type="hybrid").rerank(reranker=reranker).to_list()
```
Accepted Arguments
----------------
| Argument | Type | Default | Description |
| --- | --- | --- | --- |
| `model_name` | `str` | `""cross-encoder/ms-marco-TinyBERT-L-6"` | The name of the reranker model to use.|
| `column` | `str` | `"text"` | The name of the column to use as input to the cross encoder model. |
| `device` | `str` | `None` | The device to use for the cross encoder model. If None, will use "cuda" if available, otherwise "cpu". |
| `return_score` | str | `"relevance"` | Options are "relevance" or "all". The type of score to return. If "relevance", will return only the `_relevance_score. If "all" is supported, will return relevance score along with the vector and/or fts scores depending on query type |
## Supported Scores for each query type
You can specify the type of scores you want the reranker to return. The following are the supported scores for each query type:
### Hybrid Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ❌ Not Supported | Returns have vector(`_distance`) and FTS(`score`) along with Hybrid Search score(`_relevance_score`) |
### Vector Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have vector(`_distance`) along with Hybrid Search score(`_relevance_score`) |
### FTS Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have FTS(`score`) along with Hybrid Search score(`_relevance_score`) |

View File

@@ -0,0 +1,88 @@
## Building Custom Rerankers
You can build your own custom reranker by subclassing the `Reranker` class and implementing the `rerank_hybrid()` method. Optionally, you can also implement the `rerank_vector()` and `rerank_fts()` methods if you want to support reranking for vector and FTS search separately.
Here's an example of a custom reranker that combines the results of semantic and full-text search using a linear combination of the scores.
The `Reranker` base interface comes with a `merge_results()` method that can be used to combine the results of semantic and full-text search. This is a vanilla merging algorithm that simply concatenates the results and removes the duplicates without taking the scores into consideration. It only keeps the first copy of the row encountered. This works well in cases that don't require the scores of semantic and full-text search to combine the results. If you want to use the scores or want to support `return_score="all"`, you'll need to implement your own merging algorithm.
```python
from lancedb.rerankers import Reranker
import pyarrow as pa
class MyReranker(Reranker):
def __init__(self, param1, param2, ..., return_score="relevance"):
super().__init__(return_score)
self.param1 = param1
self.param2 = param2
def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table):
# Use the built-in merging function
combined_result = self.merge_results(vector_results, fts_results)
# Do something with the combined results
# ...
# Return the combined results
return combined_result
def rerank_vector(self, query: str, vector_results: pa.Table):
# Do something with the vector results
# ...
# Return the vector results
return vector_results
def rerank_fts(self, query: str, fts_results: pa.Table):
# Do something with the FTS results
# ...
# Return the FTS results
return fts_results
```
### Example of a Custom Reranker
For the sake of simplicity let's build custom reranker that just enchances the Cohere Reranker by accepting a filter query, and accept other CohereReranker params as kwags.
```python
from typing import List, Union
import pandas as pd
from lancedb.rerankers import CohereReranker
class ModifiedCohereReranker(CohereReranker):
def __init__(self, filters: Union[str, List[str]], **kwargs):
super().__init__(**kwargs)
filters = filters if isinstance(filters, list) else [filters]
self.filters = filters
def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table)-> pa.Table:
combined_result = super().rerank_hybrid(query, vector_results, fts_results)
df = combined_result.to_pandas()
for filter in self.filters:
df = df.query("not text.str.contains(@filter)")
return pa.Table.from_pandas(df)
def rerank_vector(self, query: str, vector_results: pa.Table)-> pa.Table:
vector_results = super().rerank_vector(query, vector_results)
df = vector_results.to_pandas()
for filter in self.filters:
df = df.query("not text.str.contains(@filter)")
return pa.Table.from_pandas(df)
def rerank_fts(self, query: str, fts_results: pa.Table)-> pa.Table:
fts_results = super().rerank_fts(query, fts_results)
df = fts_results.to_pandas()
for filter in self.filters:
df = df.query("not text.str.contains(@filter)")
return pa.Table.from_pandas(df)
```
!!! tip
The `vector_results` and `fts_results` are pyarrow tables. Lean more about pyarrow tables [here](https://arrow.apache.org/docs/python). It can be convered to other data types like pandas dataframe, pydict, pylist etc.
For example, You can convert them to pandas dataframes using `to_pandas()` method and perform any operations you want. After you are done, you can convert the dataframe back to pyarrow table using `pa.Table.from_pandas()` method and return it.

View File

@@ -0,0 +1,60 @@
Reranking is the process of reordering a list of items based on some criteria. In the context of search, reranking is used to reorder the search results returned by a search engine based on some criteria. This can be useful when the initial ranking of the search results is not satisfactory or when the user has provided additional information that can be used to improve the ranking of the search results.
LanceDB comes with some built-in rerankers. Some of the rerankers that are available in LanceDB are:
| Reranker | Description | Supported Query Types |
| --- | --- | --- |
| `LinearCombinationReranker` | Reranks search results based on a linear combination of FTS and vector search scores | Hybrid |
| `CohereReranker` | Uses cohere Reranker API to rerank results | Vector, FTS, Hybrid |
| `CrossEncoderReranker` | Uses a cross-encoder model to rerank search results | Vector, FTS, Hybrid |
| `ColbertReranker` | Uses a colbert model to rerank search results | Vector, FTS, Hybrid |
| `OpenaiReranker`(Experimental) | Uses OpenAI's chat model to rerank search results | Vector, FTS, Hybrid |
## Using a Reranker
Using rerankers is optional for vector and FTS. However, for hybrid search, rerankers are required. To use a reranker, you need to create an instance of the reranker and pass it to the `rerank` method of the query builder.
```python
import numpy
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from lancedb.rerankers import CohereReranker
embedder = get_registry().get("sentence-transformers").create()
db = lancedb.connect("~/.lancedb")
class Schema(LanceModel):
text: str = embedder.SourceField()
vector: Vector(embedder.ndims()) = embedder.VectorField()
data = [
{"text": "hello world"},
{"text": "goodbye world"}
]
tbl = db.create_table("test", data)
reranker = CohereReranker(api_key="your_api_key")
# Run vector search with a reranker
result = tbl.query("hello").rerank(reranker).to_list()
# Run FTS search with a reranker
result = tbl.query("hello", query_type="fts").rerank(reranker).to_list()
# Run hybrid search with a reranker
tbl.create_fts_index("text")
result = tbl.query("hello", query_type="hybrid").rerank(reranker).to_list()
```
## Available Rerankers
LanceDB comes with some built-in rerankers. Here are some of the rerankers that are available in LanceDB:
- [Cohere Reranker](./cohere.md)
- [Cross Encoder Reranker](./cross_encoder.md)
- [ColBERT Reranker](./colbert.md)
- [OpenAI Reranker](./openai.md)
- [Linear Combination Reranker](./linear_combination.md)
## Creating Custom Rerankers
LanceDB also you to create custom rerankers by extending the base `Reranker` class. The custom reranker should implement the `rerank` method that takes a list of search results and returns a reranked list of search results. This is covered in more detail in the [Creating Custom Rerankers](./custom_reranker.md) section.

View File

@@ -0,0 +1,52 @@
# Linear Combination Reranker
This is the default re-ranker used by LanceDB hybrid search. It combines the results of semantic and full-text search using a linear combination of the scores. The weights for the linear combination can be specified. It defaults to 0.7, i.e, 70% weight for semantic search and 30% weight for full-text search.
!!! note
Supported Query Types: Hybrid
```python
import numpy
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from lancedb.rerankers import LinearCombinationReranker
embedder = get_registry().get("sentence-transformers").create()
db = lancedb.connect("~/.lancedb")
class Schema(LanceModel):
text: str = embedder.SourceField()
vector: Vector(embedder.ndims()) = embedder.VectorField()
data = [
{"text": "hello world"},
{"text": "goodbye world"}
]
tbl = db.create_table("test", schema=Schema, mode="overwrite")
tbl.add(data)
reranker = LinearCombinationReranker()
# Run hybrid search with a reranker
tbl.create_fts_index("text", replace=True)
result = tbl.search("hello", query_type="hybrid").rerank(reranker=reranker).to_list()
```
Accepted Arguments
----------------
| Argument | Type | Default | Description |
| --- | --- | --- | --- |
| `weight` | `float` | `0.7` | The weight to use for the semantic search score. The weight for the full-text search score is `1 - weights`. |
| `return_score` | str | `"relevance"` | Options are "relevance" or "all". The type of score to return. If "relevance", will return only the `_relevance_score. If "all", will return all scores from the vector and FTS search along with the relevance score. |
## Supported Scores for each query type
You can specify the type of scores you want the reranker to return. The following are the supported scores for each query type:
### Hybrid Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have vector(`_distance`) and FTS(`score`) along with Hybrid Search score(`_distance`) |

View File

@@ -0,0 +1,73 @@
# OpenAI Reranker (Experimental)
This re-ranker uses OpenAI chat model to rerank the search results. You can use this re-ranker by passing `OpenAI()` to the `rerank()` method.
!!! note
Supported Query Types: Hybrid, Vector, FTS
!!! warning
This re-ranker is experimental. OpenAI doesn't have a dedicated reranking model, so we are using the chat model for reranking.
```python
import numpy
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from lancedb.rerankers import OpenaiReranker
embedder = get_registry().get("sentence-transformers").create()
db = lancedb.connect("~/.lancedb")
class Schema(LanceModel):
text: str = embedder.SourceField()
vector: Vector(embedder.ndims()) = embedder.VectorField()
data = [
{"text": "hello world"},
{"text": "goodbye world"}
]
tbl = db.create_table("test", schema=Schema, mode="overwrite")
tbl.add(data)
reranker = OpenaiReranker()
# Run vector search with a reranker
result = tbl.search("hello").rerank(reranker=reranker).to_list()
# Run FTS search with a reranker
result = tbl.search("hello", query_type="fts").rerank(reranker=reranker).to_list()
# Run hybrid search with a reranker
tbl.create_fts_index("text", replace=True)
result = tbl.search("hello", query_type="hybrid").rerank(reranker=reranker).to_list()
```
Accepted Arguments
----------------
| Argument | Type | Default | Description |
| --- | --- | --- | --- |
| `model_name` | `str` | `"gpt-4-turbo-preview"` | The name of the reranker model to use.|
| `column` | `str` | `"text"` | The name of the column to use as input to the cross encoder model. |
| `return_score` | str | `"relevance"` | Options are "relevance" or "all". The type of score to return. If "relevance", will return only the `_relevance_score. If "all" is supported, will return relevance score along with the vector and/or fts scores depending on query type |
| `api_key` | str | `None` | The API key to use. If None, will use the OPENAI_API_KEY environment variable.
## Supported Scores for each query type
You can specify the type of scores you want the reranker to return. The following are the supported scores for each query type:
### Hybrid Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ❌ Not Supported | Returns have vector(`_distance`) and FTS(`score`) along with Hybrid Search score(`_relevance_score`) |
### Vector Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have vector(`_distance`) along with Hybrid Search score(`_relevance_score`) |
### FTS Search
|`return_score`| Status | Description |
| --- | --- | --- |
| `relevance` | ✅ Supported | Returns only have the `_relevance_score` column |
| `all` | ✅ Supported | Returns have FTS(`score`) along with Hybrid Search score(`_relevance_score`) |

View File

@@ -1,5 +1,5 @@
import glob
from typing import Iterator
from typing import Iterator, List
from pathlib import Path
glob_string = "../src/**/*.md"
@@ -15,6 +15,7 @@ excluded_globs = [
"../src/ann_indexes.md",
"../src/basic.md",
"../src/hybrid_search/hybrid_search.md",
"../src/reranking/*.md",
]
python_prefix = "py"
@@ -50,11 +51,24 @@ def yield_lines(lines: Iterator[str], prefix: str, suffix: str):
yield line[strip_length:]
def wrap_async(lines: List[str]) -> List[str]:
# Indent all the lines
lines = [" " + line for line in lines]
# Put all lines in `async def main():`
lines = ["async def main():\n"] + lines
# Put `import asyncio\n asyncio.run(main())` at the end
lines = lines + ["\n", "import asyncio\n", "asyncio.run(main())\n"]
return lines
for file in filter(lambda file: file not in excluded_files, files):
with open(file, "r") as f:
lines = list(yield_lines(iter(f), "```", "```"))
if len(lines) > 0:
if any("await" in line for line in lines):
lines = wrap_async(lines)
print(lines)
out_path = (
Path(python_folder)

14
node/package-lock.json generated
View File

@@ -1,12 +1,12 @@
{
"name": "vectordb",
"version": "0.4.16",
"version": "0.4.17",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "vectordb",
"version": "0.4.16",
"version": "0.4.17",
"cpu": [
"x64",
"arm64"
@@ -52,11 +52,11 @@
"uuid": "^9.0.0"
},
"optionalDependencies": {
"@lancedb/vectordb-darwin-arm64": "0.4.16",
"@lancedb/vectordb-darwin-x64": "0.4.16",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.16",
"@lancedb/vectordb-linux-x64-gnu": "0.4.16",
"@lancedb/vectordb-win32-x64-msvc": "0.4.16"
"@lancedb/vectordb-darwin-arm64": "0.4.17",
"@lancedb/vectordb-darwin-x64": "0.4.17",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.17",
"@lancedb/vectordb-linux-x64-gnu": "0.4.17",
"@lancedb/vectordb-win32-x64-msvc": "0.4.17"
},
"peerDependencies": {
"@apache-arrow/ts": "^14.0.2",

View File

@@ -1,6 +1,6 @@
{
"name": "vectordb",
"version": "0.4.16",
"version": "0.4.17",
"description": " Serverless, low-latency vector database for AI applications",
"main": "dist/index.js",
"types": "dist/index.d.ts",
@@ -88,10 +88,10 @@
}
},
"optionalDependencies": {
"@lancedb/vectordb-darwin-arm64": "0.4.16",
"@lancedb/vectordb-darwin-x64": "0.4.16",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.16",
"@lancedb/vectordb-linux-x64-gnu": "0.4.16",
"@lancedb/vectordb-win32-x64-msvc": "0.4.16"
"@lancedb/vectordb-darwin-arm64": "0.4.17",
"@lancedb/vectordb-darwin-x64": "0.4.17",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.17",
"@lancedb/vectordb-linux-x64-gnu": "0.4.17",
"@lancedb/vectordb-win32-x64-msvc": "0.4.17"
}
}

View File

@@ -78,12 +78,25 @@ export interface ConnectionOptions {
/** User provided AWS crednetials.
*
* If not provided, LanceDB will use the default credentials provider chain.
*
* @deprecated Pass `aws_access_key_id`, `aws_secret_access_key`, and `aws_session_token`
* through `storageOptions` instead.
*/
awsCredentials?: AwsCredentials
/** AWS region to connect to. Default is {@link defaultAwsRegion}. */
/** AWS region to connect to. Default is {@link defaultAwsRegion}
*
* @deprecated Pass `region` through `storageOptions` instead.
*/
awsRegion?: string
/**
* User provided options for object storage. For example, S3 credentials or request timeouts.
*
* The various options are described at https://lancedb.github.io/lancedb/guides/storage/
*/
storageOptions?: Record<string, string>
/**
* API key for the remote connections
*
@@ -176,7 +189,6 @@ export async function connect (
if (typeof arg === 'string') {
opts = { uri: arg }
} else {
// opts = { uri: arg.uri, awsCredentials = arg.awsCredentials }
const keys = Object.keys(arg)
if (keys.length === 1 && keys[0] === 'uri' && typeof arg.uri === 'string') {
opts = { uri: arg.uri }
@@ -198,12 +210,26 @@ export async function connect (
// Remote connection
return new RemoteConnection(opts)
}
const storageOptions = opts.storageOptions ?? {};
if (opts.awsCredentials?.accessKeyId !== undefined) {
storageOptions.aws_access_key_id = opts.awsCredentials.accessKeyId
}
if (opts.awsCredentials?.secretKey !== undefined) {
storageOptions.aws_secret_access_key = opts.awsCredentials.secretKey
}
if (opts.awsCredentials?.sessionToken !== undefined) {
storageOptions.aws_session_token = opts.awsCredentials.sessionToken
}
if (opts.awsRegion !== undefined) {
storageOptions.region = opts.awsRegion
}
// It's a pain to pass a record to Rust, so we convert it to an array of key-value pairs
const storageOptionsArr = Object.entries(storageOptions);
const db = await databaseNew(
opts.uri,
opts.awsCredentials?.accessKeyId,
opts.awsCredentials?.secretKey,
opts.awsCredentials?.sessionToken,
opts.awsRegion,
storageOptionsArr,
opts.readConsistencyInterval
)
return new LocalConnection(db, opts)
@@ -720,7 +746,6 @@ export class LocalConnection implements Connection {
const tbl = await databaseOpenTable.call(
this._db,
name,
...getAwsArgs(this._options())
)
if (embeddings !== undefined) {
return new LocalTable(tbl, name, this._options(), embeddings)

View File

@@ -111,6 +111,10 @@ async function decodeErrorData(
if (responseType === 'arraybuffer') {
return new TextDecoder().decode(errorData)
} else {
if (typeof errorData === 'object') {
return JSON.stringify(errorData)
}
return errorData
}
}

View File

@@ -38,7 +38,7 @@ import {
fromRecordsToStreamBuffer,
fromTableToStreamBuffer
} from '../arrow'
import { toSQL } from '../util'
import { toSQL, TTLCache } from '../util'
import { type HttpMiddleware } from '../middleware'
/**
@@ -47,6 +47,7 @@ import { type HttpMiddleware } from '../middleware'
export class RemoteConnection implements Connection {
private _client: HttpLancedbClient
private readonly _dbName: string
private readonly _tableCache = new TTLCache(300_000)
constructor (opts: ConnectionOptions) {
if (!opts.uri.startsWith('db://')) {
@@ -89,6 +90,9 @@ export class RemoteConnection implements Connection {
page_token: pageToken
})
const body = await response.body()
for (const table of body.tables) {
this._tableCache.set(table, true)
}
return body.tables
}
@@ -101,6 +105,12 @@ export class RemoteConnection implements Connection {
name: string,
embeddings?: EmbeddingFunction<T>
): Promise<Table<T>> {
// check if the table exists
if (this._tableCache.get(name) === undefined) {
await this._client.post(`/v1/table/${encodeURIComponent(name)}/describe/`)
this._tableCache.set(name, true)
}
if (embeddings !== undefined) {
return new RemoteTable(this._client, name, embeddings)
} else {
@@ -169,6 +179,7 @@ export class RemoteConnection implements Connection {
)
}
this._tableCache.set(tableName, true)
if (embeddings === undefined) {
return new RemoteTable(this._client, tableName)
} else {
@@ -178,6 +189,7 @@ export class RemoteConnection implements Connection {
async dropTable (name: string): Promise<void> {
await this._client.post(`/v1/table/${encodeURIComponent(name)}/drop/`)
this._tableCache.delete(name)
}
withMiddleware (middleware: HttpMiddleware): Connection {

View File

@@ -42,6 +42,7 @@ import {
Float16,
Int64
} from 'apache-arrow'
import type { RemoteRequest, RemoteResponse } from '../middleware'
const expect = chai.expect
const assert = chai.assert
@@ -74,6 +75,19 @@ describe('LanceDB client', function () {
assert.equal(con.uri, uri)
})
it('should accept custom storage options', async function () {
const uri = await createTestDB()
const storageOptions = {
region: 'us-west-2',
timeout: '30s'
};
const con = await lancedb.connect({
uri,
storageOptions
})
assert.equal(con.uri, uri)
})
it('should return the existing table names', async function () {
const uri = await createTestDB()
const con = await lancedb.connect(uri)
@@ -913,7 +927,22 @@ describe('Remote LanceDB client', function () {
}
// Search
const table = await con.openTable('vectors')
const table = await con.withMiddleware(new (class {
async onRemoteRequest(req: RemoteRequest, next: (req: RemoteRequest) => Promise<RemoteResponse>) {
// intercept call to check if the table exists and make the call succeed
if (req.uri.endsWith('/describe/')) {
return {
status: 200,
statusText: 'OK',
headers: new Map(),
body: async () => ({})
}
}
return await next(req)
}
})()).openTable('vectors')
try {
await table.search([0.1, 0.3]).execute()
} catch (err) {

View File

@@ -42,3 +42,36 @@ export function toSQL (value: Literal): string {
// eslint-disable-next-line @typescript-eslint/restrict-template-expressions
throw new Error(`Unsupported value type: ${typeof value} value: (${value})`)
}
export class TTLCache {
private readonly cache: Map<string, { value: any, expires: number }>
/**
* @param ttl Time to live in milliseconds
*/
constructor (private readonly ttl: number) {
this.cache = new Map()
}
get (key: string): any | undefined {
const entry = this.cache.get(key)
if (entry === undefined) {
return undefined
}
if (entry.expires < Date.now()) {
this.cache.delete(key)
return undefined
}
return entry.value
}
set (key: string, value: any): void {
this.cache.set(key, { value, expires: Date.now() + this.ttl })
}
delete (key: string): void {
this.cache.delete(key)
}
}

View File

@@ -0,0 +1,219 @@
// Copyright 2024 Lance Developers.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
/* eslint-disable @typescript-eslint/naming-convention */
import { connect } from "../dist";
import {
CreateBucketCommand,
DeleteBucketCommand,
DeleteObjectCommand,
HeadObjectCommand,
ListObjectsV2Command,
S3Client,
} from "@aws-sdk/client-s3";
import {
CreateKeyCommand,
ScheduleKeyDeletionCommand,
KMSClient,
} from "@aws-sdk/client-kms";
// Skip these tests unless the S3_TEST environment variable is set
const maybeDescribe = process.env.S3_TEST ? describe : describe.skip;
// These are all keys that are accepted by storage_options
const CONFIG = {
allowHttp: "true",
awsAccessKeyId: "ACCESSKEY",
awsSecretAccessKey: "SECRETKEY",
awsEndpoint: "http://127.0.0.1:4566",
awsRegion: "us-east-1",
};
class S3Bucket {
name: string;
constructor(name: string) {
this.name = name;
}
static s3Client() {
return new S3Client({
region: CONFIG.awsRegion,
credentials: {
accessKeyId: CONFIG.awsAccessKeyId,
secretAccessKey: CONFIG.awsSecretAccessKey,
},
endpoint: CONFIG.awsEndpoint,
});
}
public static async create(name: string): Promise<S3Bucket> {
const client = this.s3Client();
// Delete the bucket if it already exists
try {
await this.deleteBucket(client, name);
} catch (e) {
// It's fine if the bucket doesn't exist
}
await client.send(new CreateBucketCommand({ Bucket: name }));
return new S3Bucket(name);
}
public async delete() {
const client = S3Bucket.s3Client();
await S3Bucket.deleteBucket(client, this.name);
}
static async deleteBucket(client: S3Client, name: string) {
// Must delete all objects before we can delete the bucket
const objects = await client.send(
new ListObjectsV2Command({ Bucket: name }),
);
if (objects.Contents) {
for (const object of objects.Contents) {
await client.send(
new DeleteObjectCommand({ Bucket: name, Key: object.Key }),
);
}
}
await client.send(new DeleteBucketCommand({ Bucket: name }));
}
public async assertAllEncrypted(path: string, keyId: string) {
const client = S3Bucket.s3Client();
const objects = await client.send(
new ListObjectsV2Command({ Bucket: this.name, Prefix: path }),
);
if (objects.Contents) {
for (const object of objects.Contents) {
const metadata = await client.send(
new HeadObjectCommand({ Bucket: this.name, Key: object.Key }),
);
expect(metadata.ServerSideEncryption).toBe("aws:kms");
expect(metadata.SSEKMSKeyId).toContain(keyId);
}
}
}
}
class KmsKey {
keyId: string;
constructor(keyId: string) {
this.keyId = keyId;
}
static kmsClient() {
return new KMSClient({
region: CONFIG.awsRegion,
credentials: {
accessKeyId: CONFIG.awsAccessKeyId,
secretAccessKey: CONFIG.awsSecretAccessKey,
},
endpoint: CONFIG.awsEndpoint,
});
}
public static async create(): Promise<KmsKey> {
const client = this.kmsClient();
const key = await client.send(new CreateKeyCommand({}));
const keyId = key?.KeyMetadata?.KeyId;
if (!keyId) {
throw new Error("Failed to create KMS key");
}
return new KmsKey(keyId);
}
public async delete() {
const client = KmsKey.kmsClient();
await client.send(new ScheduleKeyDeletionCommand({ KeyId: this.keyId }));
}
}
maybeDescribe("storage_options", () => {
let bucket: S3Bucket;
let kmsKey: KmsKey;
beforeAll(async () => {
bucket = await S3Bucket.create("lancedb");
kmsKey = await KmsKey.create();
});
afterAll(async () => {
await kmsKey.delete();
await bucket.delete();
});
it("can be used to configure auth and endpoints", async () => {
const uri = `s3://${bucket.name}/test`;
const db = await connect(uri, { storageOptions: CONFIG });
let table = await db.createTable("test", [{ a: 1, b: 2 }]);
let rowCount = await table.countRows();
expect(rowCount).toBe(1);
let tableNames = await db.tableNames();
expect(tableNames).toEqual(["test"]);
table = await db.openTable("test");
rowCount = await table.countRows();
expect(rowCount).toBe(1);
await table.add([
{ a: 2, b: 3 },
{ a: 3, b: 4 },
]);
rowCount = await table.countRows();
expect(rowCount).toBe(3);
await db.dropTable("test");
tableNames = await db.tableNames();
expect(tableNames).toEqual([]);
});
it("can configure encryption at connection and table level", async () => {
const uri = `s3://${bucket.name}/test`;
let db = await connect(uri, { storageOptions: CONFIG });
let table = await db.createTable("table1", [{ a: 1, b: 2 }], {
storageOptions: {
awsServerSideEncryption: "aws:kms",
awsSseKmsKeyId: kmsKey.keyId,
},
});
let rowCount = await table.countRows();
expect(rowCount).toBe(1);
await table.add([{ a: 2, b: 3 }]);
await bucket.assertAllEncrypted("test/table1.lance", kmsKey.keyId);
// Now with encryption settings at connection level
db = await connect(uri, {
storageOptions: {
...CONFIG,
awsServerSideEncryption: "aws:kms",
awsSseKmsKeyId: kmsKey.keyId,
},
});
table = await db.createTable("table2", [{ a: 1, b: 2 }]);
rowCount = await table.countRows();
expect(rowCount).toBe(1);
await table.add([{ a: 2, b: 3 }]);
await bucket.assertAllEncrypted("test/table2.lance", kmsKey.keyId);
});
});

View File

@@ -13,10 +13,32 @@
// limitations under the License.
import { fromTableToBuffer, makeArrowTable, makeEmptyTable } from "./arrow";
import { Connection as LanceDbConnection } from "./native";
import { ConnectionOptions, Connection as LanceDbConnection } from "./native";
import { Table } from "./table";
import { Table as ArrowTable, Schema } from "apache-arrow";
/**
* Connect to a LanceDB instance at the given URI.
*
* Accpeted formats:
*
* - `/path/to/database` - local database
* - `s3://bucket/path/to/database` or `gs://bucket/path/to/database` - database on cloud storage
* - `db://host:port` - remote database (LanceDB cloud)
* @param {string} uri - The uri of the database. If the database uri starts
* with `db://` then it connects to a remote database.
* @see {@link ConnectionOptions} for more details on the URI format.
*/
export async function connect(
uri: string,
opts?: Partial<ConnectionOptions>,
): Promise<Connection> {
opts = opts ?? {};
opts.storageOptions = cleanseStorageOptions(opts.storageOptions);
const nativeConn = await LanceDbConnection.new(uri, opts);
return new Connection(nativeConn);
}
export interface CreateTableOptions {
/**
* The mode to use when creating the table.
@@ -33,6 +55,28 @@ export interface CreateTableOptions {
* then no error will be raised.
*/
existOk: boolean;
/**
* Configuration for object storage.
*
* Options already set on the connection will be inherited by the table,
* but can be overridden here.
*
* The available options are described at https://lancedb.github.io/lancedb/guides/storage/
*/
storageOptions?: Record<string, string>;
}
export interface OpenTableOptions {
/**
* Configuration for object storage.
*
* Options already set on the connection will be inherited by the table,
* but can be overridden here.
*
* The available options are described at https://lancedb.github.io/lancedb/guides/storage/
*/
storageOptions?: Record<string, string>;
}
export interface TableNamesOptions {
@@ -109,8 +153,14 @@ export class Connection {
* Open a table in the database.
* @param {string} name - The name of the table
*/
async openTable(name: string): Promise<Table> {
const innerTable = await this.inner.openTable(name);
async openTable(
name: string,
options?: Partial<OpenTableOptions>,
): Promise<Table> {
const innerTable = await this.inner.openTable(
name,
cleanseStorageOptions(options?.storageOptions),
);
return new Table(innerTable);
}
@@ -139,7 +189,12 @@ export class Connection {
table = makeArrowTable(data);
}
const buf = await fromTableToBuffer(table);
const innerTable = await this.inner.createTable(name, buf, mode);
const innerTable = await this.inner.createTable(
name,
buf,
mode,
cleanseStorageOptions(options?.storageOptions),
);
return new Table(innerTable);
}
@@ -162,7 +217,12 @@ export class Connection {
const table = makeEmptyTable(schema);
const buf = await fromTableToBuffer(table);
const innerTable = await this.inner.createEmptyTable(name, buf, mode);
const innerTable = await this.inner.createEmptyTable(
name,
buf,
mode,
cleanseStorageOptions(options?.storageOptions),
);
return new Table(innerTable);
}
@@ -174,3 +234,43 @@ export class Connection {
return this.inner.dropTable(name);
}
}
/**
* Takes storage options and makes all the keys snake case.
*/
function cleanseStorageOptions(
options?: Record<string, string>,
): Record<string, string> | undefined {
if (options === undefined) {
return undefined;
}
const result: Record<string, string> = {};
for (const [key, value] of Object.entries(options)) {
if (value !== undefined) {
const newKey = camelToSnakeCase(key);
result[newKey] = value;
}
}
return result;
}
/**
* Convert a string to snake case. It might already be snake case, in which case it is
* returned unchanged.
*/
function camelToSnakeCase(camel: string): string {
if (camel.includes("_")) {
// Assume if there is at least one underscore, it is already snake case
return camel;
}
if (camel.toLocaleUpperCase() === camel) {
// Assume if the string is all uppercase, it is already snake case
return camel;
}
let result = camel.replace(/[A-Z]/g, (letter) => `_${letter.toLowerCase()}`);
if (result.startsWith("_")) {
result = result.slice(1);
}
return result;
}

View File

@@ -12,12 +12,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.
import { Connection } from "./connection";
import {
Connection as LanceDbConnection,
ConnectionOptions,
} from "./native.js";
export {
WriteOptions,
WriteMode,
@@ -32,6 +26,7 @@ export {
VectorColumnOptions,
} from "./arrow";
export {
connect,
Connection,
CreateTableOptions,
TableNamesOptions,
@@ -46,24 +41,3 @@ export {
export { Index, IndexOptions, IvfPqOptions } from "./indices";
export { Table, AddDataOptions, IndexConfig, UpdateOptions } from "./table";
export * as embedding from "./embedding";
/**
* Connect to a LanceDB instance at the given URI.
*
* Accpeted formats:
*
* - `/path/to/database` - local database
* - `s3://bucket/path/to/database` or `gs://bucket/path/to/database` - database on cloud storage
* - `db://host:port` - remote database (LanceDB cloud)
* @param {string} uri - The uri of the database. If the database uri starts
* with `db://` then it connects to a remote database.
* @see {@link ConnectionOptions} for more details on the URI format.
*/
export async function connect(
uri: string,
opts?: Partial<ConnectionOptions>,
): Promise<Connection> {
opts = opts ?? {};
const nativeConn = await LanceDbConnection.new(uri, opts);
return new Connection(nativeConn);
}

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.4.16",
"version": "0.4.17",
"os": [
"darwin"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-x64",
"version": "0.4.16",
"version": "0.4.17",
"os": [
"darwin"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.4.16",
"version": "0.4.17",
"os": [
"linux"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.4.16",
"version": "0.4.17",
"os": [
"linux"
],

1680
nodejs/package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb",
"version": "0.4.16",
"version": "0.4.17",
"main": "./dist/index.js",
"types": "./dist/index.d.ts",
"napi": {
@@ -18,6 +18,8 @@
},
"license": "Apache 2.0",
"devDependencies": {
"@aws-sdk/client-s3": "^3.33.0",
"@aws-sdk/client-kms": "^3.33.0",
"@napi-rs/cli": "^2.18.0",
"@types/jest": "^29.1.2",
"@types/tmp": "^0.2.6",
@@ -63,15 +65,16 @@
"lint": "eslint lancedb && eslint __test__",
"prepublishOnly": "napi prepublish -t npm",
"test": "npm run build && jest --verbose",
"integration": "S3_TEST=1 npm run test",
"universal": "napi universal",
"version": "napi version"
},
"optionalDependencies": {
"@lancedb/lancedb-darwin-arm64": "0.4.16",
"@lancedb/lancedb-darwin-x64": "0.4.16",
"@lancedb/lancedb-linux-arm64-gnu": "0.4.16",
"@lancedb/lancedb-linux-x64-gnu": "0.4.16",
"@lancedb/lancedb-win32-x64-msvc": "0.4.16"
"@lancedb/lancedb-darwin-arm64": "0.4.17",
"@lancedb/lancedb-darwin-x64": "0.4.17",
"@lancedb/lancedb-linux-arm64-gnu": "0.4.17",
"@lancedb/lancedb-linux-x64-gnu": "0.4.17",
"@lancedb/lancedb-win32-x64-msvc": "0.4.17"
},
"dependencies": {
"openai": "^4.29.2",

View File

@@ -12,6 +12,8 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashMap;
use napi::bindgen_prelude::*;
use napi_derive::*;
@@ -64,6 +66,11 @@ impl Connection {
builder =
builder.read_consistency_interval(std::time::Duration::from_secs_f64(interval));
}
if let Some(storage_options) = options.storage_options {
for (key, value) in storage_options {
builder = builder.storage_option(key, value);
}
}
Ok(Self::inner_new(
builder
.execute()
@@ -118,14 +125,18 @@ impl Connection {
name: String,
buf: Buffer,
mode: String,
storage_options: Option<HashMap<String, String>>,
) -> napi::Result<Table> {
let batches = ipc_file_to_batches(buf.to_vec())
.map_err(|e| napi::Error::from_reason(format!("Failed to read IPC file: {}", e)))?;
let mode = Self::parse_create_mode_str(&mode)?;
let tbl = self
.get_inner()?
.create_table(&name, batches)
.mode(mode)
let mut builder = self.get_inner()?.create_table(&name, batches).mode(mode);
if let Some(storage_options) = storage_options {
for (key, value) in storage_options {
builder = builder.storage_option(key, value);
}
}
let tbl = builder
.execute()
.await
.map_err(|e| napi::Error::from_reason(format!("{}", e)))?;
@@ -138,15 +149,22 @@ impl Connection {
name: String,
schema_buf: Buffer,
mode: String,
storage_options: Option<HashMap<String, String>>,
) -> napi::Result<Table> {
let schema = ipc_file_to_schema(schema_buf.to_vec()).map_err(|e| {
napi::Error::from_reason(format!("Failed to marshal schema from JS to Rust: {}", e))
})?;
let mode = Self::parse_create_mode_str(&mode)?;
let tbl = self
let mut builder = self
.get_inner()?
.create_empty_table(&name, schema)
.mode(mode)
.mode(mode);
if let Some(storage_options) = storage_options {
for (key, value) in storage_options {
builder = builder.storage_option(key, value);
}
}
let tbl = builder
.execute()
.await
.map_err(|e| napi::Error::from_reason(format!("{}", e)))?;
@@ -154,10 +172,18 @@ impl Connection {
}
#[napi]
pub async fn open_table(&self, name: String) -> napi::Result<Table> {
let tbl = self
.get_inner()?
.open_table(&name)
pub async fn open_table(
&self,
name: String,
storage_options: Option<HashMap<String, String>>,
) -> napi::Result<Table> {
let mut builder = self.get_inner()?.open_table(&name);
if let Some(storage_options) = storage_options {
for (key, value) in storage_options {
builder = builder.storage_option(key, value);
}
}
let tbl = builder
.execute()
.await
.map_err(|e| napi::Error::from_reason(format!("{}", e)))?;

View File

@@ -12,7 +12,8 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use connection::Connection;
use std::collections::HashMap;
use napi_derive::*;
mod connection;
@@ -38,6 +39,10 @@ pub struct ConnectionOptions {
/// Note: this consistency only applies to read operations. Write operations are
/// always consistent.
pub read_consistency_interval: Option<f64>,
/// (For LanceDB OSS only): configuration for object storage.
///
/// The available options are described at https://lancedb.github.io/lancedb/guides/storage/
pub storage_options: Option<HashMap<String, String>>,
}
/// Write mode for writing a table.
@@ -54,7 +59,7 @@ pub struct WriteOptions {
pub mode: Option<WriteMode>,
}
#[napi]
pub async fn connect(uri: String, options: ConnectionOptions) -> napi::Result<Connection> {
Connection::new(uri, options).await
#[napi(object)]
pub struct OpenTableOptions {
pub storage_options: Option<HashMap<String, String>>,
}

View File

@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.6.7
current_version = 0.6.9
commit = True
message = [python] Bump version: {current_version} → {new_version}
tag = True

View File

@@ -41,7 +41,7 @@ To build the python package you can use maturin:
```bash
# This will build the rust bindings and place them in the appropriate place
# in your venv or conda environment
matruin develop
maturin develop
```
To run the unit tests:

View File

@@ -1,19 +1,17 @@
[project]
name = "lancedb"
version = "0.6.7"
version = "0.6.9"
dependencies = [
"deprecation",
"pylance==0.10.9",
"pylance==0.10.12",
"ratelimiter~=1.0",
"requests>=2.31.0",
"retry>=0.9.2",
"tqdm>=4.27.0",
"pydantic>=1.10",
"attrs>=21.3.0",
"semver>=3.0",
"cachetools",
"pyyaml>=6.0",
"click>=8.1.7",
"requests>=2.31.0",
"overrides>=0.7",
]
description = "lancedb"
@@ -51,6 +49,7 @@ repository = "https://github.com/lancedb/lancedb"
[project.optional-dependencies]
tests = [
"aiohttp",
"boto3",
"pandas>=1.4",
"pytest",
"pytest-mock",
@@ -58,6 +57,7 @@ tests = [
"duckdb",
"pytz",
"polars>=0.19",
"tantivy"
]
dev = ["ruff", "pre-commit"]
docs = [
@@ -65,7 +65,6 @@ docs = [
"mkdocs-jupyter",
"mkdocs-material",
"mkdocstrings[python]",
"mkdocs-ultralytics-plugin==0.0.44",
]
clip = ["torch", "pillow", "open-clip"]
embeddings = [
@@ -88,19 +87,17 @@ azure = ["adlfs>=2024.2.0"]
python-source = "python"
module-name = "lancedb._lancedb"
[project.scripts]
lancedb = "lancedb.cli.cli:cli"
[build-system]
requires = ["maturin>=1.4"]
build-backend = "maturin"
[tool.ruff.lint]
select = ["F", "E", "W", "I", "G", "TCH", "PERF"]
select = ["F", "E", "W", "G", "TCH", "PERF"]
[tool.pytest.ini_options]
addopts = "--strict-markers --ignore-glob=lancedb/embeddings/*.py"
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"asyncio",
"s3_test"
]

View File

@@ -15,7 +15,7 @@ import importlib.metadata
import os
from concurrent.futures import ThreadPoolExecutor
from datetime import timedelta
from typing import Optional, Union
from typing import Dict, Optional, Union
__version__ = importlib.metadata.version("lancedb")
@@ -25,7 +25,6 @@ from .db import AsyncConnection, DBConnection, LanceDBConnection
from .remote.db import RemoteDBConnection
from .schema import vector
from .table import AsyncTable
from .utils import sentry_log
def connect(
@@ -84,7 +83,7 @@ def connect(
>>> db = lancedb.connect("s3://my-bucket/lancedb")
Connect to LancdDB cloud:
Connect to LanceDB cloud:
>>> db = lancedb.connect("db://my_database", api_key="ldb_...")
@@ -119,6 +118,7 @@ async def connect_async(
host_override: Optional[str] = None,
read_consistency_interval: Optional[timedelta] = None,
request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None,
storage_options: Optional[Dict[str, str]] = None,
) -> AsyncConnection:
"""Connect to a LanceDB database.
@@ -145,6 +145,9 @@ async def connect_async(
the last check, then the table will be checked for updates. Note: this
consistency only applies to read operations. Write operations are
always consistent.
storage_options: dict, optional
Additional options for the storage backend. See available options at
https://lancedb.github.io/lancedb/guides/storage/
Examples
--------
@@ -173,6 +176,7 @@ async def connect_async(
region,
host_override,
read_consistency_interval_secs,
storage_options,
)
)
@@ -184,7 +188,6 @@ __all__ = [
"AsyncTable",
"URI",
"sanitize_uri",
"sentry_log",
"vector",
"DBConnection",
"LanceDBConnection",

View File

@@ -19,10 +19,18 @@ class Connection(object):
self, start_after: Optional[str], limit: Optional[int]
) -> list[str]: ...
async def create_table(
self, name: str, mode: str, data: pa.RecordBatchReader
self,
name: str,
mode: str,
data: pa.RecordBatchReader,
storage_options: Optional[Dict[str, str]] = None,
) -> Table: ...
async def create_empty_table(
self, name: str, mode: str, schema: pa.Schema
self,
name: str,
mode: str,
schema: pa.Schema,
storage_options: Optional[Dict[str, str]] = None,
) -> Table: ...
class Table:

View File

@@ -1,12 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@@ -1,47 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import click
from lancedb.utils import CONFIG
@click.group()
@click.version_option(help="LanceDB command line interface entry point")
def cli():
"LanceDB command line interface"
diagnostics_help = """
Enable or disable LanceDB diagnostics. When enabled, LanceDB will send anonymous events
to help us improve LanceDB. These diagnostics are used only for error reporting and no
data is collected. You can find more about diagnosis on our docs:
https://lancedb.github.io/lancedb/cli_config/
"""
@cli.command(help=diagnostics_help)
@click.option("--enabled/--disabled", default=True)
def diagnostics(enabled):
CONFIG.update({"diagnostics": True if enabled else False})
click.echo("LanceDB diagnostics is %s" % ("enabled" if enabled else "disabled"))
@cli.command(help="Show current LanceDB configuration")
def config():
# TODO: pretty print as table with colors and formatting
click.echo("Current LanceDB configuration:")
cfg = CONFIG.copy()
cfg.pop("uuid") # Don't show uuid as it is not configurable
for item, amount in cfg.items():
click.echo("{} ({})".format(item, amount))

View File

@@ -18,14 +18,13 @@ import inspect
import os
from abc import abstractmethod
from pathlib import Path
from typing import TYPE_CHECKING, Iterable, List, Literal, Optional, Union
from typing import TYPE_CHECKING, Dict, Iterable, List, Literal, Optional, Union
import pyarrow as pa
from overrides import EnforceOverrides, override
from pyarrow import fs
from lancedb.common import data_to_reader, validate_schema
from lancedb.utils.events import register_event
from ._lancedb import connect as lancedb_connect
from .pydantic import LanceModel
@@ -534,6 +533,7 @@ class AsyncConnection(object):
exist_ok: Optional[bool] = None,
on_bad_vectors: Optional[str] = None,
fill_value: Optional[float] = None,
storage_options: Optional[Dict[str, str]] = None,
) -> AsyncTable:
"""Create an [AsyncTable][lancedb.table.AsyncTable] in the database.
@@ -571,6 +571,12 @@ class AsyncConnection(object):
One of "error", "drop", "fill".
fill_value: float
The value to use when filling vectors. Only used if on_bad_vectors="fill".
storage_options: dict, optional
Additional options for the storage backend. Options already set on the
connection will be inherited by the table, but can be overridden here.
See available options at
https://lancedb.github.io/lancedb/guides/storage/
Returns
-------
@@ -730,32 +736,40 @@ class AsyncConnection(object):
mode = "exist_ok"
if data is None:
new_table = await self._inner.create_empty_table(name, mode, schema)
new_table = await self._inner.create_empty_table(
name, mode, schema, storage_options=storage_options
)
else:
data = data_to_reader(data, schema)
new_table = await self._inner.create_table(
name,
mode,
data,
storage_options=storage_options,
)
register_event("create_table")
return AsyncTable(new_table)
async def open_table(self, name: str) -> Table:
async def open_table(
self, name: str, storage_options: Optional[Dict[str, str]] = None
) -> Table:
"""Open a Lance Table in the database.
Parameters
----------
name: str
The name of the table.
storage_options: dict, optional
Additional options for the storage backend. Options already set on the
connection will be inherited by the table, but can be overridden here.
See available options at
https://lancedb.github.io/lancedb/guides/storage/
Returns
-------
A LanceTable object representing the table.
"""
table = await self._inner.open_table(name)
register_event("open_table")
table = await self._inner.open_table(name, storage_options)
return AsyncTable(table)
async def drop_table(self, name: str):

View File

@@ -10,7 +10,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa: F401
from .base import EmbeddingFunction, EmbeddingFunctionConfig, TextEmbeddingFunction
from .bedrock import BedRockText
@@ -21,4 +20,7 @@ from .open_clip import OpenClipEmbeddings
from .openai import OpenAIEmbeddings
from .registry import EmbeddingFunctionRegistry, get_registry
from .sentence_transformers import SentenceTransformerEmbeddings
from .gte import GteEmbeddings
from .transformers import TransformersEmbeddingFunction, ColbertEmbeddings
from .imagebind import ImageBindEmbeddings
from .utils import with_embeddings

View File

@@ -78,6 +78,9 @@ class BedRockText(TextEmbeddingFunction):
class Config:
keep_untouched = (cached_property,)
else:
model_config = dict()
model_config["ignored_types"] = (cached_property,)
def ndims(self):
# return len(self._generate_embedding("test"))

View File

@@ -94,6 +94,9 @@ class GeminiText(TextEmbeddingFunction):
class Config:
keep_untouched = (cached_property,)
else:
model_config = dict()
model_config["ignored_types"] = (cached_property,)
def ndims(self):
# TODO: fix hardcoding

View File

@@ -22,6 +22,8 @@ from .base import EmbeddingFunction
from .registry import register
from .utils import AUDIO, IMAGES, TEXT
from lancedb.pydantic import PYDANTIC_VERSION
@register("imagebind")
class ImageBindEmbeddings(EmbeddingFunction):
@@ -38,6 +40,14 @@ class ImageBindEmbeddings(EmbeddingFunction):
device: str = "cpu"
normalize: bool = False
if PYDANTIC_VERSION < (2, 0): # Pydantic 1.x compat
class Config:
keep_untouched = (cached_property,)
else:
model_config = dict()
model_config["ignored_types"] = (cached_property,)
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._ndims = 1024

View File

@@ -0,0 +1,106 @@
# Copyright (c) 2023. LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from functools import cached_property
from typing import List, Any
import numpy as np
from pydantic import PrivateAttr
from lancedb.pydantic import PYDANTIC_VERSION
from ..util import attempt_import_or_raise
from .base import EmbeddingFunction
from .registry import register
from .utils import TEXT
@register("huggingface")
class TransformersEmbeddingFunction(EmbeddingFunction):
"""
An embedding function that can use any model from the transformers library.
Parameters:
----------
name : str
The name of the model to use. This should be a model name that can be loaded
by transformers.AutoModel.from_pretrained. For example, "bert-base-uncased".
default: "colbert-ir/colbertv2.0""
to download package, run :
`pip install transformers`
you may need to install pytorch as well - `https://pytorch.org/get-started/locally/`
"""
name: str = "colbert-ir/colbertv2.0"
_tokenizer: Any = PrivateAttr()
_model: Any = PrivateAttr()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._ndims = None
transformers = attempt_import_or_raise("transformers")
self._tokenizer = transformers.AutoTokenizer.from_pretrained(self.name)
self._model = transformers.AutoModel.from_pretrained(self.name)
if PYDANTIC_VERSION < (2, 0): # Pydantic 1.x compat
class Config:
keep_untouched = (cached_property,)
else:
model_config = dict()
model_config["ignored_types"] = (cached_property,)
def ndims(self):
self._ndims = self._model.config.hidden_size
return self._ndims
def compute_query_embeddings(self, query: str, *args, **kwargs) -> List[np.array]:
return self.compute_source_embeddings(query)
def compute_source_embeddings(self, texts: TEXT, *args, **kwargs) -> List[np.array]:
texts = self.sanitize_input(texts)
embedding = []
for text in texts:
encoding = self._tokenizer(
text, return_tensors="pt", padding=True, truncation=True
)
emb = self._model(**encoding).last_hidden_state.mean(dim=1).squeeze()
embedding.append(emb.detach().numpy())
return embedding
@register("colbert")
class ColbertEmbeddings(TransformersEmbeddingFunction):
"""
An embedding function that uses the colbert model from the huggingface library.
Parameters:
----------
name : str
The name of the model to use. This should be a model name that can be loaded
by transformers.AutoModel.from_pretrained. For example, "bert-base-uncased".
default: "colbert-ir/colbertv2.0""
to download package, run :
`pip install transformers`
you may need to install pytorch as well - `https://pytorch.org/get-started/locally/`
"""
name: str = "colbert-ir/colbertv2.0"
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

View File

@@ -19,15 +19,14 @@ import sys
import time
import urllib.error
import weakref
import logging
from typing import Callable, List, Union
import numpy as np
import pyarrow as pa
from lance.vector import vec_to_table
from retry import retry
from ..util import deprecated, safe_import_pandas
from ..utils.general import LOGGER
pd = safe_import_pandas()
@@ -256,7 +255,7 @@ def retry_with_exponential_backoff(
)
delay *= exponential_base * (1 + jitter * random.random())
LOGGER.info(f"Retrying in {delay:.2f} seconds due to {e}")
logging.info("Retrying in %s seconds...", delay)
time.sleep(delay)
return wrapper
@@ -277,5 +276,5 @@ def url_retrieve(url: str):
def api_key_not_found_help(provider):
LOGGER.error(f"Could not find API key for {provider}.")
logging.error("Could not find API key for %s", provider)
raise ValueError(f"Please set the {provider.upper()}_API_KEY environment variable.")

View File

@@ -18,6 +18,7 @@ from concurrent.futures import ThreadPoolExecutor
from typing import Iterable, List, Optional, Union
from urllib.parse import urlparse
from cachetools import TTLCache
import pyarrow as pa
from overrides import override
@@ -29,7 +30,6 @@ from ..table import Table, _sanitize_data
from ..util import validate_table_name
from .arrow import to_ipc_binary
from .client import ARROW_STREAM_CONTENT_TYPE, RestfulLanceDBClient
from .errors import LanceDBClientError
class RemoteDBConnection(DBConnection):
@@ -60,6 +60,7 @@ class RemoteDBConnection(DBConnection):
read_timeout=read_timeout,
)
self._request_thread_pool = request_thread_pool
self._table_cache = TTLCache(maxsize=10000, ttl=300)
def __repr__(self) -> str:
return f"RemoteConnect(name={self.db_name})"
@@ -89,6 +90,7 @@ class RemoteDBConnection(DBConnection):
else:
break
for item in result:
self._table_cache[item] = True
yield item
@override
@@ -109,16 +111,10 @@ class RemoteDBConnection(DBConnection):
self._client.mount_retry_adapter_for_table(name)
# check if table exists
try:
if self._table_cache.get(name) is None:
self._client.post(f"/v1/table/{name}/describe/")
except LanceDBClientError as err:
if str(err).startswith("Not found"):
logging.error(
"Table %s does not exist. Please first call "
"db.create_table(%s, data).",
name,
name,
)
self._table_cache[name] = True
return RemoteTable(self, name)
@override
@@ -267,6 +263,7 @@ class RemoteDBConnection(DBConnection):
content_type=ARROW_STREAM_CONTENT_TYPE,
)
self._table_cache[name] = True
return RemoteTable(self, name)
@override
@@ -282,6 +279,7 @@ class RemoteDBConnection(DBConnection):
self._client.post(
f"/v1/table/{name}/drop/",
)
self._table_cache.pop(name)
async def close(self):
"""Close the connection to the database."""

View File

@@ -1,4 +1,5 @@
import os
import semver
from functools import cached_property
from typing import Union
@@ -42,6 +43,14 @@ class CohereReranker(Reranker):
@cached_property
def _client(self):
cohere = attempt_import_or_raise("cohere")
# ensure version is at least 0.5.0
if (
hasattr(cohere, "__version__")
and semver.compare(cohere.__version__, "5.0.0") < 0
):
raise ValueError(
f"cohere version must be at least 0.5.0, found {cohere.__version__}"
)
if os.environ.get("COHERE_API_KEY") is None and self.api_key is None:
raise ValueError(
"COHERE_API_KEY not set. Either set it in your environment or \
@@ -51,11 +60,14 @@ class CohereReranker(Reranker):
def _rerank(self, result_set: pa.Table, query: str):
docs = result_set[self.column].to_pylist()
results = self._client.rerank(
response = self._client.rerank(
query=query,
documents=docs,
top_n=self.top_n,
model=self.model_name,
)
results = (
response.results
) # returns list (text, idx, relevance) attributes sorted descending by score
indices, scores = list(
zip(*[(result.index, result.relevance_score) for result in results])

View File

@@ -53,7 +53,6 @@ from .util import (
safe_import_polars,
value_to_sql,
)
from .utils.events import register_event
if TYPE_CHECKING:
import PIL
@@ -96,6 +95,9 @@ def _sanitize_data(
data.data.to_batches(), schema, metadata, on_bad_vectors, fill_value
)
if isinstance(data, LanceModel):
raise ValueError("Cannot add a single LanceModel to a table. Use a list.")
if isinstance(data, list):
# convert to list of dict if data is a bunch of LanceModels
if isinstance(data[0], LanceModel):
@@ -907,7 +909,6 @@ class LanceTable(Table):
f"Table {name} does not exist."
f"Please first call db.create_table({name}, data)"
)
register_event("open_table")
return tbl
@@ -1151,7 +1152,6 @@ class LanceTable(Table):
accelerator=accelerator,
index_cache_size=index_cache_size,
)
register_event("create_index")
def create_scalar_index(self, column: str, *, replace: bool = True):
self._dataset_mut.create_scalar_index(
@@ -1211,7 +1211,6 @@ class LanceTable(Table):
ordering_fields=ordering_field_names,
writer_heap_size=writer_heap_size,
)
register_event("create_fts_index")
def _get_fts_index_path(self):
return join_uri(self._dataset_uri, "_indices", "tantivy")
@@ -1259,7 +1258,6 @@ class LanceTable(Table):
self._ref.dataset = lance.write_dataset(
data, self._dataset_uri, schema=self.schema, mode=mode
)
register_event("add")
def merge(
self,
@@ -1322,7 +1320,6 @@ class LanceTable(Table):
self._ref.dataset = self._dataset_mut.merge(
other_table, left_on=left_on, right_on=right_on, schema=schema
)
register_event("merge")
@cached_property
def embedding_functions(self) -> dict:
@@ -1409,8 +1406,14 @@ class LanceTable(Table):
vector and the returned vector.
"""
if vector_column_name is None and query is not None:
try:
vector_column_name = inf_vector_column_query(self.schema)
register_event("search_table")
except Exception as e:
if query_type == "fts":
vector_column_name = ""
else:
raise e
return LanceQueryBuilder.create(
self,
query,
@@ -1537,7 +1540,6 @@ class LanceTable(Table):
if data is not None:
new_table.add(data)
register_event("create_table")
return new_table
def delete(self, where: str):
@@ -1596,7 +1598,6 @@ class LanceTable(Table):
values_sql = {k: value_to_sql(v) for k, v in values.items()}
self._dataset_mut.update(values_sql, where)
register_event("update")
def _execute_query(
self, query: Query, batch_size: Optional[int] = None
@@ -2113,7 +2114,6 @@ class AsyncTable:
if isinstance(data, pa.Table):
data = pa.RecordBatchReader.from_batches(data.schema, data.to_batches())
await self._inner.add(data, mode)
register_event("add")
def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
"""

View File

@@ -1,15 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .config import Config
CONFIG = Config()

View File

@@ -1,118 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import copy
import hashlib
import os
import platform
import uuid
from pathlib import Path
from .general import LOGGER, is_dir_writeable, yaml_load, yaml_save
def get_user_config_dir(sub_dir="lancedb"):
"""
Get the user config directory.
Args:
sub_dir (str): The name of the subdirectory to create.
Returns:
(Path): The path to the user config directory.
"""
# Return the appropriate config directory for each operating system
if platform.system() == "Windows":
path = Path.home() / "AppData" / "Roaming" / sub_dir
elif platform.system() == "Darwin":
path = Path.home() / "Library" / "Application Support" / sub_dir
elif platform.system() == "Linux":
path = Path.home() / ".config" / sub_dir
else:
raise ValueError(f"Unsupported operating system: {platform.system()}")
# GCP and AWS lambda fix, only /tmp is writeable
if not is_dir_writeable(path.parent):
LOGGER.warning(
f"WARNING ⚠️ user config directory '{path}' is not writeable, defaulting "
"to '/tmp' or CWD. Alternatively you can define a LANCEDB_CONFIG_DIR "
"environment variable for this path."
)
path = (
Path("/tmp") / sub_dir
if is_dir_writeable("/tmp")
else Path().cwd() / sub_dir
)
# Create the subdirectory if it does not exist
path.mkdir(parents=True, exist_ok=True)
return path
USER_CONFIG_DIR = Path(os.getenv("LANCEDB_CONFIG_DIR") or get_user_config_dir())
CONFIG_FILE = USER_CONFIG_DIR / "config.yaml"
class Config(dict):
"""
Manages lancedb config stored in a YAML file.
Args:
file (str | Path): Path to the lancedb config YAML file. Default is
USER_CONFIG_DIR / 'config.yaml'.
"""
def __init__(self, file=CONFIG_FILE):
self.file = Path(file)
self.defaults = { # Default global config values
"diagnostics": True,
"uuid": hashlib.sha256(str(uuid.getnode()).encode()).hexdigest(),
}
super().__init__(copy.deepcopy(self.defaults))
if not self.file.exists():
self.save()
self.load()
correct_keys = self.keys() == self.defaults.keys()
correct_types = all(
type(a) is type(b) for a, b in zip(self.values(), self.defaults.values())
)
if not (correct_keys and correct_types):
LOGGER.warning(
"WARNING ⚠️ LanceDB settings reset to default values. This may be due "
"to a possible problem with your settings or a recent package update. "
f"\nView settings & usage with 'lancedb settings' or at '{self.file}'"
)
self.reset()
def load(self):
"""Loads settings from the YAML file."""
super().update(yaml_load(self.file))
def save(self):
"""Saves the current settings to the YAML file."""
yaml_save(self.file, dict(self))
def update(self, *args, **kwargs):
"""Updates a setting value in the current settings."""
super().update(*args, **kwargs)
self.save()
def reset(self):
"""Resets the settings to default and saves them."""
self.clear()
self.update(self.defaults)
self.save()

View File

@@ -1,171 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import importlib.metadata
import platform
import random
import sys
import time
from lancedb.utils import CONFIG
from lancedb.utils.general import TryExcept
from .general import (
PLATFORMS,
get_git_origin_url,
is_git_dir,
is_github_actions_ci,
is_online,
is_pip_package,
is_pytest_running,
threaded_request,
)
class _Events:
"""
A class for collecting anonymous event analytics. Event analytics are enabled when
``diagnostics=True`` in config and disabled when ``diagnostics=False``.
You can enable or disable diagnostics by running ``lancedb diagnostics --enabled``
or ``lancedb diagnostics --disabled``.
Attributes
----------
url : str
The URL to send anonymous events.
rate_limit : float
The rate limit in seconds for sending events.
metadata : dict
A dictionary containing metadata about the environment.
enabled : bool
A flag to enable or disable Events based on certain conditions.
"""
_instance = None
url = "https://app.posthog.com/capture/"
headers = {"Content-Type": "application/json"}
api_key = "phc_oENDjGgHtmIDrV6puUiFem2RB4JA8gGWulfdulmMdZP"
# This api-key is write only and is safe to expose in the codebase.
def __init__(self):
"""
Initializes the Events object with default values for events, rate_limit,
and metadata.
"""
self.events = [] # events list
self.throttled_event_names = ["search_table"]
self.throttled_events = set()
self.max_events = 5 # max events to store in memory
self.rate_limit = 60.0 * 60.0 # rate limit (seconds)
self.time = 0.0
if is_git_dir():
install = "git"
elif is_pip_package():
install = "pip"
else:
install = "other"
self.metadata = {
"cli": sys.argv[0],
"install": install,
"python": ".".join(platform.python_version_tuple()[:2]),
"version": importlib.metadata.version("lancedb"),
"platforms": PLATFORMS,
"session_id": round(random.random() * 1e15),
# TODO: In future we might be interested in this metric
# 'engagement_time_msec': 1000
}
TESTS_RUNNING = is_pytest_running() or is_github_actions_ci()
ONLINE = is_online()
self.enabled = (
CONFIG["diagnostics"]
and not TESTS_RUNNING
and ONLINE
and (
is_pip_package()
or get_git_origin_url() == "https://github.com/lancedb/lancedb.git"
)
)
def __call__(self, event_name, params={}):
"""
Attempts to add a new event to the events list and send events if the rate
limit is reached.
Args
----
event_name : str
The name of the event to be logged.
params : dict, optional
A dictionary of additional parameters to be logged with the event.
"""
### NOTE: We might need a way to tag a session with a label to check usage
### from a source. Setting label should be exposed to the user.
if not self.enabled:
return
if (
len(self.events) < self.max_events
): # Events list limited to self.max_events (drop any events past this)
params.update(self.metadata)
event = {
"event": event_name,
"properties": params,
"timestamp": datetime.datetime.now(
tz=datetime.timezone.utc
).isoformat(),
"distinct_id": CONFIG["uuid"],
}
if event_name not in self.throttled_event_names:
self.events.append(event)
elif event_name not in self.throttled_events:
self.throttled_events.add(event_name)
self.events.append(event)
# Check rate limit
t = time.time()
if (t - self.time) < self.rate_limit:
return
# Time is over rate limiter, send now
data = {
"api_key": self.api_key,
"distinct_id": CONFIG["uuid"], # posthog needs this to accepts the event
"batch": self.events,
}
# POST equivalent to requests.post(self.url, json=data).
# threaded request is used to avoid blocking, retries are disabled, and
# verbose is disabled to avoid any possible disruption in the console.
threaded_request(
method="post",
url=self.url,
headers=self.headers,
json=data,
retry=0,
verbose=False,
)
# Flush & Reset
self.events = []
self.throttled_events = set()
self.time = t
@TryExcept(verbose=False)
def register_event(name: str, **kwargs):
if _Events._instance is None:
_Events._instance = _Events()
_Events._instance(name, **kwargs)

View File

@@ -1,454 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import importlib
import logging.config
import os
import platform
import subprocess
import sys
import threading
import time
from pathlib import Path
from typing import Union
import requests
import yaml
LOGGING_NAME = "lancedb"
VERBOSE = (
str(os.getenv("LANCEDB_VERBOSE", True)).lower() == "true"
) # global verbose mode
def set_logging(name=LOGGING_NAME, verbose=True):
"""Sets up logging for the given name.
Parameters
----------
name : str, optional
The name of the logger. Default is 'lancedb'.
verbose : bool, optional
Whether to enable verbose logging. Default is True.
"""
rank = int(os.getenv("RANK", -1)) # rank in world for Multi-GPU trainings
level = logging.INFO if verbose and rank in {-1, 0} else logging.ERROR
logging.config.dictConfig(
{
"version": 1,
"disable_existing_loggers": False,
"formatters": {name: {"format": "%(message)s"}},
"handlers": {
name: {
"class": "logging.StreamHandler",
"formatter": name,
"level": level,
}
},
"loggers": {name: {"level": level, "handlers": [name], "propagate": False}},
}
)
set_logging(LOGGING_NAME, verbose=VERBOSE)
LOGGER = logging.getLogger(LOGGING_NAME)
def is_pip_package(filepath: str = __name__) -> bool:
"""Determines if the file at the given filepath is part of a pip package.
Parameters
----------
filepath : str, optional
The filepath to check. Default is the current file.
Returns
-------
bool
True if the file is part of a pip package, False otherwise.
"""
# Get the spec for the module
spec = importlib.util.find_spec(filepath)
# Return whether the spec is not None and the origin is not None (indicating
# it is a package)
return spec is not None and spec.origin is not None
def is_pytest_running():
"""Determines whether pytest is currently running or not.
Returns
-------
bool
True if pytest is running, False otherwise.
"""
return (
("PYTEST_CURRENT_TEST" in os.environ)
or ("pytest" in sys.modules)
or ("pytest" in Path(sys.argv[0]).stem)
)
def is_github_actions_ci() -> bool:
"""
Determine if the current environment is a GitHub Actions CI Python runner.
Returns
-------
bool
True if the current environment is a GitHub Actions CI Python runner,
False otherwise.
"""
return (
"GITHUB_ACTIONS" in os.environ
and "RUNNER_OS" in os.environ
and "RUNNER_TOOL_CACHE" in os.environ
)
def is_git_dir():
"""
Determines whether the current file is part of a git repository.
If the current file is not part of a git repository, returns None.
Returns
-------
bool
True if current file is part of a git repository.
"""
return get_git_dir() is not None
def is_online() -> bool:
"""
Check internet connectivity by attempting to connect to a known online host.
Returns
-------
bool
True if connection is successful, False otherwise.
"""
import socket
for host in "1.1.1.1", "8.8.8.8", "223.5.5.5": # Cloudflare, Google, AliDNS:
try:
test_connection = socket.create_connection(address=(host, 53), timeout=2)
except (socket.timeout, socket.gaierror, OSError): # noqa: PERF203
continue
else:
# If the connection was successful, close it to avoid a ResourceWarning
test_connection.close()
return True
return False
def is_dir_writeable(dir_path: Union[str, Path]) -> bool:
"""Check if a directory is writeable.
Parameters
----------
dir_path : Union[str, Path]
The path to the directory.
Returns
-------
bool
True if the directory is writeable, False otherwise.
"""
return os.access(str(dir_path), os.W_OK)
def is_colab():
"""Check if the current script is running inside a Google Colab notebook.
Returns
-------
bool
True if running inside a Colab notebook, False otherwise.
"""
return "COLAB_RELEASE_TAG" in os.environ or "COLAB_BACKEND_VERSION" in os.environ
def is_kaggle():
"""Check if the current script is running inside a Kaggle kernel.
Returns
-------
bool
True if running inside a Kaggle kernel, False otherwise.
"""
return (
os.environ.get("PWD") == "/kaggle/working"
and os.environ.get("KAGGLE_URL_BASE") == "https://www.kaggle.com"
)
def is_jupyter():
"""Check if the current script is running inside a Jupyter Notebook.
Returns
-------
bool
True if running inside a Jupyter Notebook, False otherwise.
"""
with contextlib.suppress(Exception):
from IPython import get_ipython
return get_ipython() is not None
return False
def is_docker() -> bool:
"""Determine if the script is running inside a Docker container.
Returns
-------
bool
True if the script is running inside a Docker container, False otherwise.
"""
file = Path("/proc/self/cgroup")
if file.exists():
with open(file) as f:
return "docker" in f.read()
else:
return False
def get_git_dir():
"""Determine whether the current file is part of a git repository and if so,
returns the repository root directory.
If the current file is not part of a git repository, returns None.
Returns
-------
Path | None
Git root directory if found or None if not found.
"""
for d in Path(__file__).parents:
if (d / ".git").is_dir():
return d
def get_git_origin_url():
"""Retrieve the origin URL of a git repository.
Returns
-------
str | None
The origin URL of the git repository or None if not git directory.
"""
if is_git_dir():
with contextlib.suppress(subprocess.CalledProcessError):
origin = subprocess.check_output(
["git", "config", "--get", "remote.origin.url"]
)
return origin.decode().strip()
def yaml_save(file="data.yaml", data=None, header=""):
"""Save YAML data to a file.
Parameters
----------
file : str, optional
File name, by default 'data.yaml'.
data : dict, optional
Data to save in YAML format, by default None.
header : str, optional
YAML header to add, by default "".
"""
if data is None:
data = {}
file = Path(file)
if not file.parent.exists():
# Create parent directories if they don't exist
file.parent.mkdir(parents=True, exist_ok=True)
# Convert Path objects to strings
for k, v in data.items():
if isinstance(v, Path):
data[k] = str(v)
# Dump data to file in YAML format
with open(file, "w", errors="ignore", encoding="utf-8") as f:
if header:
f.write(header)
yaml.safe_dump(data, f, sort_keys=False, allow_unicode=True)
def yaml_load(file="data.yaml", append_filename=False):
"""
Load YAML data from a file.
Parameters
----------
file : str, optional
File name. Default is 'data.yaml'.
append_filename : bool, optional
Add the YAML filename to the YAML dictionary. Default is False.
Returns
-------
dict
YAML data and file name.
"""
assert Path(file).suffix in (
".yaml",
".yml",
), f"Attempting to load non-YAML file {file} with yaml_load()"
with open(file, errors="ignore", encoding="utf-8") as f:
s = f.read() # string
# Add YAML filename to dict and return
data = (
yaml.safe_load(s) or {}
) # always return a dict (yaml.safe_load() may return None for empty files)
if append_filename:
data["yaml_file"] = str(file)
return data
def yaml_print(yaml_file: Union[str, Path, dict]) -> None:
"""
Pretty prints a YAML file or a YAML-formatted dictionary.
Parameters
----------
yaml_file : Union[str, Path, dict]
The file path of the YAML file or a YAML-formatted dictionary.
Returns
-------
None
"""
yaml_dict = (
yaml_load(yaml_file) if isinstance(yaml_file, (str, Path)) else yaml_file
)
dump = yaml.dump(yaml_dict, sort_keys=False, allow_unicode=True)
LOGGER.info("Printing '%s'\n\n%s", yaml_file, dump)
PLATFORMS = [platform.system()]
if is_colab():
PLATFORMS.append("Colab")
if is_kaggle():
PLATFORMS.append("Kaggle")
if is_jupyter():
PLATFORMS.append("Jupyter")
if is_docker():
PLATFORMS.append("Docker")
PLATFORMS = "|".join(PLATFORMS)
class TryExcept(contextlib.ContextDecorator):
"""
TryExcept context manager.
Usage: @TryExcept() decorator or 'with TryExcept():' context manager.
"""
def __init__(self, msg="", verbose=True):
"""
Parameters
----------
msg : str, optional
Custom message to display in case of exception, by default "".
verbose : bool, optional
Whether to display the message, by default True.
"""
self.msg = msg
self.verbose = verbose
def __enter__(self):
pass
def __exit__(self, exc_type, value, traceback):
if self.verbose and value:
LOGGER.info("%s%s%s", self.msg, ": " if self.msg else "", value)
return True
def threaded_request(
method, url, retry=3, timeout=30, thread=True, code=-1, verbose=True, **kwargs
):
"""
Makes an HTTP request using the 'requests' library, with exponential backoff
retries up to a specified timeout.
Parameters
----------
method : str
The HTTP method to use for the request. Choices are 'post' and 'get'.
url : str
The URL to make the request to.
retry : int, optional
Number of retries to attempt before giving up, by default 3.
timeout : int, optional
Timeout in seconds after which the function will give up retrying,
by default 30.
thread : bool, optional
Whether to execute the request in a separate daemon thread, by default True.
code : int, optional
An identifier for the request, used for logging purposes, by default -1.
verbose : bool, optional
A flag to determine whether to print out to console or not, by default True.
Returns
-------
requests.Response
The HTTP response object. If the request is executed in a separate thread,
returns the thread itself.
"""
# retry only these codes TODO: add codes if needed in future (500, 408)
retry_codes = ()
@TryExcept(verbose=verbose)
def func(method, url, **kwargs):
"""Make HTTP requests with retries and timeouts, with optional progress
tracking.
"""
response = None
t0 = time.time()
for i in range(retry + 1):
if (time.time() - t0) > timeout:
break
response = requests.request(method, url, **kwargs)
if response.status_code < 300: # good return codes in the 2xx range
break
try:
m = response.json().get("message", "No JSON message.")
except AttributeError:
m = "Unable to read JSON."
if i == 0:
if response.status_code in retry_codes:
m += f" Retrying {retry}x for {timeout}s." if retry else ""
elif response.status_code == 429: # rate limit
m = "Rate limit reached"
if verbose:
LOGGER.warning("%s #%s", response.status_code, m)
if response.status_code not in retry_codes:
return response
time.sleep(2**i) # exponential standoff
return response
args = method, url
if thread:
return threading.Thread(
target=func, args=args, kwargs=kwargs, daemon=True
).start()
else:
return func(*args, **kwargs)

View File

@@ -1,119 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import bdb
import importlib.metadata
import logging
import sys
from pathlib import Path
from lancedb.utils import CONFIG
from .general import (
PLATFORMS,
TryExcept,
is_git_dir,
is_github_actions_ci,
is_online,
is_pip_package,
is_pytest_running,
)
@TryExcept(verbose=False)
def set_sentry():
"""
Initialize the Sentry SDK for error tracking and reporting. Only used if
sentry_sdk package is installed and sync=True in settings. Run 'lancedb settings'
to see and update settings YAML file.
Conditions required to send errors (ALL conditions must be met or no errors will
be reported):
- sentry_sdk package is installed
- sync=True in settings
- pytest is not running
- running in a pip package installation
- running in a non-git directory
- online environment
The function also configures Sentry SDK to ignore KeyboardInterrupt and
FileNotFoundError exceptions for now.
Additionally, the function sets custom tags and user information for Sentry
events.
"""
def before_send(event, hint):
"""
Modify the event before sending it to Sentry based on specific exception
types and messages.
Args:
event (dict): The event dictionary containing information about the error.
hint (dict): A dictionary containing additional information about
the error.
Returns:
dict: The modified event or None if the event should not be sent
to Sentry.
"""
if "exc_info" in hint:
exc_type, exc_value, tb = hint["exc_info"]
ignored_errors = ["out of memory", "no space left on device", "testing"]
if any(error in str(exc_value).lower() for error in ignored_errors):
return None
if is_git_dir():
install = "git"
elif is_pip_package():
install = "pip"
else:
install = "other"
event["tags"] = {
"sys_argv": sys.argv[0],
"sys_argv_name": Path(sys.argv[0]).name,
"install": install,
"platforms": PLATFORMS,
"version": importlib.metadata.version("lancedb"),
}
return event
TESTS_RUNNING = is_pytest_running() or is_github_actions_ci()
ONLINE = is_online()
if CONFIG["diagnostics"] and not TESTS_RUNNING and ONLINE and is_pip_package():
# and not is_git_dir(): # not running inside a git dir. Maybe too restrictive?
# If sentry_sdk package is not installed then return and do not use Sentry
try:
import sentry_sdk # noqa
except ImportError:
return
sentry_sdk.init(
dsn="https://c63ef8c64e05d1aa1a96513361f3ca2f@o4505950840946688.ingest.sentry.io/4505950933614592",
debug=False,
include_local_variables=False,
traces_sample_rate=0.5,
environment="production", # 'dev' or 'production'
before_send=before_send,
ignore_errors=[KeyboardInterrupt, FileNotFoundError, bdb.BdbQuit],
)
sentry_sdk.set_user({"id": CONFIG["uuid"]}) # SHA-256 anonymized UUID hash
# Disable all sentry logging
for logger in "sentry_sdk", "sentry_sdk.errors":
logging.getLogger(logger).setLevel(logging.CRITICAL)
set_sentry()

View File

@@ -1,31 +0,0 @@
from click.testing import CliRunner
from lancedb.cli.cli import cli
from lancedb.utils import CONFIG
def test_entry():
runner = CliRunner()
result = runner.invoke(cli)
assert result.exit_code == 0 # Main check
assert "lancedb" in result.output.lower() # lazy check
def test_diagnostics():
runner = CliRunner()
result = runner.invoke(cli, ["diagnostics", "--disabled"])
assert result.exit_code == 0 # Main check
assert not CONFIG["diagnostics"]
result = runner.invoke(cli, ["diagnostics", "--enabled"])
assert result.exit_code == 0 # Main check
assert CONFIG["diagnostics"]
def test_config():
runner = CliRunner()
result = runner.invoke(cli, ["config"])
assert result.exit_code == 0 # Main check
cfg = CONFIG.copy()
cfg.pop("uuid")
for item in cfg: # check for keys only as formatting is subject to change
assert item in result.output

View File

@@ -28,13 +28,25 @@ def test_basic(tmp_path):
assert db.uri == str(tmp_path)
assert db.table_names() == []
class SimpleModel(LanceModel):
item: str
price: float
vector: Vector(2)
table = db.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
],
schema=SimpleModel,
)
with pytest.raises(
ValueError, match="Cannot add a single LanceModel to a table. Use a list."
):
table.add(SimpleModel(item="baz", price=30.0, vector=[1.0, 2.0]))
rs = table.search([100, 100]).limit(1).to_pandas()
assert len(rs) == 1
assert rs["item"].iloc[0] == "bar"
@@ -43,6 +55,11 @@ def test_basic(tmp_path):
assert len(rs) == 1
assert rs["item"].iloc[0] == "foo"
table.create_fts_index(["item"])
rs = table.search("bar", query_type="fts").to_pandas()
assert len(rs) == 1
assert rs["item"].iloc[0] == "bar"
assert db.table_names() == ["test"]
assert "test" in db
assert len(db) == 1

View File

@@ -45,7 +45,7 @@ except Exception:
@pytest.mark.slow
@pytest.mark.parametrize("alias", ["sentence-transformers", "openai"])
@pytest.mark.parametrize("alias", ["sentence-transformers", "openai", "huggingface"])
def test_basic_text_embeddings(alias, tmp_path):
db = lancedb.connect(tmp_path)
registry = get_registry()
@@ -84,7 +84,7 @@ def test_basic_text_embeddings(alias, tmp_path):
)
)
query = "greetings"
query = "greeting"
actual = (
table.search(query, vector_column_name="vector").limit(1).to_pydantic(Words)[0]
)
@@ -184,9 +184,9 @@ def test_imagebind(tmp_path):
import shutil
import tempfile
import lancedb.embeddings.imagebind
import pandas as pd
import requests
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
@@ -321,8 +321,6 @@ def test_gemini_embedding(tmp_path):
)
@pytest.mark.slow
def test_gte_embedding(tmp_path):
import lancedb.embeddings.gte
model = get_registry().get("gte-text").create()
class TextModel(LanceModel):

View File

@@ -0,0 +1,158 @@
# Copyright 2024 Lance Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import asyncio
import copy
import pytest
import pyarrow as pa
import lancedb
# These are all keys that are accepted by storage_options
CONFIG = {
"allow_http": "true",
"aws_access_key_id": "ACCESSKEY",
"aws_secret_access_key": "SECRETKEY",
"aws_endpoint": "http://localhost:4566",
"aws_region": "us-east-1",
}
def get_boto3_client(*args, **kwargs):
import boto3
return boto3.client(
*args,
region_name=CONFIG["aws_region"],
aws_access_key_id=CONFIG["aws_access_key_id"],
aws_secret_access_key=CONFIG["aws_secret_access_key"],
**kwargs,
)
@pytest.fixture(scope="module")
def s3_bucket():
s3 = get_boto3_client("s3", endpoint_url=CONFIG["aws_endpoint"])
bucket_name = "lance-integtest"
# if bucket exists, delete it
try:
delete_bucket(s3, bucket_name)
except s3.exceptions.NoSuchBucket:
pass
s3.create_bucket(Bucket=bucket_name)
yield bucket_name
delete_bucket(s3, bucket_name)
def delete_bucket(s3, bucket_name):
# Delete all objects first
for obj in s3.list_objects(Bucket=bucket_name).get("Contents", []):
s3.delete_object(Bucket=bucket_name, Key=obj["Key"])
s3.delete_bucket(Bucket=bucket_name)
@pytest.mark.s3_test
def test_s3_lifecycle(s3_bucket: str):
storage_options = copy.copy(CONFIG)
uri = f"s3://{s3_bucket}/test_lifecycle"
data = pa.table({"x": [1, 2, 3]})
async def test():
db = await lancedb.connect_async(uri, storage_options=storage_options)
table = await db.create_table("test", schema=data.schema)
assert await table.count_rows() == 0
table = await db.create_table("test", data, mode="overwrite")
assert await table.count_rows() == 3
await table.add(data, mode="append")
assert await table.count_rows() == 6
table = await db.open_table("test")
assert await table.count_rows() == 6
await db.drop_table("test")
await db.drop_database()
asyncio.run(test())
@pytest.fixture()
def kms_key():
kms = get_boto3_client("kms", endpoint_url=CONFIG["aws_endpoint"])
key_id = kms.create_key()["KeyMetadata"]["KeyId"]
yield key_id
kms.schedule_key_deletion(KeyId=key_id, PendingWindowInDays=7)
def validate_objects_encrypted(bucket: str, path: str, kms_key: str):
s3 = get_boto3_client("s3", endpoint_url=CONFIG["aws_endpoint"])
objects = s3.list_objects_v2(Bucket=bucket, Prefix=path)["Contents"]
for obj in objects:
info = s3.head_object(Bucket=bucket, Key=obj["Key"])
assert info["ServerSideEncryption"] == "aws:kms", (
"object %s not encrypted" % obj["Key"]
)
assert info["SSEKMSKeyId"].endswith(kms_key), (
"object %s not encrypted with correct key" % obj["Key"]
)
@pytest.mark.s3_test
def test_s3_sse(s3_bucket: str, kms_key: str):
storage_options = copy.copy(CONFIG)
uri = f"s3://{s3_bucket}/test_lifecycle"
data = pa.table({"x": [1, 2, 3]})
async def test():
# Create a table with SSE
db = await lancedb.connect_async(uri, storage_options=storage_options)
table = await db.create_table(
"table1",
schema=data.schema,
storage_options={
"aws_server_side_encryption": "aws:kms",
"aws_sse_kms_key_id": kms_key,
},
)
await table.add(data)
await table.update({"x": "1"})
path = "test_lifecycle/table1.lance"
validate_objects_encrypted(s3_bucket, path, kms_key)
# Test we can set encryption at connection level too.
db = await lancedb.connect_async(
uri,
storage_options=dict(
aws_server_side_encryption="aws:kms",
aws_sse_kms_key_id=kms_key,
**storage_options,
),
)
table = await db.create_table("table2", schema=data.schema)
await table.add(data)
await table.update({"x": "1"})
path = "test_lifecycle/table2.lance"
validate_objects_encrypted(s3_bucket, path, kms_key)
asyncio.run(test())

View File

@@ -1,60 +0,0 @@
import json
import lancedb
import pytest
from lancedb.utils.events import _Events
@pytest.fixture(autouse=True)
def request_log_path(tmp_path):
return tmp_path / "request.json"
def mock_register_event(name: str, **kwargs):
if _Events._instance is None:
_Events._instance = _Events()
_Events._instance.enabled = True
_Events._instance.rate_limit = 0
_Events._instance(name, **kwargs)
def test_event_reporting(monkeypatch, request_log_path, tmp_path) -> None:
def mock_request(**kwargs):
json_data = kwargs.get("json", {})
with open(request_log_path, "w") as f:
json.dump(json_data, f)
monkeypatch.setattr(
lancedb.table, "register_event", mock_register_event
) # Force enable registering events and strip exception handling
monkeypatch.setattr(lancedb.utils.events, "threaded_request", mock_request)
db = lancedb.connect(tmp_path)
db.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
],
mode="overwrite",
)
assert request_log_path.exists() # test if event was registered
with open(request_log_path, "r") as f:
json_data = json.load(f)
# TODO: don't hardcode these here. Instead create a module level json scehma in
# lancedb.utils.events for better evolvability
batch_keys = ["api_key", "distinct_id", "batch"]
event_keys = ["event", "properties", "timestamp", "distinct_id"]
property_keys = ["cli", "install", "platforms", "version", "session_id"]
assert all([key in json_data for key in batch_keys])
assert all([key in json_data["batch"][0] for key in event_keys])
assert all([key in json_data["batch"][0]["properties"] for key in property_keys])
# cleanup & reset
monkeypatch.undo()
_Events._instance = None

View File

@@ -12,7 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::{sync::Arc, time::Duration};
use std::{collections::HashMap, sync::Arc, time::Duration};
use arrow::{datatypes::Schema, ffi_stream::ArrowArrayStreamReader, pyarrow::FromPyArrow};
use lancedb::connection::{Connection as LanceConnection, CreateTableMode};
@@ -90,19 +90,21 @@ impl Connection {
name: String,
mode: &str,
data: &PyAny,
storage_options: Option<HashMap<String, String>>,
) -> PyResult<&'a PyAny> {
let inner = self_.get_inner()?.clone();
let mode = Self::parse_create_mode_str(mode)?;
let batches = ArrowArrayStreamReader::from_pyarrow(data)?;
let mut builder = inner.create_table(name, batches).mode(mode);
if let Some(storage_options) = storage_options {
builder = builder.storage_options(storage_options);
}
future_into_py(self_.py(), async move {
let table = inner
.create_table(name, batches)
.mode(mode)
.execute()
.await
.infer_error()?;
let table = builder.execute().await.infer_error()?;
Ok(Table::new(table))
})
}
@@ -112,6 +114,7 @@ impl Connection {
name: String,
mode: &str,
schema: &PyAny,
storage_options: Option<HashMap<String, String>>,
) -> PyResult<&'a PyAny> {
let inner = self_.get_inner()?.clone();
@@ -119,21 +122,31 @@ impl Connection {
let schema = Schema::from_pyarrow(schema)?;
let mut builder = inner.create_empty_table(name, Arc::new(schema)).mode(mode);
if let Some(storage_options) = storage_options {
builder = builder.storage_options(storage_options);
}
future_into_py(self_.py(), async move {
let table = inner
.create_empty_table(name, Arc::new(schema))
.mode(mode)
.execute()
.await
.infer_error()?;
let table = builder.execute().await.infer_error()?;
Ok(Table::new(table))
})
}
pub fn open_table(self_: PyRef<'_, Self>, name: String) -> PyResult<&PyAny> {
#[pyo3(signature = (name, storage_options = None))]
pub fn open_table(
self_: PyRef<'_, Self>,
name: String,
storage_options: Option<HashMap<String, String>>,
) -> PyResult<&PyAny> {
let inner = self_.get_inner()?.clone();
let mut builder = inner.open_table(name);
if let Some(storage_options) = storage_options {
builder = builder.storage_options(storage_options);
}
future_into_py(self_.py(), async move {
let table = inner.open_table(&name).execute().await.infer_error()?;
let table = builder.execute().await.infer_error()?;
Ok(Table::new(table))
})
}
@@ -162,6 +175,7 @@ pub fn connect(
region: Option<String>,
host_override: Option<String>,
read_consistency_interval: Option<f64>,
storage_options: Option<HashMap<String, String>>,
) -> PyResult<&PyAny> {
future_into_py(py, async move {
let mut builder = lancedb::connect(&uri);
@@ -178,6 +192,9 @@ pub fn connect(
let read_consistency_interval = Duration::from_secs_f64(read_consistency_interval);
builder = builder.read_consistency_interval(read_consistency_interval);
}
if let Some(storage_options) = storage_options {
builder = builder.storage_options(storage_options);
}
Ok(Connection::new(builder.execute().await.infer_error()?))
})
}

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-node"
version = "0.4.16"
version = "0.4.17"
description = "Serverless, low-latency vector database for AI applications"
license.workspace = true
edition.workspace = true

View File

@@ -12,19 +12,12 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use async_trait::async_trait;
use lance::io::ObjectStoreParams;
use neon::prelude::*;
use object_store::aws::{AwsCredential, AwsCredentialProvider};
use object_store::CredentialProvider;
use once_cell::sync::OnceCell;
use tokio::runtime::Runtime;
use lancedb::connect;
use lancedb::connection::Connection;
use lancedb::table::ReadParams;
use crate::error::ResultExt;
use crate::query::JsQuery;
@@ -44,33 +37,6 @@ struct JsDatabase {
impl Finalize for JsDatabase {}
// TODO: object_store didn't export this type so I copied it.
// Make a request to object_store to export this type
#[derive(Debug)]
pub struct StaticCredentialProvider<T> {
credential: Arc<T>,
}
impl<T> StaticCredentialProvider<T> {
pub fn new(credential: T) -> Self {
Self {
credential: Arc::new(credential),
}
}
}
#[async_trait]
impl<T> CredentialProvider for StaticCredentialProvider<T>
where
T: std::fmt::Debug + Send + Sync,
{
type Credential = T;
async fn get_credential(&self) -> object_store::Result<Arc<T>> {
Ok(Arc::clone(&self.credential))
}
}
fn runtime<'a, C: Context<'a>>(cx: &mut C) -> NeonResult<&'static Runtime> {
static RUNTIME: OnceCell<Runtime> = OnceCell::new();
static LOG: OnceCell<()> = OnceCell::new();
@@ -82,29 +48,28 @@ fn runtime<'a, C: Context<'a>>(cx: &mut C) -> NeonResult<&'static Runtime> {
fn database_new(mut cx: FunctionContext) -> JsResult<JsPromise> {
let path = cx.argument::<JsString>(0)?.value(&mut cx);
let aws_creds = get_aws_creds(&mut cx, 1)?;
let region = get_aws_region(&mut cx, 4)?;
let read_consistency_interval = cx
.argument_opt(5)
.and_then(|arg| arg.downcast::<JsNumber, _>(&mut cx).ok())
.map(|v| v.value(&mut cx))
.map(std::time::Duration::from_secs_f64);
let storage_options_js = cx.argument::<JsArray>(1)?.to_vec(&mut cx)?;
let mut storage_options: Vec<(String, String)> = Vec::with_capacity(storage_options_js.len());
for handle in storage_options_js {
let obj = handle.downcast::<JsArray, _>(&mut cx).unwrap();
let key = obj.get::<JsString, _, _>(&mut cx, 0)?.value(&mut cx);
let value = obj.get::<JsString, _, _>(&mut cx, 0)?.value(&mut cx);
storage_options.push((key, value));
}
let rt = runtime(&mut cx)?;
let channel = cx.channel();
let (deferred, promise) = cx.promise();
let mut conn_builder = connect(&path);
if let Some(region) = region {
conn_builder = conn_builder.region(&region);
}
if let Some(aws_creds) = aws_creds {
conn_builder = conn_builder.aws_creds(AwsCredential {
key_id: aws_creds.key_id,
secret_key: aws_creds.secret_key,
token: aws_creds.token,
});
}
let mut conn_builder = connect(&path).storage_options(storage_options);
if let Some(interval) = read_consistency_interval {
conn_builder = conn_builder.read_consistency_interval(interval);
}
@@ -143,93 +108,19 @@ fn database_table_names(mut cx: FunctionContext) -> JsResult<JsPromise> {
Ok(promise)
}
/// Get AWS creds arguments from the context
/// Consumes 3 arguments
fn get_aws_creds(
cx: &mut FunctionContext,
arg_starting_location: i32,
) -> NeonResult<Option<AwsCredential>> {
let secret_key_id = cx
.argument_opt(arg_starting_location)
.filter(|arg| arg.is_a::<JsString, _>(cx))
.and_then(|arg| arg.downcast_or_throw::<JsString, FunctionContext>(cx).ok())
.map(|v| v.value(cx));
let secret_key = cx
.argument_opt(arg_starting_location + 1)
.filter(|arg| arg.is_a::<JsString, _>(cx))
.and_then(|arg| arg.downcast_or_throw::<JsString, FunctionContext>(cx).ok())
.map(|v| v.value(cx));
let temp_token = cx
.argument_opt(arg_starting_location + 2)
.filter(|arg| arg.is_a::<JsString, _>(cx))
.and_then(|arg| arg.downcast_or_throw::<JsString, FunctionContext>(cx).ok())
.map(|v| v.value(cx));
match (secret_key_id, secret_key, temp_token) {
(Some(key_id), Some(key), optional_token) => Ok(Some(AwsCredential {
key_id,
secret_key: key,
token: optional_token,
})),
(None, None, None) => Ok(None),
_ => cx.throw_error("Invalid credentials configuration"),
}
}
fn get_aws_credential_provider(
cx: &mut FunctionContext,
arg_starting_location: i32,
) -> NeonResult<Option<AwsCredentialProvider>> {
Ok(get_aws_creds(cx, arg_starting_location)?.map(|aws_cred| {
Arc::new(StaticCredentialProvider::new(aws_cred))
as Arc<dyn CredentialProvider<Credential = AwsCredential>>
}))
}
/// Get AWS region arguments from the context
fn get_aws_region(cx: &mut FunctionContext, arg_location: i32) -> NeonResult<Option<String>> {
let region = cx
.argument_opt(arg_location)
.filter(|arg| arg.is_a::<JsString, _>(cx))
.map(|arg| arg.downcast_or_throw::<JsString, FunctionContext>(cx));
match region {
Some(Ok(region)) => Ok(Some(region.value(cx))),
None => Ok(None),
Some(Err(e)) => Err(e),
}
}
fn database_open_table(mut cx: FunctionContext) -> JsResult<JsPromise> {
let db = cx
.this()
.downcast_or_throw::<JsBox<JsDatabase>, _>(&mut cx)?;
let table_name = cx.argument::<JsString>(0)?.value(&mut cx);
let aws_creds = get_aws_credential_provider(&mut cx, 1)?;
let aws_region = get_aws_region(&mut cx, 4)?;
let params = ReadParams {
store_options: Some(ObjectStoreParams::with_aws_credentials(
aws_creds, aws_region,
)),
..ReadParams::default()
};
let rt = runtime(&mut cx)?;
let channel = cx.channel();
let database = db.database.clone();
let (deferred, promise) = cx.promise();
rt.spawn(async move {
let table_rst = database
.open_table(&table_name)
.lance_read_params(params)
.execute()
.await;
let table_rst = database.open_table(&table_name).execute().await;
deferred.settle_with(&channel, move |mut cx| {
let js_table = JsTable::from(table_rst.or_throw(&mut cx)?);

View File

@@ -17,7 +17,6 @@ use std::ops::Deref;
use arrow_array::{RecordBatch, RecordBatchIterator};
use lance::dataset::optimize::CompactionOptions;
use lance::dataset::{ColumnAlteration, NewColumnTransform, WriteMode, WriteParams};
use lance::io::ObjectStoreParams;
use lancedb::table::{OptimizeAction, WriteOptions};
use crate::arrow::{arrow_buffer_to_record_batch, record_batch_to_buffer};
@@ -26,7 +25,7 @@ use neon::prelude::*;
use neon::types::buffer::TypedArray;
use crate::error::ResultExt;
use crate::{convert, get_aws_credential_provider, get_aws_region, runtime, JsDatabase};
use crate::{convert, runtime, JsDatabase};
pub struct JsTable {
pub table: LanceDbTable,
@@ -59,6 +58,10 @@ impl JsTable {
return cx.throw_error("Table::create only supports 'overwrite' and 'create' modes")
}
};
let params = WriteParams {
mode,
..WriteParams::default()
};
let rt = runtime(&mut cx)?;
let channel = cx.channel();
@@ -66,17 +69,6 @@ impl JsTable {
let (deferred, promise) = cx.promise();
let database = db.database.clone();
let aws_creds = get_aws_credential_provider(&mut cx, 3)?;
let aws_region = get_aws_region(&mut cx, 6)?;
let params = WriteParams {
store_params: Some(ObjectStoreParams::with_aws_credentials(
aws_creds, aws_region,
)),
mode,
..WriteParams::default()
};
rt.spawn(async move {
let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema);
let table_rst = database
@@ -112,13 +104,8 @@ impl JsTable {
"overwrite" => WriteMode::Overwrite,
s => return cx.throw_error(format!("invalid write mode {}", s)),
};
let aws_creds = get_aws_credential_provider(&mut cx, 2)?;
let aws_region = get_aws_region(&mut cx, 5)?;
let params = WriteParams {
store_params: Some(ObjectStoreParams::with_aws_credentials(
aws_creds, aws_region,
)),
mode: write_mode,
..WriteParams::default()
};

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.4.16"
version = "0.4.17"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true
@@ -46,8 +46,13 @@ tempfile = "3.5.0"
rand = { version = "0.8.3", features = ["small_rng"] }
uuid = { version = "1.7.0", features = ["v4"] }
walkdir = "2"
# For s3 integration tests (dev deps aren't allowed to be optional atm)
aws-sdk-s3 = { version = "1.0" }
aws-sdk-kms = { version = "1.0" }
aws-config = { version = "1.0" }
[features]
default = ["remote"]
remote = ["dep:reqwest"]
fp16kernels = ["lance-linalg/fp16kernels"]
s3-test = []

View File

@@ -14,6 +14,7 @@
//! LanceDB Database
use std::collections::HashMap;
use std::fs::create_dir_all;
use std::path::Path;
use std::sync::Arc;
@@ -22,9 +23,7 @@ use arrow_array::{RecordBatchIterator, RecordBatchReader};
use arrow_schema::SchemaRef;
use lance::dataset::{ReadParams, WriteMode};
use lance::io::{ObjectStore, ObjectStoreParams, WrappingObjectStore};
use object_store::{
aws::AwsCredential, local::LocalFileSystem, CredentialProvider, StaticCredentialProvider,
};
use object_store::{aws::AwsCredential, local::LocalFileSystem};
use snafu::prelude::*;
use crate::arrow::IntoArrow;
@@ -208,6 +207,50 @@ impl<const HAS_DATA: bool, T: IntoArrow> CreateTableBuilder<HAS_DATA, T> {
self.mode = mode;
self
}
/// Set an option for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_option(mut self, key: impl Into<String>, value: impl Into<String>) -> Self {
let store_options = self
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default())
.storage_options
.get_or_insert(Default::default());
store_options.insert(key.into(), value.into());
self
}
/// Set multiple options for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_options(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
let store_options = self
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default())
.storage_options
.get_or_insert(Default::default());
for (key, value) in pairs {
store_options.insert(key.into(), value.into());
}
self
}
}
#[derive(Clone, Debug)]
@@ -252,6 +295,48 @@ impl OpenTableBuilder {
self
}
/// Set an option for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_option(mut self, key: impl Into<String>, value: impl Into<String>) -> Self {
let storage_options = self
.lance_read_params
.get_or_insert(Default::default())
.store_options
.get_or_insert(Default::default())
.storage_options
.get_or_insert(Default::default());
storage_options.insert(key.into(), value.into());
self
}
/// Set multiple options for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_options(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
let storage_options = self
.lance_read_params
.get_or_insert(Default::default())
.store_options
.get_or_insert(Default::default())
.storage_options
.get_or_insert(Default::default());
for (key, value) in pairs {
storage_options.insert(key.into(), value.into());
}
self
}
/// Open the table
pub async fn execute(self) -> Result<Table> {
self.parent.clone().do_open_table(self).await
@@ -385,8 +470,7 @@ pub struct ConnectBuilder {
/// LanceDB Cloud host override, only required if using an on-premises Lance Cloud instance
host_override: Option<String>,
/// User provided AWS credentials
aws_creds: Option<AwsCredential>,
storage_options: HashMap<String, String>,
/// The interval at which to check for updates from other processes.
///
@@ -409,8 +493,8 @@ impl ConnectBuilder {
api_key: None,
region: None,
host_override: None,
aws_creds: None,
read_consistency_interval: None,
storage_options: HashMap::new(),
}
}
@@ -430,8 +514,37 @@ impl ConnectBuilder {
}
/// [`AwsCredential`] to use when connecting to S3.
#[deprecated(note = "Pass through storage_options instead")]
pub fn aws_creds(mut self, aws_creds: AwsCredential) -> Self {
self.aws_creds = Some(aws_creds);
self.storage_options
.insert("aws_access_key_id".into(), aws_creds.key_id.clone());
self.storage_options
.insert("aws_secret_access_key".into(), aws_creds.secret_key.clone());
if let Some(token) = &aws_creds.token {
self.storage_options
.insert("aws_session_token".into(), token.clone());
}
self
}
/// Set an option for the storage layer.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_option(mut self, key: impl Into<String>, value: impl Into<String>) -> Self {
self.storage_options.insert(key.into(), value.into());
self
}
/// Set multiple options for the storage layer.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_options(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
for (key, value) in pairs {
self.storage_options.insert(key.into(), value.into());
}
self
}
@@ -522,6 +635,9 @@ struct Database {
pub(crate) store_wrapper: Option<Arc<dyn WrappingObjectStore>>,
read_consistency_interval: Option<std::time::Duration>,
// Storage options to be inherited by tables created from this connection
storage_options: HashMap<String, String>,
}
impl std::fmt::Display for Database {
@@ -604,20 +720,11 @@ impl Database {
};
let plain_uri = url.to_string();
let os_params: ObjectStoreParams = if let Some(aws_creds) = &options.aws_creds {
let credential_provider: Arc<
dyn CredentialProvider<Credential = AwsCredential>,
> = Arc::new(StaticCredentialProvider::new(AwsCredential {
key_id: aws_creds.key_id.clone(),
secret_key: aws_creds.secret_key.clone(),
token: aws_creds.token.clone(),
}));
ObjectStoreParams::with_aws_credentials(
Some(credential_provider),
options.region.clone(),
)
} else {
ObjectStoreParams::default()
let storage_options = options.storage_options.clone();
let os_params = ObjectStoreParams {
storage_options: Some(storage_options.clone()),
..Default::default()
};
let (object_store, base_path) =
ObjectStore::from_uri_and_params(&plain_uri, &os_params).await?;
@@ -641,6 +748,7 @@ impl Database {
object_store,
store_wrapper: write_store_wrapper,
read_consistency_interval: options.read_consistency_interval,
storage_options,
})
}
Err(_) => Self::open_path(uri, options.read_consistency_interval).await,
@@ -662,6 +770,7 @@ impl Database {
object_store,
store_wrapper: None,
read_consistency_interval,
storage_options: HashMap::new(),
})
}
@@ -734,11 +843,26 @@ impl ConnectionInternal for Database {
async fn do_create_table(
&self,
options: CreateTableBuilder<false, NoData>,
mut options: CreateTableBuilder<false, NoData>,
data: Box<dyn RecordBatchReader + Send>,
) -> Result<Table> {
let table_uri = self.table_uri(&options.name)?;
// Inherit storage options from the connection
let storage_options = options
.write_options
.lance_write_params
.get_or_insert_with(Default::default)
.store_params
.get_or_insert_with(Default::default)
.storage_options
.get_or_insert_with(Default::default);
for (key, value) in self.storage_options.iter() {
if !storage_options.contains_key(key) {
storage_options.insert(key.clone(), value.clone());
}
}
let mut write_params = options.write_options.lance_write_params.unwrap_or_default();
if matches!(&options.mode, CreateTableMode::Overwrite) {
write_params.mode = WriteMode::Overwrite;
@@ -768,8 +892,23 @@ impl ConnectionInternal for Database {
}
}
async fn do_open_table(&self, options: OpenTableBuilder) -> Result<Table> {
async fn do_open_table(&self, mut options: OpenTableBuilder) -> Result<Table> {
let table_uri = self.table_uri(&options.name)?;
// Inherit storage options from the connection
let storage_options = options
.lance_read_params
.get_or_insert_with(Default::default)
.store_options
.get_or_insert_with(Default::default)
.storage_options
.get_or_insert_with(Default::default);
for (key, value) in self.storage_options.iter() {
if !storage_options.contains_key(key) {
storage_options.insert(key.clone(), value.clone());
}
}
let native_table = Arc::new(
NativeTable::open_with_params(
&table_uri,
@@ -801,7 +940,10 @@ impl ConnectionInternal for Database {
}
async fn drop_db(&self) -> Result<()> {
todo!()
self.object_store
.remove_dir_all(self.base_path.clone())
.await?;
Ok(())
}
}

View File

@@ -14,6 +14,7 @@
//! LanceDB Table APIs
use std::collections::HashMap;
use std::path::Path;
use std::sync::Arc;
@@ -757,6 +758,8 @@ pub struct NativeTable {
// the object store wrapper to use on write path
store_wrapper: Option<Arc<dyn WrappingObjectStore>>,
storage_options: HashMap<String, String>,
// This comes from the connection options. We store here so we can pass down
// to the dataset when we recreate it (for example, in checkout_latest).
read_consistency_interval: Option<std::time::Duration>,
@@ -822,6 +825,13 @@ impl NativeTable {
None => params,
};
let storage_options = params
.store_options
.clone()
.unwrap_or_default()
.storage_options
.unwrap_or_default();
let dataset = DatasetBuilder::from_uri(uri)
.with_read_params(params)
.load()
@@ -840,6 +850,7 @@ impl NativeTable {
uri: uri.to_string(),
dataset,
store_wrapper: write_store_wrapper,
storage_options,
read_consistency_interval,
})
}
@@ -908,6 +919,13 @@ impl NativeTable {
None => params,
};
let storage_options = params
.store_params
.clone()
.unwrap_or_default()
.storage_options
.unwrap_or_default();
let dataset = Dataset::write(batches, uri, Some(params))
.await
.map_err(|e| match e {
@@ -921,6 +939,7 @@ impl NativeTable {
uri: uri.to_string(),
dataset: DatasetConsistencyWrapper::new_latest(dataset, read_consistency_interval),
store_wrapper: write_store_wrapper,
storage_options,
read_consistency_interval,
})
}
@@ -1312,7 +1331,7 @@ impl TableInternal for NativeTable {
add: AddDataBuilder<NoData>,
data: Box<dyn RecordBatchReader + Send>,
) -> Result<()> {
let lance_params = add.write_options.lance_write_params.unwrap_or(WriteParams {
let mut lance_params = add.write_options.lance_write_params.unwrap_or(WriteParams {
mode: match add.mode {
AddDataMode::Append => WriteMode::Append,
AddDataMode::Overwrite => WriteMode::Overwrite,
@@ -1320,6 +1339,18 @@ impl TableInternal for NativeTable {
..Default::default()
});
// Bring storage options from table
let storage_options = lance_params
.store_params
.get_or_insert(Default::default())
.storage_options
.get_or_insert(Default::default());
for (key, value) in self.storage_options.iter() {
if !storage_options.contains_key(key) {
storage_options.insert(key.clone(), value.clone());
}
}
// patch the params if we have a write store wrapper
let lance_params = match self.store_wrapper.clone() {
Some(wrapper) => lance_params.patch_with_store_wrapper(wrapper)?,

View File

@@ -0,0 +1,290 @@
// Copyright 2023 LanceDB Developers.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#![cfg(feature = "s3-test")]
use std::sync::Arc;
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator, StringArray};
use arrow_schema::{DataType, Field, Schema};
use aws_config::{BehaviorVersion, ConfigLoader, Region, SdkConfig};
use aws_sdk_s3::{config::Credentials, types::ServerSideEncryption, Client as S3Client};
use lancedb::Result;
const CONFIG: &[(&str, &str)] = &[
("access_key_id", "ACCESS_KEY"),
("secret_access_key", "SECRET_KEY"),
("endpoint", "http://127.0.0.1:4566"),
("allow_http", "true"),
];
async fn aws_config() -> SdkConfig {
let credentials = Credentials::new(CONFIG[0].1, CONFIG[1].1, None, None, "static");
ConfigLoader::default()
.credentials_provider(credentials)
.endpoint_url(CONFIG[2].1)
.behavior_version(BehaviorVersion::latest())
.region(Region::new("us-east-1"))
.load()
.await
}
struct S3Bucket(String);
impl S3Bucket {
async fn new(bucket: &str) -> Self {
let config = aws_config().await;
let client = S3Client::new(&config);
// In case it wasn't deleted earlier
Self::delete_bucket(client.clone(), bucket).await;
client.create_bucket().bucket(bucket).send().await.unwrap();
Self(bucket.to_string())
}
async fn delete_bucket(client: S3Client, bucket: &str) {
// Before we delete the bucket, we need to delete all objects in it
let res = client
.list_objects_v2()
.bucket(bucket)
.send()
.await
.map_err(|err| err.into_service_error());
match res {
Err(e) if e.is_no_such_bucket() => return,
Err(e) => panic!("Failed to list objects in bucket: {}", e),
_ => {}
}
let objects = res.unwrap().contents.unwrap_or_default();
for object in objects {
client
.delete_object()
.bucket(bucket)
.key(object.key.unwrap())
.send()
.await
.unwrap();
}
client.delete_bucket().bucket(bucket).send().await.unwrap();
}
}
impl Drop for S3Bucket {
fn drop(&mut self) {
let bucket_name = self.0.clone();
tokio::task::spawn(async move {
let config = aws_config().await;
let client = S3Client::new(&config);
Self::delete_bucket(client, &bucket_name).await;
});
}
}
fn test_data() -> RecordBatch {
let schema = Arc::new(Schema::new(vec![
Field::new("a", DataType::Int32, false),
Field::new("b", DataType::Utf8, false),
]));
RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec!["a", "b", "c"])),
],
)
.unwrap()
}
#[tokio::test]
async fn test_minio_lifecycle() -> Result<()> {
// test create, update, drop, list on localstack minio
let bucket = S3Bucket::new("test-bucket").await;
let uri = format!("s3://{}", bucket.0);
let db = lancedb::connect(&uri)
.storage_options(CONFIG.iter().cloned())
.execute()
.await?;
let data = test_data();
let data = RecordBatchIterator::new(vec![Ok(data.clone())], data.schema());
let table = db.create_table("test_table", data).execute().await?;
let row_count = table.count_rows(None).await?;
assert_eq!(row_count, 3);
let table_names = db.table_names().execute().await?;
assert_eq!(table_names, vec!["test_table"]);
// Re-open the table
let table = db.open_table("test_table").execute().await?;
let row_count = table.count_rows(None).await?;
assert_eq!(row_count, 3);
let data = test_data();
let data = RecordBatchIterator::new(vec![Ok(data.clone())], data.schema());
table.add(data).execute().await?;
db.drop_table("test_table").await?;
Ok(())
}
struct KMSKey(String);
impl KMSKey {
async fn new() -> Self {
let config = aws_config().await;
let client = aws_sdk_kms::Client::new(&config);
let key = client
.create_key()
.description("test key")
.send()
.await
.unwrap()
.key_metadata
.unwrap()
.key_id;
Self(key)
}
}
impl Drop for KMSKey {
fn drop(&mut self) {
let key_id = self.0.clone();
tokio::task::spawn(async move {
let config = aws_config().await;
let client = aws_sdk_kms::Client::new(&config);
client
.schedule_key_deletion()
.key_id(&key_id)
.send()
.await
.unwrap();
});
}
}
async fn validate_objects_encrypted(bucket: &str, path: &str, kms_key_id: &str) {
// Get S3 client
let config = aws_config().await;
let client = S3Client::new(&config);
// list the objects are the path
let objects = client
.list_objects_v2()
.bucket(bucket)
.prefix(path)
.send()
.await
.unwrap()
.contents
.unwrap();
let mut errors = vec![];
let mut correctly_encrypted = vec![];
// For each object, call head
for object in &objects {
let head = client
.head_object()
.bucket(bucket)
.key(object.key().unwrap())
.send()
.await
.unwrap();
// Verify the object is encrypted
if head.server_side_encryption() != Some(&ServerSideEncryption::AwsKms) {
errors.push(format!("Object {} is not encrypted", object.key().unwrap()));
continue;
}
if !(head
.ssekms_key_id()
.map(|arn| arn.ends_with(kms_key_id))
.unwrap_or(false))
{
errors.push(format!(
"Object {} has wrong key id: {:?}, vs expected: {}",
object.key().unwrap(),
head.ssekms_key_id(),
kms_key_id
));
continue;
}
correctly_encrypted.push(object.key().unwrap().to_string());
}
if !errors.is_empty() {
panic!(
"{} of {} correctly encrypted: {:?}\n{} of {} not correct: {:?}",
correctly_encrypted.len(),
objects.len(),
correctly_encrypted,
errors.len(),
objects.len(),
errors
);
}
}
#[tokio::test]
async fn test_encryption() -> Result<()> {
// test encryption on localstack minio
let bucket = S3Bucket::new("test-encryption").await;
let key = KMSKey::new().await;
let uri = format!("s3://{}", bucket.0);
let db = lancedb::connect(&uri)
.storage_options(CONFIG.iter().cloned())
.execute()
.await?;
// Create a table with encryption
let data = test_data();
let data = RecordBatchIterator::new(vec![Ok(data.clone())], data.schema());
let mut builder = db.create_table("test_table", data);
for (key, value) in CONFIG {
builder = builder.storage_option(*key, *value);
}
let table = builder
.storage_option("aws_server_side_encryption", "aws:kms")
.storage_option("aws_sse_kms_key_id", &key.0)
.execute()
.await?;
validate_objects_encrypted(&bucket.0, "test_table", &key.0).await;
table.delete("a = 1").await?;
validate_objects_encrypted(&bucket.0, "test_table", &key.0).await;
// Test we can set encryption at the connection level.
let db = lancedb::connect(&uri)
.storage_options(CONFIG.iter().cloned())
.storage_option("aws_server_side_encryption", "aws:kms")
.storage_option("aws_sse_kms_key_id", &key.0)
.execute()
.await?;
let table = db.open_table("test_table").execute().await?;
let data = test_data();
let data = RecordBatchIterator::new(vec![Ok(data.clone())], data.schema());
table.add(data).execute().await?;
validate_objects_encrypted(&bucket.0, "test_table", &key.0).await;
Ok(())
}