Allows users to pass multiple query vector as part of a single query
plan. This just runs the queries in parallel without any further
optimization. It's mostly a convenience.
Previously, I think this was only handled by the sync Python remote API.
This makes it common across all SDKs.
Closes https://github.com/lancedb/lancedb/issues/1803
```python
>>> import lancedb
>>> import asyncio
>>>
>>> async def main():
... db = await lancedb.connect_async("./demo")
... table = await db.create_table("demo", [{"id": 1, "vector": [1, 2, 3]}, {"id": 2, "vector": [4, 5, 6]}], mode="overwrite")
... return await table.query().nearest_to([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [4.0, 5.0, 6.0]]).limit(1).to_pandas()
...
>>> asyncio.run(main())
query_index id vector _distance
0 2 2 [4.0, 5.0, 6.0] 0.0
1 1 2 [4.0, 5.0, 6.0] 0.0
2 0 1 [1.0, 2.0, 3.0] 0.0
```
This is done as setup for a PR that will fix the OpenAI dependency
issue.
* [x] FTS examples
* [x] Setup mock openai
* [x] Ran `npm audit fix`
* [x] sentences embeddings test
* [x] Double check formatting of docs examples
Sometimes it is acceptable to users to only search indexed data and skip
and new un-indexed data. For example, if un-indexed data will be shortly
indexed and they don't mind the delay. In these cases, we can save a lot
of CPU time in search, and provide better latency. Users can activate
this on queries using `fast_search()`.
BREAKING CHANGE: the return value of `index_stats` method has changed
and all `index_stats` APIs now take index name instead of UUID. Also
several deprecated index statistics methods were removed.
* Removes deprecated methods for individual index statistics
* Aligns public `IndexStatistics` struct with API response from LanceDB
Cloud.
* Implements `index_stats` for remote Rust SDK and Python async API.
The new V2 manifest path scheme makes discovering the latest version of
a table constant time on object stores, regardless of the number of
versions in the table. See benchmarks in the PR here:
https://github.com/lancedb/lance/pull/2798Closes#1583
Lance now supports FTS, so add it into lancedb Python, TypeScript and
Rust SDKs.
For Python, we still use tantivy based FTS by default because the lance
FTS index now misses some features of tantivy.
For Python:
- Support to create lance based FTS index
- Support to specify columns for full text search (only available for
lance based FTS index)
For TypeScript:
- Change the search method so that it can accept both string and vector
- Support full text search
For Rust
- Support full text search
The others:
- Update the FTS doc
BREAKING CHANGE:
- for Python, this renames the attached score column of FTS from "score"
to "_score", this could be a breaking change for users that rely the
scores
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Correct the timeout argument to `connect` in @lancedb/lancedb node SDK.
`RemoteConnectionOptions` specified two fields `connectionTimeout` and
`readTimeout`, probably to be consistent with the python SDK, but only
`connectionTimeout` was being used and it was passed to axios in such a
way that this covered the enture remote request (connect + read). This
change adds a single parameter `timeout` which makes the args to
`connect` consistent with the legacy vectordb sdk.
BREAKING CHANGE: This is a breaking change b/c users who would have
previously been passing `connectionTimeout` will now be expected to pass
`timeout`.
previously if you tried to install both vectordb and @lancedb/lancedb,
you would get a peer dependency issue due to `vectordb` requiring
`14.0.2` and `@lancedb/lancedb` requiring `15.0.0`. now
`@lancedb/lancedb` should just work with any arrow version 13-17
so this was annoying me when writing the docs.
for a `search` query, one needed to chain `async` calls.
```ts
const res = await (await tbl.search("greetings")).toArray()
```
now the promise will be deferred until the query is collected, leading
to a more functional API
```ts
const res = await tbl.search("greetings").toArray()
```
It's useful to see the underlying query plan for debugging purposes.
This exposes LanceScanner's `explain_plan` function. Addresses #1288
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
previously the `registry` would return `undefined | EmbeddingFunction`
even for built in functions such as "openai"
now it'll return the correct type for `getRegistry().get("openai")
as well as pass in the correct options type to `create`
### before
```ts
const options: {model: 'not-a-real-model'}
// this'd compile just fine, but result in runtime error
const openai: EmbeddingFunction | undefined = getRegistry().get("openai").create(options)
// this'd also compile fine
const openai: EmbeddingFunction | undefined = getRegistry().get("openai").create({MODEL: ''})
```
### after
```ts
const options: {model: 'not-a-real-model'}
const openai: OpenAIEmbeddingFunction = getRegistry().get("openai").create(options)
// Type '"not-a-real-model"' is not assignable to type '"text-embedding-ada-002" | "text-embedding-3-large" | "text-embedding-3-small" | undefined'
```
This allows users to specify URIs like:
```
s3+ddb://my_bucket/path?ddbTableName=myCommitTable
```
and it will support concurrent writes in S3.
* [x] Add dynamodb integration tests
* [x] Add modifications to get it working in Python sync API
* [x] Added section in documentation describing how to configure.
Closes#534
---------
Co-authored-by: universalmind303 <cory.grinstead@gmail.com>
while adding some more docs & examples for the new js sdk, i ran across
a few compatibility issues when using different arrow versions. This
should fix those issues.
The optimize function is pretty crucial for getting good performance
when building a large scale dataset but it was only exposed in rust
(many sync python users are probably doing this via to_lance today)
This PR adds the optimize function to nodejs and to python.
I left the function marked experimental because I think there will
likely be changes to optimization (e.g. if we add features like
"optimize on write"). I also only exposed the `cleanup_older_than`
configuration parameter since this one is very commonly used and the
rest have sensible defaults and we don't really know why we would
recommend different values for these defaults anyways.
I've been noticing a lot of friction with the current toolchain for
'/nodejs'. Particularly with the usage of eslint and prettier.
[Biome](https://biomejs.dev/) is an all in one formatter & linter that
replaces the need for two different ones that can potentially clash with
one another.
I've been using it in the
[nodejs-polars](https://github.com/pola-rs/nodejs-polars) repo for quite
some time & have found it much more pleasant to work with.
---
One other small change included in this PR:
use [ts-jest](https://www.npmjs.com/package/ts-jest) so we can run our
tests without having to rebuild typescript code first