Commit Graph

28 Commits

Author SHA1 Message Date
Weston Pace
aeac9c7644 feat: add python Permutation class to mimic hugging face dataset and provide pytorch dataloader (#2725) 2025-11-06 16:15:33 -08:00
Jack Ye
ff71d7e552 feat: support shallow clone (#2653)
Support shallow cloning a dataset at a specific location to create a new
dataset, using the shallow_clone feature in Lance. Also introduce remote
`clone` API for remote tables for this functionality.
2025-09-21 21:28:40 -07:00
Jack Ye
8da74dcb37 feat: support per-request header override (#2631)
## Summary

This PR introduces a `HeaderProvider` which is called for all remote
HTTP calls to get the latest headers to inject. This is useful for
features like adding the latest auth tokens where the header provider
can auto-refresh tokens internally and each request always set the
refreshed token.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-09-10 13:44:00 -07:00
Jack Ye
faf8973624 feat!: support multi-level namespace (#2603)
This PR adds support of multi-level namespace in a LanceDB database,
according to the Lance Namespace spec.

This allows users to create namespace inside a database connection,
perform create, drop, list, list_tables in a namespace. (other
operations like update, describe will be in a follow-up PR)

The 3 types of database connections behave like the following:
1 Local database connections will continue to have just a flat list of
tables for backwards compatibility.
2. Remote database connections will make REST API calls according to the
APIs in the Lance Namespace spec.
3. Lance Namespace connections will invoke the corresponding operations
against the specific namespace implementation which could have different
behaviors regarding these APIs.

All the table APIs now take identifier instead of name, for example
`/v1/table/{name}/create` is now `/v1/table/{id}/create`. If a table is
directly in the root namespace, the API call is identical. If the table
is in a namespace, then the full table ID should be used, with `$` as
the default delimiter (`.` is a special character and creates issues
with URL parsing so `$` is used), for example
`/v1/table/ns1$table1/create`. If a different parameter needs to be
passed in, user can configure the `id_delimiter` in client config and
that becomes a query parameter, for example
`/v1/table/ns1__table1/create?delimiter=__`

The Python and Typescript APIs are kept backwards compatible, but the
following Rust APIs are not:
1. `Connection::drop_table(&self, name: impl AsRef<str>) -> Result<()>`
is now `Connection::drop_table(&self, name: impl AsRef<str>, namespace:
&[String]) -> Result<()>`
2. `Connection::drop_all_tables(&self) -> Result<()>` is now
`Connection::drop_all_tables(&self, name: impl AsRef<str>) ->
Result<()>`
2025-08-27 12:07:55 -07:00
Will Jones
3d1f102087 feat: allow Python and Typescript users to create Sessions (#2530)
## Summary
- Exposes `Session` in Python and Typescript so users can set the
`index_cache_size_bytes` and `metadata_cache_size_bytes`
* The `Session` is attached to the `Connection`, and thus shared across
all tables in that connection.
- Adds deprecation warnings for table-level cache configuration


🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-24 12:06:29 -07:00
Will Jones
b3a4efd587 fix: revert change default read_consistency_interval=5s (#2327)
This reverts commit a547c523c2 or #2281

The current implementation can cause panics and performance degradation.
I will bring this back with more testing in
https://github.com/lancedb/lancedb/pull/2311

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Enhanced clarity on read consistency settings with updated
descriptions and default behavior.
- Removed outdated warnings about eventual consistency from the
troubleshooting guide.

- **Refactor**
- Streamlined the handling of the read consistency interval across
integrations, now defaulting to "None" for improved performance.
  - Simplified internal logic to offer a more consistent experience.

- **Tests**
- Updated test expectations to reflect the new default representation
for the read consistency interval.
- Removed redundant tests related to "no consistency" settings for
streamlined testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-04-14 08:48:15 -07:00
Will Jones
a547c523c2 feat!: change default read_consistency_interval=5s (#2281)
Previously, when we loaded the next version of the table, we would block
all reads with a write lock. Now, we only do that if
`read_consistency_interval=0`. Otherwise, we load the next version
asynchronously in the background. This should mean that
`read_consistency_interval > 0` won't have a meaningful impact on
latency.

Along with this change, I felt it was safe to change the default
consistency interval to 5 seconds. The current default is `None`, which
means we will **never** check for a new version by default. I think that
default is contrary to most users expectations.
2025-03-28 11:04:31 -07:00
Weston Pace
1a449fa49e refactor: rename drop_db / drop_database to drop_all_tables, expose database from connection (#2098)
If we start supporting external catalogs then "drop database" may be
misleading (and not possible). We should be more clear that this is a
utility method to drop all tables. This is also a nice chance for some
consistency cleanup as it was `drop_db` in rust, `drop_database` in
python, and non-existent in typescript.

This PR also adds a public accessor to get the database trait from a
connection.

BREAKING CHANGE: the `drop_database` / `drop_db` methods are now
deprecated.
2025-02-06 13:22:28 -08:00
Weston Pace
c269524b2f feat!: refactor ConnectionInternal into a Database trait (#2067)
This opens up the door for more custom database implementations than the
two we have today. The biggest change should be inivisble:
`ConnectionInternal` has been renamed to `Database`, made public, and
refactored

However, there are a few breaking changes. `data_storage_version` and
`enable_v2_manifest_paths` have been moved from options on
`create_table` to options for the database which are now set via
`storage_options`.

Before:
```
db = connect(uri)
tbl = db.create_table("my_table", data, data_storage_version="legacy", enable_v2_manifest_paths=True)
```

After:
```
db = connect(uri, storage_options={
  "new_table_enable_v2_manifest_paths": "true",
  "new_table_data_storage_version": "legacy"
})
tbl = db.create_table("my_table", data)
```

BREAKING CHANGE: the data_storage_version, enable_v2_manifest_paths
options have moved from options to create_table to storage_options.
BREAKING CHANGE: the use_legacy_format option has been removed,
data_storage_version has replaced it for some time now
2025-02-04 14:35:14 -08:00
Will Jones
15f8f4d627 ci: check license headers (#2076)
Based on the same workflow in Lance.
2025-01-29 08:27:07 -08:00
Will Jones
a324f4ad7a feat(node): enable logging and show full errors (#1775)
This exposes the `LANCEDB_LOG` environment variable in node, so that
users can now turn on logging.

In addition, fixes a bug where only the top-level error from Rust was
being shown. This PR makes sure the full error chain is included in the
error message. In the future, will improve this so the error chain is
set on the [cause](https://nodejs.org/api/errors.html#errorcause)
property of JS errors https://github.com/lancedb/lancedb/issues/1779

Fixes #1774
2024-10-29 15:13:34 -07:00
Will Jones
f3b6a1f55b feat(node): bind remote SDK to rust implementation (#1730)
Closes [#2509](https://github.com/lancedb/sophon/issues/2509)

This is the Node.js analogue of #1700
2024-10-09 11:46:27 -06:00
LuQQiu
abeaae3d80 feat!: upgrade Lance to 0.18.0 (#1657)
BREAKING CHANGE: default file format changed to Lance v2.0.

Upgrade Lance to 0.18.0

Change notes: https://github.com/lancedb/lance/releases/tag/v0.18.0
2024-09-19 10:50:26 -07:00
Will Jones
2a6586d6fb feat: add flag to enable faster manifest paths (#1612)
The new V2 manifest path scheme makes discovering the latest version of
a table constant time on object stores, regardless of the number of
versions in the table. See benchmarks in the PR here:
https://github.com/lancedb/lance/pull/2798

Closes #1583
2024-09-09 11:34:36 -07:00
Lei Xu
2bdf0a02f9 feat!: upgrade lance to 0.16 (#1519) 2024-08-07 13:15:22 -07:00
Cory Grinstead
b8a1719174 feat(nodejs): catch unwinds in node bindings (#1414)
this bumps napi version to 2.16 which contains a few bug fixes.
Additionally, it adds `catch_unwind` to any method that may
unintentionally panic.

`catch_unwind` will unwind the panics and return a regular JS error
instead of panicking.
2024-07-01 09:28:10 -05:00
Cory Grinstead
b3e5ac6d2a feat(nodejs): feature parity [2/N] - add table.name and lancedb.connect({args}) (#1380)
depends on https://github.com/lancedb/lancedb/pull/1378

see proper diff here
https://github.com/universalmind303/lancedb/compare/remote-table-node...universalmind303:lancedb:table-name
2024-06-21 11:38:26 -05:00
Weston Pace
d5586c9c32 feat: make it possible to opt in to using the v2 format (#1352)
This also exposed the max_batch_length configuration option in
python/node (it was needed to verify if we are actually in v2 mode or
not)
2024-06-04 21:52:14 -07:00
Weston Pace
3d7c48feca feat: allow the index_cache_size to be configured when opening a table (#1245)
This was already configurable in the rust API but it wasn't actually
being passed down to the underlying dataset. I added this option to both
the async python API and the new nodejs API.

I also added this option to the synchronous python API.

I did not add the option to vectordb.
2024-04-26 13:42:02 -07:00
Will Jones
1d23af213b feat: expose storage options in LanceDB (#1204)
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.

1. Fixes #1168
2. Closes #1165
3. Closes #1082
4. Closes #439
5. Closes #897
6. Closes #642
7. Closes #281
8. Closes #114
9. Closes #990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
2024-04-10 10:12:04 -07:00
Weston Pace
968c62cb8f feat: introduce ArrowNative wrapper struct for adding data that is already a RecordBatchReader (#1139)
In
2de226220b
I added a new `IntoArrow` trait for adding data into a table.
Unfortunately, it seems my approach for implementing the trait for
"things that are already record batch readers" was flawed. This PR
corrects that flaw and, conveniently, removes the need to box readers at
all (though it is ok if you do).
2024-04-05 16:33:37 -07:00
Weston Pace
73c69a6b9a feat: page_token / limit to native table_names function. Use async table_names function from sync table_names function (#1059)
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.

However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
2024-04-05 16:31:45 -07:00
Weston Pace
8033a44d68 feat: add support for add to async python API (#1037)
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:31:36 -07:00
Weston Pace
4299f719ec feat: port create_table to the async python API and the remote rust API (#1031)
I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
2024-04-05 16:31:36 -07:00
Weston Pace
2163502b31 refactor: rename the rust crate from vectordb to lancedb (#1012)
This also renames the new experimental node package to lancedb. The
classic node package remains named vectordb.

The goal here is to avoid introducing piecemeal breaking changes to the
vectordb crate. Instead, once the new API is stabilized, we will
officially release the lancedb crate and deprecate the vectordb crate.
The same pattern will eventually happen with the npm package vectordb.
2024-04-05 16:30:40 -07:00
Will Jones
c5b0934bfb feat(node): add read_consistency_interval to Node and Rust (#1002)
This PR adds the same consistency semantics as was added in #828. It
*does not* add the same lazy-loading of tables, since that breaks some
existing tests.

This closes #998.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:30:40 -07:00
Weston Pace
cbc0c439ef refactor: rust vectordb API stabilization of the Connection trait (#993)
This is the start of a more comprehensive refactor and stabilization of
the Rust API. The `Connection` trait is cleaned up to not require
`lance` and to match the `Connection` trait in other APIs. In addition,
the concrete implementation `Database` is hidden.

BREAKING CHANGE: The struct `crate::connection::Database` is now gone.
Several examples opened a connection using `Database::connect` or
`Database::connect_with_params`. Users should now use
`vectordb::connect`.

BREAKING CHANGE: The `connect`, `create_table`, and `open_table` methods
now all return a builder object. This means that a call like
`conn.open_table(..., opt1, opt2)` will now become
`conn.open_table(...).opt1(opt1).opt2(opt2).execute()` In addition, the
structure of options has changed slightly. However, no options
capability has been removed.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:30:40 -07:00
Lei Xu
efcaa433fe feat: rework NodeJS SDK using napi (#847)
Use Napi to write a Node.js SDK that follows Polars for better
maintainability, while keeping most of the logic in Rust.
2024-04-05 16:27:51 -07:00