Commit Graph

1740 Commits

Author SHA1 Message Date
Lance Release
315a24c2bc Updating package-lock.json 2025-03-21 20:03:43 +00:00
Lance Release
6dd4cf6038 Updating package-lock.json 2025-03-21 20:03:27 +00:00
Lance Release
f97e751b3c Bump version: 0.18.1 → 0.18.2-beta.0 v0.18.2-beta.0 2025-03-21 20:02:59 +00:00
Lance Release
e803a626a1 Bump version: 0.21.1 → 0.21.2-beta.0 python-v0.21.2-beta.0 2025-03-21 20:02:25 +00:00
Weston Pace
9403254442 feat: add to_query_object method (#2239)
This PR adds a `to_query_object` method to the various query builders
(except not hybrid queries yet). This makes it possible to inspect the
query that is built.

In addition this PR does some normalization between the sync and async
query paths. A few custom defaults were removed in favor of None (with
the default getting set once, in rust).

Also, the synchronous to_batches method will now actually stream results

Also, the remote API now defaults to prefiltering
2025-03-21 13:01:51 -07:00
Will Jones
b2a38ac366 fix: make pylance optional again (#2209)
The two remaining blockers were:

* A method `with_embeddings` that was deprecated a year ago
* A typecheck for `LanceDataset`
2025-03-21 11:26:32 -07:00
BubbleCal
bdb6c09c3b feat: support binary vector and IVF_FLAT in TypeScript (#2221)
resolve #2218

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-21 10:57:08 -07:00
Will Jones
2bfdef2624 ci: refactor node releases (#2223)
This PR fixes build issues associated with `aws-lc-rs`, while
simplifying the build process. Previously, we used custom scripts for
the musl and Windows ARM builds. These were complicated and prone to
breaking. This PR switches to a setup that mirrors
https://github.com/napi-rs/package-template/blob/main/.github/workflows/CI.yml.

* linux glibc and musl builds now use the Docker images provided by the
napi project
* Windows ARM build now just cross compiles from Windows x64, which
turns out to work quite well.
2025-03-21 10:56:29 -07:00
Samuel Colvin
7982d5c082 fix: correct rust install docs (#2253)
I'm pretty sure you mean `cargo add lancedb` here, `cargo install
lancedb` fails right now.
2025-03-21 10:12:53 -07:00
BubbleCal
7ff6ec7fe3 feat: upgrade to lance v0.25.0-beta.5 (#2248)
- adds `loss` into the index stats for vector index
- now `optimize` can retrain the vector index

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-21 10:12:23 -07:00
Ayush Chaurasia
ba1ded933a fix: add better check for empty results in hybrid search (#2252)
fixes: https://github.com/lancedb/lancedb/issues/2249
2025-03-21 13:05:05 +05:30
Will Jones
b595d8a579 fix(nodejs): workaround for apache-arrow null vector issue (#2244)
Fixes #2240
2025-03-20 08:07:10 -07:00
Will Jones
2a1d6d8abf ci: simplify windows builds (#2243)
We soon won't rely on cross compiling from Linux to windows, so can
remove this check. Instead, check that we can cross compile from Windows
between architectures.
2025-03-20 08:06:56 -07:00
Will Jones
440a466a13 ci: remove OpenSSL as dependency in favor of rustls (#2242)
`object_store` already hard codes `rustls` as the TLS implementation, so
we have been shipping a mix of `rustls` and `openssl`. For simplicity of
builds, we should consolidate to one, and that has to be `rustls`.
2025-03-20 08:06:45 -07:00
Ayush Chaurasia
b9afd9c860 docs: add late interaction, multi-vector guide & link example (#2231)
1/2 docs update for this week. Addesses issues from this docs epic -
https://github.com/lancedb/lancedb/issues/1476
2025-03-20 20:29:32 +05:30
Will Jones
a6b6f6a806 ci: drop vectordb support for musl, windows ARM (#2241)
vectordb is deprecated, and these platforms are particularly difficult
to maintain. Removing now to prevent further headaches.

We will keep these platforms supported on `@lancedb/lancedb`.
2025-03-19 12:23:46 -07:00
Ayush Chaurasia
ae1548b507 docs: add cloud & enterprise cta (#2235)
2/2 docs update this week
- Add cloud & enterprise CTA
- remove outdated projects/examples from landing page
2025-03-19 10:55:05 -07:00
Weston Pace
4e03ee82bc refactor: rework catalog/database options (#2213)
The `ConnectRequest` has a set of properties that only make sense for
listing databases / catalogs and a set of properties that only make
sense for remote databases.

This PR reduces all options to a single `HashMap<String, String>`. This
makes it easier to add new database / catalog implementations and makes
it clearer to users which options are applicable in which situations.

I don't believe there are any breaking changes here. The closest thing
is that I placed the `ConnectBuilder` methods `api_key`, `region`, and
`host_override` behind a `remote` feature gate. This is not strictly
needed and I could remove the feature gate but it seemed appropriate.
Since using these methods without the remote feature would have been
meaningless I don't feel this counts as a breaking change.

We could look at removing these methods entirely from the
`ConnectBuilder` (and encouraging users to use `RemoteDatabaseOptions`
instead) but I'm not sure how I feel about that.

Another approach we could take is to move these methods into a
`RemoteConnectBuilderExt` trait (and there could be a similar
`ListingConnectBuilderExt` trait to add methods for the listing database
/ catalog).

For now though my main goal is to simplify `ConnectRequest` as much as
possible (I see this being part of the key public API for database /
catalog integrations, similar to the `BaseTable`, `Catalog`, and
`Database` traits and I'd like it to be simple).
2025-03-18 10:13:59 -07:00
Weston Pace
46a6846d07 refactor: remove dataset reference from base table (#2226) 2025-03-17 06:27:33 -07:00
Will Jones
a207213358 fix: insert structs in non-alphabetical order (#2222)
Closes #2114

Starting in #1965, we no longer pass the table schema into
`pa.Table.from_pylist()`. This means PyArrow is choosing the order of
the struct subfields, and apparently it does them in alphabetical order.
This is fine in theory, since in Lance we support providing fields in
any order. However, before we pass it to Lance, we call
`pa.Table.cast()` to align column types to the table types.
`pa.Table.cast()` is strict about field order, so we need to create a
cast target schema that aligns with the input data. We were doing this
at the top-level fields, but weren't doing this in nested fields. This
PR adds support to do this for nested ones.
2025-03-13 14:46:05 -07:00
BubbleCal
6c321c694a feat: upgrade lance to 0.25.0-beta2 (#2220)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-13 14:12:54 -07:00
Bob Liu
5c00b2904c feat: add get dataset method on NativeTable (#2021)
I want to public the dataset method from native table, then I can use
more lance method like order_by which is not exposed in the lancedb
crate.
2025-03-13 11:15:28 -07:00
Gagan Bhullar
14677d7c18 fix: metric type inconsistency (#2122)
PR fixes #2113

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-03-12 10:28:37 -07:00
Martin Schorfmann
dd22a379b2 fix: use Self return type annotation for abstract query builder (#2127)
Hello LanceDB team,

while developing using `lancedb` as a library I encountered a typing
problem affecting IDE hints and completions during development.

---

## Current Situation

Currently, the abstract base class `lancedb.query:LanceQueryBuilder`
uses method chaining to build up the search parameters, where the
methods have `LanceQueryBuilder` as a return type hint.

This leads to two issues:
1. Implementing subclasses of `LanceQueryBuilder` need to override
methods to modify the return type hint, even when they don't need to
change its implementation, just to ensure adequate IDE hints and
completions.
2. When using method chaining the first method directly inherited from
the abstract `LanceQueryBuilder` causes the inferred type to switch back
to `LanceQueryBuilder`. So even when the type starts from
`lancdb.table:LanceTable.search(query_type="vector", ...)` and therefor
correctly is inferred as `LanceVectorQueryBuilder`, after calling e.g.
`LanceVectorQueryBuilder.limit(...)` it is seen as the abstract
`LanceQueryBuilder` from that point on.

### Example of current situation


![image](https://github.com/user-attachments/assets/09678727-8722-43bd-a8a2-67d9b5fc0db5)

## Proposed changes

I propose to change the return type hints of the corresponding methods
(including classmethod `create()`) in the abstract base class
`LanceQueryBuilder` from `LanceQueryBuilder` to `Self`.
`Self` is already imported in the module:

```py
    if sys.version_info >= (3, 11):
        from typing import Self
    else:
        from typing_extensions import Self
```

### Further possible changes

Additionally, the implementing subclasses could also change the return
type hints to `Self` to potentially allow for further inheritance
easily.
> [!NOTE]
> **However this is not part of this pull request as of writing.**

### Example after proposed changes


![image](https://github.com/user-attachments/assets/a9aea636-e426-477a-86ee-2dad3af2876f)

---

Best regards
Martin
2025-03-12 10:08:25 -07:00
Will Jones
7747c9bcbf feat(node): parse arrow types in alterColumns() (#2208)
Previously, users could only specify new data types in `alterColumns` as
strings:

```ts
await tbl.alterColumns([
  path: "price",
  dataType: "float"
]);
```

But this has some problems:

1. It wasn't clear what were valid types
2. It was impossible to specify nested types, like lists and vector
columns.

This PR changes it to take an Arrow data type, similar to how the Python
API works. This allows casting vector types:

```ts
await tbl.alterColumns([
  {
    path: "vector",
    dataType: new arrow.FixedSizeList(
      2,
      new arrow.Field("item", new arrow.Float16(), false),
    ),
  },
]);
```

Closes #2185
2025-03-12 09:57:36 -07:00
QianZhu
c9d6fc43a6 docs: use bypass_vector_index() instead of use_index=false (#2115) 2025-03-12 09:31:09 -07:00
Martin Schorfmann
581bcfbb88 docs: fix docstring of EmbeddingFunction (#2118)
Hello LanceDB team,

---

I have fixed a discrepancy in the class docstring of
`lancedb.embeddings.base:EmbeddingFunction` and made consistency
alignments to that docstring.

### Changes made

1. The docstring referred to the abstract method
`get_source_embeddings()`.
  This method does not exist in the repository at the current state.
I have changed the mention to refer to the actual abstract method
`compute_source_embeddings()`.
2. Also, I aligned the consistency within the ordered list which is
describing the methods to be implemented by concrete embedding
functions.

---

Thank you for developing this useful library. 👍

Best regards
Martin
2025-03-12 09:30:01 -07:00
vinoyang
3750639b5f feat(rust): add connect_catalog method to support connect catalog via url (#2177) 2025-03-12 05:19:03 -07:00
Lance Release
e744d54460 Updating package-lock.json 2025-03-11 14:00:55 +00:00
Lance Release
9d1ce4b5a5 Updating package-lock.json 2025-03-11 13:15:18 +00:00
Lance Release
729ce5e542 Updating package-lock.json 2025-03-11 13:15:03 +00:00
Lance Release
de6739e7ec Bump version: 0.18.1-beta.0 → 0.18.1 v0.18.1 2025-03-11 13:14:49 +00:00
Lance Release
495216efdb Bump version: 0.18.0 → 0.18.1-beta.0 2025-03-11 13:14:44 +00:00
Lance Release
a3b45a4d00 Bump version: 0.21.1-beta.0 → 0.21.1 python-v0.21.1 2025-03-11 13:14:30 +00:00
Lance Release
c316c2f532 Bump version: 0.21.0 → 0.21.1-beta.0 2025-03-11 13:14:29 +00:00
Weston Pace
3966b16b63 fix: restore pylance as mandatory dependency (#2204)
We attempted to make pylance optional in
https://github.com/lancedb/lancedb/pull/2156 but it appears this did not
quite work. Users are unable to use lancedb from a fresh install. This
reverts the optional-ness so we can get back in a working state while we
fix the issue.
2025-03-11 06:13:52 -07:00
Lance Release
5661cc15ac Updating package-lock.json 2025-03-10 23:53:56 +00:00
Lance Release
4e7220400f Updating package-lock.json 2025-03-10 23:13:52 +00:00
Lance Release
ae4928fe77 Updating package-lock.json 2025-03-10 23:13:36 +00:00
Lance Release
e80a405dee Bump version: 0.18.0-beta.1 → 0.18.0 v0.18.0 2025-03-10 23:13:18 +00:00
Lance Release
a53e19e386 Bump version: 0.18.0-beta.0 → 0.18.0-beta.1 2025-03-10 23:13:13 +00:00
Lance Release
c0097c5f0a Bump version: 0.21.0-beta.2 → 0.21.0 python-v0.21.0 2025-03-10 23:12:56 +00:00
Lance Release
c199708e64 Bump version: 0.21.0-beta.1 → 0.21.0-beta.2 2025-03-10 23:12:56 +00:00
Weston Pace
4a47150ae7 feat: upgrade to lance 0.24.1 (#2199) 2025-03-10 15:18:37 -07:00
Wyatt Alt
f86b20a564 fix: delete tables from DDB on drop_all_tables (#2194)
Prior to this commit, issuing drop_all_tables on a listing database with
an external manifest store would delete physical tables but leave
references behind in the manifest store. The table drop would succeed,
but subsequent creation of a table with the same name would fail with a
conflict.

With this patch, the external manifest store is updated to account for
the dropped tables so that dropped table names can be reused.
2025-03-10 15:00:53 -07:00
msu-reevo
cc81f3e1a5 fix(python): typing (#2167)
@wjones127 is there a standard way you guys setup your virtualenv? I can
either relist all the dependencies in the pyright precommit section, or
specify a venv, or the user has to be in the virtual environment when
they run git commit. If the venv location was standardized or a python
manager like `uv` was used it would be easier to avoid duplicating the
pyright dependency list.

Per your suggestion, in `pyproject.toml` I added in all the passing
files to the `includes` section.

For ruff I upgraded the version and removed "TCH" which doesn't exist as
an option.

I added a `pyright_report.csv` which contains a list of all files sorted
by pyright errors ascending as a todo list to work on.

I fixed about 30 issues in `table.py` stemming from str's being passed
into methods that required a string within a set of string Literals by
extracting them into `types.py`

Can you verify in the rust bridge that the schema should be a property
and not a method here? If it's a method, then there's another place in
the code where `inner.schema` should be `inner.schema()`
``` python
class RecordBatchStream:
    @property
    def schema(self) -> pa.Schema: ...
```

Also unless the `_lancedb.pyi` file is wrong, then there is no
`__anext__` here for `__inner` when it's not an `AsyncGenerator` and
only `next` is defined:
``` python
    async def __anext__(self) -> pa.RecordBatch:
        return await self._inner.__anext__()
        if isinstance(self._inner, AsyncGenerator):
            batch = await self._inner.__anext__()
        else:
            batch = await self._inner.next()
        if batch is None:
            raise StopAsyncIteration
        return batch
```
in the else statement, `_inner` is a `RecordBatchStream`
```python
class RecordBatchStream:
    @property
    def schema(self) -> pa.Schema: ...
    async def next(self) -> Optional[pa.RecordBatch]: ...
```

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-03-10 09:01:23 -07:00
Weston Pace
bc49c4db82 feat: respect datafusion's batch size when running as a table provider (#2187)
Datafusion makes the batch size available as part of the `SessionState`.
We should use that to set the `max_batch_length` property in the
`QueryExecutionOptions`.
2025-03-07 05:53:36 -08:00
Weston Pace
d2eec46f17 feat: add support for streaming input to create_table (#2175)
This PR makes it possible to create a table using an asynchronous stream
of input data. Currently only a synchronous iterator is supported. There
are a number of follow-ups not yet tackled:

* Support for embedding functions (the embedding functions wrapper needs
to be re-written to be async, should be an easy lift)
* Support for async input into the remote table (the make_ipc_batch
needs to change to accept async input, leaving undone for now because I
think we want to support actual streaming uploads into the remote table
soon)
* Support for async input into the add function (pretty essential, but
it is a fairly distinct code path, so saving for a different PR)
2025-03-06 11:55:00 -08:00
Lance Release
51437bc228 Bump version: 0.21.0-beta.0 → 0.21.0-beta.1 python-v0.21.0-beta.1 2025-03-06 19:23:06 +00:00
Bert
fa53cfcfd2 feat: support modifying field metadata in lancedb python (#2178) 2025-03-04 16:58:46 -05:00