Commit Graph

2470 Commits

Author SHA1 Message Date
dependabot[bot]
c9dc84087c chore(deps): bump polars-arrow from 0.39.2 to 0.52.0
Bumps [polars-arrow](https://github.com/pola-rs/polars) from 0.39.2 to 0.52.0.
- [Release notes](https://github.com/pola-rs/polars/releases)
- [Commits](https://github.com/pola-rs/polars/compare/rs-0.39.2...rs-0.52.0)

---
updated-dependencies:
- dependency-name: polars-arrow
  dependency-version: 0.52.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-06 01:38:07 +00:00
LanceDB Robot
47a34f5cca chore: update lance dependency to v7.0.0-beta.4 (#3348)
## Summary
- Update Lance Rust dependencies to `v7.0.0-beta.4` using
`ci/set_lance_version.py`.
- Update the Java `lance-core` dependency property to `7.0.0-beta.4`.
- Align LanceDB with dependency updates required by Lance 7, including
`object_store` 0.13 API compatibility.

Triggering tag:
https://github.com/lance-format/lance/releases/tag/v7.0.0-beta.4

## Verification
- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-05-05 18:36:39 -07:00
Weston Pace
a17c241e86 feat(python): make Permutation fork-safe for PyTorch DataLoader workers (#3339)
## Summary

PyTorch's `DataLoader` uses fork-based multiprocessing by default on
Linux, but threads do not survive `fork()`. LanceDB's Python bindings
drive async work through two threaded layers, both of which become inert
in a forked child:

- `BackgroundEventLoop` runs an asyncio loop on a Python
`threading.Thread`.
- `pyo3-async-runtimes::tokio` holds a global multi-threaded tokio
runtime whose worker threads also die on fork — and its runtime lives in
a `OnceLock` that cannot be replaced after first use.

As a result, any `Permutation` (or other async API) used inside a
fork-based `DataLoader` worker hangs indefinitely. This PR makes both
layers fork-safe so `Permutation` works as a `torch.utils.data.Dataset`
with `num_workers > 0`.

## Approach

### Rust — new `python/src/runtime.rs`

Mirrors the pattern used in [Lance's Python
bindings](456198cd6f/python/src/lib.rs (L139)),
adapted for the async-bridge use case.

- `LanceRuntime` implements `pyo3_async_runtimes::generic::Runtime +
ContextExt`, backed by an `AtomicPtr<tokio::runtime::Runtime>` we own
(sidestepping `pyo3-async-runtimes`'s frozen `OnceLock` global).
- A `pthread_atfork(after_in_child)` handler nulls the pointer; the next
`spawn` rebuilds the runtime in the child. The previous runtime is
intentionally **leaked** — calling `Drop` would try to join now-dead
worker threads and hang.
- `runtime::future_into_py` is a drop-in for
`pyo3_async_runtimes::tokio::future_into_py`. All ~80 call sites in
`arrow.rs` / `connection.rs` / `permutation.rs` / `query.rs` /
`table.rs` are updated to route through it.
- `python/Cargo.toml` adds `libc = "0.2"` and the tokio
`rt-multi-thread` feature.

### Python — `lancedb/background_loop.py`

- Refactors `BackgroundEventLoop.__init__` to a reusable `_start()`
method.
- An `os.register_at_fork(after_in_child=…)` hook calls `LOOP._start()`
to give the singleton a fresh asyncio loop and thread **in place**. This
matters because the rest of the codebase imports `LOOP` via `from
.background_loop import LOOP` — rebinding the module attribute would
leave those references holding the dead loop.

### Python — `lancedb/__init__.py`

Removes the `__warn_on_fork` pre-fork warning (and the now-unused
`import warnings`). Fork is supported.

## Test plan

- [x] New `test_permutation_dataloader_fork_workers` in
`python/tests/test_torch.py`: runs a `Permutation` through
`torch.utils.data.DataLoader(num_workers=2,
multiprocessing_context="fork")` inside a spawn-isolated child with a
30s hang detector. **Pre-fix**: timed out at 36s. **Post-fix**: passes
in ~3.6s.
- [x] New `test_remote_connection_after_fork` in
`python/tests/test_remote_db.py`: forks a child that creates a fresh
`lancedb.connect(...)` against a mock HTTP server and calls
`table_names()`; passes in <1s, validates the runtime reset is
sufficient for fresh remote clients.
- [x] All 62 tests in `test_torch.py` + `test_permutation.py` pass.
- [x] All 35 tests in `test_remote_db.py` pass.
- [x] `test_table.py` (87) + `test_db.py` + `test_query.py` (157, minus
one unrelated `sentence_transformers` import skip) — 244 passing.
- [x] `cargo clippy -p lancedb-python --tests` clean.
- [x] `cargo fmt`, `ruff check`, `ruff format` all clean.

## Known limitation (follow-up)

This PR makes a **freshly-built** `lancedb.connect(...)` work in a
forked child. An **inherited** `Connection` from the parent still
carries an inherited `reqwest::Client` whose hyper connection pool
references socket FDs and TCP/TLS state shared with the parent — using
it from the child after fork is unsafe (especially with HTTP/1.1
keep-alive). The recommended pattern for fork-based `DataLoader` workers
that hit a remote DB is to construct a new connection inside the worker.
Auto-clearing inherited HTTP client pools on fork would require tracking
live `Connection` instances in `lancedb` core and is left for a
follow-up PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:44:10 -07:00
Weston Pace
1fc23e5473 fix(python): make Permutation picklable for PyTorch multiprocessing (#3335)
## Summary

When pytorch is used with multiprocessing and the mp mode is spawn then
the Permutation needs to be pickled. It could not be pickled because
`Table` and `Connection` are not serializable. This PR adds pickle
support to Permutation without adding general pickle support to `Table`
or `Connection`. To add general support we probably need to start by
adding serialization in the namespace client.

In the meantime this PR enable pickling by adding special cases for:

 * In-memory tables (just serialize as Arrow IPC)
 * Native tables (serialize the URI)

If a user is not using one of the above cases (e.g. using a remote
connection) then they will need to provide a connection factory that can
be pickled.

## Breaking change

`PermutationBuilder.persist(...)` is removed from the Python bindings;
the permutation table is now always in-memory. The underlying Rust
`PermutationBuilder::persist` API is untouched and can be re-exposed
later if needed. It probably won't make sense to do that until we have a
way to serialize `Table` and `Connection`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:37:58 -07:00
qingfeng-occ
87b831bcae fix(node): remove redundant postbuild:release script to fix build failure (#3285)
The `build:release` command already outputs the `*.node` files directly
to the `dist/` directory via the `--output-dir dist` flag.

Therefore, the `postbuild:release` script, which attempts to copy
`*.node` files from the `lancedb/` source directory, fails with a "no
such file or directory" error because the source files do not exist
there.

This commit removes the redundant `postbuild:release` script to resolve
the build failure.

fix #3284

Signed-off-by: qingfeng-occ <qing.feng@zte.com.cn>
2026-05-04 09:37:18 -07:00
Nitesh Yadav
59db036118 fix(python): add missing space in hybrid query error message (#3340)
Hi, the hybrid query error message looks like it can use a space, just
added it.

```python
def _validate_query(self, query, vector=None, text=None):
    if query is not None and (vector is not None or text is not None):
        raise ValueError(
            "You can either provide a string query in search() method"
            "or set `vector()` and `text()` explicitly for hybrid search."
            "But not both."
        )
```
2026-05-02 15:51:00 -07:00
Lance Release
c091243d5b Bump version: 0.28.0-beta.10 → 0.28.0-beta.11 2026-04-29 17:53:49 +00:00
Lance Release
a2aea7b4e5 Bump version: 0.31.0-beta.10 → 0.31.0-beta.11 python-v0.31.0-beta.11 2026-04-29 17:53:22 +00:00
LanceDB Robot
4a5341edb1 chore: update lance dependency to v6.0.0-beta.7 (#3334)
## Summary
- Update Lance Rust dependencies to `6.0.0-beta.7` using
`ci/set_lance_version.py`.
- Update Java `lance-core.version` to `6.0.0-beta.7`.
- Align Arrow/DataFusion/PyO3 dependency versions and apply required
compatibility fixes for the Lance upgrade.

Triggering tag:
[v6.0.0-beta.7](https://github.com/lance-format/lance/releases/tag/v6.0.0-beta.7)

## Verification
- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-04-29 10:52:25 -07:00
Jack Ye
25dfe2cfd4 feat: add manifest-enabled directory namespace mode (#3332)
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
2026-04-29 09:22:06 -07:00
Lance Release
4dcd7f4314 Bump version: 0.28.0-beta.9 → 0.28.0-beta.10 2026-04-28 13:29:26 +00:00
Lance Release
2e36cd9dad Bump version: 0.31.0-beta.9 → 0.31.0-beta.10 python-v0.31.0-beta.10 2026-04-28 13:29:00 +00:00
Weston Pace
f31e27768a fix: address RUSTSEC-2026-0104 cargo-deny advisory (#3326)
## Summary

- Update `rustls-webpki` 0.103.10 → 0.103.13 to fix RUSTSEC-2026-0104
(reachable panic in CRL parsing)
- Add advisory ignore for the legacy `rustls-webpki` 0.101.7 copy pinned
to the aws-smithy/rustls 0.21 chain (same chain already exempted for
RUSTSEC-2026-0098/0099)

Fixes the `deny` CI job failure seen in #3325.

## Test plan

- [x] `cargo deny check advisories` passes locally

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 17:56:10 -07:00
LanceDB Robot
b84150a53e chore: update lance dependency to v6.0.0-beta.4 (#3325)
## Summary

- Updates Lance Rust dependencies to `6.0.0-beta.4` using
`ci/set_lance_version.py`.
- Updates the Java `lance-core.version` property to `6.0.0-beta.4`.
- Triggering Lance tag:
https://github.com/lance-format/lance/releases/tag/v6.0.0-beta.4

## Verification

- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-04-27 15:13:07 -07:00
Will Jones
d135c18db6 ci: add cargo-deny configuration and CI check (#3307)
Adds a `deny.toml` at the workspace root and a `deny` CI job that runs
`cargo deny check` on every PR. Catches yanked crates, license drift,
banned or wildcard dependencies, unapproved sources, and new RUSTSEC
advisories.

As part of wiring this up:

- Updated `aws-lc-rs` 1.13.0 → 1.16.3 / `aws-lc-sys` 0.28.0 → 0.40.0 to
  clear four 2026 AWS-LC advisories (timing side-channel, PKCS7 bypass,
  CRL scope). Removed the `=0.28.0` workaround pin; the original build
  failure no longer reproduces.
- Updated `bytes`, `zlib-rs`, `rand`, `rustls-webpki`, `lz4_flex` to
  clear their current advisories.
- Marked `lancedb-nodejs` and `lancedb-python` as `publish = false` and
  pinned `lzma-sys` from `*` to `0.1` so `bans.wildcards = "deny"` can
  be enforced.

10 remaining advisories have no safe upgrade available (transitive via
opendal, lance, datafusion, async-openai, aws-sdk on the legacy rustls
0.21 chain). Each is ignored in `deny.toml` with a per-entry rationale
and a link to the RUSTSEC advisory. New advisories still fail CI.

Fixes #3297

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:53:15 -07:00
Will Jones
ef399de092 ci: switch PyPI publish to OIDC trusted publishing (#3302)
## Summary

- Replaces `LANCEDB_PYPI_API_TOKEN` (long-lived token) with OIDC trusted
publishing via `pypa/gh-action-pypi-publish`
- Adds `id-token: write` permission to linux/mac/windows jobs
- Removes `twine`-based upload and the `pypi_token` input from
`upload_wheel` composite action
- Enables PEP 740 Sigstore attestations on published wheels as a bonus

After merging, rotate/revoke the `LANCEDB_PYPI_API_TOKEN` secret.

Closes #3294

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 20:53:06 -07:00
Will Jones
0d767abd0e ci: add Dependabot config for shipped Rust binaries (#3300)
Adds `.github/dependabot.yml` enabling weekly cargo update PRs for the
root workspace, which produces the Rust binaries we ship: the Node.js
and Python native extensions. The `rust/lancedb` library crate shares
the same lockfile — its consumers pick versions themselves, but bumping
transitive deps here keeps the shipped binaries current.

Also removes the misleading `exclude = ["python"]` line from the root
`Cargo.toml`: `python` is listed in `members`, and `cargo metadata`
confirms it's a workspace member, so the exclude was dead code that
implied the opposite.

Minor/patch updates are grouped to reduce PR noise.

Part of #3292. Only covers the cargo ecosystem; pip, npm, and
github-actions can follow.
2026-04-24 20:52:54 -07:00
Jack Ye
a92ae0ded5 fix: enable hostname verification by default (#3304)
## Summary

- make `TlsConfig::default()` enable hostname verification by default
- align the Rust default with the documented Python and Node behavior
- update the Rust unit test to lock in the safe default
2026-04-21 08:39:03 -07:00
Xuanwo
c54888a83a refactor(python): remove legacy tantivy FTS support (#3282)
This follows the Rust-side Tantivy removal by deleting the remaining
Python Tantivy runtime, tests, and packaging references.

It also turns the legacy Python-only Tantivy parameters into explicit
errors and stops reading legacy `_indices/fts` directories so Python FTS
is fully native-only.
2026-04-20 09:28:45 +08:00
Will Jones
ba6c44abc9 ci: add top-level permissions to GHA workflows (#3255)
Adds `permissions: contents: read` to the 10 workflows that had no
top-level permissions block. Workflows that already declared
permissions, or individual jobs that need elevated permissions (`issues:
write`, `pull-requests: write`, `contents: write`), are left unchanged.

Affected workflows: `dev.yml`, `java-publish.yml`, `java.yml`,
`license-header-check.yml`, `nodejs.yml`, `pypi-publish.yml`,
`python.yml`, `rust.yml`, `update_package_lock_run.yml`,
`update_package_lock_run_nodejs.yml`
2026-04-20 09:22:27 +08:00
Lance Release
75b0a8e0a3 Bump version: 0.28.0-beta.8 → 0.28.0-beta.9 2026-04-19 20:39:29 +00:00
Lance Release
2a886141f7 Bump version: 0.31.0-beta.8 → 0.31.0-beta.9 python-v0.31.0-beta.9 2026-04-19 20:39:04 +00:00
Jack Ye
2a1df8edcf fix(rust): materialize declared namespace tables on create (#3288)
## Summary
- handle `declare_table` already-exists conflicts in the Rust namespace
database create path
- reuse declared-but-not-materialized table metadata instead of failing
create mode
- preserve overwrite behavior while allowing declared Geneva system
tables to be materialized
2026-04-19 13:25:53 -07:00
C Kaustubh
fd98b845ea fix(node): prevent reranker from keeping process alive (#3270)
Fixes #3269.

## What I observed
Using a reranker in a hybrid query could keep the Node.js process alive
even after `table.close()` and `db.close()`.

## Root cause
The reranker callback bridge used a `ThreadsafeFunction` in referenced
mode, which can keep the event loop alive longer than intended.

## Minimal fix
- In `nodejs/src/rerankers.rs`, create the reranker callback TSFN in
weak mode (`.weak::<true>()`).
- Add a regression test in `nodejs/__test__/rerankers.test.ts` that
spawns a child process, runs a rerank query, and asserts the process
exits naturally.

## Validation
- Built Node bindings successfully.
- Ran targeted tests: `rerankers.test.ts` passes (including new
regression test).
- Pre-commit checks for changed files were run and clean.
2026-04-19 14:02:23 +08:00
Lance Release
be48ada352 Bump version: 0.28.0-beta.7 → 0.28.0-beta.8 2026-04-19 04:19:10 +00:00
Lance Release
9ad2dfe601 Bump version: 0.31.0-beta.7 → 0.31.0-beta.8 python-v0.31.0-beta.8 2026-04-19 04:18:45 +00:00
Jack Ye
f909df3e87 fix(python): use namespace-backed rust connection for namespace tables (#3286)
So far, I have been using a hacky approach that creates and opens
namespace-backed table, by getting its location and use a temporary
lancedb connection to create or open it. This was working for features
like credentials vending but is no longer fully working for the managed
versioning feature, recently geneva tests have been failing here and
there and various patches are not addressing the root cause. This PR
fully fixes this and implements proper rust binding for it.
Specifically:

- build a real Rust namespace-backed connection from the Python
namespace client
- route namespace table create/open through that connection instead of
resolved-location temp connections
- keep namespace client naming consistent in the Rust bridge and
preserve federated namespace + DuckDB behavior
2026-04-18 21:17:52 -07:00
Lance Release
d715bbb588 Bump version: 0.28.0-beta.6 → 0.28.0-beta.7 2026-04-17 08:12:27 +00:00
Lance Release
5ce3d8d141 Bump version: 0.31.0-beta.6 → 0.31.0-beta.7 python-v0.31.0-beta.7 2026-04-17 08:12:03 +00:00
Jack Ye
5eaac178b1 fix(python): pass namespace client on schema-only table create (#3283)
## Summary
- pass `namespace_client` through the Python create-table path
- ensure schema-only namespace table creation uses the namespace-aware
empty-table flow
- fix reopening namespace tables created without initial data
2026-04-17 01:11:18 -07:00
Lance Release
11af763fcd Bump version: 0.28.0-beta.5 → 0.28.0-beta.6 2026-04-16 18:57:28 +00:00
Lance Release
2ed5452e1c Bump version: 0.31.0-beta.5 → 0.31.0-beta.6 python-v0.31.0-beta.6 2026-04-16 18:57:05 +00:00
Xuanwo
b7c0b5987c chore: upgrade lance to 6.0.0-beta.1 (#3281) 2026-04-17 02:51:58 +08:00
Jack Ye
97a4b38f19 feat(rust): support nested namespace ops in listing db (#3279)
## Summary
- delegate child-namespace `ListingDatabase` operations through an
eagerly initialized `LanceNamespaceDatabase`
- support nested namespace create/open/list/drop flows without requiring
callers to inject explicit locations
- add `namespace_client_properties` plumbing for local and namespace
connections so directory namespace settings like
`table_version_tracking_enabled` can be configured
- add regression tests for nested namespace ops and namespace client
property propagation
2026-04-16 10:12:28 -07:00
Gezi-lzq
10879d99b8 docs: fix broken documentation links (#3278) 2026-04-15 20:56:59 +08:00
Lance Release
4e6a1d5dce Bump version: 0.28.0-beta.4 → 0.28.0-beta.5 2026-04-12 23:51:14 +00:00
Lance Release
13d2759356 Bump version: 0.31.0-beta.4 → 0.31.0-beta.5 python-v0.31.0-beta.5 2026-04-12 23:50:50 +00:00
Jack Ye
7f52ec8c36 feat(python): support child namepsace operations and json serialization for LanceDBConnection (#3265)
## Summary

Add connection serialization and child namespace support to
`LanceDBConnection`.

- `DBConnection.serialize()` / `lancedb.deserialize()` for connection
reconstruction in remote workers
- Cache `namespace_client()` in `LanceDBConnection` to avoid repeated
DirectoryNamespace builds
- `LanceDBConnection` transparently delegates child namespace operations
(open_table, create_table, list_tables, drop_table, create_namespace,
etc.) to `LanceNamespaceDBConnection` via `_namespace_conn()`
- Root namespace operations still go through the original Rust path
- Generic worker property override mechanism: any
`namespace_client_properties` key prefixed with `_lancedb_worker_` has
the prefix stripped and overrides the corresponding property when
`deserialize(data, for_worker=True)`
- `LanceNamespaceDBConnection` stores
`namespace_client_impl`/`namespace_client_properties` for serialization
roundtrip

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 16:49:45 -07:00
Lance Release
c6ae0de3ee Bump version: 0.28.0-beta.3 → 0.28.0-beta.4 2026-04-12 03:57:58 +00:00
Lance Release
231f0655ce Bump version: 0.31.0-beta.3 → 0.31.0-beta.4 python-v0.31.0-beta.4 2026-04-12 03:57:35 +00:00
LanceDB Robot
8c52977c59 chore: update lance dependency to v5.1.0-beta.3 (#3266)
## Summary
- Bump Rust Lance dependencies to `v5.1.0-beta.3` using
`ci/set_lance_version.py`.
- Update Java `lance-core.version` to `5.1.0-beta.3` in `java/pom.xml`.
- Refresh `Cargo.lock` metadata to the `v5.1.0-beta.3` Lance git tag.

## Verification
- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`

## Upstream Tag
- https://github.com/lance-format/lance/releases/tag/v5.1.0-beta.3
2026-04-11 20:56:49 -07:00
Lance Release
359710a0bf Bump version: 0.28.0-beta.2 → 0.28.0-beta.3 2026-04-11 22:44:52 +00:00
Lance Release
1f1726369d Bump version: 0.31.0-beta.2 → 0.31.0-beta.3 python-v0.31.0-beta.3 2026-04-11 22:44:25 +00:00
Lance Release
df354abae4 Bump version: 0.28.0-beta.1 → 0.28.0-beta.2 2026-04-11 07:06:00 +00:00
Lance Release
11bc674548 Bump version: 0.31.0-beta.1 → 0.31.0-beta.2 python-v0.31.0-beta.2 2026-04-11 07:05:36 +00:00
LanceDB Robot
5593460823 chore: update lance dependency to v5.1.0-beta.2 (#3263)
## Summary
- Bump Lance Rust workspace dependencies from `5.0.0-beta.5` to
`5.1.0-beta.2` using `ci/set_lance_version.py`.
- Update Java `lance-core.version` in `java/pom.xml` to `5.1.0-beta.2`.
- Refresh `Cargo.lock` to match the new Lance tag.

## Verification
- `cargo clippy --workspace --tests --all-features -- -D warnings`
(passes)
- `cargo fmt --all` (passes)

## Triggering Tag
- https://github.com/lance-format/lance/releases/tag/v5.1.0-beta.2
2026-04-11 00:04:43 -07:00
Will Jones
2807ad6854 chore: bump Rust toolchain from 1.91.0 to 1.94.0 (#3257)
Bumps the Rust toolchain to 1.94.0 (latest installed) to unblock CI
failures caused by the AWS SDK's MSRV requirement. No lint fixes were
needed.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 07:57:47 -07:00
Dhruv Garg
4761fa9bcb fix(python): migrate gemini-text provider to google-genai sdk (#3250)
## Summary
- migrate gemini-text embedding provider from deprecated
google.generativeai to google.genai
- update Python embedding extra dependency to google-genai
- update default model name to gemini-embedding-001
- adapt embed calls to Client().models.embed_content(...)
- apply lint fixes from CI

## Related
- Closes #3191
2026-04-09 15:28:34 -07:00
lennylxx
4c2939d66e fix(python): guard against None before .decode() on split_names metadata key (#3229)
`.get(b"split_names", None).decode()` was called unconditionally in both
Permutations.__init__ and Permutation.from_tables(), crashing with
AttributeError when schema metadata existed but lacked the split_names
key. Guard the decode behind a None check and add regression tests.
2026-04-08 16:04:13 -07:00
yaommen
a813ce2f71 fix(python): sanitize bad vectors before Arrow cast (#3158)
## Problem

`on_bad_vectors="drop"` is supposed to remove invalid vector rows before
write, but for some schema-defined vector columns it can still fail
later during Arrow cast instead of dropping the bad row.

Repro:
```python
class MySchema(LanceModel):
    text: str
    embedding: Vector(16)

table = db.create_table("test", schema=MySchema)
table.add(
    [
        {"text": "hello", "embedding": []},
        {"text": "bar", "embedding": [0.1] * 16},
    ],
    on_bad_vectors="drop",
)
```
Before:
```
RuntimeError
Arrow error: C Data interface error: Invalid: ListType can only be casted to FixedSizeListType if the lists are all the expected size.
```
After:
```
rows 1
texts ['bar']
```
## Solution

Make bad-vector sanitization use schema dimensions before cast, while
keeping the handling scoped to vector columns identified by schema
metadata or existing vector-name heuristics.

This also preserves existing integer vector inputs and avoids applying
on_bad_vectors to unrelated fixed-size float columns.


Fixes #1670

Signed-off-by: yaommen <myanstu@163.com>
2026-04-08 09:09:41 -07:00