- Rename deserialize() -> deserialize_conn() for clarity
- Rename internal _pushdown_operations -> _namespace_client_pushdown_operations
for consistency with the parameter name
- Rename serialized key "pushdown_operations" ->
"namespace_client_pushdown_operations"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
list_namespaces with empty path was going through Rust (ListingDatabase)
which doesn't see namespaces created via the directory namespace client.
Always delegate to _namespace_conn() so create/list are consistent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Run ruff format on all changed files
- Fix F821 forward reference in _namespace_conn return type
- Update test_local_namespace_operations to verify operations succeed
instead of expecting NotImplementedError (namespace ops now work on
LanceDBConnection via directory namespace delegation)
- Remove test_local_create_namespace_not_supported and
test_local_drop_namespace_not_supported (no longer applicable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_create_table_server_side was only passing self.storage_options
(connection-level) to CreateTableRequest, ignoring the user-provided
storage_options parameter. This caused per-table options like
new_table_data_storage_version to be silently dropped.
Fix both sync and async paths to merge user options on top of
connection options.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only delegate to _namespace_conn() when namespace_path is non-empty.
Root namespace operations (list_namespaces, list_tables with empty
path) still go through the original Rust connection to avoid regression.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the special-cased worker_uri key with a generic mechanism:
any namespace_client_properties key starting with _lancedb_worker_
has the prefix stripped and overrides the corresponding property
when for_worker=True.
e.g. _lancedb_worker_uri overrides uri in worker context.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of reimplementing namespace logic (describe_table, merge
storage_options, etc.) in LanceDBConnection, delegate child namespace
operations to a LanceNamespaceDBConnection via _namespace_conn().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LanceDBConnection now:
- Caches namespace_client() result to avoid repeated DirectoryNamespace builds
- Auto-delegates open_table/create_table with non-empty namespace_path
through the directory namespace client
- Routes create_namespace/drop_namespace/describe_namespace/list_namespaces
through the namespace client
- Routes list_tables/drop_table for child namespaces through namespace client
This enables local storage connections to transparently handle child
namespaces like ["__system"] without requiring a separate
LanceNamespaceDBConnection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add serialization support to DBConnection classes so connections
can be reconstructed in remote workers without tracking namespace
params separately.
- DBConnection.serialize_to_json() base method
- LanceDBConnection: serializes uri, storage_options, read_consistency_interval
- LanceNamespaceDBConnection: stores namespace_client_impl/properties,
serializes all connection params including pushdown_operations
- from_serialized_json() factory with for_worker flag for worker_uri swap
- connect_namespace() now passes impl/properties to connection for serialization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bumps the Rust toolchain to 1.94.0 (latest installed) to unblock CI
failures caused by the AWS SDK's MSRV requirement. No lint fixes were
needed.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- migrate gemini-text embedding provider from deprecated
google.generativeai to google.genai
- update Python embedding extra dependency to google-genai
- update default model name to gemini-embedding-001
- adapt embed calls to Client().models.embed_content(...)
- apply lint fixes from CI
## Related
- Closes#3191
`.get(b"split_names", None).decode()` was called unconditionally in both
Permutations.__init__ and Permutation.from_tables(), crashing with
AttributeError when schema metadata existed but lacked the split_names
key. Guard the decode behind a None check and add regression tests.
## Problem
`on_bad_vectors="drop"` is supposed to remove invalid vector rows before
write, but for some schema-defined vector columns it can still fail
later during Arrow cast instead of dropping the bad row.
Repro:
```python
class MySchema(LanceModel):
text: str
embedding: Vector(16)
table = db.create_table("test", schema=MySchema)
table.add(
[
{"text": "hello", "embedding": []},
{"text": "bar", "embedding": [0.1] * 16},
],
on_bad_vectors="drop",
)
```
Before:
```
RuntimeError
Arrow error: C Data interface error: Invalid: ListType can only be casted to FixedSizeListType if the lists are all the expected size.
```
After:
```
rows 1
texts ['bar']
```
## Solution
Make bad-vector sanitization use schema dimensions before cast, while
keeping the handling scoped to vector columns identified by schema
metadata or existing vector-name heuristics.
This also preserves existing integer vector inputs and avoids applying
on_bad_vectors to unrelated fixed-size float columns.
Fixes#1670
Signed-off-by: yaommen <myanstu@163.com>
## Summary
- Add a `user_id` field to `ClientConfig` that allows users to identify
themselves to LanceDB Cloud/Enterprise
- The user_id is sent as the `x-lancedb-user-id` HTTP header in all
requests
- Supports three configuration methods:
- Direct assignment via `ClientConfig.user_id`
- Environment variable `LANCEDB_USER_ID`
- Indirect env var lookup via `LANCEDB_USER_ID_ENV_KEY`
Closes#3230🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
Fixes#1846.
Python `Enum` fields raised `TypeError: Converting Pydantic type to
Arrow Type: unsupported type <enum 'SomethingTypes'>` when converting a
Pydantic model to an Arrow schema.
The fix adds Enum detection in `_pydantic_type_to_arrow_type`. When an
Enum subclass is encountered, the value type of its members is inspected
and mapped to the appropriate Arrow type:
- `str`-valued enums (e.g. `class Status(str, Enum)`) → `pa.utf8()`
- `int`-valued enums (e.g. `class Priority(int, Enum)`) → `pa.int64()`
- Other homogeneous value types → the Arrow type for that Python type
- Mixed-value or empty enums → `pa.utf8()` (safe fallback)
This covers the common `(str, Enum)` and `(int, Enum)` mixin patterns
used in practice.
## Changes
- `python/python/lancedb/pydantic.py`: add Enum branch in
`_pydantic_type_to_arrow_type`
- `python/python/tests/test_pydantic.py`: add `test_enum_types` covering
`str`, `int`, and `Optional` Enum fields
## Note on #2395
PR #2395 handles `StrEnum` (Python 3.11+) specifically, using a
dictionary-encoded type. This PR handles the broader `(str, Enum)` /
`(int, Enum)` mixin pattern that works across all Python versions and
stores values as their natural Arrow type.
AI assistance was used in developing this fix.
1. Refactored every client (Rust core, Python, Node/TypeScript) so
“namespace” usage is explicit: code now keeps namespace paths
(namespace_path) separate from namespace clients (namespace_client).
Connections propagate the client, table creation routes through it, and
managed versioning defaults are resolved from namespace metadata. Python
gained LanceNamespaceDBConnection/async counterparts, and the
namespace-focused tests were rewritten to match the clarified API
surface.
2. Synchronized the workspace with Lance 5.0.0-beta.3 (see
https://github.com/lance-format/lance/pull/6186 for the upstream
namespace refactor), updating Cargo/uv lockfiles and ensuring all
bindings align with the new namespace semantics.
3. Added a namespace-backed code path to lancedb.connect() via new
keyword arguments (namespace_client_impl, namespace_client_properties,
plus the existing pushdown-ops flag). When those kwargs are supplied,
connect() delegates to connect_namespace, so users can opt into
namespace clients without changing APIs. (The async helper will gain
parity in a later change)
Bumps all lance-* workspace dependencies from `4.0.0-rc.3` (git source)
to the stable `4.0.0` release on crates.io, removing the `git`/`tag`
overrides.
No code changes were required — compiles and passes clippy cleanly.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes#1540
I could not reproduce this on current `main` from Python, but I could
still reproduce it from the Rust SDK.
Python no longer reproduces because the current Python vector/hybrid
query paths re-chunk results into a `pyarrow.Table` before returning
batches. Rust still reproduced because `max_batch_length` was passed
into planning/scanning, but vector search could still emit larger
`RecordBatch`es later in execution (for example after KNN / TopK), so it
was not enforced on the final Rust output stream.
This PR enforces `max_batch_length` on the final Rust query output
stream and adds Rust regression coverage.
Before the fix, the Rust repro produced:
`num_batches=2, max_batch=8192, min_batch=1808, all_le_100=false`
After the fix, the same repro produces batches `<= 100`.
## Runnable Rust repro
Before this fix, current `main` could still return batches like `[8192,
1808]` here even with `max_batch_length = 100`:
```rust
use std::sync::Arc;
use arrow_array::{
types::Float32Type, FixedSizeListArray, RecordBatch, RecordBatchReader, StringArray,
};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use lancedb::query::{ExecutableQuery, QueryBase, QueryExecutionOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let tmp = tempfile::tempdir()?;
let uri = tmp.path().to_str().unwrap();
let rows = 10_000;
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Utf8, false),
Field::new(
"vector",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), 4),
false,
),
]));
let ids = StringArray::from_iter_values((0..rows).map(|i| format!("row-{i}")));
let vectors = FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
(0..rows).map(|i| Some(vec![Some(i as f32), Some(1.0), Some(2.0), Some(3.0)])),
4,
);
let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(ids), Arc::new(vectors)])?;
let reader: Box<dyn RecordBatchReader + Send> = Box::new(
arrow_array::RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
);
let db = lancedb::connect(uri).execute().await?;
let table = db.create_table("test", reader).execute().await?;
let mut opts = QueryExecutionOptions::default();
opts.max_batch_length = 100;
let mut stream = table
.query()
.nearest_to(vec![0.0, 1.0, 2.0, 3.0])?
.limit(rows)
.execute_with_options(opts)
.await?;
let mut sizes = Vec::new();
while let Some(batch) = stream.try_next().await? {
sizes.push(batch.num_rows());
}
println!("{sizes:?}");
Ok(())
}
```
Signed-off-by: yaommen <myanstu@163.com>
The test added in #3190 unconditionally imports `PIL`, which is an
optional dependency. This causes CI failures in environments where
Pillow isn't installed (`ModuleNotFoundError: No module named 'PIL'`).
Use `pytest.importorskip` to skip gracefully when Pillow is unavailable.
Fixes CI failure on main.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Namespace tests expected `RuntimeError` for table-not-found and
namespace-not-empty cases, but `lance_namespace` raises
`TableNotFoundError` and `NamespaceNotEmptyError` which inherit from
`Exception`, not `RuntimeError`.
- Updated `pytest.raises` to use the correct exception types.
## Test plan
- [x] CI passes on `test_namespace.py`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
url_retrieve() calls urllib.request.urlopen() but only urllib.error was
imported, causing AttributeError for any HTTP URL input. This affects
open-clip, siglip, and jinaai embedding functions when processing image
URLs.
The bug has existed since the embeddings API refactor (#580) but was
masked because most users pass local file paths or bytes rather than
HTTP URLs.
Fixes#2716
## Summary
Add support for querying with Float16Array, Float64Array, and Uint8Array
vectors in the Node.js SDK, eliminating precision loss from the previous
\Float32Array.from()\ conversion.
## Implementation
Follows @wjones127's [5-step
plan](https://github.com/lancedb/lancedb/issues/2716#issuecomment-3447750543):
### Rust (\
odejs/src/query.rs\)
1. \ytes_to_arrow_array(data: Uint8Array, dtype: String)\ helper that:
- Creates an Arrow \Buffer\ from the raw bytes
- Wraps it in a typed \ScalarBuffer<T>\ based on the dtype enum
- Constructs a \PrimitiveArray\ and returns \Arc<dyn Array>\
2. \
earest_to_raw(data, dtype)\ and \dd_query_vector_raw(data, dtype)\ NAPI
methods that pass the type-erased array to the core \
earest_to\/\dd_query_vector\ which already accept \impl
IntoQueryVector\ for \Arc<dyn Array>\
### TypeScript (\
odejs/lancedb/query.ts\, \rrow.ts\)
3. Extended \IntoVector\ type to include \Uint8Array\ (and
\Float16Array\ via runtime check for Node 22+)
4. \xtractVectorBuffer()\ helper detects non-Float32 typed arrays and
extracts their underlying byte buffer + dtype string
5. \
earestTo()\ and \ddQueryVector()\ route through the raw NAPI path when
the input is Float16/Float64/Uint8
### Backward compatibility
Existing \Float32Array\ and \
umber[]\ inputs are unchanged -- they still use the original \
earest_to(Float32Array)\ NAPI method. The new raw path is only used when
a non-Float32 typed array is detected.
## Usage
\\\ ypescript
// Float16Array (Node 22+) -- no precision loss
const f16vec = new Float16Array([0.1, 0.2, 0.3]);
const results = await
table.query().nearestTo(f16vec).limit(10).toArray();
// Float64Array -- no precision loss
const f64vec = new Float64Array([0.1, 0.2, 0.3]);
const results = await
table.query().nearestTo(f64vec).limit(10).toArray();
// Uint8Array (binary embeddings)
const u8vec = new Uint8Array([1, 0, 1, 1, 0]);
const results = await
table.query().nearestTo(u8vec).limit(10).toArray();
// Existing usage unchanged
const results = await table.query().nearestTo([0.1, 0.2,
0.3]).limit(10).toArray();
\\\
## Note on dependencies
The Rust side uses \rrow_array\, \rrow_buffer\, and \half\ crates.
These should already be in the dependency tree via \lancedb\ core, but
\Cargo.toml\ may need explicit entries for \half\ and the arrow
sub-crates in the nodejs workspace.
---------
Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Fixes#3183
## Summary
When `table.add(mode='overwrite')` is called, PyArrow infers input data
types (e.g. `list<double>`) which differ from the original table schema
(e.g. `fixed_size_list<float32>`). Previously, overwrite mode bypassed
`cast_to_table_schema()` entirely, so the inferred types replaced the
original schema, breaking vector search.
This fix builds a merged target schema for overwrite: columns present in
the existing table schema keep their original types, while columns
unique to the input pass through as-is. This way
`cast_to_table_schema()` is applied unconditionally, preserving vector
column types without blocking schema evolution.
## Changes
- `rust/lancedb/src/table/add_data.rs`: For overwrite mode, construct a
target schema by matching input columns against the existing table
schema, then cast. Non-overwrite (append) path is unchanged.
- Added `test_add_overwrite_preserves_vector_type` test that creates a
table with `fixed_size_list<float32>`, overwrites with `list<double>`
input, and asserts the original type is preserved.
## Test Plan
- `cargo test --features remote -p lancedb -- test_add_overwrite` — all
4 overwrite tests pass
- Full suite: 454 passed, 2 failed (pre-existing `remote::retry` flakes
unrelated to this change)
---------
Signed-off-by: majiayu000 <1835304752@qq.com>
dict.update() mutates in place and returns None. Assigning its result
caused with_metadata(None) to strip all schema metadata when embedding
metadata was merged during create_table with embedding_functions.
This patch mitigates template injection vulnerabilities in GitHub
Workflows by replacing direct references with an environment variable.
Aikido used AI to generate this PR.
High confidence: Aikido has a robust set of benchmarks for similar
fixes, and they are proven to be effective.
Co-authored-by: aikido-autofix[bot] <119856028+aikido-autofix[bot]@users.noreply.github.com>
Replace ~30 production `lock().unwrap()` calls that would cascade-panic
on a poisoned Mutex. Functions returning `Result` now propagate the
poison as an error via `?` (leveraging the existing `From<PoisonError>`
impl). Functions without a `Result` return recover via
`unwrap_or_else(|e| e.into_inner())`, which is safe because the guarded
data (counters, caches, RNG state) remains logically valid after a
panic.