lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2025-12-23 21:39:57 +00:00

Author	SHA1	Message	Date
David Myriel	9e278fc5a6	fix small details	2025-05-05 23:03:17 +02:00
David Myriel	09fed1f286	add quickstart doc	2025-05-05 22:02:11 +02:00
Will Jones	cee2b5ea42	chore: upgrade pyarrow pin (#2192 ) Closes #2191 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Chores - Updated the required version of the pyarrow package to version 16 or higher. - Adjusted automated testing workflows to install pyarrow version 16 for compatibility checks. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-05-05 11:23:13 -07:00
Alex Pilon	f315f9665a	feat: implement bindings to return merge stats (#2367 ) Based on this comment: https://github.com/lancedb/lancedb/issues/2228#issuecomment-2730463075 and https://github.com/lancedb/lance/pull/2357 Here is my attempt at implementing bindings for returning merge stats from a `merge_insert.execute` call for lancedb. Note: I have almost no idea what I am doing in Rust but tried to follow existing code patterns and pay attention to compiler hints. - The change in nodejs binding appeared to be necessary to get compilation to work, presumably this could actual work properly by returning some kind of NAPI JS object of the stats data? - I am unsure of what to do with the remote/table.rs changes - necessarily for compilation to work; I assume this is related to LanceDB cloud, but unsure the best way to handle that at this point. Proof of function: ```python import pandas as pd import lancedb db = lancedb.connect("/tmp/test.db") test_data = pd.DataFrame( { "title": ["Hello", "Test Document", "Example", "Data Sample", "Last One"], "id": [1, 2, 3, 4, 5], "content": [ "World", "This is a test", "Another example", "More test data", "Final entry", ], } ) table = db.create_table("documents", data=test_data, exist_ok=True, mode="overwrite") update_data = pd.DataFrame( { "title": [ "Hello, World", "Test Document, it's good", "Example", "Data Sample", "Last One", "New One", ], "id": [1, 2, 3, 4, 5, 6], "content": [ "World", "This is a test", "Another example", "More test data", "Final entry", "New content", ], } ) stats = ( table.merge_insert(on="id") .when_matched_update_all() .when_not_matched_insert_all() .execute(update_data) ) print(stats) ``` returns ``` {'num_inserted_rows': 1, 'num_updated_rows': 5, 'num_deleted_rows': 0} ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - New Features - Merge-insert operations now return detailed statistics, including counts of inserted, updated, and deleted rows. - Bug Fixes - Tests updated to validate returned merge-insert statistics for accuracy. - Documentation - Method documentation improved to reflect new return values and clarify merge operation results. - Added documentation for the new `MergeStats` interface detailing operation statistics. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2025-05-01 10:00:20 -07:00
Andrew C. Oliver	5deb26bc8b	fix: prevent embedded objects from returning null in all of their fields (#2355 ) metadata{filename=xyz} filename would be there structurally, but ALWAYS null. I didn't include this as a file but it may be useful for understanding the problem for people searching on this issue so I'm including it here as documentation. Before this patch any field that is more than 1 deep is accepted but returns null values for subfields when queried. ```js const lancedb = require('@lancedb/lancedb'); // Debug logger function debug(message, data) { console.log(`[TEST] ${message}`, data !== undefined ? data : ''); } // Log when our unwrapArrowObject is called const kParent = Symbol.for("parent"); const kRowIndex = Symbol.for("rowIndex"); // Override console.log for our test const originalConsoleLog = console.log; console.log = function() { // Filter out noisy logs if (arguments[0] && typeof arguments[0] === 'string' && arguments[0].includes('[INFO] [LanceDB]')) { originalConsoleLog.apply(console, arguments); } originalConsoleLog.apply(console, arguments); }; async function main() { debug('Starting test...'); // Connect to the database debug('Connecting to database...'); const db = await lancedb.connect('./.lancedb'); // Try to open an existing table, or create a new one if it doesn't exist let table; try { table = await db.openTable('test_nested_fields'); debug('Opened existing table'); } catch (e) { debug('Creating new table...'); // Create test data with nested metadata structure const data = [ { id: 'test1', vector: [1, 2, 3], metadata: { filePath: "/path/to/file1.ts", startLine: 10, endLine: 20, text: "function test() { return true; }" } }, { id: 'test2', vector: [4, 5, 6], metadata: { filePath: "/path/to/file2.ts", startLine: 30, endLine: 40, text: "function test2() { return false; }" } } ]; debug('Data to be inserted:', JSON.stringify(data, null, 2)); // Create the table table = await db.createTable('test_nested_fields', data); debug('Table created successfully'); } // Query the table and get results debug('Querying table...'); const results = await table.search([1, 2, 3]).limit(10).toArray(); // Log the results debug('Number of results:', results.length); if (results.length > 0) { const firstResult = results[0]; debug('First result properties:', Object.keys(firstResult)); // Check if metadata is accessible and what properties it has if (firstResult.metadata) { debug('Metadata properties:', Object.keys(firstResult.metadata)); debug('Metadata filePath:', firstResult.metadata.filePath); debug('Metadata startLine:', firstResult.metadata.startLine); // Destructure to see if that helps const { filePath, startLine, endLine, text } = firstResult.metadata; debug('Destructured values:', { filePath, startLine, endLine, text }); // Check if it's a proxy object debug('Result is proxy?', Object.getPrototypeOf(firstResult) === Object.prototype ? false : true); debug('Metadata is proxy?', Object.getPrototypeOf(firstResult.metadata) === Object.prototype ? false : true); } else { debug('Metadata is not accessible!'); } } // Close the database await db.close(); } main().catch(e => { console.error('Error:', e); }); ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - Bug Fixes - Improved handling of nested struct fields to ensure accurate preservation of values during serialization and deserialization. - Enhanced robustness when accessing nested object properties, reducing errors with missing or null values. - Tests - Added tests to verify correct handling of nested struct fields through serialization and deserialization. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2025-05-01 09:38:55 -07:00
Lance Release	3cc670ac38	Updating package-lock.json	2025-04-29 23:21:19 +00:00
Lance Release	4ade3e31e2	Updating package-lock.json	2025-04-29 22:19:46 +00:00
Lance Release	a222d2cd91	Updating package-lock.json	2025-04-29 22:19:30 +00:00
Lance Release	508e621f3d	Bump version: 0.19.1-beta.0 → 0.19.1-beta.1 v0.19.1-beta.1	2025-04-29 22:19:14 +00:00
Lance Release	a1a0472f3f	Bump version: 0.22.1-beta.0 → 0.22.1-beta.1 python-v0.22.1-beta.1	2025-04-29 22:18:53 +00:00
Wyatt Alt	3425a6d339	feat: upgrade lance to v0.27.0-beta.2 (#2364 ) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Chores - Updated dependencies for related components to use the latest version from a specific repository source. No changes to features or public functionality. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-29 14:59:56 -07:00
Ryan Green	af54e0ce06	feat: add table stats API (#2363 ) * Add a new "table stats" API to expose basic table and fragment statistics with local and remote table implementations ### Questions * This is using `calculate_data_stats` to determine total bytes in the table. This seems like a potentially expensive operation - are there any concerns about performance for large datasets? ### Notes * bytes_on_disk seems to be stored at the column level but there does not seem to be a way to easily calculate total bytes per fragment. This may need to be added in lance before we can support fragment size (bytes) statistics. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Added a method to retrieve comprehensive table statistics, including total rows, index counts, storage size, and detailed fragment size metrics such as minimum, maximum, mean, and percentiles. - Enabled fetching of table statistics from remote sources through asynchronous requests. - Extended table interfaces across Python, Rust, and Node.js to support synchronous and asynchronous retrieval of table statistics. - Tests - Introduced tests to verify the accuracy of the new table statistics feature for both populated and empty tables. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-29 15:19:08 -02:30
Lance Release	089905fe8f	Updating package-lock.json	2025-04-28 19:13:36 +00:00
Lance Release	554939e5d2	Updating package-lock.json	2025-04-28 17:20:58 +00:00
Lance Release	7a13814922	Updating package-lock.json	2025-04-28 17:20:42 +00:00
Lance Release	e9f25f6a12	Bump version: 0.19.0 → 0.19.1-beta.0 v0.19.1-beta.0	2025-04-28 17:20:26 +00:00
Lance Release	419a433244	Bump version: 0.22.0 → 0.22.1-beta.0 python-v0.22.1-beta.0	2025-04-28 17:20:10 +00:00
LuQQiu	a9311c4dc0	feat: add list/create/delete/update/checkout tag API (#2353 ) add the tag related API to list existing tags, attach tag to a version, update the tag version, delete tag, get the version of the tag, and checkout the version that the tag bounded to. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Introduced table version tagging, allowing users to create, update, delete, and list human-readable tags for specific table versions. - Enabled checking out a table by either version number or tag name. - Added new interfaces for tag management in both Python and Node.js APIs, supporting synchronous and asynchronous workflows. - Bug Fixes - None. - Documentation - Updated documentation to describe the new tagging features, including usage examples. - Tests - Added comprehensive tests for tag creation, updating, deletion, listing, and version checkout by tag in both Python and Node.js environments. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-28 10:04:46 -07:00
LuQQiu	178bcf9c90	fix: hybrid search explain plan analyze plan (#2360 ) Fix hybrid search explain plan analyze plan API <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Added options to view the execution plan and analyze the runtime performance of hybrid queries. - Refactor - Improved internal handling of query setup for better modularity and maintainability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-27 18:39:43 -07:00
Lance Release	b9be092cb1	Updating package-lock.json	2025-04-25 22:05:57 +00:00
Lance Release	e8c0c52315	Updating package-lock.json	2025-04-25 21:17:03 +00:00
Lance Release	a60fa0d3b7	Updating package-lock.json	2025-04-25 21:16:48 +00:00
Lance Release	726d629b9b	Bump version: 0.19.0-beta.12 → 0.19.0 v0.19.0	2025-04-25 21:16:30 +00:00
Lance Release	b493f56dee	Bump version: 0.19.0-beta.11 → 0.19.0-beta.12	2025-04-25 21:16:25 +00:00
Lance Release	a8b5ad7e74	Bump version: 0.22.0-beta.12 → 0.22.0 python-v0.22.0	2025-04-25 21:16:07 +00:00
Lance Release	f8f6264883	Bump version: 0.22.0-beta.11 → 0.22.0-beta.12	2025-04-25 21:16:07 +00:00
Will Jones	d8517117f1	feat: upgrade Lance to v0.26.0 (#2359 ) Upstream changelog: https://github.com/lancedb/lance/releases/tag/v0.26.0 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Chores - Updated dependency management to use published crate versions for improved reliability and maintainability. - Added a temporary workaround for build issues by pinning a specific version of a dependency. - Refactor - Improved resource management and concurrency by updating internal ownership models for object storage components. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-25 13:59:12 -07:00
Lance Release	ab66dd5ed2	Updating package-lock.json	2025-04-25 06:04:06 +00:00
Lance Release	cbb9a7877c	Updating package-lock.json	2025-04-25 05:02:47 +00:00
Lance Release	b7fc223535	Updating package-lock.json	2025-04-25 05:02:32 +00:00
Lance Release	1fdaf7a1a4	Bump version: 0.19.0-beta.10 → 0.19.0-beta.11 v0.19.0-beta.11	2025-04-25 05:02:16 +00:00
Lance Release	d11819c90c	Bump version: 0.22.0-beta.10 → 0.22.0-beta.11 python-v0.22.0-beta.11	2025-04-25 05:01:57 +00:00
BubbleCal	9b902272f1	fix: sync hybrid search ignores the distance range params (#2356 ) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Added support for distance range filtering in hybrid vector queries, allowing users to specify lower and upper bounds for search results. - Tests - Introduced new tests to validate distance range filtering and reranking in both synchronous and asynchronous hybrid query scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2025-04-25 13:01:22 +08:00
Will Jones	8c0622fa2c	fix: remote limit to avoid "Limit must be non-negative" (#2354 ) To workaround this issue: https://github.com/lancedb/lancedb/issues/2211 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Bug Fixes - Improved handling of large query parameters to prevent potential overflow issues when using the "k" parameter in queries. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-24 15:04:06 -07:00
Philip Meier	2191f948c3	fix: add missing pydantic model config compat (#2316 ) Fixes #2315. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Refactor - Enhanced query processing to maintain smooth functionality across different dependency versions, ensuring improved stability and performance. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-22 14:46:10 -07:00
Will Jones	acc3b03004	ci: fix docs deploy (#2351 ) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Chores - Improved CI workflow for documentation builds by optimizing Rust build settings and updating the runner environment. - Fixed a typo in a workflow step name. - Streamlined caching steps to reduce redundancy and improve efficiency. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-22 13:55:34 -07:00
Lance Release	7f091b8c8e	Updating package-lock.json	2025-04-22 19:16:43 +00:00
Lance Release	c19bdd9a24	Updating package-lock.json	2025-04-22 18:24:16 +00:00
Lance Release	dad0ff5cd2	Updating package-lock.json	2025-04-22 18:23:59 +00:00
Lance Release	a705621067	Bump version: 0.19.0-beta.9 → 0.19.0-beta.10 v0.19.0-beta.10	2025-04-22 18:23:39 +00:00
Lance Release	39614fdb7d	Bump version: 0.22.0-beta.9 → 0.22.0-beta.10 python-v0.22.0-beta.10	2025-04-22 18:23:17 +00:00
Ryan Green	96d534d4bc	feat: add retries to remote client for requests with stream bodies (#2349 ) Closes https://github.com/lancedb/lancedb/issues/2307 * Adds retries to remote operations with stream bodies (add, merge_insert) * Change default retryable status codes to 409, 429, 500, 502, 503, 504 * Don't retry add or merge_insert operations on 5xx responses Notes: * Supporting retries on stream bodies means we have to buffer the body into memory so it can be cloned on retry. This will impact memory use patterns for the remote client. This buffering can be disabled by disabling retries (i.e. setting retries to 0 in RetryConfig) * It does not seem that retry config can be specified by env vars as the documentation suggests. I added a follow-up issue [here](https://github.com/lancedb/lancedb/issues/2350) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - New Features - Enhanced retry support for remote requests with configurable limits and exponential backoff with jitter. - Added robust retry logic for streaming data uploads, enabling retries with buffered data to ensure reliability. - Bug Fixes - Improved error handling and retry behavior for HTTP status codes 409 and 504. - Refactor - Centralized and modularized HTTP request sending and retry logic across remote database and table operations. - Streamlined request ID management for improved traceability. - Simplified error message construction in index waiting functionality. - Tests - Added a test verifying merge-insert retries on HTTP 409 responses. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-22 15:40:44 -02:30
Lance Release	5051d30d09	Updating package-lock.json	2025-04-21 23:55:43 +00:00
Lance Release	db853c4041	Updating package-lock.json	2025-04-21 22:50:56 +00:00
Lance Release	76d1d22bdc	Updating package-lock.json	2025-04-21 22:50:40 +00:00
Lance Release	d8746c61c6	Bump version: 0.19.0-beta.8 → 0.19.0-beta.9 v0.19.0-beta.9	2025-04-21 22:50:20 +00:00
Lance Release	1a66df2627	Bump version: 0.22.0-beta.8 → 0.22.0-beta.9 python-v0.22.0-beta.9	2025-04-21 22:49:59 +00:00
Will Jones	44670076c1	fix: move timeout to avoid retries (#2347 ) I added a timeout to query execution options in https://github.com/lancedb/lancedb/pull/2288. However, this was send to the request timeout, but the retry implementation is unaware of this timeout. So once the query timed out, a retry would be triggered. Instead, this PR changes it so the timeout happens outside the retry loop. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Bug Fixes - Improved query timeout handling to provide clearer error messages and more reliable cancellation if a query takes too long to complete. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-21 14:27:04 -07:00
Will Jones	92f0b16e46	fix(python): make sure pandas is optional (#2346 ) Fixes #2344 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Tests - Updated tests to use PyArrow Tables instead of pandas DataFrames where possible, reducing reliance on pandas. - Tests that require pandas are now automatically skipped if pandas is not installed. - Chores - Improved workflow to uninstall both pylance and pandas in a specific test step. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-21 13:42:13 -07:00
Eileen Noonan	1620ba3508	docs: make table.update() nodejs guide consistent with API documentation (#2334 ) The docs in the Guide here do not match the [API reference] (https://lancedb.github.io/lancedb/js/classes/Table/#updateopts) for the nodejs client. I am writing an Elixir wrapper over the typescript library (Rust forthcoming!) and confirmed in testing that the API reference is correct vs the Guide. Following the Guide docs, the error I got was: "lance error: Invalid user input: Schema error: No field named bar. Valid fields are foo. For a query of: await table.update({foo: "buzz"}, { where: "foo = 'bar'"}); Over a table with a schema of just {foo: Utf8}. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Documentation - Reformatted a code snippet in the guide to enhance readability by splitting it into multiple lines for improved clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-21 08:38:16 -07:00

1 2 3 4 5 ...

1831 Commits